Cover illustration: The Rayleigh quotient with its three basins of attraction for a symmetric
three-dimensional matrix. The Rayleigh quotient method, an iterative technique for
finding eigenvalue–eigenvector pairs, is developed on page 99. The solution to the Swift–
Hohenberg equation with random initial conditions, which models Rayleigh–Bénard
convection, is discussed on page 475.
(𝑥 2 + 𝑦 2 − 1) 3 − 𝑥 2 𝑦 3 = 0
(𝑥 2 + 𝑦 2 ) 2 − 2(𝑥 2 − 𝑦 2 ) = 0
This book was born out of a set of lecture notes I wrote for a sequence of
three numerical methods courses taught at the Air Force Institute of Technology.
The courses Numerical Linear Algebra, Numerical Analysis, and Numerical
Methods for Partial Differential Equations were taken primarily by mathematics,
physics, and engineering graduate students. My goals in these courses were to
present the foundational principles and essential tools of scientific computing,
provide practical applications of the principles, and generate interest in the
topic. These notes were themselves inspired by lectures from a two-sequence
numerical analysis course taught by my doctoral advisor Shi Jin at the University
of Wisconsin.
The purpose of my notes was first to guide the lectures and discussion, and
second, to provide students with a bridge to more rigorous and complete, but
also more mathematically dense textbooks and references. To this end, the notes
acted as a summary to help students learn the key mathematical ideas and explain
the principles intuitively. I favored simple picture proofs and explanations over
more rigorous but abstruse analytic derivations. Students who wanted the details
could find them in any number of other numerical mathematics texts.
In moving from a set of supplementary lecture notes to a stand-alone book, I
wondered whether to make them into a handbook, a guidebook, or a textbook.
My goal in writing and subsequently revising this book was to provide a concise
treatment of the core ideas, algorithms, proofs, and pitfalls of numerical methods
for scientific computing. I aimed to present the topics in a way that might be
consumed in bits and pieces—a handbook. I wanted them weaved together
into a grand mathematical journey, with enough detail to elicit an occasional
“aha” but not so much as to become overbearing—a guidebook. Finally, I
wanted to present the ideas using a pedagogical framework to help anyone with a
good understanding of multivariate calculus and linear algebra learn valuable
mathematical skills—a textbook. Ultimately, I decided on a tongue-in-cheek
xiv Preface
“definitive manual for math geeks.” To be clear, it’s definitely not definitive. And,
it need not be. If a person knows the right questions to ask, they can find a dozen
answers through a Google search, a Github tutorial, a Wikipedia article, a Stack
Exchange snippet, or a YouTube video.
When I published the first edition of this book several years ago, I did so
with an agile development mindset. Get it out quickly with minimal bugs so
that it can be of use to others. Then iterate and improve. Make it affordable.
When textbooks often sell for over a hundred dollars, keep the print version
under twenty and the electronic one free. Publish it independently to be able
to make rapid improvements and to keep costs down. While I had moderate
aspirations, the Just Barely Good Enough (JBGE) first edition was panned. A
reviewer named Ben summarized it on Goodreads: “Typos, a terrible index.
Pretty sure it was self published. It does a good job as a primer, but functionally
speaking, it’s a terrible textbook.” Another reviewer who goes by Nineball stated
on Amazon: “This text is riddled with typos and errors. You can tell the author
had great objectives for making concise book for numerical methods, however
the number of errors significantly detracts from the message and provides a
substantial barrier to understanding.” This JBGE++ edition fixes hundreds of
typos, mistakes, and coding glitches. Still, there are undoubtedly errors I missed
and ones I inadvertently introduced during revision. Truth be told, I’d rather
spend my time watching The Great British Bake Off with my beautiful wife than
hunting down every last typo. I also learned that designing a good index is a
real challenge. The index in this book is still a work in progress. I apologize
to anyone who struggled with the JBGE edition. And I apologize in advance
for any mistakes that appear in this one. I understand that Donald Knuth would
personally send a check for $2.56 to anyone who found errors in his books. I
can’t do that, but if you ever find yourself in my town, I’ll happily buy you a
beer to ease your frustration. That offers goes out to you, especially, Ben and
Paul Halmos once said about writing mathematics to “pretend that you are
explaining the subject to a friend on a long walk in the woods, with no paper
available.” I wonder what he would have said about writing about numerical
methods. Would he have said “with no computer available?” I believe so.
Understanding Newton’s method, for instance, has more to do with the Banach
fixed-point theorem than addition assignment operators. While a mathematics
book should be agnostic about scientific programming languages, snippets of
code can help explain the underlying theory and make it more practical. In the
first edition of this book, I focused entirely on Matlab. With this edition, I’ve
embraced Julia. To help understand the switch, one need only consider that
Matlab was first released forty years ago, placing its development closer to the
1940s ENIAC than today’s grammar of graphics, dataframes, reactive notebooks,
and cloud computing. Python’s NumPy and SciPy are twenty years old. Julia is
barely ten. Still, Matlab and Python are both immensely important languages.
Scientific computing is the study and use of computers to solve scientific problems.
Numerical methods are techniques designed to solve those problems efficiently
and accurately using numerical approximation. Numerical methods have been
around for centuries. Indeed, some four thousand years ago, Babylonians used
a technique for approximating square roots. And over two thousand years
ago, Archimedes developed a method for approximating 𝜋 by inscribing and
circumscribing a circle with polygons. With the progress of mathematical
thought, with the discovery of new scientific problems requiring novel and
efficient approaches, and more recently, with the proliferation of cheap and
powerful computers, numerical methods have worked their way into all aspects
of our lives, although mostly hidden from view.
Eighteenth-century mathematicians Joseph Raphson and Thomas Simpson
used Newton’s then recently invented Calculus to develop numerical methods
for finding the zeros of functions. Their approaches are at the heart of gradient
descent, making today’s deep learning algorithms possible. In solving problems
of heat transfer, nineteenth-century mathematician Joseph Fourier discovered
that any continuous function could be written as an infinite series of sines
and cosines. At the same time, Carl Friedrich Gauss invented though never
published a numerical method to compute the coefficients of Fourier’s series
recursively. It wasn’t until the 1960s that mathematicians James Cooley and John
Tukey reinvented Gauss’ fast Fourier transform, this time spurred by a Cold War
necessity of detecting nuclear detonations. Computers themselves have enabled
mathematical discovery. Shortly after World War I, French mathematicians Pierre
Fatou and Gaston Julia developed a new field of mathematics that examined
the dynamical structure of iterated complex functions. Sixty years later, Benoit
Mandelbrot, a researcher at IBM, used computers to discover the intricate fractal
worlds hidden in these simple recursive formulas.
Today, numerical methods are accelerating the convergence of different
xx Introduction
There are two fundamental problems in linear algebra: solving a system of linear
equations and solving the eigenvalue problem. Succinctly, the first problem says
where 𝑓 (𝑥, 𝑦) is the source term and 𝑔(𝑥, 𝑦) is the boundary value. Numerically,
we can solve this problem by first considering a discrete approximation and then
solving the often large resultant system of linear equations. When solving the
problem, getting the solution is primarily left up to a black box. Part One of this
book breaks open that black box.
Numerically, one can solve the problem Ax = b either directly or iteratively.
The primary direct solver is Gaussian elimination. We can get more efficient
methods such as banded solvers, block solvers, and Cholesky decomposition by
taking advantage of a matrix’s properties, such as low bandwidth and symmetry.
Chapter 2 looks at direct methods. Large, sparse matrices often arise in the
developed in the 1960s to solve problems for which finite difference methods
were not well adapted, such as handling complicated boundaries. Hence, finite
element methods provide attractive solutions to engineering problems. Finite
element methods often require more numerical and mathematical machinery
than finite difference methods. Fourier spectral methods were also developed
in the 1960s following the (re)discovery of the fast Fourier transform. They are
important in fluid modeling and analysis where boundary effects can be assumed
to be periodic.
Finally, Part Three explores three mathematical requirements for a numerical
method: consistency, stability, and convergence. Consistency says that the
discrete numerical method is a correct approximation to the continuous problem.
Stability says that the numerical solution does not blow up. Finally, convergence
says you can always reduce error in the numerical solution by refining the mesh.
In other words, convergence says you get the correct solution.
Doing mathematics
Mathematician Paul Halmos once remarked, “the only way to learn mathematics
is to do mathematics.” Doing mathematics involves visualizing complex data
structures, thinking logically and creatively, and gaining conceptual understanding
and insight. Doing mathematics is about problem-solving and pattern recognition.
It is recognizing that the same family of equations that models the response
of tumor cells to chemical signals also models an ice cube melting in a glass
of water and the patterning of spots on a leopard. It is appreciating that the
equations of fluid dynamics can apply in one instance to tsunamis, in the next
to traffic flow, and in a third to supersonic flight. Ultimately, mathematics is
about understanding the world, and doing mathematics is learning to see that.
Of course, one must learn the mechanics and structure of mathematics to have
the familiarization, technique, and confidence to start doing mathematics. Each
chapter concludes with a set of problems. Solutions to problems marked with a
b are provided in the Back Matter.
Julia code has a matching snippet of Python and Matlab code. The code may not
be entirely Julian, Pythonic, or Matlabesque, as some effort was taken to bridge
all of the languages with similar syntax to elucidate the underlying mathematical
concepts. The code is likely not the fastest implementation—in some cases, it is
downright slow. Furthermore, the code may overlook some coding best practices,
such as exception handling in favor of brevity. And, let’s be honest, long blocks
of code are dull. All of the code is available as a Jupyter notebook:
The code in this book was written and tested in Julia version 1.9.2. To cut down
on redundancy, the LinearAlgebra.jl and Plots.jl packages are implicitly assumed
always to be imported and available, while other packages will be explicitly
imported in the code blocks. Additionally, we’ll use the variable bucket to
reference the GitHub directory of data files.
using LinearAlgebra, Plots
bucket = "https://raw.githubusercontent.com/nmfsc/data/master/";
QR links
When I was a young boy, I would thumb through my father’s copy of The Restless
Universe by physicist Max Born. (Max Born is best known as a Nobel laureate
for his contributions to fundamental research in quantum mechanics and lesser
known as a grandfather to 80s pop icon Olivia–Newton John.) While I didn’t
understand much of the book, I marveled at the illustrations. The book, first
published in 1936, featured in its margins a set of what Born called “films,”
what the publisher called “mutoscopic pictures,” and what we today would call
flipbooks. By flicking the pages with one’s thumb, simple animations emerged
from the pages that helped the reader visualize the physics a little bit better. In
designing this book, I wanted to repurpose Max Born’s idea. I decided to use
QR codes at the footers of several pages as a hopefully unobtrusive version of
a digital flipbook to animate an illustration or concept discussed on that page.
As a starting example, the QR code on this page contains one of Max Born’s
original films animating gas molecules. Simply unlock the code using your
smartphone—I promise no Rick Astley. But you can also ignore them altogether.
We’ll start by reviewing some essential concepts and definitions of linear algebra.
This chapter is brief, so please check out other linear algebra texts such as Peter
Lax’s book Linear Algebra and Its Applications, Gilbert Strang’s identically
titled book, or David Watkins’ book Fundamentals of Matrix Computations for
missing details.
Simply stated, a vector space is a set 𝑉 that is closed under linear combinations
of its elements called vectors. If any two vectors of 𝑉 are added together, the
resultant vector is also in 𝑉; and if any vector is multiplied by a scalar, the
resultant vector is also in 𝑉. The scalar is often an element of the real numbers R
or the complex numbers C, but it could also be an element of any other field F.
We often say that 𝑉 is a vector space over the field F to remove the ambiguity.
Once we have the vector space’s basis—which we’ll come to shortly—we can
express and manipulate vectors as arrays of elements of the field with a computer.
What are some examples of vector spaces? The set of points Í in an 𝑛-
dimensional space is a vector space. The set of polynomials 𝑝 𝑛 (𝑥) = 𝑛𝑘=0 𝑎 𝑘 𝑥 𝑘
is another vector space. The set of piecewise linear functions over the interval
[0, 1] is yet another vector space—this one is important for the finite-element
method. The set of all differentiable functions over the interval [0, 1] that vanish
at the endpoints is also a vector space. Another is the vector space over the
Galois field GF( 𝑝 𝑛 ) with characteristic 𝑝 and 𝑛 terms. For example, GF(28 )
4 A Review of Linear Algebra
Let 𝑉 and 𝑊 be vector spaces over C. A linear map from 𝑉 into 𝑊 is a function
𝑓 : 𝑉 → 𝑊 such that 𝑓 (𝛼x + 𝛽y) = 𝛼 𝑓 (x) + 𝛽 𝑓 (y) for all 𝛼, 𝛽 ∈ C and x, y ∈ 𝑉.
For any linear map 𝑓 , there exists a unique matrix A ∈ C𝑚×𝑛 such that 𝑓 (x) = Ax
for all x ∈ C𝑛 . Every 𝑛-dimensional vector space over C is isomorphic to C𝑛 .
And every linear map from an 𝑛-dimensional vector space into an 𝑚-dimensional
vector space can be represented by an 𝑚 × 𝑛 matrix.
A column vector can be formed using the syntax [1,2,3,4], [1;2;3;4],
[(1:4)...] , or [i for i ∈1:4]. A row vector has the syntax [1 2 3 4].
0 0 𝑎 𝑏 0 0
1 d 1
0 2 𝑏 = 2𝑐 ≡ 0 2
0 that 0
0 0 𝑐 0 d𝑥 0 0
0 0
Example. Consider the derivative operator over the space of quadratic poly-
nomials using the monomial basis {1, 𝑥, 𝑥 2 }. The derivative operator maps
constants to zero, so the null space of the derivative operator is span{1} or
equivalently span{(1, 0, 0)}. The derivative of a quadratic polynomial spans
{1, 𝑥}, so the column space is span{1, 𝑥} or equivalently span{(1, 0, 0), (0, 1, 0)}.
The left null space is the complement to the column space, so it follows that the
left null space is span{𝑥 2 } or span{(0, 0, 1)}. J
1.2 Eigenspaces
column space, and left null space) is often called the fundamental theorem of linear algebra, a term
popularized by Gilbert Strang.
Eigenspaces 7
1 1 1 −2
01 1 1
Similar matrices
Symmetric: AT = A.
Hermitian or self-adjoint: AH = A.
Positive definite: a Hermitian matrix A such that xH Ax > 0 for any nonzero
vector x.
Orthogonal: QT Q = I.
Projection: P2 = P.
Tridiagonal: 𝑎 𝑖 𝑗 = 0 if |𝑖 − 𝑗 | > 1.
upper upper
"• # tridiagonal
"• • # Hessenberg
" # triangular
" # block
••• • • •••• •
• ••• ••• • • ••• • A B
• ••• •• • • •• •
• ••• • • • • • C D
• •• • • •
Eigenspaces 9
𝜆 ∗ ···
U2 = .. .
. Û1
Then U2 is unitary and
∗ 𝜆 ∗
𝜆 ∗ ··· ∗ ···
0 0
UH2 A1 U2 = = .
Û1 Â1 Û1 ..
. T̂
is an upper triangular matrix. Call it T. So
Let’s look just at the diagonal elements of the product 𝚲H 𝚲 = 𝚲𝚲H . Starting
with the (1, 1)-element:
2 2
𝑡 11 = 𝑡 1𝑘 .
× × × ×
× ××× = ××× × .
×× ×× ×× ××
××× × × ×××
where q 𝑘 qH𝑘 is an orthogonal projection matrix onto the unit eigenvector q 𝑘 . This
decomposition is known as the spectral decomposition of A.
UH AV = 𝚺 = diag(𝜎1 , . . . , 𝜎𝑛 )
v2 A 𝜎1 u1
𝜎2 u2
1 2 3 4
Figure 1.1: A matrix maps points on the unit circle to points on an ellipse. Such
a mapping has two, one, or zero real eigenvectors (thick line segments) running
along the radial directions.
to the left and right singular vectors 1 . If we adjust the right singular vector
slightly by twisting the unit circle, the eigenvectors move toward one another 2
until they are colinear 3 , and finally they become complex 4 . A matrix that does
not have a complete basis of eigenvectors is called defective. J
Inner products
An inner product on the vector space 𝑉 is any map (·, ·) from 𝑉 × 𝑉 into R or C
with the three properties—linearity, Hermiticity, and positive definiteness:
1. (𝑎u + 𝑏v, w) = 𝑎 (u, w) + 𝑏 (v, w) for all u, v, w ∈ 𝑉 and 𝑎, 𝑏 ∈ C;
2. (u, v) = (v, u); and
3. (u, u) ≥ 0 and (u, u) = 0 if and only if u = 0.
Two vectors u and v are orthogonal if (u, v) = 0. Important examples of inner
products include the Euclidean inner product (u, v) 2 = uT v and the A-energy
inner product (u, v) A = uT Av where A is symmetric, positive definite. Vectors
that are orthogonal under the A-energy inner product are said to be A-orthogonal
or A-conjugate. We can visualize two vectors u and v that are orthogonal under
the Euclidean inner product as being perpendicular in space. What can we make
of A-orthogonal vectors? Because A is a symmetric, positive-definite matrix, it
has the spectral decomposition A = Q𝚲Q−1 , where Q is an orthogonal matrix
of eigenvectors. Then (u, v) A is simply (𝚲1/2 Q−1 u)T (𝚲1/2 Q−1 v). That is, the u
and v are perpendicular in the scaled eigenbasis of A.
Vector norms
A vector norm on the vector space 𝑉 is any map k · k from 𝑉 into R with the
three properties—positivity, homogeneity, and subadditivity:
14 A Review of Linear Algebra
Example. Let’s first show that the 1- and 2-norms are equivalent and then show
that the 2- and ∞-norms are equivalent. We’ll take 𝑉 = R𝑛 .
The hypercube kxk 1 = 1 can be inscribed in the hypersphere
kxk 2 = √1, which can itself be inscribed in the hypercube
kxk 1 = 𝑛. So √1𝑛 kxk 1 ≤ kxk 2 ≤ kxk 1 .
The hypercube kxk ∞ = 1/ 𝑛 can be inscribed in the hyper-
sphere kxk 2 = 1, which can be√inscribed in the hypercube
kxk ∞ = 1. So kxk ∞ ≤ kxk 2 ≤ 𝑛kxk ∞ . J
Matrix norms
A matrix norm is map k · k from R𝑚×𝑛 into R with the three properties—positivity,
homogeneity, and subadditivity:
1. kAk ≥ 0 and kAk = 0 if and only if A = 0;
2. k𝛼Ak = |𝛼|kAk; and
3. kA + Bk ≤ kAk + kBk.
A matrix norm is said to be submultiplicative if it has the additional property
4. kABk ≤ kAk kBk.
A matrix norm k · k is said to be compatible or consistent with a vector norm k · k
if kAxk ≤ kAk kxk for all x ∈ R𝑛 . One might ask “what is the smallest matrix
norm that is compatible with a vector norm?” We call such a norm the induced
matrix norm:
kAk = sup = sup kAxk.
x≠0 kxk kx k=1
The induced norm is the most that a matrix can stretch a vector in a vector norm.
Theorem 3. An induced norm is compatible with its associated vector norm. An
induced norm of the identity matrix is one. An induced norm is submultiplicative.
16 A Review of Linear Algebra
kAxk kAxk
1. kAxk = kxk ≤ sup kxk = kAk kxk
kxk x≠0 kxk
Arguably, the most important induced matrix norms are the 𝑝-norms
In particular, the 2-, 1-, and ∞-norms arise frequently in numerical methods. So,
it’s good to get a geometric intuition about each one. See the Figure 1.3 on the
next page.
We’ll start with the 2-norm or the spectral norm. A circle (x) is mapped to
an ellipse (Ax) under a linear transformation (A), and sup kAxk 2 corresponds to
the largest radius of the ellipse. So kAk 2 = 𝜎max (A).
Next, consider the unit circle of the 1-norm. The square (x) is mapped to
a parallelogram (Ax). The supremum of kAxk 1 must come from one of the
vertices of the parallelogram. The vertices of the parallelogram are mapped from
the corners of the square, each of which corresponds to one of the standard basis
elements ±𝝃 𝑗 , e.g., in three-dimensions (±1, 0, 0), (0, ±1, 0) or (0, 0, ±1). Then
A𝝃 𝑗 returns (plus or minus) the 𝑗th column of A. Therefore,
kAk 1 = max |𝑎 𝑖 𝑗 |.
1≤ 𝑗 ≤𝑛
Finally, consider the unit circle for the ∞-norm. Again, a square (x) is mapped
to a parallelogram (Ax). The supremum of kAxk 1 must originate from the corners
of the square. This time the corners are at (±1, ±1, . . . , ±1). Explicitly,
𝑎 11 𝑎 13 ±1 ±𝑎 11 ± 𝑎 12 ± 𝑎 13
𝑎 12
𝑎 21 𝑎 23 ±1 = ±𝑎 21 ± 𝑎 22 ± 𝑎 23 .
𝑎 22
𝑎 31 𝑎 33 ±1 ±𝑎 31 ± 𝑎 32 ± 𝑎 33
𝑎 32
By taking the largest possible element, we have
kAk ∞ = max |𝑎 𝑖 𝑗 |.
1≤𝑖 ≤𝑛
Measuring vectors and matrices 17
Note that kAk ∞ = kAT k 1 . You can use the superscripts of the 1- and ∞-norms
as a mnemonic to remember in which direction to take the sums. The “1” runs
vertically, so take the maximum of the sums down the columns. The “∞” lies
horizontally, so take the maximum of the sums along the rows. For example,
1 3
A = 2 1 1 , kAk 1 = 7, kAk ∞ = 6.
1 0
𝜌(A) = max |𝜆| ≤ sup = kAk.
𝜆∈𝜆(A) x≠0 kxk
A standard basis vector 𝝃 𝑗 is mapped to the 𝑗th column of A. So kAk max must
be equal to or smaller than the largest component A𝝃 𝑗 for some 𝝃 𝑗 . The largest
component of A𝝃 𝑗 must be equal to or smaller than the length of A𝝃 𝑗 . So
kAk max ≤ max 𝑗 kA𝝃 𝑗 k 1 ≤ 𝜎max (A).
Example. We can put a lower and upper bound on the spectral radius of a
symmetric matrix using kAk max ≤ 𝜌(A) ≤ kAk ∞ . For the following matrix
5 1
1 2
1 3
2 2
2 2
2 4
1 6
3 2
Quadratic forms
Proof. Suppose that A is positive definite. Then xT Ax > 0 for any nonzero
vector x. If x is an eigenvector, xT Ax = xT 𝜆x = 𝜆kxk 2 . It follows that 𝜆 > 0. On
the other hand, suppose that A has only positive eigenvalues 𝜆𝑖 . Because A is
symmetric, any vector x can be written as a linear combination of orthogonal
unit eigenvectors x = 𝑐 1 q1 + 𝑐 2 q2 + · · · + 𝑐 𝑛 q𝑛 . Then the quadratic form
xT Ax = (𝑐 1 q1 + 𝑐 2 q2 + · · · + 𝑐 𝑛 q𝑛 )T (𝑐 1 𝜆1 q1 + 𝑐 2 𝜆2 q2 + · · · + 𝑐 𝑛 𝜆 𝑛 q𝑛 )
= 𝑐21 𝜆1 kq1 k 2 + 𝑐22 𝜆2 kq2 k 2 + · · · + 𝑐2𝑛 𝜆 𝑛 kq𝑛 k 2 = 𝜆𝑖 𝑥𝑖2
by orthogonality of the eigenvectors. Because the 𝜆𝑖 > 0 for all 𝑖, it follows that
xT Ax > 0.
Example. Linear algebra used in quantum physics has its own unique notation.
When quantum theory was developed in the early twentieth century, linear algebra
was not widely taught in physics (or mathematics) departments, so the quantum
pioneer Paul Dirac invented his own notation. A solution 𝜓(𝑥) to Schrödinger’s
equation H𝜓 = 𝐸𝜓, with the Hamiltonian operator H = −(ℏ2 /2𝑚)Δ + 𝑉 (𝑥) and
the real-valued energy 𝐸, is a vector of an infinite-dimensional vector space 𝑊.
In Dirac or bra–ket notation, a vector 𝜓 is denoted by a special symbol i. We
can then express the vector 𝛼𝜓 as 𝛼i. Because this notation is ambiguous when
working with more than one vector, say 𝜓1 and 𝜓2 , we use an additional label, say
1 and 2, to write the vectors as |1i and |2i. Note that the label is simply enclosed
in | and i. Such an object is called a “ket.” We can then express 𝛼𝜓1 as 𝛼 |1i.
Bear in mind that anything between | and i is simply a label, so |𝛼1i ≠ 𝛼 |1i
and |1 + 2i ≠ |1i + |2i. The space of kets spans the vector space 𝑊, and we can
choose a basis of kets.
A linear functional mapping 𝑊 → C is often called a covector. The space
of linear functionals, which forms the dual space to the vector space, is itself
20 A Review of Linear Algebra
States are the solutions to the Schrödinger equation H𝜓 = 𝐸𝜓. Recognize that
the Schrödinger equation is simply an eigenvalue equation (an eigenequation).
The eigenvalue solutions of it are called eigenstates. The complex-valued
∫ ∞𝜓 are called wave functions, and by convention are normalized over
space −∞ 𝜓 ∗ 𝜓d𝑥 = 1. This gives a wavefunction 𝜓 the interpretation of a
probability distribution 𝜌(𝑥) = |𝜓(𝑥)| 2 , the probability density of a particle at
position 𝑥. Because the Schrödinger equation is an eigenvalue problem, we can
find the spectral decomposition of the Hamiltonian operator
1.4 Stability
𝜅2 = 1 𝜅2 = 1 𝜅2 = 2 𝜅 2 = 10
Figure 1.4: Images of unit circles under different linear transformations and their
associated condition numbers.
Furthermore, since
kA−1 xk kA−1 yk
kA−1 k = sup kA−1 xk = sup = sup where y = Ax
kx k=1 x≠0 kxk y≠0 kyk
kA−1 Axk kxk 1 1
= sup = sup = sup = ,
x≠0 kAxk x≠0 kAxk kx k=1 kAxk inf kAxk
kx k=1
it follows that
sup kAxk
kx k=1
𝜅(A) = .
inf kAxk
kx k=1
maximum magnification
𝜅(A) = .
minimum magnification
Specifically, the condition number in the Euclidean norm is 𝜅 2 (A) = 𝜎max /𝜎min ,
the ratio of radii of the semi-major to semi-minor axes of an ellipsoid image of a
matrix. See the figure above.
Another way to think of the condition number is as a measure of linear
dependency of the columns. A matrix A is ill-conditioned if 𝜅(A) 1. If A is
singular, 𝜅(A) = ∞. If A is a unitary or orthogonal matrix, then 𝜅 2 (A) = 1. This
Stability 23
and display the density plot of H−1 H by explicitly converting it to type Gray:
using Images
[Gray.(1 .- abs.(hilbert(n)\hilbert(n))) for n ∈ (10,15,20,25,50)]
𝑛 = 10 𝑛 = 15 𝑛 = 20 𝑛 = 25 𝑛 = 50
Zero values are white, values with magnitude one or greater are black, and
intermediate values are shades of gray. An identity matrix is represented by a
diagonal of black in a field of white. While the numerical computation appears
to be more-or-less correct for 𝑛 ≤ 10, it produces substantial round-off error for
even moderately low dimensions such as 𝑛 = 15. J
while the 1- and ∞-norms are not. We can extend the concept of a norm to a
matrix. An induced norm is the most that any vector could be stretched by the
matrix. Norms also help us measure the condition of a matrix. An ill-conditioned
matrix maps a circle to a very eccentric ellipse. The 2-condition number is the
ratio of the semi-major to semi-minor radii.
1.6 Exercises
1.2. Prove that similar matrices have the same spectrum of eigenvalues.
(b) What is the dimension of the Krylov subspace if x is the sum of two linearly
independent eigenvectors of A?
(c) What is the maximum possible dimension that the Krylov subspace can
have if A is a projection operator?
(d) What is the maximum possible dimension that the Krylov subspace can
have if the nullity of A is 𝑚? b
1.4. A (0,1)-matrix is a matrix whose elements are either zero or one. Such
matrices are important in graph theory and combinatorics. How many of them
are invertible? This is an easy problem when the
is small. For example,
there are two 1 × 1 (0,1)-matrices, namely, 0 and 1 . So half are invertible.
There are sixteen 2 × 2 matrices whose entries are ones and zeros:
0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1
Singular: 00 00
0 1 1 1 1 1 1 0 0 1
00 00 00 01 10 11 01 10 11
Invertible: 10 01 10 10 01 11 11
(0, 1)-matrices. Test a large number (perhaps 10,000) of such matrices for each
𝑛 = 1, 2, . . . , 20. b
1.5. Consider the 𝑛 × 𝑛 matrix used to approximate a second-derivative
−2 1
1 1
D= .. .. .. .
. . .
1 −2
(d) Prove that kAk F = 𝑖 𝜎𝑖2 , where 𝜎𝑖 is the 𝑖th singular value of A.
Hint: Don’t solve this problem sequentially. Spoiler: Start with (e). b
1.10. Cleve Moler and Charles van Loan’s article “Nineteen dubious ways to
compute the exponential of a matrix” presents several methods of computing
eA for a matrix A. Discuss or implement one of the methods. How is a matrix
exponent computed in practice?
Chapter 2
The macro @less shows the Julia code for a method. The command can help you
know what’s happening inside the black box.
30 Direct Methods for Linear Systems
The macros @time and @elapsed provide run times. Use the begin. . . end block
for compound statements. Julia precompiles your code on the first run, so run it twice
and take the time from the second run.
Elementary row operations leave a matrix in row equivalent form, i.e., they do
not fundamentally change the underlying system of equations but instead return
an equivalent system of equations. The three elementary row operations are
The RowEchelon.jl function rref returns the reduced row echelon form.
The first stage (forward elimination) requires the most effort because we need
to compute using entire rows. We need to operate on 𝑛2 elements to zero out
the elements below the first pivot, (𝑛 − 1) 2 elements to zero out the elements
below the second pivot, and so forth. The second stage (backward elimination)
is relatively fast because we only need to work on one column at each iteration.
Zeroing out the elements above the last pivot requires changing 𝑛 elements,
Gaussian elimination 31
zeroing out the elements above the second pivot requires 𝑛 − 1 elements, and so
forth. By saving the intermediate terms of forward elimination, we can express
a matrix A as the product A = LU, where L is a unit lower triangular matrix
(with ones along the diagonal) and U is an upper triangular matrix. If such a
decomposition exists, it is unique. This decomposition allows us to solve the
problem Ax = b in three steps:
1. Compute L and U.
2. Solve Ly = b for y.
3. Solve Ux = y for x.
1 𝑦1 𝑏1
𝑙21 1 𝑦2 𝑏2
.. .. = ..
. .. . .
𝑙 𝑛1 1 𝑦 𝑛 𝑏 𝑛
𝑙 𝑛2 ...
can be solved using forward elimination, starting from the top and working down.
From the 𝑖th row
𝑙𝑖1 𝑦 1 + · · · + 𝑙𝑖,𝑖−1 𝑦 𝑖−1 + 𝑦 𝑖 = 𝑏 𝑖 ,
we have
𝑦𝑖 = 𝑏𝑖 − 𝑙𝑖 𝑗 𝑦 𝑗 ,
where each 𝑦 𝑖 is determined from the 𝑦 𝑗 coming before it. It is often convenient
to overwrite the array b with the values y to save computer memory.
32 Direct Methods for Linear Systems
1 © ∑︁
𝑥𝑖 = 𝑦𝑖 − 𝑢𝑖 𝑗 𝑥 𝑗 ® .
𝑢 𝑖𝑖
« 𝑗=1+𝑖 ¬
As before, we can overwrite the array b with the values x. Note that we never
need to change the elements of L or U when solving the triangular systems. So,
once we have the LU decomposition of a matrix, we can use it over and over
LU decomposition
Now, let’s implement the LU decomposition itself (step 1). On a practical note,
computer memory is valuable when 𝑛 is large. So it’s important to make efficient
use of it. Storing the full matrices A, L, and U in computer memory takes 3𝑛2
floating-point numbers (at eight bytes each). The matrices L and U together have
the same effective information A, so keeping A is unnecessary. Also, L and U
are almost half-filled with zeros—wasted memory. Now, the smart idea! Note
that except along the diagonal, the nonzero elements of L and U are mutually
exclusive. In fact, since the diagonal of L is defined to be all ones, we can store
the information in these two matrices in the same array as
𝑢 11 𝑢 1𝑛
𝑢 12 ...
𝑙21 𝑢 2𝑛
𝑢 22 ...
.. .. .
. ..
𝑙 𝑛1 𝑢 𝑛𝑛
𝑙 𝑛2 ...
Now, the really smart idea! Overwrite A with the elements of L and U. When
performing LU decomposition, we work from the top-left to the bottom-right.
We zero out the elements of the 𝑖th column below the 𝑖th row and fill up the
corresponding elements in L. There is no conflict if we store this information in
the same matrix. Furthermore, since we are moving diagonally down the matrix
A, there is no conflict if we simply overwrite the elements of A. For example,
Gaussian elimination 33
2 3 2 2 2
4 2 4 2 3 4 2 3 4 2 3
−2 −5 −3 −2
→ −1 −1 −1 1 → −1 −1 −1 1 → −1 −1 −1 1 .
4 2 1 2
7 6 8 −1 2 2 3 1 1 3 1
6 10 1 12 3 3 1 3
−2 −5 3 2 −3 2 −1 2
Let’s implement this algorithm. Starting with the first column, we move right.
We zero out each element in the 𝑗th column below the 𝑗th row using elementary
row operations. And here’s the switch—we backfill those zeros with the values
from L.
for 𝑗 = 1, . . . , 𝑛
for 𝑖 = 𝑗 + 1, . . . , 𝑛
𝑎 𝑖 𝑗 ← 𝑎 𝑖 𝑗 /𝑎 𝑗 𝑗
LU decomposition
for 𝑘 = 𝑗 + 1, . . . , 𝑛
𝑎 𝑖𝑘 ← 𝑎 𝑖𝑘 − 𝑎 𝑖 𝑗 𝑎 𝑗 𝑘
h 𝑖 = 2, . . . , 𝑛Í
forward elimination
𝑏 𝑖 ← 𝑏 𝑖 − 𝑖−1 𝑗=1 𝑎 𝑖 𝑗 𝑏 𝑗
h 𝑖 = 𝑛, 𝑛 − 1, Í ··· ,1
backward elimination
𝑏 𝑖 ← 𝑏 𝑖 − 𝑛𝑗=𝑖+1 𝑎 𝑖 𝑗 𝑏 𝑗 /𝑎 𝑖𝑖
function gaussian_elimination(A,b)
n = size(A,1)
for j in 1:n
A[j+1:n,j] /= A[j,j]
A[j+1:n,j+1:n] -= A[j+1:n,j:j].*A[j:j,j+1:n]
for i in 2:n
b[i:i] -= A[i:i,1:i-1]*b[1:i-1]
for i in n:-1:1
b[i:i] = ( b[i] .- A[i:i,i+1:n]*b[i+1:n] )/A[i,i]
return b
Operation count
𝑛 ∑︁ 𝑛
© ∑︁
1 + 2® = 32 𝑛3 − 21 𝑛2 − 16 𝑛
𝑗=1 𝑖= 𝑗+1 « 𝑘= 𝑗+1 ¬
© ∑︁
2 + 2® = 𝑛2 + 𝑛
𝑖=1 « 𝑗=1 ¬
Gaussian elimination fails if the pivot 𝑎 𝑖𝑖 is zero at any step because we are
dividing by zero. In practice, even if a pivot is close to zero, Gaussian elimination
may be unstable because round-off errors get amplified. Consider the following
matrix and its LU decomposition without row exchanges
1 2 1 0 1
1 0 1 2
A = 2 2+𝜀 0 = 2 1 0 0
𝜀 −4 .
4 4 4 1 0 40/𝜀 − 4
14 10/𝜀 0
Let 𝜀 = 10−15 , or about five times machine epsilon. The 2-norm condition
number of this matrix is about 10.3, so it is numerically pretty well-conditioned.
But because of round-off error, the computed LU decomposition is
1 0 1 2 1 2
0 1 1
à = 2 1 0 0
𝜀 −4 = 2 2+𝜀 0 .
4 1 0 40/𝜀 4 8
10/𝜀 0 14
The original matrix A and the reconstructed matrix à differ in the bottom-right
elements. Let’s see how well our naïve function gaussian_elimination does in
solving Ax = b when b = (−5, 10, 0).
Cholesky decomposition 35
The function fails horribly, returning a solution of about (1, 4, −5) instead of the
correct answer (5, 0, −5). The numerical instability results from dividing by a
small pivot. We can avoid it by permuting the rows or columns at each iteration
so that a large number is in the pivot position.
In partial pivoting, we permute the rows of the matrix to put the largest
element in the pivot position 𝑎 𝑖𝑖 . With each step of LU factorization, we find
𝑟 = arg max 𝑘 ≥𝑖 |𝑎 𝑘 𝑗 | and interchange row 𝑖 and row 𝑟. To permute the rows,
we left-multiply by a permutation matrix P that keeps track of all the row
interchangescPA = LU. As a result of pivoting, the magnitudes of all of the
values of matrix L are less than or equal to one.
In complete pivoting, we permute both the rows and the columns to put the
maximum element in the pivot position 𝑎 𝑖𝑖 . With each step of LU factorization,
find 𝑟, 𝑐 = arg max 𝑘,𝑙 ≥𝑖 |𝑎 𝑘𝑙 |, and interchange row 𝑖 and row 𝑟 and interchange
column 𝑖 and column 𝑐. To interchange rows, we left-multiply by a permutation
matrix P, and to interchange columns, we right-multiply by a permutation matrix
Q—i.e., PAQ = LU. Applied to the problem Ax = b, complete pivoting yields
LU(Q−1 x) = P−1 b. Generally, complete pivoting is unnecessary, and partial
pivoting is sufficient.
Proof. The proof has several steps. First, we show that the inverse of a unit lower
triangular matrix is a unit lower triangular matrix. It is easy to confirm the block
matrix identity
A 0 A−1 0
= .
C D −D−1 CA−1 D−1
The hypothesis is certainly true for 1 × 1 matrix 1 . Suppose that the hypothesis
𝑛 × 𝑛 unit lower triangular matrix. By letting D be such a matrix
is true for an
and A = 1 , then the claim follows for an (𝑛 + 1) × (𝑛 + 1) matrix.
Next, we show that if A is symmetric and invertible, then A has the decompo-
sition A = LDLT , where L is a unit lower triangular matrix and D is a diagonal
matrix. A has the decomposition LDMT , where M is a unit lower triangular
36 Direct Methods for Linear Systems
Note that M−1 AM−T is a symmetric matrix, and by theorem 7, the matrix M−1 LD
is lower triangular with diagonal D. Hence, M−1 AM−T must be the diagonal
matrix D. So,
A = LDMT = M−1 AM−T MT = LM−1 A
from which it follows that LM−1 = I, or equivalently L = M.
Now, we show that if A is an 𝑛 × 𝑛 symmetric, positive-definite matrix and
X is an 𝑛 × 𝑘 matrix with rank(X) = 𝑘, then B = XT AX is symmetric, positive
definite. (We say that B is congruent to A if there exists an invertible X such that
B = XT AX.) Take x ∈ R 𝑘 . Then
xT Bx = (Xx)T A(Xx) ≥ 0
Starting with 𝑎 11 :
𝑟 11 𝑟 11 = 𝑎 11 from which 𝑟 11 = 𝑎 11
• • ◦ ◦ ◦ × × × ×
◦◦ ◦◦◦ ××××
◦◦◦ ◦◦ = ××××
◦◦◦◦ ◦ ××××
where • is the now known and can be used in subsequent steps and ◦ are still
unknowns. Once we have 𝑟 11 we can find all other elements of the first column:
𝑟 11 𝑟 1𝑖 = 𝑎 1𝑖 from which 𝑟 1𝑖 = 𝑎 1𝑖 /𝑟 1𝑖
Linear programming and the simplex method 37
• • • • • × × × ×
•◦ ◦◦◦ ××××
•◦◦ ◦◦ = ××××
•◦◦◦ ◦ ××××
We continue for each remaining 𝑛 − 1 columns of R by first finding the 𝑖th
diagonal element:
! 12
𝑖 ∑︁
𝑟 𝑘 𝑗 𝑟 𝑘𝑖 = 𝑎 𝑖𝑖 from which 𝑟 𝑖𝑖 = 𝑎 𝑖𝑖 − 𝑟 𝑘𝑖
𝑘=1 𝑘=1
• • • • • × × × ×
•• •◦◦ ××××
•◦◦ ◦◦ = ××××
•◦◦◦ ◦ ××××
Once we have the diagonal, we fill in the remaining elements from that column:
1 ∑︁
𝑟 𝑘 𝑗 𝑟 𝑘𝑖 = 𝑎 𝑗𝑖 from which 𝑟 𝑗𝑖 = 𝑎 𝑗𝑖 − 𝑟 𝑘 𝑗 𝑟 𝑘𝑖
𝑟 𝑖𝑖 𝑘=1
• • • • • × × × ×
•• ••• ××××
••◦ ◦◦ = ××××
••◦◦ ◦ ××××
We can write this algorithm as the following pseudocode:
for 𝑖 = 1, . . . , 𝑛
! 1/2
𝑟 𝑖𝑖 ← 𝑎 𝑖𝑖 − 2
𝑎 𝑖𝑘
for 𝑗 = 2, . . . , 𝑖
" !
𝑟𝑖 𝑗 ← 𝑎 𝑗𝑖 − 𝑟 𝑘 𝑗 𝑟 𝑘𝑖
𝑟 𝑖𝑖 𝑘=1
solving a system of inequalities. Just a few years after its introduction, physicist Phillip M. Morse
lamented, “it seems to me that the term ‘linear programming’ is a most unfortunate phrase for this
promising technique, particularly since many possible extensions appear to be in nonlinear directions.
A more general yet more descriptive term, such as ‘bounded optimization,’ might have been a happier
Linear programming and the simplex method 39
(a) (b)
Figure 2.1: A two-dimensional simplex. (a) The objective function reaches its
maximum and minimum at vertices of the simplex. (b) The simplex method
steps along edges from vertex to vertex in a direction that increases the objective
function, just as an ant might crawl, until it finally reaches a maximum.
he then interpreted as the weight of food minus the weight of water content. He
collected data on over 500 different foods, punched onto cards and fed into an
IBM 701 computer. Dantzig recalled that the optimal diet was “a bit weird but
conceivable,” except that it also included several hundred gallons of vinegar.
Reexamining the data, he noted that vinegar was listed as a weak acid with zero
water content. Fixing the data, he tried to solve the problem again. This time it
called for hundreds of bouillon cubes per day. No one at the time had thought
to put upper limits on sodium intake. So, he added an upper constraint for salt.
But now, the program demanded two pounds of bran each day. And when he
added an upper constraint on bran, it prescribed an equal amount of blackstrap
molasses. At this point, he gave up. (Dantzig [1990]) J
constraints are not inconsistent, the set Ax ≤ b carves out a convex 𝑛-polytype,
also known as a simplex, in the positive orthant (a convex polygon in the upper
right quadrant in two dimensions). The objective function cT x is itself easy
enough to describe. The objective function increases linearly in the direction
of its gradient c. Imagine water slowly filling a convex polyhedral vessel in the
direction opposite of a gravitational force vector c. The water’s surface is a level
set given by 𝑧 = cT x. As the water finally fills the vessel, its surface will either
come to a vertex of the polyhedron (a unique solution) or to a face or edge of the
polyhedron (infinitely many solutions). This characterization is often called the
fundamental theorem of linear programming.
Theorem 8. The maxima of a linear functional over a convex polytope occur at
its vertices. If the values are at 𝑘 vertices, then they must be along the 𝑘-cell
between them.
The diet problem that George Stigler solved had 15 decision variables along
with 9 constraints. So, there are as many as 15 9 = 5005 possible vertices from
which to choose a solution. That’s not so many that the problem couldn’t be solved
using clever heuristics.
But, the original problem with 77 decision variables has
as many as 77 9 or over 160 billion possibilities. How can we systematically pick
the solution among all of these possibilities? George Dantzig’s simplex solution
was to pick any vertex and examine the edges leading away from that vertex,
choosing any edge along which the cost variable was positive. If the edge is
finite, then it connects to another vertex. Here you examine its edges and choose
yet another one along which the cost variable is positive. Now continue like that,
traversing from vertex to vertex along edges much as an ant might do as it climbs
the simplex until arriving at a vertex for which no direction is strictly increasing.
This vertex is the maximum. See Figure 2.1.
Linear programming and the simplex method 41
A I 0
s= b where x, s ≥ 0.
cT 0T
1 −𝑧 0
The matrix I is a diagonal matrix of {+1, −1} depending on whether the initial
constraint was a “less than” inequality (leading to a slack variable) or a “greater
than” inequality (leading to a surplus variable). Equality constraints lead to an
equation with a slack variable and an equation with a surplus variable. Consider
the identical system
Ā 0 x̄ b x c
= where Ā = A I , x̄ = , c̄ = , and x̄ ≥ 0.
c̄ 1 −𝑧 0 s 0
Ā 0 b
c̄T 1 0
The matrix Ā has 𝑚 rows and 𝑚 + 𝑛 columns, so its rank is at most 𝑚. We’ll
assume that the rank is 𝑛 to simplify the discussion. Otherwise, we need to
consider a few extra steps. We can choose 𝑚 linearly independent columns
as the basis for the column space of Ā. Let AB be the submatrix formed by
these columns. The variables xB associated with these columns are called basic
variables. The remaining 𝑛 columns form a submatrix AN , and the variables
xN associated with it are called the nonbasic variables. Altogether, we have
Āx̄ = AB xB + AN xN = b. By setting all nonbasic variables to zero, the solution
is simply given by the basic variables. Such a solution is called a basic feasible
42 Direct Methods for Linear Systems
function row_reduce!(tableau)
(i,j) = get_pivot(tableau)
G = tableau[i:i,:]/tableau[i,j]
tableau[:,:] -= tableau[:,j:j]*G
tableau[i,:] = G
Linear programming and the simplex method 43
function get_pivot(tableau)
j = argmax(tableau[end,1:end-1])
a, b = tableau[1:end-1,j], tableau[1:end-1,end]
k = findall(a.>0)
i = k[argmin(b[k]./a[k])]
to the while loop. The initial tableau consists of six blocks A, I, b, cT , 0 and 0:
2.00 0.00 1.00 1.00 0.00 0.00 3.00
4.00 1.00 2.00 0.00 1.00 0.00 2.00
1.00 1.00 0.00 0.00 0.00 1.00 1.00
2.00 1.00 1.00 0.00 0.00 0.00 0.00
The simplex algorithm scans the bottom row for the largest value and chooses
that column as pivot column. It next determines the pivot row by selecting the
largest positive element in that column relative to the matching element in the
last column.
44 Direct Methods for Linear Systems
The bottom row of the reduced tableau also gives us the solution to a related
problem. An LP problem has two related formulations—the primal problem and
the dual problem. Every variable of the primal LP is a constraint in the dual LP,
every constraint of the primal LP is a variable in the dual LP, and a maximum
objective function in the primal is a minimum in the dual (and vice versa). Until
now, we’ve only considered the primal problem:
The strong duality theorem states four possibilities: 1. both the primal and dual
are infeasible (no feasible solutions); 2. the primal is infeasible and the dual is
unbounded; 3. the dual is infeasible and the primal is unbounded; or 4. both the
primal and dual have feasible solutions and their values are the same. This means
that we can solve the dual problem instead of the primal problem.
James H. Wilkinson defines a sparse matrix as “any matrix with enough zeros
that it pays to take advantage of them” in terms of memory and computation
time. If we are smart about implementing Gaussian elimination, each zero in a
matrix means potentially many saved computations. So it is worthwhile to use
algorithms that maintain sparsity by preventing fill-in. In Chapter 5, we’ll look
at iterative methods for sparse matrices.
Suppose that we have the 4 × 6 full matrix
11 0 16
0 14 0
0 22 26
A =
0 0 25
0 0 33 34 0
41 0 46
43 44 0
If the number of nonzero elements is small, we could store the matrix more
efficiently by just recording the locations and values of the nonzero elements.
For example,
column 1 1 2 3 3 4 4 4 5 6 6 6 6
row 1 4 2 3 4 1 3 4 2 1 2 3 4
value 11 41 22 33 43 14 34 44 25 16 26 36 46
46 Direct Methods for Linear Systems
To store this array in memory, we can go a step further and use compressed sparse
column (CSC) format. Rather than saving explicit column numbers, compressed
column format data structure only saves pointers to where the new columns begin
index 1 3 4 6 9 10 14
row 1 4 2 3 4 1 3 4 2 1 2 3 4
value 11 41 22 33 43 14 34 44 25 16 26 36 46
Similarly, the compressed sparse row (CSR) format indexes arrays by rows instead
of columns. A CSC format is most efficient for languages like Julia, Matlab, R,
and Fortran, which use a column-major order convention to store data. A CSR
format is most efficient for languages like Python and C, which use a row-major
SparseArrays.jl includes several utilities for constructing sparse matrices.
Julia’s backslash and forward slash operators use UMFPACK (pronounced “Umph
Pack”) to solve nonsymmetric, sparse systems directly.
3 Euler published his paper in 1741. A hundred years later, Königsberg native Gustav Kirchhoff
published his eponymous electrical circuits laws. One can wonder if walking along the bridges
inspired him. Königsberg is now Kalingrad, Russia. Only two of the original bridges remain today.
One was completely rebuilt in 1935, two were destroyed during the bombing of Königsberg in World
War II, and two others were later demolished to make way for a modern highway.
Sparse matrices 47
c d
C g C
c d
A e D A D
B a b
a f
b B
Euler found the solution by replacing the islands and mainland river banks with
points called vertices or nodes and the bridges with line segments called edges
that linked the vertices together. By counting the degree of each vertex—that is,
the number of edges associated with each vertex—Euler was able to confidently
answer “no.” A path or walk is a sequence of edges that join a sequence of
vertices. An Eulerian path is a path that visits every edge exactly once, the
type of path that the Königsberg bridge puzzle sought to find. An Eulerian path
can have at most two vertices with odd degree—one for the start of the walk
and one for the end of it. The vertices of the Königsberg graph have degrees
5, 3, 3, and 3. This puzzle laid the foundation for the mathematical discipline
called graph theory, which examines mathematical structures that model pairwise
relationships between objects.
Nowadays, we use graph theory to describe any number of things. When
describing the internet, vertices are websites, and edges are hyperlinks connecting
those websites together. In social networks, vertices are people, and edges are
the social bonds between pairs of individuals. In an electrical circuit, vertices are
junctions, and edges are resistors, inductors, capacitors, and other components.
We can use sparse matrices to analyze graph structure. Similarly, we can
use graphs to visualize the structure of sparse matrices. An adjacency matrix
A is a symmetric (0, 1)-matrix indicating whether or not pairs of vertices are
adjacent. The square of an adjacency matrix A2 tells us the number of paths of
length two joining a particular vertex with another particular one. The 𝑘th power
of an adjacency matrix A 𝑘 tells us the number of paths of length 𝑘 joining a
particular vertex with another particular one. A graph is said to be connected if a
path exists between every two vertices. Otherwise, the graph consists of multiple
components, the sets of all vertices connected by paths and connected to no other
vertices of the graph.
One way to measure the connectivity of a graph is by examining the Laplacian
matrix of a graph. The Laplacian matrix (sometimes called the graph Laplacian
or Kirchhoff matrix) for a graph is L = D − A, where the degree matrix D is a
diagonal matrix that counts the degrees of each vertex and A is the adjacency
matrix. The matrix L is symmetric and positive semidefinite. Because L is
symmetric, its eigenvectors are all orthogonal. The row and column sums of L
are zero. It follows that the vector of ones is in the nullspace, i.e., the vector of
ones is an eigenvector. Finally, the number of connected components equals the
dimension of the nullspace, i.e., the multiplicity of the zero eigenvalue.
48 Direct Methods for Linear Systems
As its name might suggest, the Laplacian matrix also has a physical inter-
pretation. Imagine an initial heat distribution u across a graph where 𝑢 𝑖 is
the temperature at vertex 𝑖. Suppose that heat moves from vertex 𝑖 to vertex 𝑗
proportional to 𝑢 𝑖 − 𝑢 𝑗 if the two vertices share an edge. We can model the heat
flow in the network as
= −(D − A)u = −Lu,
which is simply the heat equation if we replace −L with the Laplacian Δ. We
can write the solution u(𝑡) using the eigenvector basis L. Namely,
u(𝑡) = 𝑐 𝑖 (𝑡)v
for eigenvalues v, where 𝑐 𝑖 (0) is the spectral projection of the initial state u(0).
The solution to the decoupled system is 𝑐 𝑖 (𝑡) = 𝑐 𝑖 (0)e−𝜆𝑖 𝑡 , where the eigenvalues
of L are 0 = 𝜆1 ≤ 𝜆2 ≤ · · · ≤ 𝜆 𝑛 . What happens to the solution as 𝑡 → ∞? If the
graph is simply connected, the equilibrium solution is given by the eigenvector
associated with the zero eigenvalue, i.e., the null space of L. In other words, the
distribution across the vertices converges to the average value of the initial state
across all vertices—what we should expect from the heat equation.4
In moving from a perturbed initial state to the equilibrium state, the large-
eigenvalue components decay rapidly and the small-eigenvalue components
linger. The zero eigenvector is the smoothest and corresponds to all constants.
The next eigenvector is the next smoothest, and so on. The solution can be
approximated using the smallest eigenvalues 𝜆2 and 𝜆3 near equilibrium. This
approximation gives us a way to draw the graph in two dimensions in a meaningful
way—compute the two smallest nonzero eigenvalues of the graph Laplacian and
the unit eigenvectors associates with these eigenvalues. Place vertex 𝑖 at the
coordinates given by the 𝑖th component of v2 and the 𝑖th component of v3 . This
approach is called spectral layout drawing. To dig deeper, see Fan Chung and
Ronald Graham’s Spectral Graph Theory.
Another popular method for drawing graphs also uses an equilibrium solution
to a physical problem. The force-directed graph drawing (developed by Thomas
Fruchterman and Edward Reingold in 1991) models edges as springs that obey
Hooke’s law and vertices as electrical charges that obey Coulomb’s law. The
force at each node is given by 𝑓𝑖 𝑗 = 𝑎 𝑖 𝑗 + 𝑟 𝑖 𝑗 , where
𝑎 𝑖 𝑗 = 𝑘 −1 kx𝑖 − x 𝑗 k, if 𝑖 connects to 𝑗
𝑟 𝑖 𝑗 = −𝛼𝑘 kx𝑖 − x 𝑗 k , −2
if 𝑖 ≠ 𝑗
4 Alternatively, we can imagine the graph as a set of nodes linked together by springs. In this
case, the governing equation is the wave equation u𝑡𝑡 = −Lu. The solution that minimizes the
Dirichlet energy is the one that minimizes oscillations making the solution as smooth as possible.
Sparse matrices 49
and the parameters 𝑘 and 𝛼 adjust the optimal distance between vertices. To keep
the graph to a two-dimensional unit square, we might choose the parameter 𝑘 to
be one over the square root of the number of vertices. Because we are simply
trying to get a visually pleasing graph with separated vertices, we don’t need to
solve the problem precisely. We start with random placement of the nodes. Then
we iterate, updating the position of each vertex sequentially
x𝑖 − x 𝑗
x𝑖 ← x𝑖 + 𝛥𝑡 𝑓𝑖 𝑗 ,
kx𝑖 − x 𝑗 k
where the step size 𝛥𝑡 is chosen sufficiently small to ensure stability. Because the
force on each vertex is affected by the position of every other vertex and because
it may potentially take hundreds of iterations to reach the equilibrium solution,
force-directed graph drawing may require significant computation time.
The Graphs.jl and GraphPlot.jl libraries provide utilities for constructing, analyzing,
and visualizing graphs.
5 The collection is a curated set of around 3000 sparse matrices that occur in applications such
as computational fluid dynamics, financial modeling, and structural engineering. SuiteSparse also
maintains the software library of sparse matrix routines such as UMFPACK, SSMULT, and SPQR
used in Julia, Python, and Matlab.
50 Direct Methods for Linear Systems
Figure 2.2: Graph drawing layouts for the bottlenose dolphin network.
Preserving sparsity
•• •• • • • • • •• •• •• •• •• •• •• •• ••
• • • • • • • • • •
• • • • • • • • • •
• • has the LU decomposition • • • • • • • •
• • • • • • • • • • •
• • • • • • • • • • •
• • • • • • • • • • • • • • • •
with a tremendous fill-in. But, if we reorder the rows and columns first by moving
Sparse matrices 51
Figure 2.3: The original ordering and the reverse Cuthill–McKee ordering of the
sparsity plot for the Laplacian operator in a finite element solution for a pacman
domain. The Cholesky decompositions for each ordering are directly below.
the first row to the bottom and the first column to the end, then
• • • • • • • •
• • • • • •
• • • •
• • • • .
• • • has the LU decomposition • • •
• •
• • • • • •
• • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • •
In the 𝑛 × 𝑛 case, the original matrix requires 𝑂 (𝑛3 ) operations to compute the LU
decomposition plus another 𝑂 (𝑛2 ) operations to solve the problem. Furthermore,
we need to store all 𝑛2 terms of the LU decomposition. The reordered matrix, on
the other hand, suffers from no fill-in. It only needs 4𝑛 operations (additions and
multiplications) for the LU decomposition and 8𝑛 operations to solve the system.
We can lower the bandwidth of a sparse matrix and reduce the fill-in by
reordering the rows and columns to get the nonzero elements as close to the
diagonal as possible. The Cuthill–McKee algorithm does this for symmetric
matrices by using a breadth-first search to traverse the associated graph.6 The
algorithm builds a permutation sequence starting with an empty set:
6 Dijkstra’s algorithm also uses breadth-first search—this time to determine the shortest path
between nodes on a graph. Imagine that nodes represent intersections on a street map and the
weighted edges are the travel times between intersections. Then, Dijkstra’s algorithm can help route
52 Direct Methods for Linear Systems
1. Choose a vertex with the lowest degree and add it to the permutation set.
2. For each vertex added to the permutation set, progressively add its adjacent
vertices from lowest degree to highest degree, skipping any already vertex.
3. If the graph is connected, repeating step 2 will eventually run through
all vertices—at which point we’re done. If the graph is disconnected,
repeating step 2 will terminate without reaching all of the vertices. In this
case, we’ll need to go back to step 1 with the remaining vertices.
Consider the 12 × 12 matrix in the figure on the facing page. We start the
Cuthill–McKee algorithm by selecting a column of the matrix with the fewest
off-diagonal nonzero elements. Column 5 has one nonzero element in row 10,
so we start with that one. We now find the uncounted nonzero elements in
column 10–rows 3 and 12. We repeat the process, looking at column 3 and
skipping any vertices we’ve already counted. We continue in this fashion until
we’ve run through all the columns to build a tree. To get the permutation, start
with the root node and subsequently collect vertices at each lower level to get
{5, 10, 12, 3, 1, 7, 8, 9, 2, 4, 11, 6}. In practice, one typically uses the reverse
Cuthill–McKee algorithm, which simply reverses the ordering of the relabeled
vertices {6, 11, 4, 3, 9, 8, 7, 1, 3, 12, 10, 5}. The new graph is isomorphic to the
original graph—the labels have changed but nothing else. Relabeling the vertices
of the graph is identical to permuting the rows and columns of the symmetric
matrix. Other sorting methods include minimum degree and nested dissection.
Finite element packages such as JuliaFEM.jl and FinEtools.jl include functions for
the Cuthill–McKee algorithm.
4 3 2 5
5 1
6 12
7 11 12 3
8 9 10
a b
1 7 8
3 12 10
1 5 9
7 6 2 4 11
8 4
9 11 2 6
d e c
Figure 2.4: Cuthill–McKee sorting: (a) original sparsity pattern; (b) associated
graph; (c) completed breadth-first search tree; (d) reordered graph; (e) sparsity
pattern of column-and-row-permuted matrix.
At its simplest level, the simplex method involves sorting the basic variables
from the nonbasic variables. We can decompose
the augmented matrix into basis
and nonbasis columns Ā = AN AB such that AN xN + AB xB = b. Although
we represent them as grouped together, the columns of AN and the columns of
AB may come from anywhere in Ā as long as they are mutually exclusive. As
with the regular simplex method, we’ll again assume that Ā has full rank. In this
case, AB is invertible and
B AN xN + xB = AB b.
At each iteration, we need to keep track of which variables are basic and which
ones are nonbasic, but it will save us quite a bit of computing time if we also
keep track of A−1
B .
We start with AN = A and AB = I. At each iteration, we choose a new
entering variable—associated with the pivot column—by looking for any (the
54 Direct Methods for Linear Systems
first) positive element among the reduced cost variables cTN − cTB A−1 B AN . Let q
be that 𝑗th column of AN and q̂ be 𝑗th column of A−1 B A N , i.e. q̂ = A−1
B AN 𝝃 𝑗 ,
where 𝝃 𝑗 is the 𝑗th standard basis vector. With our pivot column in hand, we
now look for a pivot row 𝑖 that will correspond to our leaving variable. We take
𝑖 = arg min𝑖 𝑥𝑖 /𝑞 𝑖 with 𝑞 𝑖 > 0 to ensure that xB = A−1
B b remains nonnegative.
Let p be the 𝑖th column of AB , the column associated with the leaving variable.
After swapping columns p and q between AB and AN , we’ll need to update
B . Fortunately, we don’t need to recompute the inverse entirely. Instead, we
can use the Sherman–Morrison formula, which gives a rank-one perturbation of
the inverse of a matrix:
A−1 uvT A−1
(A + uvT ) −1 = A−1 − .
1 + vT A−1 u
By taking u = q − p and v = 𝝃 𝑖 , we have
(AB + (q − p)𝝃T𝑖 ) −1 = A−1 T −1
B − 𝑞ˆ 𝑖 ( q̂ − 𝝃 𝑖 )𝝃 𝑖 AB
T −1
where q̂ = A−1
B q, 𝑞ˆ 𝑖 = 𝝃 𝑖 AB q, and AB p = 𝝃 𝑖 . Note that this is equivalent to
row reduction.
Let’s implement the revised simplex method in Julia. We’ll take ABinv to be
B and B and N to be the list of columns of Ā in AB and AN , respectively. We
−1 a method ξ(i) to return a unit vector. In practice, we can save and update
AB A−1 B b instead of simply B
−1 and stop as soon as we find the first 𝑗.
using SparseArrays
function revised_simplex(c,A,b)
(m,n) = size(A)
ξ = i -> (z=spzeros(m);z[i]=1;z)
N = Vector(1:n); B = Vector(n .+ (1:m))
A = [A sparse(I, m, m)]
ABinv = sparse(I, m, m)
c = [c;spzeros(m,1)]
j = findfirst(x->x>0,(c[N]'.-(c[B]'*ABinv)*A[:,N])[:])
if isnothing(j); break; end
q = ABinv*A[:,N[j]]
k = findall(q.>0)
i = k[argmin(ABinv[k,:]*b./q[k])]
B[i], N[j] = N[j], B[i]
ABinv -= ((q - ξ(i))/q[i])*ABinv[i:i,:]
i = findall(B.≤n)
x = zeros(n,1)
x[B[i]] = ABinv[i,:]*b
return((z=c[1:n]'*x, x=x, y=c[B]'*ABinv))
Exercises 55
2.5 Exercises
2.7. The nematode Caenorhabditis elegans is the first and only organism whose
complete connectome, consisting of 277 interconnected neurons, has been
mapped. The adjacency matrix for the connectome is available as edges of a
bipartite graph celegans.csv from https://github.com/nmfsc/data (from Choe et al.
[2004]). Use the Cuthill–McKee algorithm to order the nodes to minimize the
56 Direct Methods for Linear Systems
bandwidth. See Reid and Scott’s article “Reducing the total bandwidth of a
sparse unsymmetric matrix” for different approaches.
2.8. Use the simplex method to solve the Stigler diet problem to find the diet
that meets minimum nutritional requirements at the lowest cost. The data from
Stigler’s 1945 paper is available as diet.csv from https://github.com/nmfsc/data
(nutritional values of foods are normalized per dollar of expenditure). b
2.9. In 1929 Hungarian author Frigyes Karinthy hypothesized that any person
could be connected to any other person through a chain consisting of at most five
intermediary acquaintances, a hypothesis now called six degrees of separation.7
In a contemporary version of Karinthy’s game called “six degrees of Kevin
Bacon,” players attempt to trace an arbitrary actor to Kevin Bacon by identifying
a chain of movies coappearances.8 The length of the shortest path is called the
Bacon number. Kevin Bacon’s Bacon number is zero. Anyone who has acted
in a movie with him has a Bacon number of one. Anyone who has acted with
someone who has acted with Kevin Bacon has a Bacon number of two. For
example, John Belushi starred in the movie Animal House with Kevin Bacon, so
his Bacon number is one. Danny DeVito never appeared in a movie with Kevin
Bacon. But he did appear in the movie Goin’ South with John Belushi, so his
Bacon number is two. The path connecting Danny DeVito to Kevin Bacon isn’t
unique. Danny DeVito also acted in The Jewel of the Nile with Holland Taylor,
who acted with Kevin Bacon in She’s Having a Baby.
Write a graph search algorithm to find the path connecting two arbitrary
actors. You can use the corpus of narrative American feature films from 1972 to
2012 to test your algorithm. The data, scraped from Wikipedia, is available at
https://github.com/nmfsc/data/. The file actors.txt is a list of the actors, the file
movies.txt is a list of the movies, and the CSV file actor-movie.csv identifies the
edges of the bipartite graph. Because Wikipedia content is crowd-sourced across
thousands of editors, there are inconsistencies across articles. b
2.10. The CSC and CSR formats require some overhead to store a sparse matrix.
What is the break-even point in terms of the density? b
7 In Karinthy’s short story “Láncszemek” (“Chains”), the author explains how he is two steps
removed from Nobel laureate Selma Lagerlöf and no more than four steps away from an anonymous
riveter at the Ford Motor Company.
8 Mathematicians sometimes play a nerdier game by tracking the chain of co-authors on research
Inconsistent Systems
Perhaps we want to solve a problem with noisy input data. The problem might be
curve fitting, image recognition, or statistical modeling. We may have sufficient
data, but they are inconsistent, so the problem is ill-posed. The solution is to
filter out the noise leaving us with consistent data. It is not difficult for us to
mentally filter out the noise in the figures on the next page and recognize the
letter on the left or the slope of the line on the right.
58 Inconsistent Systems
kA( x̃ + y) − bk 22 ≥ kAx̃ − bk 22 .
And expanding,
which says
kAyk 22 + 2yT AT Ax̃ − 2yT AT b ≥ 0,
1 In statistics and machine learning applications, the 2-norm arises from the maximum likelihood
estimation using a Gaussian distribution exp(−𝑥 2 ). The 1-norm comes from maximum likelihood
estimation using the Laplace distribution exp(− |𝑥 |).
Overdetermined systems 59
or equivalently
2yT (AT Ax̃ − AT b) ≥ −kAyk 22 .
Because this expression holds for any y and because AT Ax̃ − AT b is independent
of y, it must follow that
AT Ax̃ − AT b = 0.
From this, we get the normal equation
AT Ax̃ = AT b.
Here’s another way to get this equation. The residual kr( x̃) k 2 is minimized
when the gradient of kr( x̃) k 2 is zero—equivalently when the gradient of kr( x̃) k 22
is zero. Taking the gradient of
gives us
0 = 2AT Ax̃ − 2AT b.
Therefore, AT Ax̃ = AT b.
Let’s examine the normal equation by looking at the error e = x − x̃ and the
residual r = Ae = b − 𝛿b where 𝛿b = Ae. The solution to the normal equation is
x̃ = (AT A) −1 AT b,
from which
b̃ = Ax̃ = A(AT A) −1 AT b = PA b,
where PA is the orthogonal projection matrix into the column space of A. The
residual is simply
r = b − b̃ = (I − PA )b = P 𝑁 (AT ) b,
where P 𝑁 (AT ) is the orthogonal projection matrix into the left null space of
A. Whatever of the origins of the term normal equation,2 it serves as a useful
pneumonic that the residual b − Ax̃ is normal to the column space of A. Recall
that we can write Ax = b as
The system is inconsistent because bleft null ≠ 0. By simply zeroing out bleft null ,
we get a solution. Put another way, we want to solve the problem Ax = b. But
we can’t because it has no solutions. So instead, we adju st the problem to be
2 According to William Kruskal and Stephen Stigler, Gauss introduced term (Normalgleichungen)
in an 1822 paper without motive. They offer possibilities about its origin but nothing dominates.
60 Inconsistent Systems
Ax = b − 𝛿b where the 𝛿b puts the right-hand side into the column space of A.
The residual 𝛿b is none other than the left null space component of b. Pretty
remarkable! Furthermore, zeroing out xnull will get a unique solution. We’ll
come back to this idea later in the chapter.
Direct implementation of the normal equation can make a difficult problem
even worse because the condition number 𝜅(AT A) equals 𝜅(A) 2 , and the condition
number 𝜅((AT A) −1 AT ) may be even larger. A moderately ill-conditioned matrix
can lead to a very ill-conditioned problem. Instead, let’s look at an alternative
means of solving an overdetermined system by applying an orthogonal matrix to
the original problem. Such an approach is more robust (numerically stable) and
is almost as efficient as solving the normal equation.
QR decomposition
which reduces to
RT Rx = RT QT b.
Note that we simply have the Cholesky decomposition RT R of AT A on the
left-hand side. By rearranging the terms of this equality, we have
RT (Rx − QT b) = 0,
which says Rx − QT b is in the null space of RT (the left null space of R):
• 0 0 0 0 0 0 00 0
• • 0 0 0 0 0 = 00 .
• • • 0 0 0 0 0•
• • • • 0 0 0 • 0
• 0
In the next chapter, we’ll use orthogonal matrices to help us find eigenvalues.
One reason orthogonal matrices are particularly lovely is that they don’t change
the lengths of vectors in the 2-norm. The 2-condition number of an orthogonal
matrix is one. So the errors don’t blow up when we iterate. How do we find
the orthogonal matrix Q? We typically use Gram–Schmidt orthogonalization, a
series of Givens rotations, or a series of Householder reflections. Let’s examine
each of these methods of QR decomposition.
Gram–Schmidt orthogonalization
w𝑘 = v𝑘 − 𝑟 𝑗𝑘q𝑗 where 𝑟 𝑗 𝑘 = q 𝑗 , v 𝑘 .
The classical Gram–Schmidt method that subtracts the projection all at once
can be numerically unstable because unless the vectors are already close to
orthogonal, the orthogonal component is small. A better implementation is
the modified Gram–Schmidt process that subtracts the projections successively:
for 𝑘 = 1, . . . , 𝑛
q𝑘 ← v𝑘
for 𝑗 = 1, . . . , 𝑘 − 1
𝑟 𝑗 𝑘 = q 𝑗 , v𝑘
q𝑘 ← q𝑘 − 𝑟 𝑗 𝑘 q 𝑗
𝑟 𝑘 𝑘 ← kq 𝑘 k 2
q 𝑘 ← q 𝑘 /𝑟 𝑘 𝑘
Givens rotations
By choosing 𝜃 appropriately, we can rotate a vector to zero out either the first or
the second components. For example, for a vector x = (𝑥 1 , 𝑥2 ), we can get
"√︃ #
𝑥12 + 𝑥22
x̂ = Qx =
by taking
𝑥1 𝑥2
cos 𝜃 = √︃ and sin 𝜃 = − √︃ .
𝑥12 + 𝑥22 𝑥12 + 𝑥22
By applying a right rotation Q to a 2 × 2 matrix A, we can get
𝑐 −𝑠 𝑎 11 𝑎 12 ∗ ∗
𝑠 𝑐 𝑎 21 𝑎 22 0 ∗
Overdetermined systems 63
By applying three Givens rotations, we can zero out the elements below the first
× × × ×
Q41 Q31 Q21 A = 00 ×× ×× ×× .
0 ×××
By applying another two Givens rotations, we can zero out the elements below
the second pivot:
× × × ×
Q42 Q32 Q41 Q31 Q21 A = 00 ×0 ×× ×× .
0 0 ××
And by applying one more Givens rotations, we can zero out the element below
the third pivot:
× × × ×
Q43 Q42 Q32 Q41 Q31 Q21 A = 00 ×0 ×× ×× = R.
0 0 0 ×
So, formally
Householder reflections
While Givens rotations zero out one element at a time, Householder reflections
speed things up by changing whole columns all at once
A Q A Q Q A Q 3 Q2 Q1 A
× × × × × ×1× × × 2× ×1 × × × × ×
0 ××× 0 ×××
×××× → 0 ××× → 0 0 ×× → 00 ×0 ×× ×× .
×××× 0 ××× 0 0 ×× 0 0 0 ×
We start with the first column x of the 𝑛 × 𝑛 matrix. Let’s find an orthogonal
matrix Q1 such that Q1 x = y, where y = ±kxk 2 𝝃 with 𝝃 = (1, 0, . . . , 0) and the
sign ± taken from the first component of x. We can determine Q1 with a little
Let u = (x + y)/2 and v = (x − y)/2. Then u is orthogonal
to v and span{u, v} = span{x, y}, and we can get y by subtracting v
twice v from x. That is to say, we can get y by subtracting twice u
the projection of x in the v direction: y = (I − 2Pv )x. In general,
for any vector in the direction of v, we define the Householder y
reflection matrix:
H𝑛 = I − 2Pv = I − 2vvT /kvk 22 .
Because we want y to be ±kxk 2 𝝃, we’ll take v = x ± kxk 2 𝝃. Technically, we
ought to take v = (x ± kxk 2 𝝃)/2, but the one-half scaling factor is drops out
when we compute Pv . Geometrically, the Householder matrix H𝑛 reflects a
vector across the subspace span{v}⊥ .
In the second step, we want to zero out the second column’s subdiagonal
elements, leaving the first column unaltered. To do this, we left multiply by the
Householder matrix
1 0
Q2 =
0 H𝑛−1
where H𝑛−1 is the (𝑛 − 1) × (𝑛 − 1) Householder matrix that maps an (𝑛 − 1)-
dimensional vector x to the (𝑛 − 1)-dimensional vector ±kxk𝝃.
In the 𝑘th step, we use the Householder matrix
I 0
Q𝑘 = 𝑘
0 H𝑛−𝑘
Overdetermined systems 65
krk 2 = kAx − bk 2 .
krk 2 = kQRx − bk 2 ,
From this expression, we see that the problem of solving Ax = b can be replaced
with the equivalent problem of solving the upper triangular system Rx = c where
c = QT b. The residual r of this system
c R c − R1 x
r= 1 − 1 x= 1 ,
c2 0 c2
The term kc2 k 22 is independent of x—it doesn’t change krk 22 by any amount
regardless of the value of x. So the x that minimizes kc1 − R1 xk 2 also minimizes
krk 2 . And, if R1 has full rank, then kc1 − R1 xk 2 is minimized when precisely
R1 x = c 1 .
On the other hand, what if R1 does not have full rank? In this case, R1 x = c1 is
R11 R12 x1 c
= 1
0 0 x2 c2
66 Inconsistent Systems
10, 000 he, was
his, is
word count here, two, room, . . .
after, yes, such,. . .
100 dangerous, start, windows,. . .
bizarre, flower, twighlight, . . .
afternoon, finger, stranger, . . .
feminine, madman, sundial, . . .
1 word rank binomial, dunderheads, monstrosities, . . .
1 100 10, 000
Example. George Zipf was a linguist who noticed that a few words, like the,
is, of, and and, are used quite frequently while most words, like anhydride,
embryogenesis, and jackscrew, are rarely used at all. By examining word
frequency rankings across several corpora, Zipf made a statistical observation,
now called Zipf’s law. The empirical law states that in any natural language
corpus the frequency of any word is inversely proportional to its rank in a
frequency table. The most common word is twice as likely to appear as the
second most common word, three times as likely to appear as the third most
common word, and 𝑛 times as likely as the 𝑛th most common word. For example,
18,951 of the 662,817 words that appear in the canon of Sherlock Holmes by Sir
Arthur Conan Doyle are unique. The word the appears 36,125 times. Holmes
appears 3051 times, Watson 1038 times, and Moriarty3 appears 54 times. And
3 Professor James Moriarty—criminal mastermind and archenemy of Sherlock Holmes—was
a brilliant mathematician who at the age of twenty-one wrote a treatise on the binomial theorem,
winning him the mathematical chair at a small university. His book on asteroid dynamics is described
as one “which ascends to such rarefied heights of pure mathematics that it is said that there was no
man in the scientific press capable of criticizing it.”
Underdetermined systems 67
there are 6610 words like binomial, dunderheads, monstrosities, sunburnt, and
vendetta that appear only once. See Figure 3.1 on the facing page.
Zipf’s power law states that the frequency of the 𝑘th most common word is
approximately 𝑛 𝑘 = 𝑛1 𝑘 −𝑠 where 𝑠 is some parameter and 𝑛1 is the frequency of
the most common word. Let’s find the power 𝑠 for the words in Sherlock Holmes
using ordinary least squares and total least squares. First, we’ll write the power
law in log-linear form: log 𝑛 𝑘 = −𝑠 log 𝑘 + log 𝑛1 . The Julia code is
using DelimitedFiles
T = readdlm(download(bucket*"sherlock.csv"), '\t')[:,2]
n = length(T)
A = [ones(n,1) log.(1:n)]
B = log.(T)
c = A\B
We find that 𝑠 is −1.165 using least squares. Zipf’s law applies to more than just
words. We can use it to model the size of cities, the magnitude of earthquakes,
and even the distances between galaxies. For example, Figure 3.2 shows the
log-log plot of the population of U.S. cities against their associated ranks.4 We
find that 𝑠 is −0.770. J
So far, we’ve assumed that if Ax = b has a least squares solution, then it has
a unique least squares solution. This assumption is not always valid. If the
null space of A has more than the zero vector, then AT A is not invertible. And
the upper triangular submatrix R1 discussed in the previous section does not
have full rank. In this case, the problem has infinitely many solutions. So then,
4 See UScities.csv at https://github.com/nmfsc/data
68 Inconsistent Systems
how do we choose which of the infinitely many solutions is the “best” solution?
One way is to take the solution x with the smallest norm kxk. This approach is
known as Tikhonov regularization or ridge regression. It works exceptionally
well in problems where x has a zero mean, which frequently appear in statistics
by standardizing variables. Let’s change our original problem
Find the x ∈ R𝑛 that minimizes Φ(x) = kAx − bk 22
to the new problem
Find the x ∈ R𝑛 that minimizes Φ(x) = kAx − bk 22 + 𝛼2 kxk 22
where 𝛼 is some regularizing parameter. The result is a solution that fits x to b
but penalizes solutions with large norms. Minimizing kAx − bk 22 + 𝛼2 kxk 22 is
equivalent to solving the stacked-matrix equation
A b
x= , (3.1)
𝛼I 0
which we can do using least squares methods such as QR factorization.
Let’s also examine the normal equation solution to the regularized minimiza-
tion problem. As before, solve the minimization problem by finding the x such
that ∇Φ(x) = 0. We have 2AT Ax − 2AT b + 2𝛼2 x̃ = 0, which we can solve to get
x = (AT A + 𝛼2 I) −1 AT b = R 𝛼 b. (3.2)
When A has many more columns than rows, computing the QR factorization of
the stacked-matrix equation (3.1) or the regularized normal equation (3.2) can
be inefficient because we are working with even larger matrices. However, by
noting that AT (AAT + 𝛼2 I) = (AT A + 𝛼2 I)AT , we have
(AT A + 𝛼2 I) −1 AT = AT (AAT + 𝛼2 I) −1 .
So we can instead compute x̃ = AT (AAT + 𝛼2 I) −1 b.
Example. Photographs sometimes have motion or focal blur and noise, some-
thing we might want to remove or minimize. Suppose that x is some original
image and y = Ax + n is the observed image, where A is a blurring matrix and n
is a noise vector. The blurring matrix A does not have a unique inverse, so the
problem is ill-posed. Compare the following images:
a b c d e
Underdetermined systems 69
The first image (a) shows the original image x. Blur and noise have been added
to this image to get the second image y = Ax + n. Image (c) shows the solution
using the least squares method (AT A) −1 AT y. Because the blurring matrix A has
a nonzero null space, there are multiple (infinitely many) possible solutions for
the least squares algorithm. The method fails. Image (d) shows the regularized
least squares solution (AT A + 𝛼2 I) −1 AT y for a well-chosen 𝛼. I think that this
one is the winner. Image (e) shows the pseudoinverse A+ y. This one is also good,
although perhaps not as good as the regularized least squares solution. J
The weight 𝛼2 determines the relative importance of each objective kr1 k 22 and
kr2 k 22 . We then minimize krk 22 = kr1 k 22 + 𝛼2 kr2 k 22 by solving
A b
𝛼C 𝛼d
function constrained_lstsq(A,b,C,d)
x = [A'*A C'; C zeros(size(C,1),size(C,1))]\[A'*b;d]
QR decomposition and Gram matrices both fill in sparse matrices. We can avoid
fill-in by rewriting the normal equation AT Ax = AT b as a expanded block system
using the residual. Because the residual r = b − Ax is in the left null space of A,
it follows that AT r = 0. From these two equations, we have
0 AT x 0
= .
A I r b
This new matrix maintains the sparsity and can be solved using an iterative
method such as the symmetric conjugate-gradient method that will be discussed
in Chapter 5. Similarly, a sparse constrained least squares problem can be
rewritten as
0 AT CT x 0
A I 0 r = b .
C 0 0 𝜆 d
Singular value decomposition 71
Another way to solve the least squares problem is by singular value decomposition.
Recall that for the inconsistent and underdetermined problem
A(xrow + xnull ) = bcolumn + bleft null ,
we minimize kb − Axk 2 by forcing bleft null = 0 and ensure a unique solution
kxk 2 by forcing xnull = 0. We can use singular value decomposition to both zero
out the left null space and zero out the null space. A matrix’s singular value
decomposition is A = U𝚺VT , where U and V are orthogonal matrices and 𝚺 is a
diagonal matrix of singular values. By convention, the singular values 𝜎𝑖 are in
descending order 𝜎1 ≥ 𝜎2 ≥ · · · ≥ 𝜎𝑛 ≥ 0. This ordering is also a natural result
of the SVD algorithm, which we study in the next chapter after developing tools
for finding eigenvalues. In this section, we look at applications of singular value
The SVD of an 𝑚 × 𝑛 matrix A is an 𝑚 × 𝑚 matrix U, an 𝑚 × 𝑛 diagonal matrix
𝚺, and an 𝑛 × 𝑛 matrix VT . The columns of U are called left singular vectors and
the columns of V are called the right singular vectors. When 𝑚 > 𝑛, only the
first 𝑛 left singular vectors are needed to reconstruct A. The other columns are
superfluous. In this case, it’s more efficient to use the thin or economy SVD—an
𝑚 × 𝑛 matrix U, an 𝑛 × 𝑛 diagonal matrix 𝚺, and an 𝑛 × 𝑛 matrix VT . See
Figure 3.3 on the next page. The same holds for low-rank matrices. For a matrix
A with rank 𝑟,
• the first 𝑟 columns of U span the column space of A,
• the last 𝑚 − 𝑟 columns of U span the left null space of A,
• the first 𝑟 columns of V span the row space of A, and
• the last 𝑛 − 𝑟 columns of V span the null space of A.
We can reconstruct A using an 𝑚 × 𝑟 matrix U, an 𝑟 × 𝑟 diagonal matrix 𝚺, and
an 𝑟 × 𝑛 matrix VT . This type of a reduced SVD is called a compact SVD.
Small singular values do not contribute much information to A. By keeping
the 𝑘 largest singular values and the corresponding singular vectors, we get the
closest rank 𝑘 approximation to A in the Frobenius norm. The truncated SVD of
A is given by an 𝑚 × 𝑘 matrix U 𝑘 , an 𝑘 × 𝑘 diagonal matrix 𝚺 𝑘 , and an 𝑘 × 𝑛
matrix VT𝑘 . Truncated SVDs have several different uses—reducing the memory
needed to store a matrix, making computations faster, simplifying problems by
projecting the data to lower-dimensional subspaces, and regularizing problems
by reducing a matrix’s condition number.
The LinearAlgebra.jl function svd(A) returns the SVD of matrix A as an object.
Singular values are in descending order. When the option full=false (default), the
function returns a economy SVD. The function svds(A,k) returns an object containing
the first k (default is 6) singular values and associated singular vectors.
72 Inconsistent Systems
• row space
column space
compact SVD
Figure 3.3: The SVD breaks a matrix into the orthonormal bases of its four
fundamental subspaces—its column space and left null space along with its row
space and null space—coupled through a diagonal matrix of singular values.
dimensions A+ b
𝑚=𝑛=𝑟 A−1 b
𝑚 >𝑛=𝑟 arg minx kb − Axk 2 and A+ = (AT A) −1 AT
𝑛>𝑚=𝑟 rowspace projection of b and A+ = A(AAT ) −1
𝑛≥𝑚>𝑟 arg minx kb − Axk 2 using the rowspace projection of b
Example. Consider the 𝑛×𝑛 Hilbert matrix, whose ij-element equals (𝑖+ 𝑗 −1) −1 .
We examined this matrix and its inverse using Gaussian elimination on page 23.
Hilbert matrices are poorly conditioned even for relatively small 𝑛. The density
plots of the matrices pinv(hilbert(n))*hilbert(n) for 𝑛 = 0, 15, 20, 25, and 50
Singular value decomposition 73
are below. Zero values are white, values with magnitude one or greater are black,
and intermediate values are shades of gray.
𝑛 = 10 𝑛 = 15 𝑛 = 20 𝑛 = 25 𝑛 = 50
The pseudoinverse has noticeable error even for small matrices. But unlike
Gaussian elimination, which falls flat on its face, the pseudoinverse solution fails
with grace. The pseudoinverse truncates all singular values less than 10−15 which
happens near 𝜎8 , effectively reducing the problem to using a rank-8 matrix. J
Tikhonov regularization
where 𝚺−1
𝛼 is a diagonal matrix with diagonal components
𝜎𝑖 + 𝛼2
It is useful to rewrite these components as
𝜎𝑖2 1 1
= 𝑤 𝛼 (𝜎𝑖 )
𝜎𝑖2 + 𝛼2 𝜎𝑖 𝜎𝑖
where 𝑤 𝛼 (𝜎𝑖 ) is the Wiener weight. From this, we see that Tikhonov regulariza-
tion is simply a smooth filter applied to the singular values. The truncated SVD
filter (
0, 𝜎 ≤ 𝛼
𝛼 (𝜎) =
1, 𝜎 > 𝛼
74 Inconsistent Systems
Tikhonov filter
truncated SVD filter
Figure 3.4: Comparison of the Tikhonov filter for 𝛼 = 10−3 and the truncated
SVD filter with singular values zeroed out below 10−3 .
Proof. Let’s start with the spectral norm. A rank-𝑘 matrix B can be expressed
as the product XYT where X and Y have 𝑘 columns. Then because Y has only
rank 𝑘, there is a unit
Í 𝑘+1vector w = 𝛾1 v1 + · · · + 𝛾 𝑘+1 vÍ
𝑘+1 such that Y w = 0.
2 2
Furthermore, Aw = 𝑖=1 𝛾𝑖 𝜎𝑖 u𝑖 from which kAwk 2 = 𝑖=1 |𝛾𝑖 𝜎𝑖 | 2 . Then
where the inequality follows from the results about the spectral norm above
because rank (B + (A − B)𝑖−1 ) ≤ rank A 𝑘+𝑖−1 . So,
𝑟 ∑︁
kA − Bk 2F = 𝜎𝑖 (A − B) 2 ≥ 𝜎𝑖 (A) 2 = kA − A 𝑘 k 2F .
𝑖=1 𝑖=𝑘+1
a b c d
(a) The data in A lie for the most part along a direction v1 with a small orthogonal
component in direction v2 . (b) The matrix AV is a rotation of the heights and
weights onto the x- and y-axes. (c) We keep only the first principal component—
now along the x-axes—and discard the second principal component—the offsets
in the y-direction. (d) After rotating back to our original basis, all the data lie
along the v1 -direction.
Unlike least squares fit, which would project the data “vertically” onto the
height-weight line, principal component analysis projects the data orthogonally
onto the height-weight line. Consider the scatter plots below.
10 10 10
0 0 0
0 10 0 10 0 10
76 Inconsistent Systems
The original data are shown in the scatter plot on the left. The least squares
approach shown in the center figure is an orthogonal projection of the y-
components into the column space of the 15 × 2 Vandermonde matrix. Principal
component analysis is an orthogonal projection into the first right singular vector.
Note the difference in the solutions using ordinary least squares fit and orthogonal
least squares fit. Neither the points nor the line are the same in either figure.
Let’s dig deeper into orthogonal least squares, also known as total least squares.
We solved the ordinary least squares problem Ax = b by determining the solution
to a similar problem: choose the solution to Ax = b + 𝛿b that has the smallest
residual k𝛿bk 2 . We can generalize the problem of finding the solution to AX = B
by minimizing the norm of the residual the only with respect to the dependent
terms B but also with respect to the independent terms A. That is, find the
smallest 𝛿A and 𝛿B that satisfies (A + 𝛿A)X = B + 𝛿B. Such a problem is called
the total least squares problem. In two dimensions, total least squares regression
is calledDeming regression or orthogonal regression. Let’s examine the general
case and take A to be an 𝑚 × 𝑛 matrix, X to be an 𝑛 × 𝑝 matrix, and B to be an
𝑚 × 𝑝 matrix.
We can write the total least squares problem (A + 𝛿A)X = B + 𝛿B using
block matrices
A + 𝛿A B + 𝛿B = 0. (3.3)
We want to find
arg min 𝛿A 𝛿B F
𝛿A, 𝛿B
Take the singular value decomposition U𝚺VT = A B :
𝚺1 0 V11 V12
A B = U1 U2 ,
0 𝚺2 V21 V22
where U1 , 𝚺1 , V11 , and V12 have 𝑛 columns like A and U2 , 𝚺2 , V21 , and V22
have 𝑝 columns like B. By the Eckart–Young–Mirsky theorem, the terms 𝛿A
and 𝛿B are those for which
𝚺1 0 V11 V12
A + 𝛿A B + 𝛿B = U1 U2 .
0 0 V21 V22
Singular value decomposition 77
By linearity,
0 0 V11 V12
𝛿A 𝛿B = − U1 U2
0 𝚺2 V21 V22
= −U2 𝚺2 12
V12 V12
=− A B .
V22 V22
V12 V12
𝛿A 𝛿B =− A B
V22 V22
A + 𝛿A B + 𝛿B = 0.
If V22 is nonsingular, then
V12 V−1
A + 𝛿A B + 𝛿B 22 = 0.
function tls(A,B)
n = size(A,2)
V = svd([A B]).V
1 1 1
0 0 0
−1 −1 −1
0 1 2 0 1 2 0 1 2
The figure on the left above shows the original lines. The middle one shows
the ordinary least squares Ax = b + 𝛿b solution of around (0.777, 0.196). The
solution slides the lines around the plane to find the intersection but does not
change their slopes. The one on the right shows the total least squares solution
(A + 𝛿A)x = b + 𝛿b solution of around (0.877, 0.230). Now, lines can slide
around the plane and change their slopes. J
replace frequently occurring sequences of bits with short ones and Lempel–Ziv–Storer–Szymanski
(LZSS) compression that replaces repeated sequences of bits with pointers to earlier occurrences.
6 A GIF only supports an 8-bit color palette for each image, which results in posterization and
makes it less suitable for color photographs. On the other hand, a PNG will support 16 bits per
Singular value decomposition 79
values. A color image consists of three channels, and the corresponding matrix
is 𝑚 × 3𝑛. Let’s consider the SVD of a grayscale image A = UΣVT :
using Images
A = load(download(bucket*"laura.png")) .|> Gray
U, σ, V = svd(A);
See the figure on the following page and the QR code at the bottom of this
page. The Frobenius norm, which gives the root mean squared error of two
images by comparing them pixel-wise, is the sum of the singular values
𝑛 ∑︁
𝑘 ∑︁
kA − A 𝑘 k 2F = 𝜎𝑖2 − 𝜎𝑖2 = 𝜎𝑖2 .
𝑖=1 𝑖=1 𝑖=𝑘+1
We can see this by checking that norm(A-Ak )≈norm(σ[k+1:end]) returns the value
true. Let’s plot the error as a function of rank, shown in the figure on the next
ϵ2 = 1 .- cumsum(σ)/sum(σ); scatter(ϵ2 ,xaxis=:log)
Image recognition
Image recognition algorithms have been around since the late 1950s, when
psychologist Frank Rosenblatt built the Peceptron, a device with 400 photocells
and a rudimentary mathematical model capable of classifying basic shapes.
In 1987, mathematicians Lawrence Sirovich and Michael Kirby developed a
methodology to characterize human faces. The methodology, initially called
eigenpictures, was subsequently popularized as eigenfaces or eigenimages. Two
years later, computer scientist Yann LeCun and fellow researchers developed a
neural network image classifier called LeNet. We’ll return to LeNet in Chapter 10.
Here, we examine Sirovich and Kirby’s method.
The eigenimage method applies singular value decomposition to training
images to produce a smaller set of orthogonal images. Take a set of 𝑛 𝑝 × 𝑞-pixel
grayscale training images, and reshape each image into a 𝑝𝑞 × 1 array. We’ll
denote each as d 𝑗 for 𝑗 = 1, 2, . . . , 𝑛. It is common practice to mean center and
SVD image
80 Inconsistent Systems
b c d
0 d
0.01 0.1 1
U 𝑘 UT𝑘 D = U 𝑘 W 𝑘 ,
The eigenimages u𝑖 and the signatures 𝜎𝑖 vT𝑖 contain the information needed to
reconstruct the training images.
Take the 8-dimensional projection of D using the first eight eigenimages and
their respective signatures: D8 = E1 + E2 + · · · + E8 . The projected images are
Singular value decomposition 81
Suppose you want to find similar books in a library or a journal article that best
matches a given query. One way to do this is by making a list of all possible
terms and counting the frequency of each term in the documents as components
of a vector (sometimes called a “bag of words”). The 𝑗th element of a vector
d𝑖 tells us the number of times
that term 𝑗 appears
in document 𝑖. The 𝑚 × 𝑛
document-term matrix D = d1 d2 · · · d𝑛 maps words to the documents
containing those words:
documents D terms
Given a query q, we can find the most relevant documents to the query by finding
the closest vector d𝑖 to q. Ussing cosine similarity, we identify the document
𝑖 that maximizes cos 𝜃 𝑖 = qT d𝑖 /kqk kd𝑖 k. Values close to one represent a good
match, and values close to zero represent a poor match. This approach is called
explicit semantic analysis.
Searching for a document using only terms is limiting because you need
to explicitly match terms. For example, when searching for books or articles
about Barcelona, terms like Spain, Gaudi, Picasso, Catalonia, city, and soccer
are all conceptually relevant. In this case, rather than clustering documents
by explicit variables like terms, it would be better to cluster documents using
latent variables like concepts. This is the idea behind latent semantic analysis,
introduced by computer scientist Susan Dumais in the late-1980s. We can think
of concepts as intermediaries between text and documents. Take the singular
82 Inconsistent Systems
Training Images
10 20
Figure 3.6: These 120 × 120 pixel images span a 26-dimensional subspace
of R14400 . Eigenimages (ordered by their associated singular values) form an
orthogonal basis for the subspace spanned by the training images. Also, see the
QR code below.
alphabet eigenimages
and low rank
Nonnegative matrix factorization 83
Figure 3.7: Left: The two-dimensional latent space of the letters with the v1
and v2 -axes of the right-singular vectors shown. Notice how letters with similar
appearances are quite close together in the latent space such as the O and Q
and the T, I, and J. Right: The directions of the same letters. Also, see the
three-dimensional latent space of letters at the QR code at the bottom of this
We don’t need or want the full rank of D, so instead, let’s take a low rank-𝑘
approximation D 𝑘 and its singular value decomposition U 𝑘 𝚺 𝑘 VT𝑘 . Then the
latent space vectors d̂𝑖 are the columns of VT𝑘 . In other words, d̂𝑖 = 𝚺−1 𝑘 U 𝑘 d𝑖 .
We can similarly map a query vector q to its latent space representation by taking
the projection q̂ = 𝚺−1 𝑘 U 𝑘 q. In this case, we look for the column 𝑖 that maximizes
Singular value decomposition breaks a matrix into orthogonal factors that we can
use for rank reduction, such as principal component analysis, or feature extraction,
such as latent semantic indexing. One issue with singular value decomposition is
that the factors contain negative components. For some applications, negative
components may not be meaningful. Another issue is that the factors typically
do not preserve sparsity, a possibly undesirable trait. It will be helpful to have a
method of factorization that maintains nonnegativity and sparsity. Take an 𝑚 × 𝑛
latent space of letters
84 Inconsistent Systems
H ← H ◦ WT X WT WH.
The method is relatively slow to converge and not guaranteed to converge, but its
implementation is simple. Also, we may need to take care to avoid 0/0 if W has
a zero row or H has a zero column.
Nonnegative matrix factorization 85
Factors W
Reconstruced images WH
For a given sparse, nonnegative matrix X, we start with two random, non-
negative matrices W and H. A naïve Julia implementation of nonnegative matrix
factorization using multiplicative updates is
function nmf(X,p=6)
W = rand(Float64, (size(X,1), p))
H = rand(Float64, (p,size(X,2)))
for i in 1:50
W = W.*(X*H')./(W*(H*H') .+ (W.≈0))
H = H.*(W'*X)./((W'*W)*H .+ (H.≈0))
To set a stopping criterion, we can look for convergence of the residual kX−WHk 2F ,
stopping if the change in the residual from one iteration to the next falls below
some threshold. Alternatively, we can stop if kW ◦ ∇W 𝐹 k F + kH ◦ ∇H 𝐹 k F
falls below a threshold or if W or H becomes stationary. To go deeper into
nonnegative matrix factorization, see Nicolas Gillis’ article “The Why and How
of Nonnegative Matrix Factorization.”
3.5 Exercises
(u, v)
𝜃 = cos−1
kuk kvk
to show that the space of monomials {1, 𝑥, 𝑥 2 , . . . , 𝑥 𝑛} is far from orthogonal
when 𝑛 is large. Note that the components ℎ𝑖 𝑗 = 𝑥 𝑖 , 𝑥 𝑗 form the Hilbert matrix.
3.4. Fill in the steps missing from the example on page 68. Hint: let Y = AXB+N,
where X and Y are image arrays, A and B are appropriately sized Gaussian blur
(Toeplitz) matrices, and N is a random matrix. The matrix A acts on the columns
of X, and the matrix B acts on the rows of X. b
3.5. The Filippelli problem was contrived to benchmark statistical software
packages. The National Institute of Standards and Technology (NIST) “Filip”
Statistical Reference Dataset7 consists of 82 (𝑦, 𝑥) data points along with a
certified degree-10 polynomial
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥 2 + · · · + 𝛽10 𝑥 10 .
and the pseudoinverse to solve the system. What happens if you construct the
Vandermonde matrix in increasing versus decreasing order or powers? What hap-
pens if you choose different cut-off values for the pseudoinverse? What happens
if you standardize the data first by subtracting the respective means and dividing
7 http://www.itl.nist.gov/div898/strd/lls/data/Filip.shtml
Exercises 87
them by their standard deviations? The Filip dataset and certified parameters are
available as filip.csv and filip_coeffs.csv at https://github.com/nmfsc/data. b
The SpecialMatrices.jl function Vandermonde generates a Vandermonde matrix for
𝑝−1 𝑝
input (𝑥0 , 𝑥1 , . . . , 𝑥 𝑛 ) with rows given by 1 𝑥𝑖 · · · 𝑥𝑖 𝑥𝑖 .
3.6. The daily temperature of a city over several years can be modeled as a
sinusoidal function. Use linear least squares to develop such a model using
historical data recorded in Washington, DC, between 1967 to 1971, available as
dailytemps.csv from https://github.com/nmfsc/data. Or choose your own data
from https://www.ncdc.noaa.gov. b
3.7. Principal component analysis is often used for dimensionality reduction and
as the basis of the eigenface method of image classification. Suppose that we
want to identify handwritten digits for automatic mail sorting. Variations in the
writing styles would require us to collect a wide range of handwriting samples to
serve as a training set. The MNIST (Modified National Institute of Standards
and Technology) dataset is a mixture of handwriting samples from American
Census Bureau employees and high school students. (LeCun et al. [1998]) Each
image is scaled to a 28×28-pixel grayscale image. For reference, the following
9s are sampled from the dataset:
Use the MLDatasets.jl package to load the MNIST dataset.8 The training set
has sixty-thousand images with labels, and the test set has ten-thousand images
with labels. Sort and reshape the training data to form ten 282 × 𝑛𝑖 matrices D𝑖
using labels 𝑖 = {0, 1, . . . , 9} where 𝑛𝑖 is the number of images for each label.
Now, compute a low-rank SVD of D𝑖 to get V𝑖 . The columns of V𝑖 form the
basis for the low-rank subspace. To classify a new test image d, we find the
subspace to which d is closest by comparing d with its projection V𝑖 VT𝑖 d
Check how well your method properly identifies the test images. b
3.8. Apply Zipf’s law to the frequency of surnames in the United States. You can
find a 2010 census data set as surname.csv from https://github.com/nmfsc/data.
3.9. A model is a simplification of the real world. Take Hollywood actors, for
example. We can characterize actors by the roles they often play. Russell Crowe
8 Alternatively,
you can download the training and test data as the MAT-file emnist-mnist.mat
from https://www.nist.gov/itl/products-and-services/emnist-dataset or https://github.com/nmfsc/data.
88 Inconsistent Systems
is primarily thought of as an action film actor, Ben Stiller as a comedic actor, and
Kate Winslet as a dramatic actor. Others have crossed over between comedic
roles and serious ones. Build an actor similarity model using film genres to
determine which actors are most similar to a given actor. Use the data sets from
exercise 2.9 along with the adjacency matrix movie-genre.csv that relates movies
to genres listed in the file genres.txt. Use cosine similarity to determine which
actors are most similar to a chosen actor. How well does this approach work
in clustering actors? What are the limitations of this model? How does the
dimension of the latent space change the model? b
3.10. Multilateration determines an object’s position by
comparing the arrival times of a signal at different moni-
toring stations. Multilateration can locate a gunshot in an
urban neighborhood using a network of microphones or
an earthquake’s epicenter using seismographs scattered
around the world. Consider a two-dimensional problem
with 𝑛 stations each located at (𝑥𝑖 , 𝑦 𝑖 ) and a time of
arrival 𝑡 𝑖 . The source (𝑥, 𝑦) and time of transmission 𝑡
are computed from the intersection of the 𝑛 circles
(𝑥 − 𝑥 𝑖 ) 2 + (𝑦 − 𝑦 𝑖 ) 2 = 𝑐2 (𝑡𝑖 − 𝑡) 2
for 𝑖 = 1, 2, . . . , 𝑛 where 𝑐 is the signal speed. This system can be solved by first
reducing it to a system 𝑛 − 1 linear equations. The system will be inconsistent,
because of variations in signal speed, the diffraction of waves around objects,
measurement errors, etc. We can use least squares to find a best solution. Take
𝑐 = 1 and determine (𝑥, 𝑦, 𝑡) for the data
(𝑥𝑖 , 𝑦 𝑖 , 𝑡𝑖 ) = {(3, 3, 12), (1, 15, 14), (10, 2, 13), (12, 15, 14), (0, 11, 12)}. b
3.11. ShotSpotter® is a U.S.-based company that provides gunfire locator services
to law enforcement agencies. The company validated the accuracy of their system
using a series of live fire tests.9 A dataset for one round of a 9 mm automatic
handgun with a signal-to-noise ratios greater than 12 dB at the reporting sensors
is available as shotspotter.csv at https://github.com/nmfsc/data. The first eleven
rows list the three-dimensional locations of surveyed reporting sensors in meters
along with incident times in seconds. The final row lists the surveyed location
of the handgun. Modify the method from the previous exercise to locate the
handgun using the sensor locations and incident times. Use 328.9 m/s for the
sound speed for an air temperature of −4 ℃. b
9 https://github.com/ShotSpotter/research.accuracy-of-gunshot-location
Chapter 4
Computing Eigenvalues
Using pencil and paper, one often finds the eigenvalues of a matrix by determining
the zeros of the characteristic polynomial det(A − 𝜆I). This approach works
for small matrices, but we cannot hope to apply it to a general matrix larger
than 4 × 4. The Abel–Ruffini theorem states states that no general solutions
exist to degree-five or higher polynomial equations using a finite number of
algebraic operations (adding, subtracting, multiplying, dividing, and taking an
integer or fractional power). Consequently, there can be no direct method for
determining eigenvalues of a general matrix of rank five or higher. We must find
the eigenvalues using iterative methods instead.
Of course, it is possible to compute the characteristic polynomial and then
use a numerical method such as the Newton–Horner method to approximate
its roots. However, in general, this approach is unstable. In fact, a common
technique to find the roots of a polynomial 𝑝(𝑥) is to generate the companion
matrix of 𝑝(𝑥), i.e., the matrix whose characteristic polynomial is 𝑝(𝑥), and
then determine the eigenvalues of the companion matrix using the QR method
introduced in this chapter.
The Polynomials.jl function roots finds the roots of a polynomial 𝑝(𝑥) by computing
the eigenvalues for the companion matrix of 𝑝(𝑥).
Before discussing methods for determining eigenvalues, let’s look at the esti-
mation and stability of eigenvalues. Suppose that A is nondefective, with the
diagonalization 𝚲 = S−1 AS. Let 𝛿A be some change in A. Then the perturbation
90 Computing Eigenvalues
in the eigenvalues is
𝚲 + 𝛿𝚲 = S−1 (A + 𝛿A)S
and hence
𝛿𝚲 = S−1 𝛿AS.
Taking the matrix norm, we have
yH 𝛿Ax
𝛿𝜆 = .
yH x
kyk kxk
k𝛿𝜆k ≤ k𝛿Ak.
yH x
We define the eigenvalue condition number as
kyk kxk
𝜅(𝜆, A) = ,
yH x
that is, one over the cosine of the angle between the right and left eigenvectors. We
can compute the eigenvalue condition number by using the spectral decomposition
A = S𝚲S−1 :
function condeig(A)
Sr = eigvecs(A)
Sl = inv(Sr ')
Sl ./= sqrt.(sum(abs.(Sl .^2), dims=1))
1 ./ abs.(sum(Sr .*Sl , dims=1))
Eigenvalue sensitivity and estimation 91
1 1
0 0
−1 −1
−1 0 1 −1 0 1
𝜆(A) = {𝑧 ∈ C | A − 𝑧I is singular.}.
𝜆 𝜀 (A) = {𝑧 ∈ C | k (A − 𝑧I) −1 k ≥ }.
In other words, 𝑧 is an eigenvalue of A + E with kEk < 𝜀. For the Euclidean
𝜆 𝜀 (A) = {𝑧 ∈ C | 𝜎min (A − 𝑧I) ≤ 𝜀}.
𝑛 ∑︁
𝜆(A) ⊆ SR = R𝑖 where R 𝑖 = {𝑧 ∈ C : |𝑧 − 𝑎 𝑖𝑖 | ≤ |𝑎 𝑖 𝑗 |}.
𝑖=1 𝑗=1
The sets R 𝑖 are called Gershgorin circles.
92 Computing Eigenvalues
B𝜆 = A − 𝜆I = (D − 𝜆I) + M,
|𝑎 𝑖 𝑗 |
1 ≤ k (D − 𝜆I) −1 Mk ∞ = sup .
1≤𝑖 ≤𝑛 𝑗=1 |𝑎 𝑖𝑖 − 𝜆|
It follows that
|𝑎 𝑖𝑖 − 𝜆| ≤ |𝑎 𝑖 𝑗 |.
Hence, 𝜆 ∈ R 𝑖 .
Example. The Gershgorin circles are plotted for the following matrix:
10 3
A = 1 2 1
0 3
Row circles are shaded , column circles are shaded , and the eigenvalues
lie in their intersections. J
The power method 93
method x (𝑘) = Ax (𝑘−1) for an initial guess x (0) . As long as x (0) is not orthogonal
to a dominant eigenvector, the sequence x (0) , x (1) , x (2) , . . . will converge to a
dominant eigenvector.
Because A is diagonalizable, its eigenvectors {x𝑖 } form a basis of C𝑛 , and
every vector x (0) can be represented as a linear combination of the eigenvectors
x (0) = 𝛼1 x1 + 𝛼2 x2 + · · · + 𝛼𝑛 x𝑛 .
After 𝑘 iterations,
The power method converges linearly as 𝑂 (|𝜆2 /𝜆 1 | 𝑘 ). Let x (𝑘) equal the
normalized 𝑘th iterate and x𝑖 a unit eigenvalue. Then the error at the 𝑘th iterate
∑︁ 𝑘
𝜆2 𝑘 ∑︁ 𝛼𝑖
𝑛 𝑛
𝛼𝑖 𝜆 𝑖 𝜆2 𝑘
x − x1 =
x𝑖 ≤ x𝑖 ≤ 𝐶 .
𝛼 𝜆1
𝑖=2 1
𝜆1 𝛼
𝑖=2 1
For the Euclidean norm, 𝐶 = 𝛼1−1 𝛼22 + 𝛼32 + · · · + 𝛼𝑛2 .
In practice, the power method is straightforward. We multiply the current
eigenvector approximation by our matrix and then normalize to prevent over or
underflow at each iteration. Starting with an initial guess x (0) ,
multiply x (𝑘+1) = Ax (𝑘)
normalize x (𝑘+1) ← x (𝑘+1) x (𝑘+1) 2
Player Sza Tai Gli Kor Cas Ros Cas Kle Pts. S.B.
Szabó — 1 1/2 1/2 1/2 1 1 1 5 1/2 16 1/2
Taimanov 0 — 1 1 1 1/2 1 1 5 1/2 16
Gligoric 1/2 0 — 1/2 1 1 1 1 5 12 1/2
Kortchnoi 1/2 0 1/2 — 1/2 1 1 1 4 1/2 11 1/2
Casas 1/2 0 0 1/2 — 1 1/2 0 2 1/2 7 3/4
Rossetto 0 1/2 0 0 0 — 1/2 1 2 5
Casabella 0 0 0 0 1/2 1/2 — 1/2 1 1/2 3
Klein 0 0 0 0 1 0 1/2 — 1 1/2 3 1/4
Figure 4.2: Crosstable of the 1960 Santa Fe chess tournament with the combined
scores (Pts.) and the tie breaking Sonneborn–Berger score (S.B.)
player’s tournament score is given by the sum of the scores along the rows.
Mathematically, for a matrix M the combined scores are r = M1, where 1 is
a vector of ones. These scores determine ranking and prize money. But what
happens when there is a tie?
In 1873, Austrian chess master Oscar Gelbfuhs proposed using a weighted
row sum, multiplying each game score by the opponent’s cumulative score. In
this way, a player earns more points in defeating a strong opponent than a weak
one. Take the 1960 Santa Fe tournament (the table above), in which László
Szabó and Mark Taimanov both had combined raw scores of 5 12 . With Gelbfuhs’
improved scoring system (now called the Sonneborn–Berger system), Szabó
would have
1 · 5 21 + 1
2 ·5+ 1
2 · 4 12 + 1
2 · 2 12 + 1 · 2 + 1 · 1 12 + 1 · 1 12 = 16 21 ,
and precise” but warned “doubtless nevertheless, an accurate estimate of the relative value of games
would give extraordinary trouble to a chess player without mathematical training; and such an
estimate, even if competently made, would meet with universal distrust."
96 Computing Eigenvalues
0 1/2
0 0 0
1/3 0
0 0 0
H = 1/3 1/2
3 0 0 0
1/3 0
0 0
0 0
0 1
5 4
you might not trust a “friend of a friend” as much as you trust a friend.
The power method 97
page and E is an 𝑛 × 𝑛 matrix whose elements all equal 1/𝑛. A typical value for
the damping factor 𝛼 is 0.85. The matrix G is often called the Google matrix.
A nonnegative vector x is a probability vector if its column sum equals one.
The 𝑖th element of x is the probability of being on page 𝑖, and the 𝑖th element
of Gx is the probability of being on page 𝑖 a short time later. Since G is a
stochastic matrix, lim 𝑘→∞ G 𝑘 will converge to a steady-state operator. By
the Perron–Frobenius theorem, the eigenvector corresponding to the dominant
eigenvalue 𝜆 = 1 gives the steady-state probability of being on a given webpage.
This eigenvector is the PageRank.
Because we only need to find the dominant eigenvector, we can use the power
method x (𝑘+1) ← Gx (𝑘) . The matrix G is dense and requires 𝑂 (𝑛2 ) operations
for every matrix-vector multiplication, but the matrix H is quite sparse and only
requires 𝑂 (𝑛) operations. Let’s reformulate the problem in terms of H. Because
kx (𝑘) k 1 = 1, it follows that Ex (𝑘) = e. Now we have
The following Julia code computes the PageRank of the graph in Figure 4.3.
H = [0 0 0 0 1; 1 0 0 0 0; 1 0 0 0 1; 1 0 1 0 0; 0 0 1 1 0]
v = all(H.==0,dims=1)
H = H ./ (sum(H,dims=1)+v)
n = size(H,1)
d = 0.85
x = ones(n,1)/n
for i in 1:9
x = d*(H*x) .+ d/n*(v*x) .+ (1-d)/n
The full matrix form is fine for smallish matrices like this one, but for larger
matrices we should use the sparse form. After a few iterations, the solution x
converges to (0.176, 0.097, 0.227, 0.193, 0.307), which we can use to order the
nodes as {5, 3, 4, 1, 2}. If you want to go even deeper, see Vigna [2016] and
Langville and Meyer [2004]. J
The power method itself only gets us the dominant eigenvector. Getting just this
one may be enough for some problems, but often we may need more of them or
all of them. We can extend the power method to get other eigenvectors.
Suppose that the spectrum of A is
𝜆(A) = {𝜆 1 , 𝜆2 , · · · , 𝜆 𝑛−1 , 𝜆 𝑛 }
98 Computing Eigenvalues
x x x
x/kxk x/kxk 𝜌 x/kxk
Ax A−1 x (A − 𝜌I) −1 x
power method inverse power method shifted power method
x M=I A
𝜌= xH Ax x/kxk QR ← M QR
(A − 𝜌I) −1 x AQ → M RQ
Rayleigh quotient method naïve QR method QR method
Figure 4.4: Power method and some of its variations.
with the same associated eigenspace. We can get the eigenvector associated with
the eigenvalue of A closest to 0 by applying the power method to A−1 . The
convergence ratio is 𝑂 (|𝜆 𝑛 /𝜆 𝑛−1 |).
So, now we have the largest and the smallest eigenvalues. What about the
rest of them? Recall that for any value 𝜌, the spectrum of A − 𝜌I is
x (𝑘)
(A − 𝜌I)x (𝑘+1) = x (𝑘) , where x (𝑘) ← .
kx (𝑘) k
Computation requires 23 𝑛3 flops for LU-decomposition and then only 2𝑛2 flops
for each iteration
The power method 99
1 2 3 4
Figure 4.5: Rayleigh quotient 𝜌A (x)x for the matrices in Figure 1.1. Eigenvectors
are depicted with thick line segments.
The convergence rate for the shift-and-invert method depends on having a good
initial guess for the shift 𝜌. It seems reasonable that we could improve the method
by updating the shift 𝜌 at each iteration. One way we can do this is by using the
eigenvalue equation Ax = 𝜌x. Unless (𝜌, x) is already an eigenvalue-eigenvector
pair of A, this equation is inconsistent. Instead, we will find the “best” 𝜌. Let’s
determine which 𝜌 minimizes the 2-norm of the residual
At the minimum,
d d
0= krk 22 = kAx − 𝜌xk 22 = −2xH Ax + 2𝜌xH x.
d𝜌 d𝜌
xH Ax
𝜌A (x) = .
xH x
This number, called the Rayleigh quotient, can be thought of as an eigenvalue
approximation. The Courant–Fischer theorem, also known as the min-max
theorem, establishes bounds on the Rayleigh quotient.
9 1
5 2
1 6
Figure 4.6: Basins of attraction for Rayleigh iteration in the (𝜑, 𝜃)-plane. Initial
guesses in regions converge to the eigenvector at À, guesses in regions
converge to the eigenvector at Á, and guesses in regions converge to the
eigenvector at Â. See the front cover and the QR code at the bottom of this page.
xH Ax = 𝑖=1 𝜆𝑖 𝑥𝑖2 and 𝑆 𝑘 = span{𝝃 1 , 𝝃 2 , . . . , 𝝃 𝑘 }, the first 𝑘 standard basis
vectors of R . If x ∈ 𝑆 𝑘 , then
𝑛 ∑︁
𝑘 ∑︁
xH Ax = 𝜆𝑖 𝑥𝑖2 = 𝜆𝑖 𝑥𝑖2 ≥ 𝜆 𝑘 𝑥 𝑖2 = 𝜆 𝑘 kxk 22 .
𝑖=1 𝑖=1 𝑖=1
𝑛 ∑︁
xH Ax = 𝜆𝑖 𝑥𝑖2 ≤ 𝜆1 𝑥𝑖2 = 𝜆1 kxk 22
𝑖=1 𝑖=1
basins of attraction
painted onto the
Rayleigh quotient
The QR method 101
calculate 𝜌 (𝑘) = x (𝑘) H Ax (𝑘)
solve A − 𝜌 (𝑘) I x (𝑘+1) = x (𝑘)
normalize (𝑘+1)
x ←x x
(𝑘+1) (𝑘+1)
Shortly after publishing his results, he left the field and never looked back. It wasn’t until 2007 that
Francis was tracked down and learned of his achievements. By then, he was retired and living in a
seaside resort on the southeast coast of England, sailing and working on a degree in mathematics.
While Francis had briefly attended the University of Cambridge in the 1950s, he never completed a
degree. He subsequently received an honorary doctorate from the University of Sussex in 2015.
102 Computing Eigenvalues
It’s not yet clear what this equation means. Let’s change the basis to Q 𝑘 . We
simply need to “rotate” by Q−1𝑘 = Q 𝑘 to do so. Multiplying (4.2) by Q 𝑘 yields
This equation says that QT𝑘 Q 𝑘+1 R 𝑘+1 is unitarily similar to A. So QT𝑘 Q 𝑘+1 R 𝑘+1
has the same eigenvalues as A. If the iteration converges, then Q 𝑘 → Q 𝑘+1 and
QT𝑘 Q 𝑘+1 → I, and we are left with R 𝑘+1 on the left-hand side. The matrix R 𝑘+1
is an upper triangular matrix with the eigenvalues along the diagonal, giving us
the Schur decomposition of A.
We are not done yet. Define Q̂1 to be the change from Q 𝑘 to Q 𝑘+1 . That is,
let Q 𝑘+1 = Q̂ 𝑘+1 Q 𝑘 . Then
where M̂ 𝑘+1 = QT𝑘 M 𝑘+1 . From (4.3) we have R 𝑘+1 QT𝑘 = QT𝑘+1 A, and hence
R 𝑘 QT𝑘−1 = QT𝑘 A. So,
We are left with the simple implementation. Starting with T0 = A, for each
iteration, (
factor T𝑘 → Q𝑘 R𝑘
form T 𝑘+1 ← R 𝑘 Q 𝑘
The QR method 103
Because the QR method uses the power method, convergence is generally slow.
The diagonal elements of R 𝑘 converge to the eigenvalues linearly. Let A ∈ R𝑛×𝑛
have eigenvalues such that |𝜆1 | > |𝜆2 | > · · · > |𝜆 𝑛 |. Then
𝜆1 𝑟 1𝑛
𝑟 12 ···
0 .
lim R 𝑘 = .
𝑘→∞ .. .. ..
. .
0 𝜆 𝑛
0 ···
with convergence !
(𝑘) 𝜆𝑖
|𝑟 𝑖,𝑖−1 | =𝑂 .
Each QR decomposition takes 43 𝑛3 operations, and each matrix multiplication
takes 12 𝑛3 operations. We need to repeat it 𝑂 (𝑛) times to get the 𝑛 eigenvalues.
Hence, the net cost of the QR method is 𝑂 (𝑛4 ). We can improve the QR algorithm
by making a few simple modifications: reducing the matrix to upper Hessenberg
form, shifting the matrix in each iteration to speed up convergence, and deflating.
104 Computing Eigenvalues
Continuing this way, we will form an upper Hessenberg matrix unitarily similar
to A. Since it is similar, it has the same eigenvalues. If A is a Hermitian matrix,
the upper Hessenberg matrix is a tridiagonal matrix.
The QR method 105
Shifted QR iteration
Once 𝑎 𝑖𝑖(𝑘) has converged to an eigenvalue, move to the previous diagonal element
(𝑖 → 𝑖 − 1) and repeat the process. Note that
A 𝑘+1 = R 𝑘+1 Q 𝑘+1 + 𝜌I = QT𝑘+1 (A 𝑘 − 𝜌I)Q 𝑘+1 + 𝜌I = QT𝑘+1 A 𝑘 Q 𝑘+1 .
So, the new A 𝑘+1 has the same eigenvalues as the old A 𝑘 .
The convergence with each shift is 𝑂 (|𝜆𝑖 − 𝜌|/|𝜆𝑖−1 − 𝜌|) where 𝜌 ≈ 𝜆𝑖 .
Overall, we get quadratic convergence. We get cubic convergence for Hermitian
matrices, where the eigenvectors are orthogonal. This type of shifting is called
Rayleigh quotient shifting. See the figure on the following page or the QR code
below, which shows the convergence of the QR method and convergence of the
shifted QR method to eigenvalues of a 6 × 6 complex matrix. The standard QR
method takes 344 iterations to converge to an error of 10−3 , while the shifted
QR method requires only 16 iterations. Notice that convergence is slower for
eigenvalues whose magnitudes are close.
Occasionally, Rayleigh quotient shifting will fail to converge. For example,
using 𝜌 = 2 on the matrix
2 1
A= ,
1 2
comparison of QR and
shifted QR methods
106 Computing Eigenvalues
344 16
Figure 4.7: Convergence of the QR method (left) and the shifted QR method
(right) to eigenvalues of a 6 × 6 complex matrix. The eigenvectors are depicted
by • and line segments show the paths of convergence.
which has eigenvalues ±1, each of equal magnitude. Each eigenvector pulls
equally, and the algorithm cannot decide which eigenvalue to go to. One fix to
this problem is to use the Wilkinson shift. The Wilkinson shift defines 𝜌 to be
the eigenvalue of the submatrix
" (𝑘−1) (𝑘−1)
𝑎 𝑖−1,𝑖−1 𝑎 𝑖−1,𝑖
(𝑘−1) (𝑘−1)
𝑎 𝑖,𝑖−1 𝑎 𝑖,𝑖
that is closest to 𝑎 𝑖,𝑖 . The eigenvalues of a 2 × 2 matrix are easily computed
by using the quadratic formula to solve the characteristic equation.
4.4 Implicit QR
In the next step, we need a Givens rotation QT2 to zero out 𝑎 32 . Exactly the same
QT2 works for both A and A − 𝜌I. The same is true for QT3 and so on.
All we need to do is start the implicit QR step using the same column q as we
would have used to start the explicit QR method and then continue by reducing
the matrix into upper Hessenberg form. The resulting upper Hessenberg matrix
using the implicit method will equal the upper Hessenberg matrix using the
explicit method. This idea is closely related to the Implicit Q theorem.
1. Let H̃ = H − 𝜌I.
3. Compute H̃ ← QT H̃Q.
By this point, the QR method bears little resemblance to the underlying power
method. But just as we could speed up the convergence of the power method
by taking multiple steps (A2 x versus Ax, for example), we can speed up the QR
method using multiple steps. Namely, we can take (H − 𝜌1 I) (H − 𝜌2 I) · · · (H −
𝜌 𝑘 I) in one QR step. In practice, 𝑘 is typically 1, 2, 4, or 6. Note that the bulge
we need to chase is size 𝑘.
Taking multiple steps is especially appealing when working with real matrices
with complex eigenvalues because we can avoid complex arithmetic (cutting
computation by half). We can formulate a double-step QR method by taking
𝜌1 = 𝜌 and 𝜌2 = 𝜌. ¯ In this case, Ã = H2 − 2 Re(𝜌)H + |𝜌| 2 I.
The LinearAlgebra.jl function eigen returns an Eigen object containing eigenvalues
and eigenvectors. The functions eigvecs returns a matrix of eigenvectors and eigvals
returns an array of eigenvalues.
We still have not explicitly computed the eigenvectors using the QR method.
While we could keep track of the rotator or reflector matrices Q𝑖 as they
accumulate, this approach is inefficient. Instead, computing the eigenvectors
after finding the eigenvalues is typically a simple task. This section looks at two
methods: using a shifted power method and using the Shur form.
In the shifted power method, we start again from the upper Hessenberg matrix
form of the matrix H = QT AQ. Let 𝜌 be our approximation to an eigenvalue.
Then using the shifted power method
z (𝑘)
(H − 𝜌I)z (𝑘+1) = q (𝑘) with q (𝑘) =
kz (𝑘) k
will return an approximation to the associated eigenvector q of H. Only one
iteration is typically needed when 𝜌 is a good approximation. The eigenvector to
A is then QT q.
We can use an ultimate shift strategy to get the eigenvectors using the Schur
form QH AQ = T without accumulating the matrix Q. First, we use the QR
method to find only the eigenvalues. Once we have the approximate eigenvalues,
we rerun the QR method using these eigenvalues as shifts and accumulate the
matrix Q, using two steps per shift to ensure convergence. If the matrix is
symmetric, T is diagonal, and Q gives us the eigenvectors.
110 Computing Eigenvalues
Let’s look at the nonsymmetric case. Suppose that 𝜆 is the 𝑘th eigenvalue of
the upper triangular matrix
T11 v T13
T = 0 𝜆 wT ,
0 0 T33
where T11 is a (𝑘 −1)×(𝑘 −1) upper triangular matrix and T33 is an (𝑛−𝑘)×(𝑛−𝑘)
upper triangular matrix. Furthermore, suppose that 𝜆 is a simple eigenvalue
so that 𝜆 ∉ 𝜆(T11 ) ∪ 𝜆(T33 ). Then eigenvector problem (T − 𝜆I)y = 0 can be
written as
(T11 − 𝜆I 𝑘−1 )y 𝑘−1 + v𝑦 + T13 y𝑛−𝑘 = 0
wT y𝑛−𝑘 = 0 ,
(T33 − 𝜆I𝑛−𝑘 )y𝑛−𝑘 = 0
where y = y 𝑘−1 𝑦 y𝑛−𝑘 . Because 𝜆 is simple, T11 − 𝜆I 𝑘−1 and T33 − 𝜆I𝑛−𝑘
are both nonsingular. So, y𝑛−𝑘 = 0 and y 𝑘−1 = 𝑦(T11 − 𝜆I 𝑘−1 ) −1 v. Since 𝑦 is
arbitrary, we will set it to one. Then
(T11 − 𝜆I 𝑘−1 ) −1 v
y = 1 ,
which can be evaluated using backward substitution. The eigenvector for A is
then x = Qy.
problem in this space, and finally express the solutions in the canonical basis.
That is, take R = VT AV, solve Rv𝑖 = 𝜆˜ 𝑖 v𝑖 , finally calculate x̃𝑖 = Vv𝑖 . We still
need a way to find a subspace 𝑉 close to an eigenspace of A. An effective way to
do this is by constructing a Krylov subspace. A Rayleigh–Ritz method that uses
such a subspace is called an Arnoldi method. The Arnoldi method combines
the power method and the Gram–Schmidt process to get the first 𝑚 eigenvectors
The power method preserves the sparsity of the matrix. Still, it only gives
us one eigenvalue at a time because it throws away information about the other
eigenvectors by only keeping the most recent vector. What if we were to keep
a history of the vectors at each iteration? An initial guess q can be expressed
in terms of an eigenvector basis {v1 , v2 , . . . , v𝑛 }. Let’s repeatedly multiply this
initial guess by A:
q = 𝑐 1 v1 + 𝑐 2 v2 + · · · + 𝑐 𝑛 v𝑛
Aq = 𝑐 1 𝜆1 v1 + 𝑐 2 𝜆2 v2 + · · · + 𝑐 𝑛 𝜆 𝑛 v𝑛
A2 q = 𝑐 1 𝜆21 v1 + 𝑐 2 𝜆22 v2 + · · · + 𝑐 𝑛 𝜆2𝑛 v𝑛
A 𝑘−1 q = 𝑐 1 𝜆1𝑘−1 v1 + 𝑐 2 𝜆2𝑘−1 v2 + · · · + 𝑐 𝑛 𝜆 𝑛𝑘−1 v𝑛 ,
where 𝑘 is bigger than 𝑚 but much smaller than the 𝑛. For |𝜆1 | > |𝜆2 | > · · · > |𝜆 𝑛 |,
we see that (or at least hope that) the subspace
span{q, Aq, A2 q, . . . , A 𝑘−1 q}
is close to the subspace spanned by the first 𝑚 eigenvectors
span{v1 , v2 , v3 , . . . , v 𝑘 }.
The Krylov subspace K𝑘 (A, q) is the subspace spanned by
q, Aq, A2 q, . . . , A 𝑘−1 q.
These vectors are the columns of the Krylov matrix K 𝑘 . This 𝑘-dimensional
Krylov subspace is much smaller than the original 𝑛-dimensional subspace,
so it is unlikely to contain our desired eigenvectors. Instead of finding the
true eigenvector-eigenvalue pairs in all of R𝑛 , we will instead find approximate
eigenvector-eigenvalue pairs in the Krylov subspace K𝑘 (A, q).
The 𝑘-dimensional Krylov subspace K𝑘 ⊂ R𝑛 does not contain the 𝑚-
dimensional eigenspace 𝑉—only its projection into K𝑘 . If the starting vector q
happens to be already an eigenvector of A, then the Krylov subspace will only
consist of the space spanned by q. This is typically not an issue because any
vector chosen at random will likely contain components of all eigenvectors of
A. By taking a larger Krylov subspace K𝑘 ∗ with 𝑘 ∗ > 𝑘, we can get a better
approximation at the cost of more computing time:
112 Computing Eigenvalues
R𝑛 𝑉 R𝑛 𝑉
K𝑘 K𝑘 ∗
q 𝑗+1 = Aq 𝑗 − q𝑖 ℎ𝑖 𝑗 ,
where ℎ𝑖 𝑗 is the Gram–Schmidt coefficient ℎ𝑖 𝑗 = Aq 𝑗 , q𝑖 . In practice, we
implement this step using the modified Gram–Schmidt with reorthogonalization.
Complete the step by normalizing q 𝑗+1 ← q 𝑗+1 /ℎ 𝑗+1, 𝑗 where ℎ 𝑗+1, 𝑗 = kq 𝑗+1 k 2 .
The Arnoldi method can be written as the following algorithm,
•• •• •• •• •• •• •• •• •• •• •• •• •• •• •• • • • • •
• • • • • • • • • • • • • • • • •• •• •• •
• • • • • • • • • • • = • • • • + • (4.6)
• • • • • • • • • • • • • • • •• •
• • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • •
| {z } | {z } | {z } | {z } | {z }
A Q𝑘 Q𝑘 H𝑘 q 𝑘+1 0 · · · 0 ℎ 𝑘+1,𝑘
Because QT𝑘 is orthogonal to q 𝑘+1 , multiplying (4.6) by QT𝑘 zeros out the last
term and leaves us with QT𝑘 AQ 𝑘 = H 𝑘 . This means that in the projection, the
upper Hessenberg matrix H 𝑘 has the same eigenvalues as A, i.e., the Ritz values.
More precisely, if (x, 𝜇) is an eigenpair of H 𝑘 , then v = Q 𝑘 x satisfies
The residual norm ℎ 𝑘+1,𝑘 |𝑥 𝑘+1 | is called the Ritz estimate. The Ritz estimate is
small if (𝜇, v) is a good approximation to an eigenpair of A. See the figure on
the next page and the QR code at the bottom of this page.
It is mathematically intuitive that the success of Arnoldi iteration depends
on choosing a good starting vector q. A vector chosen at random likely has
significant components in all eigenvector directions, not just the first 𝑘. So the
Krylov subspace is not a great match. In practice, how do we get a good starting
vector q for the Krylov subspace?
One solution is to run the Arnoldi method to get a good starting vector and
restart using this better guess. This method is called the implicitly restarted
Arnoldi method. Suppose that we want to get the 𝑚 eigenvectors corresponding to
the largest eigenvalues. Let’s take the Arnoldi subspace to have 𝑘 = 𝑗 + 𝑚 ≈ 2𝑚
dimensions. First, we run the Arnoldi method to get 𝑘 Arnoldi vectors Q 𝑘 and H 𝑘 .
Then, we suppress the components of the eigenvectors of 𝜆 𝑚+1 , 𝜆 𝑚+2 , . . . , 𝜆 𝑘 in
our 𝑘-dimensional Arnoldi subspace. To do this, we run 𝑗 steps of the shifted
QR method, one step for each of the 𝑗 eigenvectors we are trying to filter out.
Ritz values for
increasing Krylov
dimensions and
114 Computing Eigenvalues
−2 0 2
Figure 4.8: The location of the Ritz values of K40 (A, q) and eigenvalues
in the complex plane. The matrix A is a 144 × 144 complex-valued tridiagonal
matrix and q is a random initial guess. Notice that the Ritz values are closer to
eigenvalues with large magnitudes.
For each shift, we use the approximate Ritz value {𝜇 𝑘+1 , . . . , 𝜇 𝑘 }. We take the
accumulated unitary matrix V 𝑘 from the 𝑗 QR steps to change the basis to our
Arnoldi subspace. Namely,
H ← VT𝑘 H 𝑘 V 𝑘
Q ← Q𝑘 V𝑘
The first 𝑘 columns of Q are our new first 𝑘 Arnoldi vectors. We zero out the
last 𝑗 Arnoldi vectors (and the last 𝑗 rows and columns of H) and find new ones
by running the Arnoldi method starting with q 𝑘+1 . We continue like this until
•• •• •• •• T • • • • • •
Ũ · · · • • • · · · Ṽ1 • • •
We can use the Lanczos method of finding eigenvalues to find the approximate
singular values and singular vectors of a large, sparse matrix. If u is a left singular
vector of A and v is a right singular vector of A, then combined, they are the
eigenvectors of the symmetric block matrix
0 A u u
=𝜆 ,
AT 0 v v
| {z }
The matrix M is Hermitian, so we can apply the Lanczos method to get the
largest eigenvalues.
The randomized SVD algorithm, which uses the general Rayleigh–Ritz
method, is another technique for computing an approximate, low-rank SVD
of a large matrix. We start by choosing a low rank 𝑘 approximation to the
column space of A with an orthonormal basis: A ≈ QQT A. The 𝑘 columns of
Q are chosen at random and orthogonalized using Gram–Schmidt projections,
Householder reflections, or Givens rotations. Then we compute the SVD of
QT A as usual. So,
4.8 Exercises
4.1. Consider the 𝑛 × 𝑛 matrix A, where the elements are normally distributed
random numbers with variance 1. Conjecture about the distribution of eigenvalues
𝜆( 𝐴) for arbitrary 𝑛. b
4.2. A square matrix is diagonally dominant if the magnitude of the diagonal
element of each row is greater than the sum of the magnitudes of all other
elements in that row. Use the Gershgorin circle theorem to prove that a diagonally
dominant matrix is always invertible.
For small and medium-sized matrices, Gaussian elimination is fast. Even for
a 1000 × 1000 matrix, Gaussian elimination takes less than a second on my
exceptionally ordinary laptop. Now, suppose we want to solve a partial differential
equation in three dimensions. We make a discrete approximation to the problem
using 100 grid points in each dimension. Such a problem requires that we solve
a system of a million equations. Now, Gaussian elimination would take over
thirty years on my laptop. Fortunately, such a system is sparse—only 0.001
percent of the elements in the matrix are nonzero. Gaussian elimination is still
fast if the matrix has a narrow bandwidth or can be reordered to have narrow
bandwidth. But for a matrix with no redeeming features other than its sparsity, a
better approach is an iterative method. Iterative methods use repeated matrix
in the place of matrix inversion. Gaussian elimination
𝑂 𝑛3 operations. Matrix-vector multiplication takes only 𝑂 𝑛2 operations and
as little as 𝑂 (𝑛) operations if the matrix is sparse. An iterative method will beat
a direct method as long as the number of iterations needed for convergence is
much smaller than 𝑛.
There are other reasons to use an iterative method. While a direct method
must always solve each problem from scratch with zero knowledge, we can
give an iterative method a good initial guess. Take a time-dependent partial
differential equation. At each new time step, the solution may not change much
from the solution computed at the previous time step. Using a direct solver with
Cholesky decomposition and Cuthill–McKee reordering to avoid fill in, we only
need 𝑂 (𝑛2 ) operations to compute the solution. On the other hand, an iterative
method can use the solution from the previous step as a good initial guess for the
new time step, and it automatically preserves sparsity.
Sometimes we may not need a solution as accurate as the one provided
by a direct solver. For example, a finite difference approximation to a partial
120 Iterative Methods for Linear Systems
20 21 22 23 24
15 16 17 18 19
10 11 12 13 14
5 6 7 8 9
0 1 2 3 4
Figure 5.1: The finite difference stencil for the discrete two-dimensional Laplacian
on a 5 × 5 mesh. By labeling the points in lexicon fashion, we can construct a
25 × 25 block tridiagonal matrix.
with Dirichlet boundary conditions 𝑓 (𝑥, 𝑦) = 0 on the unit square. The two-
dimensional Poisson equation models several physical systems, such as the
steady-state heat distribution 𝑢(𝑥, 𝑦) given a source 𝑓 (𝑥, 𝑦) or the shape of an
elastic membrane 𝑧 = 𝑢(𝑥, 𝑦) given a load 𝑓 (𝑥, 𝑦).
The finite difference method is a simple yet effective numerical method for
solving the Poisson equation. Consider partitioning a square domain into 𝑛
intervals each of length ℎ. For the unit square, we would take ℎ = (𝑛−1) −1 . Using
Taylor series approximation, the discrete Laplacian operator can be represented
𝜕 2 𝑢 𝜕 2 𝑢 4𝑢 𝑖 𝑗 − 𝑢 𝑖−1, 𝑗 − 𝑢 𝑖+1, 𝑗 − 𝑢 𝑖, 𝑗−1 − 𝑢 𝑖, 𝑗+1
− 2 − 2 ≈
𝜕𝑥 𝜕𝑦 ℎ2
at each grid point (𝑥 𝑖 , 𝑦 𝑗 ). Hence we have a linear system of 𝑛2 equations
The matrix A is a sparse, block-tridiagonal matrix with only 5𝑛2 nonzero entries
out of 𝑛4 elements. See Figure 5.1 on the facing page. J
The elements of the unknown x appear on both sides of the equation. One way
to solve this problem is by updating 𝑥 𝑖 iteratively as
1 ∑︁
𝑖−1 ∑︁
(𝑘+1) (𝑘) (𝑘)
𝑥𝑖 = 𝑏𝑖 − 𝑎𝑖 𝑗 𝑥 𝑗 − 𝑎𝑖 𝑗 𝑥 𝑗 . (5.1)
𝑎 𝑖𝑖 𝑗=1 𝑗=𝑖+1
In this method, called the Jacobi method, each element of x is updated indepen-
dently, which makes vectorization easy.
Another idea is using the newest and best approximations available, overwrit-
ing the elements of x. At each iteration, we compute
1 ∑︁
𝑖−1 ∑︁
(𝑘+1) (𝑘+1) (𝑘)
𝑥𝑖 = 𝑏𝑖 − 𝑎𝑖 𝑗 𝑥 𝑗 − 𝑎𝑖 𝑗 𝑥 𝑗 . (5.2)
𝑎 𝑖𝑖 𝑗=1 𝑗=𝑖+1
This approach is the Gauss–Seidel method. Unlike the Jacobi method, which
needs to store two copies of x, the Gauss–Seidel method allows us to use the same
array for each iteration, overwriting each element sequentially. Each iteration of
the Gauss–Seidel algorithm looks like this
forh 𝑖 = 1, 2,
. . . ,Í
𝑥 𝑖 ← 𝑏 𝑖 − 𝑖−1 𝑗=1 𝑎 𝑖 𝑗 𝑥 𝑗 − 𝑗=𝑖+1 𝑎 𝑖 𝑗 𝑥 𝑗 /𝑎 𝑖𝑖
122 Iterative Methods for Linear Systems
At this point, we should ask two questions. When do the Jacobi and Gauss–
Seidel methods converge? And how quickly does each converge? To answer these
questions, let’s look at the Jacobi and Gauss–Seidel methods more generality.
The Jacobi and Gauss–Seidel methods are examples of linear iterative methods.
Consider using the splitting A = P − N for some P and N. In this case, the linear
system Ax = b is the same as
Px = Nx + b. (5.3)
e (0) = 𝑐 1 v1 + 𝑐 2 v2 + · · · + 𝑐 𝑛 v𝑛 .
e (𝑘) = P−1 N e (0) = 𝑐 1 𝜆1𝑘 v1 + 𝑐 2 𝜆2𝑘 v2 + · · · + 𝑐 𝑛 𝜆 𝑛𝑘 v𝑛 .
It follows that
So the method converges if the spectral radius 𝜌(P−1 N) is less than one. On the
other hand, if P−1 N has an eigenvalue with magnitude greater than or equal to one
and the initial error e (0) has a component in the direction of the corresponding
eigenvector, the error will never converge to zero. The error will grow if the
eigenvalue is greater than one. We can summarize the discussion above with the
following theorem.
Theorem 15. The iteration Px (𝑘+1) = Nx (𝑘) + b converges if and only if the
spectral radius 𝜌(P−1 N) < 1.
Theorem 16. If a matrix is diagonally dominant, then the Jacobi and Gauss–
Seidel methods converge.
Proof. From theorem 5, all induced matrix norms are bounded below by the
spectral radius. Hence, for the Jacobi method,
∑︁ 𝑎 𝑖 𝑗
𝜌(D−1 M) ≤ D−1 M ∞
= max < 1.
𝑎 𝑖𝑖
So the Jacobi method converges. Put another way, a method Pe (𝑘+1) = Ne (𝑘)
converges if it is a contraction mapping on the error. For the Jacobi method
𝑎 𝑖𝑖 𝑒 𝑖(𝑘+1) = 𝑗≠𝑖 𝑎 𝑖 𝑗 𝑒 (𝑘)
𝑗 , from which
𝑗≠𝑖 |𝑎 𝑖 𝑗 |
e (𝑘+1)
≤ e (𝑘) ∞
|𝑎 𝑖𝑖 |
124 Iterative Methods for Linear Systems
Intuitively, the Gauss–Seidel should converge even faster because it has even
more terms in the denominator. But, because some of the terms might be negative,
it’s not completely obvious. So, let’s be explicit:
𝑗>𝑖 |𝑎 𝑖 𝑗 | |𝑎 𝑖𝑖 | − 𝑗<𝑖 |𝑎 𝑖 𝑗 | (𝑘)
e (𝑘+1)
≤ Í e (𝑘)
< Í e ∞
= e (𝑘) ∞
|𝑎 𝑖𝑖 | − 𝑗<𝑖 |𝑎 𝑖 𝑗 | |𝑎 𝑖𝑖 | − 𝑗<𝑖 |𝑎 𝑖 𝑗 |
Example. Let’s examine the convergence rate of the Jacobi and Gauss–Seidel
methods on the 𝑛 × 𝑛 discrete Laplacian
−2 1
1 1
A= .. .. .. . (5.5)
. . .
1 −2
Let’s start with the Jacobi method. The spectral radius of D−1 (L + U) is
cos 𝑛+1
, which is approximately 1 − 12 𝜋 2 (𝑛 + 1) −2 when 𝑛 is large. This
value gives us the convergence factor of the Jacobi method. The asymptotic
convergence rate (again for large 𝑛) is about 12 𝜋 2 (𝑛+1) −2 , using the approximation
log(1 + 𝑧) ≈ 𝑧 − 12 𝑧2 + · · · .
The spectral radius of (L + D) −1 U is cos2 𝑛+1 𝜋
. So, the convergence factor of
the Gauss–Seidel method is approximately 1 − 𝜋 (𝑛 + 1) −2 when 𝑛 is large. The
asymptotic convergence rate of the Gauss–Seidel method is 𝜋 2 (𝑛 + 1) −2 , twice
that of the Jacobi method.
If A is smallish 20 × 20 matrix, then the convergence factor for the Jacobi
method is 0.99. It is 0.98 for the Gauss–Seidel method. The asymptotic
convergence rates are 0.01 and 0.02, respectively. Hence, it may conceivably
take as many as 450 iterations and 225 iterations, respectively, to get the error to
one-hundredth of the initial error.1 Pretty lousy, especially when naïve Gaussian
elimination gives an exact solution in less than one-tenth of the time. For a larger
1 The missing arithmetic is log 0.01/log 0.99 = 450 and log 0.01/log 0.98 = 225.
Successive over relaxation 125
𝜔= 0.7 1
𝜔 2
0 = 2/
3 0.6
= 2/3
−1 0.5
5 10 15 20 0.4 0.6 0.8 1
Figure 5.2: Left: eigenvalues of the weighted Jacobi method for the 20 × 20
discrete Laplacian (5.5) for different weights 𝜔. Right: root mean squared values
of the eigenvalues as a function of 𝜔.
matrix, it’s much worse. If we double the number of grid points, we’ll need to
quadruple the number of iterations to achieve the same accuracy. What can we
do to speed up convergence? J
The preconditioners in the Jacobi and Gauss–Seidel methods are easy to invert,
but both approaches converge slowly. Our goal in this section will be find a
splitting A = P − N that quickly shrinks the error e (𝑘+1) = (P−1 N) 𝑘 e (0) . The
initial error e (0) is a linear combination of the eigenvectors of P−1 N. Each of
its components is scaled by their respective eigenvalues at every iteration. A
component with a small eigenvalue will swiftly fade away. A component with
a magnitude near one will linger like a drunken, boorish guest after a party. If
we happen to split the matrix A in a way that leaves the spectral radius of P−1 N
greater than one, then we’ll have an unstable method. While the best splitting
will be problem-specific, we can think of a good splitting as one that minimizes
all eigenvalues. In the next section, we’ll look at a method that systematically
targets groups of the eigenvalues at a time.
Let’s start by modifying the Jacobi method. Perhaps we can speed things up
by weighting the preconditioner P−1 = 𝜔D−1 . This approach is called a weighted
Jacobi method with the splitting
A = P − N = 𝜔−1 D + A − 𝜔−1 D .
1 weighted Jacobi
0.9 SOR
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Figure 5.3: The spectral radius 𝜌(P−1 N) of the weighted Jacobi and SOR methods
for the 20 × 20 discrete Laplacian as functions of the relaxation factor 𝜔.
for 𝑖 = 1, 2, . . . , 𝑛
∗ Í Í
𝑥 ← 𝑏 𝑖 − 𝑖−1 𝑎 𝑖 𝑗 𝑥 𝑗 − 𝑛𝑗=𝑖+1 𝑎 𝑖 𝑗 𝑥 𝑗 /𝑎 𝑖𝑖
𝛿 ← 𝑥 ∗ − 𝑥𝑖
𝑥𝑖 ← 𝑥𝑖 + 𝜔𝛿
The constant 𝜔 is called the relaxation factor. The SOR algorithm simply
computes the step size 𝛿 determined by the Gauss–Seidel algorithm and scales
that step size by 𝜔. When 𝜔 = 1, the SOR method is simply the Gauss–Seidel
method. In general, the SOR method has the splitting
A = P − N = L + 𝜔−1 D + (1 − 𝜔−1 )D + U .
To design a method that works well on big matrices, we need to contend better with
the slowly decaying, low-frequency eigenvector components. The solution to the
Poisson equation takes so many iterations to converge because the finite difference
stencil for the discrete Laplacian only allows mesh points to communicate with
neighboring mesh points. Information is shared across adjacent mesh points
in the five-point stencil shown in Figure 5.1. A domain with 𝑛 mesh points in
each dimension requires 𝑛/2 iterations to pass information from one side to the
other. As a result, convergence is slow. Of course, we could use a larger stencil.
Information would travel more quickly across the domain, but we are adding
more nonzero elements to our sparse matrix.
Another idea is to use a coarser mesh. For an 𝑛 × 𝑛 matrix, an optimal SOR
method needs 𝑂 (𝑛) iterations for convergence. The suboptimal Gauss–Seidel
and Jacobi methods need 𝑂 𝑛2 . So, reducing the size of our system will have a
tremendous impact on reducing the number of iterations. But, the solution would
not be as accurate with a coarser mesh.
The multigrid method attempts to get both speed and accuracy by iterating on
different grid refinements, using a coarse grid to speed up convergence and a fine
grid to get higher accuracy. The multigrid method embodies the following idea:
• solve the problem on a coarse mesh to kill off the residual from the low
frequencies components;
• use the solution on the coarse mesh as a starting guess for the solution on
the fine mesh (and vice versa).
mesh restriction
mesh interpolation
3. Solve (iterate a few times) A2ℎ e2ℎ = r2ℎ where e2ℎ = uℎ − u2ℎ . Not
surprisingly, A2ℎ = R2ℎ
ℎ Aℎ . Because the size of the matrix has changed,
we may also need to adjust the relaxation factor 𝜔 accordingly.
4. Interpolate the error e2ℎ back to the fine grid by taking eℎ = I2ℎ
ℎ e
2ℎ where
the interpolation is defined as
2 1 T
1 2 1
1 1 2 1
= .
5. Add eℎ to uℎ .
6. Iterate Aℎ uℎ = bℎ .
Steps 1–6 make up a single cycle multigrid. In practice, one typically uses
multiple cycles, restricting to coarser and coarser meshes, before interpolating
back. Or, they might start multigrid on a coarse grid. Here are different cycle
designs used to restrict and interpolate between an original mesh ℎ and a coarser
mesh 16ℎ:
single cycle V cycle full multigrid cycle
Example. Let’s compare the Jacobi, Gauss–Seidel, SOR, and multigrid methods
on a model problem. Take the Poisson equation 𝑢 00 = 𝑓 (𝑥) with zero boundary
conditions, where the source function is a combination of derivatives of Dirac
delta distribution 𝑓 (𝑥) = 𝛿 0 (𝑥 − 12 ) − 𝛿 0 (𝑥 + 12 ). The solution is a rectangle
function 𝑢(𝑥) = 1 for 𝑥 ∈ (− 21 , + 12 ) and 𝑢(𝑥) = 0 otherwise. We’ll take 𝑛 = 129
grid points and choose the relaxation factor 𝜔 optimal for the SOR and multigrid
130 Iterative Methods for Linear Systems
Figure 5.4: Iterative solutions to the Poisson equation converging to the exact
solution, the rectangle function. Snapshots are taken every twentieth iteration for
a total of 160 iterations. Also, see the QR code at the bottom of this page.
methods. For the multigrid method, use a V-cycle with two restrictions and two
interpolations and compute one SOR iteration at each step. Finally, take an initial
guess of 𝑢(𝑥) = 12 . The figure on the current page plots the solution. Notice that
the SOR and multigrid methods converge relatively quickly, while the Jacobi
and Gauss–Seidel methods still have significant errors after many iterations and
convergence slows down. J
One of the difficulties of the SOR method is that you need to know the optimal
relaxation parameter 𝜔. A better approach for symmetric, positive-definite
matrices is recasting the problem as a minimization problem. In Chapter 3,
we solved an overdetermined system as an equivalent minimization problem.
And in the last chapter, we approximated the eigenvalues of sparse matrices by
finding the solution that minimized the error in the Krylov subspace. The Poisson
equation −Δ𝑢 = 𝑓 has an equivalent∫ formulation of finding the solution 𝑢 that
minimizes the energy functional Ω 12 |∇𝑢| 2 − 𝑢 𝑓 d𝑉. This section examines how
to solve an iterative method as a minimization problem.
comparison of Jacobi,
Gauss-Seidel, SOR,
and multigrid methods
Solving the minimization problem 131
Φ(y) − Φ(x) = 12 yT Ay − yT b − 12 xT Ax + xT b
= 12 yT Ay − yT Ax − 12 xT Ax + xT Ax
= 12 yT Ay − 12 yT Ax − 12 yT Ax + 12 xT Ax
= 12 yT Ay − 12 yT Ax − 12 xT Ay + 12 xT Ax
= 12 (y − x)T A(y − x) > 0
The negative gradient is simply the residual. So we can rewrite the 𝑘th subsequent
estimate (5.6) as
x (𝑘+1) = x (𝑘) + 𝛼 𝑘 r (𝑘) . (5.7)
For now, consider the more general iterative method (5.6) for an arbitrary direction
vector p (𝑘) , keeping in mind that p (𝑘) = r (𝑘) for the method of gradient descent.
Let’s find 𝛼 𝑘 . Since the evaluation of Ax (𝑘) to get the residual can require a lot
of computer processing, we try to reduce the number of times we compute an
update to the direction. We will only change direction when we are closest to
the solution x along our current direction. A vector x is said to be optimal with
respect to a nonzero direction p if Φ(x) ≤ Φ(x + 𝛾p) for all 𝛾 ∈ R. That is, x
is optimal with respect to p if Φ increases as we move away from x along the
direction p. Put another way, x is optimal with respect to a direction p if the
directional derivative of Φ at x in the direction p is zero, i.e., ∇Φ(x) · p = 0. If
x is optimal with respect to every direction in vector space 𝑉, we say the x is
optimal with respect to 𝑉. Looking at (5.7), we want to know when x (𝑘+1) is
optimal with respect to a direction r (𝑘) .
132 Iterative Methods for Linear Systems
Because ∇Φ(x (𝑘+1) ) = −r (𝑘+1) , it follows that x (𝑘+1) is optimal with respect
to p (𝑘) —in the direction of r (𝑘) —if and only if r (𝑘+1) · p (𝑘) = 0, i.e., if and only
if r (𝑘+1) is orthogonal to p (𝑘) .
r (𝑘+1) = b − Ax (𝑘+1)
= b − A x (𝑘) + 𝛼 𝑘 p (𝑘)
= b − Ax (𝑘) − 𝛼 𝑘 Ap (𝑘)
= r (𝑘) − 𝛼 𝑘 Ap (𝑘) . (5.8)
When r (𝑘+1) ⊥ p (𝑘) :
0 = p (𝑘) r (𝑘+1) = p (𝑘) r (𝑘) − 𝛼 𝑘 p (𝑘) Ap (𝑘) .
Therefore, x (𝑘+1) is optimal with respect to r (𝑘) when
p (𝑘) r (𝑘)
𝛼𝑘 = T
p (𝑘) Ap (𝑘)
By taking p (𝑘) = r (𝑘) (for the gradient descent method), we have
r (𝑘) r (𝑘)
𝛼𝑘 = T
r (𝑘) Ar (𝑘)
At this new vector x (𝑘+1) = x (𝑘) + 𝛼 𝑘 r (𝑘) , we change directions, again taking
the steepest descent in the direction of the negative gradient −∇Φ(x (𝑘+1) ). We
continue like this until either we find the minimum x or we are close enough.
That’s the idea. Let’s write it out as an algorithm:
make an initial guess x
for 𝑘 = 0, 1, 2, . . .
r ← b − Ax
if krk < tolerance, then we’re done
𝛼 ← rT r/rT Ar
x ← x + 𝛼r
Solving the minimization problem 133
Consider taking the direction p (𝑘) successively along each coordinate axis 𝝃 𝑖 .
For example, if A is 3 × 3, we take
1 0 0
p (1) = 𝝃 1 = 0 , p (2) = 𝝃 2 = 1 , p (3) = 𝝃 3 = 0 , p (4) = 𝝃 1 , and so on.
0 0 1
Then for an 𝑛-dimensional space
𝝃 𝑖T r (𝑘)
x (𝑘+1) = x (𝑘) + 𝝃𝑖 with 𝑖 = 𝑘 (mod 𝑛) + 1.
𝝃 𝑖T A𝝃 𝑖
We can rewrite this expression in component form by first noting that
T (𝑘)
𝝃𝑖 r = 𝑏𝑖 − 𝑎𝑖 𝑗 𝑥 𝑗 and 𝝃 𝑖T A𝝃 𝑖 = 𝑎 𝑖𝑖
and then noting that x (𝑘) + 𝛼 𝑘 𝝃 𝑖 only updates the 𝑖th component of x (𝑘) . In other
1 ∑︁
𝑖−1 ∑︁ 𝑛
(𝑘+1) (𝑘) (𝑘+1) (𝑘)
𝑥𝑖 = 𝑥𝑖 + − 𝑎𝑖 𝑗 𝑥 𝑗 − 𝑎𝑖 𝑗 𝑥 𝑗 + 𝑏𝑖
𝑎 𝑖𝑖
𝑗=1 𝑗=𝑖
1 ∑︁
𝑖−1 ∑︁
(𝑘+1) (𝑘)
= − 𝑎𝑖 𝑗 𝑥 𝑗 − 𝑎𝑖 𝑗 𝑥 𝑗 + 𝑏𝑖 ,
𝑎 𝑖𝑖 𝑗=1 𝑗=𝑖+1
which is simply the Gauss–Seidel method (5.2). Depending on the shape of our
“bowl,” it may be advantageous to take a shorter or longer stepsize than optimal
By taking our directions successively along the coordinate axes with the scaling
factor 0 < 𝜔 < 2, we have the SOR method. See Figure 5.6 on the following
134 Iterative Methods for Linear Systems
steepest conjugate
Gauss–Seidel SOR
descent gradient
Figure 5.6: Directions taken by different iterative methods to get to the solution.
Locally, gradient descent gives the best choice of directions. Globally, it doesn’t—
especially when the bowl is very eccentric, i.e., when the matrix’s condition
number is large. So the gradient descent method doesn’t always work well in
practice. We can improve it by taking globally optimal directions, not necessarily
along the residuals.
Solving the minimization problem 135
What went wrong with the gradient method? Each new search direction p (𝑘)
was determined using only information from the most recent search direction
p (𝑘−1) . Let x (𝑘+1) = x (𝑘) + 𝛼 𝑘 p (𝑘) for some direction p (𝑘) , and suppose that
x (𝑘+1) is optimal with respect to p (𝑘) —just as we did in developing the gradient
method. So p (𝑘) ⊥ ∇Φ(x (𝑘+1) ) or equivalently p (𝑘) ⊥ r (𝑘+1) . Let’s impose
the further condition that x (𝑘+1) is also optimal with respect to p ( 𝑗) for all
𝑗 = 0, 1, . . . , 𝑘. That is, r (𝑘+1) ⊥ p ( 𝑗) for all 𝑗 = 0, 1, . . . , 𝑘.
What does this new optimality condition mean? If the position x (𝑘+1)
is optimal with respect to each of the directions p (𝑘) , p (𝑘−1) , . . . , p (0) , then
x (𝑘+1) is optimal with respect to the entire span{p (𝑘) , p (𝑘−1) , . . . , p (0) } by the
linearity of the gradient. As we iterate, the subspace spanned by the directions
p (𝑘) , p (𝑘−1) , . . . , p (0) will continue to grow as long as each new direction
is linearly independent. That sounds pretty good. Let’s find the directions
p (𝑘) , p (𝑘−1) , . . . , p (0) and show that they are linearly independent.
From the updated residual, we have
As before, we’ll take the first direction equal to the gradient at x (0) , i.e., we’ll
take p (0) = r (0) . Then each subsequent direction will be A-conjugate to this
initial gradient. Hence, the name conjugate gradient method.
It still makes sense to take the directions as close to the residuals as possible
yet still mutually conjugate. Take
It may seem almost magical that by choosing p (𝑘+1) with explicit dependence
only on p (𝑘) , we ensure that p (𝑘+1) is A-conjugate to all p ( 𝑗) for 𝑗 = 0, 1, 2, . . . , 𝑘.
Let’s see why it works. (We’ll better clarify this mathematical sleight-of-hand
when discussing Krylov subspaces in the next section.) If we multiply (5.10) by
Ap (𝑘) and enforce (5.11) when 𝑗 = 𝑘, then it follows that
r (𝑘+1) Ap (𝑘) p (𝑘) , r (𝑘+1) A
𝛽𝑘 = = .
p (𝑘) Ap (𝑘) p (𝑘) , p (𝑘) A
Now, let’s show that this choice for 𝛽 𝑘 ensures that (5.11) holds for all 𝑗 < 𝑘. Let
𝑉𝑘 = span p (0) , p (1) , . . . , p (𝑘−1) .
So, Ap (𝑘−1) ∈ 𝑉𝑘+1 and hence Ap ( 𝑗) ∈ 𝑉𝑘+1 for all 𝑗 = 0, 1, . . . , 𝑘 −1. Therefore,
Ap ( 𝑗) ⊥ p (𝑘+1) or equivalently p (𝑘+1) , p ( 𝑗) A = 0 for 𝑗 = 0, 1, . . . , 𝑘 − 1. Note
that we are performing Arnoldi iteration to generate an A-conjugate set of
Let’s summarize the conjugate gradient method by explicitly writing out
its algorithm. Compare the following algorithm with the one for the gradient
method on page 132.
make an initial guess x
p ← r ← b − Ax
for 𝑘 = 0, 1, 2, . . .
𝛼 ← rT p/pT Ap
x ← x + 𝛼p
r ← r − 𝛼Ap
if krk < tolerance, then we’re done
𝛽 ← rT Ap/pT Ap
p ← r − 𝛽p
Krylov methods 137
The iterative methods discussed in this chapter can be viewed as Krylov methods.
Consider a classical iterative method such as Jacobi or Gauss–Seidel with
Px (𝑘+1) = Nx (𝑘) + b.
The error
e (𝑘) = P−1 Ne (𝑘−1) = (I − P−1 A)e (𝑘−1) ,
and so
e (𝑘) = (I − P−1 A) 𝑘 e (0) .
x (𝑘) − x (0) = e (𝑘) − e (0) = (I − P−1 A) 𝑘 − I e (0)
" 𝑘 #
∑︁ 𝑘
= (−1) (P A) e .
𝑗 −1 𝑗 (0)
This expression says that x (𝑘) − x (0) lies in the Krylov space
K𝑘 (P−1 A, z) = span z, P−1 Az, . . . , (P−1 A) 𝑘 z
with z = P−1 Ae (0) = P−1 r (0) . The gradient descent method is also a Krylov
method. Equation (5.8) says that the residuals are elements of the Krylov
K𝑘 (A, r (0) ) = span r (0) , Ar (0) , . . . , A 𝑘 r (0) .
138 Iterative Methods for Linear Systems
x (𝑘+1) = x (𝑘) + 𝛼 𝑘 p (𝑘) = x (0) + 𝛼 𝑗 p ( 𝑗)
for some directions {p (0) , p (1) , . . . , p (𝑘) }. For the gradient descent method,
we took the directions along residuals p ( 𝑗) = r ( 𝑗) . For the conjugate gradient
method, we showed that the directions formed a subspace
𝑉𝑘+1 = span p (0) , p (1) , . . . , p (𝑘) } = span{r (0) , r (1) , . . . , r (𝑘) .
In this section, we’ll expand on the idea of using a Krylov subspace to generate
an iterative method x (𝑘+1) = x (0) + 𝜹 (𝑘) , where 𝜹 (𝑘) ∈ K𝑘+1 (A, r (0) ) with an
initial guess x (0) . Let’s consider two methods
The first approach is often called the Arnoldi method or the Full Orthogonalization
Method (FOM). When A is a symmetric, positive-definite matrix, FOM is
equivalent to the conjugate gradient method. The second approach is the general
minimal residual method (GMRES). It is simply called the minimal residual
method (MINRES) when A is symmetric.
Before looking at either of the two methods, let’s recall the Arnoldi process
from the previous chapter. The Arnoldi process is a variation of the Gram–
Schmidt method that generates an orthogonal basis for a Krylov subspace. By
choosing an orthogonal basis, we ensure that the system is well-conditioned.
Starting with the residual r (0) , take
Let’s examine the first Krylov method, in which we choose the x (𝑘+1) that
minimizes the error ke (𝑘+1) k A . At the 𝑘th
iteration, we choose the approximation
x (𝑘+1) with x (𝑘+1) − x (0) ∈ K𝑘 A, r (0) to minimize the error in the energy norm
2 T
e (𝑘+1) A
= e (𝑘+1) Ae (𝑘+1) .
Any vector in the Krylov subspace can be expressed as a linear combination of
the basis elements Q 𝑘 z (𝑘) for some z (𝑘) . Take
x (𝑘+1) = x (0) + Q 𝑘 z (𝑘)
r (𝑘) 2
= b − Ax (𝑘) 2
= ℎ 𝑘+1,𝑘 · 𝝃T𝑘 z (𝑘) = ℎ 𝑘+1,𝑘 · 𝑧 𝑘(𝑘) ,
which we can use as a stopping criterion when the value is less than 𝜀kr (0) k 2 for
some tolerance 𝜀.
It may not be obvious, but FOM outlined above is just the conjugate gradient
method when A is a symmetric, positive-definite matrix. Start with
where the columns of P 𝑘 are the directions p (𝑘) . For a symmetric matrix A, we
PT𝑘 AP 𝑘 = U−𝑘 T QT𝑘 AQ 𝑘 U−1 T −T
𝑘 = U𝑘 H𝑘 U𝑘 = U𝑘 L𝑘 . .
| {z } | {z }
symmetric lower triangular
p (𝑖) , Ap ( 𝑗) 2 = p (𝑖) , p ( 𝑗) A =0 for 𝑖 ≠ 𝑗,
This time we will choose x (𝑘+1) that minimizes the residual kr (𝑘+1) k 2 . As before,
we take
x (𝑘+1) = x (0) + Q 𝑘 z (𝑘)
for some vector z (𝑘) to be determined. From r (𝑘) = b − Ax (𝑘) , we have
where 𝝃 1 = (1, 0, . . . , 0). We now have the problem of finding z (𝑘) to minimize
r (𝑘+1) 2
= kr (0) k 2 𝝃 1 − H 𝑘+1,𝑘 z (𝑘) 2
We stop when kr (𝑘) k 2 < 𝜀kr (0) k 2 for some tolerance 𝜀. Otherwise, we continue
by finding the next basis vector q (𝑘+2) of the Krylov subspace using the Arnoldi
The IterativeSolvers.jl function gmres implements the generalized minimum residual
method, and minres implements the minimum residual method.
5.6 Exercises
5.1. Confirm that the spectral radius of the Jacobi method is cos 𝑛+1
the spectral radius of the Gauss–Seidel method is cos 𝑛+1 for the discrete
Laplacian (5.5). b
5.2. Consider the 2-by-2 matrix
1 𝜎
A= .
−𝜎 1
Under what conditions will Gauss–Seidel converge? For what range 𝜔 will the
SOR method converge? What is the optimal choice of 𝜔? b
5.3. Show that the conjugate gradient method requires no more than 𝑘 + 1 steps to
converge for a symmetric, positive-definite matrix with 𝑘 distinct eigenvalues.
Even though A has over 15 billion elements, only 0.005 percent of them are
nonzero—still, it has almost a million nonzero elements.
Consider the source term
𝑓 (𝑥, 𝑦, 𝑧) = (𝑥 − 𝑥 2 ) (𝑦 − 𝑦 2 ) + (𝑥 − 𝑥 2 ) (𝑧 − 𝑧2 ) + (𝑦 − 𝑦 2 ) (𝑧 − 𝑧2 ).
In this case, the finite difference method will produce an exact solution
𝑢(𝑥, 𝑦, 𝑧) = 12 (𝑥 − 𝑥 2 ) (𝑦 − 𝑦 2 ) (𝑧 − 𝑧2 ).
Implement the finite difference scheme using Jacobi, Gauss–Seidel, SOR, and
conjugate gradient methods starting with an initial solution 𝑢(𝑥, 𝑦, 𝑧) ≡ 0. Take
𝜔 = 1.9 for the SOR method. Plot the error of the methods for the first 500
iterations and comment on the convergence. Hint: An easy way to build the
finite difference matrix is by using a Kronecker tensor product
𝑎 11 B 𝑎 1𝑛 B
.. .
A ⊗ B = ... .
𝑎 𝑚1 B 𝑎 𝑚𝑛 B
It’s hard to overstate the impact that the fast Fourier transform (FFT) has had
in science and technology. It has been called “the most important numerical
algorithm of our lifetime” by prominent mathematician Gilbert Strang, and it is
invariably included in top-ten lists of algorithms. The FFT is an essential tool
for signal processing, data compression, and partial differential equations. It
is used in technologies as varied as digital media, medical imaging, and stock
market analysis. At its most basic level, the FFT is a recursive implementation
of the discrete Fourier transform. This implementation that puts the “fast” in
fast Fourier transform takes what would typically be an 𝑂 (𝑛2 )-operation method
and makes it an 𝑂 (𝑛 log2 𝑛)-operation one. In this chapter, we’ll examine the
algorithm and a few of its applications.
144 The Fast Fourier Transform
Now, suppose that we restrict the {𝑧 𝑗 } to equally spaced points on the unit circle.
That is, take
𝑧 𝑗 = e−i2 𝜋 𝑗/𝑛 = 𝜔 𝑛
by choosing nodes clockwise around the unit circle starting with 𝑧0 = 1. The
value 𝜔 𝑛 = exp(−i2𝜋/𝑛) is called an nth root of unity because it solves the
equation 𝑧 𝑛 = 1. (In fact, 𝜔 𝑛𝑘 are all 𝑛th roots of unity because they all are
solutions to 𝑧 𝑛 = 1.) In this case, the Vandermonde matrix is
1 1 1 1
1 𝜔2𝑛 𝜔 𝑛−1
𝜔𝑛 ··· 𝑛
𝜔2𝑛 𝜔4𝑛
F𝑛 = 1 ··· 𝜔 𝑛−1
𝑛 . (6.1)
.. .. .. ..
. . . .
1 (𝑛−1) 2
𝜔 𝑛−1
𝑛 𝜔2(𝑛−1)
𝑛 ··· 𝜔𝑛
The system y = F𝑛 c expressed in index notation is
𝑛−1 ∑︁
𝑦𝑘 = 𝑐 𝑗 e−i2 𝜋 𝑗 𝑘/𝑛 = 𝑐 𝑗 𝜔𝑛 .
𝑗=0 𝑗=0
The matrix F𝑛 , called the discrete Fourier transform (DFT), is a scaled unitary
matrix—that is, FH𝑛 F𝑛 = 𝑛I. This fact follows from a simple application of the
geometric series. Recall that if 𝑧 ≠ 1, then
𝑧𝑛 − 1
𝑧𝑗 = .
Computing FH𝑛 F𝑛 ,
𝜔 𝑛(𝑘−𝑙) −1 1−1
𝜔 𝑘−𝑙 − 1 = 𝜔 𝑘−𝑙 − 1 = 0,
when 𝑘 ≠ 𝑙
𝑛−1 ∑︁
= ∑︁
𝑗 𝑘 𝑗𝑙 𝑗 (𝑘−𝑙) 𝑛 𝑛
𝜔𝑛 𝜔𝑛 𝜔𝑛 𝑛−1
𝑗=0 𝑗=0
1 = 𝑛, when 𝑘 = 𝑙
where 𝜔 is the complex conjugate of 𝜔.
It also follows that the inverse discrete Fourier transform (IDFT) is simply
𝑛 = F𝑛 /𝑛 = F𝑛 /𝑛. The DFT is symmetric but not Hermitian. Unlike the
usual Vandermonde matrix, which is often ill-conditioned for large 𝑛, the DFT is
perfectly conditioned because it is a scaled unitary matrix.
The 𝑘th column (and row) of F𝑛 forms a subgroup1 generated by 𝜔 𝑛𝑘 . That
1A group is a set with an operation that possesses closure, associativity, an identity, and an
inverse. A subgroup is any subset of a group, which is itself a group under the same operation.
Discrete Fourier transform 145
0 1 2 3
4 5 6 7
8 9 10 11
Figure 6.1: Subgroups using the generator 𝜔12 for 𝑗 = 0, 1, . . . , 11.
{2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24} (mod 12) ≡ {2, 4, 6, 8, 10, 0}.
See Figure 6.1 above. Using 10 as a generator gives us the same group because
10 ≡ −2 (mod 12). In particular, note that this subgroup corresponds to the six
6th roots of unity. That is, the subgroup generated by 𝜔212 equals the group
generated by 𝜔16 .
If the generator 𝑘 is relatively prime with respect to 𝑛, i.e., if 𝑘 and 𝑛 are
coprime, then the number of elements of the group is 𝑛. We say the group
has order 𝑛. For example, both 5 and 7 are coprime to 12, and they generate
subgroups of order 12. Otherwise, 𝑘 generates a subgroup of order 𝑛/gcd(𝑘, 𝑛).
Associated with the subgroup {0, 2, 4, 6, 8, 10} is a coset formed by taking
each element of the group and adding one. So,
For 𝑘 = 3, the group {0, 3, 6, 9} has three cosets: itself, {1, 4, 7, 10}, and
{2, 5, 8, 11}. For 𝑘 = 5, there is only one coset. The number of cosets of the
subgroup generated by 𝑘 under addition modulo 𝑛 equals gcd(𝑘, 𝑛). The number
of cosets determines the radix, which is the basis for the Cooley–Tukey algorithm.
146 The Fast Fourier Transform
There are several FFT algorithms. Each works by taking advantage of the group
structure of the DFT matrix. The prototypical one, the Cooley–Tukey algorithm,
was developed by James Cooley and John Tukey in 1965—a rediscovery of an
unpublished discovery by Gauss in 1805. The development of their FFT was
prompted by physicist Richard Garwin, designer of the first hydrogen bomb and
researcher at IBM Watson laboratory, who at the time was interested in verifying
the Soviet Union’s compliance with the Nuclear Test Ban Treaty. Because
the Soviet Union would not agree to inspections within their borders, Garwin
turned to remote seismic monitoring of nuclear explosions. But, time series
analysis of the off-shore seismological sensors would require a fast algorithm for
computing the DFT, which normally took 𝑂 (𝑛2 ) operations. Garwin prompted
Cooley to develop such an algorithm along with Tukey. And with a few months
of effort, they had developed a recursive algorithm that took only 𝑂 (𝑛 log2 𝑛)
operations. When the sensors were finally deployed, they were able to locate
nuclear explosions to within 15 kilometers.
In this section, we look at the Cooley–Tukey radix-2 (base-2) algorithm.
We’ll start with the Danielson–Lanczos lemma,2 which says that a discrete
Fourier transform of even length 𝑛 can be rewritten as the sum of two discrete
Fourier transforms of length 𝑛/2 (one formed using the even-numbered nodes
and the other formed using the odd-numbered ones). By applying this procedure
repeatedly, we can recursively break down any power-of-two length discrete
Fourier transform. Consider a DFT whose size 𝑛 is divisible by two:
𝑛−1 ∑︁
𝑛/2−1 ∑︁
𝑘𝑗 2𝑘 𝑗 (2𝑘+1) 𝑗
𝑦𝑗 = 𝜔𝑛 𝑐 𝑘 = 𝜔 𝑛 𝑐 2𝑘 + 𝜔𝑛 𝑐 2𝑘+1 .
| {z } | {z }
𝑘=0 𝑘=0 𝑘=0
𝑚−1 ∑︁
𝑘𝑗 𝑗 𝑘𝑗
𝑦𝑗 = 𝜔 𝑚 𝑐 0𝑘 + 𝜔 𝑛 𝜔 𝑚 𝑐 00
𝑘=0 𝑘=0
where 𝑐 0𝑘 = 𝑐 2𝑘 and 𝑐 00
𝑘 = 𝑐 2𝑘+1 . So, the output can be computed by using the
sum of two smaller DFTs of size 𝑚 = 𝑛/2:
𝑦 𝑗 = 𝑦 0𝑗 + 𝜔 𝑛 𝑦 00𝑗 .
2 Cooley and Tukey rediscovered the lemma twenty-three years after Gordon Danielson and
Cornelius Lanczos published it the Journal of the Franklin Institute, a highly read journal but
apparently one that did not enjoy a wide circulation among numerical analysts.
Cooley–Tukey algorithm 147
𝑘 (𝑚+ 𝑗) 𝑘𝑗 𝑘𝑗
We only need to compute 𝑗 = 0, 1, . . . , 𝑚 − 1 because 𝜔 𝑚 𝑘𝑚 𝜔
= 𝜔𝑚 𝑚 = 𝜔𝑚
𝑚+ 𝑗 𝑗 𝑗
and 𝜔 𝑛 = 𝜔 𝑚 𝑛 𝜔 𝑛 = −𝜔 𝑛 . So
𝑦 𝑚+ 𝑗 = 𝑦 0𝑗 − 𝜔 𝑛 𝑦 00𝑗 .
( 𝑗
𝑦 𝑗 = 𝑦 0𝑗 + 𝜔 𝑛 𝑦 00𝑗
for 𝑗 = 0, 1, . . . , 𝑚 − 1. (6.2)
𝑦 𝑚+ 𝑗 = 𝑦 0𝑗 − 𝜔 𝑛 𝑦 00𝑗
To get a better picture of the math behind the algorithm, let’s look at the
matrix formulation of the radix-2 FFT. The 𝑛 = 4 DFT is
1 1 1 1 1 1
1 1
1 𝜔2 𝜔3 1 −1 −i
F4 =
𝜔 i
𝜔6 1 1 −1
= .
1 𝜔2 𝜔4 −1
1 𝜔3 𝜔6 𝜔9 1 −1 −i
1 0
0 0
0 0
0 1
0 1 0
0 0 1
1 1 1 1
i F2 𝛀2 F2 I2 𝛀2 F2
1 −1 −i
F4 P−1
1 1 −1 −1
= = ,
F2 −𝛀2 F2 I2 −𝛀2 F2
i −i
1 −1
1 1 1 0
F2 = and 𝛀2 = = diag(1, 𝜔4 )
1 −1 0 i
The matrix
I2 𝛀2
B4 =
I2 −𝛀2
3 To dig deeper into shuffling, see the article “The mathematics of perfect shuffles” by mathema-
gician Persi Diaconis and others. When he was fourteen, Diaconis dropped out of school and ran off
to follow a legendary sleight-of-hand magician. After learning that probability could improve his
card skills and that mathematical analysis was necessary to understand probability, Diaconis took up
mathematics and later went on to earn a PhD in mathematical statistics from Harvard.
148 The Fast Fourier Transform
Figure 6.2: F128 (left), after one permutation, after a second permutation, and
after a third permutation (right) of the columns.
And in general,
I𝑛 𝛀𝑛
B2𝑛 = ,
I𝑛 −𝛀𝑛
ifftx2(y) = conj(fftx2(conj(y)))/length(y);
The radix-2 FFT works by recursively breaking the DFT matrix into smaller and
smaller pieces. If 𝑛 is a power of two or a composite of small primes such as
7200 = 25 × 32 × 52 , the recursive algorithm is efficient. But if 𝑛 is a large prime
number or a composite with a large prime number, such as 7060 = 22 × 5 × 353,
the recursion breaks down. In this case, other algorithms must be used. We’ll
examine two such algorithms in section 6.4.
150 The Fast Fourier Transform
That is,
C= c Rc R2 c ··· R𝑛−1 c with c = 𝑐 0 𝑐1 ··· 𝑐 𝑛−1 .
Theorem 19. A circulant matrix C, whose first column is c, has the diagonaliza-
tion C = F−1 diag (Fc) F, where F is the discrete Fourier transform.
Toeplitz and circulant matrices 151
C = 𝑐 0 I + 𝑐 1 R + 𝑐 2 R + · · · + 𝑐 𝑛−1 R𝑛−1
Fast convolution
We can think of the convolution of two functions or arrays as their shifted inner
products. For continuous functions 𝑓 (𝑡) and 𝑔(𝑡), we have the convolution
∫ ∞
( 𝑓 ∗ 𝑔) (𝑡) = 𝑓 (𝑠)𝑔(𝑡 − 𝑠) d𝑠.
When −∞ 𝑓 (𝑠) d𝑠 = 1, you can think of 𝑓 ∗ 𝑔 as a moving average of 𝑔 using
𝑓 as a window function. For arrays u and v, the 𝑘th element of the circular
convolution u ∗ v is simply
(u ∗ v) 𝑘 = 𝑢 𝑗 𝑣 𝑘− 𝑗 (mod 𝑛) .
1 2 3
× 2 4 1
1 2 3
4 8 12 .
2 4 6
2 8 15 14 3 (and carrying. . . )
2 9 6 4 3
So 123 × 241 = 29643. J
152 The Fast Fourier Transform
The previous section explained how to quickly evaluate circulant matrix multipli-
cation using FFTs. We can do the same for a general Toeplitz matrix by padding
it out to become a circulant matrix. For example,
𝑣 1 𝑎 𝑏 𝑢 1
𝑑 𝑒 𝑐
𝑣 1 𝑎 𝑒 𝑢 1 𝑣 2 𝑏 𝑐 𝑢 2
𝑑 𝑎 𝑑 𝑒
𝑣 2 = 𝑏 𝑑 𝑢 2 𝑣 3 = 𝑐 𝑒 𝑢 3
𝑎 −→ 𝑏 𝑎 𝑑
𝑣 3 𝑐 𝑎 𝑢 3 ∗ 𝑒 𝑑 0
𝑏 𝑐 𝑏 𝑎
∗ 𝑑 𝑎 0
𝑒 𝑐 𝑏
where we discard the ∗ terms in the answer. In general, we need to pad out an 𝑛 ×𝑛
Toeplitz matrix by at least 𝑛 − 1 rows and columns to create an (2𝑛 − 1) × (2𝑛 − 1)
circulant matrix. It may often be more efficient to overpad the matrix to be a
power of two or some other product of small primes. For example,
𝑎 𝑏
𝑑 𝑒 0 0 0 𝑐
𝑏 𝑐
𝑎 𝑑 𝑒 0 0 0
𝑐 0
𝑎 𝑒 𝑏 𝑎 𝑑 𝑒 0 0
𝑑 0 0
T = 𝑏 𝑑 C =
𝑐 𝑏 𝑎 𝑑 𝑒 0
𝑎 −→ .
𝑐 𝑎 0 0 𝑐 𝑏 𝑎 𝑑 𝑒
𝑏 0 𝑒
0 0 𝑐 𝑏 𝑎 𝑑
𝑒 𝑑
0 0 0 𝑐 𝑏 𝑎
𝑑 𝑎
𝑒 0 0 0 𝑐 𝑏
Bluestein and Rader factorization 153
function fasttoeplitz(c,r,x)
n = length(x)
Δ = nextpow(2,n) - n
x1 = [c; zeros(Δ); r[end:-1:2]]
x2 = [x; zeros(Δ+n-1)]
ifftx2(fftx2(x1 ).*fftx2(x2 ))[1:n]
1.5 prime
time (ms)
0.5 5-smooth
0 2 4 6 8 10
𝑛 (thousands)
Figure 6.3: FFT run times on vectors of length 𝑛. The lower curve shows
5-smooth numbers and the upper curve shows prime numbers. The inset shows
both curves relative to the run time of directly computing a DFT.
The figure above shows the plots with several outliers removed. Note that while
the FFT is significantly slower on rough numbers than on smooth numbers, it is
still orders of magnitude faster than directly computing the DFT.
Bluestein factorization
By noting that ( 𝑗 − 𝑘) 2
= 𝑗2 − 2 𝑗 𝑘 + 𝑘 2 , we can replace 𝑗 𝑘 in the DFT by
( 𝑗 2 − ( 𝑗 − 𝑘) 2 + 𝑘 2 )/2:
𝑛−1 ∑︁
𝑗𝑘 𝑗 2 /2 −( 𝑗−𝑘) 2 /2 2 /2
𝑦𝑗 = 𝑐 𝑘 𝜔𝑛 = 𝜔𝑛 𝑐 𝑘 𝜔𝑛 𝜔 𝑛𝑘 .
𝑘=0 𝑘=0
Furthermore, since 𝜔 𝑛 = 𝜔2𝑛 , it follows that
𝑗2 −( 𝑗−𝑘) 2 𝑘 2
𝑦 𝑗 = 𝜔2𝑛 𝑐 𝑘 𝜔2𝑛 𝜔2𝑛 .
This says that the DFT matrix F equals WTW, where W is a diagonal matrix
whose diagonal elements are given by
h i
(𝑛−1) 2
w = 1 𝜔2𝑛 𝜔42𝑛 𝜔92𝑛 . . . 𝜔2𝑛
Bluestein and Rader factorization 155
and T is a symmetric Toeplitz matrix whose the first row is given by w. For
example, for F5
w = 1 𝜔 𝜔4 𝜔9 𝜔6 and w = 1 𝜔9 𝜔6 𝜔1 𝜔4
1 1 𝜔9 𝜔6 𝜔1 𝜔4 1
𝜔 𝜔9 𝜔6 𝜔1
𝜔 6 1 𝜔
𝜔4 𝜔 𝜔9 𝜔9 𝜔6 𝜔4 .
𝜔9 𝜔1 𝜔6 𝜔9 𝜔9 𝜔9
6 𝜔4 1 𝜔
𝜔1 𝜔6 𝜔9
So, y = Fc equals w ◦ (w ∗ (w ◦ c)) where ◦ denotes the Hadamard product.
When computing a Toeplitz product, we can pad the vectors with any number
of zeros to get a length for which the Cooley–Tukey algorithm is fast. We need
to compute three DFTs over a vector of at least 2𝑛 − 1 elements. The following
Julia code implements the Bluestein algorithm using the fasttoeplitz function,
defined on page 153:
function bluestein_fft(x)
n = length(x)
ω = exp.((1im*π/n)*(0:n-1).^2)
Rader factorization
Example. The figure on the next page shows the finite cyclic groups of order
𝑛 = 13. We can see that 𝑟 = 2 is a primitive root modulo 13 because
Integers 6 and 11 are also primitive roots modulo 13. The order of the multiplica-
tive group of integers modulo 𝑛 is given by Euler’s totient 𝜑(𝑛), which equals
𝑛 − 1 when 𝑛 is prime. The number of elements in the groups must be a factor
of 𝜑(13) = 12, i.e., {2, 3, 4, 6, 12}. We see that 12 generates a group of order 2;
3 and 9 generate groups of order 3; 5 and 8 generate groups of order 4; 4 and
10 generate groups of order 6; and the primitive roots 2, 6, 7, and 11 generate
groups of order 12. J
𝑦𝑘 = 𝑐 𝑗 𝜔𝑛
𝑛−1 ∑︁
𝑐 (𝑟 𝑗 ) 𝜔 𝑛(𝑟
𝑗−𝑘 )
𝑦0 = 𝑐𝑗 and 𝑦 (𝑟 −𝑘 ) = 𝑐 0 + .
𝑗=0 𝑗=1
Bluestein and Rader factorization 157
0 4 3
2 2 3 4
9 10 11
5 6 7 8
9 10 11 12
We can rewrite the sum as P−1𝑠 C𝑛−1 P𝑟 c, where C𝑛−1 is the circulant matrix
whose first row is given
𝑟 2 𝑛−1
𝜔 𝑛 𝜔𝑟𝑛 · · · 𝜔𝑟𝑛 ,
n = length(x)
r = primitiveroot(n)
P+ = powermod.(r, 0:n-2, n)
P− = circshift(reverse(P+ ),1)
ω = exp.((2im*π/n)*P− )
c = x[1] .+ ifft(fft(ω).*fft(x[2:end][P+ ]))
[sum(x); c[reverse(invperm(P− ))]]
6.5 Applications
Julia’s fft and ifft functions are all multi-dimensional transforms unless an
4 Codelets
are small-sized (typically 𝑛 ≤ 64), hard-coded transforms that use split-radix,
Cooley–Tukey, and a prime factor algorithm.
Applications 159
The FFTW.jl functions fftshift and ifftshift shift the zero-frequency compo-
nent to the middle of the array.
Discrete cosine transforms (DCTs) and discrete sine transforms (DSTs) can
be constructed using the DFT by extending even functions and odd functions
as periodic functions. Because there are different ways to define this even/odd
extension, there are different variants for each the DST and the DCT—eight of
them (each called type-1 through type-8), although only the first four of them
are commonly used in practice. Each of these four types is available through
the FFTW.jl real-input/real-output transform FFTW.r2r(A, kind [, dims]). The
parameter kind specifies the type: FFTW.REDFTxy where xy, which equals 00, 01,
10, or 11, selects type-1 to type-4 DCT. Similarly, FFTW.RODFTxy selects a type-1
to type-4 DST. Because the DCT and DST are scaled orthogonal matrices, we
can reuse the same functions for the inverse DCT and inverse DST by dividing
by a normalization constant.
The FFTW.jl functions dct and idct compute the type-2 DCT and inverse DCT
(M ⊗ N) (x𝑖 ⊗ y𝑖 ) = 𝜆𝑖 𝜇 𝑗 (x𝑖 ⊗ y𝑖 ).
We can use an FFT to compute the discrete sine transform by first noting
Data filtering
Suppose that we have noisy data. We can filter out the high frequencies and
regularize the data using a convolution with a smooth kernel function, such as
a Gaussian distribution. A kernel with a support of 𝑚 points applied over a
domain with 𝑛 points requires 𝑂 (𝑚𝑛) operations applied directly. It requires
𝑂 (𝑛) operations to compute the equivalent product in the Fourier domain, with
an additional two 𝑂 (𝑛 log 𝑛) operations to put it in the Fourier domain and bring
it back. The extra overhead in computing Fourier transforms may be small when
𝑛 and 𝑚 are large. In the example below, noise was added to the data on the left
to get the center figure. A Gaussian distribution was used as a convolution kernel
to filter the noise and get the figure on the right.
0 1 2 0 1 2 0 1 2
his namesake series as a solution to the heat equation, and a hundred years later,
Louis de Broglie suggested that all matter is waves.
The Fourier transform of a smooth function decays rapidly in the frequency
domain. So for sufficiently smooth data, high-frequency information can be
discarded without introducing substantial error. Since the DFT is a scaled unitary
matrix, both the 2-norm and the Frobenius norm are preserved by the DFT:
kek 2 = kFek 2 and kek F = kFek F . The pixelwise root mean squared error of an
image equals the pixelwise root mean squared error of the DCT of an image.
Figure 6.5 above shows the magnitudes of the discrete cosine transform of
a cartoon image. The image isn’t an exceptionally smooth one—it has several
sharp edges, particularly the eyes, mouth, and chin. The most significant Fourier
coefficients are associated with low-frequency components in the upper left-hand
corner, and in general, the Fourier coefficients decay as their frequencies increase.
By cropping out the high-frequency components in the Fourier domain, we
can keep relevant information with only a marginal increase in the error if the
data is smooth. However, if the data is not smooth, important information resides
in the higher frequency components and discarding it can result in substantial
error. Notice in Figure 6.5 that several slowly decaying rays corresponding to
sharp edges and lines in the cartoon emanate from the upper left-hand corner.
The so-called Gibbs phenomenon associated with lossy compression of data with
discontinuities can result in noticeable JPEG ringing artifacts at sharp edges. As
an alternative approach, we can zero out only those Fourier components whose
magnitudes are less than some given tolerance regardless of their frequencies.
We can store the information efficiently as a sparse array by zeroing out enough
components. We’ll leave the first approach as an exercise and consider the second
approach below.
Exercises 163
A = Gray.(load(download(bucket*"laura_square.png")))
[A dctcompress(A,0.05)]
By computing the errors from several values of 𝑑, we can determine the relative
efficiency of DCT compression. See the figure on the next page and the QR code
6.6 Exercises
6.1. Modify the function fftx2 on page 149 for a radix-3 FFT algorithm. Verify
your algorithm using a vector of size 34 = 81. b
6.2. Fast multiplication of large integers. Consider two polynomials
When 𝑥 = 10, the polynomials 𝑝(𝑥) and 𝑞(𝑥) return the decimal numbers 𝑝(10)
and 𝑞(10), respectively. We can represent the polynomials as coefficient vectors
by using the basis {1, 𝑥, 𝑥 2 , . . . , 𝑥 2𝑛 }:
p = ( 𝑝 0 , 𝑝 1 , . . . , 𝑝 𝑛 , 0, . . . , 0) and q = (𝑞 0 , 𝑞 1 , . . . , 𝑞 𝑛 , 0, . . . , 0).
an image with
increasing levels of
DCT compressed
164 The Fast Fourier Transform
c d
0 d
0.01 0.1 1
Figure 6.6: The relative error of a DCT-compressed image. The dotted line
shows equivalent lossless PNG compression.
The product of these two numbers is RSA-129, the value of which can be
easily found online to verify that your algorithm works. RSA is one of the first
commonly used public-key cryptologic algorithms. b
6.3. We can compute the fast discrete cosine transform using an FFT by
doubling the domain and mirroring the signal into the domain extension as
we did with the fast sine transform. Alternatively, rather than positioning
nodes at the mesh points {1, 2, . . . , 𝑛 − 1}, we can arrange the nodes be-
tween the mesh points 21 , 32 , . . . , 𝑛 − 12 , counting over every second node
and then reflecting the signal back at the boundary. For example, with eight
nodes 0 1 2 3 4 5 6 7 7 6 5 4 3 2 1 0 would have the reordering
0 2 4 6 7 5 3 1 . We need only 𝑛 points rather than 2𝑛 points to compute
the DCT. In two and three dimensions, the number of points are quartered and
eighthed. This DCT
𝑛−1 h𝜋 i
𝑓 𝑘 cos 𝑗+ 2 𝑘 for 𝑘 = 0, 1, . . . , 𝑛 − 1
is called the type-2 DCT or simply DCT-2 and is perhaps the most commonly
used form of DCT. Write an algorithm that reproduces an 𝑛-point DCT-2 and
inverse DCT-2 using 𝑛-point FFTs. b
6.4. Compare two approaches to image compression using a DCT—cropping high-
frequency components and removing low magnitude components—discussed on
page 162. b
Numerical Methods for
Chapter 7
This chapter briefly introduces several key ideas that we will use over the next
four chapters. Specifically, we examine how functions, problems, methods, and
computational error impact getting a good numerical solution.
Throughout this book, unless otherwise stated, we’ll assume that all functions
are nice functions to allow us to handwave around formalities. Mathematically,
the words nice and well-behaved are used in an ad-hoc way to allow for a
certain laziness of expression. We might reason that an analytic function is
better-behaved than a merely continuous function, which itself is better-behaved
than a discontinuous one. And that one is surely better-behaved than a function
that blows up. But, what is a well-behaved function? Perhaps the easiest way
to introduce well-behaved functions is with examples of ones that are not so
well-behaved, some rather naughty functions.
170 Preliminaries
0 0.2 0.4 0.6 0.8 1
Figure 7.1: The nowhere differentiable Weierstrass function. The callouts depict
the fractal nature of this pathological function.
[1916]). The Weierstrass function is also one of the earliest examples of a fractal,
a set with repeating self-similarity at every scale and noninteger dimensions.
See Figure 7.1 above or the QR link at the bottom of this one. J
the Weierstrass
Well-behaved functions 171
Landau notation
Landau notation, often called big O and little o notation, is used to describe the
limiting behavior of a function:
One often says that 𝑓 (𝑥) is on the order of 𝑔(𝑥) or 𝑓 (𝑥) is 𝑂 (𝑔(𝑥)) or simply
𝑓 (𝑥) = 𝑂 (𝑔(𝑥)) to mean 𝑓 (𝑥) ∈ 𝑂 (𝑔(𝑥)). For example, (𝑛 + 1) 2 = 𝑛2 + 𝑂 (𝑛) =
𝑂 (𝑛2 ) and (𝑛 + 1) 2 = 𝑜(𝑛3 ). Big O notation is often used to describe the
complexity of an algorithm. Gaussian elimination, used to solve a system
of 𝑛 linear equations, requires roughly 23 𝑛3 + 2𝑛2 operations (additions and
multiplications). So, Gaussian elimination is 𝑂 (𝑛3 ).
Big O and little o can also be used to notate infinitesimal asymptotics. In this
For example, e 𝑥 = 1 + 𝑥 + 𝑂 𝑥 2 and e 𝑥 = 1 + 𝑥 + 𝑜(𝑥). We can also use big O
notation to summarize the truncation error of finite difference approximation to
differentiation. For a small offset ℎ, the Taylor series approximation
𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)
𝑓 0 (𝑥) = + 𝑂 (ℎ),
where we’ve taken 𝑂 ℎ2 /ℎ = 𝑂 (ℎ). So forward differencing provides an
𝑂𝑂 [ℎ]1 approximation (or first-order approximation) to the derivative.
172 Preliminaries
Example. Just as there are ill-behaved functions, there are also ill-behaved
problems. Mathematician Nick Trefethen once offered $100 as a prize for solving
ten numerical mathematics problems to ten significant digits. One of them was
“A photon moving at speed 1 in the xy-plane starts at t = 0 at (x, y) = (0.5, 0.1)
heading due east. Around every integer lattice point (𝑖, 𝑗) in the plane, a circular
mirror of radius 1/3 has been erected. How far from the origin is the photon
at t = 10?” We can solve the problem using the geometry of intersections of
lines and circles. At each intersection, we need to evaluate a square root. And
each numerical evaluation of a square root adds round-off error to the solution,
repeatedly compounding. The solution loses about a digit of precision with
every reflection. While the original problem asked for the solution at 𝑡 = 10 to
100-digits of accuracy, we could also think about what happens at 𝑡 = 20. The
figure on the left below shows the starting position.
The middle figure shows the solution for a photon at 𝑡 = 20. Instead of just one
photon, let’s solve the problem using 100 photons each with a minuscule initial
velocity angle spread of 10−16 for each photon (machine floating-point precision).
To put it into perspective, if these photons left from a point on the sun, they would
all land on earth in an area covered by the period at the end of this sentence. The
figure on the right shows the solution for these hundred photons. Also, see the
QR code at the bottom of this page. It’s hard to tell which solution is the right
one or how accurate our original solution in the middle actually was. J
So, what makes a problem a nice problem? To answer this question, let’s
first think about what makes up a problem. Consider the statement “𝐹 (𝑦, 𝑥) = 0
for some input data 𝑥 and some output data 𝑦.” For example, 𝑦 could equal the
value of polynomial 𝐹 with coefficients {𝑐 𝑛 , 𝑐 𝑛−1 , . . . , 𝑐 0 } of the variable 𝑥, e.g.,
𝑦 = 𝑐 𝑛 𝑥 𝑛 + 𝑐 𝑛−1 𝑥 𝑛−1 + · · · + 𝑐 1 𝑥 + 𝑐 0 . Or, 𝐹 (𝑦, 𝑥) = 0 could be a differential
equation where 𝑦 is some function and 𝑥 is the initial condition or the boundary
value, e.g., 𝑦 0 = 𝑓 (𝑡, 𝑦) with 𝑦(0) = 𝑥. Or, 𝐹 (𝑦, 𝑥) = 0 could be an image
classifier where 𝑥 is a set of training images and 𝑦 is a set of labels. In a simple
diagram, our problem consists of input data, a model, and output data.
photons in a room of
circular mirrors
Well-posed problems 173
1. Direct problem: 𝐹 and 𝑥 are given and 𝑦 is unknown. We know the model
and the input—determine the output.
2. Inverse problem: 𝐹 and 𝑦 are given and 𝑥 is unknown. We know the model
and the output—determine the input.
A direct problem is typically easier to solve than an inverse problem, which itself
is easier to solve than an identification problem. What makes a problem difficult
to solve (either analytically or numerically)? The function 𝐹 (𝑦, 𝑥) may be an
implicit function of 𝑦 and 𝑥. The problem may have multiple solutions. The
problem may have no solution. Perhaps the problem is so sensitive that we cannot
replicate the output if the input changes by even the slightest amount. How do
we know whether a problem has a meaningful solution? To answer this question,
French mathematician Jacques Hadamard introduced the notion of mathematical
well-posedness in 1923. He said that a problem was well-posed if
Condition number
Existence and uniqueness are straightforward. Let’s dig into stability. We say
that the output 𝑦 depends continuously on the input 𝑥 if small perturbations in
the input cause small changes in the output. That is to say, there are no sudden
jumps in the solution. Let 𝑥 + 𝛿𝑥 be a perturbation of the input and 𝑦 + 𝛿𝑦 be the
new output, and let k · k be some norm. Lipschitz continuity says that there is
some 𝜅 such that k𝛿𝑦k ≤ 𝜅k𝛿𝑥k for any 𝛿𝑥. We call 𝜅 the Lipschitz constant or
the condition number. It is simply an upper bound on the ratio of the change in
the output data to the change in the input data. We define the absolute condition
𝜅 abs (𝑥) = sup .
𝛿𝑥 k𝛿𝑥k
174 Preliminaries
For now, consider a problem with a unique solution 𝑦 = 𝑓 (𝑥). Perturbing the
input data 𝑥 yields 𝑦 + 𝛿𝑦 = 𝑓 (𝑥 + 𝛿𝑥). If 𝑓 is differentiable, then Taylor series
expansion of 𝑓 about 𝑥 gives us
So, if 𝑎 and 𝑏 have the same sign, then the relative condition number is 2. But,
if 𝑎 and 𝑏 have opposite signs and are close in magnitude, then the condition
number can be quite large. For example, for 𝑎 = 1000 and 𝑏 = −999, but
𝜅(𝑥) = 3998. Subtraction can potentially lead to the cancellation of significant
The quadratic formula
−𝑏 ± 𝑏 2 − 4𝑎𝑐
𝑥± =
is a well-known and frequently memorized solution to the quadratic equation
𝑎𝑥 2 + 𝑏𝑥 + 𝑐 = 0. However, when 𝑏 2 4𝑎𝑐, the quadratic formula can be
Well-posed methods 175
𝑥± = √ .
−𝑏 ∓ 𝑏 2 − 4𝑎𝑐
The combined formula is
−𝑏 − sign(𝑏) 𝑏 2 − 4𝑎𝑐 𝑐
𝑥− = and 𝑥+ = . J
2𝑎 𝑎𝑥−
for nonzero 𝑥. The inverse problem is ill-posed when 𝜑 0 (𝑥) = 0, e.g., when 𝜑(𝑥)
has a double root. Furthermore, it is ill-conditioned when 𝜑 0 (𝑥) is small and
well-conditioned if 𝜑 0 (𝑥) is large. One can derive an analogous definition for
the condition number of the problem 𝐹 (𝑦, 𝑥) = 0 using the implicit function
Just as there are pathological functions and pathological problems, there are also
pathological solutions, even to well-behaved problems.
1 As you may have deduced, citardauq is simply quadratic spelled backward.
176 Preliminaries
𝑠(𝑡) = 1 − 1 − e−𝑡
𝑠(0) 0.5
0 1 2 3 4
𝑠 (𝑘+1) − 𝑠 (𝑘)
= 𝑠 (𝑘) (1 − 𝑠 (𝑘) ), (7.3)
where 𝛥𝑡 is the time step between subsequent evolutions of the solution 𝑠 (𝑘) . By
rescaling 𝑠 (𝑘) ↦→ ( 𝛥𝑡)/(1 + 𝛥𝑡)𝑥 (𝑘) and solving for 𝑥 (𝑘+1) , we get an equivalent
equation called the logistic map:
This equation has the equilibrium solution 𝑥 = 1 −𝑟 −1 . Let’s examine the solution
to (7.3) for 𝑥(0) = 0.25 with 𝛥𝑡 given by 1, 1.9, 2.5, and 2.8:
1 1 1 1
1 2 3 4
0 10 20 0 10 20 0 10 20 0 10 20
2 The term logistic was coined by Belgian mathematician Pierre François Verhulst as logistique
in his 1845 article on population growth. The name refers to the “log-like” similarities of the logistic
function 1/(1 + e−𝑡 ), with time 𝑡 as the dependent variable, to the logarithm function near the origin.
Well-posed methods 177
output. That is, if k𝛿𝑦 𝑛 k ≤ 𝜅k𝛿𝑥 𝑛 k for some 𝜅, whenever k𝛿𝑥 𝑛 k ≤ 𝜂 for some 𝜂.
We analogously define the absolute condition number as
k𝛿𝑦 𝑛 k
𝜅 abs,𝑛 (𝑥) = sup .
𝛿 𝑥𝑛 k𝛿𝑥 𝑛 k
One would hope that for any good numerical method, the approximate
solution 𝑦 𝑛 would become a better approximation to 𝑦 as 𝑛 gets larger. A
numerical method is convergent if the numerical solution can be made as close
to the exact solution as we want—for example, by refining the mesh size and
limiting round-off error. That is, for every 𝜀 > 0, there is a 𝛿 > 0 such that
for all sufficiently large 𝑛, k𝑦(𝑥) − 𝑦 𝑛 (𝑥 + 𝛿𝑥 𝑛 ) k ≤ 𝜀 whenever k𝛿𝑥 𝑛 k < 𝛿. It
should come as no surprise that consistency and stability imply convergence. The
following theorem, known as the Lax equivalence theorem or the Lax–Richtmyer
theorem, is so important that it is often called the fundamental theorem of
numerical analysis.
Theorem 21. A consistent method is stable if and only if it is convergent.
where the Jacobian (𝜕𝐹𝑛 /𝜕𝑦) is evaluated at some point in the neighborhood of
𝑦(𝑥). Therefore,
𝑦(𝑥) − 𝑦 𝑛 (𝑥) = 𝐹𝑛 (𝑦(𝑥), 𝑥) − 𝐹𝑛 (𝑦 𝑛 (𝑥), 𝑥) .
where 𝑀 = k (𝜕𝐹𝑛 /𝜕𝑦) −1 k. Because the method is consistent, for every 𝜀 > 0
there is an 𝑁 such that k𝐹𝑛 (𝑦(𝑥), 𝑥) − 𝐹 (𝑦(𝑥), 𝑥) k < 𝜀 for all 𝑛 > 𝑁. So,
k𝑦(𝑥) − 𝑦 𝑛 (𝑥) k ≤ 𝑀𝜀 for all sufficiently large 𝑛.
We will use this result to show that convergence implies stability. If a
numerical method is consistent and convergent, then
k𝛿𝑦 𝑛 k = k𝑦 𝑛 (𝑥) − 𝑦 𝑛 (𝑥 + 𝛿𝑥 𝑛 ) k
= k𝑦 𝑛 (𝑥) − 𝑦(𝑥) + 𝑦(𝑥) − 𝑦 𝑛 (𝑥 + 𝛿𝑥 𝑛 ) k
≤ k𝑦 𝑛 (𝑥) − 𝑦(𝑥) k + k𝑦(𝑥) − 𝑦 𝑛 (𝑥 + 𝛿𝑥 𝑛 ) k
≤ 𝑀𝜀 + 𝜀 0
where the bound 𝑀𝜀 comes from consistency and the bound 𝜀 0 comes from
convergence. Choosing 𝜀 and 𝜀 0 to be less than k𝛿𝑥 𝑛 k. Then
k𝛿𝑦 𝑛 k ≤ (𝑀 + 1) k𝛿𝑥 𝑛 k,
which says that the method is stable. Now, we’ll show stability implies conver-
where the bound 𝑀𝜀 comes from consistency and the bound 𝜅 comes from
stability. By choosing k𝛿𝑥 𝑛 k ≤ 𝜀/𝜅, we have
k𝑦(𝑥) − 𝑦 𝑛 (𝑥 + 𝛿𝑥 𝑛 ) k ≤ (𝑀 + 1)𝜀.
problem 𝐹 (𝑦, 𝑥) = 0
method 𝐹𝑛 (𝑦 𝑛 , 𝑥 𝑛 ) = 0
consistency 𝐹𝑛 (𝑦, 𝑥) → 𝐹 (𝑦, 𝑥) as 𝑛 → ∞
stability if k𝛿𝑥 𝑛 k ≤ 𝜂, then k𝛿𝑦 𝑛 k ≤ 𝜅k𝛿𝑥 𝑛 k for some 𝜅
convergence for any 𝜀 there are 𝜂 and 𝑁 such that if 𝑛 > 𝑁
and k𝛿𝑥 𝑛 k ≤ 𝜂, then k𝑦(𝑥) − 𝑦 𝑛 (𝑥 + 𝛿𝑥 𝑛 ) k ≤ 𝜀
equivalence consistency & stability ↔ convergence
180 Preliminaries
Programming languages support several numerical data types: often 64-bit and
32-bit floating-point numbers for real and complex data; 64-bit, 32-bit, and
16-bit integers and unsigned integers; 8-bit ASCII and up to 32-bit Unicode
characters, and 1-bit Booleans for “true” and ”false” information. A complex
number is typically stored as an array of two floating-point numbers. Other
data types exist, including arbitrary-precision arithmetic (also called multiple-
precision and bignum), often used in cryptography and in computing algebra
systems like Mathematica. For example, Python implements arbitrary-precision
integers for all integer arithmetic—the only limitation is computer memory. And
it does not support 32-bit floating-point numbers, only 64-bit numbers. The
Julia data types BigInt and BigFloat implement arbitrary-precision integer and
floating-point arithmetic using the GNU MPFR and GNU Multiple Precision
Arithmetic Libraries. The function BigFloat(x,precision=256) converts the
value x to a BigFloat type with precision set by an optional argument. The
command BigFloat(π,p) returns the first ( 𝑝/log2 10) digits of 𝜋. The command
precision(x) returns the precision of a variable x.
It is important to emphasize that computers cannot mix data types. They
instead either implicitly or explicitly convert one data type into another. For
example, Julia interprets a=3 as an integer and b=1.5 as a floating-point number.
To compute c=a*b, a language like Julia automatically converts (or promotes)
3 to the floating-point number 3.0 so that precision is not lost and returns 4.5.
Such automatic, implicit promotion lends itself to convenient syntax. Still, you
should be caution to avoid gotchas. In Julia, the calculation 4/5 returns 0.8. In
C, the calculation 4.0/5.0 returns 0.8, but 4/5 returns 0. By explicitly defining
x=uint64(4) as an integer in Matlab, it is cast as a floating-point number to
perform the calculation x/5, returning an integer value 1. In this manner, x/5*4
returns 4 but 4*x/5 returns 3.
The function typeof returns the data type of a variable.
s e m
and 64 bits (8 bytes) with 1 sign bit, 11 exponent bits, and 52 fraction mantissa
bits to represent a double-precision number.
s e m
Floating-point arithmetic 181
The exponents are biased to range between −126 and 127 in single precision
and −1022 and 1023 in double precision. This means that single and double
precision numbers have about 7 and 16 decimal digits of precision, respectively.
The floating-point representation is (−1) 𝑠 × (1 + mantissa) × 2 (exponent−𝑏) where
the bias 𝑏 = 127 for single precision and 𝑏 = 1023 for double precision. For
example, the single-precision numbers
The mantissa is normalized between 0 and 1 and the leading “hidden bit” in
the mantissa is implicitly a one. Normalizing the numbers in this way gives the
floating-point representation an extra bit of precision.
The bitstring function converts a floating-point number to a string of bits.
Only a finite set of numbers can be represented exactly. All other numbers must
be approximated. The machine epsilon or unit round-off is the distance between
1 and the next exactly representable number greater than 1.
The function eps() returns the double-precision machine epsilon of 2−52 . The
function eps(x) returns the round-off error at 𝑥. For example, eps(0.0) returns
5.0e-324 and eps(Float32(0)) returns 1.0f-45.
Example. One algorithm buried in the source code of the 1999 Quake III
Arena video game has attracted particular notoriety. The cryptic and seemingly
obfuscated C code is translated to Julia code below.3
function Q_rsqrt(x::Float32)
1 i = reinterpret(Int32,Float32(x))
2 i = Int32(0x5f3759df) - (i>>1)
3 y = reinterpret(Float32,i)
4 y = y * (1.5f0 - (0.5f0 * x * y * y))
This function, often called the fast inverse square root function, approximates the
reciprocal square root of a single-precision floating-point number, i.e., 1/sqrt(x).
Video game developers use the the reciprocal square root to normalize vectors for
shading and reflection in 3D generated scenes. In the late 1990s, Q_rsqrt(x) was
considerably faster than directly computing 1.0/sqrt(x) and its approximation
was good enough for a video game. Subsequent advances in computer hardware
in the early 2000s have transcended the need for such a hack, even for video
How does the fast inverse square root algorithm work? Lines 1 , 2 , and
3 make an initial approximation of 𝑦
−1/2 and line 4 performs one iteration of
3 The developer’s comments that accompanied the source code offered little clarity: “evil floating
point bit level hacking” for line 1 , “what the fuck?” for the line 2 , and “1st iteration” for line 4 .
The inspiration for the algorithm can be traced through Cleve Moler to mathematicians William
Kahan (the principal architect behind the IEEE 754 standard) and K. C. Ng in the mid-1980s.
Floating-point arithmetic 183
approximation is 0 1
1 + log(log 2)
𝛿 =1− .
log 2
Choosing the shift 𝜀 = 12 𝛿 ≈ 0.0430 will minimize the uniform error across
𝑠 ∈ [0, 1]. Now we just need to compute (7.6) using
by approximating
log2 (1 + 𝑠) with 𝑠 + 𝜀. By letting 0 𝑒 𝑦 𝑚 𝑦 and 0 𝑒 𝑥 𝑚 𝑥 be the
single-precision floating-point representations of 𝑦 and 𝑥, respectively, we only
need to express 𝑒 𝑦 in terms of 𝑒 𝑥 and 𝑚 𝑦 in terms of 𝑚 𝑥
(𝑒 𝑦 − 127) + 2−23 𝑚 𝑦 + 𝜀
= − 21 𝑒 𝑥 − 127 + 2−23 𝑚 𝑥 + 𝜀
= − 12 𝑒 𝑥 − 127 + 2−23 − 21 𝑚 𝑥 + 𝜀 + 32 127 − 𝜀
= − 12 𝑒 𝑥 − 127 + 2−23 − 21 𝑚 𝑥 + 𝜀 + 2−23 223 · 32 127 − 𝜀 .
relative error after one Newton iteration, not the error going to the Newton iteration as is done here.
Moroz et. al find the optimal magic number 5f37642f16 using an exhaustive search.
184 Preliminaries
using BenchmarkTools
@btime Q_rsqrt(Float32(x)) setup=(x=rand()) seconds=3
@btime 1/sqrt(x) setup=(x=rand()) seconds=3
We find that both methods take around two nanoseconds, and Q_rsqrt is sig-
nificantly less accurate. The average relative error is 2 × 10−2 after the first
approximation. It is 1 × 10−3 after the Newton iteration—good enough for a late
1990s shooter game. Were we to add another Newton iteration, the relative error
would drop to 2 × 10−6 . J
Round-off error
Example. Siegfried Rump of the Institute for Reliable Computing posited the
following example of catastrophic cancellation. Consider the calculation
𝑦 = 333.75𝑏 6 + 𝑎 2 (11𝑎 2 𝑏 2 − 𝑏 6 − 121𝑏 4 − 2) + 5.5𝑏 8 +
a = 77617; b = 33096
into Julia returns approximately 1.641 × 1021 . Matlab and R both return about
−1.181 × 1021 , Python returns about +1.181 × 1021 , and Fortran computes it to
be about 1.173. What’s going on? If we rewrite the calculation as
𝑦 = 𝑧 +𝑥 + where 𝑧 = 333.75𝑏 6 +𝑎 2 (11𝑎 2 𝑏 2 − 𝑏 6 −121𝑏 4 −2) and 𝑥 = 5.5𝑏 8 ,
𝑧 = −7917111340668961361101134701524942850 and
𝑥 = +7917111340668961361101134701524942848.
|𝑧| + |𝑥|
𝜅= ≈ 1036 ,
|𝑧 + 𝑥|
which is considerably larger than the capacity of the mantissa for double-precision
floating-point numbers. J
Floating-point arithmetic 185
Because the number of bits reserved for the exponent of a floating-point number
is limited, floating-point arithmetic has an upper bound (closest to infinity) and a
lower bound (closest to zero). Any computation outside this bound results in
either an overflow, producing inf, or an underflow, producing 0.
The command floatmax(Float64) returns the largest floating-point number—one
bit less than 21024 or about 1.8 × 10308 . The command floatmin(Float64) returns
the smallest normalized floating-point number—2−1022 or about 2.2 × 10−308 .
IEEE 754 uses gradual underflow versus abrupt underflow by using denormalized
numbers. The smallest normalized floating-point number 2−1022 gradually under-
flows by (−1) 𝑠 × (0 + mantissa) × 2−𝑏 and loses bits in the mantissa. Numbers
can be as small as eps(0.0) = floatmin(Float64)*eps(1.0) before complete
that produce NaN include NaN*inf, inf-inf, inf*0, and inf/inf. Note that while
mathematically 00 is not defined, IEEE 754 defines it as 1.
The commands isinf and isnan check for overflows and NaN. The value NaN can
be used to break up a plot, e.g., plot([1,2,2,2,3],[1,2,NaN,1,2]).
Rounding error
Example. When 𝑥 is almost zero, the quantity e 𝑥 is very close to one, and
calculating e 𝑥 − 1 can be affected by rounding error. We can avoid rounding
error when 𝑥 = 𝑜(eps) by instead computing its Taylor series approximation
e 𝑥 − 1 = (1 + 𝑥 + 12 𝑥 2 + 13 𝑥 3 + · · · ) − 1 = 𝑥 + 21 𝑥 2 + 𝑜(eps2 ). J
The functions expm1 and log1p compute e 𝑥 − 1 and log(𝑥 + 1) more precisely than
exp(x)-1 and log(x+1) in a neighborhood of zero.
Truncation error
The truncation error of a numerical method is the error that results from using
a finite or discrete approximation to an infinite series or continuous function.
For example, we can compute the approximation to the derivative of a function
using a first-order finite difference approximation. Consider the Taylor series
𝑓 (𝑥 + ℎ) = 𝑓 (𝑥) + ℎ 𝑓 0 (𝑥) + 12 ℎ2 𝑓 00 (𝜉)
5 The building of the defunct Vancouver Stock Exchange now has a swanky cocktail bar.
188 Preliminaries
10−15 10−13 10−11 10−9 10−7 10−5 10−3 10−1
stepsize ℎ
Figure 7.2: The total error for a first-order finite difference approximation. The
truncation error decreases as ℎ gets smaller and the round-error increases as ℎ
gets smaller. The minimum is near ℎ = 10−8 .
𝑓 (𝑥 + ℎ) − 𝑓 (𝑥) 1 00
𝑓 0 (𝑥) = + 2 ℎ 𝑓 (𝜉)
𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)
𝑓 0 (𝑥) ≈
𝜀 0 (ℎ) = 12 𝑚 − 2 eps/ℎ2 = 0.
So, the error has a minimum at ℎ = 2 eps/𝑚.
7.6 Exercises
192 Solutions to Nonlinear Equations
c = (a+b)/2
sign(f(c)) == sign(f(a)) ? a = c : b = c
half the size of the previous one. The error at the 𝑘th step is 𝑒 (𝑘) = 𝑥 (𝑘) − 𝑥 ∗ . So
0 = 𝑓 (𝑥 ∗ ) = 𝑓 (𝑥) + (𝑥 ∗ − 𝑥) 𝑓 0 (𝑥 ∗ ) + 12 (𝑥 ∗ − 𝑥) 2 𝑓 00 (𝜉)
𝑓 (𝑥)
𝑥∗ = 𝑥 − + 𝑜(𝑥 ∗ − 𝑥).
𝑓 0 (𝑥)
𝑓 𝑥 (𝑘)
𝑥 (𝑘+1)
=𝑥 (𝑘)
− . (8.1)
𝑓 0 𝑥 (𝑘)
The correct digits are in boldface. The first iteration gets one correct digit. The
second iteration gets three. Then six. Then 12, 25, and finally 48 at the sixth
iteration. As we should expect for a quadratic method, the number of correct
decimals doubles with each iteration. J
Note from (8.2) that if 𝑥 ∗ is a double root, then 𝑓 0 (𝑥 ∗ ) = 0 and the error
𝑒 (𝑘+1) ≈ 12 𝑒 (𝑘) . In this case, Newton’s method only converges linearly—the
Newton’s method 195
10 10
0 0
−1 0 1 2 3 −1 0 1 2 3
Secant method
Newton’s method finds the point of intersection
of the line of slope 𝑓 0 𝑥 (𝑘)
passing through the point 𝑥 (𝑘) , 𝑓 𝑥 (𝑘) . See Figure 8.2 above. Determining
the derivative 𝑓 0 can be difficult analytically and numerically.
But we can use
other slopes 𝑞 (𝑘) that are suitable approximations to 𝑓 0 𝑥 (𝑘) to get a method:
𝑥 (𝑘+1) = 𝑥 (𝑘) − 𝑓 𝑥 (𝑘) /𝑞 (𝑘) . (8.3)
Higher-order methods
We could design methods that converge faster than second order. For example,
there are several higher-order extensions of Newton’s method called Householder
methods. But, it’s often unnecessary to go through the trouble of implementing
higher-order methods because we can achieve the same net result with multiple
iterations of a superlinear method. For instance, two iterations of the second-order
Newton’s method is a fourth-order method.
The secant method uses linear interpolation to find 𝑥 (𝑘+1) . One way to improve
convergence is by interpolating with a quadratic function instead. Consider the
196 Solutions to Nonlinear Equations
parabola 𝑦 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐 which passes
through the points 𝑥 (𝑘−2) , 𝑓 𝑥 (𝑘−2) ,
𝑥 (𝑘−1) , 𝑓 𝑥 (𝑘−1) and 𝑥 (𝑘) , 𝑓 𝑥 (𝑘) . Because the parabola likely crosses the
𝑥-axis at two points, we’ll need to choose the best one. This method, called
Müller’s method, has an order of convergence of about 1.84. A similar method
called inverse quadratic approximation uses the parabola 𝑥 = 𝑎𝑦 2 + 𝑏𝑦 + 𝑐, which
crosses the 𝑥-axis at 𝑥 (𝑘+1) = 𝑐. These methods need three points rather than just
the two needed for the secant method or the one needed for Newton’s method.
secant Müller inverse quadratic
−2 0 2 −2 0 2 −2 0 2
Dekker–Brent method
Newton’s method and the secant method are not globally convergent. An iterated
initial guess can converge to a different root, go off to infinity, or get trapped in
Example. The function sign(𝑥) |𝑥| is continuous and differential everywhere
except at the zero 𝑥 = 0. Applying Newton’s method gives the iteration
The bisection method is robust, but it is only linearly convergent. The secant
method has an order of convergence approaching 1.63, but it is only locally
convergent. The Dekker-Brent method, a hybrid of the bisection and secant
methods, takes advantage of the strengths of both.
1 3
0 0
0 0.5 1 1.5 0 1 2 3
Figure 8.3: The fixed point iterations for 𝜙(𝑥) = cos 𝑥, which converges, and for
𝜙(𝑥) = 3 sin 𝑥, which does not converge.
The NLsolve.jl function nlsolve(f,x0) uses the trust region method to find the zero
of the input function. The Roots.jl library has several procedures, such as fzero(f,x0)
for finding roots.
If you type any number into a calculator and repeatedly hit the cosine button,
the display eventually stops changing and gives you 0.7390851s. We call this
number a fixed point. A fixed point of a function 𝜙(𝑥) is any point 𝑥 such that
𝑥 = 𝜙(𝑥). The simple geometric interpretation of a fixed point is the intersection
of the line 𝑦 = 𝑥 and the curve 𝑦 = 𝜙(𝑥). We can study Newton’s method as a
special case of a broader class of numerical methods—fixed-point iterations:
In fact, for any function 𝑓 , we can always write the problem 𝑓 (𝑥) = 0 as a
fixed-point iteration 𝑥 = 𝜙(𝑥) where 𝜙(𝑥) = 𝑥 − 𝑓 (𝑥). But we are not guaranteed
that a fixed-point iteration will converge. Suppose we take 𝜙(𝑥) = 3 sin 𝑥. The
function has fixed points at 0, at roughly +2.279, and at roughly −2.279. A fixed-
point iteration for any point that does not already start precisely at these three
points will never converge. Instead, it will bounce around in the neighborhood of
either of the two nonzero fixed points. See the figure above. Let’s identify the
conditions that do guarantee convergence.
We say that 𝜙(𝑥) is a contraction mapping over the interval [𝑎, 𝑏] if 𝜙 is
differentiable function with 𝜙 : [𝑎, 𝑏] → [𝑎, 𝑏] and there is a 𝐾 < 1 such that
|𝜙 0 (𝑥)| ≤ 𝐾 for all 𝑥 ∈ [𝑎, 𝑏]. The definition of a contraction mapping can be
generalized as a nonexpansive map with Lipschitz constant 𝐾 of a metric space
onto itself. In this generalization, the following contraction mapping theorem is
known as the Banach fixed-point theorem.
198 Solutions to Nonlinear Equations
Theorem 22. If the function 𝜙 is a contraction map over [𝑎, 𝑏], then 𝜙 has
a unique fixed point 𝑥 ∗ ∈ [𝑎, 𝑏] and the fixed-point method 𝑥 (𝑘+1) = 𝜙 𝑥 (𝑘)
converges to 𝑥 ∗ for any for any 𝑥 (0) ∈ [𝑎, 𝑏].
for some 𝜉 (𝑘) ∈ [𝑥 (𝑘) , 𝑥 ∗ ]. By definition 𝑥 (𝑘+1) = 𝜙(𝑥 (𝑘) ) and 𝑥 ∗ = 𝜙(𝑥 ∗ ), so
1 ( 𝑝) (𝑘) (𝑘) 𝑝
𝑥 (𝑘+1) − 𝑥 ∗ = 𝜙 𝜉 𝑥 − 𝑥∗ .
𝑒 (𝑘+1) 𝑥 (𝑘+1) − 𝑥 ∗ 1 ( 𝑝) (𝑘) 1
𝑝 = lim
(𝑘) ∗
𝑝 = lim 𝜙 𝜉 = 𝜙 ( 𝑝) (𝑥 ∗ ).
𝑘→∞ 𝑒 𝑘→∞ 𝑥 −𝑥 𝑘→∞ 𝑝! 𝑝!
method had earlier been described by the Japanese mathematician Seki Kōwa in his 1642 book Hatubi
(𝑐 −𝑐 ) (𝑐 −𝑐 )
Sanpō. Kōwa used enri ( , circle theory) to approximate 𝜋, taking 𝑐16 + (𝑐 16−𝑐 15) −(𝑐17 −𝑐16 ) ,
16 15 17 16
where 𝑐𝑛 is the perimeter of a 2 -gon inscribed in a circle of unit diameter.
2 Alexander Aitken had a prodigious memory. He would recall numbers and long text several
decades later. Aitken would also relive the war trauma from his experiences as an infantry soldier.
He coped with recurrent psychological breakdowns by writing a memoir Gallipoli to The Somme.
The acclaimed book was the inspiration behind Anthony Ritchie’s 2016 oratorio.
200 Solutions to Nonlinear Equations
Stopping criteria
We can stop an iterative method once the error is within our desired tolerance.
But without explicitly knowing the solution, we can’t explicitly compute the error.
Instead, we might use the residual or the increment as a proxy for the error and
when one of them falls below a prescribed tolerance. The residual 𝜙 𝑥 (𝑘) − 𝑥 (𝑘)
is the same as the increment 𝛥𝑥 (𝑘+1) = 𝑥 (𝑘+1) − 𝑥 (𝑘) for a fixed-point method.
The error of the fixed-point method is
𝑒 (𝑘+1) = 𝑥 ∗ − 𝑥 (𝑘+1) = 𝜙(𝑥 ∗ ) − 𝜙 𝑥 (𝑘) ≈ 𝜙 0 (𝑥 ∗ )𝑒 (𝑘) .
𝛥𝑥 (𝑘+1) = 𝑒 (𝑘+1) − 𝑒 (𝑘) ≈ (1 − 𝜙 0 (𝑥 ∗ )) 𝑒 (𝑘) ,
and hence
𝛥𝑥 (𝑘+1)
𝑒 (𝑘) ≈ .
1 − 𝜙 0 (𝑥 ∗ )
For Newton’s method, 𝜙 0 (𝑥 ∗ ) = 1 − 𝑓 0 (𝑥 ∗ ) at a simple zero, and the error is
approximately 𝛥𝑥 (𝑘+1) / 𝑓 0 (𝑥 ∗ ). If 𝜙 0 (𝑥 ∗ ) is close to 1 and equivalently 𝑓 0 (𝑥 ∗ )
is close to 0, the error could be quite large relative to 𝜀. In this case, using the
increment will not be a reliable stopping criterion.
Designing robust stopping criteria is not trivial. For example, consider
Newton’s method applied to the function 𝑓 (𝑥) = e−𝜆𝑥 for a positive value 𝜆.
While there are no solutions, the method will evaluate 𝑥 (𝑘+1) = 𝑥 (𝑘) + 𝜆−1 at
each step. In this case, the increment is always 𝜆−1 , and using the increment as a
stopping criterion will result in an endless loop. Using the residual e−𝜆𝑥 as a
stopping criterion will eventually terminate the method. Still, the solution will
hardly be correct, and if 𝜆 is really small, the method may require an unreasonably
large number of iterations before it terminates. So, it is advisable to use brackets
or some other exception handling when using an iterative method.
Dynamical systems 201
Suppose that we want to solve the simple equation 𝑥 2 − 𝑥 + 𝑐 = 0 for a given value
𝑐 using the fixed-point method. While we could easily compute the solution
using the quadratic formula, a toy problem like this will help us better understand
the dynamics of the fixed-point method.3 Start by rewriting the equation as
𝑥 = 𝜙 𝑐 (𝑥) where 𝜙 𝑐 (𝑥) = 𝑥 2 + 𝑐, and then iterate 𝑥 (𝑘+1) = 𝜙 𝑐 𝑥 (𝑘) . This iterated
equation is known as the quadratic map. The quadratic map is equivalent to the
logistic map 𝑥 (𝑘+1) = 𝑟𝑥 (𝑘) 1 − 𝑥 (𝑘) introduced on page 176 as a simple model
of rabbit population dynamics. If we start with the logistic map and first scale
the variables by −𝑟 −1 and next translate them by 12 , we have the quadratic map
𝑥 (𝑘+1) = 𝑥 (𝑘) 2 + 𝑐 where 𝑐 = 14 𝑟 (2 − 𝑟).
By the fixed-point theorem, as long as |𝜙 𝑐0 (𝑥)| < 1 in a neighborhood of
a fixed-point 𝑥 ∗ , any initial guess in that neighborhood will converge to that
fixed point. Because |𝜙 𝑐0 (𝑥)| = |2𝑥|, any fixed point 𝑥 ∗ in the interval − 21 , 12
is attractive. This happens when − 34 < 𝑐 < 14 . When 𝑐 > 14 , the equation
𝑥 2 − 𝑥 + 𝑐 = 0 no longer has a real solution. What about when 𝑐 < − 34 ? Let’s
look at a few cases.
1 2 3 4
Why does the solution have this periodic limiting behavior? Take the iterated
function compositions 2 = 𝜙 ◦ 𝜙 , 𝜙3 = 𝜙 ◦ 𝜙 ◦ 𝜙 , and so on. The forward
(0) 𝜙 𝑐 𝑐 𝑐 𝑐 𝑐 𝑐 𝑐
orbit 𝑥 , 𝜙 𝑐 𝑥 , 𝜙 𝑐 𝑥 (0) , 𝜙3𝑐 𝑥 (0) , . . . can be decomposed as
(0) 2
𝑥 (0) , 𝜙2𝑐 𝑥 (0) , 𝜙4𝑐 𝑥 (0) , . . . } ∪ {𝑥 (1) , 𝜙2𝑐 𝑥 (1) , 𝜙4𝑐 𝑥 (1) , . . .
where 𝑥 (1) = 𝜙 𝑐 𝑥 (0) . Plot 2 below shows 𝑦 = 𝜙2𝑐 (𝑥) and 𝑦 = 𝑥.
3 Robert May’s influential 1976 review article “Simple mathematical models with very compli-
cated dynamics,” which popularized the logistic map, examined how biological feedback loops can
lead to chaos.
202 Solutions to Nonlinear Equations
1 2 3 4
Notice that the function 𝜙2𝑐 (𝑥) has three fixed points, two of which are attractive
with |(𝜙2𝑐 ) 0 (𝑥)| < 1. Starting with 𝑥 (0) leads us to one fixed point under the
mapping 𝜙2𝑐 , and starting with 𝑥 (1) leads us to the other fixed point. Similarly,
we can further decompose
(0) 2 (0) 4 (0)
𝑥 , 𝜙 𝑐 𝑥 , 𝜙 𝑐 𝑥 , . . . ∪ 𝑥 (1) , 𝜙2𝑐 𝑥 (1) , 𝜙4𝑐 𝑥 (1) , . . . =
(0) 4 (0) 8 (0)
𝑥 , 𝜙 𝑐 𝑥 , 𝜙 𝑐 𝑥 , . . . ∪ 𝑥 (2) , 𝜙4𝑐 𝑥 (2) , 𝜙8𝑐 𝑥 (2) , . . . ∪
(1) 4 (1) 8 (1)
𝑥 , 𝜙 𝑐 𝑥 , 𝜙 𝑐 𝑥 , . . . ∪ 𝑥 (3) , 𝜙4𝑐 𝑥 (3) , 𝜙8𝑐 𝑥 (3) , . . . .
Plot 3 above shows 𝑦 = 𝜙4𝑐 (𝑥) and 𝑦 = 𝑥. Notice that the function 𝜙4𝑐 (𝑥) has
four fixed points where 𝜙4𝑐 (𝑥) < 1.
As 𝑐 is made more and more negative, the period of the limit cycle of the
forward orbit of 𝑥 (0) = 0 continues to double at a faster and faster rate. The first
doubling happens at 𝑐 = −0.75 and the second doubling at 𝑐 = −1.25. Then at
about −1.368, the period doubles again. The rate of period-doubling limits a
value called the Feigenbaum constant, and when 𝑐 ≈ −1.401155, the period of
the limit cycle approaches infinity. At this point, the map is chaotic. But as 𝑐 is
made still more negative, a finite period cycle re-emerges only to again double
and re-double. See the QR code at the bottom of this page and the bifurcation
diagram on the next page showing the positions of the attractive fixed points as
a function of the parameter 𝑐. Finally, when 𝑐 < −2, the sequence escapes to
Often examining a more general problem provides greater insight into a
specific problem. Generalization is one of the techniques that mathematician
George Póyla discusses in his classic book How to Solve It. We have examined
the quadratic map when the parameter 𝑐 is real-valued. What happens when the
𝑐 is complex-valued?
Let’s take 𝑧 (𝑘+1) = 𝜙 𝑐 𝑧 (𝑘) = 𝑧 (𝑘) + 𝑐. For what values 𝑐 does the
fixed-point method work? That is to say, when does 𝑧 (𝑘) converge to a fixed-point
𝑧∗ ? By the fixed-point theorem, 𝑧 is attractive when |𝜙 𝑐0 (𝑧)| < 1. That is, when
|𝑧| < 12 . So, let’s take the boundary 𝑧 = 12 ei𝜃 and determine which values 𝑐
correspond to this boundary. From the original equation 𝑧2 − 𝑧 + 𝑐 = 0, we
have that 𝑐 = 𝑧 − 𝑧 2 = 12 ei𝜃 − 14 e2i𝜃 . The plot of this function 𝑐(𝑧) is a cardioid
, the shape traced by a point on a circle rolling around another circle of the
same diameter.4 For values 𝑐 inside the cardioid, the orbit {0, 𝜙 𝑐 (0), 𝜙2𝑐 (0), . . . }
limits a fixed point.
4 There is an interesting related puzzle: “A coin is rolled around a similar coin without slipping.
forward orbit of
quadratic map
Dynamical systems 203
43 2 1
−2 −1.8 −1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4
0 43 2 1
Figure 8.4: Top: The bifurcation diagram showing the limit cycle {𝜙∞ 𝑐 (0)}
of the quadratic map as a function of 𝑐. Bottom: The Mandelbrot set is the
set of complex values 𝑐 for which 𝜙 𝑐 (0) has a limit cycle. The cases 1 – 4
from page 201 are indicated. Pay attention to changes in the period-doubling,
bifurcation points, and matching cusps along the real axis of the Mandelbrot set.
204 Solutions to Nonlinear Equations
If we extend 𝑐 to include orbits {0, 𝜙 𝑐 (0), 𝜙2𝑐 (0), . . . } whose limit sets are
bounded, the set of 𝑐 is called the Mandelbrot set. See the figure on the preceding
page. At the heart of the Mandelbrot set is the cardioid generated by the fixed
points of the quadratic map. There is an infinite set of bulbs for limit sets of
period 𝑞 tangent to the main cardioid at 𝑐 = 12 ei𝜃 − 41 e2i𝜃 with 𝜃 = 2𝜋 𝑝/𝑞
where 𝑝 and 𝑞 are relatively prime integers. The Mandelbrot set is one of the
best-known examples of fractal geometry, exhibiting recursive detail and intricate
mathematical beauty. See the figure above.
If the Mandelbrot set is the set of parameters 𝑐 that results in bounded orbits
given 𝑧 (0) = 0, what can we say about the limit sets themselves? We call the
forward orbit for a given parameter 𝑐 a Julia set. Whereas the Mandelbrot set
lives in the complex 𝑐 parameter space, the Julia set lives in the complex 𝑧-plane.
If the orbit of 𝑧 is bounded, then the Julia set is connected. If the orbit of 𝑧
is unbounded, then the Julia set is disconnected and called a Cantor set. See
Figure 8.6 on the facing page.
Imaging the Mandelbrot set is straightforward. The following functions takes
the parameters bb for the lower-left and upper-right corners of a bounding box, xn
for the number of horizontal pixels, and n for the maximum number of iterations.
The function escape counts the iterations 𝑘 until 𝑧 (𝑘) > 2 for a given 𝑐.
function escape(n, c=0, z=0)
for k in 1:n
z = z^2 + c
abs2(z) > 4 && return k
return n
What is the coin’s orientation when it is halfway around the stationary coin?” The answer is
counter-intuitive, but it is evident in the equation of the cardioid.
Dynamical systems 205
Figure 8.6: Julia sets for −0.123 + 0.745i (left) and −0.513 + 0.521i (right).
Imaging the Julia set uses almost identical code. The Mandelbrot set lives in
the 𝑐-domain with a given value 𝑧 (0) = 0, and the Julia set lives in the 𝑧-domain
with a given value 𝑐. So the code for the Julia set requires only swapping the
variables z and c.
Example. Orbits that are chaotic, erratic, and unpredictable sometimes have
applications. Such behavior makes elliptic curves especially useful in Diffie–
Hellman public-key encryption5 and pseudorandom number generators. An
5 Suppose that Alice and Bob want to communicate securely. If they both have the same secret
key, they can both encrypt and decrypt messages using this key. But first, they’ll need to agree on
a key, and they’ll need to do so securely. They can use the Diffie–Hellman key exchange protocol.
Both Alice and Bob use a published elliptic curve with a published 𝑃. Alice secretly chooses a
private key 𝑚 and sends Bob her public key 𝑚𝑃. Bob secretly chooses his private key 𝑛 and sends
Alice his public key 𝑛𝑃. When Alice applies her private key to Bob’s public key and Bob applies his
private key to Alice’s public key, they both get the same secret key 𝑛𝑚𝑃. With a shared secret key,
Alice and Bob can each generate the same cryptographic key to encrypt or decrypt a message. The
multiplication is typically modulo some prime.
206 Solutions to Nonlinear Equations
𝑄 𝑃 1
𝑅 4
2𝑃 2
𝑦 2 = 𝑥 3 + 𝑎𝑥 + 𝑏 (8.9)
where 𝑥 3 + 𝑎𝑥 + 𝑏 has distinct roots. Such a curve has two useful geometric
properties. First, it is symmetric about the 𝑥-axis. And second, a nonvertical
straight line through two points on the curve (a secant line) will pass through
a third point on the curve. Similarly, a tangent line—the limit of a secant
line as two points approach one another—also passes through one other point.
Let’s label the first two points 𝑃 = (𝑥 0 , 𝑦 0 ) and 𝑄 = (𝑥 1 , 𝑦 1 ) on the curve, and
label the reflection of the third point across the 𝑥-axis 𝑅 = (𝑥 2 , −𝑦 2 ). We can
denote 𝑅 = 𝑃 ⊕ 𝑄. Points generated in this way using ⊕ (the reflection of the
third colinear point on the curve) form a group. The point O at infinity is the
identity element in this group. We define 2𝑃 = 𝑃 ⊕ 𝑃, 3𝑃 = 𝑃 ⊕ 2𝑃, and so on.
See Figure 8.7 above.
The arithmetic is straightforward. To compute 2𝑃, start by implicitly
differentiating (8.9). We have 2𝑦 d𝑦 = (3𝑥 2 + 𝑎) d𝑥 from which the slope of the
tangent at 𝑃 is 𝜆 = (3𝑥 02 + 𝑎)/2𝑦 0 . Now, substituting the equation for points
along the line 𝑦 = 𝜆(𝑥 − 𝑥 0 ) + 𝑦 0 into (8.9) gives
𝑥 3 − (𝜆(𝑥 − 𝑥0 ) + 𝑦 0 ) 2 + 𝑎𝑥 + 𝑏 = 0.
The coefficient of the 𝑥 2 term of the resulting monic cubic equation is −𝜆2 .
The coefficient of a monic polynomial’s second term is the negative sum of
the polynomial’s roots. Because the polynomial has a double root at 𝑥0 , the
third root 𝑥 = 𝜆2 − 2𝑥0 . And after reflecting about the 𝑥-axis we have that
𝑦 = −𝜆(𝑥 − 𝑥 0 ) − 𝑦 0 .
We can similarly compute 𝑃 ⊕ 𝑄. The slope of a line through two points
is 𝜆 = (𝑦 1 − 𝑦 0 )/(𝑥1 − 𝑥0 ). And from the argument above, the third root is
Roots of polynomials 207
The quadratic formula has been known in various forms for the past two millennia.
During the Renaissance, mathematicians Scipione del Ferro and Niccolò Fontana
Tartaglia independently discovered techniques for finding the solutions to the
cubic equation.6 At the time, a mathematician’s fame (and fortune?) was tied to
winning competitions with other mathematicians, duels of a sort to prove who
was the better mathematician. So, while Tartaglia knew a formula for solving
a cubic equation, he kept his knowledge of it secret and even obfuscated his
formula in a poem, “Quando chel cubo con le cose appresso. . . .”
A contemporary, Girolamo Cardano, persuaded Tartaglia to tell him his secret
formula with the promise that he would never publish it. Of course, Cardano
did exactly that, and it is now known as Cardano’s formula (although some give
a nod to its original inventor and call it the Cardano–Tartaglia formula). Later,
Cardano’s student Lodovico de Ferrari developed a general technique for solving
quartic equations by reducing them first to cubic equations and then applying
Cardano’s formula.
And that’s where it stops. There is no general algebraic formula for quintic
polynomials, nor can there be one. In 1824, the 21-year-old mathematician Niels
Henrik Abel proved what would later be known as the Abel–Ruffini theorem or
Abel’s impossibility theorem by providing a counterexample that demonstrated
that there are polynomials for which no root-finding formula can exist. Six years
later, 18-year-old Évariste Galois generalized Abel’s theorem to characterize
which polynomials were solvable by radicals, thereby creating Galois theory.7
To dig deeper, see Mario Livio’s book The Equation That Couldn’t Be Solved.
6 As a young boy Niccolò Fontana was maimed when an invading French Army massacred the
citizens of Brescia, leaving him with a speech impediment, after which people started calling him
Tartaglia, meaning “the stutterer.”
7 Évariste Galois died tragically at age 20 in a duel stemming from a failed love affair. During his
short life, Galois’ theory was largely rejected as incomprehensible. His work would not be published
until twelve years after his death. Abel also died tragically young at age of 26 from tuberculosis.
208 Solutions to Nonlinear Equations
Horner’s deflation
Note that we can write the polynomial 𝑝(𝑥) = 𝑥 3 − 6𝑥 2 + 11𝑥 − 6 in the Horner
form 𝑥(𝑥(𝑥 − 6) + 11) − 6. This new form allows us to evaluate a polynomial more
quickly, requiring fewer multiplications while reducing the propagated round-off
error. Synthetic multiplication provides a convenient means of record-keeping
when evaluating a polynomial in Horner form. For a general polynomial
𝑧 𝑎𝑛 𝑎 𝑛−1 ··· 𝑎0
𝑏𝑛 𝑧 ··· 𝑏1 𝑧
𝑏𝑛 𝑏 𝑛−1 ··· 𝑏0
4 1 −6 11 −6
4 −8 12 .
1 −2 3 6
So, 𝑝(4) = 6.
The function evalpoly(x,p) uses Horner’s method (or Goertzel’s algorithm for
complex values) to evaluate a polynomial with coefficients p = [𝑎 0 , 𝑎 1 , . . . , 𝑎 𝑛 ] at x.
where 𝑞(𝑥) = 𝑏 𝑛 𝑥 𝑛−1 + 𝑏 𝑛−1 𝑥 𝑛−2 + · · · + 𝑏 2 𝑥 + 𝑏 1 . This says that if 𝑧 is the root of
𝑝(𝑥), then 𝑏 0 = 𝑝(𝑧) = 0. Therefore, 𝑝(𝑥) can be factored as 𝑝(𝑥) = (𝑥 − 𝑧)𝑞(𝑥)
for root 𝑧. So, 𝑞(𝑥) is a polynomial factor. This method of factorization is called
synthetic division or Ruffini’s rule.
Let’s see Ruffini’s rule in action by factoring 𝑝(𝑥), starting with the root 2:
2 1 −6 11 −6
2 −8 6
1 −4 3 0
3 1 −4 3
3 −3 .
1 −1 0
Newton–Horner method
We can use Newton’s method to find a root 𝑧 and then use Horner’s method to
deflate the polynomial. Newton’s method for polynomials is simple. Using (8.10),
we have 𝑝(𝑥) = (𝑥 − 𝑧)𝑞(𝑥) + 𝑏 0 for some 𝑧. So, 𝑝 0 (𝑥) = 𝑞(𝑥) + (𝑥 − 𝑧)𝑞 0 (𝑥), and
therefore the derivative of 𝑝 evaluated at 𝑥 = 𝑧 is simply 𝑝 0 (𝑧) = 𝑞(𝑧). Newton’s
method is then simply
Example. Using Newton’s method to compute 𝑥 (1) starting with0 𝑥 (0) = 4 for
3 2
the polynomial 𝑥 − 6𝑥 + 11𝑥 − 6. Let’s determine 𝑝 𝑥 (0) and 𝑝 𝑥 :
4 1 −6 11 −6
4 −8 12
1 −2 3 6
4 1 −2 3
4 8
1 2 11
So, 𝑝(4) = 6 and 𝑝 0 (4) = 11. It follows that 𝑥 (1) = 4 − 11 . J
Another way to find the roots of a polynomial 𝑝(𝑥) is by finding the eigenvalues
of a matrix whose characteristic polynomial is 𝑝(𝑥). The companion matrix of
the polynomial
is the matrix
0 0 0
··· −𝑎 0
1 0 0
··· −𝑎 1
C 𝑛 = 0 1 ··· 0 −𝑎 2 .
.. ..
. .
.. .. ..
. . .
0 0 −𝑎 𝑛−1
··· 1
To show that the polynomial 𝑝(𝑥) is indeed the characteristic polynomial of C𝑛 ,
we can use Laplace expansion (also known as cofactor expansion) to evaluate
det(𝜆I − C𝑛 ). For instance, consider the Laplace expansion using cofactors from
Systems of nonlinear equations 211
Other methods
Systems of equations often arise when finding a nonlinear least squares fit or
implementing an implicit solver for a partial differential equation. Adding
dimensions adds complexity, and the well-posedness of the problem becomes
more complicated.
Newton’s method
8 The FastPolynomialRoots.jl package uses a Fortran wrapper of the algorithm by Aurentz, Mach,
Vandebril, and Watkins and can be significantly faster for large polynomials by using an eigenvalue
solver that is 𝑂 (𝑛2 ) instead of 𝑂 (𝑛3 ).
212 Solutions to Nonlinear Equations
and since x∗ = x + 𝛥x
x∗ = x − J𝑓 (x) 𝑓 (x) + 𝑂 (k 𝛥xk 2 ).
We can use this second-order approximation to develop an iterative method:
x (𝑘+1) = x (𝑘) − J𝑓 x (𝑘) 𝑓 x (𝑘) . (8.11)
In practice, we do not need to or necessarily want to compute the inverse of the
Jacobian matrix explicitly. Rather, at each step of the iteration, we
evaluate J𝑓 x (𝑘) , (8.12a)
(𝑘) (𝑘)
solve J𝑓 x 𝛥x = − 𝑓 x
, (8.12b)
update x (𝑘+1) = x (𝑘) + 𝛥x (𝑘) . (8.12c)
If the Jacobian matrix is dense, it will take 𝑂 (𝑛2 ) operations to evaluate
J𝑓 (x (𝑘) ) and an additional 𝑂 (𝑛3 ) operations to solve the system. If the system is
sparse, the number of operations is 𝑂 (𝑛) and 𝑂 (𝑛2 ), respectively. There are a
few techniques we can do to speed things up.
First, we could reuse the Jacobian. It’s not necessary to compute the Jacobian
matrix at every iteration. Instead, we can evaluate it every few steps. We can also
reuse the upper and lower triangular factors of the LU decomposition of (8.12b),
so that each subsequent iteration takes 𝑂 (𝑛2 ) operations.
Alternatively, we could use an iterative method like the SOR or conjugate
gradient method to solve the linear system (8.12b). Iterative methods multiply
instead of factor, requiring 𝑂 (𝑛2 ) operations per iteration for dense matrices and
only 𝑂 (𝑛) operations per iteration for sparse matrices. They may take many
iterations to converge, but because we only want an approximate solution x (𝑘+1)
with each step, we don’t need full convergence for the iterative method to be
effective. Chapter 5 discusses these methods.
Another way to speed up Newton’s method is using a finite difference
approximation of the Jacobian matrix. The 𝑖 𝑗-element of the Jacobi matrix is
J𝑖 𝑗 = 𝜕 𝑓𝑖 /𝜕𝑥 𝑗 . So the 𝑗th column of the Jacobian matrix is the derivative of 𝑓 (x)
with respect to 𝑥 𝑗 . A second-order approximation of the 𝑗th column of J𝑓 (x) is
𝑓 (x + ℎ𝝃 𝑗 ) − 𝑓 (x − ℎ𝝃 𝑗 )
J𝑓 (x) = , +𝑂 (ℎ2 ) (8.13)
where ℎ is the step size and 𝝃 𝑗 is the 𝑗th vector in the standard basis of R𝑛 .
We can typically take the step size ℎ roughly equal to the cube root of machine
epsilon For a well-conditioned problem.9 If 𝑓 (x) is a real-valued function, we
can also approximate the Jacobian using a complex-step derivative
Im 𝑓 (x + iℎ𝝃 𝑗 )
J𝑓 (x) 𝑗
= + 𝑂 (ℎ2 ),
9 See exercise 7.1.
Systems of nonlinear equations 213
2 2
1 1
0 0
−1 −1
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
Figure 8.8: Newton fractal (left) for 𝑧3 − 1. Darker shades of gray require more
iterations to converge. Homotopy continuation (right) for the same problem.
𝑥 3 − 3𝑥𝑦 2 − 1 = 0
𝑦 3 − 3𝑥 2 𝑦 = 0.
Secant-like methods
The secant method used two subsequent iterates 𝑥 (𝑘) and 𝑥 (𝑘−1) to approximate
the tangent slope used in Newton’s method. Just as we extended Newton’s method
to a system of equations, we can also formally extend the secant method to a
system of equations. We have
and 𝛥f = 𝑓 x
(𝑘) (𝑘) −𝑓 x (𝑘−1) . We just need to choose a suitable J .
214 Solutions to Nonlinear Equations
One way to interpret (8.14) is, “Given an output vector 𝛥f (𝑘) and input vector
𝛥x (𝑘) ,find the matrix J (𝑘) .” The matrix J (𝑘) has 𝑛2 unknowns, and we have
only 𝑛 equations—a bit of a conundrum. How can we determine J (𝑘) ? It seems
reasonable that J (𝑘) should be close to the previous J (𝑘−1) , so perhaps we can
approximate J (𝑘) using J (𝑘−1) with (8.14) as a constraint. In other words, let’s
minimize kJ (𝑘) − J (𝑘−1) k subject to the constraint J (𝑘) 𝛥x (𝑘) = 𝛥f (𝑘) . What
norm k · k do we choose? The Frobenius norm is a smart choice because it is an
extension of the ℓ 2 -norm for vectors, and minimizing the ℓ 2 -norm results in a
linear problem.
We can use the method of Lagrange multipliers. Take 𝝀 ∈ R𝑛 and define
We just need to solve for 𝝀. Right multiply the first equation by 𝛥x (𝑘) :
guess x
choose an initial
set J ← J𝑓 x
solve J𝛥x = − 𝑓 x
update x ← x + 𝛥x
while k 𝛥xk <tolerance
compute J using (8.15)
solve J𝛥x = − 𝑓 (x)
update x ← x + 𝛥x
Broyden’s method requires only 𝑂 (𝑛) evaluations of 𝑓 x (𝑘) at each step to
update the slope rather than 𝑂 (𝑛2 ) evaluations that are necessary to recompute
the Jacobian matrix. In practice, errors in approximating an updated Jacobian
Homotopy continuation 215
grow with each iteration, and one should periodically restart the method with a
fresh Jacobian.
Rather than solving J (𝑘) 𝛥x (𝑘+1) = − 𝑓 x (𝑘) at each iteration, we can compute
𝛥x (𝑘+1) = −J (𝑘) 𝑓 x (𝑘) directly with the help of the Sherman–Morrison formula
A−1 uvT A−1
(A + uvT ) −1 = A−1 −
1 + vT A−1 u
−1 −1
to update J (𝑘) given J (𝑘−1) :
(𝑘) −1 (𝑘−1) −1 𝛥x (𝑘) − J (𝑘−1) 𝛥f (𝑘)
J = J + 𝛥f (𝑘) T .
k 𝛥f (𝑘) k 2
𝑓 (x)
at 𝑡 = 0 : ℎ(0, x) = 𝑔(x) = 0
at 𝑡 = 1 : ℎ(1, x) = 𝑓 (x) = 0.
Simply put, when the parameter 𝑡 = 0, the solution x is a zero of 𝑔; and when
𝑡 = 1, the solution x is a zero of 𝑓 . We’ve exchanged the difficult problem of
solving 𝑓 (x) = 0 for the easy problem of solving 𝑔(x) = 0. It sounds a little too
simple—what’s the catch? We still need to track the solutions along the paths
ℎ(𝑡, x) for the parameter 𝑡 ∈ [0, 1]. To do this, we may need to solve a nonlinear
differential equation.
Take the homotopy ℎ(𝑡, x(𝑡)) = 0 where x(𝑡) is a zero of the system ℎ for
the parameter 𝑡. Let’s assume that x(𝑡) exists and is unique for all 𝑡 ∈ [0, 1].
Otherwise, the method is ill-posed. This assumption is not trivial—there may
be no path connecting a zero of 𝑔 with a zero of 𝑓 , or the path may become
multivalued as the parameter 𝑡 is changed:
216 Solutions to Nonlinear Equations
𝑓 (x)
𝑔(x) 𝑓 (x)
The Jacobian matrix Jℎ (x) = J 𝑓 (x) and 𝜕ℎ/𝜕𝑡 = 𝑓 (x0 ). We’ll need to solve the
differential equation
= −J−1𝑓 (x) 𝑓 (x0 )
with the initial conditions x0 . Solving such a differential equation often requires
a numerical solver.
𝑥 3 − 3𝑥𝑦 2 − 1 = 0
𝑦 3 − 3𝑥 2 𝑦 = 0.
If we take x0 = (1, 1), then 𝑓 (x0 ) = (−3, −2), and we need to solve
2 −1
d 𝑥(𝑡) 3𝑥 − 3𝑦 2 −6𝑥𝑦 −3
d𝑡 𝑦(𝑡) −6𝑥𝑦 3𝑦 2 − 3𝑥 2 −2
Homotopy continuation 217
with initial conditions (𝑥(0), 𝑦(0)) = (1, 1). We can solve this problem using
the following code
using DifferentialEquations
f = z -> ((x,y)=tuple(z...); [x^3-3x*y^2-1; y^3-3x^2*y] )
df = (z,p,_) -> ((x,y)=tuple(z...);
-[3x^2-3y^2 -6x*y; -6x*y 3y^2-3x^2]\p)
z0 = [1,1]
sol = solve(ODEProblem(df,z0 ,(0,1),f(z0 )))
The numerical solution given by sol.u[end] is 0.999, 0.001. To reduce the error,
we can finish off the homotopy method by applying a step of Newton’s method to
𝑓 (x) using sol.u[end] as the initial guess to get the root (1, 0). See Figure 8.8
on page 213. J
Like Newton’s method and other fixed-point methods, Euler’s method has stability
restrictions. One way to handle the stability restrictions is by taking a smaller step
218 Solutions to Nonlinear Equations
size 𝛥𝑡. Smaller steps, of course, mean more iterations. Homotopy continuation,
too, effectively reduces the step size to maintain stability. In Chapter 12, we
examine the forward Euler method and other differential equation solvers, many
of which rely on Newton’s method for their stability properties.
Julia’s HomotopyContinuation.jl package can solve systems of polynomial equations.
Derivative-free methods
The bisection method, introduced at the beginning of this chapter to find the
zero of a univariate function, looked for sign changes to successively choose
between two brackets. To find a local minimum (or maximum), we’ll see where
the function is increasing and where it’s decreasing. To do this, we’ll need to
consider three brackets.
Suppose that 𝑓 (𝑥) is unimodal (with a unique minimum) on the interval
[𝑎, 𝑏]. Divide this interval into three brackets [𝑎, 𝑥1 ], [𝑥1 , 𝑥2 ], and [𝑥 2 , 𝑏] for
some 𝑥 1 and 𝑥 2 . If 𝑓 (𝑥 1 ) < 𝑓 (𝑥 2 ), then the minimum is in the superbracket
[𝑎, 𝑥2 ]. We’ll move 𝑏 to 𝑥2 and select a new 𝑥2 in the new interval [𝑎, 𝑏]. On the
other hand, if 𝑓 (𝑥 1 ) > 𝑓 (𝑥2 ), then the minimum is in the superbracket [𝑥 1 , 𝑏].
This time, we’ll move 𝑎 to 𝑥1 and select a new 𝑥 1 in the new interval [𝑎, 𝑏]. So
how should we select 𝑥1 and 𝑥2 ?
Without any additional knowledge about 𝑓 (𝑥), we can argue by symmetry
that the size of bracket [𝑎, 𝑥1 ] should equal the size of bracket [𝑥2 , 𝑏] and that
the relative sizes of these brackets should be the same regardless of the scale
of the interval [𝑎, 𝑏]. To simplify calculations, consider the initial interval
[0, 1] with points 𝑥 1 and 𝑥 2 at 𝑥 and 1 − 𝑥 for an unknown 𝑥. The relation
1 : (1 − 𝑥) = (1 − 𝑥) : 𝑥 maintains
√ equal scaling at every iteration. Equivalently,
𝑥 2 − 3𝑥 + 1 = 0 or 𝑥 = (3 − 5)/2. The ratio 𝜑 = 1 : (1 − 𝑥) = (1 + 5)/2 is the
golden ratio, also called the golden section. At each iteration, the intervals are
scaled in length by 1/𝜑 ≈ 0.618, so the golden section search method has linear
convergence with a convergence rate of around 0.6.
A common derivative-free search technique for multivariable functions is the
Nelder–Mead simplex method. The Nelder–Mead method uses an 𝑛 + 1 polytope
in 𝑛-dimensional space—a triangle in two-dimensional space, a tetrahedron in
three-dimensional space, and so on. The function is evaluated at each of the
vertices of the polytope. Through a sequence of reflections, expansions, and
Finding a minimum (or maximum) 219
contractions, the polytope crawls like an amoeba across the domain, seeking a
downhill direction that takes it to the bottom.
Newton’s method
A faster way to get the minimum of a function 𝑓 (𝑥) is by using Newton’s method
to find where 𝑓 0 (𝑥) = 0:
𝑓 0 𝑥 (𝑘)
𝑥 (𝑘+1)
= 𝑥 − 00 (𝑘) .
𝑓 𝑥
Geometrically, each iteration of Newton’s method for finding the zero of a function
determines a straight line tangent to the function and returns the zero crossing.
Each iteration of Newton’s method for finding a minimum now determines a
parabola tangent to the function and returns its vertex. If the function is convex,
then Newton’s method has quadratic convergence.
In higher dimensions, we formally replace the Jacobian matrix with the
Hessian matrix and vector function with its gradient in (8.11) to get
x (𝑘+1) = x (𝑘) − H𝑓 x (𝑘) ∇ 𝑓 x (𝑘) .
The Hessian matrix H𝑓 x has elements ℎ𝑖 𝑗 = 𝜕 2 𝑓 /𝜕𝑥𝑖 𝜕𝑥 𝑗 . When the Hessian
matrix is not positive-definite, we can use Levenberg–Marquardt regularization
by substituting (H𝑓 + 𝛼I) for H𝑓 for some positive 𝛼.
Computing the Hessian can be costly, particularly when working with many
variables. The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is a quasi-
Newton (secant-like) method that approximates the Hessian and updates this
approximation with each iteration. Alternatively, we can disregard the Hessian
entirely by using a general class of gradient descent methods. Gradient descent
methods are standard in machine learning and neural nets, where the number of
variables can be pretty large and the data is noisy.
where the positive parameter 𝛼, called the learning rate, controls the relative
step size. Imagine descending into a deep valley shrouded in fog. Without any
additional knowledge of the terrain, you opt to take the steepest path forward (the
negative gradient) some distance. You then reorient yourself and move forward
in the direction of the new steepest path. You continue in this manner until,
hopefully, you find your way to the bottom. If you choose 𝛼 to be too small,
you’ll frequently check your gradient and your progress will be slow. If 𝛼 is too
220 Solutions to Nonlinear Equations
2 2
1 1
0 0
−2 −1 0 1 −2 −1 0 1
Figure 8.9: The gradient descent method with a learning rate 𝛼 = 10−3 on the
Rosenbrock function for 500 iterations without and with momentum.
large and the valley is relatively narrow, you’ll likely zigzag back and forth, up
and down the valley walls.
In general, we don’t need to walk in the direction of the gradient. Instead,
we take x (𝑘+1) = x (𝑘) + 𝛼p (𝑘) where the vector p (𝑘) points us in some direction.
One approach introduced by Russian mathematician Boris Polyak in the 1960s,
often referred to as momentum or the heavy ball method, adds persistence or
inertia into the walk
x (𝑘+1) = x (𝑘) + 𝛼p (𝑘) with p (𝑘) = −∇ 𝑓 x (𝑘) + 𝛽p (𝑘−1) ,
where 𝛽 is a parameter usually between zero and one. We are less likely to change
directions, a bit like a heavy ball rolling down into the valley. The new factor
reduces the oscillations and causes acceleration in the same direction resulting in
larger step sizes.
𝑓 (𝑥, 𝑦) = (1 − 𝑥) 2 + 100(𝑦 − 𝑥 2 ) 2
is a narrow valley with rather steep sides and a relatively flat floor along the
parabola 𝑦 = 𝑥 2 . It has a global minimum at (𝑥, 𝑦) = (1, 1). When we descend
into the valley using gradient descent, the steep valley sides force us to take tiny
steps to maintain stability. Once traveling along the valley floor, these tiny steps
make for excruciatingly slow progress.
The following code uses the gradient descent method to find the minimum of
the Rosenbrock function:
∇f = x -> [-2(1-x[1])-400x[1]*(x[2]-x[1]^2); 200(x[2]-x[1]^2)]
x = [-1.8,3.0]; α = 0.001; p = [0,0]; β = 0.9
Exercises 221
for i = 1:100
p = -∇f(x) + β*p
x += α*p
The figure on the facing page shows the trace without momentum (𝛽 = 0) and
with momentum (𝛽 = 0.9). Notice that the gradient descent without momentum
initially bounces back and forth across the valley and never makes significant
progress toward the minimum.
In practice, we can use the Optim.jl library to solve the problem:
using Optim
f = x -> (1-x[1])^2 + 100(x[2] - x[1]^2)^2
x0 = [-1.8,3.0]
result = optimize(f, x0 , GradientDescent())
x = Optim.minimizer(result)
The Optim.jl library includes several methods for finding a local minimum.
8.9 Exercises
The regula falsi method subdivides brackets the same way. Discuss the conver-
gence of the regula falsi method. When does it outperform the bisection method?
When does it underperform?
where 𝑝 is a positive integer and (1/ 𝑓 ) ( 𝑝) (𝑥) denotes the 𝑝th derivative of
1/ 𝑓 (𝑥). Under suitable conditions, a Householder method has order 𝑝 + 1
convergence. Show that when 𝑝 = 1, we simply have Newton’s method. Use a
Householder method to extend the Babylonian method for calculating a square
root to get a method with cubic order convergence. b
8.5. Edmond Halley, a contemporary of Isaac Newton, invented his own root-
finding method:
2 𝑓 𝑥 (𝑘) 𝑓 0 𝑥 (𝑘)
𝑥 (𝑘+1)
=𝑥 −
2 .
2 𝑓 0 𝑥 (𝑘) − 𝑓 𝑥 (𝑘) 𝑓 00 𝑥 (𝑘)
8.10. Suppose that we want to solve 𝑓 (𝑥) = 0. Taking 𝑥 = 𝑥 − 𝛼 𝑓 (𝑥) for some 𝛼,
we get the fixed-point method 𝑥 (𝑘+1) = 𝜙(𝑥 (𝑘) ) where 𝜙(𝑥) = 𝑥 − 𝛼 𝑓 (𝑥). How
does 𝛼 affect the stability and convergence of the method? b
8.11. Derive the Newton iteration step of the Q_rsqrt algorithm on page 182.
(𝑥 2 + 𝑦 2 ) 2 − 2(𝑥 2 − 𝑦 2 ) = 0
(𝑥 2 + 𝑦 2 − 1) 3 − 𝑥 2 𝑦 3 = 0
8.13. The elliptic curve secp256k1 𝑦 2 = 𝑥 3 + 7 (mod 𝑟) was popularized when
Satoshi Nakamoto used it in Bitcoin’s public-key digital signature algorithm.
Implement the elliptic curve Diffie–Hellman (ECDH) algorithm using secp256k1.
The order of the field is the prime number 𝑟 = 2256 − 232 − 977, and the published
generator point 𝑃 can be found in the “Standards for Efficient Cryptography 2
(SEC 2)” at https://www.secg.org/sec2-v2.pdf. b
8.14. What are the convergence conditions for the gradient method and Polyak’s
momentum method applied to the quadratic function 𝑓 (𝑥) = 12 𝑚𝑥 2 ? b
Chapter 9
15 16
13 12
10 20
9 22 23
8 24
3 38 33 32 31
2 1
28 29 30 25
4 37 34
27 26
6 5 35
In this chapter, we’ll look at interpolation. In the next one, we examine best
226 Interpolation
Canonical: 𝜑 𝑘 (𝑥) = 𝑥 𝑘
𝑥 − 𝑥𝑗
Langrange: ℓ𝑘 (𝑥) = 0
𝑗=0 𝑘
− 𝑥𝑗
𝑗≠𝑘 −1
𝑘−1 1
Newton: 𝜈 𝑘 (𝑥) = (𝑥 − 𝑥 𝑗 )
𝑗=0 0
0 0.5 1 1.5
That is,
𝑦 𝑘 − 𝑝 𝑘−1 (𝑥 𝑘 )
𝑐= .
(𝑥 𝑘 − 𝑥0 ) (𝑥 𝑘 − 𝑥1 ) · · · (𝑥 𝑘 − 𝑥 𝑘−1 )
We now prove uniqueness. Suppose that there were two polynomials 𝑝 𝑛 and
𝑞 𝑛 . Then the polynomial 𝑝 𝑛 − 𝑞 𝑛 has the property ( 𝑝 𝑛 − 𝑞 𝑛 ) (𝑥𝑖 ) = 0 for all
𝑖 = 0, . . . , 𝑛. So, 𝑝 𝑛 − 𝑞 𝑛 has 𝑛 + 1 roots. But because 𝑝 𝑛 − 𝑞 𝑛 ∈ P𝑛 , it can have
at most 𝑛 roots unless it is identically zero. So, 𝑝 𝑛 must equal 𝑞 𝑛 .
Canonical basis
1 𝑥 02 𝑥0𝑛 𝑐0 𝑦0
𝑥0 ···
1 𝑥 12 𝑥1𝑛 𝑐1 𝑦1
𝑥1 ···
1 𝑥 22 𝑥2𝑛 𝑐2 𝑦2
𝑥2 ··· = .
.. .. .. ..
. . . .
.. .. ..
. . .
1 𝑥 𝑛𝑛 𝑐 𝑛 𝑦 𝑛
𝑥𝑛 𝑥 𝑛2 ···
Solving the system takes 𝑂 (𝑛3 ) operations. Moreover, the Vandermonde matrix
is increasingly ill-conditioned as it gets larger.
228 Interpolation
That is, ℓ𝑖 (𝑥 𝑗 ) = 𝛿𝑖 𝑗 . For a quick confirmation of this, note where ℓ𝑖 (𝑥) is 0 and
1 in the Lagrange basis for {𝑥 0 , 𝑥1 , 𝑥2 , 𝑥3 } = {0, 1, 2, 4}:
-1 ℓ0 ℓ1 ℓ2 ℓ3
0 1 2 4 0 1 2 4 0 1 2 4 0 1 2 4
1 0 0 0 𝑐0 𝑦0
0 1 0 0 𝑐1 𝑦1
0 0 1 0 𝑐2 𝑦2
··· = .
.. .. .. .. .. ..
. . . ..
. . .
0 0 0 1 𝑐 𝑛 𝑦 𝑛
Hence, it is trivial to find the approximating polynomial using the Lagrange
form. But the Lagrange form does have some drawbacks. Each evaluation of
a Lagrange polynomial requires 𝑂 (𝑛2 ) additions and multiplications. They are
difficult to differentiate and integrate. If you add new data pair (𝑥 𝑛 , 𝑦 𝑛 ), you
must start from scratch. And the computations can be numerically unstable.
The Newton polynomial basis fixes these limitations of the Lagrange form by
recursively defining the basis elements 𝜈𝑖 (𝑥):
1, (𝑥 − 𝑥0 ), (𝑥 − 𝑥0 ) (𝑥 − 𝑥1 ), · · · , (𝑥 − 𝑥𝑖 ).
Polynomial interpolation 229
grandfather of the modern computer, invented a mechanical device he called the “difference engine.”
The steel and brass contraption was intended to automate and make error-free the mathematical
drudgery of computing divided difference tables. Ada Lovelace, enamored by Babbage’s even grander
designs, went on to propose the world’s first computer program.
230 Interpolation
Then, the coefficients 𝑐 𝑖 come from each column of the first row
Example. Let’s find the interpolating polynomial for the function 𝑓 (𝑥) where
𝑥 = 1 3 5 6
𝑓 (𝑥) = 0 4 5 7
We first build the divided-difference table. We’ll use to denote unknown values.
1 0 1 0 2 − 3/8 7/40
3 4 3 4 1/2 1/2
From we have .
5 5 5 5 2
6 7 6 7
To fill in the upper left value, we take (4 − 0)/(3 − 1) to get 2. The value below
it is (5 − 4)/(5 − 3) or 1/2. Now, we can use these new values to compute to
unknown value to their right: ( 1/2 − 2)/(5 − 1) = − 3/8. The rest of the table is
filled out similarly. From the first row, we have the coefficients of the interpolating
Hermite interpolation
Let’s continue our work from the previous section by constructing interpolating
polynomials that match a function’s values and derivatives at a set of nodes. That
( 𝑗)
is, let’s find a unique polynomial 𝑝 𝑛 ∈ P𝑛 such that 𝑝 𝑛 (𝑥 𝑖 ) = 𝑓 ( 𝑗) (𝑥 𝑖 ) for all
𝑖 = 0, 1, . . . , 𝑘 and 𝑗 = 0, 1, . . . , 𝛼𝑖 . The polynomial 𝑝 𝑛 is called the Hermite
interpolation of 𝑓 at the points 𝑥𝑖 .
We are already familiar with Hermite interpolation when we have only one
node—it’s simply the Taylor polynomial approximation. When we have multiple
nodes, we can compute the Newton form of the Hermite interpolation by using
a table of divided differences. We’ll use the mean value theorem for divided
Polynomial interpolation 231
Proof. Let 𝑝 𝑛 (𝑥) be the interpolating polynomial of 𝑓 (𝑥) using the Newton
basis. The leading term of 𝑝 𝑛 (𝑥) is 𝑓 [𝑥0 , . . . , 𝑥 𝑛 ] (𝑥 − 𝑥0 ) . . . (𝑥 − 𝑥 𝑛−1 ). Take
𝑔(𝑥) = 𝑓 (𝑥) − 𝑝 𝑛 (𝑥), which has 𝑛 + 1 zeros at each of the nodes {𝑥0 , 𝑥1 , . . . , 𝑥 𝑛 }.
By Rolle’s theorem, 𝑔 0 (𝑥) has at least 𝑛 zeros on the interval, 𝑔 00 (𝑥) has at least
𝑛 − 1 zeros, and so on until 𝑔 (𝑛) has at least one zero on the interval. We’ll take
𝜉 to be one of these zeros. So,
For example, 𝑓 [1] = 𝑓 (1), 𝑓 [1, 1] = 𝑓 0 (1), 𝑓 [1, 1, 1] = 𝑓 00 (1)/2, and so on.
To do this, we compute the divided differences. First note that 𝑝 [1, 1] = 𝑝 0 (1) = 3,
𝑝[2, 2] = 𝑝 0 (2) = 7, and 𝑝 [2, 2, 2] = 𝑝 00 (2)/2! = 8/2 = 4. Then we fill in the
divided difference table starting with to denote unknown values.
1 2 3 1 2 3 1 2 −1
1 2 1 2 4 3 1
From 2 6 7 4 we have 2 6 7 4 .
2 6 2 6
2 6 2 6
By taking the coefficients from the first row, we conclude that 𝑝(𝑥) is
Proof. If 𝑥 is a node, then 𝑓 (𝑥) − 𝑝 𝑛 (𝑥) = 0, and the claim is clearly true. So,
let 𝑥 be any point other than a node {𝑥 0 , 𝑥1 , . . . , 𝑥 𝑛 }. Define
(𝑧 − 𝑥 𝑖 )
𝑔(𝑧) = ( 𝑓 (𝑧) − 𝑝 𝑛 (𝑧)) − ( 𝑓 (𝑥) − 𝑝 𝑛 (𝑥)) Î𝑛𝑖=0 .
𝑖=0 (𝑥 − 𝑥 𝑖 )
From the expression for pointwise error, we can see that it is affected by the
smoothness of the function 𝑓 (𝑥) and the number and placement of the nodes.
We can’t control the smoothness of 𝑓 (𝑥), but weÎcan control the number and
position of the nodes. Take a look at the plot of 8𝑖=0 (𝑥 − 𝑥𝑖 ) for nine equally
spaced nodes:
The value 𝑖=0 (𝑥 − 𝑥 𝑖 ) is quite large near both ends of the interval. And
if we add more uniformly spaced nodes, the overshoot becomes larger. This
behavior, called Runge’s phenomenon, explains why simply adding nodes does
not necessarily improve accuracy. We can Î instead choose the position of the
nodes to minimize the uniform norm2 k 8𝑖=0 (𝑥 − 𝑠𝑖 ) k ∞ . We do exactly this by
choosing nodes 𝑠𝑖 that are zeros of the Chebyshev polynomial:
Now, the black curve of 8𝑖=0 (𝑥 − 𝑠𝑖 ) with Chebyshev nodes 𝑠𝑖 overlays the lighter
gray curve of 𝑖=0 (𝑥 − 𝑥𝑖 ) with uniformly spaced nodes 𝑥𝑖 . In exchange for
moderately increasing the error near the center of the interval, using Chebyshev
nodes significantly decreases the error near its ends.
Let’s peek at Chebyshev polynomials and their zeros—we’ll come back to
them in the next chapter. Chebyshev polynomials can be defined using
over the interval [−1, 1]. While this definition might seem a bit tightly packed,
Forman Acton interprets it in a parenthetical comment in his classic 1970 book
Numerical Methods that Work: “they are actually cosine curves with a somewhat
disturbed horizontal scale, but the vertical scale has not been touched.” See the
figure on the following page. Several important properties can be derived directly
from this definition. Chebyshev polynomials can be defined using the recursion
T𝑛+1 (𝑥) = 2𝑥T𝑛 (𝑥) − T𝑛−1 (𝑥),
2 Also called the Chebyshev norm, the infinity norm, and the supremum norm.
234 Interpolation
−1 −0.5 0 0.5 1 𝑠8 𝑠7 𝑠6 𝑠5 𝑠4 𝑠3 𝑠2 𝑠1
where T0 (𝑥) = 1 and T1 (𝑥) = 𝑥. The roots of the Chebyshev polynomial, called
Chebyshev nodes, occur at
𝑖 − 12 𝜋
𝑠𝑖 = cos for 𝑖 = 1, 2, . . . , 𝑛.
These Chebyshev nodes are also the abscissas of equispaced points along the unit
semicircle. See the figure above. The Chebyshev polynomial’s extrema occur at
𝑠ˆ𝑖 = cos for 𝑖 = 0, 1, . . . , 𝑛.
and take the values T𝑛 ( 𝑠ˆ𝑖 ) = (−1) 𝑖+1 . It follows that Chebyshev polynomials
are bounded below by −1 and above by +1 for 𝑥 ∈ [−1, 1]. Finally, the leading
coefficient of T𝑛 (𝑥) is 2𝑛−1 .
Example. In 1885, at the age of seventy, Karl Weierstrass proved his famous
theorem that any continuous function could be uniformly approximated by a
sequence of polynomials. Specifically, the Weierstrass approximation theorem
states that if 𝑓 is a continuous function on the interval [𝑎, 𝑏], then for every
Splines 235
positive 𝜀, there exists a polynomial 𝑝 𝑛 such that | 𝑓 (𝑥) − 𝑝 𝑛 (𝑥)| < 𝜀 for all
𝑥 ∈ [𝑎, 𝑏]. In 1901, Weierstrass’ former student Carl Runge demonstrated that
the theorem need not hold for polynomials interpolated through equidistant nodes.
Namely, a polynomial interpolant of the function (𝑥 2 + 1) −1 using equidistant
nodes produces large oscillations near the endpoints that grow unbounded as the
polynomial degree increases.
0.5 0.5
0 0
−4 −2 0 2 4 −4 −2 0 2 4
9.3 Splines
Cubic splines
it in her 1748 book Foundations of Analysis for Italian Youth. The name comes from John Colson’s
mistranslation of aversiera (“versine curve”) as avversiera (“witch”). The versine (or versed sine)
function is versin 𝜃 = 1 − cos 𝜃. It’s one of several oddball trigonometric functions rarely used in
modern mathematics: versine, coversine, haversine, archacovercosine, . . . .
236 Interpolation
has four degrees of freedom. Hence, we have four unknowns for each of the 𝑛
subintervals—a total of 4𝑛 unknowns. Let’s look at these constraints. The spline
interpolates each of the 𝑛+1 knots, giving us 𝑛+1 constraints. Because 𝑠(𝑥) ∈ 𝐶 2 ,
it follows that 𝑠 𝑗−1 (𝑥 𝑗 ) = 𝑠 𝑗 (𝑥 𝑗 ), 𝑠 0𝑗−1 (𝑥 𝑗 ) = 𝑠 0𝑗 (𝑥 𝑗 ), and 𝑠 00𝑗−1 (𝑥 𝑗 ) = 𝑠 00𝑗 (𝑥 𝑗 ) for
𝑗 = 1, 2, . . . , 𝑛 − 1. This gives us another 3(𝑛 − 1) constraint for a total of 4𝑛 − 2
constraints. We have two degrees of freedom remaining.
We’ll need to enforce two additional constraints to remove these remaining
degrees of freedom and get a unique solution. Common constraints include the
complete or clamped boundary condition where 𝑠 0 (𝑎) and 𝑠 0 (𝑏) are specified;
natural boundary condition where 𝑠 00 (𝑎) = 𝑠 00 (𝑏) = 0; periodic boundary
condition where 𝑠 0 (𝑎) = 𝑠 0 (𝑏) and 𝑠 00 (𝑎) = 𝑠 00 (𝑏); and not-a-knot condition.
The not-a-knot condition uses the same cubic polynomial across the first two
subintervals by setting 𝑠1000 (𝑥 1 ) = 𝑠1000 (𝑥1 ), effectively making 𝑥1 “not a knot.” It
does the same across the last two subintervals by setting 𝑠 𝑛000 (𝑥 𝑛−1 ) = 𝑠 𝑛000 (𝑥 𝑛−1 ).
Before computer-aided graphic design programs, yacht designers and aero-
space engineers used thin strips of wood called splines to help draw curves. The
thin beams were fixed to the knots by hooks attached to heavy lead weights
and either clamped at the boundary to prescribe the slope (complete splines)
or unclamped so that the beam would be straight outside the interval (natural
splines). If 𝑦(𝑥) describes the position of the thin wood beam, then the curvature
is given by
𝑦 00 (𝑥)
𝜅(𝑥) = ,
(1 + 𝑦 0 (𝑥) 2 ) 3/2
and the deformation energy of the beam is
∫ 𝑏 ∫ 𝑏 2
𝑦 00 (𝑥)
𝐸= 𝜅 2 (𝑥) d𝑥 = d𝑥.
𝑎 𝑎 (1 + 𝑦 0 (𝑥) 2 ) 3/2
For small deformations, 𝐸 ≈ 𝑎 [𝑦 00 (𝑥)] 2 d𝑥 = k𝑦 00 k 22 . Physically, the beam
takes a shape that minimizes the energy, which implies that the cubic spline is
the smoothest interpolating polynomial.
The last term is nonnegative, so we just need to show that the middle term
vanishes. Integrating by parts
∫ 𝑏 𝑏
∫ 𝑏
𝑠 00 (𝑦 00 − 𝑠 00) d𝑥 = 𝑠 00 (𝑦 0 − 𝑠 0) − 𝑠 000 (𝑦 0 − 𝑠 0) d𝑥.
𝑎 𝑎 𝑎
The boundary terms are zero by the imposed boundary conditions. Over the
subinterval 𝑥 ∈ (𝑥𝑖 , 𝑥𝑖+1 ) the cubic spline 𝑠(𝑥) is a cubic polynomial 𝑠𝑖 (𝑥) with
𝑠 000 (𝑥) = 𝑠𝑖000 (𝑥) = 𝑑𝑖 for some constant 𝑑𝑖 . Then
∫ 𝑏 𝑛−1 ∫
∑︁ 𝑥𝑖+1
𝑠 (𝑦 − 𝑠 ) d𝑥 =
000 0 0
𝑑𝑖 (𝑦 0 − 𝑠𝑖0) d𝑥
𝑎 𝑖=0 𝑥𝑖
= 𝑑𝑖 (𝑦(𝑥𝑖+1 ) − 𝑠𝑖 (𝑥 𝑖+1 )) − (𝑦(𝑥𝑖 ) − 𝑠𝑖 (𝑥𝑖 )) = 0.
Let’s find an efficient method to construct a cubic spline interpolating the values
𝑦 𝑖 = 𝑓 (𝑥 𝑖 ) for 𝑎 = 𝑥0 < 𝑥 1 < · · · < 𝑥 𝑛 = 𝑏. If 𝑠(𝑥) is a 𝐶 2 -piecewise
cubic polynomial, then 𝑠 00 (𝑥) is a piecewise linear function. The second-order
derivative of 𝑠𝑖 (𝑥) over the subinterval [𝑥𝑖 , 𝑥𝑖+1 ] is the line
𝑥 − 𝑥𝑖 𝑥𝑖+1 − 𝑥
𝑠𝑖00 (𝑥) = 𝑚 𝑖+1 + 𝑚𝑖 ,
ℎ𝑖 ℎ𝑖
𝑚1 𝑚3
𝑚0 𝑚2 𝑚5
ℎ0 ℎ1 ℎ2 ℎ3 ℎ4
𝑥0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5
1 𝑚 𝑖+1 1 𝑚𝑖
𝑠𝑖 (𝑥) = (𝑥 − 𝑥𝑖 ) 3 + (𝑥𝑖+1 − 𝑥) 3 + 𝐴𝑖 (𝑥 − 𝑥 𝑖 ) + 𝐵𝑖 , (9.3)
6 ℎ𝑖 6 ℎ𝑖
where the constants 𝐴𝑖 and 𝐵𝑖 are determined by imposing the values at the ends
of the subintervals 𝑠𝑖 (𝑥 𝑖 ) = 𝑦 𝑖 and 𝑠𝑖 (𝑥𝑖+1 ) = 𝑦 𝑖+1 :
1 𝑚𝑖 3 1 𝑚 𝑖+1 3
𝑦𝑖 = ℎ + 𝐵𝑖 and 𝑦 𝑖+1 = ℎ + 𝐴ℎ𝑖 + 𝐵𝑖 .
6 ℎ𝑖 𝑖 6 ℎ𝑖 𝑖
238 Interpolation
The following function computes the interpolating cubic spline using n points
through the nodes given by the arrays x and y.
function evaluate_spline(xi ,yi ,m,n)
h = diff(xi )
B = yi [1:end-1] .- m[1:end-1].*h.^2/6
A = diff(yi )./h - h./6 .*diff(m)
x = range(minimum(xi ),maximum(xi ),length=n+1)
i = sum(xi '.≤x,dims=2)
i[end] = length(xi )-1
y = @. (m[i]*(xi [i+1]-x)^3 + m[i+1]*(x-xi [i])^3)/(6h[i]) +
A[i]*(x-xi [i]) + B[i]
We can adapt the work from the previous section to compute a spline through
points in a plane. Consider a plane curve in parametric form 𝑓 (𝑡) = (𝑥(𝑡), 𝑦(𝑡))
with 𝑡 ∈ [0, 𝑇] for some 𝑇. Take a set of points (𝑥𝑖 , 𝑦 𝑖 ) for 𝑖 = 0, 1, . . . , 𝑛 and
introduce the partition 0 = 𝑡 0 < 𝑡1 < · · · < 𝑡 𝑛 = 𝑇. A simple way to parameterize
the spline is by using the lengths of each straight-line segment
ℓ𝑖 = (𝑥𝑖 − 𝑥𝑖−1 ) 2 + (𝑦 𝑖 − 𝑦 𝑖−1 ) 2
for 𝑖 = 1, 2, . . . , 𝑛 and setting 𝑡0 = 0 and 𝑡 𝑖 = 𝑖𝑗=1 ℓ 𝑗 . We simply need to
compute splines for 𝑥(𝑡) and 𝑦(𝑡). This function is called the cumulative length
spline and is a good approximation as long as the curvature of 𝑓 (𝑡) is small.
The Dierckx.jl function Spline1D returns a cubic spline.
The Julia Dierckx.jl package is a wrapper of the dierckx Fortran library developed
by Paul Dierckx, providing an easy interface to building splines. Let’s create a
function and select some knots.
g = x-> @. max(1-abs(3-x),0)
240 Interpolation
𝐵10 𝐵02
𝐵11 𝐵03
𝐵20 𝐵12
xi = 0:5; yi = g(xi )
x = LinRange(0,5,101);
𝐵0,3 𝐵1,3 𝐵2,3 𝐵3,3 𝐵4,3 𝐵5,3 𝐵6,3 𝐵7,3 𝐵8,3 𝐵9,3
We can interpolate a spline curve 𝑠(𝑥) = 𝑖=0 𝑐 𝑖 𝐵𝑖, 𝑝 (𝑥) through a set of
knots (𝑥0 , 𝑓 (𝑥 0 )), (𝑥 1 , 𝑓 (𝑥1 )),. . . ,(𝑥 𝑛 , 𝑓 (𝑥 𝑛 )) by determining Í𝑛 the coefficients
𝑐 0 , 𝑐 1 , . . . , 𝑐 𝑛 that satisfies the system of equations 𝑓 (𝑥 𝑗 ) = 𝑖=0 𝑐 𝑖 𝐵𝑖, 𝑝 (𝑥 𝑗 ) for
𝑗 = 1, 2, . . . , 𝑛. B-splines have the following properties: each is nonnegative
with local support; each is a piecewise polynomial of degree less than or equal to
𝑝 over 𝑝 + 1 subintervals; and together, they form a partition of unity. Degree-0 B-
splines are piecewise constant, degree-1 B-splines are piecewise linear, degree-2
B-splines are piecewise quadratic, and so forth.
We can recursively generate splines of any degree, starting with degree-0
splines using de Boor’s algorithm. Let 𝑎 ≤ 𝑥0 < · · · < 𝑥 𝑛 ≤ 𝑏. The degree-0
Splines 241
We can modify these bases to fit an arbitrary interval [𝑥𝑖 , 𝑥𝑖+1 ] by taking
𝑝(𝑥) = 𝑝 𝑖 H00 (𝑡) + 𝑚 𝑖 ℎ𝑖 H10 (𝑡) + 𝑝 𝑖+1 H01 (𝑡) + 𝑚 𝑖+1 ℎ𝑖 H11 (𝑡)
where ℎ𝑖 = 𝑥𝑖+1 − 𝑥𝑖 , 𝑡 = (𝑥 − 𝑥𝑖 )/ℎ𝑖 , and the coefficients 𝑝 𝑖 = 𝑓 (𝑥𝑖 ) and
𝑚 𝑖 = 𝑓 0 (𝑥𝑖 ). The coefficients 𝑚 𝑖 can be approximated by finite difference
𝑝 𝑖+1 − 𝑝 𝑖−1 1 𝑝 𝑖+1 − 𝑝 𝑖 𝑝 𝑖 − 𝑝 𝑖 − 1
or + ,
𝑥𝑖+1 − 𝑥𝑖−1 2 𝑥 𝑖+1 − 𝑥𝑖 𝑥𝑖 − 𝑥𝑖−1
using a one-sided derivative at the boundaries.
One practical difference between a PCHIP and a cubic spline interpolant
is that the PCHIP has fewer wiggles and doesn’t overshoot the knots like the
spline is apt to do. On the other hand, the spline is smoother than the PCHIP. A
cubic spline is 𝐶 2 -continuous, whereas a PCHIP is only 𝐶 1 -continuous. A cubic
spline is designed by matching the first and second derivatives of the polynomial
segments at knots without explicit regard to the values of those derivatives. A
PCHIP is designed by prescribing the first derivatives at knots, which are the
same for each polynomial segment sharing a knot, while their second derivatives
may be different. Take a look at both methods applied to the knots below:
spline PCHIP
Bézier curves 243
Another way to create splines is by using composite Bézier curves. While splines
are typically constructed by fitting polynomials through nodes, Bézier curves are
constructed by using a set of control points to form a convex hull. A Bézier curve
doesn’t go through control points except at the endpoints. A Bézier spline is built
by matching the position and slopes of several Bézier curves at their terminal
control points ( ):
Bézier curves are all around us—just pick up a newspaper, and you are
bound to see Bézier curves. That’s because composite Bézier curves are the
mathematical encodings for glyphs used in virtually every modern font, from
quadratic ones in TrueType fonts to the cubic ones in the Times font used in
each letter of text you are reading right now. Bézier curves are used to draw
curves in design software like Inkscape and Adobe Illustrator, and they are used
to describe paths in an SVG, the standard web vector image format.
A linear Bézier curve is a straight line segment through two control points
p0 = (𝑥0 , 𝑦 0 ) and p1 = (𝑥1 , 𝑦 1 ):
q10 (𝑡) = (1 − 𝑡)p0 + 𝑡p1
where 𝑡 ∈ [0, 1]. Note that q10 (0) = p0 and q10 (1) = p1 . With an additional
control point p2 , we can build a quadratic Bézier curve. First, define the linear
Bézier curve
q11 (𝑡) = (1 − 𝑡)p1 + 𝑡p2 .
String art uses threads strung between nails hammered into a board to create
geometric representations, such as a ship’s sail. You can think of the Bézier
curve as the convex hull formed connecting threads from nails at q10 (𝑡) to
corresponding nails at q11 (𝑡). The threads or line segments are tangent to the
Bézier curve q20 (𝑡). If we expand the expression for q20 (𝑡), we have
244 Interpolation
with a derivative
(𝑡) = −2(1 − 𝑡)p0 + 2(1 − 2𝑡)p1 + 2𝑡p2
= −2 (1 − 𝑡)p0 + 𝑡p1 + 2 (1 − 𝑡)p1 + 𝑡p2
= 2 q11 (𝑡) − q10 (𝑡) .
Note that q20 (0) = p0 and q20 (1) = p2 , which says that the quadratic Bézier
curve goes through p0 and p2 . And, q20
0 (0) = 2(p −p ) and q 0 (1) = 2(p −p ),
1 0 20 2 1
which says that the Bézier curve is tangent to the line segment p0 p1 and p1 p2 at
p0 and p2 , respectively.
We can continue in the fashion by adding another control point p3 to build a
cubic Bézier curve. Take p2
q12 (𝑡) = (1 − 𝑡)p2 + 𝑡p3
q21 (𝑡) = (1 − 𝑡)q11 (𝑡) + 𝑡q12 (𝑡) p1
q30 (𝑡) = (1 − 𝑡)q20 (𝑡) + 𝑡q21 (𝑡).
This time, the cubic Bézier curve q30 (𝑡) is created from the convex hull formed
by line segments connecting corresponding points on q20 (𝑡) and q21 (𝑡). These,
in turn, are created from the segments connecting q10 (𝑡) and q11 (𝑡), and q11 (𝑡)
and q12 (𝑡), respectively. See the QR code at the bottom of this page.
This recursive method of constructing the Bézier curve starting with linear
Bézier curves, then quadratic Bézier curves, then cubic (and higher-order) Bézier
curves is called de Casteljau’s algorithm:
(1 − 𝑡) (𝑡)
q20 q21
(1 − 𝑡) (𝑡) (1 − 𝑡) (𝑡)
q10 q11 q12
(1 − 𝑡) (𝑡) (1 − 𝑡) (𝑡) (1 − 𝑡) (𝑡)
p0 p1 p2 p3
The bases {𝐵𝑛0 (𝑡), 𝐵𝑛1 (𝑡), . . . , 𝐵𝑛𝑛 (𝑡)} form a partition of unit, meaning
that the points along a Bézier curve are the weighted averages of the control
points. From this, we know that the curve is contained in the convex hull formed
by control points. The Bernstein polynomials are plotted below as a function of
the parameter 𝑡 ∈ [0, 1]. Notice how the weight changes on each control point as
𝑡 moves from zero on the left of each plot to one on the right.
𝐵10 , 𝐵11 𝐵20 , 𝐵21 , 𝐵22 𝐵30 , 𝐵31 , 𝐵32 , 𝐵33 𝐵40 , 𝐵41 .𝐵42 , 𝐵43 , 𝐵44
Rénault engineer Pierre Bézier popularized Bézier curves in the 1960s for
computer-aided geometric design. Paul de Casteljau’s contributions occurred
around the same time at the competing automobile manufacturer Citroën. Bern-
stein polynomials, however, appeared much earlier. In 1912, Serge Bernstein
introduced his eponymous polynomials in a probabilistic proof of the Weierstrass
theorem. We’ll conclude the chapter with Bernstein’s proof. The proof is elegant,
and it provides a nice segue to the material in the next chapter.
It’s no coincidence that the terms of the Bernstein polynomial bear a striking
similarity to the probability mass function of a binomial distribution. Bernstein’s
proof invokes Bernoulli’s theorem, also known as the weak law of large numbers,
which states that the frequency of success of a sequence of Bernoulli trials
approaches the probability of success as the number of trials approaches infinity.
A Bernoulli trial is a random experiment with two possible outcomes—a success
or a failure. A coin flip is either heads or tails. A soccer shot is either a goal or a
miss. A student’s random guess on a multiple-choice question is either right or
Theorem 30 (Weierstrass approximation theorem). If 𝑓 (𝑥) is a continuous
function in the interval [0, 1], then for any 𝜀 > 0, there is a polynomial 𝑝 𝑛 (𝑥)
with a sufficiently large degree 𝑛 such that | 𝑓 (𝑥) − 𝑝 𝑛 (𝑥)| < 𝜀 for every 𝑥 ∈ [0, 1].
𝑘 𝑛 𝑘
𝐸𝑛 = 𝑓 𝑥 (1 − 𝑥) 𝑛−𝑘
𝑛 𝑘
maximum and 𝑓 (𝑥) denote the minimum of 𝑓 (𝑥) over the interval (𝑥 − 𝛿, 𝑥 + 𝛿).
Then 𝑓 (𝑥) − 𝑓 (𝑥) < 12 𝜀 and 𝑓 (𝑥) − 𝑓 (𝑥) < 12 𝜀.
Let 𝑞 be the probability that |𝑥 − 𝑘/𝑛| > 𝛿, and let 𝐿 be the maximum of
| 𝑓 (𝑥)| in [0, 1]. Then
𝑓 (𝑥) (1 − 𝑞) − 𝐿𝑞 < 𝐸 𝑛 < 𝑓 (𝑥) (1 − 𝑞) + 𝐿𝑞,
𝑓 (𝑥) + ( 𝑓 (𝑥) − 𝑓 (𝑥)) − 𝑞(𝐿 + 𝑓 (𝑥) ) < 𝐸 𝑛 < 𝑓 (𝑥) + ( 𝑓 (𝑥) − 𝑓 (𝑥)) − 𝑞(𝐿 + 𝑓 (𝑥) ).
By Bernoulli’s theorem, we can choose 𝑛 sufficiently large such that 𝑞 < 𝜀/4𝐿.
It follows that
𝜀 2𝐿 𝜀 2𝐿
𝑓 (𝑥) − − 𝜀 < 𝐸 𝑛 < 𝑓 (𝑥) + + 𝜀,
2 4𝐿 2 4𝐿
The Bernstein approximation of a function 𝑓 (𝑥) is 𝑛𝑘=0 𝑓 (𝑘/𝑛)𝐵𝑛𝑘 (𝑥). To
be clear, a Bernstein approximant does not interpolate given values,
√ even if 𝑓 (𝑥)
is itself a polynomial. Let’s plot the approximation of 𝑓 (𝑥) = 𝑥:
This code does produce a polynomial approximation to 𝑓 (𝑥), but it’s not a very
good one. While Bernstein polynomials are useful in de Casteljau’s algorithm to
generate Bézier curves, they are not practical in approximating arbitrary functions.
Namely, they converge extremely slowly: | 𝑓 (𝑥) − 𝑝 𝑛 (𝑥)| = 𝑂 (𝑛−1 ). In the next
chapter, we will examine methods that converge quickly. To dig deeper into the
Bernstein polynomials, check out Rida Farouki’s centennial retrospective article.
Exercises 247
9.5 Exercises
Compare interpolation using a polynomial basis with the Gaussian and cubic
radial basis functions 𝜙(𝑥) = exp(−𝑥 2 /2𝜎 2 ) and 𝜙(𝑥) = |𝑥| 3 for the Heaviside
function using 20 equally spaced nodes in [−1, 1]. The Heaviside function
H(𝑥) = 0 when 𝑥 < 0 and H(𝑥) = 1 when 𝑥 ≥ 0. b
9.4. Collocation is a method of solving a problem by considering candidate
solutions from a finite-dimensional space, determined by a set of points called
collocation points. The finite-dimensional problem is solved exactly at the
collocation points to get an approximate solution to the original problem.
Suppose that we have the differential equation L 𝑢(𝑥) = 𝑓 (𝑥) with boundary
conditions 𝑢(𝑎) = 𝑢 𝑎 and 𝑢(𝑏) = 𝑢 𝑏Ífor some linear differential operator L.
We approximate the solution 𝑢(𝑥) by 𝑛𝑗=0 𝑐 𝑗 𝑣 𝑗 (𝑥) for some basis 𝑣 𝑗 . Take the
collocation points 𝑥 𝑖 ∈ [𝑎, 𝑏] with 𝑖 = 0, 1, . . . , 𝑛. Then solve the linear system
of equations
𝑛 ∑︁
𝑛 ∑︁
𝑐 𝑗 𝑣 𝑗 (𝑥0 ) = 𝑢 𝑎 , 𝑐 𝑗 L 𝑣 𝑗 (𝑥𝑖 ) = 𝑓 (𝑥𝑖 ), 𝑐 𝑗 𝑣 𝑗 (𝑥 𝑛 ) = 𝑢 𝑏 .
𝑗=0 𝑗=0 𝑗=0
circle with endpoints at (0, 1) and (1, 0), this means that we need to choose
control points at (𝑐, 1) and (1, 𝑐). What value 𝑐 should we take to ensure that the
radial deviation from the unit circle is minimized? How close of an approximation
does this value give us? b
Chapter 10
Approximating Functions
15 16
13 12
10 20
9 22 23
8 24
3 38 33 32 31
2 1
28 29 30 25
4 37 34
27 26
6 5 35
The last chapter used interpolation to find polynomials that coincide with a
function at a set of nodes or knots. These methods allowed us to model data and
functions using smooth curves. This chapter considers best approximation, which
is frequently used in data compression, optimization, and machine learning. For
a function 𝑓 in some vector space 𝐹, we will find the function 𝑢 in a subspace 𝑈
that is closest to 𝑓 with respect to some norm:
When the norm is the 𝐿 1 -norm, i.e., 𝑎 | 𝑓 (𝑥) − 𝑔(𝑥)| d𝑥, the problem is one
of minimizing the average absolute deviation, the one that ∫ 𝑏minimizes the area
between curves. When the norm is the 𝐿 2 -norm, i.e., 𝑎 ( 𝑓 (𝑥) − 𝑔(𝑥)) 2 d𝑥,
the problem is called least squares approximation or Hilbert approximation.
When the norm is the 𝐿 ∞ -norm, i.e., sup 𝑥 ∈ [𝑎,𝑏] | 𝑓 (𝑥) − 𝑔(𝑥)|, the problem is
called uniform approximation, minimax, or Chebyshev approximation. The least
squares approximation is popular largely because it is robust and typically the
easiest to implement. So, we will concentrate on the least squares approximation.
250 Approximating Functions
An inner product is a mapping (·, ·) of any two vectors of a vector space to the
field over which that vector space is defined (typically real or complex numbers),
satisfying three properties: linearity, positive-definiteness, and symmetry.
1. Linearity says that for any 𝑓 , 𝑔, and ℎ in the vector space and any scalar 𝛼
of the field, ( 𝑓 , 𝛼𝑔) = 𝛼 · ( 𝑓 , 𝑔) and ( 𝑓 , 𝑔 + ℎ) = ( 𝑓 , 𝑔) + ( 𝑓 , ℎ).
Euclidean vector space. The Euclidean inner product (x, y) = 𝑖=1 𝑥 𝑖 𝑦 𝑖 has the
geometric interpretation as a measure of the magnitude and closeness of two
vectors: (x, y) = kxk kyk cos 𝜃. In particular, the inner product of any two unit
vectors is simply the angle between those two vectors. In this chapter, we will
be interested in the space of continuous functions over an interval [𝑎, 𝑏] and its
associated inner product
∫ 𝑏
( 𝑓 , 𝑔) = 𝑓 (𝑥)𝑔(𝑥)𝑤(𝑥) d𝑥,
where 𝑤(𝑥) is a specified positive weight function. We can easily verify that the
three properties of an inner product space hold. Like the Euclidean vector space,
the weighted inner product for the space of functions can also be interpreted as a
measure of the magnitude and closeness of the functions.
We say that 𝑓 is orthogonal to 𝑔 if ( 𝑓 , 𝑔) = 0. Sometimes, we use the term
w-orthogonal to explicitly indicate that a weighting function is included in the
definition of the inner product. We’ll write this as 𝑓 ⊥ 𝑔. If 𝑓 ⊥ 𝑔 for all 𝑔 ∈ 𝐺,
we say that 𝑓 is orthogonal to 𝐺 and write it as 𝑓 ⊥ 𝐺. For example, consider
the space of square-integrable functions on the interval [−1, 1] with the inner
product ( 𝑓 , 𝑔) = −1 𝑓 (𝑥)𝑔(𝑥) d𝑥. In this inner product space, even functions like
e−𝑥 are orthogonal to odd functions like sin 𝑥. Even functions form a subspace
of square-integrable functions; odd functions form another subspace. In fact, the
subspace of even functions is the orthogonal complement to the subspace of odd
functions. That is to say, every function is the sum of an even function and an
odd function.
Least squares approximation 251
k 𝑓 − 𝑔k 2 = k ( 𝑓 − 𝑢) + (𝑢 − 𝑔) k 2
= k 𝑓 − 𝑢k 2 + 2 ( 𝑓 − 𝑔) + k𝑢 − 𝑔k 2 ≥ k 𝑓 − 𝑢k 2 .
−𝑢, 𝑓
To prove converse suppose that 𝑢 is a best approximation to 𝑓 . Let 𝑔 ∈ 𝑈 and
take 𝜆 > 0. Then k 𝑓 − 𝑢 + 𝜆𝑔k 2 ≥ k 𝑓 − 𝑢k 2 . So,
0 ≤ k 𝑓 − 𝑢 + 𝜆𝑔k 2 − k 𝑓 − 𝑢k 2
= k 𝑓 − 𝑢k 2 + 2𝜆 ( 𝑓 − 𝑢, 𝑔) + 𝜆2 k𝑔k 2 − k 𝑓 − 𝑢k 2
= 𝜆 2 ( 𝑓 − 𝑢, 𝑔) + 𝜆k𝑔k 2 .
This system of equations is known as the normal equation. In matrix form, the
normal equation is
(𝑢 1 , 𝑢 1 ) (𝑢 𝑛 , 𝑢 1 ) 𝑐1 ( 𝑓 , 𝑢1 )
(𝑢 2 , 𝑢 1 ) ...
(𝑢 1 , 𝑢 2 ) (𝑢 𝑛 , 𝑢 2 ) 𝑐2 ( 𝑓 , 𝑢2 )
(𝑢 2 , 𝑢 2 ) ...
.. .. = .. .
. . .
.. .. ..
. . .
(𝑢 1 , 𝑢 𝑛 ) (𝑢 𝑛 , 𝑢 𝑛 ) 𝑐 𝑛 ( 𝑓 , 𝑢 𝑛 )
(𝑢 2 , 𝑢 𝑛 ) ...
The matrix on the left is called the Gram matrix.
252 Approximating Functions
Example. Suppose that we want to find the least squares approximation to 𝑓 (𝑥)
over the interval [0, 1] using degree-𝑛 polynomials. If we take the canonical
basis {𝜑𝑖 } = {1, 𝑥, 𝑥 2 , . . . , 𝑥 𝑛 }, then
𝑝 𝑛 (𝑥) = 𝑐 0 + 𝑐 1 𝑥 + 𝑐 2 𝑥 2 + · · · + 𝑐 𝑛 𝑥 𝑛 = 𝑐 𝑖 𝜑𝑖 (𝑥).
into Julia returns around 3.89. We can get an intuitive idea of why the canonical
basis results in an ill-conditioned problem by just looking at it. The first eight
elements of the canonical basis and of the Legendre basis are shown below:
0 0
−1 −1
0 0.5 1 0 0.5 1
Differences between the elements of the canonical basis get smaller and smaller
as the degree of the monomial increases. It is easy to distinguish the plots of 1
from 𝑥 and 𝑥 from 𝑥 2 , but the plots of 𝑥 7 and 𝑥 8 become almost indistinguishable.
On the other hand, the plots of the Legendre polynomial basis elements are all
largely different. So, heuristically, it seems that orthogonal polynomials like
Legendre polynomials are simply better. J
Orthonormal systems 253
We call this best approximation the orthogonal projection of 𝑓 . The inner product
( 𝑓 , 𝑢 𝑖 ) is simply the component of 𝑓 in the 𝑢 𝑖 direction. We’ll denote the
orthogonal projection operator by P𝑛 :
P𝑛 𝑓 = ( 𝑓 , 𝑢𝑖 ) 𝑢𝑖 .
We can summarize everything into the following theorem, which will be a major
tool for the rest of the chapter.
Theorem 32. Let {𝑢 1 , 𝑢 2 , . . . , 𝑢 𝑛 } be an orthonormal system in an inner product
space 𝐹. Then the best approximation to 𝑓 is the element P𝑛 𝑓 , where the
orthogonal projection operator P𝑛 = 𝑖=1 (·, 𝑢 𝑖 ) 𝑢 𝑖 .
So far, we’ve only used a finite basis. Let’s turn to infinite-dimensional vector
spaces. To do this, we’ll need a few extra terms. A Cauchy sequence is any
sequence whose elements become arbitrarily close as the sequence progresses.
That is to say, if we have a sequence of projections P0 𝑓 , P1 𝑓 , P2 𝑓 . . . , then the
sequence is a Cauchy sequence if given some 𝜀 > 0, there is some integer 𝑁
such that k P𝑛 𝑓 − P𝑚 𝑓 k < 𝜀 for all 𝑛, 𝑚 > 𝑁. A normed vector space is said to
be complete if every Cauchy sequence of elements in that vector space converges
to an element in that vector space. Such a vector space is called a Banach
space. Similarly, a complete, inner product space is called a Hilbert space.
An orthonormal basis {𝑢 1 , 𝑢 2 , . . . } of a Hilbert space 𝐹 is called a complete,
orthonormal system of 𝐹 or a Hilbert basis.
Theorem 33 (Parseval’s theorem).
Í If {𝑢 1 , 𝑢 2 , . . . } is a complete, orthonormal
system in 𝐹, then k 𝑓 k 2 = ∞ 2
𝑖=1 | ( 𝑓 , 𝑢 𝑖 ) | for every 𝑓 ∈ 𝐹.
254 Approximating Functions
Proof. Because
Í𝑛{𝑢 1 , 𝑢 2 , . . . } are orthonormal, we can define the projection
operator P𝑛 = 𝑖=1 (·, 𝑢 𝑖 ) 𝑢 𝑖 . In the induced norm, we simply have
𝑛 2
𝑛 ∑︁
k P𝑛 𝑓 k =2
( 𝑓 , 𝑢𝑖 ) 𝑢𝑖 = ( 𝑓 , 𝑢𝑖 ) 𝑢𝑖 , 𝑓 , 𝑢 𝑗 𝑢 𝑗® .
𝑖=1 « 𝑖=1 𝑗=1 ¬
We expand the inner product to get
𝑛 ∑︁
k P𝑛 𝑓 k 2 = ( 𝑓 , 𝑢𝑖 ) 𝑓 , 𝑢 𝑗 𝑢𝑖 , 𝑢 𝑗 = |( 𝑓 , 𝑢 𝑖 )| 2 ,
𝑖=1 𝑗=1 𝑖=1
because 𝑢 𝑖 , 𝑢 𝑗 = 𝛿𝑖 𝑗 . Now, because the system is complete,
k 𝑓 k 2 = lim k P𝑛 𝑓 k 2 = | ( 𝑓 , 𝑢𝑖 ) |2 .
We can generate a Hilbert basis (an orthonormal basis for an inner product
space) by using the Gram–Schmidt method. The Gram–Schmidt method starts
with an arbitrary basis and sequentially subtracts out the non-orthogonal com-
ponents. Let {𝑣 1 , 𝑣 2 , 𝑣 3 , . . . } be a basis for 𝐹 with a prescribed inner product
1. Start by taking 𝑢 1 = 𝑣 1 and normalize 𝑢ˆ 1 = 𝑢 1 /k𝑢 1 k.
2. Next, define 𝑢 2 = 𝑣 2 − P1 𝑣 2 where the projection operator P1 = (·, 𝑢ˆ 1 ) 𝑢ˆ 1 ,
and normalize 𝑢ˆ 2 = 𝑢 2 /k𝑢 2 k.
3. Now, define 𝑢 3 = 𝑣 3 −P2 𝑣 3 where the projection operator P2 = P1 + (·, 𝑢ˆ 2 ) 𝑢ˆ 2 ,
and normalize 𝑢ˆ 3 = 𝑢 3 /k𝑢 3 k.
𝑛. We continue in this manner by defining 𝑢 𝑛 = 𝑣 𝑛 − P𝑛−1 𝑣 𝑛 where the
projection operator P𝑛−1 = 𝑖=1 (·, 𝑢ˆ 𝑖 ) 𝑢ˆ 𝑖 and normalizing 𝑢ˆ 𝑛 = 𝑢 𝑛 /k𝑢 𝑛 k.
Then {𝑢ˆ 1 , 𝑢ˆ 2 , 𝑢ˆ 3 , . . . } is an orthonormal basis for 𝐹. Directly, applying the
Gram–Schmidt process can be a lot of work. Luckily, there’s a shortcut to
generate orthogonal polynomials. We can use a three-term recurrence relation
known as Favard’s theorem.1
Theorem 34 (Favard’s theorem). For each weighted inner product, there are
uniquely determined orthogonal polynomials with leading coefficient one satisfy-
ing the three-term recurrence relation
𝑝 𝑛+1 (𝑥) = (𝑥 + 𝑎 𝑛 ) 𝑝 𝑛 (𝑥) + 𝑏 𝑛 𝑝 𝑛−1 (𝑥)
𝑝 0 (𝑥) = 1, 𝑝 1 (𝑥) = 𝑥 + 𝑎 1
1 Jean
Favard proved the theorem in 1935, although it was previously demonstrated by Aurel
Wintner in 1926 and Marshall Stone in 1932.
Orthonormal systems 255
(𝑥 𝑝 𝑛 , 𝑝 𝑛 ) ( 𝑝𝑛, 𝑝𝑛)
𝑎𝑛 = − and 𝑏𝑛 = − .
( 𝑝𝑛, 𝑝𝑛) ( 𝑝 𝑛−1 , 𝑝 𝑛−1 )
Proof. We’ll use induction. First, the only polynomial of degree zero with
leading coefficient one is 𝑝 0 (𝑥) ≡ 1. Suppose that the claim holds for the
orthogonal polynomials 𝑝 0 , 𝑝 1 , . . . , 𝑝 𝑛−1 . Let 𝑝 𝑛+1 be an arbitrary polynomial
of degree 𝑛 and leading coefficient one. Then 𝑝 𝑛+1 − 𝑥 𝑝 𝑛 is a polynomial whose
degree is less than or equal to 𝑛. Because 𝑝 0 , 𝑝 1 , . . . , 𝑝 𝑛 form an orthogonal
basis of P𝑛
𝑝 𝑛+1 − 𝑥 𝑝 𝑛 , 𝑝 𝑗
𝑝 𝑛+1 − 𝑥 𝑝 𝑛 = 𝑐𝑗𝑝𝑗 with 𝑐𝑗 = .
𝑝 𝑗, 𝑝 𝑗
(𝑥 𝑝 𝑛 , 𝑝 𝑛 ) ( 𝑝𝑛, 𝑝𝑛)
𝑎𝑛 = 𝑐𝑛 = − , 𝑏 𝑛 = 𝑐 𝑛−1 = − .
( 𝑝𝑛, 𝑝𝑛) ( 𝑝 𝑛−1 , 𝑝 𝑛−1 )
function legendre(x,n)
n==0 && return(one.(x))
n==1 && return(x)
x.*legendre(x,n-1) .- (n-1)^2/(4(n-1)^2-1)*legendre(x,n-2)
end J
256 Approximating Functions
𝑛 ∫ 2𝜋
i𝑘 𝑥 i𝑘 𝑥 1
P𝑛 𝑓 = 𝑐𝑘 e with 𝑐𝑘 = 𝑓 , e = 𝑓 (𝑥)e−i𝑘 𝑥 d𝑥. (10.2)
2𝜋 0
The projection P𝑛 𝑓 is called the truncated Fourier series, the set of coefficients
{𝑐 𝑘 } is called the Fourier coefficients, and the inner product ( 𝑓 , 𝜙 𝑘 ) is called the
continuous Fourier transform. In practice, the continuous Fourier transform is
approximated numerically by taking a finite set of points on the mesh 𝑥 𝑗 = 2𝜋 𝑗/𝑛
for 𝑗 = 0, 1, . . . , 𝑛 − 1 and using the trapezoidal method to approximate the
integral. By using the piecewise-constant approximation of 𝑓 and 𝑔 at {𝑥 𝑗 } in
theorem 35 we arrive at a simple corollary:
Theorem 36. The exponential basis elementsÍ𝜙 𝑘 (𝑥) = ei𝑘 𝑥 are orthonormal in
the discrete inner product space2 ( 𝑓 , 𝑔) = 𝑛1 𝑛−1
𝑗=0 𝑓 (𝑥 𝑗 )𝑔(𝑥 𝑗 ).
2 The bilinear form ( 𝑓 , 𝑔) = 𝑛1 𝑛−1 𝑗=0 𝑓 ( 𝑥 𝑗 )𝑔 ( 𝑥 𝑗 ) is technically a pseudo-inner product. A
pseudo-inner product satisfies all requirements to be an inner product except definiteness (nondegen-
eracy). The bilinear form ( 𝑓 , 𝑔) = 𝑛1 𝑛−1 𝑗=0 𝑓 ( 𝑥 𝑗 )𝑔 ( 𝑥 𝑗 ) is degenerate because ( 𝑓 , 𝑔) = 0 for any
𝑓 ( 𝑥) that is zero at all nodes 2 𝜋 𝑗/𝑛, like sin 𝑛𝑥. The corresponding “norm” is a semi-norm.
Fourier polynomials 257
0 𝜋 2𝜋
Figure 10.1: A polynomial over the complex plane 𝑐 𝑘 𝑧 𝑘 restricted to the
Í 2 𝜋i𝑘/𝑛 .
unit circle is a Fourier polynomial 𝑛−1
𝑘=0 𝑐 𝑘 e
A projection operator in the discrete inner product space, called the dis-
crete Fourier transform, can be defined analogously to the continuous Fourier
transform (10.2) by taking a discrete inner product at the meshpoints 𝑥 𝑗 and
𝑛−1 1 ∑︁𝑛
P𝑛 𝑓 (𝑥 𝑗 ) = 𝑐 𝑘 ei 𝑗 𝑘 with 𝑐 𝑘 = 𝑓 , ei𝑘 𝑥 = 𝑓 (𝑥 𝑗 )e−i 𝑗 𝑘 . (10.3)
𝑛 𝑗=0
Let’s briefly revisit the topic of interpolation discussed in the previous chapter.
Suppose that we have equally spaced nodes at 𝑥 𝑗 = 2𝜋 𝑗/𝑛 with values 𝑓 (𝑥 𝑗 ) for
𝑗 = 0, 1, . . . , 𝑛 − 1. Take the exponential basis functions 𝜙 𝑘 (𝑥) = ei𝑘 𝑥 . Then the
interpolating Fourier polynomial satisfies the system of equations
𝑛−1 ∑︁
𝑛−1 ∑︁
𝑓 (𝑥 𝑗 ) = 𝑐 𝑘 𝜙 𝑘 (𝑥 𝑗 ) = 𝑐 𝑘 ei2 𝜋 𝑗 𝑘/𝑛 = 𝑐𝑘 𝜔 𝑗𝑘 (10.4)
𝑗=0 𝑗=0 𝑗=0
for some coefficients 𝑐 𝑘 . The value 𝜔 = ei2 𝜋/𝑛 is an nth root of unity, and we
recognize from (10.3) that this transformation is the inverse discrete Fourier
transform. So, the coefficients of the Fourier polynomial that interpolates a
function 𝑓 at equally spaced nodes 𝑥 𝑗 = 2𝜋 𝑗/𝑛 are given by 𝑐 𝑘 = ( 𝑓 , 𝜙 𝑘 ). That
is, the solution to the interpolation problem using uniform mesh is the same as
the solution to the best approximation problem using the discrete inner product.
We can also establish a connection between Fourier polynomials and algebraic
polynomials. Suppose that we want to find the algebraic polynomial interpolant
𝑝(𝑧) = 𝑐 0 + 𝑐 1 𝑧 + 𝑐 2 𝑧 2 + · · · + 𝑐 𝑛−1 𝑧 𝑛−1
passing through the nodes {(𝑧0 , 𝑦 0 ), (𝑧1 , 𝑦 1 ), . . . , (𝑧 𝑛−1 , 𝑦 𝑛−1 )} in C × C. In
the canonical basis {1, 𝑧, . . . , 𝑧 𝑛−1 }, we must solve the Vandermonde system
258 Approximating Functions
𝑘=0 𝑐 𝑘 𝑧 to get the coefficients of the polynomial 𝑐 𝑘 . Now, restrict the
𝑦 𝑗 = 𝑛−1 𝑘
nodes to equally spaced points around the unit circle—i.e., take 𝑧 𝑘 = 𝜔 𝑘 = ei2 𝜋 𝑘/𝑛 ,
running counterclockwise starting with 𝑧0 = 1. In this case, we have (10.4),
which means that an algebraic polynomial in the complex plane is a Fourier
polynomial on the unit circle. See the figure on the preceding page.
Approximation error
Let’s finally examine the behavior of the Fourier coefficients for the continuous
inner product
∫ 2𝜋
i𝑘 𝑥 1
𝑐𝑘 = 𝑓 , e = 𝑓 (𝑥)e−i𝑘 𝑥 d𝑥,
2𝜋 0
where 𝑓 (𝑥) is periodic. The discrete inner product follows analogously. Note
that 𝑐 0 is simply the mean value of 𝑓 (𝑥) over (0, 2𝜋). Also, note that if 𝑓 (𝑥) is
differentiable, then by integration by parts
∫ 2𝜋 ∫ 2𝜋
1 i𝑘
𝑓 0, ei𝑘 𝑥 = 𝑓 0 (𝑥)e−i𝑘 𝑥 d𝑥 = 𝑓 (𝑥)e−i𝑘 𝑥 d𝑥 = i𝑘𝑐 𝑘
2𝜋 0 2𝜋 0
because 𝑓 (𝑥) is periodic. In general, we have 𝑓 ( 𝑝) , ei𝑘 𝑥 = (i𝑘) 𝑝 𝑐 𝑘 if the
function 𝑓 (𝑥) is sufficiently smooth. A Sobolev space is the space of square-
integrable functions whose derivatives are also square-integrable functions. Let
𝐻 𝑝 (0, 2𝜋) denote the Sobolev space of periodic functions whose zeroth through
𝑝th derivatives are also square-integrable. We can think of Sobolev space as a
refinement of the space of 𝐶 𝑝 -differentiable functions.
Theorem 37. If 𝑓 ∈ 𝐻 𝑝 (0, 2𝜋), then its Fourier coefficients |𝑐 𝑘 | = 𝑜(𝑘 − 𝑝−1/2 )
and the 𝐿 2 -error of its truncated Fourier series is 𝑜(𝑛− 𝑝 ).
Proof. By Parseval’s theorem,
∫ 2𝜋 ∑︁
| 𝑓 ( 𝑝) (𝑥)| 2 d𝑥 = |𝑘 𝑝 𝑐 𝑘 | 2 .
2𝜋 0 𝑘=1
𝑥 = Re 𝑧 = 12 (𝑧 + 𝑧 −1 ) = cos 𝜃.
For a general 𝑘th order monomial 𝑧 𝑘 , we can define the Chebyshev polynomial
𝑘 2 cos 𝑘𝜃 𝑘𝑥 sin 𝑘𝜃
𝑘 (𝑥) = − + (10.6)
1 − 𝑥2 (1 − 𝑥 2 ) 3/2
an approach has the benefit of being easy to manipulate and explore. We’ll
develop the Chebyshev differentiation matrix over the remainder of this section.
See Trefethen’s book Approximation Theory and Approximation Practice for an
in-depth discussion.3
In the previous chapter, we used Lagrange basis functions to fit a polynomial
through a set of points. A polynomial 𝑝 𝑛 (𝑥) passing through the points (𝑥0 , 𝑦 0 ),
(𝑥1 , 𝑦 1 ), . . . , (𝑥 𝑛 , 𝑦 𝑛 ) can be expressed using a Lagrange polynomial basis
𝑛 Ö𝑛
𝑥 − 𝑥𝑘
𝑝 𝑛 (𝑥) = 𝑦 𝑖 ℓ𝑖 (𝑥 𝑗 ) with ℓ𝑖 (𝑥) = .
𝑘=0 𝑖
− 𝑥𝑘
ℓ𝑖0 (𝑥) = ℓ𝑖 (𝑥) · [log ℓ𝑖 (𝑥)] = ℓ𝑖 (𝑥) ,
𝑥 − 𝑥𝑘
3 The
book is available at http://www.chebfun.org/ATAP as a set of m-files consisting of LATEX
markup and Matlab code. A pdf can be generated from them using the publish command in Matlab.
Chebyshev polynomials 261
When 𝑥 𝑗 is an interior Chebyshev point, we take the limit 𝑥 → 𝑥 𝑗 and use (10.6)
to get the expression
𝑛 (𝑥 2𝑗 − 1)T𝑛00 (𝑥 𝑗 ) 𝑛2 (−1) 𝑗−1
(𝑥 𝑗 − 𝑥 𝑘 ) = = .
2𝑛−1 (𝑥 𝑗 − 𝑥𝑖 ) 2𝑛−1 (𝑥 𝑗 − 𝑥𝑖 )
𝑘≠𝑖, 𝑗
𝑛 (𝑥 2𝑗 − 1)T𝑛0 (𝑥 𝑗 ) (±2)T𝑛0 (±1) (±1) 𝑗 𝑛2
(𝑥 𝑗 − 𝑥 𝑘 ) = = = .
(𝑥 𝑗 − 𝑥 𝑖 ) (𝑥 𝑗 ± 1)2𝑛−1 (𝑥 𝑗 − 𝑥 𝑖 )2𝑛−1 2𝑛−2 (𝑥 𝑗 − 𝑥𝑖 )
𝑘≠𝑖, 𝑗
𝑑𝑖𝑖 = − 𝑑𝑖𝑘 .
function chebdiff(n)
x = -cos.(LinRange(0,π,n))
c = [2;ones(n-2);2].*(-1).^(0:n-1)
D = c./c'./(x .- x' + I)
D - Diagonal([sum(D,dims=2)...]), x
−4 10−15
−1 −0.5 0 0.5 1 20 40 60
Figure 10.2: Left: Numerical solution ( ) and exact solution ( ) to the Airy
equation (10.7). Right: ℓ ∞ -error as a function of the number of Chebyshev
n = 15
D,x = chebdiff(n)
u = exp.(-4x.^2)
We’ll prescribe two boundary conditions and then solve the system:
n = 15; k2 = 256
D,x = chebdiff(n)
L = (D^2 - k2 *Diagonal(x))
L[[1,n],:] .= 0; L[1,1] = L[n,n] = 1
y = L\[2;zeros(n-2);1]
How accurate is the solution? Figure 10.2 above shows a semi-log plot of the
ℓ ∞ -error as a function of the number of nodes. The solution is already quite good
with just 15 points. After that, the solution gains one digit of accuracy for every
two points added. The error is 𝑂 (10−𝑛/2 ). Such a convergence rate is often said
Wavelets 263
10.5 Wavelets
ending in “let.”
264 Approximating Functions
1 3 −3 −1 5 9 8 10 23 25 21 19 16 16 19 21
First combine them pairwise, recording the averages and the differences:
2, −2 7, 9 24, 20 16, 20
0 ∓ −2 8∓1 22 ∓ −2 18 ∓ 2
And again:
0, 8 22, 18
4∓4 20 ∓ −2
And one more time:
4, 20
12 ∓ 8
6 Wavelets were originally called ondelettes by the French pioneers Morlet and Grossman after a
The value 12 is the average of all the initial numbers. By adding the differences
to it, we can reconstruct the original numbers.
Scaling function
Let’s start with a formal definition and then make it actionable. Take the space of
square-integrable functions 𝐿 2 . A multiresolution approximation is a sequence
of subspaces ∅ ⊂ · · · ⊂ 𝑉−1 ⊂ 𝑉0 ⊂ 𝑉1 ⊂ · · · ⊂ 𝐿 2 such that
From this definition, there should exist a basis function 𝜙(𝑥) ∈ 𝑉0 , called the
scaling function or father wavelet. Let’s denote the translated copy of the father
wavelet as 𝜙0,𝑘 (𝑥) = 𝜙(𝑥 − 𝑘) for integer 𝑘. Then we can express any function
𝑓 ∈ 𝑉0 as 𝑓 (𝑥) = ∞ 𝑘=−∞ 𝑎 𝑘 𝜙0,𝑘 (𝑥) for some coefficients 𝑎 𝑘 Í
. In particular, we
can write the scaled function 𝜙(𝑥/2) ∈ 𝑉−1 ⊂ 𝑉0 as 𝜙(𝑥/2) = ∞ 𝑘=−∞ 𝑐 𝑘 𝜙(𝑥 − 𝑘)
for some 𝑐 𝑘 , and hence the father wavelet satisfies a dilation equation:
𝜙(𝑥) = 𝑐 𝑘 𝜙(2𝑥 − 𝑘). (10.8)
In general, we’ll only consider scaling functions with compact support, and
therefore the sum is effectively only over finite 𝑘. If we integrate both sides of
266 Approximating Functions
where we’ve made the change of variable 𝑠 = 2𝑥 − 𝑘. If −∞
𝜙(𝑥) d𝑥 ≠ 0, then
we have the condition
𝑐 𝑘 = 2. (10.9)
Example. Let’s find solutions to the dilation equation (10.8) that satisfy the
constraint (10.9). One approach is to use splines. When 𝑐 0 = 2 and all other 𝑐 𝑘
are zero, the solution is a delta function 𝜙(𝑥) = 𝛿(𝑥). When 𝑐 0 = 𝑐 1 = 1 and
𝑐 𝑘 = 0 otherwise, the solution is the rectangle function—𝜙(𝑥) = 1 if 𝑥 ∈ [0, 1]
and 𝜙(𝑥) = 0 otherwise. This function is the father wavelet of the Haar wavelets.
If we set {𝑐 0 , 𝑐 1 , 𝑐 2 } = { 12 , 1, 21 }, we get a triangular function as the father
wavelet. With coefficients { 14 , 34 , 34 , 14 }, we get the quadratic B-spline. And with
{ 18 , 21 , 34 , 12 , 18 }, we have the cubic B-spline. All of these can be demonstrated by
construction. For example, the scaled cubic B-spline (depicted with a dotted
line) is the sum of the five translated cubic B-splines:
𝑐0 𝑐1 𝑐3 𝑐4
Other than the 𝐵0 Haar wavelet, B-spline wavelets are not orthogonal, limiting
their usefulness. J
for all integers 𝑚. What additional constraints do we need to put on the coefficients
𝑐 𝑘 ? To answer this question, we’ll need to take a short detour to examine a few
properties of the scaling function. The quickest route is through Fourier space.
If we take the Fourier transform of the scaling function
ˆ 1 ∞
𝜙(𝜉) = 𝜙(𝑥)e−i𝑥 𝜉 d𝑥,
2𝜋 −∞
Wavelets 267
= ˆ
| 𝜙(𝜉)| 2 −i𝑚 𝜉
e d𝜉.
The last line results from splitting the sum over even and odd values of 𝑘 and
applying the 2𝜋-periodicity of 𝐻ˆ (𝜉).
Corollary 40. The following conditions on the dilation equation are necessary
for an orthogonal scaling function:
∞ ∑︁
∞ ∑︁
(−1) 𝑛 𝑐 𝑘 = 0, 𝑐2𝑘 = 2, and 𝑐 𝑘 𝑐 𝑘−2𝑚 = 0. (10.11)
𝑘=−∞ 𝑘=−∞ 𝑘=−∞
1 ∑︁ ∑︁ 1 ∑︁ ∑︁
∞ ∞ ∞ ∞
𝑐 𝑘 𝑐 𝑗 e−i(𝑘− 𝑗) 𝜉 + 𝑐 𝑘 𝑐 𝑗 (−1) 𝑘− 𝑗 e−i(𝑘− 𝑗) 𝜉 = 1.
4 𝑗=−∞ 𝑘=−∞ 4 𝑗=−∞ 𝑘=−∞
1 ∑︁ ∑︁
∞ ∞
𝑐 𝑘 𝑐 𝑘−2𝑚 e−i𝑚 𝜉 = 1.
2 𝑚=−∞ 𝑘=−∞
This expression is true for all 𝜉, so it follows that 𝑘=−∞ 𝑐 𝑘 𝑐 𝑘−2𝑚 = 2𝛿0𝑚 .
which says that (𝜙(1), 𝜙(2)) is an eigenvector of the matrix with √ elements
(𝑐 1 , 𝑐 0 , 𝑐 3 , 𝑐 2 ). The eigenvectors of this matrix are (−1, 1) and (1 + 3, 1 − 3).
Which one do we choose? Here, we’ll state a property without proof—the
scaling functions form a partition ofÍunity, i.e., the sum of any scaling function
taken at integer values equals one: ∞ 𝑘=−∞ 𝜙(𝑘) = 1. So (−1, 1) cannot be the
270 Approximating Functions
1.5 2
1 𝜙(𝑥) 1 𝜓(𝑥)
0 1 2 3 −2 −1 0 1
√ √
right choice. Instead, we take 𝜙(1) = 12 (1 + 3) and 𝜙(2) = 21 (1 − 3). The
Daubechies 𝐷 4 scaling function is plotted in the figure above using
√ √ √ √
c = [1+ 3, 3+ 3, 3- 3, 1- 3]/4
√ √
z = [0, 1+ 3, 1- 3, 0]/2
(x,ϕ) = scaling(c, z, 8)
∞ ∑︁
P𝑛 𝑓 = ( 𝑓 , 𝜙 𝑛𝑘 ) 𝜙 𝑛𝑘 = 𝑎 𝑛𝑘 𝜙 𝑛𝑘 (10.12)
𝑘=−∞ 𝑘=−∞
𝑉0 𝑉1 𝑉2 𝑉3 𝑉4 𝑉5
Wavelet function
𝑉0 𝑊0 𝑊1 𝑊2 𝑊3 𝑊4
P𝑛+1 𝑓 = P𝑛 𝑓 + Q𝑛 𝑓 ,
Notice how the mother wavelet differs structurally from the father wavelet—
alternating the signs the coefficients, reversing the order of the terms, and shifting
them left along the 𝑥-axis. Also, unlike the father wavelet, which integrates to
one, the mother wavelet integrates to zero. Using this definition, we can generate
the wavelet function (in the dotted line below) of the cubic B-spline scaling
function on page 266:
𝑐3 𝑐1
−𝑐 4 −𝑐 0
−𝑐 2
Using the scaling function ϕ from the code on page 269, we can compute the
wavelet function ψ with
ψ = zero(ϕ); n = length(c)-1; ℓ = length(ψ)÷2n
for k∈0:n
ψ[(k*ℓ+1):(k+n)*ℓ] += (-1)^k*c[n-k+1]*ϕ[1:2:end]
which we can use to compute the Daubechies 𝐷 4 wavelet function in the figure
on the preceding page.
272 Approximating Functions
1 ∑︁
𝜓 𝑛𝑘 (𝑥) = √ (−1) 1− 𝑗 𝑐 1− 𝑗+2𝑘 𝜙 𝑛+1, 𝑗 (𝑥).
2 𝑗=−∞
From these,
1 (−1) 1−𝑖
𝜙 𝑛𝑖 , 𝜙 𝑛+1, 𝑗 = √ 𝑐 𝑖−2 𝑗 and 𝜓 𝑛𝑖 , 𝜙 𝑛+1, 𝑗 = √ 𝑐 1−𝑖+2 𝑗 .
2 2
𝑎 𝑛+1,𝑘 = 𝑓 , 𝜙 𝑛+1,𝑘 = P𝑛+1 𝑓 , 𝜙 𝑛+1,𝑘 = P𝑛 𝑓 + Q𝑛 𝑓 , 𝜙 𝑛+1,𝑘
= P𝑛 𝑓 , 𝜙 𝑛+1,𝑘 + Q𝑛 𝑓 , 𝜙 𝑛+1,𝑘
! !
∞ ∑︁
= 𝑎 𝑛𝑖 𝜙 𝑛𝑖 , 𝜙 𝑛+1,𝑘 + 𝑎 𝑛𝑖 𝜓 𝑛𝑖 , 𝜙 𝑛+1,𝑘
𝑖=−∞ 𝑖=−∞
1 ∑︁
1 ∑︁
=√ 𝑐 𝑘−2𝑖 𝑎 𝑛𝑖 +√ (−1) 1−𝑘+2𝑖 𝑐 1−𝑘+2𝑖 𝑑 𝑛𝑖 . (10.14)
2 𝑖=−∞ 2 𝑖=−∞
Similarly, we can write
1 ∑︁ 1 ∑︁
∞ ∞
𝑑 𝑛+1,𝑘 = √ 𝑐 𝑘−2𝑖 𝑎 𝑛𝑖 − √ (−1) 1−𝑘+2𝑖 𝑐 1−𝑘+2𝑖 𝑑 𝑛𝑖 .
2 𝑖=−∞ 2 𝑖=−∞
These two expressions allow us to recursively reconstruct a function from its
wavelet components. In practice, the algorithm can be computed recursively
much like a fast Fourier transform. Reversing these steps allows us to deconstruct
a function into its wavelet components.
We can represent (10.14) in matrix notation as
T a𝑛 H a
a𝑛+1 = H G T
or alternatively a = 𝑛 .
d𝑛 G 𝑛+1 d𝑛
entire real line. We can restrict ourselves to a finite domain while maintaining
orthogonality by imposing periodic boundaries. Because the scaling function has
compact support, the coefficients 𝑐 𝑘 are nonzero for finite values of 𝑘. Consider
the case when 𝑐 𝑘 ≠ 0 for 𝑘 = 0, 1, 2, 3. The matrix structure is the same for other
𝑐 0 𝑐 1 𝑐 2 𝑐 3
1 𝑐0 𝑐1 𝑐2 𝑐3
H= √
2 𝑐 0 𝑐 1 𝑐 2 𝑐 3
𝑐 2 𝑐 3
𝑐 0 𝑐 1
𝑐 3 −𝑐 2 𝑐 1 −𝑐 0
1 𝑐 3 −𝑐 2 𝑐 1 −𝑐 0
G= √
2 𝑐 3 −𝑐 2 𝑐 1 −𝑐 0
𝑐 1 −𝑐 0
𝑐 3 −𝑐 2
It’s common to write the discrete wavelet transform as a square matrix
𝑐 0
𝑐1 𝑐2 𝑐3
𝑐 3
−𝑐 2 𝑐1 −𝑐 0
𝑐0 𝑐1 𝑐2 𝑐3
H 1
= √ ,
𝑐3 −𝑐 2 𝑐1 −𝑐 0
G 2 𝑐0 𝑐1 𝑐2 𝑐 3
−𝑐 0
𝑐3 −𝑐 2 𝑐1
𝑐 2 𝑐 1
𝑐3 𝑐0
𝑐 1 −𝑐 2
−𝑐 0 𝑐3
where P is the “perfect shuffle” permutation matrix that interweaves the rows of H
and G. Notice that for T to be orthogonal, we must have 12 (𝑐20 + 𝑐21 + 𝑐22 + 𝑐23 ) = 1
and 𝑐 2 𝑐 0 + 𝑐 3 𝑐 1 = 0. Furthermore, 𝑐 0 + 𝑐 1 + 𝑐 2 + 𝑐 3 = 2. And, for vanishing
zeroth moment 𝑐 0 + 𝑐 2 = 0 and 𝑐 1 + 𝑐 4 = 0.
The Wavelets.jl package includes several utilities for discrete wavelet transforms.
See Figure 10.4 above. Beyond being a little trippy, what’s going on in it? Recall
the structure of the one-dimensional multiresolution decomposition discussed on
page 265
This decomposition starts with the sum of the values of the original sequence (the
contribution of the father wavelet). The next term tells us the gross fluctuation
(the contribution of the mother wavelet). After this, each resolution provides
Wavelets 275
c d
0 d
0.01 0.1 1
Figure 10.5: The relative error of a DWT (Haar) compressed image as a function
of the file size. The dotted line shows equivalent lossless PNG compression.
finer detail with each subsequent generation of daughter wavelets. Figure 10.4
shows a similar decomposition in two dimensions. The tiny pixel in the top-left
corner has a value of 325, the sum of the pixel values across the original image.
It’s surrounded by three one-pixel-sized components that provide the horizontal,
vertical, and diagonal details to the next higher resolution. Around these are
2 × 2 pixels details. Then 4 × 4, and so on until we get to 256 × 256-pixel blocks.
These three largest blocks show the fluctuations that occur at the pixel level of
the original image.
We can compress an image by filtering or zeroing out the coefficients of the
DWT of that image whose magnitudes fall below a given threshold. Let’s set up
such a function.
function dwtfilter(channels,wt,k)
map(1:3) do i
A = dwt(channels[i,:,:], wavelet(wt))
threshold!(A, HardTH(), k)
clamp01!(idwt(A, wavelet(wt)))
We’ll examine the effect of varying the filter threshold on lossy compression
error. The following code creates an interactive widget to compare different
wavelets and thresholds:
using Interact, Wavelets, Images
img = float.(load(download(bucket*"laura_square.png")))
channels = channelview(img)
276 Approximating Functions
and Russell’s Principia Mathematica. When he was 15, he ran away from home and showed up
at the University of Chicago, where Bertrand Russell was teaching. He remained at the university
for several years, never registering as a student, often homeless and working menial jobs. It’s here
he met Warren McCulloch and they developed the foundational theory of neural nets. Walter Pitts,
enticed to continue his research on modeling the brain, left Chicago to complete his PhD at MIT.
Over time he grew increasingly lonely and distraught, eventually setting fire to his dissertation and
all of his notes. He tragically died of alcoholism at age 46. To dig deeper, read Amanda Gefter’s
article in Nautilus Quarterly.
9 Biomimicry looks to nature for insights and inspiration in solving engineering problems. Early
unsuccessful pioneers of artificial flight made wings from wood and feathers attached to their arms.
Leonardo da Vinci’s study of birds led him to devise his ornithopter. Otto Lilienthal did the same in
developing his glider. By the time Orville and Wilbur Wright engineered the first successful powered
aircraft, it looked very little like a bird other than having wings.
an image compressed
using Haar wavelets
Neural networks 277
input values by stretching and sliding the activation function along the 𝑥-axis.
The second affine transformation affects the output values by stretching and
sliding the activation function in the 𝑦-direction. An activation function could be
any number of functions—a radial basis function such as a B-spline or Gaussian,
a monotonically increasing sigmoid function, a threshold function, a rectifier
function, and so on.
McCulloch and Pitts used a step function for an activation function to model
neurons as logical operators. Sigmoid activation functions , which naturally
arise from logistic regression, were the most popular until fairly recently. Now,
rectified linear units or ReLUs are in vogue, largely do to their simplicity
and effectiveness in deep learning.
By taking a linear combination of scaled, translated activation functions
𝑓𝑛 (𝑥) = 𝑐 𝑖 𝜙(𝑤 𝑖 𝑥 + 𝑏 𝑖 )
| 𝑓𝑛 (𝑥) − 𝑓 (𝑥)| = 1
2 𝑓 00 (𝜉𝑖 ) (𝑥 − 𝑥𝑖 ) (𝑥 𝑖+1 − 𝑥) ≤ 81 | 𝑓 00 (𝜉)|ℎ2
278 Approximating Functions
𝑥1 𝑥1
𝑥2 𝑦1 𝑦1 𝑦1
𝑥 𝑦 𝑥 𝑦 𝑥2 𝑥2
𝑥3 𝑦2 𝑦2 𝑦2
𝑥3 𝑥3
1 2 3 4 5
The function 𝑓 (𝑥) need not be smooth—the proof of a similar result for
Lipschitz functions is left as an exercise. George Cybenko’s influential 1989
paper “Approximation by superpositions of a sigmoidal function” generalizes
the universal approximation theorem for an arbitrary sigmoid function. From the
universal approximation theorem, it may appear that neural networks are little
different than directly using B-splines, radial basis functions, or any other method
we have already discussed. Namely, we can better approximate a function by
adding additional neurons into a single layer. The real magic of neural nets
emerges by using multiple layers. We can feed the output of one neuron into
another neuron. And, we can feed that neuron into yet another one, and that one
into another, and so on.
Consider the graphs in Figure 10.6. Input and output nodes are denoted with
. The neurons are denote with . The weighted edges, often called synapses,
connect the nodes together. The simplest artificial neuron is 𝑦 = 𝑤 2 𝜙(𝑤 1 𝑥 + 𝑏).
A neuron can have multiple inputs and multiple outputs y = w2 𝜙(wT1 x + 𝑏).
For simplicity of notation and implementation, the bias can be incorporated by
prepending 𝑏 into a new zeroth element of w1 and a corresponding 1 into a
zeroth element of x, giving us 𝑦 = w2 𝜙(wT1 x). We can join neurons in a layer to
form a single-layer neural net y = W2 𝜙(W1 x) where W1 and W2 are matrices
of synaptic weights. The input and output nodes form input and output layers,
and the neurons form a single hidden layer. By feeding one neuron into another,
we create a multi-layer neural net called a deep neural net
While McCulloch and Pitts forged their neural nets out of the mathematical
certainty of formal logic, modern neural nets reign as tools of statistical uncertainty
and inference. Deep neural networks are able to classify immense and complicated
data sets, thanks in large part to specialized computing hardware and stochastic
Neural networks 279
Let’s partition the points using the neural net 𝑦 = 𝜙(W3 𝜙(W2 𝜙(W1 x))), with
two inputs, two hidden layers (each with four neurons), and one output for the
label. We’ll use 𝜙(𝑥) = tanh 𝑥 as the activation function.
A single neuron partitions the plane with a line, like a crisp fold in a sheet of
paper, called a decision boundary.10 Each of the four neurons in the first hidden
layer produces different partitions in the plane—four separate folds in the paper.
When combined linearly together, these decision boundaries partition a space into
a simple, possibly unbound polytope. In a deep neural net, these initial partitions
are the building blocks for more complicated geometries with each subsequent
layer. The second hidden layer forms curved boundaries by combining the output
of the first neural layer using smooth sigmoid and tanh activation functions. The
ReLU activation function produces polytopes with flat sides.
The neural network uses these new shapes to construct the final partition, an
almost circular blob that approximates the original data. Had we included more
data, used more neurons or more layers, or trained the neural network longer, the
blob would have been more circular.
10 Not to be misled by metaphor, the real decision boundary is the result of a smooth sigmoid
function, so there is a layer of fuzziness surrounding the line.
280 Approximating Functions
𝑥1 𝑦ˆ 1
𝑦ˆ 2
There’s no straight line that can partition this data. But, by embedding the data
in higher dimensions, we can homeomorphically stretch and fold the space until
it can be partitioned with a hyperplane. Imagine that you print the original light
and dark dot pattern onto a spandex handkerchief. How might you remove the
dark dots with one straight cut using a pair of scissors? Push the middle of the
handkerchief up to make a poof, and then cut across that poof. This is what our
neural net does. We can visualize the actions of the neural net by examining
intermediate outputs ( 𝑦ˆ 1 , 𝑦ˆ 2 ) obtained from splitting W3 into a product of 1 × 2
and 2 × 4 matrices. The rightmost figure above shows how the neural net has
11 For example, handwritten digits might be digitized as 28-by-28-pixel images, which are
embedded in a 784-dimensional space (one dimension for each pixel). The intrinsic dimension—the
minimum degrees of freedom required to describe an object in space—is much smaller than the
extrinsic dimensionality. A “3” looks like another “3,” and a “7” looks like another “7.” Neural
networks manipulate low-intrinsic-dimensional data in higher-dimensional space.
Data fitting 281
stretched, folded, and squashed the data points into the plane, like a used tissue
thoughtlessly tossed onto the roadway. The decision boundary can be determined
using the 1 × 2 matrix along with the bias term. Just as the circular blob didn’t
perfectly partition the original space, the decision boundary doesn’t provide a
perfect cut between light and dark dots.
We’ll return to neural networks when we discuss fitting data in the next section.
To dig deeper, see Goodfellow, Bengio, and Courville’s text Deep Learning,
Friedman, Hastie, and Tibshirani’s text Elements of Statistical Learning, the open-
source book project Dive into Deep Learning (https://d2l.ai) developed by several
Amazon computer scientists, or Christopher Olah’s blog post “Neural Networks,
Manifolds, and Topology.” Also, check out Daniel Smilkov and Shan Carter’s
interactive visualization of neural networks at http://playground.tensorflow.org.
We often want to find the “best” function that fits a data set of, say, 𝑚 input-output
values (𝑥1 , 𝑦 1 ), (𝑥 2 , 𝑦 2 ), . . . , (𝑥 𝑚 , 𝑦 𝑚 ). Suppose that we choose a model function
𝑦 = 𝑓 (𝑥; 𝑐 1 , 𝑐 2 , . . . , 𝑐 𝑛 ) that has 𝑛 parameters 𝑐 1 , 𝑐 2 , . . . , 𝑐 𝑛 and that the number
of observations 𝑚 is at least as large as the number of parameters 𝑛. With a set
of inputs and corresponding outputs, we can try to determine the parameters 𝑐 𝑖
that fit the model. Still, the input and output data likely come from observations
with noise or hidden variables that overcomplicate our simplified model. So, the
problem is inconsistent unless we perhaps (gasp!) overfit the data by changing
the model function. Fortunately, we can handle the problem nicely using the
least squares best approximation.
( 𝑓 , 𝜑𝑖 ) = 𝑐 𝑗 𝜑 𝑗 , 𝜑𝑖 = (𝑦, 𝜑𝑖 ) for 𝑖 = 1, 2, . . . , 𝑛,
282 Approximating Functions
where the inner product ( 𝑓 , 𝜑𝑖 ) = 𝑘=1 𝑓 (𝑥 𝑘 )𝜑𝑖 (𝑥 𝑘 ). The matrix form of the
normal equation is
(𝜑1 , 𝜑1 ) (𝜑 𝑛 , 𝜑1 ) 𝑐 1 (𝑦, 𝜑1 )
(𝜑2 , 𝜑1 ) ...
(𝜑1 , 𝜑2 ) (𝜑 𝑛 , 𝜑2 ) 𝑐 2 (𝑦, 𝜑2 )
(𝜑2 , 𝜑2 ) ...
.. = .. ,
. .
.. .. .. ..
. . . .
(𝜑1 , 𝜑 𝑛 ) (𝜑 𝑛 , 𝜑 𝑛 ) 𝑐 𝑛 (𝑦, 𝜑 𝑛 )
(𝜑2 , 𝜑 𝑛 ) ...
which in vector notation is simply AT Ac = AT y, where A is an 𝑚 × 𝑛 matrix
with elements given by 𝐴𝑖 𝑗 = 𝜑 𝑗 (𝑥 𝑖 ). In the solution c = (AT A) −1 AT y, the term
(AT A) −1 AT is called the pseudoinverse of A and denoted A+ . Formally, we have
replaced the solution c = A−1 y with c = A+ y. In practice, we don’t need to
compute the pseudoinverse explicitly. Instead, we can solve Ay = c using the QR
method. For more discussion of linear least squares and the pseudoinverse, refer
back to Chapter 3.
𝑟 1 = 𝑦 1 − 𝑓1 (𝑐 1 , 𝑐 2 , . . . , 𝑐 𝑛 )
𝑟 2 = 𝑦 2 − 𝑓2 (𝑐 1 , 𝑐 2 , . . . , 𝑐 𝑛 )
𝑟 𝑚 = 𝑦 𝑚 − 𝑓𝑚 (𝑐 1 , 𝑐 2 , . . . , 𝑐 𝑛 ).
where 𝛥c (𝑘+1) = c (𝑘+1) − c (𝑘) and J𝑟 (c) is the 𝑚 × 𝑛 Jacobian matrix whose
(𝑖, 𝑗)-component is 𝜕𝑟 𝑖 /𝜕𝑐 𝑗 . The Jacobian matrix J𝑟 (c) = −J 𝑓 (c), so we
equivalently have
J 𝑓 (c (𝑘) ) 𝛥c (𝑘+1) = r (𝑘) (10.15)
Data fitting 283
where the (𝑖, 𝑗)-component of the Jacobian J 𝑓 (c) is 𝜕 𝑓𝑖 /𝜕𝑐 𝑗 . We can solve this
overdetermined system using QR factorization to update c (𝑘+1) = c (𝑘) + 𝛥c (𝑘+1)
at each iteration. The method is called the Gauss–Newton method.
By multiplying both sides of (10.15) by JT𝑓 (c (𝑘) ) and then formally solving
for c (𝑘+1) , we get an explicit formulation for the Gauss–Newton method:
where the pseudoinverse J+𝑓 is (JT𝑓 J 𝑓 ) −1 JT𝑓 . The Gauss–Newton method may be
unstable. We can regularize it by modifying the JT𝑓 J 𝑓 term of the pseudoinverse to
be diagonally dominant. This regularization has the effect of weakly decoupling
the system of equations. The Levenberg–Marquardt method is given by
Example. Suppose that we’ve been handed some noisy data collected from
an experiment, which from theory, we expect to be modeled by two Gaussian
2 2
𝑓 (𝑥; c) = 𝑐 1 e−𝑐2 ( 𝑥−𝑐3 ) + 𝑐 4 e−𝑐5 ( 𝑥−𝑐6 ) (10.18)
for some unknown parameters c = {𝑐 1 , 𝑐 2 , . . . , 𝑐 6 }. Let’s use Newton’s method
to recover the parameters. + +
Newton’s method is c (𝑘+1) = c (𝑘) − J (𝑘) (f (x; c (𝑘) ) − y), where J (𝑘) is
the pseudoinverse of the Jacobian. The numerical approximation to the Jacobian
using the complex-step derivative is
function jacobian(f,x,c)
J = zeros(length(x),length(c))
for k in (n = 1:length(c))
J[:,k] .= imag(f(x,c+1e-8im*(k .== n)))/1e-8
return J
284 Approximating Functions
Figure 10.7: The least squares solution for (10.18). Solutions are depicted at
intermediate iterations. The input data is represented by •.
Let’s take {2, 0.3, 2, 1, 0.3, 7} as the initial guess. The problem is solved using
c0 = [2, 0.3, 2, 1, 0.3, 7]
c = gauss_newton(f,x,y,c0 )
The plot above shows the intermediate solutions at each iteration, ending with
the parameters c = {0.98, 3.25.98, 3.00, 1.96, 2.88, 5.99}.
We can also control the stability of Newton’s method by shortening the step
size by a factor 𝛼, called the learning rate. For example, we could substitute
a Newton step c += α*G\r with α=0.5 in the loop in place of the Levenberg–
Marquardt step. Alternatively, we could combine a faster learning rate with the
Levenberg–Marquardt method. Because the Levenberg–Marquardt method is
Data fitting 285
more stable than the Gauss–Newton method, we can take a learning rate as large
as 2.
In practice, we might use the LsqFit.jl package, which solves nonlinear least
squares problems:
using LsqFit
cf = curve_fit(f,x,y,c0 )
The parameters are given by cf.param. The residuals are given by cf.resid. J
Logistic regression
for a probability 𝑝(x𝑖 , w). The log-likelihood function is given by the logarithm
of the likelihood:
ℓ(w) = 𝑦 𝑖 log 𝑝 + (1 − 𝑦 𝑖 ) log(1 − 𝑝). (10.19)
The function 𝑓 is called a link function. We want a link function that is smooth,
that is monotonically increasing, and that maps the unit interval (0, 1) to the real
286 Approximating Functions
line (−∞, ∞). The link function most commonly associated with the Bernoulli
distribution is the logit function
logit 𝜇 = log .
While there are other link functions,12 the logit function is easy to manipulate,
it has the nice interpretation of being the log odds ratio, and it is the inverse of
the logistic function 𝜎(𝜂) = 1/(1 + e−𝜂 ). Using the inverse logit (the logistic
function) in the log-likelihood function (10.19) yields
ℓ(w) = 𝑦 𝑖 log 𝜎 wT x𝑖 + (1 − 𝑦 𝑖 ) log 1 − 𝜎 wT x𝑖 .
Let’s see what the method looks like in Julia. We’ll first define the logistic
function and generate some synthetic data.
12 Statistician and physician Joseph Berkson introduced the logit function in 1944 as a close
approximation to the inverse cumulative normal distribution used ten years earlier by biologist Chester
Bliss. Bliss called his curve the probit, a portmanteau of probability and unit, and Berkson followed
analogously with logit, a portmanteau of logistical and unit.
Data fitting 287
Figure 10.8: Classification data (left). The function fitting the data (right).
σ = x -> @. 1/(1+exp(-x))
x = rand(30); y = ( rand(30) .< σ(10x.-5) );
A more than minimal working example would set a condition to break out the
loop once the magnitude of the change in w is less than some threshold. The
solution is plotted in Figure 10.8 above. In practice, we might use Julia’s GLM
library to solve the logistic regression problem.
Neural networks
Let’s pick up from the previous section and examine how to solve a data fitting
problem using neural networks. Consider a two-layer neural net y = W2 𝜙(W1 x∗ )
for some data x∗ and y∗ and some unknown matrices of parameters W1 and
W2 . The vector x has 𝑚 elements—𝑚 − 1 of these elements contain the input
variables (called features) and one element, set to 1, is for the bias. The vector
y has 𝑛 dimensions for the labels. The parameters W1 and W2 are 𝑘 × 𝑚 and
𝑛 × 𝑘 matrices, where 𝑘 is a hyperparameter of the algorithm, independent of the
problem. All-in-all, we need to determine 𝑘𝑚𝑛 parameters, although many of
these parameters may be zero.
288 Approximating Functions
𝐿(y) = 𝑁 −1 ky − y∗ k 22 ,
which is minimized when its gradient 𝜕𝐿/𝜕y = 2𝑁 −1 (y − y∗ ) is zero. We’ll
use the notation · to emphasize that 𝜕𝐿/𝜕y is a matrix or a vector.
We can solve the minimization problem using the gradient descent method.
The gradient descent method iteratively marches along the local gradient some
distance and then reevaluates the direction. Start with some random initial
guess for the parameters W1 and W2 . Use these values to compute y(x∗ ) =
W2 𝜙(W1 x∗ ). Then use y to determine the loss and the gradient of the loss.
Update the values of the parameters
W1 ← W1 − 𝛼1 𝜕𝐿/𝜕W1
W2 ← W2 − 𝛼2 𝜕𝐿/𝜕W2 .
The constants 𝛼1 and 𝛼2 , called learning rates, are chosen to ensure the stability
of the iterative method. Iterate until the method converges to a local minimum.
Hopefully, that local minimum is also a global minimum.
We’ll use two identities for matrix derivatives in computing the gradients. If
𝑓 (C) = 𝑓 (AB), then
𝜕 𝑓 /𝜕A = 𝜕 𝑓 /𝜕C B and 𝜕 𝑓 /𝜕B = A 𝜕 𝑓 /𝜕C ,
𝜕𝐿/𝜕W2 = 𝜕𝐿/𝜕y 𝜙(W1 x∗ )
𝜕𝐿/𝜕W1 = W2 𝜙 0 (W1 x∗ ) 𝜕𝐿/𝜕y x∗ .
The Jacobian matrix 𝜙 0 is diagonal. In practice, we’ll use a training set
consisting of 𝑁 data points x∗ and y∗ . We can treat x∗ as an 𝑚 × 𝑁 matrix and
y(x∗ ) and y∗ both as 𝑛 × 𝑁 matrices.
Data fitting 289
Example. Let’s√ use neural networks to model the data given by the noisy
semicircle 𝑦 = 1 − 𝑥 2 + 𝜎(𝑥), where 𝜎(𝑥) is some random perturbation.13
We’ll pick 100 points.
N = 100; θ = LinRange(0,π,N)'
x = cos.(θ); x̃ = [one.(x);x]
y = sin.(θ) + 0.05*randn(1,N)
We’ll select a neural network with one hidden layer with 𝑛 neurons, an input
layer with two nodes (one for the bias), and one output node. We’ll also use a
ReLU activation function.
We solve the problem using gradient descent with a sufficiently large number of
iterations, called epochs. To ensure stability, we take a learning rate 𝛼 = 0.1.
α = 0.1
for epoch = 1:1000
ŷ = W2 * ϕ(W1 *x̃)
𝜕L𝜕y = 2(ŷ-y)/N
𝜕L𝜕W1 = dϕ(W1 *x̃) .* (W2 ' * 𝜕L𝜕y) * x̃'
𝜕L𝜕W2 = 𝜕L𝜕y * ϕ(W1 *x̃)'
W1 -= 0.1α * 𝜕L𝜕W1
W2 -= α * 𝜕L𝜕W2
Now that we have our trained parameters W1 and W2 , we construct the solution
using our neural network model.
x̂ = LinRange(-1.2,1.2,200)'; x̂ = [one.(x̂);x̂]; ŷ = W2 * ϕ(W1 *x̂)
The figure on the next page and the QR code at the bottom of this page show the
solution. We can also solve the problem using a sigmoid activation function
ϕ = x -> @. 1 / (1 + exp(-x))
dϕ = x -> @. ϕ(x)*(1-ϕ(x))
inefficient. Any number of universal approximators like Chebyshev polynomials or B-splines would
provide a better solution in a fraction of the compute time. However, these methods quickly get
bogged down as the size of the problem increases. We’ll return to the curse of dimensionality when
we discuss Monte Carlo methods in the next chapter.
1 1
0.5 0.5
0 0
−1 0 1 −1 0 1
Figure 10.9: The neural network approximation for the data using ReLU
activation functions (left) and sigmoid activation functions (right).
In practice, one uses a machine learning package. Let’s use Flux.jl to solve
the same problem. The recipe is as before: build a model with initially unknown
parameters, prescribe a loss function, add some data, choose an optimizer such
as gradient descent, and then iterate.
using Flux
model = Chain(Dense(1, n, relu), Dense(n, 1))
loss(x,y) = Flux.Losses.mse(model(x), y)
parameters = Flux.params(model)
data = [(x,y)]
optimizer = Descent(0.1)
for epochs = 1:10000
Flux.train!(loss, parameters, data, optimizer)
The variable parameters is an array of four things: the weights and bias terms
of W1 and the weights and a bias terms of W2 . The object model is the trained
model, which we can now use as a function model(x) for an arbitrary input x. J
parameters. One of the largest deep neural nets, GPT-3 (Generative Pre-trained
Transformer 3) used for natural language generation, has 96 layers with over 175
billion parameters. On top of that, gradient descent is slow. It can stall on any
number of local minimums and saddle points. And the problem is increasingly
ill-conditioned as the number of parameters rises. Fortunately, there are several
approaches to these challenges.
We can drastically reduce the number of parameters using convolutional
layers and pooling layers rather than fully connected layers. Convolutional neural
nets (CNNs), frequently used in image recognition, use convolution matrices
with relatively small kernels. Such kernels—typical in image processing as edge
detection, sharpen, blur, and emboss filters—consist of tunable parameters for
novel pattern filtering. Pooling layers reduce the dimensionality by taking a local
mean or maximum of the input parameters.
Significant research has led to the development of specialized model archi-
tectures. Residual networks (ResNets), used in image processing, allow certain
connections to skip over intermediate layers. Recurrent neural networks (RNNs),
used in text processing or speech recognition, use sequential data to update
intermediate layers. Recent research has explored implicitly defined layers and
neural differential equations.
Another way to speed up model training is by starting with a better initial
guess of the parameters than simply choosing a random initial state. Transfer
learning uses similar, previously trained models to start the parameters closer
to a solution. Alternatively, we can embrace randomness. Machine learning
often requires massive amounts of data. Stochastic gradient descent uses several
batches of smaller, randomly selected subsets of the training data.
We can modify gradient descent to overcome its shortcomings, such as adding
momentum to overcome shallow local minimum and saddle points or using
adaptive learning rates such as AdaGrad and Adam to speed up convergence.
Finally, to speed up the computation of gradients and avoid numerical stability,
we can use automatic differentiation, which we discuss in the next chapter.
While the focus of this chapter has been on finding the “best” estimate in the
ℓ 2 -norm, for completeness, let’s conclude with a comment on the “best” estimate
of sample data in a general ℓ 𝑝 -norm. Suppose that we have a set of data
x = (𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ) consisting of real values. And, suppose that we’d like to find
a value 𝑥¯ that is the “best” estimate of that data set. In this case, we find 𝑥¯ that is
closest to each element of x in the ℓ 𝑝 -norm
𝑛 1/ 𝑝
𝑥¯ = arg min kx − 𝑠k 𝑝 = arg min |𝑥𝑖 − 𝑠| 𝑝
𝑠 𝑠 𝑖=1
292 Approximating Functions
Notice the locations of the mode, median, mean, and midrange relative to the
data. How well does each do in summarizing the distribution of the data? Where
might they be located on other data? What are the potential biases that arise
from using a specific average?
10.8 Exercises
(a) Find the closest quadratic polynomial to 𝑥 3 over (0, 1) with the 𝐿 2 -inner
∫ 1
( 𝑓 , 𝑔) = 𝑓 (𝑥)𝑔(𝑥) d𝑥.
(b) Use the definition of the angle between two subspaces to show that when
𝑛 is large, 𝑥 𝑛+1 is close to the subspace spanned by {1, 𝑥, 𝑥 2 , . . . , 𝑥 𝑛 }.
14 Notto be confused with the generalized mean 𝑛−1/ 𝑝 kx k 𝑝 , which in the limit as 𝑝 → 0 is the
geometric mean of x.
Exercises 293
10.2. Show that the coefficients in the three-term recurrence relation of the
Legendre polynomials are given by 𝑎 𝑛 and 𝑏 𝑛 = −𝑛2 /(4𝑛2 − 1).
10.3. When we learn arithmetic, we are first taught integer values. Only later are
we introduced to fractions and then irrational numbers, which fill out the number
line. Learning calculus is similar. We start by computing with whole derivatives
and rules of working with them, such as
d d d2
𝑓 (𝑥) = 2 𝑓 (𝑥).
d𝑥 d𝑥 d𝑥
In advanced calculus, we may be introduced to fractional derivatives. For
d1/2 d1/2 d
𝑓 (𝑥) = 𝑓 (𝑥)
d𝑥 1/2 d𝑥 1/2 d𝑥
One way to compute a fractional derivative is with the help of a Fourier
transform. If 𝑓ˆ(𝑘) is the Fourier transform of 𝑓 (𝑥), then (i𝑘) 𝑝 𝑓ˆ(𝑘) is Fourier
transform of the 𝑝th derivative of 𝑓 (𝑥). Compute the fractional derivatives of
the sine, piecewise-quadratic, piecewise-linear, and Gaussian functions:
implement LeNet-5 to classify the digits in the MNIST dataset. The dataset is
available through the MLDatasets.jl package. b
10.5. Multilateration finds the source (𝑥, 𝑦, 𝑡) of a signal by solving a system of
(𝑥 − 𝑥𝑖 ) 2 + (𝑦 − 𝑦 𝑖 ) 2 = 𝑐2 (𝑡 𝑖 − 𝑡 + 𝜀 𝑖 ) 2 ,
where (𝑥 𝑖 , 𝑦 𝑖 ) are the locations of listening stations, 𝑡 𝑖 is the signal arrival time
at station 𝑖, 𝑐 is the signal speed, and 𝜀 𝑖 is an unknown offset that accounts for
reflection or diffraction by obstacles in the signal pathway. Suppose that the
offset in the transmission times 𝜀 𝑖 is normally distributed. Solve 𝜺(𝑥, 𝑦, 𝑡) = 0
using nonlinear least squares to determine (𝑥, 𝑦, 𝑡) for locations (𝑥 𝑖 , 𝑦 𝑖 ) given
in exercise 3.10. Take 𝑐 = 1. Modify your method to locate the source of the
gunfire described in exercise 3.4. b
Chapter 11
Suppose that we want to find the derivative of some sufficiently smooth function
𝑓 at 𝑥. Let ℎ be a small displacement. From Taylor series expansion, we have
𝑓 (𝑥 + ℎ) = 𝑓 (𝑥) + ℎ 𝑓 0 (𝑥) + 21 ℎ2 𝑓 00 (𝑥) + 𝑂 ℎ3
𝑓 (𝑥) = 𝑓 (𝑥).
296 Differentiation and Integration
where 𝛿𝑖 scales the step size ℎ. We want to find a set of coefficients 𝑐 𝑘𝑖 that
satisfy the system of equations for each 𝑘 = 0, 1, . . . , 𝑛 − 1:
𝑐 𝑘𝑖 𝑓 (𝑥 + 𝛿𝑖 ℎ) = ℎ 𝑘 𝑓 (𝑘) (𝑥) + 𝑂 ℎ 𝑛 .
That is,
𝑛−1 ∑︁
𝑛−1 𝑗
𝛿𝑖 𝑗 ( 𝑗)
𝑐 𝑘𝑖 ℎ 𝑓 (𝑥) = ℎ 𝑘 𝑓 (𝑘) (𝑥) + 𝑂 ℎ 𝑛 .
𝑖=0 𝑗=0
Hence, we have the system CV = I, where C is the matrix of coefficients 𝑐 𝑘𝑖
and V is the scaled Vandermonde matrix
1 𝛿0𝑛−1 1/0!
𝛿0 ···
1 𝛿1𝑛−1 1/1!
𝛿1 ···
V = . .. .
.. .
.. .. ..
. . .
1 𝛿 𝑛−1 1/(𝑛 − 1)!
𝛿 𝑛−1 ··· 𝑛−1
Julia has a rational number type Rational that expresses the ratio of integers. For
example, 3//4 is the Julia representation for 34 . You can also convert the floating-point
number 0.75 to a rational using rationalize(0.75).
Richardson extrapolation
𝜙(ℎ) = 𝑎 0 + 𝑎 1 ℎ 𝑝 + 𝑎 2 ℎ2 𝑝 + 𝑎 3 ℎ3 𝑝 + · · · (11.1)
for which we want to determine the value 𝑎 0 in terms of function on the left-
hand side of the equation. For example, we could have the central difference
𝑓 (𝑥 + ℎ) − 𝑓 (𝑥 − ℎ)
𝜙(ℎ) = = 𝑓 0 (𝑥) + 1
3! 𝑓 00 (𝑥)ℎ2 + 1
3! 𝑓 (5) ℎ4 + · · · .
Richardson extrapolation allows us to successively eliminate the ℎ 𝑝 , ℎ2 𝑝 , and
higher-order terms. Take
𝜙(𝛿ℎ) = 𝑎 0 + 𝑎 1 𝛿 𝑝 ℎ 𝑝 + 𝑎 2 𝛿2 𝑝 ℎ2 𝑝 + 𝑎 3 𝛿3 𝑝 ℎ3 𝑝 + · · · .
from which
𝜙(𝛿ℎ) − 𝛿 𝑝 𝜙(ℎ)
= 𝑎 0 + 𝑎˜ 1 ℎ2 𝑝 + 𝑎˜ 2 ℎ3 𝑝 + 𝑎˜ 3 ℎ4 𝑝 + · · ·
1 − 𝛿𝑝
for some new coefficients 𝑎˜ 1 , 𝑎˜ 2 , . . . . This new equation has the same form as
the original equation (11.1), so we can repeat the process to eliminate the ℎ2 𝑝
term. We now have a recursion formula for 𝑎 0 = D𝑚,𝑛 +𝑜 ((𝛿 𝑚 ℎ) 𝑝𝑛 ) where
Automatic differentiation 299
D𝑚,0 = 𝜙(𝛿 𝑚 ℎ)
D1,0 D1,1
D𝑚,𝑛−1 −𝛿 𝑝𝑛 D𝑚−1,𝑛−1
D𝑚,𝑛 = for 𝑚 ≥ 𝑛.
D2,0 D2,1 D2,2 1 − 𝛿 𝑝𝑛
Julia code for Richardson extrapolation taking 𝛿 = 2 is
function richardson(f,x,m,n=m)
n > 0 ?
(4^n*richardson(f,x,m,n-1) - richardson(f,x,m-1,n-1))/(4^n-1) :
There are three common methods of calculating the derivative using a computer:
numerical differentiation, symbolic differentiation, and automatic differentiation.
Numerical differentiation, which is discussed extensively throughout this book,
evaluates the derivative of a function by calculating finite difference approx-
imations of that function or by computing the actual derivative of a simpler
approximation of the original function. Because numerical differentiation takes
300 Differentiation and Integration
where all 𝜀 2 and higher terms are zero. This expression is the formal infinitesimal
definition of the derivative using nonstandard analysis
𝑓 (𝑥 + 𝜀) − 𝑓 (𝑥)
𝑓 0 (𝑥) = .
Note that the derivative is not defined as the limit as a real number 𝜀 goes to zero.
Instead, it is defined entirely with respect to an infinitesimal 𝜀. Dual numbers
follow the same addition and multiplication rules with the additional rule that
𝜀 2 = 0. For example, for addition,
( 𝑓 , 𝑓 0) + (𝑔, 𝑔 0) ≡ ( 𝑓 + 𝑓 0 𝜀) + (𝑔 + 𝑔 0 𝜀) = ( 𝑓 + 𝑔) + ( 𝑓 0 + 𝑔 0)𝜀
≡ ( 𝑓 + 𝑔, 𝑓 0 + 𝑔 0);
Automatic differentiation 301
for multiplication,
( 𝑓 , 𝑓 0) · (𝑔, 𝑔 0) ≡ ( 𝑓 + 𝑓 0 𝜀) · (𝑔 + 𝑔 0 𝜀) = 𝑓 𝑔 + ( 𝑓 0 𝑔 + 𝑓 𝑔 0)𝜀
≡ ( 𝑓 𝑔, 𝑓 0 𝑔 + 𝑓 𝑔 0);
The results are the familiar addition, product, and quotient rules of calculus. We
can compute the values of the derivatives at the same time that the values of the
functions are evaluated with only a few more operations. When implemented in a
programming language such as Julia or Matlab, each function can be overloaded
to return its value and derivative. For the chain rule,
𝑣2 𝑣 2 ← sin 𝑣 1 𝑣 20 ← (cos 𝑣 1 )𝑣 10
𝑦 𝑣3 𝑣1 𝑥 𝑣3 ← 𝑣1 + 𝑣2 𝑣 30 ← 𝑣 10 + 𝑣 20
We’ll define 𝑥 = 0 as a dual number and 𝑦 as the function of that dual number:
x = Dual(0)
y = x + sin(x)
𝑦1 𝑣5 𝑣3 𝑣1 𝑥1 𝑦 1 = 𝑥1 𝑥2 + sin 𝑥2
𝑦 2 = 𝑥1 𝑥2 − sin 𝑥 2
𝑦2 𝑣6 𝑣4 𝑣2 𝑥2
or equivalently
1 1 𝑣2 𝑣1
y =
x0 = Jx0 .
1 −1 0 cos 𝑣 2
Using a standard unit vector 𝝃 𝑖 as our input for x0 will return a column of the
Jacobian matrix evaluated at x. In general, for 𝑚 input variables, we’ll need to
compute 𝑚 forward accumulation passes to compute the 𝑚 columns. Because
there are two input variables 𝑥1 and 𝑥2 in the system (11.2), we’ll need to do two
sweeps. We first start with (𝑥 10 , 𝑥20 ) = (1, 0) to get the first column of the Jacobian
matrix. We next start with (𝑥10 , 𝑥20 ) = (0, 1) to get the second column of the
Jacobian matrix. Figure 11.1 shows each node (𝑣 𝑖 , 𝑣 𝑖0 ) of the computational graph
Automatic differentiation 303
variables 𝑣 𝑣0
(𝑣 1 , 𝑣 10 ) 𝑥1 2 𝑥10 1 0
(𝑣 2 , 𝑣 20 ) 𝑥2 𝜋 𝑥20 0 1
(𝑣 3 , 𝑣 30 ) 𝑣1𝑣2 2𝜋 𝐷 31 𝑣 1 + 𝐷 32 𝑣 2 → 𝑣 2 𝑣 10 + 𝑣 1 𝑣 20
0 0 𝜋 2
(𝑣 4 , 𝑣 40 ) sin 𝑣 2 0 𝐷 42 𝑣 20 → 𝑣 20 cos 𝑣 2 0 -1
(𝑣 5 , 𝑣 50 ) 𝑣4 + 𝑣3 2𝜋 𝐷 53 𝑣 3 + 𝐷 54 𝑣 40 → 𝑣 30 + 𝑣 40
0 𝜋 1
(𝑣 6 , 𝑣 60 ) 𝑣4 − 𝑣3 2𝜋 𝐷 63 𝑣 30 + 𝐷 64 𝑣 40 → 𝑣 30 − 𝑣 40 𝜋 3
starting with (𝑥1 , 𝑥2 ) = (2, 𝜋). The term 𝐷 𝑖 𝑗 represents the partial derivative
𝜕𝑣 𝑖 /𝜕𝑣 𝑗 .
Let’s use the code developed in the previous example on system (11.2). We
can define the dual numbers as:
x1 = Dual(2,[1 0])
x2 = Dual(π,[0 1])
y1 = x1 *x2 + sin(x2 )
y2 = x1 *x2 - sin(x2 )
where x̄ is called the adjoint value. Using a standard unit vector 𝝃 𝑖 as our input
for x̄ will return a row of the Jacobian matrix evaluated at x. In general, for 𝑛
output variables, we’ll need to compute 𝑛 reverse accumulation passes to compute
the 𝑛 rows. When the number of input variables 𝑚 is much larger than the
number of output variables 𝑛, which is often the case in neural networks, reverse
accumulation is much more efficient than forward accumulation. We define the
adjoint 𝑣¯ 𝑖 = 𝜕𝑦/𝜕𝑣 𝑖 as the derivative of the dependent variable with respect to 𝑦 𝑖 .
To compute the Jacobian for a point (𝑥1 , 𝑥2 ), we first run through a forward pass
to get the dependencies and graph structure. The forward pass is the same as
forward accumulation and results in the same expression for 𝑦 1 and 𝑦 2 . We then
run backward once for each dependent variable, first taking (𝑦 10 , 𝑦 20 ) = (1, 0) to
get the first row of the Jacobian matrix and then taking (𝑦 10 , 𝑦 20 ) = (0, 1) to get
the second row of the Jacobian matrix. See Figure 11.2 on the following page. J
304 Differentiation and Integration
variables 𝑣 𝑣¯
(𝑣 1 , 𝑣¯ 1 ) 𝑥1 2 𝐷 31 𝑣¯ 3 → 𝑣 2 𝑣¯ 3 𝜋 𝜋
(𝑣 2 , 𝑣¯ 2 ) 𝑥2 𝜋 𝐷 32 𝑣¯ 3 + 𝐷 42 𝑣¯ 4 → 𝑣 1 𝑣¯ 3 + 𝑣¯ 4 cos 𝑣 2 1 3
(𝑣 3 , 𝑣¯ 3 ) 𝑣1𝑣2 2𝜋 𝐷 53 𝑣¯ 5 + 𝐷 63 𝑣¯ 6 → 𝑣¯ 5 + 𝑣¯ 6 1 1
(𝑣 4 , 𝑣¯ 4 ) sin 𝑣 2 0 𝐷 54 𝑣¯ 5 + 𝐷 64 𝑣¯ 6 → 𝑣¯ 5 − 𝑣¯ 6 1 −1
(𝑣 5 , 𝑣¯ 5 ) 𝑣4 + 𝑣3 2𝜋 𝑦 10 1 0
(𝑣 6 , 𝑣¯ 6 ) 𝑣4 − 𝑣3 2𝜋 𝑦 20 0 1
Julia has several packages that implement forward and reverse automatic differentia-
tion, including Zygote.jl, ForwardDiff.jl, and ReverseDiff.jl.
Numerical integration is often called quadrature, a term that seems archaic like
many others in mathematics. Historically, the quadrature of a figure described the
process of constructing a square with the same areas as some given figure. The
English word quadrature comes from the Latin quadratura (to square). It wasn’t
until the late nineteenth century that the classical problem of ancient geometers
of "squaring the circle”—finding a square with the same area as a circle using
only a compass and straightedge—was finally proved to be impossible when the
Lindemann–Weierstrass theorem demonstrated that 𝜋 is a transcendental number.
The name quadrature stuck with us, and most scientific programming languages
now use some variant of the term quad for numerical integration.1 It wasn’t until
recently that Matlab started to use the function name integral, because many
users were having trouble finding quad.
The typical method for numerically integrating a smooth function is to
approximate the function by a polynomial interpolant and then integrate the
polynomial exactly. Recall from Chapter 9 that given a set of 𝑛 + 1 nodes we can
determine the Lagrange basis for a polynomial
𝑛 Ö𝑛
𝑥 − 𝑥𝑗
𝑓 (𝑥) ≈ 𝑝 𝑛 (𝑥) = 𝑓 (𝑥𝑖 )ℓ𝑖 (𝑥) where ℓ𝑖 (𝑥) = .
𝑥 − 𝑥𝑗
𝑗=0 𝑖
1 Theeven more archaic-sounding term cubature (finding the volume of solids) is sometimes
used in reference to multiple integrals.
Newton–Cotes quadrature 305
for some 𝜉 ∈ [0, ℎ]. Now, consider Simpson’s rule over [−ℎ, ℎ]—two subinter-
vals each of length ℎ. Simpson’s rule is derived using a quadratic interpolating
polynomial, so it is exact if 𝑓 (𝑥) is a quadratic polynomial. Simpson’s rule is, in
fact, also exact if 𝑓 (𝑥) is cubic because 𝑥 3 is an odd function, and its contribution
integrates out to zero. So let’s consider a cubic polynomial over the interval
[−ℎ, ℎ] with nodes at −ℎ, 0, and ℎ and a fourth node at an arbitrary point 𝑐. The
error of Simpson’s rule is
∫ ℎ ∫ ℎ
1 (4)
𝑓 (𝑥) − 𝑝 3 (𝑥) d𝑥 = 24 𝑓 (𝜉) 𝑥(𝑥 − 𝑐) (𝑥 2 − ℎ2 ) d𝑥 = − 90
1 5 (4)
ℎ 𝑓 (𝜉)
−ℎ −ℎ
for some 𝜉 ∈ [−ℎ, ℎ]. While we can get more accurate methods by adding nodes
and using higher-order polynomials such as Boole’s rule, such an approach is
not generally good because of the Runge phenomenon. Instead, we can either
use a piecewise polynomial approximation (a spline) or pick nodes to minimize
the uniform error (such as using Chebyshev nodes). We’ll look at both of these
options in turn.
Composite methods
Composite methods use splines to approximate the integrand. The simplest type
of spline is a constant spline. Using these, we get Riemann sums. Linear splines
give us the composite trapezoidal rule, a more accurate method that applies the
trapezoidal rule to each subinterval. For 𝑛 + 1 equally-space nodes
∫ 𝑏 " #
1 1
𝑓 (𝑥) d𝑥 ≈ ℎ 2 𝑓 (𝑎) + 𝑓 (𝑎 + 𝑖ℎ) + 2 𝑓 (𝑏) (11.4)
𝑎 𝑖=1
where ℎ = (𝑏 − 𝑎)/𝑛.
Newton–Cotes quadrature 307
function trapezoidal(f,x,n)
F = f.(LinRange(x[1],x[2],n+1))
(F[1]/2 + sum(F[2:end-1]) + F[end]/2)*(x[2]-x[1])/n
1 3 (2)
The error for the trapezoidal rule is bounded by 12 ℎ | 𝑓 (𝜉)| for some 𝜉 in
a subinterval. The error for the composite trapezoidal rule is bounded by
1 3 (2) (𝜉) /𝑛2 for some 𝜉 ∈ (𝑎, 𝑏). We can derive a much better error
12 (𝑏 − 𝑎) 𝑓
estimate with a bit of extra work, which we’ll do in a couple of pages. Because
the composite trapezoidal rule is a second-order method, if we use twice as many
intervals, we can expect the error to be reduced by about a factor of four. And
with four times as many intervals, we can reduce the error by about a factor of
With a quadratic spline, Simpson’s rule becomes the composite Simpson’s
rule. For 𝑛 + 1 equally spaced nodes, Simpson’s rule says
∫ 𝑥𝑖+1
𝑓 (𝑥) d𝑥 ≈ ( 𝑓 (𝑥𝑖−1 ) + 4 𝑓 (𝑥 𝑖 ) + 𝑓 (𝑥𝑖+1 )) .
𝑥𝑖−1 6
1 5 (4)
The error of Simpson’s rule is 90 ℎ 𝑓 (𝜉), so the error of the composite
Simpson’s rule is bounded by 180 (𝑏 − 𝑎) 5 𝑓 (4) (𝜉)/𝑛4 for some 𝜉 over an interval
(𝑎, 𝑏). By using twice as many intervals, we should expect the error to decrease
by a factor of 16. In practice, one often uses an adaptive composite trapezoidal
or Simpson’s rule. But, if we can choose the positions of the nodes at which we
integrate, then it’s much better to choose Gaussian quadrature.
We can also apply Richardson extrapolation to the composite trapezoidal rule.
Such an approach is called Romberg’s method. Take 𝜙(ℎ) to be the composite
trapezoidal rule using nodes separated by a distance ℎ. Then D𝑚,0 extrapolation
is equivalent to the composite trapezoidal rule, D𝑚,1 extrapolation is equivalent
to the composite Simpson’s rule, and D𝑚,2 extrapolation is equivalent to the
composite Boole’s rule—all with 2𝑚 subsegments. We can implement a naïve
Romberg’s method by replacing the definition of ϕ in the Julia code on page 299
with a composite transpose method:
ϕ = (f,x,n) -> trapezoidal(f,x,n)
𝑝 slope
2 1 -2.00
2 -4.00
3 -3.99
1 10−10 4 -6.00
5 -5.95
10−15 6 -7.97
0 1 2 10 100 7 -7.79
Example. Let’s examine the error of the composite trapezoidal rule applied to
the function 𝑓 (𝑥) = 𝑥 + 𝑥 𝑝 (2 − 𝑥) 𝑝 over the interval [0, 2] with 𝑝 = 1, 2, . . . , 7.
Figure −1 11.3 above. We can determine the convergence rate by computing
x 1 y where x is a vector of logarithms of the number of nodes, y is a vector
of logarithms of the errors, and 1 is an appropriately-sized vector of ones. We
had earlier
determined that the error of a composite trapezoidal rule should be
𝑂 𝑛2 or better for a smooth function. We find that the error of the trapezoidal
rule is 𝑂 𝑛2 when 𝑝 = 1. We find that the error is 𝑂 𝑛 4 when 𝑝 is 2 or 3,
Let’s reexamine the quadrature error of the composite trapezoidal rule a little
more rigorously. Apply the composite trapezoidal rule to a function 𝑓 (𝑥) on the
Newton–Cotes quadrature 309
The interval [0, 𝜋] might seem a little arbitrary, and in fact, any finite interval
would do, but using [0, 𝜋] will help the mathematical feng shui further on. If we
expand 𝑓 (𝑥) as a cosine series
𝑎 0 ∑︁
2 𝜋
𝑓 (𝑥) = + 𝑎 𝑘 cos 𝑘𝑥 with 𝑎𝑘 = 𝑓 (𝑥) cos 𝑘𝑥 d𝑥,
2 𝑘=1 𝜋 0
then ∫ 𝜋
𝑎0 𝜋
𝑆= 𝑓 (𝑥) d𝑥 =
0 2
𝑎 0 𝜋 𝜋 ∑︁ 1 + (−1) 𝑘 ∑︁ ∑︁
∞ 𝑛−1 ∞
𝑖𝑘 𝜋
𝑆𝑛 = + 𝑎𝑘 + cos =𝑆+𝜋 𝑎 2𝑛𝑘 .
2 𝑛 𝑘=1 2 𝑖=1
𝑛 𝑘=1
This identity itself follows from computing the real part of the geometric series
1 − (−1) 𝑘
ei 𝑗 𝑘 𝜋/𝑛 =
1 − ei𝑘 𝜋/𝑛
and noting that the real part of (1 − ei𝑥Í ) −1 equals 21 for any 𝑥.
The quadrature error 𝑆 𝑛 − 𝑆 is 𝜋 ∞ 𝑘=1 𝑎 2𝑛𝑘 . So the convergence rate of
the composite trapezoidal method is determined by the convergence rate of
coefficients of the cosine series. If 𝑓 (𝑥) is a smooth function, then by repeatedly
integrating by parts, we get
2 ∑︁
∞ 𝜋
2 𝜋
𝑓 (2 𝑗−1) (𝑥)
𝑎 2𝑛𝑘 = 𝑓 (𝑥) cos 2𝑛𝑘𝑥 d𝑥 = (−1) 𝑗 .
𝜋 0 𝜋 𝑗=1 (2𝑛𝑘) 2 𝑗 0
Therefore, the error of the trapezoidal rule can be characterized by the odd
derivatives at the endpoints 𝑥 = 0 and 𝑥 = 𝜋. If 𝑓 0 (0) ≠ 𝑓 0 (𝜋), then we can
310 Differentiation and Integration
expect a convergence rate of 𝑂 𝑛2 . If 𝑓 0 (0) = 𝑓 0 (𝜋) and 𝑓 000 (0) ≠ 𝑓 000 (𝜋),
then we can expect a convergence rate of 𝑂 𝑛4 . If the function 𝑓 (𝑥) is periodic
or if each of the odd derivatives of 𝑓 (𝑥) vanishes at the endpoints, then the
composite trapezoidal rule gives us exponential or spectral convergence.2 If the
function 𝑓 (𝑥) is not periodic or does not have matching odd derivatives at the
endpoints, we can still make a change of variable to put it in a form that does.
This approach, called Clenshaw–Curtis or Frejér quadrature, will allow us to
integrate any smooth function with spectral convergence.
Clenshaw–Curtis quadrature
Take any smooth function over an interval [−1, 1] and make a change of variables
𝑥 = cos 𝜃, ∫ ∫ 1 𝜋
𝑓 (𝑥) d𝑥 = 𝑓 (cos 𝜃) sin 𝜃 d𝜃.
−1 0
We can express 𝑓 (cos 𝜃) as a cosine series
𝑎 0 ∑︁
2 𝜋
𝑓 (cos 𝜃) = + 𝑎 𝑘 cos(𝑘𝜃) where 𝑎𝑘 = 𝑓 (cos 𝜃) cos 𝑘𝜃 d𝜃
2 𝑘=1 𝜋 0
where we now need to evaluate the coefficients 𝑎 2𝑘 . The integrand 𝑓 (cos 𝜃) cos 𝑘𝜃
is an even function at 𝜃 = 0 and 𝜃 = 𝜋, so we can expect spectral convergence.
The trapezoidal rule is just the (type-1) discrete cosine transform
2 𝑓0 ∑︁
2𝜋 𝑗 𝑘 𝑓𝑛
+ 𝜋𝑗
𝑎 2𝑘 = 2 𝑓 𝑗 cos + where 𝑓 𝑗 = cos
𝑛 𝑛 2 𝑛
Extend the function 𝑓 (cos 𝜃) as an even function at 𝜋: 𝑓𝑛+ 𝑗 = 𝑓𝑛− 𝑗 . Then the
discrete cosine transform over 𝑛 nodes is exactly one-half the discrete Fourier
2 The Euler–Maclaurin formula, which relates a sum to an integral, provides an alternative
method for estimating the error of the composite trapezoidal rule. The Euler–Maclaurin formula
states that if 𝑓 ∈ 𝐶 𝑝 (𝑎, 𝑏) where 𝑎 and 𝑏 are integers, then
∑︁ ∑︁ B 𝑘+1
𝑏 𝑏 𝑝−1
(𝑘) (𝑘)
𝑓 (𝑖) = 𝑓 ( 𝑥) d𝑥 + 𝑓 (𝑏) − 𝑓 (𝑎) + 𝑅 𝑝
𝑎 (𝑘 + 1)!
𝑖=𝑎+1 𝑘=0
where B 𝑘 are the Bernoulli numbers and 𝑅 𝑝 is a remainder term. Odd Bernoulli numbers are zero
except for B1 . Ada Lovelace is credited for writing the first computer program in 1843 to calculate
the Bernoulli numbers using Charles Babbage’s never-built Analytic Engine.
Newton–Cotes quadrature 311
2𝑛−1 ∑︁
𝑛−1 2𝑛−1
i2𝜋 𝑗 𝑘 2𝜋 𝑗 𝑘 2𝜋 𝑗 𝑘
𝑓 𝑗 exp − = 𝑓 𝑗 cos + 𝑓2𝑛− 𝑗 cos −
𝑛 𝑗=0
𝑛 𝑗=𝑛
after the sine terms cancel each other out. We can now easily compute the sum
using a fast Fourier transform that can be evaluated in 𝑂 (𝑛 log 𝑛) operations.
We can implement the Clenshaw–Curtis quadrature in Julia by computing
the discrete cosine transform in
Julia’s FFTW package does not have a named type-1 discrete cosine transform—
the dct function is a type-2 DCT. We can formulate a type-1 DCT using
function dctI(f)
g = FFTW.r2r!([f...],FFTW.REDFT00)
[g[1]/2; g[2:end-1]; g[end]/2]
function dctI(f)
n = length(f)
g = real(fft(f[[1:n; n-1:-1:2]]))
[g[1]/2; g[2:n-1]; g[n]/2]
∞ ∑︁
𝑓 (cos 𝜃) = 12 𝑎 0 + 𝑎 𝑘 cos(𝑘 cos−1 𝑥) = 12 𝑎 0 + 𝑎 𝑘 T 𝑘 (𝑥),
𝑘=1 𝑘=1
8 nodes 12 nodes 16 nodes
Choosing quadrature nodes from the zeros of orthogonal polynomials does more
than mitigate Runge’s phenomenon. Remarkably, using these carefully selected
nodes allows us to determine a polynomial interpolant whose degree is almost
double the number of nodes.
Suppose that we let the nodes 𝑥𝑖 vary freely, still computing the weights 𝑤 𝑖
based on their positions using the Newton–Cotes formula (11.5). One might ask,
“what is the maximum degree of a polynomial 𝑝(𝑥) that can be constructed?”
This is precisely the question Carl Friedrich Gauss raised in 1814. With 𝑛 + 1
nodes and 𝑛 + 1 weights along with 2𝑛 + 2 constraints, one might rightfully
conjecture that 𝑝(𝑥) would be uniquely determined using 2𝑛 + 2 coefficients.
Hence, the maximum degree is 2𝑛 + 1. A quadrature that carefully chooses
nodes to construct a maximal polynomial is called Gaussian quadrature (and
sometimes Gauss–Christoffel quadrature).
From Chapter 10, we can generate orthogonal polynomials by using different
inner product spaces. These orthogonal polynomials are the most important
classes of polynomials for Gaussian quadrature and lend their names to Gauss–
Legendre, Gauss–Chebyshev, Gauss–Hermite, Gauss–Laguerre, and Gauss–
Jacobi quadrature. See the table on the facing page. Gauss–Jacobi quadrature,
Gaussian quadrature 313
where 𝑞(𝑥) = 𝑖=0 (𝑥 − 𝑥 𝑖 ) and 𝜉 is some point in the interval of integration.
𝑓 (2𝑛+2) (𝜉 (𝑥)) 2
𝑓 (𝑥) − 𝑝(𝑥) = 𝑞 (𝑥)
(2𝑛 + 2)!
from which it follows (after integrating both sides and applying the mean value
theorem for integrals to the right-hand side):
∫ 𝑏 ∫ 𝑏 ∫
𝑓 (2𝑛+2) (𝜉) 𝑏
𝑓 (𝑥)𝜔(𝑥) d𝑥 − 𝑝(𝑥)𝜔(𝑥) d𝑥 = 𝑞 2 (𝑥)𝜔(𝑥) d𝑥.
𝑎 𝑎 (2𝑛 + 2)! 𝑎
We still need to determine the nodes 𝑥 𝑖 and the weights 𝑤 𝑖 . Let’s first find
the nodes for an arbitrary family of orthogonal polynomials. Recall Favard’s
theorem on page 254. It says that for each weighted inner product, there exist
uniquely determined orthogonal polynomials 𝑝 𝑘 ∈ P 𝑘 with leading coefficient
one satisfying the three-term recurrence relation
𝑝 𝑘+1 (𝑥) = (𝑥 − 𝑎 𝑘 ) 𝑝 𝑘 (𝑥) − 𝑏 𝑘 𝑝 𝑘−1 (𝑥)
𝑝 0 (𝑥) = 1, 𝑝 1 (𝑥) = 𝑥 − 𝑎 1
(𝑥 𝑝 𝑘 , 𝑝 𝑘 ) ( 𝑝𝑘 , 𝑝𝑘 )
𝑎𝑘 = and 𝑏𝑘 = .
( 𝑝𝑘 , 𝑝𝑘 ) ( 𝑝 𝑘−1 , 𝑝 𝑘−1 )
It will be useful to use an orthonormal basis. Take 𝑝ˆ 𝑘 = 𝑝 𝑘 /k 𝑝 𝑘 k. Then we can
rewrite the three-term recurrence relation as
k 𝑝 𝑘+1 k k 𝑝 𝑘−1 k
𝑝ˆ 𝑘+1 (𝑥) = (𝑥 − 𝑎 𝑘 ) 𝑝ˆ 𝑘 (𝑥) − 𝑏 𝑘 𝑝ˆ 𝑘−1 (𝑥),
k 𝑝𝑘 k k 𝑝𝑘 k
Gaussian quadrature 315
and by noting that k 𝑝 𝑘+1 k/k 𝑝 𝑘 k = 𝑏 𝑘 , we have
√︁ √︁
𝑏 𝑘+1 𝑝ˆ 𝑘+1 (𝑥) = (𝑥 − 𝑎 𝑘 ) 𝑝ˆ 𝑘 (𝑥) − 𝑏 𝑘 𝑝ˆ 𝑘−1 (𝑥).
∑︁ ∫
𝑤 𝑘 𝑝ˆ𝑖 (𝑥 𝑘 ) 𝑝ˆ 𝑗 (𝑥 𝑘 ) = 𝑝ˆ𝑖 (𝑥) 𝑝ˆ 𝑗 (𝑥)𝜔(𝑥) d𝑥 = 𝑝ˆ𝑖 , 𝑝ˆ 𝑗 = 𝛿𝑖 𝑗 ,
because the degree of 𝑝ˆ𝑖 (𝑥) 𝑝ˆ 𝑗 (𝑥) is strictly less than 2𝑛. But this simply says
that QT Q = I where the elements 𝑄 𝑖𝑘 = 𝑤 𝑘 𝑝ˆ𝑖 (𝑥 𝑘 ). Hence, Q is an orthogonal
matrix. So, QQ also equals the identity operator. Therefore,
𝑤 𝑖 𝑤 𝑗 𝑝ˆ 𝑘 (𝑥𝑖 ) 𝑝ˆ 𝑘 (𝑥 𝑗 ) = 𝛿𝑖 𝑗
𝑤𝑗 𝑝ˆ2𝑘 (𝑥 𝑗 ) = 1.
So 𝑤 𝑗 [ 𝑝ˆ0 (𝑥 𝑗 ), 𝑝ˆ1 (𝑥 𝑗 ), . . . , 𝑝ˆ 𝑛 (𝑥 𝑗 )]T is a unit eigenvector of J𝑛 . Let’s examine
the first component of this unit eigenvector 𝑤 𝑗 𝑝ˆ0 (𝑥 𝑗 ). Because 𝑝 0 (𝑥) ≡ 1, we
have 𝑝ˆ0 (𝑥) ≡ 1/k1k. Therefore, 𝑤 𝑗 simply equals k1k times the first component
of the unit eigenvector of J𝑛 . We can summarize all of this in the following
316 Differentiation and Integration
√ 𝑏1
𝑏1 𝑎1
.. .. .. .
. . .
𝑏 𝑛−1 𝑎√𝑛−1 𝑏𝑛
𝑏 𝑛 𝑎 𝑛
The weights are k1k 2 times the square of the first element of the unit eigenvectors.
Parameters for common Gaussian quadrature rules are summarized in Fig-
ure 11.4 on page 313. The position of the nodes of Gauss–Legendre and Gauss–
Hermite quadrature are shown in the figure above. We can directly compute the
nodes and weights of Gauss–Chebyshev quadrature using 𝑥 𝑘 = cos((2𝑘 −1)𝜋/2𝑛)
and 𝑤 𝑘 = 𝜋/𝑛, so we don’t need the Golub–Welsch
algorithm in this case. The
Golub–Welsch algorithm takes 𝑂 𝑛2 operations to compute the eigenvalue-
eigenvector pairs of a symmetric matrix. More recently, faster 𝑂 (𝑛) algorithms
have been developed that use Newton’s method to solve 𝑝 𝑛 (𝑥) = 0 and evaluate
𝑝 𝑛 (𝑥) and 𝑝 𝑛0 (𝑥) by three-term recurrence or use asymptotic expansion.3
We can implement a naïve Gauss–Legendre quadrature with n nodes using
f = x -> cos(x)*exp(-x^2)
nodes, weights = gauss_legendre(n)
f.(nodes) · weights
where the nodes and weights are computed using the Golub–Welsch algorithm
function gauss_legendre(n)
a = zeros(n)
3 Algorithms
using Ignace Bogaert’s explicit asymptotic formulas can easily generate over a
billion Gauss–Legendre nodes and weights to 16 digits of accuracy. See Townsend [2015].
Gaussian quadrature 317
b = (1:n-1).^2 ./ (4*(1:n-1).^2 .- 1)
𝟙2 = 2
λ, v = eigen(SymTridiagonal(a, sqrt.(b)))
(λ, 𝟙2 *v[1,:].^2)
Gauss–Kronrod quadrature
for some 𝑠(𝑥) ∈ P𝑛 and 𝑟 (𝑥) ∈ P2𝑛 because the number of coefficients of 𝑓 (𝑥)
must equal the number of coefficients of 𝑟 (𝑥) plus the number of coefficients of
𝑠(𝑥). Furthermore, if
∫ 𝑏
𝑝(𝑥)𝑞(𝑥)𝑥 𝑘 𝜔(𝑥) d𝑥 = 0 for each 𝑘 = 0, 1, . . . , 𝑛 (11.6)
We just need to find a suitable 𝑞(𝑥) so that (11.6) holds. In the 1960s,
Aleksandr Kronrod examined this problem for the interval [−1, 1] and unit
weight function by taking 𝑝(𝑥) = P𝑛 (𝑥) to be a degree-𝑛 Legendre polynomial
and 𝑞(𝑥) = K𝑛+1 (𝑥) to be the degree-(𝑛 + 1) Stieltjes polynomial associated
with P𝑛 (𝑥). In general, if 𝑃𝑛 (𝑥) is any degree-𝑛 orthogonal polynomial over the
interval (𝑎, 𝑏) with the associated weight 𝜔(𝑥), then the Stieltjes polynomial
𝐾𝑛+1 (𝑥) associated with 𝑃𝑛 (𝑥) is a degree-(𝑛 + 1) polynomial defined by the
orthogonality condition
∫ 𝑏
𝐾𝑛+1 (𝑥)𝑃𝑛 (𝑥)𝑥 𝑘 𝜔(𝑥) d𝑥 = 0 for 𝑘 = 0, 1, . . . , 𝑛.
We can generate a Kronrod quadrature rule once we determine the 2𝑛+1 nodes
𝑥𝑖 of K𝑛+1 (𝑥)P𝑛 (𝑥) along with the associated quadrature weights 𝑤 𝑖 . One method
4 Another is the Stieltjes polynomials associated with the Chebyshev polynomials of the first kind
T𝑛 ( 𝑥), which happens to be (1− 𝑥 2 )U𝑛−2 ( 𝑥), where U𝑛 ( 𝑥) is a Chebyshev polynomial of the second
kind—with weight 𝜔 ( 𝑥) = (1 − 𝑥 2 ) over [−1, 1]. A third—the associated Stieltjes polynomials for
the Chebyshev polynomials of the second kind U𝑛 ( 𝑥)—are the Chebyshev polynomials of the first
kind T𝑛 ( 𝑥).
Gaussian quadrature 319
Gaussian quadrature nodes are strictly in the interior of the interval of integration.
We can modify Gaussian quadrature to add a node to one endpoint, called Gauss–
Radau quadrature, or to both end endpoints, called Gauss–Lobatto quadrature.
The nodes and weights for Gauss–Radau and Gauss–Lobatto quadratures can be
determined by using the Golub–Welsh algorithm with a modified Jacobi matrix
or by using Newton’s method. (See Gautschi [2006].)
The Golub–Welsh algorithm can generate the nodes and weights of an (𝑛 + 1)-
point Gauss–Legendre–Radau quadrature by replacing 𝑎 𝑛 = ±𝑛/(2𝑛 − 1) for
a node at ±1. And, the algorithm can generate the nodes and weights of an
(𝑛 + 1)-point Gauss–Legendre–Lobatto quadrature by replacing 𝑏 𝑛 = 𝑛/(2𝑛 − 1)
for nodes at ±1.
Because Gauss–Radau quadrature fixes a node to be an endpoint, it removes
one degree of freedom from the set of nodes and weights. By using it, we can
fit at most a degree-2𝑛 polynomial with a set of 𝑛 + 1 nodes. Gauss–Lobatto
quadrature fixes nodes at both endpoints, removing two degrees of freedom. So
we can fit at most a degree-(2𝑛 − 1) polynomial with a set of 𝑛 + 1 nodes.
Quadrature in practice
When the length of a needle is exactly half the width of a strip of wood, the
probability equals 1/𝜋, giving us an experimental way of computing 𝜋, albeit not
a very efficient one. Take a large number of needles, perhaps a thousand or more.
Drop them on the floor and count those that cross a line. Then, the number of
needles divided by that count should give us a rough approximation of 𝜋. The
more needles we use, the better the approximation. J
Monte Carlo integration 321
by Ulam’s uncle, who would borrow money from relatives because he “just had to go to Monte
Carlo,” the world-famous gambling destination. The probabilistic programming language Stan (with
interfaces in Julia, Python, and Matlab) was named in honor of Stanislaw Ulam.
322 Differentiation and Integration
Now, we need to just estimate the expected value E [ 𝑓 (𝑋)], which we can do in
practice by sampling. In general, we want to compute 𝐹 = Ω 𝑓 (x) dx. We’ll
define the 𝑁-sample Monte Carlo estimate for 𝐹 as
1 ∑︁ 𝑓 (𝑋𝑖 )
𝑁 −1
𝐹𝑁 = .
𝑁 𝑖=0 𝑝 𝑋 (𝑋𝑖 )
The law of large numbers says that the sample average converges to the expected
value as the sample size approaches infinity. For arbitrarily small 𝜀, the probability
lim 𝑃 𝐹 𝑁 − E 𝐹 𝑁 > 𝜀 = 0.
𝑁 →∞
where 𝜎 2 [𝑌𝑖 ] is the sample variance of 𝑌𝑖 = 𝑓 (𝑋𝑖 )/𝑝 𝑋 (𝑋𝑖 ). The second
equality comes from two properties of variance—that 𝜎 2 [𝑎𝑋] = 𝑎 2 𝜎 2 [𝑋] and
that 𝜎 2 [𝑋1 + 𝑋2 ] = 𝜎 2 [𝑋1 ] + 𝜎 2 [𝑋2 ] when 𝑋1 and 𝑋2 are independent random
variables. The last equality holds because 𝑋𝑖 are independent and identically
distributed variables. The standard error of the mean is then
𝜎 𝐹𝑁 = √ 𝜎 [𝑌 ] ,
Monte Carlo integration 323
1 3
0 2 0 2
0 2 0 2
and the convergence rate is 𝑂 (𝑁 −1/2 ). Such a convergence rate is relatively poor
compared to even a simple Riemann sum with a convergence rate of 𝑂 (𝑁 −1 ) for
one-dimensional problems. But, the error of the Monte Carlo method does not
depend on the number of dimensions!
Using 𝑁 nodes to integrate a function over a one-dimensional domain using
a simple Riemann sum approximation results in an error of 𝑂 (1/𝑁). For
the 𝑑-dimensional problem, we would need 𝑁 𝑑 quadrature points to get an
error 𝑂 (1/𝑁). That is, for 𝑁 points, we get an error of 𝑂 (𝑁 −1/𝑑 ). When the
dimension is more than two, the Monte Carlo method wins out over simple
first-order Riemann sums. Furthermore, we can use techniques to reduce the
sampling variance 𝜎 2 [𝑌 ]. Specifically, by choosing samples from a probability
density function 𝑝 𝑋 (x) that has a similar shape to 𝑓 (x).
a sampled value will be less than 𝑥. The inverse distribution function 𝑃−1 (𝑦)
allows us to sample from 𝑝(𝑥) by sampling 𝑦 from a uniform √︁ distribution. For
example, the standard normal distribution 𝑝(𝑥) = e−𝑥 /2 / 2/𝜋 has the error
function 𝑃(𝑥) = erf 𝑥 as its cumulative distribution. To sample from a normal
distribution, we would compute 𝑃−1 (𝑦) = erf −1 𝑦, where 𝑦 is sampled from a
uniform distribution. To sample from an exponential distribution 𝑝(𝑥) = 𝛼e−𝛼𝑥 ,
we first calculate 𝑃(𝑥) = 0 𝛼e −𝛼𝑥 d𝑠 = 1 − e−𝛼𝑥 . Then taking its inverse, we
get 𝑃 (𝑦) = −𝛼 log(1 − 𝑦).
−1 −1
2. Now, sample 𝑦 from the uniform distribution 𝑈 (0, 1). If 𝑦 < 𝑓 (𝑥)/𝑀𝑔(𝑥),
then accept the proposal as a sample drawn from 𝑝(𝑥). Otherwise, reject
it and start again with step one, repeating until you have a proposal 𝑥 that
is accepted.
3. To draw another sample from 𝑝(𝑥), start over with step one.
Markov chain Monte Carlo The accept-reject method draws each sample
from 𝑔(𝑥). When 𝑔(𝑥) is not similar to 𝑝(𝑥), we may need to choose a huge 𝑀
to ensure 𝑓 (𝑥) is bounded by 𝑀𝑔(𝑥) for all 𝑥. Consequently, we may frequently
reject our draws from 𝑔(𝑥), and the accept-reject method will be inefficient.
However, when we accept a draw, that draw is sampled from 𝑝(𝑥). Intuitively,
where we’ve already found a successful draw, we are more likely to find subsequent
successful draws. A clever idea would be using the previous accepted draw to
choose where to draw next rather than sampling from 𝑔(𝑥)—that’s a Markov chain.
But, by doing so, we are not drawing the samples independently because they
depend on the previous draw! Unless we draw from the stationary distribution of
the Markov chain!
Monte Carlo integration 325
we have
𝑓 (𝑥) 𝑓 (𝑥 ∗ )
𝑔(𝑥 ∗ | 𝑥)𝛼(𝑥 → 𝑥 ∗ ) = 𝑔(𝑥 | 𝑥 ∗ )𝛼(𝑥 ∗ → 𝑥).
𝛼(𝑥 → 𝑥 ∗ ) 𝑓 (𝑥 ∗ ) 𝑔(𝑥 | 𝑥 ∗ )
= · .
𝛼(𝑥 → 𝑥) 𝑓 (𝑥) 𝑔(𝑥 ∗ | 𝑥)
The right-hand side is the product of two ratios 𝑟 𝑓 and 𝑟 𝑔 .
𝛼(𝑥 → 𝑥 ∗ ) = min(1, 𝑟 𝑓 · 𝑟 𝑔 )
the notation of the rest of the book. In practice, the convention is to use right stochastic matrices
(whose rows sum to one), i.e., p𝑛 = p𝑛−1 T, because we often want 𝑇𝑖 𝑗 to represent the transition
probability from state 𝑖 to state 𝑗.
326 Differentiation and Integration
If the density 𝑝(𝑥) at the candidate point 𝑥 ∗ is greater than the density at the
current point 𝑥 𝑖 , then we always move to the candidate point. But, if the density
at the candidate point is less than the density at the current point, we only move
with probability 𝛼. We move if probability of drawing 𝑥 ∗ as a sample is greater
than drawing 𝑥 as a sample. Otherwise, we more with with probability equal to
the ratio 𝑝(𝑥 ∗ )/𝑝(𝑥).
Suppose that 𝑔(𝑥 ∗ | 𝑥) is normal distribution centered at 𝑥 with a standard
deviation 𝜎. The parameter 𝜎 controls the size of the random walk step. Larger
steps may move us into areas of low probability and smaller steps may not move
around the domain much at all. Consequently, we need to carefully choose 𝜎 to
make the algorithm efficient. See Figure 11.7 on the next page.
Let’s write out the steps this Markov chain Monte Carlo (MCMC) approach,
called the Metropolis–Hastings algorithm or the Metropolis algorithm when
𝑔(𝑥 | 𝑥 ∗ ) is symmetric.7 Choose an arbitrary 𝑥0 . For each iteration,
3. To draw another sample, set 𝑥 𝑛+1 to 𝑥 ∗ and start over with step one.
Because our initial guess 𝑥 0 might not be representative of a sample drawn from
the distribution (and by random walk neither would 𝑥1 , 𝑥2 , and so on), it’s often
advised to throw out the first several iterations, called the burn-in or warm-up
The original von Neumann Monte Carlo method teleports across the domain
seeking out new samples. The Metropolis Markov chain Monte Carlo method
takes a random walk around the domain. Newer Hamiltonian Monte Carlo
methods follow smooth trajectories of systems in equilibrium determined by
the terrain of the logarithm of the distribution. Such methods use automatic
differentiation to compute gradients and symplectic integrators, which we’ll
discuss in the next chapter, to compute the trajectories.
Example. We don’t need to directly compute the inverse error function to sample
from the standard normal distribution. We could instead use the Box–Muller
7 The Metropolis algorithm is named after the 1953 article “Equation of State Calculations
by Fast Computing Machines,” by Nicholas Metropolis and the husband and wife duos Arianna
and Marshall Rosenbluth and Augusta and Edward Teller. While the authors’ names appeared
alphabetically by convention, the Rosenbluths made the primary contributions and the final code was
entirely written by Arianna Rosenbluth.
Monte Carlo integration 327
Figure 11.7: The left plot shows a distribution (bold line) with a histogram of
2000 samples drawn from that distribution using the Metropolis algorithm. The
right plot shows the trace plot of the sample draw at each iteration.
transform.8 Consider
1 −( 𝑥 2 +𝑦 2 )/2
𝑝(𝑥) 𝑝(𝑦) = e .
By making the change of variable 𝑥 = 𝑟 cos 𝜃 and 𝑦 = 𝑟 sin 𝜃, we have the
differential element
1 −𝑟 2 /2 2
𝑝(𝑥) 𝑝(𝑦) d𝑥 d𝑦 = e 𝑟 d𝑟 d𝜃 = 𝑟e−𝑟 /2 d𝑟 d𝜃 .
2𝜋 2𝜋
We want to sample 𝑟 from the density 𝑟e−𝑟 /2 and 𝜃 from 1/2𝜋. To do this, we
will use inversion sampling:
∫ 𝑟 ∫ 𝜃
2 2 1 𝜃
𝑢= 𝑟e−𝑟 /2 d𝑟 = 1 − e−𝑟 /2 and 𝑣 = d𝜃 = .
0 0 2𝜋 2𝜋
Solving for 𝑟 and 𝜃 gives us 𝑟 = −2 log(1 − 𝑢) and 𝜃 = 2𝜋𝑣 where 𝑢 and 𝑣
are independently sampled from the uniform distribution 𝑈 (0, 1). Because 𝑢
√︁ uniformly from (0, 1), we could also sample 1 − 𝑢 and instead take
is taken
𝑟 = −2 log 𝑢. Therefore,
√︁ √︁
𝑥 = −2 log 𝑢 cos(2𝜋𝑣) and 𝑦 = −2 log 𝑢 sin(2𝜋𝑣)
gives two independent normally distributed samples. J
8 Thetransform was introduced in 1958 by statistician George E. P. Box, famously known for his
aphorism “all models are wrong, but some are useful,” and mathematician Mervin Muller.
328 Differentiation and Integration
11.6 Exercises
11.1. Occasionally, one may need to compute a one-side derivative to, say, enforce
a boundary condition or model a discontinuity.
1. Find a third-order approximation to the derivative 𝑓 0 (𝑥) with nodes at 𝑥,
𝑥 + ℎ, 𝑥 + 2ℎ and 𝑥 + 3ℎ for some stepsize ℎ.
2. What choice of stepsize ℎ minimizes the total (round-off and truncation)
error of the derivative of sin 𝑥 using this approximation with double-
precision floating-point numbers?
3. Find a second-order approximation for the second derivative 𝑓 00 (𝑥) for the
same set of nodes. b
9 Xoshiro is a portmanteau of XOR, shift, and rotate.
Exercises 329
11.9. Two cubes, each with equal size and equal uniform density,
are in contact along one face. The gravitational force between
two point masses separated by a distance 𝑟 is
𝑚1 𝑚2
𝐹=𝐺 ,
where 𝐺 is the gravitational constant and 𝑚 1 and 𝑚 2 are the magnitudes of the
two masses. What is the total gravitational force between the two cubes? (This
question was proposed by mathematician Nick Trefethen as one of his “ten-digit
10 Finite element methods work similarly. Whereas spectral methods use global basis functions
that span the entire computational domain, finite element methods use basis functions that are only
locally non-zero, such as B-splines.
Exercises 331
linear operator
𝑝 0 (𝑡1 ) 𝑝(𝑡 1 ) 𝑝(𝑡 0 )
0 © ª
𝑝 (𝑡2 ) 𝑝(𝑡 2 ) 𝑝(𝑡 0 ) ®
.. = M .. − .. ®® , (11.8)
. . . ®
𝑝 0 (𝑡 𝑛 )
« 𝑝(𝑡 𝑛 ) 𝑝(𝑡 0 ) ¬
where 𝑡0 , 𝑡1 , 𝑡2 , . . . , 𝑡 𝑛 are Legende–Gauss–Lobatto nodes or Chebyshev nodes
with 𝑡 0 = 0 and 𝑡 𝑛 = 1. Once we’ve determined the differentiation matrix M, we
solve the system M(p − 𝑝(𝑡0 )1) = 𝑓 (p, 𝑡).
Any study of numerical methods for partial differential equations (PDEs) likely
starts by examining numerical methods for ordinary differential equations (ODEs).
One reason is that numerical techniques used to solve time-dependent partial
differential equations often employ the method of lines. The method of lines
discretizes the spatial component and solves the resulting semi-discrete problem
using standard ODE solvers. Consider the heat equation 𝜕𝑡 𝜕
𝑢(x, 𝑡) = Δ𝑢(x, 𝑡),
Í 2 2
where Δ = 𝑖 𝜕 /𝜕𝑥𝑖 is the Laplacian operator. The heat equation can be
formally solved as a first-order linear ordinary differential equation to get the
solution 𝑢(x, 𝑡) = e𝑡Δ 𝑢(x, 0), where e𝑡Δ is a linear operator acting on the
initial conditions 𝑢(x, 0). A typical numerical solution involves discretizing the
Laplacian operator Δ in space to create a system of ODEs and then using an
ODE solver in time. A second reason to dig deep into ODEs is that the theory
of PDEs builds on the theory of ODEs. Understanding stability, consistency,
and convergence of numerical schemes for ordinary differential equations will
help understand stability, consistency, and convergence of numerical schemes
for partial differential equations. To dig even deeper, see Hairer, Nørsett, and
Wanner’s two-volume compendium Solving Ordinary Differential Equations.
Consider the initial value problem 𝑢 0 = 𝑓 (𝑡, 𝑢) with 𝑢(0) = 𝑢 0 . An initial value
problem is said to be well-posed if (1) the solution exists, (2) the solution is
unique, and (3) the solution depends continuously on the initial conditions. Let’s
examine these conditions for well-posedness.
336 Ordinary Differential Equations
𝑢(𝑡) = .
The solution blows up at 𝑡 = 1, so √ it does not even exist on the interval [1, 2].
Now, consider the problem 𝑢 0 = 𝑢 with 𝑢(0) = 0 over the interval 𝑡 ∈ [0, 2].
This problem has the solution 𝑢(𝑡) = 41 𝑡 2 . It also has the solution 𝑢(𝑡) = 0. In
fact, it has infinitely many solutions of the form
0 𝑡 < 𝑡𝑐
𝑢(𝑡) = 1 2
4 (𝑡 − 𝑡 𝑐 ) 𝑡 ≥ 𝑡𝑐
where the critical parameter is arbitrary. So, when 𝑓 (𝑡, 𝑢) is not Lipschitz
continuous in 𝑢, the solution may fail to exist; or if it exists, it may fail to be
unique. We state the following theorem without proof.
Theorem 46. If 𝑓 (𝑡, 𝑢) is Lipschitz continuous in 𝑢, then the initial value problem
𝑢 0 (𝑡) = 𝑓 (𝑡, 𝑢(𝑡)) with 𝑢(0) = 𝑢 0 is stable.
Let 𝜀 be the maximum of 𝜀 0 and the supremum of |𝛿(𝑡)| over [0, 𝑇]. Then
|𝑒 0 (𝑡)| ≤ 𝐿|𝑒(𝑡)| + 𝜀 and |𝑒(0)| ≤ 𝜀. It follows that
𝜀h i 𝜀h i
|𝑒(𝑡)| ≤ (𝐿 + 1)e 𝐿𝑡 − 1 ≤ (𝐿 + 1)e 𝐿𝑇 − 1 = 𝜀𝐾.
The error is bounded for 0 ≤ 𝑡 ≤ 𝑇, so the initial value problem is stable.
Remember the following takeaways over the rest of this chapter and the
remainder of this book. First, well-posedness requires existence, uniqueness, and
stability. Second, if 𝑓 (𝑡, 𝑢) is continuous in 𝑡 and Lipschitz continuous in 𝑢, then
the problem 𝑢 0 = 𝑓 (𝑡, 𝑢) is well-posed.
Let’s discretize 𝑡 by taking 𝑛 uniform steps of size 𝑘, i.e., 𝑡 𝑛 = 𝑛𝑘. Except for
Chapter 16, the remainder of this book will adopt a fairly standard notation with
𝑘 representing a step size in time and ℎ representing a step size in space. Using
𝑘 in place of 𝛥𝑡 and ℎ in place of 𝛥𝑥 will help tidy some otherwise messier
expressions. The derivative 𝑢 0 can be approximated by difference operators:
𝑈 𝑛+1 − 𝑈 𝑛
forward difference 𝛿 +𝑈 𝑛 = 𝑈𝑛
𝑘 𝑈 𝑛+1
𝑈𝑛 − 𝑈 𝑛−1
backward difference 𝛿 −𝑈 𝑛 =
𝑘 𝑈 𝑛−1
𝑈 𝑛+1 − 𝑈 𝑛−1
central difference 𝛿 0𝑈 𝑛 =
2𝑘 𝑡 𝑛−1 𝑡𝑛 𝑡 𝑛+1
𝑈 𝑛+1 − 𝑈 𝑛
= 𝑓 (𝑈 𝑛 ) (12.2)
𝑈 𝑛+1 − 𝑈 𝑛
= 𝑓 (𝑈 𝑛+1 ) (12.3)
Leapfrog (explicit) 𝑂 𝑘2
𝑈 𝑛+1 − 𝑈 𝑛−1
= 𝑓 (𝑈 𝑛 ) (12.4)
Trapezoidal (implicit) 𝑂 𝑘2
𝑈 𝑛+1 − 𝑈 𝑛 𝑓 (𝑈 𝑛+1 ) + 𝑓 (𝑈 𝑛 )
= (12.5)
𝑘 2
1.2 1.2 2
11 11
0 𝑘 2𝑘 0 𝑘 2𝑘
trapezoidal leapfrog
1.6 1.6
1.4 1.4 2 3
1.2 1.2
11 11
0 𝑘 2𝑘 0 𝑘 2𝑘
Figure 12.1: Integral curves and numerical solutions using the forward Euler,
backward Euler, leapfrog, and trapezoidal methods.
The forward Euler method uses the slope of the integral curve at 1 to take a
step forward. We are bumped off the original curve and find ourselves at 2 on
a neighboring one. The backward Euler uses the slope of the integral curve at
2 to take a step forward. Finding the slope of the integral curve at 2 without
explicitly knowing the value of the integral curve at 2 is not trivial. Like the
forward Euler method, we also find ourselves bumped off the integral curve when
using the backward Euler method. The trapezoidal method averages the slopes
of the forward Euler and backward Euler to take a step forward. We still get
bumped off the original integral curve, but it’s not as far this time. The leapfrog
method is not a single-step method. It uses the slope of the integral curve at 2
to take a double step from 1 to 3 . Like the trapezoidal method, the leapfrog
method performs better than either of the Euler methods.
To determine 𝑢(𝑡 + 𝑘) for some time step 𝑘 given a known value 𝑢(𝑡), we can
use the Taylor series approximation of 𝑢(𝑡 + 𝑘):
𝑢(𝑡 + 𝑘) = 𝑢(𝑡) + 𝑘𝑢 0 (𝑡) + 12 𝑘 2 𝑢 00 (𝑡) + 𝑂 𝑘 3 .
340 Ordinary Differential Equations
backward Euler
forward Euler
−4 0 1
Figure 12.2: Plot of the multiplier 𝑟 (𝜆𝑘) for several numerical schemes along
the real 𝜆𝑘-axis compared to the exact solution multiplier e𝜆𝑘 .
The exact solution e𝜆𝑘 is depicted as a dotted curve. Approximations are good
when 𝜆𝑘 is close to zero. The backward Euler and forward Euler methods are
both first-order methods. The leapfrog and trapezoidal methods as second-order
methods are better approximations. The leapfrog method has two solutions—one
that matches the exact solution fairly well and another one that doesn’t—the
absolute value of this other solution is depicted using a dashed line. This second
branch leads to stability issues for the leapfrog method when the real part of
𝜆𝑘 is negative and not close to zero. The backward Euler method blows up at
𝜆𝑘 = 12 , and the trapezoidal method blows up at 𝜆𝑘 = 2. Note that the backward
Euler method limits 0 and the trapezoidal method limits −1 as 𝜆𝑘 → −∞. We’ll
return to this limiting behavior when discussing L-stable and A-stable methods.
We’ll examine consistency in more depth when we look at multistep methods.
A numerical method is stable if, for any sufficiently small step size 𝑘, a small
perturbation in the numerical solution (perhaps the initial value) produces a
bounded change in the numerical solution at a given subsequent time. Further-
more, a numerical method is absolutely stable if, for any sufficiently small step
size 𝑘, the change due to a perturbation of size 𝛿 is not larger than 𝛿 at any of the
subsequent time steps. In other words, a method is stable if perturbations are
bounded over a finite time, and it is absolutely stable if perturbations shrink or at
least do not grow over time. See the figure on the next page.
Do not equate stability with consistency—a method can be both quite stable
and quite wrong. Still, absolute stability is desirable because numerical errors will
not only not grow, they will diminish over time. Let’s determine a condition under
342 Ordinary Differential Equations
2 2
0 0
−2 −2
0 1 2 3 4 0 1 2 3 4
which a numerical method is absolutely stable. We’ll limit the discussion to linear
methods 𝑈 𝑛+1 = L(𝑈 𝑛 , 𝑈 𝑛−1 , . . . ). The discussion applies to nonlinear methods
insofar as those methods can be linearized in a meaningful and representative way.
Let 𝑈 0 be an initial condition and 𝑉 0 be a perturbation of that initial condition.
The error in the initial condition is 𝑒 0 = 𝑉 0 − 𝑈 0 , and the error at time 𝑡 𝑛 = 𝑛𝑘 is
𝑒 𝑛 = 𝑉 𝑛 − 𝑈 𝑛 . Since
𝑒 𝑛+1 𝑒𝑛 𝑒1
|𝑒 𝑛+1 | = 𝑛
· 𝑛−1 · · · 0 · |𝑒 0 |,
𝑒 𝑒 𝑒
a sufficient condition for the error to be nonincreasing is for |𝑒 𝑛+1 /𝑒 𝑛 | ≤ 1 for all
𝑛. Because the method is a linear,
Consider the simple test problem 𝑢 0 (𝑡) = 𝜆𝑢(𝑡) for an arbitrary complex value 𝜆.
For what time step 𝑘 is a numerical scheme absolutely stable? We can answer
this question by determining the values 𝑧 = 𝜆𝑘 for which the growth factor
|𝑟 | = |𝑈 𝑛+1 /𝑈 𝑛 | ≤ 1. The region in the 𝜆𝑘-plane where |𝑟 | ≤ 1 is called the
region of absolute stability. We’ll use it to identify an appropriate numerical
method. Let’s find the regions of absolute stability for the forward Euler scheme,
the backward Euler scheme, the leapfrog scheme, and the trapezoidal scheme.
Forward Euler. The forward Euler method for 𝑢 0 = 𝜆𝑢 is
𝑈 𝑛+1 − 𝑈 𝑛
= 𝜆𝑈 𝑛 ,
Single-step methods 343
forward Euler
1 1
0 0
−1 −2
−2 −1 0 1 2 0 1 2 0 2 4
backward Euler
1 1
0 0
−1 −2
−2 −1 0 1 2 0 1 2 0 2 4
1 1
0 0
−1 −2
−2 −1 0 1 2 0 1 2 0 2 4
1 1
0 0
−1 −2
−2 −1 0 1 2 0 1 2 0 2 4
𝑈 𝑛+1 − 𝑈 𝑛
= 𝜆𝑈 𝑛+1 .
is a damped sine wave that dissipates within a few oscillations when 𝑘 is large.
Leapfrog. We can rewrite
𝑈 𝑛+1 − 𝑈 𝑛−1
= 2𝜆𝑈 𝑛
𝑟 2 − 2𝜆𝑘𝑟 − 1 = 0.
Lax equivalence theorem 345
We want to find when |𝑟 | ≤ 1 in the complex plane, so let’s take 𝑟 = ei𝜃 as the
boundary. We have
e2i𝜃 − 2𝜆𝑘ei𝜃 − 1 = 0.
So ei𝜃 − e−i𝜃 = 2𝜆𝑘, from which i sin 𝜃 = 𝜆𝑘. Therefore, the leapfrog scheme is
absolutely stable only on the imaginary axis 𝜆𝑘 ∈ [−i, +i].
Trapezoidal. Letting 𝑟 = 𝑈 𝑛+1 /𝑈 𝑛 in the trapezoidal method
𝑈 𝑛+1 − 𝑈 𝑛 𝑈 𝑛+1 + 𝑈 𝑛
𝑘 2
gives us 𝑟 − 1 = 12 𝜆𝑘 (𝑟 + 1) or equivalently
𝑟 −1
𝜆𝑘 = 2 .
𝑟 +1
As before, take 𝑟 = ei𝜃 to identify the boundary of the region of absolute stability.
ei𝜃 − 1
𝜆𝑘 = 2 i𝜃 = 2i tan 21 𝜃.
e +1
That is 𝜆𝑘 ∈ (−i∞, +i∞). So |𝑟 | = 1 along the entire imaginary axis, and the
region of absolute stability includes the entire left half 𝜆𝑘-plane. Therefore, the
trapezoidal scheme is A-stable. A method is consistent if and only if the region
of stability contains the origin 𝜆𝑘 = 0. We call such a condition zero-stability.
Before discussing specific numerical methods in-depth, let’s consider the Lax
equivalence theorem, which says that a consistent and stable method converges.
“Consistency” means that we can get two equations to agree enough, “stability”
means that we can control the size of the error enough, and “convergence” means
that we can get two solutions to agree enough. The theorem itself is important
enough that it’s sometimes called the fundamental theorem of numerical analysis.
It tells us when we can trust a numerical method for solving linear ordinary or
partial differential equations to give us the right results.
Consider the initial value problem 𝑢 𝑡 = L 𝑢, where L is a linear operator.
In the next chapter, we’ll consider L to be the Laplacian operator, and we’ll
get the heat equation 𝑢 𝑡 = Δ𝑢. In Chapter 14, we’ll consider L = c · ∇, and
we’ll get the advection equation 𝑢 𝑡 = c · ∇𝑢. Suppose that 𝑢(𝑡, ·) is the analytic
solution to 𝑢 𝑡 = L 𝑢. We’ll use the dot · as a nonspecific placeholder. It could
stand-in for some spatial dimensions if we have a partial differential equation.
Or, it could be nothing at all if we have an ordinary differential equation. In this
case, 𝑢 is simply a function of 𝑡. Let 𝑡 𝑛 = 𝑛𝑘 be the uniform discretization of
time, and let 𝑈 𝑛 denote the finite difference approximation of 𝑢(𝑡 𝑛 , ·). Define
346 Ordinary Differential Equations
𝑢(𝑡 + 𝑘, ·) − H 𝑘 𝑢(𝑡, ·)
𝜏𝑘 (𝑡, ·) = .
A method is consistent if k𝜏𝑘 (𝑡, ·) k → 0 as 𝑘 → 0. Furthermore, a method is
consistent of order 𝑝 if, for all sufficiently smooth initial data, there exists a
constant 𝑐 𝜏 such that k𝜏𝑘 (𝑡, ·) k ≤ 𝑐 𝜏 𝑘 𝑝 for all sufficiently small step sizes 𝑘.
Define the operator norm k H 𝑘 k as the relative maximum k H 𝑘 𝑢k/k𝑢k over all
vectors 𝑢. A method is stable if, for each time 𝑇, there exists a 𝑐 𝑠 and 𝑘 𝑠 > 0 such
that k H𝑛𝑘 k ≤ 𝑐 𝑠 for all 𝑛𝑘 ≤ 𝑇 and 𝑘 ≤ 𝑘 𝑠 . In particular, if k H 𝑘 k ≤ 1, then
k H𝑛𝑘 k ≤ k H 𝑘 k 𝑛 ≤ 1, which implies stability. But, we can also relax conditions
on H 𝑘 to k H 𝑘 k ≤ 1 + 𝛼𝑘 for any constant 𝛼, since in this case
𝑒 𝑘 (𝑡 𝑛 , ·) = 𝑈 𝑛 − 𝑢(𝑡 𝑛 , ·)
= H 𝑘 𝑈 𝑛−1 − H 𝑘 𝑢(𝑡 𝑛−1 , ·) − 𝑘𝜏𝑘 (𝑡 𝑛−1 , ·)
= H 𝑘 𝑒 𝑘 (𝑡 𝑛−1 , ·) − 𝑘𝜏𝑘 (𝑡 𝑛−1 , ·)
= H 𝑘 [H 𝑘 𝑒 𝑘 (𝑡 𝑛−2 , ·) − 𝑘𝜏𝑘 (𝑡 𝑛−2 , ·)] − 𝑘𝜏𝑘 (𝑡 𝑛−1 , ·),
𝑒 𝑘 (𝑡 𝑛 , ·) ≤ H𝑛𝑘 · 𝑒 𝑘 (0, ·) + 𝑘 H𝑘 · 𝜏𝑘 (𝑡 𝑗−1 , ·) .
From the stability condition, we have H 𝑘 ≤ 𝑐 𝑠 for all 𝑗 = 1, 2, . . . , 𝑛 and all
𝑛𝑘 ≤ 𝑇. Therefore,
k𝑒 𝑘 (𝑡 𝑛 , ·) k ≤ 𝑐 𝑠 k𝑒 𝑘 (0, ·) k + 𝑘 k𝜏𝑘 (𝑡 𝑗−1 , ·) k .
𝑢(𝑡) = 𝑢(𝑡)
𝑢(𝑡 − 𝑘) = 𝑢(𝑡) − 𝑘𝑢 0 (𝑡) + 12 𝑘 2 𝑢 00 (𝑡) + 𝑂 𝑘 3
𝑢(𝑡 − 2𝑘) = 𝑢(𝑡) − 2𝑘𝑢 0 (𝑡) + 42 𝑘 2 𝑢 00 (𝑡) + 𝑂 𝑘 3
𝑎 −1 𝑢(𝑡) = 𝑎 −1 𝑢(𝑡)
𝑎 0 𝑢(𝑡 − 𝑘) = 𝑎 0 𝑢(𝑡) − 𝑎 0 𝑘𝑢 0 (𝑡) + 12 𝑎 0 𝑘 2 𝑢 00 (𝑡) + 𝑂 𝑘 3
𝑎 1 𝑢(𝑡 − 2𝑘) = 𝑎 1 𝑢(𝑡) − 2𝑎 1 𝑘𝑢 0 (𝑡) + 24 𝑎 1 𝑘 2 𝑢 00 (𝑡) + 𝑂 𝑘 3 .
𝑎 −1 + 𝑎 0 + 𝑎 1 = 0
0 − 𝑎 0 − 2𝑎 1 = 𝑘 −1
0 + 21 𝑎 0 + 42 𝑎 1 = 0.
𝑎 −1 = 32 𝑘 −1 , 𝑎 0 = −2𝑘 −1 , 𝑎 1 = 12 𝑘 −1 .
− 2𝑢(𝑡 𝑛 ) + 12 𝑢(𝑡 𝑛−1 )
2 𝑢(𝑡 𝑛+1 )
= 𝑢 0 (𝑡 𝑛+1 ) + 𝑂 𝑘 2 .
Using this approximation for 𝑢 0 (𝑡), we obtain the second-order backward differ-
entiation formula (BDF2)
3 𝑛+1
2𝑈 − 2𝑈 𝑛 + 12 𝑈 𝑛−1 = 𝑘 𝑓 (𝑈 𝑛+1 ).
An 𝑟-step (and 𝑟th order) backward differentiation formula takes the form
10 2
0 0
−10 −2
−10 0 10 20 30 −6 −4 −2 0 2
Figure 12.5: Left: internal contours of the regions of absolute stability for
BDF1–BDF6. The regions of absolute stability are the areas outside these
contours. Right: a close-up of the same contours.
Now, let’s determine the region of absolute stability for BDF2. We take
𝑓 (𝑢) = 𝜆𝑢. Then letting 𝑟 = 𝑈 𝑛+1 /𝑈 𝑛 in
3 𝑛+1
2𝑈 − 2𝑈 𝑛 + 21 𝑈 𝑛−1 = 𝑘 𝑓 (𝑈 𝑛+1 ),
we have
3 2
2𝑟 − 2𝑟 + 1
2 = 𝑘𝜆𝑟 2 .
For which 𝜆𝑘 is |𝑟 | ≤ 1? While we can easily compute this answer analytically
using the quadratic formula, it becomes difficult or impossible for higher-order
methods. So, instead, we will simply plot the boundary of the region of absolute
stability. Since we want to know when |𝑟 | ≤ 1, i.e., when 𝑟 is inside the unit disk
in the complex plane, we will consider 𝑟 = ei𝜃 . Then the variable 𝜆𝑘 can be
expressed as the rational function
3 2 1
2𝑟 − 2𝑟 + 2
𝜆𝑘 = .
We can plot this function in Julia using
r = exp.(2im*π*(0:.01:1))
plot(@. (1.5r^2 - 2r + 0.5)/r^2); plot!(aspect_ratio=:equal)
and it is used to quantify the error for one time step. Error accumulates at each
iteration. Because 𝑡 𝑛 = 𝑛𝑘, it follows that 𝑛 = 𝑂 (1/𝑘). The global truncation
error of a finite difference approximation is defined as 𝜏global = 𝜏local /𝑘 and
quantifies the error over several times steps. For example, the local truncation
error of the leapfrog scheme is 𝑂 𝑘 3 , and the global truncation error is 𝑂 𝑘 2 .
From Taylor series expansion with 𝑡 𝑛− 𝑗 = 𝑡 𝑛 − 𝑗 𝑘, we have
𝑢(𝑡 𝑛− 𝑗 ) = 𝑢(𝑡 𝑛 ) − 𝑗 𝑘𝑢 0 (𝑡 𝑛 ) + 12 ( 𝑗 𝑘) 2 𝑢(𝑡 𝑛 ) + · · ·
𝑓 (𝑢(𝑡 𝑛− 𝑗 )) = 𝑢 0 (𝑡 𝑛− 𝑗 ) = 𝑢 0 (𝑡 𝑛 ) − 𝑗 𝑘𝑢 00 (𝑡 𝑛 ) + 12 ( 𝑗 𝑘) 2 𝑢 000 (𝑡 𝑛 ) + · · · .
Then the local truncation error at 𝑡 𝑛 is
𝑠−1 ∑︁
𝑠−1 ∑︁
𝑎 𝑗 𝑢(𝑡 𝑛 ) − 𝑘 ( 𝑗 𝑎 𝑗 + 𝑏 𝑗 )𝑢 0 (𝑡 𝑛 ) + 𝑘 2 ( 12 𝑗 2 𝑎 𝑗 + 𝑗 𝑏 𝑗 )𝑢 00 (𝑡 𝑛 ) − · · ·
𝑗=−1 𝑗=−1 𝑗=−1
(−1) 𝑝 ∑︁
···− 𝑘𝑝 𝑗 𝑝 𝑎 𝑗 + 𝑝 𝑗 𝑝−1 𝑏 𝑗 𝑢 ( 𝑝) (𝑡 𝑛 ) + · · ·
𝑝! 𝑗=−1
For a consistent method, the global truncation error must limit zero as the step
size limits zero. Therefore,
𝑠−1 ∑︁
𝑎𝑗 = 0 and 𝑗 𝑎 𝑗 + 𝑏 𝑗 = 0. (12.6a)
𝑗=−1 𝑗=−1
Multistep methods 351
To further get a method that is 𝑂 𝑘 𝑝 accurate, we need to set the first 𝑝 + 1
coefficients of the truncation error to 0:
𝑗 𝑖 𝑎 𝑗 + 𝑖 𝑗 𝑖−1 𝑏 𝑗 = 0 for 𝑖 = 1, 2, . . . , 𝑝. (12.6b)
The input m is a row vector indicating nonzero (free) indices of {𝑎 𝑗 } and n is a row
vector of nonzero indices of {𝑏 𝑗 }. The value 𝑎 −1 is always implicitly one, and 0
should not be included in m. By formatting B as an array of rational numbers using
// and formatting the ones array as integers, the coefficients c will be formatted
as rational numbers. For example, the input m = [1 2] and n = [1 2 3], which
corresponds to the third-order Adams–Moulton method, results in the two arrays
[1 -1] and [5/12 2/3 -1/12]. We can use these coefficients to plot the boundary
of the region of absolute stability:
function plotstability(a,b)
λk(r) = (a · r.^-(0:length(a)-1)) ./ (b · r.^-(0:length(b)-1))
r = exp.(im*LinRange(0,2π,200))
m = [0 1]; n = [0 1 2]
a, b = zeros(maximum(m)+1), zeros(maximum(n)+1)
a[m.+1], b[n.+1] = multistepcoefficients(m,n)
352 Ordinary Differential Equations
Adams methods
𝑈 𝑛+1 = 𝑈 𝑛 + 𝑘 𝑏 𝑗 𝑓 (𝑈 𝑛− 𝑗 ).
Adam–Bashforth Adam–Moulton
0.5 2
0 0
−0.5 −2
−1.5 −1 −0.5 0 0.5 −6 −4 −2 0
Figure 12.6: External contours of the regions of absolute stability for the
explicit Adams–Bashforth methods AB2–AB5 (left) and implicit Adams–Moulton
methods AM3–AM7 (right). The plot region of the Adams–Bashforth is shown
with the gray rectangle overlaying the plot region of the Adams–Moulton.
of implicit methods requires significantly more effort. Still, the region of absolute
stability of an Adams–Moulton method is about three times larger than that of
the corresponding Adams–Bashforth method.
Predictor-corrector methods
predictor: 𝑈˜ 𝑛+1 = 𝑈 𝑛 + 𝑘 𝑏 𝑗 𝑓 (𝑈 𝑛− 𝑗 ) (12.7a)
corrector: 𝑈 𝑛+1 = 𝑈 𝑛 + 𝑘 𝑏 ∗0 𝑓 (𝑈˜ 𝑛+1 ) + 𝑘 𝑏 ∗𝑗 𝑓 (𝑈 𝑛− 𝑗 ). (12.7b)
Padé approximation
31 3
𝑥− 294 𝑥
𝑅3,4 (𝑥) = 3 2 11 4
sin 𝑥
1+ 49 𝑥 + 5880 𝑥
𝑅3,4 (𝑥)
Another way to compute the Padé approximant to log 𝑟 is by starting with the
Taylor polynomial 𝑇 (𝑥) of log(𝑥 + 1); then determining
the coefficients of 𝑃𝑚 (𝑥)
and 𝑄 𝑛 (𝑥) from 𝑃𝑚 (𝑥) = 𝑄 𝑛 (𝑥)𝑇 (𝑥) + 𝑂 𝑥 𝑚+𝑛+1 ; and finally, substituting 𝑟 − 1
back in for 𝑥 and grouping coefficients by powers of 𝑟. But, perhaps the easiest
way to compute the Padé approximant is simply by solving the linear system
outlined for general multistep methods using multistepcoeffs on page 351.
Implicit methods correspond to the Padé approximations of log 𝑟. An order-𝑝
BDF method corresponds to 𝑅 𝑝,0 (𝑟) and an order-𝑝 Adams–Moulton method cor-
responds to 𝑅1, 𝑝−1 (𝑟). Explicit methods are associated with 𝑟 −1 𝐶𝑚,𝑛 (𝑟), where
𝐶𝑚,𝑛 (𝑟) is the Padé approximations of 𝑟 log 𝑟. The leapfrog method is 𝑟 −1 𝐶2,0 (𝑟),
and an order-𝑝 Adams–Bashforth method corresponds to 𝑟 −1 𝐶1, 𝑝−1 (𝑟). The
figure on the facing page shows the Padé approximants corresponding to the
backward Euler, the third-order Adams–Moulton, and the forward Euler methods.
Each of these rational functions approximates log 𝑟 at 𝑟 = 1 when 𝜆𝑘 = 0.
2 WolframCloud (https://www.wolframcloud.com) provides free, online access to Mathematica.
Runge–Kutta methods 355
𝑅1,0 (𝑟)
𝑟 −1 log 𝑟
𝑅1,0 (𝑟) =
1 𝑟 −1 𝐶1,0 (𝑟)
𝑟 −1 𝑅1,2 (𝑟)
𝑅1,2 (𝑟) = 1 2
− 12 𝑟 + 23 𝑟 + 5
𝑟 −1
𝑟 −1 𝐶1,0 (𝑟) =
𝑟 𝑟=1
By directly integrating 𝑢 0 (𝑡) = 𝑓 (𝑢, 𝑡), we can change the differential equation
into an integral equation
∫ 𝑡𝑛+1
𝑢(𝑡 𝑛+1 ) − 𝑢(𝑡 𝑛 ) = 𝑓 (𝑡, 𝑢(𝑡)) d𝑡.
The challenge now is to evaluate the integral numerically. One way of doing
this is to use the Runge–Kutta method. Unlike multistep methods, Runge–Kutta
methods can get high-order results without saving the solutions of previous
time steps, making them good choices for adaptive step size. Because of this,
Runge–Kutta methods are often methods of choice.
We can approximate the integral using quadrature
∫ 𝑡𝑛+1 ∑︁
𝑓 (𝑡, 𝑢(𝑡)) d𝑡 ≈ 𝑘 𝑏 𝑖 𝑓 (𝑡𝑖∗ , 𝑢(𝑡 𝑖∗ ))
𝑡𝑛 𝑖=1
We’ll need to approximate 𝑓 (𝑡 𝑛+1/2 , 𝑈 𝑛+1/2 ) this time. We can use the forward
Euler method. Take 𝐾1 = 𝑓 (𝑡 𝑛 , 𝑈 𝑛 ) and
𝐾2 = 𝑓 (𝑡 𝑛 + 12 𝑘, 𝑈 𝑛 + 12 𝑘 𝑓 (𝑡 𝑛 , 𝑈 𝑛 )) = 𝑓 (𝑡 𝑛 + 21 𝑘, 𝑈 𝑛 + 12 𝑘𝐾1 ).
To compute 𝑈 𝑛+1 given 𝑈 𝑛 , we’ll need to approximate 𝑓 (𝑡 𝑛+1 , 𝑈 𝑛+1 ) using the
forward Euler method. Take 𝐾1 = 𝑓 (𝑡 𝑛 , 𝑈 𝑛 ) and
𝐾2 = 𝑓 (𝑡 𝑛 + 𝑘, 𝑈 𝑛 + 𝑘 𝑓 (𝑡 𝑛 , 𝑈 𝑛 )) = 𝑓 (𝑡 𝑛 + 𝑘, 𝑈 𝑛 + 𝑘𝐾1 ).
Combining these yields a second-order rule 𝑈 𝑛+1 = 𝑈 𝑛 + 𝑘 12 𝐾1 + 21 𝐾2 . J
Example. To get high-order methods, we add quadrature points. The next step
up is Simpson’s rule, which says
∫ 𝑏
1 2 𝑏+𝑎 1
𝑓 (𝑥) d𝑥 ≈ (𝑏 − 𝑎) 𝑓 (𝑎) + 𝑓 + 𝑓 (𝑏) .
𝑎 6 3 2 6
Í the coefficients
Í𝑠 𝑎 𝑖 𝑗 is not a trivial exercise. We need to take
𝑐 𝑖 = 𝑠𝑗=1 𝑎 𝑖 𝑗 and 𝑖=1 𝑏 𝑖 = 1 for consistency. These coefficients are often
conveniently given as a Butcher tableau
c A
𝐾1 = 𝑓 (𝑡 𝑛 , 𝑈 𝑛 ) 0
1 1
𝐾2 = 𝑓 𝑡 𝑛 + 12 𝑘, 𝑈 + 12 𝑘𝐾1
2 2
𝑈 𝑛+1 = 𝑈 𝑛 + 𝑘𝐾2 0 1
𝐾1 = 𝑓 (𝑡 𝑛 , 𝑈 𝑛 ) 0
𝐾2 = 𝑓 𝑡 𝑛 + 𝑘, 𝑈 + 𝑘𝐾1 𝑛 1 0 1
1 1
𝑈 𝑛+1
= 𝑈 𝑛 + 𝑘 12 𝐾2 + 12 𝐾1 2 2
The trapezoidal rule is a DIRK method with an explicit first stage. The fourth-
order Runge–Kutta (RK4), sometimes called the classical Runge–Kutta method
or simply “the” Runge–Kutta method, is derived from Simpson’s rule.
𝐾1 = 𝑓 (𝑡 𝑛 , 𝑈 𝑛 ) 0
1 𝑛 1
1 1
𝐾2 = 𝑓 𝑡 𝑛 + 2 𝑘, 𝑈 + 2 𝑘𝐾1
2 2
1 1
𝐾3 = 𝑓 𝑡 𝑛 + 1 𝑛
+ 1 0
2 𝑘, 𝑈 2 𝑘𝐾2 2 2
𝐾4 = 𝑓 (𝑡 𝑛 + 𝑘, 𝑈 + 𝑘𝐾3 ) 1 0 0 1
1 1 1 1
𝑈 𝑛+1 = 𝑈 𝑛 + 16 𝑘 (𝐾1 + 2𝐾2 + 2𝐾3 + 𝐾4 ). 6 3 3 6
Example. Plot the regions of stability for the Runge–Kutta methods given by
the following tableaus:
1 1
0 2 2
1 1 1 1
0 2 2 2 0 2
1 1
0 0 2 2 1 −1 2 1 0 0 1
1 2 1 1 1 1 1
1 0 1 6 3 6 6 3 3 6
358 Ordinary Differential Equations
2 3
−4 −2 0 2
Figure 12.8: External contours of the regions of absolute stability for the explicit
Runge–Kutta methods RK1–RK4. The regions of stability get progressively
larger with increasing order.
𝑈 𝑛+1 = 𝑈 𝑛 + 𝑘bT K,
K = 𝜆𝑈 𝑛 (I − 𝜆𝑘A) −1 E
We can use Julia to compute the |𝑟 | = 1 contours of (12.10) for each tableau.
The regions of absolute stability are plotted in Figure 12.8. J
𝐾1 = 𝑓 (𝑈 𝑛 ) 0
1 1
1 1 1
𝐾2 = 𝑓 (𝑈 𝑛 ) + 𝑓 𝑈 𝑛+1/2 2 4 4
4 4
1 1 1
𝐾3 = 1
𝑓 (𝑈 𝑛 ) + 1
𝑓 𝑈 𝑛+1/2 + 1
𝑓 𝑈 𝑛+1 1 3 3 3
3 3 3
1 1 1
𝑈 𝑛+1 = 𝑈 𝑛 + 𝑘 3 𝐾1 + 13 𝐾2 + 13 𝐾3 3 3 3
Instead of breaking the time step into two equal subintervals, we could also take
the intermediate point at some fraction 𝛼 of the time step. In this case, we take
the trapezoidal method over a subinterval 𝛼𝑘 and the BDF2 method over the√two
subintervals 𝛼𝑘 and (1 − 𝛼)𝑘 with 0 < 𝛼 < 1. Finding an optimal 𝛼 = 2 − 2 is
left as an exercise. J
One advantage that Runge–Kutta methods have over multistep methods is the
relative ease that the step size can be varied, taking smaller steps when needed
to minimize error and maintain stability and larger steps otherwise. Often, a
Runge–Kutta method embeds a lower-order method inside a higher-order method,
sharing the same Butcher tableau. For example, the Bogacki–Shampine method
(implemented in Julia using the function BS3) combines a second-order and
third-order Runge–Kutta method using the same quadrature points:
1 1
2 2
3 3
4 0 4
2 1 4
1 9 3 9
2 1 4
9 3 9 0
7 1 1 1
24 4 3 8
The 𝐾𝑖 are the same for the second-order and the third-order methods. By
subtracting the solutions, we can approximate the truncation error. If the
360 Ordinary Differential Equations
2 The third row of the Butcher tableau
1.4 says we now try to find the solution at
𝑡 = 43 using the slope we just found at
1.2 2 and nothing from 1 . Again, we use
3 a simple linear extrapolation.
0 0.2 0.4 0.6 0.8 1
2 The fourth row says we now try to find
1.4 the solution at 𝑡 = 1 by first traveling 29
with a slope given by 1 , then a distance
1.2 1
3 with a slope given by 2 , and then a
3 4
distance 49 with slope from 3 .
0 0.2 0.4 0.6 0.8 1
1.6 The final row tells us to first travel 24
2 1
with a slope from 1 , then 4 with a
slope from 2 , then 13 with a slope from
3 , and finally
1.2 8 with a slope from
4 4 . The solution provides a third-order
11 correction over the solution from 4
0 0.2 0.4 0.6 0.8 1
truncation error is above a given tolerance, we set the step size 𝑘 to 𝑘/2. If the
truncation error is below another given tolerance, we set the step size 𝑘 to 2𝑘.
The notation RK𝑚(𝑛) is occasionally used to describe a Runge–Kutta method
where 𝑚 is the order of the method to obtain the solution and 𝑛 is the order of
the method to obtain the error estimate. This notation isn’t consistent, and some
authors use (𝑚, 𝑛) or 𝑛(𝑚) instead. At other times a method may have multiple
error estimators. For example, the 8(5,3) uses a fifth-order error estimator and
bootstraps a third-order estimator.
for 𝑖 = 1, 2, . . . , 𝑠 where 𝑢(𝑡0 ) = 𝑢 0 and the collocation points 𝑐 𝑖 ∈ [0, 1]. The
solution is given by 𝑢(𝑡 0 + 𝑘). For example, when 𝑠 = 1, the polynomial is
When 𝑐 1 = 0 we have the explicit Euler method, when 𝑐 1 = 1 we have the implicit
Euler method, and 𝑐 1 = 12 we have the implicit midpoint method:
1 1
0 0 1 1 2 2
1 1 1 .
Proof. Let 𝑢(𝑡) be the collocation polynomial and let 𝐾𝑖 = 𝑢 0 (𝑡 0 + 𝑐 𝑖 𝑘). The
Lagrange polynomial
𝑠 Ö𝑠
𝑧 − 𝑐𝑙
𝑢 0 (𝑡0 + 𝑧𝑘) = 𝐾 𝑗 ℓ 𝑗 (𝑧) where ℓ𝑖 (𝑧) = .
𝑐 − 𝑐𝑙
𝑙=0 𝑖
362 Ordinary Differential Equations
for nodes 𝑥0 , 𝑥1 , . . . , 𝑥 𝑠−1 and for some 𝜉 ∈ [𝑎, 𝑏]. Over an interval of length
𝑘, the integral is bounded by 𝑘 2𝑠+1 . So the local truncation error is 𝑂 𝑘 2𝑠+1 ,
and the global truncation error is 𝑂 𝑘 2𝑠 . Therefore, an 𝑠-stage Gauss–Legendre
method is order 2𝑠.
Radau methods are Gauss–Legendre methods that include one of the endpoints
as quadrature points—either 𝑐 1 = 0 (Radau IA methods) or 𝑐 𝑠 = 1 (Radau IIA
methods). Such methods have maximum order 2𝑠 − 1. Gauss–Lobatto quadrature
chooses the quadrature points that include both ends of the interval. Lobatto
methods (Lobatto IIA) have both 𝑐 1 = 0 and 𝑐 𝑠 = 1 and have maximum order
2𝑠 − 2.
Implicit Runge–Kutta methods offer higher accuracy and greater stability
profiles, but they are more difficult to implement because they require solving a
system of nonlinear equations. Diagonally implicit Runge–Kutta methods limit
the Runge–Kutta method (12.8) to one implicit term
𝐾𝑖 = 𝑓 𝑈 + 𝑘 𝑎𝑖 𝑗 𝐾 𝑗 .
where ∇f (u∗ ) is the Jacobian matrix 𝜕 𝑓𝑖 /𝜕𝑢 𝑗 evaluated at u∗ . The spectral radius
of the Jacobian matrix (the magnitude of its largest eigenvalue) now determines
the stability condition.
Implicit methods are more difficult to implement, but they are more stable.
When the system of differential equations is linear, this often means inverting
364 Ordinary Differential Equations
diffusion equations are stiff. The solution evolves quickly in the presence of a
large spatial gradient and then slows to a crawl once it smooths out. Because
numerical errors often result in large gradients, the problem remains stiff even
after the fast scale is no longer apparent in the solution dynamics.
𝜆∗ = 𝑢 ∗ (1 − 𝑢 ∗ )
𝑢 0 = 100
0 0 1
0 100 200
While the dynamics of the equation near 𝑢 = 1 are relatively benign, we need to
take many tiny steps using the Runge–Kutta method—124 in total. But, the BDF
method takes just 11. And almost all of these steps are early on—the last step is an
3 Shampine himself credits Robert O’Malley, who himself credits Edward Reiss. An explicit
solution, given in terms of a Lambert W function, can be found in an article by Corless et al.
366 Ordinary Differential Equations
enormous leap of 𝑘 = 157. Both methods adjust the step size to ensure stability.
To determine the stability conditions for a numerical method, we examine the
derivative of the right-hand side 𝑓 (𝑢) = 𝑢 2 (1 − 𝑢), which is 𝑓 0 (𝑢) = 2𝑢 − 3𝑢 2 .
Specifically, when 𝑢(𝑡) ≈ 1, the derivative 𝑓 0 (𝑢) ≈ −1. The lower limit of the
region of absolute stability of the Dormand–Prince Runge–Kutta method is −3.
If we use this method, we can take steps as large as 𝑘 = 3 near 𝑢 = 1 and still
maintain absolute stability—the solver took an average step size of 1.6. On the
other hand, if we use the BDF scheme, which has no lower limit on its region of
absolute stability, we can take as large a step as we’d like. J
max | Re 𝜆(A)|
𝜅= ,
min | Re 𝜆(A)|
4 Söderlind,
Jay, and Calvo’s review article “Stiffness 1952–2012: Sixty years in search of a
definition” addresses the lack of consensus on an adequate definition.
Stiff equations 367
0 0
−1 −1
0 2 4 6 8 0 2 4 6 8
Because the trapezoidal method is A-stable, we can take a large time step size 𝑘
and still ensure stability. But when 𝜆𝑘 is large, the multiplying factor 𝑟 (𝜆𝑘) ≈ −1.
Consequently, the numerical solution 𝑈 𝑛 ≈ (−1) 𝑛𝑈 0 will oscillate for a long
time as it slowly decays. See the figure on this page.
We can write the backward Euler method 𝑈 𝑛 − 𝑈 𝑛−1 = −𝜆𝑘𝑈 𝑛 as
𝑈𝑛 = 𝑈 0 = (𝑟 (𝜆𝑘)) 𝑛 𝑈 0 .
1 + 𝜆𝑘
368 Ordinary Differential Equations
Now, when 𝜆𝑘 is large, the multiplying factor 𝑟 (𝜆𝑘) ≈ 0, and the numerical
solution decays quickly.
A method is L-stable if it is A-stable and 𝑟 (𝜆𝑘) → 0 as 𝜆𝑘 → −∞. We
can weaken the definition to say that a method is almost L-stable if it is almost
A-stable and |𝑟 (𝜆𝑘)| < 1 as 𝜆𝑘 → −∞. In other words, a method is almost
L-stable if it is zero-stable and the interior of its region of absolute stability
contains the point −∞. Backward differentiation formulas BDF1 and BDF2 are
L-stable, and BDF3–BDF6 are almost L-stable. Several third- and fourth-order
DIRK and Rosenbrock methods are also L-stable.
Partial differential equations often have stiff, linear terms coupled with nonstiff,
nonlinear terms. Examples include
𝜕v 1
Navier–Stokes equation + (v · ∇)v = − ∇𝑝 + 𝜈Δv + g
𝜕𝑡 𝜌
reaction-diffusion equation = Δ𝑢 + 𝑢(1 − 𝑢 2 )
nonlinear Schrödinger equation i = − 12 Δ𝜓 + |𝜓| 2 𝜓
We can treat the linear terms implicitly to handle stiffness and the nonlinear terms
explicitly to simplify implementation—an approach called splitting.
Operator splitting
The natural question is, “how close are these approximations to the exact solution?”
Both of these solutions are plotted along curves 1 and 5 in Figure 12.11 on the
next page.
Take the problem 𝑢 0 = A 𝑢 + B 𝑢, where A and B are arbitrary operators.
Consider a simple splitting where we alternate by solving 𝑢 0 = A 𝑢 and then
solving 𝑢 0 = B 𝑢 each for one time step 𝑘. We can write and implement this
procedure as 𝑈 ∗ = S1 (𝑘)𝑈 𝑛 followed by 𝑈 𝑛+1 = S2 (𝑘)𝑈 ∗ , or equivalently as
𝑈 𝑛+1 = S2 (𝑘) S1 (𝑘)𝑈 𝑛 , where S1 (𝑘) and S2 (𝑘) are solution operators.
What is the splitting error of applying the operators A and B successively
rather than concurrently? If S1 (𝑘) and S2 (𝑘) are exact solution operators
S1 (𝑘)𝑈 𝑛 = e 𝑘 A𝑈 𝑛 and S2 (𝑘)𝑈 𝑛 = e 𝑘 B𝑈 𝑛 , then
concurrently: 𝑈 𝑛+1 = e 𝑘 (A + B) 𝑈 𝑛
successively: 𝑈˜ 𝑛+1 = e 𝑘 B e 𝑘 A𝑈 𝑛
and hence
e 𝑘 B e 𝑘 A = I +𝑘 (A + B) + 21 𝑘 2 A2 + B2 +2 B A + 𝑂 𝑘 3
e 𝑘 (A + B) = I +𝑘 (A + B) + 12 𝑘 2 A2 + B2 + A B + B A + 𝑂 𝑘 3 .
Strang splitting
We can reduce the error in operator splitting by using Strang splitting. For each
time step, we solve 𝑢 0 = A 𝑢 for a half time step, then solve 𝑢 0 = B 𝑢 for a full
time step, and finally solve 𝑢 0 = A 𝑢 for a half time step. Because
1 1
e 2 𝑘 A e 𝑘 B e 2 𝑘 A − e 𝑘 (A + B) = 𝑂 𝑘 3 ,
370 Ordinary Differential Equations
0 0.5
Strang splitting is second order. (The proof of this is left as an exercise.) Note
𝑈 𝑛+1 = S1 12 𝑘 S2 (𝑘) S1 12 𝑘 𝑈 𝑛
= S1 12 𝑘 S2 (𝑘) S1 12 𝑘 S1 12 𝑘 S2 (𝑘) S1 21 𝑘 𝑈 𝑛−1 .
But since S1 2𝑘 S1 2𝑘 = S1 (𝑘), this expression becomes
= S1 2𝑘 S2 (𝑘) S1 (𝑘) S2 (𝑘) S1 2𝑘 𝑈 𝑛−1 .
So, to implement Strang splitting, we only need to solve 𝑢 0 = A 𝑢 for a half time
step on the initial and final time steps. In between, we can use simple operator
splitting. In this manner, the global splitting error is 𝑂 𝑘 2 without much more
computation. It is impossible to have an operator splitting method with 𝑂 𝑘 3 or
smaller error.
IMEX methods
𝑈 𝑛+1 − 𝑈 𝑛 1
= L 𝑈 𝑛+1/2 + N 𝑈 𝑛+1/2 ≈ 2 L 𝑈 𝑛+1 + 21 L 𝑈 𝑛 + 32 N 𝑈 𝑛 − 12 N 𝑈 𝑛−1 . J
Alternatively, we can build a second-order IMEX scheme using the BDF2
method for the linear part 𝑢 0 = L 𝑢:
3 𝑛+1
2𝑈 − 2𝑈 𝑛 + 21 𝑈 𝑛−1
= L 𝑈 𝑛+1 .
A BDF method approximates the derivative at 𝑡 𝑛+1 . To avoid a splitting error,
we should also evaluate the nonlinear term N 𝑈 at 𝑡 𝑛+1 by extrapolating the
terms 𝑈 𝑛 and 𝑈 𝑛−1 to approximate 𝑈 𝑛+1 . We can use Taylor series to find the
𝑢(𝑡 𝑛 ) = 𝑢(𝑡 𝑛+1 ) − 𝑘𝑢 0 (𝑡 𝑛+1 ) + 12 𝑘 2 𝑢 00 (𝑡 𝑛+1 ) + 𝑂 𝑘 3
𝑢(𝑡 𝑛−1 ) = 𝑢(𝑡 𝑛+1 ) − 2𝑘𝑢 0 (𝑡 𝑛+1 ) + 2𝑘 2 𝑢 00 (𝑡 𝑛+1 ) + 𝑂 𝑘 3 .
Integrating factors
Integrating factors provide another way to deal with stiff problems. Consider the
problem 𝑢 0 = L 𝑢 + N 𝑢 with 𝑢(0) = 𝑢 0 , where L is a linear operator and N is a
nonlinear operator. If we set 𝑣 = e−𝑡 L 𝑢, then
There is no splitting error, but such an approach may only lessen the stiffness.
by the length of the pendulum. To simplify the discussion, we’ll take unit mass,
length, and gravitational acceleration In this nondimensionalized system with
𝜅 = 1 and with the position 𝑞(𝑡) and the angular velocity 𝑝(𝑡), we have the
system of equations5
d𝑞 d𝑝
= 𝑝 and = − sin 𝑞.
d𝑡 d𝑡
To simply find the trajectory in phase space (𝑞, 𝑝), we use conservation of energy.
The sum of the kinetic energy 𝑇 and potential energy 𝑉, given by the Hamiltonian
𝐻 = 𝑇 + 𝑉 = 12 𝑝 2 − cos 𝑞, is constant along a trajectory. For initial conditions
(𝑞 0 , 𝑝 0 ), the total energy is given by 𝐻0 = 12 𝑝 20 − cos 𝑞 0 . Because total energy
is constant, 𝑝 2 = 2𝐻0 + 2 cos 𝑞, from which we can plot the trajectories in the
phase plane. Suppose that we solve the problem using the forward Euler and
backward Euler methods taking 𝑞 0 = 𝜋/3 and 𝑝 0 = 0 with timestep 𝑘 = 0.1.
See the figures below or the QR code at the bottom of this page. In both cases,
the pendulum starts in the same position. When the pendulum has sufficient
momentum | 𝑝 0 | > 2, it has enough energy to swing over the top. Otherwise, the
pendulum swings back and forth with |𝑞(𝑡)| < 𝜋.
forward Euler backward Euler
The forward Euler method is unstable, and the total energy grows exponentially
in time. Even if the pendulum starts with a tiny swing, it will speed up over
time, getting higher and higher with each swing until, eventually, it swings over
the top. The backward Euler method is stable, but the total energy decays over
time. Even if the pendulum starts with enough energy to swing over the top, over
time, energy dissipates until the pendulum all but stops at the bottom of its swing.
Energy is not conserved in either method, resulting in unphysical solutions. In
problems such as planetary motion and molecular dynamics, a scheme that
conserves energy over a long period is essential to getting a physically correct
solution. Smaller timesteps and higher-order methods can help, but energy may
still not be conserved over time. Let’s look at a class of solvers designed to
conserve the Hamiltonian.
The pendulum belongs to the general class of Hamiltonian systems
d𝑝 d𝑞
= −∇𝑞 𝐻 and = +∇ 𝑝 𝐻.
d𝑡 d𝑡
5 Thisproblem can be solved analytically in terms of elliptic functions. See Karlheinz Ochs’
article “A comprehensive analytical solution of the nonlinear pendulum.”
By defining r ≡ ( 𝑝, 𝑞):
dr 0 −I
= D 𝐻 r ≡ J∇𝐻 (r) where J= .
d𝑡 +I 0
d𝐻 d𝑝 d𝑞
= ∇𝑝 𝐻 + ∇𝑞 𝐻 = 0,
d𝑡 d𝑡 d𝑡
or alternatively
𝑃 𝑛+1/2 = 𝑃 𝑛 − 21 𝑘∇𝑞 𝑉 (𝑄 𝑛 )
𝑄 𝑛+1 = 𝑄 𝑛 + 𝑘∇ 𝑝 𝑇 (𝑃 𝑛+1/2 )
𝑃 𝑛+1 = 𝑃 𝑛+1/2 − 12 𝑘∇𝑞 𝑉 (𝑄 𝑛+1 ).
This second-order approach, called the Verlet method, was reintroduced by Loup
Verlet in 1967 and can be viewed as an application of Strang splitting.
We can continue to build a higher-order method by composing Verlet methods
together. For a fourth-order method, take a Verlet with a timestep of 𝛼𝑘 followed
by a Verlet with timestep (1 − 2𝛼)𝑘 followed by a Verlet with timestep 𝛼𝑘 where
𝛼 = (2 − 2−1/3 ) −1 . To dig deeper, see Hairer, Lubich, and Wanner’s treatise on
geometric numerical integration.
374 Ordinary Differential Equations
For a complete account of the development of Matlab’s ODE suite, see any of
the articles authored by Lawrence Shampine in the References. When Lawrence
Shampine developed Matlab’s ODE suite, he had a principal design goal: the
suite should have a uniform interface so the different specialized and robust
solvers could be called exactly the same way. It’s this trade-off between efficiency
and convenience that he felt differentiated a “problem solving environment” such
as Matlab from “general scientific computation.” Instead of simply returning
the solution at each timestep, for example, Shampine designed Matlab’s ODE
suite to provide a smooth interpolation between timesteps (especially) when the
high-order solvers took large timesteps. Such a design consideration indeed
simplifies plotting nice-looking solutions.
Matlab has four explicit Runge–Kutta methods designed for nonstiff problems.
The low-order ode23 routine is the third-order Bogacki–Shampine Runge–Kutta
method presented on page 359. The medium-order ode45 routine is the popular
Dormand–Prince Runge–Kutta method (sometimes called the DOPRI method),
using six function evaluations to compute a fourth-order solution with a fifth-order
error correction. This routine is sometimes recommended as the go-to solver for
nonstiff problems. The high-order ode78 and ode89 routines—Verner’s “most
efficient” 7(6) and 8(9) Runge–Kutta methods–often outperform other methods,
particularly on problems with smooth solutions. Another routine designed for
nonstiff problems, ode113, is a variable-step, variable-order Adams–Bashforth–
Moulton PECE. It computes a solution up to order 12 and an error estimate up to
order 13 to control the variable step size. The ode15s routine uses a variation on
Klopfenstein–Shampine numerical differentiation formulas. These formulas are
modifications of BDF methods with variable-step, variable-order to order 5 and
have good stability properties, especially at higher orders. The ode15i routine is
6 For a complete list of Julia’s ODE solvers, see https://diffeq.sciml.ai/latest/solvers/ode_solve/.
Figure 12.12: Equivalent routines in Julia, Matlab, and Python. The table
includes only a fraction of all methods available in Julia. Matlab and Python
routines are available in Julia using wrappers. † Only MATLAB. ‡ Octave uses
lsode instead.
National Laboratory (LLNL). It uses BDF methods for stiff problems and Adams–
Moulton methods for nonstiff problems, using functional iteration to evaluate the
implicit terms.
Python’s scipy.integrate library has several ODE solvers. The RK45 (default), RK23,
and BDF routines are equivalent to Matlab’s ode45, ode23, and ode15s routines.
The DOP853 routine implements the Dormand–Prince 8(5,3) Runge–Kutta. LSODA
is similar to the lsode solver available in Octave, except that it automatically
and dynamically switches between the nonstiff Adams–Moulton and stiff BDF
solvers. The Radau routine is an order-5 Radau IIA (fully-implicit Runge–Kutta)
method. The Radau, BDF, and LSODA routines all require a Jacobian matrix of
the right-hand side of the system. If none is provided, the Jacobian will be
approximated using a finite difference approximation. These routines can also
be called from solve_ivp, a generic interface that uses adaptive step size over a
prescribed interval of integration, through the method option. These methods can
also be called for one time step, allowing them to be integrated into time-splitting
The scikits.odes package provides a few more routines. Two of these routines,
CVODE and IDA, come from LLNL’s Sundials7 library. The Sundials routine CVODE 8
includes Adams–Moulton formulas (with orders up to 12) for nonstiff problems
and BDFs (with orders up to 5)—similar to LSODA. The Sundials IDA routine
solves differential-algebraic equation systems in the form 𝑓 (𝑡, 𝑢, 𝑢 0) = 0.
Runge–Kutta (ARK) methods are IMEX methods that combine an ERK for
nonstiff terms and a DIRK for stiff terms. The routines found in ARKode, a class of
ARK methods developed by Christopher Kennedy and Mark Carpenter, are also
available through a class of DifferentialEquations.jl routines, such as KenCarp4.
Let’s look at a recipe using DifferentialEquations.jl. Suppose that we wish to
solve the equation for a pendulum 𝑢 00 = sin 𝑢 with initial conditions 𝑢(0) = 𝜋/9
and 𝑢 0 (0) = 0 over 𝑡 ∈ [0, 8𝜋]. For this problem, we’ll want to use a symplectic
or almost symplectic method, so let’s use the trapezoidal method. We can
solve the problem in Julia using the following steps (see the Back Matter for
implementation in Python and Matlab):
1. Load the module using DifferentialEquations, Plots
2. Set up the parameters pendulum(u,p,t) = [u[2]; -sin(u[1])]
u0 = [8π/9,0]; tspan = [0,8π]
3. Define the problem problem = ODEProblem(pendulum, u0 , tspan)
4. Choose the method method = Trapezoid()
5. Solve the problem solution = solve(problem,method)
6. Present the solution plot(solution, xaxis="t", label=["θ""ω"])
If a method is not explicitly stated, Julia will automatically choose one when it
solves the problem. You can help Julia choose a method by telling it whether the
problem is stiff (alg_hints = [:stiff]) or nonstiff (alg_hints = [:nonstiff]).
The solution can be addressed either as an array where solution[i] indicates
the 𝑖th element or as a function where solution(t) is the interpolated value at
time 𝑡. Implicit-explicit (IMEX) problems such as 𝑢 𝑡 = 𝑓 (𝑢, 𝑡) + 𝑔(𝑢, 𝑡) can
be specified using the SplitODEProblem function and choosing an appropriate
solver. When solving a semilinear system of equations, you can define a function
𝑓 (𝑢, 𝑡) ≡ L 𝑢 as a linear operator using the DiffEqArrayOperator function from
the DiffEqOperators.jl library.
DifferentialEquations.jl is a suite of utilities and hundreds of ODE solvers.
No one solver works best for all problems in all scenarios. One routine may be
relatively efficient for one class of problems; another one may be better suited for
a different class. As we’ve seen throughout this chapter, a routine’s accuracy and
computing time are a combination of several competing and compounding factors.
Benchmarking different solvers on various standard problems is one approach to
measuring performance. Figure 12.12 summarizes the relative efficiencies of
routines determined in the following benchmarking example.
population model
The variable 𝑥 is the population of the prey (e.g., rabbits) and the variable
𝑦 is the population of the predators (e.g., lynxes). We’ll take the parameters
{𝛼, 𝛽, 𝛾, 𝛿} = {1.5, 1, 3, 1}. Notice how the populations of prey and predators
fluctuate symbiotically over time:
𝑥 00 − 𝜇(1 − 𝑥 2 )𝑥 0 + 𝑥 = 0,
where the nonnegative parameter 𝜇 controls damping. When 𝜇 is zero, the Van
der Pol equation is a simple linear oscillator. When 𝜇 is large, the Van der Pol
equation exhibits slow decay punctuated by rapid phase shifts.9 The following
plot shows the Van der Pol oscillations when 𝜇 = 100 with 𝑥(0) = 2:
which required that 𝜇 be smaller than one. Van der Pol later studied behavior for large values of 𝜇
and coined the term “relaxation oscillation.”
10 The code for this example is available in the Jupyter notebook that accompanies this book. Chris
Rackaukas’ SciMLBenchmarks.jl project also presents summaries and plots of several benchmarks
using the DiffEqDevTools.jl package.
Differential-algebraic equations 379
error error
10−5 10−3
10−2 10−5 10−8 10−11 10−2 10−5 10−8 10−11
Figure 12.13: Computing time versus precision for the Lotka–Volterra equation
(nonstiff) and the Van der Pol equation with 𝜇 = 1000 (stiff).
𝑥0 = 𝑢
𝑦0 = 𝑣 −𝜏x
𝑚𝑢 0 = −𝜏𝑥 ℓ
𝑚𝑣 0 = −𝜏𝑦 + 𝑚𝑔 x
0=𝑥 +𝑦 −ℓ2 2 2 u
This system presents a few challenges. We don’t explicitly know the tension 𝜏(𝑡)
on the rod. Also, five variables and one constraint leave us with four degrees of
freedom—that’s too many. For a pendulum, we only need to prescribe an initial
angle and an initial angular velocity. To get down to two degrees of freedom,
we’ll need two additional constraints.
Let’s start by differentiating the equation 0 = 𝑥 2 + 𝑦 2 − ℓ 2 with respect to
time. We get 0 = 2𝑥𝑥 0 + 2𝑦𝑦 0, from which we have 𝑥𝑢 + 𝑦𝑣 = 0, or xT u = 0. This
380 Ordinary Differential Equations
expression says that the motion is perpendicular to the rod. Okay, that’s obvious.
Let’s take another derivative. After making a few substitutions, we have
0 = 𝑚(𝑢 2 + 𝑣 2 ) − 𝜏ℓ 2 + 𝑚𝑔𝑦. (12.15)
This expression says that the centripetal force 𝑚kvk 2 /ℓ
equals the sum of the
tension 𝜏ℓ and the radial component of the bob’s weight 𝑚𝑔𝑦/ℓ = 𝑚𝑔 cos 𝜃.
Let’s take one more derivative. After a few substitutions, we have
0 = −ℓ 2 𝜏 0 + 3𝑚𝑔𝑣, (12.16)
or =
𝜏0 −3𝑚𝑔𝑣/ℓ 2 ,
a differential equation for the tension. Note that this equation
is the conservation of energy. The total energy is the sum of kinetic and potential
𝐸 = 12 kvk 2 + 21 𝑚𝑔𝑦 = 12 𝜏𝑚 −1 ℓ 2 − 21 𝑔𝑦 + 12 𝑚𝑔𝑦 = 𝜏ℓ 2 + 23 𝑚𝑔𝑦 0 .
The conservation of energy says that the total energy is constant in time:
0 = 𝐸 0 = ℓ 2 𝜏 0 + 3𝑚𝑔𝑣.
We now have many equivalent systems. The original problem combines four
differential equations with a constraint for the rod length. We can alternatively
combine the differential equations with the constraint on motion direction. We
can combine them with the constraint balancing the forces on the rod. Or we can
combine them with a fifth differential equation for the conservation of energy.
𝑥0 = 𝑢
𝑥 2 + 𝑦2 = ℓ2 index-3
𝑦 =𝑣
𝑢𝑥 + 𝑣𝑦 = 0 index-2
and one of
𝑚𝑢 0 = −𝜏𝑥
+ 𝑣 2 ) = 𝜏ℓ 2 − 𝑚𝑔𝑦 index-1
𝑚𝑣 = −𝜏𝑦 + 𝑚𝑔
ℓ 2 𝜏 0 = 3𝑚𝑔 index-0
We can reduce a DAE to a system of purely differential equations using
repeated differentiation as long as the problem is not singular. The index
of a DAE is the minimum number of differentiations needed to do so. The
original pendulum DAE required three differentiations to reduce it to a system of
differential equations, so it is index-3. A system of purely differential equations
has index-0. We can immediately solve the index-0 pendulum problem
𝑥 0 = 𝑢, 𝑦 0 = 𝑣, 𝑢 0 = −𝜏𝑥/𝑚, 𝑣 0 = −𝜏𝑦/𝑚 + 𝑔, 𝜏 0 = 3𝑚𝑔𝑣/ℓ 2 , (12.17)
as long as the five initial conditions for (𝑥, 𝑦, 𝑢, 𝑣, 𝜏) are consistent. Alternatively,
we can solve the index-1 pendulum problem, first using the algebraic equation to
find the tension 𝜏 as a function of (𝑥, 𝑦, 𝑢, 𝑣) to get a system of four differential
𝑥 0 = 𝑢, 𝑦 0 = 𝑣, 𝑢 0 = −𝜏𝑥/𝑚, and 𝑣 0 = −𝜏𝑦/𝑚 + 𝑔, (12.18)
where 𝜏 = 𝑚(𝑢 2 + 𝑣 2 + 𝑔𝑦)/ℓ 2 . Let’s solve this pendulum problem in Julia
Differential-algebraic equations 381
Figure 12.14: Trace (𝑥(𝑡), 𝑦(𝑡)) of the pendulum. The drift (left) is removed
using manifold projection or mass matrix(right).
using DifferentialEquations
θ0 = π/3; ℓ = 1; tspan = (0,30.0)
U0 = [ℓ*sin(θ0 ), -ℓ*cos(θ0 ), 0, 0]
function pendulum(dU,U,p,t)
x,y,u,v = U; ℓ,g,m = p
τ = m*(u^2 + v^2 + g*y)/ℓ^2
dU .= (u, v, -τ*x/m, -τ*y/m + g)
problem = ODEProblem(pendulum, U0 , tspan, (ℓ,1,1))
solution = solve(problem, Tsit5())
plot(solution, idxs=(1,2), yflip=true, aspect_ratio=:equal)
The leftmost plot of Figure 12.14 shows the solution (𝑥(𝑡), 𝑦(𝑡)). Also, see the
QR code at the bottom of this page. While the solution initially traces the arc of
the pendulum correctly, it slowly drifts from that arc (a mild instability), and it
finally is utter disagreement. Drift isn’t specific to the problem (12.18). We also
get drift solving (12.17).
We can stabilize the method by controlling the step size, e.g., setting
reltol=1e-8 and abstol=1e-8 instead of the default reltol=1e-6 and abstol=1e-3.
Alternatively, we can stabilize the problem by orthogonally projecting the solution
onto the constraint manifold 𝑥 2 + 𝑦 2 − ℓ 2 = 0 and 𝑥𝑢 + 𝑦𝑣 = 0 to prevent the
solution from drifting off. We solve the problem using an explicit solver and
use Newton’s method at every step (or at every few steps after the residual drifts
beyond a tolerance) to find a zero residual.11 Manifold projection is built into
Julia’s DifferentialEquations library using a callback function.
function residual(r,w,p,t)
x,y,u,v = w; ℓ,g,m = p
r .= (x^2+y^2-ℓ^2, x*u+y*v, 0, 0)
11 Note that a Newton solver isn’t technically necessary for the pendulum problem because we
can explicitly solve the residual equations to add the conditions x = x/ℓ; y = y/ℓ; u = -v*y/x
directly into the function pendulum.
cb = ManifoldProjection(residual)
solution = solve(problem, Tsit5(), callback=cb)
The rightmost plot in Figure 12.14 shows the solution using manifold
projection. Manifold projection becomes computationally expensive as the
system grows larger. Let’s instead examine a general strategy for solving DAEs
by reducing them to index-1 (or index-0) problems so that they can be solved.
u0 = f (𝑡, u, v)
0 = g(𝑡, u, v),
u0 = f (𝑡, u, v)
𝜀v0 = g(𝑡, u, v)
in the limit as 𝜀 goes to zero. In this way, we can think of a DAE as singularly
perturbed a ODE, in which some variables evolve with infinitely fast dynamics
(the constraints are instantaneously enforced). When 𝜀 is small, the system
behaves like a stiff ODE. So we will want to use stiff solvers when solving DAEs.
The semi-implicit form can be written as
u f (𝑡, u, v) I 0
M 0 = , where M =
v g(𝑡, u, v) 0 0
using DifferentialEquations
θ0 = π/3; ℓ = 1; tspan = (0,30.0)
u0 = [ℓ*sin(θ0 ), -ℓ*cos(θ0 ), 0, 0, -ℓ*cos(θ0 )]
function pendulum(dU,U,p,t)
x,y,u,v,τ = U; ℓ,g,m = p
dU .= (u, v, -τ*x/m, -τ*y/m + g, -ℓ^2*τ + m*g*y + m*(u^2 + v^2))
M = Diagonal([1,1,1,1,0])
f = ODEFunction(pendulum, mass_matrix=M)
problem = ODEProblem(f,u0 ,tspan,(ℓ,1,1))
solution = solve(problem, Rodas5(), reltol=1e-8, abstol=1e-8)
plot(solution, idxs=(1,2), yflip=true, aspect_ratio=:equal)
Index reduction
Not all DAEs can be solved using ODE solvers. However, there is a class of
DAEs called Hessenberg forms that can be solved systematically.12 Differential-
algebraic equations in Hessenberg form can be reduced to purely differential
Let’s start with a DAE in semi-explicit form
u0 = f (𝑡, u, v)
0 = g(𝑡, u, v).
12 Hessenberg is a reference to upper Hessenberg matrices, which are zero below the diagonal.
384 Ordinary Differential Equations
𝜕g 𝜕g 0 𝜕g 0
0= + u + v,
𝜕𝑡 𝜕u 𝜕v
which we can solve for v0 as long as the Jacobian 𝜕g/𝜕v is not singular, leaving
us with the system of differential equations
u0 = f (𝑡, u, v)
𝜕g −1 𝜕g 𝜕g
v0 = − + u0 .
𝜕v 𝜕𝑡 𝜕u
An index-2 Hessenberg DAE has the form
u0 = f (𝑡, u, v)
0 = g(𝑡, u),
where the function g is independent of v and the Jacobians 𝜕g/𝜕u and 𝜕f/𝜕v are
nonsingular. The incompressible Navier–Stokes equation
u + u · ∇u = 𝜈Δu − ∇𝑝
∇ · u = 0.
u0 = f (𝑡, u, v, w)
v0 = g(𝑡, u, v)
0 = h(𝑡, v),
where the Jacobians 𝜕h/𝜕v, 𝜕g/𝜕u, and 𝜕f/𝜕w are nonsingular. These are
second-order ODEs, like the pendulum. We need to reduce the differential order
to one. Otherwise, the Jacobian is singular, and implicit solvers fail.
While we can reduce the index of a small system like the pendulum equation
by hand, manual manipulation is impractical on large systems. Instead, we can use
automated methods like the Pantelides algorithm, which uses a graph theoretical
approach, to reduce the index. The following Julia code uses ModelingToolkit.jl
to reduce the order with the Pantelides algorithm:
Differential-algebraic equations 385
Let’s look at one more approach to solving DAEs. Rather than considering a
DAE as a system of differential equations with algebraic constraints, we can view
it as an algebraic system that incorporates derivative terms.13 By approaching
the problem from this angle, we can use orthogonal collocation, also known
as a spectral method, which involves approximating the space of solutions as
orthogonal polynomials. Exercise 11.10 used Gauss–Lobatto nodes to define a
differentiation operator for Legendre polynomials. Given 𝑛 nodes (including one
endpoint) and 𝑚 variables, we would need to solve a system of 𝑚𝑛 equations. To
manage the complexity and stability, we can divide the time domain into smaller
steps and reduce the number of nodes over each step to get a pseudospectral
Let’s solve the pendulum problem in Julia using the Gauss pseudospectral
method. First, we will define the differentiation matrix for Legendre–Lobatto
polynomials. See exercise 11.10 on page 330 and its solution on page 527 for
using FastGaussQuadrature
function differentiation_matrix(n,Δt=1)
nodes, _ = gausslobatto(n+1)
t = (nodes[2:end].+1)/2
13 To ponder: Is a zebra white with black stripes or black with white stripes?
386 Ordinary Differential Equations
A = t.^(0:n-1)'.*(1:n)'
B = t.^(1:n)'
(A/B)/Δt, t
Now, let’s solve the DAE using the NonlinearSolve library by solving a nonlinear
system at each time step. We’ll use the intermediate nodes as part of our solution.
using NonlinearSolve
θ0 = π/3; ℓ = 1; tspan = (0,30.0)
u0 = [ℓ*sin(θ0 ), -ℓ*cos(θ0 ), 0, 0, 0]
n = 5; N = 100; Δt = 30/N
M,_ = differentiation_matrix(n,Δt)
D = (u,u0 ) -> M*(u .- u0 )
u = u0
for i = 2:N
problem = NonlinearProblem(pendulumGLL,[ones(n)*u0 '...],(u0 ,D,n))
solution = solve(problem,NewtonRaphson(),abstol=1e-12)
u = [u reshape(solution.u,5,n)']
u0 = u[:,end]
plot(u[1,:], u[2,:], yflip=true, aspect_ratio=:equal)
12.12 Exercises
12.3. Show that applying one step of the trapezoidal method to a linear differential
equation is the same as applying a forward Euler method for one half time step
followed by the backward Euler method for one half time step.
12.4. Plot the region of stability for the Merson method with Butcher tableau
1 1
3 3
1 1 1
3 6 6
1 1 3
2 8 0 8
1 2 0 − 32 2
1 2 1
6 0 0 3 6 b
1 1
12.5. Show that e 2 𝑘 A e 𝑘 B e 2 𝑘 A − e 𝑘 (A + B) = 𝑂 𝑘 3 , for two arbitrary, linear
A and B. From this, it follows that the error using Strang splitting is
𝑂 𝑘2 .
12.6. The multistage trapezoidal and BDF2 introduced in the example on page
358 are both implicit methods. Implementing each stage typically requires
computing a Jacobian matrix for Newton’s method when the right-hand-side
function 𝑓 (𝑢) is nonlinear. Thankfully, by choosing 𝛼 appropriately, we can
reuse the same Jacobian matrix across both stages.
1. Write the formulas for the trapezoidal method over a subinterval 𝛼𝑘 and the
BDF2 method over the two subintervals 𝛼𝑘 and (1 − 𝛼)𝑘 with 0 < 𝛼 < 1.
2. Determine the Jacobian matrices for these methods.
3. Find an 𝛼 such that the Jacobian matrices are proportional.
4. Plot the region of absolute stability.
5. Show that the multistage trapezoidal–BDF2 method is L-stable for this
choice of 𝛼.
12.7. Develop a third-order L-stable IMEX scheme. Study the method’s stability
by writing the characteristic polynomial and plotting the regions of absolute
stability in the complex plane for the implicit part and the explicit part of the
IMEX scheme. b
12.8. Plot the regions of absolute stability for the Adams–Bashforth–Moulton
predictor-corrector methods: AB1-AM2, AB2-AM3, AB3-AM4, and AB4-AM5
with PE(CE)𝑚 for 𝑠 = 0, 1, . . . , 4. b
12.9. Compute the Padé approximant 𝑅3,2 (𝑥) of log 𝑟 starting with a Taylor
polynomial as outlined on page 354. b
388 Ordinary Differential Equations
12.10. Show that the Verlet method for the system 𝑢 0 = 𝑝 and 𝑝 0 = 𝑓 (𝑢) is
the same as the central difference scheme 𝑈 𝑛 − 2𝑈 𝑛−1 + 𝑈 𝑛−2 = 𝑘 𝑓 (𝑈 𝑛 ) for
𝑢 00 = 𝑓 (𝑢).
12.13. The SIR model is a simple epidemiological model for infections diseases
that tracks the change over time in the percentage of susceptible, infected, and
recovered individuals of a population
d𝑆 d𝐼 d𝑅
= −𝛽𝐼𝑆, = 𝛽𝐼𝑆 − 𝛾𝐼, = 𝛾𝐼
d𝑡 d𝑡 d𝑡
where 𝑆(𝑡) + 𝐼 (𝑡) + 𝑅(𝑡) = 1, 𝛽 > 0 is the infection rate, and 𝛾 > 0 is the recovery
rate. The basic reproduction number 𝑅0 = 𝛽/𝛾 tells us how many other people,
on average, an infectious person might infect at the onset of the epidemic time
𝑡 = 0. Plot the curves {𝑆(𝑡), 𝐼 (𝑡), 𝑅(𝑡)} of the SIR model over 𝑡 ∈ [0, 15] starting
within an initial state {0.99, 0.01, 0} with 𝛽 = 2 and 𝛾 = 0.4. b
12.14. The Duffing equation
12.15. One way to solve a boundary value problem 𝑦 00 (𝑥) = 𝑓 (𝑦, 𝑦 0, 𝑥) with
boundary values 𝑦(𝑥0 ) = 𝑦 0 and 𝑦(𝑥1 ) = 𝑦 1 is by using a shooting method. A
shooting method solves the initial value problem 𝑦 00 (𝑥) = 𝑓 (𝑦, 𝑦 0, 𝑥) using one
of the boundary conditions 𝑦(𝑥0 ) = 𝑦 0 and 𝑦 0 (𝑥0 ) = 𝑠 for some guess 𝑠. We
then compare the difference in the solution at 𝑥1 with the prescribed boundary
value 𝑦 1 to make updates to our guess 𝑠 as an initial condition. With this new
guess, we again solve the initial value problem and compare its solution with the
correct boundary value to get another updated guess. We continue in this manner
until our solution 𝑦(𝑥1 ) converges to 𝑦 1 . This problem is equivalent to finding
the zeros 𝑠 to the error function 𝑒(𝑠) = 𝑦(𝑥1 ; 𝑠) − 𝑦 1 . Use the Roots.jl function
find_zero along with an ODE solver to find the solution to the Airy equation
𝑦 00 + 𝑥𝑦 = 0 with 𝑦(−12) = 1 and 𝑦(0) = 1. b
12.16. Two balls of equal mass are attached in series to a wall using two
springs—one soft and one stiff.
𝜅1 𝜅2
𝑥100 = −𝜅1 𝑥1 + 𝜅 2 (𝑥2 − 𝑥 1 )
𝑥200 = −𝜅2 (𝑥2 − 𝑥1 )
𝑥1 𝑥2 − 𝑥1
Suppose that the soft spring is initially stretched to 𝑥 1 (0) = 1 and the stiff spring
is in equilibrium at 𝑥2 (0) = 𝑥1 (0). Discuss the limiting behavior when 𝜅 2 → ∞.
Solve the problem numerically when 𝜅2 𝜅 1 .
Chapter 13
Parabolic Equations
392 Parabolic Equations
Fickian diffusion says that the flux J(𝑡, x) is proportional to the gradient of
the concentration 𝑢(𝑡, x). Imagine a one-dimensional random walk in which
particles move equally to the right or the left. Suppose that we have two adjacent
cells, one at 𝑥 − 12 ℎ with a particle concentration 𝑢 𝑥 − 12 ℎ and one at 𝑥 + 12 ℎ
with a concentration and 𝑢 𝑥 + 21 ℎ , separated by an interface at 𝑥. More particles
from the cell with higher concentration will pass through the interface than
particles from the cell with lower concentration. If a particle moves through the
interface with a positive diffusivity 𝛼(𝑥), then the flux across the interface can
be approximated by
−𝛼(𝑥) 𝑢 𝑡, 𝑥 + 12 ℎ − 𝑢 𝑡, 𝑥 − 21 ℎ
𝐽 (𝑡, 𝑥) = .
In the limit as ℎ → 0, we have
𝐽 = −𝛼(𝑥) .
And, in general,
= ∇ · (𝛼(x)∇𝑢).
Consider an insulated heat-conducting bar. The temperature along the bar
can be modeled as 𝑢 𝑡 = (𝛼(𝑥)𝑢 𝑥 ) 𝑥 , where 𝛼(𝑥) > 0 is the heat conductivity
and 𝑢(0, 𝑥) = 𝑢 0 (𝑥) is the initial temperature distribution. Because the bar
is insulated, the heat flux at the ends of the bar is zero. For a uniform heat
conductivity 𝛼(𝑥) = 𝛼, the heat equation is simply
𝑢 𝑡 = 𝛼𝑢 𝑥 𝑥 , 𝑢(0, 𝑥) = 𝑢 0 (𝑥), 𝑢 𝑥 (𝑡, 𝑥 𝐿 ) = 0 𝑢 𝑥 (𝑡, 𝑥 𝑅 ) = 0. (13.1)
This problem is well-posed and can be solved analytically by the method of
separation of variables or Fourier series. In the next section, we’ll use it as a
starting place to examine numerical methods for parabolic partial differential
equations, equations that have one derivative in time and two derivatives in
space. For now, let’s examine the behavior of the solution over time. Take the
heat conductivity 𝛼 = 1, and take an initial distribution with 𝑢(𝑥, 0) = 1 for
𝑥 ∈ [−1, 1] and 𝑢(𝑥, 0) = 0 otherwise along [−2, 2]. The following figure and
the QR link at the bottom of this page show snapshots of the solution taken at
equal time intervals from 𝑡 ∈ [0, 0.1]:
−2 −1 0 1 2
Notice how initially the solution rapidly evolves and then gradually slows down
over time. The height of the distribution decreases and its width increases,
steep gradients become shallow ones, sharp edges become smooth curves, and
the distribution seems to seek a constant average temperature along the entire
There are several prototypical properties of the heat equation to keep in
mind. It obeys a maximum principle—the maximum value of a solution always
decreases in time unless it lies on a boundary. Equivalently, the minimum value
of a solution always increases unless it lies on a boundary. Simply stated, warm
regions get cooler while cool regions get warmer. If the boundaries are insulating,
then the total heat or energy is conserved, i.e., the 𝐿 1 -norm of the solution is
constant in time. The heat equation is also dissipative. i.e., the 𝐿 2 -norm of the
solution—often confusingly called the “energy”—decreases in time. A third
property is that the solution at any positive time is smooth and grows smoother in
time. For a complete review, see Lawrence Evans’ textbook Partial Differential
Parabolic equations are more than just the heat equation. One important class
of parabolic equations is the reaction-diffusion equation that couples a reactive
growth term with a diffusive decay term. These equations often model the
transitions between disorder and order in physical systems. Reaction-diffusion
equations are also used to model the pattern formation in biological systems. Take
chemotaxis. Animals and bacteria often communicate by releasing chemicals.
Not only do the bacteria spread out due to normal diffusion mechanisms, but the
bacteria may also move along a direction of a concentration gradient, following
the strongest attractant, which itself is subject to diffusion. The dispersive
mechanisms behind parabolic equations also drive stabilizing behavior in viscous
fluid dynamics like the Navier–Stokes equation, ensuring that it doesn’t blow up
like its inviscid cousin, the Euler equation.1
Ladyzhenskaya, Jacques–Louis Lions, and Giovanni Prodi independently proved the existence of
smooth solutions to the two-dimensional Navier–Stokes equation in the 1960s. In 2000, the Clay
Mathematics Institute added the three-dimensional Navier–Stokes equation as a one-million-dollar
Millennium Prize. It remains unsolved.
394 Parabolic Equations
𝑈 𝑗+1/2 − 𝑈 𝑗−1/2 𝑈 𝑗+1 − 2𝑈 𝑗 + 𝑈 𝑗−1
D2 𝑈 𝑗 = D =
ℎ ℎ2
to get a second-order approximation of the second derivative. Hence, 𝑢 𝑡 = 𝛼𝑢 𝑥 𝑥
𝜕 𝑈 𝑗+1 − 2𝑈 𝑗 + 𝑈 𝑗−1
𝑈𝑗 = 𝛼 (13.2)
𝜕𝑡 ℎ2
with 𝑗 = 0, 1, . . . , 𝑚. We now have a system of 𝑚 + 1 ordinary differential
equations, and we can use any numerical methods for ODEs to solve the problem.
Note that the system is not closed. In addition to the interior mesh points at
𝑗 = 1, 2, . . . , 𝑚 − 1 and the explicit boundary mesh points at 𝑗 = 0 and 𝑗 = 𝑚, we
also have mesh points outside of the domain at 𝑗 = −1 and 𝑗 = 𝑚 + 1. We will
use the boundary conditions to remove these ghost points and close the system.
This approach of employing the numerical method of lines to solve a
partial differential equation is often called time-marching. We can combine any
consistent ODE solver with any consistent discretization of the Laplacian. The
previous chapter discussed the forward Euler, the backward Euler, the leapfrog,
and the trapezoidal methods to solve ODEs. Let’s apply these methods to the
linear diffusion equation. The leapfrog method applied to the diffusion equation is
called the Richardson method, and the trapezoidal method applied to the diffusion
equation is called the Crank–Nicolson method. British mathematicians John
Crank and Phyllis Nicolson developed their method in 1948 as a replacement for
the unstable method proposed by Lewis Fry Richardson in 1910.2 The following
stencils correspond to the stencils on page 337.
Forward Euler 𝑂 (𝑘 + ℎ2 )
𝑈 𝑛+1
𝑗 − 𝑈 𝑛𝑗 𝑈 𝑛𝑗+1 − 2𝑈 𝑛𝑗 + 𝑈 𝑛𝑗−1
=𝛼 (13.3)
𝑘 ℎ2
2 Lewis Fry Richardson, hailed as the father of numerical weather prediction, published his
visionary book Weather Forecasting by Numerical Process in 1923, decades before the invention of
digital computers. In this work, Richardson outlines how the planet’s weather could be modeled
using finite difference approximations of nonlinear partial differential equations. He imagines a
fantastic “central-forecast factory” with 64,000 human computers arranged around the inner walls of
an enormous painted globe. Each computer works on an equation of the part of the map where they
sit, supplied with meteorological data from weather balloons spaced all over the world and informed
by numerous little signs displaying the instantaneous values from neighboring computers. A man sits
at a pulpit suspended on a tall pillar rising from the pit of the hall “like the conductor of an orchestra
in which the instruments are slide-rules and calculating machines. . . instead of waving a baton he
turns a beam of rosy light upon any region that is running ahead of the rest, and a beam of blue light
upon those who are behind hand.” Surely, the hand calculations of the human computers would be
tedious, and Richardson commiserates with them, saying, “Let us hope for their sakes that they are
moved on from time to time to new operations.”
Method of lines 395
Backward Euler 𝑂 (𝑘 + ℎ2 )
𝑗+1 − 2𝑈 𝑗
𝑈 𝑛+1 𝑛+1 + 𝑈 𝑛+1
𝑈 𝑛+1
𝑗 − 𝑈 𝑛𝑗 𝑗−1
=𝛼 (13.4)
𝑘 ℎ2
Richardson (unstable!) 𝑂 (𝑘 2 + ℎ2 )
𝑈 𝑛+1
𝑗 − 𝑈 𝑛−1
𝑗 𝑈 𝑛𝑗+1 − 2𝑈 𝑛𝑗 + 𝑈 𝑛𝑗−1
=𝛼 (13.5)
2𝑘 ℎ2
Crank–Nicolson 𝑂 (𝑘 2 + ℎ2 )
𝑗+1 − 2𝑈 𝑗
𝑈 𝑛+1 𝑈 𝑛𝑗+1 − 2𝑈 𝑛𝑗 + 𝑈 𝑛𝑗−1
𝑛+1 + 𝑈 𝑛+1
𝑈 𝑛+1
𝑗 − 𝑈 𝑛𝑗 𝑗−1
= 12 𝛼 + 12 𝛼 (13.6)
𝑘 ℎ2 ℎ2
Boundary conditions
The method of lines converted the heat equation into a system of 𝑚 + 1 differential
equations with 𝑚 + 3 unknowns. In addition to the mesh points at 𝑥0 , 𝑥1 , . . . , 𝑥 𝑚 ,
we also have ghost points outside the domain. We use the boundary constraints
to explicitly remove the ghost points and close the system by eliminating the two
unknowns 𝑈−1 and 𝑈𝑚+1 . There are several common boundary conditions.3
A Dirichlet boundary condition specifies the value of the solution on the
boundary. For example, for a heat-conducting bar in contact with heat sinks of
temperatures 𝑢 𝐿 and 𝑢 𝑅 at the ends of the bar, we have the boundary conditions
𝑢(𝑡, 0) = 𝑢 𝐿 and 𝑢(𝑡, 1) = 𝑢 𝑅 . We can eliminate the left ghost point 𝑈−1 by using
a second-order approximation 12 (𝑈−1 + 𝑈1 ) ≈ 𝑈0 = 𝑢 𝐿 . In this case, substituting
𝑈−1 = 2𝑢 𝐿 − 𝑈1 into (13.2) gives us
𝜕 2𝑈1 𝑢𝐿
𝑈0 = 𝛼 2 + 2𝛼 2 . (13.7)
𝜕𝑡 ℎ ℎ
We can similarly eliminate the right ghost point.
Example. Let’s implement a backward Euler scheme for (13.1) that uses
Dirichlet boundary conditions. Let 𝜈 = 𝛼𝑘/ℎ2 . From (13.4) and (13.7) we have
𝑈0𝑛+1 − 𝑈0𝑛 = −2𝜈𝑈0𝑛+1 + 2𝜈𝑢 𝐿
𝑈 𝑛+1 𝑗+1 − 2𝑈 𝑗
− 𝑈 𝑛𝑗 = 𝜈 𝑈 𝑛+1 𝑛+1
𝑗 + 𝑈 𝑛+1
𝑈𝑚 𝑛
− 𝑈𝑚 𝑛+1
= −2𝜈𝑈𝑚 + 2𝜈𝑢 𝑅 .
3 “Science is a differential equation. Religion is a boundary condition.” —Alan Turing
396 Parabolic Equations
Move 𝑈 𝑛+1 terms to the left-hand side and 𝑈 𝑛 terms to the right-hand side:
1 + 2𝜈 𝑈 𝑛+1 𝑈 𝑛 2𝜈𝑢 𝐿
0𝑛+1 0𝑛
−𝜈 1 + 2𝜈 𝑈 𝑈 0
−𝜈 1 1
.. .. ..
.. .. .. . = . + . .
. . .
−𝜈 𝑈 𝑛+1 𝑈 𝑛 0
−𝜈 1 + 2𝜈 𝑚−1 𝑚−1
1 + 2𝜈 𝑈 𝑛+1 𝑈 𝑛 2𝜈𝑢 𝑅
𝑚 𝑚
This system is simply (I − 𝜈D)U𝑛+1 = U𝑛 + b where D is the modified tridiagonal
matrix (1, −2, 1). We need to invert a tridiagonal matrix at each step by using a
tridiagonal solver. A tridiagonal solver efficiently uses Gaussian elimination
by only operating on the diagonal and two off-diagonals, taking only 𝑂 (3𝑚)
operations to solve a linear system instead of 𝑂 23 𝑚 3 operations for general
Gaussian elimination.
The following Julia code implements the backward Euler method over the
domain [−2, 2] with zero Dirichlet boundary conditions and initial conditions
𝑢(0, 𝑥) = 1 for |𝑥| < 1 and 𝑢(0, 𝑥) = 0 otherwise.
Δx = .01; Δt = .01; L = 2; 𝜈 = Δt/Δx^2; ul = 0; ur = 0;
x = -L:Δx:L; m = length(x)
u = (abs.(x).<1)
u[1] += 2𝜈*ul ; u[m] += 2𝜈*ur
D = Tridiagonal(ones(m-1), -2ones(m), ones(m-1))
D[1,2] = 0; D[m,m-1] = 0
A = I - 𝜈*D
for i in 1:20
u = A\u
We can compute the runtime by adding @time begin before and end after the
code (or the similar @btime macro from the BenchmarkTools.jl package). This
code takes roughly 0.0003 seconds on a typical laptop. If a nonsparse matrix
is used instead of the Tridiagonal one by setting D = Matrix(D), the runtime is
0.09 seconds—significantly slower but not unreasonably slow. However, for a
larger system with Δx = .001, the runtime using a Tridiagonal matrix is still
about 0.004 seconds, whereas for a nonsparse matrix, it is almost 15 seconds. J
The LinearAlgebra.jl library contains types such as Tridiagonal that have efficient
specialized linear solvers. Such matrices can be converted to regular matrices using the
command Matrix. The identity operator is simply I regardless of size.
at both ends, we set the heat flux 𝐽 (𝑡, 𝑥) to be zero on the boundary giving
𝑢 𝑥 (𝑡, 0) = 𝑢 𝑥 (𝑡, 1) = 0. Zero Neumann boundary conditions are also called
reflecting boundary conditions.
− (2 + 2𝜈)𝑈𝑚
= 2𝜈𝑈𝑚−1
+ (2 − 2𝜈)𝑈𝑚
Compare this code with that of the backward Euler method on the facing page,
noting similarities and differences. Running this code, we observe transient
spikes at 𝑥 = −1 and 𝑥 = 1, where the initial distribution had discontinuities.
Also, see the QR code at the bottom of this page.
solution to the
heat equation
398 Parabolic Equations
−2 −1 0 1 2
These are clearly numerical artifacts. What causes them, and how can we get a
better solution? The Crank–Nicolson method is A-stable but not L-stable. Large
eigenvalue components, associated with steep gradients, only slowly decay. See
Figure 12.10 on page 367 for comparison. We’ll examine numerical stability
further in the next section and in the next chapter when we discuss dispersive
and dissipative schemes. As a practical matter, we can regularize a problem by
replacing a discontinuity in an initial condition with a smooth, rapidly-changing
surrogate function. For example, instead of the discontinuous step function, we
can take 12 + 21 tanh(𝜀 −1 𝑥), where the thickness of the interface is 𝑂 (𝜀). J
The previous chapter found the region of absolute stability of a numerical method
for the problem 𝑢 0 = 𝜆𝑢 by determining the values 𝜆𝑘 for which perturbations
of the solution decrease over time. We now want to determine under what
conditions a numerical method for a partial differential equation is stable. We
Von Neumann analysis 399
can apply a discrete Fourier transform to decouple a linear PDE into a system
of ODEs. Because eigenvalues are unaffected by a change in basis, we can use
the linear stability analysis we developed in the previous chapter. Furthermore,
periodic boundary conditions will not significantly change the analysis because
numerical stability is a local behavior.
Take the domain [0, 2𝜋] and uniform discretization in space 𝑥 𝑗 = 𝑗 ℎ with
ℎ = 2𝜋/(2𝑚 + 1). The discrete Fourier transform is defined as
1 ∑︁
ˆ 𝜉) =
𝑢(𝑡, 𝑢(𝑡, 𝑥 𝑗 )e−i 𝜉 𝑗 ℎ ,
2𝑚 + 1 𝑗=0
𝜕 ∑︁
ei 𝜉 ℎ − 2 + e−i 𝜉 ℎ
ˆ 𝜉)ei 𝜉 𝑗 ℎ =
𝑢(𝑡, 𝛼 2
ˆ 𝜉)ei 𝜉 𝑗 ℎ .
𝜉 =−𝑚
𝜕𝑡 𝜉 =−𝑚
0 0
−1 −1
−2 −1 0 1 2 −2 −1 0 1 2
leapfrog trapezoidal
1 1
0 0
−1 −1
−2 −1 0 1 2 −2 −1 0 1 2
Figure 13.1: Regions of absolute stability (shaded) in the 𝜆𝑘-plane for the forward
Euler, backward Euler, leapfrog, and trapezoidal methods along with eigenvalues
𝜆 𝜉 of the central difference approximation with 𝛼𝑘/ℎ2 = 32 .
Note that (13.9) is now a linear ordinary differential equation 𝑢ˆ 0 = 𝜆 𝜉 𝑢,ˆ where
the eigenvalues 𝜆 𝜉 = −4𝛼/ℎ2 sin2 (𝜉 ℎ/2) are all nonpositive real numbers.
The regions of absolute stability for the forward Euler, backward Euler,
leapfrog, and trapezoidal methods are shown in Figure 13.1 above. A method is
stable as long as all 𝜆 𝜉 𝑘 lie in the region of absolute stability. The eigenvalues
𝜆 𝜉 can be as large as |𝜆 𝜉 | = 4𝛼/ℎ2 . We can determine the stability conditions
for the different numerical schemes using this bound and the associated regions
of absolute stability.
Let’s start by determining the stability conditions for the forward Euler
method. In the previous chapter, we found that the region of absolute stability for
the forward Euler method is the unit circle centered at −1, containing the interval
[−2, 0] along the real axis. A numerical method is stable if 𝜆 𝜉 𝑘 is in the region
of absolute stability for all 𝜉. Because the magnitude of 𝜆 𝜉 can be as large as
4𝛼/ℎ2 , it follows that 𝜆 𝜉 𝑘 stays within the region of absolute stability as long as
𝑘 ≤ ℎ2 /2𝛼. Such a stability condition is called the Courant–Friedrichs–Lewy
condition or CFL condition. This CFL condition says that if we take a small grid
spacing ℎ, we need to take a much smaller time step 𝑘 to maintain stability. If
we double the gridpoints to reduce the error, we need to quadruple the number of
time steps to maintain stability. If we increase the number of gridpoints tenfold,
we need to increase the number of time steps a hundredfold. Such a restriction is
There is no constraint on the time step 𝑘 for the backward Euler method
when the eigenvalues are on the negative real axis. The backward Euler method
Von Neumann analysis 401
𝑈 𝑛+1
𝑗 − 𝑈 𝑛−1
𝑗 𝑈 𝑛𝑗+1 − 2𝑈 𝑛𝑗 + 𝑈 𝑛𝑗−1
=𝛼 .
2𝑘 ℎ2
Replace 𝑈 𝑛𝑗 with 𝐴𝑛 ei 𝜉 𝑗 ℎ :
𝐴2 − 1 4 𝜉ℎ
= −𝛼 2 sin2 𝐴,
2𝑘 ℎ 2
from which we have
𝑘 𝜉ℎ
𝐴2 + 8𝛼 sin2 𝐴 − 1 = 0.
ℎ 2 2
From the constant term of the quadratic, we know that the product of the roots
of this equation is −1. Hence, both roots are real because complex roots would
appear in conjugate pairs, the product of which is positive. Since {+1, −1} are not
the roots, the absolute value of one root must be greater than one. Therefore, the
Richardson method is unconditionally unstable! That the Richardson method is
unconditionally unstable should be quite clear from Figure 13.1 on the preceding
page. The eigenvalues 𝜆 𝜉 of the second-order central difference approximation
are negative real numbers, but the region of absolute stability of the leapfrog
method is a line segment along the imaginary axis. So, no eigenvalue other
than the zero eigenvalue is ever in that region regardless of how small we take
𝑘. We can slightly modify to the Richardson method to get the Dufort–Frankel
method. As we’ll see below, the Dufort–Frankel method is an explicit method
that is unconditionally stable!
402 Parabolic Equations
Dufort–Frankel (inconsistent!) 𝑂 (𝑘 2 + ℎ2 + ℎ2
𝑈 𝑛+1
𝑗 − 𝑈 𝑛−1
𝑗 𝑈 𝑛𝑗+1 − (𝑈 𝑛+1
𝑗 + 𝑈 𝑛−1 𝑛
𝑗 ) + 𝑈 𝑗−1
=𝛼 (13.10)
2𝑘 ℎ2
𝑈 𝑛+1
𝑗 − 𝑈 𝑛−1
𝑗 𝑈 𝑛𝑗+1 − (𝑈 𝑛+1
𝑗 + 𝑈 𝑛−1 𝑛
𝑗 ) + 𝑈 𝑗−1
2𝑘 ℎ2
𝑈 𝑛+1
𝑗 − 𝑈 𝑛−1
𝑗 𝑈 𝑛𝑗+1 − 2𝑈 𝑛𝑗 + 𝑈 𝑛𝑗−1 𝑈 𝑛+1
𝑗 − 2𝑈 𝑛𝑗 + 𝑈 𝑛−1
=𝛼 −𝛼 . (13.11)
2𝑘 ℎ2 ℎ2
4 It also violates the second Dahlquist barrier.
Higher-dimensional methods 403
𝑂 (1), and the method is inconsistent. In this case, we are actually finding the
solution to the equation 𝑢 𝑡 = 𝛼𝑢 𝑥 𝑥 − 𝛼𝑢 𝑡𝑡 . So, while the Dufort–Frankel scheme
is absolutely stable, it is only consistent when 𝑘 ℎ. Furthermore, we don’t get
second-order accuracy unless 𝑘 = 𝑂 (ℎ2 ).
𝛿2𝑥 𝑈𝑖 𝑗 = 𝑈𝑖+1, 𝑗 − 2𝑈𝑖, 𝑗 + 𝑈𝑖−1, 𝑗 /ℎ2 (13.12a)
𝛿2𝑦 𝑈𝑖 𝑗 = 𝑈𝑖, 𝑗+1 − 2𝑈𝑖, 𝑗 + 𝑈𝑖, 𝑗−1 /ℎ2 (13.12b)
where 𝛥𝑥 = 𝛥𝑦 = ℎ, the Crank–Nicolson method is
𝑈𝑖𝑛+1 𝑛
𝑗 − 𝑈𝑖 𝑗
= 1
2 𝛿2𝑥 𝑈𝑖𝑛+1 2 𝑛 2 𝑛+1
𝑗 + 𝛿 𝑥 𝑈𝑖 𝑗 + 𝛿 𝑦 𝑈 + 𝛿2𝑦 𝑈𝑖𝑛𝑗 . (13.13)
For von Neumann analysis in two dimensions, we substitute 𝑈ˆ 𝑖𝑛𝑗 = 𝐴𝑛 ei(𝑖 𝜉 ℎ+ 𝑗 𝜂ℎ)
and determine the CFL condition such that | 𝐴| ≤ 1. The Crank–Nicolson method
is second-order in time and space and unconditionally stable.
We no longer have a simple tridiagonal system when moving to two or three
dimensions. If we have 100 grid points in the 𝑥- and 𝑦-directions, then we will
need to invert a 104 × 104 block tridiagonal matrix. And we will need to invert
a 106 × 106 matrix in three dimensions. A more efficient approach is to use
operator splitting. Let’s examine two ways of splitting the right-hand operator
of (13.13)
1 2 𝑛+1
2 𝑛
2 3
2 𝑛+1
2 𝛿 𝑥 𝑈𝑖 𝑗 + 𝛿 𝑥 𝑈 𝑖 𝑗 + 𝛿 𝑦 𝑈 + 𝛿2𝑦 𝑈𝑖𝑛𝑗
that will ensure that the method remains implicit. The fractional step method
splits 1 + 2 from 3 + 4
𝑈𝑖∗𝑗 − 𝑈𝑖𝑛𝑗
𝑈𝑖𝑛+1 ∗
𝑗 − 𝑈𝑖 𝑗
= 1
2 𝛿2𝑥 𝑈𝑖∗𝑗 + 𝛿2𝑥 𝑈𝑖𝑛𝑗 and = 1
2 𝛿2𝑦 𝑈𝑖∗𝑗 + 𝛿2𝑦 𝑈𝑖𝑛+1
𝑗 ,
𝑘 𝑘
and the alternate direction implicit (ADI) method splits 1 + 3 from 2 + 4
and so
𝑈 𝑛+𝑖/ 𝑝 = I + 21 𝑘 D𝑖 + 41 𝑘 2 D2𝑖 +𝑂 𝑘 3 I + 12 𝑘 D𝑖 𝑈 𝑛+(𝑖−1)/ 𝑝
= I +𝑘 D𝑖 + 12 𝑘 2 D2𝑖 +𝑂 𝑘 3 𝑈 𝑛+(𝑖−1)/ 𝑝 .
The ADI method was proposed by Donald Peaceman and Henry Rachford in
1955, while working at Humble Oil and Refining Company, in a short-lived effort
to simulate petroleum reserves. The ADI method applied to the heat equation is
𝑈 𝑛+1/2 − 𝑈 𝑛 = 21 𝜈 𝛿2𝑥 𝑈 𝑛+1/2 + 𝛿2𝑦 𝑈 𝑛 , (13.14a)
2 2
𝑈 𝑛+1
−𝑈 𝑛+1/2
= 12 𝜈 𝛿 𝑥 𝑈 𝑛+1/2
+ 𝛿𝑦𝑈 𝑛+1
, (13.14b)
Higher-dimensional methods 405
where 𝜈 = 𝑘/ℎ2 . We need two tridiagonal solvers for this two-stage method.
To examine stability take the Fourier transforms (𝑥 ↦→ 𝜉 and 𝑦 ↦→ 𝜂)
𝑈ˆ 𝑛+1/2 = 𝑈ˆ 𝑛 + 12 𝜈 −4 sin2 21 𝜉 ℎ 𝑈ˆ 𝑛+1/2 + 12 𝜈 −4 sin2 12 𝜂ℎ 𝑈ˆ 𝑛
𝑈ˆ 𝑛+1 = 𝑈ˆ 𝑛+1/2 + 1 𝜈 −4 sin2 1 𝜉 ℎ 𝑈ˆ 𝑛+1/2 + 1 𝜈 −4 sin2 1 𝜂ℎ 𝑈ˆ 𝑛+1
2 2 2 2
from which
1 − 2𝜈 sin2 21 𝜂ℎ 1 − 2𝜈 sin2 12 𝜂ℎ
𝑈ˆ 𝑛+1 = 𝑈ˆ 𝑛+1/2 and 𝑈ˆ 𝑛+1/2 = 𝑈ˆ 𝑛 .
1 + 2𝜈 sin2 21 𝜉 ℎ 1 + 2𝜈 sin2 12 𝜉 ℎ
1 − 2𝜈 sin2 21 𝜂ℎ 1 − 2𝜈 sin2 12 𝜂ℎ 𝑛
𝑈ˆ 𝑛+1 = 𝑈ˆ .
1 + 2𝜈 sin2 12 𝜂ℎ 1 + 2𝜈 sin2 12 𝜂ℎ
| {z }
The denominator of 𝜆 is always larger than the numerator, so |𝜆| ≤ 1 from which
it follows that the method is unconditionally stable.
To determine the order of the ADI method, we take the difference and sum
of (13.14)
Consider the heat equation along a rod for which the heat conductivity 𝛼(𝑢)
changes as a positive function of the temperature. In this case,
𝜕 𝜕 𝜕
𝑢= 𝛼(𝑢) 𝑢 (13.16a)
𝜕𝑡 𝜕𝑥 𝜕𝑥
𝑢(0, 𝑥) = 𝑢 0 (𝑥) (13.16b)
𝑢(𝑡, 0) = 𝑢(𝑡, 1) = 0. (13.16c)
Because the diffusion equation is stiff, the ODE solver should be a stiff solver.
Otherwise, we will be forced to take many tiny steps to ensure stability. Because
the problem is nonlinear, we will need to solve a system of nonlinear equations,
typically using Newton’s method. In practice, this means that Julia (or whatever
language we might use) must numerically compute and invert a Jacobian matrix.
Fortunately, our system of nonlinear equations is sparse, so the Jacobian matrix
is also sparse. To help our solver, we can explicitly tell it that the Jacobian is
sparse by passing it the sparsity pattern for the Jacobian; otherwise, our solver
may foolishly build a nonsparse Jacobian matrix. In this case, the sparsity pattern
is a tridiagonal matrix of ones.
Figure 13.2: Solving the porous medium equation with 400 mesh points.
u0 = (abs.(x).<1)
problem = ODEProblem(Du,u0 ,(0,2),Δx)
method = CVODE_BDF(linear_solver=:Band, jac_lower=1, jac_upper=1)
solution = solve(problem, method);
Because 𝑢 2 𝑢 𝑥 = 13 𝑢 3 𝑥 , we could solve 𝑢 𝑡 = 13 𝑢 3 𝑥 𝑥 instead. In this case, we
would simply replace the appropriate lines of code with the following:
α = (u -> u.^3/3)
Du(u,Δx,t) = [0;diff(diff(α(u)))/Δx^2;0]
There are different ways to visualize time-dependent data in Julia. One is with
the @animate macro in Plots.jl, which combines snapshots as a gif
anim = @animate for t in LinRange(0,2,200)
gif(anim, "porous.gif", fps = 15)
Another way is with the @manipulate macro in Interact.jl, which creates a slider
widget allowing a user to move the solution forward and backward in time.
using Interact
@manipulate for t∈slider(0:0.01:2; value=0, label="time")
plot(x,solution(t), fill = (0, 0.4, :red))
408 Parabolic Equations
−2 −1 0 1 2
The figure above and the QR link below show the solution to the porous medium
equation with an initial distribution given by a rectangular function. The snapshots
are taken at equal intervals from 𝑡 ∈ [0, 2]. Compare this solution with that of
the one-dimensional heat equation on page 392.
It’s worthwhile to benchmark the performance of the BDF method against
other methods. While using an implicit method may reduce the number of steps
in time, each step may require significant computation to invert a Jacobian. The
runtimes using different stiff and nonstiff methods are shown in Figure 13.2 on
the preceding page. J
After integrating the right-hand side by parts and applying boundary terms,
∫ 1 ∫ 1 2
1 𝜕 2 𝜕
𝑢 d𝑥 = − 𝛼(𝑢) 𝑢 d𝑥.
2 𝜕𝑡 0 0 𝜕𝑥
𝜕 ∑︁ 1
𝑈𝑗 𝑈𝑗 = 2
𝛼 𝑗+1/2 (𝑈 𝑗+1 − 𝑈 𝑗 )𝑈 𝑗 − 𝛼 𝑗−1/2 (𝑈 𝑗 − 𝑈 𝑗−1 )𝑈 𝑗 ,
𝜕𝑡 𝑗=1
1 𝜕 ∑︁ 1 ∑︁
𝑚−1 𝑚−1
(𝑈 𝑗 ) 2 = 2 𝛼 𝑗+1/2 (𝑈 𝑗+1 − 𝑈 𝑗 )𝑈 𝑗 − 𝛼 𝑗−1/2 (𝑈 𝑗 − 𝑈 𝑗−1 )𝑈 𝑗
2 𝜕𝑡 𝑗=1 ℎ 𝑗=1
1 ∑︁
=− 𝛼 𝑗−1/2 (𝑈 𝑗 − 𝑈 𝑗−1 ) 2 ≤ 0.
ℎ2 𝑗=1
This inequality tells us that the ℓ 2 -norm of the solution is nonincreasing in time.
Because discrete norms are equivalent, we can use other norms besides the
ℓ 2 -norm to confirm stability. As an example, let’s use the ℓ ∞ -norm to confirm the
CFL condition of the forward Euler method (13.3). Take 𝑈 0𝑗 to be nonnegative
for all 𝑗. Then letting 𝜈 = 𝑘/ℎ2 ,
𝑈 𝑛+1
𝑗 − 𝑈 𝑛𝑗 = 𝜈𝛼 𝑗+1/2 (𝑈 𝑗+1 − 𝑈 𝑗 ) − 𝜈𝛼 𝑗−1/2 (𝑈 𝑗 − 𝑈 𝑗−1 ),
from which
𝑈 𝑛+1
𝑗 = 𝜈𝛼 𝑗+1/2𝑈 𝑛𝑗+1 + 1 − 𝜈𝛼 𝑗+1/2 − 𝜈𝛼 𝑗−1/2 𝑈 𝑛𝑗 + 𝜈𝛼 𝑗−1/2𝑈 𝑛𝑗−1 .
If 𝜈𝛼 𝑗+1/2 ≤ 2 for all 𝑗, then
𝑗 | = 𝜈𝛼 𝑗+1/2 |𝑈 𝑗+1 | + 1 − 𝜈 𝛼 𝑗+1/2 + 𝛼 𝑗−1/2 |𝑈 𝑗 | + 𝜈𝛼 𝑗−1/2 |𝑈 𝑗−1 |
|𝑈 𝑛+1 𝑛 𝑛
≤ 𝜈𝛼 𝑗+1/2 + 1 − 𝜈𝛼 𝑗+1/2 − 𝜈𝛼 𝑗−1/2 − 𝜈𝛼 𝑗−1/2 k𝑈 k ∞ = k𝑈 𝑛 k ∞
13.5 Exercises
13.1. Show that the heat equation 𝑢 𝑡 = 𝛼𝑢 𝑥 𝑥 with a bounded initial distribution
𝑢(0, 𝑥) = 𝑢 0 (𝑥) has the solution
∫ ∞
1 2
𝑢(𝑡, 𝑥) = 𝐺 (𝑡, 𝑥 − 𝑠)𝑢 0 (𝑠) d𝑠 with 𝐺 (𝑡, 𝑠) = √ e−𝑠 /4𝛼𝑡 .
−∞ 4𝜋𝛼𝑡
This expression is known as the fundamental solution to the heat equation.
13.2. Use Taylor series expansion to determine the truncation error of the
Crank–Nicolson scheme for 𝑢 𝑡 = 𝑢 𝑥 𝑥 . b
13.3. Suppose we want to solve the heat equation, but instead of using the
values {𝑈 𝑗−1 , 𝑈 𝑗 , 𝑈 𝑗+1 } to approximate 𝑢 𝑥 𝑥 at 𝑥 𝑗 , we decide to use the values
{𝑈 𝑗 , 𝑈 𝑗+1 , 𝑈 𝑗+2 }. For example, the stencil using the backward Euler method
would be
410 Parabolic Equations
Our approach is terrible for several reasons. First, it provides only a first-order
approximation, whereas the central difference approximation is second-order.
More importantly, it affects stability. Discuss the stability conditions of using
such a scheme with various time-stepping methods. b
13.4. The Applied Mathematics Panel, an organization estab-
lished by Vannevar Bush to conduct mathematical research
for the U.S. military during World War II, proposed the
stencil on the right to solve the heat equation. John Crank
and Phyllis Nicolson later dismissed the method as being
“very cumbersome.” Discuss the stability restrictions of the
method. b
13.5. Prove that the Dufort–Frankel scheme is unconditionally stable by showing
that the eigenvalues of the space discretization all lie within the region of absolute
stability for the time-marching scheme. Demonstrate that although the scheme is
unconditionally stable, it gets the wrong solution when 𝑘 = 𝑂 (ℎ). b
13.6. Solve the heat equation 𝑢 𝑡 = 𝑢 𝑥 𝑥 over the domain [0, 1] with initial
conditions 𝑢(0, 𝑥) = sin 𝜋𝑥 and boundary conditions 𝑢(𝑡, 0) = 𝑢(𝑡, 1) = 0 using
the forward Euler scheme. Use 20 grid points in space and use the CFL condition
to determine the stability requirement on the time step size 𝑘.
2. Use the slope of the log-log plot of the error at 𝑡 = 1 to verify the order
of convergence. Use several values for the grid spacing ℎ while keeping
the time step 𝑘 ℎ constant. Then, use several values for the time step 𝑘,
keeping the mesh ℎ 𝑘.
𝜕𝜓 𝜀2 𝜕 2 𝜓
i𝜀 =− + 𝑉 (𝑥)𝜓
𝜕𝑡 2 𝜕𝑥 2
models the quantum behavior of a particle in a potential field 𝑉 (𝑥). The parameter
𝜀 is the scaled Plank constant, which gives the relative length and time scales.
The probability of finding a particle at a position 𝑥 is 𝜌(𝑡, 𝑥) = |𝜓(𝑡, 𝑥)| 2 for a
normalized wave function 𝜓(𝑡, 𝑥). While the Schrödinger equation is a parabolic
PDE, it exhibits behaviors closely related to the wave equation—namely, the
Exercises 411
𝐿 2 -norm of the solution is constant in time, and the equation has no maximum
2. The Crank–Nicolson method does a good job conserving the 𝐿 2 -norm of the
solution. Use it to solve the initial value problem for the harmonic oscillator
with 𝑉 (𝑥) = 12 𝑥 2 with initial conditions 𝜓(0, 𝑥) = (𝜋𝜀) −1/4 e−( 𝑥−𝑥0 ) /2𝜀 .
The solution for this problem
2 /2𝜀 2
𝜓(𝑡, 𝑥) = (𝜋𝜀) −1/4 e−( 𝑥−𝑥0 exp(−i𝑡)) e−(1−exp(−2i𝑡)) 𝑥0 /4𝜀 e−i𝑡/2
3. What happens when the mesh spacing ℎ or the time step 𝑘 is about equal
to or larger than 𝜀? b
Develop a numerical method for this problem and implement it using the
regularized step function tanh 30(1 − 𝑟) as initial conditions. Use reflecting
boundary conditions at 𝑟 = 2 and symmetry at 𝑟 = 0 to determine an appropriate
boundary condition. Implement the method and plot the solution. b
5 Coherent states refer to Gaussian wave packet solutions of the harmonic oscillator. Schrödinger
derived them in 1926 in response to criticism that the wave function 𝜓 (𝑡 , 𝑥) did not display classical
motion and satisfy the correspondence principle.
412 Parabolic Equations
13.9. Suppose we want to model the heat equation 𝑢 𝑡 = 𝑢 𝑥 𝑥 with open boundary
conditions. Over time a function will gradually decrease in height and broaden
in width while conserving area. The interesting dynamics may all occur near
the origin, but we’ll still need to solve the equation far from the origin to get an
accurate solution. Rather than using equally-spaced gridpoints, we can closely
space points near the origin and spread them out away from the origin.
𝑢 𝑡 = Δ𝑢 + 𝜀 −1 𝑢(1 − 𝑢 2 )
Hyperbolic Equations
𝜕𝑢 𝜕
+ 𝑓 (𝑢) = 0.
𝜕𝑡 𝜕𝑥
414 Hyperbolic Equations
In the previous chapter, the flux was given by Fick’s Law 𝑓 (𝑢) = −𝛼(𝑥) 𝜕𝑥𝜕
which led to diffusion that characterized parabolic equations. Hyperbolic
equations are characterized by advection. Let’s start with the simplest flux
𝑓 (𝑢) = 𝑐𝑢 and look at the more general nonlinear case later. Information
propagating with velocity 𝑐 is described by the linear advection equation
𝜕𝑢 𝜕𝑢
+𝑐 =0 with 𝑢(𝑥, 0) = 𝑢 0 (𝑥). (14.1)
𝜕𝑡 𝜕𝑥
This problem can be solved by the method of characteristics. Compare the total
derivative of 𝑢(𝑡, 𝑥) and the advection equation:
d 𝜕𝑢 d𝑥 𝜕𝑢 𝜕𝑢 𝜕𝑢
𝑢(𝑡, 𝑥(𝑡)) = + and +𝑐 = 0.
d𝑡 𝜕𝑡 d𝑡 𝜕𝑥 𝜕𝑡 𝜕𝑥
By taking d𝑥/d𝑡 = 𝑐, the total derivative
𝑢(𝑡, 𝑥(𝑡)) = 0,
which says that the solution 𝑢(𝑡, 𝑥(𝑡)) is constant along the characteristic curves
𝑥(𝑡) for which d𝑥/d𝑡 = 𝑐. The characteristics curves are exactly those for which
𝑥(𝑡) = 𝑐𝑡 + 𝑥 0 for some 𝑥0 . The solution to (14.1) is then given by
We can find the solution at time 𝑡 by tracing the characteris- 𝑡 𝑢(𝑡, 𝑥(𝑡))
tic curve back to the initial conditions. The value 𝑐 is called
the characteristic speed, and 𝑢—which is constant along
the characteristic curve—is called the Riemann invariant. 𝑡)
It’s not necessary that 𝑐 be a constant—the method of
characteristics also works for general 𝑓 (𝑢). We’ll come 𝑥
back to the nonlinear case later in the chapter. 𝑢0
Upwind method
A natural first question about such a scheme might be, “what is its stability
condition?” We can answer this question using the von Neumann analysis
developed in the previous chapter. Consider the semidiscrete system of equations
𝜕 𝑢(𝑡, 𝑥 𝑗−1 ) − 𝑢(𝑡, 𝑥 𝑗 )
𝑢(𝑡, 𝑥 𝑗 ) = 𝑐 .
𝜕𝑡 ℎ
We find the corresponding system in the Fourier domain by formally substituting
ˆ 𝜉)ei 𝜉 𝑗 ℎ . In this case, we have the decoupled system
𝑢(𝑡, 𝑥 𝑗 ) with 𝑢(𝑡,
𝜕 e−i 𝜉 ℎ − 1
ˆ 𝜉) = 𝑐
𝑢(𝑡, ˆ 𝜉),
𝜕𝑡 ℎ
whose eigenvalues are (𝑐/ℎ) · (e−i 𝜉 ℎ − 1). The locus of e−i 𝜉 ℎ − 1 is a circle of
radius one centered at −1. The region of absolute stability for the forward Euler
method is also a circle of radius one centered at −1. If the characteristic speed
𝑐 > 0, then (𝑐/ℎ) · (e−i 𝜉 ℎ − 1) is simply a circle of radius 𝑐/ℎ centered at −𝑐/ℎ.
Therefore, the upwind method (14.3) is stable if the CFL condition 𝑘 < ℎ/𝑐
holds. We call |𝑐|𝑘/ℎ the CFL number for the upwind scheme. A useful method
of visualizing the CFL condition of a numerical method is by overlaying the
eigenvalues of the right-hand-side operator over the region of absolute stability
of the ODE solver:
𝑐>0 𝑐<0
1 1
0 0
−1 −1
−2 0 2 −2 0 2
𝑈 𝑛+1
𝑗 − 𝑈 𝑛𝑗 𝑈 𝑛𝑗 − 𝑈 𝑛𝑗−1
+𝑐 =0 (14.3)
𝑘 ℎ
416 Hyperbolic Equations
𝑈 𝑛+1
𝑗 − 𝑈 𝑛𝑗 𝑈 𝑛𝑗+1 − 𝑈 𝑛𝑗
+𝑐 =0 (14.4)
𝑘 ℎ
To implement the schemes, we must first determine the sign of 𝑐. We can combine
the two varieties to get a scheme that does this automatically
𝑛 𝑛 𝑛 𝑛
𝑈 𝑛+1
𝑗 − 𝑈 𝑛𝑗 𝑐 + |𝑐| 𝑈 𝑗 − 𝑈 𝑗−1 𝑐 − |𝑐| 𝑈 𝑗+1 − 𝑈 𝑗
+ + = 0.
𝑘 2 ℎ 2 ℎ
To get a sense of why we need two upwind schemes (or one combined
automatic-switching scheme), just think about the weather. To predict the
weather, we need to look in the upwind direction from which it is coming, not
the downwind direction in which it is going. Information propagates at a finite
speed 𝑐 along a characteristic of the advection equation (14.1). In a finite time,
the solution at any specific point in space can only be affected by the initial
conditions in some bounded region of space. This region is called the domain of
dependence. Similarly, that same point in space can only influence a bounded
region of space over a finite time. This region is called the domain of influence.
The domain of dependence looks into the past, and the domain of influence looks
into the future.
Consider the solution to the advection equation: 𝑢(𝑡 𝑛 , 𝑥 𝑗 ) = 𝑢 0 (𝑥 𝑗 − 𝑐𝑡 𝑛 ).
Take a uniform step size 𝑘 in time and a uniform grid spacing ℎ. The domain of
dependence of the true solution is [𝑥 𝑗 , 𝑥 𝑗 + 𝑐𝑛𝑘]. The domain of dependence
of the numerical solution is [𝑥 𝑗 , 𝑥 𝑗 + 𝑛ℎ]. By keeping 𝑐𝑘/ℎ < 1, we ensure
that the domain of dependence of the numerical solution contains the domain of
dependence of the true solution.
For parabolic equations, like the heat equation, information propagates
instantaneously throughout the domain, although its effect may be infinitesimal.
If we put a flame to one end of a heat-conducting bar, the temperature at the other
end immediately beings to rise, albeit almost undetectably.1 When we solved
the heat equation using an explicit scheme like the forward Euler method, we
found a rather restrictive CFL condition. With such a scheme, information at
one grid point can only propagate to its nearest neighbor in one time step. The
more grid points we have, the greater the number of time steps needed to send
the information from one side of the domain to the other. The matrix I + 𝑘/ℎ2 D
associated with the forward Euler method for the heat equation is tridiagonal.
On the other hand, the matrix (I − 𝑘/ℎ2 D) −1 associated with the backward
Euler method is completely filled in. With such as scheme, information at one
1 Sucha model is clearly not physical because heat is conducted at a finite velocity and definitely
less than the speed of light. But, what mathematical models are perfect?
Methods for linear hyperbolic equations 417
gridpoint is propagated to every other grid point in one time step. The normalized
plots below show the values of a column taken from the matrices I + 𝑘/ℎ2 D with
ℎ = 12 𝑘 2 and (I − 𝑘/ℎ2 D) −1 with ℎ = 𝑘:
1 10 20 30 40 50 1 10 20 30 40 50
Lax–Friedrichs method
𝜕 ei 𝜉 ℎ − e−i 𝜉 ℎ 𝑐
ˆ 𝜉) = 𝑐
𝑢(𝑡, ˆ 𝜉) = − (i sin 𝜉 ℎ) 𝑢(𝑡,
𝑢(𝑡, ˆ 𝜉)
𝜕𝑡 2ℎ ℎ
in the Fourier domain. The eigenvalues of this system are purely imaginary, but
the region of stability for the forward Euler scheme is a unit circle in the left
half-plane. So the method is unconditionally unstable.
Centered-difference (unstable!) 𝑂 (𝑘 + ℎ2 )
𝑈 𝑛+1
𝑗 − 𝑈 𝑛𝑗 𝑈 𝑛𝑗+1 + 𝑈 𝑛𝑗−1
+𝑐 =0 (14.5)
𝑘 2ℎ
We can modify the central difference scheme to make it stable. One way
to do this is by choosing an ODE solver whose region of absolute stability
includes part of the imaginary axis, like the leapfrog scheme. Another way is
to somehow push the eigenspectrum into the left half-plane, where we can then
shrink them to fit in the region of absolute stability of the forward Euler scheme.
The Lax–Friedrichs method approximates the 𝑈 𝑗 term in the time derivative
of (14.5) as 2 𝑈 𝑗+1 + 𝑈 𝑗−1 .
𝑛 𝑛
Lax–Friedrichs 𝑂 (𝑘 + ℎ2 /𝑘)
𝑈 𝑛+1
𝑗 − 12 (𝑈 𝑛𝑗+1 + 𝑈 𝑛𝑗−1 ) 𝑈 𝑛𝑗+1 − 𝑈 𝑛𝑗−1
+𝑐 =0 (14.6)
𝑘 2ℎ
418 Hyperbolic Equations
Lax–Wendroff method
Let’s look at another way to stablize the unstable central difference scheme (14.5).
Using the Taylor series expansion about 𝑈 𝑛𝑗 = 𝑢(𝑡 𝑛 , 𝑥 𝑗 ), we have
𝑈 𝑛+1 = 𝑢 + 𝑘𝑢 𝑡 + 21 𝑘 2 𝑢 𝑡𝑡 + 𝑂 𝑘 3
𝑈 𝑛𝑗+1 = 𝑢 + ℎ𝑢 𝑥 + 12 ℎ2 𝑢 𝑥 𝑥 + 16 ℎ3 𝑢 𝑥 𝑥 𝑥 + 𝑂 𝑘 4
𝑈 𝑛𝑗−1 = 𝑢 − ℎ𝑢 𝑥 + 21 ℎ2 𝑢 𝑥 𝑥 − 16 ℎ3 𝑢 𝑥 𝑥 𝑥 + 𝑂 𝑘 4 .
𝑢 𝑡 + 𝑐𝑢 𝑥 = − 12 𝑐2 𝑘𝑢 𝑥 𝑥 + 𝑂 (𝑘 2 + ℎ2 ).
Lax–Wendroff Method 𝑂 (𝑘 2 + ℎ2 )
When 𝑘 < ℎ/|𝑐|, the 𝑘-scaled ellipse is entirely inside the unit circle, and when
𝑘 > ℎ/|𝑐|, the 𝑘-scaled ellipse is entirely outside the unit circle. The CFL
numbers for the Lax–Wendroff method and the Lax–Friedrichs method are both
|𝑐|𝑘/ℎ. The following figure shows the eigenvalues for the different methods
overlaid on the region of absolute stability of the forward Euler method:
The eigenvalues of the central difference scheme are outside the region of absolute
stability, so the scheme is unconditionally unstable. The eigenvalues for the
Lax–Friedrichs scheme extend farther left than those of the upwind scheme,
which themselves extend farther left than those of the Lax–Wendroff method.
Consequently, the Lax–Friedrichs method is more dissipative than the upwind
method, which itself is more dissipative than the Lax–Wendroff method.
𝑢 𝑡 + 𝑐𝑢 𝑥 = 0, (14.11)
(𝑘 = ℎ/𝑐), then not only does the 𝑢 𝑥 𝑥 term of (14.10) disappear, but the whole
right side disappears. That is, the upwind scheme exactly solves 𝑢 𝑡 + 𝑐𝑢 𝑥 = 0
when 𝑘 = ℎ/𝑐.
The Lax–Wendroff scheme is a second-order approximation to (14.11), but it
is a third-order approximation to
1 2 𝑐𝑘
𝑢 𝑡 + 𝑐𝑢 𝑥 = − 𝑐ℎ 1 − 𝑢𝑥𝑥𝑥 . (14.13)
6 ℎ
from which we have 𝜔 = 𝑐𝜉. We can plug this relation back into the ansatz to
get a solution
𝑢(𝑡, 𝑥) = ei( 𝜔𝑡− 𝜉 𝑥) = ei 𝜉 (𝑐𝑡−𝑥) .
Each Fourier component moves at a velocity 𝑐 = 𝜔/𝜉, the ratio of the angular
frequency and the wavenumber. We call this ratio the phase velocity.
Now, let’s use the same ansatz (14.14) in the modified upwind equation (14.12)
𝑢 𝑡 + 𝑐𝑢 𝑥 = 𝛼𝑢 𝑥 𝑥 . We get
from which we have i𝜔 − i𝑐𝜉 = −𝛼𝜉 2 . From this equation, we can determine a
dispersion relation
𝜔 = 𝑐𝜉 + i𝛼𝜉 2 .
Plugging this 𝜔 back into our ansatz gives us
2 2
𝑢(𝑡, 𝑥) = ei( 𝜔𝑡− 𝜉 𝑥) = ei𝑐 𝜉 𝑡 e−𝛼 𝜉 𝑡 e−i 𝜉 𝑥 = ei 𝜉 (𝑐𝑡−𝑥) · e−𝛼 𝜉 𝑡 .
| {z } | {z }
advection dissipation
group and
phase velocities
Linear hyperbolic systems 423
𝑘 ≈ 23 ℎ
Figure 14.1: Solution to the advection equation for an initial rectangle function.
Solutions move from left to right. See the QR code at the bottom of the page.
solutions to the
advection equation
424 Hyperbolic Equations
𝑗 − V𝑛𝑗 V𝑛𝑗 − V𝑛𝑗−1 V𝑛𝑗+1 − V𝑛𝑗
+𝚲 +
+𝚲 −
= 0.
𝑘 ℎ ℎ
By making the substitution U𝑛𝑗 = TV𝑛𝑗 and defining A± = T𝚲± T−1 , we can
rewrite the above equation as
𝑗 − U𝑛𝑗 U𝑛𝑗 − U𝑛𝑗−1 U𝑛𝑗+1 − U𝑛𝑗
+ A+ + A− =0
𝑘 ℎ ℎ
We call A± = T𝚲± T−1 the characteristic decomposition.
Let’s extend the ideas developed for linear hyperbolic equations to nonlinear
hyperbolic equations. Consider the equation
𝜕 𝜕
𝑢+ 𝑓 (𝑢) = 0 (14.17)
𝜕𝑡 𝜕𝑥
for a function 𝑢(𝑡, 𝑥) with 𝑢(0, 𝑥) = 𝑢 0 (𝑥) and a function 𝑓 (𝑢) called the flux.
Such an equation is called a conservation law. If we integrate it with respect to
𝑥, we have
∫ 𝑏 ∫ 𝑏 ∫ 𝑏
d 𝜕 d
𝑢 d𝑥 + 𝑓 (𝑢) d𝑥 = 𝑢 d𝑥 + 𝑓 (𝑏) − 𝑓 (𝑎) = 0.
d𝑡 𝑎 𝑎 𝜕𝑥 d𝑡 𝑎
If there is no∫ net flux through the boundary, then d𝑡d 𝑎 𝑢 d𝑥 = 0, from which it
follows that 𝑎 𝑢 d𝑥 is constant. So, 𝑢 is a conserved quantity.
If the flux 𝑓 (𝑢) is differentiable, then we can apply the chain rule, and (14.17)
𝜕𝑢 𝜕𝑢
+ 𝑓 0 (𝑢) = 0,
𝜕𝑡 𝜕𝑥
which can be solved using the method of characteristics
d 𝜕𝑢 d𝑥 𝜕𝑢 𝜕𝑢 𝜕𝑢
𝑢(𝑡, 𝑥(𝑡)) = + = + 𝑓 0 (𝑢) = 0.
d𝑡 𝜕𝑡 d𝑡 𝜕𝑥 𝜕𝑡 𝜕𝑥
Linear hyperbolic systems 425
This expression says that the solution 𝑢(𝑡, 𝑥) is constant (a Riemann invariant)
along the characteristics given by
= 𝑓 0 (𝑢).
Because 𝑢 is invariant along 𝑥(𝑡), it follows that
= 𝑓 0 (𝑢(𝑡, 𝑥(𝑡)) = 𝑓 0 (𝑢(0, 𝑥(0))) = 𝑓 0 (𝑢 0 (𝑥0 )).
Integrating this equation, we see that the characteristics are straight lines
𝑢(𝑡, 𝑥) = 𝑢 0 (𝑥 − 𝑓 0 (𝑢 0 )𝑡).
Burgers’ equation
the QR code at the bottom of this page. The solution at this point would then be
multivalued because it comes from two (or possibly more) characteristic curves.
Furthermore, at the moment when a multivalued solution in 𝑢(𝑡, 𝑥) appears, its
derivative blows up, and the solution overturns.
We can find the point when the solution overturns by computing when the
derivative first becomes infinite.
d𝑢 d𝑢 d𝑥0 d𝑥 𝑢 0 (𝑥)
= = 𝑢 00 (𝑥 0 ) = 0 0 (14.18)
d𝑥 d𝑥0 d𝑥 d𝑥0 𝑢 0 (𝑥)𝑡 + 1
If we set the right-hand side to infinity and solve for 𝑡, we have 𝑡 = − 𝑢 00 (𝑥0 ) .
So, the time −1 at which the solution first becomes multivalued is the infimum of
− 𝑢 00 (𝑥0 ) . The solution to Burgers’ equation becomes singular if and only if
the gradient of the initial distribution 𝑢 0 (𝑥) is ever negative.
Can a faster-moving part really overtake a slower-moving part? No. A
multivalued solution is both unphysical and mathematically ill-posed. So,
something is wrong with either our formulation or our reasoning. We relied
on the fact that 𝑢(𝑡, 𝑥) is smooth to apply the method of characteristics. But as
soon as the derivative of 𝑢(𝑡, 𝑥) blows up, the method is invalidated. In the next
section, we will look at how to solve the ill-posed problem.3
Weak solutions
multivalued and
weak solutions to
Burgers’ equation
Linear hyperbolic systems 427
𝑢(𝑥) 1
−2 0 2 4 −2 0 2 4 −2 0 2 4
𝑡 (𝑥) 2
−2 0 2 4 −2 0 2 4 −2 0 2 4
If this equation holds for all 𝜑 ∈ 𝐶01 (Ω), then we say that 𝑢(𝑡, 𝑥) is a weak
solution of 𝑢 𝑡 + 𝑓 (𝑢) 𝑥 = 0. If 𝑢(𝑡, 𝑥) is smooth, the weak solution is the classical
or strong solution.
Consider an arbitrary curve 𝑃𝑄 dividing Ω into subdomains Ω1 and Ω2 .
𝑡 𝑄
Ω1 Ω2
𝑢𝐿 𝑢𝑅
𝑃 𝑥
428 Hyperbolic Equations
By Green’s theorem:
∫ ∫
= −𝜑𝑢 d𝑥 + 𝜑 𝑓 (𝑢) d𝑡 + −𝜑𝑢 d𝑥 + 𝜑 𝑓 (𝑢) d𝑡.
𝜕Ω1 𝜕Ω2
Because 𝜑(𝜕Ω) = 0:
∫ 𝑄 ∫ 𝑃
= −𝜑𝑢 𝐿 d𝑥 + 𝜑 𝑓 (𝑢 𝐿 ) d𝑡 + −𝜑𝑢 𝑅 d𝑥 + 𝜑 𝑓 (𝑢 𝑅 ) d𝑡
= 𝜑(𝑢 𝑅 − 𝑢 𝐿 ) d𝑥 − 𝜑 ( 𝑓 (𝑢 𝑅 ) − 𝑓 (𝑢 𝐿 )) d𝑡,
where 𝑢 𝐿 is the value of 𝑢 along 𝑃𝑄 from the left and 𝑢 𝑅 is the value of 𝑢 along
𝑃𝑄 from the right. So, for all test functions 𝜑 ∈ 𝐶01 (Ω),
(𝑢 𝑅 − 𝑢 𝐿 ) d𝑥 − ( 𝑓 (𝑢 𝑅 ) − 𝑓 (𝑢 𝐿 )) d𝑡 = 0
d𝑥 𝑓 (𝑢 𝑅 ) − 𝑓 (𝑢 𝐿 )
𝑠= = .
d𝑡 𝑢𝑅 − 𝑢𝐿
This expression is called the Rankine–Hugoniot jump condition for shocks, and 𝑠
is called the shock speed. Note that in the limit as 𝑢 𝐿 → 𝑢 𝑅 , the shock speed
𝑠 → 𝑓 0 (𝑢). To simplify notation, one often uses brackets to indicate difference:
𝑓 (𝑢 𝑅 ) − 𝑓 (𝑢 𝐿 ) [𝑓]
𝑠= ≡ .
𝑢𝑅 − 𝑢𝐿 [𝑢]
We can now continue to use the PDE with the weak solution.
Linear hyperbolic systems 429
A Riemann problem is an initial value problem with initial data given by two
constant states separated by a discontinuity. For the advection equation, it is
𝜕𝑢 𝜕 𝑢𝐿 𝑥<0
+ 𝑓 (𝑢) = 0 with 𝑢(0, 𝑥) = . (14.19)
𝜕𝑡 𝜕𝑥 𝑢𝑅 𝑥>0
This symmetry suggests an ansatz for the problem. We’ll look for a self-similar
solution to Burgers’ equation to simplify the two-variable PDE into a one-variable
ODE. Let 𝜁 = 𝑥/𝑡 = 𝑥/ ˆ 𝑡ˆ. Taking 𝑢(𝑡, 𝑥) = 𝑢(𝜁) in Burgers’ equation,
1 2 𝑥 0 1
0 = 𝑢𝑡 + 2𝑢 𝑥
0 0
= 𝜁𝑡 𝑢 + 𝜁 𝑥 𝑢𝑢 = − 2 𝑢 + 𝑢𝑢 0 .
𝑡 𝑡
−𝜁𝑢 0 + 𝑢𝑢 0 = 𝑢 0 (𝑢 − 𝜁) = 0.
1 2
2 𝑢𝑅− 12 𝑢 2𝐿 1
𝑠= = 2 (𝑢 𝑅 + 𝑢 𝐿 ),
𝑢𝑅 − 𝑢𝐿
𝑢𝐿, if 𝑥/𝑡 < 𝑠
𝑢(𝑡, 𝑥) =
𝑢𝑅, if 𝑥/𝑡 > 𝑠 𝑢𝑅
0 𝑠𝑡
430 Hyperbolic Equations
𝑢𝐿, if 𝑥/𝑡 < 𝑠
𝑢(𝑡, 𝑥) =
𝑢𝑅, if 𝑥/𝑡 > 𝑠 𝑢𝐿
0 𝑠𝑡
where 𝑠 = (𝑢 𝑅 + 𝑢 𝐿 )/2 is given by the Rankine–Hugoniot condition.
Another solution is
𝑢 , if 𝑥/𝑡 < 𝑢 𝐿
𝑢(𝑡, 𝑥) = 𝑥/𝑡, if 𝑢 𝐿 < 𝑥/𝑡 < 𝑢 𝑅
𝑢 𝑅 , 𝑢𝐿
if 𝑥/𝑡 > 𝑢 𝑅
0 𝑠𝑡
A third solution is
𝑢𝐿, if 𝑥/𝑡 < 𝑢 𝐿 𝑢𝑅
𝑥/𝑡, if 𝑢 𝐿 < 𝑥/𝑡 < 𝑢 𝑀
𝑢(𝑡, 𝑥) = 𝑢𝑀
𝑢 𝑀 , 𝑢 𝑀 < 𝑥/𝑡 < 𝑠 𝑢𝐿
𝑅 if 𝑥/𝑡 > 𝑠 0
0 𝑠𝑡
where 𝑠 = (𝑢 𝑅 + 𝑢 𝑀 )/2 is given by the Rankine–Hugoniot condition.
In fact, there are infinitely many weak solutions. What is the physically permissible
weak solution? To find it, we need to impose another condition. Physically, there
is a quantity called entropy, which is a constant along smooth particle trajectories
but jumps to a higher value across a discontinuity. The mathematical definition of
entropy 𝜑(𝑢) is the negative of physical entropy. For the problem 𝑢 𝑡 + 𝑓 (𝑢) 𝑥 = 0
where 𝑓 (𝑢) is the convex flux, a solution satisfies the Lax entropy condition
𝑓 (𝑢 𝑅 ) − 𝑓 (𝑢 𝐿 )
𝑓 0 (𝑢 𝐿 ) > 𝑠 ≡ > 𝑓 0 (𝑢 𝑅 ).
𝑢𝑅 − 𝑢𝐿
The Lax entropy condition says that the characteristics with speeds 𝑓 0 (𝑢 𝑅 ) and
𝑓 0 (𝑢 𝐿 ) must intersect the shock with speed 𝑠 from both sides. For Burgers’
equation 𝑓 (𝑢) = 12 𝑢 2 , so 𝑓 0 (𝑢) = 𝑢.
Let 𝜌(𝑡, 𝑥) be the density of cars measured in cars per car length. A density
𝜌 = 0 says that the road is completely empty, and a density 𝜌 = 1 says there is
bumper-to-bumper traffic. Then
𝜕𝜌 𝜕
+ 𝑓 (𝜌) = 0
𝜕𝑡 𝜕𝑥
models the traffic flow, where the flux 𝑓 (𝜌) tells us the density of cars passing a
given point in a given time interval (number of cars per second). We can define
flux 𝑓 (𝜌) = 𝑢(𝜌) 𝜌, where 𝑢(𝜌) simply tells us the speed of each car as a function
of traffic density. Let’s make a couple of reasonable assumptions to model 𝑢(𝜌):
1. Everyone travels as fast as possible while still obeying the speed limit 𝑢 max .
2. Everyone keeps a safe distance behind the car ahead of them and adjusts
their speed accordingly. Drive as fast as possible if there are no other cars
on the road. Stop moving if traffic is bumper-to-bumper.
In this case, the simplest model for 𝑢(𝜌) is 𝑢(𝜌) = 𝑢 max (1 − 𝜌). Notice that the
flux 𝑓 (𝜌) = 𝑢(𝜌) 𝜌 = 𝑢 max (𝜌 − 𝜌 2 ) is zero when the roads are empty (there are
no cars to move) and when the roads are full (no one is moving). The flux 𝑓 (𝜌)
is maximum at 𝑓 0 (𝜌) = 𝑢 max (1 − 2𝜌) = 0 when 𝜌 = 21 (the roads are half-full
with the cars traveling at 12 𝑢 max , half the posted speed limit). Our traffic equation
is now
𝜕𝜌 𝜕
+ (𝑢 max (1 − 𝜌) 𝜌) = 0.
𝜕𝑡 𝜕𝑥
This equation is very similar in form to Burgers’ equation. Consider two simple
examples: cars approaching a red stoplight and cars leaving after the stoplight
turns green. Both of these examples are Riemann problems.
Red light. We imagine several cars traveling with density 𝜌 𝐿 approaching a
queue of stopped cars with density 𝜌 𝑅 = 1. In this case
𝜌𝐿 , 𝑥<0
𝜌(𝑥, 0) =
𝜌 𝑅 = 1, 𝑥 ≥ 0.
0 0
−1 0 1 −1 0 1
where 𝑔 is the gravitational acceleration, ℎ is the height of the water, and 𝑢 is the
velocity of the water. The mass of a column of water is proportional to its height,
and we can let 𝑚 = ℎ𝑢 denote the momentum. Then (14.21) becomes
ℎ𝑡 + 𝑚 𝑥 = 0
2𝑚 𝑚2
𝑚𝑡 + 𝑚 𝑥 − 2 ℎ 𝑥 + 𝑔ℎℎ 𝑥 = 0,
ℎ ℎ
Hyperbolic systems of conservation laws 433
Riemann invariants
Let’s recap what we’ve discussed so far about Riemann invariants. Information
is advected along characteristics, and the quantity that is constant along a
characteristic curve is called the Riemann invariant. For the advection equation
𝑢 𝑡 + 𝑐𝑢 𝑥 = 0, the characteristics are given by d𝑥/d𝑡 = 𝑐, and the Riemann
invariant is 𝑢. For a general scalar conservation law 𝑢 𝑡 + 𝑓 (𝑢) 𝑥 = 0, we have
𝑢 𝑡 + 𝑓 0 (𝑢)𝑢 𝑥 = 0 when 𝑢 is differentiable. The characteristics are given by
d𝑥/d𝑡 = 𝑓 0 (𝑢), and it follows that
d 𝜕𝑢 d𝑥 𝜕𝑢 𝜕𝑢 𝜕𝑢
𝑢(𝑡, 𝑥(𝑡)) = + = + 𝑓 0 (𝑢) = 0.
d𝑡 𝜕𝑡 d𝑡 𝜕𝑥 𝜕𝑡 𝜕𝑥
So 𝑢(𝑡, 𝑥) is the Riemann invariant.
For systems of conservation laws, it’s a little more complicated. Once again
consider the system
𝜕u 𝜕
+ f (u) = 0
𝜕𝑡 𝜕𝑥
434 Hyperbolic Equations
with u ∈ R𝑛 . If u is differentiable,
𝜕u 𝜕u
+ f 0 (u) =0 (14.23)
𝜕𝑡 𝜕𝑥
where f 0 is the Jacobian matrix of f with real eigenvalues {𝜆1 , 𝜆2 , . . . , 𝜆 𝑛 }. Let
∇u 𝑤 𝑖 —for some 𝑤(u)—be the left eigenvector of f 0 (u) corresponding to the
eigenvalues 𝜆𝑖 . Then left multiplying (14.23) by ∇u 𝑤 𝑖 yields
𝜕u 𝜕u
∇u 𝑤 𝑖 + ∇u 𝑤 𝑖 f 0 (u) = 0.
𝜕𝑡 𝜕𝑥
Because ∇u 𝑤 𝑖 is the left eigenvector of f 0 (u), it follows that
𝜕u 𝜕u
∇u 𝑤 𝑖 + 𝜆 𝑖 ∇u 𝑤 𝑖 = 0.
𝜕𝑡 𝜕𝑥
And by the chain rule, we have
𝜕𝑤 𝑖 𝜕𝑤 𝑖
+ 𝜆𝑖 = 0, (14.24)
𝜕𝑡 𝜕𝑥
which is simply the scalar advection equation. So 𝑤 𝑖 is a Riemann invariant
because it is constant along the characteristic curve given by d𝑥/d𝑡 = 𝜆𝑖 . Notice
that (14.24) is the diagonalization of the original system.
Example. Let’s find the Riemann invariants for the shallow water equations.
The Jacobian matrix
0 1
f =
𝑔ℎ − 𝑢 2 2𝑢
has eigenvalues 𝜆± = 𝑢 ± 𝑔ℎ. We compute the left eigenvectors to be
(𝑢 ∓ 𝑔ℎ, 1). Recall that we defined√︁u = (ℎ, 𝑚) with 𝑚 = ℎ𝑢. In terms of ℎ
and 𝑚, the eigenvectors are (𝑚ℎ−1 ∓ 𝑔ℎ, 1). We need these vectors to be exact
differentials, so let’s rescale them by 1/ℎ. We now have
𝑚 𝑔 1
∇u 𝑤 ± = ∓ , .
ℎ2 ℎ ℎ
𝑈 𝑛+1
𝑗 − 12 (𝑈 𝑛𝑗+1 + 𝑈 𝑛𝑗−1 ) 𝑓 (𝑈 𝑛𝑗+1 ) − 𝑓 (𝑈 𝑛𝑗−1 )
+ =0
𝑘 2ℎ
in conservative form
𝑛 𝑛
𝑈 𝑛+1
𝑗 − 𝑈 𝑛𝑗 𝐹 𝑗+1/2 − 𝐹 𝑗−1/2
+ =0
𝑘 ℎ
where the numerical flux
𝑓 (𝑈 𝑛𝑗+1 ) + 𝑓 (𝑈 𝑛𝑗 ) 𝑛 𝑛
ℎ2 𝑈 𝑗+1 − 𝑈 𝑗
𝐹 𝑗+1/2 = − .
2 2𝑘 ℎ
Recall from the discussion on page 419 that the second term adds artificial
viscosity to an otherwise unstable central difference scheme. However, the
diffusion coefficient 𝛼 = ℎ2 /2𝑘 tends to make the Lax–Friedrichs scheme overly
diffusive. The coefficient of the diffusion term is bounded below by the CFL
condition for the Lax–Friedrichs method, which states that ℎ/𝑘 must be greater
than the characteristic speed. In other words, the diffusion coefficient is bounded
below by the norm of the Jacobian matrix 12 ℎkf 0 (u) k. We can take the magnitude
of the largest eigenvalue as a suitable matrix norm. Because we need just enough
viscosity to counter the effect of 𝑓 0 (𝑈 𝑗+1/2 ), we can lessen the artificial viscosity
in the Lax–Friedrichs method by choosing a local diffusion coefficient 𝛼 𝑗+1/2 to
be the maximum of kf 0 (u) k for u in the interval (𝑈 𝑗 , 𝑈 𝑗+1 ). Such an approach
to defining the numerical flux
𝑓 (𝑈 𝑛𝑗+1 ) + 𝑓 (𝑈 𝑛𝑗 ) 𝑈 𝑛𝑗+1 − 𝑈 𝑛𝑗
𝐹 𝑗+1/2 = − 𝛼 𝑗+1/2
2 ℎ
𝛼 𝑗+1/2 = 21 ℎ max kf 0 (u) k
u∈ (𝑈 𝑗 ,𝑈 𝑗+1 )
438 Hyperbolic Equations
Rather than solving the problem using Taylor series expansion in the form of
a finite difference scheme, we can instead use an integral formulation called a
finite volume method. In such a method, we consider breaking the domain into
cells centered at 𝑥 𝑗+1/2 = 21 (𝑥 𝑗+1 + 𝑥 𝑗 ) with boundaries at 𝑥 𝑗 and 𝑥 𝑗+1 . We then
compute the flux through the cell boundaries to determine the change inside the
cell. Finite volume methods are especially useful in two and three dimensions
when the mesh can be irregularly shaped.
The problem 𝑢 𝑡 + 𝑓 (𝑢) 𝑥 = 0 where 𝑢 ≡ 𝑢(𝑡, 𝑥) is equivalent to
∫ 𝑡𝑛+1 ∫ 𝑥 𝑗+1/2
(𝑢 𝑡 + 𝑓 (𝑢) 𝑥 ) d𝑥 d𝑡 = 0,
𝑘 ℎ 𝑡𝑛 𝑥 𝑗−1/2
or simply
∫ 𝑥 𝑗+1/2 ∫ 𝑥 𝑗+1/2
𝑢(𝑡 𝑛+1 , 𝑥) d𝑥 − 𝑢(𝑡 𝑛 , 𝑥) d𝑥
𝑥 𝑗−1/2 𝑥 𝑗−1/2
∫ 𝑡 𝑛+1 ∫ 𝑡 𝑛+1
+ 𝑓 (𝑢(𝑡, 𝑥 𝑗+1/2 )) d𝑡 − 𝑓 (𝑢(𝑡, 𝑥 𝑗−1/2 )) d𝑡 = 0. (14.26)
𝑡𝑛 𝑡𝑛
Physically, this says that the change in mass of a cell from time 𝑡 𝑛 to time 𝑡 𝑛+1
equals the flux through the boundaries at 𝑥 𝑗+1/2 and 𝑥 𝑗−1/2 . Let’s define
1 𝑥 𝑗+1/2
𝑈 𝑛𝑗 = 𝑢(𝑡 𝑛 , 𝑥) d𝑥
ℎ 𝑥 𝑗−1/2
Think of 𝑢 as density and 𝑈 as mass, and think of 𝑓 as flux density and 𝐹 as flux.
Then we have
𝑛+1/2 𝑛+1/2
𝑈 𝑛+1
𝑗 − 𝑈 𝑛𝑗 𝐹 𝑗+1/2 − 𝐹 𝑗−1/2
+ = 0. (14.27)
𝑘 ℎ
Now, we just need appropriate numerical approximations for these integrals.
For a first-order approximation, we simply take piecewise-constant approxi-
mations 𝑢(𝑡, 𝑥) in the cells. Then
1 𝑛
𝑈 𝑛𝑗 = 𝑈 𝑗−1/2 + 𝑈 𝑛𝑗+1/2
by the midpoint rule and
𝐹 𝑗+1/2 = 𝑓 𝑈 𝑛𝑗+1/2 .
𝑃 𝑗 (𝑥) = 𝑈 𝑛𝑗 + 𝜕 𝑛
𝜕𝑥 𝑈 𝑗 · (𝑥 − 𝑥 𝑗 )
where 𝜕 𝑛
𝜕𝑥 𝑈 𝑗 is a slope and 𝑥 ∈ [𝑥 𝑗 , 𝑥 𝑗+1 ]. We can approximate
∫ 𝑥 𝑗+1/2 ∫ 𝑥 𝑗+1/2
1 1
𝑈 𝑛𝑗+1/2 = 𝑃 𝑗 (𝑥) d𝑥 + 𝑃 𝑗+1 (𝑥) d𝑥
ℎ 𝑥𝑗 ℎ 𝑥𝑗
1 1 𝜕 𝑛 𝜕 𝑛
= 𝑈 𝑛𝑗 + 𝑈 𝑛𝑗+1 + ℎ 𝜕𝑥 𝑈 𝑗 − 𝜕𝑥 𝑈 𝑗+1 .
2 8
From (14.27), we have
1 𝑛 1 𝑛 𝑘 𝑛+1/2
𝑈 𝑛+1
𝑗+1/2 = 𝑈 + 𝑈 𝑛𝑗+1 + ℎ 𝜕 𝑛
𝜕𝑥 𝑈 𝑗 − 𝜕
𝜕𝑥 𝑈 𝑗+1 − 𝐹 − 𝐹 𝑗𝑛+1/2 .
2 𝑗 8 ℎ 𝑗+1
We still need a second-order approximation for the flux 𝐹 𝑗𝑛+1/2 . Using the
midpoint rule,
∫ 𝑡𝑛+1
𝐹 𝑗𝑛+1/2 = 𝑓 (𝑢(𝑡, 𝑥 𝑗 )) d𝑡 ≈ 𝑓 (𝑢(𝑡 𝑛+1/2 , 𝑥 𝑗 )),
𝑘 𝑡𝑛
440 Hyperbolic Equations
𝑥 𝑥 𝑥
first-order second-order slope-limited
𝑘 𝜕 𝑘 𝜕
𝑢(𝑡 𝑛+1/2 , 𝑥 𝑗 ) ≈ 𝑢(𝑡 𝑛 , 𝑥 𝑗 ) + 𝑢(𝑡 𝑛 , 𝑥 𝑗 ) = 𝑢(𝑡 𝑛 , 𝑥 𝑗 ) − 𝑓 (𝑢(𝑡 𝑛 , 𝑥 𝑗 )),
2 𝜕𝑡 2 𝜕𝑥
where we use 𝑢 𝑡 + 𝑓 (𝑢) 𝑥 = 0 for the equality above. This approximation gives
us the staggered, two-step predictor-corrector method
𝑘 𝜕
𝑈 𝑛+1/2 = 𝑈 𝑛𝑗 − 𝑓 (𝑈 𝑛𝑗 )
𝑗 2 𝜕𝑥
1 ℎ 𝑛 𝑘 𝑛+1/2
𝑈 𝑛+1 = 𝑈 𝑛𝑗 + 𝑈 𝑛𝑗+1 + 𝜕 𝑛
𝜕𝑥 𝑈 𝑗 − 𝜕
𝜕𝑥 𝑈 𝑗+1 − 𝑓 𝑈 𝑛+1/2 − 𝑓 𝑈 .
𝑗+1/2 2 8 ℎ 𝑗+1 𝑗
ΔU = [0;diff(U)]
s = @. ΔU*ϕ( [ΔU[2:end];0]/(ΔU + (ΔU==0)) )
Notice that ΔU is padded with a zero on the left and then on the right so that s has
the same number of elements as U.
Julia has several packages for modeling and solving hyperbolic problems. Trixi.jl
is a numerical simulation framework for hyperbolic conservation laws. Kinetic.jl is a
toolbox for the study of computational fluid dynamics and scientific machine learning.
Oceananigans.jl is a fluid flow solver for modeling ocean dynamics. ViscousFlow.jl
models viscous incompressible flows around bodies of various shapes.
14.7 Exercises
14.1. Show that the error of the Lax–Friedrichs scheme is 𝑂 (𝑘 + ℎ2 /𝑘). Compute
the dispersion relation. Taking the CFL condition and numerical viscosity into
consideration, how should we best choose ℎ and 𝑘 to minimize error? b
14.2. Solve the advection equation 𝑢 𝑡 + 𝑢 𝑥 = 0, with initial conditions 𝑢(0, 𝑥) = 1
if |𝑥 − 12 | < 14 and 𝑢(0, 𝑥) = 0 otherwise, using the upwind scheme and the
Lax–Wendroff scheme. Use the domain 𝑥 ∈ [0, 1] with periodic boundary
conditions. Compare the solutions at 𝑡 = 1.
14.3. Consider a finite difference scheme for the linear hyperbolic equation
𝑢 𝑡 + 𝑐𝑢 𝑥 = 0 that uses the following stencil (leapfrog in time and central
difference in space):
Determine the order and stability conditions for a consistent scheme. Discuss
whether the scheme is dispersive or dissipative. How does dispersion or
dissipation affect the scheme’s stability? Show that you can derive this scheme
starting with (14.5) and counteracting the unstable term. b
442 Hyperbolic Equations
14.5. Derive the conservative form of the Lax–Wendroff scheme (page 438).
where 𝜌 is the density, 𝑢 is the velocity, 𝐸 is the total energy, and 𝑝 is the pressure.
This system is not a closed system—there are more unknowns than equations.
To close the system, we add the equation of state for an ideal gas
𝐸 = 12 𝜌𝑢 2 + .
ℎ𝑡 + (ℎ𝑢) 𝑥 = 0 (14.29a)
2 1 2
(ℎ𝑢)𝑡 + (ℎ𝑢 + 2 𝑔ℎ ) 𝑥 = 0, (14.29b)
Elliptic Equations
This chapter introduces the finite element method (FEM) and its applications in
solving elliptic equations. Typical elliptic equations are the Laplace equation
and the Poisson equation, which model the steady-state distribution of electrical
charge or temperature either with a source term (the Poisson equation) or without
(the Laplace equation), and the steady-state Schrödinger equation. Finite element
analysis is also frequently applied to modeling strain on structures such as beams.
Finite element methods, pioneered in the 1960s, are widely used for engineering
problems because the flexible geometry allows one to use irregular elements and
the variational formulation makes it easy to deal with boundary conditions and
compute error estimates.
444 Elliptic Equations
Theorem 49. Under suitable conditions the boundary value problem D , mini-
mization problem M and variational problem V are all equivalent.
The function 𝑣 ℎ is called the test function, and 𝑢 ℎ is called the trial function.
Let’s apply the Galerkin formulation to the original boundary value problem.
Consider 𝑉ℎ to be the subspace of continuous piecewise-linear polynomials with
nodes at {𝑥𝑖 }. (We might alternatively consider other higher-order splines as
basis elements.) We’ll restrict ourselves to a uniform partition with nodes at
𝑥 𝑗 = 𝑗 ℎ and with 𝑥 1 = 0 and 𝑥 𝑚 = 1 to simplify the derivations. The finite
element solution for a nonuniform partition can similarly be derived. A basis
element 𝜑 𝑗 (𝑥) ∈ 𝑉ℎ is
1 + 𝑡 𝑗 (𝑥), 𝑡 𝑗 (𝑥) ∈ (−1, 0]
𝜑 𝑗 (𝑥) = 1 − 𝑡 𝑗 (𝑥), 𝑡 𝑗 (𝑥) ∈ (0, 1] (15.1)
otherwise 𝑥 𝑗−1 𝑥 𝑗 𝑥 𝑗+1
Í𝑚 𝑡 𝑗 (𝑥) = (𝑥 − 𝑥 𝑗 )/ℎ. Note that {𝜑 𝑗 (𝑥)} forms a partition of unity, i.e.,
𝑗=1 𝜑 𝑗 (𝑥) = 1 for all 𝑥. Then 𝑣 ℎ ∈ 𝑉ℎ is defined by
𝑣 ℎ (𝑥) = 𝑣 𝑗 𝜑 𝑗 (𝑥)
𝑢 ℎ (𝑥) = 𝜉𝑖 𝜑𝑖 (𝑥),
where 𝜉𝑖 = 𝑢(𝑥 𝑖 ) are unknowns. The Galerkin problem to find 𝑢 ℎ such that
𝑢 ℎ0 , 𝑣 ℎ0 = h 𝑓 , 𝑣 ℎ i now becomes the problem to find 𝜉𝑖 such that
* + * +
𝑚 ∑︁
𝑚 ∑︁
𝜉𝑖 𝜑𝑖0 (𝑥), 𝑣 𝑗 𝜑 0𝑗 (𝑥) = 𝑓, 𝑣 𝑗 𝜑 𝑗 (𝑥) .
𝑖=1 𝑗=1 𝑗=1
446 Elliptic Equations
Let’s extend the framework we developed in the previous section to the two-
dimensional Poisson equation
− Δ𝑢 = 𝑓 where 𝑢 𝜕Ω = 0 (15.2)
for a region Ω ∈ R2 . We’ll need a few more concepts before writing the Poisson
equation in variational form. A mapping L : 𝑉 → R is a linear functional or
linear form if
L (𝛼𝑢 + 𝛽𝑣) = 𝛼L (𝑢) + 𝛽L (𝑣)
A two-dimensional example 447
for all real numbers 𝛼 and 𝛽 and 𝑢, 𝑣, 𝑤 ∈ 𝑉. A bilinear form 𝑎(·, ·) is symmetric
if 𝑎(𝑢, 𝑣) = 𝑎(𝑣, 𝑢), and it is positive definite if 𝑎(𝑢, 𝑢) > 0 for all nonzero 𝑢.
An inner product is a bilinear form that is symmetric and positive definite.
The space of 𝐿 2 -functions, also called square-integrable functions, is
n ∬ o
𝐿 2 (Ω) = 𝑣 |𝑣| 2 dx < ∞ .
In this case, we define√︁the 𝐿 2 -inner product as h𝑣, 𝑤i 𝐿 2 = Ω 𝑣𝑤 dx and the
𝐿 2 -norm as k𝑣k 𝐿 2 = h𝑣, 𝑣i. A Sobolev space is the space of 𝐿 2 -functions
whose derivatives are also 𝐿 2 -functions:
𝐻 1 (Ω) = 𝑣 ∈ 𝐿 2 (Ω) | ∇𝑣 ∈ 𝐿 2 (Ω) .
We can further define Sobolev spaces 𝐻 𝑘 (Ω) as the space of functions whose
𝑘th derivatives are also 𝐿 2 -functions. We define the Sobolev inner product as
h𝑣, 𝑤i 𝐻 1 = 𝑣𝑤 + ∇𝑣 · ∇𝑤 dx,
where = ∇𝑢 · n̂ is a directional derivative with unit outer normal n̂.
448 Elliptic Equations
∇ · (𝑣∇𝑢) = ∇𝑣 · ∇𝑢 + 𝑣Δ𝑢.
Now, let’s get back to the two-dimensional Poisson equation. Let 𝑉 = 𝐻01 (Ω),
the space of Sobolev functions over Ω that vanish on the boundary 𝜕Ω. Then
from (15.2), for all 𝑣 ∈ 𝑉
∬ ∬
− 𝑣Δ𝑢 d𝑥 d𝑦 = 𝑓 𝑣 d𝑥 d𝑦. (15.3)
The stiffness matrix A is sparse because 𝑛𝑖 can only be a vertex for a small
number of triangles.
450 Elliptic Equations
Example. Let’s write the following problem with Neumann boundary condi-
tions as an FEM problem:
−Δ𝑢 + 𝑢 = 𝑓 with = 𝑔.
𝜕𝑛 𝜕Ω
𝜕𝑛 = 𝑔, so
On the boundary 𝜕𝑢
∬ ∬ ∬ ∫
∇𝑣 · ∇𝑢 d𝑥 d𝑦 + 𝑣𝑢 d𝑥 d𝑦 = 𝑓 𝑣 d𝑥 d𝑦 + 𝑔𝑣 d𝑠.
Ω Ω Ω 𝜕Ω
Let ∬
𝑎(𝑢, 𝑣) = ∇𝑢 · ∇𝑣 + 𝑢𝑣 d𝑥 d𝑦
and let ∬ ∫
L (𝑣) = h 𝑓 , 𝑣i + hh𝑔, 𝑣ii ≡ 𝑓 𝑣 d𝑥 d𝑦 + 𝑔𝑣 d𝑠.
Ω 𝜕Ω
The variational form of the Neumann problem is
V Find 𝑢 ∈ 𝑉 such that 𝑎(𝑢, 𝑣) = L (𝑣) for all 𝑣 ∈ 𝑉.
Let 𝑉ℎ be the space of continuous, piecewise-linear functions on Ω with a trian-
gulation 𝑇ℎ , and let {𝜑 𝑗 } be the set of basis functions. Recasting the problem as
a finite element approximation
V Find 𝑢 ℎ ∈ 𝑉ℎ such that 𝑎(𝑢 ℎ , 𝑣 ℎ ) = L (𝑣 ℎ ) for all 𝑣 ℎ ∈ 𝑉ℎ .
By taking the finite elements 𝑢 ℎ (𝑥, 𝑦) = 𝑖=1 𝜉𝑖 𝜑𝑖 (𝑥, 𝑦), then
𝜉𝑖 𝑎 𝜑𝑖 , 𝜑 𝑗 = L 𝜑 𝑗
∬ ∫
𝑏𝑗 = 𝑓 𝜑 𝑗 d𝑥 d𝑦 + 𝑔𝜑 𝑗 d𝑠.
𝑇ℎ 𝜕𝑇ℎ
Note that this time our stiffness matrix A and load vector b include the boundary
elements. J
Now that we know how a finite element method works, let’s examine when it
works and how well it works. To do this, we’ll use the Lax–Milgram theorem
and Céa’s lemma. The Lax–Milgram theorem provides sufficient conditions for
the existence and uniqueness of a weak solution of a PDE. Céa’s lemma gives a
bound on the error between the weak solution of a PDE and the solution of a
finite element approximation of the PDE.
Let’s begin with a few definitions. A Hilbert space is a complete inner product
space—a vector space with an inner product, that is complete with respect to the
corresponding norm. A space is complete if every Cauchy sequence of functions
in that space converges to a function in that space. A sequence is Cauchy if its
elements eventually become arbitrarily close to one another. In other words, the
sequence 𝑢 1 (𝑥), 𝑢 2 (𝑥), 𝑢 3 (𝑥), . . . is Cauchy if, given any positive 𝜀, the norm
k𝑢 𝑚 − 𝑢 𝑛 k < 𝜀 for all sufficiently large 𝑚 and 𝑛.
Example. The space of continuous functions over the interval [−1, 1] is not
complete in the 𝐿 2 -norm. Take the scaled sigmoid functions 𝑢 𝑛 (𝑥) = tanh 𝑛𝑥:
, , , , , etc. These functions form a Cauchy sequence
that converge to a discontinuous step function. The broader space of square-
integrable functions over the same interval is a Hilbert space. The space of
Sobelev functions is also a Hilbert space, though {𝑢 1 , 𝑢 2 , 𝑢 3 , . . . } is no longer
Cauchy in the Sobelev norm because it doesn’t converge in the Sobolev norm.J
Hilbert spaces provide the framework for the finite element method. The
solution to the original PDE is approximated by representing it as a linear
combination of basis functions using linear functionals and bilinear forms. And
completeness of Hilbert spaces allows for convergence of approximations towards
the true solution as the mesh is refined.
A few more definitions. A bilinear form 𝑎 is bounded or continuous if there
exists a positive 𝛾 such that |𝑎(𝑢, 𝑣) | ≤ 𝛾k𝑢k k𝑣k for all 𝑢, 𝑣 ∈ 𝑉. On the other
hand, a bilinear form 𝑎 is coercive or elliptic if there exists a positive 𝛼 such
that |𝑎(𝑢, 𝑢) | ≥ 𝛼k𝑢k 2 for all 𝑢 ∈ 𝑉. In other words, a bilinear form is coercive
if the magnitude of the functional increases at least as fast as the norm of the
function. Likewise, a linear functional L is bounded if there exists a positive Λ
such that |L (𝑢) | ≤ Λk𝑢k for all 𝑢 ∈ 𝑉.
452 Elliptic Equations
Proof. Note that for any 𝛼, 0 ≤ k𝑣 − 𝛼𝑤k 2 = k𝑣k 2 − 2𝛼 h𝑣, 𝑤i + |𝛼| 2 k𝑤k. By
taking 𝛼 = | h𝑣, 𝑤i |/k𝑤k 2 , we have 0 ≤ k𝑣k 2 k𝑤k 2 − | h𝑣, 𝑤i | 2 .
Example. Let’s show that the variational formulation of the Poisson equation
−𝑢 00 = 𝑓 with 𝑢(0) = 0 and 𝑢(1) = 0 is well-posed (a unique solution exists).
By setting
∫ 1 ∫ 1
𝑎(𝑢, 𝑣) = 𝑢 𝑣 d𝑥 and L (𝑣) =
0 0
𝑓 𝑣 d𝑥,
0 0
the variational formulation for the Poisson equation is
Stability and convergence 453
V Find 𝑢 ∈ 𝐻01 ( [0, 1]) such that 𝑎(𝑢, 𝑣) = L (𝑣) for all 𝑣 ∈ 𝐻01 ( [0, 1])
To show well-posedness, we’ll need to check that the conditions of boundedness
and coercivity of the Lax–Milgram lemma hold. By the Cauchy–Schwarz
√︄ √︄
∫ 1 ∫ 1 ∫ 1
|𝑎(𝑢, 𝑣)| = 𝑢 0 𝑣 0 d𝑥 ≤ |𝑢 0 | 2 d𝑥 |𝑣 0 | 2 d𝑥 ≤ k𝑢k 𝐻 1 k𝑣k 𝐻 1 ,
0 0
0 0 0
where the Sobolev norm k𝑢k 𝐻 1 = 0
𝑢 2 + |𝑢 0 | 2 d𝑥. So 𝑎 is bounded. Similarly,
∫ 1
|L (𝑣) | = 𝑓 𝑣 d𝑥 ≤ k 𝑓 k 𝐿 2 k𝑣k 𝐿 2 ≤ k 𝑓 k 𝐿 2 k𝑣k 𝐻 1 .
It follows that L is bounded. Finally, from the Poincaré inequality, it follows that
∫1 ∫1
𝑎(𝑢, 𝑢) = 0 (𝑢 0) 2 d𝑥 ≥ 𝛼 0 𝑢 2 + (𝑢 0) 2 d𝑥. So 𝑎 is coercive. J
Proof. Let 𝑉 be a Hilbert space, and consider the problem of finding the element
𝑢 ∈ 𝑉 such that 𝑎(𝑢, 𝑣) = L (𝑣) for all 𝑣 ∈ 𝑉. Consider similar problem of
finding the element 𝑢 ℎ ∈ 𝑉ℎ such that 𝑎(𝑢 ℎ , 𝑣 ℎ ) = L (𝑣 ℎ ) for all 𝑣 ℎ ∈ 𝑉ℎ for
the finite-dimensional subspace 𝑉ℎ of 𝑉. The Lax–Milgram theorem says that
unique solutions 𝑢 and 𝑢 ℎ exist. Then, 𝑎(𝑢 − 𝑢 ℎ , 𝑣 ℎ ) = 0 for all 𝑣 ℎ ∈ 𝑉ℎ . It
follows that
Céa’s lemma says that the finite element solution 𝑢 ℎ is the best solution
up to the constant 𝛾/𝛼. We can use the lemma to estimate the approximation
error. Suppose that 𝑢 ∈ 𝐻 2 is a solution to the Galerkin problem. Let 𝑝 ℎ be a
piecewise-linear polynomial interpolant of the solution 𝑢. Then by Céa’s lemma
k𝑢 − 𝑢 ℎ k 𝐻 1 ≤ k𝑢 − 𝑝 ℎ k 𝐻 1 .
By Taylor’s theorem, there is a constant 𝐶 such that k𝑢 0 − 𝑝 ℎ0 k 𝐿 2 ≤ 𝐶ℎk𝑢 00 k 𝐿 2 .
𝛾𝐶ℎ 00 𝛾𝐶ℎ
k𝑢 − 𝑢 ℎ k 𝐻 1 ≤ k𝑢 k 𝐿 2 ≤ k𝑢k 𝐻 2 .
𝛼 𝛼
For degree-𝑘 piecewise-polynomial basis functions, the 𝐻 1 -error is 𝑂 (ℎ 𝑘 ) if
𝑢 ∈ 𝐻 𝑘+1 .
454 Elliptic Equations
We can also use the finite element method to solve time-dependent problems.
Consider the one-dimensional heat equation with a source term 𝑢 𝑡 − 𝑢 𝑥 𝑥 = 𝑓 (𝑥).
Let the initial condition 𝑢(𝑥, 0) = 𝑢 0 (𝑥) and boundary values 𝑢(0, 𝑡) = 𝑢(1, 𝑡) = 0.
Then for all 𝑣 ∈ 𝐻01 ( [0, 1]),
h𝑢 𝑡 , 𝑣i − h𝑢 𝑥 𝑥 , 𝑣i = h 𝑓 , 𝑣i
where h𝑢, 𝑣i = 0
𝑢𝑣 d𝑥. From this we have the Galerkin formulation:
V Find 𝑢 ℎ ∈ 𝑉ℎ such that h𝑢 𝑡 , 𝑣i + h𝑢 𝑥 , 𝑣 𝑥 i = h 𝑓 , 𝑣i for all 𝑣 ∈ 𝑉ℎ .
Let {𝜑1 , 𝜑2 , . . . , 𝜑 𝑚 } be a basis of the finite element space 𝑉ℎ ⊂ 𝐻01 ( [0, 1]). Let
𝑢 ℎ (𝑥, 𝑡) = 𝑖=1 𝜉𝑖 (𝑡)𝜑𝑖 (𝑥) and take 𝑣 = 𝜑 𝑗 . Then we have the system
𝑚 ∑︁
𝜉𝑖0 (𝑡)h𝜑𝑖 , 𝜑 𝑗 i + 𝜉𝑖 (𝑡)h𝜑𝑖0 , 𝜑 0𝑗 i = 𝑓 , 𝜑 𝑗
𝑖=1 𝑖=1
C𝜉 0 (𝑡) + A𝜉 (𝑡) = b,
𝑢 𝑛+1 − 𝑢 𝑛
, 𝑣 + 𝑎 𝑢 𝑛+1
, 𝑣 =0
𝛥𝑡 ℎ ℎ ℎ
for all 𝑣 ∈ 𝑉ℎ where 𝑎(𝑢, 𝑣) = 0 𝑢 0 𝑣 0 d𝑥. Take 𝑣 = 𝑢 𝑛+1 ℎ . Then
𝑢 𝑛+1 𝑛+1
ℎ , 𝑢ℎ − 𝑢 𝑛ℎ , 𝑢 𝑛+1
ℎ = −𝛥𝑡𝑎 𝑢 𝑛+1 𝑛+1
ℎ , 𝑢ℎ ≤ 0.
So, k𝑢 𝑛+1 2
𝐿 2 by the Cauchy–Schwarz inequal-
𝑛 𝑛+1 ≤ k𝑢 𝑛 k k𝑢 𝑛+1 k
ℎ k 𝐿2 ≤ 𝑢 ℎ , 𝑢 ℎ ℎ 𝐿2 ℎ
ity. Therefore, k𝑢 ℎ k 𝐿 2 ≤ k𝑢 ℎ k 𝐿 2 for all 𝑛 and all stepsizes 𝛥𝑡. Hence, the
𝑛+1 𝑛
In practice, one will typically use a computational software package to manage the
tedious, low-level steps of the finite element method. Once we have formulated
the weak form of the problem, we can implement a solution by breaking
implementation into the following steps:
Example. Let’s solve the Poisson equation −Δ𝑢 = 𝑓 over the domain con-
structed by subtracting the unit square centered at (0, 0.5) from the unit circle
2 2
centered at (0, 0). Take the source term 𝑓 (𝑥, 𝑦) = e−(2𝑥+1) −(2𝑦+1) , which
models a heat source near (−0.5, −0.5) with Dirichlet boundary conditions
𝑢(𝑥, 𝑦) = 0. The variational form is 𝑎(𝑢 ℎ , 𝑣) = L (𝑣) where
∬ ∬
𝑎(𝑢, 𝑣) = ∇𝑢 · ∇𝑣 d𝑥 d𝑦 and L (𝑣) = 𝑣 𝑓 d𝑥 d𝑦.
We’ll first define the geometry, the finite element space consisting of Lagrange
polynomials, and the trial and test functions. Given a function space V, FEniCS
uses TestFunction(V) for a test function, TrialFunction(V) to define the trial
function, and FeFunction(V) to define the solution function where the result will
be stored.
1 The letters “FE” stand for “finite element,” the letters “CS” stand for “computational software,”
and the letters “ni” are simply used to string the others together into a homophone of phoenix, the
mascot of University of Chicago, where the project originated.
2 https://fenicsproject.org/
456 Elliptic Equations
Figure 15.2: FEM mesh (left), solution of the Poisson equation with a heat source
centered at (−0.5, −0.5) (middle), and solution of the heat equation with initial
conditions centered at (−0.5, −0.5) (right). Each contour depicts a 10 percent
change in value.
using FEniCS
square = Rectangle(Point([-0.5, 0]), Point([0.5, 1]))
circle = Circle(Point([0.0, 0.0]), 1)
mesh = generate_mesh(circle - square,16)
V = FunctionSpace(mesh,"Lagrange",1)
u, v, w = TrialFunction(V), TestFunction(V), FeFunction(V)
The mesh is shown in the figure above. Now, define the boundary conditions
and the source term. Note that FEniCS requires functions to be entered as C++
bc = DirichletBC(V,Constant(0),"on_boundary")
f = Expression("exp(-pow(2*x[0]+1,2)-pow(2*x[1]+1,2))", degree=2)
The solution w can be exported as a visualization toolkit file (VTK) and later
visualized using ParaView:3
The figure above shows the solution. Notice that the peak temperature is to the
right of the heat source peak. J
Example. Let’s modify the example above to solve the heat equation 𝑢 𝑡 = Δ𝑢
over the same domain with Neumann boundary conditions. We’ll take the initial
3 ParaView is an open-source application for visualizing scientific data sets such as VTK.
Practical implementation 457
2 2
condition 𝑢(0, 𝑥, 𝑦) = 𝑓 (𝑥, 𝑦) = e−(4𝑥+2) −(4𝑦+2) . The variational form of the
heat equation is h𝑢 𝑡 , 𝑣i + h∇𝑢, ∇𝑣i = 0. We define the geometry, the finite
element space, and the test functions as before, adding an expression for the
initial condition.
We define variables for the initial condition, trial function, test function, and
solution function.
u, v = interpolate(f,V), TestFunction(V)
ut , wt = TrialFunction(V), FeFunction(V)
We define the variation form of the heat equation and pass it along with the
variables to an in-place function using a named tuple p. The function get_array
maps a variable in function space to a one-dimensional array, and the function
assign maps the one-dimensional array to a variable in function space. Neumann
boundary conditions are the default boundary conditions in FEniCS, and we
prescribe them by setting bc = [].
Finally, let’s solve the problem for 𝑡 ∈ [0, 0.5] using a fourth-order semi-implicit
ODE solver and export the solution.
15.6 Exercises
for the deflection 𝑢(𝑥) of a uniform beam under the load 𝑓 (𝑥). Formulate a finite
element method and find the corresponding linear system of equations in the case
of a uniform partition. Determine the solution for 𝑓 (𝑥) = 384. You may wish to
use the basis elements 𝜙 𝑗 (𝑡) = H00 (|𝑡|) and 𝜓 𝑗 (𝑡) = H10 (|𝑡|) for 𝑡 ∈ [−1, 1] and
equal zero otherwise. The Hermite polynomials are H00 (𝑡) = 2𝑡 3 − 3𝑡 2 + 1 and
H10 (𝑡) = 𝑡 3 − 2𝑡 2 + 𝑡. b
√︁ bounding constant in Céa’s lemma can be improved such that
15.3. Show that
k𝑢 − 𝑢 ℎ k ≤ 𝛾/𝛼k𝑢 − 𝑣 ℎ k for all 𝑣 ℎ ∈ 𝑉ℎ when the bilinear form 𝑎(·, ·) is
symmetric and positive-definite. b
Chapter 16
We can improve the accuracy of finite difference methods by using wider stencils
that provide better approximations to the derivatives. With each additional grid
point added to the stencil, we increase the order of accuracy by one. One might
reason that by using all available grid points to approximate the derivatives, we
could build a highly accurate method. This is precisely what spectral methods
do. If the solution is smooth, the error shrinks faster than any power of grid
spacing—something sometimes called exponential or spectral accuracy. If the
solution is at most 𝑝-times differentiable, convergence is at most 𝑂 (ℎ 𝑝+1 ). This
chapter examines an essential class of spectral methods—the Fourier spectral
method—which uses trigonometric interpolation over a periodic domain to
approximate a solution.
Similar spectral methods, such as the Chebyshev spectral method discussed
in Chapter 10, can be used on problems with nonperiodic solutions. Some
tricks may also allow us to use Fourier spectral methods if the problem isn’t
periodic. For example, we can extend the domain so that the solution does not
have appreciable boundary interaction. Or we might add absorbing “sponge”
layers or reflecting barriers to the problem to prevent the solution from wrapping
around the boundaries.
The Fourier transform has already been discussed in sections 6.1 regarding the
fast Fourier transform and 10.3 regarding function approximation. This section
460 Fourier Spectral Methods
Trigonometric interpolation
Let the function 𝑢(𝑥) be periodic with period 2𝜋, i.e, 𝑢(𝑥 + 2𝜋) = 𝑢(𝑥). We
can modify the period of 𝑢(𝑥) Í by rescaling 𝑥. We can approximate 𝑢(𝑥) by a
Fourier polynomial 𝑝(𝑥) = 𝑚−1 𝑐
𝜉 =−𝑚 𝜉 e −i 𝜉 𝑥 by choosing coefficients 𝑐 so that
𝑢(𝑥 𝑗 ) = 𝑝(𝑥 𝑗 ) at equally spaced nodes 𝑥 𝑗 = 𝑗 𝜋/𝑚 for 𝑗 = 0, 1, . . . , 2𝑚 − 1. To
find the coefficients 𝑐 𝜉 , we’ll need the help of an identity.
Theorem 55. 2𝑚−1 𝑗=0 e
i 𝜉 𝑥 𝑗 equals zero for nonzero integers 𝜉 and 2𝑚 for 𝜉 = 0.
∑︁ ∑︁ ∑︁
i𝜉 𝑥𝑗
i 𝜉 𝑗 𝜋/𝑚
2𝑚−1 𝜔 − 1,
if 𝜉 ≠ 0
e = e = 𝜉𝑗
= 𝜔−1
2𝑚, if 𝜉 = 0
𝑗=0 𝑗=0 𝑗=0
where 𝜔 = ei 𝜋/𝑚 . Note that 𝜔2𝑚 = ei 𝜉 𝜋/𝑛 = ei 𝜉 2 𝜋 = 1.
It follows that
2𝑚−1 ∑︁ 𝑚−1
2𝑚−1 ∑︁ ∑︁
𝑚−1 ∑︁
𝑢(𝑥 𝑗 )e−i 𝜉 𝑥 𝑗 = 𝑐 𝜉 ei(𝑙− 𝜉 ) 𝑥 𝑗 = 𝑐𝜉 ei(𝑙− 𝜉 ) 𝑥 𝑗 = 2𝑚𝑐 𝜉 .
𝑗=0 𝑗=0 𝑙=−𝑚 𝑙=−𝑚 𝑗=0
From this expression, we have the discrete Fourier transform and the inverse
discrete Fourier transform
1 ∑︁ ∑︁
2𝑚−1 𝑚−1
𝑐𝜉 = 𝑢(𝑥 𝑗 )e−i 𝜉 𝑥 𝑗 and 𝑢(𝑥 𝑗 ) = 𝑐 𝜉 ei 𝜉 𝑥 𝑗 , (16.1)
2𝑚 𝑗=0 𝜉 =−𝑚
We can think of the discrete Fourier transform as a Riemann sum and the
continuous Fourier transform as a Riemann integral.
The 2𝜋-periodization of a function 𝑢(𝑥) is the function
𝑢 𝑝 (𝑥) = 𝑢(𝑥 + 2𝜋𝜉)
𝜉 =−∞
Discrete Fourier transform 461
𝑢 𝑝 (𝑥) = 𝑐 𝜉 ei 𝜉 𝑥 (16.2)
𝜉 =−∞
∫ 2𝜋 ∫
1 1 ∞
𝑐𝜉 = 𝑢 𝑝 (𝑥)e−i 𝜉 𝑥 d𝑥 = 𝑢(𝑥)e−i 𝜉 𝑥 d𝑥,
2𝜋 0 2𝜋 −∞
∞ ∑︁
i𝜉 𝑥
𝑢(𝑥 + 2𝜋𝜉) = ˆ
𝑢(𝜉)e ,
𝜉 =−∞ 𝜉 =−∞
known as the Poisson summation formula. If 𝑢(𝑥) vanishes outside of [0, 2𝜋),
i𝜉 𝑥
𝑢(𝑥) = ˆ
𝑢(𝜉)e .
𝜉 =−∞
1 𝑥 2 𝑥 3 𝑥
1 𝜉 2 𝜉 3 𝜉
Spectral differentiation
d ∑︁
d ∑︁
𝑢(𝑥) = 𝑐 𝜉 ei 𝜉 𝑥 = (i𝜉𝑐 𝜉 )ei 𝜉 𝑥 .
d𝑥 𝜉 =−𝑚
d𝑥 𝜉 =−𝑚
The discrete Fourier transform of the derivative has the memorable form
F 𝑢 = i𝜉 F [𝑢] .
From this equation, we can see the origin of the term “spectral method.” When
we expand a solution 𝑢(𝑡, 𝑥) as a series of orthogonal eigenfunctions of a linear
operator, that solution is related to the spectrum (the set of eigenvalues) of that
operator. A pseudospectral method solves the problem on a discrete mesh.
When the function 𝑢(𝑥) has period ℓ, we make the transform 𝑥 ↦→ (2𝜋/ℓ)𝑥
and d𝑥 ↦→ (2𝜋/ℓ) d𝑥, and the discrete Fourier transform of d𝑥 𝑢 is i(2𝜋/ℓ)𝜉 𝑢.
Of course, as long as 𝑢(𝑥) is smooth, we can differentiate any number of times.
So, the Fourier transform of the 𝑝th derivative of 𝑢(𝑥) is
𝑝 𝑝
d 2𝜋
F 𝑢 = i𝜉 F[𝑢]. (16.3)
d𝑥 𝑝 ℓ
Consider the heat equation 𝑢 𝑡 = 𝑢 𝑥 𝑥 with initial conditions 𝑢(0, 𝑥) and
periodic boundary conditions over a domain of length 2𝜋. The Fourier transform
ˆ 𝜉) = F 𝑢(𝑡, 𝑥) of the heat equation is 𝑢ˆ 𝑡 = −𝜉 2 𝑢,
𝑢(𝑡, ˆ which has the solution
ˆ 𝜉) = e− 𝜉 𝑡 𝑢(0,
𝑢(𝑡, ˆ 𝜉).
We can write the formal solution to the heat equation from this expression as
𝑢(𝑡, 𝑥) = F−1 e− 𝜉 𝑡 F 𝑢(0, 𝑥) . (16.4)
We’ll use a fast Fourier transform (FFT) to compute the discrete transforms in
practice. Let’s examine the numerical implementation of the discrete solution in
Julia. We need to exercise some care in ordering the indices of 𝜉. Most scientific
computing languages, including Julia, Matlab, and Python, define the coefficients
of the discrete Fourier transform 𝑐 𝜉 by starting with the zero-frequency term 𝑐 0 .
For 𝑚 nodes, we’ll take
𝜉 = 0, 1, 2, . . . , 12 (𝑚 − 1), − 12 𝑚, − 21 𝑚 + 1, . . . , −2, −1.
where we round half toward zero. For a domain of length ℓ, the derivative
operator i𝜉 · (2𝜋/ℓ) is
Discrete Fourier transform 463
iξ = im*[0:(m-1)÷2;-m÷2:-1]*(2π/ℓ)
iξ = im*ifftshift(-m÷2:(m-1)÷2)*(2π/ℓ)
iξ = im*fftfreq(m,m)*(2π/ℓ)
using FFTW
m = 256; ℓ = 4
ξ2 = (fftshift(-(m-1)÷2:m÷2)*(2π/ℓ)).^2
u = (t,u0 ) -> real(ifft(exp.(-ξ2 *t).*fft(u0 )))
If a problem repeatedly uses Fourier transforms on arrays with the same size
as an array u, as is the case in time-marching schemes, then it can be advantageous
to have FFTW try out all possible algorithms initially to see which is fastest.
In this case, we can define F = plan_fft(u) and then use F*u or F\u where the
multiplication and inverse operators are overloaded. For example, we replace the
last line in the code above with
F = plan_fft(u0 )
u = (t,u0 ) -> real(F\(exp.(-ξ2 *t).*(F*u0 )))
The fft and ifft functions in Julia are all multi-dimensional transforms
unless an additional parameter is provided. For example, fft(x) performs an
FFT over all dimensions of x, whereas fft(x,2) only performs an FFT along the
second dimension of x.
𝐵0 (𝑥) 𝐵ˆ 0 (𝜉)
𝑥 𝜉
1 + 𝑥/ℎ, 𝑥 ∈ [−ℎ, 0]
𝐵1 (𝑥) = 𝑓 (𝑥 − ℎ/2) − 𝑓 (𝑥 + ℎ/2) d𝑥 = 1 − 𝑥/ℎ,
𝑥 ∈ [0, ℎ] .
The Fourier transform of the triangular function 𝐵1 (𝑥) is
1 𝜉 1 𝜉
𝐵ˆ 1 (𝜉) = (i𝜉) −1 ei 𝜉 /2ℎ − e−i 𝜉 /2ℎ sinc = sinc2 .
ℎ 2ℎ ℎ2 2
1 British mathematician Phillip Woodward introduced the sinc function (pronounced “sink”) in
his 1952 paper “Information theory and inverse probability in telecommunication,” saying that the
function (sin 𝜋 𝑥)/ 𝜋 𝑥 “occurs so often in Fourier analysis and its applications that it does seem
to merit some notation of its own.” It’s sometimes called “cardinal sine,” not to be confused with
“cardinal sin.”
Discrete Fourier transform 465
The triangular function and the sinc-squared function are plotted below.
𝑥 𝜉
∞ ∑︁
𝑓 (𝑥) = 𝑐 𝜉 e−i 𝜉 𝑥 = 𝑐 𝜉 e−i 𝜉 𝑥 + 𝜏𝑚 (𝑥) = P𝑚 𝑓 (𝑥) + 𝜏𝑚 (𝑥).
𝜉 =−∞ 𝜉 =−𝑚
and ∑︁ ∑︁
k𝜏𝑛 k 2 = |𝑐 𝜉 | 2 = 𝑂 |𝜉 | −2 𝑝−2 = 𝑂 𝑚 −2 𝑝−1 .
𝜉 > |𝑚| 𝜉 > |𝑚|
Therefore, the truncation error k𝜏𝑚 k = 𝑂 𝑚 − 𝑝−1/2 .
For a rectangular
function or any function with a discontinuity, 𝑝 = 0 and
|𝜏𝑛 | = 𝑂 𝑛−1/2 . We should expect slow convergence, as we see in the next
466 Fourier Spectral Methods
1 1
0 0
−1 0 1 −1 0 1 0 0.5
Time splitting
Time splitting methods are familiar from chapters 12 and 13. Consider the
equation 𝑢 𝑡 = L 𝑢 + N 𝑢, where L is a linear operator and N is a nonlinear operator.
Take the splitting
1. 𝑣 𝑡 = L 𝑣 where 𝑣(0, 𝑥) = 𝑢(0, 𝑥),
2. 𝑤 𝑡 = N 𝑤 where 𝑤(0, 𝑥) = 𝑣( 𝛥𝑡, 𝑥).
Recall from the discussion on page 368 that after one time step, the solution
𝑢( 𝛥𝑡, 𝑥) = 𝑤( 𝛥𝑡, 𝑥) + 𝑂 𝛥𝑡 2 . The method has a second-order splitting error
with each time step, resulting in a first-order global splitting error. We can
increase the order of accuracy by using Strang splitting.
Let’s apply time splitting to a spectral method. Let 𝑢ˆ ≡ F 𝑢 and 𝑣ˆ ≡ F 𝑣
be the Fourier transforms of 𝑢 and 𝑣. Then Fourier transform of the equation
𝑣 𝑡 = L 𝑣 is 𝑣ˆ 𝑡 = L̂𝑣ˆ , where L̂ is the Fourier transform of the operator L, assuming
it exists. This differential equation has the formal solution 𝑣ˆ ( 𝛥𝑡, 𝜉) = e 𝛥𝑡 L̂ 𝑢(0,
ˆ 𝜉).
Applying the inverse Fourier transform gives us
h i
𝑣( 𝛥𝑡, 𝑥) = F−1 e 𝛥𝑡 L̂ F 𝑢(0, 𝑥) .
Nonlinear stiff equations 467
𝑢 𝑡 = 𝑢 𝑥 𝑥 + 𝜀 −1 𝑢(1 − 𝑢 2 )
where the initial conditions 𝑢 0 (𝑥) = 𝑢(0, 𝑥). While having an analytic solution
is convenient, we could have also solved the differential equation numerically if
one didn’t exist.
For each time step 𝑡 𝑛 , we have
h 2 i
1. 𝑣(𝑥) = F−1 e− 𝜉 𝛥𝑡 F [𝑢(𝑡 𝑛 , 𝑥)]
2. 𝑢(𝑡 𝑛+1 , 𝑥) = 𝑢 0 𝑢 20 − (𝑢 20 − 1)e−2𝜀𝛥𝑡 where 𝑢 0 = 𝑣(𝑥).
Each time step has an 𝑂 𝛥𝑡 2 splitting error, so we have 𝑂 ( 𝛥𝑡) error after 𝑛
iterations. By using Strang splitting, we can achieve 𝑂 𝛥𝑡 2 error. J
with the scaled Planck constant 𝜀. We can separate the spatial operator into a
kinetic term 12 i𝜀𝑢 𝑥 𝑥 and a potential energy term −i𝜀 −1𝑉 (𝑥)𝑢. These terms are
linear in 𝑢, and we can solve the problem using Strang splitting
2 𝛥𝑡/2
𝑢(𝑥, 𝑡 𝑛+1 ) = e−i𝑉 ( 𝑥) 𝛥𝑡/2𝜀 F−1 e−i𝜀 𝜉 F e−i𝑉 ( 𝑥) 𝛥𝑡/2𝜀 𝑢(𝑥, 𝑡 𝑛 )
Integrating factors
Let 𝑤ˆ = e−𝑡 L̂ 𝑢.
ˆ Then the solution is
h i
𝑢(𝑡, 𝑥) = F−1 e𝑡 L̂ 𝑤(𝑡,
ˆ 𝜉)
𝜕 h h ii
ˆ 𝜉) = e−𝑡 L̂ F N F−1 e𝑡 L̂ 𝑤(𝑡,
𝑤(𝑡, ˆ 𝜉) (16.6)
with initial conditions 𝑤(0,
ˆ 𝜉) = e0·L̂ F[𝑢(0, 𝑥)] = F[𝑢(0, 𝑥)].
We can break the time interval into several increments 𝛥𝑡 = 𝑡 𝑛+1 − 𝑡 𝑛 and
implement integrating factors over each interval. In this case, we have for each
interval starting at 𝑡 𝑛 and ending at 𝑡 𝑛+1 the following set-solve-set procedure
set ˆ 𝑛 , 𝜉) = F 𝑢(𝑡 𝑛 , 𝑥)
𝜕 h h ii
solve ˆ 𝜉) = e−𝛥𝑡 L̂ F N F−1 e 𝛥𝑡 L̂ 𝑤(𝑡,
𝑤(𝑡, ˆ 𝜉)
set 𝑢(𝑡 𝑛+1 , 𝑥) = F−1 e 𝛥𝑡 L̂ 𝑤(𝑡
ˆ 𝑛+1 , 𝜉) .
It isn’t necessary to switch back to the variable 𝑢(𝑡, 𝑥) with each iteration. Instead,
we can solve the problem in the variable 𝑢(𝑡, ˆ 𝜉) and only switch back when we
want to observe the solution.
The benefit of such integrating factors is that there is no splitting error. But,
integrating factors only lessen the effect of stiffness. Cox and Matthews developed
a class of exponential time differencing Runge–Kutta (ETDRK) methods that
solve (16.5) using matrix exponentials
∫ 𝛥𝑡
𝑢(𝑡 𝑛+1 ) = e 𝛥𝑡 L̂ 𝑢(𝑡 𝑛 ) + e ( 𝛥𝑡−𝑠) L̂ N̂𝑢(𝑡 𝑛 + 𝑠) d𝑠
IMEX methods
can be written as 2
𝑢ˆ 𝑡 = (−𝜉 4 + 𝜉 2 ) 𝑢ˆ − 21 i𝜉 F F−1 𝑢ˆ .
Let’s solve the KSE using an initial condition 𝑢(0, 𝑥) = e−𝑥 over a periodic
domain [−50, 50] and a timespan of 200.2
using FFTW, DifferentialEquations, SciMLOperators
ℓ = 100; T = 200.0; n = 512
x = LinRange(-ℓ/2,ℓ/2,n+1)[2:end]
2 In moving from Julia 1.7 to 1.9 and upgrading several packages, this code no longer works. :(
470 Fourier Spectral Methods
0 50 100 150 200
u0 = exp.(-x.^2)
iξ = im*[0:(n÷2);(-n÷2+1):-1]*2π/ℓ
F = plan_fft(u0 )
L = DiagonalOperator(-iξ.^2-iξ.^4)
N = (u,_,_) -> -0.5iξ.*(F*((F\u).^2))
problem = SplitODEProblem(L,N,F*u0 ,(0,T))
method = KenCarp3(linsolve=KrylovJL(),autodiff=false)
solution = solve(problem,method,reltol=1e-12)
u = t -> real(F\solution(t))
We can visualize the time evolution of the solution as a grayscale image with
time along the horizontal axis and space along the vertical axis.
using ImageShow,ColorSchemes
s = hcat(u.(LinRange(0,T,500))...)
Black pixels depict magnitudes close to the maximum, and white pixels depict
values close to zero. See the figure above and the animation at the QR code at
the bottom of this page. J
KSE animation
Incompressible Navier–Stokes equation 471
+ u · ∇u = −∇𝑝 + 𝜀Δu (16.8)
∇ · u = 0, (16.9)
which models viscous fluid flow. The vector u = (𝑢(𝑡, 𝑥, 𝑦), 𝑣(𝑡, 𝑥, 𝑦)) is the
velocity, 𝑝(𝑡, 𝑥, 𝑦) is the pressure, and 𝜀 is the inverse of the Reynolds number.
Explicitly, the two-dimensional Navier–Stokes equation says
𝑢 𝑡 + 𝑢𝑢 𝑥 + 𝑣𝑢 𝑦 = −𝑝 𝑥 + 𝜀(𝑢 𝑥 𝑥 + 𝑢 𝑦𝑦 )
𝑣 𝑡 + 𝑢𝑣 𝑥 + 𝑣𝑣 𝑦 = −𝑝 𝑦 + 𝜀(𝑣 𝑥 𝑥 + 𝑣 𝑦𝑦 )
𝑢 𝑥 + 𝑢 𝑦 = 0.
u∗ − u 𝑛
= − 12 ∇𝑝 𝑛 − 32 H𝑛 + 12 H𝑛−1 + 21 𝜀(Δu∗ + Δu𝑛 ) (16.10a)
u𝑛+1 − u∗
= − 12 ∇𝑝 𝑛+1 + 12 𝜀(Δu𝑛+1 − Δu∗ ). (16.10b)
We’ll be able to use the intermediate solution so that we don’t explicitly need
to compute the pressure 𝑝. First, solve (16.10a). Formally, we have
M− u∗ = − 21 ∇𝑝 𝑛 − 32 H𝑛 + 12 H𝑛−1 + M+ u𝑛 ,
− 12 ∇𝑝 𝑛 = M− u𝑛 − u∗−1 , (16.12)
where u∗−1 is the intermediate solution at the previous time step. Substituting
− 12 ∇𝑝 𝑛 back into (16.11) yields
u∗ = M−1
− M− u − u
− 32 H𝑛 + 12 H𝑛−1 + M+ u𝑛
3 𝑛 1 𝑛−1
= u𝑛 − u∗−1 + M−1
− −2H + 2H + M+ u 𝑛 .
u𝑛+1 = u∗ − ∇Δ−1 ∇ · u∗ ,
We now have
3 𝑛 1 𝑛−1
u∗ = u𝑛 − u∗−1 + M−1
− −2H + 2H + M+ u𝑛
u𝑛+1 = u∗ − ∇Δ−1 ∇ · u∗ .
u−1 = u∗−1 = u0 .
M̂− = 1
𝛥𝑡 − 21 𝜀|𝝃 | 2 and M̂+ = 1
𝛥𝑡 + 12 𝜀|𝝃 | 2 .
𝐻 (𝑢, 𝑣) = 𝑢𝑢 𝑥 + 𝑣𝑢 𝑦 ,
To visualize the solution, we use a tracer. Imagine dropping ink (or smoke)
into a moving fluid and following the ink’s motion. Numerically, we can simulate
a tracer by solving the advection equation
+ u · ∇𝑄 = 0 (16.15)
using the Lax–Wendroff method, where u is the solution to the Navier–Stokes
equation at each time step.
Consider a stratified fluid moving in the 𝑥-direction separated from a stationary
fluid by a narrow interface. Suppose we further modulate the speed of the fluid
in the flow direction. Think of wind blowing over water. The pressure decreases
in regions where the moving fluid moves faster, pulling on the stationary fluid
and causing it to bulge. This situation leads to Kelvin–Helmholtz instability,
producing surface waves in water and turbulence in otherwise laminar fluids.
Kelvin–Helmholtz instability can be seen in the cloud bands of Jupiter, the sun’s
corona, and billow clouds here on Earth. Some think that the Kelvin–Helmholtz
instability in billow clouds may have inspired the swirls in van Gogh’s painting
“The Starry Night.” See the figure on the next page. To model an initial stratified
flow in a 2 × 2 square, we will take zero vertical velocity 𝑣 and
1 1
𝑢= 4 (2 + sin 2𝜋𝑥) 𝑄(𝑥, 𝑦) where 𝑄(𝑥, 𝑦) = 2 + 12 tanh (10 − 20 |𝑦 − 1|) .
The Julia code to solve the Navier–Stokes equation has three parts: defining
the functions, initializing the variables, and iterating over time. We start by
defining functions for Ĥ and for the flux used in the Lax–Wendroff method.
474 Fourier Spectral Methods
Δ◦ (Q,step=1) = Q - circshift(Q,(step,0))
flux(Q,c) = c.*Δ◦ (Q) - 0.5c.*(1 .- c).*(Δ◦ (Q,1)+Δ◦ (Q,-1))
H = (u,v,iξ1 ,iξ2 ) -> F*((F\u).*(F\(iξ1 .*u)) + (F\v).*(F\(iξ2 .*u)))
Exercises 475
16.4 Exercises
16.1. Solve the Schrödinger equation discussed on page 467 using Strang splitting
and using integrating factors. Take a harmonic potential 𝑉 (𝑥) = 𝑥 2 with initial
conditions 𝑢(0, 𝑥) = e−( 𝑥−3) /𝜀 with 𝜀 = 0.1 over a domain [−8, 8].
16.2. Suppose you use a fourth-order Runge–Kutta scheme to solve the Burgers
equation 𝑢 𝑡 + 21 𝑢 2 𝑥 = 0 discretized with a Fourier spectral method. What order
can you expect? What else can you say about the numerical solution? b
16.3. The Kortweg–deVries (KdV) equation
𝑢 𝑡 + 3(𝑢 2 ) 𝑥 + 𝑢 𝑥 𝑥 𝑥 = 0 (16.16)
for 𝜀 = 1 over a square with sides of length 100. Assume periodic boundary
conditions with 256 grid points in 𝑥 and 𝑦. Choose 𝑢(0, 𝑥, 𝑦) randomly from the
uniform distribution [−1, 1]. Output the solution at time 𝑡 = 100 and several
intermediate times. b
16.5. Sudden phase separation can set in if a homogeneous molten alloy of two
metals is rapidly cooled.4 The Cahn–Hilliard equation
= Δ 𝑓 0 (𝜙) − 𝜀 2 Δ𝜙 with 𝑓 0 (𝜙) = 𝜙3 − 𝜙
models such phase separation. The dynamics of the Cahn–Hilliard equation
are driven by surface tension flow, in which interfaces move with velocities
proportional to mean curvature, and the total volumes of the binary regions are
constant over time. A convexity-splitting method
= L 𝜙 + N 𝜙 = −𝜀 2 Δ2 + 𝛼Δ 𝜙 (𝑛+1) + Δ 𝑓 0 (𝜙 (𝑛) ) − 𝛼𝜙 (𝑛)
balances the terms to ensure energy stability—see Wu et al. [2014] or Lee et al.
[2014]. Solve the two-dimensional Cahn–Hilliard equation with 𝜀 = 0.1, starting
with random initial conditions over a square periodic domain of length 8 over a
time [0, 8]. Take 𝛼 = 1. b
16.6. The two-dimensional Kuramoto–Sivashinsky equation can be written as
+ Δ𝑣 + Δ2 𝑣 + 21 |∇𝑣| 2 = 0,
where u = ∇𝑣 is the two-dimensional vector-field extension of the variable 𝑢 in
∫ (16.7). To solve this equation, we’ll further impose a
the one-dimensional KSE
mean-zero restriction Ω |∇𝑣| 2 dΩ = 0. Solve the two-dimensional KSE using a
Gaussian bump or normal random initial conditions over a periodic domain of
length 50 and a timespan 𝑡 = 150.
We can enforce the mean-zero restriction by zeroing out the zero-frequency
component in the Fourier domain. Kalogirou [2014] develops a second-order
IMEX BDF method using the splitting
= L 𝑣 + N 𝑣 = −(Δ2 + Δ + 𝛼)𝑣 − ( 12 |∇𝑣| 2 − 𝛼𝑣),
where 𝛼 is a positive splitting-stabilization constant added to ensure that operator
on the implicit term is positive-definite and thus invertible. b
4 Alternatively,
imagine vigorously shaking a bottle of vinaigrette and then watching the oil and
vinegar separate, first into tiny globules and then coarsening into larger and larger ones.
Back Matter
Appendix A
1.4. The number of invertible (0,1)-matrices has been the subject of some research
and is a yet unsolved problem. There is an entry in the Online Encyclopedia of
Integer Sequences (often called Sloane in reference to its curator Neil Sloane)
at http://oeis.org/A055165. We can estimate the number by generating random
480 Solutions
(0,1)-matrices and testing them for singularity. The following Julia code generates
the accompanying plot:
5 10 15 20
−𝜆 1 0
−𝜆 1 1 1
𝑝 3 (𝜆) = 1 −𝜆 1 = −𝜆 −1 ,
1 −𝜆 0 −𝜆
0 1 −𝜆
𝑖𝜋 𝜋
𝜌(D𝑛 + 2I) = max −2 cos = 2 cos .
𝑗=1,...,𝑛 𝑛+1 𝑛+1
(c) The 2-norm condition number for a symmetric matrix is the ratio of the
largest eigenvalue to the smallest eigenvalue.
𝜋 2
1 + cos 𝑛+1
2 − 12 𝑛+1 4
𝜅2 (D𝑛 ) = 𝜋 ≈
1 − cos 𝑛+1 2
≈ 2 (𝑛 + 1) 2
1 𝜋 𝜋
2 𝑛+1
when 𝑛 is large. So the condition number of 𝐷 𝑛 is 𝑂 𝑛2 .
(e) The trace of a matrix equals the sum of its diagonal elements. The 𝑖th
diagonal element of AT A equals 𝑛𝑗=1 𝑎 2𝑖 𝑗 . So tr(AT A) = 𝑖,𝑛 𝑗=1 𝑎 2𝑖 𝑗 , which
equals kAk 2F .
(d) Suppose that A has the singular value decomposition A = U𝚺VT . Then
using the result of (c),
kAk 2F = kU𝚺VT k 2F = k𝚺VT k 2F = kV𝚺T k 2F = k𝚺T k 2 = 𝑖=1 𝜎𝑖2 .
482 Solutions
Let B = (I − 𝑧A). We can use Cramer’s rule to compute B−1 one entry at a time:
(−1) 𝑖+ 𝑗 det M𝑖 𝑗
𝑤 𝑖 𝑗 (𝑧) = ,
det B
where M𝑖 𝑗 is the 𝑖 𝑗-minor of B, obtained by removing the 𝑖th row and 𝑗th column.
The function 𝑤 𝑖 𝑗 (𝑧) is the generating function for walks from 𝑖 → 𝑗. Let’s
determine 𝑤 13 (𝑧).
1 −𝑧 0 −𝑧
1 −𝑧 −𝑧 1 −𝑧 −𝑧
−𝑧 1 −2𝑧 −𝑧
det B = = −(−2𝑧) 0 −2𝑧 0 + −𝑧 1 −𝑧
0 −2𝑧 1 0
−𝑧 −𝑧 1 −𝑧 −𝑧 1
−𝑧 −𝑧 0 1
= 2𝑧(−2𝑧 + 2𝑧3 ) + (1 + 2𝑧 3 − 3𝑧2 ) = 4𝑧4 − 2𝑧 3 − 7𝑧 2 + 1, and
−𝑧 0 −𝑧
det M13 = 1 −2𝑧 −𝑧 = 2𝑧3 + 2𝑧2 .
−𝑧 0 1
Linear algebra 483
2𝑧3 + 2𝑧 2 2𝑧2
𝑤 13 (𝑧) = = 3 .
4𝑧 4 − 2𝑧 − 7𝑧 + 1 4𝑧 − 6𝑧2 − 𝑧 + 1
3 2
det A = det P det L det U = sign(P) diag U.
2.5. Let’s fill in the steps of the Cuthill–McKee algorithm outlined on page 51.
Start with a (0,1)-adjacency matrix A, and create a list 𝑟 of vertices ordered from
lowest to highest degree. Create another initially empty list 𝑞 that will serve as
a temporary first-in-first-out queue for the vertices before finally adding them
to the permutation list 𝑝, also initially empty. Start by moving the first element
from 𝑟 to 𝑞. As long as the queue 𝑞 is not empty, move the first element of 𝑞 to
the end of 𝑝, and move any vertex in 𝑟 adjacent to this element to the end of 𝑞. If
𝑞 is ever empty, go back and move the first element of 𝑟 to 𝑞. Continue until all
the elements from 𝑟 have been moved through 𝑞 to 𝑝. Finally, we reverse the
order of 𝑝. The following Julia function produces a permutation vector p for a
sparse matrix A using the reverse Cuthill–McKee algorithm.
function rcuthillmckee(A)
r = sortperm(vec(sum(A.!=0,dims=2)))
p = Int64[]
while ~isempty(r)
q = [popfirst!(r)]
while ~isempty(q)
q1 = popfirst!(q)
append!(p,q1 )
k = findall(!iszero, A[q1 ,r])
2.6. Let’s plot the graph of the Doubtful Sound dolphins reusing the code on
page 49. The figure on the facing page shows the sparsity plots and graphs before
and after Cuthill–McKee resorting.
Linear algebra 485
Figure A.1: Sparsity plots and graph drawings of the dolphins of Doubtful Sound.
p = rcuthillmckee(A)
2.8. Loading the CSV file as a dataframe will allow us to inspect and interpret
the data.
using CSV, DataFrames
diet = DataFrame(CSV.File(download(bucket*"diet.csv")))
We can solve the diet problem as a dual LP problem: “Find the minimum of
the objective function 𝑧 = bT y subject to constraint AT y ≥ c and non-negativity
restriction y, c ≥ 0.” The nutritional value for each of the commodities is given
by A, the nutrient minimums are given by b, and because the nutritional values
of foods are normalized per dollar of expenditure, c is a vector of ones.
A = Array(diet[2:end,4:end])'
c = ones.(size(A,2))
b = Array(diet[1,4:end])
food = diet[2:end,1]
solution = simplex(b,A',c)
print("foods: ", food[solution.y .!= 0], "\n")
print("daily cost: ", solution.z)
We can alternatively use JuMP.jl along with GLPK.jl. The JuMP function value,
which returns the solution’s values, has a name common to other libraries. We’ll
explicitly call it using JuMP.value to avoid confusion or conflicts.
using JuMP, GLPK
model = Model(GLPK.Optimizer)
@variable(model, x[1:size(A,2)] ≥ 0)
@objective(model, Min, c' * x)
@constraint(model, A * x .≥ b)
print("foods: ", food[JuMP.value.(x) .!= 0], "\n")
print("daily cost: ", objective_value(model))
2.9. Let’s write a breath-first search algorithm—we can modify the code for the
Cuthill–McKee algorithm in exercise 2.5. We’ll keep track of three arrays: an
array q that records the unique indices of the visited nodes and their neighboring
nodes, an array p of pointers to the parent nodes recorded in this first array, and an
array r of nodes that haven’t yet been added to array q. Start with a source node
a and find all nearest neighbors using the adjacency matrix A restricted to rows r.
Then with each iteration, march along q looking for new neighbors until we’ve
either found the destination node b or run out of new nodes to visit, indicating no
path. Once we’ve found b, we backtrack along p to determine the path.
function findpath(A,a,b)
r = collect(1:size(A,2))
q = [a]; p = [-1]; i = 0
while i<length(q)
k = findall(!iszero, A[q[i+=1],r])
any(r[k].==b) && return(append!(q[backtrack(p,i)],b))
display("No path.")
function backtrack(p,i)
s = []; while (i!=-1); append!(s,i); i = p[i]; end
Let’s import the data and build an adjacency matrix. We’ll use the function
get_adjacency_matrix defined on page 49 to construct the adjacency matrix, and
we’ll use get_names to build a list of the names.
get_names(filename) = readdlm(download(bucket*filename*".txt"),'\n')
actors = get_names("actors")
movies = get_names("movies")
B = get_adjacency_matrix("actor-movie");
Using A will tell us both the actors and their connecting movies.
(m,n) = size(B)
A = [spzeros(n,n) B' ; B spzeros(m,m)];
actormovie = [ actors ; movies ];
Let’s find the link between actors Emma Watson and Kevin Bacon.
Emma Watson appeared in Harry Potter and the Chamber of Secrets with John
Cleese, who appeared in The Big Picture with Kevin Bacon. Let’s try another:
Bruce Lee appeared in Enter the Dragon with Jackie Chan, who appeared in
Rush Hour with Barry Shabaka Henley, who was in Patch Adams with Greg
Sestero, who appeared in The Room with the great Tommy Wiseau. One might
wonder which actor is the center of Hollywood. We’ll come back to answer this
question in exercise 4.6.
d = 0.5; A = sprand(800,600,d)
0.9 0.9
0.8 0.8
−8 −6 −4 −8 −6 −4
Figure A.2: Solutions to Exercise 3.5. The left plot shows solutions using the
Vandermonde matrix built using the original 𝑥-data and the right plot shows
solutions using the Vandermonde matrix built using the mean-centered 𝑥-data.
3.4. Take X to grayscale image (with pixel intensity between 0 and 1), A and B
to be Gaussian blurring matrices with standard deviations of 20 pixels, and N
to be a matrix of random values from the uniform distribution over the interval
(0, 0.01).
We’ll compare three deblurring methods: inverse, Tikhonov regulation, and the
pseuodinverse. We can find a good value for regulariation parameter 𝛼 = 0.05
with some trial-and-error.
α = 0.05
X1 = A\Y/B
X2 = (A'*A+α^2*I)\A'*Y*B'/(B*B'+α^2*I)
X3 = pinv(A,α)*Y*pinv(B,α)
Gray.([X Y X1 X2 X3 ])
vandermonde(t,n) = vec(t).^(0:n-1)'
build_poly(c,X) = vandermonde(X,length(c))*c
Now, we’ll make a function that determines the coefficients and residuals using
three different methods. The command Matrix(Q) returns the thin version of the
QRCompactWYQ object returned by the qr function.
function solve_filip(x,y,n)
V = vandermonde(x, n)
c = Array{Float64}(undef, 3, n)
c[1,:] = (V'*V)\(V'*y)
Q,R = qr(V)
c[2,:] = R\(Matrix(Q)'*y)
c[3,:] = pinv(V,1e-9)*y
r = [norm(V*c[i,:]-y) for i in 1:3]
Let’s download the NIST Filip dataset, solve the problem, and plot the solutions:
using DelimitedFiles
data = readdlm(download(bucket*"filip.csv"),',',Float64)
coef = readdlm(download(bucket*"filip-coeffs.csv"),',')
(x,y) = (data[:,2],data[:,1])
β,r = solve_filip(x, y, 11)
X = LinRange(minimum(x),maximum(x),200)
Y = [build_poly(β[i,:],X) for i in 1:3]
plot(X,Y); scatter!(x,y,opacity=0.5)
[coef β']
Let’s also solve the problem and plot the solutions for the standardized data:
using Statistics
zscore(X,x=X) = (X .- mean(x))/std(x)
c,r = solve_filip(zscore(x), zscore(y), 11)
Y = [build_poly(c[i,:],zscore(X,x))*std(y).+mean(y) for i in 1:3]
plot(X,Y); scatter!(x,y,opacity=0.5)
The 2-condition number of the Vandermonde matrix is 1.7 × 1015 , which makes
it very ill-conditioned. The residuals are krk 2 = 0.97998 using the normal
equation and krk 2 = 0.028211 using QR decomposition, which matches the
residual for the coefficients provided by NIST. The relative errors are roughly 11
and 4 × 10−7 , respectively. The normal equation solution clearly fails, which is
evident in Figure A.2 on the preceding page.
Mean-centering the 𝑥-data reduces the 2-condition number of the new
Vandermonde matrix to 1.3 × 105 . Further scaling the 𝑥-data between −1 and 1
reduces the 2-condition number to 𝜅 2 = 4610. So the matrix is well-conditioned
even when using the normal equation method. The residuals of both methods are
krk 2 = 0.028211, and both methods perform well.
3.6. Let’s start with the model 𝑢(𝑡) = 𝑎 1 sin(𝑎 2 𝑡 + 𝑎 3 ) + 𝑎 4 . This formulation is
nonlinear, but we can work with it to make it linear. First, we know that the
period is one year. Second, we can write sin(𝑥 + 𝑦) as sin 𝑥 cos 𝑦 + sin 𝑦 cos 𝑥.
In doing so, we have a linear model
3.7. The following Julia code reads the MNIST training data and computes the
economy SVD of the training set to get singular vectors:
actual class
0 1 2 3 4 5 6 7 8 9
0 984 8 0 1 6 0 1 0 0 0
1 0 983 2 1 5 0 0 3 6 0
2 12 3 945 10 3 0 7 7 12 1
predicted class
3 2 1 9 952 0 8 0 5 19 4
4 3 8 5 0 944 1 3 7 4 25
5 2 3 0 46 3 922 7 1 14 2
6 5 3 0 0 1 5 986 0 0 0
7 4 7 6 0 4 1 0 933 5 40
8 6 22 6 16 5 13 5 6 910 11
9 11 4 5 8 12 3 0 20 12 925
Figure A.4: Confusion matrix for PCA identification of digits in MNIST dataset.
pix = -abs.(reshape(V[3+1],28,:))
rescale = scaleminmax(minimum(pix), maximum(pix))
pix .|> rescale .|> Gray
The image of any 3 in the training data is largely a linear combination of these
basis vectors. We can now predict the best digit associated with each test image
using these ten subspaces V. We’ll finally build a confusion matrix to check the
method’s accuracy. Each row of the confusion matrix represents the predicted
class, and each column represents the actual class. See Figure A.4. For example,
element (7, 9) is 40, meaning that 40 of test images identified as a “7” were
actually a “9.” Overall, the method is about 95 percent accurate.
image_test, label_test = MNIST(split=:test)[:]
image_test = reshape(permutedims(image_test,[3,2,1]),10000,:)
r = [( q = image_test*V[i]*V[i]' .- image_test;
sum(q.^2,dims=2) ) for i∈1:10]
r = hcat(r...)
prediction = [argmin(r[i,:]) for i∈1:10000] .- 1
[sum(prediction[label_test.== i].== j) for j∈0:9, i∈0:9]
We’ll use the Arpack.jl package to compute the SVD of a sparse matrix, with a
suitable small dimensional latent space:
using Arpack
X,_ = svds(A*B,nsv=10)
U = X.U; Σ = Diagonal(X.S); VT = X.Vt
Q = VT ./sqrt.(sum(VT .^2,dims=1));
The ten actors most similar to Steve Martin are Steve Martin (of course), Lisa
Kudrow, Rob Reiner, Christine Baranski, Patrick Cranshaw, George Gaynes, Jay
Chandrasekhar, Alexandra Holden, Luke Wilson, and David Spade. Using the
following code, we determine that the Steve Martin’s signature is comedy (40%),
drama (10%), romance (9%), crime (5%), romantic comedy (4%),. . . :
p = U*Σ*q
r = sortperm(-p)
[genres[r] p[r]/sum(p)]
Statistician George Box once quipped: “All models are wrong, but some
are useful.” How is this actor similarity model wrong? It doesn’t differentiate
the type or size of roles an actor has in a movie, e.g., such as a character
actor appearing in a serious film for comic relief. Actors change over their
careers and the data set is limited to the years 1972–2012. Classification of
movies into genres is simplistic and sometimes arbitrary. Perhaps using more
keywords would add more fidelity. The information was crowdsourced through
Wikipedia editors and contains errors. How does the dimension of the latent
space change the model? By choosing a large-dimensional latent space, we keep
much of the information about the movies and actors. But we are unable to make
generalizations and comparisons. Such a model is overfit. On the other hand,
choosing a small-dimensional space effectively removes the underrepresented
genres, and we are left with the main genres. Such a model is underfit.
3.10. To simplify the calculations, we’ll pick station one (although any station
would do) as a reference station and measure all other stations relative to it
(𝑥𝑖 , 𝑦 𝑖 , 𝑡𝑖 ) ← (𝑥 𝑖 , 𝑦 𝑖 , 𝑡𝑖 ) − (𝑥 1 , 𝑦 1 , 𝑡1 ). For this station, the reference circle is now
𝑥 2 + 𝑦 2 = (𝑐𝑡) 2 . We can remove the quadratic terms from the other equations
(𝑥 − 𝑥𝑖 ) 2 + (𝑦 − 𝑦 𝑖 ) 2 = (𝑐𝑡 − 𝑐𝑡 𝑖 ) 2 by subtracting the reference equation from
each of them to get a system of 𝑛 − 1 linear equations
Geometrically, this new equation represents the radical axis of the two circles—
the common chord if the circles intersect. The system i n matrix form is Ax = b,
which we can solve using ordinary least squares or total least squares. We’ll use
the function tls defined on page 77.
X = [3 3 12; 1 15 14; 10 2 13; 12 15 14; 0 11 12]
reference = X[1,:]; X .-= reference'
A = [2 2 -2].*X; b = (X.^2)*[1; 1; -1]
xols = A\b + reference
xtls = tls(A,b) + reference
The solution is about (6.53, 8.80, 7.20). Both methods are within a percent of
one another.
3.11. We’ll start by forming two arrays: X for the (𝑥, 𝑦, 𝑧, 𝑡) data of the sensors
and xo for the (𝑥, 𝑦, 𝑧, ·) data of the signal source.
using CSV, DataFrames
df = DataFrame(CSV.File(download(bucket*"shotspotter.csv")))
X = Array(df[1:end-1,:]); xo = Array(df[end,:]);
We find that the error is approximately −3.8 m and 3.2 m horizontally and 54.6 m
10 10 10
𝑛=4 𝑛 = 16 𝑛 = 64
5 5 5
0 0 0
−5 −5 −5
4.1. Let’s plot the eigenvalues of five thousand random 𝑛 × 𝑛 real matrices:
E = collect(Iterators.flatten([eigvals(randn(n,n)) for i∈1:2000]))
The figure above shows the distribution for 𝑛 = 4, 16, and 64. The distribution
follows Girko’s circular law, which
√ says that as 𝑛 → ∞, eigenvalues are uniformly
distributed in a disk of radius 𝑛.
9 1
0 0
2 0
A =
5 0
1 2 0 1
3 0
0 −2
Row circles are shaded , column circles are shaded , and the eigenvalues
lie in their intersections.
4.4. The following Julia code finds an eigenpair using Rayleigh iteration starting
with a random initial vector. We can compute all four eigenpairs by running
the code repeatedly for different initial guesses. The method converges rapidly,
and A − 𝜌I becomes poorly conditioned as 𝜌 approaches one of the eigenvalues.
So we break out of the iteration before committing the cardinal sin of using an
ill-conditioned operator.
function rayleigh(A)
x = randn(size(A,1),1)
while true
x = x/norm(x)
ρ = (x'*A*x)[1]
M = A - ρ*I
2. One QR-cycle is
function implicitqr(A)
n = size(A,1)
tolerance = 1E-12
H = Matrix(hessenberg(A).H)
while true
if abs(H[n,n-1])<tolerance
if (n-=1)<2; return(diag(H)); end
Q = givens([H[1,1]-H[n,n];H[2,1]],1,2)[1]
H[1:2,1:n] = Q*H[1:2,1:n]
H[1:n,1:2] = H[1:n,1:2]*Q'
for i = 2:n-1
Q = givens([H[i,i-1];H[i+1,i-1]],1,2)[1]
H[i:i+1,1:n] = Q*H[i:i+1,1:n]
H[1:n,i:i+1] = H[1:n,i:i+1]*Q'
4.6. Let’s compute the eigenvector centrality of the actor-actor adjacency matrix.
We’ll start by importing the data and building the adjacency matrix using the
functions get_names and get_adjacency_matrix defined on page 486.
actors = get_names("actors")
B = get_adjacency_matrix("actor-movie")
M = (B'*B .!= 0) - I
v = ones(size(M,1))
for k = 1:8
v = M*v; v /= norm(v);
r = sortperm(-v); actors[r][1:10]
Eigenvector centrality tells us those actors who appeared in movies with a lot
of other actors, who themselves appeared in a lot of movies with other actors.
Samuel L. Jackson, M. Emmet Walsh, Gene Hackman, Bruce Willis, Christopher
Walken, Robert De Niro, Steve Buscemi, Dan Hedaya, Whoopi Goldberg, and
Frank Welker are the ten most central actors. Arguably, Samuel L. Jackson is the
center of Hollywood.
findfirst(actors[r] .== "Kevin Bacon")
Kevin Bacon is the 92nd most connected actor in Hollywood. Let’s also look at
degree centrality, which ranks each node by its number of edges, i.e., the number
of actors with whom an actor has appeared.
actors[ -sum(M,dims=1)[:] |> sortperm ][1:10]
The ten most central actors are Frank Welker, Samuel L. Jackson, M. Emmet
Walsh, Whoopi Goldberg, Bruce Willis, Robert De Niro, Steve Buscemi, Gene
Hackman, Christopher Walken, and Dan Aykroyd. The noticeable change is that
voice actor Frank Welker moves from tenth to first position.
4.7. The following Julia code implements the randomized SVD algorithm:
function randomizedsvd(A,k)
Z = rand(size(A,2),k);
Q = Matrix(qr(A*Z).Q)
for i in 1:3
Q = Matrix(qr(A'*Q).Q)
Q = Matrix(qr(A*Q).Q)
W = svd(Q'*A)
using Images
img = load(download(bucket*"red-fox.jpg"))
A = Float64.(Gray.(img))
U,S,V = randomizedsvd(A,10)
Gray.([A U*Diagonal(S)*V'])
Let’s also compare the elapsed time for the randomized SVD, the sparse SVD in
Arpack.jl, and the full SVD in LinearAlgebra.jl.
using Arpack
@time randomizedsvd(A,10)
@time svds(A, nsv=10)
@time svd(A);
With rank 𝑘 = 10, the elapsed time for the randomized SVD is about 0.05 seconds
compared to 0.1 seconds for a sparse SVD and 3.4 seconds for a full SVD svd(A).
We can compute the relative error using
√︃Í √︃Í
𝑖=𝑘+1 𝜎𝑖
𝜀 𝑘 = √︃
2 kAk F
𝑖=1 𝜎𝑖
ϵ = 1 .- sqrt.(cumsum(S.^2))/norm(A)
The relative error in the first ten singular values ranges from 2 percent down to
0.5 percent. The image on the left is the original image A, and the image on the
right is the projected image Gray.(U*Diagonal(S)*V').
using Arpack
m = 5000
A = [1/(1 + (i+j-1)*(i+j)/2 - j) for i∈1:m, j∈1:m]
5.1. The Jacobi method’s spectral radius was derived in exercise 1.5(b). We will
confirm the Gauss–Seidel method’s spectral radius using Julia.
n = 20; D = SymTridiagonal(-2ones(n),ones(n-1))
P = -diagm(0 => diag(D,0), -1 => diag(D,-1))
N = diagm(1 => diag(D,1))
eigvals(N,P)[end], cos(pi/(n+1))^2
5.2. The Gauss–Seidel method converges if and only if the spectral radius of the
matrix (L + D) −1 U is strictly less than one. The eigenvalues of
1 0 0 𝜎 0 𝜎
(L + D) U =
−𝜎 1 0 0 0 𝜎2
are 0 and 𝜎 2 . So, the Gauss–Seidel method converges when |𝜎| < 1.
5.4. Let’s start by setting up the problem and defining the exact solution ue . We
can define kron as an infix operator using the Unicode character ⊗ within Julia to
improve readability.
using SparseArrays, LinearAlgebra
⊗(x,y) = kron(x,y); ϕ = x -> x-x^2
n = 50 ; x = (1:n)/(n+1); Δx = 1/(n+1)
J = sparse(I, n, n)
D = spdiagm(-1 => ones(n-1), 0 => -2ones(n), 1 => ones(n-1) )
A = ( D⊗J⊗J + J⊗D⊗J + J⊗J⊗D ) / Δx^2
f = [ϕ(x)*ϕ(y) + ϕ(x)*ϕ(z) + ϕ(y)*ϕ(z) for x∈x, y∈x, z∈x][:]
ue = [ϕ(x)*ϕ(y)*ϕ(z)/2 for x∈x, y∈x, z∈x][:];
P = (1-ω)*sparse(Diagonal(A)) + ω*sparse(LowerTriangular(A))
for i=1:n
u += P\(b-A*u)
append!(ϵ,norm(u - ue ,1))
Each method takes approximately two seconds to complete 400 iterations. The
figure on the next page shows the errors. Notice that the conjugate gradient
converges quite quickly relative to the stationary methods—the error is at machine
precision after about 150 iterations or about 0.7 seconds. For comparison, using
the conjugate gradient method cg from IterativeSolvers.jl takes roughly 0.4
seconds. We can compute the convergence rate by taking the slopes of the lines
k = 1:120; ([one.(k) k]\log10.(ϵ[k,:]))[2,:]
The slopes are 0.008, 0.0016, 0.019, and 0.1. The inverse of these slopes tells us
the number of iterations required to gain one digit of accuracy—roughly 1200
iterations using a Jacobi method and 10 iterations using the conjugate gradient
method. Convergence of the Jacobi method is extremely slow, requiring around
100 Jacobi
conjugate gradient
100 200 300 400
5.5. The (1, 1)-entry of A−1 is the first element of the solution x to Ax = 𝝃 for
𝝃 = (1, 0, 0, . . . , 0). We’ll use a preconditioned conjugate gradient method to
solve the problem.
using Primes, LinearAlgebra, SparseArrays, IterativeSolvers
n = 20000
d = 2 .^ (0:14); d = [-d;d]
P = Diagonal(primes(224737))
B = [ones(n - abs(d)) for d∈d]
A = sparse(P) + spdiagm(n ,n, (d .=> B)...)
b = zeros(n); b[1] = 1
cg(A, b; Pl=P)[1]
The computed solution 0.7250783 . . . agrees with the Sloane number A117237
to machine precision. Using the BenchmarkTools.jl function @btime, we find that
the conjugate gradient method takes 11 milliseconds on a typical laptop. If we
don’t specify the preconditioner, the solver takes 1500 times longer.
6.1. Let 𝑚 = 𝑛/3. Then a DFT can be written as a sum of three DFTs
𝑛−1 ∑︁
𝑚−1 ∑︁
𝑚−1 ∑︁
𝑘𝑗 𝑘𝑗 𝑗 𝑘𝑗 2𝑗 𝑘𝑗
𝑦𝑗 = 𝜔𝑛 𝑐 𝑘 = 𝜔 𝑚 𝑐 0𝑘 + 𝜔 𝑛 𝜔 𝑚 𝑐 00
𝑘 + 𝜔𝑛 𝜔 𝑚 𝑐 000
𝑘=0 𝑘=0 𝑘=0 𝑘=0
𝑗 2𝑗
with 𝑐 0𝑘 = 𝑐 3𝑘 , 𝑐 00
𝑘 = 𝑐 3𝑘+1 , and 𝑐 𝑘 = 𝑐 3𝑘+2 . So, 𝑦 𝑗 = 𝑦 𝑗 + 𝜔 𝑛 𝑦 𝑗 + 𝜔 𝑛 𝑦 𝑗 . By
000 0 00 000
where 𝜔3 = e−2i 𝜋/3 , 𝜔23 = e2i 𝜋/3 and 𝜔43 = e−2i 𝜋/3 , we have the system
𝑗 2𝑗
𝑦 𝑗 = 𝑦 0𝑗 + 00
𝜔 𝑛 𝑦 𝑚+ 𝑗 +
𝜔 𝑛 𝑦 2𝑚+ 𝑗
𝑗 2 2𝑗
𝑦 𝑚+ 𝑗 = 𝑦 0𝑗 + 𝜔3 𝜔 𝑛 𝑦 𝑚+
00 000
𝑗 + 𝜔3 𝜔 𝑛 𝑦 2𝑚+ 𝑗
𝑦 2𝑚+ 𝑗 = 𝑦 0𝑗 + 𝜔23 𝜔 𝑛 𝑦 𝑚+ 4
𝑗 00 000
𝑗 + 𝜔3 𝜔 𝑛 𝑦 2𝑚+ 𝑗 .
function fftx3(c)
n = length(c)
ω = exp(-2im*π/n)
if mod(n,3) == 0
k = collect(0:n/3-1)
u = [transpose(fftx3(c[1:3:n-2]));
F = exp(-2im*π/3).^([0;1;2]*[0;1;2]')
F = ω.^(collect(0:n-1)*collect(0:n-1)')
6.2. The following function takes two integers as strings, converts them to padded
arrays, computes a convolution, and then carries the digits:
using FFTW
function multiply(p_,q_)
p = [parse.(Int,split(reverse(p_),""));zeros(length(q_),1)]
q = [parse.(Int,split(reverse(q_),""));zeros(length(p_),1)]
pq = Int.(round.(real.(ifft(fft(p).*fft(q)))))
carry = pq .÷ 10
while any(carry.>0)
pq -= carry*10 - [0;carry[1:end-1]]
carry = pq .÷ 10
n = findlast(x->x>0, pq)
6.3. We’ll split the DCT into two series, one with even indices running forward
{0, 2, 4, . . . } and the other with odd indices running backward {. . . , 5, 3, 1}.
Doing so makes the DFT naturally pop out after we express the cosine as the
real part of the complex exponential. If we were computing a DST instead of a
DCT, we would subtract the second series because sine is an odd function at the
(𝑘 + 12 )𝜉𝜋
ˆ𝑓 𝜉 = 𝑓 𝑘 cos
! 𝑛−1 !
(2𝑘 + 12 )𝜉𝜋 ∑︁ (2𝑛 − 2𝑘 − 12 )𝜉𝜋
= 𝑓2𝑘 cos + 𝑓2(𝑛−𝑘)−1 cos
𝑛 𝑛
© ∑︁
𝑛/2−1 ∑︁
= Re e−i 𝜉 𝜋/2𝑛 𝑓2𝑘 e−i2𝑘 𝜉 𝜋/𝑛 + e−i 𝜉 𝜋/2𝑛 𝑓2(𝑛−𝑘)−1 e−i2𝑘 𝜉 𝜋/𝑛 ® .
« 𝑘=0 𝑘=𝑛/2 ¬
Because cosine is an even function, it doesn’t matter which sign we choose as
long as we are consistent between the two series. We’ll choose negative so
that the DCT is in terms of a DFT rather than an inverse DFT. We then have
DCT ( 𝑓 𝑘 ) = Re e−i 𝜉 𝜋/2𝑛 · DFT 𝑓P(𝑘) , where P(𝑘) is a permutation of the
index 𝑘. For example, P({0, 1, 2, 3, 4, 5, 6, 7, 8}) is {0, 7, 2, 5, 4, 3, 6, 1, 8}. We
can build an inverse DCT by running the steps backward.
Let’s define the functions dctII and idctII. Note that the FFTW.jl functions
fft and ifft are multi-dimensional unless we specify a dimension.
function dctII(f)
n = size(f,1)
ω = exp.(-0.5im*π*(0:n-1)/n)
return real(ω.*fft(f[[1:2:n; n-mod(n,2):-2:2],:],1))
function idctII(f)
n = size(f,1)
ω = exp.(-0.5im*π*(0:n-1)/n)
f[1,:] = f[1,:]/2
f[[1:2:n; n-mod(n,2):-2:2],:] = 2*real(ifft(f./ω,1))
return f
Alternatively, we can use the FFTW.jl function dct, which includes the additional
y2 = dct(f)/sqrt(2/n) ; y2 [1] *=sqrt(2)
6.4. The following function modifies the one on 163, which removes low-
magnitude Fourier components, to instead crop high-frequency Fourier compo-
nents and keep a full subarray. The compressed image A0 is recovered by padding
the cropped components with zeros.
function dctcompress2(A,d)
B = dct(Float64.(A))
m,n = size(A)
m0 ,n0 = sqrt(d).*(m,n) .|> floor .|> Int
B0 = B[1:m0 ,1:n0 ]
A0 = idct([B0 zeros(m0 ,n-n0 ); zeros(m-m0 ,n)])
A0 = A0 |> clamp01! .|> Gray
sizeof(B0 )/sizeof(B), sqrt(1-(norm(B0 )/norm(B))^2), A0
0.01 0.1 1
page and page 163. Cropping high-frequency components results in the Gibbs
phenomenon that appears as local, parallel ripples around the edges in the image
that diminish away from the edges. On the other hand, by zeroing out low-
magnitude components, instead of the Gibbs phenomenon, we see high-frequency
graininess throughout the entire image.
So, the error is minimum at ℎ = (3 eps/𝑚) 1/3 . For 𝑓 (𝑥) = e 𝑥 at 𝑥 = 0, the error
reaches a minimum of about eps2/3 ≈ 10−11 when ℎ = (3 eps) 1/3 ≈ 8 × 10−6 .
By taking the Taylor series expansion of 𝑓 (𝑥 + iℎ), we have
𝑓 (𝑥 + iℎ) = 𝑓 (𝑥) + iℎ 𝑓 0 (𝑥) − 21 ℎ2 𝑓 00 (𝑥) − 12 iℎ3 𝑓 000 (𝜉)
where |𝜉 − 𝑥| < ℎ. From the imaginary part, 𝑓 0 (𝑥) = Im 𝑓 (𝑥 + iℎ)/ℎ + 𝜀 trunc with
truncation error 𝜀 trunc ≈ 21 ℎ2 | 𝑓 000 (𝑥)|. The round-off error 𝜀round ≈ eps| 𝑓 (𝑥)|.
For 𝑓 (𝑥) = e 𝑥 , the total error 𝜀(ℎ) = 12 ℎ2 + eps, which is smallest when
ℎ < 2 eps ≈ 10−8 . See Figure A.8 on the next page.
an image with
increasing levels of
DCT compressed
10−8 10−8
10−17 10−17
10−16 10−11 10−6 10−1 10−16 10−11 10−6 10−1
Figure A.8: The total error as a function of the stepsize ℎ in the central difference
and complex step approximations of 𝑓 0 (𝑥).
7.2. Let’s write the 7/3, 4/3, and 1 in binary format. For double-precision
floating-point numbers, we keep 52 significant bits, not including the leading
1, and round off the trailing bits. For 7/3 the trailing bits 0101 . . . round up
to 1000 . . . , and for 4/3 the trailing bits 0010 . . . round down to 0000 . . . .
Therefore, we have
7/3 = 10.010101010101010101010101010101010101010101010101011
- 4/3 = 1.0101010101010101010101010101010101010101010101010101
1 = 1.0000000000000000000000000000000000000000000000000000
Because binary 10 − 1 = 1, the double-precision representation of 7/3 − 4/3 − 1
which is machine epsilon.
7.3. See Figure A.9 on the following page. The round-off error dominates the
total error when ℎ < 10−5 in 𝑥 3 − 3𝑥 2 + 3𝑥 − 1 and when ℎ < 10−9 in (𝑥 − 1) 3
8.3. To speed up convergence for multiple roots, we can try taking a larger step
𝑓 𝑥 (𝑘)
𝑥 (𝑘+1)
= 𝑥 − 𝑚 0 (𝑘)
𝑓 𝑥
4 100
0 10−9
−4 10−18 −12
−1 0 1 10 10−8 10−4 100
𝑓 (𝑛) (𝑥 ∗ ) (𝑘) 2
𝑒 (𝑘+1) ≈ 𝑒 .
𝑛(𝑛 + 1)
8.4. When 𝑝 = 1,
≈ 𝑀𝑒 (𝑘) 𝑒 (𝑘−1) .
𝑝 𝑝
Substituting |𝑒 (𝑘+1) | ≈ 𝑟 𝑒 (𝑘) and |𝑒 (𝑘) | ≈ 𝑟 𝑒 (𝑘−1) into the expression for
𝑝 1/ 𝑝
error yields 𝑟 𝑒 (𝑘) ≈ |𝑀/𝑟 | 𝑒 (𝑘) 𝑒 (𝑘) , from which 𝑟 2 = |𝑀 | and 𝑝 = 1+1/𝑝.
Therefore 𝑝 = ( 5 + 1)/2 ≈ 1.618.
8.7. Formulations (8.7) and (8.6) for Aitken’s extrapolation are
aitken1 (x1 ,x2 ,x3 ) = x3 - (x3 -x2 )^2/(x3 - 2x2 + x1 )
aitken2 (x1 ,x2 ,x3 ) = (x1 *x3 - x2 ^2)/(x3 - 2x2 + x1 )
aitken1 aitken2
100 100
10−8 10−8
10−16 0 10−16 0
10 101 102 103 104 10 101 102 103 104
Figure A.10: Error in the 𝑛th partial sum of Leibniz’s formula (dotted) and
Aitken’s extrapolation—(8.7) on the left and (8.6) on the right–of that sum (solid).
The left plot of Figure A.10 shows the errors in Leibniz’s formula and Aitken’s
correction. The log-log slopes are 1.0 and 3.0. The errors after the 𝑛term
are approximately 1/𝑛 and 1/4𝑛3 , respectively. The methods are both linearly
convergent, but using Aitken’s acceleration is significantly faster. Instead of a
trillion terms to compute the first 13 digits of 𝜋, we need nine thousand terms. It’s
still an exceptionally slow way of approximating 𝜋. The right plot of Figure A.10
shows the numerically less-stable Aitken’s method.
We can express the Aitken’s extrapolation to Leibniz’s formula explicitly as
(−1) 𝑖+1 4 2𝑛 − 3
𝜋𝑛 = + 𝑟𝑛 where 𝑟 𝑛 = (−1) 𝑛 .
2𝑖 − 1 (𝑛 − 1) (2𝑛 − 1)
Let’s further approximate the slope 𝜙 0 𝑥 (𝑘) as
0 (𝑘) 𝜙 𝑥 (𝑘+1) − 𝜙 𝑥 (𝑘) 𝜙 𝜙 𝑥 (𝑘) − 𝜙 𝑥 (𝑘)
𝜙 𝑥 ≈ = .
𝑥 (𝑘+1) − 𝑥 (𝑘) 𝜙 𝑥 (𝑘) − 𝑥 (𝑘)
8.10. For the fixed-point method to converge, we need |𝜙 0 (𝑥)| < 1 in a neighbor-
hood of the fixed point 𝑥 ∗ . If 𝜙(𝑥) = 𝑥 − 𝛼 𝑓 (𝑥), then |𝜙 0 (𝑥)| = |1 − 𝛼 𝑓 0 (𝑥)| < 1
when 0 < 𝛼 𝑓 0 (𝑥) < 2. Taking 𝛼 < 2/ 𝑓 0 (𝑥 ∗ ) should ensure convergence. The
asymptotic convergence factor is given by |𝜙 0 (𝑥)|, so it’s best to choose 𝛼 close to
1/ 𝑓 0 (𝑥 ∗ ). The solution 𝑥 ∗ is unknown, so we instead could use a known 𝑓 0 𝑥(𝑘)
0 (𝑘)
2 Wolfram Cloud (https://www.wolframcloud.com) has a free basic plan that provides complete
access to Mathematica with a few limitations.
8.12. Define the homotopy as ℎ(𝑡, x) = 𝑓 (x) + (𝑡 − 1) 𝑓 (x0 ) where x = (𝑥, 𝑦).
The Jacobian matrix is
𝜕ℎ 𝜕 𝑓 4𝑥(𝑥 2 + 𝑦 2 − 1) 4𝑦(𝑥 2 + 𝑦 2 + 1)
= = .
𝜕x 𝜕x 6𝑥(𝑥 2 + 𝑦 2 − 1) 2 − 2𝑥𝑦 3 6𝑦(𝑥 2 + 𝑦 2 − 1) 2 − 3𝑥 2 𝑦 2
function newton(f,df,x)
for i in 1:100
Δx = -df(x)\f(x)
norm(Δx) > 1e-8 ? x += Δx : return(x)
The routines take a function f, its Jacobian matrix df, and an initial guess x, and
they return one of the zeroes. For our problem, we’ll define f and df as
f = z -> ((x,y)=tuple(z...);
[(x^2+y^2)^2-2(x^2-y^2); (x^2+y^2-1)^3-x^2*y^3])
df = z -> ((x,y)=tuple(z...);
[4x*(x^2+y^2-1) 4y*(x^2+y^2+1);
6x*(x^2+y^2-1)^2-2x*y^3 6y*(x^2+y^2-1)^2-3x^2*y^2])
8.13. Let’s implement the ECDH algorithm for 𝑦 2 = 𝑥 3 + 7 (mod 𝑟). We’ll first
define the group operator ⊕. Given points 𝑃 = (𝑥 0 , 𝑦 0 ) and 𝑄 = (𝑥1 , 𝑦 1 ), the
new point is 𝑃 ⊕ 𝑄 is (𝑥, 𝑦) with
𝑥 = 𝜆2 − 𝑥0 − 𝑥1
𝑦 = −𝜆(𝑥 − 𝑥0 ) − 𝑦 0 ,
We’ll use the double-and-add method to compute the product 𝑚𝑃. The double-
and-add method is the Horner form applied to a polynomial ring. Consider the
𝑥 𝑛 + 𝑎 𝑛−1 𝑥 𝑛−1 + · · · + 𝑎 1 𝑥 + 𝑎 0 .
Note that by taking 𝑥 = 2 and 𝑎 𝑖 ∈ {0, 1}, we have a binary representation of
a number 𝑚 = 1𝑎 𝑛−1 𝑎 𝑛−2 . . . 𝑎 2 𝑎 1 𝑎 0 . We can write the polynomial in Horner
𝑥(𝑥(𝑥(· · · (𝑥 + 𝑎 𝑛−1 ) + · · · + 𝑎 2 ) + 𝑎 1 ) + 𝑎 0 .
Similarly, the representation of the number 𝑚 in Horner form is
Formally substituting the group operator ⊕ and the point 𝑃 into this expression
gives us
We can evaluate this expression from the inside out using an iterative function
or from the outside in using a recursive one. Let’s do the latter. This approach
allows us to right-shift through 𝑚 = 1𝑎 𝑛−1 𝑎 𝑛−2 . . . 𝑎 2 𝑎 1 𝑎 0 and inspect the least
significant bit with each function call.
Base.isodd(m::BigInt) = ((m&1)==1)
function dbl_add(m,P)
if m > 1
Q = dbl_add(m>>1,P)
return isodd(m) ? (Q⊕Q)⊕P : Q⊕Q
return P
Alice chooses a private key 𝑚 (say 1829442) and sends Bob her public key 𝑚𝑃.
Similarly, Bob chooses a private key 𝑛 (say 3727472) and sends Alice his public
key 𝑛𝑃. Now, both can generate the same cipher using a shared secret key 𝑛𝑚𝑃.
P1 = big"0x79BE667EF9DCBBAC55A06295CE87
P2 = big"0x483ADA7726A3C4655DA4FBFC0E11
P = [P1 ; P2 ]
m, n = 1829442, 3727472
mP = dbl_add(m,P)
nmP = dbl_add(n,mP)
The convergence factor for this fixed-point method is |1 − 𝛼𝑚|. The method
converges if and only if the learning rate 0 < 𝛼 < 2/𝑚. Convergence is
monotonic when the learning rate is less than 1/𝑚 and zigzagging when it is
greater than 1/𝑚. Convergence is fastest when the learning rate is close to 1/𝑚,
the reciprocal of the second-derivative of 𝑓 (𝑥)—i.e., the inverse of the Hessian.
The momentum method
𝑥 (𝑘+1) = 𝑥 (𝑘) + 𝛼𝑝 (𝑘) with 𝑝 (𝑘) = − 𝑓 0 𝑥 (𝑘) + 𝛽𝑝 (𝑘−1)
ℎ4 ℎ0 ℎ1 ℎ2 ℎ3 ℎ4 ℎ0
𝑥5 𝑥0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥0
function spline_periodic(x,y)
h = diff(x)
d = 6*diff(diff([y[end-1];y])./[h[end];h])
α = h[1:end-1]
β = 2*(h+circshift(h,1))
C = Matrix(SymTridiagonal(β,α))
C[1,end]=h[end]; C[end,1]=h[end]
m = C\d
Now, we can compute a parametric spline interpolant with nx*n points through a
set of n random points using the function evaluate_spline defined on page 239.
One solution is shown in Figure A.11 on the next page.
n = 5; nx = 30
x = rand(n); y = rand(n)
x = [x;x[1]]; y = [y;y[1]]
t = [0;cumsum(sqrt.(diff(x).^2 + diff(y).^2))]
X = evaluate_spline(t,x,spline_periodic(t,x),nx*n)
Y = evaluate_spline(t,y,spline_periodic(t,y),nx*n)
scatter(x,y); plot!(X[2],Y[2],legend=false)
9.3. First, let’s set up the domain x and the function y, which we are interpolating.
n = 20; N = 200
x = LinRange(-1,1,n)
y = float(x .> 0);
Next, define the radial basis functions and the polynomials bases.
0 0.5 1 1.5 2
−1 −0.5 0 0.5 1
Figure A.12: Interpolation using polynomial and |𝑥| 3 radial basis function .
ϕ1 (x,a) = abs.(x.-a).^3
ϕ2 (x,a) = exp.(-20(x.-a).^2)
ϕ3 (x,a) = x.^a
X = LinRange(-1,1,N)
interp(ϕ,a) = ϕ(X,a')*(ϕ(x,a')\y)
Y1 = interp(ϕ1 ,x)
Y2 = interp(ϕ2 ,x)
Y3 = interp(ϕ3 ,(0:n-1))
scatter(x,y,seriestype = :scatter, marker = :o, legend = :none)
plot!(X,[Y1 ,Y2 ,Y3 ],ylim=[-0.5,1.5])
As seen in Figure A.12 on the preceding page, the polynomial interpolant suffers
from the Runge phenomenon with a maximum error of about 500. On the other
hand, the radial basis function gives us a good approximation. The radial basis
function 𝜙(𝑥) = |𝑥| 3 generates a cubic spline because the resulting function is
cubic on the subintervals and has continuous second derivatives at the nodes.
9.4. Let 𝑣 𝑗 (𝑥) = 𝐵(ℎ−1 𝑥 − 𝑗) be B-splines with nodes equally spaced at 𝑥𝑖 = 𝑖ℎ:
− 1 (2 − |𝑥|)𝑥 2 , |𝑥| ∈ [0, 1)
3 2
𝐵(𝑥) = 61 (2 − |𝑥|) 3 , |𝑥| ∈ [1, 2) .
Every node in the domain needs to be covered by three B-splines, so we’ll need
to have an additional B-spline centered just outside either boundary to cover each
of the boundary nodes. Take the approximation 𝑖=−1 𝑐 𝑖 𝑣 𝑖 (𝑥) for 𝑢(𝑥). Bessel’s
equation 𝑥𝑢 00 + 𝑢 0 + 𝑥𝑢 = 0 at the collocation points {𝑥𝑖 } is now
𝑥𝑖 𝑣 00𝑗 (𝑥𝑖 ) + 𝑣 0𝑗 (𝑥 𝑖 ) + 𝑥𝑖 𝑣 𝑗 (𝑥𝑖 ) 𝑐 𝑗 = 0
𝑣 𝑗 00 (𝑥𝑖 ) 𝑣 0𝑗 (𝑥𝑖 ) 𝑣 𝑗 (𝑥𝑖 )
when 𝑗 = 𝑖 −2ℎ−2 0 3
when 𝑗 = 𝑖 ± 1 ℎ−2 ∓ 12 ℎ−1 1
othewise 0 0 0
We have the system Ac = d, where
from which
and 𝑑0 = 1. Í
Once we have the coefficients 𝑐 𝑗 , we construct the solution 𝑛+1𝑗=−1 𝑐 𝑗 𝑣 𝑗 (𝑥).
We start by defining a general collocation solver for L 𝑢(𝑥) = 𝑓 (𝑥). This solver
takes boundary conditions bc and an array of equally-spaced collocation points x.
It returns an array of coefficients c for each node, including two elements for the
two B-splines just outside the domain.
function collocation_solve(L,f,bc,x)
h = x[2]-x[1]
S = L(x)*([1 -1/2 1/6; -2 0 2/3; 1 1/2 1/6]./[h^2 h 1])'
d = [bc[1]; f(x); bc[2]]
A = Matrix(Tridiagonal([S[:,1];0], [0;S[:,2];0], [0;S[:,3]]))
A[1,1:3], A[end,end-2:end] = [1 4 1]/6, [1 4 1]/6
We could have kept matrix A as a Tridiagonal matrix by first using the second row
to zero out A[1,3] and the second to last row to zero out A[end,end-2]. Next, we
define a function that will interpolate between collocation points:
function collocation_build(c,x,N)
X = LinRange(x[1],x[end],N)
h = x[2] - x[1]
i = Int32.(X .÷ h .+ 1); i[N] = i[N-1]
C = [c[i] c[i.+1] c[i.+2] c[i.+3]]'
B = (x->[(1-x).^3;4-3*(2-x).*x.^2;4-3*(1+x).*(1-x).^2;x.^3]/6)
Y = sum(C.*hcat(B.((X.-x[i])/h)...),dims=1)
The solution is plotted in Figure A.13 on the next page using 15 collocation points.
Even with only 15 points, the numerical solution is fairly accurate. Finally, let’s
measure the convergence rate by increasing the collocation points. A method is
order 𝑝 if the error is 𝜀 = 𝑂 (𝑛− 𝑝 ), where 𝑛 is the number of points. This means
that log 𝜀 ≈ −𝑝 log 𝑛 + log 𝑚 for some 𝑚. We’ll plot the error as a function of
collocation points.
N = 10*2 .^(1:7); ϵ = []
for n in N
x = LinRange(0,b,n)
c = collocation_solve(L,f,[1,0],x)
518 Solutions
1 10−2
solution error
0.5 10−3
0 5 10 20 80 320
Figure A.13: Numerical solution (solid) and exact solution (dashed) to Bessel’s
equation for 15 collocation points (left). The log-log plot of the error as a function
of number of points (right).
X,Y = collocation_build(c,x,n)
append!(ϵ, norm(Y-besselj0.(X)) / n)
plot(N, ϵ, xaxis=:log, yaxis=:log, marker=:o)
The log-log slope will give us the order of convergence. The Julia code
([log.(N) one.(N)]\log.(ϵ))[1]
We find that 𝑐 ≈ 0.551915 is the optimal solution, resulting in less than 0.02
percent maximum deviation from the unit circle—not too bad!
Note that the 𝑝th derivative for sin 𝑥 is sin(𝑥 + 𝑝𝜋/2), which is simply a translation
of the original function. A Julia solution follows.
using FFTW
n = 256; ℓ = 2
x = (0:n-1)/n*ℓ .- ℓ/2
ξ = [0:(n/2-1); (-n/2):-1]*(2π/ℓ)
f1 = exp.(-16*x.^2)
f2 = sin.(π*x)
f3 = x.*(1 .- abs.(x))
deriv(f,p) = real(ifft((im*ξ).^p.*fft(f)))
using Interact
func = Dict("Gaussian"=>f1 ,"sine"=>f2 ,"spline"=>f3 )
@manipulate for f in togglebuttons(func; label="function"),
p in slider(0:0.01:2; value=0, label="derivative")
Alternatively, we could use the Plots.jl @animate macro to produce a gif. See the
QR code at the bottom of this page.
fractional derivatives
of the Gaussian
10.4. The LeNet-5 model consists of several hidden layers—three sets of convolu-
tional layers combined by pooling layers, which are then flattened and followed by
two fully-connected layers. The diagram below shows the LeNet-5 architecture:
28 × 28 × 1 14 × 14 × 6 5 × 5 × 16 120 10
image C P C P C F D D label
28 × 28 × 6 10 × 10 × 16 1 × 1 × 120 84
Let’s load the MNIST training and test data as single-precision floating-point
numbers. Using single-precision numbers can significantly speed performance
over double-precision numbers because the memory usage is halved. We’ll
convert each 28 × 28-pixel image into a 28 × 28 × 1-element array. We’ll also
convert each of the labels into one hot arrays—arrays whose elements equal one
in positions matching the respective labels and equal zero otherwise.
image_train, label_train = MLDatasets.MNIST(split=:train)[:]
image_test, label_test = MLDatasets.MNIST(split=:test)[:]
image_train = Flux.unsqueeze(image_train, 3)
image_test = Flux.unsqueeze(image_test, 3)
Now, we can train the model. Let’s loop over five epochs. It might take several
minutes to train—adding the ProgressMeter.jl macro @showprogress before the
for loop will display a progress bar.
using ProgressMeter
@showprogress for epochs = 1:5
Flux.train!(loss, parameters, data, optimizer)
We use the test data to see how well the model performed. The onecold function
is the inverse of the onehot function, returning the position of the largest element.
accuracy(x,y) = sum(Flux.onecold(x) .== Flux.onecold(y))/size(y,2)
The LeNet-5 model achieves about a 2 percent error rate—better than the 5 percent
error rate obtained using principal component analysis in exercise 3.7. The figure
on the next page shows the intermediate layer inputs. The first convolutional
layer appears to detect edges in the image. The second convolutional layer then
uses this edge detection to identify gross structure in the image.
10.5. We’ll minimize the residual k𝜺k 22 = 𝑖𝑛 𝜀 𝑖2 , where
𝜀 𝑖 = (𝑥 − 𝑥𝑖 ) 2 + (𝑦 − 𝑦 𝑖 ) 2 − 𝑐(𝑡𝑖 − 𝑡). (A.4)
image C P C P C F D D label
Figure A.14: An input image and the intermediate transformations of that image
following the three convolutional matrices of LeNet-5.
using LsqFit
ϵ = (x,p) -> @. ((x[:,1]-p[1])^2 + (x[:,2]-p[2])^2) - (x[:,3]-p[3])
x = [3 3 12; 1 15 14; 10 2 13; 12 15 14; 0 11 12]
The solution (𝑥, 𝑦, 𝑡) is approximately (6.30, 8.57, 5.46). The formulation of the
problem is different from that in exercise 3.10, and we should expect a slightly
different answer.
Now, let’s apply the same method to the ShotSpotter data. We start by forming
two arrays: X for the (𝑥, 𝑦, 𝑧, 𝑡) data of the sensors and xo for the (𝑥, 𝑦, 𝑧, ·) data
of the signal source.
using CSV, DataFrames
df = DataFrame(CSV.File(download(bucket*"shotspotter.csv")))
X = Array(df[1:end-1,:]); xo = Array(df[end,:]);
The error is approximately −4.7 m and 0.4 m horizontally and 54.0 m vertically.
where 𝜉 is some point in the interval [𝑥, 𝑥 + 3ℎ]. To find the coefficients 𝑐 10 ,
𝑐 11 , 𝑐 12 , 𝑐 13 , and 𝑚 1 , we define d = [0,1,2,3] from nodes at 𝑥, 𝑥 + ℎ, 𝑥 + 2ℎ
and 𝑥 + 3ℎ and invert the scaled Vandermonde matrix 𝐶𝑖 𝑗 = 𝑑𝑖𝑗 /𝑖! . The
coefficients of the truncation error are given by C*d.^n/factorial(n):
d = [0,1,2,3]; n = length(d)
C = inv( d.^(0:n-1)' .// factorial.(0:n-1)' )
[C C*d.^n/factorial(n)]
1 0 0 0 0
− 11/6 3 − 3/2 1/3 1/4
2 −5 4 −1 11/
− 12
−1 3 −3 1 3/2
where eps is machine epsilon. The truncation error decreases and round-off error
increases error as ℎ gets smaller. Following the discussion on page 187, the total
error is minimum when the two are equal, i.e., when ℎ = (28eps) 1/4 ≈ 3 × 10−4 .
From the third row, we have
𝑓 00 (𝑥) ≈ 2 2 𝑓 (𝑥) − 5 𝑓 (𝑥 + ℎ) + 4 𝑓 (𝑥 + 2ℎ) − 𝑓 (𝑥 + 3ℎ)
11 2
with a truncation error 12 ℎ 𝑓 (4) (𝜉).
11.2. In practice, we don’t need to save every intermediate term. Instead by
taking 𝑖 = 𝑚, 𝑗 = 𝑚 − 𝑛, and D̄ 𝑗 = D𝑚,𝑛 , we have the update
D̄ 𝑗+1 − 𝛿 𝑝 (𝑖− 𝑗) D̄ 𝑗
D̄ 𝑗 ← where 𝑗 = 𝑖 − 1, . . . , 1 and D̄𝑖 ← 𝜙(𝛿𝑖 ℎ).
1 − 𝛿 𝑝 (𝑖− 𝑗)
The Julia code for Richardson extrapolation taking 𝛿 = 2 is
function richardson(f,x,m)
D = []
for i in 1:m
append!(D, ϕ(f,x,2^i))
for j in i-1:-1:1
D[j] = (4^(i-j)*D[j+1] - D[j])/(4^(i-j) - 1)
11.3. We’ll extend the dual class on page 301 by adding methods for the square
root, division, and cosine.
function get_zero(f,x)
ϵ = 1e-12; δ = 1
while abs(δ) > ϵ
fx = f(Dual(x))
δ = value(fx)/deriv(fx)
x -= δ
function get_extremum(f,x)
ϵ = 1e-12; δ = 1
while abs(δ)>ϵ
fx = f(Dual(Dual(x)))
δ = deriv(value(fx))/deriv(deriv(fx))
x -= δ
11.4. Let’s take 𝛾 to be a circle centered at 𝑧 = 𝑎. That is, take 𝑧 = 𝑎 + 𝜀ei𝜃 with
d𝑧 = i𝜀ei𝜃 d𝜃. Then
𝑝! 2𝜋
(𝑎) = 𝑓 𝑎 + 𝜀ei𝜃 e−i 𝑝 𝜃 d𝜃.
2𝜋𝜀 𝑝 0
Analysis 525
𝑝! ∑︁
𝑓 ( 𝑝) (𝑎) = 𝑝
𝑓 𝑎 + 𝜀e2i 𝜋 𝑘/𝑛 e−2i 𝜋 𝑝𝑘/𝑛 + 𝑂 (𝜀/𝑛) 𝑛 .
𝑛𝜀 𝑘=0
√ √
11.7. We’ll make a change of variables 𝜉 = (𝑥 − 𝑠)/ 4𝑡 and d𝜉 = −𝑠/ 4𝑡. Then
∫ ∞
1 √ 2
𝑢(𝑡, 𝑥) = √ 𝑢 0 𝑥 − 2𝜉 𝑡 e− 𝜉 d𝜉.
𝜋 −∞
10−3 0
10 101 102 103 104 105 106
The log-log slope of the error is −0.468, confirming the expected 𝑂 𝑁 −1/2
half-order of convergence. The Monte Carlo method would need roughly 1025
Analysis 527
mc_π(n,d=2) = sum(sum(rand(d,n).^2,dims=1).<1)/n*2^d
1 2𝑡 1 𝑛𝑡 1𝑛−1
𝑡1 𝑡 12 𝑡 1𝑛
... ...
1 2𝑡 2 𝑛𝑡 2𝑛−1 𝑡2 𝑡 22 𝑡 2𝑛
... ...
M = . .. .. .. ,
.. . . .
.. .. .. ..
. . . .
1 2𝑡 𝑛 𝑛𝑡 𝑛 𝑡 𝑛 𝑡 𝑛3
𝑡 𝑛2
... 𝑛−1 ...
and 𝑐 0 = 𝑝(𝑡 0 ) for 𝑡0 = 0. Now, we can solve M(p − 𝑝(𝑡 0 )1) = 𝑓 (p, 𝑡) using a
standard nonlinear solver.
Let’s solve the differential equation 𝑢 0 (𝑡) = 𝛼𝑢 2 . First, we’ll generate the
nodes of Gauss–Legendre–Labotto nodes shifted from [−1, 1] to [0, 1].
using FastGaussQuadrature
function differentiation_matrix(n)
nodes, _ = gausslobatto(n+1)
t = (nodes[2:end].+1)/2
A = t.^(0:n-1)'.*(1:n)'
B = t.^(1:n)'
n = 20
M,t = differentiation_matrix(n)
D = (u,u0 ) -> M*(u .- u0 )
using NLsolve
u0 = 1.0; α = 0.9
F = (u,u0 ,α) -> D(u,u0 ) .- α*u.^2
u = nlsolve(u->F(u,u0 ,α),u0 *ones(n)).zero
plot([0;t], [u0 ;u], marker=:circle, legend=false)
The exact solution is 𝑢(𝑡) = (1 − 𝛼𝑡) −1 . The value 𝑢(1) for 𝛼 = 0.9 and 𝛼 = 0.99
should equal 10 and 100 respectively. Using 20 nodes, we have numerical
solutions of approximately 9.984 (not bad) and 21.7 (awful). If we try to increase
the number of nodes to get better accuracy, we’ll run into trouble because the
condition number of differentiation matrix explodes. When 𝑛 = 23, the condition
number is 610; when it is 𝑛 = 27, the condition number is 1.35 × 106 .
Instead, let’s use the pseudospectral method by scaling the differentiation
operator and iterating over time. This time, we’ll take 20 segments with 5 nodes
for each segment (or about 50 nodes overall).
N = 30; Δt = 1.0/N; n = 8
M,t = differentiation_matrix(n)
D = (u,u0 ) -> M*(u .- u0 )/Δt
u0 = 1.0; U = [u0 ]; T = [0.0]; α = 0.9
for i = 0:N-1
u = nlsolve(u->F(u,u0 ,α),u0 *ones(n)).zero
u0 = u[end]
plot(T, U, marker=:circle, legend=false)
Now, the relative error at 𝑡 = 1 is 1.3 × 10−9 and 0.019 for 𝛼 = 0.9 and 𝛼 = 0.99,
12.1. The 𝜃-scheme corresponds to the forward Euler scheme, the trapezoidal
scheme, and the backward Euler scheme when 𝜃 = 1, 𝜃 = 12 , and 𝜃 = 0,
respectively. So the regions of absolute stability of the 𝜃-scheme will correspond
with those three schemes when 𝜃 is any of those three values. Let’s find the region
of absolute stability for a general 𝜃. Take 𝑓 (𝑈 𝑛 ) = 𝜆𝑈 𝑛 and take 𝑟 = 𝑈 𝑛+1 /𝑈 𝑛 .
𝑟 −1
= (1 − 𝜃)𝜆𝑟 + 𝜃𝜆.
We want to determine the values 𝜆𝑘 for which |𝑟 | ≤ 1. To do this, we’ll find the
boundary of the region 𝜆𝑘 = 𝑥 + i𝑦 where |𝑟 | = 1.
1 + 𝜃 (𝑥 + i𝑦) 1 + 𝜃 (𝑥 − i𝑦)
1 = |𝑟 | 2 = 𝑟𝑟 =
1 − (1 − 𝜃) (𝑥 + i𝑦) 1 − (1 − 𝜃) (𝑥 − i𝑦)
1 + 2𝜃𝑥 + 𝜃 2 (𝑥 2 + 𝑦 2 )
= ,
1 − 2(1 − 𝜃)𝑥 + (1 − 𝜃) 2 (𝑥 2 + 𝑦 2 )
using LinearAlgebra, Plots
function rk_stability_plot(A,b)
E = ones(length(b),1) −4 −2 0 2 4
r(λk) = abs.(1 .+ λk * b*((I - λk*A)\E))
x,y = LinRange(-4,4,100),LinRange(-4,4,100)
s = reshape(vcat(r.(x'.+im*y)...),(100,100))
12.7. To solve 𝑢 0 = 𝑓 (𝑢) + 𝑔(𝑢) with a third-order L-stable IMEX method, we’ll
pair a BDF3 scheme with an appropriate explicit scheme. The BDF3 method
approximates 𝑢 0 at time 𝑡 𝑛+1 , so we’ll need to evaluate the explicit scheme at
the same time 𝑡 𝑛+1 to avoid a splitting error. To do this, we need a third-order
approximation of 𝑔(𝑈 𝑛+1 ).
The coefficients of the BDF3 scheme can computed using the function
multistepcoefficients from page 351 with an input m = [0 1 2 3] and n = [0].
Alternately, we can compute them by solving the Taylor polynomial system
keeping the 𝑢 0 term while eliminating the 𝑢, 𝑢 00, and 𝑢 000 terms.
i = 0:3; c = ((-i)'.^i.//factorial.(i))\[0;1;0;0]
The coefficients are [11//6 -3//1 3//2 -1//3], and the BDF3 scheme is corre-
11 𝑛+1
6𝑈 − 3𝑈 𝑛 + 32 𝑈 𝑛−1 − 31 𝑈 𝑛−2 = 𝑘 𝑓 (𝑈 𝑛+1 ).
To approximate 𝑔(𝑈 𝑛+1 ), we again solve (A.5). This time, we keeping the 𝑢 term
and eliminate the 𝑢 0, 𝑢 00 and 𝑢 000 terms.
i = 0:3; c = ((-(i.+1))'.^i.//factorial.(i))\[1;0;0;0]
We express 𝜆𝑘 (𝑟) as a rational function and find the boundary of the domain
|𝑟 | ≤ 1 by taking 𝑟 = ei𝜃 . Figure A.15 on the facing page shows the regions of
1 ∑︁ 1 ∑︁ ∗ 𝑛− 𝑗
𝑠−1 𝑠−1
𝛼= 𝑏 𝑗 𝑈 𝑛− 𝑗 and 𝛽= 𝑏 𝑈 .
𝑏 ∗−1 𝑗=0 𝑏 ∗−1 𝑗=0 𝑗
implicit explicit
0 0
0 4 8 −1 0
Figure A.15: The stencil and regions of absolute stability (in gray) for the implicit
and explicit parts of the third-order IMEX method.
𝑈˜ 𝑛+1 = 𝑈 𝑛 + 𝛼𝑧 (A.6a)
𝑈 𝑛+1 = 𝑈 𝑛 + (𝑈˜ 𝑛+1 + 𝛽)𝑧, (A.6b)
For an additional corrector iteration, we substitute this new expression for 𝑈˜ 𝑛+1
in (A.6b), giving us
𝑈 𝑛+1 = 𝑈 𝑛 + (𝑈 𝑛 + 𝛽) (𝑧 + 𝑧 2 ) + 𝛼𝑧 3 .
𝑈 𝑛+1 = 𝑈 𝑛 + (𝑧 + 𝑧2 + · · · + 𝑧 𝑠 ) (𝑈 𝑛 + 𝛽) + 𝑧 𝑠+1 𝛼.
To find the boundary of the region of absolute stability, we let 𝑟 = 𝑈 𝑛 /𝑈 𝑛+1 and
look for the solutions to the following equation when to |𝑟 | = 1
function PECE(n,m)
_,a = multistepcoefficients([0 1],hcat(1:n-1...))
_,b = multistepcoefficients([0 1],hcat(0:n-1...))
α(r) = a · r.^(1:n-1)/b[1]
β(r) = b[2:end] · r.^(1:n-1)/b[1]
z = [(c = [r-1; repeat([r + β(r)],m); α(r)];
for r in exp.(im*LinRange(0,2π,200))]
After splicing the solution together and snipping off unwanted sections of the
curves, we get the regions in Figure A.16 above.
12.9. Let 𝑇 (𝑥) be the Taylor series for some function.
To compute the Padé
approximation 𝑃𝑚 (𝑥)/𝑄 𝑛 (𝑥) = 𝑇(𝑥) + 𝑂 𝑥 𝑚+𝑛+1 , we’ll find the coefficients of
𝑃𝑚 (𝑥) = 𝑄 𝑛 (𝑥)𝑇 (𝑥) + 𝑂 𝑥 𝑚+𝑛+1 :
! ∞ !
𝑚 ∑︁
𝑛 ∑︁
𝑝𝑖 𝑥 = 1 +
𝑞𝑖 𝑥 𝑖
𝑎 𝑖 𝑥 + 𝑂 (𝑥 𝑚+𝑛+1 ).
𝑖=0 𝑖=1 𝑖=0
𝐼 (𝑡)
0 5 10 15
Let’s compute the coefficients of 𝑃3 (𝑥) and 𝑄 2 (𝑥) of the Padé approximation of
log(𝑥 + 1). Its Taylor series is 𝑥 − 12 𝑥 2 + 31 𝑥 3 − 41 𝑥 4 + 51 𝑥 5 − · · · .
m = 3; n = 2
a = [0; ((-1).^(0:m+n).//(1:m+n+1))]
(p,q) = pade(a,m,n)
12.13. The SIR system of equations can be easily solved using a standard explicit
ODE solver. While we can write the code that solves this problem succinctly,
Julia also lets us write code that clarifies the mathematics. Note that SIR! is an
in-place function.
function SIR!(du,u,p,t)
S,I,R = u
β,γ = p
534 Solutions
−1.5 0 1.5
Figure A.18: The phase map of the Duffing equation in problem 12.14.
du[1] = dS = -β*S*I
du[2] = dI = +β*S*I - γ*I
du[3] = dR = +γ*I
Now, we set up the problem, solve it, and present the solution:
using DifferentialEquations
u0 = [0.99; 0.01; 0]
tspan = (0.0,15.0)
p = (2.0,0.4)
problem = ODEProblem(SIR!,u0 ,tspan,p)
solution = solve(problem)
plot(solution,labels=["susceptible" "infected" "recovered"])
12.14. We can write the Duffing equation as the system of equations 𝑥 0 = 𝑣 and
𝑣 0 = −𝛾𝑣 − 𝛼𝑥 − 𝛽𝑥 3 + 𝛿 cos 𝜔𝑡, which can be solved using a standard, high-order
explicit ODE solver:
Now, we can solve the boundary value problem using the shooting method with
boundary condition bc and an initial guess guess. Because find_zero takes only
two arguments, we’ll use an anonymous function shoot_airy to let us format and
pass our additional parameters to solveIVP.
using DifferentialEquations, Roots
airy(y,p,x) = [y[2];x*y[1]]
domain = (-12.0,0.0); bc = (1.0,1.0); guess = 5
shoot_airy = (guess -> solveIVP(airy,[bc[1];guess],domain)-bc[2])
v = find_zero(shoot_airy, guess)
We can see the shooting method in action by plotting each intermediate solution
of solveIVP. In the following figure, the lines grow thicker with each iteration
until we reach the solution—the widest line.
−12 −8 −4 0
𝑈 𝑛𝑗 : 𝑢(𝑡, 𝑥) = 𝑢
𝑈 𝑛+1 : 𝑢(𝑡 + 𝑘, 𝑥) = 𝑢 + 𝑘𝑢 𝑡 + 21 𝑘 2 𝑢 𝑡𝑡 + 𝑂 𝑘 3
𝑈 𝑛𝑗+1 : 𝑢(𝑡, 𝑥 + 𝑘) = 𝑢 + ℎ𝑢 𝑥 + 12 ℎ2 𝑢 𝑥 𝑥 + 61 ℎ3 𝑢 𝑥 𝑥 𝑥 + 1 4
24 ℎ 𝑢 𝑥 𝑥 𝑥 𝑥 + 𝑂 ℎ5
𝑈 𝑛𝑗−1 : 𝑢(𝑡, 𝑥 − 𝑘) = 𝑢 − ℎ𝑢 𝑥 + 12 ℎ2 𝑢 𝑥 𝑥 − 1 3 1 4
6 ℎ 𝑢 𝑥 𝑥 𝑥 + 24 ℎ 𝑢 𝑥 𝑥 𝑥 𝑥 − 𝑂 𝑘5 .
𝑗+1 − 2𝑈 𝑗
𝑈 𝑛+1 𝑈 𝑛𝑗+1 − 2𝑈 𝑛𝑗 + 𝑈 𝑛𝑗−1
𝑛+1 + 𝑈 𝑛+1
𝑈 𝑛+1
𝑗 − 𝑈 𝑛𝑗 𝑗−1
= +
𝑘 2ℎ2 2ℎ2
𝑢 𝑡 + 12 𝑘 + 𝑂 𝑘 2 = 𝑢 𝑥 𝑥 + 1 2
12 ℎ 𝑢 𝑥 𝑥 𝑥 𝑥 + 𝑂 ℎ3 ,
ˆ 𝑛 , 𝜉)ei 𝑗 ℎ 𝜉
𝑈 𝑛𝑗+2 = 𝑢(𝑡
in the expression
𝑈 𝑗+2 − 2𝑈 𝑗+1 + 𝑈 𝑗−1
we have
1 i2ℎ 𝜉 iℎ 𝜉
1 iℎ 𝜉 iℎ 𝜉 2 2 𝜉ℎ
e − 2e + 1 ˆ
𝑢 = e e − e−iℎ 𝜉
𝑢ˆ = − 2 eiℎ 𝜉 sin2 𝑢ˆ
ℎ 2 ℎ 2 ℎ 2
as the right-hand side of 𝑢ˆ 𝑡 = 𝜆𝑢.
ˆ The eigenvalues form a cardioid.
0 0 0
−1 −1 −10
0 1 2 0 1 2 0 10 20
Examining the regions of stability for the methods in Figures 12.5, 12.6, and 12.8,
we see that BDF1–BDF5 are the only time-difference schemes that would be
absolutely stable and only if we take a sufficiently large timestep. For example,
for the backward Euler (BDF1) method, we would need to take 𝑘 > ℎ2 /2. And
for the BDF5 method, we would need to take 𝑘 > 5ℎ2 . Every other method
we’ve examined is unconditionally unstable.
m, n = [0 1 2 3 4], [1]
a, b = zeros(maximum(m)+1), zeros(maximum(n)+1)
a[m.+1], b[n.+1] = multistepcoefficients(m,n)
The stencil has the following approximation for the time derivative
1 𝑛+1
4𝑈 + 56 𝑈 𝑛 − 32 𝑈 𝑛−1 + 12 𝑈 𝑛−2 − 1 𝑛−3
12 𝑈 . (A.9)
We can examine the stability using the multiplier 𝑟 (𝜆𝑘) along the negative real
axis similar to Figure 12.2 on page 341.
λk = r -> (a · r.^-(1:length(a))) ./ (b · r.^-(1:length(b)))
r = LinRange(0.2,6,100)
plot([λk.(r) λk.(-r)],[r r],xlim=[-2,2])
13.5. Start with the Dufort–Frankel scheme as (13.11) and move the second
term on the right over to the left-hand side:
𝑈 𝑛+1
𝑗 − 𝑈 𝑛−1
𝑗 𝑈 𝑛+1
𝑗 − 2𝑈 𝑛𝑗 + 𝑈 𝑛−1
𝑗 𝑈 𝑛𝑗+1 − 2𝑈 𝑛𝑗 + 𝑈 𝑛𝑗−1
+𝑎 =𝑎 .
2𝑘 ℎ2 ℎ2
Then from (13.9), the eigenvalues from the right-hand side are
𝜆 = 2 (cos 𝜉 ℎ − 1).
The region of absolute stability determined by the
left-hand side is
1 2𝑎
𝜆 𝜉 = i sin 𝜃 − 2 (cos 𝜉 ℎ − 1),
𝑘 ℎ
which is an ellipse in the negative half-plane that extends from 0 to −4𝑎/ℎ2 ,
exactly the size that we need to fit the eigenspectrum.
The Dufort–Frankel requires two starting values 𝑈 0 and 𝑈 1 . In practice, we
might use a forward Euler step to generate 𝑈 1 from 𝑈 0 . But, because we are
merely demonstrating the consistency of the Dufort–Frankel, we’ll simply set 𝑈 1
equal to 𝑈 0 .
Δx = 0.01; Δt = 0.01
ℓ = 1; x = -ℓ:Δx:ℓ; m = length(x)
Un = exp.(-8*x.^2); Un−1 = Un
𝜈 = Δt/Δx^2; α = 0.5 + 𝜈; γ = 0.5 - 𝜈
B = 𝜈*Tridiagonal(ones(m-1),zeros(m),ones(m-1))
B[1,2] *=2; B[end,end-1] *=2
@gif for i = 1:300
global Un , Un−1 = (B*Un +γ*Un−1 )/α, Un
plot(x,Un ,ylim=(0,1),label=:none,fill=(0, 0.3, :red))
The snapshots above and the QR code at the bottom of the page show the
solution to the heat equation using the Dufort–Frankel scheme. While ultimately
dissipative, the solution is mainly oscillatory, behaving more like a viscous fluid
than a heat-conducting bar.
13.7. Let’s determine the CFL condition. We can write the constant-potential
Schrödinger equation as
𝜕𝜓 i𝜀 𝜕 2 𝜓
= − i𝜀 −1𝑉𝜓.
𝜕𝑡 2 𝜕𝑥 2
Using (13.9), we have
𝜕 𝑢ˆ 2𝜀 2 𝜉 ℎ
= −i 2 sin 𝑢ˆ − 𝜀 −1𝑉 𝑢.
ˆ (A.10)
𝜕𝑡 ℎ 2
These eigenvalues lie along the imaginary axis, bounded by |𝜆 𝜉 | ≤ 2𝜀/ℎ2 + 𝜀 −1𝑉.
We’ll need to keep ℎ much smaller than 𝜀 to model the dynamics of the
Schrödinger equation accurately. Consequently, 2𝜀/ℎ2 will dominate 𝜀 −1𝑉. So
let’s only consider the contribution of 2𝜀/ℎ2 on stability. Note where the regions
of absolute stability intersect the imaginary axis for different methods in the
following figures:
the Dufort–Frankel
solution to the
heat equation
Differential equations 539
We see that the backward Euler and trapezoidal methods are unconditionally
stable and that the forward Euler method is unconditionally unstable. The
leapfrog method is stable when |𝑘𝜆 𝜉 | < 1, meaning its CLF condition is
𝑘 < ℎ2 /2𝜀. Figure 12.8 shows that the Runge–Kutta method (RK4) is stable
when |𝑘𝜆 𝜉 | < 2.5, so its CFL condition is 𝑘 < 1.25ℎ2 /𝜀. Figures 12.5 and 12.6
show that the BDF2 method is unconditionally stable and that higher-order BDF
and Adams methods are conditionally stable.
The following Julia code solves the Schrödinger equation with 𝑉 (𝑥) = 21 𝑥 2
using the Crank–Nicolson method over the domain [−3, 3] with initial conditions
𝜓(0, 𝑥) = (𝜋𝜀) −1/4 e−( 𝑥−1) /2𝜀 :
function schroedinger(m,n,ε)
x = LinRange(-4,4,m); Δx = x[2]-x[1]; Δt = 2π/n; V = x.^2/2
ψ = exp.(-(x.-1).^2/2ε)/(π*ε)^(1/4)
diags = 0.5im*ε*[1 -2 1]/Δx^2 .- im/ε*[0 1 0].*V
D = Tridiagonal(diags[2:end,1], diags[:,2], diags[1:end-1,3])
D[1,2] *= 2; D[end,end-1] *= 2
A = I + 0.5Δt*D
B = I - 0.5Δt*D
for i ∈ 1:n
ψ = B\(A*ψ)
return ψ
To verify the convergence rate of the method, we’ll take a sufficiently small time
step 𝑘 so that its contribution to the error is negligible. Then we compute the
error for several decreasing values of ℎ. We repeat this process by taking ℎ
sufficiently small and computing the error for decreasing values of 𝑘.
ε = 0.3; m = 20000; ex =[]; et =[]
N = floor.(Int,exp10.(LinRange(2,3.7,6)))
x = LinRange(-4,4,m)
ψm = -exp.(-(x.-1).^2/2ε)/(π*ε)^(1/4)
for n ∈ N
x = LinRange(-4,4,n)
ψn = -exp.(-(x.-1).^2/2ε)/(π*ε)^(1/4)
append!(et ,norm(ψm - schroedinger(m,n,ε))/m)
append!(ex ,norm(ψn - schroedinger(n,m,ε))/n)
plot(2π./N,et ,shape=:circle, xaxis=:log, yaxis=:log)
plot!(8 ./N,ex ,shape=:circle)
The error is plotted in Figure A.19 on the next page. We can compute the slopes
of the lines using polyfit(log(n),log(error_x),1). The slope for 𝑘 is 2.0, and
for ℎ, it is 2.5.
10 −
10−2 10−4 10−4
10 −
10−8 10−8
10−3 10−2 10−1 10−2 10−1 0.1 1 10
Figure A.19: Left: the error in solution 𝜌(𝑡, 𝑥) as a function of step size ℎ or
time step 𝑘 . Middle: the error versus step size ℎ for various 𝜀. Right: the error
versus step size as a multiple of 𝜀.
Finally, let’s examine the effect 𝜀 and ℎ have on the solution 𝜌(𝑡, 𝑥) =
|𝜓(𝑡, 𝑥)| 2 . We find the solution at 𝑡 = 𝜋 using 104 time steps. We compute the
error for several values of 𝜀 logarithmically spaced between 0.1 and 0.01 and
several values of ℎ logarithmically spaced between 0.2 and 0.002. The time step
𝑘 is sufficiently small, and its contribution to the error is negligible. The middle
plot of Figure A.19 shows the error as a function of ℎ for each of the four values
of 𝜀, decreasing from right to left. The log-log curves are linear and parallel
until they reach a maximum of 2.
How can we explain such behavior? As the mesh spacing ℎ increases relative
to 𝜀, the numerical solution doesn’t adequately resolve the dynamics of the
wave packet. Consequently, the wave packet gradually slows until the numerical
solution no longer coincides with the true solution. If we plot the error against
ℎ/𝜀, we see that all of the curves line up with a turning point when ℎ is greater
than 𝜀. In general, the Crank–Nicolson method is order 𝑂 (𝑘/𝜀) 2 + (ℎ/𝜀) 2 .
𝜕𝑢 𝜕2𝑢
=2 2.
𝜕𝑡 𝜕𝑟
−2 −1 0 1 2
The figure above and the QR code below show the solution with an initial
distribution given by a step function. The snapshots are at equal intervals from
𝑡 ∈ [0, 12 ] of the solution reflected about 𝑥 = 0. Compare this solution to the
one-dimensional heat equation on page 392.
The parameter 0 < 𝑝 ≤ 1 will control how linear the function behaves. In the
limit as 𝑝 → 0, logitspace(x,n,p) will behave like LinRange(-x,x,n). Now,
we define a one-dimensional Laplacian operator for gridpoints given by x:
function laplacian(x)
Δx = diff(x); Δx1 = Δx[1:end-1]; Δx2 = Δx[2:end]
d− = @. 2/[Δx1 *(Δx1 +Δx2 ); Δx1 [1]^2]
d0 = @. -2/[Δx2 [end].^2; Δx1 *Δx2 ; Δx1 [1].^2]
d+ = @. 2/[Δx2 [end].^2; Δx2 *(Δx1 +Δx2 ) ]
Tridiagonal(d− ,d0 ,d+ )
We’ve used Neumann boundary conditions, but we could also have chosen
Dirichlet boundary conditions. We can solve the heat equation using a Crank–
Nicolson method.
function heat_equation(x,t,n,u)
Δt = t/n
D2 = laplacian(x)
A = I - 0.5Δt*D2
B = I + 0.5Δt*D2
for i ∈ 1:n
u = A\(B*u)
return u
The plots above compare the solution using equally-spaced nodes (left) and the
solution using the inverse-sigmoid-spaced nodes (right) with the exact solution.
13.10. We’ll solve the Allen–Cahn equation using Strang splitting. The dif-
ferential equation 𝑢 0 (𝑡) = 𝜀 −1 𝑢(1 − 𝑢 2 ) has the analytical solution 𝑢(𝑡) =
(𝑢 20 − (𝑢 20 − 1)e−𝑡/𝜀 ) −1/2 𝑢 0 . We’ll combine this solution with a Crank–Nicolson
method for the heat equation applied first in the 𝑥-direction and then in the
L = 16; m = 400; Δx = L/m
T = 4; n = 1600; Δt = T/n
x = LinRange(-L/2,L/2,m)'
u = @. tanh(x^4 - 16*(2*x^2-x'^2))
D = Tridiagonal(ones(m-1),-2ones(m),ones(m-1))/Δx^2
D[1,2] *= 2; D[end,end-1] *= 2
A = I + 0.5Δt*D
B = I - 0.5Δt*D
f = (u,Δt) -> @. u/sqrt(u^2 - (u^2-1)*exp(-50*Δt))
u = f(u,Δt/2)
for i = 1:n
u = (B\(A*(B\(A*u))'))'
(i<n) && (u = f(u,Δt))
u = f(u,Δt/2); Gray.(u)
To watch the solution’s evolution, we add the following three commands (the first
before the loop, the second inside the loop, and the third after the loop) to the
code above.
544 Solutions
anim = Animation()
(i%10)==1 && (plot(Gray.(u),border=:none); frame(anim))
gif(anim, "allencahn.gif", fps = 30)
The figures below show the solutions to the Allen–Cahn equation at times
𝑡 = 0.05, 0.5, 2.0, and 4.0. Also, see the QR code at the bottom of the page.
𝑈 𝑛+1
𝑗 − 12 (𝑈 𝑛𝑗+1 + 𝑈 𝑛𝑗−1 ) 𝑈 𝑛𝑗+1 − 𝑈 𝑛𝑗−1
+𝑐 = 0.
𝑘 2ℎ
Let’s substitute the Taylor series approximation
𝑢(𝑡, 𝑥) = 𝑢
𝑢(𝑡 + 𝑘, 𝑥) = 𝑢 + 𝑘𝑢 𝑡 + 12 𝑘 2 𝑢 𝑡𝑡 + 61 𝑘 3 𝑢 𝑡𝑡𝑡 + 𝑂 𝑘 4
𝑢(𝑡, 𝑥 + ℎ) = 𝑢 + ℎ𝑢 𝑥 + 12 ℎ2 𝑢 𝑥 𝑥 + 16 ℎ3 𝑢 𝑥 𝑥 𝑥 + 𝑂 ℎ4
𝑢(𝑡, 𝑥 − ℎ) = 𝑢 − ℎ𝑢 𝑥 + 12 ℎ2 𝑢 𝑥 𝑥 − 16 ℎ3 𝑢 𝑥 𝑥 𝑥 + 𝑂 ℎ4
in for 𝑈 𝑛𝑗 , 𝑈 𝑛+1
𝑗 , 𝑈 𝑗+1 , and 𝑈 𝑗−1 in terms of the Lax–Friedrichs method
𝑛 𝑛
𝑈 𝑛+1
𝑗 − 12 (𝑈 𝑛𝑗+1 + 𝑈 𝑛𝑗−1 ) ℎ2
= 𝑢 𝑡 + 12 𝑘𝑢 𝑡𝑡 + 𝑂 𝑘 2 − 21 𝑢 𝑥 𝑥 + 𝑂 ℎ4
𝑘 𝑘
the Allen–Cahn
𝑈 𝑛𝑗+1 − 𝑈 𝑛𝑗−1
𝑐 = 𝑐𝑢 𝑥 + 16 𝑐ℎ2 𝑢 𝑥 𝑥 𝑥 + 𝑂 𝑘 4 .
By combining these terms, we have
𝑢 𝑡 + 𝑐𝑢 𝑥 + 21 𝑘𝑢 𝑡𝑡 −𝑢 𝑥 𝑥 + 16 𝑐ℎ2 𝑢 𝑥 𝑥 𝑥 + {higher order terms} = 0,
from which we see that the truncation error is 𝑂 𝑘 + ℎ2 /𝑘 .
We examine the leading order truncation
𝑢 𝑡 + 𝑐𝑢 𝑥 + 21 𝑘𝑢 𝑡𝑡 − 1
2 𝑢𝑥𝑥 = 0 (A.11)
to find the dispersion relation. We will eliminate the 𝑢 𝑡𝑡 term by first differenti-
ating (A.11) with respect to 𝑡 and 𝑥 and then substituting the expression for 𝑢 𝑡𝑡
back into the original equation, discarding the higher-order terms. We have
𝑢 𝑡𝑡 = −𝑐𝑢 𝑥𝑡 + 𝑂 (𝑘 + ℎ2 /𝑘)
𝑢 𝑡 𝑥 = −𝑐𝑢 𝑥 𝑥 + 𝑂 (𝑘 + ℎ2 /𝑘)
from which
𝑢 𝑡𝑡 = 𝑐2 𝑢 𝑥 𝑥 + 𝑂 (𝑘 + ℎ2 /𝑘).
After eliminating higher-order terms, equation (A.11) becomes
𝑢 𝑡 + 𝑐𝑢 𝑥 + 21 𝑘 𝑐2 − 2 𝑢 𝑥 𝑥 = 0.
Substituting the Fourier component 𝑢(𝑡, 𝑥) = ei( 𝜔𝑡− 𝜉 𝑥) into this expression as an
ansatz yields
1 2 ℎ2 2 i( 𝜔𝑡− 𝜉 𝑥)
i𝜔 + 𝑐(−i𝜉) − 2 𝑐 𝑘 − 2 𝜉 e = 0.
ℎ2 2
i𝜔 + 𝑐(−i𝜉) − 1
2 𝑐2 𝑘 − 𝜉 = 0,
and the dispersion relation 𝜔(𝜉) is given by
𝜔 = 𝑐𝜉 − i 12 𝑘 𝑐2 − 2 𝜉 2 .
Plugging this 𝜔 back into our ansatz gives us
2 2
𝑢(𝑡, 𝑥) = ei( 𝜔𝑡− 𝜉 𝑥) = ei𝑐 𝜉 𝑡 e−𝛼 𝜉 e−i 𝜉 𝑥 = ei 𝜉 (𝑐𝑡−𝑥) · e−𝛼 𝜉 𝑡
| {z } | {z }
advection dissipation
546 Solutions
ℎ2 2
𝛼 = 12 𝑘
− 𝑐 .
So, the Lax–Friedrichs scheme is dissipative, especially when 𝑘 ℎ.
We can minimize the dissipation (and decrease the error) by choosing 𝑘 and
ℎ so that
− 𝑐2 = 0,
that is, by taking 𝑘 = ℎ/|𝑐|. Note that if 𝑘 > ℎ/|𝑐|, then 𝛼 is negative and e−𝛼 𝜉 𝑡
grows, causing the solution blows up. This condition is the CFL condition for
the Lax–Friedrichs scheme.
14.3. The scheme is
𝑈 𝑛+1
𝑗 − 𝑈 𝑛−1
𝑗 𝑈 𝑛𝑗+1 − 𝑈 𝑛𝑗−1
+𝑐 = 0.
2𝑘 2ℎ
By Taylor series expansion
𝑢(𝑡 + 𝑘, 𝑥) = 𝑢 + 𝑘𝑢 𝑡 + 12 𝑘 2 𝑢 𝑡𝑡 + 16 𝑘 3 𝑢 𝑡𝑡𝑡 + 𝑂 (𝑘 4 )
𝑢(𝑡 − 𝑘, 𝑥) = 𝑢 − 𝑘𝑢 𝑡 + 21 𝑘 2 𝑢 𝑡𝑡 − 16 𝑘 3 𝑢 𝑡𝑡𝑡 + 𝑂 (𝑘 4 )
𝑢(𝑡, 𝑥 + ℎ) = 𝑢 + ℎ𝑢 𝑥 + 21 ℎ2 𝑢 𝑥 𝑥 + 16 ℎ3 𝑢 𝑥 𝑥 𝑥 + 𝑂 (ℎ4 )
𝑢(𝑡, 𝑥 − ℎ) = 𝑢 − ℎ𝑢 𝑥 + 12 ℎ2 𝑢 𝑥 𝑥 − 16 ℎ3 𝑢 𝑥 𝑥 𝑥 + 𝑂 (ℎ4 )
from which
𝑢 𝑡 + 12 𝑘 2 𝑢 𝑡𝑡𝑡 + 𝑐𝑢 𝑥 + 12 ℎ2 𝑢 𝑡𝑡𝑡 = 𝑂 (𝑘 2 + ℎ2 ).
So, the method is 𝑂 (𝑘 2 + ℎ2 ).
The scheme is leapfrog in time and central difference in space. The Fourier
transform of
𝜕 𝑈 𝑗+1 − 𝑈 𝑗−1
𝑈 𝑗 = −𝑐
𝜕𝑡 2ℎ
𝜕 ei 𝜉 ℎ − e−i 𝜉 ℎ 𝑐
ˆ 𝜉) = −𝑐
𝑢(𝑡, ˆ 𝜉) = −i sin(𝜉 ℎ) 𝑢(𝑡,
𝑢(𝑡, ˆ 𝜉).
𝜕𝑡 ℎ ℎ
The leapfrog scheme is absolute stability only on the imaginary axis 𝜆𝑘 ∈ [−i, +i].
The factor | sin(𝜉 ℎ)| ≤ 1, so the scheme is stable when (|𝑐|/ℎ)𝑘 ≤ 1. That is,
when 𝑘 ≤ ℎ/|𝑐|.
14.6. We can express the compressible Euler equations
𝜌𝑡 + (𝜌𝑢) 𝑥 = 0 (A.12a)
(𝜌𝑢)𝑡 + (𝜌𝑢 + 𝑝) 𝑥 = 0 (A.12b)
𝐸 𝑡 + ((𝐸 + 𝑝)𝑢) 𝑥 = 0 (A.12c)
𝜌𝑡 + 𝑢𝜌 𝑥 + 𝜌𝑢 𝑥 = 0 (A.13)
𝑢𝜌𝑡 + 𝜌𝑢 𝑡 + 𝑢 𝜌 𝑥 + 2𝜌𝑢𝑢 𝑥 + 𝑝 𝑥 = 0 (A.14)
We can eliminate the 𝑢𝜌𝑡 term in (A.14) by first multiplying (A.13) by 𝑢 to get
𝑢𝜌𝑡 + 𝑢 2 𝜌 𝑥 + 𝜌𝑢𝑢 𝑥 = 0
𝜌𝑢 𝑡 + 𝜌𝑢𝑢 𝑥 + 𝑝 𝑥 = 0.
𝑢 𝑡 + 𝑢𝑢 𝑥 + 𝑝𝑥 = 0 (A.15)
To derive an equation for 𝑝, we will use the equation of state
𝐸 = 12 𝜌𝑢 2 + (A.16)
𝐸 𝑡 = 21 𝑢 2 𝜌𝑡 + 𝜌𝑢𝑢 𝑡 + 𝑝𝑡 .
𝐸 𝑡 = − 21 𝑢 3 𝜌 𝑥 − 32 𝜌𝑢 2 𝑢 𝑥 − 𝑢 𝑝 𝑥 + 𝑝𝑡 . (A.17)
By expanding (𝐸 + 𝑝)𝑢 𝑥 with the equation of state, we have
1 3 𝛾
(𝐸 + 𝑝)𝑢 = 2𝑢 𝜌 + 𝑝𝑢
𝑥 𝛾−1 𝑥
1 3 3 2 𝛾 𝛾
= 2 𝑢 𝜌𝑥 + 2 𝜌𝑢 𝑢 𝑥 + 𝑝𝑢 𝑥 + 𝑢𝑝𝑥 (A.18)
𝛾−1 𝛾−1
1 𝛾 1
𝐸 𝑡 + ((𝐸 + 𝑝)𝑢) 𝑥 = 𝑝𝑡 + 𝑝𝑢 𝑥 + 𝑢𝑝𝑥.
𝛾−1 𝛾−1 𝛾−1
𝑝 𝑡 + 𝛾 𝑝𝑢 𝑥 + 𝑢 𝑝 𝑥 = 0. (A.19)
We now can express (A.13), (A.15) and (A.19) in quasilinear matrix form
𝜌 𝑢 0 𝜌
𝑢 + 0 1/𝜌 𝑢 = 0.
𝑝 𝑢 𝑝
𝑡 0 𝛾𝑝 𝑥
The system is hyperbolic if the eigenvalues 𝜆 of the Jacobian matrix are real:
𝑢−𝜆 𝜌 0
0 𝑢−𝜆 1/𝜌 = (𝑢 − 𝜆) 3 − 𝛾 𝑝𝜌 −1 (𝑢 − 𝜆)
0 𝛾𝑝 𝑢−𝜆
= (𝑢 − 𝜆) (𝑢 − 𝜆) 2 − 𝑐2 = 0
14.7. The solution starts with a shock beginning at 𝑥 = 1 and moving with speed
2 (half the height, as given by the Rankine–Hugoniot condition). At the same
time, a rarefaction propagates from 𝑥 = 0 moving at speed 1 until the rarefaction
catches up with the shock at location 𝑥 = 2 and time 𝑡 = 2. We can imagine the
trailing edge of a right-angled trapezoid catching up to the leading edge to form
a right triangle. Now, the leading edge of the right triangle continues to move
forward with speed equal to half the height (again by the Rankine–Hugoniot
condition). To compute the leading edge position, we only need to know the
height 𝑢(𝑡). The area of the triangle is conserved, so 𝐴 = 12 𝑥(𝑡)𝑢(𝑡, 𝑥) = 1. That
is, 𝑢 = 2/𝑥. The speed is given by
d𝑥 1 1
= 2 𝑢(𝑡, 𝑥) = .
d𝑡 𝑥
The solution to this differential equation is 𝑥(𝑡) = 2𝑡 + 𝑐 for some constant 𝑐.
At 𝑡 = 2, the leading edge is at 𝑥 = 2. So 𝑐 = 0. Altogether,
𝑥/𝑡, 0<𝑥<𝑡
when 𝑡 < 2 𝑢(𝑥, 𝑡) = 1, 𝑡 < 𝑥 < 1 + 12 𝑡
( √
𝑥/𝑡, 0 < 𝑥 < 2𝑡
when 𝑡 ≥ 2 𝑢(𝑥, 𝑡) =
0, otherwise
because the area is conserved. We can write the exact solution as the function:
U = (x,t) -> @. t<2 ?
(0<x<t)*(x/t) + (t<x<1+t/2) : (x/t)*(0<x<sqrt(2t))
We can implement the local Lax–Friedrichs method using the following code:
m = 80; x = LinRange(-1,3,m); Δx = x[2]-x[1]; j = 1:m-1
n = 100; Lt = 4; Δt = Lt /n
f = u -> u.^2/2; df = u -> u
u = (x.>=0).*(x.<=1)
anim = @animate for i = 1:n
α = 0.5*max.(abs.(df(u[j])),abs.(df(u[j.+1])))
F = (f(u[j])+f(u[j.+1]))/2 - α.*(u[j.+1]-u[j])
global u -= Δt/Δx*[0;diff(F);0]
plot(x,u, fill = (0, 0.3, :blue))
plot!(x,U(x,i*Δt), fill = (0, 0.3, :red))
plot!(legend=:none, ylim=[0,1])
gif(anim, "burgers.gif", fps = 15)
The figure below and the QR code at the bottom of the page show the analytical
and numerical solutions.
𝑡=1 𝑡=2 𝑡=3
𝑈 𝑛+1/2 = 𝑈 𝑛𝑗 − 12 ℎ𝜕𝑥 𝑓 (𝑈 𝑛𝑗 )
1 𝑛 1 𝑛 𝑛+1/2 𝑛+1/2
𝑈 𝑛+1
𝑗+1/2 = 2 𝑈 𝑛
𝑗 + 𝑈 𝑗+1 − 8 ℎ 𝜕 𝑥 𝑈 𝑛
𝑗+1 − 𝜕 𝑥 𝑈 𝑗 − 𝑓 𝑈 𝑗+1 − 𝑓 𝑈 𝑗 ,
where we approximate 𝜕𝑥 using the slope limiter 𝜎 𝑗
𝑈 𝑗+1 − 𝑈 𝑗 𝑈 𝑗 − 𝑈 𝑗−1
𝜎 𝑗 (𝑈 𝑗 ) = 𝜙(𝜃 𝑗 ) where 𝜃𝑗 =
ℎ 𝑈 𝑗+1 − 𝑈 𝑗
with the van Leer limiter
|𝜃| + 𝜃
𝜙(𝜃) = .
1 + |𝜃|
We can implement the solution using Julia as
δ = u -> diff(u,dims=1)
ϕ = t -> (abs(t)+t)./(1+abs(t))
fixnan(u) = isnan(u)||isinf(u) ? 0 : u
θ = δu -> fixnan.(δu[1:end-1,:]./δu[2:end,:])
𝜕 x (u) = (δu=δ(u);[[0 0];δu[2:end,:].*ϕ.(θ(δu));[0 0]])
F = u -> [u[:,1].*u[:,2] u[:,1].*u[:,2].^2+0.5u[:,1].^2]
solutions to Burgers’
The figure and QR code below show the numerical solution for ℎ(𝑡, 𝑥).
Notice the shock wave moving to the right and the rarefaction wave moving to
the left from the initial discontinuity.
∫1 ∫1
15.1. Multiply −𝑢 00 −𝑢 = 𝑓 (𝑥) by 𝑣 and integrate: 0 (−𝑢 00 − 𝑢)𝑣 d𝑥 = 0 𝑣 𝑓 d𝑥
where 𝑓 = −8𝑥 2 . If we integrate by parts once and move the boundary
terms to the right-hand side of the equation, then we have the variational form
𝑎(𝑢 ℎ , 𝑣 ℎ ) = L (𝑣 ℎ ) where
∫ 1 ∫ 1
𝑎(𝑢, 𝑣) = 𝑢 𝑣 − 𝑢𝑣 d𝑥 and L (𝑣) =
0 0
𝑣 𝑓 d𝑥 + 𝑣𝑢 0 | 10 .
0 0
The boundary conditionsÍfor this problem are = 0 and 𝑢 0 (1) = 1. Take the
Í 𝑢 0 (0)
finite elements 𝑢 ℎ (𝑥) = 𝑖=0 𝜉𝑖 𝜑𝑖 (𝑥) and 𝑣 ℎ (𝑥) = 𝑚+1
𝑗=0 𝑣 𝑗 𝜑 𝑗 (𝑥) where 𝜑 𝑗 (𝑥)
are the piecewise basis elements defined in (15.1). Then
𝑚+1 ∑︁
𝑚+1 ∑︁
𝑣𝑗 𝜉𝑖 𝑎 𝜑 𝑗 , 𝜑𝑖 = 𝑣𝑗L 𝜑𝑗 .
𝑗=0 𝑖=0 𝑖=0
Because the expression holds for set {𝑣 𝑗 }, it follows that for all 𝑗
𝜉𝑖 𝑎 𝜑 𝑗 , 𝜑𝑖 = L 𝜑 𝑗 .
We can compute the integrals using (15.1): 𝑎 𝜑 𝑗 , 𝜑𝑖 = −ℎ−1 − 16 ℎ when
𝑖 ≠ 𝑗 and 𝑎(𝜑𝑖 , 𝜑𝑖 ) = 2ℎ−1 − 32 ℎ except on the two boundaries where it is
Solution Error
1.6 0.8
1.5 0.6
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Figure A.20: Solutions to the Neumann problem 15.1. The marker • is used for
the finite element solution and ◦ for the analytical solution.
Figure A.20 shows the finite element solution and the exact solution. The finite
element solution has little error even with only ten nodes.
15.2. Let 𝑉 be the space of twice differentiable functions that take the same
boundary conditions as 𝑢(𝑥). Multiply the equation 𝑢 0000 = 𝑓 by 𝑣 ∈ 𝑉 and
∫1 ∫1
integrate by parts twice to get a symmetric bilinear form 0 𝑢 00 𝑣 00 d𝑥 = 0 𝑓 𝑣 d𝑥.
∫ 1 terms vanish when we
(The boundary ∫ 1 enforce the boundary conditions.) Define
𝑎(𝑢, 𝑣) = 0 𝑢 00 𝑣 00 d𝑥 and L (𝑣) = 0 𝑓 𝑣 d𝑥. Then the finite element problem is
V Find 𝑢 ℎ ∈ 𝑉ℎ such that 𝑎(𝑢 ℎ , 𝑣 ℎ ) = L (𝑣 ℎ ) for all 𝑣 ℎ ∈ 𝑉ℎ
The finite elements are
𝑚+1 ∑︁
𝑢 ℎ (𝑥) = 𝜉𝑖 𝜙𝑖 (𝑥) + 𝜂𝑖 𝜓𝑖 (𝑥) and 𝑣 ℎ (𝑥) = 𝛼 𝑗 𝜙 𝑗 (𝑥) + 𝛽 𝑗 𝜓 𝑗 (𝑥),
𝑖=0 𝑗=0
where the basis elements 𝜙 𝑗 (𝑥) prescribe the value of 𝑢(𝑥) at the nodes, and
the basis elements 𝜓 𝑗 (𝑥) prescribe the value of the derivative of 𝑢(𝑥) at the
nodes. The coefficients 𝜉 𝑗 and 𝜂 𝑗 are unknown, and the coefficients 𝛼 𝑗 and 𝛽 𝑗 are
0 0.2 0.4 0.6 0.8 1
Figure A.21: Solutions to the beam problem. The marker • is used for the finite
element solution and ◦ for the analytical solution.
with elements 𝑎 𝑖 𝑗 = h𝜙𝑖00, 𝜙 00𝑗 i, 𝑏 𝑖 𝑗 = h𝜓𝑖00, 𝜙 00𝑗 i, and 𝑐 𝑖 𝑗 = h𝜓𝑖00, 𝜓 00𝑗 i along with
𝑓 𝑗(1) = h 𝑓 , 𝜙 𝑗 i = ℎ−1 and 𝑓 𝑗(2) = h 𝑓 , 𝜓 𝑗 i = 0.
We can compute the solution in Julia using
m = 8; x = LinRange(0,1,m+2); h = x[2]-x[1]
σ = (a,b,c) -> Tridiagonal(fill(a,m-1),fill(b,m),fill(c,m-1))/h^3
−𝜋 0 𝜋 −𝜋 0 𝜋 −𝜋 0 𝜋
Figure A.21 shows the solution using 8 interior nodes. Because the problem
has a constant load, the finite element solution matches the analytical solution
−8 0 8 −8 0 8 −8 0 8
Figure A.23: Solutions of the KdV equation in exercise 16.3. The top row shows
independent solitons. The bottom row shows the interacting solitons.
Figure A.22 and the QR code on the bottom of this page show the solution.
Fourier spectral methods have spectral accuracy in space so long as the solution
is a smooth function. Space and time derivatives are coupled through Burgers’
equation, so we can expect a method that is fourth-order in time and space.
However, solutions to Burgers’ equation, which may initially be smooth, become
discontinuous over time. As a result, truncation error is half-order in space
and time once a discontinuity develops. This is bad—and it gets worse. Gibbs
oscillations develop around the discontinuity. These oscillations will spread
and grow because Burgers’ equation is dispersive. Ultimately, the oscillations
overwhelm the solution.
16.3. The following Julia code solves the KdV equation using integrating factors.
We first set the initial conditions and parameters.
Fourier spectral
soliton interaction in
solution to Burgers’
the KdV equation
Figure A.24: Solution to the Swift–Hohenberg equation at 𝑡 = 3, 10, 25, and 100.
Figure A.23 shows the soliton solution to the KdV equation. The top row shows the
two independent solutions with 𝑢(𝑥, 0) = 𝜙(𝑥; −4, 4) and 𝑢(𝑥, 0) = 𝜙(𝑥; −9, 9)
as different initial conditions. The bottom row shows the two-soliton solution
for 𝑢(𝑥, 0) = 𝜙(𝑥; −4, 4) + 𝜙(𝑥; −9, 9). After colliding, the tall, fast soliton in
the bottom row is about 1 unit in front of the corresponding soliton in the top
row. The small, slow soliton in the bottom row is about 1.5 units behind the
corresponding soliton in the top row. Each soliton has the same energy after the
collision as it had before the collision—its position is merely shifted.
16.4. Let’s use Strang splitting on 𝑢 0 = N 𝑢 + L 𝑢. There are two reasonable
ways to define N and L. The first with N 𝑢 = 𝜀𝑢 − 𝑢 3 and L 𝑢 = −(Δ + 1) 2 𝑢,
the Swift–Hohenberg
We can animate the solution by adding the following code inside the loop.
We can observe the solution over time using the Interact.jl library.
using Interact, ColorSchemes, Images
@manipulate for t in slider(0:0.1:8; value=0, label="time")
get(colorschemes[:binary], u(t), :extrema)
Figure A.25 and the QR code on this page show the solution to the Cahn–Hilliard
equation. Observe how the dynamics slow down over time.
the Cahn–Hilliard
N = (u,_,_) ->
( v = reshape(u,n,n);
v = -0.5*abs.((F\(im*ξ.*v)).^2 + (F\(im*ξ'.*v)).^2);
w = (F*v)[:]; w[1] = 0.0; return w )
problem = SplitODEProblem(L,N,(F*u0 )[:],(0,T))
solution = solve(problem, ETDRK4(), dt=0.05, saveat=0.5);
u = t -> real(F\reshape(solution(t),n,n))
We can observe the evolution of the solution using the Interact.jl library. See
Figure A.26 and the QR code on the current page.
using Interact, ColorSchemes, Images
@manipulate for t in slider(0:0.5:T; value=0, label="time")
get(colorschemes[:magma], -u(t), :extrema)
the two-dimensional
Appendix B
The trends
• The exponential growth in computing speed and memory, along with the
considerable drop in computing cost;
• A paradigm shift towards open-source software and open access research;
• A widespread and fast internet, spurring greater information sharing and
cloud computing;
• The shift from digital immigrants to digital natives and increase in diversity;
• The production and monetization of massive quantities of data; and
• The emergence of specialized fields of computational science, like data
science, bioinformatics, computational neuroscience, and computational
560 Computing in Python and Matlab
closet. Cray-1 had 8 million bytes of memory and could execute 160 million
floating-point operations per second. Today, a typical smartphone costs $500,
weighs less than 200 grams, and fits in a pocket. Smartphones often have 6
billion bytes of memory and can execute 10 billion floating-point operations per
second. High-performance cloud computing, which can be accessed with those
smartphones, has up to 30 trillion bytes of memory and can achieve a quintillion
(a billion billion) floating-point operations per second using specialized GPUs.
And even now, scientists are developing algorithms that use nascent quantum
computers to solve intractable high-dimensional problems in quantum simulation,
cryptanalysis, and optimization.
The growth in open data, open standards, open access and reproducible
research, and open-source software has further accelerated the evolution of
scientific computing languages. Unlike proprietary programming languages
and libraries, the open-source movement creates a virtuous innovation feedback
loop powered through open collaboration, improved interoperability, and greater
affordability. Even traditionally closed-source software companies such as
IBM, Microsoft, and Google have embraced the open-source movement through
Red Hat, GitHub, and Android. Former World Bank Chief Economist and
Nobel laureate Paul Romer has noted, “The more I learn about the open source
community, the more I trust its members. The more I learn about proprietary
software, the more I worry that objective truth might perish from the earth.”
Python, Julia, Matlab, and R all have a robust community of user-developers. The
Comprehensive R Archive Network (CRAN) features 17 thousand contributed
packages, the Python Package Index (PyPI) has over 200 thousand packages,
and GitHub has over 100 million repositories. Nonprofit foundations, like
NumFOCUS (Numerical Foundation for Open Code and Useable Science), have
further supported open-source scientific software projects.
In 1984 Donald Knuth, author of the Art of Computer Programming and
the creator of TEX, introduced literate programming in which natural language
exposition and source code are combined by considering programs to be “works
of literature.” Knuth explained that “instead of imagining that our main task is to
instruct a computer what to do, let us concentrate rather on explaining to human
beings what we want a computer to do.” Mathematica, first released in the late
1980s, used notebooks that combined formatted text, typeset mathematics, and
Wolfram Language code into a series of interactive cells. In 2001, Fernando
Pérez developed a notebook for the Python programming language called IPython
that allowed code, narrative text, mathematical expressions, and inline plots
in a browser-based interface. In 2014, Pérez spun off the open-source Project
Jupyter. Jupyter (a portmanteau of Julia, Python, and R and an homage to
the notebooks that Galileo recorded his observations on the moons of Jupiter)
extends IPython to dozens of programming languages. Projects like Pluto for
Julia and Observables for JavaScript have further developed notebooks to make
The languages
LAPACK. The GNU Scientific Library (GSL) was initiated in the mid-1990s to
provide a modern replacement to SLATEC.
Cleve Moler, one of the developers of LAPACK and EISPACK, created
MATLAB (a portmanteau of Matrix Laboratory) in the 1970s to give his students
access to these libraries without needing to know Fortran.1 You can appreciate
how much easier it is to simply write
[1 2; 3 4]\[5;6]
and have the solution printed automatically, rather than to write in Fortran
program main
implicit none
external :: sgesv
real :: A(2, 2)
real :: b(2)
real :: pivot(2)
integer :: return_code
which would then still need to be compiled, linked, and run. MATLAB also made
data visualization and plotting easy. While at first MATLAB was little more
than an interactive matrix calculator, Moler later developed MATLAB into a
full computing environment in the early 1980s. MATLAB’s commercial growth
and evolution coincided with the introduction of affordable personal computers,
which were initially barely powerful enough to run software like MATLAB. Over
time, other features, like sparse matrix operations and ODE solvers, were added
to make MATLAB a complete scientific programming language. Still, matrices
are treated as the fundamental data type.
GNU Octave, developed by John Eaton and named after his former professor
Octave Levenspiel, was initiated in the 1980s and first released in the 1990s as
1 In this book, “MATLAB” indicates the commercial numerical computing environment developed
by MathWorks, and “Matlab” indicates the programming language inclusive of Octave. In modern
orthography, uppercase often comes across as SHOUTY CAPS, but before the 1960s, computer
memory and printing limitations made uppercase a necessity. In the 1990s, “FORTRAN,” on which
MATLAB was based and styled, adopted the naming “Fortran,” and “MATLAB” stayed “MATLAB.”
Scientific programming languages 563
Finally, Python (like C) uses 0-based indexing, whereas many other languages,
including Fortran, Matlab, R, and Julia, use 1-based indexing. Arguments can
be made supporting either 0-indexing or 1-indexing. When asked why Python
uses 0-based indexing, Guido van Rossum stated “I think I was swayed by the
elegance of half-open intervals.”
Julia debuted in the early 2010s. When asked about the inspiration behind
Julia’s name, one of the cocreators Stefan Karpinski replied, “There’s no good
reason, really. It just seemed like a pretty name.” What Fortran is to the Silent
Generation, what Matlab is to Baby Boomers, and what Python is to Generation
X, one might say Julia is to Millennials. While Fortran and Matlab are both
certainly showing their age, Julia is a true digital native. It’s designed in an era
of cloud computing and GPUs. It uses Unicode and emojis that permit more
expressive and more readable mathematical notation. The expression 2*pi in
Julia is simply 2π, inputted using TEX conventions followed by a tab autocomplete.
Useful binary operators, like ≈ for “approximately equal to,” are built-in, and
you can define custom binary operators using Unicode symbols. Julia is still
young, and its packages are still evolving.
While Fortran and C are compiled ahead of time, and Python and Matlab
are interpreted scripts, Julia is compiled just in time (JIT) using the LLVM
infrastructure. Because Julia code gets compiled the first time it is run, the first
run can be slow.3 But after that, Julia runs a much faster, cached compiled
version. Julia has arbitrary precision arithmetic (arbitrary-precision integers and
2 The Python operator ^ is a bit-wise XOR. So 7^3 is 4 (111 XOR 011 = 100).
3 Even excruciatingly slow when precompiling some libraries. This compile-time latency is
sometimes referred to as “time to first plot” (TTFP) given the amount of time import Plots.jl. But
other packages are often slower. For example, DifferentialEquations.jl can take a minute or so the
first run and around a second after that. Julia continues to improve and compile-time latency tends to
decrease with each version.
Python 565
B.2 Python
This section provides a Python supplement to the Julia commentary and code
in the body of this book. The code is written and tested using Python version
3.9.7. Page references to the Julia commentary are listed in the left margins. For
brevity, the following packages are assumed to have been imported as needed:
import matplotlib.pyplot as plt
import numpy as np
import scipy.linalg as la
import scipy.sparse as sps
Finally, we’ll define a few helper functions for downloading and displaying
import PIL, requests, io
def rgb2gray(rgb): return np.dot(rgb[...,:3], [0.2989,0.5870,0.1140])
def getimage(url):
response = requests.get(url)
return np.asarray(PIL.Image.open(io.BytesIO(response.content)))
def showimage(img):
display(PIL.Image.fromarray(np.int8(img.clip(0,255)), mode='L'))
The Jupyter viewer has a link in its menu to execute the node on Binder, which
you can also access using the inner QR code at the bottom of this page. Note that
it may take a few minutes for the Binder to launch the repository. Alternatively,
you can download the python.ipynb file and run it on a local computer with
Jupyter and Python. Or, you can upload the notebook to Google Colaboratory
(Colab), Kaggle, or CoCalc, among others. Perhaps the easiest way to run the
code is directly using Colab through the outer QR code at the bottom of this page.
Define the array x = np.arange(n). There are several different ways to create a 5
column vector using NumPy:
A[j+1:,j] /= A[j,j]
A[j+1:,j+1:] -= np.outer(A[j+1:,j],A[j,j+1:])
for i in range(1,n):
b[i] = b[i] - A[i,:i]@b[:i]
for i in reversed(range(n)):
b[i] = ( b[i] - A[i,i+1:]@b[i+1:] )/A[i,i]
return b
The following Python code implements the simplex method. We start by defining 42
functions used for pivot selection and row reduction in the simplex algorithm.
def get_pivot(tableau):
j = np.argmax(tableau[-1,:-1]>0)
a, b = tableau[:-1,j], tableau[:-1,-1]
k = np.argwhere(a > 0)
i = k[np.argmin(b[k]/a[k])]
return i,j
def row_reduce(tableau):
i,j = get_pivot(tableau)
G = tableau[i,:]/tableau[i,j]
tableau -= tableau[:,j].reshape(-1,1)*G
tableau[i,:] = G
46 We’ll construct a sparse, random matrix, use the nnz method to get the number
of nonzeros, and the matplotlib.pyplot function spy to draw the sparsity plot.
A = sps.rand(60,80,density=0.2)
print(A.nnz), plt.spy(A,markersize=1)
49 The NetworkX package provides tools for constructing, analyzing, and visualizing
49 The following code draws the dolphin networks of the Doubtful Sound. We’ll
use dolphin gray (#828e84) to color the nodes.
import pandas as pd, networkx as nx
df = pd.read_csv(bucket+'dolphins.csv', header=None)
G = nx.from_pandas_edgelist(df,0,1)
plt.axis('equal'); plt.show()
solution = namedtuple("solution",["z","x","y"])
return solution(z=x@c[:n],x=x,y=y)
For a 2000 × 1999 matrix la.lstsq(A1,b) takes 3.4 seconds and la.pinv(A1)@b 57
takes 6.3 seconds.
Python does not have a dedicated function for the Givens rotations matrix. Instead, 64
use the scipy.linalg QR decomposition Q,_ = qr([[x],[y]]).
The scipy.linalg function qr implements LAPACK routines to compute the QR 65
factorization of a matrix using Householder reflection.
The numpy.linalg and scipy.linalg function lstsq solves an overdetermined 66
system using the LAPACK routines gelsd (using singular value decomposition)
gelsy (using QR factorization), or gelss (singular value decomposition).
The Zipf’s law coefficients c for the canon of Sherlock Holmes computed using 67
ordinary least squares are
import pandas as pd
data = pd.read_csv(bucket+'sherlock.csv', sep='\t', header=None)
T = np.array(data[1])
n = len(T)
A = np.c_[np.ones((n,1)),np.log(np.arange(1,n+1)[:, np.newaxis])]
B = np.log(T)[:, np.newaxis]
c = la.lstsq(A,B)[0]
def constrained_lstsq(A,b,C,d):
x = la.solve(np.r_[np.c_[A.T@A,C.T],
np.c_[C,np.zeros((C.shape[0],C.shape[0]))]], np.r_[A.T@b,d] )
return x[:A.shape[1]]
def tls(A,B):
n = A.shape[1]
_,_,VT = la.svd(np.c_[A,B])
570 Computing in Python and Matlab
77 A = np.array([[2,4],[1,-1],[3,1],[4,-8]]);
b = np.array([[1,1,4,1]]).T
x_ols = la.lstsq(A,b)[0]
x_tls = tls(A,b)
79 We’ll use the helper functions getimage and showimage defined on page 565.
A = rgb2gray(getimage(bucket+'red-fox.jpg'))
U,σ,VT = la.svd(A)
Finally, we’ll show the compressed image and plot the error curve.
showimage(np.c_[A,Ak ])
r = np.sum(A.shape)/np.prod(A.shape)*range(1,min(A.shape)+1)
error = 1 - np.sqrt(np.cumsum(σ**2))/la.norm(σ)
plt.semilogx(r,error,'.-'); plt.show()
def nmf(X,p=6):
W = np.random.rand(X.shape[0],p)
H = np.random.rand(p,X.shape[1])
for i in range(50):
W = W*([email protected])/(W@([email protected]) + (W==0))
H = H*(W.T@X)/((W.T@W)@H + (H==0))
return (W,H)
87 The NumPy function vander generates a Vandermonde matrix with rows given
by 𝑥 𝑖 𝑥 𝑖𝑝−1 ··· 𝑥𝑖 1 for input (𝑥0 , 𝑥1 , . . . , 𝑥 𝑛 ).
89 The NumPy function roots finds the roots of a polynomial 𝑝(𝑥) by computing
the eigenvalues of the companion matrix of 𝑝(𝑥).
90 Python does not have a function to compute the eigenvalue condition number.
Instead, we can compute it using:
def condeig(A):
w, vl, vr = la.eig(A, left=True, right=True)
Python 571
c = 1/np.sum(vl*vr,axis=0)
return (c, vr, w)
H = np.array([[0,0,0,0,1],[1,0,0,0,0], \
v = ~np.any(H,0)
H = H/(np.sum(H,0)+v)
n = len(H)
d = 0.85;
x = np.ones((n,1))/n
for i in range(9):
x = d*(H@x) + d/n*(v@x) + (1-d)/n
The scipy.linalg function hessenberg computes the unitarily similar upper Hes- 104
senberg form of a matrix.
Both numpy.linalg and scipy.linalg have the function eig that computes the 109
eigenvalues and eigenvectors of a matrix.
The scipy.sparse.linalg command eigs computes several eigenvalues of a sparse 114
matrix using the implicitly restarted Arnoldi process.
The scipy.sparse.linalg function cg implements the conjugate gradient method— 137
preconditioned conjugate gradient method if a preconditioner is also provided.
The scipy.sparse.linalg function gmres implements the generalized minimum 141
residual method and minres implements the minimum residual method.
The NumPy function kron(A,B) returns the Kronecker product A ⊗ B. 148
def fftx2(c):
n = len(c)
ω = np.exp(-2j*π/n);
if np.mod(n,2) == 0:
k = np.arange(n/2)
u = fftx2(c[:-1:2])
v = (ω**k)*fftx2(c[1::2])
return np.concatenate((u+v, u-v))
k = np.arange(n)[:,None]
F = ω**(k*k.T);
return F @ c
152 The scipy.signal function convolve returns the convolution of two n-dimensional
arrays using either FFTs or directly, depending on which is faster.
153 The following function returns the fast Toeplitz multiplication of the Toeplitz
matrix with c as its first column and r as its first row and a vector x. We’ll use
fftx2 and ifftx2 that we developed earlier, although in practice it is much faster
to use the built-in NumPy or SciPy functions fft and ifft the scipy.signals
function convolve.
def fasttoeplitz(c,r,x):
n = len(x)
m = (1<<(n-1).bit_length())-n
x1 = np.concatenate((np.pad(c,(0,m)),r[:1:-1]))
x2 = np.pad(x,(0,m+n-1))
return ifftx2(fftx2(x1)*fftx2(x2))[:n]
158 The libraries numpy.fft and scipy.fft both have several functions for computing
FFTs using Cooley–Tukey and Bluestein algorithms. These include fft and ifft
for one-dimensional transforms, fft2 and ifft2 for two-dimensional transforms,
and fftn and ifftn for multi-dimensional transforms. When fft and ifft are
applied to multidimensional arrays, the transforms are along the last dimension
by default. (In Matlab, the transforms are along the first dimension by default.)
Similar functions are available to compute FFTs for real inputs and discrete sine
and discrete cosine transforms. The PyFFTW package provides a Python wrapper
to the FFTW routines. According to the SciPy documentation: “users for whom
the speed of FFT routines is critical should consider installing PyFFTW.”
159 The scipy.fft functions fftshift and ifftshift shift the zero frequency compo-
nent to the center of the array.
160 The following code solves the Poisson equation using a naive fast Poisson solver
and then compares the solution with the exact solution.
from scipy.fft import dstn, idstn
Python 573
The scipy.fft functions dct and dst return the type 1–4 DCT and DST, respectively. 163
The following function returns a sparse matrix of Fourier coefficients along with
the reconstructed compressed image:
from scipy.fft import dctn, idctn
def dctcompress(A,d):
B = dctn(A)
idx = int(d*np.prod(A.size))
tol = np.sort(abs(B.flatten()))[-idx]
B[abs(B)<tol] = 0
return sps.csr_matrix(B), idctn(B)
We’ll test the function on an image and compare it before and after using getimage
and showimage defined on page 565.
A = rgb2gray(getimage(bucket+"red-fox.jpg"))
_, A0 = dctcompress(A,0.001)
The NumPy function finfo returns machine limits for floating-point types. For 181
example, np.finfo(float).eps returns the double-precision machine epsilon
2−52 and np.finfo(float).precision returns 15, the approximate precision in
decimal digits.
574 Computing in Python and Matlab
182 The following function implements the Q_rsqrt algorithm to approximate the
reciprocal square root of a number. Because NumPy upcasts to double-precision
integers, we’ll explicitly recast intermediate results to single-precision.
def Q_rsqrt(x):
i = np.float32(x).view(np.int32)
i = (0x5f3759df - (i>>1)).astype(np.int32)
y = i.view(np.float32)
return y * (1.5 - (0.5 * x * y * y))
185 The sys command sys.float_info.max returns the largest floating-point number.
The command sys.float_info.min returns the smallest normalized floating-point
186 To check for overflows and NaN, use the NumPy commands isinf and isnan. You
can use NaN to lift the “pen off the paper” in a matplotlib plot as in
187 The NumPy functions expm1 and log1p compute e 𝑥 − 1 and log(𝑥 + 1) more
precisely than exp(x)-1 and log(x+1) in a neighborhood of zero.
191 A Python implementation of the bisection method for a function f is
def bisection(f,a,b,tolerance):
while abs(b-a)>tolerance:
c = (a+b)/2
if np.sign(f(c))==np.sign(f(a)): a = c
else: b = c
return (a+b)/2
197 The scipy.optimize function fsolve(f,x0) uses Powell’s hybrid method to find
the zero of the input function.
204 The following function takes the parameters bb for the lower-left and upper-right
corners of the bounding box, xn for the number of horizontal pixels, and n for the
maximum number of iterations. The function returns a two-dimensional array M
that counts the number of iterations 𝑘 to escape {𝑧 ∈ C | |𝑧 (𝑘) | > 2}.
def escape(n,z,c):
M = np.zeros_like(c,dtype=int)
for k in range(n):
Python 575
mask = np.abs(z)<2
M[mask] += 1
z[mask] = z[mask]**2 + c[mask]
return M
The following commands produce image (c) of Figure 8.5 on page 204:
import matplotlib.image as mpimg
M = mandelbrot([-0.1710,1.0228,-0.1494,1.0443])
mpimg.imsave('mandelbrot.png', -M, cmap='magma')
The following code uses the gradient descent method to find the minimum of the 221
Rosenbrock function:
def df(x): return np.array([-2*(1-x[0])-400*x[0]*(x[1]-x[0]**2),
x = np.array([-1.8,3.0]); p = np.array([0.0,0.0])
α = 0.001; β = 0.9
for i in range(500):
576 Computing in Python and Matlab
p = -df(x) + β*p
x += α*p
227 The NumPy function vander constructs a Vandermonde matrix. Or you can build
one yourself with x**np.arange(n).
238 The following pair of functions determines the coefficients of a cubic spline with
natural boundary conditions given arrays of nodes and then evaluates the spline
using those coefficients. Because C is a tridiagonal matrix, it is more efficient to
use a banded Hermitian solver solveh_banded from from scipy.linalg than the
general solver in numpy.linalg.
def spline_natural(x,y):
h = np.diff(x)
gamma = 6*np.diff(np.diff(y)/h)
C = [h[:-1],2*(h[:-1]+h[1:])]
m = np.pad(la.solveh_banded(C,gamma),(1, 1))
return m
def evaluate_spline(x,y,m,n):
h = np.diff(x)
B = y[:-1] - m[:-1]*h**2/6
A = np.diff(y)/h-h/6*np.diff(m)
X = np.linspace(np.min(x),np.max(x),n+1)
i = np.array([np.argmin(i>=x)-1 for i in X])
i[-1] = len(x)-2
Y = (m[i]*(x[i+1]-X)**3 + m[i+1]*(X-x[i])**3)/(6*h[i]) \
+ A[i]*(X-x[i]) + B[i]
return (X,Y)
239 The scipy.interpolate function spline returns a cubic (or any other order) spline.
The splprep and splev commands break up the steps of finding the coefficients
for an interpolating spline and evaluating the values of that spline. Perhaps
the easiest approach is the function CubicSpline that returns a cubic spline
Python 577
interpolant class. The Spline classes in scipy.interpolate are wrappers for the
Dierckx Fortran library.
The scipy.signal function bspline(x,p) evaluates a 𝑝th order B-spline. The 241
scipy.interpolate function Bspline(t,c,p) constructs a spline using 𝑝th order
B-splines with knots t and coefficients c.
The scipy.interpolate function pchip returns a cubic Hermite spline. 242
We can build a Bernstein matrix using the following function which takes a 245
column vector such as t = np.linspace(0,1,50).[:,None]:
def bernstein(n,t):
from scipy.special import comb
k = np.arange(n+1)[None,:]
return comb(n,k)*t**k*(1-t)**(n-k)
We’ll construct a Chebyshev differentiation matrix and use the matrix to solve a 261
few simple problems.
def chebdiff(n):
x = -np.cos(np.linspace(0,π,n))[:,None]
c = np.outer(np.r_[2,np.ones(n-2),2],(-1)**np.arange(n))
D = c/c.T/(x - x.T + np.eye(n))
return D - np.diag(np.sum(D,axis=1)), x
n = 15
D,x = chebdiff(n)
u = np.exp(-4*x**2);
n = 15; k2 = 256
D,x = chebdiff(n)
L = D@D - k2*np.diag(x.flatten())
578 Computing in Python and Matlab
263 The ChebPy package, a Python implementation of the Matlab Chebfun package,
includes methods for manipulating functions in Chebyshev space.
269 The following function returns 𝑥 and 𝜙(𝑥) of a scaling function with coefficients
given by c and values at integer values of 𝑥 given z:
def scaling(c,z,n):
m = len(c); L = 2**n
ϕ = np.zeros(2*m*L)
ϕ[0:m*L:L] = z
for j in range(n):
for i in range(m*2**j):
x = (2*i+1)*2**(n-1-j)
ϕ[x] = sum([c[k]*ϕ[(2*x-k*L)%(2*m*L)] for k in range(m)])
return np.arange((m-1)*L)/L,ϕ[:(m-1)*L]
sqrt3 = np.sqrt(3)
c = np.array([1+sqrt3,3+sqrt3,3-sqrt3,1-sqrt3])/4
z = np.array([0,1+sqrt3,1-sqrt3,0])/2
x,ϕ = scaling(c,z,8)
plt.plot(x,ϕ); plt.show()
273 The scipy.signals library includes several utilities for wavelet transforms. The
PyWavelets (pywt) package provides a comprehensive suite of utilities.
274 We’ll use the PyWavelets package. The helper functions getimage and showimage
are defined on page 565. The following code gives the two-dimensional Haar
DWT of a grayscale image:
import pywt
def adjustlevels(x): return 1-np.clip(np.sqrt(np.abs(x)),0,1)
A = rgb2gray(getimage(bucket+"laura_square.png"))/255
A = pywt.wavedec2(A,'haar')
c, slices = pywt.coeffs_to_array(A)
Python 579
Let’s choose a threshold level of 0.5 and plot the resultant image after taking the
inverse DWT.
level = 0.5
c = pywt.threshold(c,level)
c = pywt.array_to_coeffs(c,slices,output_format='wavedec2')
B = pywt.waverec2(c,'haar')
def jacobian(f,x,c):
J = np.empty([len(x), len(c)])
n = np.arange(len(c))
for i in n:
J[:,i] = np.imag(f(x,c+1e-8j*(i==n)))/1e-8
return J
def gauss_newton(f,x,y,c):
r = y - f(x,c)
for j in range(100):
G = jacobian(f,x,c)
M = G.T @ G
c += la.solve(M + np.diag(np.diag(M)),G.T@r)
r, r0 = y - f(x,c), r
if la.norm(r-r0) < 1e-10: return c
print('Gauss-Newton did not converge.')
Assuming that the method converges, we can plot the results using
X = np.linspace(0,8,100)
if c is not None: plt.plot(x,y,'.',X,f(X,c))
In practice, we can use the scipy.optimize function curve_fit, which uses the
Levenberg–Marquardt method for unconstrained problems.
580 Computing in Python and Matlab
286 We’ll first define the logistic function and generate synthetic data. Then, we
apply Newton’s method.
def σ(x): return 1/(1+np.exp(-x))
x = np.random.rand(30)[:,None]
y = (np.random.rand(30)[:,None] < σ(10*x-5))
X, w = np.c_[np.ones_like(x),x], np.zeros((2,1))
for i in range(10):
S = σ(X@w)*(1-σ(X@w))
w += la.solve(X.T@(S*X), X.T@(y-σ(X@w)))
After determining the parameters W1 and W2, we can plot the results.
x̂ = np.linspace(-1.2,1.2,200); x̂ = np.c_[np.ones_like(x̂),x̂].T
ŷ = W2 @ ϕ(W1@x̂)
L = la.norm(ŷ-y)
dLdy = 2*(ŷ-y)/L
Python has several open-source deep learning frameworks. The most popular
include TensorFlow (developed by Google), PyTorch (developed by Facebook),
and MXNet (developed by Apache). We’ll use TensorFlow and the Keras library,
which provides a high-level interface to TensorFlow. First, let’s load Keras and
TensorFlow and generate some training data.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
n = 20; N = 100;
θ = tf.linspace(0.0,π,N)
x = tf.math.cos(θ); y = tf.math.sin(θ) + 0.05*tf.random.normal([N])
Now, we can define the model, train it, and examine the outputs.
model = keras.Sequential(
layers.Dense(n, input_dim=1, activation='relu'),
model.compile(loss='mean_squared_error', optimizer='SGD')
ŷ = model.predict(x)
plt.plot(x,ŷ,color='#000000'); plt.scatter(x,y,color='#ff000050')
This example uses the default gradient descent model. We can also choose a
specialized optimizer.
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
d = np.array([-1,0,1,2])[:,None]
n = len(d)
V = d**np.arange(n) / [np.math.factorial(i) for i in range(n)]
C = la.inv(V)
def richardson(f,x,m,n):
if n==0: return ϕ(f,x,2**m)
return (4**n*richardson(f,x,m,n-1)-richardson(f,x,m-1,n-1))/(4**n-1)
where we define
def ϕ(f,x,n): return (f(x+1/n) - f(x-1/n))/(2/n)
301 The Python package JAX and the older autograd implement forward and reverse
automatic differentiation. We can build a minimal working example of forward
accumulation automatic differentiation by defining a class and overloading the
base operators:
class Dual:
def __init__(self, value, deriv=1):
self.value = value
self.deriv = deriv
def __add__(u, v):
return Dual(value(u) + value(v), deriv(u) + deriv(v))
__radd__ = __add__
def __sub__(u, v):
return Dual(value(u) - value(v), deriv(u) - deriv(v))
__rsub__ = __sub__
def __mul__(u, v):
return Dual(value(u)*value(v),
value(v)*deriv(u) + value(u)*deriv(v))
__rmul__ = __mul__
Python 583
def sin(u):
return Dual(sin(value(u)),cos(value(u))*deriv(u))
x1 = Dual(2,np.array([1,0]))
x2 = Dual(π,np.array([0,1]))
y1 = x1*x2 + sin(x2)
y2 = x1*x2 - sin(x2)
Python has several packages for automatic differentiation, including dedicated 304
libraries like JAX and machine learning frameworks like TensorFlow and PyTorch
We can use the following trapezoidal quadrature to make a Romberg method 306
using Richardson extrapolation:
def trapezoidal(f,x,n):
F = f(np.linspace(x[0],x[1],n+1))
return (F[0]/2 + sum(F[1:-1]) + F[-1]/2)*(x[1]-x[0])/n
The following code examines the convergence rate of the composite trapezoidal 308
rule using the function 𝑥 + (𝑥 − 𝑥 2 ) 𝑝 :
n = np.logspace(1,2,num=10).astype(int)
error = np.zeros((10,7))
def f(x,p): return ( x + x**p*(2-x)**p )
for p in range(1,8):
S = trapezoidal(lambda x: f(x,p),(0,2),10**6)
for i in range(len(n)):
Sn = trapezoidal(lambda x: f(x,p),(0,2),n[i])
error[i,p-1] = abs(Sn - S)/S
584 Computing in Python and Matlab
A = np.c_[np.log(n),np.ones_like(n)]
x = np.log(error)
s = np.linalg.lstsq(A,x,rcond=None)[0][0]
info = ['{0}: slope={1:0.1f}'.format(k+1,s[k]) for k in range(7)]
lines = plt.loglog(n,error)
plt.legend(lines, info); plt.show()
def gauss_legendre(n):
a = np.zeros(n)
b = np.arange(1,n)**2 / (4*np.arange(1,n)**2 - 1)
scaling = 2
nodes, v = la.eigh_tridiagonal(a, np.sqrt(b))
return ( nodes, scaling*v[0,:]**2 )
r = np.exp(2j*π*np.linspace(0,1,100)) 349
z = (3/2*r**2 - 2*r + 0.5)/r**2
plt.plot(z.real,z.imag); plt.axis('equal'); plt.show()
The following function determines the coefficients for stencil given by m and n: 351
def multistepcoefficients(m,n):
s = len(m) + len(n) - 1
A = (np.array(m)+1)**np.c_[range(s)]
B = [[i*((j+1)**max(0,i-1)) for j in n] for i in range(s)]
c = la.solve(-np.c_[A[:,1:],B],np.ones((s,1))).flatten()
return ( np.r_[1,c[:len(m)-1]], c[len(m)-1:] )
def plotstability(a,b):
r = np.exp(1j*np.linspace(0,2*π,200))
z = [np.dot(a,r**np.arange(len(a))) / \
np.dot(b, r**np.arange(len(b))) for r in r]
plt.plot(np.real(z),np.imag(z)); plt.axis('equal')
The Python recipe for solving a differential equation is quite similar to Julia’s 377
recipe, except that the “define the problem” step is merged into the “solve the
problem” step. Python does not have the trapezoidal method, so the Bogacki–
Shampine method is used instead.
If a method is not explicitly stated, Python will use the default solver RK45. By
default, the values of sltn.t are those determined and used for adaptive time
stepping. Because such steps can be quite large and far from uniform, especially
586 Computing in Python and Matlab
for high-order methods, plots using sltn.t and sltn.y may not look smooth.
Counterintuitively, higher-order methods such as DOP853 that use smoother
interpolating polynomials produce rougher (though still accurate) plots than
lower-order methods such as RK23:
RK23 DOP853
2 2
0 0
−2 −2
0 5 10 0 5 10
sltn = solve_ivp(pndlm,tspan,u0,method=mthd,dense_output=True)
t = np.linspace(tspan[0],tspan[1],200)
y = sltn.sol(t)
plt.legend(labels=['θ','ω']); plt.show()
380 Let’s first define the DAE for the pendulum problem. We’ll use manifold
projection: x = x/ℓ; y = y/ℓ; u = -v*y/x
def pendulum(t,U,p):
x,y,u,v = U; ℓ,g,m = p
x = x/ℓ; y = y/ℓ; u = -v*y/x # manifold projection
τ = m*(u**2 + v**2 + g*y)/ℓ**2
return (u, v, -τ*x/m, -τ*y/m + g)
The following code solves the pendulum problem using orthogonal collocation. 385
We’ll use the gauss_legendre function and differentiation_matrix function
defined on page 610 to compute the Gauss–Lobatto nodes and differentiation
from scipy.optimize import fsolve
n = 6; N = 100; Δt = 30/N
θ = π/3; ℓ = 1
u0 = [ℓ*np.sin(θ), -ℓ*np.cos(θ), 0, 0, 0]
u = np.concatenate([np.tile(i,n) for i in u0])
M,t = differentiation_matrix(n,Δt)
def D(u,u0): return M@(u - u0)
def pendulum(U,U0,n):
x,y,u,v,τ = U[:n],U[n:2*n],U[2*n:3*n],U[3*n:4*n],U[4*n:5*n]
x0,y0,u0,v0,τ0 = U0
ℓ,g,m = (1,1,1)
return np.r_[D(x,x0) - u,
D(y,y0) - v,
D(u,u0) + τ*x/m,
D(v,v0) + τ*y/m - g,
x**2 + y**2 - ℓ**2]
U = u.reshape(5,-1)
for i in range(N):
u0 = U[:,-1]
u = fsolve(pendulum,u,args=(u0,n))
U = np.c_[U,u.reshape(5,-1)]
plt.plot(U[0,:],U[1,:]); plt.gca().axis('equal')
plt.gca().invert_yaxis(); plt.show()
model.Equation(m*v.dt()==-τ*y + m*g)
model.time = np.linspace(0,30,200)
model.options.IMODE=4 # dynamic mode
model.options.NODES=3 # number of collocation nodes
plt.plot(x.value,y.value); plt.gca().axis('equal')
plt.gca().invert_yaxis(); plt.show()
after it. This code takes roughly 0.003 seconds on a typical laptop.
396 The scipy.linalg package has several routines for specialized matrices includ-
ing solve_banded for banded matrices, solveh_banded for symmetric, banded
matrices, and solve_circulant for circulant matrices (by dividing in Fourier
space). The scipy.sparse.linalg function spsolve can be used to solve general
sparse systems.
397 The following code solves the heat equation using the Crank–Nicolson method.
While we can use numpy.dot and numpy.linalg.solve_banded for tridiagonal
multiplication and inverse, let’s use Python’s sparse matrix routines instead.
Python 589
We’ll use the LSODA solver, which automatically switches between Adams– 406
Moulton and BDF routines for stiff nonlinear problems. Using the ipywidgets
package, we can plot the results in a Jupyter notebook as an interactive animation.
462 The numpy.fft function fftfreq returns the DFT sample frequencies.
iξ = 1j*np.fft.fftfreq(n,1/n)*(2*π/L)
473 The Python code to solve the Navier–Stokes equation has three parts: define the
functions, initialize the variables, and iterate over time. We start by defining
functions for Ĥ and for the flux used in the Lax–Wendroff scheme.
from numpy.fft import fft2,ifft2,fftfreq
def cdiff(Q,step=1): return Q-np.roll(Q,step,0)
def flux(Q,c): return c*cdiff(Q,1) - \
def H(u,v,iξx,iξy): return fft2(ifft2(u)*ifft2(iξx*u) + \
for i in range(1200):
Q -= flux(Q,(Δt/Δx)*np.real(ifft2(v))) + \
Hxo, Hyo = Hx, Hy
Hx, Hy = H(u,v,iξx,iξy), H(v,u,iξy,iξx)
us = u - us + (-1.5*Hx + 0.5*Hxo + M1*u)/M2
vs = v - vs + (-1.5*Hy + 0.5*Hyo + M1*v)/M2
ϕ = (iξx*us + iξy*vs)/(ξ2+(ξ2==0))
u, v = us - iξx*ϕ, vs - iξy*ϕ
plt.imshow(Q,'seismic'); plt.show()
D = np.diag(np.ones(n-1),1) \
- 2*np.diag(np.ones(n),0) + np.diag(np.ones(n-1),-1)
483 A = np.array([[0,1,0,1],[1,0,2,1],[0,2,0,0],[1,1,0,0]])
def walks(A,i,j,N):
return [np.linalg.matrix_power(A,n+1)[i-1,j-1] for n in range(N)]
484 The following function implements a naïve reverse Cuthill–McKee algorithm for
symmetric matrices:
def rcuthillmckee(A):
r = np.argsort(np.bincount(A.nonzero()[0]))
while r.size:
q = np.atleast_1d(r[0])
r = np.delete(r,0)
while q.size:
try: p = np.append(p,q[0])
except: p = np.atleast_1d(q[0])
k = sps.find(A[q[0],[r]])[1]
q = np.append(q[1:],r[k])
r = np.delete(r,k)
return np.flip(p)
The Julia command A[p,p] permutes the rows and columns of A with the
permutation array p. But, the similar-looking Python command A[p,p] selects
elements along the diagonal. Instead, we should take A[p[:,None],p]. We can
test the solution using the following code:
A = sps.random(1000,1000,0.001); A += A.T
p = rcuthillmckee(A)
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.spy(A,ms=1); ax2.spy(A[p[:,None],p],ms=1)
A = nx.adjacency_matrix(G)
p = rcuthillmckee(A)
A = A[p[:,None],p]
G = nx.from_scipy_sparse_array(A)
plt.axis('equal'); plt.show()
485 The following code solves the Stigler diet problem using the naïve simplex
algorithm developed on page 42.
import pandas as pd
Python 593
diet = pd.read_csv(bucket+'diet.csv')
A = diet.values[1:,3:].T
b = diet.values[0,3:][:,None]
c = np.ones(((A.shape)[1],1))
food = diet.values[1:,0]
solution = simplex(b,A.T,c)
print("value: ", solution.z)
i = np.argwhere(solution.y!=0).flatten()
print("foods: ", food[i])
Let’s first define a couple of helper functions to import the data. 486
def get_names(filename):
return np.genfromtxt(bucket+filename+'.txt',delimiter='\n',\
def get_adjacency_matrix(filename):
i = np.genfromtxt(bucket+filename+'.csv',delimiter=',',dtype=int)
return sps.csr_matrix((np.ones_like(i[:,0]), (i[:,0]-1,i[:,1]-1)))
We use the scipy.sparse command bmat to form a block matrix. Then we define
functions to find the shortest path between nodes a and b.
actors = get_names("actors")
movies = get_names("movies")
B = get_adjacency_matrix("actor-movie")
A = sps.bmat([[None,B.T],[B,None]],format='csr')
actormovie = np.r_[actors,movies]
def findpath(A,a,b):
p = -np.ones(A.shape[1],dtype=np.int64)
q = [a]; p[a] = -9999; i = 0
while i<len(q):
k = sps.find(A[q[i],:])[1]
k = k[p[k]==-1]
p[k] = q[i]; i += 1
if any(k==b): return backtrack(p,b)
display("No path.")
594 Computing in Python and Matlab
def backtrack(p,b):
s = [b]; i = p[b]
while i != -9999: s.append(i); i = p[i]
return s[::-1]
488 X = rgb2gray(getimage(bucket+"laura.png"))
m,n = X.shape
def blur(x): return np.exp(-(x/20)**2/2)
A = [[blur(i-j) for i in range(m)] for j in range(m)]
A /= np.sum(A,axis=1)
B = [[blur(i-j) for i in range(n)] for j in range(n)]
B /= np.sum(B,axis=0)
N = 0.01*np.random.rand(m,n)
Y = A@X@B + N
α = 0.05
X1 = la.lstsq(A,Y)[0]
X2 = la.solve(A.T@A+α**2*np.eye(m),A.T@Y).T
X2 = la.solve([email protected]+α**2*np.eye(n),B@X2).T
X3 = la.pinv(A,α) @ Y @ la.pinv(B,α)
import pandas as pd
df = pd.read_csv(bucket+'filip.csv',header=None)
Python 595
y,x = np.array(df[0]),np.array(df[1])
coef = pd.read_csv(bucket+'filip-coeffs.csv',header=None)
β = np.array(coef[0])[None,:]
Let’s also solve the problem and plot the solutions for the standardized data.
def zscore(X,x): return (X - x.mean())/x.std()
k1 = np.linalg.cond(vandermonde(x,11))
k2 = np.linalg.cond(vandermonde(zscore(x,x),11))
print("Condition numbers of the Vandermonde matrix:")
c,r = solve_filip(zscore(x,x),zscore(y,y))
for i in range(3):
Y = build_poly(c[i],zscore(X,x))*y.std() + y.mean()
plt.plot(X, Y)
596 Computing in Python and Matlab
490 We’ll use pandas to download CSV file. Pandas is built on top of NumPy, so
converting from a pandas data frame to a NumPy array is straightforward.
import pandas as pd
df = pd.read_csv(bucket+'dailytemps.csv')
t = pd.to_datetime(df["date"]).values
day = (t - t[0])/np.timedelta64(365, 'D')
u = df["temperature"].values[:,None]
def tempsmodel(t): return np.c_[np.sin(2*π*t),\
np.cos(2*π*t), np.ones_like(t)]
c = la.lstsq(tempsmodel(day),u)[0]
plt.plot(day,tempsmodel(day)@c,'k'); plt.show()
490 We’ll use the Keras library to load the MNIST data. We first compute the sparse
SVD of the training set to get singular vectors.
from keras.datasets import mnist
from scipy.sparse.linalg import svds
(image_train,label_train),(image_test,label_test) = mnist.load_data()
image_train = np.reshape(image_train, (60000,-1));
V = np.zeros((12,784,10))
for i in range(10):
D = sps.csr_matrix(image_train[label_train==i], dtype=float)
U,S,V[:,:,i] = svds(D,12)
We predict the best digit associated with each test image and build a confusion
matrix to check the method’s accuracy.
image_test = np.reshape(image_test, (10000,-1));
r = np.zeros((10,10000))
for i in range(10):
q = V[:,:,i].T@(V[:,:,i] @ image_test.T) - image_test.T
r[i,:] = np.sum(q**2,axis=0)
prediction = np.argmin(r,axis=0)
confusion = np.zeros((10,10)).astype(int)
for i in range(10):
confusion[i,:] = np.bincount(prediction[label_test==i],minlength=10)
Python 597
q = Q[:,actors.index("Steve Martin")]
z = Q.T @ q
r = np.argsort(-z)
[actors[i] for i in r[:10]]
p = (U*S) @ q
r = np.argsort(-p)
[(genres[i],p[i]/p.sum()) for i in r[:10]]
X = np.array([[3,3,12],[1,15,14],[10,2,13],[12,15,14],[0,11,12]]) 493
reference = X[0,:]; X = X - reference
A = np.array([2,2,-2])*X
b = (X**2)@np.array([[1],[1],[-1]])
x_ols = la.lstsq(A,b)[0] + reference[:,None]
x_tls = tls(A,b) + reference[:,None]
495 We’ll define a function for implicit QR method and verify it on a matrix with
known eigenvalues.
def implicitqr(A):
tolerance = 1e-12
n = len(A)
H = la.hessenberg(A)
while True:
if abs(H[n-1,n-2]) < tolerance:
n -= 1
if n<2: return np.diag(H)
Q,_ = la.qr([[H[0,0]-H[n-1,n-1]], [H[1,0]]])
H[:2,:n] = Q @ H[:2,:n]
H[:n,:2] = H[:n,:2] @ Q.T
for i in range(1,n-1):
Q,_ = la.qr([[H[i,i-1]], [H[i+1,i-1]]])
H[i:i+2,:n] = Q @ H[i:i+2,:n]
H[:n,i:i+2] = H[:n,i:i+2] @ Q.T
n = 20; S = np.random.randn(n,n);
D = np.diag(np.arange(1,n+1)); A = S @ D @ la.inv(S)
496 We’ll use the helper functions get_names and get_adjacency_matrix from
page 593.
Python 599
actors = get_names("actors")
B = get_adjacency_matrix("actor-movie")
r,c = (B.T@B).nonzero()
M = sps.csr_matrix((np.ones(len(r)),(r,c)))
v = np.ones(M.shape[0])
for k in range(10):
v = M@v; v /= np.linalg.norm(v)
r = np.argsort(-v)
[actors[i] for i in r[:10]]
The following function approximates the SVD by starting with a set of 𝑘 random 496
vectors and performing a few steps of the naïve QR method to generate a 𝑘-
dimensional subspace that is relatively close to the space of dominant singular
def randomizedsvd(A,k):
Z = np.random.rand(A.shape[1],k)
Q,R = la.qr(A@Z, mode='economic')
for i in range(4):
Q,R = la.qr(A.T @ Q, mode='economic')
Q,R = la.qr(A @ Q, mode='economic')
W,S,VT = la.svd(Q.T @ A,full_matrices=False)
U = Q @ W
return (U,S,VT )
Let’s convert an image to an array, compute its randomized SVD, and then display
the original image side-by-side with the rank-reduced version.
A = rgb2gray(getimage(bucket+'red-fox.jpg'))
U, S, VT = randomizedsvd(A,10);
img = np.c_[A, np.minimum(np.maximum((U*S)@VT ,0),255)]
sps.kron(I,sps.kron(I,D)) )/Δx**2
f = np.array([(x-x**2)*(y-y**2) + (x-x**2)*(z-z**2)+(y-y**2)*(z-z**2)
for x in ξ for y in ξ for z in ξ])
ue = np.array([(x-x**2)*(y-y**2)*(z-z**2)/2
for x in ξ for y in ξ for z in ξ])
def conjugategradient(A,b,n=400):
ϵ = []; u = np.zeros_like(b)
r = b - A@u; p = np.copy(r)
for i in range(n):
Ap = A@p
α = np.dot(r,p)/np.dot(Ap,p)
u += α*p; r -= α*Ap
β = np.dot(r,Ap)/np.dot(Ap,p)
p = r - β*p
ϵ = np.append(ϵ,la.norm(u - ue,1))
return ϵ
ϵ = np.zeros((400,4))
ϵ[:,0] = stationary(A,-f,0)
ϵ[:,1] = stationary(A,-f,1)
ϵ[:,2] = stationary(A,-f,1.9)
ϵ[:,3] = conjugategradient(A,-f)
plt.legend(["Jacobi","Gauss-Seidel","SOR","Conj. Grad."]); plt.show()
The three stationary methods take about 13, 34, and 34 seconds to complete 400
iterations. In contrast, the conjugate gradient method takes around 1.2 seconds to
complete the same number of iterations—and just 0.5 seconds to reach machine
precision. The SciPy routine spsolve uses approximate minimum degree column
ordering as the default, which would destroy the lower triangular structure.
Instead, stationary sets the option permc_spec = 'NATURAL' to maintain the
natural ordering. While SciPy has a dedicated sparse triangular matrix routine
spsolve_triangular, it is excruciatingly slow, taking almost a second to complete
just one iteration of the stationary method! And directly solving the problem
using spsolve(A,-f) takes over 12 minutes.
Python 601
Python doesn’t have a convenient function for generating primes. We could write 500
a function that implements the Sieve of Eratosthenes, but it will be easier to
import a list of the primes (pregenerated using Julia).
def fftx3(c):
n = len(c)
ω = np.exp(-2j*π/n);
if np.mod(n,3) == 0:
k = np.arange(n/3)
u = np.stack((fftx3(c[:-2:3]),\
ω**k * fftx3(c[1:-1:3]),\
ω**(2*k) * fftx3(c[2::3])))
F = np.exp(-2j*π/3)**np.array([[0,0,0],[0,1,2],[0,2,4]])
return (F @ u).flatten()
k = np.arange(n)[:,None]
F = ω**(k*k.T);
return F @ c
def multiply(p_,q_):
from scipy.signal import fftconvolve
p = np.flip(np.array([int(i) for i in list(p_)]))
q = np.flip(np.array([int(i) for i in list(q_)]))
pq = np.rint(fftconvolve(p,q)).astype(int)
pq = np.r_[pq,0]
carry = pq//10
while (np.any(carry)):
pq -= carry*10
pq[1:] += carry[:-1]
carry = pq//10
return ''.join([str(i) for i in np.flip(pq)]).lstrip('0')
602 Computing in Python and Matlab
We can alternatively use the numpy.fft commands rfft and irfft. We just need
to ensure that the zero-padded array has an even length n. In this case, we use
def idct(f):
n = f.shape[0]
ω = np.exp(-0.5j*π*np.arange(n)/n).reshape(-1,1)
i = [n-(i+1)//2 if i%2 else i//2 for i in range(n)]
f[0,:] = f[0,:]/2
return np.real(ifft(f/ω,axis=0))[i,:]*2
503 The following function returns a sparse matrix of Fourier coefficients along with
the reconstructed compressed image:
We’ll find the solution to the system of equations in exercise 8.12. Let’s first 510
define the system and the gradient.
def f(x,y): return ( np.array([(x**2+y**2)**2-2*(x**2-y**2),
(x**2+y**2-1)**3-x**2*y**3]) )
def df(x,y): return(np.array([ \
[4*x*(x**2+y**2-1), 4*y*(x**2+y**2+1)],
[6*x*(x**2+y**2-1)**2-2*x*y**3, \
We’ll write a function that computes point addition and point doubling. 511
def addpoint(P,Q):
a = 0
r = (1<<256) - (1<<32) - 977
if P[0] == Q[0]:
d = pow(2*P[1], -1, r)
λ = ((3*pow(P[0],2,r)+a)*d) % r
d = pow(Q[0] - P[0], -1, r)
λ = ((Q[1] - P[1])*d) % r
x = (pow(λ,2,r) - P[0] - Q[0]) % r
y = (-λ*(x - P[0]) - P[1]) % r
return [x,y]
604 Computing in Python and Matlab
514 The following function modifies spline_natural on page 576 to compute the
coefficients {𝑚 0 , 𝑚 1 , . . . , 𝑚 𝑛−1 } for a spline with periodic boundary conditions:
def spline_periodic(x,y):
h = np.diff(x)
d = 6*np.diff(np.diff(np.r_[y[-2],y])/np.r_[h[-1],h])
α = h[:-1]
β = h + np.r_[h[-1],h[:-1]]
C = np.diag(β)+np.diag(α,1)
C[0,-1]=h[-1]; C += C.T
m = la.solve(C,d)
return np.r_[m,m[0]]
Now we can compute a parametric spline interpolant with nx*n points through a
set of n random points using the function evaluate_spline defined on page 576:
n, nx = 10, 20
x, y = np.random.rand(n), np.random.rand(n)
x, y = np.r_[x,x[0]], np.r_[y,y[0]]
t = np.cumsum(np.sqrt(np.diff(x)**2+np.diff(y)**2))
t = np.r_[0,t]
T,X = evaluate_spline(t,x,spline_periodic(t,x),nx*n)
T,Y = evaluate_spline(t,y,spline_periodic(t,y),nx*n)
plt.plot(X,Y,x,y,'o'); plt.show()
514 The following code computes and plots the interpolating curves:
Python 605
n = 20; N = 200
x = np.linspace(-1,1,n)[:,None]
X = np.linspace(-1,1,N)[:,None]
y = (x>0)
Y1 = interp(ϕ1,x)
Y2 = interp(ϕ2,x)
Y3 = interp(ϕ3,np.arange(n))
plt.ylim((-.5,1.5)); plt.show()
def solve(L,f,bc,x):
h = x[1]-x[0]
S = np.array([[1,-1/2,1/6],[-2,0,2/3],[1,1/2,1/6]]) \
S = np.r_[np.zeros((1,3)),L(x)@S.T,np.zeros((1,3))]
d = np.r_[bc[0], f(x), bc[1]]
A = np.diag(S[1:,0],-1) + np.diag(S[:,1]) + np.diag(S[:-1,2],1)
A[0,:3] , A[-1,-3:] = np.array([1,4,1])/6 , np.array([1,4,1])/6
return la.solve(A,d)
n = 15; N = 141
L = lambda x: np.c_[x,np.ones_like(x),x]
f = lambda x: np.zeros_like(x)
b = jn_zeros(0,4)[-1]
x = np.linspace(0,b,n)
c = solve(L,f,[1,0],x)
X,Y = build(c,x,N)
plt.plot(X,Y,X,j0(X)); plt.show()
519 Let’s compute the fractional derivatives of e−16𝑥 and 𝑥(1 − |𝑥|).
import tensorflow as tf
from tensorflow import keras
from keras.layers import Conv2D, AvgPool2D, Dense, Flatten
Python 607
model = keras.models.Sequential([
Conv2D(16, 5, activation='tanh'),
Conv2D(120, 5, activation='tanh'),
Dense(84, activation='tanh'),
Dense(10, activation='sigmoid')
Let’s load the MNIST training and test data. We’ll convert each 28 × 28-pixel
image into a 28 × 28 × 1-element floating-point array with values between zero
and one.
from keras.datasets import mnist
(image_train,label_train),(image_test,label_test) = mnist.load_data()
image_train = tf.expand_dims(image_train/255.0, 3)
image_test = tf.expand_dims(image_test/255.0, 3)
We compile the model, defining the loss function, the optimizer, and the metrics.
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer="adam", loss=loss, metrics=["accuracy"])
Finally, let’s see how well the model works on an independent set of similar data.
The LeNet-5 model has about a two percent error rate. With more epochs, it has
about a one percent error rate.
We’ll first solve the toy problem. 521
Now, let’s apply the method to the ShotSpotter data. First, we get the data.
608 Computing in Python and Matlab
import pandas as pd
df = pd.read_csv(bucket+'shotspotter.csv')
X = df.iloc[:-1].to_numpy()
x_nls[0] - df.iloc[-1].to_numpy()
524 We’ll extend the Dual class on page 582 by adding methods for division, cosine,
and square root to the class definition.
The following function computes the nodes and weights for Gauss–Legendre 525
quadrature by using Newton’s method to find the roots of Pn (𝑥):
def gauss_legendre(n):
x = -np.cos((4*np.arange(n)+3)*π/(4*n+2))
Δ = np.ones_like(x)
dP = 0
P0, P1 = x, np.ones_like(x)
for k in range(2,n+1):
P0, P1 = ((2*k - 1)*x*P0-(k-1)*P1)/k, P0
dP = n*(x*P0 - P1)/(x**2-1)
Δ = P0 / dP
x -= Δ
610 Computing in Python and Matlab
return ( x, 2/((1-x**2)*dP**2) )
def mc_π(n,d,m):
return sum(sum(np.random.rand(d,n,m)**2)<1)/n*2**d
def differentiation_matrix(n,Δt=1):
nodes, _ = gauss_legendre(n+1,lobatto=True)
t = (nodes[1:]+1)/2
A = np.vander(t,increasing=True)*np.arange(1,n+1)
B = np.diag(t)@np.vander(t,increasing=True)
return ([email protected](B)/Δt, t)
We’ll use the scipy.optimize function fsolve to solve the nonlinear system
from scipy.optimize import fsolve
Python 611
N = 20; Δt = 1.0/N; n = 8; α
M,t = differentiation_matrix(n,Δt)
def D(u,u0): return M@(u - u0)
u0 = 1.0; U = np.array(u0); T = np.array(0)
for i in range(N):
u = fsolve(F,u0*np.ones(n),args=(u0,α))
u0 = u[-1]
T = np.append(T,(i + t)*Δt)
U = np.append(U,u)
plt.plot(T, U, marker="o")
The following code plots the region of absolute stability for a Runge–Kutta 529
method with tableau A and b:
N = 100; n = b.shape[1]
r = np.zeros((N,N))
E = np.ones((n,1))
x,y = np.linspace(-4,4,N),np.linspace(-4,4,N)
for i in range(N):
for j in range(N):
z = x[i] + 1j*y[j]
r[j,i] = abs(1 + z*b@(la.solve(np.eye(n) - z*A,E)))
plt.contour(x,y,r,[1]); plt.show()
i = np.arange(4)[:,None] 530
def factorial(k): return np.cumprod(np.r_[1,range(1,k)])
c1 = la.solve(((-i)**i.T/factorial(4)).T,np.array([0,1,0,0]))
c2 = la.solve(((-(i+1))**i.T/factorial(4)).T,np.array([1,0,0,0]))
The following function returns the orbit of points in the complex plane for 531
an 𝑛th order Adams–Bashforth–Moulton PE(CE)𝑚 . It calls the function
multistepcoefficients defined on page 585.
def PECE(n,m):
_,a = multistepcoefficients([0,1],range(1,n))
_,b = multistepcoefficients([0,1],range(0,n))
612 Computing in Python and Matlab
for i in range(5):
z = PECE(4,i)
plt.axis('equal'); plt.show()
def pade(a,m,n):
A = np.eye(m+n+1);
for i in range(n): A[i+1:,m+i+1] = -a[:m+n-i]
pq = la.solve(A,a[:m+n+1])
return pq[:m+1], np.r_[1,pq[m+1:]]
Now, compute the coefficients of 𝑃𝑚 (𝑥) and 𝑄 𝑛 (𝑥) for the Taylor polynomial
approximation of log(𝑥 + 1).
m = 3; n = 2
a = np.r_[0, (-1)**np.arange(m+n)/(1+np.arange(m+n))]
p,q = pade(a,m,n)
Finally, shift and combine the coefficients using upper inverse Pascal matrices.
The following code solves the Airy equation over the domain 𝑥 = (−12, 0) using 535
the shooting method. The function shoot_airy solves the initial value problem
using two initial conditions—the given boundary condition 𝑦(−12) and our guess
for 𝑦 0 (−12). The function returns the difference in the value 𝑦(0) computed by
the solve_ivp and our given boundary condition 𝑦(0). We then use fsolve to
find the zero-error initial value.
from scipy.integrate import solve_ivp
from scipy.optimize import fsolve
def airy(x,y): return (y[1],x*y[0])
domain = [-12,0]; bc = [1,1]; guess = 5
def shoot_airy(guess):
sol = solve_ivp(airy,domain,[bc[0],guess[0]])
return sol.y[0,-1] - bc[1]
v = fsolve(shoot_airy,guess)[0]
m = [0,1,2,3,4]; n = [1]
a,b = multistepcoefficients(m,n)
b = np.r_[0,b]
The following code implements the Dufort–Frankel method to solve the heat 537
equation. (As mentioned elsewhere, this approach is not recommended.)
dx = 0.01; dt = 0.01; n = 400
L = 1; x = np.arange(-L,L,dx); m = len(x)
U = np.empty((n,m))
U[0,:] = np.exp(-8*x**2); U[1,:] = U[0,:]
614 Computing in Python and Matlab
We can use the ipywidgets library to build an interactive plot of the solution.
We’ll loop over several values for time steps and mesh sizes and plot the error.
541 We’ll solve a radially symmetric heat equation. Although we divide by zero at
𝑟 = 0 when constructing the Laplacian operator, we subsequently overwrite the
resulting inf term when we apply the boundary condition.
Python 615
def logitspace(x,n,p):
return x*np.arctanh(np.linspace(-p,p,n))/np.arctanh(p)
def ϕ(x,t,s):
return np.exp(-s*x**2/(1+4*s*t))/np.sqrt(1+4*s*t)
t = 15; m = 40
x = logitspace(20,m,.999)
u = heat_equation(x,t,ϕ(x,0,10))
plt.plot(x,u,'.-',x,ϕ(x,t,10),'k'); plt.show()
x = np.linspace(-20,20,m)
u = heat_equation(x,t,ϕ(x,0,10))
plt.plot(x,u,'.-',x,ϕ(x,t,10),'k'); plt.show()
F = (fu[1:-1]+fu[:-2])/2 - α*(u[1:]-u[:-1])/2
u -= c*(np.diff(np.r_[0,F,0]))
554 The following code solves the KdV equation using integrating factors. We first
set the parameters.
from scipy.fft import fftshift, fft, ifft
def ϕ(x,x0,c): return 0.5*c/np.cosh(np.sqrt(c)/2*(x-x0))**2
L = 30; T = 2.0; m = 256
x = np.linspace(-L/2,L/2,m,endpoint=False)
iξ = 1j*fftshift(np.arange(-m/2,m/2))*(2*π/L)
Next, we define the integrating factor G and the right-hand side function f.
def G(t): return np.exp(-iξ**3*t)
def f(t,w): return -(3*iξ*fft(ifft(G(t)*w)**2))/G(t)
The first line of this code block displays the animation as HTML using JavaScript.
Alternatively, we can replace this line with
plt.rcParams["animation.html"] = "html5"
to convert the animation to HTML5 (which requires the ffmpeg video codecs).
Matlab 619
B.3 Matlab
This section provides a Matlab supplement to the Julia commentary and code
elsewhere in the book. This book uses Matlab to describe the scientific program-
ming language and syntax common to both MATLAB,4 the proprietary computer
software sold by MathWorks, and GNU Octave, the open-source computer
software developed by John Eaton and others. The book explicitly uses MATLAB
or Octave when referring to a particular implementation of the language.
The code is written and tested using Octave version 6.4.0. Page references
to the Julia commentary are listed in the left margins. For brevity, the variable
bucket is assumed throughout. The function rgb2gray is a standard function
in MATLAB it is available in Octave in the image toolbox, or we can write it
bucket = "https://raw.githubusercontent.com/nmfsc/data/master/";
rgb2gray = @(x) 0.2989*x(:,:,1) + 0.5870*x(:,:,2) + 0.1140*x(:,:,3);
You can download the IPYNB file and run it on a local computer with Octave
and Jupyter. You can also run the code directly on Binder using the QR link at
the bottom of this page. Note that it may take a few minutes for the Binder to
launch the repository.
of the Python Software Foundation, Julia® is a registered trademark of Julia Computing, Inc., and
GNU® is a registered trademark of Free Software Foundation.
Octave notebook
on Binder
620 Computing in Python and Matlab
5 While Julia’s syntax tends to prefer column vectors, Matlab’s syntax tends to
prefer row vectors. The syntax for a row vector includes [1 2 3 4], [1,2,3,4],
and 1:4. Syntax for a column vector is [1;2;3:4].
23 n = [10,15,20,25,50];
for i = 1:5,
imshow(1-abs(hilb(n(i))\hilb(n(i))),[0 1])
29 The type and edit commands display the contents of an m-file, if available.
Many basic functions are built-in and do not have an m-file.
30 The Matlab command rref returns the reduced row echelon form.
function b = gaussian_elimination(A,b)
n = length(A);
for j=1:n
A(j+1:n,j) = A(j+1:n,j)/A(j,j);
A(j+1:n,j+1:n) = A(j+1:n,j+1:n) - A(j+1:n,j).*A(j,j+1:n);
for i=2:n
b(i) = b(i) - A(i,1:i-1)*b(1:i-1);
for i=n:-1:1
b(i) = ( b(i) - A(i,i+1:n)*b(i+1:n) )/A(i,i);
Matlab 621
The following Matlab code implements the simplex method. We start by defining 42
functions used for pivot selection and row reduction in the simplex algorithm.
The MATLAB function linprog solves the LP problem. The built-in Octave 45
function glpk (GNU Linear Programming Kit) can solve LP problems using
either the revised simplex method or the interior point method.
Matlab’s backslash and forward slash operators use UMFPACK to solve sparse 46
linear systems.
We’ll construct a sparse, random matrix, get the number of zeros, and draw the 46
sparsity plot.
49 The function gplot plots the vertices and edges of an adjacency matrix. MATLAB,
but not Octave, has additional functions for constructing and analyzing graphs.
49 Matlab lacks many of the tools for drawing graphs but we can construct them.
First, let’s write naïve functions that will draw force-directed, spectral, and
circular graphs layouts. Although the force-directed layout can be implemented
using a forward Euler scheme, it may be easier to use an adaptive timestep ODE
solver to manage stability.
function f = spring(A,z)
n = length(z); k = 2*sqrt(1/n);
d = z - z.'; D = abs(d)/k;
F = -(A.*D - 1./(D+eye(n)).^2);
f = sum(F.*d,2);
function xy = spring_layout(A,z)
n = size(A,1);
z = rand(n,1) + 1i*rand(n,1);
[t,z] = ode45(@(t,z) spring(A,z),[0,10],z);
xy = [real(z(end,:));imag(z(end,:))]';
function xy = spectral_layout(A)
D = diag(sum(A,2));
[v,d] = eig(D - A);
[_,i] = sort(diag(d));
xy = v(:,i(2:3));
function xy = circular_layout(A)
n = size(A,1); t = (2*pi*(1:n)/n)';
xy = [cos(t) sin(t)];
The Matlab function symrcm returns the permutation vector using the reverse 52
Cuthill–McKee algorithm for symmetric matrices, and symamd returns the permu-
tation vector using the symmetric approximate minimum degree permutation
For a 2000 × 1999 matrix A\b takes 3.8 seconds and pinv(A)*b takes almost 56 57
The function givens(x,y) returns the Givens rotation matrix for (𝑥, 𝑦). 64
The Zipf’s law coefficients c for the canon of Sherlock Holmes computed using 67
ordinary least squares are
data = urlread([bucket 'sherlock.csv']);
T = cell2mat(textscan(data,'%s\t%d')(2));
n = length(T);
A = [ones(n,1) log(1:n)'];
624 Computing in Python and Matlab
B = log(T);
c1 = A\B
function x = constrained_lstsq(A,b,C,d)
x = [A'*A C'; C zeros(size(C,1))]\([A'*b;d])
x = x(1:size(A,2))
function X = tls(A,B)
n = size(A,2);
[_,_,V] = svd([A B],0);
X = -V(1:n,n+1:end)/V(n+1:end,n+1:end);
A = rgb2gray(imread([bucket "laura.png"]));
[U, S, V] = svd(A,0);
sigma = diag(S);
We can confirm that the error kA − A 𝑘 k 2F matches 𝑖=𝑘+1 𝜎 2 by computing
Ak = U(:,1:k) * S(1:k,1:k) * V(:,1:k)';
Matlab 625
norm(double(A)-Ak,'fro') - norm(sigma(k+1:end))
The values of the compressed image no longer lie between 0 and 1—instead
ranging between −0.11 and 1.18. The command imshow(Ak) clamps the values
of a floating-point array between 0 and 1. Finally, let’s plot the error:
r = sum(size(A))/prod(size(A))*(1:min(size(A)));
error = 1 - sqrt(cumsum(sigma.^2))/norm(sigma);
The function roots finds the roots of a polynomial 𝑝(𝑥) by computing eig for 89
the companion matrix of 𝑝(𝑥).
The following code computes the PageRank of the graph in Figure 4.3. 97
H = [0 0 0 0 1; 1 0 0 0 0; 1 0 0 0 1; 1 0 1 0 0; 0 0 1 1 0];
v = ~any(H);
H = H./(sum(H)+v);
n = length(H);
d = 0.85;
x = ones([n 1])/n;
for i = 1:9
x = d*(H*x) + d/n*(v*x) + (1-d)/n;
The function hess returns the unitarily similar upper Hessenberg form of a matrix 104
using LAPACK.
626 Computing in Python and Matlab
109 The function eig computes the eigenvalues and eigenvectors of a matrix with a
LAPACK library that implements the shifted QR method.
114 The function eigs computes several eigenvalues of a sparse matrix using the
implicitly restarted Arnoldi process.
141 The function gmres implements the generalized minimum residual method and
minres implements the minimum residual method.
function y = fftx2(c)
n = length(c);
omega = exp(-2i*pi/n);
if mod(n,2) == 0
k = (0:n/2-1)';
u = fftx2(c(1:2:n-1));
v = (omega.^k).*fftx2(c(2:2:n));
y = [u+v; u-v];
k = (0:n-1)';
F = omega.^(k*k');
y = F*c;
The Matlab function fft implements the Fastest Fourier Transform in the 158
West (FFTW) library. The inverse FFT can be computed using ifft and
multidimensional FFTs can be computed using fftn and ifftn.
The functions fftshift and ifftshift shift the zero frequency component to the 159
center of the array.
Matlab does not have a three-dimensional DST function, so we’ll need to build 160
one ourselves. The function dst3 computes the three-dimensional DST (6.4) by
computing the DST over each dimension. The function dstx3 computes a DST in
one dimension for a three-dimensional array. The real command in the function
dstx3 is only needed √︁
to remove the round-off error from the FFT. By distributing
the scaling constant 2/(𝑛 + 1) across both the DST and the inverse DST, we
only need to implement one function for both.
function a = dst3(a)
a = dstx3(shiftdim(a,1));
a = dstx3(shiftdim(a,1));
a = dstx3(shiftdim(a,1));
function a = dstx3(a)
n = size(a); o = zeros(1,n(2),n(3));
y = [o;a;o;-a(end:-1:1,:,:)];
y = fft(y);
a = imag(y(2:n(1)+1,:,:)*(sqrt(2*(n(1)+1))));
628 Computing in Python and Matlab
The program takes less than a third of a second to run for 𝑛 = 50.
163 The MATLAB function dct returns the type 1–4 DCT. The equivalent function
in Octave’s signal package returns a DCT-2. The following function returns a
sparse matrix of Fourier coefficients along with the reconstructed compressed
We’ll test the function on an image and compare it before and after.
A = rgb2gray(imread([bucket "laura.png"]));
[_,A0] = dctcompress(A,0.01);
function b = float_to_bin(x)
b = sprintf("%s",dec2bin(hex2dec(num2cell(num2hex(x))),4)');
if x<0, b(1) = '1'; end
181 The constant eps returns the double-precision machine epsilon of 2−52 ≈ 10−16 .
The function eps(x) returns the double-precision round-off error at 𝑥. For
example, eps(0) returns 2−1074 ≈ 10−323 . Of course, eps(1) is the same as eps.
Matlab 629
The command realmax returns the largest floating-point number. The command 185
realmin returns the smallest normalized floating-point number.
To check for overflows and NaN, use the Matlab logical commands isinf and 186
isnan. You can use NaN to lift the “pen off the paper” in a Matlab plot as in
The functions expm1 and log1p compute e 𝑥 − 1 and log(𝑥 + 1) more precisely 187
than exp(x)-1 and log(x+1) in a neighborhood of zero.
The function fzero(f,x0) uses the Dekker–Brent method to find a zero of the 197
input function. The function f can either be a string (which is a function of x),
an anonymous function, or the name of an m-file.
630 Computing in Python and Matlab
function M = mandelbrot(bb,xn,n,z)
yn = round(xn*(bb(4)-bb(2))/(bb(3)-(bb(1))));
z = z*ones(yn,xn);
c = linspace(bb(1),bb(3),xn) + 1i*linspace(bb(4),bb(2),yn)';
M = escape(n,z,c);
M = mandelbrot([-0.1710,1.0228,-0.1494,1.0443],800,200,0);
208 The function polyval(x,p) uses Horner’s method to evaluate a polynomial with
coefficients p = [𝑎 0 , 𝑎 1 , . . . , 𝑎 𝑛 ] at x.
211 The function roots returns the roots of a polynomial by finding the eigenvalues
of the companion matrix.
221 The following code uses the gradient descent method to find the minimum of the
Rosenbrock function:
p = -df(x) + b*p;
x = x + a*p;
In practice, we can use the optimization toolbox. The fminbnd function uses the
golden section search method to find the minimum of a univariate function. The
fminsearch function uses the Nelder–Mead method to find the minimum of a
multivariate function. The fminunc function uses the BFGS method to find the
minimum of a multivariate function.
𝑛 function generates a Vandermonde matrix with𝑛−1rows𝑛given
vander by 227
𝑥𝑖 𝑥𝑖𝑛−1 · · · 1 . Add fliplr to reverse them 1 · · · 𝑥 𝑖−1 𝑥𝑖 .
The following function computes the coefficients m of a cubic spline with natural 238
boundary conditions through the nodes given by the arrays x and y.
function m = spline_natural(x,y)
h = diff(x(:));
gamma = 6*diff(diff(y(:))./h);
alpha = diag(h(2:end-1),-1);
beta = 2*diag(h(1:end-1)+h(2:end),0);
m = [0;(alpha+beta+alpha')\gamma;0];
We can then use (9.3) and (9.4) to evaluate that spline using 𝑛 points.
The function takes column-vector inputs for x and y and outputs column vectors
X and Y with length N. Note that the line i = sum(x<=X'); broadcasts the
element-wise operator <= across a row and a column array.
632 Computing in Python and Matlab
239 The function spline returns a cubic spline with the not-a-knot condition.
241 The function pp = mkpp(b,C) returns a structured array pp that can be used to
build piecewise polynomials of order 𝑛 using the function y = ppval(pp,x). The
input b is a vector of breaks (including endpoints) with length 𝑝 + 1 and C is an
𝑝 × 𝑛 array of the polynomial coefficients in each segment.
245 We can build a Bernstein matrix in Matlab from a column vector t with the
following function.
B = @(n,t)[1 cumprod((n:-1:1)./(1:n))].*t.^(0:n).*(1-t).^(n:-1:0);
261 We’ll construct a Chebyshev differentiation matrix and use the matrix to solve a
few simple problems.
function [D,x] = chebdiff(n)
x = -cos(linspace(0,pi,n))';
c = [2;ones(n-2,1);2].*(-1).^(0:n-1);
D = c./c'./(x - x' + eye(n));
D = D - diag(sum(D,2));
n = 15;
[D,x] = chebdiff(n);
u = exp(-4*x.^2);
B = zeros(1,n); B(1) = 1;
n = 15; k2 = 256;
[D,x] = chebdiff(n);
L = (D^2 - k2*diag(x));
Matlab 633
The Large Time-Frequency Analysis Toolbox (LTFAT) includes several utilities 273
for wavelet transforms and is available from http://ltfat.github.io under the GNU
General Public License.
A = rgb2gray(double(imread([bucket "laura_square.png"])))/255;
c = fwt2(A,"db2",9);
Let’s choose a threshold level of 0.5 and plot the resultant image after taking the
inverse DWT.
c = thresh(c,0.5);
B = ifwt2(c,"db2",9);
function J = jacobian(f,x,c)
for k = (n = 1:length(c))
J(:,k) = imag(f(x,c+1e-8i*(k==n)'))/1e-8;
Assuming that the Gauss–Newton method converged, we can plot the results.
X = linspace(0,8,100);
We’ll first define the logistic function and generate synthetic data. Then, we 286
apply Newton’s method.
Let’s start by generating some training data and setting up the neural network 289
N = 100; t = linspace(0,pi,N);
x = cos(t); x = [ones(1,N);x];
y = sin(t) + 0.05*randn(1,N);
n = 20; W1 = rand(n,2); W2 = randn(1,n);
phi = @(x) max(x,0);
dphi = @(x) (x>0);
alpha = 1e-3;
for epoch = 1:10000
s = W2 * phi(W1*x);
dLdy = 2*(s-y);
dLdW1 = dphi(W1*x) .* (W2' * dLdy) * x';
dLdW2 = dLdy * phi(W1*x)';
W1 -= 0.1 * alpha * dLdW1;
W2 -= alpha * dLdW2;
After determining the parameters W1 and W2, we can plot the results.
636 Computing in Python and Matlab
s = W2 * phi(W1*x);
scatter(x(2,:),y,'r','filled'); hold on
L = norm(s- y);
dLdy = 2*(s-y)/L;
Coefficients are given by rats(C) and the coefficients of the truncation error is
given by rats(C*d.^n/factorial(n)):
298 The command rats returns the rational approximation of number as a string by
using continued fraction expansion. For example rats(0.75) returns 3/4.
function D = richardson(f,x,m,n)
if n > 0
D = (4^n*richardson(f,x,m,n-1) - richardson(f,x,m-1,n-1))/(4^n-1);
D = phi(f,x,2^m);
define a class, which we’ll call dual, with the properties value and deriv for
the value and derivative of a variable. We can overload the built-in functions +
(plus), * (mtimes), and sin to operate on the dual class. We save the following
code as the m-file dual.m. When working in Jupyter, we can add %%file dual.m
to top of a cell to write the cell to a file.
%%file dual.m
classdef dual
function obj = dual(a,b)
obj.value = a;
obj.deriv = b;
function h = plus(u,v)
if ~isa(u,'dual'), u = dual(u,0); end
if ~isa(v,'dual'), v = dual(v,0); end
h = dual(u.value + v.value, u.deriv + v.deriv);
function h = mtimes(u,v)
if ~isa(u,'dual'), u = dual(u,0); end
if ~isa(v,'dual'), v = dual(v,0); end
h = dual(u.value*v.value, u.deriv*v.value + u.value*v.deriv);
function h = sin(u)
h = dual(sin(u.value), cos(u.value)*u.deriv);
function h = minus(u,v)
h = u + (-1)*v;
Then y.value returns 0 and y.deriv returns 2, which agrees with 𝑦(0) = 0 and
𝑦 0 (0) = 2.
x1 = dual(2,[1 0]);
638 Computing in Python and Matlab
x2 = dual(pi,[0 1]);
y1 = x1*x2 + sin(x2);
y2 = x1*x2 - sin(x2);
308 n = floor(logspace(1,2,10));
for p = 1:7,
f = @(x) x + x.^p.*(2-x).^p;
S = trapezoidal(f,[0,2],1e6);
for i = 1:length(n)
Sn = trapezoidal(f,[0,2],n(i));
error(i) = abs(Sn - S)/S;
slope(p) = [log(error)/[log(n);ones(size(n))]](1);
loglog(n,error,'.-k'); hold on;
function a = dctI(f)
n = length(f);
g = real( fft( [f;f(n-1:-1:2)] ) ) / 2;
a = [g(1)/2; g(2:n-1); g(n)/2];
weights = scaling*v(1,:).^2;
nodes = diag(s);
f = @(x) cos(x).*exp(-x.^2);
[nodes,weights] = gauss_legendre(n);
weights * f(nodes)
r = exp(2i*pi*(0:0.01:1)); 349
plot((1.5*r.^2 - 2*r + 0.5)./r.^2); axis equal;
The following function returns the coefficients for a stencil given by m and n: 351
function plotstability(a,b)
r = exp(1i*linspace(.01,2*pi-0.01,400));
z = (r'.^(1:length(a))*a) ./ (r'.^(1:length(b))*b);
plot(z); axis equal
The Matlab recipe for solving a differential equation is similar to the Julia recipe. 377
But rather than calling a general solver and specifying the method, each method
in Matlab is its own solver. Fortunately, the syntax for each solver is almost
The parameter tspan may be either a two-element vector specifying initial and
final times or a longer, strictly increasing or decreasing vector. In the two-vector
case, the solver returns the intermediate solutions at each adaptive time step. In
the longer-vector case, the solver interpolates the values between the adaptive
time steps to provide solutions at the points given in tspan. In MATLAB, we
can alternatively request a structure sol as the output of the ODE solver. We can
subsequently evaluate sol at an arbitrary array of values t using the function
u = deval(sol,t). This approach is similar to the modern approaches in Julia
and Python. Octave does not have the function deval.
380 Let’s first define the DAE for the pendulum problem. We’ll use manifold
projection: manifold projection 𝑥 = 𝑥/ℓ, 𝑦 = 𝑦/ℓ, and 𝑢 = −𝑣𝑦/𝑥.
We can also solve semi-explicit DAEs using ode15s amd fully-implicit DAEs
using ode15i. We’ll set the parameters ℓ, 𝑚, and 𝑔 to unit values to better see
what’s happening in the code.
The solution gradually drifts off and is eventually entirely wrong. Let’s also solve
the fully implicit problem using ‘ode15i‘.
function F = pendulum(t,U,dU)
x = U(1); y = U(2); u = U(3); v = U(4); tau = U(5);
dx = dU(1); dy = dU(2); du = dU(3); dv = dU(4);
Matlab 641
F = [dx - u;
dy - v;
du + tau*x;
dv + tau*y - 1;
-tau + y + (u^2+v^2)];
We’ll use the function decic to determine consistent initial conditions for ode15i.
theta = pi/3; tspan = [0,30];
u0 = [sin(theta); -cos(theta); 0; 0; 0];
du0 = [0;0;0;0;0];
opt = odeset ('RelTol', 1e-8,'AbsTol', 1e-8);
[u0,du0] = decic(@pendulum,0,u0,[1;1;1;1;0],du0,[0;0;0;0;0]);
[t,y] = ode15i (@pendulum, tspan,u0, du0, opt);
plot(y(:,1),y(:,2)); axis ij; axis equal
The following code solves the pendulum problem using orthogonal collocation. 385
we’ll use gauss_lobatto and differentition_matrix from page 661 to compute
the Gauss–Lobatto nodes and differentiation matrix.
function dU = pendulum(U,U0,D,n)
x = U(1:n); y = U(n+1:2*n); u = U(2*n+1:3*n); v = U(3*n+1:4*n);
tau = U(4*n+1:5*n); x0 = U0(1); y0 = U0(2); u0 = U0(3); v0 = U0(4);
dU = [D(x,x0) - u,
D(y,y0) - v,
D(u,u0) + tau.*x,
D(v,v0) + tau.*y - 1,
x.^2 + y.^2 - 1];
The following Matlab code implements the backward Euler method: 396
642 Computing in Python and Matlab
The runtime using Octave on a typical laptop is about 0.35 seconds. The runtime
using a non-sparse matrix is about 0.45 seconds, so there doesn’t appear to be a
great advantage in using a sparse solver. But, for a larger system with dx = .001,
the runtime using sparse matrices is still about 0.35 seconds, whereas for a
nonsparse matrix it is almost 30 seconds.
396 The command mldivide (\) will automatically implement a tridiagonal solver on
a tridiagonal matrix if the matrix is formatted as a sparse matrix. The command
help sparfun returns a list of matrix functions for working with sparse matrices.
397 The following code solves the heat equation using the Crank–Nicolson method:
406 The following code implements (13.17) with 𝑚(𝑢) = 𝑢 2 using ode15s, which
uses a variation of the adaptive-step BDF methods:
[t,U] = ode15s(Du,linspace(0,2,10),u0,options);
for i=1:size(U,1), plot(x,U(i,:),'r'); hold on; ylim([0 1]); end
We could also define the right-hand side of the porous medium equation as
m = @(u) u.^3/3;
Du = @(t,u) [0;diff(m(u),2)/dx^2;0];
ik = 1i*[0:floor(n/2) floor(-n/2+1):-1]*(2*pi/L);
ik = 1i*ifftshift([floor(-n/2+1):floor(n/2)])*(2*pi/L);
The Matlab code that solves the Navier–Stokes equation has three parts: define 473
the functions, initialize the variables, and iterate over time. We start by defining
functions for Ĥ and for the flux used in the Lax–Wendroff scheme.
flux = @(Q,c) c.*diff([Q(end,:);Q]) ...
+ 0.5*c.*(1-c).*diff([Q(end,:);Q;Q(1,:)],2);
H = @(u,v,ikx,iky) fft2(ifft2(u).*ifft2(ikx.*u) ...
+ ifft2(v).*ifft2(iky.*u));
Now define initial conditions. The variable e is used for the scaling parameter
𝜀, and k = 𝜉 is a temporary variable used to construct ikx = i𝜉 𝑥 , iky = i𝜉 𝑦 , and
k2 = −|𝜉 | 2 = −𝜉 𝑥2 − 𝜉 𝑦2 . We’ll take the 𝑥- and 𝑦-dimensions to be the same, but
they can easily be modified.
L = 2; n = 128; e = 0.001; dt = .001; dx = L/n;
x = linspace(dx,L,n); y = x';
Q = 0.5*(1+tanh(10*(1-abs(L/2 -y)/(L/4)))).*ones(size(x));
u = Q.*(1+0.5*sin(L*pi*x));
v = zeros(size(u));
u = fft2(u); v = fft2(v);
us = u; vs = v;
ikx = 1i*[0:n/2 (-n/2+1):-1]*(2*pi/L);
iky = ikx.';
644 Computing in Python and Matlab
k2 = ikx.^2+iky.^2;
Hx = H(u,v,ikx,iky); Hy = H(v,u,iky,ikx);
M1 = 1/dt + (e/2)*k2;
M2 = 1/dt - (e/2)*k2;
Now, we iterate. At each time step, we evaluate (16.14) and use the Lax–Wendroff
method to evaluate the advection equation (16.15). The variable Q gives the
density 𝑄 of the tracer in the advection equation. The variable Hx, computed
using the function H, is the 𝑥-component of H𝑛 and Hxo is 𝑥-component of H𝑛−1 .
Similarly, Hy and Hyo are the 𝑦-components of H𝑛 and H𝑛−1 . The variables u and
v are the 𝑥- and y-components of the velocity u𝑛 = (𝑢 𝑛 , 𝑣 𝑛 ). Similarly, us and
vs are 𝑢 ∗ and 𝑣 ∗ , respectively. The variable phi is an intermediate variable for
Δ−1 ∇ · u∗ . The tracer 𝑄 is plotted in Figure 16.2 using a contour plot.
for i = 1:1200
Q = Q - flux(Q,(dt/dx)*real(ifft2(v))) ...
- flux(Q',(dt/dx)*real(ifft2(u))')';
Hxo = Hx; Hyo = Hy;
Hx = H(u,v,ikx,iky); Hy = H(v,u,iky,ikx);
us = u - us + (-1.5*Hx + 0.5*Hxo + M1.*u)./M2;
vs = v - vs + (-1.5*Hy + 0.5*Hyo + M1.*v)./M2;
phi = (ikx.*us + iky.*vs)./(k2+(k2==0));
u = us - ikx.*phi;
v = vs - iky.*phi;
contourf(x,y,Q,[.5 .5]); axis equal;
function D = detx(A)
[L,U,P] = lu(A);
s = 1;
for i = 1:length(P)
m = find(P(i+1:end,i));
if m, P([i i+m],:) = P([i+m i],:); s = -1*s; end
D = s * prod(diag(U));
The following function implements a naïve reverse Cuthill–McKee algorithm for 484
symmetric matrices:
function p = rcuthillmckee(A)
A = spones(A);
[_,r] = sort(sum(A));
p = [];
while ~isempty(r)
q = r(1); r(1) = [];
while ~isempty(q)
p = [p q(1)];
[_,k] = find(A(q(1),r));
q = [q(2:end) r(k)]; r(k) = [];
p = fliplr(p);
A = sprand(1000,1000,0.001); A = A + A';
p = rcuthillmckee(A);
subplot(1,2,1); spy(A,'.',2); axis equal
subplot(1,2,2); spy(A(p,p),'.',2); axis equal
p = rcuthillmckee(A);
646 Computing in Python and Matlab
axis equal; axis off;
485 Reading mixed data into MATLAB or Octave can be frustrating even for
relatively simple structures. In Octave, we can use the csvcell function (from
the io package) to convert a CSV file into a cell. Then, we can use a couple of
loops to divide the cell up into meaningful pieces. The file diet.mat contains
these variables. The solution to the diet problem is
urlwrite([bucket "diet.mat"],"diet.mat");
load diet.mat
A = values'; b = minimums; c = ones(size(A,2),1);
[z,x,y] = simplex(b,A',c);
cost = z, food{find(y!=0)}
486 We’ll reuse the function get_adjacency_matrix from page 622. Let’s also define
a function to get the names.
function r = get_names(bucket,filename)
data = urlread([bucket filename '.txt']);
r = cell2mat(textscan(data,'%s','delimiter', '\n')(1));
We’ll create a block sparse matrix and define a function to find the shortest path
between nodes a and b.
actors = get_names(bucket,"actors");
movies = get_names(bucket,"movies");
B = get_adjacency_matrix(bucket,"actor-movie");
[m,n] = size(B);
A = [sparse(n,n) B';B sparse(m,m)];
actormovie = [actors;movies];
if any(k==b),
path = backtrack(p,b); return;
display("No path.");
a = find(ismember(actors,"Bruce Lee"));
b = find(ismember(actors,"Tommy Wiseau"));
d = 0.5; A = sprand(60,80,d);
s = whos('A'); nbytes = s.bytes;
nbytes, 8*(2*d*prod(size(A)) + size(A,2) + 1)
a = 0.05;
X1 = A\Y/B;
X2 = (A'*A+a^2*eye(m))\A'*Y*B'/(B*B'+a^2*eye(n));
X3 = pinv(A,a)*Y*pinv(B,a);
imshow(max(min([X Y X1 X2 X3],1),0))
The Matlab code for exercise 3.5 follows. We’ll first download the CSV files. 488
Now, let’s define a function that determines the coefficients and residuals
using three different methods and one that evaluates a polynomial given those
coefficients. The command qr(V,0) returns an economy-size QR decomposition.
n = 11;
[c,r] = solve_filip(x,y,n);
X = linspace(min(x),max(x),200);
Y = build_poly(c,X);
Let’s also solve the problem and plot the solutions for the standardized data.
We’ll first load the variables into the workspace and build a matrix D𝑖 for each 490
𝑖 ∈ {0, 1, . . . , 9}. Then we’ll use svds to compute the first 𝑘 = 12 singular
matrices. We only need to keep {V𝑖 }, which is a low-dimensional column space
of the space of digits.
k = 12; V = [];
urlwrite([bucket "emnist-mnist.mat"],"emnist-mnist.mat")
load emnist-mnist.mat
for i = 1:10
D = dataset.train.images(dataset.train.labels==i-1,:);
[_,_,V(:,:,i)] = svds(double(D),k);
Let’s display the principal components for the “3” image subspace.
pix = reshape(V(:,:,3+1),[28,28*k]);
imshow(pix,[]); axis off
Now, let’s find the closest column spaces to our test images d by finding the 𝑖
with the smallest residual kV𝑖 VT𝑖 d − dk 2 .
r = [];
d = double(dataset.test.images)';
for i = 1:10
r(i,:) = sum((V(:,:,i)*(V(:,:,i)'*d) - d).^2);
[c,predicted] = min(r);
Each row of the matrix represents the predicted class and each column represents
the actual class.
[p,r] = sort(U*S*q,'descend');
for i=1:10, printf("%s: %4.3f\n",genres(r(i)){:},p(i)/sum(p)), end
The Matlab function hess(A) returns an upper Hessenberg matrix that is unitarily 495
similar to A. The function givens(a,b) returns the Givens rotation matrix for the
vector (𝑎, 𝑏). The following Matlab code implements the implicit QR method:
function eigenvalues = implicitqr(A)
n = length(A);
tolerance = 1e-12;
H = hess(A);
while true
if abs(H(n,n-1))<tolerance,
n = n-1; if n<2, break; end;
Q = givens(H(1,1)-H(n,n),H(2,1));
H(1:2,1:n) = Q*H(1:2,1:n);
H(1:n,1:2) = H(1:n,1:2)*Q';
for i = 2:n-1
Q = givens(H(i,i-1),H(i+1,i-1));
H(i:i+1,1:n) = Q*H(i:i+1,1:n);
H(1:n,i:i+1) = H(1:n,i:i+1)*Q';
eigenvalues = diag(H);
n = 20; S = randn(n);
D = diag(1:n); A = S*D*inv(S);
actors = get_names(bucket,"actors");
B = get_adjacency_matrix(bucket,"actor-movie");
M = sparse(B'*B);
v = ones(size(M,1),1);
for k = 1:10,
v = M*v; v /= norm(v);
[_,r] = sort(v,'descend');
We approximate the SVD by starting with a set of 𝑘 random vectors and 496
performing a few steps of the naïve QR method to generate a 𝑘-dimensional
subspace that is relatively close to the space of dominant singular values:
652 Computing in Python and Matlab
Let’s convert an image to an array, compute its randomized SVD, and then display
the original image side-by-side with the rank-reduced version.
A = double(rgb2gray(imread([bucket "red-fox.jpg"])));
[U,S,V] = randomizedsvd(A,10);
imshow([A U*S*V'])
function e = stationary(A,b,w,n,ue)
e = zeros(n,1); u = zeros(size(b));
P = tril(A,0) - (1-w)*tril(A,-1);
for i=1:n
u = u + P\(b-A*u);
e(i) = norm(u - ue, 1);
Matlab 653
function e = conjugategradient(A,b,n,ue)
e = zeros(n,1); u = zeros(size(b));
r = b-A*u; p = r;
for i=1:n
Ap = A*p;
a = (r'*p)/(Ap'*p);
u = u + a.*p; r = r - a.*Ap;
b = (r'*Ap)/(Ap'*p);
p = r - b.*p;
e(i) = norm(u - ue, 1);
Each function takes around two seconds to complete 400 iterations. Direct
computation tic; A\(-f); toc takes around 20 seconds, considerably shorter
than either Julia or Python.
n = 20000 500
d = 2 .^ (0:14); d = [-d;d];
P = spdiags(primes(224737)',0,n,n);
B = spdiags(ones(n,length(d)),d,n,n);
A = P + B;
b = sparse(1,1,1,n,1);
x = pcg(A, b, 1e-15, 100, P); x(1)
The Matlab code for the radix-3 FFT in exercise 6.1 is 501
function y = fftx3(c)
n = length(c);
omega = exp(-2i*pi/n);
if mod(n,3) == 0
k = 0:n/3-1;
u = [ fftx3(c(1:3:n-2)).';
F = exp(-2i*pi/3).^((0:2)'*(0:2));
y = (F*u).'(:);
654 Computing in Python and Matlab
F = omega.^([0:n-1]'*[0:n-1]);
y = F*c;
501 Matlab typically works with floating-point numbers, so we can’t easily input
a large integer directly. Instead, we can input it as a string and then convert
the string to an array by subtracting '0'. For example, '5472' - '0' becomes
[5 4 7 2]. Because fft operates on floating-point numbers we use round to
convert the ifft to nearest integer. Putting everything together, we get the
function f = idct(f)
n = size(f,1);
f(1,:) = f(1,:)/2;
w = exp(-0.5i*pi*(0:n-1)'/n);
f([1:2:n n-mod(n,2):-2:2],:) = 2*real(ifft(f./w));
Two-dimensional DCT and IDCT operators apply the DCT or IDCT once in each
The following function returns a sparse matrix of Fourier coefficients along with 163
the reconstructed compressed image:
We’ll test the function on an image and compare it before and after.
We’ll find the solution to the system of equations in exercise 8.12. Let’s first 510
define the system and the gradient.
function x = homotopy(f,df,x)
dxdt = @(t,x) -df(x(1),x(2))\f(x(1),x(2));
[t,y] = ode45(dxdt,[0;1],x);
x = y(end,:);
function x = newton(f,df,x)
for i = 1:100
dx = -df(x(1),x(2))\f(x(1),x(2));
x = x + dx;
if norm(dx)<1e-8, return; end
656 Computing in Python and Matlab
514 The following function computes the coefficients {𝑚 0 , 𝑚 1 , . . . , 𝑚 𝑛−1 } for a spline
with periodic boundary conditions. It is assumed that 𝑦 0 = 𝑦 𝑛 .
function m = spline_periodic(x,y)
h = diff(x);
C = circshift(diag(h),[1 0]) + 2*diag(h+circshift(h,[1 0])) ...
+ circshift(diag(h),[0 1]);
d = 6.*diff(diff([y(end-1);y])./[h(end);h]);
m = C\d; m = [m;m(1)];
514 The following code computes and plots the interpolating curves:
n = 20; N = 200;
x = linspace(-1,1,n)'; X = linspace(-1,1,N)';
y = (x>0);
Y1 = interp(phi1,x);
Y2 = interp(phi2,x);
Y3 = interp(phi3,(0:n-1)');
plot(x,y,X,Y1,X,Y2,X,Y3); ylim([-.5 1.5]);
function c = solve(L,f,bc,x)
h = x(2)-x(1); n = length(x);
S = ([1 -1/2 1/6; -2 0 2/3; 1 1/2 1/6]./[h^2 h 1])*L(x);
Matlab 657
Let’s also define a function that will interpolate between collocation points.
n = 15; N = 141
L = @(x) [x;ones(size(x));x];
f = @(x) zeros(size(x));
b = fzero(@(z) besselj(0, z), 11);
x = linspace(0,b,n);
c = solve(L,f,[1,0],x);
[X,Y] = build(c,x,N);
n = 10*2.^(1:6);
for i = 1:length(n)
x = linspace(0,b,n(i));
c = solve(L,f,[1,0],x);
[X,Y] = build(c,x,n(i));
e(i) = sqrt(sum((Y-besselj(0,X)).^2)/n(i));
s = polyfit(log(n),log(e),1);
printf("slope: %f",s(1))
658 Computing in Python and Matlab
519 n = 128; L = 2;
x = (0:n-1)/n*L-L/2;
k = [0:(n/2-1) (-n/2):-1]*(2*pi/L);
f = exp(-6*x.^2);
for p = 0:0.1:1
d = real(ifft((1i*k).^p.*fft(f)));
plot(x,d,'m'); hold on;
Now, let’s apply the method to the ShotSpotter data. First, we get the data.
bucket = 'https://raw.githubusercontent.com/nmfsc/data/master/';
data = urlread([bucket 'shotspotter.csv']);
df = cell2mat(textscan(data,'%f,%f,%f,%f','HeaderLines', 1));
X = df(1:end-1,:);
x_nls - df(end,:)'
We’ll extend the dual class on page 637 by adding methods for division, square 524
root, and cosine.
function h = mrdivide(u,v)
if ~isa(u,'dual'), u = dual(u,0); end
if ~isa(v,'dual'), v = dual(v,0); end
h = dual(u.value/v.value, ...
function h = cos(u)
h = dual(cos(u.value), -1*sin(u.value)*u.deriv);
function h = sqrt(u)
h = dual(sqrt(u.value), u.deriv/(2*sqrt(u.value)));
function x = get_zero(f,x)
tolerance = 1e-12; delta = 1;
while abs(delta)>tolerance,
fx = f(dual(x,1));
delta = fx.value/fx.deriv;
x -= delta;
fx = f(dual(dual(x,1),1));
delta = fx.deriv.value/fx.deriv.deriv;
525 The following function computes the nodes and weights for Gauss–Legendre
quadrature by using Newton’s method to find the roots of Pn (𝑥):
function [x,w] = gauss_legendre(n)
x = -cos((4*(1:n)-1)*pi/(4*n+2))';
dx = ones(n,1);
dP = 0;
P = [x ones(n,1)];
for k = 2:n
P = [((2*k - 1)*x.*P(:,1)-(k-1)*P(:,2))/k, P(:,1)];
dP = n*(x.*P(:,1) - P(:,2))./(x.^2-1);
dx = P(:,1) ./ dP(:,1);
x = x - dx;
w = 2./((1-x.^2).*dP(:,1).^2);
526 We’ll modify the implementation of the Golub–Welsch algorithm from page 638
to Gauss–Hermite polynomials.
function [nodes,weights] = gauss_hermite(n)
b = (1:n-1)/2;
a = zeros(n,1);
scaling = sqrt(pi);
[v,s] = eig(diag(sqrt(b),1) + diag(a) + diag(sqrt(b),-1));
weights = scaling*v(1,:).^2;
nodes = diag(s);
[s,w] = gauss_hermite(20);
u0 = @(x) sin(x);
u = @(t,x) w * u0(x-2*sqrt(t)*s)/sqrt(pi);
x = linspace(-12,12,100);
529 The following code plots the region of absolute stability for a Runge–Kutta
method with tableau A and b:
n = length(b);
N = 100;
x = linspace(-4,4); y = x'; r = zeros(N,N);
lk = x + 1i*y;
E = ones(n,1);
for i = 1:N, for j=1:N
r(i,j) = 1+ lk(i,j) * b*(( eye(n) - lk(i,j)*A)\E);
end, end
contour(x,y,abs(r),[1 1],'k');
axis([-4 4 -4 4]);
530 i = (0:3)';
a = ((-(i+1)').^i./factorial(i))\[1;0;0;0];
b = ((-i').^i./factorial(i))\[0;1;0;0];
531 The following function returns the orbit of points in the complex plane for
an 𝑛th order Adams–Bashforth–Moulton PE(CE)𝑚 . It calls the function
multistepcoefficients defined on page 639.
Matlab 663
function z = PECE(n,m)
[_,a] = multistepcoefficients([0,1],1:n-1);
[_,b] = multistepcoefficients([0,1],0:n-1);
for i = 1:200
r = exp(2i*pi*(i/200));
c(1) = r - 1;
c(2:m+1) = r + r.^(1:n-1)*b(2:end)'/b(1);
c(m+2) = r.^(1:n-1)*a/b(1);
z(i,:) = roots(fliplr(c))'/b(1);
for i= 1:4
plot(PECE(2,i),'.k'); hold on; axis equal
Now, compute the coefficients of 𝑃𝑚 (𝑥) and 𝑄 𝑛 (𝑥) for the Taylor polynomial
approximation of log(𝑥 + 1).
m = 3; n = 2;
a = [0 ((-1).^(0:m+n)./(1:m+n+1))]';
[p,q] = pade(a,m,n)
Finally, shift and combine the coefficients using upper inverse Pascal matrices.
S = @(n) inv(pascal(n,-1)');
S(m+1)*p, S(n+1)*q
535 The following code solves the initial value problem and computes the error at the
second boundary point:
The function fsolve calls the function shooting that solves an initial value
problem and computes the error in the solutions at the second boundary point.
m = [0 1 2 3 4]; n = [1];
a = zeros(max(m)+1,1); b = zeros(max(n)+1,1);
[a(m+1),b(n+1)] = multistepcoefficients(m,n);
dx = 0.01; dt = 0.01;
L = 1; x = (-L:dx:L)'; m = length(x);
V = exp(-8*x.^2); U = V;
c = dt/dx^2; a = 0.5 + c; b = 0.5 - c;
B = c*spdiags([ones(m,1),zeros(m,1),ones(m,1)],-1:1,m,m);
B(1,2) = B(1,2)*2; B(end,end-1) = B(end,end-1)*2;
for i = 1:420,
if mod(i,3)==1, area(x, (V-i/300),-1,'facecolor','w'); hold on; end
Vo = V; V = (B*V+b*U)/a; U = Vo;
ylim([-1,1]); set(gca,'xtick',[],'ytick',[])
We’ll loop over several values for time steps and mesh sizes and plot the error.
eps = 0.3; m = 20000; n = floor(logspace(2,3.7,6));
x = linspace(-4,4,m)';
psi_m = -exp(-(x-1).^2/(2*eps))/(pi*eps)^(1/4);
for i = 1:length(n),
x = linspace(-4,4,n(i))';
psi_n = -exp(-(x-1).^2/(2*eps))/(pi*eps)^(1/4);
error_t(i) = norm(psi_m - schroedinger(m,n(i),eps))/m;
error_x(i) = norm(psi_n - schroedinger(n(i),m,eps))/n(i);
541 Let’s solve a radially symmetric heat equation. Although we divide by zero at
𝑟 = 0 when constructing the Laplacian operator, we subsequently overwrite the
resulting inf term when we apply the boundary condition.
T = 0.5; m = 100; n = 100;
r = linspace(0,2,m)'; dr = r(2)-r(1); dt = T/n;
u = tanh(32*(1-r));
tridiag = [1 -2 1]/dr^2 + (1./r).*[-1 0 1]/(2*dr);
D = spdiags(tridiag,-1:1,m,m)';
D(1,1:2) = [-4 4]/dr^2; D(m,m-1:m) = [2 -2]/dr^2;
A = speye(m) - 0.5*dt*D;
B = speye(m) + 0.5*dt*D;
for i = 1:n
u = A\(B*u);
area(r,u,-1,"edgecolor",[1 .5 .5],"facecolor",[1 .8 .8]);
Alternatively, a much slower but high-order BDF routine can be used in place of
the Crank–Nicolson routine:
options = odeset('JPattern',spdiags(ones([n 3]),-1:1,m,m));
[t,u] = ode15s(@(t,u) D*u,[0 T],u,options);
Next, we’ll define a general discrete Laplacian operator. The following function
returns a sparse matrix in diagonal format (DIA). Two inconsequential elements
of array d are replaced with nonzero numbers to avoid divide-by-zero warnings.
Matlab 667
function D = laplacian(x)
h = diff(x); h1 = h(1:end-1); h2 = h(2:end); n = length(x);
diags = 2./[h1(1).^2, -h1(1).^2, 0;
h2.*(h1+h2), -h1.*h2, h1.*(h1+h2);
0, -h2(end).^2, h2(end).^2];
D = spdiags(diags,-1:1,n,n)';
function u = heat_equation(x,t,u)
n = 40; dt = t/n;
D = laplacian(x);
A = speye(n) - 0.5*dt*D;
B = speye(n) + 0.5*dt*D;
for i = 1:n
u = A\(B*u);
We can animate the time evolution of the solution by adding an imwrite command
inside the loop to save each iteration to a stack of sequential PNGs and then using
a program such as ffmpeg to compile the PNGs into an MP4.
function s = slope(u)
limiter = @(t) (abs(t)+t)./(1+abs(t));
du = diff(u);
s = [[0 0];du(2:end,:).*limiter(du(1:end-1,:)./(du(2:end,:) ...
+ (du(2:end,:)==0)));[0 0]];
Other slope limiters we might try include the superbee and the minmod:
m=10; 551
x=linspace(0,1,m)'; h=x(2)-x(1);
A=diag(repmat(-1/h-h/6,[m-1 1]),-1)+diag(repmat(1/h-h/3,[m 1]));
A = A + A'; A(1,1)=A(1,1)/2; A(m,m)=A(m,m)/2;
m = 8; 552
x = linspace(0,1,m+2); h = x(2)-x(1);
D = @(a,b,c) (diag(a*ones(m-1,1),-1) + ...
diag(b*ones(m,1),0) + diag(c*ones(m-1,1),1))/h^3;
M = [D(-12,24,-12) D(-6,0,6);D(6,0,-6) D(2,8,2)];
b = [ones([m 1])*h*384;zeros([m 1])];
u = M\b;
plot(x,16*(x.^4 - 2*x.^3 + x.^2),'o-',x,[0;u(1:m);0],'.-');
The following Matlab code solves the KdV equation using integrating factors. 554
We first set the parameters:
We define the integrating factor and right-hand side of the differential equation.
G = @(t) exp(-k.^3*t);
f = @(t,w) -G(t).\(3*k.*fft(ifft(G(t).*w).^2));
670 Computing in Python and Matlab
686 Index
deflation eigenimage, 79
of a matrix, 106 eigenpictures, 79
of a polynomial, 208 eigenspace, 7
degree eigenvalue, 6, 24
of a polynomial, 207, 225, 232, 312 condition number, 90
of a vertex, 47 estimation, 91
of freedom, 108, 236, 280 methods for computing, 93–114
of separation, 56 eigenvector, 6, 24
degree matrix, 47 left eigenvector, 90
Dekker–Brent method, 196 eigenvector centrality, 96
Deming regression, 76 elementary row operations, 30
density, 46 elliptic, 451
detailed balance, 325 elliptic curve, 206
determinant, 24, 55, 483 energy method, 408
de Broglie, Louis, 162 energy norm, 14
de Casteljau, Paul, 245 ENIAC, xiv, 321, 559
DFT, see discrete Fourier transform entropy, 430
Diaconis, Persi, 147 epoch, 289
diagonally dominant, 123 equivalent matrices, 8, 12
diet problem, see Stigler diet problem equivalent norms, 14, 15, 409
differentiation matrix, 331 Erdös, Paul, 56
Diffie–Hellman key exchange, 205 Euclidean inner product, 13
Dijkstra’s algorithm, 51 Euclidean norm, 14, 24
Dirac notation, 19 Euler equations, 442
direct problem, 173 Euler method, see backward Euler
Dirichlet boundary condition, 395 method, forward Euler method
discrete Fourier transform, 144 Euler, Leonhard, 46
discrete wavelet transform, 272–276 Euler–Maclaurin formula, 310
dispersion relation, 421 explicit Euler method, see forward Euler
divided differences, 229 method
dolphins of Doubtftul Sound, 49 exponential convergence, 258
domain of dependence/influence, 416
dominant eigenvalue, 93 F
Dormand–Prince method, 361, 365, 366, fast Fourier transform, 143–163, 259,
374, 407 462
double-and-add, 207, 511 fast inverse square root, 182, 189
drift, 381 fast Toeplitz multiplication, 153
dual number, 300 father wavelet, see scaling function
dual problem, 44 Favard’s theorem, 254, 314
dual space, 19 Feigenbaum constant, 202
Duffing equation, 388, 534, 612, 664 FEniCS, 455–457, 589
Dufort–Frankel method, 401–403, 537 FFT, see fast Fourier transform
DWT, see discrete wavelet transform FFTW, 158, 160, 463, 572, 627
Fickian diffusion, 392
E Filippelli problem, 86
Eaton, John, 562 finite volume method, 438
Eckart–Young–Mirsky theorem, 74 fixed point, 93, 201, 509
Index 689
Vandermonde matrix, 86, 87, 143, 226, Weierstrass approximation theorem, 245
228, 247, 297, 488, 513, 522, 570, Weierstrass function, 169
625 Weierstrass, Karl, 234
van Gogh, Vincent, 473 well-conditioned, 122, 138, 174
van Loan, Charles, 27 well-posed, 21–23, 175–179, 335
van Rossum, Guido, 563 Wickham, Hadley, 563
variational problem, 443 Wiener weight, 73
vector norm, 13, 24 Wikipedia, xiv, 56, 493, 561
vector space, 3 Wilkinson shift, 106
Verlet method, 373 Wilkinson, James, 45
versine function, 235 wisdom, 158
vertex, 47 witch of Agnesi, 235
von Neumann analysis, 398–403, 415, Wolfram Language, 211, 354, 483, 509,
418 518, 560
von Neumann, John, 31, 37, 324, 328
w-orthogonal, 250 Xoshiro, 328
wavelet function, 271
wavelets Z
B-spline, 266 zero-based indexing, 564
Cohen–Daubechies–Feauveau, 273 zero-one matrix, 25, 47
Coiflet, 276 zero-stability, 345
Daubechies, 269 zeros of polynomials
Haar, 264, 274 companion matrix method, 210
LeGall–Tabatabai, 273 Newton–Horner method, 209
weak solution, 427 Zipf’s law, 66
Julia Index
700 Julia Index
704 Python Index
706 Matlab Index