Num1 Skript
Num1 Skript
These notes are a short presentation of the material presented in my lecture. They follow
the notes “Numerik 1: Numerik gewöhnlicher Differentialgleichungen” by Rannacher (in
German) [Ran17b], as well as the books by Hairer, Nørsett, and Wanner [HNW09] and
Hairer and Wanner [HW10]. Furthermore, the book by Deuflhard and Bornemann [DB08]
was used. Historical remarks are in part taken from the article by Butcher [But96].
Thanks go to Dörte Jando, Markus Schubert, Lukas Schubotz, and David Stronczek for
their help with writing and editing these notes.
1
Index for shortcuts
2
Contents
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
0
4 Newton and quasi-Newton methods 58
5.6.1 A(α)-stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A Appendix 94
1
Chapter 1
Example 1.1.1 (Exponential growth). Bacteria are living on a substrate with ample
nutrients. Each bacteria splits into two after a certain time ∆t. The time span for splitting
is fixed and independent of the individuum. Then, given the amount u0 of bacteria at time
t0 , the amount at t1 = t0 + ∆t is u1 = 2u0 . Generalizing, we obtain
un = u(tn ) = 2n u0 , tn = t0 + n∆t.
After a short time, the number of bacteria will be huge, such that counting is not a good
idea anymore. Also, the cell division does not run on a very sharp clock, such that after
some time, divisions will not only take place at the discrete times t0 + n∆t, but at any
time between these as well. Therefore, we apply the continuum hypothesis, that is, u is
not a discrete quantity anymore, but a continuous one that can take any real value. In
order to accommodate for the continuum in time, we make a change of variables:
t−t0
u(t) = 2 ∆t u0 .
Here, we have already written down the solution of the problem, which is hard to generalize.
The original description of the problem involved the change of u from one point in time
to the next. In the continuum description, this becomes the derivative, which we can now
compute from our last formula:
d ln 2 t−t0 ln 2
dt u(t) =
∆t
2 ∆t u0 =
∆t
u(t).
We see that the derivative of u at a certain time depends on u itself at the same time
and a constant factor, which we call the growth rate α. Thus, we have arrived at our first
differential equation
2
Figure 1.1: Plot of a solution to the predator-prey system with parameters α = 23 , β = 43 ,
δ = γ = 1 and initial values u(0) = 3, v(0) = 1. Solved with a Runge-Kutta method of
order five and step size h = 10−5 .
What we have seen as well is, that we had to start with some bacteria to get the process
going. Indeed, any function of the form
u(t) = ceαt
is a solution to equation (1.1). It is the initial value u0 , which anchors the solution and
makes it unique.
Example 1.1.2 (Predator-prey systems). We add a second species to our bacteria exam-
ple. Let’s say, we replace the bacteria by sardines living in a nutrient rich sea, and we
add tuna-eating sardines. The amount of sardines eaten depends on the likelyhood that
a sardine and a tuna are in the same place, and on the hunting efficiency β of the tuna.
Thus, equation (1.1) is augmented by a negative change in population depending on the
product of sardines u and tuna v:
u0 = αu − βuv.
In addition, we need an equation for the amount of tuna. In this simple model, we will
make two assumptions: first, tuna die of natural causes at a death rate of γ. Second, tuna
procreate if there is enough food (sardines), and the procreation rate is proportional to the
amount of food. Thus, we obtain
v 0 = δuv − γv.
Again, we will need initial populations at some point in time to evolve them to later times
from that point.
3
numerically (simulated): Lotka and Volterra became interested in this system as they had
found that the amount of predatory fish caught had increased during World War I. During
the war years there was a strong decrease of fishing effort. In conclusion, they thought,
there had to be more prey fish.
A (far too rarely) applied consequence is that in order to diminish the amount of, e.g.,
foxes one should hunt rabbits, since foxes feed on rabbits.
Newton’s second law of motion, on the other hand, relates forces and acceleration:
F = mx00 ,
Combining these, we obtain equations for the positions of the two bodies:
m3−i
x00i = G (xi − x3−i ), i = 1, 2.
r3
This is a system of 6 independent variables. However, it can be reduced to three, noting
that the distance vector r is the only variable to be computed for:
m1 + m2
r00 = −G r.
r3
Intuitively, it is clear that we need an initial position and an initial velocity for the two
bodies. Later on, we will see that this can actually be justified mathematically.
Example 1.1.5 (Celestial mechanics). Now we extend the two-body system to a many-
body system. Again, we subtract the center of mass, such that we obtain n sets of 3
equations for an n + 1-body system. Since forces simply add up, this system becomes
X mj
xi = −G 3 rij . (1.2)
rij
j6=i
https://ssd.jpl.nasa.gov/?horizons
4
1.2 Introduction to initial value problems
Remark 1.2.2. A differential equation (DE), which is not ordinary, is called partial.
These are equations or systems of equations, which involve partial derivatives with re-
spect to several independent variables. While the functions in an ordinary differential
equation may be dependent on additional parameters, derivatives are only taken with re-
spect to one variable. Often, but not exclusively, this variable is time. This manuscript
only deals with ordinary differential equations, and so the adjective will be omitted in the
following.
1.2.4 Lemma: Every differential equation (of arbitrary order) can be written as
a system of first-order differential equations. If the equation is explicit, then the
system is explicit.
Proof. We introduce the additional variables u0 (t) = u(t), u1 (t) = u0 (t) to un−1 (t) =
u(n−1) (t). Then, the differential equation in (1.3) can be reformulated as the system
5
In the case of an explicit equation, the system has the form
0
u0 (t) u1 (t)
u0 (t) u2 (t)
1
.. ..
. = . (1.6)
0 .
un−2 (t) un−1 (t)
0
un−1 (t) f t, u0 (t), u1 (t), . . . , un−1 (t)
u0 = f (u). (1.10)
A method which provides the same solution for the autonomous differential equation
as for the original IVP, is called invariant under autonomization.
Differential equations usually provide sets of solutions from which we have to choose a
solution. An important selection criterion is setting an initial value which leads to a well-
posed problem (see below).
1.2.7 Definition: Given a point (t0 , u0 ) ∈ R×Rd and a function f (t, u) with values
in Rd , defined in a neighborhood I × U ⊂ R × Rd of (t0 , u0 ). Then, an initial value
problem (IVP) is defined as follows: find a function u(t), such that
u0 (t) = f t, u(t)
(1.11a)
u(t0 ) = u0 (1.11b)
6
1.2.8 Definition: We call a continuously differentiable function u(t) with u(t0 ) = 0
a local solution of the IVP (1.11), if there exists a neighborhood J ⊂ R of t0 , such
that u(t) and f (t, u(t)) are defined and the equation (1.11a) holds for all t ∈ J.
Remark 1.2.9. We introduced the IVP deliberately in a “local” form because the local
solution term is the most useful one for our purposes. Due to the fact that the neighborhood
J in the definition above can be arbitrarily small, we will have to deal with the extension
to larger intervals below.
1.2.11 Lemma: Let f be continuous in both arguments. Then, the function u(t)
is a solution of the initial value problem (1.11) if and only if it is a solution of the
Volterra integral equation (VIE)
Z t
u(t) = u0 + f s, u(s) ds. (1.12)
t0
Remark 1.2.12. The formulation as an integral equation allows on the other hand a more
general solution term, because the problem is already well-posed for functions f (t, u), which
are just integrable with respect to t. (In that case, the solution u would be just absolutely
continuous and not continuously differentiable.) Both the theoretical analysis of the IVP
and the numerical methods in this lecture notes (with exception of the BDF methods) are
in fact considering the associated integral equation (1.12) and not the IVP (1.11).
1.2.13 Theorem (Peano’s existence theorem): Let α, β > 0 and let the func-
tion f (t, u) be continuous on the closed set
D = (t, u) ∈ R × Rd |t − t0 | ≤ α, |u − u0 | ≤ β .
The proof of this theorem is of little consequence for the remainder of these notes. For
its verification, we refer to textbooks on the theory of ordinary differential equations or to
[Ran17b, Satz 1.1].
Remark 1.2.14. The Peano existence theorem does not make any statements about the
uniqueness of a solution and guarantees only local existence. The second limitation is
addressed by the following theorem. Uniqueness will be discussed in section 1.4.
7
1.2.15 Theorem (Peano’s continuation theorem): Let the assumptions of
Theorem 1.2.13 hold. Then, the solution can beextended to an interval Im = [t− , t+ ]
such that the points t− , u(t− ) and t+ , u(t+ ) are on the boundary of D. Neither
the values of t, nor the values of u(t) need to be bounded as long as f remains
bounded.
u0 = 2 |u|,
p
u(0) = 0,
has solutions u(t) = t2 and u(t) = 0 that both exist for all t ∈ R (global existence, but
non-uniqueness).
Example 1.2.17. The IVP
u0 = −u2 , u(0) = 1.
has the unique solution 1/(1 + t). This solution has a singularity for t → −1 (not global
existence, but uniqeuness). However, it exists for all t > −1 and thus in particular for all
t > 0 = t0 , which is all that matters for an IVP.
1.3.1. The study of linear differential equation turns out to be particularly simple and
results obtained here will provide us with important statements for general non-linear
IVP. Therefore, we pay particular attention to the linear case.
1.3.2 Definition: An IVP according to definition 1.2.7 is called linear if the right
hand side f is an affine function of u and the IVP can be written in the form
1.3.3 Definition: Let the matrix function A : I → Cd×d be continuous. Then the
function defined by
Z t
M (t) = exp − A(s) ds (1.14)
t0
M (t0 ) = I (1.15)
0
M (t) = −M (t)A(t). (1.16)
8
1.3.5 Lemma: Let M (t) be the integrating factor of the equation (1.13a) defined
in (1.14). Then, the function
Z t
u(t) = M (t)−1 u0 + M (s)b(s) ds (1.17)
t0
Proof. We consider the auxiliary function w(t) = M (t)u(t) with the integrating factor
M (t) defined as in eqn. (1.14). It follows by using the product rule that
w0 (t) = M (t)b(t).
According to lemma A.2.3 about the matrix exponential, M (t) is invertible for all t. Thus
we can apply M (t)−1 to w(t) to obtain the solution u(t) of (1.13) as given in equation (1.17).
The global solvability follows since the solution is defined for arbitrary t ∈ R.
Example 1.3.6. The equation in example 1.2.5 is linear and can be written in the form
of (1.13) with
0 ω 0
A(t) = A = and b(t) = .
−ω 0 f (t)
Let now f (t) ≡ 0, t0 = 0 and u(0) = u0 . It is easy to see A has eigenvalues iω and −iω,
so that we can write
ωi 0
A = C −1 C
0 −ωi
with a suitable transformation matrix C DIY . Using the properties of the matrix expo-
nential, the integrating factor is
−iωt
−At −1 e 0 cos ωt sin ωt
M (t) = e =C C= .
0 eiωt − sin ωt cos ωt
The missing details in this argument and the case for an inhomogeneity f (t) = cos αt are
left as an exercise DIY .
9
Remark 1.3.7. If the function b(t) in (1.13a) is only integrable, the function u(t) defined
in (1.17) is absolutely continuous and thus differentiable almost everywhere. The chain
rule (1.18) is applicable in all points of differentiability and w(t) solves the Volterra integral
equation corresponding to (1.13). Thus, the representation formula (1.17) holds generally
for solutions of linear Volterra integral equations.
1.3.8 Lemma (Grönwall): Let w(t), a(t) and b(t) be nonnegative, integrable
functions, such that a(t)w(t) is integrable. Furthermore, let b(t) be monotonically
non-decreasing and let w(t) satisfy the integral inequality
Z t
w(t) ≤ b(t) + a(s)w(s) ds, t ≥ t0 . (1.19)
t0
This function is absolutely continuous, and since m0 (t) = −a(t)m(t), we have almost
everywhere
Z t
0
v (t) = m(t)a(t) w(t) − a(s)w(s) ds .
t0
Using assumption (1.19), the bracket on the right can be bounded by b(t). Thus,
v 0 (t) ≤ m(t)a(t)b(t)
10
Finally, since b(t) is nondecreasing we obtain almost everywhere
Z t Z t Z s
b(t)
a(s)w(s) ds ≤ a(s) exp − a(r) dr ds
t0 m(t) t0 t0
Z s t
b(t)
= − exp − a(r) dr
m(t) t0
| {z } t0
m(s)
b(t) b(t)
= m(t0 ) − m(t) = − b(t).
m(t) m(t)
Combining this bound with the integral inequality (1.19), we obtain
Z t
b(t)
w(t) ≤ b(t) + a(s)w(s) ds = ,
t0 m(t)
which proves the lemma.
Remark 1.3.9. As we can see from the form of assumption (1.19) and estimate (1.20),
the purpose of Grönwall’s inequality is to construct a majorant for w(t) that satisfies a
linear IVP. The bound is particularly simple when a, b ≥ 0 are constant.
1.3.10 Corollary: If two solutions u(t) and v(t) of the linear differential equa-
tion (1.13a) coincide in a point t0 , then they are identical.
Proof. The difference w(t) = v(t) − u(t) solves the integral equation
Z t
w(t) = A(s)w(s) ds.
t0
Hence, for an arbitrary vector norm k·k (and induced matrix norm also denoted by k·k),
we can obtain the following integral inequality
Z t Z t
kw(t)k ≤ kA(s)w(s)k ds ≤ kA(s)kkw(s)k ds
t0 t0
Now, applying Grönwall’s inequality (1.20) with a(t) = kA(t)k and b(t) = 0, we can
conclude that kw(t)k = 0 and therefore u(t) = v(t), for all t.
Corollary 1.3.11. The representation formula (1.17) in Lemma 1.3.5 defines the unique
solution to the IVP (1.13). In particular, solutions of linear IVPs are always defined for
all t ∈ R.
Example 1.3.12. Let A ∈ Cd×d be diagonalizable with (possibly repeated) eigenvalues
λ1 , . . . , λd and corresponding eigenvectors ψ (1) , . . . , ψ (d) . The linear IVP
u0 = Au,
u(0) = u0 .
has unique solution u(t) = eAt u0 . Using the properties of the matrix exponential (see
Appendix A.2.1), with Ψ ∈ Cd×d denoting the matrix with ith column ψ (i) , we get
λ1 t
−1
u(t) = eΨ diag(λ1 ,...,λd )Ψt u0 = Ψ−1 exp
.. Ψu0 .
.
λd t
11
1.3.13 Lemma: The solutions of the homogeneous, linear differential equation
Proof. At first we observe that, due to linearity of the derivative and the right hand side,
for two solutions u(t) and v(t) of the equation (1.21) and for α ∈ R, αu(t) + v(t) is also
a solution of (1.21) with initial condition αu(0) + v(0) ∈ Rd . Therefore, the vector space
structure follows from the vector space structure of Rd .
Let now {ϕ(i) (t)} be solutions of the IVP with linearly independent initial values {ψ (i) }.
As a consequence the functions are linearly independent as well.
Assume that w(t) is a solution of equation (1.21) that cannot be written as a linear
combination of the functions {ϕ(i) (t)}. Then, w(0) is not a linear combination of the
(i) (i)
P
vectors ψP . Because otherwise, if there exists {αi }i=1,...,d with w(0) = αi ψ , then
w(t) = αi ϕ(i) (t) due to the uniqueness of any solution of equation (1.21) proven in
corollary 1.3.10, which would lead to a contradiction. However, since {ψ (i) } was assumed
to form a basis of Rd , that implies w(0) = 0 and thus w ≡ 0. It follows that the solution
space has dimension d and that ϕ(i) (t) forms a basis.
Since in the above argument t ∈ R was arbitrary, the ϕ(i) (t) are linearly independent for
all t ∈ R.
1.3.14 Definition: A basis {ϕ(1) , . . . , ϕ(d) } of the solution space of the linear dif-
ferential equation (1.21), in particular the basis with initial values ϕ(i) (0) = ei , is
called fundamental system of solutions. The matrix function
1.3.15 Corollary: The fundamental matrix Y (t) is regular for all t ∈ R and it
solves the IVP
Proof. The initial value is part of the definition. On the other hand, splitting the matrix-
valued IVP into its column vectors, we obtain the original family of IVPs defining the
solution space. Regularity follows from the linear independence of the solutions for any t.
12
1.4 Well-posedness of the IVP
1. A solution exists.
The third condition in this form is purely qualitative. Typically, in order to characterize
problems with good approximation properties, we will require Lipschitz continuity, which
has a more quantitative character.
with c ∈ R. Thus, the solution is not unique and therefore, the IVP is not well-posed.
Let now the initial value be nonzero, but slightly positive. Then, the solution is unique,
3/2
i.e., u(t) ≈ 32 t . In contrast, when the initial value is slightly negative, there exists no
real-valued solution. Hence, a small perturbation of the initial condition has a dramatic
effect on the solution; this is what the third condition for a well-posed problem in definition
1.4.1 excludes.
It satisfies a local Lipschitz condition if (1.23) holds for all compact subsets of D.
Example 1.4.4. Let f (t, y) ∈ C 1 (R × Rd ) and let all partial derivatives with respect to
the components of u be bounded such that
∂
max fj (t, y) ≤ K.
t∈R ∂yi
y∈Rd
1≤i,j≤d
13
Then, f satisfies the Lipschitz condition (1.23) with L = Kd. Indeed, by using the Fun-
damental Theorem of Calculus, we see that
Z 1
d
fj (t, y) − fj (t, x) = fj t, x + s(y − x) ds
0 ds
Z 1Xd
∂
= fj t, x + s(y − x) (yi − xi ) ds.
0 ∂yi
i=1
qP
d 2
Now, exploiting the fact that |Ax| ≤ kAkF |x|, where kAkF := i,j=1 aij is the Frobenius
norm of the matrix A, we get
Z 1 d
X ∂f
|f (t, y) − f (t, x)| ≤ t, x + s(y − x) (yi − xi ) ds
0 i=1 ∂yi
1/2
Z 1 X d 2
∂f
≤ t, x + s(y − x) |y − x| ds ≤ Kd|y − x|.
0 ∂yi
i,j=1
1.4.5 Theorem (Stability): Let f (t, y) and g(t, y) be two continuous functions
on a cylinder D = I × Ω where the interval I contains t0 and Ω is a convex set in
Rd . Furthermore, let f admit a Lipschitz condition with constant L on D. Let u
and v be solutions to the IVPs
Then
Z t
L|t−t0 |
|u(t) − v(t)| ≤ e |u0 − v0 | + max|f (s, x) − g(s, x)| ds . (1.26)
t0 x∈Ω
Proof. Both u(t) and v(t) solve their respective Volterra integral equations. Taking the
difference, we obtain
Z t
u(t) − v(t) = u0 − v0 + f (s, u(s)) − g(s, v(s)) ds
t0
Z t
Z t
= u0 − v0 + f (s, u(s)) − f (s, v(s)) ds + f (s, v(s)) − g(s, v(s)) ds.
t0 t0
This inequality is in the form of the assumption in Grönwall’s lemma with a ≡ L and
w(t) := |u(t) − v(t)|, and its application yields the stability result (1.26).
14
1.4.6 Theorem (Picard-Lindelöf ): Let f (t, y) be continuous on a cylinder
D = {(t, y) ∈ R × Rd | |t − t0 | ≤ a, |y − u0 | ≤ b}.
Let f be bounded such that there is a constant M := max(t,y)∈D |f (t, y)| and satisfy
the Lipschitz condition (1.23) with constant L on D. Then the IVP
u0 = f (t, u)
u(t0 ) = u0
Obviously, u is a solution of the Volterra integral equation (1.12) if and only if u is a fixed
point of F , i.e., u = F u. We can obtain such a fixed-point by the iteration u(k+1) = F (u(k) )
with some initial guess u(0) : I → Ω.
From the boundedness of f , we obtain for all t ≤ T that
Z t Z t
(k+1) (k)
u (t) − u0 = f (s, u (s)) ds ≤ |f (s, u(k) (s))| ds ≤ T M ≤ b.
0 0
15
and by multiplying both sides with e−2Lt it follows that
1
kF (u) − F (v)ke ≤ ku − vke .
2
Thus, we have shown that F is a contraction on (C(I), k·ke ). Therefore, we can apply
theorem A.3.1, the Banach Fixed-Point Theorem, and conclude that F has exactly one
fixed-point, which completes the proof.
Remark 1.4.7. The norm k·ke had been chosen with regard to Grönwall’s inequality,
which was not used explicitly in the proof. It is equivalent to the norm k·k∞ because e−2Lt
is strictly positiv and bounded. One could have performed the proof (with some extra
calculations) also with respect to the ordinary maximum norm k·k∞ .
If f satisfies a Lipschitz condition everywhere then this leads to the following corollary.
Corollary 1.4.9. Let the function f (t, u) admit the Lipschitz condition on R × Cd . Then,
the IVP has a unique solution on the whole real axis.
Proof. The boundedness was used in order to guarantee that u(t) ∈ Ω for any t. This is not
necessary anymore, if Ω = Cd . The fixed point argument does not depend on boundedness
of the set. (See Exercise Sheet 4 for a more detailed proof.)
16
Chapter 2
2.1 Introduction
Example 2.1.1 (Euler’s method). We begin this section with the method that serves as
the prototype for a whole class of schemes for solving IVPs numerically.
(As always for problems with infinite dimensional solution spaces, numerical solution refers
to finding an approximation by means of a discretization method and the study of the
associated discretisation error.)
Note first that at the initial point 0, not only the value u(0) = u0 of u, but also the
derivative u0 (0) = f (0, u0 ) are known. Thus, near 0 we are able to approximate the
solution u(t) (in blue in Figure 2.1) by a straight line y(t) (in red in Figure 2.1, left) using
a first-order Taylor series expansion, i.e.
The figure suggests that in general the accuracy of this method may not be very good for
t far from 0. The first improvement is that we do not draw the line through the whole
interval from 0 to T . Instead, we insert intermediate points and apply the method to each
subinterval, using the result of the previous interval as the initial point for the next. As a
result we obtain a continuous chain of straight lines (in red in Figure 2.1, right) and the
so-called Euler method (details below).
17
t T t T
Figure 2.1: Derivation of the Euler method. Left: approximation of the solution of the
IVP by a line that agrees in slope and value with the solution at t = 0. Right: Euler
method with three subintervals.
I
0 T
I1 I2 Ik In−1 In
t0 t1 t2 tk−1 tk tn−2 tn−1 tn = T
Definition 2.1.3 (Notation). In the following chapters we will regularly compare the
solution of an IVP with the results of discretization methods. Therefore, we introduce the
following convention for notation and symbols.
The solution of the IVP is called the exact or continuous solution. to emphasize that it
is the solution of the non-discretized problem. We denote it in general by u and in addition
we use the abbreviation
uk = u(tk ).
(k)
If u is vector-valued we also use the alternative superscript u(k) and write ui for the ith
component of the vector u(tk ).
The discrete solution is in general denoted by y and we write yk or y (k) for the value of
the discrete solution at time tk . In contrast to the continuous solution, y is only defined
at discrete time steps (except for special methods discussed later).
18
2.1.4 Definition (Explicit one-step methods): An explicit one-step method
is a method which, given u0 at t0 = 0 computes a sequence of approximations
y1 . . . , yn to the solution of an IVP at the time steps t1 , . . . , tn using an update
formula of the form
The function Fhk () is called increment function.a We will often omit the index
hk on Fhk () because it is clear that the method is always applied to time intervals.
The method is called one-step method because the value yk explicitly depends
only on the values yk−1 and f (tk−1 , yk−1 ), not on previous values.
a
The adjective ‘explicit’ is here in contrast to ‘implicit’ one-step methods, where the increment
function depends also on yk and equation (2.1) typically leads to a nonlinear equation for yk .
Remark 2.1.5. For one-step methods every step is per definition identical. Therefore, it
is sufficient to define and analyze methods by stating the dependence of y1 on y0 , which
then can be transferred to the general step from yn−1 to yn . The general one-step method
above then reduces to
y1 = y0 + h0 Fh0 (t0 , y0 ).
This implies that the values yk with k ≥ 2 are computed through formula (2.1) with the
respective hk and the same increment function (but evaluated at tk−1 , yk−1 ).
2.1.6 Example: The simplest choice for the increment function is Fhk (t, u) :=
f (t, u), leading to the Euler method
y1 = y0 + hf (t0 , y0 ). (2.2)
u0 = u, u(0) = 1,
which has exact solution u(t) = et . In that case, the Euler method (with uniform
time steps) reads
y1 = (1 + h)y0 .
19
u u
acc loc 2
loc 2 acc
y y
loc 1 loc 1
t t
Figure 2.2: Local and accumulated errors. Exact solution in black, the Euler method in
red. On the left, in blue the exact solution of an IVP on the second interval with initial
value y1 . On the right, in purple the second step of the Euler method, but with exact
initial value u1 .
Remark 2.2.1. In Figure 2.1, we observe that, at a given time tk+1 , the error consists
of two parts: (i) due to replacing the differential equation by the discrete method on the
interval Ik and (ii) due the initial value yk already being inexact. This is illustrated more
clearly in Figure 2.2. Therefore, in our analysis we split the error into the local error and
an accumulated error. The local error compares continuous and discrete solutions on a
single interval with the same initial value. In the analysis, we will have the options of
using the exact (right figure) or the approximated initial value (left figure).
i.e., the difference between the solution un of the differential equation at tn and the
result of the one-step method at tn .
20
2.2.3 Definition: Let u be a solution of the differential equation u0 = f (t, u) on
the interval Ik = [tk−1 , tk ]. Then, the local error of a discrete method Fhk is
ηk = ηk (u) = uk − uk−1 + hk Fhk (tk−1 , uk−1 ) , (2.4)
i.e., the difference between uk = u(tk ) and the result of one time step (2.1) with this
method with exact initial value uk−1 = u(tk−1 ).
The truncation error is the quotient of the local error and hk :
ηk uk − uk−1
τk = τk (u) = = − Fhk (tk−1 , uk−1 ). (2.5)
hk hk
The one-step method Fhk (t, y) is said to have consistency of order p, if for all suf-
ficiently regular functions f there exists a constant c independent of h := maxnk=1 hk
such that for h → 0:
n
max |τk | ≤ chp (2.6)
k=1
Example 2.2.4 (Euler method). To find the order of consistency of the Euler method,
where Fhk (t, y) = f (t, y), consider Taylor expansion of u at tk−1 :
1
u(tk ) = u(tk−1 ) + hk u0 (tk−1 ) + h2k u00 (ζ), for some ζ ∈ Ik .
2
As a result the truncation error reduces to:
uk − uk−1
τk = − Fhk (tk−1 , u(tk−1 ))
hk
hk f (tk−1 , uk−1 ) + 21 h2k u00 (ζ) 1
= − f (tk−1 , uk−1 ) = u00 (ζ) hk .
hk 2
If f ∈ C 1 (D) on a compact set D around the graph of u, we can bound the right hand
side:
1 1 ∂f
|τk | ≤ max|u00 (t)| hk = max (t, u(t)) + ∇y f (t, u(t))u0 (t) hk
2 t∈I k 2 t∈I k ∂t
1 ∂f
≤ max (t, y) + ∇y f (t, y)f (t, y) hk .
2 (t,y)∈D ∂t
| {z }
=: c
Here, we use the assumption that f is sufficiently smooth to conclude that the Euler
method is consistent of order 1 (slightly more than Lipschitz continuous).
Next we consider stability of explicit one-step methods. To prove this, we first need a
discrete version of Grönwall’s inequality.
21
2.2.5 Lemma (Discrete Grönwall inequality): Let (wk ), (ak ), (bk ) be non-
negative sequences of real numbers, such that (bk ) is monotonically nondecreasing.
Then, it follows from
n−1
X
w0 ≤ b0 and wn ≤ ak wk + bn , for all n ≥ 1, (2.7)
k=0
that
n−1
!
X
wn ≤ exp ak bn . (2.8)
k=0
Proof. Let k ∈ N and define the functions w(t), a(t), and b(t) such that for all t ∈ [k − 1, k)
These functions are bounded and piecewise continuous on any finite interval. Thus, they
are integrable on [0, n]. Therefore, the continuous Grönwall inequality of Lemma 1.3.8
applies and proves the result.
we obtain
The estimate now follows from the discrete Grönwall inequality in Lemma 2.2.5.
22
2.2.7 Corollary (One-step methods with finite precision): Let the one-step
method Fhk be run on a computer, yielding a sequence (zk ), such that each time step
is executed in finite precision arithmetic. Let (yk ) be the mathematically correct
solution of the one-step method. Then, the difference equation (2.1) is fulfilled only
up to machine accuracy eps, i.e., there exists a c > 0:
Then, the error between the true solution of the one-step method yn and the com-
puted solution is bounded by
n
|yn − zn | ≤ c eLT n max|zk | eps.
k=0
Now, since Fhk is Lipschitz continuous in its second argument, we can apply the Discrete
Stability theorem 2.2.6 and use the bound in (2.10) to obtain
n
X n
X
LT LT
|un − yn | ≤ e |ηk (u) − ηk (y)| ≤ e hk 2chp = 2cT eLT hp .
k=1 k=1
Corollary 2.2.9. If f is in C 1 in a compact set D around the graph of u over [0, T ], then
the convergence order of the global error in the Euler method is one.
23
2.3 Runge-Kutta methods
2.3.1. Since the IVP is equivalent to the Volterra integral equation (1.12), we can consider
its numerical solution as a quadrature problem. However, the difficulty is that the integrand
is not known. It will need to be approximated on each interval from values at earlier times,
leading to a class of methods for IVP, called Runge-Kutta methods.
We present the formula again only for the calculation of y1 from y0 on the interval from t0
to t1 = t0 + h. The formula for a later time step k is obtained by replacing y0 , t0 and h by
yk−1 , tk−1 and hk , respectively to obtain yk . (The coefficients aij , bj , cj remain fixed.)
ki = f (t0 + ci h, gi ) , i = 1, . . . , s, (2.11b)
X s
y1 = y0 + h bi ki , (2.11c)
i=1
Remark 2.3.3. In typical implementations, the intermediate values gi are not stored
separately. However, they are useful for highlighting the structure of the method.
0
c2 a21
c3 a31 a32
.. .. .. .. (2.12)
. . . .
cs as1 as2 · · · as,s−1
b1 b2 · · · bs−1 bs
Remark 2.3.5. The first row of the tableau should be read such that c1 = 0, g1 = y0 and
that k1 is computed directly by f (t0 , y0 ). The method is explicit since the computation
of ki only involves coefficients with index less than i. The last row below the line is the
short form of formula (2.11c) and lists the quadrature weights in the increment function
Fh (t, y0 ).
24
Considering the coefficients aij as the entries of a square s × s matrix A, we see that A is
strictly lower triangular. This is the defining feature of an explicit RK method. We will
later see RK methods that also have entries on the diagonal or even in the upper part.
Those methods will be called “implicit”, because the computation of a stage value also
involves the values at the current or future stages. We will also write b = (b1 , . . . , bs )T and
c = (0, c2 , . . . , cs )T , such that the Butcher tableau in (2.12) simplifies to
c A
bT
Example 2.3.6. The Euler method Euler method has the Butcher tableau:
0
1
y1 = y0 + hf (t0 , y0 )
The values b1 = 1 and c1 = 0 indicate that this is a quadrature rule with a single point
at the left end of the interval. As a quadrature rule, such a rule is exact for constant
polynomials and thus of order 1.
k1 = f (t0 , y0 )
0
1 1 1 1
k2 = f t0 + h1 , y0 + h1 k1 2 2
2 2
0 1
y1 = y0 + h1 k2
k1 = f (t0 , y0 )
0
k2 = f (t0 + h1 , y0 + h1 k1 )
1 1
1 1 1 1
y1 = y0 + h k1 + k2 2 2
2 2
Remark 2.3.8. The modified Euler method uses one quadrature node at t0 + h2 = t0 +t 2
1
and
h h
an approximation to f (t0 + 2 , u(t0 + 2 )) in Fh , corresponding to the midpoint quadrature
rule. The Heun method of order 2 is constructed based on the trapezoidal rule. Both
quadrature rules are of second order, and so are these one-step methods. Both methods
were discussed by Runge in his article of 1895 [Run95].
2.3.9 Lemma: For f is sufficiently smooth, the Heun method of order 2 and the
modified Euler method have consistentcy order two.a
a
Here and in the following proofs of consistency order, we will always assume that all necessary
derivatives of f exist and are bounded and simply write“f is sufficiently smooth”.
25
Proof. The proof uses Taylor expansion of the continuous solution u1 and the discrete
solution y1 around t0 with y0 = u0 . W.l.o.g. we choose t0 = 0. Considering first only the
case d = 1 and the abbreviations
h2
u1 = u(h) = u0 + hf (t0 , u0 ) + ft + fy f
2
h3
ftt + 2fty f + fyy f 2 + fy ft + fy2 f + . . . . (2.13)
+
6
For the discrete solution of the modified Euler step on the other hand, Taylor expanding
f around (t0 , u0 ) leads to
h h
y1 = u0 + hf t0 + , u0 + f (t0 , u0 )
2 2
h 2 h3
ftt + 2fty f + fyy f 2 + . . . .
= u0 + hf + ft + fy f +
2 8
Thus, |τ1 | = h−1 |u1 − y1 | = O(h2 ). Since the truncation error at the kth step can be
estimated identically, the method has consistency order two. The proof for the Heun
method is left as an exercise.
For d > 1, the derivatives with respect to y are no longer scalars, but tensors of increasing
rank, or equivalently multilinear operators. Thus, to be precise in d dimensions, ∂y f (t0 , u0 )
is a d × d matrix that is applied to the vector f (t0 , u0 ) and we should write more carefully
fy (f ) = ∂y f (t0 , u0 )f (t0 , u0 ).
Similarly, ∂yy f (t0 , u0 ) is a d × d × d rank-3 tensor, or more simply a bilinear operator and
to stress this we write fyy (f, f ) instead of fyy f 2 . (However, we will not dwell on this issue
in this course.)
k1 = f (t0 , y0 )
1 1 0
k2 = f t0 + h, y0 + hk1 1 1
2 2 2 2
k3 = f (t0 + h, y0 − hk1 + 2hk2 ) 1 −1 2
1 4 1
6 6 6
1 4 1
yn+1 = y0 + h k1 + k2 + k3
6 6 6
Remark 2.3.11. These Taylor series expansions become tedious very fast. For au-
tonomous ODEs u0 = f (u) the analysis simplifies significantly. The Runge-Kutta method
(2.11) reduces to
26
i−1
X
gi = y0 + h aij f (gj ), i = 1, . . . , s
j=1
s
(2.14)
X
y1 = y0 + h bj f (gj ).
j=1
Each (non-autonomous) ODE can be autonomized (see Def. 1.2.6) using the transformation
0
0 u f (t, u)
U := 0 = =: F (U ). (2.15)
t 1
Proof. Considering the last components of the vector gi in (2.14) when applied to the
autonomized ODE (2.15) with right hand side F (·), we obtain
i−1
X
t0 + h aij .
j=1
For the ERK to be invariant under autonomization, we require that f is evaluated at t0 +hci
in the ith stage leading to the first condition in (2.16). Similarly, the second condition in
(2.16) follows from the last component of y1 , when applying (2.14) to (2.15).
b1 c1 + · · · + bs cs = 1/2 (2.17b)
27
Proof. We consider the autonomous ODE u0 = f (u) and u(t0 ) = u0 , where we assume
w.l.o.g. again t0 = 0. As in the proof of lemma 2.3.9, we first expand u1 around t0 , using
u0 (t0 ) = f (u0 ) = f . Also, since f now only depends on one argument, we abbreviate
d d2
f (u(t0 )) = ∇f (u0 )f (u0 ) =: f 0 (f ), as well as f (u(t0 )) =: f 00 (f, f ) + f 0 (f 0 (f )), . . .
dt dt2
Thus, we obtain
h2 h3 00
u1 = u0 + hf + f 0 (f ) + f (f, f ) + f 0 (f 0 (f )) (2.18)
2 6
h4 000
+ f (f, f, f ) + 3f 00 (f 0 (f ), f ) + f 0 (f 00 (f, f )) + f 0 (f 0 (f 0 (f ))) + O(h5 ).
24
Using (2.20) and the definition of an ERK for an autonomous ODE in (2.14) we get
s s
(q)
X dq X dq−1
y1 (0) =0+ bi q hf (gi (h)) =q bi f (gi (h)) , (2.21)
dh h=0 dhq−1 h=0
i=1 i=1
s s
(q)
X dq X dq−1
gi (0) =0+ aij q hf (gj (h)) =q aij f (gj (h)) . (2.22)
dh h=0 dhq−1 h=0
j=1 j=1
(where we have assumed again for simplicity that aij = 0, for j ≥ i).
Finally, we need to apply the chain rule to compute the derivatives of f (gi (h)) needed
in (2.21) and (2.22). First for q = 1, using again the shorthand notation for the higher
derivatives of f as above, it follows from (2.16), (2.19) and (2.22) that
s s
d 0 0 0
X X
f (gi (h)) = f (gi (h)) =f aij f (gj (0)) =
aij f 0 (f ) = ci f 0 (f )
dh h=0 h=0
j=1 j=1
28
The case q = 3 can be derived in a similar way and is left as an exercise DIY .
(4)
as well as a similar formula for y1 (0), which is again left as an exercise DIY .
h2 00 h3 h3 (4)
y1 (h) = y1 (0) + hy10 (0) + y1 (0) + y1000 (0) + y1 (0) + O(h5 )
2 6 24
and comparing coefficients with the coefficients in the expansion of u1 in (2.18), we obtain
the order conditions in (2.17).
Remark 2.3.14. Butcher introduced a graph theoretical method for order conditions
based on trees. While this simplifies the process of deriving these conditions for higher
order methods considerably, it is beyond the scope of this course.
k1 = f (t0 , y0 )
1 1
k2 = f t0 + h, y0 + hk1 0
2 2 1 1
2 2
1 1 1
0 1
k3 = f t0 + h, y0 + hk2 2 2
2 2 1 0 0 1
k4 = f (t0 + h, y0 + hk3 ) 1 2 2 1
6 6 6 6
1 2 2 1
y1 = y0 + h( k1 + k2 + k3 + k4 )
6 6 6 6
Like the 3-stage method in example 2.3.10, this formula is based on Simpson’s
quadrature rule, but it uses two approximations for the value in the center point
and is of fourth order.
Remark 2.3.16 (Order conditions and quadrature). The order conditions derived by
recursive Taylor expansion have a very natural interpretation via the analysis of quadrature
formulae for the Volterra integral equation, where ci h, i = 1, . . . , s, are the quadrature
points on [0, h] and the other coefficients are quadrature weights. First, we observe that
s
X Z h
h bi f (ci h, gi ) approximates f (s, u(s)) ds.
i=1 0
29
In
P this view, conditions (2.17a)–(2.17c) and (2.17e) state that the quadrature formula
i bi p(ci h) is exact for polynomials p of degree up to 3. This implies (see Numerik 0) that
the convergence of the quadrature rule is of 4th order.
The condition (2.16), which guarantees that the method is autonomizable, simply translates
to the quadrature rule being exact for constant functions.
For higher order, the accuracy of the value of gi only implicitly enters the accuracy of the
Runge-Kutta method as an approximation of the integrand in another quadrature rule.
Thus, we actually look at approximations of nested integrals of the form
Z h Z s
ϕ(s) ψ(r) dr ds.
0 0
Condition (2.17d) for 3rd order states, that this condition must be true for linear poly-
nomials ψ(r) and constant ϕ(s); thus, after the inner integration again any polynomial
of second order in the outer rule. Equally, conditions (2.17h) and (2.17f) for fourth order
state this for linear polynomials ψ(r) with linear ϕ(s) and for quadratic polynomials ψ(r)
with constant ϕ(s), respectively. Finally, condition (2.17g) states that the quadrature has
to be exact for any linear polynomial ϕ(τ ) in
Z hZ sZ r
ϕ(τ ) dτ dr ds.
0 0 0
Remark 2.3.17 (Butcher barriers). The maximal order of an explicit Runge-Kutta method
is limited through the number of stages, or vice versa, a minimum number of stages is re-
quired for a certain order. The Butcher barriers state that in order to achieve consistency
of order p one requires s stages, where p and s are related as follows:
p 1 2 3 4 5 6 7 8
# cond. 1 2 4 8 17 37 85 200
s p p p p p+1 p+1 p+2 p+3
2.3.18 Lemma: Let f (t, y) admit a uniform Lipschitz condition on [0, T ] × Ω with
{u(t) : t ∈ [0, T ]} ⊂ Ω. Then every ERK that is invariant under autonomization
admits a uniform Lipschitz condition on [0, T ] × Ω.
30
with gi defined recursively by
i−1
X
gi (y; h) = y + h aij f t + cj h, gj (y; h) .
j=1
Let Lf be the Lipschitz constant of f and let q := hLf . Then, for any x, y ∈ Ω, using
(2.16), we get
for general step size h and with constant L = 2Lf for h ≤ (2Lf )−1 .
Corollary 2.3.19. Let f (t, y) admit a uniform Lipschitz condition on [0, T ] × Ω with
{u(t) : t ∈ [0, T ]} ⊂ Ω and let Fhk (., .) be an ERK that is invariant under autonomization.
Then consistency of order p implies convergence with order p and
where L is the Lipschitz constant of Fh from lemma 2.3.18 and the constant c is independent
of h.
31
2.4 Estimates of the local error and time step control
2.4.1. The analysis in the last section used a crude a priori bound of the local error based
on high-order derivatives of the right hand side f (t, u). In the case of a complex nonlinear
system, such an estimate is bound to be inefficient, since it involves global bounds on the
derivatives. Obviously, the local error cannot be computed exactly either, because that
would require or imply the knowledge of the exact solution.
In this section, we discuss two methods that allow an estimate of the truncation error
from computed solutions. These estimates are local in nature and therefore usually much
sharper. Thus, they can be used to control the step size, which in turn gives good control
over the balance of accuracy and effort.
Nevertheless, in these estimates there is the implicit assumption that the true solution u is
sufficiently regular and the step size is sufficiently small, such that the local error already
follows the theoretically predicted order.
The main idea is to use two numerical estimates of u(tk ) that converge with different order
to estimate the leading order term of the local error for the lower-order method. Given
this estimate for the local error, we can then devise an algorithm for step size control that
guarantees that the local error of a one-step method remains below a threshold ε in every
time step.
Algorithm 2.4.2 (Adaptive step size control). Let yk and ybk be two approximations of
uk of consistency order p and pb ≥ p + 1, respectively, and let ε > 0 be given.
3. If hopt < hk the time step is rejected; choose hk = hopt and recompute yk and ybk .
4. If the time step was accepted, let hk+1 = min(hopt , tn − tk ).
5. Increase k by one and return to Step 1.
Remark 2.4.3. When tk is close to tn , then the choice hk+1 = hopt in Step 4, leads to
tn − tk+1 ≈ eps (machine epsilon). To avoid round-off errors in the next time step, it is
advisable to choose hk+1 = tn − tk already for tn − tk ≤ chk+1 , where c is a moderate
constant of size around 1.1. This way we avoid that hk+2 ≈ eps.
Remark 2.4.4. This algorithm controls and equilibrates the local error. It does not control
the accumulated global error. The global error estimate still retains the exponential term.
Global error estimation techniques involve considerably more effort and are beyond the
scope of this course.
The algorithm does not provide an estimate for the leading-order term in the local error
of ybk . However, since it is a higher order approximation than yk , we should use ybk as the
approximation of u at tk and as the initial value for the next time step.
32
2.4.1 Extrapolation methods
2.4.5. Here, we estimate the local error by a method called Richardson extrapolation (cf.
Numerik 0). It is based on computing two approximations with the same method, but
different step size. In particular, we will use an approximation yk+1 with two steps of size
hk and an approximation Yk+1 with one step of size 2hk , both starting with the same initial
value at tk−1 .
Proof. The exact form of the leading-order term in the local error of an ERK of order p
can be obtained by explicitly calculating the leading order term
in the Taylor expansion
0 (p)
in Lemma 2.3.13, i.e. there exists a constant vector ζk = ζk fk−1 , fk−1 , . . . , fk−1 ∈ Rd
independent of h such that
Furthermore, we can also use Taylor series expansion of Fh around (t1 , u1 ) to obtain
Fh (t1 , y1 ) = Fh (t1 , u1 ) + ∇y Fh (t1 , u1 )ηk (u) + O |ηk (u)|2
= Fh (t1 , u1 ) + hp+1 ∇y Fh (t1 , u1 )ζ1 + O(hp+2 ). (2.29)
Thus, following the same proof technique as in theorem 2.2.6, we obtain for the error after
two steps of size h,
2 h
X i
u2 − y2 = u0 − y0 + ηk (u) − ηk (y) +hFh (tk−1 , uk−1 ) − hFh (tk−1 , yk−1 )
| {z } | {z }
=0 k=1 =0
2
X
= ηk (u) + h [Fh (t1 , u1 ) − Fh (t1 , y1 )]
k=1
= 2ζ1 hp+1 + O(hp+2 ) + h hp+1 ∇y Fh (t1 , u1 )ζ1 + O(hp+2 )
33
On the other hand,
u2 − Y2 = ζ1 (2h)p+1 + O(hp+2 ) = 2p+1 ζ1 hp+1 + O(hp+2 ). (2.31)
Taking 2p times equation (2.30) and subtracting equation (2.31), we can eliminate the
leading order term and obtain
O(hp+2 ) = 2p (u2 − y2 ) − (u2 − Y2 ) = (2p − 1)u2 − (2P y2 − Y2 ) = (2p − 1)(u2 − yb2 )
which completes the proof.
Remark 2.4.7. Thus, yb2 provides an approximation of u2 with consistency order p+1 > p
and can thus be used in Algorithm 2.4.2 above to control the step size in each step. In
particular, ybk+1 can be computed cheaply from yk+1 and Yk+1 via formula (2.26) (with
index k + 1 instead of 2). As mentioned in remark 2.4.4, in practice we expect a better
global accuracy, if we use ybk−1 instead of yk−1 as the initial value at tk−1 for computing
yk+1 and Yk+1 .
Instead of estimating the local error by doubling the step size, embedded Runge-Kutta
methods use two methods of different order to achieve the same effect. The key to efficiency
is here, that the computed stages gi are the same for both methods, and only the quadrature
weights bi differ.
Definition 2.4.8 (Embedded Runge-Kutta methods). An embedded s-stage Runge-Kutta
method with orders of consistence p and pb computes two approximations yk and ybk of uk
with the same function evaluations. For this purpose, the stage values gi and ki at t0 +ci hk
are identical for all i = 1, . . . , s, i.e. both methods have the same coefficients aij and ci . To
compute the final approximations at time step tk , we use two different quadrature rules,
i.e.
X
yk = yk−1 + hk bi ki ,
X (2.32)
ybk = yk−1 + hk bbi ki ,
such that yk and ybk are consistent of order p and pb > p, respectively. Typically, pb = p + 1.
2.4.9 Definition: The Butcher tableau for the embedded method has the form:
0
c2 a21
c3 a31 a32
.. .. .. ..
. . . .
cs as1 as2 · · · as,s−1
bb1 bb2 · · · bbs−1 bbs
b1 b2 · · · bs−1 bs
34
Remark 2.4.10. For higher order methods or functions f (t, u) with complicated eval-
uation, most of the work lies in the computation of the stages. Thus, the additional
quadrature for the computation of yk is almost for free. Nevertheless, due to the different
orders of approximation, ybk is much more accurate and we obtain
Thus, ybk − yk is a good estimate for the local error in yk . This is the error which is used
in step size control below. However, as in the Richardson extrapolation above, we use the
more accurate value ybk as the final approximation at tk and as the initial value for the
next time step, even if we do not have a computable estimate for its local error.
0
1/5 1/5
3/10 3/40 9/40
4/5 44/45 −56/15 32/9
19372 −25360 64448 −212
8/9 6561 2187 6561 729
9017 −355 46732 49 −5103
1 3168 33 5247 176 18656
35 500 125 −2187 11
1 384 0 1113 192 6784 84
35 500 125 −2187 11
ybk 384 0 1113 192 6784 84 0
5179 7571 393 −92097 187 1
yk 57600 0 16695 640 339200 2100 40
It has become a standard tool for the integration of IVP and it is the backbone of
ode45 in Matlab.
35
Chapter 3
In the first chapter, we studied methods for the solution of IVP and the analysis of their
convergence with shrinking step size h. We could gain a priori error estimates from con-
sistency and stability for sufficient small h.
All of these error estimates are based on Grönwall’s inequality. Therefore, they contain
a term of the form eLt which increases fast with increasing length of the time interval
[t0 , T ]. Thus, the analysis is unsuitable for the study of long-term integration, since the
exponential term will eventually outweigh any term of the form hp .
On the other hand, for instance our solar system has been moving on stable orbits for
several billion years and we do not observe an exponential increase of velocities. Thus,
there are in fact applications for which the simulation of long time periods is worthwhile
and where exponential growth of the discrete solution would be extremely disturbing.
This chapter first studies conditions on differential equations with bounded long term
solutions, and then discusses numerical methods mimicking this behavior.
Example 3.1.1. We consider for λ ∈ C the (scalar) linear initial value problem
u0 = λu
(3.1)
u(0) = 1.
Splitting λ = Re(λ)+i Im(λ) into its real and imaginary part, the (complex valued) solution
to this problem is
u(t) = eλt = eRe(λ)t cos(Im(λ)t) + i sin(Im(λ)t) .
36
Remark 3.1.2. Since we deal in the following again and again with eigenvalues of real-
valued matrices and these eigenvalues can be complex, we will always consider complex
valued IVP hereafter.
Remark 3.1.3. Due to Grönwall’s inequality and the stability Theorem 1.4.5, the solution
to the IVP above admits the estimate |u(t)| ≤ e|λ|t |u(0)|. This is seen easily by applying the
comparison function v(t) ≡ 0. As soon as λ 6= 0 has a non-positive real part, this estimate
is still correct but very pessimistic and therefore useless for large t. Since problems with
bounded long-term behavior are quite important in applications, we will have to introduce
an improved notation of stability.
holds with a constant ν for all (t, x), (t, y) ∈ D. Moreover such a function is called
monotonic if ν = 0, thus
Remark 3.1.5. The term monotonic from the previous definition is consistent with the
term monotonically decreasing, which we know from scalar, real-valued functions. We can
see this by observing that, for y > x,
f (t, y) − f (t, x) (y − x) ≤ 0 ⇔ f (t, y) − f (t, x) ≤ 0.
3.1.6 Theorem: Let u(t) and v(t) be two solutions of the equation
with initial values u(t0 ) = u0 and v(t0 ) = v0 , resp. Let the function f be continuous
and let the one-sided Lipschitz condition (3.3) hold. Then we have for t > t0 :
Proof. We consider the auxiliary function m(t) = |v(t) − u(t)|2 and its derivative
m0 (t) = 2Re v 0 (t) − u0 (t), v(t) − u(t)
= 2Re f t, v(t) − f t, u(t) , v(t) − u(t)
≤ 2ν|v(t) − u(t)|2
= 2νm(t).
According to Grönwall’s inequality (lemma 1.3.8 on page 10) we obtain for t > t0 :
m(t) ≤ m(t0 )e2ν(t−t0 ) .
Taking the square root yields the stability estimate (3.5).
37
Remark 3.1.7. As in example 3.1.1, we obtain from the stability estimate, that for the
difference of two solutions u(t) and v(t) of the differential equation u0 = f (t, u) (with
different initial conditions) we obtain in the limit t → ∞:
3.1.8 Lemma: Let A(t) ∈ Cd×d be a diagonalizable matrix function with eigen-
values λj (t), j = 1, . . . , d. Then the linear function f (t, y) := A(t)y admits the
one-sided Lipschitz condition (3.3) on all of R × Cd with the constant
hA(t)y − A(t)x, y − xi
RehA(t)y − A(t)x, y − xi = Re 2 |y − x|2 ≤ max Re(λj (t))|y − x|2 .
|y − x| j=1,...,d
This shows that ν ≤ maxj=1,...,d; t∈R Re(λj (t)). If we now insert for y − x an eigenvector
of eigenvalue λj (t) for which the maximum is attained, then we obtain the equality and
therefore ν = maxj=1,...,d; t∈R Re(λj ).
The eigenvalues of A are λ1 = −2 and λ2,3 = −40 ± 40i. The exact solution is
1 −2t + 21 e−40t [cos 40t + sin 40t]
2e
1 −2t
u(t) = 2e − 21 e−40t [cos 40t + sin 40t] → 0 as t → ∞.
For small time 0 ≤ t ≤ 0.2 all three components are changing rapidly due to the trigono-
metric terms, since the factor e−40t in front of them is still fairly big. Thus, it is necessary
to choose small time step sizes h 1.
38
For t > 0.2, we have u3 ≈ 0 and u1 ≈ u2 , and both those components change fairly slowly,
so we could choose a larger time step size h ≥ 0.1.
However, if we consider the explicit Euler method applied to (3.8) we get
y (n) = y (n−1) + hAy (n−1)
and thus
y (n) = (I + hA)n u0 .
Now, if we choose a time step size of h = 0.01 the matrix I + hA has eigenvalues µj =
0.98, 0.6 + 0.4i, 0.6 − 0.4i, so that
√
|y (n) | = |(I + hA)n u0 | ≤ kI + hAkn |u0 | = 0.98n 2 → 0 as t → ∞,
which is, at least qualitatively, the correct behaviour.
For h = 0.1, I +hA has eigenvalues µj = 0.8, −3+4i, −3−4i. It is easy to see that the first
eigenvector is v1 = √12 (1, 1, 0)T ; the other two eigenvectors are orthogonal to v1 . Thus, if
we apply (I + hA)n to the second or third eigenvector, v2 or v3 , we get
|(I + hA)n vj | = |−3 ± 4i|n |vj | = 5n → ∞ as n → ∞, for j = 2, 3.
Since u0 contains components in the direction of v2 and v3 , this means that |y (n) | → ∞ as
n → ∞, very much in contrast to the behaviour of the exact solution u(t) → 0 for t → ∞.
So, even when u3 ≈ 0 and u2 − u1 ≈ 0 and the perturbations are very small, the instability
of the explicit Euler method with time step size h = 0.1 will lead to an exponential increase
in these perturbations.
Remark 3.1.10. The important message here is that from a point of view of approxi-
mation error (or consistency), it would be possible to increase the time step significantly
at later times, but due to stability problems with the explicit Euler method we cannot
increase h beyond a certain stability threshold.
This phenomenon only arises for monotonic ODEs, or for ODEs that satisfy a one-sided
Lipschitz condition with constant 0 < ν 1 and that are monotonic for all t ≥ t∗ , for
some t∗ ≥ t0 . The consistency error is closely linked to the Lipschitz constant L of f ,
while the stability is linked to the ratio of L and the constant ν in the one-sided Lipschitz
condition. In the following definition, we will only focus on monotonic IVPs.
3.1.11 Definition: Let f be Lipschitz continuous with constant L > 0 and one-
sided Lipschitz continuous with constant ν ∈ R. An initial value problem is called
stiff , if it has the following characteristic properties:
2. The time scales on which different solution components are evolving differ a
lot, i.e.,
L |ν| .
3. The time scales which are of interest for the application are much longer than
the fastest time scales of the equation, i.e.,
39
Remark 3.1.12. Note that for the linear IVP in lemma 3.1.8 and when f is monotonic,
we have
L := max |λj (t)| ≥ max |Re(λj (t))| and |ν| := min |Re(λj (t))| .
j=1,...,d j=1,...,d j=1,...,d
t∈R t∈R t∈R
Remark 3.1.13. Even though we used the term definition, the notion of “stiffness of an
IVP” has something vague or even inaccurate about it. In fact that is due to the very nature
of the problems and cannot be fixed. Instead we are forced to sharpen our understanding
by means of a few examples.
Example 3.1.14. First of all we will have a look at equation (3.8) in example 3.1.9.
Studying the eigenvalues of the matrix A, we clearly see that ν = −2 and thus the problem
is monotonic. We can also find that the Lipschitz constant is L = kAk ≈ 72.5 so that the
second condition holds as well.
According to the discussion of example 3.1.9, the third condition depends on the purpose
of the computation. If we want to compute the solution at time T = 0.01, we would not
denote the problem as stiff. On the other hand, if one is interested on the solution at time
T = 1, on which the terms containing e−40t are already below typical machine accuracy, the
problem is stiff indeed. Here, we have seen that Euler’s method requires disproportionately
small time steps.
Remark 3.1.15. The definition of stiffness and the discussion of the examples reveal that
numerical methods are needed, which are not just convergent for time steps h → 0 but
also for fixed step size h, even in the presence of time scales clearly below h. In this case,
methods still have to produce solutions with correct limit behavior for t → ∞.
Example 3.1.16. The implicit Euler method is defined by the one-step formula
which in general involves solving a nonlinear system of equations. Applied to our linear
example (3.8), we get
For all h > 0, the real part of the eigenvalues of the matrix I − hA is
1 1 1
Re(µj ) = , , ,
1 + 2h 1 + 40h 1 + 40h
which are all strictly less than 1, such that we get
|y (n) | → 0 as n → ∞,
independently of h. Thus, although the implicit Euler method requires in general the
solution of a nonlinear system in each step, it allows for much larger time steps than the
explicit Euler method, when applied to a stiff problem.
For a visualization see the programming exercise on the last problem sheet and the ap-
pendix.
40
3.2 A-, B- and L-stability
3.2.1. In this section, we will investigate desirable properties of one-step methods for stiff
IVP (3.11). We will first study linear problems of the form
u0 = Au u(t0 ) = u0 . (3.11)
and the related notion of A-stability in detail. From the conditions for stiffness we derive
the following problem characteristics:
1. All eigenvalues of the matrix A lie in the left half-plane of the complex plane.
With (3.2) all solutions are bounded for t → ∞.
2. There are eigenvalues close to zero and eigenvalues with a large negative real part.
3. We are interested in time spans which make it necessary, that the product hλ is
allowed to be large, for an arbitrary eigenvalue and an arbitrary time step size.
For this case we now want to derive criteria for the boundedness of the discrete solution
for t → ∞. The important part is not to derive an estimate holding for h → 0, but one
that holds for any value of hλ in the left half-plane of the complex numbers.
y1 = y0 + hFh (t0 , y0 , y1 ),
y1 = R(hλ)u0 , (3.12)
and
y (n) = R(hλ)n u0 , (3.13)
for some function R : C → C, which is denoted the stability function of the
one-step method Fh . The stability region of the one-step method is the set
S = z ∈ C |R(z)| ≤ 1 . (3.14)
41
Figure 3.1: Stability regions for explicit and implicit Euler (blue stable, red unstable)
{z ∈ C | Re(z) ≤ 0} ⊂ S (3.17)
u0 = Au, u(t0 ) = u0
with a diagonalizable matrix A and initial value y (0) = u0 . The stability of a one-step
method with stability region S applied to this vector-valued problem is inherited
from the scalar equation.
∞
In particular, let y (k) k=0 be the sequence of approximations, generated by an A-
stable one-step method with step size h for this IVP. If all eigenvalues of A have a
non-positive real part, then the sequence is uniformly bounded for all h.
Remark 3.2.7. The term “A-stability” was deliberately chosen neutrally by Dahlquist. In
particaluar, note that A-stability does not stand for asymptotic stability.
Proof (only for ERKs). Since A is diagonalizable, there exists an invertible matrix V ∈
Cd×d and a diagonal matrix Λ ∈ Cd×d such that A = V −1 ΛV . Let w := V u. Then
w0 = (V u)0 = V u0 = V Au = ΛV u = Λw (3.18)
and w(t0 ) = V u(t0 ), and so the system of ODEs decouples into d independent ODEs
42
Similarly, the stage values of an ERK decouple into d independent, decoupled components:
s
X
gi = y0 + h aij V −1 ΛV gj ⇒
j=1
s
X s
X
γi := V gi = V y0 + h aij ΛV gj = w0 + h aij Λγj
j=1 j=1
s
X
or equivalently (γi )` = (w0 )` + h aij λ` (γj )` , ` = 1, . . . , d.
j=1
Finally, if we denote by ηj := V yj the transformed numerical solution at the jth time step,
we get for the next iterate
s
X s
X
η1 = V y1 = V y0 + h bi V gi = η0 + h bi γi
i=1 i=1
Thus, the ERK applied to a vector valued problem decouples into d decoupled scalar
problems solved by the same ERK. But for each of the scalar problems, the definition of
A-stability implies boundedness of the solution, if Re(λ` ) ≤ 0 for all ` = 1, . . . , d, and thus
Proof. We show that for such methods R(z) is a polynomial. It is known for polynomials
that the absolute value of their value goes to infinity, if the absolute value of the argument
goes to infinity. Thus, there exists z ∈ {z ∈ C | Re(z) ≤ 0} such that |R(z)| > 1 and thus
z 6∈ S which implies the result of the theorem.
Consider an arbitrary ERK applied to the scalar problem u0 = λu, u(t0 ) = u0 . From
equation (2.11b) it follows that ki = λgi , for all i = 1, . . . , s. If we insert that into the
equation (2.11a), we obtain
i−1
X
gi = y0 + hλ aij gj .
j=1
Remark 3.2.9. The notion of A-stability is only applicable to linear problems with di-
agonalizable matrices. Now we are considering its extension to nonlinear problems with
monotonic right hand sides.
43
3.2.10 Definition: A one-step method applied to a monotonic initial value problem
u0 = f (t, u) with arbitrary initial values y0 and ỹ0 is called B-stable if
3.2.11 Theorem: Let f be monotonic and such that f (t, 0) = 0, for all t ∈ R, and
consider the IVP
Proof. The theorem follows immediately by setting ỹ0 = 0 and iterating over the definition
of B-stability, since the assumptions of the theorem guarantee that ỹk = 0, for all k. (Note
that f (t, 0) = 0 implies Fh (t, 0) = 0 for all Runge-Kutta methods.)
Proof. Apply the method to the scalar, linear problem u0 = λu, which is monotonic for
Re(λ) ≤ 0. Now, the definition of B-stability implies |R(z)| ≤ 1, and thus, the method is
A-stable.
Thus, a method, which has exactly the left half-plane of C as its stability domain, seemingly
a desirable property, has the undesirable property that components in eigenspaces corre-
sponding to very large negative eigenvalues, and thus decaying very fast in the continuous
problem, are decaying very slowly if such a method is applied.
This gave rise to the following notion of L-stability. However, note that L-stable methods
are not always better than A-stable ones. Similarly, it is also not always necessary to
require A-stability. Judgment must be applied according to the problem being solved.
44
3.3 General Runge-Kutta methods
We point out immediately that the main drawback of these methods is the fact that they
typically require the solution of nonlinear systems of equations and thus involve much
higher computational effort. Therefore, careful judgment should be applied to determine
whether a problem is really stiff or whether it is better to use an explicit method.
Example 3.3.4 (Two-stage SDIRK). The following two SDIRK methodsare of order three:
√ √ √ √
1 3 1 3 1 3 1 3
2 − 6 2 − 6 0 2 + 6 2 + 6 0
√ √ √ √ √ √
1 3 3 1 3 1 3 3 1 3 (3.23)
2 + 6 3 2 − 6 2 − 6 − 3 2 + 6
1 1 1 1
2 2 2 2
45
3.3.5 Lemma: Let I be the s × s identity matrix and let e := (1, . . . , 1)T ∈ Rs .
The stability function of a (general) s-stage Runge-Kutta method with coefficients
a11 · · · a1s b1
.. .
.. and b = ...
A= .
as1 · · · ass bs
det I − zA + zbeT
T
−1
R(z) = 1 + zb I − zA e = (3.24)
det I − zA
Proof. Applying the method to the scalar test problem with f (u) = λu, the definition of
the stages gi leads to the system of linear equations
s
X
gi = y0 + h aij λgj , i = 1, . . . , s.
j=1
In matrix notation, with z = hλ, we obtain (I − zA)g = (y0 , . . . , y0 )T , where g is the vector
(g1 , . . . , gs )T . Equally, we obtain
s
X
R(z)y0 = y1 = y0 + h bi λgi
i=1
= y0 + zbT g
y0
−1 ..
= y0 + zb (I − zA) . = 1 + zbT (I − zA)−1 e y0 .
T
y0
In order to prove the second representation, we write the whole Runge-Kutta method as a
single system of equations of dimension s + 1:
1
I − zA 0 g ..
= y0 . .
−zbT 1 y1
1
3.3.6 Example: Stability functions of the modified Euler method, of the classical
Runge-Kutta method of order 4 and of the Dormand-Prince method of order 5 are
z2
R2 (z) = 1 + z + 2
z2 z3 z4
R4 (z) = 1 + z + 2 + 6 + 24
z2 z3 z4 z5 z6
R5 (z) = 1 + z + 2 + 6 + 24 + 120 + 600
46
Figure 3.2: Stability regions of the modified Euler method, the classical Runge-Kutta
method of order 4 and the Dormand/Prince method of order 5 (blue stable, red unstable)
3.3.7 Definition: The ϑ-scheme is the one-step method, defined for ϑ ∈ [0, 1] by
y1 = y0 + h (1 − ϑ)f (y0 ) + ϑf (y1 ) . (3.25)
Proof. DIY (The stability regions for different ϑ are shown in figure 3.3.)
While it was clear that the steps of an explicit Runge-Kutta method can always be exe-
cuted, implicit methods require the solution of a possibly nonlinear system of equations.
The solvability of such a system is not always clear. We will investigate several cases here:
First, Lemma 3.3.9 based on a Lipschitz condition on the right hand side. Since this result
suffers from a severe step size constraint, we add Lemma 3.3.10 for DIRK methods based on
right hand sides with a one-sided Lipschitz condition. Finally, we present Theorem 3.3.11
for general Runge-Kutta methods with one-sided Lipschitz condition.
47
Figure 3.3: Stability regions of the ϑ-scheme with ϑ = 0.5 (Crank-Nicolson), ϑ = 0.6,
ϑ = 0.7, and ϑ = 1 (implicit Euler).
48
3.3.9 Lemma: Let f : R × Rd → Rd be continuous and satisfy the Lipschitz
condition with constant L. If
then, for any y0 ∈ Rd , the Runge-Kutta method (3.22) has a unique solution y1 ∈ Rd .
Proof. We prove existence and uniqueness by a fixed-point argument. To this end, given
y0 ∈ Rd , we define the matrix of stage values K = [k1 , . . . , ks ] ∈ Rd×s in (3.22).
Given some initial K (0) ∈ Rd×s , we consider the fixed-point iteration K (m) = Ψ(K (m−1) ),
m = 1, 2, . . ., defined columnwise by
s
(m) (m−1)
X
(m−1)
ki = Ψi (k ) = f t0 + ci h, y0 + h aij kj , i = 1, . . . , s,
j=1
which clearly has the matrix of stage values K as a fixed point. Using on Rd×s the norm
kKk = maxi=1,...,s |ki |, where |.| is the regular Euclidean norm on Rd , it follows from the
Lipschitz continuity of f in its second argument that
s
X
0
kΨ(K) − Ψ(K )k ≤ hL max |aij | kK − K 0 k.
i=1,...,s
j=1
Under assumption (3.27), the term in parentheses is strictly less than one and thus, the
mapping Ψ is a contraction. Then the Banach fixed-point theorem (cf. theorem A.3.1)
yields the unique existence of y1 .
then, for any y0 ∈ Cd , each of the (decoupled) nonlinear equations in (3.22a) has a
solution gi ∈ Cd .
Proof. The proof simplifies compared to the general case of an IRK, since each stage
depends explicitly on the previous stages and implicitly only on itself. Thus, we can write
i−1
X
gi = y0 + vi + haii f (gi ) with vi = h aij f (gj ). (3.29)
j=1
For linear IVPs with f (t, y) := M y with diagonalizable system matrix M , we have
(I − haii M ) gi = y0 + vi .
49
Since ν = maxj=1,...,d Re(λj (M )) (cf. lemma 3.1.8), assumption (3.28) implies that all
eigenvalues of (I − haii M ) have positive real part. Thus, the inverse exists and we obtain
a unique solution.
In the nonlinear case, we use a homotopy argument. To this end, we introduce the param-
eter τ ∈ [0, 1] and set up the family of equations
g(τ ) = y0 + τ vi + haii f (g(τ )) + (τ − 1)haii f (y0 ).
For τ = 0 this equation has the solution g(0) = y0 , and for τ = 1 the solution g(1) = gi .
Now, provided g 0 is bounded on [0, 1], we can conclude that a solution exists, since
Z 1
g(1) = g(0) + g 0 (s) ds . (3.30)
0
g0
To show that is bounded, note first that since f was assumed to be differentiable in the
second argument
hfy (t, y)h + o(|h|), hi = hf (t, y + h) − f (x), hi ≤ ν|h|2
Hence, with
g 0 (τ ) = vi + haii fy t, g(τ ) g 0 (τ ) + haii f (y0 )
we obtain
|g 0 (τ )|2 = vi + haii f (y0 ), g 0 (τ ) + haii fy (t, g(τ ))g 0 (τ ), g 0 (τ )
If the DIRK method in lemma 3.3.10 is A- or B-stable, then the gi are unique.
3.3.11 Theorem: Let f be continuously differentiable and let it satisfy the one-
sided Lipschitz condition (3.3) with constant ν. If the Runge-Kutta matrix A is
invertible and if there exists a diagonal matrix D = diag(d1 , . . . , ds ) with positive
entries, such that
x, A−1 x
hν < D
, ∀x ∈ Rs , (3.31)
hx, xiD
then the nonlinear system (3.22a) has a solution (g1 , ..., gs ), where hx, yiD = hDx, yi.
Proof. We omit the proof here and refer to [HW10, Theorem IV.14.2]
50
3.3.2 Considerations on the implementation of Runge-Kutta methods
3.3.12. As we have seen in the proof of lemma 3.3.9, implicit Runge-Kutta methods require
the solution of a nonlinear system of size s · d, where s is the number of stages and d the
dimension of the system of ODEs. DIRK methods are simpler and only require the solution
of systems of dimension d. Thus, we should prefer this class of methods, weren’t it for the
following theorem.
Remark 3.3.14. In each step of an IRK, we have to solve a (non-)linear system for the
quantities gi . In order to reduce round-off errors, it is advantageous to solve for zi = gi −y0 .
Especially for small time steps, zi is expected to be much smaller than gi . Thus, we have
to solve the system
s
X
zi = h aij f (t0 + cj h, y0 + zj ), i = 1, . . . , s. (3.32)
j=1
y1 = y0 + bT A−1 z, (3.34)
which again is numerically much more stable than evaluating f (with a possibly large
Lipschitz constant).
51
3.4.2 Theorem: Consider a (general) Runge-Kutta method that satisfies condition
B(p) in (3.35a), condition C(ξ) in (3.35b), and condition D(η) in (3.35c) with ξ ≥
p/2 − 1 and η ≥ p − ξ − 1. Then the method has consistency order p.
Proof. For the proof, we refer to [HNW09, Ch. II, Theorem 7.4]. Here, we only observe,
that
cq
Z 1 Z ci
q−1 1
t dt = , tq−1 dt = i .
0 q 0 q
If we now insert the function x at the places ci into the quadrature formula with the
quadrature weights bi , then we obtain (3.35a). Similarly we get (3.35b), if we insert the
value tq /q at the places ci from the quadrature formula with weights aij for j = 1, . . . , s.
In both cases we carry this out for all monomials until the desired degree is reached. Due
to linearity of the formulas the exactness holds for all polynomials up to that degree.
3.4.3. In this subsection, we review some of the basic facts of quadrature formulas based
on orthogonal polynomials (cf. Numerik 0 for details).
dn n
Ln (t) = t (t − 1)n .
dtn
R1
A quadrature formula for 0 f dx that uses the n roots of Ln as its quadrature points
and the integrals of the Lagrange interpolating polynomials at those points as its
weights is called a Gauss rule.
[a,b]
3.4.5 Lemma: The Gauss quadrature formula Qn−1 (f ) with n points for approxi-
Rb
mating the integral a f dx is exact for polynomials of degree 2n − 1. If f ∈ C 2n [a, b]
and h := b − a then Z b
[a,b]
Qn−1 (f ) − f dx = O(h2n+1 ).
a
Proof. See Numerik 0. (Please note that in Numerik 0 we numbered the quadrature nodes
x0 , . . . , xn−1 and thus n here is n − 1 in the notes to Numerik 0.)
Remark 3.4.6. An important alternative set of quadrature formulae are the Radau and
Lobatto formulas.
The Radau quadrature formulae are similar to the Gauss rules, but they use one end
point of the interval [0, 1] and the roots of orthogonal polynomials of degree n − 1 as their
52
abscissas. We distinguish left and right Radau quadrature formulae, depending on which
end is included. Lobatto quadrature formulae use both end points and the roots of a
polynomial of degree n − 2. The polynomials are
dn−1 n
t (t − 1)n−1 ,
Radau left pn (t) = n−1
(3.36)
dt
dn−1
pn (t) = n−1 tn−1 (t − 1)n ,
Radau right (3.37)
dt
dn−2
pn (t) = n−2 tn−1 (t − 1)n−1 .
Lobatto (3.38)
dt
A Radau quadrature formula with n points is exact for polynomials of degree 2n − 2. A
Lobatto quadrature formula with n points is exact for polynomials of degree 2n − 3. The
quadrature weights of these formulae are positive.
However, as we have seen in Numerik 0, polynomials are not suited though for high-order
interpolation over large intervals. Therefore, we apply them again only subintervals in the
form of Runge-Kutta methods. The subintervals correspond to the time steps and the
quadrature points as the stages.
y(t0 ) = y0 (3.39a)
0
y (t0 + ci h) = f t0 + ci h, y(t0 + ci h) i = 1, . . . , s. (3.39b)
53
Proof. The polynomial y 0 (t) is of degree s − 1 and thus uniquely defined by the s interpo-
lation conditions in equation (3.39b). Setting y 0 (x0 + ci h) = f t0 + ci h, y(t0 + ci h) = ki
we obtain
s
X
0
y (x0 + th) = kj · Lj (t), (3.41)
j=1
which, by comparison with (3.22a), defines the coefficients aij . Integrating from 0 to 1
instead, we obtain the coefficients bj by comparison with (3.22c).
Applying this to the Lagrange interpolation polynomials Lj (t), we obtain the coefficients
of equation (3.40), which were in turn computed from the collocation polynomial, proving
the equivalence.
It follows from theorem 3.4.2 that a Runge-Kutta method that satisfies B(s) and C(s) has
consistency order (at least) s.
54
3.4.11 Theorem: Consider a collocation method with s pairwise different support
points ci and define
s
Y
π(t) = (t − ci ). (3.43)
i=1
Proof. We have already shown in the proof of Lemma 3.4.10, that for any collocation
method with s stages, B(s) and C(s) hold.
The condition on π implies that on the interval [0, 1] the quadrature rule is in fact exact
for polynomials of degree s + r − 1 (cf. Numerik 0 for the case r = s), so that we have
B(s + r). Therefore, to prove conistency order p = s + r it remains to show D(r).
First, we observe that due to C(s) and B(s + r), for all p < s and q ≤ r, we have
s s s s
!
p
q−1 p−1 q−1 ci 1 X p+q−1 1
X X X
bi aij ci cj = bi ci = bi ci = .
p p p(p + q)
j=1 i=1 i=1 i=1
1
Subtracting q times the second result from the first we get
s
!
1 1 X p−1 1
bi cq−1 q
X
0= − = cj i aij − bj 1 − cj .
p(p + q) p(p + q) q
j=1 i
| {z }
:=ξj
This holds for p = 1, . . . , s and thus amounts to a homogeneous, linear system in the
variables ξj with system matrix V T . Thus, ξj = 0 and the theorem holds.
s ≤ p ≤ 2s.
On the other hand, we know from Numerik 0 that there exists no quadrature rule such that
B(2s + 1) is satisfied, otherwise the degree s polynomial π(t) would have to be orthogonal
to itself. In particular, if we consider the scalar model equation u0 = λu with exact solution
55
u(t) = eλt = ∞ j
P
j=0 (λt) /j!, the best we can hope for is that the collocation polynomial
y(t) matches the first 2s − 1 terms in this infinite sum, such that
|u1 − y1 | = O(h2s ).
Hence, it is clear the conistency order of an s-stage collocation method satisfies p ≤ 2s.
The lower bound has already been proved in lemma 3.4.10.
√ √ √ √ √
3− 3 1 1 3 5− 15 5 2 15 5 15
6 4 4 − 6 10 36 9 − 15 36 − 30
√ √ √ √
3+ 3 1 3 1 1 5 15 2 5 15
6 4 + 6 4 2 36 + 24 9 36 − 24
√ √ √
1 1 5+ 15 5 15 2 15 5
2 2 10 36 + 30 9 + 15 36
5 4 5
18 9 18
Proof. Let f be monotonic and let y(t) and z(t) be the collocation polynomials according
to (3.39) with respect to initial values y0 or z0 . Analogous to the proof of theorem 3.1.6 we
introduce the auxiliary function m(t) = |z(t)−y(t)|2 . In the collocation points ξi = t0 +ci h
we have
56
Figure 3.4: Stability domains of right Radau-collocation methods with one (implicit Euler),
two, and three collocation points (left to right). Note the different scaling of coordinate
axes in comparison with previous figures.
To show that the stability region is exactly the left half-plane of C we refer to the problem
sheet.
Remark 3.4.17. Similarly, we can construct collocation rules based on Radau- and
Lobatto-quadrature. As in the proof of theorem 3.4.15, it can be shown that the s-stage
Radau- and Lobatto-collocation methods are of orders 2s − 1 and 2s − 2, respectively.
Also as in the case of Gauß-quadrature it can be shown that collocation methods based
on Radau- and Lobatto quadrature are B-stable (cf. [HW10]). In fact, Radau-collocation
methods with right end point of the interval [0, 1] included in the quadrature set are L-
stable.
The first right Radau collocation method with s = 1 is simply the implicit Euler method.
The definitions of the next two are given in example 3.4.18. The stability regions of the
first three are shown in Figure 3.4.
Observe that the stability domains are shrinking with order of the method. Also, observe
that the computation of y1 coincides with that of gs , such that we can save a few operations.
√ √ √ √
1 5 1 4− 6 88−7 6 296−169 6 −2+3 6
3 12 − 12 10 360 1800 225
√ √ √ √
3 1 4+ 6 296+169 6 88+7 6 −2−3 6
1 4 4 10 1800 360 225
√ √
3 1 16− 6 16+ 6 1
4 4
1 36 36 9
√ √
16− 6 16+ 6 1
36 36 9
57
Chapter 4
kx(k+1) − x∗ k ≤ q kx(k) − x∗ kp .
For p = 1, in addition we require that q < 1. In that case, q is called the conver-
gence rate.
We have already seen in the proof of lemma 3.3.9 that the fixpoint iteration, e.g. for the
implicit Euler method:
converges to y1 provided hL < 1, but the convergence is only of order p = 1 (linear) and
the convergence rate is q := Lh, which may be close to 1. Moreover, it may fail if hL ≥ 1.
In Numerik 0, we have already seen a faster converging algorithm, the Newton method,
and proved there that it converges with order p = 2, for sufficiently good initial guess.
58
4.1.3 Definition: The Newton method for finding the root of the nonlinear
equation f (x) = 0 with f : Rd → Rd reads: given an initial value x(0) ∈ Rd ,
compute iterates x(k) ∈ Rd , k = 1, 2, . . . as follows:
J = ∇f x(k) ,
d(k) = −J −1 f (x(k) ), (4.2)
x(k+1) = x(k) + d(k) .
Remark 4.1.5. The proof of this theorem can be found in the lecture notes for Numerik 0.
There are also versions that do not require the existence of the root a priori, such as the
Newton-Kantorovich Theorem [Ran17a, Satz 5.5], but we will only discuss some of the
main assumptions and features.
The Lipschitz condition on ∇f can be seen as the deviation of f from being linear. Indeed,
if f were linear, then L = 0 and provided M 6= 0 the method converges in a single step for
any initial value.
The larger the constant M , the smaller one of the eigenvalues of the Jacobian J. Therefore,
the function becomes flat in that direction and the root finding problem becomes unstable.
Most importantly, for an arbitrary initial guess, the method may fail to converge entirely,
but close enough to the solution the convergence is very fast, much faster than the fixpoint
iteration above.
While we assume for most of this discussion that F is known, we will see at the end that
the Newton method with line search does not require it.
59
Obvisously, by choosing f = ∇F , Newton’s method also solves the optimsation problem
(4.5). An alternative family of methods for (4.5) are the following:
ϑα
F (x + αs) ≤ F (x) − |∇F (x)|. (4.7)
2
In particular, a positive scaling factor α for the descent method, and thus a strict
decrease in the function value, can always be found.
Proof. Skipped.
4.2.3 Definition: The gradient method for finding minimizers of F (x) reads:
given an initial value x(0) ∈ Rd , compute iterates x(k) , k = 1, 2, . . . by the rule
It is also called the method of steepest descent. The minimization process used
to compute αk , also called line search, is one-dimensional and therefore simple. It
is sufficient only to find an approximate minimum α̃(k) .
K = x ∈ Rd F (x) ≤ F (x(0) )
is compact. Then, each sequence defined by the gradient method has at least one
accumulation point and each accumulation point is a stationary point of F (x).
60
Proof. First, we observe that in any point x(k) with ∇F (x(k) ) 6= 0, it follows from lemma 4.2.2
that there exists γ > 0 such that
We conclude, that for such x(k) , the line search obtains a positive value of α(k) . Thus, the
sequence of the gradient iteration is monotonically decreasing and stays within the set K.
Since K was assumed to be compact the sequence x(k) has at least one accumulation point
x∗ . However, the preceding discussion implies that ∇F (x∗ ) = 0.
Remark 4.2.5. However, we can also choose d(k) = −B (k) ∇F (x(k) ) in (4.8), for any
positive definite matrix B (k) , leading to generalised steepest descent methods that
minimise the descent direction in (4.6) in the weighted inner product (x, y)B (k) = xT B (k) y
instead of the Euclidean inner product (x, y) = xT y.
In particular, if the Hessian D2 F (x(k) ) is positive definite we can choose B (k) = D2 F (x(k) )
and α(k) = 1, which reduces to the Newton method. This link is derived ina different way
in the next section.
4.3.1. The convergence of the Newton method is only local, and it is the faster, the closer
to the solution we start. Thus, finding good initial guesses is an important task.
A reasonable initial guess for finding y1 in a one-step method seems to be y0 , but on
closer inspection, this is true only if the time step is small. The convergence requirements
of Newton’s method would insert a new time step restriction, which we want to avoid.
Therefore, we present methods which guarantee global convergence while still converging
locally quadratically.
As a rule, Newton’s method should never be used without some globalization strategy!
4.3.2 Lemma: Under the assumptions of theorem 4.1.4, Newton’s Method applied
to the root finding problem f (x) = 0 for f : Rd → Rd is a descent method applied
to the functional F (x) = |f (x)|2 .
Now assume that f (x(k) ) 6= 0. Then choosing s(k) := d(k) /|d(k) | and (and omitting the
arguments x(k) ), we have
−1
(k)
2f T ∇f ∇f f 2|f |2 2|f |
∇F, s =− −1 ≤ − −1 =− −1 < 0,
| ∇f f| k ∇f k kf k k ∇f k
61
and thus s(k) is a descent direction that satisfies (4.6). Here, we used the fact that for x(k)
−1
sufficiently close to x∗ we have k ∇f (x(k) ) k < ∞ (Perturbation Theorem, Numerik 0).
Finally, choosing α(k) := |d(k) | we have established the equivalence.
4.3.3 Definition: The Newton method with line search for finding the root
of the nonlinear equation f (x) = 0 reads: given an initial value x(0) ∈ Rd , compute
iterates x(k) ∈ Rd , k = 1, 2, . . . by the rule
(k)
J = ∇f x ,
d(k) = −J −1 f (x(k) ),
2 (4.9)
α(k) = argmin f (x(k) + γd(k) )
γ>0
(k+1) (k)
x =x + α(k) d(k) .
4.3.4 Definition: A practically most often used variant is the Newton method
with step size control (backtracking line search): given an initial value x(0) ∈
Rd , compute iterates x(k) ∈ Rd , k = 1, 2, . . . by the rule
J = ∇f x(k) ,
d(k) = −J −1 f (x(k) ), (4.10)
x(k+1) = x(k) + 2−j d(k) .
Remark 4.3.5. The step size control algorithm can be implemented with very low over-
head. In fact, in each Newton step we only have to monitor the norm of the residual
|f (x(k) + d(k) )|, which is typically needed for the stopping criterion anyway. If the residual
grows, i.e. |f (x(k) + d(k) )| ≥ |f (x(k) |, we halve the stepsize, recompute the residual norm
and check again. A modification to the plain Newton method and additional work are only
needed, when the original method was likely to fail anyway.
Under certain assumptions on f it can be shown [NW06] that this backtracking line search
algorithm terminates after a finite (typically very small) number of steps. Also, the step
size control typically only triggers within the first few steps, then the quadratic convergence
of the Newton method starts.
4.4.1. Quadratic convergence is an asymptotic statement, which for any practical purpose
can be replaced by “fast” convergence. Most of the effort spent in a single Newton step
consists of setting up the Jacobian J and solving the linear system in the second line
of (4.2). Therefore, we will consider techniques here, which avoid some of this work. We
will have to consider two cases
62
1. Small systems with d . 1000. For such systems, a direct method like LU - or QR-
decomposition is advisable in order to solve the linear system. To this end, we
compute the whole Jacobian and compute its decomposition, an effort of order d3
operations. Comparing to d2 operations for applying the inverse and order d for all
other tasks, this must be avoided as much as possible.
2. Large systems, where the Jacobian is typically sparse (most of its entries are zero).
For such a system, factorising the matrix at a cost of order d3 is typically not afford-
able. Therefore, the linear problem is solved by an iterative method and we avoid
the computation of the Jacobian when possible.
Remark 4.4.2. In order to save numerical effort constructing and inverting Jacobians,
the following strategies have been successful.
• Fix a threshold 0 < η < 1 which will be used as a bound for error reduction. In each
Newton step, first compute the update vector db using the Jacobian Jb of the previous
step. This yields the modified method
Jk = Jk−1
b = x(k) − Jk−1 f (x(k) )
x
(4.12)
x)| ≤ η|f (x(k) )|
If |f (b x(k+1) = x
b
−1
Else Jk = ∇f (x(k) x(k+1) = x(k) − Jk−1 f (x(k) ).
Thus, an old Jacobian and its inverse are used until convergence rates deteriorate.
This method is a quasi-Newton method which will not converge quadratically. How-
ever, we can obtain linear convergence at any rate η.
• If Newton’s method is used within a time stepping scheme, the Jacobian of the last
Newton step in the previous time step is often a good approximation for the Jacobian
of the first Newton step in the new time step. This holds in particular for small time
steps and constant extrapolation. Therefore, the previous method should also be
extended over the bounds of time steps.
• An improvement of the method above can be achieved by so called low rank updates,
e.g. for the rank-1 update: Let J0 = ∇f (x(0) ) or J (0) = I. Then, at the kth step,
given x(k) and x(k−1) , compute
p = x(k) − x(k−1)
q = f (x(k) ) − f (x(k−1) ) (4.13)
1
Jk = Jk−1 + 2 (q − Jk−1 p) pT
|p|
The fact that the rank of Jk − Jk−1 is at most one can be used to avoid computing
and storing matrices at all. The inverse of such a matrix can be computed via the
Sherman-Morrison formula. The practically most efficient and used methods use
rank-2 updates, such as the Broyden methods [NW06].
Remark 4.4.3. For problems leading to large, sparse Jacobians, typically space discretiza-
tions of partial differential equations, computing inverses of LU -decompositions is infeasi-
ble. These matrices typically only feature a few nonzero elements per row, while the inverse
63
and the LU -decomposition is fully populated, thus increasing the amount of memory from
d to d2 .
Linear systems like this are often solved by iterative methods, leading for instance to so
called Newton-Krylov methods. Iterative methods approximate the solution of a linear
system
Jd = f
only using multiplications of a vector with the matrix J. On the other hand, for any vector
v ∈ Rd , the term Jv denotes the directional derivative of f in direction J. Thus, it can be
approximated easily by
f x(k) + εv − f x(k)
Jv ≈ .
ε
The term f x(k) must be calculated anyway as it is the current Newton residual. Thus,
each step of the iterative linear solver requires one evaluation of the nonlinear function,
and no derivatives are computed.
The efficiency of such a method depends on the number of linear iteration steps which is
determined by two factors: the gain in accuracy and the contraction speed. It turns out
that typically gaining two digits in accuracy is sufficient to ensure fast convergence of the
Newton iteration. The contraction number is a more difficult issue and typically requires
preconditioning, which is problem-dependent and as such must be discussed when needed.
64
Chapter 5
Instead of using only the one initial value at the beginning of the current time interval to
the next time step, possibly with the help of intermediate steps, we can also use the values
from several previous time steps. Intuitively, this could be more efficient, since function
values at these points have been computed already.
Such methods that use values of several time steps in the past in order to achieve a higher
order are called multistep methods. We will begin this chapter by introducing some of
the common formulae, before studying their stability and convergence properties.
Basically, there are two construction principles for the multistep methods: Quadrature and
numerical differentiation.
Example 5.1.1 (Adams-Moulton formulae). Here, the integral from point tk−1 to point
tk is approximated by an interpolatory quadrature rule based on the points tk−s to tk , i.e.,
s
X Z tk
yk = yk−1 + fk−r Lr (t) dt, (5.1)
r=0 tk−1
where fj denotes the function value f (tj , yj ) and Lr (t), r = 0, . . . , s, the Lagrange inter-
polation polynomials associated with the points tk−r , r = 0, . . . , s.
Since the integral involves the function evaluated at the time step that is being computed,
these methods are implicit. Here are the first four in this family:
65
Example 5.1.2 (Adams-Bashforth formulae). With the same principle we obtain explicit
methods by omitting the point in time tk in the definition of the interpolation polynomial.
This yields quadrature formulae of the form
s
X Z tk
yk = yk−1 + fk−r Lr (t) dt. (5.2)
r=1 tk−1
Example 5.1.3 (BDF methods). Backward differencing formulas (BDF) are also based
on Lagrange interpolation at the points tk−s to tk . However, in contrast to Adams formu-
lae they do not use quadrature for the right hand side, but rather the derivative of the
interpolation polynomial in the point tk for the left hand side.
where yk is yet to be determined. Now we assume that y(t) satisfies the ODE at tk . Thus,
s
X
yk−r L0k−r (tk ) = y 0 (tk ) = f (tk , yk )
r=0
Remark 5.1.4. Recall from Numerik 0 (or any other introductory course in numerical
analysis) that numerical differentiation and extrapolation of interpolation polynomials (i.e.
the evaluation outside the interval which is spanned through the interpolation points) are
both unstable numerically. Therefore, we expect stability problems for all these methods.
Secondly, recall that Lagrange interpolation with equidistant support points is unstable for
higher degree polynomials. Therefore, we also expect all of the above methods to perform
well only at moderate order.
66
5.2 General definition and consistency of LMMs
Remark 5.2.2. The LMM was defined for constant step size h. In principle it is possible
to implement the method with a variable step size but we restrict ourselves to the constant
case. Notes to the step size control can be found later on in this chapter.
uk − yk ,
where yk is the numerical solution obtained from (5.3) using the exact initial values
yk−r = uk−r for r = 1, ..., s.
The truncation error of an LMM, on the other hand, is defined as
Lemma 5.2.4. For h sufficiently small, the two local errors satisfy the following relation
−1
uk − yk = I − hβs Df k (Lh u)(tk ), (5.7)
where
Z 1
Df k := Df tk , uk + ϑ(yk − uk ) dϑ.
0
67
Proof. Sinc we assumed yk−r = uk−r , for r = 1, ..., s, in the definition of the local error
and αs = 1, (5.3) is equivalent to
s
X s
X
0 = yk − hf (tk , yk ) + αs−r uk−r − h βs−r f (tk−r , uk−r ).
r=1 r=1
Finally, the result follows by applying the Integral Mean Value Theorem (see, e.g., [Nu-
merik 0, Hilfssatz 5.8]) and the fact that for h sufficiently small I−hβs Df k is invertible.
and that the higher-order term is exactly zero for explicit LMMs.
5.2.7 Theorem: A LMM with constant step size h is consistent of order p if and
only if
s
X s
X
αs−r rq + qβs−r rq−1 = 0,
αs−r = 0 and q = 1, . . . , p (5.9)
r=0 r=0
Proof. We start with the Taylor expansion of the ODE solution u around tk :
p
X u(q) (tk ) u(p+1) (ξ)
u(t) = (t − tk )q + (t − tk )p+1 ,
q! (p + 1)!
q=0 | {z }
=: Ru (t)
where ξ is a point between t and tk that depends on t. It follows from f (t, u) = u0 that
the corresponding right hand side can be expanded as
p
X u(q) (tk ) u(p+1) (η)
f t, u(t) = (t − tk )q−1 + (t − tk )p .
(q − 1)! p!
q=1 | {z }
=: Rf (t)
68
Substituting the two expansions into (5.6) we get:
s p (q)
X X u (tk )
Lh u(tk ) = αs−r (−rh)q + Ru (tk−r ) −
q!
r=0 q=0
p (q)
X u (tk )
− βs−r h (−rh)q−1 + Rf (tk−r )
(q − 1)!
q=1
s p s
! !
X X u(q) (tk ) q
X
q q−1
= u(tk ) αs−r + (−1) αs−r r + qβs−r r hq + Chp+1 ,
q!
r=0 q=1 r=0
where
s
X (−1)p+1 rp
C := αs−r r u(p+1) (ξr ) + (p + 1)βs−r u(p+1) (ηr )
(p + 1)!
r=0
and ξr , ηr ∈ [tk−r , tk ], for r = 0, . . . , s, which in general may all be different.
Since the right hand side f was arbitrary, Lh u(tk ) = O(hp+1 ) if and only if the conditions
in (5.9) hold. In that case we have
s !
ku(p+1) k∞ X p+1 p
|Lh u(tk )| ≤ αs−r r + (p + 1)βs−r r hp+1 .
(p + 1)!
r=0
Remark 5.2.8. A consistent LMM is not necessarily convergent. To understand this and
to develop criteria for convergence we diverge into the theory of difference equations.
5.3.1. The stability of LMM can be understood by employing the fairly old theory of
difference equations. In order to keep the presentation simple in this section, we use a
different notation for numbering indices in the equations. Nevertheless, the coefficients of
the characteristic polynomial are the same as for LMM.
5.3.3 Lemma: The solutions of equation (5.10) form a vector space of dimension s.
69
Proof. Since the equation (5.10) is linear and homogeneous, it is obvious that if two se-
quences of solutions (y (1) ) and (y (2) ) satisfy the equation, then (αy (1) + y (2) ) also satisfies
(5.10), for any α ∈ R (or C).
As soon as the initial values y0 to ys−1 are chosen, all other sequence members are uniquely
defined. Moreover it holds
y0 = y1 = · · · = ys−1 = 0 =⇒ yn = 0, n ≥ 0.
Therefore it is sufficient to consider the first s elements. Since those can be chosen arbi-
trarily, they span an s-simensional vector space.
5.3.4 Lemma: For each root ξ of the generating polynomial χ(x) the sequence
yn = ξ n is a solution of the difference equation (5.10).
5.3.5 Theorem: Let {ξj }j=1,...,J be the roots of the generating polynomial χ with
multiplicity µj . Then, the sequences of the form
Proof. First we observe that the sum of the multiplicities of the roots has to result in the
degree of the polynomial:
J
X
s= µj .
j=1
Moreover, we know from Lemma 5.3.3, that s is the dimension of the solution space.
(j,k)
However, the sequences (yn ) are also linearly independent. This is clear for sequences
of different index j. It is also clear for different roots, because for n → ∞ the exponential
function nullifies the influence of the polynomials.
(j,k)
It remains to show that the sequences (yn ) are in fact solutions of the difference equa-
tions. For k = 0 we have proven this already in lemma 5.3.4. We proof the fact here for
k = 2 and for a double zero ξj ; the principle for higher order roots should be clear then.
Equation (5.10) applied to the sequence (nξjn ) results in
s
X s
X s
X
αr (n + r)ξin+r = nξin αr ξir + ξin+1 αr rξir−1
r=0 r=0 r=1
= nξin χ(ξi ) + ξin+1 χ0 (ξi ) = 0.
Here the term with α0 vanishes, because it is multiplied with r = 0. %(ξi ) = %0 (ξi ) = 0
because ξi is a multiple root.
70
5.3.6 Corollary (Root test): All solutions {yn } of the difference equation (5.10)
are bounded for n → ∞ if and only if:
• all roots of the generating polynomial χ(x) lie in z ∈ C |z| ≤ 1 (closed
unit circle) and
• all roots on the boundary of the unit circle are simple.
Proof. According to theorem 5.3.5 we can write all solutions as linear combinations of the
sequences (y (j,k) ) in equation (5.12). Therefore, for n → ∞,
In contrast to one-step methods, the Lipschitz condition (1.23) for the RHS f of the
differential equation is not sufficient to ensure that consistency of a multistep method
implies convergence. As for A-stability, stability conditions will again be deduced by
means of a simple model problem.
Remark 5.4.1. In the following we investigate the solution to a fixed point in time t with
a shrinking step size h. Therefore we choose n steps of step size h = t/n and let n go
towards infinity.
5.4.2 Definition: An LMM is zero-stable (or simply stable) if, applied to the
trivial ODE
u0 = 0 (5.13)
5.4.3 Theorem: A LMM is zero-stable if and only if all roots of the first generating
polynomial %(x) of equation (5.4) satisfy the root test in corollary 5.3.6.
Proof. The application of the LMM to the ODE (5.13) results in the difference equation
s
X s
X
αs−r yk−r = αr0 yn+r0 = 0 (5.14)
r=0 r0 =0
71
with n = k − s. Thus, the generating polynomial %(x) is equivalent to the generating
polynomial χ(x) of the difference equation in (5.10) which is independent of h.
If αs α0 6= 0, the result than follows directly from corollary 5.3.6. Otherwise, (5.14) reduces
to a finite difference equation with generating polynomial %m (x) of order s − m, for some
1 ≤ m ≤ s − 1, and %(x) = xm %m (x). Thus, % satisfying the root test is equivalent to %m
satisfying the root test and the result follows again from corollary 5.3.6.
Proof. For all of these methods the first generating polynomial is %(x) = xs − xs−1 . It has
the simple root ξ1 = 1 and the s − 1-fold root 0.
5.4.5 Theorem: The BDF methods are zero-stable for s ≤ 6 and not zero-stable
for s ≥ 7.
5.4.6 Definition: An LMM is convergent of order p, if for any IVP with sufficiently
smooth right hand side f there exists a constant h0 > 0 such that, for all h ≤ h0 ,
Here, u is the continuous solution of the IVP and y is the LMM approximation.
To prove convergence, we will for simplicity only consider the scalar case d = 1. The case
d > 1 can be proved similarly.
where
−αs−1 −αs−2 · · · −α0
yk
..
1 0 ··· 0
Yk = . ∈ R s , A= ∈ Rs×s , (5.18)
. ..
··· 0
yk−s+1
1 0
72
Proof. From the general form of LMM we obtain
s
X s−1
X
αs−r yk−r = h βs−r f (tk−r , yk−r ) + hβs f (tk , yk ).
r=0 r=0
We rewrite this to
s
X
yk = − αs−r yk−r + hψk ,
r=1
where this formula is also entered implicitly as the value for yk in the computation of
f (tk , yk ). This is the first equation in (5.17). The remaining equations are simply shifting
the entries in the vector Yk−1 , i.e. (Yk )i+1 = (Yk−1 )i = yk−i , for i = 1, . . . , s − 1.
5.4.8 Lemma: Let d = 1 and let u(t) be the exact solution of the IVP. Suppose Ybk
is the solution of a single step
Proof. The first component of Uk − Ybk is the local error uk − yk of step k, as defined in
definition 5.2.3, which is of order hp+1 by the assumption. The other components vanish
by the definition of the method.
5.4.9 Lemma: Assume that an LMM is zero-stable. Then, there exists a vector
norm k·k on Cs such that the induced operator norm of the matrix A satisfies
kAk ≤ 1. (5.21)
By the root test we know that simple roots, which correspond to irreducible blocks of
dimension one have maximal modulus one. Furthermore, every Jordan block of dimension
greater than one corresponds to a multiple root, which by assumption has modulus strictly
less than one. Let ξi be such a multiple root with multiplicity µi . Such a block admits a
modified canonical form
λi 1 − |λi |
..
λ i . ∈ Cµi ×µi .
Ji =
..
. 1 − |λi |
λi
73
Thus, the canonical form J = T −1 AT has norm kJk∞ ≤ 1. If we define the vector norm
kxk = kT −1 xk∞ ,
it follows that
Proof. As already stated, we only prove the case d = 1 explicitly. See the original notes
by Guido Kanschat for the general proof. Since f was assumed to be sufficiently smooth,
Fh satsifies a uniform Lipschitz condition with Lipschitz constant Lh .
Let Yk−1 and Zk−1 be two initial values for the interval Ik . By the previous lemma, we
have in the norm defined there, for sufficiently small h, that
By lemma 5.4.8, the local error ηk = kUk − Ybk k at step k is bounded by M hp+1 (where
M also contains the equivalence constant γ between the Euclidean norm and the norm
defined in the previous lemma). Thus:
M T Lh
C := c0 γeT Lh + (e − 1)
Lh
74
5.5 Starting procedures
5.5.1. In contrast to one-step methods, where the numerical solution is obtained solely
from the differential equation and the initial value, multistep methods require more than
one start value. An LMM with s steps requires s known start values yk−s , . . . , yk−1 . Mostly,
they are not provided by the IVP itself. Thus, general LMM decompose into two parts:
• a starting phase where the start values are computed in a suitable way and
It is crucial that the starting procedure provides a suitable order corresponding to the
LMM of the run phase, recall condition (5.16) in definition 5.4.6. Possible choices for the
starting phase include multistep methods with variable order and one-step methods.
Example 5.5.2 (Self starter). A 2-step BDF method requires y0 and y1 to be known. y0
is given by the initial value while y1 is unknown so far. To guarantee that the method has
order 2, y1 needs to be at least locally of order 2, i.e.,
|u(t1 ) − y1 | ≤ c0 h2 . (5.24)
This is ensured, for example, by one step of the 1-step BDF method (implicit Euler).
However, starting an LMM with s > 2 steps by a first-order method and then successively
increasing the order until s is reached does not provide the desired global order. That is
due to the fact that a one-step method cannot have more than order 2, limiting the overall
convergence order to 2. Nevertheless, self starters are often used in practice.
Example 5.5.3 (Runge-Kutta starter). One can use Runge-Kutta methods to start LMMs.
Since only a fixed number of starting steps are performed, the local order of the Runge-
Kutta approximation is crucial. For an implicit LMM with convergence order p and stepsize
h one could use an RK method with consistency order p − 1 with the same step size h.
Consider a 3-step BDF method. Thus, apart from y0 , we need start values y1 , y2 with
errors less than c0 h3 . They can be computed by RK methods of consistency order 2, for
example by two steps of the 1-stage Gauß collocation method with step size h since it has
consistency order 2s = 2, see theorem 3.4.15.
Remark 5.5.4. In practice not the order of a procedure is crucial but rather the fact
that the errors of all approximations (the start values and all approximations of the run
phase) are bounded by the user-given tolerance, compare Section 2.4. Generally, LMMs
are applied with variable step sizes and orders in practice (see e.g. Exercise 7.2).
Thus, the step sizes of all steps are in practice controlled usually controlled using local
error estimates. Hence, self starting procedures usually start with very small step sizes
and increase them successively. Due to their higher orders RK starters usually are allowed
to use moderate step sizes in the beginning.
75
5.6 LMM and stiff problems
To study A-stability of LMMs we consider again the model equation u0 = λu. Applying a
general LMM (5.3) to this model equation leads to the linear model difference equation
s
X
αs−r − zβs−r )yk−r = 0. (5.25)
r=0
with z = λh.
Note that this definition is equivalent to the definition of A-stability for one-step methods
in definition 3.2.5.
Remark 5.6.3. Instead of the simple amplification function R(z) of the one-step methods,
we get here a function of two variables: the point z for which we want to show stability
and the artificial variable x from the analysis of the method.
5.6.4 Lemma: Let {ξ1 (z), . . . , ξs (z)} be the set of roots of the stability polynomial
Rz (x) as functions of z. A point z ∈ C is in the stability region of a LMM, if these
roots satisfy the root test in corollary 5.3.6.
5.6.1 A(α)-stability
5.6.6. Motivated by the fact that there are no higher order A-stable LMMs people have
introduced relaxed concepts of A-stability.
76
Figure 5.1: Boundaries of the stability regions for BDF(1) to BDF(6); the unstable region
is right of the origin. The right figure shows a zoom near the origin.
k 1 2 3 4 5 6
α 90◦ 90◦ 86.03◦ 73.35◦ 51.84◦ 17.84◦
D 0 0 0.083 0.667 2.327 6.075
Table 5.1: Values for A(α)- and stiff stability for BDF methods of order k.
5.6.7 Definition: A LMM is called A(α)-stable, for α ∈ [0◦ , 90◦ ], if its stability
region contains the sector
Im(z)
z ∈ C Re(z) < 0 and ≤ tan α .
Re(z)
It is called A(0)-stable, if the stability region contains the negative real axis.
It is called stiffly stable, if it contains the set {Re(z) < −D}.
Similarly, A(α)-stable LMM are suitable for linear problems in which high-frequency vi-
brations (large Imλ) decay fast (large −Reλ).
LMMs behave similarly for nonlinear problems if the Jacobian matrix Dy f satisfies corre-
sponding properties.
Example 5.6.9. The stability regions of the stable BDF methods are in Figure 5.1. The
corresponding values for A(α)-stability and stiff stability are in Table 5.1. (Recall from
theorem 5.4.5 that BDF(7) is not even zero-stable.)
77
Chapter 6
This chapter deals with problems of a fundamentally different type than the problems we
examined in chapter 1, namely boundary value problems. Here, we have prescribed values
at the beginning and at the end of an interval of interest. They will require the design of
different numerical methods. We will only consider the most classical one.
Due to lemma 1.2.4 we know that every ODE can be written as a system of first-order
ODEs. Thus, we make the following definition (restricting our attention to explicit ODEs).
u0 (t) = f t, u(t)
t ∈ (a, b) (6.1a)
r u(a), u(b) = 0. (6.1b)
6.1.2 Definition: A BVP (6.1) is called linear, if the right hand side f as well as
the boundary conditions are linear in u, i.e.: Find u : [a, b] → Rd such that
Remark 6.1.3. Since boundary values are imposed at two different points in time, the
concept of local solutions from Definition 1.2.8 is not applicable. Thus, tricks, such as
going forward from interval to interval, as is done in the proof of Péano’s theorem using
Euler’s method, are here not applicable. For this reason, nothing can be concluded from
the local properties of the solution and that right hand side f . In fact, it is hard in general
even to establish that a solution exists.
78
Example 6.1.4. Consider the linear BVP
0
u1 (t) 0 1 u1 (t) 0
= + with
u02 (t) 0 0 u2 (t) 1
By substitution, we can easily see that this first-order system of ODEs is in fact equivalent
to the second-order ODE u001 = 1, which can be explicitly solved by integration to give
t2
u2 (t) = u01 (t) = t + c1 and u1 (t) = + c1 t + c2 .
2
But the BVP is not solvable for all three choices of boundary conditions (BCs). Using the
BCs we get
T
(i) c2 = 0 and c1 = −1, i.e., u(t) = t( 2t − 1), t − 1 ,
t 1 T
(ii) c2 = 0 and c1 = −1/2, i.e., u(t) = 2 (t − 1), t − 2 ,
(iii) but here the two BCs lead to c1 = 0 and c1 = −1, respectively, which cannot be
satisfied simultaneously.
Remark 6.1.5. Note that due to lemma 1.3.13 we know that the solution space of the
linear ODE in (6.2a) is d-dimensional. Hence, we need d additional pieces of information
to determine the solution uniquely. However, whether the d boundary conditions in (6.2b)
are sufficient is more subtle than in the case of an IVP, as we have just seen.
We will not discuss this further and instead consider an important subclass of linear BVPs,
as well as the most classical numerical method for them. For more details on the general
solution theory, see chapter 6 in the original notes by G. Kanschat or [Ran17b, Chap. 8].
Let us consider following linear, second-order BVP of finding u : [a, b] → R such that
for some functions β, γ, f : [a, b] → R and two boundary values ua , ub ∈ R. (As is common
practice, we use x instead of t to denote the independent variable here.)
Then, we can see the LHS of (6.3) as a differential operator applied to u, mapping B to
the set of continuous functions. Namely, we define
L : B → C[a, b]
(6.4)
u 7→ −u00 + βu0 + γu.
79
To simplify our life we can (without loss of generality) get rid of the inhomogeneous bound-
ary values ua and ub . To this end, let
b−x x−a
ψ(x) = ua + ub ,
b−a b−a
and introduce the new function u0 := u − ψ. Then, u0 solves the BVP
ub − ua
−u000 (x) + β(x)u00 (x) + γ(x)u0 (x) = f (x) − β(x) − γ(x)ψ(x),
b−a
| {z }
u0 (a) = u0 (b) = 0. =: f0 (x)
such that for a differential operator of second order as defined in (6.4) above and a
right hand side f ∈ C[a, b] there holds
Lu = f. (6.6)
We subdivide the interval I = [a, b] again into subintervals and, as in Definition 2.1.2, and
consider the solution only at the partitioning points a = x0 ≤ x1 . . . ≤ xn = b. As with
one-step and multistep timestepping methods, we denote the approximate solution values
at those partitioning points by yk , k = 0, . . . , n.
While one-step methods directly discretize the Volterra integral equation in order to com-
pute the solution at every new step, finite difference methods discretize the differen-
tial equation on the whole interval at once and then solve the resulting discrete (finite-
dimensional) system of equations. We have accomplished the first step and decided that
instead of function values u(x) in every point t of the interval I, we only approximate
u(xk ) in the points of the partition by yk , k = 0, . . . , n. What is left is the definition of
the discrete operator representing the equation.
u(x + h) − u(x)
Forward difference Dh+ u(x) = , (6.7)
h
u(x) − u(x − h)
Backward difference Dh− u(x) = , (6.8)
h
u(x + h) − u(x − h)
Central difference Dhc u(x) = . (6.9)
2h
For second derivatives we introduce the
u(x + h) − 2u(x) + u(x − h)
3-point stencil Dh2 u(x) = . (6.10)
h2
80
Remark 6.3.2. Note that the 3-point stencil is the product of the forward and backward
difference operators:
Dh2 u(x) = Dh+ Dh− u(x) = Dh− Dh+ u(x) .
For simplicity, we only present finite differences of uniform subdivisions. Nevertheless, the
definition of the operators can be extended easily to h changing between intervals.
6.3.4 Lemma: The forward and backward difference operators Dh+ and Dh− in
definition 6.3.1 are consistent of order 1 with the first derivative, i.e. for any x ∈
[a, b − h] (resp. x ∈ [a + h, b]),
The central difference operator Dhc and the 3-point stencil Dh2 are consistent of
order 2 with the first and second derivative, respectively, i.e. for any x ∈ [a+h, b−h],
|Dhc u(x) − u0 (x)| ≤ ch2 and |Dh2 u(x) − u00 (x)| ≤ ch2 . (6.13)
Proof. Taylor expansion: Let x ∈ [a, b − h]. Then there exists ξ ∈ (x, x + h) such that
u(x + h) − u(x)
Dh+ u(x) − u0 (x) = − u0 (x)
h
2
u(x) + hu0 (x) + h2 u00 (ξ) − u(x)
= − u0 (x) = h 00
2 u (ξ).
h
Thus, the result for Dh+ holds with c := 1
2 maxx∈(a,b) |u00 (x)|. The same computation can
be applied to Dh− u(x).
The final result for the 3-point stencil Dh2 follows in a similar way DIY .
Remark 6.3.5. When applied to the equation u0 = f (t, u) the solutions obtained by
forward and backward differences correspond to the explicit and implicit Euler methods,
respectively.
81
6.3.6 Definition: The finite difference method (with uniform subdivisions) for
the discretization of the BVP Lu = f on the interval I = [a, b] with homogeneous
boundary conditions, i.e., for u ∈ V as in definition 6.2.1, is defined by
Example 6.3.7. Using the 3-point stencil for u00 and central differences for u0 , the BVP
and the abbreviations βk = β(xk ), γk = γ(xk ) fk = f (xk ), we obtain the discrete system
of equations
−yk+1 + 2yk − yk−1 yk+1 − yk−1
2
+ βk + γk yk = fk , for k = 1, . . . , n − 1, (6.14)
h 2h
with y0 = yn = 0, or in matrix notation
λ1 ν1
. y1 f1
µ2 λ2 ..
. .
Lh y = . = . = fh . (6.15)
.. .. . .
. . νn−2 y f
n−1 n−1
µn−1 λn−1
where
2 1 βk 1 βk
λk = + γk , µk = − − , and νk = − + . (6.16)
h2 h 2 2h h 2 2h
Remark 6.3.8. Like our view to the continuous BVP has changed w.r.t. IVPs, the discrete
problem is now a fully coupled linear system which has to be solved by methods of linear
algebra, rather than time stepping. In fact, we have n − 1 unknown variables y1 , . . . , yn−1
and n − 1 equations, such that existence and uniqueness of solutions are equivalent.
6.4.1. Since the solution of the discretized boundary value problem is a problem in linear
algebra, we have to study properties of the matrix Lh . The shortest and most elegant
way to prove stability is through the properties of M-matrices, which we present here very
shortly. We are not dwelling on this approach too long, since it is sufficient for stability,
but by far not necessary and particular to low order methods.
The fact that Lh is an M-matrix requires some knowledge of irreducible weakly diagonal
dominant matrices, which we have already come across in the last chapter in Numerik 0,
in the context of stationary iterative methods.
82
6.4.2 Definition: A quadratic n × n-matrix A is called an M-matrix if it satisfies
the following properties:
cij ≥ 0, i, j = 1, . . . , n. (6.18)
Proof. It is easy to verify that the two conditions in (6.19) are sufficient for the first M-
matrix property.
The proof of positivity of the inverse is based on irreducible diagonal dominance, which
is too long and too specialized at this point and thus we will omit it. See, e.g., [Ran17b,
Hilfssatz 10.2].
Remark 6.4.4. The finite element method, discussed next semester in “Numerik 2 – Finite
Elements”, provides a much more powerful theory to deduce solvability and stability of the
discrete problem.
6.4.5 Lemma: Let A be an M-matrix. If there is a vector w such that for the
vector v = Aw there holds
vi ≥ 1, i = 1, . . . , n,
then
Thus,
kA−1 xk∞
kA−1 k∞ = sup ≤ kwk∞ .
x∈Rn kxk∞
83
6.4.6 Theorem: Assume that (6.19) holds and that there exists a constant δ < 2
such that
δ
|βk | ≤ . (6.21)
b−a
Then, the matrix Lh defined in (6.15) is invertible and
(b − a)2
kL−1
h k∞ ≤ . (6.22)
8 − 4δ
with derivatives p0 (x) = a + b − 2x and p00 (x) = −2, and a maximum of (b − a)2 /4 at
x = (a+b)/2. Choose the values pk = p(xk ). Due to the consistency results in lemma 6.3.4,
we know that Dh2 p ≡ p00 and Dhc p ≡ p0 are exact, such that, for all k = 1, . . . , n − 1,
Remark 6.4.7. The assumptions of the previous theorem involve two sets of conditions
on the parameters βk and γk . Since
δ 2 2n 2
< ≤ = ,
b−a b−a b−a h
condition (6.21) actually implies the second condition in (6.19). It is in fact not necessary
in this form, but a better estimate requires more advanced analysis.
The condition on γk in (6.19) is indeed necessary, as will be seen when we study partial
differential equations. The second condition in (6.19), on the other hand, relates the
coefficients βk to the mesh size and can be avoided, as seen in the next example.
Example 6.4.8. By changing the discretization of the first order term to an upwind
finite difference method, we obtain an M-matrix independent of the relation of βk and h.
To this end define
(
↑ β(x)Dh− u(x) if β(x) > 0
β(x)Dh u(x) = . (6.23)
β(x)Dh+ u(x) if β(x) < 0
84
6.4.9 Theorem: Consider the boundary value problem defined in definition 6.2.1
with β, γ, f ∈ C 4 (a, b) and γ(x) ≥ 0 for all x ∈ [a, b]. Let y ∈ Rn−1 be the finite
difference approximation for this problem in Example 6.3.7. If there exists a δ < 2
such that maxx∈[a,b] |β(x)| ≤ δ/(b − a), then there exists a constant c independent
of h such that
For the solution y ↑ ∈ Rn−1 of the upwind finite difference approximation in Exam-
ple 6.4.8 there exists a constant c independent of h such that
n−1
Proof. Let n ∈ N (and thus h > 0) be arbitrary but fixed and let U = (uk )k=1 be the
vector containing the values of the exact solution at x1 , . . . , xn−1 . Considering first the
discretisation in Example 6.3.7 and denoting by
τk := (Lh U )k − (Lu)(xk ), k = 1, . . . , n − 1,
with c independent of h. Since β, γ satisfy the assumptions of theorem 6.4.6 (for arbitrary
h > 0), we can conclude that there exists a c0 > 0 independent of h such that
kU − yk∞ = kL−1 −1 0 2
h Lh (U − y)k∞ ≤ kLh k∞ kτ k∞ ≤ c h .
The proof for the upwind discretisation in Example 6.4.8 is identical, but as discussed
does not require any boundedness of β to guarantee stability and due to the use of the
forward/backward difference quotients, the consistency error is only of O(h).
Remark 6.4.10. Finite differences can be generalized to higher order by extending the
stencils by more than one point to the left and right of the current point. Whenever we
add two points to the symmetric difference formulas, we can gain two orders of consistency.
• • |• {z •} • • • • |• •
{z •} • •
u0 +O(h2 ) u00 +O(h2 )
| {z } | {z }
u0 +O(h4 ) u00 +O(h4 )
| {z } | {z }
u0 +O(h6 ) u00 +O(h6 )
Similarly, we can define one-sided difference formulas, which get us close to multistep
methods. The matrices generated by these formulas are not M-matrices anymore, although
you can show for the 4th order formula for the second derivative that it yields a product
85
of two M-matrices. While this rescues the theory in a particular instance, M-matrices do
not provide a theoretical framework for general high order finite differences anymore.
Very much like the starting procedures for high order multistep methods, high order finite
differences can lead to difficulties at the boundaries. Here, the formulas must be truncated
and for instance be replaced by one-sided formulas of equal order.
All these issues motivate the study of different discretization methods in the next course.
86
Chapter 7
Finite difference methods for two-point boundary value problems have a natural extension
to higher dimensions. There, we deal with partial derivatives ∂x∂ 1 , ∂x∂ 2 , ∂x∂ 3 and ∂t
∂
.
As an outlook towards topics in the numerical analysis of partial differential equations, we
close these notes by a short introduction by means of a some examples.
7.1.1 Definition: the Laplacian in two (three) space dimensions is the sum of the
second partial derivatives
∂2 ∂2 ∂2
∆u = u + 2u + 2u (7.1)
∂x21 ∂x2 ∂x3
−∆u = 0. (7.2)
−∆u = f. (7.3)
where ∂Br (x) ⊂ Ω is the sphere of radius r around x and ω(d) is the volume of the
unit sphere in Rd .
87
Proof. First, we rescale the problem to
Z Z
1 1
Φ(r) = d−1 u(y) ds = u(x + rz) ds.
r ω(d) ∂Br (x) ω(d) ∂B1 (0)
Then, it follows by the Gauß theorem for the vector valued function ∇u that
Z
0 1
Φ (r) = ∇u(x + rz) · z dsz
ω(d) ∂B1 (0)
y−x
Z
1
= d−1 ∇u(y) · dsy
r ω(d) ∂Br (x) r
Z
1 ∂
= d−1 u(y) dsy
r ω(d) ∂Br (x) ∂n
Z
1
= d−1 ∆u(y) dy = 0.
r ω(d) Br (x)
u(x0 ) ≥ u(x) ∀x ∈ U,
Proof. Let x0 be such a local maximum and let r > 0 be such that Br (x0 ) ⊂ Ω. Assume
that there is a point x on ∂Br (x0 ), such that u(x) < u(x0 ). Then, this holds for points
y in a neighborhood of x. Thus, in order that the mean value property holds, there must
be a subset of ∂Br (x0 ) where u(y) > u(x0 ), contradicting that x0 is a maximum. Thus,
u(x) = u(x0 ) for all x ∈ Br (x0 ) for all r such that Br (x0 ) ⊂ Ω.
Let now x ∈ Ω be arbitrary. Then, there is a (compact) path from x0 to x in Ω. Thus, the
path can be covered by a finite set of overlapping balls inside Ω, and the argument above
can be used iteratively to conclude u(x) = u(x0 ).
Corollary 7.1.4. Let u ∈ C 2 (Ω) be a solution to the Laplace equation. Then, its maximum
and its minimum lie on the boundary, that is, there are points x, x ∈ ∂Ω, such that
Proof. If the maximum of u is attained in an interior point, the maximum principle yields
a constant solution and the theorem holds trivially. On the other hand, theorem 7.1.3 does
not make any prediction on points at the boundary, which therefore can be maxima. The
same holds for the minimum, since −u is also a solution to the Laplace equation.
88
Corollary 7.1.5. The Poisson equation with homogeneous boundary condition, u ≡ 0 on
∂Ω, has a unique solution.
Proof. Assume there are two functions u, v ∈ C 2 (Ω) with u = v = 0 on ∂Ω such that
−∆u = −∆v = f.
Then, w = u − v solves the Laplace equation with w = 0 on ∂Ω. Due to the maximum
principle, w ≡ 0 and u = v.
(a, b, b) (b, b, b)
(a, b) (b, b)
(a, a, b) (b, a, b)
(a, b, a) (b, b, a)
As for two-point boundary value problems, we can reduce our considerations to homoge-
neous boundary conditions uB ≡ 0 by changing the right hand side in the Poisson equation.
• • •• •
• • • •
• • •• •
• • • •
• • •• •
• • •• • • • • •
• • •• • • • • •
The grid is called uniform, if all lines (planes) are at equal distances.
89
For the remainder of this discussion let us restirct to the two-dimensional case, d = 2, and
to uniform Cartesian grids.
7.2.4 Definition: The vector y of discrete values is defined in grid points which
run in x1 - and x2 -direction. In order to obtain a single index for every entry of this
vector in linear algebra, we use lexicographic numbering.
• • • •
[n−1][n−2]+1 [n−1][n−2]+2 [n−1]2−1 [n−1]2
• • • •
[n−1][n−3]+1 [n−1][n−3]+2 [n−1][n−2]−1 [n−1][n−2]
• • • •
n−1+1 n−1+2 2[n−1]−1 2[n−1]
• • • •
1 2 n−1− 1 n−1
7.2.5 Definition: The 5-point stencil consists of the sum of a 3-point stencil in
x1 - and a 3-point stencil in x2 -direction. Its graphical representation is
1 -4 1
For a generic row of the linear system, where the associated point is not neighboring
the boundary, this leads to
Example 7.2.6. The matrix Lh obtained for the Laplacian on Ω = [0, 1]2 using the 5-
point stencil on a uniform Cartesian mesh of mesh spacing h = 1/n with lexicographic
numbering is in RN ×N with N = (n − 1)2 and has the structure
D −I 4 −1
−I D −I −1 4 −1
2
Lh = n . . . . . . , D =
. . . . . . ∈ R(n−1)×(n−1) .
. . . . . .
−I D −I −1 4 −1
−I D −1 4
90
7.2.7 Theorem: The matrix Lh obtained by discretising the Laplace operator via
the 5-point stencil formula is an M-matrix and the solution of the discrete problem
Lh y = f
kL−1
h k∞ ≤ c.
Proof. The proof is identical to the proof for 2-point boundary value problems. To show
boundedness of kL−1
h k∞ we can use in a similar way the function
7.2.8 Theorem: The finite difference approximation in Example 7.2.6 for the Pois-
son equation in (7.3) on the unit square Ω = [0, 1]2 with homogeneous Dirichlet
conditions is convergent of second order, i.e.
Proof. We apply the consistency bound in (6.13) in the x1 - and x2 -direction separately,
obtaining
and deduce the second-order consistency of the 5-point stencil by the triangle inequality.
The remainder of the proof is identical to the proof of theorem 6.4.9.
7.2.9 Theorem: Let y be the solution to the finite difference method for the Laplace
equation with the 5-point stencil. Then, the maximum principle holds for y, namely,
if there is k ∈ {1, . . . , (n − 1)2 } such that yk ≥ yj for all j 6= k and yk ≤ yB for any
boundary value, then y is constant.
Proof. From equation (7.6), it is clear that a discrete mean value property holds, that is,
yk is the mean value of its four neighbors. Therefore, if yk ≥ yj , for all neighboring indices
j of k, we have yj = yk . We conclude by following a path through the grid points.
91
7.3 Evolution equations
After an excursion to second order differential equations depending on more than one spa-
tial variables, we are now returning to problems depending on time. But this time, on time
and space. As for the nomenclature, we have encountered ordinary differential equations
as equations or systems depending on a time variable only, then partial differential equa-
tions (PDE) with several, typically spatial, independent variables. While the problems
considered here are covered by the definition of PDE, time and space are fundamentally
different. Therefore, we introduce the concept of evolution equations.
While the problems in definition 7.1.1 are PDEs of elliptic type. The following problems
can be either parabolic or hyperbolic.
Example 7.3.2. Consider the case of one spatial variable, i.e. Ω = [a, b] ⊂ R, and the
differential operator L as defined in (6.4), i.e. a general, linear second order differential
operator with respect to the spatial variable x, for simplicity with β = β(x) and γ = γ(x)
independent of t. Furthermore, let u(t, a) = u(t, b) = 0.
We can now discretise the right hand side of (7.7), for every fixed t ≥ 0 on a spatial grid
x0 , . . . , xn , as in Example 6.3.7, to obtain a system of ODEs
y 0 (t) = Lh y(t)
for the unknown (semi-discrete) vector y(t) ∈ Rn−1 of approximations to the solution u(t, ·)
of (7.7) at time t. By choosing as the initial condition
yk (0) = u0 (xk ), k = 1, . . . , n − 1
we obtain an autonomous linear IVP for y : [0, T ] → Rn−1 that we can now solve with our
favourite time stepping method.
For γk ≥ 0 and |βk | sufficiently small, the eigenvalues of Lh have negative real part and
vary strongly in size, e.g. for β = γ = 0 we have λ1 = −4n2 sin(π/2n) ≈ −π 2 and
λn−1 = −4n2 sin(π(n − 1)/2n) ≈ −4n2 . Thus, the problem is stiff, especially for n large,
and we should use a stable time stepping method.
92
From theorem 6.4.9 we know that the spatial discretisation is of second order. Thus, a
common time stepping method to use is the Crank-Nicolson method, which is the second
order A-stable LMM with the smallest error constant. To distinguish between spatial grid
points and time steps, choose m ∈ N and let η = T /m be the time step size. We denote
the approximation of y(tj ) at the jth time step tj , j = 1, . . . , m, by Y (j) ∈ Rn−1 . Applying
the Crank-Nicolson method we finally obtain the fully discrete system
η η η
Y (j) = Y (j−1) + Lh Y (j) + Lh Y (j−1) ⇔ I − Lh Y (j) = I + Lh Y (j−1)
2 2 2
for the jth time step. Since the real part of the spectrum of Lh is negative, the matrix on
the left hand side is invertible, so that we can uniquely solve this system for any η > 0.
7.3.3 Theorem: Consider the problem in definition 7.3.1 in one space dimension,
i.e. Ω = [a, b] ⊂ R, and the differential operator L as defined in (6.4) with β = β(x)
and γ = γ(x) independent of t. Furthermore, let u(t, a) = u(t, b) = 0. Then,
with central finite difference discretisation of L with mesh width h and applying the
Crank-Nicolson method to discretise in time, as described in Example 7.3.2 with
step size η ≤ h, there exists a constant c > 0 independent of h such that
(j)
max max Yk − u(tj , xk ) ≤ ch2 .
j=0,...,m k=0,...,n
93
Appendix A
Appendix
For a first order differential equation, Lipschitz continuity of f is only a sufficient and not, as
one might think, a necessary condition for uniqueness of a first order differential equation.
The following theorem and proof show that it is indeed possible to have uniqueness without
assuming Lipschitz continuity.
u0 (t) = f u(t)
(A.1a)
u(t0 ) = u0 (A.1b)
ϕ(t)0 ψ(t)0
1= = for all t ∈ I. (A.2)
f (ϕ(t)) f (ψ(t))
94
Also, for all t ∈ I, it follows from (A.2) that
ϕ0 (s) ψ 0 (s)
Z t Z t
F (ϕ(t)) = ds = ds = F (ψ(t)).
u0 f (ϕ(s)) u0 f (ψ(s))
Thus, since F is injective, we have ϕ(t) = ψ(t) for all t ∈ I. In conclusion, the IVP (A.1)
has a unique solution.
Lemma A.2.2. The power series (A.3) converges for each matrix A. It is therefore valid
to write
m ∞
A
X Ak X Ak
e = lim = . (A.4)
m→∞ k! k!
k=0 k=0
Proof. Let k·k be a submultiplicative matrix norm on Rd . We want to show that the
Ak
sequence of partial sums (Sn )n∈N0 with Sn given as limm→∞ m
P
k=n k! converges to S :=
Ak
eA = limm→∞ m
P
k=0 k! . Consider therefore
m m
X Ak X Ak
kS − Sn k = lim = lim . (A.5)
m→∞ k! m→∞ k!
k=n+1 k=n+1
Using the triangle-inequality and the fact that k·k is submultiplicative yields
m m
X Ak X 1
lim ≤ lim kAkk . (A.6)
m→∞ k! m→∞ k!
k=n+1 k=n+1
Lemma A.2.3 (Properties of the matrix exponential). The following relations hold true:
e0 = I (A.7)
eαA eβA = e(α+β)A , ∀A ∈ Rd×d ∀α, β ∈ R, (A.8)
A −A d×d
e e =I ∀A ∈ R , (A.9)
T −1 AT
e = T −1 eA T ∀A, T ∈ Rd×d invertible, (A.10)
ediag(λ1 ,...,λd ) = diag(eλ1 , . . . , eλd ) ∀λi ∈ R, i = 1, . . . , d. (A.11)
Moreover, eA is invertible for arbitrary quadratic matrices A with (eA )−1 = e−A .
95
Proof. The equality (A.7) follows directly from the definition.
Then
giving us an IVP for ϕ(α) with unique solution ϕ(α) = eαA ϕ(0) = 0, and the identity in
(A.8) follows.
Equation (A.9) is a special case of (A.8) with parameters α = 1 and β = −1, which in
combination with (A.7) leads to the result.
For (A.10) note that Rd×d forms a ring and is thus associative. Then, for k ∈ N0 , we have
(T −1 AT )k = (T −1 AT )(T −1 AT ) · · · (T −1 AT )(T −1 AT )
= T −1 A(T T −1 )A(T · · · T −1 )A(T T −1 )AT = T −1 Ak T
and thus
∞ ∞ ∞
!
T −1 AT
X 1 −1 X 1 −1 k X 1 k
e = (T AT )k = T A T = T −1 · A · T = T −1 eA T.
k! k! k!
k=0 k=0 k=0
Here, we have used the absolute convergence of the series and that these matrices are
elements of the ring Rd×d .
96
As the matrix exponential of a diagonal matrix is simply a diagonal matrix with the
exponential of the entries, we diagonalize A.
A = Ψ−1 DΨ.
Proof. Let x0 ∈ Ω and define xk+1 = f (xk ). First, we prove existence using the Cauchy-
criterion. Let k, n ∈ N0 and consider
Iteratively, we get
Since γ ∈ (0, 1) we immediately obtain |x∗ − y ∗ | = 0. Using that |a| = 0 if and only if
a = 0 yields y ∗ = x∗ . This concludes the proof.
97
A.4 The implicit and explicit Euler-method
Clearly, the explicit Euler is a rather easy calculation since all one needs are f , h and y0 .
The implicit Euler is more difficult to compute since for calculating y1 we need the value of
f at y1 . The goal of this section is to visualize and give an intuition for the two algorithms.
y u y u
u0 (t1 )
u0 (t0 )
t t
t1 t1
For the explicit Euler we take u0 and u00 . For implicit Euler we go backwards. On
y1 , our approximated solution for u1 , is the t1 -axis we are looking for an the affine
chosen as the intersection point of t1 and function g that fulfills g(0) = u0 and
g(t) = y0 + t · u0 (t0 ). g 0 (t1 ) = f (t1 ). Then we set y1 = g(t1 ).
The BDF formulae use the approximations of the solution at the previous time steps
tk − sh, . . . , tk − h and the unkown value yQ
k at tk that we would like toPdetermine. With
the Lagrange polynomial given by Li (t) = sj=0,j6=i tt−t i
j −ti
we let y(t) = s
j=0 yk−j Ls−j (t).
Then, we will assume that y solves the IVP in the point tk and obtain a linear system from
which we derive the desired value yk .
We now aim to derive the scheme for BDF(2): Let the points tk − 2h, tk − h and tk be
given.
t
tk − 2h tk − h tk
98
L00 (t) = 2t−2tk +h
2h2
, L01 (t) = − 2t−2th2k +2h and L02 (t) = 2t−2tk +3h
2h2
,
evaluation at t = tk yields
1
fk = 2h yk−2 − h2 yk−1 + 3
2h yk .
2h
The final BDF(2)-scheme is obtained by multiplication with 3 :
yk − 34 yk−1 + 13 yk−2 = 2
3h fk .
99
Bibliography
[Heu86] H. Heuser. Lehrbuch der Analysis. Teil 2. Teubner, 3. auflage edition, 1986.
[HW10] E. Hairer and G. Wanner. Solving ordinary differential equations II. Stiff and
differential-algebraic problems, volume 14 of Springer Series in Computational
Mathematics. Springer-Verlag, Berlin, second edition edition, 2010.
[NW06] Jorge Nocedal and Stephen J. Wright. Numerical optimization. Springer Series
in Operations Research and Financial Engineering. Springer, New York, second
edition, 2006.
100