0% found this document useful (0 votes)
33 views54 pages

Lec7 ODEs

This document contains lecture notes on numerical methods for solving ordinary differential equations (ODEs). It introduces various numerical techniques for solving ODEs including Euler's method, Runge-Kutta methods, multistep methods, and adaptive time stepping. It discusses key concepts in analyzing numerical ODE solvers such as consistency, stability, order, and regions of stability. The notes cover both explicit and implicit methods as well as considerations for solving stiff systems of equations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views54 pages

Lec7 ODEs

This document contains lecture notes on numerical methods for solving ordinary differential equations (ODEs). It introduces various numerical techniques for solving ODEs including Euler's method, Runge-Kutta methods, multistep methods, and adaptive time stepping. It discusses key concepts in analyzing numerical ODE solvers such as consistency, stability, order, and regions of stability. The notes cover both explicit and implicit methods as well as considerations for solving stiff systems of equations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Math 361S Lecture Notes

Numerical solution of ODEs


Holden Lee, Jeffrey Wong
April 1, 2020

Contents
1 Overview 2

2 Basics 3
2.1 First-order ODE; Initial value problems . . . . . . . . . . . . . . . . . . . . . 4
2.2 Converting a higher-order ODE to a first-order ODE . . . . . . . . . . . . . 4
2.3 Other types of differential equations . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Comparison to integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Visualization using vector fields . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Euler’s method 10
3.1 Forward Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Euler’s method: error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Local truncation error . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 From local to global error . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Interpreting the error bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Backward Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Consistency, stability, and convergence 19


4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 General convergence theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Example of an unstable method . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Runge-Kutta methods 23
5.1 Setup: one step methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 A better way: Explicit Runge-Kutta methods . . . . . . . . . . . . . . . . . 25
5.2.1 Explicit midpoint rule (Modified Euler’s method) . . . . . . . . . . . 25

1
5.2.2 Higher-order explicit methods . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Implicit methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.1 Example: deriving the trapezoidal method . . . . . . . . . . . . . . . 29
5.4 Summary of Runge-Kutta methods . . . . . . . . . . . . . . . . . . . . . . . 30

6 Absolute stability and stiff equations 31


6.1 An example of a stiff problem . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 The test equation: Analysis for Euler’s method . . . . . . . . . . . . . . . . 33
6.3 Analysis for a general ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.4 Stability regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.5 A-stability and L-stability; Stability in practice . . . . . . . . . . . . . . . . 36
6.6 Absolute stability for systems . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7 Adaptive time stepping 39


7.1 Using two methods of different order . . . . . . . . . . . . . . . . . . . . . . 40
7.2 Embedded RK methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.3 Step doubling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8 Multistep methods 43
8.1 Adams methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2 Properties of the Adams methods . . . . . . . . . . . . . . . . . . . . . . . . 46
8.3 Other multistep methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.4 Analysis of multistep methods . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.4.2 Zero stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.4.3 Strongly stable methods . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.5 Absolute stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A Nonlinear systems 50
A.1 Taylor’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.3 Application to ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

B Boundary value problems 53

C Difference equations: review 54

1 Overview
In this section of the course we will derive methods to numerically solve ordinary differential
equations (ODE’s), analyze them, and gain experience with using them in practice. We’ll
apply what we have learned from interpolation, differentiation and integration. We will cover
the following topics.

2
• Basics: We will focus on first-order ODE’s, in standard form, and the problems we
will consider are initial value problems (IVP’s). How can we convert a higher-order
ODE into a first-order ODE? How can we visualize the solution to an ODE?
• Algorithms: We will derive and analyze a variety of algorithms, such as forward and
backward Euler, the family of Runge-Kutta methods, and multistep methods.
• Convergence: What is the relationship between local error and global error? How
can we prove convergence? We will see that there are two ingredients: consistency and
stability. How do we get quantitative estimates?
• Stability: Stability is one of the most important considerations in choosing an ODE
solver. An important analysis is to find the region of stability for a numerical method.
Stability is especially important for “stiff” ODEs.
In practice, we will have to manage trade-offs between accuracy and stability.
• Explicit vs. implicit methods: Numerical methods can be classified as explicit and
implicit. Implicit methods often have better stability properties, but require an extra
step of solving non-linear equations using e.g., Newton’s method.
• Adaptive methods: Similarly to integration, it is more efficient to vary the step size.
To do this effectively, we need to derive methods with error estimates.
• Using ODE solvers in MATLAB and python: For example, ode45 is an adaptive
method in MATLAB that is a workhorse of solving ODE’s, that often “just works.”
How do we use it? How can we choose which solver is appropriate for the problem?
What are the tradeoffs?
See sections 7.1-7.3 of Moler’s book or any standard text on ODEs for a review of ODEs.
The bare minimum will be presented here. Essentially no ODE theory is required to solve
ODEs numerically, but the theory does provide important intuition, so it will greatly enhance
your understanding of the numerics.

2 Basics
In this section, you will learn:
• What is the standard form of a first-order ODE? How can we convert a higher-order
ODE into a first-order ODE?
• What is an initial value problem (IVP) vs. a boundary value problem (BVP)?
For IVPs, when do solutions exist and when are they unique?
• How does solving ODEs compare to integration? Integration can be viewed as a special
case of solving an ODE.
• Visualization: Plot vector fields and trajectories.
• Sensitivity: Derive a bound on how solutions with differing initial conditions diverge.

3
2.1 First-order ODE; Initial value problems
We consider an ODE in the following standard form:
dy
y 0 (t) = = f (t, y), t ∈ [a, b], y(t) ∈ Rd
dt
where f is a function f : [a, b] × Rd → Rd . Here t is thought of as the independent variable,
which can be time but does not have to be. Time is a useful analogy since it suggests the
direction (forward in time) in which the ODE is to be solved. For simplicity, we will often
consider the 1-D case when y(t) ∈ R (y(t) is scalar)1 , but the theory extends to the general
case. We say that this ODE is first-order because the highest derivative is first-order. Note
the ODE does not have a unique solution until we impose some more conditions.
We will focus on solving initial value problems (IVPs) in the form
y 0 (t) = f (t, y), t ∈ [a, b], y(t) ∈ Rd (2.1a)
y(a) = y0 . (2.1b)
The equation (2.1a) is the ODE for y(t) and (2.1b) is the initial condition. We seek a
function y(t) that satisfies (2.1a) for all t in the given interval and whose value at a is y0 .
The following is a fundamental theorem about existence and uniqueness for ODE’s.

Theorem 2.1. If f : [a, b] × Rd → Rd is continuously differentiable, then in a neigh-


borhood [a, a + ε) around a, the solution to (2.1a)–(2.1b) exists and is unique.

Note that the solution may not exist for all t ∈ [a, b] because the solution may diverge.
An example is y 0 (t) = y 2 , y(0) = 1c , which has the solution y(t) = − t−c
1
for t < c.
For our purposes, we will attempt to construct numerical solutions where the actual
solution exists, so the theory is just there to ensure that the problem to solve is well-defined.
Throughout, we will further assume that f has partial derivatives of all orders required
for any derivation (mostly for Taylor expansions).

2.2 Converting a higher-order ODE to a first-order ODE


If we have a nth-order ODE involving x, x0 ,..., x(n) , in the form
x(n) = F (t, x, x0 , . . . , x(n−1) ),
we can rewrite it as a first-order ODE as follows. First, define
y1 = x, y2 = x0 , . . . yn = x(n−1) .
We can rewrite the nth-order ODE as a first-order ODE
y10 = y2
.. ..
. .
0
yn−1 = yn
yn0 = F (t, y1 , . . . , yn−1 ).
1
We will often abuse notation slightly by writing y ∈ R.

4
 
y1 (t)
If y(t) ∈ Rd , then we can collect y1 , . . . , yn into a vector y(t) =  ...  ∈ Rnd . Then letting
 
yn (t)
 
y2
F(t, y) =  ..
, we have
 
.
F (t, y1 , . . . , yn−1 )

y0 = F(t, y)

To specify initial conditions for this problem, we need to specify the value of y1 , . . . , yn
at a, i.e., we need to specify the values of y(a), y 0 (a), . . . , y (n−1) (a).

Example 2.2: For example, the ODE governing the motion of a pendulum (without air
resistance, etc.) is

d2 θ
θ00 = = −g sin(θ)
dt2
where θ is the angle from the negative vertical axis and g is the gravitational constant.
The initial conditions θ(0) and θ0 (0) would give the initial position and velocity. The
corresponding first-order ODE is

y10 = y2
y20 = −g sin(y1 ).
   
y1 (t) y2
This is in the form (2.1a)–(2.1b) with y(t) = and f (t, y) = .
y2 (t) −g sin(y1 )

The fact that we can rewrite higher-order ODE’s as first-order ODE’s means that it
suffices to derive methods for first-order ODE’s. Note that the standard ODE solvers for
MATLAB require you to input a first-order ODE in standard form, so you will need to carry
out this transformation before using it.

Problem 2.3: Write the ODE for the van der Pol oscillator
d2 dx
2
− µ(1 − x2 ) + x = 0
dt dt
as a first-order ODE in standard form.

5
2.3 Other types of differential equations
If conditions are given at more than one point, then the problem is a boundary value
problem (BVP). For an ODE, where the independent variable t is 1-dimensional, this
means that conditions are given on both y(a) and y(b).
One common case of this is that for a second-order ODE, rather than giving the initial
conditions y(a) = y0 and y 0 (a) = y00 , we are given the boundary conditions

y(a) = y0 y(b) = y1 .

Unlike for IVPs, there is no simple existence and uniqueness theorem like Theorem 2.1.
BVPs tend to be more challenging to solve numerically than IVPs, so we will not consider
them here.
Finally, differential equations that involve more than one independent variable (and
derivatives in those variables) are partial differential equations (PDEs).

2.4 Comparison to integration


Integration is a special case of solving anR ODE. To see this, note that by the fundamental
t
theorem of calculus, the integral F (t) = a f (s) ds satisfies the ODE

F 0 (t) = f (t), F (a) = 0.

This is the special case when the function f depends only on a.


We can also make the comparison by looking at the integral form of an ODE. If

y 0 (t) = f (t, y), y(a) = y0 ,

then y(t) satisfies


Z t
y(t) = y0 + f (s, y(s)) ds.
a

Note two key differences with numerical integration:

1. The integrand f (s, y(s)) depends on the function value y(s). This means that any error
in the current function value y(s) propagates. In contrast, for numerical integration,
the errors on the different intervals are independent. This makes getting good error
bounds for ODEs more challenging. Indeed, we have seen (Lemma 2.8) that error can
accumulate exponentially.

2. When solving an ODE, we often want the entire trajectory y(t), rather than just the
value y(b).

6
2.5 Visualization using vector fields
Slope fields are a good way to visualize the solution to an ODE.
Suppose we are given a scalar ODE (y ∈ R)
y 0 = f (t, y).
A solution (t, y(t)) forms a curve in the (t, y) plane. The ODE tells us the direction of the
curve at any given point, since
y(t) is parallel to (1, y 0 ) = (1, f ).
In this sense, solutions to the ODE follow the “slope field”, which is the vector field
(1, f (t, y))
in the (t, y) plane. To find a solution to the IVP starting at (t0 , y0 ), we may follow the
slope field to construct the curve; this is the basis of the simplest numerical method that is
detailed in the next section.
The MATLAB function for plotting vector fields is quiver.2

Example 2.4: An example of plotting the vector field for y 0 = 0.5y on [0, 2] × [0, 4] is
given below. Here, t and y contain the t and y-coordinates of the grid points, and u, v are
the components of the vector field at those grid points. Make sure that u, v are defined
as componentwise functions applied to t, y (for example, using .* for componentwise
2
multiplication). The solution, y = 21 et /2 is also plotted.
[t , y ] = meshgrid (0:0.2:2 ,0:0.5:4);
u = ones ( size ( t ));
v = t .* y ;

hold on
% Set window
xlim ([0 ,2.2])
ylim ([ -0.25 ,4.25])
% Plot vector field
quiver (t ,y ,u ,v , ’b ’)
% Also plot trajectory
T = linspace (0 ,2);
Y = 0.5* exp ( T .^2/2);
plot (T ,Y , ’r ’ );

Another case we can easily visualize is where y ∈ R2 and the ODE is autonomous, that
is, not depending on t:
y 0 = f (y), y ∈ R2 .
2
Documentation: https://www.mathworks.com/help/matlab/ref/quiver.html

7
We can instead plot the vector field given by f : R2 → R2 , i.e., with the vector f (y) ∈ R2 at
y ∈ R2 .

Example 2.5: We plot the vector field corresponding to Example 2.2 on [−π/2, π/2] ×
[−2, 2]. For simplicity, g = 1 here.
[x , y ] = meshgrid ( - pi /2: pi /8: pi /2 , -2:0.5:2);
u = y;
v = - sin ( x );

xlim ([ -1.7 ,1.7])


ylim ([ -2.2 ,2.2])
quiver (x ,y ,u ,v , ’b ’)

Figure 1: Outputs of Examples 2.4 and 2.5

Problem 2.6: For different values of µ, plot the vector field for the van der Pol
oscillator in Problem 2.3.

2.6 Sensitivity
The slope field gives geometric intuition for some important concepts for numerical methods,
such as the notion of sensitivity.
Sensitivity means: if y(t) gets perturbed by some amount ∆y at time t0 , how far apart
are the original and perturbed trajectories after some time? Put another way, how sensitive
is the ODE to changes in the initial condition?

8
Figure 2: Sensitivity

Suppose we have two solutions x(t) and y(t) to the same ODE,

y 0 = f (t, y), y(0) = y0 ,

x0 = f (t, x), x(0) = x0 .


We will use a Lipschitz bound on f to derive a bound on the difference z = y − x.

Definition 2.7: Let f : [a, b] × R → R be a function. We say that f (t, y) is L-Lipschitz in


y (or the Lipschitz constant in y is at most L) if

|f (t, y1 ) − f (t, y2 )| ≤ L|y1 − y2 | for all t ∈ [a, b] and y1 , y2 ∈ R. (2.2)

When f is differentiable, this is equivalent to



∂f
max (t, y) ≤ L.
(2.3)
t∈[a,b],y∈R ∂y

To see that (2.3) implies (2.2) note that by the mean value theorem, for some ξ ∈ [y1 , y2 ],
∂f
f (t, y1 ) − f (t, y2 ) = (t, ξ)(y1 − y2 ),
∂y

and take the absolute value of both sides.3


Going back to the problem, if f (t, y) is L-Lipschitz in y, then the difference z = y − x
satisfies

|z 0 (t)| = |f (t, y(t)) − f (t, x(t))| ≤ L|y(t) − x(t)| = L|z(t)|. (2.4)

To obtain a bound on z(t) given z(0), we use the following useful lemma.

Lemma 2.8 (Grönwall’s Lemma). Let u ∈ C 1 ([0, t]). If u0 (s) ≤ Lu(s) for each s ∈ [0, t],
then u(t) ≤ u(0)eLt .
3
This proof only works in 1 dimension. For y ∈ Rd , (2.3) becomes maxt∈[a,b],y∈Rd |∇y f (t, y)| ≤ L. We
R1
can conclude (2.2) by f (t, y1 ) − f (t, y2 ) = 0 h∇y f (t, y1 + s(y2 − y1 )) , y2 − y1 i ds.

9
Proof. We show that u(t)e−Lt is decreasing. This follows from the product rule and the
assumption u0 ≤ Lu:
 
d −Lt du −Lt −Lut du
(ue ) = e − Lue = − Lu e−Lt ≤ 0.
dt dt dt

Thus u(t)e−Lt ≤ u(0), giving the conclusion.


Applying Lemma 2.8 to u and −u, we get the following from (2.4):

|z 0 | ≤ L|z| =⇒ |z(t)| ≤ |z(0)|eLt

Thus L (the maximum of the variation of f with respect to y) is the exponential rate, at
worst, at which the two solutions can move apart. We have proved the following.

Theorem 2.9 (Sensitivity of ODE). Suppose

y 0 = f (t, y), y(0) = y0 ,


x0 = f (t, x), x(0) = x0 .

are solutions to the same ODE with different initial conditions, where f ∈ C 1 ([0, t] ×
R). If f (t, y) is L-Lipschitz in y, then

|x(t) − y(t)| ≤ eLt |x0 − y0 |.

The idea of sensitivity is very useful: The different initial conditions can come from some
“natural” perturbation, but they can also be error that is built up from previous steps of a
numerical algorithm.
However, the bound in Theorem 2.9 is sometimes pessimistic. Taking absolute values
discards information about the sign, so if z 0 ≈ −Lz then the bound is the same, even though
z then decays exponentially. This is shown in the figure.

Problem 2.10 (Generalization of Lemma 2.8): Prove the following.


Let u ∈ C 1 ([0, t]), and suppose Ra ∈ C([0, t]) is nonnegative. If u0 (s) ≤ a(s)u(s) for
t
each s ∈ [0, t], then u(t) ≤ u(0)e 0 a(s) ds .

3 Euler’s method
In this section we derive and analyze the simplest numerical method for ODEs, (forward)
Euler’s method; we will also briefly consider the backward Euler’s method. Through
analyzing Euler’s method, we introduce general concepts which are useful for understanding
and analyzing all kinds of numerical methods. This includes:

• Local truncation error.

10
4
4 3

2
2
1

0 0
0 1 2 0 1 2

Figure 3: Sketch of the difference in two solutions that start at nearby points (t0 , x0 ) and
(t0 , y0 ) and numerical examples for y 0 = ty and y 0 = −ty.

• Proving convergence (for global error): After getting a bound for the local trunca-
tion error, use induction to give a quantitative bound for the global error.

• The order of a method.


This gives a general framework to analyze numerical methods. The goal is to be able to go
through this analysis for other (better!) numerical methods as well. In the next section, we
will elaborate on this framework.
We summarize our notation in the following box. They will be explained more as they
are introduced.

Summary of notation in this section:

• t0 , . . . , tN : the points where the approximate solution is defined

• yj = y(tj ): the exact solution to the IVP

• ỹj the approximation at tj

• τj : truncation error in obtaining ỹj

• h or ∆t: the “step size” tj − tj−1 (if it is constant); otherwise hj or ∆tj

We seek a numerical solution y(t) to the scalar IVP

y 0 = f (t, y), y(a) = y0

(with y ∈ R) up to a time t = b. The approximation will take the form of values ỹj defined
on a grid
a = t0 < t1 < · · · < tN = b

11
Figure 4: Numerical solution of an IVP forward in time from t = a to t = b. The actual
values are y1 , . . . , yN and the estimated values are ye1 , . . . , yeN .

such that
ỹj ≈ y(tj ).
For convenience, denote by yj the exact solution at tj and let the “error” at each point be

ej = yj − ỹj .

It will be assumed that we have a free choice of the tj ’s. The situation is sketched in Figure 4.

3.1 Forward Euler’s method


Assume for simplicity that the step size (for time)

h = tj − tj−1 ,

is constant. This is not necessary, but is convenient.


Suppose that we have the exact value of y(t). To get y(t + h) from y(t), expand in a
Taylor series and use the ODE to simplify the derivatives:

y(t + h) = y(t) + hy 0 (t) + O(h2 )


= y(t) + hf (t, y) + O(h2 ).

This gives, for any point tj , the formula

y(tj + h) = y(tj ) + hf (tj , y(tj )) + τj+1 (3.1)

where τj+1 is the local truncation error defined below. We could derive a formula, but
the important thing is that
τj+1 = O(h2 ).

12
Dropping the error in (3.1) and iterating this formula, we get the (forward) Euler’s
method:
yej+1 = yej + hf (tj , yej ). (3.2)

Algorithm 3.1 (Forward Euler’s method): The forward Euler’s method for solving
the IVP

y 0 = f (t, y), y(a) = y0

is given by

yej+1 = yej + hf (tj , yej ).

The initial point is, ideally,


ye0 = y0
since the initial value (at t0 ) is given. However, in practice, there may be some initial error
in this quantity as well.

Definition 3.2 (Local truncation error, or LTE): The local truncation error τj+1
is the error incurred in obtaining yej+1 when the previous data yj = yej is known exactly:

yj+1 = yej+1 + τj+1

In other words, it is the amount by which the exact solution fails to satisfy the
equation given by the numeric method.

The LTE is “local” in the sense that it does not include errors created at previous steps.
For example, for Euler’s method,
yej+1 = yej + hf (tj , yej )
The LTE is the error between this and the true value of yej+1 when yej = yj :
yj+1 = yej + hf (tj , yej ) + τj+1
= yj + hf (tj , yj ) + τj+1 .

! Notice that the total error is not just the sum of the truncation errors, because when
starting the numeric method with y0 , at step j f is evaluated at the approximation yej ,
and yej 6= yj . The truncation error will propagate through the iteration, as a careful
analysis will show.

3.2 Convergence
Suppose we use Euler’s method to generate an approximation
(tj , yej ), j = 0, . . . , N

13
to the solution y(t) in the interval [a, b] (with t0 = a and tN = b). The “error” in the
approximation that matters in practice is the global error

E = max |yj − yej | = max |ej |


0≤j≤N 0≤j≤N

where
ej = yj − yej
is the error at tj . This is a measure of how well the approximation yej agrees with the true
solution over the whole interval.4

Definition 3.3: The method is convergent if the global error approaches 0 if h approaches
0.

More precisely, we would like to show that, given an interval [a, b], the global error has
the form
max |ej | = O(hp )
0≤j≤N

for some integer p, the order of the approximation.

As an example, consider
y 0 = ty, y(0) = 0.1
2
which has exact solution y(t) = 0.1et . Below, we plot some approximations for various time-
steps h; on the right is the max. error in the interval. The log-log plot has a slope of 1,
indicating the error should be O(h).

0.8 10 0
Exact
0.6 h=0.4
h=0.2
max. err

h=0.1
10 -2
y

0.4

0.2

10 -4
0 0.5 1 1.5 2 10 -4 10 -2 10 0
t h
4
Note that this is not precisely true since the approximation is not defined for all t; we would need to
interpolate and that would have its own error bounds. But in practice we typically consider error at the
points where the approximation is computed.
The definition of convergent means that the approximation as a piecewise-defined function, e.g. piecewise
linear, converges to y(t) as h → 0. Since the points get arbitrarily close together as h → 0, the distinction
between “max error at the tj ’s” and “max error as functions” is not of much concern here.

14
3.3 Euler’s method: error analysis
The details of the proof are instructive, as they will illustrate how error propagates in the
“worst case”. Assume that, as before, we have a fixed step size h = tj − tj−1 , points

a = t0 < t1 < · · · < tN = b

and seek a solution in [a, b].


We proceed in two steps: bounding the local truncation error, and then using tracking
the accumulation of the local error to bound the global error.

3.3.1 Local truncation error


The bound for the local truncation error is a direct consequence of Taylor’s formula.
Lemma 3.4. Suppose tj+1 = tj + h, and the solution y(t) to the scalar IVP

y 0 = f (t, y), y(tj ) = yj

satisfies max[tj ,tj+1 ] |y 00 | ≤ M . Let the truncation error τj+1 for Euler’s method be defined by

y(tj+1 ) = y(tj ) + hf (tj , yj ) + τj+1 .

Then
M h2
|τj+1 | ≤ .
2
Proof. By Taylor’s formula with remainder,

h2 00
y(tj+1 ) = y(tj ) + hy 0 (tj ) + y (ξ)
2
for some ξ ∈ [tj , tj+1 ]. Now use the fact that y 0 (tj ) = f (tj , yj ) and |y 00 (ξ)| ≤ M to conclude
the bound.

3.3.2 From local to global error


Given a bound on the local truncation errors, we can obtain a bound on the global error.
Lemma 3.5 (Euler’s method: from local to global error). Suppose that f (t, y) is L-Lipschitz
in y, and let τj be the local truncation error at step j. Let ỹj be the result of applying Euler’s
method (3.2) starting at (t0 = a, ỹ0 ). Then
 L(tj −t0 ) 
L(tj −t0 ) e − 1 max |τk |
|ej | ≤ e |e0 | + (3.3)
L h
 L(b−a) 
L(b−a) e − 1 max |τk |
max |ej | ≤ e |e0 | + . (3.4)
0≤j≤n L h
where τk is the local truncation error and ej = yj − ỹj is the error at tj .

15
In particular, max τj = O(h2 ) as h → 0, and e0 = 0 (no error in y0 ) then

max |ej | = O(h) (3.5)


0≤j≤n

as h → 0, where the constant in the O(·) depends on the interval size and L but not on h.
Note also that the amplification of the initial error e0 is similar to in Theorem 2.9.
Combining Lemmas 3.4 and 3.5, we obtain the following.

Theorem 3.6 (Convergence of Euler’s method). Suppose:

1. The actual solution y(t) satisfies max[a,b] |y 00 | ≤ M .

2. f (t, y) is L-Lipschitz in y.

Let ỹj be the result of applying Euler’s method (3.2) starting at (t0 = a, ỹ0 = y0 ).
Then
M h L(tj −t0 )
|ej | ≤ [e − 1] (3.6)
2L
M h L(b−a)
max |ej | ≤ [e − 1]. (3.7)
0≤j≤n 2L
where ej = yj − ỹj is the error at tj .

! The global error is O(h) as h → 0, but the O-notation hides a large constant that is
exponential in L(b − a), and which is independent of h.
As with the bound on sensitivity, this bound can be quite pessimistic.

Proof of Lemma 3.5. To start, recall that from the definition of the truncation error and the
formula,

yj+1 = yj + hf (tj , yj ) + τj+1 (3.8)


yej+1 = yej + hf (tj , yej ) (3.9)

Subtracting (3.9) from (3.8) gives


 
ej+1 = ej + h f (tj , yj ) − f (tj , yej ) + τj+1 .

Because f is L-Lipschitz in y, we have h|f (tj , yj ) − f (tj , yej )| ≤ h(L|yj − yej |) = Lh|ej |. By
the triangle inequality,
|ej+1 | ≤ (1 + Lh)|ej | + |τj+1 |.
Iterating, we get

|e1 | ≤ (1 + Lh)|e0 | + |τ1 |,


|e2 | ≤ (1 + Lh)2 |e0 | + (1 + Lh)|τ1 | + |τ2 |

16
and in general
|ej | ≤ (1 + Lh)j |e0 | + (1 + Lh)j−1 τ1 + · · · + (1 + Lh)τj−1 + τj
j
X
j
= (1 + Lh) |e0 | + (1 + Lh)j−k |τk |.
k=1

Bounding each |τk | by the maximum and evaluating the sum,


(1 + Lh)j − 1
|ej | ≤ (1 + Lh)j |e0 | + (max |τk |) .
Lh
Now we want to take the maximum over j, so the RHS must be written to be independent
of j. To fix this problem, we use the crude estimate
1 + Lh ≤ eLh
to obtain
eLjh − 1
 
Ljh
|ej | ≤ e |e0 | + max |τk |
Lh
But jh = tj − t0 ≤ b − a (equal when j = N ) so
 L(b−a) 
L(b−a) e − 1 max |τk |
|ej | ≤ e |e0 | + .
L h
Taking the maximum over j (note that the RHS is independent of j) we get (3.7).
M h2
Proof of Theorem 3.6. By Lemma 3.4, the local truncation errors satisfy |τj | ≤ 2
. Hence
max |τk |
h
≤ M2h . Plug this into Lemma 3.5.

3.4 Interpreting the error bound


A few observations on what the error bound tells us:
• The LTE introduced at each step grows at a rate of eLt at worst.
• An initial error (in y0 ) also propagates in the same way.
• The LTE is O(h2 ) and O(1/h) steps are taken, so the total error is O(h); the propa-
gation does not affect the order of the error on a finite interval.
Note that the factor L and (1 + Lh) are method dependent; for other methods, the factors
may be other expressions related to L.
Our two examples
(a) y 0 = ty, (b) y 0 = −ty
illustrate the propagation issue. As with actual solutions, the error and a numerical solution
(or two nearby numerical solutions), can grow like eLt at worst. Indeed, for (a), the error
grows in this way; the error bound is good here.
However, for (b), the numerical solutions actually converge to the true solution as t
increases; in fact we have that the error behaves more like e−Lt . But the error bound cannot
distinguish between the two cases, so it is pessimistic for (b).
In either case, the global error over [a, b] is O(h) as h → 0.

17
6
3

4
2

2
1

0 0
0 1 2 3 0 1 2 3

Figure 5: Numerical solutions to y 0 = ty and y 0 = −ty with different values of N ; note the
behavior of the error as t increases.

3.5 Order
The order p of a numerical method for an ODE is the order of the global truncation error.
Euler’s method, for instance, has order 1 since the global error is O(h).
Definition 3.7 (Order): A numerical method with timestep h is said to be convergent
with order p if, on an interval [a, b],

yj − y(tj )| = O(hp ) as h → 0.
max |e
0≤j≤n

The 1/h factor is true for (most) other methods, so as a rule of thumb,

LTE = O(hp+1 ) =⇒ global error = O(hp ).

The interpretation here is that to get from a to b we take ∼ 1/h steps, so the error is on
the order of the number of steps times the error at each step, (1/h) · O(hp+1 ) = O(hp ). The
careful analysis shows that the order is not further worsened by the propagation of the errors.

! Some texts define the LTE with an extra factor of h so that it lines up with the global
error, in which case the rule is that the LTE and global error have the same order.
For this reason it is safest to say that the error is O(hp ) rather than to use the term
“order p”, but either is fine in this class.

3.6 Backward Euler’s method


The forward Euler’s method is an explicit method: the approximation yej+1 can be evaluated
directly in terms of yej and other known quantities. In contrast, a implicit method is one

18
where we are only given a (nonlinear) equation that is satisfied by yej+1 , and we have to solve
for yej+1 .
The simplest example of an implicit method is backward Euler, which has the iteration
yej+1 = yej + hf (tj+1 , yej+1 ).
Note the function is evaluated at yej+1 rather than yej as in the forward Euler’s method. To
solve for yej+1 , we can iterate Newton’s method until convergence; yej makes for a good initial
guess.
Using a “backward” Taylor expansion around tj+1 rather than around tj , you can prove
that the LTE is also O(h2 ).
This seems more complicated; why would you want to use the backward method? It
turns out that it has better stability properties; we will come back to this point.

Algorithm 3.8 (Backward Euler’s method): The backward Euler’s method is given
by

yej+1 = yej + hf (tj+1 , yej+1 ). (3.10)

Here, you can solve for yej+1 using Newton’s method with yej as an initial guess.

If Newton’s method is used, the code must know f and ∂f /∂y. The function in Matlab
may be written, for instance, in the form
[T,Y] = beuler(f,fy,[a b],y 0,h)
where f,fy are both functions of t and y. At step j, We would like to set yej+1 equal to the
zero of
g(z) = z − yej − hf (tj+1 , z).
We compute
g 0 (z) = 1 − hfy (tj+1 , z)
so the Newton iteration is
g(zk )
zk+1 = zk − .
g 0 (zk )
This is iterated until convergence; then yej+1 is set to the resulting z.

Problem 3.9: Prove an analogue of Theorem 3.6 for the backward Euler’s method.
You can assume that the numeric solution yej+1 obtained from (3.10) is exact.

4 Consistency, stability, and convergence


Convergence is the property any good numerical method for an ODE must have. However,
as the proof shows, it takes some effort (and is harder for more complicated methods). To

19
better understand it, we must identify the key properties that guarantee convergence and
how they can be controlled.
This strategy has two steps: showing consistency and stability. Both are necessary
for convergence. We’ll look at each of these notions in turn, and give an informal “theorem”
which says that they are sufficient for convergence. Finally, we’ll consider an example of
what can go wrong: a method which seems reasonable at first glance but is disastrously
unstable.
As before, we solve an IVP in an interval [a, b] starting at t0 = a with step size h.

4.1 Consistency
A method is called consistent if

the LTE at each step is o(h) as h → 0.

That is,
τj
lim = 0 for all j.
h→0 h

To check consistency, we may assume the result of the previous step is exact (since
this is how the LTE is defined). This is a benefit, as there is no need to worry about the
accumulation of errors at earlier steps.

Example (Checking consistency): Euler’s method,

yej+1 = yej + hf (tj , yej )

is consistent since the truncation error is


h2 00
τj+1 = y(tj+1 ) − y(tj ) − hf (tj , y(tj )) = y (ξj )
2
where ξj lies between tj and tj+1 . For each j, the error is O(h2 ) as h → 0. From the
point of view of consistency, j is fixed so y 00 (ξj ) is just some constant; we do not need a
uniform bound on y 00 that is true for all j.

4.2 Stability
In constrast, stability refers to the sensitivity of solutions to initial conditions. We derived
a bound on stability in Theorem 2.9.
We would like to have a corresponding notion of “numerical” stability.

20
Definition 4.1 (Zero stability): Suppose {yn } and {zn } are approximate solutions
to
y 0 = f (t, y), y(t0 ) = y0 (4.1)
in [a, b]. If it holds that

|yn − zn | ≤ C|y0 − z0 | + O(hp ) (4.2)

where C is independent of n, then the method is called zero stable.

Note that the best we can hope for is C = eL(t−t0 ) since the numerical method will never
be more stable than the actual IVP. In what follows, we will try to determine the right
notions of stability for the numerical method.
As written, the stability condition is not easy to check. However, one can derive easy to
verify conditions that imply zero stability. We have the following informal result.

Proposition: A “typical” numerical method is zero stable in the sense (4.2) if it


is numerically stable when used to solve the trivial ODE

y 0 = 0.

Here “typical” includes any of the methods we consider in class (like Euler’s method)
and covers most methods for ODEs one encounters in practice. “Numerical stability” means
that a perturbation at some step will not cause the solution to blow up. Numerical stability
for y 0 = 0 is much easier to check than the original definition of zero stability.

4.3 General convergence theorem


With some effort, one can show that this notion of stability is exactly the minimum required
for the method to converge.

Theorem 4.2 (Convergence theorem, informal (Dahlquist)). A “typical” numerical


method converges if it is consistent and zero stable. Moreover, it is true that

LT E = O(hp+1 ) =⇒ global error = O(hp ).

This assertion was proven for Euler’s method directly. Observe that the theorem lets
us verify two simple conditions (easy to prove) to show that a method converges (hard to
prove).

Example: convergence by theorem The Backward Euler method (to be studied in


detail later) is
yej+1 = yej + hf (tj+1 , yej+1 ).

21
Consistency: The local truncation error is defined by

yj+1 = yj + hf (tj + h, yj+1 ) + τj+1 .

Expanding around tj + h and using the ODE we get

yj = yj+1 − y 0 (tj+1 )h + O(h2 ) = yj+1 − hf (tj + h, yj+1 ) + O(h2 ).

Plugging this in to the formula yields τj+1 = O(h2 ) so the method is consistent.

Zero stability: This part is trivial. When the ODE is y 0 = 0,

yej+1 = yej

which is clearly numerically stable.

The theorem then guarantees that the method is convergent, and that the order
of convergence is 1 (the global error is O(h)).

4.4 Example of an unstable method


Obviously, any method we propose should be consistent (i.e. the truncation error is small
enough). As the theorem asserts, consistency is not enough for convergence! A simple
example illustrates what can go wrong when a method is consistent but not stable.
Euler’s method can be derived by replacing y 0 in the ODE with a forward difference:

y(t + h) − y(t)
= y 0 = f (t, y).
h
One might hope, then, that a more accurate method can be obtained by using a second-order
forward difference
−y(t + 2h) + 4y(t + h) − 3y(t)
y 0 (t) = + O(h2 ).
2h
Plugging this in, we obtain the method

−e
yj+2 = −4e
yj+1 + 3e
yj + 2hf (tj , yej ) (4.3)

which is consistent with an O(h3 ) LTE. However, this method is not zero stable!
It suffices to show numerical instability for the trivial ODE y 0 = 0. The iteration reduces
to
yej+2 = 4eyj+1 − 3eyj .
Plugging in yj = rj we get a solution when

r2 − 4r + 3 = 0 =⇒ r = 1, 3

22
so the general solution is5
yej = a + b · 3j .
If initial values are chosen so that
ye0 = ye1
then yj = y0 for all j with exact arithmetic. However, if there are any errors (e y0 6= ye1 ) then
b 6= 0 and |e
yj | willl grow exponentially. Thus, the method is unstable, and is not convergent.
Obtaining a second order method therefore requires a different approach.

5 Runge-Kutta methods
In this section the most popular general-purpose formulas are introduced, which can be
constructed to be of any order.

• How might we derive a higher-order method? Considering the Taylor expansion, a first
idea is to include higher-order derivatives. This leads us to Taylor’s method.

• However, Taylor’s method is inconvenient as it involves derivatives of f . A better


way to get higher-order is to use Runge-Kutta methods, which use evaluations of f
at multiple points to achieve higher-order error. We’ll first consider the second-order
explicit midpoint method or modified Euler method, and then the classic RK-4
method.

• Finally, we’ll look at implicit Runge-Kutta methods, and derive the second-order
trapezoidal method.

5.1 Setup: one step methods


Euler’s method is the simplest of the class of one step methods, which are methods that
involve only yj and intermediate quantities to compute the next value yj+1 . In contrast,
multi-step methods use more than the previous point, e.g.

yj+1 = yj + h(a1 yj−1 + a2 yj−2 )

which will be addressed later.

Definition 5.1 (One step method): A general explicit one step method has the form

yej+1 = yej + hψ(tj , yej ) (5.1)

where ψ is some function we can evaluate at (tj , yj ). The truncation error is defined
by
yj+1 = yj + hψ(tj , yj ) + τj+1 . (5.2)

5
See Appendix C for a review of solving these recurrences.

23
To improve on the accuracy of Euler’s method with a one-step method, we may try to
include higher order terms to get a. To start, write (5.2) as

yj+1 = yj + hψ(tj , yj ) +τj+1


|{z} | {z }
LHS RHS

For a p-th order method, we want the LHS to equal the RHS up to O(hp+1 ). Now expand
the LHS in a Taylor series around tj :

h2 00
LHS: yj+1 = y(tj ) + hy 0 (tj ) + y (tj ) + · · ·
2
A p-order formula is therefore obtained by taking

h hp−1 (p−1)
ψ(tj , yj ) = y 0 (tj ) + y 00 (tj ) + · · · + y (tj ).
2 p!

The key point is that the derivatives of y(t) at can be expressed in terms of f and its partial
derivatives - which we presumably know. Simply differentiate the ODE y 0 = f (t, y(t)) in t,
being careful with the chain rule. If G(t, y) is any function of t and y evaluated on the
solution y(t) then
d
(G(t, y(t))) = Gt + Gy y 0 (t) = Gt + f Gy .
dt
with subscripts denoting partial derivatives and Gt etc. evaluated at (t, y(t)).

It follows that
y 0 (t) = f (t, y(t)),
y 00 (t) = ft + fy f,
y 000 (t) = (ft + fy f )0 = ftt + fty f + · · · (see HW).
In operator form,  p−1
(p) ∂ ∂
y = +f f.
∂t ∂y

Taylor’s method: The p-th order one-step formula

h2 00 hp+1 (p+1)
yj+1 = yj + hy 0 (tj ) + y (tj ) + · · · + y (tj ) + O(hp+1 ).
2 (p + 1)!

Note that y 0 , y 00 , . . . are replaced by formulas involving f and its partials by repeatedly
differentiating the ODE.

This method is generally not used due to the convenience of the (more or less equiv-
alent) Runge-Kutta methods.

24
5.2 A better way: Explicit Runge-Kutta methods
Taylor’s method is inconvenient because it involves derivatives of f . Ideally, we want a
method that needs to know f (t, y) and nothing more.
The key observation is that the choice of ψ in Taylor’s method is not unique. We can
replace ψ with anything else that has the same order error. The idea of a Runge-
Kutta method is to replace the expression with function evaluations at “intermediate”
points involving computable values starting with f (tj , yj ).

5.2.1 Explicit midpoint rule (Modified Euler’s method)


Let us illustrate this by deriving a second-order one-step method of the form

yj+1 = yj + w1 hf1 + w2 hf2 +O(h3 ).


|{z} | {z }
LHS RHS

where

f1 = f (tj , yj ),
f2 = f (tj + h/2, yj + hβf1 )

and w1 , w2 , β are constants to be found.

Aside (integration): You may notice that this resembles an integration formula using two
points; this is not a coincidence since
Z tj+1
0
y = f (t, y) =⇒ yj+1 − yj = f (t, y(t)) dt
tj

so we are really estimating the integral of f (t, y(t)) using points at tj and tj+1/2 . The problem
is more complicated than just integrating f (t) because the argument depends on the unknown
y(t), so that also has to be approximated.

To find the coefficients, expand everything in a Taylor series, keeping terms up to order h2 :
h2 00
LHS = yj + hyj0 + y + O(h3 )
2 j
h2
= yj + hf + (ft + f fy ) + O(h3 )
2
where f etc. are all evaluated at (tj , yj ). For the fi ’s, we only need to expand f2 (using
Taylor’s Theorem A.1 for multivariate functions):

h2
hf2 = hf + ft + h2 fy (βf1 ) + O(h3 )
2
h2
= hf + ft + βh2 f fy + O(h3 ).
2

25
Plugging this into the RHS gives

w2 h2
RHS = yj + h(w1 + w2 )f + ft + w2 βh2 f fy + O(h3 )
2
h2
LHS = yj + hf + (ft + f fy ) + O(h3 )
2
Comparing the LHS/RHS are equal up to O(h3 ) if
1
w1 + w2 = 1, w2 = 1, w2 β =
2
which gives
1
w1 = 0, w2 = 1, β= .
2
We have therefore obtained the formula

yj+1 = yj + hf2 + O(h3 ),

f1 = f (tj , yj ), f2 = f (tj + h/2, yj + hf1 /2).


This gives the explicit midpoint rule or the modified Euler’s method.

Algorithm 5.2 (Explicit midpoint rule, or modified Euler’s method): This is a


second-order method given by

yej+1 = yej + hf2 ,

f1 = f (tj , yej ), f2 = f (tj + h/2, yej + hf1 /2).

Remark (integration connection): In this case one can interpret the formula as using
the midpoint rule to estimate
Z tj+1
y(tj+1 ) = y(tn ) + f (t, y(t)) dt ≈ y(tj ) + hf (tj + h/2, y(tj + h/2))
tj

but using Euler’s method to estimate the midpoint value:


h
y(tj + h/2) ≈ yn + f (tj , yj ).
2
In
R fact, when f = f (t), the method reduces exactly to the composite midpoint rule for
f (t) dt. However, in general, the various intermediate quantities do not have a clear in-
terpretation, and the formulas can appear somewhat mysterious. Deriving methods from
integration formulas is done in a different way (multistep methods).

26
5.2.2 Higher-order explicit methods
The modified Euler method belongs in the class of Runge-Kutta methods.

Definition 5.3 (Explicit RK method): A general explicit Runge-Kutta (RK)


method uses m “substeps” or “stages” to get from yj to yj+1 and has the form

f1 = f (tj , yj )
f2 = f (tj + c2 h, yj + ha21 f1 )
f3 = f (tj + c3 h, yj + ha31 f1 + ha32 f2 )
..
.
fm = f (tj + cm h, yj + ham1 f1 + · · · + ham,m−1 fm−1 )
yj+1 = yj + h(w1 f1 + · · · + wm fm ),

where ci = i−1
P
k=1 aik (in order for the time to correspond to the y-estimates). Each
fj is an evaluation of f at a y-value obtained as yj plus a linear combination of the
previous fi ’s. The “next” value yj+1 is a linear combination of all the fi ’s.

• The best possible local truncation error is O(hp+1 ) where p ≤ m. For each p, the
system is underdetermined and has a family of solutions (see HW for the p = 2 case).

• Unfortunately, it is not true that p = m. That is, to get a high order method - fifth
order and above - we need more substeps per iteration than the order.

• Deriving RK methods past third order is quite tedious and a mess of algebra, since
the system for the coefficients is non-linear and the Taylor series expansions become
complicated.

Thankfully, just about every useful set of coefficients - at least for general purpose methods
- has been calculated already, so in practice one can just look them up. They are typically
arranged in the Butcher Tableau, defined by

c1 a11 ··· a1m


.. .. ... ..
. . .
cm am1 ··· amm
w1 ··· wm

Note that for an explicit method, the “matrix” A in the table only has nonzeros in the
strictly lower triangular part.

27
Algorithm 5.4 (The classical RK-4 method): One four stage method of note is the
classical “RK-4” method

f1 = f (tn , yn )
 
1 1
f2 = f tn + h, yn + hf1
2 2
 
1 1
f3 = f tn + h, yn + hf2
2 2
f4 = f (tn + h, yn + hf3 )
h
yn+1 = yn + (f1 + 2f2 + 2f3 + f4 ).
6
This method has a good balance of efficiency and accuracy (only four function evalu-
ations per step, and O(h5 ) LTE). The method would be a good first choice for solving
ODEs, except that there is a more popular variant that is better for error estimation,
the Runge-Kutta-Fehlberg method. (The formula is rather hairy so I will not copy it
here, see e.g. http://maths.cnam.fr/IMG/pdf/RungeKuttaFehlbergProof.pdf.)

5.3 Implicit methods


The explicit RK methods are nice because we simply iterate to compute successive values,
ending up with yj+1 . That is, all quantities in the formula can be computed explicitly.
However, we could also include yj+1 in the formula as well. For a one step method, the
formula would take the form

yj+1 = yj + ψ(tj , yj , yj+1 ) + τj+1 .

We have already seen an example of this, the backward Euler’s method.


Nothing is different in the theory; the truncation error is calculated in the same way and
the same remarks on convergence apply. In practice, however, more work is required because
yj+1 is defined implicitly in the formula:

yej+1 = yej + ψ(tj , yej , yej+1 ).

Suppose we have computed values up to yej . Define

g(z) = z − yej − ψ(tj+1 , yej , z).

Then yej+1 is a root of g(z) (which is computable for any z). Thus yej+1 can be computed by
applying Newton’s method (ideally) or some other root-finder to g(z).

28
Practical note: The obvious initial guess is yej , which is typically close to the root. If h is
small, then yej is almost guaranteed to be close to the root, and moreover

yej+1 → yej as h → 0.

Thus, if Newton’s method fails to converge, h can be reduced to make it work. Since the
initial guess is close, quadratic convergence ensures that the Newton iteration will only take
a few steps to achieve very high accuracy - so each step is only a few times more work than
an equally accurate explicit method.

You may wonder why we would bother with an implicit method when the explicit methods
are more efficient per step; the reason is that they have other desirable properties to be
explored in the next section. For some problems, implicit methods can use much larger h
values than explicit ones.

5.3.1 Example: deriving the trapezoidal method


Here we derive a second-order method that uses f (tj , yj ) and f (tj+1 , yj+1 ). The formula is

yj+1 = yj + hw1 f (tj , yj ) + hw2 f (tj+1 , yj+1 ) + τj+1 .

First, note that for the RHS, we only need to expand yj+1 up to an O(h2 ) error. Using

yj+1 = yj + hf + O(h2 ),

we have that
f (tj+1 , yj+1 ) = f (tj + h, yj + A)
where A = hf + O(h2 ). Since A2 = O(h2 ), the result is

RHS = hw1 f + hw2 f (tj + h, yj + A)


 
= hw1 f + hw2 f + hft + fy A + O(A2 )
 
2
= h(w1 + w2 )f + w2 h ft + f fy + O(h3 ).

Comparing to the LHS,

h2
LHS = yj+1 = yj + hf + (ft + f fy ) + O(h3 )
2
we find that w1 = w2 = 1/2. This gives

h
yj+1 = yj + (fj + fj+1 ) + O(h3 )
2
where fj = f (tj , yj ) and fj+1 = f (tj+1 , yj+1 ).

29
Algorithm 5.5 (Implicit Trapezoidal rule): This is a second-order method given by

h
yej+1 = yej + (fj + fj+1 )
2
fj = f (tj , yej ), fj+1 = f (tj+1 , yej+1 ).

Note that when f = f (t), the formula reduces to the composite trapezoidal rule.

5.4 Summary of Runge-Kutta methods


A general (possibly explicit) Runge-Kutta method is in the following form.

Definition 5.6 (General RK method): A general Runge-Kutta (RK) method


uses m “substeps” or “stages” to get from yj to yj+1 and has the form
m
!
X
f1 = f tj + c1 h, yj + h a1k fk
k=1
..
.
m
!
X
fm = f tj + cm h, yj + h amk fk
k=1
yj+1 = yj + h(w1 f1 + · · · + wm fm ),

where ci = m
P
k=1 aik (in order for the time to correspond to the y-estimates). Each
fj is an evaluation of f at a y-value obtained as yj plus a linear combination of the
previous fi ’s. The “next” value yj+1 is a linear combination of all the fi ’s.

As before, they can be arranged in the tableau

c1 a11 ··· a1m


.. .. ... ..
. . .
cm am1 ··· amm
w1 ··· wm

The formulas we discussed, and several others, are given on the next page.

Problem 5.7: Based on the tableau, write out the formulas for the explicit trape-
zoidal and the implicit midpoint rules. Show that they are second-order accurate.

30
Name Order Tableau
0 0
Forward Euler 1
1
0 0 0
Explicit Midpoint/ 1 1
2 2 2
0
Modified Euler
0 1
0 0 0
Explicit Trapezoidal 2 1 1 0
1 1
2 2
0 0 0 0 0
1 1
2 2
0 0 0
1 1
RK-4 4 2
0 2
0 0
1 0 0 1 0
1 1 1 1
6 3 3 6
1 1
Backward Euler 1
1
1 1
Implicit Midpoint 2 2 2
1
0 0 0
Implicit Trapezoidal 2 1 0 1
1 1
2 2

Figure 6: Summary of RK methods

6 Absolute stability and stiff equations


6.1 An example of a stiff problem
Convergence (and zero stability and consistency) are only the bare minimum required for a
method to work. This does not, however, guarantee that the method works well, just that
it works when h is small enough. To motivate the discussion, consider the ODE
y 0 = −20(y − sin t) + cos t, y(0) = 1 (6.1)
for t ∈ [0, 3]. The solution is
y(t) = e−20t + sin t.
At early times (see Figure 6.5), there is a rapidly decaying transient; then it settles down
and looks like sin t (the “long term” solution). To obtain a reasonable solution for small t
(up to about 1/20), we need to pick h small enough that the approximation captures the
variations in y(t). To get a qualitatively correct solution, h must be small enough to resolve
the solution features.
However, once t is large enough ( 20 1
) the initial transient e−20t disappears, and we are
left with the much smoother part y(t) = sin t. We would like to use larger time steps to
compute the solution.
What happens if Euler’s method is used? The situation is shown in Figure 6.5. Some
observations:

31
• If h > 1/10, then the numerical solution oscillates and diverges (the amplitude grows
exponentially). If h = 1/10 exactly, the oscillations stay bounded.

• If h < 1/10, the iteration suddenly settles down and behaves correctly.

There is a threshold of h∗ = 1/10; the step size must be less than h∗ to have a qualitatively
correct solution. Note that this is a separate issue from convergence, which only guarantees
that the error is O(h) for sufficiently small h.
The requirement that h < 1/10 is a stability constraint: the step size must be below
a certain threshold to avoid numerical instability. (Even in the absence of truncation error,
a larger step size would cause the solution to diverge.)
The threshold depends on the problem. An ODE that has such a constraint is called
stiff. Our first “practical” definition is as follows: 6

6
The definition of stiffness and example are borrowed from Ascher & Petzold, Computer methods for
ordinary differential equations.

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0.5 1 1.5 0 0.5 1 1.5

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0.5 1 1.5 0 0.5 1 1.5

Figure 7: Euler’s method applied to the stiff IVP (6.1) and the slope field (blue lines).

32
Definition 6.1 (Stiffness, definition I): An IVP in some interval [a, b] is called stiff
if Euler’s method requires a much smaller time step h to be stable than is needed to
represent the solution accurately.

We will see shortly that “Euler’s method” can be replaced by “typical explicit method”.
Note that this is a practical definition in that the meaning of ‘accurate” and “much
smaller” depends on what is needed. The ODE from the example is not stiff in [0, 0.01] by
this definition since h < 1/10 for accuracy, but is stiff in [0, 3]. On the interval [0, 3], to
represent the solution accurately means representing the sin t part accurately, and we know
that it is possible to obtain the curve sin t with much larger step size (e.g., if the −20(y−sin t)
part were removed from the ODE, and we just had y 0 = cos t, we can use fairly large step
size and still obtain y ≈ sin t).
At the present, the “stability” constraint remains mysterious; we will develop some theory
shortly to explain it.

A geometric interpretation
Nearby solutions with some other initial condition rapidly converge to y(t) (if the solution is
perturbed, it will quickly return). The numerical method should, but doesn’t, always behave
the same way.
The figure (Figure 6.5) is revealing here, recalling that Euler’s method follows the vector
field. Observe that if h is not small enough, Euler’s method will overshoot due to the large
slope. This process will continue, causing aggressive oscillations that will diverge. If h is at
a certain value, then the oscillations will stay bounded.
Below this value, the overshoot is not severe and the method provides a reasonable ap-
proximation.

Observation: An IVP with solution y(t) (in an interval) is stiff if y(t) is well-behaved
but nearby trajectories to y(t) vary rapidly compared to y(t) itself.

This is not the same as instability of the IVP itself; the solution is in fact quite stable
here. All solution curves, in fact, will approach sin t as t → ∞.

6.2 The test equation: Analysis for Euler’s method


A simple analysis explains the phenomenon. Define the “test equation”

y 0 = λy, y(0) = y0 (6.2)

where λ is a (complex) number. The solution is just y(t) = y0 eλt so

• If Re(λ) < 0 then y(t) → 0 as t → ∞

• If Re(λ) > 0 then y(t) grows exponentially.

33
A good numerical method should have the property that

Re(λ) < 0 =⇒ yej → 0 as j → ∞.

That is, if the true solution decays, the approximation should also decay (exponentially). In
particular, it should not grow exponentially when the true solution does not.

What does the numerical method do? For Euler’s method7 , we have

yn+1 = yn + hf (tn , yn ) = (1 + hλ)yn .

The test equation is simple enough that the iterates have an exact formula:

yn = (1 + hλ)n y0 .

Now define a region R of the complex plane as the set of hλ for which the iteration has
|yn | → 0 as n → ∞ (i.e. the set of hλ such that Euler’s method applied to (6.2) gives a
sequence that converges to zero). Then

R = {z ∈ C : |1 + z| < 1}.

This is a circle of radius 1 centered at z = −1. In particular, |yn | → 0 if


2
h< . (6.3)
|λ|

Moreover, if h > 2/|λ| then |yn | will increase exponentially. This means that in order to
have yn decay exponentially when the exact solution does, the condition (6.3) must hold.
Otherwise, yn will grow exponentially even though it should go to zero.

Note that the λ > 0 is not as much of a concern here; if λ > 0 then 1 + hλ > 1 for
any positive h so yn will grow exponentially as desired.

6.3 Analysis for a general ODE


For the first-order ODE
y 0 = f (t, y)
the behavior on the test equation tells us how the numerical method will behave locally.
Consider a point (t0 , y0 ), the solution y(t) and a nearby solution

w = y + u.

where u is a small perturbation (so w starts at w0 = y0 + u0 ). Plugging into w0 = f (t, w),
∂f
y 0 + u0 = f (t, y + u) = f (t, y) +  u + O(2 )
∂y
7
For convenience, the tilde’s have been dropped.

34
Figure 8: Sketches of the stability regions (shaded area) for Euler’s method (left) and Back-
ward Euler (right) in the complex plane.

so the perturbation u evolves according to


 
0 ∂f ∂f
u = u + O(2 ) ≈ (t0 , y0 ) u
∂y ∂y
near (t0 , y0 ). That is, for nearby trajectories w(t) to the solution y(t), the difference will
grow/decay exponentially at a rate ∂f ∂y
.

Key point: Locally, for a perturbed solution w(t) to the ODE y 0 = f (t, y), the
difference will evolve like the test equation u0 = λu with λ = ∂f /∂y. This determines
the local stability of the numerical method.

Indeed, this matches the observed behavior from the motivating example where ∂f ∂y
= −20.
Solutions that are nearby to y(t) want to be pushed onto y(t) exponentially fast (with rate
20) as shown by the vector field in Figure 6.5. The numerical method must therefore behave
well on the test equation y 0 = −20y in order to be stable.
The test equation analysis then yields a new, more precise identification of stiffness:

Stiffness (definition II:) A first order ODE is stiff if | ∂f


∂y
| is large in the sense that
∂f
1/| ∂y | is much less than the typical width of a change in y(t).

Geometrically, this typically means that nearby trajectories to y(t) converge


rapidly to y(t) compared to the variation in y(t) itself.

6.4 Stability regions


Now consider a method that produces a sequence yn of approximations. The stability con-
straint turns out to depend only on the location of the product hλ.

35
Definition 6.2 (stability region): The region of absolute stability R is the set of
complex numbers z ∈ C with z = hλ such that for the test equation

y 0 = λy

the approximation {yn } with a step size h converges to zero as n → ∞.

The interval of absolute stability is the real portion of R (that is, the
same definition but λ is real).

Note: Typically, the part of R that matters is the region where Re(λ) < 0.

We found that the region of absolute stability for Euler’s method is

R = {z : |1 + z| < 1}

and its interval of absolute stability is (−2, 0).


For the backward Euler method

yn+1 = yn + hf (tn+1 , yn+1 )

we get
1
yn+1 = yn
(1 − hλ)
so |yn | → 0 when
|1 − hλ| > 1.
The region of absolute stability is then the set

R = {z ∈ C : |1 − z| > 1}

which is the complement of a circle centered at z = 1 of radius 1 (see Figure 6.3). In


particular, R contains the entire half plane {z < 0}, which means that if Reλ < 0 then
|yn | → 0 for any positive value of h.

6.5 A-stability and L-stability; Stability in practice


So, what does this analysis say about what methods to use for stiff equations? Observe that
if R contains all points with Re(z) < 0 then there is no stability constraint, because hλ ∈ R
for all λ’s with negative real part.
If the interval contains (−∞, 0), then hλ ∈ R for all real λ’s, which is often still good
enough. Such a method can be applied to a stiff equation without any constraint on h for
stability: it will always be true that the approximation stays bounded when it should.

36
1.5 1.5
Back. Euler Back. Euler
exact exact

1 1
sol.

sol.
0.5 0.5

0 0
0 0.2 0.4 0.6 0 0.2 0.4 0.6
t t

Figure 9: Backward Euler to the stiff IVP (6.4). The approximation stays bounded for all
h, although for the initial transient, h must be taken smaller (h ∼ 1/200) to be accurate.

Definition 6.3: A method for which R contains all of {Re(z) < 0} is called A-
stable. For such a method, there is no stability condition required when integrating
a stiff equation.

For example, the backward Euler method is A-stable. Indeed, if it is used on the example
problem, the approximation will always be reasonable, even when h > 1/10. Below, Euler’s
method and Backward Euler are used to solve

y 0 = −100(y − sin t) + cos t, y(0) = 1. (6.4)

Backwards Euler does fine in the stiff interval, but Euler’s method would require h < 1/50:
the stiffness imposes a severe constraint here and there is a clear winner away from the initial
transient.
The trapezoidal method is also A-stable (see homework). However, there are a few points
worth noting here:

• Not all implicit methods are A-stable, or even have intervals of absolute stability con-
taining (−∞, 0). Only certain classes of implicit methods are good for stiff problems.

• All explicit methods have finite intervals of absolute stability (−a, 0) and the value of
a is typically not that large. The regions R for RK methods up to order 4 are shown
in Figure 6.5. The stability constraint for RK-4 is more or less the same as for Euler’s
method.

There is much more to the story than just absolute stability. Many other notions of
stability exist, describing other constraints and finer properties of numerical solutions. For

37
3 Euler
RK2
2 RK3
RK4
1

-1

-2

-3

-2 0 2

Figure 10: Stability regions (the area inside the curve) for RK methods up to order 4. Note
that all RK methods of the same order have the same stability region.

example, we may want it to be true that


the rate at which |yn | → 0 increases as Re(λ) → −∞.
For instance, if λ = −1000 we would want the approximation to decay to zero much faster
than if λ = −10. Backward Euler has this property (called L-stability or “stiff decay”) since
1
|yn+1 | = |yn |
|1 − hλ|
so as Reλ → −∞, the coefficient goes to zero in magnitude. The property is nice because it
means that fast-decaying transients are damped out quickly by the method.

6.6 Absolute stability for systems


For systems of ODEs, the analysis for absolute stability must be modified slightly. The
definition of the region of absolute stability is the same as before; we apply the method to
the scalar test equation.
In principle, one might want to consider the linear system
y0 = Ay
instead. However, if A is diagonalizable, then A = V ΛV −1 where Λ is the diagonal matrix
of eigenvalues, and setting x = V −1 y we get the decoupled equations
x0i = λi xi

38
for each component of x. Thus it suffices to know how the method behaves on the scalar
test equation to know how it behaves on the “system” version.
For a general system
y0 = F(t, y),
and a perturbation w = y + u near (t0 , y0 ), the difference u evolves according to

u0 ≈ (DF(t0 , y0 ))u

where DF is the Jacobian of F (with respect to y). Letting

λ∗ = eigenvalue with most negative real part of DF.

it follows that the stability constraint should be

hλ∗ ∈ R.

That is, the numerical stability constraint is determined by the component that decays
fastest (the “most stiff” component). As a trivial example,

x0 = −5x
y 0 = −100y

using Euler’s method has the stability constraint −100h ∈ (−2, 0) =⇒ h < 1/50 because
the y-component is stiff. Note that in general, “component” means in the eigenvector basis of
DF, so it cannot typically be seen by just looking at the equations in the system individually.
The “stiffness ratio” is the ratio of the largest/smallest real parts (in absolute value) of
the Jacobian; if this ratio is large, the system is typically stiff since it has some components
that change slowly and some that change very fast (e.g. the ratio is 20 in the example above).

7 Adaptive time stepping


Notice that one step methods use only the values at the previous time step n to update to
time n + 1. For this reason, once we are at n + 1, we are free to pick a new value of h. It
is desirable to have an algorithm that can adjust the time step h to ensure that the local
truncation error remains below a certain tolerance:

|τi | <  for all i.

To do so, we need an estimate for τi at each step, and a way to select a step size  that will
ensure that the estimated error is acceptably small.

On global error: We will keep the goal modest and not discuss strategies for bounding
the global error, which is more difficult.

39
One might, for instance, want the global error to remain below a tolerance :

max |e
y (t) − y(t)| < .
t∈[a,b]

Bounding the τ ’s individually does not lead to a natural bound on the global error, since
there is propagation and so on to worry about as well. But local control is good enough in
practice, so long as one is careful to be cautious (setting local tolerances to be smaller than
you think is necessary).

7.1 Using two methods of different order


A good strategy is to used two methods of different order. Informally, we use the more
accurate one as an estimate for the exact solution, to estimate the error for the less accurate
method.
Assume that the value of the n-th step is exact (yn ), a step size h and we have two
one-step methods:

i) Method A: Order p, producing yen+1 from yn , truncation error τn+1 (h) with step size h

ii) Method B: Order p + 1, producing ŷn+1 from yn

We seek a step size hnew such that

τn+1 (hnew ) < .

To derive the estimate, suppose that the value at time n is exact. We have that

yen+1 = yn + hψ(tn , yn ),

ŷn+1 = yn + hψ̂(tn , yn ).
By the definition of the truncation error,

y(tn+1 ) = yen+1 + τn+1 , (7.1)


y(tn+1 ) = ŷn+1 + O(hp+2 ) (7.2)

Now further approximate (7.1) by assuming that the error comes from a series with some
leading order term:
τn+1 (h) = Chp+1 + O(hp+2 ).
Then
y(tn+1 ) = yen+1 + Chp+1 + O(hp+2 ).
Now subtract the more accurate method (7.2) from this to eliminate y(tn+1 ), leaving

yn+1 − ŷn+1 | = |C|hp+1 + O(hp+2 ).


|e

40
This gives an error estimate:

|τ (h)| ≈ |C|hp+1 ≈ |e
yn+1 − ŷn+1 |. (7.3)

At this point, we have enough to get the updated time step hnew . It must satisfy

|τ (hnew )| ≈ |C|(hnew )p+1 < .

Taking the ratio of this and the estimate (7.3) (or solving for C) yields

(hnew /h)p+1 < , (7.4)
|e
yn+1 − ŷn+1 |

which is a formula for the next step hnew in terms of known quantities: The two approxima-
tions from the two methods, the last step h and the tolerance .
In practice, because this is just an estimate, one puts an extra coefficient in to be safe,
typically something like
 1/(p+1)

hnew = 0.8h .
|e
yn+1 − ŷn+1 |

This formula decreases the step size if the error is too large (to stay accurate) and increases
the step size if the error is too small (to stay efficient). Sometimes, other controls are added
(like not decreasing or increasing h by too much per time step).

! The estimate is based on some strong assumptions that may not be true. While
it typically works, there is no rigorous guarantee that the error estimate is good.
Moreover, as noted earlier, this strategy bounds the local error, which does not
always translate into a reasonable global error bound.

7.2 Embedded RK methods


Each step with the error estimate appears to cost about twice as much one fixed step (since
two methods are used). However, with a judicious choice, we can create some overlap in the
computations, saving work.
One of the main approaches is to use an embedded pair of Runge-Kutta methods,
which is a pair of RK formulas where most of the fi ’s are the same for both.
For example, suppose we wish to create a pair for estimating the error in Euler’s method.
We need a method of order 2 here. Recall that the modified Euler method is

f1 = f (tn , yn )
 
1 h
f2 = f tn + h, yn + f1
2 2
3
yn+1 = yn + hf2 + O(h ).

41
Figure 11: Left: error estimate using two methods of different order (red/blue). Right:
sketch of step doubling using one method with one step of size h and two half-steps of size
h/2.

The function evaluation needed for Euler’s method is already there, so once modified Euler
is applied, computing
yen+1 = yn + hf1 + O(h2 )
requires essentially no extra work; the value of f1 is used for both.

Following (7.4), the timestep would then be chosen to be something like

yn+1 − ŷn+1 |)1/2 .


hnew ≈ h(/|e

where yen+1 and ŷn+1 are the result of Euler (order 1) and modified Euler (order 2).
Embedded methods of higher order can be constructed by the right choice of coefficients.
We saw that fourth-order RK methods have the best balance of accuracy and efficiency.
There are a handful of good fourth/fifth order pairs.
One popular embedded pair is the Runge-Kutta-Fehlberg method, which uses a
fourth-order and fifth-order formula that share most of the fi ’s. A formula of this form with
step size selected by (7.4) is the strategy employed, for instance, by MATLAB’s ode45.8

7.3 Step doubling


Now suppose instead that we have only a single method that can take a step size h (for any
h) and wish to choose a new step size hnew such that the local error is at most . By using
(Richardson) extrapolation, an error estimate and a new step size can be obtained.

The idea is to use step doubling. Suppose, for the sake of example, that our method
is a one step method of the form

yn+1 = yn + hψ(tn , yn ) + O(hp+1 ).


8
Minor note for completeness: ode45 uses a different set of coefficients than RKF, the ‘Dormand-Prince’
pair, which offers slightly better accuracy.

42
Assume yn is exact; then the next step is
yen+1 = yn + hψ(tn , yn ).
Now take two steps of size h/2 (see Figure 11, right) to get a new approximation:
h
ŷn+1/2 = yn + ψ(tn , yn ),
2
h
ŷn+1 = ŷn+1/2 + ψ(tn , ŷn+1/2 ).
2
Assume that each application of a step creates a LTE
τ ≈ Chp+1
and that C is a single constant.9 Then, if yn is exact, we have
yn+1 ≈ y(tn+1 ) + Chp+1
ŷn+1 ≈ y(tn+1 ) + 2C(h/2)p+1
i.e. the ‘doubled’ method accumulates two truncation errors from a step size h/2. Subtract-
ing the two approximations gives
|yn+1 − ŷn+1 | ≈ (1 − 2−p )|Chp+1 |.
Thus the error estimate is
|yn+1 − ŷn+1 |
|Chp+1 | ≈
1 − 2−p
from which we can choose a new time step hnew such that C(hnew )p+1 < .

8 Multistep methods
A linear multistep method has the form
m
X m
X
aj yn−j = h bj fn−j (8.1)
j=0 j=0

where fn−j = f (tn−j , yn−j ). The methods are so named because they involve values from the
m previous steps, and are linear in f (unlike Runge-Kutta methods). For example, Euler’s
method
yn = yn−1 + hfn−1
and the Backward Euler method
yn = yn−1 + hfn
are both (trivially) linear multistep methods that only involve the previous step; we call
these one step methods. Note that the method (8.1) is implicit if b0 6= 0 and explicit
otherwise. We are interested here in methods that use more than just the previous step.
For simplicity, we will assume throughout that the discrete times are t0 , t1 , . . . with fixed
timestep h.
9
This can be made more precise by assuming that the LTE for a step of size h starting at t is τ (h; t) ≈
C(t)hp+1 where C(t) is a smooth function of t.

43
8.1 Adams methods
The starting point is the integrated form of the ODE:
Z tn
y(tn ) = y(tn−1 ) + y 0 (t) dt.
tn−1

Note that y 0 (t) = f (t, y(t)), but it is convenient to leave it as y 0 .


Change variables to t = tn + sh. Then
Z 0
y(tn ) = y(tn−1 ) + h y 0 (tn + sh) ds. (8.2)
−1

To get an explicit method using the previous m values, we estimate the integral using the
previous times tn−1 , tn−2 , . . . , tn−m , or equivalently s = −1, −2, . . . , −m:
Z 0 Xm
g(s) ds = bj g(−j) + Cg (m) (ξ).
−1 j=1

We already know how to derive such formulas. For example, for m = 2,


Z 0
3 1 5
g(s) ds = g(−1) − g(−2) + g (2) (ξ).
−1 2 2 12

Now plugging in y 0 (tn + sh) into the formula, we get


3h 0 h 5
y(tn ) = y(tn−1 ) + y (tn−1 ) − y 0 (tn−2 ) + y (3) (tn + hξ)h3
2 2 12
for some ξ ∈ (−m, 0). The numerical method is then
3h h
yn = yn−1 + fn−1 − fn−2 .
2 2
In the notation of (8.1), we have m = 2 and a0 = 1, b1 = 3/2, b2 = −1/2. The other a-values
are zero: a1 = a2 = 0.
A method that uses the m previous steps to estimate the integral is called an Adams-
Bashforth method. For a general m, it takes the form
m
!
X
yn = yn−1 + h bj fn−j .
j=1

The local truncation error is O(hm+1 ), so the method has order m. There is a trick for
deriving the coefficients; see example below.
We can also derive a method from (8.2) by including tn (or s = 0). With m = 1 this
would be the trapezoidal rule:
Z 0
1 1
g(s) ds = g(−1) + g(0) + Cg (2) (ξ).
−1 2 2

44
Note that this method uses m + 1 points (the m previous points plus tn ). The result is then
the trapezoidal method
h
yn = yn−1 + (fn + fn−1 )
2
which has order 2. In general, the method will have the form
m
!
X
yn = yn−1 + h bj fn−j .
j=0

and is called an Adams-Moulton method. The local truncation error is O(hm+2 ), so the
method is order m + 1. Note that the method is implicit, since fn appears on the RHS.

Examples: The Adams-Bashforth method for m = 3 requires a formula


Z 0
g(s) ds ≈ b1 g(−1) + b2 g(−2) + b3 g(−3).
−1

We will omit the details of the error term here. The easiest way to derive the coefficients
is to use undetermined coefficients on the Newton basis:
1, s + 1, (s + 1)(s + 2), . . .
The reason is that the resulting linear system will be triangular and easy to solve (and
can in fact there is a nice general formula). The method should have degree of accuracy
2, so we require that it be exact for 1, s + 1 and (s + 1)(s + 2). Plugging these in, we get
1 = b1 + b 2 + b3
1
= −2b2 − b3
2
5
= 2b3
6
so b1 = 23/12, b2 = −16/12 and b3 = 5/12. This gives the method
h
yn+1 = yn + (23fn−1 − 16fn−2 + 5fn−3 )
12
which has order of accuracy 3. The Adams-Moulton method with the same order of
accuracy has m = 2 and requires the formula
Z 0
g(s) ds ≈ b0 g(0) + b1 g(−1) + b2 g(−2).
−1

Requiring that the formula is exact for 1, s, s(s + 1) (same trick as before) yields
1 = b0 + b1 + b2
1
− = −b1 − 2b2
2
1
− = 2b2 .
6

45
After solving, the result is the method
h
yn = yn−1 + (8fn + 5fn−1 − fn−2 )
12
with order of accuracy 3.

8.2 Properties of the Adams methods


The main benefit of these methods is efficiency: one function evaluation is required per step.
It is also true that

• The order p Adams-Moulton method has a (much) smaller error than the order p
Adams-Bashforth method (the constant for the LTE is much smaller).

• When m ≥ 2, the Adams-Moulton methods have a bounded stability region of a


moderate size. (When m = 0 or m = 1 the method is A-stable.)

• Adams-Bashforth methods have a bounded stability region that is very small.

For this reason, neither type of method (even the implicit one!) is useful for stiff problems.
The Adams-Moulton method is superior in terms of accuracy and (absolute) stability, but
it is implicit, so it requires much more work per step.
In practice, the two are combined to form an explicit method that has some of the
stability of the implicit method, but is easy to compute. The implicit term fn = f (tn , yn ) is
estimated using the result ỹn from an explicit method. This strategy is called a predictor-
corrector method. For example, we can combine the two-step explicit formula with the
one-step implicit formula:
h
ỹn = yn−1 + (3fn−1 − fn−2 ) (8.3)
2
˜
fn = f (tn , ỹn ) (8.4)
h
yn = yn−1 + (f˜n + fn−1 ). (8.5)
2
Now the formula is not implicit (and as a bonus, the error can be estimated using yn and ỹn
by standard techniques). It turns out that this trick gives a method that has a reasonably
good stability region (not as good as the implicit one!) and is essentially strictly better than
the pure explicit method (8.3) alone.

Practical note (starting values): To start, we need m previous values, which means
y1 , . . . , ym−2 must be computed by some other method (not a multistep formula). A high-
order RK method is the simplest choice, because it only requires one previous point.

46
8.3 Other multistep methods
The Adams methods only use previous fn ’s and not yn ’s. There are other classes of multistep
methods of the form (8.1) that are derived in different ways. One such class are the Back-
ward differentiation formulas (BDFs). These are derived by approximating y 0 using a
backward difference:

hy 0 (tn ) ≈ c0 y(tn ) + c1 y(tn−1 ) + · · · + cm y(tn−m ).

For example, the first order BDF is Backward Euler; the second order BDF is
3yn − 4yn−1 + yn−2
= f (tn , yn )
2h
which rearranges to
4 1 2h
yn = yn−1 − yn−2 + fn .
3 3 3
This method is sometimes called Gears’ method. BDFs are valuable because for orders up
to 6, they do well on stiff problems (see next section) compared to Adams-Moulton methods,
and moreover have the stiff decay property.

8.4 Analysis of multistep methods


8.4.1 Consistency
The general multi-step method (8.1) can be shown to be consistent by Taylor expanding
y(tn−1 ), . . . and y 0 (tn−1 ) (in place of f ) around tn , then canceling out terms to get the local
truncation error. We have
1
y(tn−j ) = y(tn ) − jhy 0 (tn ) + (jh)2 y 00 (tn ) + · · ·
2
and
1
y 0 (tn−j ) = y 0 (tn ) − jhy 00 (tn ) + (jh)2 y 000 (tn ) + · · · .
2
The truncation error is
m
X m
X
τn = aj y(tn−j ) − h bj y 0 (tn−j )
j=0 j=0
Xm m
X m
X
0
= aj y(tn ) − h jaj y (tn ) − h bj y 0 (tn ) + O(h2 )
j=0 j=0 j=0
m
" m
#
X X
= y(tn ) aj + hy 0 (tn ) (bj − jaj ) + O(h2 ).
j=0 j=0

Thus the method (8.1) is consistent if and only if


m
X m
X m
X
aj = 0, bj = jaj .
j=0 j=0 j=0

47
Further conditions can be derived by taking more terms. Of course, for the Adams formulas
and BDFs, we derived the method and proved consistency by other (more elegant) means,
but with the above we can construct more general methods.

8.4.2 Zero stability


One necessary condition for convergence is that when applied to the trivial ODE y 0 = 0,
solutions remain bounded. If f = 0 then the method reduces to
m
X
aj yn−j = 0 (8.6)
j=0

which is a linear recurrence for yn . This turns out to be sufficient! The following theorem is
true.

Theorem 8.1 (Dahlquist equivalence theorem). A linear multistep method of the


form (8.1) is convergent if and only if it is consistent and all solutions to (8.6) remain
bounded as n → ∞.

The equation (8.6) can be solved directly (see Section C for a review), leading to the
following condition.
Theorem 8.2. All solutions of the linear recurrence (8.6) are bounded if and only if all the
roots of the characteristic polynomial
m
X
p(r) = aj rm−j (8.7)
j=0

have magnitude ≤ 1.
The Dahlquist Theorem lets us verify a multistep method is convergent simply by proving
consistency and finding the roots of the characteristic polynomial (8.7).

8.4.3 Strongly stable methods


From the consistency result, we have that
if the method is consistent, r = 1 is a root of p(r).
For zero stability, this is the “good” root, since it means that if y 0 = 0 then yn = const. is a
solution. The other roots are “bad”, in that they are just spurious terms that appear due to
errors. If one of these roots is also 1, the bad term will stay bounded but not decay, which
is not ideal.
A method is called strongly stable if there is one root r = 1 and all other roots are
strictly less than 1 in magnitude. The method is weakly stable if there are multiple roots
with magnitude 1.
Weakly stable methods tend not to damp out errors as much, which makes them unde-
sirable unless they are needed for some other compelling reason.

48
8.5 Absolute stability
The definition of absolute stability is the same. When more than one previous y-value is
involved, the recurrence has more than two terms, so the analysis is more involved. When
the general method (8.1) is applied to y 0 = λy we get
m
X m
X
aj yn−j = h λbj yn−j .
j=0 j=0

Let z = hλ as before. The above is a linear recurrence with characteristic polynomial


m
X
p(r; z) = (aj − zbj )rm−j .
j=0

To have solutions to the recurrence go to zero as n → ∞, we need all the roots of p(r, z) to
have magnitude less than one. Thus, with R the region of absolute stability,

z ∈ R if and only if the roots of p(r; z) = 0 all have |r| < 1.

This is now a condition that can, with some effort, be computed and we can plot the stability
region.

Example: Stability region for Adams methods The two-step Adams-Bashforth


method is  
3 1
yn = yn−1 + h fn−1 − fn−2 .
2 2
Applied to y 0 = λy, this becomes
 
3 1
yn = yn−1 + z yn−1 − yn−2 .
2 2

with z = hλ. The characteristic polynomial for this recurrence is


 
2 3 1
p(r; z) = r − r − z r+ .
2 2

Rearranging, we get (equivalently) that the roots satisfy

2r2 − (2 + 3z)r + z = 0.

Solving, we get p
2 + 3z ± (2 + 3z)2 − 8z
r= .
4
Thus z is in the region of absolute stability if and only if the two roots above are less
than 1 in magnitude. This is not a nice looking condition. However, we can determine
the largest b such that (−b, 0) is in the interval of absolute stability.

49
Observe that on the boundary of R, one of the roots must have |r| = 1. If
z < 0 is real then both roots are real (by the calculations above). Thus we need only
find the values of z for which r = ±1 is a root, which occurs when

z = 0 or z = −1.

Thus the interval of absolute stability is (−1, 0), which is half the size of Euler’s method!

Some facts absolute stability regions for multistep methods:

• A linear multistep method with order ≥ 2 cannot be A-stable. (This is the second
Dahlquist barrier.)

• The interval of absolute stability for BDFs up to order 6 contains (−∞, 0).

• The stability region for Adams-Moulton methods is bounded for m > 2.

• The stability region for Adams-Bashforth methods shrinks as the order increases.

It follows that higher order Adams-Moulton methods are not to be used on stiff problems
even though they are implicit, and that Adams-Bashforth methods should not be used on
anything remotely stiff.10

A Nonlinear systems
Solving implicit systems of ODEs requires solving non-linear systems, which is a problem in
itself. Here we briefly consider numerical solution of the non-linear system F(x) = 0 where

F : Rn → Rn , F = (F1 (x), F2 (x), . . . , Fn (x)).

This describes a general system of n equations for n variables. Let DF be the Jacobian
matrix with entries
∂Fi
(DF)ij =
∂xj
We willl consider finding a zero x∗ with the property that

F(x∗ ) = 0 and DF(x∗ ) is invertible,

analogous to the f 0 (x∗ ) 6= 0 assumption in 1D. This guarantees that Newton’s method (see
below) works and that the zero is isolated.11

10
More to the point, predictor-corrector methods using Adams-Bashforth and Adams-Moluton formulas
work much better, so there is not much reason to use the Adams-Bashforth formula alone.
11
The result is a consequence of the inverse function theorem, which ensures that F is invertible in a
neighborhood of x∗ .

50
Simple example: The system of two equations

0 = ex1 (x2 − 1)
0 = x21 − x22

can be written in the form F(x) = 0 where

x = (x1 , x2 ), F(x) = (ex1 (x2 − 1), x21 − x22 ).

It has a zero at x∗ = (1, 1). The Jacobian is


 x 
e 1 (x2 − 1) ex1
DF = .
2x1 −2x2

Finally, we check that  


∗ 0 1
DF (x ) =
2 −2
is invertible.

A.1 Taylor’s theorem


For the informal treatment here, we need only a simplified version of Taylor’s theorem.
Theorem A.1 (Taylor’s theorem, simple variant). Suppose x0 ∈ Rn and F : Rn →
Rm is C 2 in a neighborhood of x0 and that DF (x0 ) is invertible. Then

F(x) = F(x0 ) + DF(x0 )(x − x0 ) + R(x)

where the remainder satisfies


M
|R(x)| ≤ kx − x0 k2
2
for a constant M depending on the size of the partial derivatives of F.

Much more precise statements can be made about the error, but this version is enough
for our purposes.

A.2 Newton’s method


Newton’s method extends to Rn directly. Let x(k) (to be defined) be the sequence of iterates.
As before, we use Taylor’s theorem to get a linear approximation of F near x(k) :
F(x) = F(x(k) ) + DF(x(k) )(x − x(k) ) + O(kx − xk k2 ).
Using this ‘locally linear’ approximation, we pick xk+1 so that
0 ≈ F(xk+1 ) ≈ F(x(k) ) + DF(x(k) )(xk+1 − x(k) )

51
leading to
xk+1 = xk − J−1
k F(xk ), Jk := DF(xk ).
In practice, we solve the linear system
Jk v = −F(xk )
for the correction v and then update xk+1 = xk + v.

The convergence results transfer to the n-dimensional case. If DF(x∗ ) is invertible then
Newton’s method converges quadratically when kx0 − x∗ k is small enough (see the Newton-
Kantorovich theorem).

A good guess is even more valuable in Rn since there are more dimensions to work with
- which means more possibilities for failure. Without a good guess, Newton’s method will
likely not work. Moreover, searching for a root is much more difficult (no bisection).

A.3 Application to ODEs


Nonlinear systems arise in solving systems of ODEs implicitly. As an example, consider the
pendulum equation from earlier,
θ00 + sin θ = f (t)
which was written as the system
 
0 x2
x = F(t, x), F(t, x) =
− sin x1 + f (t)
and x = (θ, θ0 ). Suppose we apply Backwards Euler to this system:
xj+1 = xj + hF(tj+1 , xj+1 ).
To obtain xj+1 , define the function
G(z) = z − xj − hF(tj+1 , z).
The Jacobian of G is
 
1 h
DG = I − hDx F(tj+1 , z) =
−h cos z1 1
where Dx F denotes the Jacobian of F as a function of x (not t):
 
0 1
Dx F(t, x) = .
− cos x1 0
At step j, we find a root of G(v) using Newton’s method:
(DG(zk )) (zk+1 − zk ) = −G(zk ) (A.1)
and then use this as xj+1 . Newton’s method is particularly nice here because we have access
to a good initial guess (z0 = xj ).

52
Practical note: A general routine would require the user to input the ODE function F and
its Jacobian Dx F, both as functions of (t, x). If the dimension is large, it may be much more
efficient to write a solver specific to that problem that knows how to solve the Newton’s
method system (A.1) efficiently (e.g. Appendix B).

B Boundary value problems


We give an example of using Newton’s method to solve boundary value problems.
The system of N equations
2y1 − y2 = cos(y1 )
−yi−1 + 2yi − yi+1 = sin(yi ), i = 2, · · · N − 1
−yN −1 + 2yN = cos(yN )
are a discretization of the boundary value problem
−y 00 = sin(y), y(0) = y(1) = 0
for equally spaced data {yj } at points x0 , · · · , xN in [0, 1]. To write as a system, set
y = (y1 , · · · , yn ),
and define the function F with i-th component
Fi (y) = −yi−1 + 2yi − yi+1 − cos(yi ), i = 2, · · · N − 1.
and the exceptional cases
F1 (y) = 2y1 − y2 − sin(y1 ),
FN (y) = −yN −1 + 2yN − sin(yN ).
Note that the Jacobian of F is a tridiagonal matrix.

To apply Newton’s method, we first compute the Jacobian J. Let ri = sin(yi ). Then
−1 ···
 
2 + sin(y1 ) 0 0
...
−1 2 + sin(y2 ) −1 0
 
 
J(y) = 
 ... ... ... .

 0 0 
 0 ··· −1 2 + + sin(yN −1 ) −1 
0 ··· 0 −1 2 + + sin(yN )
To apply Newton’s method, pick some starting vector y0 . Then, at step k, solve
(J(yk ))vk = −F(yk )
which is a tri-diagonal system (so it can be solved efficiently) and then update
yk+1 = yk + vk .

53
C Difference equations: review
Consider the linear difference equation

an = c1 an−1 + c2 an−2 + · · · cm an−m .

We solve by looking for solutions of the form

an = r n .

This is a solution if r is a root of the characteristic polynomial

p(r) := rn − c1 rn−1 − c2 rn−2 − · · · − cm−1 r − cm .

There are m (complex) roots r1 , · · · rm . Since the equation is linear, any linear combination
is a solution. The general solution is therefore
m
X
an = bj rjn
j=1

for constants b1 , · · · bm . Note that the solution is bounded for any initial conditions if and
only if
|rj | ≤ 1 for all j
and that an → 0 for any initial conditions if and only if

|rj | < 1 for all j.

54

You might also like