Numerical Methods
Numerical Methods
Syllabus
1. Introduction
Background
1. Name
4. Background
5. Scientific Interest
7. Programming experience/language
3
Reserve List
• Dale R. Durran, Numerical Methods for Wave Equations in Geophysical Fluid Dy-
namics, Springer, New York, 1998. (CGFD)
Useful References
• K.W. Morton and D.F. Mayers, Numerical Solution of Partial Differential Equa-
tions : An Introduction, Cambridge University Press, New York, 1994. (FDM)
• Roger Peyret and Thomas D. Taylor, Computational Methods for Fluid Flow,
Springer-Verlag, New York, 1990. (Num. Sol. of PDE’s)
• Joel H. Ferziger and M. Peric Computational Methods For Fluid Dynamics, Springer-
Verlag, New York, 1996.
• C. Canuto, M.Y. Hussaini, A. Quarteroni and T.A. Zang, Spectral Methods in Fluid
Dynamics, Springer-Verlag, New York, 1991. (Spectral Methods)
4
• John P. Boyd, Chebyshev and Fourier Spectral Methods Dover Publications, 2000.
(Spectral methods)
• O.C. Zienkiewicz and R.L. Taylor, The Finite Element Method, 4th edition, Mc
Graw Hill, 1989.
• Michel O. Deville, Paul F. Fischer and E.H. Mund, High-Order Methods for In-
compressible Fluid Flow , Cambridge Monographs on Applied and Computational
Mathematics, Cambridge University Press, Cambridge, 2002.
Useful Software
December 2, 2010
2
Contents
1 Introduction 9
1.1 Justification of CFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Finite Difference Method . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Finite Element Method . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.3 Spectral Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.4 Finite Volume Methods . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.5 Computational Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Initialization and Forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Turbulence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Examples of system of equations and their properties . . . . . . . . . . . . 13
2 Basics of PDEs 17
2.1 Classification of second order PDEs . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 Hyperbolic Equation: b2 − 4ac > 0 . . . . . . . . . . . . . . . . . . 19
2.1.2 Parabolic Equation: b2 − 4ac = 0 . . . . . . . . . . . . . . . . . . . 21
2.1.3 Elliptic Equation: b2 − 4ac < 0 . . . . . . . . . . . . . . . . . . . . 23
2.2 Well-Posed Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 First Order Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Scalar Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 System of Equations in one-space dimension . . . . . . . . . . . . . 27
2.3.3 System of equations in multi-space dimensions . . . . . . . . . . . 28
3
4 CONTENTS
6.4.1 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.5 Lax Wendroff scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5.1 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.6 Numerical Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.6.1 Analytical Dispersion Relation . . . . . . . . . . . . . . . . . . . . 91
6.6.2 Numerical Dispersion Relation: Spatial Differencing . . . . . . . . 91
16 Debuggers 247
16.1 Preparing the code for debugging . . . . . . . . . . . . . . . . . . . . . . . 247
16.2 Running the debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
8 CONTENTS
Chapter 1
Introduction
1.2 Discretization
The central process in CFD is the process of discretization, i.e. the process of taking
differential equations with an infinite number of degrees of freedom, and reducing it
to a system of finite degrees of freedom. Hence, instead of determining the solution
9
10 CHAPTER 1. INTRODUCTION
Greenland
Iceland
everywhere and for all times, we will be satisfied with its calculation at a finite number
of locations and at specified time intervals. The partial differential equations are then
reduced to a system of algebraic equations that can be solved on a computer.
Errors creep in during the discretization process. The nature and characteristics
of the errors must be controlled in order to ensure that 1) we are solving the correct
equations (consistency property), and 2) that the error can be decreased as we increase
the number of degrees of freedom (stability and convegence). Once these two criteria are
established, the power of computing machines can be leveraged to solve the problem in
a numerically reliable fashion.
Various discretization schemes have been developed to cope with a variety of issues.
The most notable for our purposes are: finite difference methods, finite volume methods,
finite element methods, and spectral methods.
1.2. DISCRETIZATION 11
f (x + ∆x) − f (x)
f ′ (x) = lim (1.1)
∆x→0 ∆x
with a finite limiting process,i.e.
f (x + ∆x) − f (x)
f ′ (x) ≈ + O(∆x) (1.2)
∆x
The term O(∆x) gives an indication of the magnitude of the error as a function of the
mesh spacing. In this instance, the error is halfed if the grid spacing, ∆x is halved, and
we say that this is a first order method. Most FDM used in practice are at least second
order accurate except in very special circumstances. We will concentrate mostly on
finite difference methods since they are still among the most popular numerical methods
for the solution of PDE’s because of their simplicity, efficiency, low computational cost,
and ease of analysis. Their major drawback is in their geometric inflexibility which
complicates their applications to general complex domains. These can be alleviated by
the use of either mapping techniques and/or masking to fit the computational mesh to
the computational domain.
Figure 1.2: Elemental partition of the global ocean as seen from the eastern and western equa-
torial Pacific. The inset shows the master element in the computational plane. The location of
the interpolation points is marked with a circle, and the structuredness of this grid local grid is
evident from the predictable adjacency pattern between collocation points.
Finite volume methods are primarily used in aerodynamics applications where strong
shocks and discontinuities in the solution occur. Finite volume method solves an integral
form of the governing equations so that local continuity property do not have to hold.
The CPU time to solve the system of equations differ substantially from method to
method. Finite differences are usually the cheapest on a per grid point basis followed
by the finite element method and spectral method. However, a per grid point basis
comparison is a little like comparing apple and oranges. Spectral methods deliver more
accuracy on a per grid point basis than either FEM or FDM. The comparison is more
meaningfull if the question is recast as ”what is the computational cost to achieve a given
error tolerance?”. The problem becomes one of defining the error measure which is a
complicated task in general situations.
1.3. INITIALIZATION AND FORCING 13
1.4 Turbulence
Most flows occuring in nature are turbulent, in that they contain energy at all scales
ranging from hundred of kilometers to a few centimeters. It is obviously not possible to
model all these scales at once. It is often sufficient to model the ”large scale physics”
and to relegate the small unresolvable scale to a subgrid model. Subgrid models occupy
a large discipline in CFD, we will use the simplest of these models which rely on a simple
eddy diffusivity coefficient.
supplemented with proper boundary and initial conditions. The unknowns are the two
components of the velocity v and the pressure p so that we have 3 unknowns functions
to determine. The parameters of the problem are the density and the kinematic viscosity
which are assumed constant in this example. Equation (1.3) expresses the conservation of
momentum, and equation (1.4), also referred to as the continuity equation, conservation
of mass which, for a constant density fluid, amount to conservation of volume. The form
of equations (1.3)-(1.4) is called primitive because it uses velocity and pressure as its
dependent variables.
In the early computer days memory was expensive, and modelers had to use their
ingenuity to reduce the model’s complexity as much as possible. Furthermore, the in-
compressibility constraint complicates the solution process substantially since there is
no simple evolution equation for the pressure. The streamfunction vorticity formulation
of the two-dimensional Navier Stokes equations addresses both difficulties by enforcing
the continuity requirement implicitly, and reducing the number of unknown functions.
The streamfunction vorticity formulation introduces other complications in the solution
process that we will ignore for the time being. Alas, this is a typical occurence where a
cure to one set of concerns raises quite a few, and sometimes, irritating side-effects. The
streamfunction is defined as follows:
u = −ψy , v = ψx . (1.5)
Any velocity derivable from such a streamfunction is that guaranteed to conserve mass
since
ux + vy = (−ψy )x + (ψx )y = −ψyx + ψxy = 0.
To simplify the equations further we introduce the vorticity ζ = vx − uy (a scalar in 2D
flows), which in terms of the streamfunction is
∇2 ψ = ζ. (1.6)
We can derive an evolution equation for the vorticity by taking the curl of the momentum
equation, Eq. (1.3). The final form after using various vector identities are:
ζt + v·ζ = ν∇2 ζ
|{z} | {z } | {z }
time rate of change advection diffusion
Ω UΩ νΩ
T L L2
(1.7)
UT νT
1
L L2
ν 1
1 1 =
UL Re
Note that the equation simplifies considerably since the pressure does not appear in the
equation (the curl of a gradient is always zero). Equations (1.6) and (1.7) are now a sys-
tem of two equations in two unknowns: the vorticity and streamfunction. The pressure
has disappeared as an unknown, and its role has been assumed by the streamfunction.
The two physical processes governing the evolution of vorticity are advection by the flow
1.5. EXAMPLES OF SYSTEM OF EQUATIONS AND THEIR PROPERTIES 15
and diffusion by viscous action. Equation (1.7) is an example of parabolic partial dif-
ferential equation requiring initial and boundary data for its unique solution. Equation
(1.6) is an example of an elliptic partial differential equation. In this instance it is a
Poisson equation linking a given vorticity distribution to a streamfunction. Occasion-
ally the term prognostic and diagnostic are applied to the vorticity and streamfunction,
respectively, to mean that the vorticity evolves dynamically in time to balance the conser-
vation of momentum, while the streamfunction responds instantly to changes in vorticity
to enforce kinematic constraints. A numerical solution of this coupled set of equations
would proceed in the following manner: given an initial distribution of vorticity, the cor-
responding streamfunction is computed from the solution of the Poisson equations along
with the associated velocity; the vorticity equation is then integrated in time using the
previous value of the unknown fields; the new streamfunction is subsequently updated.
The process is repeated until the desired time is reached.
In order to gauge which process dominates, advection or diffusion, the vorticity evo-
lution, we proceed to non-dimensionalize the variables with the following, time, length
and velocity scales, T , L and U , respectively. The vorticity scale is then Ω = U/L from
the vorticity definition. The time rate of change, advection and diffusion then scale as
Ω/T , U Ω/L, and νΩ/L2 as shown in the third line of equation 1.7. Line 4 shows the
relative sizes of the term after multiplying line 3 by T /Ω. If the time scale is chosen to be
the advective time scale, i.e. T = L/U , then we obtain line 5 which shows a single dimen-
sionless parameter, the Reynolds number, controlling the evolution of ζ. When Re ≪ 1
diffusion dominates and the equation reduces to the so called heat equation ζ = ν∇2 ζ. If
Re ≫ 1 advection dominates almost everywhere in the domain. Notice that dropping the
viscous term is problematic since it has the highest order derivative, and hence controls
the imposition of boundary conditions. Diffusion then has to become dominant near the
boundary through an increase of the vorticity gradient in a thin boundary layers where
advection and viscous action become balanced.
What are the implications for numerical solution of all the above analysis. By care-
fully analysing the vorticity dynamics we have shown that a low Reynolds number sim-
ulation requires attention to the viscous operator, whereas advection dominates in high
Reynolds number flow. Furthermore, close attention must be paid to the boundary layers
forming near the edge of the domain. Further measures of checks on the solution can be
obtained by spatially integrating various forms of the vorticity equations to show that
energy, kinetic energy here, and enstrophy ζ 2 /2 should be conserved in the inviscid case,
Re = ∞, when the domain is closed.
16 CHAPTER 1. INTRODUCTION
Chapter 2
Basics of PDEs
Partial differential equations are used to model a wide variety of physical phenomena.
As such we expect their solution methodology to depend on the physical principles used
to derive them. A number of properties can be used to distinguish the different type of
differential equations encountered. In order to give concrete examples of the discussions
to follow we will use as an example the following partial differential equation:
The unknown function in equation (2.1) is the function u which depends on the two inde-
pendent variables x and y. The coefficients of the equations a, b, . . . , f are yet undefined.
The following properties of a partial differential equation are useful in determining
its character, properties, method of analysis, and numerical solution:
Order : The order of a PDE is the order of the highest occuring derivative. The order
in equation (2.1) is 2. A more detailed description of the equation would require
the specification of the order for each independent variable. Equation 2.1 is second
order in both x and y. Most equations derived from physical principles, are usually
first order in time, and first or second order in space.
Linear : The PDE is linear if none of its coefficients depend on the unknown function.
In our case this is tantamount to requiring that the functions a, b, . . . , f are inde-
pendent of u. Linear PDEs are important since their linear combinations can be
combined to form yet another solution. More mathematically, if u and v are solu-
tion of the PDE, the so is w = αu + βv where α and β are constants independent
of u, x and y. The Laplace equation
uxx + uyy = 0
ut + uux = 0
is nonlinear. The majority of the numerical analysis results are valid for linear
equations, and little is known or generalizable to nonlinear equations.
17
18 CHAPTER 2. BASICS OF PDES
Quasi Linear A PDE is quasi linear if it is linear in its highest order derivative, i.e.
the coefficients of the PDE multiplying the highest order derivative depends at
most on the function and lower order derivative. Thus, in our example, a, b and c
may depend on u, ux and uy but not on uxx , uyy or uxy . Quasi linear equations
form an important subset of the larger nonlinear class because methods of analysis
and solutions developed for linear equations can safely be applied to quasi linear
equations. The vorticity transport equation of quasi-geostrophic motion:
!
∂∇2 ψ ∂ψ ∂∇2 ψ ∂ψ ∂∇2 ψ
+ − = 0, (2.2)
∂t ∂y ∂x ∂x ∂y
where ∇2 = ψxx + ψyy is a third order quasi linear PDE for the streamfunction ψ.
The equation can be simplified if ξ and η can be chosen such that A = C = 0 which
in terms of the transformation factors requires:
(
aξx2 + bξx ξy + cξy2 = 0
(2.16)
aηx2 + bηx ηy + cηy2 = 0
Assuming ξy and ηy are not equal to zero we can rearrange the above equation to have
the form ar 2 + br + c = 0 where r = ξx /ξy or ηx /ηy . The number of roots for this
quadratic depends on the sign of the determinant b2 − 4ac. Before considering the
different cases we note that the sign of the determinant is independent of the choice of
the coordinate system. Indeed it can be easily shown that the determinant in the new
system is B 2 −4AC = (b2 −4ac)(ξx ηy −ξy ηx )2 , and hence the same sign as the determinant
in the old system since the quantity (ξx ηy − ξy ηx ) is nothing but the squared Jacobian
of the mapping between (x, y) and (ξ, η) space, and the Jacobian has to be one-signed
for the mapping to be valid.
The interpretation of the above relation can be easily done by focussing on constant ξ
surfaces where dξ = ξx dx + ξy dy = 0, and hence:
√
dy ξx b− b2 −4ac
=− = 2a (2.19)
dx ξ ξy
√
dy ηx b+ b2 −4ac
=− = 2a (2.20)
dx η ηy
The roots of the quadratic are hence nothing but the slope of the constant ξ and constant
η curves. These curves are called the characteristic curves. They are the preferred
direction along which information propagate in a hyperbolic system.
In the (ξ, η) system the equation takes the canonical form:
The solution can be easily obtained in the case D = E = F = 0, and takes the form:
where G and H are function determined by the boundary and initial conditions.
20 CHAPTER 2. BASICS OF PDES
where κ is the wave speed is an example of a hyperbolic system, since its b2 − 4ac =
4κ2 > 0. The slope of the charasteristic curves are given by
dx
= ±κ, (2.24)
dt
which, for constant κ, gives the two family of characteristics:
ξ = x − κt, η = x + κt (2.25)
Initial conditions are needed to solve this problem; assuming they are of the form:
where x0 is arbitrary and α an integration variable. We now have two equations in two
unknowns, F and G, and the system can be solved to get:
Z
f (x) 1 x F (x0 ) − G(x0 )
F (x) = − g(α) dα − (2.30)
2 2κ x0 2
Z x
f (x) 1 F (x0 ) − G(x0 )
G(x) = + g(α) dα + . (2.31)
2 2κ x0 2
Figure 2.1 shows the solution of the wave equation for the case where κ = 1, g = 0, and
f (x) is a square wave. The time increases from left to right. The succession of plots
shows two travelling waves, going in the positive and negative x-direction respectively
at the speed κ = 1. Notice that after the crossing, the two square waves travel with no
change of shape.
2.1. CLASSIFICATION OF SECOND ORDER PDES 21
1 1 1
0 0 0
−5 0 5 −5 0 5 −5 0 5
1 1 1
0 0 0
−5 0 5 −5 0 5 −5 0 5
Figure 2.1: Solution to the second order wave equation. The top left figure shows the
initial conditions, and the subsequent ones the solution at t = 0.4, 0.8, 1.2, 1.6 and 2.0.
Since the two characteristic curves coincide, the Jacobian of the mapping ξx ηy − ξy ηx
vanishes identically. The coefficients of the PDE become A = B = 0. The canonical
form of the parabolic equation is then
ut = κ2 uxx (2.35)
22 CHAPTER 2. BASICS OF PDES
where κ now stands for a diffusion coefficient. The solution in an infinite domain can be
obtained by Fourier transforming the PDE to get:
ũt = −k2 κ2 ũ (2.36)
where ũ is the Fourier transform of u:
Z ∞
ũ(k, t) = u(x, t)e−ikx dx, (2.37)
−∞
and k is the wavenumber. The transformed PDE is simply a first order ODE, and can
be easily solved with the help of the initial conditions u(x, 0) = u0 (x):
2 κ2 t
ũ(k, t) = ũ0 e−k (2.38)
The final solution is obtained by back Fourier transform u(x, t) = F−1 (ũ). The lat-
ter can be written as a convolution since the back transforms of u0 = F−1 (ũ0 ), and
2 κ2 t −x2
1
F−1 (e−k )= 2κ
√
πt
e 4κ2 t are known:
Z ∞ −(x−X)2
1
u(x, t) = √ u0 (X)e 4κ2 t dX (2.39)
2κ πt −∞
1
0.1 0.01
0.2
0.8
0.6
0.4
0.2
−5 −4 −3 −2 −1 0 1 2 3 4 5
Figure 2.2: Solution to the second order wave equation. The figure shows the initial
conditions (the top hat profile), and the solution at times t = 0.01, 0.1, 0.2.
As an example we show the solution of the heat equation using the same square initial
condition as for the wave equation. The solution can be written in terms of the error
function:
Z 1 −(x−X)2
1 1 x+1 x−1
u(x, t) = √ e 4κ2 t dX = erf √ − erf √ , (2.40)
2κ πt −1 2 2κ t 2κ t
R 2
where erf(z) = √1π 0z e−s ds. The solution is shown in figure 2.2. Instead of a travelling
wave the solution shows a smearing of the two discontinuities as time increases accompa-
nied by a decrease of the maximum amplitude. As its name indicates, the heat equation
is an example of a diffusion equation where gradients in the solution tend to be smeared.
2.2. WELL-POSED PROBLEMS 23
Clearly, the PDEs are hyperbolic if γ > 0 and elliptic if γ < 0. To solve this system
of equation we require boundary conditions. To continue with our example, we look for
periodic solutions in x and impose the following boundary condition:
sin(N x)
u(x, 0) = , v(x, 0) = 0 (2.45)
N2
1. Ellipitic γ = −β 2 < 0
The solution for this case is then:
sin(N x) cos(N x)
u(x, y) = cosh(βN y), v(x, y) = β sinh(βN y) (2.46)
N2 N2
24 CHAPTER 2. BASICS OF PDES
Notice that even though the PDEs are identical, the two solutions u and v are
different because of the boundary conditions. For N → ∞ the boundary conditions
become identical, and hence one would expect the solution to become identical.
However, it is easy to verify that |u − v| → ∞ for any finite y > 0 as N → ∞.
Hence small changes in the boundary condition lead to large changes in the solution,
and the continuity between the boundary data and the solution has been lost. The
problem in this case is that no boundary condition has been imposed on y → ∞
as required by the elliptic nature of the problem.
2. Hyperbolic γ = β 2 > 0
The solution is then
sin(N x) cos(N x)
u(x, y) = cos(βN y), v(x, y) = −γ sin(βN y) (2.47)
N2 N2
Notice that in this case u, v → 0 when N → ∞ for any finite value of y.
where we have defined η = κux , and v = ut . Note that it is possible to consider v and η
as the component of a vector unknown w and to write the equations in vector notation as
shown in equation (2.48). In the next few section we will look primarily on the condition
under which a first order system is hyperbolic.
ut + cux = 0, 0 ≤ x ≤ L
(2.49)
u(x, t = 0) = u0 (x), u(x = 0, t) = ul (t)
where c is the advecting velocity. Let us define c = dx/dt, so that the equation becomes:
∂u ∂u
dt + dx = du = 0 (2.50)
∂t ∂x
2.3. FIRST ORDER SYSTEMS 25
1.4
u(x,t)=u (x )
1.2 0 0
0.8
t
0.6
0.4
u (x )
0.2 0 0
0
−4 −2 0 2 4 6
x
Figure 2.3: Characteristic lines for the linear advection equation. The solid lines are the
characteristics emanating from different locations on the initial axis. The dashed line
represents the signal at time t = 0 and t = 1. If the solution at (x, t) is desired, we first
need to find the foot of the characteristic x0 = x − ct, and the value of the function there
at the initial time is copied to time t.
where du is the total differential of the function u. Since the right hand side is zero,
then the function must be constant along the lines dx/dt = c, and this constant must be
equal to the value of u at the initial time. The solution can then written as:
dx
u(x, t) = u0 (x0 ) along =c (2.51)
dt
where x0 is the location of the foot of the characteristic, the intersection of the char-
acteristic with the t = 0 line. The simplest way to visualize this picture is to consider
the case where c is constant. The characteristic lines can then be obtained analytically:
they are straight lines given by x = x0 + ct. A family of characteristic lines are shown
in figure 2.3.1 where c is assumed positive. In this example the information is travelling
from left to right at the constant speed c, and the initial hump translates to the right
without change of shape.
If the domain is of finite extent, say 0 ≤ x ≤ L, and the characteristic intersects the
line x = 0 (assuming c > 0), then a boundary condition is needed on the left boundary
to provide the information needed to determine the solution uniquely. That is we need
to provide the variation of u at the “inlet” boundary in the form:u(x = 0, t) = g(t). The
solution now can be written as:
(
u0 (x − ct) for x − ct > 0
u(x, t) = (2.52)
g(t − x/c) for x − ct < 0
Not that since the information travels from left to right, the boundary condition is
needed at the left boundary and not on the right. The solution would be impossible
to determine had the boundary conditions been given on the right boundary x = L,
26 CHAPTER 2. BASICS OF PDES
2 2
u(x,0)=1−sin(π x)
1.5 1.5
1 1
0.5 0.5
0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Figure 2.4: Characteristics for Burgers’ equation (left panel), and the solution (right
panel) at different times for a periodic problem. The black line is the initial condition,
the red line the solution at t = 1/8, the blue at t = 1/4, and the magenta at t = 3/4.
The solution become discontinuous when characteristics intersects.
the problem would then be ill-posed for lack of proper boundary conditions. The right
boundary maybe called an “outlet” boundary since information is leaving the domain.
No boundary condition can be imposed there since the solution is determined from
“upstream” information. In the case where c < 0 the information travels from right to
left, and the boundary condition must be imposed at x = L.
If the advection speed c varies in x and t, then the characteristics are not necessarily
straight lines of constant slope, but are in general curves. Since the slopes of the curves
vary, characteristic lines may intersects. These intersection points are places where the
solution is discontinuous with sharp jumps. At these locations the slopes are infinite and
space and time-derivative become meaningless, i.e. the PDE is not valid anymore. This
breakdown occurs usually because of the neglect of important physical terms, such as
dissipative terms, that act to prevent true discontinuous solutions.
An example of an equation that can lead to discontinuous solutions is the Burger
equation:
ut + uux = 0 (2.53)
where c = u. This equation is nonlinear with a variable advection speed that depend on
the solution. The characteristics are given by the lines:
dx
=u (2.54)
dt
along which the PDE takes the form du = 0. Hence u is constant along characteristics,
which means that their slopes are also constant according to equation (2.54), and hence
must be straightlines. Even in this nonlinear case the characteristics are straightlines
but with varying slopes. The behavior of the solution can become quite complicated as
characteristic lines intersect as shown in figure 2.4. The solution of hyperbolic equations
in the presence of discontinuities can become quite complicated. We refer the interested
reader to Whitham (1974); Durran (1999) for further discussions.
2.3. FIRST ORDER SYSTEMS 27
Example 5 The linearized equation governing tidal flow in a channel of constant cross
section and depth are the shallow water equations:
! ! ! !
∂ u 0 g ∂ u 0 g
+ = 0, A= (2.57)
∂t η h 0 ∂x η h 0
where u and η are the unknown velocity and surface elevation, g is the gravitational
acceleration, and h the water depth. The eigenvalues of the matrix A can be found by
solving the equation:
−λ g
det A = 0 ⇔ det = λ2 − gh = 0. (2.58)
h −λ
√
The two real roots of the equations are λ = ±c, where c = gh. Hence the eigenvalues
are the familiar gravity wave speed. Two eigenvectors of the system are u1 and u2
corresponding to the positive and negative roots, respectively:
1 1
u1 = c , u2 = − c . (2.59)
g g
The eigenvectors are the columns of the transformation matrix T, and we have
! !
g
1 1 1 1
T= c , T−1 = c . (2.60)
g − gc 2 1 −g
c
Figure 2.5: Characteristics for the one-dimensional tidal equations. The new variables
û and η̂ are constant along the right, and left going characteristic, respectively. The
solution at the point (x, t) can be computed by finding the foot of two characteristic
curves at the initial time, xa and xb and applying the conservation of û and η̂.
∂w ∂w ∂w
+A +B =0 (2.65)
∂t ∂x ∂y
2.3. FIRST ORDER SYSTEMS 29
where I is the identity matrix. Equation (2.67) has the form of an eigenvalue problem,
where ω represent the eigenvalues of the matrix kA + lB. The system is classified as
hyperbolic if and only if the eigenvalues ω are real for real choices of the wavenumber
vector (k, l). The extension above hinges on the matrices A and B being constant in
space and time; in spite of its limitation it does show intuitively how the general behavior
of wave-like solution can be extended to multiple spatial dimensions.
For the case where the matrices are not constant, the definition can be extended by
requiring that the exitence of bounded matrices T such that the matrix T−1 (kA + lB)T
is a diagonal matrix with real eigenvalues for all points within an neighborhood of (x, y).
30 CHAPTER 2. BASICS OF PDES
Chapter 3
3.1 Introduction
The standard definition of derivative in elementary calculus is the following
Computers however cannot deal with the limit of ∆x → 0, and hence a discrete analogue
of the continuous case need to be adopted. In a discretization step, the set of points on
which the function is defined is finite, and the function value is available on a discrete set
of points. Approximations to the derivative will have to come from this discrete table of
the function.
Figure 3.1 shows the discrete set of points xi where the function is known. We
will use the notation ui = u(xi ) to denote the value of the function at the i-th node
of the computational grid. The nodes divide the axis into a set of intervals of width
∆xi = xi+1 − xi . When the grid spacing is fixed, i.e. all intervals are of equal size, we
will refer to the grid spacing as ∆x. There are definite advantages to a constant grid
spacing as we will see later.
31
32 CHAPTER 3. FINITE DIFFERENCE APPROXIMATION OF DERIVATIVES
x x x
i−1 i i+1
Figure 3.1: Computational grid and example of backward, forward, and central approx-
imation to the derivative at point xi . The dash-dot line shows the centered parabolic
interpolation, while the dashed line show the backward (blue), forward (red) and centered
(magenta) linear interpolation to the function.
The above is not the only approximation possible, two equally valid approximations are:
backward Euler:
u(xi ) − u(xi − ∆x) ui − ui−1
u′ (xi ) ≈ = (3.3)
∆x ∆x
Centered Difference
Since u(x) is arbitrary, the formula should hold with u(x) replaced by u′ (x), i.e.,
Z x
′ ′
u (x) = u (xi ) + u′′ (s) ds (3.6)
xi
Replacing this expression in the original formula and carrying out the integration (since
u(xi ) is constant) we get:
Z xZ x
u(x) = u(xi ) + (x − xi )u′ (xi ) + u′′ (s) ds ds (3.7)
xi xi
The process can be repeated with
Z x
u′′ (x) = u′′ (xi ) + u′′′ (s) ds (3.8)
xi
to get:
Z xZ xZ x
(x − xi )2 ′′
u(x) = u(xi ) + (x − xi )u′ (xi ) + u (xi ) + u′′′ (s) ds ds ds (3.9)
2! xi xi xi
This process can be repeated under the assumption that u(x) is sufficiently differen-
tiable, and we find:
(x − xi )2 ′′ (x − xi )n (
u(x) = u(xi ) + (x − xi )u′ (xi ) + u (xi ) + · · · + u n)(xi ) + Rn+1 (3.10)
2! n!
where the remainder is given by:
Z x Z x
Rn+1 = ··· u(n+1) (x)( ds)n+1 (3.11)
xi xi
Equation 3.10 is known as the Taylor series of the function u(x) about the point xi .
Notice that the series is a polynomial in (x − xi ) (the signed distance of x to xi ), and
the coefficients are the (scaled) derivatives of the function evaluated at xi .
If the (n + 1)-th derivative of the function u has minimum m and maximum M over
the interval [xi x] then we can write:
Z x Z x Z x Z x
··· m( ds)n+1 ≤ Rn+1 ≤ ··· M ( ds)n+1 (3.12)
xi xi xi xi
(x − xi )n+1 (x − xi )n+1
m ≤ Rn+1 ≤ M (3.13)
(n + 1)! (n + 1)!
which shows that the remainder is bounded by the values of the derivative and the
distance of the point x to the expansion point xi raised to the power (n + 1). If we
further assume that u(n+1) is continuous then it must take all values between m and M
that is
(x − xi )n+1
Rn+1 = u(n+1) (ξ) (3.14)
(n + 1)!
for some ξ in the interval [xi x].
34 CHAPTER 3. FINITE DIFFERENCE APPROXIMATION OF DERIVATIVES
The taylor series expansion can be used to get an expression for the truncation error
of the backward difference formula:
∂u ∆x2i−1 ∂ 2 u ∆x3i−1 ∂ 3 u
u(xi − ∆xi−1 ) = u(xi ) − ∆xi−1 + − + ... (3.18)
∂x xi 2! ∂x2 x 3! ∂x3 x
i i
where ∆xi−1 = xi − xi−1 . We can now get an expression for the error corresponding to
backward difference approximation of the first derivative:
u(xi ) − u(xi − ∆xi−1 ) ∂u ∆xi−1 ∂ 2 u ∆x2i−1 ∂ 3 u
− = − + + ... (3.19)
∆xi−1 ∂x xi 2! ∂x2 x 3! ∂x3 x
i i
| {z }
Truncation Error
3.3. TAYLOR SERIES 35
It is now clear that the truncation error of the backward difference, while not the same
as the forward difference, behave similarly in terms of order of magnitude analysis, and
is linear in ∆x:
∂u ui − ui−1
= + O(∆x) (3.20)
∂x xi ∆xi
Notice that in both cases we have used the information provided at just two points to
derive the approximation, and the error behaves linearly in both instances.
Higher order approximation of the first derivative can be obtained by combining the
two Taylor series equation (3.15) and (3.18). Notice first that the high order derivatives
of the function u are all evaluated at the same point xi , and are the same in both
expansions. We can now form a linear combination of the equations whereby the leading
error term is made to vanish. In the present case this can be done by inspection of
equations (3.16) and (3.19). Multiplying the first by ∆xi−1 and the second by ∆xi and
adding both equations we get:
1 ui+1 − ui ui − ui−1 ∂u ∆xi−1 ∆xi ∂ 3 u
∆xi−1 + ∆xi − = + ...
∆xi + ∆xi−1 ∆xi ∆xi−1 ∂x xi 3! ∂x3 x
i
(3.21)
There are several points to note about the preceding expression. First the approximation
uses information about the functions u at three points: xi−1 , xi and xi+1 . Second the
truncation error is T.E. ∼ O(∆xi−1 ∆xi ) and is second order, that is if the grid spacing is
decreased by 1/2, the T.E. error decreases by factor of 22 . Thirdly, the previous point can
be made clearer by focussing on the important case where the grid spacing is constant:
∆xi−1 = ∆xi = ∆x, the expression simplifies to:
ui+1 − ui−1 ∂u ∆x2 ∂ 3 u
− = + ... (3.22)
2∆x ∂x xi 3! ∂x3 x
i
Hence, for an equally spaced grid the centered difference approximation converges quadrat-
ically as ∆x → 0:
∂u ui+1 − ui−1
= + O(∆x2 ) (3.23)
∂x xi 2∆x
Note that like the forward and backward Euler difference formula, the centered differ-
ence uses information at only two points but delivers twice the order of the other two
methods. This property will hold in general whenever the grid spacing is constant and
the computational stencil, i.e. the set of points used in approximating the derivative, is
symmetric.
Equation (3.22) provides us with the simplest way to derive a fourth order approx-
imation. An important property of this centered formula is that its truncation error
contains only odd derivative terms:
3.3.3 Remarks
The validity of the Taylor series analysis of the truncation error depends on the existence
of higher order derivatives. If these derivatives do not exist, then the higher order
approximations cannot be expected to hold. To demonstrate the issue more clearly we
will look at specific examples.
Example 6 The function u(x) = sin πx is infinitely smooth and differentiable, and its
first derivative is given by ux = π cos πx. Given the smoothness of the function we expect
the Taylor series analysis of the truncation error to hold. We set about verifying this claim
in a practical calculation. We lay down a computational grid on the interval −1 ≤ x ≤ 1
of constant grid spacing ∆x = 2/M . The approximation points are then xi = i∆x − 1,
3.3. TAYLOR SERIES 37
1 0.01
0.5 0.005
0 0
−0.5 −0.005
−1 −0.01
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
0 0
10 10
−5 −5
10 10
−10 −10
10 10
−15 −15
10 10
0 1 2 3 4 0 1 2 3 4
10 10 10 10 10 10 10 10 10 10
Figure 3.2: Finite difference approximation to the derivative of the function sin πx. The
top left panel shows the function as a function of x. The top right panel shows the
spatial distribution of the error using the Forward difference (black line), the backward
difference (red line), and the centered differences of various order (magenta lines) for the
case M = 1024; the centered difference curves lie atop each other because their errors
are much smaller then those of the first order schemes. The lower panels are convergence
curves showing the rate of decrease of the rms and maximum errors as the number of
grid cells increases.
38 CHAPTER 3. FINITE DIFFERENCE APPROXIMATION OF DERIVATIVES
The numerical approximation ũx will be computed using the forward difference, equation
(3.17), the backward difference, equation (3.20), and the centered difference approxima-
tions of order 2, 4 and 6, equations (3.22), (3.27, and (3.29). We will use two measures
to characterize the error ǫi , and to measure its rate of decrease as the number of grid
points is increased. One is a bulk measure and consists of the root mean square error,
and the other one consists of the maximum error magnitude. We will use the following
notations for the rms and max errors:
M
! 12
X
kǫk2 = ∆x ǫ2i (3.31)
i=0
kǫk∞ = max (|ǫi |) (3.32)
0≤i≤M
The right panel of figure 3.2 shows the variations of ǫ as a function of x for the case
M = 1024 for several finite difference approximations to ux . For the first order schemes
the errors peak at ±1/2 and reaches 0.01. The error is much smaller for the higher order
centered difference scheme. The lower panels of figure 3.2 show the decrease of the rms
error (kǫk2 on the left), and maximum error (kǫk∞ on the right) as a function of the
number of cells M . It is seen that the convergence rate increases with an increase in
the order of the approximation as predicted by the Taylor series analysis. The slopes
on this log-log plot are -1 for forward and backward difference, and -2, -4 and -6 for the
centered difference schemes of order 2, 4 and 6, respectively. Notice that the maximum
error decreases at the same rate as the rms error even though it reports a higher error.
Finally, if one were to gauge the efficiency of using information most accurately, it is
evident that for a given M , the high order methods achieve the lowest error.
Notice that the function and its first two derivatives are continuous at x = 0, but the
third derivative is discontinuous. An examination of the graph of the function in figure
3.3 shows a curve, at least visually (the so called eye-ball norm). The error distribution
is shown in the top right panel of figure 3.3 for the case M = 1024 and the fourth order
centered difference scheme. Notice that the error is very small except for the spike near
the discontinuity. The error curves (in the lower panels) show that the second order
centered difference converges faster then the forward and backward Euler scheme, but
3.3. TAYLOR SERIES 39
−6
x 10
1.5 1
1
0.5
0.5
u(x)
ε
0
−0.5
−0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
x x
0 0
10 10
max( |ε| )
2
−5 −5
|| ε ||
10 10
−10 −10
10 10
0 1 2 3 4 0 1 2 3 4
10 10 10 10 10 10 10 10 10 10
M M
Figure 3.3: Finite difference approximation to the derivative of a function with discon-
tinuous third derivative. The top left panel shows the function u(x) which, to the eyeball
norm, appears to be quite smooth. The top right panel shows the spatial distribution
of the error (M = 1024) using the fourth order centered difference: notice the spike at
the discontinuity in the derivative. The lower panels are convergence curves showing the
rate of decrease of the rms and maximum errors as the number of grid cells increases.
40 CHAPTER 3. FINITE DIFFERENCE APPROXIMATION OF DERIVATIVES
that the convergence rates of the fourth and sixth order centered schemes are no better
then that of the second order one. This is a direct consequence of the discontinuity in
the third derivative whereby the Taylor expansion is valid only up to the third term. The
effects of the discontinuity are more clearly seen in the maximum error plot (lower right
panel) then in the mean error one (lower left panel). The main message of this example
is that for functions with a finite number of derivatives, the Taylor series prediction for
the high order schemes does not hold. Notice that the error for the fourth and sixth
order schemes are lower then the other 3, but their rate of convergence is the same as
the second order scheme. This is largely coincidental and would change according to the
function.
and hence we need at least four points in order to determine its solution uniquely. More
generally, if we need the k-th derivative then the highest derivative to be neglected must
be of order k + p − 1, and hence k + p − 1 points are needed. The equations will then
have the form:
r
X
bq = mq am = δqk , q = 1, 2, . . . , k + p − 1 (3.35)
m=−l,m6=0
where δqk is the Kronecker delta δqk = 1 is q = k and 0 otherwise. For the solution to
exit and be unique we must have: l + r = k + p. Once the solution is obtained we can
determine the leading order truncation term by calculating the coefficient multiplying
the next higher derivative in the truncation error series:
r
X
bk+1 mk+p am . (3.36)
m=−l,m6=0
Example 8 As an example of the application of the previous procedure, let us fix the
stencil to r = 1 and l = −3. Notice that this is an off-centered scheme. The system of
equation then reads as follows in matrix form:
−3 −2 −1 1 a−3 b1
(−3)2 (−2)2 (−1)2 (1)2 a b2
−2
= (3.37)
(−3)3 (−2)3 (−1)3 (1)3 a−1 b3
(−3)4 (−2)4 (−1)4 (1)4 a1 b4
If the first derivative is desired to fourth order accuracy, we would set b1 = 1 and
b2,3,4 = 0, while if the second derivative is required to third order accuracy we would set
b1,3,4 = 0 and b2 = 1. The coefficients for the first example would be:
a−3 −1
a−2 1 12
= (3.38)
a−1 12 −18
a1 3
ui+ n2 − ui− n2
δnx ui = (3.39)
n∆x
ui+ 1 − ui− 1
δx ui = 2 2
= ux + O(∆x2 ) (3.40)
∆x
ui+1 − ui−1
δ2x ui = = ux + O(∆x2 ) (3.41)
2∆x
42 CHAPTER 3. FINITE DIFFERENCE APPROXIMATION OF DERIVATIVES
∆x2
δ2x ui = ux + uxxx + O(∆x4 ) (3.46)
3!
If I can estimate the third order derivative to second order then I can substitute this
estimate in the above formula to get a fourth order estimate. Applying the δx2 operator
to both sides of the above equation we get:
∆x2
δx2 (δ2x ui ) = δx2 (ux + uxxx + O(∆x4 )) = uxxx + O(∆x2 ) (3.47)
3!
Thus we have
∆x2 2
δ2x ui = ux + δ [δ2x ui + O(∆x2 )] (3.48)
3! x
Rearranging the equation we have:
!
∆x3 2
ux |xi = 1− δ δ2x ui + O(∆x4 ) (3.49)
3! x
A Taylor series analysis will show this approximation to be linear. Likewise if a linear
interpolation is used to interpolate the function in xi−1 ≤ x ≤ xi we get the backward
difference formula.
Notice that these expression are identical to the formulae obtained earlier. A Taylor
series analysis would confirm that both expression are second order accurate.
−1.5 5 Points −6
f(x)
9 Points
−2 13 Points −8
−2.5 −10
−3 −12
−3.5
−14
−4
−1 −0.5 0 0.5 1 −16
x −1 −0.5 0 0.5 1
0.6
f(x)
0.4
0.2
−0.2
−1 −0.5 0 0.5 1
x
Figure 3.4: Illustration of the Runge phenomenon for equally-spaced Lagrangian inter-
polation (upper figures). The right upper figure illustrate the worsening amplitude of
the oscillations as the degree is increased. The Runge oscillations are suppressed if an
unequally spaced set of interpolation point is used (lower panel); here one based on
Gauss-Lobatto roots of Chebyshev polynomials. The solution black line refers to the
exact solution and the dashed lines to the Lagrangian interpolants. The location of the
interpolation points can be guessed by the crossing of the dashed lines and the solid black
line.
3.4. POLYNOMIAL FITTING 45
At a fixed point near the boundary, the oscillations’ amplitude becomes bigger as the
polynomial degree is increased: the amplitude of the 16 order polynomial reaches of value
of 17 and has to be plotted separately for clarity of presentation. This is not the case
when a non-uniform grid is used for the interpolation as shown in the lower left panel of
figure 3.4. The interpolants approach the true function in the center and at the edges of
the interval. The points used in this case are the Gauss-Lobatto roots of the Chebyshev
polynomial of degree N − 1, where N is the number of points.
46 CHAPTER 3. FINITE DIFFERENCE APPROXIMATION OF DERIVATIVES
where u(n) is the n-th derivative of u with respect to x at xi , with m being an arbitrary
number. From this expression it is easy to obtain the following sum and difference
∞
X (m∆x)n (n)
ui+m ± ui−m = ((1 ± (−1)n ) u (3.57)
n=0
n!
∞
ui+m + ui−m X (m∆x)n (n)
= u (3.58)
2 n=0,2,4
n!
∞
ui+m − ui−m X (m∆x)n (n+1)
= u (3.59)
2m∆x n=0,2,4
(n + 1)!
These expansion apply to arbitraty functions u as long as the expansion is valid; so they
apply in particular to their derivatives. In case we substitute u(1) for u in the summation
expansion we obtain an expression for the expansion of the derivatives:
(1) (1) ∞
ui+m + ui−m X (m∆x)n (n+1)
= u (3.60)
2 n=0,2,4
n!
where α, and the ai are unknown constants. The Taylor series expansion of the left and
right hand sides can be matched as follows
∞ ∞
(1) X 1 X a1 + 2n a2 + 3n a3
ui + 2α ∆xn u(n+1) = ∆xn u(n+1) (3.62)
n=0,2,4
n! n=0,2,4
(n + 1)!
or ∞
X (a1 + 2n a2 + 3n a3 ) − (n + 1)(δn0 + 2α)
∆xn u(n+1) = 0 (3.63)
n=0,2,4
(n + 1)!
3.5. COMPACT DIFFERENCING SCHEMES 47
Here δn0 refers to the Kronecker delta: δnm = 0 for n 6= m and 1 if n = m. This leads to
the following constraints on the constants ai and α:
a1 + a2 + a3 = 1 + 2α for n = 0 (3.64)
n n
a1 + 2 a2 + 3 a3 = 2(n + 1)α for n = 2, 4, . . . , N (3.65)
a1 + 2N +2 a2 + 3N +2 a3 − 2(N + 3)α
∆xN +2 u(N +3) (3.66)
(N + 3)!
Since we only have a handful of parameters we cannot hope to satisfy all these constraints
for all n. However, we can derive progressively better approximation by matching higher-
order terms. Indeed with 4 paramters at our disposal we can only satisfy 4 constraints.
Let us explore some of the possible options.
1
a1 − 2 α = 1 α =
3! with solution 4 (3.67)
a1 − 2 α = 0
a1 3
2! =
2
A family of fourth order schemes can be obtained if we allow a wider stencil on the
right hand side of the equations, and allow a2 6== 0. This family of schemes can be
generated by the single parameter α
( 2
a1 + a2 = 1 + 2α a1 = (α + 2)
with solution 3 (3.68)
a1 + 4 a2 = 6α
a2 1
= (4α − 1)
3
a1 + 24 a2 − 2 · 5α a1 + 26 a2 − 2 · 7α
TE = ∆x4 u(5) + ∆x6 u(7) (3.69)
5! 7!
4(3α − 1) 4(18α − 5)
= ∆x4 u(5) + ∆x6 u(7) (3.70)
5! 7!
This family of compact scheme can be made unique by, for example, requiring the scheme
to be sixth-order and setting α = 1/3; the leading truncation error term would then be
4/(7!)∆x6 u(7) .
48 CHAPTER 3. FINITE DIFFERENCE APPROXIMATION OF DERIVATIVES
0 0
10 10
−5 −5
10 10
||ε||∞
2
||ε||
−10 −10
10 10
−15 −15
10 10
1 2 1 2
10 10 10 10
M M
Figure 3.5: Convergence curves for the compact (solid ligns) and explicit centered differ-
ence (dashed lines) schemes for the sample function u = sin πx, |x| ≤ 1. The dash-dot
lines refers to lines of slope -2,-4,-6, and -8. The left panel shows the 2-norm of the error
while the right panel shows the ∞-norm.
In this chapter we explore the application of finite differences in the simplest setting
possible, namely where there is only one independent variable. The equations are then
referred to as ordinary differential equations (ODE). We will use the setting of ODE’s to
introduce several concepts and tools that will be useful in the numerical solution of partial
differential equations. Furthermore, time-marching schemes are almost universally reliant
on finite difference methods for discretization, and hence a study of ODE’s is time well
spent.
4.1 Introduction
Here we derive how an ODE may be obtained in the process of solving numerically a
partial differential equations. Let us consider the problem of solving the following PDE:
where ûn are the complex amplitudes and depend only on the time variable, whereas eikx
are the Fourier functions with wavenumber kn . Because of the periodicity requirement
we have kn = 2πn/L where n is an integer. The Fourier functions form what is called an
orthonormal basis, and can be determined as follows: multiply the two sides of equations
(4.2) by e−ikm x where m is integer and integrate over the interval [0π] to get:
Z L ∞
X Z L
ue−ikm x dx = ûn (t) ei(kn −km )x dx (4.3)
0 k=−∞ 0
49
50 CHAPTER 4. APPLICATION OF FINITE DIFFERENCES TO ODE
Now notice that the integral on the right hand side of equation (4.3) satisfies the or-
thogonality property:
i(k −k )L ei2π(n−m)L − 1
Z L e n m −1
i(kn −km )x = = 0, n 6= m
e dx = i(kn − km ) i(kn − km ) (4.4)
0
L, n = m
The role of the integration is to pick out the m − th Fourier component since all the other
integrals are zero. We end up with the following expression for the Fourier coefficients:
Z L
1
ûm = u(x)e−ikm x dx (4.5)
L 0
Equation (4.5) would allow us to calculate the Fourier coefficients for a known function
u. Note that for a real function, the Fourier coefficients satisfy
where the ∗ superscript stands for the complex conjugate. Thus, only the positive Fourier
components need to be determined and the negative ones are simply the complex conju-
gates of the positive components.
The Fourier series (4.2) can now be differentiated term by term to get an expression
for the derivatives of u, namely:
∞
X
ux = ikn ûn (t)eikn x (4.7)
k=−∞
X∞
uxx = −kn2 ûn (t)eikn x (4.8)
k=−∞
X∞
dûn ikn x
ut = e (4.9)
k=−∞
dt
Replacing these expressions for the derivative in the original equation and collecting
terms we arrive at the following equations:
∞
X
dûn
+ (ickn + νkn2 )ûn eikn x = 0 (4.10)
k=−∞
dt
Note that the above equation has to be satisfied for all x and hence its Fourier amplitudes
must be zero for all x (just remember the orthogonality property (4.4), and replace u by
zero). Each Fourier component can be studied separately thanks to the linearity and the
constant coefficients of the PDE.
The governing equation for the Fourier amplitude is now
dû
= −(ick + νk2 ) û (4.11)
dt | {z }
κ
4.2. FORWARD EULER APPROXIMATION 51
where we have removed the subscript n to simplify the notation, and have introduced
the complex number κ. The solution to this simple ODE is:
where uˆ0 = û(t = 0) is the Fourier amplitude at the initial time. Taking the ratio of the
solution between time t and t + ∆t, we can get see the expected behavior of the solution
between two consquetive times:
û(t + ∆t)
= eκ∆t = eRe(κ)∆t eiIm(κ)∆t (4.13)
û(t)
where Re(κ) and Im(κ) refer to the real and imaginary parts of κ. It is now clear to
follow the evolution of the amplitude of the Fourier components:
un+1 − un
≈ κun (4.15)
∆t
where the superscript indicates the time level:un = u(n∆t), and where we have removed
the ˆ for simplicity. Equation (4.15) is an explicit approximation to the original dif-
ferential equation since no information about the unknown function at the future time
(n + 1)∆t has been used on the right hand side of the equation. In order to derive the
error committed in the approximation we rely again on Taylor series. Expanding un+1
about time level n∆ts, and inserting in the forward difference expression (4.15) we get:
∆t ∆t2
ut − κu = − ut − utt (4.16)
| 2 {z 3! }
truncation error ∼O(∆t)
The terms on the right hand side are the truncation errors of the forward Euler ap-
proximation. The formal definition of the truncation error is that it is the difference
between the analytical and approximate representation of the differential equation. The
leading error term (for sufficiently small ∆t) is linear in ∆t and hence we expect the
errors to decrease linearly. Most importantly, the approximation is consistent in that
the truncation error goes to zero as ∆t → 0.
52 CHAPTER 4. APPLICATION OF FINITE DIFFERENCES TO ODE
Given the initial condition u(t = 0) = u0 we can advance the solution in time to get:
u1 = (1 + κ∆t)u0
u2 = (1 + κ∆t)u1 = (1 + κ∆t)2 u0
u3 = (1 + κ∆t)u2 = (1 + κ∆t)3 u0 (4.17)
..
.
un = (1 + κ∆t)un−1 = (1 + κ∆t)n u0
Let us study what happens when we let ∆t → 0 for a fixed time integration tn = n∆t.
The only factor we need to worry about is the numerical amplification factor:(1 + κ∆t):
tn tn ln(1+κ∆t) tn (κ∆t−κ2 ∆t2 +...)
lim (1 + κ∆t) ∆t = lim e ∆t = lim e ∆t = eκtn (4.18)
∆t→0 ∆t→0 ∆t→0
where we have used the logarithm Taylor series ln(1 + ǫ) = ǫ − ǫ2 + . . ., assuming that
κ∆t is small. Hence we have proven convergence of the numerical solution to the analytic
solution in the limit ∆t → 0. The question is what happens for finite ∆t?
Notice that in analogy to the analytic solution we can define an amplification factor
associated with the numerical solution, namely:
un
A= = |A|eiθ (4.19)
un−1
where θ is the argument of the complex number A. The amplitude of A will determine
whether the numerical solution is amplifying or decaying, and its argument will determine
the change in phase. The numerical amplification factor should mimic the analytical
amplification factor, and should lead to an anologous increase or decrease of the solution.
For small κ∆t it can be seen that A is just the first term of the Taylor series expansion
of eκ∆t and is hence only first order accurate. Let us investigate the magnitude of A in
terms of κ, a problem parameter, and ∆t the numerical parameter, we have:
|A|2 = AA∗ = 1 + 2Re(κ)∆t + |κ|2 ∆t2 (4.20)
We focus in particular for the condition under which the amplitude factor is less then 1.
The following condition need then to be fullfilled (assuming ∆t > 0):
Re(κ)
∆t ≤ −2 (4.21)
|κ|2
There are two cases to consider depending on the sign of Re(κ). If Re(κ) > 0 then |A| > 1
for ∆t > 0, and the finite difference solution will grow like the analytical solution. For
Re(κ) = 0, the solution will also grow in amplitude whereas the analytical solution
predicts a neutral amplification. If Re(κ) < 0, then |A| > 1 for ∆t > −2 Re(κ)
|κ|2 whereas
the analytical solution predicts a decay. The moral of the story is that the numerical
solution can behave in unexpected ways. We can rewrite the amplification factor in the
following form:
|A|2 = AA∗ = [Re(z) + 1]2 + [Im(z)]2 (4.22)
where z = κ∆t. The above equation can be interpreted as the equation for a circle
centered at (−1, 0) in the complex plane with radius |A|2 . Thus z must be within the
unit circle centered at (−1, 0) for |A|2 ≤ 1.
4.3. STABILITY, CONSISTENCY AND CONVERGENCE 53
where en = U n −un is the total error at time tn = n∆t. The reapplication of this formula
to en−1 transforms it to:
h i
en = A(Aen−2 − T n−2 ∆t) − T n−1 ∆t = (A)2 en−2 − ∆t AT n−2 + T n−1 (4.28)
where (A)2 = A.A, and the remainder of the superscript indicate time levels. Repeated
application of this formula shows that:
h i
en = (A)n e0 − ∆t (A)n−1 T 0 + (A)n−2 T 1 + . . . + AT n−2 + T n−1 (4.29)
Equation (4.29) shows that the error at time level n depends on the initial error, on
the history of the truncation error, and on the discretization through the factor A. We
will now attempt to bound this error and show that this possible if the truncation error
can be made arbitrarily small (the consistency condition), and if the scheme is stable
according to the definition shown above. A simple application of the triangle inequality
shows that
h i
|en | ≤ |An | |e0 | + ∆t (A)n−1 T 0 + (A)n−2 T 1 + . . . + |A| T n−2 + T n−1 (4.30)
In order to proceed further we need to introduce also the maximum bound on the
amplification factor and all its powers. So let us assume that the scheme is stable, i.e.
there is a positive constant C, independent of ∆t and n, such that
Since the individual entries in the summation are smaller then C and the sum involves n
terms, the sum must be smaller then nC. The inequality (4.32) is then bounded above
by C|E 0 | + nT ∆tC, and we arrive at:
where tn = n∆t is the final integration time. The right hand side of (4.33) can be made
arbitrarily small by the following argument. First, the initial condition is known and so
the initial error is (save for round-off errors) zero. Second, since the scheme is consistent,
the maximum truncation error T can be made arbitrarily small by choosing smaller and
smaller ∆t. The end results is that the bound on the error can be made arbitrarily
small if the approximation is stable and consistent. Hence the scheme converges to
the solution as ∆t → 0.
4.4. BACKWARD DIFFERENCE 55
where we have used the Taylor series for the exponential in arriving at the final expression.
This is the least restrictive condition on the amplification factor that will permit us to
bound the error growth for finite times. Thus the modulus of the amplification factor
maybe greater then 1 by an amount proportional to positive powers of |∆t|. This gives
plenty of latitude for the numerical solution to grow, but will prevent this growth from
depending on the time step or the number of time steps.
In practice, the stability criterion is too generous, particularly when we know the
solution is bounded. The growth of the numerical solution should be bounded at all
times by setting C = 1. In this case the Von Neumann stability criterion reduces to
|A| ≤ 1 (4.37)
This is an example of an implicit method since the unknown un+1 has been used in
evaluating the slope of the solution on the right hand side; this is not a problem to solve
for un+1 in this scalar and linear case. For more complicated situations like a nonlinear
right hand side or a system of equations, a nonlinear system of equations may have to
be inverted. It is easy to show via Taylor series analysis that the truncation error for
the backward difference scheme is O(∆t) and the scheme is hence consistent and of first
order. The numerical solution can be updated according to:
un
un+1 = (4.39)
1 − κ∆t
and the amplification factor is simply A = 1/(1 − κ∆t). Its magnitude is given by:
1
|A|2 = (4.40)
1 − 2Re(κ)∆t + |κ|2 ∆t2
This is an example of an implicit method since the unknown un+1 has been used in
evaluating the slope of the solution on the right hand side; this is not a problem to solve
for un+1 in this scalar and linear case. For more complicated situations like a nonlinear
right hand side or a system of equations, a nonlinear system of equations may have to
be inverted. It is easy to show via Taylor series analysis that the truncation error for
the backward difference scheme is O(∆t) and the scheme is hence consistent and of first
order. The numerical solution can be updated according to:
un
un+1 = (4.42)
1 − κ∆t
and the amplification factor is simply A = 1/(1 − κ∆t). Its magnitude is given by:
1
|A|2 = (4.43)
1 − 2Re(κ)∆t + |κ|2 ∆t2
Re(κ)
∆t ≥ 2 (4.44)
κ2
again depend on the sign of Re(κ). If Re(κ) < 0, then |A| < 1 for all ∆t > 0; this
is an instance of unconditional stability. The numerical solution is also damped when
Re(κ) = 0 whereas the analytical solution is neutral. The numerical amplitude factor
can be rewritten as:
1
[1 − Re(z)]2 + [Im(z)]2 = (4.45)
|A|2
and shows contours of constant amplitude factors to be circles centered at (1,0) and of
radius 1/|A|.
un+1 − un un+1 + un
+ O(∆t2 ) = κ + O(∆t2 ) (4.47)
∆t 2
4.6. TRAPEZOIDAL SCHEME 57
using simple Taylor series expansions about time n + 12 . The truncation error is O(∆t2 )
and the method is hence second order accurate. It is implicit since un+1 is used in the
evaluation of the right hand side. The unkown function can be updated as:
κ∆t
1+
un+1 = 2
κ∆t
un (4.48)
1− 2
The modulus of the forward Euler solution cycles outward and is indicative of the un-
stable nature of the scheme. The backward Euler solution cycles inward indicative of a
numerical solution that is too damped. Finally, the trapezoidal scheme has neutral am-
plification: its solution remains on the unit circle and tracks the analytical solution quite
closely. Notice however, that the trapezoidal solution seem to lag behind the analytical
solution, and this lag seems to increase with time. This is symptomatic of lagging phase
errors.
2.5
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
−2.5
−2 −1 0 1 2
Figure 4.1: Solution of the oscillation equation using the forward (x), backward (+) and
trapezoidal schemes (◦). The analytical solution is indicated by a red asterisk.
1.3 RK3
1.2
RK4
1.1
1
Relative Phase
RK2
0.9
0.8
TZ
0.7
0.6
FD,BD
0.5
0.4
0 0.5 1 1.5 2 2.5
κ∆ t
Figure 4.2: Phase errors for several two-level schemes when Re(κ) = 0. The forwad
and backward differencing schemes (FD and BD) have the same decelerating relative
phase. The Trapezoidal scheme (TZ) has lower phase error for the same κ∆t as the 2
first order schemes. The Runge Kutta schemes of order 2 and 3 are accelerating. The
best performance is for RK4 which stays closest to the analytical curve for the largest
portion of the spectrum.
4.7. HIGHER ORDER METHODS 59
In general it is hard to get a simple formula for the phase error since the expressions often
involve the tan−1 functions with complicated arguments. Figure 4.2 shows the relative
phase as a function of κ∆t for the case where Re(κ) = 0 for several time integration
schemes. The solid black line (R = 1) is the reference for an exact phase. The forward,
backward and trapezoidal differencing have negative phase errors (and hence the schemes
are decelerating), while the RK schemes (to be presented below) have an accelerating
phase.
where a21 , b1 , b2 and c2 are constant that will be determined on the basis of accuracy.
Variants of this approach has the constants determined on the basis of stability consid-
ertaion, but in the following we follow the accuracy criterion. The key to determining
these constants is to match the Taylor series expansion for the ODE with that of the
approximation. Expanding u as a Taylor series in time we get:
du ∆t2 d2 u
un+1 = un + ∆t + + O(∆t3 ) (4.56)
dt 2! dt2
Notice that the ODE provides the information necessary to compute the derivative in
the Taylor series. Thus we have:
du
= f (u, t) (4.57)
dt
d2 u df ∂f ∂f du
= = + (4.58)
dt2 dt ∂t ∂u dt
∂f ∂f
= + f (4.59)
∂t ∂u
d3 u dft dfu df
= + f + fu (4.60)
dt3 dt dt dt
= ftt + fut f + fut f + fuu f 2 + fu ft + fu2 f (4.61)
2
= ftt + 2fut f + fuu f + fu ft + fu2 f (4.62)
60 CHAPTER 4. APPLICATION OF FINITE DIFFERENCES TO ODE
(a21 ∆tf )2
f (un +a21 ∆tf, tn +c2 ∆t) = f (un , tn +c2 ∆t)+a21 ∆tf fu + fuu +O(∆t3 ) (4.64)
2!
Now each term in expansion (4.64) is expanded in the t variable about time tn to get:
(c2 ∆t)2
f (un , tn + c2 ∆t) = f (un , tn ) + c2 ∆tft + ftt + O(∆t2 ) (4.65)
2!
fu (un , tn + c2 ∆t) = fu (un , tn ) + fut (un , tn )c2 ∆t + O(∆t2 ) (4.66)
n n
fuu (u , tn + c2 ∆t) = fuu (u , tn ) + O(∆t) (4.67)
Substituting these expressions in expansion (4.64) we get the two-variable Taylor series
expansion for f . The whole expression is then inserted in (4.55) to get:
Matching the expansions (4.68) and (4.63) term by term we get the following equations
for the different constants.
b2 + b1 = 1
2b2 a21 = 1 (4.69)
2b c
2 2 = 1
A solution can be found in term of the parameter b2 , and it is as follows:
b1
= 1 − b2
1
a21 = 2b2 (4.70)
c 1
2 = 2b2
A family of second order Runge Kutta schemes can be obtained by varying b2 . Two
common choices are
1
• Midpoint rule with b2 = 1, so that b1 = 0 and a21 = c2 = 2. The schemes
becomes:
∆t
u(1) = un + f (un , tn ) (4.71)
2
∆t
un+1 = un + ∆tf (u(1) , tn + ) (4.72)
2
The first phase of the midpoint rule is a forward Euler half step, followed by a
centered approximation at the mid-time level.
4.7. HIGHER ORDER METHODS 61
1
• Heum rule b2 = 2 and a21 = c2 = 1.
Higher order Runge Kutta schemes can be derived by introducing additional stages
between the two time levels; their derivation is however very complicated (for more
information see Butcher (1987) and see Dormand (1996) for a more readable account).
Here we limit ourselves to listing the algorithms for common third and fourth order
Runge Kutta schemes.
• RK3
q1 = ∆tf (un , tn ) u(1) = un + q31
5q1
q2 = ∆tf (u(1) , tn + ∆t
3 )− 9 u(2) = u(1) + 15q
16
2
(4.75)
∆t 153q2 8q3
q3 = ∆tf (u(2) , tn + 3 )− 128 un+1 (2)
= u + 15
• RK4 The fourth order RK scheme is the most well-known for its accuracy and
large stability region. It is:
q1 = ∆tf (un , tn )
q2 = ∆tf (un + q21 , tn + ∆t2 )
n q2 ∆t
q3 = ∆tf (u + 2 , tn + 2 ) (4.76)
q4 = ∆tf (un + q3 , tn + ∆t)
un+1 = un + q1 +2q2 +2q
6
3 +q4
2. Runge Kutta offers high order integration with only information at two time levels.
Automatic step size control is easy since the order of the method does not depend
on maintaining the same step size as the calculation proceeds. Several strategies
are then possible to maximize accuracy and reduce CPU cost.
3. For order less then or equal to 4, only one stage is required per additional order,
which is optimal. For higher order, it is known that the number of stages exceeds
the order of the method. For example, a fifth order method requires 6 stages, and
an eight order RK scheme requires a minimum of 11 stages.
4. Runge Kutta schemes require multiple evaluation of the right hand side per time
step. This can be quite costly if a large system of simultaneous ODE’s is involved.
Alternatives to the RK steps are multi-level schemes.
62 CHAPTER 4. APPLICATION OF FINITE DIFFERENCES TO ODE
3.5
3 RK4
2.5 1
RK3 AB2
0.8 AB3
2 0.6
Im(κ∆ t)
RK2 0.4
1.5 0.2
0
1 RK1 −0.2
−0.4
0.5 −0.6
−0.8
0 −1
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5
Re(κ∆ t) −1 −0.8 −0.6 −0.4 −0.2 0 0.2
Figure 4.3: Stability region for the Runge-Kutta methods of order 1, 2, 3, and 4 (left
figure), and for the Adams Bashforth schmes of order 2 and 3 (right figure). The RK2
and AB2 stability curves are tangent to the imaginary axis at the origin, and hence the
method are not stable for purely imaginary κ∆t.
5. A new family of Runge Kutta scheme was devised in recent years to cope with the
requirements of Total Variations Diminishing (TVD) schemes. For second order
methods, the Heum scheme is TVD. The third order TVD Runge Kutta scheme is
It is easy to show that the truncation error is of size O(∆t2 ). Moreover, unlike the trape-
zoidal scheme, it is explicit in the unknown un+1 , and hence does not involve nonlinear
4.7. HIGHER ORDER METHODS 63
complications, nor systems of equations. For our model equation, the trapezoidal scheme
takes the form:
un+1 = un−1 + 2κ∆tun (4.79)
The determination of the amplification factor is complicated by the fact that two time
levels are involved in the calculations. Nevertheless, let us assume that the amplification
factor is the same for each time step, i.e. un = Aun−1 , and un+1 = Aun . We then arrive
at the following equation:
A2 − 2zA − 1 = 0 (4.80)
There are two solutions to this quadratic equation:
p
A± = z ± 1 + z2 (4.81)
In the limit of good resolution, |z| → 0, we have A+ → 1 and A− → −1. The numerical
solution is capable of behaving in two different ways, or modes. The mode associated
with A+ is referred to as the physical mode because it approximates the solution to
the original differential equation. The mode associated with A− is referred to as the
computational mode since it arises solely as an artifact of the numerical procedure.
The origin of the computational mode can be traced back to the fact that the leap-
frog scheme is an approximation to a higher order equation that requires more initial
conditions then necessary for the original ODE. To show this consider the trivial case
of κ = 0 where the analytical solution is simply given by ua = u0 ; here u0 is the initial
condition. The amplitude factors for the leap-frog schemes are A+ = 1 and A− = −1,
and hence the computational mode is expected to keep its amplitude but switch sign at
every time step. Applying the leap-frog scheme we see that all even time levels will have
the correct value: u2 = u4 = . . . = u0 . The odd time levels will be contaminated by
error in estimating the second initial condition needed to jump start the calculations. If
u1 = u0 + ǫ where ǫ is the initial error committed, the solution at all odd time levels
will then be u2n+1 = u0 + ǫ. The numerical solution for the present simple case can be
written entirely in terms of the physical (initial condition) and computational (initial
condition error) modes:
ǫ ǫ
un = u0 + − (−1)n (4.82)
2 2
Absolute stability requires that |A|± ≤ 1; notice however that the product of the
two roots is A+ A− = −1, which implies that |A+ ||A− | = 1. Hence, if one root, say
A+ is stable |A+ | < 1, the other one must be unstable with |A− | = 1/|A+ | > 1; the
only exception is when both amplification factor have a neutral amplification |A+ | =
|A− | = 1. For real z, Im(z) = 0, one of the two roots has modulus exceeding 1, and
the scheme √ is always unstable. Let us for a moment assume that z = iλ , we then have:
A = iλ + 1 − λ2 . If λ ≤ 1 then the quantity under the square root sign is positive and
we have two roots such that |A+ | = |A− | = 1. To make further progress on studying the
stability of the leap frog scheme, let z = sinh(w) where w is a complex number. Using
the identity cosh2 w − sinh2 w = 1 we arrive at the expression A± = sinh w ± cosh w.
Setting w = a + ib where a, b are real, subsituting in the previous expression for the
amplification factor, and calculating its modulus we get: |A± | = e±a . Hence a = 0 for
64 CHAPTER 4. APPLICATION OF FINITE DIFFERENCES TO ODE
both amplification factors to be stable. The region of stability is hence z = i sin b where
b is real, and is confined to the unit slit along the imaginary axis |Im(z)| ≤ 1.
The leap frog scheme is a popular scheme to integrate PDE’s of primarily hyperbolic
type in spite of the existence of the computational mode. The reason lies primarily in
its neutral stability and good phase properties. The control of the computational mode
can be effectively achieved either with a Asselin time filter (see Durran (1999)) or by
discarding periodically the solution at level n − 1 and taking a two time level scheme.
Multi-Step schemes
A family of multi-step schemes can built upon interpolating the right hand side of the
ODE in the interval [tn tn+1 ] and performing the integral. The derivation starts from the
exact solution to the ODE:
Z tn+1
un+1 = un + f (u, t) dt (4.83)
tn
Since the integrand is unknown in [tn tn+1 ] we need to find a way to approximate it given
information at specific time levels. A simple way to achieve this is to use a polynomial
that interpolates the integrand at the points (tk , uk ), n − p ≤ k ≤ n, where the solution
is known. If we write:
f (u, t) = Vp (t) + Ep (t) (4.84)
where Vp is the polynomial approximation and Ep the error associated with it, then the
numerical scheme becomes:
Z tn+1 Z tn+1
un+1 = un + Vp (t) dt + Ep (t) dt (4.85)
tn tn
If the integration of Vp is performed exactly, then the only approximation errors present
are due to the integration of the interpolation error term; this term can be bounded by
max(|Ep |∆t.
The explicit family of Adams Bashforth scheme relies on Lagrange interpolation.
Specifically,
p
X
Vp (t) = hpk (t)f n−p , (4.86)
k=0
p
Y t − tn−m
hpk (t) = (4.87)
m=0,m6=k
tn−k − tn−m
t − tn t − tn−(k−1) t − tn−k−1 t − tn−p
= ... ... (4.88)
tn−k − tn tn−k − tn−(k−1) tn−k − tn−k−1 tn−k − tn−p
It is easy to verify that hpk (t) is a polynomial of degree p − 1 in t, and that hpk (tn−m ) = 0
for m 6= k and hpk (tn−k ) = 1. These last two properties ensures that Vp (tn−k ) = f n−k .
The error associated with the Lagrange interpolation with p + 1 points is O(∆tp+1 ).
Inserting the expressions for Vp in the numerical scheme, and we get:
p
X Z tn+1
un+1 = un + f n−k hpk (t) dt + ∆tO(∆tp+2 ) (4.89)
k=0 tn
4.7. HIGHER ORDER METHODS 65
Note that the error appearing in the above formula is only the local error, the global
error is one order less, i.e. it is O(∆tp+1 ).
We illustrate the application of this procedure by considering the derivation of its
second and third order variants. The second order scheme requires p = 1. Hence, we
write:
t − tn−1 n t − tn
V1 (t) = f + f n−1 (4.90)
tn − tn−1 tn−1 − tn
The integral on the interval [tn tn+1 ] is
Z tn+1 Z tn+1 Z tn+1
t − tn−1 t − tn
V1 (t) dt = dtf n + dtf n−1 (4.91)
tn tn tn − tn−1 tn tn−1 − tn
3 n 1 n−1
= ∆t f − f (4.92)
2 2
The final expression for the second order Adams Bashforth formula is:
3 n 1 n−1
un+1 = un + ∆t f − f + O(∆t3 ) (4.93)
2 2
A third order formula can be designed similarly. Starting with the quadratic inter-
polation polynomial V2 (t):
Its integral can be evaluated and plugged into equation (4.89) to get:
23 n 16 n−1 5
un+1 = un + ∆t f − f + f n−2 (4.95)
12 12 12
The stability of the AB2 scheme can be easily determined for the sample problem.
The amplification factors are the roots to the equation
2 3 z
A − 1+ z A+ =0 (4.96)
2 2
and the two roots are:
s
2
1 3 3
A± = 1 + z ± 1+ z − 2z . (4.97)
2 2 2
Like the leap frog scheme AB2 suffers from the existence of a computational mode. In the
limit of good resolution, z → 0, we have A+ → 1 and A+ → 0; that is the computational
mode is heavily damped. figure 4.4 shows the modulus of the physical and computational
modes for Re(z) = 0 and Im(z) < 1. The modulus of the computational mode amplitude
factor is quite small for the entire range of z considered. On the other hand the physical
mode is unstable for purely imaginary z as the modulus of its amplification factor exceeds
66 CHAPTER 4. APPLICATION OF FINITE DIFFERENCES TO ODE
1.6
1.4
1.2
|A±|
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
κ∆ t
Figure 4.4: Modulus of amplification factor for the physical and computational modes
of AB2 when Re(κ) = 0.
1. Note however, that a series expansion of |A+ | for small z = iλ∆t shows that |A+ | =
1 + (λ∆t4 )/4 and hence the instability grows very slowly for sufficiently small ∆t. It
can be anticipated that AB3 will have one physical mode and two computational modes
since its stability analysis leads to a third order equation for the amplification factor.
Like AB2, AB3 strongly damps the two computational modes; it has the added benefit
of providing conditional stability for Im(z) 6= 0. The complete stability regions for AB2
and AB3 is shown in the right panel of figure 4.3.
Like all multilevel schemes there are some disadvantages to Adams Bashforth meth-
ods. First a starting method is required to jump start the calculations. Second the
stability region shrinks with the order of the method. The good news is that although
AB2 is unstable for imaginary κ, its instability is small and tolerable for finite integration
time. The third order Adams Bashforth scheme on the other hand includes portion of
the imaginary axis, which makes AB3 quite valuable for the integration of advection like
operators. The main advantage of AB schemes over Runge-Kutta is that they require
but one evaluation of the right hand side per time step and use a similar amount of
storage.
Pp
p a1 a2 a3 a4 k=1 kak
1 1 1
2 4/3 −1/3 2/3
3 18/11 −9/11 2/11 6/11
4 48/25 −36/25 16/25 −3/25 12/25
For a p-th order expression, we require the higher order derivative, 2 through p, to vanish;
this yields p − 1 homogeneous algebraic equations. For a non-trivial solution we need
to append one more conditions which we choose to be that the sum of the unknown
coefficient is equal to one. This yields the following system of equations
p
X
ak = 1 (4.100)
k=1
p
X
km ak = 0, m = 2, 3, . . . , p (4.101)
k=1
3.5
2.5
1.5
0.5
Figure 4.5: Stability regions for the Backward Differencing schemes of order 1, 2 and 3.
The schemes are unstable within the enclosed region and stable everywhere else. The
instability regions grow with the order. The stability regions are symmetric about the
real axis
48 n 36 n−1 16 n−2 3 12
un+1 = u − u + u − un−3 + + ∆t ut |n+1 + O(∆t4 )(4.106)
25 25 25 25 25
Notice that we have shown explicitly the time level at which the time derivative is
approximated. The BDF’s scheme lead to implicit expressions to update the solution at
time level un+1 . Like the Adams-Bashforth formula the BDF schemes require a starting
method. They also generate computational modes whose number depends on how many
previous time levels have been used. Their most important advantage is their stability
regions in the complex plane which are much larger then equivalent explicit schemes.
5.1 Introduction
Suppose we are given a well-posed problem that consists of a partial differential equation
∂u
= Lu (5.1)
∂t
where L is a differential operator, initial conditions
un+1
j − unj unj − unj−1
+c =0 (5.4)
∆t ∆x
71
72 CHAPTER 5. NUMERICAL SOLUTION OF PDE’S
Equation 5.4 provides a simple formula for updating the solution at time level n + 1 from
the values at time n:
c∆t
un+1
j = (1 − µ)unj + µunj−1 , where µ = (5.5)
∆x
The variable µ is known as the Courant number and will figure prominently in the study
of the stability of the scheme. Equation 5.5 can be written as a matrix operation in the
following form:
n+1 n
u1 1 u1
u2 µ 1−µ u2
.. .. .. ..
. .
. .
uj−1 µ 1−µ uj−1
uj = µ 1−µ uj
u µ 1−µ u
j+1 j+1
.. .. .. ..
. .
. .
uN −1 µ 1−µ uN −1
uN µ 1−µ uN
(5.6)
where we have assumed that the boundary condition is given by u(x1 , t) = u0 (x1 ).
The following legitimate question can now be asked:
3. Errors What are the errors committed by the approximation, and how should one
expect them to behave as the numerical resolution is increased.
4. Stability Does the numerical solution remained bounded by the data specifying
the problem? or are the numerical errors increasing as the computations are carried
out.
We will now turn to the issue of defining these concepts more precisely, and hint to
the role they play in devising finite difference schemes. We will return to the issue of
illustrating their applications in practical situations later.
5.1.1 Convergence
Let enj = unj − vjn denote the error between the numerical and analytical solutions of the
PDE at time n∆t and point j∆x. If this error tends to 0 as the grid and time steps are
decreased, the finite difference solution converges to the analytical solution. Moreover,
a finite difference scheme is said to be convergent of order (p, q) if kek = O(∆tp , ∆xq ) as
∆t, ∆x → 0.
5.2. TRUNCATION ERROR 73
5.1.3 Consistency
Loosely speaking the notion of consistency addresses the problem of whether the finite
difference approximation is really representing the partial differential equations. We say
that a finite difference approximation is consistent with a differential equation if the
FD equations converge to the original equations as the time and space grids are refined.
Hence, if the truncation error goes to zero as the time and space grids are refined we
conclude that the scheme is consistent.
5.1.4 Stability
The notion of stability is a little more complicated to define. Our primary concern here
is to make sure that numerical errors do not swamp the analytical solution. One way
to ensure that is to require the solution to remain bounded by the initial data. Hence a
definition of stability is to require the following
kun k ≤ Cku0 k (5.7)
where C is a positive constant that may depend on the final integration time tn but not
on the time nor on the space increments. Notice that this definition of stability is very
general and does not refer to the behavior of the continuum equation. If the latter is
known to preserve the norm of the solution, then the more restrictive condition
kun k ≤ ku0 k (5.8)
is more practical, particularly for finite ∆t.
∆t2
vjn+1 = vjn + ∆t vt |nj + vtt |nj + O(∆t3 ) (5.10)
2
∆x2
n
vj−1 = vjn − ∆x vx |nj + vxx |nj + O(∆x3 ) (5.11)
2
Substituting these expressions in equation 5.9 we get:
∆t c∆x
[vt + cvx ] = − vtt + vxx + O(∆t2 , ∆x2 ) (5.12)
2 2
| {z }
T.E.
The terms on the left hand side of eq. 5.12 are the original PDE; all terms on the
right hand side are part of the truncation error. They represent the residual by which
the exact solution fails to satisfy the difference equation. For sufficiently small ∆t and
∆x, the leading term in the truncation series is linear in both ∆t and ∆x. Notice also
that one can regard equation 5.12 as the true partial differential equation represented by
the finite difference equation for finite ∆t and ∆x. The analysis of the different terms
appearing in the truncation error series can give valuable insight into the behavior of the
numerical approximation, and forms the basis of the modified equation analysis. We
will return to this issue later. For now it is sufficient to notice that the truncation error
tends to 0 as ∆t, ∆x → 0, and hence the finite difference approximation is consistent.
where A is the evolution matrix, and b is a vector containing forcing terms and the
effects of boundary conditions. The vector un holds the vector of solution values at time
n. The truncation error at a specific time level can be obtained by applying the above
matrix operation to the vector of exact solution values:
where z is the vector of truncation error at time level n − 1. Substracting equation 5.14
from 5.13, we get an evolution equation for the error, namely:
Equation 5.15 shows that the error at time level n is made up of two parts. The first
one is the evolution of the error inherited from the previous time level, the first term on
the right hand side of eq. 5.15, and the second part is the truncation error committed
at the present time level. Since, this expression applies to a generic time level, the same
expression holds for en−1 :
en−1 = Aen−2 + zn−2 ∆t (5.16)
5.3. THE LAX RICHTMEYER THEOREM 75
where we have assumed that the matrix A does not change with time to simplify the dis-
cussion (this is tantamount to assuming constant coefficients for the PDE). By repeated
application of this argument we get:
en = A2 en−2 + Azn−2 + zn−1 ∆t (5.17)
= A3 en−3 + A2 zn−3 + Azn−2 + zn−1 ∆t (5.18)
..
.
= An e0 + An z0 + An−1 z1 + . . . + Azn−2 + zn−1 ∆t (5.19)
Equation 5.19 shows that the error growth depends on the truncation error at all time
levels, and on the discretization through the matrix A. We can use the triangle inequality
to get a bound on the norm of the error. Thus,
ken k ≤ kAn k ke0 k + kAn k kz0 k + kAn−1 k kz1 k + . . . + kAk kzn−2 k + kzn−1 k ∆t
(5.20)
In order to make further progress we assume that the norm of the truncation error at
any time is bounded by a constant ǫ such that
n−1
!
X
n n 0 m
ke k ≤ kA k ke k + ǫ∆t kA k (5.22)
m=0
The initial errors and the subsequent truncation errors are thus modulated by the evo-
lution matrices Am . In order to prevent the unbounded growth of the error norm as
n → ∞, we need to put a limit on the norm of the these matrices. This is in effect the
stability property needed for convergence:
where C is a constant independent of n, ∆t and ∆x. The sum in bracket can be bounded
by the factor nC; the final expression becomes:
ken k ≤ C ke0 k + tn ǫ (5.24)
where tn = n∆t is the final integration time. When ∆x → 0, the initial error ken k can be
made as small as desired. Furthermore, by consistency, the truncation error ǫ → 0 when
∆t, ∆x → 0. The global error is hence guarateed to go to zero as the computational grid
is refined, and the scheme is convergent.
76 CHAPTER 5. NUMERICAL SOLUTION OF PDE’S
kAm k ≤ C (5.25)
1 ∆t ln C
kAk ≤ C m = e tm ln C = 1 + ∆t + . . . = 1 + O(∆t) (5.27)
tm
The Von neumann stability condition is hence that
Note that this stability condition does not make any reference on whether the con-
tinuous (exact) solution grows or decays in time. Furthermore, the stability condition is
established for finite integration times with the limit ∆t → 0. In practical computations
the computations are necessarily carried out with a small but finite ∆t, and it is fre-
quently the case that the evolution equation puts a bound on the growth of the solution.
Since the numerical solution and its errors are subject to the same growth factors via the
matrix A, it is reasonable, and in most cases essential to require the stronger condition
kAk ≤ 1 for stability for non growing solutions.
A final practical detail still needs to be ironed out, namely what norm should be
used to measure the error? From the properties of the matrix norm it is immediately
clear that the spectral radius ρ(A) ≤ kAk, hence ρ(A) ≤ 1 is a necessary condition
for stability but not sufficient. There are classes of matrices A where it is sufficient,
for example those that posess a complete set of linear eigenvectors such as those that
arise from the discretization of hyperbolic equation. If the 1 or ∞-norms are used the
condition for stability becomes sufficient.
Example 10 In the case of the advection equation, the matrix A given in equation 5.6
has norm:
kAk1 = |Ak∞ = |µ| + |1 − µ| (5.29)
For stability we thus require that |µ| + |1 − µ| ≤ 1. Two cases need to be considered:
1. 0 ≤ µ ≤ 1: kAk = µ + 1 − µ = 1, stable.
where k is the wavenumber and ûnk is its (complex) Fourier amplitude. This expression
can then be inserted back in the finite difference equation, and an expression for the
amplification factor A can be obtained, where A depends on k, ∆x and ∆t. Stability of
every Fourier mode will guarantee the stability of the entire solution, and hence |A| ≤ 1
for all Fourier modes is the necessary and sufficient stability condition for non-growing
solutions.
Example 11 Inserting the Fourier series in the finite difference approximation for the
advection equation we end up with the following equation:
ûn+1
k eikxj = (1 − µ)ûnk eikxj + µûnk eikxj−1 (5.31)
Since xj−1 = xj − ∆x, the exponential factor drops out of the picture and we end up
with the following expression for the growth of the Fourier coefficients:
h i
ûn+1
k = (1 − µ) + µe−ik∆x ûnk (5.32)
| {z }
A
The expression in bracket is nothing but the amplification factor for Fourier mode k.
Stability requires that |A| < 1 for all k.
h ih i
|A|2 = AA∗ = (1 − µ) + µe−ik∆x (1 − µ) + µeik∆x (5.33)
= (1 − µ)2 + µ(1 − µ)(eik∆x + e−ik∆x ) + µ2 (5.34)
2
= 1 − 2µ + 2µ(1 − µ) cos k∆x + 2µ (5.35)
2
= 1 − 2µ(1 − cos k∆x) + 2µ (1 − cos k∆x) (5.36)
k∆x
= 1 − 4 sin2 (1 − µ)µ (5.37)
2
It is now clear that |A|2 ≤ 1 if µ(1 − µ) > 0, i.e. 0 ≤ µ ≤ 1. It is the same stability
criterion derived via the matrix analysis procedure.
This is motivated by the observation that the finite difference scheme is in fact solving a
perturbed form of the original equation. Equations 5.9 and 5.12 establish that the FTBS
scheme approximates the advection equation to first order, O(∆t, ∆x) to the advection
equation. They also show that FTBS approximates the following equation to second
order in time and space:
∆t c∆x
[vt + cvx ] = − vtt + vxx + O(∆t2 , ∆x2 ). (5.38)
2 2
The second term on the right hand side of equation 5.38 has the form of a diffusion
like operator, and hence we expect it to lead to a gradual decrease in the amplitude
of the solution. The interpretation of the time derivative term is not simple. The way
to proceed is to derive an expression for vtt in terms of the spatial derivative. This is
achieved by differentiating equation 5.38 once with respect to time and once with respect
to space to obtain:
∆t c∆x
vtt + cvxt = − vttt + vtxx + O(∆t2 + ∆x2 ) (5.39)
2 2
∆t c∆x
vtx + cvxx = − vttx + vxxx + O(∆t2 + ∆x2 ) (5.40)
2 2
Multiplying equation 5.40 by −c and adding it to equation 5.39 we get:
∆t c∆x
vtt = c2 vxx (−vttt + cvtxx ) + (vxxt − cvxxx ) + O(∆t2 , ∆x2 ) (5.41)
2 2
Inserting this first order approximation to vt t back in equation 5.38 we obtain the fol-
lowing modified equation.
c
vt + cvx = (∆x − c∆t) vxx + O(∆x2 , ∆x∆t, ∆t2 ) (5.42)
2
Equation 5.42 is more informative then its earlier version, equation 5.38. It tells us that
the leading error term in ∆t, ∆x behaves like a second order spatial derivative whose
coefficient is given by the pseudo, or numerical, viscosity νn , where
c
νn = (∆x − c∆t) . (5.43)
2
If νn > 0 we expect the solution to be damped to leading order, the numerical scheme
behaves as a advection-diffusion equation, one whose viscous coefficient is purely an
artifact of the finite difference discretization. If the numerical viscosity is negative,
νn < 0, the solution will be amplified exponentially fast. The stability condition that
νn > 0 is nothing but the usual stability criterion we have encountered earlier, namely
that c > 0 and µ = c∆t/∆x < 1.
A more careful analysis of the truncation error that includes higher powers of ∆t, ∆x
yields the following form:
c∆x c∆x2
vt +cvx = (1−µ)vxx − (2µ2 −3µ+1)vxxx +O(∆t3 , ∆t2 ∆x, ∆t∆x2 , ∆x3 ) (5.44)
2 6
5.6. MODIFIED EQUATION 79
The third derivative term is indicative of the presence of dispersive errors in the numerical
solution; the magnitude of these errors is amplified by the coefficient multiplying the third
order derivative. This coefficient is always negative in the stability region 0 ≤ µ ≤ 1.
One can expect a lagging phase error with respect to the analytical solution. Notice also
that the coefficients of the higher order derivative on the right hand side term go to zero
for µ = 1. This “ideal” value for the time step makes the scheme at least third order
accurate according to the modified equation; in fact it is easy to convince one self on the
basis of the characteristic analysis that the exact solution is recovered.
Notice that the derivation of the modified equation uses the Taylor series form of
the finite difference scheme, equation 5.9, rather then the original partial differential
equations to derive the estimates for the high order derivative. This is essential to
account for the discretization errors. The book by Tannehill et all 1997 discusses a
systematic procedure for deriving the higher order terms in the modified equation.
80 CHAPTER 5. NUMERICAL SOLUTION OF PDE’S
Chapter 6
6.1 Introduction
We devote this chapter to the application of the notions discussed in the previous chapter
to investigate several finite difference schemes to solve the simple advection equation.
This equation was taken as an example to illustrate the abstract concepts that frame
most of the discussion on finite difference methods from a theoretical perspective. These
concepts we repeat are consistency, convergence and stability. We will investigate several
common schemes found in the literature, and we will investigate their amplitude and
phase errors more closely.
un+1
j − unj c + |c| unj − unj−1 c − |c| unj+1 − unj
+ + =0 (6.1)
∆t 2 ∆x 2 ∆x
For c > 0 the scheme simplifies to a FTBS, and for c < 0 it becomes a FTFS (forward
time and forward space) scheme. Here we will consider solely the case c > 0 to simplify
things. Figure 6.1 shows plots of the amplification factor for the donor cell scheme. Prior
to discussing the figures we would like to make the following remarks.
6.2.1 Remarks
1. The scheme is conditionally stable since the time step cannot be chosen inde-
pendently of the spatial discretization and must satisfy ∆t ≤ ∆tmax = c/∆x.
81
82 CHAPTER 6. NUMERICAL SOLUTION OF THE ADVECTION EQUATION
0.9
0.8 µ=0.25,0.75
µ=0.50
0.7
0.6
|A|
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
k ∆ x/π
1.4
µ=0.75
1.2
µ=0.5,1.0
1
0.8
a
Φ/Φ
0.6
0.4
0.2 µ=0.25
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
k ∆ x/π
Figure 6.1: Amplitude and phase diagram of the donor cell scheme as a function of the
wavenumber
6.2. DONOR CELL SCHEME 83
2. The wavelength appearing in the Von-Neumann stability analysis has not been
specified yet. Small values of k correspond to very long wavelegths, i.e. Fourier
modes that are well represented on the computational grid. Large values of k corre-
spond to very short wavelength. This correspondence is evident by the expression
k∆x = 2π∆x/λ, where λ is the wavelength of the Fourier mode. For example, a
twenty kilometers wave represented on a grid with ∆x = 2 kilometers would have
10 points per wavelength and its k∆x = 2π2/10 = 2π/5.
3. There is an lower limit on the value of the shortest wave representable on a discrete
grid. This wave has a wavelength equal to 2∆x and takes the form of a see-saw
function; its k∆x = π. Any wavelength shorter then this limit will be aliased
into a longer wavelegth. This phenomenon is similar to the one encountered in the
Fourier analysis of time series where the Nyquist limit sets a lower bound to the
smallest measurable wave period.
4. In the previous chapter we have focussed primarily on the magnitude of the amplifi-
cation factor as it is the one that impacts the issue of stability. However, additional
information is contained in the expression for the amplification factor that relates to
the dispersive properties of the finite difference scheme. The analytical expression
for the amplification factor for a Fourier mode is
Aa = e−ick∆t. (6.2)
Thus the analytical solution expects a unit amplification per time step, |Aa | = 1,
and a change of phase of φa = −ck∆t = −µk∆x. The amplification factor for the
donor cell scheme is however:
A = |A|eiφ , (6.3)
k∆x
|A| = 1 − µ(1 − µ)4 sin2 , (6.4)
2
µ sin k∆x
φ = tan−1 (6.5)
1 − µ(1 − cos k∆x)
where φ is the argument of the complex number A. The ratio of φ/φa gives the
relative error in the phase. A ratio less then 1 means that the numerical phase
error is less then the analytical one, and the scheme is decelerating, while a ratio
greater then indicates an accelerating scheme. We will return to phase errors later
when we look at the dispersive properties of the scheme.
5. The donor cell scheme for c positive can be written in the form:
un+1
j = (1 − µ)unj + µunj−1 (6.6)
which is a linear, convex (for 0 ≤ µ ≤ 1), combination of the two values at the
previous time levels upstream of the point (j, n). Since the two factors are positive
we have
min(unj , unj−1 ) ≤ un+1
j ≤ max(unj , unj−1 ), (6.7)
84 CHAPTER 6. NUMERICAL SOLUTION OF THE ADVECTION EQUATION
In plain words the value at the next time level cannot exceed the maximum of the
two value upstream, nor be less then the minimum of these two values. This is
referred to as the monotonicity property. It plays an important role in devising
scheme which do not generate spurious oscillation because of under-resolved gra-
dients. We will return to this point several times when discussing dispersive errors
and special advection schemes.
Figure 6.1 shows |A| and φ/φa for the donor cell scheme as a function of k∆x for
several values of the Courant number µ. The long waves (small k∆x are damped the
least for 0 ≤ µ ≤ 1 whereas high wave numbers k∆x → 0 are damped the most. The
most vigorous damping occurs for the shortest wavelength for µ = 1/2, where the donor-
cell scheme reduces to an average of the two upstream value, the amplification factor
magnitude is then |A| = 0, i.e. 2∆x waves are eliminated after a single time step. The
amplification curves are symmetric about µ = 1/2, and damping lessens as µ becomes
smaller for a fixed wavelength. The dispersive errors are small for long waves; they are
decelerating for all wavelengths for µ < 1/2 and accelerating for 1/2 ≤ µ ≤ 1; they reach
their peak acceleration for µ = 3/4.
6.3.1 Remarks
1. truncation error The Taylor series analysis (expansion about time level n + 1)
leads to the following equation:
!
∆t c∆x2 ∆t2
ut + cux = − − utt + uxxx + uttt + O(∆t3 , ∆x4 ) (6.9)
2 3 6
The leading truncation error term is O(∆t, ∆x2 ), and hence the scheme is first
order in time and second order in space. Moreover, the truncation error goes to
zero for ∆t, ∆x → 0, and hence the scheme is consistent.
2. The Von Neumann stability analysis leads to the following amplification factor:
1 − iµ sin k∆x
A = (6.10)
1 + µ2 sin2 k∆x
1
|A| = q < 1, for all µ, k∆x (6.11)
1 + µ2 sin2 k∆x
φ tan−1 (−µ sin k∆x)
= (6.12)
φa −µk∆x
6.3. BACKWARD TIME CENTERED SPACE (BTCS) 85
0.9 µ=0.50
0.8 µ=0.75
µ=1.00
|A|
0.7
0.6
0.5
µ=2.00
0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
k ∆ x/π
1
0.9
0.25
0.8
0.7 0.50
0.6
0.75
a
Φ/Φ
0.5
0.4 1.00
0.3
2.00
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
k ∆ x/π
Figure 6.2: Amplitude and phase diagrams of the BTCS scheme as a function of the
wavenumber
86 CHAPTER 6. NUMERICAL SOLUTION OF THE ADVECTION EQUATION
The scheme is unconditionally stable since |A| < 1 irrespective of the time step
∆t. By the Lax-Richtmeyer theorem the consistency and stability of the scheme
guarantee it is also convergent.
The numerical viscosity is hence always positive and lends the scheme its stable
and damping character. Notice that the damping increasing with increasing c and
∆t.
4. Notice that the scheme cannot update the solution values a grid point at a time,
since the values un+1
j and un+1
j±1 are unknown and must be determined simultane-
ously. This is an example of an implicit scheme which requires the inversion of a
system of equation. Segregating the unknowns on the left hand side of the equation
we get:
µ µ
− un+1
j−1 + uj
n+1
+ un+1 = unj (6.14)
2 2 j+1
which consititutes a matrix equation for the vector of unknowns at the next time
level. The equation in matrix forms are:
u n+1
u1
n
1
.. ..
. .
−µ
1 µ uj−1 u
2 2 j−1
−µ µ uj
2 1 2 = uj (6.15)
−µ µ u
0 1 j+1 uj+1
2 2 .
. ..
. .
uN uN
The special structure of the matrix is that the only non-zero entries are those along
the diagonal, and on the first upper and lower diagonals. This special structure is
referred to as a tridiagonal matrix. Its inversion is far cheaper then that of a full
matrix and can be done in O(N ) addition and multiplication through the Thomas
algorithm for tridiagonal matrices; in contrast, a full matrix would require O(N 3 )
operations. Finally, the first and last rows of the matrix have to be modified to
take into account boundary conditions. We will return to the issue of boundary
conditions later.
Figures 6.2 shows the magnitude of the amplification factor |A| for several Courant
numbers. The curves are symmetric about k∆x = π/2. The high and low wave
numbers are the least damped whereas the intermediate wave numbers are the
most damped. The departure of |A| from 1 deteriorates with increasing µ. Finally
the scheme is decelarating for all wavenumbers and Courant numbers, and the
deceleration deteriorates for the shorter wavelengths.
6.4. CENTERED TIME CENTERED SPACE (CTCS) 87
1 1.00
0.9
0.25
0.8 0.50
0.75
0.7
0.6
a
Φ/Φ
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
k ∆ x/π
Figure 6.3: Phase diagrams of the CTCS scheme as a function of the wavenumber
6.4.1 Remarks
1. truncation error The Taylor series analysis leads to the following equation:
!
∆t2 c∆x2
ut + cux = − − uttt + uxxx + O(∆t4 , ∆x4 ) (6.17)
3 3
The leading truncation error term is O(∆t2 , ∆x2 ), and hence the scheme is second
order in time and space. Moreover, the truncation error goes to zero for ∆t, ∆x →
0, and hence the scheme is consistent.
88 CHAPTER 6. NUMERICAL SOLUTION OF THE ADVECTION EQUATION
2. The Von Neumann stability analysis leads to a quadratic equation for the amplifi-
cation factor: A2 + 2µ sin k∆xA − 1 = 0. Its two solutions are:
q
A± = −iµ sin k∆x ± 1 − µ2 sin2 k∆x (6.18)
|A± | = 1, for all |µ| < 1 (6.19)
φ 1 −µ sin k∆x
= tan−1 q (6.20)
φa −µk∆x 1 − µ2 sin2 k∆x
The scheme is conditionally stable for |µ| < 1, and its amplification is neutral
since |A| = 1 within the stability region. An attribute of the CTCS scheme is
that its amplification factor mirror the neutral amplification of the analytical solu-
tion. By the Lax-Richtmeyer theorem the consistency and stability of the scheme
guarantee it is also convergent.
3. The modified equation for CTCS is
c∆x2 2 c∆x4
ut + cux = (µ − 1)uxxx − (9µ4 − 10µ2 + 1)uxxxxx + . . . (6.21)
6 120
The even derivative are absent from the modified equation indicating the total
absence of numerical dissipation. The only errors are dispersive in nature due to
the presence of odd derivative in the modified equation.
4. The model requires a starting procedure to kick start the computations. It also
has a computational mode that must be damped.
5. Figures 6.3 shows the phase errors for CTCS for several Courant numbers. All
wave numbers are decelerating and the shortest wave are decelerated more then
the long waves.
0.9
µ=0.25
0.8
0.7
0.6
|A|
0.5 µ=0.50
0.4
0.3
0.2
µ=0.75
0.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
k ∆ x/π
1.4
1.2
1 1.00
0.25 0.75
0.50
0.8
a
Φ/Φ
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
k ∆ x/π
Figure 6.4: Amplitude and phase diagrams of the Lax-Wendroff scheme as a function of
the wavenumber
90 CHAPTER 6. NUMERICAL SOLUTION OF THE ADVECTION EQUATION
All that remains to be done is to use high order approximations for the spatial derivatives
ux and ux x. We use centered derivatives for both terms as they are second order accurate
to get the final expression:
un+1
j − unj unj+1 − unj−1 c2 ∆t unj+1 − 2unj + unj−1
+c + =0 (6.25)
∆t 2∆x 2 ∆x2
6.5.1 Remarks
1. truncation error The Taylor series analysis (expansion about time level n + 1
leads to the following equation:
!
∆t2 c∆x2
ut + cux = − − uttt + uxxx + O(∆t4 , ∆x4 ) (6.26)
3 3
The leading truncation error term is O(∆t2 , ∆x2 ), and hence the scheme is second
order in time and space. Moreover, the truncation error goes to zero for ∆t, ∆x →
0, and hence the scheme is consistent.
2. The Von Neumann stability analysis leads to:
A = 1 − µ2 (1 − cos k∆x) − iµ sin k∆x (6.27)
2 2 2 2 2
|A| = [1 − µ (1 − cos k∆x)] + µ sin k∆x, (6.28)
φ 1 −µ sin k∆x
= tan−1 (6.29)
φa −µk∆x 1 − µ2 (1 − cos k∆x)
The scheme is conditionally stable for |µ| < 1. By the Lax-Richtmeyer theorem
the consistency and stability of the scheme guarantee it is also convergent.
3. The modified equation for Lax Wendroff is
c∆x2 2 c∆x3
ut + cux = (µ − 1)uxxx − µ(1 − µ2 )uxxxx + . . . (6.30)
6 8
4. Figures 6.4 shows the amplitude and phase errors for the Lax Wendroff schemes.
The phase errors are predominantly lagging, the only accelerating errors are those
of the short wave at relatively high values of the Courant number.
4. Donor cell
uj − uj−1
ut + =0 (6.37)
∆x
5. Third order upwind
2uj+1 + 3uj − 6uj−1 ) + uj−2
ut + =0 (6.38)
6∆x
92 CHAPTER 6. NUMERICAL SOLUTION OF THE ADVECTION EQUATION
The dispersion for these numerical scheme can be derived also on the basis of periodic
solution of the form uj = ũei(kxj −σ) . The biggest difference is of course that the Fourier
expansion is discrete in space. The following expression for the phase velocity can be
derived for the different schemes:
σ sin k∆x
CD2 =c
k k∆x
σ 8 sin k∆x − sin 2k∆x
CD4 =c
k 6k∆x
σ sin 3k∆x − 9 sin 2k∆x + 45 sin k∆x
CD6 =c (6.39)
k 30k∆x
σ sin k∆x − i(1 − cos k∆x)
Donor =c
k k∆x
σ (8 sin k∆x − sin 2k∆x) − i2(1 − cos k∆x)2
Third Upwind =c
k k∆x
Several things stand out in the numerical dispersion of the various schemes. First, all
of them are dispersive, and hence one expects that a wave form made up of the sum of
individual Fourier components will evolve such that the fast travelling wave will pass the
slower moving ones. Second, all the centered difference scheme show a real frequency,
i.e. they introduce no amplitude errors. The off-centered schemes on the other hand
have real and imaginary parts. The former influences the phase speed whereas the
former influences the amplitude. The amplitude decays if Im(σ) < 0, and increases for
Im(σ) > 0. Furthermore, the upwind biased schemes have the same real part as the next
higher order centered scheme; thus their dispersive properties are as good as the higher
order centered scheme except for the damping associated with their imaginary part (this
is not necessarily a bad things at least for the short waves).
Figure 6.5 shows the dispersion curve for the various scheme discussed in this section
versus the analytical dispersion curve (the solid straight line). One can immediately see
the impact of higher order spatial differencing in improving the propagation character-
istics of the intermediate wavenumber range. As the order is increased, the dispersion
curves rise further towards the analytical curve, particularly near k∆x > π/2, hence a
larger portion of the spectrum is propagating correctly. The lower panel shows the impact
of biasing the differencing towards the upstream side. The net effect is the introduction
of numerical dissipation. The latter is strongest for the short waves, and decreases with
the order of the scheme.
Figure 6.6 shows the phase speed (upper panel) and group velocity (lower panel) of
the various schemes. Again it is evident that a larger portion of the wave spectrum is
propagating correctly whereas as the order is increased. None of the schemes allows the
shortest wave to propagate. The same trend can be seen for the group velocity plot.
However, the impact of the numerical error is more dramatic there since the short waves
have negative group velocities, i.e. they are propagating in the opposite direction. This
trend worsens as the accuracy is increased.
6.6. NUMERICAL DISPERSION 93
2.5
2
ω
FE1
1.5
CD6
CD4
1
CD2
0.5
0
0 0.2 0.4 0.6 0.8 1
k /(∆ xπ)
Imaginary part of frequency for Upstream Difference of 1, and 3 order
2
1.8
1.6
1.4
1.2
ωi
1 1
0.8
0.6
0.4
3
0.2
0
0 0.2 0.4 0.6 0.8 1
k /(∆ xπ)
Figure 6.5: Dispsersion relation for various semi-discrete schemes. The upper panel
shows the real part of the frequency whereas the bottom panel shows the imaginary part
for the first and third order upwind schemes. The real part of the frequency for these
two schemes is identical to that of the second and fourth order centered schemes.
94 CHAPTER 6. NUMERICAL SOLUTION OF THE ADVECTION EQUATION
0.9
0.8
0.7 FE1
0.6 CD6
cnum/can
0.5 CD4
0.4
0.3 CD2
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
k ∆ x/π
Group velocity
1
0.5
−0.5
g
−1 CD2
c
−1.5 CD4
−2 CD6
−2.5
FE1
−3
0 0.2 0.4 0.6 0.8 1
k∆ x/π
Figure 6.6: Phase speed (upper panel) and Group velocity (lower panel) for various
semi-discrete schemes.
Chapter 7
Numerical Dispersion of
Linearized SWE
This chapter is concerned with the impact of FDA and variable staggering on the fidelity
of wave propagation in numerical models. We will use the shallow water equations as the
model equations on which to compare various approximations. These equations are the
simplest for describing wave motions in the ocean and atmosphere, and they are simple
enough to be tractable with pencil and paper. By comparing the dispersion relation of
the continuous and discrete systems, we can decide which scales of motions are faithfully
represented in the model, and which are distorted. Conversely the diagrams produced
can be used to decide on the number of points required to resolve specific wavelengths.
The two classes of wave motions encountered here are inertia-gravity waves, and Rossby
waves. The main reference for the present work is Dukowicz (1995).
The plan is to look at dynamical system of increasing complexity in order to highlight
various aspects of the discrete systems. We start by looking at 1D versions of the
linearized shallow water equations, and unstaggered and staggered versions of the discrete
approximation; in particular we constrast these two approaches for several high order
centered difference scheme and show the superiority of the staggered system. Second
we look at the impact of including a second spatial dimensional and include rotation
but restrict ourselves to second order schemes; the discussion is instead focussed on the
various staggering on the dispersive properties. Lastly we look at the dispersive relation
for the Rossby waves.
ut + gηx = 0 (7.1)
ηt + Hux = 0 (7.2)
95
96 CHAPTER 7. NUMERICAL DISPERSION OF LINEARIZED SWE
Hence we have for the time derivative ut = −σûei(kxj −σt) , and for the spatially discrete
derivative
h i
uj+m − uj−m = û ei(kxj+m −σt) − ei(kxj−m −σt) = ûei(kxj −σt) 2i sin mk∆x; (7.7)
Hence the FDA of the spatial derivative has the following expression
M
X M
X
i(kxj −σt)
αm (uj+m − uj−m ) = 2i û e αm sin mk∆x (7.8)
m=1 m=1
Similar expressions can be written for the η variables. Inserting the periodic solutions in
7.5 we get the homogeneous system of equations for the amplitudes û and η̂:
M
!
X
sin mk∆x
−iσû + gi αm η̂ = 0 (7.9)
m=1
∆x
M
!
X sin mk∆x
Hi αm û − iσ η̂ = 0 (7.10)
m=1
∆x
For non-trivial solution we require the determinant of the system to be equal to zero, a
condition that yields the following dispersion relation:
M
!
X
sin mk∆x
σ = ±c αm (7.11)
m=1
∆x
7.1. LINEARIZED SWE IN 1D 97
√
where c = gH is the gravity wave speed of the continuous system. The phase speed is
then !
M
X sin mk∆x
CA,M =c αm (7.12)
m=1
k∆x
and clearly the departure of the term in bracket from unity determines the FDA phase
fidelity of a given order M. We thus have the following relations for schemes of order 2,
4 and 6:
sin k∆x
CA,2 = c (7.13)
k∆x
8 sin k∆x − sin 2k∆x
CA,4 = c (7.14)
6k∆x
45 sin k∆x − 9 sin 2k∆x + sin 3k∆x
CA,6 = c (7.15)
30k∆x
Going back to the dispersion relation, we now look for periodic solution of the form:
i(kxj+ 1 −σt)
uj+ 1 = ûe 2 and ηj = η̂ei(kxj −σt) (7.24)
2
Inserting these expressions in the FDA for the C-grid we obtain the following equations
after eliminating the exponential factors to get the dispersion equations:
M
!
sin 1+2m
X
2 k∆x
−iσû + g2i βm η̂ = 0 (7.29)
m=0
∆x
M
!
sin 1+2m
X
2 k∆x
H2i βm û − iσ η̂ = 0 (7.30)
m=1
∆x
We thus have the following phase speeds for schemes of order 2,4 and 6
sin k∆x
2
σC,2 = c (7.32)
k∆x/2
" #
27 sin k∆x k∆x
2 − sin 3 2
σC,4 = c (7.33)
24k∆x/2
" #
k∆x
2250 sin 2 − 125 sin 3 k∆x k∆x
2 + 9 sin 5 2
σC,6 = c (7.34)
1920k∆x/2
Figure 7.1 compares the shallow water phase speed for the staggered and unstaggered
configuration for various order of centered difference approximations. The un-staggered
schemes display a familiar pattern: by increasing order the phase speed in the intermedi-
ate wavelengths is improved but there is a rapid deterioration for the marginally resolved
waves k∆x ≥ 0.6. The staggered scheme on the other hand displays a more accurate
representation of the phase speed for the entire spectrum. Notice that the second order
7.1. LINEARIZED SWE IN 1D 99
0.9
C−6
0.8
C−4
0.7 A−2 A−4
C−2
0.6
Cnum/C
0.5
0.4
A−6
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
k∆ x/π
Figure 7.1: Phase speed of the spatially discrete linearized shallow water equation. The
solid lines show the phase speed for the A-grid configuration for centered schemes of order
2, 4 and 6, while the dashed lines show the phase speed of the staggered configuration
for orders 2, 4 and 6.
100 CHAPTER 7. NUMERICAL DISPERSION OF LINEARIZED SWE
r r r r bψ vr bψ u
r
u, v, η u, v, η u, v u, v
staggered approximation provides a phase fidelity which is comparable to the the fourth
order approximation in the intermediate wavelengths 0.2π ≤ k∆x ≤ 0.6π and superior
for wavelengths k∆x ≥ 0.6. Finally, and most importantly the unstaggered scheme pos-
sess a null mode where C = 0 which could manifest itself as a non-propagating 2∆x
spurious mode; the staggered schemes do not have a null mode.
ut − f v + gηx = 0 (7.35)
vt + f u + gηy = 0 (7.36)
ηt + H(ux + vy ) = 0 (7.37)
where (k, l) are wavenumbers in the x − y directions, we obtain the following eigenvalue
problem for the frequency σ:
−iσ −f gik
h i
f −iσ gil = −iω −ω 2 + f 2 + c2 (k 2 + l2 ) = 0 (7.38)
iHk iHl −iσ
√
Here c = gH is the gravity wave speed. The non-inertial roots can be written in the
following form: h i
σ 2 = 1 + a2 (kd)2 + (ld)2 (7.39)
7.3. ROSSBY WAVES 101
2. The C and D grids have a null mode for all zonal wavenumber when ld = π.
3. for a ≥ 2 the B,C and D grids perform similarly for the resolved portion of the
spectrum kd ≤ 2π/5.
102 CHAPTER 7. NUMERICAL DISPERSION OF LINEARIZED SWE
35
1
30
0.8 A−Grid B−Grid Exact
0.6
25
l∆ x/π
l∆ x/π
0.4
0.2 20
0
1
15
0.8 C−Grid D−Grid
0.6 10
l∆ x/π
0.4
5
0.2
0
0 0.5 10 0.5 1 0
k∆ x/π k∆ x/π
0.9
0.4
0.8
0.2
0.7
0
1 0.6
0.4
0.6
l∆ x/π
0.3
0.4
0.2
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0
k∆ x/π k∆ x/π
Figure 7.3: Comparison of the dispersion relation on the Arakawa A, B, C and D grids.
The top figure shows the dispersion relation while the bottom one shows the relative
error. The parameter a=8.
7.3. ROSSBY WAVES 103
18
1 16
0.6
l∆ x/π
l∆ x/π
12
0.4
0.2 10
0
1 8
4
0.4
0.2 2
0
0 0.5 10 0.5 1 0
k∆ x/π k∆ x/π
0.9
0.4
0.8
0.2
0.7
0
1 0.6
0.4
0.6
l∆ x/π
0.3
0.4
0.2
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0
k∆ x/π k∆ x/π
1 8
0.6
l∆ x/π
l∆ x/π
6
0.4
0.2 5
0
1 4
2
0.4
0.2 1
0
0 0.5 10 0.5 1 0
k∆ x/π k∆ x/π
0.9
0.4
0.8
0.2
0.7
0
1 0.6
0.4
0.6
l∆ x/π
0.3
0.4
0.2
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0
k∆ x/π k∆ x/π
4.5
1
0.6 3.5
l∆ x/π
l∆ x/π
0.4
3
0.2
2.5
0
1
2
0.8 C−Grid D−Grid
1.5
0.6
l∆ x/π
0.4 1
0.2
0.5
0
0 0.5 10 0.5 1 0
k∆ x/π k∆ x/π
0.9
0.4
0.8
0.2
0.7
0
1 0.6
0.4
0.6
l∆ x/π
0.3
0.4
0.2
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0
k∆ x/π k∆ x/π
1
2.5
0.8 A−Grid B−Grid Exact
0.6
l∆ x/π
l∆ x/π
2
0.4
0.2
1.5
0
1
0.4
0.5
0.2
0
0 0.5 10 0.5 1 0
k∆ x/π k∆ x/π
1
0.6
l∆ x/π
0.9
0.4
0.8
0.2
0.7
0
1 0.6
0.4
0.6
l∆ x/π
0.3
0.4
0.2
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0
k∆ x/π k∆ x/π
1 3
l∆ x/π
0.4 1
0.2
0
0
1
0.6
l∆ x/π
−2
0.4
0.2 −3
0
0 0.5 10 0.5 1 −4
k∆ x/π k∆ x/π
1
Rossby 1 −σσG
0.8 A−Grid B−Grid
r=8
1
0.6
l∆ x/π
0.8
0.4
0.6
0.2
0.4
0
1 0.2
−0.2
0.6
l∆ x/π
−0.4
0.4
−0.6
0.2
−0.8
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 −1
k∆ x/π k∆ x/π
Figure 7.8: Comparison of Rossby wave dispersion for the different Arakawa grids. The
top figures shows the dispersion while the bottom ones show the relative error. Here
a=8.
108 CHAPTER 7. NUMERICAL DISPERSION OF LINEARIZED SWE
1 1.5
l∆ x/π
0.4 0.5
0.2
0
0
1
0.6
l∆ x/π
−1
0.4
0.2 −1.5
0
0 0.5 10 0.5 1 −2
k∆ x/π k∆ x/π
1
Rossby 1 −σσG
0.8 A−Grid B−Grid
r=4
1
0.6
l∆ x/π
0.8
0.4
0.6
0.2
0.4
0
1 0.2
−0.2
0.6
l∆ x/π
−0.4
0.4
−0.6
0.2
−0.8
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 −1
k∆ x/π k∆ x/π
0.8
1
0.6 0.4
l∆ x/π
l∆ x/π
0.4
0.2
0.2
0
0
1
−0.2
0.8 C−Grid D−Grid
−0.4
0.6
l∆ x/π
0.4 −0.6
0.2
−0.8
0
0 0.5 10 0.5 1 −1
k∆ x/π k∆ x/π
1
Rossby 1 −σσG
0.8 A−Grid B−Grid
r=2
1
0.6
l∆ x/π
0.8
0.4
0.6
0.2
0.4
0
1 0.2
−0.2
0.6
l∆ x/π
−0.4
0.4
−0.6
0.2
−0.8
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 −1
k∆ x/π k∆ x/π
0.8
1
0.6 0.4
l∆ x/π
l∆ x/π
0.4
0.2
0.2
0
0
1
−0.2
0.8 C−Grid D−Grid
−0.4
0.6
l∆ x/π
0.4 −0.6
0.2
−0.8
0
0 0.5 10 0.5 1 −1
k∆ x/π k∆ x/π
1
Rossby 1 −σσG
0.8 A−Grid B−Grid
r=1
1
0.6
l∆ x/π
0.8
0.4
0.6
0.2
0.4
0
1 0.2
−0.2
0.6
l∆ x/π
−0.4
0.4
−0.6
0.2
−0.8
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 −1
k∆ x/π k∆ x/π
0.8
1
0.6 0.4
l∆ x/π
l∆ x/π
0.4
0.2
0.2
0
0
1
−0.2
0.8 C−Grid D−Grid
−0.4
0.6
l∆ x/π
0.4 −0.6
0.2
−0.8
0
0 0.5 10 0.5 1 −1
k∆ x/π k∆ x/π
1
Rossby 1 −σσG
0.8 A−Grid B−Grid
r=0.5
1
0.6
l∆ x/π
0.8
0.4
0.6
0.2
0.4
0
1 0.2
−0.2
0.6
l∆ x/π
−0.4
0.4
−0.6
0.2
−0.8
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 −1
k∆ x/π k∆ x/π
0 3
−0.5
2
−1
1
−1.5
σ −2 0
−2.5
−1
−3
−2
−3.5
−4 −3
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
1 1
π− k∆x π −
k∆x
0 1.5
1
−0.5
0.5
σ
−1 0
−0.5
−1.5
−1
−2 −1.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
1 1
π− k∆x π −
k∆x
0 0.8
0.6
−0.2
0.4
−0.4 0.2
σ
0
−0.6 −0.2
−0.4
−0.8
−0.6
−1 −0.8
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
1 1
π− k∆x π− k∆x
0 0.3
0.2
−0.1
0.1
−0.2
0
σ
−0.3 −0.1
−0.2
−0.4
−0.3
−0.5 −0.4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
1 1
π− k∆x π −
k∆x
0 0.15
0.1
−0.05
0.05
−0.1
0
σ
−0.15 −0.05
−0.1
−0.2
−0.15
−0.25 −0.2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
1 1
π− k∆x π −
k∆x
Figure 7.13: Rossby wave frequency σ versus k∆x for, from top to bottom a =, 8,4,2, 1
and 1/2. The left figures show the case l = 0 and the right figures the case l = k. The
black line refers to the continuous case and the colored line to the A (red), B (blue), C
(green), and D (magenta) grids.
Chapter 8
∇2 u = f, x ∈ Ω (8.1)
subject to appropriate boundary conditions on ∂Ω, the boundary of the domain. The
right hand side f is a known function. We can approximate the above equation using
standard second order finite differences:
uj+1,k − 2uj,k + uj − 1, k uj,k+1 − 2uj,k + uj, k − 1
+ = fj,k (8.2)
∆2 x ∆2 y
The finite difference representation 8.2 of the Poisson equation results in a coupled system
of algebraic equations that must be solved simultaneously. In matrix notation the system
can be written in the form Ax = b, where x represents the vector of unknowns, b
represents the right hand side, and A the matrix representing the system. Boundary
conditions must be applied prior to solving the system of equations.
5 s s s s s
64 s s s s s
k 3 s s s s s
2 s s s s s
1 s s s s s
1 2 3 4 5
j -
113
114 CHAPTER 8. SOLVING THE POISSON EQUATIONS
Example 12 For a square domain divided into 4x4 cells, as shown in figure 8.1, subject
to Dirichlet boundary conditions on all boundaries, there are 9 unknowns uj,k , with
(j, k) = 1, 2, 3. The finite difference equations applied at these points provide us with
the system:
−4 1 0 1 0 0 0 0 0 u2,2 f2,2 u2,1 + u1,2
1 −4 1 0 1 0 0 0 0 u3,2 f3,2 u3,1
0 1 −4 1 0 1 0 0 0 u4,2 f4,2 u4,1
1 0 1 −4 1 0 1 0 0 u2,3 f2,3 0
0 1 0 1 −4 1 0 1 0 u3,3 = ∆ f3,3 − 0
0 0 1 0 1 −4 1 0 1 u4,3 f4,3 0
0 0 0 1 0 1 −4 1 0 u2,4 f2,4 u2,5
0 0 0 0 1 0 1 −4 1 u3,4 f3,4 u3,5
0 0 0 0 0 1 0 1 −4 u4,4 f4,4 u4,5 + u5,4
(8.3)
where ∆x = ∆y = ∆. Notice that the system is symmetric, and pentadiagonal (5 non-
zero diagonal). This last property precludes the efficient solution of the system using the
efficient tridiagonal solver.
The crux of the work in solving elliptic PDE is the need to update the unknowns
simultaneously by inverting the system Ax = b. The solution methodologies fall under 2
broad categories:
1. Direct solvers: calculate the solution x = A−1 b exactly (up to round-off errors).
These methods can be further classified as:
(a) Matrix Methods: are the most general type solvers and work for arbitrary
non-singular matrices A. They work by factoring the matrix into a lower
and upper triangular matrices that can be easily inverted. The most general
and robust algorithm for this factorization is the Gaussian elimination method
with partial pivoting. For symmetric real system a slightly faster version relies
on the Cholesky algorithm. The main drawback of matrix methods is that
their storage cost CPU cost grow rapidly with the increase of the number of
points. In particular, the CPU cost grows as O(M 3 ), where N is the number
of unknowns. If the grid has N points in each direction, then for a 2D problem
this cost scales like N 6 , and like N 9 for 3D problems.
(b) FFT also referred to as fast solvers. These methods take advantage of the
structure of separable equations to diagonalize the system using Fast Fourier
Transforms. The efficiency of the method rests on the fact that FFT costs
grow like N 2 ln N in 2D, a substantial reduction compared to the N 6 cost of
matrix methods.
equations to reduce the CPU cost and accelerate convergence. Here we mention a
few of the more common iterative schemes:
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
λ∆ y/π
λ∆ y/π
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
κ∆ x/π κ∆ x/π
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 8.2: Magnitude of the amplification factor for the Jacobi method (left) and Gauss-
Seidel method (right) as a function of the (x, y) Fourier components.
by
cos κ∆x + cos κ∆y
G= (8.9)
2
where (κ, λ) are the wavenumbers in the (x, y) directions. A plot of |G| is shown in figure
8.2. It is clear that the shortest (κ∆x → π) and longest (κ∆x → 0) error components
are damped the least. The intermediate wavelengths (κ∆x = π/2) are damped most
effectively.
1
1
0.9
0.9
0.8 0.8
0.7 0.7
0.6 0.6
λ∆ y/π
λ∆ y/π
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
κ∆ x/π κ∆ x/π
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 8.3: Magnitude of the amplification factor for the Gauss-Seidel by rows (left)
and Gauss-Seidel method by rows and columns (right) as a function of the (x, y) Fourier
components.
un+1 n ∗ n
j,k = uj,k + ω(uj,k − uj,k ) (8.12)
where ω is the correction factor. For ω = 1 we revert to the Gauss-Seidel update, for
ω < 1 the correction is under-relaxed, and for ω > 1 the correction is over-relaxed. For
convergence, it can be shown that 1 ≤ ω ≤ 2. The optimal ω, ωo , can be quite hard
to compute and depends on the number of points in each direction and the boundary
conditions applied. Analytic values for ωo can be obtained for a Dirichlet problem:
√ 2
∆x2
1− 1−β cos Mπ−1 + ∆y 2 cos π
N −1
ωo = 2 , β= ∆x2
(8.13)
β 1 + ∆y2
where M and N are the number of points in the x and y directions, respectively.
iteration, and unj,k+1 is still lagged in time. Hence a simple tridiagonal solver can be
used to update the rows one-by-one. The amplification factor for this variation on the
Gauss-Seidel method is given by:
r4
|G|2 = (8.15)
[2(1 + r 2 − cos α)]2 + [2(1 + r 2 − cos α)]2 cos β + r 4
A plot of |G| for r = 1 is shown in figure 8.3. The areas with small |G| have expanded
with resepect to those shown in figure 8.2. In order to symmetrize the iterations along
the two directions, it is natural to follow a sweep-by-row by a sweep-by-columns. The
amplification factor for this iteration is shown in the left panel of figure 8.3 and show a
substantial reduction in error amplitude for all wavelenths except the longest ones.
Example 13 In order to illustrate the efficiency of the different methods outline above
we solve the following Laplace equation
∇2 u = 0, 0 ≤ x, y ≤ 1 (8.16)
u(0, y) = u(1, y) = 0 (8.17)
u(x, 1) = sin πx (8.18)
−16(x− 14 )2
u(x, 1) = e sin πx (8.19)
We divide the unit square into M × N grid points and we use the following meth-
ods:Jacobi, Gauss-Seidel, SOR, SOR by line in the x-direction, and SOR by line in both
directions. We monitor the convergence history with the rms change in u from one
iteration to the next:
1
2
1 X n+1
kǫk2 = (u − unjk )2 (8.20)
M N j,k jk
The stopping criterion is kǫk2 < 10−13 , and we limit the maximum number of iterarions
to 7,000. We start all iterations with u = 0 (save for the bc) as an initial guess. The
convergence history is shown in figure 8.4 for M = N = 65. The Jacobi and Gauss-Seidel
have similar convergence history except near the end where Gauss-Seidel is converging
faster. The SOR iterations are the fastest reducing the number of iterations required
by a factor of 100 almost. We have used the optimal relaxation factor since it is was
computable in our case. The SOR iterations are also quite similar showing a slow decrease
of the error in the initial stages but very rapid decrease in the final stages. The criteria
for the selection of an iteration algorithm should not rely solely on the algorithm’s rate
of convergence; it should also the operation count needed to complete each iteration.
The convergence history for the above example shows that the 2-way line SOR is the
most efficient per iterations. However, table 8.1 shows the total CPU time is cheapest
for the point-SOR. Thus, the overhead of the tridiagonal solver is not compensated by
the higher efficiency of the SOR by line iterations. Table 8.1 also shows that, where
applicable, the FFT-based fast solvers are the most efficient and cheapest.
8.1. ITERATIVE METHODS 119
−2
10
−4
10
−6
10
−8
2
10
|ε|
−10
10
−12
10
−14
10
0 1 2 3 4
10 10 10 10 10
n
Figure 8.4: Convergence history for the Laplace equation. The system of equation is
solved with: Jacobi (green), Gauss-Seidel (red), SOR (black), Line SOR in x (solid
blue), and line SOR in x and y (dashed blue). Here M = N = 65.
33 65 129
Jacobi 0.161 0.682 2.769
Gauss-Seidel 0.131 2.197 10.789
SOR 0.009 0.056 0.793
SOR-Line 0.013 0.164 1.291
SOR-Line 2 0.014 0.251 1.403
FFT 0.000 0.001 0.004
Table 8.1: CPU time in second to solve Laplace equation versus the number of points
(top row).
120 CHAPTER 8. SOLVING THE POISSON EQUATIONS
where N and P are matrices of the same order as A. The system of equations becomes:
Nx = Px + b (8.22)
Starting with an arbitrary vector x(0) , we define a sequence of vectors x(v) by the recursion
It is now clear what kind of restrictions need to be imposed on the matrices in order to
solve for x, namely: the matrix N must be non-singular: det(N ) 6= 0, and the matrix N
must be easily invertible so that computing y from N y = z is computationally efficient.
In order to study how fast the iterations are converging to the correct solution, we
introduce the matrix M = N −1 P , and the error vectors e(n) = x(n) − x. Substracting
equation 8.22 from equation 8.23, we obtain an equation governing the evolution of the
error, thus:
e(n) = M e(n−1) = M 2 e(n−2) = . . . = M n e(0) (8.24)
where e(0) is the initial error. Thus, it is clear that a sufficient condition for convergence,
i.e. that limn→∞ e(n) = 0, is that limn→∞ M n = O. This is also necessary for the
method to converge for all e(0) . The condition for a matrix to be convergent is that its
spectral radius ρ(M ) < 1. (Reminder: the spectral radius of a matrix M is defined as the
maximum eigenvalue in magnitude: ρ(M ) = maxi |λi |). Since computing the eigenvalues
is difficult usually, and since the spectral radius is a lower bound for any matrix norm,
we often revert to imposing conditions on the matrix norm to enforce convergence; thus
In particular, it is common to use either the 1- or infinity-norms since these are the
simplest to calculate.
The spectral radius is also useful in defining the rate of convergence of the method.
In fact since, using equation 8.24, one can bound the norm of the error by:
Thus the number of iteration needed to reduce the initial error by a factor of α is
n ≥ ln α/ ln[ρ(M )]. Thus, a small spectral radius reduces the number of iterations (and
hence CPU cost) needed for convergence.
8.1. ITERATIVE METHODS 121
Jacobi Method
The Jacobi method derived for the Poisson equation can be generalized by defining the
matrix N as the diagonal of matrix A:
N = D, P = A − D (8.28)
The matrix D = aij δij , where δij is the Kronecker delta. The matrix M = D−1 (D−A) =
I − D −1 A In component form the update takes the form:
K
1 X
xni = aij xjn−1 (8.29)
aii j=1
j6=i
The procedure can be employed if aii 6= 0, i.e. all the diagonal elements of A are different
from zero. The rate of convergence is in general difficult to obtain since the eigenvalues
are not easily available. However, the infinity and/or 1-norm of M can be easily obtained:
Gauss-Seidel Method
A change of splitting leads to the Gauss-Seidel method. Thus we split the matrix into a
lower triangular matrix, and an upper triangular matrix:
a11
a21 a22
N =
.. .. , P = N − A
(8.34)
. .
aK1 aK2 · · · aKK
where xk is the kth iterate, α is a scalar and pk are the search directions. The two
parameters at our disposal are α and p. We also define the residual vector rk = b − Axk .
We can now relate Φ(xk ) to Φ(xk−1 ):
1 T
Φ(xk ) = x Axk − xTk b
2 k !
α2 T
= Φ(xk−1 ) + αxTk−1 Apk + p Apk − αpTk b (8.39)
2 k
For an efficient iteration algorithm, the 2nd and 3rd terms on the right hand side of
equation 8.39 have to be minimized separately. The task is considerably simplified if we
require the search directions pk to be A-orthogonal to the solution:
The remaining task is to choose α such that the last term in 8.39 is minimized. It is a
simple matter to show that the optimal α occurs for
pTk b
α= , (8.41)
pTk Apk
1 (pTk b)2
Φ(xk ) = Φ(xk−1 ) − . (8.42)
2 pTk Apk
We can use the orthogonality requirement 8.40 to rewrite the above two equations as:
The remaining task is defining the iteration is to determine the algorithm needed to
update the search vectors pk ; the latter must satisfy the orthogonality condition 8.40,
and must maximum the decrease in the functional. Let us denote by Pk the matrix
8.2. KRYLOV METHOD-CG 123
formed by the (k − 1) column vectors pi , then since the iterates are linear combinations
of the search vectors,we can write:
k−1
X
xk−1 = αi pi = Pk−1 y (8.44)
i=1
h i
Pk−1 = p1 p2 . . . pk−1 (8.45)
α1
α2
y = .. (8.46)
.
αk−1
We note that the solution vector xk−1 belongs to the space spanned by the search vectors
pi , i = 1, . . . , k − 1. The orthogonality property can now be written as y T Pk−1
T Ap = 0.
k
This property is easy to satisfy if the new search vector pk is A-orthogonal to all the
previous search vectors, i.e. if Pk−1 T Ap = 0. The algorith can now be summarized as
k
follows: First we initialize our computations by defining our initial guess and its residual;
second we perform the following iterations:
A vector pk which is A-orthogonal to all previous search, and such that pTk rk−1 6= 0,
vectors can always be found. Note that if pTk rk−1 = 0, then the functional does not
decrease and the minimum has been reached, i.e. the system has been solved. To bring
about the largest decrease in Φ(xk ), we must maximize the inner product pTk rk−1 . This
can be done by minimizing the angle between the two vectors pk and rk−1 , i.e. minimizing
krk−1 − pk k.
Consider the following update for the search direction:
where zk−1 is chosen to minimize J = krk−1 − APk−1 zk2 . It is easy to show that the
minimum occurs for
T
Pk−1 aT APk−1 z = Pk−1
T
AT rk−1 , (8.48)
and under this condition pTk APk−1 = 0, and kpk −rk k is minimized. We have the following
property:
PkT rkT = 0, (8.49)
i.e. the search vectors are orthogonal to the residual vectors. We note that
i.e. these different basis sets are spanning the same vector space. The final steps in the
conjugate gradient algorithm is that the search vectors can be written in the simple form:
It can be shown that the error in the CG algorithm after k iteration is bounded by:
√ !k
κ−1
kxk − xkA ≤ 2kx0 − xkA √ (8.54)
κ+1
maxi (|λi |)
κ(A) = kAkkA−1 k = , (8.55)
mini (|λi |)
Hence the number of iterations needed to reach convergence increases. For efficient
iterations κ must be close to 1, i.e. the eigenvalues cluster around the unit circle. The
problem becomes one of converting the original problem Ax = b into Ãx̃ = b̃ with
κ(Ã) ≈ 1.
8.3. DIRECT METHODS 125
1 MX−1 N
X −1
2πjm 2πkn
ujk = ûmn e−i M e−i N (8.57)
M N m=0 n=0
where ûmn are the Discrete Fourier Coefficients. A similar expression can be written for
the right hand side function f . Replace the Fourier expression in the original Laplace
equation we get:
1 MX−1 N
X −1
2πjm 2πkn
2πm 2πm 2πn 2πn
ûmn e−i M e−i N e−i M + ei M e−i N + ei N =
M N m=0 n=0
1 MX−1 N
X −1
2πjm 2πkn
∆2 fˆmn e−i M e−i N (8.58)
M N m=0 n=0
Since the Fourier functions form an orthogonal basis, the Fourier coefficients should
match individually. Thus, one can obtain the following expression for the unknowns
ûmn :
∆2 fˆmn
ûmn = , m = 0, 1, . . . , M − 1, n = 0, 1, . . . , N − 1 (8.59)
2 cos 2πm 2πn
M + cos N − 2
1 MX−1 N
X −1
πjm πkn
ujk = ûmn sin sin (8.60)
M N m=1 n=1 M N
Again, the sine basis function are orthogonal and hence the Fourier coefficients can be
computed as
∆2 fˆmn
ûmn = , m = 1, . . . , M − 1, n = 1, . . . , N − 1 (8.61)
2 cos πm πn
M + cos N − 2
Again, the efficiency of the method rests on the FFT algorithm. Specialized routines for
sine-transforms are available.
126 CHAPTER 8. SOLVING THE POISSON EQUATIONS
Chapter 9
Nonlinear equations
Linear stability analysis is not sufficient to establish the stability of finite difference
approximation to nonlinear PDE’s. The nonlinearities add a severe complications to
the equations by providing a continuous source for the generation of small scales. Here
we investigate how to approach nonlinear problems, and ways to mitigate/control the
growth of nonlinear instabilities.
9.1 Aliasing
In a constant coefficient linear PDE, no new Fourier components are created that are
not present either in the initial and boundary conditions conditions, or in the forcing
functions. This is not the case if nonlinear terms are present or if the coefficients of a
linear PDE are not constant. For example, if two periodic functions: φ = eik1 xj and
ψ = eik2 xj , are multiplied during the course of a calculation, a new Fourier mode with
wavenumber k1 + k2 is generated:
The new wave generated will be shorter then its parents if k1,2 have the same sign,
i.e. k12π 2π
+k2 < k1,2 . The representation of this new wave on the finite difference grid can
become problematic if its wavelength is smaller then twice the grid spacing. In this case
the wave can be mistaken for a longer wave via aliasing.
Aliasing occurs because a function defined on a discrete grid has a limit on the shortest
wave number it can represent; all wavenumbers shorter then this limit appear as a long
wave. The shortest wavelength representable of a finite difference grid with step size ∆x
is λs = 2∆x and hence the largest wavenumber is kmax = 2π/λs = π/∆x. Figure 9.1
shows an example of a long and short waves aliased on a finite difference grid consisting
6πx
of 6 cells. The solid line represents the function sin 4∆x and is indistinguishable from
πx
the function sin 4∆x (dashed line): the two functions coincide at all points of the grid
(as marked by the solid circles). This coincidence can be explained by re-writing each
Fourier mode as:
h ij 2πn
eikxj = eikj∆x = eikj∆x ei2πn = ei(k+ ∆x )j∆x , (9.2)
127
128 CHAPTER 9. NONLINEAR EQUATIONS
6πx 6πx
Figure 9.1: Aliasing of the function sin 4∆x (solid line) and the function sin 4∆x (dashed
line). The two functions have the same values on the FD grid j∆x.
where n = 0, ±1, ±2, . . . Relation 9.2 is satisfied at all the FD grid points xj = j∆x; it
shows that all waves with wavenumber k + 2πn ∆x are indistinguishable on a finite difference
grid with grid size ∆x. In the case shown in figure 9.1, the long wave has length 4∆x
and the short wave has length 4∆x/3, so that the equation 9.2 applies with n = −1.
Going back to the example of the quadratic nonlinearity φψ, although the individual
functions, φ and ψ, are representable on the FD grid, i.e. |k1,2 | ≤ π/∆x, their product
may not be since now |k1 + k2 | ≤ 2π/∆x. In particular, if π/∆x ≤ |k1 + k2 | ≤ 2π/∆x,
the product will be unresolvable on the discrete grid and will be aliased into wavenumber
k̃ given by (
2π π
k1 + k2 − ∆x , if k1 + k2 > ∆x
k̃ = 2π π (9.3)
k1 + k2 + ∆x , if k1 + k2 < − ∆x
π
Note that very short waves are aliased to very long waves: k̃ → 0 when |k1,2 | → ∆x .
The aliasing wavenumber can be visualized by looking at the wavenumber axis shown in
d t t
kc |k̃| k1 + k2 k
-
6 π 2π
0
& %
∆x ∆x
into |k̃| = kmax − (2kc − kmax ), and the latter must satisfy |k̃| > kc , and we end up with
2
kc < kmax (9.4)
3
2π
For a finite difference grid this is equivalent to kc < 3∆x .
Notice that the terms within the summation sign do not cancel out, and hence energy is
not conserved. Likewise, a finite difference approximation to the conservative form:
1.5
0.56
1 0.55
0.54
0.5
0.53
∆x Σi=1 u /2
2
0
0.52
N
−0.5 0.51
0.5
−1
0.49
−1.5 0.48
−1 −0.5 0 0.5 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
x t
Figure 9.3: Left: Solution of the inviscid Burger equation at t = 0.318 < 1/π using
the advective form (black), momentum conserving form (blue), and energy conserving
form (red); the analytical solution is superimposed in Green. The initial conditions are
u(x, 0) = − sin πx, the boundary conditions are periodic, the time step is ∆t = 0.01325,
and ∆x = 2/16; RK4 was used for the time integration. Right: Energy budget for the
different Burger schemes: red is the energy conserving, blue is the momentum conserving,
and black is the advective form.
The only energy conserving available for the Burger equation is the following:
where the term inside the summation sign does cancel out. Figure 9.3 shows solutions
of the Burger equations using the 3 schemes listed above. The advective form, shown
in black, does not conserve energy and exhibits oscillations near the front region. The
oscillations are absent in the both the flux and energy conserving forms. The flux form,
equation 9.10, exhibits a decrease in the energy and a decrease in the amplitude of the
waves. Note that the solution is shown just prior to the formation of the shock at time
t = 0.318 < 1/π.
∇ · v = 0, (9.27)
Tt + ∇ · (vT ) = 0. (9.28)
Equation 9.26 is called the advective form, and equation 9.28 is called the flux (or con-
servative) form. The two forms are equivalent in the continuum provided the flow is diver-
gence free. Note that the above statements holds regardless of the linearity/dimensionality
of the system. Integration of the flux form over the domain Ω shows that
Z Z
d
T dV =− n · vT dS (9.29)
dt Ω ∂Ω
where n is the unit outward normal, ∂Ω the boundary of the flow domain, dS is an
elemental surface on this boundary, and the boundary integral is the amount of T entering
Ω. Equation 9.29 shows that the total inventory of T and Ω depends on the amount of
T entering through the boundary. In particular the total budget of T should be constant
if the boundary is closed v · n = 0 or if the domain is periodic.
The above conservation laws imply higher order conservation. To wit, equation 9.26,
can be multiplied by T m (where m ≥ 0) and the following equation can be derived:
∂T m+1
+ v · ∇T m+1 = 0, (9.30)
∂t
i.e. the conservation law imply that all moments of T m are also conserved. Since the
above equation has the same form as the original one, the total inventory of T m will also
be conserved under the same conditions as equation 9.29.
The relevant question is: is it possible to come up with a finite difference scheme that will
conserve both the first and second moment? Let us look at the following approximation
of the flux form
x y
∇ · (vT ) = δx (ux T ) + δy (v y T ). (9.32)
9.4. NONLINEAR ADVECTION EQUATION 133
Can the term T ∇ · (vT ) be also written in flux form (for then it will be conserved upon
summation). We concentrate on the x-component for simplicity:
x x x x 2 x x
T δx u T = δx u T − ux T δx T (9.33)
x
x x 2 T2
= δx u T − ux δx (9.34)
2
x x
x 2 T2 T2
x
= δx u T − δx u − δx u (9.35)
2 2
! !
2x ∆x2 T2 T2
x x 2 xT
= δx u T − δx u − δx δx u δx + δx ux
2 4 2 2
!
∆x2 T2
+ δx δx u δx (9.36)
4 2
" x #
Tf2 x T2
= δx u − δx (ux ) (9.37)
2 2
Equality 9.33 follows from property 9.20, 9.34 from 9.23, 9.35 from 9.19. The second
and third terms of equation 9.35 can rewritten with the help of equations 9.21 and 9.24,
respectively. The third and fifth terms on the right hand side of equation 9.36 cancel.
The final equation 9.37 is obtained by combining the first and second terms of equation
9.36 (remember the operators are linear), and using equation 9.25. A similar derivation
can be carries out for the y-component of the divergence:
" y #
y y
Tf2 y T2
T δy v T = δy v − δy (v y ) (9.38)
2 2
Thus, the semi-discrete second moment conservation becomes:
x " y # " #
∂T 2 /2 Tf2 x Tf2 y T2
= −δx u − δy v + [δx (ux ) + δy (v y )] (9.39)
∂t 2 2 2
The first and second term in the semi-discrete conservation equation are in flux form,
and hence will cancel out upon summation. The third term on the right hand side is
nothing but the discrete divergence constraint. Thus, the second order moment of T will
be conserved provided that the velocity field is discretely divergence-free.
The following is a FD approximation to v · ∇T consistent with the above derivation:
∂T ∂uT ∂u
u = −T (9.40)
∂x ∂x
∂x
x
= δx ux T − T δx (ux ) (9.41)
x
= T δx (ux ) + ux δx T − T δx (ux ) (9.42)
x
= ux δx T (9.43)
Thus, we have
x y
v · ∇T = ux δx T + v y δy T (9.44)
134 CHAPTER 9. NONLINEAR EQUATIONS
u = −ψy , v = ψx (9.45)
ζ = vx − uy = ψx x + ψy y = ∇2 ψ (9.46)
and the vorticity advection equation can be obtained by taking the curl of the momentum
equation, thus:
∂∇2 ψ
= J(∇2 ψ, ψ) (9.47)
∂t
where J stand for the Jacobian operator:
J(a, b) = ax by − bx ay (9.48)
J(a, b) = ∇a · k × ∇b (9.50)
= ∇ · (k × a∇b) (9.51)
= −∇ · (k × b∇a) (9.52)
3. The integral of the Jacobian over a closed domain can be turned into a boundary
integral thanks to the above equations
Z Z Z
∂b ∂a
J(a, b)dA = a ds = − b ds (9.53)
Ω ∂Ω ∂s ∂Ω ∂s
where s is the tangential direction to the boundary. Hence, the integral of the
Jacobian vanishes if either a or b is constant along ∂Ω. In particular, if the boundary
is a streamline or a vortex line, the Jacobian integral vanishes. The area-averaged
vorticity is hence conserved.
4. The following relations hold:
a2
aJ(a, b) = J( , b) (9.54)
2
b2
bJ(a, b) = J(a, ) (9.55)
2
9.5. CONSERVATION IN VORTICITY STREAMFUNCTION FORMULATION 135
Thus, the area integrals of aJ(a, b) and bJ(a, b) are zero if either a or b are constant
along the boundary.
It is easy to show that enstrophy, ζ 2 /2, and kinetic energy, |∇ψ|2 /2, are conserved if
the boundary is closed. We would like to investigate if we can conserve vorticity, energy
and enstrophy in the discrete equations. We begin first by noting that the Jacobian in
the continuum form can be written in one of 3 ways:
J(ζ, ψ) = ζx ψy − ζy ψx (9.56)
= (ζψy )x − (ζψx )y (9.57)
= (ψζx )y − (ψζy )x (9.58)
It is obvious that J2 and J3 will conserve vorticity since they are in flux form; J1 can
also be shown to conserve the first moment since:
h xy x
i x yx
h xy y
i y xy
J1 (ζ, ψ) = δx δy ψ ζ − ζ δx δy ψ − δy δx ψ ζ + ζ δy δx ψ (9.62)
" # " #
xy x ∆x2 y xy y ∆y 2 x
= δx δy ψ ζ − δx δy ψ δx ζ − δy δx ψ ζ − δy δx ψ δy ζ (9.63)
.
4 4
The last equation above shows that J1 can indeed be written in flux form, and hence
vorticity conservation is ensured. Now we turn our attention to the conservation of
quadratic quantities, namely, kinetic energy and enstrophy. It is easy to show that J2
conserves kinetic energy since:
yx
xy
ψJ2 (ζ, ψ) = ψδx ζδy ψ − ψδy ζδx ψ (9.64)
x yx
y xy
yx
x
xy
y
= δx ψ ζδy ψ − δy ψ ζδx ψ − ζδy ψ δx ψ + ζδx ψ δy ψ (9.65)
" #
x yx ∆x2 y
= δx ψ ζδy ψ − δx ψ δx ζδy ψ
4
" #
y xy ∆y 2 x
− δy ψ ζδx ψ − δy ψ δy ζδx ψ (9.66)
4
Notice that the finite difference Jacobians satisfy the following property:
Hence, from equation 9.66 ζJ3 (ζ, ψ) can be written in flux form, and from equation 9.67
ψ J1 +J
2
3
can also be written in flux form. These results can be tabulated:
Note, the terms in ζj,k cancel out, the expression for JA can use ±ζj,k or no value at all.
An important property of the Arakawa Jacobian is that it inhibits the pile up of energy
at small scales, a consequence of conserving enstrophy and energy. Since both quantities
are conserved, so is their ratio which can be used to define an average wavenumber κ:
R
2 k∇ψk2 dA
κ = RA 2 2 . (9.73)
A (∇ ψ) dA
For the case of a periodic problem, the relationship between the above ratio and wavenum-
bers can be easily demonstrated by expanding the streamfunction and vorticity in terms
9.6. CONSERVATION IN PRIMITIVE EQUATIONS 137
cψ vs cψ
us q ηj,k us
cψ vs cψ
where ψ̂m,n are the complex Fourier cofficients and the computational domain has been
mapped into the square domain [0, 2π]2 . Using the orthogonality of the Fourier modes,
it is easy to show that the ratio κ becomes
P P
(m2 + n2 )2 |ψ̂m,n |2
κ = Pm Pn
2
(9.76)
2 + n2 )|ψ̂m,n |2
m n (m
The implication of equation 9.76 is that there can be no one-way cascade of energy in
wavenumber space; if some “local” cascading takes place from one part of the spectrum
to the other; there must be a compensating shift of energy in another part.
In order to make use of the results obtained using the vorticity-streamfunction form,
it is usefull to introduce a fictitious streamfunction in the primitive variables using a
staggered grid akin to the Arakawa C-grid shown in figure 9.4. The staggered velocity
are defined with respect to the “fictitious streamfunction”:
u = −δy ψ, v = δx ψ, ζ = δx v − δy u (9.79)
Comparing equations 9.80 and 9.77 we can deduce the following energy conserving mo-
mentum advection operators:
( x y
v · ∇u = ux δx u + v x δy u
x y (9.81)
v · ∇v = uy δx v + v y δy v
J1 + J2
= −δx (uxy δx v x + v yy δy v y ) + δy (uxx δx ux + v xy δy uy ) , (9.82)
2
and we can deduce the following enstrophy-conserving operators:
(
v · ∇u = uxx δx ux + v xy δy uy
(9.83)
v · ∇v = uxy δx v x + v yy δy v y
If either 9.81 or 9.83 is used in the momentum equation, and if the flow is discretely
divergence-free, then energy or enstrophy is conserved in the same manner as it is in the
vorticity equation through J2 or J1 +J2 . Stated differently, only the divergent part of the
2
2 1 h ′ y x′ y ′ i
∇ · vu = [δx (uxyy ux ) + δy (v xyy uy )] + δx′ u u + δy′ v ′ uy (9.84)
3 3
2 xxy x xxy y 1 h ′ x x′ x ′ i
∇ · vv = [δx (u v ) + δy (v v )] + δx′ u v + δy′ v ′ v y (9.85)
3 3
u x + vy
u′ = −δy′ ψ = √ (9.86)
2
−u x + vy
v ′ = δx′ ψ = √ (9.87)
2
The (x′ , y ′ ) coordinate system is rotated 45 degrees counterclockwise to the (x, y) coor-
dinate system; i.e. it is in the diagonal directions w.r.t. to the original axis.
9.7. CONSERVATION FOR DIVERGENT FLOWS 139
Continuum ! Discrete !
xy
−v(ζ + f ) −V q y
xy
u(ζ + f ) U qx
10.1 Introduction
This chapter deals with specialized advection schemes designed to handle problems where
in addition to consistency, stability and conservation, additional constraints on the so-
lution must be satisfied. For example, biological or chemical concentration must be
non-negative for phsical reason; however, numerical errors are capable of generating neg-
ative values which are simply wrong and not amenable to physical interpretation. These
negative values can impact the solution adversely, particularly if there is a feed back
loop that exacerbate these spurious values by increasing their unphysical magnitude. An
example of a feed back loop is a reaction term valid for only positive values leading to
moderate growth or decay of the quantity in question; whereas negative values lead to
unstable exponential growth. Another example is the equation of state in ocean models
which intimately ties salt, temperature, and density. This equation is empirical in nature
and is valid for specific ranges of temperature, salt, and density; and the results of out
of range inputs to this equation are unpredictable and lead quickly to instabilities in the
simulation.
The primary culprit in these numerical artifacts is the advection operator as it is the
primary means by which tracers are moved around in a fluid environment. Molecular
diffusion is usually too weak to account for much of the transport, and what passes for
turbulent diffusion has its roots in “vigorous” advection in straining flow fields. Advection
transports a tracer from one place to another without change of shape, and as such
preserves the original extrema (maxima and minima) of the field for long times (in the
absence of other physical mechanism). Problems occur when the gradient are too steep to
be resolved by the underlying computational grid. Examples include true discontinuities,
such as shock waves or tidal bore, or pseudi-discontinuities such as narrow temperature
or salt fronts that are too narrow to be resolved on the grid, (a few hundered meters
whereas the computational grid is of the order of kilometers).
A number of special advection schemes were devised to address some or all of these
issues. They are known generically as Total Variation Diminishing (TVD) schemes. They
occupy a prominent place in the study and numerical solution of hyperbolic equations
like the Euler equations of gas dynamics or the shallow water equations. Here we confine
ourselves to the pure advection equation, a scalar hyperbolic equation.
141
142 CHAPTER 10. SPECIAL ADVECTION SCHEMES
Tjn ≥ Tj+1
n
(10.1)
for all j and n. A general advection scheme can be written in the form:
q
X
Tjn+1 = n
αk Tj+k (10.2)
k=−p
where the αk are coefficients that depend on the specific scheme used. A linear scheme
is one where the coefficients αk are independent of the solution Tj . For a scheme to be
monotone with respect to the Tkn , we need the condition
∂Tjn+1
n ≥0 (10.3)
∂Tj+k
Godunov has shown that the only linear monotonic scheme is the first order (upstream)
donor cell scheme. All high-order linear schemes are not monotonic and will permit
spurious extrema to be generated. High-order schemes must be nonlinear in order to
preserve monotonicity.
10.3.1 One-Dimensional
Consider the advection of a tracer in one-dimension written in conservation form:
Tt + (uT )x = 0 (10.4)
subject to appropriate initial and boundary conditions. The spatially integrated form of
this equation lead to the following:
Z xj+ 1
2
Tt dx + f |x − f |x =0 (10.5)
xj− 1 j+ 1 j− 1
2 2
2
where f |x 1
= [uT ]xj− 1 is the flux out of the cell j. This equation is nothing but the
j+ 2 2
restatement of the partial differential equation as the rate at which the budget of T in
10.3. FLUX CORRECTED TRANSPORT (FCT) 143
cell j increases according to the advective fluxes in and out of the cell. As a matter of
fact the above equation can be reintrepeted as a finite volume method if the integral is
∂T
replaced by ∂tj ∆x where T j refers to the average of T in cell j whose size is ∆x. We
now have:
∂T j fj+ 1 − fj− 1
2 2
+ =0 (10.6)
∂t ∆x
If the analytical flux is now replaced by a numerical flux, F , we can generate a family
of discrete schemes. If we choose an upstream biased scheme where the value within each
cell is considered constant, i.e. Fj+ 1 = uj+ 1 Tj for uj+ 1 > 0 and Fj+ 1 = uj+ 1 Tj+1 for
2 2 2 2 2
uj+ 1 < 0 we get the donor cell scheme. Note that the two cases above can be re-written
2
(and programmed) as:
4. Update the solution using the low order fluxes to obtain a first order diffused but
monotonic approximation:
L
Fj+ L
1 − F
j− 1
Tjd = Tjn − 2 2
∆t (10.9)
∆x
5. Limit the anti-diffusive flux so that the corrected solution will be free of extrema not
found in Tjn or Tjd . The limiting is effected through a factor: Acj+ 1 = Cj+ 1 Aj+ 1 ,
2 2 2
where 0 ≤ Cj+ 1 ≤ 1.
2
144 CHAPTER 10. SPECIAL ADVECTION SCHEMES
1. Optional step designed to eliminate correction near extrema. Set Aj+ 1 = 0 if:
2
d d
Aj+ 12 Tj+2 − Tj+1 < 0
d
Aj+ 1 Tj+1 − Tjd < 0 and or (10.12)
2
A 1 Td − Td
j+ j j−1 < 0
2
4. Compute the maximum permissible incoming flux that will keep Tjn+1 ≤ Tjmax .
From the corrective step in equation 10.10 this given by
∆x
Q+ max
j = Tj − Tjd (10.15)
∆t
10.3. FLUX CORRECTED TRANSPORT (FCT) 145
6. Steps 3, 4 and 5 must be repeated so as to ensure that the lower bound on the
solution Tjmin ≤ Tjn+1 . So now we define the anti-diffusive fluxes away from cell j:
Pj− = max 0, Aj+ 1 − min 0, Aj− 1 (10.17)
2 2
∆x
Q− d min
j = Tj − Tj (10.18)
∆t
−!
min 1, Qj
if Pj− > 0
Rj− = Pj− (10.19)
0 if Pj− =0
7. We now choose the limiting factors so as enforce the extrema constraints simulta-
neously on adjacent cells.
min R+ , R− if A 1 > 0
j+1 j j+ 2
Cj+ 1 = (10.20)
min R+ , R−
j+1 if Aj+ 1 < 0
2
j 2
1.2
1 M=50
0.8
0.6
0.4
0.2
−0.2
1.2
1 M=100
0.8
0.6
0.4
0.2
−0.2
1.2
1 M=200
0.8
0.6
0.4
0.2
−0.2
1.2
1 M=400
0.8
0.6
0.4
0.2
−0.2
1.2
1 M=2000
0.8
0.6
0.4
0.2
−0.2
−10 −8 −6 −4 −2 0 2 4 6 8 10
Figure 10.1: Uncorrected Centered Difference schemes of 8th order with RK3, ∆t = 10−3 .
10.3. FLUX CORRECTED TRANSPORT (FCT) 147
1.2
1 M=50
0.8
0.6
0.4
0.2
−0.2
1.2
1 M=100
0.8
0.6
0.4
0.2
−0.2
1.2
1 M=200
0.8
0.6
0.4
0.2
−0.2
1.2
1 M=400
0.8
0.6
0.4
0.2
−0.2
1.2
1 M=2000
0.8
0.6
0.4
0.2
−0.2
−10 −8 −6 −4 −2 0 2 4 6 8 10
Figure 10.2: Uncorrected Centered Difference schemes of 1st order with RK1, ∆t = 10−3 .
148 CHAPTER 10. SPECIAL ADVECTION SCHEMES
1.2
1 M=50
0.8
0.6
0.4
0.2
−0.2
1.2
1 M=100
0.8
0.6
0.4
0.2
−0.2
1.2
1 M=200
0.8
0.6
0.4
0.2
−0.2
1.2
1 M=400
0.8
0.6
0.4
0.2
−0.2
1.2
1 M=2000
0.8
0.6
0.4
0.2
−0.2
−10 −8 −6 −4 −2 0 2 4 6 8 10
Figure 10.3: FCT-Centered Difference schemes of 8th order with RK3, ∆t = 10−3 .
10.3. FLUX CORRECTED TRANSPORT (FCT) 149
theoretical discussion of the donor-cell scheme the errors are primarily dissipative. At
coarse resolution the signals in the solution have almost disappeared and the solution
has been smeared excessively. Even at the highest resolution used here the smooth peaks
have lost much of their amplitudes while the discontinuities (of various strengthes) have
also been smeared. On the up-side the donor-cell scheme has delivered solutions that are
oscillations-free and that respect the TVD properties of the continuous PDE.
The FCT approach, as well as other limiting methods, aim to achieve a happy medium
between diffusive but oscillation-free and high-order but oscillatory. Figure 10.3 illus-
trate FCT’s benefits (and drawbacks). First the oscillations have been eliminate at all
resolutions. Second, at the coarseest resolution, there is a significant loss of peak accom-
panied by the so-called terracing effect where the limiter tends to flatten smooth peaks,
and to introduce intermediate ”step” in the solution where previously there was none.
This effect continues well-into the well resolved regime of ∆x = 20/400 where the smooth
Gaussian peak has still not recovered its full amplitude and is experiencing a continuiing
terracing effect. This contrasts with the uncorrected solution where the smooth peak
was recoved with a more modest resolution of ∆x = 20/200. This is a testament to an
overactive FCT that cannot distinguish between smooth peaks and discontinuities. As
a result smooth peaks that ought not to be limited are flattened out. It is possible to
mitigate an overzealous limiter by appending a “discriminator” that can turn off the lim-
iting near smooth extrema. In general the discriminator is built upon checking whether
the second derivative of the solution is one-signed in neighborhoods where the solution’s
slope changes sign. Furthermore, the terracing effect can be eliminated by introducing a
small amount of scale selective dissipation flux (essentially a hyperviscosity). For further
details check out Shchepetkin and McWilliams (1998) and Zalesak (2005).
The FCT procedure has advantages and disadvantages. The primary benefit of the
procedure is that it is a practical procedure to prevent the generation of spurious ex-
trema. It is also flexible in defining the high order fluxes and the extrema of the fields.
Most importantly the algorithm can be extended to multiple dimensions in a relatively
straightforward manners. The disadvantages is that the procedure is costly in CPU
compared to unlimited method, but hey nothing comes free.
max =
The extrema of the solution are defined in two passes over the data. First we set Tj,k
n , T d ). The final permissible values are determined by computing the extrema
max(Tj,k j,k
150 CHAPTER 10. SPECIAL ADVECTION SCHEMES
Finally, the incoming and outgoing fluxes in cell (j, k) are given by:
+
Pj,k = max 0, Aj− 1 ,k − min 0, Aj+ 1 ,k + max 0, Aj,k− 1 − min 0, Aj,k+ 1 (10.24)
2 2 2 2
−
Pj,k = max 0, Aj+ 1 ,k − min 0, Aj− 1 ,k + max 0, Aj,k+ 1 − min 0, Aj,k− 1 (10.25)
2 2 2 2
1 T n + T̂ n+1
T n+ 2 = (10.27)
2
and then used to compute the high order fluxes
1
H H
Fj+ 1 = F
j+ 1
T n+ 2 . (10.28)
2 2
It is these last high order fluxes that are limited using the FCT algorithm.
10.4. SLOPE/FLUX LIMITER METHODS 151
2.5
2 MC
ee
p erb
C(r)
1.5 S u
Van Leer
1 minmod
0.5
0
0 1 2 3
r
Figure 10.4: Graph of the different limiters as a function of the slope ratio r.
where the limiting factor C = C(r) is a function of the slope ratio in neighboring cells:
Tj − Tj−1
rj+ 1 = . (10.30)
2 Tj+1 − Tj
For slowly varying smooth function the slope ratio is close to 1; the slopes change sign
near an extremum and the ratio is negative. A family of method can be generated based
entirely on the choice of limiting function; here we list a few of the possible choices:
The graph of these limiters is shown in figure 10.4. The different functions have a
number of common features. All limiters set C = 0 near extrema (r ≤ 0). They all
asymptote to 2, save for the minmod limiter which asymptotes to 1, when the function
changes rapidly (r ≫ 1). The minmod limiter is the most stringent of the limiters and
prevents the solution gradient from changing quickly in neighboring cells; this limiter is
known as being diffusive. The other limiters are more lenient, the MC one being the
152 CHAPTER 10. SPECIAL ADVECTION SCHEMES
most lenient, and permit the gradients in neighboring cells to be twice as large as the
one in the neighboring cell. The Van Leer limiter is the smoothest of the limiters and
asymptotes to 2 for r → ∞.
10.5 MPDATA
The Multidimensional Positive Definite Advection Transport Algorithm (MPDATA) was
presented by Smolarkiewicz (1983) as an algorithm to preserve the positivity of the field
throughout the simulation. The motivation behind his work is that chemical tracers
must remain positive. Non-oscillatory schemes like FCT are positive definite but are
deemed too expensive, particularly since oscillations are tolerable as long as they did not
involve negative values. MPDATA is built on the monotone donor cell scheme and on
its modified equation. The latter is used to determine the diffusive errors in the scheme
and to correct for it near the zero values of the field. The scheme is presented here in
its one-dimensional form for simplicity. The modified equation for the donor cell scheme
where the fluxes are defined as in equaiton 10.7 is:
∂T ∂uT ∂ ∂T
+ = κ + O(∆x2 ) (10.32)
∂t ∂x ∂x ∂x
where κ is the numerical diffusion generated by the donor cell scheme:
|u|∆x − u2 ∆t
κ= (10.33)
2
The donor cell scheme will produce a first etimate of the field which is guranteed
to be non-negative if the initial field is initially non-negative. This estimate however, is
too diffused; and must be corrected to eliminate these first order errors. MPDATA data
achieves the correction by casting the second order derivatives in the modified equation
10.32 as another transport step with a pseudo-velocity ũ:
∂T ∂ ũT κ ∂T
T >0
=− , ũ = (10.34)
∂t ∂x T ∂x
0 T =0
and re-using the donor cell scheme to discretize it. The velocity ũ plays the role of an
anti-diffusion velocity that tries to compensate for the diffusive error in the first step.
The correction step takes the form:
d − Td
Tj+1 j
ũj+ 1 = uj+ 1 ∆x − u2j+ 1 ∆t d + T d + ǫ)∆x
(10.35)
2 2 2 (Tj+1 j
ũj+ 1 + ũj+ 1 ũj+ 1 − ũj+ 1
F̃j+ 1 = 2 2
Tjd + 2 2 d
Tj+1 (10.36)
2 2 2
F̃j+ 1 − F̃j− 1
Tjn+1 = T̃j − 2 2
∆t (10.37)
∆x
where Tjd is the diffused solution from the donor-cell step, and ǫ is a small positive
number, e.g. 10− 15, meant to prevent the denominator from vanishing when Tjd =
10.6. WENO SCHEMES IN VERTICAL 153
d
Tj+1 = 0. The second donor cell step is stable provided the original one is too; and
hence the correction does not penalize the stability of the scheme. The procedure to
derive the two-dimensional version of the scheme is similar; the major difficulty is in
deriving the modified equation and the corresponding anti-diffusion velocity. It turns
out that the x-component of the anti-diffusion velocity remains the same while the y-
components takes a similar form with u replaced by v, and ∆x by ∆y.
zi− 1 zi+ 1
i−l ··· i−1 2 i 2 i+1 ··· i+s
-
s s s s s s s
∆zi
Figure 10.5: Sketch of the stencil S(i; k, l). This stencil is associated with cell i, has left
shift l, and contains k = l + s + 1 cells.
We first focus on the issue of computing function values from cell averages. We divide
the vertical into a number of finite volumes which we also refer to as cells, and we define
cells, cell centers and cells sizes by:
h i
Ii = zi− 1 , zi+ 1 (10.38)
2 2
zi− 1 + zi+ 1
2 2
zi = (10.39)
2
∆zi = zi+ 1 − zi− 1 (10.40)
2 2
The reconstruction problem can be stated as follows: Given the cell averages of a
function T (z):
Z z 1
z 1 i+ 2
Ti = v(z ′ ) dz ′ , i = 1, 2, . . . , N (10.41)
∆zi zi− 1
2
154 CHAPTER 10. SPECIAL ADVECTION SCHEMES
find a polynomial pi (z), of degree at most k − 1, for each cell i, such that it is a k-th
order accurate approximation to the function T (z) inside Ii :
The polynomial pi (z) interpolates the function within cells. It also provides for a dis-
continuous interpolation at cell boundaries since a cell boundary is shared by more then
one cell; we thus write:
+ −
Ti− 1 = pi (zi− 1 ), T
i+ 1
= pi (zi+ 1 ) (10.43)
2 2 2 2
Given the cell Ii and the order of accuracy k, we first choose a stencil, S(i; k, l), based
on Ii , l cells to the left of Ii and s cells to the right of Ii with l + s + 1 = k. S(i) consists
of the cells:
S(i) = {Ii−l , Ii−l+1 , . . . , Ii+s } (10.44)
There is a unique polynomial p(z) of degree k − 1 = l + s, whose cell average in each of
the cells in S(i) agrees with that of T (z):
Z zj+ 1
z 1
Tj = 2
p(z ′ ) dz ′ , j = i − l, . . . , i + s. (10.45)
∆zj zj− 1
2
The polynomial in question is nothing but the derivative of the Lagrangian interpolant
of the function T (z) at the cell boundaries. To see this, we look at the primitive function
of T (z): Z z
T (z) = T (z ′ ) dz ′ , (10.46)
−∞
where the choice of lower integration limit is immaterial. The function T (z) at cell edges
can be expressed in terms of the cell averages:
i
X Z zj+ 1 i
X z
T (zi+ 1 ) = 2
T (z ′ ) dz ′ = T j ∆zj (10.47)
2
j=−∞ zj− 12 j=−∞
Thus, the cell averages define the primitive function at the cell boundaries. If we denote
the unique polynomial of degree at most k which interpolates T at the k + 1 points:
zi−l− 1 , . . . , zi+s+ 1 , by P (z), and denote its derivative by p(z), it is easy to verify that:
2 2
Z zj+ 1 Z zj+ 1
1 ′ ′ 1
2
p(z ) dz = 2
P ′ (z ′ ) dz ′ (10.48)
∆zj zj− 1 ∆zj zj− 1
2 2
P (zj+ 1 ) − P (zj− 1 )
2 2
= (10.49)
∆zj
T (zj+ 1 ) − T (zj− 1 )
2 2
= (10.50)
∆zj
Z zj+ 1
1
= 2
T (z ′ ) dz ′ (10.51)
∆zj zj− 1
2
z
= T j , j = i − l, . . . , i + s (10.52)
10.6. WENO SCHEMES IN VERTICAL 155
This implies that p(z) is the desired polynomial. Standard approximation theory tells
us that P ′ (z) = T ′ (z) + O(∆z k ), z ∈ Ii , which is the accuracy requirement.
The construction of the polynomial p(z) is now straightforward. We can start with
the Lagrange intepolants on the k + 1 cell boundary and differentiate with respect to z
to obtain:
k
X k
Y
z − zi−l+q− 1
n=0 q=0 2
k m−1
X X
z n6=m q6=m,n
p(z) = T i−l+j ∆zi−l+j k (10.53)
Y
m=0 j=0
zi−l+m− 1 − zi−l+n− 1
2 2
n=0
n6=m
The order of the outer sums can be exchanged to obtain an alternative form which maybe
computationally more practical:
k−1
X z
p(z) = Clj (z)T i−l+j (10.54)
j=0
The coefficient Clj need not be computed at each time step if the computational grid is
fixed, instead they can be precomputed and stored to save CPU time. The expression
for the Clj simplifies (because many terms vanish) when the point z coincide with a cell
edge and/or when the grid is equally spaced (∆zj = ∆z, ∀j).
ENO reconstruction
The accuracy estimate holds only if the function is smooth inside the entire stencil
S(i; k, l) used in the interpolation. If the function is not smooth Gibbs oscillations
appear. The idea behind ENO reconstruction is to vary the stencil S(i; k, l), by changing
the left shift l, so as to choose a discontinuity-free stencil; this choice of S(i; k, l) is called
an “adaptive stencil”. A smoothness criterion is needed to choose the smoothest stencil,
and ENO uses Newton divided differences. The stencil with the smoothest Newton
divided difference is chosen.
ENO properties:
1. The accuracy condition is valid for any cell which does not contain a discontinuity.
ENO disadvantages:
1. The choice of stencil is sensitive to round-off errors near the roots of the solution
and its derivatives.
2. The numerical flux is not smooth as the stencil pattern may change at neighboring
points.
3. In the stencil choosing process k stencils are considered covering 2k − 1 cells but
only one of the stencils is used. If information from all cells are used one can
potentially get 2k − 1-th order accuracy in smooth regions.
4. ENO stencil choosing is not computationally efficient because of the repeated use
of “if” structures in the code.
where ωl are the weights satisfying the following requirements for consistency and sta-
bility:
k−1
X
ωl ≥ 0, ωl = 1 (10.58)
l=0
Furthermore, when the solution has a discontinuity in one or more of the stencils we
would like the corresponding weights to be essentially 0 to emulate the ENO idea. The
weights should also be smooth functions of the cell averages. The weights described
below are in fact C ∞ .
Shu et al propose the following forms for the weights:
αl dl
ωl = , αl = l = 0, . . . , k − 1 (10.59)
k−1
P (ǫ + βl )2
αn
n=0
10.6. WENO SCHEMES IN VERTICAL 157
Here, the dl are the factor needed to maximize the accuracy of the estimate, i.e. to
make the truncation error O(∆z 2k−1 ). Note that the weights must be as close to dl in
smooth regions, actually we have the requirement that ωl = dl + O(∆z k ). The factor ǫ
is introduced to avoid division by zero, a value of ǫ = 10−6 seems standard. Finally, βl
are the smoothness indicators of stencils S(i; k, l). These factors are responsible for the
success of WENO; they also account for much of the CPU cost increase over traditional
methods. The requirements for the smoothness indicator are that βl = O(∆z 2 ) in smooth
regions and O(1) in the presence of discontinuities. This translates into αl = O(1) in
smooth regions and O(∆z 4 ) in the presence of discontinuities. The smoothness measures
advocated by Shu et al look like weighed H k−1 norms of the interpolating functions:
X Z zi+ 1
k−1 2
2 2n−1 ∂ n pl
βl = ∆z dz (10.60)
n=1 zi− 12
∂z n
The right hand side is just the squares of the scaled L2 norms for all derivatives of the
polynomial pl over the interval [zi− 1 , zi+ 1 ]. The factor ∆z 2n−1 is introduced to remove
2 2
any ∆z dependence in the derivatives in order to preserve self similarity; the smoothness
indicator are the same regardless of the underlying grid. The smoothness indicators for
the case k = 2 are:
z z z z
β0 = (T i+1 − T i )2 , β1 = (T i − T i−1 )2 (10.61)
Higher order formulae can be found in Shu (1998); Balsara and Shu (2000). The formulae
given here have a one-point upwind bias in the optimal linear stencil suitable for a
problem with wind blowing from left to right. If the wind blows the other way, the
procedure should be modified symemetrically with respect to zi+ 1 .
2
0 0
10 10
−5
2
10
−5
10
3
−10
3
10
ε
ε
−10
10
5
−15 7 5
10 9
6
7 −20
10
−15 10 1 2 3 4
1 2 3
10 10 10 10 10 10 10
N N
Figure 10.6: Convergence Rate (in the maximum norm) for ENO (left panel) and WENO
(right panel) reconstuction. The dashed curves are for a left shift set to 0, while the solid
curve are for centered interpolation. The numbers refer to the interpolation order
the high order accuracy of WENO near discontinuities and smoth extrema, and as such
include a peak discriminator that picks out smooth extrema from discontinuous ones.
As such, I think the scheme will fail to preserve the peaks of the original shape and
will allow some new extrema to be generated. This is because there is no full proof
discrimator. Consider what happens to a square wave advected by a uniform velocity
field. The discontinuity is initially confined to 1 cell; the discriminator will rightly flag it
as a discontinuous extremum and will apply the limiter at full strength. Subsequentally,
numerical dissipation will smear the front across a few cells and the front width will
occupy a wider stencil. The discriminator, which works by comparing the curvature at
a fixed number of cells, will fail to pick the widening front as a discontinuity, and will
permit out of range values to be mistaken for permissible smooth extrema.
In order to test the effectiveness of the limiter, I have tried the 1D advection of a
square wave using the limited and unlimited WENO5 (5-th order) coupled with RK2.
Figure 10.8 compares the negative minimum obtained with the limited (black) and un-
limited (red) schemes; the x-axis represent time (the cone has undergone 4 rotations by
the end of the simulation). The different panels show the result using 16, 32, 64 and 128
cells. The trend in all cases is similar for the unlimited scheme: a rapid growth of the
negative extremum before it reaches a quasi-steady state. The trend for the limited case
is different. Initially, the negative values are suppressed the black curves starting away
from time 0. This is initial period increases with better resolution. After the first nega-
tive values appear, there is a rapid deterioration in the minimum value before reaching a
steady state. This steady state value decreases with the number of points, and becomes
negligeable for the 128 cell case. Finally, note that unlimited case produces a slightly
better minimum for the case of the 16 cells, but does not improve substantially as the
number of points is increased. For this experiment, the interval is the unit interval and
the hat profile is confined to |x| < 1/4; the time step is held fix at ∆t = 1/80, so the
Courant number increases with the number of cells used.
10.7. UTOPIA 159
Figure 10.7: Advection of several Shchepetkin profiles. The black solid line refers to the
analytical solution, the red crosses to the WENO3 (RK2 time-stepping), and the blue
stars to WENO5 with RK3. The WENO5 is indistiguishable from the analytical solution
for the narrow profile
I have repeated the experiments for the narrow profile case (Shchepetkin’s profile),
and confirmed that the limiter is indeed able to supress the generation of negative value,
even for a resolution as low as 64 cells (the reference case uses 256 cells). The discrim-
inator, however, allows a very slight and occasional increase of the peak value. By in
large, the limiter does a good job. The 2D cone experiments with the limiters are shown
in the Cone section.
10.7 Utopia
The uniform third order polynomial interpolation algorithm was derived explicitly to be
a multi-dimension, two-time level, conservative advection scheme. The formulation is
based on a finite volume formulation of the advection equation:
Z
(∆V T )t + F · n dS = 0 (10.62)
∂∆V
z
where T is the average of T over the cell ∆V and F · n are the fluxes passing through
the surfaces ∂∆V of the control volume. A further integral in time reduces the solution
160 CHAPTER 10. SPECIAL ADVECTION SCHEMES
0 0
10 10
−10 −10
10 10
−20 −20
10 10
0 0
10 10
−10 −10
10 10
−20 −20
10 10
Figure 10.8: Negative minima of unlimited (red) and limited (black) WENO scheme on
a square hat profile. Top left 16 points, top right 32 points, bottom left 64 points, and
bottom rights 64 points.
to the following:
Z ∆t Z
n+1 n 1
T =T + F · n dS dt = 0 (10.63)
∆V 0 ∂V
A further definition will help us interpret the above formula. If we let the time-average
R ∆t
flux passing the surfaces bounding ∆V as F∆t = 0 F dt we end up with the following
two-time level expression:
Z
n+1 n ∆t
T =T + F · n dS (10.64)
∆V ∂V
UTOPIA takes a Lagrangian point of view in tracing the fluid particle crossing each
face. The situation is depicted in figure 10.9 where the particles crossing the left face
of a rectangular cell, is the area within the quadrilateral
R
ABCD; this is effectively the
contribution of edge AD to the boundary integral ∆t ∂VAD F · n dS. UTOPIA makes
the assumption that the advecting velocity is locally constant across the face in space
10.7. UTOPIA 161
(j − 1, k + 1) (j, k + 1) (j + 1, k + 1)
Et
A
B
(j − 1, k) (j, k) (j + 1, k)
Ft
D
C
(j − 1, k − 1) (j, k − 1) (j + 1, k − 1)
(j − 1, k − 2) (j, k − 2) (j + 1, k − 2)
Figure 10.9: Sketch of the particles entering cell (j, k) through its left edge (j − 12 , k)
assuming positive velocity components u and v.
162 CHAPTER 10. SPECIAL ADVECTION SCHEMES
and time; this amount to approximating the curved edges of the area by straight lines
as shown in the figure. The distance from the left edge to the straight line BC is u∆t,
and can be expressed as p∆x where p is the courant number for the x-component of the
velocity. Likewise, the vertical distance between point B and edge k + 12 is v∆t = q∆y,
where q is the Courant number in the y direction.
We now turn to the issue of computing the integral of the boundary fluxes; we will
illustrate this for edge AD of cell (j, k). Owing to UTOPIA’s Lagrangian estimate of the
flux we have: Z Z
∆t 1
F · n dS = T dx dy. (10.65)
∆V ∂VAD ∆x∆y ABCD
The right hand side integral is in area integral of T over portions of upstream neighboring
cells. Its evaluation requires us to assume a form for the spatial variations of T . Several
choices are available and Leonard et al Leonard et al. (1995) discusses several options.
UTOPIA is built on assuming a quadratic variations; for cell (j, k), the interpolation is:
1
Tj,k (ξ, η) = T j,k − T j+1,k + T j,k−1 + T j−1,k + T j,k+1 − 4T j,k
24
T j+1,k − T j−1,k T j+1,k − 2T j,k + T j−1,k 2
+ ξ+ ξ
2 2
T j,k+1 − T j,k−1 T j,k+1 − 2T j,k + T j,k−1 2
+ η+ η . (10.66)
2 2
Here, ξ and η are scaled local coordinates:
x y
ξ= −i η = − j. (10.67)
∆x ∆y
so that the center of the box is located at (0, 0) and the left and right edges at (± 21 , 0),
respectively. The interpolation formula is designed such as to produce the proper cell-
averages when the function is integrated over cells (j, k), (j ±1, k) and (j, k±1). The area
integral in equation 10.65 must be broken into several integral: First, the area ABCD
stradles two cells, (j − 1, k) and (j − 1, k − 1), with two different interpolation for T ; and
second, the trapezoidal area integral can be simplified. We now have
Z
1
T dx dy = I1 (j, k) − I2 (j, k) + I2 (j, k − 1) (10.68)
∆x∆y ABCD
where the individual contributions from each area are given by:
Z Z 1 Z 1
2 2
I1 (j, k) = Tj,k (ξ, η) dη dξ = Tj,k (ξ, η) dη dξ (10.69)
1 1
AEF D 2
−u 2
−u
Z Z 1 Z 1
2 2
I2 (j, k) = Tj,k (ξ, η) dη dξ = Tj,k (ξ, η) dη dξ. (10.70)
1
AEB 2
−u ηAB (ξ)
Figure 10.10: Flux for the finite volume form of the utopia algorithm.
using the local coordinates of cell (j, k). Performing the integration is rather tedious; the
output of a symbolic computer program is shown in figure 10.10
A different derivation of the UTOPIA scheme can be obtained if we consider the cell
values are function values and not cell averages. The finite difference form is then given
by the equation shown in figure 10.11.
Tt + ∇ · (uT ) = 0. (10.74)
The starting point is the time Taylor series expansion which we carry out to fourth order:
∆t ∆t2 ∆t3 ∆t4
T n+1 = T n + Tt + Ttt + Tttt + Ttttt + O(∆t5 ). (10.75)
1! 2! 3! 4!
The next step is the replacement of the time-derivative above with spatial derivatives
using the original PDE. It is easy to derive the following identities:
Tt = −∇ · [uT ] (10.76)
Ttt = −∇ · [ut T + uTt ]
Ttt = −∇ · [ut T − u∇ · (uT )] (10.77)
Tttt = −∇ · [utt T − 2ut ∇ · (uT ) − u∇ · (ut T ) + u∇ · (u∇ · (ut T ))] (10.78)
164 CHAPTER 10. SPECIAL ADVECTION SCHEMES
Figure 10.11: Flux of UTOPIA when variables represent function values. This is the
finite difference form of the scheme
Chapter 11
The discretization of complicated flow domains with finite difference methods is quite
cumbersome. Their straightfoward application requires the flow to occur in logically
rectangular domains, a constraint that severely limit their capabilities to simulate flows
in realistic geometries. The finite element method was developed in large part to address
the issue of solving PDE’s in arbitrarily complex regions. Briefly, the FEM method
works by dividing the flow region into cells referred to as element. The solution within
each element is approximated using some form of interpolation, most often polynomial
interpolation; and the weights of the interpolation polynomials are adjusted so that
the residual obtained after applying the PDE is minimized. There are a number of
FE approaches in existence today; they differ primarily in their choice of interpolation
procedure, type of elements used in the discretization; and the sense in which the residual
is minimized. Finlayson Finlayson (1972) showed how the different approaches can be
unified via the perspective of the Mean Weighed Residual (MWR) method.
11.1 MWR
Consider the following problem: find the function u such that
L(u) = 0 (11.1)
N
X
ũ = ûj φ(x) (11.2)
j=0
Here ũ stands for the approximate solution of the problem, û are the N + 1 degrees
of freedom available to minimize the error, and the φ’s are the interpolation or trial
165
166 CHAPTER 11. FINITE ELEMENT METHODS
functions. Equation 11.2 can be viewed as an expansion of the solution in term of a basis
function defined by the functions φ. Applying this series to the operator L we obtain
where R(x) is a residual which is different then zero unless ũ is the exact solution of the
equation 11.1. The degrees of freedom û can now be chosen to minimize R. In order to
determine the problem uniquely, I can impose N + 1 constraints. For MWR we require
that the inner product of R with a N + 1 test functions vj to be orthogonal:
(R, vj ) = 0, j = 0, 1, 2, . . . , N. (11.4)
Recalling the chapter on linear analysis; this is equivalent to saying that the projection of
R on the set of functions vj is zero. In the case of the inner product defined in equation
12.13 this is equivalent to
Z
Rvj dx = 0, j = 0, 1, 2, . . . , N. (11.5)
Ω
11.1.1 Collocation
If the test functions are defined as the Dirac delta functions vj = δ(x−xj ), then constraint
11.4 becomes:
R(xj ) = 0 (11.6)
i.e. it require the residual to be identically zero on the collocation points xj . Finite
differences can thus be cast as a MWR with collocation points defined on the finite
difference grid. The residual is free to oscillate between the collocation points where it is
pinned to zero; the oscillations amplitude will decrease with the number of of collocation
points if the residual is a smooth function.
11.1.3 Galerkin
In the Galerkin method the test functions are taken to be the same as the trial functions,
so that vj = φj . This is the most popular choice in the FE community and will be the one
we concentrate on. There are varitions on the Galerkin method where the test functions
are perturbation of the trial functions. This method is usually referred as the Petrov-
Galerkin method. The perturbations are introduced to improve the numerical stability of
the scheme; for example to introduce upwinding in the solution of advection dominated
flows.
∂2u
− λu + f = 0 (11.10)
∂x2
subject to the boundary conditions
u(x = 0) = 0, (11.11)
∂u
= q (11.12)
∂x
Equation 11.11 is a Dirichlet boundary conditions and is usually referred to as an essential
boundary condition, while equation 11.12 is usually referred to as a natural boundary
conditions. The origin of these terms will become clearer shortly.
The only condition we impose on the test function is that it is zero on those portions
of the boundary where Dirichlet boundary conditions are applied; in this case v(0) = 0.
Equation 11.13 is called the strong form of the PDE as it requires the second derivative
of the function to exist and be integrable. Imposing this constraint in geometrically
complex region is difficult, and we seek to reformulate the problem in a “weak” form
such that only lower order derivatives are needed. We do this by integrating the second
derivative in equation 11.13 by part to obtain:
Z 1 Z 1
∂u ∂v ∂u ∂u
+ λuv dx = f v dx + v − v (11.14)
0 ∂x ∂x 0 ∂x 1 ∂x 0
The essential boundary conditions on the left edge eliminates the third term on the right
hand side of the equation since v(0) = 0, and the Neumann boundary condition at the
168 CHAPTER 11. FINITE ELEMENT METHODS
right edge can be substituted in the second term on the right hand side. The final form
is thus: Z 1 Z 1
∂u ∂v
+ λuv dx = f v dx + qv(1) (11.15)
0 ∂x ∂x 0
For the weak form to be sensible, we must require that the integrals appearing in the
formulation be finite. The most severe restriction stems from the first order derivatives
appearing on the left hand side of 11.15. For this term to be finite we must require that
the functions ∂u ∂v
∂x and ∂x be integrable, i.e. piecewise continuous with finite jump discon-
tinuities; this translates into the requirement that the functions u and v be continuous,
the so-called C0 continuity requirement.
Note that the matrix K is symmetric: Kji = Kij , so that only half the matrix entries
need to be evaluated. The Galerkin formulation of the weak variational statement 11.15
will always produce a symmetric matrix regardless of the choice of expansion function;
the necessary condition for the symmetry is that the left hand side operator in equation
11.15 be symmetric with respect to the u and v variables. The matrix K is usually
referred to as the stiffness matrix, a legacy term dating to the early application of the
finite element method to solve problems in solid mechanics.
Had the boundary condition on the right edge of the domain been of Dirichlet type,
we would have to add the following restrictions on the trial functions φ2≤i≤N −1 (1) = 0.
The imposition of Dirichlet conditions on both sides is considerably simplified if we
further request that φ0 (1) = φN (0) = 0 and φ0 (0) = φN (1) = 1. Under these conditions
û0 = u(0) = u0 and ûN = u(1) = uN . We end up with the following (N − 1) × (N − 1)
system of algebraic equations
N
X −1
Kji ui = cj , cj = bj − Kj0 u0 − KjN uN , j = 1, 2, . . . , N − 1 (11.20)
i=1
φ0 φ1 φ2
XX XX
XXX XXX
X
X X
X
XXX
XXs
XXX
s
XXs
û0 û1 û2
Figure 11.1: 2-element discretization of the interval [0, 1] showing the interpolation func-
tions
point. If our expansion functions are Lagrange interpolants, then the coefficients ûi
represent the value of the function at the collocation points xj :
N
X
u(xj ) = uj = ûi φ(xj ) = ûj , j = 0, 1, . . . , N (11.22)
i=0
We will omit the circumflex accents on the coefficients whenever the expansion functions
are Lagrangian interpolation functions. The use of Lagrangian interpolation simplifies
the imposition of the C 0 continuity requirement, and the function values at the colloca-
tion points are obtained directly without further processing.
There are other expansion functions in use in the FE community. For example,
Hermitian interpolation functions are used when the solution and its derivatives must be
continuous across element boundaries (the solution is then C 1 continuous); or Hermitian
expansion is used to model infinite elements. These expansion function are usually
reserved for special situations and we will not address them further.
In the following 3 sections we will illustrate how the FEM solution of equation 11.15
proceeds. We will approach the problem from 3 different perspectives in order to highlight
the algorithmic steps of the finite element method. The first approach will consider a
small size expansion for the approximate solution, the matrix equation can then be
written and inverted manually. The second approach repeats this procedure using a
longer expansion, the matrix entries are derived but the solution of the larger system
must be done numerically. The third approach considers the same large problem as
number two above; but introduces the local coordinate and numbering systems, and the
mapping between the local and global systems. This local approach to constructing the
FE stiffness matrix is key to its success and versatility since it localizes the computational
details to elements and subroutines. A great variety of local finite element approximations
can then be introduced at the local elemental level with little additional complexity at
the global level.
where the interpolation functions and their derivatives are tabulated below
∂φ0 ∂φ1 ∂φ2
φ0 (x) φ1 (x) φ2 (x) ∂x ∂x ∂x
1 (11.24)
0 ≤ x ≤ 2 1 − 2x 2x 0 −2 2 0
1
2 ≤ x ≤ 1 0 2(1 − x) 2x −1 0 −2 2
and shown in figure 11.1. It is easy to verify that the φi are Lagrangian interpolation
functions at the 3 collocations points x = 0, 1/2, 1, i.e. φi (xj ) = δij . Furthermore, the
expansion functions are continuous across element interfaces, so that the C 0 continuity
requirement is satisfied), but their derivates are discontinuous. It is easy to show that the
interpolation 11.23 amounts to a linear interpolation of the solution within each element.
Since the boundary condition at x = 0 is of Dirichlet type, we need only test with
functions that satisfy v(0) = 0; in our case the functions φ1 and φ2 are the only can-
didates. Notice also that we have only 2 unknowns u1 and u2 , u0 being known from
the Dirichlet boundary conditions; thus only 2 equations are needed to determine the
solution. The matrix entries can now be determined. We illustrate this for two of the
entries, and assuming λ is constant for simplicity:
Z 1 ∂φ∂φ0
Z 1
2 λ
1
K10 = + λφ1 φ0 dx = [−4 + λ(2x − 4x2 )] dx = −2 + (11.25)
0 ∂x ∂x 0 12
Notice that the integral over the entire domain reduces to an integral over a single element
because the interpolation and test functions φ0 and φ1 are non-zero only over element 1.
This property that localizes the operations needed to build the matrix equation is key
to the success of the method.
The entry K11 requires integration over both elements:
Z 1 ∂φ ∂φ1
1
K11 = + λφ1 φ1 dx (11.26)
0 ∂x ∂x
Z 1 Z 1
2
2
= [4 + λ4x ] dx + [4 + λ4(1 − x)2 ] dx (11.27)
1
0 2
2λ 2λ 4λ
= 2+ + 2+ =4+ (11.28)
12 12 12
The remaining entries can be evaluated in a similar manner. The final matrix equation
takes the form:
! u0 !
λ
−2 + 12 4 + 4λ
12
λ
−2 + 12 b1
λ 2λ u1 = (11.29)
0 −2 + 12 2 + 12 b2
u2
Note that since the value of u0 is known we can move it to the right hand side to obtain
the following system of equations:
! ! !
4 + 4λ
12
λ
−2 + 12 u1 b1 + 2 − λ
12 u0
λ 2λ = (11.30)
−2 + 12 2 + 12 u2 b2
172 CHAPTER 11. FINITE ELEMENT METHODS
λ 2 λ
where ∆ = 8(1 + 12 ) − ( 12 − 2)2 is the determinant of the matrix. The only missing
piece is the evaluation of the right hand side. This is easy since the function f and the
flux q are known. It is possible to evaluate the integrals directly if the global expression
for f is available. However, more often that not, f is either a complicated function, or is
known only at the collocation points. The interpolation methodology that was used for
the unknown function can be used to interpolate the forcing functions and evalute their
associated integrals. Thus we write:
Z 1
bj = f φj dx + qφj (1) (11.32)
0
Z 2
1X
= (fi φi )φj dx + qφj (1) (11.33)
0 i=0
2 Z 1
X
= φi φj dx fi + qφj (1) (11.34)
i=0 0
! f0 !
1 1 4 1 0
= f1 + (11.35)
12 0 1 2 q
f2
which is the exact solution of the PDE. The FEM procedure produces the exact result
because the solution to the PDE is linear in x. Notice that the FE solution is exact at
the interpolation points x = 0, 1/2, 1 and inside the elements.
If f = −1, and the remaining parameters are unchanged the exact solution is
quadratic ue = x2 /2 + (q − 1)x, and the finite element solution is
! ! ! !
1 −1 4q−3
u1 2 2 2 8
= 1 = (11.38)
u2 4 2 4 q− 4 q − 12
Notice that the FE procedure yields the exact value of the solution at the 3 interpolation
points. The errors committed are due to the interpolation of the function within the
11.2. FEM EXAMPLE IN 1D 173
0.5 0.7
0.6
0.4 f=0
0.5
0.3 0.4 f=−x
0.2 0.3
0.2
0.1
0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
0.8
0.6
0.4 f=−x2
0.2
0
0 0.2 0.4 0.6 0.8 1
Figure 11.2: Comparison of analytical (solid line) and FE (dashed line) solutions for
the equation ux x + f = 0 with homogeneous Dirichlet condition on the left edge and
Neumann condition ux = 1 on the right edge. The circles indicate the location of the
interpolation points. Two linear finite elements are used in this example.
174 CHAPTER 11. FINITE ELEMENT METHODS
elements; the solution is quadratic whereas the FE interpolation provide only for a linear
variation.
For f = −x, the exact solution is ue = x3 /6 + (q − 1/2)x, and the FE solution is:
! ! ! !
1 −1 24q−11
u1 2 2 4 48
= 5 = (11.39)
u2 4 2 4 q− 24 q − 13
The solution is again exact at the interpolation points by in error within the element due
to the linear interpolation. This fortuitous state of affair is due to the exact evaluation
of the forcing term f which is also exactly interpolated by the linear functions.
For f = −x2 , the exact solution is u = x4 /12 + (q − 1/2)x, and the FE solution is:
! ! ! !
1 −1 48q−10
u1 2 2 12 96
= 9 = (11.40)
u2 4 2 4 q− 48 q − 22
96
This time the solution is in error at the interpolation points also. Figure 11.2 compare
the analytical and FE solutions for the 3 cases after setting q = 1.
φ0 φj−1 φj φj+1 φN
HH XXX XXX
XX XX
HH XXXX XXXX
u Hu ...
u
Xu Xu ...
u u
x0 ∆x0 x xj−1 ∆xj xj ∆xj+1 xj+1 xN −1 ∆xN xN
1
Figure 11.3: An element partition of the domain into N linear elements. The element
edges are indicated by the filled dots.
In order to decrease the error further for cases where f is a more complex function,
we need to increase the number of elements. This will increase the size of the matrix
system and the computational cost. We go through the process of developing the stiffness
equations for this system since it will be useful for the understanding of general FE
concepts. Suppose that we have divided the interval into N elements (not necessarily of
equal size), then interpolation formula becomes:
N
X
u(x) = ui φi (x) (11.41)
i=0
Element number j, shown in figure 11.3, spans the interval [xj−1 , xj ], for j = 1, 2, . . . , N ;
its left neighbor is element j − 1 and its right number is element j + 1. The length of
each element is ∆xj = xj − xj−1 . The linear interpolation function associated with node
11.2. FEM EXAMPLE IN 1D 175
j is
0 x < xj−1
x−xj−1
xj−1 ≤ x ≤ xj
xj −xj−1
φj (x) = xj+1 −x (11.42)
xj ≤ x ≤ xj+1
xj+1 −xj
0 xj+1 < x
Let us now focus on building the stiffness matrix row by row. The j th row of K
corresponds to setting the test function to φj . Since the function φj is non-zero only
on the interval [xj−1 , xj+1 ], the integral in the stiffness matrix reduces to an integration
over that interval. We thus have:
Z xj+1
∂φj ∂φi
Kji = + λφi φj dx, (11.43)
xj−1 ∂x ∂x
Z xj+1
bj = f φj dx + qφj (1) (11.44)
xj−1
Likewise, the function φi is non-zero only on elements i and i + 1, and hence Kji = 0
unless i = j − 1, j, j + 1; the system of equation is hence tridiagonal. We now derive
explicit expressions for the stiffness matrix entries for i = j, j ± 1.
Z xj
∂φj ∂φj−1
Kj,j−1 = + λφj−1 φj dx, (11.45)
xj−1 ∂x ∂x
Z !
xj 1 x − xj−1 xj − x
= −2
+λ dx, (11.46)
xj−1 (xj − xj−1 ) xj − xj−1 xj − xj−1
1 ∆xj
= − +λ (11.47)
∆xj 6
The entry for Kj,j+1 can be deduced automatically by using symmetry and applying the
above formula for j + 1; thus:
1 ∆xj+1
Kj,j+1 = Kj+1,j = − +λ (11.48)
∆xj+1 6
The sole entry remaining is i = j; in this case the integrals spans the elements i and i + 1
Z xj Z xj+1
∂φj ∂φj ∂φj ∂φj
Kj,j = + λφj φj dx + + λφj φj dx, (11.49)
xj−1 ∂x ∂x xj ∂x ∂x
Z xj h i Z xj+1 h i
1 2 1
= 1 + λ(x − xj−1 ) dx, + 1 + λ(xj+1 − x)2 (11.50)
dx,
∆x2j xj−1 ∆x2j+1 xj
! !
1 2∆xj 1 2∆xj+1
= +λ + +λ
∆xj 6 ∆xj+1 6
Note that all entries in the matrix equations are identical except for the rows associated
with the end points where the diagonal entries are different. It is easy to show that we
must have:
1 2∆x1
K0,0 = +λ (11.51)
∆x1 6
1 2∆xN
KN,N = +λ (11.52)
∆xN 6
176 CHAPTER 11. FINITE ELEMENT METHODS
The evaluation of the right hand sides leads to the following equations for bj :
Z xj+1
bj = f φj dx + qφj (1) (11.53)
xj−1
j+1
X Z xj+1
= φi φj dxfi + qφN (1)δN j (11.54)
i=j−1 xj−1
1
=[∆xj fj−1 + 2(∆xj + ∆xj+1 )fj + ∆xj+1 fj+1 ] + qφN (1)δN j (11.55)
6
Again, special consideration must be taken when dealing with the edge points to account
for the boundary conditions properly. In the present case b0 is not needed since a Dirichlet
boundary condition is applied on the left boundary. On the right boundary the right
hand side term is given by:
1
[∆xj fN −1 + 2(∆xN fj ] + q
bN = (11.56)
6
Note that the flux boundary condition affects only the last entry of the right hand side.
If the grid spacing is constant, a typical of the matrix equation is:
1 λ∆x 1 2λ∆x 1 λ∆x
− + uj−1 + 2 + uj + − + uj+1 =
∆x 6 ∆x 6 ∆x 6
∆x
(fj−1 + 4fj + fj+1 ) (11.57)
6
For λ = 0 it is easy to show that the left hand side reduces to the centered finite difference
approximation of the second order derivative. The finite element discretization produces
a more complex approximation for the right hand side involving a weighing of the function
at several neighboring points.
h1 (ξ) h2 (ξ)
XXX
XX
X
XX
uj1
u XXu uj
2
ξ1 ∆xj ξ2
We also introduce a local numbering system so the unknown can be identified locally.
The superscript j, whenever it appears, indicate that a local numbering system is used to
refer to entities defined over element j. In the present instance the uj1 refers to the global
unknown uj−1 and uj2 refers to the global unknown uj . Finally, the global expansion
functions, φj are transformed into local expansion functions so that the interpolation of
the solution u within element j is:
Notice that the local stiffness matrix has a small dimension, 2 × 2 for the linear inter-
polation function, and is symmetric. We evaluate these integrals in computational space
178 CHAPTER 11. FINITE ELEMENT METHODS
j
by breaking them up into 2 pieces Dm,n j
and Mm,n defined as follows:
Z xj Z 1 2
j ∂hm ∂hn
∂hm ∂hn ∂ξ ∂x
Dm,n = dx = dξ (11.66)
xj−1 ∂x ∂x
−1 ∂ξ ∂ξ ∂x ∂ξ
Z xj Z 1
j ∂x
Mm,n = hm hn dx = hm hn dξ (11.67)
xj−1 −1 ∂ξ
The integrals in physical space have been replaced with integrals in computational space
in which the metric of the mapping appears. For the linear mapping and interpolation
function, these integrals can be easily evaluated:
Z ! !
∆xj 1 (1 − ξ)2 (1 − ξ 2 ) ∆xj 2 1
Mj = dξ = (11.68)
2 −1 (1 − ξ 2 ) (1 + ξ)2 6 1 2
The left hand sides in the above equations are the global entries while those on the
right hand sides are the local entries. The process of adding up the local contribution is
called the stiffness assembly. Note that some global entries require contributions from
different elements.
In practical computer codes, the assembly is effected most efficiently by keeping
track of the map between the local and global numbers in an array: imap(2,j) where
imap(1,j) gives the global node number of local node 1 in element j, and imap(2,j)
gives the global node number of local node 2 in element j. For the one-dimensional case
a simple straightforward implementation is shown in the first loop of figure 11.5 where P
stands for the number of collocation points within each element. For linear interpolation,
P = 2. The scatter, gather and assembly operations between local and global nodes can
now be easily coded as shown in the second, and third loops of figure 11.5.
! Assign Connectivity in 1D
do j = 1,N ! element loop
do m = 1,P ! loop over collocation points
imap(m,j) = (j-1)*(P-1)+m ! assign global node numbers
enddo
enddo
! Gather/Scatter operation
do j = 1,N ! element loop
do m = 1,P ! loop over local node numbers
mg = imap(m,j) ! global node number of node m in element j
ul(m,j) = u(mg) ! local gathering operation
v(mg) = vl(m,j) ! local scattering
enddo
enddo
! Assembly operation
K(1:Nt,1:Nt) = 0 ! global stiffness matrix
do j = 1,N ! element loop
do n = 1,P
ng = imap(n,j) ! global node number of local node n
do m = 1,P
mg = imap(m,j) ! global node number of local node m
K(mg,ng) = K(mg,ng) + Kl(m,n,j)
enddo
enddo
enddo
The assembly procedure can now be done as before with the proviso that the local node
numbers m runs from 1 to 3. In the present instance the global system of equation is
pentadiagonal and is more expensive to invert then the tridiagonal system obtained with
the linear interpolation functions. One would expect improved accuracy, however.
LP −1 denotes the Legendre polynomial of degree (P −1) and L′P −1 denotes its derivative.
The P collocation points ξn are the P Gauss-Lobatto roots of the Legendre polynomials
of degree P − 1, i.e. they are the roots of the equation:
1 1
h2 h22 h31 h32 h33
0.8 1 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
1 1
4 4 6 6 6 6
h2 h3 h2 h3 h4 h5
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Figure 11.6: Plot of the Lagragian interpolants for different polynomial degrees: linear
(top left), quadratic (top right), cubic (bottom left), and fifth (bottom right). The
collocation points are shown by circles, and are located at the Gauss-Lobatto roots of
the Legendre polynomial. The superscript indicates the total number of collocation
points, and the subscript the collocation point with which the polynomial interpolant is
associated.
and are shown in figure 11.7. Equation 11.79 shows the two different forms in we can
Q
express the Lagragian interpolant; the traditional notation expresses hm as a product
of P − 1 factors chosen so as to guarantee the Lagragian property; the second form is
particular to the choice of Legendre Gauss-Lobatto points Boyd (1989); Canuto et al.
(1988). It is easy to show that hm (ξn ) = δmn , and they are polynomials of degree P − 1.
Note that unlike the previous cases the collocation points are not equally spaced within
each element but tend to cluster more near the boundary of the element. Actually the
collocation spacing is O(1/(P − 1)2 ) near the boundary and O(1/(P − 1)) near the center
of the element. These functions are shown in figure 11.6 for P = 4 and 6. The P − 2
internal points are localized to a single element and do not interact with the interpolation
function of neighboring elements; the edge interpolation polynomials have support in two
neighboring elements.
The evaluation of the derivative of the solution at specified points ηn is equivalent to:
P
X
u′ (ηn ) = h′m (ηn )um (11.81)
m=1
and can be cast in the form of a matrix vector product, where the matrix entries are the
derivative of the interpolation polynomials at the specified points ηn .
The only problem arises from the more complicated form of the integration formula.
For this reason, it is common to evaluate the integrals numerically using high order
Gaussian quadrature; see section 11.2.10. Once the local matrices are computed the
assembly procedure can be performed with the local node numbering m running from 1
to P.
where Q is the order of the quadrature formula and ξpG are the Gauss quadrature points;
the superscript is meant to distinguish the quadrature points from the collocation points.
RQ is the remainder of approximating the integral with a weighed sum; it is usually
proportional to a high order derivative of the function g.
Gauss Quadrature
The Gauss quadrature is one of the most common quadrature formula used. Its quadra-
ture points are given by the roots of the Qth degree Legendre polynomial; its weights
11.2. FEM EXAMPLE IN 1D 183
Gauss quadrature omits the end points of the interval ξ = ±1 and considers only interior
points. Notice that if the integrand is a polynomial of degree 2Q − 1; its 2Q-derivative is
zero and the remainder vanishes identically. Thus a Q point Gauss quadrature integrates
all polynomials of degree less then 2Q exactly.
Gauss-Lobatto Quadrature
The Gauss-Lobatto quadrature formula include the end points in their estimate of the
integral. The roots, weights, and remainder of a Q-order Gauss-Lobatto quadrature are:
2
1 − ξpGL L′Q−1 (ξpGL ) = 0 (11.85)
2
ωpGL = , p = 1, 2, . . . , Q (11.86)
(1 − ξp2 )[L′Q (ξp )]2
−Q(Q − 1)3 22Q−1 [(Q − 2)!]4 ∂ 2Q−2 g
RQ = , |ξ| <(11.87)
1
(2Q − 1)[(2Q − 2)!]3 ∂ξ 2Q ξ
Although the order of Gauss quadrature is higher, it is not always the optimal choice;
other considerations may favor Gauss-Lobatto quadrature and reduced (inexact) inte-
gration. Consider a quadrature rule where the collocation and quadrature points are
identical, such a rule is possible if one chooses the Gauss-Lobatto quadrature of order P ,
where P is the number of points in each element; then ξm = ξm GL for m = 1, . . . , P . The
−2
10
−2 10
−4
−4
1 10 1
10
2 2
−6
10
−6 10
3 3
4 4
5 −8
5
10
−8 10
||ε||2
||ε||2
−10
10
−10 10
−12
10
−12 10
−14
10
−14 10
−16
10
−16 10
0 1 2 3
0 50 100 150 200 250 300 350 10 10 10 10
K(N−1)+1 K(N−1)+1
Figure 11.8: Convergence curves for the finite element solution of uxx − 4u = 0. The
red and blue lines show the solutions obtained using linear and quadratic interpolation,
respectively, using exact integration; the black lines show the spectral element solution.
Gauss-Lobatto quadrature to evaluate the integrals, whereas exact integration was used
for linear and quadratic finite elements. The inexact quadrature does not destroy the
spectral character of the method.
where |û|H 2 is the so-called H 2 semi-norm, (essentially a measure of the “size” of the
second derivative), and C is a generic positive constant independent of h. If the solution
11.4. TWO DIMENSIONAL PROBLEMS 187
admits integrabale second derivatives, then the H 1 -norm of the error decays linearly with
the grid size h. The L2 -norm of the error however decreases quadratically according to:
Z 1 12
2 2
kun − ûkH 1 ≤ C̃ h |û|H 2 |û|H 2 = (uxx ) dx (11.97)
0
The difference between the rate of convergences of the two error norms is due to the
fact that the H 1 -norm takes the derivatives of the function into account. That the first
derivative of û are approximated to first order in h while û itself is approximated to
second order in h.
For an interpolation using polynomials of degree k the L2 -norm of the error is given
by
!2 1
Z 2
1 dk+1 u
kun − ûk ≤ Ĉ hk+1 |û|H k+1 |û|H k+1 = dx (11.98)
0 dxk+1
provided that the solution is smooth enough i.e. the k + 1 derivative of the solution is
square-integrable.
For the spectral element approximation using a single element, the error depends on
N and the regularity of the solution. If û ∈ H m with m > 0, then the approximation
error in the L2 norm is bounded by
The essential difference between the p-version of the finite element method and the
spectral element method lies in the exponent of the error (note that h N −1 ). In the
p-case the exponent of N is limited by the degree of the interpolating polynomial. In
the spectral element case it is limited by the smoothness of the solution. If the latter is
infinitely smooth then m ≈ N and hence the decay is exponential N −N .
∇2 u − λu + f = 0, x ∈ Ω (11.100)
u(x) = ub (x), x ∈ ΓD (11.101)
∇u · n = q, x ∈ ΓN (11.102)
188 CHAPTER 11. FINITE ELEMENT METHODS
where Ω is the region of interest with ∂Ω being its boundary, ΓD is that portion of
∂Ω where Dirichlet conditions are imposed and ΓN are those portions where Neumann
conditions are imposed. We suppose the ΓD + ΓN = ∂Ω.
The variational statements comes from multiplying the PDE by test functions v(x)
that satisfy v(x ∈ ΓD ) = 0, integrating over the domain Ω, and applying the boundary
conditions; the variational statement boils down to:
Z Z Z
(∇u · ∇v + λuv) dA = f v dA + vq ds, ∀v ∈ H01 (11.103)
Ω Ω ΓN
where H01 is the space of square integrable functions on Ω that satisfy homogenous
boundary conditions on ΓD ; ds is the arclength along the boundary. The Galerkin
formulation comes from restricting the test functions to a finite set and interpolation
functions to a finite set:
N
X
u= ui φi (x), v = φj , j = 1, 2, . . . , N (11.104)
i=1
where the φi (x) are now two dimensional interpolation functions (here we restrict our-
selves to Lagrangian interpolants). The Galerkin formulations becomes find ui such that
N Z
X Z Z
(∇φi · ∇φi + λφi φj ) dAui = f φj dA + φj q ds, j = 1, 2, . . . , N (11.105)
i=1 Ω Ω ΓN
We thus recover the matrix formulation Ku = b where the stiffness matrix and forcing
functions are given by:
Z
Kji = (∇φi · ∇φi + λφi φj ) dA (11.106)
ZΩ Z
bj = f φj dA + φj q ds (11.107)
Ω ΓN
In two space dimensions we have a greater choice of element topology (shape) then in
the simplistic 1D case. Triangular elements, the simplex elements of 2D space, are very
common since they have great flexibility, and allow the discretization of very complicated
domains. In addition to triangular elements, quadrilaterals elements with either straight
or curved edges are extensively used. In the following, we explore the interpolation
functions for each of these elements, and the practical issues needed to be overcome in
order to compute the integrals of the Galerkin formulation.
k
hhhh
@ hhhh
@ hhh
hh
A j
@c
@ i
Aj P
Ak
i
Ai Aj Ak
ai = , aj = , ak = , ai + aj + ak = 1 (11.108)
A A A
The area of a triangle is given by the determinant of the following matrix:
1 x yi
1 i
A = 1 xj yj (11.109)
2
1 xk yk
The other areas Ai , Aj and Ak can be obtained similarly; their dependence on the
coordinate (x, y) of the point P is linear. It is now easy to verify that if we set the
local interpolation functions to φi = ai we obtain the linear Lagrangian interpolant
on the triangle, i.e. that φi (xm ) = δi,m where m can tak the value i, j, or k. The
linear Lagrangian interpolant for point i can be easily expressed in terms of the global
coordinate system (x, y):
1
φi (x, y) = (αi + βi x + γi y) (11.110)
2A
αi = xj yk − xk yj (11.111)
βi = yj − yk (11.112)
γi = −(xj − xk ) (11.113)
The other interpolation function can be obtained with a simple permutation of indices.
Note that the derivative of the linear interpolation functions are constant over the ele-
ment. The linear interpolation now takes the simple form:
1
0.5
0.5
1 0
1
0.5 1
0.5
0.5
0
0 0 0
0 0.2 0.4 0.6 0.8 1
0.5
0
1
1
0.5
0.5
0 0
Figure 11.10: Linear interpolation function over linear triangular elements. The triangle
is shown in dark shades. The three interpolation functions are shown in a light shade
and appear as inclined planes in the plot.
where ui , uj and uk are the values of the solution at the nodes i, j and k. The interpo-
lation formula for the triangle guarantees the continuity of the solution across element
boundaries. Indeed, on edge j − k for example, the interpolation does not involve node
i and is essentially a linear combination using the functions values at node j and k, thus
ensuring continuity. The linear interpolation function for the triangle are shown in figure
11.10 as a three-dimensional plot.
The usefullness of the area coordinates stems from the existence of the following
integration formula over the triangle:
Z
p!q!r!
api aqj ark dA = 2A (11.115)
A (a + b + c + 2)!
where the notation p! = 12 . . . p stands for the factorial of the integer p. It is now easy
to verify that the local mass matrix is now given by
φi φi φi φj
Z φi φk 2 1 1
A
Me = φj φi φj φj φj φk dA = 1 2 1 (11.116)
A 12
φk φi φk φj φk φk 1 1 2
The entries of the matrix arising from the discretization of the Laplace operator are
easy to compute since the gradients of the interpolation and test functions are constant
over an element; thus we have:
Z ∇φi · ∇φi ∇φi · ∇φj ∇φi · ∇φk
De = ∇φj · ∇φi ∇φj · ∇φj ∇φj · ∇φk dA (11.117)
A
∇φk · ∇φi ∇φk · ∇φj ∇φk · ∇φk
11.4. TWO DIMENSIONAL PROBLEMS 191
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Figure 11.11: FEM grid and contours of the solution to the Laplace equation in a circular
annulus.
β β + γi γi βi βj + γi γj βi βk + γi γk
1 i i
= βj βi + γj γi βj βj + γj γj βj βk + γj γk (11.118)
4A
βk βi + γk γi βk βj + γk γj βk βk + γk γk
As an example of the application of the FEM element method we solve the Laplace
equation in a circular annulus subject to Dirichlet boundary conditions using a FEM with
linear triangular elements. The FEM grid and contours of the solution are shown in figure
11.11. The grid contains 754 nodes and 1364 elements. The boundary conditions were
set to cos θ on the outer radius and sin θ on the inner radius, where θ is the azimuthal
angle. The inner and outer radii are 1/2 and 1, respectively. The contour lines were
obtained by interpolating the FEM solution to a high resolution (401x401) strutured
grid prior to contouring it.
η
6 2
3u 1 u4 4 Q
u
Q
u Q
C Q
Qu
- y C 1
ξ 6 C
-1 1 C
C
u u 3 Cu
-1 - x
1 2
Figure 11.12: Mapping of a quadrilateral between the unit square in computational space
(left) and physical space (right).
where i − k denotes the midpoint of edge i − k. There are 6 degrees of freedom associated
with each quadratic triangular element. Although it is possible to define even higher order
interpolation in the triangle by proceeding as before; it is not so simple to implement.
The alternative is to use a mapping to a structured computational space and define
the interpolation and collocation function in that space. It is important to choose the
collocation points appropriately in order to avoid Gibbs oscillations as the polynomial
degree increases. Spectral triangular elements are the optimal choice with this regard.
We will not discuss spectral element triangles here; we refer the interested reader to
Karniadakis and Sherwin (1999).
where r denotes the position vector of a point P in space and (i, j) forms an othonormal
basis in the physical space. Inverting the above relationship one obtains
yη yξ xη xξ
i= eξ − eη , j = − eξ + eη
J J J J
where J = xξ yη − xη yξ is the Jacobian of the mapping. The norms of eξ and eη are given
by
|eξ |2 = eξ · eξ = (xξ )2 + (yξ )2 , |eη |2 = eη · eη = (xη )2 + (yη )2
The basis in the computational plane is orthogonal if eξ · eη = xξ xη + yξ yη = 0; in general
the basis is not orthogonal unless the element is rectangular.
It is now possible to derive expression for length and area segements in computational
space. These are needed in order to compute boundary and area integrals arising from
the Galerkin formulation. Using the definition ( ds)2 = dr· dr with dr = rξ dξ +rη dη,
we have:
The differential area of a curved surface is defined as the area of its tangent plane
approximation (in 2D, the area is always flat.) The area of the parallelogram defined by
the vectors dξeξ and dηeη is
i j k
dA = || dξeξ × dηeη || = xξ yξ 0 dξ dη = |xξ yη − xη yξ | dξ dη = |J| dξ dη
xη yη 0
(11.129)
after using the definition of (eξ , eη ) in terms of (i, j).
Since x = x(ξ, η) and y = y(ξ, η), the derivative in physical space can be expressed
in terms of derivatives in computational space by using the chain rule of differentiation;
in matrix form this can be expressed as:
! ! !
ux ξx ηx uξ
= (11.130)
uy ξy ηy uη
Notice that the chain rule involves the derivatives of ξ, η in terms of x, y whereas the
bilinear map readily delivers the derivatives of x, y with respect to ξ, η. In order to avoid
inverting the mapping from physical to computational space we derive expressions for
∇ξ, and ∇η in terms of xξ , xη , etc... Applying the chain rule to x and y we obtain
(noticing that the two variables are independent) the system of equations:
! ! ! !
ξx ηx xξ yξ xx y x 1 0
= = (11.131)
ξy ηy xη yη xy yy 0 1
The solution is
! !
ξx ηx 1 yη −yξ
= , J = xξ yη − xη yξ (11.132)
ξy ηy J −xη xξ
194 CHAPTER 11. FINITE ELEMENT METHODS
1 1
0.5 0.5
0 0
1 1
1 1
0 0
0 0
−1 −1 −1 −1
1 1
0.5 0.5
0 0
1 1
1 1
0 0
0 0
−1 −1 −1 −1
Figure 11.13: Bilinear shape functions in quadrilateral elements, the upper left hand
corner is h1 (ξ)h1 (η), the upper right hand side panel shows h2 (ξ)h1 (η), the lower left
panel shows h1 (ξ)h2 (η) and lower right shows h2 (ξ)h1 (η).
For the bilinear map of equation 11.126, the metrics and Jacobian can be easily
computed by differentiation:
1 − η x2 − x1 1 + η x4 − x1 1 − η y2 − y1 1 + η y4 − y1
xξ = + yξ = + (11.133)
2 2 2 2 2 2 2 2
1 − ξ x3 − x1 1 + ξ x4 − x2 1 − ξ y3 − y1 1 + ξ y4 − y2
xη = + yη = + (11.134)
2 2 2 2 2 2 2 2
The remaining expression can be obtained simply by plugging in the the various expres-
sions derived earliear.
can use the collocation points located at the vertices of the quadrilateral to define the
following interpolation:
4
X
u(ξ, η) = um φm (ξ, η) = u1 φ1 + u2 φ2 + u3 φ3 + u4 φ4 (11.135)
m=1
where the two-dimensional Lagrangian interpolants are tensorized product of the one-
dimensional interpolants defined in equation 11.61:
φ1 (ξ, η) = h1 (ξ)h1 (η), φ2 (ξ, η) = h2 (ξ)h1 (η),
(11.136)
φ3 (ξ, η) = h1 (ξ)h2 (η), φ4 (ξ, η) = h2 (ξ)h2 (η),
and shown in figure 11.13. Note that the bilinear interpolation function above satisfy
the C 0 continuity requirement. This can be easily verified by noting first that the in-
terpolation along an edge involves only the collocation points along that edge, hence
neighboring elements sharing an edge will interpolate the solution identically if the value
of the function on the collocation points is unique. Another important feature of the
bilinear interpolation is that, unlike the linear interpolation in triangular element, it
contains the term of second degree: ξη. Hence the interpolation within an element is
non-linear; it is only linear along edges.
Before proceeding further, we introduce a new notation for interpolation with quadri-
laterals to explicitely bring out the tensorized form of the interpolation functions. This
is accomplished by breaking the single two-dimensional index m in equation 11.135 into
two one-dimensional indices (i, j) such that m = (j − 1)2 + i, where (i, j) = 1, 2. The
index i runs along the ξ direction and the index j along the η direction; thus m = 1
becomes identified with (i, j) = (1, 1), m = 2 with (i, j) = (2, 1), etc... The interpolation
formula can now be written as
2 X
X 2
u(ξ, η) = uij hi (ξ)hj (η) (11.137)
j=1 i=1
The superscript N has been introduced on the Lagragian interpolants to stress that
they are polynomials of degree N − 1 and use N collocation points per direction. The
collocation points using 7th degree interpolation polynomials is shown in figure 11.14.
Note, finally, that sum factorization algorithms must be used to compute various
quantities on structured sets of points p, q in order to reduce the computational overhead.
For example, the derivative of the function at point (p, q), can be computed as:
N
X N
X dhN
i N
uξ |ξp ,ηq = uij hj (ηq ). (11.139)
j=1 i=1
dξ ξ
p
196 CHAPTER 11. FINITE ELEMENT METHODS
Figure 11.14: Collocation points within a spectral element using 8 collocation points per
direction to interpolate the solution; there are 82 =64 points in total per element.
First the term in parenthesis is computed and saved in a temporary array, second, the fi-
nal expression is computed and saved; this essentially reduces the operation from O(N 4 )
to O(N 3 ). Further reduction in operation count can be obtained under special circum-
stances. For instance, if ηq happens to be a collocation point, then hN
j (ηq ) = δjq and the
formula reduces to a single sum:
N
X dhN
i
uξ |ξp ,ηq = uiq (11.140)
i=1
dξ ξ
p
where Jmn denotes the Jacobian at the Gauss quadrature points (ξm G , η G ), and ω G are
n m
the Gauss quadrature weights. The only required operations are hence the evaluation of
the Lagrangian interpolants at the points ξmG , and summations of the terms in 11.142. If
Gauss-Lobatto integration is used, the roots and weights become those appropriate for
the Gauss-Lobatto quadrature; the number of quadrature points needed to evaluate the
integral exactly increases to Q ≥ P + 1.
Like the one-dimensional case, the mass matrix can be rendered diagonal if inexact
(but accurate enough) integration is acceptable. If the quadrature points and collocation
G ) = δ , and the expression for the mass matrix entries reduces to:
points coincide, hk (ξm im
where the expressions in bracket are evaluated at the quadrature points (ξm , ηn ) ( we
have omitted the superscript G from the quadrature points). Again, substantial savings
can be achieved if the Gauss-Lobatto quadrature of the same order as the inteporlation
polynomial is used. The expression for Dij,kl reduces to:
Q
X
Dij,kl = δjl h′i (ξm ) h′k (ξm ) [∇ξ · ∇ξ|J|]m,j ωm ωj
m=1
XQ
+ δik h′j (ηn ) h′l (ηn ) [∇η · ∇η|J|]i,n ωi ωn
n=1
+ h′i (ξk ) h′l (ηj ) [∇ξ · ∇η|J|]k,j ωk ωj
+ h′k (ξi ) h′j (ηl ) [∇ξ · ∇η|J|]i,l ωi ωl (11.149)
ut + cux = 0, x ∈ Ω (11.150)
subject to appropriate initial and boundary conditions. The variational statement that
solves the above problem is: Find u such that
Z
(ut + cux )vdx = 0, ∀v (11.151)
Ω
The Galerkin formulation reduces the problem to a finite dimensional space by replacing
the solution with a finite expansion of the form
N
X
u= ui (t)φi (x) (11.152)
i=1
and setting the test functions to v = φj , j = 1, 2, . . . , N . Note that unlike the steady
state problems encountered earlier ui , the value of the function at the collocation points,
depends on time. Replacing the finite expansion in the variational form we get the
following ordinary differential equation (ODE)
du
M + Cu = 0 (11.153)
dt
where u is the vector of solution values at the collocation points, M is the mass matrix,
and C the matrix resulting from discretization of the advection term using the Galerkin
procedure. The entries of M and C are given by:
Z L
Mj,i = φi φj dx (11.154)
0
Z L ∂φi
Cj,i = c φj dx (11.155)
0 ∂x
11.5. TIME-DEPENDENT PROBLEM IN 1D: THE ADVECTION EQUATION 199
We follow the procedure outlined in section 11.2.7 to build the matrices M and C.
We also start by looking at linear interpolation functions as those defined in equation
11.60. The local mass matrix is given in equation 11.68; here we derive expressions for
the advection matrix C assuming the advective velocity c is constant. and advection
matrices are
Z Z !
xj dhi 1 dhj (ξ) c −1 1
Cji = hj dx = c hi (ξ) dξ, C = (11.156)
xj−1 dx −1 dξ 2 −1 1
∆x d c
(uj−1 + 4uj + uj+1 ) + (−uj−1 + uj+1 ) = 0 (11.157)
6 dt 2
The approximation of the advective term using linear finite element has resulted in a
centered-difference approximation for that term; whereas the time-derivative term has
produced the mass matrix. Notice that any integration scheme, even an explicit ones
like leap-frog, would necessarily require the inversion of the mass matrix. Thus, one can
already anticipate that the computational cost of solving the advection using FEM will
be higher then a similarly configured explicit finite difference method. For linear elements
in 1D, the mass matrix is tridiagonal and the increase in cost is minimal since tridiagonal
solvers are very efficient. Quadratic elements in 1D lead to a pentadiagonal system and
is hence costlier to solve. This increased cost may be justifiable if it is compensated by
a sufficient increase in accuracy. In multi-dimensions, the mass matrix is not tridiagonal
but has only limited bandwidth that depends on the global numbering of the nodes; thus
even linear elements would require a full matrix inversion.
Many solutions have been proposed to tackle the extra-cost of the full matrix. One
solution is to use reduced integration using Gauss-Lobatto quadrature which as we saw in
the Gaussian quadrature section leads immediately to a diagonal matrix; this procedure
is often referred to as mass lumping. For low order elements, mass lumping degrades
significantly the accuracy of the finite element method, particularly in regards to its phase
properties. For the 1D advection equation mass lumping of linear elements is equivalent
to a centered difference approximation. For high order interpolation, i.e. higher then
degree 3, the loss of accuracy due to the inexact quadrature is tolerable, and is of the
same order of accuracy as the interpolation formula. Another alternative revolves around
the use of discontinuous test functions and is appropriate for the solution of mostly
hyperbolic equations; this approach is dubbed the Discontinuous Galerkin method, and
will be examined in a following section.
The system of ODE can now be integrated using one of the time-stepping algorithms,
for example second order leap-frog, third order Adams-Bashforth, or one of the Runge-
Kutta schemes. For linear finite elements, the stability limit can be easily studied with
the help of Von Neumann stability analysis. For example, it is easy to show that a leap-
frog
√ scheme applied to equation will result in a stability limit of the form µ = c∆t/∆x <
1/ 3; and hence is much more restrictive then the finite difference scheme which merely
requires that the Courant number be less then 1. However, an examination of the phase
property of the linear FE scheme will reveal its superiority over centered differences.
200 CHAPTER 11. FINITE ELEMENT METHODS
0.5
cg
num
CD4 −1 CD2
c
0.4
−1.5 CD4
0.3 CD2
−2 CD6
0.2
0.1 −2.5
FE1
0 −3
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
k ∆ x/π k∆ x/π
Figure 11.15: Comparison of the dispersive properties of linear finite elements with
centered differences, the left panel shows the ratio of numerical phase speed to analytical
phase speed, and the right panel shows the ratio of the group velocity
We study the phase properties implied by the linear finite element discretization by
looking for the periodic solution, in space and time, of the system of equation 11.157:
u(x, t) = ei(kxj −σt) . We thus get the following dispersion relationship and phase speed:
3c sin k∆x
σ = (11.158)
∆x 2 + cos k∆x
cF 3 sin k∆x
= (11.159)
c k∆x(2 + cos k∆x)
The numerical phase speed should be contrasted to the one obtaiend from centered second
and fourth order finite differences:
cCD2 sin k∆x
= (11.160)
c k∆x
cCD4 1 sin k∆x sin 2k∆x
= 4 − (11.161)
c 3 k∆x 2k∆x
Figure 11.15 compares the dispersion of linear finite element with that of centered dif-
ference schemes of 2, 4 and 6 order. It is immediately apparent that the FE formulation
yields more accurate phase speed at all wavenumbers, and that the linear interpolation
is equivalent, if not slightly better then, a sixth-order centered FD approximation; in
particular the intermediate to short waves travel slightly faster then in FD. The group
velocity, shown in the right panel of figure 11.15, shows similar results for the long to
intermediate waves. The group velocity of the short wave are, however, in serious errors
for the FE formulation; in particular they have a negative phase speed and propagate up-
stream of the signal at a faster speed then the finite difference schemes. A mass-lumped
version of the FE method would collapse the FE curve onto that of the centered second
order method.
11.5. TIME-DEPENDENT PROBLEM IN 1D: THE ADVECTION EQUATION 201
The initial condition consists of an infinitely smooth Gaussian hill centered at x = −1/2.
The solution at time t = 1 should be the same Gaussian hill but centered at x = 1/2. The
FEM solutions are shown in figure 11.16 where the consistent mass (exact integration)
and “lumped” mass solutions are compared, for various interpolation orders; the number
of elements was kept fixed and the convergence study can be considered an p-refinement
strategy. The number of element was chosen so that the hill is barely resolved for
linear elements (the case m = 2). It is obvious that the solution improves rapidly as m
(the number of interpolation points within an element) is increased. The lumped mass
solution is very poor indeed for the linear case where its poor dispersive properties have
generated substantial oscillatory behavior. The situation improves substantially when
the interpolation is increased to quadratic; the biggest discrepancy occuring near the
upstream foot of the hill. The lumped and consistent mass solutions are indistinguishable
for m = 5.
One may object that the improvements are solely due to the increased resolution.
We have repeated the experiment above trying to keep the total number of degrees of
freedom fixed; in this case we are trying to isolate the effects of improved interpolation
order. The first case considered was an under-resolved case using 61 total degrees of
freedom, and the comparison is shown in figure 11.17. The figure shows indeed the
improvements in the dispersive characteristics of the lumped mass approximation as the
degree of the polynomial is increased. The consistent and lumped solutions overlap over
the main portion of the signal for m ≥ 4 but differ over the (numerically) dispersed waves
trailing the hill. These are evidence that the two approximations are still under-resolved.
The solution in the resolved case is shown in figure 11.18 where the total number of
degrees of freedom is increased to 121. In this case the two solution overlap for m ≥ 3.
The two solutions are indistinguishable for m = 5 over the entire domain, evidence that
the solution is now well-resolved. Notice that the dispersive error of the lumped mass in
the linear interpolation case is still quite in error and further increase in the number of
elements is required; these errors are entirely due to mass-lumping as the consitent mass
solution is free of dispersive ripples.
The lesson of the previous example is that the dispersive properties of the lumped-
mass solution is quite good provided enough resolution is used. The number of elements
needed to reach the resolution threshold decreases dramatically as the polynomial order
is increased. The dispersive properties of the lumped-mass linear interpolation FEM
seem to be the worse, and seem to require a very-well resolved solution before the effect
of the mass-lumping is eliminated. One then has to weigh the increased cost of a mass-
lumped large calculation versus that of a smaller consistent-mass calculation to decided
whether lumping is cost-effective or not.
202 CHAPTER 11. FINITE ELEMENT METHODS
1
0.5
m=2
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
1
0.5
m=3
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
1
0.5
m=4
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
1
0.5
m=5
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Figure 11.16: Solution of the advection equation with FEM. The black lines show the re-
sult of the consistent mass matrix calculation and the red lines show that of the “lumped”
mass matrix calculation. The discretization consisted of 40 equally spaced elements on
the interval [0 1], and a linear (m=2), quadratic (m=3), cubic (m=4) and quartic (m=5)
interpolation. The time stepping is done with a TVD-RK3 scheme (no explicit dissipation
included).
11.5. TIME-DEPENDENT PROBLEM IN 1D: THE ADVECTION EQUATION 203
1
0.5 m=2, E=60
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
1
0.5 m=3, E=30
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
1
0.5 m=4, E=20
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
1
0.5 m=5, E=15
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
1
0.5 m=6, E=12
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
1
0.5 m=7, E=10
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Figure 11.17: Consistent versus lumped solution of the advection equation with FEM
for a coarse resolution with the total number of degrees of freedom fixed, N ≈ 61. The
black and red lines refer to the consistent and lumped mass solutions, respectively.
204 CHAPTER 11. FINITE ELEMENT METHODS
1
0.5 m=2, E=120
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
1
0.5 m=3, E=60
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
1
0.5 m=4, E=40
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
1
0.5 m=5, E=30
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Figure 11.18: Consistent versus lumped solution of the advection equation with FEM
for a coarse resolution with the total number of degrees of freedom fixed, N ≈ 121. The
black and red lines refer to the consistent and lumped mass solutions, respectively.
11.6. THE DISCONTINUOUS GALERKIN METHOD (DGM) 205
Tt + ·∇F = 0, F = vT (11.164)
where F is the flux and is a function of T ; equation 11.164 is written in the form of a
conservation law. We suppose that the domain of interest has been divided into elements.
Then the following variational statement applies inside each element:
Z
(Tt + ·∇F)w dV = 0 (11.165)
E
where w are the test functions, and E is the element of interest. If w is sufficiently smooth,
we can integrate the divergence term by part using the Gauss theorem to obtain:
Z Z
(Tt w − F · ∇w) dV + wF · n dS = 0 (11.166)
E ∂E
where ∂E is the boundary of element E, n is the unit outward normal. The boundary
integral represent a weighed sum of the fluxes leaving/entering the element. The dis-
cretization steps consist of replacing the infinite dimensional space of the test functions
by a finite space, and representing the solution by a finite expansion Th . Since Th is dis-
continuous at the edges of elements, we must also replace the flux F(T ) by a numerical
flux G that depends on the values of Th outside and inside the element:
G = G(T i , T o ) (11.167)
where T i and T o represent the values of the function on the edge as seen from inside
element, and from the neighboring element, respectively. Physical intuition dictates that
the right flux is that obtained by following the characteristic (Lagrangian trajectory).
For the advection equation that means the outside value is used if n · v < 0 and the
inside value is used if n · v > 0.
206 CHAPTER 11. FINITE ELEMENT METHODS
Figure 11.19: Flow field and initial condition for Gaussian Hill experiment
where M is the local mass matrix; no assembly is required. The great value of DGM
is that the variational statement operates one element at a time, and hence the mass
matrix arising from the time-derivative is purely local. The only linkage to neighboring
elements comes from considering the fluxes along the boundary. Thus, even if the matrix
M is full, it is usually a small system that can be readily inverted, and the process of time
integration becomes considerably cheaper. Another benefit of DGM is that it satisfies the
local conservation property so dear to many oceanographic/coastal modeler, since the
fluxes defined on each edge are unique. Finally, by dropping the continuity requirements
it becomes quite feasible to build locally and dynamically adaptive solution strategies
without worrying about global continuity of the function. In the next section we compare
the performance of the DGM with that of the continuous formulation on several problems;
all simulations will be performed using the spectral element interpolation.
where ω is set to 2π. The initial condition is infinitely smooth and given by a Gaussian
distribution s
− r2
2
1 2 1 2
φ=e l , r= x− + y− (11.173)
4 2
with an e-folding length scale l = 1/16. Periodic boundary conditions are imposed on
all sides. All models were integrated for one rotation which time the solution should be
identical to the initial condition.
The convergence curves for the l2 norm of the error are displayed in 11.20 for the
different formulations. The time integration consists of RK4 for traditional Galerkin
method ( which we will refer to as the CGM), and DGM; the time step was chosen so
that spatial errors are dominant. The convergence curves for CGM, and DGM are similar
and indicate the exponential decrease of the error as the spectral truncation is increased
for a constant elemental partition.
In order to compare the benefits of h versus p refinements, we plot in the right panels
of figure 11.20 the l2 error versus the total number of collocation points in each direction.
The benefits of p-refinement for this infinitely smooth problem is readily apparent for
CG: Given a fixed number of collocation points, the error is smallest for the smallest
number of elements, and hence highest spectral truncation. (The number of collocation
points is given by K(N −1)+1, where N is the number of points per element and K is the
number of elements). The situation is not as clear for DGM where the different curves
tend to overlap. This is probably due to the discontinuous character of the interpolation:
adjacent elements are holding duplicate information about the solution, and the total
number of degrees of freedom grows like KN .
The cone has a peak at r = 0 which decreases linearly to 0; there is a slope discontinuity
at r = 1/8. The initial conditions contours are similar to the one depicted in figure 11.19.
The same numerical parameters were used as for the Gaussian Hill problem.
The presence of the discontinuity ruins the spectral convergence property of the
spectral element method. This is born out in the convergence curves (not shown) which
display a 1/N convergence rate only in the l2 norm for a fixed number of elements;
h refinement is a more effective way to reduce the errors in the present case. In the
following, we compare the performance of the different schemes using a single resolution,
10x10 elemental partition with 6 points per element. Figure 11.21 compares the solution
208 CHAPTER 11. FINITE ELEMENT METHODS
2
−4 −4
10 3 10
2
4
−6 5 −6
10 10
3
2
2
ε
ε
8
−8 −8
10 10 4
10
−10 −10
5
10 10
8 10
6 8 10 12 14 16 18 0 50 100 150 200
N Ndf
ε vs number of points per element ε vs total number of degrees of freedom
2 2
−2 −2
10 10
2
−4 −4
10 10
3
2
4
−6 −6
10 10
2
5
ε
−8 −8
10 10
10
8
4
−10 −10
10 10 10
5 8
Figure 11.20: Convergence curve in the L2 norm for the Gaussian Hill initial condition using, from top
to bottom, CGM, and DG. The labels indicate the number of elements in each direction. The abcissa on
the left graphs represent the spectral truncation, and on the right the total number of collocation points.
11.6. THE DISCONTINUOUS GALERKIN METHOD (DGM) 209
Figure 11.21: Contour lines of the rotating cone problem after one revolution for CG (left), and DG
(right), using the 10×6 grid. The contours are irregularly spaced to highlight the Gibbs oscillations.
for the 4 schemes at the end of the simulations. We note that the contour levels are
irregularly spaced and were chosen to highlight the presence of Gibbs oscillations around
the 0-level contour.
For CG, the oscillation are present in the entire computational region, and have
peaks that reaches −0.03. Although the DG solution exhibits Gibbs oscillations also,
these oscillations are confined to the immediate neighborhood of the cone. Their largest
amplitude is one third that observed with CG. Further reduction of these oscillation
require the use of some form of dissipation, e.g. Laplacian, high order filters, or slope
limiters. We observe that CG shows a similar decay in the peak amplitude of the cone
with DG doing a slightly better job at preserving the peak amplitude. Figure 11.22
shows the evolution of the largest negative T as a function of the grid resolution for CG
and DG. We notice that the DG simulation produces smaller negative values, up to a
factor of 5, than CG at the same resolution.
210 CHAPTER 11. FINITE ELEMENT METHODS
0
10
−1
10
5
−min(T)
7 5
9
7
−2
10 9
−3
10 0 1 2
10 10 10
K
Figure 11.22: Min(T) as a function of the number of element at the end of the simulation. The red
lines are for DG and the blue lines for CG. The number of points per element is fixed at 5, 7 and 9 as
indicated by the lables.
Chapter 12
Linear Analysis
(a) commutative x + y = y + x
(b) associative (x + y) + z = x + (y + z)
(c) neutral element: there exist a null zero vector 0 such that x + 0 = x
(d) for each vector x ∈ V there exist a vector y such that x + y = 0, we denote
this vector by y = −x.
211
212 CHAPTER 12. LINEAR ANALYSIS
With the help of the norm we can define now the distance between 2 vectors x and
y as kx − yk, i.e. the norm of their difference. So 2 vectors are equal or identical if
their distance is zero. Furthermore, we can now talk about the convergence of a vector
sequence. Specifically, we say that a vector sequence xn converges to x as n → ∞ if for
any ǫ > 0 there is an N such that kxn − xk < ǫ ∀n > N .
1. conjugate symmetry: (x, y) = (y, x), where the overbar denotes the complex con-
jugate.
12.1.4 Basis
A set of vectors e1 , e2 , . . . , en different from 0 are called a basis for the vector space V if
they have the following 2 properties:
1. Linear independence:
N
X
αi ei = 0 ⇔ αi = 0∀i (12.2)
i=1
If at least one of the αi is non-zero, the set is called linearly dependent, and one of
the vectors can be written as a linear combination of the others.
12.1. LINEAR VECTOR SPACES 213
The number of vectors needed to form a basis is called the dimension of the space V .
Put in another way, V is N dimensional if it contains a set of N independent vectors
but no set of (N + 1) of independent vectors. If N vectors can be found for each N , no
matter how large, we say that the vector space is infinite dimensional.
A basis is very usefull since it allows us to describe any element x of the vector space.
Thus any vector x can be “expanded” in the basis e1 , e2 , . . . , en as:
N
X
x = αi ei
i
= α1 e1 + α2 e2 + . . . + αn en (12.3)
This representaion is also unique by the independence of the basis vectors. The question
of course if how to find out the coefficients αi which are nothing but the coordinates of
x in ei . We can take the inner product of both sides of the equations to come up with
the following linear system of algebraic equations for the αi :
The coupling between the equations makes it hard to compute the coordinates of a
vector in a general basis, particularly for large N . Suppose however that the basis set
is mutually orthogonal, that is every every vector ei is orthogonal to every other vector
ej , that is (ei , ej ) = 0 for all i 6= j. Another way of denoting this mutual orthogonality
is by using the Kronecker delta function, δij :
(
1 i=j
δij = (12.5)
0 i=
6 j
The orthogonality property can then be written as (ei , ej ) = δij (ej , ej ), and the ba-
sis is called orthogonal. For an orthogonal basis the system reduces to the uncoupled
(diagonal) system
(ej , ej )αj = (x, ej ) (12.6)
(x, ej ) (x, ej )
αj = = (12.7)
(ej , ej ) kej k2
The basis set can be made orthonormal by normalizing the basis vectors (i.e. rescaling
ej by such that their norm is 1), then αj = (x, ej ).
214 CHAPTER 12. LINEAR ANALYSIS
N
! p1
X
p
kxkp = |ξ| (12.8)
i=1
A particularly usefull norm is the 2-norm (p = 2), also, called the Euclidean norm. Other
usefull norms are the 1-norm (p = 1) and the infinity norm:
N
X
(x, y) = ξi η i (12.10)
i=1
It can be easily verified that this inner product satifies all the needed requirements.
Furthermore, the norm introduced by this inner product is nothing but the 2-norm
mentioned above.
An orthonormal basis for the vector space V is given by;
e1 = (1, 0, 0, 0, . . . , 0)
e2 = (0, 1, 0, 0, . . . , 0)
..
. (12.11)
en = (0, 0, 0, 0, . . . , 1)
the interval [a, b] (i.e. discrete space) by their pointwise values xi and yi at the points
ti , and let us define the discrete inner product as the Riemann type sum:
N
b−aX
(x, y) = xi yi (12.12)
N i=1
L2 (a, b) is an example of a Hilbert space – a closed inner product space with the
1 1
kxk = (x, x) 2 – a closed inner product space with the kxk = (x, x) 2 .
The issue of defining a basis function for a function space is complicated by infinite
dimension of the space. Assume I have an infinite set of linearly independent vectors, if
I remove a single element of that set, the set is still infinite but clearly cannot generate
the space. It turns out that it is possible to prove completeness but we will defer the
discussion until later. For the moment, we assume that it is possible to define such a
216 CHAPTER 12. LINEAR ANALYSIS
basis. Furthermore, this basis can be made to be orthogonal. The Legendre polynomials
Pn are an orthogonal set of functions over the interval [−1, 1], and the trigonometric
functions einx are orthogonal over [−π, π].
Suppose that an orthogonal and complete basis ei (t) has been defined, then we can
expand a vector in this basis function:
∞
X
x= αi ei (12.17)
i=1
The above expansion is referred to as a generalized Fourier series, and the αi are the
Fourier coefficients of x in the basis ei . We can also follow the procedure outlined for the
finite-dimensional spaces to compute the αi ’s by taking the inner product of both sides
of the expansion. The determination of the coordinate is, again, particularly easy if the
basis is orthogonal
(x, ei )
αi = (12.18)
(ei , ei )
In particular, if the basis is orthonormal αi = (x, ei ).
In the following we show that the Fourier coefficients are the best approximation to
P
the function in the 2-norm, i.e. the coefficients αi minimize the norm kx − i αi ei k2 .
We have:
X X X
kx − αi ei k22 = (x − αi ei , x − αi ei ) (12.19)
i i i
X X XX
= (x, x) − αi (x, ei ) − αi (ei , x) + αi αj (ei , ej(12.20)
)
i i i j
The orthogonality of the basis functions can be used to simply the last term on the right
P
hand side to i |αi |2 . If we furthermore define ai = (x, ei ) we have:
X X
kx − αi ei k22 = kxk2 + [αi αi − ai αi − ai αi ] (12.21)
i i
X
= kxk2 + [(ai − αi )(ai − αi ) − ai ] (12.22)
i
X X
= kxk2 + |ai − αi |2 − |a|2i (12.23)
i i
Note that since the first and last terms are fixed, and the middle term is always greater
or equal to zero, the left hand side can be minimized by the choice αi = ai = (x, ei ). The
P
minimum norm has the value ||x||2 − i |ai |2 . Since this value must be always positive
then X
|ai | ≤ ||x||2 , (12.24)
i
A result known as the Bessel inequality. If the basis set is complete and the minimum
norm tend to zero as the number of basis functions increases to infinity, we have:
X
|ai | = ||x||2 , (12.25)
i
Both functions belong to C(−1, 1) and are identical for all t except t = 0. If we use the
2-norm to judge the distance between the 2 functions, we get that kx − yk2 = 0, hence
the functions are the same. However, in the maximum norm, the 2 functions are not
identical since kx − yk∞ = 1. This example makes apparent that the choice of norms
is critical in deciding whether 2 functions are the same or different, 2 functions that
may be considered identical in one norm can become different in an another norm. This
is simply an apparent contradiction and reflects the fact that different norms measure
different things. The 2-norm for example looks at the global picture and asks if the 2
functions are the same over the interval; this is the so-called mean-square convergence.
The infinity norm on the other hand measures pointwise convergence.
where w(t) > 0 on a < t < b (this is a more general inner product that the one defined
earlier which correspond to w(t) = 1). Let the operator L be defined as follows:
1 d dy
Ly = p(t) + r(t) (12.37)
w(t) dt dt
αy(a) + βy ′ (a) = 0, γy(b) + δy ′ (b) = 0 (12.38)
12.4. STURM-LIOUVILLE THEORY 219
1.4
2
1.2
1.5
0.8
0.5
0.6
0
0.4
−0.5
0.2
0 −1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Figure 12.1: Left: step function fit f = 1; blue line is for N = 1, black line for N = 7,
and red line for N = 15. Right: Fourier expansion to f = esin πx , blue line is for N = 1,
red for N = 2, and black for N = 3; the circles show the function f
m Am Bm
0 0.000000000000000 0.532131755504017
1 1.13031820798497 4.097072104153782×10−17
2 4.968164449421373×10−18 -0.271495339534077
3 -4.433684984866381×10−02 -5.790930955238388×10−17
4 -3.861917048379140×10−17 5.474240442093724×10−03
5 5.429263119140263×10−04 2.457952426063564×10−17
6 -3.898111864745672×10−16 -4.497732295430222×10−05
7 -3.198436462370136×10−06 5.047870580187706×10−17
8 3.363168791013840×10−16 1.992124806619515×10−07
9 1.103677177269183×10−08 -1.091272736411650×10−16
10 -4.748619297285671×10−17 -5.505895970193617×10−10
11 -2.497959777896152×10−11 1.281761515677824×10−16
4
and that the Fourier coefficients are given by: fm = 0, for even m, and fm = mπ , for
odd m. Figure 12.1 illustrates the convergence of the series as the number of Fourier
functions retained, N , increases.
Example 16 The function f = esin πx is periodic over the interval [−1, 1] and can be
expanded into a Fourier series of the following form:
N
X
f (x) = Am cos mπx + Bm sin mπx (12.41)
m=0
Since the integrals cannot be evaluated analytically, we compute them numerically using
a very high order methods. The first few Fourier coefficients are listed in table 12.1.
Notice in particular the rapid decrease of |Am | and |Bm | as m increases, and the fact
that with 3 Fourier modes the series expansion and the original function are visually
identical.
Example 17 Let us take the example of the Laplace equation ∇u = 0 defined on the
rectangular domain 0 ≤ x ≤ a and 0 ≤ y ≤ b, and subject to the boundary conditions
that u = 0 on all boundaries except the top boundary where u(x, y = b) = v(x). Sepa-
ration of variables assumes that the solution can be written as the product of functions
that depend on a single independent variable: u(x, y) = X(x)Y (y). When this trial
solution is substituted in the PDE, we can derive the identity:
Xxx Yyy
=− (12.43)
X Y
Since the left hand side is a function of x-only, and the right hand side a function of y
only, and the equality must hold for arbitrary x and y the two ratios must be equal to a
constant which we set to −λ2 , and we end up with the 2 equations
Xxx + λ2 X = 0 (12.44)
2
Yyy − λ Y = 0 (12.45)
Notice that the 2 equations are in the form of a Sturm-Liouville problem. The solutions
of the above 2 systems are the following set of functions:
where we have set En = Bn ∗ Dn . The last unused boundary condition determines the
constants En with the equation
∞
X
v(x) = En sin λn x sinh λn b (12.51)
n=1
It is easy to show that the functions sin λn x are orthogonal over the interval 0 ≤ x ≤ a,
i.e. Z b
a
sin λn x sin λm x dx = δnm (12.52)
a 2
222 CHAPTER 12. LINEAR ANALYSIS
which leads to Z
2 b
En = sin λn x v(x) dx. (12.53)
a a
The procedure outlined above can be reintrepeted as follows: the sin λn x are the
eigenfunctions of the Sturm Liouville problem 12.44 and the eigenvalues are given by
12.49; the basis function is complete and hence can be used to generate the solution as
in equation 12.50; the coordinates of the solution in that basis are determined by 12.53.
The eigenvalues and eigenfunctions depend on the partial differential equations, the
boundary conditions, and the shape of the domain. The next few examples illustrate
this dependence.
Example 18 The heat equation ut = ν(uxx + uyy ) in the same domain as example 17
subject to homogeneous Dirichlet boundary
conditions on all sides will generate the eigen-
values eigenfunction pairs sin mπx
a , in the x-direction and sin nπyb in the y-direction.
The solution can be written as the double series:
∞ X
∞
!
X mπx nπy −αmn t m2 π 2 n 2 π 2
u(x, y, t) = Amn sin sin e , αmn = − + 2 (12.54)
n=1 m=1
a b a2 b
Ttt + κ2 c2 T = 0 (12.55)
2
Θθθ +λ Θ = 0 (12.56)
2 2 2 2
r Rrr + rRr + (κ r − λ )R = 0 (12.57)
Since the domain is periodic in the azimuthal direction, we should expect a periodic
solution and hence λ must be an integer. The radial equation is nothing but the Bessel
equation in the variable κr, its solutions are given by R = An Jn (κr) + Bn Yn (κr). Bn = 0
must be imposed if the solution is to be finite at r = 0. The eigenvalues κ are determined
by imposing a boundary condition at r = a. For a homogeneous Dirichlet conditions,
the eigenvalues are determined by the roots ξmn of the Bessel functions Jn (ξm ) = 0, and
hence κmn = ξmn /a. Note that κmn is the radial wavenumber. The solution can now be
expanded as:
∞ X
X ∞
u(r, θ, t) = Jn (κmn r) [(Amn cos nθ + Bmn sin nθ) cos σmn t +
n=0 m=0
(Cmn cos nθ + Dmn sin nθ) sin σmn t] (12.58)
12.5. APPLICATION TO PDE 223
where σmn = κmn c is the time frequency. The integration constants must be determined
from the initial conditions of which there must be 2 since we have a second derivative
in time. In the present examples the radial eigenfunctions are given the Bessel functions
of the first kind and order n and the eigenvalues are determined by the roots of the Jn .
Notice that the Bessel equation is also a Sturm Liouville problem and hence the basis
Jn (κmn r) must be complete and orthogonal. Periodicity in the azimuthal direction yields
the trigonometric functions, and quantizes the eigenvalues to the set of integers.
224 CHAPTER 12. LINEAR ANALYSIS
Chapter 13
1. positivity: kuk ≥ 0, and if kuk = 0, then u = 0. All norms are positive and only
the null vector has 0 norm.
225
226 CHAPTER 13. RUDIMENTS OF LINEAR ALGEBRA
Here are some common matrix norms some satisfy the compatibility and can be regarded
as subordinate norms:
PN
1. 1-norm kLk1 = maxj ( i=1 |lij |) This is also referred to as the maximum column
sum.
PN
2. ∞-norm kLk∞ = maxi ( j=1 |lij |) This is also referred to as the maximum row
sum.
q
3. 2-norm kLk2 = ρ(LT L) where ρ is the spectral radius (see below). If the matrix
p
L is symmetric LT = L, then kLk2 = ρ(L2 ) = ρ(L).
13.3. EIGENVALUES AND EIGENVECTORS 227
Lu = λu, u 6= 0 (13.5)
• The matrix L2 = LL has the same eigenvectors as L and has the eigenvalues λ2i .
• If the inverse of the matrix exist, L−1 , then its eigenvectors are the same as those
of L and its eigenvalues are 1/λi .
The spectral radius is the lowest for all compatible matrix norms. The proof is simple.
Let the matrix L have the eigenvalues and eigenvectors λi , ui . Then
and hence |λi | ≤ kLk. This result holds for all eigenvalues, and in particular for the
eigenvalue with the largest magnitude, i.e. the spectral radius:
The above result holds for all matrix norms. For the case of the 1 or ∞-norms, we
have Gershgorin first theorem, that the spectral radius is less then the largest sum of the
absolute values of the row or columns entries, namely: ρ(L) ≤ kLk1 and ρ(L) ≤ kLk∞ .
228 CHAPTER 13. RUDIMENTS OF LINEAR ALGEBRA
Gershgorin’s second theorem puts a limit on where to find the eigenvalues of a matrix
in the complex plane. Each eigenvalue is within a circle centered at li i, the diagonal entry
of the matrix, and of radius R:
N
X
|λi − li,i | ≤ |lij | = |li,1 | + |l2,i | + . . . |li,i−1 | + |li,i+1 | + . . . |li,N | (13.9)
j=1,j6=i
for i = 1, 2, . . . , N .
13.5. EIGENVALUES OF TRIDIAGONAL MATRICES 229
For periodic partial differential equations, the tridiagonal system is often of the form
a b c
c a b
c a b
L=
..
(13.13)
.
c a b
b c a
2π(i − 1) 2π(i − 1)
λi = a + (b + c) cos + i(c − b) sin (13.14)
N N
A final useful result is the following. If a real tridiagonal matrix has either all its off-
diagonal element positive or all its off-diagonal element negative, then all its eigenvalues
are real.
230 CHAPTER 13. RUDIMENTS OF LINEAR ALGEBRA
Chapter 14
Fourier series
In this chapter we explore the issues arising in expressing functions as Fourier series.
where the overbar denotes the complex conjugate. The Fourier series of the function u
is defined as: ∞ X
Su = ûk φk . (14.3)
k=−∞
It represents the formal expansion of u in the Fourier orthogonal system. The Fourier
coefficients ûk are: Z
1 π
ûk = u(x)e−ikx dx. (14.4)
2π −π
It is also possible to re-write the Fourier series in terms of trigonometric functions by
using the identities:
eiθ + e−iθ eiθ − e−iθ
cos θ = , sin θ = , (14.5)
2 2i
The Fourier series become:
∞
X
Su = a0 + (ak cos kx + bk sin kx) (14.6)
k=1
The Fourier coefficients of the trigonometric series are related to those of the complex
exponential series by
ûk = a|k| − ib|k| . (14.7)
231
232 CHAPTER 14. FOURIER SERIES
If u is a real-valued functions, ak and bk are real numbers, and û−k = ûk . Often it is
unnessary to use the full Fourier expansion. If u is an even function, i.e. u(−x) = u(x)
then all the sine coefficients, bk , are zero, and the series becomes what is called a cosine-
series. Likewise, if the function u is odd, u(−x) = −u(x), bk = 0 and the expansion
becomes a sine-series.
1 NX
−1
ũn = uj e−inxj , n = −N/2, −N/2 + 1, . . . , N/2 − 1. (14.8)
N j=0
PN −1 N 2πj PN −1
Notice that ũ±N/2 = j=0 uj e∓i 2 N = j=0 (−1)j uj , and so ũ N = ũ− N . The inver-
2 2
sion of the definition can be done easily, multiply equation 14.8 by einxk and sum the
resulting series over n to obtain:
N −1 N −1
" #
X
inxk
X 1 NX
−1
in(xk −xj ) 1 NX
−1 N
X −1
ũn e = uj e = uj ein(xk −xj ) (14.9)
n=0 n=0
N j=0 N j=0 n=0
The last sum can be written as as geometric series with factor ei(xk −xj ) . Its sum can
then be expressed analytically as:
N
X −1
ein(xk −xj ) = 1 + ei(xk −xj ) + ei2(xk −xj ) + . . . + ei(N −1)(xk −xj ) (14.10)
n=0
1 − eiN (xk −xj ) 1 − ei2π(k−j)
= = (14.11)
1 − ei(xk −xj ) 1 − ei2π(k−j)/N
There are two cases to consider: if k = j then all the terms in the series are equal to
1 and hence sum up to N , if k 6= j then the numerator in the last fraction is equal to
PN −1 in(x −x )
0. Thus we can write n=0 e k j = N δjk . The set of functions einx is said to be
discretely orthogonal. This property can be substituted in equation 14.9 to obtain the
inversion formula:
N
X −1 N
X −1
inxk
ũn e = uj δjk = uk (14.12)
n=0 j=0
14.2. DISCRETE FOURIER SERIES 233
Here, and unlike the integral form, the wavenumbers k are not continuous but are quan-
tizied on account of periodicity. If the domain is 0 ≤ x ≤ a then the wavenumbers are
given by
nπ
kn = , n = 0, 1, 2, 3, . . . (14.14)
a
In the discrete case the domain would be divide into an equally-spaced set of points
xj = j∆x with ∆x = a/N , where N + 1 is the number of points. The discrete Fourier
series would then take the form
NX
max
u(xj ) = uj = ûn e−ikn xj (14.15)
n=−Nmax
where Nmax is the maximum wave mode that can be represented on the discrete grid.
2πnj
Note that kn xj = 2πn
a jaN = N . Since the smallest wavelength is 2∆x the maximum
wavenumber is then
2π 2πNmax a N
kmax = = so that Nmax = = (14.16)
2∆x a 2∆x 2
Furthermore we have that
2π N
e−ik±Nmax xj = e∓i N 2
j
= e±iπj = (−1)j (14.17)
Hence the two waves are identical and it is enough to retain the amplitude of one only.
The discrete Fourier series can then take the form:
N
−1
X
2
2πn
u(xj ) = uj = ûn e−i N (14.18)
n=− N
2
We now have the parity between the N degrees of freedom in physical space uj and
the N degrees of freedom in Fourier space. Further manipulation can reduce the above
expression in the standard form presented earlier:
N
−1
2X 2πnj
uj = ûn e−i N (14.19)
n=− N
2
N
−1 −1
X X
2
−i 2πnj 2πnj
= ûn e N + ûn e−i N (14.20)
n=− N n=0
2
234 CHAPTER 14. FOURIER SERIES
N
−1 −1
X X
2
−i 2πnj 2πnj
= ûn e N e−i2πj + ûn e−i N (14.21)
n=− N n=0
2
N
−1 −1
X 2π(n+N)j X
2
2πnj
−i
= ûn e N + ûn e−i N (14.22)
n=− N n=0
2
N
N −1 −1
X 2 X
−i 2πnj 2πnj
= ûn−N e N + ûn e−i N (14.23)
n= N n=0
2
N
X −1
2πnj
= ũn e−i N (14.24)
n=0
where the new tilded coefficients are related to the old (hatted) coefficients by
(
ũn = ûn 0 ≤ n ≤ N2 − 1
N (14.25)
ũn = ûn−N 2 −1≤n≤N −1
Sine transforms
For problems that have homogeneous Dirichlet boundary conditions imposed, one ex-
pands the solution in terms of sine functions:
N −1 N −1
X nπxj X nπj∆x NX−1
nπj
uj = ûn sin = ûn sin = ûn sin (14.26)
n=1
L n=1
L n=1
N
It is easy to verify that u0 = uN = 0 no matter the values of ûn . To derive the inversion
formula, multiply the above sum by sin mπj
N and sum over j to get:
N −1 N −1 N −1
!
X mπj X X nπj mπj
uj sin = ûn sin sin
j=1
N j=1 n=1
N N
N
X −1 N
X −1
nπj mπj
= ûn sin sin
n=1 j=1
N N
N
X −1 N
X −1
ûn (n − m)πj (n + m)πj
= cos − cos
n=1
2 j=1
N N
N
X −1 N
X −1
ûn i
(n−m)πj
−i
(n−m)πj
i
(n+m)πj
−i
(n+m)πj
= e N +e N −e N −e N(14.27)
n=1
4 j=1
rk − 1 r k+1 − r r k+1 − 1
Sk = r+r 2 +. . .+r k = r(1+r+. . .+r k−1 ) = r = = −1 (14.28)
r−1 r−1 r−1
14.2. DISCRETE FOURIER SERIES 235
with p = ±(n ± m). Note also that S(0) = (N − 1). Furthermore we have that
pπ pπ
(−1)p − ei N (−1)p − e−i N
S(p) + S(−p) = pπ + pπ (14.30)
ei N − 1 e−i N − 1
pπ
(−1)p 2 cos pπ
N − 2 − 2 − 2 cos N
= (14.31)
2 − 2 cos pπ
N
= −1 − (−1)p (14.32)
since if n ± m are even if n and m are both either odd or even, and n ± m is odd if one
is odd and the other even. For the case where m = n we have
2 NX
−1
πnj
ûn = uj sin (14.36)
N j=1 N
236 CHAPTER 14. FOURIER SERIES
Chapter 15
Programming Tips
15.1 Introduction
The implementation of numerical algorithm requires familiarity with a number of soft-
ware packages and utilities. Here is a short list of the minimum required to get started
in programming:
1. Basics of the Operating System such as file manipulation. The RSMAS library has
UNIX: the basics. The library has also a bunch of electronic books on the subject.
Two titles I came across are: Unix Unleashed, and Learning the Unix Operating
System: Nutshell Handbook
2. Text editor to write the computer program. Most Unix books would have a short
tutorial on using either vi or emacs for editing text files. There are a number of
simpler visual editors too such as jed. Web sites for vi or its close cousin vim are:
• http://www.asu.edu/it/fyi/dst/helpdocs/editing/vi/
This is actually a very concise and fast introduction to vi. Start with it and
then go to the other web sites for more in-depth information.
• http://docs.freebsd.org/44doc/usd/12.vi/paper.html
• http://www.vim.org/
• http://www.math.utah.edu/lab/unix/emacs.html
This seems like a good and brief introduction so that you can be editing files
with simple commands.
• http://cmgm.stanford.edu/classes/unix.emacs.html
• http://www.lib.chicago.edu/keith/tclcourse/emacs-tutorial.html-
• http://www.xemacs.org/
This is the Grapher User Interface version of emacs. It is much like notepad
in WINDOWS in that the commands can be entered visually.
237
238 CHAPTER 15. PROGRAMMING TIPS
4. Compiler to turn the program into machine instructions. Section 15.3 discusses
compiler issues and how to use the compiler options in helping you to track errors
in the coding.
7. Various mathematical libraries such as LAPACK for linear algebra routines and
FFTPACK for Fast Fourier Transform.
real :: pi,x,y,dx
!.End of Variable Declaration
pi = 2.0*asin(1.0) ! initialize pi
dx = (xmax-xmin)/real(M-1) ! grid-size
do i = 1,M ! counter: starts at 1, increments by 1 and ends at M
!...indent statements within loop for clarity
x = (i-1)*dx + xmin ! location on interval
y = sin(pi*x) ! compute function 1
f(i) = cos(pi*x) ! compute function 2
write(6,*) x,y ! write two columns to terminal
write(9,*) x,y,f(i) ! write three columns to file
enddo ! end of do loop must be marked.
write(6,*)’Done’
stop
end program waves
!
!
! Compiling the program and creating an executable called waves:
! $ f90 waves.f90 -o waves
! If "-o waves" is ommited the executable will be called a.out
! by default. The fortan 90 compiler (f90) may have a different name
! on your system. Possible alternatives are pgf90 (Portland Group
! compiler), ifc (Intel Fortran Compiler), and xlf90 (on IBMs).
!
!
! Running the program
! $ waves
240 CHAPTER 15. PROGRAMMING TIPS
!
!
! Expected Terminal output is:
! Hello World
! -1.000000 8.7422777E-08
! -0.9000000 -0.3090170
! -0.8000000 -0.5877852
! -0.7000000 -0.8090170
! -0.6000000 -0.9510565
! -0.5000000 -1.000000
! -0.4000000 -0.9510565
! -0.3000000 -0.8090171
! -0.2000000 -0.5877852
! -9.9999964E-02 -0.3090169
! 0.0000000E+00 0.0000000E+00
! 0.1000000 0.3090171
! 0.2000000 0.5877854
! 0.3000001 0.8090171
! 0.4000000 0.9510565
! 0.5000000 1.000000
! 0.6000000 0.9510565
! 0.7000000 0.8090169
! 0.8000001 0.5877850
! 0.9000000 0.3090170
! 1.000000 -8.7422777E-08
! Done
!
!
! Visualizing results with matlab:
! $ matlab
!> z = load(’waves.out’); % read file into array z
!> size(z) % get dimensions of z
!> plot(z(:,1),z(:,2),’k’) % plot second column versus first in black
!> hold on; % add additional lines
!> plot(z(:,1),z(:,3),’r’) % plot third column versus first in red
!> xlabel(’x’); % add labels to x-axis.
!> ylabel(’f(x)’); % add labels to y-axis.
!> title(’sine and cosine curves’); % add title to plot
!> legend(’sin’,’cos’,0) % add legend
!> print -depsc waves % save results to a color
!> % encapsulated postscript file
!> % called waves.eps. The extension
!> % eps will be added automatically.
! Viewing the postscript file:
! $ ghostscript waves.eps
15.3. DEBUGGING AND VALIDATION 241
!
! Printing the file to color printer
! $ lpr -Pmpocol waves.eps
• Do not used implicitly declared types. In fortran variables whose name start with
the letters i,j,k,l,m,n are integers by default and all others are reals by default.
Use the statement implicit none in your code to force the compiler to list every
undeclared variable, and to catch mistyped variables. It is also possible to do that
using compiler options (see the manual pages for the specific options, on UNIX
machines it is usually -u).
• Make sure every variable is initialized properly before using it, and do not assume
that the compiler does it automatically. Some compiler will allow you to initialize
variables to a bogus value, a Nan (short for not a number, so that the code trips if
that variable is used in an operation before overwriting the Nan.
• Use the include header.h statement to make sure that common blocks are iden-
tical across program units. The common declaration would thus go into a separate
file (header.h) which can be subsequently included in other subroutines without
retyping it.
• Check that the argument list of a call statement matches in type and number the
argument list of the subroutine declaration.
1. IMPLICIT NONE
2. Explicit interfaces
3. INTENT attributes
4. PRIVATE attributes for module-wide data that should not be accessible to
the outside
15.3. DEBUGGING AND VALIDATION 243
• Use list files to catch compiler report. When compiling the compiler throws rapidly
a list of errors at you, being out of context they are hard to understand, and even
harder to remember when you return to the editor. Using two windows one for
editing and one for debugging is helpful. You can also ask the compiler to generate
a LIST FILE, that you can look at (or print) in a separate window.
• Use modules and interface to double check the argument lists of calling subroutines.
1. Do not optimize the code in the first run. Rather compile it with a debugging
option (usually -g) to produce trace back information. The program can then let
you know the statement number that caused the fatal error.
2. Do array bound checking. The code will crash if you are trying to access memory
beyond that available for an array. The common flag for this is -C but changes
from compiler to compiler. This flag will slow down the performance of the code.
However, you first concern should be a correct code rather then fast code.
3. Some floating point operations can result in a Nan. The code should stop then and
issue an error report. Various switches trap different floating point exceptions.
You want to catch division by zero, overflows (where the number is too large
to be represent in the machine precision), underflows (similar but for very small
numbers). Underflows are not as problematic as overflows and can usually be
ignored. Check the manual for the right compiler switches.
4. Test routines individually to check if they are working according to their specifica-
tion. Try first “trivial” cases where you know the answers to check the results you
get.
The first rule about debugging is staying cool, treat the misbehaving program as an
intellectual exercise, or as a detective work.
1. You got some strange results that made you think there is a bug. Think again, are
you sure they are not the correct output for some special input?
2. If you are not sure what causes the bug, DON’T try semi-random code modifi-
cations, that’s seldom works. Your aim should be to gather as much as possible
information!
244 CHAPTER 15. PROGRAMMING TIPS
3. If you have a modular program, each part does a clearly defined task, so prop-
erly placed ’debug statements’ can ISOLATE the malfunctioning procedure/code-
section.
4. If you are familiar with a debugger use it, but be careful not to be carried away by
the many options and start playing.
• Arithmetic
• Miscellaneous
• General
15.3. DEBUGGING AND VALIDATION 245
Debuggers
A debugger is a very useful tool for developing codes, and in finding and fixing bugs
introduced in the code development. Most compiler providers include a debugger as
part of their software bundle. Since we are using the Portland Group compiler for the
class, the discussion here will center on its debugger primarily, even though a lot of the
information applies equally to other compiler/debugger systems. Check the manual for
the specific compiler/debugger in case of problems.
247
248 CHAPTER 16. DEBUGGERS
5. -Mchkstk: Check the stack for available space upon entry to and before the start
of a parallel region. Useful when many private variables are declared.
2. list 10,40 will list the lines 10,40, list without argument will list lines 10 to
20 from the current statement.
3. stop at xyz will put a breakpoint at line xyz. Execution will stop at the line so
the user can examine variables.
8. step will execute the next source line, including stepping into a function or sub-
routine.
16.2. RUNNING THE DEBUGGER 249
9. stepi will execute a single intruction (as opposed to the entire source line. step
¡count¿ will execute ¡count¿ instructions.
10. display list the expressions being printed at breakpoints. display <exp1>,<exp2>
prints <exp1> and <exp2> at every breakpoint.
11. next will cause the debugge to skip over a function or subroutine, i.e. executing
it in its entirety.
12. continue will cause the execution to resume from the point it stopped to the next
breakpoint.
250 CHAPTER 16. DEBUGGERS
Bibliography
Arakawa, A., 1966. Computational design for long term numerical integration of the
equations of fluid motion: two-dimensional incompressible flow. part i. Journal of
Computational Physics 1 (1), 119–143.
Arakawa, A., Hsu, Y.-J., 1981. Energy conserving and potential-enstrophy dissipating
schemes for the shallow-water equations. Monthly Weather Review 118, 1960–1969.
Arakawa, A., Lamb, V. R., 1977. Computational Design of the Basic Dynamical Processes
of the UCLS general circulation model. Vol. 17. Academic Press, New York, p. 174.
Arakawa, A., Lamb, V. R., 1981. A potential enstrophy and energy conserving scheme
for the shallow water equations. Monthly Weather Review 109, 18–36.
Balsara, D. S., Shu, C.-W., 2000. Monotonicity preserving weighed essentially non-
oscillatory schemes with increasingly high order of accuracy. Journal of Computational
Physics 160, 405–452.
Boris, J. P., Book, D. L., 1973. Flux corrected transport, i: Shasta, a fluid transport
algorithm that works. Journal of Computational Physics 11, 38–69.
Boris, J. P., Book, D. L., 1975. Flux corrected transport, ii: Generalization of the method.
Journal of Computational Physics 18, 248.
Boris, J. P., Book, D. L., 1976. Flux corrected transport, iii: Minimum error fct algo-
rithms. Journal of Computational Physics 20, 397.
Boyd, J. P., 1989. Chebyshev and Fourier Spectral Methods. Lecture Notes in Engineer-
ing. Springer-Verlag, New York.
Butcher, J. C., 1987. The Numerical Analysis of Ordinary Differential Equations. John
Wiley and Sons Inc., NY.
Canuto, C., Hussaini, M. Y., Quarteroni, A., Zang, T. A., 1988. Spectral Methods in
Fluid Dynamics. Springer Series in Computational Physics. Springer-Verlag, New York.
Dukowicz, J. K., 1995. Mesh effects for rossby waves. Journal of Computational Physics
119, 188–194.
251
252 BIBLIOGRAPHY
Durran, D. R., 1999. Numerical Methods for Wave Equations in Geophysical Fluid Dy-
namics. Springer, New York.
Finlayson, B. A., 1972. The Method of Weighed Residuals and Variational Principles.
Academic Press.
Jiang, C.-S., Shu, C.-W., 1996. Efficient implementation of weighed eno schemes. Journal
of Computational Physics 126, 202–228.
Karniadakis, G. E., Sherwin, S. J., 1999. Spectral/hp Element Methods for CFD. Oxford
University Press.
Leonard, B. P., MacVean, M. K., Lock, A. P., 1995. The flux integral method fo multi-
dimensional convection and diffusion. Applied Mathematical Modelling 19, 333–342.
Suresh, A., Huynh, H. T., 1997. Accurate monotonicity preserving schemes with runge-
kutta time stepping. Journal of Computational Physics 136, 83–99.
Whitham, G. B., 1974. Linear and Nonlinear Waves. Wiley-Interscience, New York.
Zalesak, S. T., 1979. Fully multidimensional flux-corrected transport algorithms for flu-
ids. Journal of Computational Physics 31, 335–362.
Zalesak, S. T., 2005. The design of flux-corrected transport algorithms for structured
grids. In: Kuzmin, D., Löhner, R., Turek, S. (Eds.), Flux-Corrected Transport.
Springer, pp. 29–78.