0% found this document useful (0 votes)
121 views266 pages

Numerical Linear Algebra

Uploaded by

Lucero Delugo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views266 pages

Numerical Linear Algebra

Uploaded by

Lucero Delugo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 266

ver-Bild 1 (left half): Lecture Notes

Mathematics LN

v4
v3
v2
v1
v0
Numerical Linear Algebra
ROLF RANNACHER

ver-Bild 2:

Ax = b
AT Ax = AT b
Ax = λx

HEIDELBERG
UNIVERSITY PUBLISHING
NUMERICAL LINEAR ALGEBRA
Lecture Notes Mathematics LN
NUMERICAL LINEAR
ALGEBRA
Rolf Rannacher
Institute of Applied Mathematics
Heidelberg University
About the author

Rolf Rannacher, retired professor of Numerical Mathematics at Heidelberg Universi-


ty, study of Mathematics at the University of Frankfurt/Main, doctorate 1974, post-
doctorate (Habilitation) 1978 at Bonn University – 1979/1980 Vis. Assoc. Professor
at the University of Michigan (Ann Arbor, USA), thereafter Professor at Erlangen
DQG6DDUEU¾FNHQLQ+HLGHOEHUJVLQFH‫ࢺي‬HOGRILQWHUHVWّ1XPHULFVRI3DUWLDO
'LࢹHUHQWLDO(TXDWLRQVْHVSHFLDOO\WKHّ)LQLWH(OHPHQW0HWKRGْDQGLWVDSSOLFDWLRQV
LQWKH1DWXUDO6FLHQFHVDQG(QJHQHHULQJPRUHWKDQVFLHQWLࢺFSXEOLFDWLRQV

Bibliographic information published by the Deutsche Nationalbibliothek


7KH'HXWVFKH1DWLRQDOELEOLRWKHNOLVWVWKLVSXEOLFDWLRQLQWKH'HXWVFKH1DWLRQDOELEOLRJUDࢺH
detailed bibliographic data are available on the Internet at http://dnb.dnb.de.

This book is published under the Creative Commons License 4.0


(CC BY-SA 4.0). The cover is subject to the Creative Commons
License CC-BY-ND 4.0.

The electronic, open access version of this work is permanently available


on Heidelberg University Publishing’s website: http://heiup.uni-heidelberg.de.
urn: urn:nbn:de:bsz:16-heiup-book-407-3
doi: https://doi.org/10.17885/heiup.407

Text © 2018, Rolf Rannacher

ISSN 2566-4816 (PDF)


ISSN 2512-4455 (Print)

ISBN 978-3-946054-99-3 (PDF)


ISBN 978-3-947732-00-5 (Softcover)
Contents

0 Introduction 1
0.1 Basic notation of Linear Algebra and Analysis . . . . . . . . . . . . . . . . 1
0.2 Linear algebraic systems and eigenvalue problems . . . . . . . . . . . . . . 2
0.3 Numerical approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.4 Applications and origin of problems . . . . . . . . . . . . . . . . . . . . . . 4
0.4.1 Gaussian equalization calculus . . . . . . . . . . . . . . . . . . . . . 4
0.4.2 Discretization of elliptic PDEs . . . . . . . . . . . . . . . . . . . . . 6
0.4.3 Hydrodynamic stability analysis . . . . . . . . . . . . . . . . . . . . 10

1 Linear Algebraic Systems and Eigenvalue Problems 13


1.1 The normed Euclidean space K n
. . . . . . . . . . . . . . . . . . . . . . . 13
1.1.1 Vector norms and scalar products . . . . . . . . . . . . . . . . . . . 13
1.1.2 Linear mappings and matrices . . . . . . . . . . . . . . . . . . . . . 22
1.1.3 Non-quadratic linear systems . . . . . . . . . . . . . . . . . . . . . 26
1.1.4 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . . . . 28
1.1.5 Similarity transformations . . . . . . . . . . . . . . . . . . . . . . . 31
1.1.6 Matrix analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.2 Spectra and pseudo-spectra of matrices . . . . . . . . . . . . . . . . . . . . 37
1.2.1 Stability of dynamical systems . . . . . . . . . . . . . . . . . . . . . 37
1.2.2 Pseudo-spectrum of a matrix . . . . . . . . . . . . . . . . . . . . . 41
1.3 Perturbation theory and conditioning . . . . . . . . . . . . . . . . . . . . . 45
1.3.1 Conditioning of linear algebraic systems . . . . . . . . . . . . . . . 46
1.3.2 Conditioning of eigenvalue problems . . . . . . . . . . . . . . . . . 48
1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2 Direct Solution Methods 55


2.1 Gaussian elimination, LR and Cholesky decomposition . . . . . . . . . . . 55
2.1.1 Gaussian elimination and LR decomposition . . . . . . . . . . . . . 55
2.1.2 Accuracy improvement by defect correction . . . . . . . . . . . . . 64
2.1.3 Inverse computation and the Gauß-Jordan algorithm . . . . . . . . 66
2.2 Special matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

v
vi CONTENTS

2.2.1 Band matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70


2.2.2 Diagonally dominant matrices . . . . . . . . . . . . . . . . . . . . . 72
2.2.3 Positive definite matrices . . . . . . . . . . . . . . . . . . . . . . . . 73
2.3 Irregular linear systems and QR decomposition . . . . . . . . . . . . . . . 76
2.3.1 Householder algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.4 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.5 “Direct” determination of eigenvalues . . . . . . . . . . . . . . . . . . . . . 88
2.5.1 Reduction methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.5.2 Hyman’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.5.3 Sturm’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

3 Iterative Methods for Linear Algebraic Systems 99


3.1 Fixed-point iteration and defect correction . . . . . . . . . . . . . . . . . . 99
3.1.1 Stopping criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.1.2 Construction of iterative methods . . . . . . . . . . . . . . . . . . . 105
3.1.3 Jacobi- and Gauß-Seidel methods . . . . . . . . . . . . . . . . . . . 108
3.2 Acceleration methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.2.1 SOR method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.2.2 Chebyshev acceleration . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.3 Descent methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.3.1 Gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.3.2 Conjugate gradient method (CG method) . . . . . . . . . . . . . . 130
3.3.3 Generalized CG methods and Krylov space methods . . . . . . . . . 136
3.3.4 Preconditioning (PCG methods) . . . . . . . . . . . . . . . . . . . . 138
3.4 A model problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

4 Iterative Methods for Eigenvalue Problems 153


4.1 Methods for the partial eigenvalue problem . . . . . . . . . . . . . . . . . . 153
4.1.1 The “Power Method” . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.1.2 The “Inverse Iteration” . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.2 Methods for the full eigenvalue problem . . . . . . . . . . . . . . . . . . . . 159
CONTENTS vii

4.2.1 The LR and QR method . . . . . . . . . . . . . . . . . . . . . . . . 159


4.2.2 Computation of the singular value decomposition . . . . . . . . . . 164
4.3 Krylov space methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.3.1 Lanczos and Arnoldi method . . . . . . . . . . . . . . . . . . . . . . 167
4.3.2 Computation of the pseudo-spectrum . . . . . . . . . . . . . . . . . 172
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

5 Multigrid Methods 187


5.1 Multigrid methods for linear systems . . . . . . . . . . . . . . . . . . . . . 187
5.1.1 Multigrid methods in the “finite element” context . . . . . . . . . . 188
5.1.2 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 195
5.2 Multigrid methods for eigenvalue problems (a short review) . . . . . . . . . 202
5.2.1 Direct multigrid approach . . . . . . . . . . . . . . . . . . . . . . . 203
5.2.2 Accelerated Arnoldi and Lanczos method . . . . . . . . . . . . . . . 204
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.3.1 General exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

A Solutions of exercises 209


A.1 Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
A.2 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
A.3 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
A.4 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
A.5 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
A.5.1 Solutions for the general exercises . . . . . . . . . . . . . . . . . . . 242

Bibliography 246

Index 250
0 Introduction
Subject of this course are numerical algorithms for solving problems in Linear Algebra,
such as linear algebraic systems and corresponding matrix eigenvalue problems. The
emphasis is on iterative methods suitable for large-scale problems arising, e. g., in the
discretization of partial differential equations and in network problems.

0.1 Basic notation of Linear Algebra and Analysis

At first, we introduce some standard notation in the context of (finite dimensional) vector
spaces of functions and their derivatives. Let K denote the field of real or complex
numbers R or C , respectively. Accordingly, for n ∈ N , let Kn denote the n-dimensional
vector space of n-tuples x = (x1 , . . . , xn ) with components xi ∈ K, i = 1, . . . , n . For
these addition and scalar multiplication are defined by:

x + y := (x1 + y1 , . . . , xn + yn ), αx := (αx1 , . . . , αxn ), α ∈ K.

The elements x ∈ Kn are, depending on the suitable interpretation, addressed as “points”


or “vectors” . Here, one may imagine x as the end point of a vector attached at the origin
of the chosen Cartesian1 coordinate system and the components xi as its “coordinates”
with respect to this coordinate system. In general, we consider vectors as “column vec-
tors”. Within the “vector calculus” its row version is written as (x1 , . . . , xn )T . The
null (or zero) vector (0, . . . , 0) may also be briefly written as 0 . Usually, we prefer
this coordinate-oriented notation over a coordinate-free notation because of its greater
clearness. A set of vectors {a1 , . . . , ak } in Kn is called “linearly independent” if


k
ci ai = 0, ci ∈ K ⇒ ci = 0, i = 1, . . . , k.
i=1

Such a set of k = n linearly independent vectors is called a “basis” of Kn , which spans


all of Kn , i. e., each element x ∈ Kn can be (uniquely) written as a linear combination
of the form
n
x= ci ai , ci ∈ K.
i=1

Each (finite dimensional) vector space, such as Kn , possesses a basis. The special “Carte-
sian basis” {e1 , . . . , en } is formed by the “Cartesian unit vectors” ei := (δ1i , . . . , δni ) ,
δii = 1 and δij = 0, for i = j, being the usual Kronecker symbol. The elements of this
basis are mutually orthonormal, i. e., with respect to the Euclidian scalar product, there
holds (ei , ej )2 := nk=1 eik ejk = δij . “Matrices” A ∈ Kn×n are two-dimensional square
arrays of numbers from K written in the form A = (aij )ni,j=1 , where the first index, i ,

1
René Descartes (1596–1650): French mathematician and philosopher (“(ego) cogito ergo sum”);
worked in the Netherlands and later in Stockholm; first to recognize the close relation between geometry
and arithmetic and founded analytic geometry.

1
2 Introduction

refers to the row and the second one, j , to the column (counted from the left upper corner
of the array) at which the element aij is positioned. Usually, matrices are square arrays,
but in some situations also rectangular matrices may occur. The set of (square) matri-
ces forms a vector space with addition and scalar multiplication defined in the natural
elementwise sense,

A = (aij )ni,j=1, B = (bij )ni,j=1, c ∈ K ⇒ cA + B = (caij + bij )ni,j=1 .

For matrices and vectors natural multiplications are defined by



d n 
d n
Ax = aik xk ∈ Kn , AB = aik bkj ∈ Kn×n .
i=1 i,j=1
k=1 k=1

Matrices are used to represent linear mappings in Kd with respect to a given basis, mostly
a Cartesian basis, ϕ(x) = Ax . By ĀT = (aTij )ni,j=1 , we denote the conjugate “transpose”
of a matrix A = (aij )ni,j=1 ∈ Kn×n with the elements aTij = āji . For matrices A, B ∈ Kn×n
there holds (AB)T = B T AT . Matrices for which A = ĀT are called “symmetric” in the
case K = R and “Hermitian” in the case K = C.

0.2 Linear algebraic systems and eigenvalue problems

Let A be an m × n-matrix and b an m-vector,


⎡ ⎤ ⎡ ⎤
a11 · · · a1n b1
⎢ . .. ⎥ ⎢ . ⎥
A = (ajk )m,n ⎢ .. . ⎥ b = (bj )m =⎢ . ⎥
j,k=1 = ⎣ ⎦, j=1 ⎣ . ⎦.
am1 · · · amn bm

We seek an n-vector x = (xk )k=1,...,n such that

a11 x1 + a12 x2 + · · · + a1n xn = b1


..
. (0.2.1)
am1 x1 + am2 x2 + · · · + amn xn = bm

or written in short as Ax = b . This is called a “linear system” (of equations). It is


called “underdetermined” for m < n , “quadratic” for m = n, and “overdetermined” for
m > n . The linear system is solvable if and only if rank(A) = rank([A, b]) (rank(A) =
number of linearly independent columns of A ) with the composed matrix
⎡ ⎤
a11 · · · a1n b1
⎢ . .. .. ⎥
[A, b] = ⎢
⎣ .
. . . ⎥
⎦.
am1 · · · amn bm
0.3 Numerical approaches 3

In the “quadratic” case the solvability of the system (0.2.1) is equivalent to any one of
the following properties of the coefficient matrix A ∈ Kn×n :

- Ax = 0 implies x = 0 .
- rank(A) = n .
- det(A) = 0 .
- All eigenvalues of A are nonzero.

A number λ ∈ C is called “eigenvalue” of the (quadratic) matrix A ∈ Kn×n if there


exists a corresponding vector w ∈ Kn \ {0}, called ”eigenvector”, such that

Aw = λw. (0.2.2)

Eigenvalues are just the zeros of the characteristic polynomial χA (z) := det(A − zI)
of A , so that by the fundamental theorem of Algebra each n × n-matrix has exactly
n eigenvalues counted accordingly to their (algebraic) multiplicities. The corresponding
eigenvectors span linear subspaces of Kn called “eigenspaces”.
Eigenvalue problems play an important role in many problems from science and engi-
neering, e. g., they represent energy levels in physical models (e. g., Schrödinger equation
in Quantum Mechanics) or determine the stability or instability of solutions of dynamical
systems (e. g., Navier-Stokes equations in hydrodynamics).

0.3 Numerical approaches

We will mainly consider numerical methods for solving quadratic linear systems and asso-
ciated eigenvalue problems. The emphasis will be on medium- and large-scale problems,
i. e., problems of dimension n ≈ 104 − 109 , which at the upper end impose particularly
strong requirements on the algorithms with respect to storage and work efficiency. Prob-
lems of that size usually involve matrices with special structure such as “band structure”
and/or extreme “sparsity”, i. e., only very few matrix elements in each row are non-zero.
Most of the classical methods, which have originally been designed for “full” but smaller
matrices, cannot be realistically applied to such large problems. Therefore, modern meth-
ods extensively exploit the particular sparsity structure of the matrices. These methods
split into two classes, “direct methods” and “iterative methods”.
Definition 0.1: A “direct” method for the solution of a linear system Ax = b is an
algorithm, which (neglecting round-off errors) delivers the exact solution x in finitely
many arithmetic steps. “Gaussian elimination” is a typical example of such a “direct
method”. In contrast to that an “iterative method” constructs a sequence of approximate
solutions {xt }t∈N , which only in the limit t → ∞ converge to the exact solution, i. e.,
limt→∞ xt = x. “Richardson iteration” or more general fixed-point methods of similar
kind are typical example of such “iterative methods”. In analyzing a direct method, we are
mainly interested in the work count, i. e., the asymptotic number of arithmetic operations
needed for achieving the final result depending on the problem size, e. g., O(n3 ) , while
4 Introduction

in an iterative method, we look at the work count needed for one iteration step and the
number of iteration steps for reducing the initial error by a certain fixed factor, e. g., 10−1 ,
or the asymptotic speed of convergence (“linear”, “quadratic, etc.).

However, there is no sharp separation between the two classes of “direct” or “iterative”
methods as many theoretically “direct” methods are actually used in “iterative” form in
practice. A typical method of this type is the classical “conjugate gradient (CG) method”,
which in principle is a direct method (after n iteration steps) but is usually terminated
like an iterative methods already after m  n steps.

0.4 Applications and origin of problems

we present some applications, from which large linear algebra problems originate. This
illustrates how the various possible structures of matrices may look like. Thereby, we have
to deal with scalar or vector-valued functions u = u(x) ∈ Kn for arguments x ∈ Kn . For
derivatives of differentiable functions, we use the notation

∂u ∂2u ∂u ∂2u
∂x u := , ∂x2 u := 2 , . . . , ∂i u := , ∂ij2 u := , ... ,
∂x ∂ x ∂xi ∂xi ∂xj

and analogously also for higher-order derivatives. With the nabla operator ∇ the “gra-
dient” of a scalar function and the “divergence” of a vector function are written as
grad u = ∇u := (∂1 u, ..., ∂d u)T and div u = ∇ · u := ∂1 u1 + ... + ∂d ud , respectively.
For a vector β ∈ Rd the derivative in direction β is written as ∂β u := β · ∇u . Combi-
nation of gradient and divergence yields the so-called “Laplacian operator”

∇ · ∇u = Δu = ∂12 u + ... + ∂d2 u.

The symbol ∇m u denotes the “tensor” of all partial derivatives of order m of u , i. e., in
two dimensions u = u(x1 , x2 ) , ∇2 u = (∂1i ∂2j u)i+j=2.

0.4.1 Gaussian equalization calculus

A classical application in Astronomy is the Gaussian equalization calculus (method of


least error-squares): For given functions u1 , . . . , un and points (xj , yj ) ∈ R2 , j =
1, . . . , m , m > n, a linear combination

n
u(x) = ck uk (x)
k=1

is to be determined such that the “mean deviation”


m 1/2
Δ2 := |u(xj ) − yj |2
j=1
0.4 Applications and origin of problems 5

becomes minimal. (The “Chebyshev2 equalization problem” in which the “maximal de-
viation” Δ∞ := maxj=1,...,m |u(xj ) − yj | is minimized poses much more severe difficulties
and is therefore used only for smaller n.) For the solution of the Gaussian equalization
problem, we set y := (y1 , . . . , ym ), c := (c1 , . . . , cn ) and

ak := (uk (x1 ), . . . , uk (xm )) , k = 1, . . . , n , A ≡ [a1 , . . . , an ].

Using this notation, now the quadratic functional



m 1/2
F (c) = |(Ac − y)j |2
j=1

is to be minimized with respect to c ∈ Rn . This is equivalent to solving the overde-


termined linear system Ac = y in the sense of finding a vector c with minimal mean
error-squares, i. e., with minimal “defect”. In case that rank(A) = n this “minimal-defect
solution” c is determined by the so-called “normal equation”

AT Ac = AT y, (0.4.3)

a linear n × n-system with a positive definite (and hence regular) coefficient matrix AT A.
In the particular case of polynomial fitting, i. e., uk (x) = xk−1 , the “optimal” solution


n
u(x) = ck xk−1
k=1
called “Gaussian equalization parabola” for the points (xj , yj ), j = 1, . . . , m . Because of
the regularity of the “Vandermondian3 determinant”

⎡ ⎤
1 x1 · · · xn−1
⎢ 1

⎢ 1 x2 · · · xn−1 ⎥
n
⎢ 2 ⎥
det ⎢ .. .. .. ⎥ = (xk − xj ) = 0,
⎢ . . . ⎥
⎣ ⎦ j,k=1,j<k
1 xn · · · xn−1
n

for mutually distinct points xj there holds rank(A) = n , i. e., the equalization parabola
is uniquely determined.

2
Pafnuty Lvovich Chebyshev (1821–1894): Russian mathematician; prof. in St. Petersburg; contribu-
tions to number theory, probability theory and especially to approximation theory; developed the general
theory of orthogonal polynomials.
3
Alexandre-Thophile Vandermonde (1735–1796): French mathematician; gifted musician, came late
to mathematics and published here only four papers (nevertheless member of the Academy of Sciences
in Paris); contributions to theory of determinants and combinatorial problem (curiously enough the
determinant called after him does not appear explicitly in his papers).
6 Introduction

0.4.2 Discretization of elliptic PDEs

The numerical solution of partial differential equations requires an appropriate “discretiza-


tion” of the differential operator, e. g., by a “difference approximation” of the derivatives.
Consider, for example, the “first boundary value problem of the Laplacian4 operator”,

Lu := −Δu = f in Ω, u = g on ∂Ω, (0.4.4)

posed on a domain Ω ⊂ Rn with boundary ∂Ω . Here, for a given (continuous) right-hand


side function f = f (x1 , x2 ) and boundary function g = g(x1 , x2 ) a function u = u(x1 , x2 )
is to be determined, which is twice differentiable on Ω and continuous on Ω̄, such that
(0.4.4) holds. The region Ω , e. g., the unit square, is covered by an equidistant Cartesian
mesh Ωh with “mesh boundary” ∂Ωh . The mesh points P ∈ Ωh may be numbered
row-wise.

1 2 3 4

1
5 6 7 8 h= m+1
mesh width

9 10 11 12
n = m2 “interior” mesh points
13 14 15 16
h

Figure 1: Finite difference mesh

At “interior” mesh points P ∈ Ωh the differential operators in x1 - and x2 -direction are


approximated by second-order central difference quotients, which act on mesh functions
uh (P ) . This results in “difference equations” of the form

Lh uh (P ) := σ(P, Q)uh (Q) = fh (P ), P ∈ Ωh , (0.4.5)
Q∈N (P )

uh (P ) = gh (P ), P ∈ ∂Ωh , (0.4.6)

with certain mesh neighborhoods N(P ) ⊂ Ω∪∂Ωh of points P ∈ Ωh and approximations


fh (·) to f and gh (·) to g . We set the coefficients σ(P, Q) := 0 for points Q ∈ N(P ).
The considered difference operator based on second-order central difference quotients for
approximating second derivatives is called “5-point difference operator” since it uses 5
points (Accordingly, its three-dimensional analogue is called “7-point difference opera-

4
Pierre Simon Marquis de Laplace (1749–1827): French mathematician and astronomer; prof. in Paris;
founded among other fields probability calculus.
0.4 Applications and origin of problems 7

tor”). Then, for P ∈ Ωh there holds


 
σ(P, Q)uh (Q) = fh (P ) − σ(P, Q)gh (Q). (0.4.7)
Q∈Ωh Q∈∂Ωh

For any numbering of the mesh points in Ωh and ∂Ωh , Ωh = {Pi , i = 1, ..., n} , ∂Ωh =
{Pi , i = n+1, ..., n+m}, we obtain a quadratic linear system for the vector of approximate
mesh values U = (Ui )N i=1 , Ui := uh (Pi ) .

AU = F, (0.4.8)

with A = (aij )ni,j=1, F = (bj )nj=1 , where


n+m
aij := σ(Pi , Pj ), bj := fh (Pj ) − σ(Pj , Pi )gh (Pi ).
i=n+1

In the considered special case of the unit square and row-wise numbering of the interior
mesh points Ωh the 5-point difference approximation of the Laplacian yields the following
sparse matrix of dimension n = m2 :
⎡ ⎤⎫ ⎡ ⎤⎫
Bm −Im ⎪
⎪ 4 −1 ⎪

⎢ ⎪
⎥⎪ ⎢ ⎪
⎥⎪
⎢ ⎥⎪
⎬ ⎢ ⎥⎪

1 ⎢ −I m B m −I m ⎥ ⎢ −1 4 −1 ⎥
A= 2 ⎢ . ⎥ n Bm = ⎢ . ⎥ m,
h ⎢ −Im Bm . . ⎥ ⎪ ⎢ −1 4 .. ⎥ ⎪
⎣ ⎦⎪

⎪ ⎣ ⎦⎪


⎪ ⎪
. ⎭ . ⎭
.. .. .. ..
. .

where Im is the m×m-unit matrix. The matrix A is a very sparse band matrix with
half-band width m , symmetric and (irreducibly) diagonally dominant. This implies that
it is regular and positive definite. In three dimensions the corresponding matrix has di-
mension n = m3 and half-band width m2 and shares all the other mentioned properties
of its two-dimensional analogue. In practice, n 104 up to n ≈ 107 in three dimensions.
If problem (0.4.4) is only part of a larger mathematical model involving complex domains
and several physical quantities such as (in chemically reacting flow models) velocity, pres-
sure, density, temperature and chemical species, the dimension of the complete system
may reach up to n ≈ 107 − 109 .
To estimate a realistic size of the algebraic problem oriented by the needs of a practical
application, we consider the above model problem (Poisson equation on the unit square)
with an adjusted right-hand side and boundary function such that the exact solution is
given by u(x, y) = sin(πx) sin(πy) ,

−Δu = 2π 2 u =: f in Ω, u = 0 on ∂Ω. (0.4.9)

For this setting the error analysis of the difference approximation yields the estimate

max |u − uh | ≈ 1 2
d M (u)h2
24 Ω 4
≈ 8h2 , (0.4.10)
Ωh
8 Introduction

where M4 (u) = maxΩ̄ |∇4 u| ≈ π 4 (see the lecture notes Rannacher [3]). In order to
guarantee a relative error below TOL = 10−3 , we have to choose h ≈ 10−2 corresponding
to n ≈ 104 in two and n ≈ 106 in three dimension. The concrete structure of the matrix
A depends on the numbering of mesh points used:
i) Row-wise numbering: The lexicographical ordering of mesh points leads to a band
matrix with band width 2m + 1. The sparsity within the band would be largely reduced
by Gaussian elimination (so-called “fill-in”).

21 22 23 24 25

16 17 18 19 20

11 12 13 14 15

6 7 8 9 10

1 2 3 4 5

Figure 2: Lexicographical ordering of mesh points

ii) Diagonal numbering: The successive numbering diagonally to the Cartesian coor-
dinate directions leads to a band matrix with less band volume. This results in less fill-in
within Gaussian elimination.

11 16 20 23 25
7 12 17 21 24

4 8 13 18 22

2 5 9 14 19
1 3 6 10 15

Figure 3: Diagonal mesh-point numbering

iii) Checker-board numbering: The staggered row-wise and column-wise numbering


leads to a 2 × 2-block matrix with diagonal main blocks and band width 2m + 1 ≈ h−1 .
0.4 Applications and origin of problems 9

11 24 12 25 13
21 9 22 10 23

6 19 7 20 8

16 4 17 5 18
1 14 2 15 3

Figure 4: Checkerboard mesh-point numbering

For large linear systems of dimension n > 105 direct methods such as Gaussian elim-
ination are difficult to realize since they are generally very storage and work demanding.
For a matrix of dimension n = 106 and band width m = 102 Gaussian elimination re-
quires already about 108 storage places. This is particularly undesirable if also the band
is sparse as in the above example with at most 5 non-zero elements per row. In this
case those iterative methods are more attractive, in which essentially only matrix-vector
multiplications occur with matrices of similar sparsity pattern as that of A .
As illustrative examples, we consider simple fixed-point iterations for solving a linear
system Ax = b with a regular n×n-coefficient matrix. The system is rewritten as

n
ajj xj + ajk xk = bj , j = 1, . . . , n.
k=1
k=j

If ajj = 0, this is equivalent to


1   
n
xj = bj − ajk xk , j = 1, . . . , n.
ajj k=1
k=j

Then, the so-called “Jacobi method” generates iterates xt ∈ Rn , t = 1, 2, . . . , by succes-


sively solving
1   
n
xtj = bj − ajk xt−1
k , j = 1, . . . , n. (0.4.11)
ajj k=1
k=j

When computing xtj the preceding components xtr , r < j, are already known. Hence, in
order to accelerate the convergence of the method, one may use this new information in
the computation of xtj . This idea leads to the “Gauß-Seidel5 method”:

1    
xtj = bj − ajk xtk − ajk xt−1
k , j = 1, . . . , n. (0.4.12)
ajj
k<j k>j

5
Philipp Ludwig von Seidel (1821–1896): German mathematician; Prof. in Munich; contributions to
analysis (method of least error-squares) and celestial mechanics and astronomy.
10 Introduction

The Gauß-Seidel method has the same arithmetic complexity as the Jacobi method but
under certain conditions (satisfied in the above model situation) it converges twice as fast.
However, though very simple and maximal storage economical, both methods, Jacobi as
well as Gauß-Seidel, are by far too slow in practical applications. Much more efficient iter-
ative methods are the Krylov-space methods. The best known examples are the classical
“conjugate gradient method” (“CG method”) of Hestenes and Stiefel for solving linear
systems with positive definite matrices and the “Arnoldi method” for solving correspond-
ing eigenvalue problems. Iterative methods with minimal complexity can be constructed
using multi-scale concepts (e. g., geometric or algebraic “multigrid methods”). The latter
type of methods will be discussed below.

0.4.3 Hydrodynamic stability analysis

Another origin of large-scale eigenvalue problems is hydrodynamic stability analysis. Let


{v̂, p̂} be a solution (the “base flow”) of the stationary Navier-Stokes equation

− νΔv̂ + v̂ · ∇v̂ + ∇p̂ = 0, ∇ · v̂ = 0, in Ω,


(0.4.13)
v̂|Γrigid = 0, v̂|Γin = v , ν∂n v̂ − p̂n|Γout = P, ν∂n v̂ − p̂n|ΓQ = q,
in

where v̂ is the velocity vector field of the flow, p̂ its hydrostatic pressure, ν the kinematic
viscosity (for normalized density ρ ≡ 1), and q the control pressure. The flow is driven
by a prescribed flow velocity v in at the Dirichlet (inflow) boundary (at the left end), a
prescribed mean pressure P at the Neumann (outflow) boundary (at the right end) and
the mean pressure q at the control boundary ΓQ . The (artificial) “free outflow” (also
called “do nothing”) boundary condition in (0.4.13) has proven successful especially in
modeling pipe flow since it is satisfied by Poiseuille flow (see Heywood et al. [42]).

ΓQ

Γin S Γout
ΓQ

Figure 5: Configuration of the flow control problem.

Fig. 5 shows the configuration of a channel flow around an obstacle controlled by pressure
prescription at ΓQ , and Figure 6 the computational mesh and streamline plots of two
flows for different Reynolds numbers and control values, one stable and one unstable.
0.4 Applications and origin of problems 11

Figure 6: Computational mesh (top), uncontrolled stable (middle) and controlled unstable
(bottom) stationary channel flow around an obstacle.

For deciding whether these base flows are stable or unstable, within the usual linearized
stability analysis, one investigates the following eigenvalue problem corresponding to the
Navier-Stokes operator linearized about the considered base flow:

− νΔv + v̂ · ∇v + v · ∇v̂ + ∇q = λv, ∇ · v = 0, in Ω,


(0.4.14)
v|Γrigid ∪Γin = 0, ν∂n v − qn|Γout ∪ΓQ = 0.

From the location of the eigenvalues in the complex plane, one can draw the following
conclusion: If an eigenvalue λ ∈ C of (0.4.14) has Re λ < 0 , the base flow is unstable,
otherwise it is said to be “linearly stable”. This means that the solution of the linearized
nonstationary perturbation problem

∂t w − νΔw + v̂ · ∇w + w · ∇v̂ + ∇q = 0, ∇ · w = 0, in Ω,
(0.4.15)
w|Γrigid ∪Γin = 0, ν∂n w − qn|Γout ∪ΓQ = 0

corresponding to an initial perturbation w|t=0 = w0 satisfies a bound

sup w(t) ≤ Aw0 , (0.4.16)


t≥0

with some constant A ≥ 1 . After discretization the eigenvalue problem (0.4.14) in func-
tion space is translated into an nonsymmetric algebraic eigenvalue problem, which is
usually of high dimension n ≈ 105 − 106 . Therefore its solution can be achieved only by
iterative methods.
However, “linear stability” does not guarantee full “nonlinear stability” due to effects
caused by the “non-normality” of the operator governing problem (0.4.14), which may
cause the constant A to become large. This is related to the possible “deficiency” (dis-
crepancy of geometric and algebraic multiplicity) or a large “pseudo-spectrum” (range
12 Introduction

of large resolvent norm) of the critical eigenvalue. This effect is commonly accepted as
explanation of the discrepancy in the stability properties of simple base flows such as
Couette flow and Poiseuille flow predicted by linear eigenvalue-based stability analysis
and experimental observation (see, e. g., Trefethen & Embree [22] and the literature cited
therein).
1 Linear Algebraic Systems and Eigenvalue Problems
In this chapter, we introduce the basic notation and facts about the normed real or
complex vector spaces Kn of n-dimensional vectors and Kn×n of corresponding n × n-
matrices. The emphasis is on square matrices as representations of linear mappings in
Kn and their spectral properties.

1.1 The normed Euclidean space Kn

1.1.1 Vector norms and scalar products

We recall some basic topological properties of the finite dimensional “normed” (vector)
space Kn , where depending on the concrete situation K = R (real space) or K = C
(complex space). In the following each point x ∈ Kn is expressed by its canonical coor-
dinate representation x = (x1 , . . . , xn ) in terms of a (fixed) Cartesian basis {e1 , . . . , en }
of Kn ,
n
x= xi ei .
i=1

Definition 1.1: A mapping  ·  : Kn → R is a “(vector) norm” if it has the following


properties:

(N1) Definiteness: x ≥ 0, x = 0 ⇒ x = 0, x ∈ Kn .

(N2) Homogeneity: αx = |α| x, α ∈ K, x ∈ Kn .

(N3) Triangle inequality: x + y ≤ x + y, x, y ∈ Kn .

The notion of a “norm” can be defined on any vector space V over K , finite or infinite
dimensional. The resulting pair {V,  · } is called “normed space”.

Remark 1.1: The property x ≥ 0 is a consequence of the other conditions. With (N2),
we obtain 0 = 0 and then with (N3) and (N2) 0 = x − x ≤ x +  − x = 2x .
With the help of (N3) we obtain the useful inequality
 
x − y ≤ x − y, x, y ∈ Kn . (1.1.1)

Example 1.1: The standard example of a vector norm is the “Euclidian norm”

n 1/2
x2 := |xi |2 .
i=1

13
14 Linear Algebraic Systems and Eigenvalue Problems

The first two norm properties, (N1) and (N2), are obvious, while the triangle inequality
is a special case of the “Minkowski inequality” provided below in Lemma 1.4. Other
examples of useful norms are the “maximum norm” (or “l∞ norm”) and the “l1 norm”


n
x∞ := max |xi |, x1 := |xi |.
i=1,...,n
i=1

The norm properties of ·∞ and ·1 are immediate consequences of the corresponding
properties of the modulus function. Between l1 norm and maximum norm there are the
so-called “lp norms” for 1 < p < ∞ :

n 1/p
xp := |xi |p .
i=1

Again the first two norm properties, (N1) and (N2), are obvious and the triangle inequality
is the Minkowski inequality provided in Lemma 1.4, below.

With the aid of a norm  ·  the “distance” d(x, x ) := x − x  of two vectors in


K is defined. This allows the definition of the usual topological terms “open”, “closed”,
n

“compact”, “diameter”, and “neighborhood” for point sets in Kn in analogy to the cor-
responding situation in K . We use the maximum norm  · ∞ in the following discussion,
but we will see later that this is independent of the chosen norm. For any a ∈ Kn and
r > 0 , we use the ball
Kr (a) := {x ∈ Kn : x − a∞ < r}
as standard neighborhood of a with radius r . This neighborhood is “open” since for
each point x ∈ Kr (a) there exists a neighborhood Kδ (x) ⊂ Kr (a) ; accordingly the
complement Kr (a)c is “closed”. The “closure” of Kr (a) is defined by K r (a) := Kr (a) ∪
∂Kr (a) with the “boundary” ∂Kr (a) = {x ∈ Kn : x − a∞ = r} of Kr (a).

Definition 1.2: A sequence of vectors (xk )k∈N in Kn is called


- “bounded” if all its elements are contained in a ball KR (0) , i. e., xk ∞ < R, k ∈ N,
- “Cauchy sequence” if for each ε ∈ R+ there is an Nε ∈ N, such that xk − xl ∞ < ε
for k, l ≥ Nε ,
- “convergent” towards an x ∈ Kn if xk − x∞ → 0 (k → ∞).

For a convergent sequence (xk )k∈N , we also write limk→∞ xk = x or xk → x (k → ∞).


Geometrically this means that any standard neighborhood Kε (x) of x contains almost all
(i. e., all but finitely many) of the elements xk . This notion of “convergence” is obviously
equivalent to the componentwise convergence:

xk − x∞ → 0 (k → ∞) ⇔ xki → xi (k → ∞), i = 1, . . . , n.

This allows the reduction of the convergence of sequences of vectors in Kn to that of


sequences of numbers in K . As basic results, we obtain n-dimensional versions of the
Cauchy criterion for convergence and the theorem of Bolzano-Weierstraß.
1.1 The normed Euclidean space Kn 15

Theorem 1.1 (Theorems of Cauchy and Bolzano-Weierstraß):


i) Each Cauchy sequence in Kn is convergent, i. e., the normed space (Kn ,  · ∞ ) is
complete (a so-called “Banach space”).
ii) Each bounded sequence in Kn contains a convergent subsequence.

Proof. i) For any Cauchy sequence (xk )k∈N , in view of |xi | ≤ x∞ , i = 1, . . . , n , for
x ∈ Kn , also the component sequences (xki )k∈N , i = 1, . . . , n, are Cauchy sequences in
K and therefore converge to limits xi ∈ K . Then, the vector x := (x1 , . . . , xn ) ∈ Kn is
limit of the vector sequence (xk )k∈N with respect to the maximum norm.
ii) For any bounded vector sequence (xk )k∈N the component sequences (xki )k∈N , i =
1, . . . , n, are likewise bounded. By successively applying the theorem of Bolzano-Weierstraß
k
in K, in the first step, we obtain a convergent subsequence (x11j )j∈N of (xk1 )k∈N with
k k k
x11j → x1 (j → ∞) , in the next step a convergent subsequence (x22j )j∈N of (x21j )j∈N
k2j
with x2 → x2 (j → ∞), and so on. After n selection steps, we eventually obtain a sub-
k
sequence (xknj )j∈N of (xk )k∈N , for which all component sequences (xi nj )j∈N , i = 1, . . . , n,
converge. Then, with the limit values xi ∈ K, we set x := (x1 , . . . , xn ) ∈ Kn and have
the convergence xknj → x (j → ∞) . Q.E.D.
The following important result states that on the (finite dimensional) vector space Kn
the notion of convergence, induced by any norm  ·  , is equivalent to the convergence
with respect to the maximum norm, i. e., to the componentwise convergence.

Theorem 1.2 (Equivalence of norms): All norms on the finite dimensional vector
space Kn are equivalent to the maximum norm, i. e., for each norm  ·  there are
positive constants m, M such that

m x∞ ≤ x ≤ M x∞ , x ∈ Kn . (1.1.2)

n
Proof. Let  ·  be a vector norm. For any vector x = i=11 xi ei ∈ Kn there holds


n 
n
x ≤ |xk | ek  ≤ M x∞ , M := ek  .
k=1 k=1

We set
S1 := {x ∈ Kn : x∞ = 1}, m := inf{x, x ∈ S1 } ≥ 0.
We want to show that m > 0 since then, in view of x−1
∞ x ∈ S1 , it follows that also
m ≤ x−1
∞ x for x 
= 0 , and consequently,

0 < mx∞ ≤ x, x ∈ Kn .

Suppose m = 0 . Then, there is a sequence (xk )k∈N in S1 with xk  → 0 (k → ∞) . Since


this sequence is bounded in the maximum norm, by the theorem of Bolzano-Weierstrass it
possesses a subsequence, likewise denoted by xk , which converges in the maximum norm
16 Linear Algebraic Systems and Eigenvalue Problems

to some x ∈ Kn . Since
   
1 − x∞  = xk ∞ − x||∞  ≤ xk − x∞ → 0 (k → ∞),

we have x ∈ S1 . On the other hand, for all k ∈ N , there holds

x ≤ x − xk  + xk  ≤ Mx − xk ∞ + xk .

This implies for k → ∞ that x = 0 and therefore x = 0 , which contradicts x ∈ S1 .


Q.E.D.

Remark 1.2: i) For the two foregoing theorems, the theorem of Bolzano-Weierstrass and
the theorem of norm equivalence, the finite dimensionality of Kn is decisive. Both theo-
rems do not hold in infinite-dimensional normed spaces such as the space l2 of (infinite)
l2 -convergent sequences or the space C[a, b] of continuous functions on [a, b].
ii) A subset M ⊂ Kn is called “compact” (or more precisely “sequentially compact”),
if each sequence of vectors in M possesses a convergent subsequence with limit in M .
Then, the theorem of Bolzano-Weierstrass implies that the compact subsets in Kn are
exactly the bounded and closed subsets in Kn .
iii) A point x ∈ Kn is called “accumulation point” of a set M ⊂ Kn if each neighborhood
of x contains at least one point from M \ {x} . The set of accumulation points of M is
denoted by H(M) (closed “hull” of M ). A point x ∈ M \ H(M) is called “isolated” .

Remark 1.3: In many applications there occur pairs {x, y} (or more generally tuples)
of points x, y ∈ Kn . These form the so-called “product space” V = Kn × Kn , which may
 1/2
be equipped with the generic norm {x, y} := x2 + y2 . Since this space may
be identified with the 2n-dimensional Euclidian space K2n all results on subsets of Kn
carry over to subsets of Kn × Kn . This can be extended to more general product spaces
of the form V = Kn1 × · · · × Knm .

The basic concept in the geometry of Kn is that of “orthogonality” of vectors or


subspaces. For its definition, we use a “scalar product”.

Definition 1.3: A mapping (·, ·) : Kn × Kn → K is called “scalar product” if it has the


following properties:
(S1) Conjugate Symmetry: (x, y) = (y, x) , x, y ∈ Kn .
(S2) Linearity: (αx + βy, z) = α(x, z) + β(y, z) , x, y, z ∈ Kn , α, β ∈ K.
(S3) Definiteness: (x, x) ∈ R, (x, x) > 0, x ∈ Kn \ {0}.

In the following, we will mostly use the “euclidian” scalar product



n
(x, y)2 = xj yj , (x, x)2 = x22 .
j=1
1.1 The normed Euclidean space Kn 17

Remark 1.4: i) If the strict definiteness (S3) is relaxed, (x, x) ∈ R, (x, x) ≥ 0, the
sesquilinear form becomes a so-called “semi-scalar product”.
ii) From property (S2) (linearity in the first argument) and (S1) (conjugate symmetry),
we obtain the conjugate linearity in the second argument. Hence, a scalar product is a
special kind of “sesquilinear form” (if K = C ) or “bilinear form” (if K = R )

Lemma 1.1: For a scalar product on Kn there holds the “Cauchy-Schwarz inequality”

|(x, y)|2 ≤ (x, x)(y, y), x, y ∈ Kn . (1.1.3)

Proof. The assertion is obviously true for y = 0 . Hence, we can now assume that y = 0 .
For arbitrary α ∈ K there holds

0 ≤ (x + αy, x + αy) = (x, x) + α(y, x) + α(x, y) + αα(y, y).

With α := −(x, y)(y, y)−1 this implies

0 ≤ (x, x) − (x, y)(y, y)−1(y, x) − (x, y)(y, y)−1(x, y) + (x, y)(x, y)(y, y)−1
= (x, x) − |(x, y)|2(y, y)−1

and, consequently, 0 ≤ (x, x)(y, y) − |(x, y)|2. This is the asserted inequality. Q.E.D.

The Cauchy-Schwarz inequality in Kn is a special case of the “Hölder1 inequality”.

Corollary 1.1: Any scalar product (·, ·) on Kn generates a norm  ·  on Kn by

x := (x, x)1/2 , x ∈ Kn .

The “Euclidian” scalar product (·, ·)2 corresponds to the “Euclidian” norm x2 .

Proof. The norm properties (N1) and (N2) are obvious. It remains to show (N3). Using
the Cauchy-Schwarz inequality, we obtain

x + y2 = (x + y, x + y) = (x, x) + (x, y) + (y, x) + (y, y)


≤ x2 + 2|(x, y)| + y2 ≤ x2 + 2xy + y2 = (x + ||y||)2,

what was to be shown. Q.E.D.


Next, we provide a useful inequality, which is a special case of so-called “Young2
inequalities”.

1
Ludwig Otto Hölder (1859–1937): German mathematician; Prof. in Tübingen; contributions first to
the theory of Fourier series and later to group theory; found 1884 the inequality named after him.
2
William Henry Young (1863–1942): English mathematician; worked at several universities world-
wide, e. g., in Calcutta, Liverpool and Wales; contributions to differential and integral calculus, topological
set theory and geometry.
18 Linear Algebraic Systems and Eigenvalue Problems

Lemma 1.2 (Young inequality): For p, q ∈ R with 1 < p, q < ∞ and 1/p + 1/q = 1,
there holds the inequality

|x|p |y|q
|xy| ≤ + , x, y ∈ K. (1.1.4)
p q

Proof. The logarithm ln(x) is on R+ , in view of ln (x) = −1/x2 < 0, a concave
function. Hence, for x, y ∈ K there holds:
 
ln 1p |x|p + 1q |y|q ≥ 1p ln(|x|p ) + 1q ln(|y|q ) = ln(|x|) + ln(|y|).

Because of the monotonicity of the exponential function ex it further follows that for
x, y ∈ K :
     
1
p
|x|p + 1q |y|q ≥ exp ln(|x|) + ln(|y|) = exp ln(|x|) exp ln(|y|) = |x||y| = |xy|,

what was to be proven. Q.E.D.

Lemma 1.3 (Hölder inequality): For the Euclidian scalar product there holds, for ar-
bitrary p, q ∈ R with 1 < p, q < ∞ and 1/p + 1/q = 1, the so-called “Hölder inequality”

|(x, y)2 | ≤ xp yq , x, y ∈ Kn . (1.1.5)

This inequality also holds for the limit case p = 1, q = ∞ .

Proof. For x = 0 or y = 0 the asserted estimate is obviously true. Hence, we can


assume that xp = 0 and yq = 0 . First, there holds

|(x, y)2| n   n


|xi ||yi |
1  
=  xi ȳi  ≤ .
x||p yq xp yq i=1 i=1
xp yq

Using the Young inequality it follows that


n 
 |yi |q  1  1  q 1 1
n n
|(x, y)2 | |xi |p
≤ + = |xi | p
+ |yi | = + = 1.
xp yq i=1
pxpp qyqq pxpp i=1 qyqq i=1 p q

This implies the asserted inequality. Q.E.D.


3
As consequence of the Hölder inequality, we obtain the so-called “Minkowski inequal-
ity”, which is the triangle inequality for the lp norm.

3
Hermann Minkowski (1864–1909): Russian-German mathematician; Prof. in Göttingen; several
contributions to pure mathematics; introduced the non-euclidian 4-dimensional space-time continuum
(“Minkowski space”) for describing the theory of relativity of Einstein.
1.1 The normed Euclidean space Kn 19

Lemma 1.4 (Minkowski inequality): For arbitrary p ∈ R with 1 ≤ p < ∞ as well


as for p = ∞ there holds the “Minkowski inequality”

x + yp ≤ xp + yp, x, y ∈ Kn . (1.1.6)

Proof. For p = 1 and p = ∞ the inequality follows from the triangle inequality on R :

n 
n 
n
x + y1 = |xi + yi | ≤ |xi | + |yi | = x1 + y1,
i=1 i=1 i=1
x + y∞ = max |xi + yi | ≤ max |xi | + max |yi | = x∞ + y∞.
1≤i≤n 1≤i≤n 1≤i≤n

Let now 1 < p < ∞ and q be defined by 1/p + 1/q = 1 , i. e., q = p/(p − 1) . We set

ξi := |xi + yi |p−1 , i = 1, . . . , n, ξ := (ξi )ni=1 .

This implies that



n 
n 
n
x + ypp = |xi + yi ||xi + yi |p−1 ≤ |xi |ξi + |yi |ξi
i=1 i=1 i=1

and further by the Hölder inequality

x + ypp ≤ xp ξq + yp ξq = (xp + yp)ξq .

Observing q = p/(p − 1), we conclude


n 
n
ξqq = |ξi | =
q
|xi + yi |p = x + ypp,
i=1 i=1

and consequently,

p = (xp + yp )x + yp .


x + ypp ≤ (xp + yp)x + yp/q p−1

This implies the asserted inequality. Q.E.D.


Using the Euclidian scalar product, we can introduce a canonical notion of “orthogo-
nality”, i. e., two vectors x, y ∈ Kn are called “orthogonal” (in symbols x ⊥ y) if
(x, y)2 = 0.

Two subspaces N, M ⊂ Kn are called “orthogonal” (in symbols N ⊥ M) if


(x, y)2 = 0, x ∈ N, y ∈ M.

Accordingly to each subspace M ∈ Kn , we can assign its “orthogonal complement”


M ⊥ := {x ∈ Kn , span(x)⊥M}, which is uniquely determined. Then, Kn = M ⊕ M ⊥ ,
the “direct sum” of M and M ⊥ . Let M ⊂ Kn be a (nontrivial) subspace. Then, for
any vector x ∈ Kn the “orthogonal projection” PM x ∈ M is determined by the relation
20 Linear Algebraic Systems and Eigenvalue Problems

x − PM x2 = min x − y. (1.1.7)


y∈M

This “best approximation” property is equivalent to the relation

(x − PM x, y)2 = 0 ∀y ∈ M, (1.1.8)

which can be used to actually compute PM x.


For arbitrary vectors there holds the “parallelogram identity” (exercise)

x + y22 + x − y22 = 2x22 + 2y22, x, y ∈ Kn , (1.1.9)

and for orthogonal vectors the “Theorem of Pythagoras” (exercise):

x + y22 = x22 + y22, x, y ∈ Kn , x ⊥ y. (1.1.10)

A set of vectors {a1 , . . . , am }, ai = 0 , of Kn , which are mutually


morthogonal, (ak , al ) = 0,
for k = l, is necessarily linearly independent. Because for k
k=1 ck a = 0 , successively
taking the scalar product with al , l = 1, . . . , m , yields

m
0= ck (ak , al )2 = cl (al , al )2 ⇒ cl = 0.
k=1

Definition 1.4: A set of vectors {a1 , . . . , am }, ai = 0 of Kn , which are mutually orthog-


onal, (ak , al )2 = 0, k = l , is called “orthogonal system” (in short “ONS”) and in the case
m = n “orthogonal basis” (in short “ONB”). If (ak , ak ) = 1, k = 1, . . . , m, one speaks of
an “orthonormal system” and an “orthonormal basis”, respectively. The cartesian basis
{e1 , . . . , en } is obviously an orthonormal basis of Rn with respect to the Euclidian scaler
product. However, there are many other (actually infinitely many) of such orthonormal
bases in Rn .

Lemma 1.5: Let {ai , i = 1, . . . , n} be an orthonormal basis of Kn (with respect to the


canonical Euclidian scalar product). Then, each vector x ∈ Kn possesses a representation
of the form (in analogy to the “Fourier expansion” with trigonometric functions)


n
x= (x, ai )2 ai , (1.1.11)
i=1

and there holds the “Parseval 4 identity”



n
x22 = |(x, ai )2 |2 , x ∈ Kn . (1.1.12)
i=1

4
Marc-Antoine Parseval des Chênes (1755–1836): French mathematician; worked on partial differen-
tial equations in physics (only five mathematical publications); known by the identity named after him,
which he stated without proof and connection to Fourier series.
1.1 The normed Euclidean space Kn 21


Proof. From the representation x = nj=1 αj aj taking the product with ai it follows
that

n
(x, ai )2 = αj (aj , ai )2 = αi , i = 1, . . . , n,
j=1

and consequently the representation (1.1.11). Further there holds:


n 
n
x22 = (x, x)2 = i
(x, a )2 (x, aj ) 2
i j
(a , a )2 = |(x, ai )2 |2 ,
i,j=1 i=1

what was to be proven. Q.E.D.


5 6
By the following Gram -Schmidt algorithm, we can orthonormalize an arbitrary basis
of Kn , i. e., construct an orthonormal basis.

Theorem 1.3 (Gram-Schmidt algorithm): Let {a1 , . . . , an } be any basis of Kn .


Then, the following so-called “Gram-Schmidt orthonormalization algorithm”,

b1 := a1 −1 1
2 a ,

k−1
(1.1.13)
b̃k := ak − (ak , bj )2 bj , bk := b̃k −1 k
2 b̃ , k = 2, . . . , n,
j=1

yields an orthonormal basis {b1 , . . . , bn } of Kn .

Proof. First, we show that the construction process of the bk does not stop with k < n .
The vectors bk are linear combinations of the a1 , . . . , ak . If for some k ≤ n

k−1
ak − (ak , bj )2 bj = 0,
j=1

the vectors {a , . . . , a } would be linearly dependent contradicting the a priori assumption


1 k

that {a1 , . . . , an } is a basis. Now, we show by induction that the Gram-Schmidt process
yields an orthonormal basis. Obviously b1 2 = 1 . Let now {b1 , . . . , bk }, for k ≤ n, be
an already constructed orthonormal system. Then, for l = 1, . . . , k, there holds

k
(bk+1 , bl )2 = (ak+1 , bl )2 − (ak+1 , bj )2 (bj , bl )2 = 0
  
j=1
= δjl

and b k+1
2 = 1 , i. e., {b , . . . , b
1 k+1
} is also an orthonormal system. Q.E.D.

5
Jørgen Pedersen Gram (1850–1916): Danish mathematician, employee and later owner of an insurance
company, contributions to algebra (invariants theory), probability theory, numerics and forestry; the
orthonormalization algorithm named after him had already been used before by Cauchy 1836.
6
Erhard Schmidt (1876–1959): German mathematician, Prof. in Berlin, there founder of the Institute
for Applied Mathematics 1920, after the war Director of the Mathematical Institute of the Academy of
Sciences of DDR; contributions to the theory of integral equations and Hilbert spaces and later to general
topology.
22 Linear Algebraic Systems and Eigenvalue Problems

The Gram-Schmidt algorithm in its “classical” form (1.1.13) is numerically unstable


due to accumulation of round-off errors. Below, in Section 4.3.1, we will consider a stable
version, the so-called “modified Gram-Schmidt algorithm”, which for exact arithmetic
yields the same result.

1.1.2 Linear mappings and matrices

We now consider linear mappings from the n-dimensional vector space Kn into the m-
dimensional vector space Km , where not necessarily m = n . However, the special case
m = n plays the most important role. A mapping ϕ = (ϕ1 , . . . , ϕm ) : Kn → Km is called
“linear”, if for x, y ∈ Kn and α, β ∈ K there holds

ϕ(αx + βy) = αϕ(x) + βϕ(y). (1.1.14)

The action of a linear mapping ϕ on a vector space can be described in several ways. It
obviously suffices to prescribe the action of ϕ on the elements of a basis of the space,
e. g., a Cartesian basis {ei , i = 1, . . . , n},


n 
n  
n
x= i
xi e → ϕ(x) = ϕ i
xi e = xi ϕ(ei ).
i=1 i=1 i=1

Thereby, to each vector (or point) x ∈ Kn a “coordinate vector” x̂ = (xi )ni=1 is uniquely
associated. If the images ϕ(x) are expressed with respect to a Cartesian basis of Km ,


m m 
 n 
ϕ(x) = ϕj (x)ej = ϕj (ei ) xi ej ,
  
j=1 j=1 i=1
=: aji

with the coordinate vector ϕ̂(x) = (ϕj (x))m


j=1 , we can write the action of the mapping ϕ
on a vector x ∈ Kn in “matrix form” using the usual rules of matrix-vector multiplication
as follows:
n
ϕj (x) = (Ax̂)j := aji xi , j = 1, . . . , m,
i=1

with the n × m-array of numbers A = (aij )n,m


i,j=1 ∈ K
m×n
, a “matrix”,
⎛ ⎞ ⎛ ⎞
ϕ1 (e1 ) · · · ϕ1 (en ) a11 · · · a1n
⎜ .. .. ⎟ ⎜ . .. ⎟
⎜ .. ⎟ ⎜ . .. ⎟ = A ∈ Km×n .
⎝ . . . ⎠ =: ⎝ . . . ⎠
ϕm (e ) · · · ϕm (e )
1 n
am1 · · · amn

By this matrix A ∈ Km×n the linear mapping ϕ is uniquely described with respect to
the chosen bases of Kn and Km . In the following discussion, for simplicity, we identify
the point x ∈ Kn with its special cartesian coordinate representation x̂ . Here, we follow
the convention that in the notation Km×n for matrices the first parameter m stands for
the dimension of the target space Km , i. e., the number of rows in the matrix, while the
1.1 The normed Euclidean space Kn 23

second one n corresponds to the dimension of the initial space Kn , i. e., the number of
columns. Accordingly, for a matrix entry aij the first index refers to the row number and
the second one to the column number of its position in the matrix. We emphasize that this
is only one of the possible concrete representations of the linear mapping ϕ : Kn → Km .
In this sense each quadratic matrix A ∈ Kn×n represents a linear mapping in Kn . The
identity map ϕ(x) = x is represented by the “identity matrix” I = (δij )ni,j=1 where
δij := 1 for i = j and δij = 0 else (the usual “Kronecker symbol”).
Clearly, two matrices A, A ∈ Km×n are identical, i. e., aij = aij if and only if Ax =

A x, x ∈ Kn . To a general matrix A ∈ Km×n , we associate the “adjoint transpose”
i,j=1 by setting aij := āji . A quadratic matrix A ∈ K
ĀT = (aTi,j )n×m T n×n
is called “regular”, if
the corresponding linear mapping is injective and surjective, i. e., bijective, with “inverse”
denoted by A−1 ∈ Kn×n . Further, to each matrix A ∈ Kn×n , we associate the following
quantities, which are uniquely determined by the corresponding linear mapping ϕ :

– “determinant” of A : det(A) .

– “adjugate” of A: adj(A) := C T , cij := (−1)i+j Aij (Aij the cofactors of A ).



– “trace” of A: trace(A) := ni=1 aii .

The following property of the determinant will be useful below: det(ĀT ) = det(A).

Lemma 1.6: For a square matrix A = (aij )ni,j=1 ∈ Kn×n the following statements are
equivalent:
i) A is regular with inverse A−1 .
ii) The equation Ax = 0 has only the zero solution x = 0 (injectivity).
iii) The equation Ax = b has for any b ∈ Kn a solution (surjectivity).
iv) det(A) = 0 .
v) The adjoint transpose ĀT is regular with inverse (ĀT )−1 = (A−1 )T .

Proof. For the proof, we refer to the standard linear algebra literature. Q.E.D.

Lemma 1.7: For a general matrix A ∈ Km×n , we introduce its “range” and its “kernel”
(or “null space”)

range(A) := {y ∈ Km | y = Ax for some x ∈ Kn },


kern(A) := {x ∈ Kn | Ax = 0}.

There holds

range(A) = kern(ĀT )T , range(ĀT ) = kern(A)T , (1.1.15)

i. e., the equation Ax = b has a solution if and only if (b, y)2 = 0 for all y ∈ kern(ĀT ) .
24 Linear Algebraic Systems and Eigenvalue Problems

Proof. For the proof, we refer to the standard linear algebra literature. Q.E.D.
In many practical applications the governing matrices have special properties, which
require the use of likewise special numerical methods. Some of the most important prop-
erties are those of “symmetry” or “normality” and “definiteness”.

Definition 1.5: i) A quadratic matrix A ∈ Kn×n is called “Hermitian” if it satisfies

A = ĀT (⇔ aij = āji , i, j = 1, . . . , n), (1.1.16)

or equivalently,

(Ax, y)2 = (x, Ay)2 , x, y ∈ Kn . (1.1.17)

ii) It is called “normal” if ĀT A = AĀT .


iii) It is called “positive semi-definite” if

(Ax, x)2 ∈ R, (Ax, x)2 ≥ 0, x ∈ Kn . (1.1.18)

and “positive definite” if

(Ax, x)2 ∈ R, (Ax, x)2 > 0, x ∈ Kn \ {0}. (1.1.19)

iv) A real Hermitian matrix A ∈ Rn×n is called “symmetric” .

Lemma 1.8: For a Hermitian positive definite matrix A ∈ Kn×n the main diagonal
elements are real and positive, aii > 0 , and the element with largest modulus lies on the
main diagonal.

Proof. i) From aii = āii it follows that aii ∈ R. The positiveness follows via testing by
the Cartesian unit vector ei yielding aii = (Aei , ei )2 > 0.
ii) Let aij = 0 be an element of A with maximal modulus and suppose that i = j .
Testing now by x = ei − sign(aij )ej = 0 , we obtain the following contradiction to the
definiteness of A :

0 < (Ax, x)2 = (Aei , ei )2 − 2 sign(aij )(Aei , ej )2 + sign(aij )2 (Aej , ej )2


= aii − 2 sign(aij )aij + ajj = aii − 2|aij | + ajj ≤ 0.

This completes the proof. Q.E.D.

Remark 1.5 (Exercises): i) If a matrix A ∈ Kn×n is positive definite (or more generally
just satisfies (Ax, x)2 ∈ R for x ∈ Cn ), then it is necessarily Hermitian. This does not
need to be true for real matrices A ∈ Rn×n .
ii) The general form of a scalar product (·, ·) on Kn is given by (x, y) = (Ax, y)2 with
a (Hermitian) positive definite matrix A ∈ Kn×n .
1.1 The normed Euclidean space Kn 25

Definition 1.6 (Orthonormal matrix): A matrix Q ∈ Km×n is called “orthogonal”


or “orthonormal” if its column vectors form an orthogonal or orthonormal system in Kn ,
respectively. In the case n = m such a matrix is called “unitary”.

Lemma 1.9: A unitary matrix Q ∈ Kn×n is regular and its inverse is Q−1 = Q̄T .
Further, there holds:

(Qx, Qy)2 = (x, y)2 , x, y ∈ Kn , (1.1.20)


Qx2 = x2 , x ∈ Kn . (1.1.21)

T
Proof. First, we show that Q is the inverse of Q . Let qi ∈ Kn denote the column
vectors of Q satisfying by definition (qi , qj )2 = qiT q j = δij . This implies:
⎛ ⎞ ⎛ ⎞
q T1 q1 . . . q T1 qn 1 ... 0
⎜ . . ⎟ ⎜ . . ⎟
Q Q=⎜
T
. . .. .. ⎟ = ⎜ .. . . ... ⎟ = I.
⎝ . ⎠ ⎝ ⎠
T T
q n q1 . . . q n qn 0 ... 1

From this it follows that

(Qx, Qy)2 = (x, Q̄T Qx)2 = (x, y)2, x, y ∈ Kn ,

and further
1/2
Qx2 = (Qx, Qx)2 = x2 , x ∈ Kn ,
which completes the proof. Q.E.D.

Example 1.2: The real unitary matrix

i j
⎛ ⎞
1 0 0 0 0
⎜ ⎟
⎜ 0 cos(θ) 0 − sin(θ) 0 ⎟
⎜ ⎟ i
(ij) ⎜ ⎟
Qθ =⎜ 0 0 1 0 0 ⎟
⎜ ⎟
⎜ 0 sin(θ) 0 cos(θ) 0 ⎟
⎝ ⎠ j
0 0 0 0 1

describes a rotation in the (xi , xj )-plane about the origin x = 0 with angle θ ∈ [0, 2π) .

Remark 1.6: i) In view of the relations (1.1.20) and (1.1.21) Euclidian scalar product
and norm of vectors are invariant under unitary transformations. This explains why it is
the Euclidian norm, which is used for measuring length or distance of vectors in Rn .
ii) The Schwarz inequality (1.1.3) allows the definition of an “angle” between two vectors
26 Linear Algebraic Systems and Eigenvalue Problems

in Rn . For any number α ∈ [−1, 1] there is exactly one θ ∈ [0, π] such that α = cos(θ).
By
(x, y)2
cos(θ) = , x, y ∈ Kn \ {0},
x2 y2
a θ ∈ [0, π] is uniquely determined. This is then the “angle” between the two vectors
x and y . The relation (1.1.20) states that the Euclidian scalar product of two vectors
in Kn is invariant under rotations. By some rotation Q in Rn , we can achieve that
Qx, Qy ∈ span{e(1) , e(2) } and Qx = x2 e(1) . Then, there holds

(x, y)2 = (Qx, Qy)2 = x2 (e(1) , Qy)2 = x2 (Qy)1 = x2 Qy2 cos(θ) = x2 y2 cos(θ),

i. e., θ is actually the “angle” between the two vectors in the sense of elementary geometry.

x2

Qy
(Qy)2
cos(θ) = (Qy)1/Qy2

θ
x1
(Qy)1 Qx

Figure 1.1: Angle between two vectors x = xe1 and y in R2 .

1.1.3 Non-quadratic linear systems

Let A ∈ Rm×n be a not necessarily quadratic coefficient matrix and b ∈ Rm a given


vector. We concentrate in the case m = n and consider the non-quadratic linear system

Ax = b, (1.1.22)

for x ∈ Rn . Here, rank(A) < rank[A, b] is allowed, i. e., the system does not need to
possess a solution in the normal sense. In this case an appropriately extended notion
of “solution” is to be used. In the following, we consider the so-called “method of least
error-squares”, which goes back to Gauss. In this approach a vector x̄ ∈ Rn is seeked
with minimal defect norm d2 = b − Ax̄2 . Clearly, this extended notion of “solution”
coincides with the traditional one if rank(A) = rank([A, b]).

Theorem 1.4 (“Least error-squares” solution): There exists always a “solution”


x̄ ∈ Rn of (1.1.22) in the sense of least error-squares (“least error-squares” solution)
1.1 The normed Euclidean space Kn 27

Ax̄ − b2 = minn Ax − b2 . (1.1.23)


x∈R

This is equivalent to x̄ being solution of the so-called “normal equation”

AT Ax̄ = AT b. (1.1.24)

If m ≥ n and rank(A) = n the “least error-squares” solution x̄ is uniquely determined.


Otherwise each other solution has the form x̄ + y with y ∈ kern(A). In this case, there
always exists such a solution with minimal Euclidian norm, i. e., a “minimal” solution
with least error-squares,

xmin 2 = min{x̄ + y2, y ∈ kern(A)}. (1.1.25)

Proof. i) Let x̄ be a solution of the normal equation. Then, for arbitrary x ∈ Rn there
holds

b − Ax22 = b − Ax̄ + A(x̄ − x)22


= b − Ax̄22 + 2 (b − Ax̄ , A[x̄ − x]) +A(x̄ − x)22 ≥ b − Ax̄22 ,
     
∈ kern(AT ) ∈ range(A)

i. e., x̄ has least error-squares. In turn, for such a least error-squares solution x̄ there
holds

∂     2 
n n

0= Ax − b22|x=x̄ = ajk xk − bj 
∂xi ∂xi j=1 k=1 |x=x̄


n 
n 
=2 aji ajk x̄k − bj = 2(AT Ax̄ − AT b)i ,
j=1 k=1

i. e., x̄ solves the normal equation.


ii) We now consider the solvability of the normal equation. The orthogonal complement
of range(A) in Rm is kern(AT ) . Hence the element b has a unique decomposition

b = s+r, s ∈ range(A) , r ∈ kern(AT ).

Then, for any x̄ ∈ Rn satisfying Ax̄ = s there holds

AT Ax̄ = AT s = AT s + AT r = AT b,

i. e., x̄ solves the normal equation. In case that range(A) = n there holds kern(A) = {0}
and range(A) = Rn . Observing AT Ax = 0 and kern(AT ) ⊥ range(A), we conclude
Ax = 0 and x = 0 . The matrix AT A ∈ Rn×n is regular and consequently x̄ uniquely
determined. In case that range(A) < n , for any other solution x1 of the normal equation,
we have

b = Ax1 + (b − Ax1 ) ∈ range(A) + kern(AT ) = range(A) + range(A)T .


28 Linear Algebraic Systems and Eigenvalue Problems

In view of the uniqueness of the orthogonal decomposition, we necessarily obtain Ax1 =


Ax̄ and x̄ − x1 ∈ kern(A) .
iii) We finally consider the case rank(A) < n. Among the solutions x̄ + kern(A) of the
normal equation, we can find one with minimal euclidian norm,

xmin 2 = min{x̄ + y2, y ∈ kern(A)}.

This follows from the non-negativity of the function F (y) := x̄ + y2 and its uniform
strict convexity, which also implies uniqueness of the minimal solution. Q.E.D.
For the computation of the “solution with smallest error-squares” of a non-quadratic
system Ax = b, we have to solve the normal equation AT Ax = AT b. Efficient methods
for this task will be discussed in the next chapter.

Lemma 1.10: For any matrix A ∈ Km×n the matrices ĀT A ∈ Kn×n and AĀT ∈
Km×m are Hermitian (symmetric) and positive semi-definite. In the case m ≥ n and if
rank(A) = n the matrix ĀT A it is even positive definite.

Proof. Following the rules of matrix arithmetic there holds


T
(ĀT A)T = AT Ā = ĀT A, x̄T (ĀT A)x = (Ax) Ax = Ax22 ≥ 0,

i. e., ĀT A is Hermitian and positive semi-definite. The argument for AĀT is analogous.
In case that m ≥ n and rank(A) = n the matrix viewed as mapping A : Rn → Rm is
injective, i. e., Ax2 = 0 implies x = 0 . Hence, the matrix ĀT A is positive definite.
Q.E.D.

1.1.4 Eigenvalues and eigenvectors

In the following, we consider square matrices A = (aij )ni,j=1 ∈ Kn×n .

Definition 1.7: i) A number λ ∈ C is called “eigenvalue” of A , if there is a corre-


sponding “eigenvector” w ∈ Cn , w = 0 , such that the “eigenvalue equation” holds:

Aw = λw. (1.1.26)

ii) The vector space of all eigenvectors of an eigenvalue λ is called “eigenspace” and
denoted by Eλ . Its dimension is the “geometric multiplicity” of λ . The set of all eigen-
values of a matrix A ∈ Kn×n is called its “spectrum” and denoted by σ(A) ⊂ C . The
matrix function RA (z) := zI − A is called the “resolvent” of A and Res(A) := {z ∈
C | zI − A is regular} the corresponding “resolvent set”.
iii) The eigenvalues are just the zeros of the “characteristic polynomial” χA ∈ Pn of A ,

χA (z) := det(zI − A) = z n + b1 z n−1 + . . . + bn .


1.1 The normed Euclidean space Kn 29

Hence, by the fundamental theorem of algebra there are exactly n eigenvalues counted
accordingly to their multiplicity as zeros of χA , their so-called “algebraic multiplicities”.
The algebraic multiplicity is always greater or equal than the geometric multiplicity. If it
is strictly greater, then the eigenvalue is called “deficient” and the difference the “defect”
of the eigenvalue.
iv) The eigenvalues of a matrix can be determined independently of each other. One speaks
of the “partial eigenvalue problem” if only a small number of the eigenvalues (e. g., the
largest or the smallest one) and the corresponding eigenvectors are to be determined. In
the “full eigenvalue problem” one seeks all eigenvalues with corresponding eigenvectors.
For a given eigenvalue λ ∈ C (e. g., obtained as a zero of the characteristic polynomial)
a corresponding eigenvector can be determined as any solution of the (singular) problem

(A − λI)w = 0. (1.1.27)

Conversely, for a given eigenvector w ∈ Kn (e. g., obtained by the “power method” de-
scribed below), one obtains the corresponding eigenvalue by evaluating any of the quotients
(choosing wi = 0 )

(Aw)i (Aw, w)2


λ= , i = 1, . . . , n, λ= .
wi w22

The latter quotient is called the “Rayleigh7 quotient”.

The characteristic polynomial of a matrix A ∈ Kn×n has the following representation


with its mutually distinct zeros λi :

m 
m
χA (z) = (z − λi )σi , σi = n,
i=1 i=1

where σi is the algebraic multiplicity of eigenvalue λi . Its geometric multiplicity is


ρi := dim(kern(A − λi I)) . We recall that generally ρi ≤ σi , i. e., the defect satisfies
αi := σi − ρi ≥ 0 . The latter corresponds to the largest integer α = α(λ) such that

kern(A − λI)α+1 ) = kern((A − λI)α ). (1.1.28)

Since
det(ĀT − z̄I) = det(AT − zI) = det(A − zI)T ) = det(A − zI)
the eigenvalues of the matrices A and ĀT are related by

λ(ĀT ) = λ(A). (1.1.29)

7
John William Strutt (Lord Rayleigh) (1842–1919): English mathematician and physicist; worked at
the beginning as (aristocratic) private scholar, 1879–1884 Professor for Experimental Physics in Cam-
bridge; fundamental contributions to theoretical physics: scattering theory, acoustics, electro-magnetics,
gas dynamics.
30 Linear Algebraic Systems and Eigenvalue Problems

Hence, associated to a normalized “primal” (right) eigenvector w ∈ Kn , w2 = 1,


corresponding to an eigenvalue λ of A there is a “dual” (left) eigenvector w ∗ ∈ Kn \{0}
corresponding to the eigenvalue λ̄ of ĀT satisfying the “adjoint” eigenvalue equation

ĀT w ∗ = λ̄ w ∗ (⇔ w̄ ∗T A = λw̄ ∗T ). (1.1.30)

The dual eigenvector w ∗ may also be normalized by w ∗2 = 1 or, what is more suggested
by numerical purposes, by (w, w ∗)2 = 1 . In the “degenerate” case (w, w ∗)2 = 0 , and
only then, the problem

Aw 1 − λw 1 = w (1.1.31)

has a solution w 1 ∈ Kn . This follows from the relations w ∗ ∈ kern(ĀT − λ̄I), w ⊥


kern(ĀT − λ̄I), and range(A − λI) = kern(ĀT − λ̄I)T , the latter following from the result
of Lemma 1.7. The vector w 1 is called “generalized eigenvector (of level one)” of A (or
“Hauptvektor erster Stufe” in German) corresponding to the eigenvalue λ . Within this
notion, eigenvectors are “generalized eigenvectors” of level zero. By definition, there holds

(A − λI)2 w 1 = (A − λI)w = 0,

i. e., w 1 ∈ kern((A − λI)2 ) and, consequently, in view of the above definition, the eigen-
value λ has “defect” α(λ) ≥ 1 . If this construction can be continued, i. e., if (w 1 , w ∗ )2 =
0, such that also the problem Aw 2 − λw 2 = w 1 has a solution w 2 ∈ Kn , which is then
a “generalized eigenvector” of level two, by construction satisfying (A − λI)3 w 2 = 0 .
In this way, we may obtain “generalize eigenvectors” w m ∈ Kn of level m for which
(A − λI)m+1 w m = 0 and (w m , w ∗)2 = 0. Then, the eigenvalue λ has defect α(λ) = m .

Example 1.3: The following special matrices Cm (λ) occur as building blocks, so-called
“Jordan blocks”, in the “Jordan8 normal form” of a matrix A ∈ Kn×n (see below):
⎡ ⎤
λ 1 0
⎢ ⎥
⎢ λ 1 ⎥
⎢ ⎥
⎢ .. .. ⎥
Cm (λ) = ⎢ . . ⎥ ∈ Km×m , eigenvalue λ∈C
⎢ ⎥
⎢ 1 ⎥
⎣ λ ⎦
0 λ
χCm (λ) (z) = (z − λ)m ⇒ σ = m, rank(Cm (λ) − λI) = m − 1 ⇒ ρ = 1 .

8
Marie Ennemond Camille Jordan (1838–1922): French mathematician; Prof. in Paris; contributions
to algebra, group theory, calculus and topology.
1.1 The normed Euclidean space Kn 31

1.1.5 Similarity transformations

Definition 1.8: Two matrices A, B ∈ Kn×n are called “similar (to each other)”, if there
is a regular matrix T ∈ Kn×n such that

B = T −1 AT. (1.1.32)

The transition A → B is called “similarity transformation”.

Suppose that the matrix A ∈ Kn×n is the representation of a linear mapping ϕ :


K → Kn with respect to a basis {a1 , . . . , an } of Kn . Then, using the regular matrix
n

T ∈ Kn×n , we obtain a second basis {T a1 , . . . , T an } of Kn and B is the representation of


the mapping ϕ with respect to this new basis. Hence, similar matrices are representations
of the same linear mapping and any two representations of the same linear mapping are
similar. In view of this fact, we expect that two similar matrices, representing the same
linear mapping, have several of their characteristic quantities as matrices in common.

Lemma 1.11: For any two similar matrices A, B ∈ Kn×n there holds:
a) det(A) = det(B).
b) σ(A) = σ(B).
c) trace(A) = trace(B).

Proof. i) The product theorem for determinants implies that det(AB) = det(A) det(B)
and further det(T −1 ) = det(T )−1 . This implies that

det(B) = det(T −1 AT ) = det(T −1 ) det(A) det(T ) = det(T )−1 det(A) det(T ) = det(A).

ii) Further, for any z ∈ C there holds

det(zI − B) = det(zT −1 T − T −1 AT ) = det(T −1 (zI − A)T )


= det(T −1 ) det(zI − A) det(T ) = det(zI − A),

which implies that A and B have the same eigenvalues.


iii) The trace of A is just the coefficient of the monom z n−1 in the characteristic poly-
nomial χA (z) . Hence by (i) the trace of A equals that of B . Q.E.D.
Any matrix A ∈ Kn×n is similar to its “canonical form” (Jordan normal form) which
has the eigenvalues λi of A on its main diagonal counted accordingly to their algebraic
multiplicity. Hence, in view of Lemma 1.11 there holds

n 
n
det(A) = λi , trace(A) = λi . (1.1.33)
i=1 i=1

Definition 1.9 (Normal forms): i) Any matrix A ∈ Kn×n is similar to its “canonical
normal form” JA (“Jordan normal form”) which is a block diagonal matrix with main
32 Linear Algebraic Systems and Eigenvalue Problems

diagonal blocks, the “Jordan blocks”, of the form as shown in Example 1.3. Here, the
“algebraic” multiplicity of an eigenvalue corresponds to the number of occurrences of this
eigenvalue on the main diagonal of JA , while its “geometric” multiplicity corresponds to
the number of Jordan blocks containing λ .
ii) A matrix A ∈ Kn×n , which is similar to a diagonal matrix, then having its eigenvalues
on the main diagonal, is called “diagonalizable” ,

W AW −1 = Λ = diag(λi ) (λi eigenvalues of A).

This relation implies that the transformation matrix W = [w 1 , . . . , w n ] has the eigenvec-
tors w i corresponding to the eigenvalues λi as column vectors. This means that orthog-
onalizability of a matrix is equivalent to the existence of a basis of eigenvectors.
iii) A matrix A ∈ Kn×n is called “unitarily diagonalizable” if it is diagonalizable with
a unitary transformation matrix. This is equivalent to the existence of an orthonormal
basis of eigenvectors.

Positive definite Hermitian matrices A ∈ Kn×n have very special spectral properties.
These are collected in the following lemma and theorem, the latter one being the basic
result of matrix analysis (“spectral theorem”).

Lemma 1.12: i) A Hermitian matrix has only real eigenvalues and eigenvectors to dif-
ferent eigenvalues are mutually orthogonal.
ii) A Hermitian matrix is positive definite if and only if all its (real) eigenvalues are pos-
itive.
iii) Two normal matrices A, B ∈ Kn×n commute, AB = BA, if and only if they possess
a common basis of eigenvectors.

Proof. For the proofs, we refer to the standard linear algebra literature. Q.E.D.

Theorem 1.5 (Spectral theorem): For square Hermitian matrices, A = ĀT , or more
general for “normal” matrices, ĀT A = AĀT , algebraic and geometric multiplicities of
eigenvalues are equal, i. e., these matrices are diagonalizable. Further, they are even
unitarily diagonalizable, i. e., there exists an orthonormal basis of eigenvectors.

Proof. For the proof, we refer to the standard linear algebra literature. Q.E.D.

1.1.6 Matrix analysis

We now consider the vector space of all m×n-matrices A ∈ Km×n . This vector space may
be identified with the vector space of mn-vectors, Kn×n ∼= Kmn . Hence, all statements for
vector norms carry over to norms for matrices. In particular, all norms for m×n-matricen
1.1 The normed Euclidean space Kn 33

are equivalent and the convergence of sequences of matrices is again the componentwise
convergence

Ak → A (k → ∞) ⇐⇒ akij → aij (k → ∞) , i = 1, . . . , m , j = 1, . . . , n .

Now, we restrict the further discussion to square matrices A ∈ Kn×n . For an arbitrary
vector norm  ·  on Kn a norm for matrices A ∈ Kn×n is generated by

Ax
A := sup = sup Ax.
x∈Kn \{0} x x∈Kn , x =1

The definiteness and homogeneity are obvious and the triangle inequality follows from
that holding for the given vector norm. This matrix norm is called the “natural matrix
norm” corresponding to the vector norm  ·  . In the following for both norms, the
matrix norm and the generating vector norm, the same notation is used. For a natural
matrix norm there always holds I = 1 . Such a “natural” matrix norm is automatically
“compatible” with the generating vector norm, i. e., it satisfies

Ax ≤ A x , x ∈ Kn , A ∈ Kn×n . (1.1.34)

Further it is “submultiplicative”,

AB ≤ A B , A, B ∈ Kn×n . (1.1.35)

Not all matrix norms are “natural” in the above sense. For instance, the square-sum norm
(also called “Frobenius9 -norm”)

n 1/2
AF := |ajk |2
j,k=1

is compatible with the Euclidian


√ norm and submultiplicative but cannot be a natural
matrix norm since IF = n (for n ≥ 2 ). The natural matrix norm generated from
the Euclidian vector norm is called “spectral norm”. This name is suggested by the
following result.

Lemma 1.13 (Spectral norm): For an arbitrary square matrix A ∈ Kn×n the product
T
matrix A A ∈ Kn×n is always Hermitian and positive semi-definitsemi-definite. For the
spectral norm of A there holds

A2 = max{|λ|1/2 , λ ∈ σ(ĀT A)}. (1.1.36)

If A is Hermitian (or symmetric), then,

A2 = max{|λ|, λ ∈ σ(A)}. (1.1.37)

9
Ferdinand Georg Frobenius (1849–1917): German mathematician; Prof. in Zurich and Berlin; con-
tributions to the theory of differential equations, to determinants and matrices as well as to group theory.
34 Linear Algebraic Systems and Eigenvalue Problems

Proof. i) Let the matrix A ∈ Kn×n be Hermitian. For any eigenvalue λ of A and
corresponding eigenvector x there holds

λx2 Ax2
|λ| = = ≤ A2 .
x2 x2

Conversely, let {ai , i = 1, · · · , n} ⊂ Cn be an ONB of eigenvectors of A and x =



i xi a ∈ C be arbitrary. Then,
i n

   
Ax2 = A xi ai 2 = λi xi ai 2 ≤ max |λi | xi ai 2 = max |λi | x2,
i i i i i

and consequently,
Ax2
≤ max |λi |.
x2 i

ii) For a general matrix A ∈ Kn×n there holds

Ax22 (ĀT Ax, x)2 ĀT Ax2


A22 = max = max ≤ max = ĀT A2 .
n
x∈C \0 x2
2 x∈C n \0 x22 x∈C n \0 x2

and ĀT A2 ≤ ĀT 2 A2 = A22 (observe that A2 = ĀT 2 due to Ax2 =
ĀT x̄2 ). This completes the proof. Q.E.D.

Lemma 1.14 (Natural matrix norms): The natural matrix norms generated by the
l∞ norm  · ∞ and the l1 Norm  · 1 are the so-called “maximal-row-sum norm” and
the “maximal-column-sum norm” , respectively,

n 
n
A∞ := max |aij |, A1 := max |aij | . (1.1.38)
1≤i≤n 1≤j≤n
j=1 i=1

Proof. We give the proof only for the l∞ norm. For the l1 norm the argument is analogous.
i) The maximal row sum  · ∞ is a matrix norm. The norm properties (N1) - (N3) follow
from the corresponding properties of the modulus. For the matrix product AB there
holds
n 
n  n 
 
n 
 
AB∞ = max  aik bkj  ≤ max |aik | |bkj |
1≤i≤n 1≤i≤n
j=1 k=1 k=1 j=1

n 
n
≤ max |aik | max |bkj | = A∞ B∞ .
1≤i≤n 1≤k≤n
k=1 j=1

ii) Further, in view of


n 
n
Ax∞ = max | ajk xk | ≤ max |ajk | max |xk | = A∞ x∞
1≤j≤n 1≤j≤n 1≤k≤n
k=1 k=1
1.1 The normed Euclidean space Kn 35

the maximal row-sum is compatible with the maximum norm  · ∞ and there holds

sup Ax∞ ≤ A∞ .


x ∞ =1

iii) In the case A∞ = 0 also A = 0 , i. e.,

A∞ = sup Ax∞ .


x ∞ =1

Therefore, let A∞ > 0 and m ∈ {1, . . . , n} an index such that


n 
n
A∞ = max |ajk | = |amk |.
1≤j≤n
k=1 k=1

For k = 1, . . . , n, we set
!
|amk |/amk für amk = 0,
zk :=
0, sonst,

i. e., z = (zk )nk=1 ∈ Kn , z∞ = 1 . For v := Az it follows that


n 
n
vm = amk zk = |amk | = A∞ .
k=1 k=1

Consequently,
A∞ = vm ≤ v∞ = Az∞ ≤ sup Ay∞,
y ∞ =1

what was to be shown. Q.E.D.

Let  ·  be an arbitrary vector norm and  ·  a corresponding compatible matrix


norm. Then, with a normalized eigenvector w = 1 corresponding to the eigenvalue λ
there holds

|λ| = |λ| w = λw = Aw ≤ A w = A, (1.1.39)

i. e., all eigenvalues of A are contained in a circle in C with center at the origin and
radius A. Especially with A∞ , we obtain the eigenvalue bound


n
max |λ| ≤ A∞ = max |aij |. (1.1.40)
λ∈σ(A) 1≤i≤n
j=1

Since the eigenvalues of ĀT and A are related by λ(ĀT ) = λ̄(A) , using the bound
(1.1.40) simultaneously for ĀT and A yields the following refined bound:
36 Linear Algebraic Systems and Eigenvalue Problems

max |λ| ≤ min{A∞ , ĀT ∞ }


λ∈σ(A)
 
n 
n  (1.1.41)
= min max |aij |, max |aij | .
1≤i≤n 1≤j≤n
j=1 i=1

The following lemma contains a useful result on the regularity of small perturbations
of the unit matrix.

Lemma 1.15 (Perturbation of unity): Let · be any natural matrix norm on Kn×n
and B ∈ Kn×n a matrix with B < 1 . Then, the perturbed matrix I + B is regular and
its inverse is given as the (convergent) “Neumann10 series”



(I + B)−1 = Bk. (1.1.42)
k=0

Further, there holds


1
(I + B)−1  ≤ . (1.1.43)
1 − B

Proof. i) First, we show the regularity of I + B and the bound (1.1.43). For all x ∈ Kn
there holds

(I + B)x ≥ x − Bx ≥ x − Bx = (1 − B)x .

In view of 1 − B > 0 this implies that I + B is injective and consequently regular.
Then, the following estimate implies (1.1.43):

1 = I = (I + B)(I + B)−1  = (I + B)−1 + B(I + B)−1 


≥ (I + B)−1  − B (I + B)−1  = (I + B)−1 (1 − B) > 0 .

ii) Next, we define


k
S := lim Sk , Sk = Bs.
k→∞
s=0

S is well defined due to the fact that {Sn }n∈N is a Cauchy sequence with respect to the
matrix norm  ·  (and, by the norm equivalence in finite dimensional normed spaces,

10
Carl Gottfried Neumann (1832–1925): German mathematician; since 1858 “Privatdozent” and since
1863 apl. Prof. in Halle. After holding professorships in Basel and Tübingen he moved 1868 to Leipzig
where he worked for more than 40 years. He contributed to the theory of (partial) differential and integral
equations, especially to the Dirichlet problem. The “Neumann boundary condition” and the “Neumann
series” are named after him. In mathematical physics he worked on analytical mechanics and potential
theory. together with A. Clebsch he founded the journal “Mathematische Annalen”.
1.2 Spectra and pseudo-spectra of matrices 37

with respect to any matrix norm). By employing the triangle inequality, using the matrix
norm property and the limit formula for the geometric series, we see that


k 
k
1 − Bk+1 1
S = lim Sk  = lim B s
≤ lim Bs = lim = .
k→∞ k→∞ n→∞ k→∞ 1 − B 1 − B
s=0 s=0

Furthermore, Sk (I − B) = I − B k+1 and due to the fact that multiplication with I − B


is continuous,
   
I = lim Sk (I − B) = lim Sk (I − B) = S(I − B).
k→∞ k→∞

Hence, S = (I − B)−1 and the proof is complete. Q.E.D.

Corollary 1.2: Let A ∈ Kn×n be a regular matrix and à another matrix such that
1
à − A < . (1.1.44)
A−1 

Then, also à is regular. This means that the “resolvent set” Res(A) of a matrix A ∈
Kn×n is open in Kn×n and the only “singular” points are just the eigenvalues of A , i. e.,
there holds C = Res(A) ∪ σ(A) .

Proof. Notice that à = A + à − A = A(I + A−1 (à − A)) . In view of

A−1 (Ã − A) ≤ A−1  Ã − A < 1

by Lemma 1.15 the matrix I + A−1 (Ã − A) is regular. Then, also the product matrix
A(I + A−1 (à − A)) is regular, which implies the regularity of à . Q.E.D.

1.2 Spectra and pseudo-spectra of matrices

1.2.1 Stability of dynamical systems

We consider a finite dimensional dynamical system of the form

u (t) = F (t, u(t)), t ≥ 0, u(0) = u0 , (1.2.45)

where u : [0, ∞) → Rn is a continuously differentiable vector function and the system


function F (·, ·) is assumed (for simplicity) to be defined on all of R × Rn and twice
continuously differentiable. The system (1.2.45) may originate from the discretization
of an infinite dimensional dynamical system such as the nonstationary Navier-Stokes
equations mentioned in the introductory Chapter 0. Suppose that u is a particular
solution of (1.2.45). We want to investigate its stability against small perturbations
u(t0 ) → u(t0 ) + w 0 =: v(t0 ) at any time t0 ≥ 0 . For this, we use the strongest concept of
38 Linear Algebraic Systems and Eigenvalue Problems

stability, which is suggested by the corresponding properties of solutions of the Navier-


Stokes equations.

Definition 1.10: The solution u ∈ C 1 [0, ∞; Rn ) of (1.2.45) is called “exponentially


stable” if there are constants δ, K, κ ∈ R+ such that for any perturbation w 0 ∈ Rn ,
w 0 2 ≤ δ , at any time t0 ≥ 0, there exists a secondary solution v ∈ C 1 (t0 , ∞; Rn ) of
the perturbed system

v  (t) = F (t, v(t)), t ≥ 0, v(t0 ) = u(t0 ) + w 0 , (1.2.46)

and there holds

v(t) − u(t)2 ≤ Ke−κ(t−t0 ) w 0 2 , t ≥ t0 . (1.2.47)

For simplicity, we restrict the following discussion to the special situation of an au-
tonomous system, i. e., F (t, ·) ≡ F (·), and a stationary particular solution u(t) ≡ u ∈ Rn ,
i. e., to the solution of the nonlinear system

F (u) = 0. (1.2.48)

The investigation of the stability of u leads us to consider the so-called “perturbation


equation” for the perturbation w(t) := v(t) − u ,

w  (t) = F (v(t)) − F (u) = F  (u)w(t) + O(w(t)22 ), t ≥ 0, w(0) = w 0 , (1.2.49)

where the higher-order term depends on bounds on u and u as well as on the smoothness
properties of F (·) .

Theorem 1.6: Suppose that the Jacobian A := F  (u) is diagonalizable and that all its
eigenvalues have negative real part. Then, the solution u of (1.2.48) is exponentially stable
in the sense of Definition 1.10 with the constants κ = |Reλmax | and K = cond2 (W ) ,
where λmax is the eigenvalue of A with largest (negative) real part and W = [w 1 , . . . , w n ]
the column matrix formed by the (normalized) eigenbasis of A . If A is normal then
K = cond2 (W ) = 1 .

Proof. i) Consider the linearized system (linearized perturbation equation)

w  (t) = Aw(t), t ≥ t0 , w(0) = w 0 . (1.2.50)

Since the Jacobian A is diagonalizable there exists an ONB {w 1 , . . . , w n } of eigenvectors


of A :
Aw i = λi w i , i = 1, . . . , n.
With the matrices W := [w 1 , . . . , w n ] and Λ := diag(λi ) there holds

W −1 AW = Λ, A = W ΛW −1 .
1.2 Spectra and pseudo-spectra of matrices 39

Using this notation the perturbation equation can be rewritten in the form

w  (t) = Aw(t) ⇔ w (t) = W ΛW −1w(t) ⇔ (W −1w) (t) = ΛW −1 w(t),

or for the transformed variable v := W −1 w componentwise:

vi (t) = λi vi (t), t ≥ 0, vi (0) = (W −1 w)i (0).

The solution behavior is (observe that eiImλi t = 1 )

|vi (t)| ≤ eReλi t |(W −1 w)i (0)|, t ≥ 0.

This implies:

n 
n
v(t)22 ≤ |vi (t)| ≤
2
e2Reλi t |(W −1w)i (0)|2 ≤ e2Reλmin t (W −1 w)(0)22,
i=1 i=1

and consequently,

w(t)2 ≤ W v(t)2 ≤ W 2 v(t)2 ≤ W 2 eReλmin t (W −1 w)(0)2


≤ W 2 eReλmin t W −1 2 w(0)2 (1.2.51)
= cond2 (W ) e Reλmin t
w(0)2.

The condition number of W can become arbitrarily large depending on the “non-orthogo-
nality” of the eigenbasis of the Jacobian A .
ii) The assertion now follows by combining (1.2.51) and (1.2.49) within a continuation
argument. The proof is complete. Q.E.D.

Following the argument in the proof of Theorem 1.6, we see that the occurrence of
just one eigenvalue with Re λ > 0 inevitably causes dynamic instability of the solution
u , i. e., arbitrarily small perturbations may grow in time without bound. Denoting by
S : Rn → C 1 [0, ∞; Rn) the “solution operator” of the linearized perturbation equation
(1.2.50), i. e., w(t) = S(t)w 0 , this can be formulated as

max Re λ > 0 ⇒ sup S(t)2 = ∞. (1.2.52)


λ∈σ(A) t≥0

The result of Theorem 1.6 can be extended to the case of a non-diagonalizable Jacobian
A = F  (u) . In this case, one obtains a stability behavior of the form

S(t)2 ≈ K(1 + tα )eReλmax t , t ≥ 0, (1.2.53)

where α ≥ 1 is the defect of the most critical eigenvalue λmax , i. e., that eigenvalue with
largest real part Reλmax < 0 . This implies that
 α α 1
sup S(t) ≈ , (1.2.54)
t>0 e | Re λmax |α
40 Linear Algebraic Systems and Eigenvalue Problems

i. e., for −1  Re λmin < 0 initially small perturbations may grow beyond a value at
which nonlinear instability is triggered. Summarizing, we are interested in the case that
all eigenvalues of A = F  (u) have negative real part, suggesting stability in the sense of
Theorem 1.6, and especially want to compute the most “critical” eigenvalue, i. e., that
λ ∈ σ(A) with maximal Re λ < 0 to detect whether the corresponding solution operator
S(t) may behave in a critical way.
The following result, which is sometimes addressed as the “easy part of the Kreiss11
matrix theorem” indicates in which direction this analysis has to go.

Lemma 1.16: Let A := F  (u) and z ∈ C \ σ(A) with Re z > 0 . Then, for the solution
operator S(t) of the linearized perturbation equation (1.2.50), there holds

sup S(t)2 ≥ | Re z| (zI − A)−1 2 . (1.2.55)


t≥0

Proof. We continue using the notation from the proof of Theorem 1.6. If S(t)2 is
unbounded over [0, ∞) , the asserted estimate holds trivially. Hence, let us assume that

sup w(t)2 = sup S(t)w 00 ≤ sup S(t)2 w 02 < ∞.


t≥0 t≥0 t≥0

For z ∈ σ(A) the resolvent RA (z) = zI − A is regular. Let w 0 ∈ Kn be an arbitrary


but nontrivial initial perturbation and w(t) = S(t)w 0 . We rewrite equation (1.2.50) in
the form
∂t w − zw + (zI − A)w = 0,
and multiply by e−tz , to obtain

∂t (e−tz w) + e−tz (zI − A)w = 0.

Next, integrating this over 0 ≤ t < T and observing Re z > 0 and limt→∞ e−tz w = 0
yields
" ∞ 
−1 0
−(zI − A) w = e−tz S(t) dt w 0 .
0
From this, we conclude
" ∞ 
(zI − A)−1 2 ≤ e−t| Re z| dt sup S(t)2 ≤ | Re z|−1 sup S(t)2 ,
0 t>0 t>0

which implies the asserted estimate. Q.E.D.


The above estimate (1.2.55) for the solution operator S(t) can be interpreted as
follows: Even if all eigenvalues of the matrix A have negative real parts, which in view

11
Heinz-Otto Kreiss (1930–2015): Swedish/US-American mathematician; worked in Numerical Analy-
sis and in the new field Scientific Computing in the early 1960s; born in Hamburg, Germany, he studied
and worked at the Kungliga Tekniska Hgskolan in Stockholm, Sweden; he published a number of books;
later he became Prof. at the California Institute of Technology and University of California, Los Angeles
(UCLA).
1.2 Spectra and pseudo-spectra of matrices 41

of Theorem 1.6 would indicate stability of solutions to (1.2.50), there may be points z in
the right complex half plane for which (zI − A)−1 2 | Re z|−1 and consequently,

sup S(t)2 1. (1.2.56)


t≥0

Hence, even small perturbations of the particular solution u may be largely amplified
eventually triggering nonlinear instability.

1.2.2 Pseudo-spectrum of a matrix

The estimate (1.2.55) makes us search for points z ∈ C \ σ(A) with Re z > 0 and

(zI − A)−1 2 | Re z|−1 .

This suggests the concept of the “pseudo-spectrum” of the matrix A , which goes back
to Landau [9] and has been extensively described and applied in the stability analysis of
dynamical systems, e. g., in Trefethen [20] and Trefethen & Embree [22].

Definition 1.11 (Pseudo-spectrum): For ε ∈ R+ the “ε-pseudo-spectrum” σε (A) ⊂


C of a matrix A ∈ Kn×n is defined by
#  $
σε (A) := z ∈ C \ σ(A) (A − zI)−1 2 ≥ ε−1 ∪ σ(A). (1.2.57)

Remark 1.7: The concept of a pseudo-spectrum is interesting only for non-normal opera-
tors, since for a normal operator σε (A) is just the union of ε-circles around its eigenvalues.
This follows from the estimate (see Dunford & Schwartz [8] or Kato [12])

(A − zI)−1 2 ≥ dist(z, σ(A))−1 , z∈


/ σ(A), (1.2.58)

where equality holds if A is normal.

Remark 1.8: The concept of the “pseudo-spectrum” can be introduced in much more
general situations, such as that of closed linear operators in abstract Hilbert or Banach
spaces (see Trefethen & Embree [22]). Typically hydrodynamic stability analysis concerns
differential operators defined on bounded domains. This situation fits into the Hilbert-
space framework of “closed unbounded operators with compact inverse”.

Using the notion of the pseudo-spectrum the estimate (1.2.55) can be expressed in the
following form
 | Re z|  

sup S(t)2 ≥ sup  ε > 0, z ∈ σε (A), Re z > 0 , (1.2.59)
t≥0 ε
or

max Re λ > Kε ⇒ sup S(t)2 > K. (1.2.60)


λ∈σε (A) t≥0
42 Linear Algebraic Systems and Eigenvalue Problems

Below, we will present methods for computing estimates for the pseudo-spectrum of a
matrix. This will be based on related methods for solving the partial eigenvalue problem.
To this end, we provide some results on several basic properties of the pseudo-spectrum.

Lemma 1.17: i) For a matrix A ∈ Kn×n the following definitions of an ε-pseudo-


spectrum are equivalent:
#  $
a) σε (A) := z ∈ C \ σ(A) (A − zI)−1 2 ≥ ε−1 ∪ σ(A).
#  $
b) σε (A) := z ∈ C z ∈ σ(A + E) for some E ∈ Kn×n with E2 ≤ ε .[2mm]
#  $
c) σε (A) := z ∈ C (A − zI)v2 ≤ ε for some v ∈ Kn with v2 = 1 .

ii) Let 0 ∈ σ(A) . Then, the ε-pseudo-spectra of A and that of its inverse A−1 are related
by
#  $
σε (A) ⊂ z ∈ C \ {0}  z −1 ∈ σδ(z) (A−1 ) ∪ {0}, (1.2.61)

where δ(z) := εA−1 2 /|z| and, for 0 < ε < 1 , by


#  $
σε (A−1 ) ∩ B1 (0)c ⊂ z ∈ C \ {0}  z −1 ∈ σδ (A) , (1.2.62)

where B1 (0) := {z ∈ C, |z| ≤ 1} and δ := ε/(1 − ε) .

Proof. The proof of part (i) can be found in Trefethen & Embree [22]. For completeness,
we recall a sketch of the argument. The proof of part (ii) is taken from Gerecht et al.
[35].
ia) In all three definitions, we have σ(A) ⊂ σε (A) . Let z ∈ σε (A) in the sense of definition
(a). There exists a w ∈ Kn with w2 = 1 , such that (zI − A)−1 w2 ≥ ε−1 . Hence,
there is a v ∈ Kn with v2 = 1 , and s ∈ (0, ε) , such that (zI − A)−1 w = s−1 v or
(zI − A)v = sw . Let Q(v, w) ∈ Kn×n denote the unitary matrix, which rotates the unit
vector v into the unit vector w , such that sw = sQ(v, w)v . Then, z ∈ σ(A + E) where
E := sQ(v, w) with E2 ≤ ε , i. e., z ∈ σε (A) in the sense of definition (b). Let now
be z ∈ σε (A) in the sense of definition (b), i. e., there exists E ∈ Kn×n with E2 ≤ ε
such that (A + E)w = zw , with some w ∈ Kn , w = 0 . Hence, (A − zI)w = −Ew , and
therefore,

(A − zI)−1 v2 v2


(A − zI)−1 2 = sup = sup
v∈Kn \{0} v2 v∈Kn \{0} (A − zI)v2
 (A − zI)v2 −1  (A − zI)w2 −1
= inf ≥
v∈Kn \{0} v2 w2
 Ew −1
≥ E−1 −1
2
= 2 ≥ ε .
w2

Hence, z ∈ σε (A) in the sense of (a). This proves the equivalence of definitions (a) and
(b).
1.2 Spectra and pseudo-spectra of matrices 43

ib) Next, let again z ∈ σε (A) \ σ(A) in the sense of definition (a). Then,
 (A − zI)−1 w2 −1 (A − zI)v2
ε ≥ (A − zI)−1 −1
2 = sup = inf .
n
w∈K \{0} w 2 v∈K n \{0} v2

Hence, there exists a v ∈ Kn with v2 = 1 , such that (A − zI)v ≤ ε , i. e., z ∈ σε (A)
in the sense of definition (c). By the same argument, now used in the reversed direction,
we see that z ∈ σε (A) in the sense of definition (c) implies that also z ∈ σε (A) in the
sense of definition (a). Thus, definition (a) is also equivalent to condition (c).
iia) We use the definition (c) from part (i) for the ε-pseudo-spectrum. Let z ∈ σε (A) and
accordingly v ∈ Kn , v2 = 1 , satisfying (A − zI)v2 ≤ ε . Then,

(A−1 − z −1 I)v2 = z −1 A−1 (zI − A)v)2 ≤ |z|−1 −1


2 A 2 ε.

This proves the asserted relation (1.2.61).


iib) To prove the relation (1.2.62), we again use the definition (c) from part (i) for the
ε-pseudo-spectrum. Accordingly, for z ∈ σε (A−1 ) with |z| ≥ 1 there exists a unit vector
v ∈ Kn , v2 = 1 , such that

ε ≥ (zI − A−1 )v2 = |z|(A − z −1 I)A−1 v2 .

Then, setting w := A−1 v−1 −1


2 A v with w2 = 1 , we obtain

(A − z −1 I)w2 ≤ |z|−1 A−1 v−1


2 ε.

Hence, observing that

A−1 v2 = (A−1 − zI)v + zv2 ≥ zv2 − (A−1 − zI)v2 ≥ |z| − ε,

we conclude that
ε ε
(A − z −1 I)w2 ≤ ≤ .
|z|(|z| − ε) 1−ε
This completes the proof. Q.E.D.
The next proposition relates the size of the resolvent norm (zI − A)−1 2 to easily
computable quantities in terms of the eigenvalues and eigenvectors of the matrix A =
F  (u) .

Theorem 1.7: Let λ ∈ C be a non-deficient eigenvalue of the matrix A := F  (u) with


corresponding primal and dual eigenvectors v, v ∗ ∈ Kn normalized by v2 = (v, v ∗)2 = 1.
Then, there exists a continuous function ω : R+ → C with limε 0+ ω(ε) = 1 , such that
for λε := λ − εω(ε)v ∗2 , there holds

(A − λε I)−1 2 ≥ ε−1 , (1.2.63)

i. e., the point λε lies in the ε-pseudo-spectrum of the matrix A .


44 Linear Algebraic Systems and Eigenvalue Problems

Proof. The argument of the proof is recalled from Gerecht et. al. [35] where it is
developed within a function space setting and has therefore to be simplified here for the
finite dimensional situation.
i) Let B ∈ Kn×n be a matrix with B2 ≤ 1. We consider the perturbed eigenvalue
problem

(A + εB)vε = λε vε . (1.2.64)

Since this is a regular perturbation and λ non-deficient, there exist corresponding eigen-
values λε ∈ C and eigenvectors vε ∈ Kn , vε 2 = 1, such that

|λε − λ| = O(ε), vε − v2 = O(ε).

Furthermore, from the relation

(Av − λε I)vε = −εBvε , ϕ ∈ J1 ,

we conclude that
(A − λε I)vε 2 ≤ εB2vε 2 ≤ εvε 2 ,
and from this, if λε is not an eigenvalue of A ,
 (A − λε I)−1 y2 −1  x2 −1
(A − λε I)−1 −1
2 = sup = sup
y∈Kn y2 x∈Kn (A − λε I)x2
(A − λε I)x2 (A − λε I)vε 2
= infn ≤ ≤ ε.
x∈K x2 vε 2

This implies the asserted estimate

(A − λε I)−1 2 ≥ ε−1 . (1.2.65)

ii) Next, we analyze the dependence of the eigenvalue λε on ε in more detail. Subtracting
the equation for v from that for vε , we obtain

A(vε − v) + εBvε = (λε − λ)vε + λ(vε − v).

Multiplying this by v ∗ yields

(A(vε − v), v ∗)2 + ε(Bvε , v ∗ )2 = (λε − λ)vε , v ∗ )2 + λ(vε − v, v ∗)2

and, using the equation satisfied by v ∗ ,

ε(Bvε , v ∗ )2 = (λε − λ)(vε , v ∗ )2 .

This yields λε = λ + εω(ε)(Bv, v ∗)2 , where, observing vε → v and (v, v ∗) = 1,

(Bvε , v ∗ )2
ω(ε) := → 1 (ε → 0).
(vε , v ∗ )2 (Bv, v ∗ )2
1.3 Perturbation theory and conditioning 45

iii) It remains to construct an appropriate perturbation matrix B . For convenience, we


consider the renormalized dual eigenvectors ṽ ∗ := v ∗ v ∗ −1 ∗
2 , satisfying ṽ 2 = 1 . With
∗ ∗ −1
the vector w := (v − ṽ )v − ṽ 2 , we set for ψ ∈ K : n

Sψ := ψ − 2 Re (ψ, w)2 w, B := −S.

The unitary matrix S acts like a Householder transformation mapping v into ṽ ∗ (s. the
discussion in Section 2.3.1, below). In fact, observing v2 = ṽ ∗ 2 = 1 , there holds

2 Re (v, v − ṽ ∗ )2 ∗ {2 − 2 Re (v, ṽ ∗ )2 }v − 2 Re (v, v − ṽ ∗ )2 (v − ṽ ∗ )


Sv = v − (v − ṽ ) =
v − ṽ ∗ 22 2 − 2 Re (v, ṽ ∗ )2
2v − 2 Re (v, ṽ )2 v − 2v + 2 Re (v, ṽ )2 v + (2 − 2 Re (v, ṽ ∗)2 )ṽ ∗
∗ ∗
= = ṽ ∗ .
2 − 2 Re (v, ṽ ∗)2

This implies that

(Bv, v ∗ )2 = −(Sv, v ∗ )2 = −(ṽ ∗ , v ∗ )2 = −v ∗ 2 .

Further, observing w2 = 1 and

Sv22 = v22 − 2 Re (v, w)2(v, w)2 − 2 Re (v, w)2 (w, v)2 + 4 Re (v, w)22w22 = v22 ,

we have B2 = S2 = 1. Hence, for this particular choice of the matrix B , we have

λε = λ − εω(ε)v ∗2 , lim ω(ε) = 1,


ε→0

as asserted. Q.E.D.

Remark 1.9: i) We note that the statement of Theorem 1.7 becomes trivial if the matrix
A is normal. In this case primal and dual eigenvectors coincide and, in view of Remark
1.7, σε (A) is the union of ε-circles around its eigenvalues λ . Hence, observing w ∗ 2 =
w2 = 1 and setting ω(ε) ≡ 1 , we trivially have λε := λ − ε ∈ σε (A) as asserted.
ii) If A is non-normal it may have a nontrivial pseudo-spectrum. Then, a large norm
of the dual eigenfunction w ∗2 corresponding to a critical eigenvalue λcrit with −1 
Re λcrit < 0 , indicates that the ε-pseudo-spectrum σε (A) , even for small ε , reaches into
the right complex half plane.
iii) If the eigenvalue λ ∈ σ(A) considered in Theorem 1.7 is deficient, the normalization
(w, w ∗)2 = 1 is not possible. In this case, as discussed above, there is still another
mechanism for triggering nonlinear instability.

1.3 Perturbation theory and conditioning

First, we analyze the “conditioning” of quadratic linear systems. There are two main
sources of errors in solving an equation Ax = b :
46 Linear Algebraic Systems and Eigenvalue Problems

a) errors in the “theoretical” solution caused by errors in the data, i. e,, the elements
of A and b ,
b) errors in the “numerical” solution caused by round-off errors in the course of the
solution process.

1.3.1 Conditioning of linear algebraic systems

We give an error analysis for linear systems

Ax = b (1.3.66)

with regular coefficient matrix A ∈ Kn×n . The matrix A and the vector b are faulty by
small errors δA and δb , so that actually the perturbed system

Ãx̃ = b̃, (1.3.67)

is solved with à = A + δA , b̃ = b + δb and x̃ = x + δx . We want to estimate the error


δx in dependence of δA and δb . For this, we use an arbitrary vector norm  ·  and the
associated natural matrix norm likewise denoted by  · .

Theorem 1.8 (Perturbation theorem): Let the matrix A ∈ Kn×n be regular and the
perturbation satisfy δA < A−1 −1 . Then, the perturbed matrix à = A + δA is also
regular and for the resulting relative error in the solution there holds
! %
δx cond(A) δb δA
≤ + , (1.3.68)
x 1 − cond(A)δA/A b A

with the so-called “condition number” cond(A) := A A−1 of the matrix A .

Proof. The assumptions imply

A−1 δA ≤ A−1  δA < 1,

such that also A + δA = A[I + A−1 δA] is regular by Lemma 1.15. From

(A + δA)x̃ = b + δb , (A + δA)x = b + δAx

it follows that then for δx = x̃ − x

(A + δA)δx = δb − δAx,

and consequently using the estimate of Lemma 1.15,


1.3 Perturbation theory and conditioning 47

# $
δx ≤ (A + δA)−1  δb + δA x
 −1 # $
=  A(I + A−1 δA)  δb + δA x
# $
= (I + A−1 δA)−1 A−1  δb + δA x
# $
≤ (I + A−1 δA)−1  A−1  δb + δA x
A−1  # $
≤ δb + δA x
1 − A δA
−1
! %
A−1  A x δb δA
≤ + .
1 − A−1  δA A A−1 A x A

Since b = Ax ≤ A x it eventually follows that


! %
cond(A) δb δA
δx ≤ + x,
1 − cond(A)δAA−1 b A

what was to be shown. Q.E.D.


The condition number cond(A) depends on the chosen vector norm in the estimate
(1.3.68). Most often the max-norm  · ∞ or the euclidian norm  · 2 are used. In the
first case there holds
cond∞ (A) := A∞ A−1 ∞
with the maximal row sum  · ∞ . Especially for Hermitian matrices Lemma 1.13 yields

|λmax |
cond2 (A) := A2 A−1 2 =
|λmin |

with the eigenvalues λmax and λmin of A with largest and smallest modulus, respectively.
Accordingly, the quantity cond2 (A) is called the “spectral condition (number)” of A . In
the case cond(A)δA A−1  1 , the stability estimate (1.3.68) takes the form
! %
δx δb δA
≈ cond(A) + ,
x b A

i. e., cond(A) is the amplification factor by which relative errors in the data A and b
affect the relative error in the solution x .

Corollary 1.3: Let the condition of A be of size cond(A) ∼ 10s . Are the elements of
A and b faulty with a relative error if size
δA δb
≈ 10−k , ≈ 10−k (k > s),
A b

then the relative error in the solution can be at most of size of A and b faulty with a
relative error if size
δx
≈ 10s−k .
x
In the case  ·  =  · ∞ , one may lose s decimals in accuracy.
48 Linear Algebraic Systems and Eigenvalue Problems

Example 1.4: Consider the following coefficient matrix A and its inverse A−1 :
& ' & '
1.2969 0.8648 −1 8 0.1441 −0.8648
A= , A = 10
0.2161 0.1441 −0.2161 1.2969

A∞ = 2.1617 , A−1 ∞ = 1.513 · 108 ⇒ cond(A) ≈ 3.3 · 108 .


In solving the linear system Ax = b, one may lose 8 decimals in accuracy by which the
elements ajk and bj are given. Hence, this matrix is very ill-conditioned.

Finally, we demonstrate that the stability estimate (1.3.68) is essentially sharp. Let
A be a positive definite n × n-matrix with smallest and largest eigenvalues λ1 and λn
and corresponding normalized eigenvectors w1 and wn , respectively. We choose

δA ≡ 0, b ≡ wn , δb ≡ εw1 (ε = 0).

Then, the equations Ax = b and Ax̃ = b + δb have the solutions

x = λ−1
n wn , x̃ = λ−1 −1
n w n + ε λ1 w 1 .

Consequently, for δx = x̃ − x there holds

δx2 λn w1 2 δb2


=ε = cond2 (A) ,
x2 λ1 wn 2 b2

i. e., in this very special case the estimate (1.3.68) is sharp.

1.3.2 Conditioning of eigenvalue problems

The most natural way of computing eigenvalues of a matrix A ∈ Kn×n appears to go


via its definition as zeros of the characteristic polynomial χA (·) of A and to compute
corresponding eigenvectors by solving the singular system (A − λI)w = 0 . This approach
is not advisable in general since the determination of zeros of a polynomial may be highly
ill-conditioned, at least if the polynomial is given in canonical form as sum of monomials.
We will see that the determination of eigenvalues may be well- or ill-conditioned depending
on the properties of A , i. e., its deviation from being “normal”.

Example 1.5: A symmetric matrix A ∈ R20×20 with eigenvalues λj = j , j = 1, . . . , 20,


has the characteristic polynomial


20
χA (z) = (z − j) = z 20 −210 19
   z + . . . + 
20! .
j=1 b20
b1

The coefficient b1 is perturbed: b̃1 = −210 + 2−23 ∼ −210, 000000119 . . . , which results
in
1.3 Perturbation theory and conditioning 49

 b̃ − b 
 1 1
relative error   ∼ 10−10 .
b1
Then, the perturbed polynomial χ̃A (z) has two roots λ± ∼ 16.7 ± 2.8i, far away from
the trues.

The above example shows that via the characteristic polynomial eigenvalues may be
computed reliably only for very special matrices, for which χA (z) can be computed with-
out determining its monomial from. Examples of some practical importance are, e. g.,
“tridiagonal matrices” or more general “Hessenberg12 matrices”.
⎡ ⎤ ⎡ ⎤
a1 b1 a11 · · · a1n
⎢ ⎥ ⎢ .. ⎥
⎢ c .. ..
. . ⎥ ⎢ .
a21 . . ⎥
⎢ 2 ⎥ ⎢ . ⎥
⎢ .. ⎥ ⎢ .. ⎥
⎢ . bn−1 ⎥ ⎢ . an−1,n ⎥
⎣ ⎦ ⎣ ⎦
cn an 0 an,n−1 ann
tridiagonal matrix Hessenberg matrix
Next, we provide a useful estimate which will be the basis for estimating the condi-
tioning of the eigenvalue problem.

Lemma 1.18: Let A, B ∈ Kn×n be arbitrary matrices and  ·  a natural matrix norm.
Then, for any eigenvalue λ of A , which is not eigenvalue of B there holds

(λI − B)−1 (A − B) ≥ 1. (1.3.69)

Proof. If w is an eigenvector corresponding to the eigenvalue λ of A it follows that

(A − B) w = (λI − B) w,

and for λ not being an eigenvalue of B,

(λI − B)−1 (A − B) w = w.

Consequently

(λI − B)−1 (A − B) x
1≤ sup = (λI − B)−1 (A − B),
x∈Kn \{0} x

what was to be shown. Q.E.D.


As consequence of Lemma 1.18, we obtain the following important inclusion theorem
of Gerschgorin13 (1931).

12
Karl Hessenberg (1904–1959): German mathematician; dissertation “Die Berechnung der Eigenwerte
und Eigenl”osungen linearer Gleichungssysteme”, TU Darmstadt 1942.
13
Semyon Aranovich Gershgorin (1901–1933): Russian mathematician; since 1930 Prof. in Leningrad
(St. Petersburg); worked in algebra, complex function theory differential equations and numerics.
50 Linear Algebraic Systems and Eigenvalue Problems

Theorem 1.9 (Theorem of Gerschgorin): All eigenvalues of a matrix A ∈ Kn×n are


contained in the union of the corresponding “Gerschgorin circles”
 
n 
Kj := z ∈ C : |z − ajj | ≤ |ajk | , j = 1, . . . , n. (1.3.70)
k=1,k =j

If the sets U ≡ ∪m
i=1 Kji and V ≡ ∪j=1 Kj \ U are disjoint then U contains exactly m
n

and V exactly n − m eigenvalues of A (counted accordingly to their algebraic multiplic-


ities).

Proof. i) We set B ≡ D = diag(ajj) in Lemma 1.18 and take the “maximal row sum”
as natural matrix norm. Then, it follows that for λ = ajj :

1 
n
(λI − D)−1 (A − D)∞ = max |ajk | ≥ 1,
j=1,...,n |λ − ajj | k=1,k =j

i. e., λ is contained in one of the Gerschgorin circles.


ii) For proving the second assertion, we set At ≡ D + t(A − D) . Obviously exactly m
eigenvalues of A0 = D are in U and n − m eigenvalues in V . The same then also
follows for A1 = A since the eigenvalues of At (ordered accordingly to their algebraic
multiplicities) are continuous functions of t . Q.E.D.
The theorem of Gerschgorin yields much more accurate information on the position of
eigenvalues λ of A than the rough estimate |λ| ≤ A∞ derived above. The eigenvalues
of the matrices A and ĀT are related by λ(ĀT ) = λ(A) . By applying the Gerschgorin
theorem simultaneously to A and ĀT , one may obtain a sharpening of the estimates for
the eigenvalues.

Example 1.6: Consider the 3 × 3-matrix:


⎡ ⎤
1 0.1 −0.2
⎢ ⎥
A=⎢ ⎣ 0 2 0.4 ⎥⎦ A∞ = 3.2 , A1 = 3.6.
−0.2 0 3

||Α||∞ ||Α||1

1 2 3

Figure 1.2: Gerschgorin circles of A and AT


1.3 Perturbation theory and conditioning 51

K1 = {z ∈ C : |z − 1| ≤ 0.3} K1T = {z ∈ C : |z − 1| ≤ 0.2}


K2 = {z ∈ C : |z − 2| ≤ 0.4} K2T = {z ∈ C : |z − 2| ≤ 0.1}
K3 = {z ∈ C : |z − 3| ≤ 0.2} K3T = {z ∈ C : |z − 3| ≤ 0.6}

|λ1 − 1| ≤ 0.2 , |λ2 − 2| ≤ 0.1 , |λ3 − 3| ≤ 0.2

Next, from the estimate of Lemma 1.18, we derive the following basic stability result
for the eigenvalue problem.

Theorem 1.10 (Stability theorem): Let A ∈ Kn×n be a diagonalizable matrix, i. e.,


one for which n linearly independent eigenvectors {w 1 , . . . , w n } exist, and let B ∈ Kn×n
be an arbitrary second matrix. Then, for each eigenvalue λ(B) of B there is a cor-
responding eigenvalue λ(A) of A such that with the matrix W = [w 1, . . . , w n ] there
holds

|λ(A) − λ(B)| ≤ cond2 (W) A − B2 . (1.3.71)

Proof. The eigenvalue equation Aw i = λi (A)w i can be rewritten in matrix form AW =


W diag(λi (A)) with the regular matrix W = [w1 , . . . , wn ] . Consequently,

A = W diag(λi (A)) W−1 ,

i. e., A is “similar” to the diagonal matrix Λ = diag(λi (A)). Since λ = λ(B) is not an
eigenvalue of A ,

(λI − A)−1 2 = W (λI − Λ)−1 W −1 2


≤ W −12 W 2(λI − Λ)−1 2
= cond2 (W) max |λ − λi (A)|−1 .
i=1,...,n

Then, Lemma 1.18 yields the estimate,

1 ≤ (λI − A)−1 (B − A) ≤ (λI − A)−1 (B − A)


≤ cond2 (W) max |λ − λi (A)|−1 (B − A),
i=1,...,n

from which the assertion follows. Q.E.D.


For Hermitian matrices A ∈ K there exists an ONB in K of eigenvectors so that
n×n n

the matrix W in the estimate (1.3.71) can be assumed to be unitary, W W̄ T = I . In this


special case there holds

cond2 (W) = W̄T 2 W2 = 1, (1.3.72)

i. e., the eigenvalue problem of “Hermitian” (or more general “normal”) matrices is well
conditioned. For general “non-normal” matrices the conditioning of the eigenvalue prob-
lem may be arbitrarily bad, cond2 (W) 1 .
52 Linear Algebraic Systems and Eigenvalue Problems

1.4 Exercises

Exercise 1.1 (Some useful inequalities):


Verify the following inequalities:
a) ab ≤ εa2 + 4ε b , a, b ∈ R, ε ∈ R+ .
1 2

 n −1 n −1 
b) i=1 xi λi ≤ i=1 xi λi , xi ∈ R+ , 0 ≤ λi ≤ 1, ni=1 λi = 1.
# $
c) max0≤x≤1 x2 (1 − x)2n ≤ (1 + n)−2 .

Exercise 1.2 (Some useful facts about norms and scalar products):
Verify the following claims for vectors x, y ∈ Rn and the Euclidean norm  · 2 and scalar
product (·, ·)2 :
a) 2x22 + 2y22 = x + y22 + x − y22 (Parallelogram identity).
b) |(x, y)2| ≤ x2 y2 (Schwarz inequality).
c) For any symmetric, positive definite matrix A ∈ Rn×n the bilinear form (x, y)A :=
(Ax, y)2 is a scalar product. i) Can any scalar product on Rn×n be written in this form?
ii) How has this to be formulated for complex matrices A ∈ Cn×n ?

Exercise 1.3 (Some useful facts about matrix norms):


Verify the following relations for matrices A, B ∈ Kn×n and the Euclidean norm  · 2 :
# $ # $
a) A2 := max Ax2 /x2 , x ∈ Kn , x = 0 = max Ax2 , x ∈ Rn , x2 = 1 .
b) Ax2 ≤ A2 x2 .
c) AB2 ≤ A2 B2 (Is this relation true for any matrix norm?).
d) For Hermitian matrices A ∈ Cn×n there holds A2 = max{|λ|, λ eigenvalue of A}.
e) For general matrices A ∈ Cn×n there holds A2 = max{|λ|1/2 , λ eigenvalue of ĀT A}.

Exercise 1.4 (Some useful facts about vector spaces and matrices):
a) Formulate the Gram-Schmidt algorithm for orthonormalizing a set of linearly indepen-
dent vectors {x1 , . . . , xm } ⊂ Rn :
b) How can one define the square root A1/2 of a symmetric, positive definite matrix
A ∈ Rn×n ?
c) Show that a positive definite matrix A ∈ Cn×n is automatically Hermitian, i. e.,
A = ĀT . This is not necessarily true for real matrices A ∈ Rn×n , i. e., for real matrices
the definition of positiveness usually goes together with the requirement of symmetry.

Exercise 1.5: Recall the definitions of the following quantities:


a) The “maximum-norm”  · ∞ and the “l1 -norm”  · 1 on Kn .
b) The “spectrum” Σ(A) of a matrix A ∈ Kn×n .
c) The “Gerschgorin circles” Ki ⊂ C, i = 1, . . . , n , of a matrix A ∈ Kn×n .
1.4 Exercises 53

d) The “spectral radius” ρ(A) of a matrix A ∈ Kn×n .


e) The “spectral condition number” κ2 (A) of a matrix A ∈ Kn×n .

Exercise 1.6: Recall the proofs of the following facts about matrices:
a) The diagonal elements of a (Hermitian) positive definite matrix A ∈ Kn×n are real
and positive.

b) For the trace tr(A) := ni=1 aii of a Hermitian matrix A ∈ Kn×n with eigenvalues
λi ∈ Σ(A) there holds
n
tr(A) = λi .
i=1

c) A strictly diagonally dominant matrix A ∈ Kn×n is regular. If it is also Hermitian


with (real) positive diagonal entries, then it is positive definite.

Exercise 1.7: Let B ∈ Kn×n be a matrix, which for some matrix norm  ·  satisfies
B < 1 . Prove that the matrix I − B is regular with inverse satisfying
1
(I − B)−1  ≤ .
1 − B

Exercise 1.8: Prove that each connected component of k Gerschgorin circles (that are
disjoined to all other n−k circles) of a matrix A ∈ Cn×n contains exactly k eigenvalues of
A (counted accordingly to their algebraic multiplicities). This implies that such a matrix,
for which all Gerschgorin circles are mutually disjoint, has exactly n simple eigenvalues
and is therefore diagonalizable.

Exercise 1.9: Let A, B ∈ Kn×n be two Hermitian matrices. Then, the following state-
ments are equivalent:
i) A and B commute, i. e., AB = BA .
ii) A and B possess a common basis of eigenvectors.
iii) AB is Hermitian.
Does the above equivalence in an appropriate sense also hold for two general “normal”
matrices A, B ∈ Kn×n , i. e., if ĀT A = AĀT and B̄ T B = B B̄ T ?

Exercise 1.10: A ”‘sesquilinear form”’ on Kn is a mapping ϕ(·, ·) : Rn × Rn → K ,


which is bilinear in the following sense:

ϕ(αx + βy, z) = ᾱϕ(x, z) + β̄ϕ(y, z), ϕ(z, αx + βy) = αϕ(z, x) + βϕ(z, y), α, β ∈ K.

i) Show that for any regular matrix A ∈ Kn×n the sesquilinear form ϕ(x, y) := (Ax, Ay)2
is a scalar product on Kn .
ii) In an earlier exercise, we have seen that each scalar product (x, y) on Kn can be written
54 Linear Algebraic Systems and Eigenvalue Problems

in the form (x, y) = (x, Ay)2 with a (Hermitian) positive definite matrix A ∈ Kn×n . Why
does this statement not contradict (i)?

Exercise 1.11: Let A ∈ Kn×n be Hermitian.


i) Show that eigenvectors corresponding to different eigenvalues λ1 (A) and λ2 (A) are
orthogonal. Is this also true for (non-Hermitian) “normal” matrices, i. e., if ĀT A = AĀT ?

ii) Show that there holds

(Ax, x)2 (Ax, x)2


λmin (A) = min ≤ max = λmax (A),
x∈Kn \{0} x22 x∈Kn \{0} x22

where λmin (A) and λmax (A) denote the minimal and maximal (real) eigenvalues of A ,
respectively. (Hint: Use that a Hermitian matrix possesses an ONB of eigenvectors.)

Exercise 1.12: Let A ∈ Kn×n and 0 ∈ σ(A) . Show that the ε-pseudo-spectra of A
and that of its inverse A−1 are related by
#  $
σε (A) ⊂ z ∈ C \ {0}  z −1 ∈ σδ(z) (A−1 ) ∪ {0},

where δ(z) := εA−1 /|z| and, for 0 < ε < 1 , by


#  $
σε (A−1 ) \ B1 (0) ⊂ z ∈ C \ {0}  z −1 ∈ σδ (A) ,

where B1 (0) := {z ∈ C, z ≤ 1} and δ := ε/(1 − ε) .


2 Direct Solution Methods

2.1 Gaussian elimination, LR and Cholesky decomposition

In this chapter, we collect some basic results on so-called “direct” methods for solving
linear systems and matrix eigenvalue problems. A “direct” method delivers the exact
solution theoretically in finitely many arithmetic steps, at least under the assumption of
“exact” arithmetic. However, to get useful results a “direct” method has to be carried
to its very end. In contrast to this, so-called “iterative” methods produce sequences of
approximate solutions of increasing accuracy, which theoretically converge to the exact
solution in infinitely many arithmetic steps. However, “iterative” methods may yield
useful results already after a small number of iterations. Usually “direct” methods are
very robust but, due to their usually high storage and work requirements, feasible only
for problems of moderate size. Here, the meaning of “moderate size” depends very much
on the currently available computer power, i. e., today reaches up to dimension n ≈
105 − 106 . Iterative methods need less storage and as multi-level algorithms may even
show optimal arithmetic complexity, i. e., a fixed improvement in accuracy is achieved in
O(n) arithmetic operations. These methods can be used for really large-scale problems
of dimension reaching up to n ≈ 106 − 109 but at the prize of less robustness and higher
algorithmic complexity. Such modern “iterative” methods are the main subject of this
book and will be discussed in the next chapters.

2.1.1 Gaussian elimination and LR decomposition

In the following, we discuss “direct methods” for solving (real) quadratic linear systems

Ax = b . (2.1.1)

It is particularly easy to solve staggered systems, e. g., those with an upper triangular
matrix A = (ajk ) as coefficient matrix

a11 x1 + a12 x2 + . . . + a1n xn = b1


a22 x2 + . . . + a2n xn = b2
.. .
.
ann xn = bn

In case that ajj = 0, j = 1, . . . , n , we obtain the solution by “backward substitution”:


bn 1 
n
xn = , j = n − 1, . . . , 1 : xj = (bj − ajk xk ).
ann ajj
k=j+1

This requires Nback subst = n2 /2 + O(n) arithmetic operations. The same holds true if
the coefficient matrix is lower triangular and the system is solved by the corresponding
“forward substitution”.

55
56 Direct Solution Methods

Definition 2.1: For quantifying the arithmetic work required by an algorithm, i. e., its
“(arithmetic) complexity”, we use the notion “arithmetic operation” (in short “a. op.”),
which means the equivalent of “1 multiplication + 1 addition” or “1 division” (assuming
that the latter operations take about the same time on a modern computer).

The classical direct method for solving linear systems is the elimination method of
Gauß1 which transforms the system Ax = b in several “elimination steps” (assuming
“exact” arithmetic) into an equivalent upper triangular system Rx = c, which is then
solved by backward substitution. In practice, due to round-off errors, the resulting upper
triangular system is not exactly equivalent to the original problem and this unavoidable
error needs to be controlled by additional algorithmical steps (“final iteration”, or “Na-
chiteration”, in German). In the elimination process two elementary transformations are
applied to the matrix A , which do not alter the solution of system (2.1.1): “permutation
of two rows of the matrix” and “addition of a scalar multiple of a row to another row of
the matrix”. Also the “permutation of columns” of A is admissible if the unknowns xi
are accordingly renumbered.
In the practical realization of Gaussian elimination the elementary transformations
are applied to the composed matrix [A, b] . In the following, we assume the matrix A
(0)
to be regular. First, we set A(0) ≡ A, b(0) ≡ b and determine ar1 = 0, r ∈ {1, . . . , n} .
(Such an element exists since otherwise A would be singular.). Permute the 1-st and the
r-th row. Let the result be the matrix [Ã(0) , b̃(0) ] . Then, for j = 2, . . . , n, we multiply
the 1-st row by qj1 and subtract the result from the j-th row,
(0) (0) (0) (1) (0) (0) (1) (0) (0)
qj1 ≡ ãj1 /ã11 (= ar1 /a(0)
rr ), aji := ãji − qj1 ã1i , bj := b̃j − qj1 b̃1 .

The result is ⎡ ⎤
(0) (0) (0) (0)
ã11 ã12 . . . ã1n b̃1
⎢ ⎥
⎢ 0 (1) (1) (1) ⎥
⎢ a22 . . . a2n b2 ⎥
[A , b ] = ⎢ .
(1) (1)
.. ⎥.
⎢ .. . ⎥
⎣ ⎦
(1) (1) (1)
0 an2 . . . ann bn

The transition [A(0) , b(0) ] → [Ã(0) , b̃(0) ] → [A(1) , b(1) ] can be expressed in terms of matrix
multiplication as follows:

[Ã(0) , b̃(0) ] = P1 [A(0) , b(0) ] , [A(1) , b(1) ] = G1 [Ã(0) , b̃(0) ] ,

where P1 is a “permutation matrix” und G1 is a “Frobenius matrix” of the following


form:

1
Carl Friedrich Gauß (1777–1855): Eminent German mathematician, astronomer and physicist;
worked in Göttingen; fundamental contributions to arithmetic, algebra and geometry; founder of modern
number theory, determined the planetary orbits by his “equalization calculus”, further contributions to
earth magnetism and construction of an electro-magnetic telegraph.
2.1 Gaussian elimination, LR and Cholesky decomposition 57

1 r
⎡ ⎤
0 ··· 1 1
⎢ ⎥
⎢ 1 ⎥
⎢ ⎥ 1
⎢ . .. .. ⎥ ⎡ ⎤
⎢ .. . . ⎥
⎢ ⎥ 1
⎢ ⎥ ⎢ ⎥
⎢ 1 ⎥ ⎢ −q21 1 ⎥
⎢ ⎥ ⎢ ⎥ 1
P1 = ⎢
⎢ 1 ··· 0

⎥ r G1 = ⎢
⎢ ... .. ⎥

⎢ ⎥ ⎣ . ⎦
⎢ 1 ⎥
⎢ ⎥ −qn1 1
⎢ ⎥
⎢ ..
. ⎥
⎣ ⎦
1

Both matrices, P1 and G1 , are regular regular with determinants det(P1 ) = det(G1 ) = 1
and there holds ⎡ ⎤
1
⎢ ⎥
⎢ q21 1 ⎥
⎢ ⎥
P1−1 = P1 , G−1 = ⎢ . . ⎥.
1
⎢ .. .. ⎥
⎣ ⎦
qn1 1

The systems Ax = b and A(1) x = b(1) have obviously the same solution,

Ax = b ⇐⇒ A(1) x = G1 P1 Ax = G1 P1 b = b(1) .

(0)
Definition 2.2: The element ar1 = ã11 is called “pivot element” and the whole substep
of its determination “pivot search”. For reasons of numerical stability one usually makes
the choice

|ar1 | = max |aj1 | . (2.1.2)


1≤j≤n

The whole process incl. permutation of rows is called “column pivoting” . If the elements
of the matrix A are of very different size “total pivoting” is advisable. This consists in
the choice

|ars | = max |ajk |, (2.1.3)


1≤j,k≤n

and subsequent permutation of the 1-st row with the r-th row and the 1-st column with
the s-th column. According to the column permutation also the unknowns xk have to
be renumbered. However, “total pivoting” is costly so that simple “column pivoting” is
usually preferred.
58 Direct Solution Methods

i r
⎡ ⎤
1
⎢ ⎥
⎢ .. ⎥
⎢ . ⎥ i
⎢ ⎥ ⎡ ⎤
⎢ ⎥
⎢ 1 ⎥ 1
⎢ ⎥ ⎢ ⎥
⎢ 0 ··· 1 ⎥i ⎢ .. ⎥
⎢ ⎥ ⎢ . ⎥
⎢ ⎥ ⎢ ⎥
⎢ 1 ⎥ ⎢ ⎥i
⎢ ⎥ ⎢ 1 ⎥
Pi = ⎢

..
.
..
.
..
.

⎥ Gi = ⎢
⎢ −qi+1,i 1


⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ .. .. ⎥
⎢ 1 ⎥ ⎢ . ⎥
⎢ ⎥ ⎣ . ⎦
⎢ ··· ⎥r
⎢ 1 0 ⎥ −qni 1
⎢ ⎥
⎢ 1 ⎥
⎢ ⎥
⎢ .. ⎥
⎢ . ⎥
⎣ ⎦
1

The matrix A(1) generated in the first step is again regular. The same is true for
the reduced submatrix to which the next elimination step is applied. By repeating this
elimination process, one obtains in n − 1 steps a sequence of matrices,

[A, b] → [A(1) , b(1) ] → . . . → [A(n−1) , b(n−1) ] =: [R, c] ,

where
[A(i) , b(i) ] = Gi Pi [A(i−1) , b(i−1) ] , [A(0) , b(0) ] := [A, b] ,
with (unitary) permutation matrices Pi and (regular) Frobenius matrices Gi of the above
form. The end result

[R, c] = Gn−1 Pn−1 . . . G1 P1 [A, b] (2.1.4)

is an upper triangular system Rx = c , which has the same solution as the original system
Ax = b . By the i-th elimination step [A(i−1) , b(i−1) ] → [A(i) , b(i) ] the subdiagonal elements
in the i-th column are made zero. The resulting free places are used for storing the
elements qi+1,i , . . . , qn,i of the matrices G−1
i (i = 1, . . . , n − 1) . Since in this elimination
step the preceding rows 1 to i are not changed, one works with matrices of the form
⎡ ⎤
r11 r12 ··· r1i r1,i+1 ··· r1n c1
⎢ ⎥
⎢ λ21 r22 ··· r2i r2,i+1 ··· r2n c2 ⎥
⎢ ⎥
⎢ ⎥
⎢ λ31 λ32 r3i r3,i+1 ··· r3n c3 ⎥
⎢ ⎥
⎢ .. .. .. .. .. .. .. ⎥
⎢ . . . . . . . ⎥
⎢ ⎥.
⎢ ⎥
⎢ λi1 λi2 rii ri,i+1 ··· rin ci ⎥
⎢ ⎥
⎢ λi+1,1 λi+2,2 λi+1,i
(i)
ai+1,i+1 ···
(i)
ai+1,n
(i) ⎥
bi+1 ⎥

⎢ .. .. .. .. .. .. ⎥
⎢ . ⎥
⎣ . . . . . ⎦
(i) (i) (i)
λn,1 λn,2 · · · λn,i an,i+1 ··· an,n bn
2.1 Gaussian elimination, LR and Cholesky decomposition 59

Here, the subdiagonal elements λk+1,k , . . . , λnk in the k-th column are permutations of
the elements qk+1,k , . . . , qnk of G−1
k since the permutations of rows (and only those) are
applied to the whole composed matrix. As end result, we obtain the matrix
⎡ ⎤
r11 ··· r1n c1
⎢ ⎥
⎢ l21 r22 r2n c2 ⎥
⎢ ⎥
⎢ .. .. .. .. .. ⎥.
⎢ . . . . . ⎥
⎣ ⎦
ln1 · · · ln,n−1 rnn cn

Theorem 2.1 (LR decomposition): The matrices


⎡ ⎤ ⎡ ⎤
1 0 r11 r12 · · · r1n
⎢ ⎥ ⎢ ⎥
⎢ l21 1 ⎥ ⎢ r22 · · · r2n ⎥
⎢ ⎥ ⎢ ⎥
L=⎢ .. .. .. ⎥, R = ⎢ .. .. ⎥
⎢ . . . ⎥ ⎢ . . ⎥
⎣ ⎦ ⎣ ⎦
ln1 · · · ln,n−1 1 0 rnn

are the factors in a so-called (multiplicative) “LR decomposition” of the matrix P A,

P A = LR , P := Pn−1 · · · P1 . (2.1.5)

If the LR decomposition is possible with P = I, then it is uniquely determined. Once an


LR decomposition is computed the solution of the linear system Ax = b can be accom-
plished by successively solving two triangular systems,

Ly = P b, Rx = y, (2.1.6)

by forward and backward substitution, respectively.

Proof. i) We give the proof only for the case that pivoting is not necessary, i. e., Pi = I .
Then, R = Gn−1 · · · G1 A and G−1 −1 −1 −1
1 · · · Gn−1 R = A. In view of L = G1 · · · Gn−1 the first
assertion follows.
ii) To prove uniqueness let A = L1 R1 = L2 R2 be two LR decompositions. Then, L−1 2 L1 =
R2 R1−1 = I since L−12 L1 is lower triangular with ones on the main diagonal and R2 R1−1
is upper triangular. Consequently, L1 = L2 and R1 = R2 , what was to be shown. Q.E.D.

Lemma 2.1: The solution of a linear n × n system Ax = b by Gaussian elimination


requires

NGauß (n) = 31 n3 + O(n2 ) (2.1.7)

arithmetic operations. This is just the work count of computing the corresponding decom-
position P A = LR , while the solution of the two triangular systems (2.1.6) only requires
n2 + O(n) arithmetic operations.
60 Direct Solution Methods

Proof. The k-th elimination step


(k−1) (k−1)
(k) (k−1) aik (k−1) (k) (k−1) aik (k−1)
aij = aij − a
(k−1) kj
, bi = bi − b
(k−1) k
, i, j = k, . . ., n,
akk akk

requires n−k divisions and (n−k) + (n−k)2 combined multiplications and additions
resulting altogether in


n−1
k 2 + O(n2 ) = 13 n3 + O(n2 ) a. Op.
k=1

for the n−1 steps of forward elimination. By this all elements of the matrices L and R
are computed. The work count of the forward and backward elimination in (2.1.6) follows
by similar considerations. Q.E.D.

Example 2.1: The pivot elements are marked by · .

⎡ ⎤⎡ ⎤ ⎡ ⎤ pivoting
3 1 6 x1 2
⎢ ⎥⎢ ⎥ ⎢ ⎥ 3 1 6 2
⎢ 2 1 3 ⎥ ⎢ x2 ⎥ = ⎢ 7 ⎥ →
⎣ ⎦⎣ ⎦ ⎣ ⎦ 2 1 3 7
1 1 1 x3 4
1 1 1 4

elimination pivoting
3 1 6 2 3 1 6 2

2/3 1/3 −1 17/3 1/3 2/3 −1 10/3
1/3 2/3 −1 10/3 2/3 1/3 −1 17/3

elimination
x3 = −8
3 1 6 2
→ x2 = 2 ( 3 − x3 ) = −7
3 10
1/3 2/3 −1 10/3
x1 = 3 (2 − x2 − 6x3 ) =
1
19 .
2/3 1/2 −1/2 4

LR decomposition: ⎡ ⎤
1 0 0
⎢ ⎥
P2 = ⎢
P1 = I ,
⎣ 0 0 1 ⎥⎦,
0 1 0
⎡ ⎤ ⎡ ⎤⎡ ⎤
3 1 6 1 0 0 3 1 6
⎢ ⎥ ⎢ ⎥⎢ ⎥
PA = ⎢

⎥ ⎢
1 1 1 ⎦ = LR = ⎣ 1/3 1 ⎥ ⎢
0 ⎦ ⎣ 0 2/3 −1 ⎥.

2 1 3 2/3 1/2 1 0 0 −1/2
2.1 Gaussian elimination, LR and Cholesky decomposition 61

Example 2.2: For demonstrating the importance of the pivoting process, we consider
the following linear 2×2-system:
& '& ' & '
10−4 1 x1 1
= (2.1.8)
1 1 x2 2

with the exact solution x1 = 1.00010001, x2 = 0.99989999 . Using 3-decimal floating


point arithmetic with correct rounding yields
a) without pivoting: b) with pivoting:
x1 x2 x1 x2
0.1 · 10−3 0.1 · 101 0.1 · 101 0.1 · 101 0.1 · 101 0.2 · 101
0 −0.1 · 105 −0.1 · 105 0 0.1 · 101 0.1 · 101
x2 = 1 , x1 = 0 x2 = 1 , x1 = 1

Example 2.3: The positive effect of column pivoting is achieved only if all row sums of
the matrix A are of similar size. As an example, we consider the 2×2-system
& '& ' & '
2 20000 x1 20000
= ,
1 1 x2 2

which results from (2.1.8) by scaling the first row by the factor 20.000 . Since in the first
column the element with largest modulus is on the main diagonal the Gauß algorithm
with and without pivoting yields the same unacceptable result (x1 , x2 )T = (0, 1)T . To
avoid this effect, we apply an “equilibration” step before the elimination, i. e., we multiply
A by a diagonal matrix D,

n −1
Ax = b → DAx = Db , di = |aij | , (2.1.9)
j=1

such that all row sums of A are scaled to 1 . An even better stabilization in the case
of matrix elements of very different size is “total pivoting”. Here, an equilibration step,
row-wise and column-wise, is applied before the elimination.

Conditioning of Gaussian elimination

We briefly discuss the conditioning of the solution of a linear system by Gaussian elim-
ination. For any (regular) matrix A there exists an LR decomposition like P A = LR.
Then, there holds
R = L−1 P A, R−1 = (P A)−1 L.
Due to column pivoting the elements of the triangular matrices L and L−1 are all less
or equal one and there holds

cond∞ (L) = L∞ L−1 ∞ ≤ n2 .


62 Direct Solution Methods

Consequently,

cond∞ (R) = R∞ R−1 ∞ = L−1 P A∞ (P A)−1 L∞


≤ L−1 ∞ P A∞ (P A)−1∞ L∞ ≤ n2 cond∞ (P A).

Then, the general perturbation theorem, Theorem1.8, yields the following estimate for
the solution of the equation LRx = P b (considering only perturbations of the right-hand
side b ):

δx∞ δP b∞ δP b∞


≤ cond∞ (L)cond∞ (R) ≤ n4 cond∞ (P A) .
x∞ P b∞ P b∞

Hence the conditioning of the original system Ax = b is by the LR decomposition, in the


worst case, amplified by n4 . However, this is an extremely pessimistic estimate, which
can significantly be improved (see Wilkinson2 [23]).

Theorem 2.2 (Round-off error influence): The matrix A ∈ Rn×n be regular, and the
linear system Ax = b be solved by Gaussian elimination with column pivoting. Then, the
actually computed perturbed solution x+δx under the influence of round-off error is exact
solution of a perturbed system (A + δA)(x + δx) = b , where (eps = “machine accuracy”)

δA∞
≤ 1.01 · 2n−1 (n3 + 2n2 ) eps. (2.1.10)
A∞

In combination with the perturbation estimate of Theorem 1.8 Wilkinson’s result


yields the following bound on the effect of round-off errors in the Gaussian elimination:

δx∞ cond(A)
≤ {1.01 · 2n−1 (n3 + 2n2 ) eps} . (2.1.11)
x∞ 1 − cond(A)δA∞ /A∞

This estimate is, as practical experience shows, by far too pessimistic since it is oriented
at the worst case scenario and does not take into account round-off error cancellations.
Incorporating the latter effect would require a statistical analysis. Furthermore, the above
estimate applies to arbitrary full matrices. For “sparse” matrices with many zero entries
much more favorable estimates are to be expected. Altogether, we see that Gaussian
elimination is, in principle, a well-conditioned algorithm, i. e., the influence of round-off
errors is bounded in terms of the problem dimension n and the condition cond(A) , which
described the conditioning of the numerical problem to be solved.

Direct LR and Cholesky decomposition

The Gaussian algorithm for the computation of the LR decomposition A = LR (if it


exists) can also be written in direct form, in which the elements ljk of L and rjk of

2
James Hardy Wilkinson (1919–1986): English mathematician; worked at National Physical Labora-
tory in London (since 1946); fundamental contributions to numerical linear algebra, especially to round-off
error analysis; co-founder of the famous NAG software library (1970).
2.1 Gaussian elimination, LR and Cholesky decomposition 63

R are computed recursively. The equation A = LR yields n2 equations for the n2


unknown elements rjk , j ≤ k , ljk , j > k (ljj = 1) :


min(j,k)
ajk = lji rik . (2.1.12)
i=1

Here, the ordering of the computation of ljk , rjk is not prescribed a priori. In the so-called
“algorithm of Crout3 ” the matrix A = LR is tessellated as follows:

⎡ ⎤
1
⎢ ⎥
⎢ 3 ⎥
⎢ ⎥
⎢ ⎥
⎢ 5 ⎥
⎢ ⎥.
⎢ .. ⎥
⎢ . ⎥
⎣ ⎦
2 4 6 ···

The single steps of this algorithm are (lii ≡ 1) :


1
k = 1, · · · , n : a1k = l1i rik ⇒ r1k := a1k ,
i=1
1
−1
j = 2, · · · , n : aj1 = lji ri1 ⇒ lj1 := r11 aj1 ,
i=1
 2
k = 2, · · · , n : a2k = l2i rik ⇒ r2k := a2k − l21 r1k ,
i=1
..
.

and generally for j = 1, · · · , n :


j−1
rjk := ajk − lji rik , k = j, j + 1, · · · , n ,
i=1
( ) (2.1.13)

j−1
−1
lkj := rjj akj − lki rij , k = j + 1, j + 2, · · · , n .
i=1

The Gaussian elimination and the direct computation of the LR decomposition differ only
in the ordering of the arithmetic operations and are algebraically equivalent.

3
Prescott D. Crout (1907–1984): US-American mathematician and engineer; Prof. at Massachusetts
Institute of Technology (MIT); contributions to numerical linear algebra (“A short method for evaluating
determinants and solving systems of linear equations with real or complex coefficients”, Trans. Amer.
Inst. Elec. Eng. 60, 1235–1241, 1941) and to numerical electro dynamics.
64 Direct Solution Methods

2.1.2 Accuracy improvement by defect correction

The Gaussian elimination algorithm transforms a linear system Ax = b into an upper


triangular system Rx = c , from which the solution x can be obtained by simple back-
ward substitution. Due to Theorem 2.1 this is equivalent to the determination of the
decomposition P A = LR and the subsequent solution of the two triangular systems

Ly = P b , Rx = y . (2.1.14)

This variant of the Gaussian algorithm is preferable if the same linear system is succes-
sively to be solved for several right-hand sides b . Because of the unavoidable round-off
error one usually obtains an only approximate LR decomposition

L̃R̃ = P A

and using this in (2.1.14) an only approximate solution x(0) with (exact) “residual”
(negative “defect”)
dˆ(0) := b − Ax(0) = 0 .
Using the already computed approximate trianguler decomposition L̃R̃ ∼ P A, one solves
(again approximately) the so-called “correction equation”

Ak = dˆ(0) , L̃R̃k (1) = dˆ(0) , (2.1.15)

and from this obtains a correction k (1) for x(0) :

x(1) := x(0) + k (1) . (2.1.16)

Had the correction equation be solved exactly, i. e., k (1) ≡ k , then

Ax(1) = Ax(0) + Ak = Ax(0) − b + b + dˆ(0) = b,

i. e., x(1) = x would be the exact solution of the system Ax = b . In general, x(1)
is a better approximation to x than x(0) even if the defect equation is solved only
approximately. This, however, requires the computation of the residual (defect) d with
higher accuracy by using extended floating point arithmetic. This is supported by the
following error analysis.
For simplicity, let us assume that P = I . We suppose the relative error in the LR
decomposition of the matrix A to be bounded by a small number ε . Due to the general
perturbation result of Theorem 1.8 there holds the estimate

x(0) − x cond(A) A − L̃R̃


≤ .
x 1 − cond(A) A  A
A− L̃R̃
 
∼ε
Here, the loss of exact decimals corresponds to the condition cond(A) . Additionally
round-off errors are neglected. The exact residual dˆ(0) is replaced by the expression
2.1 Gaussian elimination, LR and Cholesky decomposition 65

d(0) := Ãx(0) − b where à is a more accurate approximation to A ,

A − Ã
≤ ε̃  ε .
A

By construction there holds

x(1) = x(0) + k (1) = x(0) + (L̃R̃)−1 [b − Ãx(0) ]


= x(0) + (L̃R̃)−1 [Ax − Ax(0) + (A − Ã) x(0) ],

and, consequently,

x(1) − x = x(0) − x − (L̃R̃)−1 A(x(0) − x) + (L̃R̃)−1 (A − Ã) x(0)


= (L̃R̃)−1 [L̃R̃ − A](x(0) − x) + (L̃R̃)−1 (A − Ã) x(0) .

Since  
L̃R̃ = A − A + L̃R̃ = A I − A−1 (A − L̃R̃) ,
we can use Lemma 1.15 to conclude

(L̃R̃)−1  ≤ A−1   [I − A−1 (A − L̃R̃)]−1 


A−1  A−1  A−1 
≤ ≤ = .
1 − A−1 (A − L̃R̃) 1 − A−1  A − L̃R̃ 1 − cond(A) A− L̃R̃
A

This eventually implies

x(1) − x * A − L̃R̃ x(0) − x A − Ã x(0)  +


∼ cond(A) + .
x A x A x
        
∼ε ∼ cond(A)ε ∼ ε̃

This correction procedure can be iterated to a “defect correction” iteration (“Nachitera-


tion” in German). It may be continued until the obtained solution has an error (usually
achieved after 2−3 steps) of the order of the defect computation, i. e., x(3) −x/x ∼ ε̃ .

Example 2.4: The linear system


& '& ' & '
1.05 1, 02 x1 1
=
1.04 1, 02 x2 2

has the exact solution x = (−100, 103.921 . . .)T . Gaussian elimination, with 3-decimal
arithmetic and correct rounding, yields the approximate triangular matrices
& ' & '
1 0 1.05 1.02
L̃ = , R̃ = ,
0.990 1 0 0.01
66 Direct Solution Methods

& '
0 0
L̃R̃ − A = (correct within machine accuracy).
5 · 10−4 2 · 10−4
The resulting “solution” x(0) = (−97, 1.101)T has the residual
,
(0, 0)T 3-decimal computation,
d(0) = b − Ax(0) = T
(0, 065, 0, 035) 6-decimal computation.

The approximate correction equation


& '& ' & (1) ' & '
1 0 1.05 1.02 k1 0.065
(1)
=
0.990 1 0 0.01 k2 0.035

has the solution k (1) = (−2.9, 102.899)T (obtained by 3 decimal computation). Hence,
one correction step yields the approximate solution

x(1) = x(0) + k (1) = (−99.9, 104)T ,

which is significantly more accurate than the first approximation x(0) .

2.1.3 Inverse computation and the Gauß-Jordan algorithm

In principle, the inverse A−1 of a regular matrix A can be computed as follows:

i) Computation of the LR decomposition of P A .

ii) Solution of the staggered systems

Ly (i) = P e(i) , Rx(i) = y (i) , i = 1, . . . , n,

with the Cartesian basis vectors ei of Rn .

iii) Then, A−1 = [x(1) , . . . , x(n) ].

More practical is the simultaneous elimination (without explicit determination of the


matrices L and R ), which directly leads to the inverse (without row perturbation):

forward elimination
1 0 r11 · · · r1n 1 0
.. → .. .. ..
A . . . .
0 1 rnn ∗ 1
2.1 Gaussian elimination, LR and Cholesky decomposition 67

backward elimination scaling


r11 0 1 0
.. → ..
. ∗ . A−1
0 rnn 0 1

Example 2.5: The pivot elements are marked by · .

⎡ ⎤ forward elimination
3 1 6
⎢ ⎥ 3 1 6 1 0 0
A=⎢ ⎥
⎣ 2 1 3 ⎦ : →
2 1 3 0 1 0
1 1 1
1 1 1 0 0 1

row permutation forward elimination


3 1 6 1 0 0 3 1 6 1 0 0
→ → →
0 1/3 −1 −2/3 1 0 0 2/3 −1 −1/3 0 1
0 2/3 −1 −1/3 0 1 0 1/3 −1 −2/3 1 0

backward elimination backward elimination


3 1 6 1 0 0 3 1 0 −5 12 −6
→ → →
0 2/3 −1 −1/3 0 1 0 2/3 0 2/3 −2 2
0 0 −1/2 −1/2 1 −1/2 0 0 −1/2 −1/2 1 −1/2

scaling
3 0 0 −6 15 −9 1 0 0 −2 5 −3
→ →
0 2/3 0 2/3 −2 2 0 1 0 1 −3 3
0 0 −1/2 −1/2 1 −1/2 0 0 1 1 −2 1
⎡ ⎤
−2 5 −3
⎢ ⎥
⇒ A−1 = ⎢
⎣ 1 −3 3 ⎥
⎦.
1 −2 1

An alternative method for computing the inverse of a matrix is the so-called “exchange
algorithm” (sometimes called “Gauß-Jordan algorithm”). Let be given a not necessarily
quadratic linear system

Ax = y, where A ∈ Rm×n , x ∈ Rn , y ∈ Rm . (2.1.17)

A solution is computed by successive substitution of components of x by those of y . If


a matrix element apq = 0 , then the p-th equation can be solved for xq :

ap1 ap,q−1 1 ap,q+1 apn


xq = − x1 − . . . − xq−1 + yp − xq+1 − . . . − xn .
apq apq apq apq apq
68 Direct Solution Methods

Substitution of xq into the other equations

aj1 x1 + . . . + aj,q−1 xq−1 + ajq xq + aj,q+1xq+1 + . . . + ajn xn = yj ,

yields for j = 1, . . . , m , j = p :
- . - .
ajq ap1 ajq ap,q−1 ajq
aj1 − x1 + . . . + aj,q−1 − xq−1 + yp +
apq apq apq
- . - .
ajq ap,q+1 ajq apn
+ aj,q+1 − xq+1 + . . . + ajn − xn = yj .
apq apq
The result is a new system, which is equivalent to the original one,
⎡ ⎤ ⎡ ⎤
x1 y1
⎢ . ⎥ ⎢ .. ⎥
⎢ . ⎥ ⎢ ⎥
⎢ . ⎥ ⎢ . ⎥
⎢ ⎥ ⎢ ⎥
à ⎢ ⎥ ⎢
⎢ yp ⎥ = ⎢ xq ⎥ ⎥, (2.1.18)
⎢ . ⎥ ⎢ .. ⎥
⎢ .. ⎥ ⎢ . ⎥
⎣ ⎦ ⎣ ⎦
xn ym

where the elements of the matrix à are determined as follows:

pivot element : ãpq = 1/apq ,


pivot row : ãpk = apk /apq , k = 1, . . . , n , k = q ,
pivot column : ãjq = ajq /apq , j = 1, . . . , m , j = p ,
others : ãjk = ajk − ajq apk /apq , j = 1, . . . , m, j = p, k = 1, . . . , n , k = q.

If we succeed with replacing all components of x by those of y the result is the solution
of the system y = A−1 x . In the case m = n , we obtain the inverse A−1 , but in general
with permutated rows and columns. In determining the pivot element it is advisable, for
stability reasons, to choose an element apq of maximal modulus.

Theorem 2.3 (Gauß-Jordan algorithm): In the Gauß-Jordan algorithm r = rank(A)


exchange steps can be done.

Proof. Suppose the algorithm stops after r exchange steps. Let at this point x1 , . . . , xr
be exchanged against y1 , . . . , yr so that the resulting system has the form
2.1 Gaussian elimination, LR and Cholesky decomposition 69

⎡ ⎤
⎧ ⎡ ⎤ ⎤ ⎡

⎪ ⎢ ⎥ y1 x1
⎨ ⎢ ∗ ∗ ⎥ ⎢ . ⎥ ⎢ . ⎥
⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥
r ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥

⎪ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎩ ⎢ ⎥ ⎢ y ⎥ ⎢ x ⎥
⎧ ⎢



⎢ r

⎥ ⎢ r ⎥
⎥ = ⎢ ⎥.

⎪ ⎢ ⎥ ⎢ xr+1 ⎥ ⎢ yr+1 ⎥
⎨ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥
m−r ⎢ ∗ 0 ⎥ ⎢ .. ⎥ ⎢ .. ⎥

⎪ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦
⎩ ⎣ ⎦
xn ym

     
r n−r

If one chooses now y1 = · · · = yr = 0, xr+1 = λ1 , · · · , xn = λn−r so are all x1 , · · · , xr


uniquely determined and it follows that yr+1 = · · · = ym = 0 . For arbitrary values
λ1 , · · · , λn−r there also holds
⎡ ⎤
x1 (λ1 , · · · , λn−r ) ⎡ ⎤
⎢ ⎥ 0
⎢ .. ⎥ ⎢ ⎥
⎢ . ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ x (λ , · · · , λ ) ⎥ ⎢ . ⎥
⎢ r 1 ⎥
⎥ = ⎢ ⎥.
n−r
A⎢ .
⎢ . ⎥
⎢ λ ⎥ ⎢ ⎥
⎢ 1 ⎥ ⎢ ⎥
⎢ .. ⎥ ⎣ ⎦
⎢ . ⎥
⎣ ⎦ 0
λn−r

Hence, dim(kern(A)) ≥ n−r . On the other hand, because y1 , · · · , yr can be freely chosen,
we have dim(range(A)) ≥ r . Further, observing dim(range(A)) + dim(kern(A)) = n it
follows that rank(A) = dim(range(A)) = r . This completes the proof. Q.E.D.
For a quadratic linear system with regular coefficient matrix A the Gauß-Jordan
algorithm for computing the inverse A−1 is always applicable.

Example 2.6: ⎡ ⎤⎡ ⎤ ⎡ ⎤
1 2 1 x1 y1
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ −3 −5 −1 ⎥ ⎢ x2 ⎥ = ⎢ y2 ⎥
⎣ ⎦⎣ ⎦ ⎣ ⎦
−7 −12 −2 x3 y3

Exchange steps: The pivot elements are marked by · .

x1 x2 x3 x1 y3 x3
1 2 1 y1 −1/6 −1/6 2/3 y1
−3 −5 −1 y2 −1/12 5/12 −1/6 y2
−7 −12 −2 y3 −7/12 −1/12 −1/6 x2
70 Direct Solution Methods

x1 y3 y1 y2 y3 y1 ⎡ ⎤
−2 −8 3
1/4 1/4 3/2 x3 −2 1 1 x3 ⎢ ⎥
inverse: ⎢ 1 5 −2 ⎥
−1/8 −1/4 y2 ⎣ ⎦
3/8 −8 3 −2 x1
1 −2 1
−5/8 −1/8 −1/4 x2 5 −2 1 x2

Lemma 2.2: The inversion of a regular n × n-matrix by simultaneous elimination or the


Gauß-Jordan algorithmalgorithmus requires

NGauß-Jordan (n) = n3 + O(n2 ) a. op. (2.1.19)

Proof. i) The n − 1 steps of forward elimination at the matrix A require 13 n3 + O(n2 )


a. op. The simultaneous treatment of the columns of the identity matrix requires addi-
tional 16 n3 + O(n2 ) a. op. The backward elimination for generating the identity matrix
on the left requires again

(n − 1)n + (n − 2)n + . . . + n = 12 n(n − 1)n = 12 n3 + O(n2 )

multiplications and additions and subsequently n2 divisions. Hence the total work count
for computing the inverse is

Ninverse = 31 n3 + 16 n3 + 12 n3 + O(n2 ) = n3 + O(n2 ).

ii) In the Gauß-Jordan algorithm the k-th exchange step requires 2n + 1 divisions in
pivot row and column and (n − 1)2 multiplications and additions for the update of the
remaining submatrix, hence all together n2 +O(n) a. op.. The computation of the inverse
requires n exchange steps so that the total work count again becomes n3 + O(n2 ) a. op..
Q.E.D.

2.2 Special matrices

2.2.1 Band matrices

The application of Gaussian elimination for the solution of large linear systems of size
n > 104 poses technical difficulties if the primary main memory of the computer is not
large enough for storing the matrices occurring during the process (fill-in problem). In
this case secondary (external) memory has to be used, which increases run-time because
of slower data transfer. However, many large matrices occurring in practice have special
structures, which allow for memory saving in the course of Gaussian elimination.

Definition 2.3: A matrix A ∈ Rn,n is called “band matrix” of “band type” (ml , mr )
with 0 ≤ ml , mr ≤ n − 1 , if

ajk = 0, for k < j − ml or k > j + mr (j, k = 1, . . . , n),


2.2 Special matrices 71

i. e., the elements of A outside of the main diagonal and of ml + mr secondary diagonals
are zero. The quantity m = ml + mr + 1 is called the “band width” of A.

Example 2.7: We give some very simple examples of band matrices:


Typ (n − 1, 0) : lower triangular matrix
Typ (0, n − 1) : upper triangular matrix
Typ (1, 1) : tridiagonal matrix

Example of a (16 × 16)-band matrix of band type (4, 4) :


⎡ ⎤⎫ ⎡ ⎤⎫ ⎡ ⎤⎫
B −I ⎪
⎪ ⎪
4 −1
⎪ 1 ⎪

⎢ ⎥⎪⎪ ⎢ ⎥⎪⎪ ⎢ ⎪
⎥⎪
⎢ −I B −I ⎥⎪⎬ ⎢ −1 4 −1 ⎥ ⎪
⎬ ⎢ ⎥⎪

⎢ ⎥ ⎢ ⎥ ⎢ 1 ⎥
A= ⎢ ⎥ 16, B=⎢ ⎥ 4, I=⎢ ⎥ 4.
⎢ −I B −I ⎥ ⎪ ⎢ −1 4 −1 ⎥ ⎪ ⎢ ⎥⎪
⎣ ⎦⎪⎪
⎪ ⎣ ⎦⎪⎪
⎪ ⎣ 1 ⎦⎪


⎪ ⎪ ⎪
−I B ⎭ −1 4 ⎭ 1 ⎭

Theorem 2.4 (Band matrices): Let A ∈ Rn×n be a band matrix of band type (ml , mr ),
for which Gaussian elimination can be applied without pivoting, i. e., without permutation
of rows. Then, all reduced matrices are also band matrices of the same band type and
the matrix factors L and R in the triangular decomposition of A are band matrices of
type (ml , 0) and (0, mr ), respectively. The work count for the computation of the LR
decomposition A = LR is

NLR = 31 nml mr + O(n(ml + mr )) a. op. (2.2.20)

Proof. The assertion follows by direct computation (exercise). Q.E.D.


In Gaussian elimination applied to a band matrix it suffices to store the “band” of the
matrix. For n ≈ 105 and m ≈ 102 this makes Gaussian elimination feasible at all. For
the small model matrix from above (finite difference discretization of the Poisson problem)
this means a reduced memory requirement of 16 × 9 = 144 instead of 16 × 16 = 256 for
the full matrix. How the symmetry of A can be exploited for further memory reduction
will be discussed below.
An extreme storage saving is obtained for tridiagonal matrices
⎡ ⎤
a1 b1
⎢ ⎥
⎢ c ..
.
..
. ⎥
⎢ 2 ⎥
⎢ ⎥.
⎢ .. ..
. bn−1 ⎥
⎣ . ⎦
cn an
72 Direct Solution Methods

Here, the elements of the LR decomposition


⎡ ⎤ ⎡ ⎤
1 α1 β1
⎢ ⎥ ⎢ ⎥
⎢ γ ..
. ⎥ ⎢ .. .. ⎥
⎢ 2 ⎥ ⎢ . . ⎥
L=⎢ ⎥, R = ⎢ ⎥
⎢ .. ⎥ ⎢ αn−1 βn−1 ⎥
⎣ . 1 ⎦ ⎣ ⎦
γn 1 αn

are simply be obtained by short recursion formulas (sometimes called “Thomas4 algo-
rithm”),
α1 = a1 , β1 = b1 ,
i = 2, . . . , n − 1 : γi = ci /αi−1 , αi = ai − γi βi−1 , βi = bi ,
γn = cn /αn−1 , αn = an − γn βn−1 .

For this only 3n − 2 storage places and 2n − 2 a. op. are needed.


Frequently the band matrices are also sparse, i. e., most elements within the band
are zero. However, this property cannot be used within Gaussian elimination for storage
reduction because during the elimination process the whole band is filled with non-zero
entries.
It is essential for the result of Theorem 2.4 that the Gaussian elimination can be carried
out without perturbation of rows, i. e., without pivoting, since otherwise the bandwidth
would increase in the course of the algorithm. We will now consider two important classes
of matrices, for which this is the case.

2.2.2 Diagonally dominant matrices

Definition 2.4: A matrix A = (aij )ni,j=1 ∈ Rn×n is called “diagonally dominant”, if there
holds

n
|ajk | ≤ |ajj | , j = 1, . . . , n. (2.2.21)
k=1,k =j

Theorem 2.5 (Existence of LR decomposition): Let the matrix A ∈ Rn×n be reg-


ular and diagonally dominant. Then, the LR decomposition A = LR exists and can be
computed by Gaussian elimination without pivoting.

4
Llewellyn Thomas (1903–1992): British physicist and applied mathematician; studied at Cambridge
University, since 1929 Prof. of physics at Ohio State University, after the war, 1946, staff member at
Watson Scientific Computing Laboratory at Columbia University, since 1968 Visiting Professor at North
Carolina State University until retirement; best known for his contributions to Atomic Physics, thesis
(1927) “Contributions to the theory of the motion of electrified particles through matter and some effects
of that motion”; his name is frequently attached to an efficient version of the Gaussian elimination method
for tridiagonal matrices.
2.2 Special matrices 73

Proof. Since A is regular and diagonally dominant necessarily a11 = 0 . Consequently,


the first elimination step A := A(0) → A(1) can be done without (column) pivoting. The
(1) (1)
elements ajk are obtained by a1k = a1k , k = 1, . . . , n , and

(1) aj1
j = 2, . . . , n , k = 1, . . . , n : ajk = ajk − qj1 a1k , qj1 = .
a11
Hence, for j = 2, . . . , n, there holds

n
(1)

n 
n
|ajk | ≤ |ajk | + |qj1 | |a1k |
k=2,k =j k=2,k =j k=2,k =j
 n n
≤ |ajk | −|aj1 | + |qj1 | |a1k | −|qj1 ||a1j |

 a  k=2
k=1,k =j
    j1    
= 
≤ |ajj | a11 ≤ |a11 |
(1)
≤ |ajj | − |qj1 a1j | ≤ |ajj − qj1 a1j | = |ajj |.

The matrix A(1) = G1 A(0) is regular and obviously again diagonally dominant. Conse-
(1)
quently, a22 = 0 . This property is maintained in the course of the elimination process,
i. e., the elimination is possible without any row permutations. Q.E.D.

Remark 2.1: If in (2.2.21) for all j ∈ {1, . . . , n} the strict inequality holds, then the
matrix A is called “strictly diagonally dominant” . The proof of Theorem 2.5 shows that
for such matrices Gaussian elimination is applicable without pivoting, i. e., such a matrix
is necessarily regular. The above model matrix is diagonally dominant but not strictly
diagonally dominant. Its regularity will be shown later by other arguments based on a
slightly more restrictive assumption.

2.2.3 Positive definite matrices

We recall that a (symmetric) matrix A ∈ Rn×n with the property

(Ax, x)2 > 0 , x ∈ Rn \ {0},

is called “positive definite”.

Theorem 2.6 (Existence of LR decomposition): For positive definite matrices A ∈


Rn×n the Gaussian elimination algorithm can be applied without pivoting and all occurring
(i)
pivot elements aii are positive.

Proof. For the (symmetric) positive matrix A there holds a11 > 0 . The relation
(1) aj1 ak1 (1)
ajk = ajk − a1k = akj − a1j = akj ,
a11 a11
74 Direct Solution Methods

for j, k = 2, . . . , n, shows that the first elimination step yields an (n − 1) × (n − 1)-matrix


(1)
Ã(1) = (ajk )j,k=2,...,n , which ia again symmetric. We have to show that it is also positive
(1)
definite, i. e., a22 > 0. The elimination process can be continued with a positive pivot
element and the assertion follows by induction. Let x̃ = (x2 , . . . , xn )T ∈ Rn−1 \ {0} and
x = (x1 , x̃)T ∈ Rn with
1 
n
x1 = − a1k xk .
a11 k=2
Then,

n 
n 
n
0< ajk xj xk = ajk xj xk + 2x1 a1k xk + a11 x21
j,k=1 j,k=2 k=2

1  1  2
n n
− ak1 a1j xk xj + a1k xk
a11 j,k=2 a11 k=2
  
= 0 (ajk = akj )
n  ak1 a1j   1 
n 2
= ajk − xj xk + a11 x1 + a1k xk
a a11
j,k=2   11 
(1)
 
k=2

= ajk =0

and, consequently, x̃T Ã(1) x̃ > 0, what was to be proven. Q.E.D.


For positive definite matrices an LR decomposition A = LR exists with positive pivot
(i)
elements rii = aii > 0, i = 1, . . . , n. Since A = AT there also holds

A = AT = (LR)T = (LDR̃)T = R̃T DLT

with the matrices


⎡ ⎤
1 r12 /r11 · · · r1n /r11 ⎡ ⎤
⎢ ⎥ r11 0
⎢ .. .. .. ⎥ ⎢ ⎥
⎢ . . . ⎥
R̃ = ⎢ ⎥, D=⎢

..
. ⎥.

⎢ 1 rn−1,n /rn−1,n−1 ⎥
⎣ ⎦
0 rnn
0 1

In virtue of the uniqueness of the LR decomposition it follows that

A = LR = R̃T DLT ,

and, consequently, L = R̃T and R = DLT . This proves the following theorem.
2.2 Special matrices 75

Theorem 2.7: Positive definite matrices allow for a so-called “Cholesky5decomposition”.

A = LDLT = L̃L̃T , (2.2.22)

with the matrix L̃ := LD 1/2 . For computing the Cholesky decomposition it suffices to
compute the matrices D and L . This reduces the required work count to

NCholesky (n) = 16 n3 + O(n2) a. op. (2.2.23)

The so-called “Cholesky method” for computing the decomposition matrix


⎡ ⎤
˜l11 0
⎢ . ⎥
L̃ = ⎢
⎣ .
. ..
. ⎥

˜ln1 · · · ˜lnn

starts from the relation A = L̃L̃T , which can be viewed as a system of n(n + 1)/2
equations for the quantities ˜ljk , k ≤ j. Multiplicating this out,
⎡ ⎤⎡ ⎤ ⎡ ⎤
˜l11 0 ˜l11 · · · ˜ln1 a11 · · · a1n
⎢ . ⎥⎢ .. ⎥ ⎢ . .. ⎥
⎢ .. .. ⎥⎢ .. ⎥ ⎢ . ⎥
⎣ . ⎦⎣ . . ⎦ = ⎣ .. ⎦,
˜ln1 · · · ˜lnn 0 ˜lnn an1 · · · ann

yields in the first column of L̃ :


˜l2 = a11 , ˜l21 ˜l11 = a21 , ... , ˜ln1 ˜l11 = an1 ,
11

from which, we obtain

˜l11 = √a11 , j = 2, . . . , n : ˜lj1 = aj1 = √aj1 . (2.2.24)


˜l11 a11

Let now for some i ∈ {2, · · · , n} the elements ˜ljk , k = 1, . . . , i − 1, j = k, . . . , n be


already computed. Then, from
˜l2 + ˜l2 + . . . + ˜l2 = aii , ˜lii > 0 ,
i1 i2 ii
˜lj1˜li1 + ˜lj2 ˜li2 + . . . + ˜lji ˜lii = aji ,

the next elements ˜lii and ˜lji , j = i + 1, . . . , n can be obtained,


2
˜lii = aii − ˜l2 − ˜l2 − . . . − ˜l2 ,
i1 i2 i,i−1
# $
˜lji = ˜l−1 aji − ˜lj1˜li1 − ˜lj2˜li2 − . . . − ˜lj,i−1˜li,i−1 , j = i + 1, . . . , n,
ii

5
Andrè Louis Cholesky (1975–1918): French mathematician; military career as engineer officer; con-
tributions to numerical linear algebra, “Cholesky decomposition”; killed in battle shortly before the end of
World War I, his discovery was published posthumously in ”Bulletin Géodésique”.
76 Direct Solution Methods

Example 2.8: The 3 × 3-matrix


⎡ ⎤
4 12 −16
⎢ ⎥
A=⎢
⎣ 12 37−43 ⎥

−16 −43 98

has the following (uniquely determined) Cholesky decomposition A = LDL = L̃L̃T :


⎡ ⎤⎡ ⎤⎡ ⎤ ⎡ ⎤⎡ ⎤
1 0 0 4 0 0 1 3 −4 2 0 0 2 6 −8
⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥
A=⎢ ⎥⎢ ⎥⎢ ⎥ ⎢
⎣ 3 1 0 ⎦⎣ 0 1 0 ⎦⎣ 0 1 5 ⎦ = ⎣ 6 1 0 ⎦⎣ 0 1 5 ⎦.
⎥⎢ ⎥
−4 5 1 0 0 9 0 0 1 −8 5 3 0 0 3

2.3 Irregular linear systems and QR decomposition

Let A ∈ Rm×n be a not necessarily quadratic coefficient matrix and b ∈ Rm a right-hand


side vector. We are mainly interested in the case m > n (more equations than unknowns)
but also allow the case m ≤ n. We consider the linear system

Ax = b, (2.3.25)

for x ∈ Rn . In the following, we again seek a vector x̂ ∈ Rn with minimal defect


norm d2 = b − Ax̂2 , which coincides with the usual solution concept if rank(A) =
rank([A, b]). In view of Theorem 1.4 such a generalized solution is characterized as solution
of the “normal equation”

AT Ax = AT b. (2.3.26)

In the rank-deficient case, rank(A) < n , a particular solution x̂ of the normal system is
not unique, but of the general form x̂ + y with any element y ∈ kern(A). In this case
uniqueness is achieved by requiring the “least error-squares” solution to have minimal
Euclidian norm, x̂2 .

We recall that the matrix AT A is symmetric and positive semi-definite, and even
positive definite if A has maximal rank rank(A) = n. In the latter case the normal
equation can, in principle, be solved by the Cholesky algorithm for symmetric positive
definite matrices. However, in general the matrix AT A is rather ill-conditioned. In fact,
for m = n, we have that

cond2 (AT A) ∼ cond2 (A)2 . (2.3.27)


2.3 Irregular linear systems and QR decomposition 77

Example 2.9: Using 3-decimal arithmetic, we have


⎡ ⎤
1.07 1.10 & '
⎢ ⎥ 3.43 3.60
A=⎢ ⎥
⎣ 1.07 1.11 ⎦ → A A = 3.60 3.76 .
T

1.07 1.15

But AT A is not positive definite: (−1, 1) · AT A · (−1, 1)T = −0.01 , i. e., in this case the
Cholesky algorithm will not yield a solution.

We will now describe a method, by which the normal equation can be solved without
explicitly forming the product AT A. For later purposes, from now on, we admit complex
matrices.

Theorem 2.8 (QR decomposition): Let A ∈ Km×n be any rectangular matrix with
m ≥ n and rank(A) = n . Then, there exists a uniquely determined orthonormal matrix
Q ∈ Km×n with the property

Q̄T Q = I (K = C) , QT Q = I (K = R), (2.3.28)

and a uniquely determined upper triangular matrix R ∈ Kn×n with real diagonal rii >
0 , i = 1, . . . , n , such that

A = QR. (2.3.29)

Proof. i) Existence: The matrix Q is generated by successive orthonormalization of the


column vectors ak , k = 1, . . . , n , of A by the Gram-Schmidt algorithm:


k−1
q1 ≡ a1 −1
2 a1 , k = 2, . . . , n : q̃k ≡ ak − (ak , qi )2 qi , qk ≡ q̃k −1
2 q̃k .
i=1

Since by assumption rank(A) = n the n column vectors {a1 , . . . , an } are linearly in-
dependent and the orthonormalization process does not terminate before k = n. By
construction the matrix Q ≡ [q1 , . . . , qn ] is orthonormal. Further, for k = 1, . . . , n, there
holds:

k−1 
k−1
ak = q̃k + (ak , qi )2 qi = q̃k 2 qk + (ak , qi )2 qi
i=1 i=1

and

k
ak = rik qk , rkk ≡ q̃k 2 ∈ R+ , rik ≡ (ak , qi )2 .
i=1

Setting rik ≡ 0, for i > k, this is equivalent to the equation A = QR with the upper
triangular matrix R = (rik ) ∈ Kn×n .
ii) Uniqueness: For proving the uniqueness of the QR decomposition let A = Q1 R1 and
78 Direct Solution Methods

A = Q2 R2 be two such decompositions. Since R1 and R2 are regular and (det(Ri ) > 0)
it follows that

Q := Q̄T2 Q1 = R2 R1−1 right upper triangular,


Q̄T = Q̄T1 Q2 = R1 R2−1 right upper triangular.

Since Q̄T Q = R1 R2−1 R2 R1−1 = I it follows that Q is orthonormal and diagonal with
|λi | = 1 . From QR1 = R2 , we infer that λi rii1 = rii2 > 0 and, consequently, λi ∈ R and
λi = 1. Hence, Q = I , i. e.,

R1 = R2 , Q1 = AR1−1 = AR2−1 = Q2 .

This completes the proof. Q.E.D.


In the case K = R, using the QR decomposition, the normal equation A Ax = AT b
T

transforms into
AT Ax = RT QT QRx = RT Rx = RT QT b,
and, consequently, in view of the regularity of RT ,

Rx = QT b. (2.3.30)

This triangular system can now be solved by backward substitution in O(n2 ) arithmetic
operations. Since

AT A = RT R (2.3.31)

with the triangular matrix R, we are given a Cholesky decomposition of AT A without


explicit computation of the matrix product AT A.

Example 2.10: The 3 × 3-matrix


⎡ ⎤
12 −51 4
⎢ ⎥
A=⎢
⎣ 6 167 −68 ⎥

−4 24 −41

has the following uniquely determined QR decomposition


⎡ ⎤ ⎡ ⎤
6/7 −69/175 −58/5 14 21 −14
⎢ ⎥ ⎢ ⎥
A = QR = ⎢ ⎥ ⎢
⎣ 3/7 158/175 6/175 ⎦ · ⎣ 0 175 −70 ⎦ .

−2/7 6/35 −33/35 0 0 35

2.3.1 Householder algorithm

The Gram-Schmidt algorithm used in the proof of Theorem 2.8 for orthonormalizing
the column vectors of the matrix A is not suitable in practice because of its inherent
2.3 Irregular linear systems and QR decomposition 79

instability. Due to strong round-off effects the orthonormality of the columns of Q is


quickly lost already after only few orthonormalization steps. A more stable algorithm for
this purpose is the “Householder6 algorithm”, which is described below.
For any vector v ∈ Km the “dyadic product” is defined as the matrix
⎡ ⎤ ⎡ ⎤
v1 |v1 |2 v1 v̄2 · · · v1 v̄m
⎢ ⎥
.. ⎥ ⎢ ⎥
vv̄ T := ⎢ ⎢ .. ⎥ ∈ Km×m ,
⎣ . ⎦ [v̄1 , . . . , v̄m ] = ⎣ . ⎦
vm vm v̄1 vm v̄2 · · · |vm |2

(not to be confused with the “scalar product” v̄ T v = v22 , which maps vectors to scalars).

Definition 2.5: For a normalized vector v ∈ Kn , v2 = 1, the Matrix

S = I − 2vv̄ T ∈ Km×m

is called “Householder transformation”. Obviously S = S̄ T = S −1 , i. e., S (and also S̄ T )


is Hermitian and unitary. Further, the product of two (unitary) Householder transforma-
tions is again unitary.

For the geometric interpretation of the Householder transformation S , we restrict us


to the real case, K = R. For an arbitrary normed vector v ∈ R2 , v2 = 1, consider the
basis {v, v ⊥ }, where v T v ⊥ = 0 . For an arbitrary vector u = αv + βv ⊥ ∈ R2 there holds

Su = (I − 2vv T ) (αv + βv ⊥ )
= αv + βv ⊥ − 2α (v v T ) v −2β(v v T ) v ⊥ = −αv + βv ⊥ .
     
=1 =0

Hence, the application of S = I − 2vv T to a vector u in the plane span{v, u} induces a


reflection of u with respect to the orthogonal axis span{v ⊥ } .
Starting from a matrix A ∈ Km×n the Householder algorithm in n steps generates a
sequence of matrices

A := A(0) → · · · → A(i−1) → · · · → A(n) := R̃ ,

where A(i−1) has the followimg form:

6
Alston Scott Householder (1904–1993): US-American mathematician; Director of Oak Ridge National
Laboratory (1948-1969), thereafter Prof. at the Univ. of Tennessee; worked in mathematical biology,
best known for his fundamental contributions to numerics, especially to numerical linear algebra.
80 Direct Solution Methods

⎡ ⎤
∗ ∗
⎢ .. ⎥
⎢ ..
. . ⎥
⎢ ⎥
⎢ ⎥
⎢ ∗ ⎥
A(i−1) = ⎢

⎥i

⎢ ∗ ··· ∗ ⎥
⎢ ⎥
⎢ 0 ⎥
⎣ ⎦
∗ ··· ∗
i
In the i-th step the Householder transformation Si ∈ Km×m is determined such that

Si A(i−1) = A(i) .

After n steps the result is


¯ T A,
R̃ = A(n) = Sn Sn−1 · · · S1 A =: Q̃
¯ ∈ Km×m as product of unitary matrices is also unitary and R̃ ∈ Km×n has the
where Q̃
form ⎡ ⎤⎫ ⎫
r11 · · · r1n ⎪ ⎪
⎢ ⎥ ⎬ ⎪
⎪ ⎪


⎢ .. .. ⎥
n ⎬
⎢ . . ⎥⎪
R̃ = ⎢ ⎥⎪⎭ ⎪ m.
⎢ 0 rnn ⎥ ⎪
⎣ ⎦ ⎪



0 ··· 0
This results in the representation

A = S̄1T · · · S̄nT R̃ = Q̃R̃.

From this, we obtain the desired QR decomposition of A simply by striking out the last
m − n columns in Q̃ and the last m − n rows in R̃ :
⎡ ⎤⎫
⎡ ⎤ ⎪

⎢ ⎥⎬
⎢ ⎥ ⎢ ⎢ ⎥
⎢ ⎥ ⎢
R ⎥⎪ n
⎢ ⎥ ⎢ ⎥⎪
⎢ ⎥ ⎢ ⎥⎭
A = ⎢ Q ∗ ⎥·⎢ ⎥⎫ = QR .
⎢ ⎥ ⎢ ⎥⎪
⎢ ⎥ ⎢ ⎥⎪⎬
⎣ ⎦ ⎢ ⎥

⎣ 0 ⎦⎪ m−n


     
n m-n
We remark that here the diagonal elements of R do not need to be positive, i. e., the
Householder algorithm does generally not yield the “uniquely determined” special QR
decomposition given by Theorem 2.8.
2.3 Irregular linear systems and QR decomposition 81

Now, we describe the transformation process in more detail. Let ak be the column
vectors of the matrix A .
Step 1: S1 is chosen such that S1 a1 ∈ span{e1 } . The vector a1 is reflected with respect
to one of the axes span{a1 + a1 e1 } or span{a1 − a1 e1 } ) into the x1 -axis. The choice
of the axis is oriented by sgn(a11 ) in order to minimize round-off errors. In case a11 ≥ 0
this choice is
a1 + a1 2 e1 a1 − a1 2 e1
v1 = , v1⊥ = .
a1 + a1 2 e1 2 a1 − a1 2 e1 2
Then, the matrix A(1) = (I − 2v1 v̄1T )A has the column vectors
(1) (1)
a1 = −a1 2 e1 , ak = ak − 2(ak , v1 )v1 , k = 2, . . . , n.

Spiegelungsachse
a1−||a1||2e1 a1 a1+||a1||2e1

−||a1||2e1 ||a1||2e1 x1

Figure 2.1: Scheme of the Householder transformation

Let now the transformed matrix A(i−1) be already computed.


i-th step: For Si we make the following ansatz:
⎡ ⎤ ⎡ ⎤⎫ ⎫
0 ⎪ ⎪ ⎪

⎢ ⎥ ⎢ . ⎥⎬ ⎪

⎢ I 0 ⎥ ⎢ . ⎥ i-1⎪⎬
⎢ ⎥ ⎢ . ⎥⎪
Si = ⎢

⎥ = I − 2vi v̄iT ,
⎥ vi = ⎢ ⎥⎪ m
⎢ ⎥ ⎢ 0 ⎥⎭ ⎪

⎣ I− 2ṽi ṽ¯iT ⎦ ⎣ ⎦ ⎪

0 ⎪

ṽi
  
i-1

The application of the (unitary) matrix Si to A(i−1) leaves the first i − 1 rows and
columns of A(i−1) unchanged. For the construction of vi , we use the considerations of
the 1-st step for the submatrix:
82 Direct Solution Methods

⎡ ⎤
(i−1) (i−1)
ãii · · · ãin
⎢ . .. ⎥ 3 (i−1) 4
Ã(i−1) =⎢

.. . ⎥⎦ = ãi , . . . , ãn(i−1) .
(i−1) (i−1)
ãmi · · · ãmn
It follows that
(i−1) (i−1) (i−1) (i−1)
ãi − ãi 2 ẽi ãi + ãi 2 ẽi
ṽi = , ṽi⊥ = ,
 . . . 2  . . . 2

and the matrix A(i) has the column vectors


(i) (i−1)
ak = ak , k = 1, . . . , i − 1 ,
(i) (i−1) (i−1) (i−1)
ai = (a1i , . . . , ai−1,i , ãi  , 0, . . . , 0)T ,
(i) (i−1) (i−1)
ak = ak − 2(ãk , ṽi )vi , k = i + 1, . . . , n.

Remark 2.2: For a quadratic matrix A ∈ Kn×n the computation of the QR decom-
position by the Householder algorithm costs about twice the work needed for the LR
decomposition of A , i. e., NQR = 32 n3 + O(n2 ) a. op.

2.4 Singular value decomposition

The methods for solving linear systems and equalization problems become numerically
unreliable if the matrices are very ill-conditioned. It may happen that a theoretically
regular matrix appears as singular for the (finite arithmetic) numerical computation or
vice versa. The determination of the rank of a matrix cannot be accomplished with suf-
ficient reliability by the LR or the QR decomposition. A more accurate approach for
treating rank-deficient matrices uses the so-called “singular value decomposition (SVD)”.
This is a special orthogonal decomposition, which transforms the matrix from both sides.
For more details, we refer to the literature, e. g., to the introductory textbook by Deufl-
hard & Hohmann [33].
Let A ∈ Km×n be given. Further let Q ∈ Km×m and Z ∈ Kn×n be orthonormal
matrices. Then, the holds

QAZ2 = A2 . (2.4.32)

Hence this two-sided transformation does not change the conditioning of the matrix A .
For suitable matrices Q and Z , we obtain precise information about the rank of A
and the equalization problem can by accurately solved also for a rank-deficient matrix.
However, the numerically stable determination of such transformations is costly as will
be seen below.
2.4 Singular value decomposition 83

Theorem 2.9 (Singular value decomposition): Let A ∈ Km×n be arbitrary real or


complex. Then, there exist unitary matrices V ∈ Kn×n and U ∈ Km×m such that

A = UΣV̄ T , Σ = diag(σ1 , . . . , σp ) ∈ Rm×n , p = min(m, n), (2.4.33)

where σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0. Depending on whether m ≤ n or m ≥ n the matrix Σ


has the form ⎛ ⎞
⎛ ⎞ σ1
σ1 0 ⎜ ⎟
⎜ ⎟ ⎜ .. ⎟
⎜ .. ⎟ ⎜ . ⎟
⎝ . 0 ⎠ or ⎜ ⎟.
⎜ σn ⎟
0 σm ⎝ ⎠
0

Remark 2.3: The singular value decomposition A = UΣV̄ T of a general matrix A ∈


Km×n is the natural generalization of the well-known decomposition

A = W ΛW̄ T (2.4.34)

of a square normal (and hence diagonalizable) matrix A ∈ Kn×n where Λ = diag(λi ) ,


λi the eigenvalues of A , and W = [w 1 , . . . , w n ] , {w 1, . . . , w n } an ONB of eigenvectors.
It allows for a representation of the inverse of a general square regular matrix A ∈ Kn×n
in the form

A−1 = (UΣV̄ T )−1 = V −1 Σ−1 Ū T , (2.4.35)

where the orthonormlity of U and V are used.

From (2.4.33), one sees that for the column vectors ui , v i of U, V , there holds

Av i = σi ui , ĀT ui = σi v i , i = 1, . . . , min(m, n).

This implies that


ĀT Av i = σi2 v i , AĀT ui = σi2 ui ,
which shows that the values σi , i = 1, . . . , min(m, n), are the square roots of eigenvalues
of the Hermitian, positive semi-definite matrices ĀT A ∈ Kn×n and AĀT ∈ Km×m corre-
sponding to the eigenvectors v i and ui , respectively. The σi are the so-called “singular
values” of the matrix A . In the case m ≥ n the matrix ĀT A ∈ Kn×n has the p = n
eigenvalues {σi2 , i = 1, . . . , n} , while the matrix AĀT ∈ Km×m has the m eigenvalues
{σ12 , . . . , σn2 , 0n+1 , . . . , 0m } . In the case m ≤ n the matrix ĀT A ∈ Kn×n has the n
eigenvalues {σi2 , . . . , σm 2
, 0m+1 , . . . , 0n } , while the matrix AĀT ∈ Rm×m has the p = m
eigenvalues {σ1 , . . . , σm } . The existence of a decomposition (2.4.33) will be concluded
2 2

by observing that ĀT A is orthonormally diagonalizable,

Q̄T (ĀT A)Q = diag(σi2 ).


84 Direct Solution Methods

Proof of Theorem 2.9. We consider only the real case K = R.


i) Case m ≥ n (overdetermined system): Let the eigenvalues of the symmetric, positive
semi-definite matrix AT A ∈ Rn×n be ordered like λ1 ≥ λ2 ≥ · · · ≥ λr > λr+1 = · · · =
λn = 0. Here, r is the rank of A and also of AT A. Further, let {v 1 , . . . , v n } be a
corresponding ONB of eigenvectors, AT Av i = λi v i , such that the associated matrix V :=
[v 1 , . . . , v n ] is unitary. We define the diagonal matrices Λ := diag(λi ) and Σ := diag(σi )
1/2
where σi := λi , i = 1 . . . , n, are the “singular values” of A . In matrix notation there
holds
AV = ΛV.
Next, we define the vectors ui := σi−1 Av i ∈ Rm , i = 1, . . . , n, which form an ONS in Rm ,

(ui , uj )2 = σi−1 σj−1 (Av i , Av j )2 = σi−1 σj−1 (v i , AT Av j )2


= σi−1 σj−1 λj (v i , v j )2 = δij , i, j = 1, . . . , n.

The ONS {u1, . . . , un } can be extended to an ONB {u1 , . . . , um} of Rm such that the
associated matrix U := [u1 , . . . , um ] is unitary. Then, in matrix notation there holds

AT U = Σ−1 AT AV = Σ−1 ΛV = ΣV, U T A = ΣV T , A = UΣV T .

ii) Case m ≤ n (underdetermined system): We apply the result of (i) to the transposed
matrix AT ∈ Rn×m , obtaining

AT = Ũ Σ̃Ṽ T , A = Ṽ Σ̃T Ũ T .

Then, setting U := Ṽ , V := Ũ , and observing that, in view of the above discussion, the
eigenvalues of (AT )T AT = AAT ∈ Rm×m are among those of AT A ∈ Rn×n besides n−m
zero eigenvalues. Hence, Σ̃T has the desired form. Q.E.D.
We now collect some important consequences of the decomposition (2.4.33). Suppose
that the singular values are ordered like σ1 ≥ · · · ≥ σr > σr+1 = · · · = σp = 0, p =
min(m, n) . Then, there holds (proof exercise):

- rank(A) = r ,

- kern(A) = span{v r+1 , . . . , v n } ,

- range(A) = span{u1 , . . . , ur } ,

- A = Ur Σr VrT ≡ ri=1 σi ui v iT (singular decomposition of A ),

- A2 = σ1 = σmax ,

- AF = (σ12 + · · · + σr2 )1/2 (Frobenius norm).

We now consider the problem of computing the “numerical rank” of a matrix. Let

rank(A, ε) = min rank(B) .


A−B 2 ≤ε
2.4 Singular value decomposition 85

The matrix is called “numerically rank-deficient” if

rank(A, ε) < min(m, n) , ε = epsA2 ,

where eps is the “machine accuracy” (maximal relative round-off error). If the matrix
elements come from experimental measurements, then the parameter ε should be related
to the measurement error. The concept of “numerically rank-deficient” has something in
common with that of the ε-pseudospectrum discussed above.

Theorem 2.10 (Error estimate): Let A, U, V, Σ be as in Theorem 2.9. If k < r =


rank(A), then in the truncated singular value decomposition,


k
Ak = σi ui v iT ,
i=1

there holds the estimate

min A − B2 = A − Ak 2 = σk+1 .


rank(B)=k

This implies for rε = rank(A, ε) the relation

σ1 ≥ · · · ≥ σrε > ε ≥ σrε +1 ≥ · · · ≥ σp , p = min(m, n).

Proof. Since
U T Ak V = diag(σ1 , . . . , σk , 0, . . . , 0)
it follows that rank(Ak ) = k . Further, we obtain

U T (A − Ak )V = diag(0, . . . , 0, σk+1, . . . , σp )

and because of the orthonormality of U and V that

A − Ak 2 = σk+1 .

It remains to show that for any other matrix B with rank k, the following inequality
holds
A − B2 ≥ σk+1 .
To this end, we choose an ONB {x1 , . . . , xn−k } of kern(B) . For dimensional reasons
there obviously holds

span{x1 , . . . , xn−k } ∩ span{v 1 , . . . , v k+1} = ∅.

Let z with z2 = 1 be from this set. Then, there holds


k+1
Bz = 0 , Az = σi (v iT z)ui
i=1
86 Direct Solution Methods

and, consequently,


k+1
A − B22 ≥ (A − B)z22 = Az22 = σi2 (v iT z)2 ≥ σk+1
2
.
i=1

k+1 iT
Here, we have used that z = i=1 (v z)v i and therefore


k+1
1 = z22 = (v iT z)2 .
i=1

This completes the proof. Q.E.D.


With the aid of the singular value decomposition, one can also solve the equalization
problem. In the following let again m ≥ n . We have already seen that any minimal
solution x,
Ax − b2 = min!
necessarily solves the normal equation AT Ax = AT b. But this solution is unique only in
the case of maximal rank(A) = n , which may be numerically hard to verify. In this case
AT A is invertible and there hold

x = (AT A)−1 AT b.

Now, knowing the (non-negative) eigenvalues λi , i = 1, . . . , n, of AT A with corresponding


1/2
ONB of eigenvectors {v 1 , . . . , v n } and setting Σ = diag(σi ), σi := λi , V = [v 1 , . . . , v n ],
i −1/2 i 1 n
u := λi Av , and U := [u , . . . , u ], we have

(AT A)−1 AT = (V Σ2 V T )−1 AT = V Σ−2 V T AT = V Σ−1 (AV )T = V Σ−1 U T .

This implies the solution representation



n
uiT b
x = V Σ−1 U T b = vi. (2.4.36)
i=1
σi

In the case rank(A) < n the normal equation has infinitely many solutions. Out of these
solutions, one selects one with minimal euclidian norm, which is then uniquely determined.
This particular solution is called “minimal solution” of the equalization problem. Using
the singular value decomposition the solution formula (2.4.36) can be extended to this
“irregular” situation.

Theorem 2.11 (Minimal solution): Let A = UΣV T be singular value decomposition


of the matrix A ∈ Rm×n and let r = rank(A) . Then,


r
uiT b
x̄ = vi
i=1
σi
2.4 Singular value decomposition 87

is the uniquely determined “minimal solution” of the normal equation. The corresponding
least squares error satisfies

m
ρ2 = Ax̄ − b22 = (uiT b)2 .
i=r+1

Proof. For any x ∈ Rn there holds

Ax − b22 = AV V T x − b22 = U T AV V T x − U T b22 = ΣV T x − U T b22 .

Setting z = V T x , we conclude

r 
m
Ax − b22 = Σz − U T b22 = (σi z i − uiT b)2 + (uiT b)2 .
i=1 i=r+1

Hence a minimal point necessarily satisfies

σi z i = uiT b, i = 1, . . . , r.

Among all z with this property z i = 0 , i = r + 1, . . . , m has minimal euclidian norm.


The identity for the least squares error is obvious. Q.E.D.
The uniquely determined minimal solution of the equalization problem has the follow-
ing compact representation

x̄ = A+ b , ρ = (I − AA+ )b2 , (2.4.37)

where
A+ = V Σ+ U T , Σ+ = diag(σ1−1 , . . . , σr−1 , 0, . . . , 0) ∈ Rn×m .
The matrix

A+ = V Σ+ U T (2.4.38)

is called “pseudo-inverse” of the matrix A (or “Penrose7 inverse” (1955)). The pseudo-
inverse is the unique solution of the matrix minimization problem

min AX − IF ,


X∈Rn×m

with the Frobenius norm  · F . Since the identity in (2.4.37) holds for all b it follows
that

7
Roger Penrose (1931–): English mathematician; Prof. at Birkbeck College in London (1964) and since
1973 Prof. at the Univ. of Oxford; fundamental contributions to the theory of half-groups, to matrix
calculus and to the theory of “tesselations” as well as in Theoretical Physics to Cosmology, Relativity
and Quantum Mechanics.
88 Direct Solution Methods

rank(A) = n ⇒ A+ = (AT A)−1 AT ,


rank(A) = n = m ⇒ A+ = A−1 .

In numerical practice the definition of the pseudo-inverse has to use the (suitably defined)
numerical rank. The numerically stable computation of the singular value decomposition
is rather costly. For details, we refer to the literature, e. g., the book by Golub & van Loan
[36].

2.5 “Direct” determination of eigenvalues

In the following, we again consider general square matrices A ∈ Kn×n . The direct way of
computing eigenvalues of A would be to follow the definition of what an eigenvalue is and
to compute the zeros of the corresponding characteristic polynomial χA (z) = det(zI − A)
by a suitable method such as, e. g., the Newton method. However, the mathematical task
of determining the zeros of a polynomial may be highly ill-conditioned if the polynomial is
given in “monomial expansion”, although the original task of determining the eigenvalues
of a matrix is mostly well-conditioned. This is another nice example of a mathematical
problem the conditioning of which significantly depends on the choice of its formulation.
In general the eigenvalues cannot be computed via the characteristic polynomial. This
is feasible only in special cases when the characteristic polynomial does not need to be
explicitly built up, such as for tri-diagonal matrices or so-called “Hessenberg8 matrices”.

Tridiagonal matrix Hessenberg matrix


⎡ ⎤ ⎡ ⎤
a1 b1 a11 · · · a1n
⎢ ⎥ ⎢ .. ⎥
⎢ c .. ..
. . ⎥ ⎢ .
a21 . . ⎥
⎢ 2 ⎥ ⎢ . ⎥
⎢ .. ⎥ ⎢ .. ⎥
⎢ . bn−1 ⎥ ⎢ . an−1,n ⎥
⎣ ⎦ ⎣ ⎦
cn an 0 an,n−1 ann

2.5.1 Reduction methods

We recall some properties related to the “similarity” of matrices. Two matrices A, B ∈


Cn×n are “similar”, in symbols A ∼ B , if with a regular matrix T ∈ Cn×n there holds
A = T −1 BT . In view of

det(A − zI) = det(T −1 [B − zI]T ) = det(T −1 ) det(B − zI) det(T ) = det(B − zI),

similar matrices A, B have the same characteristic polynomial and therefore also the
same eigenvalues. For any eigenvalue λ of A with a corresponding eigenvector w there

8
Karl Hessenberg (1904–1959): German mathematicians; dissertation “Die Berechnung der Eigenwerte
und Eigenl”osungen linearer Gleichungssysteme”, TU Darmstadt 1942.
2.5 “Direct” determination of eigenvalues 89

holds
Aw = T −1 BT w = λw,
i. e., T w is an eigenvector of B corresponding to the same eigenvalue λ . Further, al-
gebraic and geometric multiplicity of eigenvalues of similar matrices are the same. A
“reduction method” reduces a given matrix A ∈ Cn×n by a sequence of similarity trans-
formations to a simply structured matrix for which the eigenvalue problem is then easier
to solve,

A = A(0) = T1−1 A(1) T1 = Q . . . = Ti−1 A(i) Ti = . . . . (2.5.39)

In order to prepare for the following discussion of reduction methods, we recall (without
proof) some basic results on matrix normal forms.

Theorem 2.12 (Jordan normal form): Let the matrix A ∈ Cn×n have the (mutually
different) eigenvalues λi , i = 1, . . . , m, with algebraic and geometric multiplicities σi and
(i) (i) (i)
ρi , respectively. Then, there exist numbers rk ∈ N k = 1, . . . , ρi , σi = r1 + . . . + rρi ,
such that A is similar to the Jordan normal form
⎡ ⎤
Cr(1) (λ1 )
⎢ 1 ⎥
⎢ .. ⎥
⎢ . 0 ⎥
⎢ ⎥
⎢ Cr(1) (λ1 ) ⎥
⎢ ⎥
⎢ ρ1 ⎥
⎢ .. ⎥
JA = ⎢ . ⎥.
⎢ ⎥
⎢ ⎥
⎢ Cr(m) (λm ) ⎥
⎢ ⎥
⎢ ⎥
1

⎢ .. ⎥
⎣ 0 . ⎦
Cr(m) (λm )
ρm

(i)
Here, the numbers rk are up to their ordering uniquely determined.

The following theorem of Schur9 concerns the case that in the similarity transformation
only unitary matrices are allowed.

Theorem 2.13 (Schur normal form): Let the matrix A ∈ Cn×n have the eigenvalues
λi , i = 1, . . . , n (counted accordingly to their algebraic multiplicities). Then, there exists
a unitary matrix U ∈ Cn×n such that

9
Issai Schur (1875–1941): Russian-German mathematician; Prof. in Bonn (1911–1916) and in Berlin
(1916–1935), where he founded a famous mathematical school; because of his jewish origin persecuted
he emigrated 1939 to Palestine; fundamental contributions especially to the Representation Theory of
Groups and to Number Theory.
90 Direct Solution Methods

⎡ ⎤
λ1 ∗
⎢ ⎥
Ū T AU = ⎢

..
. ⎥.
⎦ (2.5.40)
0 λn

If A ∈ Cn×n is Hermitian, AT = Ā , so is also Ū T AU Hermitian. Hence, Hermitian


matrices A ∈ Cn×n are “unitary similar” to a diagonal matrix Ū T AU = diag(λi ) , i. e.,
“diagonalizable”.

Lemma 2.3 (Diagonalization): For any matrix A ∈ Cn×n the following statements
are equivalent:
i) A is diagonalizable.
ii) There exists an ONB in Cn of eigenvectors of A .
iii) For all eigenvalues of A algebraic and geometric multiplicity coincide.

In general, the direct transformation of a given matrix into normal form in finitely
many steps is possible only if all its eigenvectors are a priori known. Therefore, first one
transforms the matrix in finitely many steps into a similar matrix of simpler structure
(e. g., Hessenberg form) and afterwards applies other mostly iterative methods of the form

A = A(0) → A(1) = T1−1 A(0) T1 → . . . A(m) = Tm−1 A(m−1) Tm .

Here, the transformation matrices Ti should be given explicitly in terms of the elements
of A(i−1) . Further, the eigenvalue problem of the matrix A(i) = Ti−1 A(i−1) Ti should not
be worse conditioned than that of A(i−1) .
Let  ·  be any natural matrix norm generated by a vector norm  ·  on Cn . For
any two similar matrices, B ∼ A, there holds

B = T −1 AT , B + δB = T −1 (A + δA) T , δA = T δBT −1,

and, therefore,
B ≤ cond(T) A , δA ≤ cond(T) δB.
This implies that

δA δB
≤ cond(T)2 . (2.5.41)
A B

Hence, for large cond(T) 1 even small perturbations in B may effect its eigenvalues
significantly more than those in A . In order to guarantee the stability of the reduction
approach, in view of

cond(T) = cond(T1 . . . Tm ) ≤ cond(T1 ) · . . . · cond(Tm ),

the transformation matrices Ti are to be chosen such that cond(Ti ) does not become too
large. This is especially achieved for the following three types of transformations:
2.5 “Direct” determination of eigenvalues 91

a) Rotations (Givens transformation):


⎡ ⎤
1
⎢ .. ⎥
⎢ . ⎥
⎢ ⎥
⎢ 1 ⎥
⎢ ⎥
⎢ cos(ϕ) − sin(ϕ) ⎥
⎢ ⎥
⎢ ⎥
⎢ 1 ⎥
⎢ .. ⎥
T =⎢ . ⎥ =⇒ cond2 (T) = 1.
⎢ ⎥
⎢ 1 ⎥
⎢ ⎥
⎢ ⎥
⎢ sin(ϕ) cos(ϕ) ⎥
⎢ ⎥
⎢ 1 ⎥
⎢ .. ⎥
⎣ . ⎦
1

b) Reflections (Householder transformation):

T = I −2uūT =⇒ cond2 (T) = 1.

The Givens and the Householder transformations are unitary with spectral condition
cond2 (T) = 1.
c) Elimination
⎡ ⎤
1
⎢ ⎥
⎢ ..
. ⎥
⎢ ⎥
⎢ ⎥
⎢ 1 ⎥
T =⎢

⎥,
⎥ |ljk | ≤ 1 =⇒ cond∞ (T) ≤ 4.
⎢ li+1,i 1 ⎥
⎢ ⎥
⎢ .. .. ⎥
⎣ . . ⎦
ln,i 1

In the following, we consider only the eigenvalue problem of real matrices. The fol-
lowing theorem provides the basis of the so-called “Householder algorithm”.

Theorem 2.14 (Hessenberg normal form): To each matrix A ∈ Rn×n there exists
a sequence of Householder matrices Ti , i = 1, . . . , n − 2, such that T AT T with T =
Tn−2 . . . T1 is a Hessenberg matrix . For symmetric A the transformed matrix T AT T is
tri-diagonal.

Proof. Let A = [a1 , . . . , an ] and ak the column vectors of A . In the first step u1 =
(0, u12 , . . . , u1n )T ∈ Rn , u1 2 = 1 , is determines such that with T1 = I −2u1 uT1 there
holds T1 a1 ∈ span{e1 , e2}. Then,
92 Direct Solution Methods

⎡ ⎤ ⎡ ⎤ ⎡ ⎤
a11 a12 . . . a1n 1 0 ... a11 ∗
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
A(1) = T1 AT1 = ⎢ ⎥ ⎢ 0 ∗ ⎥=⎢ //// ⎥
⎥.
⎣ //// ∗ ⎦ ⎣ ⎦ ⎢
⎣ ⎦
0
.. Ã(1)
. 0
     
T1 A T1T

In the next step, we apply the same procedure to the reduced matrix Ã(1) . After n−2
steps, we obtain a matrix A(n−2) which has Hessenberg form. With A also A(1) = T1 AT1
is symmetric and then also A(n−2) . The symmetric Hessenberg matrix A(n−2) is tri-
diagonal. Q.E.D.

Remark 2.4: For a symmetric matrix A ∈ Rn×n the Householder algorithm for reduc-
ing it to tri-diagonal form requires 32 n3 + O(n2) a. op. and the reduction of a general
matrix to Hessenberg form 53 n3 + O(n2 ) a.op. For this purpose the alternative method
of Wilkinson using Gaussian elimination steps and row permutation is more efficient as
it requires only half as many arithmetic operations. However, the row permutation de-
stroys the possible symmetry of the original matrix. The oldest method for reducing a
real symmetric matrix to tri-diagonal form goes back to Givens10 (1958). It uses (uni-
tary) Givens rotation matrices. Since this algorithm requires twice as many arithmetic
operations as the Householder algorithm it is not further discussed. For details, we refer
to the literature, e. g., the textbook by Stoer & Bulirsch II [50].

2.5.2 Hyman’s method

The classical method for computing the eigenvalues of a tri-diagonal or Hessenberg matrix
is based on the characteristic polynomial without explicitly determining the coefficients
in its monomial expansion. The method of Hyman11 (1957) computes the characteristic
polynomial χA (·) of a Hessenberg matrix A ∈ Rn×n . Let as assume that the matrix
A does not separate into two submatrices of Hessenberg form, i. e., aj+1,j = 0, j =
1, . . . , n−1. With a function c(·) still to be chosen, we consider the linear system

(a11 − z)x1 + a12 x2 + · · · + a1,n−1 xn−1 + a1n xn = −c (z)


a21 x1 + (a22 − z)x2 + · · · + a2,n−1 xn−1 + a2n xn = 0
..
.
an,n−1 xn−1 + (ann − z)xn = 0.

10
James Wallace Givens, 1910–1993: US-American mathematician; worked at Oak Ridge National
Laboratory; known by the named after him matrix transformation “Givens rotation” (“Computation of
plane unitary rotations transforming a general matrix to triangular form”, SIAM J. Anal. Math. 6,
26-50, 1958).
11
Morton Allan Hyman: Dutch mathematician; PhD Techn. Univ. Delft 1953, Eigenvalues and
eigenvectors of general matrices, Twelfth National Meeting A.C.M., Houston, Texas, 1957.
2.5 “Direct” determination of eigenvalues 93

Setting xn = 1 the values xn−1 , . . . , x1 and c (z) can be successively determined. By


Cramer’s rule there holds
(−1)n c(z)a21 a32 . . . an,n−1
1 = xn = .
det(A − zI)

Consequently, c(z) = const. det(A − zI), and we obtain a recursion formula for deter-
mining the characteristic polynomial χA (z) = det(zI − A) .

Let now A ∈ Rn×n be a symmetric tri-diagonal matrix with entries bi = 0, i =


1, . . . , n − 1 : ⎡ ⎤
a1 b1 0
⎢ ⎥
⎢ b ..
. ⎥
⎢ 1 ⎥
A=⎢ ⎥.
⎢ .. ..
bn−1 ⎥
⎣ . . ⎦
0 bn−1 an
For the computation of the characteristic polynomial χA (·) , we have the recursion for-
mulas

p0 (z) = 1 , p1 (z) = a1 − z, pi (z) = (ai − z) pi−1 (z) − b2i−1 pi−2 (z), i = 2, . . . , n.

The polynomials pi ∈ Pi are the i-th principle minors of det(zI − A), i. e., pn = χA . To
see this, we expand the (i + 1)-th principle minor with respect to the (i + 1)-th column:
⎡ ⎤
a1 − z b1
⎢ ⎥
⎢ b1
.. .. ⎥
⎢ . . ⎥
⎢ ⎥
⎢ .. ⎥
⎢ . ⎥
⎢ ⎥
⎢ bi−1 ⎥ = (ai+1 − z)pi (z) − b2i pi−1 (z) .
⎢ ⎥   
⎢ ⎥
⎢ bi−1 ai − z bi ⎥ =: pi+1 (z)
⎢ ⎥
⎢ .. ⎥
⎢ bi ai+1 − z . ⎥
⎣ ⎦
.. ..
. .
i−1 i i+1

Often it is useful to know the derivative χA (·) of χA (·) (e. g., in using the Newton
method for computing the zeros of χA (·)). This is achieved by the recursion formula

q0 (z) = 0 , q1 (z) = −1
qi (z) = −pi−1 (z) + (ai − z)qi−1 (z) − b2i−1 qi−2 (z) , i = 2, . . . , n ,
qn (z) = χA (z) .

If the zero λ of χA , i. e., an eigenvalue of A, has been determined a corresponding


94 Direct Solution Methods

eigenvector w (λ) is given by


⎡ ⎤
w0 (z) w0 (z) ≡ 1 (bn := 1)
⎢ .. ⎥
w (z) = ⎢
⎣ . ⎥,
⎦ (−1)i pi (z) (2.5.42)
wi (z) := , i = 1, . . . , n .
wn−1(z) b1 . . . bi

For verifying this, we compute (A − zI) w(z) . For i = 1, . . . , n − 1 (b0 := 0) there holds

bi−1 wi−2 (z) + ai wi−1 (z) + bi wi (z) − z wi−1 (z) =


pi−2 (z) pi−1 (z) pi (z) pi−1 (z)
= bi−1 (−1)i−2 + ai (−1)i−1 + bi (−1)i − z(−1)i−1
b1 . . . bi−2 b1 . . . bi−1 b1 . . . bi b1 . . . bi−1
p i−2 (z) p i−1 (z) (a i − z) p i−1 (z) − b 2
i−1 pi−2 (z)
= b2i−1 (−1)i−2 + ai (−1)i−1 + (−1)i
b1 . . . bi−1 b1 . . . bi−1 b1 . . . bi−1
p i−1 (z)
− z(−1)i−1 = 0.
b1 . . . bi−1

Further, for i = n (bn := 1):

bn−1 wn−2 (z) + an wn−1 (z) − z wn−1 (z) =


pn−2 (z) pn−1 (z)
= bn−1 (−1)n−2 + (an − z)(−1)n−1
b1 . . . bn−2 b1 . . . bn−1
pn−2 (z) pn−1 (z)
= −b2n−1 (−1)n−1 + (an − z)(−1)n−1
b1 . . . bn−1 b1 . . . bn−1
pn (z)
= (−1)n−1 = −wn (z) .
b1 . . . bn−1
Hence, we have
⎡ ⎤
0
⎢ ⎥
⎢ .. ⎥
⎢ . ⎥
(A − zI)w(z) = ⎢ ⎥. (2.5.43)
⎢ 0 ⎥
⎣ ⎦
−wn (z)

For an eigenvalue λ of A there is wn (λ) = const. pA (λ) = 0 , i. e., (A − λI) w(λ) = 0.

2.5.3 Sturm’s method

We will now describe a method for the determination of zeros of the characteristic polyno-
mial χA of a real symmetric (irreducible) tridiagonal matrix A ∈ Rn×n . Differentiating
in the identity (2.5.43) yields
2.5 “Direct” determination of eigenvalues 95

⎡ ⎤
0
⎢ ⎥
⎢ .. ⎥
 ⎢  . ⎥
[(A − zI)w(z)] = −w(z) + (A − zI)w (z) = ⎢ ⎥.
⎢ 0 ⎥
⎣ ⎦
−wn (z)
We set z = λ with some eigenvalue λ of A and multiply by −w(λ) to obtain

0 < w(λ)22 − ([A − λI]w(λ), w (λ))


  
=0
pn−1 (λ)pn (λ)
= wn−1 (λ)wn (λ) = − 2 .
b1 . . . b2n−1

Consequently, pn (λ) = 0 , i. e., there generally holds


(S1) All zeros of pn are simple.
Further:
(S2) For each zero λ of pn : pn−1 (λ)pn (λ) < 0.
(S3) For each real zero ζ of pi−1 : pi (ζ)pi−2(ζ) < 0, i = 2, . . . , n;
since in this case pi (ζ) = −b2i−1 pi−2 (ζ) and were pi (ζ) = 0 this would result in the
contradiction
0 = pi (ζ) = pi−1 (ζ) = pi−2 (ζ) = . . . = p0 (ζ) = 1.
Finally, there trivially holds:
(S4) p0 = 0 does not change sign.

Definition 2.6: A sequence of polynomials p = pn , pn−1 , . . . , p0 (or more general of con-


tinuous functions fn , fn−1 , . . . , f0 ) with the properties (S1) - (S4) is called a “Sturm12
chain” of p .

The preceding consideration has led us to the following result:

Theorem 2.15 (Sturm chain): Let A ∈ Rn×n be a symmetric, irreducible tri-diagonal


matrix. Then, the principle minors pi (z) of the matrix A − zI form a Sturm chain of
the characteristic polynomial χA (z) = pn (z) of A .

The value of the existence of a Sturm chain of a polynomial p consists in the following
result.

12
Jacques Charles Franois Sturm (1803–1855): French-Swiss mathematician; Prof. at École Poly-
technique in Paris since 1840; contributions to Mathematical Physics, differential equations, (“Sturm-
Liouville problem”) and Differential Geometry.
96 Direct Solution Methods

Theorem 2.16 (Bisection method): Let p be a polynomial and p = pn , pn−1 , . . . , p0 a


corresponding Sturm chain. Then, the number of real zeros of p in an interval [a, b] equals
N(b) − N(a), where N(ζ) is the number of sign changes in the chain pn (ζ), . . . , p0 (ζ).

Proof. We consider the number of sign changes N(a) for increasing a . N(a) remains
constant as long as a does not pass a zero of one of the pi . Let now a be a zero of one
of the pi . We distinguish two cases:
i) Case pi (a) = 0 for i = n : In this case pi+1 (a) = 0 , pi−1 (a) = 0 . Therefore, the sign of
pj (a) , j ∈ {i − 1, i, i + 1} for sufficiently small h > 0 shows a behavior that is described
by one of the following two tables:

a−h a a+h a−h a a+h


i−1 − − − i−1 + + +
i +/− 0 −/+ i +/− 0 −/+
i+1 + + + i+1 − − −

In each case N(a − h) = N(a) = N(a + h) and the number of sign changes does not
change.
ii) Case pn (a) = 0 : In this case the behavior of pj (a) , j ∈ {n − 1, n} , is described by
one of the following two tables (because of (S2)):

a−h a a+h a−h a a+h


n − 0 + n + 0 −
n−1 − − − n−1 + + +

Further, there holds N(a − h) = N(a) = N(a + h) − 1 , i. e., passing a zero of pn


causes one more sign change. For a < b and h > 0 sufficiently small the difference
N(b) − N(a) = N(b + h) − N(a − h) equals the number of zeros of pn in the interval
[a − h, b + h] . Since h can be chosen arbitrarily small the assertion follows. Q.E.D.
Theorem 2.15 suggests a simple bisection method for the approximation of roots of the
characteristic polynomial χA of a symmetric, irreduzible tridiagonal matrix A ∈ Rn×n .
Obviously, A has only real, simple eigenvalues

λ1 < λ2 < · · · < λn .

For x → −∞ the chain

p0 (x) = 1 , p1 (x) = a1 − x
i = 2, . . . , n : pi (x) = (ai − x)pi−1 (x) − b2i pi−2 (x) ,

has the sign distribution +, . . . , + , which shows that N(x) = 0 . Consequently, N(ζ)
corresponds to the number of zeros λ of χA with λ < ζ . For the eigenvalues λi of A
2.6 Exercises 97

it follows that

λi < ζ ⇐⇒ N(ζ) ≥ i. (2.5.44)

In order to determine the i-th eigenvalue λi , one starts from an interval [a0 , b0 ] containing
λi , i. e., a0 < λ1 < λn < b0 . Then, the interval is bisected and it is tested using the Sturm
sequence, which of the both new subintervals λi contains λi . Continuing this process for
t = 0, 1, 2, . . ., one obtains:
,
at , for N(μt ) ≥ i
at+1 :=
at + bt μt , for N(μt ) < i
μt := , , (2.5.45)
2 μt , for N(μt ) ≥ i
bt+1 :=
bt , for N(μt ) < i

By construction, we have λi ∈ [at+1 , bt+1 ] and

[at+1 , bt+1 ] ⊂ [at , bt ] , |at+1 − bt+1 | = 12 |at − bt |,

i. e., the points at converge monotonically increasing and bt monotonically decreasing to


λi . This algorithm is slow but very robust with respect to round-off perturbations and
allows for the determination of any eigenvalue of A independently of the others.

2.6 Exercises

Exercise 2.1: a) Construct examples of real matrices, which are symmetric, diagonally
dominant and regular but indefinite (i. e. neither positive nor negative definite), and vice
versa those, which are positive (or negative) definite but not diagonally dominant. This
demonstrates that these two properties of matrices are independent of each other.
b) Show that a matrix A ∈ Kn×n for which the conjugate transpose ĀT is strictly
diagonally dominant is regular.
c) Show that a strictly diagonally dominant real matrix, which is symmetric and has
positive diagonal elements is positive definite.

Exercise 2.2: Let A = (aij )ni,j=1 ∈ Rn×n be a symmetric, positive definite matrix. The
Gaussian elimination algorithm (without pivoting) generates a sequence of matrices A =
A(0) → . . . → A(k) → . . . → A(n−1) = R, where R = (rij )ni,j=1 is the resulting upper-right
triangular matrix. Prove that the algorithm is “stable” in the following sense:
(k) (k−1)
k = 1, . . ., n − 1 : aii ≤ aii , i = 1, . . ., n, max |rij | ≤ max |aij |.
1≤i,j≤n 1≤i,j≤n

(Hint: Use the recursion formula and employ an induction argument.)

Exercise 2.3: The “LR decomposition” of a regular matrix A ∈ Rn×n is the represen-
98 Direct Solution Methods

tation of A as a product A = LR consisting of a lower-left triangular matrix L with


normalized diagonal (lii = 1, 1 ≤ i ≤ n) and an upper-right triangular matrix R.
i) Verify that the set of all (regular) lower-left triangular matrices L ∈ Rn×n , with normal-
ized diagonal (lii = 1, i = 1, . . ., n), as well as the set of all regular, upper-right triangular
matrices R ∈ Rn×n form groups with respect to matrix multiplication. Are these groups
Abelian?
ii) Use the result of (i) to prove that if the LR decomposition of a regular matrix A ∈ Rn×n
exists, it must be unique.

Exercise 2.4: Let A ∈ Rn×n be a regular matrix that admits an “LR decomposition”.
In the text it is stated that Gaussian elimination (without pivoting) has an algorithmic
complexity of 13 n3 + O(n2 ) a. op., and that in case of a symmetric matrix this reduces
to 61 n3 + O(n2 ) a. op. Hereby, an “a. op.” (arithmetic operation) consists of exactly one
multiplication (with addition) or of a division.
Question: What are the algorithmic complexities of these algorithms in case of a band
matrix of type (ml , mr ) with ml = mr = m? Give explicit numbers for the model matrix
introduced in the text with m = 102 , n = m2 = 104 , and m = 104 , n = m2 = 108 ,
respectively.

Exercise 2.5: Consider the linear system Ax = b where


⎡ ⎤ ⎡ ⎤
1 3 −4 ⎡ ⎤ 1
⎢ ⎥ x1 ⎢ ⎥
⎢ 3 9 −2 ⎥ ⎢ ⎥ ⎢ 1 ⎥
⎢ ⎥ ⎢ x2 ⎥ = ⎢ ⎥.
⎢ ⎥⎣ ⎦ ⎢ 1 ⎥
⎣ 4 12 −6 ⎦ ⎣ ⎦
x3
2 6 2 1

a) Investigate whether this system is solvable (with argument).


b) Determine the least-error squares solution of the system (“minimal solution”).
c) Is this “solution” unique?
d) Are the matrices AT A and AAT (symmetric) positive definit?
3 Iterative Methods for Linear Algebraic Systems
In this chapter, we discuss iterative methods for solving linear systems. The underlying
problem has the form

Ax = b, (3.0.1)

with a real square matrix A = (aij )ni,j=1 ∈ Rn×n and a vector b = (bj )nj=1 ∈ Rn . Here,
we concentrate on the higher-dimensional case n 103 , such that, besides arithmetical
complexity, also storage requirement becomes an important issue. In practice, high-
dimensional matrices usually have very special structure, e. g., band structure and extreme
sparsity, which needs to be exploited by the solution algorithms. The most cost-intensive
parts of the considered algorithms are simple matrix-vector multiplications x → Ax .
Most of the considered methods and results are also applicable in the case of matrices
and right-hand sides with complex entries.

3.1 Fixed-point iteration and defect correction

For the construction of cheap iterative methods for solving problem (3.0.1), one rewrites
it in form of an equivalent fixed-point problem,

Ax = b ⇔ Cx = Cx − Ax + b ⇔ x = (I − C −1 A)x + C −1 b,

with a suitable regular matrix C ∈ Rn×n , the so-called “preconditioner”. Then, starting
from some initial value x0 , one uses a simple fixed-point iteration,

xt = (I − C −1 A) xt−1 + C −1
  b, t = 1, 2, . . . . (3.1.2)
  
=: c
=: B
Here, the matrix B = I − C −1 A is called the “iteration matrix” of the fixed-point
iteration. Its properties are decisive for the convergence of the method. In practice,
such a fixed-point iteration is organized in form of a “defect correction” iteration, which
essentially requires in each step only a matrix-vector multiplication and the solution of a
linear system with the matrix C as coefficient matrix:

dt−1 = b − Axt−1 (residual), Cδxt = dt−1 (correction), xt = xt−1 + δxt (update).

Example 3.1: The simplest method of this type is the (damped) Richardson1 method,
which for a suitable parameter θ ∈ (0, 2λmax (A)−1 ] uses the matrices

C = θ−1 I, B = I − θA. (3.1.3)

1
Lewis Fry Richardson (1881–1953): English mathematician and physicist; worked at several institu-
tions in in England and Scotland; a typical “applied mathematician”; pioneered modeling and numerics
in weather prediction.

99
100 Iterative Methods for Linear Algebraic Systems

Starting from some initial value x0 the iteration looks like

xt = xt−1 + θ(b − Axt−1 ), t = 1, 2, . . . . (3.1.4)

In view of the Banach fixed-point theorem a sufficient criterion for the convergence
of the fixed-point iteration (3.1.2) is the contraction property of the corresponding fixed-
point mapping g(x) := Bx + c ,

g(x) − g(y) = B(x − y) ≤ Bx − y, B < 1,

in some vector norm  ·  . For a given iteration matrix B the property B < 1 may
depend on the particular choice of the norm. Hence, it is desirable to characterize the
convergence of this iteration in terms of norm-independent properties of B . For this, the
appropriate quantity is the “spectral radius”

spr(B) := max { |λ| : λ ∈ σ(B) }.

Obviously, spr(B) is the radius of the smallest circle in C around the origin, which
contains all eigenvalues of B . For any natural matrix norm  · , there holds

spr(B) ≤ B. (3.1.5)

For symmetric B , we even have

Bx2
spr(B) = B2 = sup . (3.1.6)
x∈Rn \{0} x2

However, we note that spr(·) does not define a norm on Rn×n since the triangle inequality
does not hold in general.

Theorem 3.1 (Fixed-point iteration): The fixed-point iteration (3.1.2) converges for
any starting value x0 if and only if

ρ := spr(B) < 1. (3.1.7)

In case of convergence the limit is the uniquely determined fixed point x . The asymptotic
convergence behavior with respect to any vector norm  ·  is characterized by
 xt − x 1/t
sup lim sup = ρ. (3.1.8)
x0 ∈Rn t→∞ x0 − x

Hence, the number of iteration steps necessary for an asymptotic error reduction by a
small factor TOL > 0 is approximately given by

ln(1/TOL)
t(TOL) ≈ . (3.1.9)
ln(1/ρ)
3.1 Fixed-point iteration and defect correction 101

Proof. Assuming the existence of a fixed point x , we introduce the notation et := xt −x.
Recalling that x = Bx + c, we find

et = xt − x = Bxt−1 + c − (Bx + c) = Bet−1 = · · · = B t e0 .

i) In case that spr(B) < 1, in view of Lemma 3.1 below, there exists a vector norm  · B,ε
depending on B and some ε > 0 chosen sufficiently small, such that the corresponding
natural matrix norm  · B,ε satisfies

BB,ε ≤ spr(B) + ε = ρ + ε < 1.

Consequently, by the Banach fixed-point theorem, there exists a unique fixed-point x


and the fixed-point iteration converges for any starting value x0 :

et B,ε = B t e0 B,ε ≤ B t B,ε e0 B,ε ≤ BtB,ε e0 B,ε → 0.

In view of the norm equivalence in Rn this means convergence xt → x (t → ∞) .


ii) Now, we assume convergence for any starting value x0 . Let λ be an eigenvalue of
B such that |λ| = ρ and w = 0 a corresponding eigenvector. Then, for the particular
starting value x0 := x + w , we obtain

λt e0 = λt w = B t w = B t e0 = et → 0 (t → ∞).

This necessarily requires spr(B) = |λ| < 1. As byproduct of this argument, we see that
in this particular case
 et  1/t
= ρ, t ∈ N.
e0 
iii) For an arbitrary small ε > 0 let  · B,ε again be the above special norm for which
BB,ε ≤ ρ + ε. Then, by the norm equivalence for any other vector norm  ·  there
exist positive numbers m = m(B, ε), M = M(B, ε) such that

mx ≤ xB,ε ≤ Mx , x ∈ Rn .

Using this notation, we obtain


1 t 1 1 M
et  ≤ e B,ε = B t e0 B,ε ≤ BtB,ε e0 B,ε ≤ (ρ + ε)t e0 ,
m m m m
 1/t
and, consequently, observing that Mm
→ 1 (t → ∞) :
 et  1/t
lim sup ≤ ρ + ε.
t→∞ e0 

Since ε > 0 can be chosen arbitrarily small and recalling the last identity in (ii), we
obtain the asserted identity (3.1.8).
102 Iterative Methods for Linear Algebraic Systems

iv) Finally requiring an error reduction by TOL > 0 , we have to set

xt − x
≤ (ρ + ε)t ≈ TOL, t ≥ t(TOL),
x0 − x

from which we obtain


ln(1/TOL)
t(TOL) ≈ .
ln(1/ρ)
This completes the proof. Q.E.D.
The spectral radius of the iteration matrix determines the general asymptotic con-
vergence behavior of the fixed-point iteration. The relation (3.1.9) can be interpreted as
follows: In case that ρ = spr(B) < 1 the error obtained in the t-th step ( t sufficiently
large) can be further reduced by a factor 10−1 , i. e., gaining one additional decimal in
accuracy, by

ln(1/10)
t(10−1 ) ≈
ln(1/ρ)
more iterations. For example, for ρ ∼ 0.99, which is not at all unrealistic, we have
t1 ∼ 230 . For large systems with n 106 this means substantial work even if each
iteration step only requires O(n) arithmetic operations.
We have to provide the auxiliary lemma used in the proof of Theorem 3.1.

Lemma 3.1 (Spectral radius): For any matrix B ∈ Rn×n and any small ε > 0 there
exists a natural matrix norm  · B,ε , such that

spr(B) ≤ BB,ε ≤ spr(B) + ε. (3.1.10)

Proof. The matrix B is similar to an upper triangular matrix (e. g., its Jordan normal
form), ⎡ ⎤
r11 · · · r1n
⎢ .. ⎥
B = T −1 RT , R = ⎢ ⎣
..
. . ⎥
⎦,
0 rnn
with the eigenvalues of B on its main diagonal. Hence,

spr(B) = max |rii |.


1≤i≤n

For an arbitrary δ ∈ (0, 1], we set


3.1 Fixed-point iteration and defect correction 103

⎡ ⎤
⎡ ⎤ 0 r12 δr13 · · · δ n−2 r1n
1 0 ⎡ ⎤ ⎢ ⎥
⎢ ⎥ r11 0 ⎢

..
.
..
.
..
. ... ⎥

⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ δ ⎥
Sδ = ⎢ .. ⎥ R0 = ⎢

..
. ⎥ Qδ = ⎢
⎦ ⎢
..
.
..
. δrn−2,n ⎥,

⎢ . ⎥ ⎢ ⎥
⎣ ⎦ 0 rnn ⎢ .. ⎥
⎣ . rn−1,n ⎦
0 δ n−1
0

and, with this notation, have


⎡ ⎤
r11 δr12 · · · δ n−1 r1n
⎢ ⎥
⎢ ..
.
..
. ... ⎥
⎢ ⎥
Rδ := Sδ−1 RSδ = ⎢ .. ⎥ = R0 + δQδ .
⎢ . δrn−1,n ⎥
⎣ ⎦
0 rnn

In view of the regularity of Sδ−1 T , a vector norm is defined by

xδ := Sδ−1 T x2 , x ∈ Rn .

Then, observing R = Sδ Rδ Sδ−1 , there holds

B = T −1 RT = T −1 Sδ Rδ Sδ−1 T.

Hence for all x ∈ Rn and y = Sδ−1 T x :

Bxδ = T −1 Sδ Rδ Sδ−1 T xδ = Rδ y2


≤ R0 y2 + δQδ y2 ≤ {max1≤i≤n |rii | + δμ} y2
≤ {spr(B) + δμ}xδ

with the constant



n 1/2
μ= |rij |2 .
i,j=1

This implies
Bxδ
Bδ = sup ≤ spr(B) + μδ,
x∈Rn \{0} xδ
and setting δ := ε/μ the desired vector norm is given by  · B,ε :=  · δ . Q.E.D.

3.1.1 Stopping criteria

In using an iterative method, one needs “stopping criteria”, which for some prescribed
accuracy TOL terminates the iteration, in the ideal case, once this required accuracy is
reached.
104 Iterative Methods for Linear Algebraic Systems

i) Strategy 1. From the Banach fixed-point theorem, we have the general error estimate
q
xt − x ≤ xt − xt−1 , (3.1.11)
1−q

with the “contraction constant” q = B < 1 . For a given error tolerance TOL > 0 the
iteration could be stopped when

B xt − xt−1 


≤ TOL. (3.1.12)
1 − B xt 

The realization of this strategy requires an quantitatively correct estimate of the norm
B or of spr(B) . That has to be generated from the computed iterates xt , i. e., a
posteriori in the course of the computation. In general the iteration matrix B = I −C −1 A
cannot be computed explicitly with acceptable work. Methods for estimating spr(B) will
be considered in the chapter about the iterative solution of eigenvalue problems, below.
ii) Strategy 2. Alternatively, one can evaluate the “residual” Axt − b . Observing that
et = xt − x = A−1 (Axt − b) and x = A−1 b, it follows that
1 1
et  ≤ A−1  Axt − b , ≥ ,
b A x

and further
et  Axt − b Axt − b
≤ A−1 A = cond(A) .
x b b
This leads us to the stopping criterion

Axt − b
cond(A) ≤ TOL. (3.1.13)
b

The evaluation of this criterion requires an estimate of cond(A), which may be as costly
as the solution of the equation Ax = b itself. Using the spectral norm  · 2 the condition
number is related to the singular values of A (square roots of the eigenvalues of AT A),
σmax
cond2 (A) = .
σmin
Again generating accurate estimates of these eigenvalues may require more work than the
solution of Ax = b . This short discussion shows that designing useful stopping criteria
for iterative methods is an not at all an easy task. However, in the context of linear
systems originating from the “finite element discretization” (“FEM”) of partial differen-
tial equations there are approaches based on the concept of “Galerkin orthogonality”,
which allow for a systematic balancing of iteration and discretization errors. In this way,
practical stopping criteria can be designed, by which the iteration may be terminated
once the level of the discretization error is reached. Here, the criterion is essentially the
approximate solution’s “violation of Galerkin orthogonality” (s. Meidner et al. [43] and
Rannacher et al. [45] for more details).
3.1 Fixed-point iteration and defect correction 105

3.1.2 Construction of iterative methods

The construction of concrete iterative methods for solving the linear system Ax = b by
defect correction requires the specification of the preconditioner C . For this task two
particular goals have to be observed:
– spr(I − C−1 A) should be as small as possible.
– The correction equation Cδxt = b − Axt−1 should be solvable with O(n) a. op.,
requiring storage space not much exceeding that for storing the matrix A itself.

Unfortunately, these requirements contradict each other. The two extreme cases are:

C=A ⇒ spr(I−C−1 A) = 0
C = θ−1 I ⇒ spr(I−C−1 A) ≈ 1.

The simplest preconditioners are defined using the natural additive decomposition of the
matrix, A = L + D + R , where
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
a11 ··· 0 0 ··· 0 0 a12 · · · a1n
⎢ ⎥ ⎢ ⎥ ⎢ .. ⎥
⎢ . .. ⎥ ⎢ a ..
. ⎥ ⎢ . .. . .. ⎥
⎢ ⎥ ⎢ 21 ⎥ ⎢ . ⎥
D=⎢ .. ⎥ L = ⎢ . .. .. ⎥ R=⎢ .. ⎥.
⎢ . ⎥ ⎢ .. . . ⎥ ⎢ . an−1,n ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
0 ··· ann an1 · · · an,n−1 0 0 ··· 0

Further, we assume that the main diagonal elements of A are nonzero, aii = 0.

1. Jacobi2 method (“Gesamtschrittverfahren” in German):

C = D, B = −D −1 (L + R) =: J (iteration matrix). (3.1.14)

The iteration of the Jacobi method reads

Dxt = b − (L+R)xt−1 , t = 1, 2, . . . , (3.1.15)

or written component-wise:

n
aii xti = bi − aij xtj , i = 1, . . . , n,
j=1

2. Gauß-Seidel method (“Einzelschrittverfahren” in German):

C = D + L, B = −(D + L)−1 R =: H1 (iteration matrix). (3.1.16)

2
Carl Gustav Jakob Jacobi (1804–1851): German mathematician; already as child highly gifted;
worked in Königsberg and Berlin; contributions to many parts of mathematics: Number Theory, elliptic
functions, partial differential equations, functional determinants, and Theoretical Mechanics.
106 Iterative Methods for Linear Algebraic Systems

The iteration of the Gauß-Seidel method reads as follow:

(D + L)xt = b − Rxt−1 , t = 1, 2, . . . .

Writing this iteration componentwise,


 
aii xti = bi − aij xtj − aij xt−1
j , i = 1, . . . , n,
j<i j>i

one sees that Jacobi and Gauß-Seidel method have exactly the same arithmetic com-
plexity per iteration step and require the same amount of storage. However, since
the latter method uses a better approximation of the matrix A as preconditioner it
is expected to have an iteration matrix with smaller spectral radius, i. e., converges
faster. It will be shown below that this is actually the case for certain classes of
matrices A.

3. SOR method (“Successive Over-Relaxation”): ω ∈ (0, 2)


1
C= (D + ωL), B = −(D + ωL)−1 [(ω − 1)D + ωR]. (3.1.17)
ω
The SOR method is designed to accelerate the Gauß-Seidel method by introducing
a “relaxation parameter” ω ∈ R, which can be optimized in order to minimize the
spectral radius of the corresponding iteration matrix. Its iteration reads as follows:

(D + ωL)xt = ωb − [(ω − 1)D + ωR]xt−1 , t = 1, 2, . . . .

The arithmetic complexity is about that of Jacobi and Gauß-Seidel method. But
the parameter ω can be optimized for a certain class of matrices resulting in a
significantly faster convergence than that of the other two simple methods.

4. ILU method (“Incomplete LU Decomposition”):

C = L̃R̃, B = I − R̃−1 L̃−1 A. (3.1.18)

For a symmetric, positive definite matrix A the ILU method naturally becomes
the ILLT method (“Incomplete Cholesky decomposition”). The ILU decomposition
is obtained by the usual recursive process for the direct computation of the LU
decomposition from the relation LU = A by setting all matrix elements to zero,
which correspond to index pairs {i, j} for which aij = 0 :


i−1
i = 1, ..., n : r̃il = ail − ˜lik r̃kl (l = 1, ..., n)
k=1
 
i−1 
˜lii = 1, ˜lki = r̃ −1 aki − ˜lkl r̃li (k = i + 1, ..., n)
ii
l=1
˜lij = 0, r̃ij = 0, for aij = 0 .
3.1 Fixed-point iteration and defect correction 107

If this process stops because some r̃ii = 0 , we set r̃ii := δ > 0 and continue. The
iteration of the ILU method reads as follows:

L̃R̃xt = (L̃R̃ − A)xt−1 + b, t = 1, 2, . . . .

We note that here, L and U stand for “lower” and “upper” triangular matrix,
respectively, in contrast to the notion L and R for “left” and “right” triangular
matrix as used before in the context of multiplicative matrix decomposition.
Again this preconditioner is cheap, for sparse matrices, O(n) a. op. per iteration
step, but its convergence is difficult to analyze and will not be discussed further.
However, in certain situations the ILU method plays an important role as a robust
“smoothing iteration” within “multigrid methods” to be discussed below.
5. ADI method (“Alternating-Direction Implicit Iteration”):

C = (Ax + ωI)(Ay + ωI),


(3.1.19)
B = (Ay + ωI)−1 (ωI − Ax )(Ax + ωI)−1 (ωI − Ay ).

The ADI method can be applied to matrices A which originate from the discretiza-
tion of certain elliptic partial differential equations, in which the contributions from
the different spatial directions (x-direction and y-direction in 2D) are separated in
the form A = Ax + Ay . A typical example is the central difference approximation of
the Poisson equation described in Chapter 0.4.2. The iteration of the ADI method
reads as follows:
 
(Ax + ωI)(Ay + ωI)xt = (Ax + ωI)(Ay + ωI) − A xt−1 + b, t = 1, 2, . . . .

Here, the matrices Ax + ωI and Ay + ωI are tri-diagonal, such that the second
goal “solution efficiency” is achieved, while the full matrix A is five-diagonal. This
method can be shown to converge for any choice of the parameter ω > 0 . For
certain classes of matrices the optimal choice of ω leads to convergence, which is
at least as fast as that of the optimal SOR method. We will not discuss this issue
further since the range of applicability of the ADI method is rather limited.

Remark 3.1 (Block-versions of fixed-point iterations): Sometimes the coefficient


matrix A has a regular block structure for special numberings of the unknowns (e. g., in
the discretization of the Navier-Stokes equations when grouping the velocity and pressure
unknowns together at each mesh point):
⎡ ⎤
A11 · · · A1r
⎢ ⎥
⎢ ..
. ⎥
⎢ ⎥
A=⎢ . .. .. ⎥ ,
⎢ .. . . ⎥
⎣ ⎦
Ar1 · · · Arr

where the submatrices Aij are of small dimension, 3−10 , such that the explicit inversion
108 Iterative Methods for Linear Algebraic Systems

of the diagonal blocks Aii is possible without spoiling the overall complexity of O(n) a. op.
per iteration step.

3.1.3 Jacobi- and Gauß-Seidel methods

In the following, we will give a complete convergence analysis of Jacobi and Gauß-Seidel
method. As already stated above, both methods have the same arithmetic cost (per
iteration step) and require not much more storage as needed for storing the matrix A.
This simplicity suggests that both methods may not be very fast, which will actually be
seen below at the model matrix in Example (2.7) of Section 2.2.

Theorem 3.2 (Strong row-sum criterion): If the row sums or the column sums of
the matrix A ∈ Rn×n satisfy the condition (strict diagonal dominance)


n 
n
|ajk | < |ajj | or |akj | < |ajj |, j = 1, . . . , n, (3.1.20)
k=1,k =j k=1,k =j

then, spr(J) < 1 and spr(H1 ) < 1 , i. e., Jacobi and Gauß-Seidel method converge.

Proof. First, assume that the matrix A is strictly diagonally dominant. Let λ ∈ σ(J)
and μ ∈ σ(H1 ) with corresponding eigenvectors v and w , respectively. Then, noting
that ajj = 0 , we have
λv = Jv = −D −1 (L+R)v
and
μw = H1 w = −(D+L)−1 Rw ⇔ μw = −D −1 (μL+R)w.
From this it follows that for v∞ = w∞ = 1 and using the strict diagonal dominance
of A :
 1  n 
|λ| ≤ D −1 (L+R)∞ = max |ajk | < 1.
j=1,...,n |ajj | k=1,k =j

Hence, spr(J) < 1 . Further,


 1 *  +
|μ| ≤ D −1 (μL+R)∞ ≤ max |μ| |ajk | + |ajk | .
1≤j≤n |ajj | k<j k>j

For |μ| ≥ 1 , we would obtain the contradiction

|μ| ≤ |μ| D−1(L+R)∞ < |μ|,

so that also spr(H1 ) < 1 . If instead of A its transpose AT is strictly diagonally dominant,
we can argue analogously since, in view of λ(ĀT ) = λ(A) , the spectral radii of these two
matrices coincide. Q.E.D.
3.1 Fixed-point iteration and defect correction 109

Remark 3.2: We show an example of a non-symmetric matrix A, which satisfies the


strong column- but not the strong row-sum criterion:
⎡ ⎤ ⎡ ⎤
4 4 1 4 2 1
⎢ ⎥ ⎢ ⎥
A=⎢ ⎣ 2 5 3 ⎥,
⎦ A T
= ⎢ 4 5 0 ⎥.
⎣ ⎦
1 0 5 1 3 5

Clearly, for symmetric matrices the two conditions are equivalent.

The strict diagonal dominance of A or AT required in Theorem 3.2 is a too restric-


tive condition for the needs of many applications. In most cases only simple “diagonal
dominance” is given as in the Example (2.7) of Section 2.2,
⎡ ⎤⎫ ⎡ ⎤⎫
B −I4 ⎪
⎪ 4 −1 ⎪

⎢ ⎪
⎥⎪ ⎢ ⎥⎪⎪
⎢ −I4 B −I4 ⎥⎬ ⎢ −1 4 −1 ⎥⎬
A= ⎢ ⎢ ⎥ ⎢ ⎥
⎥⎪ 16 , B=⎢ ⎥ 4
⎣ −I4 B −I4 ⎦⎪ ⎪ ⎣ −1 4 −1 ⎦⎪ ⎪


⎭ ⎪

−I4 B −1 4

However, this matrix is strictly diagonally dominant in some of its rows, which together
with an additional structural property of A can be used to guarantee convergence of
Jacobi and Gauß-Seidel method.

Definition 3.1: A matrix A ∈ Rn×n is called “reducible”, if there exists a permutation


matrix P such that & '
T Ã11 0
P AP = ,
Ã21 Ã22

(simultaneous row and column permutation) with matrices Ã11 ∈ Rp×p , Ã22 ∈ Rq×q , Ã21
∈ Rq×p , p, q > 0, p + q = n . It is called “irreducible” if it is not reducible.

For a reducible matrix A the linear system Ax = b can be transformed into an


equivalent system of the form P AP T y = P b, x = P T y which is decoupled into two
separate parts such that it could be solved in two successive steps. The following lemma
provides a criterion for the irreducibility of the matrix A , which can be used in concrete
cases. For example, the above model matrix A is irreducible.

Lemma 3.2 (Irreducibility): A matrix A ∈ Rn×n is irreducible if and only if the as-
sociated directed graph
# $
G(A) := knots P1 , ..., Pn , edges Pj Pk ⇔ ajk =
 0, j, k = 1, ..., n

is connected, i. e., for each pair of knots {Pj , Pk } there exists a directed connection between
Pj and Pk .
110 Iterative Methods for Linear Algebraic Systems

Proof. The reducibility of A can be formulated as follows: There exists a non-trivial


decomposition Nn = J ∪ K of the index set Nn = {1, ..., n}, J, K = ∅ , J ∩ K = ∅ such
that ajk = 0 for all pairs {j, k} ∈ J × K . Connectivity of the graph G(A) now means
that for any pair of indices {j, k} there exists a chain of indices i1 , . . . , im ∈ {1, . . . , n}
such that
aji1 = 0 , ai1 i2 = 0 , . . . , aim−1 im = 0 , aim k = 0.
From this, we can conclude the asserted characterization (left as exercise). Q.E.D.
For irreducible matrices the condition in the strong row-sum criterion can be relaxed.

Theorem 3.3 (Weak row-sum criterion): Let the matrix A ∈ Rn×n be irreducible
and diagonally dominant,

n
|ajk | ≤ |ajj | j = 1, . . . , n, (3.1.21)
k=1,k =j

and let for at least one index r ∈ {1, . . . , n} the corresponding row sum satisfy


n
|ark | < |arr |. (3.1.22)
k=1,k =r

Then, A is regular and spr(J) < 1 and spr(H1 ) < 1 , i. e., Jacobi and Gauß-Seidel
method converge. An analogous criterion holds in terms of the column sums of A .

Proof. i) Because of the assumed irreducibility of the matrix A there necessarily holds


n
|ajk | > 0 , j = 1, . . . , n ,
k=1

 0, j = 1, . . . , n . Hence, Jacobi and


and, consequently, by its diagonal dominance, ajj =
Gauß-Seidel method are feasible. With the aid of the diagonal dominance, we conclude
analogously as in the proof of Theorem 3.2 that

spr(J) ≤ 1 , spr(H1 ) ≤ 1.

ii) Suppose now that there is an eigenvalue λ ∈ σ(J) with modulus |λ| = 1 . Let v ∈ Cn
be a corresponding eigenvector with a component vs satisfying |vs | = v∞ = 1. There
holds

|λ| |vi | ≤ |aii |−1 |aik | |vk | , i = 1, . . . , n. (3.1.23)
k =i

By the assumed irreducibility of A in the sense of Lemma 3.2 there exist a chain of indices
i1 , . . . , im such that asi1 = 0 , . . .,aim r = 0 . Hence, by multiple use of the inequality
3.2 Acceleration methods 111

(3.1.23), we obtain the following contradiction (observe that |λ| = 1 )



|vr | = |λvr | ≤ |arr |−1 |ark | v∞ < v∞ ,
k =r
  
|vim | = |λvim | ≤ |aim im |−1 |aim k | v∞ + |aim r | |vr | < v∞ ,
  
k =im ,r
= 0
..
.
  
|vi1 | = |λvi1 | ≤ |ai1 i1 |−1 |ai1 k | v∞ + |ai1 i2 | |vi2 | < v∞ ,
  
k =i1 ,i2
= 0
  
v∞ = |λvs | ≤ |ass |−1 |ask | v∞ + |asi1 | |vi1 | < v∞ .
  
k =s,i1
= 0

Consequently, there must hold spr(J) < 1 . Analogously, we also conclude spr(H1 ) < 1.
Finally, in view of A = D(I −J) the matrix A must be regular. Q.E.D.

3.2 Acceleration methods

For practical problems Jacobi and Gauß-Seidel method are usually much too slow. There-
fore, one tries to improve their convergence by several strategies, two of which will be
discussed below.

3.2.1 SOR method

The SOR method can be interpreted as combining the Gauß-Seidel method with an extra
“relaxation step”. Starting from a standard Gauß-Seidel step in the t-th iteration,
1    
x̃tj = bj − xtk − xt−1
k ,
ajj k<j k>j

the next iterate xtj is generated as a convex linear combination (“relaxation”) of the form

xtj = ω x̃tj + (1 − ω) xt−1


j ,

with a parameter ω ∈ (0, 2). For ω = 1 this is just the Gauß-Seidel iteration. For ω < 1,
one speaks of “underrelaxation” and for ω > 1 of “overrelaxation”. The iteration matrix
of the SOR methods is obtained from the relation

xt = ωD −1 {b − Lxt − Rxt−1 } + (1 − ω)xt−1

as
Hω = −(D + ωL)−1 [(ω − 1) D + ωR].
112 Iterative Methods for Linear Algebraic Systems

Hence, the iteration reads

xt = Hω xt−1 + ω(D + ωL)−1b, (3.2.24)

or in componentwise notation:
ω   
xti = (1 − ω)xt−1
i + bi − aij xt
j − aij xt−1
j , i = 1, . . . , n. (3.2.25)
aii j<i j>i

The following lemma shows that in the relaxation parameter has to be picked in the range
0 < ω < 2 if one wants to guarantee convergence.

Lemma 3.3 (Relaxation): For an arbitrary matrix A ∈ Rn×n with regular D there
holds

spr (Hω ) ≥ |ω − 1| , ω ∈ R. (3.2.26)

Proof. We have

Hω = (D+ωL)−1[ (1−ω) D − ωR ] = (I +ω D


−1
L)−1 D −1
D [ (1−ω) I − ω D
−1
  R ].
=: L =I =: R
Then,
det(Hω ) = det(I +ωL ))−1 · det((1−ω) I −ωR ) = (1−ω)n .
     
=1 = (1−ω)n
5
Since det (Hω ) = ni=1 λi ( λi ∈ Hω ) it follows that

n 1/n
spr(Hω ) = max |λi | ≥ |λi | = |1 − ω|,
1≤i≤n
i=1

which proves the asserted estimate. Q.E.D.


For positive definite matrices the assertion of Lemma 3.3 can be reversed in a certain
sense. This is the content of the following theorem of of Ostrowski3 and Reich4 .

3
Alexander Markowitsch Ostrowski (1893–1986): Russian-German-Swiss mathematician; studied at
Marburg, Göttingen (with D. Hilbert and E. Landau) and Hamburg, since 1927 Prof. in Basel; worked
on Dirichlet series, in Valuation Theory and especially in Numerical Analysis: “On the linear iteration
procedures for symmetric matrices”, Rend. Mat. Appl. 5, 140–163 (1954).
4
Edgar Reich (1927–2009): US-American mathematician of German origin; start as Electrical Engineer
at MIT (Massachusetts, USA) and Rand Corp. working there on numerical methods and Queuing Theory:
“On the convergence of the classical iterative method for solving linear simultaneous equations”, Ann.
Math. Statist. 20. 448–451 (1949); PhD at UCLA and 2-year postdoc at Princeton, since 1956 Prof.
at Univ. of Minnesota (Minneapolis, USA), work in Complex Analysis especially on quasi-conformal
mappings.
3.2 Acceleration methods 113

Theorem 3.4 (Theorem of Ostrowski-Reich): For a positive definite matrix A ∈


Rn×n there holds

spr(Hω ) < 1 , for 0 < ω < 2. (3.2.27)

Hence, especially the Gauß-Seidel method (ω = 1) is convergent. Its asymptotic conver-


gence speed can be estimated by

2 2 λmax (D)
spr(H1 ) ≤ 1 − + , μ := , (3.2.28)
μ μ(μ + 1) λmin (A)

assuming the quantity μ ≈ cond2 (A) to be large.

Proof. i) In view of the symmetry of A, we have R = LT , i. e., A = L + D + LT . Let


λ ∈ σ(Hω ) be arbitrary for 0 < ω < 2 , with some eigenvector v ∈ Rn , i. e., Hω v = λv .
Thus, there holds  
(1−ω) D−ωLT v = λ (D+ωL) v
and
ω (D+LT ) v = (1−λ) Dv − λωLv.
From this, we conclude that

ωAv = ω (D+LT ) v + ωLv


= (1−λ) Dv − λωLv + ωLv = (1−λ) Dv + ω (1−λ) Lv,

and

λωAv = λω (D+LT ) v + λωLv


= λω (D + LT ) v + (1−λ) Dv − ω (D+LT ) v
= (λ−1)ω(D+LT ) v + (1−λ) Dv = (1−λ)(1−ω) Dv − (1−λ) ωLT v.

Observing v T Lv = v T LT v implies

ωv T Av = (1−λ) v T Dv + ω (1−λ) v T Lv
λωv T Av = (1−λ)(1−ω) v T Dv − (1−λ) ωv T Lv,

and further by adding the two equations,

ω (1+λ) v T Av = (1−λ) (2−ω) v T Dv.

As with A also D is positive definite there holds v T Av > 0, v T Dv > 0. Consequently


(observing 0 < ω < 2), λ = ±1 , and it follows that

1+λ 2 − ω v T Dv
μ := = > 0.
1−λ ω v T Av
Resolving this for λ, we finally obtain the estimate
114 Iterative Methods for Linear Algebraic Systems

 
μ − 1
|λ| =   < 1, (3.2.29)
μ + 1

what was to be shown.


ii) To derive the quantitative estimate (3.2.28), we rewrite (3.2.29) in the form
   
 μ − 1   1 − 1/μ 

|λ| =   =  ≤1− 2 + 2
,
μ + 1   1 + 1/μ  μ μ(μ + 1)

where
v T Dv max y 2 =1 y T Dy λmax (D)
μ= ≤ ≤ .
v T Av min y 2 =1 y T Ay λmin (A)
This completes the proof. Q.E.D.

Remark 3.3: The estimate (3.2.28) for the convergence rate of the Gauß-Seidel method
in the case of a symmetric, positive definite matrix A has an analogue for the Jacobi
method,
1
spr(J) ≤ 1 − , (3.2.30)
μ

where μ is defined as in (3.2.28). This is easily seen by considering any eigenvalue


λ ∈ σ(J) with corresponding normalized eigenvector v, v2 = 1 , satisfying

λDv = Dv − Av.

Multiplying by v and observing that A as well as D are positive definite, then yields

v T Av 1
λ=1− T
≤1− .
v Dv μ

Comparing this estimate with (3.2.28) and observing that

spr(J)2 = (1 − μ−1 )2 ≈ 1 − 2μ−1 ≈ spr(H1 ),

for μ 1 , indicates that the Gauß-Seidel method may be almost twice as fast as the
Jacobi method. That this is actually the case will be seen below for a certain class of
matrices.

Definition 3.2: A matrix A ∈ Rn×n with the usual additive splitting A = L + D + R is


called “consistently ordered” if the eigenvalues of the matrices

J(α) = −D−1 {αL + α−1 R} , α ∈ C,

are independent of the parameter α, i. e., equal to the eigenvalues of the matrix J = J(1).

The importance of this property lies in the fact that in this case there are explicit
relations between the eigenvalues of J and those of Hω .
3.2 Acceleration methods 115

Example 3.2: Though the condition of “consistent ordering” appears rather strange
and restrictive, it is satisfied for a large class of matrices. Consider the model matrix in
Subsection 0.4.2 of Chapter 0. Depending on the numbering of the mesh points matrices
with different block structures are encountered.
i) If the mesh points are numbered in a checker-board manner a block-tridiagonal matrix
⎡ ⎤
D1 A12
⎢ ⎥
⎢ A ..
. ⎥
⎢ 21 D2 ⎥
A=⎢ .. .. ⎥,
⎢ . . A ⎥
⎣ r−1,r ⎦
Ar,r−1 Dr

occurs where the Di are diagonal and regular. Such a matrix is consistently ordered,
which is seen by applying a suitable similarity transformation,
⎡ ⎤
I
⎢ ⎥
⎢ αI ⎥
⎢ ⎥
T =⎢ .. ⎥, αD −1 L + α−1 D −1 R = T (D −1 L + D −1 R)T −1 .
⎢ . ⎥
⎣ ⎦
αr−1 I

and observing that similar matrices have the same eigenvalues.


ii) If the mesh points are numbered in a row-wise manner a block-tridiagonal matrix
⎡ ⎤
A1 D12
⎢ ⎥
⎢ D ..
. ⎥
⎢ 21 A2 ⎥
A=⎢ .. .. ⎥,
⎢ . . Dr−1,r ⎥
⎣ ⎦
Dr,r−1 Ar

occurs where the Ai are tridiagonal and the Dij diagonal. Such a matrix is consistently
ordered, which is seen by first applying the same similarity transformation as above,
⎡ ⎤
A1 α−1 D12
⎢ ⎥
⎢ αD ..
. ⎥
⎢ A ⎥
T AT −1 = ⎢
21 2
.. .. ⎥,
⎢ . . α Dr−1,r ⎥
−1
⎣ ⎦
αDr,r−1 Ar

and then a similarity transformation with the diagonal-block matrix

S = diag{S1 , . . . , Sm },

where Si = diag{1, α, α2, . . . , αr−1}, i = 1, . . . , m, resulting in


116 Iterative Methods for Linear Algebraic Systems

⎡ ⎤
S1 A1 S1−1 α−1 D12
⎢ ⎥
⎢ S2 A2 S2−1
..
. ⎥
⎢ αD21 ⎥
ST AT −1 S −1 =⎢ .. .. ⎥.
⎢ . . α−1 Dr−1,r ⎥
⎣ ⎦
αDr,r−1 Sr Ar Sr−1
Here, it has been used that the blocks Dij are diagonal. Since the main-diagonal blocks
are tri-diagonal, they split like Ai = Di + Li + Ri and there holds

Si Ai Si−1 = Di + αL + α−1 R.

This implies that the matrix A is consistently ordered.

Theorem 3.5 (Optimal SOR method): Let the matrix A ∈ Rn×n be consistently or-
dered and 0 ≤ ω ≤ 2. Then, the eigenvalues μ ∈ σ(J) and λ ∈ σ(Hω ) are related
through the identity

λ1/2 ωμ = λ + ω − 1. (3.2.31)

Proof. Let λ, μ ∈ C two numbers, which satisfy equation (3.2.31). If 0 = λ ∈ σ(Hω )


the relation Hω v = λv is equivalent to
 
(1 − ω)I − ωD −1R v = λ(I + ωD −1 L)v

and  
(λ + ω − 1)v = −λ1/2 ω λ1/2 D −1 L + λ−1/2 D −1 R v = λ1/2 ωJ(λ1/2 ) v.
Thus, v is eigenvector of J(λ1/2 ) corresponding to the eigenvalue

λ+ω−1
μ= .
λ1/2 ω
Then, by the assumption on A also μ ∈ σ(J). In turn, for μ ∈ σ(J), by the same relation
we see that λ ∈ σ(Hω ). Q.E.D.
As direct consequence of the above result, we see that for consistently ordered matrices
the Gauß-Seidel matrix (case ω = 1 ) either has spectral radius spr(H1 ) = 0 or there holds

spr(H1 ) = spr(J)2 . (3.2.32)

In case spr(J) < 1 the Jacobi method converges. For reducing the error by the factor
10−1 the Gauß-Seidel method only needs half as many iterations than the Jacobi method
and is therefore to be preferred. However, this does not necessarily hold in general since
one can construct examples for which one or the other method converges or diverges.
For consistently ordered matrices from the identity (3.2.31), we can derive a formula
for the “optimal” relaxation parameter ωopt with spr(Hωopt ) ≤ spr(Hω ), ω ∈ (0, 2). If
there holds ρ := spr(J) < 1, then:
3.2 Acceleration methods 117

,
ω−1 , ωopt ≤ ω
spr(Hω ) =  6 2
1
4
ρ ω + ρ ω − 4(ω − 1)
2 2 , ω ≤ ωopt .

1 ........................................................................................ ...
...
..
....................
................... ...
...............
......... ....
.
....... ...
0.8 spr(Bω ) ......
....
.... ...
..
... ...
...
.. ....
... ..
... .....
0.6 ... ..
.. ...
.....
....

0.4
0.2
0 - ω
1 ωopt 2
Figure 3.1: Spectral radius of the SOR matrix Hω as function of ω
.
Then, there holds
6
2 1− 1 − ρ2
ωopt = 6 , spr(Hωopt ) = ωopt − 1 = 6 < 1. (3.2.33)
1 + 1 − ρ2 1 + 1 − ρ2

In general the exact value for spr(J) is not known. Since the left-sided derivative of
the function f (ω) = spr(Hω ) for ω → ωopt is singular, in estimating ωopt it is better to
take a value slightly larger than the exact one. Using inclusion theorems for eigenvalues
or simply the bound ρ ≤ J∞ one obtains estimates ρ̄ ≥ ρ . In case ρ̄ < 1 this yields
an upper bound ω̄ ≥ ωopt

2 2
ω̄ := 6 ≥ 6 = ωopt
1 + 1 − ρ̄2 1 + 1 − ρ2

for which
6
1− 1 − ρ̄2
spr(Hω̄ ) = ω̄ − 1 = 6 < 1. (3.2.34)
1+ 1 − ρ̄2

However, this consideration requires the formula (3.2.33) to hold true.

Example 3.3: To illustrate the possible improvement of convergence by optimal overre-


laxation, we note that
, ,
0.81 0.39
spr(H1 ) = spr (J) = 2
⇒ spr(Hωopt ) =
0.99 0.8
This will be further discussed for the model matrix in Section 3.4, below.
118 Iterative Methods for Linear Algebraic Systems

3.2.2 Chebyshev acceleration

In the following, we discuss another method of convergence acceleration, termed “Cheby-


shev acceleration”, which can be used in the case of a symmetric coefficient matrices A,
for fixed-point iterations of the form

xt = Bxt−1 + c, t = 1, 2, . . . , (3.2.35)

with diagonalizable iteration matrix B . First, we describe the general principle of this
approach and then apply it to a symmetrized version of the SOR method. Suppose that
the above fixed-point iteration converges to the solution x ∈ Rn of the linear system

Ax = b ⇔ x = Bx + c, (3.2.36)

i. e., that spr(B) < 1 . The idea of Chebyshev acceleration is to construct linear combi-
nations

t
y t := γst xs , t ≥ 1, (3.2.37)
s=0

with certain coefficients γst , such that the new sequence (y t )t≥0 converges faster to the
fixed point x than the original sequence (xt )t≥0 . Once the fixed-point has been reached,
i. e., xt ≈ x , the new iterates should also be close to x . This imposes the consistency
condition

t
γst = 1. (3.2.38)
s=0

Then, the corresponding error has the form


t 
t
y −x=
t
γst (xs − x) = γst B s (x0 − x) = pt (B)(x0 − x), (3.2.39)
s=0 s=0

with the polynomial pt ∈ Pt of degree t given by


t
pt (z) = γst z s , pt (1) = 1. (3.2.40)
s=0

This iteration may be viewed as one governed by a sequence of “iteration matrices”


pt (B), t = 1, 2, . . . , and therefore, we may try to characterize its convergence by the spec-
tral radius spr(pt (B)) as in the standard situation of a “stationary fixed-point iteration
(i. e., one with a fixed iteration matrix). This requires us to relate the eigenvalues of
pt (B) to those of B,

λ(pt (B)) = pt (λ(B)). (3.2.41)


3.2 Acceleration methods 119

This leads us to consider the following optimization problem

spr(pt (B)) = min max |p(λ)|. (3.2.42)


p∈Pt ,p(1)=1 λ∈σ(B)

The eigenvalues λ ∈ spr(B) are usually not known, but rather the bound spr(B) ≤ 1 − δ
with some small δ > 0 may be available. Hence, this optimization problem has to be
relaxed to

spr(pt (B)) ≤ min max |pt (x)|. (3.2.43)


p∈Pt ,p(1)=1 |x|≤1−δ

This optimization problem can be explicitly solved in the case σ(B) ∈ R . Therefore, we
make the following assumption.

Assumption 3.1: The coefficient matrix A = L + D + LT is assumed to be symmetric


and the iteration matrix B of the base iteration (3.2.35) to be similar to a symmetric
matrix and, therefore, is diagonalizable with real eigenvalues,

σ(B) ⊂ R. (3.2.44)

Remark 3.4: In general the iteration matrix B cannot be assumed to be symmetric and
not even similar to a symmetric matrix (e. g., in the Gauß-Seidel method with H1 = −(D+
L)−1 LT ). But if this were the case (e. g., in the Richardson method with B = I − θA or
in the Jacobi method with J = −D −1 (L + LT ) ) the analysis of the new sequence (yt )t≥0
may proceed as follows. Taking spectral-norms, we obtain

y t − x2 ≤ pt (B)2 x0 − x2 . (3.2.45)

Hence, the convergence can be improved by choosing the polynomial pt such the the
norm pt (B)2 becomes minimal,

y t − x2
≤ min pt (B)2  B t 2 ≤ Bt2 . (3.2.46)
x0 − x2 pt ∈Pt ,pt (1)=1

Using the representation of the spectral norm, valid for symmetric matrices,

pt (B)2 = max |pt (λ)|. (3.2.47)


λ∈σ(B)

and observing σ(B) ∈ [−1 + δ, 1 − δ], for same small δ > 0, the optimization problem
takes the form
min max |pt (x)|. (3.2.48)
pt ∈Pt ,pt (1)=1 |x|≤1−δ

The solution of the optimization problem (3.2.43) is given by the well-known Cheby-
shev polynomials (of the first kind), which are the orthogonal polynomials obtained by
120 Iterative Methods for Linear Algebraic Systems

successively orthogonalizing (using the the Gram-Schmidt algorithm with exact arith-
metic) the monomial basis {1, x, x2 , . . . , xt } with respect to the scalar product
" 1
dx
(p, q) := p(x)q(x) √ , p, q ∈ Pt ,
−1 1 − x2

defined on the function space C[−1, 1] . These polynomials, named Tt ∈ Pt , are usually
normalized to satisfy Tt (1) = 1 ,

" 1
dx ⎨ 0, t = s,
Tt (x)Ts (x) √ = π, t = s = 0,
−1 1 − x2 ⎩
π/2, t = s = 0.

They can be written in explicit form as (see, e. g., Stoer & Bulirsch [50] or Rannacher [1]):

⎨ (−1) cosh(t arccosh(−x)), x ≤ −1,
t

Tt (x) = cos(t arccos(x)), −1 ≤ x ≤ 1, (3.2.49)



cosh(t arccosh(x)), x ≥ 1.

Tschebyscheff-Polynome
1

0.8
tp(1,x)
tp(2,x)
0.6 tp(3,x)
tp(4,x)
tp(5,x)
0.4

0.2
y-Achse

-0.2

-0.4

-0.6

-0.8

-1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x-Achse

Figure 3.2: Chebyshev polynomials Tt , t = 0, 1, . . . , 5.


3.2 Acceleration methods 121

That the so defined functions are actually polynomials can be seen by induction.
Further, there holds the three-term recurrence relation

T0 (x) = 1, T1 (x) = x, Tt+1 (x) = 2xTt (x) − Tt−1 (x), t ≥ 1, (3.2.50)

which allows the numerically stable computation and evaluation of the Chebyshev poly-
nomials. Sometimes the following alternative global representation is useful:
 √ √ 
Tt (x) = 12 [x + x2 − 1 ]t + [x − x2 − 1 ]t , x ∈ R. (3.2.51)

With this notation, we have the following basic result.

Theorem 3.6 (Chebyshev polynomials): Let [a, b] ⊂ R be a non-empty interval and


let c ∈ R be any point outside this interval. Then, the minimum

min max |p(x)| (3.2.52)


p∈Pt ,p(c)=1 x∈[a,b]

is attained by the uniquely determined polynomial

Tt (1 + 2 x−b )
p(x) := Ct (x) = b−a
c−b
, x ∈ [a, b]. (3.2.53)
Tt (1 + 2 b−a )

Furthermore, for a < b < c there holds

1 2γ t
min max |p(x)| = c−b
= ≤ 2γ t , (3.2.54)
p∈Pt ,p(c)=1 x∈[a,b] Tt (1 + 2 b−a ) 1 + γ 2t

where √
1 − 1/ κ c−a
γ := √ , κ := .
1 + 1/ κ c−b

Proof. i) By affine transformation, which does not change the max-norm, we may restrict
ourselves to the standard case [a, b] = [−1, 1] and c ∈ R \ [−1, 1] . Then, Ct (x) = C̃Tt (x)
with constant C̃ = Tt (c)−1 . The Chebyshev polynomial Tt (x) = cos(t arccos(x)) attains
the values ±1 at the points xi = cos(iπ/t), i = 0, . . . , t, and it alternates between 1
and −1 , i. e., Tt (xi ) and Tt (xi+1 ) have opposite signs. Furthermore, max[−1,1] |Tt | = 1 ,
implying max[−1,1] |Ct | = |C̃| .
ii) Assume now the existence of q ∈ Pt such that max[−1,1] |q| < max[−1,1] |Ct | = |C̃|
and q(c) = 1 . Then, the polynomial r = Ct − q changes sign t-times in the interval
[−1, 1] since sign r(xi ) = sign Tt (xi ), i = 0, . . . , t. Thus, r has at least t zeros in [−1, 1] .
Additionally, r(c) = 0 . Hence, r ∈ Pt has at least t + 1 zeros; thus, r ≡ 0, which leads
to a contradiction.
iii) By definition, there holds |Tt (x)| ≤ 1, x ∈ [−1, 1] . This implies that
1
max |Ct (x)| = c−b
.
x∈[a,b] Tt (1 + 2 b−a )
122 Iterative Methods for Linear Algebraic Systems

The assertion then follows from the explicit representation of the Tt given above and
some manipulations (for details see the proof of Theorem 3.11, below). Q.E.D.

Practical use of Chebyshev acceleration

We now assume σ(B) ⊂ (−1, 1) , i. e., convergence of the primary iteration. Moreover,
we assume that a parameter ρ ∈ (−1, 1) is known such that σ(B) ⊂ [−ρ, ρ] . With
the parameters a = −ρ, b = ρ, and c = 1, we use the polynomials pt = Ct given in
Theorem 3.6 in defining the secondary iteration (3.2.37). This results in the “Chebyshev-
accelerated” iteration scheme. This is a consistent choice since Tt (1) = 1 .
The naive evaluation of the secondary iterates (3.2.37) would require to store the
whole convergence history of the base iteration (xt )t≥0 , which may not be possible for
large problems. Fortunately, the three-term recurrence formula (3.2.50) for the Chebyshev
polynomials carries over to the corresponding iterates (y t)t≥0 , making the whole process
feasible at all.
Since the Tt satisfy the three-term recurrence (3.2.50), and so do the polynomials
p = Ct from (3.2.53):
2x
μt+1 pt+1 (x) = μt pt (x) − μt−1 pt−1 (x), t ≥ 1, μt = Tt (1/ρ), (3.2.55)
ρ
with initial functions
T1 (x/ρ) x/ρ
p0 (x) ≡ 1, p1 (x) = = = x,
T1 (1/ρ) 1/ρ

i. e., a0,0 = 1 and a1,0 = 0, a1,1 = 1 . We also observe the important relation

2
μt+1 = μt − μt−1 , μ0 = 1, μ1 = 1/ρ. (3.2.56)
ρ

which can be concluded from (3.2.55) observing that pt (1) = 1 . With these preparations,
we can now implement the Chebyshev acceleration scheme. With the limit x := limt→∞ xt ,
we obtain for the error y t − x = ẽt = pt (B)e0 :
μt μt−1
y t+1 = x + ẽt+1 = x + pt+1 (B)e0 = x + 2 Bpt (B)e0 − pt−1 (B)e0
ρμt+1 μt+1
μt μt−1 t−1 μt μt−1 t−1
=x+2 Bẽt − ẽ = x + 2 B(y t − x) − (y − x)
ρμt+1 μt+1 ρμt+1 μt+1
μt μt−1 t−1 1  2 
=2 By t − y + μt+1 − μt B + μt−1 x.
ρμt+1 μt+1 μt+1 ρ

Now, using the fixed-point relation x = Bx+c and the recurrence (3.2.56), we can remove
the appearance of x in the above recurrence obtaining
3.2 Acceleration methods 123

μt μt−1 t−1 μt
y t+1 = 2 By t − y +2 c, y 0 = x0 , y 1 = x1 = Bx0 + c. (3.2.57)
ρμt+1 μt+1 ρμt+1

Hence, the use of Chebyshev acceleration for the primary iteration (3.2.35) consists in
evaluating the three-term recurrences (3.2.56) and (3.2.57), which is of similar costs as
the primary iteration (3.2.35) itself, in which the most costly step is the matrix-vector
multiplication By t .
In order to quantify the acceleration effect of this process, we write the secondary
iteration in the form

t
yt − x = γst (xs − x) = pt (B)(x0 − x),
s=0

where γst are the coefficients of the polynomial pt . There holds

Tt (x/ρ)
pt (x) = Ct (x) = .
Tt (1/ρ)

By the estimate (3.2.54) of Theorem 3.6 it follows that



2γ t 1 − 1/ κ 1+ρ
spr(pt (B)) = max |pt (x)| = ≤ 2γ t , γ := √ , κ := .
λ∈σ(B) 1 + γ 2t 1 + 1/ κ 1−ρ

Hence, for the primary and the secondary iteration, we find the asymptotic error behavior
 et  1/t
lim sup = spr(B) ≤ ρ = 1 − δ, (3.2.58)
e0 
 ẽt  1/t 1 − 1/√κ √

lim sup ≤ √ ≤ 1 − c δ, (3.2.59)
e0  1 + 1/ κ

i. e., in the case 0 < δ  1 by Chebyshev acceleration a significant improvement can be


achieved for the convergence speed.

Application for accelerating the SOR method

We want to apply the concept of Chebyshev acceleration to the SOR method with the
iteration matrix (recalling that A is symmetric)
 
Hω = (D + ωL)−1 (1 − ω) D − ωLT , ω ∈ (0, 2).

However, it is not obvious whether this matrix is diagonalizable. Therefore, one introduces
a symmetrized version of the SOR method, which is termed “SSOR method”,

(D + ωL)y t = [(1−ω)D − ωLT ]xt−1 + b,


(D + ωLT )xt = [(1−ω)D − ωL]y t + b,
124 Iterative Methods for Linear Algebraic Systems

or equivalently,
 
xt = (D + ωLT )−1 [(1−ω) D − ωL](D + ωL)−1 [(1−ω)D − ωLT ]xt−1 + b + b , (3.2.60)

with the iteration matrix


   
HωSSOR := (D + ωLT )−1 (1−ω) D − ωL (D + ωL)−1 (1−ω)D − ωLT .

The SSOR-iteration matrix is similar to a symmetric matrix, which is seen from the
relation

(D+ωLT )HωSSOR (D+ωLT )−1 = [(1−ω) D−ωL](D+ωL)−1[(1−ω)D−ωLT ](D+ωLT )−1


= [(1−ω) D−ωL](D+ωL)−1(D+ωLT )−1 [(1−ω)D−ωLT ].

The optimal relaxation parameter of the SSOR method is generally different from that of
the SOR method.

Remark 3.5: In one step of the SSOR method the SOR loop is successively applied
twice, once in the standard “forward” manner based on the splitting A = (L + D) + LT
and then in “backward” form based on A = L + (D + LT ) . Hence, it is twice as expensive
compared to the standard SOR method. But this higher cost is generally not compensated
by faster convergence. Hence, the SSOR method is attractive mainly in connection with
the Chebyshev acceleration as described above and not so much as a stand-alone method.

3.3 Descent methods

In the following, we consider a class of iterative methods, which are especially designed for
linear systems with symmetric and positive definite coefficient matrices A , but can also
be extended to more general situations. In this section, we use the abbreviated notation
(·, ·) := (·, ·)2 and  ·  :=  · 2 for the Euclidian scalar product and norm.
Let A ∈ Rn×n be a symmetric positive definite (and hence regular) matrix,

(Ax, y) = (x, Ay), x, y ∈ Rn , (Ax, x) > 0, x ∈ Rn×n \ {0}. (3.3.61)

This matrix generates the so-called “A-scalar product” and the corresponding “A-norm”,

(x, y)A := (Ax, y), xA := (Ax, x)1/2 , x, y ∈ Rn . (3.3.62)

Accordingly, vectors with the property (x, y)A = 0 are called “A-orthogonal”. The
positive definite matrix A has important properties. Its eigenvalues are real and positive
0 < λ := λ1 ≤ . . . ≤ λn =: Λ and there exists an ONB of eigenvectors {w1 , . . . , wn } . For
its spectral radius and spectral condition number, there holds
Λ
spr(A) = Λ , cond2 (A) = . (3.3.63)
λ
3.3 Descent methods 125

The basis for the descent methods discussed below is provided by the following theorem,
which characterizes the solution of the linear system Ax = b as the minimum of a
quadratic functional.

Theorem 3.7 (Minimization property): The matrix A be symmetric positive defi-


nite. The uniquely determined solution of the linear system Ax = b is characterized by
the property

Q(x) < Q(y) ∀ y ∈ Rn \ {x}, Q(y) := 12 (Ay, y)2 − (b, y)2 . (3.3.64)

Proof. Let Ax = b . Then, in view of the definiteness of A for y = x there holds

Q(y) − Q(x) = 12 { (Ay, y) − 2(b, y) − (Ax, x) + 2(b, x) }


= 12 { (Ay, y) − 2(Ax, y) + (Ax, x) } = 12 (A[x − y], x − y) > 0.

In turn, if Q(x) < Q(y) , for x = y , i. e., if x is a strict minimum of Q on Rn , there


must hold grad Q(x) = 0 . This means that (observe ajk = akj )

1 ∂  ∂  
n n n
∂Q
(x) = ajk xj xk − bk xk = aik xk − bi = 0, i = 1, . . . , n,
∂xi 2 ∂xi j,k=1 ∂xi k=1 k=1

i. e., Ax = b . Q.E.D.
We note that the gradient of Q in a point y ∈ R is given by n

grad Q(y) = 1
2
(A + AT )y − b = Ay − b. (3.3.65)

This coincides with the “defect” of the point y with respect to the equation Ax = b
(negative “residual” b−Ay ). The so-called “descent methods”, starting from some initial
point x(0) ∈ Rn , determine a sequence of iterates xt , t ≥ 1, by the prescription

xt+1 = xt + αt r t , Q(xt+1 ) = min Q(xt + αr t ). (3.3.66)


α∈R

Here, the “descent directions” r t are a priori determined or adaptively chosen in the
course of the iteration. The prescription for choosing the “step length” αt is called “line
search”. In view of
d
Q(xt + αr t ) = gradQ(xt + αr t ) · r t = (Axt − b, r t ) + α(Ar t , r t ),

we obtain the formula
(g t, r t )
αt = − , g t := Axt − b = gradQ(xt ).
(Ar t , r t )

Definition 3.3: The general descent method, starting from some initial point x0 ∈ Rn ,
determines a sequence of iterates xt ∈ Rn , t ≥ 1, by the prescription
126 Iterative Methods for Linear Algebraic Systems

i) gradient g t = Axt − b,
ii) descent direction r t ,
(g t , r t )
iii) step length αt = − ,
(Ar t , r t )
iv) descent step xt+1 = xt + αt r t .

Each descent step as described in the above definition requires two matrix-vector
multiplications. By rewriting the algorithm in a slightly different way, one can save one
of these multiplications at the price of additionally storing the vector Ar t .
General descent algorithm:

Starting values: x0 ∈ Rn , g 0 := Ax0 − b.


Iterate for t ≥ 0: descent direction r t
(g t , r t )
αt = − , xt+1 = xt + αt r t , g t+1 = g t + αt Ar t .
(Ar t , r t )

Using the notation yB := (By, y)1/2 there holds

2Q(y) = Ay − b2A−1 − b2A−1 = y − x2A − x2A , (3.3.67)

i. e., the minimization of the functional Q(·) is equivalent to the minimization of the
Defect norm Ay − bA−1 or the error norm y − xA .

3.3.1 Gradient method

The various descent methods essentially differ by the choice of the descent directions
r t . One of the simplest a priori strategies uses in a cyclic way the Cartesian coordinate
direction {e1 , . . . , en } . The resulting method is termed “coordinate relaxation” and is
sometimes used in the context of nonlinear systems. For solving linear systems it is much
too slow as it is in a certain sense equivalent to the Gauß-Seidel method (exercise). A
more natural choice are the directions of steepest descent of Q(·) in the points xt :

r t = −gradQ(xt ) = −gt . (3.3.68)

Definition 3.4: The “gradient method” determines a sequence of iterates xt ∈ Rn , t ≥ 0,


by the prescription

Starting values: x0 ∈ Rn , g 0 := Ax0 − b.


g t 2
Iterate for t ≥ 0: αt = , xt+1 = xt − αt g t , g t+1 = g t − αt Ag t .
(Ag t , g t )

In case that (Ag t , g t ) = 0 for some t ≥ 0 there must hold g t = 0 , i. e., the iteration can
only terminate with Axt = b .
3.3 Descent methods 127

Theorem 3.8 (Gradient method): For a symmetric positive definite matrix A ∈ Rn×n
the gradient method converges for any starting point x0 ∈ Rn to the solution of the linear
system Ax = b .

Proof. We introduce the “error functional”

E(y) := y − x2A = (y − x, A[y−x]) , y ∈ Rn ,

and for abbreviation set et := xt − x . With this notation there holds


E(xt ) − E(xt+1 ) (et , Aet ) − (et+1 , Aet+1 )
=
E(xt ) (et , Aet )
(e , Ae ) − (et − αt g t , A[et − αt g t ])
t t
=
(et , Aet )
2αt (e , Ag ) − αt2 (g t , Ag t )
t t
=
(et , Aet )

and consequently, because of Aet = Axt − Ax = Axt − b = g t ,

E(xt ) − E(xt+1 ) 2αt g t 2 − αt2 (g t , Ag t) g t4


= = .
E(xt ) (g t , A−1 g t ) (g t, Ag t )(g t , A−1 g t )

For the positive definite matrix A there holds

λy2 ≤ (y, Ay) ≤ Λy2 , Λ−1 y2 ≤ (y, A−1y) ≤ λ−1 y2,

with λ = λmin(A) and Λ = λmax (A) . In the case xt = x , i. e., E(xt ) = 0 and g t = 0 ,
we conclude that
g t 4 g t 4 λ
≥ = ,
t t t −1 t
(g , Ag )(g , A g ) Λg  λ g 
t 2 −1 t 2 Λ
and, consequently,

E(xt+1 ) ≤ { 1 − κ−1 } E(xt ), κ := condnat (A).

Since 0 < 1 − 1/κ < 1 for any x0 ∈ Rn the error functional E(xt ) → 0 (t → ∞) , i. e.,
xt → x (t → ∞). Q.E.D.
For the quantitative estimation of the speed of convergence of the gradient method,
we need the following result of Kantorovich5 .

Lemma 3.4 (Lemma of Kantorovich): For a symmetric and positive definite matrix

5
Leonid Vitalyevich Kantorovich (1912–1986): Russian Mathematician; Prof. at the U of Leningrad
(1934–1960), at the Academy of Sciences (1961–1971) and at the U Moscow (1971-1976); fundamental
contributions to linear optimization in Economy, to Functional Analysis and to Numerics (Theorem of
Newton-Kantorovich).
128 Iterative Methods for Linear Algebraic Systems

A ∈ Rn with smallest and largest eigenvalues λ and Λ, respectively, there holds

λΛ y4
4 ≤ , y ∈ Rn . (3.3.69)
(λ + Λ)2 (y, Ay)(y, A−1y)

Proof. Let λ = λ1 ≤ . . . ≤ λn = Λ be the eigenvalues of A and {w1 , . . . , wn } a


corresponding
n ONB of eigenvectors. An arbitrary vector y ∈ Rn admits an expansion
y = i=1 yi wi with the coefficients yi = (y, wi ) . Then,

y4 ( ni=1 yi2 )2 1 ϕ(ζ)
−1
= n 2
n −1 2 = n n −1 = ,
(y, Ay)(y, A y) ( i=1 λi yi ) ( i=1 λi yi ) ( i=1 λi ζi ) ( i=1 λi ζi ) ψ(ζ)

with the notation



n
ζ = (ζi )i=1,...,n , ζi = yi2 ( yi2 )−1 ,
i=1


n 
n
ψ(ζ) = λ−1
i ζi , ϕ(ζ) = ( λi ζi )−1 .
i=1 i=1
−1
n
Since the function f (λ) = λ is convex it follows from 0 ≤ ζi ≤ 1 and i=1 ζi = 1
that

n 
n
λ−1
i ζi ≥ ( λi ζi )−1 .
i=1 i=1

We set g(λ) := (λ1 + λn − λ)/(λ1 λn ) .

f(λ)
g(λ)

λ1 λn λ

Figure 3.3: Sketch to the proof of the Lemma of Kantorovich.

Obviously, the graph of ϕ(ζ) lies, for all arguments ζ on the curve f (λ) , and that of
ψ(ζ) between the curves f (λ) and g(λ) (shaded area). This implies that

ϕ(ζ) f (λ) f ([λ1 + λn ]/2) 4λ1 λn


≥ min = = ,
ψ(ζ) λ1 ≤λ≤λn g(λ) g([λ1 + λn ]/2) (λ1 + λn )2

which concludes the proof. Q.E.D.


3.3 Descent methods 129

Theorem 3.9 (Error estimate for gradient method): Let the matrix A ∈ Rn×n be
symmetric positive definite. Then, for the gradient method the following error estimate
holds:
 1 − 1/κ t
xt − xA ≤ x0 − xA , t ∈ N, (3.3.70)
1 + 1/κ

with the spectral condition number κ = cond2 (A) = Λ/λ of A . For reducing the initial
error by a factor TOL the following number of iterations is required:

t(TOL) ≈ 12 κ ln(1/TOL). (3.3.71)

Proof. i) In the proof of Theorem 3.8 the following error identity was shown:
 g t 4 
E(xt+1 ) = 1 − t −1
E(xt ).
(g , Ag ) (g , A g )
t t t

This together with the inequality (3.3.69) in the Lemma of Kantorovich yields
 λΛ   λ − Λ 2
E(xt+1 ) ≤ 1−4 E(xt
) = E(xt ).
(λ + Λ)2 λ+Λ

From this, we conclude by successive use of the recurrence that


 λ − Λ 2t
xt − x2A ≤ x0 − x2A , t ∈ N,
λ+Λ
Which proves the asserted estimate (3.3.70).
ii) To prove (3.3.71), we take the logarithm on both sides of the relations
 1 − 1/κ t(TOL)  κ − 1 t(TOL)  κ + 1 t(TOL) 1
= < TOL, > ,
1 + 1/κ κ+1 κ−1 TOL

obtaining  1   κ + 1 −1
t(TOL) > ln ln .
TOL κ−1
Since ! %
x+1 1 1 1 1 1 2
ln =2 + + + . . . ≥
x−1 x 3 x3 5 x5 x
this is satisfied for t(TOL) ≥ 12 κ ln(1/TOL) . Q.E.D.

The relation

(g t+1 , g t ) = (g (t) − αt Ag t , g t ) = g t2 − αt (Ag t , g t ) = 0 (3.3.72)

shows that the descent directions r t = −g t used in the gradient method in consecutive
steps are orthogonal to each other, while g t+2 may be far away form being orthogonal
to g t . This may lead to strong oscillations in the convergence behavior of the gradient
method especially for matrices A with large condition number, i. e., λ  Λ . In the two-
130 Iterative Methods for Linear Algebraic Systems

dimensional case this effect can be illustrated by the contour lines of the functional Q(·),
which are eccentric ellipses, leading to a zickzack path of the iteration (see Fig. 3.3.1).

Figure 3.4: Oscillatory convergence of the gradient method

3.3.2 Conjugate gradient method (CG method)

The gradient method utilizes the particular structure of the functional Q(·), i. e., the
distribution of the eigenvalues of the matrix A , only locally from one iterate xt to the next
one, xt+1 . It seems more appropriate to utilize the already obtained information about
the global structure of Q(·) in determining the descent directions, e. g., by choosing the
descent directions mutually orthogonal. This is the basic idea of the “conjugate gradient
method” (“CG method”) of Hestenes6 and Stiefel7 (1952), which successively generates
a sequence of descent directions dt which are mutually “A-orthogonal”, i. e., orthogonal
with respect to the scalar product (·, ·)A .
For developing the CG method, we start from the ansatz

Bt := span{d0 , · · · , dt−1 } (3.3.73)

with a set of linearly independent vectors di and seek to determine the iterates in the
form

t−1
xt = x0 + αi di ∈ x0 + Bt , (3.3.74)
i=0

such that

Q(xt ) = min
0
Q(y) ⇔ Axt − bA−1 = min
0
Ay − bA−1 . (3.3.75)
y∈x +Bt y∈x +Bt

Setting the derivatives of Q(·) with respect to the αi to zero, we see that this is equivalent

6
Magnus R. Hestenes (1906–1991): US-American mathematician; worked at the National Bureau of
Standards (NBS) and the University of California at Los Angeles (UCLA); contributions to optimization
and control theory and to numerical linear algebra.
7
Eduard Stiefel (1909–1978): Swiss mathematician; since 1943 Prof. for Applied Mathematics at
the ETH Zurich; important contributions to Topology, Groupe Theory, Numerical Linear Algebra (CG
method), Approximation Theory and Celestrian Mechanics.
3.3 Descent methods 131

to solving the so-called “Galerkin8 equations”:

(Axt − b, dj ) = 0 , j = 0, . . . , t − 1, (3.3.76)

or in compact form: Axt − b = g t ⊥ Bt . Inserting the above ansatz for xt into this
orthogonality condition, we obtain a regular linear system for the coefficients αi , i =
0, . . . , t−1,

n
αi (Adi , dj ) = (b, dj ) − (Ax0 , dj ), j = 0, . . . , t − 1. (3.3.77)
i=1

Remark 3.6: We note that (3.3.76) does not depend on the symmetry of the matrix A .
Starting from this relation one may construct CG-like methods for linear systems with
asymmetric and even indefinite coefficient matrices. Such methods are generally termed
“projection methods”. Methods of this type will be discussed in more detail below.

Recall that the Galerkin equations (3.3.76) are equivalent to minimizing the defect
norm Axt − bA−1 or the error norm xt − xA on x0 + Bt . Natural choices for the
spaces Bt are the so-called Krylov9 spaces

Bt = Kt (d0 ; A) := span{d0 , Ad0 , . . . , At−1 d0 }, (3.3.78)

with some vector d0 , e. g., the (negative) initial defect d0 = b−Ax0 of an arbitrary vector
x0 . This is motivated by the observation that from At d0 ∈ Kt (d0 ; A), we necessarily obtain

−g t = b − Axt = d0 + A(x0 − xt ) ∈ d0 + AKt (d0 ; A) ∈ Kt (d0 ; A).

Because g t ⊥ Kt (d0 ; A) , this implies g t = 0 by construction.


Now the CG method constructs a sequence of descent directions, which form an A-
orthogonal basis of the Krylov spaces Kt (d0 ; A) . We proceed in an inductive way: Start-
ing from an arbitrary point x0 with (negative) defect d0 = b − Ax0 let iterates xi
and corresponding descent directions di (i = 0, ..., t − 1) already been determined such
that {d0 , ..., dt−1 } is an A-orthogonal basis of Kt (d0 ; A) . For the construction of the
next descent direction dt ∈ Kt+1 (d0 ; A) with the property dt ⊥A Kt (d0 ; A) we make the
ansatz

t−1
dt = −g t + βjt−1 dj ∈ Kt+1 (d0 ; A). (3.3.79)
j=0

8
Boris Grigorievich Galerkin (1871–1945): Russian civil engineer and mathematician; Prof. in St.
Petersburg; contributions to Structural Mechanics especially Plate Bending Theory.
9
Aleksei Nikolaevich Krylov (1863–1945): Russian mathematician; Prof. at the Sov. Academy of
Sciences in St. Petersburg; contributions to Fourier Analysis and differential equations, applications in
ship building.
132 Iterative Methods for Linear Algebraic Systems

Here, we can assume that g t = Axt −b ∈ / Kt (d0 ; A) as otherwise g t = 0 and, consequently,


x = x . Then, for i = 0, ..., t − 1 there holds
t


t−1
(dt , Adi ) = (−g t , Adi ) + βjt−1 (dj , Adi ) = (−g t + βit−1 di , Adi ). (3.3.80)
j=0

For i < t − 1, we have (g t , Adi ) = 0 since Adi ∈ Kt (d0 ; A) and, consequently, βit−1 = 0.
For i = t − 1, the condition

0 = (−g t , Adt−1 ) + βt−1


t−1 t−1
(d , Adt−1 ) (3.3.81)

leads us to the formulas


(g t , Adt−1 )
t−1
βt−1 := βt−1 = , dt = −g t + βt−1 dt−1 . (3.3.82)
(dt−1 , Adt−1 )

The next iterates xt+1 and g t+1 = Axt+1 − b are then determined by

(g t, dt )
αt = − , xt+1 = xt + αt dt , g t+1 = g t + αt Adt . (3.3.83)
(dt , Adt )

These are the recurrence equations of the CG method. By construction there holds

(dt , Adi ) = (g t , di ) = 0, i ≤ t− 1, (g t , g t−1) = 0. (3.3.84)

From this, we conclude that

g t 2 = (dt − βt−1 dt−1 , −g t+1 + αt Adt ) = αt (dt , Adt ), (3.3.85)


g t+1 2 = (g t + αt Adt , g t+1 ) = αt (Adt , g t+1 ). (3.3.86)

This allows for the following simplifications in the above formulas:

g t 2 g t+1 2
αt = , βt = , (3.3.87)
(dt , Adt ) g t 2

as long as the iteration does not terminate with g t = 0.

Definition 3.5: The CG method determines a sequence of iterates xt ∈ Rn , t ≥ 0, by


the prescription

Starting values: x0 ∈ Rn , d0 = −g 0 = b − Ax0 ,


g t 2
Iterate for t ≥ 0: αt = t , xt+1 = xt + αt dt , g t+1 = g t + αt Adt ,
(d , Adt )
g t+1 2
βt = , dt+1 = −g t+1 + βt dt .
g t 2

By construction the CG method generates a sequence of descent directions dt , which


3.3 Descent methods 133

are automatically A-orthogonal. This implies that the vectors d0 , . . . , dt are linearly
independent and that therefore span{d0, . . . , dn−1 } = Rn . We formulate the properties of
the CG method derived so far in the following theorem.

Theorem 3.10 (CG method): Let the matrix A ∈ Rn×n be symmetric positive defi-
nite. Then, (assuming exact arithmetic) the CG method terminates for any starting vector
x0 ∈ Rn after at most n steps at xn = x . In each step there holds

Q(xt ) = min
0
Q(y), (3.3.88)
y∈x +Bt

and, equivalently,

xt − xA = Axt − bA−1 = min


0
Ay − bA−1 = min
0
y − xA , (3.3.89)
y∈x +Bt y∈x +Bt

where Bt := span{d0, . . . , dt−1 } .

In view of the result of Theorem 3.10 the CG method formally belongs to the class of
“direct” methods. In practice, however, it is used like an iterative method, since:

1. Because of round-off errors the descent directions dt are not exactly A-orthogonal
such that the iteration does not terminate.
2. For large matrices one obtains accurate approximations already after t  n itera-
tions.

As preparation for the main theorem about the convergence of the CG method, we provide
the following auxiliary lemma.

Lemma 3.5 (Polynomial norm bounds): Let A be a symmetric positive definite ma-
trix with spectrum σ(A) ⊂ [a, b] . Then, for any polynomial p ∈ Pt , p(0) = 1 there holds

xt − xA ≤ M x0 − xA , M := sup |p(μ)|. (3.3.90)


μ∈[a,b]

Proof. Observing the relation

xt − xA = min


0
y − xA ,
y∈x +Bt

Bt = span{d0 , . . . , dt−1 } = span{A0 g (0) , . . . , At−1 g 0},


we find
xt − xA = min x0 − x + p(A)g 0 A .
p∈Pt−1
134 Iterative Methods for Linear Algebraic Systems

Since g 0 = Ax0 − b = A(x0 − x) it follows that

xt − xA = min [I + Ap(A)](x0 − x)A


p∈Pt−1

≤ min I + Ap(A)A x0 − xA


p∈Pt−1

≤ min p(A)A x0 − xA ,


p∈Pt , p(0)=1

with the natural matrix norm  · A generated from the A-norm  · A . Let λi , i =
1, . . . , n, be the eigenvalues and {w 1 , . . . , w n } a corresponding ONS of eigenvectors of the
symmetric, positive definite matrix A. Then, for arbitrary y ∈ Rn there holds

n
y= γ i wi , γi = (y, wi ),
i=1

and, consequently,

n 
n
p(A)y2A = λi p(λi )2 γi2 ≤ M 2 λi γi2 = M 2 y2A .
i=1 i=1

This implies
p(A)yA
p(A)A = sup ≤ M,
y∈Rn , y =0 yA
which completes the proof. Q.E.D.

As a consequence of Lemma 3.5, we obtain the following a priori error estimate.

Theorem 3.11 (Error estimate for CG method): Let A be a symmetric positive


definite matrix. Then, for the CG method there holds the error estimate
 1 − 1/√κ t
xt − xA ≤ 2 √ x0 − xA , t ∈ N , (3.3.91)
1 + 1/ κ

with the spectral condition number κ = cond2 (A) = Λ/λ of A . For reducing the initial
error by a factor TOL the following number of iteration is required:

t(TOL) ≈ 12 κ ln(2/TOL). (3.3.92)

Proof. i) Setting [a, b] := [λ, Λ] in Lemma 3.5, we obtain


# $
xt − xA ≤ min sup |p(μ)| x0 − xA .
p∈Pt , p(0)=1 λ≤μ≤Λ

This yields the assertion if we can show that

#  1 − 6λ/Λ t
$
min sup |p(μ)| ≤ 2 6 .
p∈Pt , p(0)=1 λ≤μ≤Λ 1 + λ/Λ
3.3 Descent methods 135

This is again a problem of approximation theory with respect to the max-norm (Chebyshev
approximation), which can be solved using the Chebyshev polynomials described above
in Subsection 3.2.2. The solution pt ∈ Pt is give by
 Λ + λ − 2μ   Λ + λ −1
pt (μ) = Tt Tt ,
Λ−λ Λ−λ
with the t-th Chebyshev polynomial Tt on [−1, 1] . There holds
 Λ + λ −1
sup pt (μ) = Tt .
λ≤μ≤Λ Λ−λ

From the representation


3 6 t  6 t 4
Tt (μ) = 12 μ + μ2 − 1 + μ − μ2 − 1 , μ ∈ [−1, 1],

for the Chebyshev polynomials and the identity


7 √ √ √
κ+1 κ + 1 2 κ+1 2 κ ( κ + 1)2 κ+1
+ −1 = + = =√ ,
κ−1 κ−1 κ−1 κ−1 κ−1 κ−1

we obtain the estimate


Λ + λ  κ + 1  1 * √κ + 1 t  √κ − 1 t + 1  √κ + 1 t
Tt = Tt = √ + √ ≥ √ .
Λ−λ κ−1 2 κ−1 κ+1 2 κ−1

Hence,
 √κ − 1  t
sup pt (μ) ≤ 2 √ ,
λ≤μ≤Λ κ+1
which implies (3.3.91).
ii) For deriving (3.3.92), we require
 √κ − 1 t(ε)
2 √ ≤ TOL,
κ+1

and, equivalently,
 2   √κ + 1 −1
t(TOL) > ln ln √ .
TOL κ−1
Since ! %
x+1 1 1 1 1 1 2
ln =2 + + + ... ≥ ,
x−1 x 3x 3 5x 5 x
1√
this is satisfied for t(TOL) ≥ 2 κ ln(2/TOL) . Q.E.D.


Since κ = condnat (A) > 1, we have κ < κ. Observing that the function f (λ) =
(1 − λ−1 ) (1 + λ−1 )−1 is strictly monotonically increasing for λ > 0 (f (λ) > 0), there
holds
136 Iterative Methods for Linear Algebraic Systems


1 − 1/ κ 1 − 1/κ
√ < ,
1 + 1/ κ 1 + 1/κ
implying that the CG method should converge faster than the gradient method. This is
actually the case in practice. Both methods converge the faster the smaller the condition
number is. However, in case Λ λ , which is frequently the case in practice, even the
CG method is too slow. An acceleration can be achieved by so-called “preconditioning”,
which will be described below.

3.3.3 Generalized CG methods and Krylov space methods

For solving a general linear system Ax = b, with regular but not necessarily symmetric
and positive definite matrix A ∈ Rn , by the CG method, one may consider the equivalent
system

AT Ax = AT b (3.3.93)

with the symmetric, positive definite matrix AT A. Applied to this system the CG method
takes the following form:

Starting values: x0 ∈ Rn , d0 = AT (b − Ax0 ) = −g 0 ,


g t2
for t ≥ 0: αt = , xt+1 = xt + αt dt , g t+1 = g t + αt AT Adt ,
Adt 2
g t+1 2
βt = , dt+1 = −g t+1 + βt dt .
g t 2

This approach is referred to as CGS method (“Conjugate Gradient Squared”) of P. Sonn-


eveld (1989). The convergence speed is characterized by cond2 (AT A) . The whole method
is equivalent to minimizing the functional

Q(y) := 12 (AT Ay, y) − (AT b, y) = 12 Ay − b2 − 12 b2 . (3.3.94)

Since cond2 (AT A) ≈ cond2 (A)2 the convergence of this variant of the CG method may
be rather slow. However, its realization does not require the explicit evaluation of the
matrix product AT A but only the computation of the matrix-vector products z = Ay
and AT z .

On the basis of the formulation (3.3.75) the standard CG method is limited to linear
systems with symmetric, positive definite matrices. But starting from the (in this case
equivalent) Galerkin formulation (3.3.76) the method becomes meaningful also for more
general matrices. In fact, in this way one can derive effective generalizations of the CG
method also for nonsymmetric and even indefinite matrices. These modified CG methods
are based on the Galerkin equations (3.3.76) and differ in the choices of “ansatz spaces”
3.3 Descent methods 137

Kt and “test spaces” Kt∗ ,

xt ∈ x0 + Kt : (Axt − b, y) = 0 ∀ y ∈ Kt∗ . (3.3.95)

Here, one usually uses the Krylov spaces

Kt = span{d0 , Ad0 , ..., At−1 d0 },

combined with the test spaces Kt∗ = Kt , or

Kt∗ = span{d0 , AT d0 , ..., (AT )t−1 d0 }.

This leads to the general class of “Krylov space methods”. Most popular representatives
are the following methods, which share one or the other property with the original CG
method but generally do not allow for a similarly complete error analysis.

1. GMRES with or without restart (“Generalized Minimal Residual”) of Y. Saad and


M. H. Schultz (1986): Kt = span{d0, Ad0 , ..., At−1 d0 } = Kt∗ ,

Axt − b = min
0
Ay − b. (3.3.96)
y∈x +Kt

Since this method minimizes the residual over spaces of increasing dimension as
the CG method also the GMRES methods yields the exact solution after at most
n steps. However, for general nonsymmetric matrices the iterates xt cannot be
obtained by a simple tree-term recurrence as in the CG method. It uses a full
recurrence, which results in high storage requirements. Therefore, to limit the costs
the GMRES method is stopped after a certain number of steps, say k steps, and
then restarted with xk as new starting vector. The latter variant is denoted by
“GMRES(k) method”.

2. BiCG and BiCGstab (“Biconjugate Gradient Stabilized”) of H. A. Van der Vorst,


(1992): Kt = span{d0 , Ad0 , ..., At−1 d0 }, Kt∗ = span{d0, AT d0 , ..., (AT )t−1 d0 },

xt ∈ x0 + Kt : (Axt − b, y) = 0, ∀y ∈ Kt∗ . (3.3.97)

In the BiCG method the iterates xt are obtained by a three-term recurrence but
for an unsymmetric matrix the residual minimization property gets lost and the
method may not even converge. Additional stability is provided in the “BiCGstab”
method.

Both methods, GMRES(k) and BiCGstab, are especially designed for unsymmetric but
definite matrices. They have there different pros and cons and are both not universally
applicable. One can construct matrices for which one or the other of the methods does
not work. The methods for the practical computation of the iterates xt in the Krylov
spaces Kt are closely related to the Lanczos and Arnoldi algorithms used for solving the
corresponding eigenvalue problems discussed in Chapter 4, below.
138 Iterative Methods for Linear Algebraic Systems

3.3.4 Preconditioning (PCG methods)

The error estimate (3.3.91) for the CG method indicates a particularly good convergence
if the condition number of the matrix A is close to one. In case of large cond2 (A) 1 ,
one uses “preconditioning”, i. e., the system Ax = b is transformed into an equivalent
one, Ãx̃ = b̃ with a better conditioned matrix Ã. To this end, let C be a symmetric,
positive definite matrix, which is explicitly given in product form

C = KK T , (3.3.98)

with a regular matrix K . The system Ax = b can equivalently be written in the form

K −1 A (K T )−1 K T x = K −1 b . (3.3.99)
        
à x̃ b̃

Then, the CG method is formally applied to the transformed system Ãx̃ = b̃ , while it is
hoped that cond2 (Ã)  cond2 (A) for an appropriate choice of C . The relation

(K T )−1 ÃK T = (K T )−1 K −1 A(K T )−1 K T = C −1 A (3.3.100)

shows that for C ≡ A the matrix à is similar to I, and thus cond2 (Ã) = cond2 (I) = 1.
Consequently, one chooses C = KK T such that C −1 is a good approximation to A−1 .
The CG method for the transformed system Ãx̃ = b̃ can then be written in terms of
the quantities A, b, and x as so-called “PCG method” (“Preconditioned CG” method)
as follows:

Starting value: x0 ∈ RN , d0 = r 0 = b − Ax0 , Cρ0 = r0 ,


r t , ρt 
for t ≥ 0: αt = , xt+1 = xt + αt dt , r t+1 = r t − αt Adt , Cρt+1 = rt+1 ,
Adt , dt 
r t+1 , ρt+1 
βt = , dt+1 = r t+1 + βt dt .
r t , ρt 

Compared to the normal CG method the PCG iteration in each step additionally requires
the solution of the system Cρt+1 = r t+1 , which is easily accomplished using the decompo-
sition C = KK T . In order to preserve the work complexity O(n) a. op. in each step the
triangular matrix K should have a sparsity pattern similar to that of the lower triangular
part L of A . This condition is satisfied by the following popular preconditioners:
1) Diagonal preconditioning (scaling): C := D = D 1/2 D 1/2 .
The scaling ensures that the elements of A are brought to approximately the same size,
especially with ãii = 1 . This reduces the condition number since
max1≤i≤n aii
cond2 (A) ≥ . (3.3.101)
min1≤i≤n aii

Example: The matrix A = diag{λ1 = ... = λn−1 = 1, λn = 10k } has the condition number
3.3 Descent methods 139

cond2 (A) = 10k , while the scaled matrix à = D −1/2 AD −1/2 has the optimal condition
number cond2 (Ã) = 1.
2) SSOR preconditioning: We choose

C := (D + L)D −1 (D + LT ) = D + L + LT + LD −1 LT
= (D

1/2
+LD −1/2)(D

1/2
D −1/2 LT),
+ 
K KT
or, more generally, involving a relaxation parameter ω ∈ (0, 2),
1 1  1 −1  1 
C := D+L D D + LT
2−ω ω ω ω
1 −1/2 1
=6 1/2
(D + ωLD )6 (D 1/2 + ωD −1/2 LT ).
(2−ω)ω (2−ω)ω
     
K KT
Obviously, the triangular matrix K has the same sparsity pattern as L . Each step of the
preconditioned iteration costs about twice as much work as the basic CG method. For an
optimal choice of the relaxation parameter ω (not easy to determine) there holds
6
cond2 (Ã) = cond2 (A).

3) ICCG preconditioning (Incomplete Cholesky Conjugate Gradient): The symmetric,


positive definite matrix A has a Cholesky decomposition A = LLT with an lower tri-
angular matrix L = (lij )ni,j=1 . The elements of L are successively determined by the
recurrence formulas
 
i−1 1/2 1 
i−1 
lii = aii − 2
lik , i = 1, . . . , n, lji = aji − ljk lik , j = i + 1, . . . , n.
k=1
lii k=1

The matrix L generally has nonzero elements in the whole band of A, which requires
much more memory than A itself. This can be avoided by performing (such as in the
ILU approach discussed in Subsection 3.1.2) only an “incomplete” Cholesky decomposition
where within the elimination process some of the lji are set to zero, e. g., those for which
aji = 0. This results in an incomplete decomposition

A = L̃L̃T + E (3.3.102)

with a lower triangular matrix L̃ = (˜lij )ni,j=1, which has a similar sparsity pattern as A .
In this case, one speaks of the “ICCG(0) variant”. In case of a band matrix A, one
may allow the elements of L̃ to be nonzero in further p off-diagonals resulting in the
so-called “ICCG(p) variant” of the ICCG method, which is hoped to provide a better
approximation C −1 ≈ A−1 for increasing p. Then, for preconditioning the matrix

C = KK T := L̃L̃T (3.3.103)
140 Iterative Methods for Linear Algebraic Systems

is used. Although, there is no full theoretical justification yet for the success of the
ICCG preconditioning practical tests show a significant improvement in the convergence
behavior. This may be due to the fact that, though the condition number is not necessar-
ily decreased, the eigenvalues of the corresponding transformed matrix à cluster more
around λ = 1.

3.4 A model problem

At the end of the discussion of the classical iterative methods for solving linear systems
Ax = b , we will determine their convergence rates for the model situation already de-
scribed in Section 0.4.2 of Chapter 0. We consider the so-called “1-st boundary value
problem of the Laplace operator”

∂2u ∂2u
− 2
(x, y) − 2 (x, y) = f (x, y) for (x, y) ∈ Ω
∂x ∂y (3.4.104)
u(x, y) = 0 for (x, y) ∈ ∂Ω,

on the unit square Ω = (0, 1) × (0, 1) ⊂ R2 . For solving this problem the domain Ω is
covered by a uniform mesh as shown in Fig. 3.4.

1 2 3 4

1
h= m+1
mesh size
5 6 7 8

9 10 11 12 n = m2 number of unknown
mesh values
13 14 15 16
h

Figure 3.5: Mesh for the discretization of the model problem

The “interior” mesh points are numbered row-wise. On this mesh the second deriva-
tives in the differential equation (3.4.104) are approximated by second-order central dif-
ference quotients leading to the following difference equations for the mesh unknowns
U(x, y) ≈ u(x, y) :
# $
−h−2 U(x+h, y) − 2U(x, y) + U(x−h, y) + U(x, y+h) − 2U(x, y) + U(x, y−h) = f (x, y).

Observing the boundary condition u(x, y) = 0 for (x, y) ∈ ∂Ω this set of difference
3.4 A model problem 141

equations is equivalent to the linear system

Ax = b, (3.4.105)

for the vector x ∈ Rn of unknown mesh values xi ≈ u(Pi ) , Pi interior mesh point. The
matrix A has the already known form
⎡ ⎤⎫ ⎡ ⎤⎫
B −I ⎪
⎪ 4 −1 ⎪

⎢ ⎥⎪

⎪ ⎢ ⎥⎪


⎢ −I B −I ⎥⎬ ⎢ −1 4 −1 ⎥⎬
⎢ ⎥ ⎢ ⎥
A= ⎢ . ⎥ n B = ⎢ . ⎥ m

⎣ −I B . . ⎥ ⎪
⎦⎪


⎣ −1 4 .. ⎥ ⎦




⎪ ⎪

. ⎭ . ⎭
.. .. .. ..
. .

with the m×m-unit matrix I . The right-hand side is given by b = h2 (f (P1 ), . . . , f (Pn ))T .
The matrix A has several special properties:
- “sparse band matrix” with bandwidth 2m + 1 ;
- “irreducible” and “strongly diagonally dominant”;
- “symmetric” and “positive definite”;
- “consistently ordered”;
- “of nonnegative type” (“M-matrix”): aii > 0, aij ≤ 0, i = j.
The importance of this last property will be illustrated in an exercise.

For this matrix eigenvalues and eigenvectors can be explicitly determined (h = 1/(m+1)) :

λkl = 4 − 2 (cos[khπ] + cos[lhπ]), w kl = (sin[ikhπ] sin[jlhπ])i,j=1,...,m , k, l = 1, . . . , m,

i. e., Aw kl = λkl w kl . Hence for h  1, we have

Λ := λmax = 4 − 4 cos (1 − h)π ≈ 8,


π2 2
λ := λmin = 4 − 4 cos (hπ) = 4 − 4 (1 − h + O(h4 )) ≈ 2π 2 h2 ,
2
and consequently
4
κ := cond2 (A) ≈ . (3.4.106)
π 2 h2
Then, the eigenvalues of the Jacobi iteration matrix J = −D −1 (L + R) are given by

μkl (J) = 21 (cos[khπ] + cos[lhπ]), k, l = 1, . . . , m.

Hence,
π2 2
ρ := spr(J) = μmax (J) = cos[hπ] = 1 − h + O(h4 ). (3.4.107)
2
For the iteration matrices of the Gauß-Seidel and the optimal SOR iteration matrices,
142 Iterative Methods for Linear Algebraic Systems

H1 and Hωopt , respectively, there holds

spr(H1 ) = ρ2 = 1 − π 2 h2 + O(h4), (3.4.108)


6
1 − 1 − ρ2 1 − πh + O(h2 )
spr(Hωopt ) = 6 = = 1 − 2 πh + O(h2 ) . (3.4.109)
1 + 1 − ρ2 1 + πh + O(h2 )

Comparison of convergence speed

Now, we make a comparison of the convergence speed of the various iterative methods
considered above. The reduction of the initial error x(0) − x2 in a fixed-point iteration
by the factor ε  1 requires about T (ε) iterations,

ln(1/ε)
T (ε) ≈ , ρ = spr(B), B = I − C−1 A iteration matrix. (3.4.110)
ln(1/ρ)

Using the above formulas, we obtain:

ln(1/ε) ln(1/ε) 2
TJ (ε) ≈ − ≈ 2 2 2 = 2 n ln(1/ε),
ln(1 − 2 h )
π2 2 π h π
ln(1/ε) ln(1/ε) 1
TGS (ε) ≈ − ≈ = 2 n ln(1/ε),
ln(1 − π 2 h2 ) π 2 h2 π
ln(1/ε) ln(1/ε) 1√
TSOR (ε) ≈ − ≈ = n ln(1/ε).
ln(1 − 2πh) 2πh 2π

The gradient method and the CG method require for the reduction of the initial error
x0 − x2 by the factor ε  1 the following numbers of iterations:
1 2 2
TG (ε) = κ ln (2/ε) ≈ 2 2 ln(1/ε) ≈ 2 n ln(1/ε),
2 π h π
1√ 1 1√
TCG (ε) = κ ln(2/ε) ≈ ln(2/ε) ≈ n ln(2/ε).
2 πh π
We see that the Jacobi method and the gradient method converge with about the same
speed. The CG method is only half as fast as the (optimal) SOR method, but it does
not require the determination of an optimal parameter (while the SOR method does not
require the matrix A to be symmetric). The Jacobi method with Chebyshev acceleration
is as fast as the “optimal” SOR method but also does not require the determination of
an optimal parameter (but a guess for spr(J) ).
For the special right-hand side function f (x, y) = 2π 2 sin(πx) sin(πy) the exact solu-
tion of the boundary value problem is given by

u(x, y) = sin(πx) sin(πy). (3.4.111)

The error caused by the finite difference discretization considered above can be estimated
3.4 A model problem 143

as follows:
π4 2
max |u(Pi ) − xi | ≤ h + O(h4 ). (3.4.112)
Pi 12

Hence, for achieving a relative accuracy of TOL = 10−3 (three decimals) a mesh size

12 −3/2
h ≈ 10 ≈ 10−2 ,
π2
is required. This results in n ≈ 104 unknowns. In this case, we obtain for the above
spectral radii, conditions numbers and numbers of iterations required for error reduction
by ε = 10−4 (including a safety factor of 1/10) the following values (ln(1/ε) ∼ 10) :

spr(J) ≈ 0, 9995 TJ ≈ 20.000


spr(H1 ) ≈ 0, 999 TGS ≈ 10.000
spr(Hω∗ ) ≈ 0, 9372 TSOR ≈ 160
cond2 (A) ≈ 5.000 TG ≈ 20.000, TCG ≈ 340

For the comparison of the various solution methods, we also have to take into account
the work in each iteration step. For the number “OP” of “a. op.” (1 multiplication + 1
addition) per iteration step there holds:

OPJ ≈ OPH1 ≈ OPHω ≈ 6 n ,


OPG ≈ OPCG ≈ 10 n .

As final result, we see that the computation of the approximate solution of the boundary
value model problem (3.4.104) with a prescribed accuracy TOL by the Jacobi method,
the Gauß-Seidel method and the gradient method requires O(n2 ) a. op. In this case
a direct method such as the Cholesky algorithm requires O(n2 ) = O(m2 n) a. op. but
significantly more storage space. The (optimal) SOR method and the CG method only
require O(n3/2 ) a. op.
For the model problem with n = 104 , we have the following total work “TW” required
for the solution of the system (3.4.105) to discreetization accuracy ε = 10−4 :

TWJ (TOL) ≈ 4 · 3n2 ≈ 1, 2 · 109 a. op. ,


TWGS (TOL) ≈ 4 · 1, 5n2 ≈ 6 · 108 a. op. ,
TWSOR (TOL) ≈ 4 · 2n3/2 ≈ 8 · 106 a. op. ,
TWCG (TOL) ≈ 4 · 10n3/2 ≈ 4 · 107 a. op. .

Remark 3.7: Using an appropriate preconditioning, e. g., the ILU preconditioning, in


the CG method the work count can be reduced to O(n5/4 ) . The same complexity can
be achieved by Chebyshev acceleration of the (optimal) SOR method. Later, we will
discuss a more sophisticated iterative method based on the “multi-level concept”, which
144 Iterative Methods for Linear Algebraic Systems

has optimal solution complexity O(n) . For such a multigrid (“MG”) method, we can
expect work counts like TWM G ≈ 4 · 25n ≈ 106 a. op..

Remark 3.8: For the 3-dimensional version of the above model problem, we have
8
λmax ≈ 12h−2 , λmin ≈ 3π 2 , κ ≈ ,
3π 2 h2
and consequently the same estimates for ρJ , ρGS and ρSOR as well as for the iteration
numbers TJ , TGS , TSOR, TCG , as in the 2-dimensional case. In this case the total work
per iteration step is OPJ , OPGS , OPSOR ≈ 8 N, OPCG ≈ 12 N . Hence, the resulting
total work amounts to

TWJ (TOL) ≈ 4 · 4n2 ≈ 1, 6 · 1013 a. op. ,


TWGS (TOL) ≈ 4 · 2n2 ≈ 8 · 1012 a. op. ,
TWSOR (TOL) ≈ 4 · 3n3/2 ≈ 1, 2 · 1010 a. op. ,
TWCG (TOL) ≈ 4 · 12n3/2 ≈ 4, 8 · 1010 a. op.,

while that for the multigrid method increases only to TWM G ≈ 4 · 50n ≈ 2 · 108 a. op..

Remark 3.9: For the interpretation of the above work counts, we have to consider the
computing power of available computer cores, e. g., 200 MFlops (200 million “floating-
point” oper./sec.) of a standard desktop computer. Here, the solution of the 3-dimensional
model problem by the optimal SOR method takes about 1, 5 minutes while the multigrid
method only needs less than 1 second.

3.5 Exercises

Exercise 3.1: Investigate the convergence of the fixed-point iteration xt = Bxt−1 + c


with an arbitrary starting value x0 ∈ R3 for the following matrices
⎡ ⎤ ⎡ ⎤
0.6 0.3 0.1 0 0.5 0
⎢ ⎥ ⎢ ⎥
i) B = ⎢ ⎣ 0.2 0.5 0.7 ⎥,
⎦ ii) B = ⎢ 1 0 0 ⎥.
⎣ ⎦
0.1 0.1 0.1 0 0 2

What are the limits of the iterates in case of convergence? (Hint: The eigenvalues of the
matrices B are to be estimated. This can be done via appropriate matrix norms or also
via the determinants.

Exercise 3.2: The linear system


& '& ' & '
3 −1 x1 −1
=
−1 3 x2 1
3.5 Exercises 145

is to be solved by the Jacobi and the Gauß-Seidel method. How many iterations are
approximately (asymptotically) required for reducing the initial error x0 − x2 by the
factor 10−6 ? (Hint: Use the error estimate stated in the text.)

Exercise 3.3: Show that the two definitions of “irreducibility” of a matrix A ∈ Rn×n
given in the text are equivalent.
Hint: Use the fact that the definition of “reducibility” of the system Ax = b , i. e., the
existence of simultaneous row and column permutations resulting in
& '
Ã11 0
T
P AP = Ã = , Ã11 ∈ Rp×p , Ã22 ∈ Rq×q , n = p + q,
0 Ã22

is equivalent to the existence of a non-trivial index partitioning {J, K} of Nn = {1, . . ., n},


J ∪ K = Nn , J ∩ K = ∅, such that ajk = 0 for j ∈ J, k ∈ K.

Exercise 3.4: Examine the convergence of the Jacobi and Gauss-Seidel methods for
solving the linear system Ai x = b (i = 1, 2) for the following two matrices
⎡ ⎤ ⎡ ⎤
2 −1 2 5 5 0
⎢ ⎥ ⎢ ⎥
A1 = ⎢
⎣ 1 2 −2 ⎦ ,
⎥ A2 = ⎢⎣ −1 5 4 ⎦ .

2 2 2 2 3 8

(Hint: Use the convergence criteria stated in the text, or estimate the spectral radius)

Exercise 3.5: For the solution of the linear (2 × 2)-system


& '
1 −a
x = b, x, b ∈ R2 ,
−a 1

the following parameter-dependent fixed-point iteration is considered:


& ' & '
1 0 1 − ω ωa
t
x = xt−1 + ωb, ω ∈ R.
−ωa 1 0 1−ω

a) For which a ∈ R is this method with ω = 1 convergent?


b) Determine for a = 0.5 the value

ω ∈ {0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4},

for which the spectral radius of the iteration matrix Bω becomes minimal and sketch the
graph of the function f (ω) = spr(Bω ) .
146 Iterative Methods for Linear Algebraic Systems

Exercise 3.6: Let A ∈ Rn×n be a symmetric (and therefore diagonalizable) matrix with
eigenvalues λi ∈ R, i = 1, . . . , n. Show that for any polynomial p ∈ Pk there holds

spr(p(A)) = max |p(λi )|.


i=1,...,n

(Hint: Use the fact that there exists an ONB of eigenvectors of A .)

Exercise 3.7: For the computation of the inverse A−1 of a regular matrix A ∈ Rn×n
the following two fixed-point iterations are considered:

a) Xt = Xt−1 (I − AC) + C, t = 1, 2, . . . , C ∈ Rn×n a regular “preconditioner”,


b) Xt = Xt−1 (2I − AXt−1 ), t = 1, 2, . . . .

Give (sufficient) criteria for the convergence of these iterations. For this task (computation
of a matrix inverse), how would the Newton iteration look like?

Exercise 3.8: Let B be an arbitrary n × n-matrix, and let p by a polynomial. Show


that
σ(p(B)) = p(σ(B)),
i. e., for any λ ∈ σ(p(B)) there exists a μ ∈ σ(B) such that λ = p(μ) and vice versa.
(Hint: Recall the Schur or the Jordan normal form.)

Exercise 3.9: The method of Chebyshev acceleration can be applied to any convergent
fixed-point iteration
xt = Bxt−1 + c, t = 1, 2, . . . ,
with symmetric iteration matrix B. Here, the symmetry of B guarantees the relation
p(B)2 = spr(p(B)) = maxλ∈σ(B) |p(λ| for any polynomial p ∈ Pk , which is crucial for
the analysis of the acceleration effect. In the text this has been carried out for the SSOR
(Symmetric Successive Over-Relaxation) method. Repeat the steps of this analysis for
the Jacobi method for solving the linear system Ax = b with symmetric matrix A ∈ Rn×n .

Exercise 3.10: Consider the following symmetric “saddle point system”


& '& ' & '
A B x b
T
= ,
B O y c

with a symmetric positive definite matrix A ∈ Rn×n and a not necessarily quadratic
matrix B ∈ Rn×m , m ≤ n. The coefficient matrix cannot be positive definite since some
of its main diagonal elements are zero. Most of the iterative methods discussed in the
text can directly be applied for this system.
i) Assume that the coefficient matrix is regular. Can the damped Richardson method,
& ' (& ' & ') & ' & '
xt I O A B xt−1 b
t
= −θ T t−1
+θ ,
y O I B O y c
3.5 Exercises 147

be made convergent in this case for appropriately chosen damping parameter θ ? (Hint:
Investigate whether the coefficient matrix may have positive AND negative eigenvalues.)
ii) A classical approach to solving this saddle-point system is based on the equivalent
“Schur-complement formulation”:

B T A−1 By = B T A−1 b − c, x = A−1 b − A−1 By,

in which the solution component y can be computed independently of x . The matrix


B T A−1 B is called the “Schur complement” of A in the full block matrix. Show that
the matrix B T A−1 B is symmetric and positive semi-definite and even positive definite
if B has maximal rank. Hence the symmetrized Gauß-Seidel method with Chebyshev
acceleration may be applied to this reduced system for y . Formulate this iteration!

Exercise 3.11: The general “descent method” for the iterative solution of a linear system
Ax = b with symmetric positive definite matrix A ∈ RN ×N has the form

starting value: x0 ∈ Rn , g 0 := Ax0 − b ,


for t ≥ 0: descent direction r t ,
(g t , r t )2
αt = − ,
(Ar t , r t )2
xt+1 = xt + αt r t , g t+1 = g t − αt Ar t .

The so-called “Coordinate Relaxation” uses descent directions r t , which are obtained by
cyclicing through the Cartesian unit vectors {e1 , . . . , en } . Verify that a full n-cycle of
this method is equivalent to one step of the Gauß-Seidel iteration

x̂1 = D −1 b − D −1 (Lx̂1 + Rx0 ).

Exercise 3.12: The minimal squared-defect solution of an overdetermined linear system


Ax = b is characterized as solution of the normal equation

AT Ax = AT b.

The square matrix AT A is symmetric and also positive definite, provided A has full rank.
Formulate the CG method for solving the normal equation without explictly computing
the matrix product AT A. How many matrix-vector products with A are necessary per
iteration (compared to the CG method applied to Ax = b)? Relate the convergence speed
of this iteration to the singular values of the matrix A.

Exercise 3.13: For solving a linear system Ax = b with symmetric positive definite co-
efficient matrix A one may use the Gauß-Seidel, the (optimal) SOR method, the gradient
mathod, or the CG methods. Recall the estimates for the asymptotic convergence speed
of these iterations expressed in terms of the spectral condition number κ = cond2 (A) and
compare the corresponding performance results.
148 Iterative Methods for Linear Algebraic Systems

In order to derive convergence estimates for the Gauß-Seidel and (optimal) SOR method,
assume that A is consistently ordered and that the spectral radius of the Jacobi iteration
matrix is given by
1
spr(J) = 1 − .
κ
Discuss the pros and cons of the considered methods.

Exercise 3.14: Consider the symmetric “saddle point system” from Exercise 3.10
& '& ' & '
A B x b
T
= ,
B O y c

with a symmetric positive definite matrix A ∈ Rn×n and a not necessarily quadratic
matrix B ∈ Rn×m , m ≤ n with full rank. The coefficient matrix cannot be positive
definite since some of its main diagonal elements are zero.
A classical approach of solving this saddle-point system is based on the equivalent “Schur-
complement formulation”:

B T A−1 By = B T A−1 b − c, x = A−1 b − A−1 By,

in which the solution component y can be computed independently of x . The matrix


B T A−1 B is called the “Schur complement” of A in the full block matrix.
In Exercise 3.10 it was shown that a symmetric variant of the Gauß-Seidel method with
Chebyshev-acceleration can be applied to this system. However, this approach suffers
from the severe drawback that B T A−1 B has to be explicitly known in order to construct
the decomposition
B T A−1 B = L + D + R.
Verify that, in contrast, the CG method applied to the Schur complement method does
not suffer from this defect, i. e. that an explicit construction of A−1 can be avoided.
Formulate the CG algorithm for above Schur complement and explain how to efficiently
treat the explicit occurence of A−1 in the algorithm.

Exercise 3.15: For the gradient method and the CG method for a symmetric, positive
definite matrix A there hold the error estimates
 1 − 1/κ t
xtgrad − xA ≤ x0grad − xA ,
1 + 1/κ
 1 − 1/√κ t
xtcg − xA ≤ 2 √ x0cg − xA ,
1 + 1/ κ
with the condition number κ := cond2 (A) = λmax /λmin . Show that for reducing the
initial error by a factor ε the following numbers of iteration are required:

tgrad (ε) ≈ 12 κ ln(1/ε), tcg (ε) ≈ 12 κ ln(2/ε).
3.5 Exercises 149

Exercise 3.16: The SSOR preconditioning of the CG method for a symmetric, positive
definite matrix A with the usual additive decomposition A = L + D + LT uses the
parameter dependent matrix
1 1  1 −1  1 
C := D+L D D + LT , ω ∈ (0, 2).
2−ω ω ω ω
Write this matrix in the form C = KK T with a regular, lower-triangular matrix K and
explain why C −1 may be viewed as an approximation to A−1 .

Exercise 3.17: The model matrix A ∈ Rn×n , n = m2 , originating from the 5-point
discretization of the Poisson problem on the unit square,
⎡ ⎤⎫ ⎡ ⎤⎫
B −I ⎪
⎪ 4 −1 ⎪

⎢ ⎥⎪⎪ ⎢ ⎥⎪⎪
⎢ −I B −I ⎥⎪⎬ ⎢ −1 4 −1 ⎥ ⎪

⎢ ⎥ ⎢ ⎥
A= ⎢ ⎥
. . ⎥⎪ n B = ⎢ ⎥
. . ⎥⎪ m,
⎢ −I B . ⎪ ⎢ −1 4 . ⎪
⎣ ⎦⎪⎪ ⎣ ⎦⎪⎪

⎭ ⎪
. ⎭
.. .. . . . .
. . .

possesses an important property (of “nonnegative type” or a regular ”Z-matrix”):

aii > 0, aij ≤ 0, i = j.

Show that the inverse A−1 = (aij )ni,j=1 has nonnegative elements aij ≥ 0, i. e., A is a
(−1) (−1)

so-called “M-matrix” (“(inverse) monotone” matrix). This implies that the solution x of
a linear system Ax = b with nonnegative right-hand side b, bi ≥ 0 , is also nonnegative
xi ≥ 0. (Hint: consider the Jacobi matrix J = −D −1 (L + R) and the representation of
the inverse (I − J)−1 as a Neumann series.)

Exercise 3.18: In the text, we formulated the sequence of iterates {xt }t≥1 of the CG-
method formally as the solution xt of the optimization problem

Q(xt ) = min Q(y) ↔ Axt − bA−1 = min Axt − bA−1 ,


y∈x0 +Kt (d0 ;A) y∈x0 +Kt (d0 ;A)

with the Krylow spaces Kt (d0 ; A) = span{d0 , Ad0 , · · · , At−1 d0 }. The so called “Gener-
alized minimal residual method” (GMRES), instead, formally constructs a sequence of
iterates {xtgmres }t≥1 by

Axtgmres − b2 = min Ay − b2 .


y∈x0 +Kt (d0 ;A)

i) Prove that the GMRES method allows for an error inequality similar to the one that
was derived for the CG method:

Axtgmres − b2 ≤ min p(A)2 Ax0 − b2 ,


p∈Pt ,p(0)=1

where Pt denotes the space of polynomials up to order t.


150 Iterative Methods for Linear Algebraic Systems

ii) Prove that in case of A being a symmetric, positive definite matrix, this leads to the
same asymptotic convergence rate as for the CG method.
iii) Show that the result obtained in (i) can also be applied to the case of A being similar
to a diagonal matrix D = diagi (λi ) ∈ Cn×n , i. e.,

A = T DT −1,

with a regular matrix T . In this case there holds

xtgmres − x2 ≤ κ2 (T ) min max |p(λi )| x0 − x2 .


p∈Pt ,p(0)=1 i

What makes this result rather cumbersome in contrast to the case of a symmetric, positive
matrix discussed in (ii)?
Remark: The advantage of the GMRES method lies in the fact that it is, in principle,
applicable to any regular matrix A . However, good convergence estimates for the general
case are hard to prove.

Exercise 3.19: Repeat the analysis of the convergence properties of the various solution
methods for the 3-dimensional version of the model problem considered in the text. The
underlying boundary value problem has the form
 ∂2 ∂2 ∂2 
− + + u(x, y, z) = f (x, y, z), (x, y, z) ∈ Ω = (0, 1)3 ∈ R3 ,
∂x2 ∂y 2 ∂z 2
u(x, y, z) = 0, (x, y, z) ∈ ∂Ω,

and the corresponding difference approximation (so-called “7-point approximation”) at


interior mesh points (x, y, z) ∈ {Pijk , i, j, k = 1, . . . , m} , reads
 
−h−2 U(x ± h, y, z) + U(x, y ± h, z) + U(x, y, z ± h) − 6U(x, y, z) = f (x, y, z).

Using again row-wise numbering of the mesh points the resulting linear system for the
mesh values Uijk ≈ u(Pijk ) takes the form
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
B −Im2 C −Im 6 −1
⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥
A=⎢
⎣ −Im2 B . ⎥
⎦ B=⎢
⎣ −Im C . ⎥
⎦ C=⎢
⎣ −1 6 . ⎥

.. .. .. .. .. ..
. . . . . .
        
n=m3 m2 m

In this case the corresponding eigenvalues and eigenvectors are explicitly given by
 
λijk = 6 − 2 cos[ihπ] + cos[jhπ] + cos[khπ] , i, j, k = 0, . . . , m,
 m
w ijk = sin[pihπ] sin[qjhπ] sin[rkhπ] p,q,r=1.

For the exact solution u(x, y, z) = sin(πx) sin(πy) sin(πz) there holds the error estimate
3.5 Exercises 151

π4 2
max |Uijk ) − u(Pijk )| ≤ h + O(h4 ),
Ω 8
which dictates a mesh size h = 10−2 in order to guarantee a desired relative discretization
accuracy of TOL = 10−3 .
a) Determine formulas for the condition number cond2 (A) and the spectral radius spr(J)
in terms of the mesh size h.
b) Give the number of iterations of the Jacobi, Gauß-Seidel and optimal SOR method as
well as the gradient and CG method approximately needed for reducing the initial error
to size ε = 10−4 (including a small safety factor).
c) Give a rough estimate (in terms of h) of the total number of a. op. per iteration step
for the methods considered.
4 Iterative Methods for Eigenvalue Problems
4.1 Methods for the partial eigenvalue problem

In this section, we discuss iterative methods for solving the partial eigenvalue problem of
a general matrix A ∈ Kn×n .

4.1.1 The “Power Method”

Definition 4.1: The “Power method” of v. Mises1 generates, starting from some initial
point z 0 ∈ Cn with z 0  = 1, a sequence of iterates z t ∈ Cn , t = 1, 2, . . . , by

z̃ t = Az t−1 , z t := z̃ t −1 z̃ t . (4.1.1)

The corresponding eigenvalue approximation is given by

(Az t )r
λt := , r ∈ {1, . . . , n} : |zrt | = max |zjt |. (4.1.2)
zrt j=1,...,n

The normalization is commonly done using the norms  ·  =  · ∞ or  ·  =  · 2 . For


the convergence anlysis of this method, we assume the matrix A to be diagonalizable,
i. e., to be similar to a diagonal matrix, which is equivalent to the existence of a basis
of eigenvectors {w 1 , . . . , w n } of A . These eigenvectors are associated to the eigenvalues
ordered according to their modulus, 0 ≤ |λ1 | ≤ . . . ≤ |λn | , and are assumed to be
normalized, w i 2 = 1 . Further, we assume that the initial vector z 0 has a nontrivial
component with respect to the n-th eigenvector w n ,

n
z0 = αi w i , αn = 0. (4.1.3)
i=1

In practice, this is not really a restrictive assumption since, due to round-off errors, it will
be satisfied in general.

Theorem 4.1 (Power method): Let the matrix A be diagonalizable and let the eigen-
value with largest modulus be separated from the other eigenvalues, i. e., |λn | > |λi |, i =
1, . . . , n − 1. Further, let the starting vector z 0 have a nontrivial component with respect
to the eigenvector w n . Then, there are numbers σt ∈ C, |σt | = 1 such that

z t − σt w n  → 0 (t → ∞), (4.1.4)

1
Richard von Mises (1883–1953): Austrian mathematician; Prof. of applied Mathematics in Straßburg
(1909-1918), in Dresden and then founder of the new Institute of Applied Mathematics in Berlin (1919-
1933), emigration to Turkey (Istambul) and eventually to the USA (1938); Prof. at Harward University;
important contributions to Theoretical Fluid Mechanics (introduction of the “stress tensor”), Aerodynam-
ics, Numerics, Statistics and Probability Theory.

153
154 Iterative Methods for Eigenvalue Problems

and the “maximum” eigenvalue λmax = λn is approximated with convergence speed


 λ t 
 n−1 
λt = λmax + O   (t → ∞). (4.1.5)
λmax

Proof. Let z 0 = ni=1 αi w i be the basis expansion of the starting vector. For the iterates
z t there holds
z̃ t Az t−1 Az̃ t−1 z̃ t−1 2 At z 0
zt = = = = . . . = .
z̃ t 2 Az t−1 2 z̃ t−1 2 Az̃ t−1 2 At z 0 2

Furthermore,

n  
n−1
αi  λi t i 
At z 0 = αi λti w i = λtn αn w n + w
i=1 i=1
αn λ n
and consequently, since |λi /λn | < 1, i = 1, . . . , n − 1 ,

At z 0 = λtn αn {w n + o(1)} (t → ∞).

This implies
λtn αn {w n + o(1)} λtn αn n
zt = = w + o(1).
|λtn αn | w n + o(1)2 |λtn αn |
  
=: σt
The iterates z t converges to span{w n }. Further, since αn = 0, it follows that

(Az t )k (At+1 z 0 )k At z 0 2


λt = =
zkt At z 0 2 (At z 0 )k
#   λi t+1 i $
λt+1 αn wkn + n−1  λ t 
n i=1 αi λn wk  n−1 
= #   
λi t i
$ = λ n + O   (t → ∞).
λtn αn wkn + n−1 i=1 α i λn w k
λn

This completes the proof. Q.E.D.


For Hermitian matrices, one obtains improved eigenvalue approximations using the
“Rayleigh quotient”:

λt := (Az t , z t )2 , z t 2 = 1. (4.1.6)

In this case {w1 , . . . , wn } can be chosen as ONB of eigenvectors such that there holds
n
i=1 |αi | λi
2 2t+1
(At+1 z 0 , At z 0 )
t
λ = = 
At z 0 2 i=1 |αi | λi
n 2 2t
#   2t+1 $
λ2t+1 n−1
|αn |2 + i=1 |αi |2 λλni  λ 2t 
 = λmax + O  n−1  .
n
=    
|αn |2 + n−1 2 λi 2t λmax
λ2tn i=1 |αi | λn

Here, the convergence of the eigenvalue approximations is twice as fast as in the non-
Hermitian case.
4.1 Methods for the partial eigenvalue problem 155

Remark 4.1: The convergence of the power method is the better the more the modulus-
wise largest eigenvalue λn is separated from the other eigenvalues. The proof of conver-
gence can be extended to the case of diagonalizable matrices with multiple “maximum”
eigenvalue for which |λn | = |λi | necessarily implies λn = λi . For even more general,
non-diagonalizable matrices convergence is not guaranteed. The proof of Theorem 4.1
suggests that the constant in the convergence estimate (4.1.5) depends on the dimension
n and may therefore be very large for large matrices. The proof that this is actually not
the case is posed as an exercise.

4.1.2 The “Inverse Iteration”

For practical computation the power method is of only limited value, as its convergence is
very slow in general if |λn−1 /λn | ∼ 1. Further, it only delivers the “largest” eigenvalue. In
most practical applications the “smallest” eigenvalue is wanted, i. e., that which is closest
to zero. This is accomplished by the so-called “Inverse Iteration” of Wielandt2 . Here, it
is assumed that one already knows a good approximation λ̃ for an eigenvalue λk of the
matrix A to be computed (obtained by other methods, e. g., Lemma of Gershgorin, etc.)
such that

|λk − λ̃|  |λi − λ̃|, i = 1, . . . , n , i = k. (4.1.7)

In case λ̃ = λk the matrix (A − λ̃I)−1 has the eigenvalues μi = (λi − λ̃)−1 , i = 1, . . . , n ,


and there holds
 1   1 
   
|μk | =     = |μi |, i = 1, . . . , n , i = k. (4.1.8)
λk − λ̃ λi − λ̃

Definition 4.2: The “Inverse Iteration” consists in the application of the power method
to the matrix (A − λ̃I)−1 , where the so-called “shift” λ̃ is taken as an approximation to
the desired eigenvalue λk . Starting from an initial point z 0 the method generates iterates
z t as solutions of the linear systems

(A − λ̃I)z̃ t = z t−1 , z t = z̃ t −1 z̃ t , t = 1, 2, . . . . (4.1.9)

The corresponding eigenvalue approximation is determined by

[(A − λ̃I)−1 z t ]r
μt := , r ∈ {1, . . . , n} : |zrt | = max |zjt |, (4.1.10)
zrt j=1,...,n

or, in the Hermitian case, by the Rayleigh quotient

μt := ((A − λ̃I)−1 z t , z t )2 . (4.1.11)

2
Helmut Wielandt (1910–2001): German mathematician; Prof. in Mainz (1946-1951) and Tübingen
(1951-1977); contributions to Group Theory, Linear Algebra and Matrix Theory.
156 Iterative Methods for Eigenvalue Problems

In the evaluation of the eigenvalue approximation in (4.1.10) and (4.1.11) the not yet
known vector z̃ t+1 := (A − λ̃I)−1 z t is needed. Its computation requires to carry the
iteration, possibly unnecessarily, one step further by solving the corresponding linear
system (A − λ̃I)z̃ t+1 = z t . This can be avoided by using the formulas

(Az t )r
λt := , or in the symmetric case λt := (Az t , z t )2 , (4.1.12)
zrt

instead. This is justified since z t is supposed to be an approximation to an eigenvector


of (A − λ̃I)−1 corresponding to the eigenvalue μk , which is also an eigenvector of A
corresponding to the desired eigenvalue λk .
In virtue of the above result for the simple power method, for any diagonalizable
matrix A the “Inverse Iteration” delivers any eigenvalue, for which a sufficiently accurate
approximation is known. There holds the error estimate
 μ t 
 k−1 
μt = μk + O   (t → ∞), (4.1.13)
μk

where μk−1 is the eigenvalue of (A − λ̃I)−1 closest to the “maximum” eigenvalue μk .


From this, we infer
 λ − λ̃ t 
1  k 
μt = +O   (t → ∞), (4.1.14)
λk − λ̃ λk−1 − λ̃

where λk−1 := 1/μk−1 + λ̃ , and eventually,


 λ − λ̃ t 
1  k 
λtk := + λ̃ = λ k + O   (t → ∞). (4.1.15)
μt λk−1 − λ̃

We collect the above results for the special case of the computation of the “smallest”
eigenvalue λmin = λ1 of a diagonalizable matrix A in the following theorem.

Theorem 4.2 (Inverse Iteration): Let the matrix A be diagonalizable and suppose
that the eigenvalue with smallest modulus is separated from the other eigenvalues, i. e.,
|λ1 | < |λi |, i = 2, . . . , n. Further, let the starting vector z 0 have a nontrivial component
with respect to the eigenvector w 1 . Then, for the “Inverse Iteration” (with shift λ̃ := 0)
there are numbers σt ∈ C, |σt | = 1 such that

z t − σt w 1  → 0 (t → ∞), (4.1.16)

and the “smallest” eigenvalue λmin = λ1 of A is approximated with convergence speed,


in the general non-Hermitian case using (4.1.10),
 λ t 
 min 
λt = λmin + O   (t → ∞). (4.1.17)
λ2
and with squared power 2t in the Hermitian case using (4.1.11).
4.1 Methods for the partial eigenvalue problem 157

Remark 4.2: The inverse iteration allows the approximation of any eigenvalue of A for
which a sufficiently good approximation is known, where “sufficiently good” depends on
the separation of the desired eigenvalue of A from the other ones. The price to be paid
for this flexibility is that each iteration step requires the solution of the nearly singular
system (A − λ̃I)z t = z t−1 . This means that the better the approximation λ̃ ≈ λk , i. e.,
the faster the convergence of the Inverse Iteration is, the more expensive is each iteration
step. This effect is further amplified if the Inverse Iteration is used with “dynamic shift”
λ̃ := λtk , in order to speed up its convergence.

The solution of the nearly singular linear systems (4.1.9),

(A − λ̃I)z̃ t = z t−1 ,

can be accomplished, for moderately sized matrices, by using an a priori computed LR or


Cholesky (in the Hermitian case) decomposition and, for large matrices, by the GMRES
or the BiCGstab method and the CG method (in the Hermitian case). The matrix A− λ̃I
is very ill-conditioned with condition number

|λmax (A − λ̃I)| maxj=1,...,n |λj − λ̃|


cond2 (A − λ̃I) = = 1.
|λmin(A − λ̃I)| |λk − λ̃|

Therefore, preconditioning is mandatory. However, only the “direction” of the iterate


z̃ t is needed, which is a much better conditoned task almost independent of the quality
of the approximation λ̃ to λk . In this case a good preconditioning is obtained by the
incomplete LR (or the incomplete Cholesky) decomposition.

Example 4.1: We want to apply the considered methods to the eigenvalue problem of
the model matrix from Section 3.4. The determination of vibration mode and frequency of
a membrane over the square domain Ω = (0, 1)2 (drum) leads to the eigenvalue problem
of the Laplace operator

∂2w ∂2w
− 2
(x, y) − (x, y) = μw(x, y) for (x, y) ∈ Ω,
∂x ∂y 2 . (4.1.18)
w(x, y) = 0 for (x, y) ∈ ∂ Ω.

This eigenvalue problem in function space shares several properties with that of a sym-
metric, positive definite matrix in Rn . First, there are only countably many real, positive
eigenvalues with finite (geometric) multiplicities. The corresponding eigenspaces span
the whole space L2 (Ω) . The smallest of theses eigenvalues, μmin > 0, and the associated
eigenfunction, wmin , describe the fundamental tone and the fundamental oscillation mode
of the drum. The discretization by the 5-point difference operator leads to the matrix
eigenvalue problem

Az = λz, λ = h2 μ, (4.1.19)

with the same block-tridiagonal matrix A as occurring in the corresponding discretization


158 Iterative Methods for Eigenvalue Problems

of the boundary value problem discussed in Section 3.4. Using the notation from above,
the eigenvalues of A are explicitly given by

λkl = 4 − 2(cos(khπ) + cos(lhπ)) , k, l = 1, . . . , m.

We are interested in the smallest eigenvalue λmin of A , which by h−2 λmin ≈ μmin yields
an approximation to the smallest eigenvalue of problem (4.1.18). For λmin and the next
eigenvalue λ∗ > λmin there holds

λmin = 4 − 4 cos(hπ) = 2π 2 h2 + O(h4),


λ∗ = 4 − 2(cos(2hπ) + cos(hπ)) = 5π 2 h2 + O(h4 ).

For computing λmin , we may use the inverse iteration with shift λ = 0 . This requires in
each iteration the solution of a linear system like

Az t = z t−1 . (4.1.20)

For the corresponding eigenvalue approximation

λt = (z̃ t+1 , z t )2 , (4.1.21)

there holds the convergence estimate


λ 2t  2 2t
min
|λt − λmin| ≈ ≈ , (4.1.22)
λ∗ 5
i. e., the convergence is independent of the mesh size h or the dimension n = m2 ≈ h−2
of A. However, in view of the relation μmin = h−2 λmin achieving a prescribed accuracy
in the approximation of μmin requires the scaling of the tolerance in computing λmin
by a factor h2 , which introduces a logarithmic h-dependence in the work count of the
algorithm,

log(εh2 )
t(ε) ≈ ≈ log(n). (4.1.23)
log(2/5)

This strategy for computing μmin is not very efficient if the solution of the subproblems
(4.1.20) would be done by the PCG method. For reducing the work, one may use an
iteration-dependent stopping criterion for the inner PCG iteration by which its accuracy
is balanced against that of the outer inverse iteration.

Remark 4.3: Another type of iterative methods for computing single eigenvalues of sym-
metric or nonsymmetric large-scale matrices is the “Jacobi-Davidson method” (Davidson
[30]), which is based on the concept of defect correction. This method will not be dis-
cussed in these lecture notes, we rather refer to the literature, e. g., Crouzeix et al. [29]
and Sleijpen & Van der Vorst [48]
4.2 Methods for the full eigenvalue problem 159

4.2 Methods for the full eigenvalue problem

In this section, we consider iterative methods for solving the full eigenvalue problem of
an arbitrary matrix A ∈ Rn×n . Since these methods use successive factorizations of
matrices, which for general full matrices have arithmetic complexity O(n3 ), they are only
applied to matrices with special sparsity pattern such as general Hessenberg or symmetric
tridiagonal matrices. In the case of a general matrix therefore at firsr a reduction to such
special structure has to be performed (e. g., by applying Householder transformations as
discussed in Section 2.5.1). As application of such a method, we discuss the computation
of the singular value decomposition of a general matrix. In order to avoid confusion
between “indexing” and “exponentiation”, in the following, we use the notation A(t)
instead of the short version At for elements in a sequence of matrices.

4.2.1 The LR and QR method

I) The “LR method” of Rutishauser3 (1958), starting from some initial guess A(1) := A,
generates a sequence of matrices A(t) , t ∈ N, by the prescription

A(t) = L(t) R(t) (LR decomposition), A(t+1) := R(t) L(t) . (4.2.24)

Since
A(t+1) = R(t) L(t) = L(t)−1 L(t) R(t) L(t) = (L(t)−1 A(t) L(t) ,
all iterates A(t) are similar to A and therefore have the same eigenvalues as A. Under
certain conditions on A , one can show that, with the eigenvalues λi of A :
⎡ ⎤
λ1 ∗
⎢ ⎥
lim A(t) = lim R(t) = ⎢

..
. ⎥ , lim L(t) = I.
⎦ (4.2.25)
t→∞ t→∞ t→∞
0 λn

The LR method requires in each step the computation of an LR decomposition and is


consequently by far too costly for general full matrices. For Hessenberg matrices the work
is acceptable. The most severe disadvantage of the LR method is the necessary existence
of the LR decompositions A(t) = L(t) R(t) . If only a decomposition P (t) A(t) = L(t) R(t)
exists with a perturbation matrix P (t) = I the method may not converge. This problem
is avoided by the so-called “QR method”.
II) The “QR method” of Francis4 (1961) is considered as the currently most efficient
method for solving the full eigenvalue problem of Hessenberg matrices. Starting from

3
Heinz Rutishauser (1918–1970): Swiss mathematician and computer scientist; since 1962 Prof. at
ETH Zurich; contributions to Numerical Linear Algebra (LR method: “Solution of eigenvalue problems
with the LR transformation”, Appl. Math. Ser. nat. Bur. Stand. 49, 47-81(1958).) and Analysis as well
as to the foundation of Computer Arithmetik.
4
J. F. G. Francis: “The QR transformation. A unitary analogue to the LR transformation”, Computer
J. 4, 265-271 (1961/1962).
160 Iterative Methods for Eigenvalue Problems

some initial guess A(1) = A a sequence of matrices A(t) , t ∈ N, is generated by the


prescription

A(t) = Q(t) R(t) (QR decomposition), A(t+1) := R(t) Q(t) , (4.2.26)

where Q(t) is unitary and R(t) is an upper triangular matrix with positive diagonal
elements (in order to ensure its uniqueness). The QR decomposition can be obtained,
e. g., by employing Householder transformations. Because of the high costs of this method
for a general full matrix the QR method is economical only for Hessenberg matrices or,
in the symmetric case, only for tridiagonal matrices. Since

A(t+1) = R(t) Q(t) = Q(t)T Q(t) R(t) Q(t) = Q(t)T A(t) Q(t) ,

all iterates A(t) are similar to A and therefore have the same eigenvalues as A. The
proof of convergence of the QR method will use the following auxiliary lemma.

Lemma 4.1: Let E (t) ∈ Rn×n , t ∈ N, be regular matrices, which satisfy limt→∞ E (t) = I
and possess the QR decompositions E (t) = Q(t) R(t) with rii > 0. Then, there holds

lim Q(t) = I = lim R(t) . (4.2.27)


t→∞ t→∞

Proof. Since
T T T
E (t) − I2 = Q(t) R(t) − Q(t) Q(t) 2 = Q(t) (R(t) − Q(t) )2 = R(t) − Q(t) 2 → 0,
(t)
it follows that qjk → 0 (t → ∞) for j < k . In view of
⎡ ⎤⎡ ⎤
 → 0 
⎢ ⎥ ⎢ ⎥
⎢  ∗ ⎥ ⎢  ∗ ⎥
⎢ ⎥ ⎢ ⎥
T ⎢ .. ⎥ ⎢ .. ⎥
I = Q(t) Q(t) =⎢ . ⎥ ⎢ ∗ . ⎥,
⎢ ⎥ ⎢ ⎥
⎢ ∗  ⎥ ⎢  ⎥
⎣ ⎦ ⎣ ⎦
 → 0 

we conclude that
(t) (t)
qjj → ±1 , qjk → 0 (t → ∞), j > k.
Hence Q(t) → diag(±1) (t → ∞). Since

Q(t) R(t) = E (t) → I (t → ∞) , rjj > 0,

also limt→∞ Q(t) = I. Then,


T
lim R(t) = lim Q(t) E (t) = I,
t→∞ t→∞

what was to be shown. Q.E.D.


4.2 Methods for the full eigenvalue problem 161

Theorem 4.3 (QR method): Let the eigenvalues of the matrix A ∈ Rn×n be separated
with respect to their modulus, i. e., |λ1 | > |λ2 | > . . . > |λn | . Then, the matrices A(t) =
(t)
(ajk )j,k=1,...,n generated by the QR method converge like
(t)
{ lim ajj | j = 1, . . . , n} = {λ1 , . . . , λn }. (4.2.28)
t→∞

Proof. The separation assumption implies that all eigenvalues of the matrix A are
simple. There holds

A(t) = R(t−1) Q(t−1) = Q(t−1)T Q(t−1) R(t−1) Q(t−1) = Q(t−1)T A(t−1) Q(t−1)
(4.2.29)
= . . . = [Q(1) . . . Q(t−1) ]T A[Q(1) . . . Q(t−1) ] =: P (t−1)T AP (t−1) .

The normalized eigenvectors w i , w i  = 1, associated to the eigenvalues λi are linearly


independent. Hence, the matrix W = [w1 , . . . , wn ] is regular and there holds the relation
AW = W Λ with the diagonal matrix Λ = diag(λi ). Consequently,
A = W ΛW −1.

Let QR = W be a QR decomposition of W and LS = P W −1 an LR decomposition of


P W −1 (P an appropriate permutation matrix). In the following, we consider the simple
case that P = I . There holds

At = [W ΛW −1]t = W Λt W −1 = [QR]Λt [LS] = QR[Λt LΛ−t ]Λt S


⎡ ⎤
1 0
⎢ ⎥
⎢ .. ⎥ t
= QR ⎢ . ⎥Λ S
⎣  t ⎦
λj
ljk λk 1

= QR[I + N (t) ]Λt S = Q[R + RN (t) ]Λt S,

and, consequently,

At = Q[I + RN (t) R−1 ]RΛt S. (4.2.30)

By the assumption on the separation of the eigenvalues λi , we have |λj /λk | < 1, j > k ,
which yields
N (t) → 0 , RN (t) R−1 → 0 (t → ∞).
Then, for the (uniquely determined) QR decomposition Q̃(t) R̃(t) = I + RN (t) R−1 with
(t)
r̃ii > 0, Lemma 4.1 implies

Q̃(t) → I , R̃(t) → I (t → ∞).

Further, recalling (4.2.30),

At = Q[I + RN (t) R−1 ]RΛt S = Q[Q̃(t) R̃(t) ]RΛt S = [QQ̃(t) ][R̃(t) RΛt S]
162 Iterative Methods for Eigenvalue Problems

is obviously a QR decomposition of At (but with not necessarily positive diagonal ele-


ments of R). By (4.2.29) and Q(t) R(t) = A(t) there holds

[Q(1) . . . Q(t) ] [R(t) . . . R(1) ] = [Q(1) . . . Q(t−1) ] A(t) [R(t−1) . . . R(1) ]


           
= P (t) =: S (t) = P (t−1) =: S (t−1)

= P (t−1) [P (t−1)T A P (t−1) ]S (t−1) = A P (t−1) S (t−1) ,

and observing P (1) S (1) = A,

P (t) S (t) = AP (t−1) S (t−1) = . . . = At−1 P (1) S (1) = At . (4.2.31)

This yields another QR decomposition of At , i. e.,

[QQ̃(t) ][R̃(t) RΛt S] = At = P (t) S (t) .

Since the QR decomposition of a matrix is unique up to the scaling of the column vectors
of the unitary matrix Q , there must hold

P (t) = QQ̃(t) D (t) =: QT (t) ,

with certain diagonal matrices D (t) = diag(±1). Then, recalling again the realtion
(4.2.29) and observing that

A = W ΛW −1 = QRΛ[QR]−1 = QRΛR−1 QT ,

we conclude that

A(t+1) = P (t)T AP (t) = [QT (t) ]T AQT (t)


= T (t)T QT [QRΛR−1 QT ]QT (t) = T (t)T RΛR−1 T (t)
⎡ ⎤ ⎡ ⎤
λ1 ∗ λ1 ∗

(t)T ⎢
⎥ (t) ⎢ ⎥ (t) (t)
=T . .. ⎥ (t) (t)T ⎢ . .. ⎥ Q̃ D .
⎣ ⎦ T = D Q̃ ⎣ ⎦
0 λn 0 λn

Since Q̃(t) → I (t → ∞) and D (t) D (t) = I, we obtain


⎡ ⎤
λ1 ∗
⎢ ⎥
D → ⎣
(t) (t+1) (t)
D A ⎢ . . . ⎥ (t → ∞).

0 λn

In case that W −1 does not possess an LR decomposition, then the eigenvalues λi do not
appear ordered according to their modulus. Q.E.D.
4.2 Methods for the full eigenvalue problem 163

Remark 4.4: The separation assumption |λ1 | > |λ2 | > . . . > |λn | means that all eigen-
values of A are simple, which implies that A is necessarily diagonalizable. For more
general matrices the convergence of the QR method is not guaranteed. However, conver-
gence in a suitable sense can be shown in case of multiple eigenvalues (such as in the model
problem of Section 3.4). For a more detailed discussion, we refer to the literature, e. g.,
Deuflhard & Hohmann [33], Stoer & Bulirsch [50], Golub & Loan [36], and Parlett [44].

The speed of convergence of the QR method, i. e., the convergence of the off-diagonal
elements in A(t) to zero, is determined by the size of the quotients
 
 λj 
  < 1, j > k,
 λk 

The convergence is the faster the better the eigenvalues of A are modulus-wise separated.
This suggests to use the QR algorithm with a “shift” σ for the matrix A − σI, such that
   
 λj − σ   
    λj  < 1,
 λk − σ   λk 

for the most interesting eigenvalues. The QR method with (dynamic) shift starts form
some initial guess A(1) = A and constructs a sequence of matrices A(t) , t ∈ N, by the
prescription

A(t) − σt I = Q(t) R(t) (QR decomposition), A(t+1) := R(t) Q(t) + σt I, (4.2.32)

This algorithm again produces a sequence of similar matrices:

A(t+1) = R(t) Q(t) + σt I


= Q(t)T Q(t) R(t) Q(t) + σt I = Q(t)T [A(t) − σt I]Q(t) + σt I (4.2.33)
(t)T (t) (t)
=Q A Q .

For this algorithm a modified version of the proof of Theorem 4.3 yields a convergence
estimate
 λ − σ   λ − σ 
(t)  j 1  j t
|ajk | ≤ c  ···  , j > k, (4.2.34)
λk − σ1 λk − σt
(t)
for the lower off-diagonal elements of the iterates A(t) = (ajk )nj,k=1.

Remark 4.5: For positive definite matrices the QR method converges twice as fast as the
corresponding LR method, but requires about twice as much work in each iteration. Under
certain structural assumpotions on the matrix A , one can show that the QR method with
varying shifts converges with quadratic order for Hermitian tridiagonal matrices and even
with cubic order for unitary Hessenberg matrices (see Wang & Gragg [56]).

|λ(t) − λ| ≤ c|λ(t−1) − λ|3 ,


164 Iterative Methods for Eigenvalue Problems

As the LR method, for economy reasons, also the QR method is applied only to pre-
reduced matrices for which the computation of the QR decomposition is of acceptable cost,
e. g., Hessenber matrices, symmetric tridiagonal matrices or more general band matrices
with bandwidth 2m + 1  n = m2 (e. g., the model matrix considered in Section 3.4).
This is justified by the following observation.

Lemma 4.2: If A is a Hessenberg matrix (or a symmetric 2m + 1-band matrix), then


the same holds true for the matrices A(t) generated by the QR method.

Proof. The proof is posed as exercise. Q.E.D.

4.2.2 Computation of the singular value decomposition

The numerically stable computation of the singular value decomposition (SVD) is rather
costly. For more details, we refer to the literature, e. g., the book by Golub & van Loan
[36]. The SVD of a matrix A ∈ Cn×k is usually computed by a two-step procedure. In the
first step, the matrix is reduced to a bidiagonal matrix. This requires O(kn2 ) operations,
assuming that k ≤ n . The second step is to compute the SVD of the bidiagonal matrix.
This step needs an iterative method since the problem to be solved is generically nonlinear.
For fixe4 accuracy requirement (e. g., round-off error level) this takes O(n) iterations,
each costing O(n) operations. Thus, the first step is more expensive and the overall
cost is O(kn2 ) operations (see Trefethen & Bau [54]). The first step can be done using
Householder reflections for a cost of O(kn2 + n3 ) operations, assuming that only the
singular values are needed and not the singular vectors.
The second step can then very efficiently be done by the QR algorithm. The LAPACK
subroutine DBDSQR[9] implements this iterative method, with some modifications to
cover the case where the singular values are very small. Together with a first step using
Householder reflections and, if appropriate, QR decomposition, this forms the LAPACK
DGESVD[10] routine for the computation of the singular value decomposition.
If the matrix A is very large, i. e., n ≥ 104 − 108 , the method described so far for
computing the SVD is too expensive. In this situation, particularly if A ∈ Cn×n is square
and regular, the matrix is first reduced to smaller dimension,

A → A(m) = Q(m)T AQ(m) ∈ Cm×m ,

with m  n, by using, e. g., the Arnoldi process described below in Section 4.3.1, and
then the above method is applied to this reduced matrix. For an appropriate choice
of the orthonormal transformation matrix Q(m) ∈ Cn×m the singular values of A(m)
are approximations of those of A , especially the “largest” ones (by modulus). If one is
interested in the “smallest” singular values of A , what is typically the case in applications,
the dimensional reduction process has to be applied to the inverse matrix A−1 .
4.3 Krylov space methods 165

4.3 Krylov space methods

“Krylov space methods” for solving eigenvalue problems follow essentially the same idea
as in the case of the solution of linear systems. The original high-dimensional problem
is reduced to smaller dimension by applying the Galerkin approximation in appropriate
subspaces, e. g., so-called “Krylov space”, which are sucessively constructed using the
given matrix and sometimes also its transpose. The work per iteration should amount to
about one matrix-vector multiplication. We will consider the two most popular variants
of such methods, the “Arnoldi5 method” for general, not necessarily Hermitian matrices,
and its specialization for Hermitian matrices, the “Lanczos6 method”.
First, we introduce the general concept of such a “model reduction” by “Galerkin
approximation”. Consider a general eigenvalue problem

Az = λz, (4.3.35)

with a higher-dimensional matrix A ∈ Cn×n , n ≥ 104 , which may have resulted from the
discretization of the eigenvalue problem of a partial differential operator. This eigenvalue
problem can equivalently be written in variational from as

z ∈ Cn , λ ∈ C : (Az, y)2 = λ(z, y)2 ∀y ∈ Cn . (4.3.36)

Let Km = span{q 1 , . . . , q m } be an appropriately chosen subspace of Cn of smaller di-


mension dim Km = m  n. Then, the n-dimensional eigenvalue problem (4.3.36) is
approximated by the m-dimensional “Galerkin eigenvalue problem”

z ∈ Km , λ ∈ C : (Az, y)2 = λ(z, y)2 ∀y ∈ Km . (4.3.37)


m
Expanding the eigenvector z ∈ Km with respect to the given basis, z = j=1 αj q
j
, the
Galerkin system takes the form

m 
m
αj (Aq j , q i )2 = λ αj (q j , q i )2 , i = 1, . . . , m, (4.3.38)
j=1 j=1

5
Walter Edwin Arnoldi (1917–1995): US-American engineer; graduated in Mechanical Engineering
at the Stevens Institute of Technology in 1937; worked at United Aircraft Corp. from 1939 to 1977;
main research interests included modelling vibrations, Acoustics and Aerodynamics of aircraft propellers;
mainly known for the “Arnoldi iteration”; the paper “The principle of minimized iterations in the solution
of the eigenvalue problem”, Quart. Appl. Math. 9, 17-29 (1951), is one of the most cited papers in
Numerical Linear Algebra.
6
Cornelius (Cornel) Lanczos (1893–1974): Hungarian mathematician and physicist; PhD in 1921 on
Relativity Theory; assistant to Albert Einstein 19281929; contributions to exact solutions of the Einstein
field equation; discovery of the fast Fourier transform (FFT) 1940; worked at the U.S. National Bureau
of Standards after 1949; invented the “Lanczos algorithm” for finding eigenvalues of large symmetric
matrices and the related conjugate gradient method; in 1952 he left the USA for the School of Theoretical
Physics at the Dublin Institute for Advanced Studies in Ireland, where he succeeded Schrödinger and
stayed until 1968; Lanczos was author of many classical text books.
166 Iterative Methods for Eigenvalue Problems

Within the framework of Galerkin approximation this is usually written in compact form
as a generalized eigenvalue problem

Aα = λMα, (4.3.39)
 n
for the vector α = (αj )nj=1 , involving the matrices A = (Aq j , q i )2 i,j=1 and M =
 j i n
(q , q )2 i,j=1.
In the following, we use another formulation. With the Cartesian representations of
the basis vectors q i = (qji )nj=1 the Galerkin eigenvalue problem (4.3.37) is written in the
form

m 
n 
m 
n
αj akl qkj q̄li =λ αj qkj q̄ki , i = 1, . . . , m. (4.3.40)
j=1 k,l=1 j=1 k=1

Then, using the matrix Q(m) := [q 1 , . . . , q m ] ∈ Cn×m and the vector α = (αj )m
j=1 ∈ C
m

this can be written in compact form as

Q̄(m)T AQ(m) α = λQ̄(m)T Q(m) α. (4.3.41)

If {q 1 , . . . , q m } were an ONB of Km this reduces to the normal eigenvalue problem

Q̄(m)T AQ(m) α = λα, (4.3.42)

of the reduced matrix H (m) := Q̄(m)T AQ(m) ∈ Cm×m . If the reduced matrix H (m) has a
particular structure, e. g., a Hessenberg matrix or a symmetric tridiagonal matrix, then,
the lower-dimensional eigenvalue problem (4.3.42) can efficiently be solved by the QR
method. Its eigenvalues may be considered as approximations to some of the dominant
eigenvalues of the original matrix A and are called “Ritz7 eigenvalues” of A . In view of
this preliminary consideration the “Krylov methods” consist in the following steps:

1. Choose an appropriate subspace Km ⊂ Cn , m  n (a “Krylov space”), using the


matrix A and powers of it.
2. Construct an ONB {q 1 , . . . , q m } of Km by the stabilized version of the Gram-
Schmidt algorithm, and set Q(m) := [q 1 , . . . , q m ].
3. Form the matrix H (m) := Q̄(m)T AQ(m) , which then by construction is a Hessenberg
matrix or, in the Hermitian case, a Hermitian tridiagonal matrix.
4. Solve the eigenvalue problem of the reduced matrix H (m) ∈ Cm×m by the QR
method.
5. Take the eigenvalues of H (m) as approximations to the dominant (i. e., “largest”)
eigenvalues of A . If the “smallest” eigenvalues (i. e., those closest to the origin) are

7
Walter Ritz (1878–1909): Swiss physicist; Prof. in Zürich and Göttingen; contributions to Spectral
Theorie in Nuclear Physics and Electromagnetism.
4.3 Krylov space methods 167

to be determined the whole process has to be applied to the inverse matrix A−1 ,
which possibly makes the construction of the subspace Km expensive.

Remark 4.6: In the above form the Krylov method for eigenvalue problems is analogous
to its version for (real) linear systems described in Section 3.3.3. Starting from the
variational form of the linear system

x ∈ Rn :
(Ax, y)2 = (b, y)2 ∀y ∈ Rn ,

we obtain the following reduced system for xm = m j
j=1 αj q :


m 
n 
n
αj akl qkj qli = bk qki , i = 1, . . . , m.
j=1 k,l=1 k=1

This is then equivalent to the m-dimensional algebraic system

Q(m)T AQ(m) α = Q(m)T b.

4.3.1 Lanczos and Arnoldi method

The “power method” for computing the largest eigenvalue of a matrix only uses the
current iterate Am q, m  n, for some normalized starting vector q ∈ Cn , q2 = 1, but
ignores the information contained in the already obtained iterates {q, Aq, . . . , A(m−1) q}.
This suggests to form the so-called “Krylov matrix”

Km = [q, Aq, A2 q, . . . , Am−1 q], 1 ≤ m ≤ n.

The columns of this matrix are not orthogonal. In fact, since At q converges to the
direction of the eigenvector corresponding to the largest (in modulus) eigenvalue of A ,
this matrix tends to be badly conditioned with increasing dimension m. Therefore, one
constructs an orthogonal basis by the Gram-Schmidt algorithm. This basis is expected to
yield good approximations of the eigenvectors corresponding to the m largest eigenvalues,
for the same reason that Am−1 q approximates the dominant eigenvector. However, in
this simplistic form the method is unstable due to the instability of the standard Gram-
Schmidt algorithm. Instead the “Arnoldi method” uses a stabilized version of the Gram-
Schmidt process to produce a sequence of orthonormal vectors, {q 1 , q 2 , q 3 , . . .} called
the “Arnoldi vectors”, such that for every m, the vectors {q 1 , . . . , q m } span the Krylov
subspace Km . For the following, we define the orthogonal projection operator

proju (v) := u−2


2 (v, u)2 u ,

which projects the vector v onto span{u}. With this notation the classical Gram-Schmidt
orthonormalization process uses the recurrence formulas:
168 Iterative Methods for Eigenvalue Problems

q 1 = q−1
2 q, t = 2, . . . , m :

t−1
(4.3.43)
q̃ t = At−1 q − projqj (At−1 q), q t = q̃ t −1 t
2 q̃ .
j=1

Here, the t-th step projects out the component of At−1 q in the directions of the already
determined orthonormal vectors {q 1 , . . . , q t−1 }. This algorithm is numerically unstable
due to round-off error accumulation. There is a simple modification, the so-called “mod-
ified Gram-Schmidt algorithm”, where the t-th step projects out the component of Aq t
in the directions of {q 1 , . . . , q t−1 }:

q 1 = q−1
2 q, t = 2, . . . , m :

t−1
(4.3.44)
q̃ t = Aq t−1 − projqj (Aq t−1 ), q t = q̃ t −1 t
2 q̃ .
j=1

Since q t , q̃ t are aligned and q̃ t ⊥Kt , we have


 t−1 
(q t , q̃ t )2 = q̃ t 2 = q̃ t , Aq t−1 − projqj (Aq t−1 ) 2 = (q̃ t , Aq t−1 )2 .
j=1

Then, with the setting hi,t−1 := (Aq t−1 , q i )2 , from (4.3.44), we infer that


t
Aq t−1 = hi,t−1 q i , t = 2, . . . , m + 1. (4.3.45)
i=1

In practice the algorithm (4.3.44) is implemented in the following equivalent recursive


form:
q 1 = q−12 q, t = 2, . . . , m :
j = 1, . . . , t − 1 : q t,1 = Aq t−1 , (4.3.46)
q t,j+1
=q t,j
− projqj (q ), t,j t
q = q t,t −1 t,t
2 q .

This algorithm gives the same result as the original formula (4.3.43) but introduces smaller
errors in finite-precision arithmetic. Its cost is asymptotically 2nm2 a. op.

Definition 4.3 (Arnoldi algorithm): For a general matrix A ∈ Cn×n the Arnoldi
method determines a sequence of orthonormal vectors q t ∈ Cn , 1 ≤ t ≤ m  n
(“Arnoldi basis”), by applying the modified Gram-Schmidt method (4.3.46) to the basis
{q, Aq, . . . , Am−1 q} of the Krylov space Km :

Starting vector: q 1 = q−1


2 q.
Iterate for 2 ≤ t ≤ m: q = Aq t−1 ,
t,1

j = 1, . . . , t−1 : hj,t = (q t,j , q j )2 , q t,j+1 = q t,j − hj,tq j ,


ht,t = q t,t 2 , q t = h−1 t,t q
t,t
4.3 Krylov space methods 169

Let Q(m) denote the n×m-matrix formed by the first m Arnoldi vectors {q 1 , q 2 , . . . , q m },
and let H (m) be the (upper Hessenberg) m × m-matrix formed by the numbers hjk :
⎡ ⎤
h11 h12 h13 ... h1m
⎢ ⎥
⎢ h21 h22 h23 ... h2m ⎥
⎢ ⎥
⎢ ⎥
Q(m) := [q 1 , q 2 , . . . , q m ], H (m) =⎢ 0 h32 h33 ... h3m ⎥ .
⎢ .. ⎥
⎢ .. .. ..
. .
..
. . ⎥
⎣ . ⎦
0 0 hm,m−1 hmm

The matrices Q(m) are orthonormal and in view of (4.3.45) satisfy (“Arnoldi relation”)

AQ(m) = Q(m) H (m) + hm+1,m [0, . . . , 0, q m+1]. (4.3.47)

Multiplying by Q̄(m)T from the left and observing Q̄(m)T Q(m) = I and Q̄(m)T q m+1 = 0 ,
we infer that
H (m) = Q̄(m)T AQ(m) . (4.3.48)

In the limit case m = n the matrix H (n) is similar to A and, therefore, has the same
eigenvalues. This suggests that even for m  n the eigenvalues of the reduced matrix
H (m) may be good approximations to some eigenvalues of A . When the algorithm stops
(in exact arithmetic) for some m < n by hm+1,m = 0, then the Krylov space Km is an
invariant subspace of the matrix A and the reduced matrix H (m) = Q̄(m)T AQ(m) has m
eigenvalues in common with A (exercise), i. e.,

σ(H (m) ) ⊂ σ(A).

The following lemma provides an a posteriori bound for the accuracy in approximating
eigenvalues of A by those of H (m) .

Lemma 4.3: Let {μ, w} be an eigenpair of the Hessenberg matrix H (m) and let v =
Q(m) w so that (μ, v} is an approximate eigenpair of A. Then, there holds

Aw − μw2 = |hm+1,m | |wm|, (4.3.49)

where wm is the last component of the eigenvector w.


Proof. Multiplying in (4.3.47) by w yields

Av = AQ(m) w = Q(m) H (m) w + hm+1,m [0, . . . , 0, q m+1 ]w


= μQ(m) w + hm+1,m [0, . . . , 0, q m+1 ]w = μv + hm+1,m [0, . . . , 0, q m+1 ]w.

Consequently, observing q m+1 2 = 1,


Av − μv2 = |hm+1,m | |wm|,
which is the asserted identity. Q.E.D.
170 Iterative Methods for Eigenvalue Problems

The relation (4.3.49) does not provide a priori information about the convergence of
the eigenvalues of H (m) against those of A for m → n , but in view of σ(H (n) ) = σ(A)
this is not the question. Instead, it allows for an a posteriori check on the basis of the
computed quantities hm+q,m and wm whether the obtained pair {μ, w} is a reasonable
approximation.

Remark 4.7: i) Typically, the Ritz eigenvalues converge to the extreme (“maximal”)
eigenvalues of A. If one is interested in the “smallest” eigenvalues, i. e., those which
are closest to zero, the method has to be applied to the inverse matrix A−1 , similar to
the approach used in the “Inverse Iteration”. In this case the main work goes into the
generation of the Krylov space Km = span{q, A−1 q, . . . , (A−1 )m−1 q}, which requires the
successive solution of linear systems,

v 0 := q, Av 1 = v 0 , ... Av m = v m−1 .

ii) Due to practical storage consideration, common implementations of Arnoldi methods


typically restart after some number of iterations. Theoretical results have shown that
convergence improves with an increase in the Krylov subspace dimension m. However,
an a priori value of m which would lead to optimal convergence is not known. Recently a
dynamic switching strategy has been proposed, which fluctuates the dimension m before
each restart and thus leads to acceleration of convergence.

Remark 4.8: The algorithm (4.3.46) can be used also for the stable orthonormalization
of a general basis {v 1 , . . . , v m } ⊂ Cn :

u1 = v 1 −1 1
2 v , t = 2, . . . , m :
j = 1, . . . , t − 1 : ut,1 = v t , (4.3.50)
u t,j+1
=u t,j
− projuj (u ), t,j t
u = ut,t −1 t,t
2 u .

This “modified” Gram-Schmidt algorithm (with exact arithmetic) gives the same result
as its “classical” version (exercise)

u1 = v 1 −1 1
2 v , t = 2, . . . , m :

t−1
(4.3.51)
ũt = v t − projuj (v t ), ut = ũt −1 t
2 ũ .
j=1

Both algorithms have the same arithmetic complexity (exercise). In each step a vector is
determined orthogonal to its preceding one and also orthogonal to any errors introduced
in the computation, which enhances stability. This is supported by the following stability
estimate for the resulting “orthonormal” matrix U = [u1 , . . . , um ]

c1 cond2 (A)
U T U − I2 ≤ ε. (4.3.52)
1 − c2 cond2 (A)

The proof can be found in Björck & Paige [26].


4.3 Krylov space methods 171

Remark 4.9: Other orthogonalization algorithms use Householder transformations or


Givens rotations. The algorithms using Householder transformations are more stable
than the stabilized Gram-Schmidt process. On the other hand, the Gram-Schmidt pro-
cess produces the t-th orthogonalized vector after the t-th iteration, while orthogonaliza-
tion using Householder reflections produces all the vectors only at the end. This makes
only the Gram-Schmidt process applicable for iterative methods like the Arnoldi itera-
tion. However, in Quantum Mechanics there are several orthogonalization schemes with
characteristics even better suited for applications than the Gram-Schmidt algorithm

As in the solution of linear systems by Krylov space methods, e. g., the GMRES
method, the high storage needs for general matrices are avoided in the case of Hermitian
matrices due to the availability of short recurrences in the orthonormalization process.
This is exploited in the “Lanczos method”. Suppose that the matrix A is Hermitian.
Then, the recurrence formula of the Arnoldi method


t−1
q̃ t = Aq t−1 − (Aq t−1 , q j )2 q j , t = 2, . . . , m + 1,
j=1

because of (Aq t−1 , q j )2 = (q t−1 , Aq j )2 = 0, j = 1, . . . , t − 3, simplifies to

q̃ t = Aq t−1 − (Aq t−1 , q t−1 )2 q t−1 − (Aq t−1 , q t−2 )2 q t−2 = Aq t−1 − αt−1 q t−1 − βt−2 q t−2 .
     
=: αt−1 =: βt−2

Clearly, αt−1 ∈ R since A Hermitian. Further, multiplying this identity by q t yields

q̃ t 2 = (q t , q̃ t )2 = (q t , Aq t−1 − αt−1 q t−1 − βt−2 q t−2 )2 = (q t , Aq t−1 )2 = (Aq t , q t−1 )2 = βt−1 .

This implies that also βt−1 ∈ R and βt−1 q t = q̃ t . Collecting the foregoing relations, we
obtain

Aq t−1 = βt−1 q t + αt−1 q t−1 + βt−2 q t−2 , t = 2, . . . , m + 1. (4.3.53)

These equations can be written in matrix form as follows:


⎡ ⎤
α β2 0 ... ... 0
⎢ 1 .. ⎥
⎢ β α . ⎥
⎢ 2 2 β3 0 ⎥
⎢ .. ⎥
⎢ 0 β ..
.
..
. . ⎥
⎢ 3 α3 ⎥
AQ(m) = Q(m) ⎢ . . .. .. ⎥ + βm [0, . . . , 0, q m+1 ],
⎢ .. .. . . βm−1 0 ⎥
⎢ ⎥
⎢ . ⎥
⎢ .. 0 βm−1 αm−1 βm ⎥
⎣ ⎦
0 ... ... 0 βm αm
  
=: T (m)

where the matrix T (m) ∈ Rm×m is real symmetric. From this so-called “Lanczos relation”,
172 Iterative Methods for Eigenvalue Problems

we finally obtain

Q̄(m)T AQ(m) = T (m) . (4.3.54)

Definition 4.4 (Lanczos Algorithm): For a Hermitian matrix A ∈ Cn×n the Lanczos
method determines a set of orthonormal vectors {q 1 , . . . , q m }, m  n, by applying the
modified Gram-Schmidt method to the basis {q, Aq, . . . , Am−1 q} of the Krylov space Km :

Starting values: q 1 = q−12 q, q 0 = 0, β1 = 0.


Iterate for 1 ≤ t ≤ m − 1: r t = Aq t , αt = (r t , q t )2 ,
st = r t − αt q t − βt q t−1 ,
−1 t
β t+1 = st 2 , q t+1 = βt+1 s,
m m m m
r = Aq , αm = (r , q )2 .

After the matrix T (m) is calculated, one can compute its eigenvalues λi and their
corresponding eigenvectors w i , e. g., by the QR algorithm. The eigenvalues and eigen-
vectors of T (m) can be obtained in as little as O(m2 ) work. It can be proven that the
eigenvalues are approximate eigenvalues of the original matrix A. The Ritz eigenvectors
v i of A can then be calculated by v i = Q(m) w i .

4.3.2 Computation of the pseudo-spectrum

As an application of the Krylov space methods described so far, we discuss the computa-
tion of the pseudo-spectrum of a matrix Ah ∈ Rn×n , which resulted from the discretization
of a dynamical system governed by a differential operator in the context of linearized sta-
bility analysis. Hence, we are interested in the most “critical” eigenvalues, i. e., in those
which are close to the origin or to the imaginary axis. This requires to consider the in-
verse of matrix, T = A−1 h . Thereby, we follow ideas developed in Trefethen & Embree
[22], Trefethen [21], and Gerecht et al. [35]. The following lemma collects some useful
facts on the pseudo-spectra of matrices.

Lemma 4.4: i) The ε-pseudo-spectrum of a matrix T ∈ Cn×n can be equivalently defined


in the following way:

σε (T ) := {z ∈ C | σmin(zI − T ) ≤ ε}, (4.3.55)

where σmin (B) denotes the smallest singular value of the matrix B , i. e.,

σmin (B) := min{λ1/2 | λ ∈ σ(B̄ T B)},

with the (complex) adjoint B̄ T of B .


ii) The ε-pseudo-spectrum σε (T ) of a matrix T ∈ Cn×n is invariant under orthonormal
transformations, i. e., for any unitary matrix Q ∈ Cn×n there holds

σε (Q̄T T Q) = σε (T ). (4.3.56)
4.3 Krylov space methods 173

Proof. i) There holds

(zI −T )−1 2 = max{μ1/2 | μ singular value of (zI −T )−1 }


= min{μ1/2 | μ singular value of zI −T }−1 = σmin (zI − T )−1 ,

and, consequently,

σε (T ) = {z ∈ C | (zI − T )−1 2 ≥ ε−1}


= {z ∈ C | σmin (zI − T )−1 ≥ ε−1 } = {z ∈ C | σmin(zI − T ) ≤ ε}.

ii) The proof is posed as exercise. Q.E.D.


There are several different though equivalent definitions of the ε-pseudo-spectrum
σε (T ) of a matrix T ∈ Cn×n , which can be taken as starting point for the computation
of pseudo-spectra (see Trefethen [21] and Trefethen & Embree [22]). Here, we use the
definition contained in Lemma 4.4. Let σε (T ) to be determined in a whole section
D ⊂ C . We choose a sequence of grid points zi ∈ D, i = 1, 2, 3, . . . , and in each zi
determine the smallest ε for which zi ∈ σε (T ). By interpolating the obtained values, we
can then decide whether a point z ∈ C approximately belongs to σε (T ).

Remark 4.10: The characterization

σε (T ) = ∪{σ(T + E) | E ∈ Cn×n , E2 ≤ ε} (4.3.57)

leads one to simply take a number of random matrices E of norm less than ε and to
plot the union of the usual spectra σ(T + E) . The resulting pictures are called the
“poor man’s pseudo-spectra”. This approach is rather expensive since in order to obtain
precise information of the ε-pseudo-spectrum a really large number of random matrices
are needed. It cannot be used for higher-dimensional matrices.

Remark 4.11: The determination of pseudo-spectra in hydrodynamic stability theory


requires the solution of eigenvalue problems related to the linearized Navier-Stokes equa-
tions as described in Section 0.4.3:
− νΔv + v̂ · ∇v + v · ∇v̂ + ∇q = λv, ∇ · v = 0, in Ω,
(4.3.58)
v|Γrigid ∪Γin = 0, ν∂n v − qn|Γout = 0,

where v̂ is the stationary “base flow” the stability of which is to be investigated. This
eigenvalue problem is posed on the linear manifold described by the incompressibility con-
straint ∇·v = 0. Hence after discretization the resulting algebraic eigenvalue problems in-
herit the saddle-point structure of (4.3.58). We discuss this aspect in the context of a finite
element Galerkin discretization with finite element spaces Hh ⊂ H01 (Ω)d and Lh ⊂ L2 (Ω).
Let {ϕih , i = 1, . . . , nv := dim Hh } and {χjh , j = 1, . . . , np := dim Lh } be standard nodal
bases of the finite element spaces Hh and Lh , respectively. nv i The np j vj h ∈ Hh
eigenvector
and the pressure qh ∈ Lh possess expansions vh = i=1 vh ϕh , qh = j=1 qh χh , where
i

the vectors of expansion coefficients are likewise denoted by vh = (vhi )ni=1 v


∈ Cnv and
174 Iterative Methods for Eigenvalue Problems

n
qh = (qhj )j=1
p
∈ Cnp , respectively. With this notation the discretization of the eigenvalue
problem (4.3.58) results in a generalized algebraic eigenvalue problem of the form
& '& ' & '& '
Sh Bh vh Mh 0 vh
= λh , (4.3.59)
BhT 0 qh 0 0 qh

with the so-called stiffness matrix Sh , gradient matrix Bh and mass matrix Mh defined
by
  nv  nv ,np   nv
Sh := a (v̂h ; ϕjh , ϕih ) i,j=1, Bh := (χjh , ∇ · ϕih )L2 i,j=1 , Mh := (ϕjh , ϕih )L2 i,j=1.

For simplicity, we suppress terms stemming from pressure and transport stabilization.
The generalized eigenvalue problem (4.3.59) can equivalently be written in the form
& '& '−1 & '& ' & '& '
Mh 0 Sh Bh Mh 0 vh Mh 0 vh
= μh , (4.3.60)
0 0 BhT 0 0 0 qh 0 0 qh

where μh = λ−1h . Since the pressure qh only plays the role of a silent variable (4.3.60)
reduces to the (standard) generalized eigenvalue problem

Th vh = μh Mh vh , (4.3.61)

with the matrix Th ∈ Rnv ×nv defined by


& ' & '& '−1 & '
Th 0 Mh 0 Sh Bh Mh 0
:= .
0 0 0 0 BhT 0 0 0

The approach described below for computing eigenvalues of general matrices T ∈ Rn×n
can also be applied to this non-standard situation.

Computation of eigenvalues

For computing the eigenvalues of a (general) matrix T ∈ Rn×n , we use the Arnoldi
process, which produces a lower-dimensional Hessenberg matrix the eigenvalues of which
approximate those of T :
⎛ ⎞
h1,1 h1,2 h1,3 ··· h1,m
⎜ ⎟
⎜h2,1 h2,2 h2,3 ··· h2,m ⎟
⎜ ⎟
⎜ ⎟
H (m) = Q̄(m)T T Q(m) = ⎜ 0 h3,2 h3,3 ··· h3,m ⎟ ,
⎜ . .. ⎟
⎜ .. ..
.
..
.
..
. . ⎟
⎝ ⎠
0 ··· 0 hm,m−1 hm,m
4.3 Krylov space methods 175

where the matrix Q(m) = [q 1 , . . . , q m ] is formed with the orthonormal basis {q 1 , . . . , q m }


of the Krylov space Km = span{q, T q, .., T m−1q} . The corresponding eigenvalue problem
is then efficiently solved by the QR method using only O(m2 ) operaton. The obtained
eigenvalues approximate those eigenvalues of T with largest modulus, which in turn are
related to the desired eigenvalues of the differential operator with smallest real parts.
Enlarging the dimension m of Km improves the accuracy of this approximation as well
as the number of the approximated “largest” eigenvalues. In fact, the pseudo-spectrum
of H (m) approaches that of T for m → n .
The construction of the Krylov space Km is the most cost-intensive part of the whole
process. It requires (m−1)-times the application of the matrix T , which, if T is the inverse
of a given system matrix, amounts to the consecutive solution of m linear systems of
dimension n m. This may be achieved by a multigrid method implemented in available
open source software (see Chapter 5). Since such software often does not support complex
arithmetic the linear system Sx = y needs to be rewritten in real arithmetic,
( )( ) ( )
ReS ImS Rex Rey
Sx = y ⇔ = .
−ImS ReS −Imx −Imy

For the reliable approximation of the pseudo-spectrum of T in the subregion D ⊂ C it is


necessary to choose the dimension m of the Krylov space sufficiently large, such that all
eigenvalues of T and its perturbations located in D are well approximated by eigenvalues
of H (m) . Further, the QR method is to be used with maximum accuracy requiring the
corresponding error tolerance TOL to be set in the range of the machine accuracy. An
eigenvector w corresponding to an eigenvalue λ ∈ σ(H (m) ) is then obtained by solving
the singular system

(H (m) − λI)w = 0. (4.3.62)

By back-transformation of this eigenvector from the Krylov space Km into the space Rn ,
we obtain a corresponding approximate eigenvector of the full matrix T .

Practical computation of the pseudospectrum

We want to determine the “critical” part of the ε-pseudo-spectrum of the discrete operator
Ah , which approximates the unbounded differential operator A . As discussed above, this
requires the computation of the smallest singular value of the inverse matrix T = A−1 h .
Since the dimension nh of T in practical applications is very high, nh ≈ 104 − 108 , the
direct computation of singular values of T or even a full singular value decomposition is
prohibitively expensive. Therefore, the first step is the reduction of the problem to lower
dimension by projection onto a Krylov space resulting in a (complex) Hessenberg matrix
H (m) ∈ Cn×n the inverse of which, H (m)−1 , may then be viewed as a low-dimensional
approximation to Ah capturing the critical “smallest” eigenvalues of Ah and likewise its
pseudo-spectra. The pseudo-spectra of H (m) may then be computed using the approach
described in Section 4.2.2. By Lemma 1.17 the pseudo-spectrum of H (m) is closely related
176 Iterative Methods for Eigenvalue Problems

to that of H (m)−1 but involving constants, which are difficult to control. Therefore, one
tends to prefer to directly compute the pseudo-spectra of H (m)−1 as an approximation
to that of Ah . This, however, is expensive for larger m since the inversion of the matrix
H (m) costs O(m3 ) operations. Dealing directly with the Hessenberg matrix H (m) looks
more attractive. Both procedures are discussed in the following. We choose a section
D ⊂ C (around the origin), in which we want to determine the pseudo-spectrum. Let
D := {z ∈ C| { Re z, Im z} ∈ [ar , br ] × [ai , bi ]} for certain values ar < br and ai < bi .
To determine the pseudo-spectrum in the complete rectangle D , we cover D by a grid
with spacing dr and di , such that k points lie on each grid line. For each grid point, we
compute the corresponding ε-pseudo-spectrum.
i) Computation of the pseudo-spectra σε (H (m)−1 ): For each z ∈ D \ σ(H (m)−1 ) the
quantity
ε(z, H (m)−1 ) := (zI − H (m)−1 )−1 −1
2 = σmin (zI − H
(m)−1
)
determines the smallest ε > 0 , such that z ∈ σε (H (m)−1 ). Then, for any point z ∈ D ,
by computing σmin (zI − H (m)−1 ) , we obtain an approximation of the smallest ε , such
that z ∈ σε (H (m)−1 ) . For computing σmin := σmin (zI − H (m)−1 ) , we recall its definition
as smallest (positive) eigenvalue of the Hermitian, positive definite matrix

S := (zI − H (m)−1 )T (zI − H (m)−1 )

and use the “inverse iteration”, z 0 ∈ Cn , z 0 2 = 1 ,

t≥1: S z̃ t = z t−1 , z t = z̃ t −1 t


2 z̃ ,
t
σmin := (Sz t , z t )2 . (4.3.63)

The linear systems in each iteration can be solved by pre-computing either directly an
LR decomposition of S , or if this is too ill-conditioned, first a QR decomposition
zI − H (m)−1 = QR,
which then yields a Cholesky decomposition of S :

S = (QR)T QR = R̄T Q̄T QR = R̄T R. (4.3.64)

This preliminary step costs another O(m3 ) operations.


ii) Computation of the pseudo-spectra σε (H (m) ): Alternatively, one may compute a sin-
gular value decomposition of the Hessenberg matrix zI − H (m) ,

zI − H (m) = UΣV̄ T ,

where U, V ∈ Cn×n are unitary matrices and Σ = diag{σi , i = 1, . . . , n}. Then,

σmin (zI − H (m) ) = min{σi , i = 1, . . . , m}.

For that, we use the LAPACK routine dgesvd within MATLAB. Since the operation count
of the singular value decomposition growth like O(m2 ) , in our sample calculation, we limit
the dimension of the Krylov space by m ≤ 200 .
4.3 Krylov space methods 177

Choice of parameters and accuracy issues

The described algorithm for computing the pseudo-spectrum of a differential operator at


various stages requires the appropriate choice of parameters:

- The mesh size h in the finite element discretization on the domain Ω ⊂ Rn for
reducing the infinite dimensional problem to an matrix eigenvalue problem of di-
mension nh .

- The dimension of the Krylov space Km,h in the Arnoldi method for the reduction
of the nh -dimensional (inverse) matrix Th to the much smaller Hessenberg matrix
(m)
Hh .

- The size of the subregion D := [ar , br ] × [ai , bi ] ⊂ C in which the pseudospectrum


is to be determined and the mesh width k of interpolation points in D ⊂ C .

Only for an appropriate choice of these parameters one obtains a reliable approximation
to the pseudo-spectrum of the differential operator A . First, h is refined and m is in-
creased until no significant change in the boundaries of the ε-pseudo-spectrum is observed
anymore.

Example 1. Sturm-Liouville eigenvalue problem

As a prototypical example for the proposed algorithm, we consider the Sturm-Liouville


boundary value problem (see Trefethen [21])

Au(x) = −u (x) − q(x)u(x), x ∈ Ω = (−10, 10), (4.3.65)


1 4
with the complex potential q(x) := (3 + 3i)x2 + 16
x , and the boundary condition
u(−10) = 0 = u(10). Using the sesquilinear form

a(u, v) := (u, v  ) + (qu, v), u, v ∈ H01 (Ω),

the eigenvalue problem of the operator A reads in variational form

a(v, ϕ) = λ(v, ϕ) ∀ϕ ∈ H01 (Ω). (4.3.66)

First, the interval Ω = (−10, 10) is discretized by eightfold uniform refinement resulting
in the finest mesh size h = 20 · 2−8 ≈ 0.078 and nh = 256 . The Arnoldi algorithm
for the corresponding discrete eigenvalue problem of the inverse matrix A−1 h generates
(m)
a Hessenberg matrix Hh of dimension m = 200 . The resulting reduced eigenvalue
problem is solved by the QR method. For the determination of the corresponding pseudo-
(m)
spectra, we export the Hessenberg matrix Hh into a MATLAB file. For this, we use the
routine DGESVD in LAPACK (singular value decomposition) on meshes with 10×10 and
with 100 × 100 points. The ε-pseudo-spectra are computed for ε = 10−1, 10−2 , ..., 10−10
leading to the results shown in Fig. 4.1. We observe that all eigenvalues have negative real
178 Iterative Methods for Eigenvalue Problems

part but also that the corresponding pseudo-spectra reach far into the positive half-plane
of C , i. e., small perturbations of the matrix may have strong effects on the location of
the eigenvalues. Further, we see that already a grid with 10 × 10 points yields sufficiently
(m)
good approximations of the pseudo-spectrum of the matrix Hh .

150 150

100 100

50 50

0 0
−100 −50 0 50 −100 −50 0 50

Figure 4.1: Approximate eigenvalues and pseudo-spectra of the operator A computed


from those of the inverse matrix A−1h on a 10 × 10 grid (left) and on a 100 × 100 grid
(right): “dots” represent eigenvalues and the lines the boundaries of the ε-pseudo-spectra
for ε = 10−1 , ..., 10−10 .

Example 2. Stability eigenvalue problem of the Burgers operator

A PDE test example is the two-dimensional Burgers equation

−νΔv + v · ∇v = 0, in Ω. (4.3.67)

This equation is sometimes considered as a simplified version of the Navier-Stokes equation


since both equations contain the same nonlinearity. We use this example for investigating
some questions related to the numerical techniques used, e. g., the required dimension of
the Krylov spaces in the Arnoldi method.
For simplicity, we choose Ω := (0, 2) × (0, 1) ⊂ R2 , and along the left-hand “inflow
boundary” Γin := ∂Ω ∩ {x1 = 0} as well as along the upper and lower boundary parts
Γrigid := ∂Ω∩({x2 = 0}∪{x2 = 1}) Dirichlet conditions and along the right-hand “outflow
boundary” Γout := ∂Ω ∩ {x1 = 2} Neumann conditions are imposed, such that the exact
solution has the form v̂(x) = (x2 , 0) of a Couette-like flow. We set ΓD := Γrigid ∪ Γin and
choose ν = 10−2 . Linearization around this stationary solution yields the nonsymmetric
stability eigenvalue problem for v = (v1 , v2 ) :

−νΔv1 + x2 ∂1 v1 + v2 = λv1 ,
(4.3.68)
−νΔv2 + x2 ∂1 v2 = λv2 ,

in Ω with the boundary conditions v|ΓD = 0, ∂n v|Γout = 0 . For discretizing this problem,
we use the finite element method described above with conforming Q1 -elements combined
4.3 Krylov space methods 179

with transport stabilization by the SUPG (streamline upwind Petrov-Galerkin) method.


We investigate the eigenvalues of the linearized (around Couette flow) Burgers operator
with Dirichlet or Neumann inflow conditions. We use the Arnoldi method described above
with Krylov spaces of dimension m = 100 or m = 200 . For generating the contour lines
of the ε-pseudospectra, we use a grid of 100 × 100 test points.
For testing the accuracy of the proposed method, we compare the quality of the
pseudo-spectra computed on meshes of width h = 2−7 (nh ≈ 30, 000) and h = 2−8 , (nh ≈
130, 000) and using Krylov spaces of dimension m = 100 or m = 200 . The results shown
in Fig. 4.2 and Fig. 4.3 indicate that the choice h = 2−7 and m = 100 is sufficient for
the present example.

5 5

0 0

−5 −5
−2 0 2 4 6 8 10 −2 0 2 4 6 8 10

Figure 4.2: Computed pseudo-spectra of the linearized Burgers operator with Dirichlet
inflow condition for ν = 0.01 and h = 2−7 (left) and h = 2−8 (right) computed by
the Arnoldi method with m = 100. The “dots” represent eigenvalues and the lines the
boundaries of the ε-pseudo-spectra for ε = 10−1 , ..., 10−4.

5 5

0 0

−5 −5
−2 0 2 4 6 8 10 −2 0 2 4 6 8 10

Figure 4.3: Computed pseudo-spectra of the linearized Burgers operator with Dirichlet
inflow condition for ν = 0.01 and h = 2−8 computed by the Arnoldi method with m =
100 (left) and m = 200 (right). The “dots” represent eigenvalues and the lines the
boundaries of the ε-pseudo-spectra for ε = 10−1 , ..., 10−4.

Now, we turn to Neumann inflow conditions. In this particular case the first eigenval-
180 Iterative Methods for Eigenvalue Problems

ues and eigenfunctions of the linearized Burgers operator can be determined analytically
as λk = νk 2 π 2 , vk (x) = (sin(kπx2 ), 0)T , for k ∈ Z . All these eigenvalues are degenerate.
However, there exists another eigenvalue λ4 ≈ 1.4039 between the third and fourth one,
which is not of this form, but also degenerate.
We use this situation for studying the dependence of the proposed method for com-
puting pseudo-spectra on the size of the viscosity parameter, 0.001 ≤ ν ≤ 0.01 . Again
the discretization uses the mesh size h = 2−7 , Krylov spaces of dimension m = 100 and
a grid of spacing k = 100 . By varying these parameters, we find that only eigenvalues
with Reλ ≤ 6 and corresponding ε-pseudo-spectra with ε ≥ 10−4 are reliably computed.
The results are shown in Fig. 4.4.

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2
−1 0 1 2 3 −1 0 1 2 3

Figure 4.4: Computed pseudos-pectra of the linearized (around Couette flow) Burger op-
erator with Neumann inflow conditions for ν = 0.01 (left) and ν = 0.001 (right):
The dots represent eigenvalues and the lines the boundaries of the ε-pseudo-spectra for
ε = 10−1 , . . . , 10−4.
For Neumann inflow conditions the most critical eigenvalue is significantly smaller
than the corresponding most critical eigenvalue for Dirichlet inflow conditions, which
suggests weaker stability properties in the “Neumann case”. Indeed, in Fig. 4.4, we
see that the 0.1-pseudo-spectrum reaches into the negative complex half-plane indicating
instability for such perturbations. This effect is even more pronounced for ν = 0.001
with λNcrit ≈ 0.0098 .

Example 3. Stability eigenvalue problem of the Navier-Stokes operator

In this last example, we present some computational results for the 2d Navier-Stokes
benchmark “channel flow around a cylinder” with the configuration shown in Section 0.4.3
(see Schäfer & Turek [65]). The geometry data are as follows: channel domain Ω :=
(0.00m, 2.2m) × (0.00m, 0.41m), diameter of circle D := 0.10m, center of circle at a :=
(0.20m, 0.20m) (slightly nonsymmetric position). The Reynolds number is defined in
terms of the diameter D and the maximum inflow velocity Ū = max |v in| = 0.3m/s
(parabolic profile), Re = Ū 2 D/ν . The boundary conditions are v|Γrigid = 0, v|Γin =
v in , ν∂n v − np|Γout = 0. The viscosity is chosen such that the Reynolds number is small
enough, 20 ≤ Re ≤ 40 , to guarantee stationarity of the base flow as shown in Fig. 4.5.
Already for Re = 60 the flow turns nonstationary (time periodic).
4.3 Krylov space methods 181

(0 m,0.41 m)

0.16 m

0.41 m
0.15 m
0.1 m x2
S
0.15 m
x1
(0 m,0 m)
2.2 m

Figure 4.5: Configuration of the “channel flow” benchmark and x1 -component of the ve-
locity for Re = 40 .

We want to investigate the stability of the computed base flow for several Reynolds
numbers in the range 20 ≤ Re ≤ 60 and inflow conditions imposed on the admissible
perturbations, Dirichlet or Neumann (“free”), by determining the corresponding critical
eigenvalues and pseudo-spectra. This computation uses a “stationary code” employing the
Newton method for linearization, which is known to potentially yield stationary solutions
even at Reynolds numbers for which such solutions may not be stable.

Perturbations satisfying Dirichlet inflow conditions

We begin with the case of perturbations satisfying (homogeneous) Dirichlet inflow condi-
tions. The pseudo-spectra of the critical eigenvalues for Re = 40 and Re = 60 are shown
in Fig. 4.6.

Figure 4.6: Computed pseudo-spectra of the linearized Navier-Stokes operator (“chan-


nel flow” benchmark) for different Reynolds numbers, Re = 40 (left) and Re = 60
(right), with Dirichlet inflow condition: The “dots” represent eigenvalues and the lines
the boundaries of ε-pseudospectra for ε = 10−2 , 10−2.5, 10−3 , 10−3.5.
182 Iterative Methods for Eigenvalue Problems

The computation has been done on meshes obtained by four to five uniform refinements
of the (locally adapted) meshes used for computing the base flow. In the Arnoldi method,
we use Krylov spaces of dimension m = 100 . Computations with m = 200 give almost
the same results. For Re = 40 the relevant 10−2 -pseudo-spectrum does not reach into
the negative complex half-plane indicating stability of the corresponding base solution in
this case, as expected in view of the result of nonstationary computations. Obviously the
transition from stationary to nonstationary (time periodic) solutions occurs in the range
40 ≤ Re ≤ 60 . However, for this “instability” the sign of the real part of the critical
eigenvalue seems to play the decisive role and not so much the size of the corresponding
pseudo-spectrum.

Perturbations satisfying Neumann (free) inflow conditions

Next, we consider the case of perturbations satisfying (homogeneous) Neumann (“free”)


inflow conditions, i. e., the space of admissible perturbations is larger than in the pre-
ceding case. In view of the observations made before for Couette flow and Poiseuille
flow, we expect weaker stability properties. The stationary base flow is again computed
using Dirichlet inflow conditions but the associated eigenvalue problem of the linearized
Navier-Stokes operator is considered with Neumann inflow conditions. In the case of
perturbations satisfying Dirichlet inflow conditions the stationary base flow turned out
to be stable up to Re = 45 . In the present case of perturbations satisfying Neumann
inflow conditions at Re = 40 the critical eigenvalue has positive but very small real
part, Reλmin ≈ 0.003. Hence, the precise stability analysis requires the determination of
the corresponding pseudo-spectrum. The results are shown in Fig. 4.3.2. Though, for
Re = 40 the real part of the most critical (positive) eigenvalue is rather small, the corre-
sponding 10−2 -pseudo-spectrum reaches only a little into the negative complex half-plane.

Figure 4.7: Computed pseudo-spectra of the linearized Navier-Stokes operator (“channel


flow”) with Neumann inflow conditions for different Reynolds numbers, Re = 40 (left)
and Re = 60 (right): The “dots” represent eigenvalues and the lines the boundaries of
the ε-pseudospectra for ε = 10−2, 10−2.5 , 10−3, 10−3.5 .
4.4 Exercises 183

4.4 Exercises

Exercise 4.1: The proof of convergence of the “power method” applied to a symmetric,
positive definite matrix A ∈ Rn×n resulted in the identity
#    $
(λn )2t+1 |αn |2 + n−1 2 λi 2t+1  λ 2t 
i=1 |αi |  n−1 
t t t
λ = (Az , z )2 = #  
λn
 2t $ = λ max + O   ,
(λn ) |αn | + i=1 |αi | λn
2t 2 n−1 2 λi λmax

where λi ∈ R, i = 1, . . . , n , are the eigenvalues of A , {w i , i = 1, . . . , n} a corresponding


ONB
n of eigenvectors and αi the coefficients in the expansion of the starting vector z 0 =
i=1 αi w . Show that, in case αn = 0, in the above identity the error term on the
i

right-hand side is uniformly bounded with respect to the dimension n of A but depends
linearly on |λn |.

Exercise 4.2: The “inverse iteration” may be accelerated by employing a dynamic “shift”
taken from the presceding eigenvalue approximation (λ0k ≈ λk ):

z̃ t−1 −1 t t 1
(A − λt−1 t
k I)z̃ = z
t−1
, zt = , μtk = ((A − λt−1
k I) z , z )2 , λtk = + λt−1
k .
z̃ t−1  μtk

Investigate the convergence of this method for the computation of the smallest eigenvalue
λ1 = λmin of a symmetric, positive definite matrix A ∈ Rn×n . In detail, show the
convergence estimate
t−1 

 λ1 − λj 2 z 0 22
|λ1 − λt | ≤ |λt − λt−1 |   .
j=0
λ2 − λj |α1 |2

(Hint: Show that


n 5
i=1 |αi |2 (λi − λt−1 )−1 t−1 j −2
j=0 (λi − λ )
μ =t
n 5
i=1 |αi |
t−1
j=0 (λi − λ )
2 j −2

and proceed in a similar way as in the preceding exercise.)

Exercise 4.3: Let A be a Hessenberg matrix or a symmetric tridiagonal matrix. Show


that in this case the same holds true for all iterates At generated by the QR method:

A(0) := A,
A(t+1) := R(t) Q(t) , with A(t) = Q(t) R(t) , t ≥ 0.

Exercise 4.4: Each matrix A ∈ Cn×n possesses a QR decomposition A = QR, with a


unitary matrix Q = [q 1 , . . . , q n ] and an upper triangular matrix R = (rij )ni,j=1. Clearly,
this decomposition is not uniquely determined. Show that for regular A there exists a
uniquely determined QR decomposition with the property rii ∈ R+ , i = 1, . . . , n.
184 Iterative Methods for Eigenvalue Problems

(Hint: Use the fact that the QR decomposition of A yields a Cholesky decomposition of
the matrix ĀT A.)

Exercise 4.5: For a matrix A ∈ Cn×n and an arbitrary vector q ∈ Cn , q = 0, form the
Krylov spaces Km := span{q, Aq, . . . , Am−1 q}. Suppose that for some 1 ≤ m ≤ n there
holds Km−1 = Km = Km+1 .
i) Show that then Km = Km+1 = · · · = Kn = Cn and dimKm = m.
ii) Let {q 1 , . . . , q m } be an ONB of Km and set Qm := [q 1 , . . . , q m ]. Show that there
holds σ(QmT AQm ) ⊂ σ(A) . In the case m = n there holds σ(QnT AQn ) = σ(A) .

Exercise 4.6: Recall the two versions of the Gram-Schmidt algorithm, the “classical”
one and the “modified” one described in the text, for the successive orthogonalization of
a general, linear independent set {v 1 , . . . , v m } ⊂ Rn .
i) Verify that both algorithms, used with exact arithmetic, yield the same result.
ii) Determine the computational complexity of these two algorithms, i. e., the number of
arithmetic operations for computing the corresponding orthonormal set {u1 , . . . , um }.

Exercise 4.7: Consider the nearly singular 3 × 3-matrix


⎡ ⎤
1 1 1
⎢ ⎥

A=⎣ ε ε 0 ⎥ = [a1 , a2 , a3 ],

ε 0 ε

where ε > 0 is small enough so that 1 + ε2 is rounded to 1 in the given floating-point


arithmetic. Compute the QR decomposition of A = [a1 , a2 , a3 ] by orthonormalization
of the set of its column vectors {a1 , a2 , a3 } using (i) the classical Gram-Schmidt algo-
rithm and (ii) its modified version. Compare the quality of the results by making the
“Householder Test”: QT Q − I∞ ≈ 0 .

Exercise 4.8: Consider the model eigenvalue problem from the text, which originates
from the 7-point discretization of the Poisson problem on the unit cube:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
B −Im2 C −Im 6 −1
⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥
A = h−2 ⎢
⎣ −Im2 B . ⎥
⎦ B=⎢ ⎣ −Im C . ⎥
⎦ C=⎢ ⎣ −1 6 . ⎥

.. .. .. .. .. ..
. . . . . .
        
n=m3 m2 m

where h = 1/(m + 1) is the mesh size. In this case the corresponding eigenvalues and
eigenvectors are explicitly given by
#  $
λhijk = h−2 6 − 2 cos[ihπ] + cos[jhπ] + cos[khπ] , i, j, k = 1, . . . , m,
 m
whijk = sin[pihπ] sin[qjhπ] sin[rkhπ] p,q,r=1.
4.4 Exercises 185

For this discretization, there holds the theoretical a priori error estimate

|λijk − λhijk | 1
≤ λijk h2 ,
|λijk | 12

where λijk = (i2 + j 2 + k 2 )π 2 are the exact eigenvalues of the Laplace operator (and h
sufficiently small).
i) Verify this error estimate using the given values for λijk and λhijk .
ii) Derive an estimate for the number of eigenvalues (not counting multiplicities) of the
Laplace operator that can be approximated reliably for a mesh size of h = 2−7 , if a
uniform relative accuracy of TOL = 10−3 is required.
iii) How small has the mesh size h to be chosen if the first 1.000 eigenvalues (counting
multiplicities for simplicity) of the Laplace operator have to be computed with relative
accuracy TOL = 10−3 ? How large would the dimension n of the resulting system matrix
A be in this case? (Hint: We are interested in an upper bound, so simplify accordingly.)

Exercise 4.9: Formulate the “inverse iteration” of Wielandt and the “Lanczos algo-
rithm” (combined with the QR method) for computing the smallest eigenvalue of a large
symmetric positive definite matrix A ∈ Rn×n . Suppose that matrix vector products as
well as solving linear systems occurring in these processes can be accomplished with O(n)
a. op.:
i) Compare the arithmetic work (# of a. op.) of these two approaches for performing 100
iterations.
ii) How do the two methods compare if not only the smallest but the 10 smallest eigen-
values are to be computed?

Exercise 4.10: The Krylov space method applied for general matrices A ∈ Cn×n re-
quires complex arithmetic, but many software packages provide only real arithmetic.
i) Verify that a (complex) linear system Ax = b can equivalently be written in the
following real “(2n × 2n)-block form”:
( )( ) ( )
Re A Im A Re x Re b
= .
−Im A Re A −Im x −Im b

ii) Formulate (necessary and sufficient) conditions on A, which guarantee that this (real)
coefficient block-matrix is a) regular, b) symmetric and c) positive definite?

Exercise 4.11: Show that the ε-pseudo-spectrum σε (T ) of a matrix T ∈ Cn×n is in-


variant under orthonormal transformations, i. e., for any unitary matrix Q ∈ Cn×n there
holds
σε (T ) = σε (Q−1 T Q).
(Hint: Pick a suitable one of the many equivalent definitions of ε-pseudo-spectrum pro-
vided in the text.)
5 Multigrid Methods
Multigrid methods belong to the class of preconditioned defect correction methods, in
which the preconditioning uses a hierarchy of problems of similar structure but decreas-
ing dimension. They are particularly designed for the solution of the linear systems
resulting from the discretization of partial differential equations by grid methods such as
finite difference or finite element schemes. But special versions of this method can also be
applied to other types of problems not necessarily originating from differential equations.
Its main concept is based on the existence of a superposed “continuous” problem of infi-
nite dimension, from which all the smaller problems are obtained in a nested way by some
projection process (e. g., a “finite difference discretization” or a “finite element Galerkin
method”). On the largest subspace (on the finest grid) the errors and the corresponding
defects are decomposed into “high-frequency” and “low-frequency” components, which
are treated separately by simple fixed-point iterations for “smoothing” out the former
and by correcting the latter on “coarser” subspaces (the “preconditioning” or “coarse-
space correction”). This “smoothing” and “coarse-space correcting” is applied recursively
on the sequence of nested spaces leading to the full “multigrid algorithm”. By an appro-
priate combination of all these algorithmic components one obtains an “optimal” solution
algorithm, which solves a linear system of dimension n , such as the model problem con-
sidered above, in O(n) arithmetic operations. In the following, for notational simplicity,
we will decribe and analyze the multigrid method within the framework of a low-order
finite element Galerkin discretization of the model problem of Section 3.4. In fact, on
uniform Cartesian meshes this discretization is closely related (almost equivalent) to the
special finite difference scheme considered in Section 3.4. For the details of such a finite
element scheme and its error analysis, we refer to the literature, e. g., Rannacher [3].

5.1 Multigrid methods for linear systems

For illustration, we consider the linear system

Ah xh = bh , (5.1.1)

representing the discretization of the two-dimensional model problem of Section 3.4 on a


finite difference mesh Th with mesh size h ≈ m−1 and dimension n = m2 ≈ h−4 . Here
and below, the quantities related to a fixed subspace (corresponding to a mesh Th ) are
labeled by the subscript h.
The solution of problem (5.1.1) is approximated by the damped Richardson iteration

xt+1
h = xth + θh (bh − Ah xth ) = (Ih − θh Ah )xth + θh bh , (5.1.2)

with a damping parameter 0 < θh ≤ 1 . The symmetric, positive definite matrix Ah pos-
sesses an ONS of eigenvectors {whi , i = 1, ..., nh } corresponding to the ordered eigenvalue

187
188 Multigrid Methods

λmin (Ah ) = λ1 ≤ ... ≤ λn = λmax (Ah ) =: Λh . The expansion of the initial error


nh
e0h := x0h − xh = εi whi ,
i=1

induces the corresponding expansion of the iterated errors



nh 
nh
eth = (Ih − θh Ah )t e0h = εi (Ih − θh Ah )t whi = εi (1 − θh λi )t whi .
i=1 i=1

Consequently, denoting by | · | the Euclidean norm on Rnh ,


nh
|eth |2 = ε2i (1 − θh λi )2t . (5.1.3)
i=1

The assumption 0 < θh ≤ Λ−1 h is sufficient for the convergence of the Richardson iteration.
Because of |1 − θh λi |  1 for larger λi and |1 − θh λ1 | ≈ 1 “high-frequency” components
of the error decay fast, but “low-frequency” components only very slowly. The same holds
for the residuum rht = bh − Ah xth = Ah eth . Hence already after a few iterations there holds


[N/2]
|rht |2 ≈ ε2i λ2i (1 − θh λi )2t , [n/2] := max{m ∈ N| m ≤ n/2}. (5.1.4)
i=1

This may be interpreted as follows: The iterated defect rht on the mesh Th is “smooth”.
Hence, it can be approximated well on the coarser mesh T2h with mesh size 2h . The
resulting defect equation for the computation of the correction to the approximation xth
on Th would be less costly because of its smaller dimension n2h ≈ nh /4 .
This defect correction process in combination with recursive coarsening can be carried
on to a coarsest mesh, on which the defect equation can finally be solved exactly. The
most important components of this multigrid process are the “smoothing iteration”, xνh =
Shν (x0h ) and certain transfer operations between functions defined on different meshes. The
smoothing operation Sh (·) is usually given in form of a simple fixed-point iteration (e. g.,
the Richardson iteration)

xν+1
h = Sh (xνh ) := (Ih − Ch−1 Ah )xνh + Ch−1 bh ,

with the iteration matrix Sh := Ih − Ch−1 Ah .

5.1.1 Multigrid methods in the “finite element” context

For the formulation of the multigrid process, we consider a sequence of nested grids
Tl = Thl , l = 0, ..., L , of increasing fineness h0 > ... > hl > ... > hL (for instance
obtained by successively refining a coarse starting grid) and corresponding finite element
spaces Vl := Vhl of increasing dimension nl , which are subspaces of the “continuous” solu-
5.1 Multigrid methods for linear systems 189

tion space V = H01 (Ω) (first-order Sobolev1 space on Ω including zero Dirichlet boundary
conditions). Here, we think of spaces of continuous, with respect to the mesh Th piece-
wise linear (on triangular meshes) or piecewise (isoparametric) bilinear (on quadrilateral
meshes) functions. For simplicity, we assume that the finite element spaces are hierachi-
cally ordered,

V0 ⊂ V1 ⊂ ... ⊂ Vl ⊂ ... ⊂ VL . (5.1.5)

This structural assumption eases the analysis of the multigrid process but is not essential
for its practical success.

The finite element Galerkin scheme

As usual, we write the continuous problem and its corresponding finite element Galerkin
approximation in compact variational form

a(u, ϕ) = (f, ϕ)L2 ∀ϕ ∈ V, (5.1.6)

and, analogously on the mesh Th

a(uh , ϕh ) = (f, ϕh )L2 ∀ϕh ∈ Vh . (5.1.7)

Here, a(u, ϕ) := (Lu, ϕ)L2 is the “energy bilinear form” corresponding to the (elliptic)
differential operator L and (f, ϕ)L2 the L2 -scalar product on the solution domain Ω . In
the model problem discussed above this notation has the explicit form Lu = −Δu and
" "
a(u, ϕ) = ∇u(x)∇ϕ(x) dx, (f, ϕ)L2 = f (x)ϕ(x) dx.
Ω Ω

The finite element subspace Vh ⊂ V has a natural so-called “nodal basis” (Lagrange basis)
{b1 , . . . , bnh } characterized by the interpolation property bi (aj ) = δij , i, j = 1, . . . , nh ,
where aj are the nodal points of the mesh Th . Between the finite element function
uh ∈ Vh and the corresponding nodal-value vector xh ∈ Rnh , we have the connection
uh (aj ) = xh,j , j = 1, ..., nh ,
nh nh
j
uh = xh,j b = uh (aj )bj .
j=1 j=1

Using this notation the discrete problems (5.1.7) can be written in the following form:


nh
xh,j a(bj , bi ) = (f, bi )L2 , i = 1, . . . , nh ,
j=1

1
Sergei Lvovich Sobolev (1908–1989): Russian mathematician; worked in Leningrad (St. Petersburg)
and later at the famous Steklov-Institute for Mathematics of the Academy of Sciences in Moscow; funda-
mental contributions to the theory of partial differential equations concept of generalized (distributional)
solutions, “Sobolev spaces”; worked also on numerical methods, numerical quadrature.
190 Multigrid Methods

which is equivalent to the linear system

Ah xh = bh , (5.1.8)

with the “system matrix” (“stiffness matrix”) Ah = (aij )ni,j=1


h
∈ Rnh ×nh and “load vector”
bh = (bj )j=1 ∈ R defined by
nh nh

aij := a(bj , bi ), bj := (f, bj )L2 , i, j = 1, . . . , nh .


 h  h
For finite element functions uh = ni=1 xh,i bi and vh = ni=1 yh,i bi there holds

a(uh , vh ) = (Ah xh , yh )2 .

The system matrix Ah is symmetric and positive definite by construction and has a
condition number of size cond2 (Ah ) = O(h−2 ). Additionally, we will use the so-called
“mass matrix” Mh = (mij )ni,j=1
h
defined by

mij := (bj , bi )L2 , i, j = 1, . . . , nh .


 h  h
For finite element functions uh = ni=1 xh,i bi and vh = ni=1 yh,i bi there holds

(uh , vh )L2 = (Mh xh , yh )2 .

The mass matrix Mh is also symmetric and positive definite by construction and has a
condition number of size cond2 (Ah ) = O(1).
For the exact “discrete” solution uh ∈ Vh there holds the error estimate

u − uh L2 ≤ c h2 f L2 . (5.1.9)

Now, we seek a solution process which produces an approximation ũh ∈ Vh to uh satis-


fying

uh − ũh L2 ≤ c h2 f L2 . (5.1.10)

This process is called “complexity-optimal” if the arithmetic work for achieving this accu-
racy is of size O(nh ) uniformly with respect to the mesh size h . We will see below that
the multigrid method is actually optimal is this sense if all its components are properly
chosen.

The multigrid process

Let u0L ∈ VL be an initial guess for the exact discrete solution uL ∈ VL on mesh level L
(For example, u0L = 0 or u0L = uL−1 if such a coarse-grid solution is available.). Then,
u0L is “smoothed” (“pre-smoothed”) by applying ν steps, e. g., of the damped Richardson
iteration starting from ū0L := u0L . This reads in variational from as follows:
5.1 Multigrid methods for linear systems 191

# $
L , ϕL )L2 + θL (f, ϕL )L2 − a(ūL , ϕL )
(ūkL , ϕL )L2 = (ūk−1 ∀ϕL ∈ VL ,
k−1
(5.1.11)

where θL = λmax (Ah )−1 . For the resulting smoothed approximation ūνL , we define the
“defect” dL ∈ VL (without actually computing it) as follows:

(dL , ϕL )L2 := (f, ϕL )L2 − a(ūνL , ϕL ), ϕL ∈ V L . (5.1.12)

Since VL−1 ⊂ VL , we obtain the “defect equation” on the next coarser mesh TL−1 as

a(qL−1 , ϕL−1 ) = (dL , ϕL−1 )L2 = (f, ϕL−1 )L2 − a(ūνL , ϕL−1 ) ∀ϕL−1 ∈ VL−1 . (5.1.13)

The correction qL−1 ∈ VL−1 is now computed either exactly (for instance by a direct
0
solver) or only approximately by a defect correction iteration qL−1 → qL−1
R
using the
sequence of coarser meshes TL−2 , ..., T0 . The result qL−1 ∈ VL−1 is then interpreted as
R

an element of VL and used for correcting the preliminary approximation ūνL :

ū¯0L := ūνL + ωL qL−1


R
. (5.1.14)

The correction step may involve a damping parameter ωL ∈ (0, 1] in order to minimize
the residual of ū¯L . This practically very useful trick will not be further discussed here,
i. e., in the following, we will mostly set ωL = 1 . The obtained corrected approximation
ū¯L is again smoothed (“post-smoothing”) by applying another μ steps of the damped
Richardson iteration starting from ū¯0L := ū¯L :
# $
L , ϕL )L2 + θL (f, ϕL )L2 − a(ūL , ϕL )
(ū¯kL , ϕL )L2 = (ū¯k−1 ∀ϕL ∈ VL .
¯k−1 (5.1.15)

The result is then accepted as the next multigrid iterate, u1L := ū¯μL , completing one step
of the multigrid iteration (“multigrid cycle”) on mesh level L. Each such cycle consists of
ν + μ Richardson smoothing steps (on level L), which each requires the inversion of the
mass matrix Mh , and the solution of the correction equation on the next coarser mesh.
Now, we will formulate the multigrid algorithm using a more abstract, functional
analytic notation, in order to better understand its structure and to ease its convergence
analysis. To the system matrices Al = Ahl on the sequence of meshes Tl , l = 0, 1, . . . , L,
we associate operators Al : Vl → Vl by setting

(Al vl , wl )L2 = a(vl , wl ) = (Al yl , zl )2 ∀vl , wl ∈ Vl , (5.1.16)

where vl = (yl,i )ni=1


l
, wl = (zl,i )ni=1
l
. Further, let Sl (·) denote the corresponding smoothing
operations with (linear) iteration operators (Richardson operator) Sl = Il −θl Al : Vl → Vl
where Al is the “system operator” defined above and Il denotes the identity operator
on Vl . Finally, we introduce the operators by which the transfers of functions between
consecutive subspaces are accomplished:

rll−1 : Vl → Vl−1 (“restriction”), pll−1 : Vl−1 → Vl (“prolongation”).

In the finite element context these operators are naturally chosen as rll−1 = Pl−1 , the L2
projection onto Vl−1 , and pll−1 = id., the natural embedding of Vl−1 ⊂ Vl into Vl .
192 Multigrid Methods

Now, using this notation, we reformulate the multigrid process introduced above for
solving the linear system on the finest mesh TL :

AL uL = fL := PL f. (5.1.17)

Multigrid process: Starting from an initial vector u0L ∈ VL iterates utL are constructed
by the recursive formula
(t+1) (t)
uL = MG(L, uL , fL ). (5.1.18)
(t)
Let the t-th multigrid iterate uL be determined.
Coarse grid solution: For l = 0 , the operation MG(0, 0, g0) yields the exact solution of
the system A0 v0 = g0 (obtained for instance by a direct method),

v0 = MG(0, ·, g0) = A−1


0 g0 . (5.1.19)

Recursion: Let for some 1 ≤ l ≤ L the system Al vl = gl to be solved. With parameter


values ν, μ ≥ 1 the value

MG(l, vl0 , gl ) := vl1 ≈ vl (5.1.20)

is recursively definined by the following steps:


i) Pre-smoothimg:
v̄l := Slν (vl0 ); (5.1.21)

ii) Defect formation:

dl := gl − Al v̄l ; (5.1.22)

iii) Restriction:

d˜l−1 := rll−1 dl ; (5.1.23)


0
iv) Defect equation: Starting from ql−1 := 0 for 1 ≤ r ≤ R the iteration proceeds as
follows:
r−1 ˜
r
ql−1 := MG(l − 1, ql−1 , dl−1 ); (5.1.24)
(5.1.25)

v) Prolongation:

ql := pll−1 ql−1
R
; (5.1.26)

vi) Correction: With a damping parameter ωl ∈ (0, 1],

v̄¯l := v̄l + ωl ql ; (5.1.27)


5.1 Multigrid methods for linear systems 193

vii) Post-smoothing:

vl1 := Slμ (v̄¯l ); (5.1.28)

In case l = L , one sets:

ut+1 1
L := vl . (5.1.29)

We collect the afore mentioned algorithmic steps into a compact systematics of the multi-
grid cycle utL → ut+1
L :

utL → ūtL = SLν (utL ) → dL = fL − AL ūtL

↓ d˜L−1 = rLL−1 dL−1 (restriction)

qL−1 = Ã−1 ˜
L−1 dL−1 (R-times defect correction)

↓ q̃L = pLL−1 qL−1 (prolongation)


ū¯tL = ūtL + ωL q̃L → ut+1
L = SLμ (ū¯tL ).

If the defect equation AL−1 qL−1 = d˜L−1 on the coarser mesh TL−1 is solved “exactly”, one
speaks of a “two-grid method”. In practice, the two-grid process is continued recursively
to the “multigrid method” up to the coarsest mesh T0 . This process can be organized
in various ways depending essentially on the choice of the iteration parameter R , which
determines how often the defect correction step is repeated on each mesh level. In practice,
for economical reasons, only the cases R = 1 and R = 2 play a role. The schemes of the
corresponding multigrid cycles, the “V-cycle” and the “W-cycle”, are shown in Fig. 5.1.
Here, “•” represent “smoothing” and “defect correction” on the meshes Tl , and lines
“−” stand for the transfer between consecutive mesh levels.
v4
v3
v2
v1
v0
v4
v3
v2
v1
v0
Figure 5.1: Scheme of a multigrid algorithm organized as V- (top left), F- (top right), and
W-cycle (bottom line).
194 Multigrid Methods

The V-cycle is very efficient (if it works at all), but often suffers from instabilities
caused by irregularities in the problem to be solved, such as strong anisotropies in the
differential operator, boundary layers, corner singularities, nonuniformities and deteriora-
tions in the meshes (local mesh refinement and cell stretching), etc.. In contrast to that,
the W-cycle is more robust but usually significantly more expensive. Multigrid methods
with R ≥ 3 are too inefficient. A good compromise between robustness and efficiency is
provided by the so-called “F-cycle” shown in Fig. 5.1. This process is usually started on
the finest mesh TL with an arbitrary initial guess u0L (most often u0L = 0). However,
for economical reasons, one may start the whole multigrid process on the coarsest mesh
T0 and then use the approximate solutions obtained on intermediate meshes as starting
values for the iteration on the next finer meshes. This “nested” version of the multigrid
method will be studied more systematically below.
Nested multigrid: Starting from some initial vector u0 := A−1 0 f0 on the coarsest mesh
T0 , for l = 1, ..., L, successively approximations ũl ≈ ul are computed by the following
recursion:

u0l = pll−1 ũl−1


utl = MG(l, ut−1
l , fl ), t = 1, ..., tl , utl l − ul L2 ≤ ĉ h2l f L2 ,
ũl = utl l .

Remark 5.1: Though the multigrid iteration in V-cycle modus may be unstable, it can
be used as preconditioners for an “outer” CG (in the symmetric case) or GMRES iteration
(in the nonsymmetric case). This approach combines the robustness of the Krylov space
method with the efficiency of the multigrid methods and has been used very successfully
for the solution of various nonstandard problems, involving singularities, indefiniteness,
saddle-point structure, and multi-physics coupling.

Remark 5.2: There is not something like the multigrid algorithm. The successful real-
ization of the multigrid concept requires a careful choice of the various components such
as the “smoother” Sl , the coarse-mesh operators Al , and the mesh-transfer operators
rll−1 , pll−1 , specially adapted to the particular structure of the problem considered. In the
following, we discuss these algorithmic componenents in the context of the finite element
discretization, e. g., of the model problem from above.

i) Smoothers: “Smoothers” are usually simple fixed-point iterations, which could princi-
pally also be used as “solvers”, but with a very bad convergence rate. They are applied
on each mesh level only a few times (ν, μ ∼ 1 − 4), for damping out the high-frequency
components in the errors or the residuals. In the following, we consider the damped
Richardson iteration with iteration matrix

Sl := Il − θl Al , θl = λmax (Al )−1 , (5.1.30)

as smoother, which, however, only works for very simple (scalar) and non-degenerate
problems.
5.1 Multigrid methods for linear systems 195

Remark 5.3: More powerful smoothers are based on the Gauß-Seidel and the ILU iter-
ation. These methods also work well for problems with certain pathologies. For example,
in case of strong advection in the differential equation, if the mesh points are numbered
in the direction of the transport, the system matrix has a dominant lower triangular part
L , for which the Gauß-Seidel method is nearly “exact”. For problems with degenerate
coefficients in one space direction or on strongly anisotropic meshes the system matrix has
a dominant tridiagonal part, for which the ILU method is nearly “exact”. For indefinite
saddle-point problems certain “block” iterations are used, which are specially adapted to
the structure of the problem. Examples are the so-called “Vanka-type” smoothers, which
are used in solving the “incompressible” Navier-Stokes equations in Fluid Mechanics.

ii) Grid transfers: In the context of a finite element discretization with nested subspaces
V0 ⊂ V1 ⊂ ... ⊂ Vl ⊂ ... ⊂ VL the generic choice of the prolongation pll−1 : Vl−1 → Vl
is the cellwise embedding, and of the restriction rll−1 : Vl → Vl−1 the L2 projection. For
other discretizations (e. g., finite difference schemes), one uses appropriate interpolation
operators (e. g., bi/trilinear interpolation).
iii) Corse-grid operators: The operators Al on the several spaces Vl do not need to
correspond to the same discretization of the original “continuous” problem. This as-
pect becomes important in the use of mesh-dependent numerical diffusion (“upwinding”,
“streamline diffusion”, etc.) for the treatment of stronger transport. Here, we only con-
sider the ideal case that all Al are defined by the same finite element discretization on
the mesh family {Tl }l=0,...,L . In this case, we have the following useful identity:

(Al−1vl−1 , wl−1)L2 = a(vl−1 , wl−1)


= a(pll−1 vl−1 , pll−1wl−1 ) (5.1.31)
= (Al pll−1 vl−1 , pll−1 wl−1 )L2 = (rll−1Al pll−1 vl−1 , wl−1 )L2 ,

for all wl−1 ∈ Vl−1 , which means that

Al−1 = rll−1 Al pll−1 . (5.1.32)

iv) Coarse-grid correction: The correction step contains a damping parameter ωl ∈


(0, 1]. It has proved very useful in practice to choose this parameter such that the defect
Al v̄¯l − d˜l−1 becomes minimal. This leads to the formula

(Al v̄¯l , d˜l−1 − Al v̄¯l )L2


ωl = . (5.1.33)
Al v̄¯l 2L2

In the following analysis of the multigrid process, for simplicity, we will make the choice
ωl = 1.

5.1.2 Convergence analysis

The classical analysis of the multigrid process is based on its interpretation as a defect-
correction iteration and the concept of recursive application of the two-grid method. For
196 Multigrid Methods

simplicity, we assume that only pre-smoothing is used, i. e., ν > 0, μ = 0, and that in the
correction step no damping is applied, i. e., ωl = 1 . Then, the two-grid algorithm can be
written in the following form:
−1 L−1
 
L = SL (uL ) + pL−1 AL−1 rL
ut+1 fL − AL SLν (utL )
ν t L
 
= SLν (utL ) + pLL−1 A−1
L−1 rL AL uL − SL (uL ) .
L−1 ν t

Hence, for the iteration error etL := utL − uL there holds


  ν t 
et+1
L = IL − pLL−1 A−1L−1 rL AL SL (uL ) − uL .
L−1
(5.1.34)

The smoothing operation is given in (affin)-lineare form as

SL (vL ) := SL vL + gL ,

and as fixed-point iteration satisfies SL (uL ) = uL . From this, we conclude that


 
SLν (utL ) − uL = SL SLν−1 (utL ) − uL = ... = SLν etL .

With the so-called “two-grid operator”

ZGL (ν) := (IL − pLL−1 A−1


L−1 rL AL )SL
L−1 ν

there consequently holds

et+1 t
L = ZGL (ν)eL . (5.1.35)

Theorem 5.1 (Two-grid convergence): For sufficiently frequent smoothimg, ν > 0 ,


the two-grid method converges with a rate independent of the mesh level L :

ZGL (ν)L2 ≤ ρZG (ν) = c ν −1 < 1 . (5.1.36)

Proof. We write

ZGL (ν) = (A−1 L −1 L−1


L − pL−1 AL−1 rL )AL SL
ν
(5.1.37)

and estimate as follows:

ZGL (ν)L2 ≤ A−1 L −1 L−1


L − pL−1 AL−1 rL L2 AL SL L2 .
ν
(5.1.38)

The first term on the right-hand side describes the quality of the approximation of the fine-
grid solution on the next coarser mesh, while the second term represents the smoothing
effect. The goal of the further analysis is now to show that the smoother SL (·) possesses
the so-called “smoothing property”,

AL SLν vL L2 ≤ cs ν −1 h−2


L vL L2 , vL ∈ VL , (5.1.39)
5.1 Multigrid methods for linear systems 197

and the coarse-grid correction possesses the so-called “approximation property” ,

(A−1 L −1 L−1
L − pL−1 AL−1 rL )vL L2 ≤ ca hL vL L2 ,
2
vL ∈ VL , (5.1.40)

with positive constants cs , ca independent of the mesh level L . Combination of these two
estimates then yields the asserted estimate (5.1.36). For sufficiently frequent smoothing,
we have ρZG := cν −1 < 1 and the two-grid algorithm converges with a rate uniformly
w.r.t. the mesh level L . All constants appearing in the following are independent of L .
i) Smoothing property: The selfadjoint operator AL possesses real, positive eigenvalues
0 < λ1 ≤ ... ≤ λi ≤ ... ≤ λnL =: ΛL and a corresponding  L2 -ONS of eigenfunctions
{w , ..., w } , such that each vL ∈ VL can be written as vL = ni=1
1 nL L
γi w i , γi = (vL , w i )L2 .
For the Richardson iteration operator,

SL := IL − θL AL : VL → VL , θL = Λ−1
L , (5.1.41)

there holds

nL  λi  ν i
AL SLν vL = γ i λi 1 − w, (5.1.42)
i=1
ΛL

and, consequently,

nL  λi 2ν
AL SLν vL 2L2 = γi2 λ2i 1 −
i=1
ΛL
 λ 2  λi 2ν  
nL
i
≤ Λ2L max 1− γi2
1≤i≤nL ΛL ΛL i=1
 λ 2  λi 2ν 
i
= Λ2L max 1− vL 2L2 .
1≤i≤nL ΛL ΛL
By the relation (exercise)

max {x2 (1 − x)2ν } ≤ (1 + ν)−2 (5.1.43)


0≤x≤1

it follows that

AL SLν vL 2L2 ≤ Λ2L (1 + ν)−2 vL 2L2 . (5.1.44)

The relation ΛL ≤ ch−2


L eventually implies the asserted inequality for the Richardson
iteration operator:

AL SLν L2 ≤ cs ν −1 h−2


L , ν ≥ 1. (5.1.45)

ii) Approximation property: We recall that in the present context of nested subspaces Vl
prolongationen and restriction operators are given by

pLL−1 = id. (identity), rLL−1 = PL−1 (L2 projection).


198 Multigrid Methods

Further, the operator AL : VL → VL satisfies

(AL vL , ϕL )L2 = a(vL , ϕL ), vL , ϕL ∈ VL .

For an arbitrary but fixed fL ∈ VL and functions vL := A−1 −1 L−1


L fL , vL−1 := AL−1 rL fL
there holds:

a(vL , ϕL ) = (fL , ϕL )L2 ∀ϕL ∈ VL ,


a(vL−1 , ϕL−1) = (fL , ϕL−1 )L2 ∀ϕL−1 ∈ VL−1 .

To the function vL ∈ VL , we associate a function v ∈ V ∩ H 2 (Ω) as solution of the


“continuous” boundary value problem

Lv = fL in Ω, v = 0 on ∂Ω, (5.1.46)

or in “weak” formulation

a(v, ϕ) = (fL , ϕ)L2 ∀ϕ ∈ V. (5.1.47)

For this auxiliary problem, we have the following a priori estimate

∇2 vL2 ≤ cfL L2 . (5.1.48)

There holds

a(vL , ϕL ) = (fL , ϕL )L2 = a(v, ϕL ), ϕL ∈ VL ,


a(vL−1 , ϕL−1 ) = (fL , ϕL−1 )L2 = a(v, ϕL−1 ), ϕL−1 ∈ VL−1 ,

i. e., vL and vL−1 are the Ritz projections of v into VL and VL−1 , respectively. For
these the following L2 -error estimates hold true:

vL − vL2 ≤ ch2L ∇2 vL2 , vL−1 − vL2 ≤ ch2L−1 ∇2 vL2 . (5.1.49)

This together with the a priori estimate (5.1.48) and observing hL−1 ≤ 4hL implies that

vL − vL−1 L2 ≤ ch2L ∇2 vL2 ≤ ch2L fL L2 , (5.1.50)

and, consequently,

A−1 L −1 L−1
L fL − pL−1 AL−1 rL fL L2 ≤ chL fL L2 .
2
(5.1.51)

From this, we obtain the desired estimate

A−1 L −1 L−1
L − pL−1 AL−1 rL L2 ≤ chL ,
2
(5.1.52)

which completes the proof. Q.E.D.


5.1 Multigrid methods for linear systems 199

The foregoing result for the two-grid algorithm will now be used for inferring the
convergence of the full multigrid method.

Theorem 5.2 (Multigrid conver- gence): Suppose that the two-grid algorithm con-
verges with rate ρZG (ν) → 0 for ν → ∞ , uniformly with respect to the mesh level L .
Then, for sufficiently frequent smoothing the multigrid method with R ≥ 2 (W-cycle)
converges with rate ρM G < 1 independent of the mesh level L,

uL − MG(L, utL , fL )L2 ≤ ρMG uL − utL L2 . (5.1.53)

Proof. The proof is given by induction with respect to the mesh level L . We consider
only the relevant case R = 2 (W-cycle) and, for simplicity, will not try to optimize the
constants occurring in the course of the argument. Let ν be chosen sufficiently large such
that the convergence rate of the two-grid algorithm is ρZG ≤ 1/8 . We want to show that
then the convergence rate of the full multigrid algorithm is ρM G ≤ 1/4, uniformly with
respect to the mesh level L . For L = 1 this is obviously fulfilled. Suppose now that
also ρM G ≤ 1/4 for mesh level L − 1. Then, on mesh level L , starting from the iterate
utL , with the approximative solution qL−1
2
(after 2-fold application of the coarse-mesh
correction) and the exact solution q̂L−1 of the defect equation on mesh level L − 1 , there
holds

L = MG(L, uL , fL ) = ZG(L, uL , fL ) + pL−1 (qL−1 − q̂L−1 ).


ut+1 t t L 2
(5.1.54)

According to the induction assumption (observing that the starting value of the multigrid
iteration on mesh level L − 1 is zero and that ρ̂L−1 = A−1 L−1
L−1 rL dL ) it follows that

q̂L−1 − qL−1
2
L2 ≤ ρ2MG q̂L−1 L2 = ρ2MG A−1
L−1 rL AL SL (uL − uL )L2 .
L−1 ν t
(5.1.55)

Combination of the last two relations implies for the iteration error etL := utL − uL that
 −1 L−1
 t
et+1
L L2 ≤ ρZG + ρMG AL−1 rL AL SL L2 eL L2 .
2 ν
(5.1.56)

The norm on the right-hand side has been estimated already in connection with the
convergence analysis of the two-grid algorithm. Recalling the two-grid operator

ZGL = (A−1 L −1 L−1 ν ν L−1 −1 L−1


L − pL−1 AL−1 rL )AL SL = SL − pL AL−1 rL AL SL ,
ν

there holds
A−1
L−1 rL AL SL = SL − ZGL ,
L−1 ν ν

und, consequently,

A−1
L−1 rL AL SL L2 ≤ SL L2 + ZGL L2 ≤ 1 + ρZG ≤ 2.
L−1 ν ν
(5.1.57)

This eventually implies


  t
L L2 ≤ ρZG + 2ρMG eL L2 .
et+1 2
(5.1.58)
200 Multigrid Methods

By the assumption on ρZG and the induction assumption, we conclude


1  t
L L2 ≤ 8 + 2 16 eL L2 ≤ 4 eL L2 ,
et+1 1 1 t
(5.1.59)

which completes the proof. Q.E.D.

Remark 5.4: For well-conditioned problems (symmetric, positive definite operator, reg-
ular coefficients, quasi-uniform meshes, etc.) one achieves multigrid convergence rates in
the range ρM G = 0, 05 − 0, 5 . The above analysis only applies to the W-cycle since in
part (ii), we need that R ≥ 2 . The V-cycle cannot be treated on the basis of the two-
grid analysis. In the literature there are more general approaches, which allow to prove
convergence of multigrid methods under much weaker conditions.

Next, we analyze the computational complexity of the full multigrid algorithm. For
this, we introduce the following notation:

OP(T ) = number of a. op. for performing the operation T,


R = number of defect-correction steps on the different mesh levels,
nl = dim Vl ≈ h−d
l (d = space dimension),
κ = max nl−1 /nl < 1,
1≤l≤L

C0 = OP(A−1
0 )/n0 ,
Cs = max {OP(Sl )/nl }, Cd = max {OP(dl )/nl },
1≤l≤L 1≤l≤L

Cr = max {OP(rl )/nl }, Cp = max {OP(pl )/nl }.


1≤l≤L 1≤l≤L

In practice mostly κ ≈ 2−d , Cs ≈ Cd ≈ Cr ≈ Cp ≈ #{anm = 0} and C0 n0  nL .

Theorem 5.3 (Multigrid complexity): Under the condition q := Rκ < 1, for the full
multigrid cycle MGL there holds

OP(MGL ) ≤ CL nL , (5.1.60)

where
(ν + μ)Cs + Cd + Cr + Cp
CL = + C0 q L .
1−q
The multigrid algorithm for approximating the nL -dimensional discrete solution uL ∈ VL
on the finest mesh TL within discretization accuracy O(h2L ) requires O(nL ln(nL )) a. op.,
and therefore has (almost) optimal complexity.

Proof. We set Cl := OP(MGl )/nl . One multigrid cycle contains the R-fold application
of the same algorithm on the next coarser mesh. Observing nl−1 ≤ κnl and setting
Ĉ := (ν + μ)Cs + Cd + Cr + Cp it follows that

CL nL = OP(MGL ) ≤ ĈnL + R · OP(MGL−1 ) = ĈnL + R · CL−1 nL−1 ≤ ĈnL + qCL−1 nL ,


5.1 Multigrid methods for linear systems 201

and consequently CL ≤ Ĉ + qCL−1 . Recursive use of this relation yields


 
CL ≤ Ĉ + q Ĉ + qCL−2 = Ĉ(1 + q) + q 2 CL−1
..
.

≤ Ĉ(1 + q + q 2 + ... + q L−1 ) + q L C0 ≤ + q L C0 .
1−q

This implies the asserted estimate (5.1.60). The total complexity of the algotrithm then
results from the relations
−2/d ln(nL )
ρtMG ≈ h2L ≈ nL , t≈− .
ln(ρMG )

The proof is complete. Q.E.D.


It should be emphasized that in the proof of (5.1.60) the assumption

q := Rκ = R max nl−1 /nl < 1 (5.1.61)


1≤l≤L

is essential. This means for the W-cycle (R = 2) that by the transition from mesh Tl−1
to the next finer mesh Tl the number of mesh points (dimension of spaces) sufficiently
increases, comparibly to the situation of uniform mesh refinement,

nl ≈ 4nl−1 . (5.1.62)

Remark 5.5: In an adaptively driven mesh refinement process with only local mesh
refinement the condition (5.1.61) is usually not satisfied. Mostly only nl ≈ 2nl−1 can be
expected. In such a case the multigrid process needs to be modified in order to preserve
good efficiency. This may be accomplished by applying the cost-intensive smoothing only
to those mesh points, which have been newly created by the transition from mesh Tl−1 to
mesh Tl . The implementation of a multigrid algorithm on locally refined meshes requires
much care in order to achieve optimal complexity of the overall algorithm.

The nested multigrid algorithm turns out to be really complexity optimal, as it requires
only O(nL ) a. op. for producing a sufficiently accurate approximation to the discrete
solution uL ∈ VL .

Theorem 5.4 (Nested multigrid): The nested multigrid algorithm is of optimal com-
plexity, i. e., it delivers an approximation to the discrete solution uL ∈ VL on the finest
mesh TL with discretization accuracy O(h2L ) with complexity O(nL ) a. op. independent
of the mesh level L.

Proof. The accuracy requirement for the multigrid algorithm on mesh level TL is

etL L2 ≤ ĉh2L f L2 . (5.1.63)


202 Multigrid Methods

i) First, we want to show that, under the assumptions of Theorem 5.2, the result (5.1.63)
is achievable by the nested multigrid algorithm on each mesh level L by using a fixed
(sufficiently large) number t∗ of multigrid cycles. Let etL := utL −uL be again the iteration
error on mesh level L . By assumption et0 = 0, t ≥ 1 . In case u0L := utL−1 there holds

etL L2 ≤ ρtMG e0L L2 = ρtMG utL−1 − uL L2


 
≤ ρtMG utL−1 − uL−1 L2 + uL−1 − uL2 + u − uL L2
 
≤ ρtMG etL−1 L2 + ch2L f L2 .

Recursive use of this relation for L ≥ l ≥ 1 then yields (observing hl ≤ κl−L hL )


  
etL L2 ≤ ρtMG ρtMG big(etL−2 L2 + ch2L−1 f L2 + ch2L f L2
..
.
 t 
≤ ρLt
MG e0 L2 + cρMG hL + cρMG hL−1 + ... + cρMG h1 f L2
t 2 2t 2 Lt 2
 
= ch2L κ2 ρtMG κ−2·1 + ρ2t
MG κ
−2·2
+ ... + ρLt
MG κ
−2L
f L2
−2 t
κ ρMG
≤ ch2L κ2 f L2 ,
1 − κ−2 ρtMG

provided that κ−2 ρtMG < 1. Obviously there exists a t∗ , such that (5.1.63) is satisfied for
t ≥ t∗ uniformly with respect to L .
ii) We can now carry out the complexity analysis. Theorem 5.3 states that one cycle
of the simple multigrid algorithm MG(l, ·, ·) on the l-th mesh level requires Wl ≤ c∗ nl
a. op. (uniformly with respect to l ). Let now Ŵl be the number of a. op. of the nested
multigrid algorithm on mesh level l . Then, by construction there holds

ŴL ≤ ŴL−1 + t∗ WL ≤ ŴL−1 + t∗ c∗ nL .

Iterating this relation, we obtain with κ := max1≤l≤L nl−1 /nl < 1 that

ŴL ≤ ŴL−1 + t∗ c∗ nL ≤ ŴL−2 + t∗ c∗ nL−1 + t∗ c∗ nL


..
.
ct∗ c∗
≤ t∗ c∗ {nL + ... + n0 } ≤ ct∗ c∗ nL {1 + ... + κL } ≤ nL ,
1−κ
what was to be shown. Q.E.D.

5.2 Multigrid methods for eigenvalue problems (a short review)

The application of the “multigrid concept” to the solution of high-dimensional eigenvalue


problems can follow different pathes. First, there is the possibility of using it directly
for the eigenvalue problem based on its reformulation as a nonlinear system of equations,
which allows for the formation of “residuals”. Second, the multigrid concept may be
5.2 Multigrid methods for eigenvalue problems (a short review) 203

used as components of other iterative methods, such as the Krylov space methods, for
accelerating certain computation-intensive substeps. In the following, we will only briefly
describe these different approaches.

5.2.1 Direct multigrid approach

The algebraic eigenvalue problem

Az = λz, λ ∈ C, z ∈ Cn , z2 = 1, (5.2.64)

is equivalent to the following nonlinear system of equations


, 8
Az − λz
= 0. (5.2.65)
z22 − 1

To this system, we may apply a nonlinear version of the multigrid method described in
Section 5.1 again yielding an algorithm of optimal complexity, at least in principly (for
details see, e. g., Brand et al. [27] and Hackbusch [37]). However, this approach suffers
from stability problems in case of irregularities of the underlying continuous problem,
such as anisotropies in the operator, the domain or the computational mesh, which may
spoil the convergence of the method or render it inefficient. One cause may be the lack
of approximation property in case that the continuous eigenvalue problem is not well
approximated on coarser meshes, which is essential for the convergence of the multigrid
method. The possibility of such a pathological situation is illustrated by the following
example, which suggests to use the multigrid concept not directly but rather for accel-
erating the cost-intensive components of other more robust methods such as the Krylov
space methods (or the Jacobi-Davidson method) described above.

Example 5.1: We consider the following non-symmetric convection-diffusion eigenvalue


problem on the unit square Ω = (0, 1)2 ∈ R2 :

−νΔu + b · ∇u = λu, in Ω, u = 0, on ∂Ω, (5.2.66)

with coefficients ν > 0 and c = (c1 , c2 ) ∈ R2 . The (real) eigenvalues are explicitly given
by
b2 + b22
λ= 1 + νπ 2 (n21 + n22 ), n1 , n2 ∈ N,

with corresponding (non-normalized) eigenfunctions
b x + b x 
1 1 2 2
w(x1 , x2 ) = exp sin(n1 πx1 ) sin(n2 πx2 ).

The corresponding adjoint eigenvalue problem has the eigenfunctions
 b x +b x 
w ∗ (x1 , x2 ) = exp −
1 1 2 2
sin(n1 πx1 ) sin(n2 πx2 ).

204 Multigrid Methods

This shows, first, that the underlying differential operator in (5.2.66) is non-normal and
secondly that the eigenfunctions develop strong boundary layers for small parameter val-
ues ν (transport-dominant case). In particular, the eigenvalues depend very strongly
on ν. For the “direct” application of the multigrid method to this problem, this means
that the “coarse-grid problems”, due to insufficient mesh resolution, have completely dif-
ferent eigenvalues than the “fine-grid” problem, leading to insufficient approximation for
computing meaningful corrections. This renders the multigrid iteration, being based on
successive smoothing and coarse-grid correction, inefficient and may even completely spoil
convergence.

5.2.2 Accelerated Arnoldi and Lanczos method

The most computation-intensive part of the Arnoldi and Lanczos methods in the case of
the approximation of the smallest (by modulus) eigenvalues of a high-dimensional matrix
A ∈ Kn×n is the generation of the Krylov space

Km = span{q, A−1q, . . . , (A−1 )m−1 q},

which requires the solution of a small number m  n but high-dimensional linear systems
with A as coefficient matrix. Even though the Krylov space does not need to be explicitly
constructed in the course of the modified Gram-Schmidt algorithm for the generation
of an orthonormal basis {q 1 , . . . , q m } of Km , this process requires the same amount
of computation. This computational “acceleration” by use of multigrid techniques is
exploited in Section 4.3.2 on the computation of pseudospectra. We want to illustrate this
for the simpler situation of the “inverse iteration” for computing the smallest eigenvalue
of a symmetric, positive definite matrix A ∈ Rn×n .
Recall Example 4.1 in Section 4.1.1, the eigenvalue problem of the Laplace operator
on the unit square:

∂2w ∂2w
− 2
(x, y) − (x, y) = μw(x, y) for (x, y) ∈ Ω,
∂x ∂y 2 . (5.2.67)
w(x, y) = 0 for (x, y) ∈ ∂ Ω.

The discretization of this eigenvalue problem by the 5-point difference scheme on a uniform
Cartesian mesh or the related finite element method with piecewise linear trial functions
leads to the matrix eigenvalue problem

Az = λz, λ = h2 μ, (5.2.68)

with the same block-tridiagonal matrix A as occurring in the corresponding discretization


of the boundary value problem discussed in Section 3.4. We are interested in the smallest
eigenvalue λ1 = λmin of A , which by h−2 λmin ≈ μmin yields an approximation to the
smallest eigenvalue of problem (5.2.67). For λ1 and the next eigenvalue λ2 > λ1 there
holds
λ1 = 2π 2 h2 + O(h4 ), λ2 = 5π 2 h2 + O(h4 ).
5.2 Multigrid methods for eigenvalue problems (a short review) 205

For computing λ1 , we may use the inverse iteration with shift λ = 0 . This requires in
each step the solution of a problem like

Az̃ t = z t−1 , z t := z̃ t −1 t


2 z̃ . (5.2.69)

For the corresponding eigenvalue approximation

1 (A−1 z t , z t )2
:= = (z̃ t+1 , z t )2 , (5.2.70)
λt1 z t 22

there holds the error estimate (see exercise in Section 4.1.1)


1 1   1  z 0 22  λ2 2t

 t − ≤  , (5.2.71)
λ1 λ1 λ1 |α1 |2 λ1

where α1 is the coefficient in the expansion of z 0 with respect to the eigenvector w 1.


From this relation, we infer that

z 0 22  λ2 2t
|λ1 − λt1 | ≤ λt1 . (5.2.72)
|α1 |2 λ1

Observing that λt1 ≈ λ1 ≈ h2 and h2 z 0 22 = h2 ni=1 |zi0 |2 ≈ v 02L2 , where v 0 ∈ H01 (Ω)
is the continuous eigenfunction corresponding to the eigenvector z 0 , we obtain
 λ 2t
2
|λ1 − λt1 | ≤ c ≤ c 0.42t . (5.2.73)
λ1

i. e., the convergence is independent of the mesh size h or the dimension n = m2 ≈ h−2
of A. However, in view of the relation μ1 = h−2 λ1 achieving a prescribed accuracy in the
approximation of μ1 requires the scaling of the tolerance in computing λ1 by a factor
h2 , which introduces a logarithmic h-dependence in the work count of the algorithm,

log(εh2 )
t(ε) ≈ ≈ log(n). (5.2.74)
log(2/5)

Now, using a multigrid solver of optimal complexity O(n) in each iteration step (4.1.20)
the total complexity of computing the smallest eigenvalue λ1 becomes O(n log(n)) .

Remark 5.6: For the systematic use of multigrid acceleration within the Jacobi-Davidson
method for nonsymmetric eigenvalue problems, we refer to Heuveline & Bertsch [41]. This
combination of a robust iteration and multigrid acceleration seems presently to be the
most efficient approach to solving large-scale symmetric or unsymmetric eigenvalue prob-
lems.
206 Multigrid Methods

5.3 Exercises

Exercise 5.1: Consider the discretization of the Poisson problem

−Δu = f, in Ω, u = 0, on ∂Ω,

on the unit square Ω = (0, 1)2 ⊂ R2 by the finite element Galerkin method using linear
shape and test functions on a uniform Cartesian triangulation Th = {K} with cells K
(rectangular triangles) of width h > 0. The lowest-order finite element space on the mesh
Th is given by

Vh = {vh ∈ C(Ω̄) | vh |K ∈ P1 (K), K ∈ Th , vh |∂Ω = 0}.

Its dimension is dimVh = nh , which coincides with the number of interior nodal points
ai , i = 1, . . . , nh , of the mesh Th . Let {ϕ1h , . . . , ϕnhh } denote the usual “nodal basis” (so-
called “Lagrange basis”) of the finite element subspace Vh defined by the interpolation
condition ϕih (aj ) = δij . Make a sketch of this situation, especially of the mesh Th and a
nodal basis function ϕih .
Then, the finite element Galerkin approximation in the space Vh as described in the text
results in the following linear system for the nodal value vector xh = (x1h , . . . , xnhh ) :

Ah xh = bh ,

with the matrix Ah = (aij )ni,j=1h


and right-hand side vector bh = (bi )ni=1
h
given by
aij = (∇ϕh , ∇ϕih )L2 and bi = (f, ϕih )L2 . Evaluate these elements aij and bi using
j

the trapezoidal rule for triangles


"
|K| 
3
w(x) dx ≈ w(ai ),
K 3 i=1

where ai , i = 1, 2, 3 , are the three vertices of the triangle K and |K| its area. This
quadrature rule is exact for linear polynomials. The result is a matrix and right-hand
side vector which are exactly the same as resulting from the finite difference discretization
of the Poisson problem on the mesh Th described in the text.

Exercise 5.2: Analyze the proof for the convergence of the two-grid algorithm given in
the text for its possible extension to the case the restriction rll−1 : Vl → Vl−1 is defined
by local bilinear interpolation rather than by global L2 -projection onto the coarser mesh
Tl−1 . What is the resulting difficulty? Do you have an idea how to get around it?

Exercise 5.3: The FE-discretization of the convection-diffusion problem

−Δu + ∂1 u = f in Ω, u = 0 on ∂Ω,

leads to asymmetric system matrices Ah . In this case the analysis of the multigrid
algorithm requires some modifications. Try to extend the proof given in the text for the
5.3 Exercises 207

convergence of the two-grid algorithm for this case if again the (damped) Richardson
iteration is chosen as smoother,

xt+1
h = xth − θt (Ah xth − bh ), t = 0, 1, 2, . . . .

What is the resulting difficulty and how can one get around it?

Exercise 5.4: Consider the discretization of the Poisson problem

−Δu = f, in Ω, u = 0, on ∂Ω,

on the unit square Ω = (0, 1)2 ⊂ R2 by the finite element Galerkin method using linear
shape and test functions. Let (Tl )l≥0 be a sequence of equidistant Cartesian meshes of
width hl = 2−l . The discrete equations on mesh level l are solved by a multigrid method
with (damped) Richardson smooting and the natural embedding as prolongation and the
L2 projection as restriction. The number of pre- and postsmoothing steps is ν = 2 and
μ = 0 , respectively. How many arithmetic operations are approximately required for a
V-cycle and a W-cycle depending on the dimension nl = dimVl ?

5.3.1 General exercises

Exercise 5.5: Give short answers to the following questions:


a) When is a matrix A ∈ Rn×n called “diagonalizable”?
b) When is a matrix A ∈ Rn×n called “diagonally dominant”?
c) What is a “normal” matrix and is a Hermitian matrix always “normal”?
d) What is the relation between the “power method” and the “inverse iteration”?
e) What is the Rayleigh quotient of a Hermitian matrix A ∈ Cn×n and a given vector
v ∈ Cn \ {0} . What is it used for?
f) What is the spectral condition number of a matrix A ∈ Cn×n ?
g) What is a “Gerschgorin circle”?
h) What is the use of “restriction” and “prolongation” within the multigrid method?
i) What is a “Krylov space” Km corresponding to a matrix A ∈ Cn×n ?
j) What does the adjective “damped” refer to within the Richardson method?
k) What is the difference between the “classical” and the “modified” Gram-Schmidt
algorithm for orthonormalization?

Exercise 5.6: Consider the following matrices:


⎡ ⎤ ⎡ ⎤ ⎡ ⎤
2 −1 0 2 −1 1 2 −1 1
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
A1 = ⎢
⎣ −1 2 −1 ⎦ ⎥ , A2 = ⎢
⎣ −1 2 −1
⎥,
⎦ A3 = ⎢ ⎥
⎣ −1 2 −1 ⎦ .
0 −1 2 1 −1 2 −1 −1 2
208 Multigrid Methods

For which of these matrices do the Jacobi method, the Gauß-Seidel method and the CG
method converge unconditionally for any given initial point u0 ?

Exercise 5.7: Derive best possible estimates for the eigenvalues of the matrix
⎡ ⎤
1 10−3 10−4
⎢ ⎥
A=⎢ ⎣ 10
−3
2 10−3 ⎥⎦
10−4 10−3 3

by the enclusion lemma of Gerschgorin. (Hint: Precondition the matrix by scaling, i. e.,
by using an appropriate similarity transformation with a diagonal matrix A → D −1 AD .)

Exercise 5.8: Formulate the power method for computing the largest (by modulus)
eigenvalue of a matrix A ∈ Cn×n . Distinguish between the case of a general matrix
and the special case of a Hermitian matrix.
i) Under which conditions is convergence guaranteed?
ii) Which of these conditions is the most critical one?
iii) State an estimate for the convergence speed.
A Solutions of exercises
In this section solutions are collected for the exercises at the end of the individual chapters.
These are not to be understood as ”‘blue-print”’ solutions but rather as suggestions in
sketchy form for stimulating the reader’s own work.

A.1 Chapter 1

Solution A.1.1: a) Follows directly from


√ 1 2
0≤ εa ± √ b , a, b ∈ R, ε ∈ R+ .
2 ε
n
b) The function f (x) = x−1 is for x > 0 convex and i=1 xi λi is a convex linear
combination. Hence, by a geometric argument, one may conclude the asserted estimate.
c) For n = 0 the statement is obvious. For n ∈ N observe that there are exactly three
1
local extrema, for x = 0 (maximum), x = 1 (maximum), and xmin = 1+n (minimum).
Furthermore,
1  1 2n 1 n2n 1
x2min (1 − xmin )2n = 1 − = ≤ .
(1 + n)2 1+n (1 + n)2 (1 + n)2n (1 + n)2

Solution A.1.2: a) Multiplying out yields

x + y2 + x − y2 = (x + y, x + y) + (x − y, x − y)
= (x, x) + (y, y) + (x, y) + (y, x) + (x, x) + (y, y) − (x, y) − (y, x)
= 2 x2 + 2 y2.

b) Let x, y ∈ Rn be arbitrary and (without loss of generality) x = y = 1.

0 ≤ (x − y, x − y) = x2 + y2 − 2 (x, y) = 2 − 2 (x, y).

Similarly

0 ≤ (x + y, x + y) = x2 + y2 + 2 (x, y) = 2 + 2 (x, y),

i. e., |(x, y)| ≤ 1.

c) The properties of a scalar product follow immediately from those of the Euclidean
scalar product and those assumed for the matrix A .
i) Yes. Let ·, · be an arbitrary scalar product and let {ei }1≤i≤n
 be a Cartesian
 basis of
Rn , such that any x, y ∈ Rn have the respresentations x = i xi ei , y = i xi ei ∈ Rn .
Define
 a matrix A ∈ Rn×n by aij := ej , ei . Then, there holds (Ax, y) = ij aij xj yi =
ij xj yi ej , ei  = x, y . Furthermore, A is obviously symmetric and positive definite
due to the same properties of the scalar product < ·, · >.

209
210 Solutions of exercises

ii) The following statements are equivalent:

1. < ·, · > : C × C → C is a (Hermitian) positive definite sesquilinear form (i.e. a


scalar product).

2. There exists a (Hermitian) positive definite matrix A ∈ Cn×n such that x, y =
(Ax, y), x, y ∈ Cn .

Solution A.1.3: a) The identity becomes obvious by replacing x by x̃ := xx−1 .


b) There holds, for x ∈ Rn \ 0:

Ax2 Ay2
Ax2 = x2 ≤ sup x = A2 x2 .
x y∈R n y2

c) There holds

ABx2 A2 Bx2 A2 B2 x2


AB2 = sup ≤ sup ≤ sup = A2 B2 .
x∈Rn x2 x∈Rn x2 x∈Rn x2

This relation is not true for any matrix norm. As a counter example, employ the elemen-
twise maximum norm Amax := maxi,j=1,··· ,n |aij | to
( ) ( ) ( )
1 1 1 0 2 0
· = .
0 0 1 0 0 0

d) Let λ be any eigenvalue of A and x a corresponding eigenvector. Then,

λx2 Ax2
|λ| = = ≤ A2 .
x2 x2

Conversely, let {ai , i = 1, · · · , n} ⊂ Cn be an ONB of eigenvectors of A and x =



i xi a ∈ C be arbitrary. Then,
i n

   
Ax2 = A xi ai 2 = λi xi ai 2 ≤ max |λi | xi ai 2 | = max |λi | x2 ,
i i
i i i

and consequently,
Ax2
≤ max |λi |.
x2 i

e) There holds

Ax22 (ĀT Ax, x)2 ĀT Ax2


A22 = max = max ≤ max = ĀT A2 .
n
x∈C \0 x2
2 x∈C n \0 x22 x∈C n \0 x2

and ĀT A2 ≤ ĀT 2 A2 = A22 (observe that A2 = ĀT 2 due to Ax2 =
ĀT x̄2 , x ∈ Cn ).
A.1 Chapter 1 211

Solution A.1.4: a) See the description in the text.


b) Let A ∈ Rn×n be symmetric and positive definite. Then, there exists an ONB
{a1 , . . . , an } of eigenvectors of A such that with the (regular) matrix B = [a1 . . . an ]
there holds A = B D B −1 with D = diagi (λi ) (λi > 0 the eigenvalues of A). Now define
 1/2 
A1/2 := B diagi λi B −1 .

This is well defined and independent of the concrete choice of B.


c) Let A ∈ Cn×n be positive definite, i. e., x̄T Ax ∈ R+ , x ∈ Cn . Then, A is necessarily
Hermitian since for x, y ∈ C arbitrary there holds:
!
(x + y)T A(x + y) ∈ R,
(x + iy)T A(x + iy) ∈ R
! T
x̄ Ax + ȳ T Ay + (x̄T Ay + ȳ T Ax) ∈ R,
=⇒
x̄T Ax + ȳ T Ay + i (x̄T Ay − ȳ T Ax) ∈ R.

Setting x = ei and y = ej , we see that aij + āji ∈ R and i(aij − āji ) ∈ R, i.e.,

Re(aji + aij ) + iIm(aji + aij ) ∈ R,


iRe(aji − aij ) + Im(aji − aij ) ∈ R.

Hence, aij = Re aij + iIm aij = Re aji − iIm aji = Re āji + iIm āji = āji .
Remark: The above argument only uses that x̄T Ax ∈ R, x ∈ Cn .
n
Solution A.1.5: a) v∞ = max1≤i≤n |vi | and v1 = i=1 |vi |.
b) The “spectrum” Σ(A) is defined as Σ(A) := {λ ∈ C, λ eigenvalue of A}.

 Ki ⊂$C, i = 1, . . . , n , are the closed discs


c) The#“Gerschgorin circles”
Ki := x ∈ C, |x − aii | ≤ j =i aij .
d) ρ(A) = max1≤i≤n {|λi |, λi eigenvalue of A}.
e) κ2 (A) = A2 A−1 2 = max1≤i≤n σi / min1≤i≤n σi , where σi are the “singular values”
of A , i. e., the square roots of the (nonnegative) eigenvalues of ĀT A).

Solution A.1.6: a) aii ∈ R follows directly from the property aii = āii of a Hermitian
matrix. Positiveness follows via testing by ei , which yields aii = ēTi Aei > 0.
b) The trace of a matrix is invariant under coordinate
 transformation,
 i. e. similarity
transformation (may be seen by direct calculation ijk bij ajk bki = i aii or by applying
the product formula for determinants to the characteristic polynomial. Observing that a
Hermitian matrix is similar to a diagonal matrix with its eigenvalues on the main diagonal
implies the asserted identity.
c) Assume that A is singular. Then ker(A) = ∅ and there exists x = 0 such that A x = 0,
i. e., zero is an eigenvalue of A. But this contradicts the statement of Gerschgorin’s Lemma
which bounds all eigenvalues away from zero due to the strict diagonal dominance. If
212 Solutions of exercises

all diagonal entries aii > 0 , then also by Gerschgorin’s lemma all Gerschgorin circles
(and consequently all eigenvalues) are contained in the right complex half-plane. If A
is additionally Hermitian, all these eigenvalues are real and positive and A consequently
positive definite.

Solution A.1.7: Define



n
S := lim Sn , Sn = Bs.
n→∞
s=0

S is well defined due to the fact that {Sn }n∈N is a Cauchy sequence with respect to the
matrix norm  ·  (and, by the norm equivalence in finite dimensional normed spaces,
with respect to any matrix norm). By employing the triangle inequality, using the matrix
norm property and the limit formula for the geometric series, one proofs that

n 
n
1 − Bn+1 1
S = lim Sn  = lim B s ≤ lim Bs = lim = .
n→∞ n→∞ n→∞ n→∞ 1 − B 1 − B
s=0 s=0

Furthermore, Sn (I − B) = I − B n+1 and due to the fact that multiplication with I − B


is continuous,
   
I = lim Sn (I − B) = lim Sn (I − B) = S(I − B).
n→∞ n→∞

Solution A.1.8: To proof the statement, we use a so-called “deformation argument”.


For t ∈ [0, 1] define the matrix

A(t) = (1 − t)diagi (aii ) + t A.

Obviously A(0) is a diagonal matrix with eigenvalues λi (0) = aii . Now observe that the
“evolution” of the ith eigenvalue λi (t) is a continuous function in t (This follows from
the fact that a root t0 of a polynomial pα is locally arbitrarily differentiable with respect
to its coefficients – a direct consequence of the implicit function theorem employed to
p(α, t) = pα (t) and a special treatment of multiple roots).
Furthermore, the Gerschgorin circles of A(t), 0 ≤ t ≤ 1 have all the same origin, only
the radii are strictly increasing. So, Gerschgorin’s Lemma implies that the image of the
function t → λi (t) lies entirely in the union of all Gerschgorin circles of A(1). And due
to the fact that it is continous obviously in the connected component containing aii .

Solution A.1.9: (i)⇒(ii): Suppose that A and B commute. First observe that for an
arbitrary eigenvector z of B with eigenvalue λ there holds:

ABz − BAz = 0 ⇒ BAz = λAz.

So, Az is either 0 or also an eigenvector of B with eigenvalue λ. Due to the fact that
B is Hermitian there exists an orthonormal basis {vi }ni=1 of eigenvectors of B . So we
A.1 Chapter 1 213

can transfrom B by a change of basis to a diagonal matrix. Furthermore, by virtue of


the observation above, A has block diagonal structure with respect to this basis, where a
single block solely acts on an eigenspace Eλ (B) for a specific
 eigenvalue λ of B. Due to the
fact that A is also hermitian, we can diagonalize AE (B) with respect to this subspace
λ 
by another change of basis. Now observe that the diagonal character of B E (B) = λI
λ
will be preserved.
(ii)⇒(i): Let O = {vi }ni=1 be the common basis of eigenvectors of A and B , one checks
that
ABvi = λA B B A
i λi vi = λi λi vi = BAvi , i = 1, . . . , n.

Consequently, ABx = BAx, x ∈ Kn , and therefore AB = BA.


(i)⇔(iii): For any two Hermitian matrices A and B there holds BA = B̄ T ĀT = AB T
and the asserted equivalence follows immediately.

Solution A.1.10: i) Let A ∈ Kn×n be an arbitrary, regular matrix and define ϕ(x, y) :=
(Ax, Ay)2 . It is clear that ϕ is a sesquilinear form. Furthermore symmetry and positivity
follow directly from the corresponding property of ( . ). For definiteness observe that
(Ax, Ax) = 0 ⇒ Ax = 0 ⇒ x = 0 due to the regularity of A.
ii) The earlier result does not contradict (i) because there holds

(Ax, Ay)2 = (x, ĀT Ay)2,

and ĀT A is a hermitian matrix.

Solution A.1.11: i) Let λ1 and λ2 be two pairs of eigenvalues with eigenvectors v 1


and v 2 . It holds:

0 = (v 1 , Av 2 ) − (v 1 , Av 2 ) = (v 1 , Av2 ) − (Av 1 , v 2 ) = (λ2 − λ1 )(v 1 , v 2 ).

So, if λ1 = λ2 it must hold that (v 1 , v 2 ) = 0. Yes, this result is true in general for normal
matrices (over C) and—more generally—known as the “spectral theorem for normal op-
erators” (see [Bosch, Lineare Algebra, p. 266, Satz 7.5/8], for details).
ii) Let v be an eigenvector for the eigenvalue λmax . There holds:

(Ax, x)2 (Av, v)2


max ≥ = λmax (A),
x∈Kn \{0} x22
v22

Conversely, for arbitrary x ∈ Kn \ {0} there exists a representation x = i xi vi with
respect to an orthonormal basis {vi }, so that
  
(Ax, x)2 (A i xi vi , i xi vi )2 λ x2
= = i i 2 i ≤ λmax (A)
x22
x22
i xi

The corresponding equality for λmin (A) follows by a similar argument. λmin(A) ≤
λmax (A) is obvious.
214 Solutions of exercises

Solution A.1.12: i) We use the definition (c) from the text for the ε-pseudo-spectrum.
Let z ∈ σε (A) and accordingly v ∈ Kn , v = 1 , satisfying (A − zI)v ≤ ε . Then,

(A−1 − z −1 I)v = z −1 A−1 (zI − A)v) ≤ |z|−1 A−1 ε.

This proves the asserted relation.


ii) To prove the asserted relation, we again use the definition (c) from the text for the
ε-pseudo-spectrum. Accordingly, for z ∈ σε (A−1 ) with |z| ≥ 1 there exists a unit vector
v ∈ Kn , v = 1 , such that

ε ≥ (zI − A−1 )v = |z|(A − z −1 I)A−1 v.

Hence, setting w := A−1 v−1 A−1 v ∈ Kn with w = 1 , we obtain

(A − z −1 I)w ≤ |z|−1 A−1 v−1ε.

Hence, observing that

A−1 v = (A−1 − zI)v + zv ≥ zv − (A−1 − zI)v ≥ |z| − ε,

we conclude that
ε ε
(A − z −1 I)w ≤ ≤ .
|z|(|z| − ε) 1−ε
This completes the proof.

A.2 Chapter 2

Solution A.2.1: a) An example for a symmetric, diagonally dominant matrix that is


indefinite is
( )
−2 1
A= .
1 2

On the other hand, a symmetric, positive definite but not (strictly) diagonally dominant
matrix is given by
⎛ ⎞
3 2 2
⎜ ⎟
B=⎜ ⎝2 3 2⎠ ,

2 2 3

or typically system matrices arising from higher order difference approximations, e. g. the
“stretched” 5-point stencil for the Laplace problem in 1D:
A.2 Chapter 2 215

⎛ ⎞
30 −16 1
⎜ ⎟
⎜−16 30 −16 1 ⎟
⎜ ⎟
⎜ 1 −16 30 −16 1 ⎟
⎜ ⎟
n 1 ⎜⎜ . .

⎟ ∈ Rn×n .
B = .
12h ⎜



⎜ 1 −16 30 −16 1 ⎟
⎜ ⎟
⎜ ⎟
⎝ 1 −16 30 −16⎠
1 −16 30

Note: To prove that the above B n ∈ Rn×n is positive definite, compute det(B k ) > 0 for
k = 1, · · · , 3 and derive a recursion formula of the form

det(B k+1 ) = 30 det(B k ) ± · · · det(B k−1 ) ± · · · det(B k−2 )

so that det(B k+1 ) > 0 follows by induction.


b) Apply the Gerschgorin lemma to the adjoint transpose ĀT . This yields that 0 ∈
/
σ(ĀT ) , i. e., ĀT is regular. Then, also A is regular.
c) All eigenvalues of the symmetric matrix A are real. Further, all Gerschgorin circles
have their centers on the positive real half axis. Hence the strict diagonal dominance
implies that all eigenvalues must be positive.

Solution A.2.2: The result of the first k − 1 elimination steps is a block matrix A(k−1)
of the form
& k−1 '
R ∗ k−1
A(k−1)
= k−1 , with A ∈ R(n−k)×(n−k) pos. def. (by induction).
0 A

The k-th elimination step reads:


(k−1) (k−1)
(k) (k−1) aik akj
aij = aij − (k−1)
, i, j = k, . . ., n.
akk
(k−1)
i) The main diagonal elements of positive definite matrices are positive, ajj > 0 . For
the diagonal elements it follows by symmetry:
(k−1) (k−1) (k−1) 2
(k) (k−1) aik aki (k−1) |aik | (k−1)
aii = aii − (k−1)
= aii − (k−1)
≤ aii , i = k, . . ., n.
akk akk
(k−1)
ii) The element with maximal modulus of a positive definite matrix A lies on the
main diagonal,
(k−1) (k−1)
max |aij | ≤ max |aii |.
k≤i,j≤n k≤i≤n
216 Solutions of exercises

(k)
The submatrix A obtained in the k-th step is again positive definite. Hence the result
(i) implies
(k) (k) (k−1) (k−1)
max |aij | ≤ max |aii | ≤ max |aii | ≤ max |aij |.
k≤i,j≤n k≤i≤n k≤i≤n k≤i,j≤n

Since in the k-th elimination step the first k − 1 rows are not changed anymore induction
with respect to k = 1, . . ., n yields:
(n−1) (0)
max |rij | = max |aij | ≤ max |aij | ≤ max |aij |.
1≤i,j≤n 1≤i,j≤n 1≤i,j≤n 1≤i,j≤n

Solution A.2.3: i) Let

L := {L ∈ Rn×n , L regular, lower-left triangular matrix mit lii = 1},


R := {R ∈ Rn×n , L regular upper-right triangular matrix}.

We have to show the following group properties for the matrix multiplication ◦ :
(G1) Closedness: L1 , L2 ∈ L ⇒ L1 ◦ L2 ∈ L .
(G2) Associative law: L1 , L2 , L3 ∈ L ⇒ L1 ◦ (L2 ◦ L3 ) = (L1 ◦ L2 ) ◦ L3 .
(G3) Neutral element I : L ∈ L ⇒ L ◦ I = L .
(G4) Inverse: L ∈ L ⇒ ∃L−1 ∈ L : L ◦ L−1 = I .
(G1) follows by computation. (G2) and (G3) follow from the properties of matrix multi-
plication. (G4) is seen through determination of the inverse by simultaneous elimination:
⎡ ⎤ ⎡ ⎤
1 0 1 0 1 0
⎢ ⎥ ⎢ ⎥
⎢ ..
.
..
. ⎥ ⇒ L−1 = ⎢ ..
. ⎥ ∈ L.
⎣ ⎦ ⎣ ⎦
0 1 ∗ 1 ∗ 1

The group L is not abelian as the following 3 × 3 example shows:


⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎡ ⎤
1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0
⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ 1 1 0 ⎥⎢ 0 1 0 ⎥ = ⎢ 1 1 0 ⎥ = ⎢ 2 1 0 ⎥ ⎢ ⎥⎢ 1 0 ⎥
⎣ ⎦⎣ ⎦ ⎣ ⎦ ⎣ ⎦ = ⎣ 1 1 0 ⎦⎣ 1 ⎦.
0 0 1 0 −1 1 0 −1 1 −1 −1 1 0 0 1 −1 −1 1

The argument for R is analogous. The group R is also not abelian as the following 2 × 2
example shows:
& '& ' & ' & ' & '& '
1 1 −1 1 −1 2 −1 0 −1 1 1 1
= = = .
0 1 0 1 0 1 0 1 0 1 0 1

ii) For proving the uniqueness of the LR-decompositiong let for a regular matrix A ∈ Rn×n
two LR-decompositiongs A = L1 R1 = L2 R2 be given. Then, by (i) L1 , L2 ∈ L as well
as R1 , R2 ∈ R and consequently

R1 R2−1 = L−1 L = diag(dii ).


    1 2
∈R ∈L
A.2 Chapter 2 217

With L1 (and L2 ) also the inverse L−1


1 has ones on the main diagonal. Hence dii = 1
which finally implies R1 = R2 and L2 = L1 .

Solution A.2.4: Let A be a band matrix with ml = mr =: m (Make a sketch of this


situation.)
i) The k-th elimination step
(k−1) (k−1)
(k) (k−1) aik (k−1) (k) (k−1) aik (k−1)
aij = aij − a
(k−1) kj
, bi = bi − b
(k−1) k
, i, j = k + 1, . . ., k + m,
akk akk

requires essentially m divisions and m2 multiplications and additionas. Hence alltogether

Nband = nm2 + O(nm) a. op.,

for the n − 1 steps of the forward elimination for computing the matrix R and si-
multanously of the matrix L . For the sparse model matrix, we have Nband = 108 +
O(106) a. op. in contrast to N = 13 1012 + O(108) a. op. for a full matrix.
ii) If A is additionally symmetric (and positive definite) one obtaines the Cholesky de-
composition from the LR decomposition by

A = L̃L̃T , L̃ = LD 1/2 , D = diag(rii ).

Because of the symmetry of all resulting rduced submatrices only the elements on the
main diagonal and the upper diagonals need to be computed. This reduces the work to
Nband = 21 nm2 +O(nm) a. Op. , i. e., for the model matrix to Nband = 12 108 +O(106) a. op.,
and Nband = 12 1016 + O(1012 ) a. op., respectively.

Solution A.2.5: a) The first step of the Gaussian elimination applied on the extended
matrix [A|b] produces:
⎡ ⎤ ⎡ ⎤
1 3 −4 1 1 3 −4 1
⎢ ⎥ ⎢ ⎥
⎢ 3 9 −2 1 ⎥ ⎢ 0 0 −2 ⎥
⎢ ⎥ ⎢ 10 ⎥
⎢ ⎥ → ⎢ ⎥.
⎢ 4 12 −6 1 ⎥ ⎢ 0 0 10 −3 ⎥
⎣ ⎦ ⎣ ⎦
2 6 2 1 0 0 10 −1

The linear system is not solvable because of rank A = 2 = 3 = rank [A|b] . Observe in
particular, that A does not have full rank.
b) A straightforward calculation leads to the following normal equation:
⎡ ⎤ ⎡ ⎤
⎡ ⎤ 1 3 −4 ⎡ ⎤ ⎡ ⎤ 1
1 3 4 2 ⎢ ⎥ x1 1 3 4 2 ⎢ ⎥
⎢ ⎥ ⎢ 3 9 −2 ⎥ ⎢ ⎥ ⎢ ⎥⎢ 1 ⎥
⎢ 3 9 12 6 ⎥ ⎢ ⎥ ⎢ x ⎥ = ⎢ 3 9 12 6 ⎥⎢ ⎥.
⎣ ⎢
⎦ 4 12 −6 ⎣⎥ 2 ⎦ ⎣ ⎦⎢ 1 ⎥
⎣ ⎦ ⎣ ⎦
−4 −2 −6 2 x3 −4 −2 −6 2
2 6 2 1
218 Solutions of exercises

Hence, ⎡ ⎤⎡ ⎤ ⎡ ⎤
30 90 −30 x1 10
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ 90 270 −90 ⎥ ⎢ x2 ⎥ = ⎢ 30 ⎥ .
⎣ ⎦⎣ ⎦ ⎣ ⎦
−30 −90 60 x3 −10
Because of Rank A = 2 < 3, the kernel of the matrix AT A ∈ R3×3 is one dimensional.
Gaussian elimination:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
30 90 −30 10 30 90 −30 10 3 9 0 1
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 90 270 −90 30 ⎥ → ⎢ 0 0 0 0 ⎥ → ⎢ 0 0 0 0 ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
−30 −90 60 −10 0 0 30 0 0 0 1 0,

and the solution can be characetized by x = ( 13 − 3t, t, 0)T , t ∈ R .


c) The system of normal equations is solvable but the solution is not unique.
d) No. Due to the fact that A does not have full rank, the matrix AT A ∈ R3×3 cannot
be one-to-one, and can consequently be only semi-definite. (Counter example: x =
(−3, 1, 0)T is a non trivial element of the kernel of AT A.)

A.3 Chapter 3

Solution A.3.1: i) For the maximal absolute column sum it holds


3
B1 = max |aij | = 0.9 < 1.
j=1,2,3
i=1

This implies convergence due to the fact that spr(B) ≤ B1 < 1 and, hence, the iteration
is contractive. (Observe that the maximal absolute row sum does not imply convergence
because B||∞ = 1.4 > 1 .) The limit z = limt→∞ xt fullfills z = B z + c . Hence,

z = (I − B)−1 c.

ii) Let λi be the eigenvalues of B . It holds


3
λi = det(B) = −1.
i=1

This implies that at least for one of the eigenvalues there must hold |λ| ≥ 1. So, for
the choice (ii) the fixed point iteration cannot be convergent in general: In particular, if
x0 − x happens to be an eigenvector corresponding to the above eigenvalue λ it holds

xt − x = B t (x0 − x)|| = λt (x0 − x)|| = |λ|t x0 − x → 0 (t → ∞).

Solution A.3.2: For a general fixed point iteration xt+1 = Bxt + c the following error
A.3 Chapter 3 219

estimate holds true in case of convergence to a limit z:

xt − z ≤ spr(B)t x0 − z.

It follows by induction that in order to reduce the initial error by at least a factor of ε it
is necessary to perform the following number of iterations:
9 :
log10 (ε)
t= .
log10 (spr(B))

For the Jacobi- and Gauss-Seidel-Matrix it holds


& ' & '
0 1/3 0 −1/3
J= , H1 = ,
1/3 0 0 1/9

hence, spr(J) = 1/3 and spr(H1 ) = 1/9 . Therefore, the necessary number of iterations
is 9 : 9 :
6 6
tJ = + 1 = 13, and tH1 = = 7, respectively.
log10 (3) log10 (9)

Solution A.3.3: We restate the two definitions of “irreducibility”:


a) (With the help of the hint): A matrix A ∈ Rn×n is called “irreducible” if for every
partition J, K of {1, . . ., n} =: Nn with J ∪ K = Nn and J ∩ K = ∅ , so that ajk = 0
for all j ∈ J and all k ∈ K, it holds that either K = ∅ or J = ∅.
b) A matrix A ∈ Rn×n is called “irreducible” if for every pair of indices j, k ∈ Nn there
exists a set of indices {i1 , . . . , im } ∈ Nn such that aj,i1 = 0, ai1 ,i2 = 0, . . . , aim−1 ,im = 0,
aim ,k = 0.
i) (a) ⇒ (b): Let A be irreducible in the sense of (a). Furthermore, let i ∈ Nn be an
arbitrary index. Let J be the set of all indices l ∈ Nn , with the property that there
exists a sequence of indicies {i1 , . . . , im } ∈ Nn such that all ai,i1 , . . . , aim ,l = 0 . Define
its complement K := Nn \ J . In order to prove (b) we have to show that J = Nn , or
that K = ∅, respectively.
First of all, it holds that i ∈ J , so J is not empty. Furthermore, observe that for all
p ∈ K it must hold that al,p = 0 for all l ∈ J, otherwise there would exist a sequence from
i to p by expanding an arbitrary sequence ai,i1 , . . . , aim ,l from i to l (which exists by
virtue of l ∈ J) by al,p . So by irreducibility in the sense of (a) it must hold that K = ∅.
ii) (b) ⇒ (a): Let A be irreducible in the sense of (b). Let {J, K} be an arbitrary
partition in the sense of (a). Then for arbitrary index pairs {j, k} ∈ J × K there exists
a sequence {i1 , . . ., im } ∈ Nn with aj,i1 = 0, . . . , aim ,k = 0 . Inductively, because of
aiμ ,iν = 0, it follows that iμ ∈ J for μ = 1, . . . , m, and finally (because of aim ,k = 0) also
k ∈ J in contradiction to the original choice {j, k} ∈ J × K. So it must hold that either
J = ∅ or K = ∅.

Solution A.3.4: a) In case of the matrix A1 , the iteration matrices for the Jacobi and
220 Solutions of exercises

Gauss-Seidel methods are


⎡ ⎤ ⎡ ⎤
0 0.5 −1 0 0.5 −1
⎢ ⎥ ⎢ ⎥
J = −D −1 (L + R) = ⎢
⎣ −0.5 0 1 ⎥
⎦, H1 = −(D + L)−1 R = ⎢
⎣ 0 −0.25 1.5 ⎥

−1 −1 0 0 −0.25 −0.5

The eigenvalues λi of J fulfill λ1 λ2 λ3 = det(J) = −1 , hence spr(J) ≥ 1 . Therefore, the


Jacobi iteration cannot be convergent in general. The matrix H1 has the characteristic

polynomial χ(λ) = −λ(λ2 + 34 λ + 12 ) and the eigenvalues λ1 = 0, λ2/3 = ±1/ 2 .
Consequently, spr(H1 ) < 1 and the Gauss-Seidel method is convergent.
b) The matrix A2 fulfills the weak row sum criterion and is irreducible. Hence, the
Jacobi- and Gauss-Seidel methods converge

Solution A.3.5: First, we determine the iteration matrix: It holds


& '−1 & '
1 0 1 0
= .
−ωa 1 ωa 1

Hence, & '& ' & '


t 1 0 1−ω ωa t−1 1 0
x = x +ω b,
ωa 1 0 1−ω ωa 1
and therefore: & '
1−ω ωa
Bω = .
ωa(1 − ω) ω 2a2 + 1 − ω
Consequently, it is det(Bω − λI) = −λω 2 a2 + (1 − ω − λ)2 .
a) With ω = 1 it is

det(B1 − λI) = −λa2 + λ2 ⇒ spr(B1 ) = a2 ,

thereby, for |a| < 1 the system is convergent.


1
b) In case of a = 2
it holds
2
λ1,2 = 1 − ω + 18 ω 2 ± 12 ω 1−ω+ 1 2
16
ω .

In case of 1 − ω + 16 ω ≥ 0 , and ω ≤ 8 − 4 3 = 1.07179677. . . , respectively, both roots
1 2

are real valued. For any other choice of ω they are complex. Therefore:
, 2 √
1 − ω + 18 ω 2 + 12 ω 1 − ω + 16 ω , 0 ≤ ω ≤ 8 − 4 3,
1 2
spr(Bω ) = √
ω − 1, 8 − 4 3 < ω ≤ 2.
A.3 Chapter 3 221

Evaluating the formula for the stated values:

ω 0.8 0.9 1.0 1.1 1.2 1.3 1.4


spr(Bω ) 0.476 0.376 0.25 0.1 0.2 0.3 0.4

The graph of the function√ρ(ω) := spr(Bω ), 0 ≤ ω ≤ 2, starts with ρ(0) = 1 ; it has a


minimum at ωopt := 8 − 4 3 with a sharp, down-pointing cusp, behind that it increases
linear to ρ(2) = 1 .

1 ............................................................................................ .
...
..
...................... ...
.....................
................... ...
.
................... .
............... ...
.......... ..
........ ...
....... ...
0.8 spr(Bω ) ......
....
.... .
...
.
... ...
... ...
... ..
... ...
...
... ...
.
.
... ...
0.6 ...
... ....
.. ...
..
.. ..
.....
.

0.4

0.2

0 - ω
1 ωopt 2

Graph of the spectral radius spr(Hω ) plotted over ω ∈ [0, 2]

Solution A.3.6: Let T be a matrix consisting of a (row wise) ONB of A. It holds


⎛ ⎞
λ1
⎜ ⎟
T −1 A0 T = I, T −1 A1 T = ⎜

..
. ⎟

λd ,

and
⎛ ⎞
λk1
⎜ ⎟
T −1 Ak T = T −1 AT · T −1 AT · · · T −1 AT = ⎜

..
. ⎟
⎠ ∀k ∈ N.
λkd

So, by virtue of linearity,


⎛ ⎞
p(λ1 )
⎜ ⎟
T p(A)T = ⎜
−1

..
. ⎟

p(λd )

for an arbitrary polynomial p. So, the spectral radius of p(A) is exactly maxi=1,...,n |p(λi )|.
222 Solutions of exercises

Solution A.3.7: a) It holds

Xt = g(Xt−1 ), g(X) := X(I − AC) + C.

Hence,
g(X) − g(Y ) = (X − Y )(I − AC) ≤ X − Y || I − AC.
Therefore, if I − AC =: q < 1, then g is a contraction. The corresponding fixed-
point iteration converges for every initial value X0 . The limit Z fulfills the equation
Z = Z(I − AC) + C or ZAC = C . This is equivalent to Z = A−1 .
So, if q < 1 the fixed point iteration converges for every initial value X0 ∈ Rn×n to the
limit A with the a priori error estimate

Xt − A−1  ≤ q t X0 − A−1 , t ∈ N.

b) We have
Xt = g(Xt−1 ), g(X) := X(2I − AX).
Let Z be an arbitrary fixed point of g . It necessarily fulfills the equation Z = Z(2I−AZ)
or Z = ZAZ . Suppose that Z is regular, then Z = A−1 . Note that this assumption is
essential because the singular matrix Z = 0 is always a valid fixed point of g . To prove
convergence (under a yet to be stated assumption) we observe that:

Xt − Z = 2Xt−1 − Xt−1 AXt−1 − Z


= −Xt−1 AXt−.1 +  AZ − 
ZA Yt−1 + Xt−1  ZA Z
=I =I =I
= −(Xt−1 − Z)A(Xt−1 − Z).

This implies
Xt − Z ≤ A Xt−1 − Z2 .
So, for Z = A−1 and under the condition that
1
X0 − Z <
A

the iteration converges quadratically to Z:


 2  2t
A Xt − Z ≤ A Xt−1 − Z ≤ . . . ≤ A X0 − Z → 0 (t → ∞).

This iteration is exactly Newton’s method for calculating the inverse of a matrix.
Remark: It is sufficient to choose a starting value X0 that fulfills the convergence
criterion (for the preconditioner C ) in (a):
1
1 > I − AX0  = A A−1 − X0  ⇐⇒ A−1 − X0  < ,
A

so, also the criterion of (b) is fulfilled.


A.3 Chapter 3 223

Solution A.3.8: Let J be the Jordan normal form of B and T a corresponding trans-
formation matrix such that
T −1 BT = J.
Let p(X) = α0 + α1 X + α2 X 2 + · · · + αk X k be an arbitrary polynomial. Then,

T −1 p(B)T = α0 + α1 T −1 BT + α2 (T −1 BT )2 + · · · + αk (T −1 BT )k = p(J).

Furthermore, observe that multiplication (or addition) of upper triangular matrices yields
another upper triangular matrix, whose diagonal elements are formed by elementwise
multiplication (or addition) of the corresponding diagonal elements of the multiplicands.
Consequently, p(J) is an upper triangular matrix of the form
⎛ ⎞
p(λ1 )
⎜ ⎟
⎜ p(λ2 ) ∗ ⎟
⎜ ⎟
p(J) = ⎜ .. ⎟,
⎜ 0 . ⎟
⎝ ⎠
p(λn )

where λi are the eigenvalues of B. Hence,


 
χp(J) (λ) = det p(J) − λI = (p(λi ) − λ),
1≤i≤n

which proves the assertion.

Solution A.3.9: In case of a symmetric matrix A, the Jacobi methods reads

xt = −D −1 (L + LT )xt−1 + D −1 b,

with the iteration matrix


B = −D −1 (L + LT ).
The idea of the Chebyshev acceleration is now to construct a sequence of improved approx-
imations y t − x = pt (B)(x0 − x) (instead of the ordinary fixed point iteration xt = B t x0 )
by a smart choice of polynomials


t
pt (z) = γst z s , pt (1) = 1.
s=0

It holds
y t − x2 ≤ pt (B)2 x0 − x2 ,
with pt (B)2 = maxλ∈σ(B) |p(λ)|. So, the optimal choice for the polynomial pt (z) would
be the solution of the minimzation problem

min max |p(λ)|.


p∈Pt ,p(1)=1 λ∈σ(B)
224 Solutions of exercises

Unfortunately, this is practically impossible because σ(B) is usually unknown. But,


under the assumption that the Jacobi method is already convergent it holds that

max |p(λ)| ∈ [−1 + δ, 1 − δ],


λ∈σ(B)

due to the fact that the resulting iteration matrix B is similar to a symmetric matrix

D −1/2 (L + LT )D −1/2 .

This motivates the modified optimization problem

min max |p(x)|.


p∈Pt ,p(1)=1 |x|≤1−δ

This optimization problem can be solved analytically. The solutions are given by rescaled
Chebyshev polynomials:  x 
Tt 1−δ
pt (x) := Ct (x) =  δ
.
Tt 1 + 2−2δ

Solution A.3.10: a) No, the damped Richardson equation cannot be made convergent in
general. A necessary (and sufficient) condition for convergence of the damped Richardson
equation (applied to a symmetric coefficient matrix) for arbitrary starting values is that
(& ' & ')
I O A B
spr −θ < 1.
O I BT O

For this to hold true, it is necessary that the eigenvalues of the coefficient matrix are
sufficiently small – this can be controlled by θ and is therefore not a problem – and that
all eigenvalues are positive. But this does not need to be the case, consider, e. g.,
( ) ( )
1 0 −1
A= , B= .
0 1 0

In this case the coefficient matrix


⎛ ⎞
1 0 −1
⎜ ⎟
⎜0 1 0⎟
⎝ ⎠
−1 0 0

has two positive and one negative eigenvalue λ1 = 1, λ2,3 = 1
2
± 2
5
.
b) With A = (aij ), B = (bij ) and employing the fact that A is symmetric and positive
definite it holds
A.3 Chapter 3 225

   
B T AB)il = bji ajk bkl = bkl akj bkl = B T AB)li 1 ≤ i, j ≤ m,
jk jk

xT B T ABx ≥ Bx2 ≥ 0 ∀x ∈ Rm .

So, B T AB is symmetric, positive semidefinite. If B : Rm → Rn is a one to one mapping


(because m ≤ n), then, xT B T ABx = 0 implies Bx = 0. This in turn implies x = 0.
Therefore, if B has full rank, then B T AB is positive definite.
The Chebyshev acceleration is most efficiently realized by using the two step recursion
formula:
μt μt−1 t−1 μt
ξt = 2 H1sym ξ t−1 − ξ +2 ζ
ρ μt+1 μt+1 ρ μt+1
2
μt+1 = μt − μt−1
ρ

starting from the initial values ξ 0 = y 0 , y 1 = H1sym y 0 + ζ, μ0 = 1 and μ1 = 1/ρ.


Hereby, the symmetrized Gauß-Seidel iteration matrix reads

H1sym = (D + LT )−1 L (D + L)−1 LT

and the corresponding right hand side of the iterative procedure is

ζ = B T A−1 b − c.

We assume that we have an efficient method in estimating the additive splitting

B T A−1 B = L + D + LT

and that an estimate ρ ∈ (0, 1) with σ(H1sym ) ∈ (−ρ, ρ) is readily available.

Solution A.3.11: One step of the Gauß-Seidel method reads


( )
(1) 1  (1)
 (0)
x̂j = bj − ajk x̂k + ajk xk .
ajj k<j k>j

Due to the specific choice of decent directions r (t) = et+1 in the coordinate relaxation,
(t+1) (t)
there holds xj = xj for j = t + 1. Consequently, it suffices to show that in the step
t → t + 1 the (t + 1)-th component is set to the correct value. Inserting the step length
(t)
gt+1 1   
(t)
αt+1 = = bt+1 − at+1,k xk
at+1,t+1 at+1,t+1 k

into the iteration procedure gives


226 Solutions of exercises

(t+1) (t) bt+1 1  (t)


xt+1 = xt+1 + − at+1,k xk
at+1,t+1 at+1,t+1
1   k
  (1.3.1)
(t) (t)
= bt+1 − at+1,k xk − at+1,k xk .
at+1,t+1
k<t+1 k>t+1

(t) (1) (t) (0)


By induction it follows that xk = x̂k for k < t + 1. Furthermore, xk = xk for
k > t + 1, so that:
1    
(t+1) (1) (0)
xt+1 = bt+1 − at+1,k x̂k − at+1,k xk . (1.3.2)
at+1,t+1 k<t+1 k>t+1

Solution A.3.12: The CG method applied to the normal equation reads: Given an
initial value x0 and an initial decent direction

d(0) = AT (b − Ax0 ) = −g (0)

iterate by the prescription

(g (t) , g (t) )
αt = , y (t+1) = y (t) + αt d(t) , g (t+1) = g (t) + αt AT Ad(t) ,
(Ad(t) , Ad(t) )
(g (t+1) , g (t+1) )
βt = , d(t+1) = −g (t+1) + βt d(t) .
(g (t) , g (t) )

Remarkably, by efficiently storing and reusing intermediate computational results, there


is only one additional matrix-vector multiplication involved in contrast to the original CG
method – the term AT Ad(t) has to be computed instead of Ad(t) .
The convergence speed, however, is linked to the eigenvalues of AT A by the result (given
in the text) that in order to reduce the error by a factor of ε about
1√ 2
t(ε) ≈ κ ln
2 ε
steps are required. Now,

maxλ∈σ(AT A) |λ| maxs∈S(A) |s|2


κ = cond2 (AT A) = = ,
minλ∈σ(AT A) |λ| mins∈S(A) |s|2

with the set of singular values S(A) of A. This implies that for symmetric A the relation
κ(AT A) = κ(A)2 holds and therefore a much slower convergence speed has to be expected.

Solution A.3.13: The asymptotic convergence speed


 xt − x 1/t
lim sup
t→∞ x0 − x

for the different methods in terms of κ = cond2 (A) = Λ/λ (with Λ maximal absolute
eigenvalue and λ minimal absolute eigenvalue) are as follows:
A.3 Chapter 3 227

 1 2 1  1 2
Gauß-Seidel: spr(H1 ) = spr(J)2 = 1 − = 1−2 +O ,
6 κ κ κ
1 − 1 − spr(J)2 √ 1 1
Optimal SOR: spr(Hopt ) = 6 = 1 − 8√ + O ,
1 + 1 − spr(J)2 κ κ
 1 − 1/κ  1  1 2
Gradient method: = 1−2 +O ,
1 + 1/κ κ κ

 1 − 1/ κ  1 1
CG method: √ = 1−2√ +O .
1 + 1/ κ κ κ

Solution A.3.14: The CG method applied to the Schur complement

B T A−1 By = B T A−1 b − c

reads: Given an initial value y0 and an initial decent direction

d(0) = B T A−1 (b − By 0 ) − c = −g (0)

iterate by the prescription

(g (t) , g (t) )
αt = , y (t+1) = y (t) + αt d(t) , g (t+1) = g (t) + αt B T A−1 Bd(t) ,
(A−1 Bd(t) , Bd(t) )
(g (t+1) , g (t+1) )
βt = , d(t+1) = −g (t+1) + βt d(t) .
(g (t) , g (t) )

Observe that in each step it is only necessary to compute two matrix vector products
(one with B and one with B T ) and one matrix vector product with A−1 when eval-
uating A−1 Bd(t) . This can be done with the help of an iterative method, e. g. with a
preconditioned Richardson method (as introduced in the text)

ξ t = ξ t−1 + C −1 (b − Aξ t−1 ).

Different choices for the preconditioner C −1 are now possible, e. g. by choosing C =


1
ω
(D + ωL) with A = L + D + R, one ends up with the SOR method. In practice, it is
crucial to have a preconditioner that has good orthogonality preserving features, so one
might use another Krylov space method as a preconditioner instead.

Solution A.3.15: There holds


; <t(ε) ; <t(ε) ; <t(ε)
1 − 1/κ κ−1 κ+1 1
≤ ε ⇐⇒ ≤ε ⇐⇒ ≥ .
1 + 1/κ κ+1 κ−1 ε

Now, without loss of generality, both bases are greater than 1, so that
; <
κ+1 1
⇐⇒ t(ε) ≥ log .
κ+1 ε
228 Solutions of exercises

 κ+1  #1 $
Finally, observe that log κ−1
=2 κ
+ 1 1
3 κ3
+··· ≥ 2 κ1 . Hence,

1 1
⇐= 2 t(ε) ≥ log .
κ ε

The corresponding result for the CG method follows by replacing κ with κ.

Solution A.3.16: The matrix C can be written in the form C = KK T with the help
of 1 
1
K=6 D + L D −1/2 .
(2 − ω) ω ω
A close look reveals that the iteration matrix HωSSOR of the SSOR method can be ex-
pressed in terms of C and A :

HωSSOR = I − C −1 A.

In view of spr(HωSSOR ) < 1, the inverse C −1 can be viewed as an approximation to A−1


that is suitable for preconditioning.

Solution A.3.17: For the model problem matrix A it holds that spr(A) < 1. Hence,
the inverse (I − J)−1 is well defined and the Neumann series converges:



−1
(I − J) = J k.
k=0

Furthermore, with J = I − D −1 A it follows that (I − J)−1 = D A−1 . Then,



A−1 = D −1 J k.
k=0

Finally, observe that the multiplication of two arbitrary matrices with non-negative entries
yields another matrix with non-negative entries. Therefore the matrices J k = D −k (−L −
R)k are elementwise non-negative. So, A−1 viewed as the sum of elementwise non-negative
matrices has the same property.

Solution A.3.18: i) The stated inequality is solely a result of the special choice of x0 +
Kt (d0 ; A) as affine subspace for the optimization problem – it holds:
# $ # $
x0 + Kt (d0 , A) = x0 + span A0 d0 , · · · , At−1 d0 = x0 + p(A)d0 : p ∈ Pt−1 .

Furthermore, d0 = g 0 = Ax0 − b = A(x0 − x) , so


# $
x0 + Kt (d0 , A) = x0 + Ap(A)(x0 − x) : p ∈ Pt−1 .

Hence, it follows that


A.3 Chapter 3 229

Axtgmres − b2 = min A[I + Ap(A)](x0 − x)2 = min p(A)A(x0 − x)2


p∈Pt−1 p∈Pt ,p(0)=1

≤ min p(A)2 A(x0 − x)2 .


p∈Pt ,p(0)=1

ii) Due to the fact that A is symmetric and positive definite there exists an orthonormal
basis {oi } of eigenvectors of  A with corresponding eigenvalues {λi }. Let y ∈ R be an
n

arbitrary vector with y = i yi oi for suitable coefficients yi . It holds


  
p(A)y = p(A) y i oi 2 =  p(λi )yi oi 2 ≤ sup |p(λi )|  yi oi 2 = sup |p(λi )| y2.
i i
i i i

We conclude that p(A)2 ≤ supi |p(λi )| and consequently (Let λ be the smallest and Λ
be the biggest eigenvalue of A ):

Axtgmres − b2 ≤ min max |p(λi )| A(x0 − x)2


p∈Pt ,p(0)=1 i

≤ min max |p(μ)| A(x0 − x)2 .


p∈Pt ,p(0)=1 λ≤μ≤Λ

But this is (up to the different norms) the very same inequality that was derived for the
CG method. So, with the same line of reasoning one derives
;√ <t
κ−1
Axgmres − b2 ≤
t
√ A(x0 − x)2 .
κ+1

iii) Similarly to (ii):

p(A)y2 = T −1 T p(A)T −1 T y2 = T −1 p(D)T y2 ≤ T −1 2 p(D)2 T 2 y2.

Furthermore, p(D)2 = maxi |λi | , so one concludes that

p(A)2 ≤ κ2 (T ) max |λi |.


i

The difficulty of this result lies in the fact that the λi are generally complex valued, so
some a priori asumption has to be made in order to control maxi |λi |.

Solution A.3.19: a) It holds


 
λmax = 6 + 2 × 3 cos (1 − h)π ≈ 12,
λmin = 6 − 2 × 3 cos(hπ),

and hence,
4
cond2 (A) ≈ .
π 2 h2
In analogy to the text, it holds that the eigenvalues of the Jacobi iteration matrix J =
I − D −1 A are given by
230 Solutions of exercises

1 
μijk = cos[ihπ] + cos[jhπ] + cos[khπ] , i, j, k = 0, . . . , m.
2
Consequently,
π2 2
spr(J) = 1 −
h + O(h4 ).
2
b) Due to the fact that the matrix A is consistently ordered, it holds

spr(H1 ) = ρ2 = 1 − π 2 h2 + O(h4 ),
6
1 − 1 − ρ2 1 − πh + O(h2 )
spr(Hωopt ) = 6 = = 1 − 2πh + O(h2 ).
1+ 1−ρ 2 1 + πh + O(h2)

The number of required iterations T∗ (ε) ≈ ln(ε)/ ln(spr(∗)) is thus


2 1
TJ (ε) ≈ − ln(ε) ≈ 18 665, TH1 (ε) ≈ − ln(ε) ≈ 9 333,
π 2 h2 π 2 h2
1
THωopt (ε) ≈ − ln(ε) ≈ 147,
2πh
and for the gradient and CG method:
1 2
TG (ε) = − κ ln(ε) ≈ − 2 2 ln(ε) ≈ 18 665,
2 π h
1√ 1
TCG (ε) = − κ ln(ε/2) ≈ − ln(ε/2) ≈ 316.
2 πh
c) A matrix vector multiplication with A needs roughly 7h−3 a. op.. With that one
concludes that the number of required a. op. for Jacobi, Gauß-Seidel and SOR method
is approximately 8h−3 . Similarly the workload for the gradient method is 11h−3 a. op.,
whereas the CG method needs 12h−3 a. op.

A.4 Chapter 4
n
Solution A.4.1: There holds z 0 = i=1 αi w i and z t = At z 0 −1 t 0
2 A z and therefore
n
(At+1 z 0 , At z 0 )2 |αi |2 λ2t+1
λt = (Az t , z t )2 = = i=1 i
A z 2
t 0 2 n
i=1 |α i | 2 λ2t
i
#   
2 λi 2t+1
$
(λn )2t+1 |αn |2 + n−1 |α i |
=  i=1
n−1  λi 2t 
λn

(λn ) 2t |αn | + i=1 |αi | λn


2 2

  
2 λi 2t
   
2 λi 2t λi

|αn |2 + n−1 i=1 |αi | λn
+ n−1 i=1 |αi | λn λn
−1
= λn   
|αn |2 + n−1 2 λi 2t
i=1 |αi | λn
n−1   
2 λi 2t λi

i=1 |αi | λn λn
−1
= λn + λn  n−1  λ 2t =: λn + λn Et .
|αn |2 + i=1 |αi |2 λni
A.4 Chapter 4 231

The error term on the right can be estimated as follows:


 λ 2t n−1 |α |2  λ 2t z 0 2
n−1 i n−1
|Et | ≤ i=1
= 2
.
λn |αn |2 λn |αn |2

Hence,
z 0 22  λn−1 2t
|λt − λn | ≤ |λn | .
|αn |2 λn

Solution A.4.2: Let μi := (λi − λ)−1 be the eigenvalues of the matrix (A − λI)−1 .
Further, we note that μmax = (λmin − λ)−1 . The corresponding iterates generated by the
inverse iteration are μt = (λt − λ)−1 with λt := 1/μt + λ . We begin with the identity

z̃ t (A − λt−1 I)−1 z t−1 (A − λt−1 I)−1 (A − λt−2 I)−1 z t−2


zt = = =
z̃ 2
t (A − λ I) z 2
t−1 −1 t−1 (A − λt−1 I)−1 (A − λt−2 I)−1 z t−2 2
5t−1 j −1 0
j=0 (A − λ I) z
= · · · = 5t−1 ,
 j=0(A − λj I)−1 z 0 2

from which we conclude


 
μt = (A − λt−1 I)−1 z t , z t 2
 5 5t−1 
(A − λt−1 I)−1 t−1 j −1 0
j=0 (A − λ I) z ,
j −1 0
j=0 (A − λ I) z 2
= 5
 t−1 j=0 (A − λ I) z 2
j −1 0 2
n t−1 −1
5 t−1 j −2
i=1 |αi | (λi − λ j=0 (λi − λ )
2
)
= n 5 t−1 .
i=1 |αi | j=0 (λi − λ )
2 j −2

Next,
5t−1 n 5t−1
|α1 |2 (λ1 − λt−1 )−1 j=0 (λ1 − λj )−2 + i=2 |αi | (λi − λ
2 t−1 −1
) j=0 (λi − λj )−2
t
μ = 5t−1 n 5t−1
|α1 | 2
j=0 (λ1 − λj )−2 + i=2 |αi |
2
j=0 (λi − λ )
j −2

5 n 
2 λ1 −λt−1
 5t−1  λ −λj 2
(λ1 − λt−1 )−1 t−1j=0 (λ1 − λ )
j −2 |α |2 +
1 i=2 |αi | λi −λt−1
1
j=0 λi −λj
= 5t−1  5  
λ1 −λj 2
j=0 (λ1 − λ )
j −2
|α1 |2 + ni=2 |αi |2 t−1 j=0 λi −λj
  −λj 2 n 5  λ1 −λj 2   −λt−1 
1 |α1 |2 + ni=2 |αi |2 λλ1i −λ j + i=2 |αi |2 t−1 j=0 λi −λj 1 − λλ1i −λt−1
= n 5t−1  λ1 −λj 2
λ1 − λ t−1
|α1 | + i=2 |αi |
2 2
j=0 λi −λj
n 5t−1  λ1 −λj 2  −λt−1

i=2 |αi | 1 − λλ1i −λ
2
1 1 j=0 λi −λj t−1
= +  5  λ1 −λj 2
λ1 − λt−1 λ1 − λt−1 |α1 |2 + ni=2 |αi |2 t−1 j=0 λi −λj
1 1
=: + Et .
λ1 − λt−1 λ1 − λt−1
232 Solutions of exercises

The error term on the right-hand side can be estimates as follows:


n 5t−1  λ1 −λj 2  t−1 
  1 − λ1 −λt−1 
i=2 |αi |
2
j=0 λi −λj λi −λ
|Et | ≤  5  j  2
λ1 −λ
|α1 |2 + ni=2 |αi |2 t−1 j=0 λi −λj
t−1 
 n t−1  
 λ1 − λj 2 i=2 |αi |2  λ1 − λj 2 z 0 22
≤   =   .
j=0
λ2 − λj |α1 |2 j=0
λ2 − λj |α1 |2

This yields
  t−1 

 t 1  1  λ1 − λj 2 z 0 22
μ −  ≤   .
λ1 − λt−1 λ1 − λt−1 j=0 λ2 − λj |α1 |2

Observing μt = (λt − λt−1 )−1 or λt = 1/μt + λt−1 ,


   λ − λt−1 − λt + λt−1   λ1 − λt 
 1 1   1   
 −  =   =  ,
λ −λ
t t−1 λ1 − λ t−1 (λ − λ )(λ1 − λ )
t t−1 t−1 (λ − λ )(λ1 − λ )
t t−1 t−1

we obtain the desired estimate


t−1 

 λ1 − λj 2 z 0 22
|λ1 − λt | ≤ |λt − λt−1 |   .
j=0
λ2 − λj |α1 |2

Solution A.4.3: It suffices to prove the following two statements about the QR-iteration.
The assertion then follows by induction.

1. Let A be a Hessenberg matrix and A = QR its QR-decomposition. Then, Ã = RQ


is also a Hessenberg matrix.

2. Let A be a symmetric matrix and A = QR its QR-decomposition. Then, Ã = RQ


is also a symmetric matrix.

The QR decomposition of a Hessenberg matrix A can be expressed as

Gn−1 Gn−2 · · · G1 A = R

with ⎛ ⎞
Ii−1 0
⎜ ⎟
Gi = ⎜
⎝ G̃i ⎟,

0 In−i−1

and an orthogonal component G̃i ∈ R2×2 that eliminates the lower left off diagonal entry
of the block ( ) ( )
∗i,i ∗i,i+1 ∗ ∗
G̃i = .
ai+1,i ai+1,i+1 0 ∗
A.4 Chapter 4 233

Apart from eliminating the entry ai+1,i , the orthogonal matrix Gi only acts on the upper
right part of the (intermediate) matrix. Consequently, R is an upper triangular matrix
and it holds
à = RQ = RGT1 GT2 · · · GTn−1 .
Similarly, it follows by induction that multiplication with GTi from the right only intro-
duces at most one (lower-left) off-diagonal element at position ∗i+1,i , so à is indeed a
Hessenberg matrix.
Now, let A be symmetric. It holds QR = A = AT = RT QT and consequently R =
QT RT QT . We conclude that

à = RQ = QT RT QT Q = (RQ)T = ÃT .

Solution A.4.4: Let A = Q̃R̃ be an arbitrary QR-decomposition of A . Define a unitary


matrix H = diag(hi ) ∈ Cn×n by hi = |rr̄iiii | and set R = H R̃, Q = Q̃H̄.
Now, observe that ĀT A = R̄T Q̄T QR = R̄T R is the Cholesky decomposition of the real
valued, symmetric and positive definite matrix ĀT A . Since the Cholesky decomposition
(with positive diagonal) is uniquely determined it follows that R is unique and hence also
Q = AR−1 .

Solution A.4.5: i) From the definition of Km it follows

Km+1 = span{q, A Km}.

Now, if Km = Km+1 = span{q, A Km } one sees by induction that repeated applications


of this procedure yield the same space again, hence Kn = Km ∀ n ≥ m. On the other
hand, given the fact that Km−1 = Km it must hold Ki = Ki+1 for i = 1, . . . , m − 1.
Otherwise, this would already imply Km−1 = Km which is a contradiction.
It holds dim K1 = 1 because q = 0. Furthermore, Km is generated by m vectors.
Therefore, one sees by induction that dim Ki = i as long as Ki = Ki−1 , i. e. for
2 ≤ i ≤ m.
ii) Let λ ∈ σ(Qm T AQm ) be arbitrary. Then, there exists an eigenvector v ∈ Cm \ {0}
with Qm T AQm v = λv. Multiplication of Qm from the left and utilizing

m
Qm QmT . = q i (q i , .) = projKm .
i=1

yields
projKm A Qm v = λ Qm v.
But by definition of m it holds that Km is A-invariant, i. e. AKm ⊂ Km , hence
projKm (A Qm v) = A Qm v and therefore λ ∈ σ(A).
In case of m = n there is Km = Cn . Consequently, Qm ∈ Cn×n is a regular matrix and
234 Solutions of exercises

the matrices A and Qm T AQm are similar; specifically

σ(QmT AQm ) = σ(A).

Solution A.4.6: Let {v 1 , . . . , v m } ∈ Rn be a linearly independent set of vectors. The


classical Gram-Schmidt orthogonalization procedure reads: For i = 1, . . . , m:


i−1
α) ũi := v i − (uj , v i )uj ,
j=1

β) ui := ũi /ũi .

The modified Gram-Schmidt orthogonalization procedure takes the form: For i = 1, . . . , m:

α) ũi,1 := v i ,
ũi,k := ũi,k−1 − (uk−1, ũi,k−1)uk−1, for k = 2, . . . , i,
β) ui := ũi,i /ũi,i .

i) For the modified Gram-Schmidt algorithm we can assume by induction that

ũi,k−1 = v i − proju1 ,...,uk−2  (v i ),

hence

ũi,k = ũi,k−1 − (uk−1 , ũi,k−1)uk−1


= v i − proju1 ,...,uk−2  (v i ) − projuk−1 (ũi,k−1)
= v i − proju1 ,...,uk−2  (v i ) − projuk−1 (v i )

k−1
=v − i
(uj , v i )uj .
j=1

ii) By rewriting step (α) of the classical algorithm in the form

α) ũi,1 := v i ,
ũi,k := ũi,k−1 − (uk−1, v i )uk−1, for k = 2, . . . , i,

one observes that the algorithmic complexity of both variants are exactly the same. Both
consist of i − 1 scalar-products (with n a. op.) with vector scaling and vector addition
(with n a. op.) in step (α) which sums up to


m
(i − 1) (n + n) = n m(m − 1) a. op.
i=1

as well as m normalization steps with roughly 2n a. op. in (β). In total n m(m + 1)


a. op..
A.4 Chapter 4 235

Solution A.4.7: The result by the classical Gram-Schmidt algorithm is:


⎡ ⎤
1 0 0
⎢ √ ⎥
Q̃ = ⎢
⎣ ε 0 2 ⎥,
√2 ⎦
ε −1 22

with Q̃T Q̃ − I∞ = 2 ( 12 + ε) . The result by the modified Gram-Schmidt algorithm is:
⎡ ⎤
1 0 0
⎢ ⎥
Q̃ = ⎢
⎣ ε 0 −1 ⎦ ,

ε −1 0

with Q̃T Q̃ − I∞ ≈ 2 ε .

Solution A.4.8: i) With the help of the Taylor expansion of the cosine:

|λijkk − λhijk | =
 ! %
 2 (i2 + j 2 + k 2 )π 2 h2 (i4 + j 4 + k 4 )π 4 h4 
(i + j 2 + k 2 )π 2 − h−2 6 − 6 − − − O(h6 ) 
2! 4!
(i4 + j 4 + k 4 )π 4 h2 1 1
= + O(h4 ) ≤ λ2ijk h2 + O(h4 ) ≤ λ2ijk h2 (for h sufficiently small).
4! 4! 12

ii) The maximal eigenvalue λmax that can be reliably computed with a relative tolerance
TOL fulfills the relation
1 12 TOL
λmax h2 ≈ TOL =⇒ λmax ≈ .
12 h2
The number of reliably approximateable eigenvalues (not counting multiplicities) is the
cardinality of the set
#2 λmax $
i + j 2 + k 2 : (i2 + j 2 + k 2 ) ≤ 2 , i, j, k ∈ N, 1 ≤ i, j, k ≤ m .
π
For the concrete choice of numbers this leads to:
# $
# i2 + j 2 + k 2 : (i2 + j 2 + k 2 ) ≤ 19, i, j, k ∈ N ,

whose cardinality can be counted by hand:


# $
# (1, 1, 1), (2, 1, 1), (2, 2, 1), (2, 2, 2), (3, 1, 1), (3, 2, 1), (3, 2, 2), (3, 3, 1), (4, 1, 1) = 9.

iii) The number of reliably approximateable eigenvalues (counting multiplicities) is the


236 Solutions of exercises

cardinality of the set


 λmax 
(i, j, k) ∈ N3 : (i2 + j 2 + k 2 ) ≤ 2 , 1 ≤ i, j, k ≤ m .
π
For large numbers a reasonably large subset is given by
 √
λmax 
(i, j, k) ∈ N : 1 ≤ i, j, k ≤ √
3
,

which has the cardinality
= √λ >3 = 4 √TOL >3
√ max = .
3π πh
Therefore, h must be chosen such that
= 4 √TOL >3 √
6 10−3
≥ 1.000 ⇐⇒ h ≤ ≈ 6.0 × 10−3 .
πh 10 π
Approximately 7-times uniform refinement in 3D, i. e., nh ≈ h−3 ≈ 4.6 × 106 .

Solution A.4.9: i) The inverse iteration for determining the smallest eigenvalue (with
shift λ = 0 ) reads

Az̃ t = z t−1 , z t = z̃ t −1 z̃ t , t = 1, 2, . . .

with intermediate guesses μt = (A−1 z t , z t ) for the smallest eigenvalue. One iteration of
the inverse iteration consists of 1 solving step consisting of cn a. op and a normalization
step of roughly 2n a. op. Determining the final guess for the eigenvalue needs another
solving step and a scalar product, in total (c + 1) n a. op. So, for 100 iteration steps we
end up with
(101 c + 201)n a. op.
The Lanczos algorithm reads: Given initial q 0 = 0, q 1 = q−1 q, β1 = 0 compute for
1 ≤ t ≤ m − 1:

r t = A−1 q t , αt = (r t , q t ), st = r t − αt q t − βt q t−1
β t+1 = st , q t+1 = st /β t+1 ,

and a final step r m = A−1 q m , αm = (r m , q m ). This procedure takes cn a. op. for the
matrix vector product with additional 5n a. op. per round. In total (respecting initial
and final computations):
(101 c + 501) n a. op.
The Lanczos algorithm will construct a tridiagonal matrix T m (with m = 100 in our
case) of which we still have to compute the eigenvalues with the help of the QR method:
B (0) = T m ,
B (i) = Q(i) R(i) , B (i+1) = R(i) Q(i) .
A.4 Chapter 4 237

From a previous exercise we already know that the intermediate B (i) will retain the
tridiagonal matrix property, so that a total workload of O(m) a. op. per round of the
QR method can be assumed. For simplicity, we assume that the number of required QR
iterations (to achieve good accuracy) also scales with O(m). Then, the total workload of
QR method is O(m2 ) a. op.
ii) Assume that it is possible to start the inverse iteration with a suitable guess for each
of the 10 desired eigenvalues. Still, it is necessary to do the full 100 iterations for each
eigenvalue independently, resulting in

10 (101 c + 201) n a. op.

The Lanczos algorithm, in contrast, already approximates the first 10 eigenvalues simul-
taneoulsy for the choice m = 100 (see results of the preceding exercise). Hence, we end
up with the same number of a. op.:

(101 c + 501) n a. op.

(except for some possibly higher workload in the QR iteration). Given the fact that c is
usually o moderate size somewhere around 5, the Lanczos algorithm clearly wins.

Solution A.4.10: i) Let A ∈ Cn×n , x, b ∈ Cn . It is equivalent:

Ax = b
⇐⇒ (Re A + i Im A)(Re x + i Im x) = Re b + i Im b
!
Re A Re x − Im A Im x = Re b
⇐⇒
−Re A Im x − Im A Re x = −Im b
( )( ) ( )
Re A Im A Re x Re b
⇐⇒ = .
−Im A Re A −Im x −Im b

ii) For all three properties it holds that they are fullfilled by the block-matrix à if and
only if the correspondig complex valued matrix A has the analogous property (in the
complex sense):
a) From the above identity we deduce that the complex valued linear system of equations
(in the first line) is uniquely solvable for arbitrary b ∈ Cn if and only if the same holds
true for the real valued linear equation (in the last line) for arbitrary (Re b, Im b) ∈ R2n .
Thus à is regular iff A is regular.
b) Observe that

à symmetric
⇐⇒ Im A = −Im AT and Re A = Re AT
⇐⇒ Re A + Im A = Re A − Im AT
⇐⇒ A = ĀT .
238 Solutions of exercises

c) For arbitrary x ∈ Cn it holds


 
Re x̄T A x > 0
⇐⇒ Re xT Re A Re x + Im xT Re A Im x − Im xT Im A Re x − Re xT Im A Im x > 0
( )T ( )( )
Re x Re A Im A Re x
⇐⇒ > 0.
−Im x −Im A Re A −Im x

Solution A.4.11: The statement follows immediately from the equivalent definition
# $
σε (T ) = z ∈ C : σmin (zI − T ) ≤ ε , with
# $
σmin (T ) := min λ1/2 : λ ∈ σ(T̄ T T )

and by the observation that similar matrices yield the same set of eigenvalues:
# $
σmin (T ) = min λ1/2 : λ ∈ σ(T̄ T T )
# $
= min λ1/2 : λ ∈ σ(Q̄T T̄ T QQ̄T T Q)
  T

= min λ1/2 : λ ∈ σ (Q̄T T Q) (Q̄T T Q)
= σmin (Q−1 T Q).

A.5 Chapter 5

Solution A.5.1: Let ai be an arbitrary nodal point and ϕih be the corresponding nodal
basis function. Its support consists of 6 triangles T1 , · · · , T6 :

Outside of ∪6i=1 Ti the function ϕih is zero. Due to the fact that ϕih is continuous and
cellwise linear, its gradient is cellwise defined and constant with values
 ; <  ; <
 1 1  1 0
∇ϕih  = , ∇ϕih  = ,
K1 h 1 K2 h 1
 ; <  ; <
 1 −1  1 −1
∇ϕih  = , ∇ϕih  = ,
K3 h 0 K4 h −1
 ; <  ; <
 1 0  1 1
∇ϕih  = , ∇ϕih  = ,
K5 h −1 K6 h 0
A.5 Chapter 5 239

where h denotes the length of the catheti of the triangles. With these preliminaries it
follows immediately that


6
|Kμ | 
3
1 2
bi = f (aj )ϕih (aj ) = 6 h f (ai ) = h2 f (ai ).
μ=1
3 j=1
6

For the stiffness matrix aij = (∇ϕih , ∇ϕjh ), we have to consider three distinct cases: a)
where ai = aj , b) where ai and aj are endpoints of a cathetus, and c) where they are
endpoints of a hypotenuse:


6
|Kμ |  
3
 1
a) aii = ∇ϕih (aν ), ∇ϕih (aν ) = h2 3 (2 + 1 + 1 + 2 + 1 + 1) h−2 = 4.
μ=1
3 ν=1
6

6
|Kμ |  
3
 1
b) aij = ∇ϕih (aν ), ∇ϕjh (aν ) = h2 3 (−1 − 1) h−2 = −1.
μ=1
3 ν=1
6

6
|Kμ |  
3
 1
c) aij = ∇ϕih (aν ), ∇ϕjh (aν ) = h2 3 0 = 0.
μ=1
3 ν=1
6

In summary, the stencil has the form


⎛ ⎞
0 −1
⎜ ⎟
⎜−1 −1⎟
⎝ 4 ⎠.
−1 0

This is, up to a factor of h−2 exactly the stencil of the finite different discretization
described in the text.

Solution A.5.2: The principal idea for the convergence proof of the two-grid algorithm
was to prove a contraction property for
 
= ZGL (ν)eL , ZGL (ν) = A−1 −1
(t+1) (t)
L − pL−1 AL−1 rL
L L−1
eL AL SLν

This was done with the help of a so called smoothing property,

AL SLν  ≤ cs ν −1 h−2


L ,

and an approximation property,

A−1 −1 L−1
L − pL−1 AL−1 rL
L
≤ ca h2L .

The first property is completely independent of the choice of restriction that is used. The
second, however, poses major difficulties for our choice of restriction: In analogy to the
proof given in the text let ψL ∈ VL be arbitrary. Now, vL := A−1 L ψL is the solution of
the variational problem
240 Solutions of exercises

a(vL , ϕL ) = (ψL , ϕL ) ∀ϕL ∈ VL ,


and similarly vL−1 := pLL−1 A−1 L−1
L−1 rL ψL is the solution of

a (vL−1 , ϕL−1 ) = (rLL−1 ψL , ϕL−1 ) ∀ϕL−1 ∈ VL−1 .

Let v and ṽ be the solutions of the corresponding continuous problems:

a(v, ϕ) = (ψL , ϕ) ∀ϕ ∈ V,
a (ṽ, ϕ) = (rLL−1 ψL , ϕ) ∀ϕ ∈ V.

We can employ the usual a priori error estimate (for the Ritz-projection):

vL − vL−1  ≤ vL − v + vL−1 − ṽ + v − ṽ


 
≤ ch2 ψL  + rLL−1ψL  + v − ṽ.

Furthermore, exploiting the finit dimensionality of the spaces involved it is possible to


bound rLL−1 ψL  in terms of ψL , i. e.,

rLL−1 ψL  ≤ cψL .

But, now, rLL−1 is not the L2 -projection. So we have to assume that in general

(rLL−1 ψL , ϕL−1 ) = (ψL , ϕL−1) ,

and hence v = ṽ. This is a problem because a necessary bound of the form

v − ṽ ≤ ch2L ψL  .

does not hold in general.

Solution A.5.3: This time, the problem when trying to convert the proof to the given
problem arises in the smoothing property. The proof of the approximation property does
not need symmetry. We still have an inverse property of the form AL  ≤ ch−2 . So, it
remains to show that
SL  ≤ c < 1
for SL = IL − θAL with a constant c that is independent of L. Because AL is not
symmetric, it is not possible to copy the arguments (that utilize spectral theory) from the
text. We proceed differently: First of all observe that for all uL ∈ VL it holds that

(AL uL , uL ) = a(uL , uL ) = ∇u2 + (∂1 u, u)


"
1
= ∇u +
2
∂1 (u2 ) dx
2 Ω
"
1
= ∇u2 + n1 u2 ds
2 ∂Ω
= ∇u2.
A.5 Chapter 5 241

Hence, AL is positive definite – or, equivalently, for all (complex valued) eigenvalues λi ,
i = 1, ..., NL of AL it holds:

Reλi > 0 i = 1, ..., NL .

The eigenvalues of SL = IL − θAL are 1 − θλi , i = 1, ..., NL . Furthermore,


# $1/2
|1 − θλ| = |1 − θReλ − θImλ| = (1 − θReλ)2 + θ2 (Imλ)2
# $1
= 1 − 2θReλ + θ2 (Imλ)2 + (Reλ)2 ) 2

So finally, the choice


2Reλi
θ< min ,
i=1,...,NL |λi |2
leads to
spr(SL ) = max |1 − θλi | < c < 1.
i=1...NL

with a constant c independent of L. The smoothing property now follows with the
general observation that for every ε > 0 there exists an (operator, or induced matrix)
norm  · ∗ with
SL ∗ ≤ c + ε.
The question remains whether this extends to an L independent convergence rate in the
norm  · .

Solution A.5.4: Applying one step of the Richardson iteration x̄n+1 = x̄n + θ(b−AL x̄n )
needs essentially one matrix vector multiplication with a complexity of 9NL a. op. (due to
the fact that at most 9 matrix entries per row are non-zero). Together with the necessary
addition processes SLν needs 11 νNL a. op.
Calculating the defect dl = fl − Al xl needs another 10NL a. op. For the L2 projektion
onto the coarser grid, we need to calculate

d˜l−1 := rll−1 dl .

This can be done very efficiently: Let {ϕli } be the nodal basis on level l. The i-th
component of the L2 projection of d˜l−1 is given by

d˜l−1
i = (rll−1 dl , ϕl−1 l l−1
i ) = (d , ϕi ).

Due to the fact that Vl−1 ⊂ Vl , it is possible to express ϕl−1


i as


Nl
ϕl−1
i = μij ϕlj ,
j=1

where at most 9 values μij are non trivial. This reduces the computation of the L2
projection to
242 Solutions of exercises


Nl 
Nl
d˜l−1
i = μij (dl , ϕli ) = μij dli
j=1 j=1

and needs 9Nl a. op.. Contrary to this, the prolongation is relatively cheap with roughly
2Nl a. op. (interpolating intermediate values, neglecting the one in the middle and the
boundary, . . . ). Additionally, we account another Nl a. op. for adding the correction. In
total:
(2 · 11 + 10 + 9 + 2 + 1)Nl = 44Nl a. op. on level l.
The dimension of the subspaces behaves roughly like

Nl−k ≈ 2−2k Nl .

Within a V-cycle all operations have to be done exactly once on every level, hence (ne-
glecting the cost for solving on the coarsest level) we end up with:


l l
44 4   4
44Nl−k = N = 44Nl 1 − 2−(2k+2) ≤ 44Nl a. op..
2k l
k=0 k=0
2 3 3

Within a W-cycle, we have to do 2k steps on level l − k. This leads to


l 
l
44  
k
2 44Nl−k = Nl = 2 · 44Nl 1 − 2−k−1 ≤ 2 · 44Nl a. op.
k=0 k=0
2k

A.5.1 Solutions for the general exercises

Solution A.5.5: a) If there exists a regular T ∈ Rn×n and a diagonal matrix D ∈ Rn×n
such that
T −1 AT = D.
b) A matrix A = (aij ) is diagonally dominant if there holds


n
|aij | ≤ |aii |, i = 1, . . . , n.
j=1,j =i

c) A matrix A ∈ Cn×n is called normal if ĀT A = AĀT . Yes, if A is Hermitian, it is


automatically normal.
d) The Rayleigh quotient is defined as (Av, v)2 /v22 for a given vector v = 0 . It can be
used to calculate an eigenvalue approximation from a given eigenvector approximation.
e) cond2 = A2 A−1 2 = |σmax |/|σmin | , where  · 2 is the matrix norm induced by
 ·  2 : Cn → R+
0 , and σmin and σmax are the smallest and largest singular value of A .

g) A Gerschgorin circle is a closed disc, denoted by K̄ρ (aii ) , and associated with a row (or
column) of a matrix
 by the diagonal value aii and the absolute sum of the off-diagonal
elements ρ = j =i |aij | (or ρ = j =i |aji | , respectively). The union of all Gerschgorin
A.5 Chapter 5 243

circles of a matrix has the property that it contains all eigenvalues of the matrix.
h) The restriction rll−1 : Vl → Vl−1 is used to transfer an intermediate value vl ∈ Vl to
the next coarser level Vl−1 , typically a given finite element function to the next coarser
mesh. The prolongation operator pll−1 : Vl−1 → Vl does the exact opposite. It transfers
an intermediate result from Vl−1 to the next finer level.
i) Given an arbitrary b ∈ Cn it is defined as
# $
Km (b; A) = span b, Ab, . . . , Am−1 b .

j) It refers to the damping parameter θ ∈ (0, 1] in the Richardson iteration:


 
x(k+1) = x(k) + ω b − Ax(k) .

k) The difference lies in the evaluation of the term


k−1
ũk = v k − (v k , ui ) ui .
i=1

In the classical Gram-Schmidt method this is done in a straight forward manner, in the
modified version a slightly different algorithm is used:

ũk,1 := v k , ũk,i := ũk,i−1 − (ui−1 , ũk,i−1)ui−1 , for i = 2, . . . , k,

with uk = ũk,k /|ũk,k |. Both algorithms are equivalent in exact arithmetic, but the latter
is much more stable in floating point arithmetic.

Solution A.5.6: i) The matrix A1 fulfils the weak row-sum criterion. Therefore the
Jacobi and Gauß-Seidel methods converge. Furthermore, A1 is symmetric and positive
definite (because it is regular and diagonally dominant), hence the CG method is appli-
cable.
ii) For A2 the Jacobi matrix reads
⎛ ⎞
0 1
− 12
⎜ 2 ⎟
J =⎜

1
2
0 1
2


− 12 1
2
0

with eigenvalues f ulf ilsλ1 = −1, f ulf ilsλ2,3 = ± 2/2. Hence, no convergence in gen-
eral. The Gauß-Seidel matrix is
⎛ ⎞
0 4 −4
1⎜ ⎟
H1 = ⎜ ⎝ 0 2 2⎟⎠
8
0 −1 3
244 Solutions of exercises


with eigenvalues f ulf ilsλ1 = 0, f ulf ilsλ2,3 = − 16
5
± i167 . Hence, the Gauß-Seidel iteration
does converge. A2 is symmetric and positive definite (because it is regular and diagonally
dominant).
iii) The matrix A3 is not symmetric, so the CG method is not directly applicable. For
the Jacobi method: ⎛ ⎞
0 12 − 12
⎜ ⎟
J =⎜ 1 1 ⎟
⎝2 0 2 ⎠ ,
1 1
2 2
0
with corresponding eigenvalues f ulf ilsλ1 = 0, f ulf ilsλ2,3 = ± 12 . Hence, the method
does converge. Similarly for the Gauß-Seidel method:
⎛ ⎞
0 4 −4
1⎜ ⎟
H1 = ⎜ ⎝0 2 2⎟ ⎠,
8
0 3 −1

with eigenvalues f ulf ilsλ1 = 0, f ulf ilsλ2,3 = − 16
1
± 33
16
. The method does converge.

Solution A.5.7: Given a diagonal matrix D = diag(d, 1, 1) it holds


⎛ ⎞
1 10−3 d−1 10−4 d−1
⎜ ⎟
D−1 AD = ⎜ −3
⎝10 d 2 10−3 ⎟ ⎠.
−4 −3
10 d 10 3

Now, we choose d ∈ R in such a way that the Gerschgorin circle defined by the first
column has minimal radius but is still disjunct from the other two Gerschgorin circles.
Therefore, a suitable choice of d must fulfil (the first two Gerschgorin circles must not
touch):
1 + 1.1 × 10−3 d < 2 − 10−3 − 10−3 d−1 .
Solving this quadratic inequality leads to a necessary condition d > 0.001001 (and . . . ),
hence d = 0.0011 is a suitable choice. This improves the radius of the first Gerschgorin
circle to
ρ1 = (1.1 × 10−3 )2 = 1.21 × 10−6 : K 1.21×10−6 (1).
Similarly, for the third Gerschgorin circle and with the choice D = diag(1, 1, d) :

3 − 1.1 × 10−3 d > 2 + 10−3 + 10−3 d−1 .

This is the same inequality as already discussed. Therefore:

ρ3 = (1.1 × 10−3 )2 = 1.21 × 10−6 : K 1.21×10−6 (3).

For the second eigenvalue and with the choice D = diag(1, d, 1) :


A.5 Chapter 5 245

2 − 2 × 10−3 d > 1 + 10−4 + 10−3 d−1 , and


2 + 2 × 10−3 d < 3 − 10−4 − 10−3 d−1 ,

with an inequality of the form


22
d> √ ≈ 0.0011 . . .
9999 + 99979201
and an (obviously) appropriate choice of d = 0.002. Hence:

ρ2 = (2 × 10−3 )2 = 4 × 10−6 : K 4×10−6 (2).

Solution A.5.8: Let z 0 ∈ Cn with z 0  = 1 be an arbitrary starting point. Then,


construct a sequence z t ∈ Cn , t = 1, 2, . . . by

z̃ t := Az t−1 , z t = z̃ t /z̃ t .

In case of a general matrix the corrsponding eigenvalue approximation is given by

(Az t )r
λt := ,
zrt

where r is an index such that |zrt | = maxj=1,...,n |zjt | . In case of a Hermitian matrix A ,
the eigenvalue approximation can be determined with the help of the Rayleigh quotient:

(Az t , z t )
λt := .
z t 2

i) The power method converges if A is diagonalizable and the eigenvalue with largest
modulus is separated from the other eigenvalues, i. e. |λn | > |λi | for i < n . Furthermore
the starting vector z 0 must have a non-trivial component in the direction of the eigen-
vector wn corresponding to λn .
ii) The separation of the biggest eigenvalue from the others is the most crucial restriction
because the convergence rate is directly connected to this property (see iii)), and the other
two conditions are usually fulfilled (due to round-off errors).
iii) The power method has the following a priori error estimate (for a general matrix):
  λ t 
λ = λmax + O 
n−1 
t
, t → ∞.
λn 
Bibliography
[1] R. Rannacher: Numerik 0: Einführung in die Numerische Mathematik, Lecture
Notes Mathematik, Heidelberg: Heidelberg University Publishing, Heidelberg, 2017,
https://doi.org/10.17885/heiup.206.281
[2] R. Rannacher: Numerik 1: Numerik Gewöhnlicher Differentialgleichungen, Lecture
Notes Mathematik, Heidelberg University Publishing, Heidelberg, 2017,
https://doi.org/10.17885/heiup.258.342
[3] R. Rannacher: Numerik 2: Numerik Partieller Differentialgleichungen, Lecture Notes
Mathematik, Heidelberg University Publishing, Heidelberg, 2017,
https://doi.org/10.17885/heiup.281.370

[4] R. Rannacher: Numerik 3: Probleme der Kontinuumsmechanik und ihre numerische


Behandlung, Lecture Notes Mathematik, Heidelberg University Publishing, Heidel-
berg, 2017, https://doi.org/10.17885/heiup.312.424

[5] R. Rannacher: Analysis 1: Differential- und Integralrechnung für Funktionen einer


reellen Veränderlichen, Lecture Notes Mathematik, Heidelberg University Publish-
ing, Heidelberg, 2017, https://doi.org/10.17885/heiup.317.431

[6] R. Rannacher: Analysis 2: Differential- und Integralrechnung für Funktionen mehre-


rer reeller Veränderlichen, Lecture Notes Mathematik, Heidelberg University Pub-
lishing, Heidelberg, 2018, https://doi.org/10.17885/heiup.381.542
[7] R. Rannacher: Analysis 3: Integralsätze, Lebesgue-Integral und Anwendungen, Lec-
ture Notes Mathematik, Heidelberg University Publishing, 2018,
https://doi.org/10.17885/heiup.391

(I) References on Functional Analysis, Linear Algebra and Matrix Analysis

[8] N. Dunford and J. T. Schwartz: Linear Operators I, II, III, Interscience Publishers
and Wiley, 1957, 1963, 1971.
[9] H. J. Landau: On Szegö’s eigenvalue distribution theory and non-Hermitian kernels,
J. Analyse Math. 28, 335–357 (1975).
[10] R. A. Horn and C. R. Johnson: Matrix Analysis, Cambridge University Press, 1985-
1999, 2007.

[11] P. M. Halmos: Finite Dimensional Vector Spaces, Springer, 1974.


[12] T. Kato: Perturbation Theory for Linear Operators, Springer, 2nd ed., 1980.
[13] H.-J. Kowalsky: Lineare Algebra, De Gruyter, 1967.

[14] H.-O. Kreiss: Über die Stabilitätsdefinition für Differenzengleichungen die partielle
Differentialgleichungen approximaieren, BIT, 153–181 (1962).

247
248 BIBLIOGRAPHY

[15] P. Lancester and M. Tismenetsky: The Theory of Matrices with Applications, Aca-
demic Press, 1985.

[16] D. W. Lewis: Matrix Theory, World Scientific, 1991.

[17] J. M. Ortega: Matrix Theory, A Second Course, Springer, 1987.

[18] B. N. Parlett: The Symmetric Eigenvalue Problem, Prentice-Hall, 1980.

[19] B. Rajendra: Matrix Analysis, Springer, 1997.

[20] L. N. Trefethen: Pseudospectra of linear operators, SIAM Rev. 39, 383–406 (1997).

[21] L. N. Trefethen: Computation of pseudospectra, Acta Numerica 8, 247–295, 1999.

[22] L. N. Trefethen and M. Embree: Spectra and Pseudospectra, Princeton University


Press Europe, 2005.

[23] J. H. Wilkinson: Rounding Errors in Algebraic Processes, Prentice-Hall, 1963.

[24] J. H. Wilkinson: The Algebraic Eigenvalue Problem, Clarendon Press, 1965.

(II) References on Numerical Linear Algebra

[25] G. Allaire and S. M. Kaber: Numerical Linear Algebra, Springer, 2007.

[26] A. Björck and C. C. Paige: Loss and recapture of orthogonality in the modified Gram-
Schmidt algorithm, SIAM J. Matrix Anal. Appl. 13, 176–190 (1992).

[27] A. Brandt, S. McCormick, and J. Ruge: Multigrid method for differential eigenvalue
problems, J. Sci. Stat. Comput. 4, 244–260 (1983).

[28] Ph. G. Ciarlet: Introduction to Numerical Linear Algebra and Optimization, Cam-
bridge University Press, 1989.

[29] M. Crouzeix, B. Philippe, and M. Sadkane: The Davidson method, SIAM J. Sci.
Comput. 15, 62–76 (1994).

[30] E. R. Davidson: The iterative calculation of a few of the lowest eigenvalues and
corresponding eigenvectors of large real-symmetric matrices, J. Comput. Phys. 17,
87–94 (1975).

[31] B. N. Datta: Numerical Linear Algebra and Applications, Springer, 2008.

[32] J. W. Demmel: Applied Numerical Linear Algebra, SIAM, 1997.

[33] P. Deuflhard and A. Hohmann: Numerische Mathematik I, De Gruyter, 2002 (3rd


edition).

[34] D. K. Faddeev and W. N. Faddeeva: Numerische Methoden der linearen Algebra,


Deutscher Verlag der Wissenschaften, 1964.
BIBLIOGRAPHY 249

[35] D. Gerecht, R. Rannacher and W. Wollner: Computational aspects of pseudospectra


in hydrodynamic stability analysis, J. Math. Fluid Mech. 14, 661–692 (2012).

[36] G. H. Golub and C. F. van Loan: Matrix Computations, Johns Hopkins University
Press, 1984.

[37] W. Hackbusch: Multi-Grid Methods and Applications, Springer, 1985.

[38] W. Hackbusch: Iterative Lösung großer schwachbesetzter Gleichungssysteme, Teub-


ner, 1991.

[39] G. Hämmerlin and K.-H. Hoffmann: Numerische Mathematik; Springer, 1989.

[40] W. W. Hager: Applied Numerical Linear Algebra, Prentice Hall, 1988.

[41] V. Heuveline and C. Bertsch: On multigrid methods for the eigenvalue computation
of nonselfadjoint elliptic operators, East-West J. Numer. Math. 8, 257–342 (2000).

[42] J. G. Heywood, R. Rannacher, and S. Turek: Artificial boundaries and flux and
pressure conditions for the incompressible Navier-Stokes equations, Int. J. Nu-
mer. Meth. Fluids 22, 325–352 (1996).

[43] D. Meidner, R. Rannacher, and J. Vihharev: Goal-oriented error control of the iter-
ative solution of finite element equations, J. Numer. Math. 17, 143-172 (2009).

[44] B. N. Parlett: Convergence of the QR algorithm, Numer. Math. 7, 187–193 (1965);


corr. in 10, 163-164 (1965).

[45] R. Rannacher, A. Westenberger, and W. Wollner: Adaptive finite element approxima-


tion of eigenvalue problems: balancing discretization and iteration error, J. Numer.
Math. 18, 303–327 (2010).

[46] Y. Saad: Numerical Methods for Large Eigenvalue Problems, Manchester University
Press, 1992.

[47] H. R. Schwarz, H. Rutishauser and E. Stiefel: Numerik symmetrischer Matrizen,


Teubner, 1968.

[48] G. L. G. Sleijpen and H. A. Van der Vorst: A Jacobi-Davidson iteration method for
linear eigenvalue problems, SIAM Review42, 267–293 (2000).

[49] C. E. Soliverez and E. Gagliano: Orthonormalization on the plane: a geometric


approach, Mex. J. Phys. 31, 743-758 (1985).

[50] J. Stoer and R. Bulirsch: Numerische Mathematik 1/2, Springer, 2007 (10th edi-
tions).

[51] G. Strang: Linear Algebra and its Applications, Academic Press, 1980.

[52] G. W. Stewart: Introduction to Matrix Computations, Academic Press, 1973.


250 BIBLIOGRAPHY

[53] J. Todd: Basic Numerical Mathematics, Vol. 2: Numerical Algebra, Academic Press,
1977.

[54] L. N. Trefethen and D. Bau, III: Numerical Linear Algebra, SIAM, 1997.

[55] R. S. Varga: Matrix Iterative Analysis, Springer, 2000 (2nd edition).

[56] T.-L. Wang and W. B. Gragg: Convergence of the shifted QR algorithm for unitary
Hessenberg matrices, Math. Comput. 71, 1473–1496 (2001).

[57] D. M. Young: Iterative Solution of Large Linear Systems, Academic Press, 1971.

(III) References on the Origin of Problems and Applications

[58] O. Axelsson and V. A. Barker: Finite Element Solution of Boundary Value Problems,
Theory and Computation, Academic Press, 1984.

[59] D. Braess: Finite Elemente, Springer 2003 (3rd edition).

[60] H. Goering, H.-G. Roos, and L. Tobiska: Finite-Elemente-Methode, Akademie-


Verlag, 1993 (3rd edition).

[61] C. Großmann, H.-G. Roos: Numerik partieller Differentialgleichungen, Teubner, 1992

[62] W. Hackbusch: Theorie und Numerik elliptischer Differentialgleichungen, Teubner,


1986.

[63] A. R. Mitchell and D. F. Griffiths: The Finite Difference Method in Partial Differ-
ential Equations, Wiley, 1980

[64] A. Quarteroni and A. Valli: Numerical Approximation of Partial Differential Equa-


tions, Springer, 1994.

[65] M. Schäfer and S. Turek: Benchmark computations of laminar flow around a cylinder,
in Flow Simulation with High-Performance Computer II, Notes on Numerical Fluid
Mechanics, vol. 48 (Hirschel, E. H., ed.), pp. 547–566, Vieweg, 1996.

[66] H. R. Schwarz: Numerische Mathematik, B. G. Teubner, 1986

[67] G. Strang and G. J. Fix: An Analysis of the Finite Element Method, Prentice-Hall,
1973.

[68] A. Tveito and R. Winther: Introduction to Partial Differential Equations: A Com-


putational Approach, Springer, 1998.
Index
ε-pseudo-spectrum, 41, 54, 185 Cholesky decomposition, 62, 75, 106
coarse-grid correction, 195
A-orthogonal, 124, 131 column pivoting, 57
A-scalar product, 124 complex arithmetic, 185
adjoint transpose, 23 condition number, 46
adjuncts, 23 conditioning, 45
algorithm contraction constant, 104
classical Gram-Schmidt, 21 coordinate relaxation, 127
Crout, 63 correction equation, 64
exchange, 67 Crout (1907–1984), 63
Gauß-Jordan, 66, 67
Gaussian elimination, 97 defect, 5, 26, 29, 64, 76
Givens, 92 defect correction, 64, 65, 99
Gram-Schmidt, 52, 77, 184 defect equation, 191
Householder, 78, 91 definiteness, 13
modified Gram-Schmidt, 22, 168 Descartes (1596–1650), 1
Thomas, 72 descent direction, 126
angle, 25 descent method, 147
ansatz space, 137 determinant, 23
approximation property, 197 deviation
arithmetic complexity, 56 maximal, 5
arithmetic operation, 56 mean, 4
Arnoldi (1917–1995), 165 difference approximation, 6
Arnoldi basis, 168 difference equation, 6
Arnoldi relation, 169 discretization, 6
dyadic product, 79
backward substitution, 55, 59 dynamic shift, 183
Banach space, 15
band matrix, 7 eigenspace, 3, 28
band type, 70 eigenvalue, 3, 28
band width, 71 deficient, 29
basis eigenvalue equation, 28
orthogonal, 20 eigenvalue problem
orthonormal, 20 full, 29
best approximation, 20 partial, 29
bilinear form, 17 eigenvector, 3, 28
Burgers equation, 178 generalized, 30
energy form, 189
Cartesian basis, 1 equalization parabola, 5
Cauchy sequence, 14 exponential stability, 38
central difference quotient, 141
characteristic polynomial, 28, 95 fill-in, 8
Chebyshev (1821–1894), 5 finite difference discretization, 206
Chebyshev equalization, 5 fixed-point iteration, 99, 145
Chebyshev polynomial, 119, 135 fixed-point problem, 99
Cholesky (1975–1918), 75 forward substitution, 55, 59

251
252 INDEX

Frobenius (1849–1917), 33 Lagrange basis, 206


Lanczos (1893–1974), 165
Galerkin (1871–1945), 131 Lanczos algorithm, 185
Galerkin equation, 131 Lanczos relation, 171
Galerkin orthogonality, 104 Laplace (1749–1827), 6
Gauß (1777–1855), 56 line search, 126
Gaussian elimination, 55, 56, 91 linear mapping, 22
Gaussian equalization, 4 linear system, 2
Gerschgorin circle, 50, 53
overdetermined, 2
Gershgorin (1901–1933), 49
quadratic, 2
Givens (1910–1993), 92
underdetermined, 2
Givens transformation, 91
load vector, 190
Gram (1850–1916), 21
LR decomposition, 55, 59, 62, 72, 97
grid transfer, 195

Hölder (1859–1937), 17 M-matrix, 149


half-band width, 7 machine accuracy, 62
Hessenberg (1904–1959), 49, 88 mass matrix, 190
Hessenberg normal form, 91 matrix, 22
Hestenes (1906–1991), 131 irreducible, 145
homogeneity, 13 band, 70, 141
Householder (1904–1993), 79 consistently ordered, 114, 141
Householder transformation, 79, 91 diagonalizable, 32, 90, 156
diagonally dominant, 72
identity Frobenius, 56
parallelogram, 20 Hermitian, 24
Parseval, 20 Hessenberg, 49, 88
identity matrix, 23 ill-conditioned, 48
inequality inverse, 23
Cauchy-Schwarz, 17 irreducible, 109, 141
Hölder, 18 Jacobi, 142
Minkowski, 14, 18 lower triangular, 71
Young, 18 normal, 24, 54
inverse iteration, 155, 185 of nonnegative type, 149
iteration matrix, 99 orthogonal, 25
orthonormal, 25, 77
Jacobi (1804–1851), 105
permutation, 56
Jordan (1838–1922), 30
positive definite, 24, 73, 124
Jordan normal form, 30, 89
positive semi-definite, 24
Kantorovich (1912-1986), 128 rank-deficient, 85
kernel, 23 reducible, 109
Kreiss (1930–2015), 40 regular, 23
Kronecker symbol, 1, 23 similar, 31, 88
Krylov (1879–1955), 132 sparse, 72
Krylov matrix, 167 strictly diagonally dominant, 73
Krylov space, 132, 184 symmetric, 24, 124
INDEX 253

triangular, 55 Richardson, 99, 187


tridiagonal, 49, 71, 88 SOR, 106, 111, 142
unitarily diagonalizable, 32 SSOR, 124
unitary, 25 two-grid, 193
upper triangular, 71 minimal solution, 27, 86
matrix norm Minkowski (1864–1909), 18
compatible, 33 Mises, von (1883–1953), 153
natural, 33 multigrid cycle, 191
mesh-point numbering multiplicity
checkerboard, 8 algebraic, 29
diagonal, 8 geometric, 28
row-wise, 8
Navier-Stokes equation, 10
method
neighborhood, 14
gradient, 149
nested multigrid, 194
Jacobi, 145
Neumann (1832–1925), 36
ADI, 107
Neumann series, 36
Arnoldi, 165, 167
nodal basis, 189, 206
bisection, 96
norm
CG, 131, 143, 149, 207
l1 , 14
Chebyshev acceleration, 146
l∞ , 14
Cholesky, 75
lp , 14
cordinate relaxation, 147
Euclidian, 13
damped Richardson, 147
Frobenius, 33, 87
descent, 126
maximal row-sum, 34
direct, 3
maximum, 14
finite element, 104
spectral, 33
finite element Galerkin, 206
submultiplicative, 33
Gauß-Seidel, 9, 105, 108, 113, 142, 145
normal equation, 27, 76, 148
gradient, 127, 143
normal form, 31
Hyman, 92
normed space, 13
ILU, 106
null space, 23
iterative, 3
null vector, 1
Jacobi, 9, 105, 108
numerical rank, 84
Jacobi-Davidson, 158
Krylov space, 137 operator
Lanczos, 165 5-point, 6
Least-Error Squares, 26 7-point, 7
LR, 159 coarse-grid, 195
multigrid, 187 divergence, 4
optimal SOR, 148 gradient, 4
PCG, 139 Laplace, 4, 6, 140, 157, 204
power, 153 nabla, 4
projection, 132 two-grid, 196
QR, 159 orthogonal
reduction, 89 complement, 19
254 INDEX

projection, 19 rotation, 25, 91


system, 20 row sum criterion, 108, 110
Ostrowski (1893–1986), 112 Rutishauser (1918–1970), 159
overrelaxation, 111
saddle point system, 147, 148
Parseval (1755–1836), 20 scalar product, 16, 53
Penrose (1931–), 87 Euclidian, 16
perturbation equation, 38 semi, 17
pivot element, 57, 67 Schmidt (1876–1959), 21
pivot search, 57 Schur (1875–1941), 89
pivoting, 61 Schur complement, 147
point, 1 Schur normal form, 89
accumulation, 16 Seidel, von (1821–1896), 9
isolated, 16 sequence
Poisson problem, 206 bounded, 14
post-smoothing, 191 convergent, 14
power method, 183, 208 sesquilinear form, 17, 53
pre-smoothing, 190, 192 set
precondition closed, 14
diagonal scaling, 208 compact, 16
preconditioner, 99 open, 14
preconditioning, 136, 138 resolvent, 28
diagonal, 139 sequentially compact, 16
ICCG, 140 similarity transformation, 31
SSOR, 139, 149 singular value, 83
product space, 16 singular value decomposition, 82
prolongation, 191 smoother, 194
pseudo-spectrum, 41 smoothing operation, 194
pseudoinverse, 87 smoothing property, 196
Sobolev (1908–1989), 189
QR decomposition, 77 Sobolev space, 189
QR method, 183 solution operator, 39
spectral condition number, 47, 53
range, 23 spectral radius, 53, 100, 102
Rayleigh (1842–1919), 29 spectrum, 28, 52
Rayleigh quotient, 29, 154 step length, 126
reflection, 79, 91 Stiefel (1909–1978), 131
Reich (1927–2009), 112 stiffness matrix, 190
relaxation parameter, 106 stopping criterion, 103
relaxation step, 111 Sturm (1803–1855), 95
residual, 64, 104 Sturm chain, 95
resolvent, 28 Sturm-Liouville problem, 177
restriction, 191 system matrix, 190
Richardson (1881–1953), 99
Ritz (1878–1909), 166 test space, 137
Ritz eigenvalue, 166 theorem
INDEX 255

Banach, 100 underrelaxation, 111


Cauchy, 15
Gerschgorin, 50 V-cycle, 193
Kantorovich, 128 Vandermonde (1735–1796), 5
multigrid complexity, 200 Vandermondian determinant, 5
multigrid convergence, 199 vector, 1
nested multigrid, 201 norm, 13
norm equivalence, 15 space, 1
Pythagoras, 20
W-cycle, 193
two-grid convergence, 196
Wielandt (1910–2001), 155
Thomas (1903–1992), 72
Wilkinson (1919–1986), 62
total pivoting, 57, 61
trace, 23, 53 Young (1863–1942), 17
This introductory text is based on lectures within a multi-semester
course on “Numerical Mathematics”, taught by the author at Heidelberg
University. The present volume treats algorithms for solving basic prob-
lems from Linear Algebra such as (large) systems of linear equations
and corresponding eigenvalue problems. Theoretical as well as practical
aspects are considered. Applications are found in the discretization
of partial differential equations and in spectral stability analysis. As
prerequisite only that prior knowledge is required as is usually taught
in the introductory Analysis and Linear Algebra courses. For supporting
self-study each chapter contains exercises with solutions collected in
the appendix.

About the author


Rolf Rannacher, retired professor of Numerical Mathematics at
Heidelberg University, study of Mathematics at the University of
Frankfurt/Main, doctorate 1974, postdoctorate (Habilitation) 1978
at Bonn University – 1979/1980 Vis. Assoc. Professor at the University
of Michigan (Ann Arbor, USA), thereafter Professor at Erlangen and
Saarbrücken, in Heidelberg since 1988 – field of interest “Numerics
of Partial Differential Equations”, especially the “Finite Element Method”
and its applications in the Natural Sciences and Engeneering, more
than 160 scientific publications.

ISBN 978-3-947732-00-5 21,90 EUR (DE)


22,60 EUR (AT)

9 783947 732005

You might also like