Poulson Dissertation
Poulson Dissertation
by
Jack Lesly Poulson
2012
The Dissertation Committee for Jack Lesly Poulson
certifies that this is the approved version of the following dissertation:
Committee:
Björn Engquist
Sergey Fomel
Omar Ghattas
by
DISSERTATION
Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
DOCTOR OF PHILOSOPHY
I would like to begin by thanking Lexing Ying for his immense amount
of help and advice. He has always made me feel comfortable discussing my
conceptual stumbling blocks, and I am now convinced that there are few things
he cannot teach with five minutes and a blackboard. Looking back, it is hard
to express how much he has helped me learn over the past three years. Modern
numerical analysis is far more intellectually satisfying that I could have ever
possibly known, and I look forward to continually expanding upon the many
insights he has been kind enough to share with me.
I would also like to thank Paul Tsuji for sharing his hard-earned expe-
rience in extending sweeping preconditioners to such a large class of problems
iv
and discretizations. He has been a good friend and a better colleague, and I
look forward to many fruitful future collaborations.
v
Fast Parallel Solution of Heterogeneous
3D Time-harmonic Wave Equations
Publication No.
vi
invariance of free-space Green’s functions in order to justify the replacement
of each dense matrix within a certain modified multifrontal method with the
sum of a small number of Kronecker products. For the sake of reproducibility,
every algorithm exercised within this dissertation is made available as part of
the open source packages Clique and Parallel Sweeping Preconditioner (PSP).
vii
Table of Contents
Acknowledgments iv
Abstract vi
List of Tables xi
Chapter 1. Introduction 1
1.1 Fundamental problem . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Time-harmonic wave equations . . . . . . . . . . . . . . . . . . 3
1.3 Solving Helmholtz equations . . . . . . . . . . . . . . . . . . . 6
1.4 Background material . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
viii
2.4.1 Symbolic factorization . . . . . . . . . . . . . . . . . . . 45
2.4.2 Numeric factorization . . . . . . . . . . . . . . . . . . . 45
2.4.3 Standard triangular solves . . . . . . . . . . . . . . . . . 49
2.4.4 Selective inversion . . . . . . . . . . . . . . . . . . . . . 52
2.4.5 Solving with many right-hand sides . . . . . . . . . . . . 54
2.4.6 Graph partitioning . . . . . . . . . . . . . . . . . . . . . 55
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
ix
5.1.5 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Appendices 113
Bibliography 185
x
List of Tables
2.1 Storage and work upper bounds for the multifrontal method
applied to d-dimensional grid graphs with n degrees of freedom
in each direction (i.e., N = nd total degrees of freedom) . . . . 41
3.1 The number of iterations required for convergence for four model
problems (with four forcing functions per model). The grid sizes
were 5003 and roughly 50 wavelengths were spanned in each di-
rection. Five grid points were used for all PML discretizations,
four planes were processed per panel, and the damping factors
were all set to 7. . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2 Convergence rates and timings (in seconds) on TACC’s Lon-
estar for the SEG/EAGE Overthrust model, where timings in
parentheses do not make use of selective inversion. All cases
used a complex double-precision second-order finite-difference
stencil with five grid spacings for all PML (with a magnitude
of 7.5), and a damping parameter of 2.25π. The preconditioner
was configured with four planes per panel and eight processes
per node. The ‘apply’ timings refer to a single application of
the preconditioner to four right-hand sides. . . . . . . . . . . . 85
xi
List of Figures
xii
3.2 (Left) A depiction of the portion of the domain involved in the
computation of the Schur complement of an x1 x2 plane (marked
with the dashed line) with respect to all of the planes to its left
during execution of Alg. 3.1. (Middle) An equivalent auxiliary
problem which generates the same Schur complement; the orig-
inal domain is truncated just to the right of the marked plane
and a homogeneous Dirichlet boundary condition is placed on
the cut. (Right) A local auxiliary problem for generating an
approximation to the relevant Schur complement; the radiation
boundary condition of the exact auxiliary problem is moved
next to the marked plane. . . . . . . . . . . . . . . . . . . . . 64
3.3 A separator-based elimination tree (right) over a quasi-2D sub-
domain (left) . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4 Overlay of the owning process ranks of an 7 × 7 matrix dis-
tributed over a 2 × 3 process grid in the [MC , MR ] distribution,
where MC assigns row i to process row i mod 2, and MR assigns
column j to process column i mod 3 (left). The process grid is
shown on the right. . . . . . . . . . . . . . . . . . . . . . . . . 73
3.5 Overlay of the owning process ranks of a vector of height 7
distributed over a 2 × 3 process grid in the [VC , ?] vector dis-
tribution (left) and the [VR , ?] vector distribution (right). . . . 75
3.6 Overlay of the owning process ranks of a vector of height 7
distributed over a 2 × 3 process grid in the [MC , ?] distribution
(left) and the [MR , ?] distribution (right). . . . . . . . . . . . . 75
3.7 A single x2 x3 plane from each of the four analytical velocity
models over a 5003 grid with the smallest wavelength resolved
with ten grid points. (Top-left) the three-shot solution for the
barrier model near x1 = 0.7, (bottom-left) the three-shot so-
lution for the two-layer model near x1 = 0.7, (top-right) the
single-shot solution for the wedge model near x1 = 0.7, and
(bottom-right) the single-shot solution for the waveguide model
near x1 = 0.55. . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.8 Three cross-sections of the SEG/EAGE Overthrust velocity model,
which represents an artificial 20 km × 20 km × 4.65 km domain,
containing an overthrust fault, using samples every 25 m. The
result is an 801 × 801 × 187 grid of wave speeds varying discon-
tinuously between 2.179 km/sec (blue) and 6.000 km/sec (red). 83
3.9 Three planes from an 8 Hz solution with the Overthrust model
at its native resolution, 801 × 801 × 187, with a single localized
shot at the center of the x1 x2 plane at a depth of 456 m: (top)
a x2 x3 plane near x1 = 14 km, (middle) an x1 x3 plane near
x2 = 14 km, and (bottom) an x1 x2 plane near x3 = 0.9 km. . . 84
xiii
4.1 (Top) The real part of a 2D slice of the potential generated by a
source in the obvious location and (bottom) the same potential
superimposed with that of a mirror-image charge. The potential
is identically zero along the vertical line directly between the two
charges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 A set of six unevenly spaced points (in blue) along the line seg-
ment {0} × [0, 1] × {0} and their four evenly spaced translations
(in red) in the direction ê3 . . . . . . . . . . . . . . . . . . . . . 93
4.3 Convergence of compressed moving PML sweeping precondi-
tioner in GMRES(20) for a 2503 waveguide problem, with ten
points per wavelength and relative tolerances of 0.01 and 0.05
for the Krocker product approximations of the top-left and
bottom-left quadrants of each frontal matrix, respectively . . . 106
xiv
Chapter 1
Introduction
1
While the details of effective inversion techniques are well beyond the
scope of this document, it is important to recognize a few common character-
istics:
In order to drive the last point home, let us recognize that the speed of sound
in water is roughly 1.5 km/sec, which implies that it would take roughly 10
seconds for sound to propagate across a distance of 15 km. If a sound wave
oscillates at a frequency of 30 Hz, then, throughout the course of the 10 seconds
it takes to travel this distance, it will oscillate 300 times.
1
It is important to combat so-called pollution effects [15], for instance using high-order
spectral elements [123, 124], but this issue is somewhat orthogonal to the focus of this
dissertation.
2
1.2 Time-harmonic wave equations
Consider the scalar analogue of the elastic wave equation, the 3D acous-
tic wave equation [121],
1 ∂2
−∆ + U (x, t) = F (x, t), (1.2)
c2 (x) ∂t2
where U (x, t) is the time and spatially-varying pressure field, F (x, t) is the
forcing function, and c(x) is the sound speed. If we were to perform a for-
mal Fourier transform [84] in the time variable of each side of this equation,
we would decouple the solution U (x, t) into a sum of solutions of the form
u(x)e−iωt , each driven by a forcing function of the form f (x)e−iωt , where ω is
the temporal frequency, which usually has units of radians per second.2 Since
∂2
u(x)e−iωt = −ω 2 u(x)e−iωt ,
∂t 2
Because of the implicit e−iωt time-dependence of the solution and forcing func-
tion, the Helmholtz equation is (the prototypical example of) a time-harmonic
wave equation.
2
Different communities have different conventions for Fourier transforms. It is also com-
mon to use a time-dependence of eiωt , but this essentially only requires conjugating the
appropriate terms in the following discussion.
3
infinity as the discretization is refined. The Helmholtz operator, −∆ − ω 2 ,
therefore has the same eigenvalues, but shifted left by a distance of ω 2 in the
complex plane. Thus, for any nonzero frequency ω and sufficiently fine dis-
cretization, its eigenvalues span both sides of the origin on the real line (and
are often quite close to the origin). This admittedly crude interpretation of the
spectrum of the Helmholtz operator explains the essential difficulty with solv-
ing Helmholtz equations via iterative methods. For instance, because of the
shape of the spectrum, it is not possible to construct an ellipse in the complex
plane which contains the spectrum but not the origin (see Definition B.4.4).
where E(x, t) and H(x, t) respectively model the three-dimensional electric and
magnetic fields, and J(x, t) is the current distribution, which is also a three-
dimensional vector quantity. If we again perform a formal Fourier transform in
the time variable of each set of equations, then we may represent the solutions
E(x, t) and H(x, t) as sums of terms of the form E(x)e−iωt and H(x)e−iωt ,
which are each driven by a harmonic component J (x)e−iωt of the current
distribution, J(x, t). Substituting these harmonic quantities into Maxwell’s
4
equations yields
which may be readily proved with so-called index notation [104] via the prod-
uct rule and knowledge of the Levi-Civita symbol, which is also known as the
alternating symbol. In particular, if we take the curl of both sides of Equa-
tion (1.8), then our vector identity allows us to write
∇(∇ · E) − ∆E = iω∇ × H.
We may then substitute Equations (1.10) and (1.11) in order to arrive at the
result
i
−∆ − ω 2 E(x) = fE (J ) ≡ iωJ − ∇(∇ · J ),
(1.13)
ω
which was purposely written in a form similar to that of Equation (1.3). We
may similarly take the curl of Equation (1.9) in order to find
−∆ − ω 2 H(x) = fH (J ) ≡ ∇ × J ,
(1.14)
which is clearly also similar in structure to the Helmholtz equation. Note that
these time-harmonic equations for the electric and magnetic fields, which are
5
ostensibly decoupled, will typically interact through an appropriate choice of
boundary conditions.
which we have written in the usual index notation.3 The basic idea is that xi
refers to the i’th coordinate of the vector x, xi,j refers to the derivative of the
i’th coordinate in the j’th direction, and a repeated index implies that it is a
dummy summation variable, e.g., Aik Bkj is an expression of the (i, j) entry of
the matrix AB, and Aki Bkj represents the (i, j) entry of AT B.
3
For an introduction to index notation and linear elasticity, please consult a book on
continuum mechanics, such as [104].
6
conditioners for Helmholtz equations without internal resonance [43, 44]. Both
approaches approximate a block LDLT factorization of the Helmholtz opera-
tor in block tridiagonal form in a manner which exploits a radiation boundary
condition [78]. The first approach performs a block tridiagonal factorization
algorithm in H-matrix arithmetic [59, 66], while the second approach approx-
imates the Schur complements of the factorization using auxiliary problems
with artificial radiation boundary conditions. Though the H-matrix sweeping
preconditioner has theoretical support for two-dimensional problems [43, 92],
there is not yet justification for three-dimensional problems.
7
1.4 Background material
A more detailed discussion of the sweeping preconditioner requires a
thorough knowledge of sparse-direct methods and at least a passing familiar-
ity with iterative methods and wave equations. Chapter 2 gives a detailed
introduction to multifrontal Cholesky factorization and triangular solves un-
der the assumption that the reader is well-versed in numerical linear algebra.
If this is not the case, it is recommended that the reader work through the
proofs in Appendix A leading up to the Hermitian spectral decomposition
(Corollary A.2). In addition, Appendix B is provided for those who would
like to familiarize themselves with the theory behind the Generalized Min-
imum Residual method (GMRES) [108] used in Chapters 3 and 4. Lastly,
Appendix C is provided for readers who are interested in the (parallel) dense
linear algebra algorithms used at the core of the multifrontal algorithms dis-
cussed in Chapter 2. Their implementations are all part of the Elemental
library [99], which is written in a style derived from FLAME notation [60]. It
is also strongly recommended that the reader familiarize themselves with the
fundamentals behind BLAS [37], LAPACK [5], and MPI [39]. In particular,
readers interested in gaining a better understanding of collective communica-
tion algorithms should consult [28].
1.5 Contributions
There are essentially four significant contributions in this dissertation:
8
2. a parallelization of the sweeping preconditioner is introduced and applied
to large-scale challenging heterogeneous 3D Helmholtz equations,
1.6 Outline
The remainder of this dissertation is organized as follows:
9
• Chapter 4 describes how the translation invariance of free-space Green’s
functions can be used to compress certain half-space Green’s functions
associated with the diagonal blocks of a modified multifrontal factoriza-
tion, as well as how these compressed matrices may be efficiently applied
as part of a multifrontal triangular solve.
10
Chapter 2
Multifrontal methods
11
frontal Cholesky factorization and solution [56, 63, 64, 113] in the following
two ways:
1. element-wise matrix distributions [71, 93, 99] will be used for parallelizing
dense linear algebra operations, and
12
non-negative, and Hermitian positive-definite (HPD) when every eigenvalue is
positive.
A = RH R,
where R is upper-triangular.
A = V ΛV H ,
is also HPSD, and A1/2 A1/2 = A, which justifies the choice of notation.
Now, let us apply Theorem A.9 to A1/2 to show there exists some
unitary matrix Q ∈ Mn and upper-triangular matrix R ∈ Mn such that A1/2 =
QR. But, A1/2 is Hermitian, and so A1/2 = RH QH , and we find that
A = RH QH QR = RH R.
13
Remark 2.1.1. We have now justified an interpretation of a Cholesky factor
of a Hermitian positive semi-definite matrix A as the square-root of A: it may
always be chosen to be the upper-triangular factor, R, of a QR decomposition
of A1/2 . We will now strengthen our assumptions and give a constructive proof
of uniqueness which will serve as the starting point for the remainder of this
chapter.
Proof.
(y, X H AXy) = ((Xy), A(Xy)) ≥ 0,
with equality if and only if Xy = 0. But, because the columns of X are linearly
independent, Xy must be nonzero.
A = LLH ,
14
that the top-left entry of A, α0,0 = |ρ0,0 |2 , is positive, and so we can legally take
its square root and divide by it. Consider the partitioned matrix expression
λ1,1 `H
2
λ1,1 `H
α1,1 aH2,1 λ1,1 0 1,0 λ1,1 2,1
= = ,
a2,1 A2,2 `2,1 L2,2 0 LH 2,2 λ1,1 `H H
2,1 `2,1 `2,1 + L2,2
λ1,1 0
where L = is the lower-triangular Cholesky factor we wish to
`2,1 L2,2
show can be chosen to have a positive diagonal. Clearly we may set λ1,1 =
√
α1,1 and `2,1 := a2,1 /λ1,1 , and then find that LH H
2,2 L2,2 = A2,2 − `2,1 `2,1 , but,
A. But this Schur complement is the result of the product X H AX, where
−a2,1 /α1,1
X=
I
has linearly independent columns. The preceding lemma therefore yields the
result.
15
which can be directly computed with Algorithm 2.1 using the notation
L1,0 A1,1 A1,2
L1:2,0 ≡ , A1:2,1:2 ≡ .
L2,0 A2,1 A2,2
A0,0 0 AH
2,0
A = 0 A1,1 AH 2,1
. (2.1)
A2,0 A2,1 A2,2
We say that such a matrix has arrowhead [133] form.1 This class of matrices
is of interest because it allows us to easily motivate many of the basic ideas
behind the multifrontal method. In particular, the fact that A1,0 = 0 allows
us to decouple and reorganize many of the operations in Algorithm 2.1.
1
We should perhaps call A a block-arrowhead matrix for the same reason that a block-
diagonal matrix should not be called diagonal. Despite their usefulness for Hermitian eigen-
value problems, we will not make use of unblocked arrowhead matrices in this dissertation,
and so we will drop the block qualifier.
16
2.2.1 Factorization
a lower-order cost. For the sake of parallelism, it is then natural to form and
store these updates for later use. The result is Algorithm 2.3.
We can now recognize that Steps 2-4 of Algorithm 2.3 are equivalent
to the partial Cholesky factorization of the original frontal matrix
Aj,j ?
F̂j = ,
A2,j 0
17
Algorithm 2.3: reorganized arrowhead Cholesky
1 foreach j ∈ {0, 1} do
2 Lj,j := Cholesky(Aj,j )
3 L2,j := A2,j L−H
j,j
4 Uj := −L2,j LH 2,j
5 end
6 A2,2 := A2,2 + U0 + U1
7 L2,2 := Cholesky(A2,2 )
where we have used a ? to denote a quadrant that will not be accessed due to
implicit symmetry (i.e., A2,j = AH
j,2 ). If a right-looking Cholesky factorization
18
Algorithm 2.4: “multifrontal” arrowhead Cholesky
1 foreach j ∈ {0, 1} do
2 Fj := F̂j
3 Process Fj
4 end
5 F2 := F̂2
6 Add bottom-right quadrants of F0 and F1 onto F2
7 Process F2
19
can be performed with Algorithm 2.5, which simply executes a matrix version
of forward substitution (see Algorithm C.6). Clearly L1,0 = 0 implies that
Step 2 of Algorithm 2.5 can be replaced with X2 := X2 − L2,0 X0 , and the re-
sulting optimized algorithm is given in Algorithm 2.6. Just as we introduced
temporary storage in order to decouple the updates within an arrowhead fac-
torization, temporarily storing the update matrices Zj ≡ −L2,j Xj , j = 0, 1,
allows us to effectively decouple Steps 1-2 and 3-4 within Algorithm 2.6, re-
sulting in Algorithm 2.7.
20
Applying the inverse of LH , say,
H −1
L0,0 0 LH
X0 2,0 X0
X1 := 0 LH1,1 LH
2,1
X1 , (2.5)
H
X2 0 0 L2,2 X2
2.2.3 Separators
21
only if the (i, j) entry of A is nonzero (it is therefore common to associate
vertex j with the j’th diagonal entry of A). Because A1,0 = AH
0,1 = 0 within a
1
0 X X X X 5
1 X X
X 2 X X
2
X X 3 X X
3
X 4
X X 5 X
0
X X 6
X X X 7 6 4 7
clearly not in arrowhead form, but it turns out that investigating its graph
leads to a way to permute the matrix into the desired form: If we clipped all
external connections to vertices {3,7}, we would end up with the graph shown
in Figure 2.2b. It should be clear from the clipped graph that the remaining set
of vertices can be grouped into two subsets, V0 = {1, 2, 5} and V1 = {0, 4, 6},
which do not directly interact; we therefore refer to vertices V2 = {3, 7} as a
separator. If we now reorder the sparse matrix so that vertex set V0 appears
22
1 X X
X 2 X X
X X 5 X
0 X X X X
1 0 3
X 4
X 6 X
5 4 6 7
X X X 3 X
X X X 7 2
first, followed by V1 , and finally the separator V2 , then the result, shown in
Figure 2.2a, is in arrowhead form.
Remark 2.2.1. The fact that the top-left and bottom-right blocks of the
matrix in Figure 2.2a are dense corresponds to their associated sets of vertices,
V0 and V2 , being fully connected subsets of the induced subgraph.
E|U = {{vi , vj } ∈ E : vi , vj ∈ U }.
23
Remark 2.2.2. We can now exercise our new terminology: the vertex subsets
V0 and V2 from the previous example are both cliques, and thus their induced
subgraphs, G|V0 and G|V2 , are complete/fully-connected.
1. V \V2 = V0 ∪ V1 , and
2. {v0 , v1 } ∈ E|V\V2 : v0 ∈ V0 , v1 ∈ V1 = ∅.
24
A0,0 A0,2 A0,6 L0,0
A1,1 A1,2 A1,6
L1,1
A2,0 A2,1 A2,2 A2,6 L2,0 L2,1 L2,2
A3,3 A3,5 A3,6
L3,3
A4,4 A4,5 A4,6
L4,4
A5,3 A5,4 A5,5 A5,6 L5,3 L5,4 L5,5
A6,0 A6,1 A6,2 A6,3 A6,4 A6,5 A6,6 L6,0 L6,1 L6,2 L6,3 L6,4 L6,5 L6,6
Algorithms 2.7 and 2.8. It turns out that it is also possible to exploit sparsity
within the off-diagonal blocks, but it will help to introduce more notation be-
fore tackling this issue. Note that our following discussions will assume that
an n × n Hermitian positive-definite matrix A has already been reordered in
some beneficial manner, typically via nested dissection.
Definition 2.3.1 (supernode [11, 13, 91]). For any n × n Hermitian positive-
definite matrix A, if the n vertices are partitioned into m contiguous subsets,
say D = (D0 , D1 , . . . , Dm−1 ), such that j < k implies that every member of Dj
is less than every member of Dk , then each set Di is referred to as a (relaxed)
supernode.2
Remark 2.3.1. If possible, these supernodes should be chosen such that each
corresponding lower-triangular diagonal block of the Cholesky factor of A, say
L(Ds , Ds ), is sufficiently dense. We will frequently refer to supernodes by their
indices, e.g., by speaking of “supernode s ” rather than of “supernode Ds ”.
In all of the calculations performed within this dissertation, these supernodes
2
The usual graph-theoretic definition of a supernode is much more restrictive, as the
subset must be a clique whose edges satisfy a certain property. We must therefore make it
clear that our relaxed [11] definition is different.
25
correspond to the degrees of freedom of the separators produced during nested
dissection. Please glance ahead to Figures 2.5 and 2.6 for a visual interpreta-
tion of supernodes (and their associated elimination tree, which will soon be
introduced).
Each contribution in the right term, say Lj \D0:s , is meant to model the effects
of an outer-product update
on the portion of A below supernode s, and clearly the columns with indices
Ds can only be effected when Lj intersects Ds .
Remark 2.3.2. When the supernodes each consist of a single vertex, this
definition yields the precise nonzero structure of the Cholesky factor of A,
assuming that no exact numerical cancellation took place. When the supern-
odes contain multiple vertices, this formula is meant to return a relatively tight
superset of the true nonzero structure.
26
Definition 2.3.4 (ancestors, parent, c.f. [112]). We may define the set of
ancestors of supernode s as all supernodes which have a nonzero intersection
with the factored structure of supernode s, that is, supernodes
We also say that the children of supernode s are the supernodes with Ds as
their parent, i.e.,
Definition 2.3.6 (elimination forest, c.f. [97, 112]). The structure we have
been describing is extremely important to multifrontal methods and was first
formalized in [112] in the context of supernodes composed of single vertices,
or, more simply, nodes. Simply connecting each supernode with its parent
supernode, when it exists, implies a graph, and parent(s) > s further implies
that this graph is actually a collection of trees. This collection of trees is
known as the elimination forest, which we will denote by E(A, D), and it can
be used to guide a multifrontal factorization.
27
discussions of the multifrontal method. We will begin by showing that a much
simpler formula for the factored structure can be constructed if we think in
terms of the elimination forest.
Figure 2.4: (a) A symmetric two-level arrowhead matrix and (b) its elimination
tree
28
Now consider recursively defining the related sets
[
Gs = L̂s ∪ Gj ,
j∈A−1 (s)
We now have only to show that, for each c ∈ C(s), Lc \ D0:s = Lc \ Ds . But
this is equivalent to the requirement that, for each j such that c < j < s,
Lc ∩ Dj = ∅, which follows from the fact that s = parent(c).
2.3.2 Factorization
29
its descendants’ updates has been applied, and so a factorization algorithm
must work “up” the elimination forest. Since each elimination tree is indepen-
dent, from now on we will assume that all matrices are irreducible, with the
understanding that the following algorithms should be run on each tree in the
elimination forest.
30
Algorithm 2.10: Cholesky factorization via an elimination tree
Input: HPD matrix, A, supernodes, D, and root node, s
Output: a lower triangular matrix, L, such that A = LLH
1 foreach c ∈ C(s) do Recurse(A,D,c)
2 L(Ds , Ds ) := Cholesky(A(Ds , Ds ))
−H
3 L(Ls , Ds ) := A(Ls , Ds )L(Ds , Ds )
H
4 A(Ls , Ls ) := A(Ls , Ls ) − L(Ls , Ds )L(Ls , Ds )
31
in the elimination tree with the original frontal matrix
A(Ds , Ds ) A(Ds , Ls )
F̂s = , (2.12)
A(Ls , Ds ) 0
then we can use the same language as Algorithm 2.4 to condense Algorithm 2.11
into Algorithm 2.12, which produces the factored frontal tree, F, with each
front Fs equal to
L(Ds , Ds ) ?
Fs = , (2.13)
L(Ls , Ds ) ?
where the irrelevant quadrants have been marked with a ?.3
3
In fact, a practical implementation will free the memory required for these quadrants
the last time they are used within the multifrontal factorization.
32
that each supernode must only interact with its children and its parent (when
it exists). We will see that multifrontal lower triangular solves (X := L−1 X)
work up the elimination tree making use of extend-add operations, but the
adjoint operation (X := L−H X) works down the elimination tree and requires
more care. We will begin by explaining the simple case and will then move on
to adjoint solves.
It should be apparent that, for any block row s of L which is only nonzero at
its diagonal block, Ls,s , the solution for that block row may be found by simply
setting Xs := L−1
s,s Xs . Since this condition on block row s holds precisely when
Once the solution X` has been computed for each leaf ` in the elimina-
tion tree, each set of degrees of freedom can be eliminated via the update
33
which we see, by definition, only effects the portions of X assigned to the
ancestors of leaf `. As was mentioned earlier, our goal is to find a way to ac-
cumulate these updates up the elimination tree so that, for instance, when we
reach the root node, its updates from the entire rest of the tree have all been
merged into the update matrices of its children. Such an approach is demon-
strated by Algorithm 2.13, which is quite similar to the verbose description of
multifrontal factorization in Algorithm 2.11, which can likewise be condensed
using the language of fronts. In particular, if we introduce a right-hand side
tree, say Y, which assigns to supernode s the matrix
X(Ds , :)
Ys = ,
zeros(|Ls |, k)
where k is the number of columns of X, then we can express the solve using
the compact language of Algorithm 2.14. The term trapezoidal elimination [61]
simply refers to viewing the last two steps of Algorithm 2.13 as a partial
elimination process with the matrices
L(Ds , Ds ) ? X(Ds , :)
Fs = , and Ys = .
L(Ls , Ds ) ? Zs
Notice that the nonzero structure of the left half of Fs is trapezoidal: L(Ds , Ds )
is lower-triangular and L(Ls , Ds ) is, in general, dense.
34
Algorithm 2.13: Verbose multifrontal lower solve
Input: lower triangular matrix, L, right-hand side matrix of
width k, X, elimination tree, E, and root node, s
Output: X := L−1 X
1 foreach c ∈ C(s) do Recurse(L,X,E,c)
2 Zs := zeros(|Ls |, k)
X(Ds , :) X(Ds , :)
foreach c ∈ C(s) do := Zc
3 Zs Zs
−1
4 X(Ds , :) := L(Ds , Ds ) X(Ds , :)
5 Zs := Zs − L(Ls , Ds )X(Ds , :)
This time, the only part of the solution which we can immediately compute
corresponds to the root node: Xs := L−H
s,s Xs (in this case, s = 6). These
One possible approach would be to update all of the descendants in this manner
and then to recurse on each of the children of the root node.
35
Since the set of ancestors of node s is precisely equal to the set of supernodes
which intersect its lower structure, this update simplifies to
We have now derived the main algorithms required for solving sparse
Hermitian positive-definite linear systems with the multifrontal algorithm (with
36
Algorithm 2.16: Multifrontal lower adjoint solve
Input: frontal tree, F, right-hand side tree, Y, and root node, s
1 if A(s) 6= ∅ then Extract(Ys , Yparent(s) )
H
2 Trapezoidal backward elimination of Ys with Fs
3 foreach c ∈ C(s) do Recurse(F,Y,c)
While the separators chosen in Figure 2.6 are what we would choose
in practice, in order to keep our analysis simple we will choose so-called cross
separators [61], which are approximately twice as large as necessary but do
not change the asymptotics of the method. Figure 2.7 shows an elimination
37
Figure 2.6: A separator-based elimination tree (right) for a 15 × 15 grid graph
(left)
n+1
w` = − 1.
2`
We thus know that, if s ∈ level(`) (that is, supernode s is on the `’th level of
the elimination tree), then |Ds | = 2wl − 1. We can also easily bound |Ls | from
above by 4w` (the maximum number of nodes that can border the w` × w` box
spanned by the cross separator and its descendants).
We can easily see that the work required for processing front Fs is
approximately equal to
1
Work(Fs ) = |Ds |3 + |Ds |2 |Ls | + |Ds ||Ls |2 , (2.14)
3
where the first term corresponds to the Cholesky factorization of the |Ds | ×
|Ds | diagonal block, the second term corresponds to a triangular solve of the
diagonal block against the |Ls | × |Ds | submatrix A(Ls , Ds ), and the third term
38
Figure 2.7: An elimination tree based on cross separators (right) of a 15 × 15
grid graph (left)
We have just shown that nested dissection of regular 2d grids leads only re-
quires O(n3 ) = O(N 3/2 ) work, where N = n2 is the total number of degrees of
freedom. Soon after George showed this result [54], Lipton and Tarjan gener-
alized the approach to the entire class of planar graphs [90], which is simply
the class of graphs which may be drawn on paper without any overlapping
edges (that is, embedded in the plane).
39
our analysis to d-dimensional grid graphs with n vertices in each direction, for
a total of N = nd vertices, which only interact with their nearest neighbors in
each of the d cardinal directions. This time, the cross separators will consist of
d overlapping hyperplanes of dimension d − 1 (i.e., in 3D, the cross separators
consist of three planes), and we can bound |Ds | from above by dw`d−1 , which
is exact other than ignoring the overlap of the d hyperplanes. Also, a d-
dimensional cube has 2d faces, and so our best bound for |Ls | is 2dw`d−1 . The
resulting bound for d-dimensional nested dissection is
blog2 (n)c
X
d` 1 d−1 3 d−1 3 d−1 3
Work(F) ≤ 2 (dw` ) + 2(dw` ) + 4(dw` )
`=0
3
blog2 (n)c
X 3(d−1)
=C 2d` w`
`=0
blog2 (n)c
X
3(d−1)
≤ C(n + 1) 2`(3−2d) .
`=0
Now let us recall that, after factorization, all that must be kept from
each frontal matrix Fs is its top-left and bottom-left quadrants, i.e., L(Ds , Ds )
and L(Ls , Ds ), which respectively require 21 |Ds |2 and |Ds ||Ls | units of storage.
We therefore write
1
Mem(Fs ) = |Ds |2 + |Ds ||Ls |. (2.15)
2
If we make use of the same bounds on |Ds | and |Ls | as before, then we find that,
for a d-dimensional grid graph, the memory required for storing the Cholesky
40
d work storage
1 O(N ) O(N )
3/2
2 O(N ) O(N log N )
2
3 O(N ) O(N 4/3 )
3(1−1/d)
d ≥ 3 O(N ) O(N 2(1−1/d) )
Table 2.1: Storage and work upper bounds for the multifrontal method applied
to d-dimensional grid graphs with n degrees of freedom in each direction (i.e.,
N = nd total degrees of freedom)
factorization is
blog2 (n)c
X 1
Mem(F) ≤ d`
2 (dw`d−1 )2 + 2(dw`d−1 )2
`=0
2
blog2 (n)c
X
2(d−1)
= C(n + 1) 2`(2−d) .
`=0
When we set d = 1, Mem(F) is O(n), which agrees with the fact that the
Cholesky factor should be bidiagonal, when d = 2, we see that the memory
requirement is O(n2 log n) = O(N log N ), for d = 3 it is O(n4 ) = O(N 4/3 ),
and, for arbitrary d ≥ 3, it is O(n2(d−1) ), which is equal to the amount of
space required for the storage for a nd−1 × nd−1 dense matrix. Because our
multifrontal triangular solve algorithms perform O(1) flops per unit of storage,
the work required for the triangular solves has the same asymptotic complexity
as the memory requirements of the factorization.
41
and, in d dimensions, the separator can be chosen as a hyperplane of nd−1
vertices. Interestingly, it was shown in [57] that, if a graph can be embedded
in a surface of genus g (for example, the surface of a sphere with g handles
√ √
attached), then a separator of size O( gN + N ) may be found. This implies
that d-dimensional grid graphs have separators comparable to those of graphs
d−2
of genus N d .
42
still an open question.
2. symbolic factorization,
4. triangular solves.
We will begin with the easiest step, symbolic factorization, move on to nu-
meric factorization, spend a while discussing several options for the triangular
solves, and then briefly touch on remaining challenges with respect to parallel
reorderings.
Our approach to the parallelization of each of these stages (with the pos-
sible exception of the reordering phase) exploits the independence of siblings
within the elimination tree in order to expose as much trivial parallelism as
possible, and fine-grain parallelism is therefore only employed towards the top
of the tree, usually via distributed-memory dense linear algebra algorithms.
The fundamentals of this approach are due to George, Liu, and Ng, who
proposed a so-called subtree-to-subcube mapping [56] which splits the set of
processes into two disjoint teams at each junction of a binary elimination
tree. The term subcube was used because this decomposition maps naturally
43
Figure 2.8: An example of a (slightly imbalanced) elimination tree distributed
over four processes: the entire set of processes shares the root node, teams of
two processes share each of the two children, and each process is assigned an
entire subtree rooted at the dashed line.
∗∗∗
0∗∗ 1∗∗
000 000 001 001 010 010 011 011 100 100 101 101 110 110 111 111
Figure 2.9: Overlay of the process ranks (in binary) of the owning subteams
of each supernode from an elimination tree assigned to eight processes; a ‘*’ is
used to denote both 0 and 1, so that ‘00∗’ represents processes 0 and 1, ‘01∗’
represents processes 2 and 3, and ‘∗ ∗ ∗’ represents all eight processes.
44
2.4.1 Symbolic factorization
45
Algorithm 2.17: Distributed symbolic factorization
Input: local original structure, L̂loc , local elimination tree, Eloc ,
root node, s, number of processes, p, and process rank, r
Output: local factored structure, Lloc
1 if p > 1 then
2 if r < b p2 c then
3 Recurse(L̂loc , Eloc , c0 (s),b p2 c,r)
4 Exchange Lc0 (s) for Lc1 (s) with other team
5 else
6 Recurse(L̂loc , Eloc , c1 (s),p−b p2 c, r−b p2 c)
7 Exchange Lc1 (s) for Lc0 (s) with other team
8 else
9 foreach c ∈ C(s) do Recurse(L̂loc , Eloc ,c,1,0)
S
10 Ls := L̂s ∪ c∈C(s) Lc \ Ds
46
able manner. And, in the words of Schreiber, a scalable dense factorization of
the root front is a sine qua non for the overall scalability of the multifrontal
method. As was discussed in the previous chapter, when memory is not con-
strained, it is worthwhile to consider so-called 2.5D [116] and 3D [9, 76] dense
factorizations, which replicate the distributed matrix across several teams of
processes in order to lower the communication volume.
47
The packages SPOOLES [12] and MUMPS [3] both implement distributed-
memory multifrontal methods, but the former makes use of one-dimensional
distributions for all of its fronts in order to simplify partial pivoting, while
the latter only makes use of two-dimensional distributions for the root front.
Thus, neither approach can be expected to exhibit comparable scalability to
the approach of [63]. In addition, MUMPS was originally designed to make
use of a master-slave paradigm, and a number of artifacts still remain that ob-
struct its usage for truly large-scale problems (for example, MUMPS requires
that right-hand sides be stored entirely on the root process at the beginning
of the solution process [1]).
The remaining two commonly used packages are PaStiX [72] and
SuperLU_Dist [87], both of which use alternatives to the multifrontal method.
PaStiX couples a column fan-in, or left-looking supernodal approach [10] with
an intelligent selection criteria for choosing between one-dimensional and two-
dimensional dense distributions, though it appears to only be competitive
with the approach of [63] for modest numbers of processors [72]. On the
other hand, SuperLU_Dist focuses on structurally non-symmetric matrices
and uses a row-pipelined approach [88] which resembles a sparse version of
the High Performance Linpack algorithm [95, 126]. One potential pitfall to
this approach is that the size of supernodes must be artificially restricted in
order to preserve the efficiency of the parallelization: for instance, if the root
separator arising from nested dissection of a 3D grid graph was treated as a
single supernode, then the runtime and memory usage of the parallel scheme
√ √
would at best be O(N 2 / p) and O(N 4/3 / p), respectively.
48
PSPASES and WSMP is which two-dimensional distribution is chosen for the
fronts. While PSPASES and WSMP make use of block-cyclic distributions,
Clique, the implementation which will be used within our parallel sweeping
preconditioner, makes use of element-wise distributions [71, 93, 99] and does
not alter the details of the data distribution so that parallel extend-add op-
erations only require communication with a single process [63]. The major
advantage Clique is its flexibility with respect to triangular solve algorithms
(though a custom Kronecker-product compression scheme is also employed in
a later chapter).
49
and so Steps 6 and 7 should temporarily be interpreted as a dense triangular
solve with a single right-hand side and a dense matrix-vector multiply. Thus,
the distribution of each member of the right-hand side tree Y can be chosen
to maximize the efficiency of the dense triangular solves and matrix-vector
multiplication.
50
3D), the factorization can be expected to remain the dominant term. Exam-
ples of parallel multifrontal forward and backward eliminations based upon
element-wise 2D frontal distributions and 1D right-hand side distributions are
therefore given in Algorithms 2.19 and 2.20, though it is also reasonable to
redistribute the fronts into one-dimensional distributions and to make use of
multifrontal triangular solves based upon Algorithms C.7 and C.15.4 One-
dimensional distributions should be used for each right-hand side submatrix
Ys because, by assumption, there are only O(1) right-hand sides, and so one-
dimensional distributions must be used if the right-hand sides are to be dis-
tributed amongst O(p) processes.
Note that, when more than O(1) triangular solves must be performed
with each factorization, efficient parallelizations are important. There are
essentially two important situations:
4
We will not discuss pipelined dense triangular solves since they are less practical with
element-wise distributions. See [79] for a pipelined two-dimensional dense triangular solve.
51
Algorithm 2.20: Distributed multifrontal backward substitution
2D 1D
Input: Floc , Yloc , s, p, r
1 if p > 1 then
2 if A(s) 6= ∅ then AllToAll to update Ys with Yparent(s)
3 X(Ds , :) := X(Ds , :) − L(Ls , Ds )H Zs via Algorithm C.2
4 X(Ds , :) := X(Ds , :)L(Ds , Ds )−H via Algorithm C.12
5 if r < b p2 c then Recurse(Floc 2D 1D
, Yloc ,c0 (s), b p2 c,r)
6 else Recurse(Floc 2D 1D
, Yloc ,c1 (s), p−b p2 c, r−b p2 c)
7 else
8 if A(s) 6= ∅ then AllToAll to update Ys with Yparent(s)
9 X(Ds , :) := X(Ds , :) − L(Ls , Ds )H Zs
10 X(Ds , :) := X(Ds , :)L(Ds , Ds )−H
2D 1D
11 foreach c ∈ C(s) do Recurse(Floc , Yloc , c,1,0)
The following subsection addresses the first case, which is important for iter-
ative methods which make use of sparse-direct triangular solves (usually on
subproblems) as part of a preconditioner.
52
Significantly less work is required for this inversion process than the origi-
nal multifrontal factorization and it has been observed that, for moderate to
large numbers of processes, the extra time spent within selective inversion is
recouped after a small number of triangular solves are performed [98, 102].
Large performance improvements from making use of selective inversion are
shown in Chapter 3. It is also important to note that a similar idea appeared
in earlier work on out-of-core dense solvers [114], where the diagonal blocks of
dense triangular matrices were inverted in place in order to speed up subse-
quent triangular solves.
6 else
7 foreach d ∈ A−1 (s) do Gd := Fd
53
Algorithm 2.22: Distributed multifrontal forward substitution af-
ter selective-inversion
2D 1D
Input: Gloc , Yloc , s, p, r
1 if p > 1 then
2 if r < b p2 c then Recurse(Gloc 2D 1D
, Yloc ,c0 (s), b p2 c,r)
3 else Recurse(Gloc 2D 1D
, Yloc ,c1 (s), p−b p2 c, r−b p2 c)
4 foreach j ∈ {0, 1} do AllToAll to add Zcj into Ys
5 X(Ds , :) := X(Ds , :) inv(L(Ds , Ds )) via Algorithm C.1
6 Zs := Zs − L(Ls , Ds )X(Ds , :) via Algorithm C.1
7 else
2D 1D
8 Apply Algorithm 2.14 to Gloc and Yloc beginning at node s
When many right-hand sides are available at once, which is the case in
many seismic inversion [121] algorithms, there are many more paths to scal-
able multifrontal triangular solves. In particular, let us recall that the dense
operation TRSM [37], which performs triangular solves with multiple right-hand
sides, can be made scalable when there are sufficiently many right-hand sides
due to the independence of each solve. This is in stark contrast to triangular
solves with O(1) right-hand sides, which are difficult (if not impossible) to
efficiently parallelize, and so it is natural to consider building a multifrontal
triangular solve on top of a distributed dense TRSM algorithm, such as Algo-
rithm C.9.
54
efficiency of the most expensive operations which will be performed with it.
In the case of many right-hand sides, which we will loosely define to be at
√
least O( p) right-hand sides when p processes are to be used, Steps 8-9 and
5-6 from Algorithms 2.19 and 2.20 should each be thought of as a pair of
distributed TRSM and GEMM calls. If each matrix Ys in the right-hand side tree
is stored in a two-dimensional data distribution, then we may directly exploit
Algorithms C.9 and C.3 in order to build a high-performance multifrontal
triangular solve with many right-hand sides (and the adjoint dense kernels
may be used to construct an efficient adjoint solve). The resulting forward
elimination approach is listed in Algorithm 2.23.
The elephant in the room for parallel sparse direct solvers is that no
scalable implementations of effective heuristics for graph partitioning yet exist.
Although ParMETIS [82] is extremely widely used and useful for partitioning
graphs which cannot be stored in memory on a single node, there are still
difficulties in the parallelization of partition refinements, and, in practice [89],
55
Algorithm 2.24: Distributed multifrontal forward substitution
with many right-hand sides after selective-inversion
2D 2D
Input: Gloc , Yloc , s, p, r
1 if p > 1 then
2 if r < b p2 c then Recurse(Gloc 2D 2D
, Yloc ,c0 (s), b p2 c,r)
3 else Recurse(Gloc 2D 2D
, Yloc ,c1 (s), p−b p2 c, r−b p2 c)
4 foreach j ∈ {0, 1} do AllToAll to add Zcj into Ys
5 X(Ds , :) := X(Ds , :) inv(L(Ds , Ds )) via Algorithm C.3
6 Zs := Zs − L(Ls , Ds )X(Ds , :) via Algorithm C.3
7 else
2D 2D
8 Apply Algorithm 2.14 to Gloc and Yloc beginning at node s
2.5 Summary
An introduction to nested-dissection based multifrontal Cholesky fac-
torization and triangular solves was provided in the context of a reordered
dense right-looking Cholesky factorization. Parallelizations of these opera-
tions were then discussed, Raghavan’s selective inversion [102] was reviewed,
and then two new parallel multifrontal triangular solve algorithms targeted
towards simultaneously solving many right-hand sides were introduced. The
author expects that these algorithms will be a significant contribution to-
wards frequency-domain seismic inversion procedures, which typically require
the solution of roughly O(N 2/3 ) linear systems. In the following chapter, an
algorithm will be described which replaces a full 3D multifrontal factorization
56
with a preconditioner which only requires O(N 4/3 ) work to set up, yet only
requires O(N log N ) work to solve usual seismic problems. It is easy to see
that, if O(N 2/3 ) linear systems are to be solved, asymptotically more work is
performed solving the linear systems and setting up the preconditioner.
57
Chapter 3
Sweeping preconditioners
The fundamental idea of both this and the following chapter is to or-
ganize our problem in such a manner that we may assign physical meanings
to the temporary matrices generated during a factorization process, and then
to exploit our knowledge of the physics in order to efficiently construct (or,
in the case of the next chapter, compress) an approximation to these tempo-
rary products. In particular, we will order our degrees of freedom so that the
Schur complements generated during a block tridiagonal LDLT factorization
of a discretized boundary value problem may be approximately generated by
the factorization of a boundary value problem posed over a much smaller sub-
domain. More specifically, if the original problem is three-dimensional, multi-
frontal factorizations may be efficiently employed on quasi-2D subdomains in
order to construct an effective preconditioner.
58
tion. It is therefore the responsibility of the boundary conditions1 to break
this symmetry so that we may describe cause and effect in a sensical manner.
For instance, if we create a disturbance at some point in the domain, the de-
sired response is for a wave front to radiate outwards from this point as time
progresses, rather than coming in “from infinity” at the beginning of time in
anticipation of our actions.
which can be seen to hold for the radiating 3D Helmholtz Green’s function,
eiω|x−y|
Gr (x, y) = , (3.2)
4π|x − y|
e−iω|x−y|
Ga (x, y) = Gr (x, y) = . (3.3)
4π|x − y|
1
some readers will perhaps insist that we speak of remote conditions instead of applying
boundary conditions “at infinity”
2
And thus, if a time-dependence of eiωt is desired, the radiation condition should be
conjugated and the definitions of Ga and Gr should be reversed (i.e., conjugated).
59
Essentially the same argument holds for 3D time-harmonic Maxwell’s
equations, which are typically equipped with the Silver-Müller radiation con-
ditions [124],
Likewise, the 3D time-harmonic linear elastic equations make use of the Kupradze-
Sommerfeld radiation conditions [24, 123],
∂us
lim |x| − iκs us = 0, and (3.6)
|x|→∞ ∂|x|
∂up
lim |x| − iκp up = 0, (3.7)
|x|→∞ ∂|x|
where us and up respectively refer to the solenoidal (divergence-free) and ir-
rotational (curl-free) components of the solution, u, and κs and κp refer to the
wave numbers associated with shear and pressure waves, respectively. Note
that the decomposition of the solution into its solenoidal and irrotational parts
is referred to as the Helmholtz decomposition [32].
60
the condition that waves propagate outwards and not reflect at the bound-
ary of the truncated domain. This concept is crucial to understanding the
Schur complement approximations that take place within the moving PML
sweeping preconditioner which is reintroduced in this document for the sake
of completeness.
A0,0 AT1,0
u0 f0
.
A1,0 A1,1 . . u1 f1
... ... ... .. ..
= , (3.8)
. .
.. .. u
fn−2
. . ATn−1,n−2 n−2
where Ai,j propagates sources from the i’th x1 x2 plane into the j’th x1 x2 plane,
and the overall linear system is complex symmetric (not Hermitian) due to the
imaginary terms introduced by the PML [44].
If we were to ignore the sparsity within each block, then the naı̈ve
factorization and solve algorithms shown in Algorithms 3.1 and 3.2 would be
appropriate for a direct solver, where the application of Si−1 makes use of
the factorization of Si . While the computational complexities of these ap-
proaches are significantly higher than the multifrontal algorithms discussed in
61
region
PML of inter-
est
x1
x3
Figure 3.1: An x1 x3 cross section of a cube with PML on its x3 = 0 face. The
domain is shaded in a manner which loosely corresponds to its extension into
the complex plane.
the previous chapter, which have an O(N 2 ) factorization cost and an O(N 4/3 )
solve complexity for regular three-dimensional meshes, they are the conceptual
starting points of the sweeping preconditioner.3
Suppose that we paused Algorithm 3.1 after computing the i’th Schur
complement, Si , where the i’th x1 x2 plane is in the interior of the domain
(i.e., it is not in a PML region). Due to the ordering imposed on the degrees
of freedom of the discretization, the first i Schur complements are equivalent
3
In fact, they are the starting points of both classes of sweeping preconditioners. The
H-matrix approach essentially executes these algorithms with H-matrix arithmetic.
62
Algorithm 3.2: Naı̈ve block LDLT solve. O(n5 ) = O(N 5/3 ) work
is required.
// Apply L−1
1 for i = 0, ..., n − 2 do
2 ui+1 := ui+1 − Ai+1,i (Si−1 ui )
3 end
// Apply D−1
4 for i = 0, ..., n − 1 do
5 ui := Si−1 ui
6 end
// Apply L−T
7 for i = n − 2, ..., 0 do
8 ui := ui − Si−1 (ATi+1,i ui+1 )
9 end
to those that would have been produced from applying Algorithm 3.1 to an
auxiliary problem formed by placing a homogeneous Dirichlet boundary con-
dition on the (i + 1)’th x1 x2 plane and ignoring all of the subsequent planes
(see the left half of Fig. 3.2). While this observation does not immediately
yield any computational savings, it does allow for a qualitative description of
the inverse of the i’th Schur complement, Si−1 : it is the restriction of the half-
space Green’s function of the exact auxiliary problem onto the i’th x1 x2 plane
(recall that PML can be interpreted as approximating the effect of a domain
extending to infinity).
63
= ≈
x1 x1 x1
x3 x3 x3
Figure 3.2: (Left) A depiction of the portion of the domain involved in the
computation of the Schur complement of an x1 x2 plane (marked with the
dashed line) with respect to all of the planes to its left during execution of
Alg. 3.1. (Middle) An equivalent auxiliary problem which generates the same
Schur complement; the original domain is truncated just to the right of the
marked plane and a homogeneous Dirichlet boundary condition is placed on
the cut. (Right) A local auxiliary problem for generating an approximation to
the relevant Schur complement; the radiation boundary condition of the exact
auxiliary problem is moved next to the marked plane.
64
If we use γ(ω) to denote the number of grid points of PML as a function
of the frequency, ω, then it is important to recognize that the depth of the
approximate auxiliary problems in the x3 direction is Ω(γ).4 If the depth of the
approximate auxiliary problems was O(1), then combining nested dissection
with the multifrontal method over the regular n × n × O(1) mesh would only
require O(n3 ) work and O(n2 log n) storage [54]. However, if the PML size is
a slowly growing function of frequency, then applying 2D nested dissection to
the quasi-2D domain requires O(γ 3 n3 ) work and O(γ 2 n2 log n) storage, as the
number of degrees of freedom in each front increases by a factor of γ and dense
factorizations have cubic complexity.
4
In all of the experiments in this chapter, γ(ω) was either 5 or 6, and the subdomain
depth never exceeded 10.
65
Algorithm 3.3: Application of S̃i−1 to ui given a multifrontal fac-
torization of Hi . O(γ 2 n2 log n) work is required.
1 Form ûi as the extension of ui by zero over the artificial PML
−1
2 Form v̂i := Hi ûi
−1
3 Extract S̃i ui from the relevant entries of v̂i
There are two more points to discuss before defining the full serial al-
gorithm. The first is that, rather than performing an approximate LDLT
factorization using a discretization of A, the preconditioner is instead built
from a discretization of an artificially damped version of the Helmholtz oper-
ator, say
(ω + iα)2
J ≡ −∆ − , (3.11)
c2 (x)
where α ≈ 2π is responsible for the artificial damping. This is in contrast
to shifted Laplacian preconditioners [18, 46], where α is typically O(ω) [47],
and our motivation is to avoid introducing large long-range dispersion error
by damping the long range interactions in the preconditioner. Just as A refers
to the discretization of the original Helmholtz operator, A, we will use J to
refer to the discretization of the artificially damped operator, J .
The second point is much easier to motivate: since the artificial PML
in each approximate auxiliary problem is of depth γ(ω), processing a single
plane at a time would require processing O(n) subdomains with O(γ 3 n3 ) work
each. Clearly, processing O(γ) planes at once has the same asymptotic com-
plexity as processing a single plane, and so combining O(γ) planes reduces
the setup cost from O(γ 3 N 4/3 ) to O(γ 2 N 4/3 ). Similarly, the memory usage
66
becomes O(γN log N ) instead of O(γ 2 N log N ).5 If we refer to these sets of
O(γ) contiguous planes as panels, which we label from 0 to m − 1, where
m = O(n/γ), and correspondingly redefine {ui }, {fi }, {Si }, {Ti }, and {Hi },
we have the following algorithm for setting up an approximate block LDLT
factorization of the discrete artificially damped Helmholtz operator:
5
Note that increasing the number of planes per panel provides a mechanism for interpo-
lating between the sweeping preconditioner and a full multifrontal factorization.
67
Algorithm 3.5: Application of the sweeping preconditioner.
O(γN log N ) work is required.
// Apply L−1 and D−1
1 for i = 0, ..., m − 2 do
2 ui := Ti ui
3 ui+1 := ui+1 − Ji+1,i ui
4 end
5 um−1 := Tm−1 um−1
// Apply L−T
6 for i = m − 2, ..., 0 do
T
7 ui := ui − Ti (Ji+1,i ui+1 )
8 end
and current work supports its effectiveness for time-harmonic linear elasticity.
The rest of this document will be presented in the context of the Helmholtz
equation, but we emphasize that all parallelizations should carry over to more
general wave equations in a conceptually trivial way.
68
Another closely related method is the Analytic ILU factorization [52].
Like the sweeping preconditioner, it uses local approximations of the Schur
complements of the block LDLT factorization of the Helmholtz matrix rep-
resented in block tridiagonal form. There are two crucial differences between
the two methods:
These two improvements are responsible for transitioning from the O(ω) iter-
ations required with AILU to the O(1) iterations needed with the sweeping
preconditioner (for problems without internal resonance).
69
requirements. In particular, [27] demonstrates that, with a properly tuned
two-grid approach, large-scale heterogeneous 3D problems can be solved with
impressive timings.
There has also been a recent effort to extend the fast-direct meth-
ods presented in [134] from definite elliptic problems into the realm of low-
to-moderate frequency time-harmonic wave equations [130, 131]. While their
work has resulted in a significant constant speedup versus applying a clas-
sical multifrontal algorithm to the full 3D domain [131], their results have
so far still demonstrated the same O(N 2 ) asymptotic complexity as standard
sparse-direct methods.
6
While it is tempting to try to expose more parallelism with techniques like cyclic reduc-
tion (which is a special case of a multifrontal algorithm), their straightforward application
destroys the Schur complement properties that we exploit for our fast algorithm.
70
3.4.1 The need for scalable triangular solves
71
Figure 3.3: A separator-based elimination tree (right) over a quasi-2D subdo-
main (left)
7
In cases where the available solve parallelism has been exhausted but the problem cannot
be solved on less processes due to memory constraints, it would be preferable to solve with
subdomains stored on subsets of processes.
72
0 2 4 0 2 4 0
1 3 5 1 3 5 1
0 2 4 0 2 4 0
0 − 2 − 4
1 3 5 1 3 5 1
| | |
0 2 4 0 2 4 0
1 − 3 − 5
1 3 5 1 3 5 1
0 2 4 0 2 4 0
73
this notation may seem vapid when only working with a single distributed
matrix, it is useful when working with products of distributed matrices. For
instance, if a ‘?’ is used to represent rows/columns being redundantly stored
(i.e., not distributed), then the result of every process multiplying its local
submatrix of A[X, ?] with its local submatrix of B[?, Y ] forms a distributed
matrix C[X, Y ] = (AB)[X, Y ] = A[X, ?] B[?, Y ], where the last expression
refers to a per-process local multiplication.
See the left side of Fig. 3.6 for an example of an [MC , ?] distribution of a 7 × 3
matrix.
74
0 0
1
2
2
4
3 ,
1
4
3
5 5
0 0
Figure 3.5: Overlay of the owning process ranks of a vector of height 7 dis-
tributed over a 2 × 3 process grid in the [VC , ?] vector distribution (left) and
the [VR , ?] vector distribution (right).
{0, 2, 4} {0, 1}
{1, 3, 5}
{2, 3}
{0, 2, 4}
{4, 5}
{1, 3, 5} ,
{0, 1}
{0, 2, 4}
{2, 3}
{1, 3, 5} {4, 5}
{0, 2, 4} {0, 1}
Figure 3.6: Overlay of the owning process ranks of a vector of height 7 dis-
tributed over a 2 × 3 process grid in the [MC , ?] distribution (left) and the
[MR , ?] distribution (right).
75
provide each process with the data necessary to form xs [MR , ?]. Under rea-
sonable assumptions, both of these redistributions can be shown to have per-
√
process communication volume lower bounds of Ω(n/ p) (if FT L is n × n)
√
and latency lower bounds of Ω(log2 ( p)) [28]. We also note that translating
between xs [VC , ?] and xs [VR , ?] simply requires permuting which process owns
each local subvector, so the communication volume would be O(n/p), while
the latency cost is O(1).
tion (3.12) also implies that each process’s local data from a [VC , ?] distri-
bution is a subset of its local data from a [MC , ?] distribution, a simultane-
ous summation and scattering of {z (t) [MC , ?]}c−1
t=0 within process rows, per-
76
process i mod q, we can instead choose an alignment parameter, σ, where
0 ≤ σ < q, and assign entry i to process (i + σ) mod q. If we simply set σ = 0
for every supernode, as the discussion at the beginning of this subsection
implied, then at most O(γn) processes will store data for the root separator
supernodes of a global vector, as each root separator only has O(γn) degrees
of freedom by construction. However, there are m = O(n/γ) root separators,
so we can easily allow for up to O(n2 ) processes to share the storage of a
global vector if the alignments are carefully chosen. It is important to notice
that the top-left quadrants of the frontal matrices for the root separators can
each be distributed over O(γ 2 n2 ) processes, so O(γ 2 n2 ) processes can actively
participate in the corresponding triangular matrix-vector multiplications.
77
handle several right-hand sides at once. The approach taken in the following
experiments is a straight-forward modification of Algorithm B.1 which simply
maintains a separate Krylov subspace and Rayleigh quotient for each right-
hand side and iterates until all relative residuals are sufficiently small.
2. A wedge problem over the unit cube, where the wave speed is set to 2
if Z ≤ 0.4 + 0.1x2 , 1.5 if otherwise Z ≤ 0.8 − 0.2x2 , and 3 in all other
cases.
3. A two-layer model defined over the unit cube, where c = 4 if x2 < 0.5,
and c = 1 otherwise.
78
2 +|x 2)
4. A waveguide over the unit cube: c(x) = 1.25(1−0.4e−32(|x1 −0.5| 2 −0.5|
).
2
1. a single shot centered at x0 , f0 (x) = ne−10nkx−x0 k ,
P2 2
2. three shots, f1 (x) = i=0 ne−10nkx−xi k ,
2
3. a Gaussian beam centered at x2 in direction d, f2 (x) = eiωx·d e−4ωkx−x2 k ,
and
where x0 = (0.5, 0.5, 0.1), x1 = (0.25, 0.25, 0.1), x2 = (0.75, 0.75, 0.5), and
√
d = (1, 1, −1)/ 3. Note that, in the case of the Overthrust model, these
source locations should be interpreted proportionally (e.g., an x3 value of 0.1
means a depth which is 10% of the total depth of the model). Due to the
oscillatory nature of the solution, we simply choose the zero vector as our
initial guess in all experiments.
The first experiment was meant to test the convergence rate of the
sweeping preconditioner over domains spanning 50 wavelengths in each direc-
tion (resolved to ten points per wavelength) and each test made use of 256
nodes of Lonestar. During the course of the tests, it was noticed that a signifi-
cant amount of care must be taken when setting the amplitude of the derivative
of the PML takeoff function (i.e., the “C” variable in Equation (2.1) from [44]).
79
velocity model
barrier wedge two-layers waveguide
Hz 50 75 50 37.5
PML amplitude 3.0 4.0 4.0 2.0
# of its. 28 49 48 52
Table 3.1: The number of iterations required for convergence for four model
problems (with four forcing functions per model). The grid sizes were 5003
and roughly 50 wavelengths were spanned in each direction. Five grid points
were used for all PML discretizations, four planes were processed per panel,
and the damping factors were all set to 7.
For the sake of brevity, hereafter we refer to this variable as the PML ampli-
tude. A modest search was performed in order to find a near-optimal value for
the PML amplitude for each problem. These values are reported in Table 3.1
along with the number of iterations required for the relative residuals for all
four forcing functions to reduce to less than 10−5 .
It was also observed that, at least for the waveguide problem, the con-
vergence rate would be significantly improved if 6 grid points of PML were
used instead of 5. In particular, the 52 iterations reported in Table 3.1 reduce
to 27 if the PML size is increased by one. Interestingly, the same number
of iterations are required for convergence of the waveguide problem at half
the frequency (and half the resolution) with five grid points of PML. Thus,
it appears that the optimal PML size is a slowly growing function of the fre-
quency.8 We also note that, though it was not intentional, each of the these
first four velocity models is invariant in one or more direction, and so it would
be straightforward to sweep in a direction such that only O(1) panel factoriza-
tions would need to be performed, effectively reducing the complexity of the
8
A similar observation is also made in [120].
80
8e-05
0.0002
6e-05
0.00015
4e-05
0.0001
2e-05
5e-05
0
0
-2e-05
-5e-05
-4e-05
-0.0001
-6e-05
-0.00015
-8e-05
-0.0002
6e-05
0.0015
4e-05
0.001
0.0005 2e-05
0 0
-0.0005 -2e-05
-0.001 -4e-05
-0.0015 -6e-05
Figure 3.7: A single x2 x3 plane from each of the four analytical velocity models
over a 5003 grid with the smallest wavelength resolved with ten grid points.
(Top-left) the three-shot solution for the barrier model near x1 = 0.7, (bottom-
left) the three-shot solution for the two-layer model near x1 = 0.7, (top-right)
the single-shot solution for the wedge model near x1 = 0.7, and (bottom-right)
the single-shot solution for the waveguide model near x1 = 0.55.
81
setup phase to O(γ 2 N ).
3.6 Summary
A parallelization of the moving-PML sweeping preconditioner was pre-
sented and successfully applied to five challenging heterogeneous velocity mod-
els, including the SEG/EAGE Overthrust model. The primary challenges
were:
• ensuring that the subdomain multifrontal solves were scalable (e.g., through
selective inversion),
• ensuring that the PML profile was appropriately matched to the velocity
9
Other than Clique, the only other implementation appears to be in DSCPACK [103].
82
Figure 3.8: Three cross-sections of the SEG/EAGE Overthrust velocity model,
which represents an artificial 20 km × 20 km × 4.65 km domain, containing an
overthrust fault, using samples every 25 m. The result is an 801 × 801 × 187
grid of wave speeds varying discontinuously between 2.179 km/sec (blue) and
6.000 km/sec (red).
83
Figure 3.9: Three planes from an 8 Hz solution with the Overthrust model at
its native resolution, 801 × 801 × 187, with a single localized shot at the center
of the x1 x2 plane at a depth of 456 m: (top) a x2 x3 plane near x1 = 14 km,
(middle) an x1 x3 plane near x2 = 14 km, and (bottom) an x1 x2 plane near
x3 = 0.9 km.
84
number of processes
128 256 512 1024 2048
Hz 3.175 4 5.04 6.35 8
grid 319×319×75 401×401×94 505×505×118 635×635×145 801×801×187
setup 48.40 (46.23) 66.33 (63.41) 92.95 (86.90) 130.4 (120.6) 193.0 (175.2)
apply 1.87 (4.26) 2.20 (5.11) 2.58 (9.61) 2.80 (13.3) 3.52 (24.5)
3 digits 42 44 42 39 40
4 digits 54 57 57 58 58
5 digits 63 69 70 68 72
Table 3.2: Convergence rates and timings (in seconds) on TACC’s Lonestar
for the SEG/EAGE Overthrust model, where timings in parentheses do not
make use of selective inversion. All cases used a complex double-precision
second-order finite-difference stencil with five grid spacings for all PML (with
a magnitude of 7.5), and a damping parameter of 2.25π. The preconditioner
was configured with four planes per panel and eight processes per node. The
‘apply’ timings refer to a single application of the preconditioner to four right-
hand sides.
field, and
85
Chapter 4
Rather than listing the free-space Green’s functions for the most im-
portant classes of time-harmonic wave equations and showing that their trans-
lation invariance is self-evident, we will instead consider the translation invari-
ance as a direct result of the lack of a frame of reference for a problem posed
over an infinite homogeneous domain. We will use the 3D Helmholtz Green’s
function as an example, but it is important to keep in mind that the techniques
generalize to other time-harmonic wave equations.
86
4.1 Theory
Consider evaluating the Green’s function for the free-space 3D Helmholtz
equation, (
exp(iω|x−y|)
4π|x−y|
, x 6= y,
G(x, y) =
+∞, x = y,
with x and y restricted to a set of points evenly spaced over a line segment,
say x, y ∈ {jhê}q−1 3 3
j=0 ⊂ R , where h > 0 and ê ∈ R is an arbitrary unit vector.
Despite the fact that there are q 2 different choices for the pair (x, y), |x − y|
can only take on q different values, namely, {0, h, . . . , (q − 1)h} (and thus G
can also only take on at most q unique values). This redundancy is primarily
due to the translation invariance of G, i.e.,
G(x, y) = G(x + t, y + t) ∀ t ∈ R3 .
Our next step will be to discuss situations where G is only translation invariant
in a particular direction.
87
Proof. Due to the translation invariance of G in the direction ê,
where 0 ≤ |k − `| < q.
matrix
88
Proof. Using translation invariance,
and thus, since −q < j1 − j2 < q, there are at most 2q − 1 unique columns.
Remark 4.1.2. Although Lemma 4.1.2 may seem overly specific, transla-
tion invariance in a particular direction arises naturally for Green’s func-
tions of constant-coefficient problems posed over unbounded domains which
are invariant in a particular direction, such as the infinitely tall rectangle
[0, 1] × (−∞, +∞). We will now show how the method of mirror images [77]
can be used to extend these ideas to semi-infinite domains, e.g., [0, 1]×(−∞, 0]
with a zero Dirichlet boundary condition imposed over [0, 1] × {0}. Once we
have done so, we will have a strong argument for the compressibility of the
frontal matrices arising from the semi-infinite subdomain auxiliary problems
posed by the sweeping preconditioner.
89
Then we can use the original Green’s function, G, to construct a half-space
Green’s function, say G̃, with domain (R2 ×(−∞, 0])2 and boundary condition
G̃(x, y) = 0 when x ∈ R2 × {0}. In particular, we may set
where R(v) is the reflection of v over the plane R2 × {0} and represents the
location of an artificial charge which counteracts the charge located at v over
the plane R2 × {0}.
90
Figure 4.1: (Top) The real part of a 2D slice of the potential generated by a
source in the obvious location and (bottom) the same potential superimposed
with that of a mirror-image charge. The potential is identically zero along the
vertical line directly between the two charges.
Proof. By definition,
then
91
It follows that the columns of K are again linear combinations of the vectors
and set G̃(x, y) = G(x, y) − G(x, R(y)), where R(y) = y − 2ê3 (ê3 , y). Then G̃
restricted to (R2 × (−∞, 0])2 is a Green’s function for the constant-coefficient
half-space Helmholtz problem,
−∆ − ω 2 u(x) = f (x),
[0, 1] × {0}, and set ê = ê3 , then for any integer q > 1, Lemma 4.1.3 implies
that the rank of the s2 × q 2 matrix
j1 λ j2 λ
K(i1 + si2 , j1 + qj2 ) = G̃(ui1 + ê3 , ui2 + ê3 ),
q−1 q−1
where 0 ≤ i1 , i2 < s and 0 ≤ j1 , j2 < q, is at most 3q − 2.
92
1
ê2
ê1
ê3 λ
Figure 4.2: A set of six unevenly spaced points (in blue) along the line segment
{0} × [0, 1] × {0} and their four evenly spaced translations (in red) in the
direction ê3 .
and so the sign of γ is irrelevant when evaluating G̃(ui1 +γê3 , ui2 ). We therefore
satisfy the mirror symmetry conditions for Lemma 4.1.3 and find that the rank
of K is at most 2q − 1.
Remark 4.1.3. Although our previous example used identical source and
target sets, notice that this is not a requirement of Lemma 4.1.3.
G̃(i, j) = G̃(xi , xj ),
93
segment {0} × [0, 1] × [0, λ]. This representation is satisfying because G̃ can
be interpreted as mapping a charge distribution over the plane segment to its
resulting potential field over the same region.
94
Theorem 4.1.4. Let G : Rd × Rd → C ∪ {∞} be invariant under translation
in the direction ê ∈ Rd and define
a −1 b −1
where R(y) = y −2ê(ê, y). Furthermore, let {ai }si=0 , {bi }si=0 ⊂ Rd where each
bi is orthogonal to ê, and define an array of qsa target points, x = vec(X),
and an array of qsb source points, y = vec(Y ), where X(i, j) = ai + jhê and
Y (i, j) = bi + jhê. Then the qsa × qsb matrix
G̃(i, j) = G̃(xi , yj )
then G̃ can be represented as the sum of only 2q−1 Kronecker product matrices.
Proof. Partition G̃ as
G̃0,0 · · · G̃0,q−1
G̃ = ... .. ..
,
. .
G̃q−1,0 · · · G̃q−1,q−1
where each submatrix G̃i,j is sa ×sb . Due to the ordering imposed upon G̃, the
matrix K formed such that its i + qj’th column is equal to vec(G̃i,j ) satisfies
95
the conditions for Lemma 4.1.3, which shows that the rank of K is at most
3q − 2. That is, we may decompose K as
3q−3
X
K= zt ytT .
t=0
Then K satisfies the second condition of Lemma 4.1.3 and has a rank of at
most 2q − 1.
(Y ⊗ Z)x = vec(ZXYT ),
96
Corollary 4.1.6. Given a q × q matrix Y, an sa × sb matrix Z, and a vector x
of length qsb , the product (Y ⊗ Z)x can be formed with 2q 2 min{sa , sb } + 2qsa sb
total additions and multiplications.
Theorem 4.1.7. Suppose a qsa × qsb matrix G̃ satisfies the first set of condi-
tions for Theorem 4.1.4. Then the Kronecker product representation of G̃ only
requires the storage of (3q − 2)(q 2 + sa sb ) scalars, and this representation can
be be used to linearly transform a vector with (3q − 2)(2q 2 min{sa , sb } + 2qsa sb )
floating-point operations.
If G̃ satisfies the second set of conditions for Theorem 4.1.4, then only
(2q − 1)(q 2 + sa sb ) scalars of storage are required, and the operator may be
applied in this form with (2q − 1)(2q 2 min{sa , sb } + 2qsa sb ) total additions and
multiplications.
97
that the second set of conditions for 4.1.4 are satisfied, is essentially 4q 3 entries
of storage versus the q 4 entries required for the standard dense storage scheme,
which is clearly a factor of q/4 compression. On the other hand, if q sa , sb ,
then we could expect a factor of q/2 compression.
with zero Dirichlet boundary conditions on the finite boundaries and radi-
ation conditions posed in the remaining two directions, where the subscript
x implies that the operator −∆ − ω 2 acts on the x variable of the Green’s
function GD (x, y). Due to Dirichlet boundary conditions being posed on both
sides of the domain in the first two dimensions, mirror imaging techniques for
constructing the Green’s function for this problem from the free-space Green’s
function would require an infinite summation, as the reflections from the left
wall would reflect off the right wall, which would then reflect off of the left
wall, ad infinitum.
98
We will avoid discussion of the convergence of such summations [21] and
simply recognize that the resulting Green’s function will remain translation
invariant in the third coordinate. Then we may use the method of mirror
imaging in the third dimension to construct another Green’s function, say
G̃D (x, y) = GD (x, y) − GD (x, R(y)), for the problem
where R(y) = y − 2ê3 (ê3 , y) is responsible for reflecting y over the newly
imposed zero Dirichlet boundary condition on [0, 1] × [0, 1] × {0}. If x and
y are both perpendicular to ê3 , then GD retains the free-space Helmholtz
property that GD (x + γê3 , y) = GD (x − γê3 , y) for any γ ∈ R, and we see that
Theorem 4.1.7 can be invoked to show that samples of G̃D over a grid which
is uniform in the ê3 direction are compressible.
Theorem 4.1.8. The Green’s function, say G̃D , for the Helmholtz problem
−∆ − ω 2
x
G̃D (x, y) = δ(x − y), in [0, 1] × [0, 1] × (−∞, 0],
with zero Dirichlet boundary conditions on the finite boundaries and the Som-
merfeld radiation condition in the remaining direction, is compressible in the
following sense. For any points {ai }s−1 ⊥
i=0 ⊂ ê3 , if we define the array of points
99
where Yt is q × q and Zt is s × s. Furthermore, this representation requires
only (2q − 1)(q 2 + s2 ) units of storage and can be used to linearly transform a
vector with (2q − 1)(2q 2 s + 2qs2 ) units of work.
is closely related to the Green’s function resulting from the subproblem posed
over the subdomain covered by supernode s and its descendants, but with zero
Dirichlet boundary conditions posed over the artificial boundaries introduced
by the ancestor separators (consider Figure 3.3); Theorem 4.1.8 was specifically
designed to handle these artificial Dirichlet boundary conditions.
• The theory from the previous section demonstrates precise cases where
the discrete Green’s function can be losslessly compressed as a sum of
Kronecker products, but we will make use of a thresholded singular value
decomposition in order to find an approximate compression in more gen-
eral scenarios (e.g., heterogeneous media and PML boundary conditions).
100
nately, much less can be said about the compression of the bottom-left
quadrant of the fronts.
For example, Step 4 of Algorithm 4.3 and Step 2 of Algorithm 4.4 must
each be replaced with an operation of the form
r−1
X
X(Ds , :) := (Yt ⊗ Zt )X(Ds , :),
t=0
101
Algorithm 4.2: Structured Kronecker product compression
Input: rsa × rsb matrix G and tolerance,
r−1 r−1
Output: r × r matrices
P {Yt }t=0 and sa × sb matrices {Zt }t=0 such
that kG − t Yt ⊗ Zt k2 ≤ kGk2
2
1 Form the sa sb × r matrix K such that
K(:, i+jr) = vec(G(isa : (i+1)sa −1, jsb : (j +1)sb −1))
H
2 Compute SVD of K, K = U ΣV
3 Remove every triplet (σ, u, v) such that σ < kAk2 to form the
102
4.2.1 Mapping a single vector
For the moment, let us suppose that the submatrix X(Ds , :) has only
a single column, which we will simply refer to as the vector b. Then we must
perform the update
r−1
X r−1
X
b := (Yt ⊗ Zt )b = vec(Zt W YtT ),
t=0 t=0
103
Definition 4.2.1. Given an m × qk matrix W , which we may partition as
W = W0 , W 1 , · · · Wk−1 ,
where
W0 Y0T ··· Wk−1 Y0T
W0 Y T ··· Wk−1 Y1T
1
E = .. , (4.3)
.. ..
. . .
T T
W0 Yr−1 ··· Wk−1 Yr−1
which is a permuted version of the matrix
W0
W1
≡ W̃ Ỹ T .
Ẽ = .. Y0T Y1T · · · T
Yr−1 (4.4)
.
Wk−1
Pr−1
We may thus form the update B := t=0 (Yt ⊗ Zt )B in an analogous manner
as in the previous subsection (see Algorithm 4.6). As before, the entire process
involves two dense matrix-matrix multiplications (Steps 2 and 4) and several
permutations.
104
AlgorithmP4.6: Mapping a matrix with a sum of Kronecker prod-
ucts, B := r−1
t=0 (Yt ⊗ Zt )B
1 W̃ := vec−1 (B, k)
2
T
Ẽ := W̃ [Y0T , . . . , Yr−1 ]
3 Shuffle Ẽ into E
4 M := ZE
5 B := vec(M, k)
4.3 Results
We now reconsider the waveguide model discussed in the previous chap-
ter using 250 × 250 × 250 and 500 × 500 × 500 second-order finite-difference
stencils with frequencies of 18.75 Hz and 37.5 Hz, respectively. Table 4.1
demonstrates that our compression scheme is quite successful for the discrete
Green’s functions (the diagonal blocks) produced within our modified multi-
frontal factorization. In particular, for the 2503 grid, a factor of 16.41 memory
compression was measured for the top-most panel of the waveguide (which
consists of 14 planes), whereas the compression ratio was 12.51 for the 5003
grid. It is perhaps unsurprising that the compression algorithm is less suc-
cessful when applied in an ad-hoc manner to the connectivity between the
supernodes, as these matrices do not correspond to discrete Green’s functions
and therefore do not enjoy any obvious benefits from the (approximate) trans-
lation invariance of the free-space Green’s function. It was also observed (see
Figure 4.3) that the compressed preconditioner behaved essentially the same
as the original preconditioner for the 2503 waveguide example.
105
original compressed ratio
diagonal blocks 2544 MB (11295 MB) 155.0 MB (903.2 MB) 16.41 (12.51)
connectivity 6462 MB (31044 MB) 1336 MB (7611 MB) 4.836 (4.079)
total 9007 MB (42339 MB) 1491 MB (8513 MB) 6.039 (4.973)
100
max kb − Axk2 /kbk2
10−1
10−2
10−3
10−4
b
10−5
0 10 20 30
Iteration
106
plication will be left as future work. The author would also like to emphasize
that, while several other researchers have investigated hierarchical low-rank
compression schemes for the dense frontal matrices [110, 134], the proposed
scheme is based upon Kronecker-product approximations designed to exploit
the translation invariance of the free-space Green’s function. Another differ-
ence is that the primary goal is to lower the memory requirements rather than
to reduce the computational cost.
107
Chapter 5
5.1 Contributions
The major contributions detailed throughout the previous chapters es-
sentially boil down to:
We will now briefly summarize each of these items and then provide pointers
to the source code and documentation for the supporting software.
108
fundamental idea behind each of these approaches is to eschew traditional one-
dimensional distributions for each supernode’s portion of the right-hand side
vectors in favor of two-dimensional data distributions which can be used within
scalable dense matrix-matrix multiplication and triangular solve algorithms.
Both of these approaches will be of significant interest for applications which
require the solution of large numbers of linear systems with the same sparse
matrix (especially frequency-domain seismic inversion).
• ensuring that the subdomain multifrontal solves were scalable, for in-
stance, through the usage of selective inversion,
109
5.1.3 Compressed moving-PML sweeping preconditioner
(Y ⊗ Z)x = vec(ZXY T ),
where the vec operator concatenates the columns of its input matrix and X is
defined such that vec(X) = x, that the frontal matrices compressed as a sum
of a small number of Kronecker products may be applied to (sets of) vectors
using scalable algorithms for dense matrix-matrix multiplication. The result
is that, after the Kronecker product compression, standard multifrontal trian-
gular solves need only be slightly modified in order to perform the necessary
linear transformations directly with the compressed data.
110
5.1.5 Reproducibility
Each of the major software efforts involved in the research behind this
dissertation is available under an open-source license along with a significant
amount of documentation:
111
• Extending the current implementation of the parallel sweeping precon-
ditioner to more sophisticated discretizations for both time-harmonic
Maxwell’s and linear elastic equations.
Finally, as was mentioned in the introductory chapter, one of the main ap-
plications of this work will be seismic inversion. As these techniques mature,
the author will begin investigating their incorporation into effective inversion
procedures.
112
Appendices
113
Appendix A
A proof of the fundamental theorem of algebra is beyond the scope of this dis-
sertation, and so we will simply state it without proof and point the interested
reader to [117] for a concise proof based upon Liouville’s theorem, or to [30]
for a more elementary argument based upon the minimum-modulus principle
which dates back to the amateur mathematician, Jean-Robert Argand [7].
114
The motivating ideas of this appendix are essentially a blend of [14],
[74], [58], and [67]. Like [14], we prove the existence of eigenpairs without
resorting to determinants,1 whereas our proof of the existence of a shared
eigenvector among two commuting matrices is due to [74], which leads to a
beautifully simple (and seemingly nonstandard) proof of the spectral theorem
for normal matrices. Our proof of the singular value decomposition is essen-
tially identical to that of [58], and our debt to [67] and primarily stylistic.
3. a proof of the existence of the Schur decomposition (as well as for pairs
of commuting matrices) and of the singular value decomposition,
1
via an argument based upon Krylov subspaces, despite the fact that the term Krylov
never appears in [14]
115
vector spaces. Ideally the reader is already intimately familiar with the con-
cept of linear independence and the notion of a basis of a vector space, but their
definitions will be provided for the sake of completeness. However, we will not
list all of the properties required for a proper definition of a vector space or
provide a proof of the finite-dimensional Riesz representation theorem. Please
see [67] for a detailed discussion of this background material.
k−1
X
αj xj 6= 0
j=0
Notice that this is equivalent to the statement that no member of the set of
vectors can be written as a linear combination of the others.
116
for that subspace. More specifically, {xj }k−1
j=0 is a basis for W if, for every
k−1
X
w= αj xj .
j=0
Definition A.1.5 (span). We say that a basis for a subspace spans that
subspace, and, for any arbitrary set of vectors {xj }k−1 n
j=0 ⊂ C , we may define
the set
k−1
X
span {xj }k−1
j=0
n
= {x ∈ C : x = αj xj for some {αj }k−1
j=0 ⊂ C}. (A.1.1)
j=0
It is easy to see that this set is in fact a linear subspace of Cn . We will also
occasionally use the shorthand span A for denoting the span of the columns of
a matrix A.
Definition A.1.6 (standard basis vectors). The j’th standard basis vector,
ej ∈ Cn , is defined component-wise as being zero everywhere except for entry
j, which is equal to one. Clearly {ej }n−1 n
j=0 spans C .
117
Proof. Suppose not. Then there exists a member of W which is independent
of the original d vectors, which implies a set of d + 1 linearly independent
vectors, which contradicts the previous theorem.
When m = 1 (i.e., the result lies in C), we will call such a transformation a
linear functional.
where ξj and ηj are respectively the j’th entries of x and y. The inner product
(y, x) will also often be denoted by y H x. It is easy to see that the Euclidean
inner product satisfies the following properties for any α, β ∈ C and x, y ∈ Cn :
We say that the Riesz representation theorem identifies each linear functional
on Cn with a vector in Cn .
118
Proof. Please see [67] for details.
where αi,j refers to the (i, j) entry of the matrix associated with A, ξj refers to
the j’th entry of ξ, and ηi refers to the i’th entry of y. From now on, we will
identify all finite-dimensional linear transformations with their corresponding
matrix.
Definition A.1.11. We will follow the convention of [74] and denote the set
of m × n matrices with complex coefficients as Mm,n , and the set of n × n
(square) matrices as Mn .
119
Proposition A.4. For any square matrix A ∈ Mn ,
Proof. We may apply the previous proposition with standard basis vectors in
order to show that both equations hold entrywise. For example,
which shows the first result. A similar argument holds for (AB)T .
120
Definition A.2.2 (kernel). Given a matrix A ∈ Mn , the set of all vectors
x ∈ Cn such that Ax = 0 is called the kernel or null space of A. That is,
ker A = {x ∈ Cn : Ax = 0}.
Ax = λx,
which shows that, in some sense, A behaves like the scalar λ when acting on
the vector x. For this reason, each λ ∈ σ(A) is called an eigenvalue of A, and
its corresponding vector, x, is called an eigenvector. It is also common to refer
to the pair (λ, x) as an eigenpair of a matrix.
2
This definition must become significantly more technical when considering infinite-
dimensional linear operators. Roughly speaking, the spectrum of a linear operator A is
the set of all ξ ∈ C such that the resolvent operator, (ξI − A)−1 , is not well-defined and
bounded. See [84] for details.
121
Remark A.2.1. We will now follow the approach of [14] in showing the crucial
fact that every complex finite-dimensional square matrix, A ∈ Mn , where
n ≥ 1, has at least one eigenpair. Along the way, we will introduce the
extremely-important notion of a Krylov subspace, which plays a central role in
the body of this document.
is irrelevant.
which is precisely the space of all polynomials of degree less than or equal to
k − 1 of A acting on r, which we denote by Pk−1 (A)r.
122
Lemma A.3 (Krylov lemma). For any matrix A ∈ Mn and vector x ∈ Cn ,
there exists a polynomial p ∈ Pn such that p(A)x = 0.
p(z) = γn (z − ξ0 )(z − ξ1 ) · · · (z − ξn ),
123
where γn 6= 0, and, when combined with Lemma A.2, we see that
n
!
Y
p(A)r = γn (A − ξj I) r = 0,
j=0
and consider evaluating the middle expression from right to left, e.g., by con-
sidering the sequence {rj }n+1
j=0 , where r0 ≡ r and
Remark A.2.2. With just a few more lemmas at our disposal, we can actually
prove that, if two matrices commute, they share a common eigenvector. In
fact, this is true for any family of commuting matrices [74], but we will simply
stick to pairs of commuting matrices.
124
Lemma A.6. If a subspace W ⊂ Cn is invariant under A ∈ Mn , then there
exists an eigenvector of A within W.
Proof. Let W be a matrix whose columns form a minimal basis for the in-
variant subspace W. Then, for each w ∈ W, we may express w as a linear
combination of the columns of W , i.e., w = W y. Since W is invariant, each
column of AW therefore lies within W, and so we may find a matrix Z such
that
AW = W Z.
W Zy = W (λy) = AW y,
we may move a single power of B from the right to the left side of A as many
times as we like, which shows that A commutes with every monomial of B,
say B k . Since polynomials are merely linear combinations of monomials, the
result follows.
125
Lemma A.8. If, given some matrix A ∈ Mn and vector x ∈ Cn , the vectors
{x, Ax, . . . , Ak−1 x} are linearly dependent, then Kk (A, x) is invariant under
A.
Proof. Since the result is trivial for k = 1, 2, suppose that we have shown the
result for k < j. Then, if {x, Ax, . . . , Aj−1 x} is linearly dependent, then there
are two possibilities:
• {x, Ax, . . . , Aj−2 x} is linearly dependent, which implies that Kj−1 (A, x)
is invariant, which shows that Kk (A, x) is invariant for all k ≥ j, or
126
such matrices. We will then show that this decomposition provides a significant
amount of insight into the spectrum of A.
Proof. Since T − λI has a zero diagonal entry if and only if T has a diagonal
value equal to λ, we need only show that a triangular matrix is singular if and
only if it has a zero diagonal entry.
Now suppose that the j’th diagonal entry of T is zero. Then the first j
columns of T lie within the j − 1-dimensional subspace span {e0 , e1 , . . . , ej−2 },
and thus, by Lemma A.1, they cannot be linearly independent. We may
127
therefore define a vector x, which is only nonzero in its first j components
such that T x = 0.
where the square-root can be interpreted in the usual sense because (x, x) ≥ 0,
with equality if and only if x = 0. The two-norm is perhaps the most natural
way to define the length of a vector.
kAk2 = max
n
kAxk2 , (A.3.8)
x∈C ,
kxk2 =1
where the maximum can be shown to exist due to the Heine-Borel theo-
rem [106], but we will avoid diving into real analysis in order to explain this
128
technical detail. This operator norm of A can be geometrically interpreted
as the maximum length of any transformed vector which originated on the
surface of the unit-ball in Cn .
and, by the definition of unitary matrices, kQxk2 = kxk2 . Since the maximum
is only taken over the unit ball, where kxk2 = 1, the result follows.
Proposition A.4. A matrix is unitary if and only if its adjoint is its inverse.
129
Proposition A.5. Every subspace W ⊂ Cn has an orthonormal basis of size
dim W.
which effectively removes all of the components of q̃k in the directions of the
previous orthonormal set. Since the linear independence of the {wj }d−1
j=0 basis
W ⊥ = {x ∈ Cn : ∀ w ∈ W, (w, x) = 0}.
130
Proposition A.7. For any subspace W ⊂ Cn ,
dim W + dim W ⊥ = n,
W ⊕ W ⊥ = Cn .
Since each zi ∈ W ⊥ , (zi , wj ) = (wj , zi ) = 0 for every pair (i, j). Because
each of these sets is orthonormal, the combined set is orthonormal as well (and,
of course, also linearly independent). This implies that dim W ⊥ ≤ n − d.
and
n−d−1
X
z= zj (zj , x),
j=0
131
In order to show uniqueness, let x = ŷ + ẑ be another such decomposi-
tion, so that y − ŷ = ẑ − z. Since the left side of the equation lies in W, and
the right side lies in W ⊥ , it can only be satisfied when both y − ŷ and z − ẑ
are zero.
Proof. If we define W = span x, then Lemma A.7 shows that there exists
an orthonormal basis of dimension n − 1 for x⊥ , which implies that, when
combined with x, is an orthonormal basis for all of Cn .
where Q ∈ Mm is unitary and R ∈ Mm,n is zero in every entry (i, j) such that
i > j (and is therefore referred to as quasi upper-triangular).
132
Remark A.3.1. The following theorem is being presented in somewhat of a
different context than normal. In particular, please keep in mind that we have
only proven that the spectrum of A ∈ Mn is non-empty. However, once we es-
tablish the following theorem (via an admittedly tedious inductive argument),
we will immediately be able to say much more about the spectrum of A.
where the zero in the bottom-left of the last term is due to the orthonormality
of the columns of Q0 . We can of course rewrite this equation as
λ r0
A = Q0 QH0 ,
0 B0
133
I 0
and that U unitary implies that is also unitary in order to complete
0 U
the proof.
A = QSQH , B = QT QH ,
Proof. We may essentially repeat the proof of Theorem A.10, but, at each step,
we may apply Theorem A.9 instead of Theorem A.5 in order to simultaneously
reduce both matrices to triangular form.
134
where U ∈ Mm and V ∈ Mn are unitary, and Σ ∈ Mm,n is at most nonzero in
the first min{m, n} entries of its main diagonal, and these non-negative entries
are called the singular values. The columns of U and V are respectively referred
to as the left and right singular vectors. We will denote the j’th diagonal value
of Σ by σj , where 0 ≤ j < min{m, n}.
135
and an inductive technique essentially identical to that of Theorem A.10 com-
pletes the proof.
The dimension of the range is referred to as the rank of the matrix, while the
dimension of the kernel is called its nullity.
Corollary A.14 (rank-nullity). The dimensions of the range and null space
of a matrix A ∈ Mn add up to n, and, in particular, given a singular value
decomposition A = U ΣV H ,
and
ker A = ker ΣV H . (A.3.11)
That is to say, the rank is equal to the number of nonzero singular values, and
the nullity is equal to n minus this number.
Proof. The result is essentially immediate now that we have the existence of
the SVD: the number of linearly independent columns in U Σ is the number of
nonzero singular values, and a similar argument holds for ΣV H .
136
Proposition A.15 (consistency of two-norm). The operator two-norm is con-
sistent.
AAH = AH A.
we see that
137
These observations justify the name spectral theorem.
Proof. Suppose that A is normal. Then A commutes with its adjoint, and
Theorem A.12 shows that we may find a simultaneous Schur decomposition of
A and AH , say A = QSQH and AH = QT QH , where Q is unitary and S and
T are both upper-triangular. Then, we may equate the adjoint of the Schur
decomposition of A with the Schur decomposition of AH in order to find that
S H = T , where S H is lower-triangular and T is upper-triangular. But, this is
only possible when both S and T are diagonal, so that S = T and A = QSQH
is in fact a spectral decomposition of A.
A = AH .
Since every matrix commutes with itself, every Hermitian matrix is normal.
Proof. Since A is normal, we know that A = QΛQH for some unitary Q and
diagonal Λ (consisting of the eigenvalues of A), but we do not yet know that
Λ is real. But this is easily shown since A = AH implies that Λ = ΛH , and
since Λ is diagonal, its values must also be real.
138
Remark A.4.1. Given any matrix A ∈ Mm,n , its singular value decompo-
sition, A = U ΣV H , can be interpreted in terms of the Hermitian eigenvalue
decompositions of AAH and AH A. In particular,
AH A = (U ΣV H )H (U ΣV H ) = V (ΣH Σ)V H ,
which shows that the right singular vectors, V , are the eigenvectors of AH A,
and the associated eigenvalues lie along the diagonal of ΣH Σ. Due to the
structure of Σ, the first min{m, n} eigenvalues are equal to σj2 , while the
remaining n − min{m, n} eigenvalues are equal to zero. Likewise,
AAH = U (ΣΣH )U H ,
which also has σj2 as its j’th eigenvalue, for 0 ≤ j < min{m, n}, and the
remaining m − min{m, n} eigenvalues are zero.
139
Example A.5.1. Unfortunately not all square matrices have eigenspaces which
span their vector space. Consider the matrix
0 1
T = ,
0 0
which, because it is upper triangular, has the spectrum σ(T ) = {0}. It is easy
to see that its only eigenspace is
1
W0 = span ,
0
which clearly does not span C2 . Thankfully we may loosen our definition of
an eigenspace in order to span the entire vector space.
(A − λI)k x = 0.
140
Proposition A.2 ([14]). For any A ∈ Mn and λ ∈ σ(A),
Gλ = ker (A − λI)n .
Suppose that
k−1
X
αj (A − λI)j x = 0.
j=0
141
and, due to the invariance of W under A, QQH AQ = AQ, and QH AQz = 0 if
and only if AQz = 0, so we may write
k
!
X k
(−λ)j Ak−j Qy = 0.
j=0
j
(A − λI)k (Qy) = 0,
RA (Q) = QH AQ
Proof. We need only show that ran (A − λI)n ∩ ker(A − λI)n = {0} in order to
have the result follow from the rank-nullity theorem. So, suppose x ∈ Cn is in
both the range and kernel of (A − λI)n . Because x is in the range of (A − λI)n ,
there is some y such that (A−λI)n y = x. But then (A−λI)2n y = (A−λI)n x =
0, where the last equality follows since x is in the kernel of (A − λI)n . But
we have then shown that y ∈ Gλ (A), which implies that (A − λI)n y = 0 by
Proposition A.2. But then x = 0, which shows the result.
142
Theorem A.5 ([14]). The generalized eigenvectors of a matrix A ∈ Mn span
Cn .
143
Proof. Let x be an arbitrary member of Cn , which, by Theorem A.6, we may
decompose as
X
x= gλ ,
λ∈σ(A)
where gλ ∈ Gλ . Then
Y
pA (A)x = (A − λI)dim Gλ x,
λ∈σ(A)
and each of the matrices in the product commutes. Thus, for each component
gλ of x,
Y
pA (A)gλ = (A − ξI)dim Gξ gλ ,
ξ∈σ(A)
144
span Cn , Corollary A.7 shows that A is diagonalizable if and only if the geo-
metric multiplicity of every eigenvalue is equal to its algebraic multiplicity.
Remark A.5.1. We have already shown that all normal matrices are diago-
nalizable (in fact, with a unitary eigenvector matrix).
p(A) = Xp(Λ)X −1 .
A.6 Summary
A compact introduction to the classical linear algebra theorems needed
for discussing Krylov subspace methods has been given, with an emphasis
145
placed upon using basic dimensional arguments (in the spirit of [14]) as a re-
placement for cumbersome arguments based upon determinants. The following
appendix makes use of these theorems in order to provide a rigorous intro-
duction to Krylov subspace methods and the Generalized Minimum Residual
method (GMRES), which are both fundamental to prerequisites for Chapters 3
and 4.
146
Appendix B
147
columns span a Krylov subspace, it is often the case that the columns of AV
are approximately contained in the same Krylov subspace. It will be helpful
to introduce a few more concepts before attempting to clarify this statement
and its important implications.
It is worth noting that this appendix draws heavily from the content
and style of [119], [107], and [122]. In particular, the focus on Rayleigh quo-
tients and Krylov decompositions is due to [119], and much of the Chebyshev
polynomial approximation discussion is drawn from [107].
v H Av
RA (v) = ,
vH v
RA (V ) = V H AV.
148
Definition B.1.1 (projection matrix). The projection of a vector x ∈ Cn onto
the space spanned by the (orthonormal) columns of a matrix V is given by
PV x = V V H x,
PV ⊥ x = (I − V V H )x.
AV = PV AV + PV ⊥ AV = V Z + PV ⊥ AV. (B.1.2)
149
We can thus characterize how close V is to being an invariant subspace through
the size of the residual matrix,
E ≡ PV ⊥ AV = (I − V V H )AV = AV − V RA (V ). (B.1.3)
Given any eigenpair of the Rayleigh quotient, say Zy = λy, we see that
AV y = λV y + Ey, (B.1.4)
ÂV y = λV y, (B.1.5)
where we have simply made use of the fact that V H V = I in order to rearrange
Equation (B.1.4). Since kA − Âk2 = kEV H k2 = kEk2 , we see that, when V
is close to an invariant subspace of A, Ritz pairs are eigenpairs of a matrix
which is close to A.
150
Proof. By definition,
Kk+1 = span {Kk , Ak r},
and thus
PKk⊥ (Kk+1 ) = span PKk⊥ (Ak r).
PV ⊥ AV = vwH ,
AV = V Z + vwH ,
Proof. Equation (B.1.2) shows that, for any matrix V with orthonormal columns,
we may decompose AV as V RA (V ) + PV ⊥ AV . Lemma B.2 exploits the fact
that span V is a Krylov subspace in order to yield the necessary rank-one
decomposition of the second term.
151
The “orthonormal” qualifier hints at the fact that, with a proper gen-
eralization of the Rayleigh quotient, we may lift the requirement that V must
have orthonormal columns [119]. However, for the purposes of this disserta-
tion, orthonormal Krylov decompositions are more than sufficient.
152
Theorem B.3 (Arnoldi decomposition [8]). If a matrix V consists of k or-
thonormal columns satisfying Equation (B.3.6), then its image AV can be
decomposed as
AV = V H + v(βek )H ,
Proof. We can again apply Equation (B.1.2) to show that, for any matrix V
with orthonormal columns, we may decompose AV as V RA (V ) + PV ⊥ AV .
Lemma B.1 then shows that RA (V ) is upper Hessenberg, and Lemma B.2
shows that only the last column of PV ⊥ AV can be nonzero. If we label this
last column as u, then, when u = 0, we may set v = 0 and β = 0, otherwise,
we may set v = u/kuk2 and β = kuk2 .
AV = V T + v(βek )H ,
Proof. We may simply combine Theorem B.3 with the fact that, if A is Hermi-
tian, RA (V ) = V H AV must also be Hermitian. Since Hermitian Hessenberg
matrices are tridiagonal, we have finished the proof.
153
B.4 Introduction to FOM and GMRES
Suppose that we are interested in the solution of the nonsingular system
of equations
Ax = b,
154
shows that, for any b ∈ span V , A−1 b = V Z −1 V H b ∈ span V . A : span V →
span V is therefore surjective.
Proof. We have only to show that Kk (A, b) = Kk+1 (A, b) implies that Kk (A, b)
is invariant under A. This fact can easily be seen since AKk (A, b) ⊂ Kk+1 (A, b) =
Kk (A, b).
Remark B.4.1. Given any vector x0 , we can instead apply FOM to the linear
system
A(x − x0 ) = b − Ax0 ,
155
variant. We will unfortunately now see that the FOM can fail catastrophically
when span V is not invariant.
0 1 1
Example B.4.1. Let A = and b = , and set V = b so that V
1 0 0
has a single orthonormal column which spans the Krylov subspace K1 (A, b).
Even though A is both nonsingular and Hermitian, the Rayleigh quotient
RA (V ) = bH Ab is identically zero.
Despite the fact that FOM can completely fail on trivial problems, it
turns out that a relatively minor modification results in an extremely robust
algorithm. We will now show that, even though the inverse of the Rayleigh
quotient can fail to exist, a unique solution can always be found in the least-
squares sense.
Lemma B.3. If dim Kk+1 (A, b) = k + 1 and span V = Kk (A, b), then the
matrix AV has full column-rank.
Proof. Because Kk+1 (A, b) = span{b, AV }, the dimension of the column space
of AV is at least dim Kk+1 (A, b) − 1 = k, and thus rank(AV ) = k.
156
k + 1, and so Lemma B.3 shows that the matrix AV has full column rank,
which implies that a unique least-squares solution can be found, e.g., through
a QR decomposition of AV .
AV y − b = (V H + v(βek )H )y − V (βe1 ).
Remark B.4.2. Just like FOM, GMRES can easily incorporate an initial
guess x0 by first finding
157
Definition B.4.3 (Minimum residual method (MINRES) [94]). Given a Her-
mitian nonsingular linear system, Ax = b, the label MINRES applies to any
method which exploits a Lanczos decomposition, say AV = V T + v(βek )H , to
find the minimizer
Mk = {p ∈ Pk : p(0) = 1}
Proof. The left term immediately follows from the definition of GMRES and
the fact that the search space is Kk (A, b) = Pk−1 (A)b. The right term follows
from the fact that p ∈ Pk−1 if and only if 1 + xp(x) ∈ Mk .
158
Proof. The first equality only requires that, for any polynomial p, p(A) =
p(W ΛW −1 ) = W p(Λ)W −1 , which was shown in Proposition A.10. The in-
equality relies on the consistency of the matrix two-norm (Proposition A.15),
i.e.,
Proof. We have made use of two elementary facts about unitary matrices in
order to improve upon Corollary B.6: κ2 (Q) = 1, and kQxk2 = kxk2 for any
vector x.
Remark B.4.3. The previous bounds show that the residual norms produced
by GMRES are at least weakly controlled by the conditioning of the eigenvec-
tors and the distribution of the eigenvalues. Before we begin to discuss the
important subject of preconditioning, which, in the context of GMRES, loosely
corresponds to mitigating the effects of these two terms, it is important that
we touch on how to replace the kmkσ(A) term with something more concrete.
The main idea is that, if the spectrum is known to be contained within some
well-defined region, then we may propose a monic polynomial, evaluate its
maximum magnitude over the region of interest, and then use this value as an
upper-bound for the minimum value over all polynomials.
159
Example B.4.2. Suppose that A = I − γU , where |γ| < 1 and U is an ar-
bitrary unitary matrix with spectral decomposition QΛQH . Then each eigen-
value of U lies on the unit circle, and, since A = Q(I − γΛ)QH , the eigenvalues
of A lie on the circle of radius |γ| centered at 1. It is then natural to investigate
the candidate polynomial
m(z) = (1 − z)k ∈ Mk ,
Remark B.4.4. It turns out that, as the spectrum becomes dense on the
boundary of the circle, the candidate polynomial from the last example be-
comes optimal. This can easily be shown using the following theorem due to
S. Bernstein, which we state without proof.
Theorem B.8 (S. Bernstein [6, 20]). If p ∈ Pk and kpkC1 (0) = 1, then kpkCR (0) ≤
Rk for R ≥ 1, with equality only if p(z) = γz k , with |γ| = 1.
rk
min kmkCr (z0 ) = ,
m∈Mk |z0 |k
Proof. Showing that this value is attainable simply requires a rescaling of the
previous example. In particular, we can define the monic polynomial
1
m(z) = (z − z)k ∈ Mk ,
k 0
z0
160
which, for any z ∈ Cr (z0 ), evaluates to rk /|z0 |k .
ηk rk
km̃kCr (z0 ) = < .
|z0 |k |z0 |k
|z0 |k
p(z) = m̃(zr + z0 ),
ηk
|z0 |k |z0 |k
p(−z0 /r) = k m̃(0) = k > Rk ,
η η
rk
min kmkD̄r (z0 ) = ,
m∈Mk |z0 |k
Proof. Since Cr (z0 ) ⊂ D̄r (z0 ) and (z0 − z)k /z0k reaches its extrema on the
boundary of Dr (z0 ), (z0 − z)k /z0k is also optimal over the entire closed disc
D̄r (z0 ).
Remark B.4.5. The optimal polynomial only depends upon the center, and
not the radius, of the disc.
While the optimal monic polynomial is easily found for the case where
the spectrum is only known to be contained within a particular circle in the
complex plane which does not contain the origin, there are several other cases
161
of significant interest. In particular, the following well-known theorem [51],
due to Flanders and Shortley, but apparently inspired by Tukey and Grosch,
can be immediately used to show that monic Chebyshev polynomials are the
optimally small monic polynomials over intervals of the real line which are
separated from the origin.
Corollary B.12. Let [α, β] be an interval of the real line which does not
contain the origin. Then
1
min kmk[α,β] = ,
m∈Mk α+β
Tk β−α
α+β+2z α+β
and the unique minimizing monic polynomial is Tk β−α
/Tk β−α
.
Corollary B.13. Let i[α, β] be an interval of the imaginary axis which does
not contain the origin. Then
1
min kmk[α,β] = ,
m∈Mk α+β
Tk β−α
α+β+2iz α+β
and the unique minimizing polynomial is Tk β−α
/Tk β−α
.
162
Proof. The proof is essentially identical to that of Corollary B.12, but we
instead set µ = i(α + β)/2 and w = i(β − α)/2.
and it satisfies
a
Tk d
km̃kε = z0
,
Tk d
163
The reason is that the right-hand side, b, may only have components in the
directions of a handful of eigenvectors of A, and a monic polynomial can be
constructed with its roots at the appropriate eigenvalues such that m(A)b is
exactly zero.
Proof. Let {λj }rj=1 be the eigenvalues corresponding to the eigenvectors which
comprise b and set
r
Y λj − z
m(z) = ∈ Mr .
j=1
λj
Then !
r
Y −1
m(A)b = (A − λ1 I) · · · (A − λr I)b.
j=1
λj
If we express b as a linear combination of the relevant eigenvectors, say
r
X
b= γj yj ,
j=1
shows that !
r r Y
r
X −1 X
m(A)b = (λj − λi )γj yj = 0,
j=1
λj j=1 i=1
Qr
since each term i=1 (λj − λi )γj yj is identically zero.
164
B.5 Implementing GMRES
A completely naı̈ve implementation of GMRES would essentially exe-
cute the following steps:
7. stop if the residual is small enough, or, otherwise, repeat the entire
process with a larger value for k.
165
• if we choose to increase k in order to use a large Krylov subspace then
the previous work need not be thrown away,
B.6 Summary
A brief introduction to the theory and implementation of the Gener-
alized Minimum Residual (GMRES) method was provided in order to serve
as a reference to those without extensive experience implementing iterative
166
Algorithm B.1: Pseudocode for a simple implementation of pre-
conditioned GMRES(k)
Input: Nonsingular matrix, A, preconditioner, M , right-hand
side, b, and initial guess, x0
Output: Estimate, x, for the solution, A−1 b
1 x := x0
2 r := b − Ax, ρ := krk2
3 while ρ/kbk2 < tolerance do
// Run GMRES for k steps with x as initial guess
4 x0 := x, H := zeros(k, k)
5 w := M −1 r
6 β := kwk2 , v0 := w/β
7 for j = 0 : k − 1 do
8 d := Avj , δ := kdk2
9 d := M −1 d
// Run j’th step of Arnoldi
10 for i = 0 : j do
11 H(i, j) := viH d
12 d := d − H(i, j)vi
13 end
14 H(j + 1, j) = δ := kdk2
15 if j + 1 6= k then vj+1 := vj+1 /δ
// Solve for residual minimizer
16 y := arg minz kH(0 : j +1, 0 : j)z − βe0 k2
// Form next iterate and residual
17 x := x0 + Vj y
18 r := b − Ax, ρ := krk2
19 if ρ/kbk2 < tolerance then break
20 end
21 end
167
methods. The author hopes that this appendix will serve to dispel some of
the mystique surrounding GMRES(k), as it is surprisingly straight-forward to
implement, both sequentially and in parallel.
168
Appendix C
Several algorithms are listed in this appendix for the sole purpose of
providing an anchor for the (compressed) multifrontal techniques discussed in
Chapters 2 and 4. The following fundamental dense linear algebra operations
are covered:
• matrix-matrix multiplication,
• triangular solves,
• triangular inversion.
169
The notation used within the following algorithms is in the style of
FLAME [60] and Elemental [99]. In addition, the operations described here
are essentially all implemented as part of the Elemental library, and thorough
benchmarks are provided for several of these operations within [99] and [109],
as well as thorough discussions of the underlying notation (which is closely
followed by the actual implementations). In most cases, the progression follows
that of LAPACK [5], ScaLAPACK [31], and PLAPACK [2], which is to build
an “unblocked” implementation for small sequential problems, use this at the
core of a “blocked” algorithm for larger sequential problems, and then to embed
this blocked algorithm within a distributed-memory algorithm.
170
Algorithm C.3: Distributed dense matrix-matrix multiplication
with many right-hand sides
Input: Scalar, α, m × k matrix, A, k × n matrix, B, scalar, β,
m × n matrix, C, and a blocksize, nb . A, B, and C are
each in 2D distributions.
Output: C := αAB + βC
1 C := βC
2 b := min{k, nb }
B1
A → A1 A2 and B → , where A1 is m × b and B1 is b × n
3 B2
// C := C + αA1 B1
4 A1 [MC , ?] ← A1 [MC , MR ]
5 B1 [?, MR ] ← B1 [MC , MR ]
6 C[MC , MR ] := C + αA1 [MC , ?]B1 [?, MR ]
7 if k > nb then Recurse(α,A2 ,B2 ,1,C,nb )
171
Algorithm C.5: Unblocked dense triangular solve
Input: n × n lower-triangular matrix L and an n × k matrix X
−1
Output:
X := L X
λ1,1 0 x1
L→ and X → , where λ1,1 is 1 × 1 and x1 is
1 `2,1 L2,2 X2
1×k
2 x1 := x1 /λ1,1
3 if n > 1 then
4 X2 := X2 − `2,1 x1
5 Recurse(L2,2 ,X2 )
172
Algorithm C.7: Dense triangular solve with few right-hand sides
(1D distribution)
Input: n × n lower-triangular matrix, L, an n × k matrix, X, and
a blocksize, nb . L and X are both in 1D distributions.
Output: X := L−1 X
1 b := min{n, nb }
L1,1 0 X1
L→ and X → , where L1,1 is b × b and X1 is
2 L2,1 L2,2 X2
b×k
// X1 := L−1 1,1 X1
3 X1 [∗, ∗] ← X1 [VC , ∗]
4 L1,1 [∗, ∗] ← L1,1 [VC , ∗]
−1
5 X1 [∗, ∗] := L1,1 [∗, ∗]X1 [∗, ∗] via Algorithm C.6
6 X1 [VC , ∗] ← X1 [∗, ∗]
7 if n > nb then
// X2 := X2 − L2,1 X1
8 X2 [VC , ∗] := X2 [VC , ∗] − L2,1 [VC , ∗]X1 [∗, ∗]
9 Recurse(L2,2 ,X2 )
173
Algorithm C.8: Dense triangular solve with few right-hand sides
(2D distribution)
Input: n × n lower-triangular matrix, L, an n × k matrix, X, and
a blocksize, nb . L is in a 2D distribution, but X is in a 1D
distribution. An n × k matrix, Ẑ, initially filled with zeros
and in an [MC , ∗] distribution, should also be passed in.
Output: X := L−1 X
1 b := min{n, nb }
L1,1 0 X1 Ẑ1
L→ ,X→ , and Ẑ → , where L1,1 is
2 L2,1 L2,2 X2 Ẑ2
b × b and X1 and Ẑ1 are b × k
// Combine and apply all previous partial updates to X1
3 X1 [VC , ∗] := X1 [VC , ∗] + SumScatter(Ẑ1 [MC , ∗])
// X1 := L−1 1,1 X1
4 X1 [∗, ∗] ← X1 [VC , ∗]
5 L1,1 [∗, ∗] ← L1,1 [MC , MR ]
−1
6 X1 [∗, ∗] := L1,1 [∗, ∗]X1 [∗, ∗] via Algorithm C.6
7 X1 [VC , ∗] ← X1 [∗, ∗]
8 if n > nb then
// Add portion of −L2,1 X1 onto past partial updates
9 X1 [MR , ∗] ← X1 [∗, ∗]
10 Ẑ2 [MC , ∗] := Ẑ2 [MC , ∗] − L2,1 [MC , MR ]X1 [MR , ∗]
11 Recurse(L2,2 ,X2 ,Ẑ2 )
174
Algorithm C.9: Distributed dense triangular solve with many
right-hand sides
Input: n × n lower-triangular matrix, L, an n × k matrix, X, and
a blocksize, nb . L and X are both in 2D distributions.
Output: X := L−1 X
1 b := min{n, nb }
L1,1 0 X1
L→ and X → , where L1,1 is b × b and X1 is
2 L2,1 L2,2 X2
b×k
// X1 := L−1 1,1 X1
3 X1 [∗, VR ] ← X1 [MC , MR ]
4 L1,1 [∗, ∗] ← L1,1 [MC , MR ]
−1
5 X1 [∗, VR ] := L1,1 [∗, ∗]X1 [∗, VR ] via Algorithm C.6
6 if n > nb then
// X2 := X2 − L2,1 X1 and delayed write of X1
7 L2,1 [MC , ∗] ← L2,1 [MC , MR ]
8 X1 [∗, MR ] ← X1 [∗, VR ]
9 X2 [MC , MR ] := X2 [MC , MR ] − L2,1 [MC , ∗]X1 [∗, MR ]
10 X1 [MC , MR ] ← X1 [∗, MR ]
11 Recurse(L2,2 ,X2 )
175
Algorithm C.11: Dense adjoint triangular solve
Input: n × n lower-triangular matrix L, an n × k matrix X, and a
blocksize, nb
Output: X := L−H X
1 b := min{n, nb }
L0,0 0 X0
L→ and X → , where L1,1 is b × b and X1 is
2 L1,0 L1,1 X1
b×k
−H
3 X1 := L1,1 X1 via Algorithm C.10
4 if n > nb then
5 X0 := X0 − LH 1,0 X1
6 Recurse(L0,0 ,X0 )
176
Algorithm C.13: Distributed dense adjoint triangular solve with
many right-hand sides
Input: n × n lower-triangular matrix L, an n × k matrix X, and a
blocksize, nb . L and X are both in a 2D distribution.
Output: X := L−H X
1 b := min{n, nb }
L0,0 0 X0
L→ and X → , where L1,1 is b × b and X1 is
2 L1,0 L1,1 X1
b×k
// X1 := L−H 1,1 X1
3 X1 [∗, VR ] ← X1 [MC , MR ]
4 L1,1 [∗, ∗] ← L1,1 [MC , MR ]
−H
5 X1 [∗, VR ] := L1,1 [∗, ∗]X1 [∗, VR ] via Algorithm C.10
6 if n > nb then
// X0 := X0 − LH 1,0 X1 and delayed write of X1
7 L1,0 [∗, MC ] ← L1,0 [MC , MR ]
8 X1 [∗, MR ] ← X1 [∗, VR ]
9 X0 [MC , MR ] := X0 [MC , MR ] − LH 1,0 [MC , ∗]X1 [∗, MR ]
10 X1 [MC , MR ] ← X1 [∗, MR ]
11 Recurse(L0,0 ,X0 )
177
Algorithm C.14: Dense adjoint triangular solve (lazy variant)
Input: n × n lower-triangular matrix L, an n × k matrix X, and a
blocksize, nb
−H
Output:
X := L X
LT L 0 XT
L→ and X → , where LBR is 0 × 0 and XB
1 LBL LBR XB
is 0 × k
2 while size(LBR ) 6= size(L) do
3 b := min{height(LT L ), nb }
L0,0 0 0
LT L 0
→ L1,0 L1,1 0 and
LBL LBR
4 L L2,1 L2,2
2,0
X0
XT
→ X1 , where L1,1 is b × b and X1 is b × k
XB
X2
H
5 X1 := X1 − L2,1 X2
6 X1 := L−H1,1 X1 via Algorithm C.10
L0,0 0 0
LT L 0
← L1,0 L1,1 0 , and
LBL LBR
7 L L2,1 L2,2
2,0
X0
XT
← X1
XB
X2
8 end
178
Algorithm C.15: Dense adjoint triangular solve with few right-
hand sides (lazy variant, 1D distribution)
Input: n × n lower-triangular matrix L, an n × k matrix X, and a
blocksize, nb
−H
Output:
X := L X
LT L 0 XT
L→ and X → , where LBR is 0 × 0 and XB
1 LBL LBR XB
is 0 × k
2 while size(LBR ) 6= size(L) do
3 b := min{height(LT L ), nb }
L0,0 0 0
LT L 0
→ L1,0 L1,1 0 and
LBL LBR
4 L L2,1 L2,2
2,0
X0
XT
→ X1 , where L1,1 is b × b and X1 is b × k
XB
X2
// X1 := X1 − LH 2,1 X2
H
5 Ẑ1 [∗, ∗] := −L2,1 [∗, VC ]X2 [VC , ∗]
6 X1 [VC , ∗] := X1 [VC , ∗] + SumScatter(Ẑ1 [∗, ∗])
// X1 := L−H 1,1 X1
7 X1 [∗, ∗] ← X1 [VC , ∗]
8 L1,1 [∗, ∗] ← L1,1 [VC , ∗]
9 X1 [∗, ∗] := L−H1,1 [∗, ∗]X1 [∗, ∗] via Algorithm C.10
10 X1 [VC , ∗] ← X1 [∗, ∗]
L0,0 0 0
LT L 0
← L1,0 L1,1 0 , and
LBL LBR
11 L L2,1 L2,2
2,0
X0
XT
← X1
XB
X2
12 end
179
Algorithm C.16: Unblocked dense right-looking Cholesky
Input: n × n HPD matrix, A
Output:
A is overwritten
with its lower Cholesky factor, L
α1,1 ?
A→ , where α1,1 is a scalar
1 a2,1 A2,2
√
2 λ1,1 := α1,1
3 if n > 1 then
4 `2,1 := a2,1 /λ1,1
5 A2,2 := A2,2 − `2,1 `H2,1
6 Recurse(A2,2 )
180
Algorithm C.18: Distributed dense right-looking Cholesky
Input: n × n HPD matrix, A, in a 2D distribution, and a
blocksize nb
Output: A is overwritten with its lower Cholesky factor, L
1 b := min{n, nb }
A1,1 ?
A→ , where A1,1 is b × b
2 A2,1 A2,2
// L1,1 := Cholesky(A1,1 )
3 A1,1 [∗, ∗] ← A1,1 [MC , MR ]
4 L1,1 [∗, ∗] := Cholesky(A1,1 [∗, ∗]) via Algorithm C.17
5 L1,1 [MC , MR ] ← L1,1 [∗, ∗]
6 if n > nb then
// L2,1 := A2,1 L−H 1,1
7 A2,1 [VC , ∗] ← A2,1 [MC , MR ]
8 L2,1 [VC , ∗] := A2,1 [VC , ∗]L−H
1,1 [∗, ∗]
// A2,2 := A2,2 − L2,1 LH 2,1 and delayed write of L2,1
9 L2,1 [MC , ∗] ← L2,1 [VC , ∗]
10 L2,1 [MR , ∗] ← L2,1 [VC , ∗]
11 A2,2 [MC , MR ] := A2,2 [MC , MR ] − L2,1 [MC , ∗]LH
2,1 [∗, MR ]
12 L2,1 [MC , MR ] ← L2,1 [MC , ∗]
13 Recurse(A2,2 ,nb )
181
Algorithm C.19: Unblocked dense triangular inversion
Input: n × n lower-triangular matrix, L
Output:
L is overwritten
with its inverse
LT L 0
L→ , where LT L is 0 × 0
1 LBL LBR
2 while size(LT L ) 6= size(L) do
L0,0 0 0
LT L 0
→ `1,0 λ1,1 0 , where λ1,1 is a scalar
LBL LBR
3 L2,0 `2,1 L2,2
4 `1,0 := −`1,0 /λ1,1
5 L2,0 := L2,0 + `2,1 `1,0
6 `2,1 := `2,1 /λ1,1
7 λ1,1 := 1/λ1,1
L0,0 0 0
LT L 0
← `1,0 λ1,1 0
LBL LBR
8 L2,0 `2,1 L2,2
9 end
182
Algorithm C.20: Dense triangular inversion
Input: n × n lower-triangular matrix, L, and blocksize, nb
Output:
L is overwritten
with its inverse
LT L 0
L→ , where LT L is 0 × 0
1 LBL LBR
2 while size(LT L ) 6= size(L) do
3 b := min{height(LBR ), nb }
L0,0 0 0
LT L 0
→ L1,0 L1,1 0 , where L1,1 is b × b
LBL LBR
4 L2,0 L2,1 L2,2
−1
5 L1,0 := −L1,1 L1,0
6 L2,0 := L2,0 + L2,1 L1,0
7 L2,1 := L2,1 L−1
1,1
8 L1,1 := L−1
1,1 via Algorithm C.19
L0,0 0 0
LT L 0
← L1,0 L1,1 0
LBL LBR
9 L2,0 L2,1 L2,2
10 end
183
Algorithm C.21: Distributed dense triangular inversion
Input: n × n lower-triangular matrix, L, in a 2D distribution, and
a blocksize, nb
Output:
L is overwritten
with its inverse
LT L 0
L→ , where LT L is 0 × 0
1 LBL LBR
2 while size(LT L ) 6= size(L) do
3 b := min{height(LBR ), nb }
L0,0 0 0
LT L 0
→ L1,0 L1,1 0 , where L1,1 is b × b
LBL LBR
4 L2,0 L2,1 L2,2
−1
// L1,0 := −L1,1 L1,0
5 L1,0 [∗, VR ] ← L1,0 [MC , MR ]
6 L1,1 [∗, ∗] ← L1,1 [MC , MR ]
7 L1,0 [∗, VR ] := −L−1 1,1 [∗, ∗]L1,0 [∗, VR ]
// L2,0 := L2,0 + L2,1 L1,0 and delayed write of L1,0
8 L2,1 [MC , ∗] ← L2,1 [MC , MR ]
9 L1,0 [∗, MR ] ← L1,0 [∗, VR ]
10 L2,0 [MC , MR ] := L2,0 [MC , MR ] + L2,1 [MC , ∗]L1,0 [∗, MR ]
11 L1,0 [MC , MR ] ← L1,0 [∗, MR ]
// L2,1 := L2,1 L−1 1,1
12 L2,1 [VC , ∗] ← L2,1 [MC , ∗]
13 L2,1 [VC , ∗] := L2,1 [VC , ∗]L−1 1,1 [∗, ∗]
14 L2,1 [MC , MR ] ← L2,1 [VC , ∗]
// L1,1 := L−1 1,1
15 L1,1 [∗, ∗] := L−11,1 [∗, ∗] via Algorithm C.20
16 L1,1 [MC , MR ] ← L1,1 [∗, ∗]
L0,0 0 0
LT L 0
← L1,0 L1,1 0
LBL LBR
17 L2,0 L2,1 L2,2
18 end
184
Bibliography
[1] MUltifrontal Massively Parallel Solver Users’ guide: version 4.10.0. http:
//graal.ens-lyon.fr/MUMPS/doc/userguide_4.10.0.pdf, May 10, 2011.
[2] Philip Alpatov, Greg Baker, Carter Edwards, John A. Gunnels, Greg
Morrow, James Overfelt, Robert A. van de Geijn, and Yuan-Jye J. Wu.
PLAPACK: Parallel Linear Algebra Package – design overview. In
Proceedings of Supercomputing, 1997.
[3] Patrick R. Amestoy, Iain S. Duff, Jacko Koster, and Jean-Yves L’Excellent.
A fully asynchronous multifrontal solver using distributed dynamic schedul-
ing. SIAM Journal on Matrix Analysis and Applications, 23(1):15–41,
2001.
[4] Fred Aminzadeh, Jean Brac, and Tim Kunz. 3-D Salt and Overthrust
models. In SEG/EAGE 3-D Modeling Series 1, Tulsa, OK, 1997. Soci-
ety for Exploration Geophysicists.
185
[7] Jean-Robert Argand. Essai sur une manière de représenter des quan-
tités imaginaires dans les constructions géométriques. les Annales de
mathématiques pures et appliquées, 4:133–147, 1813.
[10] Cleve Ashcraft, Stanley C. Eisenstat, and Joseph W.-H. Liu. A fan-in
algorithm for distributed sparse numerical factorization. SIAM Journal
on Scientific and Statistical Computing, 11(3):593–599, 1990.
[11] Cleve Ashcraft and Roger G. Grimes. The influence of relaxed su-
pernode partitions on the multifrontal method. ACM Transactions on
Mathematical Software, 15(4):291–309, 1989.
[13] Cleve Ashcraft, Roger G. Grimes, John G. Lewis, Barry W. Peyton, and
Horst D. Simon. Progress in sparse matrix methods for large sparse
linear systems on vector supercomputers. International Journal of Su-
percomputer Applications, 1:10–30, 1987.
186
[15] Ivo M. Babuška and S. A. Sauter. Is the pollution effect of the FEM
avoidable for the Helmholtz equation considering high wave numbers?
SIAM Review, 42(3):451–484, 2000.
[16] Satish Balay, Jed Brown, , Kris Buschelman, Victor Eijkhout, William D.
Gropp, Dinesh Kaushik, Matthew G. Knepley, Lois Curfman McInnes,
Barry F. Smith, and Hong Zhang. PETSc Uusers manual. Technical
Report ANL-95/11 - Revision 3.3, Argonne National Laboratory, 2012.
[17] Satish Balay, Jed Brown, Kris Buschelman, William D. Gropp, Dinesh
Kaushik, Matthew G. Knepley, Lois Curfman McInnes, Barry F. Smith,
and Hong Zhang. PETSc Web page. http://www.mcs.anl.gov/
petsc, 2012.
[21] Gregory Beylkin, Christopher Kurcz, and Lucas Monzón. Fast algo-
rithms for Helmholtz Green’s functions. Proceedings of the Royal Society
A: Mathematical, Physical, and Engineering Sciences, 464(2100):3301–
3326, 2008.
187
[22] Matthias Bollhöfer and Yousef Saad. Multilevel preconditioners con-
structed from inverse-based ILUs. SIAM Journal on Scientific Comput-
ing, 27:1627–1650, 2006.
[23] Matthis Bollhöfer, Marcus Grote, and Olaf Schenk. Algebraic multi-
level preconditioner for the Helmholtz equation in heterogeneous media.
SIAM Journal on Scientific Computing, 31:3781–3805, 2009.
[24] James H. Bramble and Joseph E. Pasciak. A note on the existence and
uniqueness of solutions of frequency domain elastic wave problems: a pri-
ori estimates in H 1 . Mathematical Analysis and Applications, 345:396–
404, 2008.
[27] Henri Calandra, Serge Gratton, Xavier Pinel, and Xavier Vasseur. An
improved two-grid preconditioner for the solution of three-dimensional
Helmholtz problems in heterogeneous media. Technical Report TR/PA/12/2,
CERFACS, Toulouse, France, 2012.
[28] Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de
Geijn. Collective communication: theory, practice, and experience.
Concurrency and Computation: Practice and Experience, 19(13):1749–
1783, 2007.
188
[29] Weng Cho Chew and William H. Weedon. A 3D perfectly matched
medium from modified Maxwell’s equations with stretched coordinates.
Microwave and Optical Technology Letters, 7(13):599–604, 1994.
[32] Robert Dautray and Jacques-Louis Lions. Spectral theory and appli-
cations, volume 3 of Mathematical analysis and numerical methods for
science and technology. Springer-Verlag, New York, NY, 1985.
[33] Tim Davis. Summary of available software for sparse direct methods.
http://www.cise.ufl.edu/research/sparse/codes/, April 2012.
[36] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S. Duff.
A set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions
on Mathematical Software, 16(1):1–17, 1990.
189
[37] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J.
Hanson. An extended set of Fortran Basic Linear Algebra Subprograms.
ACM Transactions on Mathematical Software, 14(1):1–17, 1988.
[40] Alex Druinsky and Sivan Toledo. How accurate is inv(a) ∗ b? ArXiv
e-prints, January 2012.
[41] Iain S. Duff and John K. Reid. The multifrontal solution of indefinite
sparse symmetric linear equations. ACM Transactions on Mathematical
Software, 9:302–325, 1983.
[43] Björn Engquist and Lexing Ying. Sweeping preconditioner for the
Helmholtz equation: hierarchical matrix representation. Communica-
tions on Pure and Applied Mathematics, 64:697–735, 2011.
[44] Björn Engquist and Lexing Ying. Sweeping preconditioner for the
Helmholtz equation: moving perfectly matched layers. SIAM Journal
on Multiscale Modeling and Simulation, 9:686–710, 2011.
190
[45] Yogi A. Erlangga. Advances in iterative methods and preconditioners
for the Helmholtz equation. Archives of Computational Methods in
Engineering, 15:37–66, 2008.
[49] Bernd Fischer and Roland Freund. On the constrained Chebyshev ap-
proximation problem on ellipses. Journal of Approximation Theory,
62(3):297–315, 1990.
[50] Bernd Fischer and Roland Fruend. Chebyshev polynomials are not
always optimal. Journal of Approximation Theory, 65:261–272, 1991.
[52] Martin J. Gander and Frédéric Nataf. AILU for Helmholtz problems: a
new preconditioner based on the analytic parabolic factorization. Jour-
nal of Computational Acoustics, 9:1499–1506, 2001.
191
[53] M. R. Garey, David S. Johnson, and Larry J. Stockmeyer. Some sim-
plified NP-complete problems. In Proceedings of the sixth annual ACM
symposium on Theory of computing, pages 47–63, New York, NY, 1974.
ACM.
[54] Alan George. Nested dissection of a regular finite element mesh. SIAM
Journal on Numerical Analysis, 10:345–363, 1973.
[55] Alan George and Joseph W. H. Liu. An optimal algorithm for sym-
bolic factorization of symmetric matrices. SIAM Journal on Computing,
9:583–593, 1980.
[58] Gene H. Golub and Charles F. van Loan. Matrix computations. Johns
Hopkins University Press, Baltimore, MD, 1996.
192
[61] Anshul Gupta. Analysis and design of scalable parallel algorithms for
scientific computing. Ph.D. Dissertation, University of Minnesota, Min-
neapolis, Minnesota, 1995.
[63] Anshul Gupta, George Karypis, and Vipin Kumar. A highly scalable
parallel algorithm for sparse matrix factorization. IEEE Transactions
on Parallel and Distributed Systems, 8(5):502–520, 1997.
[64] Anshul Gupta, Seid Koric, and Thomas George. Sparse matrix factor-
ization on massively parallel computers. In Proceedings of Conference
on High Performance Computing Networking, Storage, and Analysis,
number 1, New York, NY, 2009. ACM.
[65] Anshul Gupta and Vipin Kumar. Parallel algorithms for forward elimi-
nation and backward substitution in direct solution of sparse linear sys-
tems. In Proceedings of Supercomputing, San Diego, CA, 1995. ACM.
[68] Michael T. Heath and Padma Raghavan. LAPACK Working Note 62:
distributed solution of sparse linear systems. Technical Report UT-CS-
93-201, University of Tennessee, Knoxville, TN, 1993.
193
[69] Bruce A. Hendrickson and Robert W. Leland. The Chaco Users Guide:
Version 2.0. Sandia Technical Report SAND94–2692, 1994.
[72] Pascal Henon, Pierre Ramet, and Jean Roman. PaStiX: A parallel
sparse direct solver based on static scheduling for mixed 1D/2D block
distributions. In Proceedings of Irregular’2000 workshop of IPDPS, vol-
ume 1800 of Lecture Notes in Computer Science, pages 519–525, Cancun,
Mexico, 2000. Springer Verlag.
[75] Bruce M. Irons. A frontal solution program for finite element analysis.
International Journal for Numerical Methods in Engineering, 2:5–32,
1970.
[76] Dror Irony and Sivan Toledo. Trading replication for communication in
parallel distributed-memory dense solvers. Parallel Processing Letters,
pages 79–94, 2002.
194
[77] James Jeans. The mathematical theory of electricity and magnetism.
Cambridge University Press, 1908.
[79] Mahesh V. Joshi, Anshul Gupta, George Karypis, and Vipin Kumar. A
high performance two dimensional scalable parallel algorithm for solv-
ing sparse triangular systems. In Proceedings of the Fourth Interna-
tional Conference on High-Performance Computing, HIPC ’97, pages
137–, Washington, DC, 1997. IEEE Computer Society.
[80] Mahesh V. Joshi, George Karypis, Vipin Kumar, Anshul Gupta, and
Fred Gustavson. PSPASES: Building a high performance scalable par-
allel direct solver for sparse linear systems. In Tianruo Yang, editor,
Parallel Numerical Computations with Applications, pages 3–18, Nor-
well, MA, 1999. IEEE.
[81] George Karypis and Vipin Kumar. A fast and high quality multilevel
scheme for partitioning irregular graphs. SIAM Journal on Scientific
Computing, 20(1):359–392, 1998.
[82] George Karypis and Vipin Kumar. A parallel algorithm for multilevel
graph partitioning and sparse matrix ordering. Parallel and Distributed
Computing, 48:71–85, 1998.
[83] Brian W. Kernighan and Shen Lin. An efficient heuristic procedure for
partitioning graphs. Bell Systems Technical Journal, 49:291–307, 1970.
195
[84] Erwin Kreyszig. Introductory functional analysis with applications. Wi-
ley, 1978.
[85] Cornelius Lanczos. An iteration method for the solution of the eigen-
value problem of linear differential and integral operators. Journal of
Research of the National Bureau of Standards, 45:255–282, 1950.
[89] Xiaoye S. Li, James W. Demmel, John R. Gilbert, Laura Grigori, Meiyue
Shao, and Ichitaro Yamazaki. SuperLU Users’ Guide. Technical Report
LBNL-44289, October 2011.
[90] Robert J. Lipton and Robert Endre Tarjan. A separator theorem for
planar graphs. SIAM Journal on Applied Mathematics, 36(2), 1979.
[91] Joseph W. H. Liu. The multifrontal method for sparse matrix solution:
theory and practice. SIAM Review, 34(1):82–109, 1992.
196
[92] Per-Gunnar Martinsson and Vladimir Rokhlin. A fast direct solver for
scattering problems involving elongated structures. Journal of Compu-
tational Physics, 221(1):288–302, 2007.
[95] Antoine Petitet, R. Clint Whaley, Jack J. Dongarra, and Andy Cleary.
HPL - a portable implementation of the High-Performance Linpack bench-
mark for distributed-memory computers. http://www.netlib.org/
benchmark/hpl, 2012.
[97] Alex Pothen and Sivan Toledo. Elimination structures in scientific com-
puting. In Dinesh Mehta and Sartaj Sahni, editors, Handbook on Data
Structures and Applications, pages 1–59.29. CRC Press, 2004.
[98] Jack Poulson, Björn Engquist, Siwei Li, and Lexing Ying. A paral-
lel sweeping preconditioner for heterogeneous 3D Helmholtz equations.
ArXiv e-prints, March 2012.
[99] Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond,
and Nichols A. Romero. Elemental: A new framework for distributed
197
memory dense matrix computations. ACM Transactions on Mathemat-
ical Software, 39(2).
[101] Jack Poulson and Lexing Ying. Parallel Sweeping Preconditioner. http:
//bitbucket.org/poulson/psp/, November 2012.
[107] Yousef Saad. Iterative methods for sparse linear systems. SIAM,
Philadelphia, PA, 2003.
198
[109] Martin D. Schatz, Jack Poulson, and Robert A. van de Geijn. Scal-
able Universal Matrix Multiplication: 2D and 3D variations on a theme.
Technical Report, University of Texas at Austin, Austin, TX, 2012.
[110] Phillip G. Schmitz and Lexing Ying. A fast direct solver for elliptic
problems on general meshes in 2D. Journal of Computational Physics,
231:1314–1338, 2012.
[114] David S. Scott. Out of core dense solvers on Intel parallel supercom-
puters. In Proceedings of Frontiers of Massively Parallel Computation,
pages 484–487, 1992.
[115] Laurent Sirgue and R. Gerhard Pratt. Efficient waveform inversion and
imaging: A strategy for selecting temporal frequencies. Geophysics,
69(1):231–248, 2004.
199
[117] Elias M. Stein and Rami Shakarchi. Complex analysis. Princeton
Lectures in Analysis. Princeton University Press, Princeton, NJ, 2003.
[122] Lloyd N. Trefethen and David Bau. Numerical linear algebra. SIAM,
Philadelphia, PA, 1997.
[124] Paul Tsuji, Björn Engquist, and Lexing Ying. A sweeping precondi-
tioner for time-harmonic Maxwell’s equations with finite elements. Jour-
nal of Computational Physics, 231(9):3770–3783, 2012.
[125] Paul Tsuji, Björn Engquist, and Lexing Ying. A sweeping precondi-
tioner for Yee’s finite difference approximation of time-harmonic Maxwell’s
equations. Frontiers of Mathematics in China, 7(2):347–363, 2012.
200
[126] Robert A. van de Geijn. Massively parallel LINPACK benchmark on
the Intel Touchstone DELTA and iPSC/860 systems. In Proceedings of
Intel Supercomputer Users Group, Dallas, TX, 1991.
[127] Robert A. van de Geijn and Jerrell Watts. SUMMA: Scalable Universal
Matrix Multiplication Algorithm. Concurrency: Practice and Experi-
ence, 9:255–274, 1997.
[129] Charles F. van Loan and Nikos Pitsianis. Approximation with Kro-
necker products. In Linear Algebra for Large Scale and Real Time
Applications, pages 293–314. Kluwer Publications, 1993.
[131] Shen Wang, Xiaoye S. Li, Jianlin Xia, Yingchong Situ, and Maarten V.
de Hoop. Efficient scalable algorithms for hierarchically semiseparable
matrices. Technical report, Purdue University, West Lafayette, IN,
2011.
201
[134] Jianlin Xia, Shiv Chandrasekaran, Ming Gu, and Xiaoye S. Li. Superfast
multifrontal method for large structured linear systems of equations.
SIAM Journal on Matrix Analysis and Applications, 31(3):1382–1411,
2009.
202