0% found this document useful (0 votes)
31 views35 pages

CS 240A: Solving Ax B in Parallel: Dense A: Gaussian Elimination With Partial Pivoting (LU)

This document discusses various methods for solving the system of linear equations Ax=b in parallel. It covers direct methods like Gaussian elimination and Cholesky factorization as well as iterative methods like conjugate gradients. It also discusses how to apply these methods to sparse matrices using graph algorithms and compressed storage formats. Preconditioning techniques like incomplete Cholesky and ILU approximations are introduced to improve the convergence of iterative methods.

Uploaded by

RolySimangunsong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views35 pages

CS 240A: Solving Ax B in Parallel: Dense A: Gaussian Elimination With Partial Pivoting (LU)

This document discusses various methods for solving the system of linear equations Ax=b in parallel. It covers direct methods like Gaussian elimination and Cholesky factorization as well as iterative methods like conjugate gradients. It also discusses how to apply these methods to sparse matrices using graph algorithms and compressed storage formats. Preconditioning techniques like incomplete Cholesky and ILU approximations are introduced to improve the convergence of iterative methods.

Uploaded by

RolySimangunsong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

CS 240A: Solving Ax = b in parallel

Dense A: Gaussian elimination with partial pivoting (LU)


Same flavor as matrix * matrix, but more complicated
Sparse A: Gaussian elimination Cholesky, LU, etc.
Graph algorithms
Sparse A: Iterative methods Conjugate gradient, etc.
Sparse matrix times dense vector
Sparse A: Preconditioned iterative methods and multigrid
Mixture of lots of things

Matrix and Graph


1

G(A)

Edge from row i to column j for nonzero A(i,j)


No edges for diagonal nonzeros
If A is symmetric, G(A) is an undirected graph
Symmetric permutation PAPT renumbers the vertices

Compressed Sparse Matrix Storage


value: 31 41 59 26 53
31

53

59

41

26

Full storage:
2-dimensional array.
(nrows*ncols) memory.

row: 1

colstart: 1

Sparse storage:
Compressed storage by
columns (CSC).
Three 1-dimensional arrays.
(2*nzs + ncols + 1) memory.
Similarly, CSR.

The Landscape of Ax=b Solvers


Direct
A = LU

Iterative
y = Ay
More General

Nonsymmetric

Pivoting
LU

GMRES,
BiCGSTAB,

Symmetric
positive
definite

Cholesky

Conjugate
gradient

More Robust

More Robust
Less Storage (if sparse)

CS 240A: Solving Ax = b in parallel


Dense A: Gaussian elimination with partial pivoting (LU)
See April 15 slides
Same flavor as matrix * matrix, but more complicated
Sparse A: Gaussian elimination Cholesky, LU, etc.
Graph algorithms
Sparse A: Iterative methods Conjugate gradient, etc.
Sparse matrix times dense vector
Sparse A: Preconditioned iterative methods and multigrid
Mixture of lots of things

Gaussian elimination to solve Ax = b


For a symmetric, positive definite matrix:
1. Matrix factorization: A = LLT

(Cholesky factorization)

2. Forward triangular solve: Ly = b


3. Backward triangular solve: LTx = y
For a nonsymmetric matrix:
4. Matrix factorization: PA = LU
5. . . .

(Partial pivoting)

Sparse Column Cholesky Factorization


for j = 1 : n
L(j:n, j) = A(j:n, j);
for k < j with L(j, k) nonzero
% sparse cmod(j,k)
L(j:n, j) = L(j:n, j) L(j, k) * L(j:n, k);
end;
% sparse cdiv(j)
L(j, j) = sqrt(L(j, j));
L(j+1:n, j) = L(j+1:n, j) / L(j, j);

LT
L
L

end;

Column j of A becomes column j of L

Irregular mesh: NASA Airfoil in 2D

Graphs and Sparse Matrices: Cholesky factorization

Fill: new nonzeros in factor

8
4

G(A)

10

10

G+(A)
[chordal]

Symmetric Gaussian elimination:


for j = 1 to n
add edges between js
higher-numbered neighbors

Permutations of the 2-D model problem


Theorem: With the natural permutation, the n-vertex model
problem has (n3/2) fill. (order exactly)
Theorem: With any permutation, the n-vertex model
problem has (n log n) fill. (order at least)
Theorem: With a nested dissection permutation, the
n-vertex model problem has O(n log n) fill. (order at
most)

Nested dissection ordering


A separator in a graph G is a set S of vertices whose
removal leaves at least two connected components.
A nested dissection ordering for an n-vertex graph G
numbers its vertices from 1 to n as follows:
Find a separator S, whose removal leaves connected
components T1, T2, , Tk
Number the vertices of S from n-|S|+1 to n.
Recursively, number the vertices of each component:
T1 from 1 to |T1|, T2 from |T1|+1 to |T1|+|T2|, etc.
If a component is small enough, number it arbitrarily.
It all boils down to finding good separators!

Separators in theory
If G is a planar graph with n vertices, there exists a set
of at most sqrt(6n) vertices whose removal leaves no
connected component with more than 2n/3 vertices.
(Planar graphs have sqrt(n)-separators.)
Well-shaped finite element meshes in 3 dimensions
have n2/3 - separators.
Also some other classes of graphs trees, graphs of
bounded genus, chordal graphs, bounded-excludedminor graphs,
Mostly these theorems come with efficient algorithms,
but they arent used much.

Separators in practice
Graph partitioning heuristics have been an active
research area for many years, often motivated by
partitioning for parallel computation.
Some techniques:

Spectral partitioning (uses eigenvectors of Laplacian matrix of graph)


Geometric partitioning (for meshes with specified vertex coordinates)
Iterative-swapping (Kernighan-Lin, Fiduccia-Matheysses)
Breadth-first search (fast but dated)

Many popular modern codes (e.g. Metis, Chaco) use


multilevel iterative swapping
Matlab graph partitioning toolbox: see course web page

Complexity of direct methods


Time and
space to solve
any problem
on any wellshaped finite
element mesh

n1/2

n1/3

2D

3D

Space (fill):

O(n log n)

O(n 4/3 )

Time (flops):

O(n 3/2 )

O(n 2 )

CS 240A: Solving Ax = b in parallel


Dense A: Gaussian elimination with partial pivoting (LU)
See April 15 slides
Same flavor as matrix * matrix, but more complicated
Sparse A: Gaussian elimination Cholesky, LU, etc.
Graph algorithms
Sparse A: Iterative methods Conjugate gradient, etc.
Sparse matrix times dense vector
Sparse A: Preconditioned iterative methods and multigrid
Mixture of lots of things

The Landscape of Ax=b Solvers


Direct
A = LU

Iterative
y = Ay
More General

Nonsymmetric

Pivoting
LU

GMRES,
BiCGSTAB,

Symmetric
positive
definite

Cholesky

Conjugate
gradient

More Robust

More Robust
Less Storage (if sparse)

Conjugate gradient iteration


x0=0,r0=b,d0=r0
fork=1,2,3,...
k=(rTk1rk1)/(dTk1Adk1)step length
xk=xk1+kdk1approx solution
r =rk1kAdk1residual

k=(rTkrk)/(rTk1rk1)improvement
dk=rk+kdk1search direction
One matrix-vector multiplication per iteration
Two vector dot products per iteration
Four n-vectors of working storage

Sparse matrix data structure (stored by rows)

31

53

59

41

26

Full:
2-dimensional array of real or
complex numbers
(nrows*ncols) memory

31

53

59

41

26

Sparse:
compressed row storage
about (2*nzs + nrows) memory

Distributed row sparse matrix data structure

P0

31

41

59

26

53

P1
P2
Pp-1

Each processor stores:


# of local nonzeros
range of local rows
nonzeros in CSR form

Matrix-vector product: Parallel implementation


Lay out matrix and vectors by rows
P0 P1

y(i) = sum(A(i,j)*x(j))
Skip terms with A(i,j) = 0
Algorithm

P2

P3

x
P0

Each processor i:
Broadcast x(i)
Compute y(i) = A(i,:)*x

Optimizations: reduce communication by


Only send as much of x as necessary to each proc
Reorder matrix for better locality by graph partitioning

P1
P2
P3

Sparse Matrix-Vector Multiplication

CS 240A: Solving Ax = b in parallel


Dense A: Gaussian elimination with partial pivoting (LU)
See April 15 slides
Same flavor as matrix * matrix, but more complicated
Sparse A: Gaussian elimination Cholesky, LU, etc.
Graph algorithms
Sparse A: Iterative methods Conjugate gradient, etc.
Sparse matrix times dense vector
Sparse A: Preconditioned iterative methods and multigrid
Mixture of lots of things

Conjugate gradient: Convergence


In exact arithmetic, CG converges in n steps
(completely unrealistic!!)
Accuracy after k steps of CG is related to:
consider polynomials of degree k that are equal to 1 at 0.
how small can such a polynomial be at all the eigenvalues of A?

Thus, eigenvalues close together are good.


Condition number: (A) = ||A||2 ||A-1||2 = max(A) / min(A)
Residual is reduced by a constant factor by
O( sqrt((A)) ) iterations of CG.

Preconditioners

Suppose you had a matrix B such that:


1. condition number (B-1A) is small
2. By = z is easy to solve

. Then you could solve (B-1A)x = B-1b instead of Ax = b


. Each iteration of CG multiplies a vector by B-1A:
. First multiply by A
. Then solve a system with B

Preconditioned conjugate gradient iteration


x0=0,r0=b,d0=B-1 r0,y0=B-1 r0
fork=1,2,3,...
k=(yTk1rk1)/(dTk1Adk1)step length
xk=xk1+kdk1approx solution
r =rk1kAdk1residual

-1
y
=B
rkpreconditioning solve
k

k=(yTkrk)/(yTk1rk1)improvement
dk=yk+kdk1search direction
One matrix-vector multiplication per iteration
One solve with preconditioner per iteration

Choosing a good preconditioner

Suppose you had a matrix B such that:


1. condition number (B-1A) is small
2. By = z is easy to solve

Then you could solve (B-1A)x = B-1b instead of Ax = b

B = A is great for (1), not for (2)


B = I is great for (2), not for (1)
Domain-specific approximations sometimes work
B = diagonal of A sometimes works

Better: blend in some direct-methods ideas. . .

Incomplete Cholesky factorization (IC, ILU)

x
A

RT

Compute factors of A by Gaussian elimination,


but ignore fill
Preconditioner B = RTR A, not formed explicitly
Compute B-1z by triangular solves (in time nnz(A))
Total storage is O(nnz(A)), static data structure
Either symmetric (IC) or nonsymmetric (ILU)

Incomplete Cholesky and ILU: Variants


Allow one or more levels of fill
unpredictable storage requirements

Allow fill whose magnitude exceeds a drop tolerance


may get better approximate factors than levels of fill
unpredictable storage requirements
choice of tolerance is ad hoc
Partial pivoting (for nonsymmetric A)
Modified ILU (MIC): Add dropped fill to diagonal of U or R
A and RTR have same row sums
good in some PDE contexts

Incomplete Cholesky and ILU: Issues


Choice of parameters
good: smooth transition from iterative to direct methods
bad: very ad hoc, problem-dependent
tradeoff: time per iteration (more fill => more time)
vs # of iterations (more fill => fewer iters)

Effectiveness
condition number usually improves (only) by constant factor
(except MIC for some problems from PDEs)
still, often good when tuned for a particular class of problems

Parallelism
Triangular solves are not very parallel
Reordering for parallel triangular solve by graph coloring

Coloring for parallel nonsymmetric preconditioning


[Aggarwal, Gibou, G]

263 million DOF

Level set method for multiphase


interface problems in 3D

Nonsymmetric-structure,
second-order-accurate octree discretization.

BiCGSTAB preconditioned by parallel triangular solves.

Sparse approximate inverses

B1

Compute B-1 A explicitly


Minimize || B-1A I ||F

(in parallel, by columns)

Variants: factored form of B-1, more fill, . .


Good: very parallel
Bad: effectiveness varies widely

Other Krylov subspace methods


Nonsymmetric linear systems:
GMRES:
for i = 1, 2, 3, . . .
find xi Ki (A, b) such that ri = (Axi b) Ki (A, b)
But, no short recurrence => save old vectors => lots more
space
(Usually restarted every k iterations to use less space.)
BiCGStab, QMR, etc.:
Two spaces Ki (A, b) and Ki (AT, b) w/ mutually orthogonal bases
Short recurrences => O(n) space, but less robust
Convergence and preconditioning more delicate than CG
Active area of current research

Multigrid

For a PDE on a fine mesh, precondition using a solution on


a coarser mesh
Use idea recursively on hierarchy of meshes
Solves the model problem (Poissons eqn) in linear time!
Often useful when hierarchy of meshes can be built
Hard to parallelize coarse meshes well
This is just the intuition lots of theory and technology

Complexity of linear solvers


Time to solve
model problem
(Poissons
equation) on
regular mesh

n1/2

n1/3

2D

3D

Sparse Cholesky:

O(n1.5 )

O(n2 )

CG, exact arithmetic:

O(n2 )

O(n2 )

CG, no precond:

O(n1.5 )

O(n1.33 )

CG, modified IC:

O(n1.25 )

O(n1.17 )

CG, support trees:

O(n1.20 ) -> O(n1+ ) O(n1.75 ) -> O(n1+ )

Multigrid:

O(n)

O(n)

Complexity of direct methods


Time and
space to solve
any problem
on any wellshaped finite
element mesh

n1/2

n1/3

2D

3D

Space (fill):

O(n log n)

O(n 4/3 )

Time (flops):

O(n 3/2 )

O(n 2 )

You might also like