CS 240A: Solving Ax B in Parallel: Dense A: Gaussian Elimination With Partial Pivoting (LU)
CS 240A: Solving Ax B in Parallel: Dense A: Gaussian Elimination With Partial Pivoting (LU)
G(A)
53
59
41
26
Full storage:
2-dimensional array.
(nrows*ncols) memory.
row: 1
colstart: 1
Sparse storage:
Compressed storage by
columns (CSC).
Three 1-dimensional arrays.
(2*nzs + ncols + 1) memory.
Similarly, CSR.
Iterative
y = Ay
More General
Nonsymmetric
Pivoting
LU
GMRES,
BiCGSTAB,
Symmetric
positive
definite
Cholesky
Conjugate
gradient
More Robust
More Robust
Less Storage (if sparse)
(Cholesky factorization)
(Partial pivoting)
LT
L
L
end;
8
4
G(A)
10
10
G+(A)
[chordal]
Separators in theory
If G is a planar graph with n vertices, there exists a set
of at most sqrt(6n) vertices whose removal leaves no
connected component with more than 2n/3 vertices.
(Planar graphs have sqrt(n)-separators.)
Well-shaped finite element meshes in 3 dimensions
have n2/3 - separators.
Also some other classes of graphs trees, graphs of
bounded genus, chordal graphs, bounded-excludedminor graphs,
Mostly these theorems come with efficient algorithms,
but they arent used much.
Separators in practice
Graph partitioning heuristics have been an active
research area for many years, often motivated by
partitioning for parallel computation.
Some techniques:
n1/2
n1/3
2D
3D
Space (fill):
O(n log n)
O(n 4/3 )
Time (flops):
O(n 3/2 )
O(n 2 )
Iterative
y = Ay
More General
Nonsymmetric
Pivoting
LU
GMRES,
BiCGSTAB,
Symmetric
positive
definite
Cholesky
Conjugate
gradient
More Robust
More Robust
Less Storage (if sparse)
k=(rTkrk)/(rTk1rk1)improvement
dk=rk+kdk1search direction
One matrix-vector multiplication per iteration
Two vector dot products per iteration
Four n-vectors of working storage
31
53
59
41
26
Full:
2-dimensional array of real or
complex numbers
(nrows*ncols) memory
31
53
59
41
26
Sparse:
compressed row storage
about (2*nzs + nrows) memory
P0
31
41
59
26
53
P1
P2
Pp-1
y(i) = sum(A(i,j)*x(j))
Skip terms with A(i,j) = 0
Algorithm
P2
P3
x
P0
Each processor i:
Broadcast x(i)
Compute y(i) = A(i,:)*x
P1
P2
P3
Preconditioners
-1
y
=B
rkpreconditioning solve
k
k=(yTkrk)/(yTk1rk1)improvement
dk=yk+kdk1search direction
One matrix-vector multiplication per iteration
One solve with preconditioner per iteration
x
A
RT
Effectiveness
condition number usually improves (only) by constant factor
(except MIC for some problems from PDEs)
still, often good when tuned for a particular class of problems
Parallelism
Triangular solves are not very parallel
Reordering for parallel triangular solve by graph coloring
Nonsymmetric-structure,
second-order-accurate octree discretization.
B1
Multigrid
n1/2
n1/3
2D
3D
Sparse Cholesky:
O(n1.5 )
O(n2 )
O(n2 )
O(n2 )
CG, no precond:
O(n1.5 )
O(n1.33 )
O(n1.25 )
O(n1.17 )
Multigrid:
O(n)
O(n)
n1/2
n1/3
2D
3D
Space (fill):
O(n log n)
O(n 4/3 )
Time (flops):
O(n 3/2 )
O(n 2 )