An Ecient Parallel Version of The Householder-QL Matrix Diagonalisation Algorithm
An Ecient Parallel Version of The Householder-QL Matrix Diagonalisation Algorithm
Abstract
In this paper we report an eective parallelisation of the House-
holder routine for the reduction of a real symmetric matrix to tri-
diagonal form and the QL algorithm for the diagonalisation of the
resulting matrix. The Householder algorithm scales like N =P +3
1
The Householder-QL Matrix Diagonalisation Algorithm 2
1 Introduction
The parallelisation of methods for nding the eigenvalues and eigenvectors
for real symetric matrices has received little attention over the years, for two
reasons. Firstly for very large systems quite often only a few eigenvalues and
eigenvectors are needed. Secondly, the parallelisation of the best sequential
techniques for matrix diagonalisation has proved dicult and ineective.
In the simulation of the dynamics of large structures, only the lowest few
eigenvalues that correspond to the most likely occuring low frequency modes
of vibration are needed. Similarly, in solid state physics, states at or near the
ground state are those most likely to be occupied, at least at low tempera-
tures, and so make the largest contribution to the material properties. For
these reasons, in subjects like structural mechanics, techniques have concen-
trated in reducing the degrees of freedom in the system, from perhaps mil-
lions to a few hundred, enabling classical sequential techniques to be used. In
Physics and Chemistry, and where the original matrix is regularly structured
and sparse, iterative techniques like the Lanczos[9] or Davidson[2] methods,
that recover the lowest few eigenvalues and their corresponding eigenvectors
can be used eectively.
There are however many problems for which it is desirable to have all of the
eigenvalues and eigenvectors of systems. Physical simulations of quantum
mechanical systems that have a few particles present at nite temperature
cannot be properly analysed by methods for large numbers of particles like
statistical mechanics and are suciently far from their ground states as to
be in
uenced by all available states. Similarly in the design of mechanical
structures acoustic resonances might occur over a broad bandwidth. In such
cases all eigenvalues and eigenvectors of large symmetric matrices are needed,
for which no algorithm of complexity less than O(N ), where N is the di-
3
and the operation count for the QL method is roughly 30N when the eigen-
2
The Householder-QL Matrix Diagonalisation Algorithm 3
vectors are not required and 3N when they are. These counts presume that
3
required and about 40N when the are assuming less that 10 Jacobi cycles
3
for convergence[10]. The Jacobi method might readily parallelise [11] but
requires more than 20 processors before becoming competitive with the se-
quential Householder-QL algorithm. Previous attempts at parallelising the
Householder and QL algorithms have resulted in inecient solutions[4] or
have altered the algorithms in some way[12, 3, 1, 8]
0 1
B d e 0 0 0 0C
B x x x x CC
1 1
B e d
B x x x x CCC
1 2
B
B 0 x
B
B 0 x x x x x CC
B
B 0 x x x x x CC
B
@ 0 x x x x x CA
0 x x x x x
and c = WkT Vk .
1 1
The Householder-QL Matrix Diagonalisation Algorithm 4
the eigenvectors are required. The code fragment of the QL algorithm that
constructs the eigenvectors has the form below for the ith iteration:
for(j=0;j<N;j++)f
f=Z[i+1][j]
Z[i+1][j]=s*Z[i][j]+c*f
Z[i][j]=c*Z[i][j]-s*f
g
form W and s
MPI Bcast(s,W)
g
else
MPI Bcast(s,W)
g
return(s)
As both the vectors Vk and Wk are held by each processor the Qk vector
(calculated in formQ) can be computed locally on each processor. The Ak
matrix is updated by the equation Ak = Ak 2Wk QTk 2Qk WkT , which
1
The Householder-QL Matrix Diagonalisation Algorithm 6
involves only outer products of the locally stored Wk and Qk vectors. The
diagonal and o-diagonal elements of the reduced part of the a matrix are
stored in a local vector on the processor that holds row k. At the end
of the tri-diagonalisation process these need to be gathered onto the other
processors so all processors have a copy of the tri-diagonalised matrix as
required by the QL procedure. If the eigenvectors are not required then the
transformation matrix Z = PN ; : : :; P is not required and formZ can be
2 1
bottom right hand corner of the original A matrix, which now stores the Wk
row-wise. Proceeding from this corner for which Z is the 2 2 unit matrix, we
can form the Zk matrix in rows k up to N in the space formerly used to keep
the Wk up to WN . In separating out the formation of the Z matrix from the
2
main iteration sequence we have made the choice to opt for storage eciency,
rather than execution speed. This means that we have to re-broadcast the
Wk vectors in order to do the U = ZW matrix-vector product, and then
perform a gathering operation to put a copy of the resultant vector U on
every processor. The outerproduct UW T can then be done locally. In the
following code description remember that A stores both the Wk vectors and
the Z matrix under construction.
formZ(A) {Inputs AN and outputs Z in the A matrix
Make the bottom right hand corner of A a 2 2 unit matrix
2
for(k=N-2;k>0;k{) f
if(ProcNoForRow[k] == MyProcNo)f
recover W
MPI Bcast(W)
g
else
MPI Bcast(W)
g
Do the local matrix vector multiplication Ul = AW
MPI Allgather(Ul,U)
Z = Z 2UW T
Zero row k and column k of Z
Set Z[k][k]=1
g
The Householder-QL Matrix Diagonalisation Algorithm 7
3 Test Cases
The algorithms were tested using matrices that occur in many-body quantum
mechanical model of crystals for which there are two states per atom. The
total number of available states for the system is thus 2N for a N atom system.
The nature of the problem is such that the matrices for l = 0; 1; : : : ; N
particles on N crystal sites can be generated separately and the resulting
matrices have dimension ClN = NNl l . The general sparsity pattern of the
(
!
)! !
4 Results
The results presented below were all carried out on an IBM SP2 which is a
heterogeneous 23 node machine with a subset of 16 'thin1' nodes on which
our code was tested. Each `thin1' node consists of a 66Mhz Power2 processor,
128Mbytes of RAM with a 64 bit memory bus, a 32Kbyte instruction cache
and a 64Kbyte data cache. The latency and bandwidth for the machine, as
measured by simple processor to processor `ping-pong' benchmarks are 100
seconds and 20 Mbytes/second[7].
For the scaling test presented in gure 2 the problem size was increased from
N = 70 to N = 3432 with the number of processors xed at 16. Ideally the
The Householder-QL Matrix Diagonalisation Algorithm 8
100
200
300
400
500
600
700
800
900
0 200 400 600 800
Figure 1: The Sparsity Pattern of the 924 924 Test Matrix used in the
Speed-up Experiments.
The Householder-QL Matrix Diagonalisation Algorithm 9
P 1 2 3 4 5 6 7 8
HH 114 66.4 46.3 36.8 30.5 26.5 23.8 22.8
QL 37.1 28.2 21.7 17.8 15.4 13.5 12.1 11.3
Total 151 94.6 68.0 54.6 45.9 40.0 35.9 34.1
P 9 10 11 12 13 14 15 16
HH 21.0 19.9 18.8 18.6 17.5 17.0 16.9 16.9
QL 10.3 9.86 9.27 8.85 8.58 8.13 8.08 7.67
Total 31.3 29.8 28.1 27.5 26.1 25.1 25.0 24.6
Table 1: Execution Times (in seconds) on P Processors for the QL and House-
holder Segments of the Diagonalisation Algorithm for a 924 924 Matrix
N 70 126 252 462 924 1716 3432
HH 0.22 0.36 1.05 3.13 16.9 79.1 550
QL 0.05 0.09 0.33 1.42 7.67 38.9 264
(Av. Iters.) (1.71) (1.67) (1.71) (1.75) (1.89) (1.75) (1.89)
Total Time 0.27 0.45 1.38 4.55 24.6 118 814
Table 2: Execution Times (in seconds) on 16 Processors for the QL and
Householder Segments of the Diagonalisation Algorithm for varying Prob-
lem Size, N. The Average number of iterations per eigenvalue for the QL
algorithm is also indicated.
execution time should behave like t = N for large N , which is indeed the
3
600
Householder
QL
550(N/3432)^3
264(N/3432)^3
500
Execution Time (seconds)
400
300
200
100
0
0 500 1000 1500 2000 2500 3000 3500
Problem Size (N)
140
Householder
QL
129/P+2log2(P)
120 4.7+47/P
100
Execution Time (seconds)
80
60
40
20
0
0 2 4 6 8 10 12 14 16
Number of Processors (P)
5 Summary
We have shown that a relatively straight forward parallelisation of the House-
holder tri-diagonalisation algorithm and the QL method for nding the eigen-
values and eigenvectors of the resulting matrix shows a time complexity like
N =P + N log (P ) for the Householder problem and like
N + N =P
3 2
2
2 3
for the QL algorithm for xed problem size on increasing numbers of proces-
sors, P . The log (P ) term is empirically tted to the data, but is consistent
with MPI Bcast and MPI Allgather being implemented on a breadth rst
2
spanning out tree. If we use the timing results from gure 2 for large N , then
we nd values of 3 , 0:6 10
8800
(3432)
6
,
5:5 10 and
6
3.
4224
(3432)
is less relevant for larger problems since the time for the main body of the
algorithm is / N . If the eigenvalues are not required the QL algorithm
3
remains sequential and the Householder method goes twice as fast. For large
problems it might still worth using a parallel version of Householders algo-
rithm since this spreads the memory requirements over all of the processors
and there is still a 30% time saving.
The Householder-QL Matrix Diagonalisation Algorithm 13
References
[1] M.M. Chawla and D.J. Evans. Parallel Householder Method for Linear-
Systems International Journal of Computer Mathematics 58 (1995) 159-
167
[2] E.R. Davidson. The Iterative Calculation of a Few of the Lowest Eigen-
values and Corresponding Eigenvectors of Large Real-Symmetric Matri-
ces. J. Computational Physics 17 (1975) 87
[3] F.J. Ferreira, P.B. Vasconelos and F.D. Dalmeida. Performance of a QR
Algorithm Inplementation on a Multicluster of Transputers Computer
Systems in Engineering 6 (1995) 363-367
[4] G. Fox et al, Solving Problems on Concurrent Processors (Prentice-Hall
International Englewood Clis 1988).
[5] V. Kumar, A. Grama, A. Gupta and G. Karypis, Introduction to Parallel
Computing - Design and Analysis of Algorithms (Benjamin Cummings
California).
[6] J.H. Mathews, Numerical Methods for Mathematics, Science and Engi-
neering (Prentice-Hall International Englewood Clis 1992).
[7] P. Melas and E.J. Zaluska, High Performance Protocols for Clusters
of Commodity Workstations Lecture Notes in Computer Science 1470
(1998) 570-577.
[8] G.G. Meyer and M. Pascale A Family of Parallel QR Factorisation Al-
gorithms Concurrency-Practice and Experience 8 (1996) 461-473
[9] B.N.Parlett, H.Simon and L.M.Stringer On Estimating the Largest
Eigenvalue With the Lanczos Algorithm Mathematics of Computation
38 (1982) 153-165
[10] W.H.Press, B.P. Flanery, S.A. Teukolsky and W.T. Vetterling, Numer-
ical Recipies - The Art of Scientic Computing (Combridge University
Press Cambridge 1986).
[11] W.H.Press, B.P. Flanery, S.A. Teukolsky and W.T. Vetterling, Numeri-
cal Recipies in Fortran 90- The Art of Scientic Computing (Cambridge
University Press Cambridge 1996).
The Householder-QL Matrix Diagonalisation Algorithm 14