0% found this document useful (0 votes)
146 views14 pages

An Ecient Parallel Version of The Householder-QL Matrix Diagonalisation Algorithm

This document summarizes an efficient parallel version of the Householder-QL matrix diagonalization algorithm. The Householder algorithm scales like O(N2) = P + 3N log2(P) and the QL algorithm scales like O(N3) + O(N2) = P as the number of processors P increases for a fixed problem size. The code is implemented in C using MPI and was tested on an IBM SP2 with real matrices from simulations of crystalline materials.

Uploaded by

Shashi Kant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PS, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views14 pages

An Ecient Parallel Version of The Householder-QL Matrix Diagonalisation Algorithm

This document summarizes an efficient parallel version of the Householder-QL matrix diagonalization algorithm. The Householder algorithm scales like O(N2) = P + 3N log2(P) and the QL algorithm scales like O(N3) + O(N2) = P as the number of processors P increases for a fixed problem size. The code is implemented in C using MPI and was tested on an IBM SP2 with real matrices from simulations of crystalline materials.

Uploaded by

Shashi Kant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PS, PDF, TXT or read online on Scribd
You are on page 1/ 14

An Ecient Parallel Version of the

Householder-QL Matrix Diagonalisation


Algorithm
J.S. Reeveand M. Heath
Department of Electronics and Computer Science
University of Southampton
Southampton SO17 1BJ, UK
email [email protected]
October 9, 1998

Abstract
In this paper we report an e ective parallelisation of the House-
holder routine for the reduction of a real symmetric matrix to tri-
diagonal form and the QL algorithm for the diagonalisation of the
resulting matrix. The Householder algorithm scales like N =P +3

N log (P ) and the QL algorithm like N + N =P as the number


2
2
2 3

of processors P is increased for xed problem size. The constant pa-


rameters , , and  are obtained empirically. When the eigenvalues
only are required the Householder method scales as above while the
QL algorithm remains sequential. The code is implemented in c in
conjunction with the Message Passing Interface (MPI) libraries and
veri ed on a sixteen node IBM SP2 and for real matrices that occur
in the simulation of properties of crystaline materials

Keywords Householder; QL; Parallel Algorithms.

1
The Householder-QL Matrix Diagonalisation Algorithm 2

1 Introduction
The parallelisation of methods for nding the eigenvalues and eigenvectors
for real symetric matrices has received little attention over the years, for two
reasons. Firstly for very large systems quite often only a few eigenvalues and
eigenvectors are needed. Secondly, the parallelisation of the best sequential
techniques for matrix diagonalisation has proved dicult and ine ective.
In the simulation of the dynamics of large structures, only the lowest few
eigenvalues that correspond to the most likely occuring low frequency modes
of vibration are needed. Similarly, in solid state physics, states at or near the
ground state are those most likely to be occupied, at least at low tempera-
tures, and so make the largest contribution to the material properties. For
these reasons, in subjects like structural mechanics, techniques have concen-
trated in reducing the degrees of freedom in the system, from perhaps mil-
lions to a few hundred, enabling classical sequential techniques to be used. In
Physics and Chemistry, and where the original matrix is regularly structured
and sparse, iterative techniques like the Lanczos[9] or Davidson[2] methods,
that recover the lowest few eigenvalues and their corresponding eigenvectors
can be used e ectively.
There are however many problems for which it is desirable to have all of the
eigenvalues and eigenvectors of systems. Physical simulations of quantum
mechanical systems that have a few particles present at nite temperature
cannot be properly analysed by methods for large numbers of particles like
statistical mechanics and are suciently far from their ground states as to
be in uenced by all available states. Similarly in the design of mechanical
structures acoustic resonances might occur over a broad bandwidth. In such
cases all eigenvalues and eigenvectors of large symmetric matrices are needed,
for which no algorithm of complexity less than O(N ), where N is the di-
3

mension of the matrix to be diagonalised, exists, regardless of the sparsity of


the matrix.
The most ecient algorithm for the diagonalisation of large symmetric ma-
trices is the Householder tri-diagonalisation proceedure followed by the QL
diagonalisation algorithm. The operation count for the combined House-
holder method is approximately N whether or not eigenvectors are required
3

and the operation count for the QL method is roughly 30N when the eigen-
2
The Householder-QL Matrix Diagonalisation Algorithm 3

vectors are not required and 3N when they are. These counts presume that
3

the average number of iterations is about 1.7[10], an assumption that is con-


rmed for the test matrices that we use. The only viable alternative is the
Jacobi method which has about 20N operations when eigenvectors are not
3

required and about 40N when the are assuming less that 10 Jacobi cycles
3

for convergence[10]. The Jacobi method might readily parallelise [11] but
requires more than 20 processors before becoming competitive with the se-
quential Householder-QL algorithm. Previous attempts at parallelising the
Householder and QL algorithms have resulted in inecient solutions[4] or
have altered the algorithms in some way[12, 3, 1, 8]

2 Description of the Algorithm


Clear descriptions of the Householder and QL algorithms can be found in [10]
and [6]. If A is an N  N symmetric matrix, then a sequence of N
2 transformations of the form PN :::P AP :::PN reduces the matrix to
2 1 1 2

symmetric tri-diagonal form. The matrix Pk is formed from the requirement


that the matrix Ak = Pk :::P AP :::Pk , is tri-diagonal up to the kth
column. For example for a 6  6 matrix A has the form
1 1 1 1 1

0 1
B d e 0 0 0 0C
B x x x x CC
1 1

B e d
B x x x x CCC
1 2

B
B 0 x
B
B 0 x x x x x CC
B
B 0 x x x x x CC
B
@ 0 x x x x x CA
0 x x x x x

The Pk matrices can be written as Pk = I 2Wk WkT , where the vec-


tor WkT = R (0; : : : ; 0; xk + s; xk ; : : : ; xn)T , is formed from the kth
1
+1 +2

row (or column) of the Ak matrix, where R normalises Wk and


1

s = sign(xk )(xk + xk + : : : + xn). The Ak matrix is then calcu-


+1
2 2 2

lated as Ak = Pk Ak Pk which can readily be shown to reduce to


+1 +2

Ak = Ak 2Wk QTk 2Qk WkT , where Qk = Vk cWk , with Vk = Ak Wk


1

and c = WkT Vk .
1 1
The Householder-QL Matrix Diagonalisation Algorithm 4

The complete transformation matrix Z = PN :::P is needed for the calcu-


2 1

lation of the eigenvectors. The normal method for reconstructing Z is to


store the Wk vectors in the rows of the A matrix that have already been
tri-diagonalised, then the Z matrix can be built using the same space so that
all the eigenvectors of A can be determined using only the memory occupied
by the original data space.
The QL algorithm which is used to nd all the eigenvalues (and eigenvec-
tors) of a symmetric tri-diagonal matrix naturally follows the Householder
transformation. The part of the QL algorithm that generates the eigenvalues
has complexity O(N ), and the part that builds the eigenvectors has com-
2

plexity O(N ), so that it is only worthwhile parallelising the QL algorithm if


3

the eigenvectors are required. The code fragment of the QL algorithm that
constructs the eigenvectors has the form below for the ith iteration:
for(j=0;j<N;j++)f
f=Z[i+1][j]
Z[i+1][j]=s*Z[i][j]+c*f
Z[i][j]=c*Z[i][j]-s*f
g

When the transformation matrix Z is stored by row-wise striping[5] the up-


date of the Z matrix can be performed independently for each processor,
except when the ith and i+1th rows are on di erent processors. As the iter-
ation sequence is from N to 1, at most a single exchange of rows is needed
for each iteration. This dictates that the transformation matrix from the
Householder algorithm also be decomposed by row-wise striping, which in
turn means that we have to use the same storage method for the original
matrix. In fact the only signi cant amount of memory needed is that oc-
cupied by the original matrix, assuming of course that this is stored as a
dense matrix. This important property needs to be retained in the parallel
algorithm.
In the mathematical description of the Householder algorithm that follows
the iteration number is indicated by subscripting the vectors, for example
Wk is the W vector at the kth iteration. In the pseudo code there is only
one storage location for each vector and in the case of the Wk , it is denoted
as W. The iteration number in the pseudo code is k.
The Householder-QL Matrix Diagonalisation Algorithm 5

The pseudo-code outline for the Householder routine is:-


for(k=0;k<N-2;k++)f
s=formW(W,A,k)
formV(V,A,W)
formQ(Q,V,W,s,k)
formA(A,Q,W,s,k)
g
formZ(A)

As the matrix Ak is stored row-wise the formation of the Wk is done by


one processor and broadcast to the rest. The quantity s, which is needed for
subsequent parts of the calculation is also distributed in this same broadcast.
The pseudo-code outline for formW is :-
double formW(W,A,k) {Inputs Ak and outputs Wk
if(ProcNoForRow[k] == MyProcNo)f
1

form W and s
MPI Bcast(s,W)
g
else
MPI Bcast(s,W)
g
return(s)

Calculation of the vector Vk = Ak Wk requires a matrix vector product. Since


each processor has a copy of the Wk vector, this requires a local matrix vector
product followed by a gathering operation of the local slices of Vk , so that
each processor has a complete copy of Vk .
formV(V,A,W,k) {Inputs Ak and Wk and outputs Vk
1

Do the local matrix multiply Vl = AW


MPI Allgather(Vl,V )

As both the vectors Vk and Wk are held by each processor the Qk vector
(calculated in formQ) can be computed locally on each processor. The Ak
matrix is updated by the equation Ak = Ak 2Wk QTk 2Qk WkT , which
1
The Householder-QL Matrix Diagonalisation Algorithm 6

involves only outer products of the locally stored Wk and Qk vectors. The
diagonal and o -diagonal elements of the reduced part of the a matrix are
stored in a local vector on the processor that holds row k. At the end
of the tri-diagonalisation process these need to be gathered onto the other
processors so all processors have a copy of the tri-diagonalised matrix as
required by the QL procedure. If the eigenvectors are not required then the
transformation matrix Z = PN ; : : :; P is not required and formZ can be
2 1

omitted. When the eigenvectors are required, Z is updated recursively from


the formula Zk = Zk 1 2Zk Wk WkT . The formation of Z starts from the
1

bottom right hand corner of the original A matrix, which now stores the Wk
row-wise. Proceeding from this corner for which Z is the 2  2 unit matrix, we
can form the Zk matrix in rows k up to N in the space formerly used to keep
the Wk up to WN . In separating out the formation of the Z matrix from the
2

main iteration sequence we have made the choice to opt for storage eciency,
rather than execution speed. This means that we have to re-broadcast the
Wk vectors in order to do the U = ZW matrix-vector product, and then
perform a gathering operation to put a copy of the resultant vector U on
every processor. The outerproduct UW T can then be done locally. In the
following code description remember that A stores both the Wk vectors and
the Z matrix under construction.
formZ(A) {Inputs AN and outputs Z in the A matrix
Make the bottom right hand corner of A a 2  2 unit matrix
2

for(k=N-2;k>0;k{) f
if(ProcNoForRow[k] == MyProcNo)f
recover W
MPI Bcast(W)
g
else
MPI Bcast(W)
g
Do the local matrix vector multiplication Ul = AW
MPI Allgather(Ul,U)
Z = Z 2UW T
Zero row k and column k of Z
Set Z[k][k]=1
g
The Householder-QL Matrix Diagonalisation Algorithm 7

3 Test Cases
The algorithms were tested using matrices that occur in many-body quantum
mechanical model of crystals for which there are two states per atom. The
total number of available states for the system is thus 2N for a N atom system.
The nature of the problem is such that the matrices for l = 0; 1; : : : ; N
particles on N crystal sites can be generated separately and the resulting
matrices have dimension ClN = NNl l . The general sparsity pattern of the
(
!
)! !

test matrices is shown in gure 1 in which non-zero elements (about 3%)


are all one and we store the original matrix as a dense matrix since the
storage is needed for the eigenvectors anyway. It must be stressed too, that
sparsity does not a ect the complexity of the Householder algorithm and the
complexity of the QL algorithm is dominated by the number of iterations
required which is about 1:7N for the matrices we use although the maximum
number of iterations required (top nd the rst eigenvalue) can vary markedly
from 10 or so to something like N .
The fact that the algorithm only requires minimal storage, for the original
matrix only, is an important property when dealing with matrices that grow
exponentially as the physical size of the problem is increased linearly. For
instance for the 3432  3432 diagonalised in the scaling experiments requires
about 100Mbytes of storage for the original matrix.

4 Results
The results presented below were all carried out on an IBM SP2 which is a
heterogeneous 23 node machine with a subset of 16 'thin1' nodes on which
our code was tested. Each `thin1' node consists of a 66Mhz Power2 processor,
128Mbytes of RAM with a 64 bit memory bus, a 32Kbyte instruction cache
and a 64Kbyte data cache. The latency and bandwidth for the machine, as
measured by simple processor to processor `ping-pong' benchmarks are 100
seconds and 20 Mbytes/second[7].
For the scaling test presented in gure 2 the problem size was increased from
N = 70 to N = 3432 with the number of processors xed at 16. Ideally the
The Householder-QL Matrix Diagonalisation Algorithm 8

100

200

300

400

500

600

700

800

900
0 200 400 600 800

Figure 1: The Sparsity Pattern of the 924  924 Test Matrix used in the
Speed-up Experiments.
The Householder-QL Matrix Diagonalisation Algorithm 9

P 1 2 3 4 5 6 7 8
HH 114 66.4 46.3 36.8 30.5 26.5 23.8 22.8
QL 37.1 28.2 21.7 17.8 15.4 13.5 12.1 11.3
Total 151 94.6 68.0 54.6 45.9 40.0 35.9 34.1
P 9 10 11 12 13 14 15 16
HH 21.0 19.9 18.8 18.6 17.5 17.0 16.9 16.9
QL 10.3 9.86 9.27 8.85 8.58 8.13 8.08 7.67
Total 31.3 29.8 28.1 27.5 26.1 25.1 25.0 24.6
Table 1: Execution Times (in seconds) on P Processors for the QL and House-
holder Segments of the Diagonalisation Algorithm for a 924  924 Matrix
N 70 126 252 462 924 1716 3432
HH 0.22 0.36 1.05 3.13 16.9 79.1 550
QL 0.05 0.09 0.33 1.42 7.67 38.9 264
(Av. Iters.) (1.71) (1.67) (1.71) (1.75) (1.89) (1.75) (1.89)
Total Time 0.27 0.45 1.38 4.55 24.6 118 814
Table 2: Execution Times (in seconds) on 16 Processors for the QL and
Householder Segments of the Diagonalisation Algorithm for varying Prob-
lem Size, N. The Average number of iterations per eigenvalue for the QL
algorithm is also indicated.

execution time should behave like t = N for large N , which is indeed the
3

case. The constant of proportionality is approximately 1:88 times larger than


ideal as is seen from the overall eciency of the algorithm on 16 processors.
The speed-up of the algorithm was tested on a problem size N = 924 on one
to sixteen processors. The results presented on gure 3 show a di erence be-
tween the eciency(on 16 procesors). To estimate the eciency we calculate
the ratio of the time on one processor to P times the time on P processors.
We nd that the eciency EHH  0:50, for the Householder routine and that
of the QL algorithm is EQL  0:46. The super eciency e ect for the QL
algorithm being due to a marked reduction in data cache misses when the
data is distributed over a large number of processors. The overall eciency
E  0:49 for 16 processors.
The measured data are shown in tables 1 and 2.
The Householder-QL Matrix Diagonalisation Algorithm 10

600
Householder
QL
550(N/3432)^3
264(N/3432)^3
500
Execution Time (seconds)

400

300

200

100

0
0 500 1000 1500 2000 2500 3000 3500
Problem Size (N)

Figure 2: Execution Times for Varying Problem Sizes on 16 Processors


The Householder-QL Matrix Diagonalisation Algorithm 11

140
Householder
QL
129/P+2log2(P)
120 4.7+47/P

100
Execution Time (seconds)

80

60

40

20

0
0 2 4 6 8 10 12 14 16
Number of Processors (P)

Figure 3: Execution Times for a 924  924 Matrix on 2 to 16 processors.


The Householder-QL Matrix Diagonalisation Algorithm 12

5 Summary
We have shown that a relatively straight forward parallelisation of the House-
holder tri-diagonalisation algorithm and the QL method for nding the eigen-
values and eigenvectors of the resulting matrix shows a time complexity like
N =P + N log (P ) for the Householder problem and like N + N =P
3 2
2
2 3

for the QL algorithm for xed problem size on increasing numbers of proces-
sors, P . The log (P ) term is empirically tted to the data, but is consistent
with MPI Bcast and MPI Allgather being implemented on a breadth rst
2

spanning out tree. If we use the timing results from gure 2 for large N , then
we nd values of  3 ,  0:6  10
8800
(3432)
6
,  5:5  10 and  
6
3.
4224
(3432)

For the Householder algorithm, the communications penalty , N log (P ),2


2

is less relevant for larger problems since the time for the main body of the
algorithm is / N . If the eigenvalues are not required the QL algorithm
3

remains sequential and the Householder method goes twice as fast. For large
problems it might still worth using a parallel version of Householders algo-
rithm since this spreads the memory requirements over all of the processors
and there is still a 30% time saving.
The Householder-QL Matrix Diagonalisation Algorithm 13

References
[1] M.M. Chawla and D.J. Evans. Parallel Householder Method for Linear-
Systems International Journal of Computer Mathematics 58 (1995) 159-
167
[2] E.R. Davidson. The Iterative Calculation of a Few of the Lowest Eigen-
values and Corresponding Eigenvectors of Large Real-Symmetric Matri-
ces. J. Computational Physics 17 (1975) 87
[3] F.J. Ferreira, P.B. Vasconelos and F.D. Dalmeida. Performance of a QR
Algorithm Inplementation on a Multicluster of Transputers Computer
Systems in Engineering 6 (1995) 363-367
[4] G. Fox et al, Solving Problems on Concurrent Processors (Prentice-Hall
International Englewood Cli s 1988).
[5] V. Kumar, A. Grama, A. Gupta and G. Karypis, Introduction to Parallel
Computing - Design and Analysis of Algorithms (Benjamin Cummings
California).
[6] J.H. Mathews, Numerical Methods for Mathematics, Science and Engi-
neering (Prentice-Hall International Englewood Cli s 1992).
[7] P. Melas and E.J. Zaluska, High Performance Protocols for Clusters
of Commodity Workstations Lecture Notes in Computer Science 1470
(1998) 570-577.
[8] G.G. Meyer and M. Pascale A Family of Parallel QR Factorisation Al-
gorithms Concurrency-Practice and Experience 8 (1996) 461-473
[9] B.N.Parlett, H.Simon and L.M.Stringer On Estimating the Largest
Eigenvalue With the Lanczos Algorithm Mathematics of Computation
38 (1982) 153-165
[10] W.H.Press, B.P. Flanery, S.A. Teukolsky and W.T. Vetterling, Numer-
ical Recipies - The Art of Scienti c Computing (Combridge University
Press Cambridge 1986).
[11] W.H.Press, B.P. Flanery, S.A. Teukolsky and W.T. Vetterling, Numeri-
cal Recipies in Fortran 90- The Art of Scienti c Computing (Cambridge
University Press Cambridge 1996).
The Householder-QL Matrix Diagonalisation Algorithm 14

[12] M.A. Shalaby. A Parallel Processed Scheme for the Eigenproblem of


Positive-De nite Matrices J. of Computational and Applied Maths. 54
(1994) 99-106

You might also like