MATHEMATICS Parallel Scientific Computation
MATHEMATICS Parallel Scientific Computation
Computation
A structured approach using BSP and MPI
ROB H. BISSELING
Utrecht University
1
PARALLEL SCIENTIFIC COMPUTATION
3
Great Clarendon Street, Oxford OX2 6DP
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide in
Oxford New York
Auckland Bangkok Buenos Aires Cape Town Chennai
Dar es Salaam Delhi Hong Kong Istanbul
Karachi Kolkata Kuala Lumpur Madrid Melbourne Mexico City Mumbai
Nairobi São Paulo Shanghai Taipei Tokyo Toronto
Oxford is a registered trade mark of Oxford University Press
in the UK and in certain other countries
Published in the United States
by Oxford University Press Inc., New York
c Oxford University Press 2004
The moral rights of the author have been asserted
Database right Oxford University Press (maker)
First published 2004
All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
without the prior permission in writing of Oxford University Press,
or as expressly permitted by law, or under terms agreed with the appropriate
reprographics rights organization. Enquiries concerning reproduction
outside the scope of the above should be sent to the Rights Department,
Oxford University Press, at the address above
You must not circulate this book in any other binding or cover
and you must impose this same condition on any acquirer
A catalogue record for this title is available from the British Library
Library of Congress Cataloging in Publication Data
(Data available)
ISBN 0 19 852939 2
10 9 8 7 6 5 4 3 2 1
Typeset by Newgen Imaging Systems (P) Ltd., Chennai, India
Printed in Great Britain
on acid-free paper by
Biddles Ltd, www.biddles.co.uk
Plate 1. Sparse matrix prime60 distributed over four processors of a parallel
computer. Cf. Chapter 4.
PREFACE
Why this book on parallel scientific computation? In the past two decades,
several shelves full of books have been written on many aspects of parallel
computation, including scientific computation aspects. My intention to add
another book asks for a motivation. To say it in a few words, the time is
ripe now. The pioneering decade of parallel computation, from 1985 to 1995,
is well behind us. In 1985, one could buy the first parallel computer from
a commercial vendor. If you were one of the few early users, you probably
found yourself working excessively hard to make the computer perform well
on your application; most likely, your work was tedious and the results were
frustrating. If you endured all this, survived, and are still interested in parallel
computation, you deserve strong sympathy and admiration!
Fortunately, the situation has changed. Today, one can theoretically
develop a parallel algorithm, analyse its performance on various architectures,
implement the algorithm, test the resulting program on a PC or cluster of PCs,
and then run the same program with predictable efficiency on a massively
parallel computer. In most cases, the parallel program is only slightly more
complicated than the corresponding sequential program and the human time
needed to develop the parallel program is not much more than the time needed
for its sequential counterpart.
This change has been brought about by improvements in parallel hardware
and software, together with major advances in the theory of parallel pro-
gramming. An important theoretical development has been the advent of the
Bulk Synchronous Parallel (BSP) programming model proposed by Valiant in
1989 [177,178], which provides a useful and elegant theoretical framework for
bridging the gap between parallel hardware and software. For this reason, I
have adopted the BSP model as the target model for my parallel algorithms.
In my experience, the simplicity of the BSP model makes it suitable for teach-
ing parallel algorithms: the model itself is easy to explain and it provides a
powerful tool for expressing and analysing algorithms.
An important goal in designing parallel algorithms is to obtain a good
algorithmic structure. One way of achieving this is by designing an algorithm
as a sequence of large steps, called supersteps in BSP language, each con-
taining many basic computation or communication operations and a global
synchronization at the end, where all processors wait for each other to finish
their work before they proceed to the next superstep. Within a superstep, the
work is done in parallel, but the global structure of the algorithm is sequen-
tial. This simple structure has proven its worth in practice in many parallel
applications, within the BSP world and beyond.
viii PREFACE
problems because they are important for applications and because they give
rise to a variety of important parallelization techniques. This book treats
well-known subjects such as dense LU decomposition, fast Fourier transform
(FFT), and sparse matrix–vector multiplication. One can view these sub-
jects as belonging to the area of numerical linear algebra, but they are also
fundamental to many applications in scientific computation in general. This
choice of problems may not be highly original, but I made an honest attempt
to approach these problems in an original manner and to present efficient
parallel algorithms and programs for their solution in a clear, concise, and
structured way.
Since this book should serve as a textbook, it covers a limited but care-
fully chosen amount of material; I did not strive for completeness in covering
the area of numerical scientific computation. A vast amount of sequen-
tial algorithms can be found in Matrix Computations by Golub and Van
Loan [79] and Numerical Recipes in C: The Art of Scientific Computing by
Press, Teukolsky, Vetterling, and Flannery [157]. In my courses on parallel
algorithms, I have the habit of assigning sequential algorithms from these
books to my students and asking them to develop parallel versions. Often,
the students go out and perform an excellent job. Some of these assign-
ments became exercises in the present book. Many exercises have the form
of programming projects, which are suitable for use in an accompanying
computer-laboratory class. I have graded the exercises according to diffi-
culty/amount of work involved, marking an exercise by an asterisk if it requires
more work and by two asterisks if it requires a lot of work, meaning that it
would be suitable as a final assignment. Inevitably, such a grading is subject-
ive, but it may be helpful for a teacher in assigning problems to students.
The main text of the book treats a few central topics from parallel scientific
computation in depth; the exercises are meant to give the book breadth.
The structure of the book is as follows. Chapter 1 introduces the BSP
model and BSPlib, and as an example it presents a simple complete parallel
program. This two-page program alone already teaches half the primitives of
BSPlib. (The other half is taught by the program of Chapter 4.) The first
chapter is a concise and self-contained tutorial, which tells you how to get
started with writing BSP programs, and how to benchmark your computer as
a BSP computer. Chapters 2–4 present parallel algorithms for problems with
increasing irregularity. Chapter 2 on dense LU decomposition presents a reg-
ular computation with communication patterns that are common in matrix
computations. Chapter 3 on the FFT also treats a regular computation but
one with a more complex flow of data. The execution time requirements of the
LU decomposition and FFT algorithms can be analysed exactly and the per-
formance of an implementation can be predicted quite accurately. Chapter 4
presents the multiplication of a sparse matrix and a dense vector. The com-
putation involves only those matrix elements that are nonzero, so that in
general it is irregular. The communication involves the components of dense
x PREFACE
input and output vectors. Although these vectors can be stored in a regular
data structure, the communication pattern becomes irregular because efficient
communication must exploit the sparsity. The order in which the chapters can
be read is: 1, 2, then either 3 or 4, depending on your taste. Chapter 3 has
the brains, Chapter 4 has the looks, and after you have finished both you
know what I mean. Appendix C presents MPI programs in the order of the
corresponding BSPlib programs, so that it can be read in parallel with the
main text; it can also be read afterwards. I recommend reading the appendix,
even if you do not intend to program in MPI, because it illustrates the vari-
ous possible choices that can be made for implementing communications and
because it makes the differences and similarities between BSPlib and MPI
clear.
Each chapter contains: an abstract; a brief discussion of a sequential
algorithm, included to make the material self-contained; the design and
analysis of a parallel algorithm; an annotated program text; illustrative
experimental results of an implementation on a particular parallel computer;
bibliographic notes, giving historical background and pointers for further
reading; theoretical and practical exercises.
My approach in presenting algorithms and program texts has been to give
priority to clarity, simplicity, and brevity, even if this comes at the expense of
a slight decrease in efficiency. In this book, algorithms and programs are only
optimized if this teaches an important technique, or improves efficiency by
an order of magnitude, or if this can be done without much harm to clarity.
Hints for further optimization are given in exercises. The reader should view
the programs as a starting point for achieving fast implementations.
One goal of this book is to ease the transition from theory to practice. For
this purpose, each chapter includes an example program, which presents a
possible implementation of the central algorithm in that chapter. The program
texts form a small but integral part of this book. They are meant to be read
by humans, besides being compiled and executed by computers. Studying the
program texts is the best way of understanding what parallel programming is
really about. Using and modifying the programs gives you valuable hands-on
experience.
The aim of the section on experimental results is to illustrate the the-
oretical analysis. Often, one aspect is highlighted; I made no attempt to
perform an exhaustive set of experiments. A real danger in trying to explain
experimental results for an algorithm is that a full explanation may lead to
a discussion of nitty-gritty implementation details or hardware quirks. This
is hardly illuminating for the algorithm, and therefore I have chosen to keep
such explanations to a minimum. For my experiments, I have used six dif-
ferent parallel machines, older ones as well as newer ones: parallel computers
come and go quickly.
The bibliographic notes of this book are lengthier than usual, since I have
tried to summarize the contents of the cited work and relate them to the topic
PREFACE xi
discussed in the current chapter. Often, I could not resist the temptation to
write a few sentences about a subject not fully discussed in the main text,
but still worth mentioning.
The source files of the printed program texts, together with a set
of test programs that demonstrate their use, form a package called
BSPedupack, which is available at http://www.math.uu.nl/people/
bisseling/software.html. The MPI version, called MPIedupack, is also
available from that site. The packages are copyrighted, but freely available
under the GNU General Public License, meaning that their programs can be
used and modified freely, provided the source and all modifications are men-
tioned, and every modification is again made freely available under the same
license. As the name says, the programs in BSPedupack and MPIedupack
are primarily intended for teaching. They are definitely not meant to be used
as a black box. If your program, or worse, your airplane crashes because
BSP/MPIedupack is not sufficiently robust, it is your own responsibility. Only
rudimentary error handling has been built into the programs. Other software
available from my software site is the Mondriaan package [188], which is used
extensively in Chapter 4. This is actual production software, also available
under the GNU General Public License.
To use BSPedupack, a BSPlib implementation such as the Oxford BSP
toolset [103] must have been installed on your computer. As an alternative,
you can use BSPedupack on top of the Paderborn University BSP (PUB)
library [28,30], which contains BSPlib as a subset. If you have a cluster of PCs,
connected by a Myrinet network, you may want to use the Panda BSP library,
a BSPlib implementation by Takken [173], which is soon to be released. The
programs of this book have been tested extensively for the Oxford BSP toolset.
To use the first four programs of MPIedupack, you need an implementation
of MPI-1. This often comes packaged with the parallel computer. The fifth
program needs MPI-2. Sometimes, part of the MPI-2 extensions have been
supplied by the computer vendor. A full public-domain implementation of
MPI-2 is expected to become available in the near future for many different
architectures as a product of the MPICH-2 project.
If you prefer to use a different communication library than BSPlib, you
can port BSPlib programs to other systems such as MPI, as demonstrated by
Appendix C. Porting out of BSPlib is easy, because of the limited number of
BSPlib primitives and because of the well-structured programs that are the
result of following the BSP model. It is my firm belief that if you use MPI or
another communication library, for historical or other reasons, you can benefit
tremendously from the structured approach to parallel programming taught in
this book. If you use MPI-2, and in particular its one-sided communications,
you are already close to the bulk synchronous parallel world, and this book
may provide you with a theoretical framework.
xii PREFACE
The programming language used in this book is ANSI C [121]. The reason
for this is that many students learn C as their first or second programming lan-
guage and that efficient C compilers are available for many different sequential
and parallel computers. Portability is the name of the game for BSP software.
The choice of using C together with BSPlib will make your software run on
almost every computer. Since C is a subset of C++, you can also use C++
together with BSPlib. If you prefer to use another programming language,
BSPlib is also available in Fortran 90.
Finally, let me express my hope and expectation that this book will trans-
form your barriers: your own activation barrier for parallel programming will
disappear; instead, synchronization barriers will appear in your parallel pro-
grams and you will know how to use them as an effective way of designing
well-structured parallel programs.
R.H. Bisseling
Utrecht University
July 2003
ACKNOWLEDGEMENTS
First of all, I would like to thank Bill McColl of Oxford University for
introducing the BSP model to me in 1992 and convincing me to abandon
my previous habit of developing special-purpose algorithms for mesh-based
parallel computers. Thanks to him, I turned to designing general-purpose
algorithms that can run on every parallel computer. Without Bill’s encour-
agement, I would not have written this book.
Special mention should be made of Jon Hill of Oxford University, now at
Sychron Ltd, who is co-designer and main implementor of BSPlib. The BSPlib
standard gives the programs in the main text of this book a solid foundation.
Many discussions with Jon, in particular during the course on BSP we gave
together in Jerusalem in 1997, were extremely helpful in shaping this book.
Several visits abroad gave me feedback and exposure to constructive
criticism. These visits also provided me the opportunity to test the par-
allel programs of this book on a variety of parallel architectures. I would
like to thank my hosts Michael Berman of BioMediCom Ltd in Jerusalem,
Richard Brent and Bill McColl of Oxford University, Iain Duff of CERFACS
in Toulouse, Jacko Koster and Fredrik Manne of the University of Bergen,
Satish Rao of NEC Research at Princeton, Pilar de la Torre of the Univer-
sity of New Hampshire, and Leslie Valiant of Harvard University, inventor of
the BSP model. I appreciate their hospitality. I also thank the Engineering
and Physical Sciences Research Council in the United Kingdom for funding
my stay in Oxford in 2000, which enabled me to make much progress with
this book.
I would like to thank the Oxford Supercomputer Centre for granting access
to their SGI Origin 2000 supercomputer and I am grateful to Jeremy Martin
and Bob McLatchie for their help in using this machine. I thank Sychron
Ltd in Oxford for giving me access to their PC cluster. In the Netherlands,
I would like to acknowledge grants of computer time and funding of two
postdocs by the National Computer Facilities foundation (NCF). Patrick
Aerts, director of the NCF, has tremendously stimulated the development
of the high-performance computing infrastructure in the Netherlands, and it
is thanks to him that I could use so many hours of computing time on so
many different parallel computers. I thank the supercomputer centres HPαC
in Delft and SARA in Amsterdam for access to their computers, with per-
sonal thanks to Jana Vasiljev and Willem Vermin for supporting BSPlib at
these centres. I also thank Aad van der Steen for help in accessing DAS-2, the
Dutch distributed supercomputer.
xiv ACKNOWLEDGEMENTS
Jacko Koster, Frank van Lingen, Ronald Meester, Adina Milston, John
Reid, Dan Stefanescu, Pilar de la Torre, Leslie Valiant, and Yael Weinbach.
Aesthetic advice has been given by Ron Bertels, Lidy Bisseling, Gerda
Dekker, and Gila and Joel Kantor. Thanks to all of them. Disclaimer: if
you find typing errors, small flaws, serious flaws, unintended Dutch, or
worse, do not blame them, just flame me! All comments are welcome at:
[email protected]. I thank my editors at Oxford University Press,
Elizabeth Johnston, Alison Jones, and Mahua Nandi, for accepting my vision
of this book and for their ideas, good judgement, help, and patience.
Finally, in the writing of this book, I owe much to my family. My wife
Rona showed love and sympathy, and gave support whenever needed. Our
daughter Sarai, born in 1994, has already acquired quite some mathematical
and computer skills. I tested a few exercises on her (admittedly, unmarked
ones), and am amazed how much a nine-year old can understand about parallel
computing. If she can, you can. Sarai provided me with the right amount of
distraction and the proper perspective. Furthermore, one figure in the book
is hers.
ABOUT THE AUTHOR
1 Introduction 1
1.1 Wanted: a gentle parallel programming model 1
1.2 The BSP model 3
1.3 BSP algorithm for inner product computation 9
1.4 Starting with BSPlib: example program bspinprod 13
1.5 BSP benchmarking 24
1.6 Example program bspbench 27
1.7 Benchmark results 31
1.8 Bibliographic notes 38
1.8.1 BSP-related models of parallel computation 38
1.8.2 BSP libraries 40
1.8.3 The non-BSP world: message passing 42
1.8.4 Benchmarking 43
1.9 Exercises 44
2 LU decomposition 50
2.1 The problem 50
2.2 Sequential LU decomposition 51
2.3 Basic parallel algorithm 57
2.4 Two-phase broadcasting and other improvements 64
2.5 Example function bsplu 72
2.6 Experimental results on a Cray T3E 79
2.7 Bibliographic notes 85
2.7.1 Matrix distributions 85
2.7.2 Collective communication 87
2.7.3 Parallel matrix computations 87
2.8 Exercises 88
3 The fast Fourier transform 100
3.1 The problem 100
3.2 Sequential recursive fast Fourier transform 103
3.3 Sequential nonrecursive algorithm 105
3.4 Parallel algorithm 113
3.5 Weight reduction 120
3.6 Example function bspfft 127
3.7 Experimental results on an SGI Origin 3800 136
3.8 Bibliographic notes 145
3.8.1 Sequential FFT algorithms 145
xviii CONTENTS
Index 299
1
INTRODUCTION
programming. Unfortunately, until recently this has not been the case, and
parallel computing used to be a very specialized area where exotic parallel
algorithms were developed for even more exotic parallel architectures, where
software could not be reused and many man years of effort were wasted in
developing software of limited applicability. Automatic parallelization by com-
pilers could be a solution for this problem, but this has not been achieved yet,
nor is it likely to be achieved in the near future. Our only hope of harnessing
the power of parallel computing lies in actively engaging ourselves in parallel
programming. Therefore, we might as well try to make parallel programming
easy and effective, turning it into a natural activity for everyone who writes
computer programs.
An important step forward in making parallel programming easier has been
the development of portability layers, that is, communication software such
as PVM [171] and MPI [137] that enable us to run the same parallel program
on many different parallel computers without changing a single line of program
text. Still, the resulting execution time behaviour of the program on a new
machine is unpredictable (and can indeed be rather erratic), due to the lack
of an underlying parallel programming model.
To achieve the noble goal of easy parallel programming we need a
model that is simple, efficiently implementable, and acceptable to all parties
involved: hardware designers, software developers, and end users. This
model should not interfere with the process of designing and implementing
algorithms. It should exist mainly in the background, being tacitly under-
stood by everybody. Such a model would encourage the use of parallel
computers in the same way as the Von Neumann model did for the sequential
computer.
The bulk synchronous parallel (BSP) model proposed by Valiant in
1989 [177,178] satisfies all requirements of a useful parallel programming
model: the BSP model is simple enough to allow easy development and ana-
lysis of algorithms, but on the other hand it is realistic enough to allow
reasonably accurate modelling of real-life parallel computing; a portability
layer has been defined for the model in the form of BSPlib [105] and this
standard has been implemented efficiently in at least two libraries, namely the
Oxford BSP toolset [103] and the Paderborn University BSP library [28,30],
each running on many different parallel computers; another portability layer
suitable for BSP programming is the one-sided communications part of
MPI-2 [138], implementations of which are now appearing; in principle, the
BSP model could be used in taking design decisions when building new hard-
ware (in practice though, designers face other considerations); the BSP model
is actually being used as the framework for algorithm design and implementa-
tion on a range of parallel computers with completely different architectures
(clusters of PCs, networks of workstations, shared-memory multiprocessors,
and massively parallel computers with distributed memory). The BSP model
is explained in the next section.
THE BSP MODEL 3
Communication
network
P P P P P
M M M M M
Comp
Sync
Comm
Sync
Comm
Sync
Comp
Sync
Comm
Sync
Fig. 1.2. BSP algorithm with five supersteps executed on five processors. A ver-
tical line denotes local computation; an arrow denotes communication of one
or more data words to another processor. The first superstep is a computation
superstep. The second one is a communication superstep, where processor P (0)
sends data to all other processors. Each superstep is terminated by a global
synchronization.
sends and receives a number of messages. At the end of a superstep, all pro-
cessors synchronize, as follows. Each processor checks whether it has finished
all its obligations of that superstep. In the case of a computation superstep, it
checks whether the computations are finished. In the case of a communication
superstep, it checks whether it has sent all messages that it had to send, and
whether it has received all messages that it had to receive. Processors wait
until all others have finished. When this happens, they all proceed to the next
superstep. This form of synchronization is called bulk synchronization,
because usually many computation or communication operations take place
between successive synchronizations. (This is in contrast to pairwise synchron-
ization, used in most message-passing systems, where each message causes a
pair of sending and receiving processors to wait until the message has been
transferred.) Figure 1.2 gives an example of a BSP algorithm.
THE BSP MODEL 5
(a) (b)
P(2) P(2)
Fig. 1.3. Two different h-relations with the same h. Each arrow represents the
communication of one data word. (a) A 2-relation with hs = 2 and hr = 1; (b)
a 2-relation with hs = hr = 2.
Comp
60
50
20 Sync
100 Comm
5
Sync
150
6 Comm
Sync
200
60 Comp
250
Sync
2 Comm
300
Sync
Fig. 1.4. Cost of the BSP algorithm from Fig. 1.2 on a BSP computer with p = 5,
g = 2.5, and l = 20. Computation costs are shown only for the processor that
determines the cost of a superstep (or one of them, if there are several). Com-
munication costs are shown for only one source/destination pair of processors,
because we assume in this example that the amount of data happens to be the
same for every pair. The cost of the first superstep is determined by processors
P (1) and P (3), which perform 60 flops each. Therefore, the cost is 60 + l = 80
flop time units. In the second superstep, P (0) sends five data words to each of
the four other processors. This superstep has hs = 20 and hr = 5, so that it
is a 20-relation and hence its cost is 20g + l = 70 flops. The cost of the other
supersteps is obtained in a similar fashion. The total cost of the algorithm is
320 flops.
p number of processors
r computing rate (in flop/s)
g communication cost per data word (in time units of 1 flop)
l global synchronization cost (in time units of 1 flop)
8 INTRODUCTION
In our terminology, vectors are column vectors; to save space we write them
as x = (x0 , . . . , xn−1 )T , where the superscript ‘T’ denotes transposition. The
vector x can also be viewed as an n × 1 matrix. The inner product of x and y
can concisely be expressed as xT y.
The inner product is computed by the processors P (0), . . . , P (p − 1) of a
BSP computer with p processors. We assume that the result is needed by all
processors, which is usually the case if the inner product computation is part
of a larger computation, such as in iterative linear system solvers.
10 INTRODUCTION
(a)
Cyclic 0 1 2 3 0 1 2 3 0 1
0 1 2 3 4 5 6 7 8 9
(b) Block
0 0 0 1 1 1 2 2 2 3
0 1 2 3 4 5 6 7 8 9
Fig. 1.5. Distribution of a vector of size ten over four processors. Each cell repres-
ents a vector component; the number in the cell and the greyshade denote the
processor that owns the cell. The processors are numbered 0, 1, 2, 3. (a) Cyclic
distribution; (b) block distribution.
The data distribution of the vectors x and y should be the same, because
in that case the components xi and yi reside on the same processor and they
can be multiplied immediately without any communication. The data distri-
bution then determines the work distribution in a natural manner. To balance
the work load of the algorithm, we must assign the same number of vector
components to each processor. Card players know how to do this blindly,
even without counting and in the harshest of circumstances. They always
deal out their cards in a cyclic fashion. For the same reason, an optimal work
distribution is obtained by the cyclic distribution defined by the mapping
Here, the mod operator stands for taking the remainder after division by p,
that is, computing modulo p. Similarly, the div operator stands for integer
division rounding down. Figure 1.5(a) illustrates the cyclic distribution for
n = 10 and p = 4. The maximum number of components per processor is
⌈n/p⌉, that is, n/p rounded up to the nearest integer value, and the minimum
is ⌊n/p⌋ = n div p, that is, n/p rounded down. The maximum and the
minimum differ at most by one. If p divides n, every processor receives exactly
n/p components. Of course, many other data distributions also lead to the
best possible load balance. An example is the block distribution, defined
by the mapping
xi −→ P (i div b), for 0 ≤ i < n, (1.6)
with block size b = ⌈n/p⌉. Figure 1.5(b) illustrates the block distribution for
n = 10 and p = 4. This distribution has the same maximum number of com-
ponents per processor, but the minimum can take every integer value between
zero and the maximum. In Fig. 1.5(b) the minimum is one. The minimum can
even be zero: if n = 9 and p = 4, then the block size is b = 3, and the pro-
cessors receive 3, 3, 3, 0 components, respectively. Since the computation
cost is determined by the maximum amount of work, this is just as good as
BSP ALGORITHM FOR INNER PRODUCT COMPUTATION 11
Algorithm 1.1. Inner product algorithm for processor P (s), with 0 ≤ s < p.
input: x, y : vector of length n,
distr(x) = distr(y) = φ,
with φ(i) = i mod p, for 0 ≤ i < n.
output: α = xT y.
(0) αs := 0;
for i := s to n − 1 step p do
αs := αs + xi yi ;
(1) for t := 0 to p − 1 do
put αs in P (t);
(2) α := 0;
for t := 0 to p − 1 do
α := α + αt ;
of the data element are specified. The ‘put’ primitive assumes that the source
processor knows the memory location on the destination processor where the
data must be put. The source processor is the initiator of the action, whereas
the destination processor is passive. Thus, we assume implicitly that each pro-
cessor allows all others to put data into its memory. Superstep (1) could also
have been written as ‘put αs in P (∗)’, where we use the abbreviation P (∗) to
denote all processors. Note that the program includes a put by processor P (s)
into itself. This operation is simply skipped or becomes a local memory-copy,
but it does not involve communication. It is convenient to include such puts
in program texts, to avoid having to specify exceptions.
Sometimes, it may be necessary to let the destination processor initiate
the communication. This may happen in irregular computations, where the
destination processor knows that it needs data, but the source processor is
unaware of this need. In that case, the destination processor must fetch the
data from the source processor. This is done by a statement of the form ‘get
x from P (t)’ in the program text of P (s). In most cases, however, we use the
‘put’ primitive. Note that using a ‘put’ is much simpler than using a matching
‘send’/‘receive’ pair, as is done in message-passing parallel algorithms. The
program text of such an algorithm must contain additional if-statements to
distinguish between sends and receives. Careful checking is needed to make
sure that pairs match in all possible executions of the program. Even if
every send has a matching receive, this does not guarantee correct commu-
nication as intended by the algorithm designer. If the send/receive is done
by the handshake (or kissing) protocol, where both participants can only
continue their way after the handshake has finished, then it can easily hap-
pen that the sends and receives occur in the wrong order. A classic case is
when two processors both want to send first and receive afterwards; this situ-
ation is called deadlock. Problems such as deadlock cannot happen when
using puts.
In superstep (2), all processors compute the final result. This is done
redundantly, that is, the computation is replicated so that all processors per-
form exactly the same operations on the same data. The complete algorithm
is illustrated in Fig. 1.6.
The cost analysis of the algorithm is as follows. Superstep (0) requires a
floating-point multiplication and an addition for each component. Therefore,
the cost of (0) is 2⌈n/p⌉ + l. Superstep (1) is a (p − 1)-relation, because each
processor sends and receives p − 1 data. (Communication between a processor
and itself is not really communication and hence is not counted in determining
h.) The cost of (1) is (p − 1)g + l. The cost of (2) is p + l. The total cost of
the inner product algorithm is
n
Tinprod =2 + p + (p − 1)g + 3l. (1.8)
p
STARTING WITH BSPLIB: EXAMPLE PROGRAM bspinprod 13
12 0 4 7 –1 2 15 11 3 –2
0 1 2 3 4 5 6 7 8 9
*
1 9 2 0 –1 12 1 2 3 8
0 1 2 3 4 5 6 7 8 9
+
22 8 23 22
22 8 23 22 22 8 23 22 22 8 23 22 22 8 23 22
75 75 75 75
Fig. 1.6. Parallel inner product computation. Two vectors of size ten are distributed
by the cyclic distribution over four processors. The processors are shown by
greyshades. First, each processor computes its local inner product. For example,
processor P (0) computes its local inner product 12 · 1 + (−1) · (−1) + 3 · 3 = 22.
Then the local result is sent to all other processors. Finally, the local inner
products are summed redundantly to give the result 75 in every processor.
bsp begin(reqprocs);
bsp begin starts several executions of the same subprogram, where each
execution takes place on a different processor and handles a different stream
of data, in true SPMD style. The parallel part is terminated by
bsp end();
Two possible modes of operation can be used. In the first mode, the whole
computation is SPMD; here, the call to bsp begin must be the first executable
statement in the program and the call to bsp end the last. Sometimes, how-
ever, one desires to perform some sequential part of the program before and
after the parallel part, for example, to handle input and output. For instance,
if the optimal number of processors to be used depends on the input, we want
to compute it before the actual parallel computation starts. The second mode
enables this: processor P (0) executes the sequential parts and all processors
together perform the parallel part. Processor P (0) preserves the values of its
variables on moving from one part to the next. The other processors do not
inherit values; they can only obtain desired data values by communication.
To allow the second mode of operation and to circumvent the restriction of
bsp begin and bsp end being the first and last statement, the actual parallel
part is made into a separate function spmd and an initializer
bsp init(spmd, argc, argv);
is called as the first executable statement of the main function. Here, int
argc and char **argv are the standard arguments of main in a C program,
and these can be used to transfer parameters from a command line interface.
Funny things may happen if this is not the first executable statement. Do not
even think of trying it! The initializing statement is followed by: a sequential
part, which may handle some input or ask for the desired number of pro-
cessors (depending on the input size it may be better to use only part of the
available processors); the parallel part, which is executed by spmd; and finally
another sequential part, which may handle output. The sequential parts are
optional.
The rules for I/O are simple: processor P (0) is the only processor that can
read from standard input or can access the file system, but all processors can
write to standard output. Be aware that this may mix the output streams;
use an fflush(stdout) statement to empty the output buffer immediately
and increase the chance of obtaining ordered output (sorry, no guarantees).
At every point in the parallel part of the program, one can enquire about
the total number of processors. This integer is returned by
bsp nprocs();
The function bsp nprocs also serves a second purpose: when it is used in
the sequential part at the start, or in the bsp begin statement, it returns the
available number of processors, that is, the size of the BSP machine used. Any
desired number of processors not exceeding the machine size can be assigned to
16 INTRODUCTION
the program by bsp begin. The local processor identity, or processor number,
is returned by
bsp pid();
It is an integer between 0 and bsp nprocs()−1. One can also enquire about
the time in seconds elapsed on the local processor since bsp begin; this time
is given as a double-precision value by
bsp time();
Note that in the parallel context the elapsed time, or wall-clock time, is often
the desired metric and not the CPU time. In parallel programs, processors
are often idling because they have to wait for others to finish their part of
a computation; a measurement of elapsed time includes idle time, whereas
a CPU time measurement does not. Note, however, that the elapsed time
metric does have one major disadvantage, in particular to your fellow users:
you need to claim the whole BSP computer for yourself when measuring run
times.
Each superstep of the SPMD part, or program superstep, is terminated
by a global synchronization statement
bsp sync();
except the last program superstep, which is terminated by bsp end. The
structure of a BSPlib program is illustrated in Fig. 1.7. Program supersteps
may be contained in loops and if-statements, but the condition evaluations of
these loops and if-statements must be such that all processors pace through the
same sequence of program supersteps. The rules imposed by BSPlib may seem
restrictive, but following them makes parallel programming easier, because
they guarantee that all processors are in the same superstep. This allows us
to assume full data integrity at the start of each superstep.
The version of the BSP model presented in Section 1.2 does not allow
computation and communication in the same superstep. The BSPlib sys-
tem automatically separates computation and communication, since it delays
communication until all computation is finished. Therefore the user does not
have to separate these parts herself and she also does not have to include
a bsp sync for this purpose. In practice, this means that BSPlib programs
can freely mix computation and communication. The automatic separation
feature of BSPlib is convenient, for instance because communication often
involves address calculations and it would be awkward for a user to separ-
ate these computations from the corresponding communication operations. A
program superstep can thus be viewed as a sequence of computation, implicit
synchronization, communication, and explicit synchronization. The compu-
tation part or the communication part may be empty. Therefore, a program
superstep may contain one or two supersteps as defined in the BSP model,
namely a computation superstep and/or a communication superstep. From
STARTING WITH BSPLIB: EXAMPLE PROGRAM bspinprod 17
Init
Sequential
Begin
Sync
Parallel (SPMD)
Sync
End
Sequential
Exit
Fig. 1.7. Structure of a BSPlib program. The program first initializes the BSP
machine to be used and then it performs a sequential computation on P (0),
followed by a parallel computation on five processors. It finishes with a sequential
computation on P (0).
now on, we use the shorter term ‘superstep’ to denote program supersteps as
well, except when this would lead to confusion.
Wouldn’t it be nice if we could compute and communicate at the same
time? This tempting thought may have occurred to you by now. Indeed,
processors could in principle compute while messages travel through the
communication network. Exploiting this form of parallelism would reduce
the total computation/communication cost a + bg of the algorithm, but at
most by a factor of two. The largest reduction would occur if the cost of
each computation superstep were equal to the cost of the corresponding
communication superstep, and if computation and communication could be
overlapped completely. In most cases, however, either computation or com-
munication dominates, and the cost reduction obtained by overlapping is
insignificant. Surprisingly, BSPlib guarantees not to exploit potential over-
lap. Instead, delaying all communication gives more scope for optimization,
since this allows the system to combine different messages from the same
source to the same destination and to reorder the messages with the aim of
balancing the communication traffic. As a result, the cost may be reduced by
much more than a factor of two.
Processors can communicate with each other by using the bsp put and
bsp get functions (or their high-performance equivalents bsp hpput and
18 INTRODUCTION
bsp_pid
nbytes
Put
Source
pid
Offset nbytes
Dest
Fig. 1.8. Put operation from BSPlib. The bsp put operation copies nbytes of data
from the local processor bsp pid into the specified destination processor pid.
The pointer source points to the start of the data to be copied, whereas the
pointer dest specifies the start of the memory area where the data is written.
The data is written at offset bytes from the start.
bsp hpget, see Exercise 10, or the bsp send function, see Section 4.9). A pro-
cessor that calls bsp put reads data from its own memory and writes them
into the memory of another processor. The function bsp put corresponds to
the put operation in our algorithms. The syntax is
Here, int pid is the identity of the remote processor; void *source is a
pointer to the source memory in the local processor from which the data
are read; void *dest is a pointer to the destination memory in the remote
processor into which the data are written; int offset is the number of bytes
to be added to the address dest to obtain the address where writing starts;
and int nbytes is the number of bytes to be written. The dest variable must
have been registered previously; the registration mechanism will be explained
soon. If pid equals bsp pid, the put is done locally by a memory copy, and
no data is communicated. The offset is determined by the local processor, but
the destination address is part of the address space of the remote processor.
The use of an offset separates the concerns of the local processor, which knows
where in an array a data element should be placed, from the concerns of the
remote processor, which knows the address of the array in its own address
space. The bsp put operation is illustrated in Fig. 1.8.
The bsp put operation is safe in every sense, since the value to be put is
first written into a local out-buffer, and only at the end of the superstep (when
all computations in all processors are finished) it is transferred into a remote
in-buffer, from which it is finally copied into the destination memory. The
user can manipulate both the source and destination value without worrying
about possible interference between data manipulation and transfer. Once the
bsp put is initiated, the user has got rid of the source data and can reuse the
STARTING WITH BSPLIB: EXAMPLE PROGRAM bspinprod 19
variable that holds them. The destination variable can be used until the end of
the superstep, when it will be overwritten. It is possible to put several values
into the same memory cell, but of course only one value survives and reaches
the next superstep. The user cannot know which value, and he bears the
responsibility for ensuring correct program behaviour. Put and get operations
do not block progress within their superstep: after a put or get is initiated,
the program proceeds immediately.
Although a remote variable may have the same name as a local variable,
it may still have a different physical memory address because each processor
could have its own memory allocation procedure. To enable a processor to
write into a remote variable, there must be a way to link the local name to
the correct remote address. Linking is done by the registration primitive
bsp push reg(variable, nbytes);
where void *variable is a pointer to the variable being registered. All pro-
cessors must simultaneously register a variable, or the NULL pointer; they must
also deregister simultaneously. This ensures that they go through the same
sequence of registrations and deregistrations. Registration takes effect at the
start of the next superstep. From that moment, all simultaneously registered
variables are linked to each other. Usually, the name of each variable linked
in a registration is the same, in the right SPMD spirit. Still, it is allowed to
link variables with different names.
If a processor wants to put a value into a remote address, it can do this
by using the local name that is linked to the remote name and hence to the
desired remote address. The second registration parameter, int nbytes, is
an upper bound on the number of bytes that can be written starting from
variable. Its sole purpose is sanity checking: our hope is to detect insane
programs in their youth.
A variable is deregistered by a call to
bsp pop reg(variable);
Within a superstep, the variables can be registered and deregistered in arbit-
rary order. The same variable may be registered several times, but with
different sizes. (This may happen for instance as a result of registration of
the same variable inside different functions.) A deregistration cancels the last
registration of the variable concerned. The last surviving registration of a
variable is the one valid in the next superstep. For each variable, a stack
of registrations is maintained: a variable is pushed onto the stack when it
is registered; and it is popped off the stack when it is deregistered. A stack is
the computer science equivalent of the hiring and firing principle for teachers
in the Dutch educational system: Last In, First Out (LIFO). This keeps the
average stack population old, but that property is irrelevant for our book.
In a sensible program, the number of registrations is kept limited. Prefer-
ably, a registered variable is reused many times, to amortize the associated
20 INTRODUCTION
Global 12 0 4 7 –1 2 15 11 3 –2
0 1 2 3 4 5 6 7 8 9
Local 12 –1 3 0 2 –2 4 15 7 11
0 1 2 0 1 2 0 1 0 1
P(0) P(1) P(2) P(3)
Fig. 1.9. Two different views of the same vector. The vector of size ten is distributed
by the cyclic distribution over four processors. The numbers in the square cells
are the numerical values of the vector components. The processors are shown
by greyshades. The global view is used in algorithms, where vector components
are numbered using global indices j. The local view is used in implementations,
where each processor has its own part of the vector and uses its own local
indices j.
/* This program computes the sum of the first n squares, for n>=0,
sum = 1*1 + 2*2 + ... + n*n
by computing the inner product of x=(1,2,...,n)ˆT and itself.
The output should equal n*(n+1)*(2n+1)/6.
The distribution of x is cyclic.
*/
return (n+p-s-1)/p ;
} /* end nloc */
inprod= 0.0;
for (i=0; i<nloc(p,s,n); i++){
inprod += x[i]*y[i];
}
alpha= 0.0;
for (t=0; t<p; t++){
alpha += Inprod[t];
}
bsp_pop_reg(Inprod); vecfreed(Inprod);
return alpha;
} /* end bspip */
void bspinprod(){
bsp_begin(P);
p= bsp_nprocs(); /* p = number of processors obtained */
s= bsp_pid(); /* s = processor number */
if (s==0){
printf("Please enter n:\n"); fflush(stdout);
scanf("%d",&n);
if(n<0)
bsp_abort("Error in input: n is negative");
}
bsp_push_reg(&n,SZINT);
bsp_sync();
bsp_get(0,&n,0,&n,SZINT);
bsp_sync();
bsp_pop_reg(&n);
nl= nloc(p,s,n);
x= vecallocd(nl);
for (i=0; i<nl; i++){
iglob= i*p+s;
x[i]= iglob+1;
}
bsp_sync();
time0=bsp_time();
alpha= bspip(p,s,n,x,x);
bsp_sync();
time1=bsp_time();
vecfreed(x);
bsp_end();
} /* end bspinprod */
/* sequential part */
printf("How many processors do you want to use?\n");
fflush(stdout);
scanf("%d",&P);
if (P > bsp_nprocs()){
printf("Sorry, not enough processors available.\n");
fflush(stdout);
exit(1);
}
/* SPMD part */
bspinprod();
/* sequential part */
exit(0);
} /* end main */
programs, preferring instead to let the compiler and the BSP system do
the job. (The benchmark method for optimization enthusiasts would be very
different.)
The sequential computing rate r is determined by measuring the time of
a so-called DAXPY operation (‘Double precision A times X Plus Y ’), which
has the form y := αx + y, where x and y are vectors and α is a scalar. A
DAXPY with vectors of length n contains n additions and n multiplications
and some overhead in the form of O(n) address calculations. We also measure
the time of a DAXPY operation with the addition replaced by subtraction.
We use 64-bit arithmetic throughout; on most machines this is called double-
precision arithmetic. This mixture of operations is representative for the
majority of scientific computations. We measure the time for a vector length,
which on the one hand is large enough so that we can ignore the startup
costs of vector operations, but on the other hand is small enough for the
vectors to fit in the cache; a choice of n = 1024 is often adequate. A cache is
a small but fast intermediate memory that allows immediate reuse of recently
accessed data. Proper use of the cache considerably increases the computing
rate on most modern computers. The existence of a cache makes the life of
a benchmarker harder, because it leads to two different computing rates: a
flop rate for in-cache computations and a rate for out-of-cache computations.
Intelligent choices should be made if the performance results are to be reduced
to a single meaningful figure.
The DAXPY measurement is repeated a number of times, both to obtain
a more accurate clock reading and to amortize the cost of bringing the vector
into the cache. We measure the sequential computing rate of each processor of
the parallel computer, and report the minimum, average, and maximum rate.
The difference between the minimum and the maximum indicates the accur-
acy of the measurement, except when the processors genuinely differ in speed.
(One processor that is slower than the others can have a remarkable effect
on the overall time of a parallel computation!) We take the average comput-
ing rate of the processors as the final value of r. Note that our measurement
is representative for user programs that contain mostly hand-coded vector
operations. To realize top performance, system-provided matrix–matrix oper-
ations should be used wherever possible, because these are often efficiently
coded in assembler language. Our benchmark method does not reflect that
situation.
The communication parameter g and the synchronization parameter l are
obtained by measuring the time of full h-relations, where each processor
sends and receives exactly h data words. To be consistent with the meas-
urement of r, we use double-precision reals as data words. We choose a
particularly demanding test pattern from the many possible patterns with
the same h, which reflects the typical way most users would handle commun-
ication in their programs. The destination processors of the values to be sent
26 INTRODUCTION
P(1) P(2)
1
0 3 4
2
P(0) P(3)
5
Fig. 1.10. Communication pattern of the 6-relation in the BSP benchmark. Pro-
cessors send data to the other processors in a cyclic manner. Only the data sent
by processor P (0) are shown; other processors send data in a similar way. Each
arrow represents the communication of one data word; the number shown is the
index of the data word.
the error
h1
ELSQ = (Tcomm (h) − (hg + l))2 . (1.9)
h=h0
(These values are obtained by setting the partial derivatives with respect to g
and l to zero, and solving the resulting 2 × 2 linear system of equations.) We
choose h0 = p, because packet optimization becomes worthwhile for h ≥ p;
we would like to capture the behaviour of the machine and the BSP system
for such values of h. A value of h1 = 256 will often be adequate, except if
p ≥ 256 or if the asymptotic communication speed is attained only for very
large h.
Timing parallel programs requires caution since ultimately it often relies
on a system timer, which may be hidden from the user and may have a low
resolution. Always take a critical look at your timing results and your personal
watch and, in case of suspicion, plot the output data in a graph! This may
save you from potential embarrassment: on one occasion, I was surprised to
find that according to an erroneous timer the computer had exceeded its true
performance by a factor of four. On another occasion, I found that g was
negative. The reason was that the particular computer used had a small g but
a huge l, so that for h ≤ h1 the measurement error in gh + l was much larger
than gh, thereby rendering the value of g meaningless. In this case, h1 had to
be increased to obtain an accurate measurement of g.
void leastsquares(int h0, int h1, double *t, double *g, double *l){
/* This function computes the parameters g and l of the
linear function T(h)= g*h+l that best fits
the data points (h,t[h]) with h0 <= h <= h1. */
nh= h1-h0+1;
/* Compute sums:
sumt = sum of t[h] over h0 <= h <= h1
sumth = t[h]*h
sumh = h
sumhh = h*h */
sumt= sumth= 0.0;
for (h=h0; h<=h1; h++){
sumt += t[h];
sumth += t[h]*h;
}
sumh= (h1*h1-h0*h0+h1+h0)/2;
sumhh= ( h1*(h1+1)*(2*h1+1) - (h0-1)*h0*(2*h0-1))/6;
if(fabs(nh)>fabs(sumh)){
a= sumh/nh;
/* subtract a times first eqn from second eqn */
*g= (sumth-a*sumt)/(sumhh-a*sumh);
*l= (sumt-sumh* *g)/nh;
} else {
a= nh/sumh;
/* subtract a times second eqn from first eqn */
*g= (sumt-a*sumth)/(sumh-a*sumhh);
*l= (sumth-sumhh* *g)/sumh;
}
} /* end leastsquares */
void bspbench(){
void leastsquares(int h0, int h1, double *t, double *g, double *l);
int p, s, s1, iter, i, n, h, destproc[MAXH], destindex[MAXH];
double alpha, beta, x[MAXN], y[MAXN], z[MAXN], src[MAXH], *dest,
time0, time1, time, *Time, mintime, maxtime,
nflops, r, g0, l0, g, l, t[MAXH+1];
computing rate */
if (s==0){
mintime= maxtime= Time[0];
for(s1=1; s1<p; s1++){
mintime= MIN(mintime,Time[s1]);
maxtime= MAX(maxtime,Time[s1]);
}
if (mintime>0.0){
/* Compute r = average computing rate in flop/s */
nflops= 4*NITERS*n;
r= 0.0;
for(s1=0; s1<p; s1++)
r += nflops/Time[s1];
r /= p;
printf("n= %5d min= %7.3lf max= %7.3lf av= %7.3lf Mflop/s ",
n, nflops/(maxtime*MEGA),nflops/
(mintime*MEGA), r/MEGA);
fflush(stdout);
/* Output for fooling benchmark-detecting compilers */
printf(" fool=%7.1lf\n",y[n-1]+z[n-1]);
} else
printf("minimum time is 0\n"); fflush(stdout);
}
}
if (s==0){
printf("size of double = %d bytes\n",(int)SZDBL);
leastsquares(0,p,t,&g0,&l0);
printf("Range h=0 to p : g= %.1lf, l= %.1lf\n",g0,l0);
leastsquares(p,MAXH,t,&g,&l);
printf("Range h=p to HMAX: g= %.1lf, l= %.1lf\n",g,l);
printf("The bottom line for this BSP computer is:\n");
printf("p= %d, r= %.3lf Mflop/s, g= %.1lf, l= %.1lf\n",
p,r/MEGA,g,l);
fflush(stdout);
}
bsp_pop_reg(dest); vecfreed(dest);
bsp_pop_reg(Time); vecfreed(Time);
bsp_end();
} /* end bspbench */
} /* end main */
PC PC PC PC PC PC PC PC
Fig. 1.11. Beowulf cluster of eight PCs connected by four switches. Each PC is
connected to all switches.
BENCHMARK RESULTS 33
800 000
Measured data
Least-squares fit
700 000
600 000
Time (in flop units)
500 000
400 000
300 000
200 000
100 000
0
0 50 100 150 200 250 300 350 400 450 500
h
Fig. 1.12. Time of an h-relation on two connected PCs. The values shown are for
even h with h ≤ 500.
1024 to 512, to make the vectors fit in primary cache (i.e. the fastest cache);
for length 1024 and above the computing rate decreases sharply. The value
of MAXH was increased from the default 256 to 800, because in this case
g ≪ l and hence a larger range of h-values is needed to obtain an accurate
value for g from measurements of hg + l. Finding the right input paramet-
ers MAXN, MAXH, and NITERS for the benchmark program may require trial
and error; plotting the data is helpful in this process. It is unlikely that one
set of default parameters will yield sensible measurements for every parallel
computer.
What is the most expensive parallel computer you can buy? A super-
computer, by definition. Commonly, a supercomputer is defined as
one of today’s top performers in terms of computing rate, communica-
tion/synchronization rate, and memory size. Most likely, the cost of a
supercomputer will exceed a million US dollars. An example of a supercom-
puter is the Cray T3E, which is a massively parallel computer with distributed
memory and a communication network in the form of a three-dimensional
torus (i.e. a mesh with wraparound links at the boundaries). We have bench-
marked up to 64 processors of the 128-processor machine called Vermeer,
34 INTRODUCTION
after the famous Dutch painter, which is located at the HPαC supercom-
puter centre of Delft University of Technology. Each node of this machine
consists of a DEC Alpha 21164 processor with a clock speed of 300 MHz, an
advertised peak performance of 600 Mflop/s, and 128 Mbyte memory. In our
experiments, we used version 1.4 of BSPlib with optimization level 2 and the
Cray C compiler with optimization level 3.
The measured single-processor computing rate is 35 Mflop/s, which is
much lower than the theoretical peak speed of 600 Mflop/s. The main reason
for this discrepancy is that we measure the speed for a DAXPY opera-
tion written in C, whereas the highest performance on this machine can
only be obtained by performing matrix–matrix operations such as DGEMM
(Double precision GEneral Matrix–Matrix multiplication) and then only when
using well-tuned subroutines written in assembler language. The BLAS (Basic
Linear Algebra Subprograms) library [59,60,126] provides a portable interface
to a set of subroutines for the most common vector and matrix operations,
such as DAXPY and DGEMM. (The terms ‘DAXPY’ and ‘DGEMM’ origin-
ate in the BLAS definition.) Efficient BLAS implementations exist for most
machines. A complete BLAS list is given in [61,Appendix C]. A Cray T3E ver-
sion of the BLAS is available; its DGEMM approaches peak performance. A
note of caution: our initial, erroneous result on the Cray T3E was a computing
rate of 140 Mflop/s. This turned out to be due to the Cray timer IRTC, which
is called by BSPlib on the Cray and ran four times slower than it should. This
error occurs only in version 1.4 of BSPlib, in programs compiled at BSPlib
level 2 for the Cray T3E.
Figure 1.13 shows the time of an h-relation on 64 processors of the Cray
T3E. The time grows more or less linearly, but there are some odd jumps,
for instance the sudden significant decrease around h = 130. (Sending more
data takes less time!) It is beyond our scope to explain every peculiarity of
every benchmarked machine. Therefore, we feel free to leave some surprises,
like this one, unexplained.
Table 1.2 shows the BSP parameters obtained by benchmarking the Cray
T3E for up to 64 processors. The results for p = 1 give an indication of the
overhead of running a bulk synchronous parallel program on one processor.
For the special case p = 1, the value of MAXH was decreased to 16, because
l ≈ g and hence a smaller range of h-values is needed to obtain an accurate
value for l from measurements of hg + l. (Otherwise, even negative values of
l could appear.) The table shows that g stays almost constant for p ≤ 16 and
that it grows slowly afterwards. Furthermore, l grows roughly linearly with
p, but occasionally it behaves strangely: l suddenly decreases on moving from
16 to 32 processors. The explanation is hidden inside the black box of the
communication network. A possible explanation is the increased use of wrap-
around links when increasing the number of processors. (For a small number of
processors, all boundary links of a subpartition connect to other subpartitions,
instead of wrapping around to the subpartition itself; thus, the subpartition
BENCHMARK RESULTS 35
25 000
Measured data
Least-squares fit
20 000
Time (in flop units)
15 000
10 000
5000
0
0 50 100 150 200 250
h
p g l Tcomm (0)
1 36 47 38
2 28 486 325
4 31 679 437
8 31 1193 580
16 31 2018 757
32 72 1145 871
64 78 1825 1440
350 000
Measured data
Least-squares fit
300 000
250 000
Time (in flop units)
200 000
150 000
100 000
50 000
0
0 50 100 150 200 250
h
200 000
Measured data
Least-squares fit
150 000
Time (in flop units)
100 000
50 000
0
0 50 100 150 200 250
h
g l g l
write (EREW) variant allows only one processor to access the memory at a
time. The PRAM model ignores communication costs and is therefore mostly
of theoretical interest; it is useful in establishing lower bounds for the cost
of parallel algorithms. The PRAM model has stimulated the development of
many other models, including the BSP model. The BSP variant with auto-
matic memory management by randomization in fact reduces to the PRAM
model in the asymptotic case g = l = O(1). For an introduction to the PRAM
model, see the survey by Vishkin [189]. For PRAM algorithms, see the survey
by Spirakis and Gibbons [166,167] and the book by JáJá [113].
The BSP model has been proposed by Valiant in 1989 [177]. The full
description of this ‘bridging model for parallel computation’ is given in [178].
This article describes the two basic variants of the model (automatic memory
management or direct user control) and it gives a complexity analysis of
algorithms for fast Fourier transform, matrix–matrix multiplication, and sort-
ing. In another article [179], Valiant proves that a hypercube or butterfly
architecture can simulate a BSP computer with optimal efficiency. (Here, the
model is called XPRAM.) The BSP model as it is commonly used today has
been shaped by various authors since the original work by Valiant. The survey
by McColl [132] argues that the BSP model is a promising approach to general-
purpose parallel computing and that it can deliver both scalable performance
and architecture independence. Bisseling and McColl [21,22] propose the vari-
ant of the model (with pure computation supersteps of cost w + l and pure
communication supersteps of cost hg + l) that is used in this book. They show
how a variety of scientific computations can be analysed in a simple manner
by using their BSP variant. McColl [133] analyses and classifies several
important BSP algorithms, including dense and sparse matrix–vector mul-
tiplication, matrix–matrix multiplication, LU decomposition, and triangular
system solution.
The LogP model by Culler et al. [49] is an offspring of the BSP model,
which uses four parameters to describe relative machine performance: the
latency L, the overhead o, the gap g, and the number of processors P , instead
of the three parameters l, g, and p of the BSP model. The LogP model
treats messages individually, not in bulk, and hence it does not provide the
notion of a superstep. The LogP model attempts to reflect the actual machine
architecture more closely than the BSP model, but the price to be paid is an
increase in the complexity of algorithm design and analysis. Bilardi et al. [17]
show that the LogP and BSP model can simulate each other efficiently so that
in principle they are equally powerful.
The YPRAM model by de la Torre and Kruskal [54] characterizes a parallel
computer by its latency, bandwidth inefficiency, and recursive decomposabil-
ity. The decomposable BSP (D-BSP) model [55] is the same model expressed
in BSP terms. In this model, a parallel computer can be decomposed into sub-
machines, each with their own parameters g and l. The parameters g and l of
submachines will in general be lower than those of the complete machine. The
40 INTRODUCTION
1.8.4 Benchmarking
The BSPlib definition [105] presents results obtained by the optimized bench-
marking program bspprobe, which is included in the Oxford BSP toolset [103].
The values of r and l that were measured by bspprobe agree well with those
of bspbench. The values of g, however, are much lower: for instance, the value
g = 1.6 for 32-bit words at r = 47 Mflop/s given in [105,Table 1] corresponds
to g = 0.07 µs for 64-bit words, which is 12.5 times less than the 0.88 µs
measured by bspbench, see Table 1.3. This is due to the high optimization
level of bspprobe: data are sent in blocks instead of single words and high-
performance puts are used instead of buffered puts. The goal of bspprobe is
to measure communication performance for optimized programs and hence its
bottom line takes as g-value the asymptotic value for large blocks. The effect
of such optimizations will be studied in Chapter 2. The program bspprobe
measures g for two different h-relations with the same h: (i) a local cyclic
shift, where every processor sends h data to the next higher-numbered pro-
cessor; (ii) a global all-to-all procedure where every processor sends h/(p − 1)
data to every one of the others. In most cases, the difference between the two
g-values is small. This validates the basic assumption of the BSP model,
44 INTRODUCTION
1.9 Exercises
1. Algorithm 1.1 can be modified to combine the partial sums into one global
sum by a different method. Let p = 2q , with q ≥ 0. Modify the algorithm to
combine the partial sums by repeated pairing of processors. Take care that
every processor obtains the final result. Formulate the modified algorithm
exactly, using the same notation as in the original algorithm. Compare the
BSP cost of the two algorithms. For which ratio l/g is the pairwise algorithm
faster?
2. Analyse the following operations and derive the BSP cost for a parallel
algorithm. Let x be the input vector (of size n) of the operation and y the out-
put vector. Assume that these vectors are block distributed over p processors,
EXERCISES 45
(b) Find a suitable cryptotext as input and compute its κ. Guess its
language by comparing the result with the κ-values found by Kullback
(reproduced in [14]): Russian 5.29%, English 6.61%, German 7.62%,
French 7.78%.
(c) Find out whether Dutch is closer to English or German.
(d) Extend your program to compute all letter frequencies in the input
text. In English, the ‘e’ is the most frequent letter; its frequency is
about 12.5%.
(e) Run your program on some large plain texts in the language just
determined to obtain a frequency profile of that language. Run your
program on the cryptotext and establish its letter frequences. Now
break the code.
(f) Is parallelization worthwhile in this case? When would it be?
5. (∗) Data compression is widely used to reduce the size of data files,
for instance texts or pictures to be transferred over the Internet. The
LZ77 algorithm by Ziv and Lempel [193] passes through a text and uses
the most recently accessed portion as a reference dictionary to shorten
the text, replacing repeated character strings by pointers to their first
occurrence. The popular compression programs PKZIP and gzip are based
on LZ77.
Consider the text
‘yabbadabbadoo’
each step moves north, east, south, or west with equal probability 1/4.
What is the expected distance to the origin after 100 steps? Create a
large number of walks to obtain a good estimate. Use the parallel RNG
to accelerate your simulation.
(e) Improve the quality of your parallel RNG by adding a local shuffle to
break up short distance correlations. The numbers generated are writ-
ten to a buffer array of length 64 instead of to the output. The buffer
is filled at startup; after that, each time a random number is needed,
one of the array values is selected at random, written to the output,
and replaced by a new number xk . The random selection of the buffer
element is done based on the last output number. The shuffle is due to
Bays and Durham [16]. Check whether this improves the quality of the
RNG. Warning: the resulting RNG has limited applicability, because
m is relatively small. Better parallel RNGs exist, see for instance
the SPRNG package [131], and in serious work such RNGs must
be used.
7. (∗) The sieve of Eratosthenes (276–194 BC) is a method for generating all
prime numbers up to a certain bound n. It works as follows. Start with the
integers from 2 to n. The number 2 is a prime; cross out all larger multiples of
2. The smallest remaining number, 3, is a prime; cross out all larger multiples
of 3. The smallest remaining number, 5, is a prime, etc.
(h) Modify your program to generate twin primes, that is, pairs of primes
that differ by two, such as (5, 7). (It is unknown whether there are
infinitely many twin primes.)
(i) Extend your program to check the Goldbach conjecture: every even
k > 2 is the sum of two primes. Choose a suitable range of integers to
check. Try to keep the number of operations low. (The conjecture has
been an open question since 1742.)
2
LU DECOMPOSITION
Ax = b, (2.1)
n−1 min(i,j)
aij = lir urj = lir urj , for 0 ≤ i, j < n. (2.3)
r=0 r=0
In the case i ≤ j, we split off the ith term and substitute lii = 1, to obtain
i−1
uij = aij − lir urj , for 0 ≤ i ≤ j < n. (2.4)
r=0
Similarly,
j−1
1
lij = aij − lir urj , for 0 ≤ j < i < n. (2.5)
ujj r=0
Equations (2.4) and (2.5) lead to a method for computing the elements
of L and U . For convenience, we first define the intermediate n × n matrices
A(k) , 0 ≤ k ≤ n, by
k−1
(k)
aij = aij − lir urj , for 0 ≤ i, j < n. (2.6)
r=0
52 LU DECOMPOSITION
for k := 0 to n − 1 do
for j := k to n − 1 do
(k)
ukj := akj ;
for i := k + 1 to n − 1 do
(k)
lik := aik /ukk ;
for i := k + 1 to n − 1 do
for j := k + 1 to n − 1 do
(k+1) (k)
aij := aij − lik ukj ;
Note that A(0) = A and A(n) = 0. In this notation, (2.4) and (2.5) become
(i)
uij = aij , for 0 ≤ i ≤ j < n, (2.7)
and
(j)
aij
lij = , for 0 ≤ j < i < n. (2.8)
ujj
Algorithm 2.1 produces the elements of L and U in stages. Stage k first
computes the elements ukj , j ≥ k, of row k of U and the elements lik , i > k,
of column k of L. Then, it computes A(k+1) in preparation for the next stage.
(k)
Since only values aij with i, j ≥ k are needed in stage k, only the values
(k+1)
aij with i, j ≥ k + 1 are prepared. It can easily be verified that this order
of computation is indeed feasible: in each assignment of the algorithm, the
values of the right-hand side have already been computed.
Figure 2.1 illustrates how computer memory can be saved by storing all
currently available elements of L, U , and A(k) in one working matrix, which
we call A. Thus, we obtain Algorithm 2.2. On input, A contains the original
matrix A(0) , whereas on output it contains the values of L below the diagonal
and the values of U above and on the diagonal. In other words, the output
matrix equals L − In + U , where In denotes the n × n identity matrix, which
has ones on the diagonal and zeros everywhere else. Note that stage n − 1 of
the algorithm does nothing, so we can skip it.
This is a good moment for introducing our matrix/vector notation, which
is similar to the MATLAB [100] notation commonly used in the field of numer-
ical linear algebra. This notation makes it easy to describe submatrices and
SEQUENTIAL LU DECOMPOSITION 53
0 1 2 3 4 5 6
0
1 U
2
3
4 L A(k)
5
6
for k := 0 to n − 1 do
for i := k + 1 to n − 1 do
aik := aik /akk ;
for i := k + 1 to n − 1 do
for j := k + 1 to n − 1 do
aij := aij − aik akj ;
The last example shows that Algorithm 2.2 may break down, even in the
case of a nonsingular matrix. This happens if akk = 0 for a certain k, so that
division by zero is attempted. A remedy for this problem is to permute the
rows of the matrix A in a suitable way, giving a matrix P A, before computing
an LU decomposition. This yields
P A = LU, (2.9)
1 if i = σ(j)
(Pσ )ij = for 0 ≤ i, j < n. (2.10)
0 otherwise,
This means that column j of Pσ has an element one in row σ(j), and zeros
everywhere else.
Example 2.4 Let n = 3 and σ(0) = 1, σ(1) = 2, and σ(2) = 0. Then
· · 1
Pσ = 1 · · ,
· 1 ·
for i := 0 to n − 1 do
πi := i;
for k := 0 to n − 1 do
r := argmax(|aik | : k ≤ i < n);
swap(πk , πr );
for j := 0 to n − 1 do
swap(akj , arj );
for i := k + 1 to n − 1 do
aik := aik /akk ;
for i := k + 1 to n − 1 do
for j := k + 1 to n − 1 do
aij := aij − aik akj ;
Proof By induction on n.
BASIC PARALLEL ALGORITHM 57
0 1 2 3 4 5 6
0
1
2
3 akj
4
5 aik aij
6
Fig. 2.2. Matrix update by operations aij := aij − aik akj at the end of stage k = 3.
Arrows denote communication.
from the set of matrix index pairs to the set of processor identifiers. The
mapping function φ has two coordinates,
t=0 2 1 2 0 1 0
s =0 00 02 01 02 00 01 00
0 00 02 01 02 00 01 00
1 10 12 11 12 10 11 10
0 00 02 01 02 00 01 00
1 10 12 11 12 10 11 10
0 00 02 01 02 00 01 00
1 10 12 11 12 10 11 10
In both cases, the destination is only a subset of all the processors. Therefore,
we decide to use a Cartesian matrix distribution. For this moment, we do not
specify the distribution further, to leave us the freedom of tailoring it to our
future needs.
An initial parallel algorithm can be developed by parallelizing the sequen-
tial algorithm step by step, using data parallelism to derive computa-
tion supersteps and the need-to-know principle to obtain the necessary
communication supersteps. According to this principle, exactly those non-
local data that are needed in a computation superstep should be fetched in
preceding communication supersteps.
One parallelization method based on this approach is to allocate a com-
putation to the processor that possesses the variable on the left-hand side
of an assignment and to communicate beforehand the nonlocal data appear-
ing in the right-hand side. An example is the superstep pair (10)–(11) of
Algorithm 2.4, which is a parallel version of the matrix update from stage k
of the LU decomposition. (The superstep numbering corresponds to that of
the complete basic parallel algorithm.) In superstep (11), the local elements
aij with i, j ≥ k + 1 are modified. In superstep (10), the elements aik and
akj with i, j ≥ k + 1 are communicated to the processors that need them.
It is guaranteed that all values needed have been sent, but depending on
the distribution and the stage k, certain processors actually may not need
all of the communicated elements. (This mild violation of the strict need-to-
know principle is common in dense matrix computations, where all matrix
elements are treated as nonzero; for sparse matrices, however, where many
matrix elements are zero, the communication operations should be precisely
targeted, see Chapter 4.) Another example of this parallelization method is
the superstep pair (8)–(9). In superstep (9), the local elements of column k
are divided by akk . This division is performed only by processors in processor
60 LU DECOMPOSITION
column P (∗, φ1 (k)), since these processors together possess matrix column k.
In superstep (8), the element akk is obtained.
An alternative parallelization method based on the same need-to-know
approach is to allocate a computation to the processor that contains part
or all of the data of the right-hand side, and then to communicate partial
results to the processors in charge of producing the final result. This may
be more efficient if the number of result values is less than the number of
input data values involved. An example is the sequence of supersteps (0)–(3)
of Algorithm 2.5, which is a parallel version of the pivot search from stage
k of the LU decomposition. First a local element with maximum absolute
value is determined, whose index and value are then sent to all processors in
P (∗, φ1 (k)). (In our cost model, this takes the same time as sending them to
BASIC PARALLEL ALGORITHM 61
Algorithm 2.6. Index and row swaps in stage k for P (s, t).
only one master processor P (0, φ1 (k)); a similar situation occurs for the inner
product algorithm in Section 1.3.) All processors in the processor column
redundantly determine the processor P (smax , φ1 (k)) and the global row index
r of the maximum value. The index r is then broadcast to all processors.
The part of stage k that remains to be parallelized consists of index and
row swaps. To parallelize the index swaps, we must first choose the distribu-
tion of π. It is natural to store πk together with row k, that is, somewhere
in processor row P (φ0 (k), ∗); we choose P (φ0 (k), 0) as the location. Altern-
atively, we could have replicated πk and stored a copy in every processor
of P (φ0 (k), ∗). (Strictly speaking, this is not a distribution any more.) The
index swaps are performed by superstep pair (4)–(5) of Algorithm 2.6. The
components πk and πr of the permutation vector are swapped by first put-
ting each component into its destination processor and then assigning it to
the appropriate component of the array π. Temporary variables (denoted by
hats) are used to help distinguishing between the old and the new contents of
a variable. The same is done for the row swaps in supersteps (6)–(7).
To make the algorithm efficient, we must choose a distribution φ that
incurs low BSP cost. To do this, we first analyse stage k of the algorithm and
identify the main contributions to its cost. Stage k consists of 12 supersteps,
so that its synchronization cost equals 12l. Sometimes, a superstep may be
empty so that it can be deleted. For example, if N = 1, superstep (3) is empty.
In the extreme case p = 1, all communication supersteps can be deleted and
the remaining computation supersteps can be combined into one superstep.
For p > 1, however, the number of supersteps in one stage remains a small
62 LU DECOMPOSITION
that is, Rk is the maximum number of local matrix rows with index ≥ k, and
that is, Ck is the maximum number of local matrix columns with index ≥ k.
Example 2.8 In Fig. 2.3, R0 = 4, C0 = 3 and R4 = 2, C4 = 2.
Lower bounds for Rk and Ck are given by
n−k n−k
Rk ≥ , Ck ≥ . (2.16)
M N
Proof Assume Rk < ⌈(n − k)/M ⌉. Because Rk is integer, we even have that
Rk < (n − k)/M so that each processor row has less than (n − k)/M matrix
rows. Therefore, the M processor rows together possess less than n − k matrix
rows, which contradicts the fact that they hold the whole range k ≤ i < n.
A similar proof holds for Ck .
The computation supersteps of the algorithm are (0), (2), (5), (7), (9), and
(11). Supersteps (0), (2), (5), and (7) are for free in our benign cost model,
since they do not involve floating-point operations. (A more detailed analysis
taking all types of operations into account would yield a few additional lower-
order terms.) Computation superstep (9) costs Rk+1 time units, since each
processor performs at most Rk+1 divisions. Computation superstep (11) costs
2Rk+1 Ck+1 time units, since each processor performs at most Rk+1 Ck+1 mul-
tiplications and Rk+1 Ck+1 subtractions. The cost of (11) clearly dominates
the total computation cost.
Table 2.1 presents the cost of the communication supersteps of the basic
parallel LU decomposition. It is easy to verify the cost values given by the
table. For the special case N = 1, the hr value given for (3) in the table
should in fact be 0 instead of 1, but this does not affect the resulting value
of h. A similar remark should be made for supersteps (4), (8), and (10).
During most of the algorithm, the largest communication superstep is (10),
while the next-largest one is (6). Near the end of the computation, (6) becomes
dominant.
To minimize the total BSP cost of the algorithm, we must take care to
minimize the cost of both computation and communication. First we consider
BASIC PARALLEL ALGORITHM 63
Superstep hs hr h = max{hs , hr }
the computation cost, and in particular the cost of the dominant computation
superstep,
n−k−1 n−k−1
T(11) = 2Rk+1 Ck+1 ≥ 2 . (2.17)
M N
This cost can be minimized by distributing the matrix rows cyclically over
the M processor rows and the matrix columns cyclically over the N processor
columns. In that case, matrix rows k + 1 to n − 1 are evenly or nearly evenly
divided over the processor rows, with at most a difference of one matrix row
between the processor rows, and similarly for the matrix columns. Thus,
n−k−1 n−k−1
T(11),cyclic = 2 . (2.18)
M N
2(n − k − 1)2
n−k−1 n−k−1
≤ T(11),cyclic < 2 +1 +1
p M N
2(n − k − 1)2 2(n − k − 1)
= + (M + N ) + 2,
p p
t=0 1 2 0 1 2 0
s =0 00 01 02 00 01 02 00
1 10 11 12 10 11 12 10
0 00 01 02 00 01 02 00
1 10 11 12 10 11 12 10
0 00 01 02 00 01 02 00
1 10 11 12 10 11 12 10
0 00 01 02 00 01 02 00
where hs (s) is the number of data words sent by processor P (s) and hr (s) is
the number received. In this notation, maxs hs (s) = hs and maxs hr (s) = hr .
p−1
Note that V ≤ s=0 h = ph. We call an h-relation balanced if V = ph,
that is, h = V /p. Equality can only hold if hs (s) = h for all s. Therefore,
a balanced h-relation has hs (s) = h for all s, and, similarly, hr (s) = h for
all s. These necessary conditions for balance are also sufficient and hence an
h-relation is balanced if and only if every processor sends and receives exactly
h words. But this is precisely the definition of a full h-relation, see Section 1.2.
It is just a matter of viewpoint whether we call an h-relation balanced or full.
The communication volume provides us with a measure for load imbalance:
we call h − V /p the communicational load imbalance. This is analogous
to the computational load imbalance, which commonly (but often tacitly)
is defined as w − wseq /p, where w denotes work. If an h-relation is balanced,
then h = hs = hr . The reverse is not true: it is possible that h = hs = hr
but that the h-relation is still unbalanced: some processors may be overloaded
sending and some receiving. In that case, h > V /p. To reduce communication
cost, one can either reduce the volume, or improve the balance for a fixed
volume.
Consider the basic parallel LU decomposition algorithm with the cyclic
distribution. Assume for diagnostic purposes that the distribution is square.
(Later, in developing our improved algorithm, we shall assume the more
general M × N cyclic distribution.) Supersteps (3), (8), and (10) perform
h-relations with hs ≫ hr , see Table 2.1. Such a discrepancy between hs and
hr is a clear symptom of imbalance. The three unbalanced supersteps are
candidates for improvement. We concentrate our efforts on the dominant com-
√
munication superstep, (10), which has hs = ( p − 1)hr and h ≈ 2(n − k − 1),
see (2.20). The contribution of superstep (10) to the total communication
n−1 n−1
cost of the basic algorithm is about k=0 2(n − k − 1)g = 2g k=0 k =
2g(n − 1)n/2 ≈ n2 g, irrespective of the number of processors. With an
increasing number of processors, the fixed contribution of n2 g to the total
communication cost will soon dominate the total computation cost of roughly
Tseq /p ≈ 2n3 /3p, see (2.11). This back-of-the-envelope analysis suffices to
reveal the undesirable scaling behaviour of the row and column broadcasts.
The unbalance in the broadcasts of superstep (10) is caused by the fact
√
that only 2 p − 1 out of p processors send data: the sending processors
66 LU DECOMPOSITION
√ √
are P (∗, φ1 (k)) = P (∗, k mod p) and P (φ0 (k), ∗) = P (k mod p, ∗). The
receives are spread better: the majority of the processors receive 2Rk+1 data
elements, or one or two elements less. The communication volume equals
√
V = 2(n − k − 1)( p − 1), because n − k − 1 elements of row k and column
√
k must be broadcast to p − 1 processors. It is impossible to reduce the
communication volume significantly: all communication operations are really
necessary, except in the last few stages of the algorithm. The communication
balance, however, has potential for improvement.
To find ways to improve the balance, let us first examine the problem of
broadcasting a vector x of length n from a processor P (0) to all p processors of
a parallel computer, where n ≥ p. For this problem, we use a one-dimensional
processor numbering. The simplest approach is that processor P (0) creates
p−1 copies of each vector component and sends these copies out. This method
concentrates all sending work at the source processor. A better balance can
be obtained by sending each component to a randomly chosen intermediate
processor and making this processor responsible for copying and sending the
copies to the final destination. (This method is similar to two-phase random-
ized routing [176], where packets are sent from source to destination through a
randomly chosen intermediate location, to avoid congestion in the routing net-
work.) The new method splits the original h-relation into two phases: phase 0,
an unbalanced h-relation with small volume that randomizes the location of
the data elements; and phase 1, a well-balanced h-relation that performs
the broadcast itself. We call the resulting pair of h-relations a two-phase
broadcast.
An optimal balance during phase 1 can be guaranteed by choosing the
intermediate processors deterministically instead of randomly. For instance,
this can be achieved by spreading the vector in phase 0 according to the
block distribution, defined by (1.6). (An equally suitable choice is the cyclic
distribution.) The resulting two-phase broadcast is given as Algorithm 2.7; it is
illustrated by Fig. 2.5. The notation repl(x) = P (∗) means that x is replicated
such that each processor has a copy. (This is in contrast to distr(x) = φ,
which means that x is distributed according to the mapping φ.) Phase 0 is
an h-relation with h = n − b, where b = ⌈n/p⌉ is the block size, and phase 1
has h = (p − 1)b. Note that both phases cost about ng. The total cost of the
two-phase broadcast of a vector of length n to p processors is
n
Tbroadcast = n + (p − 2) g + 2l ≈ 2ng + 2l. (2.23)
p
This is much less than the cost (p − 1)ng + l of the straightforward one-phase
broadcast (except when l is large).
The two-phase broadcast can be used to broadcast column k and row k
in stage k of the parallel LU decomposition. The broadcasts are performed in
TWO-PHASE BROADCASTING AND OTHER IMPROVEMENTS 67
b := ⌈n/p⌉;
{ Spread the vector. }
(0) if s = 0 then for t := 0 to p − 1 do
for i := tb to min{(t + 1)b, n} − 1 do
put xi in P (t);
P(0) 0 0 0 0 0 0 0 0 0 0 0 0
P(1) 1 1 1
Phase 0
P(2) 2 2 2
P(3) 3 3 3
P(0) 0 0 0 0
P(1) 1 1 1
Phase 1
P(2) 2 2 2
P(3) 3 3 3 3
Fig. 2.5. Two-phase broadcast of a vector of size twelve to four processors. Each
cell represents a vector component; the number in the cell and the greyshade
denote the processor that owns the cell. The processors are numbered 0, 1, 2, 3.
The block size is b = 3. The arrows denote communication. In phase 0, the
vector is spread over the four processors. In phase 1, each processor broadcasts
its subvector to all processors. To avoid clutter, only a few of the destination
cells of phase 1 are shown.
68 LU DECOMPOSITION
supersteps (6) and (7) of the final algorithm, Algorithm 2.8. The column part
to be broadcast from processor P (s, k mod N ) is the subvector (aik : k < i <
n ∧ i mod M = s), which has length Rk+1 or Rk+1 − 1, and this subvector is
broadcast to the whole processor row P (s, ∗). Every processor row performs
its own broadcast of a column part. The row broadcast is done similarly.
Note the identical superstep numbering ‘(6)/(7)’ of the two broadcasts, which
is a concise way of saying that phase 0 of the row broadcast is carried out
together with phase 0 of the column broadcast and phase 1 with phase 1 of the
column broadcast. This saves two synchronizations. (In an implementation,
such optimizations are worthwhile, but they harm modularity: the complete
broadcast cannot be invoked by one function call; instead, we need to make
the phases available as separately callable functions.)
The final algorithm has eight supersteps in the main loop, whereas the
basic algorithm has twelve. The number of supersteps has been reduced as
follows. First, we observe that the row swap of the basic algorithm turns ele-
ment ark into the pivot element akk . The element ark , however, is already
known by all processors in P (∗, k mod N ), because it is one of the elements
broadcast in superstep (1). Therefore, we divide column k immediately by ark ,
instead of dividing by akk after the row swap. This saves the pivot broadcast
(8) of the basic algorithm and the synchronization of superstep (9). For read-
ability, we introduce the convention of writing the condition ‘if k mod N = t
then’ only once for supersteps (0)–(3), even though we want the test to be car-
ried out in every superstep. This saves space and makes the algorithm better
readable; in an implementation, the test must be repeated in every superstep.
(Furthermore, we must take care to let all processors participate in the global
synchronization, and not only those that test positive.) Second, the index and
row swaps are now combined and performed in two supersteps, numbered (4)
and (5). This saves two synchronizations. Third, the last superstep of stage
k of the algorithm is combined with the first superstep of stage k + 1. We
express this by numbering the last superstep as (0′ ), that is, superstep (0) of
the next stage.
The BSP cost of the final algorithm is computed in the same way as
before. The cost of the separate supersteps is given by Table 2.2. Now, Rk+1 =
⌈(n−k −1)/M ⌉ and Ck+1 = ⌈(n−k −1)/N ⌉, because we use the M ×N cyclic
distribution. The cost expressions for supersteps (6) and (7) are obtained as
in the derivation of (2.23).
The dominant computation superstep in the final algorithm remains the
√
matrix update; the choice M = N = p remains optimal for computa-
tion. The costs of the row and column broadcasts do not dominate the
other communication costs any more, since they have decreased to about
2(Rk+1 + Ck+1 )g in total, which is of the same order as the cost C0 g of the
row swap. To find optimal values of M and N for communication, we consider
TWO-PHASE BROADCASTING AND OTHER IMPROVEMENTS 69
(6)/(7) broadcast((aik : k < i < n ∧ i mod M = s), P (s, k mod N ), P (s, ∗));
(6)/(7) broadcast((akj : k < j < n ∧ j mod N = t), P (k mod M, t), P (∗, t));
Superstep Cost
(0) l
(1) 2(M − 1)g + l
(2) Rk + l
(3) (N − 1)g + l
(4) (C0 + 1)g + l
(5) l
(6) (Rk+1 − ⌈Rk+1 /N ⌉ + Ck+1 − ⌈Ck+1 /M ⌉)g + l
(7) ((N − 1)⌈Rk+1 /N ⌉ + (M − 1)⌈Ck+1 /M ⌉)g + l
(0′ ) 2Rk+1 Ck+1
Proof
n
k 0 1 q n−q+1 n
= + + ··· + + ··· + + ··· +
q q q q q q
k=0
n/q
n n n
= q · 1 + q · 2 + ··· + q · =q k=q +1 , (2.26)
q 2q q
k=1
where we have used Lemma 2.7. The proof of the second equation is similar.
√
Provided n mod p = 0, the resulting sums are:
n−1 √
n(n + p)
Rk = √ , (2.27)
2 p
k=0
n−1 √
n(n + p) n
Rk+1 = √ −√ , (2.28)
2 p p
k=0
n−1 √ √
2 n(n + p)(2n + p) n2
Rk+1 = − . (2.29)
6p p
k=0
√ √ √
To compute the sum of ⌈Rk+1 / p⌉ = ⌈⌈(n − k − 1)/ p⌉/ p⌉, we need
the following lemma, which may be useful in other contexts as well.
Lemma 2.10 Let k, q, r be integers with q, r ≥ 1. Then
⌈k/q⌉ k
= .
r qr
n(n + p) n
= − . (2.30)
2p p
The last equality follows from Lemma 2.9 with q = p, which can be applied
√
if we assume that n mod p = 0 (this also guarantees that n mod p = 0).
This assumption is made solely for the purpose of simplifying our analysis; in
72 LU DECOMPOSITION
2n3
3 2 2 5n 3 2
TLU = + √ − n + + √ − n2
3p 2 p p 6 p p
√ 4 4
+ 4 p − √ + − 3 n g + 8nl. (2.31)
p p
within a subset should all be able to determine their source, destination set,
and vector length uniquely from the function parameters. A processor can
then decide to participate as source and/or destination in its subset or to
remain idle. In the LU decomposition program, processors are partitioned
into processor rows for the purpose of column broadcasts, and into processor
columns for row broadcasts.
The broadcast function is designed such that it can perform the phases of
the broadcast separately; a complete broadcast is done by calling the function
twice, first with a value phase = 0, then with phase = 1. The synchronization
terminating the phase is not done by the broadcast function itself, but is
left to the calling program. The advantage of this approach is that unnec-
essary synchronizations are avoided. For instance, phase 0 of the row and
column broadcasts can be combined into one superstep, thus needing only
one synchronization.
The program text of the broadcast function is a direct implementation of
Algorithm 2.7 in the general context described above. Note that the size of
the data vector to be put is the minimum of the block size b and the number
n − tb of components that would remain if all preceding processors had put
b components. The size thus computed may be negative or zero, so that we
must make sure that the put is carried out only for positive size. In phase 1, all
processors avoid sending data back to the source processor. This optimization
has no effect on the BSP cost, since the (cost-determining) source processor
itself does not benefit, as can be seen by studying the role of P (0) in Fig. 2.5.
Still, the optimization reduces the overall communication volume, making it
equal to that of the one-phase broadcast. This may make believers in other
models than BSP happy, and BSP believers with cold feet as well!
The basic structure of the LU decomposition algorithm and the function
bsplu are the same, except that supersteps (0)–(1) of the algorithm are com-
bined into Superstep 0 of the function, supersteps (2)–(3) are combined into
Superstep 1, and supersteps (4)–(5) into Superstep 2. For the pairs (0)–(1)
and (2)–(3) this could be done because BSPlib allows computation and com-
munication to be mixed; for (4)–(5), this could be done because bsp puts are
buffered automatically, so that we do not have to take care of that ourselves.
Note that the superstep (0′ )–(0) of the algorithm is delimited quite naturally
in the program text by the common terminating bsp sync of Superstep 0.
As a result, each stage of the function bsplu has five supersteps.
The relation between the variables of the algorithm and those of the func-
tion bsplu is as follows. The variables M, N, s, t, n, k, smax , r of the algorithm
correspond to the variables M, N, s, t, n, k, smax, r of the function. The global
row index used in the algorithm is i = i ∗ M + s, where i is the local row
index used in the function. The global column index is j = j ∗ N + t, where
j is the local column index. The matrix element aij corresponds to a[i][j] on
the processor that owns aij , and the permutation component πi corresponds
to pi[i]. The global row index of the local element in column k with largest
74 LU DECOMPOSITION
#include "bspedupack.h"
void bsp_broadcast(double *x, int n, int src, int s0, int stride, int p0,
int s, int phase){
/* Broadcast the vector x of length n from processor src to
processors s0+t*stride, 0 <= t < p0. Here n >= 0, p0 >= 1.
The vector x must have been registered previously.
Processors are numbered in one-dimensional fashion.
s = local processor identity.
phase= phase of two-phase broadcast (0 or 1)
Only one phase is performed, without synchronization.
*/
if (dest!=src)
bsp_put(dest,&x[t*b],x,t*b*SZDBL,nbytes);
}
}
}
}
} /* end bsp_broadcast */
return (n+p-s-1)/p ;
} /* end nloc */
void bsplu(int M, int N, int s, int t, int n, int *pi, double **a){
/* Compute LU decomposition of n by n matrix A with partial pivoting.
Processors are numbered in two-dimensional fashion.
Program text for P(s,t) = processor s+t*M,
with 0 <= s < M and 0 <= t < N.
A is distributed according to the M by N cyclic distribution.
*/
bsp_push_reg(&r,SZINT);
if (nlr>0)
pa= a[0];
else
pa= NULL;
bsp_push_reg(pa,nlr*nlc*SZDBL);
bsp_push_reg(pi,nlr*SZINT);
uk= vecallocd(nlc); bsp_push_reg(uk,nlc*SZDBL);
lk= vecallocd(nlr); bsp_push_reg(lk,nlr*SZDBL);
Max= vecallocd(M); bsp_push_reg(Max,M*SZDBL);
Imax= vecalloci(M); bsp_push_reg(Imax,M*SZINT);
if (k%N==t){ /* k=kc*N+t */
/* Search for local absolute maximum in column k of A */
absmax= 0.0; imax= -1;
for (i=kr; i<nlr; i++){
if (fabs(a[i][kc])>absmax){
absmax= fabs(a[i][kc]);
imax= i;
}
}
if (absmax>0.0){
max= a[imax][kc];
} else {
max= 0.0;
}
bsp_pop_reg(Imax); vecfreei(Imax);
bsp_pop_reg(Max); vecfreed(Max);
bsp_pop_reg(lk); vecfreed(lk);
bsp_pop_reg(uk); vecfreed(uk);
bsp_pop_reg(pi);
bsp_pop_reg(pa);
bsp_pop_reg(&r);
} /* end bsplu */
n One-phase Two-phase
One-phase broadcast
14
Two-phase broadcast
12
10
Time (in s)
0
0 2000 4000 6000 8000 10 000
n
synchronizations, and adding the times for the same program superstep, we
obtain the total time spent in each of the supersteps of the program. By
adding the total time of program supersteps 3 and 4, we compute the total
broadcast time, shown in Fig. 2.6. In this figure, it is easy to see that for
large matrices the two-phase broadcast is significantly faster than the one-
phase broadcast, thus confirming our theoretical analysis. For small matrices,
with n < 4000, the vectors to be broadcast are too small to justify the extra
EXPERIMENTAL RESULTS ON A CRAY T3E 81
synchronization. Note that for n = 4000, each processor has a local submat-
rix of size 500 × 500, so that it broadcasts two vectors of size 499 in stage
0, and this size decreases until it reaches 1 in stage 498; in all stages, the
vectors involved are relatively small. The theoretical asymptotic gain factor
√
in broadcast time for large matrices is about p/2 = 4; the observed gain
factor of about 1.4 at n = 10 000 is still far from that asymptotic value. Our
results imply that for n = 4000 the broadcast time represents only about
4.8% of the total time; for larger n this fraction is even less. Thus, the signi-
ficant improvement in broadcast time in the range n =4000–10 000 becomes
insignificant compared to the total execution time, explaining the results of
Table 2.3. (On a different computer, with faster computation compared to
communication and hence a higher g, the improvement would be felt also in
the total execution time.)
Program supersteps 2, 3, and 4 account for almost all of the communica-
tion carried out by bsplu. These supersteps perform the row swaps, phase 0
of the row and column broadcasts, and phase 1, respectively. The BSP
model predicts that these widely differing operations have the same total cost,
35 Pessimistic prediction
Optimistic prediction
Broadcast, phase 0
Broadcast, phase 1
30
Row swaps
25
Time (in s)
20
15
10
0
0 2000 4000 6000 8000 10 000
n
Fig. 2.7. Total measured time (shown as data points) of row swaps, broadcast
phases 0 and broadcast phases 1 of LU decomposition on a 64-processor Cray
T3E. Also given is the total predicted time (shown by lines).
82 LU DECOMPOSITION
√
n2 g/ p+nl. Figure 2.7 shows the measured time for these operations. In gen-
eral, the three timing results for the same problem size are reasonably close
to each other, which at least qualitatively confirms the prediction of the BSP
model. This is particularly encouraging in view of the fact that the commun-
ication volumes involved are quite different: for instance, the communication
√
volume of phase 1 is about p−1 = 7 times that of phase 0. (We may conclude
that communication volume is definitely not a good predictor of communica-
tion time, and that the BSP cost is a much better predictor.) We can also use
√
the theoretical cost n2 g/ p + nl together with benchmark results for r, g, and
l, to predict the time of the three supersteps in a more precise, quantitative
way. To do this, the BSP cost in flops is converted into a time in seconds by
multiplying with tflop = 1/r. The result for the values obtained by bspbench
is plotted in Fig. 2.7 as ‘pessimistic prediction’. The reason for this title is
obvious from the plot.
To explain the overestimate of the communication time, we note that the
theoretical BSP model does not take header overhead into account, that
is, the cost of sending address information together with the data themselves.
The BSP cost model is solely based on the amount of data sent, not on
that of the associated headers. In most practical cases, this matches reality,
because the header overhead is often insignificant. If we benchmark g also
in such a situation, and use this (lower) value of g, the BSP model will
predict communication time well. We may call this value the optimistic
g -value. This value is measured by bspprobe, see Section 1.8.3. If, however,
the data are communicated by put or get operations of very small size, say
less than five reals, such overhead becomes significant. In the extreme case of
single words as data, for example, one real, we have a high header overhead,
which is proportional to the amount of data sent. We can then just include
this overhead in the cost of sending the data themselves, which leads to a
higher, pessimistic g -value. This is the value measured by bspbench. In
fact, the header overhead includes more than just the cost of sending header
information; for instance, it also includes the overhead of a call to the bsp put
function. Such costs are conveniently lumped together. In the transition range,
for data size in the range 1–5 reals, the BSP model does not accurately predict
communication time, but we have an upper and a lower bound. It would be
easy to extend the model to include an extra parameter (called the block size
B in the BSP∗ model [15]), but this would be at the expense of simplicity. We
shall stick to the simple BSP model, and rely on our common sense to choose
between optimism and pessimism.
For LU decomposition, the optimistic g-value is appropriate since we send
data in large blocks. Here, we measure the optimistic g-value by modifying
bspbench to use puts of 16 reals (each of 64 bits), instead of single reals.
This gives g = 10.1. Results with this g are plotted as ‘optimistic prediction’.
The figure shows that the optimistic prediction matches the measurements
reasonably well.
EXPERIMENTAL RESULTS ON A CRAY T3E 83
We have looked at the total time of the LU decomposition and the total
time of the different supersteps. Even more insight can be gained by examining
the individual supersteps. The easiest way of doing this is to use the Oxford
BSP toolset profiler, which is invoked by compiling with the option -prof.
Running the resulting program creates PROF.bsp, a file that contains the
statistics of every individual superstep carried out. This file is converted into
a plot in Postscript format by the command
bspprof PROF.bsp
An example is shown in Fig. 2.8. We have zoomed in on the first three stages
of the algorithm by using the zooming option, which specifies the starting and
finishing time in seconds of the profile plot:
bspprof -zoom 0.06275,0.06525 PROF.bsp
Oxford BSP
SP Toolset [flags -O3 -prof -flibrary-level 2 -fcombi...] 0.232 seconds elapsed on a Cray T3E Fri Jun 15 11:57:32
bytes 14 14 14
out
4500
Step Filename Line
4000 9 bsplu.c 86
10 bsplu.c 121
3500
11 bsplu.c 150
3000 12 bsplu.c 168
13 bsplu.c 187
2500 10 10 10 14 bsplu.c 197
2000
12 12 12
1500 Process 0
1000 13 13 13
500 Process 1
9 11 11 11
0
62.75 63.00 63.25 63.50 63.75 64.00 64.25 64.50 64.75 65.00 milliseconds Process 2
bytes 14 14
in 14
4500 Process 3
4000
Process 4
3500
3000
Process 5
2500 10 10 10
2000 Process 6
12 12 12
1500
Process 7
1000
13 13 13
500
9 11 11 11
0
62.75 63.00 63.25 63.50 63.75 64.00 64.25 64.50 64.75 65.00 milliseconds
The absolute times given in the profile must be taken with a grain of salt,
since the profiling itself adds some extra time.
The BSP profile of a program tells us where the communication time is
spent, which processor is busiest communicating, and whether more time is
spent communicating or computing. Because our profiling example concerns
an LU decomposition with a row distribution (M = 8, N = 1), it is particu-
larly easy to recognize what happens in the supersteps. Before reading on, try
to guess which superstep is which.
Superstep 10 in the profile corresponds to program superstep 0, super-
step 11 corresponds to program superstep 1, and so on. Superstep 10
communicates the local maxima found in the pivot search. For the small
problem size of n = 100 this takes a significant amount of time, but for larger
problems the time needed becomes negligible compared to the other parts of
the program. Superstep 11 contains no communication because N = 1, so
that the broadcast of the pivot index within a processor row becomes a copy
operation within a processor. Superstep 12 represents the row swap; it has
two active processors, namely the owners of rows k and k + 1. In stage 0,
these are processors P (0) and P (1); in stage 1, P (1) and P (2); and in stage 2,
P (2) and P (3). Superstep 13 represents the first phase of a row broadcast. In
stage 0, P (0) sends data and all other processors receive; in stage 1, P (1) is
the sender. Superstep 14 represents the second phase. In stage 0, P (0) sends
⌈99/8⌉ = 13 row elements to seven other processors; P (1)–P (6) each send 13
elements to six other processors (not to P (0)); and P (7) sends the remain-
ing 8 elements to six other processors. The number of bytes sent by P (0) is
13 · 7 · 8 = 728; by each of P (1)–P (6), 624; and by P (7), 384. The total is
4856 bytes. These numbers agree with the partitioning of the bars in the top
part of the plot.
Returning to the fashion world, did the BSP model survive the catwalk?
Guided by the BSP model, we obtained a theoretically superior algorithm
with a better spread of the communication tasks over the processors. Our
experiments show that this algorithm is also superior in practice, but that the
benefits occur only in a certain range of problem sizes and that their impact
is limited on our particular computer. The BSP model helped explaining our
experimental results, and it can tell us when to expect significant benefits.
The superstep concept of the BSP model helped us zooming in on certain
parts of the computation and enabled us to understand what happens in
those parts. Qualitatively speaking, we can say that the BSP model passed
an important test. The BSP model also gave us a rough indication of the
expected time for different parts of the algorithm. Unfortunately, to obtain
this indication, we had to distinguish between two types of values for the
communication parameter g, reflecting whether or not the put operations are
extremely small. In most cases, we can (and should) avoid extremely small
put operations, at least in the majority of our communication operations, so
BIBLIOGRAPHIC NOTES 85
that we can use the optimistic g-value for predictions. Even then, the resulting
prediction can easily be off by 50%, see Fig. 2.7.
A simple explanation for the remaining discrepancy between prediction
and experiment is that there are lies, damned lies, and benchmarks. Sub-
stituting a benchmark result in a theoretical time formula gives an ab initio
prediction, that is, a prediction from basic principles, and though this may
be useful as an indication of expected performance, it will hardly ever be
an accurate estimate. There are just too many possibilities for quirks in the
hardware and the system software, ranging from obscure cache behaviour to
inadequate implementation of certain communication primitives. Therefore,
we should not have unrealistic quantitative expectations of a computation
model.
The square block distribution leads to a bad load balance, because more
and more processors become idle when the computation proceeds. As a res-
ult, the computation takes three times longer than with the square cyclic
distribution. Fox et al. [71,Chapter 20] present an algorithm for LU decom-
position of a banded matrix. They perform a theoretical analysis and give
experimental results on a hypercube computer. (A matrix A is banded with
upper bandwidth bU and lower bandwidth bL if aij = 0 for i < j − bU
and i > j + bL . A dense matrix can be viewed as a degenerate special case
of a banded matrix, with bL = bU = n − 1.) Bisseling and van de Vorst [23]
prove optimality with respect to load balance of the square cyclic distribution,
within the class of Cartesian distributions. They also show that the communi-
√
cation time on a square mesh of processors is of the same order, O(n2 / p),
as the load imbalance and that on a complete network the communication
√
volume is O(n2 p). (For a BSP computer this would imply a communication
2 √
cost of O(n g/ p), provided the communication can be balanced.) Extending
these results, Hendrickson and Womble [98] show that the square cyclic distri-
bution is advantageous for a large class of matrix computations, including LU
decomposition, QR decomposition, and Householder tridiagonalization. They
present experimental results for various ratios N/M of the M × N cyclic
distribution.
A straightforward generalization of the cyclic distribution is the block-
cyclic distribution, where the cyclic distribution is used to assign rectangu-
lar submatrices to processors instead of assigning single matrix elements.
O’Leary and Stewart [150] proposed this distribution already in 1986, giving
it the name block-torus assignment. It is now widely used, for example, in
ScaLAPACK (Scalable Linear Algebra Package) [24,25,41] and in the object-
oriented package PLAPACK (Parallel Linear Algebra Package) [4,180]. The
M × N block-cyclic distribution with block size b0 × b1 is defined by
parallel eigensystem solver based on the square cyclic distribution, and they
argue in favour of blocking of algorithms but not of distributions.
2.8 Exercises
1. Find a matrix distribution for parallel LU decomposition that is optimal
with respect to computational load balance in all stages of the computation.
The distribution need not be Cartesian. When would this distribution be
applicable?
2. The ratio N/M = 1 is close to optimal for the M × N cyclic distribution
used in Algorithm 2.8 and hence this ratio was assumed in our cost analysis.
The optimal ratio, however, may be slightly different. This is mainly due to
an asymmetry in the communication requirements of the algorithm. Explain
this by using Table 2.2. Find the ratio N/M with the lowest communication
cost, for a fixed number of processors p = M N . What is the reduction in
communication cost for the optimal ratio, compared with the cost for the
ratio N/M = 1?
3. Algorithm 2.8 contains a certain amount of unnecessary communication,
because the matrix elements arj with j > k are first swapped out and then
spread and broadcast. Instead, they could have been spread already from their
original location.
(a) How would you modify the algorithm to eliminate superfluous commu-
nication? How much communication cost is saved for the square cyclic
distribution?
(b) Modify the function bsplu by incorporating this algorithmic improve-
ment. Test the modified program for n = 1000. What is the resulting
reduction in execution time? What is the price to be paid for this
optimization?
EXERCISES 89
7. (∗∗) Once upon a time, there was a Mainframe computer that had great dif-
ficulty in multiplying floating-point numbers and preferred to add or subtract
them instead. So the Queen decreed that computations should be carried out
with a minimum of multiplications. A young Prince, Volker Strassen [170],
set out to save multiplications in the Queen’s favourite pastime, computing
the product of 2 × 2 matrices on the Royal Mainframe,
c00 c01 a00 a01 b00 b01
= . (2.35)
c10 c11 a10 a11 b10 b11
At the time, this took eight multiplications and four additions. The young
Prince slew one multiplication, but at great cost: fourteen new additions
sprang up. Nobody knew how he had obtained his method, but there were
rumours [46], and indeed the Prince had drunk from the magic potion. Later,
three additions were beheaded by the Princes Paterson and Winograd and the
resulting Algorithm 2.9 was announced in the whole Kingdom. The Queen’s
subjects happily noted that the new method, with seven multiplications and
fifteen additions, performed the same task as before. The Queen herself lived
happily ever after and multiplied many more 2 × 2 matrices.
(a) Join the inhabitants of the Mainframe Kingdom and check that the
task is carried out correctly.
(b) Now replace the matrix elements by submatrices of size n/2 × n/2,
C00 C01 A00 A01 B00 B01
= . (2.36)
C10 C11 A10 A11 B10 B11
l0 := a00 ; r0 := b00 ;
l1 := a01 ; r1 := b10 ;
l2 := a10 + a11 ; r2 := b01 − b00 ;
l3 := a00 − a10 ; r3 := b11 − b01 ;
l4 := l2 − a00 ; r4 := r3 + b00 ;
l5 := a01 − l4 ; r5 := b11 ;
l6 := a11 ; r6 := b10 − r4 ;
for i := 0 to 6 do
mi := li ri ;
t0 := m0 + m4 ;
t1 := t0 + m3 ;
c00 := m0 + m1 ;
c01 := t0 + m2 + m5 ;
c10 := t1 + m6 ;
c11 := t1 + m2 ;
2 T 2vT x
Pv x = x − vv x = x − v. (2.38)
v2 vT v
(b) Let e = (1, 0, 0, . . . , 0)T . Show that the choice v = x − xe implies
Pv x = xe. This means that we have an orthogonal transformation
that sets all components of x to zero, except the first.
(c) Algorithm 2.10 is a sequential algorithm that determines a vector v
such that Pv x = xe. For convenience, the algorithm also outputs
the corresponding scalar β = 2/v2 and the norm of the input vector
µ = x. The vector has been normalized such that v0 = 1. For the
memory-conscious, this can save one memory cell when storing v. The
algorithm contains a clever trick proposed by Parlett [154] to avoid
subtracting nearly equal quantities (which would result in so-called
subtractive cancellation and severe loss of significant digits). Now
design and implement a parallel version of this algorithm. Assume
that the input vector x is distributed by the cyclic distribution over p
processors. The output vector v should become available in the same
distribution. Try to keep communication to a minimum. What is the
BSP cost?
94 LU DECOMPOSITION
{Compute v = x − x e}
−α
if x0 ≤ 0 then v0 := x0 − µ else v0 := x0 +µ ;
for i := 1 to n − 1 do
vi := xi ;
for i := k + 1 to n − 1 do
pi := 0;
for j := k + 1 to n − 1 do
pi := pi + βaij vj ;
γ := 0;
for i := k + 1 to n − 1 do
γ := γ + pi vi ;
for i := k + 1 to n − 1 do
wi := pi − βγ
2 vi ;
ak+1,k := µ; ak,k+1 := µ;
for i := k + 2 to n − 1 do
aik := vi ; aki := vi ;
for i := k + 1 to n − 1 do
for j := k + 1 to n − 1 do
aij := aij − vi wj − wi vj ;
(a) Design a basic parallel algorithm for the solution of a lower triangular
system Lx = b, where L is an n × n lower triangular matrix, b a given
vector of length n, and x the unknown solution vector of length n.
Assume that the number of processors is p = M 2 and that the matrix
is distributed by the square cyclic distribution. Hint: the computation
and communication can be organized in a wavefront pattern, where
in stage k of the algorithm, computations are carried out for matrix
elements lij with i+j = k. After these computations, communication is
performed: the owner P (s, t) of an element lij on the wavefront puts xj
j
into P ((s+1) mod M, t), which owns li+1,j , and it also puts r=0 lir xr
into P (s, (t + 1) mod M ), which owns li,j+1 .
(b) Reduce the amount of communication. Communicate only when this
is really necessary.
(c) Which processors are working in stage k? Improve the load balance.
Hint: procrastinate!
(d) Determine the BSP cost of the improved algorithm.
(e) Now assume that the matrix is distributed by the square block-cyclic
√
distribution, defined by eqn (2.33) with M = N = p and b0 = b1 = β.
How would you generalize your algorithm for solving lower triangu-
lar systems to this case? Determine the BSP cost of the generalized
algorithm and find the optimal block size parameter β for a computer
with given BSP parameters p, g, and l.
(f) Implement your algorithm for the square cyclic distribution in a
function bspltriang. Write a similar function bsputriang that
solves upper triangular systems. Combine bsplu, bspltriang, and
bsputriang into one program bsplinsol that solves a linear system
of equations Ax = b. The program has to permute b into Pπ−1 b, where
π is the partial pivoting permutation produced by the LU decomposi-
tion. Measure the execution time of the LU decomposition and the
triangular system solutions for various p and n.
10. (∗∗) The LU decomposition function bsplu is, well, educational. It teaches
important distribution and communication techniques, but it is far from
optimal. Our goal now is to turn bsplu into a fast program that is suit-
able for a production environment where every flop/s counts. We optimize
EXERCISES 97
0 k0 k1 n–1
0
A00 A01 A02
k0
A10 A11 A12
k1
A20 A21 A22
n–1
Fig. 2.9. Submatrices created by combining the operations from stages k0 ≤ k < k1
of the LU decomposition.
the program gradually, taking care that we can observe the effect of each
modification. Measure the gains (or losses) achieved by each modification and
explain your results.
(a) In parallel computing, postponing work until it can be done in bulk
quantity creates opportunities for optimization. This holds for compu-
tation work as well as for communication work. For instance, we can
combine several consecutive stages k, k0 ≤ k < k1 , of the LU decom-
position algorithm. As a first optimization, postpone all operations on
the submatrix A(∗, k1 : n − 1) until the end of stage k1 − 1, see Fig. 2.9.
This concerns two types of operations: swapping elements as part of
row swaps and modifying elements as part of matrix updates. Opera-
tions on the submatrix A(∗, 0 : k1 −1) are done as before. Carry out the
postponed work by first permuting all rows involved in the row swaps
and then performing a sequence of row broadcasts and matrix updates.
This affects only the submatrices A12 and A22 . (We use the names of
the submatrices as given in the figure when this is more convenient.)
To update the matrix correctly, the values of the columns A(k +
1 : n − 1, k), k0 ≤ k < k1 , that are broadcast must be stored in
an array L immediately after the broadcast. Caution: the row swaps
inside the submatrix A(∗, k0 : k1 − 1) will be carried out on the sub-
matrix itself, but not on the copies created by the column broadcasts.
At the end of stage k1 − 1, the copy L(i, k0 : k1 − 1) of row i, with
elements frozen at various stages, may not be the same as the cur-
rent row A(i, k0 : k1 − 1). How many rows are affected in the worst
case? Rebroadcast those rows. Do you need to rebroadcast in two
phases? What is the extra cost incurred? Why is the new version of
the algorithm still an improvement?
(b) Not all flops are created equal. Flops from matrix operations can often
be performed at much higher rates than flops from vector or scalar
operations. We can exploit this by postponing and then combining all
98 LU DECOMPOSITION
(g) The high-performance put function bsp hpput of BSPlib has exactly
the same syntax as the bsp put function:
bsp hpput(pid, source, dest, offset, nbytes);
It does not provide the safety of buffering at the source and destination
that bsp put gives. The read and write operations can in principle
occur at any time during the superstep. Therefore the user must ensure
safety by taking care that different communication operations do not
interfere. The primary aim of using this primitive is to save the memory
of the buffers. Sometimes, this makes the difference between being able
to solve a problem or not. A beneficial side effect is that this saves time
as well. There also exists a bsp hpget operation, with syntax
bsp hpget(pid, source, offset, dest, nbytes);
which should be used with the same care as bsp hpput. In the LU
decomposition program, matrix data are often put into temporary
arrays and not directly into the matrix itself, so that there is no
need for additional buffering by the system. Change the bsp puts into
bsp hpputs, wherever this is useful and allowed, perhaps after a few
minor modifications. What is the effect?
(h) For short vectors, a one-phase broadcast is faster than a two-phase
broadcast. Replace the two-phase broadcast in stage k of row elements
akj with k < j < k1 by a one-phase broadcast. For which values of b is
this an improvement?
(i) As already observed in Exercise 3, a disadvantage of the present
approach to row swaps and row broadcasts in the submatrix A(k0 :
n − 1, k1 : n − 1) is that elements of pivot rows move three times: each
such element is first moved into the submatrix A12 as part of a per-
mutation; then it is moved as part of the data spreading operation
in the first phase of the row broadcast; and finally it is copied and
broadcast in the second phase. This time, there are b rows instead
of one that suffer from excessive mobility, and they can be dealt with
together. Instead of moving the local part of a row into A12 , you should
spread it over the M processors of its processor column (in the same
way for every row). As a result, A12 becomes distributed over all p
processors in a column distribution. Updating A12 becomes a local
operation, provided each processor has a copy of the lower triangular
part of A11 . How much does this approach save?
(j) Any ideas for further improvement?
3
THE FAST FOURIER TRANSFORM
∞
f˜(t) =
ck e2πikt/T , (3.1)
k=−∞
THE PROBLEM 101
and i denotes the complex number with i2 = −1. (To avoid confusion, we
ban the index i from this chapter.) Under relatively mild assumptions, such
as piecewise smoothness, it can be proven that the Fourier series converges
for every t. (A function is called smooth if it is continuous and its derivative
is also continuous. A property is said to hold piecewise if each finite interval
of its domain can be cut up into a finite number of pieces where the property
holds; it need not hold in the end points of the pieces.) A piecewise smooth
function satisfies f˜(t) = f (t) in points of continuity; in the other points, f˜ is
the average of the left and right limit of f . (For more details, see [33].) If f
is real-valued, we can use Euler’s formula eiθ = cos θ + i sin θ, and eqns (3.1)
and (3.2) to obtain a real-valued Fourier series expressed in sine and cosine
functions.
On digital computers, signal or image functions are represented by their
values at a finite number of sample points. A compact disc contains 44 100
sample points for each second of recorded music. A high-resolution digital
image may contain 1024 by 1024 picture elements (pixels). On an unhappy
day in the future, you might find your chest being cut by a CT scanner into
40 slices, each containing 512 by 512 pixels. In all these cases, we obtain a
discrete approximation to the continuous world.
Suppose we are interested in computing the Fourier coefficients of a
T -periodic function f which is sampled at n points tj = jT /n, with j =
0, 1, . . . , n − 1. Using the trapezoidal rule for numerical integration on the
interval [0, T ] and using f (0) = f (T ), we obtain an approximation
T
1
ck = f (t)e−2πikt/T dt
T 0
n−1
1 T f (0) f (T )
≈ · + f (tj )e−2πiktj /T +
T n 2 j=1
2
n−1
1
= f (tj )e−2πijk/n . (3.3)
n j=0
n−1
1
xj = yk e2πijk/n , for 0 ≤ j < n. (3.5)
n
k=0
where
ωn = e−2πi/n . (3.7)
Figure 3.1 illustrates the powers of ωn occurring in the Fourier matrix; these
are sometimes called the roots of unity.
ω6 = i
ω5 ω7
ω4 ω0 = 1
ω3 ω1
ω2
Fig. 3.1. Roots of unity ω k , with ω = ω8 = e−2πi/8 , shown in the complex plane.
SEQUENTIAL RECURSIVE FAST FOURIER TRANSFORM 103
In the first sum, we recognize a Fourier transform of length n/2 of the even
components of x. To cast the sum exactly into this form, we must restrict the
output indices to the range 0 ≤ k < n/2. In the second sum, we recognize
a transform of the odd components. This leads to a method for computing
the set of coefficients yk , 0 ≤ k < n/2, which uses two Fourier transforms
of length n/2. To obtain a method for computing the remaining coefficients
yk , n/2 ≤ k < n, we have to rewrite eqn (3.10). Let k ′ = k − n/2, so that
0 ≤ k ′ < n/2. Substituting k = k ′ + n/2 in eqn (3.10) gives
n/2−1 n/2−1
j(k′ +n/2) j(k′ +n/2)
′
yk′ +n/2 = x2j ωn/2 + ωnk +n/2 x2j+1 ωn/2 ,
j=0 j=0
n/2 n/2
By using the equalities ωn/2 = 1 and ωn = −1, and by dropping the primes
we obtain
n/2−1 n/2−1
jk jk
yk+n/2 = x2j ωn/2 − ωnk x2j+1 ωn/2 , for 0 ≤ k < n/2. (3.12)
j=0 j=0
Comparing eqns (3.10) and (3.12), we see that the sums appearing in the
right-hand sides are the same; if we add the sums we obtain yk and if we
subtract them we obtain yk+n/2 . Here, the savings become apparent: we need
to compute the sums only once.
Following the basic idea, we can compute a Fourier transform of length n
by first computing two Fourier transforms of length n/2 and then combining
the results. Combining requires n/2 complex multiplications, n/2 complex
additions, and n/2 complex subtractions, that is, a total of (6 + 2 + 2) · n/2 =
5n flops. If we use the DFT for the half-length Fourier transforms, the total
flop count is already reduced from 8n2 − 2n to 2 · [8(n/2)2 − 2(n/2)] + 5n =
4n2 + 3n, thereby saving almost a factor of two in computing time. Of course,
we can apply the idea recursively, computing the half-length transforms by
the same splitting method. The recursion ends when the input length becomes
odd; in that case, we switch to a straightforward DFT algorithm. If the ori-
ginal input length is a power of two, the recursion ends with a DFT of length
one, which is just a trivial copy operation y0 := x0 . Figure 3.2 shows how the
problem is split up recursively for n = 8. Algorithm 3.1 presents the recursive
FFT algorithm for an arbitrary input length.
For simplicity, we assume from now on that the original input length is a
power of two. The flop count of the recursive FFT algorithm can be computed
0 1 2 3 4 5 6 7
0 2 4 6 1 3 5 7
0 4 2 6 1 5 3 7
0 4 2 6 1 5 3 7
Fig. 3.2. Recursive computation of the DFT for n = 8. The numbers shown are
the indices in the original vector, that is, the number j denotes the index of the
vector component xj (and not the numerical value). The arrows represent the
splitting operation. The combining operation is executed in the reverse direction
of the arrows.
SEQUENTIAL NONRECURSIVE ALGORITHM 105
if n mod 2 = 0 then
xe := x(0 : 2 : n − 1);
xo := x(1 : 2 : n − 1);
ye := FFT(xe , n/2);
yo := FFT(xo , n/2);
for k := 0 to n/2 − 1 do
τ := ωnk yko ;
yk := yke + τ ;
yk+n/2 := yke − τ ;
else y := DFT(x, n);
because an FFT of length n requires two FFTs of length n/2 and the com-
bination of the results requires 5n flops. Since the half-length FFTs are split
again, we substitute eqn (3.13) in itself, but with n replaced by n/2. This
gives
n n
n
T (n) = 2 2T +5 + 5n = 4T + 2 · 5n. (3.14)
4 2 4
Repeating this process until the input length becomes one, and using T (1) = 0,
we obtain
T (n) = nT (1) + (log2 n) · 5n = 5n log2 n. (3.15)
The gain of the FFT compared with the straightforward DFT is huge: only
5n log2 n flops are needed instead of 8n2 − 2n. For example, you may be able
to process a sound track of n = 32 768 samples (about 0.74 s on a compact
disc) on your personal computer in real time by using FFTs, but you would
have to wait 43 min if you decided to use DFTs instead.
that do not invoke the recursive function themselves. Figure 3.2 shows the tree
for an FFT of length eight; the tree is binary, since each node has at most two
children. The tree-like nature of recursive computations may lead you into
thinking that such algorithms are straightforward to parallelize. Indeed, it is
clear that the computation can be split up easily. A difficulty arises, however,
because a recursive algorithm traverses its computation tree sequentially, vis-
iting different subtrees one after the other. For a parallel algorithm, we ideally
would like to access many subtrees simultaneously. A first step towards paral-
lelization of a recursive algorithm is therefore to reformulate it in nonrecursive
form. The next step is then to split and perhaps reorganize the computation.
In this section, we derive a nonrecursive FFT algorithm, which is known as
the Cooley–Tukey algorithm [45].
Van Loan [187] presents a unifying framework in which the Fourier mat-
rix Fn is factorized as the product of permutation matrices and structured
sparse matrices. This helps in concisely formulating FFT algorithms, classify-
ing the huge amount of existing FFT variants, and identifying the fundamental
variants. We adopt this framework in deriving our parallel algorithm.
The computation of Fn x by the recursive algorithm can be expressed in
matrix language as
In/2 Ωn/2 Fn/2 0 x(0 : 2 : n − 1)
Fn x = . (3.16)
In/2 −Ωn/2 0 Fn/2 x(1 : 2 : n − 1)
Here, Ωn denotes the n × n diagonal matrix with the first n powers of ω2n on
the diagonal,
2 n−1
Ωn = diag(1, ω2n , ω2n , . . . , ω2n ). (3.17)
even rows are rows 0, 2, 4, . . . , n − 2.) Using this notation, we can write
x(0 : 2 : n − 1)
Sn x = . (3.19)
x(1 : 2 : n − 1)
(A ⊗ B) ⊗ C = A ⊗ (B ⊗ C).
Im ⊗ In = Imn .
Proof Boring.
Lemma 3.3 saves some ink because we can drop brackets and write A⊗B ⊗
C instead of having to give an explicit evaluation order such as (A ⊗ B) ⊗ C.
108 THE FAST FOURIER TRANSFORM
xj xj+n/2
x⬘j x⬘j+n/2
Fig. 3.3. Butterfly operation transforming an input pair (xj , xj+n/2 ) into an output
pair (x′j , x′j+n/2 ). Right butterfly:
c 2002 Sarai Bisseling, reproduced with sweet
permission.
Using the Kronecker product notation, we can write the middle part of the
right-hand side of eqn (3.16) as
Fn/2 0
I2 ⊗ Fn/2 = . (3.21)
0 Fn/2
The leftmost part of the right-hand side of eqn (3.16) is the n×n butterfly
matrix
In/2 Ωn/2
Bn = . (3.22)
In/2 −Ωn/2
The butterfly matrix obtains its name from the butterfly-like pattern in which
it transforms input pairs (xj , xj+n/2 ), 0 ≤ j < n/2, into output pairs, see
Fig. 3.3. The butterfly matrix is sparse because only 2n of its n2 elements are
nonzero. It is also structured, because its nonzeros form three diagonals.
Example 3.6
1 0 1 0
0 1 0 −i
B4 =
1 0 −1
.
0
0 1 0 i
Using our new notations, we can rewrite eqn (3.16) as
Since this holds for all vectors x, we obtain the matrix factorization
factorizing the middle factor of the right-hand side. For a factor of the form
Ik ⊗ Fn/k , this is done by applying Lemmas 3.4 (twice), 3.3, and 3.5, giving
Ik ⊗ Fn/k = [Ik Ik Ik ] ⊗ Bn/k (I2 ⊗ Fn/(2k) )Sn/k
= Ik ⊗ Bn/k )([Ik Ik ] ⊗ (I2 ⊗ Fn/(2k) )Sn/k
= (Ik ⊗ Bn/k )(Ik ⊗ I2 ⊗ Fn/(2k) )(Ik ⊗ Sn/k )
= (Ik ⊗ Bn/k )(I2k ⊗ Fn/(2k) )(Ik ⊗ Sn/k ). (3.25)
After repeatedly eating away at the middle factor, from both sides, we finally
reach In ⊗Fn/n = In ⊗I1 = In . Collecting the factors produced in this process,
we obtain the following theorem, which is the so-called decimation in time
(DIT) variant of the Cooley–Tukey factorization. (The name ‘DIT’ comes
from splitting—decimating—the samples taken over time, cf. eqn (3.9).)
Theorem 3.7 (Cooley and Tukey [45]—DIT) Let n be a power of two with
n ≥ 2. Then
where
where bk ∈ {0, 1} is the kth bit and n = 2m . We call b0 the least significant
bit and bm−1 the most significant bit. We express the binary expansion by
the notation
m−1
(bm−1 · · · b1 b0 )2 = bk 2k . (3.27)
k=0
Example 3.9
(10100101)2 = 27 + 25 + 22 + 20 = 165.
Multiplying a vector by Rn starts by splitting the vector into a subvector
of components x(bm−1 ···b0 )2 with b0 = 0 and a subvector of components with
b0 = 1. This means that the most significant bit of the new position of a
component becomes b0 . Each subvector is then split according to bit b1 , and
so on. Thus, the final position of the component with index (bm−1 · · · b0 )2
becomes (b0 · · · bm−1 )2 , that is, the bit reverse of the original position; hence
the name. The splittings of the bit reversal are exactly the same as those of
the recursive procedure, but now they are lumped together. For this reason,
Fig. 3.2 can also be viewed as an illustration of the bit-reversal permutation,
where the bottom row gives the bit reverses of the vector components shown
at the top.
The following theorem states formally that Rn corresponds to a bit-reversal
permutation ρn , where the correspondence between a permutation σ and a
permutation matrix Pσ is given by eqn (2.10). The permutation for n = 8 is
displayed in Table 3.1.
Theorem 3.10 Let n = 2m , with m ≥ 1. Let ρn : {0, . . . , n − 1} →
{0, . . . , n − 1} be the bit-reversal permutation defined by
Then
Rn = Pρn .
Proof First, we note that
0 000 000 0
1 001 100 4
2 010 010 2
3 011 110 6
4 100 001 1
5 101 101 5
6 110 011 3
7 111 111 7
((In/2t ⊗ S2t )x)(bm−1 ···b0 )2 = x(bm−1 ···bt bt−2 ···b0 bt−1 )2 , (3.29)
for all binary numbers (bm−1 · · · b0 )2 of m bits. This is because we can apply
eqn (3.28) with t bits instead of m to the subvector of length 2t of x starting at
index j = (bm−1 · · · bt )2 · 2t . Here, only the t least significant bits participate
in the circular shift. Third, using the definition Rn = (In/2 ⊗ S2 ) · · · (I1 ⊗ Sn )
and applying eqn (3.29) for t = 1, 2, 3, . . . , m we obtain
Therefore (Rn x)j = xρn (j) , for all j. Using ρn = ρ−1n and applying Lemma 2.5
we arrive at (Rn x)j = xρ−1n (j)
= (P ρn x)j . Since this holds for all j, we have
Rn x = Pρn x. Since this in turn holds for all x, it follows that Rn = Pρn .
Algorithm 3.2 is an FFT algorithm based on the Cooley–Tukey theorem.
The algorithm overwrites the input vector x with the output vector Fn x. For
112 THE FAST FOURIER TRANSFORM
later use, we make parts of the algorithm separately callable: the function
bitrev(x, n) performs a bit reversal of length n and the function UFFT(x, n)
performs an unordered FFT of length n, that is, an FFT without bit
reversal. Note that multiplication by the butterfly matrix Bk combines com-
ponents at distance k/2, where k is a power of two. In the inner loop,
the subtraction xrk+j+k/2 := xrk+j − τ is performed before the addition
xrk+j := xrk+j + τ , because the old value of xrk+j must be used in the
computation of xrk+j+k/2 . (Performing these statements in the reverse order
would require the use of an extra temporary variable.) A simple count of
the floating-point operations shows that the cost of the nonrecursive FFT
algorithm is the same as that of the recursive algorithm.
PARALLEL ALGORITHM 113
Proof
T
Fn = FnT = (I1 ⊗ Bn )(I2 ⊗ Bn/2 )(I4 ⊗ Bn/4 ) · · · (In/2 ⊗ B2 )Rn
The butterfly blocks of x that are multiplied by blocks of I4 ⊗ B2 are x(0 : 1),
x(2 : 3), x(4 : 5), and x(6 : 7). The first two blocks are contained in processor
block x(0 : 3), which belongs to P (0). The last two blocks are contained in
processor block x(4: 7), which belongs to P (1).
In contrast to the block distribution, the cyclic distribution makes butter-
flies with k ≥ 2p local, because k/2 ≥ p implies that k/2 is a multiple of p, so
that the vector components xj and xj ′ with j ′ = j + k/2 reside on the same
processor.
Example 3.13 Let n = 8 and p = 2 and assume that x is cyclically dis-
tributed. Stage k = 8 of the FFT algorithm is a multiplication of x = x(0 : 7)
with
1 · · · 1 · · ·
· 1 · · · ω · ·
2
· · 1 · · · ω ·
3
· · · 1 · · · ω
B8 =
,
1 · · · −1 · · ·
· 1 · · · −ω · ·
· · 1 · · · −ω 2 ·
· · · 1 · · · −ω 3
√
where ω = ω8 = e−πi/4 = (1 − i)/ 2. The component pairs (x0 , x4 ) and
(x2 , x6 ) are combined on P (0), whereas the pairs (x1 , x5 ) and (x3 , x7 ) are
combined on P (1).
Now, a parallelization strategy emerges: start with the block
√ distribution
and finish with the cyclic distribution. If p ≤ n/p (i.e. p ≤ n), then these
two distributions suffice for the butterflies: we need to redistribute only once
and we can do this at any desired time after stage p but before stage 2n/p.
If p > n/p, however, we are lucky to have so many processors to solve such a
small problem, but we are unlucky in that we have to use more distributions.
For the butterflies of size n/p < k ≤ p, we need one or more intermediates
between the block and cyclic distribution.
PARALLEL ALGORITHM 115
(a)
c=1 0 0 1 1 2 2 3 3
(block) 0 1 2 3 4 5 6 7
(b)
c=2 0 1 0 1 2 3 2 3
0 1 2 3 4 5 6 7
(c)
c=4 0 1 2 3 0 1 2 3
(cyclic) 0 1 2 3 4 5 6 7
Fig. 3.4. Group-cyclic distribution with cycle c of a vector of size eight over four
processors. Each cell represents a vector component; the number in the cell
and the greyshade denote the processor that owns the cell. The processors are
numbered 0, 1, 2, 3. (a) c = 1; (b) c = 2; and (c) c = 4.
116 THE FAST FOURIER TRANSFORM
where 0 ≤ j0 < c and 0 ≤ j1 < n/p. The processor that owns the component
xj in the group-cyclic distribution with cycle c is P (j2 c + j0 ); the processor
allocation is not influenced by j1 . As always, the components are stored locally
in order of increasing global index. Thus xj ends up in local position j = j1 =
(j mod cn/p) div c on processor P (j2 c + j0 ). (The relation between global
and local indices is explained in Fig. 1.9.)
To make a butterfly of size k local in the group-cyclic distribution with
cycle c, two constraints must be satisfied. First, the butterfly block x(rk : rk+
k − 1) should fit completely into one block of size cn/p, which is assigned to a
group of c processors. This is guaranteed if k ≤ cn/p. Second, k/2 must be a
multiple of the cycle c, so that the components xj and xj ′ with j ′ = j + k/2
reside on the same processor from the group. This is guaranteed if k/2 ≥ c.
As a result, we find that a butterfly of size k is local in the group-cyclic
distribution with cycle c if
n
2c ≤ k ≤ c. (3.35)
p
This result includes as a special case our earlier results for the block and cyclic
distributions. In Fig. 3.4(a), it can be seen that for c = 1 butterflies of size
k = 2 are local, since these combine pairs (xj , xj+1 ). In Fig. 3.4(b) this can
be seen for c = 2 and pairs (xj , xj+2 ), and in Fig. 3.4(c) for c = 4 and pairs
(xj , xj+4 ). In this particular example, the range of (3.35) consists of only one
value of k, namely k = 2c.
A straightforward strategy for the parallel FFT is to start the butter-
flies with the group-cyclic distribution with cycle c = 1, and continue as
long as possible with this distribution, that is, in stages k = 2, 4, . . . , n/p.
At the end of stage n/p, the vector x is redistributed into the group-cyclic
distribution with cycle c = n/p, and then stages k = 2n/p, 4n/p, . . . , n2 /p2
are performed. Then c is again multiplied by n/p, x is redistributed, stages
k = 2n2 /p2 , 4n2 /p2 , . . . , n3 /p3 are performed, and so on. Since n/p ≥ 2 the
value of c increases monotonically. When multiplying c by n/p would lead to
a value c = (n/p)t ≥ p, the value of c is set to c = p instead and the remaining
stages 2(n/p)t , . . . , n are performed in the cyclic distribution.
Until now, we have ignored the bit reversal preceding the butterflies. The
bit reversal is a permutation, which in general requires communication. We
have been liberal in allowing different distributions in different parts of our
algorithm, so why not use different distributions before and after the bit
reversal? This way, we might be able to avoid communication.
Let us try our luck and assume that we have the cyclic distribution before
the bit reversal. This is the preferred starting distribution of the overall com-
putation, because it is the distribution in which the FFT computation ends.
It is advantageous to start and finish with the same distribution, because
then it is easy to apply the FFT repeatedly. This would make it possible, for
instance, to execute a parallel inverse FFT by using the parallel forward FFT
with conjugated weights; this approach is based on the property Fn−1 = Fn /n.
PARALLEL ALGORITHM 117
k := 2;
c := 1;
rev := true;
while k ≤ n do
(0) j0 := s mod c;
j2 := s div c;
while k ≤ np c do
nc
nblocks := kp ;
for r := j2 · nblocks to (j2 + 1) · nblocks − 1 do
{ Compute local part of x(rk : (r + 1)k − 1) }
for j := j0 to k2 − 1 step c do
τ := ωkj xrk+j+k/2 ;
xrk+j+k/2 := xrk+j − τ ;
xrk+j := xrk+j + τ ;
k := 2k;
if c < p then
c0 := c;
c := min( np c, p);
(1) redistr(x, n, p, c0 , c, rev );
rev := false;
{ distr(x) = group-cyclic with cycle c }
index, (j mod cn/p) div c, as we saw above, and this will be used later in
the implementation.) We want the redistribution to work also in the trivial
case p = 1, and therefore we need to define ρ1 ; the definition in Theorem 3.10
omitted this case. By convention, we define ρ1 to be the identity permutation
of length one, which is the permutation that reverses zero bits.
addition, and subtraction, that is, a total of 10 real flops. We do not count
indexing arithmetic, nor the computation of the weights. As a consequence,
the total number of flops per processor in stage k equals (nc/(kp)) · (k/(2c)) ·
10 = 5n/p. Since there are m stages, the computation cost is
Tcomp = 5mn/p. (3.38)
The total BSP cost of the algorithm as a function of n and p is obtained by
summing the three costs and substituting m = log2 n and q = log2 p, giving
5n log2 n log2 p n log2 p
TFFT = +2· · g+ 2 + 1 l. (3.39)
p log2 (n/p) p log2 (n/p)
As you may know, budgets for the acquisition of parallel computers are
often tight, but you, the user of a parallel computer, may be insatiable in your
computing demands. In that case, p remains small, n √ becomes large, and you
may find yourself performing FFTs with 1 < p ≤ n. The good news is
that then you only need one communication superstep and two computation
supersteps. The BSP cost of the FFT reduces to
√
5n log2 n n
TFFT, 1<p≤ n = + 2 g + 3l. (3.40)
p p
√
This happens because p ≤ n implies p ≤ n/p and hence log2 p ≤ log2 (n/p),
so that the ceiling expression in (3.39) becomes one.
implies that only the weights ωnj with 0 ≤ j ≤ n/4 have to be computed. The
remaining weights can then be obtained by negation and complex conjugation,
which are cheap operations. Symmetry can be exploited further by using the
property
ωnn/4−j = −i(ωnj ), (3.43)
which is also cheap to compute. The set of weights can thus be computed by
eqn (3.41) with 0 ≤ j ≤ n/8, eqn (3.43) with 0 ≤ j < n/8, and eqn (3.42)
with 0 < j < n/4. This way, the initialization of the n/2 weights in double
precision costs about 2 · 10 · n/8 = 2.5n flops.
An alternative method for precomputation of the weights is to compute
the powers of ωn by successive multiplication, computing ωn2 = ωn · ωn , ωn3 =
ωn · ωn2 , and so on. Unfortunately, this propagates roundoff errors and hence
produces less accurate weights and a less accurate FFT. This method is not
recommended [187].
In the parallel case, the situation is more complicated than in the sequen-
tial case. For example, in the first iteration of the main loop of Algorithm 3.3,
c = 1 and hence j0 = 0 and j2 = s, so that all processors perform the same
set of butterfly computations, but on different data. Each processor performs
an unordered sequential FFT of length n/p on its local part of x. This implies
that the processors need the same weights, so that the weight table for these
butterflies must be replicated, instead of being distributed. The local table
j
should at least contain the weights ωn/p = ωnjp , 0 ≤ j < n/(2p), so that the
total memory used by all processors for this iteration alone is already n/2
complex numbers. Clearly, in the parallel case care must be taken to avoid
excessive memory use and initialization time.
A brute-force approach would be to store on every processor the complete
table of all n/2 weights that could possibly be used during the computation.
This has the disadvantage that every processor has to store almost the same
amount of data as needed for the whole sequential problem. Therefore, this
approach is not scalable in terms of memory usage. Besides, it is also unneces-
sary to store all weights on every processor, since not all of them are needed.
Another disadvantage is that the 2.5n flops of the weight initializations can
easily dominate the (5n log2 n)/p flops of the FFT itself.
At the other extreme is the simple approach of recomputing the weights
whenever they are needed, thus discarding the table. This attaches a weight
computation of about 20 flops to the 10 flops of each pairwise butterfly opera-
tion, thereby approximately tripling the total computing time. This approach
wastes a constant factor in computing time, but it is scalable in terms of
memory usage.
Our main aim in this section is to find a scalable approach in terms of
memory usage that adds few flops to the overall count. To achieve this, we try
to find structure in the local computations of a processor and to express them
by using sequential FFTs. This has the additional benefit that we can make
122 THE FAST FOURIER TRANSFORM
and then adds it to the subvector x(rk+j0 : c : rk+k/2−1), and also subtracts
j0 /c
it. This means that a generalized butterfly Bk/c is performed on the local
subvector x(rk + j0 : c : (r + 1)k − 1). The r-loop takes care that the same
generalized butterfly is performed for all nc/(kp) local subvectors. Thus, stage
k in the group-cyclic distribution with cycle c computes
j0 /c nc nc
(Inc/(kp) ⊗ Bk/c ) · x j2 + j0 : c : (j2 + 1) − 1 .
p p
A complete sequence of butterfly stages is a sequence of maximal length,
k = 2c, 4c, . . . , (n/p)c. Such a sequence multiplies the local vector by
j /c
0 0 j /c j /c j /c
(I1 ⊗ Bn/p )(I2 ⊗ Bn/(2p) ) · · · (In/(2p) ⊗ B20 ) = Fn/p
0
Rn/p , (3.52)
where the equality follows from Theorem 3.14. This implies that superstep
(0) is equivalent to an unordered GFFT applied to the local vector, with shift
parameter α = j0 /c = (s mod c)/c.
One problem that remains is the computation superstep of the last itera-
tion. This superstep may not perform a complete sequence of butterfly stages,
in which case we cannot find a simple expression for the superstep. If, how-
ever, we would start with an incomplete sequence such that all the following
computation supersteps perform complete sequences, we would have an easier
task, because at the start c = 1 so that α = 0 and we perform standard
butterflies. We can then express a sequence of stages k = 2, 4, . . . , k1 by the
matrix product
t := ⌈ loglog(n/p)
2p
⌉;
2
n
k1 := (n/p)t ;
rev := true;
for r := s · kn1 p to (s + 1) · kn1 p − 1 do
UFFT(x(rk1 : (r + 1)k1 − 1), k1 );
c0 := 1;
c := k1 ;
while c ≤ p do
(1) redistr(x, n, p, c0 , c, rev );
{ distr(x) = group-cyclic with cycle c }
(2) rev := false;
j0 := s mod c;
j2 := s div c;
UGFFT(x(j2 nc nc
p + j0 : c : (j2 + 1) p − 1), n/p, j0 /c);
c0 := c;
c := np c;
Iteration Processor
0 1 0 0 0 0 0 0 0 0
1 2 0 1/2 0 1/2 0 1/2 0 1/2
2 8 0 1/8 1/4 3/8 1/2 5/8 3/4 7/8
sequential implementations for these modules. For the FFT many implement-
ations are available, see Section 3.8, but for the GFFT this is not the case.
We are willing to accept a small increase in flop count if this enables us to
use the FFT as the computational workhorse instead of the GFFT. We can
achieve this by writing the GDFT as
n−1
yk = (xj ωnjα )ωnjk , for 0 ≤ k < n, (3.54)
j=0
where Mseq (n) is the memory space required by the sequential algorithm for
an input size n, and p is the number of processors. This definition allows for
O(p) overhead, reflecting the philosophy that BSP algorithms are based on all-
to-all communication supersteps, where each processor deals with p−1 others,
and also reflecting the practice of current BSP implementations where each
processor stores several arrays of length p. (For example, each registration
of a variable by the BSPlib primitive bsp push reg gives rise to an array of
length p on every processor that contains the p addresses of the variable on
all processors. Another example is the common implementation of a commun-
ication superstep where the number of data to be sent is announced to each
destination processor before the data themselves are sent. This information
needs to be stored in an array of length p on the destination processor.)
For p ≤ n/p, only one twiddle array has to be stored, so that the total
memory requirement is M (n, p) = 5n/p, which is scalable by the definition
above. For p > n/p, we need t − 1 additional iterations each requiring a
twiddle array. Fortunately, we can find a simple upper bound on the additional
memory requirement, namely
t−1
2(t − 1)n n n n n n n n 2p
=2 + + ··· + ≤ 2 · ··· = 2 = ≤ p.
p p p p p p p p k1
(3.61)
Thus, the total memory use in this case is M (n, p) ≤ 5n/p + p, which is also
scalable. We have achieved our initial aim.
Note that some processors may be able to use the same twiddle array in
several subsequent supersteps, thus saving memory. An extreme case is P (0),
which always has α = 0 and in fact would need no twiddle memory. Processor
EXAMPLE FUNCTION bspfft 127
can be computed when needed, saving the memory of rho, but this can be
costly in computer time. In the local bit reversal, for instance, the reverse of
n/p local indices is computed. Each reversal of an index costs of the order
log2 (n/p) integer operations. The total number of such operations is there-
fore of the order (n/p) log2 (n/p), which for small p is of the same order as
the (5n/p) log2 n floating-point operations of the butterfly computations. The
fraction of the total time spent in the bit reversal could easily reach 20%.
This justifies using a table so that the bit reversal needs to be computed only
once and its cost can be amortized over several FFTs. To reduce the cost
also in case the FFT is called only once, we optimize the inner loop of the
function by using bit operations on unsigned integers (instead of integer oper-
ations as used everywhere else in the program). To obtain the last bit bi of
the remainder (bm−1 · · · bi )2 , an ‘and’-operation is carried out with 1, which
avoids the expensive modulo operation that would occur in the alternative
formulation, lastbit= rem%2. After that, the remainder is shifted one posi-
tion to the right, which is equivalent to rem /=2. It depends on the compiler
and the chosen optimization level whether the use of explicit bit operations
gives higher speed. (A good compiler will make such optimizations super-
fluous!) In my not so humble opinion, bit operations should only be used
sparingly in scientific computation, but here is an instance where there is a
justification.
The function k1 init computes k1 from n, p by finding the first c =
(n/p)t ≥ p. Note that the body of the c-loop consists of an empty state-
ment, since we are only interested in the final value of the counter c. The
counter c takes on t + 1 values, which is the number of iterations of the main
loop of the FFT. As a consequence, k1 (n/p)t = n so that k1 = n/c.
The function bspredistr redistributes the vector x from group-cyclic dis-
tribution with cycle c0 to the group-cyclic distribution with cycle c1 , for a
ratio c1 /c0 ≥ 1, as illustrated in Fig. 3.5. (We can derive a similar redis-
tribution function for c1 /c0 < 1, but we do not need it.) The function is
an implementation of Algorithm 3.4, but with one important optimization
(I could not resist the temptation!): vector components to be redistributed
are sent in blocks, rather than individually. This results in blocks of nc0 /(pc1 )
complex numbers. If nc0 < pc1 , components are sent individually. The aim
is, of course, to reach a communication rate that corresponds to optimistic
values of g, see Section 2.6.
The parallel FFT, like the parallel LU decomposition, is a regular paral-
lel algorithm, for which the communication pattern can be predicted exactly,
and each processor can determine exactly where every communicated data ele-
ment goes. In such a case, it is always possible for the user to combine data
for the same destination in a block, or packet, and communicate them using
one put operation. In general, this requires packing at the source processor
and unpacking at the destination processor. No identifying information needs
to be sent together with the data since the receiver knows their meaning. (In
EXAMPLE FUNCTION bspfft 129
(a)
c=2 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(b)
c=4 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
(cyclic) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
To perform the unpacking, we have to move data from the location they
were put into, to their final location on the same processor. Let xj and xj ′
be two adjacent components in a packet, with j′ = j + ratio, and hence
j ′ = j + (c1 /c0 )c0 = j + c1 . Since these components are in the same new
block, and their global indices differ by c1 , their new local indices differ by
one, j′ = j + 1. We are lucky: if we put xj into its final location, and the
next component of the packet into the next location, and so on, then all
components of the packet immediately reach their final destination. In fact,
this means that we do not have to unpack!
The function bspfft performs the FFT computation itself. It follows
Algorithm 3.5 and contains no surprises. The function bspfft init initial-
izes all tables used. It assumes that a suitable amount of storage has been
allocated for the tables in the calling program. For the twiddle weights, this
amount is 2n/p + p reals, cf. eqn (3.61).
The program text is:
#include "bspedupack.h"
j1= j0+1;
j2= j0+k;
j3= j2+1;
taur= wr*x[j2] - wi*x[j3];
taui= wi*x[j2] + wr*x[j3];
x[j2]= x[j0]-taur;
x[j3]= x[j1]-taui;
x[j0] += taur;
x[j1] += taui;
}
}
}
} /* end ufft */
if (n==1)
return;
theta= -2.0 * M_PI / (double)n;
w[0]= 1.0;
w[1]= 0.0;
if (n==4){
w[2]= 0.0;
w[3]= -1.0;
} else if (n>=8) {
/* weights 1 .. n/8 */
for(j=1; j<=n/8; j++){
w[2*j]= cos(j*theta);
w[2*j+1]= sin(j*theta);
}
/* weights n/8+1 .. n/4 */
for(j=0; j<n/8; j++){
n4j= n/4-j;
w[2*n4j]= -w[2*j+1];
w[2*n4j+1]= -w[2*j];
}
/* weights n/4+1 .. n/2-1 */
for(j=1; j<n/4; j++){
n2j= n/2-j;
w[2*n2j]= -w[2*j];
w[2*n2j+1]= w[2*j+1];
}
132 THE FAST FOURIER TRANSFORM
}
} /* end ufft_init */
int j, j1;
double wr, wi, xr, xi;
} /* end twiddle */
int j;
double theta;
} /* end twiddle_init */
EXAMPLE FUNCTION bspfft 133
} /* end permute */
int j;
unsigned int n1, rem, val, k, lastbit, one=1;
if (n==1){
rho[0]= 0;
return;
}
n1= n;
for(j=0; j<n; j++){
rem= j; /* j= (b(m-1), ... ,b1,b0) in binary */
val= 0;
for (k=1; k<n1; k <<= 1){
lastbit= rem & one; /* lastbit = b(i) with i= log2(k) */
rem >>= 1; /* rem = (b(m-1), ... , b(i+1)) */
134 THE FAST FOURIER TRANSFORM
val <<= 1;
val |= lastbit; /* val = (b0, ... , b(i)) */
}
rho[j]= (int)val;
}
} /* end bitrev_init */
np= n/p;
for(c=1; c<p; c *=np)
;
k1= n/c;
return k1;
} /* end k1_init */
void bspredistr(double *x, int n, int p, int s, int c0, int c1,
char rev, int *rho_p){
double *tmp;
int np, j0, j2, j, jglob, ratio, size, npackets, destproc, destindex, r;
np= n/p;
ratio= c1/c0;
size= MAX(np/ratio,1);
npackets= np/size;
tmp= vecallocd(2*size);
if (rev) {
j0= rho_p[s]%c0;
j2= rho_p[s]/c0;
EXAMPLE FUNCTION bspfft 135
} else {
j0= s%c0;
j2= s/c0;
}
for(j=0; j<npackets; j++){
jglob= j2*c0*np + j*c0 + j0;
destproc= (jglob/(c1*np))*c1 + jglob%c1;
destindex= (jglob%(c1*np))/c1;
for(r=0; r<size; r++){
tmp[2*r]= x[2*(j+r*ratio)];
tmp[2*r+1]= x[2*(j+r*ratio)+1];
}
bsp_put(destproc,tmp,x,destindex*2*SZDBL,size*2*SZDBL);
}
bsp_sync();
vecfreed(tmp);
} /* end bspredistr */
void bspfft(double *x, int n, int p, int s, int sign, double *w0, double *w,
double *tw, int *rho_np, int *rho_p){
char rev;
int np, k1, r, c0, c, ntw, j;
double ninv;
np= n/p;
k1= k1_init(n,p);
permute(x,np,rho_np);
rev= TRUE;
136 THE FAST FOURIER TRANSFORM
c0= 1;
ntw= 0;
for (c=k1; c<=p; c *=np){
bspredistr(x,n,p,s,c0,c,rev,rho_p);
rev= FALSE;
twiddle(x,np,sign,&tw[2*ntw*np]);
ufft(x,np,sign,w);
c0= c;
ntw++;
}
if (sign==-1){
ninv= 1 / (double)n;
for(j=0; j<2*np; j++)
x[j] *= ninv;
}
} /* end bspfft */
void bspfft_init(int n, int p, int s, double *w0, double *w, double *tw,
int *rho_np, int *rho_p){
/* This parallel function initializes all the tables used in the FFT. */
np= n/p;
bitrev_init(np,rho_np);
bitrev_init(p,rho_p);
k1= k1_init(n,p);
ufft_init(k1,w0);
ufft_init(np,w);
ntw= 0;
for (c=k1; c<=p; c *=np){
alpha= (s%c) / (double)(c);
twiddle_init(np,alpha,rho_np,&tw[2*ntw*np]);
ntw++;
}
} /* end bspfft_init */
p g l Tcomm (0)
1 99 55 378
2 75 5118 1414
4 99 12 743 2098
8 126 32 742 4947
16 122 93 488 15 766
200
150
Time (in ms)
100
50
0
1 2 4 6 8 10 12 14 16
p
p Length n
the Origin 3800, which is partly due to the fact that the predecessor machine
was used in dedicated mode, but it must also be due to artefacts from aggress-
ive optimization. This should serve as a renewed warning about the difficulties
of benchmarking.
Figure 3.6 shows a set of timing results Tp (n) for an FFT of length n =
262 144. What is your first impression? Does the performance scale well? It
seems that for larger numbers of processors the improvement levels off. Well,
let me reveal that this figure represents the time of a theoretical, perfectly
parallelized FFT, based on a time of 155.2 ms for p = 1. Thus, we conclude
that presenting results in this way may deceive the human eye.
Table 3.4 presents the raw data of our time measurements for the program
bspfft of Section 3.6. These data have to be taken with a grain of salt, since
timings may suffer from interference by programs from other users (caused for
instance by sharing of communication links). To get meaningful results, we
ran each experiment three times, and took the best result, assuming that the
corresponding run would suffer less from interference. Often we found that
the best two timings were within 5% of each other, and that the third result
was worse.
Figure 3.7 compares the actual measured execution time for n = 262 144
on an Origin 3800 with the ideal time. Note that for this large n, the measured
time is reasonably close to the ideal time, except perhaps for p = 16.
The speedup Sp (n) of a parallel program is defined as the increase in
speed of the program running on p processors compared with the speed of a
sequential program (with the same level of optimization),
Tseq (n)
Sp (n) = . (3.62)
Tp (n)
Note that we do not take the time of the parallel program with p = 1 as
reference time, since this may be too flattering; obtaining a good speedup
with such a reference for comparison may be reason for great pride, but it is
140 THE FAST FOURIER TRANSFORM
200
Measured
Ideal
150
Time (in ms)
100
50
0
1 2 4 6 8 10 12 14 16
p
16
Measured n = 262144
14 Measured n = 65536
Ideal
12
10
Speedup 8
0
0 2 4 6 8 10 12 14 16
p
part of the computation can be carried out using data that are already in
cache, thus yielding fewer cache misses and a higher computing rate. Occur-
rence of superlinear speedup in a set of experimental results should be a
warning to be cautious when interpreting results, even for results that are not
superlinear themselves. In our case, it is also likely that we benefit from cache
effects. Still, one may argue that the ability to use many caches simultaneously
is a true benefit of parallel computing.
The efficiency Ep (n) gives the fraction of the total computing power that
is usefully employed. It is defined by
In general, 0 ≤ Ep (n) ≤ 1, with the same caveats as before. Figure 3.9 gives
the efficiency for n = 65 536, 262 144.
Another measure is the normalized cost Cp (n), which is just the time of
the parallel program divided by the time that would be taken by a perfectly
parallelized version of the sequential program. This cost is defined by
Tp (n)
Cp (n) = . (3.64)
Tseq (n)/p
Note that Cp (n) = 1/Ep (n), which explains why this cost is sometimes called
the inefficiency. Figure 3.10 gives the cost of the FFT program for n =
65 536, 262 144. The difference between the normalized cost and the ideal
value of 1 is the parallel overhead, which usually consists of load imbalance,
communication time, and synchronization time. A breakdown of the overhead
into its main parts can be obtained by performing additional measurements,
or theoretically by predictions based on the BSP model.
142 THE FAST FOURIER TRANSFORM
1
n = 262 144
n = 65 536
0.8
0.6
Efficiency
0.4
0.2
0
1 2 4 6 8 10 12 14 16
p
Fig. 3.9. Measured efficiency Ep (n) of parallel FFT. The ideal value is 1.
6
Measured n = 65 536
Measured n = 262 144
5 Ideal
4
Normalized cost
0
1 2 4 6 8 10 12 14 16
p
p Length n
even using a rate that depends on the local vector length. The communication
prediction can be improved by measuring optimistic g-values.
Table 3.6 shows the computing rate Rp (n) of all processors together for
this application, defined by
5n log2 n
Rp (n) = , (3.65)
Tp (n)
where we take the standard flop count 5n log2 n as basis (as is customary for all
FFT counts, even for highly optimized FFTs that perform fewer flops). The
flop rate is useful in comparing results for different problem sizes and also
for different applications. Furthermore, it tells us how far we are from the
advertised peak performance. It is a sobering thought that we need at least
four processors to exceed the top computing rate of 285 Mflop/s measured
for an in-cache DAXPY operation on a single processor. Thus, instead of
parallelizing, it may be preferable to make our sequential program cache-
friendly. If we still need more speed, we turn to parallelism and make our
parallel program cache-friendly. (Exercise 5 tells you how to do this.) Making a
parallel program cache-friendly will decrease running time and hence increase
the computing rate, but paradoxically it will also decrease the speedup and
the efficiency, because the communication part remains the same while the
computing part is made faster in both the parallel program and the sequential
reference program. Use of a parallel computer for the one-dimensional FFT
can therefore only be justified for very large problems. But were not parallel
computers made exactly for that purpose?
BIBLIOGRAPHIC NOTES 145
for 0 ≤ j < m and 0 ≤ k < N/m. Furthermore, Sm,N is the Mod-m sort
matrix, the N × N permutation matrix defined by
x(0 : m : N − 1)
x(1 : m : N − 1)
Sm,N x = .. , (3.67)
.
x(m − 1 : m : N − 1)
−1
which has as inverse Sm,N = SN/m,N . Note that T2,N = diag(IN/2 , ΩN/2 ) and
S2,N = SN . The SPL compiler translates each factorization formula into a
Fortran program. In the same spirit as FFTW, an extensive search is car-
ried out over the space of factorization formulae and compiler techniques. A
Kronecker-product property used by SPL which is important for a parallel
context is: for every m × m matrix A and n × n matrix B,
were mostly designed targeting the hypercube architecture, where each pro-
cessor s = (bq−1 · · · b0 )2 is connected to the q = log2 p processors that differ
exactly one bit with s in their processor number.
Examples of algorithms from this category are discussed by Van
Loan [187, Algorithm 3.5.3], by Dubey, Zubair, and Grosch [62], who present
a variant that can use every input and output distribution from the block-
cyclic family, and by Gupta and Kumar [87] (see also [82]), who describe
the so-called binary exchange algorithm. Gupta and Kumar analyse the
scalability of this algorithm by using the isoefficiency function fE (p), which
expresses how fast the amount of work of a problem must grow with p to
maintain a constant efficiency E.
Another example is an algorithm for the hypercube architecture given
by Swarztrauber [172] which is based on index-digit permutations [72]: each
permutation τ on the set of m bits {0, . . . , m − 1} induces an index-
digit permutation which moves the index j = (bm−1 · · · b1 b0 )2 into j ′ =
(bτ (m−1) · · · bτ (1) bτ (0) )2 . Using the block distribution for n = 2m and p = 2q ,
an index is split into the processor number s = (bm−1 · · · bm−q )2 and the local
index j = (bm−q−1 · · · b1 b0 )2 . An i-cycle is an index-digit permutation where
τ is a swap of the pivot bit m − q − 1 with another bit r. Swarztrauber notes
that no communication is needed if r ≤ m − q − 1; otherwise, the i-cycle
permutation requires communication, but it still has the advantage of moving
data in large chunks, namely blocks of size n/(2p). Every index-digit permuta-
tion can be carried out as a sequence of i-cycles. Every butterfly operation of
an FFT combines pairs (j, j ′ ) that differ in exactly one bit. An i-cycle can be
used to make this bit local.
Gupta et al. [88] use data redistributions to implement parallel FFTs. For
the unordered Cooley–Tukey algorithm, they start with the block distribu-
tion, finish with the cyclic distribution, and use block-cyclic distributions
in between. If a bit reversal must be performed and the output must be
distributed in the same√way as the input, this requires three communica-
tion supersteps for p ≤ n. The authors also modify the ordered Stockham
algorithm so they can start and finish with the cyclic
√ distribution, and per-
form only one communication superstep for p ≤ n. Thus they achieve the
same minimal communication cost as Algorithms 3.3 and 3.5. Experimental
results on an Intel iPSC/860 hypercube show that the modified Stockham
algorithm outperforms all other implemented algorithms.
McColl [135] presents a detailed BSP algorithm for an ordered FFT, which
uses the block distribution on input and output. The algorithm starts with
an explicit bit-reversal permutation and it finishes with a redistribution from
cyclic to block distribution. Thus, for p > 1 the algorithm needs at least
three communication supersteps. Except for the extra communication at the
start and finish, the algorithm of McColl is quite similar to our algorithm. His
algorithm stores and communicates the original index of each vector compon-
ent together with its numerical value. This facilitates the description of the
algorithm, but the resulting communication should be removed in an imple-
mentation because in principle the original indices can be computed by every
processor. Furthermore, the exposition is simplified by √ the assumption that
m − q is a divisor of m. This implies that p = 1 or p ≥ n; it is easy to gener-
√
alize the algorithm so that it can handle the most common case 1 < p < n
as well.
The algorithm presented in this chapter is largely based on work by Inda
and Bisseling [111]. This work introduces the group-cyclic distribution and
formulates redistributions as permutations of the data vector. For example,
changing the distribution from block to cyclic has the same effect as keeping
the distribution constant but performing an explicit permutation Sp,n . (To
be more precise, the processor and local
√ index of the original data element xj
are the same in both cases.) For p ≤ n, the factorization is written as
−1
Fn = Sp,n ASp,n (Rp ⊗ In/p )(Ip ⊗ Fn/p )Sp,n , (3.70)
This factorization needs three permutations if the input and output are dis-
tributed by blocks, but it needs only one permutation if these distributions
are cyclic. Thus the cyclic I/O distribution is best.
Another related algorithm is the transpose algorithm, which calculates a
one-dimensional FFT of size mn by storing the vector x as a two-dimensional
matrix of size m × n. Component xj is stored as matrix element X(j0 , j1 ),
where j = j0 n + j1 with 0 ≤ j0 < m and 0 ≤ j1 < n. This algorithm is
150 THE FAST FOURIER TRANSFORM
based on the observation that in the first part of an unordered FFT com-
ponents within the same matrix row are combined, whereas in the second
part components within the same column are combined. The matrix can be
transposed between the two parts of the algorithm, so that all butterflies can
be done within rows. In the parallel case, each processor can then handle one
or more rows. The only communication needed is in the matrix transposi-
tion and the bit reversal. For a description and experimental results, see, for
example, Gupta and Kumar [87]. This approach works for p ≤ min(m, n).
Otherwise, the transpose algorithm must be generalized to a higher dimen-
sion. A detailed description of such a generalization can be found in the book
by Grama and coworkers [82]. Note that this algorithm is similar to the BSP
algorithm presented here, except that its bit-reversal permutation requires
communication.
The two-dimensional view of a one-dimensional FFT can be carried one
step further by formulating the algorithm such that it uses explicit shorter-
length FFTs on the rows or columns of the matrix storing the data vector.
Van Loan [187, Section 3.3.1] calls the corresponding approach the four-
step framework and the six-step framework. This approach is based
on a mixed-radix method due to Agarwal and Cooley [2] (who developed it
for the purpose of vectorization). The six-step framework is equivalent to a
factorization into six factors,
This factorization follows immediately from eqns (3.66) and (3.69). In a paral-
lel algorithm based on eqn (3.72), with the block distribution used throughout,
the only communication occurs in the permutations Sm,mn and Sn,mn . The
four-step framework is equivalent to the factorization
which also follows from eqns (3.66) and (3.69). Note that now the shorter-
length Fourier transforms need strided access to the data vector, with stride
n for Fm ⊗ In . (In a parallel implementation, such access is local if a cyclic
distribution is used, provided p ≤ min(m, n).)
One advantage of the four-step and six-step frameworks is that the shorter-
length FFTs may fit into the cache of computers that cannot accommodate an
FFT of full length. This may result in much higher speeds on cache-sensitive
computers. The use of genuine FFTs in this approach makes it possible to call
fast system-specific FFTs in an implementation. A disadvantage is that the
multiplication by the twiddle factors takes an additional 6n flops. In a parallel
implementation, the communication is nicely isolated in the permutations.
Hegland [91] applies the four-step framework twice to generate a factor-
ization of Fmnm with m maximal, treating FFTs of length m recursively by the
BIBLIOGRAPHIC NOTES 151
same method. He presents efficient algorithms for multiple FFTs and multidi-
mensional FFTs on vector and parallel computers. In his implementation, the
key to efficiency is a large vector length of the inner loops of the computation.
Edelman, McCorquodale, and Toledo [68] present an approximate par-
allel FFT algorithm aimed at reducing communication from three data
permutations (as in the six-step framework) to one permutation, at the
expense of an increase in computation. As computation rates are expected
to grow faster than communication rates, future FFTs are likely to be
communication-bound, making such an approach worthwhile. The authors
argue that speedups in this scenario will be modest, but that using parallel
FFTs will be justified on other grounds, for instance because the data do not
fit in the memory of a single processor, or because the FFT is part of a larger
application with good overall speedup.
3.8.4 Applications
Applications of the FFT are ubiquitous. Here, we mention only a few with
emphasis on parallel applications. Barros and Kauranne [13] and Foster and
Worley [70] parallelize the spectral transform method for solving partial differ-
ential equations on a sphere, aimed at global weather and climate modelling.
The spectral transform for a two-dimensional latitude/longitude grid consists
of a DFT along each latitude (i.e. in the east–west direction) and a discrete
Legendre transform (DLT) along each longitude (i.e. in the north–south dir-
ection). Barros and Kauranne [13] redistribute the data between the DFT and
the DLT, so that these computations can be done sequentially. For instance,
each processor performs several sequential FFTs. The main advantages of
this approach are simplicity, isolation of the communication parts (thus cre-
ating bulk), and reproducibility: the order of the computations is exactly the
same as in a sequential computation and it does not depend on the number
of processors used. (Weather forecasts being that fragile, they must at least
be reproducible in different runs of the same program. No numerical but-
terfly effects please!) Foster and Worley [70] also investigate the alternative
approach of using parallel one-dimensional FFTs and Legendre transforms as
a basis for the spectral transform.
Nagy and O’Leary [143] present a method for restoring blurred images
taken by the Hubble Space Telescope. The method uses a preconditioned
conjugate gradient solver with fast matrix-vector multiplications carried out
by using FFTs. This is possible because the matrices involved are Toeplitz
matrices, that is, they have constant diagonals: aij = αi−j , for all i, j.
The FFT is the computational workhorse in grid methods for quantum
molecular dynamics, where the time-dependent Schrödinger equation is solved
numerically on a multidimensional grid, see a review by Kosloff [123] and a
comparison of different time propagation schemes by Leforestier et al. [127].
In each time step, a potential energy operator is multiplied by a wavefunction,
which is a local (i.e. point-wise) operation in the spatial domain. This means
152 THE FAST FOURIER TRANSFORM
that every possible distribution including the cyclic one can be used in a par-
allel implementation. The kinetic energy operator is local in the transformed
domain, that is, in momentum space. The transformation between the two
domains is done efficiently using the FFT. Here, the cyclic distribution as
used in our parallel FFT is applicable in both domains. In one dimension, the
FFT thus has only one communication superstep.
Haynes and Côté [90] parallelize a three-dimensional FFT for use in
an electronic structure calculation based on solving the time-independent
Schrödinger equation. The grids involved are relatively small compared with
those used in other FFT applications, with a typical size of only 128×128×128
grid points. As a consequence, reducing the communication is of prime import-
ance. In momentum space, the grid is cyclically distributed in each dimension
to be split. In the spatial domain, the distribution is by blocks. The number
of communication supersteps is log2 p. An advantage of the cyclic distribu-
tion in momentum space is a better load balance: momentum is limited by
an energy cut-off, which means that all array components outside a sphere in
momentum space are zero; in the cyclic distribution, all processors have an
approximately equal part of this sphere.
Zoldi et al. [195] solve the nonlinear Schrödinger equation in a study on
increasing the transmission capacity of optical fibre lines. They apply a par-
allel one-dimensional FFT based on the four-step framework. They exploit
the fact that the Fourier transform is carried out in both directions, avoiding
unnecessary permutations and combining operations from the forward FFT
and the inverse FFT on the same block of data. This results in better cache
use and even causes a speedup of the parallel program with p = 1 compared
with the original sequential program.
3.9 Exercises
1. The recursive FFT, Algorithm 3.1, splits vectors repeatedly into two vec-
tors of half the original length. For this reason, we call it a radix-2 FFT.
We can generalize the splitting method by allowing to split the vectors into r
parts of equal length. This leads to a radix-r algorithm.
(a) How many flops are actually needed for the computation of F4 x, where
x is a vector of length four? Where does the gain in this specific case
come from compared to the 5n log2 n flops of an FFT of arbitrary
length n?
(b) Let n be a power of four. Derive a sequential recursive radix-4 FFT
algorithm. Analyse the computation time and compare the number of
flops with that of a radix-2 algorithm. Is the new algorithm faster?
(c) Let n be a power of four. Formulate a sequential nonrecursive radix-4
algorithm. Invent an appropriate name for the new starting permuta-
tion. Can you modify the algorithm to handle all powers of two, for
example by ending with a radix-2 stage if needed?
EXERCISES 153
n
0 −1 n
1 −1
n
0 −1 n
1 −1 n
2 −1
for 0 ≤ kd < nd , d = 0, 1, 2.
(a) Write a function bspfft3d, similar to bspfft2d from Exercise 3, that
performs a 3D FFT, assuming that X is distributed by the M0 × M1 ×
M2 cyclic distribution, where Md is a power of two with 1 ≤ Md < nd ,
for d = 0, 1, 2. The result Y must be in the same distribution as X.
(b) Explain why each communication superstep of the parallel 3D FFT
algorithm has the same cost.
√ the case n0 = n1 = n2 = n, how do you choose the Md ? Hint: for p ≤
(c) In
n you need only one communication superstep; for p ≤ n only two.
EXERCISES 155
(a) Show that yn−k = yk , for k = 1, . . . , n−1. This implies that the output
of the FFT is completely determined by y0 , . . . , yn/2 . The remaining
n/2 − 1 components of y can be obtained cheaply by complex conjug-
ation, so that they need not be stored. Also show that y0 and yn/2
156 THE FAST FOURIER TRANSFORM
are real. It is customary to pack these two reals on output into one
complex number, stored at position 0 in the y-array.
(b) We can preprocess the input data by packing the real vector x of length
n as a complex vector x′ of length n/2 defined by x′j = x2j + ix2j+1 ,
for 0 ≤ j < n/2. The packing operation is for free in our FFT data
structure, which stores a complex number as two adjacent reals. It
turns out that if we perform a complex FFT of length n/2 on the
conjugate of x′ , yielding y′ = Fn x′ , we can retrieve the desired vector
y by
(a) Cyclic 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(b) Zig-zag 0 1 2 3 0 3 2 1 0 1 2 3 0 3 2 1
cyclic 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Fig. 3.11. Distribution of a vector of size 16 over four processors. Each cell repres-
ents a vector component; the number in the cell and the greyshade denote the
processor that owns the cell. The processors are numbered 0, 1, 2, 3. (a) Cyclic
distribution; (b) zig-zag cyclic distribution.
EXERCISES 157
(a) In our pursuit of a fast cosine transform (FCT), we try to pack the
vector x in a suitable form into another vector x′ ; then compute the
Fourier transform y′ = Fn x′ ; and hope to be able to massage y′ into
y. By way of miracle, we can indeed succeed if we define
x′j = x2j ,
x′n−1−j = x2j+1 , for 0 ≤ j < n/2. (3.80)
We can retrieve y by
k ′
yk = Re(ω4n yk ),
k ′
yn−k = −Im(ω4n yk ), for 0 ≤ k ≤ n/2. (3.81)
(c) Describe how you would perform a parallel FCT. Which distributions
do you use for input and output? Here, they need not be the same.
Motivate your choices. Give your distributions a meaningful name.
Hint: try to avoid redistribution as much as possible. You may assume
that p is so small that only one communication superstep is needed in
the complex FFT used by the real FFT.
(d) Analyse the cost of your parallel algorithm, implement it, and test the
resulting program. How does the computing time scale with n and p?
(e) Refresh your trigonometry and formula manipulation skills by proving
that the inverse DCT is given by
n−1
y0 2 πk(l + 1/2)
xl = + yk cos , for 0 ≤ l < n. (3.82)
n n n
k=1
can be obtained by: first multiplying x with Wn and then moving the even
components of the result to the front, that is, multiplying the result with Sn ;
repeating this procedure on the first half of the current vector, using Wn/2
and Sn/2 ; and so on. The algorithm terminates after multiplication by W4
and S4 . We denote the result, obtained after m − 1 stages, by W x.
(a) Factorize W in terms of matrices Wk , Sk , and Ik of suitable size,
similar to the sequential factorization of the Fourier matrix Fn . You can
use the notation diag(A0 , . . . , Ar ), which stands for a block-diagonal
matrix with blocks A0 , . . . , Ar on the diagonal.
(b) How many flops are needed to compute W x? How does this scale
compared with the FFT? What does this mean for communication in
a parallel algorithm?
(c) Formulate a sequential algorithm that computes W x in place, without
performing permutations. The output data will become available in
scrambled order. To unscramble the data in one final permutation,
where would we have to move the output value at location 28 =
(0011100)2 for length n = 128? And at j = (bm−1 · · · b0 )2 for arbitrary
length n? Usually, there is no need to unscramble.
(d) Choose a data distribution that enables the development of an efficient
parallel in-place DWT algorithm. This distribution must be used on
input and as long as possible during the algorithm.
(e) Formulate a parallel DWT algorithm. Hint: avoid communicating data
at every stage of your algorithm. Instead, be greedy and compute what
you can from several stages without communicating in between. Then
communicate and finish the stages you started. Furthermore, find a
sensible way of finishing the whole algorithm.
(f) Analyse the BSP cost of the parallel DWT algorithm.
(g) Compare the characteristics of your DWT algorithm to those of the
parallel FFT, Algorithm 3.5. What are the essential similarities and
differences?
(h) Implement and test your algorithm.
(i) Prove that the matrix Wn is orthogonal (i.e. WnT Wn = In ) by showing
that the rows are mutually orthogonal and have unit norm. Thus W is
orthogonal and W −1 = W T . Extend your program so that it can also
compute the inverse DWT.
(j) Take a picture of a beloved person or animal, translate it into a matrix
A, where each element represents a pixel, and perform a 2D DWT by
carrying out 1D DWTs over the rows, followed by 1D DWTs over the
columns. Choose a threshold value τ > 0 and set all matrix elements
ajk with |ajk | ≤ τ to zero. What would the compression factor be if we
would store A as a sparse matrix by keeping only the nonzero values
ajk together with their index pairs (j, k)? What does your beloved one
look like after an inverse 2D DWT? You may vary τ .
160 THE FAST FOURIER TRANSFORM
102639592829741105772054196573991675900
716567808038066803341933521790711307779
*
106603488380168454820927220360012878679
207958575989291522270608237193062808643
=
109417386415705274218097073220403576120
037329454492059909138421314763499842889
347847179972578912673324976257528997818
33797076537244027146743531593354333897,
which has 155 decimal digits and is called RSA-155. The reverse com-
putation, finding the two (prime) factors of RSA-155 was posed as a
cryptanalytic challenge by RSA Security; its solution took a total of 35
CPU years using 300 workstations and PCs in 1999 by Cavallar et al.
[39]. (The next challenge, factoring a 174 decimal-digit number, carries
an award of $10 000.)
(f) Develop other parallel functions for fast operations on large integers,
such as addition and subtraction, using the block distribution.
(g) The Newton–Raphson method for finding a zero of a function f , that
is, an x with f (x) = 0, computes successively better approximations
f (x(k) )
x(k+1) = x(k) − . (3.86)
f ′ (x(k) )
Apply this method with f (x) = 1/x−a and f (x) = 1/x2 −a to compute
√
1/a and 1/ a, respectively, with high precision for a given fixed real a
using your parallel functions. Choose a suitable representation of a as
a finite sequence of bytes. Pay special attention to termination of the
Newton–Raphson iterations.
(h) At the time of writing, the world record in π computation is held by
Kanada, Ushiro, Kuroda, Kudoh, and nine co-workers who obtained
about 1241 100 000 000 decimal digits of π in December 2002 run-
ning a 64-processor Hitachi SR8000 supercomputer with peak speed
of 2 Tflop/s for 600 h. They improved the previous record by Kanada
and Takahashi who got 206 158 430 000 digits right in 1999 using all
128 processors of a Hitachi SR8000 parallel computer with peak speed
of 1 Tflop/s. That run used the Gauss–Legendre method proposed by
Brent [32] and Salamin [161], which works as follows. Define sequences
162 THE FAST FOURIER TRANSFORM
√
a0 , a1 , a2 , . . . and b0 , b1 , b2 , . . . by a0 = 2, b0 = 1, ak+1 = (ak +√bk )/2
is the arithmetic mean of the pair (ak , bk ), and bk+1 = ak bk
is its geometric mean, for k ≥ 0. Let ck = 2k (a2k − b2k ). Define
the sequence d0 , d1 , d2 , . . . by d0 = 1 and dk+1 = dk − ck+1 . Then
limk→∞ 2a2k /dk = π, with fast convergence: the number of digits of π
produced doubles at every iteration. Use your own parallel functions to
compute as many decimal digits of π as you can. You need an efficient
conversion from binary to decimal digits to produce human-readable
output.
4
SPARSE MATRIX–VECTOR MULTIPLICATION
This chapter gently leads you into the world of irregular algorithms,
through the example of multiplying a sparse matrix with a vector. The
sparsity pattern of the matrix may be irregular, but fortunately it does
not change during the multiplication, and the multiplication may be
repeated many times with the same matrix. This justifies putting a lot
of effort in finding a good data distribution for a parallel multiplication.
We are able to analyse certain special types of matrices fully, such as
random sparse matrices and Laplacian matrices, and of course we can
also do this for dense matrices. For the first time, we encounter a useful
non-Cartesian matrix distribution, which we call the Mondriaan distri-
bution, and we study an algorithm for finding such a distribution for
a general sparse matrix. The program of this chapter demonstrates the
use of the bulk synchronous message passing primitives from BSPlib,
which were designed to facilitate irregular computations; the discus-
sion of these primitives completes the presentation of the whole BSPlib
standard. After having read this chapter, you are able to design and
implement parallel iterative solvers for linear systems and eigensystems,
and to build higher-level solvers on top of them, such as solvers for non-
linear systems, partial differential equations, and linear programming
problems.
u := Av. (4.4)
Of course, we can exploit the sparsity of A by summing only those terms for
which aij = 0.
Sparse matrix–vector multiplication is almost trivial as a sequential prob-
lem, but it is surprisingly rich as a parallel problem. Different sparsity patterns
of A lead to a wide variety of communication patterns during a parallel com-
putation. The main task then is to keep this communication within bounds.
Sparse matrix–vector multiplication is important in a range of computa-
tions, most notably in the iterative solution of linear systems and eigen-
systems. Iterative solution methods start with an initial guess x0 of the
solution and then successively improve it by finding better approximations xk ,
k = 1, 2, . . ., until convergence within a prescribed error tolerance. Examples
of such methods are the conjugate gradient method for solving symmetric
positive definite sparse linear systems Ax = b and the Lanczos method for
solving symmetric sparse eigensystems Ax = λx; for an introduction, see [79].
The attractive property of sparse matrix–vector multiplication as the core of
these solvers is that the matrix does not change, and in particular that it
remains sparse throughout the computation. This is in contrast to methods
such as sparse LU decomposition that create fill-in, that is, new nonzeros. In
THE PROBLEM 165
Fig. 4.1. Sparse matrix cage6 with n = 93, nz = 785, c = 8.4, and d = 9.1%,
generated in a DNA electrophoresis study by van Heukelum, Barkema, and
Bisseling [186]. Black squares denote nonzero elements; white squares denote
zeros. This transition matrix represents the movement of a DNA polymer in a
gel under the influence of an electric field. Matrix element aij represents the
probability that a polymer in state j moves to a state i. The matrix has the
name cage6 because the model used is the cage model and the polymer mod-
elled contains six monomers. The sparsity pattern of this matrix is symmetric
(i.e. aij = 0 if and only if aji = 0), but the matrix is unsymmetric (since in
general aij = aji ). In this application, the eigensystem Ax = x is solved by the
power method, which computes Ax, A2 x, A3 x, . . ., until convergence. Solution
component xi represents the frequency of state i in the steady-state situation.
166 SPARSE MATRIX–VECTOR MULTIPLICATION
Fig. 4.2. (a) Two-dimensional molecular dynamics domain of size 1.0 × 1.0 with ten
particles. Each circle denotes the interaction region of a particle, and is defined
by a cut-off radius rc = 0.1. (b) The corresponding 10 × 10 sparse matrix F . If
the circles of particles i and j overlap in (a), these particles interact and nonzeros
fij and fji appear in (b).
principle, iterative methods can solve larger systems, but convergence is not
guaranteed.
The study of sparse matrix–vector multiplication also yields more insight
into other areas of scientific computation. In a molecular dynamics simula-
tion, the interaction between particles i and j can be described by a force
fij . For short-range interactions, this force is zero if the particles are far
apart. This implies that the force matrix F is sparse. The computation of
the new positions of particles moving under two-particle forces is similar to the
multiplication of a vector by F . A two-dimensional particle domain and the
corresponding matrix are shown in Fig. 4.2.
Algorithm 4.1 is a sequential sparse matrix–vector multiplication
algorithm. The ‘for all’ statement of the algorithm must be interpreted such
that all index pairs involved are handled in some arbitrary sequential order.
Tests such as ‘aij = 0’ need never be performed in an actual implementation,
since only the nonzeros of A are stored in the data structure used. The formu-
lation ‘aij = 0’ is a simple notational device for expressing sparsity without
having to specify the details of a data structure. This allows us to formulate
sparse matrix algorithms that are data-structure independent. The algorithm
costs 2cn flops.
SPARSE MATRICES AND THEIR DATA STRUCTURES 167
for i := 0 to n − 1 do
ui := 0;
for all (i, j) : 0 ≤ i, j < n ∧ aij = 0 do
ui := ui + aij vj ;
Now, I suggest you pause for a moment to think about how you would add
two vectors x and y that are stored in the data structure described above.
When you are done contemplating this, you may realize that the main problem
is to find the matching pairs (i, xi ) and (i, yi ) without incurring excessive costs;
this precludes for instance sorting. The magic trick is to use an auxiliary array
of length n that has been initialized already. We can use this array for instance
to register the location of the components of the vector y in its sparse data
structure, so that for a given i we can directly find the location j = loc[i]
where yi is stored. A value loc[i] = −1 denotes that yi is not stored in the
data structure, implying that yi = 0. After the computation, loc must be left
behind in the same state as before the computation, that is, with every array
element set to −1. For each nonzero yi , the addition method modifies the
component xi if it is already nonzero and otherwise creates a new nonzero in
the data structure of x. Algorithm 4.2 gives the details of the method.
This sparse vector addition algorithm is more complicated than the
straightforward dense algorithm, but it has the advantage that the compu-
tation time is only proportional to the sum of the input lengths. The total
number of operations is O(cx + cy ), since there are cx + 2cy loop iterations,
each with a small constant number of operations. The number of flops equals
the number of nonzeros in the intersection of the sparsity patterns of x and y.
The initialization of array loc costs n operations, and this cost will dominate
that of the algorithm itself if only one vector addition has to be performed.
Fortunately, loc can be reused in subsequent vector additions, because each
modified array element is reset to −1. For example, if we add two n × n
matrices row by row, we can amortize the initialization cost over n vector
additions. The relative cost of initialization then becomes insignificant.
The addition algorithm does not check its output for accidental zeros,
that is, elements that are numerically zero but still present as a nonzero pair
(i, 0) in the data structure. Such accidental zeros are created for instance
when a nonzero yi = −xi is added to a nonzero xi and the resulting zero is
retained in the data structure. Furthermore, accidental zeros can propagate:
if yi is an accidental zero included in the data structure of y, and xi = 0 is
not in the data structure of x, then Algorithm 4.2 will insert the accidental
SPARSE MATRICES AND THEIR DATA STRUCTURES 169
zero into the data structure of x. Still, testing all operations in a sparse
matrix algorithm for zero results is more expensive than computing with a few
additional nonzeros, so accidental zeros are usually kept. Another reason for
keeping accidental zeros is that removing them would make the output data
structure dependent on the numerical values of the input and not on their
sparsity pattern alone. This may cause problems for certain computations,
for example, if the same program is executed repeatedly for a matrix with
different numerical values but the same sparsity pattern and if knowledge
obtained from the first program run is used to speed up subsequent runs.
(Often, the first run of a sparse matrix program uses a dynamic data structure
but subsequent runs use a simplified static data structure based on the sparsity
patterns encountered in the first run.) In our terminology, we ignore accidental
zeros and we just assume that they do not exist.
Sparse matrices can be stored using many different data structures; the
best choice depends on the particular computation at hand. Some of the most
170 SPARSE MATRIX–VECTOR MULTIPLICATION
Example 4.2
0 3 0 0 1
4 1 0 0 0
A=
0 5 9 2 0 ,
n = 5, nz(A) = 13.
6 0 0 5 3
0 0 5 8 9
a[k] = 3 1 4 1 5 9 2 6 5 3 5 8 9
j[k] = 1 4 0 1 1 2 3 0 3 4 2 3 4
k= 0 1 2 3 4 5 6 7 8 9 10 11 12
start[i] = 0 2 4 7 10 13
i= 0 1 2 3 4 5
The CRS data structure has the advantage that the elements of a row are
stored consecutively, so that row-wise operations are easy. If we compute
u := Av by components of u, then the nonzero elements aij needed to com-
pute ui are conveniently grouped together, so that the value of ui can be
kept in cache on a cache-based computer, thus speeding up the computation.
Algorithm 4.3 shows a sequential sparse matrix–vector multiplication that
uses CRS.
SPARSE MATRICES AND THEIR DATA STRUCTURES 171
for i := 0 to n − 1 do
u[i] := 0;
for k := start[i] to start[i + 1] − 1 do
u[i] := u[i] + a[k] · v[j[k]];
The ICRS data structure has been used in Parallel Templates [124], a par-
allel version of the Templates package for iterative solution of sparse linear
172 SPARSE MATRIX–VECTOR MULTIPLICATION
j := inc[0];
k := 0;
for i := 0 to n − 1 do
u[i] := 0;
while j < n do
u[i] := u[i] + a[k] · v[j];
k := k + 1;
j := j + inc[k];
j := j − n;
systems [11]. ICRS does not need the start array and its implementation in
C was found to be somewhat faster than that of CRS because the increments
translate well into the pointer arithmetic of the C language. Algorithm 4.4
shows a sequential sparse matrix–vector multiplication that uses ICRS. Note
that the algorithm avoids the indirect addressing of the vector v in the CRS
data structure, replacing access to v[j[k]] by access to v[j].
Jagged diagonal storage (JDS). The matrix A is permuted into a matrix
P A by ordering the rows by decreasing number of nonzeros. The first jagged
diagonal is formed by taking the first nonzero element of every row in P A. If
the matrix does not have empty rows, the length of the first jagged diagonal
is n. The second jagged diagonal is formed by taking the second nonzero of
every row. The length may now be less than n. This process is continued, until
all c0 jagged diagonals have been formed, where c0 is the number of nonzeros
of row 0 in P A. As in CRS, for each element the numerical value and column
index are stored. The main advantage of JDS is the large average length of the
jagged diagonals (of order n) that occurs if the number of nonzeros per row
does not vary too much. In that case, the sparse matrix–vector multiplication
can be done using efficient operations on long vectors.
Gustavson’s data structure [89]. This data structure combines CRS
and CCS, except that it stores the numerical values only for the rows. It
provides row-wise and column-wise access to the matrix, which is useful for
sparse LU decomposition.
The two-dimensional doubly linked list. Each nonzero is represen-
ted by a tuple, which includes i, j, aij , and links to a next and a previous
PARALLEL ALGORITHM 173
nonzero in the same row and column. The elements within a row or column
need not be ordered. This data structure gives maximum flexibility: row-wise
and column-wise access are easy and elements can be inserted and deleted in
O(1) operations. Therefore it is applicable in dynamic computations where the
matrix changes. The two-dimensional doubly linked list was proposed as the
best data structure for parallel sparse LU decomposition with pivoting [183],
where frequently rows or columns have to move from one set of processors to
another. (A two-dimensional singly linked list for sparse linear system solving
was already presented by Knuth in 1968 in the first edition of [122].) A dis-
advantage of linked list data structures is the amount of storage needed, for
instance seven memory cells per nonzero for the doubly linked case, which is
much more than the two cells per nonzero for CRS. A severe disadvantage is
that following the links causes arbitrary jumps in the computer memory, thus
often incurring cache misses.
Matrix-free storage. In certain applications it may be too costly or
unnecessary to store the matrix explicitly. Instead, each matrix element is
recomputed every time it is needed. In certain situations this may enable the
solution of huge problems that otherwise could not have been solved.
(a) (b)
2 1 1 4 3 v
0 1 2 2 3 4
6 3 1 0 3 0 1
9 4 1 1 4 1 2 2
22 5 9 2 2 5 9 3 5 3
41 6 5 3 3 6 4 5 8 9
64 5 8 9
u A P (0) P (1)
Fig. 4.3. (a) Distribution of a 5×5 sparse matrix A and vectors u and v of length five
over two processors. The matrix nonzeros and vector components of processor
P (0) are shown as grey cells; those of P (1) as black cells. The numbers in the
cells denote the numerical values aij . The matrix is the same as in Example 4.2.
Vector component ui is shown to the left of the matrix row that produces it;
vector component vj is shown above the matrix column that needs it. (b) The
local matrix part of the processors. Processor P (0) has six nonzeros; its row
index set is I0 = {0, 1, 2, 3} and its column index set J0 = {0, 1, 2}. Processor
P (1) has seven nonzeros; I1 = {0, 2, 3, 4} and J1 = {2, 3, 4}.
for all i with 0 ≤ i < n. As in the sequential case, only terms for which aij = 0
are summed. Furthermore, only the local partial sums uis for which the set
{j : 0 ≤ j < n ∧ φ(i, j) = s} is nonempty are computed. (The other partial
176 SPARSE MATRIX–VECTOR MULTIPLICATION
(0) { Fanout }
for all j ∈ Js do
get vj from P (φv (j));
(2) { Fanin }
for all i ∈ Is do
put uis in P (φu (i));
sums are zero.) If there are less than p nonzeros in row i, then certainly one
or more processors will have an empty row part. For c ≪ p, this will happen
in many rows. To exploit this, we introduce the index set Is of the rows that
are locally nonempty in processor P (s). We compute uis if and only if i ∈ Is .
An example of row index sets is depicted in Fig. 4.3(b). Superstep (1) of
Algorithm 4.5 is the resulting local matrix–vector multiplication.
A suitable sparse data structure must be chosen to implement super-
step (1). Since we formulate our algorithm by rows, row-based sparse
data structures such as CRS and ICRS are a good choice, see Section 4.2.
The data structure should, however, only include nonempty local rows, to
avoid unacceptable overhead for very sparse matrices. To achieve this, we can
number the nonempty local rows from 0 to |Is | − 1. The corresponding indices
PARALLEL ALGORITHM 177
2 1 1 4 3 v
6 3 1
9 4 1
22 5 9 2
41 6 5 3
64 5 8 9
u A
Fig. 4.4. Communication during sparse matrix–vector multiplication. The matrix
is the same as in Fig. 4.3. Vertical arrows denote communication of components
vj : v0 must be sent from its owner P (1) to P (0), which owns the nonzeros
a10 = 4 and a30 = 6; v2 must be sent from P (0) to P (1); v1 , v3 , v4 need not be
sent. Horizontal arrows denote communication of partial sums uis : P (1) sends
its contribution u01 = 3 to P (0); P (0) sends u20 = 14 to P (1); and P (1) sends
u31 = 29 to P (0); u1 and u4 are computed locally, without contribution from
the other processor. The total communication volume is V = 5 data words.
i are the local indices. The original global indices from the set Is are stored
in increasing order in an array rowindex of length |Is |. For 0 ≤ i < |Is |, the
global row index is i = rowindex [i]. If for instance CRS is used, the address
of the first local nonzero of row i is start[i] and the number of local nonzeros
of row i is start[i + 1] − start[i].
The vector component vj needed for the computation of aij vj must be
obtained before the start of superstep (1). This is done in communication
superstep (0), see the vertical arrows in Fig. 4.4. The processor that has to
receive vj knows from its local sparsity pattern that it needs this component.
On the other hand, the processor that has to send the value is not aware
of the needs of the receiver. This implies that the receiver should be the
initiator of the communication, that is, we should use a ‘get’ primitive. Here,
we encounter an important difference between dense and sparse algorithms.
In dense algorithms, the communication patterns are predictable and thus
known to every processor, so that we can formulate communication supersteps
exclusively in terms of ‘put’ primitives. In sparse algorithms, this is often not
the case, and we have to use ‘get’ primitives as well.
Component vj has to be obtained only once by every processor that needs
it, even if it is used repeatedly for different local nonzeros aij in the same
matrix column j. If column j contains at least one local nonzero, then vj
must be obtained; otherwise, vj is not needed. Therefore, it is convenient to
define the index set Js of the locally nonempty columns, similar to the row
index set Is . We get vj if and only if j ∈ Js . This gives superstep (0) of
178 SPARSE MATRIX–VECTOR MULTIPLICATION
Algorithm 4.5. We call superstep (0) the fanout, because vector components
fan out from their initial location. The set Js can be represented by an array
colindex of length |Js |, similar to rowindex. An example of column index sets
is also depicted in Fig. 4.3(b). We consider the arrays rowindex and colindex
to be part of the data structure for the local sparse matrix.
The partial sum uis must be contributed to ui if the local row i is
nonempty, that is, if i ∈ Is . In that case, we call uis nonzero, even if acci-
dental cancellation of terms aij vj has occurred. Each nonzero partial sum uis
should be sent to the processor that possesses ui . Note that in this case the
sender has the knowledge about the existence of a nonzero partial sum, so
that we have to use a ‘put’ primitive. The resulting communication superstep
is superstep (2), which we call the fanin, see the horizontal arrows in Fig. 4.4.
Finally, the processor responsible for ui computes its value by adding the
previously received nonzero contributions uit , 0 ≤ t < p with t = s, and the
local contribution uis . This is superstep (3).
‘Wat kost het? ’ is an often-used Dutch phrase meaning ‘How much does
it cost?’. Unfortunately, the answer here is, ‘It depends’, because the cost of
Algorithm 4.5 depends on the matrix A and the chosen distributions φ, φv , φu .
Assume that the matrix nonzeros are evenly spread over the processors, each
processor having cn/p nonzeros. Assume that the vector components are also
evenly spread over the processors, each processor having n/p components.
Under these two load balancing assumptions, we can obtain an upper bound
on the cost. This bound may be far too pessimistic, since distributions may
exist that reduce the communication cost by a large factor.
The cost of superstep (0) is as follows. In the worst case, P (s) must
receive all n components vj except for the n/p locally available components.
Therefore, in the worst case hr = n − n/p; also, hs = n − n/p, because the
n/p local vector components must be sent to the other p − 1 processors. The
cost is T(0) = (1 − 1/p)ng + l. The cost of superstep (1) is T(1) = 2cn/p + l,
since two flops are needed for each local nonzero. The cost of superstep (2) is
T(2) = (1 − 1/p)ng + l, similar to T(0) . The cost of superstep (3) is T(3) = n + l,
because each of the n/p local vector components is computed by adding at
most p partial sums. The total cost of the algorithm is thus bounded by
2cn 1
TMV ≤ +n+2 1− ng + 4l. (4.9)
p p
Examining the upper bound (4.9), we see that the computation cost dominates
if 2cn/p > 2ng, that is, if c > pg. In that (rare) case, a distribution is already
efficient if it only satisfies the two load balancing assumptions. Note that
it is the number of nonzeros per row c and not the density d that directly
determines the efficiency. Here, and in many other cases, we see that the
parameter c is the most useful one to characterize the sparsity of a matrix.
CARTESIAN DISTRIBUTION 179
because every processor that has a nonzero in a matrix row i must send a
value uis in superstep (2), except perhaps one processor (the owner of ui ),
and similarly every processor that has a nonzero in a matrix column j must
receive vj in superstep (0), except perhaps one processor (the owner of vj ).
An upper bound is Vφ + 2n, because in the worst case all n components ui
are owned by processors that do not have a nonzero in row i, and similar for
the components vj . Therefore,
Vφ ≤ V ≤ Vφ + 2n. (4.11)
and hence certainly V ≤ Vφ +n. In the example of Fig. 4.4, the communication
volume can be reduced from V = 5 to Vφ = 4 if we would assign v0 to P (0)
instead of P (1).
to a processor P (φ0 (i), φ1 (j)) with 0 ≤ φ0 (i) < M , 0 ≤ φ1 (j) < N , and
p = M N . We can fit this assignment in our general scheme by identifying
one-dimensional and two-dimensional processor numbers. An example is the
natural column-wise identification
t= 0 1
s=0
Fig. 4.5. Sparse matrix cage6 with n = 93 and nz = 785 distributed in a Cartesian
manner over four processors with M = N = 2; the matrix is the same as the
one shown in Fig. 4.1. Black squares denote nonzero elements; white squares
denote zeros. Lines denote processor boundaries. The processor row of a matrix
element aij is denoted by s = φ0 (i), and the processor column by t = φ1 (j).
The distribution has been determined visually, by trying to create blocks of rows
and columns that fit the sparsity pattern of the matrix. Note that the matrix
diagonal is assigned in blocks to the four processors, in the order P (0) ≡ P (0, 0),
P (1) ≡ P (1, 0), P (2) ≡ P (0, 1), and P (3) ≡ P (1, 1). The number of nonzeros
of the processors is 216, 236, 76, 257, respectively. The number of diagonal
elements is 32, 28, 12, 21, respectively; these are all nonzero. Assume that the
vectors u and v are distributed in the same way as the matrix diagonal. In that
case, 64 (out of 93) components of v must be communicated in superstep (0) of
the sparse matrix–vector multiplication, and 72 contributions to u in superstep
(2). For example, components v0 , . . . , v15 are only needed locally. The total
communication volume is V = 136 and the BSP cost is 24g + 2 · 257 + 28g + 28 ·
2 + 4l = 570 + 52g + 4l. Try to verify these numbers by clever counting!
182 SPARSE MATRIX–VECTOR MULTIPLICATION
ask whether this is also a good distribution for dense matrix–vector mul-
tiplication by Algorithm 4.5. Unfortunately, the answer is negative. The
reason is that element aii from the matrix diagonal is assigned to processor
√ √
P (i mod p, i mod p), so that the matrix diagonal is assigned to the
√
diagonal processors, that is, the processors P (s, s), 0 ≤ s < p. This
√
implies that only p out of p processors have part of the matrix diagonal and
hence of the vectors, so that the load balancing assumption for vector com-
√
ponents is not satisfied. Diagonal processors have to send out p − 1 copies
√ √
of n/ p vector components, so that hs = n − n/ p in superstep (0) and h is
√
p times larger than the h of a well-balanced distribution. The total cost for
a dense matrix with the square cyclic distribution becomes
2n2
1
TMV, dense, √p×√p cyclic = +n+2 1− √ ng + 4l. (4.20)
p p
√
The communication cost for this unbalanced distribution is a factor p higher
than the upper bound (4.19) with c = n for balanced distributions. The total
√
communication volume is Vφ = 2( p − 1)n. For dense matrices with the
square cyclic distribution, the communication imbalance can be reduced by
changing the algorithm and using two-phase broadcasting for the fanout and
a similar technique, two-phase combining, for the fanin. This, however, does
not solve the problem of vector imbalance in other parts of the application,
and of course it would be better to use a good distribution in the first place,
instead of redistributing the data during the fanout or fanin.
The communication balance can be improved by choosing a distribution
that spreads the vectors and hence the matrix diagonal evenly, for example
choosing the one-dimensional distribution φu (i) = φv (i) = i mod p and using
the 1D–2D identification (4.14). We still have the freedom to choose M and N ,
where M N = p. For the choice M = p and N = 1, this gives φ0 (i) = i mod p
and φ1 (j) = 0, which is the same as the cyclic row distribution. It is easy to
see that now the cost is
2n2
1
TMV, dense, p×1 cyclic = + 1− ng + 2l. (4.21)
p p
This distribution disposes of the fanin and the summation of partial sums,
since each matrix row is completely contained in one processor. Therefore,
the last two supersteps are empty and can be deleted. Still, this is a bad
distribution, since the gain by fewer synchronizations is lost by the much
more expensive fanout: each processor has to send n/p vector components to
all other processors. The communication volume is large: Vφ = (p − 1)n.
√ √
For the choice M = N = p, we obtain φ0 (i) = (i mod p) mod p =
√ √
i mod p and φ1 (j) = (j mod p) div p. The cost of the fanout and fanin
186 SPARSE MATRIX–VECTOR MULTIPLICATION
0 1 2 3 0 1 2 3 v
0 0 1 1 0 0 1 1
0 0 0
1 1 1
2 0 2
3 1 3
0 0 0
1 1 1
2 0 2
3 1 3
u A
Fig. 4.6. Dense 8×8 matrix distributed over four processors using a square Cartesian
distribution based on a cyclic distribution of the matrix diagonal. The vectors
are distributed in the same way as the matrix diagonal. The processors are shown
by greyshades; the one-dimensional processor numbering is shown as numbers in
the cells of the matrix diagonal and the vectors. The two-dimensional numbering
is represented by the numbering of the processor rows and columns shown along
the borders of the matrix. The 1D–2D correspondence is P (0) ≡ P (0, 0), P (1) ≡
P (1, 0), P (2) ≡ P (0, 1), and P (3) ≡ P (1, 1).
√
for this distribution are given by hs = hr = ( p − 1)n/p, so that
2n2
n 1 1
TMV, dense = + √ +2 √ − ng + 4l, (4.22)
p p p p
which equals the upper bound (4.19) for c = n. We see that this distribution is
much better than the square cyclic distribution and the cyclic row distribution.
√
The communication volume is Vφ = 2( p−1)n. Figure 4.6 illustrates this data
distribution.
We may conclude that the case of dense matrices is a good example where
Cartesian matrix partitioning is useful in deriving an optimal distribution,
a square Cartesian distribution based on a cyclic distribution of the matrix
diagonal. (Strictly speaking, we did not give an optimality proof; it seems,
however, that this distribution is hard to beat.)
Let p′i = max(pi − 1, 0) and qj′ = max(qj − 1, 0). We are done if we prove
for 0 ≤ i < n, and a similar equality for qj′ , because the result then follows
from summing over i = 0, . . . , n − 1 for p′i and j = 0, . . . , n − 1 for qj′ . We only
prove the equality for the p′i , and do this by distinguishing two cases. If row i
has a nonzero in Ak−1 or Ak , then p′i = pi − 1 in all three terms of eqn (4.24).
Thus,
The proof of Theorem 4.6 shows that the theorem also holds for the com-
munication volume of the fanout and fanin separately. The theorem implies
that we only have to look at the subset we want to split when trying to optim-
ize the split, and not at the effect such a split has on communication for other
subsets.
The theorem helps us to achieve our goal of minimizing the communication
volume. Of course, we must at the same time also consider the load balance of
the computation; otherwise, the problem is easily solved: assign all nonzeros
to the same processor et voilà, no communication whatsoever! We specify
the allowed load imbalance by a parameter ǫ > 0, requiring that the p-way
partitioning of the nonzeros satisfies the computational balance constraint
nz(A)
max nz(As ) ≤ (1 + ǫ) . (4.27)
0≤s<p p
The load imbalance achieved for the matrix prime60 shown on the cover
of this book is ǫ′ ≈ 2.2%: the matrix has 462 nonzeros, and the number of
nonzeros per processor is 115, 118, 115, 114, for P (0) (red), P (1) (black), P (2)
(yellow), and P (3) (blue), respectively. (We denote the achieved imbalance by
ǫ′ , to distinguish it from the allowed imbalance ǫ.) The partitioning has been
obtained by a vertical split into blocks of consecutive columns, followed by
two independent horizontal splits into blocks of consecutive rows. The splits
were optimized for load balance only. (Requiring splits to yield consecutive
blocks is an unnecessary and harmful restriction, but it leads to nice and easily
comprehensible pictures.) The communication volume for this partitioning is
120, which is bad because it is the highest value possible for the given split
directions.
The best choice of the imbalance parameter ǫ is machine-dependent and
can be found by using the BSP model. Suppose we have obtained a matrix
distribution with volume V that satisfies the constraint (4.27). Assuming that
the subsequent vector partitioning does a good job, balancing the communica-
tion well and thus achieving a communication cost of V g/p, we have a BSP
cost of 2(1 + ǫ′ )nz(A)/p + V g/p + 4l. To get a good trade-off between com-
putation imbalance and communication, the corresponding overhead terms
should be about equal, that is, ǫ′ ≈ V g/(2nz(A)). If this is not the case, we
can increase or decrease ǫ and obtain a lower BSP cost. We cannot determine
ǫ beforehand, because we cannot predict exactly how its choice affects V .
How to split a given subset? Without loss of generality, we may assume
that the subset is A itself. In principle, we can assign every individual nonzero
to one of the two available processors. The number of possible 2-way par-
titionings, however, is huge, namely 2nz(A)−1 . (We saved a factor of two
by using symmetry: we can swap the two processors without changing the
volume of the partitioning.) Trying all partitionings and choosing the best is
usually impossible, even for modest problem sizes. In the small example of
Fig. 4.3, we already have 212 = 4096 possibilities (one of which is shown).
Thus, our only hope is to develop a heuristic method, that is, a method
that gives an approximate solution, hopefully close to the optimum and com-
puted within reasonable time. A good start is to try to restrict the search
space, for example, by assigning complete columns to processors; the number
of possibilities then decreases to 2n−1 . In the example, we now have 24 = 16
possibilities. In general, the number of possibilities is still large, and heuristics
are still needed, but the problem is now more manageable. One reason is that
bookkeeping is simpler for n columns than for nz(A) nonzeros. Furthermore,
a major advantage of assigning complete columns is that the split does not
generate communication in the fanout. Thus we decide to perform each split
by complete columns, or, alternatively, by complete rows. We can express the
splitting by the assignment
where dir ∈ {row, col} is the splitting direction and ǫ the allowed load
imbalance. Since we do not know ahead, which of the two splitting directions
is best, we try both and choose the direction with the lowest communication
volume.
Example 4.7 Let
0 3 0 0 1
4 1 0 0 0
A=
0 5 9 2 0 .
6 0 0 5 3
0 0 5 8 9
For ǫ = 0.1, the maximum number of nonzeros per processor must be seven
and the minimum six. For a column split, this implies that one processor must
have two columns with three nonzeros and the other processor the remaining
columns. A solution that minimizes the communication volume V is to assign
columns 0, 1, 2 to P (0) and columns 3, 4 to P (1). This gives V = 4. For a row
split, assigning rows 0, 1, 3 to P (0) and rows 2, 4 to P (1) is optimal, giving
V = 3. Can you find a better solution if you are allowed to assign nonzeros
individually to processors?
The function split can be applied repeatedly, giving a method for parti-
tioning a matrix into several parts. The method can be formulated concisely
as a recursive computation. For convenience, we assume that p = 2q , but
the method can be adapted to handle other values of p as well. (This would
also require generalizing the splitting function, so that for instance in the first
split for p = 3, it can produce two subsets with a nonzero ratio of about 2 : 1.)
The recursive method should work for a rectangular input matrix, since the
submatrices involved may be rectangular (even though the initial matrix is
square). Because of the splitting into sets of complete columns or rows, we can
view the resulting p-way partitioning as a splitting into p mutually disjoint
submatrices (not necessarily with consecutive rows and columns): we start
with a complete matrix, split it into two submatrices, split each submatrix,
giving four submatrices, and so on. The number of times the original submat-
rix must be split to reach a given submatrix is called the recursion level
of the submatrix. The level of the original matrix is 0. The final result for
processor P (s) is a submatrix defined by an index set I¯s × J¯s . This index set
is different from the index set Is × Js of pairs (i, j) with i ∈ Is and j ∈ Js
defined in Algorithm 4.5, because the submatrices I¯s × J¯s may contain empty
rows and columns; removing these gives Is × Js . Thus we have
Furthermore, all the resulting submatrices are mutually disjoint, that is,
nz(Bs ) nz(A)
(1 + ǫs ) = (1 + ǫ) , (4.32)
p/2 p
gives the value ǫs to be used in the remainder. In this way, the allowed load
imbalance is dynamically adjusted during the partitioning. A matrix part that
has fewer nonzeros than the average will have a larger ǫ in the remainder,
giving more freedom to minimize communication for that part. The resulting
algorithm is given as Algorithm 4.6.
Figure 4.7 presents a global view of the sparse matrix prime60 and the
corresponding input and output vectors distributed over four processors by
the Mondriaan package [188], version 1.0. The matrix distribution program of
this package is an implementation of Algorithm 4.6. The allowed load imbal-
ance specified by the user is ǫ = 3%; the imbalance achieved by the program
is ǫ′ ≈ 1.3%, since the maximum number of nonzeros per processor is 117 and
the average is 462/4=115.5. The communication volume of the fanout is 51
and that of the fanin is 47, so that V = 98. Note that rows i = 11, 17, 19, 23,
25, 31, 37, 41, 43, 47, 53, 55, 59 (in the exceptional numbering starting from
one) are completely owned by one processor and hence do not cause commu-
nication in the fanin; vector component ui is owned by the same processor.
Not surprisingly, all except two of these row numbers are prime. The distri-
bution is better than the distribution on the book cover, both with respect
to computation and with respect to communication. The matrix prime60 is
symmetric, and although the Mondriaan package has an option to produce
symmetric partitionings, we did not use it for our example.
Figure 4.8 presents a local view, or processor view, of prime60 and
the corresponding input and output vectors. For processor P (s), s = 0, 1, 2, 3,
the local submatrix I¯s × J¯s is shown. The size of this submatrix is 29 × 26 for
192 SPARSE MATRIX–VECTOR MULTIPLICATION
if p > 1 then
maxnz := (1 + ǫ) nz p(A) ;
(B0row , B1row ) := split(A, row, qǫ );
(B0col , B1col ) := split(A, col, qǫ );
if V (B0row , B1row ) ≤ V (B0col , B1col ) then
(B0 , B1 ) := (B0row , B1row );
else (B0 , B1 ) := (B0col , B1col );
ǫ0 := nz (B0 ) · p2 − 1;
maxnz
maxnz p
ǫ1 := nz (B1 ) · 2 − 1;
(A0 , . . . , Ap/2−1 ) := MatrixPartition(B0 , p2 , ǫ0 );
(Ap/2 , . . . , Ap−1 ) := MatrixPartition(B1 , p2 , ǫ1 );
else A0 := A;
P (0) (red), 29 × 34 for P (1) (black), 31 × 31 for P (2) (yellow), and 31 × 29 for
P (3) (blue). Together, the submatrices fit in the space of the original matrix.
The global indices of a submatrix are not consecutive in the original matrix,
but scattered. For instance, I¯0 = {2, 3, 4, 5, 11, 12, 14, . . . , 52, 53, 55, 56, 57}, cf.
Fig. 4.7. Note that I¯0 × J¯0 = I0 ×J0 and I¯3 × J¯3 = I3 ×J3 , but that I¯1 × J¯1 has
six empty rows and nine empty columns, giving a size of 23×25 for I1 ×J1 , and
that I¯2 × J¯2 has seven empty rows, giving a size of 24 × 31 for I2 × J2 . Empty
rows and columns in a submatrix are the aim of a good partitioner, because
they do not incur communication. An empty row is created by a column split
in which all nonzeros of a row are assigned to the same processor, leaving the
other processor empty-handed. The partitioning directions chosen for prime60
were first splitting in the row direction, and then twice, independently, in the
column direction.
The high-level recursive matrix partitioning algorithm does not specify the
inner workings of the split function. To find a good split, we need a biparti-
tioning method based on the exact communication volume. It is convenient
to express our problem in terms of hypergraphs, as was first done for matrix
partitioning problems by Çatalyürek and Aykanat [36,37]. A hypergraph
H = (V, N ) consists of a set of vertices V and a set of hyperedges, or
MONDRIAAN DISTRIBUTION 193
Fig. 4.7. Matrix and vector distribution for the sparse matrix prime60. Global view
(see also Plate 2).
Fig. 4.8. Same matrix and distribution as in Fig. 4.7. Local view (see also
Plate 3).
194 SPARSE MATRIX–VECTOR MULTIPLICATION
0 5
1 6
2 7
3 8
Fig. 4.9. Hypergraph with nine vertices and six nets. Each circle represents a vertex.
Each oval curve enclosing a set of vertices represents a net. The vertex set
is V = {0, . . . , 8} and the nets are n0 = {0, 1}, n1 = {0, 5}, n2 = {0, 6},
n3 = {2, 3, 4}, n4 = {5, 6, 7}, and n5 = {7, 8}. The vertices have been coloured
to show a possible assignment to processors, where P (0) has the white vertices
and P (1) the black vertices.
A communication arises if the net is cut, that is, not all its vertices are
assigned to the same processor. The total communication volume incurred
by the split thus equals the number of cut nets of the hypergraph. In the
assignment of Fig. 4.9, two nets are cut: n1 and n2 .
Example 4.8 Let V = {0, 1, 2, 3, 4}. Let the nets be n0 = {1, 4},
n1 = {0, 1}, n2 = {1, 2, 3}, n3 = {0, 3, 4}, and n4 = {2, 3, 4}. Let
N = {n0 , n1 , n2 , n3 , n4 }. Then H = (V, N ) is the hypergraph that corresponds
MONDRIAAN DISTRIBUTION 195
reduction in cut nets obtained by moving it to the other processor. The best
move has the largest gain, for example, moving vertex 0 to P (1) in Fig. 4.9
has a gain of 1. The gain may be zero, such as in the case of moving vertex 5
or 6 to P (0). The gain may also be negative, for example, moving vertex 1,
2, 3, 4, or 8 to the other processor has a gain of −1. The worst move is that
of vertex 7, since its gain is −2.
Example 4.9 The following is a column bipartitioning of an 8 × 8 matrix
by the multilevel method. During the coarsening, columns are matched in
even–odd pairs, column 0 with 1, 2 with 3, and so on. The initial partitioning
assigns columns to processors. A column owned by P (1) has its nonzeros
marked in boldface. All other columns are owned by P (0).
· 1 · · · · · · 1 · · · 1 ·
1 · · · 1 · 1 ·
1 · 1 1
1 1
1 1 1 1 · · · 1
1 1 · 1
1 1
· 1 1 1 · · · ·
coarsen
1 1 · · coarsen 1 · partition
−→
−→ · 1 −→
· · · · 1 1 · ·
· · 1 ·
· · · · 1 1 · ·
· · 1 ·
· 1
· · · · · · 1 1 · · · 1 · 1
· · 1 1 · · 1 1 · 1 · 1 1 1
1 · 1 · · · 1 · · ·
1 1
1 · 1 1
1 ·
1 1
1 1
1 1 · 1
1 1
· 1
1 · uncoarsen
−→
1 1 · · refine
1 1 · · uncoarsen
−→ −→
· 1
· · 1 ·
· ·
1 ·
· 1
· · 1 · · ·
1 ·
· 1 · · · 1 · · · 1
1 1 · 1 · 1 · 1 · 1
· 1 · · · · · · · 1 · · · · · ·
1 · · · 1 · 1 ·
1 · · ·
1 · 1 ·
1 1 1 1 · · · 1
1 1 1 1
· · · 1
· 1 1 1 · · · · refine · 1 1 1
−→ · · · ·
· · · · 1 1 · ·
· · · ·
1 1 · ·
· · · · 1 1 · ·
· · · ·
1 1 · ·
· · · · · · 1 1 · · · · · · 1 1
· · 1 1 · · 1 1 · · 1 1 · · 1 1
u v
0 18 11 12 13 13 13
1 11 12 11 13 13 12
2 12 12 12 16 14 15
3 19 12 12 18 11 11
of the fanout is 15g and the cost of the fanin 12g. The total communication
cost of 27g is thus only slightly above the average V g/p = 98g/4 = 24.5g,
which means that the communication is well-balanced. Note that the number
of vector components is less well-balanced, but this does not influence the
cost of the matrix–vector multiplication. (It influences the cost of other oper-
ations though, such as the vector operations accompanying the matrix–vector
multiplication in an iterative solver.)
The two vector distribution problems are similar; it is easy to see that
we can solve the problem of finding a good distribution φu given φ = φA by
finding a good distribution φv given φ = φAT . This is because the nonzero
pattern of row i of A is the same as the nonzero pattern of column i of AT ,
so that a partial sum uis is sent from P (s) to P (t) in the multiplication
by A if and only if a vector component vi is sent from P (t) to P (s) in the
multiplication by AT . Therefore, we only treat the problem for φv and hence
only consider the communication in the fanout.
Let us assume without loss of generality that we have a vector distribution
problem with qj ≥ 2, for all j. Columns with qj = 0 or qj = 1 do not cause
communication and hence may be omitted from the problem formulation.
(A good matrix distribution method will give rise to many columns with
qj = 1.) Without loss of generality we may also assume that the columns
are ordered by increasing qj ; this can be achieved by renumbering. Then the
h-values for the fanout are
hs (s) = (qj − 1), for 0 ≤ s < p, (4.36)
0≤j<n, φv (j)=s
and
hr (s) = |{j : j ∈ Js ∧ φv (j) = s}|, for 0 ≤ s < p. (4.37)
Me first! Consider what would happen if a processor P (s) becomes utterly
egoistic and tries to minimize its own h(s) = max(hs (s), hr (s)) without con-
sideration for others. To minimize hr (s), it just has to maximize the number
of components vj with j ∈ Js that it owns. To minimize hs (s), it has to
minimize the total weight of these components, where we define the weight
of vj as qj − 1. An optimal strategy would thus be to start with hs (s) = 0
and hr (s) = |Js | and grab the components in increasing order (and hence
increasing weight), adjusting hs (s) and hr (s) to account for each newly owned
component. The processor grabs components as long as hs (s) ≤ hr (s), the
new component included. We denote the resulting value of hs (s) by ĥs (s), the
resulting value of hr (s) by ĥr (s), and that of h(s) by ĥ(s). Thus,
The value ĥ(s) is indeed optimal for an egoistic P (s), because stopping
earlier would result in a higher hr (s) and hence a higher h(s) and because
200 SPARSE MATRIX–VECTOR MULTIPLICATION
stopping later would not improve matters either: if for instance P (s) would
grab one component more, then hs (s) > hr (s) so that h(s) = hs (s) ≥ hr (s) +
1 = ĥr (s) = ĥ(s). The value ĥ(s) is a local lower bound on the actual value
that can be achieved in the fanout,
ĥ(s) ≤ h(s), for 0 ≤ s < p. (4.39)
Example 4.10 The following table gives the input of a vector distribution
problem. If a processor P (s) owns a nonzero in matrix column j, this is
denoted by a 1 in the corresponding location; if it does not own such a nonzero,
this is denoted by a dot. This problem could for instance be the result of a
matrix partitioning for p = 4 with all splits in the row direction. (We can
view the input itself as a sparse p × n matrix.)
s=0 1 · 1 · 1 1 1 1
1 1 1 · 1 1 1 1 ·
2 · 1 · · · 1 1 1
3 · · 1 1 1 · · 1
qj = 2 2 2 2 3 3 3 3
j= 0 1 2 3 4 5 6 7
Processor P (0) wants v0 and v2 , so that ĥs (0) = 2, ĥr (0) = 4, and ĥ(0) = 4;
P (1) wants v0 , v1 , and v3 , so that ĥ(1) = 3; P (2) wants v1 , giving ĥ(2) = 3;
and P (3) wants v2 and v3 , giving ĥ(3) = 2. The fanout will cost at least 4g.
More in general, we can compute a lower bound ĥ(J, ns0 , nr0 ) for a given
index set J ⊂ Js and a given initial number of sends ns0 and receives nr0 .
We denote the corresponding send and receive values by ĥs (J, ns0 , nr0 ) and
ĥr (J, ns0 , nr0 ). The initial communications may be due to columns outside J.
This bound is computed by the same method, but starting with the val-
ues hs (s) = ns0 and hr (s) = nr0 + |J|. Note that ĥ(s) = ĥ(Js , 0, 0). The
generalization of eqn (4.38) is
ĥs (J, ns0 , nr0 ) ≤ ĥr (J, ns0 , nr0 ) = ĥ(J, ns0 , nr0 ). (4.40)
Think about the others! Every processor would be happy to own the lighter
components and would rather leave the heavier components to the others.
Since every component vj will have to be owned by exactly one processor, we
must devise a mechanism to resolve conflicting desires. A reasonable heuristic
seems to be to give preference to the processor that faces the toughest future,
that is, the processor with the highest value ĥ(s). Our aim in the vector distri-
bution algorithm is to minimize the highest h(s), because (max0≤s<p h(s)) · g
is the communication cost of the fanout.
Algorithm 4.7 is the vector distribution algorithm based on the local-bound
heuristic; it has been proposed by Meesen and Bisseling [136]. The algorithm
VECTOR DISTRIBUTION 201
for s := 0 to p − 1 do
Ls := Js ;
hs (s) := 0;
hr (s) := 0;
if hs (s) < ĥs (Ls , hs (s), hr (s)) then
active(s) := true;
else active(s) := false;
5 4
repeated until no more odd-degree vertices are present. It is easy to see that
our procedure cannot change the degree of a vertex from even to odd. Finally,
the same procedure is carried out starting at even-degree vertices.
Once all undirected edges have been transformed into directed edges, we
have obtained a directed graph, which determines the owner of every remain-
ing vector component: component vj corresponding to a directed edge (s, t) is
assigned to P (s), causing a communication from P (s) to P (t). The resulting
vector distribution has minimal communication cost; for a proof of optimality,
see [136]. The vector distribution shown in Fig. 4.7 has been determined this
way. The matrix prime60 indeed has the property pi ≤ 2 for all i and qj ≤ 2
for all j, as a consequence of the different splitting directions of the matrix,
that is, first horizontal, then vertical.
Fig. 4.11. Sparse matrix random100 with n = 100, nz = 1000, c = 10, and
d = 0.1, interactively generated at the Matrix Market Deli [26], see
http://math.nist.gov/MatrixMarket/deli/Random/.
assume that the sparse matrix is random by construction. (We have faith in
the random number generator we use, ran2 from [157]. One of its character-
istics is a period of more than 2 × 1018 , meaning that it will not repeat itself
for a very long time.)
Now, let us study parallel matrix–vector multiplication for random sparse
matrices. Suppose we have constructed a random sparse matrix A by drawing
for each index pair (i, j) a random number rij ∈ [0, 1], doing this independ-
ently and uniformly (i.e. with each outcome equally likely), and then creating
a nonzero aij if rij < d. Furthermore, suppose that we have distributed A over
the p processors of a parallel computer in a manner that is independent of
the sparsity pattern, by assigning an equal number of elements (whether zero
or nonzero) to each processor. For simplicity, assume that n mod p = 0.
Therefore, each processor has n2 /p elements. Examples of such a distribution
are the square block distribution and the cyclic row distribution.
First, we investigate the effect of such a fixed, pattern-independent distri-
bution scheme on the spread of the nonzeros, and hence on the load balance
RANDOM SPARSE MATRICES 205
The bound for ǫ = 1 tells us that the extra time caused by load imbalance
exceeds the ideal time of the computation itself with probability less than
2
p(0.68)dn /p . Figure 4.12 plots the function F (ǫ) defined as the right-hand
side of eqn (4.42) against the normalized computation cost 1+ǫ, for n = 1000,
p = 100, and three different choices of d. The normalized computation cost of
superstep (1) is the computation cost in flops divided by the cost of a perfectly
parallelized computation. The figure shows for instance that for d = 0.01 the
expected normalized cost is at most 1.5; this is because the probability of
exceeding 1.5 is almost zero.
206 SPARSE MATRIX–VECTOR MULTIPLICATION
1
d = 0.1
d = 0.01
d = 0.001
0.8
Probability of exceeding cost
0.6
0.4
0.2
0
1 1.5 2 2.5 3
Normalized computation cost
Fig. 4.12. Chernoff bound on the probability that a given normalized computation
cost is exceeded, for a random sparse matrix of size n = 1000 and density d
distributed over p = 100 processors.
0.2
Measured
Scaled derivative
0.15
Probability
0.1
0.05
0
1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6
Normalized computation cost
in P (∗, φ1 (j)). A processor P (s, φ1 (j)) does not need the component vj if all
√
n/ p elements in the local part of matrix column j are zero; this event has
√
probability (1 − d)n/ p . The probability that P (s, φ1 (j)) needs vj is 1 − (1 −
√ √
d)n/ p . Since p − 1 processors each have to receive vj with this probability,
√ √
the expected number of receives for component vj is ( p−1)(1−(1−d)n/ p ).
The owner of vj does not have to receive it. The expected communication
√ √
volume of the fanout is therefore n( p−1)(1−(1−d)n/ p ). Since no processor
is preferred, the h-relation is expected to be balanced, so that the expected
communication cost of superstep (0) is
1 1 √
T(0) = √ − (1 − (1 − d)n/ p )ng. (4.43)
p p
Superstep (3) adds the nonzero partial sums, both those just received and
those present locally. This costs
n √
T(3) = √ (1 − (1 − d)n/ p ). (4.45)
p
If g ≫ 1, which is often the case, then T(3) ≪ T(2) , so that the cost of
superstep (3) can be neglected. Finally, the synchronization cost of the whole
algorithm is 4l.
For our example of n = 1000 and p = 100, the matrix with highest density,
d = 0.1, is expected to cause a communication cost of 179.995g, which is close
to the cost of 180g for a completely dense matrix. The corresponding expected
normalized communication cost is (T(0) + T(2) )/(2dn2 /p) ≈ 0.09g. This means
that we need a parallel computer with g ≤ 11 ro run our algorithm with more
than 50% efficiency.
For matrices with very low density, the local part of a matrix column
is unlikely to have more than one nonzero. Every nonzero will thus incur a
communication. In that case a row distribution is better than a square matrix
distribution, because this saves the communications of the fanin. For our
example of n = 1000 and p = 100, the matrix with lowest density, d = 0.001,
is expected to cause a normalized communication cost of 0.86g for a square
matrix distribution and 0.49g for the cyclic row distribution.
One way of improving the performance is by tailoring the distribution
used to the sparsity pattern of the random sparse matrix. Figure 4.14 shows
a tailored distribution produced by the Mondriaan package for the mat-
rix random100. The figure gives a local view of the matrix, showing the
RANDOM SPARSE MATRICES 209
Fig. 4.14. Local view of sparse matrix random100 from Fig. 4.11 with n = 100,
nz = 1000, and d = 0.1, distributed by the Mondriaan package over p = 16
processors. The allowed imbalance is ǫ = 20%; the achieved imbalance is ǫ′ =
18.4%. The maximum number of nonzeros per processor is 74; the average is
62.5; and the minimum is 25. The communication volume is V = 367. The first
split is in the row direction; the next two splits are in the column direction.
The empty row and column parts created by the splits are collected in empty
rectangles.
the distribution produced by the Mondriaan package (version 1.0, run with
default parameters), averaged over a set of 100 random sparse matrices. The
volume for the Cartesian distribution is based on the cost formula eqn (4.43),
generalized to handle nearly square distributions as well, such as the 8 × 4
distribution for p = 32. For ease of comparison, the value of ǫ specified as
input to Mondriaan equals the expected load imbalance for the Cartesian
distribution. The achieved imbalance ǫ′ is somewhat below that value. It is
clear from the table that the Mondriaan distribution causes less communica-
tion, demonstrating that on average the package succeeds in tailoring a better
distribution to the sparsity pattern. For p = 32, we gain about 45%.
The vector distribution corresponding to the Cartesian distribution
satisfies distr(u) = distr(v), which is an advantage if this requirement must
be met. The volume for the Mondriaan distribution given in Table 4.2 may
increase in case of such a requirement, but at most by n, see eqn (4.12). For
p ≥ 8, the Mondriaan distribution is guaranteed to be superior in this case.
We may conclude from these results that the parallel multiplication of a
random sparse matrix and a vector is a difficult problem, most likely leading to
much communication. Only low values of g or high nonzero densities can make
this operation efficient. Using the fixed, pattern-independent Cartesian distri-
bution scheme based on a cyclic distribution of the matrix diagonal already
brings us close to the best distribution we can achieve. The load balance of
this distribution is expected to be good in most cases, as the numerical sim-
ulation and the Chernoff bound show. The distribution can be improved by
tailoring the distribution to the sparsity pattern, for example, by using the
Mondriaan package, but the improvement is modest.
where xi,j denotes the temperature at grid point (i, j). The difference
xi+1,j − xi,j approximates the derivative of the temperature in the i-direction,
and the difference (xi+1,j −xi,j )−(xi,j −xi−1,j ) = xi−1,j +xi+1,j −2xi,j approx-
imates the second derivative. By convention, we assume that xi,j = 0 outside
the k × k grid; in practice, we just ignore zero terms in the right-hand side of
eqn (4.46).
We can view the k × k array of values xi,j as a one-dimensional vector v
of length n = k 2 by the identification
Example 4.12 Consider the 3 × 3 grid shown in Fig. 4.15. Equation (4.46)
now becomes u = Av, where
−4 1 · 1 · · · · ·
1 −4 1 · 1 · · · ·
· 1 −4 · · 1 · · ·
1 · · −4 1 · 1 · · B I3 0
A= · 1 · 1 −4 1 · 1 ·
= I3 B I3 .
· · 1 · 1 −4 · · 1
0 I3 B
· · · 1 · · −4 1 ·
· · · · 1 · 1 −4 1
· · · · · 1 · 1 −4
(0,2) 6 7 8
(0,1) 3 4 5
0 1 2
(0,0) (1,0) (2,0)
Fig. 4.15. A 3×3 grid. For each grid point (i, j), the index i+3j of the corresponding
vector component is shown.
for integer radius r ≥ 0 and centre c = (c0 , c1 ) ∈ Z2 . This is the set of points
with Manhattan distance at most r to the central point c, see Fig. 4.17. The
number of points of Br (c) is 1+3+5+· · ·+(2r−1)+(2r+1)+(2r−1)+· · ·+1 =
2r2 +2r +1. The number of neighbouring points is 4r +4. If Br (c) represents a
Fig. 4.17. Digital diamond of radius r = 3 centred at c. Points inside the diamond
are shown in black; neighbouring points are shown in white.
LAPLACIAN MATRICES 215
set of grid points allocated to one processor, then the fanout involves receiving
4r+4 values. Just on the basis of receives, we may conclude that this processor
has a communication-to-computation ratio
√
Tcomm, diamonds 4r + 4 2 2 2p
= g≈ g≈ g, (4.52)
Tcomp, diamonds 5(2r2 + 2r + 1) 5r 5k
√
for large enough r, where we use the approximation r ≈ k/ 2p, obtained by
assuming that the processor has its fair share 2r2 +√2r + 1 = k 2 /p of the grid
points. The resulting asymptotic ratio is a factor 2 lower than for square
blocks, cf. eqn (4.50). This reduction is caused by using each received value
twice. Diamonds are a parallel computing scientist’s best friend.
The gain of using diamonds can only be realized if the outgoing traffic
is balanced with the incoming traffic, that is, if the number of sends hs (s)
of processor P (s) is the same as the number of receives hr (s) = 4r + 4.
The number of sends of a processor depends on which processors own the
neighbouring points. Each of the 4r border points of a diamond has to be sent
to at least one processor and at most two processors, except corner points,
which may have to be sent to three processors. Therefore, 4r ≤ hs (s) ≤ 8r +4.
To find a distribution that balances the sends and the receives, we try to
fit the diamonds in a regular pattern. Consider first the infinite lattice Z2 ; to
make mathematicians cringe we view it as a k × k grid with k = ∞; to make
matters worse we let the grid start at (−∞, −∞). We try to partition this
∞×∞ grid over an infinite number of processors using diamonds. It turns out
that we can do this by placing the diamonds in a periodic fashion at centres
c = λa + µb, λ, µ ∈ Z, where a = (r, r + 1) and b = (−r − 1, r). Part of
this infinite partitioning is shown in Fig. 4.18. The centres of the diamonds
form a lattice defined by the orthogonal basis vectors a, b. We leave it to the
mathematically inclined reader to verify that the diamonds Br (λa + µb) are
indeed mutually disjoint and that they fill the whole domain. It is easy to
see that each processor sends hs (s) = 4r + 4 values. This distribution of an
infinite grid over an infinite number of processors achieves the favourable ratio
of eqn (4.52).
Practical computational grids and processor sets are, alas, finite and there-
fore the region of interest must be covered using a finite number of diamonds.
Sometimes, the shape of the region allows us to use diamonds directly without
too much waste. In other situations, many points from the covering diamonds
fall outside the region of interest. These points are then discarded and the
remaining pieces of diamonds are combined, assigning several pieces to one
processor, such that each processor obtains about the same number of grid
points.
Figure 4.18 shows a 10 × 10 grid partitioned into one complete and seven
incomplete diamonds by using the infinite diamond partitioning. This can
216 SPARSE MATRIX–VECTOR MULTIPLICATION
Fig. 4.19. Basic cell of radius r = 3 assigned to a processor. The cell has 18 grid
points, shown in black. Grid points outside the cell are shown in white. Grid
points on the thick lines are included; those on the thin lines are excluded. The
cell contains 13 grid points that are closer to the centre than to a corner of the
enclosing square and it contains five points at equal distance. The cell has 14
neighbouring grid points.
Fig. 4.20. Distribution of a 12 × 12 grid over eight processors obtained by using the
basic cell from Fig. 4.19 (see also Plate 4).
Fig. 4.22. Basic three-dimensional cell assigned to a processor. The cell is defined
as the set of grid points that fall within a truncated octahedron. The boundaries
of this truncated octahedron are included/excluded as follows. Included are the
four hexagons and three squares visible at the front (which are enclosed by solid
lines), the twelve edges shown as thick solid lines, and the six vertices marked
in black. The other faces, edges, and vertices are excluded. The enclosing cube
is shown for reference only. Neighbouring cells are centred at the eight corners
of the cube and the six centres of neighbouring cubes.
since grid points of the cell are closer to its centre than to the centres of other
cells (with fair tie breaking).
The number of grid points of the basic cell is 4r3 . A careful count shows
that the number of points on the surface is 9r2 + 2 and that the number of
sends and receives is 9r2 +6r+2. The resulting communication-to-computation
ratio is
which is better by a factor of 1.68 than the ratio 6p1/3 /(7k) for blocks. In
an actual implementation, it may be most convenient to use a cubic array as
local data structure, with an extra layer of points along each border, and to
fill this array only partially. This enables the use of a regular array structure
while still reducing the communication.
Table 4.3 shows the communication cost for various distributions of a
two-dimensional grid with k = 1024 and a three-dimensional grid with
k = 128. For the two-dimensional Laplacian, the ideal case for the rectangular
block distribution occurs for p = q 2 , that is, for p = 4, 16, 64, since the
LAPLACIAN MATRICES 221
local subdomains then become square blocks. For p = 2q 2 , that is, for
p = 2, 8, 32, 128, the blocks become rectangles with an aspect ratio 2 : 1.
In contrast, the ideal case for the diamond distribution is p = 2q 2 . To handle
the nonideal case p = q 2 as well, the diamond distribution is generalized
by stretching the basic cell in one direction, giving a communication cost of
√
4kg/ p. The table shows that the diamond distribution is better than the
rectangular block distribution for p = 2q 2 and performs the same for p = q 2 ,
except in the special case of small p, where the boundaries of the grid play a
prominent role. For the three-dimensional Laplacian, only the ideal case for
the diamond distribution, p = 2q 3 , is shown. (The other cases p = q 3 and
p = 4q 3 are more difficult to treat, requiring a generalization of the diamond
distribution based on stretching the basic cell. An application developer hav-
ing 32 or 64 processors at his disposal might be motivated to implement this
kind of generalization.) We observe a reduction by a factor of 1.71 for p = 16
compared with the block distribution. Asymptotically, for large radius r, the
reduction factor is 16/9 ≈ 1.78 in the case p = 2q 3 .
For comparison, Table 4.3 also presents results obtained by using the
Mondriaan package (version 1.0) to produce a row distribution of the
Laplacian matrix and a corresponding distribution of the grid. For the Mon-
driaan distribution, the allowed load imbalance for the corresponding matrix
is ǫ = 10%. The communication cost given is the average over 100 runs of
the Mondriaan program, each time with a different seed of the random num-
ber generator used. In three dimensions, the Mondriaan distribution is better
than blocks and for large local subdomains (such as for p = 16) it comes close
to the performance of diamonds.
needed for the vectors and a large part of this memory may never be used for
writing nonzero partial sums. Furthermore, this method also requires O(n)
computing time, since all memory cells must be inspected. Thus, this method
is nonscalable both in time and memory. An alternative is a rather clumsy
method that may be termed the ‘three-superstep’ approach. In the first super-
step, each processor tells each of the other processors how many partial sums
it is going to send. In the second superstep, each receiving processor reserves
exactly the required amount of space for each of the senders, and tells them
the address from which they can start writing. Finally, in the third superstep,
the partial sums are put as pairs (i, uit ). Fortunately, we can organize the com-
munication in a more efficient and more elegant way by using the bsp send
primitive instead of bsp put. This is done in the function bspmv. Any one
writing programs with irregular communication patterns will be grateful for
the existence of bsp send!
The bsp send primitive sends a message which consists of a tag and a
payload. The tag is used to identify the message; the payload contains the
actual data. The use of the bsp send primitive is illustrated by the top part of
Fig. 4.23. In our case, the tag is an index corresponding to i and the payload
is the partial sum uit . The syntax is
bsp send(pid, tag, source, nbytes);
Here, int pid is the identity of the destination processor; void *tag is a
pointer to the tag; void *source is a pointer to the source memory from
bsp_pid
nbytes
Source Send
Move
pid
maxnbytes
Dest
Fig. 4.23. Send operation from BSPlib. The bsp send operation copies nbytes of
data from the local processor bsp pid into a message, adds a tag, and sends this
message to the specified destination processor pid. Here, the pointer source
points to the start of the data to be copied. In the next superstep, the bsp move
operation writes at most maxnbytes from the message into the memory area
specified by the pointer dest.
REMAINDER OF BSPLIB: EXAMPLE FUNCTION bspmv 225
which the data to be sent are read; int nbytes is the number of bytes to
be sent. In our case, the number of the destination processor is available as
φu (i) = destprocu[i], which has been initialized beforehand by the function
bspmv init. It is important to choose a tag that enables the receiver to handle
the payload easily. Here, the receiver needs to know to which vector component
the partial sum belongs. We could have used the global index i as a tag,
but then this index would have to be translated on receipt into the local
index i used to access u. Instead, we use the local index directly. Note that
in this case the tag need not identify the source processor, since its number is
irrelevant.
The message to be sent using the bsp send primitive is first stored by the
system in a local send buffer. (This implies that the tag and source variable
can be reused immediately.) The message is then sent and stored in a buffer
on the receiving processor. The send and receive buffers are invisible to the
user (but there is a way of emptying the receive buffer, as you may guess).
Some time after the message has been sent, it becomes available on the
receiving processor. In line with the philosophy of the BSP model, this hap-
pens at the end of the current superstep. In the next superstep, the messages
can be read; reading messages means moving them from the receive buffer into
the desired destination memory. At the end of the next superstep all remain-
ing unmoved messages will be lost. This is to save buffer memory and to force
the user into the right habit of cleaning his desk at the end of the day. (As said
before, the BSP model and its implementation BSPlib are quite paternalistic.
They often force you to do the right thing, for lack of alternatives.) The syntax
of the move primitive is
bsp move(dest, maxnbytes);
Here, void *dest is a pointer to the destination memory where the data are
written; int maxnbytes is an upper bound on the number of bytes of the
payload that is to be written. This is useful if only part of the payload needs
to be retrieved. The use of the bsp move primitive is illustrated by the bottom
part of Fig. 4.23.
In our case, the payload of a message is one double, which is written in its
entirety into sum, so that maxnbytes = SZDBL.
The header information of a message consists of the tag and the length of
the payload. This information can be retrieved by the statement
bsp get tag(status, tag);
Here, int *status is a pointer to the status, which equals −1 if the buffer is
empty; otherwise, it equals the length of the payload in bytes. Furthermore,
void *tag is a pointer to the memory where the tag is written. The status
information can be used to decide whether there is an unread message, and if
so, how much space to allocate for it. In our case, we know that each payload
has the same fixed length SZDBL.
226 SPARSE MATRIX–VECTOR MULTIPLICATION
We could have used the status in the termination criterion of the loop in
Superstep 3, to determine whether we have handled all partial sums. Instead,
we choose to use the enquiry primitive
bsp qsize(nmessages, nbytes);
Here, int *nmessages is a pointer to the total number of messages received
in the preceding superstep, and int *nbytes is a pointer to the total number
of bytes received. In our program, we only use bsp qsize to determine the
number of iterations of the loop, that is, the number of partial sums received.
In general, the bsp qsize primitive is useful for allocating the right amount
of memory for storing the received messages. Here, we do not need to allocate
memory, because we process and discard the messages immediately after we
read them. The name bsp qsize derives from the fact that we can view
the receive buffer as a queue: messages wait patiently in line until they are
processed.
In our program, the tag is an integer, but in general it can be of any type.
The size of the tag in bytes is set by
bsp set tagsize(tagsz);
On input, int *tagsz points to the desired tag size. As a result, the system
uses the desired tag size for all messages to be sent by bsp send. The function
bsp set tagsize takes effect at the start of the next superstep. All processors
must call the function with the same tag size. As a side effect, the contents of
tagsz will be modified, so that on output it contains the previous tag size of
the system. This is a way of preserving the old value, which can be useful if
an initial global state of the system must be restored later.
In one superstep, an arbitrary number of communication operations can be
performed, using either bsp put, bsp get, or bsp send primitives, and they
can be mixed freely. The only practical limitation is imposed by the amount
of buffer memory available. The BSP model and BSPlib do not favour any
particular type of communication, so that it is up to the user to choose the
most convenient primitive in a given situation.
The local matrix–vector multiplication in Superstep 2 is an implement-
ation of Algorithm 4.4 for the local data structure, modified to handle a
rectangular nrows × ncols matrix. The inner loop of the multiplication has
been optimized by using pointer arithmetic. For once deviating from our
declared principles, we sacrifice readability here because this loop is expec-
ted to account for a large proportion of the computing time spent, and
because pointer arithmetic is the raison d’être of the ICRS data structure.
The statement
*psum += (*pa) * (*pvloc);
is a translation of
sum += a[k] * vloc[j];
REMAINDER OF BSPLIB: EXAMPLE FUNCTION bspmv 227
void bspmv(int p, int s, int n, int nz, int nrows, int ncols,
double *a, int *inc,
int *srcprocv, int *srcindv, int *destprocu,
int *destindu, int nv, int nu, double *v, double *u){
pvloc += *pinc;
for(i=0; i<nrows; i++){
*psum= 0.0;
while (pvloc<pvloc_end){
*psum += (*pa) * (*pvloc);
pa++;
pinc++;
pvloc += *pinc;
}
bsp_send(destprocu[i],&destindu[i],psum,SZDBL);
pvloc -= ncols;
}
bsp_sync();
bsp_pop_reg(v);
vecfreed(vloc);
} /* end bspmv */
return (n+p-s-1)/p ;
} /* end nloc */
} /* end bspmv_init */
p g l Tcomm (0)
Matrix n nz Origin
3800 supercomputer given in Table 4.3, and the computing rate is about the
same. The version of the Panda BSP library used has not been fully optim-
ized yet. For instance, if a processor puts data into itself, this can be done
quickly by a memory copy via a buffer, instead of letting the underlying Panda
communication system discover, at much higher cost, that the destination is
local. The lack of this feature is revealed by the relatively high value of g for
p = 1. In the experiments described below, the program bspmv has been mod-
ified to avoid sending data from a processor to itself. (In principle, as users
we should refuse to perform such low-level optimizations. The BSP system
should do this for us, since the main advantage of BSP is that it enables such
communication optimizations.)
Table 4.5 shows the set of sparse matrices used in our experiments. The
set consists of: random1k and random20k, which represent the random sparse
matrices discussed in Section 4.7; amorph20k, which was created by convert-
ing a model of 20 000 silicon atoms, each having four connections with other
atoms, to a sparse matrix, see Fig. 4.2 and Exercise 7; prime20k, which
extends the matrix prime60 from the cover of this book to size n = 20 000;
EXPERIMENTAL RESULTS ON A BEOWULF CLUSTER 233
Table 4.7. Measured execution time (in ms) for sparse matrix–vector
multiplication
1 (seq) 0.2 9 7 18 30 52 71 92
1 (par) 0.3 10 8 19 31 53 72 96
2 5.1 73 13 56 26 41 59 205
4 5.2 57 16 77 22 26 39 228
8 6.9 48 15 50 14 21 25 226
16 9.1 32 11 46 13 18 24 128
32 14.8 28 17 37 18 21 23 87
64 27.4 36 29 45 29 32 34 73
Qualitatively, the BSP cost can be used to explain the timing results, or
predict them. Quantitatively, the agreement is less than perfect. The reader
can easily check this by substituting the measured values of g and l into
the cost expressions of Table 4.6. The advantage of presenting BSP costs for
sparse matrices as shown in Table 4.6 over presenting raw timings as is done
in Table 4.7 is the longevity of the results: in 20 years from now, when all
present supercomputers will rest in peace, when I shall be older and hopefully
wiser, the results expressed as BSP costs can still be used to predict execution
time on a state-of-the-art parallel computer.
processors, and finally broadcasts the sums within their processor row. Thus,
the output vector becomes available in the same format as the input vector
but with the role of processor rows and columns reversed. If the input and
output vector are needed in exactly the same distribution, the broadcast must
be preceded by a vector transposition.
The parallel sparse matrix–vector multiplication algorithm described in
this chapter, Algorithm 4.5, is based on previous work by Bisseling and
McColl [19,21,22]. The Cartesian version of the algorithm was first presen-
ted in [19] as part of a parallel implementation of GMRES, an iterative solver
for square unsymmetric linear systems. Bisseling [19] outlines the advantages
of using a two-dimensional Cartesian distribution and distributing the vectors
in the same way as the matrix diagonal and suggests to use as a fixed matrix-
independent distribution the square block/cyclic distribution, defined by
√ √
assigning matrix element aij to processor P (i div (n/ p), j mod p). The cost
analysis of the algorithm in [19] and the implementation, however, are closely
tied to a square mesh communication network with store-and-forward rout-
ing. This means for instance that a partial sum sent from processor P (s, t0 )
to processor P (s, t1 ) has to be transferred through all intermediate processors
P (s, t), t0 < t < t1 . Experiments performed on a network of 400 transputers
for a subset of unsymmetric sparse matrices from the Harwell–Boeing collec-
tion [64] give disappointing speedups for the matrix–vector multiplication part
of the GMRES solver, due to the limitations of the communication network.
Bisseling and McColl [21,22] transfer the sparse matrix–vector multiplic-
ation algorithm from [19] to the BSP context. The matrix distribution is
Cartesian and the vectors are distributed in the same way as the matrix
diagonal. Now, the algorithm benefits from the complete communication
network provided by the BSP architecture. Architecture-independent time
analysis becomes possible because of the BSP cost function. This leads the
authors to a theoretical and experimental study of scientific computing applic-
ations such as molecular dynamics, partial differential equation solving on
multidimensional grids, and linear programming, all interpreted as an instance
of sparse matrix–vector multiplication. This work shows that the block/cyclic
distribution is an optimal fixed Cartesian distribution for unstructured sparse
matrices; also optimal is a Cartesian matrix distribution based on a balanced
random distribution of the matrix diagonal. Bisseling and McColl propose
to use digital diamonds for the Laplacian operator on a square grid. They
perform numerical experiments using MLIB, a library of matrix generators
and BSP cost analysers specifically developed for the investigation of sparse
matrix–vector multiplication.
Ogielski and Aiello [148] present a four-superstep parallel algorithm for
multiplication of a sparse rectangular matrix and a vector. The algorithm
exploits sparsity in the computation, but not in the communication. The
input and output vector are distributed differently. The matrix is distributed
in a Cartesian manner by first randomly permuting the rows and columns
BIBLIOGRAPHIC NOTES 239
many tools to help the user initialize an iterative solver, such as a tool for
detecting which vector components must be obtained during the fanout. In
Aztec, a processor has three kinds of vector components: internal components
that can be updated without communication; border components that belong
to the processor but need components from other processors for an update;
and external components, which are the components needed from other pro-
cessors. The components of u and v are renumbered in the order internal,
border, external, where the external components that must be obtained
from the same processor are numbered consecutively. The local submatrix
is reordered correspondingly. Two local data structures are supported: a vari-
ant of CRS with special treatment of the matrix diagonal and a block variant,
which can handle dense subblocks, thus increasing the computing rate for
certain problems by a factor of five.
Parallel Templates by Koster [124] is a parallel, object-oriented implement-
ation in C++ of the complete linear system templates [11]. It can handle every
possible matrix and vector distribution, including the Mondriaan distribution,
and it can run on top of BSPlib and MPI-1. The high-level approach of the
package and the easy reuse of its building blocks makes adding new paral-
lel iterative solvers a quick exercise. This work introduces the ICRS data
structure discussed in Section 4.2.
gain, starting from an arbitrary vertex, until half the total vertex weight
is included in the partition. For the uncoarsening, they propose a boundary
Kernighan–Lin algorithm. The authors implemented the graph partitioner in
a package called METIS. Karypis and Kumar [119] also developed a parallel
multilevel algorithm that performs the partitioning itself on a parallel com-
puter, with p processors computing a p-way partitioning. The algorithm uses
a graph colouring, that is, a colouring of the vertices such that neighbour-
ing vertices have different colours. To avoid conflicts between processors when
matching vertices in the coarsening phase, each coarsening step is organized
by colour, trying to find a match for vertices of one colour first, then for those
of another colour, and so on. This parallel algorithm has been implemented
in the ParMETIS package.
Walshaw and Cross [190] present a parallel multilevel graph partitioner
called PJostle, which is aimed at two-dimensional and three-dimensional
irregular grids. PJostle and ParMETIS both start with the vertices already
distributed over the processors in some manner and finish with a better
distribution. If the input partitioning is good, the computation of a new
partitioning is faster; the quality of the output partitioning is not affected
by the input partitioning. This means that these packages can be used for
dynamic repartitioning of grids, for instance in an adaptive grid computation.
The main difference between the two packages is that PJostle actually moves
vertices between subdomains when trying to improve the partitioning, whereas
ParMETIS keeps vertices on the original processor, but registers the new
owner. Experiments in [190] show that PJostle produces a better partitioning,
with about 10% less cut edges, but that ParMETIS is three times faster.
Bilderback [18] studies the communication load balance of the data
distributions produced by five graph partitioning programs: Chaco, METIS,
ParMETIS, PARTY, and Jostle. He observes that the difference between the
communication load of the busiest processor and that of the least busy one,
expressed in edge cuts, is considerable for all of these programs, indicating
that there is substantial room for improvement.
Hendrickson [93] argues that the standard approach to sparse matrix par-
titioning by using graph partitioners such as Chaco and METIS is flawed
because it optimizes the wrong cost function and because it is unnecessarily
limited to square symmetric matrices. In his view, the emperor wears little
more than his underwear. The standard approach minimizes the number of
nonzeros that induce communication, but not necessarily the number of com-
munication operations themselves. Thus the cost function does not take into
account that if there are two nonzeros aij and ai′ j on the same processor, the
value vj need not be sent twice to that processor. (Note that our Algorithm 4.5
obeys the old rule ne bis in idem, because it sends vj only once to the same pro-
cessor, as a consequence of using the index set Js .) Furthermore, Hendrickson
states that the cost function of the standard approach only considers com-
munication volume and not the imbalance of the communication load nor the
242 SPARSE MATRIX–VECTOR MULTIPLICATION
startup costs of sending a message. (Note that the BSP cost function is based
on the maximum communication load of a processor, which naturally encour-
ages communication balancing. The BSP model does not ignore startup costs,
but lumps them together into one parameter l; BSP implementations such as
BSPlib reduce startup costs by combining messages to the same destination
in the same superstep. The user minimizes startup costs by minimizing the
number of synchronizations.)
Çatalyürek and Aykanat [36,37] model the total communication volume
of sparse matrix–vector multiplication correctly by using hypergraphs. They
present a multilevel hypergraph partitioning algorithm that minimizes the
true communication volume. The algorithm has been implemented in a
package called PaToH (Partitioning Tool for Hypergraphs). Experimental
results show that PaToH reduces the communication volume by 30–40%
compared with graph-based partitioners. PaToH is about four times faster
than the hypergraph version of METIS, hMETIS, while it produces parti-
tionings of about the same quality. The partitioning algorithm in [36,37] is
one-dimensional since all splits are carried out in the same direction, yielding
a row or column distribution for the matrix with a corresponding vector distri-
bution. PaToH can produce p-way partitionings where p need not be a power
of two. Çatalyürek and Aykanat [38] also present a fine-grained approach to
sparse matrix–vector multiplication, where nonzeros are assigned individually
to processors. Each nonzero becomes a vertex in the problem hypergraph
and each row and column becomes a net. The result is a matrix partitioning
into disjoint sets As , 0 ≤ s < p, not necessarily corresponding to disjoint
submatrices Is × Js . This method is slower and needs more memory than
the one-dimensional approach, but the resulting partitioning is excellent; the
communication volume is almost halved. Hypergraph partitioning is com-
monly used in the design of electronic circuits and much improvement is due
to work in that field. A hypergraph partitioner developed for circuit design is
MLpart [35].
The two-dimensional Mondriaan matrix distribution method described in
Section 4.5 is due to Vastenhouw and Bisseling [188] and has been implemen-
ted in version 1.0 of the Mondriaan package. The method used to split a matrix
into two submatrices is based on the one-dimensional multilevel method for
hypergraph bipartitioning by Çatalyürek and Aykanat [36,37]. The Mondriaan
package can handle rectangular matrices as well as square matrices, and allows
the user to impose the condition distr(u) = distr(v). The package also has an
option to exploit symmetry by assigning aij and aji to the same processor.
The vector distribution methods described in Section 4.6 are due to Meesen
and Bisseling [136]; these methods improve on the method described in [188]
and will be included in the next major release of the Mondriaan package.
EXERCISES 243
4.12 Exercises
1. Let A be a dense m×n matrix distributed by an M ×N block distribution.
Find a suitable distribution for the input and output vector of the dense
matrix–vector multiplication u := Av; the input and output distributions
can be chosen independently. Determine the BSP cost of the corresponding
matrix–vector multiplication. What is the optimal ratio N/M and the BSP
cost for this ratio?
2. Find a distribution of a 12 × 12 grid for a BSP computer with p = 8,
g = 10, and l = 50, such that the BSP cost of executing a two-dimensional
Laplacian operator is as low as possible. For the computation, we count five
flops for an interior point, four flops for a boundary point that is not a corner
point, and three flops for a corner point. Your distribution should be better
than that of Fig. 4.20, which has a BSP cost of 90 + 14g + 2l = 330 flops on
this computer.
3. Modify the benchmarking program bspbench from Chapter 1 by changing
the central bsp put statement into a bsp send and adding a corresponding
bsp move to the next superstep. Choose suitable sizes for tag and payload.
Run the modified program for various values of p and measure the values of g
and l. Compare the results with those of the original program. If your commu-
nication pattern allows you to choose between using bsp put and bsp send,
which primitive would you choose? Why?
4. (∗) An n × n matrix A is banded with upper bandwidth bU and lower
bandwidth bL if aij = 0 for i < j − bU and i > j + bL . Let bL = bU = b. The
matrix A has a band of 2b + 1 nonzero diagonals and hence it is sparse if b
is small. Consider the multiplication of a banded matrix A and a vector v by
Algorithm 4.5 using the one-dimensional distribution φ(i) = i div (n/p) for the
matrix diagonal and the vectors, and a corresponding M ×N Cartesian matrix
distribution (φ0 , φ1 ). For simplicity, assume that n is a multiple of p. Choosing
M completely determines the matrix distribution. (See also Example 4.5,
where n = 12, b = 1, and p = 4.)
(a) Let b = 1, which means that A is tridiagonal. Show that the communic-
ation cost for the choice M = p (i.e. a row distribution of the matrix)
√
is lower than for the choice M = p (i.e. a square distribution).
(b) Let b = n − 1, which means that A is dense. Section 4.4 shows that
√
now the communication cost for the choice M = p is lower than for
M = p. We may conclude that for small bandwidth the choice M = p
√
is better, whereas for large bandwidth the choice M = p is better.
Which value of b is the break-even point between the two methods?
(c) Implement Algorithm 4.5 for the specific case of band matrices. Drop
the constraint on n and p. Choose a suitable data structure for the
matrix: use an array instead of a sparse data structure.
244 SPARSE MATRIX–VECTOR MULTIPLICATION
(d) Run your program and obtain experimental values for the break-even
point of b. Compare your results with the theoretical predictions.
5. (∗) Let A be a sparse m × m matrix and B a dense m × n matrix with
m ≥ n. Consider the matrix–matrix multiplication C = AB.
(a) What is the time complexity of a straightforward sequential
algorithm?
(b) Choose distributions for A, B, and C and formulate a corresponding
parallel algorithm. Motivate your choice and discuss alternatives.
(c) Analyse the time complexity of the parallel algorithm.
(d) Implement the algorithm. Measure the execution time for various values
of m, n, and p. Explain the results.
6. (∗) The CG algorithm by Hestenes and Stiefel [99] is an iterative method
for solving a symmetric positive definite linear system of equations Ax = b.
(A matrix A is positive definite if xT Ax > 0, for all x = 0.) The algorithm
computes a sequence of approximations xk , k = 0, 1, 2 . . ., that converges
towards the solution x. The algorithm is usually considered converged when
$rk $ ≤ ǫconv $b$, where rk = b − Axk is the residual. One can take, for
example, ǫconv = 10−12 . A sequential (nonpreconditioned) CG algorithm is
given as Algorithm 4.8. For more details and a proof of convergence, see Golub
and Van Loan [79].
(a) Design a parallel CG algorithm based on the sparse matrix–vector mul-
tiplication of this chapter. How do you distribute the vectors x, r, p, w?
Motivate your design choices. Analyse the time complexity.
(b) Implement your algorithm in a function bspcg, which uses bspmv for
the matrix–vector multiplication and bspip from Chapter 1 for inner
product computations.
(c) Write a test program that first generates an n × n sparse matrix B with
a random sparsity pattern and random nonzero values in the interval
[−1, 1] and then turns B into a symmetric matrix A = B + B T + µIn .
Choose the scalar µ sufficiently large to make A strictly diagonally
n−1
dominant, that is, |aii | > j=0,j=i |aij | for all i, and to make the
diagonal elements aii positive. It can be shown that such a matrix is
positive definite, see [79]. Use the Mondriaan package with suitable
options to distribute the matrix and the vectors.
(d) Experiment with your program and explain the results. Try different n
and p and different nonzero densities. How does the run time of bspcg
scale with p? What is the bottleneck? Does the number of iterations
needed depend on the number of processors and the distribution?
7. (∗) In a typical molecular dynamics simulation, the movement of a large
number of particles is followed for a long period of time to gain insight into
a physical process. For an efficient parallel simulation, it is crucial to use a
EXERCISES 245
x := x0 ; { initial guess }
k := 0; { iteration number }
r := b − Ax;
ρ := $r$2 ;
√
while ρ > ǫconv $b$ ∧ k < kmax do
if k = 0 then
p := r;
else
β := ρ/ρold ;
p := r + βp;
w := Ap;
γ := pT w;
α := ρ/γ;
x := x + αp;
r := r − αw;
ρold := ρ;
ρ := $r$2 ;
k := k + 1;
good data distribution, especially in three dimensions. We can base the data
distribution on a suitable geometric partitioning of space, following [169].
Consider a simulation with a three-dimensional simulation box of size 1.0 ×
1.0 × 1.0 containing n particles, spread homogeneously, which interact if their
distance is less than a cut-off radius rc , with rc ≪ 1, see Fig. 4.2. Assume
that the box has periodic boundaries, meaning that a particle near a boundary
interacts with particles near the opposite boundary.
(b) Write an efficient function that computes the cost increment for a given
move. Note that simply computing the cost from scratch before and
after the move and taking the difference is inefficient; this approach
would be too slow for use inside a simulated annealing program, where
many moves must be evaluated. Take care that updating the cost for
a sequence of moves yields the same result as computing the cost from
scratch. Hint: keep track of the contribution of each processor to the
cost of the four supersteps of the matrix–vector multiplication.
(c) Put everything together and write a complete simulated annealing pro-
gram. The main loops of your program should implement a cooling
schedule, that is, a method for changing the temperature T during
the course of the computation. Start with a temperature T0 that is
much larger than every possible increment ∆C to be encountered. Try
a large number of moves at the initial temperature, for instance p·nz(A)
moves, and then reduce the temperature, for example, to T1 = 0.99T0 ,
thus making cost increases less likely to be accepted. Perform another
round of moves, reduce the temperature further, and so on. Finding a
good cooling schedule requires some trial and error.
(d) Compare the output quality and computing time of the simulated
annealing program to that of the Mondriaan package. Discuss the
difference between the output distributions produced by the two
programs.
program to unpack the sums directly from the receive buffer by using
bsp hpmove instead of bsp get tag and bsp move.
(h) Test the effect of these optimizations. Do you attain communication
rates corresponding to optimistic g-values?
(i) Try to improve the speed of the local matrix–vector multiplication by
treating the local nonzeros that do not cause communication separ-
ately. This optimization should enhance cache use on computers with
a cache.
APPENDIX A
AUXILIARY BSPEDUPACK FUNCTIONS
/*
###########################################################################
## BSPedupack Version 1.0 ##
## Copyright (C) 2004 Rob H. Bisseling ##
## ##
## BSPedupack is released under the GNU GENERAL PUBLIC LICENSE ##
## Version 2, June 1991 (given in the file LICENSE) ##
## ##
###########################################################################
*/
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "bsp.h"
if (n==0){
pd= NULL;
} else {
pd= (double *)malloc(n*SZDBL);
if (pd==NULL)
bsp_abort("vecallocd: not enough memory");
}
return pd;
} /* end vecallocd */
if (n==0){
pi= NULL;
} else {
pi= (int *)malloc(n*SZINT);
if (pi==NULL)
bsp_abort("vecalloci: not enough memory");
}
return pi;
} /* end vecalloci */
if (m==0){
ppd= NULL;
UTILITY FILE bspedupack.c 253
} else {
ppd= (double **)malloc(m*sizeof(double *));
if (ppd==NULL)
bsp_abort("matallocd: not enough memory");
if (n==0){
for (i=0; i<m; i++)
ppd[i]= NULL;
} else {
pd= (double *)malloc(m*n*SZDBL);
if (pd==NULL)
bsp_abort("matallocd: not enough memory");
ppd[0]=pd;
for (i=1; i<m; i++)
ppd[i]= ppd[i-1]+n;
}
}
return ppd;
} /* end matallocd */
if (pd!=NULL)
free(pd);
} /* end vecfreed */
if (pi!=NULL)
free(pi);
} /* end vecfreei */
if (ppd!=NULL){
if (ppd[0]!=NULL)
free(ppd[0]);
free(ppd);
}
} /* end matfreed */
APPENDIX B
A QUICK REFERENCE GUIDE TO BSPLIB
Table B.1 groups the primitives of BSPlib into three classes: Single Pro-
gram Multiple Data (SPMD) for creating the overall parallel structure; Direct
Remote Memory Access (DRMA) for communication with puts or gets; and
Bulk Synchronous Message Passing (BSMP) for communication with sends.
Functions bsp nprocs, bsp pid, and bsp hpmove return an int; bsp time
returns a double; and all others return void. A parameter with an asterisk
is a pointer; a parameter with two asterisks is a pointer to a pointer. The
parameter spmd is a parameterless function returning void. The parameter
error message is a string. The remaining parameters are ints.
Table B.1. The 20 primitives of BSPlib [105]
Assuming you have read the chapters of this book and hence have
learned how to design parallel algorithms and write parallel programs
using BSPlib, this appendix quickly teaches you how to write well-
structured parallel programs using the communication library MPI. For
this purpose, the package MPIedupack is presented, which consists of
the five programs from BSPedupack, but uses MPI instead of BSPlib,
where the aim is to provide a suitable starter subset of MPI. Experi-
mental results are given that compare the performance of programs from
BSPedupack with their counterparts from MPIedupack. This appendix
concludes by discussing the various ways bulk synchronous parallel style
can be applied in practice in an MPI environment. After having read this
appendix, you will be able to use both BSPlib and MPI to write well-
structured parallel programs, and, if you decide to use MPI, to make
the right choices when choosing MPI primitives from the multitude of
possibilities.
with a fixed startup cost tstartup and additional cost tword per data word.
Cost analysis in this model requires a detailed study of the order in which the
messages are sent, their lengths, and the computations that are interleaved
between the communications. In its most general form this can be expressed by
a directed acyclic graph with chunks of computation as vertices and messages
as directed edges.
In MPI, the archetypical primitives for the message-passing style, based
on the message-passing model, are MPI Send and MPI Recv. An example of
their use is
if (s==2)
MPI_Send(x, 5, MPI_DOUBLE, 3, 0, MPI_COMM_WORLD);
if (s==3)
MPI_Recv(y, 5, MPI_DOUBLE, 2, 0, MPI_COMM_WORLD, &status);
which sends five doubles from P (2) to P (3), reading them from an array x on
P (2) and writing them into an array y on P (3). Here, the integer ‘0’ is a tag
that can be used to distinguish between different messages transferred from
the same source processor to the same destination processor. Furthermore,
MPI COMM WORLD is the communicator consisting of all the processors. A com-
municator is a subset of processors forming a communication environment
with its own processor numbering. Despite the fundamental importance of
the MPI Send/MPI Recv pair in MPI, it is best to avoid its use if possible, as
extensive use of such pairs may lead to unstructured programs that are hard
to read, prove correct, or debug. Similar to the goto-statement, which was
considered harmful in sequential programming by Dijkstra [56], the explicit
send/receive pair can be considered harmful in parallel programming. In the
parallel case, the danger of deadlock always exists; deadlock may occur for
instance if P (0) wants to send a message to P (1), and P (1) to P (0), and
both processors want to send before they receive. In our approach to using
MPI, we advocate using the collective and one-sided communications of MPI
where possible, and to limit the use of the send/receive pair to exceptional
situations. (Note that the goto statement still exists in C, for good reasons,
but it is hardly used any more.)
258 PROGRAMMING IN BSP STYLE USING MPI
A discussion of all the MPI primitives is beyond the scope of this book,
as there are almost 300 primitives (I counted 116 nondeprecated MPI-1
primitives, and 167 MPI-2 primitives). With BSPlib, we could strive for com-
pleteness, whereas with MPI, this would require a complete book by itself. We
focus on the most important primitives to provide a quick entry into the MPI
world. For the definitive reference, see the original MPI-1 standard [137],
the most recent version of MPI-1 (currently, version 1.2) available from
http://www.mpi-forum.org, and the MPI-2 standard [138]. A more access-
ible reference is the annotated standard [83,164]. For tutorial introductions,
see [84,85,152].
MPI COMM WORLD; it consists of all the processors. The corresponding number
of processors p can be obtained by calling MPI Comm size and the local pro-
cessor identity s by calling MPI Comm rank. Globally synchronizing all the
processors, the equivalent of a bsp sync, can be done by calling MPI Barrier
for the communicator MPI COMM WORLD. Here, this is done before using the
wall-clock timer MPI Wtime. As in BSPlib, the program can be aborted by one
processor if it encounters an error. In that case an error number is returned.
Unlike bspip, the program mpiip does not ask for the number of processors to
be used; it simply assumes that all available processors are used. (In BSPlib,
it is easy to use less than the maximum number of processors; in MPI, this
is slightly more complicated and involves creating a new communicator of
smaller size.)
Collective communication requires the participation of all the processors
of a communicator. An example of a collective communication is the broad-
cast by MPI Bcast in the main function of one integer, n, from the root
P (0) to all other processors. Another example is the reduction operation by
MPI Allreduce in the function mpiip, which sums the double-precision local
inner products inprod, leaving the result alpha on all processors. (It is also
possible to perform such an operation on an array, by changing the para-
meter 1 to the array size, or to perform other operations, such as taking the
maximum, by changing MPI SUM to MPI MAX.)
Note that the resulting program mpiip is shorter than the BSPlib equi-
valent. Using collective-communication functions built on top of the BSPlib
primitives would reduce the program size for the BSP case in the same way.
(Such functions are available, but they can also easily be written by the
programmer herself, and tailored to the specific situation.)
Now, try to compile the program by the UNIX command
cc -o ip mpiinprod.c mpiedupack.c -lmpi -lm
and run the resulting executable program ip on four processors by the
command
mpirun -np 4 ip
and see what happens. An alternative run command, with prescribed and
hence portable definition of its options is
mpiexec -n 4 ip
The program text is:
#include "mpiedupack.h"
/* This program computes the sum of the first n squares, for n>=0,
sum = 1*1 + 2*2 + ... + n*n
by computing the inner product of x=(1,2,...,n)ˆT and itself.
The output should equal n*(n+1)*(2n+1)/6.
The distribution of x is cyclic.
*/
260 PROGRAMMING IN BSP STYLE USING MPI
inprod= 0.0;
for (i=0; i<nloc(p,s,n); i++){
inprod += x[i]*y[i];
}
MPI_Allreduce(&inprod,&alpha,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
return alpha;
} /* end mpiip */
/* sequential part */
/* SPMD part */
MPI_Init(&argc,&argv);
if (s==0){
printf("Please enter n:\n"); fflush(stdout);
scanf("%d",&n);
if(n<0)
MPI_Abort(MPI_COMM_WORLD,-1);
}
MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WORLD);
nl= nloc(p,s,n);
x= vecallocd(nl);
for (i=0; i<nl; i++){
iglob= i*p+s;
x[i]= iglob+1;
}
MPI_Barrier(MPI_COMM_WORLD);
time0=MPI_Wtime();
alpha= mpiip(p,s,n,x,x);
MPI_Barrier(MPI_COMM_WORLD);
CONVERTING BSPEDUPACK TO MPIEDUPACK 261
time1=MPI_Wtime();
vecfreed(x);
MPI_Finalize();
/* sequential part */
exit(0);
} /* end main */
Time= vecallocd(p);
alpha= 1.0/3.0;
beta= 4.0/9.0;
for (i=0; i<n; i++)
z[i]= y[i]= x[i]= (double)i;
/* Measure time of 2*NITERS DAXPY operations of length n */
time0= MPI_Wtime();
if (p==1){
Nsend[0]= Nrecv[0]= h;
} else {
for (s1=0; s1<p; s1++)
Nsend[s1]= h/(p-1);
for (i=0; i < h%(p-1); i++)
Nsend[(s+1+i)%p]++;
Nsend[s]= 0; /* no communication with yourself */
for (s1=0; s1<p; s1++)
Nrecv[s1]= h/(p-1);
for (i=0; i < h%(p-1); i++)
Nrecv[(s-1-i+p)%p]++;
Nrecv[s]= 0;
}
Offset_send[0]= Offset_recv[0]= 0;
for(s1=1; s1<p; s1++){
Offset_send[s1]= Offset_send[s1-1] + Nsend[s1-1];
Offset_recv[s1]= Offset_recv[s1-1] + Nrecv[s1-1];
}
if (s==0){
printf("size of double = %d bytes\n",(int)SZDBL);
leastsquares(0,p,t,&g0,&l0);
printf("Range h=0 to p : g= %.1lf, l= %.1lf\n",g0,l0);
leastsquares(p,MAXH,t,&g,&l);
printf("Range h=p to HMAX: g= %.1lf, l= %.1lf\n",g,l);
vecfreei(Offset_recv);
vecfreei(Offset_send);
vecfreei(Nrecv);
vecfreei(Nsend);
vecfreed(Time);
MPI_Finalize();
exit(0);
} /* end main */
void mpilu(int M, int N, int s, int t, int n, int *pi, double **a){
/* Compute LU decomposition of n by n matrix A with partial pivoting.
Processors are numbered in two-dimensional fashion.
Program text for P(s,t) = processor s+t*M,
with 0 <= s < M and 0 <= t < N.
A is distributed according to the M by N cyclic distribution.
*/
uk= vecallocd(nlc);
lk= vecallocd(nlr);
if (k%N==t){ /* k=kc*N+t */
/* Search for local absolute maximum in column k of A */
absmax= 0.0; imax= -1;
for (i=kr; i<nlr; i++){
if (fabs(a[i][kc])>absmax){
absmax= fabs(a[i][kc]);
imax= i;
}
}
MPI_Abort(MPI_COMM_WORLD,-6);
}
}
if (k%M==s){
/* Store new row k in uk */
for(j=kc1; j<nlc; j++)
uk[j-kc1]= a[kr][j];
}
MPI_Bcast(lk,nlr-kr1,MPI_DOUBLE,k%N,row_comm_s);
} /* end mpilu */
Here, the local processor writes from x into tmp, and then all processors write
back from tmp into x. It can happen that a particularly quick remote pro-
cessor already starts writing into the space of x while the local processor is
still reading from it. To prevent this, we can either use an extra temporary
array, or insert a global synchronization to make sure all local writes into tmp
have finished before MPI Alltoallv starts. We choose the latter option, and it
feels good. When in doubt, insert a barrier. The rule also says that synchron-
ization may occur. Thus a processor cannot send a value before the collective
communication, hoping that another processor receives it after the collective
communication. For correctness, we have to think barriers, even if they are
not there in the actual implementation.
#include "mpiedupack.h"
void mpiredistr(double *x, int n, int p, int s, int c0, int c1,
char rev, int *rho_p){
double *tmp;
int np, j0, j2, j, jglob, ratio, size, npackets, t, offset, r,
destproc, srcproc,
*Nsend, *Nrecv, *Offset_send, *Offset_recv;
np= n/p;
ratio= c1/c0;
272 PROGRAMMING IN BSP STYLE USING MPI
size= MAX(np/ratio,1);
npackets= np/size;
tmp= vecallocd(2*np);
Nsend= vecalloci(p);
Nrecv= vecalloci(p);
Offset_send= vecalloci(p);
Offset_recv= vecalloci(p);
offset= 0;
j0= s%c1; /* indices for after the redistribution */
j2= s/c1;
for(r=0; r<npackets; r++){
j= r*size;
jglob= j2*c1*np + j*c1 + j0;
srcproc= (jglob/(c0*np))*c0 + jglob%c0;
if (rev)
srcproc= rho_p[srcproc];
Nrecv[srcproc]= 2*size;
Offset_recv[srcproc]= offset;
offset += 2*size;
}
MPI_Alltoallv(tmp,Nsend,Offset_send,MPI_DOUBLE,
x, Nrecv,Offset_recv,MPI_DOUBLE,MPI_COMM_WORLD);
vecfreei(Offset_recv);
vecfreei(Offset_send);
vecfreei(Nrecv);
vecfreei(Nsend);
vecfreed(tmp);
} /* end mpiredistr */
MPI Put(src, src n, src type, pid, dst offset, dst n, dst type, dst win);
In BSPlib, data size and offsets are measured in bytes, whereas in MPI this is
in units of the basic data type, src type for the source array and dst type
for the destination array. In most cases these two types will be identical (e.g.
both could be MPI DOUBLE), and the source and destination sizes will thus be
equal. The destination memory area in the MPI-2 case is not simply given by
a pointer to memory space such as an array, but by a pointer to a window
object, which will be explained below.
The syntax of the unbuffered get primitives in BSPlib and MPI is
bsp hpget(pid, src, src offsetbytes, dst, nbytes);
MPI Get(dst, dst n, dst type, pid, src offset, src n, src type, src win);
274 PROGRAMMING IN BSP STYLE USING MPI
Note the different order of the arguments, but also the great similarity between
the puts and gets of BSPlib and those of MPI-2. In the fanout of mpimv, shown
below, one double is obtained by an MPI Get operation from the remote pro-
cessor srcprocv[j], at an offset of srcindv[j] doubles from the start of window
v win; the value is stored locally as vloc[j]. It is instructive to compare the
statement with the corresponding one in bspmv.
A window is a preregistered and distributed memory area, consisting of
local memory on every processor of a communicator. A window is created by
MPI Win create, which is the equivalent of BSPlib’s bsp push reg. We can
consider this as the registration of the memory needed before puts or gets can
be executed.
In the first call of MPI Win create in the function mpimv, a window of
size nv doubles is created and the size of a double is determined to be the
basic unit for expressing offsets of subsequent puts and gets into the window.
All processors of the communication world participate in creating the win-
dow. The MPI INFO NULL parameter always works, but can be replaced by
other parameters to give hints to the implementation for optimization. For
further details, see the MPI-2 standard. The syntax of the registration and
deregistration primitives in BSPlib and MPI is
bsp push reg(variable, nbytes);
MPI Win create(variable, nbytes, unit, info, comm, win);
bsp pop reg(variable);
MPI Win free(win);
Here, win is the window of type MPI Win corresponding to the array variable;
the integer unit is the unit for expressing offsets; and comm of type MPI Comm
is the communicator of the window.
A window can be used after a call to MPI Win fence, which can be thought
of as a synchronization of the processors that own the window. The first
parameter of MPI Win fence is again for transferring optimization hints, and
can best be set to zero at the early learning stage; this is guaranteed to work.
The communications initiated before a fence are guaranteed to have been
completed after the fence. Thus the fence acts as a synchronization at the end
of a superstep. A window is destroyed by a call to MPI Win free, which is the
equivalent of BSPlib’s bsp pop reg.
The MPI Put primitive is illustrated by the function mpimv init, which
is a straightforward translation of bspmv init. Four windows are created,
one for each array, for example, tmpprocv win representing the integer array
tmpprocv. (It would have been possible to use one window instead, by using
a four times larger array accessed with proper offsets, and thus saving some
fences at each superstep. This may be more efficient, but it is perhaps also a
bit clumsy and unnatural.)
CONVERTING BSPEDUPACK TO MPIEDUPACK 275
void mpimv(int p, int s, int n, int nz, int nrows, int ncols,
double *a, int *inc,
int *srcprocv, int *srcindv, int *destprocu, int *destindu,
int nv, int nu, double *v, double *u){
int i, j, *pinc;
double sum, *psum, *pa, *vloc, *pvloc, *pvloc_end;
pvloc += *pinc;
for(i=0; i<nrows; i++){
*psum= 0.0;
while (pvloc<pvloc_end){
*psum += (*pa) * (*pvloc);
pa++;
pinc++;
pvloc += *pinc;
}
MPI_Accumulate(psum,1,MPI_DOUBLE,destprocu[i],destindu[i],
1,MPI_DOUBLE,MPI_SUM,u_win);
pvloc -= ncols;
}
MPI_Win_fence(0, u_win);
MPI_Win_free(&u_win);
MPI_Win_free(&v_win);
vecfreed(vloc);
} /* end mpimv */
*/
vecfreei(tmpindu); vecfreei(tmpprocu);
vecfreei(tmpindv); vecfreei(tmpprocv);
} /* end mpimv_init */
Table C.1. Time Tp (n) (in ms) of parallel programs from BSPedupack
and MPIedupack on p processors of a Silicon Graphics Origin 3800
preventing a processor from sending data to itself. This yields large savings
for both versions and eliminates the parallel overhead for p = 1. To enable a
fair comparison, the buffered get operation in the fanout of the BSPlib ver-
sion has been replaced by an unbuffered get; the fanin by bulk synchronous
message passing remains buffered. The MPI version is completely unbuffered,
as it is based on the one-sided MPI-2 primitives, which may partly explain its
superior performance. The matrix–vector multiplication has not been optim-
ized to obtain optimistic g-values, in contrast to the LU decomposition and
the FFT. The test problem is too small to expect any speedup, as discussed
in Section 4.10. The results of both versions can be improved considerably by
further optimization.
Overall, the results show that the performance of BSPlib and MPI is
comparable, but with a clear advantage for MPI. This may be explained by the
fact that the MPI version used is a recent, vendor-supplied implementation,
which has clearly been optimized very well. On the other hand, the BSPlib
implementation (version 1.4, from 1998) is older and was actually optimized
for the SGI Origin 2000, a predecessor of the Origin 3800. No adjustment was
needed when installing the software, but no fine-tuning was done either.
Other experiments comparing BSPlib and MPI have been performed on
different machines. For instance, the BSPlib version of the LU decomposition
from ScaLAPACK by Horvitz and Bisseling [110] on the Cray T3E was found
280 PROGRAMMING IN BSP STYLE USING MPI
The strength of MPI is its wide availability and broad functionality. You
can do almost anything in MPI, except cooking dinner. The weakness of MPI is
its sheer size: the full standard [137,138] needs 550 pages, which is much more
than the 34 pages of the BSPlib standard [105]. This often leads developers of
system software to implementing only a subset of the MPI primitives, which
harms portability. It also forces users to learn only a subset of the primit-
ives, which makes it more difficult to read programs written by others, since
different programmers will most likely choose a different subset. Every imple-
mented MPI primitive is likely to be optimized independently, with a varying
rate of success. This makes it impossible to develop a uniform cost model that
realistically reflects the performance of every primitive. In contrast, the small
size of BSPlib and the underlying cost model provide a better focus to the
implementer and make theoretical cost analysis and cost predictions feasible.
A fundamental difference between MPI and BSPlib is that MPI provides
more opportunities for optimization by the user, by allowing many different
ways to tackle a given programming task, whereas BSPlib provides more
opportunities for optimization by the system. For an experienced user, MPI
may achieve better results than BSPlib, but for an inexperienced user this
may be the reverse.
We have seen that MPI software can be used for programming in BSP
style, even though it was not specifically designed for this purpose. Using
collective communication wherever possible leads to supersteps and global
synchronizations. Puts and gets are available in MPI-2 and can be used in the
same way as BSPlib high-performance puts and gets. Still, in using MPI one
would miss the imposed discipline provided by BSPlib. A small, paternalistic
library such as BSPlib steers programming efforts in the right direction, unlike
a large library such as MPI, which allows many different styles of programming
and is more tolerant of deviations from the right path.
In this appendix, we have viewed MPI from a BSP perspective, which may
be a fresh view for those readers who are already familiar with MPI. We can
consider the BSP model as the theoretical cost model behind the one-sided
communications of MPI-2. Even though the full MPI-2 standard is not yet
available on all parallel machines, its extensions are useful and suitable to the
BSP style, giving us another way of writing well-structured parallel programs.
REFERENCES
[1] Agarwal, R. C., Balle, S. M., Gustavson, F. G., Joshi, M., and Palkar, P.
(1995). A three-dimensional approach to parallel matrix multiplication.
IBM Journal of Research and Development, 39, 575–82.
[2] Agarwal, R. C. and Cooley, J. W. (1987). Vectorized mixed radix
discrete Fourier transform algorithms. Proceedings of the IEEE , 75,
1283–92.
[3] Aggarwal, A., Chandra, A. K., and Snir, M. (1990). Communication
complexity of PRAMs. Theoretical Computer Science, 71, 3–28.
[4] Alpatov, P., Baker, G., Edwards, C., Gunnels, J., Morrow, G.,
Overfelt, J., van de Geijn, R., and Wu, Y.-J. J. (1997). PLAPACK:
Parallel linear algebra package. In Proceedings Eighth SIAM Conference
on Parallel Processing for Scientific Computing. SIAM, Philadelphia.
[5] Alpert, R. D. and Philbin, J. F. (1997, February). cBSP: Zero-cost
synchronization in a modified BSP model. Technical Report 97-054,
NEC Research Institute, Princeton, NJ.
[6] Anderson, E., Bai, Z., Bischof, C., Blackford, L. S., Demmel, J.,
Dongarra, J., Du Croz, J, Greenbaum, A., Hammarling, S.,
McKenney, A., and Sorensen, D. (1999). LAPACK Users’ Guide (3rd
edn). SIAM, Philadelphia.
[7] Ashcraft, C. C. (1990, October). The distributed solution of linear
systems using the torus wrap data mapping. Technical Report ECA-
TR-147, Boeing Computer Services, Seattle, WA.
[8] Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., and van der Vorst, H. (ed.)
(2000). Templates for the Solution of Algebraic Eigenvalue Problems:
A Practical Guide. SIAM, Philadelphia.
[9] Barnett, M., Gupta, S., Payne, D. G., Shuler, L., van de Geijn, R., and
Watts, J. (1994). Building a high-performance collective communication
library. In Proceedings Supercomputing 1994, pp. 107–116. IEEE Press,
Los Alamitos, CA.
[10] Barnett, M., Payne, D. G., van de Geijn, R. A., and Watts, J. (1996).
Broadcasting on meshes with wormhole routing. Journal of Parallel
and Distributed Computing, 35, 111–22.
[11] Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J.,
Dongarra, J., Eijkhout, V., Pozo, R., Romine, C., and van der Vorst, H.
(1994). Templates for the Solution of Linear Systems: Building Blocks
for Iterative Methods. SIAM, Philadelphia.
[12] Barriuso, R. and Knies, A. (1994, May). SHMEM user’s guide
revision 2.0. Technical report, Cray Research Inc., Mendota
Heights, MN.
284 REFERENCES
[25] Blackford, L. S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J.,
Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A.,
Stanley, K., Walker, D., and Whaley, R. C. (1997b). ScaLAPACK
Users’ Guide. SIAM, Philadelphia.
[26] Boisvert, R. F., Pozo, R., Remington, K., Barrett, R. F., and
Dongarra, J. J. (1997). Matrix Market: A web resource for test matrix
collections. In The Quality of Numerical Software: Assessment and
Enhancement (ed. R. F. Boisvert), pp. 125–37. Chapman and Hall,
London.
[27] Bongiovanni, G., Corsini, P., and Frosini, G. (1976). One-dimensional
and two-dimensional generalized discrete Fourier transforms. IEEE
Transactions on Acoustics, Speech, and Signal Processing, ASSP-24,
97–9.
[28] Bonorden, O., Dynia, M., Gehweiler, J., and Wanka, R. (2003,
July). PUB-library, release 8.1-pre, user guide and function reference.
Technical report, Heinz Nixdorf Institute, Department of Computer
Science, Paderborn University, Paderborn, Germany.
[29] Bonorden, O., Hüppelshäuser, N., Juurlink, B., and Rieping, I. (2000,
June). The Paderborn University BSP (PUB) library on the Cray
T3E. Project report, Heinz Nixdorf Institute, Department of Computer
Science, Paderborn University, Paderborn, Germany.
[30] Bonorden, O., Juurlink, B., von Otte, I., and Rieping, I. (2003). The
Paderborn University BSP (PUB) library. Parallel Computing, 29,
187–207.
[31] Bracewell, R. N. (1999). The Fourier Transform and its Applications
(3rd edn). McGraw-Hill Series in Electrical Engineering. McGraw-Hill,
New York.
[32] Brent, R. P. (1975). Multiple-precision zero-finding methods and the
complexity of elementary function evaluation. In Analytic Compu-
tational Complexity (ed. J. F. Traub), pp. 151–76. Academic Press,
New York.
[33] Briggs, W. L. and Henson, V. E. (1995). The DFT: An Owner’s
Manual for the Discrete Fourier Transform. SIAM, Philadelphia.
[34] Bui, T. N. and Jones, C. (1993). A heuristic for reducing fill-in in sparse
matrix factorization. In Proceedings Sixth SIAM Conference on Parallel
Processing for Scientific Computing, pp. 445–52. SIAM, Philadelphia.
[35] Caldwell, A. E., Kahng, A. B., and Markov, I. L. (2000). Improved
algorithms for hypergraph bipartitioning. In Proceedings Asia and
South Pacific Design Automation Conference, pp. 661–6. ACM Press,
New York.
[36] Çatalyürek, Ü. V. and Aykanat, C. (1996). Decomposing irregularly
sparse matrices for parallel matrix–vector multiplication. In Proceed-
ings Third International Workshop on Solving Irregularly Structured
Problems in Parallel (Irregular 1996) (ed. A. Ferreira, J. Rolim,
Y. Saad, and T. Yang), Volume 1117 of Lecture Notes in Computer
Science, pp. 75–86. Springer, Berlin.
286 REFERENCES
[49] Culler, D. E., Karp, R. M., Patterson, D., Sahay, A., Santos, E. E.,
Schauser, K. E., Subramonian, R., and von Eicken, T. (1996).
LogP: A practical model of parallel computation. Communications of
the ACM , 39(11), 78–85.
[50] da Cunha, R. D. and Hopkins, T. (1995). The Parallel Iterative
Methods (PIM) package for the solution of systems of linear equations
on parallel computers. Applied Numerical Mathematics, 19, 33–50.
[51] Danielson, G. C. and Lanczos, C. (1942). Some improvements in
practical Fourier analysis and their application to X-ray scattering from
liquids. Journal of the Franklin Institute, 233, 365–80, 435–52.
[52] Daubechies, I. (1988). Orthonormal bases of compactly supported
wavelets. Communications on Pure and Applied Mathematics, 41,
909–96.
[53] Davis, T. A. (1994–2003). University of Florida sparse matrix collection.
Online collection, http://www.cise.ufl.edu/research/sparse/
matrices, Department of Computer and Information Science and
Engineering, University of Florida, Gainesville, FL.
[54] de la Torre, P. and Kruskal, C. P. (1992). Towards a single model of
efficient computation in real machines. Future Generation Computer
Systems, 8, 395–408.
[55] de la Torre, P. and Kruskal, C. P. (1996). Submachine locality in the
bulk synchronous setting. In Euro-Par’96 Parallel Processing. Vol. 2
(ed. L. Bougé, P. Fraigniaud, A. Mignotte, and Y. Robert), Volume 1124
of Lecture Notes in Computer Science, pp. 352–8. Springer, Berlin.
[56] Dijkstra, E. W. (1968). Go to statement considered harmful.
Communications of the ACM , 11, 147–8.
[57] Donaldson, S. R., Hill, J. M. D., and Skillicorn, D. B. (1999). Pre-
dictable communication on unpredictable networks: implementing BSP
over TCP/IP and UDP/IP. Concurrency: Practice and Experience, 11,
687–700.
[58] Dongarra, J. J. (2003, April). Performance of various computers
using standard linear equations software. Technical Report CS-89-85,
Computer Science Department, University of Tennessee, Knoxville,
TN. Continuously being updated at
http://www.netlib.org/benchmark/performance.ps.
[59] Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. (1990).
A set of level 3 Basic Linear Algebra Subprograms. ACM Transactions
on Mathematical Software, 16, 1–17.
[60] Dongarra, J. J., Du Croz, J., Hammarling, S., and Hanson, R. J. (1988).
An extended set of FORTRAN Basic Linear Algebra Subprograms.
ACM Transactions on Mathematical Software, 14, 1–17.
[61] Dongarra, J. J., Duff, I. S., Sorensen, D. C., and van der Vorst, H. A.
(1998). Numerical Linear Algebra for High-Performance Computers.
Software, Environments, Tools. SIAM, Philadelphia.
288 REFERENCES
[62] Dubey, A., Zubair, M., and Grosch, C. E. (1994). A general purpose
subroutine for fast Fourier transform on a distributed memory parallel
machine. Parallel Computing, 20, 1697–1710.
[63] Duff, I. S., Erisman, A. M., and Reid, J. K. (1986). Direct Methods
for Sparse Matrices. Monographs on Numerical Analysis. Oxford
University Press, Oxford.
[64] Duff, I. S., Grimes, R. G., and Lewis, J. G. (1989). Sparse matrix test
problems. ACM Transactions on Mathematical Software, 15, 1–14.
[65] Duff, I. S., Grimes, R. G., and Lewis, J. G. (1997, September).
The Rutherford–Boeing sparse matrix collection. Technical Report
TR/PA/97/36, CERFACS, Toulouse, France.
[66] Duff, I. S., Heroux, M. A., and Pozo, R. (2002). An overview of
the Sparse Basic Linear Algebra Subprograms: the new standard
from the BLAS technical forum. ACM Transactions on Mathematical
Software, 28, 239–67.
[67] Duff, I. S. and van der Vorst, H. A. (1999). Developments and trends
in the parallel solution of linear systems. Parallel Computing, 25,
1931–70.
[68] Edelman, A., McCorquodale, P., and Toledo, S. (1999). The future
fast Fourier transform. SIAM Journal on Scientific Computing, 20,
1094–1114.
[69] Fiduccia, C. M. and Mattheyses, R. M. (1982). A linear-time heuristic
for improving network partitions. In Proceedings of the 19th IEEE
Design Automation Conference, pp. 175–81. IEEE Press, Los Alamitos,
CA.
[70] Foster, I. T. and Worley, P. H. (1997). Parallel algorithms for the spec-
tral transform method. SIAM Journal on Scientific Computing, 18,
806–37.
[71] Fox, G. C., Johnson, M. A., Lyzenga, G. A., Otto, S. W., Salmon, J. K.,
and Walker, D. W. (1988). Solving Problems on Concurrent Processors:
Vol. 1, General Techniques and Regular Problems. Prentice-Hall,
Englewood Cliffs, NJ.
[72] Fraser, D. (1976). Array permutation by index-digit permutation.
Journal of the ACM , 23, 298–308.
[73] Frigo, M. and Johnson, S. G. (1998). FFTW: An adaptive software
architecture for the FFT. In Proceedings IEEE International Confer-
ence on Acoustics, Speech, and Signal Processing, Vol. 3, pp. 1381–4.
IEEE Press, Los Alamitos, CA.
[74] Gauss, C. F. (1866). Theoria interpolationis methodo nova tractata.
In Carl Friedrich Gauss Werke, Vol. 3, pp. 265–327. Königlichen
Gesellschaft der Wissenschaften, Göttingen, Germany.
[75] Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Mancheck, R., and
Sunderam, V. (1994). PVM: Parallel Virtual Machine. A Users’
Guide and Tutorial for Networked Parallel Computing. Scientific and
Engineering Computation Series. MIT Press, Cambridge, MA.
REFERENCES 289
[76] Geist, G. A., Kohl, J. A., and Papadopoulos, P. M. (1996). PVM and
MPI: A comparison of features. Calculateurs Parallèles, 8(2), 137–50.
[77] Geist, G. A. and Romine, C. H. (1988). LU factorization algorithms
on distributed-memory multiprocessor architectures. SIAM Journal on
Scientific and Statistical Computing, 9, 639–49.
[78] Gerbessiotis, A. V. and Valiant, L. G. (1994). Direct bulk-synchronous
parallel algorithms. Journal of Parallel and Distributed Computing, 22,
251–67.
[79] Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations (3rd
edn). Johns Hopkins Studies in the Mathematical Sciences. Johns
Hopkins University Press, Baltimore, MD.
[80] Goudreau, M. W., Lang, K., Rao, S. B., Suel, T., and Tsantilas, T.
(1999). Portable and efficient parallel computing using the BSP model.
IEEE Transactions on Computers, 48, 670–89.
[81] Goudreau, M. W., Lang, K., Rao, S. B., and Tsantilas, T. (1995, June).
The Green BSP library. Technical Report CS-TR-95-11, Department
of Computer Science, University of Central Florida, Orlando, FL.
[82] Grama, A., Gupta, A., Karypis, G., and Kumar, V. (2003). Introduc-
tion to Parallel Computing (2nd edn). Addison-Wesley, Harlow, UK.
[83] Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B.,
Saphir, W., and Snir, M. (1998). MPI: The Complete Reference.
Vol. 2, The MPI Extensions. Scientific and Engineering Computation
Series. MIT Press, Cambridge, MA.
[84] Gropp, W., Lusk, E., and Skjellum, A. (1999a). Using MPI: Portable
Parallel Programming with the Message-Passing Interface (2nd edn).
MIT Press, Cambridge, MA.
[85] Gropp, W., Lusk, E., and Thakur, R. (1999b). Using MPI-2:
Advanced Features of the Message-Passing Interface. MIT Press,
Cambridge, MA.
[86] Gropp, W., Lusk, E., Doss, N., and Skjellum, A. (1996). A high-
performance, portable implementation of the MPI message passing
interface standard. Parallel Computing, 22, 789–828.
[87] Gupta, A. and Kumar, V. (1993). The scalability of FFT on parallel
computers. IEEE Transactions on Parallel and Distributed Systems, 4,
922–32.
[88] Gupta, S. K. S., Huang, C.-H., Sadayappan, P., and Johnson, R. W.
(1994). Implementing fast Fourier transforms on distributed-memory
multiprocessors using data redistributions. Parallel Processing Letters,
4, 477–88.
[89] Gustavson, F. G. (1972). Some basic techniques for solving sparse sys-
tems of linear equations. In Sparse Matrices and Their Applications
(ed. D. J. Rose and R. A. Willoughby), pp. 41–52. Plenum Press,
New York.
290 REFERENCES
[192] Zitney, S. E., Mallya, J., Davis, T. A., and Stadtherr, M. A. (1994).
Multifrontal techniques for chemical process simulation on supercom-
puters. In Proceedings Fifth International Symposium on Process
Systems Engineering, Kyongju, Korea (ed. E. S. Yoon), pp. 25–30.
Korean Institute of Chemical Engineers, Seoul, Korea.
[193] Ziv, J. and Lempel, A. (1977). A universal algorithm for sequential
data compression. IEEE Transactions on Information Theory, IT-23,
337–43.
[194] Zlatev, Z. (1991). Computational Methods for General Sparse Matrices,
Volume 65 of Mathematics and Its Applications. Kluwer, Dordrecht,
the Netherlands.
[195] Zoldi, S., Ruban, V., Zenchuk, A., and Burtsev, S. (1999, January/
February). Parallel implementation of the split-step Fourier method
for solving nonlinear Schrödinger systems. SIAM News, 32(1), 8–9.
INDEX
accidental zero, 168–169 cost, 5–8, 44, 70, 72, 120, 148, 179,
accumulate operation, 275 181, 189, 217, 219, 233, 235,
all-to-all, 43, 126, 261, 270 238, 242, 247
argmax, 55, 56, 60, 69, 201, 266 decomposable, 39, 265
arithmetic mean, 162 model, vii–ix, xi, xiii, 2–9, 39–40, 79,
ASCII, 46 81, 82, 84
parameters, 1, 6, 8, 24, 27, 79, 96,
138, 232
back substitution see triangular system variants of, 8, 39
bandwidth of matrix, 86, 243 BSP Worldwide, viii, xvi
barrier see synchronization bsp abort, 20, 255, 258
benchmarking, ix, 6, 24 bsp begin, 14, 255
program, 1, 27–32, 43, 44, 79, 243, bsp broadcast, 72
261–265 bsp end, 15, 255
results of, 31–38, 82, 85, 138, 232 bsp get, 20, 223, 226, 255
Beowulf, 32, 43, 231–236 bsp get tag, 225, 250, 255
Bernoulli trial, 205 bsp hpget, 99, 255, 273
Bi-CGSTAB, 236 bsp hpmove, 250, 255
bipartitioning, 192, 196, 197 bsp hpput, 99, 255, 273, 275
of hypergraph, 195, 197, 242 bsp init, 15, 255, 258
bit operation, 128 bsp move, 225, 243, 250, 255, 275
bit reversal, 110–111, 116–117, 127–128, bsp nprocs, 15, 255
149, 150 bsp pid, 16, 255
bit-reversal matrix, 110 bsp pop reg, 19, 255, 274
BLAS, 34, 98 bsp push reg, 19, 126, 255, 274
sparse, 237 bsp put, 18, 99, 223, 224, 226, 243, 255,
block distribution see distribution, block 271
block size, algorithmic, 98 bsp qsize, 226, 255
blocking bsp send, 223–226, 243, 255, 275
of algorithm, 86–87 bsp set tagsize, 226, 255
of distribution, 86 bsp sync, 16, 79, 255, 259
body-centred cubic (BCC) lattice, 219 bsp time, 16, 79, 255
border correction, 213, 216 BSPedupack, xi, 92, 223, 251–253, 256,
branch-and-bound method, 247 258, 261, 265, 278, 280, 281
broadcast, 11, 57, 65, 66, 72–73, 259, BSPlib, viii, ix, xi–xiii, xvi, 1, 2, 13–14,
265–267 137, 163, 231, 241, 254, 256, 258,
one-phase, 66, 79, 80, 89, 99 259, 261, 262, 265, 266, 270,
tree-based, 88 273–275, 278–282
two-phase, 66–68, 79, 80, 83, 87–89, for Windows, xiv
99, 185 implementation of, xi, 42
BSMP, 254, 255 primitives, xi, 14–20, 223–226, 259,
BSP 273–274
algorithm, 3–4, 148, 149 programs, xi
computer, ix, 1, 3, 6, 243 compiling of, 20, 33
300 INDEX
regular algorithm, 128, 129 superstep, vii, xiv, 3–4, 16, 261
reproducibility, 151 communication, 3
residual, 236 computation, 3
root of tree, 105, 246 program, 16, 265
roundoff error, 121 superstep piggybacking, 98, 266
RSA, 161 surface-to-volume ratio, 214, 219
Rutherford–Boeing collection, 171, 203, switch, 32
237 synchronization, 4
bulk, 4
global, vii, 271
SAXPY see DAXPY pairwise, 4
scalability see speedup subset, 9, 40, 42
scalable memory use, 121, 126, 222, 224 zero-cost, 42
ScaLAPACK, 86–87, 256, 279
scattered square decomposition see
distribution, square cyclic tag, 224–226, 243, 250
Schrödinger equation template, 236
nonlinear, 152 tensor product, 107
time-dependent, 151 torus, 33
time-independent, 152 torus-wrap mapping see distribution,
SGEMM see DGEMM square cyclic
SGI Origin 2000, xiii, 37–38, 137 total exchange, 261
SGI Origin 3800, 136–144, 278–280 transpose algorithm, 149, 150
shared memory, 37, 38, 137 transputer, xiv, 238
shuffle, perfect, 147 trapezoidal rule, 101, 153
sieve of Eratosthenes, 48, 49 tree, 105–106, 246
simulated annealing, 247–249 triangular system, 51, 55, 95–96
sine transform, 100 triple scheme, 170
fast, 146 truncated octahedron, 219, 220, 245
singular value decomposition, 236 twiddle matrix, 125, 126, 146
six-step framework, 150, 151
slowdown, 233 UFFT, 112, 124, 127
smooth function, 101, 153 UGFFT, 124
sorting, 222 uncoarsening, 195, 197
sparse matrix algorithm, 167, 169 unpacking, 128–130, 249
sparsity pattern, 163, 164, 169, 177, list, 249
181, 186, 195, 203, 204, 206–208,
210, 244
vector addition, sparse, 167–168
symmetric, 165
vector allocation, 251
spectral transform, 151
vertex, 88, 192, 194–197, 202–203, 241
speedup, 139–141, 144, 151, 152, 233,
adjacent, 197
278, 279
degree, 202–203
superlinear, 140
video compression, 158
SPMD, 11, 14–15, 17, 41, 127, 254, 255,
virtual processor, 155
258
volume see communication volume, 65
stack, 19
Voronoi cell, 217, 219
startup cost, 26, 41, 242, 257
stencil, five-point see Laplacian operator
Strassen matrix–matrix multiplication, wall-clock time, 16
90–92 wavefront, 96
stride, 11, 53, 72, 150 wavelet, 100, 158–159
strip, 213–214 Daubechies, 158
subtractive cancellation, 93 weight, 113, 120–127
supercomputer, viii, xiii, 33, 44, 235 window, 273–274