0% found this document useful (0 votes)
510 views324 pages

MATHEMATICS Parallel Scientific Computation

Uploaded by

Reza Wardhana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
510 views324 pages

MATHEMATICS Parallel Scientific Computation

Uploaded by

Reza Wardhana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 324

Parallel Scientific

Computation
A structured approach using BSP and MPI

ROB H. BISSELING
Utrecht University

1
PARALLEL SCIENTIFIC COMPUTATION
3
Great Clarendon Street, Oxford OX2 6DP
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide in
Oxford New York
Auckland Bangkok Buenos Aires Cape Town Chennai
Dar es Salaam Delhi Hong Kong Istanbul
Karachi Kolkata Kuala Lumpur Madrid Melbourne Mexico City Mumbai
Nairobi São Paulo Shanghai Taipei Tokyo Toronto
Oxford is a registered trade mark of Oxford University Press
in the UK and in certain other countries
Published in the United States
by Oxford University Press Inc., New York
c Oxford University Press 2004

The moral rights of the author have been asserted
Database right Oxford University Press (maker)
First published 2004
All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
without the prior permission in writing of Oxford University Press,
or as expressly permitted by law, or under terms agreed with the appropriate
reprographics rights organization. Enquiries concerning reproduction
outside the scope of the above should be sent to the Rights Department,
Oxford University Press, at the address above
You must not circulate this book in any other binding or cover
and you must impose this same condition on any acquirer
A catalogue record for this title is available from the British Library
Library of Congress Cataloging in Publication Data
(Data available)
ISBN 0 19 852939 2
10 9 8 7 6 5 4 3 2 1
Typeset by Newgen Imaging Systems (P) Ltd., Chennai, India
Printed in Great Britain
on acid-free paper by
Biddles Ltd, www.biddles.co.uk
Plate 1. Sparse matrix prime60 distributed over four processors of a parallel
computer. Cf. Chapter 4.
PREFACE

Why this book on parallel scientific computation? In the past two decades,
several shelves full of books have been written on many aspects of parallel
computation, including scientific computation aspects. My intention to add
another book asks for a motivation. To say it in a few words, the time is
ripe now. The pioneering decade of parallel computation, from 1985 to 1995,
is well behind us. In 1985, one could buy the first parallel computer from
a commercial vendor. If you were one of the few early users, you probably
found yourself working excessively hard to make the computer perform well
on your application; most likely, your work was tedious and the results were
frustrating. If you endured all this, survived, and are still interested in parallel
computation, you deserve strong sympathy and admiration!
Fortunately, the situation has changed. Today, one can theoretically
develop a parallel algorithm, analyse its performance on various architectures,
implement the algorithm, test the resulting program on a PC or cluster of PCs,
and then run the same program with predictable efficiency on a massively
parallel computer. In most cases, the parallel program is only slightly more
complicated than the corresponding sequential program and the human time
needed to develop the parallel program is not much more than the time needed
for its sequential counterpart.
This change has been brought about by improvements in parallel hardware
and software, together with major advances in the theory of parallel pro-
gramming. An important theoretical development has been the advent of the
Bulk Synchronous Parallel (BSP) programming model proposed by Valiant in
1989 [177,178], which provides a useful and elegant theoretical framework for
bridging the gap between parallel hardware and software. For this reason, I
have adopted the BSP model as the target model for my parallel algorithms.
In my experience, the simplicity of the BSP model makes it suitable for teach-
ing parallel algorithms: the model itself is easy to explain and it provides a
powerful tool for expressing and analysing algorithms.
An important goal in designing parallel algorithms is to obtain a good
algorithmic structure. One way of achieving this is by designing an algorithm
as a sequence of large steps, called supersteps in BSP language, each con-
taining many basic computation or communication operations and a global
synchronization at the end, where all processors wait for each other to finish
their work before they proceed to the next superstep. Within a superstep, the
work is done in parallel, but the global structure of the algorithm is sequen-
tial. This simple structure has proven its worth in practice in many parallel
applications, within the BSP world and beyond.
viii PREFACE

Recently, efficient implementations of the BSP model have become


available. The first implementation that could be run on many differ-
ent parallel computers, the Oxford BSP library, was developed by Miller
and Reed [140,141] in 1993. The BSP Worldwide organization was foun-
ded in 1995, see http://www.bsp-worldwide.org, to promote collaboration
between developers and users of BSP. One of the goals of BSP Worldwide is
to provide a standard library for BSP programming. After extensive discus-
sions in the BSP community, a standard called BSPlib [105] was proposed
in May 1997 and an implementation by Hill and coworkers [103], the Oxford
BSP toolset, was made available in the public domain, see http://www.bsp-
worldwide.org/implmnts/oxtool. The programs in the main text of this
book make use of the primitives of BSPlib.
Another communication library, which can be used for writing parallel
programs and which has had a major impact on the field of parallel computing
is the Message-Passing Interface (MPI), formulated in 1994 as MPI-1, and
enhanced in 1997 by the MPI-2 extensions. The MPI standard has promoted
the portability of parallel programs and one might say that its advent has
effectively ended the era of architecture-dependent communication interfaces.
The programs in the main text of this book have also been converted to MPI
and the result is presented in Appendix C.
I wrote this book for students and researchers who are interested in sci-
entific computation. The book has a dual purpose: first, it is a textbook for a
course on parallel scientific computation at the final year undergraduate/first
year graduate level. The material of the book is suitable for a one-semester
course at a mathematics or computer science department. I tested all material
in class at Utrecht University during the period 1993–2002, in an introduct-
ory course given every year on parallel scientific computation, called ‘Parallel
Algorithms for Supercomputers 1’, see the course page http://www.math.uu.
nl/people/bisseling/pas1.html and an advanced course given biannually,
see pas2.html. These courses are taken by students from mathematics, phys-
ics, and computer science. Second, the book is a source of example parallel
algorithms and programs for computational physicists and chemists, and for
other computational scientists who are eager to get a quick start in paral-
lel computing and want to learn a structured approach to writing parallel
programs. Prerequisites are knowledge about linear algebra and sequential
programming. The program texts assume basic knowledge of the programming
language ANSI C.
The scope of this book is the area of numerical scientific computation; the
scope also includes combinatorial scientific computation needed for numerical
computations, such as in the area of sparse matrices, but it excludes symbolic
scientific computation. The book treats the area of numerical scientific compu-
tation by presenting a detailed study of several important numerical problems.
Through these problems, techniques are taught for designing and implement-
ing efficient, well-structured parallel algorithms. I selected these particular
PREFACE ix

problems because they are important for applications and because they give
rise to a variety of important parallelization techniques. This book treats
well-known subjects such as dense LU decomposition, fast Fourier transform
(FFT), and sparse matrix–vector multiplication. One can view these sub-
jects as belonging to the area of numerical linear algebra, but they are also
fundamental to many applications in scientific computation in general. This
choice of problems may not be highly original, but I made an honest attempt
to approach these problems in an original manner and to present efficient
parallel algorithms and programs for their solution in a clear, concise, and
structured way.
Since this book should serve as a textbook, it covers a limited but care-
fully chosen amount of material; I did not strive for completeness in covering
the area of numerical scientific computation. A vast amount of sequen-
tial algorithms can be found in Matrix Computations by Golub and Van
Loan [79] and Numerical Recipes in C: The Art of Scientific Computing by
Press, Teukolsky, Vetterling, and Flannery [157]. In my courses on parallel
algorithms, I have the habit of assigning sequential algorithms from these
books to my students and asking them to develop parallel versions. Often,
the students go out and perform an excellent job. Some of these assign-
ments became exercises in the present book. Many exercises have the form
of programming projects, which are suitable for use in an accompanying
computer-laboratory class. I have graded the exercises according to diffi-
culty/amount of work involved, marking an exercise by an asterisk if it requires
more work and by two asterisks if it requires a lot of work, meaning that it
would be suitable as a final assignment. Inevitably, such a grading is subject-
ive, but it may be helpful for a teacher in assigning problems to students.
The main text of the book treats a few central topics from parallel scientific
computation in depth; the exercises are meant to give the book breadth.
The structure of the book is as follows. Chapter 1 introduces the BSP
model and BSPlib, and as an example it presents a simple complete parallel
program. This two-page program alone already teaches half the primitives of
BSPlib. (The other half is taught by the program of Chapter 4.) The first
chapter is a concise and self-contained tutorial, which tells you how to get
started with writing BSP programs, and how to benchmark your computer as
a BSP computer. Chapters 2–4 present parallel algorithms for problems with
increasing irregularity. Chapter 2 on dense LU decomposition presents a reg-
ular computation with communication patterns that are common in matrix
computations. Chapter 3 on the FFT also treats a regular computation but
one with a more complex flow of data. The execution time requirements of the
LU decomposition and FFT algorithms can be analysed exactly and the per-
formance of an implementation can be predicted quite accurately. Chapter 4
presents the multiplication of a sparse matrix and a dense vector. The com-
putation involves only those matrix elements that are nonzero, so that in
general it is irregular. The communication involves the components of dense
x PREFACE

input and output vectors. Although these vectors can be stored in a regular
data structure, the communication pattern becomes irregular because efficient
communication must exploit the sparsity. The order in which the chapters can
be read is: 1, 2, then either 3 or 4, depending on your taste. Chapter 3 has
the brains, Chapter 4 has the looks, and after you have finished both you
know what I mean. Appendix C presents MPI programs in the order of the
corresponding BSPlib programs, so that it can be read in parallel with the
main text; it can also be read afterwards. I recommend reading the appendix,
even if you do not intend to program in MPI, because it illustrates the vari-
ous possible choices that can be made for implementing communications and
because it makes the differences and similarities between BSPlib and MPI
clear.
Each chapter contains: an abstract; a brief discussion of a sequential
algorithm, included to make the material self-contained; the design and
analysis of a parallel algorithm; an annotated program text; illustrative
experimental results of an implementation on a particular parallel computer;
bibliographic notes, giving historical background and pointers for further
reading; theoretical and practical exercises.
My approach in presenting algorithms and program texts has been to give
priority to clarity, simplicity, and brevity, even if this comes at the expense of
a slight decrease in efficiency. In this book, algorithms and programs are only
optimized if this teaches an important technique, or improves efficiency by
an order of magnitude, or if this can be done without much harm to clarity.
Hints for further optimization are given in exercises. The reader should view
the programs as a starting point for achieving fast implementations.
One goal of this book is to ease the transition from theory to practice. For
this purpose, each chapter includes an example program, which presents a
possible implementation of the central algorithm in that chapter. The program
texts form a small but integral part of this book. They are meant to be read
by humans, besides being compiled and executed by computers. Studying the
program texts is the best way of understanding what parallel programming is
really about. Using and modifying the programs gives you valuable hands-on
experience.
The aim of the section on experimental results is to illustrate the the-
oretical analysis. Often, one aspect is highlighted; I made no attempt to
perform an exhaustive set of experiments. A real danger in trying to explain
experimental results for an algorithm is that a full explanation may lead to
a discussion of nitty-gritty implementation details or hardware quirks. This
is hardly illuminating for the algorithm, and therefore I have chosen to keep
such explanations to a minimum. For my experiments, I have used six dif-
ferent parallel machines, older ones as well as newer ones: parallel computers
come and go quickly.
The bibliographic notes of this book are lengthier than usual, since I have
tried to summarize the contents of the cited work and relate them to the topic
PREFACE xi

discussed in the current chapter. Often, I could not resist the temptation to
write a few sentences about a subject not fully discussed in the main text,
but still worth mentioning.
The source files of the printed program texts, together with a set
of test programs that demonstrate their use, form a package called
BSPedupack, which is available at http://www.math.uu.nl/people/
bisseling/software.html. The MPI version, called MPIedupack, is also
available from that site. The packages are copyrighted, but freely available
under the GNU General Public License, meaning that their programs can be
used and modified freely, provided the source and all modifications are men-
tioned, and every modification is again made freely available under the same
license. As the name says, the programs in BSPedupack and MPIedupack
are primarily intended for teaching. They are definitely not meant to be used
as a black box. If your program, or worse, your airplane crashes because
BSP/MPIedupack is not sufficiently robust, it is your own responsibility. Only
rudimentary error handling has been built into the programs. Other software
available from my software site is the Mondriaan package [188], which is used
extensively in Chapter 4. This is actual production software, also available
under the GNU General Public License.
To use BSPedupack, a BSPlib implementation such as the Oxford BSP
toolset [103] must have been installed on your computer. As an alternative,
you can use BSPedupack on top of the Paderborn University BSP (PUB)
library [28,30], which contains BSPlib as a subset. If you have a cluster of PCs,
connected by a Myrinet network, you may want to use the Panda BSP library,
a BSPlib implementation by Takken [173], which is soon to be released. The
programs of this book have been tested extensively for the Oxford BSP toolset.
To use the first four programs of MPIedupack, you need an implementation
of MPI-1. This often comes packaged with the parallel computer. The fifth
program needs MPI-2. Sometimes, part of the MPI-2 extensions have been
supplied by the computer vendor. A full public-domain implementation of
MPI-2 is expected to become available in the near future for many different
architectures as a product of the MPICH-2 project.
If you prefer to use a different communication library than BSPlib, you
can port BSPlib programs to other systems such as MPI, as demonstrated by
Appendix C. Porting out of BSPlib is easy, because of the limited number of
BSPlib primitives and because of the well-structured programs that are the
result of following the BSP model. It is my firm belief that if you use MPI or
another communication library, for historical or other reasons, you can benefit
tremendously from the structured approach to parallel programming taught in
this book. If you use MPI-2, and in particular its one-sided communications,
you are already close to the bulk synchronous parallel world, and this book
may provide you with a theoretical framework.
xii PREFACE

The programming language used in this book is ANSI C [121]. The reason
for this is that many students learn C as their first or second programming lan-
guage and that efficient C compilers are available for many different sequential
and parallel computers. Portability is the name of the game for BSP software.
The choice of using C together with BSPlib will make your software run on
almost every computer. Since C is a subset of C++, you can also use C++
together with BSPlib. If you prefer to use another programming language,
BSPlib is also available in Fortran 90.
Finally, let me express my hope and expectation that this book will trans-
form your barriers: your own activation barrier for parallel programming will
disappear; instead, synchronization barriers will appear in your parallel pro-
grams and you will know how to use them as an effective way of designing
well-structured parallel programs.

R.H. Bisseling
Utrecht University
July 2003
ACKNOWLEDGEMENTS

First of all, I would like to thank Bill McColl of Oxford University for
introducing the BSP model to me in 1992 and convincing me to abandon
my previous habit of developing special-purpose algorithms for mesh-based
parallel computers. Thanks to him, I turned to designing general-purpose
algorithms that can run on every parallel computer. Without Bill’s encour-
agement, I would not have written this book.
Special mention should be made of Jon Hill of Oxford University, now at
Sychron Ltd, who is co-designer and main implementor of BSPlib. The BSPlib
standard gives the programs in the main text of this book a solid foundation.
Many discussions with Jon, in particular during the course on BSP we gave
together in Jerusalem in 1997, were extremely helpful in shaping this book.
Several visits abroad gave me feedback and exposure to constructive
criticism. These visits also provided me the opportunity to test the par-
allel programs of this book on a variety of parallel architectures. I would
like to thank my hosts Michael Berman of BioMediCom Ltd in Jerusalem,
Richard Brent and Bill McColl of Oxford University, Iain Duff of CERFACS
in Toulouse, Jacko Koster and Fredrik Manne of the University of Bergen,
Satish Rao of NEC Research at Princeton, Pilar de la Torre of the Univer-
sity of New Hampshire, and Leslie Valiant of Harvard University, inventor of
the BSP model. I appreciate their hospitality. I also thank the Engineering
and Physical Sciences Research Council in the United Kingdom for funding
my stay in Oxford in 2000, which enabled me to make much progress with
this book.
I would like to thank the Oxford Supercomputer Centre for granting access
to their SGI Origin 2000 supercomputer and I am grateful to Jeremy Martin
and Bob McLatchie for their help in using this machine. I thank Sychron
Ltd in Oxford for giving me access to their PC cluster. In the Netherlands,
I would like to acknowledge grants of computer time and funding of two
postdocs by the National Computer Facilities foundation (NCF). Patrick
Aerts, director of the NCF, has tremendously stimulated the development
of the high-performance computing infrastructure in the Netherlands, and it
is thanks to him that I could use so many hours of computing time on so
many different parallel computers. I thank the supercomputer centres HPαC
in Delft and SARA in Amsterdam for access to their computers, with per-
sonal thanks to Jana Vasiljev and Willem Vermin for supporting BSPlib at
these centres. I also thank Aad van der Steen for help in accessing DAS-2, the
Dutch distributed supercomputer.
xiv ACKNOWLEDGEMENTS

The Mathematics Department of Utrecht University provided me with a


stimulating environment for writing this book. Henk van der Vorst started
the Computational Science education programme in Utrecht in 1993, which
gave me the opportunity to develop and test the material of the book. Over
the years, more than a hundred students have graduated from my parallel
scientific computation courses. Many have contributed, occasionally by being
guinea pigs (my apologies!), but more often as partners in a genuine dialogue.
These students have helped me to improve the exposition of the material and
they have forced me to be as clear and brief as I can. I thank all of them,
with special mention of Tammo-Jan Dijkema and Maarten Löffler who were
champion proofreaders and Dik Takken who implemented a version of BSPlib
for Windows just to speed up his homework assignment and later implemented
a version for a Myrinet-based PC cluster, which is used in Chapter 4.
During the period of writing this book, I was joined in my parallel com-
puting research by MSc students, PhD students, and postdocs. Discussions
with them often yielded new insights and I enjoyed many working and off-
working hours spent together. In particular, I would like to mention here:
Márcia Alves de Inda, Ildikó Flesch, Jeroen van Grondelle, Alexander van
Heukelum, Neal Hegeman, Guy Horvitz, Frank van Lingen, Jacko Koster,
Joris Koster, Wouter Meesen, Bruce Stephens, Frank van der Stappen, Mark
Stijnman, Dik Takken, Patrick Timmers, and Brendan Vastenhouw.
Much of my pre-BSP work has contributed to this book as well. In particu-
lar, research I carried out from 1987–93 at the Koninklijke/Shell-laboratory in
Amsterdam has taught me much about parallel computing and sparse matrix
computations. I became an expert in the programming language occam, which
unfortunately could only be used on transputer-based parallel computers,
thereby representing perhaps the antithesis of portability. In the pioneering
years, however, occam was a great educational tool. Ideas from the prototype
library PARPACK, developed in those years at Shell, profoundly influenced
the present work. The importance of a structured approach was already appar-
ent then; good structure was obtained in PARPACK by writing programs with
communication-closed layers, the predecessor of the BSP superstep. I would
like to express my debt to the enlightened management by Arie Langeveld
and Theo Verheggen at Shell and to my close colleagues Daniël Loyens and
Hans van de Vorst from the parallel computing group.
Going back even further, to my years 1981–6 at the Hebrew University of
Jerusalem, my PhD supervisor Ronnie Kosloff aroused my interest in fast
Fourier transforms, which has become the subject of Chapter 3. Ronnie
seriously influenced my way of working, by injecting me with a large dose
of (quantum molecular) dynamics. In Jerusalem, Larry Rudolph introduced
me to the field of parallel computing. His enthusiasm and juggling acts left
an imprint forever.
Comments on draft chapters of this book have been given by Márcia Alves
de Inda, Richard Brent, Olivier Dulieu, Jonathan Hill, Slava Kokoouline,
ACKNOWLEDGEMENTS xv

Jacko Koster, Frank van Lingen, Ronald Meester, Adina Milston, John
Reid, Dan Stefanescu, Pilar de la Torre, Leslie Valiant, and Yael Weinbach.
Aesthetic advice has been given by Ron Bertels, Lidy Bisseling, Gerda
Dekker, and Gila and Joel Kantor. Thanks to all of them. Disclaimer: if
you find typing errors, small flaws, serious flaws, unintended Dutch, or
worse, do not blame them, just flame me! All comments are welcome at:
[email protected]. I thank my editors at Oxford University Press,
Elizabeth Johnston, Alison Jones, and Mahua Nandi, for accepting my vision
of this book and for their ideas, good judgement, help, and patience.
Finally, in the writing of this book, I owe much to my family. My wife
Rona showed love and sympathy, and gave support whenever needed. Our
daughter Sarai, born in 1994, has already acquired quite some mathematical
and computer skills. I tested a few exercises on her (admittedly, unmarked
ones), and am amazed how much a nine-year old can understand about parallel
computing. If she can, you can. Sarai provided me with the right amount of
distraction and the proper perspective. Furthermore, one figure in the book
is hers.
ABOUT THE AUTHOR

Rob Bisseling is an associate professor at the Mathematics Department of


Utrecht University, the Netherlands, where he has worked in the area of
scientific computation since 1993. Previously, he held positions as a senior
research mathematician (1990–3) and research mathematician (1987–90) at
the Koninklijke/Shell-Laboratorium, Amsterdam, the Netherlands, where he
investigated the application of parallel computing in oil refinery optimization
and polymer modelling. He received a BSc degree cum laude in mathematics,
physics, and astronomy in 1977 and an MSc degree cum laude in mathematics
in 1981, both from the Catholic University of Nijmegen, the Netherlands, and
a PhD degree in theoretical chemistry in 1987 from the Hebrew University of
Jerusalem, Israel.
The author has spent sabbaticals as a visiting scientist at Silicon Graph-
ics Biomedical, Jerusalem, in 1997 and at the Programming Research Group
of Oxford University, UK, in 2000. He is co-author of the BSPlib standard
(1997) and the Mondriaan package for partitioning sparse matrices (2002).
Since 2000, he maintains the website of the BSP Worldwide organization. His
research interests are numerical and combinatorial scientific computing in gen-
eral, and parallel algorithms, sparse matrix computations, and bioinformatics
in particular. His research goal is to design algorithms and develop software
tools that are useful in a wide range of scientific computing applications.
CONTENTS

1 Introduction 1
1.1 Wanted: a gentle parallel programming model 1
1.2 The BSP model 3
1.3 BSP algorithm for inner product computation 9
1.4 Starting with BSPlib: example program bspinprod 13
1.5 BSP benchmarking 24
1.6 Example program bspbench 27
1.7 Benchmark results 31
1.8 Bibliographic notes 38
1.8.1 BSP-related models of parallel computation 38
1.8.2 BSP libraries 40
1.8.3 The non-BSP world: message passing 42
1.8.4 Benchmarking 43
1.9 Exercises 44
2 LU decomposition 50
2.1 The problem 50
2.2 Sequential LU decomposition 51
2.3 Basic parallel algorithm 57
2.4 Two-phase broadcasting and other improvements 64
2.5 Example function bsplu 72
2.6 Experimental results on a Cray T3E 79
2.7 Bibliographic notes 85
2.7.1 Matrix distributions 85
2.7.2 Collective communication 87
2.7.3 Parallel matrix computations 87
2.8 Exercises 88
3 The fast Fourier transform 100
3.1 The problem 100
3.2 Sequential recursive fast Fourier transform 103
3.3 Sequential nonrecursive algorithm 105
3.4 Parallel algorithm 113
3.5 Weight reduction 120
3.6 Example function bspfft 127
3.7 Experimental results on an SGI Origin 3800 136
3.8 Bibliographic notes 145
3.8.1 Sequential FFT algorithms 145
xviii CONTENTS

3.8.2 Parallel FFT algorithms with log2 p or more


supersteps 147
3.8.3 Parallel FFT algorithms with O(1) supersteps 148
3.8.4 Applications 151
3.9 Exercises 152
4 Sparse matrix–vector multiplication 163
4.1 The problem 163
4.2 Sparse matrices and their data structures 167
4.3 Parallel algorithm 173
4.4 Cartesian distribution 179
4.5 Mondriaan distribution for general sparse matrices 186
4.6 Vector distribution 197
4.7 Random sparse matrices 203
4.8 Laplacian matrices 210
4.9 Remainder of BSPlib: example function bspmv 222
4.10 Experimental results on a Beowulf cluster 231
4.11 Bibliographic notes 235
4.11.1 Sparse matrix computations 235
4.11.2 Parallel sparse matrix–vector multiplication
algorithms 237
4.11.3 Parallel iterative solvers for linear systems 239
4.11.4 Partitioning methods 240
4.12 Exercises 243

A Auxiliary BSPedupack functions 251


A.1 Header file bspedupack.h 251
A.2 Utility file bspedupack.c 251
B A quick reference guide to BSPlib 254

C Programming in BSP style using MPI 256


C.1 The message-passing interface 256
C.2 Converting BSPedupack to MPIedupack 258
C.2.1 Program mpiinprod 258
C.2.2 Program mpibench 261
C.2.3 Function mpilu 265
C.2.4 Function mpifft 270
C.2.5 Function mpimv 273
C.3 Performance comparison on an SGI Origin 3800 278
C.4 Where BSP meets MPI 280
References 283

Index 299
1
INTRODUCTION

This chapter is a self-contained tutorial which tells you how to get


started with parallel programming and how to design and implement
algorithms in a structured way. The chapter introduces a simple tar-
get architecture for designing parallel algorithms, the bulk synchronous
parallel computer. Using the computation of the inner product of two
vectors as an example, the chapter shows how an algorithm is designed
hand in hand with its cost analysis. The algorithm is implemented
in a short program that demonstrates the most important primitives
of BSPlib, the main communication library used in this book. If you
understand this program well, you can start writing your own parallel
programs. Another program included in this chapter is a benchmark-
ing program that allows you to measure the BSP parameters of your
parallel computer. Substituting these parameters into a theoretical cost
formula for an algorithm gives a prediction of the actual running time
of an implementation.

1.1 Wanted: a gentle parallel programming model


Parallel programming is easy if you use the right programming model. All you
have to do is find it! Today’s developers of scientific application software must
pay attention to the use of computer time and memory, and also to the accur-
acy of their results. Since this is already quite a burden, many developers
view parallelism, the use of more than one processor to solve a problem,
as just an extra complication that is better avoided. Nevertheless, parallel
algorithms are being developed and used by many computational scientists in
fields such as astronomy, biology, chemistry, and physics. In industry, engin-
eers are trying to accelerate their simulations by using parallel computers.
The main motivation of all these brave people is the tremendous comput-
ing power promised by parallel computers. Since the advent of the World
Wide Web, this power lies only a tempting few mouse-clicks away from every
computer desk. The Grid is envisioned to deliver this power to that desk,
providing computational services in a way that resembles the workings of an
electricity grid.
The potential of parallel computers could be realized if the practice of
parallel programming were just as easy and natural as that of sequential
2 INTRODUCTION

programming. Unfortunately, until recently this has not been the case, and
parallel computing used to be a very specialized area where exotic parallel
algorithms were developed for even more exotic parallel architectures, where
software could not be reused and many man years of effort were wasted in
developing software of limited applicability. Automatic parallelization by com-
pilers could be a solution for this problem, but this has not been achieved yet,
nor is it likely to be achieved in the near future. Our only hope of harnessing
the power of parallel computing lies in actively engaging ourselves in parallel
programming. Therefore, we might as well try to make parallel programming
easy and effective, turning it into a natural activity for everyone who writes
computer programs.
An important step forward in making parallel programming easier has been
the development of portability layers, that is, communication software such
as PVM [171] and MPI [137] that enable us to run the same parallel program
on many different parallel computers without changing a single line of program
text. Still, the resulting execution time behaviour of the program on a new
machine is unpredictable (and can indeed be rather erratic), due to the lack
of an underlying parallel programming model.
To achieve the noble goal of easy parallel programming we need a
model that is simple, efficiently implementable, and acceptable to all parties
involved: hardware designers, software developers, and end users. This
model should not interfere with the process of designing and implementing
algorithms. It should exist mainly in the background, being tacitly under-
stood by everybody. Such a model would encourage the use of parallel
computers in the same way as the Von Neumann model did for the sequential
computer.
The bulk synchronous parallel (BSP) model proposed by Valiant in
1989 [177,178] satisfies all requirements of a useful parallel programming
model: the BSP model is simple enough to allow easy development and ana-
lysis of algorithms, but on the other hand it is realistic enough to allow
reasonably accurate modelling of real-life parallel computing; a portability
layer has been defined for the model in the form of BSPlib [105] and this
standard has been implemented efficiently in at least two libraries, namely the
Oxford BSP toolset [103] and the Paderborn University BSP library [28,30],
each running on many different parallel computers; another portability layer
suitable for BSP programming is the one-sided communications part of
MPI-2 [138], implementations of which are now appearing; in principle, the
BSP model could be used in taking design decisions when building new hard-
ware (in practice though, designers face other considerations); the BSP model
is actually being used as the framework for algorithm design and implementa-
tion on a range of parallel computers with completely different architectures
(clusters of PCs, networks of workstations, shared-memory multiprocessors,
and massively parallel computers with distributed memory). The BSP model
is explained in the next section.
THE BSP MODEL 3

1.2 The BSP model


The BSP model proposed by Valiant in 1989 [177,178] comprises a computer
architecture, a class of algorithms, and a function for charging costs to
algorithms. In this book, we use a variant of the BSP model; the differences
with the original model are discussed at the end of this section.
A BSP computer consists of a collection of processors, each with private
memory, and a communication network that allows processors to access
memories of other processors. The architecture of a BSP computer is shown
in Fig. 1.1. Each processor can read from or write to every memory cell in
the entire machine. If the cell is local, the read or write operation is relatively
fast. If the cell belongs to another processor, a message must be sent through
the communication network, and this operation is slower. The access time
for all nonlocal memories is the same. This implies that the communication
network can be viewed as a black box, where the connectivity of the network
is hidden in the interior. As users of a BSP computer, we need not be con-
cerned with the details of the communication network. We only care about
the remote access time delivered by the network, which should be uniform.
By concentrating on this single property, we are able to develop portable
algorithms, that is, algorithms that can be used for a wide range of parallel
computers.
A BSP algorithm consists of a sequence of supersteps. A superstep
contains either a number of computation steps or a number of communica-
tion steps, followed by a global barrier synchronization. In a computation
superstep, each processor performs a sequence of operations on local data. In
scientific computations, these are typically floating-point operations (flops).
(A flop is a multiplication, addition, subtraction, or division of two floating-
point numbers. For simplicity, we assume that all these operations take the
same amount of time.) In a communication superstep, each processor

Communication
network

P P P P P

M M M M M

Fig. 1.1. Architecture of a BSP computer. Here, ‘P ’ denotes a processor and ‘M ’ a


memory.
4 INTRODUCTION

P(0) P(1) P(2) P(3) P(4)

Comp

Sync

Comm

Sync

Comm

Sync

Comp

Sync
Comm

Sync

Fig. 1.2. BSP algorithm with five supersteps executed on five processors. A ver-
tical line denotes local computation; an arrow denotes communication of one
or more data words to another processor. The first superstep is a computation
superstep. The second one is a communication superstep, where processor P (0)
sends data to all other processors. Each superstep is terminated by a global
synchronization.

sends and receives a number of messages. At the end of a superstep, all pro-
cessors synchronize, as follows. Each processor checks whether it has finished
all its obligations of that superstep. In the case of a computation superstep, it
checks whether the computations are finished. In the case of a communication
superstep, it checks whether it has sent all messages that it had to send, and
whether it has received all messages that it had to receive. Processors wait
until all others have finished. When this happens, they all proceed to the next
superstep. This form of synchronization is called bulk synchronization,
because usually many computation or communication operations take place
between successive synchronizations. (This is in contrast to pairwise synchron-
ization, used in most message-passing systems, where each message causes a
pair of sending and receiving processors to wait until the message has been
transferred.) Figure 1.2 gives an example of a BSP algorithm.
THE BSP MODEL 5

(a) (b)
P(2) P(2)

P(0) P(1) P(0) P(1)

Fig. 1.3. Two different h-relations with the same h. Each arrow represents the
communication of one data word. (a) A 2-relation with hs = 2 and hr = 1; (b)
a 2-relation with hs = hr = 2.

The BSP cost function is defined as follows. An h-relation is a com-


munication superstep where each processor sends at most h data words to
other processors and receives at most h data words, and where at least one
processor sends or receives h words. A data word is a real or an integer. We
denote the maximum number of words sent by any processor by hs and the
maximum number received by hr . Therefore,
h = max{hs , hr }. (1.1)
This equation reflects the assumption that a processor can send and receive
data simultaneously. The cost of the superstep depends solely on h. Note that
two different communication patterns may have the same h, so that the cost
function of the BSP model does not distinguish between them. An example
is given in Fig. 1.3. Charging costs on the basis of h is motivated by the
assumption that the bottleneck of communication lies at the entry or exit of
the communication network, so that simply counting the maximum number
of sends and receives gives a good indication of communication time. Note
that for the cost function it does not matter whether data are sent together
or as separate words.
The time, or cost, of an h-relation is
Tcomm (h) = hg + l, (1.2)
where g and l are machine-dependent parameters and the time unit is the
time of one flop. This cost is charged because of the expected linear increase
of communication time with h. Since g = limh→∞ Tcomm (h)/h, the parameter
g can be viewed as the time needed to send one data word into the communica-
tion network, or to receive one word, in the asymptotic situation of continuous
message traffic. The linear cost function includes a nonzero constant l because
executing an h-relation incurs a fixed overhead, which includes the cost of
global synchronization, but also fixed cost components of ensuring that all
data have arrived at their destination and of starting up the communication.
We lump all such fixed costs together into one parameter l.
6 INTRODUCTION

Approximate values for g and l of a particular parallel computer can be


obtained by measuring the execution time for a range of full h-relations, that
is, h-relations where each processor sends and receives exactly h data words.
Figure 1.3(b) gives an example of a full 2-relation. (Can you find another full
2-relation?) In practice, the measured cost of a full h-relation will be an upper
bound on the measured cost of an arbitrary h-relation. For our measurements,
we usually take 64-bit reals or integers as data words.
The cost of a computation superstep is

Tcomp (w) = w + l, (1.3)

where the amount of work w is defined as the maximum number of flops


performed in the superstep by any processor. For reasons of simplicity, the
value of l is taken to be the same as that of a communication superstep, even
though it may be less in practice. As a result, the total synchronization cost
of an algorithm can be determined simply by counting its supersteps. Because
of (1.2) and (1.3), the total cost of a BSP algorithm becomes an expression of
the form a + bg + cl. Figure 1.4 displays the cost of the BSP algorithm from
Fig. 1.2.
A BSP computer can be characterized by four parameters: p, r, g, and
l. Here, p is the number of processors. The parameter r is the single-
processor computing rate measured in flop/s (floating-point operations per
second). This parameter is irrelevant for cost analysis and algorithm design,
because it just normalizes the time. (But if you wait for the result of your
program, you may find it highly relevant!) From a global, architectural point
of view, the parameter g, which is measured in flop time units, can be seen
as the ratio between the computation throughput and the communication
throughput of the computer. This is because in the time period of an h-
relation, phg flops can be performed by all processors together and ph data
words can be communicated through the network. Finally, l is called the
synchronization cost, and it is also measured in flop time units. Here, we
slightly abuse the language, because l includes fixed costs other than the
cost of synchronization as well. Measuring the characteristic parameters of a
computer is called computer benchmarking. In our case, we do this by
measuring r, g, and l. (For p, we can rely on the value advertised by the
computer vendor.) Table 1.1 summarizes the BSP parameters.
We can predict the execution time of an implementation of a BSP
algorithm on a parallel computer by theoretically analysing the cost of the
algorithm and, independently, benchmarking the computer for its BSP per-
formance. The predicted time (in seconds) of an algorithm with cost a+bg +cl
on a computer with measured parameters r, g, and l equals (a + bg + cl)/r,
because the time (in seconds) of one flop equals tflop = 1/r.
One aim of the BSP model is to guarantee a certain performance of an
implementation. Because of this, the model states costs in terms of upper
THE BSP MODEL 7

Cost P(0) P(1) P(2) P(3) P(4)


0

Comp
60
50
20 Sync

100 Comm
5

Sync
150
6 Comm

Sync
200

60 Comp

250

Sync
2 Comm
300
Sync

Fig. 1.4. Cost of the BSP algorithm from Fig. 1.2 on a BSP computer with p = 5,
g = 2.5, and l = 20. Computation costs are shown only for the processor that
determines the cost of a superstep (or one of them, if there are several). Com-
munication costs are shown for only one source/destination pair of processors,
because we assume in this example that the amount of data happens to be the
same for every pair. The cost of the first superstep is determined by processors
P (1) and P (3), which perform 60 flops each. Therefore, the cost is 60 + l = 80
flop time units. In the second superstep, P (0) sends five data words to each of
the four other processors. This superstep has hs = 20 and hr = 5, so that it
is a 20-relation and hence its cost is 20g + l = 70 flops. The cost of the other
supersteps is obtained in a similar fashion. The total cost of the algorithm is
320 flops.

Table 1.1. The BSP parameters

p number of processors
r computing rate (in flop/s)
g communication cost per data word (in time units of 1 flop)
l global synchronization cost (in time units of 1 flop)
8 INTRODUCTION

bounds. For example, the cost function of an h-relation assumes a worst-case


communication pattern. This implies that the predicted time is an upper
bound on the measured time. Of course, the accuracy of the prediction
depends on how the BSP cost function reflects reality, and this may differ
from machine to machine.
Separation of concerns is a basic principle of good engineering. The BSP
model enforces this principle by separating the hardware concerns from the
software concerns. (Because the software for routing messages in the commu-
nication network influences g and l, we consider such routing software to be
part of the hardware system.) On the one hand, the hardware designer can
concentrate on increasing p, r and decreasing g, l. For example, he could aim at
designing a BSP(p, r, g, l) computer with p ≥ 100, r ≥ 1 Gflop/s (the prefix G
denotes Giga = 109 ), g ≤ 10, l ≤ 1000. To stay within a budget, he may have
to trade off these objectives. Obviously, the larger the number of processors
p and the computing rate r, the more powerful the communication network
must be to keep g and l low. It may be preferable to spend more money on a
better communication network than on faster processors. The BSP paramet-
ers help in quantifying these design issues. On the other hand, the software
designer can concentrate on decreasing the algorithmic parameters a, b, and
c in the cost expression a + bg + cl; in general, these parameters depend on
p and the problem size n. The aim of the software designer is to obtain good
scaling behaviour of the cost expression. For example, she could realistically

aim at a decrease in a as 1/p, a decrease in b as 1/ p, and a constant c.
The BSP model is a distributed-memory model. This implies that both
the computational work and the data are distributed. The work should be
distributed evenly over the processors, to achieve a good load balance. The
data should be distributed in such a way that the total communication cost
is limited. Often, the data distribution determines the work distribution in
a natural manner, so that the choice of data distribution must be based on
two objectives: good load balance and low communication cost. In all our
algorithms, choosing a data distribution is an important decision. By design-
ing algorithms for a distributed-memory model, we do not limit ourselves to
this model. Distributed-memory algorithms can be used efficiently on shared-
memory parallel computers, simply by partitioning the shared memory among
the processors. (The reverse is not true: shared-memory algorithms do not
take data locality into account, so that straightforward distribution leads to
inefficient algorithms.) We just develop a distributed-memory program; the
BSP system does the rest. Therefore, BSP algorithms can be implemented
efficiently on every type of parallel computer.
The main differences between our variant of the BSP model and the
original BSP model [178] are:

1. The original BSP model allows supersteps to contain both computa-


tion and communication. In our variant, these operations must be split
BSP ALGORITHM FOR INNER PRODUCT COMPUTATION 9

into separate supersteps. We impose this restriction to achieve conceptual


simplicity. We are not concerned with possible gains obtained by overlapping
computation and communication. (Such gains are minor at best.)
2. The original cost function for a superstep with an h-relation and a max-
imum amount of work w is max(w, hg, L), where L is the latency, that is, the
minimum number of time units between successive supersteps. In our variant,
we charge w + hg + 2l for the corresponding two supersteps. We use the syn-
chronization cost l instead of the (related) latency L because it facilitates the
analysis of algorithms. For example, the total cost in the original model of
a 2-relation followed by a 3-relation equals max(2g, L) + max(3g, L), which
may have any of the outcomes 5g, 3g + L, and 2L, whereas in our variant we
simply charge 5g + 2l. If g and l (or L) are known, we can substitute their val-
ues into these expressions and obtain one scalar value, so that both variants
are equally useful. However, we also would like to analyse algorithms without
knowing g and l (or L), and in that case our cost function leads to simpler
expressions. (Valiant [178] mentions w + hg + l as a possible alternative cost
function for a superstep with both computation and communication, but he
uses max(w, hg, l) in his analysis.)
3. The data distribution in the original model is controlled either directly
by the user, or, through a randomizing hash function, by the system. The
latter approach effectively randomizes the allocation of memory cells, and
thereby it provides a shared-memory view to the user. Since this approach
is only efficient in the rare case that g is very low, g ≈ 1, we use the direct
approach instead. Moreover, we use the data distribution as our main means
of making computations more efficient.
4. The original model allowed for synchronization of a subset of all pro-
cessors instead of all processors. This option may be useful in certain cases,
but for reasons of simplicity we disallow it.

1.3 BSP algorithm for inner product computation


A simple example of a BSP algorithm is the following computation of the
inner product α of two vectors x = (x0 , . . . , xn−1 )T and y = (y0 , . . . , yn−1 )T ,
n−1

α= xi yi . (1.4)
i=0

In our terminology, vectors are column vectors; to save space we write them
as x = (x0 , . . . , xn−1 )T , where the superscript ‘T’ denotes transposition. The
vector x can also be viewed as an n × 1 matrix. The inner product of x and y
can concisely be expressed as xT y.
The inner product is computed by the processors P (0), . . . , P (p − 1) of a
BSP computer with p processors. We assume that the result is needed by all
processors, which is usually the case if the inner product computation is part
of a larger computation, such as in iterative linear system solvers.
10 INTRODUCTION

(a)
Cyclic 0 1 2 3 0 1 2 3 0 1
0 1 2 3 4 5 6 7 8 9

(b) Block
0 0 0 1 1 1 2 2 2 3
0 1 2 3 4 5 6 7 8 9

Fig. 1.5. Distribution of a vector of size ten over four processors. Each cell repres-
ents a vector component; the number in the cell and the greyshade denote the
processor that owns the cell. The processors are numbered 0, 1, 2, 3. (a) Cyclic
distribution; (b) block distribution.

The data distribution of the vectors x and y should be the same, because
in that case the components xi and yi reside on the same processor and they
can be multiplied immediately without any communication. The data distri-
bution then determines the work distribution in a natural manner. To balance
the work load of the algorithm, we must assign the same number of vector
components to each processor. Card players know how to do this blindly,
even without counting and in the harshest of circumstances. They always
deal out their cards in a cyclic fashion. For the same reason, an optimal work
distribution is obtained by the cyclic distribution defined by the mapping

xi −→ P (i mod p), for 0 ≤ i < n. (1.5)

Here, the mod operator stands for taking the remainder after division by p,
that is, computing modulo p. Similarly, the div operator stands for integer
division rounding down. Figure 1.5(a) illustrates the cyclic distribution for
n = 10 and p = 4. The maximum number of components per processor is
⌈n/p⌉, that is, n/p rounded up to the nearest integer value, and the minimum
is ⌊n/p⌋ = n div p, that is, n/p rounded down. The maximum and the
minimum differ at most by one. If p divides n, every processor receives exactly
n/p components. Of course, many other data distributions also lead to the
best possible load balance. An example is the block distribution, defined
by the mapping
xi −→ P (i div b), for 0 ≤ i < n, (1.6)
with block size b = ⌈n/p⌉. Figure 1.5(b) illustrates the block distribution for
n = 10 and p = 4. This distribution has the same maximum number of com-
ponents per processor, but the minimum can take every integer value between
zero and the maximum. In Fig. 1.5(b) the minimum is one. The minimum can
even be zero: if n = 9 and p = 4, then the block size is b = 3, and the pro-
cessors receive 3, 3, 3, 0 components, respectively. Since the computation
cost is determined by the maximum amount of work, this is just as good as
BSP ALGORITHM FOR INNER PRODUCT COMPUTATION 11

Algorithm 1.1. Inner product algorithm for processor P (s), with 0 ≤ s < p.
input: x, y : vector of length n,
distr(x) = distr(y) = φ,
with φ(i) = i mod p, for 0 ≤ i < n.
output: α = xT y.

(0) αs := 0;
for i := s to n − 1 step p do
αs := αs + xi yi ;

(1) for t := 0 to p − 1 do
put αs in P (t);

(2) α := 0;
for t := 0 to p − 1 do
α := α + αt ;

the cyclic distribution, which assigns 3, 2, 2, 2 components. Intuitively, you


may object to the idling processor in the block distribution, but the work
distribution is still optimal!
Algorithm 1.1 computes an inner product in parallel. It consists of three
supersteps, numbered (0), (1), and (2). The synchronizations at the end of
the supersteps are not written explicitly. All the processors follow the same
program text, but their actual execution paths differ. The path of processor
P (s) depends on the processor identity s, with 0 ≤ s < p. This style of pro-
gramming is called single program multiple data (SPMD), and we shall
use it throughout the book.
In superstep (0), processor P (s) computes the local partial inner product

αs = xi yi , (1.7)
0≤i<n, i mod p=s

multiplying xs by ys , xs+p by ys+p , xs+2p by ys+2p , and so on, and adding


the results. The data for this superstep are locally available. Note that we
use global indices so that we can refer uniquely to variables without regard
to the processors that own them. We access the local components of a vector
by stepping through the arrays with a stride, or step size, p.
In superstep (1), each processor broadcasts its result αs , that is, it sends
αs to all processors. We use the communication primitive ‘put x in P (t)’ in the
program text of P (s) to denote the one-sided action by processor P (s) of stor-
ing a data element x at another processor P (t). This completely determines
the communication: both the source processor and the destination processor
12 INTRODUCTION

of the data element are specified. The ‘put’ primitive assumes that the source
processor knows the memory location on the destination processor where the
data must be put. The source processor is the initiator of the action, whereas
the destination processor is passive. Thus, we assume implicitly that each pro-
cessor allows all others to put data into its memory. Superstep (1) could also
have been written as ‘put αs in P (∗)’, where we use the abbreviation P (∗) to
denote all processors. Note that the program includes a put by processor P (s)
into itself. This operation is simply skipped or becomes a local memory-copy,
but it does not involve communication. It is convenient to include such puts
in program texts, to avoid having to specify exceptions.
Sometimes, it may be necessary to let the destination processor initiate
the communication. This may happen in irregular computations, where the
destination processor knows that it needs data, but the source processor is
unaware of this need. In that case, the destination processor must fetch the
data from the source processor. This is done by a statement of the form ‘get
x from P (t)’ in the program text of P (s). In most cases, however, we use the
‘put’ primitive. Note that using a ‘put’ is much simpler than using a matching
‘send’/‘receive’ pair, as is done in message-passing parallel algorithms. The
program text of such an algorithm must contain additional if-statements to
distinguish between sends and receives. Careful checking is needed to make
sure that pairs match in all possible executions of the program. Even if
every send has a matching receive, this does not guarantee correct commu-
nication as intended by the algorithm designer. If the send/receive is done
by the handshake (or kissing) protocol, where both participants can only
continue their way after the handshake has finished, then it can easily hap-
pen that the sends and receives occur in the wrong order. A classic case is
when two processors both want to send first and receive afterwards; this situ-
ation is called deadlock. Problems such as deadlock cannot happen when
using puts.
In superstep (2), all processors compute the final result. This is done
redundantly, that is, the computation is replicated so that all processors per-
form exactly the same operations on the same data. The complete algorithm
is illustrated in Fig. 1.6.
The cost analysis of the algorithm is as follows. Superstep (0) requires a
floating-point multiplication and an addition for each component. Therefore,
the cost of (0) is 2⌈n/p⌉ + l. Superstep (1) is a (p − 1)-relation, because each
processor sends and receives p − 1 data. (Communication between a processor
and itself is not really communication and hence is not counted in determining
h.) The cost of (1) is (p − 1)g + l. The cost of (2) is p + l. The total cost of
the inner product algorithm is
 
n
Tinprod =2 + p + (p − 1)g + 3l. (1.8)
p
STARTING WITH BSPLIB: EXAMPLE PROGRAM bspinprod 13

12 0 4 7 –1 2 15 11 3 –2
0 1 2 3 4 5 6 7 8 9
*

1 9 2 0 –1 12 1 2 3 8
0 1 2 3 4 5 6 7 8 9
+

22 8 23 22

22 8 23 22 22 8 23 22 22 8 23 22 22 8 23 22

75 75 75 75

Fig. 1.6. Parallel inner product computation. Two vectors of size ten are distributed
by the cyclic distribution over four processors. The processors are shown by
greyshades. First, each processor computes its local inner product. For example,
processor P (0) computes its local inner product 12 · 1 + (−1) · (−1) + 3 · 3 = 22.
Then the local result is sent to all other processors. Finally, the local inner
products are summed redundantly to give the result 75 in every processor.

An alternative approach would be to send all partial inner products to one


processor, P (0), and let this processor compute the result and broadcast it.
This requires four supersteps. Sending the partial inner products to P (0) is a
(p − 1)-relation and therefore is just as expensive as broadcasting them. While
P (0) adds partial inner products, the other processors are idle and hence the
cost of the addition is the same as for the redundant computation. The total
cost of the alternative algorithm would be 2⌈n/p⌉ + p + 2(p − 1)g + 4l, which
is higher than that of Algorithm 1.1. The lesson to be learned: if you have
to perform an h-relation with a particular h, you might as well perform as
much useful communication as possible in that h-relation; other supersteps
may benefit from this.

1.4 Starting with BSPlib: example program bspinprod


BSPlib is a standard library interface, which defines a set of primitives for
writing bulk synchronous parallel programs. Currently, BSPlib has been
implemented efficiently in two libraries, namely the Oxford BSP toolset [103]
and the Paderborn University BSP library [28,30]. The BSPlib standard uni-
fies and extends its two predecessors, the Oxford BSP library designed by
Miller and Reed [140,141] and the Green BSP library by Goudreau et al. [81].
14 INTRODUCTION

Implementations of BSPlib exist for many different architectures, including: a


cluster of PCs running Linux; a network of UNIX workstations connected by
an Ethernet and communicating through the TCP/IP or UDP/IP protocol;
shared-memory multiprocessors running a variant of UNIX; and massively
parallel computers with distributed memory such as the Cray T3E, the Sil-
icon Graphics Origin, and the IBM SP. In addition, the library can also be
used on an ordinary sequential computer to run a parallel program with p = 1
as a sequential program; in this case, a special version of BSPlib can be used
that strips off the parallel overhead. This is advantageous because it allows us
to develop and maintain only one version of the source program, namely the
parallel version. It is also possible to simulate a parallel computer by running
p processes in parallel on one processor, sharing the time of the common CPU.
This is a useful environment for developing parallel programs, for example,
on a PC.
The BSPlib library contains a set of primitive functions, which can be
called from a program written in a conventional programming language such
as C, C++, or Fortran 90. BSPlib was designed following the motto ‘small
is beautiful’. The primitives of BSPlib were carefully crafted and particular
attention was paid to the question of what to exclude from the library. As a
result, BSPlib contains only 20 primitive functions, which are also called the
core operations. In this section, we present a small C program, which uses
12 different primitives. The aim of this tutorial program is to get you started
using the library and to expose the main principles of writing a BSPlib pro-
gram. The remainder of this book gives further examples of how the library
can be used. Six primitives for so-called bulk synchronous message passing will
be explained in Chapter 4, where they are first needed. Two primitives for
high-performance communication are explained in an exercise in Chapter 2.
They are primarily meant for programmers who want the ultimate in per-
formance, in terms of memory and computing speed, and who are prepared
to live on the edge and be responsible for the safety of their programs, instead
of relying on the BSP system to provide this. A quick reference guide to the
BSPlib primitives is given as Appendix B. An alternative to using BSPlib
is using MPI. Appendix C discusses how to program in BSP style using
MPI and presents MPI versions of all the programs from Chapters 1 to 4.
For a full explanation of the BSPlib standard, see the definitive source by
Hill et al. [105]. In the following, we assume that a BSPlib implementation
has already been installed. To start with, you could install the Oxford BSP
toolset [103] on your PC running Linux, or ask your systems administrator to
install the toolset at your local network of workstations.
The parallel part of a BSPlib program starts with the statement

bsp begin(reqprocs);

where int reqprocs is the number of processors requested. The function


STARTING WITH BSPLIB: EXAMPLE PROGRAM bspinprod 15

bsp begin starts several executions of the same subprogram, where each
execution takes place on a different processor and handles a different stream
of data, in true SPMD style. The parallel part is terminated by
bsp end();
Two possible modes of operation can be used. In the first mode, the whole
computation is SPMD; here, the call to bsp begin must be the first executable
statement in the program and the call to bsp end the last. Sometimes, how-
ever, one desires to perform some sequential part of the program before and
after the parallel part, for example, to handle input and output. For instance,
if the optimal number of processors to be used depends on the input, we want
to compute it before the actual parallel computation starts. The second mode
enables this: processor P (0) executes the sequential parts and all processors
together perform the parallel part. Processor P (0) preserves the values of its
variables on moving from one part to the next. The other processors do not
inherit values; they can only obtain desired data values by communication.
To allow the second mode of operation and to circumvent the restriction of
bsp begin and bsp end being the first and last statement, the actual parallel
part is made into a separate function spmd and an initializer
bsp init(spmd, argc, argv);
is called as the first executable statement of the main function. Here, int
argc and char **argv are the standard arguments of main in a C program,
and these can be used to transfer parameters from a command line interface.
Funny things may happen if this is not the first executable statement. Do not
even think of trying it! The initializing statement is followed by: a sequential
part, which may handle some input or ask for the desired number of pro-
cessors (depending on the input size it may be better to use only part of the
available processors); the parallel part, which is executed by spmd; and finally
another sequential part, which may handle output. The sequential parts are
optional.
The rules for I/O are simple: processor P (0) is the only processor that can
read from standard input or can access the file system, but all processors can
write to standard output. Be aware that this may mix the output streams;
use an fflush(stdout) statement to empty the output buffer immediately
and increase the chance of obtaining ordered output (sorry, no guarantees).
At every point in the parallel part of the program, one can enquire about
the total number of processors. This integer is returned by
bsp nprocs();
The function bsp nprocs also serves a second purpose: when it is used in
the sequential part at the start, or in the bsp begin statement, it returns the
available number of processors, that is, the size of the BSP machine used. Any
desired number of processors not exceeding the machine size can be assigned to
16 INTRODUCTION

the program by bsp begin. The local processor identity, or processor number,
is returned by
bsp pid();
It is an integer between 0 and bsp nprocs()−1. One can also enquire about
the time in seconds elapsed on the local processor since bsp begin; this time
is given as a double-precision value by
bsp time();
Note that in the parallel context the elapsed time, or wall-clock time, is often
the desired metric and not the CPU time. In parallel programs, processors
are often idling because they have to wait for others to finish their part of
a computation; a measurement of elapsed time includes idle time, whereas
a CPU time measurement does not. Note, however, that the elapsed time
metric does have one major disadvantage, in particular to your fellow users:
you need to claim the whole BSP computer for yourself when measuring run
times.
Each superstep of the SPMD part, or program superstep, is terminated
by a global synchronization statement
bsp sync();
except the last program superstep, which is terminated by bsp end. The
structure of a BSPlib program is illustrated in Fig. 1.7. Program supersteps
may be contained in loops and if-statements, but the condition evaluations of
these loops and if-statements must be such that all processors pace through the
same sequence of program supersteps. The rules imposed by BSPlib may seem
restrictive, but following them makes parallel programming easier, because
they guarantee that all processors are in the same superstep. This allows us
to assume full data integrity at the start of each superstep.
The version of the BSP model presented in Section 1.2 does not allow
computation and communication in the same superstep. The BSPlib sys-
tem automatically separates computation and communication, since it delays
communication until all computation is finished. Therefore the user does not
have to separate these parts herself and she also does not have to include
a bsp sync for this purpose. In practice, this means that BSPlib programs
can freely mix computation and communication. The automatic separation
feature of BSPlib is convenient, for instance because communication often
involves address calculations and it would be awkward for a user to separ-
ate these computations from the corresponding communication operations. A
program superstep can thus be viewed as a sequence of computation, implicit
synchronization, communication, and explicit synchronization. The compu-
tation part or the communication part may be empty. Therefore, a program
superstep may contain one or two supersteps as defined in the BSP model,
namely a computation superstep and/or a communication superstep. From
STARTING WITH BSPLIB: EXAMPLE PROGRAM bspinprod 17

P(0) P(1) P(2) P(3) P(4)

Init

Sequential

Begin

Sync
Parallel (SPMD)
Sync

End

Sequential

Exit

Fig. 1.7. Structure of a BSPlib program. The program first initializes the BSP
machine to be used and then it performs a sequential computation on P (0),
followed by a parallel computation on five processors. It finishes with a sequential
computation on P (0).

now on, we use the shorter term ‘superstep’ to denote program supersteps as
well, except when this would lead to confusion.
Wouldn’t it be nice if we could compute and communicate at the same
time? This tempting thought may have occurred to you by now. Indeed,
processors could in principle compute while messages travel through the
communication network. Exploiting this form of parallelism would reduce
the total computation/communication cost a + bg of the algorithm, but at
most by a factor of two. The largest reduction would occur if the cost of
each computation superstep were equal to the cost of the corresponding
communication superstep, and if computation and communication could be
overlapped completely. In most cases, however, either computation or com-
munication dominates, and the cost reduction obtained by overlapping is
insignificant. Surprisingly, BSPlib guarantees not to exploit potential over-
lap. Instead, delaying all communication gives more scope for optimization,
since this allows the system to combine different messages from the same
source to the same destination and to reorder the messages with the aim of
balancing the communication traffic. As a result, the cost may be reduced by
much more than a factor of two.
Processors can communicate with each other by using the bsp put and
bsp get functions (or their high-performance equivalents bsp hpput and
18 INTRODUCTION

bsp_pid
nbytes
Put
Source

pid
Offset nbytes

Dest

Fig. 1.8. Put operation from BSPlib. The bsp put operation copies nbytes of data
from the local processor bsp pid into the specified destination processor pid.
The pointer source points to the start of the data to be copied, whereas the
pointer dest specifies the start of the memory area where the data is written.
The data is written at offset bytes from the start.

bsp hpget, see Exercise 10, or the bsp send function, see Section 4.9). A pro-
cessor that calls bsp put reads data from its own memory and writes them
into the memory of another processor. The function bsp put corresponds to
the put operation in our algorithms. The syntax is

bsp put(pid, source, dest, offset, nbytes);

Here, int pid is the identity of the remote processor; void *source is a
pointer to the source memory in the local processor from which the data
are read; void *dest is a pointer to the destination memory in the remote
processor into which the data are written; int offset is the number of bytes
to be added to the address dest to obtain the address where writing starts;
and int nbytes is the number of bytes to be written. The dest variable must
have been registered previously; the registration mechanism will be explained
soon. If pid equals bsp pid, the put is done locally by a memory copy, and
no data is communicated. The offset is determined by the local processor, but
the destination address is part of the address space of the remote processor.
The use of an offset separates the concerns of the local processor, which knows
where in an array a data element should be placed, from the concerns of the
remote processor, which knows the address of the array in its own address
space. The bsp put operation is illustrated in Fig. 1.8.
The bsp put operation is safe in every sense, since the value to be put is
first written into a local out-buffer, and only at the end of the superstep (when
all computations in all processors are finished) it is transferred into a remote
in-buffer, from which it is finally copied into the destination memory. The
user can manipulate both the source and destination value without worrying
about possible interference between data manipulation and transfer. Once the
bsp put is initiated, the user has got rid of the source data and can reuse the
STARTING WITH BSPLIB: EXAMPLE PROGRAM bspinprod 19

variable that holds them. The destination variable can be used until the end of
the superstep, when it will be overwritten. It is possible to put several values
into the same memory cell, but of course only one value survives and reaches
the next superstep. The user cannot know which value, and he bears the
responsibility for ensuring correct program behaviour. Put and get operations
do not block progress within their superstep: after a put or get is initiated,
the program proceeds immediately.
Although a remote variable may have the same name as a local variable,
it may still have a different physical memory address because each processor
could have its own memory allocation procedure. To enable a processor to
write into a remote variable, there must be a way to link the local name to
the correct remote address. Linking is done by the registration primitive
bsp push reg(variable, nbytes);
where void *variable is a pointer to the variable being registered. All pro-
cessors must simultaneously register a variable, or the NULL pointer; they must
also deregister simultaneously. This ensures that they go through the same
sequence of registrations and deregistrations. Registration takes effect at the
start of the next superstep. From that moment, all simultaneously registered
variables are linked to each other. Usually, the name of each variable linked
in a registration is the same, in the right SPMD spirit. Still, it is allowed to
link variables with different names.
If a processor wants to put a value into a remote address, it can do this
by using the local name that is linked to the remote name and hence to the
desired remote address. The second registration parameter, int nbytes, is
an upper bound on the number of bytes that can be written starting from
variable. Its sole purpose is sanity checking: our hope is to detect insane
programs in their youth.
A variable is deregistered by a call to
bsp pop reg(variable);
Within a superstep, the variables can be registered and deregistered in arbit-
rary order. The same variable may be registered several times, but with
different sizes. (This may happen for instance as a result of registration of
the same variable inside different functions.) A deregistration cancels the last
registration of the variable concerned. The last surviving registration of a
variable is the one valid in the next superstep. For each variable, a stack
of registrations is maintained: a variable is pushed onto the stack when it
is registered; and it is popped off the stack when it is deregistered. A stack is
the computer science equivalent of the hiring and firing principle for teachers
in the Dutch educational system: Last In, First Out (LIFO). This keeps the
average stack population old, but that property is irrelevant for our book.
In a sensible program, the number of registrations is kept limited. Prefer-
ably, a registered variable is reused many times, to amortize the associated
20 INTRODUCTION

overhead costs. Registration is costly because it requires a broadcast of the


registered local variable to the other processors and possibly an additional
synchronization.
A processor that calls the bsp get function reads data from the memory
of another processor and writes them into its own memory. The syntax is
bsp get(pid, source, offset, dest, nbytes);
The parameters of bsp get have the same meaning as those of bsp put, except
that the source memory is in the remote processor and the destination memory
in the local processor and that the offset is in the source memory. The offset is
again computed by the local processor. The source variable must have been
registered previously. The value obtained by the bsp get operation is the
source value immediately after the computations of the present superstep have
terminated, but before it can be modified by other communication operations.
If a processor detects an error, it can take action and bring down all other
processors in a graceful manner by a call to
bsp abort(error message);
Here, error message is an output string such as used by the printf function
in C. Proper use of the abort facility makes it unnecessary to check periodically
whether all processors are still alive and computing.
BSPlib contains only core operations. By keeping the core small, the
BSPlib designers hoped to enable quick and efficient implementation of BSPlib
on every parallel computer that appears on the market. Higher level func-
tions such as broadcasting or global summing, generally called collective
communication, are useful but not really necessary. Of course, users can
write their own higher level functions on top of the primitive functions,
giving them exactly the desired functionality, or use the predefined ones
available in the Oxford BSP toolset [103]. Section 2.5 gives an example of
a collective-communication function, the broadcast of a vector.
The best way of learning to use the library is to study an example and
then try to write your own program. Below, we present the function bspip,
which is an implementation of Algorithm 1.1 in C using BSPlib, and the test
program bspinprod, which handles input and output for a test problem. Now,
try to compile the program by the UNIX command
bspcc -o ip bspinprod.c bspedupack.c -lm
and run the resulting executable program ip on four processors by the
command
bsprun -npes 4 ip
and see what happens for this particular test problem, defined by xi = yi =
i + 1, for 0 ≤ i < n. (If you are not running a UNIX variant, you may have to
follow a different procedure.) A listing of the file bspedupack.c can be found
STARTING WITH BSPLIB: EXAMPLE PROGRAM bspinprod 21

Global 12 0 4 7 –1 2 15 11 3 –2
0 1 2 3 4 5 6 7 8 9

Local 12 –1 3 0 2 –2 4 15 7 11
0 1 2 0 1 2 0 1 0 1
P(0) P(1) P(2) P(3)

Fig. 1.9. Two different views of the same vector. The vector of size ten is distributed
by the cyclic distribution over four processors. The numbers in the square cells
are the numerical values of the vector components. The processors are shown
by greyshades. The global view is used in algorithms, where vector components
are numbered using global indices j. The local view is used in implementations,
where each processor has its own part of the vector and uses its own local
indices j.

in Appendix A. It contains functions for allocation and deallocation of vectors


and matrices.
The relation between Algorithm 1.1 and the function bspip is as follows.
The variables p, s, t, n, α of the algorithm correspond to the variables p, s,
t, n, alpha of the function. The local inner product αs of P (s) is denoted
by inprod in the program text of P (s), and it is also put into Inprod[s] in
all processors. The global index i in the algorithm equals i * p + s, where i is
a local index in the program. The vector component xi corresponds to the
variable x[i] on the processor that owns xi , that is, on processor P (i mod p).
The number of local indices on P (s) is nloc(p,s,n). The first n mod p
processors have ⌈n/p⌉ such indices, while the others have ⌊n/p⌋. Note the
efficient way in which nloc is computed and check that this method is correct,
by writing n = ap + b with 0 ≤ b < p and expanding the expression returned
by the function.
It is convenient to describe algorithms such as Algorithm 1.1 in global
variables, but to implement them using local variables. This avoids addressing
by a stride and its worse alternative, the superfluous test ‘if i mod p = s’ in a
loop over the global index i. Using local, consecutive indices is most natural
in an implementation because only a subarray of the original global array is
stored locally. The difference between the global and local view is illustrated
in Fig. 1.9. For all our distributions, we store the local vector components in
order of increasing global index. This gives rise to a natural mapping between
local and global indices.
The program printed below largely explains itself; a few additional
explanations are given in the following. The included file bspedupack.h can
be found in Appendix A. It contains inclusion statements for standard header
files and also constants such as the size of a double SZDBL. The number of
22 INTRODUCTION

processors is first stored as a global variable P (global in the C sense, that


is, accessible to all functions in its file), so that we are able to transfer the
value of P from the main program to the SPMD part. Values cannot be trans-
ferred other than by using global variables, because the SPMD function is not
allowed to have parameters. The function vecallocd from bspedupack.c is
used to allocate an array of doubles of length p dynamically and vecfreed is
used to free the array afterwards.
The offset in the first bsp put is s*SZDBL, since the local inner product
of processor s is put into Inprod[s] on every processor t. The processors
synchronize before the time measurements by bsp time, so that the measure-
ments start and finish simultaneously.
#include "bspedupack.h"

/* This program computes the sum of the first n squares, for n>=0,
sum = 1*1 + 2*2 + ... + n*n
by computing the inner product of x=(1,2,...,n)ˆT and itself.
The output should equal n*(n+1)*(2n+1)/6.
The distribution of x is cyclic.
*/

int P; /* number of processors requested */

int nloc(int p, int s, int n){


/* Compute number of local components of processor s for vector
of length n distributed cyclically over p processors. */

return (n+p-s-1)/p ;

} /* end nloc */

double bspip(int p, int s, int n, double *x, double *y){


/* Compute inner product of vectors x and y of length n>=0 */

int nloc(int p, int s, int n);


double inprod, *Inprod, alpha;
int i, t;

Inprod= vecallocd(p); bsp_push_reg(Inprod,p*SZDBL);


bsp_sync();

inprod= 0.0;
for (i=0; i<nloc(p,s,n); i++){
inprod += x[i]*y[i];
}

for (t=0; t<p; t++){


bsp_put(t,&inprod,Inprod,s*SZDBL,SZDBL);
}
bsp_sync();
STARTING WITH BSPLIB: EXAMPLE PROGRAM bspinprod 23

alpha= 0.0;
for (t=0; t<p; t++){
alpha += Inprod[t];
}
bsp_pop_reg(Inprod); vecfreed(Inprod);

return alpha;

} /* end bspip */

void bspinprod(){

double bspip(int p, int s, int n, double *x, double *y);


int nloc(int p, int s, int n);
double *x, alpha, time0, time1;
int p, s, n, nl, i, iglob;

bsp_begin(P);
p= bsp_nprocs(); /* p = number of processors obtained */
s= bsp_pid(); /* s = processor number */
if (s==0){
printf("Please enter n:\n"); fflush(stdout);
scanf("%d",&n);
if(n<0)
bsp_abort("Error in input: n is negative");
}
bsp_push_reg(&n,SZINT);
bsp_sync();

bsp_get(0,&n,0,&n,SZINT);
bsp_sync();
bsp_pop_reg(&n);

nl= nloc(p,s,n);
x= vecallocd(nl);
for (i=0; i<nl; i++){
iglob= i*p+s;
x[i]= iglob+1;
}
bsp_sync();
time0=bsp_time();

alpha= bspip(p,s,n,x,x);
bsp_sync();
time1=bsp_time();

printf("Processor %d: sum of squares up to %d*%d is %.lf\n",


s,n,n,alpha); fflush(stdout);
if (s==0){
printf("This took only %.6lf seconds.\n", time1-time0);
fflush(stdout);
}
24 INTRODUCTION

vecfreed(x);
bsp_end();

} /* end bspinprod */

int main(int argc, char **argv){

bsp_init(bspinprod, argc, argv);

/* sequential part */
printf("How many processors do you want to use?\n");
fflush(stdout);
scanf("%d",&P);
if (P > bsp_nprocs()){
printf("Sorry, not enough processors available.\n");
fflush(stdout);
exit(1);
}

/* SPMD part */
bspinprod();

/* sequential part */
exit(0);

} /* end main */

1.5 BSP benchmarking


Computer benchmarking is the activity of measuring computer perform-
ance by running a representative set of test programs. The performance
results for a particular sequential computer are often reduced in some ruth-
less way to one number, the computing rate in flop/s. This allows us to
rank different computers according to their computing rate and to make
informed decisions on what machines to buy or use. The performance of par-
allel computers must be expressed in more than a single number because
communication and synchronization are just as important for these com-
puters as computation. The BSP model represents machine performance by
a parameter set of minimal size: for a given number of processors p, the
parameters r, g, and l represent the performance for computation, commu-
nication, and synchronization. Every parallel computer can be viewed as a
BSP computer, with good or bad BSP parameters, and hence can also be
benchmarked as a BSP computer. In this section, we present a method for
BSP benchmarking. The aim of the method is to find out what the BSP
computer looks like to an average user, perhaps you or me, who writes
parallel programs but does not really want to spend much time optimizing
BSP BENCHMARKING 25

programs, preferring instead to let the compiler and the BSP system do
the job. (The benchmark method for optimization enthusiasts would be very
different.)
The sequential computing rate r is determined by measuring the time of
a so-called DAXPY operation (‘Double precision A times X Plus Y ’), which
has the form y := αx + y, where x and y are vectors and α is a scalar. A
DAXPY with vectors of length n contains n additions and n multiplications
and some overhead in the form of O(n) address calculations. We also measure
the time of a DAXPY operation with the addition replaced by subtraction.
We use 64-bit arithmetic throughout; on most machines this is called double-
precision arithmetic. This mixture of operations is representative for the
majority of scientific computations. We measure the time for a vector length,
which on the one hand is large enough so that we can ignore the startup
costs of vector operations, but on the other hand is small enough for the
vectors to fit in the cache; a choice of n = 1024 is often adequate. A cache is
a small but fast intermediate memory that allows immediate reuse of recently
accessed data. Proper use of the cache considerably increases the computing
rate on most modern computers. The existence of a cache makes the life of
a benchmarker harder, because it leads to two different computing rates: a
flop rate for in-cache computations and a rate for out-of-cache computations.
Intelligent choices should be made if the performance results are to be reduced
to a single meaningful figure.
The DAXPY measurement is repeated a number of times, both to obtain
a more accurate clock reading and to amortize the cost of bringing the vector
into the cache. We measure the sequential computing rate of each processor of
the parallel computer, and report the minimum, average, and maximum rate.
The difference between the minimum and the maximum indicates the accur-
acy of the measurement, except when the processors genuinely differ in speed.
(One processor that is slower than the others can have a remarkable effect
on the overall time of a parallel computation!) We take the average comput-
ing rate of the processors as the final value of r. Note that our measurement
is representative for user programs that contain mostly hand-coded vector
operations. To realize top performance, system-provided matrix–matrix oper-
ations should be used wherever possible, because these are often efficiently
coded in assembler language. Our benchmark method does not reflect that
situation.
The communication parameter g and the synchronization parameter l are
obtained by measuring the time of full h-relations, where each processor
sends and receives exactly h data words. To be consistent with the meas-
urement of r, we use double-precision reals as data words. We choose a
particularly demanding test pattern from the many possible patterns with
the same h, which reflects the typical way most users would handle commun-
ication in their programs. The destination processors of the values to be sent
26 INTRODUCTION

P(1) P(2)

1
0 3 4

2
P(0) P(3)
5

Fig. 1.10. Communication pattern of the 6-relation in the BSP benchmark. Pro-
cessors send data to the other processors in a cyclic manner. Only the data sent
by processor P (0) are shown; other processors send data in a similar way. Each
arrow represents the communication of one data word; the number shown is the
index of the data word.

are determined in a cyclic fashion: P (s) puts h values in remote processors


P (s + 1), P (s + 2), . . . , P (p − 1), P (0), . . . , P (s − 1), P (s + 1), . . ., wrapping
around at processor number p and skipping the source processor to exclude
local puts that do not require communication, see Fig. 1.10. In this commun-
ication pattern, all processors receive the same number, h, of data words (this
can easily be proven by using a symmetry argument). The destination pro-
cessor of each communicated value is computed before the actual h-relation is
performed, to prevent possibly expensive modulo operations from influencing
the timing results of the h-relation. The data are sent out as separate words,
to simulate the typical situation in an application program where the user
does not worry about the size of the data packets. This is the task of the BSP
system, after all! Note that this way of benchmarking h-relations is a test of
both the machine and the BSP library for that machine. Library software can
combine several data words into one packet, if they are sent in one superstep
from the same source processor to the same destination processor. An efficient
library will package such data automatically and choose an optimal packet size
for the particular machine used. This results in a lower value of g, because
the communication startup cost of a packet is amortized over several words
of data.
Our variant of the BSP model assumes that the time Tcomm (h) of an
h-relation is linear in h, see (1.2). In principle, it would be possible to measure
Tcomm (h) for two values of h and then determine g and l. Of course, the results
would then be highly sensitive to measurement error. A better way of doing
this is by measuring Tcomm (h) for a range of values h0 –h1 , and then finding the
best least-squares approximation, given by the values of g and l that minimize
EXAMPLE PROGRAM bspbench 27

the error
h1

ELSQ = (Tcomm (h) − (hg + l))2 . (1.9)
h=h0

(These values are obtained by setting the partial derivatives with respect to g
and l to zero, and solving the resulting 2 × 2 linear system of equations.) We
choose h0 = p, because packet optimization becomes worthwhile for h ≥ p;
we would like to capture the behaviour of the machine and the BSP system
for such values of h. A value of h1 = 256 will often be adequate, except if
p ≥ 256 or if the asymptotic communication speed is attained only for very
large h.
Timing parallel programs requires caution since ultimately it often relies
on a system timer, which may be hidden from the user and may have a low
resolution. Always take a critical look at your timing results and your personal
watch and, in case of suspicion, plot the output data in a graph! This may
save you from potential embarrassment: on one occasion, I was surprised to
find that according to an erroneous timer the computer had exceeded its true
performance by a factor of four. On another occasion, I found that g was
negative. The reason was that the particular computer used had a small g but
a huge l, so that for h ≤ h1 the measurement error in gh + l was much larger
than gh, thereby rendering the value of g meaningless. In this case, h1 had to
be increased to obtain an accurate measurement of g.

1.6 Example program bspbench


The program bspbench is a simple benchmarking program that measures
the BSP parameters of a particular computer. It is an implementation of
the benchmarking method described in Section 1.5. In the following, we
present and explain the program. The least-squares function of bspbench
solves a 2 × 2 linear system by subtracting a multiple of one equation from
the other. Dividing by zero or by a small number is avoided by subtract-
ing the equation with the largest leading coefficient. (Solving a 2 × 2 linear
system is a prelude to Chapter 2, where large linear systems are solved by
LU decomposition.)
The computing rate r of each processor is measured by using the bsp time
function, which gives the elapsed time in seconds for the processor that
calls it. The measurements of r are independent, and hence they do not
require timer synchronization. The number of iterations NITERS is set such
that each DAXPY pair (and each h-relation) is executed 100 times. You
may decrease the number if you run out of patience while waiting for the
results.
The h-relation of our benchmarking method is implemented as follows.
The data to be sent are put into the array dest of the destination processors.
The destination processor destproc[i] is determined by starting with the next
28 INTRODUCTION

processor s + 1 and allocating the indices i to the p − 1 remote processors in


a cyclic fashion, that is, by adding i mod (p − 1) to the processor number
s+1. Taking the resulting value modulo p then gives a valid processor number
unequal to s. The destination index destindex[i] is chosen such that each
source processor fills its own part in the dest arrays on the other processors:
P (s) fills locations s, s + p, s + 2p, and so on. The locations are defined by
destindex[i] = s + (i div (p − 1))p, because we return to the same processor
after each round of p−1 puts into different destination processors. The largest
destination index used in a processor is at most p − 1 + ((h − 1) div (p −
1))p < p + 2 · MAXH, which guarantees that the array dest is sufficiently
large.
#include "bspedupack.h"

/* This program measures p, r, g, and l of a BSP computer


using bsp_put for communication.
*/

#define NITERS 100 /* number of iterations */


#define MAXN 1024 /* maximum length of DAXPY computation */
#define MAXH 256 /* maximum h in h-relation */
#define MEGA 1000000.0

int P; /* number of processors requested */

void leastsquares(int h0, int h1, double *t, double *g, double *l){
/* This function computes the parameters g and l of the
linear function T(h)= g*h+l that best fits
the data points (h,t[h]) with h0 <= h <= h1. */

double nh, sumt, sumth, sumh, sumhh, a;


int h;

nh= h1-h0+1;
/* Compute sums:
sumt = sum of t[h] over h0 <= h <= h1
sumth = t[h]*h
sumh = h
sumhh = h*h */
sumt= sumth= 0.0;
for (h=h0; h<=h1; h++){
sumt += t[h];
sumth += t[h]*h;
}
sumh= (h1*h1-h0*h0+h1+h0)/2;
sumhh= ( h1*(h1+1)*(2*h1+1) - (h0-1)*h0*(2*h0-1))/6;

/* Solve nh*l + sumh*g = sumt


sumh*l + sumhh*g = sumth */
EXAMPLE PROGRAM bspbench 29

if(fabs(nh)>fabs(sumh)){
a= sumh/nh;
/* subtract a times first eqn from second eqn */
*g= (sumth-a*sumt)/(sumhh-a*sumh);
*l= (sumt-sumh* *g)/nh;
} else {
a= nh/sumh;
/* subtract a times second eqn from first eqn */
*g= (sumt-a*sumth)/(sumh-a*sumhh);
*l= (sumth-sumhh* *g)/sumh;
}

} /* end leastsquares */

void bspbench(){
void leastsquares(int h0, int h1, double *t, double *g, double *l);
int p, s, s1, iter, i, n, h, destproc[MAXH], destindex[MAXH];
double alpha, beta, x[MAXN], y[MAXN], z[MAXN], src[MAXH], *dest,
time0, time1, time, *Time, mintime, maxtime,
nflops, r, g0, l0, g, l, t[MAXH+1];

/**** Determine p ****/


bsp_begin(P);
p= bsp_nprocs(); /* p = number of processors obtained */
s= bsp_pid(); /* s = processor number */

Time= vecallocd(p); bsp_push_reg(Time,p*SZDBL);


dest= vecallocd(2*MAXH+p); bsp_push_reg(dest,(2*MAXH+p)*SZDBL);
bsp_sync();

/**** Determine r ****/


for (n=1; n <= MAXN; n *= 2){
/* Initialize scalars and vectors */
alpha= 1.0/3.0;
beta= 4.0/9.0;
for (i=0; i<n; i++){
z[i]= y[i]= x[i]= (double)i;
}
/* Measure time of 2*NITERS DAXPY operations of length n */
time0=bsp_time();
for (iter=0; iter<NITERS; iter++){
for (i=0; i<n; i++)
y[i] += alpha*x[i];
for (i=0; i<n; i++)
z[i] -= beta*x[i];
}
time1= bsp_time();
time= time1-time0;
bsp_put(0,&time,Time,s*SZDBL,SZDBL);
bsp_sync();

/* Processor 0 determines minimum, maximum, average


30 INTRODUCTION

computing rate */
if (s==0){
mintime= maxtime= Time[0];
for(s1=1; s1<p; s1++){
mintime= MIN(mintime,Time[s1]);
maxtime= MAX(maxtime,Time[s1]);
}
if (mintime>0.0){
/* Compute r = average computing rate in flop/s */
nflops= 4*NITERS*n;
r= 0.0;
for(s1=0; s1<p; s1++)
r += nflops/Time[s1];
r /= p;
printf("n= %5d min= %7.3lf max= %7.3lf av= %7.3lf Mflop/s ",
n, nflops/(maxtime*MEGA),nflops/
(mintime*MEGA), r/MEGA);
fflush(stdout);
/* Output for fooling benchmark-detecting compilers */
printf(" fool=%7.1lf\n",y[n-1]+z[n-1]);
} else
printf("minimum time is 0\n"); fflush(stdout);
}
}

/**** Determine g and l ****/


for (h=0; h<=MAXH; h++){
/* Initialize communication pattern */
for (i=0; i<h; i++){
src[i]= (double)i;
if (p==1){
destproc[i]=0;
destindex[i]=i;
} else {
/* destination processor is one of the p-1 others */
destproc[i]= (s+1 + i%(p-1)) %p;
/* destination index is in my own part of dest */
destindex[i]= s + (i/(p-1))*p;
}
}

/* Measure time of NITERS h-relations */


bsp_sync();
time0= bsp_time();
for (iter=0; iter<NITERS; iter++){
for (i=0; i<h; i++)
bsp_put(destproc[i],&src[i],dest,destindex[i]*SZDBL,
SZDBL);
bsp_sync();
}
time1= bsp_time();
time= time1-time0;
BENCHMARK RESULTS 31

/* Compute time of one h-relation */


if (s==0){
t[h]= (time*r)/NITERS;
printf("Time of %5d-relation= %lf sec= %8.0lf flops\n",
h, time/NITERS, t[h]); fflush(stdout);
}
}

if (s==0){
printf("size of double = %d bytes\n",(int)SZDBL);
leastsquares(0,p,t,&g0,&l0);
printf("Range h=0 to p : g= %.1lf, l= %.1lf\n",g0,l0);
leastsquares(p,MAXH,t,&g,&l);
printf("Range h=p to HMAX: g= %.1lf, l= %.1lf\n",g,l);
printf("The bottom line for this BSP computer is:\n");
printf("p= %d, r= %.3lf Mflop/s, g= %.1lf, l= %.1lf\n",
p,r/MEGA,g,l);
fflush(stdout);
}
bsp_pop_reg(dest); vecfreed(dest);
bsp_pop_reg(Time); vecfreed(Time);

bsp_end();
} /* end bspbench */

int main(int argc, char **argv){

bsp_init(bspbench, argc, argv);


printf("How many processors do you want to use?\n");
fflush(stdout);
scanf("%d",&P);
if (P > bsp_nprocs()){
printf("Sorry, not enough processors available.\n");
exit(1);
}
bspbench();
exit(0);

} /* end main */

1.7 Benchmark results


What is the cheapest parallel computer you can buy? Two personal computers
connected by a cable. What performance does this configuration give you?
Certainly p = 2, and perhaps r = 122 Mflop/s, g = 1180, and l = 138 324. These
BSP parameters were obtained by running the program bspbench on two PCs
from a cluster that fills up a cabinet at the Oxford office of Sychron. This
cluster consists of 11 identical Pentium-II PCs with a clock rate of 400 MHz
running the Linux operating system, and a connection network of four Fast
32 INTRODUCTION

Ethernets. Each Ethernet of this cluster consists of a Cisco Catalyst switch


with 11 cables, each ending in a Network Interface Card (NIC) at the back
of a different PC. (A switch connects pairs of PCs; pairs can communic-
ate independently from other pairs.) Thus, each Ethernet connects all PCs.
Having four Ethernets increases the communication capacity fourfold and it
gives each PC four possible routes to every other PC, which is useful if some
Ethernets are tied down by other communicating PCs. A Fast Ethernet is
capable of transferring data at the rate of 100 Mbit/s (i.e. 12.5 Mbyte/s or
1 562 500 double-precision reals per second—ignoring overheads). The system
software running on this cluster is the Sychron Virtual Private Server, which
guarantees a certain specified computation/communication performance on
part or all of the cluster, irrespective of use by others. On top of this system,
the portability layers MPI and BSPlib are available. In our experiments on
this machine, we used version 1.4 of BSPlib with communication optimization
level 2 and the GNU C compiler with computation optimization level 3 (our
compiler flags were -flibrary-level 2 -O3).
Figure 1.11 shows how you can build a parallel computer from cheap com-
modity components such as PCs running Linux, cables, and simple switches.
This figure gives us a look inside the black box of the basic BSP architecture
shown in Fig. 1.1. A parallel computer built in DIY (do-it-yourself) fashion
is often called a Beowulf, after the hero warrior of the Old English epic, pre-
sumably in admiration of Beowulf’s success in slaying the supers of his era.
The Sychron cluster is an example of a small Beowulf; larger ones of hundreds
of PCs are now being built by cost-conscious user groups in industry and in
academia, see [168] for a how-to guide.
Figure 1.12 shows the time of an h-relation on two PCs of the Sychron
Beowulf. The benchmark was run with MAXN decreased from the default

Switch Switch Switch Switch

PC PC PC PC PC PC PC PC

Fig. 1.11. Beowulf cluster of eight PCs connected by four switches. Each PC is
connected to all switches.
BENCHMARK RESULTS 33

800 000
Measured data
Least-squares fit
700 000

600 000
Time (in flop units)

500 000

400 000

300 000

200 000

100 000

0
0 50 100 150 200 250 300 350 400 450 500
h

Fig. 1.12. Time of an h-relation on two connected PCs. The values shown are for
even h with h ≤ 500.

1024 to 512, to make the vectors fit in primary cache (i.e. the fastest cache);
for length 1024 and above the computing rate decreases sharply. The value
of MAXH was increased from the default 256 to 800, because in this case
g ≪ l and hence a larger range of h-values is needed to obtain an accurate
value for g from measurements of hg + l. Finding the right input paramet-
ers MAXN, MAXH, and NITERS for the benchmark program may require trial
and error; plotting the data is helpful in this process. It is unlikely that one
set of default parameters will yield sensible measurements for every parallel
computer.
What is the most expensive parallel computer you can buy? A super-
computer, by definition. Commonly, a supercomputer is defined as
one of today’s top performers in terms of computing rate, communica-
tion/synchronization rate, and memory size. Most likely, the cost of a
supercomputer will exceed a million US dollars. An example of a supercom-
puter is the Cray T3E, which is a massively parallel computer with distributed
memory and a communication network in the form of a three-dimensional
torus (i.e. a mesh with wraparound links at the boundaries). We have bench-
marked up to 64 processors of the 128-processor machine called Vermeer,
34 INTRODUCTION

after the famous Dutch painter, which is located at the HPαC supercom-
puter centre of Delft University of Technology. Each node of this machine
consists of a DEC Alpha 21164 processor with a clock speed of 300 MHz, an
advertised peak performance of 600 Mflop/s, and 128 Mbyte memory. In our
experiments, we used version 1.4 of BSPlib with optimization level 2 and the
Cray C compiler with optimization level 3.
The measured single-processor computing rate is 35 Mflop/s, which is
much lower than the theoretical peak speed of 600 Mflop/s. The main reason
for this discrepancy is that we measure the speed for a DAXPY opera-
tion written in C, whereas the highest performance on this machine can
only be obtained by performing matrix–matrix operations such as DGEMM
(Double precision GEneral Matrix–Matrix multiplication) and then only when
using well-tuned subroutines written in assembler language. The BLAS (Basic
Linear Algebra Subprograms) library [59,60,126] provides a portable interface
to a set of subroutines for the most common vector and matrix operations,
such as DAXPY and DGEMM. (The terms ‘DAXPY’ and ‘DGEMM’ origin-
ate in the BLAS definition.) Efficient BLAS implementations exist for most
machines. A complete BLAS list is given in [61,Appendix C]. A Cray T3E ver-
sion of the BLAS is available; its DGEMM approaches peak performance. A
note of caution: our initial, erroneous result on the Cray T3E was a computing
rate of 140 Mflop/s. This turned out to be due to the Cray timer IRTC, which
is called by BSPlib on the Cray and ran four times slower than it should. This
error occurs only in version 1.4 of BSPlib, in programs compiled at BSPlib
level 2 for the Cray T3E.
Figure 1.13 shows the time of an h-relation on 64 processors of the Cray
T3E. The time grows more or less linearly, but there are some odd jumps,
for instance the sudden significant decrease around h = 130. (Sending more
data takes less time!) It is beyond our scope to explain every peculiarity of
every benchmarked machine. Therefore, we feel free to leave some surprises,
like this one, unexplained.
Table 1.2 shows the BSP parameters obtained by benchmarking the Cray
T3E for up to 64 processors. The results for p = 1 give an indication of the
overhead of running a bulk synchronous parallel program on one processor.
For the special case p = 1, the value of MAXH was decreased to 16, because
l ≈ g and hence a smaller range of h-values is needed to obtain an accurate
value for l from measurements of hg + l. (Otherwise, even negative values of
l could appear.) The table shows that g stays almost constant for p ≤ 16 and
that it grows slowly afterwards. Furthermore, l grows roughly linearly with
p, but occasionally it behaves strangely: l suddenly decreases on moving from
16 to 32 processors. The explanation is hidden inside the black box of the
communication network. A possible explanation is the increased use of wrap-
around links when increasing the number of processors. (For a small number of
processors, all boundary links of a subpartition connect to other subpartitions,
instead of wrapping around to the subpartition itself; thus, the subpartition
BENCHMARK RESULTS 35

25 000
Measured data
Least-squares fit

20 000
Time (in flop units)

15 000

10 000

5000

0
0 50 100 150 200 250
h

Fig. 1.13. Time of an h-relation on a 64-processor Cray T3E.

Table 1.2. Benchmarked BSP parameters p, g, l and the


time of a 0-relation for a Cray T3E. All times are in flop
units (r = 35 Mflop/s)

p g l Tcomm (0)

1 36 47 38
2 28 486 325
4 31 679 437
8 31 1193 580
16 31 2018 757
32 72 1145 871
64 78 1825 1440

is a mesh, rather than a torus. Increasing the number of processors makes


the subpartition look more like a torus, with richer connectivity.) The time
of a 0-relation (i.e. the time of a superstep without communication) displays
a smoother behaviour than that of l, and it is presented here for comparison.
This time is a lower bound on l, since it represents only part of the fixed cost
of a superstep.
36 INTRODUCTION

Another prominent supercomputer, at the time of writing these lines of


course since machines come and go quickly, is the IBM RS/6000 SP. This
machine evolved from the RS/6000 workstation and it can be viewed as a
tightly coupled cluster of workstations (without the peripherals, of course). We
benchmarked eight processors of the 76-processor SP at the SARA supercom-
puter centre in Amsterdam. The subpartition we used contains eight so-called
thin nodes connected by a switch. Each node contains a 160 MHz PowerPC
processor, 512 Mbyte of memory, and a local disk of 4.4 Gbyte. The theoretical
peak rate of a processor is about 620 Mflop/s, similar to the Cray T3E above.
Figure 1.14 shows the time of an h-relation on p = 8 processors of the SP.
Note the effect of optimization by BSPlib: for h < p, the time of an h-relation
increases rapidly with h. For h = p, however, BSPlib detects that the pth
message of each processor is sent to the same destination as the first message,
so that it can combine the messages. Every additional message can also be
combined with a previous one. As a result, the number of messages does not
increase further, and g grows only slowly. Also note the statistical outliers,
that is, those values that differ considerably from the others. All outliers are
high, indicating interference by message traffic from other users. Although the
system guarantees exclusive access to the processors used, it does not guar-
antee the same for the communication network. This makes communication

350 000
Measured data
Least-squares fit
300 000

250 000
Time (in flop units)

200 000

150 000

100 000

50 000

0
0 50 100 150 200 250
h

Fig. 1.14. Time of an h-relation on an 8-processor IBM SP.


BENCHMARK RESULTS 37

benchmarking difficult. More reliable results can be obtained by repeating


the benchmark experiment, or by organizing a party for one’s colleagues (and
sneaking out in the middle), in the hope of encountering a traffic-free period.
A plot will reveal whether this has happened. The resulting BSP parameters
of our experiment are p = 8, r = 212 Mflop/s, g = 187, and l = 148 212.
Note that the interference from other traffic hardly influences g, because the
slope of the fitted line is the same as that of the lower string of data (which
can be presumed to be interference-free). The value of l, however, is slightly
overestimated.
The last machine we benchmark is the Silicon Graphics Origin 2000. This
architecture has a physically distributed memory, like the other three bench-
marked computers. In principle, this promises scalability because memory,
communication links, and other resources can grow proportionally with the
number of processors. For ease of use, however, the memory can also be made
to look like shared memory. Therefore, the Origin 2000 is often promoted as a
‘Scalable Shared Memory Multiprocessor’. Following BSP doctrine, we ignore
the shared-memory facility and use the Origin 2000 as a distributed-memory
machine, thereby retaining the advantage of portability.
The Origin 2000 that we could lay our hands on is Oscar, a fine
86-processor parallel computer at the Oxford Supercomputer Centre of Oxford
University. Each processing element of the Origin 2000 is a MIPS R10000 pro-
cessor with a clock speed of 195 MHz and a theoretical peak performance of
390 Mflop/s. We compiled our programs using the SGI MIPSpro C-compiler
with optimization flags switched on as before. We used eight processors in our
benchmark. The systems managers were kind enough to empty the system
from other jobs, and hence we could benchmark a dedicated system. The res-
ulting BSP parameters are p = 8, r = 326 Mflop/s, g = 297, and l = 95 686.
Figure 1.15 shows the results.
Table 1.3 presents benchmark results for three different computers with
the number of processors fixed at eight. The parameters g and l are given
not only in flops but also in raw microseconds to make it easy to compare
machines with widely differing single-processor performance. The Cray T3E
is the best balanced machine: the low values of g and l in flops mean that
communication/synchronization performance of the Cray T3E is excellent,
relative to the computing performance. The low values of l in microseconds tell
us that in absolute terms synchronization on the Cray is still cheap. The main
drawback of the Cray T3E is that it forces the user to put effort into optimizing
programs, since straightforward implementations such as our benchmark do
not attain top performance. For an unoptimized program, eight processors of
the T3E are slower than a single processor of the Origin 2000. The SP and the
Origin 2000 are similar as BSP machines, with the Origin somewhat faster in
computation and synchronization.
38 INTRODUCTION

200 000
Measured data
Least-squares fit

150 000
Time (in flop units)

100 000

50 000

0
0 50 100 150 200 250
h

Fig. 1.15. Time of an h-relation on an 8-processor SGI Origin.

Table 1.3. Comparing the BSP parameters for three different


parallel computers with p = 8

Computer r (Mflop/s) (flop) (µs)

g l g l

Cray T3E 35 31 1 193 0.88 34


IBM RS/6000 SP 212 187 148 212 0.88 698
SGI Origin 2000 326 297 95 686 0.91 294

1.8 Bibliographic notes


1.8.1 BSP-related models of parallel computation
Historically, the Parallel Random Access Machine (PRAM) has been the most
widely studied general-purpose model of parallel computation. In this model,
processors can read from and write to a shared memory. Several variants
of the PRAM model can be distinguished, based on the way concurrent
memory access is treated: the concurrent read, concurrent write (CRCW)
variant allows full concurrent access, whereas the exclusive read, exclusive
BIBLIOGRAPHIC NOTES 39

write (EREW) variant allows only one processor to access the memory at a
time. The PRAM model ignores communication costs and is therefore mostly
of theoretical interest; it is useful in establishing lower bounds for the cost
of parallel algorithms. The PRAM model has stimulated the development of
many other models, including the BSP model. The BSP variant with auto-
matic memory management by randomization in fact reduces to the PRAM
model in the asymptotic case g = l = O(1). For an introduction to the PRAM
model, see the survey by Vishkin [189]. For PRAM algorithms, see the survey
by Spirakis and Gibbons [166,167] and the book by JáJá [113].
The BSP model has been proposed by Valiant in 1989 [177]. The full
description of this ‘bridging model for parallel computation’ is given in [178].
This article describes the two basic variants of the model (automatic memory
management or direct user control) and it gives a complexity analysis of
algorithms for fast Fourier transform, matrix–matrix multiplication, and sort-
ing. In another article [179], Valiant proves that a hypercube or butterfly
architecture can simulate a BSP computer with optimal efficiency. (Here, the
model is called XPRAM.) The BSP model as it is commonly used today has
been shaped by various authors since the original work by Valiant. The survey
by McColl [132] argues that the BSP model is a promising approach to general-
purpose parallel computing and that it can deliver both scalable performance
and architecture independence. Bisseling and McColl [21,22] propose the vari-
ant of the model (with pure computation supersteps of cost w + l and pure
communication supersteps of cost hg + l) that is used in this book. They show
how a variety of scientific computations can be analysed in a simple manner
by using their BSP variant. McColl [133] analyses and classifies several
important BSP algorithms, including dense and sparse matrix–vector mul-
tiplication, matrix–matrix multiplication, LU decomposition, and triangular
system solution.
The LogP model by Culler et al. [49] is an offspring of the BSP model,
which uses four parameters to describe relative machine performance: the
latency L, the overhead o, the gap g, and the number of processors P , instead
of the three parameters l, g, and p of the BSP model. The LogP model
treats messages individually, not in bulk, and hence it does not provide the
notion of a superstep. The LogP model attempts to reflect the actual machine
architecture more closely than the BSP model, but the price to be paid is an
increase in the complexity of algorithm design and analysis. Bilardi et al. [17]
show that the LogP and BSP model can simulate each other efficiently so that
in principle they are equally powerful.
The YPRAM model by de la Torre and Kruskal [54] characterizes a parallel
computer by its latency, bandwidth inefficiency, and recursive decomposabil-
ity. The decomposable BSP (D-BSP) model [55] is the same model expressed
in BSP terms. In this model, a parallel computer can be decomposed into sub-
machines, each with their own parameters g and l. The parameters g and l of
submachines will in general be lower than those of the complete machine. The
40 INTRODUCTION

scaling behaviour of the submachines with p is described by functions g(p) and


l(p). This work could provide a theoretical basis for subset synchronization
within the BSP framework.
The BSPRAM model by Tiskin [174] replaces the communication network
by a shared memory. At the start of a superstep, processors read data from
the shared memory into their own local memory; then, they compute inde-
pendently using locally held data; and finally they write local data into the
shared memory. Access to the shared memory is in bulk fashion, and the cost
function of such access is expressed in g and l. The main aim of the BSPRAM
model is to allow programming in shared-memory style while keeping the
benefits of data locality.

1.8.2 BSP libraries


The first portable BSP library was the Oxford BSP library by Miller and
Reed [140,141]. This library contains six primitives: put, get, start of super-
step, end of superstep, start of program, and end of program. The Cray
SHMEM library [12] can be considered as a nonportable BSP library. It con-
tains among others: put, strided put, get, strided get, and synchronization.
The Oxford BSP library is similar to the Cray SHMEM library, but it is
available for many different architectures. Neither of these libraries allows
communication into dynamically allocated memory. The Green BSP library
by Goudreau et al. [80,81] is a small experimental BSP library of seven prim-
itives. The main difference with the Oxford BSP library is that the Green
BSP library communicates by bulk synchronous message passing, which will
be explained in detail in Chapter 4. The data sent in a superstep is writ-
ten into a remote receive-buffer. This is one-sided communication, because
the receiver remains passive when the data is being communicated. In the
next superstep, however, the receiver becomes active: it must retrieve the
desired messages from its receive-buffer, or else they will be lost forever.
Goudreau et al. [80] present results of numerical experiments using the Green
BSP library in ocean eddy simulation, computation of minimum spanning
trees and shortest paths in graphs, n-body simulation, and matrix–matrix
multiplication.
Several communication libraries and language extensions exist that enable
programming in BSP style but do not fly the BSP flag. The Split-C lan-
guage by Culler et al. [48] is a parallel extension of C. It provides put and get
primitives and additional features such as global pointers and spread arrays.
The Global Array toolkit [146] is a software package that allows the creation
and destruction of distributed matrices. It was developed in first instance for
use in computational chemistry. The Global Array toolkit and the underly-
ing programming model include features such as one-sided communication,
global synchronization, relatively cheap access to local memory, and uni-
formly more expensive access to remote memory. Co-Array Fortran [147],
BIBLIOGRAPHIC NOTES 41

formerly called F−− , is a parallel extension of Fortran 95. It represents a


strict version of an SPMD approach: all program variables exist on all pro-
cessors; remote variables can be accessed by appending the processor number
in square brackets, for example, x(3)[2] is the variable x(3) on P (2). A
put is concisely formulated by using such brackets in the left-hand side of an
assignment, for example, x(3)[2]=y(3), and a get by using them on the right-
hand side. Processors can be synchronized in subsets or even in pairs. The
programmer needs to be aware of the cost implications of the various types of
assignments.
BSPlib, used in this book, combines the capabilities of the Oxford BSP
and the Green BSP libraries. It has grown into a de facto standard, which
is fully defined by Hill et al. [105]. These authors also present results for fast
Fourier transformation, randomized sample sorting, and n-body simulation
using BSPlib. Frequently asked questions about BSP and BSPlib, such as
‘Aren’t barrier synchronizations expensive?’ are answered by Skillicorn, Hill,
and McColl [163].
Hill and Skillicorn [106] discuss how to implement BSPlib efficiently. They
demonstrate that postponing communication is worthwhile, since this allows
messages to be reordered (to avoid congestion) and combined (to reduce
startup costs). If the natural ordering of the communication in a superstep
requires every processor to put data first into P (0), then into P (1), and so
on, this creates congestion at the destination processors, even if the total
h-relation is well-balanced. To solve this problem, Hill and Skillicorn use a
p × p Latin square, that is, a matrix with permutations of {0, . . . , p − 1} in
the rows and columns, as a schedule for the communication. An example is
the 4 × 4 Latin square
 
0 1 2 3
 1 2 3 0 
R=  2 3 0 1 .
 (1.10)
3 0 1 2
The communication of the superstep is done in p rounds. In round j, pro-
cessor P (i) sends all data destined for P (rij ). In another article [107], Hill
and Skillicorn discuss the practical issues of implementing a global synchron-
ization on different machines. In particular, they show how to abuse the
cache-coherence mechanism of a shared-memory parallel computer to obtain
an extremely cheap global synchronization. (In one case, 13.4 times faster than
the vendor-provided synchronization.) Donaldson, Hill, and Skillicorn [57]
show how to implement BSPlib efficiently on top of the TCP/IP and UDP/IP
protocols for a Network Of Workstations (NOW) connected by an Ethernet.
They found that it is important to release data packets at an optimal rate into
the Ethernet. Above this rate, data packets will collide too often, in which
case they must be resent; this increases g. Below the optimal rate, the network
42 INTRODUCTION

capacity is underused. Hill, Donaldson, and Lanfear [102] present an imple-


mentation of BSPlib that continually migrates p processes around a NOW,
taking care to run them on the least used workstations. After the work loads
change, for example, because the owner returns from lunch, the next global
synchronization provides a natural point to stop the computation, dump all
data onto disk, and restart the computation on a less busy set of worksta-
tions. As an additional benefit, this provides fault tolerance, for instance the
capability to recover from hardware failures.
The Oxford BSP toolset [103] contains an implementation of the C and
Fortran 90 versions of the BSPlib standard. This library is used in most numer-
ical experiments of this book. The toolset also contains profiling tools [101,104]
that can be used to measure and visualize the amount of computation and the
amount of incoming and outgoing data of each processor during a sequence
of supersteps.
The Paderborn University BSP (PUB) library [28,30] is an implementa-
tion of the C version of the BSPlib standard with extensions. One extension
of PUB is zero-cost synchronization [5], also called oblivious synchron-
ization, which exploits the fact that for certain regular computations each
receiving processor P (s) knows the number of data words hr (s) it will receive.
Processor P (s) performs a bsp oblsync(hr (s)) operation at the end of the
superstep, instead of a bsp sync. This type of synchronization is cheap
because no communication is needed to determine that processors can move
on to the next superstep. Another extension is partitioning (by using the
primitive bsp partition), which is decomposing a BSP machine into sub-
machines, each of them again a BSP machine. The submachines must be
rejoined later (by using the primitive bsp done). Each submachine can again
be partitioned, in a recursive fashion. The processors of a submachine must
be numbered consecutively. The partitioning mechanism provides a discip-
lined form of subset synchronization. The PUB library also includes a large
set of collective-communication functions. Bonorden et al. [29] compare the
performance on the Cray T3E of PUB with that of the Oxford BSP toolset
and MPI.
The one-sided communications added by MPI-2 [138] to the orginal MPI
standard can also be viewed as comprising a BSP library. The MPI-2 primit-
ives for one-sided communications are put, get, and accumulate; their use is
demonstrated in the fifth program, mpimv, of Appendix C. For the complete
MPI-2 standard with annotations, see [83]. For a tutorial introduction, see
the book by Gropp, Lusk, and Thakur [85].

1.8.3 The non-BSP world: message passing


In contrast to the one-sided communication of BSP programming, traditional
message passing uses two-sided communication, which involves both an active
sender and an active receiver. The underlying model is that of communicat-
ing sequential processes (CSP) by Hoare [108]. Several libraries support this
BIBLIOGRAPHIC NOTES 43

type of communication. The first portable communication library, parallel


virtual machine (PVM), was developed by Sunderam [171]. PVM is a message-
passing library that enables computing on heterogeneous networks, that
is, networks of computers with different architectures. PVM has evolved into a
standard [75] and PVM implementations are available for many different par-
allel computers. The two strong points of PVM compared with other systems
such as BSPlib and MPI-1 are its support of heterogeneous architectures and
its capability of dynamic process creation. This means that processors can be
added (or removed) during a computation. These features make PVM attract-
ive for certain Beowulf clusters, in particular heterogeneous ones. Geist, Kohl,
and Papadopoulus [76] compare PVM with MPI-1.
The message-passing interface (MPI) [137], based on traditional message
passing, was defined in 1994 by the MPI Forum, a committee of users and man-
ufacturers of parallel computers. The initial definition is now known as MPI-1.
For the complete, most recent MPI-1 standard with annotations, see [164].
For a tutorial introduction, see the book by Gropp, Lusk, and Skjellum [84].
An introduction to parallel programming that uses MPI-1 is the textbook
by Pacheco [152]. An introduction that uses PVM and MPI-1 for distributed-
memory parallel programming and Pthreads for shared-memory programming
is the textbook by Wilkinson and Allen [191]. An introduction to parallel com-
puting that uses MPI-1 for distributed-memory parallel programming and
Pthreads and OpenMP for shared-memory programming is the textbook by
Grama et al. [82]. Today, MPI-1 is the most widely used portability layer for
parallel computers.

1.8.4 Benchmarking
The BSPlib definition [105] presents results obtained by the optimized bench-
marking program bspprobe, which is included in the Oxford BSP toolset [103].
The values of r and l that were measured by bspprobe agree well with those
of bspbench. The values of g, however, are much lower: for instance, the value
g = 1.6 for 32-bit words at r = 47 Mflop/s given in [105,Table 1] corresponds
to g = 0.07 µs for 64-bit words, which is 12.5 times less than the 0.88 µs
measured by bspbench, see Table 1.3. This is due to the high optimization
level of bspprobe: data are sent in blocks instead of single words and high-
performance puts are used instead of buffered puts. The goal of bspprobe is
to measure communication performance for optimized programs and hence its
bottom line takes as g-value the asymptotic value for large blocks. The effect
of such optimizations will be studied in Chapter 2. The program bspprobe
measures g for two different h-relations with the same h: (i) a local cyclic
shift, where every processor sends h data to the next higher-numbered pro-
cessor; (ii) a global all-to-all procedure where every processor sends h/(p − 1)
data to every one of the others. In most cases, the difference between the two
g-values is small. This validates the basic assumption of the BSP model,
44 INTRODUCTION

namely that costs can be expressed in terms of h. The program bspprobe


takes as l-value the cost of a synchronization in the absence of communica-
tion, that is, Tcomm (0), see Table 1.2. BSP parameters obtained for the Green
BSP library are given by Goudreau et al. [80].
Benchmark results for machines ranging from personal computers to
massively parallel supercomputers are collected and regularly updated by
Dongarra [58]. These results represent the total execution rates for solving
a dense n × n linear system of equations with n = 100 and n = 1000 by
the LINPACK software and with unlimited n by any suitable software. The
slowest machine included is an HP48 GX pocket calculator which achieves
810 flop/s on the n = 100 benchmark. A PC based on the 3.06 GHz Intel
Pentium-IV chip achieves a respectable 1.41 Gflop/s for n = 100 and 2.88
Gflop/s for n = 1000 (1 Gflop = 1 Gigaflop = 109 flop). Most interesting is
the unrestricted benchmark, see [58,Table 3], which allows supercomputers
to show off and demonstrate their capabilities. The table gives: rmax , the
maximum rate achieved; rpeak , the theoretical peak rate; nmax , the size of
the system at which rmax is achieved; and n1/2 , the size at which half of
rmax is obtained. The n1/2 parameter is widely used as a measure of startup
overhead. Low n1/2 values promise top rates already for moderate problem
sizes, see The Science of Computer Benchmarking by Hockney [109]. The
value of rmax is the basis for the Top 500 list of supercomputer sites, see
http://www.top500.org.
To be called a supercomputer, at present a computer must achieve at least
1 Tflop/s (1 Tflop = 1 Teraflop = 1012 flop). The fastest existing number
cruncher is the Earth Simulator, which was custom-built by NEC for the
Japan Marine Science and Technology Center. This computer occupies its
own building, has 5120 processors, and it solves a linear system of size n =
1 075 200 at rmax = 36 Tflop/s. Of course, computing rates increase quickly,
and when you read these lines, the fastest computer may well have passed the
Pflop/s mark (1 Pflop = 1 Petaflop = 1015 flop).

1.9 Exercises
1. Algorithm 1.1 can be modified to combine the partial sums into one global
sum by a different method. Let p = 2q , with q ≥ 0. Modify the algorithm to
combine the partial sums by repeated pairing of processors. Take care that
every processor obtains the final result. Formulate the modified algorithm
exactly, using the same notation as in the original algorithm. Compare the
BSP cost of the two algorithms. For which ratio l/g is the pairwise algorithm
faster?
2. Analyse the following operations and derive the BSP cost for a parallel
algorithm. Let x be the input vector (of size n) of the operation and y the out-
put vector. Assume that these vectors are block distributed over p processors,
EXERCISES 45

with p ≤ n. Furthermore, k is an integer with 1 ≤ k ≤ n. The operations


are:
(a) Minimum finding: determine the index j of the component with the
minimum value and subtract this value from every component: yi =
xi − xj , for all i.
(b) Shifting (to the right): assign y(i+k) mod n = xi .
(c) Smoothing: replace each component by a moving average yi = 1/(k+1)

i+k/2
j=i−k/2 xj , where k is even.

i
(d) Partial summing: compute yi = j=0 xj , for all i. (This problem is an
instance of the parallel prefix problem.)
(e) Sorting by counting: sort x by increasing value and place the result in
y. Each component xi is an integer in the range 0–k, where k ≪ n.
3. Get acquainted with your parallel computer before you use it.
(a) Run the program bspbench on your parallel computer. Measure
the values of g and l for various numbers of processors. How does
the performance of your machine scale with p?
(b) Modify bspbench to measure bsp gets instead of bsp puts. Run the
modified program for various p. Compare the results with those of
the original program.
4. Since their invention, computers have been used as tools in cryptanalytic
attacks on secret messages; parallel computers are no exception. Assume a
plain text has been encrypted by the classic method of monoalphabetic sub-
stitution, where each letter from the alphabet is replaced by another one
and where blanks and punctuation characters are deleted. For such a simple
encryption scheme, we can apply statistical methods to uncover the mes-
sage. See Bauer [14] for more details and also for a fascinating history of
cryptology.
(a) Let t = (t0 , . . . , tn−1 )T be a cryptotext of n letters and t′ another
cryptotext, of the same length, language, and encryption alpha-
bet. With a bit of luck, we can determine the language of the
texts by computing Friedman’s Kappa value, also called the index of
coincidence,
n−1
1
κ(t, t′ ) = δ(ti , ti ′ ).
n i=0

Here, δ(x, y) = 1 if x = y, and δ(x, y) = 0 otherwise. The value of


κ tells us how likely it is that two letters in the same position of the
texts are identical. Write a parallel program that reads an encrypted
text, splits it into two parts t and t′ of equal size (dropping the last
letter if necessary), and computes κ(t, t′ ). Motivate your choice of data
distribution.
46 INTRODUCTION

(b) Find a suitable cryptotext as input and compute its κ. Guess its
language by comparing the result with the κ-values found by Kullback
(reproduced in [14]): Russian 5.29%, English 6.61%, German 7.62%,
French 7.78%.
(c) Find out whether Dutch is closer to English or German.
(d) Extend your program to compute all letter frequencies in the input
text. In English, the ‘e’ is the most frequent letter; its frequency is
about 12.5%.
(e) Run your program on some large plain texts in the language just
determined to obtain a frequency profile of that language. Run your
program on the cryptotext and establish its letter frequences. Now
break the code.
(f) Is parallelization worthwhile in this case? When would it be?
5. (∗) Data compression is widely used to reduce the size of data files,
for instance texts or pictures to be transferred over the Internet. The
LZ77 algorithm by Ziv and Lempel [193] passes through a text and uses
the most recently accessed portion as a reference dictionary to shorten
the text, replacing repeated character strings by pointers to their first
occurrence. The popular compression programs PKZIP and gzip are based
on LZ77.
Consider the text
‘yabbadabbadoo’

(Fred Flintstone, Stone Age). Assume we arrive at the second occurrence of


the string ‘abbad’. By going back 5 characters, we find a matching string of
length 5. We can code this as the triple of decimal numbers (5,5,111), where
the first number in the triple is the number of characters we have to go back
and the second number the length of the matching string. The number 111
is the ASCII code for the lower-case ‘o’, which is the next character after the
second ‘abbad’. (The lower-case characters ‘a’–‘z’ are numbered 97–122 in the
ASCII set.) Giving the next character ensures progress, even if no match was
found. The output for this example is: (0,0,121), (0,0,97), (0,0,98), (1,1,97),
(0,0,100), (5,5,111), (1,1,−1). The ‘−1’ means end of input. If more matches
are possible, the longest one is taken. For longer texts, the search for a match
is limited to the last m characters before the current character (the search
window); the string to be matched is limited to the first n characters starting
at the current character (the look-ahead window).
(a) Write a sequential function that takes as input a character sequence
and writes as output an LZ77 sequence of triples (o, l, c), where o is the
offset, that is, number of characters to be moved back, l the length of
the matching substring, and c is the code for the next character. Use
suitable data types for o, l, and c to save space. Take m = n = 512.
EXERCISES 47

Also write a sequential function that decompresses the LZ77 sequence.


Which is fastest, compression or decompression?
(b) Design and implement a parallel LZ77 compression algorithm. You
may adapt the original algorithm if needed for parallelization as long
as the output can be read by the sequential LZ77 program. Hint: make
sure that each processor has all input data it needs, before it starts
compressing.
(c) Now design a parallel algorithm that produces exactly the same output
sequence as the sequential algorithm. You may need several passes
through the data.
(d) Compare the compression factor of your compression programs with
that of gzip. How could you improve the performance?
(e) Is it worthwhile to parallelize the decompression?
6. (∗) A random number generator (RNG) produces a sequence of real
numbers, in most cases uniformly distributed over the interval [0,1], that
are uncorrelated and at least appear to be random. (In fact, the sequence is
usually generated by a computer in a completely deterministic manner.) A
simple and widely used type of RNG is the linear congruential generator
based on the integer sequence defined by

xk+1 = (axk + b) mod m,

where a, b, and m are suitable constants. The starting value x0 is called


the seed. The choice of constants is critical for the quality of the generated
sequence. The integers xk , which are between 0 and m − 1, are converted to
real values rk ∈ [0, 1) by rk = xk /m.
(a) Express xk+p in terms of xk and some constants. Use this expression
to design a parallel RNG, that is, an algorithm, which generates a
different, hopefully uncorrelated, sequence for each of the p processors
of a parallel computer. The local sequence of a processor should be a
subsequence of the xk .
(b) Implement your algorithm. Use the constants proposed by Lewis,
Goodman, and Miller [129]: a = 75 , b = 0, m = 231 − 1. This is
a simple multiplicative generator. Do not be tempted to use zero as
a seed!
(c) For the statistically sophisticated. Design a statistical test to check the
randomness of the local sequence of a processor, for example, based on
the χ2 test. Also design a statistical test to check the independence of
the different local sequences, for example, for the case p = 2. Does the
parallel RNG pass these tests?
(d) Use the sequential RNG to simulate a random walk on the two-
dimensional integer lattice Z2 , where the walker starts at (0,0), and at
48 INTRODUCTION

each step moves north, east, south, or west with equal probability 1/4.
What is the expected distance to the origin after 100 steps? Create a
large number of walks to obtain a good estimate. Use the parallel RNG
to accelerate your simulation.
(e) Improve the quality of your parallel RNG by adding a local shuffle to
break up short distance correlations. The numbers generated are writ-
ten to a buffer array of length 64 instead of to the output. The buffer
is filled at startup; after that, each time a random number is needed,
one of the array values is selected at random, written to the output,
and replaced by a new number xk . The random selection of the buffer
element is done based on the last output number. The shuffle is due to
Bays and Durham [16]. Check whether this improves the quality of the
RNG. Warning: the resulting RNG has limited applicability, because
m is relatively small. Better parallel RNGs exist, see for instance
the SPRNG package [131], and in serious work such RNGs must
be used.
7. (∗) The sieve of Eratosthenes (276–194 BC) is a method for generating all
prime numbers up to a certain bound n. It works as follows. Start with the
integers from 2 to n. The number 2 is a prime; cross out all larger multiples of
2. The smallest remaining number, 3, is a prime; cross out all larger multiples
of 3. The smallest remaining number, 5, is a prime, etc.

(a) When can we stop?


(b) Write a sequential sieve program. Represent the integers by a suitable
array.
(c) Analyse the cost of the sequential algorithm. Hint: the probability
of an arbitrary integer x ≥ 2 to be prime is about 1/ log x, where
log = loge denotes the natural logarithm. Estimate the total number of
cross-out operations and use some calculus to obtain a simple formula.
Add operation counters to your program to check the accuracy of your
formula.
(d) Design a parallel sieve algorithm. Would you distribute the array over
the processors by blocks, cyclically, or in some other fashion?
(e) Write a parallel sieve program bspsieve and measure its execution
time for n = 1000, 10 000, 100 000, 1 000 000 and p = 1, 2, 4, 8, or use
as many processors as you can lay your hands on.
(f) Estimate the BSP cost of the parallel algorithm and use this estimate
to explain your time measurements.
(g) Can you reduce the cost further? Hints: for the prime q, do you need
to start crossing out at 2q? Does every processor cross out the same
number of integers? Is all communication really necessary?
EXERCISES 49

(h) Modify your program to generate twin primes, that is, pairs of primes
that differ by two, such as (5, 7). (It is unknown whether there are
infinitely many twin primes.)
(i) Extend your program to check the Goldbach conjecture: every even
k > 2 is the sum of two primes. Choose a suitable range of integers to
check. Try to keep the number of operations low. (The conjecture has
been an open question since 1742.)
2
LU DECOMPOSITION

This chapter presents a general Cartesian scheme for the distribution


of matrices. Based on BSP cost analysis, the square cyclic distribu-
tion is proposed as particularly suitable for matrix computations such
as LU decomposition. Furthermore, the chapter introduces two-phase
broadcasting of vectors, which is a useful method for sending copies of
matrix rows or columns to a group of processors. These techniques are
demonstrated in the specific case of LU decomposition, but they are
applicable to almost all parallel matrix computations. After having read
this chapter, you are able to design and implement parallel algorithms
for a wide range of matrix computations, including symmetric linear
system solution by Cholesky factorization and eigensystem solution by
QR decomposition or Householder tridiagonalization.

2.1 The problem


Take a close look at your favourite scientific computing application. Whether
it originates in ocean modelling, oil refinery optimization, electronic circuit
simulation, or in another application area, most likely you will find on close
inspection that its core computation consists of the solution of large systems
of linear equations. Indeed, solving linear systems is the most time-consuming
part of many scientific computing applications. Therefore, we start with this
problem.
Consider a system of linear equations

Ax = b, (2.1)

where A is a given n × n nonsingular matrix, b a given vector of length n,


and x the unknown solution vector of length n. One method for solving this
system is by using LU decomposition, that is, decomposition of the matrix
A into an n × n unit lower triangular matrix L and an n × n upper triangular
matrix U such that
A = LU. (2.2)
An n×n matrix L is called unit lower triangular if lii = 1 for all i, 0 ≤ i < n,
and lij = 0 for all i, j with 0 ≤ i < j < n. An n × n matrix U is called upper
triangular if uij = 0 for all i, j with 0 ≤ j < i < n. Note that we always start
counting at zero—my daughter Sarai was raised that way—and this will turn
SEQUENTIAL LU DECOMPOSITION 51

out to be an advantage later in life, when encountering parallel computations.


(For instance, it becomes easier to define the cyclic distribution.)
Example 2.1
   
1 4 6 1 0 0
For A =  2 10 17 , the decomposition is L =  2 1 0 ,
3 16 31 3 2 1
 
1 4 6
U = 0 2 5 .
0 0 3

The linear system Ax = b can be solved by first decomposing A into


A = LU and then solving the triangular systems Ly = b and U x = y.
The advantage of LU decomposition over similar methods such as Gaussian
elimination is that the factors L and U can be reused, to solve different systems
Ax = b′ with the same matrix but different right-hand sides. The main text
of the present chapter only deals with LU decomposition. Exercise 9 treats
the parallel solution of triangular systems.

2.2 Sequential LU decomposition


In this section, we derive the sequential algorithm that is the basis for develop-
ing our parallel algorithm. By expanding (2.2) and using the fact that lir = 0
for i < r and urj = 0 for r > j, we get

n−1 min(i,j)
 
aij = lir urj = lir urj , for 0 ≤ i, j < n. (2.3)
r=0 r=0

In the case i ≤ j, we split off the ith term and substitute lii = 1, to obtain
i−1

uij = aij − lir urj , for 0 ≤ i ≤ j < n. (2.4)
r=0

Similarly,
j−1
 
1 
lij = aij − lir urj , for 0 ≤ j < i < n. (2.5)
ujj r=0

Equations (2.4) and (2.5) lead to a method for computing the elements
of L and U . For convenience, we first define the intermediate n × n matrices
A(k) , 0 ≤ k ≤ n, by
k−1
(k)

aij = aij − lir urj , for 0 ≤ i, j < n. (2.6)
r=0
52 LU DECOMPOSITION

Algorithm 2.1. Sequential LU decomposition.


input: A(0) : n × n matrix.
output: L : n × n unit lower triangular matrix,
U : n × n upper triangular matrix,
such that LU = A(0) .

for k := 0 to n − 1 do
for j := k to n − 1 do
(k)
ukj := akj ;
for i := k + 1 to n − 1 do
(k)
lik := aik /ukk ;
for i := k + 1 to n − 1 do
for j := k + 1 to n − 1 do
(k+1) (k)
aij := aij − lik ukj ;

Note that A(0) = A and A(n) = 0. In this notation, (2.4) and (2.5) become
(i)
uij = aij , for 0 ≤ i ≤ j < n, (2.7)

and
(j)
aij
lij = , for 0 ≤ j < i < n. (2.8)
ujj
Algorithm 2.1 produces the elements of L and U in stages. Stage k first
computes the elements ukj , j ≥ k, of row k of U and the elements lik , i > k,
of column k of L. Then, it computes A(k+1) in preparation for the next stage.
(k)
Since only values aij with i, j ≥ k are needed in stage k, only the values
(k+1)
aij with i, j ≥ k + 1 are prepared. It can easily be verified that this order
of computation is indeed feasible: in each assignment of the algorithm, the
values of the right-hand side have already been computed.
Figure 2.1 illustrates how computer memory can be saved by storing all
currently available elements of L, U , and A(k) in one working matrix, which
we call A. Thus, we obtain Algorithm 2.2. On input, A contains the original
matrix A(0) , whereas on output it contains the values of L below the diagonal
and the values of U above and on the diagonal. In other words, the output
matrix equals L − In + U , where In denotes the n × n identity matrix, which
has ones on the diagonal and zeros everywhere else. Note that stage n − 1 of
the algorithm does nothing, so we can skip it.
This is a good moment for introducing our matrix/vector notation, which
is similar to the MATLAB [100] notation commonly used in the field of numer-
ical linear algebra. This notation makes it easy to describe submatrices and
SEQUENTIAL LU DECOMPOSITION 53

0 1 2 3 4 5 6
0
1 U
2
3
4 L A(k)
5
6

Fig. 2.1. LU decomposition of a 7 × 7 matrix at the start of stage k = 3. The values


of L and U computed so far and the computed part of A(k) fit exactly in one
matrix.

Algorithm 2.2. Memory-efficient sequential LU decomposition.


input: A : n × n matrix, A = A(0) .
output: A : n × n matrix, A = L − In + U , with
L : n × n unit lower triangular matrix,
U : n × n upper triangular matrix,
such that LU = A(0) .

for k := 0 to n − 1 do
for i := k + 1 to n − 1 do
aik := aik /akk ;
for i := k + 1 to n − 1 do
for j := k + 1 to n − 1 do
aij := aij − aik akj ;

subvectors. The subvector x(i0 : i1 ) is a vector of length i1 − i0 + 1, which


contains all components of x from i0 up to and including i1 . The (noncon-
tiguous) subvector x(i0 : s : i1 ) contains the components i0 , i0 + s, i0 + 2s, . . .
not exceeding i1 . Here, s is the stride of the subvector. The subvector x(∗)
contains all components and hence it equals x. The subvector x(i) contains
one component, the ith. The submatrix A(i0 : i1 , j0 : j1 ) contains all elements
aij with i0 ≤ i ≤ i1 and j0 ≤ j ≤ j1 . The ranges for the matrix indices can be
written in the same way as for the vector indices. For example, the submatrix
A(i, ∗) denotes row i of the matrix A. Using our matrix/vector notation, we
can write the submatrix used to store elements of A(k) as A(k : n−1, k : n−1).
54 LU DECOMPOSITION

We can also write the part of U computed in stage k as U (k, k : n − 1) and


the part of L computed in stage k as L(k + 1 : n − 1, k).
Example 2.2 The matrix A of Example 2.1 is transformed into a matrix
holding the L and U factors, as follows:
     
1 4 6 1 4 6 1 4 6
(0) (1)
A= 2 10 17  −→  2 2 5  −→  2 2 5  = L − In + U.
3 16 31 3 4 13 3 2 3

Example 2.3 No LU decomposition exists for



0 1
A= .
1 0

The last example shows that Algorithm 2.2 may break down, even in the
case of a nonsingular matrix. This happens if akk = 0 for a certain k, so that
division by zero is attempted. A remedy for this problem is to permute the
rows of the matrix A in a suitable way, giving a matrix P A, before computing
an LU decomposition. This yields

P A = LU, (2.9)

where P is an n × n permutation matrix, that is, a matrix obtained by


permuting the rows of In . A useful property of a permutation matrix is that
its inverse equals its transpose, P −1 = P T . The effect of multiplying A from
the left by P is to permute the rows of A.
Every permutation matrix corresponds to a unique permutation, and vice
versa. Let σ : {0, . . . , n − 1} → {0, . . . , n − 1} be a permutation. We define the
permutation matrix Pσ corresponding to σ as the n × n matrix with elements

1 if i = σ(j)
(Pσ )ij = for 0 ≤ i, j < n. (2.10)
0 otherwise,

This means that column j of Pσ has an element one in row σ(j), and zeros
everywhere else.
Example 2.4 Let n = 3 and σ(0) = 1, σ(1) = 2, and σ(2) = 0. Then
 
· · 1
Pσ =  1 · · ,
· 1 ·

where the dots in the matrix denote zeros.


SEQUENTIAL LU DECOMPOSITION 55

The matrix Pσ has the following useful properties:


Lemma 2.5 Let σ : {0, . . . , n − 1} → {0, . . . , n − 1} be a permutation. Let
x be a vector of length n and A an n × n matrix. Then
(Pσ x)i = xσ−1 (i) , for 0 ≤ i < n,
(Pσ A)ij = aσ−1 (i),j , for 0 ≤ i, j < n,
(Pσ APσT )ij = aσ−1 (i),σ−1 (j) , for 0 ≤ i, j < n.
Lemma 2.6 Let σ, τ : {0, . . . , n − 1} → {0, . . . , n − 1} be permutations.
Then
Pτ Pσ = Pτ σ and (Pσ )−1 = Pσ−1 .
Here, τ σ denotes σ followed by τ .
Proof The proofs follow immediately from the definition of Pσ . 
Usually, it is impossible to determine a suitable complete row permutation
before the LU decomposition has been carried out, because the choice may
depend on the evolving computation of L and U . A common procedure, which
works well in practice, is partial row pivoting. The computation starts
with the original matrix A. At the start of stage k, a pivot element ark
is chosen with the largest absolute value in column k, among the elements
aik with i ≥ k. We express this concisely in the program text by stating
that r = argmax(|aik | : k ≤ i < n), that is, r is the argument (or index)
of the maximum. If A is nonsingular, it is guaranteed that ark = 0. (Taking
the largest element, instead of an arbitrary nonzero, keeps us farthest from
dividing by zero and hence improves the numerical stability.) Swapping row
k and the pivot row r now makes it possible to perform stage k.
LU decomposition with partial row pivoting produces the L and U factors
of a permuted matrix P A, for a given input matrix A. These factors can
then be used to solve the linear system P Ax = P b by permuting the vector
b and solving two triangular systems. To perform the permutation, we need
to know P . We can represent P by a permutation vector π of length n. We
denote the components of π by πi or π(i), whichever is more convenient in
the context. We determine P by registering the swaps executed in the stages
of the computation. For instance, we can start with the identity permutation
stored as a vector e = (0, 1, . . . , n − 1)T in π. We swap the components k
and r of π whenever we swap a row k and a row r of the working matrix. On
output, the working matrix holds the L and U factors of P A, and π holds the
vector P e. Assume P = Pσ for a certain permutation σ. Applying Lemma 2.5
gives π(i) = (Pσ e)i = eσ−1 (i) = σ −1 (i), for all i. Therefore, σ = π −1 and
Pπ−1 A = LU . Again applying the lemma, we see that this is equivalent with
aπ(i),j = (LU )ij , for all i, j.
The resulting LU decomposition with partial pivoting is given as
Algorithm 2.3.
56 LU DECOMPOSITION

Algorithm 2.3. Sequential LU decomposition with partial row pivoting.


input: A : n × n matrix, A = A(0) .
output: A : n × n matrix, A = L − In + U , with
L : n × n unit lower triangular matrix,
U : n × n upper triangular matrix,
π : permutation vector of length n,
(0)
such that aπ(i),j = (LU )ij , for 0 ≤ i, j < n.

for i := 0 to n − 1 do
πi := i;
for k := 0 to n − 1 do
r := argmax(|aik | : k ≤ i < n);
swap(πk , πr );
for j := 0 to n − 1 do
swap(akj , arj );
for i := k + 1 to n − 1 do
aik := aik /akk ;
for i := k + 1 to n − 1 do
for j := k + 1 to n − 1 do
aij := aij − aik akj ;

Its cost is determined as follows. The floating-point operations in stage k


are: n − k − 1 divisions, (n − k − 1)2 multiplications, and (n − k − 1)2 sub-
tractions. We ignore all other operations, such as comparisons, assignments,
and integer operations, because taking these into account would make our
analysis laborious and unnecessarily complicated. The cost of Algorithm 2.3,
measured in flops, is therefore:
n−1 n−1
  2n3 n2 n
Tseq = (2(n − k − 1)2 + n − k − 1) = (2k 2 + k) = − − . (2.11)
3 2 6
k=0 k=0

In the summation, we used two formulae so important for analysing the


complexity of matrix computations, that you should know them by heart:
Lemma 2.7 Let n ≥ 0 be an integer. Then
n n
 n(n + 1)  n(n + 1)(2n + 1)
k= , k2 = .
2 6
k=0 k=0

Proof By induction on n. 
BASIC PARALLEL ALGORITHM 57

2.3 Basic parallel algorithm


Design your parallel algorithms backwards! We follow this motto by first
transforming a sequential step into a computation superstep and then insert-
ing preceding communication supersteps to obtain nonlocal data where
needed.
The design process of our parallel LU decomposition algorithm is as fol-
lows. First, we introduce a general data distribution scheme, which reflects
the problem and restricts the possible communication patterns, but which also
leaves sufficient freedom for optimization. Second, we derive a basic parallel
algorithm, directly from the sequential algorithm and the data distribution
scheme; we do this mostly in the backward direction. Third, we analyse the
cost of the basic parallel algorithm and use the results of this analysis to
choose a data distribution with optimal load balance and low communication
overhead. Fourth, we restructure the algorithm to reduce its cost further. This
section presents the first three phases of the design process; the fourth phase
is presented in the next section.
The data to be distributed for parallel LU decomposition are the matrix A
and the vector π. Clearly, the most important decision is how to distribute A.
The bulk of the computational work in stage k of the sequential algorithm
is the modification of the matrix elements aij with i, j ≥ k + 1. Therefore,
our choice of distribution will be based on an analysis of this part of the
algorithm. It is easy to distribute the computational work of this part evenly
over the processors; this can simply be done by evenly distributing the corres-
ponding data. Distribution of the matrix elements over different processors,
however, will give rise to communication, because in general the matrix ele-
ments aij , aik , and akj involved in an update aij := aij −aik akj will not reside
on the same processor. There are (n − k − 1)2 elements aij to be updated,
using only n − k − 1 elements aik from column k of A and n − k − 1 elements
akj from row k. Therefore, to prevent communication of large amounts of
data, the update aij := aij − aik akj must be performed by the processor that
contains aij . This implies that only elements of column k and row k of A need
to be communicated in stage k. This approach is illustrated in Fig. 2.2.
An important observation is that the modification of the elements in row
A(i, k + 1 : n − 1) uses only one value from column k of A, namely aik . If
we distribute each matrix row over a limited set of N processors, then the
communication of an element from column k can be restricted to a broad-
cast to N processors. Similarly, the modification of the elements in column
A(k + 1 : n − 1, j) uses only one value from row k of A, namely akj . If we
distribute each matrix column over a limited set of M processors, then the
communication of an element from row k can be restricted to a broadcast to
M processors.
For matrix computations, it is natural to number the processors by
two-dimensional identifiers P (s, t), 0 ≤ s < M and 0 ≤ t < N , where
p = M N is the number of processors. We define processor row P (s, ∗)
58 LU DECOMPOSITION

0 1 2 3 4 5 6
0
1
2
3 akj
4
5 aik aij
6

Fig. 2.2. Matrix update by operations aij := aij − aik akj at the end of stage k = 3.
Arrows denote communication.

as the group of N processors P (s, t) with 0 ≤ t < N , and processor column


P (∗, t) as the group of M processors P (s, t) with 0 ≤ s < M . This is just
a two-dimensional numbering of the processors and has no physical meaning
in the BSP model. Any resemblance to actual parallel computers, such as
a rectangular processor network, is purely coincidental and, for the sake of
portability, such resemblance should not be exploited. To make it easier to
resist the temptation, BSP veterans always tell newcomers to the BSP world
that BSPlib software randomly renumbers the processors before it starts.
A matrix distribution is a mapping

φ : {(i, j) : 0 ≤ i, j < n} → {(s, t) : 0 ≤ s < M ∧ 0 ≤ t < N }

from the set of matrix index pairs to the set of processor identifiers. The
mapping function φ has two coordinates,

φ(i, j) = (φ0 (i, j), φ1 (i, j)), for 0 ≤ i, j < n. (2.12)

A matrix distribution is called Cartesian if φ0 (i, j) is independent of j and


φ1 (i, j) is independent of i, so that we can write

φ(i, j) = (φ0 (i), φ1 (j)), for 0 ≤ i, j < n. (2.13)

Figure 2.3 shows a Cartesian distribution of a 7 × 7 matrix over 2 × 3 pro-


cessors. Cartesian distributions allocate matrix rows to processor rows. This
is good for LU decomposition, because in stage k an element aik of column k
needs to be communicated only to the owners of matrix row i, that is, to pro-
cessor row P (φ0 (i), ∗), which is a group of N processors. Similarly, Cartesian
distributions allocate matrix columns to processor columns, which reduces
the communication of an element from row k to a broadcast to M processors.
BASIC PARALLEL ALGORITHM 59

t=0 2 1 2 0 1 0
s =0 00 02 01 02 00 01 00
0 00 02 01 02 00 01 00
1 10 12 11 12 10 11 10
0 00 02 01 02 00 01 00
1 10 12 11 12 10 11 10
0 00 02 01 02 00 01 00
1 10 12 11 12 10 11 10

Fig. 2.3. A Cartesian distribution of a 7 × 7 matrix over 2 × 3 processors. The label


‘st’ in a cell denotes its owner, processor P (s, t).

In both cases, the destination is only a subset of all the processors. Therefore,
we decide to use a Cartesian matrix distribution. For this moment, we do not
specify the distribution further, to leave us the freedom of tailoring it to our
future needs.
An initial parallel algorithm can be developed by parallelizing the sequen-
tial algorithm step by step, using data parallelism to derive computa-
tion supersteps and the need-to-know principle to obtain the necessary
communication supersteps. According to this principle, exactly those non-
local data that are needed in a computation superstep should be fetched in
preceding communication supersteps.
One parallelization method based on this approach is to allocate a com-
putation to the processor that possesses the variable on the left-hand side
of an assignment and to communicate beforehand the nonlocal data appear-
ing in the right-hand side. An example is the superstep pair (10)–(11) of
Algorithm 2.4, which is a parallel version of the matrix update from stage k
of the LU decomposition. (The superstep numbering corresponds to that of
the complete basic parallel algorithm.) In superstep (11), the local elements
aij with i, j ≥ k + 1 are modified. In superstep (10), the elements aik and
akj with i, j ≥ k + 1 are communicated to the processors that need them.
It is guaranteed that all values needed have been sent, but depending on
the distribution and the stage k, certain processors actually may not need
all of the communicated elements. (This mild violation of the strict need-to-
know principle is common in dense matrix computations, where all matrix
elements are treated as nonzero; for sparse matrices, however, where many
matrix elements are zero, the communication operations should be precisely
targeted, see Chapter 4.) Another example of this parallelization method is
the superstep pair (8)–(9). In superstep (9), the local elements of column k
are divided by akk . This division is performed only by processors in processor
60 LU DECOMPOSITION

Algorithm 2.4. Parallel matrix update in stage k for P (s, t).


(8) if φ0 (k) = s ∧ φ1 (k) = t then put akk in P (∗, t);

(9) if φ1 (k) = t then for all i : k < i < n ∧ φ0 (i) = s do


aik := aik /akk ;

(10) if φ1 (k) = t then for all i : k < i < n ∧ φ0 (i) = s do


put aik in P (s, ∗);
if φ0 (k) = s then for all j : k < j < n ∧ φ1 (j) = t do
put akj in P (∗, t);

(11) for all i : k < i < n ∧ φ0 (i) = s do


for all j : k < j < n ∧ φ1 (j) = t do
aij := aij − aik akj ;

Algorithm 2.5. Parallel pivot search in stage k for P (s, t).


(0) if φ1 (k) = t then rs := argmax(|aik | : k ≤ i < n ∧ φ0 (i) = s);

(1) if φ1 (k) = t then put rs and ars ,k in P (∗, t);

(2) if φ1 (k) = t then


smax := argmax(|arq ,k | : 0 ≤ q < M );
r := rsmax ;

(3) if φ1 (k) = t then put r in P (s, ∗);

column P (∗, φ1 (k)), since these processors together possess matrix column k.
In superstep (8), the element akk is obtained.
An alternative parallelization method based on the same need-to-know
approach is to allocate a computation to the processor that contains part
or all of the data of the right-hand side, and then to communicate partial
results to the processors in charge of producing the final result. This may
be more efficient if the number of result values is less than the number of
input data values involved. An example is the sequence of supersteps (0)–(3)
of Algorithm 2.5, which is a parallel version of the pivot search from stage
k of the LU decomposition. First a local element with maximum absolute
value is determined, whose index and value are then sent to all processors in
P (∗, φ1 (k)). (In our cost model, this takes the same time as sending them to
BASIC PARALLEL ALGORITHM 61

Algorithm 2.6. Index and row swaps in stage k for P (s, t).

(4) if φ0 (k) = s ∧ t = 0 then put πk as π̂k in P (φ0 (r), 0);


if φ0 (r) = s ∧ t = 0 then put πr as π̂r in P (φ0 (k), 0);

(5) if φ0 (k) = s ∧ t = 0 then πk := π̂r ;


if φ0 (r) = s ∧ t = 0 then πr := π̂k ;

(6) if φ0 (k) = s then for all j : 0 ≤ j < n ∧ φ1 (j) = t do


put akj as âkj in P (φ0 (r), t);
if φ0 (r) = s then for all j : 0 ≤ j < n ∧ φ1 (j) = t do
put arj as ârj in P (φ0 (k), t);

(7) if φ0 (k) = s then for all j : 0 ≤ j < n ∧ φ1 (j) = t do


akj := ârj ;
if φ0 (r) = s then for all j : 0 ≤ j < n ∧ φ1 (j) = t do
arj := âkj ;

only one master processor P (0, φ1 (k)); a similar situation occurs for the inner
product algorithm in Section 1.3.) All processors in the processor column
redundantly determine the processor P (smax , φ1 (k)) and the global row index
r of the maximum value. The index r is then broadcast to all processors.
The part of stage k that remains to be parallelized consists of index and
row swaps. To parallelize the index swaps, we must first choose the distribu-
tion of π. It is natural to store πk together with row k, that is, somewhere
in processor row P (φ0 (k), ∗); we choose P (φ0 (k), 0) as the location. Altern-
atively, we could have replicated πk and stored a copy in every processor
of P (φ0 (k), ∗). (Strictly speaking, this is not a distribution any more.) The
index swaps are performed by superstep pair (4)–(5) of Algorithm 2.6. The
components πk and πr of the permutation vector are swapped by first put-
ting each component into its destination processor and then assigning it to
the appropriate component of the array π. Temporary variables (denoted by
hats) are used to help distinguishing between the old and the new contents of
a variable. The same is done for the row swaps in supersteps (6)–(7).
To make the algorithm efficient, we must choose a distribution φ that
incurs low BSP cost. To do this, we first analyse stage k of the algorithm and
identify the main contributions to its cost. Stage k consists of 12 supersteps,
so that its synchronization cost equals 12l. Sometimes, a superstep may be
empty so that it can be deleted. For example, if N = 1, superstep (3) is empty.
In the extreme case p = 1, all communication supersteps can be deleted and
the remaining computation supersteps can be combined into one superstep.
For p > 1, however, the number of supersteps in one stage remains a small
62 LU DECOMPOSITION

constant, which should not influence the choice of distribution. Therefore,


we consider 12nl to be an upper bound on the total synchronization cost of
the algorithm, and we exclude terms in l from the following analysis of the
separate supersteps.
The computation and communication cost can concisely be expressed using

Rk = max |{i : k ≤ i < n ∧ φ0 (i) = s}|, (2.14)


0≤s<M

that is, Rk is the maximum number of local matrix rows with index ≥ k, and

Ck = max |{j : k ≤ j < n ∧ φ1 (j) = t}|, (2.15)


0≤t<N

that is, Ck is the maximum number of local matrix columns with index ≥ k.
Example 2.8 In Fig. 2.3, R0 = 4, C0 = 3 and R4 = 2, C4 = 2.
Lower bounds for Rk and Ck are given by

n−k n−k
Rk ≥ , Ck ≥ . (2.16)
M N

Proof Assume Rk < ⌈(n − k)/M ⌉. Because Rk is integer, we even have that
Rk < (n − k)/M so that each processor row has less than (n − k)/M matrix
rows. Therefore, the M processor rows together possess less than n − k matrix
rows, which contradicts the fact that they hold the whole range k ≤ i < n.
A similar proof holds for Ck . 
The computation supersteps of the algorithm are (0), (2), (5), (7), (9), and
(11). Supersteps (0), (2), (5), and (7) are for free in our benign cost model,
since they do not involve floating-point operations. (A more detailed analysis
taking all types of operations into account would yield a few additional lower-
order terms.) Computation superstep (9) costs Rk+1 time units, since each
processor performs at most Rk+1 divisions. Computation superstep (11) costs
2Rk+1 Ck+1 time units, since each processor performs at most Rk+1 Ck+1 mul-
tiplications and Rk+1 Ck+1 subtractions. The cost of (11) clearly dominates
the total computation cost.
Table 2.1 presents the cost of the communication supersteps of the basic
parallel LU decomposition. It is easy to verify the cost values given by the
table. For the special case N = 1, the hr value given for (3) in the table
should in fact be 0 instead of 1, but this does not affect the resulting value
of h. A similar remark should be made for supersteps (4), (8), and (10).
During most of the algorithm, the largest communication superstep is (10),
while the next-largest one is (6). Near the end of the computation, (6) becomes
dominant.
To minimize the total BSP cost of the algorithm, we must take care to
minimize the cost of both computation and communication. First we consider
BASIC PARALLEL ALGORITHM 63

Table 2.1. Cost (in g) of communication supersteps in stage k


of basic parallel LU decomposition

Superstep hs hr h = max{hs , hr }

(1) 2(M − 1) 2(M − 1) 2(M − 1)


(3) N −1 1 N −1
(4) 1 1 1
(6) C0 C0 C0
(8) M −1 1 M −1
(10) Rk+1 (N − 1)+ Rk+1 + Ck+1 Rk+1 (N − 1)+
Ck+1 (M − 1) Ck+1 (M − 1)

the computation cost, and in particular the cost of the dominant computation
superstep,

n−k−1 n−k−1
T(11) = 2Rk+1 Ck+1 ≥ 2 . (2.17)
M N

This cost can be minimized by distributing the matrix rows cyclically over
the M processor rows and the matrix columns cyclically over the N processor
columns. In that case, matrix rows k + 1 to n − 1 are evenly or nearly evenly
divided over the processor rows, with at most a difference of one matrix row
between the processor rows, and similarly for the matrix columns. Thus,

n−k−1 n−k−1
T(11),cyclic = 2 . (2.18)
M N

The resulting matrix distribution is the M × N cyclic distribution,


defined by

φ0 (i) = i mod M, φ1 (j) = j mod N, for 0 ≤ i, j < n. (2.19)

Figure 2.4 shows the 2 × 3 cyclic distribution of a 7 × 7 matrix.


The cost of (11) for the M × N cyclic distribution is bounded between

2(n − k − 1)2
 
n−k−1 n−k−1
≤ T(11),cyclic < 2 +1 +1
p M N
2(n − k − 1)2 2(n − k − 1)
= + (M + N ) + 2,
p p

where we have used that M N = p. The upper bound is minimal if M =



N = p, that is, if the distribution is square. The resulting second-order

term 4(n − k − 1)/ p in the upper bound can be viewed as the additional
computation cost caused by imbalance of the work load.
64 LU DECOMPOSITION

t=0 1 2 0 1 2 0
s =0 00 01 02 00 01 02 00
1 10 11 12 10 11 12 10
0 00 01 02 00 01 02 00
1 10 11 12 10 11 12 10
0 00 01 02 00 01 02 00
1 10 11 12 10 11 12 10
0 00 01 02 00 01 02 00

Fig. 2.4. The 2 × 3 cyclic distribution of a 7 × 7 matrix.

Next, we examine the cost of the dominant communication superstep,

T(10) = (Rk+1 (N − 1) + Ck+1 (M − 1))g



n−k−1 n−k−1
≥ (N − 1) + (M − 1) g
M N
= T(10),cyclic . (2.20)

Again, we can minimize the cost by using the M × N cyclic distribution. To


find optimal values for M and N , we consider the upper bound
  
n−k−1 n−k−1
T(10),cyclic < +1 N + +1 M g
M N
 
N M
= (n − k − 1) + + M + N g. (2.21)
M N
We now minimize this simple upper bound on the cost, instead of the more
complicated true cost itself. (This approximation is valid because the bound
is not too far from the true cost.) From (M − N )2 ≥ 0, it immediately follows

that N/M +M/N = (M 2 +N 2 )/(M N ) ≥ 2. For M = N = p, the inequality
becomes an equality. For this choice, the term N/M + M/N is minimal. The
same choice also minimizes the term M + N under the constraint M N = p.
This implies that the square cyclic distribution is a good choice for the basic
LU decomposition algorithm, on the grounds of both computation cost and
communication cost.

2.4 Two-phase broadcasting and other improvements


Can the basic parallel algorithm be improved? The computation cost in
flops cannot be reduced by much, because the computation part is already
well-balanced and little is computed redundantly. Therefore, the question
TWO-PHASE BROADCASTING AND OTHER IMPROVEMENTS 65

is whether the communication and synchronization cost can be reduced. To


answer this, we take a closer look at the communication supersteps.
The communication volume V of an h-relation is defined as the total
number of data words communicated. Using a one-dimensional processor
numbering, we can express this as
p−1
 p−1

V = hs (s) = hr (s), (2.22)
s=0 s=0

where hs (s) is the number of data words sent by processor P (s) and hr (s) is
the number received. In this notation, maxs hs (s) = hs and maxs hr (s) = hr .
p−1
Note that V ≤ s=0 h = ph. We call an h-relation balanced if V = ph,
that is, h = V /p. Equality can only hold if hs (s) = h for all s. Therefore,
a balanced h-relation has hs (s) = h for all s, and, similarly, hr (s) = h for
all s. These necessary conditions for balance are also sufficient and hence an
h-relation is balanced if and only if every processor sends and receives exactly
h words. But this is precisely the definition of a full h-relation, see Section 1.2.
It is just a matter of viewpoint whether we call an h-relation balanced or full.
The communication volume provides us with a measure for load imbalance:
we call h − V /p the communicational load imbalance. This is analogous
to the computational load imbalance, which commonly (but often tacitly)
is defined as w − wseq /p, where w denotes work. If an h-relation is balanced,
then h = hs = hr . The reverse is not true: it is possible that h = hs = hr
but that the h-relation is still unbalanced: some processors may be overloaded
sending and some receiving. In that case, h > V /p. To reduce communication
cost, one can either reduce the volume, or improve the balance for a fixed
volume.
Consider the basic parallel LU decomposition algorithm with the cyclic
distribution. Assume for diagnostic purposes that the distribution is square.
(Later, in developing our improved algorithm, we shall assume the more
general M × N cyclic distribution.) Supersteps (3), (8), and (10) perform
h-relations with hs ≫ hr , see Table 2.1. Such a discrepancy between hs and
hr is a clear symptom of imbalance. The three unbalanced supersteps are
candidates for improvement. We concentrate our efforts on the dominant com-

munication superstep, (10), which has hs = ( p − 1)hr and h ≈ 2(n − k − 1),
see (2.20). The contribution of superstep (10) to the total communication
n−1 n−1
cost of the basic algorithm is about k=0 2(n − k − 1)g = 2g k=0 k =
2g(n − 1)n/2 ≈ n2 g, irrespective of the number of processors. With an
increasing number of processors, the fixed contribution of n2 g to the total
communication cost will soon dominate the total computation cost of roughly
Tseq /p ≈ 2n3 /3p, see (2.11). This back-of-the-envelope analysis suffices to
reveal the undesirable scaling behaviour of the row and column broadcasts.
The unbalance in the broadcasts of superstep (10) is caused by the fact

that only 2 p − 1 out of p processors send data: the sending processors
66 LU DECOMPOSITION
√ √
are P (∗, φ1 (k)) = P (∗, k mod p) and P (φ0 (k), ∗) = P (k mod p, ∗). The
receives are spread better: the majority of the processors receive 2Rk+1 data
elements, or one or two elements less. The communication volume equals

V = 2(n − k − 1)( p − 1), because n − k − 1 elements of row k and column

k must be broadcast to p − 1 processors. It is impossible to reduce the
communication volume significantly: all communication operations are really
necessary, except in the last few stages of the algorithm. The communication
balance, however, has potential for improvement.
To find ways to improve the balance, let us first examine the problem of
broadcasting a vector x of length n from a processor P (0) to all p processors of
a parallel computer, where n ≥ p. For this problem, we use a one-dimensional
processor numbering. The simplest approach is that processor P (0) creates
p−1 copies of each vector component and sends these copies out. This method
concentrates all sending work at the source processor. A better balance can
be obtained by sending each component to a randomly chosen intermediate
processor and making this processor responsible for copying and sending the
copies to the final destination. (This method is similar to two-phase random-
ized routing [176], where packets are sent from source to destination through a
randomly chosen intermediate location, to avoid congestion in the routing net-
work.) The new method splits the original h-relation into two phases: phase 0,
an unbalanced h-relation with small volume that randomizes the location of
the data elements; and phase 1, a well-balanced h-relation that performs
the broadcast itself. We call the resulting pair of h-relations a two-phase
broadcast.
An optimal balance during phase 1 can be guaranteed by choosing the
intermediate processors deterministically instead of randomly. For instance,
this can be achieved by spreading the vector in phase 0 according to the
block distribution, defined by (1.6). (An equally suitable choice is the cyclic
distribution.) The resulting two-phase broadcast is given as Algorithm 2.7; it is
illustrated by Fig. 2.5. The notation repl(x) = P (∗) means that x is replicated
such that each processor has a copy. (This is in contrast to distr(x) = φ,
which means that x is distributed according to the mapping φ.) Phase 0 is
an h-relation with h = n − b, where b = ⌈n/p⌉ is the block size, and phase 1
has h = (p − 1)b. Note that both phases cost about ng. The total cost of the
two-phase broadcast of a vector of length n to p processors is

n
Tbroadcast = n + (p − 2) g + 2l ≈ 2ng + 2l. (2.23)
p

This is much less than the cost (p − 1)ng + l of the straightforward one-phase
broadcast (except when l is large).
The two-phase broadcast can be used to broadcast column k and row k
in stage k of the parallel LU decomposition. The broadcasts are performed in
TWO-PHASE BROADCASTING AND OTHER IMPROVEMENTS 67

Algorithm 2.7. Two-phase broadcast for P (s).

input: x : vector of length n, repl(x) = P (0).


output: x : vector of length n, repl(x) = P (∗).
function call: broadcast(x, P (0), P (∗)).

b := ⌈n/p⌉;
{ Spread the vector. }
(0) if s = 0 then for t := 0 to p − 1 do
for i := tb to min{(t + 1)b, n} − 1 do
put xi in P (t);

{ Broadcast the subvectors. }


(1) for i := sb to min{(s + 1)b, n} − 1 do
put xi in P (∗);

P(0) 0 0 0 0 0 0 0 0 0 0 0 0

P(1) 1 1 1
Phase 0
P(2) 2 2 2

P(3) 3 3 3

P(0) 0 0 0 0

P(1) 1 1 1
Phase 1
P(2) 2 2 2

P(3) 3 3 3 3

Fig. 2.5. Two-phase broadcast of a vector of size twelve to four processors. Each
cell represents a vector component; the number in the cell and the greyshade
denote the processor that owns the cell. The processors are numbered 0, 1, 2, 3.
The block size is b = 3. The arrows denote communication. In phase 0, the
vector is spread over the four processors. In phase 1, each processor broadcasts
its subvector to all processors. To avoid clutter, only a few of the destination
cells of phase 1 are shown.
68 LU DECOMPOSITION

supersteps (6) and (7) of the final algorithm, Algorithm 2.8. The column part
to be broadcast from processor P (s, k mod N ) is the subvector (aik : k < i <
n ∧ i mod M = s), which has length Rk+1 or Rk+1 − 1, and this subvector is
broadcast to the whole processor row P (s, ∗). Every processor row performs
its own broadcast of a column part. The row broadcast is done similarly.
Note the identical superstep numbering ‘(6)/(7)’ of the two broadcasts, which
is a concise way of saying that phase 0 of the row broadcast is carried out
together with phase 0 of the column broadcast and phase 1 with phase 1 of the
column broadcast. This saves two synchronizations. (In an implementation,
such optimizations are worthwhile, but they harm modularity: the complete
broadcast cannot be invoked by one function call; instead, we need to make
the phases available as separately callable functions.)
The final algorithm has eight supersteps in the main loop, whereas the
basic algorithm has twelve. The number of supersteps has been reduced as
follows. First, we observe that the row swap of the basic algorithm turns ele-
ment ark into the pivot element akk . The element ark , however, is already
known by all processors in P (∗, k mod N ), because it is one of the elements
broadcast in superstep (1). Therefore, we divide column k immediately by ark ,
instead of dividing by akk after the row swap. This saves the pivot broadcast
(8) of the basic algorithm and the synchronization of superstep (9). For read-
ability, we introduce the convention of writing the condition ‘if k mod N = t
then’ only once for supersteps (0)–(3), even though we want the test to be car-
ried out in every superstep. This saves space and makes the algorithm better
readable; in an implementation, the test must be repeated in every superstep.
(Furthermore, we must take care to let all processors participate in the global
synchronization, and not only those that test positive.) Second, the index and
row swaps are now combined and performed in two supersteps, numbered (4)
and (5). This saves two synchronizations. Third, the last superstep of stage
k of the algorithm is combined with the first superstep of stage k + 1. We
express this by numbering the last superstep as (0′ ), that is, superstep (0) of
the next stage.
The BSP cost of the final algorithm is computed in the same way as
before. The cost of the separate supersteps is given by Table 2.2. Now, Rk+1 =
⌈(n−k −1)/M ⌉ and Ck+1 = ⌈(n−k −1)/N ⌉, because we use the M ×N cyclic
distribution. The cost expressions for supersteps (6) and (7) are obtained as
in the derivation of (2.23).
The dominant computation superstep in the final algorithm remains the

matrix update; the choice M = N = p remains optimal for computa-
tion. The costs of the row and column broadcasts do not dominate the
other communication costs any more, since they have decreased to about
2(Rk+1 + Ck+1 )g in total, which is of the same order as the cost C0 g of the
row swap. To find optimal values of M and N for communication, we consider
TWO-PHASE BROADCASTING AND OTHER IMPROVEMENTS 69

Algorithm 2.8. Final parallel LU decomposition algorithm for P (s, t).

input: A : n × n matrix, A = A(0) , distr(A) = M × N cyclic.


output: A : n × n matrix, distr(A) = M × N cyclic, A = L − In + U , with
L : n × n unit lower triangular matrix,
U : n × n upper triangular matrix,
π : permutation vector of length n, distr(π) = cyclic in P (∗, 0),
(0)
such that aπ(i),j = (LU )ij , for 0 ≤ i, j < n.

if t = 0 then for all i : 0 ≤ i < n ∧ i mod M = s do


πi := i;
for k := 0 to n − 1 do
if k mod N = t then
(0) rs := argmax(|aik | : k ≤ i < n ∧ i mod M = s);
(1) put rs and ars ,k in P (∗, t);
(2) smax := argmax(|arq ,k | : 0 ≤ q < M );
r := rsmax ;
for all i : k ≤ i < n ∧ i mod M = s ∧ i = r do
aik := aik /ark ;
(3) put r in P (s, ∗);

(4) if k mod M = s then


if t = 0 then put πk as π̂k in P (r mod M, 0);
for all j : 0 ≤ j < n ∧ j mod N = t do
put akj as âkj in P (r mod M, t);
if r mod M = s then
if t = 0 then put πr as π̂r in P (k mod M, 0);
for all j : 0 ≤ j < n ∧ j mod N = t do
put arj as ârj in P (k mod M, t);

(5) if k mod M = s then


if t = 0 then πk := π̂r ;
for all j : 0 ≤ j < n ∧ j mod N = t do
akj := ârj ;
if r mod M = s then
if t = 0 then πr := π̂k ;
for all j : 0 ≤ j < n ∧ j mod N = t do
arj := âkj ;

(6)/(7) broadcast((aik : k < i < n ∧ i mod M = s), P (s, k mod N ), P (s, ∗));
(6)/(7) broadcast((akj : k < j < n ∧ j mod N = t), P (k mod M, t), P (∗, t));

(0′ ) for all i : k < i < n ∧ i mod M = s do


for all j : k < j < n ∧ j mod N = t do
aij := aij − aik akj ;
70 LU DECOMPOSITION

Table 2.2. Cost of supersteps in stage k of the final parallel LU decomposition


algorithm

Superstep Cost

(0) l
(1) 2(M − 1)g + l
(2) Rk + l
(3) (N − 1)g + l
(4) (C0 + 1)g + l
(5) l
(6) (Rk+1 − ⌈Rk+1 /N ⌉ + Ck+1 − ⌈Ck+1 /M ⌉)g + l
(7) ((N − 1)⌈Rk+1 /N ⌉ + (M − 1)⌈Ck+1 /M ⌉)g + l
(0′ ) 2Rk+1 Ck+1

the upper bound


 
n−k−1 n−k−1
Rk+1 + Ck+1 < +1 + +1
M N
M +N
= (n − k − 1) + 2, (2.24)
p

which is minimal for M = N = p. The row swap in superstep (4) prefers
large values of N , because C0 = ⌈n/N ⌉. The degenerate choice N = p even
gives a free swap, but at the price of an expensive column broadcast. Overall,

the choice M = N = p is close to optimal and we shall adopt it in the
following analysis.
The total BSP cost of the final algorithm with the square cyclic distribu-
tion is obtained by summing the contributions of all supersteps. This gives
n−1 n−1
 

2 √ Rk+1
TLU = (2Rk+1 + Rk ) + 2 Rk+1 + ( p − 2) √ g
p
k=0 k=0

+ (C0 + 3 p − 2)ng + 8nl. (2.25)
n−1 n−1 √ n √
To compute k=0 Rk = k=0 ⌈(n − k)/ p⌉ = k=1 ⌈k/ p⌉ and the sums of
2
Rk+1 and Rk+1 , we need the following lemma.
Lemma 2.9 Let n, q ≥ 1 be integers with n mod q = 0. Then
n n 2
 k n(n + q)  k n(n + q)(2n + q)
= , = .
q 2q q 6q 2
k=0 k=0
TWO-PHASE BROADCASTING AND OTHER IMPROVEMENTS 71

Proof
n  
 k 0 1 q n−q+1 n
= + + ··· + + ··· + + ··· +
q q q q q q
k=0
n/q 
n  n n
= q · 1 + q · 2 + ··· + q · =q k=q +1 , (2.26)
q 2q q
k=1

where we have used Lemma 2.7. The proof of the second equation is similar.


Provided n mod p = 0, the resulting sums are:

n−1 √
 n(n + p)
Rk = √ , (2.27)
2 p
k=0
n−1 √
 n(n + p) n
Rk+1 = √ −√ , (2.28)
2 p p
k=0
n−1 √ √

2 n(n + p)(2n + p) n2
Rk+1 = − . (2.29)
6p p
k=0
√ √ √
To compute the sum of ⌈Rk+1 / p⌉ = ⌈⌈(n − k − 1)/ p⌉/ p⌉, we need
the following lemma, which may be useful in other contexts as well.
Lemma 2.10 Let k, q, r be integers with q, r ≥ 1. Then

⌈k/q⌉ k
= .
r qr

Proof Write k = aqr + bq + c, with 0 ≤ b < r and 0 ≤ c < q. If b = c = 0,


both sides of the equation equal a. Otherwise, they equal a + 1, as can easily
be verified. 
The remaining sum equals
n−1 n−1  ⌈k/√p⌉ n−1
√ n−1
 ⌈(n − k − 1)/ p⌉
  k
Rk+1
√ = √ = √ =
p p p p
k=0 k=0 k=0 k=0

n(n + p) n
= − . (2.30)
2p p

The last equality follows from Lemma 2.9 with q = p, which can be applied

if we assume that n mod p = 0 (this also guarantees that n mod p = 0).
This assumption is made solely for the purpose of simplifying our analysis; in
72 LU DECOMPOSITION

an implementation, such a restriction would hinder practical application and


hence it should be avoided there.
The total BSP cost of the final algorithm with the square cyclic distribu-
tion is obtained by substituting the results of (2.27)–(2.30) and the value

C0 = n/ p into (2.25). This gives

2n3
 
3 2 2 5n 3 2
TLU = + √ − n + + √ − n2
3p 2 p p 6 p p
 
√ 4 4
+ 4 p − √ + − 3 n g + 8nl. (2.31)
p p

For many purposes, it suffices to approximate the total cost of an algorithm


by taking into account only the highest-order computation term and the
highest-order overhead terms, that is, the terms representing load imbal-
ance, communication, and synchronization. In this case, an approximate cost
estimate is
2n3 3n2 3n2 g
TLU ≈ + √ + √ + 8nl. (2.32)
3p 2 p p
Note that in the final algorithm the row swaps, row broadcasts, and column

broadcasts each contribute n2 g/ p to the communication cost.

2.5 Example function bsplu


This section presents the program texts of the function bsplu, which is a
BSPlib implementation of Algorithm 2.8, and the collective-communication
function bsp broadcast used in bsplu.
The function bsp broadcast implements a slight generalization of
Algorithm 2.7: it broadcasts the vector x from processor P (src) instead of
P (0), and the set of destination processors is {P (s0 + t ∗ stride) : 0 ≤ t < p0}
instead of the set of all p processors. (Sometimes, the term ‘multicast’ is
used to describe such an operation with a limited number of destination pro-
cessors; the term ‘broadcast’ is then reserved for the case with p destination
processors. We do not make this distinction.) Within the broadcast function,
processors are numbered in one-dimensional fashion. The broadcast as we for-
mulate it is flexible and it can be applied in many situations: for instance, it
can be used to broadcast a vector within one processor row of a parallel com-
puter numbered in two-dimensional fashion; for the standard identification
P (s, t) ≡ P (s + tM ), the parameter s0 equals the number of the processor
row involved, stride = M , and p0 = N . The function can also be used to
perform several broadcasts simultaneously, for example, one broadcast within
each processor row. In that case, P (s, t) executes the function with the para-
meter s0 = s. This feature must be used with caution: the simultaneous
broadcast works well as long as the set of processors can be partitioned into
disjoint subsets, each including a source and a destination set. The processors
EXAMPLE FUNCTION bsplu 73

within a subset should all be able to determine their source, destination set,
and vector length uniquely from the function parameters. A processor can
then decide to participate as source and/or destination in its subset or to
remain idle. In the LU decomposition program, processors are partitioned
into processor rows for the purpose of column broadcasts, and into processor
columns for row broadcasts.
The broadcast function is designed such that it can perform the phases of
the broadcast separately; a complete broadcast is done by calling the function
twice, first with a value phase = 0, then with phase = 1. The synchronization
terminating the phase is not done by the broadcast function itself, but is
left to the calling program. The advantage of this approach is that unnec-
essary synchronizations are avoided. For instance, phase 0 of the row and
column broadcasts can be combined into one superstep, thus needing only
one synchronization.
The program text of the broadcast function is a direct implementation of
Algorithm 2.7 in the general context described above. Note that the size of
the data vector to be put is the minimum of the block size b and the number
n − tb of components that would remain if all preceding processors had put
b components. The size thus computed may be negative or zero, so that we
must make sure that the put is carried out only for positive size. In phase 1, all
processors avoid sending data back to the source processor. This optimization
has no effect on the BSP cost, since the (cost-determining) source processor
itself does not benefit, as can be seen by studying the role of P (0) in Fig. 2.5.
Still, the optimization reduces the overall communication volume, making it
equal to that of the one-phase broadcast. This may make believers in other
models than BSP happy, and BSP believers with cold feet as well!
The basic structure of the LU decomposition algorithm and the function
bsplu are the same, except that supersteps (0)–(1) of the algorithm are com-
bined into Superstep 0 of the function, supersteps (2)–(3) are combined into
Superstep 1, and supersteps (4)–(5) into Superstep 2. For the pairs (0)–(1)
and (2)–(3) this could be done because BSPlib allows computation and com-
munication to be mixed; for (4)–(5), this could be done because bsp puts are
buffered automatically, so that we do not have to take care of that ourselves.
Note that the superstep (0′ )–(0) of the algorithm is delimited quite naturally
in the program text by the common terminating bsp sync of Superstep 0.
As a result, each stage of the function bsplu has five supersteps.
The relation between the variables of the algorithm and those of the func-
tion bsplu is as follows. The variables M, N, s, t, n, k, smax , r of the algorithm
correspond to the variables M, N, s, t, n, k, smax, r of the function. The global
row index used in the algorithm is i = i ∗ M + s, where i is the local row
index used in the function. The global column index is j = j ∗ N + t, where
j is the local column index. The matrix element aij corresponds to a[i][j] on
the processor that owns aij , and the permutation component πi corresponds
to pi[i]. The global row index of the local element in column k with largest
74 LU DECOMPOSITION

absolute value is rs = imax ∗ M + s. The numerical value ars ,k of this element


corresponds to the variable max in the function. The arrays Max and Imax store
the local maxima and their index for the M processors that contain part of
column k. The maximum for processor P (s, k mod N ) is stored in Max[s]. The
global row index of the overall winner is r = Imax[smax] ∗ M + smax and its
value is ark = pivot. The arrays uk and lk store, starting from index 0, the
local parts to be broadcast from row k of U and column k of L.
The function nloc introduced in the program bspinprod of Section 1.4
is used here as well, for instance to compute the number of local rows
nloc(M, s, n) of processor row P (s, ∗) for an n × n matrix in the M × N cyclic
distribution. The local rows have local indices i = 0, 1, . . . , nloc(M, s, n) − 1.
In general, it holds that i < nloc(M, s, k) if and only if i < k. Thus,
i = nloc(M, s, k) is the first local row index for which the corresponding global
row index satisfies i ≥ k.
Variables must have been registered before they can be used as the des-
tination of put operations. For example, the arrays lk, uk, Max, Imax are
registered immediately upon allocation. Thus, registration takes place outside
the main loop, which is much cheaper than registering an array each time it
is used in communication. The exact length of the array is also registered so
that we can be warned if we attempt to put data beyond the end of the array.
(Sloppy, lazy, or overly confident programmers sometimes register INT MAX,
defined in the standard C file limits.h, instead of the true length; a practice
not to be recommended.) The permutation array pi must be allocated outside
bsplu because it is an output array. Nevertheless, its registration takes place
inside bsplu, because this is the only place where it is used in communica-
tion. The matrix A itself is the target of put operations, namely in the row
swaps. The easiest and cheapest way of registering a two-dimensional array is
by exploiting the fact that the utility function matallocd from bspedupack.c
(see Appendix A) allocates a contiguous array of length mn to store an m × n
matrix. We can address this matrix in a one-dimensional fashion, if we wish
to do so, and communication is one of the few occasions where this is worth-
while. We introduce a variable pa, which stands for ‘pointer to A’; it is a
pointer that stores the address of the first matrix row, a[0]. We can put data
into every desired matrix row i by putting them into the space pointed to
by pa and using a suitable offset i ∗ nlc, where nlc is the row length of the
destination processor. (Watch out: the putting processor must know the row
length of the remote processor. Fortunately, in the present case the local and
remote row lengths are the same, because the rows are swapped.) This way
of registering a matrix requires only one registration, instead of one registra-
tion per row. This saves much communication time, because registration is
expensive: each registration costs at least (p − 1)g because every processor
broadcasts the address of its own variable to all other processors.
The supersteps of the function bsplu are a straightforward implement-
ation of the supersteps in the LU decomposition algorithm. A few details
EXAMPLE FUNCTION bsplu 75

need additional explanation. The division by the pivot in Superstep 1 is also


carried out for the pivot element itself, yielding ark /ark = 1, despite the fact
that this element should keep its original value ark . Later, this value must be
swapped into akk . This problem is solved by simply reassigning the original
value stored in the temporary variable pivot.
In the matrix update, elements lk[i − kr1] are used instead of lk[i],
because array lk was filled starting from position 0. If desired, the extra index
calculations can be saved by shifting the contents of lk by kr1 positions to
the right just before the update loop. A similar remark holds for uk.
The program contains only rudimentary error handling. For the sake of
brevity, we have only included a crude test for numerical singularity. If no
pivot can be found with an absolute value larger than EPS, then the program
is aborted and an error message is printed. The complete program text is:

#include "bspedupack.h"

#define EPS 1.0e-15

void bsp_broadcast(double *x, int n, int src, int s0, int stride, int p0,
int s, int phase){
/* Broadcast the vector x of length n from processor src to
processors s0+t*stride, 0 <= t < p0. Here n >= 0, p0 >= 1.
The vector x must have been registered previously.
Processors are numbered in one-dimensional fashion.
s = local processor identity.
phase= phase of two-phase broadcast (0 or 1)
Only one phase is performed, without synchronization.
*/

int b, t, t1, dest, nbytes;

b= ( n%p0==0 ? n/p0 : n/p0+1 ); /* block size */

if (phase==0 && s==src){


for (t=0; t<p0; t++){
dest= s0+t*stride;
nbytes= MIN(b,n-t*b)*SZDBL;
if (nbytes>0)
bsp_put(dest,&x[t*b],x,t*b*SZDBL,nbytes);
}
}

if (phase==1 && s%stride==s0%stride){


t=(s-s0)/stride; /* s = s0+t*stride */
if (0<=t && t<p0){
nbytes= MIN(b,n-t*b)*SZDBL;
if (nbytes>0){
for (t1=0; t1<p0; t1++){
dest= s0+t1*stride;
76 LU DECOMPOSITION

if (dest!=src)
bsp_put(dest,&x[t*b],x,t*b*SZDBL,nbytes);
}
}
}
}

} /* end bsp_broadcast */

int nloc(int p, int s, int n){


/* Compute number of local components of processor s for vector
of length n distributed cyclically over p processors. */

return (n+p-s-1)/p ;

} /* end nloc */

void bsplu(int M, int N, int s, int t, int n, int *pi, double **a){
/* Compute LU decomposition of n by n matrix A with partial pivoting.
Processors are numbered in two-dimensional fashion.
Program text for P(s,t) = processor s+t*M,
with 0 <= s < M and 0 <= t < N.
A is distributed according to the M by N cyclic distribution.
*/

int nloc(int p, int s, int n);


double *pa, *uk, *lk, *Max;
int nlr, nlc, k, i, j, r, *Imax;

nlr= nloc(M,s,n); /* number of local rows */


nlc= nloc(N,t,n); /* number of local columns */

bsp_push_reg(&r,SZINT);
if (nlr>0)
pa= a[0];
else
pa= NULL;
bsp_push_reg(pa,nlr*nlc*SZDBL);
bsp_push_reg(pi,nlr*SZINT);
uk= vecallocd(nlc); bsp_push_reg(uk,nlc*SZDBL);
lk= vecallocd(nlr); bsp_push_reg(lk,nlr*SZDBL);
Max= vecallocd(M); bsp_push_reg(Max,M*SZDBL);
Imax= vecalloci(M); bsp_push_reg(Imax,M*SZINT);

/* Initialize permutation vector pi */


if (t==0){
for(i=0; i<nlr; i++)
pi[i]= i*M+s; /* global row index */
}
bsp_sync();
EXAMPLE FUNCTION bsplu 77

for (k=0; k<n; k++){


int kr, kr1, kc, kc1, imax, smax, s1, t1;
double absmax, max, pivot;

/****** Superstep 0 ******/


kr= nloc(M,s,k); /* first local row with global index >= k */
kr1= nloc(M,s,k+1);
kc= nloc(N,t,k);
kc1= nloc(N,t,k+1);

if (k%N==t){ /* k=kc*N+t */
/* Search for local absolute maximum in column k of A */
absmax= 0.0; imax= -1;
for (i=kr; i<nlr; i++){
if (fabs(a[i][kc])>absmax){
absmax= fabs(a[i][kc]);
imax= i;
}
}
if (absmax>0.0){
max= a[imax][kc];
} else {
max= 0.0;
}

/* Broadcast value and local index of maximum to P(*,t) */


for(s1=0; s1<M; s1++){
bsp_put(s1+t*M,&max,Max,s*SZDBL,SZDBL);
bsp_put(s1+t*M,&imax,Imax,s*SZINT,SZINT);
}
}
bsp_sync();

/****** Superstep 1 ******/


if (k%N==t){
/* Determine global absolute maximum (redundantly) */
absmax= 0.0;
for(s1=0; s1<M; s1++){
if (fabs(Max[s1])>absmax){
absmax= fabs(Max[s1]);
smax= s1;
}
}
if (absmax > EPS){
r= Imax[smax]*M+smax; /* global index of pivot row */
pivot= Max[smax];
for(i=kr; i<nlr; i++)
a[i][kc] /= pivot;
if (s==smax)
a[imax][kc]= pivot; /* restore value of pivot */
78 LU DECOMPOSITION

/* Broadcast index of pivot row to P(*,*) */


for(t1=0; t1<N; t1++)
bsp_put(s+t1*M,&r,&r,0,SZINT);
} else {
bsp_abort("bsplu at stage %d: matrix is singular\n",k);
}
}
bsp_sync();

/****** Superstep 2 ******/


if (k%M==s){
/* Store pi(k) in pi(r) on P(r%M,0) */
if (t==0)
bsp_put(r%M,&pi[k/M],pi,(r/M)*SZINT,SZINT);
/* Store row k of A in row r on P(r%M,t) */
bsp_put(r%M+t*M,a[k/M],pa,(r/M)*nlc*SZDBL,nlc*SZDBL);
}
if (r%M==s){
if (t==0)
bsp_put(k%M,&pi[r/M],pi,(k/M)*SZINT,SZINT);
bsp_put(k%M+t*M,a[r/M],pa,(k/M)*nlc*SZDBL,nlc*SZDBL);
}
bsp_sync();

/****** Superstep 3 ******/


/* Phase 0 of two-phase broadcasts */
if (k%N==t){
/* Store new column k in lk */
for(i=kr1; i<nlr; i++)
lk[i-kr1]= a[i][kc];
}
if (k%M==s){
/* Store new row k in uk */
for(j=kc1; j<nlc; j++)
uk[j-kc1]= a[kr][j];
}
bsp_broadcast(lk,nlr-kr1,s+(k%N)*M, s,M,N,s+t*M,0);
bsp_broadcast(uk,nlc-kc1,(k%M)+t*M,t*M,1,M,s+t*M,0);
bsp_sync();

/****** Superstep 4 ******/


/* Phase 1 of two-phase broadcasts */
bsp_broadcast(lk,nlr-kr1,s+(k%N)*M, s,M,N,s+t*M,1);
bsp_broadcast(uk,nlc-kc1,(k%M)+t*M,t*M,1,M,s+t*M,1);
bsp_sync();

/****** Superstep 0 ******/


/* Update of A */
for(i=kr1; i<nlr; i++){
for(j=kc1; j<nlc; j++)
a[i][j] -= lk[i-kr1]*uk[j-kc1];
}
}
EXPERIMENTAL RESULTS ON A CRAY T3E 79

bsp_pop_reg(Imax); vecfreei(Imax);
bsp_pop_reg(Max); vecfreed(Max);
bsp_pop_reg(lk); vecfreed(lk);
bsp_pop_reg(uk); vecfreed(uk);
bsp_pop_reg(pi);
bsp_pop_reg(pa);
bsp_pop_reg(&r);

} /* end bsplu */

2.6 Experimental results on a Cray T3E


Experiment does to computation models what the catwalk does to fashion
models: it subjects the models to critical scrutiny, exposes their good and
bad sides, and makes the better models stand out. In this section, we put the
predictions of the BSP model for LU decomposition to the test. We perform
numerical experiments to check whether the theoretical benefits of two-phase
broadcasting can be observed in practice. To do this, we measure the per-
formance of the function bsplu with the two-phase broadcasting function
bsp broadcast, and also with a one-phase broadcasting function.
We performed our experiments on 64 processors of the Cray T3E intro-
duced in Section 1.7. We compiled our LU program using the standard
Cray ANSI C compiler and BSPlib version 1.4 with optimization flags
-flibrary-level 2 -O3 -bspfifo 10000 -fcombine-puts. We multiplied
the times produced by the (faulty) timer by 4.0, to obtain correct time meas-
urements, as was done in Section 1.7. A good habit is to run the benchmark
program bspbench just before running an application program such as bsplu.
This helps detecting system changes (improvements or degradations) and tells
you what BSP machine you have today. The BSP parameters of our computer
are p = 64, r = 38.0 Mflop/s, g = 87, l = 2718.
In the experiments, the test matrix A is distributed by the 8 × 8 cyclic dis-
tribution. The matrix is chosen such that the pivot row in stage k is row k + 1;
this forces a row swap with communication in every stage of the algorithm,
because rows k and k + 1 reside on different processor rows.
Table 2.3 presents the total execution time of LU decomposition with one-
phase and two-phase broadcasts. The difference between the two cases is small
but visible. For n < 4000, LU decomposition with the one-phase broadcast is
faster because it requires less synchronization; this is important for small
problems. For n > 4000, LU decomposition with the two-phase broadcast is
faster, which is due to better spreading of the communication. The savings
in broadcast time, however, are insignificant compared with the total execu-
tion time. The break-even point for the two types of broadcast lies at about
n = 4000.
Why is the difference in total execution time so small? Timing the super-
steps is an excellent way of answering such questions. By inserting a bsp time
statement after every bsp sync, taking the time difference between subsequent
80 LU DECOMPOSITION

Table 2.3. Time (in s) of LU decomposition on a 64-processor


Cray T3E using one-phase and two-phase broadcasts

n One-phase Two-phase

1000 1.21 1.33


2000 7.04 7.25
3000 21.18 21.46
4000 47.49 47.51
5000 89.90 89.71
6000 153.23 152.79
7000 239.21 238.25
8000 355.84 354.29
9000 501.92 499.74
10 000 689.91 689.56

One-phase broadcast
14
Two-phase broadcast

12

10
Time (in s)

0
0 2000 4000 6000 8000 10 000
n

Fig. 2.6. Total broadcast time of LU decomposition on a 64-processor Cray T3E


using one-phase and two-phase broadcasts.

synchronizations, and adding the times for the same program superstep, we
obtain the total time spent in each of the supersteps of the program. By
adding the total time of program supersteps 3 and 4, we compute the total
broadcast time, shown in Fig. 2.6. In this figure, it is easy to see that for
large matrices the two-phase broadcast is significantly faster than the one-
phase broadcast, thus confirming our theoretical analysis. For small matrices,
with n < 4000, the vectors to be broadcast are too small to justify the extra
EXPERIMENTAL RESULTS ON A CRAY T3E 81

synchronization. Note that for n = 4000, each processor has a local submat-
rix of size 500 × 500, so that it broadcasts two vectors of size 499 in stage
0, and this size decreases until it reaches 1 in stage 498; in all stages, the
vectors involved are relatively small. The theoretical asymptotic gain factor

in broadcast time for large matrices is about p/2 = 4; the observed gain
factor of about 1.4 at n = 10 000 is still far from that asymptotic value. Our
results imply that for n = 4000 the broadcast time represents only about
4.8% of the total time; for larger n this fraction is even less. Thus, the signi-
ficant improvement in broadcast time in the range n =4000–10 000 becomes
insignificant compared to the total execution time, explaining the results of
Table 2.3. (On a different computer, with faster computation compared to
communication and hence a higher g, the improvement would be felt also in
the total execution time.)
Program supersteps 2, 3, and 4 account for almost all of the communica-
tion carried out by bsplu. These supersteps perform the row swaps, phase 0
of the row and column broadcasts, and phase 1, respectively. The BSP
model predicts that these widely differing operations have the same total cost,

35 Pessimistic prediction
Optimistic prediction
Broadcast, phase 0
Broadcast, phase 1
30
Row swaps

25
Time (in s)

20

15

10

0
0 2000 4000 6000 8000 10 000
n

Fig. 2.7. Total measured time (shown as data points) of row swaps, broadcast
phases 0 and broadcast phases 1 of LU decomposition on a 64-processor Cray
T3E. Also given is the total predicted time (shown by lines).
82 LU DECOMPOSITION

n2 g/ p+nl. Figure 2.7 shows the measured time for these operations. In gen-
eral, the three timing results for the same problem size are reasonably close
to each other, which at least qualitatively confirms the prediction of the BSP
model. This is particularly encouraging in view of the fact that the commun-
ication volumes involved are quite different: for instance, the communication

volume of phase 1 is about p−1 = 7 times that of phase 0. (We may conclude
that communication volume is definitely not a good predictor of communica-
tion time, and that the BSP cost is a much better predictor.) We can also use

the theoretical cost n2 g/ p + nl together with benchmark results for r, g, and
l, to predict the time of the three supersteps in a more precise, quantitative
way. To do this, the BSP cost in flops is converted into a time in seconds by
multiplying with tflop = 1/r. The result for the values obtained by bspbench
is plotted in Fig. 2.7 as ‘pessimistic prediction’. The reason for this title is
obvious from the plot.
To explain the overestimate of the communication time, we note that the
theoretical BSP model does not take header overhead into account, that
is, the cost of sending address information together with the data themselves.
The BSP cost model is solely based on the amount of data sent, not on
that of the associated headers. In most practical cases, this matches reality,
because the header overhead is often insignificant. If we benchmark g also
in such a situation, and use this (lower) value of g, the BSP model will
predict communication time well. We may call this value the optimistic
g -value. This value is measured by bspprobe, see Section 1.8.3. If, however,
the data are communicated by put or get operations of very small size, say
less than five reals, such overhead becomes significant. In the extreme case of
single words as data, for example, one real, we have a high header overhead,
which is proportional to the amount of data sent. We can then just include
this overhead in the cost of sending the data themselves, which leads to a
higher, pessimistic g -value. This is the value measured by bspbench. In
fact, the header overhead includes more than just the cost of sending header
information; for instance, it also includes the overhead of a call to the bsp put
function. Such costs are conveniently lumped together. In the transition range,
for data size in the range 1–5 reals, the BSP model does not accurately predict
communication time, but we have an upper and a lower bound. It would be
easy to extend the model to include an extra parameter (called the block size
B in the BSP∗ model [15]), but this would be at the expense of simplicity. We
shall stick to the simple BSP model, and rely on our common sense to choose
between optimism and pessimism.
For LU decomposition, the optimistic g-value is appropriate since we send
data in large blocks. Here, we measure the optimistic g-value by modifying
bspbench to use puts of 16 reals (each of 64 bits), instead of single reals.
This gives g = 10.1. Results with this g are plotted as ‘optimistic prediction’.
The figure shows that the optimistic prediction matches the measurements
reasonably well.
EXPERIMENTAL RESULTS ON A CRAY T3E 83

We have looked at the total time of the LU decomposition and the total
time of the different supersteps. Even more insight can be gained by examining
the individual supersteps. The easiest way of doing this is to use the Oxford
BSP toolset profiler, which is invoked by compiling with the option -prof.
Running the resulting program creates PROF.bsp, a file that contains the
statistics of every individual superstep carried out. This file is converted into
a plot in Postscript format by the command
bspprof PROF.bsp
An example is shown in Fig. 2.8. We have zoomed in on the first three stages
of the algorithm by using the zooming option, which specifies the starting and
finishing time in seconds of the profile plot:
bspprof -zoom 0.06275,0.06525 PROF.bsp

Oxford BSP
SP Toolset [flags -O3 -prof -flibrary-level 2 -fcombi...] 0.232 seconds elapsed on a Cray T3E Fri Jun 15 11:57:32
bytes 14 14 14
out
4500
Step Filename Line
4000 9 bsplu.c 86
10 bsplu.c 121
3500
11 bsplu.c 150
3000 12 bsplu.c 168
13 bsplu.c 187
2500 10 10 10 14 bsplu.c 197

2000
12 12 12
1500 Process 0
1000 13 13 13
500 Process 1
9 11 11 11
0
62.75 63.00 63.25 63.50 63.75 64.00 64.25 64.50 64.75 65.00 milliseconds Process 2
bytes 14 14
in 14
4500 Process 3
4000
Process 4
3500
3000
Process 5
2500 10 10 10
2000 Process 6
12 12 12
1500
Process 7
1000
13 13 13
500
9 11 11 11
0
62.75 63.00 63.25 63.50 63.75 64.00 64.25 64.50 64.75 65.00 milliseconds

Fig. 2.8. Profile of stages k = 0, 1, 2 of an LU decomposition with two-phase broad-


cast on an 8-processor Cray T3E, for n = 100, M = 8, N = 1. The horizontal
axis shows the time. The vertical axis shows the communication of a superstep
in bytes sent (top) and received (bottom). The height of the bar represents the
total communication volume. The shaded parts of the bar represent the values
hs (s) (top) and hr (s) (bottom) in bytes for the different processors P (s). The
width of the bar represents the communication time of the corresponding super-
step. The distance between two bars represents the computation time of the
corresponding superstep. The supersteps are numbered based on the program
supersteps.
84 LU DECOMPOSITION

The absolute times given in the profile must be taken with a grain of salt,
since the profiling itself adds some extra time.
The BSP profile of a program tells us where the communication time is
spent, which processor is busiest communicating, and whether more time is
spent communicating or computing. Because our profiling example concerns
an LU decomposition with a row distribution (M = 8, N = 1), it is particu-
larly easy to recognize what happens in the supersteps. Before reading on, try
to guess which superstep is which.
Superstep 10 in the profile corresponds to program superstep 0, super-
step 11 corresponds to program superstep 1, and so on. Superstep 10
communicates the local maxima found in the pivot search. For the small
problem size of n = 100 this takes a significant amount of time, but for larger
problems the time needed becomes negligible compared to the other parts of
the program. Superstep 11 contains no communication because N = 1, so
that the broadcast of the pivot index within a processor row becomes a copy
operation within a processor. Superstep 12 represents the row swap; it has
two active processors, namely the owners of rows k and k + 1. In stage 0,
these are processors P (0) and P (1); in stage 1, P (1) and P (2); and in stage 2,
P (2) and P (3). Superstep 13 represents the first phase of a row broadcast. In
stage 0, P (0) sends data and all other processors receive; in stage 1, P (1) is
the sender. Superstep 14 represents the second phase. In stage 0, P (0) sends
⌈99/8⌉ = 13 row elements to seven other processors; P (1)–P (6) each send 13
elements to six other processors (not to P (0)); and P (7) sends the remain-
ing 8 elements to six other processors. The number of bytes sent by P (0) is
13 · 7 · 8 = 728; by each of P (1)–P (6), 624; and by P (7), 384. The total is
4856 bytes. These numbers agree with the partitioning of the bars in the top
part of the plot.
Returning to the fashion world, did the BSP model survive the catwalk?
Guided by the BSP model, we obtained a theoretically superior algorithm
with a better spread of the communication tasks over the processors. Our
experiments show that this algorithm is also superior in practice, but that the
benefits occur only in a certain range of problem sizes and that their impact
is limited on our particular computer. The BSP model helped explaining our
experimental results, and it can tell us when to expect significant benefits.
The superstep concept of the BSP model helped us zooming in on certain
parts of the computation and enabled us to understand what happens in
those parts. Qualitatively speaking, we can say that the BSP model passed
an important test. The BSP model also gave us a rough indication of the
expected time for different parts of the algorithm. Unfortunately, to obtain
this indication, we had to distinguish between two types of values for the
communication parameter g, reflecting whether or not the put operations are
extremely small. In most cases, we can (and should) avoid extremely small
put operations, at least in the majority of our communication operations, so
BIBLIOGRAPHIC NOTES 85

that we can use the optimistic g-value for predictions. Even then, the resulting
prediction can easily be off by 50%, see Fig. 2.7.
A simple explanation for the remaining discrepancy between prediction
and experiment is that there are lies, damned lies, and benchmarks. Sub-
stituting a benchmark result in a theoretical time formula gives an ab initio
prediction, that is, a prediction from basic principles, and though this may
be useful as an indication of expected performance, it will hardly ever be
an accurate estimate. There are just too many possibilities for quirks in the
hardware and the system software, ranging from obscure cache behaviour to
inadequate implementation of certain communication primitives. Therefore,
we should not have unrealistic quantitative expectations of a computation
model.

2.7 Bibliographic notes


2.7.1 Matrix distributions
Almost all matrix distributions that have been proposed for use in parallel
matrix computations are Cartesian. The term ‘Cartesian’ was first used in
this context by Bisseling and van de Vorst [23] and Van de Velde [181] in
articles on LU decomposition. Van de Velde even defines a matrix distribution
as being Cartesian. (Still, non-Cartesian matrix distributions exist and they
may sometimes be useful, see Exercise 1 and Chapter 4.) An early example
of a Cartesian distribution is the cyclic row distribution, which is just the
p × 1 cyclic distribution. Chu and George [42] use this distribution to show
that explicit row swaps lead to a good load balance during the whole LU
decomposition and that the resulting savings in computation time outweigh
the additional communication time. Geist and Romine [77] present a method
for reducing the number of explicit row swaps (by a factor of two) while
preserving good load balance. They relax the constraint of maintaining a
strict cyclic distribution of the rows, by demanding only that successive sets
of p pivot rows are evenly distributed over the p processors.
In 1985, O’Leary and Stewart [149] introduced the square cyclic distri-
bution in the field of parallel matrix computations; they called it torus
assignment. In their scheme, the matrix elements and the corresponding
computations are assigned to a torus of processors, which executes a data-
flow algorithm. Over the years, this distribution has acquired names such as
cyclic storage [115], grid distribution [182], scattered square decom-
position [71], and torus-wrap mapping [7], see also [98], and it has been
used in a wide range of matrix computations. It seems that the name ‘cyclic
distribution’ has been generally accepted now, and therefore we use this term.
Several algorithms based on the square cyclic distribution have been pro-
posed for parallel LU decomposition. Van de Vorst [182] compares the square
cyclic distribution with other distributions, such as the square block distribu-
√ √
tion, which allocates square submatrices of size n/ p × n/ p to processors.
86 LU DECOMPOSITION

The square block distribution leads to a bad load balance, because more
and more processors become idle when the computation proceeds. As a res-
ult, the computation takes three times longer than with the square cyclic
distribution. Fox et al. [71,Chapter 20] present an algorithm for LU decom-
position of a banded matrix. They perform a theoretical analysis and give
experimental results on a hypercube computer. (A matrix A is banded with
upper bandwidth bU and lower bandwidth bL if aij = 0 for i < j − bU
and i > j + bL . A dense matrix can be viewed as a degenerate special case
of a banded matrix, with bL = bU = n − 1.) Bisseling and van de Vorst [23]
prove optimality with respect to load balance of the square cyclic distribution,
within the class of Cartesian distributions. They also show that the communi-

cation time on a square mesh of processors is of the same order, O(n2 / p),
as the load imbalance and that on a complete network the communication

volume is O(n2 p). (For a BSP computer this would imply a communication
2 √
cost of O(n g/ p), provided the communication can be balanced.) Extending
these results, Hendrickson and Womble [98] show that the square cyclic distri-
bution is advantageous for a large class of matrix computations, including LU
decomposition, QR decomposition, and Householder tridiagonalization. They
present experimental results for various ratios N/M of the M × N cyclic
distribution.
A straightforward generalization of the cyclic distribution is the block-
cyclic distribution, where the cyclic distribution is used to assign rectangu-
lar submatrices to processors instead of assigning single matrix elements.
O’Leary and Stewart [150] proposed this distribution already in 1986, giving
it the name block-torus assignment. It is now widely used, for example, in
ScaLAPACK (Scalable Linear Algebra Package) [24,25,41] and in the object-
oriented package PLAPACK (Parallel Linear Algebra Package) [4,180]. The
M × N block-cyclic distribution with block size b0 × b1 is defined by

φ0 (i) = (i div b0 ) mod M, φ1 (j) = (j div b1 ) mod N, for 0 ≤ i, j < n.


(2.33)
A good choice of the block size can improve efficiency. In general, larger block
sizes lead to lower synchronization cost, but also to higher load imbalance.
Usually, the block size does not affect the communication performance. The
characteristics of the particular machine used and the problem size determine
the best block size. In many cases, synchronization time is insignificant so
that the best choice is b0 = b1 = 1, that is, the square cyclic distribution.
Note that blocking of a distribution has nothing to do with blocking of
an algorithm, which is merely a way of organizing the (sequential or parallel)
algorithm, usually with the aim of formulating it in terms of matrix operations,
rather than operations on single elements or vectors. Blocking of algorithms
makes it possible to attain peak computing speeds and hence algorithms are
usually blocked in packages such as LAPACK (Linear Algebra PACKage) [6];
see also Exercise 10. Hendrickson, Jessup, and Smith [94] present a blocked
BIBLIOGRAPHIC NOTES 87

parallel eigensystem solver based on the square cyclic distribution, and they
argue in favour of blocking of algorithms but not of distributions.

2.7.2 Collective communication


The idea of spreading data before broadcasting them was already used by
Barnett et al. [9,10], who call the two phases of their broadcast ‘scatter’
and ‘collect’. The interprocessor Collective Communication (iCC) library is
an implementation of broadcasting and related algorithms; it also imple-
ments hybrid methods for broadcasting vectors of intermediate length. In
the BSP context, the cost analysis of a two-phase broadcast becomes easy.
Juurlink [116] presents and analyses a set of communication primitives includ-
ing broadcasts for different vector lengths. Bisseling [20] implements a two-
phase broadcast and shows that it reduces communication time significantly
in LU decomposition. This implementation, however, has not been optimized
by sending data in blocks, and hence the g-values involved are pessimistic.
The timing results in the present chapter are much better than in [20] (but
as a consequence the gains of a two-phase broadcast are less clearly visible).

2.7.3 Parallel matrix computations


The handbook on matrix computations by Golub and Van Loan [79] is a stand-
ard reference for the field of numerical linear algebra. It treats sequential LU
decomposition in full detail, discussing issues such as roundoff errors, partial
and complete pivoting, matrix scaling, and iterative improvement. The book
provides a wealth of references for further study. Chapter 6 of this handbook
is devoted to parallel matrix computations, with communication by message
passing using sends and receives. As is often done in message passing, it is
assumed that communication takes α + βn time units for a message of length
n, with α the startup time of the message and β the time per data word.
The book by Dongarra, Duff, Sorensen, and van der Vorst [61] is a very read-
able introduction to numerical linear algebra on high-performance computers.
It treats architectures, parallelization techniques, and the direct and iterat-
ive solution of linear systems and eigensystems, with particular attention to
sparse systems.
Modern software for sequential LU decomposition and other matrix
computations is provided by LAPACK [6], a parallel version of which is
ScaLAPACK [24,25,41]. These widely used packages contain solvers for linear
systems, eigenvalue systems, and linear least-squares problems, for dense sym-
metric and unsymmetric matrices. The packages also contain several solvers
for banded, triangular, and tridiagonal matrices. ScaLAPACK is available
on top of the portability layers PVM and MPI; Horvitz and Bisseling [110]
discuss how ScaLAPACK can be ported to BSPlib and they show for LU
decomposition that savings in communication time can be obtained by using
BSPlib.
88 LU DECOMPOSITION

Several BSP algorithms have been designed for LU decomposition.


Gerbessiotis and Valiant [78] present a BSP algorithm for Gauss–Jordan elim-
ination, a matrix computation that is quite similar to LU decomposition;
their algorithm uses the square cyclic distribution and a tree-based broadcast
(which is less efficient than a two-phase broadcast). McColl [133] presents a
BSP algorithm for LU decomposition without pivoting that is very different
from other LU decomposition algorithms. He views the computation as a dir-
ected acyclic graph forming a regular three-dimensional mesh of n × n × n
vertices, where vertex (i, j, k) represents matrix element aij in stage k. The
absence of pivoting allows for more flexibility in scheduling the computation.
Instead of processing the vertices layer by layer, that is, stage after stage, as
is done in other algorithms, McColl’s algorithm processes the vertices in cubic
blocks, with p blocks handled in parallel. This algorithm has time complexity
√ √
O(n3 /p + n2 g/ p + pl), which may be attractive because of the low syn-
chronization cost. The computation and communication cost are a constant
factor higher than for Algorithm 2.8.

2.8 Exercises
1. Find a matrix distribution for parallel LU decomposition that is optimal
with respect to computational load balance in all stages of the computation.
The distribution need not be Cartesian. When would this distribution be
applicable?
2. The ratio N/M = 1 is close to optimal for the M × N cyclic distribution
used in Algorithm 2.8 and hence this ratio was assumed in our cost analysis.
The optimal ratio, however, may be slightly different. This is mainly due to
an asymmetry in the communication requirements of the algorithm. Explain
this by using Table 2.2. Find the ratio N/M with the lowest communication
cost, for a fixed number of processors p = M N . What is the reduction in
communication cost for the optimal ratio, compared with the cost for the
ratio N/M = 1?
3. Algorithm 2.8 contains a certain amount of unnecessary communication,
because the matrix elements arj with j > k are first swapped out and then
spread and broadcast. Instead, they could have been spread already from their
original location.

(a) How would you modify the algorithm to eliminate superfluous commu-
nication? How much communication cost is saved for the square cyclic
distribution?
(b) Modify the function bsplu by incorporating this algorithmic improve-
ment. Test the modified program for n = 1000. What is the resulting
reduction in execution time? What is the price to be paid for this
optimization?
EXERCISES 89

4. Take a critical look at the benefits of two-phase broadcasting by first run-


ning bsplu on your own parallel computer for a range of problem sizes and
then replacing the two-phase broadcast by a simple one-phase broadcast.
Measure the run time for both programs and explain the difference. Gain
more insight by timing the broadcast parts separately.
5. (∗) The Cholesky factor of a symmetric positive definite matrix A is
defined as the lower triangular matrix L with positive diagonal entries that
satisfies A = LLT . (A matrix A is symmetric if it equals its transpose,
A = AT , and it is positive definite if xT Ax > 0 for all x = 0.)
(a) Derive a sequential Cholesky factorization algorithm that is sim-
ilar to the sequential LU decomposition algorithm without pivoting,
Algorithm 2.2. The Cholesky algorithm should save half the flops,
because it does not have to compute the upper triangular matrix LT .
Furthermore, pivoting is not needed in the symmetric positive definite
case, see [79].
(b) Design a parallel Cholesky factorization algorithm that uses the M × N
cyclic distribution. Take care to communicate only where needed.
Analyse the BSP cost of your algorithm.
(c) Implement and test your algorithm.
6. (∗) Matrix–matrix multiplication is a basic building block in linear algebra
computations. The highest computing rates can be achieved by constructing
algorithms based on this operation. Consider the matrix product C = AB,
where A, B, and C are n × n matrices. We can divide the matrices into
submatrices of size n/q × n/q, where we assume that n mod q = 0. Thus we
can write A = (Ast )0≤s,t<q , and similarly for B and C. We can express the
matrix product in terms of submatrix products,
q−1

Cst = Asu But , for 0 ≤ s, t < q. (2.34)
u=0
On a sequential computer with a cache, this is even the best way of computing
C, since we can choose the submatrix size such that two submatrices and
their product fit into the cache. On a parallel computer, we can compute the
product in two-dimensional or three-dimensional fashion; the latter approach
was proposed by Aggarwal, Chandra, and Snir [3], as an example of the use
of their LPRAM model. (See also [1] for experimental results and [133] for
a BSP analysis.)
(a) Let p = q 2 be the number of processors. Assume that the matrices
are distributed by the square block distribution, so that processor
P (s, t) holds the submatrices Ast and Bst on input, and Cst on output.
Design a BSP algorithm for the computation of C = AB where P (s, t)
computes Cst . Analyse the cost and memory use of your algorithm.
(b) Let p = q 3 be the number of processors. Design a BSP algorithm where
processor P (s, t, u) with 0 ≤ s, t, u < q computes the product Asu But .
90 LU DECOMPOSITION

Choose a suitable distribution for the input and output matrices


that spreads the data evenly. Analyse the cost and memory use of
your algorithm. When is the three-dimensional algorithm preferred?
Hint: assume that on input the submatrix Asu is distributed over
P (s, ∗, u).

7. (∗∗) Once upon a time, there was a Mainframe computer that had great dif-
ficulty in multiplying floating-point numbers and preferred to add or subtract
them instead. So the Queen decreed that computations should be carried out
with a minimum of multiplications. A young Prince, Volker Strassen [170],
set out to save multiplications in the Queen’s favourite pastime, computing
the product of 2 × 2 matrices on the Royal Mainframe,
  
c00 c01 a00 a01 b00 b01
= . (2.35)
c10 c11 a10 a11 b10 b11

At the time, this took eight multiplications and four additions. The young
Prince slew one multiplication, but at great cost: fourteen new additions
sprang up. Nobody knew how he had obtained his method, but there were
rumours [46], and indeed the Prince had drunk from the magic potion. Later,
three additions were beheaded by the Princes Paterson and Winograd and the
resulting Algorithm 2.9 was announced in the whole Kingdom. The Queen’s
subjects happily noted that the new method, with seven multiplications and
fifteen additions, performed the same task as before. The Queen herself lived
happily ever after and multiplied many more 2 × 2 matrices.
(a) Join the inhabitants of the Mainframe Kingdom and check that the
task is carried out correctly.
(b) Now replace the matrix elements by submatrices of size n/2 × n/2,
  
C00 C01 A00 A01 B00 B01
= . (2.36)
C10 C11 A10 A11 B10 B11

The Strassen method can be applied here as well, because it does


not rely on the commutativity of the real numbers. (Commutativ-
ity, ab = ba, holds for real numbers a and b, but in general not for
matrices A and B.) This should be beneficial, because multiplication
of submatrices is much more expensive than addition of submatrices
(and not only on mainframes!). The method requires the multiplica-
tion of smaller submatrices. This can again be done using Strassen’s
method, and so on, until the remaining problem is small and traditional
matrix–matrix multiplication is used. The resulting matrix–matrix
multiplication algorithm is the Strassen algorithm. We say that the
method is applied recursively, that is, calling itself on smaller prob-
lem sizes. The number of times the original matrix size must be halved
to reach the current matrix size is called the current level of the
EXERCISES 91

Algorithm 2.9. Strassen 2 × 2 matrix–matrix multiplication.


input: A : 2 × 2 matrix.
B : 2 × 2 matrix.
output: C : 2 × 2 matrix, C = AB.

l0 := a00 ; r0 := b00 ;
l1 := a01 ; r1 := b10 ;
l2 := a10 + a11 ; r2 := b01 − b00 ;
l3 := a00 − a10 ; r3 := b11 − b01 ;
l4 := l2 − a00 ; r4 := r3 + b00 ;
l5 := a01 − l4 ; r5 := b11 ;
l6 := a11 ; r6 := b10 − r4 ;

for i := 0 to 6 do
mi := li ri ;

t0 := m0 + m4 ;
t1 := t0 + m3 ;

c00 := m0 + m1 ;
c01 := t0 + m2 + m5 ;
c10 := t1 + m6 ;
c11 := t1 + m2 ;

recursion. Assume that n is a power of two. Derive a formula that


expresses T (n), the time needed to multiply two n × n matrices, in
terms of T (n/2). Use this formula to count the number of flops of
the Strassen algorithm with a switch to the traditional method at size
r × r, 1 ≤ r ≤ n. (The traditional method computes an r × r matrix
product by performing r − 1 floating-point additions and r floating-
point multiplications for each of the r2 elements of the output matrix,
thus requiring a total of 2r3 − r2 flops.)
(c) Prove that the Strassen algorithm with r = 1 requires O(nlog2 7 ) ≈
O(n2.81 ) flops, which scales better than the O(n3 ) flops of tradi-
tional matrix–matrix multiplication. What is the optimal value of the
switching parameter r and the corresponding total number of flops?
(d) Write a traditional matrix–matrix multiplication function and a recurs-
ive sequential function that implements the Strassen algorithm. Com-
pare their accuracy for some test matrices.
(e) Recursion is a gift to the programmer, because it takes the burden of
data management from his shoulders. For parallelism, recursion is not
92 LU DECOMPOSITION

such a blessing, because it zooms in on one task to be carried out while


a parallel computation tries to perform many tasks at the same time.
(This is explained in more detail in Section 3.3, where a recursive fast
Fourier transform is cast into nonrecursive form to prepare the ground
for parallelization.) Write a nonrecursive sequential Strassen function
that computes all matrices of one recursion level in the splitting part
together, and does the same in the combining part. Hint: in the split-
ting part, the input of a level is an array L0 , . . . , Lq−1 of k × k matrices
and a corresponding array R0 , . . . , Rq−1 , where q = 7log2 (n/k) . The out-
put is an array L0 , . . . , L7q−1 of k/2×k/2 matrices and a corresponding
array R0 , . . . , R7q−1 . How do the memory requirements grow with an
increase in level? Generalize the function matallocd from BSPedupack
to the three-dimensional case and use the resulting function to allocate
space for arrays of matrices. Matrices of size r × r are not split any
more; instead the products Mi = Li Ri are computed in the traditional
manner. After these multiplications, the results are combined, in the
reverse direction of the splittings.
(f) The addition of two n/2 × n/2 submatrices of an n × n matrix A, such
as the computation of A10 + A11 , is a fundamental operation in the
Strassen algorithm. Assume this is done in parallel by p processors,

with p a power of four and p ≤ n/2. Which distribution for A is
better: the square block distribution or the square cyclic distribution?
Why? (Loyens and Moonen [130] answered this question first.)
(g) Implement a parallel nonrecursive Strassen algorithm. Use the dis-
tribution found above for all matrices that occur in the splitting and
combining parts of the Strassen algorithm. Redistribute the data before
and after the multiplications Mi := Li Ri at matrix size r′ × r′ . (This
approach was proposed by McColl [134].) Perform each multiplication
on one processor, using the sequential recursive Strassen algorithm and
switching to the traditional algorithm at size r × r, where r ≤ r′ .
(h) Analyse the BSP cost of the parallel algorithm. Find an upper bound
for the load imbalance that occurs because p (a power of four) does not
divide the number of multiplication tasks (a power of seven). Discuss
the trade-off between computational load imbalance and communica-
tion cost and their scaling behaviour as a function of the number of
splitting levels. In practice, how would you choose r′ ?
(i) Measure the run time of your program for different parameters
n, p, r, r′ . Explain your results.

8. (∗∗) Householder tridiagonalization decomposes an n × n symmetric matrix


A into A = Q1 T QT1 , where Q1 is an n × n orthogonal matrix and T an n × n
symmetric tridiagonal matrix. (To recall some linear algebra: a square matrix
Q is orthogonal if QT Q = I, which is equivalent to Q−1 = QT ; a matrix T
EXERCISES 93

is tridiagonal if tij = 0 for |i − j| > 1. Do not confuse the T denoting the


tridiagonal matrix with the superscript denoting transposition!)
Tridiagonalization is often a stepping stone in achieving a larger goal,
namely solving a real symmetric eigensystem. This problem amounts to
decomposing a real symmetric matrix A into A = QDQT , where Q is ortho-
gonal and D is diagonal, that is, dij = 0 for i = j. The eigenvalues of A
are the diagonal elements of D and the corresponding eigenvectors are the
columns of Q. It is much easier to solve a symmetric tridiagonal eigensystem
and decompose T into T = Q2 DQT 2 than to solve the original system. As a
result, we obtain A = (Q1 Q2 )D(Q1 Q2 )T , which has the required form. Here,
we concentrate on the tridiagonalization part of the eigensystem solution,
which accounts for the majority of the flops. See [79] for more details.
(a) The central operation in Householder tridiagonalization is the applica-
tion of a Householder reflection
2
Pv = In − vvT . (2.37)
v2

Here, v = 0 is a vector of length n with Euclidean norm v =


n−1
v2 = ( i=0 vi2 )1/2 . For brevity, we drop the subscript ‘2’ from the
norm. Like all our vectors, v is a column vector, and hence it can also
be viewed as an n×1 matrix. Note that vvT represents an n×n matrix,
in contrast to vT v, which is the scalar v2 . Show that Pv is symmetric
and orthogonal. We can apply Pv to a vector x and obtain

2 T 2vT x
Pv x = x − vv x = x − v. (2.38)
v2 vT v

(b) Let e = (1, 0, 0, . . . , 0)T . Show that the choice v = x − xe implies
Pv x = xe. This means that we have an orthogonal transformation
that sets all components of x to zero, except the first.
(c) Algorithm 2.10 is a sequential algorithm that determines a vector v
such that Pv x = xe. For convenience, the algorithm also outputs
the corresponding scalar β = 2/v2 and the norm of the input vector
µ = x. The vector has been normalized such that v0 = 1. For the
memory-conscious, this can save one memory cell when storing v. The
algorithm contains a clever trick proposed by Parlett [154] to avoid
subtracting nearly equal quantities (which would result in so-called
subtractive cancellation and severe loss of significant digits). Now
design and implement a parallel version of this algorithm. Assume
that the input vector x is distributed by the cyclic distribution over p
processors. The output vector v should become available in the same
distribution. Try to keep communication to a minimum. What is the
BSP cost?
94 LU DECOMPOSITION

Algorithm 2.10. Sequential Householder reflection.


input: x : vector of length n, n ≥ 2, x(1 : n − 1) = 0.
output: v : vector of length n, such that v0 = 1 and Pv x = x e,
β = 2/ v 2 ,
µ = x .
function call: (v, β, µ) := Householder(x, n).

{Compute α = x(1 : n − 1) 2 and µ}


α := 0;
for i := 1 to n − 1 do
2
 α := α + xi ;
µ := x20 + α;

{Compute v = x − x e}
−α
if x0 ≤ 0 then v0 := x0 − µ else v0 := x0 +µ ;
for i := 1 to n − 1 do
vi := xi ;

{Compute β and normalize v}


2v02
β := v2 +α ;
0
for i := 1 to n − 1 do
vi := vi /v0 ;
v0 := 1;

(d) In stage k of the tridiagonalization, a Householder vector v is deter-


mined for column k of the current matrix A below the diagonal. The
matrix is then transformed into Pk APk , where Pk = diag(Ik+1 , Pv ) is a
symmetric matrix. (The notation diag(A0 , . . . , Ar ) stands for a block-
diagonal matrix with blocks A0 , . . . , Ar on the diagonal.) This sets the
elements aik with i > k + 1 to zero, and also the elements akj with j >
k +1; furthermore, it sets ak+1,k and ak,k+1 to µ = A(k +1 : n−1, k).
The vector v without its first component can be stored precisely in the
space of the zeros in column k. Our memory-frugality paid off! As a
result, we obtain the matrix Q1 = Pn−3 · · · P0 , but only in factored
form: we have a record of all the Pv matrices used in the process. In
most cases, this suffices and Q1 never needs to be computed explicitly.
To see how the current matrix is transformed efficiently into Pk APk ,
we only have to look at the submatrix B = A(k + 1 : n − 1, k + 1 : n −
1), which is transformed into Pv BPv . Prove that this matrix equals
B − vwT − wvT , where w = p − ((βpT v)/2)v with p = βBv.
(e) Algorithm 2.11 is a sequential tridiagonalization algorithm based on
Householder reflections. Verify that this algorithm executes the method
just described. The algorithm does not make use of symmetry yet, but
EXERCISES 95

Algorithm 2.11. Sequential Householder tridiagonalization.

input: A : n × n symmetric matrix, A = A(0) .


output: A : n × n symmetric matrix, A = V + T + V T , with
T : n × n symmetric tridiagonal matrix,
V : n × n matrix with vij = 0 for i ≤ j + 1,
such that Q1 T QT 1 = A
(0)
, where Q1 = Pn−3 , . . . , P0 , with

(k) 1
Pk = diag(Ik+1 , Pv(k) ) and v = V (k + 2 : n − 1, k) .
for k := 0 to n − 3 do
(v′ , β, µ) := Householder(A(k + 1 : n − 1, k), n − k − 1);
for i := k + 1 to n − 1 do
vi := v ′ i−k−1 ; {shift for easier indexing}

for i := k + 1 to n − 1 do
pi := 0;
for j := k + 1 to n − 1 do
pi := pi + βaij vj ;

γ := 0;
for i := k + 1 to n − 1 do
γ := γ + pi vi ;

for i := k + 1 to n − 1 do
wi := pi − βγ
2 vi ;

ak+1,k := µ; ak,k+1 := µ;
for i := k + 2 to n − 1 do
aik := vi ; aki := vi ;
for i := k + 1 to n − 1 do
for j := k + 1 to n − 1 do
aij := aij − vi wj − wi vj ;

this is easy to achieve, by only performing operations on the lower


triangular and diagonal part of A.
(f) Design, implement, and test a parallel version of this algorithm that
exploits the symmetry. Assume that A is distributed by the square cyc-
lic distribution, for similar reasons as in the case of LU decomposition.
Why do these reasons apply here as well? Choose a suitable vector dis-
tribution, assuming that the vectors v, p, and w are distributed over
all p processors, and that this is done in the same way for the three vec-
tors. (We could have chosen to distribute the vectors in the same way
as the input vector, that is, the column part A(k + 1 : n − 1, k). Why
is this a bad idea?) Design the communication supersteps by following
the need-to-know principle. Pay particular attention to the matrix–
vector multiplication in the computation of p. How many supersteps
does this multiplication require? (Matrix–vector multiplication will be
discussed extensively in Chapter 4.)
96 LU DECOMPOSITION

9. (∗∗) Usually, the decomposition P A = LU is followed by the solution of


two triangular systems, Ly = P b and U x = y; this solves the linear system
Ax = b. In the parallel case, the distribution in which the triangular matrices
L and U are produced by the LU decomposition must be used when solving
the triangular systems, because it would be too expensive to redistribute the
n2 matrix elements involved, compared with the 2n2 flops required by the
triangular system solutions.

(a) Design a basic parallel algorithm for the solution of a lower triangular
system Lx = b, where L is an n × n lower triangular matrix, b a given
vector of length n, and x the unknown solution vector of length n.
Assume that the number of processors is p = M 2 and that the matrix
is distributed by the square cyclic distribution. Hint: the computation
and communication can be organized in a wavefront pattern, where
in stage k of the algorithm, computations are carried out for matrix
elements lij with i+j = k. After these computations, communication is
performed: the owner P (s, t) of an element lij on the wavefront puts xj
j
into P ((s+1) mod M, t), which owns li+1,j , and it also puts r=0 lir xr
into P (s, (t + 1) mod M ), which owns li,j+1 .
(b) Reduce the amount of communication. Communicate only when this
is really necessary.
(c) Which processors are working in stage k? Improve the load balance.
Hint: procrastinate!
(d) Determine the BSP cost of the improved algorithm.
(e) Now assume that the matrix is distributed by the square block-cyclic

distribution, defined by eqn (2.33) with M = N = p and b0 = b1 = β.
How would you generalize your algorithm for solving lower triangu-
lar systems to this case? Determine the BSP cost of the generalized
algorithm and find the optimal block size parameter β for a computer
with given BSP parameters p, g, and l.
(f) Implement your algorithm for the square cyclic distribution in a
function bspltriang. Write a similar function bsputriang that
solves upper triangular systems. Combine bsplu, bspltriang, and
bsputriang into one program bsplinsol that solves a linear system
of equations Ax = b. The program has to permute b into Pπ−1 b, where
π is the partial pivoting permutation produced by the LU decomposi-
tion. Measure the execution time of the LU decomposition and the
triangular system solutions for various p and n.

10. (∗∗) The LU decomposition function bsplu is, well, educational. It teaches
important distribution and communication techniques, but it is far from
optimal. Our goal now is to turn bsplu into a fast program that is suit-
able for a production environment where every flop/s counts. We optimize
EXERCISES 97

0 k0 k1 n–1
0
A00 A01 A02

k0
A10 A11 A12
k1
A20 A21 A22
n–1

Fig. 2.9. Submatrices created by combining the operations from stages k0 ≤ k < k1
of the LU decomposition.

the program gradually, taking care that we can observe the effect of each
modification. Measure the gains (or losses) achieved by each modification and
explain your results.
(a) In parallel computing, postponing work until it can be done in bulk
quantity creates opportunities for optimization. This holds for compu-
tation work as well as for communication work. For instance, we can
combine several consecutive stages k, k0 ≤ k < k1 , of the LU decom-
position algorithm. As a first optimization, postpone all operations on
the submatrix A(∗, k1 : n − 1) until the end of stage k1 − 1, see Fig. 2.9.
This concerns two types of operations: swapping elements as part of
row swaps and modifying elements as part of matrix updates. Opera-
tions on the submatrix A(∗, 0 : k1 −1) are done as before. Carry out the
postponed work by first permuting all rows involved in the row swaps
and then performing a sequence of row broadcasts and matrix updates.
This affects only the submatrices A12 and A22 . (We use the names of
the submatrices as given in the figure when this is more convenient.)
To update the matrix correctly, the values of the columns A(k +
1 : n − 1, k), k0 ≤ k < k1 , that are broadcast must be stored in
an array L immediately after the broadcast. Caution: the row swaps
inside the submatrix A(∗, k0 : k1 − 1) will be carried out on the sub-
matrix itself, but not on the copies created by the column broadcasts.
At the end of stage k1 − 1, the copy L(i, k0 : k1 − 1) of row i, with
elements frozen at various stages, may not be the same as the cur-
rent row A(i, k0 : k1 − 1). How many rows are affected in the worst
case? Rebroadcast those rows. Do you need to rebroadcast in two
phases? What is the extra cost incurred? Why is the new version of
the algorithm still an improvement?
(b) Not all flops are created equal. Flops from matrix operations can often
be performed at much higher rates than flops from vector or scalar
operations. We can exploit this by postponing and then combining all
98 LU DECOMPOSITION

updates of the submatrix A22 from the stages k0 ≤ k < k1 . Updates of


the submatrix A12 are still carried out as in (a). Let b = k1 − k0 be the
algorithmic block size. Formulate the combined matrix update in
terms of the multiplication of an (n − k1 ) × b matrix by a b × (n − k1 )
matrix. Modify bsplu accordingly, and write your own version of the
DGEMM function from the BLAS library to perform the multiplication.
The syntax of DGEMM is

DGEMM (transa, transb, m, n, k, alpha, a, lda,


b, ldb, beta, c, ldc);

This corresponds to the operation C := αÂB̂ + βC, where  is an


m × k matrix, B̂ a k × n matrix, C an m × n matrix, and α and β
are scalars. Here,  = A if transa=’n’ and  = AT if transa=’t’,
where the matrix A is the matrix that is actually stored in the memory,
and similarly for B̂. The integer lda is the leading dimension of the
matrix A, that is, lda determines how the two-dimensional matrix A is
stored in the one-dimensional array a: element aij is stored in position
i + j·lda. Similarly, ldb and ldc determine the storage format of B
and C.
A complication arises because the data format assumed by DGEMM is
that of the Fortran language, with matrices stored by columns, whereas
in the C language matrices are stored by rows. To hand over a matrix
A stored by rows from the calling C program to the Fortran-speaking
DGEMM subroutine, we just tell the subroutine that it receives the matrix
AT stored by columns. Therefore, the subroutine should perform the
update C T := αB̂ T ÂT + βC T .
(c) Try to find a version of DGEMM that is tailored to your machine.
(Most machine vendors provide extremely efficient BLAS in assem-
bler language; DGEMM is their showcase function which should approach
theoretical peak performance.) Find the optimal block size b using the
vendor’s DGEMM, for a suitable choice of n and p. Explain the relation
between block size and performance.
(d) How can you achieve a perfect load balance in the update of the
matrix A22 ?
(e) Where possible, replace other computations by calls to BLAS func-
tions. In the current version of the program, how much time is spent
in computation? How much in communication?
(f) Postpone all row swaps in the submatrix A(∗, 0 : k0 − 1) until the
end of stage k1 − 1, and then perform them in one large row per-
mutation. Avoid synchronization by combining the resulting superstep
with another superstep. This approach may be called superstep
piggybacking.
EXERCISES 99

(g) The high-performance put function bsp hpput of BSPlib has exactly
the same syntax as the bsp put function:
bsp hpput(pid, source, dest, offset, nbytes);
It does not provide the safety of buffering at the source and destination
that bsp put gives. The read and write operations can in principle
occur at any time during the superstep. Therefore the user must ensure
safety by taking care that different communication operations do not
interfere. The primary aim of using this primitive is to save the memory
of the buffers. Sometimes, this makes the difference between being able
to solve a problem or not. A beneficial side effect is that this saves time
as well. There also exists a bsp hpget operation, with syntax
bsp hpget(pid, source, offset, dest, nbytes);
which should be used with the same care as bsp hpput. In the LU
decomposition program, matrix data are often put into temporary
arrays and not directly into the matrix itself, so that there is no
need for additional buffering by the system. Change the bsp puts into
bsp hpputs, wherever this is useful and allowed, perhaps after a few
minor modifications. What is the effect?
(h) For short vectors, a one-phase broadcast is faster than a two-phase
broadcast. Replace the two-phase broadcast in stage k of row elements
akj with k < j < k1 by a one-phase broadcast. For which values of b is
this an improvement?
(i) As already observed in Exercise 3, a disadvantage of the present
approach to row swaps and row broadcasts in the submatrix A(k0 :
n − 1, k1 : n − 1) is that elements of pivot rows move three times: each
such element is first moved into the submatrix A12 as part of a per-
mutation; then it is moved as part of the data spreading operation
in the first phase of the row broadcast; and finally it is copied and
broadcast in the second phase. This time, there are b rows instead
of one that suffer from excessive mobility, and they can be dealt with
together. Instead of moving the local part of a row into A12 , you should
spread it over the M processors of its processor column (in the same
way for every row). As a result, A12 becomes distributed over all p
processors in a column distribution. Updating A12 becomes a local
operation, provided each processor has a copy of the lower triangular
part of A11 . How much does this approach save?
(j) Any ideas for further improvement?
3
THE FAST FOURIER TRANSFORM

This chapter demonstrates the use of different data distributions in


different phases of a computation: we use both the block and cyclic
distributions of a vector and also intermediates between them. Each
data redistribution is a permutation that requires communication. By
making careful choices, the number of such redistributions can be kept
to a minimum. This approach is demonstrated for the fast Fourier trans-
form (FFT), a regular computation with a predictable but challenging
data access pattern. Furthermore, the chapter shows how permutations
with a regular pattern can be implemented more efficiently. These tech-
niques are demonstrated in the specific case of the FFT, but they are
applicable to other regular computations on data vectors as well. After
having read this chapter, you are able to design and implement parallel
algorithms for a range of related regular computations, including wavelet
transforms, sine and cosine transforms, and convolutions; you will also
be able to incorporate these algorithms in larger programs, for example,
for weather forecasting or signal processing. Furthermore, you will be
able to present the results of numerical experiments in a meaningful
manner using the metrics of speedup and efficiency.

3.1 The problem


Fourier analysis studies the decomposition of functions into their frequency
components. The functions may represent a piano sonata by Mozart recorded
50 years ago, a blurred picture of a star taken by the Hubble Space Telescope
before its mirrors were repaired, or a Computerized Tomography (CT) scan
of your chest. It is often easier to improve a function if we can work directly
with its frequency components. Enhancing desired frequencies or removing
undesired ones makes the music more pleasing to your ears. Fourier methods
help deblurring the satellite picture and they are crucial in reconstructing
a medical image from the tomographic measurements.
Let f : R → C be a T -periodic function, that is, a function with f (t +
T ) = f (t) for all t ∈ R. The Fourier series associated with f is defined by


f˜(t) =

ck e2πikt/T , (3.1)
k=−∞
THE PROBLEM 101

where the Fourier coefficients ck are given by


T
1

ck = f (t)e−2πikt/T dt (3.2)
T 0

and i denotes the complex number with i2 = −1. (To avoid confusion, we
ban the index i from this chapter.) Under relatively mild assumptions, such
as piecewise smoothness, it can be proven that the Fourier series converges
for every t. (A function is called smooth if it is continuous and its derivative
is also continuous. A property is said to hold piecewise if each finite interval
of its domain can be cut up into a finite number of pieces where the property
holds; it need not hold in the end points of the pieces.) A piecewise smooth
function satisfies f˜(t) = f (t) in points of continuity; in the other points, f˜ is
the average of the left and right limit of f . (For more details, see [33].) If f
is real-valued, we can use Euler’s formula eiθ = cos θ + i sin θ, and eqns (3.1)
and (3.2) to obtain a real-valued Fourier series expressed in sine and cosine
functions.
On digital computers, signal or image functions are represented by their
values at a finite number of sample points. A compact disc contains 44 100
sample points for each second of recorded music. A high-resolution digital
image may contain 1024 by 1024 picture elements (pixels). On an unhappy
day in the future, you might find your chest being cut by a CT scanner into
40 slices, each containing 512 by 512 pixels. In all these cases, we obtain a
discrete approximation to the continuous world.
Suppose we are interested in computing the Fourier coefficients of a
T -periodic function f which is sampled at n points tj = jT /n, with j =
0, 1, . . . , n − 1. Using the trapezoidal rule for numerical integration on the
interval [0, T ] and using f (0) = f (T ), we obtain an approximation
T
1

ck = f (t)e−2πikt/T dt
T 0
 
n−1
1 T  f (0)  f (T ) 
≈ · + f (tj )e−2πiktj /T +
T n 2 j=1
2

n−1
1
= f (tj )e−2πijk/n . (3.3)
n j=0

The discrete Fourier transform (DFT) of a vector x =


(x0 , . . . , xn−1 )T ∈ Cn can be defined as the vector y = (y0 , . . . , yn−1 )T ∈ Cn
with
n−1

yk = xj e−2πijk/n , for 0 ≤ k < n. (3.4)
j=0
102 THE FAST FOURIER TRANSFORM

(Different conventions exist regarding the sign of the exponent.) Thus,


eqn (3.3) has the form of a DFT, with xj = f (tj )/n for 0 ≤ j < n. It is
easy to see that the inverse DFT is given by

n−1
1
xj = yk e2πijk/n , for 0 ≤ j < n. (3.5)
n
k=0

A straightforward implementation of eqn (3.4) would require n − 1 com-


plex additions and n complex multiplications for each vector component yk ,
assuming that factors of the form e−2πim/n have been precomputed and are
available in a table. A complex addition has the form (a + bi) + (c + di) =
(a + c) + (b + d)i, which requires two real additions. A complex multiplication
has the form (a + bi)(c + di) = (ac − bd) + (ad + bc)i, which requires one real
addition, one real subtraction, and four real multiplications, that is, a total
of six flops. Therefore, the straightforward computation of the DFT costs
n(2(n − 1) + 6n) = 8n2 − 2n flops.
It is often convenient to use matrix notation to express DFT algorithms.
Define the n × n Fourier matrix Fn by

(Fn )jk = ωnjk , for 0 ≤ j, k < n, (3.6)

where
ωn = e−2πi/n . (3.7)

Figure 3.1 illustrates the powers of ωn occurring in the Fourier matrix; these
are sometimes called the roots of unity.

ω6 = i
ω5 ω7

ω4 ω0 = 1

ω3 ω1
ω2
Fig. 3.1. Roots of unity ω k , with ω = ω8 = e−2πi/8 , shown in the complex plane.
SEQUENTIAL RECURSIVE FAST FOURIER TRANSFORM 103

Example 3.1 Let n = 4. Because ω4 = e−2πi/4 = e−πi/2 = −i, it follows


that  0
ω40 ω40 ω40
  
ω4 1 1 1 1
 ω40 ω41 ω42 ω43 
  1 −i −1 i 

F4 = 
 ω40 2 4 6 =
. (3.8)
ω4 ω4 ω4 1 −1 1 −1 
ω40 ω43 ω46 ω49 1 i −1 −i
Useful properties of the Fourier matrix are FnT = Fn and Fn−1 = Fn /n,
where the bar on Fn denotes the complex conjugate. The transform y :=
DFT(x) can be written as y := Fn x, so that the DFT becomes a matrix–
vector multiplication. The computation of the DFT is the problem studied in
this chapter.

3.2 Sequential recursive fast Fourier transform


The FFT is a fast algorithm for the computation of the DFT. The basic idea
of the algorithm is surprisingly simple, which does not mean that it is easy
to discover if you have not seen it before. In this section, we apply the idea
recursively, as was done by Danielson and Lanczos [51] in 1942. (A method is
recursive if it invokes itself, usually on smaller problem sizes.)
Assume that n is even. We split the sum of eqn (3.4) into sums of even-
indexed and odd-indexed terms, which gives
n−1 n/2−1 n/2−1
  
yk = xj ωnjk = x2j ωn2jk + x2j+1 ωn(2j+1)k , for 0 ≤ k < n.
j=0 j=0 j=0
(3.9)
By using the equality ωn2 = ωn/2 , we can rewrite (3.9) as
n/2−1 n/2−1
jk jk
 
yk = x2j ωn/2 + ωnk x2j+1 ωn/2 , for 0 ≤ k < n. (3.10)
j=0 j=0

In the first sum, we recognize a Fourier transform of length n/2 of the even
components of x. To cast the sum exactly into this form, we must restrict the
output indices to the range 0 ≤ k < n/2. In the second sum, we recognize
a transform of the odd components. This leads to a method for computing
the set of coefficients yk , 0 ≤ k < n/2, which uses two Fourier transforms
of length n/2. To obtain a method for computing the remaining coefficients
yk , n/2 ≤ k < n, we have to rewrite eqn (3.10). Let k ′ = k − n/2, so that
0 ≤ k ′ < n/2. Substituting k = k ′ + n/2 in eqn (3.10) gives
n/2−1 n/2−1
j(k′ +n/2) j(k′ +n/2)
 ′ 
yk′ +n/2 = x2j ωn/2 + ωnk +n/2 x2j+1 ωn/2 ,
j=0 j=0

for 0 ≤ k ′ < n/2. (3.11)


104 THE FAST FOURIER TRANSFORM

n/2 n/2
By using the equalities ωn/2 = 1 and ωn = −1, and by dropping the primes
we obtain
n/2−1 n/2−1
jk jk
 
yk+n/2 = x2j ωn/2 − ωnk x2j+1 ωn/2 , for 0 ≤ k < n/2. (3.12)
j=0 j=0

Comparing eqns (3.10) and (3.12), we see that the sums appearing in the
right-hand sides are the same; if we add the sums we obtain yk and if we
subtract them we obtain yk+n/2 . Here, the savings become apparent: we need
to compute the sums only once.
Following the basic idea, we can compute a Fourier transform of length n
by first computing two Fourier transforms of length n/2 and then combining
the results. Combining requires n/2 complex multiplications, n/2 complex
additions, and n/2 complex subtractions, that is, a total of (6 + 2 + 2) · n/2 =
5n flops. If we use the DFT for the half-length Fourier transforms, the total
flop count is already reduced from 8n2 − 2n to 2 · [8(n/2)2 − 2(n/2)] + 5n =
4n2 + 3n, thereby saving almost a factor of two in computing time. Of course,
we can apply the idea recursively, computing the half-length transforms by
the same splitting method. The recursion ends when the input length becomes
odd; in that case, we switch to a straightforward DFT algorithm. If the ori-
ginal input length is a power of two, the recursion ends with a DFT of length
one, which is just a trivial copy operation y0 := x0 . Figure 3.2 shows how the
problem is split up recursively for n = 8. Algorithm 3.1 presents the recursive
FFT algorithm for an arbitrary input length.
For simplicity, we assume from now on that the original input length is a
power of two. The flop count of the recursive FFT algorithm can be computed

0 1 2 3 4 5 6 7

0 2 4 6 1 3 5 7

0 4 2 6 1 5 3 7

0 4 2 6 1 5 3 7

Fig. 3.2. Recursive computation of the DFT for n = 8. The numbers shown are
the indices in the original vector, that is, the number j denotes the index of the
vector component xj (and not the numerical value). The arrows represent the
splitting operation. The combining operation is executed in the reverse direction
of the arrows.
SEQUENTIAL NONRECURSIVE ALGORITHM 105

Algorithm 3.1. Sequential recursive FFT.


input: x : vector of length n.
output: y : vector of length n, y = Fn x.
function call: y := FFT(x, n).

if n mod 2 = 0 then
xe := x(0 : 2 : n − 1);
xo := x(1 : 2 : n − 1);
ye := FFT(xe , n/2);
yo := FFT(xo , n/2);
for k := 0 to n/2 − 1 do
τ := ωnk yko ;
yk := yke + τ ;
yk+n/2 := yke − τ ;
else y := DFT(x, n);

as follows. Let T (n) be the number of flops of an FFT of length n. Then


n
T (n) = 2T + 5n, (3.13)
2

because an FFT of length n requires two FFTs of length n/2 and the com-
bination of the results requires 5n flops. Since the half-length FFTs are split
again, we substitute eqn (3.13) in itself, but with n replaced by n/2. This
gives
n n n
T (n) = 2 2T +5 + 5n = 4T + 2 · 5n. (3.14)
4 2 4
Repeating this process until the input length becomes one, and using T (1) = 0,
we obtain
T (n) = nT (1) + (log2 n) · 5n = 5n log2 n. (3.15)

The gain of the FFT compared with the straightforward DFT is huge: only
5n log2 n flops are needed instead of 8n2 − 2n. For example, you may be able
to process a sound track of n = 32 768 samples (about 0.74 s on a compact
disc) on your personal computer in real time by using FFTs, but you would
have to wait 43 min if you decided to use DFTs instead.

3.3 Sequential nonrecursive algorithm


With each recursive computation, a computational tree is associated. The
nodes of the tree are the calls to the recursive function that performs the
computation. The root of the tree is the first call and the leaves are the calls
106 THE FAST FOURIER TRANSFORM

that do not invoke the recursive function themselves. Figure 3.2 shows the tree
for an FFT of length eight; the tree is binary, since each node has at most two
children. The tree-like nature of recursive computations may lead you into
thinking that such algorithms are straightforward to parallelize. Indeed, it is
clear that the computation can be split up easily. A difficulty arises, however,
because a recursive algorithm traverses its computation tree sequentially, vis-
iting different subtrees one after the other. For a parallel algorithm, we ideally
would like to access many subtrees simultaneously. A first step towards paral-
lelization of a recursive algorithm is therefore to reformulate it in nonrecursive
form. The next step is then to split and perhaps reorganize the computation.
In this section, we derive a nonrecursive FFT algorithm, which is known as
the Cooley–Tukey algorithm [45].
Van Loan [187] presents a unifying framework in which the Fourier mat-
rix Fn is factorized as the product of permutation matrices and structured
sparse matrices. This helps in concisely formulating FFT algorithms, classify-
ing the huge amount of existing FFT variants, and identifying the fundamental
variants. We adopt this framework in deriving our parallel algorithm.
The computation of Fn x by the recursive algorithm can be expressed in
matrix language as
   
In/2 Ωn/2 Fn/2 0 x(0 : 2 : n − 1)
Fn x = . (3.16)
In/2 −Ωn/2 0 Fn/2 x(1 : 2 : n − 1)

Here, Ωn denotes the n × n diagonal matrix with the first n powers of ω2n on
the diagonal,
2 n−1
Ωn = diag(1, ω2n , ω2n , . . . , ω2n ). (3.17)

Please verify that eqn (3.16) indeed corresponds to Algorithm 3.1.


We examine the three parts of the right-hand side of eqn (3.16) starting
from the right. The rightmost part is just the vector x with its components
sorted into even and odd components. We define the even–odd sort matrix
Sn by
 
1 0 0 0 ··· 0 0 0
 0 0 1 0 ··· 0 0 0 
.. ..
 
 

 . . 

 0 0 0 0 ··· 0 1 0 
Sn = 
 , (3.18)
 0 1 0 0 ··· 0 0 0 

 0 0 0 1 ··· 0 0 0 
 
 .. .. 
 . . 
0 0 0 0 ··· 0 0 1
that is, Sn is the n × n permutation matrix that contains the even rows of
In followed by the odd rows. (Note that the indices start at zero, so that the
SEQUENTIAL NONRECURSIVE ALGORITHM 107

even rows are rows 0, 2, 4, . . . , n − 2.) Using this notation, we can write
 
x(0 : 2 : n − 1)
Sn x = . (3.19)
x(1 : 2 : n − 1)

The middle part of the right-hand side of eqn (3.16) is a block-diagonal


matrix with two identical blocks Fn/2 on the diagonal. The off-diagonal blocks,
which are zero, can be interpreted as 0 times the submatrix Fn/2 . The matrix
therefore consists of four submatrices that are scaled copies of the submatrix
Fn/2 . In such a situation, it is convenient to use the Kronecker matrix product
notation. If A is a q × r matrix and B an m × n matrix, then the Kronecker
product (also called tensor product, or direct product) A ⊗ B is the
qm × rn matrix defined by
 
a00 B ··· a0,r−1 B
A⊗B = .. ..
. (3.20)
 
. .
aq−1,0 B · · · aq−1,r−1 B
   
0 1 1 0 2
Example 3.2 Let A = and B = . Then
2 4 0 1 0
 
  0 0 0 1 0 2
0 B  0 0 0 0 1 0 
A⊗B = =
 2 0 4 4 0 8 .

2B 4B
0 2 0 0 4 0

The Kronecker product has many useful properties (but, unfortunately, it


does not possess commutativity). For an extensive list, see Van Loan [187].
Here, we only mention the three properties that we shall use.
Lemma 3.3 Let A, B, C be matrices. Then

(A ⊗ B) ⊗ C = A ⊗ (B ⊗ C).

Lemma 3.4 Let A, B, C, D be matrices such that AC and BD are defined.


Then

(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD).

Lemma 3.5 Let m, n ∈ N. Then

Im ⊗ In = Imn .

Proof Boring. 
Lemma 3.3 saves some ink because we can drop brackets and write A⊗B ⊗
C instead of having to give an explicit evaluation order such as (A ⊗ B) ⊗ C.
108 THE FAST FOURIER TRANSFORM

xj xj+n/2

x⬘j x⬘j+n/2
Fig. 3.3. Butterfly operation transforming an input pair (xj , xj+n/2 ) into an output
pair (x′j , x′j+n/2 ). Right butterfly: 
c 2002 Sarai Bisseling, reproduced with sweet
permission.

Using the Kronecker product notation, we can write the middle part of the
right-hand side of eqn (3.16) as
 
Fn/2 0
I2 ⊗ Fn/2 = . (3.21)
0 Fn/2

The leftmost part of the right-hand side of eqn (3.16) is the n×n butterfly
matrix  
In/2 Ωn/2
Bn = . (3.22)
In/2 −Ωn/2
The butterfly matrix obtains its name from the butterfly-like pattern in which
it transforms input pairs (xj , xj+n/2 ), 0 ≤ j < n/2, into output pairs, see
Fig. 3.3. The butterfly matrix is sparse because only 2n of its n2 elements are
nonzero. It is also structured, because its nonzeros form three diagonals.
Example 3.6  
1 0 1 0
 0 1 0 −i 
B4 = 
 1 0 −1
.
0 
0 1 0 i
Using our new notations, we can rewrite eqn (3.16) as

Fn x = Bn (I2 ⊗ Fn/2 )Sn x. (3.23)

Since this holds for all vectors x, we obtain the matrix factorization

Fn = Bn (I2 ⊗ Fn/2 )Sn , (3.24)

which expresses the Fourier matrix in terms of a smaller Fourier matrix.


We can reduce the size of the smaller Fourier matrix further by repeatedly
SEQUENTIAL NONRECURSIVE ALGORITHM 109

factorizing the middle factor of the right-hand side. For a factor of the form
Ik ⊗ Fn/k , this is done by applying Lemmas 3.4 (twice), 3.3, and 3.5, giving
 
Ik ⊗ Fn/k = [Ik Ik Ik ] ⊗ Bn/k (I2 ⊗ Fn/(2k) )Sn/k
  
= Ik ⊗ Bn/k )([Ik Ik ] ⊗ (I2 ⊗ Fn/(2k) )Sn/k
= (Ik ⊗ Bn/k )(Ik ⊗ I2 ⊗ Fn/(2k) )(Ik ⊗ Sn/k )
= (Ik ⊗ Bn/k )(I2k ⊗ Fn/(2k) )(Ik ⊗ Sn/k ). (3.25)

After repeatedly eating away at the middle factor, from both sides, we finally
reach In ⊗Fn/n = In ⊗I1 = In . Collecting the factors produced in this process,
we obtain the following theorem, which is the so-called decimation in time
(DIT) variant of the Cooley–Tukey factorization. (The name ‘DIT’ comes
from splitting—decimating—the samples taken over time, cf. eqn (3.9).)
Theorem 3.7 (Cooley and Tukey [45]—DIT) Let n be a power of two with
n ≥ 2. Then

Fn = (I1 ⊗ Bn )(I2 ⊗ Bn/2 )(I4 ⊗ Bn/4 ) · · · (In/2 ⊗ B2 )Rn ,

where

Rn = (In/2 ⊗ S2 ) · · · (I4 ⊗ Sn/4 )(I2 ⊗ Sn/2 )(I1 ⊗ Sn ).

Note that the factors Ik ⊗ Sn/k are permutation matrices, so that Rn is a


permutation matrix.
Example 3.8

R8 = (I4 ⊗ S2 )(I2 ⊗ S4 )(I1 ⊗ S8 ) = (I4 ⊗ I2 )(I2 ⊗ S4 )S8 = (I2 ⊗ S4 )S8


  
1 · · · · · · · 1 · · · · · · ·
 · · 1 · · · · ·  · · 1 · · · · · 
  
 · 1 · · · · · ·  · · · · 1 · · · 
  
 · · · 1 · · · ·  · · · · · · 1 · 
= 
 · · · · 1 · · ·  · 1 · · · · · · 

  
 · · · · · · 1 ·  · · · 1 · · · · 
  
 · · · · · 1 · ·  · · · · · 1 · · 
· · · · · · · 1 · · · · · · · 1
 
1 · · · · · · ·
 · · · · 1 · · · 
 
 · · 1 · · · · · 
 
 · · · · · · 1 · 
= .
 · 1 · · · · · · 

 · · · · · 1 · · 
 
 · · · 1 · · · · 
· · · · · · · 1
110 THE FAST FOURIER TRANSFORM

The permutation matrix Rn is known as the bit-reversal matrix. Mul-


tiplying an input vector by this matrix first permutes the vector by splitting
it into even and odd components, moving the even components to the front;
then treats the two parts separately in the same way, splitting each half into
its own even and odd components; and so on.
The name ‘bit reversal’ stems from viewing this permutation in terms of
binary digits. We can write an index j, 0 ≤ j < n, as
m−1

j= b k 2k , (3.26)
k=0

where bk ∈ {0, 1} is the kth bit and n = 2m . We call b0 the least significant
bit and bm−1 the most significant bit. We express the binary expansion by
the notation
m−1

(bm−1 · · · b1 b0 )2 = bk 2k . (3.27)
k=0

Example 3.9
(10100101)2 = 27 + 25 + 22 + 20 = 165.
Multiplying a vector by Rn starts by splitting the vector into a subvector
of components x(bm−1 ···b0 )2 with b0 = 0 and a subvector of components with
b0 = 1. This means that the most significant bit of the new position of a
component becomes b0 . Each subvector is then split according to bit b1 , and
so on. Thus, the final position of the component with index (bm−1 · · · b0 )2
becomes (b0 · · · bm−1 )2 , that is, the bit reverse of the original position; hence
the name. The splittings of the bit reversal are exactly the same as those of
the recursive procedure, but now they are lumped together. For this reason,
Fig. 3.2 can also be viewed as an illustration of the bit-reversal permutation,
where the bottom row gives the bit reverses of the vector components shown
at the top.
The following theorem states formally that Rn corresponds to a bit-reversal
permutation ρn , where the correspondence between a permutation σ and a
permutation matrix Pσ is given by eqn (2.10). The permutation for n = 8 is
displayed in Table 3.1.
Theorem 3.10 Let n = 2m , with m ≥ 1. Let ρn : {0, . . . , n − 1} →
{0, . . . , n − 1} be the bit-reversal permutation defined by

ρn ((bm−1 · · · b0 )2 ) = (b0 · · · bm−1 )2 .

Then
Rn = Pρn .
Proof First, we note that

(Sn x)(bm−1 ···b0 )2 = x(bm−2 ···b0 bm−1 )2 , (3.28)


SEQUENTIAL NONRECURSIVE ALGORITHM 111

Table 3.1. Bit-reversal per-


mutation for n = 8

j (b2 b1 b0 )2 (b0 b1 b2 )2 ρ8 (j)

0 000 000 0
1 001 100 4
2 010 010 2
3 011 110 6
4 100 001 1
5 101 101 5
6 110 011 3
7 111 111 7

for all binary numbers (bm−1 · · · b0 )2 of m bits. This is easy to see: if j =


(bm−1 · · · b0 )2 < n/2, then bm−1 = 0 and j = (bm−2 · · · b0 )2 so that (Sn x)j =
x2j = x(bm−2 ···b0 bm−1 )2 . The proof for j ≥ n/2 is similar. Equation (3.28) says
that the original position of a component of the permuted vector is obtained
by shifting the bits of its index one position to the left in a circular fashion.
Second, we generalize our result and obtain

((In/2t ⊗ S2t )x)(bm−1 ···b0 )2 = x(bm−1 ···bt bt−2 ···b0 bt−1 )2 , (3.29)

for all binary numbers (bm−1 · · · b0 )2 of m bits. This is because we can apply
eqn (3.28) with t bits instead of m to the subvector of length 2t of x starting at
index j = (bm−1 · · · bt )2 · 2t . Here, only the t least significant bits participate
in the circular shift. Third, using the definition Rn = (In/2 ⊗ S2 ) · · · (I1 ⊗ Sn )
and applying eqn (3.29) for t = 1, 2, 3, . . . , m we obtain

(Rn x)(bm−1 ···b0 )2 = ((In/2 ⊗ S2 )(In/4 ⊗ S4 ) · · · (I1 ⊗ Sn )x)(bm−1 ···b0 )2


= ((In/4 ⊗ S4 ) · · · (I1 ⊗ Sn )x)(bm−1 ···b1 b0 )2
= ((In/8 ⊗ S8 ) · · · (I1 ⊗ Sn )x)(bm−1 ···b2 b0 b1 )2
= ((In/16 ⊗ S16 ) · · · (I1 ⊗ Sn )x)(bm−1 ···b3 b0 b1 b2 )2
···
= ((I1 ⊗ Sn )x)(bm−1 b0 b1 b2 ···bm−2 )2 = x(b0 ···bm−1 )2 . (3.30)

Therefore (Rn x)j = xρn (j) , for all j. Using ρn = ρ−1n and applying Lemma 2.5
we arrive at (Rn x)j = xρ−1n (j)
= (P ρn x)j . Since this holds for all j, we have
Rn x = Pρn x. Since this in turn holds for all x, it follows that Rn = Pρn . 
Algorithm 3.2 is an FFT algorithm based on the Cooley–Tukey theorem.
The algorithm overwrites the input vector x with the output vector Fn x. For
112 THE FAST FOURIER TRANSFORM

Algorithm 3.2. Sequential nonrecursive FFT.


input: x : vector of length n = 2m , m ≥ 1, x = x0 .
output: x : vector of length n, such that x = Fn x0 .
function call: FFT(x, n).

{ Perform bit reversal x := Rn x. Function call bitrev(x, n) }


for j := 0 to n − 1 do
{ Compute r := ρn (j) }
q := j;
r := 0;
for k := 0 to log2 n − 1 do
bk := q mod 2;
q := q div 2;
r := 2r + bk ;
if j < r then swap(xj , xr );

{ Perform butterflies. Function call UFFT(x, n) }


k := 2;
while k ≤ n do
{ Compute x := (In/k ⊗ Bk )x }
for r := 0 to nk − 1 do
{ Compute x(rk : rk + k − 1) := Bk x(rk : rk + k − 1) }
for j := 0 to k2 − 1 do
{ Compute xrk+j ± ωkj xrk+j+k/2 }
τ := ωkj xrk+j+k/2 ;
xrk+j+k/2 := xrk+j − τ ;
xrk+j := xrk+j + τ ;
k := 2k;

later use, we make parts of the algorithm separately callable: the function
bitrev(x, n) performs a bit reversal of length n and the function UFFT(x, n)
performs an unordered FFT of length n, that is, an FFT without bit
reversal. Note that multiplication by the butterfly matrix Bk combines com-
ponents at distance k/2, where k is a power of two. In the inner loop,
the subtraction xrk+j+k/2 := xrk+j − τ is performed before the addition
xrk+j := xrk+j + τ , because the old value of xrk+j must be used in the
computation of xrk+j+k/2 . (Performing these statements in the reverse order
would require the use of an extra temporary variable.) A simple count of
the floating-point operations shows that the cost of the nonrecursive FFT
algorithm is the same as that of the recursive algorithm.
PARALLEL ALGORITHM 113

The symmetry of the Fourier matrix Fn and the bit-reversal matrix


Rn gives us an alternative form of the Cooley–Tukey FFT factorization, the
so-called decimation in frequency (DIF) variant.
Corollary 3.11 (Cooley and Tukey [45]—DIF) Let n be a power of two
with n ≥ 2. Then

Fn = Rn (In/2 ⊗ B2T )(In/4 ⊗ B4T )(In/8 ⊗ B8T ) · · · (I1 ⊗ BnT ).

Proof
T
Fn = FnT = (I1 ⊗ Bn )(I2 ⊗ Bn/2 )(I4 ⊗ Bn/4 ) · · · (In/2 ⊗ B2 )Rn


= RnT (In/2 ⊗ B2 )T · · · (I4 ⊗ Bn/4 )T (I2 ⊗ Bn/2 )T (I1 ⊗ Bn )T


= Rn (In/2 ⊗ B2T ) · · · (I4 ⊗ Bn/4
T T
)(I2 ⊗ Bn/2 )(I1 ⊗ BnT ).

3.4 Parallel algorithm


The first question to answer when parallelizing an existing sequential
algorithm is: which data should be distributed? The vector x must certainly
be distributed, but it may also be necessary to distribute the table containing
the powers of ωn that are used in the computation, the so-called weights. (A
table must be used because it would be too expensive to compute the weights
each time they are needed.) We defer the weights issue to Section 3.5, and
for the moment we assume that the table of weights is replicated on every
processor.
A suitable distribution should make the operations of the algorithm local.
For the FFT, the basic operation is the butterfly, which modifies a pair
(xj , xj ′ ) with j ′ = j + k/2, where k is the butterfly size. This is done in
stage k of the algorithm, that is, the iteration corresponding to parameter
k of the main loop of Algorithm 3.2. Let the length of the Fourier transform
be n = 2m , with m ≥ 1, and let the number of processors be p = 2q , with
0 ≤ q < m. We restrict the number of processors to a power of two because
this matches the structure of our sequential FFT algorithm. Furthermore, we
require that q < m, or equivalently p < n, because in the pathological case
p = n there is only one vector component per processor, and we need at least
a pair of vector components to perform a sensible butterfly operation.
A block distribution of the vector x, with n/p components per processor,
makes butterflies with k ≤ n/p local. This is because k and n/p are powers of
two, so that k ≤ n/p implies that k divides n/p, thus making each butterfly
block x(rk : rk + k − 1) fit completely into a processor block x(sn/p : sn/p +
n/p − 1); no butterfly blocks cross the processor boundaries.
114 THE FAST FOURIER TRANSFORM

Example 3.12 Let n = 8 and p = 2 and assume that x is block distributed.


Stage k = 2 of the FFT algorithm is a multiplication of x = x(0 : 7) with
 
1 1 · · · · · ·
 1 −1 · · · · · · 
 
 · · 1 1 · · · · 
 
 · · 1 −1 · · · · 
I4 ⊗ B2 = 
 .
 · · · · 1 1 · · 

 · · · · 1 −1 · · 
 
 · · · · · · 1 1 
· · · · · · 1 −1

The butterfly blocks of x that are multiplied by blocks of I4 ⊗ B2 are x(0 : 1),
x(2 : 3), x(4 : 5), and x(6 : 7). The first two blocks are contained in processor
block x(0 : 3), which belongs to P (0). The last two blocks are contained in
processor block x(4: 7), which belongs to P (1).
In contrast to the block distribution, the cyclic distribution makes butter-
flies with k ≥ 2p local, because k/2 ≥ p implies that k/2 is a multiple of p, so
that the vector components xj and xj ′ with j ′ = j + k/2 reside on the same
processor.
Example 3.13 Let n = 8 and p = 2 and assume that x is cyclically dis-
tributed. Stage k = 8 of the FFT algorithm is a multiplication of x = x(0 : 7)
with
 
1 · · · 1 · · ·
 · 1 · · · ω · · 
 2

 · · 1 · · · ω · 
 3 

 · · · 1 · · · ω
B8 = 
 ,
 1 · · · −1 · · · 
 · 1 · · · −ω · · 
 
 · · 1 · · · −ω 2 · 
· · · 1 · · · −ω 3

where ω = ω8 = e−πi/4 = (1 − i)/ 2. The component pairs (x0 , x4 ) and
(x2 , x6 ) are combined on P (0), whereas the pairs (x1 , x5 ) and (x3 , x7 ) are
combined on P (1).
Now, a parallelization strategy emerges: start with the block
√ distribution
and finish with the cyclic distribution. If p ≤ n/p (i.e. p ≤ n), then these
two distributions suffice for the butterflies: we need to redistribute only once
and we can do this at any desired time after stage p but before stage 2n/p.
If p > n/p, however, we are lucky to have so many processors to solve such a
small problem, but we are unlucky in that we have to use more distributions.
For the butterflies of size n/p < k ≤ p, we need one or more intermediates
between the block and cyclic distribution.
PARALLEL ALGORITHM 115

The group-cyclic distribution with cycle c is defined by the mapping


      
cn cn
xj −→ P j div c + j mod mod c , for 0 ≤ j < n.
p p
(3.31)
This distribution is defined for every c with 1 ≤ c ≤ p and p mod c = 0.
It first splits the vector x into blocks of size ⌈cn/p⌉, and then assigns each
block to a group of c processors using the cyclic distribution. Note that for
c = 1 this reduces to the block distribution, see (1.6), and for c = p to the
cyclic distribution, see (1.5). In the special case n mod p = 0, the group-cyclic
distribution reduces to
  
cn
xj −→ P j div c + j mod c , for 0 ≤ j < n. (3.32)
p
This case is relevant for the FFT, because n and p are both powers of two.
Figure 3.4 illustrates the group-cyclic distribution for n = 8 and p = 4.
The group-cyclic distribution is a new generalization of the block and cyclic
distributions; note that it differs from the block-cyclic distribution introduced
earlier,
xj −→ P ((j div b) mod p), for 0 ≤ j < n, (3.33)
where b is the block size.
In the FFT case, n and p and hence c are powers of two and we can write
each global index j as the sum of three terms,
cn
j = j2 + j1 c + j0 , (3.34)
p

(a)
c=1 0 0 1 1 2 2 3 3
(block) 0 1 2 3 4 5 6 7

(b)
c=2 0 1 0 1 2 3 2 3
0 1 2 3 4 5 6 7

(c)
c=4 0 1 2 3 0 1 2 3
(cyclic) 0 1 2 3 4 5 6 7

Fig. 3.4. Group-cyclic distribution with cycle c of a vector of size eight over four
processors. Each cell represents a vector component; the number in the cell
and the greyshade denote the processor that owns the cell. The processors are
numbered 0, 1, 2, 3. (a) c = 1; (b) c = 2; and (c) c = 4.
116 THE FAST FOURIER TRANSFORM

where 0 ≤ j0 < c and 0 ≤ j1 < n/p. The processor that owns the component
xj in the group-cyclic distribution with cycle c is P (j2 c + j0 ); the processor
allocation is not influenced by j1 . As always, the components are stored locally
in order of increasing global index. Thus xj ends up in local position j = j1 =
(j mod cn/p) div c on processor P (j2 c + j0 ). (The relation between global
and local indices is explained in Fig. 1.9.)
To make a butterfly of size k local in the group-cyclic distribution with
cycle c, two constraints must be satisfied. First, the butterfly block x(rk : rk+
k − 1) should fit completely into one block of size cn/p, which is assigned to a
group of c processors. This is guaranteed if k ≤ cn/p. Second, k/2 must be a
multiple of the cycle c, so that the components xj and xj ′ with j ′ = j + k/2
reside on the same processor from the group. This is guaranteed if k/2 ≥ c.
As a result, we find that a butterfly of size k is local in the group-cyclic
distribution with cycle c if
n
2c ≤ k ≤ c. (3.35)
p
This result includes as a special case our earlier results for the block and cyclic
distributions. In Fig. 3.4(a), it can be seen that for c = 1 butterflies of size
k = 2 are local, since these combine pairs (xj , xj+1 ). In Fig. 3.4(b) this can
be seen for c = 2 and pairs (xj , xj+2 ), and in Fig. 3.4(c) for c = 4 and pairs
(xj , xj+4 ). In this particular example, the range of (3.35) consists of only one
value of k, namely k = 2c.
A straightforward strategy for the parallel FFT is to start the butter-
flies with the group-cyclic distribution with cycle c = 1, and continue as
long as possible with this distribution, that is, in stages k = 2, 4, . . . , n/p.
At the end of stage n/p, the vector x is redistributed into the group-cyclic
distribution with cycle c = n/p, and then stages k = 2n/p, 4n/p, . . . , n2 /p2
are performed. Then c is again multiplied by n/p, x is redistributed, stages
k = 2n2 /p2 , 4n2 /p2 , . . . , n3 /p3 are performed, and so on. Since n/p ≥ 2 the
value of c increases monotonically. When multiplying c by n/p would lead to
a value c = (n/p)t ≥ p, the value of c is set to c = p instead and the remaining
stages 2(n/p)t , . . . , n are performed in the cyclic distribution.
Until now, we have ignored the bit reversal preceding the butterflies. The
bit reversal is a permutation, which in general requires communication. We
have been liberal in allowing different distributions in different parts of our
algorithm, so why not use different distributions before and after the bit
reversal? This way, we might be able to avoid communication.
Let us try our luck and assume that we have the cyclic distribution before
the bit reversal. This is the preferred starting distribution of the overall com-
putation, because it is the distribution in which the FFT computation ends.
It is advantageous to start and finish with the same distribution, because
then it is easy to apply the FFT repeatedly. This would make it possible, for
instance, to execute a parallel inverse FFT by using the parallel forward FFT
with conjugated weights; this approach is based on the property Fn−1 = Fn /n.
PARALLEL ALGORITHM 117

Consider a component xj with index j = (bm−1 · · · b0 )2 of a cyclically


distributed vector x. This component is stored on processor P ((bq−1 · · · b0 )2 )
in location j = (bm−1 · · · bq )2 . In other words, the least significant q bits of j
determine the processor number and the most significant m − q bits the local
index. Since our aim is to achieve a global bit reversal, it seems natural to
start with a local bit reversal; this reverses already part of the bits and it does
not incur communication. The local bit reversal moves xj into local position
(bq · · · bm−1 )2 , which consists of the least significant m − q bits of the global
destination index ρn (j) = (b0 · · · bm−1 )2 . The least significant m − q bits also
happen to be those that determine the local position in the block distribution.
Therefore, the local position would be correct if we declared the distribution
to be by blocks. Unfortunately, the processor would still be wrong: in the
block distribution after the global bit reversal, the original xj should find
itself in processor P (b0 · · · bq−1 ), determined by the most significant q bits
of the destination index (b0 · · · bm−1 )2 , but of course xj did not leave the
original processor P ((bq−1 · · · b0 )2 ). Note that the correct processor number is
the bit-reverse of the actual processor number. Therefore, we call the current
distribution after the local bit reversal the block distribution with bit-
reversed processor numbering.
Can we use the current distribution instead of the block distribution for
the first set of butterflies, with size k = 2, 4, . . . , n/p? We have to check
whether these butterflies remain local when using the block distribution with
bit-reversed processor numbering. Fortunately, for these k, every processor
carries out exactly the same operations, but on its own data. This is because
k ≤ n/p, so that every butterfly block fits completely into a processor block.
As a consequence, the processor renumbering does not affect the local com-
putations. At the first redistribution, the vector is moved into the standard
group-cyclic distribution without bit-reversed processor numbering, and then
the algorithm returns to the original strategy outlined above.
By a stroke of luck, we now have a complete parallel FFT algorithm that
starts and ends with the same distribution, the cyclic distribution, and that
performs only a limited number of redistributions. The result is given as
Algorithm 3.3. In stage k of the algorithm, there are n/k butterfly blocks, so
that each of the p/c processor groups handles nblocks = (n/k)/(p/c) blocks.
Each processor of a group participates in the computations of every block
of its group. A boolean variable rev is used which causes a reversal of the
processor number during the first redistribution.
The redistribution function used in Algorithm 3.3 is given as
Algorithm 3.4. This function puts every component xj in the destination
processor determined by the new distribution with cycle c. We do not specify
where exactly in the destination processor the component xj is placed. Strictly
speaking this has nothing to do with the distribution; rather, it is related to
the data structure used to store the local values. (Still, our convention that
components are ordered locally by increasing global index determines the local
118 THE FAST FOURIER TRANSFORM

Algorithm 3.3. Parallel FFT algorithm for processor P (s).


input: x : vector of length n = 2m , m ≥ 1, x = x0 ,
distr(x) = cyclic over p = 2q processors with 0 ≤ q < m.
output: x : vector of length n, distr(x) = cyclic, such that x = Fn x0 .

bitrev(x(s : p : n − 1), n/p);


{ distr(x) = block with bit-reversed processor numbering }

k := 2;
c := 1;
rev := true;
while k ≤ n do
(0) j0 := s mod c;
j2 := s div c;
while k ≤ np c do
nc
nblocks := kp ;
for r := j2 · nblocks to (j2 + 1) · nblocks − 1 do
{ Compute local part of x(rk : (r + 1)k − 1) }
for j := j0 to k2 − 1 step c do
τ := ωkj xrk+j+k/2 ;
xrk+j+k/2 := xrk+j − τ ;
xrk+j := xrk+j + τ ;
k := 2k;
if c < p then
c0 := c;
c := min( np c, p);
(1) redistr(x, n, p, c0 , c, rev );
rev := false;
{ distr(x) = group-cyclic with cycle c }

index, (j mod cn/p) div c, as we saw above, and this will be used later in
the implementation.) We want the redistribution to work also in the trivial
case p = 1, and therefore we need to define ρ1 ; the definition in Theorem 3.10
omitted this case. By convention, we define ρ1 to be the identity permutation
of length one, which is the permutation that reverses zero bits.

Finally, we determine the cost of Algorithm 3.3. First, we consider the


synchronization cost. Each iteration of the main loop has two supersteps: a
computation superstep that computes the butterflies and a communication
superstep that redistributes the vector x. The computation superstep of the
first iteration also includes the local bit reversal. The last iteration, with c = p,
does not perform a redistribution any more. The total number of iterations
equals t+1, where t is the smallest integer such that (n/p)t ≥ p. By taking the
log2 of both sides and using m = 2n and p = 2q , we see that this inequality
PARALLEL ALGORITHM 119

Algorithm 3.4. Redistribution from group-cyclic distribution with cycle c0 to


cycle c1 for processor P (s).
input: x : vector of length n = 2m , m ≥ 0,
distr(x) = group-cyclic with cycle c0 over p = 2q
processors with 0 ≤ q ≤ m.
If rev is true, the processor numbering is bit-reversed,
otherwise it is standard.
output: x : vector of length n, distr(x) = group-cyclic with cycle c1
with standard processor numbering.
function call : redistr(x, n, p, c0 , c1 , rev );

(1) if rev then


j0 := ρp (s) mod c0 ;
j2 := ρp (s) div c0 ;
else
j0 := s mod c0 ;
j2 := s div c0 ;
for j := j2 c0pn + j0 to (j2 + 1) c0pn − 1 step c0 do
dest := (j div c1pn )c1 + j mod c1 ;
put xj in P (dest);

is equivalent to t ≥ q/(m − q), so that t = ⌈q/(m − q)⌉. Therefore, the total


synchronization cost is
   
q
Tsync = 2 + 1 l. (3.36)
m−q

Second, we examine the communication cost. Communication occurs only


within the redistribution, where in the worst case all n/p old local vector
components are sent away, and n/p new components are received from another
processor. Each vector component is a complex number, which consists of two
real numbers. The redistribution is therefore a 2n/p-relation. (The cost is
actually somewhat lower than 2ng/p, because certain data remain local. For
example, component x0 remains on P (0) in all group-cyclic distributions, even
when the processor numbering is bit-reversed. This is a small effect, which we
can neglect.) Thus, the communication cost is
 
q 2n
Tcomm = · g. (3.37)
m−q p

Third, we focus on the computation cost. Stage k handles nblocks =


nc/(kp) blocks, each of size k. Each processor handles a fraction 1/c of the
k/2 component pairs in a block. Each pair requires a complex multiplication,
120 THE FAST FOURIER TRANSFORM

addition, and subtraction, that is, a total of 10 real flops. We do not count
indexing arithmetic, nor the computation of the weights. As a consequence,
the total number of flops per processor in stage k equals (nc/(kp)) · (k/(2c)) ·
10 = 5n/p. Since there are m stages, the computation cost is
Tcomp = 5mn/p. (3.38)
The total BSP cost of the algorithm as a function of n and p is obtained by
summing the three costs and substituting m = log2 n and q = log2 p, giving
     
5n log2 n log2 p n log2 p
TFFT = +2· · g+ 2 + 1 l. (3.39)
p log2 (n/p) p log2 (n/p)
As you may know, budgets for the acquisition of parallel computers are
often tight, but you, the user of a parallel computer, may be insatiable in your
computing demands. In that case, p remains small, n √ becomes large, and you
may find yourself performing FFTs with 1 < p ≤ n. The good news is
that then you only need one communication superstep and two computation
supersteps. The BSP cost of the FFT reduces to


5n log2 n n
TFFT, 1<p≤ n = + 2 g + 3l. (3.40)
p p

This happens because p ≤ n implies p ≤ n/p and hence log2 p ≤ log2 (n/p),
so that the ceiling expression in (3.39) becomes one.

3.5 Weight reduction


The weights of the FFT are the powers of ωn that are needed in the FFT
n/2−1
computation. These powers 1, ωn , ωn2 , . . . , ωn are usually precomputed and
stored in a table, thus saving trigonometric evaluations when repeatedly using
the same power of ωn . This table can be reused in subsequent FFTs. In a
sequential computation, the table requires a storage space of n/2 complex
numbers, which is half the space needed for the vector x; such an overhead
is usually acceptable. For small n, the O(n) time of the weight initializations
may not be negligible compared with the 5n log2 n flops of the FFT itself.
The reason is that each weight computation requires the evaluation of two
trigonometric functions,
2πj 2πj
ωnj = cos − i sin , (3.41)
n n
which typically costs about five flops per evaluation [151] in single precision
and perhaps ten flops in double precision.
The precomputation of the weights can be accelerated by using symmet-
ries. For example, the property

ωnn/2−j = −(ωnj ) (3.42)


WEIGHT REDUCTION 121

implies that only the weights ωnj with 0 ≤ j ≤ n/4 have to be computed. The
remaining weights can then be obtained by negation and complex conjugation,
which are cheap operations. Symmetry can be exploited further by using the
property
ωnn/4−j = −i(ωnj ), (3.43)
which is also cheap to compute. The set of weights can thus be computed by
eqn (3.41) with 0 ≤ j ≤ n/8, eqn (3.43) with 0 ≤ j < n/8, and eqn (3.42)
with 0 < j < n/4. This way, the initialization of the n/2 weights in double
precision costs about 2 · 10 · n/8 = 2.5n flops.
An alternative method for precomputation of the weights is to compute
the powers of ωn by successive multiplication, computing ωn2 = ωn · ωn , ωn3 =
ωn · ωn2 , and so on. Unfortunately, this propagates roundoff errors and hence
produces less accurate weights and a less accurate FFT. This method is not
recommended [187].
In the parallel case, the situation is more complicated than in the sequen-
tial case. For example, in the first iteration of the main loop of Algorithm 3.3,
c = 1 and hence j0 = 0 and j2 = s, so that all processors perform the same
set of butterfly computations, but on different data. Each processor performs
an unordered sequential FFT of length n/p on its local part of x. This implies
that the processors need the same weights, so that the weight table for these
butterflies must be replicated, instead of being distributed. The local table
j
should at least contain the weights ωn/p = ωnjp , 0 ≤ j < n/(2p), so that the
total memory used by all processors for this iteration alone is already n/2
complex numbers. Clearly, in the parallel case care must be taken to avoid
excessive memory use and initialization time.
A brute-force approach would be to store on every processor the complete
table of all n/2 weights that could possibly be used during the computation.
This has the disadvantage that every processor has to store almost the same
amount of data as needed for the whole sequential problem. Therefore, this
approach is not scalable in terms of memory usage. Besides, it is also unneces-
sary to store all weights on every processor, since not all of them are needed.
Another disadvantage is that the 2.5n flops of the weight initializations can
easily dominate the (5n log2 n)/p flops of the FFT itself.
At the other extreme is the simple approach of recomputing the weights
whenever they are needed, thus discarding the table. This attaches a weight
computation of about 20 flops to the 10 flops of each pairwise butterfly opera-
tion, thereby approximately tripling the total computing time. This approach
wastes a constant factor in computing time, but it is scalable in terms of
memory usage.
Our main aim in this section is to find a scalable approach in terms of
memory usage that adds few flops to the overall count. To achieve this, we try
to find structure in the local computations of a processor and to express them
by using sequential FFTs. This has the additional benefit that we can make
122 THE FAST FOURIER TRANSFORM

use of the tremendous amount of available sequential FFT software, including


highly tuned programs provided by computer vendors for their hardware.
The generalized discrete Fourier transform (GDFT) [27] is defined
by
n−1

yk = xj ωnj(k+α) , for 0 ≤ k < n, (3.44)
j=0

where α is a fixed real parameter, not necessarily an integer, which repres-


ents a frequency shift. (For instance, taking α = −1 shifts the components
(y0 , . . . , yn−1 ) of a standard DFT into (yn−1 , y0 , y1 , . . . , yn−2 ).) For α = 0, the
GDFT becomes the DFT. The GDFT can be computed by a fast algorithm,
called GFFT, which can be derived in the same way as we did for the FFT.
We shall give the main results of the derivation in the generalized case, but
omit the proofs.
The sum in the GDFT definition (3.44) can be split into even and odd
components, giving
n/2−1 n/2−1
j(k+α) j(k+α)
 
yk = x2j ωn/2 + ωnk+α x2j+1 ωn/2 , for 0 ≤ k < n/2,
j=0 j=0
(3.45)
and
n/2−1 n/2−1
j(k+α) j(k+α)
 
yk+n/2 = x2j ωn/2 − ωnk+α x2j+1 ωn/2 , for 0 ≤ k < n/2.
j=0 j=0
(3.46)
These equations can be converted to matrix notation by defining the n × n
generalized Fourier matrix Fnα with parameter α,

(Fnα )jk = ωnj(k+α) , for 0 ≤ j, k < n, (3.47)

the n × n diagonal matrix


1+α 2+α n−1+α
Ωα α
n = diag(ω2n , ω2n , ω2n , . . . , ω2n ), (3.48)

and the n × n generalized butterfly matrix


 
α In/2 Ωαn/2
Bn = (3.49)
In/2 −Ωα n/2

giving the decomposition

Fnα = Bnα (I2 ⊗ Fn/2


α
)Sn . (3.50)

Repeatedly applying (3.50) until we reach the middle factor In ⊗ F1α = In


yields the following generalization of Cooley and Tukey’s theorem.
WEIGHT REDUCTION 123

Theorem 3.14 Let n be a power of two. Then

Fnα = (I1 ⊗ Bnα )(I2 ⊗ Bn/2


α α
)(I4 ⊗ Bn/4 ) · · · (In/2 ⊗ B2α )Rn .

Our ultimate goal is to express the local computations of the parallel


FFT algorithm in terms of sequential FFTs, but for the moment we settle
for less and try to express the computations of a superstep in terms of a
GFFT with suitable α. For this moderate goal, we have to inspect the inner
loops of Algorithm 3.3. The j-loop takes a local subvector x(rk + k/2 +
j0 : c : (r + 1)k − 1) of length k/(2c), multiplies it by the diagonal matrix
k/2−c+j0
diag(ωkj0 , ωkc+j0 , ωk2c+j0 , . . . , ωk )
j0 /c 1+j /c 2+j /c k/(2c)−1+j0 /c
= diag(ωk/c , ωk/c 0 , ωk/c 0 , . . . , ωk/c )
0 j /c
= Ωk/(2c) , (3.51)

and then adds it to the subvector x(rk+j0 : c : rk+k/2−1), and also subtracts
j0 /c
it. This means that a generalized butterfly Bk/c is performed on the local
subvector x(rk + j0 : c : (r + 1)k − 1). The r-loop takes care that the same
generalized butterfly is performed for all nc/(kp) local subvectors. Thus, stage
k in the group-cyclic distribution with cycle c computes
 
j0 /c nc nc
(Inc/(kp) ⊗ Bk/c ) · x j2 + j0 : c : (j2 + 1) − 1 .
p p
A complete sequence of butterfly stages is a sequence of maximal length,
k = 2c, 4c, . . . , (n/p)c. Such a sequence multiplies the local vector by
j /c
0 0 j /c j /c j /c
(I1 ⊗ Bn/p )(I2 ⊗ Bn/(2p) ) · · · (In/(2p) ⊗ B20 ) = Fn/p
0
Rn/p , (3.52)

where the equality follows from Theorem 3.14. This implies that superstep
(0) is equivalent to an unordered GFFT applied to the local vector, with shift
parameter α = j0 /c = (s mod c)/c.
One problem that remains is the computation superstep of the last itera-
tion. This superstep may not perform a complete sequence of butterfly stages,
in which case we cannot find a simple expression for the superstep. If, how-
ever, we would start with an incomplete sequence such that all the following
computation supersteps perform complete sequences, we would have an easier
task, because at the start c = 1 so that α = 0 and we perform standard
butterflies. We can then express a sequence of stages k = 2, 4, . . . , k1 by the
matrix product

(In/(k1 p) ⊗ Bk1 ) · · · (In/(4p) ⊗ B4 )(In/(2p) ⊗ B2 )


= In/(k1 p) ⊗ ((I1 ⊗ Bk1 ) · · · (Ik1 /4 ⊗ B4 )(Ik1 /2 ⊗ B2 ))
= In/(k1 p) ⊗ (Fk1 Rk1 ). (3.53)
124 THE FAST FOURIER TRANSFORM

Algorithm 3.5. Restructured parallel FFT algorithm for processor P (s).


input: x : vector of length n = 2m , m ≥ 1, x = x0 ,
distr(x) = cyclic over p = 2q processors with 0 ≤ q < m.
output: x : vector of length n, distr(x) = cyclic, such that x = Fn x0 .

(0) bitrev(x(s : p : n − 1), n/p);


{ distr(x) = block with bit-reversed processor numbering }

t := ⌈ loglog(n/p)
2p
⌉;
2
n
k1 := (n/p)t ;
rev := true;
for r := s · kn1 p to (s + 1) · kn1 p − 1 do
UFFT(x(rk1 : (r + 1)k1 − 1), k1 );
c0 := 1;
c := k1 ;
while c ≤ p do
(1) redistr(x, n, p, c0 , c, rev );
{ distr(x) = group-cyclic with cycle c }
(2) rev := false;
j0 := s mod c;
j2 := s div c;
UGFFT(x(j2 nc nc
p + j0 : c : (j2 + 1) p − 1), n/p, j0 /c);
c0 := c;
c := np c;

Now, superstep (0) of the first iteration is equivalent to n/(k1 p) times an


unordered FFT applied to a local subvector of size k1 . Thus, we decide to
restructure the main loop of our algorithm and use distributions with c =
1, k1 , k1 (n/p), k1 (n/p)2 , . . . , k1 (n/p)t−1 = p, where k1 = n/(n/p)t and t =
⌈log2 p/log2 (n/p)⌉.
The resulting algorithm is given as Algorithm 3.5. The BSP cost of the
algorithm is the same as that of Algorithm 3.3; the cost is given by eqn (3.39).
The function UGFFT(x, n, α) performs an unordered GFFT with parameter
α on a vector x of length n. This function is identical to UFFT(x, n), see
Algorithm 3.2, except that the term ωkj in the inner loop is replaced by ωkj+α .
Table 3.2 gives an example of the shift parameters α that occur in a particular
parallel computation.
The advantage of the restructured algorithm is increased modularity: the
algorithm is based on smaller sequential modules, namely unordered FFTs
and GFFTs of size at most n/p. Thus, we can benefit from existing efficient
WEIGHT REDUCTION 125

Table 3.2. Shift parameters α on eight processors for n = 32

Iteration Processor

c P (0) P (1) P (2) P (3) P (4) P (5) P (6) P (7)

0 1 0 0 0 0 0 0 0 0
1 2 0 1/2 0 1/2 0 1/2 0 1/2
2 8 0 1/8 1/4 3/8 1/2 5/8 3/4 7/8

sequential implementations for these modules. For the FFT many implement-
ations are available, see Section 3.8, but for the GFFT this is not the case.
We are willing to accept a small increase in flop count if this enables us to
use the FFT as the computational workhorse instead of the GFFT. We can
achieve this by writing the GDFT as
n−1

yk = (xj ωnjα )ωnjk , for 0 ≤ k < n, (3.54)
j=0

which can be viewed as a multiplication of x by an n × n diagonal twiddle


matrix
Tnα = diag(1, ωnα , ωn2α , . . . , ωn(n−1)α ), (3.55)
followed by a standard DFT. In matrix notation,

Fnα = Fn Tnα . (3.56)

Twiddling with the data vector is a worthwhile preparation for the


execution of the butterflies. The extra cost of twiddling is n/p complex multi-
plications, or 6n/p real flops, in every computation superstep, except the first.
This is usually much less than the (5n/p) log2 (n/p) flops of a computation
superstep.
A slight complication arises because the twiddling relation (3.56) between
the DFT and the GDFT is for ordered transforms, whereas we need unordered
ones. Using the fact that Rn = Rn−1 , we can write the unordered GDFT as

Fnα Rn = Fn Tnα Rn = Fn Rn (Rn Tnα Rn ) = (Fn Rn )T̃nα . (3.57)

Here, T̃nα = Rn Tnα Rn is the diagonal matrix obtained by permuting the


diagonal entries of Tnα according to ρn ,

(T̃nα )jj = (Tnα )ρn (j),ρn (j) , for 0 ≤ j < n, (3.58)

cf. Lemma 2.5.


126 THE FAST FOURIER TRANSFORM

The maximum amount of memory space per processor used by


Algorithm 3.5, expressed in reals, equals
   
log2 p n
MFFT = 2 · +3 · . (3.59)
log2 (n/p) p

This memory is needed for: n/p complex numbers representing x; n/(2p)


complex numbers representing the weights of an FFT of length n/p; and
n/p complex numbers representing a diagonal twiddle matrix, for each of the
⌈log2 p/log2 (n/p)⌉ GFFTs performed by a processor.
The amount of memory required by the FFT has been reduced from O(n)
to O((n log2 p)/(p log2 (n/p))). To see whether the new memory requirement is
‘scalable’, we need to define precisely what we mean by this term. We call the
memory requirements of a BSP algorithm scalable if the maximum memory
space M (n, p) required per processor satisfies
 
Mseq (n)
M (n, p) = O +p , (3.60)
p

where Mseq (n) is the memory space required by the sequential algorithm for
an input size n, and p is the number of processors. This definition allows for
O(p) overhead, reflecting the philosophy that BSP algorithms are based on all-
to-all communication supersteps, where each processor deals with p−1 others,
and also reflecting the practice of current BSP implementations where each
processor stores several arrays of length p. (For example, each registration
of a variable by the BSPlib primitive bsp push reg gives rise to an array of
length p on every processor that contains the p addresses of the variable on
all processors. Another example is the common implementation of a commun-
ication superstep where the number of data to be sent is announced to each
destination processor before the data themselves are sent. This information
needs to be stored in an array of length p on the destination processor.)
For p ≤ n/p, only one twiddle array has to be stored, so that the total
memory requirement is M (n, p) = 5n/p, which is scalable by the definition
above. For p > n/p, we need t − 1 additional iterations each requiring a
twiddle array. Fortunately, we can find a simple upper bound on the additional
memory requirement, namely
   t−1
2(t − 1)n n n n n n n n 2p
=2 + + ··· + ≤ 2 · ··· = 2 = ≤ p.
p p p p p p p p k1
(3.61)
Thus, the total memory use in this case is M (n, p) ≤ 5n/p + p, which is also
scalable. We have achieved our initial aim.
Note that some processors may be able to use the same twiddle array in
several subsequent supersteps, thus saving memory. An extreme case is P (0),
which always has α = 0 and in fact would need no twiddle memory. Processor
EXAMPLE FUNCTION bspfft 127

P (p − 1), however, has α = ((p − 1) mod c)/c = (c − 1)/c = 1 − 1/c, so that


each time it needs a different twiddle array. We assume that all processors are
equal in memory size (as well as in computing rate), so that it is not really
worthwhile to save memory for some processors when this is not possible for
the others, as in the present case. After all, this will not increase the size
of the largest problem that can be solved. Also note that P (0) can actually
save the extra flops needed for twiddling, but this does not make the overall
computation faster, because P (0) still has to wait for the other processors to
complete their work. The moral: stick to the SPMD style of programming and
do not try to be clever saving memory space or computing time for specific
processors.

3.6 Example function bspfft


This section presents the program text of the function bspfft, which is a
straightforward implementation of Algorithm 3.5. This function was written
to explain the implementation of the algorithm, and hence its formulation
emphasizes clarity and brevity rather than efficiency, leaving room for further
optimization (mainly in the computation part). Throughout, the data struc-
ture used to store a complex vector of length n is a real array of size 2n with
alternating real and imaginary parts, that is, with Re(xj ) stored as x[2 ∗ j]
and Im(xj ) as x[2 ∗ j + 1]. The function bspfft can also compute an inverse
FFT, and it does this by performing all operations of the forward FFT with
conjugated weights and scaling the output vector by 1/n. Before we can use
bspfft, the function bspfft init must have been called to initialize three
weight tables and two bit-reversal tables.
The function ufft is a faithful implementation of the UFFT function in
Algorithm 3.2. The loop index runs through the range j = 0, 2, . . . , k − 2,
which corresponds to the range j = 0, 1, . . . , k/2 − 1 in the algorithm; this is a
consequence of the way complex vectors are stored. The function ufft init
initializes a weight table of n/2 complex weights ωnk , 0 ≤ k < n/2, using
eqns (3.41)–(3.43).
The function permute permutes a vector by a given permutation σ that
swaps component pairs independently; an example of such a permutation is
the bit-reversal permutation ρn . This type of permutation has the advantage
that it can be done in-place, requiring only two reals as extra storage but
no additional temporary array. The condition j < sigma[j] ensures that the
swap is executed only if the indices involved are different and that this is done
only once per pair. Without the condition the overall effect of the function
would be nil!
The function twiddle multiplies x and w componentwise, xj := wj xj , for
0 ≤ j < n. The function twiddle init reveals the purpose of twiddle: w is
the diagonal of the diagonal matrix T̃nα = Rn Tnα Rn for a given α.
The bit-reversal initialization function bitrev init fills an array rho with
bit reverses of indices used in the FFT. Of course, the bit reverse of an index
128 THE FAST FOURIER TRANSFORM

can be computed when needed, saving the memory of rho, but this can be
costly in computer time. In the local bit reversal, for instance, the reverse of
n/p local indices is computed. Each reversal of an index costs of the order
log2 (n/p) integer operations. The total number of such operations is there-
fore of the order (n/p) log2 (n/p), which for small p is of the same order as
the (5n/p) log2 n floating-point operations of the butterfly computations. The
fraction of the total time spent in the bit reversal could easily reach 20%.
This justifies using a table so that the bit reversal needs to be computed only
once and its cost can be amortized over several FFTs. To reduce the cost
also in case the FFT is called only once, we optimize the inner loop of the
function by using bit operations on unsigned integers (instead of integer oper-
ations as used everywhere else in the program). To obtain the last bit bi of
the remainder (bm−1 · · · bi )2 , an ‘and’-operation is carried out with 1, which
avoids the expensive modulo operation that would occur in the alternative
formulation, lastbit= rem%2. After that, the remainder is shifted one posi-
tion to the right, which is equivalent to rem /=2. It depends on the compiler
and the chosen optimization level whether the use of explicit bit operations
gives higher speed. (A good compiler will make such optimizations super-
fluous!) In my not so humble opinion, bit operations should only be used
sparingly in scientific computation, but here is an instance where there is a
justification.
The function k1 init computes k1 from n, p by finding the first c =
(n/p)t ≥ p. Note that the body of the c-loop consists of an empty state-
ment, since we are only interested in the final value of the counter c. The
counter c takes on t + 1 values, which is the number of iterations of the main
loop of the FFT. As a consequence, k1 (n/p)t = n so that k1 = n/c.
The function bspredistr redistributes the vector x from group-cyclic dis-
tribution with cycle c0 to the group-cyclic distribution with cycle c1 , for a
ratio c1 /c0 ≥ 1, as illustrated in Fig. 3.5. (We can derive a similar redis-
tribution function for c1 /c0 < 1, but we do not need it.) The function is
an implementation of Algorithm 3.4, but with one important optimization
(I could not resist the temptation!): vector components to be redistributed
are sent in blocks, rather than individually. This results in blocks of nc0 /(pc1 )
complex numbers. If nc0 < pc1 , components are sent individually. The aim
is, of course, to reach a communication rate that corresponds to optimistic
values of g, see Section 2.6.
The parallel FFT, like the parallel LU decomposition, is a regular paral-
lel algorithm, for which the communication pattern can be predicted exactly,
and each processor can determine exactly where every communicated data ele-
ment goes. In such a case, it is always possible for the user to combine data
for the same destination in a block, or packet, and communicate them using
one put operation. In general, this requires packing at the source processor
and unpacking at the destination processor. No identifying information needs
to be sent together with the data since the receiver knows their meaning. (In
EXAMPLE FUNCTION bspfft 129

(a)
c=2 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(b)
c=4 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
(cyclic) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Fig. 3.5. Redistribution from group-cyclic distribution with cycle c0 = 2 to cycle


c1 = 4 for a vector of size n = 16. The number of processors is p = 4. Each
cell represents a vector component; the number in the cell and the greyshade
denote the processor that owns the cell. The processors are numbered 0, 1, 2, 3.
The array indices are global indices. The initial distribution has two blocks of
size eight; the final distribution has one block of size 16. The arrows show a
packet sent by P (0) to P (2).

contrast, for certain irregular algorithms, sending such information cannot


be avoided, in which case there is no advantage in packaging by the user
and this is better left up to the BSP system. If, furthermore, the size of the
data items is small compared with the identifying information, we must sadly
communicate at a rate corresponding to pessimistic values of g.)
To perform the packing, we have to answer the question: which vector
components move to the same processor? Consider two components, xj and
xj ′ , that reside on the same processor in the old distribution with cycle c0 .
Obviously, these components are in the same block of size nc0 /p handled by
a group of c0 processors. Because c0 ≤ c1 , the block size increases on moving
to the new distribution, or stays the same, and because n, p, c0 , c1 are powers
of two, each block of the old distribution fits entirely in a block of the new
distribution. Thus, xj and xj ′ will automatically be in the same new block of
size nc1 /p handled by a group of c1 processors. Furthermore, they will be on
the same processor in the new distribution if and only if j and j ′ differ by a
multiple of c1 . Using eqn (3.34) with c = c0 for j and j ′ , and noting that j0 =
j0′ and j2 = j2′ because these indices depend only on the processor number, we
see that this is equivalent to j1 c0 and j1′ c0 differing a multiple of c1 , that is, j1
and j1′ differing a multiple of ratio = c1 /c0 . Thus we can pack components
with local indices j, j + ratio, j + 2 ∗ ratio, . . ., into a temporary array and
then put all of these together, as a packet, into the destination processor,
which is determined by component j. The destination index must be chosen
such that two different processors do no write into the same memory location.
In general, this can be done by assigning on each receiving processor a piece
of memory for each of the other processors. Because of the regularity, the
starting point and size of such a location can be computed by the source
processor. Since this computation is closely related to the unpacking, we shall
treat it together with that operation.
130 THE FAST FOURIER TRANSFORM

To perform the unpacking, we have to move data from the location they
were put into, to their final location on the same processor. Let xj and xj ′
be two adjacent components in a packet, with j′ = j + ratio, and hence
j ′ = j + (c1 /c0 )c0 = j + c1 . Since these components are in the same new
block, and their global indices differ by c1 , their new local indices differ by
one, j′ = j + 1. We are lucky: if we put xj into its final location, and the
next component of the packet into the next location, and so on, then all
components of the packet immediately reach their final destination. In fact,
this means that we do not have to unpack!
The function bspfft performs the FFT computation itself. It follows
Algorithm 3.5 and contains no surprises. The function bspfft init initial-
izes all tables used. It assumes that a suitable amount of storage has been
allocated for the tables in the calling program. For the twiddle weights, this
amount is 2n/p + p reals, cf. eqn (3.61).
The program text is:

#include "bspedupack.h"

/****************** Sequential functions ********************************/


void ufft(double *x, int n, int sign, double *w){

/* This sequential function computes the unordered discrete Fourier


transform of a complex vector x of length n, stored in a real array
of length 2n as pairs (Re x[j], Im x[j]), 0 <= j < n.
n=2ˆm, m >= 0.
If sign = 1, then the forward unordered dft FRx is computed;
if sign =-1, the backward unordered dft conjg(F)Rx is computed,
where F is the n by n Fourier matrix and R the n by n bit-reversal
matrix. The output overwrites x.
w is a table of n/2 complex weights, stored as pairs of reals,
exp(-2*pi*i*j/n), 0 <= j < n/2,
which must have been initialized before calling this function.
*/

int k, nk, r, rk, j, j0, j1, j2, j3;


double wr, wi, taur, taui;

for(k=2; k<=n; k *=2){


nk= n/k;
for(r=0; r<nk; r++){
rk= 2*r*k;
for(j=0; j<k; j +=2){
wr= w[j*nk];
if (sign==1) {
wi= w[j*nk+1];
} else {
wi= -w[j*nk+1];
}
j0= rk+j;
EXAMPLE FUNCTION bspfft 131

j1= j0+1;
j2= j0+k;
j3= j2+1;
taur= wr*x[j2] - wi*x[j3];
taui= wi*x[j2] + wr*x[j3];
x[j2]= x[j0]-taur;
x[j3]= x[j1]-taui;
x[j0] += taur;
x[j1] += taui;
}
}
}

} /* end ufft */

void ufft_init(int n, double *w){

/* This function initializes the n/2 weights to be used


in a sequential radix-2 FFT of length n.
n=2ˆm, m >= 0.
w is a table of n/2 complex weights, stored as pairs of reals,
exp(-2*pi*i*j/n), 0 <= j < n/2.
*/

int j, n4j, n2j;


double theta;

if (n==1)
return;
theta= -2.0 * M_PI / (double)n;
w[0]= 1.0;
w[1]= 0.0;
if (n==4){
w[2]= 0.0;
w[3]= -1.0;
} else if (n>=8) {
/* weights 1 .. n/8 */
for(j=1; j<=n/8; j++){
w[2*j]= cos(j*theta);
w[2*j+1]= sin(j*theta);
}
/* weights n/8+1 .. n/4 */
for(j=0; j<n/8; j++){
n4j= n/4-j;
w[2*n4j]= -w[2*j+1];
w[2*n4j+1]= -w[2*j];
}
/* weights n/4+1 .. n/2-1 */
for(j=1; j<n/4; j++){
n2j= n/2-j;
w[2*n2j]= -w[2*j];
w[2*n2j+1]= w[2*j+1];
}
132 THE FAST FOURIER TRANSFORM

}
} /* end ufft_init */

void twiddle(double *x, int n, int sign, double *w){

/* This sequential function multiplies a complex vector x


of length n, stored as pairs of reals, componentwise
by a complex vector w of length n, if sign=1, and
by conjg(w), if sign=-1. The result overwrites x.
*/

int j, j1;
double wr, wi, xr, xi;

for(j=0; j<2*n; j +=2){


j1= j+1;
wr= w[j];
if (sign==1) {
wi= w[j1];
} else {
wi= -w[j1];
}
xr= x[j];
xi= x[j1];
x[j]= wr*xr - wi*xi;
x[j1]= wi*xr + wr*xi;
}

} /* end twiddle */

void twiddle_init(int n, double alpha, int *rho, double *w){

/* This sequential function initializes the weight table w


to be used in twiddling with a complex vector of length n,
stored as pairs of reals.
n=2ˆm, m >= 0.
alpha is a real shift parameter.
rho is the bit-reversal permutation of length n,
which must have been initialized before calling this function.
The output w is a table of n complex values, stored as pairs of
reals,
exp(-2*pi*i*rho(j)*alpha/n), 0 <= j < n.
*/

int j;
double theta;

theta= -2.0 * M_PI * alpha / (double)n;


for(j=0; j<n; j++){
w[2*j]= cos(rho[j]*theta);
w[2*j+1]= sin(rho[j]*theta);
}

} /* end twiddle_init */
EXAMPLE FUNCTION bspfft 133

void permute(double *x, int n, int *sigma){

/* This in-place sequential function permutes a complex vector x


of length n >= 1, stored as pairs of reals, by the permutation sigma,
y[j] = x[sigma[j]], 0 <= j < n.
The output overwrites the vector x.
sigma is a permutation of length n that must be decomposable
into disjoint swaps.
*/

int j, j0, j1, j2, j3;


double tmpr, tmpi;

for(j=0; j<n; j++){


if (j<sigma[j]){
/* swap components j and sigma[j] */
j0= 2*j;
j1= j0+1;
j2= 2*sigma[j];
j3= j2+1;
tmpr= x[j0];
tmpi= x[j1];
x[j0]= x[j2];
x[j1]= x[j3];
x[j2]= tmpr;
x[j3]= tmpi;
}
}

} /* end permute */

void bitrev_init(int n, int *rho){

/* This function initializes the bit-reversal permutation rho


of length n, with n=2ˆm, m >= 0.
*/

int j;
unsigned int n1, rem, val, k, lastbit, one=1;

if (n==1){
rho[0]= 0;
return;
}
n1= n;
for(j=0; j<n; j++){
rem= j; /* j= (b(m-1), ... ,b1,b0) in binary */
val= 0;
for (k=1; k<n1; k <<= 1){
lastbit= rem & one; /* lastbit = b(i) with i= log2(k) */
rem >>= 1; /* rem = (b(m-1), ... , b(i+1)) */
134 THE FAST FOURIER TRANSFORM

val <<= 1;
val |= lastbit; /* val = (b0, ... , b(i)) */
}
rho[j]= (int)val;
}

} /* end bitrev_init */

/****************** Parallel functions ********************************/


int k1_init(int n, int p){

/* This function computes the largest butterfly size k1 of the first


superstep in a parallel FFT of length n on p processors with p < n.
*/

int np, c, k1;

np= n/p;
for(c=1; c<p; c *=np)
;
k1= n/c;

return k1;

} /* end k1_init */

void bspredistr(double *x, int n, int p, int s, int c0, int c1,
char rev, int *rho_p){

/* This function redistributes the complex vector x of length n,


stored as pairs of reals, from group-cyclic distribution
over p processors with cycle c0 to cycle c1, where
c0, c1, p, n are powers of two with 1 <= c0 <= c1 <= p <= n.
s is the processor number, 0 <= s < p.
If rev=true, the function assumes the processor numbering
is bit reversed on input.
rho_p is the bit-reversal permutation of length p.
*/

double *tmp;
int np, j0, j2, j, jglob, ratio, size, npackets, destproc, destindex, r;

np= n/p;
ratio= c1/c0;
size= MAX(np/ratio,1);
npackets= np/size;
tmp= vecallocd(2*size);

if (rev) {
j0= rho_p[s]%c0;
j2= rho_p[s]/c0;
EXAMPLE FUNCTION bspfft 135

} else {
j0= s%c0;
j2= s/c0;
}
for(j=0; j<npackets; j++){
jglob= j2*c0*np + j*c0 + j0;
destproc= (jglob/(c1*np))*c1 + jglob%c1;
destindex= (jglob%(c1*np))/c1;
for(r=0; r<size; r++){
tmp[2*r]= x[2*(j+r*ratio)];
tmp[2*r+1]= x[2*(j+r*ratio)+1];
}
bsp_put(destproc,tmp,x,destindex*2*SZDBL,size*2*SZDBL);
}
bsp_sync();
vecfreed(tmp);

} /* end bspredistr */

void bspfft(double *x, int n, int p, int s, int sign, double *w0, double *w,
double *tw, int *rho_np, int *rho_p){

/* This parallel function computes the discrete Fourier transform


of a complex array x of length n=2ˆm, m >= 1, stored in a real array
of length 2n as pairs (Re x[j], Im x[j]), 0 <= j < n.
x must have been registered before calling this function.
p is the number of processors, p=2ˆq, 0 <= q < m.
s is the processor number, 0 <= s < p.
The function uses three weight tables:
w0 for the unordered fft of length k1,
w for the unordered fft of length n/p,
tw for a number of twiddles, each of length n/p.
The function uses two bit-reversal permutations:
rho_np of length n/p,
rho_p of length p.
The weight tables and bit-reversal permutations must have been
initialized before calling this function.
If sign = 1, then the dft is computed,
y[k] = sum j=0 to n-1 exp(-2*pi*i*k*j/n)*x[j], for 0 <= k < n.
If sign =-1, then the inverse dft is computed,
y[k] = (1/n) sum j=0 to n-1 exp(+2*pi*i*k*j/n)*x[j], for 0 <= k < n.
Here, i=sqrt(-1). The output vector y overwrites x.
*/

char rev;
int np, k1, r, c0, c, ntw, j;
double ninv;

np= n/p;
k1= k1_init(n,p);
permute(x,np,rho_np);
rev= TRUE;
136 THE FAST FOURIER TRANSFORM

for(r=0; r<np/k1; r++)


ufft(&x[2*r*k1],k1,sign,w0);

c0= 1;
ntw= 0;
for (c=k1; c<=p; c *=np){
bspredistr(x,n,p,s,c0,c,rev,rho_p);
rev= FALSE;
twiddle(x,np,sign,&tw[2*ntw*np]);
ufft(x,np,sign,w);
c0= c;
ntw++;
}

if (sign==-1){
ninv= 1 / (double)n;
for(j=0; j<2*np; j++)
x[j] *= ninv;
}

} /* end bspfft */

void bspfft_init(int n, int p, int s, double *w0, double *w, double *tw,
int *rho_np, int *rho_p){

/* This parallel function initializes all the tables used in the FFT. */

int np, k1, ntw, c;


double alpha;

np= n/p;
bitrev_init(np,rho_np);
bitrev_init(p,rho_p);

k1= k1_init(n,p);
ufft_init(k1,w0);
ufft_init(np,w);

ntw= 0;
for (c=k1; c<=p; c *=np){
alpha= (s%c) / (double)(c);
twiddle_init(np,alpha,rho_np,&tw[2*ntw*np]);
ntw++;
}

} /* end bspfft_init */

3.7 Experimental results on an SGI Origin 3800


In this section, we take a critical look at parallel computing. We discuss
various ways of presenting experimental results, and we shall experience that
things are not always what they seem to be.
EXPERIMENTAL RESULTS ON AN SGI ORIGIN 3800 137

The experiments of this section were performed on up to 16 processors of


Teras, the national supercomputer in the Netherlands, located in Amsterdam.
This Silicon Graphics Origin 3800 machine has 1024 processors and is the
successor of the SGI Origin 2000 benchmarked in Section 1.7. The machine
consists of six smaller subsystems, the largest of which has 512 processors.
Each processor has a MIPS RS14000 CPU with a clock rate of 500 MHz and
a theoretical peak performance of 1 Gflop/s, so that the whole machine has a
peak computing rate of 1 Tflop/s. (The name Teras comes from this Teraflop/s
speed, but also from the Greek word for ‘monster’, τ ǫρας.) Each processor
has a primary data cache of 32 Kbyte, a secondary cache of 8 Mbyte, and a
memory of 1 Gbyte. The machine is sometimes classified as a Cache Coherent
Non-Uniform Memory Access (CC-NUMA) machine, which means that the
user views a shared memory (where the cache is kept coherent—hence the
CC) but that physically the memory is distributed (where memory access time
is not uniform—hence the NUMA). The BSP architecture assumes uniform
access time for remote memory, see Section 1.2, but it assumes faster access
to local memory. BSP architectures can therefore be considered NUMA, but
with a strict two-level memory hierarchy. (We always have to be careful when
using the word ‘uniform’ and state precisely what it refers to.) As in the case
of the Origin 2000, we ignore the shared-memory facility and use the Origin
3800 as a distributed-memory machine.
The Origin 3800 was transformed into a BSP computer by using
version 1.4 of BSPlib. We compiled our programs using the standard
SGI ANSI C-compiler with optimization flags switched on. These
flags are -O3 for computation and -flibrary-level 2 bspfifo 10000
-fcombine-puts -fcombine-puts-buffer 256K,128M,4K for communica-
tion. The -fcombine-puts flag tells the compiler that puts must be combined,
meaning that different messages from the same source to the same destination
in the same superstep must be sent together in one packet. This saves time,
but requires buffer memory. The -fcombine-puts-buffer 256K,128M,4K
flag was added to change the default buffer sizes, telling the compiler that
the buffer space available on each processor for combining put operations is
2(p − 1) × 256 Kbyte, that is, a local send and receive buffer of 256 Kbyte for
each of the remote processors. The flag also tells the compiler that no more
than 128 Mbyte should be used for all buffers of all processors together and
that combining puts should be switched off when this upper bound leads to
less than 4 Kbyte buffer memory on each processor; this may happen for a
large number of processors. We changed the default, which was tailored to
the previous Origin 2000 machine, to use the larger memory available on the
Origin 3800 and to prevent early switchoff.
The BSP parameters of the Origin 3800 obtained by bspbench for various
values of p are given by Table 3.3. Note that g is fairly constant, but that l
grows quickly with p; this makes larger numbers of processors less useful for
138 THE FAST FOURIER TRANSFORM

Table 3.3. Benchmarked BSP parameters p, g, l and the


time of a 0-relation for a Silicon Graphics Origin 3800. All
times are in flop units (r = 285 Mflop/s)

p g l Tcomm (0)

1 99 55 378
2 75 5118 1414
4 99 12 743 2098
8 126 32 742 4947
16 122 93 488 15 766

200

150
Time (in ms)

100

50

0
1 2 4 6 8 10 12 14 16
p

Fig. 3.6. Time of a parallel FFT of length 262 144.

certain applications. For p = 16, the synchronization time is already about


100 000 flop units, or 0.3 ms.
For the benchmarking, we had to reduce the compiler optimization level
from -O3 to -O2, because the aggressive compilation at level -O3 removed part
of the computation from the benchmark, leading to inexplicably high comput-
ing rates. At that level, we cannot fool the compiler any more about our true
intention of just measuring computing speed. For the actual FFT timings,
however, gains by aggressive optimization are welcome and hence the FFT
measurements were carried out at level -O3. Note that the predecessor of the
Origin 3800, the Origin 2000, was benchmarked in Section 1.7 as having a
computing rate r = 326 Mflop/s. This rate is higher than the 285 Mflop/s of
EXPERIMENTAL RESULTS ON AN SGI ORIGIN 3800 139

Table 3.4. Time Tp (n) (in ms) of sequential


and parallel FFT on p processors of a Silicon
Graphics Origin 3800

p Length n

4096 16 384 65 536 262 144

1 (seq) 1.16 5.99 26.6 155.2


1 (par) 1.32 6.58 29.8 167.4
2 1.06 4.92 22.9 99.4
4 0.64 3.15 13.6 52.2
8 1.18 2.00 8.9 29.3
16 8.44 11.07 9.9 26.8

the Origin 3800, which is partly due to the fact that the predecessor machine
was used in dedicated mode, but it must also be due to artefacts from aggress-
ive optimization. This should serve as a renewed warning about the difficulties
of benchmarking.
Figure 3.6 shows a set of timing results Tp (n) for an FFT of length n =
262 144. What is your first impression? Does the performance scale well? It
seems that for larger numbers of processors the improvement levels off. Well,
let me reveal that this figure represents the time of a theoretical, perfectly
parallelized FFT, based on a time of 155.2 ms for p = 1. Thus, we conclude
that presenting results in this way may deceive the human eye.
Table 3.4 presents the raw data of our time measurements for the program
bspfft of Section 3.6. These data have to be taken with a grain of salt, since
timings may suffer from interference by programs from other users (caused for
instance by sharing of communication links). To get meaningful results, we
ran each experiment three times, and took the best result, assuming that the
corresponding run would suffer less from interference. Often we found that
the best two timings were within 5% of each other, and that the third result
was worse.
Figure 3.7 compares the actual measured execution time for n = 262 144
on an Origin 3800 with the ideal time. Note that for this large n, the measured
time is reasonably close to the ideal time, except perhaps for p = 16.
The speedup Sp (n) of a parallel program is defined as the increase in
speed of the program running on p processors compared with the speed of a
sequential program (with the same level of optimization),
Tseq (n)
Sp (n) = . (3.62)
Tp (n)
Note that we do not take the time of the parallel program with p = 1 as
reference time, since this may be too flattering; obtaining a good speedup
with such a reference for comparison may be reason for great pride, but it is
140 THE FAST FOURIER TRANSFORM

200
Measured
Ideal

150
Time (in ms)

100

50

0
1 2 4 6 8 10 12 14 16
p

Fig. 3.7. Time Tp of actual parallel FFT of length 262 144.

an achievement of the same order as becoming the Dutch national champion


in alpine skiing. The latter achievement is put into the right perspective if
you know that the Netherlands is a very flat country, which does not have
skiers of international fame. Similarly, the speedup achieved on a parallel
computer can only be put into the right perspective if you know the time
of a good sequential program. A parallel program run on one processor will
always execute superfluous operations, and for n = 262 144 this overhead
amounts to about 8%. (Sometimes, it is unavoidable to measure speedup vs.
the parallel program with p = 1, for instance if it is too much work to develop
a sequential version, or if one adheres to the purist principle of maintaining
a single program source: ‘a sequential program is a parallel program run on
one processor’. In such a situation, speedups should be reported with a clear
warning about the reference version.) Figure 3.8 gives the speedup for n =
65 536, 262 144. The largest speedup obtained is about 5.95. This way of
presenting timings of parallel programs gives much insight: it exposes good
as well as bad scaling behaviour and it also allows easy comparison between
measurements for different lengths.
In principle, 0 ≤ Sp (n) ≤ p should hold, because p processors cannot be
more than p times faster than one processor. This is indeed true for our meas-
ured speedups. Practice, however, may have its surprises: in certain situations,
a superlinear speedup of more than p can be observed. This phenomenon
is often the result of cache effects, due to the fact that in the parallel case each
processor has less data to handle than in the sequential case, so that a larger
EXPERIMENTAL RESULTS ON AN SGI ORIGIN 3800 141

16
Measured n = 262144
14 Measured n = 65536
Ideal
12

10
Speedup 8

0
0 2 4 6 8 10 12 14 16
p

Fig. 3.8. Speedup Sp (n) of parallel FFT.

part of the computation can be carried out using data that are already in
cache, thus yielding fewer cache misses and a higher computing rate. Occur-
rence of superlinear speedup in a set of experimental results should be a
warning to be cautious when interpreting results, even for results that are not
superlinear themselves. In our case, it is also likely that we benefit from cache
effects. Still, one may argue that the ability to use many caches simultaneously
is a true benefit of parallel computing.
The efficiency Ep (n) gives the fraction of the total computing power that
is usefully employed. It is defined by

Sp (n) Tseq (n)


Ep (n) = = . (3.63)
p pTp (n)

In general, 0 ≤ Ep (n) ≤ 1, with the same caveats as before. Figure 3.9 gives
the efficiency for n = 65 536, 262 144.
Another measure is the normalized cost Cp (n), which is just the time of
the parallel program divided by the time that would be taken by a perfectly
parallelized version of the sequential program. This cost is defined by

Tp (n)
Cp (n) = . (3.64)
Tseq (n)/p

Note that Cp (n) = 1/Ep (n), which explains why this cost is sometimes called
the inefficiency. Figure 3.10 gives the cost of the FFT program for n =
65 536, 262 144. The difference between the normalized cost and the ideal
value of 1 is the parallel overhead, which usually consists of load imbalance,
communication time, and synchronization time. A breakdown of the overhead
into its main parts can be obtained by performing additional measurements,
or theoretically by predictions based on the BSP model.
142 THE FAST FOURIER TRANSFORM

1
n = 262 144
n = 65 536

0.8

0.6
Efficiency

0.4

0.2

0
1 2 4 6 8 10 12 14 16
p

Fig. 3.9. Measured efficiency Ep (n) of parallel FFT. The ideal value is 1.

6
Measured n = 65 536
Measured n = 262 144
5 Ideal

4
Normalized cost

0
1 2 4 6 8 10 12 14 16
p

Fig. 3.10. Normalized cost Cp (n) of parallel FFT.


EXPERIMENTAL RESULTS ON AN SGI ORIGIN 3800 143

Table 3.5. Breakdown of predicted execution time


(in ms) of the parallel FFT program for n =
262 144

p Tcomp Tcomm Tsync TFFT Tp


(pred.) (meas.)

1 82.78 0.00 0.00 82.78 167.4


2 41.39 68.99 0.05 110.43 99.4
4 20.70 45.53 0.13 66.36 52.2
8 10.35 28.97 0.35 39.67 29.3
16 5.17 14.03 0.98 20.18 26.8

For comparison, the measured time is also given.

Table 3.5 predicts the computation, communication, and synchronization


time of the FFT for n = 262 144, based on the costs 5n log2 n + 0g + l for
p = 1 and (5n log2 n)/p + 2ng/p + 3l for p > 1, cf. eqn (3.40).
The first observation we can make from the table is that it is difficult
to predict the computation time correctly. The program takes about twice
the time predicted for p = 1. This is because our benchmark measured a
computing rate of 285 Mflop/s for a small DAXPY, which fits into the primary
cache, but for an FFT of length 262 144, a single processor has to handle an
array of over 4 Mbyte for the complex vector x alone, thus exceeding the
primary cache and filling half the secondary cache. The measured rate for
p = 1 is about 144 Mflop/s. Thus, the prediction considerably underestimates
the actual computing time. For larger p, the misprediction becomes less severe.
From the table, we also learn that the FFT is such a data-intensive com-
putation that one permutation of the data vector already has a serious impact
on the total execution time of a parallel FFT program. The prediction over-
estimates the communication time, because it is based on a pessimistic g-value,
whereas the actual parallel FFT was optimized to send data in large pack-
ets. The prediction also overestimates the synchronization time, because for
p = 1 we counted l (for a computation superstep) but in reality there is no
synchronization, and for p > 1 we counted 3l (for two computation supersteps
and one communication superstep) where there is only one synchronization.
This does not matter much, because the synchronization time is insignificant.
The overall prediction matches the measurement reasonably well, except for
p = 1, and the reason is that the errors made in estimating computation time
and communication time cancel each other to a large extent. Of course, it
is possible to predict better, for example, by measuring the computing rate
for this particular application and using that rate instead of r, and perhaps
144 THE FAST FOURIER TRANSFORM

Table 3.6. Computing rate Rp (n) (in Mflop/s) of sequen-


tial and parallel FFT on p processors of a Silicon Graphics
Origin 3800

p Length n

4096 16 384 65 536 262 144

1 (seq) 220 197 202 155


1 (par) 193 179 180 144
2 239 240 234 243
4 397 375 395 462
8 216 591 607 824
16 30 107 545 900

even using a rate that depends on the local vector length. The communication
prediction can be improved by measuring optimistic g-values.
Table 3.6 shows the computing rate Rp (n) of all processors together for
this application, defined by

5n log2 n
Rp (n) = , (3.65)
Tp (n)

where we take the standard flop count 5n log2 n as basis (as is customary for all
FFT counts, even for highly optimized FFTs that perform fewer flops). The
flop rate is useful in comparing results for different problem sizes and also
for different applications. Furthermore, it tells us how far we are from the
advertised peak performance. It is a sobering thought that we need at least
four processors to exceed the top computing rate of 285 Mflop/s measured
for an in-cache DAXPY operation on a single processor. Thus, instead of
parallelizing, it may be preferable to make our sequential program cache-
friendly. If we still need more speed, we turn to parallelism and make our
parallel program cache-friendly. (Exercise 5 tells you how to do this.) Making a
parallel program cache-friendly will decrease running time and hence increase
the computing rate, but paradoxically it will also decrease the speedup and
the efficiency, because the communication part remains the same while the
computing part is made faster in both the parallel program and the sequential
reference program. Use of a parallel computer for the one-dimensional FFT
can therefore only be justified for very large problems. But were not parallel
computers made exactly for that purpose?
BIBLIOGRAPHIC NOTES 145

3.8 Bibliographic notes


3.8.1 Sequential FFT algorithms
The basic idea of the FFT was discovered already in 1805 by (who else?)
Gauss [74]. It has been rediscovered several times: by Danielson and
Lanczos [51] in 1942 and by Cooley and Tukey [45] in 1965. Because Cooley
and Tukey’s rediscovery took place in the age of digital computers, their FFT
algorithm found immediate widespread use in computer programs. As a result,
the FFT became connected to their name. Cooley [44] tells the whole story
of the discovery in a paper How the FFT gained acceptance. He states that
one reason for the widespread use of the FFT is the decision made at IBM,
the employer of Cooley at that time, to put the FFT algorithm in the public
domain and not to try to obtain a patent on this algorithm. In concluding
his paper, Cooley recommends not to publish papers in neoclassic Latin (as
Gauss did). Heideman, Johnson, and Burrus [92] dug up the prehistory of the
FFT and wrote a historical account on the FFT from Gauss to modern times.
An FFT bibliography from 1995 by Sorensen, Burrus, and Heideman [165]
contains over 3400 entries. Some entrance points into the vast FFT literature
can be found below.
A large body of work such as the work done on FFTs inevitably contains
much duplication. Identical FFT algorithms have appeared in completely dif-
ferent formulations. Van Loan in his book Computational Frameworks for the
Fast Fourier Transform [187] has rendered us the great service of providing a
unified treatment of many different FFT algorithms. This treatment is based
on factorizing the Fourier matrix and using powerful matrix notations such
as the Kronecker product. The book contains a wealth of material and it is
the first place to look for suitable variants of the FFT. We have adopted this
framework where possible, and in particular we have used it in describing
sequential FFTs and GFFTs, and the local computations of parallel FFTs.
A different look at FFT algorithms, ordered and unordered, DIF and DIT,
sequential and parallel, is given by Chu and George in their book Inside
the FFT Black Box [43]. Their book is aimed at a computing science and
engineering audience; it presents FFT methods in the form of highly detailed
algorithms and not in the more abstract form of matrix factorizations.
For those who want to learn more about using the DFT, but care less
about how it is actually computed, the book by Briggs and Henson [33] is
a good source. It discusses the various forms of the DFT and their rela-
tion with continuous Fourier transforms, two-dimensional DFTs, applications
such as the reconstruction of images from projections, and related transforms.
Bracewell [31] gives a detailed discussion of the continuous Fourier transform;
a good understanding of the continuous case is necessary for explaining the res-
ults of a discrete transform. The book by Bracewell includes a series of pictures
of functions and their continuous Fourier transforms and also a biography of
Fourier.
146 THE FAST FOURIER TRANSFORM

Sequential implementations of various FFTs can be found in Numer-


ical Recipes in C: The Art of Scientific Computing by Press, Teukolsky,
Vetterling, and Flannery [157]. This book devotes two chapters to FFTs and
their application. Programs included are, among others, complex-valued FFT,
real-valued FFT, fast sine transform, fast cosine transform, multidimensional
FFT, out-of-core FFT, and convolution.
The fastest Fourier transform in the West (FFTW) package by Frigo and
Johnson [73] is an extremely fast sequential program. The speed of FFTW
comes from the use of codelets, straight inline code without loops, in core
parts of the software and from the program’s ability to adapt itself to the
hardware used. Instead of using flop counts or user-provided knowledge about
the hardware (such as cache sizes) to optimize performance, the program car-
ries out a set of FFT timings on the hardware, resulting in a computation
plan for each FFT length. Using actual timings is better than counting flops,
since most of the execution time of an FFT is spent in moving data around
between registers, caches, and main memory, and not in floating-point opera-
tions. A plan is used in all runs of the FFT of the corresponding length.
An optimal plan is chosen by considering several possible plans: an FFT of
length n = 128 can be split into two FFTs of length 64 by a so-called radix-2
approach (used in this chapter) but it can also be split into four FFTs of
length 32 by a radix-4 approach, and so on. Optimal plans for smaller FFTs
are used in determining plans for larger FFTs. Thus, the plans are computed
bottom-up, starting with the smallest lengths. The large number of possib-
ilities to be considered is reduced by dynamic programming, that is, the
use of optimal solutions to subproblems when searching for an optimal solu-
tion to a larger problem. (If a solution to a problem is optimal, its solutions
of subproblems must also be optimal; otherwise, the overall solution could
have been improved.) FFTW is recursive, with the advantage that sufficiently
small subproblems fit completely in the cache. The program can also handle
FFT lengths that are not a power of two. A parallel version of FFTW exists;
it uses the block distribution on input and has two or three communica-
tion supersteps, depending on whether the output must be redistributed into
blocks.
Johnson et al. [114] describe the signal processing language (SPL), a Lisp-
like programming language which can be used to implement factorization for-
mulae directly for transform matrices (such as Fourier and Walsh–Hadamard
matrices) arising in digital signal processing applications. Many different fac-
torization formulae for the same matrix Fn can be generated automatically
using mathematical transformation rules such as the properties of Kronecker
products and the radix-m splitting formula,
Fmn = (Fm ⊗ In )Tm,mn (Im ⊗ Fn )Sm,mn , (3.66)
which is the basis for the Cooley–Tukey approach [45]. Here, the twiddle mat-
jk
rix Tm,N is an N × N diagonal matrix with (Tm,N )j(N/m)+k,j(N/m)+k = ωN
BIBLIOGRAPHIC NOTES 147

for 0 ≤ j < m and 0 ≤ k < N/m. Furthermore, Sm,N is the Mod-m sort
matrix, the N × N permutation matrix defined by
 
x(0 : m : N − 1)
 x(1 : m : N − 1) 
Sm,N x =  .. , (3.67)
 
 . 
x(m − 1 : m : N − 1)
−1
which has as inverse Sm,N = SN/m,N . Note that T2,N = diag(IN/2 , ΩN/2 ) and
S2,N = SN . The SPL compiler translates each factorization formula into a
Fortran program. In the same spirit as FFTW, an extensive search is car-
ried out over the space of factorization formulae and compiler techniques. A
Kronecker-product property used by SPL which is important for a parallel
context is: for every m × m matrix A and n × n matrix B,

A ⊗ B = Sm,mn (In ⊗ A)Sn,mn (Im ⊗ B). (3.68)

As a special case, we have

Fm ⊗ In = Sm,mn (In ⊗ Fm )Sn,mn . (3.69)

3.8.2 Parallel FFT algorithms with log2 p or more supersteps


Cooley and Tukey [45] already observed that all butterflies of one stage can
be computed in parallel. The first parallel FFT algorithm was published by
Pease [155] in 1968, a few years after the Cooley–Tukey paper. Pease presents
a matrix decomposition of the Fourier matrix in which he uses the Kronecker-
product notation. Each of the log2 n stages of his algorithm requires a so-called
perfect shuffle permutation Sn/2,n = Sn−1 ; this would make an actual imple-
mentation on a general-purpose computer expensive. The algorithm, however,
was aimed at implementation in special-purpose hardware, with specialized
circuitry for butterflies, shuffles, and twiddles.
Fox et al. [71] present a parallel recursive algorithm based on the block
distribution, which decomposes the FFT into a sequence of n/p calls to FFTs
of length p and combines the results. The decomposition and combination
parts of the algorithm are carried out without communication. Each FFT
of length p, however, requires much communication since there is only one
vector component per processor. Another disadvantage, which is true of every
straightforward implementation of a recursive algorithm, is that the smaller
tasks of the recursion are executed one after the other, where in principle they
could have been done in parallel.
The nonrecursive parallel FFT algorithms proposed in the earlier liter-
ature typically perform log2 (n/p) stages locally without communication and
log2 p stages nonlocally involving both communication and computation, thus
requiring a total of O(log2 p) communication supersteps. These algorithms
148 THE FAST FOURIER TRANSFORM

were mostly designed targeting the hypercube architecture, where each pro-
cessor s = (bq−1 · · · b0 )2 is connected to the q = log2 p processors that differ
exactly one bit with s in their processor number.
Examples of algorithms from this category are discussed by Van
Loan [187, Algorithm 3.5.3], by Dubey, Zubair, and Grosch [62], who present
a variant that can use every input and output distribution from the block-
cyclic family, and by Gupta and Kumar [87] (see also [82]), who describe
the so-called binary exchange algorithm. Gupta and Kumar analyse the
scalability of this algorithm by using the isoefficiency function fE (p), which
expresses how fast the amount of work of a problem must grow with p to
maintain a constant efficiency E.
Another example is an algorithm for the hypercube architecture given
by Swarztrauber [172] which is based on index-digit permutations [72]: each
permutation τ on the set of m bits {0, . . . , m − 1} induces an index-
digit permutation which moves the index j = (bm−1 · · · b1 b0 )2 into j ′ =
(bτ (m−1) · · · bτ (1) bτ (0) )2 . Using the block distribution for n = 2m and p = 2q ,
an index is split into the processor number s = (bm−1 · · · bm−q )2 and the local
index j = (bm−q−1 · · · b1 b0 )2 . An i-cycle is an index-digit permutation where
τ is a swap of the pivot bit m − q − 1 with another bit r. Swarztrauber notes
that no communication is needed if r ≤ m − q − 1; otherwise, the i-cycle
permutation requires communication, but it still has the advantage of moving
data in large chunks, namely blocks of size n/(2p). Every index-digit permuta-
tion can be carried out as a sequence of i-cycles. Every butterfly operation of
an FFT combines pairs (j, j ′ ) that differ in exactly one bit. An i-cycle can be
used to make this bit local.

3.8.3 Parallel FFT algorithms with O(1) supersteps


We can make parallel FFT algorithms more efficient if we manage to combine
many butterfly stages into one superstep (or pair of supersteps); this way, they
require less √communication and synchronization. Under mild assumptions,
such as p ≤ n, this leads to algorithms with only a small constant number
of supersteps. The BSP algorithm presented in this chapter falls into this
category.
The original BSP paper by Valiant [178] discusses briefly the BSP cost
of an algorithm by Papadimitriou and Yannakakis [153] which achieves a
total cost within a constant factor of optimal if g = O(log2 (n/p)) and
l = O((n/p) log2 (n/p)); for such a ratio n/p, our algorithm √ also achieves
optimality, cf. eqn (3.39). For the common special case p ≤ n, this method
has been implemented in an unordered DIF FFT by Culler et al. [47], demon-
strating the application of the LogP model. The computation starts with a
cyclic distribution and then switches to a block distribution. The authors
avoid contention during the redistribution by scheduling the communications
carefully and by inserting additional global synchronizations to enforce strict
adherence to the communication schedule. (A good BSPlib system would do
this automatically.)
BIBLIOGRAPHIC NOTES 149

Gupta et al. [88] use data redistributions to implement parallel FFTs. For
the unordered Cooley–Tukey algorithm, they start with the block distribu-
tion, finish with the cyclic distribution, and use block-cyclic distributions
in between. If a bit reversal must be performed and the output must be
distributed in the same√way as the input, this requires three communica-
tion supersteps for p ≤ n. The authors also modify the ordered Stockham
algorithm so they can start and finish with the cyclic
√ distribution, and per-
form only one communication superstep for p ≤ n. Thus they achieve the
same minimal communication cost as Algorithms 3.3 and 3.5. Experimental
results on an Intel iPSC/860 hypercube show that the modified Stockham
algorithm outperforms all other implemented algorithms.
McColl [135] presents a detailed BSP algorithm for an ordered FFT, which
uses the block distribution on input and output. The algorithm starts with
an explicit bit-reversal permutation and it finishes with a redistribution from
cyclic to block distribution. Thus, for p > 1 the algorithm needs at least
three communication supersteps. Except for the extra communication at the
start and finish, the algorithm of McColl is quite similar to our algorithm. His
algorithm stores and communicates the original index of each vector compon-
ent together with its numerical value. This facilitates the description of the
algorithm, but the resulting communication should be removed in an imple-
mentation because in principle the original indices can be computed by every
processor. Furthermore, the exposition is simplified by √ the assumption that
m − q is a divisor of m. This implies that p = 1 or p ≥ n; it is easy to gener-

alize the algorithm so that it can handle the most common case 1 < p < n
as well.
The algorithm presented in this chapter is largely based on work by Inda
and Bisseling [111]. This work introduces the group-cyclic distribution and
formulates redistributions as permutations of the data vector. For example,
changing the distribution from block to cyclic has the same effect as keeping
the distribution constant but performing an explicit permutation Sp,n . (To
be more precise, the processor and local
√ index of the original data element xj
are the same in both cases.) For p ≤ n, the factorization is written as
−1
Fn = Sp,n ASp,n (Rp ⊗ In/p )(Ip ⊗ Fn/p )Sp,n , (3.70)

where A is a block-diagonal matrix with blocks of size n/p,


−1
A = Sp,n (I1 ⊗ Bn ) · · · (Ip/2 ⊗ B2n/p )Sp,n . (3.71)

This factorization needs three permutations if the input and output are dis-
tributed by blocks, but it needs only one permutation if these distributions
are cyclic. Thus the cyclic I/O distribution is best.
Another related algorithm is the transpose algorithm, which calculates a
one-dimensional FFT of size mn by storing the vector x as a two-dimensional
matrix of size m × n. Component xj is stored as matrix element X(j0 , j1 ),
where j = j0 n + j1 with 0 ≤ j0 < m and 0 ≤ j1 < n. This algorithm is
150 THE FAST FOURIER TRANSFORM

based on the observation that in the first part of an unordered FFT com-
ponents within the same matrix row are combined, whereas in the second
part components within the same column are combined. The matrix can be
transposed between the two parts of the algorithm, so that all butterflies can
be done within rows. In the parallel case, each processor can then handle one
or more rows. The only communication needed is in the matrix transposi-
tion and the bit reversal. For a description and experimental results, see, for
example, Gupta and Kumar [87]. This approach works for p ≤ min(m, n).
Otherwise, the transpose algorithm must be generalized to a higher dimen-
sion. A detailed description of such a generalization can be found in the book
by Grama and coworkers [82]. Note that this algorithm is similar to the BSP
algorithm presented here, except that its bit-reversal permutation requires
communication.
The two-dimensional view of a one-dimensional FFT can be carried one
step further by formulating the algorithm such that it uses explicit shorter-
length FFTs on the rows or columns of the matrix storing the data vector.
Van Loan [187, Section 3.3.1] calls the corresponding approach the four-
step framework and the six-step framework. This approach is based
on a mixed-radix method due to Agarwal and Cooley [2] (who developed it
for the purpose of vectorization). The six-step framework is equivalent to a
factorization into six factors,

Fmn = Sm,mn (In ⊗ Fm )Sn,mn Tm,mn (Im ⊗ Fn )Sm,mn . (3.72)

This factorization follows immediately from eqns (3.66) and (3.69). In a paral-
lel algorithm based on eqn (3.72), with the block distribution used throughout,
the only communication occurs in the permutations Sm,mn and Sn,mn . The
four-step framework is equivalent to the factorization

Fmn = (Fm ⊗ In )Sm,mn Tn,mn (Fn ⊗ Im ), (3.73)

which also follows from eqns (3.66) and (3.69). Note that now the shorter-
length Fourier transforms need strided access to the data vector, with stride
n for Fm ⊗ In . (In a parallel implementation, such access is local if a cyclic
distribution is used, provided p ≤ min(m, n).)
One advantage of the four-step and six-step frameworks is that the shorter-
length FFTs may fit into the cache of computers that cannot accommodate an
FFT of full length. This may result in much higher speeds on cache-sensitive
computers. The use of genuine FFTs in this approach makes it possible to call
fast system-specific FFTs in an implementation. A disadvantage is that the
multiplication by the twiddle factors takes an additional 6n flops. In a parallel
implementation, the communication is nicely isolated in the permutations.
Hegland [91] applies the four-step framework twice to generate a factor-
ization of Fmnm with m maximal, treating FFTs of length m recursively by the
BIBLIOGRAPHIC NOTES 151

same method. He presents efficient algorithms for multiple FFTs and multidi-
mensional FFTs on vector and parallel computers. In his implementation, the
key to efficiency is a large vector length of the inner loops of the computation.
Edelman, McCorquodale, and Toledo [68] present an approximate par-
allel FFT algorithm aimed at reducing communication from three data
permutations (as in the six-step framework) to one permutation, at the
expense of an increase in computation. As computation rates are expected
to grow faster than communication rates, future FFTs are likely to be
communication-bound, making such an approach worthwhile. The authors
argue that speedups in this scenario will be modest, but that using parallel
FFTs will be justified on other grounds, for instance because the data do not
fit in the memory of a single processor, or because the FFT is part of a larger
application with good overall speedup.

3.8.4 Applications
Applications of the FFT are ubiquitous. Here, we mention only a few with
emphasis on parallel applications. Barros and Kauranne [13] and Foster and
Worley [70] parallelize the spectral transform method for solving partial differ-
ential equations on a sphere, aimed at global weather and climate modelling.
The spectral transform for a two-dimensional latitude/longitude grid consists
of a DFT along each latitude (i.e. in the east–west direction) and a discrete
Legendre transform (DLT) along each longitude (i.e. in the north–south dir-
ection). Barros and Kauranne [13] redistribute the data between the DFT and
the DLT, so that these computations can be done sequentially. For instance,
each processor performs several sequential FFTs. The main advantages of
this approach are simplicity, isolation of the communication parts (thus cre-
ating bulk), and reproducibility: the order of the computations is exactly the
same as in a sequential computation and it does not depend on the number
of processors used. (Weather forecasts being that fragile, they must at least
be reproducible in different runs of the same program. No numerical but-
terfly effects please!) Foster and Worley [70] also investigate the alternative
approach of using parallel one-dimensional FFTs and Legendre transforms as
a basis for the spectral transform.
Nagy and O’Leary [143] present a method for restoring blurred images
taken by the Hubble Space Telescope. The method uses a preconditioned
conjugate gradient solver with fast matrix-vector multiplications carried out
by using FFTs. This is possible because the matrices involved are Toeplitz
matrices, that is, they have constant diagonals: aij = αi−j , for all i, j.
The FFT is the computational workhorse in grid methods for quantum
molecular dynamics, where the time-dependent Schrödinger equation is solved
numerically on a multidimensional grid, see a review by Kosloff [123] and a
comparison of different time propagation schemes by Leforestier et al. [127].
In each time step, a potential energy operator is multiplied by a wavefunction,
which is a local (i.e. point-wise) operation in the spatial domain. This means
152 THE FAST FOURIER TRANSFORM

that every possible distribution including the cyclic one can be used in a par-
allel implementation. The kinetic energy operator is local in the transformed
domain, that is, in momentum space. The transformation between the two
domains is done efficiently using the FFT. Here, the cyclic distribution as
used in our parallel FFT is applicable in both domains. In one dimension, the
FFT thus has only one communication superstep.
Haynes and Côté [90] parallelize a three-dimensional FFT for use in
an electronic structure calculation based on solving the time-independent
Schrödinger equation. The grids involved are relatively small compared with
those used in other FFT applications, with a typical size of only 128×128×128
grid points. As a consequence, reducing the communication is of prime import-
ance. In momentum space, the grid is cyclically distributed in each dimension
to be split. In the spatial domain, the distribution is by blocks. The number
of communication supersteps is log2 p. An advantage of the cyclic distribu-
tion in momentum space is a better load balance: momentum is limited by
an energy cut-off, which means that all array components outside a sphere in
momentum space are zero; in the cyclic distribution, all processors have an
approximately equal part of this sphere.
Zoldi et al. [195] solve the nonlinear Schrödinger equation in a study on
increasing the transmission capacity of optical fibre lines. They apply a par-
allel one-dimensional FFT based on the four-step framework. They exploit
the fact that the Fourier transform is carried out in both directions, avoiding
unnecessary permutations and combining operations from the forward FFT
and the inverse FFT on the same block of data. This results in better cache
use and even causes a speedup of the parallel program with p = 1 compared
with the original sequential program.

3.9 Exercises
1. The recursive FFT, Algorithm 3.1, splits vectors repeatedly into two vec-
tors of half the original length. For this reason, we call it a radix-2 FFT.
We can generalize the splitting method by allowing to split the vectors into r
parts of equal length. This leads to a radix-r algorithm.
(a) How many flops are actually needed for the computation of F4 x, where
x is a vector of length four? Where does the gain in this specific case
come from compared to the 5n log2 n flops of an FFT of arbitrary
length n?
(b) Let n be a power of four. Derive a sequential recursive radix-4 FFT
algorithm. Analyse the computation time and compare the number of
flops with that of a radix-2 algorithm. Is the new algorithm faster?
(c) Let n be a power of four. Formulate a sequential nonrecursive radix-4
algorithm. Invent an appropriate name for the new starting permuta-
tion. Can you modify the algorithm to handle all powers of two, for
example by ending with a radix-2 stage if needed?
EXERCISES 153

(d) Let n be a power of two. Implement your nonrecursive radix-4 algorithm


in a sequential function fft4. Compare its performance with a sequen-
tial radix-2 function based on functions used in bspfft. Explain your
results. Is the difference in performance only due to a difference in flop
count?
(e) Modify fft4 for use in a parallel program, replacing the unordered
ufft in bspfft. The new function, ufft4, should have exactly the
same input and output specification as ufft. Note: take special care
that the ordering is correct before the butterflies start flying.
2. (∗) Let f be a T -periodic smooth function. The Fourier coefficients ck of f
are given by eqn (3.2).
(a) Let dk be the kth Fourier coefficient of the derivative f ′ . Prove that
dk = 2πikck /T . How can you use this relation to differentiate f ?
(b) The Fourier coefficients ck , k ∈ Z, can be obtained in approximation
by eqn (3.3). The quality of the approximation depends on how well
the trapezoidal rule estimates the integral of the function f (t)e−2πikt/T
on the subintervals [tj , tj+1 ]. The average of the two function values in
the endpoints tj and tj+1 is a good approximation of the integral on
[tj , tj+1 ] if the function f (t)e−2πikt/T changes slowly on the subinterval.
On the other hand, if the function changes quickly, the approximation is
poor. Why is the approximation meaningless for |k| > n/2? For which
value of k is the approximation best? In answering these questions, you
may assume that f itself behaves well and is more or less constant on
each subinterval [tj , tj+1 ].
(c) Let n be even. How can you compute (in approximation) the set of
coefficients c−n/2+1 , . . . , cn/2 by using a DFT? Hint: relate the yk ’s of
eqn (3.4) to the approximated ck ’s.
(d) Write a parallel program for differentiation of a function f sampled
in n = 2m points. The program should use a forward FFT to compute
the Fourier coefficients ck , −n/2 + 1 ≤ k ≤ n/2, then compute the
corresponding dk ’s, and finally perform an inverse FFT to obtain the
derivative f ′ in the original sample points. Use the cyclic distribution.
Test the accuracy of your program by comparing your results with
analytical results, using a suitable test function. Take for instance the
2
Gaussian function f (t) = e−α(t−β) , and place it in the middle of your
sampling interval [0, T ] by choosing β = T /2.
3. (∗) The two-dimensional discrete Fourier transform (2D DFT) of an
n0 × n1 matrix X is defined as the n0 × n1 matrix Y given by

n
0 −1 n
 1 −1

Y (k0 , k1 ) = X(j0 , j1 )ωnj00k0 ωnj11k1 , (3.74)


j0 =0 j1 =0
154 THE FAST FOURIER TRANSFORM

for 0 ≤ k0 < n0 and 0 ≤ k1 < n1 . We can rewrite this as


 
n
0 −1 n
1 −1

Y (k0 , k1 ) =  X(j0 , j1 )ωnj11k1  ωnj00k0 , (3.75)


j0 =0 j1 =0

showing that the 2D DFT is equivalent to a set of n0 1D DFTs of length n1 ,


each in the row direction of the matrix, followed by a set of n1 1D DFTs of
length n0 , each in the column direction.
(a) Write a function bspfft2d that performs a 2D FFT, assuming that
X is distributed by the M × N cyclic distribution with 1 ≤ M < n0
and 1 ≤ N < n1 , and with M, N powers of two. The result Y must
be in the same distribution as X. Use the function bspfft to perform
parallel 1D FFTs, but modify it to perform several FFTs together in an
efficient manner. In particular, avoid unnecessary synchronizations.
(b) As an alternative, write a function bspfft2d transpose that first
performs sequential 1D FFTs on the rows of X, assuming that X is
distributed by the cyclic row distribution, then transposes the matrix,
performs sequential 1D FFTs on its rows, and finally transposes back
to return to the original distribution.
(c) Let n0 ≥ n1 . For each function, how many processors can you use?
Compare the theoretical cost of the two implemented algorithms. In

particular, which algorithm is better in the important case p ≤ n0 ?
Hint: you can choose M, N freely within the constraint p = M N ; use
this freedom well.
(d) Compare the performance of the two functions experimentally.
(e) Optimize the best of the two functions.
4. (∗) The three-dimensional discrete Fourier transform (3D DFT) of
an n0 × n1 × n2 array X is defined as the n0 × n1 × n2 array Y given by

n
0 −1 n
 1 −1 n
 2 −1

Y (k0 , k1 , k2 ) = X(j0 , j1 , j2 )ωnj00k0 ωnj11k1 ωnj22k2 , (3.76)


j0 =0 j1 =0 j2 =0

for 0 ≤ kd < nd , d = 0, 1, 2.
(a) Write a function bspfft3d, similar to bspfft2d from Exercise 3, that
performs a 3D FFT, assuming that X is distributed by the M0 × M1 ×
M2 cyclic distribution, where Md is a power of two with 1 ≤ Md < nd ,
for d = 0, 1, 2. The result Y must be in the same distribution as X.
(b) Explain why each communication superstep of the parallel 3D FFT
algorithm has the same cost.
√ the case n0 = n1 = n2 = n, how do you choose the Md ? Hint: for p ≤
(c) In
n you need only one communication superstep; for p ≤ n only two.
EXERCISES 155

(d) For arbitrary n0 , n1 , n2 , what is the maximum number of processors


you can use? Write a function that determines, for a given p, the
triple (M0 , M1 , M2 ) with M0 M1 M2 = p that causes the least number
of communication supersteps.
(e) Test your parallel 3D FFT and check its performance. Is the theoret-
ically optimal triple indeed optimal?
5. (∗) The parallel FFT algorithm of this chapter uses two levels of memory:
local memory with short access time and remote memory with longer access
time. Another situation where such a two-level memory hierarchy exists is
in sequential computers with a cache. The cache is a fast memory of limited
size whereas the random access memory (RAM) is a much larger but slower
memory.
A bulk synchronous parallel computation by a number of virtual processors
can be simulated on one real, physical processor by executing the work of one
computation superstep for all virtual processors in turn. If possible, the work
of one virtual processor in a computation superstep is carried out within the
cache. The switch between different virtual processors working on the same
computation superstep causes page faults, that is, movements of a memory
page between the cache and the RAM. A communication superstep can be
viewed as a set of assignments that causes page faults.

(a) Implement a fast sequential FFT algorithm for cache-based computers


by adapting the function bspfft to that particular situation. Replace
the bsp put statements by buffered assignments. What happens with
the bsp syncs?
(b) Compare your program with a sequential program that does not exploit
the cache. Use a cache-based computer in your tests. How much do
you gain?
(c) Find the optimal number of virtual processors, pvirtual , for several
values of n. Relate your results to the cache size of the computer.
(d) Develop a parallel program based on a three-level memory hierarchy,
namely cache, local RAM, and remote RAM.
(e) Test the parallel program. Use pvirtual virtual processors for each
physical processor.
6. (∗∗) Let x be a real vector of length n. We could compute the Fourier
transform y = Fn x, which is complex, by using a straightforward complex
FFT. Still, we may hope to do better and accelerate the computation by
exploiting the fact that x is real.

(a) Show that yn−k = yk , for k = 1, . . . , n−1. This implies that the output
of the FFT is completely determined by y0 , . . . , yn/2 . The remaining
n/2 − 1 components of y can be obtained cheaply by complex conjug-
ation, so that they need not be stored. Also show that y0 and yn/2
156 THE FAST FOURIER TRANSFORM

are real. It is customary to pack these two reals on output into one
complex number, stored at position 0 in the y-array.
(b) We can preprocess the input data by packing the real vector x of length
n as a complex vector x′ of length n/2 defined by x′j = x2j + ix2j+1 ,
for 0 ≤ j < n/2. The packing operation is for free in our FFT data
structure, which stores a complex number as two adjacent reals. It
turns out that if we perform a complex FFT of length n/2 on the
conjugate of x′ , yielding y′ = Fn x′ , we can retrieve the desired vector
y by

yk = yk′ − 21 (1 − iωnk )(yk′ − yn/2−k


′ ), (3.77)

for 0 ≤ k ≤ n/2. Prove by substitution that the postprocessing by



(3.77) is correct. The variable yn/2 appearing in the right-hand side for

k = 0 and k = n/2 is defined by yn/2 = y0′ , so that the DFT definition
(3.4) also holds for k = n/2.
(c) Formulate a sequential algorithm for a real FFT based on the procedure
above and count its number of flops. Take care to perform the post-
processing efficiently by computing pairs (yk , yn/2−k ) together, using
only one complex multiplication. Store constants such as 21 (1 − iωnk ) in
a table.
(d) Design and implement a parallel algorithm. You may assume that p
is so small that only one communication superstep is needed in the
complex FFT. Start the complex FFT with the cyclic distribution of
x′ (which is equivalent to the cyclic distribution of component pairs in
x). Finish the complex FFT with the zig-zag cyclic distribution shown
in Fig. 3.11(b); this distribution was proposed for use in FFT-based
transforms in [112]. The zig-zag cyclic distribution of a vector x

(a) Cyclic 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(b) Zig-zag 0 1 2 3 0 3 2 1 0 1 2 3 0 3 2 1
cyclic 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Fig. 3.11. Distribution of a vector of size 16 over four processors. Each cell repres-
ents a vector component; the number in the cell and the greyshade denote the
processor that owns the cell. The processors are numbered 0, 1, 2, 3. (a) Cyclic
distribution; (b) zig-zag cyclic distribution.
EXERCISES 157

of length n over p processors is defined by the mapping



P (j mod p) if j mod 2p < p
xj −→ for 0 ≤ j < n.
P ((−j) mod p) otherwise,
(3.78)

The local index of xj on the processor that owns it is j = j div p, just


as in the cyclic case. What is the main advantage of the zig-zag cyclic
distribution? Which operations are local in this distribution? Take care
in designing your algorithm: a subtle point is to decide where exactly
to communicate and move to the zig-zag cyclic distribution.
(e) Modify the algorithm so that it starts with the zig-zag cyclic distribu-
tion of x′ , instead of the cyclic distribution.
(f) For which p is your algorithm valid? What is the BSP cost of your
algorithm?
(g) Design and implement an inverse real FFT algorithm by inverting each
of the main phases of your real FFT and performing the phases in
reverse order.
7. (∗∗) The discrete cosine transform (DCT) can be defined in several
ways. One version, which is often used in image processing and data compres-
sion, is as follows. Let x be a real vector of length n. The DCT of x is the
real vector y of length n given by
n−1
 πk(j + 1/2)
yk = xj cos , for 0 ≤ k < n. (3.79)
j=0
n

(a) In our pursuit of a fast cosine transform (FCT), we try to pack the
vector x in a suitable form into another vector x′ ; then compute the
Fourier transform y′ = Fn x′ ; and hope to be able to massage y′ into
y. By way of miracle, we can indeed succeed if we define

x′j = x2j ,
x′n−1−j = x2j+1 , for 0 ≤ j < n/2. (3.80)

We can retrieve y by
k ′
yk = Re(ω4n yk ),
k ′
yn−k = −Im(ω4n yk ), for 0 ≤ k ≤ n/2. (3.81)

(For k = 0, we do not compute the undefined value yn−k .) Prove by


substitution that this method, due to Narasimha and Peterson [144],
is correct.
(b) Formulate a sequential FCT algorithm based on the procedure above
and count its number of flops. Since x′ is real, it is best to perform the
Fourier transform by using the real FFT of Exercise 6.
158 THE FAST FOURIER TRANSFORM

(c) Describe how you would perform a parallel FCT. Which distributions
do you use for input and output? Here, they need not be the same.
Motivate your choices. Give your distributions a meaningful name.
Hint: try to avoid redistribution as much as possible. You may assume
that p is so small that only one communication superstep is needed in
the complex FFT used by the real FFT.
(d) Analyse the cost of your parallel algorithm, implement it, and test the
resulting program. How does the computing time scale with n and p?
(e) Refresh your trigonometry and formula manipulation skills by proving
that the inverse DCT is given by
n−1
y0 2 πk(l + 1/2)
xl = + yk cos , for 0 ≤ l < n. (3.82)
n n n
k=1

Hint: Use the formula cos α cos β = [cos(α + β) + cos(α − β)]/2 to


remove
n−1products of cosines. Also, derive and use a simple expression
for k=1 cos(πkr/n), for integer r. How would you compute the inverse
DCT in parallel?
8. (∗∗) Today, wavelet transforms are a popular alternative to Fourier trans-
forms: wavelets are commonly used in areas such as image processing
and compression; for instance, they form the basis for the JPEG 2000
image-compression and MPEG-4 video-compression standards. An interesting
application by the FBI is in the storage and retrieval of fingerprints. Wavelets
are suitable for dealing with short-lived or transient signals, and for analysing
images with highly localized features. Probably the best-known wavelet is the
wavelet of order four (DAUB4) proposed by Ingrid Daubechies [52]. A nice
introduction to the DAUB4 wavelet transform is given in Numerical Recipes
in C [157,section 13.10]. It is defined as follows. For k ≥ 4, k even, let Wk
denote the k × k matrix given by
 
c0 c1 c2 c3
 c3 −c2 c1 −c0 
 

 c0 c1 c2 c3



 c3 −c 2 c1 −c 0


Wk = 
 . .. ,

 

 c0 c1 c2 c3 


 c3 −c2 c1 −c0  
 c2 c3 c0 c1 
c1 −c0 c3 −c2
√ √ √ √
where the coefficients
√ √ are given by c0 =√(1+ √ 3)/(4 2), c1 = (3+ 3)/(4 2),
c2 = (3 − 3)/(4 2), and c3 = (1 − 3)/(4 2). For this choice of wavelet,
the discrete wavelet transform (DWT) of an input vector x of length n = 2m
EXERCISES 159

can be obtained by: first multiplying x with Wn and then moving the even
components of the result to the front, that is, multiplying the result with Sn ;
repeating this procedure on the first half of the current vector, using Wn/2
and Sn/2 ; and so on. The algorithm terminates after multiplication by W4
and S4 . We denote the result, obtained after m − 1 stages, by W x.
(a) Factorize W in terms of matrices Wk , Sk , and Ik of suitable size,
similar to the sequential factorization of the Fourier matrix Fn . You can
use the notation diag(A0 , . . . , Ar ), which stands for a block-diagonal
matrix with blocks A0 , . . . , Ar on the diagonal.
(b) How many flops are needed to compute W x? How does this scale
compared with the FFT? What does this mean for communication in
a parallel algorithm?
(c) Formulate a sequential algorithm that computes W x in place, without
performing permutations. The output data will become available in
scrambled order. To unscramble the data in one final permutation,
where would we have to move the output value at location 28 =
(0011100)2 for length n = 128? And at j = (bm−1 · · · b0 )2 for arbitrary
length n? Usually, there is no need to unscramble.
(d) Choose a data distribution that enables the development of an efficient
parallel in-place DWT algorithm. This distribution must be used on
input and as long as possible during the algorithm.
(e) Formulate a parallel DWT algorithm. Hint: avoid communicating data
at every stage of your algorithm. Instead, be greedy and compute what
you can from several stages without communicating in between. Then
communicate and finish the stages you started. Furthermore, find a
sensible way of finishing the whole algorithm.
(f) Analyse the BSP cost of the parallel DWT algorithm.
(g) Compare the characteristics of your DWT algorithm to those of the
parallel FFT, Algorithm 3.5. What are the essential similarities and
differences?
(h) Implement and test your algorithm.
(i) Prove that the matrix Wn is orthogonal (i.e. WnT Wn = In ) by showing
that the rows are mutually orthogonal and have unit norm. Thus W is
orthogonal and W −1 = W T . Extend your program so that it can also
compute the inverse DWT.
(j) Take a picture of a beloved person or animal, translate it into a matrix
A, where each element represents a pixel, and perform a 2D DWT by
carrying out 1D DWTs over the rows, followed by 1D DWTs over the
columns. Choose a threshold value τ > 0 and set all matrix elements
ajk with |ajk | ≤ τ to zero. What would the compression factor be if we
would store A as a sparse matrix by keeping only the nonzero values
ajk together with their index pairs (j, k)? What does your beloved one
look like after an inverse 2D DWT? You may vary τ .
160 THE FAST FOURIER TRANSFORM

9. (∗∗) The discrete convolution of two vectors u and v of length n is


defined as the vector u ∗ v of length n with
n−1

(u ∗ v)k = uj vk−j , for 0 ≤ k < n, (3.83)
j=0

where vj for j < 0 is defined by periodic extension with period n, vj = vj+n .


(a) Prove the convolution theorem:

(Fn (u ∗ v))k = (Fn u)k (Fn v)k , for 0 ≤ k < n. (3.84)

How can we use this to perform a fast convolution? An import-


ant application of convolutions is high-precision arithmetic, used for
instance in the record computation of π in zillions of decimals. This
will become clear in the following.
(b) A large nonnegative integer can be stored as a sequence of coefficients,
which represents its expansion in a radix-r digit system,
m−1

x = (xm−1 · · · x1 x0 )r = xk rk , (3.85)
k=0

where 0 ≤ xk < r, for all k. We can view the integer x as a vector


x of length m with integer components. Show that we can use a con-
volution to compute the product xy of two integers x and y. Hint:
pad the vectors of length m with at least m zeros, giving the vector
u = (x0 , . . . , xm−1 , 0, . . . , 0)T of length n ≥ 2m, and a similar vector
v. Note that u ∗ v represents xy in the form (3.85), but its components
may be larger than or equal to r. The components can be reduced to
values between 0 and r − 1 by performing carry-add operations, similar
to those used in ordinary decimal arithmetic.
(c) Use a parallel FFT to multiply two large integers in parallel. Choose
r = 256, so that an expansion coefficient can be stored in a byte, that
is, a char. Choose a suitable distribution for the FFTs and the carry-
adds, not necessarily the same distribution. On input, each coefficient
xk is stored as a complex number, and on output the result must be
rounded to the nearest integer. Check the maximum rounding error
incurred during the convolution, to see whether the result compon-
ents are sufficiently close to the nearest integer, with a solid safety
margin.
(d) Check the correctness of the result by comparing with an ordinary
(sequential) large-integer multiplication program, based on the O(m2 )
algorithm you learnt at primary school. Sequentially, which m is the
break-even point between the two multiplication methods?
EXERCISES 161

(e) Check that

102639592829741105772054196573991675900
716567808038066803341933521790711307779
*
106603488380168454820927220360012878679
207958575989291522270608237193062808643
=
109417386415705274218097073220403576120
037329454492059909138421314763499842889
347847179972578912673324976257528997818
33797076537244027146743531593354333897,

which has 155 decimal digits and is called RSA-155. The reverse com-
putation, finding the two (prime) factors of RSA-155 was posed as a
cryptanalytic challenge by RSA Security; its solution took a total of 35
CPU years using 300 workstations and PCs in 1999 by Cavallar et al.
[39]. (The next challenge, factoring a 174 decimal-digit number, carries
an award of $10 000.)
(f) Develop other parallel functions for fast operations on large integers,
such as addition and subtraction, using the block distribution.
(g) The Newton–Raphson method for finding a zero of a function f , that
is, an x with f (x) = 0, computes successively better approximations

f (x(k) )
x(k+1) = x(k) − . (3.86)
f ′ (x(k) )

Apply this method with f (x) = 1/x−a and f (x) = 1/x2 −a to compute

1/a and 1/ a, respectively, with high precision for a given fixed real a
using your parallel functions. Choose a suitable representation of a as
a finite sequence of bytes. Pay special attention to termination of the
Newton–Raphson iterations.
(h) At the time of writing, the world record in π computation is held by
Kanada, Ushiro, Kuroda, Kudoh, and nine co-workers who obtained
about 1241 100 000 000 decimal digits of π in December 2002 run-
ning a 64-processor Hitachi SR8000 supercomputer with peak speed
of 2 Tflop/s for 600 h. They improved the previous record by Kanada
and Takahashi who got 206 158 430 000 digits right in 1999 using all
128 processors of a Hitachi SR8000 parallel computer with peak speed
of 1 Tflop/s. That run used the Gauss–Legendre method proposed by
Brent [32] and Salamin [161], which works as follows. Define sequences
162 THE FAST FOURIER TRANSFORM

a0 , a1 , a2 , . . . and b0 , b1 , b2 , . . . by a0 = 2, b0 = 1, ak+1 = (ak +√bk )/2
is the arithmetic mean of the pair (ak , bk ), and bk+1 = ak bk
is its geometric mean, for k ≥ 0. Let ck = 2k (a2k − b2k ). Define
the sequence d0 , d1 , d2 , . . . by d0 = 1 and dk+1 = dk − ck+1 . Then
limk→∞ 2a2k /dk = π, with fast convergence: the number of digits of π
produced doubles at every iteration. Use your own parallel functions to
compute as many decimal digits of π as you can. You need an efficient
conversion from binary to decimal digits to produce human-readable
output.
4
SPARSE MATRIX–VECTOR MULTIPLICATION

This chapter gently leads you into the world of irregular algorithms,
through the example of multiplying a sparse matrix with a vector. The
sparsity pattern of the matrix may be irregular, but fortunately it does
not change during the multiplication, and the multiplication may be
repeated many times with the same matrix. This justifies putting a lot
of effort in finding a good data distribution for a parallel multiplication.
We are able to analyse certain special types of matrices fully, such as
random sparse matrices and Laplacian matrices, and of course we can
also do this for dense matrices. For the first time, we encounter a useful
non-Cartesian matrix distribution, which we call the Mondriaan distri-
bution, and we study an algorithm for finding such a distribution for
a general sparse matrix. The program of this chapter demonstrates the
use of the bulk synchronous message passing primitives from BSPlib,
which were designed to facilitate irregular computations; the discus-
sion of these primitives completes the presentation of the whole BSPlib
standard. After having read this chapter, you are able to design and
implement parallel iterative solvers for linear systems and eigensystems,
and to build higher-level solvers on top of them, such as solvers for non-
linear systems, partial differential equations, and linear programming
problems.

4.1 The problem


Sparse matrices are matrices that are, well, sparsely populated by nonzero
elements. The vast majority of their elements are zero. This is in contrast to
dense matrices, which have mostly nonzero elements. The borderline between
sparse and dense may be hard to draw, but it is usually clear whether a given
matrix is sparse or dense.
For an n × n sparse matrix A, we denote the number of nonzeros by
nz(A) = |{aij : 0 ≤ i, j < n ∧ aij = 0}|, (4.1)
the average number of nonzeros per row or column by
nz(A)
c(A) = , (4.2)
n
and the density by
nz(A)
d(A) = . (4.3)
n2
164 SPARSE MATRIX–VECTOR MULTIPLICATION

In this terminology, a matrix is sparse if nz(A) ≪ n2 , or equivalently c(A) ≪


n, or d(A) ≪ 1. We drop the A and write nz, c, and d in cases where this does
not cause confusion. An example of a small sparse matrix is given by Fig. 4.1.
Sparse matrix algorithms are much more difficult to analyse than their
dense matrix counterparts. Still, we can get some grip on their time and
memory requirements by analysing them in terms of n and c, while making
simplifying assumptions. For instance, we may assume that c remains constant
during a computation, or even that each matrix row has a fixed number of
nonzeros c.
In practical applications, sparse matrices are the rule rather than the
exception. A sparse matrix arises in every situation where each variable from
a large set of variables is connected to only a few others. For example, in a
computation scheme for the heat equation discretized on a two-dimensional
grid, the temperature at a grid point may be related to the temperatures at
the neighbouring grid points to the north, east, south, and west. This can be
expressed by a sparse linear system involving a sparse matrix with c = 5.
The problem studied in this chapter is the multiplication of a sparse square
matrix A and a dense vector v, yielding a dense vector u,

u := Av. (4.4)

The size of A is n × n and the length of the vectors is n. The components of


u are defined by
n−1

ui = aij vj , for 0 ≤ i < n. (4.5)
j=0

Of course, we can exploit the sparsity of A by summing only those terms for
which aij = 0.
Sparse matrix–vector multiplication is almost trivial as a sequential prob-
lem, but it is surprisingly rich as a parallel problem. Different sparsity patterns
of A lead to a wide variety of communication patterns during a parallel com-
putation. The main task then is to keep this communication within bounds.
Sparse matrix–vector multiplication is important in a range of computa-
tions, most notably in the iterative solution of linear systems and eigen-
systems. Iterative solution methods start with an initial guess x0 of the
solution and then successively improve it by finding better approximations xk ,
k = 1, 2, . . ., until convergence within a prescribed error tolerance. Examples
of such methods are the conjugate gradient method for solving symmetric
positive definite sparse linear systems Ax = b and the Lanczos method for
solving symmetric sparse eigensystems Ax = λx; for an introduction, see [79].
The attractive property of sparse matrix–vector multiplication as the core of
these solvers is that the matrix does not change, and in particular that it
remains sparse throughout the computation. This is in contrast to methods
such as sparse LU decomposition that create fill-in, that is, new nonzeros. In
THE PROBLEM 165

Fig. 4.1. Sparse matrix cage6 with n = 93, nz = 785, c = 8.4, and d = 9.1%,
generated in a DNA electrophoresis study by van Heukelum, Barkema, and
Bisseling [186]. Black squares denote nonzero elements; white squares denote
zeros. This transition matrix represents the movement of a DNA polymer in a
gel under the influence of an electric field. Matrix element aij represents the
probability that a polymer in state j moves to a state i. The matrix has the
name cage6 because the model used is the cage model and the polymer mod-
elled contains six monomers. The sparsity pattern of this matrix is symmetric
(i.e. aij = 0 if and only if aji = 0), but the matrix is unsymmetric (since in
general aij = aji ). In this application, the eigensystem Ax = x is solved by the
power method, which computes Ax, A2 x, A3 x, . . ., until convergence. Solution
component xi represents the frequency of state i in the steady-state situation.
166 SPARSE MATRIX–VECTOR MULTIPLICATION

(a) (b) 0123456789


0
1
5 2
3
9 4
6 7 5
2 6
7
3 8
1 8 9

Fig. 4.2. (a) Two-dimensional molecular dynamics domain of size 1.0 × 1.0 with ten
particles. Each circle denotes the interaction region of a particle, and is defined
by a cut-off radius rc = 0.1. (b) The corresponding 10 × 10 sparse matrix F . If
the circles of particles i and j overlap in (a), these particles interact and nonzeros
fij and fji appear in (b).

principle, iterative methods can solve larger systems, but convergence is not
guaranteed.
The study of sparse matrix–vector multiplication also yields more insight
into other areas of scientific computation. In a molecular dynamics simula-
tion, the interaction between particles i and j can be described by a force
fij . For short-range interactions, this force is zero if the particles are far
apart. This implies that the force matrix F is sparse. The computation of
the new positions of particles moving under two-particle forces is similar to the
multiplication of a vector by F . A two-dimensional particle domain and the
corresponding matrix are shown in Fig. 4.2.
Algorithm 4.1 is a sequential sparse matrix–vector multiplication
algorithm. The ‘for all’ statement of the algorithm must be interpreted such
that all index pairs involved are handled in some arbitrary sequential order.
Tests such as ‘aij = 0’ need never be performed in an actual implementation,
since only the nonzeros of A are stored in the data structure used. The formu-
lation ‘aij = 0’ is a simple notational device for expressing sparsity without
having to specify the details of a data structure. This allows us to formulate
sparse matrix algorithms that are data-structure independent. The algorithm
costs 2cn flops.
SPARSE MATRICES AND THEIR DATA STRUCTURES 167

Algorithm 4.1. Sequential sparse matrix–vector multiplication.

input: A: sparse n × n matrix,


v : dense vector of length n.
output: u : dense vector of length n, u = Av.

for i := 0 to n − 1 do
ui := 0;
for all (i, j) : 0 ≤ i, j < n ∧ aij = 0 do
ui := ui + aij vj ;

4.2 Sparse matrices and their data structures


The main advantage of exploiting sparsity is a reduction in memory usage
(zeros are not stored) and computation time (operations with zeros are
skipped or simplified). There is, however, a price to be paid, because sparse
matrix algorithms are more complicated than their dense equivalents. Devel-
oping and implementing sparse algorithms costs more human time and effort.
Furthermore, sparse matrix computations have a larger integer overhead asso-
ciated with each floating-point operation. This implies that sparse matrix
algorithms are less efficient than dense algorithms on sparse matrices with
density close to one.
In this section, we discuss a few basic concepts from sparse matrix compu-
tations. For an extensive coverage of this field, see the books by Duff, Erisman,
and Reid [63] and Zlatev [194]. To illustrate a fundamental sparse technique,
we first study the addition of two sparse vectors. Assume that we have to add
an input vector y of length n to an input vector x of length n, overwriting
x, that is, we have to perform x := x + y. The vectors are sparse, which
means that xi = 0 for most i and yi = 0 for most i. We denote the number
of nonzeros of x by cx and that of y by cy .
To store x as a sparse vector, we only need 2cx memory cells: for each
nonzero, we store its index i and numerical value xi as a pair (i, xi ). Since
storing x as a dense vector requires n cells, we save memory if cx < n/2. There
is no need to order the pairs. In fact, ordering is sometimes disadvantageous
in sparse matrix computations, for instance when the benefit from ordering
the input is small but the obligation to order the output is costly. In the data
structure, the vector x is stored as an array x of cx nonzeros. For a nonzero
xi stored in position j, 0 ≤ j < cx , it holds that x[j].i = i and x[j].a = xi .
The computation of the new component xi requires a floating-point addi-
tion only if both xi = 0 and yi = 0. The case xi = 0 and yi = 0 does not
require a flop because the addition reduces to an assignment xi := yi . For
yi = 0, nothing needs to be done.
168 SPARSE MATRIX–VECTOR MULTIPLICATION

Example 4.1 Vectors x, y have length n = 8; their number of nonzeros is


cx = 3 and cy = 4, respectively. Let z = x + y. The sparse data structure for
x, y, and z is:
x[j].a = 2 5 1
x[j].i = 5 3 7
y[j].a = 1 4 1 4
y[j].i = 6 3 5 2
z[j].a = 3 9 1 1 4
z[j].i = 5 3 7 6 2

Now, I suggest you pause for a moment to think about how you would add
two vectors x and y that are stored in the data structure described above.
When you are done contemplating this, you may realize that the main problem
is to find the matching pairs (i, xi ) and (i, yi ) without incurring excessive costs;
this precludes for instance sorting. The magic trick is to use an auxiliary array
of length n that has been initialized already. We can use this array for instance
to register the location of the components of the vector y in its sparse data
structure, so that for a given i we can directly find the location j = loc[i]
where yi is stored. A value loc[i] = −1 denotes that yi is not stored in the
data structure, implying that yi = 0. After the computation, loc must be left
behind in the same state as before the computation, that is, with every array
element set to −1. For each nonzero yi , the addition method modifies the
component xi if it is already nonzero and otherwise creates a new nonzero in
the data structure of x. Algorithm 4.2 gives the details of the method.
This sparse vector addition algorithm is more complicated than the
straightforward dense algorithm, but it has the advantage that the compu-
tation time is only proportional to the sum of the input lengths. The total
number of operations is O(cx + cy ), since there are cx + 2cy loop iterations,
each with a small constant number of operations. The number of flops equals
the number of nonzeros in the intersection of the sparsity patterns of x and y.
The initialization of array loc costs n operations, and this cost will dominate
that of the algorithm itself if only one vector addition has to be performed.
Fortunately, loc can be reused in subsequent vector additions, because each
modified array element is reset to −1. For example, if we add two n × n
matrices row by row, we can amortize the initialization cost over n vector
additions. The relative cost of initialization then becomes insignificant.
The addition algorithm does not check its output for accidental zeros,
that is, elements that are numerically zero but still present as a nonzero pair
(i, 0) in the data structure. Such accidental zeros are created for instance
when a nonzero yi = −xi is added to a nonzero xi and the resulting zero is
retained in the data structure. Furthermore, accidental zeros can propagate:
if yi is an accidental zero included in the data structure of y, and xi = 0 is
not in the data structure of x, then Algorithm 4.2 will insert the accidental
SPARSE MATRICES AND THEIR DATA STRUCTURES 169

Algorithm 4.2. Addition of two sparse vectors.


input: x : sparse vector with cx ≥ 0 nonzeros, x = x0 ,
y : sparse vector with cy ≥ 0 nonzeros,
loc : dense vector of length n, loc[i] = −1, for 0 ≤ i < n.
output: x = x0 + y,
loc[i] = −1, for 0 ≤ i < n.

{ Register location of nonzeros of y}


for j := 0 to cy − 1 do
loc[y[j].i] := j;

{ Add matching nonzeros of x and y }


for j := 0 to cx − 1 do
i := x[j].i;
if loc[i] = −1 then
x[j].a := x[j].a + y[loc[i]].a;
loc[i] := −1;

{ Append remaining nonzeros of y to x }


for j := 0 to cy − 1 do
i := y[j].i;
if loc[i] = −1 then
x[cx ].i := i;
x[cx ].a := y[j].a;
cx := cx + 1;
loc[i] := −1;

zero into the data structure of x. Still, testing all operations in a sparse
matrix algorithm for zero results is more expensive than computing with a few
additional nonzeros, so accidental zeros are usually kept. Another reason for
keeping accidental zeros is that removing them would make the output data
structure dependent on the numerical values of the input and not on their
sparsity pattern alone. This may cause problems for certain computations,
for example, if the same program is executed repeatedly for a matrix with
different numerical values but the same sparsity pattern and if knowledge
obtained from the first program run is used to speed up subsequent runs.
(Often, the first run of a sparse matrix program uses a dynamic data structure
but subsequent runs use a simplified static data structure based on the sparsity
patterns encountered in the first run.) In our terminology, we ignore accidental
zeros and we just assume that they do not exist.
Sparse matrices can be stored using many different data structures; the
best choice depends on the particular computation at hand. Some of the most
170 SPARSE MATRIX–VECTOR MULTIPLICATION

common data structures are:


The coordinate scheme, or triple scheme. Every nonzero element aij is
represented by a triple (i, j, aij ), where i is the row index, j the column index,
and aij the numerical value. The triples are stored in arbitrary order in an
array. This data structure is easiest to understand and therefore is often used
for input/output, for instance in Matrix Market [26], a WWW-based reposit-
ory of sparse test matrix collections. It is also suitable for input to a parallel
computer, since all information about a nonzero is contained in its triple. The
triples can be sent directly and independently to the responsible processors.
It is difficult, however, to perform row-wise or column-wise operations on this
data structure.
Compressed row storage (CRS). Each row i of the matrix is stored as
a sparse vector consisting of pairs (j, aij ) representing nonzeros. In the data
structure, a[k] denotes the numerical value of the nonzero numbered k, and
j[k] its column index. Rows are stored consecutively, in order of increasing
row index. The address of the first nonzero of row i is given by start[i]; the
number of nonzeros of row i equals start[i + 1] − start[i], where by convention
start[n] = nz(A).

Example 4.2
 
0 3 0 0 1

 4 1 0 0 0 

A=
 0 5 9 2 0 ,
 n = 5, nz(A) = 13.
 6 0 0 5 3 
0 0 5 8 9

The CRS data structure for A is:

a[k] = 3 1 4 1 5 9 2 6 5 3 5 8 9
j[k] = 1 4 0 1 1 2 3 0 3 4 2 3 4
k= 0 1 2 3 4 5 6 7 8 9 10 11 12
start[i] = 0 2 4 7 10 13
i= 0 1 2 3 4 5

The CRS data structure has the advantage that the elements of a row are
stored consecutively, so that row-wise operations are easy. If we compute
u := Av by components of u, then the nonzero elements aij needed to com-
pute ui are conveniently grouped together, so that the value of ui can be
kept in cache on a cache-based computer, thus speeding up the computation.
Algorithm 4.3 shows a sequential sparse matrix–vector multiplication that
uses CRS.
SPARSE MATRICES AND THEIR DATA STRUCTURES 171

Algorithm 4.3. Sequential sparse matrix–vector multiplication for the CRS


data structure.
input: A: sparse n × n matrix,
v : dense vector of length n.
output: u : dense vector of length n, u = Av.

for i := 0 to n − 1 do
u[i] := 0;
for k := start[i] to start[i + 1] − 1 do
u[i] := u[i] + a[k] · v[j[k]];

Compressed column storage (CCS). Similar to CRS, but with columns


instead of rows. This is the data structure employed by the Harwell–Boeing
collection [64], now called the Rutherford–Boeing collection [65], the first
sparse test matrix collection that found widespread use. The motivation for
building such a collection is that researchers testing different algorithms on
different machines should at least be able to use a common set of test problems,
in particular in the sparse matrix field where rigid analysis is rare and where
heuristics reign.
Incremental compressed row storage (ICRS) [124]. A variant of CRS,
where the location (i, j) of a nonzero aij is encoded as a one-dimensional index
i · n + j, and the difference with the one-dimensional index of the previous
nonzero is stored (except for the first nonzero, where the one-dimensional
index itself is stored). The nonzeros within a row are ordered by increas-
ing column index, so that the one-dimensional indices form a monotonically
increasing sequence. Thus, their differences are positive integers, which are
called the increments and are stored in an array inc. A dummy nonzero
is added at the end, representing a dummy location (n, 0); this is useful in
addressing the inc array. Note that the increments are less than 2n if the
matrix does not have empty rows. If there are many empty rows, increments
can become large (but at most n2 ), so we must make sure that each increment
fits in a data word.
Example 4.3 The matrix A is the same as in Example 4.2. The ICRS data
structure for A is given by the arrays a and inc from:
a[k] = 3 1 4 1 5 9 2 6 5 3 5 8 9 0
j[k] = 1 4 0 1 1 2 3 0 3 4 2 3 4 0
i[k] · n + j[k] = 1 4 5 6 11 12 13 15 18 19 22 23 24 25
inc[k] = 1 3 1 1 5 1 1 2 3 1 3 1 1 1
k= 0 1 2 3 4 5 6 7 8 9 10 11 12 13

The ICRS data structure has been used in Parallel Templates [124], a par-
allel version of the Templates package for iterative solution of sparse linear
172 SPARSE MATRIX–VECTOR MULTIPLICATION

Algorithm 4.4. Sequential sparse matrix–vector multiplication for the ICRS


data structure.

input: A: sparse n × n matrix,


v : dense vector of length n.
output: u : dense vector of length n, u = Av.

j := inc[0];
k := 0;
for i := 0 to n − 1 do
u[i] := 0;
while j < n do
u[i] := u[i] + a[k] · v[j];
k := k + 1;
j := j + inc[k];
j := j − n;

systems [11]. ICRS does not need the start array and its implementation in
C was found to be somewhat faster than that of CRS because the increments
translate well into the pointer arithmetic of the C language. Algorithm 4.4
shows a sequential sparse matrix–vector multiplication that uses ICRS. Note
that the algorithm avoids the indirect addressing of the vector v in the CRS
data structure, replacing access to v[j[k]] by access to v[j].
Jagged diagonal storage (JDS). The matrix A is permuted into a matrix
P A by ordering the rows by decreasing number of nonzeros. The first jagged
diagonal is formed by taking the first nonzero element of every row in P A. If
the matrix does not have empty rows, the length of the first jagged diagonal
is n. The second jagged diagonal is formed by taking the second nonzero of
every row. The length may now be less than n. This process is continued, until
all c0 jagged diagonals have been formed, where c0 is the number of nonzeros
of row 0 in P A. As in CRS, for each element the numerical value and column
index are stored. The main advantage of JDS is the large average length of the
jagged diagonals (of order n) that occurs if the number of nonzeros per row
does not vary too much. In that case, the sparse matrix–vector multiplication
can be done using efficient operations on long vectors.
Gustavson’s data structure [89]. This data structure combines CRS
and CCS, except that it stores the numerical values only for the rows. It
provides row-wise and column-wise access to the matrix, which is useful for
sparse LU decomposition.
The two-dimensional doubly linked list. Each nonzero is represen-
ted by a tuple, which includes i, j, aij , and links to a next and a previous
PARALLEL ALGORITHM 173

nonzero in the same row and column. The elements within a row or column
need not be ordered. This data structure gives maximum flexibility: row-wise
and column-wise access are easy and elements can be inserted and deleted in
O(1) operations. Therefore it is applicable in dynamic computations where the
matrix changes. The two-dimensional doubly linked list was proposed as the
best data structure for parallel sparse LU decomposition with pivoting [183],
where frequently rows or columns have to move from one set of processors to
another. (A two-dimensional singly linked list for sparse linear system solving
was already presented by Knuth in 1968 in the first edition of [122].) A dis-
advantage of linked list data structures is the amount of storage needed, for
instance seven memory cells per nonzero for the doubly linked case, which is
much more than the two cells per nonzero for CRS. A severe disadvantage is
that following the links causes arbitrary jumps in the computer memory, thus
often incurring cache misses.
Matrix-free storage. In certain applications it may be too costly or
unnecessary to store the matrix explicitly. Instead, each matrix element is
recomputed every time it is needed. In certain situations this may enable the
solution of huge problems that otherwise could not have been solved.

4.3 Parallel algorithm


How to distribute a sparse matrix over the processors of a parallel computer?
Should we first build a data structure and then distribute it, or should we start
from the distribution? The first approach constructs a global data structure
and then distributes its components; a parallelizing compiler would take this
road. Unfortunately, this requires global collaboration between the processors,
even for basic operations such as insertion of a new nonzero. For example, if
the nonzeros of a matrix row are linked in a list with each nonzero pointing to
the next, then the predecessor and the successor in the list and the nonzero
to be inserted may reside on three different processors. This means that three
processors have to communicate to adjust their link information.
The alternative approach distributes the sparse matrix first, that is, dis-
tributes its nonzeros over the processors, assigning a subset of the nonzeros
to each processor. The subsets are disjoint, and together they contain all
nonzeros. (This means that the subsets form a partitioning of the nonzero
set.) Each subset can be viewed as a smaller sparse submatrix containing
exactly those rows and columns that have nonzeros in the subset and exactly
those nonzeros that are part of the subset. Submatrices can then be stored
using a familiar sequential sparse matrix data structure. This approach has
the virtue of simplicity: it keeps basic operations such as insertion and deletion
local. When a new nonzero is inserted, the distribution scheme first determines
which processor is responsible, and then this processor inserts the nonzero in
its local data structure without communicating. Because of this, our motto
is: distribute first, represent later.
174 SPARSE MATRIX–VECTOR MULTIPLICATION

We have already encountered Cartesian matrix distributions (see


Section 2.3), which are suitable for dense LU decomposition and many other
matrix computations, sparse as well as dense. Here, however, we would like
to stay as general as possible and remove the restriction of Cartesianity.
This will give us, in principle, more possibilities to find a good distribution.
We do not fear the added complexity of non-Cartesian distributions because
the sparse matrix–vector multiplication is a relatively simple problem, and
because the matrix does not change in this computation (unlike for instance
in LU decomposition). Our general scheme maps nonzeros to processors by
aij −→ P (φ(i, j)), for 0 ≤ i, j < n and aij = 0, (4.6)
where 0 ≤ φ(i, j) < p. Zero elements of the matrix are not assigned to pro-
cessors. For notational convenience, we define φ also for zeros: φ(i, j) = −1
if aij = 0. Note that we use a one-dimensional processor numbering for our
general matrix distribution. For the moment, we do not specify φ further.
Obviously, it is desirable that the nonzeros of the matrix are evenly spread over
the processors. Figure 4.3 shows a non-Cartesian matrix distribution. (Note
that the distribution in the figure is indeed not Cartesian, since a Cartesian
distribution over two processors must be either a row distribution or a column
distribution, and clearly neither is the case.)
Each nonzero aij is used only once in the matrix–vector multiplication.
Furthermore, usually nz(A) ≫ n holds, so that there are many more nonzeros
than vector components. For these reasons, we perform the computation of
aij · vj on the processor that possesses the nonzero element. Thus, we bring
the vector component to the matrix element, and not the other way round.
We add products aij vj belonging to the same row i; the resulting sum is the
local contribution to ui , which is sent to the owner of ui . This means that
we do not have to communicate elements of A, but only components of v and
contributions to components of u.
How to distribute the input and output vectors of a sparse matrix–vector
multiplication? In most iterative linear system solvers and eigensystem solv-
ers, the same vector is repeatedly multiplied by a matrix A, with a few vector
operations interspersed. These vector operations are mainly DAXPYs (addi-
tions of a scalar times a vector to a vector) and inner product computations.
In such a situation, it is most natural to distribute all vectors in the same
way, and in particular the input and output vectors of the matrix–vector
multiplication, thus requiring distr(u) = distr(v).
Another common situation is that the multiplication by A is followed by
a multiplication by AT . This happens for instance when a vector has to be
multiplied by a matrix B = AT A, where B itself is not explicitly stored
but only its factor A. The output vector of the multiplication by A is thus
the input vector of the multiplication by AT , so that we do not need to
revert immediately to the same distribution. In this situation, we can use two
different distributions, so that distr(u) = distr(v) is allowed.
PARALLEL ALGORITHM 175

(a) (b)
2 1 1 4 3 v
0 1 2 2 3 4

6 3 1 0 3 0 1
9 4 1 1 4 1 2 2
22 5 9 2 2 5 9 3 5 3
41 6 5 3 3 6 4 5 8 9
64 5 8 9
u A P (0) P (1)

Fig. 4.3. (a) Distribution of a 5×5 sparse matrix A and vectors u and v of length five
over two processors. The matrix nonzeros and vector components of processor
P (0) are shown as grey cells; those of P (1) as black cells. The numbers in the
cells denote the numerical values aij . The matrix is the same as in Example 4.2.
Vector component ui is shown to the left of the matrix row that produces it;
vector component vj is shown above the matrix column that needs it. (b) The
local matrix part of the processors. Processor P (0) has six nonzeros; its row
index set is I0 = {0, 1, 2, 3} and its column index set J0 = {0, 1, 2}. Processor
P (1) has seven nonzeros; I1 = {0, 2, 3, 4} and J1 = {2, 3, 4}.

For a vector u of length n, we map components to processors by

ui −→ P (φu (i)), for 0 ≤ i < n, (4.7)

where 0 ≤ φu (i) < p. To stay as general as possible, we assume that we have


two mappings, φu and φv , describing the vector distributions; these can be
different, as happens in Fig. 4.3. Often, it is desirable that vector components
are evenly spread over the processors, to balance the work load of vector
operations in other parts of an application.
It becomes straightforward to derive a parallel algorithm, once we have
chosen the distribution of the matrix and the vectors and once we have decided
to compute the products aij vj on the processor that contains aij . Let us
focus first on the main computation, which is a local sparse matrix–vector
multiplication. Processor P (s) multiplies each local nonzero element aij by vj
and adds the result into a local partial sum,

uis = aij vj , (4.8)
0≤j<n, φ(i,j)=s

for all i with 0 ≤ i < n. As in the sequential case, only terms for which aij = 0
are summed. Furthermore, only the local partial sums uis for which the set
{j : 0 ≤ j < n ∧ φ(i, j) = s} is nonempty are computed. (The other partial
176 SPARSE MATRIX–VECTOR MULTIPLICATION

Algorithm 4.5. Parallel sparse matrix–vector multiplication for P (s).


input: A: sparse n × n matrix, distr(A) = φ,
v : dense vector of length n, distr(v) = φv .
output: u : dense vector of length n, u = Av, distr(u) = φu .

Is = {i : 0 ≤ i < n ∧ (∃j : 0 ≤ j < n ∧ φ(i, j) = s)}


Js = {j : 0 ≤ j < n ∧ (∃i : 0 ≤ i < n ∧ φ(i, j) = s)}

(0) { Fanout }
for all j ∈ Js do
get vj from P (φv (j));

(1) { Local sparse matrix–vector multiplication }


for all i ∈ Is do
uis := 0;
for all j : 0 ≤ j < n ∧ φ(i, j) = s do
uis := uis + aij vj ;

(2) { Fanin }
for all i ∈ Is do
put uis in P (φu (i));

(3) { Summation of nonzero partial sums }


for all i : 0 ≤ i < n ∧ φu (i) = s do
ui := 0;
for all t : 0 ≤ t < p ∧ uit = 0 do
ui := ui + uit ;

sums are zero.) If there are less than p nonzeros in row i, then certainly one
or more processors will have an empty row part. For c ≪ p, this will happen
in many rows. To exploit this, we introduce the index set Is of the rows that
are locally nonempty in processor P (s). We compute uis if and only if i ∈ Is .
An example of row index sets is depicted in Fig. 4.3(b). Superstep (1) of
Algorithm 4.5 is the resulting local matrix–vector multiplication.
A suitable sparse data structure must be chosen to implement super-
step (1). Since we formulate our algorithm by rows, row-based sparse
data structures such as CRS and ICRS are a good choice, see Section 4.2.
The data structure should, however, only include nonempty local rows, to
avoid unacceptable overhead for very sparse matrices. To achieve this, we can
number the nonempty local rows from 0 to |Is | − 1. The corresponding indices
PARALLEL ALGORITHM 177

2 1 1 4 3 v

6 3 1
9 4 1
22 5 9 2
41 6 5 3
64 5 8 9
u A
Fig. 4.4. Communication during sparse matrix–vector multiplication. The matrix
is the same as in Fig. 4.3. Vertical arrows denote communication of components
vj : v0 must be sent from its owner P (1) to P (0), which owns the nonzeros
a10 = 4 and a30 = 6; v2 must be sent from P (0) to P (1); v1 , v3 , v4 need not be
sent. Horizontal arrows denote communication of partial sums uis : P (1) sends
its contribution u01 = 3 to P (0); P (0) sends u20 = 14 to P (1); and P (1) sends
u31 = 29 to P (0); u1 and u4 are computed locally, without contribution from
the other processor. The total communication volume is V = 5 data words.

i are the local indices. The original global indices from the set Is are stored
in increasing order in an array rowindex of length |Is |. For 0 ≤ i < |Is |, the
global row index is i = rowindex [i]. If for instance CRS is used, the address
of the first local nonzero of row i is start[i] and the number of local nonzeros
of row i is start[i + 1] − start[i].
The vector component vj needed for the computation of aij vj must be
obtained before the start of superstep (1). This is done in communication
superstep (0), see the vertical arrows in Fig. 4.4. The processor that has to
receive vj knows from its local sparsity pattern that it needs this component.
On the other hand, the processor that has to send the value is not aware
of the needs of the receiver. This implies that the receiver should be the
initiator of the communication, that is, we should use a ‘get’ primitive. Here,
we encounter an important difference between dense and sparse algorithms.
In dense algorithms, the communication patterns are predictable and thus
known to every processor, so that we can formulate communication supersteps
exclusively in terms of ‘put’ primitives. In sparse algorithms, this is often not
the case, and we have to use ‘get’ primitives as well.
Component vj has to be obtained only once by every processor that needs
it, even if it is used repeatedly for different local nonzeros aij in the same
matrix column j. If column j contains at least one local nonzero, then vj
must be obtained; otherwise, vj is not needed. Therefore, it is convenient to
define the index set Js of the locally nonempty columns, similar to the row
index set Is . We get vj if and only if j ∈ Js . This gives superstep (0) of
178 SPARSE MATRIX–VECTOR MULTIPLICATION

Algorithm 4.5. We call superstep (0) the fanout, because vector components
fan out from their initial location. The set Js can be represented by an array
colindex of length |Js |, similar to rowindex. An example of column index sets
is also depicted in Fig. 4.3(b). We consider the arrays rowindex and colindex
to be part of the data structure for the local sparse matrix.
The partial sum uis must be contributed to ui if the local row i is
nonempty, that is, if i ∈ Is . In that case, we call uis nonzero, even if acci-
dental cancellation of terms aij vj has occurred. Each nonzero partial sum uis
should be sent to the processor that possesses ui . Note that in this case the
sender has the knowledge about the existence of a nonzero partial sum, so
that we have to use a ‘put’ primitive. The resulting communication superstep
is superstep (2), which we call the fanin, see the horizontal arrows in Fig. 4.4.
Finally, the processor responsible for ui computes its value by adding the
previously received nonzero contributions uit , 0 ≤ t < p with t = s, and the
local contribution uis . This is superstep (3).
‘Wat kost het? ’ is an often-used Dutch phrase meaning ‘How much does
it cost?’. Unfortunately, the answer here is, ‘It depends’, because the cost of
Algorithm 4.5 depends on the matrix A and the chosen distributions φ, φv , φu .
Assume that the matrix nonzeros are evenly spread over the processors, each
processor having cn/p nonzeros. Assume that the vector components are also
evenly spread over the processors, each processor having n/p components.
Under these two load balancing assumptions, we can obtain an upper bound
on the cost. This bound may be far too pessimistic, since distributions may
exist that reduce the communication cost by a large factor.
The cost of superstep (0) is as follows. In the worst case, P (s) must
receive all n components vj except for the n/p locally available components.
Therefore, in the worst case hr = n − n/p; also, hs = n − n/p, because the
n/p local vector components must be sent to the other p − 1 processors. The
cost is T(0) = (1 − 1/p)ng + l. The cost of superstep (1) is T(1) = 2cn/p + l,
since two flops are needed for each local nonzero. The cost of superstep (2) is
T(2) = (1 − 1/p)ng + l, similar to T(0) . The cost of superstep (3) is T(3) = n + l,
because each of the n/p local vector components is computed by adding at
most p partial sums. The total cost of the algorithm is thus bounded by

2cn 1
TMV ≤ +n+2 1− ng + 4l. (4.9)
p p

Examining the upper bound (4.9), we see that the computation cost dominates
if 2cn/p > 2ng, that is, if c > pg. In that (rare) case, a distribution is already
efficient if it only satisfies the two load balancing assumptions. Note that
it is the number of nonzeros per row c and not the density d that directly
determines the efficiency. Here, and in many other cases, we see that the
parameter c is the most useful one to characterize the sparsity of a matrix.
CARTESIAN DISTRIBUTION 179

The synchronization cost of 4l is usually insignificant and it does not grow


with the problem size.
To achieve efficiency for smaller c, we can use a Cartesian distribution and
exploit its two-dimensional nature, see Section 4.4, or we can refine the general
distribution scheme using an automatic procedure to detect the underlying
structure of the matrix, see Section 4.5. We can also exploit known properties
of specific classes of sparse matrices, such as random sparse matrices, see
Section 4.7, and Laplacian matrices, see Section 4.8.
In our model, the cost of a computation is the BSP cost. A closely related
metric is the communication volume V , which is the total number of data
words sent. The volume depends on φ, φu , and φv . For a given matrix distri-
bution φ, a lower bound Vφ on V can be obtained by counting for each vector
component ui the number of processors pi that has a nonzero aij in matrix
row i and similarly for each vector component vj the number of processors qj
that has a nonzero aij in matrix column j. The lower bound equals
 
Vφ = (pi − 1) + (qj − 1), (4.10)
0≤i<n, pi ≥1 0≤j<n, qj ≥1

because every processor that has a nonzero in a matrix row i must send a
value uis in superstep (2), except perhaps one processor (the owner of ui ),
and similarly every processor that has a nonzero in a matrix column j must
receive vj in superstep (0), except perhaps one processor (the owner of vj ).
An upper bound is Vφ + 2n, because in the worst case all n components ui
are owned by processors that do not have a nonzero in row i, and similar for
the components vj . Therefore,

Vφ ≤ V ≤ Vφ + 2n. (4.11)

We can achieve V = Vφ by choosing the vector distribution after the matrix


distribution, taking care that ui is assigned to one of the processors that owns
a nonzero aij in row i, and similar for vi . We can always do this if ui and vi
can be assigned independently; if, however, they must be assigned to the same
processor, and aii = 0, then achieving the lower bound may be impossible. If
distr(u) = distr(v) must be satisfied, we can achieve

V = Vφ + |{i : 0 ≤ i < n ∧ aii = 0}|, (4.12)

and hence certainly V ≤ Vφ +n. In the example of Fig. 4.4, the communication
volume can be reduced from V = 5 to Vφ = 4 if we would assign v0 to P (0)
instead of P (1).

4.4 Cartesian distribution


For sparse matrices, a Cartesian distribution is defined in the same way as
for dense matrices, by mapping an element aij (whether it is zero or not)
180 SPARSE MATRIX–VECTOR MULTIPLICATION

to a processor P (φ0 (i), φ1 (j)) with 0 ≤ φ0 (i) < M , 0 ≤ φ1 (j) < N , and
p = M N . We can fit this assignment in our general scheme by identifying
one-dimensional and two-dimensional processor numbers. An example is the
natural column-wise identification

P (s, t) ≡ P (s + tM ), for 0 ≤ s < M and 0 ≤ t < N, (4.13)

which can also be written as

P (s) ≡ P (s mod M, s div M ), for 0 ≤ s < p. (4.14)

Thus we map nonzeros aij to processors P (φ(i, j)) with

φ(i, j) = φ0 (i) + φ1 (j)M, for 0 ≤ i, j < n and aij = 0. (4.15)

Again we define φ(i, j) = −1 if aij = 0. In this section, we examine only


Cartesian distributions, and use both one-dimensional and two-dimensional
processor numberings, choosing the numbering that is most convenient in the
given situation.
Cartesian distributions have the same advantage for sparse matrices as
for dense matrices: row-wise operations require communication only within
processor rows and column-wise operations only within processor columns,
and this restricts the amount of communication. For sparse matrix–vector
multiplication this means that a vector component vj has to be sent to at
most M processors, and a vector component ui is computed using contribu-
tions received from at most N processors. Another advantage is simplicity:
Cartesian distributions partition the matrix orthogonally into rectangular sub-
matrices (with rows and columns that are not necessarily consecutive in the
original matrix). In general, non-Cartesian distributions create arbitrarily
shaped matrix parts, see Fig. 4.3. We can view the local part of a pro-
cessor P (s) as a submatrix {aij : i ∈ Is ∧ j ∈ Js }, but in the non-Cartesian
case the local submatrices may overlap. For instance, in Fig. 4.3(b), we note
that column 2 has three overlapping elements. In the Cartesian case, how-
ever, the local submatrix of a processor P (s) equals its Cartesian submatrix
{aij : φ0 (i) = s mod M ∧ φ1 (j) = s div M } without the empty rows and
columns. Since the Cartesian submatrices are disjoint, the local submatrices
are also disjoint. Figure 4.5 shows a Cartesian distribution of the matrix
cage6.
To reduce communication, the matrix distribution and the vector distribu-
tion should match. The vector component vj is needed only by processors that
possess an aij = 0 with 0 ≤ i < n, and these processors are contained in pro-
cessor column P (∗, φ1 (j)). Assigning component vj to one of the processors
in that processor column reduces the upper bound on the communication,
CARTESIAN DISTRIBUTION 181

t= 0 1

s=0

Fig. 4.5. Sparse matrix cage6 with n = 93 and nz = 785 distributed in a Cartesian
manner over four processors with M = N = 2; the matrix is the same as the
one shown in Fig. 4.1. Black squares denote nonzero elements; white squares
denote zeros. Lines denote processor boundaries. The processor row of a matrix
element aij is denoted by s = φ0 (i), and the processor column by t = φ1 (j).
The distribution has been determined visually, by trying to create blocks of rows
and columns that fit the sparsity pattern of the matrix. Note that the matrix
diagonal is assigned in blocks to the four processors, in the order P (0) ≡ P (0, 0),
P (1) ≡ P (1, 0), P (2) ≡ P (0, 1), and P (3) ≡ P (1, 1). The number of nonzeros
of the processors is 216, 236, 76, 257, respectively. The number of diagonal
elements is 32, 28, 12, 21, respectively; these are all nonzero. Assume that the
vectors u and v are distributed in the same way as the matrix diagonal. In that
case, 64 (out of 93) components of v must be communicated in superstep (0) of
the sparse matrix–vector multiplication, and 72 contributions to u in superstep
(2). For example, components v0 , . . . , v15 are only needed locally. The total
communication volume is V = 136 and the BSP cost is 24g + 2 · 257 + 28g + 28 ·
2 + 4l = 570 + 52g + 4l. Try to verify these numbers by clever counting!
182 SPARSE MATRIX–VECTOR MULTIPLICATION

because then vj has to be sent to at most M − 1 processors, instead of M .


This decrease in upper bound may hardly seem worthwhile, especially for large
M , but the lower bound also decreases, from one to zero, and this is crucial.
If vj were assigned to a different processor column, it would always have to be
communicated (assuming that matrix column j is nonempty). If component
vj is assigned to processor column P (∗, φ1 (j)), then there is a chance that it
is needed only by its own processor, so that no communication is required.
A judicious choice of distribution may enhance this effect. This is the main
reason why for Cartesian matrix distributions we impose the constraint that vj
resides in P (∗, φ1 (j)). If we are free to choose the distribution of v, we assign
vj to one of the owners of nonzeros in matrix column j, thereby satisfying the
constraint.
We impose a similar constraint on the output vector u. To compute ui ,
we need contributions from the processors that compute the products aij vj ,
for 0 ≤ j < n and aij = 0. These processors are all contained in processor
row P (φ0 (i), ∗). Therefore, we assign ui to a processor in that processor row.
The number of contributions sent for ui is thus at most N − 1. If we are free
to choose the distribution of u, we assign ui to one of the owners of nonzeros
in matrix row i.
If the requirement distr(u) = distr(v) must be satisfied, our constraints
on u and v imply that ui and vi must be assigned to P (φ0 (i), φ1 (i)), which is
the processor that owns the diagonal element aii in the Cartesian distribution
of A. In that case,

φu (i) = φv (i) = φ0 (i) + φ1 (i)M, for 0 ≤ i < n. (4.16)

As a result, for a fixed M and N , the choice of a Cartesian matrix distribu-


tion determines the vector distribution. The reverse is also true: an arbitrary
matrix element aij is assigned to the processor row that possesses the vector
component ui and to the processor column that possesses uj .
The following trivial but powerful theorem states that the amount of
communication can be restricted based on a suitable distribution of the
vectors.
Theorem 4.4 Let A be a sparse n × n matrix and u, v vectors of length n.
Assume that: (i) the distribution of A is Cartesian, distr(A) = (φ0 , φ1 ); (ii)
the distribution of u is such that ui resides in P (φ0 (i), ∗), for all i; (iii) the
distribution of v is such that vj resides in P (∗, φ1 (j)), for all j. Then: if
ui and vj are assigned to the same processor, the matrix element aij is also
assigned to that processor and does not cause communication.

Proof Component ui is assigned to a processor P (φ0 (i), t) and component


vj to a processor P (s, φ1 (j)). Since this is the same processor, it follows that
(s, t) = (φ0 (i), φ1 (j)), so that this processor also owns matrix element aij . 
CARTESIAN DISTRIBUTION 183

Example 4.5 Let A be the n × n tridiagonal matrix defined by


 
−2 1
 1 −2 1 
 

 1 −2 1 

A=
 .. .

 . 

 1 −2 1 

 1 −2 1 
1 −2

This matrix represents a Laplacian operator on a one-dimensional grid of


n points. (Section 4.8 treats Laplacian operators on multidimensional grids.)
The nonzeros are the elements on the main diagonal and on the diagonals
immediately above and below it, so that aij = 0 if and only if i − j = 0, ±1.
Assume that we have to find a suitable Cartesian matrix distribution (φ0 , φ1 )
and a single distribution for the input and output vectors. Theorem 4.4
says that here it is best to assign ui and uj to the same processor if
i = j ± 1. Therefore, a suitable vector distribution over p processors is the
block distribution,


n
ui −→ P i div , for 0 ≤ i < n. (4.17)
p

Communication takes place only on the boundary of a block. Each processor


has to send and receive at most two components vj , and to send and receive at
most two contributions to a component ui . The vector distribution completely
determines the matrix distribution, except for the choice of M and N .
For n = 12 and M = N = 2, the matrix distribution corresponding to the
block distribution of the vectors is
 
0 0
 0 0 0 
 

 0 0 0 


 1 1 1 


 1 1 1 

 1 1 3 
distr(A) =  .
 0 2 2 


 2 2 2 


 2 2 2 


 3 3 3 

 3 3 3 
3 3

Position (i, j) of distr(A) gives the one-dimensional identity of the processor


that possesses matrix element aij . The matrix distribution is obtained by
first distributing the matrix diagonal in the same way as the vectors, and
184 SPARSE MATRIX–VECTOR MULTIPLICATION

then translating the corresponding one-dimensional processor numbers into


two-dimensional numbers by P (0) ≡ P (0, 0), P (1) ≡ P (1, 0), P (2) ≡ P (0, 1),
and P (3) ≡ P (1, 1). After that, the off-diagonal nonzeros are assigned to pro-
cessors, for instance a56 is in the same processor row as a55 , which is owned by
P (1) = P (1, 0), and it is in the same processor column as a66 , which is owned
by P (2) = P (0, 1). Thus a56 is owned by P (1, 1) = P (3). Note that this
distribution differs only slightly from the row distribution defined by M = 4,
N = 1; the only elements distributed differently are a56 and a65 (marked
in boldface), which in the row distribution are assigned to P (1) and P (2),
respectively.
The cost of Algorithm 4.5 for a Cartesian distribution depends on the
matrix A and the chosen distribution. Because of our additional assumptions,
we can improve the upper bound (4.9). As before, we assume a good spread
of the matrix elements and vector components over the processors, but now
we also assume a good spread of the matrix rows over the processor rows
and the matrix columns over the processor columns. In superstep (0), P (s, t)
must receive at most all components vj with φ1 (j) = t, except for the n/p
locally available components. Therefore, in the worst case hr = n/N − n/p =
(M − 1)n/p = hs . The cost is at most T(0) = (M − 1)ng/p + l. As before,
T(1) = 2cn/p + l. The cost of superstep (2) is at most T(2) = (N − 1)ng/p + l,
similar to T(0) . The cost of superstep (3) is at most T(3) = N n/p+l = n/M +l,
because each of the n/p local vector components is computed by adding at
most N partial sums. The total cost of the algorithm is thus bounded by
2cn n M +N −2
TMV, M ×N ≤ + + ng + 4l. (4.18)
p M p
The communication term of the upper bound (4.18) is minimal for

M = N = p. For that choice, the bound reduces to

√ √
2cn n 1 1
TMV, p× p ≤ + √ +2 √ − ng + 4l. (4.19)
p p p p
Examining the upper bound (4.19), we see that the computation cost domin-
√ √
ates if 2cn/p > 2ng/ p, that is, if c > pg. This is an improvement of the

critical c-value by a factor p compared with the value for the general upper
bound (4.9).
Dense matrices can be considered as the extreme limit of sparse matrices.
Analysing the dense case is easier and it can give us insight into the sparse case
as well. Let us therefore examine an n × n dense matrix A. Assume we have
to use the same distribution for the input and output vectors, which therefore
must be the distribution of the matrix diagonal. Assume for simplicity that n
is a multiple of p and p is a square.
In our study of dense LU decomposition, see Chapter 2, we have extolled
the virtues of the square cyclic distribution for parallel linear algebra. One may
CARTESIAN DISTRIBUTION 185

ask whether this is also a good distribution for dense matrix–vector mul-
tiplication by Algorithm 4.5. Unfortunately, the answer is negative. The
reason is that element aii from the matrix diagonal is assigned to processor
√ √
P (i mod p, i mod p), so that the matrix diagonal is assigned to the

diagonal processors, that is, the processors P (s, s), 0 ≤ s < p. This

implies that only p out of p processors have part of the matrix diagonal and
hence of the vectors, so that the load balancing assumption for vector com-

ponents is not satisfied. Diagonal processors have to send out p − 1 copies
√ √
of n/ p vector components, so that hs = n − n/ p in superstep (0) and h is

p times larger than the h of a well-balanced distribution. The total cost for
a dense matrix with the square cyclic distribution becomes

2n2

1
TMV, dense, √p×√p cyclic = +n+2 1− √ ng + 4l. (4.20)
p p

The communication cost for this unbalanced distribution is a factor p higher
than the upper bound (4.19) with c = n for balanced distributions. The total

communication volume is Vφ = 2( p − 1)n. For dense matrices with the
square cyclic distribution, the communication imbalance can be reduced by
changing the algorithm and using two-phase broadcasting for the fanout and
a similar technique, two-phase combining, for the fanin. This, however, does
not solve the problem of vector imbalance in other parts of the application,
and of course it would be better to use a good distribution in the first place,
instead of redistributing the data during the fanout or fanin.
The communication balance can be improved by choosing a distribution
that spreads the vectors and hence the matrix diagonal evenly, for example
choosing the one-dimensional distribution φu (i) = φv (i) = i mod p and using
the 1D–2D identification (4.14). We still have the freedom to choose M and N ,
where M N = p. For the choice M = p and N = 1, this gives φ0 (i) = i mod p
and φ1 (j) = 0, which is the same as the cyclic row distribution. It is easy to
see that now the cost is

2n2

1
TMV, dense, p×1 cyclic = + 1− ng + 2l. (4.21)
p p

This distribution disposes of the fanin and the summation of partial sums,
since each matrix row is completely contained in one processor. Therefore,
the last two supersteps are empty and can be deleted. Still, this is a bad
distribution, since the gain by fewer synchronizations is lost by the much
more expensive fanout: each processor has to send n/p vector components to
all other processors. The communication volume is large: Vφ = (p − 1)n.
√ √
For the choice M = N = p, we obtain φ0 (i) = (i mod p) mod p =
√ √
i mod p and φ1 (j) = (j mod p) div p. The cost of the fanout and fanin
186 SPARSE MATRIX–VECTOR MULTIPLICATION

0 1 2 3 0 1 2 3 v
0 0 1 1 0 0 1 1
0 0 0
1 1 1
2 0 2
3 1 3
0 0 0
1 1 1
2 0 2
3 1 3

u A
Fig. 4.6. Dense 8×8 matrix distributed over four processors using a square Cartesian
distribution based on a cyclic distribution of the matrix diagonal. The vectors
are distributed in the same way as the matrix diagonal. The processors are shown
by greyshades; the one-dimensional processor numbering is shown as numbers in
the cells of the matrix diagonal and the vectors. The two-dimensional numbering
is represented by the numbering of the processor rows and columns shown along
the borders of the matrix. The 1D–2D correspondence is P (0) ≡ P (0, 0), P (1) ≡
P (1, 0), P (2) ≡ P (0, 1), and P (3) ≡ P (1, 1).


for this distribution are given by hs = hr = ( p − 1)n/p, so that

2n2

n 1 1
TMV, dense = + √ +2 √ − ng + 4l, (4.22)
p p p p

which equals the upper bound (4.19) for c = n. We see that this distribution is
much better than the square cyclic distribution and the cyclic row distribution.

The communication volume is Vφ = 2( p−1)n. Figure 4.6 illustrates this data
distribution.
We may conclude that the case of dense matrices is a good example where
Cartesian matrix partitioning is useful in deriving an optimal distribution,
a square Cartesian distribution based on a cyclic distribution of the matrix
diagonal. (Strictly speaking, we did not give an optimality proof; it seems,
however, that this distribution is hard to beat.)

4.5 Mondriaan distribution for general sparse matrices


Sparse matrices usually have a special structure in their sparsity pattern.
Sometimes this structure is known, but more often it has to be detected by
MONDRIAAN DISTRIBUTION 187

an automatic procedure. The aim of this section is to present an algorithm


that finds the underlying structure of an arbitrary sparse matrix and exploits
this structure to generate a good distribution for sparse matrix–vector mul-
tiplication. We name the resulting distribution Mondriaan distribution,
honouring the Dutch painter Piet Mondriaan (1872–1944) who is most famous
for his compositions of brightly coloured rectangles.
Suppose we have to find a matrix and vector distribution for a sparse
matrix A where we have the freedom to distribute the input and output vec-
tors independently. Thus, we can first concentrate on the matrix distribution
problem and try to find a matrix distribution φ that minimizes the commu-
nication volume Vφ , defined by eqn (4.10), while balancing the computational
work. We define As = {(i, j) : 0 ≤ i, j < n ∧ φ(i, j) = s} as the set of index
pairs corresponding to the nonzeros of P (s), 0 ≤ s < p. Thus, A0 , . . . , Ap−1
forms a p-way partitioning of A = {(i, j) : 0 ≤ i, j < n ∧ aij = 0}. (For the
purpose of partitioning, we identify a nonzero with its index pair and a sparse
matrix with its set of index pairs.) We use the notation V (A0 , . . . , Ap−1 ) = Vφ
to express the explicit dependence of the communication volume on the
partitioning.
The figure on the cover of this book (see also Plate 1) shows a 4-way
partitioning of the 60 × 60 matrix prime60, defined by aij = 1 if i mod j = 0
or j mod i = 0, and aij = 0 otherwise, for i, j = 1, . . . , n. It is easy to
recognize rows and columns with an index that is prime. (To establish this
direct connection, we must start counting the indices of prime60 at one, as
an exception to the rule of always starting at zero.)
For mutually disjoint subsets of nonzeros A0 , . . . , Ak−1 , where k ≥ 1, not
necessarily with A0 ∪ · · · ∪ Ak−1 = A, we define the communication volume
V (A0 , . . . , Ak−1 ) as the volume of the sparse matrix–vector multiplication for
the matrix A0 ∪ · · · ∪ Ak−1 where subset As is assigned to P (s), 0 ≤ s < k.
For k = p, this reduces to the original definition. When we split a subset of a
k-way partitioning of A, we obtain a (k + 1)-way partitioning. The following
theorem says that the new communication volume equals the old volume plus
the volume incurred by splitting the subset as a separate problem, ignoring
all other subsets.
Theorem 4.6 (Vastenhouw and Bisseling [188]) Let A be a sparse n × n
matrix and let A0 , . . . , Ak ⊂ A be mutually disjoint subsets of nonzeros, where
k ≥ 1. Then

V (A0 , . . . , Ak ) = V (A0 , . . . , Ak−2 , Ak−1 ∪ Ak ) + V (Ak−1 , Ak ). (4.23)

Proof The number of processors that contributes to a vector compon-


ent ui depends on the partitioning assumed, pi = pi (A0 , . . . , Ak−1 ) for
the k-way partitioning A0 , . . . , Ak−1 , and similarly qj = qj (A0 , . . . , Ak−1 ).
188 SPARSE MATRIX–VECTOR MULTIPLICATION

Let p′i = max(pi − 1, 0) and qj′ = max(qj − 1, 0). We are done if we prove

p′i (A0 , . . . , Ak ) = p′i (A0 , . . . , Ak−2 , Ak−1 ∪ Ak ) + p′i (Ak−1 , Ak ), (4.24)

for 0 ≤ i < n, and a similar equality for qj′ , because the result then follows
from summing over i = 0, . . . , n − 1 for p′i and j = 0, . . . , n − 1 for qj′ . We only
prove the equality for the p′i , and do this by distinguishing two cases. If row i
has a nonzero in Ak−1 or Ak , then p′i = pi − 1 in all three terms of eqn (4.24).
Thus,

p′i (A0 , . . . , Ak−2 , Ak−1 ∪ Ak ) + p′i (Ak−1 , Ak )


= pi (A0 , . . . , Ak−2 , Ak−1 ∪ Ak ) − 1 + pi (Ak−1 , Ak ) − 1
= pi (A0 , . . . , Ak−2 ) + 1 − 1 + pi (Ak−1 , Ak ) − 1
= pi (A0 , . . . , Ak−2 ) + pi (Ak−1 , Ak ) − 1
= pi (A0 , . . . , Ak ) − 1
= p′i (A0 , . . . , Ak ). (4.25)

If row i has no nonzero in Ak−1 or Ak , then

p′i (A0 , . . . , Ak−2 , Ak−1 ∪ Ak ) + p′i (Ak−1 , Ak )


= p′i (A0 , . . . , Ak−2 ) + 0
= p′i (A0 , . . . , Ak ). (4.26)


The proof of Theorem 4.6 shows that the theorem also holds for the com-
munication volume of the fanout and fanin separately. The theorem implies
that we only have to look at the subset we want to split when trying to optim-
ize the split, and not at the effect such a split has on communication for other
subsets.
The theorem helps us to achieve our goal of minimizing the communication
volume. Of course, we must at the same time also consider the load balance of
the computation; otherwise, the problem is easily solved: assign all nonzeros
to the same processor et voilà, no communication whatsoever! We specify
the allowed load imbalance by a parameter ǫ > 0, requiring that the p-way
partitioning of the nonzeros satisfies the computational balance constraint

nz(A)
max nz(As ) ≤ (1 + ǫ) . (4.27)
0≤s<p p

We do not allow ǫ = 0, because obtaining a perfect load balance is often


impossible, and when it is possible, the demand for perfect balance does not
leave much room for minimizing communication. Besides, nobody is perfect
and epsilons are never zero.
MONDRIAAN DISTRIBUTION 189

The load imbalance achieved for the matrix prime60 shown on the cover
of this book is ǫ′ ≈ 2.2%: the matrix has 462 nonzeros, and the number of
nonzeros per processor is 115, 118, 115, 114, for P (0) (red), P (1) (black), P (2)
(yellow), and P (3) (blue), respectively. (We denote the achieved imbalance by
ǫ′ , to distinguish it from the allowed imbalance ǫ.) The partitioning has been
obtained by a vertical split into blocks of consecutive columns, followed by
two independent horizontal splits into blocks of consecutive rows. The splits
were optimized for load balance only. (Requiring splits to yield consecutive
blocks is an unnecessary and harmful restriction, but it leads to nice and easily
comprehensible pictures.) The communication volume for this partitioning is
120, which is bad because it is the highest value possible for the given split
directions.
The best choice of the imbalance parameter ǫ is machine-dependent and
can be found by using the BSP model. Suppose we have obtained a matrix
distribution with volume V that satisfies the constraint (4.27). Assuming that
the subsequent vector partitioning does a good job, balancing the communica-
tion well and thus achieving a communication cost of V g/p, we have a BSP
cost of 2(1 + ǫ′ )nz(A)/p + V g/p + 4l. To get a good trade-off between com-
putation imbalance and communication, the corresponding overhead terms
should be about equal, that is, ǫ′ ≈ V g/(2nz(A)). If this is not the case, we
can increase or decrease ǫ and obtain a lower BSP cost. We cannot determine
ǫ beforehand, because we cannot predict exactly how its choice affects V .
How to split a given subset? Without loss of generality, we may assume
that the subset is A itself. In principle, we can assign every individual nonzero
to one of the two available processors. The number of possible 2-way par-
titionings, however, is huge, namely 2nz(A)−1 . (We saved a factor of two
by using symmetry: we can swap the two processors without changing the
volume of the partitioning.) Trying all partitionings and choosing the best is
usually impossible, even for modest problem sizes. In the small example of
Fig. 4.3, we already have 212 = 4096 possibilities (one of which is shown).
Thus, our only hope is to develop a heuristic method, that is, a method
that gives an approximate solution, hopefully close to the optimum and com-
puted within reasonable time. A good start is to try to restrict the search
space, for example, by assigning complete columns to processors; the number
of possibilities then decreases to 2n−1 . In the example, we now have 24 = 16
possibilities. In general, the number of possibilities is still large, and heuristics
are still needed, but the problem is now more manageable. One reason is that
bookkeeping is simpler for n columns than for nz(A) nonzeros. Furthermore,
a major advantage of assigning complete columns is that the split does not
generate communication in the fanout. Thus we decide to perform each split
by complete columns, or, alternatively, by complete rows. We can express the
splitting by the assignment

(A0 , A1 ) := split(A, dir , ǫ), (4.28)


190 SPARSE MATRIX–VECTOR MULTIPLICATION

where dir ∈ {row, col} is the splitting direction and ǫ the allowed load
imbalance. Since we do not know ahead, which of the two splitting directions
is best, we try both and choose the direction with the lowest communication
volume.
Example 4.7 Let  
0 3 0 0 1

 4 1 0 0 0 

A=
 0 5 9 2 0 .

 6 0 0 5 3 
0 0 5 8 9
For ǫ = 0.1, the maximum number of nonzeros per processor must be seven
and the minimum six. For a column split, this implies that one processor must
have two columns with three nonzeros and the other processor the remaining
columns. A solution that minimizes the communication volume V is to assign
columns 0, 1, 2 to P (0) and columns 3, 4 to P (1). This gives V = 4. For a row
split, assigning rows 0, 1, 3 to P (0) and rows 2, 4 to P (1) is optimal, giving
V = 3. Can you find a better solution if you are allowed to assign nonzeros
individually to processors?
The function split can be applied repeatedly, giving a method for parti-
tioning a matrix into several parts. The method can be formulated concisely
as a recursive computation. For convenience, we assume that p = 2q , but
the method can be adapted to handle other values of p as well. (This would
also require generalizing the splitting function, so that for instance in the first
split for p = 3, it can produce two subsets with a nonzero ratio of about 2 : 1.)
The recursive method should work for a rectangular input matrix, since the
submatrices involved may be rectangular (even though the initial matrix is
square). Because of the splitting into sets of complete columns or rows, we can
view the resulting p-way partitioning as a splitting into p mutually disjoint
submatrices (not necessarily with consecutive rows and columns): we start
with a complete matrix, split it into two submatrices, split each submatrix,
giving four submatrices, and so on. The number of times the original submat-
rix must be split to reach a given submatrix is called the recursion level
of the submatrix. The level of the original matrix is 0. The final result for
processor P (s) is a submatrix defined by an index set I¯s × J¯s . This index set
is different from the index set Is × Js of pairs (i, j) with i ∈ Is and j ∈ Js
defined in Algorithm 4.5, because the submatrices I¯s × J¯s may contain empty
rows and columns; removing these gives Is × Js . Thus we have

As ⊂ Is × Js ⊂ I¯s × J¯s , for 0 ≤ s < p. (4.29)

Furthermore, all the resulting submatrices are mutually disjoint, that is,

(I¯s × J¯s ) ∩ (I¯t × J¯t ) = ∅, for 0 ≤ s, t < p with s = t, (4.30)


MONDRIAAN DISTRIBUTION 191

and together they comprise the original matrix,


p−1
(I¯s × J¯s ) = {0, . . . , n − 1} × {0, . . . , n − 1}.

(4.31)
s=0

To achieve a final load imbalance of at most ǫ, we must take care that


the maximum number of nonzeros per matrix part grows slowly enough with
the recursion level. If the growth factor at each level is 1 + δ, then the overall
growth factor is (1+δ)q ≈ 1+qδ in a first-order approximation. This motivates
our choice of starting with δ = ǫ/q. After the first split, a new situation
arises. One part has at least half the nonzeros, and the other part at most
half. Assume that the matrix parts, or subsets, are B0 and B1 . Subset Bs ,
s = 0, 1, has nz(Bs ) nonzeros and will be partitioned over p/2 processors with
a load imbalance parameter ǫs . Equating the maximum number of nonzeros
per processor specified for the remainder of the partitioning process to the
maximum specified at the beginning,

nz(Bs ) nz(A)
(1 + ǫs ) = (1 + ǫ) , (4.32)
p/2 p

gives the value ǫs to be used in the remainder. In this way, the allowed load
imbalance is dynamically adjusted during the partitioning. A matrix part that
has fewer nonzeros than the average will have a larger ǫ in the remainder,
giving more freedom to minimize communication for that part. The resulting
algorithm is given as Algorithm 4.6.
Figure 4.7 presents a global view of the sparse matrix prime60 and the
corresponding input and output vectors distributed over four processors by
the Mondriaan package [188], version 1.0. The matrix distribution program of
this package is an implementation of Algorithm 4.6. The allowed load imbal-
ance specified by the user is ǫ = 3%; the imbalance achieved by the program
is ǫ′ ≈ 1.3%, since the maximum number of nonzeros per processor is 117 and
the average is 462/4=115.5. The communication volume of the fanout is 51
and that of the fanin is 47, so that V = 98. Note that rows i = 11, 17, 19, 23,
25, 31, 37, 41, 43, 47, 53, 55, 59 (in the exceptional numbering starting from
one) are completely owned by one processor and hence do not cause commu-
nication in the fanin; vector component ui is owned by the same processor.
Not surprisingly, all except two of these row numbers are prime. The distri-
bution is better than the distribution on the book cover, both with respect
to computation and with respect to communication. The matrix prime60 is
symmetric, and although the Mondriaan package has an option to produce
symmetric partitionings, we did not use it for our example.
Figure 4.8 presents a local view, or processor view, of prime60 and
the corresponding input and output vectors. For processor P (s), s = 0, 1, 2, 3,
the local submatrix I¯s × J¯s is shown. The size of this submatrix is 29 × 26 for
192 SPARSE MATRIX–VECTOR MULTIPLICATION

Algorithm 4.6. Recursive matrix partitioning.


input: A: m × n sparse matrix,
p: number of processors, p = 2q with q ≥ 0,
ǫ = allowed load imbalance, ǫ > 0.
output: (A0 , . . . , Ap−1 ): p-way partitioning of A,
satisfying max0≤s<p nz(As ) ≤ (1 + ǫ) nz(A) p .
function call: (A0 , . . . , Ap−1 ) := MatrixPartition(A, p, ǫ).

if p > 1 then
maxnz := (1 + ǫ) nz p(A) ;
(B0row , B1row ) := split(A, row, qǫ );
(B0col , B1col ) := split(A, col, qǫ );
if V (B0row , B1row ) ≤ V (B0col , B1col ) then
(B0 , B1 ) := (B0row , B1row );
else (B0 , B1 ) := (B0col , B1col );
ǫ0 := nz (B0 ) · p2 − 1;
maxnz
maxnz p
ǫ1 := nz (B1 ) · 2 − 1;
(A0 , . . . , Ap/2−1 ) := MatrixPartition(B0 , p2 , ǫ0 );
(Ap/2 , . . . , Ap−1 ) := MatrixPartition(B1 , p2 , ǫ1 );
else A0 := A;

P (0) (red), 29 × 34 for P (1) (black), 31 × 31 for P (2) (yellow), and 31 × 29 for
P (3) (blue). Together, the submatrices fit in the space of the original matrix.
The global indices of a submatrix are not consecutive in the original matrix,
but scattered. For instance, I¯0 = {2, 3, 4, 5, 11, 12, 14, . . . , 52, 53, 55, 56, 57}, cf.
Fig. 4.7. Note that I¯0 × J¯0 = I0 ×J0 and I¯3 × J¯3 = I3 ×J3 , but that I¯1 × J¯1 has
six empty rows and nine empty columns, giving a size of 23×25 for I1 ×J1 , and
that I¯2 × J¯2 has seven empty rows, giving a size of 24 × 31 for I2 × J2 . Empty
rows and columns in a submatrix are the aim of a good partitioner, because
they do not incur communication. An empty row is created by a column split
in which all nonzeros of a row are assigned to the same processor, leaving the
other processor empty-handed. The partitioning directions chosen for prime60
were first splitting in the row direction, and then twice, independently, in the
column direction.
The high-level recursive matrix partitioning algorithm does not specify the
inner workings of the split function. To find a good split, we need a biparti-
tioning method based on the exact communication volume. It is convenient
to express our problem in terms of hypergraphs, as was first done for matrix
partitioning problems by Çatalyürek and Aykanat [36,37]. A hypergraph
H = (V, N ) consists of a set of vertices V and a set of hyperedges, or
MONDRIAAN DISTRIBUTION 193

Fig. 4.7. Matrix and vector distribution for the sparse matrix prime60. Global view
(see also Plate 2).

Fig. 4.8. Same matrix and distribution as in Fig. 4.7. Local view (see also
Plate 3).
194 SPARSE MATRIX–VECTOR MULTIPLICATION

0 5

1 6

2 7

3 8

Fig. 4.9. Hypergraph with nine vertices and six nets. Each circle represents a vertex.
Each oval curve enclosing a set of vertices represents a net. The vertex set
is V = {0, . . . , 8} and the nets are n0 = {0, 1}, n1 = {0, 5}, n2 = {0, 6},
n3 = {2, 3, 4}, n4 = {5, 6, 7}, and n5 = {7, 8}. The vertices have been coloured
to show a possible assignment to processors, where P (0) has the white vertices
and P (1) the black vertices.

nets, N , which are subsets of V. (A hypergraph is a generalization of an


undirected graph G = (V, E), where E is a set of undirected edges, which
are unordered pairs (i, j) with i, j ∈ V. Thus (i, j) = (j, i) and can be identi-
fied with a subset {i, j} of two elements.) Figure 4.9 illustrates the definition
of a hypergraph.
Our splitting problem is to assign each of the n columns of a sparse m × n
matrix to either processor P (0) or P (1). We can identify a matrix column j
with a vertex j of a hypergraph with vertex set V = {0, . . . , n − 1}, so that
our problem translates into assigning each vertex to a processor. We have to
minimize the communication volume of the fanin: assigning the nonzeros of a
row to different processors gives rise to one communication, whereas assigning
them to the same processor avoids communication. We can identify a matrix
row i with a net ni , defining

ni = {j : 0 ≤ j < n ∧ aij = 0}, for 0 ≤ i < m. (4.33)

A communication arises if the net is cut, that is, not all its vertices are
assigned to the same processor. The total communication volume incurred
by the split thus equals the number of cut nets of the hypergraph. In the
assignment of Fig. 4.9, two nets are cut: n1 and n2 .
Example 4.8 Let V = {0, 1, 2, 3, 4}. Let the nets be n0 = {1, 4},
n1 = {0, 1}, n2 = {1, 2, 3}, n3 = {0, 3, 4}, and n4 = {2, 3, 4}. Let
N = {n0 , n1 , n2 , n3 , n4 }. Then H = (V, N ) is the hypergraph that corresponds
MONDRIAAN DISTRIBUTION 195

to the column partitioning problem of Example 4.7. The optimal column


solution has four cut nets: n0 , n2 , n3 , n4 .
We have to assign the vertices of the hypergraph in a balanced way, so
that both processors receive about the same number of nonzeros. This is best
modelled by making the vertices weighted, defining the weight cj of vertex j
to be the number of nonzeros in column j,

cj = |{i : 0 ≤ i < m ∧ aij = 0}|, for 0 ≤ j < n. (4.34)

The splitting has to satisfy


 nz(A)
cj ≤ (1 + ǫ) , for s = 0, 1. (4.35)
2
j∈P (s)

Having converted our problem to a hypergraph bipartitioning problem, we


can apply algorithms developed for such problems. An excellent approach is
to use the multilevel method [34], which consists of three phases: coarsen-
ing, initial partitioning, and uncoarsening. During the coarsening phase,
the problem is reduced in size by merging similar vertices, that is, vertices
representing columns with similar sparsity patterns. A natural heuristic is to
do this pairwise, halving the number of vertices at each coarsening level. The
best match for column j is an unmatched column j ′ with maximal overlap in
the sparsity pattern, that is, maximal |{i : 0 ≤ i < m ∧ aij = 0 ∧ aij ′ = 0}|.
This value can be computed as the inner product of columns j and j ′ , taking
all nonzeros to be ones. The result of the merger is a column which has a
nonzero in row i if aij = 0 or aij ′ = 0. In Example 4.7, the best match for
column 2 is column 3, since their sparsity patterns have two nonzeros in com-
mon. For the purpose of load balancing, the new column gets a weight equal
to the sum of the weights of the merged columns.
The initial partitioning phase starts when the problem is sufficiently
reduced in size, typically when a few hundred columns are left. Each column
is then assigned to a processor. The simplest initial partitioning method is by
random assignment, but more sophisticated methods give better results. Care
must be taken to obey the load balance criterion for the weights.
The initial partitioning for the smallest problem is transferred to the larger
problems during the uncoarsening phase, which is similar to the coarsening
phase but is carried out in the reverse direction. At each level, both columns
of a matched pair are assigned to the same processor as their merged column.
The resulting partitioning of the larger problem is refined, for instance by
trying to move columns, or vertices, to the other processor. A simple approach
would be to try a move of all vertices that are part of a cut net. Note that a
vertex can be part of several cut nets. Moving a vertex to the other processor
may increase or decrease the number of cut nets. The gain of a vertex is the
196 SPARSE MATRIX–VECTOR MULTIPLICATION

reduction in cut nets obtained by moving it to the other processor. The best
move has the largest gain, for example, moving vertex 0 to P (1) in Fig. 4.9
has a gain of 1. The gain may be zero, such as in the case of moving vertex 5
or 6 to P (0). The gain may also be negative, for example, moving vertex 1,
2, 3, 4, or 8 to the other processor has a gain of −1. The worst move is that
of vertex 7, since its gain is −2.
Example 4.9 The following is a column bipartitioning of an 8 × 8 matrix
by the multilevel method. During the coarsening, columns are matched in
even–odd pairs, column 0 with 1, 2 with 3, and so on. The initial partitioning
assigns columns to processors. A column owned by P (1) has its nonzeros
marked in boldface. All other columns are owned by P (0).
     
· 1 · · · · · · 1 · · · 1 ·

 1 · · · 1 · 1 · 

 1 · 1 1 
 
 1 1 
 

 1 1 1 1 · · · 1 

 1 1 · 1 
   1 1 
 
 · 1 1 1 · · · · 
 coarsen
 1 1 · ·  coarsen  1 ·  partition
 −→  
 −→  · 1  −→
  

 · · · · 1 1 · · 
  · · 1 ·   

 · · · · 1 1 · · 

 · · 1 · 
 
 · 1 
 
 · · · · · · 1 1   · · · 1   · 1 
· · 1 1 · · 1 1 · 1 · 1 1 1
     
1 · 1 · · · 1 · · ·

 1 1 


 1 · 1 1  
 1 ·
 1 1 


 1 1 


 1 1 · 1  
 1 1
 · 1 

 1 ·  uncoarsen
 −→
 1 1 · ·  refine 

 1 1 · ·  uncoarsen
  −→  −→

 · 1 


 · · 1 ·  
 · ·
 1 · 


 · 1 


 · · 1 ·    · ·
 1 · 

 · 1   · · · 1   · · · 1 
1 1 · 1 · 1 · 1 · 1
   
· 1 · · · · · · · 1 · · · · · ·

 1 · · · 1 · 1 · 
 1 · · ·
 1 · 1 · 

 1 1 1 1 · · · 1 

 1 1 1 1
 · · · 1 


 · 1 1 1 · · · ·  refine  · 1 1 1
 −→  · · · · 


 · · · · 1 1 · · 

 · · · ·
 1 1 · · 

 · · · · 1 1 · · 

 · · · ·
 1 1 · · 
 · · · · · · 1 1   · · · · · · 1 1 
· · 1 1 · · 1 1 · · 1 1 · · 1 1

The communication volume after the initial partitioning is V = 3, and is


caused by rows 1, 2, and 7. The first refinement does not change the par-
titioning, because no move of a single column reduces the communication
and because all such moves cause a large imbalance in the computation. The
VECTOR DISTRIBUTION 197

second refinement moves column 0 to P (1), giving a partitioning with V = 2


as the final result.
The Kernighan–Lin algorithm [120], originally developed for graph bipar-
titioning, can be applied to hypergraph bipartitioning. It can be viewed as a
method for improving a given bipartitioning. The algorithm consists of several
passes. Fiduccia and Mattheyses [69] give an efficient implementation of the
Kernighan–Lin algorithm based on a priority-queue data structure for which
one pass costs O(nz(A) + n). In a pass, all vertices are first marked as mov-
able and their gain is computed. The vertex with the largest gain among the
movable vertices is moved, provided this does not violate the load balance
constraint. The vertex is then marked as nonmovable for the remainder of the
current pass; this is to guarantee termination of the pass. The gains of the
other vertices are then updated and a new move is determined. This process
is continued until no more moves can be carried out. The best partitioning
encountered during the pass (not necessarily the final partitioning) is saved
and used as starting point for the next pass. Note that moves with negative
gain are allowed and that they occur when no moves with positive or zero
gain are available. A move with negative gain may still be advantageous, for
instance if it is followed by a move of adjacent vertices (i.e. vertices that
share a net with the moved vertex).
The Kernighan–Lin algorithm can be used in the initial partitioning and
uncoarsening phases of the multilevel method. In the initial partitioning, the
algorithm can be applied several times, each time to improve a different ran-
dom assignment of the vertices to the processors. The best result is chosen.
In the uncoarsening phase, the algorithm is commonly applied only once, and
often with the movable vertices restricted to those that are part of a cut net.
This cheaper variant is called the boundary Kernighan–Lin algorithm. Its use
is motivated by the larger problem sizes involved and by the limited purpose of
the uncoarsening phase, which is to refine a partitioning, and not to compute
a completely new one.

4.6 Vector distribution


The matrix distribution algorithm of the previous section should lead to a
matrix–vector multiplication with low communication volume and a good
computational load balance. What remains to be done is to partition the
input and output vectors such that the communication is balanced as well. In
other words, given a matrix distribution φ, we have to determine a vector
distribution φv that minimizes the value h of the fanout and that satis-
fies j ∈ Jφv (j) , for 0 ≤ j < n. This constraint says that the processor
P (s) = P (φv (j)) that obtains vj must own a nonzero in matrix column j,
that is, j ∈ Js . We also have to find a vector distribution φu that minim-
izes the value h of the fanin and that satisfies the constraint i ∈ Iφu (i) , for
0 ≤ i < n. These are two independent vector distribution problems, except if
198 SPARSE MATRIX–VECTOR MULTIPLICATION

the requirement distr(u) = distr(v) must be satisfied; we assume that this is


not the case.
Figure 4.7 gives a possible vector distribution in the global view. The
vectors in this figure are depicted in the familiar way, cf. Fig. 4.4. Note that
the two constraints mentioned above are satisfied: one of the nonzeros in each
matrix column has the colour of the corresponding component vj , and one of
the nonzeros in each row has the colour of the corresponding component ui .
Figure 4.8 gives the same vector distribution, but now in the local view. The
local components of the vector u are placed to the left of the local submatrix
(for P (0) and P (2)) or to the right (for P (1) and P (3)), just outside the matrix
boundary, whereas the local components of the vector v are placed above the
local submatrix (for P (0) and P (1)) or below it (for P (2) and P (3)). For
processor P (0) (red), this gives a familiar picture, with the difference that
now only the local part is shown.
A vector component vj is depicted above or below the corresponding
column of the submatrix if it is owned locally. Otherwise, the correspond-
ing space in the picture of the vector distribution remains empty. For each
resulting hole, a component vj must be received, unless the corresponding
local column is empty. Thus, it is easy to count the receives of the fanout by
using this picture: 13, 12, 15, 11 components are received by P (0), P (1), P (2),
P (3), respectively. The number of sends in the fanin can be counted similarly.
Because of the way the matrix has been split, communication occurs in the
fanin only between P (0) and P (1), and between P (2) and P (3). This makes
it easy to count the number of receives in the fanin: a send for P (0) implies a
receive for P (1), and vice versa; similarly, for P (2) and P (3). For the fanout,
it is more difficult to count the sends, because P (0) can now send a compon-
ent vj either to P (2) or P (3) (but not to P (1)). To determine the number of
sends we need to count them in Fig. 4.7. Table 4.1 summarizes the statistics
of the given vector distribution. The table shows that the communication cost

Table 4.1. Components of vectors u and v owned and com-


municated by the different processors for the matrix prime60
in the distribution of Figures 4.7 and 4.8

u v

s N (s) hs (s) hr (s) N (s) hs (s) hr (s)

0 18 11 12 13 13 13
1 11 12 11 13 13 12
2 12 12 12 16 14 15
3 19 12 12 18 11 11

The number of components owned by processor P (s) is N (s);


the number sent is hs (s); the number received is hr (s)
VECTOR DISTRIBUTION 199

of the fanout is 15g and the cost of the fanin 12g. The total communication
cost of 27g is thus only slightly above the average V g/p = 98g/4 = 24.5g,
which means that the communication is well-balanced. Note that the number
of vector components is less well-balanced, but this does not influence the
cost of the matrix–vector multiplication. (It influences the cost of other oper-
ations though, such as the vector operations accompanying the matrix–vector
multiplication in an iterative solver.)
The two vector distribution problems are similar; it is easy to see that
we can solve the problem of finding a good distribution φu given φ = φA by
finding a good distribution φv given φ = φAT . This is because the nonzero
pattern of row i of A is the same as the nonzero pattern of column i of AT ,
so that a partial sum uis is sent from P (s) to P (t) in the multiplication
by A if and only if a vector component vi is sent from P (t) to P (s) in the
multiplication by AT . Therefore, we only treat the problem for φv and hence
only consider the communication in the fanout.
Let us assume without loss of generality that we have a vector distribution
problem with qj ≥ 2, for all j. Columns with qj = 0 or qj = 1 do not cause
communication and hence may be omitted from the problem formulation.
(A good matrix distribution method will give rise to many columns with
qj = 1.) Without loss of generality we may also assume that the columns
are ordered by increasing qj ; this can be achieved by renumbering. Then the
h-values for the fanout are

hs (s) = (qj − 1), for 0 ≤ s < p, (4.36)
0≤j<n, φv (j)=s

and
hr (s) = |{j : j ∈ Js ∧ φv (j) = s}|, for 0 ≤ s < p. (4.37)
Me first! Consider what would happen if a processor P (s) becomes utterly
egoistic and tries to minimize its own h(s) = max(hs (s), hr (s)) without con-
sideration for others. To minimize hr (s), it just has to maximize the number
of components vj with j ∈ Js that it owns. To minimize hs (s), it has to
minimize the total weight of these components, where we define the weight
of vj as qj − 1. An optimal strategy would thus be to start with hs (s) = 0
and hr (s) = |Js | and grab the components in increasing order (and hence
increasing weight), adjusting hs (s) and hr (s) to account for each newly owned
component. The processor grabs components as long as hs (s) ≤ hr (s), the
new component included. We denote the resulting value of hs (s) by ĥs (s), the
resulting value of hr (s) by ĥr (s), and that of h(s) by ĥ(s). Thus,

ĥs (s) ≤ ĥr (s) = ĥ(s), for 0 ≤ s < p. (4.38)

The value ĥ(s) is indeed optimal for an egoistic P (s), because stopping
earlier would result in a higher hr (s) and hence a higher h(s) and because
200 SPARSE MATRIX–VECTOR MULTIPLICATION

stopping later would not improve matters either: if for instance P (s) would
grab one component more, then hs (s) > hr (s) so that h(s) = hs (s) ≥ hr (s) +
1 = ĥr (s) = ĥ(s). The value ĥ(s) is a local lower bound on the actual value
that can be achieved in the fanout,
ĥ(s) ≤ h(s), for 0 ≤ s < p. (4.39)
Example 4.10 The following table gives the input of a vector distribution
problem. If a processor P (s) owns a nonzero in matrix column j, this is
denoted by a 1 in the corresponding location; if it does not own such a nonzero,
this is denoted by a dot. This problem could for instance be the result of a
matrix partitioning for p = 4 with all splits in the row direction. (We can
view the input itself as a sparse p × n matrix.)

s=0 1 · 1 · 1 1 1 1
1 1 1 · 1 1 1 1 ·
2 · 1 · · · 1 1 1
3 · · 1 1 1 · · 1
qj = 2 2 2 2 3 3 3 3
j= 0 1 2 3 4 5 6 7

Processor P (0) wants v0 and v2 , so that ĥs (0) = 2, ĥr (0) = 4, and ĥ(0) = 4;
P (1) wants v0 , v1 , and v3 , so that ĥ(1) = 3; P (2) wants v1 , giving ĥ(2) = 3;
and P (3) wants v2 and v3 , giving ĥ(3) = 2. The fanout will cost at least 4g.

More in general, we can compute a lower bound ĥ(J, ns0 , nr0 ) for a given
index set J ⊂ Js and a given initial number of sends ns0 and receives nr0 .
We denote the corresponding send and receive values by ĥs (J, ns0 , nr0 ) and
ĥr (J, ns0 , nr0 ). The initial communications may be due to columns outside J.
This bound is computed by the same method, but starting with the val-
ues hs (s) = ns0 and hr (s) = nr0 + |J|. Note that ĥ(s) = ĥ(Js , 0, 0). The
generalization of eqn (4.38) is

ĥs (J, ns0 , nr0 ) ≤ ĥr (J, ns0 , nr0 ) = ĥ(J, ns0 , nr0 ). (4.40)
Think about the others! Every processor would be happy to own the lighter
components and would rather leave the heavier components to the others.
Since every component vj will have to be owned by exactly one processor, we
must devise a mechanism to resolve conflicting desires. A reasonable heuristic
seems to be to give preference to the processor that faces the toughest future,
that is, the processor with the highest value ĥ(s). Our aim in the vector distri-
bution algorithm is to minimize the highest h(s), because (max0≤s<p h(s)) · g
is the communication cost of the fanout.
Algorithm 4.7 is the vector distribution algorithm based on the local-bound
heuristic; it has been proposed by Meesen and Bisseling [136]. The algorithm
VECTOR DISTRIBUTION 201

Algorithm 4.7. Local-bound based vector partitioning.


input: φ = distr(A), matrix distribution over p processors, p ≥ 1,
where A is an n × n sparse matrix.
output: φv = distr(v): vector distribution over p processors,
satisfying j ∈ Jφv (j) , for 0 ≤ j < n,
where Js = {j : 0 ≤ j < n ∧ (∃i : 0 ≤ i < n ∧ φ(i, j) = s)},
for 0 ≤ s < p.

for s := 0 to p − 1 do
Ls := Js ;
hs (s) := 0;
hr (s) := 0;
if hs (s) < ĥs (Ls , hs (s), hr (s)) then
active(s) := true;
else active(s) := false;

while (∃s : 0 ≤ s < p ∧ active(s)) do


smax := argmax(ĥr (Ls , hs (s), hr (s)) : 0 ≤ s < p ∧ active(s));
j := min(Lsmax );
φv (j) = smax ;
hs (smax ) := hs (smax ) + qj − 1;
for all s : 0 ≤ s < p ∧ s = smax ∧ j ∈ Js do
hr (s) := hr (s) + 1;
for all s : 0 ≤ s < p ∧ j ∈ Js do
Ls := Ls \{j};
if hs (s) = ĥs (Ls , hs (s), hr (s)) then
active(s) := false;

successively assigns components vj ; the set Ls is the index set of components


that may still be assigned to P (s). The number of sends caused by the assign-
ments is registered as hs (s) and the number of receives as hr (s). The processor
with the highest local lower bound ĥr (Ls , hs (s), hr (s)) becomes the happy
owner of the lightest component available. The values ĥr (Ls , hs (s), hr (s))
and ĥs (Ls , hs (s), hr (s)) may change after every assignment. A processor will
not accept components any more from the moment it knows it has achieved
its optimum, which happens when hs (s) = ĥs (Ls , hs (s), hr (s)). (Note that
ns0 ≤ ĥs (J, ns0 , nr0 ), so that trivially hs (s) ≤ ĥs (Ls , hs (s), hr (s)).) Accepting
additional components would raise its final h(s). This egoistic approach holds
for every processor, and not only for the one with the highest current bound.
(Accepting more components for altruistic reasons may be well-intended, but
is still a bad idea because the components thus accepted may be more useful
202 SPARSE MATRIX–VECTOR MULTIPLICATION

(a) (b) (c)


P(0) P(1) P(0) P(1) P(0) P(1)
7 3

5 4

P(2) P(3) P(2) P(3) P(2) P(3)


2

Fig. 4.10. Transforming the undirected communication graph of a matrix


distribution into a directed graph. (a) The original undirected communication
graph with multiple edges shown as edge weights; (b) the undirected graph after
removal of all pairs of edges; (c) the final directed graph. As a result, P (0) has
to send two values to P (2) and receive three values from P (2); it has to send
four values to P (3) and receive three.

to other processors.) The algorithm terminates when no processor is willing


to accept components any more.
After termination, a small fraction (usually much less than 10%) of the
vector components remains unowned. These are the heavier components. We
can assign the remaining components in a greedy fashion, each time assigning
a component vj to the processor P (s) for which this would result in the lowest
new value h(s).
A special case occurs if the matrix partitioning has the property qj ≤ 2,
for all j. This case can be solved to optimality by an algorithm that has as
input the undirected communication graph G = (V, E) defined by a vertex set
V = {0, . . . , p − 1} representing the processors and an edge set E representing
matrix columns shared by a pair of processors, where an edge (s, t) implies
communication from P (s) to P (t) or vice versa. Multiple edges between the
same pair of vertices are possible, see Fig. 4.10(a). The algorithm first removes
all multiple edges in pairs: if matrix column j and j ′ are both shared by P (s)
and P (t), then vj is assigned to P (s) and vj ′ to P (t). This gives rise to
one send and one receive for both processors, balancing their communication
obligations. The undirected graph that remains has at most one edge between
each pair of vertices, see Fig. 4.10(b).
The algorithm now picks an arbitrary vertex with odd degree as the
starting point for a path. The degree of a vertex is the number of edges
connected to it, for example, the degree of vertex 0 in Fig. 4.10(b) is two.
A path is a sequence of vertices that are connected by edges. Edges along
the path are transformed into directed edges in a new, directed graph, see
Fig. 4.10(c). (In a directed graph, each edge has a direction. Thus an edge
is an ordered pair (s, t), which differs from (t, s).) The direction of the cre-
ated edge is the same as that of the path. The path ends when a vertex is
reached that has no more undirected edges connected to it. This procedure is
RANDOM SPARSE MATRICES 203

repeated until no more odd-degree vertices are present. It is easy to see that
our procedure cannot change the degree of a vertex from even to odd. Finally,
the same procedure is carried out starting at even-degree vertices.
Once all undirected edges have been transformed into directed edges, we
have obtained a directed graph, which determines the owner of every remain-
ing vector component: component vj corresponding to a directed edge (s, t) is
assigned to P (s), causing a communication from P (s) to P (t). The resulting
vector distribution has minimal communication cost; for a proof of optimality,
see [136]. The vector distribution shown in Fig. 4.7 has been determined this
way. The matrix prime60 indeed has the property pi ≤ 2 for all i and qj ≤ 2
for all j, as a consequence of the different splitting directions of the matrix,
that is, first horizontal, then vertical.

4.7 Random sparse matrices


A random sparse matrix A can be obtained by determining randomly and
independently for each matrix element aij whether it is zero or nonzero. If
the probability of creating a nonzero is d and hence that of creating a zero is
1 − d, the matrix has an expected density d(A) = d and an expected number
of nonzeros nz(A) = dn2 . This definition of randomness only concerns the
sparsity pattern, and not the numerical values of the nonzeros.
Historically, the first sparse matrix algorithms were tested using random
sparse matrices. Later, one realized that these matrices constitute a very par-
ticular class and that many sparse matrices from practical applications fall
outside this class. This led to the development of the Harwell–Boeing collec-
tion [64,65] of sparse matrices, now called the Rutherford–Boeing collection.
If nothing is known about a given sparse matrix A, except for its size
n × n and its sparsity pattern, and if no structure is discernible, then a
first approximation is to consider A as a random sparse matrix with density
d = nz(A)/n2 . Still, it is best to call such a sparse matrix unstructured,
and not random sparse, because random sparse matrices have a very special
property: every subset of the matrix elements, chosen independently from
the sparsity pattern, has an expected fraction d of nonzeros. This property
provides us with a powerful tool for analysing algorithms involving random
sparse matrices and finding distributions for them.
The question whether a given sparse matrix such as the one shown in
Fig. 4.11 is random is tricky and just as hard to answer as the question
whether a given random number generator generates a true sequence of ran-
dom numbers. If the random number generator passes a battery of tests,
then for all practical purposes the answer is positive. The same pragmatic
approach can be taken for random sparse matrices. One test could be to split
the matrix into four submatrices of equal size, and check whether each has
about dn2 /4 nonzeros, within a certain tolerance given by probability the-
ory. In this section, we do not have to answer the tricky question, since we
204 SPARSE MATRIX–VECTOR MULTIPLICATION

Fig. 4.11. Sparse matrix random100 with n = 100, nz = 1000, c = 10, and
d = 0.1, interactively generated at the Matrix Market Deli [26], see
http://math.nist.gov/MatrixMarket/deli/Random/.

assume that the sparse matrix is random by construction. (We have faith in
the random number generator we use, ran2 from [157]. One of its character-
istics is a period of more than 2 × 1018 , meaning that it will not repeat itself
for a very long time.)
Now, let us study parallel matrix–vector multiplication for random sparse
matrices. Suppose we have constructed a random sparse matrix A by drawing
for each index pair (i, j) a random number rij ∈ [0, 1], doing this independ-
ently and uniformly (i.e. with each outcome equally likely), and then creating
a nonzero aij if rij < d. Furthermore, suppose that we have distributed A over
the p processors of a parallel computer in a manner that is independent of
the sparsity pattern, by assigning an equal number of elements (whether zero
or nonzero) to each processor. For simplicity, assume that n mod p = 0.
Therefore, each processor has n2 /p elements. Examples of such a distribution
are the square block distribution and the cyclic row distribution.
First, we investigate the effect of such a fixed, pattern-independent distri-
bution scheme on the spread of the nonzeros, and hence on the load balance
RANDOM SPARSE MATRICES 205

in the main computation part of Algorithm 4.5, the local matrix–vector


multiplication (1). The load balance can be estimated by using probability
theory. The problem here is to determine the expected maximum, taken over
all processors, of the local number of nonzeros. We cannot solve this problem
exactly, but we can still obtain a useful bound on the probability of the max-
imum exceeding a certain value, by applying a theorem of Chernoff, which
is often used in the analysis of randomized algorithms. A proof and further
details and applications can be found in [142].
Theorem 4.11 (Chernoff [40]). Let 0 < d < 1. Let X0 , X2 , . . . , Xm−1 be
independent Bernoulli trials with outcome 0 or 1, such that Pr[Xk = 1] = d,
m−1
for 0 ≤ k < m. Let X = k=0 Xk and µ = md. Then for every ǫ > 0,
µ


Pr[X > (1 + ǫ)µ] < . (4.41)
(1 + ǫ)1+ǫ

If we flip a biased coin which produces heads with probability d, then


the Chernoff bound tells us how small the probability is of getting ǫµ more
heads than the expected average µ. The bound for ǫ = 1 tells us that the
probability of getting more than twice the expected number of heads is less
than (e/4)µ ≈ (0.68)md . Often, we apply the bound for smaller values of ǫ.
In the case of a random sparse matrix distributed over p processors, every
processor has m = n2 /p elements, each being nonzero with a probability d.
The expected number of nonzeros per processor is µ = dn2 /p. Let Es be
the event that processor P (s) has more than (1 + ǫ)µ nonzeros and E the
event that at least one processor has more than (1 + ǫ)µ nonzeros, that is,
p−1
E = ∪s=0 Es . The probability that at least one event from a set of events
happens is less than or equal
p−1to the sum of the separate probabilities of the
events, so that Pr[E] ≤ s=0 Pr[Es ]. Because all events have the same
probability Pr[E0 ], this yields Pr[E] ≤ pPr[E0 ]. Since each nonzero causes
two flops in superstep (1), we get as a result
dn2 /p
2(1 + ǫ)dn2 eǫ
  
Pr T(1) > <p . (4.42)
p (1 + ǫ)1+ǫ

The bound for ǫ = 1 tells us that the extra time caused by load imbalance
exceeds the ideal time of the computation itself with probability less than
2
p(0.68)dn /p . Figure 4.12 plots the function F (ǫ) defined as the right-hand
side of eqn (4.42) against the normalized computation cost 1+ǫ, for n = 1000,
p = 100, and three different choices of d. The normalized computation cost of
superstep (1) is the computation cost in flops divided by the cost of a perfectly
parallelized computation. The figure shows for instance that for d = 0.01 the
expected normalized cost is at most 1.5; this is because the probability of
exceeding 1.5 is almost zero.
206 SPARSE MATRIX–VECTOR MULTIPLICATION

1
d = 0.1
d = 0.01
d = 0.001

0.8
Probability of exceeding cost

0.6

0.4

0.2

0
1 1.5 2 2.5 3
Normalized computation cost

Fig. 4.12. Chernoff bound on the probability that a given normalized computation
cost is exceeded, for a random sparse matrix of size n = 1000 and density d
distributed over p = 100 processors.

The expected normalized computation cost for given n, p, and d can be


estimated more accurately by performing a simulation experiment. In this
experiment, a set of random sparse matrices is created, each matrix is dis-
tributed by a fixed scheme that is independent of the sparsity pattern (e.g.
by a square block distribution), and its maximum local number of nonzeros is
determined. The average over the whole set of matrices is an estimate of the
expected maximum number of nonzeros; dividing the average by dn2 /p gives
an estimate of the expected normalized cost. For the matrices of Fig. 4.12,
the average normalized computation costs are: 1.076 for d = 0.1; 1.258 for
d = 0.01; and 1.876 for d = 0.001. These values were obtained by creating
10 000 matrices.
Figure 4.13 shows a histogram of the probability distribution for d = 0.01,
obtained by creating 100 000 matrices. The most frequent result in our sim-
ulation is a maximum local nonzero count of 124; this occurred 9288 times.
Translated into normalized costs and probabilities, this means that the nor-
malized cost of 1.24 has the highest probability, namely 9.3%. For comparison,
the figure also shows the derivative of the function 1−F (ǫ). This derivative can
be interpreted as a probability density function corresponding to the Chernoff
bound. The derivative has been scaled, multiplying it by a factor ∆ǫ = 0.01,
to make the function values comparable with the histogram values. The bars
of the histogram are far below the line representing the scaled derivative,
meaning that our Chernoff bound is rather pessimistic.
RANDOM SPARSE MATRICES 207

0.2
Measured
Scaled derivative

0.15
Probability

0.1

0.05

0
1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6
Normalized computation cost

Fig. 4.13. Histogram of the probability distribution of the normalized computation


cost, for a random sparse matrix of size n = 1000 and density d = 0.01 distrib-
uted over p = 100 processors. Also shown is the function −F ′ (ǫ)/100, the scaled
derivative of the Chernoff bound.

Based on the above, we may expect that distributing a random sparse


matrix independently of its sparsity pattern spreads the computation well; we
can quantify this expectation using the Chernoff bound. The same quality of
load balance is expected for every distribution scheme with an equal number
of matrix elements assigned to the processors. For the communication, how-
ever, the choice of distribution scheme makes a difference. The communication
volume for a dense matrix is an upper bound on the volume for a sparse matrix
distributed by the same fixed, pattern-independent distribution scheme. For
a random sparse matrix with a high density, the communication obligations
will be the same as for a dense matrix. Therefore, the best we can do to find a
good fixed distribution scheme for random sparse matrices, is to apply meth-
ods for reducing communication in the dense case. A good choice is a square
Cartesian distribution based on a cyclic distribution of the matrix diagonal,
√ √
cf. Fig. 4.6, where each processor has an n/ p × n/ p submatrix. A suitable
corresponding choice of vector distribution is to distribute the vector u and v
in the same cyclic way as the matrix diagonal. (This does not guarantee though
that the owner of a vector component vj also owns a nonzero in column j.)
To obtain the communication cost of Algorithm 4.5 for this distribution, we
first examine superstep (0). Vector component vj is needed only by processors
208 SPARSE MATRIX–VECTOR MULTIPLICATION

in P (∗, φ1 (j)). A processor P (s, φ1 (j)) does not need the component vj if all

n/ p elements in the local part of matrix column j are zero; this event has

probability (1 − d)n/ p . The probability that P (s, φ1 (j)) needs vj is 1 − (1 −
√ √
d)n/ p . Since p − 1 processors each have to receive vj with this probability,
√ √
the expected number of receives for component vj is ( p−1)(1−(1−d)n/ p ).
The owner of vj does not have to receive it. The expected communication
√ √
volume of the fanout is therefore n( p−1)(1−(1−d)n/ p ). Since no processor
is preferred, the h-relation is expected to be balanced, so that the expected
communication cost of superstep (0) is

1 1 √
T(0) = √ − (1 − (1 − d)n/ p )ng. (4.43)
p p

Communication superstep (2) is similar to (0), with the operation of send-


ing vector components replaced by the operation of receiving partial sums.
The communication cost of superstep (2) is

T(2) = T(0) . (4.44)

Superstep (3) adds the nonzero partial sums, both those just received and
those present locally. This costs
n √
T(3) = √ (1 − (1 − d)n/ p ). (4.45)
p

If g ≫ 1, which is often the case, then T(3) ≪ T(2) , so that the cost of
superstep (3) can be neglected. Finally, the synchronization cost of the whole
algorithm is 4l.
For our example of n = 1000 and p = 100, the matrix with highest density,
d = 0.1, is expected to cause a communication cost of 179.995g, which is close
to the cost of 180g for a completely dense matrix. The corresponding expected
normalized communication cost is (T(0) + T(2) )/(2dn2 /p) ≈ 0.09g. This means
that we need a parallel computer with g ≤ 11 ro run our algorithm with more
than 50% efficiency.
For matrices with very low density, the local part of a matrix column
is unlikely to have more than one nonzero. Every nonzero will thus incur a
communication. In that case a row distribution is better than a square matrix
distribution, because this saves the communications of the fanin. For our
example of n = 1000 and p = 100, the matrix with lowest density, d = 0.001,
is expected to cause a normalized communication cost of 0.86g for a square
matrix distribution and 0.49g for the cyclic row distribution.
One way of improving the performance is by tailoring the distribution
used to the sparsity pattern of the random sparse matrix. Figure 4.14 shows
a tailored distribution produced by the Mondriaan package for the mat-
rix random100. The figure gives a local view of the matrix, showing the
RANDOM SPARSE MATRICES 209

Fig. 4.14. Local view of sparse matrix random100 from Fig. 4.11 with n = 100,
nz = 1000, and d = 0.1, distributed by the Mondriaan package over p = 16
processors. The allowed imbalance is ǫ = 20%; the achieved imbalance is ǫ′ =
18.4%. The maximum number of nonzeros per processor is 74; the average is
62.5; and the minimum is 25. The communication volume is V = 367. The first
split is in the row direction; the next two splits are in the column direction.
The empty row and column parts created by the splits are collected in empty
rectangles.

submatrix Is × Js for each processor P (s), 0 ≤ s < 16. (This is slightly


different from the local view given in Fig. 4.8, which shows the submatrix
I¯s × J¯s , where empty row and column parts are included.) Note that the
relatively large allowed load imbalance helps reduce the communication, res-
ulting in large empty rectangles. Note that each submatrix Is ×Js is the result
of two row splits and two column splits, but in different order for different
processors.
Table 4.2 compares the theoretical communication volume for the best
pattern-independent distribution scheme, the square Cartesian distribution
based on a cyclic distribution of the matrix diagonal, with the volume for
210 SPARSE MATRIX–VECTOR MULTIPLICATION

Table 4.2. Communication volume for a random sparse


matrix of size n = 1000 and density d = 0.01 distributed
over p processors, using a pattern-independent Cartesian
distribution and a pattern-dependent distribution produced
by the Mondriaan package

p ǫ (in %) V (Cartesian) V (Mondriaan)

2 0.8 993 814


4 2.1 1987 1565
8 4.0 3750 2585
16 7.1 5514 3482
32 11.8 7764 4388

the distribution produced by the Mondriaan package (version 1.0, run with
default parameters), averaged over a set of 100 random sparse matrices. The
volume for the Cartesian distribution is based on the cost formula eqn (4.43),
generalized to handle nearly square distributions as well, such as the 8 × 4
distribution for p = 32. For ease of comparison, the value of ǫ specified as
input to Mondriaan equals the expected load imbalance for the Cartesian
distribution. The achieved imbalance ǫ′ is somewhat below that value. It is
clear from the table that the Mondriaan distribution causes less communica-
tion, demonstrating that on average the package succeeds in tailoring a better
distribution to the sparsity pattern. For p = 32, we gain about 45%.
The vector distribution corresponding to the Cartesian distribution
satisfies distr(u) = distr(v), which is an advantage if this requirement must
be met. The volume for the Mondriaan distribution given in Table 4.2 may
increase in case of such a requirement, but at most by n, see eqn (4.12). For
p ≥ 8, the Mondriaan distribution is guaranteed to be superior in this case.
We may conclude from these results that the parallel multiplication of a
random sparse matrix and a vector is a difficult problem, most likely leading to
much communication. Only low values of g or high nonzero densities can make
this operation efficient. Using the fixed, pattern-independent Cartesian distri-
bution scheme based on a cyclic distribution of the matrix diagonal already
brings us close to the best distribution we can achieve. The load balance of
this distribution is expected to be good in most cases, as the numerical sim-
ulation and the Chernoff bound show. The distribution can be improved by
tailoring the distribution to the sparsity pattern, for example, by using the
Mondriaan package, but the improvement is modest.

4.8 Laplacian matrices


In many applications, a physical domain exists that can be distributed
naturally by assigning a contiguous subdomain to every processor. Commu-
nication is then only needed for exchanging information across the subdomain
LAPLACIAN MATRICES 211

boundaries. Often, the domain is structured as a multidimensional rectangular


grid, where grid points interact only with a set of immediate neighbours. In
the two-dimensional case, this set could for instance contain the neighbours
to the north, east, south, and west. One example of a grid application is the
Ising model used to study ferromagnetism, in particular phase transitions and
critical temperatures. Each grid point in this model represents a particle with
positive or negative spin; neighbours tend to prefer identical spins. (For more
on the Ising model, see [145].) Another example is the heat equation, where the
value at a grid point represents the temperature at the corresponding location.
The heat equation can be solved iteratively by a relaxation procedure that
computes the new temperature at a point using the old temperature of that
point and its neighbours.
An important operation in the solution of the two-dimensional heat
equation is the application of the two-dimensional Laplacian operator to
the grid, computing

∆i,j = xi−1,j + xi+1,j + xi,j+1 + xi,j−1 − 4xi,j , for 0 ≤ i, j < k, (4.46)

where xi,j denotes the temperature at grid point (i, j). The difference
xi+1,j − xi,j approximates the derivative of the temperature in the i-direction,
and the difference (xi+1,j −xi,j )−(xi,j −xi−1,j ) = xi−1,j +xi+1,j −2xi,j approx-
imates the second derivative. By convention, we assume that xi,j = 0 outside
the k × k grid; in practice, we just ignore zero terms in the right-hand side of
eqn (4.46).
We can view the k × k array of values xi,j as a one-dimensional vector v
of length n = k 2 by the identification

vi+jk ≡ xi,j , for 0 ≤ i, j < k, (4.47)

and similarly we can identify the ∆i,j with a vector u.

Example 4.12 Consider the 3 × 3 grid shown in Fig. 4.15. Equation (4.46)
now becomes u = Av, where
 
−4 1 · 1 · · · · ·
 1 −4 1 · 1 · · · · 
 
 · 1 −4 · · 1 · · · 
   
 1 · · −4 1 · 1 · ·  B I3 0
 
A=  · 1 · 1 −4 1 · 1 · 
= I3 B I3  .

 · · 1 · 1 −4 · · 1 
 0 I3 B

 · · · 1 · · −4 1 · 
 
 · · · · 1 · 1 −4 1 
· · · · · 1 · 1 −4

The matrix A is pentadiagonal because vector components (representing grid


points) are only connected to components at distance ±1 or ±3. The holes in
212 SPARSE MATRIX–VECTOR MULTIPLICATION

(0,2) 6 7 8

(0,1) 3 4 5

0 1 2
(0,0) (1,0) (2,0)

Fig. 4.15. A 3×3 grid. For each grid point (i, j), the index i+3j of the corresponding
vector component is shown.

the subdiagonal and superdiagonal occur because points (i, j) with i = 0, 2


do not have a neighbour at distance −1 and +1, respectively. As a result,
the matrix has a block-tridiagonal structure with 3 × 3 blocks B on the main
diagonal and 3 × 3 identity matrices just below and above it.
In general, it is best to view application of the Laplacian operator as an
operation on the physical domain. This domain view has the advantage that
it naturally leads to the use of a regular data structure for storing the data.
Occasionally, however, it may also be beneficial to view the Laplacian opera-
tion as a matrix operation, so that we can apply our knowledge about sparse
matrix–vector multiplication and gain from insights obtained in that context.
Let us try to find a good distribution for the k × k grid. We adopt the
domain view, and not the matrix view, and therefore we must assign each grid
point to a processor. The resulting distribution of the grid should uniquely
determine the distribution of the matrix and the vectors in the corresponding
matrix–vector multiplication. We assign the values xi,j and ∆i,j to the owner
of grid point (i, j), and this translates into distr(u) = distr(v). It is easiest
to use a row distribution for the matrix and assign row i + jk to the same
processor as vector component ui+jk and hence grid point (i, j). (For low-
dimensional Laplacian matrices, using a square matrix distribution does not
give much advantage over a row distribution.) The resulting sparse matrix–
vector multiplication algorithm has two supersteps, the fanout and the local
matrix–vector multiplication. If a neighbouring point is on a different pro-
cessor, its value must be obtained during the fanout. The computation time
assuming an equal spread of grid points is 5k 2 /p, since eqn (4.46) gives rise
to five flops per grid point. Note that in this specific application we use the
flop count corresponding to the domain view, and not the more general count
of two flops per matrix nonzero used in other sections of this chapter; the
latter count yields ten flops per grid point. Furthermore, we do not have to
store the matrix elements explicitly, that is, we may use matrix-free storage,
cf. Section 4.2.
LAPLACIAN MATRICES 213

(a) (b) (c)

Fig. 4.16. Distribution of an 8 × 8 grid. (a) by strips for p = 4 processors; (b) by


strips with border corrections for p = 3; (c) by square blocks for p = 4.

The simplest distribution of the grid can be obtained by dividing it into


p strips, each of size k × k/p, assuming that k mod p = 0, see Fig. 4.16(a).
The advantage is, indeed, simplicity and the fact that communication is only
needed in the east–west direction, between a grid point (i, j) on the eastern
border of a strip and its neighbour (i+1, j), or between a point on the western
border and its neighbour (i − 1, j). The northern and southern neighbours of
a grid point are on the same processor as the point itself and hence do not
cause communication. The main disadvantage will immediately be recognized
by every Norwegian or Chilean: relatively long borders. Each processor, except
the first and the last, has to send and receive 2k boundary values. Therefore,
Tcomm, strips = 2kg. (4.48)

A related disadvantage is that p ≤ k should hold, because otherwise processors


would be idle. For p ≤ k, load balance may still be a serious problem: if
k mod p = 0, some processors will contain one extra column of k points. This
problem can be solved by border corrections, jumping one point to the east
or west somewhere along the border, as shown in Fig. 4.16(b).
A better distribution can be obtained by dividing the grid into p square
√ √
blocks, each of size k/ p × k/ p, where we assume that p is a square number

and that k mod p = 0, see Fig. 4.16(c). The borders are smaller now. In
the special case p = 4, each processor has to receive k neighbouring values,
send k − 2 values to one destination processor, and send a corner value to two
destination processors, thus sending a total of k values. The communication is
reduced by a factor of two compared with the strip distribution. In the general
case p > 4, processors not on the boundary are busiest communicating, having

to send and receive 4k/ p values, so that
4k
Tcomm, squares = √ g. (4.49)
p
214 SPARSE MATRIX–VECTOR MULTIPLICATION

This is a factor p/2 less than for division into strips. The resulting
communication-to-computation ratio is
√ √
Tcomm, squares 4k/ p 4 p
= g= g. (4.50)
Tcomp, squares 5k 2 /p 5k

The communication-to-computation ratio is often called the surface-to-


volume ratio. This term originates in the three-dimensional case, where the
volume of a domain represents the amount of computation of a processor and
the surface represents the communication with other processors.
Not only are square blocks better with respect to communication, they are
also better with respect to computation: in case of load imbalance, the surplus

of the busiest processor is at most 2⌈k/ p⌉ − 1 grid points, instead of k.
It may seem that the best we can do is to distribute the grid by square
blocks. This intuitive belief may even be stronger if you happen to use a
square processor network and are used, in the old ways, to exploiting network
proximity to optimize communication. In the BSP model, however, there is no
particular advantage in using such regular schemes. Therefore, we can freely
try other shapes for the area allocated to one processor. Consider what the
computer scientist would call the digital diamond, and the mathematician
the closed l1 -sphere, defined by

Br (c0 , c1 ) = {(i, j) ∈ Z2 : |i − c0 | + |j − c1 | ≤ r}, (4.51)

for integer radius r ≥ 0 and centre c = (c0 , c1 ) ∈ Z2 . This is the set of points
with Manhattan distance at most r to the central point c, see Fig. 4.17. The
number of points of Br (c) is 1+3+5+· · ·+(2r−1)+(2r+1)+(2r−1)+· · ·+1 =
2r2 +2r +1. The number of neighbouring points is 4r +4. If Br (c) represents a

Fig. 4.17. Digital diamond of radius r = 3 centred at c. Points inside the diamond
are shown in black; neighbouring points are shown in white.
LAPLACIAN MATRICES 215

set of grid points allocated to one processor, then the fanout involves receiving
4r+4 values. Just on the basis of receives, we may conclude that this processor
has a communication-to-computation ratio

Tcomm, diamonds 4r + 4 2 2 2p
= g≈ g≈ g, (4.52)
Tcomp, diamonds 5(2r2 + 2r + 1) 5r 5k

for large enough r, where we use the approximation r ≈ k/ 2p, obtained by
assuming that the processor has its fair share 2r2 +√2r + 1 = k 2 /p of the grid
points. The resulting asymptotic ratio is a factor 2 lower than for square
blocks, cf. eqn (4.50). This reduction is caused by using each received value
twice. Diamonds are a parallel computing scientist’s best friend.
The gain of using diamonds can only be realized if the outgoing traffic
is balanced with the incoming traffic, that is, if the number of sends hs (s)
of processor P (s) is the same as the number of receives hr (s) = 4r + 4.
The number of sends of a processor depends on which processors own the
neighbouring points. Each of the 4r border points of a diamond has to be sent
to at least one processor and at most two processors, except corner points,
which may have to be sent to three processors. Therefore, 4r ≤ hs (s) ≤ 8r +4.
To find a distribution that balances the sends and the receives, we try to
fit the diamonds in a regular pattern. Consider first the infinite lattice Z2 ; to
make mathematicians cringe we view it as a k × k grid with k = ∞; to make
matters worse we let the grid start at (−∞, −∞). We try to partition this
∞×∞ grid over an infinite number of processors using diamonds. It turns out
that we can do this by placing the diamonds in a periodic fashion at centres
c = λa + µb, λ, µ ∈ Z, where a = (r, r + 1) and b = (−r − 1, r). Part of
this infinite partitioning is shown in Fig. 4.18. The centres of the diamonds
form a lattice defined by the orthogonal basis vectors a, b. We leave it to the
mathematically inclined reader to verify that the diamonds Br (λa + µb) are
indeed mutually disjoint and that they fill the whole domain. It is easy to
see that each processor sends hs (s) = 4r + 4 values. This distribution of an
infinite grid over an infinite number of processors achieves the favourable ratio
of eqn (4.52).
Practical computational grids and processor sets are, alas, finite and there-
fore the region of interest must be covered using a finite number of diamonds.
Sometimes, the shape of the region allows us to use diamonds directly without
too much waste. In other situations, many points from the covering diamonds
fall outside the region of interest. These points are then discarded and the
remaining pieces of diamonds are combined, assigning several pieces to one
processor, such that each processor obtains about the same number of grid
points.
Figure 4.18 shows a 10 × 10 grid partitioned into one complete and seven
incomplete diamonds by using the infinite diamond partitioning. This can
216 SPARSE MATRIX–VECTOR MULTIPLICATION

Fig. 4.18. Distribution of a 10 × 10 grid by digital diamonds of radius r = 3. Each


complete diamond has 25 grid points. Pieces of incomplete diamonds can be
combined, assigning them to the same processor.

be transformed into a partitioning over four processors by assigning the two


incomplete diamonds near the lower-left corner of the picture to P (0); the
two incomplete diamonds near the lower-right corner to P (1); the complete
diamond to P (2); and the remaining pieces to P (3). This assigns 24, 27, 25,
24 grid points to the respective processors. The processor with the complete
diamond has to communicate most: hs (2) = 15 and hr (2) = 14. Thus the
communication cost is close to the maximum 4r + 4 = 16 for diamonds of
radius r = 3. It is possible to improve this assignment, for instance by border
corrections. (What is the best assignment you can find?) In general, to find
an optimal assignment of pieces to processors is a hard optimization problem,
and perhaps the best approach is to use a heuristic, for instance assigning
pieces to processors in order of decreasing size. In a few cases, we are lucky
and we can fit all the pieces together creating complete diamonds. An example
is the 25 × 25 grid, which can be covered exactly with 25 diamonds of radius
r = 3. (For the curious, a picture can be found in [22].)
The main disadvantage of using diamonds is that it is much more complic-
ated than using blocks. To make the partitioning method based on diamonds
practical, it must be simplified. Fortunately, this can be done by a small
modification, discarding one layer of points from the north-eastern and
LAPLACIAN MATRICES 217

Fig. 4.19. Basic cell of radius r = 3 assigned to a processor. The cell has 18 grid
points, shown in black. Grid points outside the cell are shown in white. Grid
points on the thick lines are included; those on the thin lines are excluded. The
cell contains 13 grid points that are closer to the centre than to a corner of the
enclosing square and it contains five points at equal distance. The cell has 14
neighbouring grid points.

south-eastern border of the diamond, as shown in Fig. 4.19. For r = 3, the


number of points decreases from 25 (see Fig. 4.17) to 18. The resulting set of
points can be seen as the set of points closer to the centre than to the corner
of the enclosing large square, and such a set is called a Voronoi cell. Ties
are broken by assigning border points to the interior for the western borders,
and to the exterior for the eastern borders. Only one corner is assigned to the
interior. This way, we obtain a basic cell that can be repeated and used to fill
the whole space. We can place the diamonds in a periodic fashion at centres
c = λa + µb, λ, µ ∈ Z, where a = (r, r) and b = (−r, r).
Figure 4.20 shows how the basic cell is used to distribute a 12 × 12 grid
over eight processors. Five cells are complete (black, green, pink, gold, and
brown); the dark blue processor has points near the left and right border of the
grid; the light blue processor has points near the top and bottom; and the red
processor has points near the four corners. The pieces have been combined by
treating the grid boundaries as periodic. Each processor has 18 grid points.
The communication volume is 104 and hs = hr = 14; the BSP cost of the
computation is 90 + 14g + 2l.
A different way of partitioning a k × k grid is to translate it into the
corresponding k 2 ×k 2 matrix and vectors of length k 2 , then let the Mondriaan
package find a good data distribution for the corresponding sparse matrix–
vector multiplication, and translate this back into a grid distribution. As
before, we impose distr(u) = distr(v) for the vectors and use a row distribu-
tion for the matrix. Figure 4.21 shows the result for an allowed load imbalance
ǫ = 10%. Note that Mondriaan (the package!) prefers stepwise borders, similar
to the borders of a digital diamond. The communication volume is V = 85,
which is lower than for the regular distribution of Fig. 4.20. The reason is
218 SPARSE MATRIX–VECTOR MULTIPLICATION

Fig. 4.20. Distribution of a 12 × 12 grid over eight processors obtained by using the
basic cell from Fig. 4.19 (see also Plate 4).

Fig. 4.21. Distribution of a 12 × 12 grid over eight processors produced by the


Mondriaan package (see also Plate 5).
LAPLACIAN MATRICES 219

that Mondriaan manages to make each processor subdomain connected, thus


reducing the communication compared with the regular case where discon-
nected subdomains occur. The achieved load imbalance in the corresponding
matrix–vector multiplication is ǫ′ = 8.3%. The maximum number of vector
components per processor is 20; the average is 18. We had to allow such a
load imbalance, because otherwise Mondriaan could not succeed in keeping
the communication low. (For ǫ = 3%, the result is perfect load balance, ǫ′ = 0,
but disconnected subdomains and a high volume, V = 349.) The BSP cost
of the computation is 91 + 16g + 2l, where we have taken into account that
points on the grid boundary have fewer flops associated with them. Running
Mondriaan takes much more time than performing one Laplacian operation
on the grid, but the effort of finding a good distribution has to be spent only
once, and the cost can be amortized over many parallel computations on the
same grid.
The crucial property of our matrix–vector multiplication algorithm applied
to grid computation is that a value needed by another processor is sent only
once, even if it is used several times. Our cost model reflects this, thus encour-
aging reuse of communicated data. If a partitioning method based on this cost
model is applied to a regular grid, then the result will be diamond-like shapes
of the subdomain, as shown in Fig. 4.21.
The prime application of diamond-shaped partitioning will most likely be
in three dimensions, where the number of grid points at the boundary of a
subdomain is relatively large compared to the number of interior points. If
a processor has a cubic block of N = k 3 /p points, about 6k 2 /p2/3 = 6N 2/3
are boundary points; in the two-dimensional case this is only 4N 1/2 . For
example, if a processor has a 10 × 10 × 10 block, 488 points are on the pro-
cessor boundary. In three dimensions, communication is important. Based on
the surface-to-volume ratio of a√three-dimensional digital diamond, we can
expect a reduction by a factor 3 ≈ 1.73 in communication cost, which is
certainly worthwhile. The reduction achieved in reality depends on whether
we manage to fill three-dimensional space with shapes that closely resemble
digital diamonds.
The basic cell that suits our purpose is a truncated octahedron, shown in
Fig. 4.22. The surface parts have been carefully assigned to the interior or
exterior, so that the whole space can be filled with nonoverlapping copies of
the cell, that is, with no point in space belonging to more than one cell. This
has been achieved by a fair assignment of the faces, edges, and vertices of the
cell. For instance, the front square is included in the cell but the back square
is not. A cell of radius r is enclosed by a cube with edge length 2r. We can fill
space with such cubes, and place copies of the basic cell at the corners and
centres of these cubes. As a result, cells are centred at points (λr, µr, νr), with
λ, µ, ν three even integers, or three odd integers. This set of centre points is
called the body-centred cubic (BCC) lattice. Each centre point represents
a processor. As in the two-dimensional case, the basic cell is a Voronoi cell,
220 SPARSE MATRIX–VECTOR MULTIPLICATION

Fig. 4.22. Basic three-dimensional cell assigned to a processor. The cell is defined
as the set of grid points that fall within a truncated octahedron. The boundaries
of this truncated octahedron are included/excluded as follows. Included are the
four hexagons and three squares visible at the front (which are enclosed by solid
lines), the twelve edges shown as thick solid lines, and the six vertices marked
in black. The other faces, edges, and vertices are excluded. The enclosing cube
is shown for reference only. Neighbouring cells are centred at the eight corners
of the cube and the six centres of neighbouring cubes.

since grid points of the cell are closer to its centre than to the centres of other
cells (with fair tie breaking).
The number of grid points of the basic cell is 4r3 . A careful count shows
that the number of points on the surface is 9r2 + 2 and that the number of
sends and receives is 9r2 +6r+2. The resulting communication-to-computation
ratio is

Tcomm, truncated octahedron 9r2 + 6r + 2 9 9(4p)1/3


= g ≈ g ≈ g, (4.53)
Tcomp, truncated octahedron 7 · 4r3 28r 28k

which is better by a factor of 1.68 than the ratio 6p1/3 /(7k) for blocks. In
an actual implementation, it may be most convenient to use a cubic array as
local data structure, with an extra layer of points along each border, and to
fill this array only partially. This enables the use of a regular array structure
while still reducing the communication.
Table 4.3 shows the communication cost for various distributions of a
two-dimensional grid with k = 1024 and a three-dimensional grid with
k = 128. For the two-dimensional Laplacian, the ideal case for the rectangular
block distribution occurs for p = q 2 , that is, for p = 4, 16, 64, since the
LAPLACIAN MATRICES 221

local subdomains then become square blocks. For p = 2q 2 , that is, for
p = 2, 8, 32, 128, the blocks become rectangles with an aspect ratio 2 : 1.
In contrast, the ideal case for the diamond distribution is p = 2q 2 . To handle
the nonideal case p = q 2 as well, the diamond distribution is generalized
by stretching the basic cell in one direction, giving a communication cost of

4kg/ p. The table shows that the diamond distribution is better than the
rectangular block distribution for p = 2q 2 and performs the same for p = q 2 ,
except in the special case of small p, where the boundaries of the grid play a
prominent role. For the three-dimensional Laplacian, only the ideal case for
the diamond distribution, p = 2q 3 , is shown. (The other cases p = q 3 and
p = 4q 3 are more difficult to treat, requiring a generalization of the diamond
distribution based on stretching the basic cell. An application developer hav-
ing 32 or 64 processors at his disposal might be motivated to implement this
kind of generalization.) We observe a reduction by a factor of 1.71 for p = 16
compared with the block distribution. Asymptotically, for large radius r, the
reduction factor is 16/9 ≈ 1.78 in the case p = 2q 3 .
For comparison, Table 4.3 also presents results obtained by using the
Mondriaan package (version 1.0) to produce a row distribution of the
Laplacian matrix and a corresponding distribution of the grid. For the Mon-
driaan distribution, the allowed load imbalance for the corresponding matrix
is ǫ = 10%. The communication cost given is the average over 100 runs of
the Mondriaan program, each time with a different seed of the random num-
ber generator used. In three dimensions, the Mondriaan distribution is better
than blocks and for large local subdomains (such as for p = 16) it comes close
to the performance of diamonds.

Table 4.3. Communication cost (in g) for a Laplacian oper-


ation on a grid, using distributions based on rectangular
blocks and diamond cells, and a distribution produced by
the Mondriaan package

Grid p Rectangular Diamond Mondriaan

1024 × 1024 2 1024 2046 1024


4 1024 2048 1240
8 1280 1026 1378
16 1024 1024 1044
32 768 514 766
64 512 512 548
128 384 258 395
64 × 64 × 64 16 4096 2402 2836
128 1024 626 829
222 SPARSE MATRIX–VECTOR MULTIPLICATION

4.9 Remainder of BSPlib: example function bspmv


The function bspmv is an implementation of Algorithm 4.5 for sparse
matrix–vector multiplication. It can handle every possible data distribution
for the matrix and vectors. Before executing the algorithm, each processor
builds its own local data structure for representing the local part of the
sparse matrix. The local nonempty rows are numbered i = 0, . . . , nrows − 1,
where nrows = |Is |. The global index of the row with local index i is given by
i = rowindex[i]. Similarly, the global index of the column with local index j
is given by j = colindex[j], for 0 ≤ j < ncols. (The local indices i and j of
the matrix data structure are distinct from those of the vectors.) The nonzeros
are stored in order of increasing local row index i. The nonzeros of each local
row are stored consecutively in increasing order of local column index j, using
the ICRS data structure. The kth nonzero is stored as a pair (a[k], inc[k]),
where a[k] is the numerical value of the nonzero and inc[k] the increment in
the local column index.
This data structure is convenient for use in repeated sparse matrix–vector
multiplication. Building the structure, however, requires quite a bit of pre-
processing on input. An outline of the input preprocessing is as follows. Each
triple (i, j, aij ) is read from an input file and sent to the responsible processor,
as determined by the matrix distribution of the file. The local triples are then
sorted by increasing global column index, which enables conversion to local
column indices. During the conversion, the global indices are registered in
colindex. The triples are sorted again, this time by global row index, taking
care that the original mutual precedences are maintained between triples from
the same matrix row. The global row indices are then converted to local ones
and the array rowindex is initialized.
The nonzeros must be sorted with care. Sequentially, the nz(A) nonzeros of
a sparse matrix A can be sorted by row in time O(nz(A)+n), simply by count-
ing the number of nonzeros in each row during one pass through the nonzeros,
allocating exactly the right amount of memory space for each row, and filling
the space in a second pass. In parallel, it is more difficult to sort efficiently,
because the range of possible global indices remains 0, . . . , n − 1, while the
number of local nonzeros decreases to nz(A)/p. Clearly, such O(nz(A)/p + n)
behaviour for a straightforward sort by row index is nonscalable √ and hence
unacceptable. Fortunately, a radix sort (see [46]) with radix r = n will do the
job. This method first sorts the triples by using i mod r as a key, and then in a
second pass sorts the indices by using i div r as a key, maintaining the original
mutual precedences between triples that have the same key for the second pass.
The total time and memory √ needed is about O(nz(A)/p + n/r + r), which is
minimal
√ for the choice r = n. We choose the radix to be a power of two close
to n, because of the cheaper modular arithmetic for powers of two. This sort-
ing procedure
√ is scalable in time and memory by our definition in Section 3.5,
because n = O(n/p + p). The driver program bspmv test (not printed here
REMAINDER OF BSPLIB: EXAMPLE FUNCTION bspmv 223

because of its length, but included in BSPedupack) implements the complete


input phase, for matrix as well as vectors.
The relation between the vector variables of Algorithm 4.5 and the func-
tions bspmv and bspmv init is as follows. Vector component vj corresponds to
a local component v[k] in P (φv (j)), where j = vindex[k]. All the needed vec-
tor components vj , whether obtained from other processors or already present
locally, are written into a local array vloc, which has the same local indices
as the matrix columns; vloc[j] stores a copy of vj , where j = colindex[j].
This copy is obtained using a bsp get, because the receiver knows it needs
the value. The processor from which to get the value vj has processor number
φv (j) = srcprocv[j] and this number is stored beforehand by the initial-
ization function bspmv init. This way, the source processor needs to be
determined only once and its processor number can be used without addi-
tional cost in repeated application of the matrix–vector multiplication. We
also need to determine the location in the source processor where vj resides.
This location is stored as the local index srcindv[j].
The partial sum uit = sum is computed by pointer magic, to be explained
later on, and this sum is immediately sent to the processor P (φu (i)) that
computes ui . A convenient way of sending this value is by using the bsp send
primitive, which is the core primitive of bulk synchronous message
passing, a new style of communication introduced in this section. The
function bspmv is the first occasion where we see the five important bulk
synchronous message passing primitives of BSPlib in action. (An additional
high-performance primitive is presented in Exercise 10.) The bsp send primit-
ive allows us to send data to a given processor without specifying the location
where the data is to be stored. One can view bsp send as a bsp put with a
wildcard for the destination address. In all other aspects, bsp send acts like
bsp put; in particular, it performs a one-sided communication since it does not
require any activity by the receiver in the same superstep. (In the next super-
step, however, the receiver must do something if it wants to use the received
data, see below.) The bsp send primitive is quite unlike traditional message
passing primitives, which require coordinated action of a sender and a receiver.
The reason for the existence of bsp send is nicely illustrated by
Superstep 2 of bspmv, which employs bsp send to send a nonzero partial sum
uit to processor P (φu (i)). The information whether a nonzero partial sum for
a certain row exists is only available at the sender. As a consequence, a send-
ing processor does not know what the other processors send. Furthermore,
processors do not know what they will receive. If we were to use bsp put
statements, we would have to specify a destination address. One method of
doing this is by having each receiving processor reserve memory space to store
p partial sums uit for each of its vector components ui . If this is done, the pro-
cessor that computes a partial sum uit can write it directly into the memory
cell reserved for it on P (φu (i)). Unfortunately, the amount of reserved local
memory, of the order p · n/p = n cells, is p times larger than the memory
224 SPARSE MATRIX–VECTOR MULTIPLICATION

needed for the vectors and a large part of this memory may never be used for
writing nonzero partial sums. Furthermore, this method also requires O(n)
computing time, since all memory cells must be inspected. Thus, this method
is nonscalable both in time and memory. An alternative is a rather clumsy
method that may be termed the ‘three-superstep’ approach. In the first super-
step, each processor tells each of the other processors how many partial sums
it is going to send. In the second superstep, each receiving processor reserves
exactly the required amount of space for each of the senders, and tells them
the address from which they can start writing. Finally, in the third superstep,
the partial sums are put as pairs (i, uit ). Fortunately, we can organize the com-
munication in a more efficient and more elegant way by using the bsp send
primitive instead of bsp put. This is done in the function bspmv. Any one
writing programs with irregular communication patterns will be grateful for
the existence of bsp send!
The bsp send primitive sends a message which consists of a tag and a
payload. The tag is used to identify the message; the payload contains the
actual data. The use of the bsp send primitive is illustrated by the top part of
Fig. 4.23. In our case, the tag is an index corresponding to i and the payload
is the partial sum uit . The syntax is
bsp send(pid, tag, source, nbytes);
Here, int pid is the identity of the destination processor; void *tag is a
pointer to the tag; void *source is a pointer to the source memory from

bsp_pid
nbytes
Source Send

Message tag payload

Move

pid
maxnbytes

Dest

Fig. 4.23. Send operation from BSPlib. The bsp send operation copies nbytes of
data from the local processor bsp pid into a message, adds a tag, and sends this
message to the specified destination processor pid. Here, the pointer source
points to the start of the data to be copied. In the next superstep, the bsp move
operation writes at most maxnbytes from the message into the memory area
specified by the pointer dest.
REMAINDER OF BSPLIB: EXAMPLE FUNCTION bspmv 225

which the data to be sent are read; int nbytes is the number of bytes to
be sent. In our case, the number of the destination processor is available as
φu (i) = destprocu[i], which has been initialized beforehand by the function
bspmv init. It is important to choose a tag that enables the receiver to handle
the payload easily. Here, the receiver needs to know to which vector component
the partial sum belongs. We could have used the global index i as a tag,
but then this index would have to be translated on receipt into the local
index i used to access u. Instead, we use the local index directly. Note that
in this case the tag need not identify the source processor, since its number is
irrelevant.
The message to be sent using the bsp send primitive is first stored by the
system in a local send buffer. (This implies that the tag and source variable
can be reused immediately.) The message is then sent and stored in a buffer
on the receiving processor. The send and receive buffers are invisible to the
user (but there is a way of emptying the receive buffer, as you may guess).
Some time after the message has been sent, it becomes available on the
receiving processor. In line with the philosophy of the BSP model, this hap-
pens at the end of the current superstep. In the next superstep, the messages
can be read; reading messages means moving them from the receive buffer into
the desired destination memory. At the end of the next superstep all remain-
ing unmoved messages will be lost. This is to save buffer memory and to force
the user into the right habit of cleaning his desk at the end of the day. (As said
before, the BSP model and its implementation BSPlib are quite paternalistic.
They often force you to do the right thing, for lack of alternatives.) The syntax
of the move primitive is
bsp move(dest, maxnbytes);
Here, void *dest is a pointer to the destination memory where the data are
written; int maxnbytes is an upper bound on the number of bytes of the
payload that is to be written. This is useful if only part of the payload needs
to be retrieved. The use of the bsp move primitive is illustrated by the bottom
part of Fig. 4.23.
In our case, the payload of a message is one double, which is written in its
entirety into sum, so that maxnbytes = SZDBL.
The header information of a message consists of the tag and the length of
the payload. This information can be retrieved by the statement
bsp get tag(status, tag);
Here, int *status is a pointer to the status, which equals −1 if the buffer is
empty; otherwise, it equals the length of the payload in bytes. Furthermore,
void *tag is a pointer to the memory where the tag is written. The status
information can be used to decide whether there is an unread message, and if
so, how much space to allocate for it. In our case, we know that each payload
has the same fixed length SZDBL.
226 SPARSE MATRIX–VECTOR MULTIPLICATION

We could have used the status in the termination criterion of the loop in
Superstep 3, to determine whether we have handled all partial sums. Instead,
we choose to use the enquiry primitive
bsp qsize(nmessages, nbytes);
Here, int *nmessages is a pointer to the total number of messages received
in the preceding superstep, and int *nbytes is a pointer to the total number
of bytes received. In our program, we only use bsp qsize to determine the
number of iterations of the loop, that is, the number of partial sums received.
In general, the bsp qsize primitive is useful for allocating the right amount
of memory for storing the received messages. Here, we do not need to allocate
memory, because we process and discard the messages immediately after we
read them. The name bsp qsize derives from the fact that we can view
the receive buffer as a queue: messages wait patiently in line until they are
processed.
In our program, the tag is an integer, but in general it can be of any type.
The size of the tag in bytes is set by
bsp set tagsize(tagsz);
On input, int *tagsz points to the desired tag size. As a result, the system
uses the desired tag size for all messages to be sent by bsp send. The function
bsp set tagsize takes effect at the start of the next superstep. All processors
must call the function with the same tag size. As a side effect, the contents of
tagsz will be modified, so that on output it contains the previous tag size of
the system. This is a way of preserving the old value, which can be useful if
an initial global state of the system must be restored later.
In one superstep, an arbitrary number of communication operations can be
performed, using either bsp put, bsp get, or bsp send primitives, and they
can be mixed freely. The only practical limitation is imposed by the amount
of buffer memory available. The BSP model and BSPlib do not favour any
particular type of communication, so that it is up to the user to choose the
most convenient primitive in a given situation.
The local matrix–vector multiplication in Superstep 2 is an implement-
ation of Algorithm 4.4 for the local data structure, modified to handle a
rectangular nrows × ncols matrix. The inner loop of the multiplication has
been optimized by using pointer arithmetic. For once deviating from our
declared principles, we sacrifice readability here because this loop is expec-
ted to account for a large proportion of the computing time spent, and
because pointer arithmetic is the raison d’être of the ICRS data structure.
The statement
*psum += (*pa) * (*pvloc);
is a translation of
sum += a[k] * vloc[j];
REMAINDER OF BSPLIB: EXAMPLE FUNCTION bspmv 227

We move through the array a by incrementing pa (i.e. the pointer to a) and


do the same for the inc array. Instead of using an index j to access vloc, we
use a pointer pvloc; after a nonzero has been processed, this pointer is moved
∗pinc = inc[k] places forward.
The initialization function bspmv init first reveals the owner of each
global index i, storing its number in a temporary array tmpproc that can
be queried by all processors. As a result, every processor can find answers
to questions such as: who is the owner of ui and where is this compon-
ent parked? For scalability, the temporary array is itself distributed, and
this is done by the cyclic distribution. In addition, the local index on the
owning processor is stored in an array tmpind. The temporary arrays are
then queried to initialize the local lists used for steering the communications.
For example, vector component vj with j = jglob = colindex[j], which is
needed for local matrix column j, can be obtained from the processor whose
number is stored in array tmpprocv on processor P (j mod p), in location
j div p. A suitable mental picture is that of a collection of notice boards:
every processor first announces the availability of its vector components on
the appropriate notice boards and then reads the announcements that concern
the components it needs. We finish bspmv init by deregistering and freeing
memory. Note that we deregister memory in Superstep 3 but deallocate it
only in Superstep 4. This is because deregistration takes effect only at the
end of Superstep 3; allocated memory must still exist at the time of actual
deregistration.
The main purpose of the program bspmv is to explain the bulk synchronous
message passing primitives. It is possible to optimize the program further,
by performing a more extensive initialization, so that all data for the same
destination can be sent together in one block. This can even be done using
puts! (Such optimization is the subject of Exercise 10.)
The program text is:
#include "bspedupack.h"

void bspmv(int p, int s, int n, int nz, int nrows, int ncols,
double *a, int *inc,
int *srcprocv, int *srcindv, int *destprocu,
int *destindu, int nv, int nu, double *v, double *u){

/* This function multiplies a sparse matrix A with a


dense vector v, giving a dense vector u=Av.
A is n by n, and u,v are vectors of length n.
A, u, and v are distributed arbitrarily on input.
They are all accessed using local indices, but the local
matrix indices may differ from the local vector indices.
The local matrix nonzeros are stored in an incremental
compressed row storage (ICRS) data structure defined by
nz, nrows, ncols, a, inc.
228 SPARSE MATRIX–VECTOR MULTIPLICATION

All rows and columns in the local data structure are


nonempty.

p is the number of processors.


s is the processor number, 0 <= s < p.
n is the global size of the matrix A.
nz is the number of local nonzeros.
nrows is the number of local rows.
ncols is the number of local columns.

a[k] is the numerical value of the k’th local nonzero of


the sparse matrix A, 0 <= k < nz.
inc[k] is the increment in the local column index of the
k’th local nonzero, compared to the column index
of the (k-1)th nonzero, if this nonzero is in the
same row; otherwise, ncols is added to the
difference. By convention, the column index of the
-1’th nonzero is 0.

srcprocv[j] is the source processor of the component in v


corresponding to the local column j, 0 <= j < ncols.
srcindv[j] is the local index on the source processor
of the component in v corresponding to the local
column j.
destprocu[i] is the destination processor of the partial sum
corresponding to the local row i, 0 <= i < nrows.
destindu[i] is the local index in the vector u on the
destination processor corresponding to the
local row i.

nv is the number of local components of the input vector v.


nu is the number of local components of the output vector u.
v[k] is the k’th local component of v, 0 <= k < nv.
u[k] is the k’th local component of u, 0 <= k < nu.
*/

int i, j, k, tagsz, status, nsums, nbytes, *pinc;


double sum, *psum, *pa, *vloc, *pvloc, *pvloc_end;

/****** Superstep 0. Initialize and register ******/


for(i=0; i<nu; i++)
u[i]= 0.0;
vloc= vecallocd(ncols);
bsp_push_reg(v,nv*SZDBL);
tagsz= SZINT;
bsp_set_tagsize(&tagsz);
bsp_sync();

/****** Superstep 1. Fanout ******/


for(j=0; j<ncols; j++)
bsp_get(srcprocv[j],v,srcindv[j]*SZDBL,&vloc[j],SZDBL);
bsp_sync();
REMAINDER OF BSPLIB: EXAMPLE FUNCTION bspmv 229

/****** Superstep 2. Local matrix--vector multiplication


and fanin */
psum= &sum;
pa= a;
pinc= inc;
pvloc= vloc;
pvloc_end= pvloc + ncols;

pvloc += *pinc;
for(i=0; i<nrows; i++){
*psum= 0.0;
while (pvloc<pvloc_end){
*psum += (*pa) * (*pvloc);
pa++;
pinc++;
pvloc += *pinc;
}
bsp_send(destprocu[i],&destindu[i],psum,SZDBL);
pvloc -= ncols;
}
bsp_sync();

/****** Superstep 3. Summation of nonzero partial sums ******/


bsp_qsize(&nsums,&nbytes);
bsp_get_tag(&status,&i);
for(k=0; k<nsums; k++){
/* status != -1, but its value is not used */
bsp_move(&sum,SZDBL);
u[i] += sum;
bsp_get_tag(&status,&i);
}

bsp_pop_reg(v);
vecfreed(vloc);

} /* end bspmv */

int nloc(int p, int s, int n){


/* Compute number of local components of processor s for vector
of length n distributed cyclically over p processors. */

return (n+p-s-1)/p ;

} /* end nloc */

void bspmv_init(int p, int s, int n, int nrows, int ncols,


int nv, int nu, int *rowindex, int *colindex,
int *vindex, int *uindex, int *srcprocv,
int *srcindv, int *destprocu, int *destindu){
230 SPARSE MATRIX–VECTOR MULTIPLICATION

/* This function initializes the communication data structure


needed for multiplying a sparse matrix A with a dense
vector v, giving a dense vector u=Av.

Input: the arrays rowindex, colindex, vindex, uindex,


containing the global indices corresponding to the local
indices of the matrix and the vectors.
Output: initialized arrays srcprocv, srcindv, destprocu,
destindu containing the processor number and the local
index on the remote processor of vector components
corresponding to local matrix columns and rows.

p, s, n, nrows, ncols, nv, nu are the same as


in bspmv.

rowindex[i] is the global index of the local row


i, 0 <= i < nrows.
colindex[j] is the global index of the local column
j, 0 <= j < ncols.
vindex[j] is the global index of the local v-component
j, 0 <= j < nv.
uindex[i] is the global index of the local u-component
i, 0 <= i < nu.

srcprocv, srcindv, destprocu, destindu are the same as in bspmv.


*/

int nloc(int p, int s, int n);


int np, i, j, iglob, jglob, *tmpprocv, *tmpindv, *tmpprocu,
*tmpindu;

/****** Superstep 0. Allocate and register temporary arrays */


np= nloc(p,s,n);
tmpprocv=vecalloci(np); bsp_push_reg(tmpprocv,np*SZINT);
tmpindv=vecalloci(np); bsp_push_reg(tmpindv,np*SZINT);
tmpprocu=vecalloci(np); bsp_push_reg(tmpprocu,np*SZINT);
tmpindu=vecalloci(np); bsp_push_reg(tmpindu,np*SZINT);
bsp_sync();

/****** Superstep 1. Write into temporary arrays ******/


for(j=0; j<nv; j++){
jglob= vindex[j];
/* Use the cyclic distribution */
bsp_put(jglob%p,&s,tmpprocv,(jglob/p)*SZINT,SZINT);
bsp_put(jglob%p,&j,tmpindv, (jglob/p)*SZINT,SZINT);
}
for(i=0; i<nu; i++){
iglob= uindex[i];
bsp_put(iglob%p,&s,tmpprocu,(iglob/p)*SZINT,SZINT);
bsp_put(iglob%p,&i,tmpindu, (iglob/p)*SZINT,SZINT);
}
bsp_sync();
EXPERIMENTAL RESULTS ON A BEOWULF CLUSTER 231

/****** Superstep 2. Read from temporary arrays ******/


for(j=0; j<ncols; j++){
jglob= colindex[j];
bsp_get(jglob%p,tmpprocv,(jglob/p)*SZINT,&srcprocv[j],SZINT);
bsp_get(jglob%p,tmpindv, (jglob/p)*SZINT,&srcindv[j], SZINT);
}
for(i=0; i<nrows; i++){
iglob= rowindex[i];
bsp_get(iglob%p,tmpprocu,(iglob/p)*SZINT,&destprocu[i],SZINT);
bsp_get(iglob%p,tmpindu, (iglob/p)*SZINT,&destindu[i], SZINT);
}
bsp_sync();

/****** Superstep 3. Deregister temporary arrays ******/


bsp_pop_reg(tmpindu); bsp_pop_reg(tmpprocu);
bsp_pop_reg(tmpindv); bsp_pop_reg(tmpprocv);
bsp_sync();

/****** Superstep 4. Free temporary arrays ******/


vecfreei(tmpindu); vecfreei(tmpprocu);
vecfreei(tmpindv); vecfreei(tmpprocv);

} /* end bspmv_init */

4.10 Experimental results on a Beowulf cluster


Nowadays, a Beowulf cluster is a powerful and relatively cheap alternative
to the traditional supercomputer. A Beowulf cluster consists of several PCs
connected by communication switches, see Fig. 1.11. In this section, we use a
cluster of 32 IBM x330 nodes, located at the Physics Department of Utrecht
University and part of DAS-2, the 200-node Distributed ASCI Supercom-
puter built by five collaborating Dutch universities for research into parallel
and Grid computing. (A cluster of the DAS-2 was the machine Romein and
Bal [158] used to solve the African game of Awari.) Each node of the cluster
contains two Pentium-III processors with 1 GHz clock speed, 1 Gbyte of
memory, a local disk, and interfaces to Fast Ethernet and Myrinet. The nodes
within the cluster are connected by a Myrinet-2000 communication network.
The operating system is Linux and the compiler used in our experiments is
version 3.2.2 of the GNU C compiler.
The 32-node, 64-processor Beowulf cluster has been turned into a BSP
computer with p = 64 by using a preliminary version of the BSPlib imple-
mentation by Takken [173], the Panda BSP library, which runs on top of
the Panda portability layer [159], version 4.0. For p ≤ 32, one processor per
node is used; for p = 64, two processors per node. The BSP parameters of
the cluster measured by using bspbench are given in Table 4.4. Note that g
remains more or less constant as a function of p, and that l grows linearly.
The values of g and l are about ten times higher than the values for the Origin
232 SPARSE MATRIX–VECTOR MULTIPLICATION

Table 4.4. Benchmarked BSP parameters p, g, l and the


time of a 0-relation for a Myrinet-based Beowulf cluster
running the Panda BSP library. All times are in flop units
(r = 323 Mf lop/s)

p g l Tcomm (0)

1 1337 7188 6767


2 1400 100 743 102 932
4 1401 226 131 255 307
8 1190 440 742 462 828
16 1106 835 196 833 095
32 1711 1 350 775 1 463 009
64 2485 2 410 096 2 730 173

Table 4.5. Test set of sparse matrices

Matrix n nz Origin

random1k 1000 9779 Random sparse matrix


random20k 20 000 99 601 Random sparse matrix
amorph20k 20 000 100 000 Amorphous silicon [169]
prime20k 20 000 382 354 Prime number matrix
lhr34 35 152 764 014 Light hydrocarbon recovery [192]
nasasrb 54 870 1 366 097 Shuttle rocket booster
bcsstk32 44 609 2 014 701 Automobile chassis
cage12 130 228 2 032 536 DNA electrophoresis [186]

3800 supercomputer given in Table 4.3, and the computing rate is about the
same. The version of the Panda BSP library used has not been fully optim-
ized yet. For instance, if a processor puts data into itself, this can be done
quickly by a memory copy via a buffer, instead of letting the underlying Panda
communication system discover, at much higher cost, that the destination is
local. The lack of this feature is revealed by the relatively high value of g for
p = 1. In the experiments described below, the program bspmv has been mod-
ified to avoid sending data from a processor to itself. (In principle, as users
we should refuse to perform such low-level optimizations. The BSP system
should do this for us, since the main advantage of BSP is that it enables such
communication optimizations.)
Table 4.5 shows the set of sparse matrices used in our experiments. The
set consists of: random1k and random20k, which represent the random sparse
matrices discussed in Section 4.7; amorph20k, which was created by convert-
ing a model of 20 000 silicon atoms, each having four connections with other
atoms, to a sparse matrix, see Fig. 4.2 and Exercise 7; prime20k, which
extends the matrix prime60 from the cover of this book to size n = 20 000;
EXPERIMENTAL RESULTS ON A BEOWULF CLUSTER 233

lhr34, which represents the simulation of a chemical process; nasasrb and


bcsstk32, which represent three-dimensional structural engineering problems
from NASA and Boeing, respectively; and cage12, one of the larger matrices
from the DNA electrophoresis series, see Fig. 4.1. The original matrix nasasrb
is symmetric, but we take only the lower triangular part as our test matrix.
The matrices have been ordered by increasing number of nonzeros. The largest
four matrices can be obtained from the University of Florida collection [53].
The size of the matrices in the test set is considered medium by current stand-
ards (but bcsstk32 was the largest matrix in the original Harwell–Boeing
collection [64]).
The matrices of the test set and the corresponding input and output
vectors were partitioned by the Mondriaan package (version 1.0) for the pur-
pose of parallel sparse matrix–vector multiplication. The resulting BSP costs
are given in Table 4.6. Note that every nonzero is counted as two flops, so
that the cost for p = 1 equals 2nz(A). Mondriaan was run with 3% load
imbalance allowed, with input and output vectors distributed independently,
and with all parameters set to their default values. The synchronization cost
is not shown, since it is always 4l, except in a few cases: for p = 1, the cost
is l; for p = 2, it is 2l; and for p = 4, it is 2l in the case of the matrices
amorph20, nasasrb, and bcsstk32, that is, the matrices with underlying
three-dimensional structure. Still, synchronization cost can be important:
since l ≈ 100 000 for p = 2, matrices must have at least 100 000 nonzeros
to make parallel multiplication worthwhile on this machine. Note that the
matrices with three-dimensional structure have much lower communication
cost than the random sparse matrices.
The time measured on our DAS-2 cluster for parallel sparse matrix–vector
multiplication using the function bspmv from Section 4.9 is given in Table 4.7.
The measured time always includes the synchronization overhead of two com-
munication supersteps, since bspmv does not take advantage from a possibly
empty fanin or fanout. We note that, as predicted, the matrix random1k is too
small to observe a speedup. Instead, the total time grows with p in the same
way as the synchronization cost grows. It is interesting to compare the two
matrices random20k and amorph20k, which have the same size and number of
nonzeros (and are too small to expect any speedup). The matrix amorph20k
has a much lower communication cost, see Table 4.6, and this results in con-
siderably faster execution, see Table 4.7. The largest four matrices display
modest speedups, as expected for g ≈ 1000. The matrix nasasrb shows a
speedup on moving from p = 1 to p = 2, whereas cage12 shows a slowdown,
which agrees with the theoretical prediction.
It is quite common to finish research papers about parallel computing
with a remark ‘it has been shown that for large problem sizes the algorithm
scales well.’ If only all problems were large! More important than showing
good speedups by enlarging problem sizes until the experimenter is happy, is
gaining an understanding of what happens for various problem sizes, small as
well as large.
Table 4.6. Computation and communication cost for sparse matrix–vector multiplication

p random1k random20k amorph20k prime20k

1 19 558 199 202 200 000 764 708


2 10 048 + 408g 102 586 + 5073g 100 940 + 847g 393 520 + 4275g
4 5028 + 392g 51 292 + 4663g 51 490 + 862g 196 908 + 5534g
8 2512 + 456g 25 642 + 3452g 25 742 + 1059g 98 454 + 4030g
16 1256 + 227g 12 820 + 2152g 12 872 + 530g 49 226 + 3148g
32 626 + 224g 6408 + 1478g 6434 + 371g 24 612 + 2620g
64 312 + 132g 3202 + 1007g 3216 + 267g 12 304 + 2235g

p lhr34 nasasrb bcsstk32 cage12

1 1 528 028 2 732 194 4 029 402 4 065 072


2 782 408 + 157g 1 378 616 + 147g 2 070 816 + 630g 2 093 480 + 10 389g
4 388 100 + 945g 703 308 + 294g 1 036 678 + 786g 1 046 748 + 15 923g
8 196 724 + 457g 351 746 + 759g 518 676 + 842g 523 376 + 16 543g
16 98 364 + 501g 175 876 + 733g 259 390 + 1163g 261 684 + 9 984g
32 49 160 + 516g 87 938 + 585g 129 692 + 917g 130 842 + 6658g
64 24 588 + 470g 43 966 + 531g 64 836 + 724g 65 420 + 5385g
BIBLIOGRAPHIC NOTES 235

Table 4.7. Measured execution time (in ms) for sparse matrix–vector
multiplication

p random random amorph prime lhr34 nasasrb bcsstk32 cage12


1k 20k 20k 20k

1 (seq) 0.2 9 7 18 30 52 71 92
1 (par) 0.3 10 8 19 31 53 72 96
2 5.1 73 13 56 26 41 59 205
4 5.2 57 16 77 22 26 39 228
8 6.9 48 15 50 14 21 25 226
16 9.1 32 11 46 13 18 24 128
32 14.8 28 17 37 18 21 23 87
64 27.4 36 29 45 29 32 34 73

Qualitatively, the BSP cost can be used to explain the timing results, or
predict them. Quantitatively, the agreement is less than perfect. The reader
can easily check this by substituting the measured values of g and l into
the cost expressions of Table 4.6. The advantage of presenting BSP costs for
sparse matrices as shown in Table 4.6 over presenting raw timings as is done
in Table 4.7 is the longevity of the results: in 20 years from now, when all
present supercomputers will rest in peace, when I shall be older and hopefully
wiser, the results expressed as BSP costs can still be used to predict execution
time on a state-of-the-art parallel computer.

4.11 Bibliographic notes


4.11.1 Sparse matrix computations
The book Direct methods for sparse matrices by Duff, Erisman, and Reid [63],
is a good starting point for a study of sequential sparse matrix computations of
the direct type. Direct methods such as sparse LU decomposition for unsym-
metric matrices and sparse Cholesky factorization for symmetric matrices are
based on the Gaussian elimination method for solving linear systems. The
unsymmetric case is characterized by the interplay between numerical stabil-
ity and sparsity; in the symmetric case, it is often possible to separate these
concerns by performing symbolic, sparsity-related computations before the
actual numerical factorization. The book has a practical flavour, discussing
in detail important issues such as sparse data structures, reuse of information
obtained during previous runs of a solver, and heuristics for finding good pivot
elements that preserve both the numerical stability and the sparsity. The book
also introduces graph-theoretic concepts commonly used in the sparse matrix
field. Another book on sparse matrix computations, written by Zlatev [194],
tries to combine direct and iterative methods, for instance by first performing
a sparse LU decomposition, dropping small matrix elements aij with |aij | ≤ τ ,
236 SPARSE MATRIX–VECTOR MULTIPLICATION

where τ is the drop tolerance, and then applying an iterative method to


improve the solution. Zlatev also presents a parallel implementation for a
shared-memory parallel computer.
A recent book which pays much attention to sparse matrix computations
is Numerical Linear Algebra for High-Performance Computers by Dongarra,
Duff, Sorensen, and van der Vorst [61]. The book treats direct and iterative
solvers for sparse linear systems and eigensystems in much detail, and it also
discusses parallel aspects. One chapter is devoted to preconditioning, solving
a system K −1 Ax = K −1 b instead of Ax = b, where the n × n matrix K is
the preconditioner, which must be a good approximation to A for which
Kx = b is easy to solve. Finding good parallel preconditioners is an active
area of research. Duff and van der Vorst [67] review recent developments in
the parallel solution of linear systems by direct and iterative methods. The
authors note that the two types of method cannot be clearly distinguished,
since many linear solvers have elements of both.
The Templates project aims at providing precise descriptions in template
form of the most important iterative solvers for linear systems and eigensys-
tems, by giving a general algorithm with sufficient detail to enable customized
implementation for specific problems. Templates for linear systems [11] con-
tains a host of iterative linear system solvers, including the conjugate gradient
(CG) [99], generalized minimal residual (GMRES) [160], and bi-conjugate
gradient stabilized (Bi-CGSTAB) [184] methods. It is important to have a
choice of solvers, since no single iterative method can solve all problems
efficiently. Implementations of the complete templates for linear systems are
available in C++, Fortran 77, and MATLAB. Most modern iterative methods
build a Krylov subspace

Km = span{r, Ar, A2 r, . . . , Am−1 r}, (4.54)

where r = b − Ax0 is the residual of the initial solution x0 . Sparse matrix–


vector multiplication is the main building block of Krylov subspace methods.
A recent book on these methods has been written by van der Vorst [185].
Templates for algebraic eigensystems [8] treats the more difficult problem
of solving eigensystems Ax = λx and generalized eigensystems Ax = λBx,
where A, B are square matrices, and it also treats the singular value decom-
position A = U ΣV T , where U and V are orthogonal matrices and Σ is a
diagonal matrix. An important eigenvalue solver discussed is the Lanczos
method [125]. The expert knowledge of the authors has been captured in a
decision tree which helps choosing the most suitable eigensystem solver for the
application at hand. A complete implementation of all eigensystem templates
does not exist, but the book [8] gives many references to software sources.
The Sparse Basic Linear Algebra Subprograms (Sparse BLAS) [66] have
been formulated recently as a standard interface for operations involving
sparse matrices or vectors. The most important primitive is BLAS dusmv
BIBLIOGRAPHIC NOTES 237

(‘Double precision Unstructured Sparse Matrix–Vector multiplication’), which


computes y := αAx + y. The sparse BLAS have been designed in an object-
oriented fashion to make them independent of the data structure used in an
implementation. Reference to a sparse matrix is made through a handle,
which is in fact an integer that represents the matrix. As a result, user pro-
grams are not cluttered by details of the data structure. In an iterative solver
using the sparse BLAS, a matrix will typically be created, filled with nonzero
entries, used repeatedly, and then destroyed. One way of filling a sparse mat-
rix is by dense subblocks, which occur in many applications. For each block,
the location, row and column indices, and numerical values are stored. This
can be used to great advantage by suitable data structures, saving memory
space and yielding high computing rates that approach peak performance. The
sparse BLAS are available for the languages C, Fortran 77, and Fortran 90.
The performance of sparse matrix algorithms achieved in practice depends
to a large extent on the problem solved and hence it is important to use
realistic test problems. The Harwell–Boeing collection was the first widely
available set of sparse test matrices originating in real applications. Release 1,
the version from 1989 described in [64], contained almost 300 matrices
with the largest matrix bcsstk32 of size n = 44 609 possessing 1 029 655
stored nonzeros. (For this symmetric matrix, only the nonzeros below or on
the main diagonal were stored.) The Harwell–Boeing input format is based
on the CCS data structure, see Section 4.2. The original distribution medium
of the 110-Mbyte collection was a set of three 9-track tapes of 2400 feet length
and 1600 bits-per-inch (bpi) density. Using the CCS format and not the triple
scheme saved at least one tape! The original Harwell–Boeing collection has
evolved into the Rutherford–Boeing collection [65], available online through
the Matrix Market repository [26]. Matrices from this repository are avail-
able in two formats: Rutherford–Boeing (based on CCS) and Matrix Market
(based on the triple scheme). Matrix Market can be searched by size, number
of nonzeros, shape (square or rectangular), and symmetry of the matrices.
Useful statistics and pictures are provided for each matrix. Another large
online collection is maintained and continually expanded by Tim Davis at the
University of Florida [53]. (The cage matrices [186] used in this chapter can be
obtained from there.) The currently largest matrix of the Florida collection,
cage15, has n = 5 154 859 and nz = 99 199 551.

4.11.2 Parallel sparse matrix–vector multiplication algorithms


Already in 1988, Fox and collaborators [71,section 21–3.4] presented a parallel
algorithm for dense matrix–vector multiplication that distributes the matrix in
both dimensions. They assume that a block-distributed copy of the complete
input vector is available in every processor row. Their algorithm starts with a
local matrix–vector multiplication, then performs a so-called fold operation
which gathers and adds partial sums uit , 0 ≤ t < N , into a sum ui , spreading
the responsibility for computing the sums ui of a processor row over all its
238 SPARSE MATRIX–VECTOR MULTIPLICATION

processors, and finally broadcasts the sums within their processor row. Thus,
the output vector becomes available in the same format as the input vector
but with the role of processor rows and columns reversed. If the input and
output vector are needed in exactly the same distribution, the broadcast must
be preceded by a vector transposition.
The parallel sparse matrix–vector multiplication algorithm described in
this chapter, Algorithm 4.5, is based on previous work by Bisseling and
McColl [19,21,22]. The Cartesian version of the algorithm was first presen-
ted in [19] as part of a parallel implementation of GMRES, an iterative solver
for square unsymmetric linear systems. Bisseling [19] outlines the advantages
of using a two-dimensional Cartesian distribution and distributing the vectors
in the same way as the matrix diagonal and suggests to use as a fixed matrix-
independent distribution the square block/cyclic distribution, defined by
√ √
assigning matrix element aij to processor P (i div (n/ p), j mod p). The cost
analysis of the algorithm in [19] and the implementation, however, are closely
tied to a square mesh communication network with store-and-forward rout-
ing. This means for instance that a partial sum sent from processor P (s, t0 )
to processor P (s, t1 ) has to be transferred through all intermediate processors
P (s, t), t0 < t < t1 . Experiments performed on a network of 400 transputers
for a subset of unsymmetric sparse matrices from the Harwell–Boeing collec-
tion [64] give disappointing speedups for the matrix–vector multiplication part
of the GMRES solver, due to the limitations of the communication network.
Bisseling and McColl [21,22] transfer the sparse matrix–vector multiplic-
ation algorithm from [19] to the BSP context. The matrix distribution is
Cartesian and the vectors are distributed in the same way as the matrix
diagonal. Now, the algorithm benefits from the complete communication
network provided by the BSP architecture. Architecture-independent time
analysis becomes possible because of the BSP cost function. This leads the
authors to a theoretical and experimental study of scientific computing applic-
ations such as molecular dynamics, partial differential equation solving on
multidimensional grids, and linear programming, all interpreted as an instance
of sparse matrix–vector multiplication. This work shows that the block/cyclic
distribution is an optimal fixed Cartesian distribution for unstructured sparse
matrices; also optimal is a Cartesian matrix distribution based on a balanced
random distribution of the matrix diagonal. Bisseling and McColl propose
to use digital diamonds for the Laplacian operator on a square grid. They
perform numerical experiments using MLIB, a library of matrix generators
and BSP cost analysers specifically developed for the investigation of sparse
matrix–vector multiplication.
Ogielski and Aiello [148] present a four-superstep parallel algorithm for
multiplication of a sparse rectangular matrix and a vector. The algorithm
exploits sparsity in the computation, but not in the communication. The
input and output vector are distributed differently. The matrix is distributed
in a Cartesian manner by first randomly permuting the rows and columns
BIBLIOGRAPHIC NOTES 239

(independently from each other) and then using an M × N cyclic distribu-


tion. A probabilistic analysis shows that the randomization leads to good
expected load balance in the computation, for every matrix with a limited
number of nonzeros per row and column. (This is a more general result than
the load balance result given in Section 4.7, which only applies to random
sparse matrices.) The vectors used in the multiplication u := Av are first per-
muted in correspondence with the matrix and then component vi is assigned
to P (s div N, s mod N ) and ui to P (s mod M, s div M ), where s = i mod p.
Experimental results on a 16384-processor MasPar machine show better load
balance than theoretically expected.
Lewis and van de Geijn [128] present several algorithms for sparse
matrix–vector multiplication on a parallel computer with a mesh or hyper-
cube communication network. Their final algorithm for u := Av distributes
the matrix and the vectors by assigning aij to processor P ((i div (n/p))
mod M, j div (n/N )) and ui and vi to P ((i div (n/p)) mod M, i div (n/N )).
This data distribution fits in the scheme of Section 4.4: the matrix distribution
is Cartesian and the input and output vectors are distributed in the same
way as the matrix diagonal. The vector distribution uses blocks of size n/p.
The resulting data distribution is similar to that of Fig. 4.5. The sparsity of
the matrix is exploited in the computation, but not in the communication.
Experimental results are given for the random sparse matrix with n = 14 000
and nz = 18 531 044 from the conjugate gradient solver of the NAS bench-
mark. Such experiments were also carried out by Hendrickson, Leland, and
Plimpton [96], using a sparse algorithm similar to the dense algorithm of Fox
et al. [71]. Hendrickson and co-workers overlap computation and communic-
ation and use a special processor numbering to reduce the communication
cost of the vector transposition on a hypercube computer. Hendrickson and
Plimpton [97] apply ideas from this matrix–vector multiplication algorithm
to compute the operation of a dense n × n force matrix on n particles in a
molecular dynamics simulation.

4.11.3 Parallel iterative solvers for linear systems


The Parallel Iterative Methods (PIM) package by da Cunha and Hopkins [50]
contains a set of Fortran 77 subroutines implementing iterative solvers. The
user of the package has to supply the matrix–vector multiplication, inner
product computations, and preconditioners. This makes the package inde-
pendent of data distribution and data structure, at the expense of extra effort
by the user. All vectors must be distributed in the same way. PIM can run on
top of MPI and PVM.
The Aztec package by Tuminaro, Shadid, and Hutchinson [175] is a
complete package of iterative solvers that provides efficient subroutines for
sparse matrix–vector multiplication and preconditioning; see also [162] for the
initial design. The data distribution of Aztec is a row distribution for the
matrix and a corresponding distribution for the vectors. The package contains
240 SPARSE MATRIX–VECTOR MULTIPLICATION

many tools to help the user initialize an iterative solver, such as a tool for
detecting which vector components must be obtained during the fanout. In
Aztec, a processor has three kinds of vector components: internal components
that can be updated without communication; border components that belong
to the processor but need components from other processors for an update;
and external components, which are the components needed from other pro-
cessors. The components of u and v are renumbered in the order internal,
border, external, where the external components that must be obtained
from the same processor are numbered consecutively. The local submatrix
is reordered correspondingly. Two local data structures are supported: a vari-
ant of CRS with special treatment of the matrix diagonal and a block variant,
which can handle dense subblocks, thus increasing the computing rate for
certain problems by a factor of five.
Parallel Templates by Koster [124] is a parallel, object-oriented implement-
ation in C++ of the complete linear system templates [11]. It can handle every
possible matrix and vector distribution, including the Mondriaan distribution,
and it can run on top of BSPlib and MPI-1. The high-level approach of the
package and the easy reuse of its building blocks makes adding new paral-
lel iterative solvers a quick exercise. This work introduces the ICRS data
structure discussed in Section 4.2.

4.11.4 Partitioning methods


The multilevel partitioning method has been proposed by Bui and Jones [34]
and improved by Hendrickson and Leland [95]. Hendrickson and Leland
present a multilevel scheme for partitioning a sparse undirected graph G =
(V, E) among the processors of a parallel computer, where the aim is to obtain
subsets of vertices of roughly equal size and with a minimum number of cut
edges, that is, edges that connect pairs of vertices in different subsets (and
hence on different processors). In the case of a square symmetric matrix, the
data distribution problem for sparse matrix–vector multiplication can be con-
verted to a graph partitioning problem by identifying matrix row i with vertex
i and matrix nonzero aij , i < j, with an edge (i, j) ∈ E. (Self-edges (i, i) are
not created.) The graph is partitioned and the processor that obtains a vertex i
in the graph partitioning becomes the owner of matrix row i and vector com-
ponents ui , vi . Hendrickson and Leland [95] use a spectral method [156] for
the initial partitioning, based on solving an eigensystem for the Laplacian
matrix connected to the graph. Their method allows initial partitioning into
two subsets, but also into four or eight, which may sometimes be better. The
authors implemented their partitioning algorithms in a package called Chaco.
Karypis and Kumar [118] investigate the three phases of multilevel graph
partitioning, proposing new heuristics for each phase, based on extensive
experiments. For the coarsening phase, they propose to match each ver-
tex with the neighbouring vertex connected by the heaviest edge. For the
initial partitioning, they propose to grow a partition greedily by highest
BIBLIOGRAPHIC NOTES 241

gain, starting from an arbitrary vertex, until half the total vertex weight
is included in the partition. For the uncoarsening, they propose a boundary
Kernighan–Lin algorithm. The authors implemented the graph partitioner in
a package called METIS. Karypis and Kumar [119] also developed a parallel
multilevel algorithm that performs the partitioning itself on a parallel com-
puter, with p processors computing a p-way partitioning. The algorithm uses
a graph colouring, that is, a colouring of the vertices such that neighbour-
ing vertices have different colours. To avoid conflicts between processors when
matching vertices in the coarsening phase, each coarsening step is organized
by colour, trying to find a match for vertices of one colour first, then for those
of another colour, and so on. This parallel algorithm has been implemented
in the ParMETIS package.
Walshaw and Cross [190] present a parallel multilevel graph partitioner
called PJostle, which is aimed at two-dimensional and three-dimensional
irregular grids. PJostle and ParMETIS both start with the vertices already
distributed over the processors in some manner and finish with a better
distribution. If the input partitioning is good, the computation of a new
partitioning is faster; the quality of the output partitioning is not affected
by the input partitioning. This means that these packages can be used for
dynamic repartitioning of grids, for instance in an adaptive grid computation.
The main difference between the two packages is that PJostle actually moves
vertices between subdomains when trying to improve the partitioning, whereas
ParMETIS keeps vertices on the original processor, but registers the new
owner. Experiments in [190] show that PJostle produces a better partitioning,
with about 10% less cut edges, but that ParMETIS is three times faster.
Bilderback [18] studies the communication load balance of the data
distributions produced by five graph partitioning programs: Chaco, METIS,
ParMETIS, PARTY, and Jostle. He observes that the difference between the
communication load of the busiest processor and that of the least busy one,
expressed in edge cuts, is considerable for all of these programs, indicating
that there is substantial room for improvement.
Hendrickson [93] argues that the standard approach to sparse matrix par-
titioning by using graph partitioners such as Chaco and METIS is flawed
because it optimizes the wrong cost function and because it is unnecessarily
limited to square symmetric matrices. In his view, the emperor wears little
more than his underwear. The standard approach minimizes the number of
nonzeros that induce communication, but not necessarily the number of com-
munication operations themselves. Thus the cost function does not take into
account that if there are two nonzeros aij and ai′ j on the same processor, the
value vj need not be sent twice to that processor. (Note that our Algorithm 4.5
obeys the old rule ne bis in idem, because it sends vj only once to the same pro-
cessor, as a consequence of using the index set Js .) Furthermore, Hendrickson
states that the cost function of the standard approach only considers com-
munication volume and not the imbalance of the communication load nor the
242 SPARSE MATRIX–VECTOR MULTIPLICATION

startup costs of sending a message. (Note that the BSP cost function is based
on the maximum communication load of a processor, which naturally encour-
ages communication balancing. The BSP model does not ignore startup costs,
but lumps them together into one parameter l; BSP implementations such as
BSPlib reduce startup costs by combining messages to the same destination
in the same superstep. The user minimizes startup costs by minimizing the
number of synchronizations.)
Çatalyürek and Aykanat [36,37] model the total communication volume
of sparse matrix–vector multiplication correctly by using hypergraphs. They
present a multilevel hypergraph partitioning algorithm that minimizes the
true communication volume. The algorithm has been implemented in a
package called PaToH (Partitioning Tool for Hypergraphs). Experimental
results show that PaToH reduces the communication volume by 30–40%
compared with graph-based partitioners. PaToH is about four times faster
than the hypergraph version of METIS, hMETIS, while it produces parti-
tionings of about the same quality. The partitioning algorithm in [36,37] is
one-dimensional since all splits are carried out in the same direction, yielding
a row or column distribution for the matrix with a corresponding vector distri-
bution. PaToH can produce p-way partitionings where p need not be a power
of two. Çatalyürek and Aykanat [38] also present a fine-grained approach to
sparse matrix–vector multiplication, where nonzeros are assigned individually
to processors. Each nonzero becomes a vertex in the problem hypergraph
and each row and column becomes a net. The result is a matrix partitioning
into disjoint sets As , 0 ≤ s < p, not necessarily corresponding to disjoint
submatrices Is × Js . This method is slower and needs more memory than
the one-dimensional approach, but the resulting partitioning is excellent; the
communication volume is almost halved. Hypergraph partitioning is com-
monly used in the design of electronic circuits and much improvement is due
to work in that field. A hypergraph partitioner developed for circuit design is
MLpart [35].
The two-dimensional Mondriaan matrix distribution method described in
Section 4.5 is due to Vastenhouw and Bisseling [188] and has been implemen-
ted in version 1.0 of the Mondriaan package. The method used to split a matrix
into two submatrices is based on the one-dimensional multilevel method for
hypergraph bipartitioning by Çatalyürek and Aykanat [36,37]. The Mondriaan
package can handle rectangular matrices as well as square matrices, and allows
the user to impose the condition distr(u) = distr(v). The package also has an
option to exploit symmetry by assigning aij and aji to the same processor.
The vector distribution methods described in Section 4.6 are due to Meesen
and Bisseling [136]; these methods improve on the method described in [188]
and will be included in the next major release of the Mondriaan package.
EXERCISES 243

4.12 Exercises
1. Let A be a dense m×n matrix distributed by an M ×N block distribution.
Find a suitable distribution for the input and output vector of the dense
matrix–vector multiplication u := Av; the input and output distributions
can be chosen independently. Determine the BSP cost of the corresponding
matrix–vector multiplication. What is the optimal ratio N/M and the BSP
cost for this ratio?
2. Find a distribution of a 12 × 12 grid for a BSP computer with p = 8,
g = 10, and l = 50, such that the BSP cost of executing a two-dimensional
Laplacian operator is as low as possible. For the computation, we count five
flops for an interior point, four flops for a boundary point that is not a corner
point, and three flops for a corner point. Your distribution should be better
than that of Fig. 4.20, which has a BSP cost of 90 + 14g + 2l = 330 flops on
this computer.
3. Modify the benchmarking program bspbench from Chapter 1 by changing
the central bsp put statement into a bsp send and adding a corresponding
bsp move to the next superstep. Choose suitable sizes for tag and payload.
Run the modified program for various values of p and measure the values of g
and l. Compare the results with those of the original program. If your commu-
nication pattern allows you to choose between using bsp put and bsp send,
which primitive would you choose? Why?
4. (∗) An n × n matrix A is banded with upper bandwidth bU and lower
bandwidth bL if aij = 0 for i < j − bU and i > j + bL . Let bL = bU = b. The
matrix A has a band of 2b + 1 nonzero diagonals and hence it is sparse if b
is small. Consider the multiplication of a banded matrix A and a vector v by
Algorithm 4.5 using the one-dimensional distribution φ(i) = i div (n/p) for the
matrix diagonal and the vectors, and a corresponding M ×N Cartesian matrix
distribution (φ0 , φ1 ). For simplicity, assume that n is a multiple of p. Choosing
M completely determines the matrix distribution. (See also Example 4.5,
where n = 12, b = 1, and p = 4.)

(a) Let b = 1, which means that A is tridiagonal. Show that the communic-
ation cost for the choice M = p (i.e. a row distribution of the matrix)

is lower than for the choice M = p (i.e. a square distribution).
(b) Let b = n − 1, which means that A is dense. Section 4.4 shows that

now the communication cost for the choice M = p is lower than for
M = p. We may conclude that for small bandwidth the choice M = p

is better, whereas for large bandwidth the choice M = p is better.
Which value of b is the break-even point between the two methods?
(c) Implement Algorithm 4.5 for the specific case of band matrices. Drop
the constraint on n and p. Choose a suitable data structure for the
matrix: use an array instead of a sparse data structure.
244 SPARSE MATRIX–VECTOR MULTIPLICATION

(d) Run your program and obtain experimental values for the break-even
point of b. Compare your results with the theoretical predictions.
5. (∗) Let A be a sparse m × m matrix and B a dense m × n matrix with
m ≥ n. Consider the matrix–matrix multiplication C = AB.
(a) What is the time complexity of a straightforward sequential
algorithm?
(b) Choose distributions for A, B, and C and formulate a corresponding
parallel algorithm. Motivate your choice and discuss alternatives.
(c) Analyse the time complexity of the parallel algorithm.
(d) Implement the algorithm. Measure the execution time for various values
of m, n, and p. Explain the results.
6. (∗) The CG algorithm by Hestenes and Stiefel [99] is an iterative method
for solving a symmetric positive definite linear system of equations Ax = b.
(A matrix A is positive definite if xT Ax > 0, for all x = 0.) The algorithm
computes a sequence of approximations xk , k = 0, 1, 2 . . ., that converges
towards the solution x. The algorithm is usually considered converged when
$rk $ ≤ ǫconv $b$, where rk = b − Axk is the residual. One can take, for
example, ǫconv = 10−12 . A sequential (nonpreconditioned) CG algorithm is
given as Algorithm 4.8. For more details and a proof of convergence, see Golub
and Van Loan [79].
(a) Design a parallel CG algorithm based on the sparse matrix–vector mul-
tiplication of this chapter. How do you distribute the vectors x, r, p, w?
Motivate your design choices. Analyse the time complexity.
(b) Implement your algorithm in a function bspcg, which uses bspmv for
the matrix–vector multiplication and bspip from Chapter 1 for inner
product computations.
(c) Write a test program that first generates an n × n sparse matrix B with
a random sparsity pattern and random nonzero values in the interval
[−1, 1] and then turns B into a symmetric matrix A = B + B T + µIn .
Choose the scalar µ sufficiently large to make A strictly diagonally
n−1
dominant, that is, |aii | > j=0,j=i |aij | for all i, and to make the
diagonal elements aii positive. It can be shown that such a matrix is
positive definite, see [79]. Use the Mondriaan package with suitable
options to distribute the matrix and the vectors.
(d) Experiment with your program and explain the results. Try different n
and p and different nonzero densities. How does the run time of bspcg
scale with p? What is the bottleneck? Does the number of iterations
needed depend on the number of processors and the distribution?
7. (∗) In a typical molecular dynamics simulation, the movement of a large
number of particles is followed for a long period of time to gain insight into
a physical process. For an efficient parallel simulation, it is crucial to use a
EXERCISES 245

Algorithm 4.8. Sequential conjugate gradient algorithm.


input: A: sparse n × n matrix,
b : dense vector of length n.
output: x : dense vector of length n, such that Ax ≈ b.

x := x0 ; { initial guess }
k := 0; { iteration number }
r := b − Ax;
ρ := $r$2 ;

while ρ > ǫconv $b$ ∧ k < kmax do
if k = 0 then
p := r;
else
β := ρ/ρold ;
p := r + βp;
w := Ap;
γ := pT w;
α := ρ/γ;
x := x + αp;
r := r − αw;
ρold := ρ;
ρ := $r$2 ;
k := k + 1;

good data distribution, especially in three dimensions. We can base the data
distribution on a suitable geometric partitioning of space, following [169].
Consider a simulation with a three-dimensional simulation box of size 1.0 ×
1.0 × 1.0 containing n particles, spread homogeneously, which interact if their
distance is less than a cut-off radius rc , with rc ≪ 1, see Fig. 4.2. Assume
that the box has periodic boundaries, meaning that a particle near a boundary
interacts with particles near the opposite boundary.

(a) Design a geometric distribution of the particles for p = 2q 3 processors


based on the truncated octahedron, see Fig. 4.22. What is the difference
with the case of the Laplacian operator on a three-dimensional grid? For
a given small value of rc , how many nonlocal particles are expected to
interact with the n/p local particles of a processor? These neighbouring
nonlocal particles form the halo of the local domain; their position must
be obtained by communication. Give the ratio between the number of
halo particles and the number of local particles, which is proportional
to the communication-to-computation ratio of a parallel simulation.
246 SPARSE MATRIX–VECTOR MULTIPLICATION

(b) Implement your distribution by writing a function that computes the


processor responsible for a particle at location (x, y, z).
(c) Design a scaling procedure that enables use of the distribution method
for p = 2q0 q1 q2 processors with q0 , q1 , q2 arbitrary positive integers.
Give the corresponding particle ratio.
(d) Test you distribution function for a large ensemble of particles located
at random positions in the box. Compare the corresponding ratio with
the predicted ratio.
(e) How would you distribute the particles for p = 8?
(f) Compare the output quality of the geometric distribution program to
that of the distribution produced by running the Mondriaan package
in one-dimensional mode for the matrix A defined by taking aij = 0 if
and only if particles i and j interact. Use the option distr(u) = distr(v)
in Mondriaan, and convert the output to a particle distribution by
assigning particle i to the processor that owns ui and vi .

8. (∗∗) Sometimes, it is worthwhile to use an optimal splitting for the function


split of Algorithm 4.6, even if such a splitting takes much more computation
time than a multilevel splitting. An optimal splitting could be the best choice
for the final phases of a p-way partitioning process, in particular for the smaller
submatrices.
(a) Design and implement a sequential algorithm that splits a sparse matrix
into two sets of columns satisfying the load imbalance constraint (4.27)
and with guaranteed minimum communication volume in the corres-
ponding matrix–vector multiplication. Because of the strict demand
for optimality, the algorithm should be based on some form of brute-
force enumeration of all possible splits with selection of the best split.
Formulate the algorithm recursively, with the recursive task defined as
assigning columns j, . . . , n − 1 to processor P (0) or P (1).
(b) The assignment of columns j = 0, 1, . . . , n − 1 to P (0) or P (1) can
be viewed as a binary computation tree. A node of the tree at depth
j corresponds to an assignment of all columns up to and including
column j to processors. The root of the tree, at depth 0, corresponds
to an assignment of column 0 to P (0), which can be done without loss
of generality. Each node at depth j with j < n − 1 has two branches to
children nodes, representing assignment of column j + 1 to P (0) and
P (1), respectively. Nodes at depth n − 1 are leaves, that is, they have
no children.
Accelerate the search for the best solution, that is, the best path to
a leaf of the tree, by pruning part of the tree. For instance, when the
imbalance exceeds ǫ after assigning a number of columns, further assign-
ments can make matters only worse. Thus, there is no need to search
the subtree corresponding to all possible assignments of the remaining
EXERCISES 247

columns. A subtree can also be pruned when the number of commu-


nications incurred exceeds the minimum number found so far for a
complete assignment of all columns. This pruning approach makes the
search algorithm a so-called branch-and-bound method.
(c) Try to accelerate the search further by adding heuristics to the search
method, such as choosing the branch to search first by some greedy
criterion, or reordering the columns at the start.
(d) Compare the quality and computing time of your optimal splitter to
that of the splitter used in the Mondriaan package. What size of
problems can you solve on your computer?
(e) How would you parallelize your splitting algorithm?
9. (∗∗) Finding the best matrix distribution for a parallel sparse matrix–
vector multiplication u := Av is a combinatorial optimization problem. It
can be solved by recursive matrix partitioning using multilevel splitting (see
Section 4.5) or optimal splitting by enumeration (see Exercise 8). We can
also use the general-purpose method for combinatorial optimization known
as simulated annealing, which is based on the Metropolis algorithm [139].
This method simulates slow cooling of a liquid: the temperature is gradually
lowered until the liquid freezes; at that point the molecules of the liquid loose
their freedom to move and they line up to form crystals. If the liquid cools
down slowly enough, the molecules have time to adapt to the changing tem-
perature so that the final configuration will have the lowest possible energy.
(a) Implement the simulated annealing algorithm for parallel sparse
matrix–vector multiplication. Start with a random distribution φ of the
matrix A. Write a sequential function that computes the correspond-
ing communication volume Vφ , defined in eqn (4.10). For the purpose
of optimization, take Vφ g/p as the communication cost and ignore the
synchronization cost because it is either 2l or 4l. Assume that the value
of g is known. Try to improve the distribution by a sequence of moves,
that is, assignments of a randomly chosen nonzero aij to a randomly
chosen processor. A move is accepted if the BSP cost of the new distri-
bution is lower than that of the old one. If, however, only cost decreases
were allowed, the process could easily get stuck in a local minimum,
and this will not always be a global minimum. Such a process would
not be able to peek over the upcoming mountain ridge to see that
there lies a deeper valley ahead. To escape from local minima, the
method occasionally accepts an increase in cost. This is more likely at
the beginning than at the end. Suppose we use as cost function the
normalized cost C, that is, the BSP cost divided by 2nz(A)/p. A move
with cost increase ∆C is accepted with probability e−∆C/T , where T
is the current temperature of the annealing process. Write a sequential
function that decides whether to accept a move with a given (positive
or negative) cost increment ∆C.
248 SPARSE MATRIX–VECTOR MULTIPLICATION

(b) Write an efficient function that computes the cost increment for a given
move. Note that simply computing the cost from scratch before and
after the move and taking the difference is inefficient; this approach
would be too slow for use inside a simulated annealing program, where
many moves must be evaluated. Take care that updating the cost for
a sequence of moves yields the same result as computing the cost from
scratch. Hint: keep track of the contribution of each processor to the
cost of the four supersteps of the matrix–vector multiplication.
(c) Put everything together and write a complete simulated annealing pro-
gram. The main loops of your program should implement a cooling
schedule, that is, a method for changing the temperature T during
the course of the computation. Start with a temperature T0 that is
much larger than every possible increment ∆C to be encountered. Try
a large number of moves at the initial temperature, for instance p·nz(A)
moves, and then reduce the temperature, for example, to T1 = 0.99T0 ,
thus making cost increases less likely to be accepted. Perform another
round of moves, reduce the temperature further, and so on. Finding a
good cooling schedule requires some trial and error.
(d) Compare the output quality and computing time of the simulated
annealing program to that of the Mondriaan package. Discuss the
difference between the output distributions produced by the two
programs.

10. (∗∗) The matrix–vector multiplication function bspmv is educational and


can be optimized. The communication performance can be improved by send-
ing data in large packets, instead of single values vj or uis , and by sending
the numerical values without index information. All communication of index
information can be handled by an initialization function. Such preprocessing
is advantageous when bspmv is used repeatedly for the same matrix, which is
the common situation in iterative solution methods.
(a) Write an initialization function v init that computes a permutation of
the local column indices j into the order: (i) indices corresponding to
locally available vector components vj that need not be sent; (ii) indices
corresponding to locally available components vj that must be sent;
(iii) indices corresponding to components vj that are needed but not
locally available and hence must be received from other processors. To
create a unique and useful ordering, the components in (i) and (ii) must
be ordered by increasing global index j and those in (iii) by increas-
ing processor number of the source processor and by increasing global
index as a secondary criterion. Set pointers to the start of (i), (ii), and
(iii), and to the start of each source processor within (iii). Initial-
ize an array colindex (such as used in bspmv init) that reflects
the new ordering, giving j = colindex[j]. Furthermore, extend the
EXERCISES 249

array colindex by adding: (iv) indices that correspond to locally


available vector components vj that are not used by a local matrix
column, that is, with j ∈ / Js ; such components may need to be sent to
other processors. (For a good data distribution, there will be few such
components, or none.)
(b) Write a similar function u init for the local row indices.
(c) Write an initialization function that permutes the rows and columns of
the local matrix using v init and u init.
(d) The new ordering of the local indices can be used to avoid unneces-
sary data copying and to improve the communication. In the function
bspmv, replace the two arrays v and vloc by a single array V to be
used throughout the computation for storing vector components in the
order given by v init. Initialize V before the first call to bspmv by
using v init. Similarly, replace the array u by a new array U to be
used for storing local partial sums uis and vector components ui in
the order given by u init. The values psum = uis are now stored into
U instead of being sent immediately, to enable sending large packets.
Initialize U to zero at the start of bspmv. The new index ordering will
be used in a sequence of matrix–vector multiplications. Permute the
output components ui back to the original order after the last call to
bspmv.
(e) The components vj to be sent away must first be packed into a
temporary array such that components destined for the same processor
form a contiguous block. Construct a list that gives the order in which
the components must be packed. Use this packing list in bspmv to carry
out bsp hpputs of blocks of numerical values, one for each destination
processor. Because of the ordering of V, the components need not be
unpacked at their destination.
(f) The partial sums uis can be sent in blocks directly from U, one block
for each destination processor. Use bsp send with a suitable tag for
this operation. On arrival, the data must be unpacked. Construct an
unpacking list for the partial sums received. Use this list in bspmv to
add the sums to the appropriate components in U.
(g) The high-performance move primitive

bsp hpmove(tagptr, payloadptr);

saves copying time and memory by setting a pointer to the start of


the tag and payload instead of moving the tag and payload explicitly
out of the receive buffer. Here, void **tagptr is a pointer to a pointer
to the tag, that is, it gives the address where the pointer to the start
of the tag can be found. Similarly, void **payloadptr is a pointer to
a pointer to the payload. The primitive returns an integer −1 if there
are no messages, and otherwise the length of the payload. Modify your
250 SPARSE MATRIX–VECTOR MULTIPLICATION

program to unpack the sums directly from the receive buffer by using
bsp hpmove instead of bsp get tag and bsp move.
(h) Test the effect of these optimizations. Do you attain communication
rates corresponding to optimistic g-values?
(i) Try to improve the speed of the local matrix–vector multiplication by
treating the local nonzeros that do not cause communication separ-
ately. This optimization should enhance cache use on computers with
a cache.
APPENDIX A
AUXILIARY BSPEDUPACK FUNCTIONS

A.1 Header file bspedupack.h


This header file is included in every program file of BSPedupack. It contains
necessary file inclusions, useful definitions and macros, and prototypes for the
memory allocation and deallocation functions.

/*
###########################################################################
## BSPedupack Version 1.0 ##
## Copyright (C) 2004 Rob H. Bisseling ##
## ##
## BSPedupack is released under the GNU GENERAL PUBLIC LICENSE ##
## Version 2, June 1991 (given in the file LICENSE) ##
## ##
###########################################################################
*/

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "bsp.h"

#define SZDBL (sizeof(double))


#define SZINT (sizeof(int))
#define TRUE (1)
#define FALSE (0)
#define MAX(a,b) ((a)>(b) ? (a) : (b))
#define MIN(a,b) ((a)<(b) ? (a) : (b))

double *vecallocd(int n);


int *vecalloci(int n);
double **matallocd(int m, int n);
void vecfreed(double *pd);
void vecfreei(int *pi);
void matfreed(double **ppd);

A.2 Utility file bspedupack.c


This file contains the functions used by BSPedupack to allocate and deallocate
memory dynamically for storing vectors and matrices. The functions are suited
for use in a parallel program, because they can abort a parallel computation
252 AUXILIARY BSPEDUPACK FUNCTIONS

on all processors if they detect a potential memory overflow on one of the


processors. If zero memory space is requested, which could happen if the local
part of a distributed vector or matrix is empty, a NULL pointer is returned
and nothing else is done.
Allocating an m × n matrix reserves a contiguous chunk of mn memory
cells and prepares it for row-wise matrix access. As a result, matrix element
aij can conveniently be addressed as a[i][j] and row i as a[i].
#include "bspedupack.h"

/* These functions can be used to allocate and deallocate vectors and


matrices. If not enough memory available, one processor halts them all.
*/

double *vecallocd(int n){


/* This function allocates a vector of doubles of length n */
double *pd;

if (n==0){
pd= NULL;
} else {
pd= (double *)malloc(n*SZDBL);
if (pd==NULL)
bsp_abort("vecallocd: not enough memory");
}
return pd;

} /* end vecallocd */

int *vecalloci(int n){


/* This function allocates a vector of integers of length n */
int *pi;

if (n==0){
pi= NULL;
} else {
pi= (int *)malloc(n*SZINT);
if (pi==NULL)
bsp_abort("vecalloci: not enough memory");
}
return pi;

} /* end vecalloci */

double **matallocd(int m, int n){


/* This function allocates an m x n matrix of doubles */
int i;
double *pd, **ppd;

if (m==0){
ppd= NULL;
UTILITY FILE bspedupack.c 253

} else {
ppd= (double **)malloc(m*sizeof(double *));
if (ppd==NULL)
bsp_abort("matallocd: not enough memory");
if (n==0){
for (i=0; i<m; i++)
ppd[i]= NULL;
} else {
pd= (double *)malloc(m*n*SZDBL);
if (pd==NULL)
bsp_abort("matallocd: not enough memory");
ppd[0]=pd;
for (i=1; i<m; i++)
ppd[i]= ppd[i-1]+n;
}
}
return ppd;

} /* end matallocd */

void vecfreed(double *pd){


/* This function frees a vector of doubles */

if (pd!=NULL)
free(pd);

} /* end vecfreed */

void vecfreei(int *pi){


/* This function frees a vector of integers */

if (pi!=NULL)
free(pi);

} /* end vecfreei */

void matfreed(double **ppd){


/* This function frees a matrix of doubles */

if (ppd!=NULL){
if (ppd[0]!=NULL)
free(ppd[0]);
free(ppd);
}

} /* end matfreed */
APPENDIX B
A QUICK REFERENCE GUIDE TO BSPLIB

Table B.1 groups the primitives of BSPlib into three classes: Single Pro-
gram Multiple Data (SPMD) for creating the overall parallel structure; Direct
Remote Memory Access (DRMA) for communication with puts or gets; and
Bulk Synchronous Message Passing (BSMP) for communication with sends.
Functions bsp nprocs, bsp pid, and bsp hpmove return an int; bsp time
returns a double; and all others return void. A parameter with an asterisk
is a pointer; a parameter with two asterisks is a pointer to a pointer. The
parameter spmd is a parameterless function returning void. The parameter
error message is a string. The remaining parameters are ints.
Table B.1. The 20 primitives of BSPlib [105]

Class Primitive Meaning Page

SPMD bsp begin(reqprocs); Start of parallel part 14


bsp end(); End of parallel part 15
bsp init(spmd, argc, **argv); Initialize parallel part 15
bsp nprocs(); Number of processors 15
bsp pid(); My processor number 16
bsp time(); My elapsed time 16
bsp abort(error message); One processor stops all 20
bsp sync(); Synchronize globally 16

DRMA bsp push reg(*variable, nbytes); Register variable 19


bsp pop reg(*variable); Deregister variable 19
bsp put(pid, *source, *dest, offset, nbytes); Write into remote memory 18
bsp hpput(pid, *source, *dest, offset, nbytes); Unbuffered put 99
bsp get(pid, *source, offset, *dest, nbytes); Read from remote memory 20
bsp hpget(pid, *source, offset, *dest, nbytes); Unbuffered get 99

BSMP bsp set tagsize(*tagsz); Set new tag size 226


bsp send(pid, *tag, *source, nbytes); Send a message 224
bsp qsize(*nmessages, *nbytes); Number of received messages 226
bsp get tag(*status, *tag); Get tag of received message 225
bsp move(*dest, maxnbytes); Store payload locally 225
bsp hpmove(**tagptr, **payloadptr); Store by setting pointers 249
APPENDIX C
PROGRAMMING IN BSP STYLE USING MPI

Assuming you have read the chapters of this book and hence have
learned how to design parallel algorithms and write parallel programs
using BSPlib, this appendix quickly teaches you how to write well-
structured parallel programs using the communication library MPI. For
this purpose, the package MPIedupack is presented, which consists of
the five programs from BSPedupack, but uses MPI instead of BSPlib,
where the aim is to provide a suitable starter subset of MPI. Experi-
mental results are given that compare the performance of programs from
BSPedupack with their counterparts from MPIedupack. This appendix
concludes by discussing the various ways bulk synchronous parallel style
can be applied in practice in an MPI environment. After having read this
appendix, you will be able to use both BSPlib and MPI to write well-
structured parallel programs, and, if you decide to use MPI, to make
the right choices when choosing MPI primitives from the multitude of
possibilities.

C.1 The message-passing interface


The Message-Passing Interface (MPI) [137] is a standard interface for paral-
lel programming in C and Fortran 77 that became available in 1994 and was
extended by MPI-2 [138] in 1997, adding functionality in the areas of one-sided
communications, dynamic process management, and parallel input/output
(I/O), and adding bindings for the languages Fortran 90 and C++. MPI is
widely available and has broad functionality. Most likely it has already been
installed on the parallel computer you use, perhaps even in a well-optimized
version provided by the hardware vendor. (In contrast, often you have to
install BSPlib yourself or request this from your systems administrator.) Much
parallel software has already been written in MPI, a prime example being
the numerical linear algebra library ScaLAPACK [24,25,41] The availabil-
ity of such a library may sometimes be a compelling reason for using MPI
in a parallel application. Important public-domain implementations of MPI
are MPICH [85] and LAM/MPI from Indiana University. An interesting new
development is MPICH-G2 [117] for parallel programming using the Globus
toolkit on the Grid, the envisioned worldwide parallel computer consisting of
all the computers on the Internet.
MPI is a software interface, and not a parallel programming model. It
enables programming in many different styles, including the bulk synchronous
THE MESSAGE-PASSING INTERFACE 257

parallel style. A particular algorithm can typically be implemented in many


different ways using MPI, which is the strength but also the difficulty of MPI.
MPI derives its name from the message-passing model, which is a program-
ming model based on pairwise communication between processors, involving
an active sender and an active receiver. Communication of a message
(in its blocking form) synchronizes the sender and receiver. The cost of
communicating a message of length n is typically modelled as

T (n) = tstartup + ntword , (C.1)

with a fixed startup cost tstartup and additional cost tword per data word.
Cost analysis in this model requires a detailed study of the order in which the
messages are sent, their lengths, and the computations that are interleaved
between the communications. In its most general form this can be expressed by
a directed acyclic graph with chunks of computation as vertices and messages
as directed edges.
In MPI, the archetypical primitives for the message-passing style, based
on the message-passing model, are MPI Send and MPI Recv. An example of
their use is

if (s==2)
MPI_Send(x, 5, MPI_DOUBLE, 3, 0, MPI_COMM_WORLD);
if (s==3)
MPI_Recv(y, 5, MPI_DOUBLE, 2, 0, MPI_COMM_WORLD, &status);

which sends five doubles from P (2) to P (3), reading them from an array x on
P (2) and writing them into an array y on P (3). Here, the integer ‘0’ is a tag
that can be used to distinguish between different messages transferred from
the same source processor to the same destination processor. Furthermore,
MPI COMM WORLD is the communicator consisting of all the processors. A com-
municator is a subset of processors forming a communication environment
with its own processor numbering. Despite the fundamental importance of
the MPI Send/MPI Recv pair in MPI, it is best to avoid its use if possible, as
extensive use of such pairs may lead to unstructured programs that are hard
to read, prove correct, or debug. Similar to the goto-statement, which was
considered harmful in sequential programming by Dijkstra [56], the explicit
send/receive pair can be considered harmful in parallel programming. In the
parallel case, the danger of deadlock always exists; deadlock may occur for
instance if P (0) wants to send a message to P (1), and P (1) to P (0), and
both processors want to send before they receive. In our approach to using
MPI, we advocate using the collective and one-sided communications of MPI
where possible, and to limit the use of the send/receive pair to exceptional
situations. (Note that the goto statement still exists in C, for good reasons,
but it is hardly used any more.)
258 PROGRAMMING IN BSP STYLE USING MPI

A discussion of all the MPI primitives is beyond the scope of this book,
as there are almost 300 primitives (I counted 116 nondeprecated MPI-1
primitives, and 167 MPI-2 primitives). With BSPlib, we could strive for com-
pleteness, whereas with MPI, this would require a complete book by itself. We
focus on the most important primitives to provide a quick entry into the MPI
world. For the definitive reference, see the original MPI-1 standard [137],
the most recent version of MPI-1 (currently, version 1.2) available from
http://www.mpi-forum.org, and the MPI-2 standard [138]. A more access-
ible reference is the annotated standard [83,164]. For tutorial introductions,
see [84,85,152].

C.2 Converting BSPedupack to MPIedupack


In this section, we create MPIedupack by converting all communication in the
five BSPedupack programs from BSPlib to MPI, thereby demonstrating the
various communication methods that exist in MPI. We shall comment on
the suitability of each method for programming in BSP style, trying to identify
a subset of MPI that can be used to program in BSP style. The first four
programs are converted using only primitives from MPI-1, whereas the fifth
program uses MPI-2 extensions as well.
For each program or function from BSPedupack printed in the main
part of this book, we print its counterpart from MPIedupack here, but
for brevity we omit the following repetitive program texts: sequential func-
tions (such as leastsquares from bspbench); parallel functions that have
not changed, except perhaps in name (the function mpifft is identical
to bspfft, except that it calls the redistribution function mpiredistr,
instead of bspredistr); long comments that are identical to the comments
in BSPedupack; mpiedupack.h, which is identical to bspedupack.h but
includes mpi.h instead of bsp.h; and mpiedupack.c, which is identical to
bspedupack.c but calls MPI Abort instead of bsp abort.
I/O is a complicated subject in parallel programming. MPI-1 ignored the
subject, leaving it up to the implementation. MPI-2 added extensive I/O
functionality. We will assume for MPI that the same I/O functionality is
available as for BSPlib: P (0) can read, and all processors can write although
the output may become multiplexed. Fortunately, this assumption often holds.
For more sophisticated I/O, one should use MPI-2.

C.2.1 Program mpiinprod


The first program of BSPedupack, bspinprod, becomes mpiinprod and is
presented below. The SPMD part of the new program is started by call-
ing MPI Init; note that it needs the addresses of argc and argv (and not
the arguments themselves as in bsp init). The SPMD part is terminated
by calling MPI Finalize. The standard communicator available in MPI is
CONVERTING BSPEDUPACK TO MPIEDUPACK 259

MPI COMM WORLD; it consists of all the processors. The corresponding number
of processors p can be obtained by calling MPI Comm size and the local pro-
cessor identity s by calling MPI Comm rank. Globally synchronizing all the
processors, the equivalent of a bsp sync, can be done by calling MPI Barrier
for the communicator MPI COMM WORLD. Here, this is done before using the
wall-clock timer MPI Wtime. As in BSPlib, the program can be aborted by one
processor if it encounters an error. In that case an error number is returned.
Unlike bspip, the program mpiip does not ask for the number of processors to
be used; it simply assumes that all available processors are used. (In BSPlib,
it is easy to use less than the maximum number of processors; in MPI, this
is slightly more complicated and involves creating a new communicator of
smaller size.)
Collective communication requires the participation of all the processors
of a communicator. An example of a collective communication is the broad-
cast by MPI Bcast in the main function of one integer, n, from the root
P (0) to all other processors. Another example is the reduction operation by
MPI Allreduce in the function mpiip, which sums the double-precision local
inner products inprod, leaving the result alpha on all processors. (It is also
possible to perform such an operation on an array, by changing the para-
meter 1 to the array size, or to perform other operations, such as taking the
maximum, by changing MPI SUM to MPI MAX.)
Note that the resulting program mpiip is shorter than the BSPlib equi-
valent. Using collective-communication functions built on top of the BSPlib
primitives would reduce the program size for the BSP case in the same way.
(Such functions are available, but they can also easily be written by the
programmer herself, and tailored to the specific situation.)
Now, try to compile the program by the UNIX command
cc -o ip mpiinprod.c mpiedupack.c -lmpi -lm
and run the resulting executable program ip on four processors by the
command
mpirun -np 4 ip
and see what happens. An alternative run command, with prescribed and
hence portable definition of its options is
mpiexec -n 4 ip
The program text is:
#include "mpiedupack.h"

/* This program computes the sum of the first n squares, for n>=0,
sum = 1*1 + 2*2 + ... + n*n
by computing the inner product of x=(1,2,...,n)ˆT and itself.
The output should equal n*(n+1)*(2n+1)/6.
The distribution of x is cyclic.
*/
260 PROGRAMMING IN BSP STYLE USING MPI

double mpiip(int p, int s, int n, double *x, double *y){


/* Compute inner product of vectors x and y of length n>=0 */

int nloc(int p, int s, int n);


double inprod, alpha;
int i;

inprod= 0.0;
for (i=0; i<nloc(p,s,n); i++){
inprod += x[i]*y[i];
}
MPI_Allreduce(&inprod,&alpha,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);

return alpha;

} /* end mpiip */

int main(int argc, char **argv){

double mpiip(int p, int s, int n, double *x, double *y);


int nloc(int p, int s, int n);
double *x, alpha, time0, time1;
int p, s, n, nl, i, iglob;

/* sequential part */

/* SPMD part */
MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD,&p); /* p = number of processors */


MPI_Comm_rank(MPI_COMM_WORLD,&s); /* s = processor number */

if (s==0){
printf("Please enter n:\n"); fflush(stdout);
scanf("%d",&n);
if(n<0)
MPI_Abort(MPI_COMM_WORLD,-1);
}

MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WORLD);

nl= nloc(p,s,n);
x= vecallocd(nl);
for (i=0; i<nl; i++){
iglob= i*p+s;
x[i]= iglob+1;
}
MPI_Barrier(MPI_COMM_WORLD);
time0=MPI_Wtime();

alpha= mpiip(p,s,n,x,x);
MPI_Barrier(MPI_COMM_WORLD);
CONVERTING BSPEDUPACK TO MPIEDUPACK 261

time1=MPI_Wtime();

printf("Processor %d: sum of squares up to %d*%d is %.lf\n",


s,n,n,alpha); fflush(stdout);
if (s==0){
printf("This took only %.6lf seconds.\n", time1-time0);
fflush(stdout);
}

vecfreed(x);
MPI_Finalize();

/* sequential part */
exit(0);

} /* end main */

C.2.2 Program mpibench


The main question regarding the conversion of the second program from
BSPedupack, bspbench, is which communication method from MPI to bench-
mark. Of course, we can benchmark all methods, but that would be a very
cumbersome exercise. In the BSPlib case, we opted for benchmarking a typical
user program, where the user does not care about communication optim-
ization, for example, about combining messages to the same destination,
but instead relies on the BSPlib system to do this for her. When writing
a program in MPI, a typical user would look first if there is a collective-
communication function that can perform the job for him. This would lead to
shorter program texts, and is good practice from the BSP point of view as well.
Therefore, we should choose a collective communication as the operation to be
benchmarked for MPI.
The BSP superstep, where every processor can communicate in principle
with all the others, is reflected best by the all-to-all primitives from MPI, also
called total exchange primitives. Using an all-to-all primitive gives the MPI
system the best opportunities for optimization, similar to the opportunities
that the superstep gives to the BSPlib system.
The primitive MPI Alltoall requires that each processor send exactly
the same number n of data to every processor, thus performing a full
(p − 1)n-relation. (The syntax of the primitive also requires sending n data
to the processor itself. We do not count this as communication and neither
should the system handle this as true communication.) A more flexible variant,
the so-called vector variant, is the primitive MPI Alltoallv, which allows a
varying number of data to be sent (or even no data). This is often needed in
applications, and also if we want to benchmark h-relations with h not a mul-
tiple of p − 1. Another difference is that the MPI Alltoall primitive requires
the data in its send and receive arrays to be ordered by increasing processor
262 PROGRAMMING IN BSP STYLE USING MPI

number, whereas MPI Alltoallv allows an arbitrary ordering. Therefore, we


choose MPI Alltoallv as the primitive to be benchmarked.
Before the use of MPI Alltoallv, the number of sends to each processor
is determined. The number of sends Nsend[s1] to a remote processor P (s1 )
equals ⌊h/(p − 1)⌋ or ⌈h/(p − 1)⌉. The number of sends to the processor
itself is zero. In the same way, the number of receives is determined. The
offset Offset send[s1] is the distance, measured in units of the data type
involved (doubles), from the start of the send array where the data destined
for P (s1 ) can be found. Similarly, Offset recv[s1] gives the location where
the data received from P (s1 ) must be placed. Note that BSPlib expresses offset
parameters in raw bytes, whereas MPI expresses them in units of the data
type involved. (The MPI approach facilitates data transfer between processors
with different architectures.)
Another collective-communication primitive used by the program is
MPI Gather, which gathers data from all processors in the communicator onto
one processor. Here, one local double, time, of every processor is gathered onto
P (0). The values are gathered in order of increasing processor number and
hence the time measured for P (s1 ) is stored in Time[s1].
#include "mpiedupack.h"

/* This program measures p, r, g, and l of a BSP computer


using MPI_Alltoallv for communication.
*/

#define NITERS 100 /* number of iterations */


#define MAXN 1024 /* maximum length of DAXPY computation */
#define MAXH 256 /* maximum h in h-relation */
#define MEGA 1000000.0

int main(int argc, char **argv){


void leastsquares(int h0, int h1, double *t, double *g, double *l);
int p, s, s1, iter, i, n, h,
*Nsend, *Nrecv, *Offset_send, *Offset_recv;
double alpha, beta, x[MAXN], y[MAXN], z[MAXN], src[MAXH], dest[MAXH],
time0, time1, time, *Time, mintime, maxtime,
nflops, r, g0, l0, g, l, t[MAXH+1];

/**** Determine p ****/


MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD,&p); /* p = number of processors */


MPI_Comm_rank(MPI_COMM_WORLD,&s); /* s = processor number */

Time= vecallocd(p);

/**** Determine r ****/


for (n=1; n <= MAXN; n *= 2){
/* Initialize scalars and vectors */
CONVERTING BSPEDUPACK TO MPIEDUPACK 263

alpha= 1.0/3.0;
beta= 4.0/9.0;
for (i=0; i<n; i++)
z[i]= y[i]= x[i]= (double)i;
/* Measure time of 2*NITERS DAXPY operations of length n */
time0= MPI_Wtime();

for (iter=0; iter<NITERS; iter++){


for (i=0; i<n; i++)
y[i] += alpha*x[i];
for (i=0; i<n; i++)
z[i] -= beta*x[i];
}
time1= MPI_Wtime();
time= time1-time0;
MPI_Gather(&time,1,MPI_DOUBLE,Time,1,MPI_DOUBLE,0,MPI_COMM_WORLD);

/* Processor 0 determines minimum, maximum, average computing rate */


if (s==0){
mintime= maxtime= Time[0];
for(s1=1; s1<p; s1++){
mintime= MIN(mintime,Time[s1]);
maxtime= MAX(maxtime,Time[s1]);
}
if (mintime>0.0){
/* Compute r = average computing rate in flop/s */
nflops= 4*NITERS*n;
r= 0.0;
for(s1=0; s1<p; s1++)
r += nflops/Time[s1];
r /= p;
printf("n= %5d min= %7.3lf max= %7.3lf av= %7.3lf Mflop/s ",
n, nflops/(maxtime*MEGA),nflops/(mintime*MEGA), r/MEGA);
fflush(stdout);
/* Output for fooling benchmark-detecting compilers */
printf(" fool=%7.1lf\n",y[n-1]+z[n-1]);
} else
printf("minimum time is 0\n"); fflush(stdout);
}
}

/**** Determine g and l ****/


Nsend= vecalloci(p);
Nrecv= vecalloci(p);
Offset_send= vecalloci(p);
Offset_recv= vecalloci(p);

for (h=0; h<=MAXH; h++){


/* Initialize communication pattern */

for (i=0; i<h; i++)


src[i]= (double)i;
264 PROGRAMMING IN BSP STYLE USING MPI

if (p==1){
Nsend[0]= Nrecv[0]= h;
} else {
for (s1=0; s1<p; s1++)
Nsend[s1]= h/(p-1);
for (i=0; i < h%(p-1); i++)
Nsend[(s+1+i)%p]++;
Nsend[s]= 0; /* no communication with yourself */
for (s1=0; s1<p; s1++)
Nrecv[s1]= h/(p-1);
for (i=0; i < h%(p-1); i++)
Nrecv[(s-1-i+p)%p]++;
Nrecv[s]= 0;
}

Offset_send[0]= Offset_recv[0]= 0;
for(s1=1; s1<p; s1++){
Offset_send[s1]= Offset_send[s1-1] + Nsend[s1-1];
Offset_recv[s1]= Offset_recv[s1-1] + Nrecv[s1-1];
}

/* Measure time of NITERS h-relations */


MPI_Barrier(MPI_COMM_WORLD);
time0= MPI_Wtime();
for (iter=0; iter<NITERS; iter++){
MPI_Alltoallv(src, Nsend,Offset_send,MPI_DOUBLE,
dest,Nrecv,Offset_recv,MPI_DOUBLE,MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
}
time1= MPI_Wtime();
time= time1-time0;

/* Compute time of one h-relation */


if (s==0){
t[h]= (time*r)/NITERS;
printf("Time of %5d-relation= %lf sec= %8.0lf flops\n",
h, time/NITERS, t[h]); fflush(stdout);
}
}

if (s==0){
printf("size of double = %d bytes\n",(int)SZDBL);
leastsquares(0,p,t,&g0,&l0);
printf("Range h=0 to p : g= %.1lf, l= %.1lf\n",g0,l0);
leastsquares(p,MAXH,t,&g,&l);
printf("Range h=p to HMAX: g= %.1lf, l= %.1lf\n",g,l);

printf("The bottom line for this BSP computer is:\n");


printf("p= %d, r= %.3lf Mflop/s, g= %.1lf, l= %.1lf\n",
p,r/MEGA,g,l);
fflush(stdout);
}
CONVERTING BSPEDUPACK TO MPIEDUPACK 265

vecfreei(Offset_recv);
vecfreei(Offset_send);
vecfreei(Nrecv);
vecfreei(Nsend);
vecfreed(Time);

MPI_Finalize();

exit(0);

} /* end main */

C.2.3 Function mpilu


For the LU decomposition, we have numbered the processors in two-
dimensional fashion, relating one-dimensional and two-dimensional processor
numbers by the standard identification P (s, t) ≡ P (s + tM ). We used two-
dimensional processor numbers in the LU decomposition algorithm, but had
to translate these into one-dimensional numbers in the actual BSPlib pro-
gram. In MPI, we can carry the renumbering one step further by defining a
communicator for every processor row and column, which allows us to use
the processor row number s and the processor column number t directly in
the communication primitives. In the function below, the new communicators
are created by splitting the old communicator MPI COMM WORLD into subsets
by the primitive MPI Comm split. Processors that call this primitive with the
same value of s end up in the same communicator, which we call row comm s.
As a result, we obtain M communicators, each corresponding to a processor
row P (s, ∗). Every processor obtains a processor number within its commu-
nicator. This number is by increasing value of the third parameter of the
primitive, that is, t. Because of the way we split, the processor number of
P (s, t) within the row communicator is simply t. The column communic-
ators are created by a similar split in the other direction. As a result, we
can address processors in three different ways: by the global identity s + tM
within the MPI COMM WORLD communicator; by the column number t within the
row comm s communicator; and by the row number s within the col comm t
communicator. This allows us for instance to broadcast the pivot value, a
double, from processor P (k mod n, smax) within its processor column, simply
by using the MPI Bcast primitive with col comm t as the communicator para-
meter, see Superstep 1. (For easy reference, the program supersteps retain
the original numbering from BSPedupack.) Another instance where use of the
new communicators is convenient is in the broadcast of nlr-kr1 doubles from
lk (representing a column part) within each processor row, see Superstep 3.
The source processor of the broadcast within P (s, ∗) is P (s, k mod n).
Communicators can also be used to split a BSP computer into differ-
ent subcomputers, such as in the decomposable BSP (D-BSP) model [55].
Care must be taken to do this in a structured way, as it is still possible for
266 PROGRAMMING IN BSP STYLE USING MPI

processors from different subcomputers to communicate with each other (for


instance through the MPI COMM WORLD communicator). This use of communic-
ators can be important for certain applications, for example, with a tree-like
computational structure.
Because we have a wide variety of collective communications available in
MPI, we should try to use one of them to determine the maximum absolute
value in A(k + 1 : n − 1, k) and the processor that contains it. It turns out to
be most convenient to perform a reduction operation over the processors in
P (∗, k mod n), using MPI Allreduce within one column communicator, see
Superstep 0. The operation works on pairs consisting of a double represent-
ing the local maximum absolute value and an integer representing the index
s of the corresponding processor. A pair is stored using a struct of a double
val and an integer idx. The local input pair is max and the global output
pair is max glob; this pair becomes available on all processors in the column
communicator. The operation is called MPI MAXLOC, since it determines a max-
imum and registers the location (argmax) of the maximum. In case of equal
maxima, the maximum with the lowest processor number is taken. Note that
after the maximum absolute value has been determined, all processors in
P (∗, k mod n) know the owner and the absolute value of the maximum, but not
the corresponding pivot value; this value is broadcast in the next superstep.
The row swap of the LU decomposition can be done in many different
ways using MPI. If k mod M = r mod N , the swap involves communication.
In that case, the communication pattern is a swap of local row parts by N
disjoint pairs of processors. Here, the communication is truly pairwise and
this is a good occasion to show the point-to-point message-passing primitives
in action that gave MPI its name. In the second part of Superstep 2 the
rows are swapped. Each processor P (k mod M, t) sends nlc local values of
type double to P (r mod M, t), that is, P (r mod M + tM ) in the global
one-dimensional numbering used. The messages are tagged by ‘1’. Note that
the program text of this part is lengthy, because here we have to write lines of
program text explicitly for receiving. (We became used to one-sided views of
the world, but alas.) Note also the reversed order of the sends and receives for
P (k mod M, t) compared with the order for P (r mod M, t). If we had used
the same order, dreaded deadlock could occur. Whether it actually occurs is
dependent on the implementation of MPI. The MPI standard allows both a
buffered implementation, where processors use intermediate buffers and hence
the sender is free to continue with other work after having written its message
into the buffer, and an unbuffered implementation, where the sender is blocked
until the communication has finished. The receiver is always blocked until the
end of the communication.
The processors involved in the swapping are synchronized more than
strictly necessary; the number of supersteps is actually two: one for swap-
ping components of π, and one for swapping rows of A. In the BSPlib case,
we can do the same easily within one superstep, by superstep piggybacking,
CONVERTING BSPEDUPACK TO MPIEDUPACK 267

that is, combining unrelated supersteps to reduce the amount of synchroniz-


ation. (Using MPI, we could do this as well, if we are prepared to pack and
unpack the data in a certain way, using so-called derived types. This is more
complicated, and may be less efficient due to copying overhead.) An advant-
age of the send/receive approach is that the p − 2N processors not involved in
the swap can race through unhindered, but they will be stopped at the next
traffic light (the broadcast in Superstep 3).
It would be rather unnatural to coerce the row swap into the frame-
work of a collective communication. Since the pattern is relatively simple,
it cannot do much harm to use a send/receive pair here. If desired, we can
reduce the program length by using a special MPI primitive for paired mes-
sages, MPI Sendrecv, or, even better, by using MPI Sendrecv replace, which
requires only one buffer for sending and receiving, thus saving memory. (MPI
has many more variants of send and receive operations.)
#include "mpiedupack.h"

#define EPS 1.0e-15

void mpilu(int M, int N, int s, int t, int n, int *pi, double **a){
/* Compute LU decomposition of n by n matrix A with partial pivoting.
Processors are numbered in two-dimensional fashion.
Program text for P(s,t) = processor s+t*M,
with 0 <= s < M and 0 <= t < N.
A is distributed according to the M by N cyclic distribution.
*/

int nloc(int p, int s, int n);


double *uk, *lk;
int nlr, nlc, k, i, j, r;

MPI_Comm row_comm_s, col_comm_t;


MPI_Status status, status1;

/* Create a new communicator for my processor row and column */


MPI_Comm_split(MPI_COMM_WORLD,s,t,&row_comm_s);
MPI_Comm_split(MPI_COMM_WORLD,t,s,&col_comm_t);

nlr= nloc(M,s,n); /* number of local rows */


nlc= nloc(N,t,n); /* number of local columns */

uk= vecallocd(nlc);
lk= vecallocd(nlr);

/* Initialize permutation vector pi */


if (t==0){
for(i=0; i<nlr; i++)
pi[i]= i*M+s; /* global row index */
}
268 PROGRAMMING IN BSP STYLE USING MPI

for (k=0; k<n; k++){


int kr, kr1, kc, kc1, imax, smax, tmp;
double absmax, pivot, atmp;
struct {
double val;
int idx;
} max, max_glob;

/****** Superstep 0 ******/


kr= nloc(M,s,k); /* first local row with global index >= k */
kr1= nloc(M,s,k+1);
kc= nloc(N,t,k);
kc1= nloc(N,t,k+1);

if (k%N==t){ /* k=kc*N+t */
/* Search for local absolute maximum in column k of A */
absmax= 0.0; imax= -1;
for (i=kr; i<nlr; i++){
if (fabs(a[i][kc])>absmax){
absmax= fabs(a[i][kc]);
imax= i;
}
}

/* Determine value and global index of absolute maximum


and broadcast them to P(*,t) */
max.val= absmax;
if (absmax>0.0){
max.idx= imax*M+s;
} else {
max.idx= n; /* represents infinity */
}
MPI_Allreduce(&max,&max_glob,1,MPI_DOUBLE_INT,MPI_MAXLOC,col_comm_t);

/****** Superstep 1 ******/

/* Determine global maximum */


r= max_glob.idx;
pivot= 0.0;
if (max_glob.val > EPS){
smax= r%M;
if (s==smax)
pivot = a[imax][kc];
/* Broadcast pivot value to P(*,t) */
MPI_Bcast(&pivot,1,MPI_DOUBLE,smax,col_comm_t);

for(i=kr; i<nlr; i++)


a[i][kc] /= pivot;
if (s==smax)
a[imax][kc]= pivot; /* restore value of pivot */
} else {
CONVERTING BSPEDUPACK TO MPIEDUPACK 269

MPI_Abort(MPI_COMM_WORLD,-6);
}
}

/* Broadcast index of pivot row to P(*,*) */


MPI_Bcast(&r,1,MPI_INT,k%N,row_comm_s);

/****** Superstep 2 ******/


if (t==0){
/* Swap pi(k) and pi(r) */
if (k%M != r%M){
if (k%M==s){
/* Swap pi(k) and pi(r) */
MPI_Send(&pi[k/M],1,MPI_INT,r%M,0,MPI_COMM_WORLD);
MPI_Recv(&pi[k/M],1,MPI_INT,r%M,0,MPI_COMM_WORLD,&status);
}
if (r%M==s){
MPI_Recv(&tmp,1,MPI_INT,k%M,0,MPI_COMM_WORLD,&status);
MPI_Send(&pi[r/M],1,MPI_INT,k%M,0,MPI_COMM_WORLD);
pi[r/M]= tmp;
}
} else if (k%M==s){
tmp= pi[k/M];
pi[k/M]= pi[r/M];
pi[r/M]= tmp;
}
}
/* Swap rows k and r */
if (k%M != r%M){
if (k%M==s){
MPI_Send(a[k/M],nlc,MPI_DOUBLE,r%M+t*M,1,MPI_COMM_WORLD);
MPI_Recv(a[k/M],nlc,MPI_DOUBLE,r%M+t*M,1,MPI_COMM_WORLD,&status1);
}
if (r%M==s){
/* abuse uk as a temporary receive buffer */
MPI_Recv(uk,nlc,MPI_DOUBLE,k%M+t*M,1,MPI_COMM_WORLD,&status1);
MPI_Send(a[r/M],nlc,MPI_DOUBLE,k%M+t*M,1,MPI_COMM_WORLD);
for(j=0; j<nlc; j++)
a[r/M][j]= uk[j];
}
} else if (k%M==s){
for(j=0; j<nlc; j++){
atmp= a[k/M][j];
a[k/M][j]= a[r/M][j];
a[r/M][j]= atmp;
}
}

/****** Superstep 3 ******/


if (k%N==t){
/* Store new column k in lk */
for(i=kr1; i<nlr; i++)
lk[i-kr1]= a[i][kc];
}
270 PROGRAMMING IN BSP STYLE USING MPI

if (k%M==s){
/* Store new row k in uk */
for(j=kc1; j<nlc; j++)
uk[j-kc1]= a[kr][j];
}
MPI_Bcast(lk,nlr-kr1,MPI_DOUBLE,k%N,row_comm_s);

/****** Superstep 4 ******/


MPI_Bcast(uk,nlc-kc1,MPI_DOUBLE,k%M,col_comm_t);

/****** Superstep 0 ******/


/* Update of A */
for(i=kr1; i<nlr; i++){
for(j=kc1; j<nlc; j++)
a[i][j] -= lk[i-kr1]*uk[j-kc1];
}
}
vecfreed(lk);
vecfreed(uk);

} /* end mpilu */

C.2.4 Function mpifft


The function mpifft differs in only one place from the function bspfft: it
invokes a redistribution function written in MPI, instead of BSPlib. The pro-
gram text of this function, mpiredistr, is listed below. The function creates
packets of data in the same way as bspredistr, but it first fills a complete
array tmp with packets, instead of putting the packets separately into a remote
processor. The packets are sent to their destination by using MPI Alltoallv,
explained above in connection with mpibench. The destination processor of
each packet is computed in the same way as in the BSPlib case. All pack-
ets sent have the same size, which is stored in Nsend, counting two doubles
for each complex value. The offset is also stored in an array. Note that for
n/p ≥ c1 /c0 , the number of packets equals c1 /c0 . This means that some pro-
cessors may not receive a packet from P (s). In the common case of an FFT

with p ≤ n, however, the function is used with c0 = 1 and c1 = p, so that
the condition holds, the number of packets created by P (s) is p, and the com-
munication pattern is a true all-to-all; this is the motivation for using an MPI
all-to-all primitive. In the common case, all packets have equal size n/p2 , and
we can use MPI Alltoall. For n/p < c1 /c0 , however, we must use the more
general MPI Alltoallv. An advantage of MPI Alltoallv is that the packets
can be stored in the receive array in every possible order, so that we do not
have to unpack the data by permuting the packets.
A disadvantage of using MPI Alltoallv instead of one-sided puts, is that
receive information must be computed beforehand, namely the number of
receives and the proper offset for each source processor. This is not needed
CONVERTING BSPEDUPACK TO MPIEDUPACK 271

when using bsp put primitives as in bspredistr. Thus, MPI Alltoallv is


a true two-sided communication operation, albeit a collective one. For each
packet, the global index of its first vector component is computed in the
new distribution (with cycle c1 ), and the processor srcproc is computed
that owns this component in the old distribution (with cycle c0 ). All the
information about sends and receives is determined before MPI Alltoallv is
called.
The MPI Alltoallv operation is preceded by an explicit global synchron-
ization, which teaches us an MPI rule:
collective communications may synchronize the processors,
but you cannot rely on this.

Here, the local processor writes from x into tmp, and then all processors write
back from tmp into x. It can happen that a particularly quick remote pro-
cessor already starts writing into the space of x while the local processor is
still reading from it. To prevent this, we can either use an extra temporary
array, or insert a global synchronization to make sure all local writes into tmp
have finished before MPI Alltoallv starts. We choose the latter option, and it
feels good. When in doubt, insert a barrier. The rule also says that synchron-
ization may occur. Thus a processor cannot send a value before the collective
communication, hoping that another processor receives it after the collective
communication. For correctness, we have to think barriers, even if they are
not there in the actual implementation.
#include "mpiedupack.h"

/****************** Parallel functions ********************************/

void mpiredistr(double *x, int n, int p, int s, int c0, int c1,
char rev, int *rho_p){

/* This function redistributes the complex vector x of length n,


stored as pairs of reals, from group-cyclic distribution
over p processors with cycle c0 to cycle c1, where
c0, c1, p, n are powers of two with 1 <= c0 <= c1 <= p <= n.
s is the processor number, 0 <= s < p.
If rev=true, the function assumes the processor numbering
is bit reversed on input.
rho_p is the bit-reversal permutation of length p.
*/

double *tmp;
int np, j0, j2, j, jglob, ratio, size, npackets, t, offset, r,
destproc, srcproc,
*Nsend, *Nrecv, *Offset_send, *Offset_recv;

np= n/p;
ratio= c1/c0;
272 PROGRAMMING IN BSP STYLE USING MPI

size= MAX(np/ratio,1);
npackets= np/size;
tmp= vecallocd(2*np);
Nsend= vecalloci(p);
Nrecv= vecalloci(p);
Offset_send= vecalloci(p);
Offset_recv= vecalloci(p);

for(t=0; t<p; t++){


Nsend[t]= Nrecv[t]= 0;
Offset_send[t]= Offset_recv[t]= 0;
}

/* Initialize sender info and copy data */


offset= 0;
if (rev) {
j0= rho_p[s]%c0;
j2= rho_p[s]/c0;
} else {
j0= s%c0;
j2= s/c0;
}
for(j=0; j<npackets; j++){
jglob= j2*c0*np + j*c0 + j0;
destproc= (jglob/(c1*np))*c1 + jglob%c1;
Nsend[destproc]= 2*size;
Offset_send[destproc]= offset;
for(r=0; r<size; r++){
tmp[offset + 2*r]= x[2*(j+r*ratio)];
tmp[offset + 2*r+1]= x[2*(j+r*ratio)+1];
}
offset += 2*size;
}

/* Initialize receiver info */

offset= 0;
j0= s%c1; /* indices for after the redistribution */
j2= s/c1;
for(r=0; r<npackets; r++){
j= r*size;
jglob= j2*c1*np + j*c1 + j0;
srcproc= (jglob/(c0*np))*c0 + jglob%c0;
if (rev)
srcproc= rho_p[srcproc];
Nrecv[srcproc]= 2*size;
Offset_recv[srcproc]= offset;
offset += 2*size;
}

/* Necessary for safety */


MPI_Barrier(MPI_COMM_WORLD);
CONVERTING BSPEDUPACK TO MPIEDUPACK 273

MPI_Alltoallv(tmp,Nsend,Offset_send,MPI_DOUBLE,
x, Nrecv,Offset_recv,MPI_DOUBLE,MPI_COMM_WORLD);

vecfreei(Offset_recv);
vecfreei(Offset_send);
vecfreei(Nrecv);
vecfreei(Nsend);
vecfreed(tmp);

} /* end mpiredistr */

C.2.5 Function mpimv


The final program text we discuss is different from the previous ones because it
is solely based on the MPI-2 extensions for one-sided communication. In writ-
ing this program, we have tried to exploit the close correspondence between
the one-sided communications in BSPlib and their counterparts in MPI-2. Six
years after the MPI-2 standard has been released, partial MPI-2 implement-
ations with reasonable functionality are starting to become available. A full
public-domain implementation for many different architectures is expected to
be delivered in the near future by the MPICH-2 project. The driving force in
developing one-sided communications is their speed and their ease of use.
MPI-2 contains unbuffered put and get operations, which are called high-
performance puts and gets in BSPlib. The motivation of the MPI-2 designers
for choosing the unbuffered version is that it is easy for the user to provide
the safety of buffering if this is required. In contrast, BSPlib provides both
buffered and unbuffered versions; this book encourages use of the buffered
version, except if the user is absolutely sure that unbuffered puts and gets are
safe. The syntax of the unbuffered put primitives in BSPlib and MPI is
bsp hpput(pid, src, dst, dst offsetbytes, nbytes);

MPI Put(src, src n, src type, pid, dst offset, dst n, dst type, dst win);

In BSPlib, data size and offsets are measured in bytes, whereas in MPI this is
in units of the basic data type, src type for the source array and dst type
for the destination array. In most cases these two types will be identical (e.g.
both could be MPI DOUBLE), and the source and destination sizes will thus be
equal. The destination memory area in the MPI-2 case is not simply given by
a pointer to memory space such as an array, but by a pointer to a window
object, which will be explained below.
The syntax of the unbuffered get primitives in BSPlib and MPI is
bsp hpget(pid, src, src offsetbytes, dst, nbytes);

MPI Get(dst, dst n, dst type, pid, src offset, src n, src type, src win);
274 PROGRAMMING IN BSP STYLE USING MPI

Note the different order of the arguments, but also the great similarity between
the puts and gets of BSPlib and those of MPI-2. In the fanout of mpimv, shown
below, one double is obtained by an MPI Get operation from the remote pro-
cessor srcprocv[j], at an offset of srcindv[j] doubles from the start of window
v win; the value is stored locally as vloc[j]. It is instructive to compare the
statement with the corresponding one in bspmv.
A window is a preregistered and distributed memory area, consisting of
local memory on every processor of a communicator. A window is created by
MPI Win create, which is the equivalent of BSPlib’s bsp push reg. We can
consider this as the registration of the memory needed before puts or gets can
be executed.
In the first call of MPI Win create in the function mpimv, a window of
size nv doubles is created and the size of a double is determined to be the
basic unit for expressing offsets of subsequent puts and gets into the window.
All processors of the communication world participate in creating the win-
dow. The MPI INFO NULL parameter always works, but can be replaced by
other parameters to give hints to the implementation for optimization. For
further details, see the MPI-2 standard. The syntax of the registration and
deregistration primitives in BSPlib and MPI is
bsp push reg(variable, nbytes);
MPI Win create(variable, nbytes, unit, info, comm, win);
bsp pop reg(variable);
MPI Win free(win);
Here, win is the window of type MPI Win corresponding to the array variable;
the integer unit is the unit for expressing offsets; and comm of type MPI Comm
is the communicator of the window.
A window can be used after a call to MPI Win fence, which can be thought
of as a synchronization of the processors that own the window. The first
parameter of MPI Win fence is again for transferring optimization hints, and
can best be set to zero at the early learning stage; this is guaranteed to work.
The communications initiated before a fence are guaranteed to have been
completed after the fence. Thus the fence acts as a synchronization at the end
of a superstep. A window is destroyed by a call to MPI Win free, which is the
equivalent of BSPlib’s bsp pop reg.
The MPI Put primitive is illustrated by the function mpimv init, which
is a straightforward translation of bspmv init. Four windows are created,
one for each array, for example, tmpprocv win representing the integer array
tmpprocv. (It would have been possible to use one window instead, by using
a four times larger array accessed with proper offsets, and thus saving some
fences at each superstep. This may be more efficient, but it is perhaps also a
bit clumsy and unnatural.)
CONVERTING BSPEDUPACK TO MPIEDUPACK 275

The third, and final one-sided communication operation available in MPI-


2, but not in BSPlib, is an accumulate operation, called MPI Accumulate. It
is similar to a put, but instead of putting a value into the destination location
the accumulate operation adds a value into into the location, or takes a max-
imum, or performs another binary operation. The operation must be one of the
predefined MPI reduction operations. The accumulate primitive allows target-
ing the same memory location by several processors. For the unbuffered put,
MPI Put or bsp hpput, this is unsafe. The order in which the accumulate oper-
ations are carried out is not specified and may be implementation-dependent.
(This resembles the variation in execution order that is caused by changing p.)
In Superstep 2 of mpimv, a contribution sum, pointed to by psum, is added
by using the operator MPI SUM into u[destindu[i]] on processor destprocu[i].
This operation is exactly what we need for the fanin.
BSPlib does not have a primitive for the accumulate operation. In bspmv
we performed the fanin by sending the partial sums to their destination using
the bsp send primitive, retrieving them together from the system buffer in
the next superstep by using the bsp move primitive. This is a more general
approach, less tailored to this specific situation. The bsp send primitive does
not have an equivalent in MPI-2. It is, however, a very convenient way of
getting rid of data by sending them to another processor, not caring about
how every individual message is received. This resembles the way we send
messages by regular mail, with daily synchronization (if you are lucky) by the
postal service. The one-sided send primitive may perhaps be a nice candidate
for inclusion in a possible MPI-3 standard.
#include "mpiedupack.h"

void mpimv(int p, int s, int n, int nz, int nrows, int ncols,
double *a, int *inc,
int *srcprocv, int *srcindv, int *destprocu, int *destindu,
int nv, int nu, double *v, double *u){

/* This function multiplies a sparse matrix A with a


dense vector v, giving a dense vector u=Av.
*/

int i, j, *pinc;
double sum, *psum, *pa, *vloc, *pvloc, *pvloc_end;

MPI_Win v_win, u_win;

/****** Superstep 0. Initialize and register ******/


for(i=0; i<nu; i++)
u[i]= 0.0;
vloc= vecallocd(ncols);
MPI_Win_create(v,nv*SZDBL,SZDBL,MPI_INFO_NULL,MPI_COMM_WORLD,&v_win);
MPI_Win_create(u,nu*SZDBL,SZDBL,MPI_INFO_NULL,MPI_COMM_WORLD,&u_win);
276 PROGRAMMING IN BSP STYLE USING MPI

/****** Superstep 1. Fanout ******/


MPI_Win_fence(0, v_win);
for(j=0; j<ncols; j++)
MPI_Get(&vloc[j], 1,MPI_DOUBLE,srcprocv[j],
srcindv[j],1,MPI_DOUBLE,v_win);
MPI_Win_fence(0, v_win);

/****** Superstep 2. Local matrix-vector multiplication and fanin */


MPI_Win_fence(0, u_win);
psum= &sum;
pa= a;
pinc= inc;
pvloc= vloc;
pvloc_end= pvloc + ncols;

pvloc += *pinc;
for(i=0; i<nrows; i++){
*psum= 0.0;
while (pvloc<pvloc_end){
*psum += (*pa) * (*pvloc);
pa++;
pinc++;
pvloc += *pinc;
}
MPI_Accumulate(psum,1,MPI_DOUBLE,destprocu[i],destindu[i],
1,MPI_DOUBLE,MPI_SUM,u_win);
pvloc -= ncols;
}
MPI_Win_fence(0, u_win);

MPI_Win_free(&u_win);
MPI_Win_free(&v_win);
vecfreed(vloc);

} /* end mpimv */

void mpimv_init(int p, int s, int n, int nrows, int ncols,


int nv, int nu, int *rowindex, int *colindex,
int *vindex, int *uindex, int *srcprocv, int *srcindv,
int *destprocu, int *destindu){

/* This function initializes the communication data structure


needed for multiplying a sparse matrix A with a dense vector v,
giving a dense vector u=Av.

*/

int nloc(int p, int s, int n);


int np, i, j, iglob, jglob, *tmpprocv, *tmpindv, *tmpprocu, *tmpindu;

MPI_Win tmpprocv_win, tmpindv_win, tmpprocu_win, tmpindu_win;


CONVERTING BSPEDUPACK TO MPIEDUPACK 277

/****** Superstep 0. Allocate and register temporary arrays */


np= nloc(p,s,n);
tmpprocv=vecalloci(np);
tmpindv=vecalloci(np);
tmpprocu=vecalloci(np);
tmpindu=vecalloci(np);
MPI_Win_create(tmpprocv,np*SZINT,SZINT,MPI_INFO_NULL,
MPI_COMM_WORLD,&tmpprocv_win);
MPI_Win_create(tmpindv,np*SZINT,SZINT,MPI_INFO_NULL,
MPI_COMM_WORLD,&tmpindv_win);
MPI_Win_create(tmpprocu,np*SZINT,SZINT,MPI_INFO_NULL,
MPI_COMM_WORLD,&tmpprocu_win);
MPI_Win_create(tmpindu,np*SZINT,SZINT,MPI_INFO_NULL,
MPI_COMM_WORLD,&tmpindu_win);

MPI_Win_fence(0, tmpprocv_win); MPI_Win_fence(0, tmpindv_win);


MPI_Win_fence(0, tmpprocu_win); MPI_Win_fence(0, tmpindu_win);

/****** Superstep 1. Write into temporary arrays ******/


for(j=0; j<nv; j++){
jglob= vindex[j];
/* Use the cyclic distribution */
MPI_Put(&s,1,MPI_INT,jglob%p,jglob/p,1,MPI_INT,tmpprocv_win);
MPI_Put(&j,1,MPI_INT,jglob%p,jglob/p,1,MPI_INT,tmpindv_win);
}

for(i=0; i<nu; i++){


iglob= uindex[i];
MPI_Put(&s,1,MPI_INT,iglob%p,iglob/p,1,MPI_INT,tmpprocu_win);
MPI_Put(&i,1,MPI_INT,iglob%p,iglob/p,1,MPI_INT,tmpindu_win);
}
MPI_Win_fence(0, tmpprocv_win); MPI_Win_fence(0, tmpindv_win);
MPI_Win_fence(0, tmpprocu_win); MPI_Win_fence(0, tmpindu_win);

/****** Superstep 2. Read from temporary arrays ******/


for(j=0; j<ncols; j++){
jglob= colindex[j];
MPI_Get(&srcprocv[j],1,MPI_INT,jglob%p,jglob/p,1,MPI_INT,tmpprocv_win);
MPI_Get(&srcindv[j], 1,MPI_INT,jglob%p,jglob/p,1,MPI_INT,tmpindv_win);
}
for(i=0; i<nrows; i++){
iglob= rowindex[i];
MPI_Get(&destprocu[i],1,MPI_INT,iglob%p,iglob/p,1,MPI_INT,tmpprocu_win);
MPI_Get(&destindu[i], 1,MPI_INT,iglob%p,iglob/p,1,MPI_INT,tmpindu_win);
}
MPI_Win_fence(0, tmpprocv_win); MPI_Win_fence(0, tmpindv_win);
MPI_Win_fence(0, tmpprocu_win); MPI_Win_fence(0, tmpindu_win);

/****** Superstep 3. Deregister temporary arrays ******/


MPI_Win_free(&tmpindu_win); MPI_Win_free(&tmpprocu_win);
MPI_Win_free(&tmpindv_win); MPI_Win_free(&tmpprocv_win);

/****** Superstep 4. Free temporary arrays ******/


278 PROGRAMMING IN BSP STYLE USING MPI

vecfreei(tmpindu); vecfreei(tmpprocu);
vecfreei(tmpindv); vecfreei(tmpprocv);

} /* end mpimv_init */

C.3 Performance comparison on an SGI Origin 3800


To compare the performance of the programs from MPIedupack with those
of BSPedupack, we performed experiments on Teras, the SGI Origin 3800
computer used in Chapter 3 to test the FFT. We ran the programs for inner
product computation (bspinprod and mpiinprod), LU decomposition (bsplu
and mpilu), FFT (bspfft and mpifft), and sparse matrix–vector multi-
plication (bspmv and mpimv), which all have the same level of optimization
for the BSPlib and MPI versions. We excluded the benchmarking programs,
because bspbench measures pessimistic g-values and mpibench optimistic val-
ues, making them incomparable. The results for a proper benchmark would
resemble those of the FFT, because both mpibench and mpifft perform their
communication by using the MPI Alltoallv primitive.
All programs were compiled using the same MIPSpro C compiler, which is
the native compiler of the SGI Origin 3800. The MPI-1 programs have been
linked with the MPT 1.6 implementation of MPI. The MPI-2 program mpimv
has been linked with MPT 1.8, which has just been released and which is
the first version to include all three one-sided communications. Although the
program mpimv is completely legal in MPI-2, we had to allocate the target
memory of the one-sided communications by using MPI Alloc mem from MPI
in mpiedupack.c instead of malloc from C. This is because of a restriction
imposed by the MPT implementation. Each experiment was performed three
times and the minimum of the three timings was taken, assuming this result
suffered the least from other activities on the parallel computer. The problem
size was deliberately chosen small, to expose the communication behaviour.
(For large problems, computation would be dominant, making differences
in communication performance less clearly visible.) The matrix used for the
sparse matrix–vector multiplication is amorph20k with 100 000 nonzeros.
The timing results of our experiments are presented in Table C.1. The
inner product program shows the limits of the BSP model: the amount of
communication is small, one superstep performing a (p − 1)-relation, which is
dominated by the global synchronization and the overhead of the superstep.
This approach is only viable for a sufficiently large amount of computation,
and n = 100 000 is clearly too small. The MPI version makes excellent use of
the specific nature of this problem and perhaps of the hardware, leading to
good scalability. The LU decomposition shows slightly better performance of
BSPlib for p = 1 and p = 2, indicating lower overhead, but MPI scales better
for larger p, achieving a speedup of about 10 on 16 processors. For the FFT,
a similar behaviour can be observed, with a largest speedup of about 8 on 16
processors for MPI. The matrix–vector multiplication has been optimized by
PERFORMANCE COMPARISON ON AN SGI ORIGIN 3800 279

Table C.1. Time Tp (n) (in ms) of parallel programs from BSPedupack
and MPIedupack on p processors of a Silicon Graphics Origin 3800

Program n p BSPlib MPI

Inner product 100 000 1 4.3 4.3


2 4.2 2.2
4 5.9 1.1
8 9.1 0.6
16 26.8 0.3
LU decomposition 1000 1 5408 6341
2 2713 2744
4 1590 1407
8 1093 863
16 1172 555
FFT 262 144 1 154 189
2 111 107
4 87 50
8 41 26
16 27 19
Matrix–vector 20 000 1 3.8 3.9
2 11.4 2.7
4 14.7 6.9
8 20.8 8.4
16 18.7 11.0

preventing a processor from sending data to itself. This yields large savings
for both versions and eliminates the parallel overhead for p = 1. To enable a
fair comparison, the buffered get operation in the fanout of the BSPlib ver-
sion has been replaced by an unbuffered get; the fanin by bulk synchronous
message passing remains buffered. The MPI version is completely unbuffered,
as it is based on the one-sided MPI-2 primitives, which may partly explain its
superior performance. The matrix–vector multiplication has not been optim-
ized to obtain optimistic g-values, in contrast to the LU decomposition and
the FFT. The test problem is too small to expect any speedup, as discussed
in Section 4.10. The results of both versions can be improved considerably by
further optimization.
Overall, the results show that the performance of BSPlib and MPI is
comparable, but with a clear advantage for MPI. This may be explained by the
fact that the MPI version used is a recent, vendor-supplied implementation,
which has clearly been optimized very well. On the other hand, the BSPlib
implementation (version 1.4, from 1998) is older and was actually optimized
for the SGI Origin 2000, a predecessor of the Origin 3800. No adjustment was
needed when installing the software, but no fine-tuning was done either.
Other experiments comparing BSPlib and MPI have been performed on
different machines. For instance, the BSPlib version of the LU decomposition
from ScaLAPACK by Horvitz and Bisseling [110] on the Cray T3E was found
280 PROGRAMMING IN BSP STYLE USING MPI

to be 10–15% faster than the original MPI version. Parallel Templates by


Koster [124] contains a highly optimized version of our sparse matrix–vector
multiplication, both in BSPlib and in MPI-1 (using MPI Alltoallv). Koster
reports close results on a T3E with a slight advantage for MPI.

C.4 Where BSP meets MPI


Almost every parallel computer these days comes equipped with an imple-
mentation of MPI-1. Sometimes, MPI-2 extensions are available as well. For
many parallel computers, we can install a public-domain version of BSPlib
ourselves, or have it installed by the systems administrator. Which road should
we take when programming in bulk synchronous parallel style?
To answer the question, we propose four different approaches, each of
which can be recommended for certain situations. The first is the purist
approach used in the chapters of this book, writing our programs in BSPlib
and installing BSPlib ourselves if needed. The main advantage is ease of
use, and automatic enforcement of the BSP style. An important advantage
is the impossibility of introducing deadlock in BSPlib programs. For some
machines, an efficient implementation is available. For other machines, the
BSPlib implementation on top of MPI-1 from the Oxford BSP toolset [103]
and the Paderborn University BSP (PUB) library [28,30] can be used. This
may be slow, but could be acceptable for development purposes. It is my hope
that this book will stimulate the development of more efficient implementa-
tions of BSPlib. In particular, it should not be too difficult to implement
BSPlib efficiently on top of MPI-2, basing all communications on the one-
sided communications introduced by MPI-2. (Optimization by a clever BSPlib
system could even lead to faster communication compared with using the
underlying MPI-2 system directly.)
The programs from MPIedupack are in general shorter than those in
BSPedupack, due to the use of preexisting collective-communication func-
tions in MPI. We have learned in this book to write such functions ourselves
in BSPlib, and thus we can design them for every specific purpose. It would
be helpful, however, to have a common collection of efficient and well-tested
functions available for everybody. A start has been made by the inclusion of
so-called level-1 functions in the Oxford BSP toolset [103] and a set of col-
lective communications in the PUB library [28,30]. I would like to encourage
readers to write equivalents of MPI collective communications and to make
them available under the GNU General Public License. I promise to help in
this endeavour.
The second approach is the hybrid approach, writing a single program
in BSP style, but expressing all communication both in MPI and BSPlib.
The resulting single-source program can then be compiled conditionally. The
conditionally compiled statements can for instance be complete functions,
WHERE BSP MEETS MPI 281

such as the redistribution function of the FFT:


#ifdef MPITARGET
mpiredistr(x,n,p,s,c0,c,rev,rho_p);
#else
bspredistr(x,n,p,s,c0,c,rev,rho_p);
#endif
The compilation command would then have a flag -DMPITARGET if compilation
using MPI is required. The default would be using BSPlib. (If desired, the
preference can easily be reversed.) This can also be applied at a lower level:
#ifdef MPITARGET
MPI_Barrier(MPI_COMM_WORLD);
time= MPI_Wtime();
#else
bsp_sync();
time= bsp_time();
#endif
The hybrid approach keeps one source file, and has the advantage that
it allows choosing the fastest implementation available on the machine used,
either BSPlib or MPI. The price to be paid is an increase in the amount of
program text, but as we saw in the conversion the differences between the
BSPedupack programs and the MPIedupack programs are often limited, and
hence the number of extra lines of program text in a hybrid program is expec-
ted to be limited. Based on my own experience with the conversion described
above, the main differences are in the I/O parts of the programs, and in com-
munication parts that are well-isolated because of the structured approach
inherent in the bulk synchronous parallel style. An additional advantage of
the hybrid approach is that it encourages programming in this style also in the
MPI part of programs, where the temptation of using matching send/receive
pairs always lures.
The third approach is to develop program usings BSPlib, and when the
need arises convert them to MPI-2. We have seen how this can be done for the
programs from BSPedupack. To give an idea, it took me about a week (human
processing time) to convert the whole of BSPedupack to MPI, including all
driver programs, and to compile and test the resulting programs. After having
read this appendix, a similar task should take you less time. The extra human
time incurred by having to convert the final result to MPI is compensated for
by the quicker development of the original program. (If, however, you have to
develop many collective communications yourself in BSPlib, this represents
an additional time investment compared with MPI.)
The fourth approach is to program directly in MPI-2, using collective
communications where possible, and keeping the lessons learned from the BSP
model in mind. This approach probably works best after having obtained some
experience with BSPlib.
282 PROGRAMMING IN BSP STYLE USING MPI

The strength of MPI is its wide availability and broad functionality. You
can do almost anything in MPI, except cooking dinner. The weakness of MPI is
its sheer size: the full standard [137,138] needs 550 pages, which is much more
than the 34 pages of the BSPlib standard [105]. This often leads developers of
system software to implementing only a subset of the MPI primitives, which
harms portability. It also forces users to learn only a subset of the primit-
ives, which makes it more difficult to read programs written by others, since
different programmers will most likely choose a different subset. Every imple-
mented MPI primitive is likely to be optimized independently, with a varying
rate of success. This makes it impossible to develop a uniform cost model that
realistically reflects the performance of every primitive. In contrast, the small
size of BSPlib and the underlying cost model provide a better focus to the
implementer and make theoretical cost analysis and cost predictions feasible.
A fundamental difference between MPI and BSPlib is that MPI provides
more opportunities for optimization by the user, by allowing many different
ways to tackle a given programming task, whereas BSPlib provides more
opportunities for optimization by the system. For an experienced user, MPI
may achieve better results than BSPlib, but for an inexperienced user this
may be the reverse.
We have seen that MPI software can be used for programming in BSP
style, even though it was not specifically designed for this purpose. Using
collective communication wherever possible leads to supersteps and global
synchronizations. Puts and gets are available in MPI-2 and can be used in the
same way as BSPlib high-performance puts and gets. Still, in using MPI one
would miss the imposed discipline provided by BSPlib. A small, paternalistic
library such as BSPlib steers programming efforts in the right direction, unlike
a large library such as MPI, which allows many different styles of programming
and is more tolerant of deviations from the right path.
In this appendix, we have viewed MPI from a BSP perspective, which may
be a fresh view for those readers who are already familiar with MPI. We can
consider the BSP model as the theoretical cost model behind the one-sided
communications of MPI-2. Even though the full MPI-2 standard is not yet
available on all parallel machines, its extensions are useful and suitable to the
BSP style, giving us another way of writing well-structured parallel programs.
REFERENCES

[1] Agarwal, R. C., Balle, S. M., Gustavson, F. G., Joshi, M., and Palkar, P.
(1995). A three-dimensional approach to parallel matrix multiplication.
IBM Journal of Research and Development, 39, 575–82.
[2] Agarwal, R. C. and Cooley, J. W. (1987). Vectorized mixed radix
discrete Fourier transform algorithms. Proceedings of the IEEE , 75,
1283–92.
[3] Aggarwal, A., Chandra, A. K., and Snir, M. (1990). Communication
complexity of PRAMs. Theoretical Computer Science, 71, 3–28.
[4] Alpatov, P., Baker, G., Edwards, C., Gunnels, J., Morrow, G.,
Overfelt, J., van de Geijn, R., and Wu, Y.-J. J. (1997). PLAPACK:
Parallel linear algebra package. In Proceedings Eighth SIAM Conference
on Parallel Processing for Scientific Computing. SIAM, Philadelphia.
[5] Alpert, R. D. and Philbin, J. F. (1997, February). cBSP: Zero-cost
synchronization in a modified BSP model. Technical Report 97-054,
NEC Research Institute, Princeton, NJ.
[6] Anderson, E., Bai, Z., Bischof, C., Blackford, L. S., Demmel, J.,
Dongarra, J., Du Croz, J, Greenbaum, A., Hammarling, S.,
McKenney, A., and Sorensen, D. (1999). LAPACK Users’ Guide (3rd
edn). SIAM, Philadelphia.
[7] Ashcraft, C. C. (1990, October). The distributed solution of linear
systems using the torus wrap data mapping. Technical Report ECA-
TR-147, Boeing Computer Services, Seattle, WA.
[8] Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., and van der Vorst, H. (ed.)
(2000). Templates for the Solution of Algebraic Eigenvalue Problems:
A Practical Guide. SIAM, Philadelphia.
[9] Barnett, M., Gupta, S., Payne, D. G., Shuler, L., van de Geijn, R., and
Watts, J. (1994). Building a high-performance collective communication
library. In Proceedings Supercomputing 1994, pp. 107–116. IEEE Press,
Los Alamitos, CA.
[10] Barnett, M., Payne, D. G., van de Geijn, R. A., and Watts, J. (1996).
Broadcasting on meshes with wormhole routing. Journal of Parallel
and Distributed Computing, 35, 111–22.
[11] Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J.,
Dongarra, J., Eijkhout, V., Pozo, R., Romine, C., and van der Vorst, H.
(1994). Templates for the Solution of Linear Systems: Building Blocks
for Iterative Methods. SIAM, Philadelphia.
[12] Barriuso, R. and Knies, A. (1994, May). SHMEM user’s guide
revision 2.0. Technical report, Cray Research Inc., Mendota
Heights, MN.
284 REFERENCES

[13] Barros, S. R. M. and Kauranne, T. (1994). On the parallelization of


global spectral weather models. Parallel Computing, 20, 1335–56.
[14] Bauer, F. L. (2000). Decrypted Secrets: Methods and Maxims of
Cryptology (2nd edn). Springer, Berlin.
[15] Bäumker, A., Dittrich, W., and Meyer auf der Heide, F. (1998). Truly
efficient parallel algorithms: 1-optimal multisearch for an extension of
the BSP model. Theoretical Computer Science, 203, 175–203.
[16] Bays, C. and Durham, S. D. (1976). Improving a poor random number
generator. ACM Transactions on Mathematical Software, 2, 59–64.
[17] Bilardi, G., Herley, K. T., Pietracaprina, A., Pucci, G., and Spirakis, P.
(1996). BSP vs LogP. In Eighth Annual ACM Symposium on Parallel
Algorithms and Architectures, pp. 25–32. ACM Press, New York.
[18] Bilderback, M. L. (1999). Improving unstructured grid application exe-
cution times by balancing the edge-cuts among partitions. In Proceedings
Ninth SIAM Conference on Parallel Processing for Scientific Computing
(ed. B. Hendrickson et al.). SIAM, Philadelphia.
[19] Bisseling, R. H. (1993). Parallel iterative solution of sparse linear
systems on a transputer network. In Parallel Computation (ed.
A. E. Fincham and B. Ford), Volume 46 of The Institute of Mathematics
and its Applications Conference Series, pp. 253–71. Oxford University
Press, Oxford.
[20] Bisseling, R. H. (1997). Basic techniques for numerical linear algebra
on bulk synchronous parallel computers. In Workshop Numerical Ana-
lysis and its Applications 1996 (ed. L. Vulkov, J. Waśniewski, and
P. Yalamov), Volume 1196 of Lecture Notes in Computer Science,
pp. 46–57. Springer, Berlin.
[21] Bisseling, R. H. and McColl, W. F. (1993, December). Scientific
computing on bulk synchronous parallel architectures. Preprint
836, Department of Mathematics, Utrecht University, Utrecht, the
Netherlands.
[22] Bisseling, R. H. and McColl, W. F. (1994). Scientific computing on
bulk synchronous parallel architectures. In Technology and Founda-
tions: Information Processing ’94, Vol. 1 (ed. B. Pehrson and I. Simon),
Volume 51 of IFIP Transactions A, pp. 509–14. Elsevier Science,
Amsterdam.
[23] Bisseling, R. H. and van de Vorst, J. G. G. (1989). Parallel LU decom-
position on a transputer network. In Parallel Computing 1988 (ed.
G. A. van Zee and J. G. G. van de Vorst), Volume 384 of Lecture Notes
in Computer Science, pp. 61–77. Springer, Berlin.
[24] Blackford, L. S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J.,
Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A.,
Stanley, K., Walker, D., and Whaley, R. C. (1997a). ScaLAPACK:
A linear algebra library for message-passing computers. In Proceed-
ings Eighth SIAM Conference on Parallel Processing for Scientific
Computing. SIAM, Philadelphia.
REFERENCES 285

[25] Blackford, L. S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J.,
Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A.,
Stanley, K., Walker, D., and Whaley, R. C. (1997b). ScaLAPACK
Users’ Guide. SIAM, Philadelphia.
[26] Boisvert, R. F., Pozo, R., Remington, K., Barrett, R. F., and
Dongarra, J. J. (1997). Matrix Market: A web resource for test matrix
collections. In The Quality of Numerical Software: Assessment and
Enhancement (ed. R. F. Boisvert), pp. 125–37. Chapman and Hall,
London.
[27] Bongiovanni, G., Corsini, P., and Frosini, G. (1976). One-dimensional
and two-dimensional generalized discrete Fourier transforms. IEEE
Transactions on Acoustics, Speech, and Signal Processing, ASSP-24,
97–9.
[28] Bonorden, O., Dynia, M., Gehweiler, J., and Wanka, R. (2003,
July). PUB-library, release 8.1-pre, user guide and function reference.
Technical report, Heinz Nixdorf Institute, Department of Computer
Science, Paderborn University, Paderborn, Germany.
[29] Bonorden, O., Hüppelshäuser, N., Juurlink, B., and Rieping, I. (2000,
June). The Paderborn University BSP (PUB) library on the Cray
T3E. Project report, Heinz Nixdorf Institute, Department of Computer
Science, Paderborn University, Paderborn, Germany.
[30] Bonorden, O., Juurlink, B., von Otte, I., and Rieping, I. (2003). The
Paderborn University BSP (PUB) library. Parallel Computing, 29,
187–207.
[31] Bracewell, R. N. (1999). The Fourier Transform and its Applications
(3rd edn). McGraw-Hill Series in Electrical Engineering. McGraw-Hill,
New York.
[32] Brent, R. P. (1975). Multiple-precision zero-finding methods and the
complexity of elementary function evaluation. In Analytic Compu-
tational Complexity (ed. J. F. Traub), pp. 151–76. Academic Press,
New York.
[33] Briggs, W. L. and Henson, V. E. (1995). The DFT: An Owner’s
Manual for the Discrete Fourier Transform. SIAM, Philadelphia.
[34] Bui, T. N. and Jones, C. (1993). A heuristic for reducing fill-in in sparse
matrix factorization. In Proceedings Sixth SIAM Conference on Parallel
Processing for Scientific Computing, pp. 445–52. SIAM, Philadelphia.
[35] Caldwell, A. E., Kahng, A. B., and Markov, I. L. (2000). Improved
algorithms for hypergraph bipartitioning. In Proceedings Asia and
South Pacific Design Automation Conference, pp. 661–6. ACM Press,
New York.
[36] Çatalyürek, Ü. V. and Aykanat, C. (1996). Decomposing irregularly
sparse matrices for parallel matrix–vector multiplication. In Proceed-
ings Third International Workshop on Solving Irregularly Structured
Problems in Parallel (Irregular 1996) (ed. A. Ferreira, J. Rolim,
Y. Saad, and T. Yang), Volume 1117 of Lecture Notes in Computer
Science, pp. 75–86. Springer, Berlin.
286 REFERENCES

[37] Çatalyürek, Ü. V. and Aykanat, C. (1999). Hypergraph-partitioning-


based decomposition for parallel sparse-matrix vector multiplication.
IEEE Transactions on Parallel and Distributed Systems, 10, 673–93.
[38] Çatalyürek, Ü. V. and Aykanat, C. (2001). A fine-grain hypergraph
model for 2D decomposition of sparse matrices. In Proceedings Eighth
International Workshop on Solving Irregularly Structured Problems in
Parallel (Irregular 2001), p. 118. IEEE Press, Los Alamitos, CA.
[39] Cavallar, S., Dodson, B., Lenstra, A. K., Lioen, W., Montgomery,
P. L., Murphy, B., te Riele, H., Aardal, K., Gilchrist, J., Guillerm, G.,
Leyland, P., Marchand, J., Morain, F., Muffett, A., Putnam, Chris,
Putnam, Craig, and Zimmermann, P. (2000). Factorization of a 512-
bit RSA modulus. In Advances in Cryptology: EUROCRYPT 2000
(ed. B. Preneel), Volume 1807 of Lecture Notes in Computer Science,
pp. 1–18. Springer, Berlin.
[40] Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a
hypothesis based on the sum of observations. Annals of Mathematical
Statistics, 23, 493–507.
[41] Choi, J., Dongarra, J. J., Ostrouchov, L. S., Petitet, A. P.,
Walker, D. W., and Whaley, R. C. (1996). The design and implement-
ation of the ScaLAPACK LU, QR, and Cholesky factorization routines.
Scientific Programming, 5, 173–84.
[42] Chu, E. and George, A. (1987). Gaussian elimination with partial pivot-
ing and load balancing on a multiprocessor. Parallel Computing, 5,
65–74.
[43] Chu, E. and George, A. (2000). Inside the FFT Black Box: Serial
and Parallel Fast Fourier Transform Algorithms. Computational
Mathematics Series. CRC Press, Boca Raton, FL.
[44] Cooley, J. W. (1990). How the FFT gained acceptance. In A History
of Scientific Computing (ed. S. G. Nash), pp. 133–140. ACM Press,
New York.
[45] Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the machine
calculation of complex Fourier series. Mathematics of Computation, 19,
297–301.
[46] Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2001).
Introduction to algorithms (2nd edn). MIT Press, Cambridge, MA and
McGraw-Hill, New York.
[47] Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E.,
Santos, E., Subramonian, R., and von Eicken, T. (1993). LogP:
Towards a realistic model of parallel computation. ACM SIGPLAN
Notices, 28(7), 1–12.
[48] Culler, D. E., Dusseau, A., Goldstein, S. C., Krishnamurthy, A.,
Lumetta, S., von Eicken, T., and Yelick, K. (1993). Parallel program-
ming in Split-C. In Proceedings Supercomputing 1993, pp. 262–73.
IEEE Press, Los Alamitos, CA.
REFERENCES 287

[49] Culler, D. E., Karp, R. M., Patterson, D., Sahay, A., Santos, E. E.,
Schauser, K. E., Subramonian, R., and von Eicken, T. (1996).
LogP: A practical model of parallel computation. Communications of
the ACM , 39(11), 78–85.
[50] da Cunha, R. D. and Hopkins, T. (1995). The Parallel Iterative
Methods (PIM) package for the solution of systems of linear equations
on parallel computers. Applied Numerical Mathematics, 19, 33–50.
[51] Danielson, G. C. and Lanczos, C. (1942). Some improvements in
practical Fourier analysis and their application to X-ray scattering from
liquids. Journal of the Franklin Institute, 233, 365–80, 435–52.
[52] Daubechies, I. (1988). Orthonormal bases of compactly supported
wavelets. Communications on Pure and Applied Mathematics, 41,
909–96.
[53] Davis, T. A. (1994–2003). University of Florida sparse matrix collection.
Online collection, http://www.cise.ufl.edu/research/sparse/
matrices, Department of Computer and Information Science and
Engineering, University of Florida, Gainesville, FL.
[54] de la Torre, P. and Kruskal, C. P. (1992). Towards a single model of
efficient computation in real machines. Future Generation Computer
Systems, 8, 395–408.
[55] de la Torre, P. and Kruskal, C. P. (1996). Submachine locality in the
bulk synchronous setting. In Euro-Par’96 Parallel Processing. Vol. 2
(ed. L. Bougé, P. Fraigniaud, A. Mignotte, and Y. Robert), Volume 1124
of Lecture Notes in Computer Science, pp. 352–8. Springer, Berlin.
[56] Dijkstra, E. W. (1968). Go to statement considered harmful.
Communications of the ACM , 11, 147–8.
[57] Donaldson, S. R., Hill, J. M. D., and Skillicorn, D. B. (1999). Pre-
dictable communication on unpredictable networks: implementing BSP
over TCP/IP and UDP/IP. Concurrency: Practice and Experience, 11,
687–700.
[58] Dongarra, J. J. (2003, April). Performance of various computers
using standard linear equations software. Technical Report CS-89-85,
Computer Science Department, University of Tennessee, Knoxville,
TN. Continuously being updated at
http://www.netlib.org/benchmark/performance.ps.
[59] Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. (1990).
A set of level 3 Basic Linear Algebra Subprograms. ACM Transactions
on Mathematical Software, 16, 1–17.
[60] Dongarra, J. J., Du Croz, J., Hammarling, S., and Hanson, R. J. (1988).
An extended set of FORTRAN Basic Linear Algebra Subprograms.
ACM Transactions on Mathematical Software, 14, 1–17.
[61] Dongarra, J. J., Duff, I. S., Sorensen, D. C., and van der Vorst, H. A.
(1998). Numerical Linear Algebra for High-Performance Computers.
Software, Environments, Tools. SIAM, Philadelphia.
288 REFERENCES

[62] Dubey, A., Zubair, M., and Grosch, C. E. (1994). A general purpose
subroutine for fast Fourier transform on a distributed memory parallel
machine. Parallel Computing, 20, 1697–1710.
[63] Duff, I. S., Erisman, A. M., and Reid, J. K. (1986). Direct Methods
for Sparse Matrices. Monographs on Numerical Analysis. Oxford
University Press, Oxford.
[64] Duff, I. S., Grimes, R. G., and Lewis, J. G. (1989). Sparse matrix test
problems. ACM Transactions on Mathematical Software, 15, 1–14.
[65] Duff, I. S., Grimes, R. G., and Lewis, J. G. (1997, September).
The Rutherford–Boeing sparse matrix collection. Technical Report
TR/PA/97/36, CERFACS, Toulouse, France.
[66] Duff, I. S., Heroux, M. A., and Pozo, R. (2002). An overview of
the Sparse Basic Linear Algebra Subprograms: the new standard
from the BLAS technical forum. ACM Transactions on Mathematical
Software, 28, 239–67.
[67] Duff, I. S. and van der Vorst, H. A. (1999). Developments and trends
in the parallel solution of linear systems. Parallel Computing, 25,
1931–70.
[68] Edelman, A., McCorquodale, P., and Toledo, S. (1999). The future
fast Fourier transform. SIAM Journal on Scientific Computing, 20,
1094–1114.
[69] Fiduccia, C. M. and Mattheyses, R. M. (1982). A linear-time heuristic
for improving network partitions. In Proceedings of the 19th IEEE
Design Automation Conference, pp. 175–81. IEEE Press, Los Alamitos,
CA.
[70] Foster, I. T. and Worley, P. H. (1997). Parallel algorithms for the spec-
tral transform method. SIAM Journal on Scientific Computing, 18,
806–37.
[71] Fox, G. C., Johnson, M. A., Lyzenga, G. A., Otto, S. W., Salmon, J. K.,
and Walker, D. W. (1988). Solving Problems on Concurrent Processors:
Vol. 1, General Techniques and Regular Problems. Prentice-Hall,
Englewood Cliffs, NJ.
[72] Fraser, D. (1976). Array permutation by index-digit permutation.
Journal of the ACM , 23, 298–308.
[73] Frigo, M. and Johnson, S. G. (1998). FFTW: An adaptive software
architecture for the FFT. In Proceedings IEEE International Confer-
ence on Acoustics, Speech, and Signal Processing, Vol. 3, pp. 1381–4.
IEEE Press, Los Alamitos, CA.
[74] Gauss, C. F. (1866). Theoria interpolationis methodo nova tractata.
In Carl Friedrich Gauss Werke, Vol. 3, pp. 265–327. Königlichen
Gesellschaft der Wissenschaften, Göttingen, Germany.
[75] Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Mancheck, R., and
Sunderam, V. (1994). PVM: Parallel Virtual Machine. A Users’
Guide and Tutorial for Networked Parallel Computing. Scientific and
Engineering Computation Series. MIT Press, Cambridge, MA.
REFERENCES 289

[76] Geist, G. A., Kohl, J. A., and Papadopoulos, P. M. (1996). PVM and
MPI: A comparison of features. Calculateurs Parallèles, 8(2), 137–50.
[77] Geist, G. A. and Romine, C. H. (1988). LU factorization algorithms
on distributed-memory multiprocessor architectures. SIAM Journal on
Scientific and Statistical Computing, 9, 639–49.
[78] Gerbessiotis, A. V. and Valiant, L. G. (1994). Direct bulk-synchronous
parallel algorithms. Journal of Parallel and Distributed Computing, 22,
251–67.
[79] Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations (3rd
edn). Johns Hopkins Studies in the Mathematical Sciences. Johns
Hopkins University Press, Baltimore, MD.
[80] Goudreau, M. W., Lang, K., Rao, S. B., Suel, T., and Tsantilas, T.
(1999). Portable and efficient parallel computing using the BSP model.
IEEE Transactions on Computers, 48, 670–89.
[81] Goudreau, M. W., Lang, K., Rao, S. B., and Tsantilas, T. (1995, June).
The Green BSP library. Technical Report CS-TR-95-11, Department
of Computer Science, University of Central Florida, Orlando, FL.
[82] Grama, A., Gupta, A., Karypis, G., and Kumar, V. (2003). Introduc-
tion to Parallel Computing (2nd edn). Addison-Wesley, Harlow, UK.
[83] Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B.,
Saphir, W., and Snir, M. (1998). MPI: The Complete Reference.
Vol. 2, The MPI Extensions. Scientific and Engineering Computation
Series. MIT Press, Cambridge, MA.
[84] Gropp, W., Lusk, E., and Skjellum, A. (1999a). Using MPI: Portable
Parallel Programming with the Message-Passing Interface (2nd edn).
MIT Press, Cambridge, MA.
[85] Gropp, W., Lusk, E., and Thakur, R. (1999b). Using MPI-2:
Advanced Features of the Message-Passing Interface. MIT Press,
Cambridge, MA.
[86] Gropp, W., Lusk, E., Doss, N., and Skjellum, A. (1996). A high-
performance, portable implementation of the MPI message passing
interface standard. Parallel Computing, 22, 789–828.
[87] Gupta, A. and Kumar, V. (1993). The scalability of FFT on parallel
computers. IEEE Transactions on Parallel and Distributed Systems, 4,
922–32.
[88] Gupta, S. K. S., Huang, C.-H., Sadayappan, P., and Johnson, R. W.
(1994). Implementing fast Fourier transforms on distributed-memory
multiprocessors using data redistributions. Parallel Processing Letters,
4, 477–88.
[89] Gustavson, F. G. (1972). Some basic techniques for solving sparse sys-
tems of linear equations. In Sparse Matrices and Their Applications
(ed. D. J. Rose and R. A. Willoughby), pp. 41–52. Plenum Press,
New York.
290 REFERENCES

[90] Haynes, P. D. and Côté, M. (2000). Parallel fast Fourier transforms


for electronic structure calculations. Computer Physics Communica-
tions, 130, 130–6.
[91] Hegland, M. (1995). An implementation of multiple and multivariate
Fourier transforms on vector processors. SIAM Journal on Scientific
Computing, 16, 271–88.
[92] Heideman, M. T., Johnson, D. H., and Burrus, C. S. (1984). Gauss
and the history of the fast Fourier transform. IEEE Acoustics, Speech,
and Signal Processing Magazine, 1(4), 14–21.
[93] Hendrickson, B. (1998). Graph partitioning and parallel solvers: Has
the emperor no clothes? In Proceedings Fifth International Workshop
on Solving Irregularly Structured Problems in Parallel (Irregular 1998)
(ed. A. Ferreira, J. Rolim, H. Simon, and S.-H. Teng), Volume 1457 of
Lecture Notes in Computer Science, pp. 218–25. Springer, Berlin.
[94] Hendrickson, B., Jessup, E., and Smith, C. (1999). Toward an efficient
parallel eigensolver for dense symmetric matrices. SIAM Journal on
Scientific Computing, 20, 1132–54.
[95] Hendrickson, B. and Leland, R. (1995). A multilevel algorithm for
partitioning graphs. In Proceedings Supercomputing 1995. ACM Press,
New York.
[96] Hendrickson, B. A., Leland, R., and Plimpton, S. (1995). An efficient
parallel algorithm for matrix–vector multiplication. International
Journal of High Speed Computing, 7, 73–88.
[97] Hendrickson, B. and Plimpton, S. (1995). Parallel many-body sim-
ulations without all-to-all communication. Journal of Parallel and
Distributed Computing, 27, 15–25.
[98] Hendrickson, B. A. and Womble, D. E. (1994). The torus-wrap
mapping for dense matrix calculations on massively parallel computers.
SIAM Journal on Scientific Computing, 15, 1201–26.
[99] Hestenes, M. R. and Stiefel, E. (1952). Methods of conjugate gradients
for solving linear systems. Journal of Research of the National Bureau
of Standards, 49, 409–36.
[100] Higham, D. J. and Higham, N. J. (2000). MATLAB Guide. SIAM,
Philadelphia.
[101] Hill, J. M. D., Crumpton, P. I., and Burgess, D. A. (1996). Theory,
practice, and a tool for BSP performance prediction. In Euro-Par’96
Parallel Processing. Vol. 2 (ed. L. Bougé, P. Fraigniaud, A. Mignotte,
and Y. Robert), Volume 1124 of Lecture Notes in Computer Science,
pp. 697–705. Springer, Berlin.
[102] Hill, J. M. D., Donaldson, S. R., and Lanfear, T. (1998a). Process
migration and fault tolerance of BSPlib programs running on networks
of workstations. In Euro-Par’98, Volume 1470 of Lecture Notes in
Computer Science, pp. 80–91. Springer, Berlin.
REFERENCES 291

[103] Hill, J. M. D., Donaldson, S. R., and McEwan, A. (1998b, September).


Installation and user guide for the Oxford BSP toolset (v1.4) imple-
mentation of BSPlib. Technical report, Oxford University Computing
Laboratory, Oxford, UK.
[104] Hill, J. M. D., Jarvis, S. A., Siniolakis, C. J., and Vasiliev, V. P.
(1998c). Portable and architecture independent parallel performance
tuning using a call-graph profiling tool. In Proceedings Sixth EuroMicro
Workshop on Parallel and Distributed Processing (PDP’98), pp.
286–92. IEEE Press, Los Alamitos, CA.
[105] Hill, J. M. D., McColl, B., Stefanescu, D. C., Goudreau, M. W.,
Lang, K., Rao, S. B., Suel, T., Tsantilas, T., and Bisseling, R. H.
(1998d). BSPlib: The BSP programming library. Parallel Computing,
24, 1947–80.
[106] Hill, J. M. D. and Skillicorn, D. B. (1997/1998a). Lessons learned
from implementing BSP. Future Generation Computer Systems, 13,
327–35.
[107] Hill, J. M. D. and Skillicorn, D. B. (1998b). Practical barrier
synchronisation. In Proceedings Sixth EuroMicro Workshop on Par-
allel and Distributed Processing (PDP’98), pp. 438–44. IEEE Press,
Los Alamitos, CA.
[108] Hoare, C. A. R. (1985). Communicating Sequential Processes.
Prentice-Hall, Englewood Cliffs, NJ.
[109] Hockney, R. W. (1996). The Science of Computer Benchmarking.
SIAM, Philadelphia.
[110] Horvitz, G. and Bisseling, R. H. (1999). Designing a BSP version
of ScaLAPACK. In Proceedings Ninth SIAM Conference on Parallel
Processing for Scientific Computing (ed. Hendrickson, B. et al.). SIAM,
Philadelphia.
[111] Inda, M. A. and Bisseling, R. H. (2001). A simple and efficient
parallel FFT algorithm using the BSP model. Parallel Computing, 27,
1847–1878.
[112] Inda, M. A., Bisseling, R. H., and Maslen, D. K. (2001). On the
efficient parallel computation of Legendre transforms. SIAM Journal
on Scientific Computing, 23, 271–303.
[113] JáJá, J. (1992). An Introduction to Parallel Algorithms. Addison-
Wesley, Reading, MA.
[114] Johnson, J., Johnson, R. W., Padua, D. A., and Xiong, J. (2001).
Searching for the best FFT formulas with the SPL compiler. In
Languages and Compilers for Parallel Computing (ed. S. P. Midkiff,
J. E. Moreira, M. Gupta, S. Chatterjee, J. Ferrante, J. Prins, W. Pugh,
and C.-W. Tseng), Volume 2017 of Lecture Notes in Computer Science,
pp. 112–26. Springer, Berlin.
[115] Johnsson, S. L. (1987). Communication efficient basic linear algebra
computations on hypercube architectures. Journal of Parallel and
Distributed Computing, 4, 133–72.
292 REFERENCES

[116] Juurlink, B. H. H. and Wijshoff, H. A. G. (1996). Communication


primitives for BSP computers. Information Processing Letters, 58,
303–10.
[117] Karonis, N. T., Toonen, B., and Foster, I. (2003). MPICH-G2: A grid-
enabled implementation of the message passing interface. Journal of
Parallel and Distributed Computing, 63, 551–63.
[118] Karypis, G. and Kumar, V. (1998). A fast and high quality multilevel
scheme for partitioning irregular graphs. SIAM Journal on Scientific
Computing, 20, 359–92.
[119] Karypis, G. and Kumar, V. (1999). Parallel multilevel k-way
partitioning scheme for irregular graphs. SIAM Review , 41, 278–300.
[120] Kernighan, B. W. and Lin, S. (1970). An efficient heuristic procedure
for partitioning graphs. Bell System Technical Journal , 49, 291–307.
[121] Kernighan, B. W. and Ritchie, D. M. (1988). The C Programming
Language (2nd edn). Prentice-Hall, Englewood Cliffs, NJ.
[122] Knuth, D. E. (1997). The Art of Computer Programming, Vol. 1,
Fundamental Algorithms (3rd edn). Addison-Wesley, Reading, MA.
[123] Kosloff, R. (1996). Quantum molecular dynamics on grids. In
Dynamics of Molecules and Chemical Reactions (ed. R. E. Wyatt and
J. Z. H. Zhang), pp. 185–230. Marcel Dekker, New York.
[124] Koster, J. H. H. (2002, July). Parallel templates for numerical lin-
ear algebra, a high-performance computation library. Master’s
thesis, Department of Mathematics, Utrecht University, Utrecht, the
Netherlands.
[125] Lanczos, C. (1950). An iteration method for the solution of the
eigenvalue problem of linear differential and integral operators. Journal
of Research of the National Bureau of Standards, 45, 255–82.
[126] Lawson, C. L., Hanson, R. J., Kincaid, D. R., and Krogh, F. T.
(1979). Basic Linear Algebra Subprograms for Fortran usage. ACM
Transactions on Mathematical Software, 5, 308–23.
[127] Leforestier, C., Bisseling, R. H., Cerjan, C., Feit, M. D., Friesner, R.,
Guldberg, A., Hammerich, A., Jolicard, G., Karrlein, W., Meyer, H.-D.,
Lipkin, N., Roncero, O., and Kosloff, R. (1991). A comparison of
different propagation schemes for the time dependent Schrödinger
equation. Journal of Computational Physics, 94, 59–80.
[128] Lewis, J. G. and van de Geijn, R. A. (1993). Distributed memory
matrix–vector multiplication and conjugate gradient algorithms. In Pro-
ceedings Supercomputing 1993, pp. 484–92. ACM Press, New York.
[129] Lewis, P. A. W., Goodman, A. S., and Miller, J. M. (1969). A pseudo-
random number generator for the System/360. IBM Systems Journal ,
8, 136–46.
[130] Loyens, L. D. J. C. and Moonen, J. R. (1994). ILIAS, a sequential
language for parallel matrix computations. In PARLE’94, Parallel
Architectures and Languages Europe (ed. C. Halatsis, D. Maritsas,
G. Phylokyprou, and S. Theodoridis), Volume 817 of Lecture Notes in
Computer Science, pp. 250–261. Springer, Berlin.
REFERENCES 293

[131] Mascagni, M. and Srinivasan, A. (2000). SPRNG: A scalable library for


pseudorandom number generation. ACM Transactions on Mathematical
Software, 26, 436–61.
[132] McColl, W. F. (1993). General purpose parallel computing. In Lectures
on Parallel Computation (ed. A. Gibbons and P. Spirakis), Volume 4 of
Cambridge International Series on Parallel Computation, pp. 337–91.
Cambridge University Press, Cambridge, UK.
[133] McColl, W. F. (1995). Scalable computing. In Computer Science
Today: Recent Trends and Developments (ed. J. van Leeuwen), Volume
1000 of Lecture Notes in Computer Science, pp. 46–61. Springer,
Berlin.
[134] McColl, W. F. (1996a). A BSP realisation of Strassen’s algorithm. In
Proceedings Third Workshop on Abstract Machine Models for Parallel
and Distributed Computing (ed. M. Kara, J. R. Davy, D. Goodeve, and
J. Nash), pp. 43–6. IOS Press, Amsterdam.
[135] McColl, W. F. (1996b). Scalability, portability and predictability: The
BSP approach to parallel programming. Future Generation Computer
Systems, 12, 265–72.
[136] Meesen, W. and Bisseling, R. H. (2003). Balancing communication
in parallel sparse matrix–vector multiplication. Preprint, Department
of Mathematics, Utrecht University, Utrecht, the Netherlands. In
preparation.
[137] Message Passing Interface Forum (1994). MPI: A message-passing inter-
face standard. International Journal of Supercomputer Applications
and High-Performance Computing, 8, 165–414.
[138] Message Passing Interface Forum (1998). MPI2: A message-passing
interface standard. International Journal of High Performance
Computing Applications, 12, 1–299.
[139] Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H.,
and Teller, E. (1953). Equation of state calculations by fast computing
machines. Journal of Chemical Physics, 21, 1087–92.
[140] Miller, R. (1993). A library for bulk synchronous parallel programming.
In General Purpose Parallel Computing, pp. 100–108. British Computer
Society Parallel Processing Specialist Group, London.
[141] Miller, R. and Reed, J. (1993). The Oxford BSP library users’ guide,
version 1.0. Technical report, Oxford Parallel, Oxford, UK.
[142] Motwani, R. and Raghavan, P. (1995). Randomized Algorithms.
Cambridge University Press, Cambridge, UK.
[143] Nagy, J. G. and O’Leary, D. P. (1998). Restoring images degraded
by spatially variant blur. SIAM Journal on Scientific Computing, 19,
1063–82.
[144] Narasimha, M. J. and Peterson, A. M. (1978). On the computation of
the discrete cosine transform. IEEE Transactions on Communications,
COM-26, 934–6.
294 REFERENCES

[145] Newman, M. E. J. and Barkema, G. T. (1999). Monte Carlo Methods


in Statistical Physics. Oxford University Press, Oxford.
[146] Nieplocha, J., Harrison, R. J., and Littlefield, R. J. (1996). Global
arrays: A nonuniform memory access programming model for high-
performance computers. Journal of Supercomputing, 10, 169–89.
[147] Numrich, R. W. and Reid, J. (1998). Co-array Fortran for parallel
programming. ACM SIGPLAN Fortran Forum, 17(2), 1–31.
[148] Ogielski, A. T. and Aiello, W. (1993). Sparse matrix computations on
parallel processor arrays. SIAM Journal on Scientific Computing, 14,
519–30.
[149] O’Leary, D. P. and Stewart, G. W. (1985). Data-flow algorithms for
parallel matrix computations. Communications of the ACM , 28,
840–53.
[150] O’Leary, D. P. and Stewart, G. W. (1986). Assignment and
scheduling in parallel matrix factorization. Linear Algebra and Its
Applications, 77, 275–99.
[151] Oualline, S. (1993). Practical C Programming (2nd edn). Nutshell
Handbook. O’Reilly, Sebastopol, CA.
[152] Pacheco, P. S. (1997). Parallel Programming with MPI. Morgan
Kaufmann, San Fransisco.
[153] Papadimitriou, C. H. and Yannakakis, M. (1990). Towards an architec-
ture-independent analysis of parallel algorithms. SIAM Journal on
Computing, 19, 322–8.
[154] Parlett, B. N. (1971). Analysis of algorithms for reflections in bisectors.
SIAM Review , 13, 197–208.
[155] Pease, M. C. (1968). An adaptation of the fast Fourier transform for
parallel processing. Journal of the ACM , 15, 252–64.
[156] Pothen, A., Simon, H. D., and Liou, K.-P. (1990). Partitioning sparse
matrices with eigenvectors of graphs. SIAM Journal on Matrix Analysis
and Applications, 11, 430–52.
[157] Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P.
(1992). Numerical Recipes in C: The Art of Scientific Computing (2nd
edn). Cambridge University Press, Cambridge, UK.
[158] Romein, J. W. and Bal, H. E. (2002). Awari is solved. Journal of the
International Computer Games Association, 25, 162–5.
[159] Rühl, T., Bal, H., Benson, G., Bhoedjang, R., and Langendoen, K.
(1996). Experience with a portability layer for implementing parallel
programming systems. In Proceedings 1996 International Conference
on Parallel and Distributed Processing Techniques and Applications,
pp. 1477–88. CSREA Press, Athens, GA.
[160] Saad, Y. and Schultz, M. H. (1986). GMRES: A generalized minimal
residual algorithm for solving nonsymmetric linear systems. SIAM
Journal on Scientific and Statistical Computing, 7, 856–69.
[161] Salamin, E. (1976). Computation of π using arithmetic-geometric
mean. Mathematics of Computation, 30, 565–70.
REFERENCES 295

[162] Shadid, J. N. and Tuminaro, R. S. (1992). Sparse iterative algorithm


software for large-scale MIMD machines: An initial discussion and
implementation. Concurrency: Practice and Experience, 4, 481–97.
[163] Skillicorn, D. B., Hill, J. M. D., and McColl, W. F. (1997). Questions
and answers about BSP. Scientific Programming, 6, 249–74.
[164] Snir, M., Otto, S., Huss-Lederman, S., Walker, D., and Dongarra, J.
(1998). MPI: The Complete Reference. Vol. 1, The MPI Core
(2nd edn). Scientific and Engineering Computation Series. MIT Press,
Cambridge, MA.
[165] Sorensen, H. V., Burrus, C. S., and Heideman, M. T. (1995). Fast
Fourier Transform Database. PWS Publishing, Boston.
[166] Spirakis, P. G. (1993). PRAM models and fundamental parallel
algorithmic techniques: Part II (randomized algorithms). In Lectures
on Parallel Computation (ed. A. Gibbons and P. Spirakis), Volume 4
of Cambridge International Series on Parallel Computation, pp. 41–66.
Cambridge University Press, Cambridge, UK.
[167] Spirakis, P. G. and Gibbons, A. (1993). PRAM models and funda-
mental parallel algorithmic techniques: Part I. In Lectures on Parallel
Computation (ed. A. Gibbons and P. Spirakis), Volume 4 of Cambridge
International Series on Parallel Computation, pp. 19–40. Cambridge
University Press, Cambridge, UK.
[168] Sterling, T., Salmon, J., Becker, D. J., and Savarese, D. F. (1999). How
to Build a Beowulf: A Guide to the Implementation and Application
of PC Clusters. Scientific and Engineering Computation Series. MIT
Press, Cambridge, MA.
[169] Stijnman, M. A., Bisseling, R. H., and Barkema, G. T. (2003).
Partitioning 3D space for parallel many-particle simulations. Computer
Physics Communications, 149, 121–34.
[170] Strassen, V. (1969). Gaussian elimination is not optimal. Numerische
Mathematik , 13, 354–6.
[171] Sunderam, V. S. (1990). PVM: A framework for parallel distributed
computing. Concurrency: Practice and Experience, 2, 315–39.
[172] Swarztrauber, P. N. (1987). Multiprocessor FFTs. Parallel Computing,
5, 197–210.
[173] Takken, D. H. J. (2003, April). Implementing BSP on Myrinet.
Project report, Department of Physics, Utrecht University, Utrecht,
the Netherlands.
[174] Tiskin, A. (1998). The bulk-synchronous parallel random access
machine. Theoretical Computer Science, 196, 109–30.
[175] Tuminaro, R. S., Shadid, J. N., and Hutchinson, S. A. (1998). Parallel
sparse matrix vector multiply software for matrices with data locality.
Concurrency: Practice and Experience, 10, 229–47.
[176] Valiant, L. G. (1982). A scheme for fast parallel communication. SIAM
Journal on Computing, 11, 350–61.
296 REFERENCES

[177] Valiant, L. G. (1989). Bulk-synchronous parallel computers. In Parallel


Processing and Artificial Intelligence (ed. M. Reeve and S. E. Zenith),
pp. 15–22. Wiley, Chichester, UK.
[178] Valiant, L. G. (1990a). A bridging model for parallel computation.
Communications of the ACM , 33(8), 103–11.
[179] Valiant, L. G. (1990b). General purpose parallel architectures. In Hand-
book of Theoretical Computer Science: Vol. A, Algorithms and Complex-
ity (ed. J. van Leeuwen), pp. 943–71. Elsevier Science, Amsterdam.
[180] van de Geijn, R. A. (1997). Using PLAPACK. Scientific and
Engineering Computation Series. MIT Press, Cambridge, MA.
[181] Van de Velde, E. F. (1990). Experiments with multicomputer
LU-decomposition. Concurrency: Practice and Experience, 2, 1–26.
[182] van de Vorst, J. G. G. (1988). The formal development of a parallel
program performing LU-decomposition. Acta Informatica, 26, 1–17.
[183] van der Stappen, A. F., Bisseling, R. H., and van de Vorst, J. G. G.
(1993). Parallel sparse LU decomposition on a mesh network of
transputers. SIAM Journal on Matrix Analysis and Applications, 14,
853–79.
[184] van der Vorst, H. A. (1992). Bi-CGSTAB: A fast and smoothly conver-
ging variant of Bi-CG for the solution of nonsymmetric linear systems.
SIAM Journal on Scientific and Statistical Computing, 13, 631–44.
[185] van der Vorst, H. A. (2003). Iterative Krylov Methods for Large Linear
Systems. Cambridge Monographs on Applied and Computational
Mathematics. Cambridge University Press, Cambridge, UK.
[186] van Heukelum, A., Barkema, G. T., and Bisseling, R. H. (2002). DNA
electrophoresis studied with the cage model. Journal of Computational
Physics, 180, 313–26.
[187] Van Loan, C. (1992). Computational Frameworks for the Fast Fourier
Transform, Volume 10 of Frontiers in Applied Mathematics. SIAM,
Philadelphia.
[188] Vastenhouw, B. and Bisseling, R. H. (2002, May). A two-dimensional
data distribution method for parallel sparse matrix–vector multiplica-
tion. Preprint 1238, Department of Mathematics, Utrecht University,
Utrecht, the Netherlands.
[189] Vishkin, U. (1993). Structural parallel algorithmics. In Lectures on
Parallel Computation (ed. A. Gibbons and P. Spirakis), Volume 4 of
Cambridge International Series on Parallel Computation, pp. 1–18.
Cambridge University Press, Cambridge, UK.
[190] Walshaw, C. and Cross, M. (2000). Parallel optimisation algorithms
for multilevel mesh partitioning. Parallel Computing, 26, 1635–60.
[191] Wilkinson, B. and Allen, M. (1999). Parallel Programming: Tech-
niques and Applications Using Networked Workstations and Parallel
Computers. Prentice-Hall, Upper Saddle River, NJ.
REFERENCES 297

[192] Zitney, S. E., Mallya, J., Davis, T. A., and Stadtherr, M. A. (1994).
Multifrontal techniques for chemical process simulation on supercom-
puters. In Proceedings Fifth International Symposium on Process
Systems Engineering, Kyongju, Korea (ed. E. S. Yoon), pp. 25–30.
Korean Institute of Chemical Engineers, Seoul, Korea.
[193] Ziv, J. and Lempel, A. (1977). A universal algorithm for sequential
data compression. IEEE Transactions on Information Theory, IT-23,
337–43.
[194] Zlatev, Z. (1991). Computational Methods for General Sparse Matrices,
Volume 65 of Mathematics and Its Applications. Kluwer, Dordrecht,
the Netherlands.
[195] Zoldi, S., Ruban, V., Zenchuk, A., and Burtsev, S. (1999, January/
February). Parallel implementation of the split-step Fourier method
for solving nonlinear Schrödinger systems. SIAM News, 32(1), 8–9.
INDEX

accidental zero, 168–169 cost, 5–8, 44, 70, 72, 120, 148, 179,
accumulate operation, 275 181, 189, 217, 219, 233, 235,
all-to-all, 43, 126, 261, 270 238, 242, 247
argmax, 55, 56, 60, 69, 201, 266 decomposable, 39, 265
arithmetic mean, 162 model, vii–ix, xi, xiii, 2–9, 39–40, 79,
ASCII, 46 81, 82, 84
parameters, 1, 6, 8, 24, 27, 79, 96,
138, 232
back substitution see triangular system variants of, 8, 39
bandwidth of matrix, 86, 243 BSP Worldwide, viii, xvi
barrier see synchronization bsp abort, 20, 255, 258
benchmarking, ix, 6, 24 bsp begin, 14, 255
program, 1, 27–32, 43, 44, 79, 243, bsp broadcast, 72
261–265 bsp end, 15, 255
results of, 31–38, 82, 85, 138, 232 bsp get, 20, 223, 226, 255
Beowulf, 32, 43, 231–236 bsp get tag, 225, 250, 255
Bernoulli trial, 205 bsp hpget, 99, 255, 273
Bi-CGSTAB, 236 bsp hpmove, 250, 255
bipartitioning, 192, 196, 197 bsp hpput, 99, 255, 273, 275
of hypergraph, 195, 197, 242 bsp init, 15, 255, 258
bit operation, 128 bsp move, 225, 243, 250, 255, 275
bit reversal, 110–111, 116–117, 127–128, bsp nprocs, 15, 255
149, 150 bsp pid, 16, 255
bit-reversal matrix, 110 bsp pop reg, 19, 255, 274
BLAS, 34, 98 bsp push reg, 19, 126, 255, 274
sparse, 237 bsp put, 18, 99, 223, 224, 226, 243, 255,
block distribution see distribution, block 271
block size, algorithmic, 98 bsp qsize, 226, 255
blocking bsp send, 223–226, 243, 255, 275
of algorithm, 86–87 bsp set tagsize, 226, 255
of distribution, 86 bsp sync, 16, 79, 255, 259
body-centred cubic (BCC) lattice, 219 bsp time, 16, 79, 255
border correction, 213, 216 BSPedupack, xi, 92, 223, 251–253, 256,
branch-and-bound method, 247 258, 261, 265, 278, 280, 281
broadcast, 11, 57, 65, 66, 72–73, 259, BSPlib, viii, ix, xi–xiii, xvi, 1, 2, 13–14,
265–267 137, 163, 231, 241, 254, 256, 258,
one-phase, 66, 79, 80, 89, 99 259, 261, 262, 265, 266, 270,
tree-based, 88 273–275, 278–282
two-phase, 66–68, 79, 80, 83, 87–89, for Windows, xiv
99, 185 implementation of, xi, 42
BSMP, 254, 255 primitives, xi, 14–20, 223–226, 259,
BSP 273–274
algorithm, 3–4, 148, 149 programs, xi
computer, ix, 1, 3, 6, 243 compiling of, 20, 33
300 INDEX

BSPlib (cont.) cryptanalysis, 45, 161


profiling of, 83–84 cryptology, 45
running of, 20 cut-off radius, 166, 245
buffer memory, 137, 225, 226 cyclic distribution see distribution,
size of, 137 cyclic
bulk synchronous parallel see BSP
butterfly, 108, 113, 116, 121
block, 113, 114, 116, 117 DAS-2, xiii, 231, 233
data compression, 46
data structure, 166, 173, 176, 178, 212,
C, viii, xii, 14, 98, 137, 172, 237, 256 220, 222, 237, 239, 240
C++, xii, 14, 236, 240, 256 dynamic, 169
cache, 25, 85, 89, 137, 140, 141, 144, sparse, 168, 170–173, 176
146, 150, 152, 155, 170, 173, 250 static, 169
effects, 140, 141 Daubechies wavelet, 158
misses, 141 DAXPY, 25, 28, 34, 143, 144, 174
primary, 137, 143 deadlock, 12, 257, 266, 280
secondary, 137, 143 decimation in frequency (DIF), 113,
carry-add operation, 160 145, 148
CC-NUMA, 137 decimation in time (DIT), 109, 145
Chernoff bound, 205–206 decomposable BSP (D-BSP), 40, 265
Cholesky factorization, 50, 89 decompression, 47
sparse, 236 degree of vertex, 202–203
coarsening, 195, 196 density of sparse matrix, 163, 167, 178,
codelet, 146 203, 206, 207, 208, 210
collective communication, 20, 42, 72, 87, deregistration, 19, 227, 274
257, 259, 261, 262, 266, 267, 271, DGEMM, 34, 98
280–282 diagonal dominance, 244
combine, two-phase, 185 diagonal processor, 185
combining puts, 137 differentiation, by Fourier transform,
communication volume, 65–66, 73, 83, 153
86, 177, 179, 181, 185–192, 194, digital diamond, 214–219, 238
196, 197, 207–210, 241–242, 246, three-dimensional, 219
247 discrete convolution, 160
communication-closed layer, xiv discrete cosine transform (DCT), 157
communicator, 257–259, 262, 265–266 inverse, 158
commutativity, 90 discrete Fourier transform (DFT), 102,
compressed column storage (CCS), 171 104, 105, 151, 153
compressed row storage (CRS), three-dimensional, 154–155
170–171, 176, 177 two-dimensional, 153–154
incremental, 171–172, 176, 222, 227, discrete Legendre transform (DLT), 151
241 discrete wavelet transform (DWT), 158,
compression, 46–47, 158 159
conjugate gradient method, 151, 164, inverse, 159
236, 239, 244–246 two-dimensional, 159
convolution, 100, 146, 160 distributed memory, 8, 37, 137
theorem, 160 distribution
Cooley–Tukey FFT, 106, 109, 111, 113 block, 10–11, 44, 113–115, 146–150,
Cooley–Tukey GFFT, 122 152, 161, 183
cooling schedule, 248 block with bit-reversed processor
coordinate scheme, 170 numbering, 117
cosine transform, 100, 157–158 block-cyclic, 86, 96, 115, 148, 149
fast (FCT), 146, 157, 158 Cartesian, 58–59, 85, 86, 174,
Cray T3E, 34–36, 38, 79–81, 83, 179–185, 207, 209, 238–240
279–280 column, 99
INDEX 301

cyclic, 10, 65, 79, 93, 114, 115, fault tolerance, 42


148–150, 152–154, 156, 186, FFT see Fourier, transform, fast
227 FFTW, 146
cyclic row, 85, 185, 186, 204, 208 fill-in, 164
group-cyclic, 115–117, 119, 123, 128, flop, 3, 25, 43–44, 82, 89, 91, 93, 96, 97,
129, 149 102, 104, 105, 120, 121, 125, 127,
matrix, 58, 85 144, 146, 150, 152, 153, 166–168,
M × N block, 243 205, 212, 243
M × N cyclic, 63–65, 74, 89, 154 ‘for all’ statement, 166
Mondriaan, 163, 186–197, 210, 221, Fortran, 98, 147
240 Fortran 77, 237, 239, 256
non-Cartesian, 85, 88, 163, 174 Fortran 90, xii, 14, 237, 256
row, 184, 208, 212, 217, 221, 243 forward substitution see triangular
square, 63, 186, 207–209, 212, 243 system
square block, 85, 86, 92, 204, 206 four-step framework, 150, 152
square cyclic, 64, 65, 85, 86, 88, 92, Fourier
95, 96, 184, 186 coefficient, 101, 153
vector, 197–202, 243 matrix, 102, 103, 108, 113
zig-zag cyclic, 156–157 generalized, 122
DNA, 165, 233 series, 100, 101
domain view, 212 transform
dot product see inner product continuous, 145
double-precision arithmetic, 25 differentiation by, 153
DRMA, 254, 255 discrete (DFT), 101, 145, 151, 153,
dynamic programming, 146 154, 156
fast (FFT), ix, xiv, 100–162, 270,
278
Earth Simulator, 44 generalized discrete (GDFT), 122,
edge, 202–203, 241 125
cut, 240, 241 generalized fast (GFFT), 122–126
directed, 202 inverse fast, 116, 127, 153
multiple, 202 radix-2 fast, 146
undirected, 194, 202, 203 radix-4 fast, 146, 152–153
efficiency, 141, 142, 144, 178, 208 real, 146, 155–157
eigensystem, 50, 93, 163–165, 174, 236, three-dimensional, 152, 154–155
240 two-dimensional, 145, 153–154
eigenvalue, 93 unordered fast, 112, 121, 124
eigenvector, 93
elapsed time, 16
electronic structure, 152 g (communication cost), 5–6, 25
electrophoresis, 165, 233 optimistic value of, 82, 85, 128, 144,
encryption, 45 250, 278
enumeration, 247 pessimistic value of, 82, 87, 129, 143,
Eratosthenes, 48 278
error, roundoff, 121 gain of vertex move, 195–197
error handling, xi, 20, 75 Gauss-Jordan elimination, 88
Ethernet, 32, 41, 231 Gaussian elimination, 51, 235
Euclidean norm, 93 Gaussian function, 153
even–odd sort, 106 geometric mean, 162
get, 12, 177, 273–274, 282
global view, 21, 191, 193, 198
fanin, 176, 178, 185, 188, 191, 194, Globus toolkit, 256
197–199, 208, 233, 275 GMRES, 236–238
fanout, 176, 178, 185, 188, 189, 191, GNU General Public License, xi, 280
197–200, 208, 212, 215, 233, 274 Goldbach conjecture, 49
302 INDEX

goto-statement, 257 input/output see I/O


graph inverse FFT, 116, 127, 153
colouring, 241 irregular algorithm, 129, 163
directed, 202–203 Ising model, 211
directed acyclic, 88, 257 isoefficiency, 148
undirected, 194, 202, 240 iterative solution method, 163, 164, 174,
Grid, 1, 231, 256 199, 236–238, 239–240, 244, 248
grid, 151–152, 211–221, 238, 245
Gustavson’s data structure, 172
jagged diagonal storage (JDS), 172
JPEG 2000, 158
h-relation, 5, 43
balanced, 65, 208
Kernighan–Lin algorithm, 197
full, 6, 25
boundary, 197, 241
halo, 245
Kronecker product, 107, 145, 147
Harwell–Boeing collection, 171, 203,
property of, 107, 146, 147
233, 237–238
Krylov subspace, 236
header overhead, 82
heat equation, 164, 211
heuristic, 171, 189, 195, 200, 216, 235, l (synchronization cost), 5–6, 26
247 LAM/MPI, 256
high-precision arithmetic, 160 Lanczos method, 164
Hitachi SR8000, 161 Laplacian operator, 183, 211, 212, 238,
Householder reflection, 93, 94 243, 245
Householder tridiagonalization, 50, 86, large-integer multiplication, 160
92–95 lattice, 215
Hubble Space Telescope, 100, 151 body-centred cubic, 219
hypercube, 39, 86, 148, 149, 239 leaf of tree, 246
hyperedge, 192 least-squares approximation, 26
hypergraph, 192–195, 197, 242 Legendre transform, 151
linear congruential generator, 47
linear distribution see distribution,
i-cycle, 148 block
I/O, 15, 256, 258 linear programming, 163, 238
IBM RS/6000 SP, 36, 38 linear system, 50, 163, 164, 174,
identity matrix, 52 235–238, 239–240
identity permutation, 55 linked list, 173
image two-dimensional, 172–173
compression, 158 LINPACK, 44
processing, 157 Linux, 14, 31, 32, 231
reconstruction, 100, 145 local bound, 200, 201
restoration, 151 local view, 21, 191, 193, 198, 208, 209
imbalance LogP model, 39, 148
of communication load, 65, 185, 241 LU decomposition, ix, 50–99, 174, 184,
of computation load, 65, 92, 188–192, 265–270, 278, 279
196, 205, 209, 210 sparse, 164, 172, 173, 236
parameter, 188–189, 191, 209, 210 LZ77, 46–47
increment, 171, 222
index of maximum see argmax
index set, 175–178 Manhattan distance, 214
index-digit permutation, 148 mapping see distribution
indirect addressing, 172 MATLAB, 52, 236
inefficiency, 141 matrix
initial partitioning, 195–197 band, 86, 243
inner product, 1, 9–13, 174, 195, 239, bit-reversal, 110
244, 278 block-diagonal, 94, 159
INDEX 303

block-tridiagonal, 212 modularity, 68, 124


butterfly, 108 molecular dynamics, 166, 238, 239, 244
dense, 59, 86, 163, 164, 184–186, quantum, xiv, 151
207–208 Mondriaan, 163, 187
density, 163, 167, 178, 203, 206, 207, package, xi, 191, 208–210, 217–219,
208, 210 221, 233, 242, 246, 247, 248
diagonal, 93 MPEG-4, 158
diagonally dominant, 244 MPI, viii, x, xi, 2, 32, 42–43, 239, 240,
Fourier, 102, 103, 108, 113, 146 256–282
generalized butterfly, 122 programs
generalized Fourier, 122 compiling of, 259
identity, 52 running of, 259
Laplacian, 163, 210–221, 240 MPI-2, viii, xi, 2, 42, 256, 258, 273–275,
lower triangular, 50, 89, 95 278–282
orthogonal, 92, 93, 159 MPI Abort, 258
pentadiagonal, 211 MPI Accumulate, 275
positive definite, 89, 244 MPI Alloc mem, 278
random sparse, 163, 179, 203–210, MPI Allreduce, 259, 266
233, 244 MPI Alltoall, 261, 270
sparse, viii, 59, 106, 108, 159, 163–250 MPI Alltoallv, 261–262, 270–271, 278,
symmetric, 89, 92, 93, 191 280
Toeplitz, 151 MPI Barrier, 259, 281
tridiagonal, 92, 93, 183, 243 MPI Bcast, 259, 265
twiddle, 125, 126, 146 MPI Comm split, 265
unit lower triangular, 50 MPI Comm rank, 259
unstructured sparse, 203, 238 MPI Comm size, 259
unsymmetric, 165 MPI COMM WORLD, 257, 259, 265, 266
upper triangular, 50 MPI Finalize, 258
Walsh–Hadamard, 146 MPI Gather, 262
matrix allocation, 251 MPI Get, 273
Matrix Market, 170, 204, 237 MPI Init, 258
matrix update, 58–60, 75 MPI Put, 273–275
matrix view, 212 MPI Recv, 257
matrix-free storage, 173, 212 MPI Send, 257
matrix–matrix multiplication, 89–90, MPI Sendrecv, 267
244 MPI Sendrecv replace, 267
traditional, 90, 91 MPI Win create, 274
Strassen, 90–92 MPI Win fence, 274
three-dimensional, 89, 90 MPI Win free, 274
two-dimensional, 89 MPI Wtime, 259, 281
matrix–vector multiplication MPICH, 256
by FFT, 151 MPICH-2, xi, 273
dense, 185, 243 MPICH-G2, 256
sparse, ix, 163–250, 278–280 MPIedupack, xi, 256, 258–278, 280, 281
matrix/vector notation, 52–54 multicast, 72
memory multilevel method, 195–197, 240–242,
allocation, 251 246
deallocation, 251 Myrinet, xi, xiv, 232
overflow, 252
use, 121, 126
message passing, 4, 12, 42, 87, 223, 256, need-to-know principle, 59, 95
266 net, 194
bulk synchronous, 163, 223, 227, 254 cut, 194–195
Metropolis algorithm, 247 Newton–Raphson method, 161
Mod-m sort, 147 node of tree, 105–106, 246
304 INDEX

nonlinear system, 163 polymer, xvi, 165


norm, 93 portability
normalized cost, 141, 206, 247 layer, 2, 32
NUMA, 137 of algorithms, 3
power method, 165
PRAM model, 39
occam, xiv preconditioner, 236, 239
one-sided communication, 11, 40, 42, prediction, 1, 6, 8, 79, 82, 85, 141–144,
256, 257, 266, 270, 273, 275, 278, 282
280, 282 ab initio, 85
OpenMP, 44 optimistic, 82
optical fibre, 152 pessimistic, 82
overhead, 72, 141 presenting results, 136
overlap of computation and prime, 48, 49, 187, 191
communication, 9, 17 twin, 49
Oxford BSP toolset, viii, xi, 2, 13, 14, probability density function, 206
20, 42, 280 processor column, 58, 73, 180–182, 184,
186, 265
p (number of processors), 6 processor row, 57, 73, 180–182, 184,
P (∗), 12 186, 265
P (∗, t), 58 processor view, 191
P (s, ∗), 57 profile, 83–84
packet, 129, 130, 249, 270–271 programming language
packing, 128, 129, 249 C, viii, xii, 14, 137, 172, 237, 256
list, 249 C++, xii, 14, 236, 240, 256
padding with zeros, 160 Fortran 77, 237, 239, 256
Paderborn University BSP library, xi, 2, Fortran 90, xii, 14, 237, 256
13, 42, 280 occam, xiv
page fault, 155 Pthreads, 43
Panda, 231, 232 put, 11–12, 177, 178, 273–275, 282
BSP library, xi, 231, 232 PVM, 2, 43, 239
parallelizing compiler, 173
parallel computer, vii, 1, 2 QR decomposition, 50, 86
parallel prefix, 45 quantum molecular dynamics, xiv, 151
parallel programming, vii, x–xii, 1, 2
PARPACK, xiv
partial differential equation, 163, 238 r (computing rate), 6, 24
partial row pivoting, 55 radix-2 FFT, 146
particle simulation see molecular radix-4 FFT, 146, 152–153
dynamics radix-m splitting, 146
partitioning, 173, 187, 189, 240–242 radix sort, 222
geometric, 246 RAM, 155
p-way, 187–190, 192, 246 random number generator, 47, 203–204
symmetric, 191 random sparse matrix, 163, 179,
path, 202 203–210, 233, 244
payload, 224–225, 243 random walk, 47
perfect shuffle, 147 recursive computation, 90–92, 104–106,
periodic boundaries, 245 146, 147, 190, 192, 247
periodic function, 100, 101, 153 level of, 90, 190, 191
permutation matrix, 54 redistribution, 114, 116–119, 128, 258,
π, decimals of, 160–162 270, 281
pivot, 55, 60, 68, 75, 84, 173 reduction operation, 259, 266
pivot bit, 148 redundant computation, 12, 13, 61, 64
plan for computation, 146 registration, 19, 74, 126, 274
pointer arithmetic, 226 of two-dimensional array, 74
INDEX 305

regular algorithm, 128, 129 superstep, vii, xiv, 3–4, 16, 261
reproducibility, 151 communication, 3
residual, 236 computation, 3
root of tree, 105, 246 program, 16, 265
roundoff error, 121 superstep piggybacking, 98, 266
RSA, 161 surface-to-volume ratio, 214, 219
Rutherford–Boeing collection, 171, 203, switch, 32
237 synchronization, 4
bulk, 4
global, vii, 271
SAXPY see DAXPY pairwise, 4
scalability see speedup subset, 9, 40, 42
scalable memory use, 121, 126, 222, 224 zero-cost, 42
ScaLAPACK, 86–87, 256, 279
scattered square decomposition see
distribution, square cyclic tag, 224–226, 243, 250
Schrödinger equation template, 236
nonlinear, 152 tensor product, 107
time-dependent, 151 torus, 33
time-independent, 152 torus-wrap mapping see distribution,
SGEMM see DGEMM square cyclic
SGI Origin 2000, xiii, 37–38, 137 total exchange, 261
SGI Origin 3800, 136–144, 278–280 transpose algorithm, 149, 150
shared memory, 37, 38, 137 transputer, xiv, 238
shuffle, perfect, 147 trapezoidal rule, 101, 153
sieve of Eratosthenes, 48, 49 tree, 105–106, 246
simulated annealing, 247–249 triangular system, 51, 55, 95–96
sine transform, 100 triple scheme, 170
fast, 146 truncated octahedron, 219, 220, 245
singular value decomposition, 236 twiddle matrix, 125, 126, 146
six-step framework, 150, 151
slowdown, 233 UFFT, 112, 124, 127
smooth function, 101, 153 UGFFT, 124
sorting, 222 uncoarsening, 195, 197
sparse matrix algorithm, 167, 169 unpacking, 128–130, 249
sparsity pattern, 163, 164, 169, 177, list, 249
181, 186, 195, 203, 204, 206–208,
210, 244
vector addition, sparse, 167–168
symmetric, 165
vector allocation, 251
spectral transform, 151
vertex, 88, 192, 194–197, 202–203, 241
speedup, 139–141, 144, 151, 152, 233,
adjacent, 197
278, 279
degree, 202–203
superlinear, 140
video compression, 158
SPMD, 11, 14–15, 17, 41, 127, 254, 255,
virtual processor, 155
258
volume see communication volume, 65
stack, 19
Voronoi cell, 217, 219
startup cost, 26, 41, 242, 257
stencil, five-point see Laplacian operator
Strassen matrix–matrix multiplication, wall-clock time, 16
90–92 wavefront, 96
stride, 11, 53, 72, 150 wavelet, 100, 158–159
strip, 213–214 Daubechies, 158
subtractive cancellation, 93 weight, 113, 120–127
supercomputer, viii, xiii, 33, 44, 235 window, 273–274

You might also like