Predictable Parallel Performance The BSP Model
Predictable Parallel Performance The BSP Model
net/publication/2802361
CITATIONS READS
5 87
1 author:
David Skillicorn
Queen's University
254 PUBLICATIONS 3,839 CITATIONS
SEE PROFILE
All content following this page was uploaded by David Skillicorn on 09 January 2015.
PREDICTABLE PARALLEL
PERFORMANCE: THE BSP MODEL
D.B. Skillicorn
Abstract There are three big challenges for mainstream parallel computing: build-
ing useful hardware platforms, designing programming models that are
eective, and designing a software construction process that builds cor-
rectness into software. The rst has largely been solved, at least for
current technology. The second has been an active area of research for
perhaps fteen years, while work on the third has barely begun. In
this chapter, we describe the Bulk Synchronous Parallel (BSP) model
which, at present, represents the best compromise among programming
models for simplicity, predictability, and performance. We describe the
model from the a software developer's perspective and show how its
high-level structure is used to build ecient implementations. Almost
alone among programming models, BSP has an associated cost model
so that the performance of programs can be predicted on any target
without laborious benchmarking. Some progress towards software con-
struction has also been made in the context of BSP.
1
2
Local
Computation
Global
Comms
Barrier
almost any real parallel computer can readily simulate the BSP abstract
machine with little loss of performance. It means that some features of
shared-memory computers cannot be directly exploited [38].
The essence of the BSP model is the notion of a superstep which can
be considered the basic parallel instruction of the abstract machine. A
superstep is a global operation of the entire parallel computer, and is
decomposed into three phases:
Local computation within each processor on local data.
Non-blocking communication between processors.
A barrier synchronisation of all processors, after which all commu-
nication is guaranteed to have completed, and the moved data
becomes visible to the destination processors.
Communication and synchronisation are completely decoupled. This
avoids one of the big problems with e.g. message passing, where each
communication action overloads both data movement and synchronisa-
tion. A single process is easy to understand; but the global state that
results from the parallel composition of more than one process is ex-
tremely hard to understand. When a send is written in a process, the
entire global state must be understood in order to know whether the op-
eration is safe or will cause deadlock. Thus the send-receive abstraction
of message passing is easy to understand in the abstract but hard to use
in practice.
The BSP Model 5
and only the data recipient knows where incoming data may be safely
stored.
The message-passing style typical of MPI and PVM has the following
drawbacks:
It is possible to write programs that deadlock, and it is dicult to
be sure that a given program will not deadlock;
The ability to use a send or receive anywhere leads to spaghetti
communication logic { message passing is the communication ana-
logue of goto control
ow;
There is no global consistent state which makes debugging and
reasoning about programs extremely dicult;
There is no simple cost model for performance prediction, and
those that do exist are too complex to use on programs of even
moderate size;
On modern architectures with powerful global communication, they
cannot use the full power of the architecture.
BSP avoids all of these drawbacks. Provided BSP's apparent inecien-
cies can be avoided, it is clearly a more attractive programming model,
at least for algorithms that are fairly regular.
It is possible to program in a BSP style using libraries such as MPI
which provide an explicit barrier operation. However, many global op-
timisations cannot be applied in this setting, and the library operations
8
themselves are not optimised appropriately. For example, the MPI bar-
rier synchronisation on the SGI Power Challenge is 32 times slower than
BSPlib's barrier. Replacing MPI calls in a superstep structured program
with BSPLib calls typically produces a 5-20% speedup.
BSPLib is available for the following machines: Silicon Graphics Power
Challenge (Irix 6.x), Origin 2000; IBM SP; Cray T3E, T3D, C90; Parsytec
Explorer; Convex SPP; Digital 8400 and Digital Alpha farm; Hitachi
SR2001; plus generic versions on top of either UDP/IP, TCP/IP, sys-
tem V shared memory, MPI, and certain clusters.
4. BSP ALGORITHM DESIGN
BSP has a simple but accurate cost model that can be used for per-
formance prediction at a high level, before any code is written. Hence, it
provides the basis for parallel algorithm design, since the costs of dier-
ent potential implementations can be compared without building them
all.
There are three components to the cost of a BSP superstep: the com-
putation that occurs locally at each processor, the communication that
takes place between them, and the cost of the barrier synchronisation.
Costing the local computation is uses the conventional approach of
counting instructions executed. This approach, however, is becoming
increasingly fragile, not only for BSP but for other parallel and, for that
matter, sequential models. The reason is that modern processors are
not simple linear consumers of instructions with predictable execution
times. Instead, the time a given instruction takes depends strongly on
its context: the instructions that execute around it and their use of
functional units, and whether or not its arguments are present in the
cache. We have been exploring better models for instruction execution,
but this is an issue bigger than parallel programming.
Costing communication is an area in which BSP has made a major
contribution to our understanding of parallel computation. Most other
cost models for communication are bottom up in the sense that they try
to compute the cost of a single message, and then sum these costs to get
an overall communication cost. Typically, the cost of a single message
is modelled as a startup term followed by a per-byte transmission time
term. The problem with this approach is that the dominant term in the
per-message cost often comes from congestion, which is highly nonlin-
ear and prevents the costs of individual messages being summed in a
meaningful way.
Such communication cost models implicitly assume that the bottle-
neck in a network is somewhere in the centre, where messages collide.
The BSP Model 9
wi
hi
This may be a fair assumption for a lightly loaded network, but it does
not hold when a network is busy. In typical parallel computer networks,
the bottlenecks are at the edges of the network, getting trac into and
out of the communication substrate itself [13, 16]. This continues to be
true even for clusters where many of the protocol and copy overheads of
older networks (e.g. TCP/IP) have been avoided [11].
If it is the network boundary that is the bottleneck, then the cost of
communication will not be dominated by the distance between sender
and receiver (i.e. the transit time), nor by congestion in transit, nor by
the communication topology of the network. Instead, the communica-
tion that will take the longest is the one whose processor has the most
other trac coming from or to it. This insight is at the heart of the
BSP cost model, and has led to its adoption as a de facto cost model for
other parallel programming models too.
BSP has a further advantage because of its superstep structure { it is
possible to tell which communication actions will aect each other: those
from the same superstep. The communications actions of a superstep
consist of a set of transfers between processors. If the maximum volume
of data entering and leaving any processor is h then this communication
pattern is called an h-relation. The cost of the communication part of
the superstep is given by hg, where g captures the permeability of the
network to high-load trac. It is expressed in units of instruction times
per byte, so that hg is in units of instruction times. Hence a small value
for g is good.
The cost of the barrier synchronisation is similarly expressed in units
of instruction times.
10
Source
Destination
The BSP Model 13
Superstep 1
Superstep 2
Superstep 3
Destination
Superstep
One
Superstep
Two
0.05
0.04
Time in seconds
0.03
0.02
0.01
0
2 4 6 8 10 12 14 16 18 20
Number of processors on a Cray T3E
Conventional BSP
MESSAGE PACKING
On most computers, there is a signicant startup cost to initiating
a transfer of data. Even if the network itself introduces only minor
overheads, it's not unusual to require a system call with its overheads
of saving and restoring state. The BSP cost model claims that sending
1000 1-byte messages from a processor should cost the same as sending
a single 1000-byte message, and that seems unlikely.
However, BSP has one advantage over conventional message passing
systems { it is not required to begin transmitting data at the point where
the program indicates that it should. Instead, a BSP implementation
can wait until the end of each superstep before initiating data transfers.
This makes it possible to pack all of the trac between each pair of
processors into a single message so that the overhead is paid only once
per processor pair rather than once per message. In other words, no
matter how the programmer expresses the messaging requirements, BSP
will treat them the same: 1000 1-byte messages become a single 1000-
byte message before leaving the sending processor. The communication
interface sees a single transfer regardless of the actual sequence of puts
in the program text.
DESTINATION SCHEDULING
Each processor knows that, when it is about to start transmitting
data, other processors probably are too.
If all processors choose their destinations in the same way, say Pro-
cessor 0 rst, then Processor 1, and so on, then Processor 0 is going
to be bombarded with messages. It will only be able to upload one of
them, and the rest will block back through the network. This situation
is particularly likely in SPMD programs, where puts to the same desti-
nation will tend to be executed simultaneously because each processor
is executing the same code.
Because BSPLib does not transmit messages as they are requested, it
is able to rearrange the order in which they are actually despatched; in
particular each processor may use a dierent despatch order even though
the same order was implied by the puts executed by each processor.
Choosing a schedule that makes collisions at the receivers unlikely has a
large eect on performance. It can be done by choosing destinations ran-
domly, or by using a latin square as a delivery schedule. Table 1.4 shows
this eect on a variety of computers. Notice that in almost every case,
the factor of two lost by not overlapping computation and communica-
tion is earned back; in the more complex case of the NIC cluster, where
The BSP Model 17
Table 1.4 8-processor total exchange in s. small=16k words per processor pair,
big=32k words.
latin square dst order factor
SP2 big 102547 107073 1.0
T3D big 14658 29017 2.0
PowerChallenge big 37820 47910 1.3
Cluster(TCP) small 61336 119055 1.9
big 134839 248724 1.8
Cluster(NIC) small 39468 76597 1.9
big 78683 14043686 178
the huge increase in time happens because the central switch saturates,
using a careful schedule prevents a performance disaster.
PACING
Each processor knows that, when it is about to start transmitting
data, other processors probably are too.
Networks all contain bottlenecks { as the applied load increases, the
throughput eventually drops. It is always useful to control (throttle) in-
put rates to maintain the applied load at (just below) the region of max-
imum throughput. If the applied load is maintained slightly below the
point of maximum throughput, then
uctuations in load create slightly
larger throughputs which tend to restore equilibrium. On the other
hand, if the applied load goes above the point of maximum throughput,
then positive feedback tends to drive the network into congestion and
collapse. The ability to control applied load makes it possible to keep
the network running at its maximum eective capacity.
This can be done in two ways:
High-level. The target's g value gives its permeability to data under
conditions of continuous trac { and this can easily be converted
into units of secs/Mbit, say. There is no point in trying to insert
data into the network faster than this value, because g is the best
estimate of what the network's ideal throughput value is.
Low-level. When the communication medium is shared (e.g. Ether-
net), using a slotting system with a carefully-chosen slot size re-
duces collisions (a distributed approximation to the latin square
scheme). For TCP/IP over Ethernet, this improves performance
by a factor of 1.5 over mpich.
18
Powerchallenge SP2
SGI MPI Arg MPI IBM mpl BSPLib
provided 73.5 217.7 197.9 |
message tree 74.1 165.9 95.6 |
central message 92.0 234.0 156.2 |
total exchange 105.7 224.9 124.3 137.8
BARRIER IMPLEMENTATION
Most standard barrier implementations are quite poor. Because bar-
riers are so important in BSP, BSPLib implements them very carefully
[17].
On shared-memory architectures, a technique that exploits cache co-
herency hardware is used to make barriers very fast. Table 1.5 shows
some performance data for dierent barrier implementations on shared-
memory systems.
On distributed-memory manufactured architectures, access to the com-
munication substrate is usually possible only through a manufacturer's
API that is already far removed from the hardware. This makes it
dicult to build fast interfaces. Some performance data are shown in
Table 1.6. These times are close to the sum of the times required to send
individual messages to p ? 1 destinations.
Total exchange is an attractive way to implement barriers in the con-
text of global exchange of data, although once again this is counterin-
tuitive. Rather than exchanging single bits at the barrier, BSPLib ex-
changes the sizes of messages about to be communicated between each
pair of processors. This costs almost nothing, but makes new optimi-
sations possible because the receiver knows how much data to expect.
Distributed-memory implementations of BSPLib execute the barrier im-
The BSP Model 19
mediately after the local computation phase. Since each processor knows
how much data to expect, it is simple to preserve the semantics of the
program barrier.
The opportunities for optimisation are constrained because BSPLib
is a library and does not have access to e.g. the kernel, where further
optimisation opportunities exist.
6. OPTIMISATION IN CLUSTERS
It is now possible to build a distributed-memory parallel computer
from o-the-shelf components (a cluster). Such systems are charac-
terised by: high performance processor boards, a simple interconnection
network such as switched Ethernet or Myrinet, and Network Interface
Cards (NICs) to handle the interface between processors and network.
Protocols such as TCP/IP that were originally designed for long-haul
networks are not appropriate for such an environment. Instead, new
protocols have been designed to provide low latencies (end-to-end laten-
cies in the 10s of microseconds are typical) and as few data copies as
possible [4, 41, 28, 6]. Changes to the kernel to make it possible to access
the network without a system call are also common.
When the target is a cluster, with full access to the kernel, commu-
nication hardware, and protocols, further optimisations to BSPLib are
possible [12, 11]. Some of them are listed in Table 1.7.
HOLE FILLING ERROR RECOVERY
TCP/IP is a reliable protocol, but it was designed for an error model
that is not relevant to clusters. Clusters do lose packets, but almost in-
variably because of buer problems at the receiving end. High-performance
protocols need to provide reliability in a dierent way.
20
TCP error recovery keeps track of the last packet acknowledged. Once
a packet is lost, everything after it is also resent. This is a poor strategy
if errors are not in fact bursty.
In the BSPLib cluster protocol, packets contain an extra eld with
the sequence number after the last one in sequence (i.e. acknowledging
both a hole, and the rst packet past the end of a hole). Only the actual
missing packets need to be resent.
REDUCING ACKNOWLEDGEMENT TRAFFIC
Acknowledgements are piggy-backed onto normal trac, and therefore
use no extra bandwidth, provided that trac is symmetric. However,
often it isn't, and so the general strategy is to avoid acknowledgements
for as long as possible, preferably for ever. Also, because each processor
knows what it is expecting, recovery is driven by the receiver, not by
the sender. Because of this, a sender need not know for sure whether
a packet has been received for a very long time, unless it needs to free
some send buers.
JUST IN TIME ACKNOWLEDGEMENTS
Acknowledgements are forced by a sender prediction that it will run
out of buers (because it must keep copies of unacked data). It uses
knowledge of the roundtrip message time to send ack requests just in
time.
MANAGING BUFFERS MORE EFFECTIVELY
Because the architecture is dedicated, and communication patterns
will often be skewed, it makes sense to use a single buer pool at each
processor (rather than separate buer pools for each pair of processors).
The combination of these optimisation makes the BSPLib cluster pro-
tocol one of the fastest reliable protocols. Its performance is illustrated
in Figures 1.11, 1.12, and 1.13.
These optimisations have a direct eect on application level perfor-
mance. This is shown in Tables 1.8 and 1.9, using some of the NAS
benchmarks.
The combined eect of these communication optimisations is that the
communication performance of BSPLib is about a factor of four better
than comparable libraries.
The BSP Model 21
PII-BSPlib-NIC-100mbit-wire
PII-BSPlib-NIC-100mbit-2916XL
500 PII-BSPlib-UDP-100mbit-2916XL
PII-BSPlib-TCP-100mbit-2916XL
PII-MPI-ch_p4-100mbit-2916XL
SP2-MPI-IBM-320mbit-vulcan
SP2-MPL-IBM-320mbit-vulcan
O2K-MPI-SGI-700mbit-ccnuma
400
Round-trip time (usecs)
300
200
100
0
4 8 16 32 64 128 256 512 1024
Message size (bytes, log2 scale)
120 PII-BSPlib-NIC-100mbit-wire
PII-BSPlib-NIC-100mbit-2916XL
PII-BSPlib-UDP-100mbit-2916XL
PII-BSPlib-TCP-100mbit-2916XL
PII-MPI-ch_p4-100mbit-2916XL
100 SP2-MPI-IBM-320mbit-vulcan
SP2-MPL-IBM-320mbit-vulcan
O2K-MPI-SGI-700mbit-ccnuma
80
Bandwidth (Mbps)
60
40
20
0
0 200 400 600 800 1000 1200 1400
Message size (bytes)
22
256
Latency per packet (usecs, log2 scale)
128
64
32
PII-BSPlib-NIC-100mbit-2916XL
16 PII-BSPlib-UDP-100mbit-2916XL
PII-BSPlib-TCP-100mbit-2916XL
PII-MPI-ch_p4-100mbit-2916XL
SP2-MPI-IBM-320mbit-vulcan
SP2-MPL-IBM-320mbit-vulcan
O2K-MPI-SGI-700mbit-ccnuma
8
1 4 16 64 256 1024
Number of packets (log2 scale)
7. SOFTWARE DEVELOPMENT
The third requirement for successful parallel programming is the exis-
tence of a process for developing programs from specications. This is, of
course, also important for sequential computing; but parallel computing
adds an extra spatial dimension to all of the complexities of sequential
programming and a software construction process is therefore even more
critical.
Formal methods for software development in sequential settings often
have large overheads to generate a single program statement. Partly
for this reason, they have made little inroad into practical software con-
struction. A big dierence in a parallel setting is that the application
of a single formal construction can often build the entire parallel struc-
ture of a program. Thus the same overhead of using a formal technique
results in a much larger gain in program construction.
BSP is particularly attractive from the perspective of a methodology
for program construction because:
Each BSP superstep has a well-dened interface to those that pre-
cede and follow it. BSP supersteps resemble a general form of
skeleton whose details can be instantiated from a few parameters.
The extensive skeleton literature has documented the advantages
this brings to software development [8, 34, 33].
BSP programs can be given a simple semantics in which each pro-
cessor can act arbitrarily on any program variables, with the nal
eect of a superstep being dened by a merge operator which de-
nes what happens if multiple assignments to the same variable
are made by dierent processors. In fact, it is clear that BSP is
just one of a family of models dened by dierent merge functions.
This idea has been pursued by He and Hoare [20], and in slightly
dierent way by Lecomber [23].
Too much should not be made of the ability to reason formally about
parallel programs. Nevertheless, BSP is noticeable as the only known
programming model that is both ecient and semantically clean enough
to be reasoned about [36].
8. APPLICATIONS OF BSP
BSP has been used extensively across the whole range of standard
scientic and numeric applications. Some examples are:
computational
uid dynamics [18, 7],
computational electromagnetics [25],
24
ing algorithm on each one. Clearly, since these copies are executed
independently on their own data, bagging can be trivially parallelized.
Bagging illustrates an important feature of data mining algorithms. It
is often possible to achieve much better results by combining the outputs
trained on several small bags than to use the output trained on the entire
dataset. It's as if the most cost-eective training occurs from the rst
few examples seen. Hence a bagging algorithm often produces better
results from much less input data, and does so in parallel as well.
It is easy to see how to use BSP to implement bagging. It requires
only a single BSP superstep. If the bags are chosen to have the same size,
then the computations at each processor usually takes similar amounts
of time. If the predictors are left at the processors that computed them,
then evaluation can also be done in parallel by broadcasting each new
data point to the processors and then combining their votes using a
gather.
EXCHANGING INFORMATION EARLIER
Bagging provides a generic parallelization technique. However, the ob-
servation that much of the gain occurs after having seen only a few data
points suggest that performance might be improved still further by ex-
changing information among processors before the end of the algorithm.
That way, each processor can learn from its own data, and simultane-
ously from the data that other processors are using. This turns out to
be very productive.
The basic structure of the parallel algorithm is:
Either partition the data among the processors, or give each pro-
cessor a bag;
Execute the approximating algorithm on each processor for a while;
Exchange the results obtained by each processor with all of the
others (a total exchange);
Update the model being built by each processor to incorporate the
information received from other processors;
Repeat as appropriate.
Of course, this requires the possibility of being able to combine informa-
tion learned by other processors into the model under construction, so
this approach will not work for all data mining algorithms. However, it
is eective for several, including neural networks [30, 29], and inductive
logic programming [42].
The BSP Model 27
fact, the biggest weakness in the BSP cost model is not related
to parallelism at all, but to the diculty of modelling instruction
execution within a single processor.
BSP's cost model makes parallel software design possible, and pro-
vides predictable performance without benchmarking.
The benets of the BSP model are not theoretical, nor do they
apply only at low levels. The benets of the model play through to
the application level, allowing high-performance real applications
to be built and maintained.
The BSP model shows that there are signicant benets, both
in simplicity and performance, in using DRMA for data transfer
rather than message passing. The biggest weakness of message
passing is that it takes two to play { forcing programmers to do
a complex matching when they write programs, and processors to
synchronise when they wish to transfer data.
Finding eective ways to program parallel computers is dicult be-
cause the requirements are mutually exclusive: simplicity and abstrac-
tion, but also performance (and preferably predictable performance).
Very few models are known that score reasonably well in both dimen-
sions. BSP is arguably the best positioned of known models. Yet it is
clear that BSP is still too low level and restricted to become the model
for parallel programming. However, it is an important step on the way
to such a model.
Acknowledgements.. A large number of people have been involved in
the design and implementation of BSP. In particular, the implementation
of BSPLib and much of the performance analysis was done by Jonathan
M.D. Hill and Stephen Donaldson.
References
[1] D.J. Becker, T. Sterling, D. Savarese, J. E.Dorbandi, U.A. Ranawak,
and C.V. Packer. Beowulf: A parallel workstation for scientic com-
putation. In Proceedings of the International Conference on Parallel
Processing (ICPP), pages 11{14, 1995.
[2] Rob H. Bisseling. Basic techniques for numerical linear algebra
on bulk synchronous parallel computers. In Lubin Vulkov, Jerzy
Wasniewski, and Plamen Yalamov, editors, Workshop Numerical
Analysis and its Applications 1996, volume 1196 of Lecture Notes
in Computer Science, pages 46{57. Springer-Verlag, Berlin, 1997.
30
[25] P.B. Monk, A.K. Parrott, and P.J. Wesson. A parallel nite element
method for electromagnetic scattering. COMPEL, 13, Supp.A:237{
242, 1994.
[26] M. Nibhanupudi, C. Norton, and B. Szymanski. Plasma simulation
on networks of workstations using the bulk synchronous parallel
model. In Proceedings of the International Conference on Parallel
and Distributed Processing Techniques and Applications, Athens,
GA, November 1995.
[27] S.L. Peyton-Jones and David Lester. Implementing Functional Pro-
gramming Languages. Prentice-Hall International Series in Com-
puter Science, 1992.
[28] Loic Prylli and Bernard Tourancheau. A new protocol designed
for high performance networking on Myrinet. In Parallel and Dis-
tributed Processing, volume 1388 of Lecture Notes in Computer Sci-
ence, pages 472{485. Springer, 1998.
[29] R.O. Rogers and D.B. Skillicorn. Using the BSP cost model for opti-
mal parallel neural network training. Future Generation Computer
Systems, 14:409{424, 1998.
[30] R.O. Rogers and D.B. Skillicorn. Using the BSP cost model to opti-
mize parallel neural network training. Future Generation Computer
Systems, 14:409{424, 1998.
[31] Constantinos Siniolakis. Bulk-synchronous parallel algorithms in
computational geometry. Technical Report PRG-TR-10-96, Oxford
University, Computing Laboratory, May 1996.
[32] D. Skillicorn. Strategies for parallel data mining. IEEE Concur-
rency, 7(4):26{35, October{December 1999.
[33] D.B. Skillicorn. Architecture-independent parallel computation.
IEEE Computer, 23(12):38{51, December 1990.
[34] D.B. Skillicorn. Structuring data parallelism using categorical data
types. In Programming Models for Massively Parallel Computers,
pages 110{115, Berlin, September 1993. Computer Society Press.
[35] D.B. Skillicorn. Foundations of Parallel Programming. Number 6 in
Cambridge Series in Parallel Computation. Cambridge University
Press, 1994.
[36] D.B. Skillicorn. Building BSP programs using the Renement
Calculus. In Third International Workshop on Formal Methods
for Parallel Programming: Theory and Applications (FMPPTA'98),
Springer Lecture Notes in Computer Science 1388, pages 790{795,
March/April 1998.
The BSP Model 33