Communication Overlap in Multi-Tier Parallel Algorithms - A33-Baden
Communication Overlap in Multi-Tier Parallel Algorithms - A33-Baden
Abstract
Hierarchically organized multicomputers such as SMP clusters oer new opportunities and
new challenges for high-performance computation, but realizing their full potential remains a
formidable task. W e present a hierarchical model of comm unication targeted to block-structured,
bulk-synchronous applications running on dedicated clusters of symmetric m ultiprocessors. Our
model supports node-level rather processor-level communication as the fundamental operation,
and is optimized for aggregate patterns of regular section moves rather than point-to-point mes-
sages. These two capabilities work synergistically. They provide
exibility in overlapping com-
munication and overcome deciencies in the underlying communication layer on systems where
inter-node communication bandwidth is at a premium. We have implemented our communica-
tion model in the KeLP2.0 run time library. We present empirical results for ve applications
running on a cluster of Digital AlphaServer 2100's. Four of the applications were able to overlap
communication on a system which does not support overlap via non-blocking message passing
using MPI. Overall performance improvements due to our overlap strategy ranged from 12% to
28%.
1 Introduction
Hierarchical parallel computers such as clusters of symmetric m ultiprocessors (SMPs) oer both
new opportunities and new challenges for high-performance computation [41]. Although these
computer platforms can potentially deliver unprecedented performance for computationally in-
tensive scientic calculations, realizing the hardware's potential remains a formidable task. A
principal diculty is that increased node performance amplies the cost of inter-node communi-
cation, which is compounded by any failure in the message passing layer to meet the requirements
of the application.
This research addresses the problem of how to tolerate communication costs associated with
block-structured, bulk-synchronous applications running on dedicated clusters of symmetric
multiprocessors. W e present a hierarchical communication model that re
ects the underlying
hierarchical communication structure of the hardware and ascribes communication to nodes
rather than the individual processors making them up. Our model supports complex composed
communication patterns involving regular sections as the fundamental communication primi-
tive, rather than the point-to-point message. It is thereby able to overcome deciencies in the
underlying communication layer.
Stephen Fink was supported by the DOE Computational Science Graduate Fellowship Program, Scott Baden
*
by NSF contract ASC-9520372. Computer time on the Maryland Digital AlphaServer was provided by NSF CISE
Institutional Infrastructure Award CDA9401151 and a grant from Digital Equipment Corp. Special thanks to Alan
Sussman and Joel Saltz for arranging access to the Maryland Digital AlphaServer.
1
Permission to make digital/hard copy of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage, the
copyright notice, the title of the publication and its date appear, and notice is given that copying is by
permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SC ’98, Orlando, FL
(c) 1998 IEEE 0-89792-984-X/98/0011 $3.50
Past eorts to overlap communication on SMP clusters take advantage of some of the ca-
pabilities of hierarchical system organization but dier in their approach. Lim et al. describe
the message proxy, a trusted agent running as a kernel extension on a spare SMP processor
to provide protected access to the network interface [26]. When a process sends a message, it
communicates with the proxy running on its node via shared memory. The proxy ooads most
of the work involved in managing communication but requires a set of shared data structures
for each pair of communicating processes. The Multi-protocol Active Message, described by
Lumetta et al. [28], employs shared memory to intercept on-node messages. The cost of com-
munication is signicantly lower if the processors happen to be on the same node, re
ecting
a hierarchical cost model. As noted by the authors, this cost model may be used to improve
performance through carefully chosen data decompositions.
Past work demonstrates the need for ecient non-blocking point-to-point communication on
SMP clusters. Communication layers like message proxies and multi-protocol Active Messages,
while taking advantage of the hierarchical structure of the hardware to improve performance,
retain an historical non-hierarchical view of the hardware: any two processors may communicate
by passing a message. This research explores an alternative communication model that is
explicitly hierarchical. Under this model, processors may communicate directly only if they
lay on the same node, and must do so through shared memory. Processors on dierent nodes
may not communicate directly. Rather, nodes communicate on behalf of their child processors by
sending node-level messages. Moreover, the fundamental node-level communication operation
is not point-to-point, but collective: an arbitrary pattern of multidimensional regular section
moves that may be further organized into phases. Nodes communicate via an interface called
a Mover which executes asynchronously and concurrently with computation carried out by the
node's processors.
The hierarchical
avor of our communication model re
ects the reality that o-node band-
width of the network interface is a scare resource, severely limiting collective rate at which
processors on a given node may communicate. In this spirit, it is similar to the PMH model
described by Alpern et al. [3]. The collective behavior implemented by the Mover matches
the requirements of many scientic computations, in which communication, even if irregular, is
highly stylized. Communication patterns may often be known in advance, and involve many
communicating pairs of processors.
We have implemented the Mover model as part of the C++ run time library called KeLP2.0
[16, 15]. In addition to supporting collective, asynchronous, hierarchical communication, KeLP2.0
provides other vital ingredients: hierarchical decompositions and control
ow.
We chose to implement the Mover to run on a spare SMP processor. This policy decision was
made out of expediency, and could be changed without aecting the correctness of the user's
code. Others have recognized the benets of using a spare processor when the marginal loss of
a single CPU is small. The technique was rst explored on the Intel Paragon [31], later on the
Wisconsin Wind Tunnel [33], Meiko CS-2 [6], SHRIMP [7], and Proteus [38]. Sawdey et al. [34]
describe a compiler for the Fortran-P programming language, which uses a spare SMP, and is
specialized for grid-based applications.
We implemented ve applications and ran them on a cluster of Digital AlphaServer 2100's.
Four of the applications were able to overlap communication on a system which does not support
overlap via non-blocking message passing using MPI. Overall performance improvements due to
the overlap ranged from 12% to 28%. We discuss performance tradeos illuminating the benets
and limitations of communication overlap. We also present the interface to the KeLP2.0 mover
along with implementation details.
The contributions of this research are
A technique for realizing communication overlap in complex communication patterns aris-
ing in block-structured, bulk-synchronous algorithms, that overcomes deciencies in the
communication layer.
A hierarchical communication model with abstractions separating the expression of correct
programs from optimizations that aect performance.
2
Empirical results concerning the tradeos of overlapped communication.
2 Motivation
SMP clusters impose strict performance constraints that increase the programming eort re-
quired to achieve high performance. Most importantly, the bandwidth delivered by the network
interface tends to be lower than on previous generation systems equipped with single-CPU
processing nodes.
The applications we consider are traditional bulk-synchronous computations which manage
coarse-grain communication explicitly. Data transfers are on the order of tens to hundreds
of thousands of bytes or more and may be non-contiguous. Communication among dierent
nodes is highly correlated and the structure of this pattern may be determined at run time
using an o-line algorithm. This behavior is apparent in various broadcasting operations such
as all-to-all personalized communication, but it is also present in irregularly coupled structured
problems such as multiblock and adaptive mesh renement, and even in uniform stencil-based
computations. Too, users may need to dene application-specic collective operations that are
not supported by their favorite message passing layer. Since communication requirements can
be known in advance, we may use advance knowledge to optimize communication at run time,
e.g. inspector-executor analysis [2], though other optimizations are possible [15].
For a variety of reasons, non-blocking communication may be ineective in realizing com-
munication overlap. Multi-phase communication algorithms, such as dimension exchange or
hypercube broadcast algorithms, require a strict ordering of messages. Overlap strategies based
on non-blocking communication must poll, which complicates the user code and compromises
performance. In other cases, the network interface may not be able to realize communication
overlap except under certain narrowly dened conditions, which may not be realistic [37]. Or,
the message layer may not be implemented to take advantage of the network interface's capabil-
ities. For example, the message co-processor may not be able to handle message linearization,
which which must then be executed on a compute processor, slowing down computation.
The eect of such operating conditions are particularly severe on an SMP cluster for they
amplify any ineectiveness or ineciency in the message passing layer by the multiplicity of the
processors at the node.
3
through the shared memory system. Multiple SMPs communicate over a comparatively slow
interconnection network, typically via message-passing.
3.2 KeLP2.0 Overview
As mentioned, parallel programming model design primarily re
ects the historical view of single
tier hardware. We propose a hierarchical programming model, which we have implemented
as the KeLP2.0 framework. KeLP2.0 builds on its single-tier predecessor, KeLP1.0 [17]. Like
its predecessor, KeLP2.0 is a C++ class library. We next give an overview of the multi-tier
KeLP2.0 Programming model. We will refer to KeLP2.0 as KeLP from now on.
Control
ow. Under KeLP, we characterize an SMP cluster as follows. Referring to Fig. 1b,
an SMP cluster contains a set of n nodes which communicate by passing messages over an inter-
connection network. Each SMP node contains p processors, which perform local computational
tasks. (More generally each node may have a dierent number of child processors.)
KeLP support three levels of control: collective , node , and processor. The collective level
manages data layout and data motion among the n SMP nodes. The node level control stream
manages activities at a single SMP node, and coordinates its p child processors much as the
collective level instruction stream coordinates its child nodes. Node-level control performs se-
rial computation at an SMP node, and collective operations which apply only to the multiple
processors at a single SMP node. In particular, the node-level control stream often controls
partitioning and scheduling decisions for the multiple processors at the SMP node. The node
level mediates communication among its processors through shared memory. The processor-
level control stream executes a serial instruction stream on a single physical processor, which
are often tuned serial numeric kernels.
Storage. KeLP denes two storage classes. The Grid is a node-level storage class and is like
a Fortran 90 allocatable array. It lives within the single address space of a node. The XArray
is a collective storage class. It is a distributed 1D array of Grids (or a class derived from Grid).
XArray elements may be assigned arbitrarily to nodes. They must have the same number of
dimensions but may have dierent index sets and sizes.
KeLP denes two meta-data types called Region and FloorPlan. These are used not only
to dene storage but also to specify data motion and parallel control
ow as described below.
A Region is simply a bounding box and may be used to construct a Grid. A FloorPlan is a
list of Region-integer pairs and has two interpretations. When dened at the collective level,
the FloorPlan denes the structure and node assignments for an XArray, and in this capacity
is used in XArray construction. When dened as a node level object, a FloorPlan species the
decomposition of a Grid over a set of processors making up a single node. Since the FloorPlan
may contain arbitrary block structures, irregular decompositions, possibly overlapping, may also
be dened.
By using the FloorPlan constructs judiciously, the KeLP programmer may specify many-to-
one mappings of data to nodes or processors. Or, they may specify task parallelism permitting
independent computations to execute in parallel over possibly disjoint subsets of a machine or
a node. However, changing node or processor assignments within a FloorPlan will only aect
M M M M M M M M
C C C C C C C C C C C C C C C C
P P P P P P P P P P P P P P P P P P P P
Figure 1: Block diagram of a single-tier (left) and dual-tier (right) computer. Memory modules are
labeled by M, processors by P, and Cache by C.
4
the performance of the code but not its correctness.
With the above discussion in mind, we next present a 2-level KeLP program for a simple
stencil-based computation that employs pre-fetching to implement overlap. As shown in Fig. 2,
we set up an inner region and an annulus, which are represented as FloorPlans Fi and Fa in
the code of Fig. 3. Points in the interior Fi do not depend on the ghost cells, which are shaded.
In contrast, the points on the annulus Fa. depend on incoming ghost cells. Our strategy is to
compute on the interior while the ghost cells are arriving, and once the ghost cells arrive we
are free to compute on the annulus. Control
ow begins at the collective level which generates
a FloorPlan, XArray, and Mover (lines (1) - (4)). We ignore the MotionPlan and Mover for
the time being at lines (3), (4), (6), and (13). Once the XArray has been constructed, the
collective level executes a node iterator loop (7). This loop takes a collective level FloorPlan F
as an argument, and causes each node to execute the body of the loop over the iterations that
it owns. This ownership is specied in the FloorPlan. (The execution order of iterations is not
dened in the case of a many-to-one mapping.)
The body of the nodeIterator is a procIterator loop (10). This loop executes in a similar
fashion as the nodeIterator except that it takes a node level FloorPlan Fi as an argument, and
each processor selectively executes the iterations it owns as specied in the node level FloorPlan.
These iterations execute in a processor level instruction stream (11) and (12), which executes
serially.
KeLP enforces an implicit barrier synchronization point at the end of a procIterator or
nodeIterator loop. The procIterator synchronization point enforces a logical barrier among
all processors at a node. The nodeIterator logically synchronizes all SMP nodes. A clever
implementation may eliminate the barriers to improve performance. The KeLP 2.0 implemen-
tation permits the programmer to disable the procIterator barrier (NO SYNC), and eliminates the
nodeIterator barrier with inspector/executor communication analysis. Both of these features
are unsafe, however.
Data motion. Once we have completed computation in the inner region of each grid,
we wait for the communication complete so we can update the annulus dened over Fa. This
is the job of the Mover wait() function at line (13). Once communication has completed we
execute another node/procIterator loop to carry out the nal smoothing (lines 14-19). The
KeLP collective Mover builds on the KeLP1.0 Mover. The Mover is actually a distinguished
class and is constructed with the aid of the KeLP MotionPlan. A MotionPlan is a rst class
communication schedule, and it describes a set of block copies between two XArrays, which will
execute as an atomic operation. The programmer constructs the MotionPlan describing the
desired communication pattern with the help of a Region calculus of geometric operations [17].
Once the MotionPlan has been constructed, the Mover may be built. We pass the MotionPlan
to the Mover constructor, along with the two XArrays to which the communication activity is
2
0
5 3
1
4
Figure 2: The pre-fetching algorithm for an explicit nite dierence method with a 5-point stencil
divides the computation into an inner domain and an annulus. Two FloorPlans describe the 6
marked Regions. The regions marked 0 and 1 correspond to the FloorPlan Fi in the code of Fig. 3,
the regions marked 2 through 5, to FloorPlan Fa. This decomposition is intended for a 2-processor
node, and is duplicated on each node.
5
( 1) FloorPlan2 F = ... // Set up Collective FloorPlan for the XArray
( 2) XArray2 U(F); // To define XArray
( 3) MotionPlan M = ... // Set up MotionPlan
( 4) Mover2 Mvr(U,U,M); // The XArray U is both source
// and destination
( 5) do{
( 6) Mvr.start()
(13) Mvr.wait();
6
to be bound. KeLP also supports endogenous copies, an important case in which the source and
destination XArrays are the same.
A Mover denes two operations: start() and wait(). The start() operation causes the Mover
to asynchronously execute the communication pattern dened by the binding of the FloorPlan to
the XArray argument(s). The KeLP program then executes the next collective level operation.
Eventually the KeLP program must synchronize with communication using wait(). However,
between the start and the wait the user may logically overlap execution of the Mover with
ongoing computation (7)-(12).
The KeLP Mover runs as a concurrent task in parallel with computation carried out within
an nodeIterator/procIterator loop. How this concurrency is implemented cannot be known to
the programmer, though it is possible to determine certain details indirectly. In particular, the
programmer may observe a change in performance by modifying the node level FloorPlans to
vary the number of compute processors used in procIterator loop execution.
Note that KeLP structured loops contrast sharply with unstructured thread programming,
where the programmer must explicitly manage synchronization between individual threads.
CC++ [10] provides a programming model with both types of parallel control constructs. Like
the CC++ structured parallel loops, the KeLP iterators simplify the expression of parallelism,
but restrict the forms of parallel control
ow available to the programmer. As a compromise,
KeLP denes node and processor level waits on a Mover, that provide additional
exibility.
4 Implementation
We implemented KeLP2.0 on a cluster of Digital AlphaServer 2100's [19] running Digital UNIX
4.0. Each SMP has four Alpha 21064A processors, each processor has a 4MB direct-mapped L2
cache. For inter-node communication, we rely on MPICH 1.0.12 [18] over an OC-3 ATM switch.
KeLP was implemented in C++, but serial numeric kernels employed in our applications
were implemented in Fortran 77, and use double precision arithmetic. We compiled C++ using
gcc v2.7.2, with compiler option -O2. To compile Fortran, we used the Digital Fortran complier
v4.0, with compiler options -O5 -tune ev4 -automatic -u -assume noaccuracy sensitive.
KeLP was originally written with POSIX threads, but due to diculties with Digital UNIX
threads we were forced emulate threads using heavyweight processes and mmap memory-mapped
le support. We implemented our own shared heap under KeLP control and bound processes to
processors using the bind to cpu system call for the duration of the program. These decisions
are not a fundamental to the design of KeLP, though they do shed light on some performance
limitations on the Alpha Cluster. From this point on, we will use the word \thread" to refer to
a stream of instructions managed by the KeLP 2.0 implementation.
All threads execute the collective level code, analogously to SPMD execution. The nodeIterator
and procIterator constructs mask out execution with a conditional, so only the appropriate
threads execute the appropriate loop bodies. Node level barrier synchronization is costly in our
implementation: 1:0sec:
The Mover implementation deserves special attention, since the Mover must perform in-
spector/executor analysis of the data motion pattern, issue message-passing calls and memory
copies to eect data motion, overlap communication and computation as needed. A variant of
the mover performs arithmetic updates like addition which is useful in multilevel methods.
Our implementation uses a producer-consumer queue in shared-memory. When the program
calls the Mover start() function from collective control
ow, one thread enqueues a pointer to
the Mover, and signals the communication thread by incrementing a counting semaphore. The
communication thread waits on the semaphore. When the communication thread wakes up, it
dereferences the Mover pointer, and carries out the various steps involved in moving the data.
Because a Mover may encode a complicated communication pattern, the cost of enqueueing the
descriptor is usually amortized over multiple data transmissions.
When the program calls the Mover wait() function, the running computational threads wait
on a semaphore. When the communication thread completes the data motion, it posts to this
7
semaphore, waking up the sleeping threads. In eect, wait() serves as a barrier synchronization
point.
The implementation of Mover raises various issues. First, our implementation avoids buer-
packing for contiguous data. That is, if a Grid copy's data happens to lie contiguously in memory,
then the Mover will send or receive the data directly from its original storage locations. Second,
there is obvious room for improvement in our message passing layer. For example, MPICH does
not intercept communication via shared memory, and we had to implement this capability inside
the Mover. Too, Fast messages, which provide for message streams, could reduce the amount of
copying, even for non-contiguous data [25]. We did not multi-thread the Mover, but the design
of KeLP admits this possibility. However, it is unlikely that the Alpha has the memory and
communication bandwidth to support concurrency within communication.
5 Performance
5.1 Overview
We next assess the benets of communication overlap expressed by the KeLP Mover. We present
a performance study of of ve block-structured applications: Redblack3D, Red Black Gauss-
Seidel relaxation in 3D on a 7-point stencil; NAS-MG, an NAS benchmark that solves the 3D
Poisson equation with multigrid V cycles; MB, 2D irregular multiblock code that uses multigrid
V cycles to solve the 2D Poisson equation; SUMMA, van de Geign's matrix multiply algorithm;
NAS-FT, the NAS FT benchmark which takes an FFT.
We implemented all ve applications in KeLP2.0, restructuring original single-tier formula-
tions into a two-tier hierarchical form that utilized shared memory on the node. We also imple-
mented non-overlapped variants of the KeLP2.0 codes, which executed calls to Mover::start()
and Mover::wait() in immediate succession, i.e. without any interspersed computation. We ex-
perimented with an alternative implementation of the Mover, that employed non-blocking mes-
sage passing calls in lieu of an extra thread. This strategy failed to to improve performance{and
in some cases actually resulted in lower performance [15].
We compared performance of our KeLP applications with explicit message passing versions
written in MPI. Two of these-NPB2.1 benchmarks NAS-MG and NAS-FT{were down-loaded
and used without modication. Another{RedBlack3D{was carefully and painstakingly opti-
mized. As described below, SUMMA was down-loaded and modied to invoke the native BLAS,
and we did not construct an MPI version of MB.
We ran on a dedicated system. For all experimental results reported in here, the timed
computation is repeated so that the total timed running time is at least ten seconds. Timings
reported are based on wall-clock times, obtained via the MPI Wtime() call. We scaled the amount
of work per node so that it remained constant as we increased the number of nodes.
Fig.4 plots performance of the applications, and demonstrates the benets of communication
overlap realized by the KeLP Mover in executing our restructured algorithms. Comparing the
"KeLP/Non-overlapped" and "KeLP/Overlap"
oating point rates shown in the Figure, we
see that four of the applications{Redblack3D, NAS-MG, NAS-FT, and SUMMA{ were able to
overlap communication on a system which does not support overlap via non-blocking message
passing. On 8 AlphaServer nodes, our overlap strategy improved performance by between 12%
and 28%. For all but MB, we also compare performance against an equivalent handed-coded MPI
version. In some cases (RedBlack3D and NAS FT) there were signicant dierences between
the MPI and KeLP2.0 non-overlapped versions. However, our version of MPI was not optimized
for shared memory on the node, and we expect that MPI performance would improve with an
enhanced message passing layer [28]. Perhaps the non-overlapped KeLP version gives us a truer
picture of performance with an improved message passing implementation.
In addition we present single node performance, in Fig. 5. As the Figure shows, per-processor
performance degrades as we add more processors. With RedBlack3D, for example, performance
drops from 14.1 MFLOPS on one processor to 6.44 MFLOPS per processor on four processors.
8
NAS−MG
AlphaServers
Redblack3D 300
AlphaServer cluster MPI
200 Multi−tier KeLP
250
MPI Multi−tier KeLP with overlap
MPI w/ optimized mapping
Multi−Tier KeLP
200
Multi−TIer KeLP w/ overlap
150
MFLOPS
150
MFLOPS
100
100
50
50
0
1 2 4 8
Nodes
0
1 2 4 8
b)
Nodes
a)
NAS−FT
AlphaServers
Multi−block multigrid 200
AlphaServers MPI
Multi−Tier KeLP / No overlap
Multi−Tier KeLP / overlap
Multi+Tier KeLP/No overlap
250
Multi−tier KeLP/overlap 150
200
MFLOPS
100
MFLOPS
150
100 50
50
0
1 2 4 8
Nodes
0
1 2 4 8
d)
Nodes
c)
SUMMA
AlphaServers
2000
MPI
Multi−Tier KeLP / No overlap
Multi−Tier KeLP / overlap
1500
MFLOPS
1000
500
0
1 2 4 8
Nodes
e)
Figure 4: Performance using all four processors per node: a) RedBlack3D, b) NAS-MG, c) Multi-
block multigrid (MB), d) NAS-FT and e) SUMMA. The problem sizes (work) scale scale linearly
with the number of nodes. The data labeled \MPI" report results from running an MPI code with
one MPI process per physical processor in the system. Thus, on n nodes, these results use 4n
MPI processes. Two versions of the MPI implementation of RedBlack3D are reported. One uses a
partitioning optimized to the structure of the machine and leads to superior performance. There
is no MPI implementation for MB.
9
Redblack3D Multi−block multigrid
Single−Node Performance: AlphaServer Single−Node Performance: AlphaServer
30 40
30
20
MFLOPS
MFLOPS
20
10
10
0 0
1 2 3 4 1 2 3 4
Processors Processors
a)
NAS−FT SUMMA b)
Single−Node Performance: AlphaServer Single−Node Performance: AlphaServer
80 500
400
60
300
MFLOPS
MFLOPS
40
200
20
100
0 0
1 2 3 4 1 2 3 4
Processors Processors
c) d)
Figure 5: Performance on a single node of the Alpha, varying the number of processors: a) Red-
Black3D, b) MB, c) NAS-FT and d) SUMMA. The problem size was xed.
10
Similar trends were observed for the other kernels. Even SUMMA achieved a speedup of only
3.29 on 4 processors. Apparently, memory contention plays a role even for matrix multiplication,
which achieves only 60% of the 275 Mega
ops peak theoretical rate of the 21064A processor
running the vendor-supplied dgemm kernel. These results show that the shared memory system
of this SMP cannot support the memory bandwidth demands of four processors running numeric
kernels.
5.2 Detailed Performance Studies
5.2.1 Redblack3D
RedBlack3D employs the pre-fetching strategy described in the previous section. Tab. 5.2.1 gives
detailed timing breakdowns. We were able to fully overlap communication on 2 and 4 nodes,
but not on 8 nodes, where we observe a non-zero wait time on the Mover. The table also reveals
that overlapped execution comes at a price: it slows down local computation. First, we increase
the workload on the 3 remaining processors. Without overlap, the time per iteration increases
by 15% from 680 ms/iteration with 4 processors per node to 784 ms with 3 processors per node.
Second, communication consumes memory bus bandwidth and slows down the compute threads.
With the addition of overlap, computation time increases 4% to 815 ms. Memory contention,
while small here, also works both ways: it slows down the message passing performance. We have
observed as much as a 70% increase in communication time as measured on the communicating
processor. Though most of this time is masked by our overlap strategy, its eect can be seen
on 8 nodes, where we can not quite overlap all of the communication.
Whenever we observe a non-zero wait time in communication, the communication thread
utilizes virtually 100% of the dedicated processor. Thus, on fewer than 8 nodes we might be
tempted to share the processor which runs the communication thread with one of the compu-
tational threads in order to render a more precise balance of communication and computation.
Recall that under KeLP the Mover runs as a concurrent task, and does not explicitly reserve
any specic processing resources. Thus, by modifying the FloorPlan we can ooad some work
onto the processor running the Mover without having to modify the remainder of the application
code. However, we were unable to observe more than a few percent improvement in performance
using this strategy. This is true in part because we increase memory bus utilization when we
increase processor utilization, and in part because variations in communication times over the
ATM switch introduce uncertainty into our load balancing estimates. Interrupt overhead is
likely another factor.
Table 1: Execution time break-down for redblack3D in milliseconds per iteration. The columns
labeled `Comm' report the time spent waiting for communication to complete. The times reported
are the maximum reported from all nodes; thus, the local computation and communication times do
not add up exactly to the total time. The domains were cubical (N), and the nodes were congured
with the indicated processor geometry.
11
5.2.2 NAS-MG
The NAS-MG multigrid benchmark [4] solves Poisson's equation in 3D using a multigrid V-
cycle [8]. Multigrid carries out computation at a series of levels and each level denes a grid at
a successively coarser resolution. We parallelized each level of this stencil-based computation
much as we did redBlack3D, using a pre-fetching scheme which delayed computation on an inner
annulus. we parallelized redblack3D.
Performance improvements due to overlap are slightly lower for NAS-MG than for RedBlack3D{
for example, on eight processors, 12% vs. 18%. This is due to a surface-to-volume eect: the
coarser levels of multigrid do not carry enough work to hide all the communication.
The NAS-MG code requires ghost cells to be transferred across periodic boundary conditions,
and the 27-point stencil requires ghost cells on the corners and edges of each processor's 3D grid.
The NPB 2.1 code uses a three-stage dimensional exchange algorithm to satisfy boundary con-
ditions, as illustrated in Fig. 6. This algorithm is interesting because it cannot be conveniently
expressed with non-blocking communication. To handle the three phases of communication, we
cannot naively build 3 separate Movers, and start all 3 at once. Instead, we dene an object
called the MultiMover. The MultiMover, derived from Mover, serially invokes a sequence of
Movers, as shown in Fig. 7 The MultiMover implementation stores a linked list of pointers to
Movers. When called to perform data motion, the MultiMover executes each Mover in succes-
sion. To the calling application, the three-stage MultiMover execution appears to be an atomic
operation.
This communication example highlights the expressive power of KeLP's solution. By express-
ing the multi-phase communication as an atomic object, the programmer can asynchronously
execute an arbitrary sequence of message-passing and synchronization operations. In contrast,
non-blocking MPI calls asynchronously start only one message-passing operation at a time. To
overlap communication using a sequence of operations, the MPI program must periodically poll
for the completion of non-blocking message calls, in order to start the next sequence of calls in
a timely manner. The programmer may have to interrupt highly-tuned numeric kernels to poll,
degrading performance and tangling program structure. The more powerful KeLP design lends
itself to better structured and more ecient code.
5.2.3 MB
The previous two examples evaluated KeLP support for explicit nite dierence codes with
regular geometries. However, KeLP targets irregular geometries as well. To examine KeLP
support for nite dierence codes on irregular domains, we have implemented a 2D multigrid
solver for an irregular domain. This multi-block code accepts any grid geometry as run-time
input and employs 5-point block Gauss-Seidel relaxation stencil. In the following experiments,
1 3
1 3
2 4
2 4
a) b)
Figure 6: The two-stage dimensional exchange algorithm used to manage ghost cell updates in a
2-dimensional, 9-point stencil. In this case, the corner points are also needed. Only the messages
aecting the nal state of the left edge of node 4 are shown. Each processor exchanges ghost cells
along just one dimension of the domain.
12
class MultiMover : public Mover
{
void add( ... ) { .. }
void start() { ... }
void wait() { ... }
}
FloorPlan F = SetUpFloorPlan(...) ;
XArray2 U(F);
M = new MultiMover;
// overlapped computation
13
we solve Poisson's equation over a grid structure covering the geometrical shape of Lake Superior,
obtained from Yingxin Pang. Fig. 8 shows the resultant grids and partitioning assignments
generated by Pang's process, which employed a heuristic developed by Rantakokko [32].
The data in Fig. 4c reveal that the performance due to overlap is virtually non-existent.
Further examination reveals that the computation (and hence communication) are severely load
imbalanced. On eight nodes, the maximum workload exceeded the mean by 44%, and the node
with the highest communication workload exceeded the mean by 114%. We suspect that due to
load imbalance, the most lightly loaded node nishes each iteration early, and spends most of
its time waiting for the most heavily loaded node to catch up. Perhaps this pattern defeats the
overlap strategy, since the most heavily loaded node must forever attempt to catch up with the
others. If so, the most heavily loaded node does not overlap communication and computation.
5.2.4 NAS FT
The NAS FT benchmark incurs a costly transpose that limits the performance of the application
on an SMP cluster [16, 28]. Nevertheless, we were able to improve performance with overlap{
about 15% on 8 nodes{ using a pipelining strategy due to Agarwal et al. [1]. We scaled the
problem size with the number of nodes, with 220 unknowns per node. Data were distributed
with a 1D block decomposition.
Despite the use of overlap, performance for this application is poor. The application spends
55% of its time communicating on 8 nodes and utilizes the dedicated communication processor
100% of the time. The use of multiprocessing at the nodes is debatable, since it exacerbates
the architectural imbalance between interconnection network and node's computational power.
Compared with the non-overlapped algorithm running on just 1 CPU per node, the overlapped
algorithm achieves a speedup of only 1.7 and uses 4 times the number of processors. Com-
munication costs are further increased by the heavy memory bandwidth demands made by the
kernel. On a single node, the code obtains speedups of 1.73, 2.38, and 2.57 on 2, 3 , 4 four
Alpha processors, respectively.
5.2.5 SUMMA
SUMMA implements a fast matrix multiply algorithm due to van de Geijn and Watts [40]. It
implements matrix multiplication as a series of blocked outer products over distributed matrices.
We developed a straightforward dual-tier adaptation of SUMMA, as well as a new SUMMA
variant that explicitly overlaps communication and computation [15].
This pipelined algorithm carries out a series of broadcasts as shown in Fig. 9. The broadcasts
involve a panel of data rather than the entire block of data assigned to the processor, that is,
Figure 8: Grid geometries used for the multi-block multigrid code. The colors denote the assignment
of grids onto four nodes.
14
a vertical or horizontal slice. By transmitting by panel rather than by block we decrease the
granularity of the pipeline and hence improve the success of overlap.
The matrices were distributed with a 2D block distribution. The panel size b was 100 in all
runs, and was determined experimentally. We use the dgemm kernels from the Digital Extended
Math Library (DXML) [22]. On this platform, DXML obtains 160 MFLOPS per processor for
matrix sizes that do not t in the L2 cache. We always select problem sizes that do not in in the
L2 cache. The \MPI" results use the C+MPI SUMMA version made publicly available by van
de Geijn and Watts and described in [40], which we modied code to call the DXML kernel.
The pipelined MPI implementation is able to overlap communication with computation and
edges out the non-overlapped KeLP2.0 implementation on 8 nodes. This is true because the
non-overlapped KeLP implementation overlaps at a coarser level of granularity{at the level of
a node rather than a processor.
While the pipelined single-tier algorithm performs unexpectedly well, the results show that
careful orchestration of the overlap pattern with the restructured multi-tier algorithm im-
proves performance further still. Table 2 shows the breakdown of execution times for the
non-overlapped multi-tier KeLP code, and the explicitly overlapped multi-tier KeLP code. The
column labeled \Comm" gives the time spent by the message co-processor executing the Movers{
not the time spent by compute processors waiting on the Mover. The column labeled \dgemm"
shows the time spent by the compute processors executing local matrix multiply.
The table shows that for the problem size considered, on eight nodes, the bottleneck is local
computation using our overlap strategy. Moreover, communication activity ooaded onto the
spare processor takes roughly the same time as computation and fully utilizes the spare processor.
Note that even for dgemm, the message-passing activity slows down local computation on other
processors. The eect of dedicating the fourth processor to the Mover and restructuring the
algorithm slows down local computation by nearly 100% on 8 nodes, though reciprocal eect
on communication is not nearly as pronounced. This eect is more extreme on 8 nodes than on
smaller congurations. We believe the cause is an inecient broadcast algorithm, which has a
running time that is linear in the number of nodes. We are currently investigating enhancements
to this algorithm.
Our results imply that improvements to the communication layer that reduced memory
bandwidth requirements could result in improved application performance, through the indirect
eect of speeding up local computation. Improvements in communication performance alone
would not lead to any signicant overall increase in performance, since the dgemm would remain
in the critical path of the computation.
5.3 Discussion
Experience with our applications has shed light on principles and practice for overlapping com-
munication in block-structured scientic calculations on dedicated SMP clusters. We considered
three crucial application classes: nite dierence codes, FFTs, and blocked dense linear alge-
15
bra. Each of these application classes demanded quite dierent data layout strategies, and
communication patterns. Our results also show that the implementation of the KeLP Mover
is able to realize communication overlap in hardware that could not support overlap through
non-blocking message passing. Moreover, the KeLP communication primitives provide a higher
level of abstraction than MPI message-passing code, resulting in shorter, cleaner, easier-to-read
application software.
An important aspect of the KeLP communication model is that it carries out node-level
rather than processor level communication. This is useful in computations that employ shared
memory at the nodes, where it may not be appropriate to associate a specic processor (or pro-
cessors) with communication required by computation at the node. The capability to permit spe-
cic processors to request communication introduces the need to contend with policy decisions
that complicate the user code. By permitting communication to be carried out anonymously
by processors on behalf of the node, we encapsulate policy decisions that aect communication
performance. Note that we could achieve a similar eect with Message Proxies, so long as we
restricted communication to a distinguished processor (This would be a dierent processor than
the one executing the proxy). However, as mentioned, Message Proxies rely on non-blocking
point-to-point communication to express communication overlap, and this mechanism is not
appropriate for multi-phase communication algorithms.
In all of our applications, performance was ultimately limited by the performance of the
message passing layer. On the largest number of nodes we used (8), utilization of the dedicated
communication processor was nearly 100%. While improvements in the message passing layer
would naturally lead to improved application performance we note an important indirect eect
of reduced memory bus utilization on the progress of local computation running elsewhere on the
node. Under these circumstances, the KeLP Mover might not fully utilize the communication
processor and we might be tempted to take steps to interleave some computation with the
Mover. However, the benet of increased utilization are unclear, both because of the memory
bandwidth limitations and because of the deleterious eects of interrupts.
6 Conclusions
The appropriate programming paradigm for SMP based multicomputers is presently an unre-
solved issue. We have presented a collective hierarchical communication model that permits
the user to express a variety of communication algorithms in bulk-synchronous block-structured
computations. These restructured algorithms were able to tolerate severe bandwidth limitations
on a platform which is unable to support communication overlap using non-blocking communi-
cation.
We have implemented our communication model with the KeLP2.0 library, which also pro-
vides hierarchical control
ow and data decomposition. KeLP encapsulates complicated com-
Table 2: Execution-time breakdown for the basic and overlapped multi-tier KeLP SUMMA codes
on the AlphaServer cluster. Times are in seconds. The times reported are the maximum reported
from all nodes; thus, the local computation and communication times do not add up exactly to the
total time.
16
munication patterns arising in block structured applications. This encapsulation works syner-
gistically with our scheme for overlapping communication and hides much of the underlying
complexity. In particular, the keLP Mover separates the expression of correct programs from
optimizations aecting performance. This type of separation of concerns results in easier-to-
develop, more maintainable code [23].
Our implementation of KeLP2.0 worked around many limitations of an aging platform and
and raises questions about performance tradeos on modern designs. In particular, we are inter-
ested in the newer systems, which have enhanced message passing layers, a good quality thread
implementation, high degrees of multiprocessing, large number of nodes, faster interconnect.
We are currently porting KeLP to the IBM ASCI-Blue machine and a cluster of Sun Enterprise
Servers. Other possible target systems include the HP Exemplar, clusters of SGI-Cray Origin
2000's or the more recent Digital AlphaServers.
Though our hierarchical model of parallelism was instantiated on just two levels, we are
currently exploring the extension to arbitrary levels. A more general model may be applicable
to other levels of the memory hierarchy, including cache, network, and I/O.
The authors wish to thank Larry Carter for many rewarding discussions. This work was
completed while the rst author was on sabbatical leave at the Department of Computer Science
and Business Administration, University of Karlskrona/Ronneby, in Ronneby, Sweden. Thanks
to M.
References
[1] R. C. Agarwal, F. G. Gustavson, and M. Zubair, An ecient parallel algorithm for
the 3-d FFT NAS parallel benchmark, in Proc. of SHPCC `94, May 1994, pp. 129{133.
[2] G. Agrawal, A. Sussman, and J. Saltz, An integrated runtime and compile-time ap-
proach for parallelizing structured and block structured applications, IEEE Transactions on
Parallel and Distributed Systems, 6 (1995).
[3] B. Alpern, L. Carter, and J. Ferrante, Modeling parallel computers as memory hi-
erarchies, in Programming Models for Massively Parallel Computers, W. K. Giloi, S. Jah-
nichen, and B. D. Shriver, eds., IEEE Computer Society Press, Sept. 1993, pp. 116{23.
[4] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fa-
toohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon,
V. Venkatakrishnan, and S. Weeratunga, The NAS parallel benchmarks, Tech. Rep.
RNR-94-007, NASA Ames Research Center, March 1994.
[5] S. Balay, W. D. Gropp, L. C. McInnes, and B. R. Smith, Ecient management
of parallelism in object-oriented numerical software libraries, in Modern Software Tools in
Scientic Computing, E. Arge, A. M. Bruaset, and H. P. Langtangen, eds., Birkhauser
Press, 1997.
[6] J. Beecroft, M. Homewook, and M. McLaren, Meiko cs-2 interconnect elan-elite
design, Parallel Computing, 20 (1994), pp. 1627{1638.
[7] M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, and J. Sandberg,
Virtual memory mapped network interface for the SHRIMP multicomputer, in Proceedings
of the 21st Annual International Symposium on Computer Architecture, Chicago,IL, April
1994, pp. 142{153.
[8] W. L. Briggs, A Multigrid Tutorial, SIAM, 1987.
[9] S. Chakrabarti, E. Deprit, E.-J. Im, J. Jones, A. Krishnamurthy, C.-P. Wen,
and K. Yelick, Multipol: A distributed data structure library, in Fifth ACM SIGPLAN
Symposium on Principles and Practices of Parallel Programming, Jul. 1995.
[10] K. Chandy and C. Kesselman, Compositional C++: Compositional parallel program-
ming, in Fifth International Workshop of Languages and Compilers for Parallel Computing,
Aug. 1992.
17
[11] C. Chang, A. Sussman, and J. Saltz, Support for distributed dynamic data structures
in C++, Tech. Rep. CS-TR-3266, University of Maryland, 1995.
[12] D. Culler, A. Dusseau, S. Goldstein, A. Krishnamurthy, S. Lumetta, T. von
Eicken, and K. Yelick, Parallel programming in Split-C, in Proc. Supercomputing, Nov.
1993.
[13] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos,
R. Subramonian, and T. von Eiken, LogP: Towards a realistic model of parallel com-
putation, in Proceedings of the Fourth Symposium on Principle and Practice of Parallel
Programming, May 1993, pp. 1{12.
[14] R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang, Communication optimizations for
irregular scientic computations on distributed memory architectures, Journal of Parallel
and Distributed Computing, 22 (1994), pp. 462{479.
[15] S. J. Fink, Hierarchical Programming for Block{Structured Scientic Calculations, PhD
thesis, Department of Computer Science and Engineering, University of California, San
Diego, 1998.
[16] S. J. Fink and S. B. Baden, Runtime support for multi-tier programming of block-
structured applications on SMP clusters, in Proceedings of 1997 International Scientic
Computing in Object-Oriented Parallel Environments Conference, Marina del Rey, CA,
December 1997.
[17] S. J. Fink, S. B. Baden, and S. R. Kohn, Ecient run-time support for irregular
block-structured applications, J. Parallel Distrib. Comput., (1998).
[18] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, A high-performance, portable im-
plementation of the MPI message passing interface standard, tech. rep., Argonne National
Laboratory, Argonne, IL, 1997. http://www.mcs.anl.gov/mpi/mpich/.
[19] F. M. Hayes, Design of the AlphaServer multiprocessor sever systems, Digital Technical
Journal, 6 (1994), pp. 8{19.
[20] High Performance Fortran Forum, High Performance Fortran Language Specica-
tion, Nov. 1994.
[21] L. Kale and S. Krishnan, CHARM++: a portable concurrent object oriented system in
C++, in Proceedings of OOPSLA, Sept. 1993.
[22] C. Kamath, R. Ho, and D. P. Manley, DXML: a high-performance scientic subroutine
library, Digital Technical Journal, 6 (1994), pp. 44{56.
[23] G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. Lopes, J.-M. Longtier,
and J. Irwin, Aspect-oriented programming, Tech. Rep. SPL97-008 P9710042, Xerox
PARC, February 1997.
[24] S. R. Kohn and S. B. Baden, Irregular coarse-grain data parallelism under LPARX, J.
Scientic Programming, 5 (1996).
[25] M. Lauria, S. Pakin, and A. A. Chien, Ecient layering for high speed communication:
Fast messages 2.x, in Proc. 7th High Perf. Distributed Computing Conf. (HPDC7), July
1998.
[26] B.-H. Lim, P. Heidelberger, P. Pattnaik, and M. Snir, Message proxies for ecient,
protected communication on smp clusters, in Proceedings of the Third International Sym-
posium on High-Performance Computer Architecture, San Antonio, TX, February 1997,
IEEE Computer Society Press, pp. 116{27.
[27] C. Lin and L. Snyder, ZPL:an array sublanguage, in Proc. Languages and Compilers
for Parallel Computing, 6th Int'l Workshop, U. Banerjee, D. Gelernter, A. Nicolau, and
D. Padua, eds., Springer-Verlag, 1994, pp. 96{114.
[28] S. S. Lumetta, A. M. Mainwaring, and D. E. Culler, Multi-protocol active messages
on a cluster of smps, in Proc. SC97, Nov. 1997.
18
[29] Message-Passing Interface Standard, MPI: A message-passing interface standard,
University of Tennessee, Knoxville, TN, Jun. 1995.
[30] R. Parsons and D. Quinlan, Run-time recognition of task parallelism within the P++
parallel array class library, in Proc. Scalable Parallel Libraries Conference, October 1994,
pp. 77{86.
[31] P. Pierce and G. Regnier, The Paragon implementation of the NX message passing in-
terface, in Proceedings of the Scalable High-Performance Computing Conference, Knoxville,
TN, May 1994, pp. 184{190.
[32] J. Rantakokko, A framework for partitioning domains with inhomogeneous workload,
Tech. Rep. 8, Royal Institute of Technology and Uppsala University, March 1997.
[33] S. K. Reinhardt, R. W. Pfile, and D. A. Wood, Decoupled hardware support for
distributed shared memory, in Proceedings of the 23rd Annual International Conference on
Computer Architecture, Philadelphia,PA, May 1996, pp. 34{43.
[34] A. C. Sawdey, M. T. O'Keefe, and W. B. Jones, A general programming model for
developing scalable ocean circulation applications, in Proceedings of the ECMWF Workshop
on the Use of Parallel Processors in Meteorology, January 1997.
[35] L. Snyder, Type architectures, shared memory, and the corollary of modest potential,
Annual Review of Computer Science, 1 (1986), pp. 289{317.
[36] , Foundations of practical parallel programming languages, in Portability and Perfor-
mance of Parallel Processing, T. Hey and J. Ferrante, eds., John Wiley and Sons, 1993.
[37] A. Sohn and R. Biswas, Communication studies of DMP and SMP machines, Tech. Rep.
NAS-97-004, NAS, 1997.
[38] A. K. Somani and A. M. Sansano, Minimizing overhead in parallel algorithms through
overlapping communication/computation, Tech. Rep. 97-8, ICASE, February 1997.
[39] L. G. Valiant, A bridging model for parallel computation, Communications of the ACM,
33 (1990), pp. 103{111.
[40] R. van de Geign and J. Watts, SUMMA: Scalable universal matrix multiplication
algorithm, Concurrency: Practice and Experience, 9 (1997), pp. 255{74.
[41] P. R. Woodward, Perspectives on supercomputing: Three decades of change, IEEE Com-
puter, 29 (1996), pp. 99{111.
Biographical Sketches
Scott B. Baden is an Associate Professor of Computer Science and Engineering at the Univer-
sity of California, San Diego and is also a Senior Fellow at the San Diego Supercomputer Center.
He received the B.S. degree (magna cum laude) in electrical engineering from Duke University
in 1978, and the M.S. and Ph.D. degrees in computer science from the University of California,
Berkeley, in 1982 and 1987, respectively. He was a post-doc in the Mathematics Group at the
University of California's Lawrence Berkeley Laboratory between 1987 and 1990, taking time
o to travel. Dr. Baden's current research interests are in the areas of parallel and scientic
computation: programming methodology, irregular problems, load balancing, and performance.
Stephen J. Fink is currently a Research Sta Member at the IBM T. J. Watson Research Cen-
ter in Hawthorne, NY. He received the B.S. degree in Computer Science from Duke University
in 1992 with an additional major in Mathematics, and the M.S and Ph.D. degree in Computer
19
Science from the University of California, San Diego in 1994 and 1998. His research focuses
on programming abstractions, algorithms, and computer architecture for high performance and
scientic computation.
20