Programming Models For Parallel Computing Pavan Balaji download
Programming Models For Parallel Computing Pavan Balaji download
Balaji download
https://ebookbell.com/product/programming-models-for-parallel-
computing-pavan-balaji-43008582
https://ebookbell.com/product/chapel-excerpted-from-programming-
models-for-parallel-computation-bradford-l-chamberlain-50501794
Smart Card Applications Design Models For Using And Programming Smart
Cards Wolfgang Rankl
https://ebookbell.com/product/smart-card-applications-design-models-
for-using-and-programming-smart-cards-wolfgang-rankl-981454
https://ebookbell.com/product/dataflow-programming-visualizing-and-
managing-data-streams-for-effective-processing-and-parallelism-
programming-models-edet-232947356
Task Models And Diagrams For User Interface Design 8th International
Workshop Tamodia 2009 Brussels Belgium September 2325 2009 Revised
Selected Programming And Software Engineering 1st Edition David
England
https://ebookbell.com/product/task-models-and-diagrams-for-user-
interface-design-8th-international-workshop-tamodia-2009-brussels-
belgium-september-2325-2009-revised-selected-programming-and-software-
engineering-1st-edition-david-england-2540476
Models Languages And Tools For Concurrent And Distributed Programming
Essays Dedicated To Rocco De Nicola On The Occasion Of His 65th
Birthday 1st Ed Michele Boreale
https://ebookbell.com/product/models-languages-and-tools-for-
concurrent-and-distributed-programming-essays-dedicated-to-rocco-de-
nicola-on-the-occasion-of-his-65th-birthday-1st-ed-michele-
boreale-10488678
https://ebookbell.com/product/a-practical-programming-model-for-the-
multicore-era-3rd-international-workshop-on-openmp-iwomp-2007-beijing-
china-june-37-2007-proceedings-1st-edition-eduard-ayguad-4240592
Abap Development For Sap S4hana Abap Programming Model For Sap Fiori
First Edition Stefan Haas
https://ebookbell.com/product/abap-development-for-sap-s4hana-abap-
programming-model-for-sap-fiori-first-edition-stefan-haas-63517092
Building Big Data Pipelines With Apache Beam Use A Single Programming
Model For Both Batch And Stream Data Processing 1st Edition Jan
Lukavsky
https://ebookbell.com/product/building-big-data-pipelines-with-apache-
beam-use-a-single-programming-model-for-both-batch-and-stream-data-
processing-1st-edition-jan-lukavsky-37633446
Abap Restful Programming Model Abap Development For Sap S4hana Sap
Press 1st Edition Stefan Haas
https://ebookbell.com/product/abap-restful-programming-model-abap-
development-for-sap-s4hana-sap-press-1st-edition-stefan-haas-11263694
Series Foreword
The Scientific and Engineering Series from MIT Press presents accessible accounts of com-
puting research areas normally presented in research papers and specialized conferences.
Elements of modern computing that have appeared thus far in the series include paral-
lelism, language design and implementation, system software, and numerical libraries.
The scope of the series continues to expand with the spread of ideas from computing into
new aspects of science.
Programming models and the software systems that implement them are a crucial aspect
of all computing, since they provide the concrete mechanisms by which a programmer
prepares algorithms for execution on a computer and communicates these ideas to the ma-
chine. In the case of parallel systems, the complexity of the task has spurred innovative
research. This book collects in one place definitive expositions of a wide variety of pro-
gramming systems by highly regarded authors, including both language-based systems and
library-based systems. Some are heavily used and defined by standards bodies; others are
research projects with smaller user bases. All are currently being used in scientific and
engineering applications for parallel computers.
A programming model can be thought of as the abstract machine for which a programmer
is writing instructions. Programming models typically are instantiated in languages and
libraries. Such models form a rich topic for computer science research because program-
mers prefer them to be productive (capable of expressing any abstract algorithm with ease),
portable (capable of being used on any computer architecture), performant (capable of de-
livering performance commensurate with that of the underlying hardware), and expressive
(capable of expressing a broad range of algorithms)—the four pillars of programming.
Achieving one or perhaps even two of these features simultaneously is relatively easy, but
achieving all of them is nearly impossible. This situation accounts for the great multiplicity
of programming models, each choosing a different set of compromises.
With the coming of the parallel computing era, computer science researchers have
shifted focus to designing programming models that are well suited for high-performance
parallel computing and supercomputing systems. Parallel programming models typically
include an execution model (what path the code execution takes) and a memory model
(how data moves in the system between computing nodes and in the memory hierarchy of
each computing node). Programming parallel systems is complicated by the fact that mul-
tiple processing units are simultaneously computing and moving data, thus often increasing
the nondeterminism in the execution in terms of both correctness and performance.
Also important is the distinction between programming models and programming sys-
tems. Technically speaking, the former refers to a style of programming—such as bulk
synchronous or implicit compiler-assisted parallelization—while the latter refers to actual
abstract interfaces that the user would program to. Over the years, however, the parallel
computing community has blurred this distinction; and in practice today a programming
model refers to both the style of programming and the abstract interfaces exposed by the
instantiation of the model.
Contrary to common belief, most parallel systems do not expose a single parallel pro-
gramming model to users. Different users prefer different levels of abstraction and differ-
ent sets of tradeoffs among the four pillars of programming. Broadly speaking, domain
scientists and those developing end applications often prefer a high-level programming
model that is biased toward higher productivity, even if it is specialized to a small class of
algorithms and lacks the expressivity required by other algorithms. On the other hand, de-
velopers of domain-specific languages and libraries might prefer a low-level programming
model that is biased toward performance, even if it is not as easy to use. Of course, these
are general statements, and exceptions exist on both sides.
This book provides an overview of some of the most prominent parallel programming
xxii Preface
Acknowledgments
I first thank all the authors who contributed the various chapters to this book:
Special thanks to Ewing Lusk and William Gropp for their contributions to the book as
a whole and for improving the prose substantially.
Preface xxv
I also thank Gail Pieper, technical writer in the Mathematics and Computer Science
Division at Argonne National Laboratory, for her indispensable guidance in matters of
style and usage, which vastly improved the readability of the prose.
Programming Models for Parallel Computing
1 Message Passing Interface
William D. Gropp, University of Illinois, Urbana-Champaign
Rajeev Thakur, Argonne National Laboratory
1.1 Introduction
MPI is a standard, portable interface for communication in parallel programs that use a
distributed-memory programming model. It provides a rich set of features for expressing
the communication commonly needed in parallel programs and also includes additional
features such as support for parallel file I/O. It supports the MPMD (multiple program,
multiple data) programming model. It is a library-based system, not a compiler or lan-
guage specification. MPI functions can be called from multiple languages—it has official
bindings for C and Fortran. MPI itself refers to the definition of the interface specifica-
tion (the function names, arguments, and semantics), not any particular implementation of
those functions. MPI was defined by an organization known as the MPI Forum, a broadly
based group of experts and users from industry, academia, and research laboratories. Many
high-performance implementations of the MPI specification are available (both free and
commercial) for all platforms (laptops, desktops, servers, clusters and supercomputers of
all sizes) and all architectures and operating systems. As a result, it is possible to write par-
allel applications that can be run portably on any platform, while at the same time achieving
good performance. This feature has contributed to MPI becoming the most widely used
programming system for parallel scientific applications.
MPI Background
The effort to define a single, standard interface for message passing began in 1992. It
was motivated by the presence of too many different, nonportable APIs—both vendor
supported (e.g., Intel NX [232], IBM EUI [119], Thinking Machines CMMD [272],
nCUBE [207]) and research libraries (e.g., PVM [121], p4 [51], Chameleon [130], Zip-
code [254]). Applications written to any one of these APIs either could not be run on
different machines or would not run efficiently. If any HPC vendor went out of business,
applications written to that vendor’s API could not be run elsewhere. It was recognized
that this multiplicity of APIs was hampering progress in application development, and the
need for a single, standard interface defined with broad input from everyone was evident.
The first version of the MPI specification (MPI-1) was released in 1994, and it covered
basic message-passing features, such as point-to-point communication, collective commu-
nication, datatypes, and nonblocking communication. In 1997, the MPI Forum released the
second major version of MPI (MPI-2), which extended the basic message-passing model
to include features such as one-sided communication, parallel I/O, and dynamic processes.
The third major release of MPI (MPI-3) was in 2012, and it included new features such as
nonblocking collectives, neighborhood collectives, a tools information interface, and sig-
2 Chapter 1
The core of MPI is communication between processes, following the communicating se-
quential processes (CSP) model. Each process executes in its own address space. Declared
variables (e.g., int b[10];) are private to each process; the b in one process is dis-
tinct from the b in another process. In MPI, there are two major types of communication:
communication between two processes, called point-to-point communication, and commu-
nication among a group of processes, called collective communication.
Each MPI process is a member of a group of processes and is identified by its rank
in that group. Ranks start from zero, so in a group of four processes, the processes are
numbered 0, 1, 2, 3. All communication in MPI is made with respect to a communicator.
This object contains the group of processes and a (hidden) communication context. The
communication context ensures that library software written using MPI can guarantee that
messages remain within the library, which is a critical feature that enables MPI applica-
tions to use third-party libraries. The communicator object is a handle, which is just a way
to say “opaque type.” In C, the communicator handle is of type MPI Comm; in Fortran,
it is of type TYPE(MPI Comm) (for the Fortran 2008 interface) or INTEGER (for earlier
versions of Fortran). When an MPI program begins, there are two predefined communica-
tors: MPI COMM WORLD and MPI COMM SELF. The former contains all the processes in
the MPI execution; the latter just the process running that instance of the program. MPI
provides routines to discover the rank of a process in a communicator (MPI Comm rank),
to discover the number of processes in a communicator (MPI Comm size), and to create
new communicators from old ones.
A complete, but very basic, MPI program is shown in Figure 1.1. This program shows
the use of MPI Init to initialize MPI and MPI Finalize to finalize MPI. With a
few exceptions, all other MPI routines must be called after MPI Init (or MPI Init -
thread) and before MPI Finalize. The odd arguments to MPI Init were intended
to support systems where the command-line arguments in a parallel program were not
available from main but had to be provided to the processes by MPI.
Figure 1.1 illustrates another feature of MPI: anything that is not an MPI call executes
independently. In this case, the printfs will execute in an arbitrary order; it is not even
required that the output appear one line at a time.
The examples presented are in C, but it is important to note that MPI is defined by a
language-neutral specification and a set of language bindings to that specification. MPI
Message Passing Interface 3
#include "mpi.h"
#include <stdio.h>
Figure 1.1: A complete MPI program that illustrates communicator rank and size.
currently supports language bindings in C and Fortran. (C++ programs can use the C
bindings.)
The most basic communication in MPI is between pairs of processes. One process sends
data, and the other receives it. The sending process must specify the data to send, the pro-
cess to which that data is to be sent, and a communicator. In addition, following a feature
from some of the earliest message-passing systems, each message also has a message tag,
which is a single nonnegative integer. The receiving process must specify where the data
is to be received, the process that is the source of the data, the communicator, and the
message tag. In addition, it may provide a parameter in which MPI will return information
about the received message; this is called the message status.
Early message-passing systems, and most I/O libraries, specify data buffers as a tuple
containing the address and the number of bytes. MPI generalizes this as a triple: address,
datatype, and count (the number of datatype elements). In the simplest case, the datatype
corresponds to the basic language types. For example, to specify a buffer of 10 ints in C,
MPI uses (address, 10, MPI INT). This approach allows easier handling of data
for the programmer, who does not need to know or discover the number of bytes in each
basic type. It also allows the MPI library to perform data conversions if the MPI program
is running on a mix of hardware that has different data representations (this was more
important when MPI was created than it is today). In addition, as described in Section 1.4,
it allows the specification of data buffers that are not contiguous in memory.
4 Chapter 1
int msg[MAX_MSG_SIZE];
...
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0) {
for (i=1; i<size; i++)
MPI_Send(msg, msgsize, MPI_INT, i, 0, MPI_COMM_WORLD);
} else {
MPI_Recv(msg, MAX_MSG_SIZE, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
doWork(msg);
}
Figure 1.2: Example of the use of MPI to send data (in the integer array msg of length
msgsize from the process with rank zero to all other processes).
The process to which data is sent is specified by a rank in a communicator; that com-
municator also provides the communication context. The message tag is a nonnegative
integer; the maximum allowed value depends on the MPI implementation but must be at
least 32767. These are the arguments to MPI Send.
The arguments for receiving a message are similar. One difference is that the receive
function is allowed to specify a size larger than the data actually being sent. The tag and
source rank may be used to either specify an exact value (e.g., tag of 15 and source of 3)
or any value by using what are called wild card values (MPI ANY TAG and MPI ANY -
SOURCE, respectively). The use of MPI ANY TAG allows the user to send one additional
integer item of data (as the tag value); the use of MPI ANY SOURCE allows the implemen-
tation of nondeterministic algorithms. A status argument provides access to the tag value
provided by the sender and the source rank of the sender. When this information is not
needed, the value MPI STATUS IGNORE may be used. Figure 1.2 shows a program that
sends the same data from the process with rank zero in MPI COMM WORLD to all other
processes.
MPI provides a number of different send modes; what has been described here is the
basic or standard send mode. Other modes include a synchronous send, ready send, and
buffered send. Nonblocking communication is described in Section 1.5.
1.4 Datatypes
As explained earlier, one of the special features of MPI is that all communication functions
take a “datatype” argument. The datatype is used to describe the type of data being sent
or received, instead of just bytes. For example, if communicating an array of integers,
Message Passing Interface 5
the datatype argument would be set to MPI INT in C or MPI INTEGER in Fortran. One
purpose for such a datatype argument is to enable communication in a heterogeneous en-
vironment, e.g., communication between machines with different endianness or different
lengths of basic datatypes. When the datatype is specified as part of the communication,
the MPI implementation can internally perform the necessary byte transformations so that
the data makes sense at the destination. MPI supports all the language types specified in
C99 and Fortran 2008.
Another purpose of datatypes is to enable the user to describe noncontiguous layouts of
data in memory and communicate the data with a single MPI function call. MPI provides
several datatype constructor functions to describe noncontiguous data layouts. A high-
quality implementation can optimize the noncontiguous data transfer such that it performs
better than manual packing/unpacking and contiguous communication by the user.
MPI datatypes can describe any layout in memory. The predefined types correspond-
ing to basic language types, such as MPI FLOAT (C float) or MPI COMPLEX (Fortran
COMPLEX), provide the starting points. New datatypes can be constructed from old ones
using a variety of layouts. For example,
creates a new datatype where there are count blocks, each consisting of blocklength
contiguous copies of oldtype, with each block separated by a distance of stride,
where stride is in units of the size of the oldtype. This provides a convenient (and,
with a good MPI implementation, more efficient [139]) way to describe data that is sepa-
rated by a regular stride.
Other MPI routines can be used to describe more general data layouts. MPI Type -
indexed is a generalization of MPI Type vector where each block can have a differ-
ent number of copies of the oldtype and a different displacement. MPI Type create -
struct is the most general datatype constructor, which further generalizes MPI Type -
indexed allowing each block to consist of replications of different datatypes. Conve-
nience functions for describing layouts for subarrays and distributed arrays are also pro-
vided; these functions are particularly useful when using the parallel file I/O interface in
MPI. All these functions can be called recursively to build datatypes describing any arbi-
trary layout.
Figure 1.3 shows an example of sending a column of a two-dimensional array in C. The
column is noncontiguous because C stores arrays in row-major order. By using a vec-
tor datatype to define the memory layout, one can send the entire column with a single
MPI send function. Note that it is necessary to call MPI Type commit before a derived
6 Chapter 1
datatype can be used in a communication function. This function provides the implemen-
tation an opportunity to analyze the datatype and store a compact representation in order to
optimize the communication of noncontiguous data during the subsequent communication
function.
For the MPI Send to complete, the data in buf, all one hundred million words of it,
will need to be copied into some buffer somewhere (most likely either at the source or
destination process). If this memory is not available, then the program will wait within the
MPI Send call for that memory to become available, e.g., the receive buffer that would
be available when the destination process calls a receive routine such as MPI Recv. That
will not happen in the example above because both processes call MPI Send first, and so
the program might hang forever. Such programs are called unsafe because their correct
Message Passing Interface 7
Figure 1.4: Fixing deadlock by having some processes receive while others send. This
method requires knowledge of the communication pattern at compile time.
execution depends on how much memory is available and used for buffering messages for
which there is no matching receive at the time of the send.
The classic fix for this is to order the sends and receives so that no process has to wait,
and is shown in Figure 1.4. However, this approach only works if the communication
pattern is simple and known at compile time. More complex patterns can be handled at
run time, but such code rapidly becomes too complex to maintain. The alternative is to
allow the MPI operations to return before the communication is complete. In MPI, such
routines are called nonblocking and are often (but not always) named with an “I” before
the operation. For example, the nonblocking version of MPI Send is MPI Isend. The
parameters for these routines are very similar to those for the blocking versions. The one
change is an additional output parameter of type MPI Request. This is a handle (in the
MPI sense) that can be used to query about the status of the operation and to wait for its
completion. Figure 1.5 shows the use of nonblocking send and receive routines. Note that
the user is required to call a test or wait function to complete the nonblocking operation.
MPI Wait will block until the operation completes. A nonblocking alternative is to use the
function MPI Test, which returns immediately and indicates whether the operation has
completed or not. Variants of test and wait operations are available for checking completion
of multiple requests at a time, such as MPI Waitall in Figure 1.5.
In many applications, the same communication pattern is executed repeatedly. For this
case, MPI provides a version of nonblocking operation that uses persistent requests; that is,
requests that persist and may be used repeatedly to initiate communication. These are cre-
ated with routines such as MPI Send init (the persistent counterpart to MPI Isend)
and MPI Recv init (the persistent counterpart to MPI Irecv). In the persistent case,
the request is first created but the communication is not yet started. Starting the com-
munication is accomplished with MPI Start (and MPI Startall for a collection of
8 Chapter 1
int msg[MAX_MSG_SIZE];
MPI_Request *r;
...
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0) {
r = (MPI_Request *) malloc((size-1) * sizeof(MPI_Request));
if (!r) . . . e r r o r
for (i=1; i<size; i++)
MPI_Isend(msg, msgsize, MPI_INT, i, 0, MPI_COMM_WORLD,
&r[i-1]);
. . . Could p e r f o r m some work
MPI_Waitall(size-1, r, MPI_STATUSES_IGNORE);
free(r);
} else {
MPI_Request rr;
MPI_Irecv(msg, msgsize, MPI_INT, 0, 0, MPI_COMM_WORLD, &rr);
. . . p e r f o r m o t h e r work
MPI_Wait(&rr, MPI_STATUS_IGNORE);
doWork(msg);
}
Figure 1.5: An alternative to the code in Figure 1.2 that permits the overlapping of com-
munication and computation
requests). Once that communication has completed, for example, after MPI Wait returns
on that request, it may be started again by calling MPI Start. When the request is no
longer needed, it is freed by calling MPI Request free.
In addition to the point-to-point communication functions that exchange data between pairs
of processes, MPI provides a large set of functions that perform communication among
a group of processes. This type of communication is called collective communication.
All processes in the communicator passed to a collective communication function must
call the function, and the function is said to be “collective over the communicator.” Col-
lective communication functions are widely used in parallel programming because they
represent commonly needed communication patterns and a vast amount of research in effi-
cient collective communication algorithms has resulted in high-performance implementa-
tions [73, 270, 284].
Collective communication functions are of three types:
Message Passing Interface 9
2. Data Movement. MPI has a large set of functions to perform commonly needed col-
lective data movement operations. For example, MPI Bcast sends data from one
process (the root) to all other processes in the communicator. MPI Scatter sends
different parts of a buffer from a root process to other processes. MPI Gather does
the reverse of scatter: it collects data from other processes to a buffer at the root.
MPI Allgather is similar to gather except that all processes get the result, not
just the root. MPI Alltoall does the most general form of collective commu-
nication in which each process sends a different data item to every other process.
Figure 1.6 illustrates these collective communication operations. Variants of these
basic operations also exist that allow users to communicate unequal amounts of data.
3. Collective Computation. MPI also provides reduction and scan operations that per-
form arithmetic operations on data, such as minimum, maximum, sum, product, and
logical OR, and also user-defined operations. MPI Reduce performs a reduction
operation and returns the result at the root process; MPI Allreduce returns the
result to all processes. MPI Scan performs a scan (or parallel prefix) operation
in which the reduction result at a process is the result of the operations performed
on data items contributed by the process itself and all processes ranked less than it.
MPI Exscan does an exclusive scan in which the result does not include the con-
tribution from the calling process. MPI Reduce scatter combines reduce and
scatter.
10 Chapter 1
int result[MAX_MSG_SIZE];
...
MPI_Bcast(msg, msgsize, MPI_INT, 0, MPI_COMM_WORLD);
if (rank != 0) {
doWork(msg, result);
}
MPI_Reduce(MPI_IN_PLACE, result, msgsize, MPI_INT,
MPI_SUM, 0, MPI_COMM_WORLD);
Figure 1.7: An alternative (and more efficient) way than in Figure 1.2 for sending data
from one process to all others. It also includes an MPI Reduce call to accumulate the
results on one process (the process with rank zero in MPI COMM WORLD).
Both blocking and nonblocking versions of all collective communication functions are
available. The functions mentioned above are blocking functions, i.e., they return only
after the operation has locally completed at the calling process. Their nonblocking ver-
sions have an “I” in their name, e.g., MPI Ibcast or MPI Ireduce. These functions
initiate the collective operation and return an MPI Request object, similar to point-to-
point functions. The operations must be completed by calling a test or wait function.
Nonblocking collectives provide much needed support for overlapping collective commu-
nication with computation. MPI even provides a nonblocking version of barrier, called
MPI Ibarrier.
Figure 1.7 shows the use of MPI Bcast to send the same data from one process (the
“root” process) to all other processes in the same communicator. It also shows the use of
MPI Reduce with the sum operation (given by the MPI predefined operation, MPI SUM)
to add all the results from each process and store them on the specified root process, which
in this case is the process with rank zero.
for moving data between processes; these include routines to put, get, and update data in
a memory window on a remote process. Third are routines to ensure completion of the
one-sided operations. Each of these components has a unique MPI flavor.
MPI provides four routines for creating memory windows, each addressing a specific
application need. Three of the routines, MPI Win create, MPI Win allocate, and
MPI Win allocate shared, specify the memory that can be read or changed by an-
other process. These routines, unlike the ones in many other one-sided programming mod-
els, are collective. The fourth routine, MPI Win create dynamic, allows memory to
be attached (and detached) by individual processes independently by calling additional
routines.
The communication routines are simple (at least in concept) and are in the three cate-
gories of put, get, and update. The simplest routines (and the only ones in the original
MPI-2 RMA) are one to put data to a remote process (MPI Put), one to get data from a
remote process (MPI Get), and one to update data in a remote process by using one of
the predefined MPI reduction operations (MPI Accumulate). MPI-3 added additional
communication routines in each of these categories. Each of these routines is nonblocking
in the MPI sense, which permits the MPI implementation great flexibility in implementing
these operations.
The fact that these operations are nonblocking emphasizes that a third set of functions
are needed. These functions define when the one-sided communications complete. Any
one-sided model must address this issue. In MPI, there are three ways to complete one-
sided operations. The simplest is a collective MPI Win fence. This function completes
all operations that originated at the calling process as well as those that targeted the calling
process. When the “fence” exits, the calling process knows that all the operations started on
this process have completed (thus, the data buffers provided can now be reused or changed)
and that all operations targeting this process have also completed (thus, any access to the
local memory that was defined for the MPI Win has completed, and the local process
may freely access or update that memory). This description simplifies the actual situation
somewhat; the reader is encouraged to consult the MPI standard for the precise details.
In particular, MPI Win fence really separates groups of RMA operations; thus a fence
both precedes and follows the use of the MPI RMA communication calls.
Figure 1.8 shows the use of one-sided communication to implement an alternative to the
use of MPI Reduce in Figure 1.7. Note that this is not an exact replacement. By using
one-sided operations, we can update the result asynchronously; in fact, a single process
could update the result multiple times. Conversely, there are potential scalability issues
with this approach, and it is used as an example only.
MPI provides two additional methods for completing one-sided operations. One can
be thought of as a generalization of MPI Win fence. This is called the scalable syn-
12 Chapter 1
MPI_Win_allocate((rank==0)?result:NULL,
(rank==0)?msgsize:0, sizeof(int),
MPI_INFO_NULL, MPI_COMM_WORLD, &result, &win);
MPI_Win_fence(0, win);
...
doWork(msg, myresult);
MPI_Accumulate(myresult, msgsize, MPI_INT, 0,
0, msgsize, MPI_INT, MPI_SUM, win);
...
MPI_Win_fence(0, win);
. . . p r o c e s s 0 h a s d a t a i n result
MPI_Win_free(&win);
Figure 1.8: Using MPI one-sided operations to accumulate (reduce) a result at a single
process.
chronization method and uses four routines—MPI Win post, MPI Win start, MPI -
Win complete, and MPI Win wait. As input, these routines take the MPI window and
a group of processes that specify either the processes that are the targets or the origins of
a one-sided communication operation. This method is considered scalable because it does
not require barrier synchronization across all processes in the communicator used to create
the window. Because both this scalable synchronization approach and the fence approach
require that synchronization routines be called by both origin and target processes, these
synchronization methods are called active target synchronization.
A third form of one-sided synchronization requires only calls at the origin process. This
form is called passive target synchronization. The use of this form is illustrated in Fig-
ure 1.9. Note that the process with rank zero does not need to call any MPI routines
within the while loop to cause RMA operations that target process zero to complete. The
MPI Win lock and MPI Win unlock routines ensure that the MPI RMA operations
(MPI Accumulate in this example) complete at the origin (calling) process and that the
data is deposited at the target process. The routines MPI Win lockall and MPI Win -
unlockall permit one process to perform RMA communication to all other processes
in the window object. In addition, MPI Win flush may be used within the passive tar-
get synchronization to complete the RMA operations issued so far to a particular target;
there are additional routines to complete only locally and to complete RMA operations to
all targets. There are also versions of put, get, and accumulate operations that return an
MPI Request object; the user can use any of the MPI Test or MPI Wait functions
to check for local completion, without having to wait until the next RMA synchronization
call.
Also note that process zero does need to call MPI Win lock before accessing the
Message Passing Interface 13
MPI_Win_allocate((rank==0)?result:NULL,
(rank==0)?msgsize:0, sizeof(int),
MPI_INFO_NULL, MPI_COMM_WORLD, &result, &win);
Figure 1.9: Using MPI one-sided operations to accumulate (reduce) a result at a single
process, using passive target synchronization.
result buffer that it used to create the MPI window. This ensures that any pending
memory operations have completed at the target process. This is a subtle aspect of shared
and remote memory programming models that is often misunderstood by programmers
(see [45] for some examples of common errors in using shared memory). MPI defines a
memory model for the one-sided operations that ensures that users will obtain consistent
and correct results, even on systems without fully cache-coherent memory (at the time
of MPI-2’s definition, the fastest machines in the world had this feature, and systems in
the future again may not be fully cache coherent). While the standard is careful to de-
scribe the minimum requirements for correctly using one-sided operations, it also provides
slightly more restrictive yet simpler rules that are sufficient for programmers on most sys-
tems. MPI-3 introduced a new “unified memory model” in addition to the existing memory
model, which is now called “separate memory model.” The user can query (via MPI -
Win get attr) whether the implementation supports a unified memory model (e.g., on
a cache-coherent system), and if so, the memory consistency semantics that the user must
follow are greatly simplified.
MPI-3 significantly extended the one-sided communication interface defined in MPI-2 in
order to fix some of the limitations of the MPI-2 interface and to enable MPI RMA be more
14 Chapter 1
broadly usable in libraries and applications, while also supporting portability and high
performance. For example, new functions have been added to support atomic read-modify-
write operations, such as fetch-and-add (MPI Fetch and op) and compare-and-swap
(MPI Compare and swap), which are essential in many parallel algorithms. Another
new feature is the ability to create a window of shared memory (where shared memory is
available, such as within a single node) that can be used for direct load/store accesses in
addition to RMA operations. If you considered the MPI-2 RMA programming features
and found them wanting, you should look at the new features in MPI-3.
Many parallel scientific applications need to read or write large amounts of data from or
to files for a number of reasons such as reading input meshes, checkpoint/restart, data
analysis, and visualization. If file I/O is not performed efficiently, it is often the bottleneck
in such applications. MPI provides an interface for parallel file I/O that enables application
and library writers to express the “big picture” of the I/O access pattern concisely and
thereby enable MPI implementations to optimize file access.
The MPI interface for I/O retains the look and feel of MPI and also supports the common
operations in POSIX file I/O such as open, close, seek, read, and write. In addition, it
supports many advanced features such as the ability to express noncontiguous accesses in
memory and in the file using MPI derived datatypes, collective I/O functions, and passing
performance-related hints to the MPI implementation.
Let us consider a simple example where each process needs to read data from a different
location in a shared file in parallel as shown in Figure 1.10. There are many ways of
doing this using MPI. The simplest way is by using independent file I/O functions and
individual file pointers, as shown in Figure 1.11. Each process opens the file by using
MPI File open, which is collective over the communicator passed as the first argument
to the function, in this case MPI COMM WORLD. The second parameter is the name of
the file being opened, which could include a directory path. The third parameter is the
mode in which the file is being opened. The fourth parameter can be used to pass hints
to the implementation by attaching key-value pairs to an MPI Info object. Example
hints include parameters for file striping, sizes of internal buffers used by MPI for I/O
optimizations, etc. In the simple example in Figure 1.11, we pass MPI INFO NULL so
that default values are used. The last parameter is the file handle returned by MPI (of type
MPI File), which is used in future operations on the file.
Each process then calls MPI File seek to move the file pointer to the offset corre-
sponding to the first byte it needs to read. This is called the individual file pointer since it
is local to each process. (MPI also has another file pointer, called the shared file pointer,
Message Passing Interface 15
FILE
P0 P1 P2 P(n-1)
Figure 1.10: Each process needs to read a chunk of data from a common file
MPI_File fh;
...
rc = MPI_File_open(MPI_COMM_WORLD, "myfile.dat", MPI_MODE_RDONLY,
MPI_INFO_NULL, &fh);
rc = MPI_File_seek(fh, rank*bufsize*sizeof(int), MPI_SEEK_SET);
rc = MPI_File_read(fh, msg, msgsize, MPI_INT, MPI_STATUS_IGNORE);
MPI_File_close(&fh);
Figure 1.11: Reading data with independent I/O functions and individual file pointers
that is shared among processes and requires a separate set of functions to access and use.)
Data is read by each process using MPI File read, which reads msgsize integers into
the memory buffer from the current location of the file pointer. MPI File close closes
the file. Note that this method of doing file I/O is very similar to the way one would do it
with POSIX I/O functions.
A second way of reading the same data is to avoid using file pointers and instead specify
the starting offset in the file directly to the read function. This can be done by using
the function MPI File read at, which takes an additional “offset” parameter. MPI -
File seek does not need to be called in this case. This function also provides a thread-
safe way to access the file, since it does not require a notion of “current” position in the
file.
MPI File read and MPI File read at are called independent I/O functions be-
cause they have no collective semantics. Each process calls them independently; there
is no requirement that if one process calls them, then all processes must call them. In
other words, an MPI implementation does not know how many processes may call these
functions and hence cannot perform any optimizations across processes.
MPI also provides collective versions of all read and write functions. These functions
have an all in their name, e.g., MPI File read all and MPI File read at all.
They have the same syntax as their independent counterparts, but they have collective se-
mantics; i.e., they must be called on all processes in the communicator with which the file
16 Chapter 1
MPI_File fh;
...
rc = MPI_File_open(MPI_COMM_WORLD, "myfile.dat", MPI_MODE_RDONLY,
MPI_INFO_NULL, &fh);
rc = MPI_File_set_view(fh, rank*bufsize, MPI_INT, MPI_INT,
"native", MPI_INFO_NULL);
rc = MPI_File_read_all(fh, msg, msgsize, MPI_INT,
MPI_STATUS_IGNORE);
MPI_File_close(&fh);
Figure 1.12: Reading data in parallel, with each process receiving a different part of the
input file
was opened. With this guarantee, an MPI implementation has the opportunity to optimize
the accesses based on the combined request of all processes, an optimization known as col-
lective I/O [268, 269]. In general, it is recommended to use collective I/O over independent
I/O whenever possible.
A third way of reading the data in Figure 1.10 is to use the notion of “file views” defined
in MPI, as shown in Figure 1.12. MPI File set view is used to set the file view,
whereby a process can specify its view of the file, i.e., which parts of the file it intends
to read/write and which parts it wants to skip. The file view is specified as a triplet of
displacement, etype, and filetype: displacement is the offset to be skipped from the start of
the file (such as a header), etype is the elementary type describing the basic unit of data
access, and filetype is an MPI type constructed out of etypes. The file view consists of
the layout described by a repeated tiling of filetypes starting at an offset of “displacement”
from the start of the file.
In Figure 1.12, each process specifies the displacement as its rank × msgsize,
etype as MPI INT, and filetype also as MPI INT. The next parameter specifies the data
representation in the file; “native” means the data representation is the same as in memory.
The last parameter can be used to pass hints. We could use either independent or collective
read functions to read the data; we choose to use the collective function MPI File -
read all. Each process reads msgsize integers into the memory buffer from the file
view defined for that process. Since each process has a different displacement in the file
view, offset by its rank, it reads a different portion of the file.
MPI’s I/O functionality is quite sophisticated, particularly for cases where I/O accesses
from individual processes are not contiguous in the file, such as when accessing subarrays
and distributed arrays. In such cases, MPI-I/O can provide very large performance benefits
over using POSIX I/O directly; in some cases, it is over 1,000 times as fast. We refer the
reader to [127] for a more detailed discussion of MPI’s I/O capabilities.
Message Passing Interface 17
#include "mpi.h"
#include <stdio.h>
int MPI_Finalize(void)
{
printf("Synchronization time in MPI_Bcast was %.2e seconds\n",
syncTime); fflush(stdout);
return PMPI_Finalize();
}
Figure 1.13: Example use of the profiling interface to record an estimate of the amount of
time that an MPI Bcast is waiting for all processes to enter the MPI Bcast call.
MPI includes a rich set of features intended to support developing and using large-scale
software. One innovative feature (now available in some other tools) is a set of alternate
entry points for each routine that makes it easy to interpose special code between any
MPI routine. For each MPI routine, there is another entry point that uses PMPI as the
prefix. This is known as the MPI profiling interface. For example, PMPI Bcast is the
profiling entry point for MPI Bcast. The PMPI version of the routine performs exactly
the same operations as the MPI version. The one difference is that the user may define
their own version of any MPI routine but not of the PMPI routines. An example is shown
in Figure 1.13. Linking the object file created from this file with a program that includes
calls to MPI Bcast will create a program that will print out the amount of time spent
waiting for all the processes to call MPI Bcast.
To enable users to write hybrid MPI and threaded programs, MPI also precisely specifies
the interaction between MPI calls and threads. MPI supports four “levels” of thread-safety
that a user must explicitly select:
MPI THREAD FUNNELED: A process may be multithreaded, but only the thread that
initialized MPI can make MPI calls.
MPI THREAD SERIALIZED: A process may be multithreaded, but only one thread can
make MPI calls at a time.
MPI THREAD MULTIPLE: A process may be multithreaded and multiple threads can
call MPI functions simultaneously.
The user must call the function MPI Init thread to indicate the level of thread-safety
desired, and the MPI implementation will return the level it supports. It is the user’s respon-
sibility to meet the restrictions of the level supported. An implementation is not required
to support a level higher than MPI THREAD SINGLE, but a fully thread-safe implemen-
tation will support MPI THREAD MULTIPLE. MPI specifies thread safety in this manner
so that the implementation does not need to support more than what the user needs and
unnecessarily incur the potential performance penalties.
MPI also enables an application to spawn additional processes (by using MPI Comm -
spawn or MPI Comm spawn multiple) and separately started MPI applications to
connect with each other and communicate (by using MPI Comm connect and MPI -
Comm accept or MPI Join).
MPI provides neighborhood collective operations (MPI Neighbor allgather and
MPI Neighbor alltoall and their variants) that define collective operations among
a process and its neighbors as defined by a cartesian or graph virtual process topology in
MPI. These functions are useful, for example, in stencil computations that require nearest-
neighbor exchanges. They also represent sparse all-to-many communication concisely,
which is essential when running on many thousands of processes.
MPI also has functionality to expose some internal aspects of an MPI implementa-
tion that may be useful for libraries. These features include functions to decode de-
rived datatypes and functions to associate arbitrary nonblocking operations with an MPI -
Request (known as generalized requests). New in MPI-3 is a facility, known as the
MPI T interface, that provides access to internal variables that either control the operation
of the MPI implementation or expose performance information.
Like any programming approach, making effective use of MPI requires using it as it was
intended, taking into account the strengths and weaknesses of the approach. Perhaps the
most important consideration is that MPI is a library. This means that any MPI operation
requires one or more function calls and might not be the most efficient for very short data
Message Passing Interface 19
transfers where even function-call overheads matter. Therefore, wherever possible, com-
munication should be aggregated so as to move as much data in one MPI call as possible.
MPI contains features to support the construction of software libraries. These features
should be used. For example, rather than adding MPI calls throughout your application,
it is often better to define important abstractions and implement them in MPI. Most of the
application code then makes use of these abstractions, which permits the code to be cleaner
as well as simplifies the process of tuning the use of MPI. This is the approach used in
several important computational libraries and frameworks, such as PETSc [229, 26] and
Trilinos [137]. In these libraries, MPI calls rarely, if ever, appear in the user’s code.
Locality at all levels is important for performance, and MPI, because it is based on
processes, helps users maintain locality. In fact, this feature is sometimes considered both
a strength and weakness of MPI: a strength because requiring users to plan for and respect
locality helps develop efficient programs; a weakness because users must take locality
into account. We note that locality at other levels of the memory hierarchy, particularly
between cache and main memory, is also necessary (possibly more so) for achieving high
performance.
Programs often do not behave as expected, and having tools to investigate the behavior,
both in terms of correctness and performance, is essential. The MPI profiling interface has
provided an excellent interface for tool development, and a number of tools are available
that can help in the visualization of MPI program behavior [72, 145, 253, 300]. The pro-
filing interface can also be used by end users [285], and good program design will take
advantage of this feature.
There are many good practices to follow when using MPI. Some of the most important
are the following.
1. Avoid assumptions about buffering (see the discussion above on safe programs)
5. For I/O, use collective I/O where possible. Pay attention to any performance lim-
itations of your file system (some have extreme penalties for accesses that are not
aligned on disk block boundaries).
6. MPI Barrier is rarely required and usually reduces performance. See [252] for an
automated way to detect “functionally irrelevant barriers.” Though there are a few
exceptions, most uses of MPI Barrier are, at best, sloppy programming and, at
worst, incorrect because they assume that MPI Barrier has some side effects. A
correct MPI program will rarely need MPI Barrier. (We mention this because the
analysis of many programs reveals that MPI Barrier is one of the most common
MPI collective routines even though it is not necessary.)
1.11 Summary
MPI has been an outstanding success. At this writing, the MPI specification is over 21 years
old and continues to be the dominant programming system for highly parallel applications
in computational science.
Why has MPI has been so successful? In brief, it is because MPI provides a robust
solution for parallel programming that allows users to achieve their goals. A thorough ex-
amination of the reasons for MPI’s success may be found in [129]. The open process by
which MPI was defined also contributed to its success; MPI avoided errors committed in
some other, less open, programming system designs. Another contributor to the success of
MPI is its deliberate support for “programming in the large”—for the creation of software
modules that operate in parallel. A number of libraries have been built on top of MPI,
permitting application developers to write programs at a high level and still achieve perfor-
mance. Several of these have won the Gordon Bell prize for outstanding achievements in
high-performance computing and/or R&D 100 awards [4, 9, 12, 24, 137].
As the number of cores continues to grow in the most powerful systems, one frequent
question is “Can MPI scale to millions of processes?” The answer is yes, though it will re-
quire careful implementation of the MPI library [23]. It is also likely that for such systems,
MPI will be combined with another approach, exploiting MPI’s thread-safe design. Users
are already combining MPI with OpenMP. Using OpenMP or another node programming
language, combined with MPI, would allow the use of MPI with millions of MPI processes,
with each process using thousands of cores (e.g., via threads).
There is a rich research literature on the use and implementation of MPI, including the
annual EuroMPI meeting. A tutorial introduction to MPI is available in [127, 128]. The
official version of the MPI standard [202] is freely available at www.mpi-forum.org.
Message Passing Interface 21
This chapter provided an introduction to MPI but could not cover the richness of MPI.
The references above, as well as your favorite search engine, can help you discover the full
power of the MPI programming model.
2 Global Address Space Networking
In 2002 a team of researchers at the University of California Berkeley and Lawrence Berke-
ley National Laboratory began work on a compiler for the Unified Parallel C (UPC) lan-
guage (see Chapter 4). A portion of that team had also worked on the compiler and run-
time library for Titanium [277], a parallel dialect of Java. This motivated the design of a
language-independent library to support the network communication needs of both UPC
and Titanium, with the intent to be applicable to an even wider range of global address
space language and library implementations. The result of those efforts is the Global
Address Space Networking library, known more commonly as simply “GASNet” (pro-
nounced just as written: “gas net”). GASNet has language bindings only for C, but is
“safe” to use from C++ as well.
At the time of this writing, the current revision of the GASNet specification is v1.8. The
most current revision can always be found on the GASNet project webpage [174].
Since its inception, GASNet has become the networking layer for numerous global ad-
dress space language implementations. In addition to the Berkeley UPC compiler [173],
the Open Source UPC compilers from Intrepid Technology [146] (GUPC) and the Univer-
sity of Houston [278] (OpenUH) use GASNet. Rice University chose GASNet for both
their original Co-Array Fortran (CAF) and CAF-2.0 compilers [239]. Cray’s UPC and
CAF compilers [96] use GASNet for the Cray XT series, and Cray Chapel (see Chapter 6)
uses GASNet on multiple platforms. The OpenSHMEM (see Chapter 3) reference im-
plementation from the University of Houston and Oak Ridge National Laboratory is also
implemented over GASNet. In addition to these languages and libraries, some of which are
described in later chapters of this book, GASNet has been used in numerous other research
projects.
and for use by expert programmers who are authoring parallel runtime libraries. Where
performance and ease-of-use conflict, the design favors choices that will achieve high per-
formance. One consequence of this is GASNet’s API specifies the “interfaces” or “calls”
which GASNet implements, but does not require that any of these be implemented as a
function. Therefore, in many cases a GASNet call may be implemented as a C preproces-
sor macro (especially when there is a simple mapping from a GASNet interface to a call in
the vendor-provided network API).
GASNet is also designed with wide portability in mind and one consequence is that
the capabilities expressed directly in GASNet’s interfaces are those one should be able to
implement efficiently on nearly any platform. At the time of this writing, GASNet has
“native” implementations—in terms of the network APIs—of all of the common cluster
interconnects, and those of the currently available supercomputers from IBM and Cray.
Also at the time of this writing, porting of GASNet to the largest systems in Japan and
China is known to be in progress by researchers affiliated with those systems.
The remainder of this section will introduce the terminology used in the GASNet spec-
ification and in this chapter, and provide an overview of the functionality found in GAS-
Net. Later sections expand in more detail upon this overview, provide usage examples and
describe some of the plans for GASNet’s future.
2.2.1 Terminology
Below are several terms that are used extensively in the GASNet specification and in this
chapter. Before reading further, familiarize yourself with these terms or bookmark this
page for easy reference.
Global Address Space Networking 25
2.2.2 Threading
GASNet is intended to be “thread neutral” and allow the client to use threads as it sees
fit. By default GASNet will build three variants of the library for each supported network
to support different client threading models. These are known as the “seq”, “parsync”
and “par” builds, where the names correspond both to the library file name and to the
preprocessor tokens GASNET SEQ, GASNET PARSYNC and GASNET PAR. Exactly one
of these three must be defined by the client when it includes gasnet.h and the library
to which it is linked must correspond to the correct preprocessor token.1 The three models
are:
• GASNET SEQ
In this mode the client is permitted to make GASNet calls from only a single thread
in each process. There is no restriction on how many threads the client may use, but
exactly one of them must be used to make all GASNet calls.
• GASNET PARSYNC
In this mode at most one thread may make GASNet calls concurrently. Multiple
1. There is some name-shifting under the covers to catch mismatches.
26 Chapter 2
threads may make GASNet calls with appropriate mutual exclusion. GASNet does
not provide the mechanism for such mutual exclusion, which is the client’s respon-
sibility.
• GASNET PAR
This is the most general mode, allowing multiple client threads to enter GASNet
concurrently.
When using SEQ or PARSYNC modes, the restriction on the client’s calls to GASNet is
only a restriction on the client. It is legal, even in a SEQ build, for GASNet to use threads
internally, and these internal threads may be used to execute the client’s AM handlers. For
this reason the client code must make proper use of GASNet’s mechanisms for concurrency
control (described in Section 2.3.4) regardless of the threading mode.
GASNet’s flexibility in implementing parallel language runtimes comes from the inclusion
of a remote procedure call mechanism based on Berkeley Active Messages [183]. While
GASNet’s Active Message (just “AM” from here on) interfaces are significantly reduced
relative to the Berkeley AM design, they provide the caller with significant flexibility sub-
ject to constraints that allow an implementation to guarantee deadlock freedom while using
bounded resources. Briefly, the idea is that a client may send an AM Request to a node
(including itself) which results in running code (a “handler”) that was registered by the call
to gasnet attach. The handler receives a small number of integer arguments provided
by the AMRequest call, plus an optional payload which may either be buffered by the im-
plementation or delivered to a location given by the client. The handler is permitted calls
Global Address Space Networking 27
to only a subset of the GASNet Core API (and none of the Extended); the only communi-
cation permitted is at most one AMReply to the requesting node. A significant portion of
this chapter will be devoted to showing how to use GASNet’s AMs.
The GASNet Core API contains everything one needs to write an AM-based code.2 This
section picks up from the brief introduction given in Section 2.2.3 to provide some detail on
the Core API. This information will be put into practice in several examples in Section 2.6.
Any GASNet client will begin with a call to gasnet init which takes pointers to
the standard argc and argv parameters to main(). The job environment prior to
calling gasnet init is very vague (see the specification for details), and the user is
strongly encouraged not to do much, if anything, before this call. However, after the call to
gasnet init, the command-line will have been cleansed of any arguments used inter-
nally by GASNet, and environment variables will be accessible using gasnet getenv.
Additionally, the jobs’s stdout and stderr will be setup by this init call. GASNet does
not make any guarantees about stdin.
Only after gasnet init returns do the other calls in the list above become legal. The
next step in initialization of a GASNet job is a call to gasnet attach to allocate the
GASNet segment and reserve any network resources required for the job. The arguments
to gasnet attach give the client’s table of AM handlers, and the client’s segment re-
quirements:
2. It is also the minimum one must port to a new platform, because there is a reference implementation of every-
thing in the Extended API in terms of the Core.
28 Chapter 2
typedef struct {
gasnet_handler_t index;
void (*fnptr)();
} gasnet_handlerentry_t;
The fnptr is the function to be invoked as the AM handler at the respective integer
index. The signature of AM handlers will be covered in Section 2.3.5. Values for
index of 128–255 are available to the client, while the special value of 0 indicates
“don’t care” and will be overwritten by a unique value by gasnet attach.
• int numentries
The number of entries in the handler entry table.
• uintptr t segsize
The requested size of the GASNet segment.
Must be a multiple of GASNET PAGESIZE, and no larger than the value returned
by gasnet getMaxLocalSegmentSize (see below).
Ignored for GASNET SEGMENT EVERYTHING.
• uintptr t minheapoffset
The requested minimum distance between GASNet’s segment and the current top of
the heap.3 On systems where the layout in virtual memory forces GASNet’s segment
and the heap to compete for space, this ensures that at least this amount of space will
be left for heap allocation after allocation of the segment. While not recommended,
it is legal to pass zero. The value passed by all nodes must be equal.
Ignored for GASNET SEGMENT EVERYTHING.
There are two calls to determine what segment size one may request in the attach call.
The function gasnet getMaxLocalSegmentSize returns the maximum amount of
memory that GASNet has determined is available for the segment on the calling node,
while gasnet getMaxGlobalSegmentSize returns the minimum of all the “local”
values. Keep in mind that on many platforms, the GASNet segment and the malloc heap
must compete for the same space, meaning that these SegmentSize queries should be
treated as the maximum of the sum of segsize and minheapoffset. A client that
3. The range of memory used to satisfy calls to the malloc family of functions.
Global Address Space Networking 29
finds the available segment size too small for its requirements may call gasnet exit to
terminate the job rather than calling gasnet attach.
In addition to the two segment size query calls and access to environment variables us-
ing gasnet getenv, clients may call gasnet nodes to query the number of GAS-
Net nodes in the job, and gasnet mynode to determine the caller’s rank within the
job (ranks start from zero). The calls listed above are the only ones permitted between
gasnet init and gasnet attach. The two segment size query calls are unique in
that they are only legal between gasnet init and gasnet attach.
After gasnet attach comes the client’s “real” code using the interfaces described in
the sections that follow. When all the real work is done, gasnet exit is the mechanism
for reliable job termination. The call to gasnet exit takes an exit code as its only
argument and does not return to the caller. GASNet makes a strong effort to ensure that if
any node provides a nonzero exit code that the job as a whole (spawned by some platform-
specific mechanism) will also return a nonzero code. It also tries to preserve the actual
value when possible.
A call to gasnet exit by a single node is sufficient to cause the entire parallel job
to terminate. Any node which does not call gasnet exit at the same time as one or
more others, will receive a SIGQUIT signal if possible.4 This is the only signal for which
a client may portably register a signal handler, because GASNet reserves all others for
internal use. To avoid unintentionally triggering their mechanism, a client performing a
“normal” exit should perform a barrier (see Section 2.3.3) immediately before the call to
gasnet exit.
typedef struct {
void *addr;
uintptr_t size;
} gasnet_seginfo_t;
int gasnet_getSegmentInfo(gasnet_seginfo_t *seginfo_table,
int numentries);
This call populates the lesser of numentries or gasnet nodes()) entries of type
gasnet seginfo t in the client-owned memory at seginfo table, and returns an
error code on failure (see Section 2.3.8). The ith entry in the array gives the address and
size of the segment on node i. When conditions permit, GASNet favors assigning segments
with the same base address on all nodes. If an implementation can guarantee that this
property is always satisfied, then the preprocessor token GASNET ALIGNED SEGMENTS
is defined to 1.
In the GASNET SEGMENT EVERYTHING configuration, the segment is all of virtual
memory. In this configuration, the addr fields will always be zero and the size will
always be (uintptr t)(-1).
2.3.3 Barriers
The next set of Core API calls to describe are those for performing a barrier:
Unlike many barrier implementation, the one in GASNet is “split-phase” and supports
optional id matching.
The “split-phase” nature of GASNet’s barrier is evident in the specification’s descrip-
tion of gasnet barrier wait which states “This is a blocking operation that returns
only after all remote nodes have called gasnet barrier notify().” In simple terms
imagine that “notify” increments an arrival counter and that “wait” blocks until that counter
equals the job size.5 The call gasnet barrier try checks the same condition, but re-
turns immediately with the value GASNET ERROR NOT READY if the condition is not yet
satisfied. Regardless of whether one uses “wait” or “try” to complete the barrier, it is legal
to perform most GASNet operations between the initiation and completion.
The id and flags arguments to the barrier functions implement optional matching at
the barriers. This feature is best understood by a careful reading of the specification, but
two common use cases are easy to understand:
• Anonymous barrier
The simplest case is when one does not wish to use the id matching support. In this
case, the constant GASNET BARRIERFLAG ANONYMOUS is passed for the flags
argument to the barrier functions. Any value can be passed as the id (though 0 is
most common), since it will be ignored.
• Named barrier
The simplest case that makes use of the id matching logic is a blocking (as opposed
to split-phase) barrier with an integer argument that is expected to be equal across
all callers:
GASNet’s split-phase barrier comes with some usage restrictions which might not ini-
tially be obvious. Here we will consider a successful “try” equivalent to a “wait” to keep
the descriptions brief. The first restriction is the most intuitive: one must alternate between
“notify” and “wait” to ensure that barrier operations do not overlap one another. The sec-
ond is that in a GASNET PARSYNC or GASNET PAR build, the “notify” and “wait” should
only be performed once per node (the client is free to choose which thread does the work,
and need not pick the same thread for the two phases). The third is a potentially nonob-
vious consequence of the first two: in a GASNET PAR build the client has the burden of
ensuring that at most one client thread is in any barrier call at any given instant.
Other than minor details given in the GASNet specification, these are equivalent to the
analogous constants and functions on pthread mutex t. Like the POSIX threads ana-
logues, these can be used to prevent concurrent access to data structures or regions of code.
A general tutorial on the use of a mutex are outside the scope of this chapter. Note that
these are node-local mutexes, and GASNet does not provide mechanisms for cross-node
mutual exclusion. However, the example in Section 2.6.5 shows how one can use AMs to
implement a well-known shared memory algorithm for mutual exclusion.
In addition to the previously introduced idea of internal threads for executing the client’s
AM handlers, the GASNet specification allows for the possibility of interrupt-driven im-
plementations. Though at the time of this writing there are no such implementations, we
will introduce this concept briefly:
void gasnet_hold_interrupts();
void gasnet_resume_interrupts();
These two calls, used in pairs, delimit sections of code which may not be interrupted by
execution of AM handlers on the calling thread. This is different from use of an HSL to
prevent multiple threads from concurrently accessing given code or data. The intended use
of no-interrupt sections is to protect client code which is nonreentrant and can potentially
be reached from both handler and nonhandler code in the client. No-interrupt sections are
seldom necessary for two key reasons: 1) holding an HSL implicitly enters a no-interrupt
section; 2) AM handlers run in implicit no-interrupt sections. Note that these calls do not
nest and the client is therefore responsible for managing no-interrupt sections when nesting
might occur dynamically.
It is worth noting that in a GASNET SEQ build the mutex calls may compile away to
“nothing” if and only if the GASNet implementation is using neither threads nor interrupts
internally to execute the client’s AM handlers.
node. Functions to be run are known as AM “handlers” and are named by an index of
type gasnet handler t, where the mapping between these indices and actual func-
tions was established by the handler table passed to (and possibly modified by) the call to
gasnet attach.
Arguments to AM handlers are 32-bit integers.6 There are implementation-dependent
limits on the argument count, which can be queried at runtime:
size_t gasnet_AMMaxArgs(void);
The value must be at least 8 on 32-bit platforms and at least 16 on 64-bit platforms. This
ensures a client can always pass at least 8 pointer-sized values to a handler.
In addition to the arguments, there is an optional payload. There are three “categories”
of AMs depending on the treatment of the payload:
• Short AMs have no payload. The signature of a Short AM handler looks like:
• Long AMs carry a payload that is placed at an address on the target node that is
provided by the initiating node. This address must lie in the GASNet segment. The
signature of a Long AM handler looks like:
6. GASNet’s gasnet handlerarg t type is always equivalent to uint32 t, but GASNet supports C89
compilers which may not have uint32 t.
34 Chapter 2
In all three handler signatures above, the “...” denotes up to gasnet AMMaxArgs()
additional arguments. Since the Medium and Long AM handler signatures are identical, it
is permissible to use the same handler for either category of AM.
Payload size is subject to implementation-dependent limit, which can be queried at run-
time:
size_t gasnet_AMMaxMedium(void);
size_t gasnet_AMMaxLongRequest(void);
size_t gasnet_AMMaxLongReply(void);
The GASNet specification requires that all implementations support payloads of at least
512 bytes, and typical values are much higher for platforms with RMA support in hard-
ware. It is important to notice the distinction between the Request and Reply limits for
Long AMs.7
To invoke an AM handler on a target node one issues an AM request using one of the
following:
Each of the above prototypes represents an entire family of calls, where the “[N]” above
is replaced with the values from 0 through gasnet AMMaxArgs(). As before, the “...”
denotes the placement of the 32-bit arguments to pass to the handler. For the Medium and
Long requests, the calls return as soon as the payload memory is safe to reuse (also known
as “local completion”). The implementation is not required to make a copy of the payload,
and thus these calls may block temporarily until the network can send the payload. While
blocked, AMs sent to the calling node by others may be executed. The LongAsync request
7. This difference arises from the fact that a Reply can only be initiated within a request handler, and it may not
be possible in this context to allocate resources for a large Reply. Therefore, when the limits for LongRequest
and LongReply differ, the Request value will be the larger of the two.
Global Address Space Networking 35
differs from the Long case in that it returns without waiting for local completion (though
it may still block waiting for resources). The client must not modify the payload until
the corresponding AM reply handler begins running—that is the only indication of local
completion. This is a difficult semantic to apply correctly, but can be powerful.
When an AM handler runs, code is executed in an environment known as “handler con-
text” in which several restrictions apply. These restrictions will be enumerated later, but
at this point we focus on the one that for many is the defining feature of Berkeley AM,
and thus of GASNet. This is the “at most one reply” rule which states that 1) the only
communication permitted in the handler for an AM Request is one optional Reply to the
node initiating the Request; and 2) no communication is permitted in the handler for an
AM Reply. An AM Reply is sent with one of the following:8
The use of “...”, again, denotes the 32-bit hander arguments, and [N] indicates that
these three prototypes are templates for instances from 0 to gasnet AMMaxArgs() ar-
guments.
Other than the names, the key difference between the calls to send an AM Reply versus
those for a Request is the type of the first argument: gasnet token t. This type was
first seen, without any explanation, when the prototypes for the three categories of AM
handlers were given. It is an opaque type that contains (at least) the source node of an
AM. Since there is no way to construct an object of this type the only way to invoke an
AMReply function is using the token received as an argument to the Request handler. For
situations where one does need to known the source node for an AM (either Request or
Reply), one can query:
This call can be made only from the handler context and the only valid value for the
token argument is the one received as an argument to the handler function.
8. Note that the lack of an AMReplyLongAsync is a consequence of the fact that the “at most one reply” rule
prevents the AM Reply handler from issuing any communication that would serve to indicate the local completion.
36 Chapter 2
int gasnet_AMPoll(void);
#define GASNET_BLOCKUNTIL(condition) ...
The call gasnet AMPoll checks for incoming AMs (both Requests and Replies) and
will execute some implementation-dependent maximum number of them before return-
ing. Thus, there is no guarantee that at the time this call returns that are no additional
AMs waiting. This call is typically used in the clients’ own progress loop, or before and
after client operations that are known not to poll for long periods of time. The macro
GASNET BLOCKUNTIL is used to block until a condition becomes true. It takes as an
argument a C expression to evaluate, and GASNet executes code functionally equivalent
to:
9. If it helps to understand this rule, try visualize an implementation using a condition variable which is broadcast
by GASNet each time a handler execution completes.
Global Address Space Networking 37
• No handler may call the GASNet barrier functions, initiate AM Requests or call any
portion of the Extended API (these involve prohibited communication).
• A handler may block temporarily in a call to obtain an HSL, but must release any
held HSL before returning.
• The GASNet implementation is not required to ensure AMs are executed in order
and client code must be constructed to be deadlock-free in the presence of reordered
messages.10
• Client code must be written in a thread-safe manner (through the proper use of HSL)
in recognition that even with single-threaded clients, GASNet may run AM handlers
asynchronously.
• The expression passed to GASNET BLOCKUNTIL does not have an exception to the
previous rule and must consider the possibility that the expression could be evaluated
concurrently with execution of AM handlers.
• GASNET OK
Guaranteed to be zero, this value indicates success.
10. GASNet will not drop or replay AMs. So the client may assume “exactly once” delivery.
38 Chapter 2
Additionally, one can convert the numerical error code into a string (GASNET ERR -
prefixed) name, or an English language description of the error value using the following:
The Extended API provides a rich set of interfaces for remote memory access (Puts and
Gets) with a variety of semantics intended to ease automatic code generation, especially
from source-to-source translation of partitioned global address space (PGAS) Languages.
At this time GASNet provides standardized RMA interfaces only for Put and Get of con-
tiguous regions, but see Section 2.7 for information on proposed “Vector-Index-Strided”
interfaces).
from source. The default configuration is known as GASNET SEGMENT FAST, or just
SEGMENT FAST for short. In this configuration the implementation provides the fastest
(lowest latency and/or highest bandwidth) implementation possible, even if this results in
making trade-offs which significantly reduce the size of the segment. The second option,
SEGMENT LARGE, supports the largest contiguous segment possible (within reason) even
when this support may require “bounce buffers” or other mechanisms that reduce the speed
of remote accesses. The final option is GASNET SEGMENT EVERYTHING in which the
entire virtual address space is considered “in-segment.”
11. The README for each conduit includes documentation on protocol switch points that control such behaviors,
and most offer environment variables to adjust them.
Another Random Document on
Scribd Without Any Related Topics
inventions now for defying the ravages of age, for keeping a
youthful bloom on the cheek and a youthful lustre on the
hair. It would be necessary for Mrs Fortescue to look as
charming as ever in order to take her young charges about.
How pleasant it would be to go with them from one gay
assembly to another, to watch their innocent triumphs!
As she lay down in bed on the first night after their arrival
she appraised with a great deal of discernment their
manifest charms. Florence was, of course, the beauty, but
Brenda had a quiet distinction of her own. Her face was full
of intellect. Her eyes full of resource. She was dignified, too,
more so than Florence, who was all sparkling and gay, as
befitted the roses in her cheeks and the flashes of light in
her big brown eyes. Altogether, they were a charming pair,
and when dressed as they ought to be (how Mrs Fortescue
would love that part of her duty!) would do anybody credit.
“I can’t quite believe it, can you, Brenda?” said Florence. “It
seems just as if we must be going back to the dear old
place.”
“Oh, I don’t know,” said Brenda. “We are not going back:
we said good-bye to every one, don’t you remember?”
The girls were silent, looking hard at her. “As I have taken
care of you since you were quite young girls, you will
naturally wish for my protection until you are both married.”
Brenda was silent. Florence said eagerly—“I mean to marry
as soon as possible.” Here she laughed, showing her pearly
teeth, and a flashing light of anticipated triumph coming
into her eyes.
The girls left the room and soon afterwards were seen going
out arm-in-arm. They walked down the little avenue, and
were lost to view.
That was the usual state of things. The girls did not seem in
the mood, however, to greet their old friends beyond
smiling and nodding to them. As they were returning home,
Brenda said—
“I know,” remarked Brenda; “but, all the same, our lives are
our own, and I don’t think we can do with Mrs Fortescue. I
suppose Mr Timmins will tell us what he has decided. We
are not of age yet, either of us. You have three years to
wait, Flo, and I have two.”
“Mr Timmins has been here for nearly twenty minutes. His
train was in sharp at three. He is very much annoyed at
your both being out. Go to him at once, girls—at once.”
They went, just as they were, into the pretty little precise
drawing-room, where a fire was burning cheerily in the
grate, and the room was looking spick and span, everything
dusted and in perfect order, and some pretty vases full of
fresh flowers adding a picturesqueness to the scene. It was
quite a dear little drawing-room, and when the two girls—
Florence with that rich colour which so specially
characterised her, and Brenda a little paler but very sweet-
looking—entered the room, the picture was complete. The
old lawyer lost his sense of irritation. He came forward with
both hands outstretched.
“My dear children,” he said; “my poor children. Sit down; sit
down.”
She often called him by that name. He took her soft young
hand and stroked it. There was a husky note in his voice.
He found it difficult to speak. After a minute or two, he said
abruptly—
“Now, children, I will just tell you the very worst at once.
You haven’t a solid, solitary hundred pounds between you in
this wide world. I kept you at school as long as I could.
There is not enough money to pay for another term’s
schooling, but there is enough to pay Mrs Fortescue for your
Christmas holidays, and there will be a few pounds over to
put into each of your pockets. The little money your father
left you will then be quite exhausted.”
“How old were you, Brenda, when your father and mother
died?” he asked.
“Nor did I,” said Florence, speaking for the first time.
“But you have grown, dear; you have grown up now,” said
Mrs Fortescue. “Oh my love!” She drew her chair a little
closer to the young girl as she spoke. “I wonder what Mr
Timmins meant. He did not seem at all interested in my
house. I expressed so plainly my willingness to give it up
and to take a house in town where we could be all happy
together; but he was very huffy and disagreeable. It was a
sad pity that you didn’t stay in for him. It put him out. I
never knew that Mr Timmins was such an irascible old
gentleman before.”
Tears came into the little woman’s eyes. They were genuine
tears, of sorrow for herself but also of affection for the girls.
She would, of course, like to make money by them, but she
also regarded them as belonging to her. She had known
them for so long, and, notwithstanding the fact that she had
been paid for their support, she had been really good to
them. She had given them of those things which money
cannot buy, had sat up with Florence night after night when
she was ill with the measles, and had read herself hoarse in
order to keep that difficult young lady in bed when she
wanted to be up and playing about.
“We won’t say a word about it,” said Brenda, “until after
Christmas Day.”
She gave forth this mandate when the girls were in their
room preparing for dinner.
“It won’t kill you,” replied Brenda, “for you will have me to
talk it over with.”
“It is the part you feel most at the present time,” said
Brenda. “It is a merciful dispensation that we cannot realise
everything that is happening just at the moment it happens.
It is only by degrees that we get to realise the full extent of
our calamities.”
“No; she couldn’t guess the truth, that would be beyond her
power,” said Florence. “The truth is horrible, and yet
delightful. We are our own mistresses, aren’t we, Brenda?”
But now Christmas Day had really come, and Mrs Fortescue,
in the highest of high spirits, accompanied her young
charges to Colonel Arbuthnot’s house. Year by year, the girls
had eaten their Christmas dinner at the old Colonel’s house,
which was known by the commonplace name of The
Grange. It was a corner house in Langdale, abutting straight
on to the street, but evidently at one time there had been a
big garden in front, and just before the hall door was an
enormous oak tree, which spread its shadows over the low
stone steps in summer, and caused the dining-room
windows which faced the street to be cool even in the
hottest weather.
ebookbell.com