0% found this document useful (0 votes)
8 views

Programming Models For Parallel Computing Pavan Balaji download

The document discusses the book 'Programming Models for Parallel Computing' edited by Pavan Balaji, which provides an overview of various parallel programming models used in high-performance computing. It highlights the complexity of programming parallel systems and the need for different models to cater to diverse user preferences regarding productivity, portability, performance, and expressivity. The book covers a range of models, including MPI, Chapel, and OpenMP, and aims to present these concepts in an accessible manner for readers interested in parallel computing.

Uploaded by

doyentakisbp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Programming Models For Parallel Computing Pavan Balaji download

The document discusses the book 'Programming Models for Parallel Computing' edited by Pavan Balaji, which provides an overview of various parallel programming models used in high-performance computing. It highlights the complexity of programming parallel systems and the need for different models to cater to diverse user preferences regarding productivity, portability, performance, and expressivity. The book covers a range of models, including MPI, Chapel, and OpenMP, and aims to present these concepts in an accessible manner for readers interested in parallel computing.

Uploaded by

doyentakisbp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Programming Models For Parallel Computing Pavan

Balaji download

https://ebookbell.com/product/programming-models-for-parallel-
computing-pavan-balaji-43008582

Explore and download more ebooks at ebookbell.com


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

Chapel Excerpted From Programming Models For Parallel Computation


Bradford L Chamberlain

https://ebookbell.com/product/chapel-excerpted-from-programming-
models-for-parallel-computation-bradford-l-chamberlain-50501794

Smart Card Applications Design Models For Using And Programming Smart
Cards Wolfgang Rankl

https://ebookbell.com/product/smart-card-applications-design-models-
for-using-and-programming-smart-cards-wolfgang-rankl-981454

Dataflow Programming Visualizing And Managing Data Streams For


Effective Processing And Parallelism Programming Models Edet

https://ebookbell.com/product/dataflow-programming-visualizing-and-
managing-data-streams-for-effective-processing-and-parallelism-
programming-models-edet-232947356

Task Models And Diagrams For User Interface Design 8th International
Workshop Tamodia 2009 Brussels Belgium September 2325 2009 Revised
Selected Programming And Software Engineering 1st Edition David
England
https://ebookbell.com/product/task-models-and-diagrams-for-user-
interface-design-8th-international-workshop-tamodia-2009-brussels-
belgium-september-2325-2009-revised-selected-programming-and-software-
engineering-1st-edition-david-england-2540476
Models Languages And Tools For Concurrent And Distributed Programming
Essays Dedicated To Rocco De Nicola On The Occasion Of His 65th
Birthday 1st Ed Michele Boreale

https://ebookbell.com/product/models-languages-and-tools-for-
concurrent-and-distributed-programming-essays-dedicated-to-rocco-de-
nicola-on-the-occasion-of-his-65th-birthday-1st-ed-michele-
boreale-10488678

A Practical Programming Model For The Multicore Era 3rd International


Workshop On Openmp Iwomp 2007 Beijing China June 37 2007 Proceedings
1st Edition Eduard Ayguad

https://ebookbell.com/product/a-practical-programming-model-for-the-
multicore-era-3rd-international-workshop-on-openmp-iwomp-2007-beijing-
china-june-37-2007-proceedings-1st-edition-eduard-ayguad-4240592

Abap Development For Sap S4hana Abap Programming Model For Sap Fiori
First Edition Stefan Haas

https://ebookbell.com/product/abap-development-for-sap-s4hana-abap-
programming-model-for-sap-fiori-first-edition-stefan-haas-63517092

Building Big Data Pipelines With Apache Beam Use A Single Programming
Model For Both Batch And Stream Data Processing 1st Edition Jan
Lukavsky

https://ebookbell.com/product/building-big-data-pipelines-with-apache-
beam-use-a-single-programming-model-for-both-batch-and-stream-data-
processing-1st-edition-jan-lukavsky-37633446

Abap Restful Programming Model Abap Development For Sap S4hana Sap
Press 1st Edition Stefan Haas

https://ebookbell.com/product/abap-restful-programming-model-abap-
development-for-sap-s4hana-sap-press-1st-edition-stefan-haas-11263694
Series Foreword

The Scientific and Engineering Series from MIT Press presents accessible accounts of com-
puting research areas normally presented in research papers and specialized conferences.
Elements of modern computing that have appeared thus far in the series include paral-
lelism, language design and implementation, system software, and numerical libraries.
The scope of the series continues to expand with the spread of ideas from computing into
new aspects of science.
Programming models and the software systems that implement them are a crucial aspect
of all computing, since they provide the concrete mechanisms by which a programmer
prepares algorithms for execution on a computer and communicates these ideas to the ma-
chine. In the case of parallel systems, the complexity of the task has spurred innovative
research. This book collects in one place definitive expositions of a wide variety of pro-
gramming systems by highly regarded authors, including both language-based systems and
library-based systems. Some are heavily used and defined by standards bodies; others are
research projects with smaller user bases. All are currently being used in scientific and
engineering applications for parallel computers.

William Gropp and Ewing Lusk, Editors


Preface

A programming model can be thought of as the abstract machine for which a programmer
is writing instructions. Programming models typically are instantiated in languages and
libraries. Such models form a rich topic for computer science research because program-
mers prefer them to be productive (capable of expressing any abstract algorithm with ease),
portable (capable of being used on any computer architecture), performant (capable of de-
livering performance commensurate with that of the underlying hardware), and expressive
(capable of expressing a broad range of algorithms)—the four pillars of programming.
Achieving one or perhaps even two of these features simultaneously is relatively easy, but
achieving all of them is nearly impossible. This situation accounts for the great multiplicity
of programming models, each choosing a different set of compromises.
With the coming of the parallel computing era, computer science researchers have
shifted focus to designing programming models that are well suited for high-performance
parallel computing and supercomputing systems. Parallel programming models typically
include an execution model (what path the code execution takes) and a memory model
(how data moves in the system between computing nodes and in the memory hierarchy of
each computing node). Programming parallel systems is complicated by the fact that mul-
tiple processing units are simultaneously computing and moving data, thus often increasing
the nondeterminism in the execution in terms of both correctness and performance.
Also important is the distinction between programming models and programming sys-
tems. Technically speaking, the former refers to a style of programming—such as bulk
synchronous or implicit compiler-assisted parallelization—while the latter refers to actual
abstract interfaces that the user would program to. Over the years, however, the parallel
computing community has blurred this distinction; and in practice today a programming
model refers to both the style of programming and the abstract interfaces exposed by the
instantiation of the model.
Contrary to common belief, most parallel systems do not expose a single parallel pro-
gramming model to users. Different users prefer different levels of abstraction and differ-
ent sets of tradeoffs among the four pillars of programming. Broadly speaking, domain
scientists and those developing end applications often prefer a high-level programming
model that is biased toward higher productivity, even if it is specialized to a small class of
algorithms and lacks the expressivity required by other algorithms. On the other hand, de-
velopers of domain-specific languages and libraries might prefer a low-level programming
model that is biased toward performance, even if it is not as easy to use. Of course, these
are general statements, and exceptions exist on both sides.

About this book

This book provides an overview of some of the most prominent parallel programming
xxii Preface

models used on high-performance computing and supercomputing systems today. The


book aims at covering a wide range of parallel programming models at various levels of
the productivity, portability, performance, and expressability spectrum, allowing the reader
to learn about and understand what tradeoffs each of these models has to offer.
We begin in Chapter 1 with a discussion of the Message Passing Interface (MPI), which
is the most prominent parallel programming model for distributed-memory computing to-
day. This chapter provides an overview of the most commonly used capabilities of MPI,
leading up to the third major version of the standard—MPI-3.
In Chapters 2 to 5, we cover one-sided communication models, ranging from low-level
runtime libraries to high-level programming models. Chapter 2 covers GASNet (global ad-
dress space networking), a low-level programming model designed to serve as a common
portable runtime system for a number of partitioned global address space (PGAS) mod-
els. Chapter 3 discusses OpenSHMEM, a one-sided communication library designed to
directly expose native hardware communication capabilities to end users. OpenSHMEM
mimics many of the capabilities of PGAS models, but in library form as opposed to relying
on language extensions and compiler support for processing those extensions. Chapter 4
provides an overview of the Unified Parallel C (UPC) programming model. UPC is a C
language–based PGAS model that provides both language extensions and library inter-
faces for creating and managing global address space memory. Chapter 5 covers Global
Arrays (GA), another library-based one-sided communication model like OpenSHMEM
but providing a higher-level abstraction—based on multidimensional arrays—for users to
program with.
Chapter 6 discusses Chapel, a high-productivity programming model that allows appli-
cations to be expressed with both task and data parallelism. Chapel also has first-class
language concepts for expressing and reasoning about locality that are orthogonal to its
features for parallelism.
In Chapters 7 to 11, we present task-oriented programming models that allow users to
describe their computation and data units as tasks, allowing the runtime system to manage
the computation and data movement as necessary. Of these models, we first discuss the
Charm++ programming model in Chapter 7. Charm++ provides an abstract model that
relies on overdecomposition of work in order to dynamically manage the work across the
available computational units. Next we delve into the Asynchronous Dynamic Load Bal-
ancing (ADLB) library in Chapter 8, which provides another task-oriented approach for
sharing work based on an MPI-based low-level communication model. In Chapter 9 we
discuss the Scalable Collection of Task Objects (Scioto) programming model, which relies
on PGAS-like one-sided communication frameworks to achieve load balancing through
work stealing. Chapter 10 describes Swift, a high-level scripting language that allows
users to describe their computation in high-level semantics and internally translates it to
Preface xxiii

other task-oriented models such as ADLB. Chapter 11 describes Concurrent Collections


(CnC), a high-level declarative model that allows users to describe their applications as a
graph of kernels communicating with one another.
In the final collection of chapters, Chapter 12 to 16, we present parallel programming
models intended for on-node parallelism in the context of multicore architectures, attached
accelerators, or both. In this collection, we first discuss OpenMP in Chapter 12. OpenMP
is the most prominent on-node parallel programming model for scientific computing today.
The chapter describes the evolution of OpenMP and its core set of features, leading up to
OpenMP 4.0. In Chapter 13 we discuss the Cilk Plus programming model, a parallel ex-
tension of C and C++ languages for exploiting regular and irregular parallelism on modern
shared-memory multicore machines. In Chapter 14 we discuss Intel Threading Building
Blocks (TBB), which, like Cilk Plus, aims at providing parallelism on shared-memory
multicore architectures but using a library based on C++ template classes. The Compute
Unified Device Architecture (CUDA) programming model from NVIDIA is discussed in
Chapter 15. CUDA provides parallelism based on single instruction, multiple thread blocks
suitable for NVIDIA graphics processing units. Although CUDA is a proprietary program-
ming model restricted to NVIDIA devices, the broad interest in the community and the
wide range of applications using it make it a worthy programming model for inclusion in
this book. In Chapter 16, we describe the Open Computing Language (OpenCL) model,
which provides a low-level, vendor-independent programming model to program various
heterogeneous architectures, including graphics processing units.
This book describes the various programming models at a level that is difficult to find
elsewhere. Specifically, the chapters present material in a tutorial fashion, rather than the
more formal approach found in research publications. Nor is this book a reference man-
ual aimed at comprehensively describing all of the syntax and semantics defined by each
programming model. Rather, the goal of this book is to describe the general approaches to
parallel programming taken by each of the presented models and what they aim to achieve.
Nevertheless, the chapters provide some syntactic and semantic definitions of a core set of
interfaces they offer. These definitions are meant to be examples of the abstractions offered
by the programming model. They are provided in order to improve the readability of the
chapter. They are not meant to be taken as the most important or even the most commonly
used interfaces, but just as examples of how one would use that programming model.

Acknowledgments

I first thank all the authors who contributed the various chapters to this book:

William D. Gropp, University of Illinois, Urbana-Champaign


xxiv Preface

Rajeev Thakur, Argonne National Laboratory


Paul Hargrove, Lawrence Berkeley National Laboratory
Jeffery A. Kuehn, Oak Ridge National Laboratory
Stephen W. Poole, Oak Ridge National Laboratory
Kathy Yelick, University of California, Berkeley, and Lawrence Berkeley National Labo-
ratory
Yili Zheng, Lawrence Berkeley National Laboratory
Sriram Krishnamoorthy, Pacific Northwest National Laboratory
Jeff Daily, Pacific Northwest National Laboratory
Abhinav Vishnu, Pacific Northwest National Laboratory
Bruce Palmer, Pacific Northwest National Laboratory
Bradford L. Chamberlain, Cray Inc.
Laxmikant Kale, University of Illinois, Urbana-Champaign
Nikhil Jain, University of Illinois, Urbana-Champaign
Jonathan Lifflander, University of Illinois, Urbana-Champaign
Ewing Lusk, Argonne National Laboratory
Ralph Butler, Middle Tennessee State University
Steven C. Pieper, Argonne National Laboratory
James Dinan, Intel
Timothy Armstrong, The University of Chicago
Justin M. Wozniak, Argonne National Laboratory and the University of Chicago
Michael Wilde, Argonne National Laboratory and the University of Chicago
Ian T. Foster, Argonne National Laboratory and the University of Chicago
Kath Knobe, Rice University
Michael G. Burke, Rice University
Frank Schlimbach, Intel
Barbara Chapman, University of Houston
Deepak Eachempati, University of Houston
Sunita Chandrasekaran, University of Houston
Arch D. Robison, Intel
Charles E. Leiserson, MIT
Alexey Kukanov, Intel
Wen-mei Hwu, University of Illinois, Urbana-Champaign
David Kirk, NVIDIA
Tim Mattson, Intel

Special thanks to Ewing Lusk and William Gropp for their contributions to the book as
a whole and for improving the prose substantially.
Preface xxv

I also thank Gail Pieper, technical writer in the Mathematics and Computer Science
Division at Argonne National Laboratory, for her indispensable guidance in matters of
style and usage, which vastly improved the readability of the prose.
Programming Models for Parallel Computing
1 Message Passing Interface
William D. Gropp, University of Illinois, Urbana-Champaign
Rajeev Thakur, Argonne National Laboratory

1.1 Introduction

MPI is a standard, portable interface for communication in parallel programs that use a
distributed-memory programming model. It provides a rich set of features for expressing
the communication commonly needed in parallel programs and also includes additional
features such as support for parallel file I/O. It supports the MPMD (multiple program,
multiple data) programming model. It is a library-based system, not a compiler or lan-
guage specification. MPI functions can be called from multiple languages—it has official
bindings for C and Fortran. MPI itself refers to the definition of the interface specifica-
tion (the function names, arguments, and semantics), not any particular implementation of
those functions. MPI was defined by an organization known as the MPI Forum, a broadly
based group of experts and users from industry, academia, and research laboratories. Many
high-performance implementations of the MPI specification are available (both free and
commercial) for all platforms (laptops, desktops, servers, clusters and supercomputers of
all sizes) and all architectures and operating systems. As a result, it is possible to write par-
allel applications that can be run portably on any platform, while at the same time achieving
good performance. This feature has contributed to MPI becoming the most widely used
programming system for parallel scientific applications.

MPI Background
The effort to define a single, standard interface for message passing began in 1992. It
was motivated by the presence of too many different, nonportable APIs—both vendor
supported (e.g., Intel NX [232], IBM EUI [119], Thinking Machines CMMD [272],
nCUBE [207]) and research libraries (e.g., PVM [121], p4 [51], Chameleon [130], Zip-
code [254]). Applications written to any one of these APIs either could not be run on
different machines or would not run efficiently. If any HPC vendor went out of business,
applications written to that vendor’s API could not be run elsewhere. It was recognized
that this multiplicity of APIs was hampering progress in application development, and the
need for a single, standard interface defined with broad input from everyone was evident.
The first version of the MPI specification (MPI-1) was released in 1994, and it covered
basic message-passing features, such as point-to-point communication, collective commu-
nication, datatypes, and nonblocking communication. In 1997, the MPI Forum released the
second major version of MPI (MPI-2), which extended the basic message-passing model
to include features such as one-sided communication, parallel I/O, and dynamic processes.
The third major release of MPI (MPI-3) was in 2012, and it included new features such as
nonblocking collectives, neighborhood collectives, a tools information interface, and sig-
2 Chapter 1

nificant extensions to the one-sided communication interface. We discuss many of these


features in this chapter.

1.2 MPI Basics

The core of MPI is communication between processes, following the communicating se-
quential processes (CSP) model. Each process executes in its own address space. Declared
variables (e.g., int b[10];) are private to each process; the b in one process is dis-
tinct from the b in another process. In MPI, there are two major types of communication:
communication between two processes, called point-to-point communication, and commu-
nication among a group of processes, called collective communication.
Each MPI process is a member of a group of processes and is identified by its rank
in that group. Ranks start from zero, so in a group of four processes, the processes are
numbered 0, 1, 2, 3. All communication in MPI is made with respect to a communicator.
This object contains the group of processes and a (hidden) communication context. The
communication context ensures that library software written using MPI can guarantee that
messages remain within the library, which is a critical feature that enables MPI applica-
tions to use third-party libraries. The communicator object is a handle, which is just a way
to say “opaque type.” In C, the communicator handle is of type MPI Comm; in Fortran,
it is of type TYPE(MPI Comm) (for the Fortran 2008 interface) or INTEGER (for earlier
versions of Fortran). When an MPI program begins, there are two predefined communica-
tors: MPI COMM WORLD and MPI COMM SELF. The former contains all the processes in
the MPI execution; the latter just the process running that instance of the program. MPI
provides routines to discover the rank of a process in a communicator (MPI Comm rank),
to discover the number of processes in a communicator (MPI Comm size), and to create
new communicators from old ones.
A complete, but very basic, MPI program is shown in Figure 1.1. This program shows
the use of MPI Init to initialize MPI and MPI Finalize to finalize MPI. With a
few exceptions, all other MPI routines must be called after MPI Init (or MPI Init -
thread) and before MPI Finalize. The odd arguments to MPI Init were intended
to support systems where the command-line arguments in a parallel program were not
available from main but had to be provided to the processes by MPI.
Figure 1.1 illustrates another feature of MPI: anything that is not an MPI call executes
independently. In this case, the printfs will execute in an arbitrary order; it is not even
required that the output appear one line at a time.
The examples presented are in C, but it is important to note that MPI is defined by a
language-neutral specification and a set of language bindings to that specification. MPI
Message Passing Interface 3

#include "mpi.h"
#include <stdio.h>

int main(int argc, char *argv[])


{
int wrank, srank, wsize;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &wrank);
MPI_Comm_size(MPI_COMM_WORLD, &wsize);
MPI_Comm_rank(MPI_COMM_SELF, &srank);
printf("World rank %d, world size %d, self rank %d\n",
wrank, wsize, srank);
MPI_Finalize();
return 0;
}

Figure 1.1: A complete MPI program that illustrates communicator rank and size.

currently supports language bindings in C and Fortran. (C++ programs can use the C
bindings.)

1.3 Point-to-Point Communication

The most basic communication in MPI is between pairs of processes. One process sends
data, and the other receives it. The sending process must specify the data to send, the pro-
cess to which that data is to be sent, and a communicator. In addition, following a feature
from some of the earliest message-passing systems, each message also has a message tag,
which is a single nonnegative integer. The receiving process must specify where the data
is to be received, the process that is the source of the data, the communicator, and the
message tag. In addition, it may provide a parameter in which MPI will return information
about the received message; this is called the message status.
Early message-passing systems, and most I/O libraries, specify data buffers as a tuple
containing the address and the number of bytes. MPI generalizes this as a triple: address,
datatype, and count (the number of datatype elements). In the simplest case, the datatype
corresponds to the basic language types. For example, to specify a buffer of 10 ints in C,
MPI uses (address, 10, MPI INT). This approach allows easier handling of data
for the programmer, who does not need to know or discover the number of bytes in each
basic type. It also allows the MPI library to perform data conversions if the MPI program
is running on a mix of hardware that has different data representations (this was more
important when MPI was created than it is today). In addition, as described in Section 1.4,
it allows the specification of data buffers that are not contiguous in memory.
4 Chapter 1

int msg[MAX_MSG_SIZE];
...
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0) {
for (i=1; i<size; i++)
MPI_Send(msg, msgsize, MPI_INT, i, 0, MPI_COMM_WORLD);
} else {
MPI_Recv(msg, MAX_MSG_SIZE, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
doWork(msg);
}

Figure 1.2: Example of the use of MPI to send data (in the integer array msg of length
msgsize from the process with rank zero to all other processes).

The process to which data is sent is specified by a rank in a communicator; that com-
municator also provides the communication context. The message tag is a nonnegative
integer; the maximum allowed value depends on the MPI implementation but must be at
least 32767. These are the arguments to MPI Send.
The arguments for receiving a message are similar. One difference is that the receive
function is allowed to specify a size larger than the data actually being sent. The tag and
source rank may be used to either specify an exact value (e.g., tag of 15 and source of 3)
or any value by using what are called wild card values (MPI ANY TAG and MPI ANY -
SOURCE, respectively). The use of MPI ANY TAG allows the user to send one additional
integer item of data (as the tag value); the use of MPI ANY SOURCE allows the implemen-
tation of nondeterministic algorithms. A status argument provides access to the tag value
provided by the sender and the source rank of the sender. When this information is not
needed, the value MPI STATUS IGNORE may be used. Figure 1.2 shows a program that
sends the same data from the process with rank zero in MPI COMM WORLD to all other
processes.
MPI provides a number of different send modes; what has been described here is the
basic or standard send mode. Other modes include a synchronous send, ready send, and
buffered send. Nonblocking communication is described in Section 1.5.

1.4 Datatypes

As explained earlier, one of the special features of MPI is that all communication functions
take a “datatype” argument. The datatype is used to describe the type of data being sent
or received, instead of just bytes. For example, if communicating an array of integers,
Message Passing Interface 5

the datatype argument would be set to MPI INT in C or MPI INTEGER in Fortran. One
purpose for such a datatype argument is to enable communication in a heterogeneous en-
vironment, e.g., communication between machines with different endianness or different
lengths of basic datatypes. When the datatype is specified as part of the communication,
the MPI implementation can internally perform the necessary byte transformations so that
the data makes sense at the destination. MPI supports all the language types specified in
C99 and Fortran 2008.
Another purpose of datatypes is to enable the user to describe noncontiguous layouts of
data in memory and communicate the data with a single MPI function call. MPI provides
several datatype constructor functions to describe noncontiguous data layouts. A high-
quality implementation can optimize the noncontiguous data transfer such that it performs
better than manual packing/unpacking and contiguous communication by the user.
MPI datatypes can describe any layout in memory. The predefined types correspond-
ing to basic language types, such as MPI FLOAT (C float) or MPI COMPLEX (Fortran
COMPLEX), provide the starting points. New datatypes can be constructed from old ones
using a variety of layouts. For example,

MPI_Type_vector(count, blocklength, stride, oldtype, newtype)

creates a new datatype where there are count blocks, each consisting of blocklength
contiguous copies of oldtype, with each block separated by a distance of stride,
where stride is in units of the size of the oldtype. This provides a convenient (and,
with a good MPI implementation, more efficient [139]) way to describe data that is sepa-
rated by a regular stride.
Other MPI routines can be used to describe more general data layouts. MPI Type -
indexed is a generalization of MPI Type vector where each block can have a differ-
ent number of copies of the oldtype and a different displacement. MPI Type create -
struct is the most general datatype constructor, which further generalizes MPI Type -
indexed allowing each block to consist of replications of different datatypes. Conve-
nience functions for describing layouts for subarrays and distributed arrays are also pro-
vided; these functions are particularly useful when using the parallel file I/O interface in
MPI. All these functions can be called recursively to build datatypes describing any arbi-
trary layout.
Figure 1.3 shows an example of sending a column of a two-dimensional array in C. The
column is noncontiguous because C stores arrays in row-major order. By using a vec-
tor datatype to define the memory layout, one can send the entire column with a single
MPI send function. Note that it is necessary to call MPI Type commit before a derived
6 Chapter 1

      

           # #$+!)!,! # !( %"


 # #$( %"
          # $(&*'!)! !! !%"
 # #$( %"
         

      

Figure 1.3: Using an MPI derived datatype to send a column of a 2D array in C

datatype can be used in a communication function. This function provides the implemen-
tation an opportunity to analyze the datatype and store a compact representation in order to
optimize the communication of noncontiguous data during the subsequent communication
function.

1.5 Nonblocking Communication

An important feature of MPI is the availability of nonblocking communication routines.


These routines are used to initiate a communication but not wait for it to complete. This
feature provides two benefits. First, it enables (but does not require) the MPI implemen-
tation to perform the communication asynchronously with other activities, such as com-
putations. Second, it permits the description of complex communication patterns with-
out requiring careful management of communication order and memory space. To under-
stand the second requirement, consider this communication between two processes, where
partner is the rank of the other process and both processes execute this code:

MPI_Send(buf, 100000000, MPI_INT, partner, 0, MPI_COMM_WORLD);


MPI_Recv(rbuf, 100000000, MPI_INT, partner, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);

For the MPI Send to complete, the data in buf, all one hundred million words of it,
will need to be copied into some buffer somewhere (most likely either at the source or
destination process). If this memory is not available, then the program will wait within the
MPI Send call for that memory to become available, e.g., the receive buffer that would
be available when the destination process calls a receive routine such as MPI Recv. That
will not happen in the example above because both processes call MPI Send first, and so
the program might hang forever. Such programs are called unsafe because their correct
Message Passing Interface 7

if (rank & 0x1) {


/ ∗ Odd rank , do s e n d f i r s t ∗ /
MPI_Send( . . . ) ;
MPI_Recv( . . . ) ;
}
else {
/ ∗ Even rank , do r e c e i v e f i r s t ∗ /
MPI_Recv( . . . ) ;
MPI_Send( . . . ) ;
}

Figure 1.4: Fixing deadlock by having some processes receive while others send. This
method requires knowledge of the communication pattern at compile time.

execution depends on how much memory is available and used for buffering messages for
which there is no matching receive at the time of the send.
The classic fix for this is to order the sends and receives so that no process has to wait,
and is shown in Figure 1.4. However, this approach only works if the communication
pattern is simple and known at compile time. More complex patterns can be handled at
run time, but such code rapidly becomes too complex to maintain. The alternative is to
allow the MPI operations to return before the communication is complete. In MPI, such
routines are called nonblocking and are often (but not always) named with an “I” before
the operation. For example, the nonblocking version of MPI Send is MPI Isend. The
parameters for these routines are very similar to those for the blocking versions. The one
change is an additional output parameter of type MPI Request. This is a handle (in the
MPI sense) that can be used to query about the status of the operation and to wait for its
completion. Figure 1.5 shows the use of nonblocking send and receive routines. Note that
the user is required to call a test or wait function to complete the nonblocking operation.
MPI Wait will block until the operation completes. A nonblocking alternative is to use the
function MPI Test, which returns immediately and indicates whether the operation has
completed or not. Variants of test and wait operations are available for checking completion
of multiple requests at a time, such as MPI Waitall in Figure 1.5.
In many applications, the same communication pattern is executed repeatedly. For this
case, MPI provides a version of nonblocking operation that uses persistent requests; that is,
requests that persist and may be used repeatedly to initiate communication. These are cre-
ated with routines such as MPI Send init (the persistent counterpart to MPI Isend)
and MPI Recv init (the persistent counterpart to MPI Irecv). In the persistent case,
the request is first created but the communication is not yet started. Starting the com-
munication is accomplished with MPI Start (and MPI Startall for a collection of
8 Chapter 1

int msg[MAX_MSG_SIZE];
MPI_Request *r;
...
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0) {
r = (MPI_Request *) malloc((size-1) * sizeof(MPI_Request));
if (!r) . . . e r r o r
for (i=1; i<size; i++)
MPI_Isend(msg, msgsize, MPI_INT, i, 0, MPI_COMM_WORLD,
&r[i-1]);
. . . Could p e r f o r m some work
MPI_Waitall(size-1, r, MPI_STATUSES_IGNORE);
free(r);
} else {
MPI_Request rr;
MPI_Irecv(msg, msgsize, MPI_INT, 0, 0, MPI_COMM_WORLD, &rr);
. . . p e r f o r m o t h e r work
MPI_Wait(&rr, MPI_STATUS_IGNORE);
doWork(msg);
}

Figure 1.5: An alternative to the code in Figure 1.2 that permits the overlapping of com-
munication and computation

requests). Once that communication has completed, for example, after MPI Wait returns
on that request, it may be started again by calling MPI Start. When the request is no
longer needed, it is freed by calling MPI Request free.

1.6 Collective Communication

In addition to the point-to-point communication functions that exchange data between pairs
of processes, MPI provides a large set of functions that perform communication among
a group of processes. This type of communication is called collective communication.
All processes in the communicator passed to a collective communication function must
call the function, and the function is said to be “collective over the communicator.” Col-
lective communication functions are widely used in parallel programming because they
represent commonly needed communication patterns and a vast amount of research in effi-
cient collective communication algorithms has resulted in high-performance implementa-
tions [73, 270, 284].
Collective communication functions are of three types:
Message Passing Interface 9

        
  
   
 
   
  
   

           


           
  
         
        

Figure 1.6: Some of the collective operations in MPI

1. Synchronization. MPI Barrier is a collective function that synchronizes all pro-


cesses in the communicator passed to the function. No process can return from the
barrier until all processes have reached the barrier.

2. Data Movement. MPI has a large set of functions to perform commonly needed col-
lective data movement operations. For example, MPI Bcast sends data from one
process (the root) to all other processes in the communicator. MPI Scatter sends
different parts of a buffer from a root process to other processes. MPI Gather does
the reverse of scatter: it collects data from other processes to a buffer at the root.
MPI Allgather is similar to gather except that all processes get the result, not
just the root. MPI Alltoall does the most general form of collective commu-
nication in which each process sends a different data item to every other process.
Figure 1.6 illustrates these collective communication operations. Variants of these
basic operations also exist that allow users to communicate unequal amounts of data.

3. Collective Computation. MPI also provides reduction and scan operations that per-
form arithmetic operations on data, such as minimum, maximum, sum, product, and
logical OR, and also user-defined operations. MPI Reduce performs a reduction
operation and returns the result at the root process; MPI Allreduce returns the
result to all processes. MPI Scan performs a scan (or parallel prefix) operation
in which the reduction result at a process is the result of the operations performed
on data items contributed by the process itself and all processes ranked less than it.
MPI Exscan does an exclusive scan in which the result does not include the con-
tribution from the calling process. MPI Reduce scatter combines reduce and
scatter.
10 Chapter 1

int result[MAX_MSG_SIZE];
...
MPI_Bcast(msg, msgsize, MPI_INT, 0, MPI_COMM_WORLD);
if (rank != 0) {
doWork(msg, result);
}
MPI_Reduce(MPI_IN_PLACE, result, msgsize, MPI_INT,
MPI_SUM, 0, MPI_COMM_WORLD);

Figure 1.7: An alternative (and more efficient) way than in Figure 1.2 for sending data
from one process to all others. It also includes an MPI Reduce call to accumulate the
results on one process (the process with rank zero in MPI COMM WORLD).

Both blocking and nonblocking versions of all collective communication functions are
available. The functions mentioned above are blocking functions, i.e., they return only
after the operation has locally completed at the calling process. Their nonblocking ver-
sions have an “I” in their name, e.g., MPI Ibcast or MPI Ireduce. These functions
initiate the collective operation and return an MPI Request object, similar to point-to-
point functions. The operations must be completed by calling a test or wait function.
Nonblocking collectives provide much needed support for overlapping collective commu-
nication with computation. MPI even provides a nonblocking version of barrier, called
MPI Ibarrier.
Figure 1.7 shows the use of MPI Bcast to send the same data from one process (the
“root” process) to all other processes in the same communicator. It also shows the use of
MPI Reduce with the sum operation (given by the MPI predefined operation, MPI SUM)
to add all the results from each process and store them on the specified root process, which
in this case is the process with rank zero.

1.7 One-Sided Communication

In point-to-point or collective operations, communication between processes requires par-


ticipation by both the sender and the receiver. An alternative approach is to have one
process specify both the source and destination of the data. This approach is called one-
sided communication and is the major method used for communication in programming
systems such as ARMCI/GA (see Chapter 5), UPC (see Chapter 4), and OpenSHMEM
(see Chapter 3). This approach is also called remote memory access or RMA.
The one-sided model in MPI has three major components. First is the creation of an
MPI Win object, also called a window object. This describes the region of memory, or
memory window on each process, that other processes can access. Second are routines
Message Passing Interface 11

for moving data between processes; these include routines to put, get, and update data in
a memory window on a remote process. Third are routines to ensure completion of the
one-sided operations. Each of these components has a unique MPI flavor.
MPI provides four routines for creating memory windows, each addressing a specific
application need. Three of the routines, MPI Win create, MPI Win allocate, and
MPI Win allocate shared, specify the memory that can be read or changed by an-
other process. These routines, unlike the ones in many other one-sided programming mod-
els, are collective. The fourth routine, MPI Win create dynamic, allows memory to
be attached (and detached) by individual processes independently by calling additional
routines.
The communication routines are simple (at least in concept) and are in the three cate-
gories of put, get, and update. The simplest routines (and the only ones in the original
MPI-2 RMA) are one to put data to a remote process (MPI Put), one to get data from a
remote process (MPI Get), and one to update data in a remote process by using one of
the predefined MPI reduction operations (MPI Accumulate). MPI-3 added additional
communication routines in each of these categories. Each of these routines is nonblocking
in the MPI sense, which permits the MPI implementation great flexibility in implementing
these operations.
The fact that these operations are nonblocking emphasizes that a third set of functions
are needed. These functions define when the one-sided communications complete. Any
one-sided model must address this issue. In MPI, there are three ways to complete one-
sided operations. The simplest is a collective MPI Win fence. This function completes
all operations that originated at the calling process as well as those that targeted the calling
process. When the “fence” exits, the calling process knows that all the operations started on
this process have completed (thus, the data buffers provided can now be reused or changed)
and that all operations targeting this process have also completed (thus, any access to the
local memory that was defined for the MPI Win has completed, and the local process
may freely access or update that memory). This description simplifies the actual situation
somewhat; the reader is encouraged to consult the MPI standard for the precise details.
In particular, MPI Win fence really separates groups of RMA operations; thus a fence
both precedes and follows the use of the MPI RMA communication calls.
Figure 1.8 shows the use of one-sided communication to implement an alternative to the
use of MPI Reduce in Figure 1.7. Note that this is not an exact replacement. By using
one-sided operations, we can update the result asynchronously; in fact, a single process
could update the result multiple times. Conversely, there are potential scalability issues
with this approach, and it is used as an example only.
MPI provides two additional methods for completing one-sided operations. One can
be thought of as a generalization of MPI Win fence. This is called the scalable syn-
12 Chapter 1

MPI_Win_allocate((rank==0)?result:NULL,
(rank==0)?msgsize:0, sizeof(int),
MPI_INFO_NULL, MPI_COMM_WORLD, &result, &win);
MPI_Win_fence(0, win);
...
doWork(msg, myresult);
MPI_Accumulate(myresult, msgsize, MPI_INT, 0,
0, msgsize, MPI_INT, MPI_SUM, win);
...
MPI_Win_fence(0, win);
. . . p r o c e s s 0 h a s d a t a i n result
MPI_Win_free(&win);

Figure 1.8: Using MPI one-sided operations to accumulate (reduce) a result at a single
process.

chronization method and uses four routines—MPI Win post, MPI Win start, MPI -
Win complete, and MPI Win wait. As input, these routines take the MPI window and
a group of processes that specify either the processes that are the targets or the origins of
a one-sided communication operation. This method is considered scalable because it does
not require barrier synchronization across all processes in the communicator used to create
the window. Because both this scalable synchronization approach and the fence approach
require that synchronization routines be called by both origin and target processes, these
synchronization methods are called active target synchronization.
A third form of one-sided synchronization requires only calls at the origin process. This
form is called passive target synchronization. The use of this form is illustrated in Fig-
ure 1.9. Note that the process with rank zero does not need to call any MPI routines
within the while loop to cause RMA operations that target process zero to complete. The
MPI Win lock and MPI Win unlock routines ensure that the MPI RMA operations
(MPI Accumulate in this example) complete at the origin (calling) process and that the
data is deposited at the target process. The routines MPI Win lockall and MPI Win -
unlockall permit one process to perform RMA communication to all other processes
in the window object. In addition, MPI Win flush may be used within the passive tar-
get synchronization to complete the RMA operations issued so far to a particular target;
there are additional routines to complete only locally and to complete RMA operations to
all targets. There are also versions of put, get, and accumulate operations that return an
MPI Request object; the user can use any of the MPI Test or MPI Wait functions
to check for local completion, without having to wait until the next RMA synchronization
call.
Also note that process zero does need to call MPI Win lock before accessing the
Message Passing Interface 13

MPI_Win_allocate((rank==0)?result:NULL,
(rank==0)?msgsize:0, sizeof(int),
MPI_INFO_NULL, MPI_COMM_WORLD, &result, &win);

while (not done) {


...
doWork(msg, myresult);
MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
MPI_Accumulate(myresult, msgsize, MPI_INT, 0,
0, msgsize, MPI_INT, MPI_SUM, win);
MPI_Win_unlock(0, win);
...
}
MPI_Barrier(MPI_COMM_WORLD); / / E n s u r e a l l p r o c e s s e s done
if (rank == 0) {
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
printf("Result is %d\n", result[0]);
...
MPI_Win_unlock(0, win);
}
MPI_Win_free(&win);

Figure 1.9: Using MPI one-sided operations to accumulate (reduce) a result at a single
process, using passive target synchronization.

result buffer that it used to create the MPI window. This ensures that any pending
memory operations have completed at the target process. This is a subtle aspect of shared
and remote memory programming models that is often misunderstood by programmers
(see [45] for some examples of common errors in using shared memory). MPI defines a
memory model for the one-sided operations that ensures that users will obtain consistent
and correct results, even on systems without fully cache-coherent memory (at the time
of MPI-2’s definition, the fastest machines in the world had this feature, and systems in
the future again may not be fully cache coherent). While the standard is careful to de-
scribe the minimum requirements for correctly using one-sided operations, it also provides
slightly more restrictive yet simpler rules that are sufficient for programmers on most sys-
tems. MPI-3 introduced a new “unified memory model” in addition to the existing memory
model, which is now called “separate memory model.” The user can query (via MPI -
Win get attr) whether the implementation supports a unified memory model (e.g., on
a cache-coherent system), and if so, the memory consistency semantics that the user must
follow are greatly simplified.
MPI-3 significantly extended the one-sided communication interface defined in MPI-2 in
order to fix some of the limitations of the MPI-2 interface and to enable MPI RMA be more
14 Chapter 1

broadly usable in libraries and applications, while also supporting portability and high
performance. For example, new functions have been added to support atomic read-modify-
write operations, such as fetch-and-add (MPI Fetch and op) and compare-and-swap
(MPI Compare and swap), which are essential in many parallel algorithms. Another
new feature is the ability to create a window of shared memory (where shared memory is
available, such as within a single node) that can be used for direct load/store accesses in
addition to RMA operations. If you considered the MPI-2 RMA programming features
and found them wanting, you should look at the new features in MPI-3.

1.8 Parallel I/O

Many parallel scientific applications need to read or write large amounts of data from or
to files for a number of reasons such as reading input meshes, checkpoint/restart, data
analysis, and visualization. If file I/O is not performed efficiently, it is often the bottleneck
in such applications. MPI provides an interface for parallel file I/O that enables application
and library writers to express the “big picture” of the I/O access pattern concisely and
thereby enable MPI implementations to optimize file access.
The MPI interface for I/O retains the look and feel of MPI and also supports the common
operations in POSIX file I/O such as open, close, seek, read, and write. In addition, it
supports many advanced features such as the ability to express noncontiguous accesses in
memory and in the file using MPI derived datatypes, collective I/O functions, and passing
performance-related hints to the MPI implementation.
Let us consider a simple example where each process needs to read data from a different
location in a shared file in parallel as shown in Figure 1.10. There are many ways of
doing this using MPI. The simplest way is by using independent file I/O functions and
individual file pointers, as shown in Figure 1.11. Each process opens the file by using
MPI File open, which is collective over the communicator passed as the first argument
to the function, in this case MPI COMM WORLD. The second parameter is the name of
the file being opened, which could include a directory path. The third parameter is the
mode in which the file is being opened. The fourth parameter can be used to pass hints
to the implementation by attaching key-value pairs to an MPI Info object. Example
hints include parameters for file striping, sizes of internal buffers used by MPI for I/O
optimizations, etc. In the simple example in Figure 1.11, we pass MPI INFO NULL so
that default values are used. The last parameter is the file handle returned by MPI (of type
MPI File), which is used in future operations on the file.
Each process then calls MPI File seek to move the file pointer to the offset corre-
sponding to the first byte it needs to read. This is called the individual file pointer since it
is local to each process. (MPI also has another file pointer, called the shared file pointer,
Message Passing Interface 15

FILE

P0 P1 P2 P(n-1)

Figure 1.10: Each process needs to read a chunk of data from a common file

MPI_File fh;
...
rc = MPI_File_open(MPI_COMM_WORLD, "myfile.dat", MPI_MODE_RDONLY,
MPI_INFO_NULL, &fh);
rc = MPI_File_seek(fh, rank*bufsize*sizeof(int), MPI_SEEK_SET);
rc = MPI_File_read(fh, msg, msgsize, MPI_INT, MPI_STATUS_IGNORE);
MPI_File_close(&fh);

Figure 1.11: Reading data with independent I/O functions and individual file pointers

that is shared among processes and requires a separate set of functions to access and use.)
Data is read by each process using MPI File read, which reads msgsize integers into
the memory buffer from the current location of the file pointer. MPI File close closes
the file. Note that this method of doing file I/O is very similar to the way one would do it
with POSIX I/O functions.
A second way of reading the same data is to avoid using file pointers and instead specify
the starting offset in the file directly to the read function. This can be done by using
the function MPI File read at, which takes an additional “offset” parameter. MPI -
File seek does not need to be called in this case. This function also provides a thread-
safe way to access the file, since it does not require a notion of “current” position in the
file.
MPI File read and MPI File read at are called independent I/O functions be-
cause they have no collective semantics. Each process calls them independently; there
is no requirement that if one process calls them, then all processes must call them. In
other words, an MPI implementation does not know how many processes may call these
functions and hence cannot perform any optimizations across processes.
MPI also provides collective versions of all read and write functions. These functions
have an all in their name, e.g., MPI File read all and MPI File read at all.
They have the same syntax as their independent counterparts, but they have collective se-
mantics; i.e., they must be called on all processes in the communicator with which the file
16 Chapter 1

MPI_File fh;
...
rc = MPI_File_open(MPI_COMM_WORLD, "myfile.dat", MPI_MODE_RDONLY,
MPI_INFO_NULL, &fh);
rc = MPI_File_set_view(fh, rank*bufsize, MPI_INT, MPI_INT,
"native", MPI_INFO_NULL);
rc = MPI_File_read_all(fh, msg, msgsize, MPI_INT,
MPI_STATUS_IGNORE);
MPI_File_close(&fh);

Figure 1.12: Reading data in parallel, with each process receiving a different part of the
input file

was opened. With this guarantee, an MPI implementation has the opportunity to optimize
the accesses based on the combined request of all processes, an optimization known as col-
lective I/O [268, 269]. In general, it is recommended to use collective I/O over independent
I/O whenever possible.
A third way of reading the data in Figure 1.10 is to use the notion of “file views” defined
in MPI, as shown in Figure 1.12. MPI File set view is used to set the file view,
whereby a process can specify its view of the file, i.e., which parts of the file it intends
to read/write and which parts it wants to skip. The file view is specified as a triplet of
displacement, etype, and filetype: displacement is the offset to be skipped from the start of
the file (such as a header), etype is the elementary type describing the basic unit of data
access, and filetype is an MPI type constructed out of etypes. The file view consists of
the layout described by a repeated tiling of filetypes starting at an offset of “displacement”
from the start of the file.
In Figure 1.12, each process specifies the displacement as its rank × msgsize,
etype as MPI INT, and filetype also as MPI INT. The next parameter specifies the data
representation in the file; “native” means the data representation is the same as in memory.
The last parameter can be used to pass hints. We could use either independent or collective
read functions to read the data; we choose to use the collective function MPI File -
read all. Each process reads msgsize integers into the memory buffer from the file
view defined for that process. Since each process has a different displacement in the file
view, offset by its rank, it reads a different portion of the file.
MPI’s I/O functionality is quite sophisticated, particularly for cases where I/O accesses
from individual processes are not contiguous in the file, such as when accessing subarrays
and distributed arrays. In such cases, MPI-I/O can provide very large performance benefits
over using POSIX I/O directly; in some cases, it is over 1,000 times as fast. We refer the
reader to [127] for a more detailed discussion of MPI’s I/O capabilities.
Message Passing Interface 17

#include "mpi.h"
#include <stdio.h>

static double syncTime = 0.0;

int MPI_Bcast(void *buf, int len, MPI_Datatype dtype, int root,


MPI_Comm comm)
{
double t1;
t1 = MPI_Wtime();
PMPI_Barrier(comm);
syncTime += MPI_Wtime() - t1;
return PMPI_Bcast(buf, len, dtype, root, comm);
}

int MPI_Finalize(void)
{
printf("Synchronization time in MPI_Bcast was %.2e seconds\n",
syncTime); fflush(stdout);
return PMPI_Finalize();
}

Figure 1.13: Example use of the profiling interface to record an estimate of the amount of
time that an MPI Bcast is waiting for all processes to enter the MPI Bcast call.

1.9 Other Features

MPI includes a rich set of features intended to support developing and using large-scale
software. One innovative feature (now available in some other tools) is a set of alternate
entry points for each routine that makes it easy to interpose special code between any
MPI routine. For each MPI routine, there is another entry point that uses PMPI as the
prefix. This is known as the MPI profiling interface. For example, PMPI Bcast is the
profiling entry point for MPI Bcast. The PMPI version of the routine performs exactly
the same operations as the MPI version. The one difference is that the user may define
their own version of any MPI routine but not of the PMPI routines. An example is shown
in Figure 1.13. Linking the object file created from this file with a program that includes
calls to MPI Bcast will create a program that will print out the amount of time spent
waiting for all the processes to call MPI Bcast.
To enable users to write hybrid MPI and threaded programs, MPI also precisely specifies
the interaction between MPI calls and threads. MPI supports four “levels” of thread-safety
that a user must explicitly select:

MPI THREAD SINGLE: A process has only one thread of execution.


18 Chapter 1

MPI THREAD FUNNELED: A process may be multithreaded, but only the thread that
initialized MPI can make MPI calls.

MPI THREAD SERIALIZED: A process may be multithreaded, but only one thread can
make MPI calls at a time.

MPI THREAD MULTIPLE: A process may be multithreaded and multiple threads can
call MPI functions simultaneously.

The user must call the function MPI Init thread to indicate the level of thread-safety
desired, and the MPI implementation will return the level it supports. It is the user’s respon-
sibility to meet the restrictions of the level supported. An implementation is not required
to support a level higher than MPI THREAD SINGLE, but a fully thread-safe implemen-
tation will support MPI THREAD MULTIPLE. MPI specifies thread safety in this manner
so that the implementation does not need to support more than what the user needs and
unnecessarily incur the potential performance penalties.
MPI also enables an application to spawn additional processes (by using MPI Comm -
spawn or MPI Comm spawn multiple) and separately started MPI applications to
connect with each other and communicate (by using MPI Comm connect and MPI -
Comm accept or MPI Join).
MPI provides neighborhood collective operations (MPI Neighbor allgather and
MPI Neighbor alltoall and their variants) that define collective operations among
a process and its neighbors as defined by a cartesian or graph virtual process topology in
MPI. These functions are useful, for example, in stencil computations that require nearest-
neighbor exchanges. They also represent sparse all-to-many communication concisely,
which is essential when running on many thousands of processes.
MPI also has functionality to expose some internal aspects of an MPI implementa-
tion that may be useful for libraries. These features include functions to decode de-
rived datatypes and functions to associate arbitrary nonblocking operations with an MPI -
Request (known as generalized requests). New in MPI-3 is a facility, known as the
MPI T interface, that provides access to internal variables that either control the operation
of the MPI implementation or expose performance information.

1.10 Best Practices

Like any programming approach, making effective use of MPI requires using it as it was
intended, taking into account the strengths and weaknesses of the approach. Perhaps the
most important consideration is that MPI is a library. This means that any MPI operation
requires one or more function calls and might not be the most efficient for very short data
Message Passing Interface 19

transfers where even function-call overheads matter. Therefore, wherever possible, com-
munication should be aggregated so as to move as much data in one MPI call as possible.
MPI contains features to support the construction of software libraries. These features
should be used. For example, rather than adding MPI calls throughout your application,
it is often better to define important abstractions and implement them in MPI. Most of the
application code then makes use of these abstractions, which permits the code to be cleaner
as well as simplifies the process of tuning the use of MPI. This is the approach used in
several important computational libraries and frameworks, such as PETSc [229, 26] and
Trilinos [137]. In these libraries, MPI calls rarely, if ever, appear in the user’s code.
Locality at all levels is important for performance, and MPI, because it is based on
processes, helps users maintain locality. In fact, this feature is sometimes considered both
a strength and weakness of MPI: a strength because requiring users to plan for and respect
locality helps develop efficient programs; a weakness because users must take locality
into account. We note that locality at other levels of the memory hierarchy, particularly
between cache and main memory, is also necessary (possibly more so) for achieving high
performance.
Programs often do not behave as expected, and having tools to investigate the behavior,
both in terms of correctness and performance, is essential. The MPI profiling interface has
provided an excellent interface for tool development, and a number of tools are available
that can help in the visualization of MPI program behavior [72, 145, 253, 300]. The pro-
filing interface can also be used by end users [285], and good program design will take
advantage of this feature.
There are many good practices to follow when using MPI. Some of the most important
are the following.
1. Avoid assumptions about buffering (see the discussion above on safe programs)

2. Avoid unnecessary synchronization in programs. This means avoiding forcing an


order on the communication of data when it is not necessary. Implementing this
often means using nonblocking communication and multiple-completion routines
(e.g., MPI Waitall).
3. Use persistent communication where possible and when communicating small to
medium amounts of data (the definition of medium depends on the speed of your
network, MPI implementation, and processor). This can reduce the overhead of
MPI.
4. Use MPI derived datatypes when possible and when your MPI implementation pro-
vides a high-quality implementation of this feature. This can reduce unnecessary
memory motion for noncontiguous data.
20 Chapter 1

5. For I/O, use collective I/O where possible. Pay attention to any performance lim-
itations of your file system (some have extreme penalties for accesses that are not
aligned on disk block boundaries).

6. MPI Barrier is rarely required and usually reduces performance. See [252] for an
automated way to detect “functionally irrelevant barriers.” Though there are a few
exceptions, most uses of MPI Barrier are, at best, sloppy programming and, at
worst, incorrect because they assume that MPI Barrier has some side effects. A
correct MPI program will rarely need MPI Barrier. (We mention this because the
analysis of many programs reveals that MPI Barrier is one of the most common
MPI collective routines even though it is not necessary.)

1.11 Summary

MPI has been an outstanding success. At this writing, the MPI specification is over 21 years
old and continues to be the dominant programming system for highly parallel applications
in computational science.
Why has MPI has been so successful? In brief, it is because MPI provides a robust
solution for parallel programming that allows users to achieve their goals. A thorough ex-
amination of the reasons for MPI’s success may be found in [129]. The open process by
which MPI was defined also contributed to its success; MPI avoided errors committed in
some other, less open, programming system designs. Another contributor to the success of
MPI is its deliberate support for “programming in the large”—for the creation of software
modules that operate in parallel. A number of libraries have been built on top of MPI,
permitting application developers to write programs at a high level and still achieve perfor-
mance. Several of these have won the Gordon Bell prize for outstanding achievements in
high-performance computing and/or R&D 100 awards [4, 9, 12, 24, 137].
As the number of cores continues to grow in the most powerful systems, one frequent
question is “Can MPI scale to millions of processes?” The answer is yes, though it will re-
quire careful implementation of the MPI library [23]. It is also likely that for such systems,
MPI will be combined with another approach, exploiting MPI’s thread-safe design. Users
are already combining MPI with OpenMP. Using OpenMP or another node programming
language, combined with MPI, would allow the use of MPI with millions of MPI processes,
with each process using thousands of cores (e.g., via threads).
There is a rich research literature on the use and implementation of MPI, including the
annual EuroMPI meeting. A tutorial introduction to MPI is available in [127, 128]. The
official version of the MPI standard [202] is freely available at www.mpi-forum.org.
Message Passing Interface 21

This chapter provided an introduction to MPI but could not cover the richness of MPI.
The references above, as well as your favorite search engine, can help you discover the full
power of the MPI programming model.
2 Global Address Space Networking

Paul Hargrove, Lawrence Berkeley National Laboratory

2.1 Background and Motivation

In 2002 a team of researchers at the University of California Berkeley and Lawrence Berke-
ley National Laboratory began work on a compiler for the Unified Parallel C (UPC) lan-
guage (see Chapter 4). A portion of that team had also worked on the compiler and run-
time library for Titanium [277], a parallel dialect of Java. This motivated the design of a
language-independent library to support the network communication needs of both UPC
and Titanium, with the intent to be applicable to an even wider range of global address
space language and library implementations. The result of those efforts is the Global
Address Space Networking library, known more commonly as simply “GASNet” (pro-
nounced just as written: “gas net”). GASNet has language bindings only for C, but is
“safe” to use from C++ as well.
At the time of this writing, the current revision of the GASNet specification is v1.8. The
most current revision can always be found on the GASNet project webpage [174].
Since its inception, GASNet has become the networking layer for numerous global ad-
dress space language implementations. In addition to the Berkeley UPC compiler [173],
the Open Source UPC compilers from Intrepid Technology [146] (GUPC) and the Univer-
sity of Houston [278] (OpenUH) use GASNet. Rice University chose GASNet for both
their original Co-Array Fortran (CAF) and CAF-2.0 compilers [239]. Cray’s UPC and
CAF compilers [96] use GASNet for the Cray XT series, and Cray Chapel (see Chapter 6)
uses GASNet on multiple platforms. The OpenSHMEM (see Chapter 3) reference im-
plementation from the University of Houston and Oak Ridge National Laboratory is also
implemented over GASNet. In addition to these languages and libraries, some of which are
described in later chapters of this book, GASNet has been used in numerous other research
projects.

2.2 Overview of GASNet

GASNet was originally designed to support languages compiled by source-to-source trans-


lation techniques in which the compiler converts a program written in a parallel language
into serial code (almost always in C) with library calls to implement the parallel com-
munication and other distinguishing aspects of the language, such as global memory al-
location and locks in UPC. Since GASNet is language-independent by design, translated
code typically calls a language-specific runtime library. Calls to GASNet to implement
the communication might be made either directly by the translated code or indirectly by
the language-specific runtime library. Since source-to-source translation was the original
motivating usage case, GASNet has been designed for use in automatically generated code
24 Chapter 2

and for use by expert programmers who are authoring parallel runtime libraries. Where
performance and ease-of-use conflict, the design favors choices that will achieve high per-
formance. One consequence of this is GASNet’s API specifies the “interfaces” or “calls”
which GASNet implements, but does not require that any of these be implemented as a
function. Therefore, in many cases a GASNet call may be implemented as a C preproces-
sor macro (especially when there is a simple mapping from a GASNet interface to a call in
the vendor-provided network API).
GASNet is also designed with wide portability in mind and one consequence is that
the capabilities expressed directly in GASNet’s interfaces are those one should be able to
implement efficiently on nearly any platform. At the time of this writing, GASNet has
“native” implementations—in terms of the network APIs—of all of the common cluster
interconnects, and those of the currently available supercomputers from IBM and Cray.
Also at the time of this writing, porting of GASNet to the largest systems in Japan and
China is known to be in progress by researchers affiliated with those systems.
The remainder of this section will introduce the terminology used in the GASNet spec-
ification and in this chapter, and provide an overview of the functionality found in GAS-
Net. Later sections expand in more detail upon this overview, provide usage examples and
describe some of the plans for GASNet’s future.

2.2.1 Terminology
Below are several terms that are used extensively in the GASNet specification and in this
chapter. Before reading further, familiarize yourself with these terms or bookmark this
page for easy reference.
Global Address Space Networking 25

Client The software using GASNet, most often a par-


allel language runtime rather than an “end-
user” code.
Conduit The implementation of GASNet for a specific
network API. Example: “mpi-conduit” and
“udp-conduit” are maximally portable imple-
mentations to allow use on platforms without
“native” support.
Node GASNet uses the term “node” to mean an O/S
process rather than a network endpoint.
Supernode A collection of nodes running under the same
OS instance. On supported platforms, GAS-
Net will use shared-memory communication
among such groups of nodes.
Segment The range of virtual addresses that are permit-
ted as the remote address (part of the Extended
API described in Section 2.4).
Local Completion When memory associated with input(s) on the
initiating node is safe for reuse.
Remote Completion When memory associated with output(s) has
been written.

2.2.2 Threading
GASNet is intended to be “thread neutral” and allow the client to use threads as it sees
fit. By default GASNet will build three variants of the library for each supported network
to support different client threading models. These are known as the “seq”, “parsync”
and “par” builds, where the names correspond both to the library file name and to the
preprocessor tokens GASNET SEQ, GASNET PARSYNC and GASNET PAR. Exactly one
of these three must be defined by the client when it includes gasnet.h and the library
to which it is linked must correspond to the correct preprocessor token.1 The three models
are:
• GASNET SEQ
In this mode the client is permitted to make GASNet calls from only a single thread
in each process. There is no restriction on how many threads the client may use, but
exactly one of them must be used to make all GASNet calls.
• GASNET PARSYNC
In this mode at most one thread may make GASNet calls concurrently. Multiple
1. There is some name-shifting under the covers to catch mismatches.
26 Chapter 2

threads may make GASNet calls with appropriate mutual exclusion. GASNet does
not provide the mechanism for such mutual exclusion, which is the client’s respon-
sibility.

• GASNET PAR
This is the most general mode, allowing multiple client threads to enter GASNet
concurrently.

When using SEQ or PARSYNC modes, the restriction on the client’s calls to GASNet is
only a restriction on the client. It is legal, even in a SEQ build, for GASNet to use threads
internally, and these internal threads may be used to execute the client’s AM handlers. For
this reason the client code must make proper use of GASNet’s mechanisms for concurrency
control (described in Section 2.3.4) regardless of the threading mode.

2.2.3 API Organization


GASNet divides its interfaces into two groups—the “Core API” and the “Extended API.”
The latter of these contains interfaces for remote memory access, while everything else in
contained in the former. The core includes the most basic needs in any parallel runtime
with interfaces to initialize and finalize the runtime, query the number of parallel entities
(GASNet “nodes”, in this case), and the identity of the calling entity (the node number in
GASNet).
In addition to the Core and Extended API as documented in the GASNet specification,
there are several additional features in GASNet to assist in the writing and debugging of
portable GASNet clients.

2.3 Core API

GASNet’s flexibility in implementing parallel language runtimes comes from the inclusion
of a remote procedure call mechanism based on Berkeley Active Messages [183]. While
GASNet’s Active Message (just “AM” from here on) interfaces are significantly reduced
relative to the Berkeley AM design, they provide the caller with significant flexibility sub-
ject to constraints that allow an implementation to guarantee deadlock freedom while using
bounded resources. Briefly, the idea is that a client may send an AM Request to a node
(including itself) which results in running code (a “handler”) that was registered by the call
to gasnet attach. The handler receives a small number of integer arguments provided
by the AMRequest call, plus an optional payload which may either be buffered by the im-
plementation or delivered to a location given by the client. The handler is permitted calls
Global Address Space Networking 27

to only a subset of the GASNet Core API (and none of the Extended); the only communi-
cation permitted is at most one AMReply to the requesting node. A significant portion of
this chapter will be devoted to showing how to use GASNet’s AMs.
The GASNet Core API contains everything one needs to write an AM-based code.2 This
section picks up from the brief introduction given in Section 2.2.3 to provide some detail on
the Core API. This information will be put into practice in several examples in Section 2.6.

2.3.1 Beginnings and Endings


There seems to be an almost universal imperative to provide a “Hello, World!” example
for every programming language. With parallel languages there is an expectation that one
can print some information about the job size and the caller’s rank. Such an example is
provided in Section 2.6.2, but we introduce here the corresponding portions of the Core
API:

int gasnet_init(int *argcp, char ***argvp);


int gasnet_attach(gasnet_handlerentry_t *table, int numentries,
uintptr_t segsize, uintptr_t minheapoffset);
void gasnet_exit(int exitcode);
gasnet_node_t gasnet_nodes(void);
gasnet_node_t gasnet_mynode(void);
char * gasnet_getenv(const char *name);
uintptr_t gasnet_getMaxLocalSegmentSize(void);
uintptr_t gasnet_getMaxGlobalSegmentSize(void);

Any GASNet client will begin with a call to gasnet init which takes pointers to
the standard argc and argv parameters to main(). The job environment prior to
calling gasnet init is very vague (see the specification for details), and the user is
strongly encouraged not to do much, if anything, before this call. However, after the call to
gasnet init, the command-line will have been cleansed of any arguments used inter-
nally by GASNet, and environment variables will be accessible using gasnet getenv.
Additionally, the jobs’s stdout and stderr will be setup by this init call. GASNet does
not make any guarantees about stdin.
Only after gasnet init returns do the other calls in the list above become legal. The
next step in initialization of a GASNet job is a call to gasnet attach to allocate the
GASNet segment and reserve any network resources required for the job. The arguments
to gasnet attach give the client’s table of AM handlers, and the client’s segment re-
quirements:

2. It is also the minimum one must port to a new platform, because there is a reference implementation of every-
thing in the Extended API in terms of the Core.
28 Chapter 2

• gasnet handlerentry t *table


This is a pointer to an array of C structs:

typedef struct {
gasnet_handler_t index;
void (*fnptr)();
} gasnet_handlerentry_t;

The fnptr is the function to be invoked as the AM handler at the respective integer
index. The signature of AM handlers will be covered in Section 2.3.5. Values for
index of 128–255 are available to the client, while the special value of 0 indicates
“don’t care” and will be overwritten by a unique value by gasnet attach.

• int numentries
The number of entries in the handler entry table.

• uintptr t segsize
The requested size of the GASNet segment.
Must be a multiple of GASNET PAGESIZE, and no larger than the value returned
by gasnet getMaxLocalSegmentSize (see below).
Ignored for GASNET SEGMENT EVERYTHING.

• uintptr t minheapoffset
The requested minimum distance between GASNet’s segment and the current top of
the heap.3 On systems where the layout in virtual memory forces GASNet’s segment
and the heap to compete for space, this ensures that at least this amount of space will
be left for heap allocation after allocation of the segment. While not recommended,
it is legal to pass zero. The value passed by all nodes must be equal.
Ignored for GASNET SEGMENT EVERYTHING.

There are two calls to determine what segment size one may request in the attach call.
The function gasnet getMaxLocalSegmentSize returns the maximum amount of
memory that GASNet has determined is available for the segment on the calling node,
while gasnet getMaxGlobalSegmentSize returns the minimum of all the “local”
values. Keep in mind that on many platforms, the GASNet segment and the malloc heap
must compete for the same space, meaning that these SegmentSize queries should be
treated as the maximum of the sum of segsize and minheapoffset. A client that

3. The range of memory used to satisfy calls to the malloc family of functions.
Global Address Space Networking 29

finds the available segment size too small for its requirements may call gasnet exit to
terminate the job rather than calling gasnet attach.
In addition to the two segment size query calls and access to environment variables us-
ing gasnet getenv, clients may call gasnet nodes to query the number of GAS-
Net nodes in the job, and gasnet mynode to determine the caller’s rank within the
job (ranks start from zero). The calls listed above are the only ones permitted between
gasnet init and gasnet attach. The two segment size query calls are unique in
that they are only legal between gasnet init and gasnet attach.
After gasnet attach comes the client’s “real” code using the interfaces described in
the sections that follow. When all the real work is done, gasnet exit is the mechanism
for reliable job termination. The call to gasnet exit takes an exit code as its only
argument and does not return to the caller. GASNet makes a strong effort to ensure that if
any node provides a nonzero exit code that the job as a whole (spawned by some platform-
specific mechanism) will also return a nonzero code. It also tries to preserve the actual
value when possible.
A call to gasnet exit by a single node is sufficient to cause the entire parallel job
to terminate. Any node which does not call gasnet exit at the same time as one or
more others, will receive a SIGQUIT signal if possible.4 This is the only signal for which
a client may portably register a signal handler, because GASNet reserves all others for
internal use. To avoid unintentionally triggering their mechanism, a client performing a
“normal” exit should perform a barrier (see Section 2.3.3) immediately before the call to
gasnet exit.

2.3.2 Segment Info


The segsize argument to gasnet attach is the client’s requested size. GASNet
may allocate a smaller segment. At any time following the call to gasnet attach a
client may determine information about the segments allocated to all nodes by calling
gasnet getSegmentInfo:

typedef struct {
void *addr;
uintptr_t size;
} gasnet_seginfo_t;
int gasnet_getSegmentInfo(gasnet_seginfo_t *seginfo_table,
int numentries);

4. Except on platforms without POSIX signals.


30 Chapter 2

This call populates the lesser of numentries or gasnet nodes()) entries of type
gasnet seginfo t in the client-owned memory at seginfo table, and returns an
error code on failure (see Section 2.3.8). The ith entry in the array gives the address and
size of the segment on node i. When conditions permit, GASNet favors assigning segments
with the same base address on all nodes. If an implementation can guarantee that this
property is always satisfied, then the preprocessor token GASNET ALIGNED SEGMENTS
is defined to 1.
In the GASNET SEGMENT EVERYTHING configuration, the segment is all of virtual
memory. In this configuration, the addr fields will always be zero and the size will
always be (uintptr t)(-1).

2.3.3 Barriers
The next set of Core API calls to describe are those for performing a barrier:

#define GASNET_BARRIERFLAG_ANONYMOUS ...


#define GASNET_BARRIERFLAG_MISMATCH ...
void gasnet_barrier_notify(int id, int flags);
int gasnet_barrier_wait(int id, int flags);
int gasnet_barrier_try(int id, int flags);

Unlike many barrier implementation, the one in GASNet is “split-phase” and supports
optional id matching.
The “split-phase” nature of GASNet’s barrier is evident in the specification’s descrip-
tion of gasnet barrier wait which states “This is a blocking operation that returns
only after all remote nodes have called gasnet barrier notify().” In simple terms
imagine that “notify” increments an arrival counter and that “wait” blocks until that counter
equals the job size.5 The call gasnet barrier try checks the same condition, but re-
turns immediately with the value GASNET ERROR NOT READY if the condition is not yet
satisfied. Regardless of whether one uses “wait” or “try” to complete the barrier, it is legal
to perform most GASNet operations between the initiation and completion.
The id and flags arguments to the barrier functions implement optional matching at
the barriers. This feature is best understood by a careful reading of the specification, but
two common use cases are easy to understand:

• Anonymous barrier
The simplest case is when one does not wish to use the id matching support. In this
case, the constant GASNET BARRIERFLAG ANONYMOUS is passed for the flags

5. Rest assured that more scalable algorithms are used in practice.


Global Address Space Networking 31

argument to the barrier functions. Any value can be passed as the id (though 0 is
most common), since it will be ignored.

• Named barrier
The simplest case that makes use of the id matching logic is a blocking (as opposed
to split-phase) barrier with an integer argument that is expected to be equal across
all callers:

int named_barrier(int name) {


gasnet_barrier_notify(name, 0);
int err = gasnet_barrier_wait(name, 0);
if (err == GASNET_OK) {
return 0; // Success − all nodes specified same name
} else if (err == GASNET_ERR_BARRIER_MISMATCH) {
return 1; // Failure − names did not all match
}
return -1; // Something unexpected happened!
}

GASNet’s split-phase barrier comes with some usage restrictions which might not ini-
tially be obvious. Here we will consider a successful “try” equivalent to a “wait” to keep
the descriptions brief. The first restriction is the most intuitive: one must alternate between
“notify” and “wait” to ensure that barrier operations do not overlap one another. The sec-
ond is that in a GASNET PARSYNC or GASNET PAR build, the “notify” and “wait” should
only be performed once per node (the client is free to choose which thread does the work,
and need not pick the same thread for the two phases). The third is a potentially nonob-
vious consequence of the first two: in a GASNET PAR build the client has the burden of
ensuring that at most one client thread is in any barrier call at any given instant.

2.3.4 Locks and Interrupts


In Section 2.2.2 we learned that GASNet might run AM handlers concurrent with client
code even when the client is single threaded. We also learned above that multi-threaded
clients must prevent concurrent calls to GASNet’s barrier functions. So now we take a look
at GASNet’s mechanisms for controlling concurrency.
GASNet has interfaces specifically for dealing with thread-safety, which are of impor-
tance even in single-threaded clients due to the possibility of multi-threaded implementa-
tions of GASNet. The main mechanism is a simple mutex, known as a “handler-safe lock”,
or “HSL”, based on the mutex type, gasnet hsl t. The implementation of this mutex
type is ensured to be appropriate for the given implementation (which includes being a
no-op when both the GASNet client and implementation are single-threaded).
32 Chapter 2

#define GASNET_HSL_INITIALIZER ...


void gasnetc_hsl_init(gasnet_hsl_t *hsl);
void gasnetc_hsl_destroy(gasnet_hsl_t *hsl);
void gasnetc_hsl_lock(gasnet_hsl_t *hsl);
void gasnetc_hsl_unlock(gasnet_hsl_t *hsl);
int gasnetc_hsl_trylock(gasnet_hsl_t *hsl);

Other than minor details given in the GASNet specification, these are equivalent to the
analogous constants and functions on pthread mutex t. Like the POSIX threads ana-
logues, these can be used to prevent concurrent access to data structures or regions of code.
A general tutorial on the use of a mutex are outside the scope of this chapter. Note that
these are node-local mutexes, and GASNet does not provide mechanisms for cross-node
mutual exclusion. However, the example in Section 2.6.5 shows how one can use AMs to
implement a well-known shared memory algorithm for mutual exclusion.
In addition to the previously introduced idea of internal threads for executing the client’s
AM handlers, the GASNet specification allows for the possibility of interrupt-driven im-
plementations. Though at the time of this writing there are no such implementations, we
will introduce this concept briefly:

void gasnet_hold_interrupts();
void gasnet_resume_interrupts();

These two calls, used in pairs, delimit sections of code which may not be interrupted by
execution of AM handlers on the calling thread. This is different from use of an HSL to
prevent multiple threads from concurrently accessing given code or data. The intended use
of no-interrupt sections is to protect client code which is nonreentrant and can potentially
be reached from both handler and nonhandler code in the client. No-interrupt sections are
seldom necessary for two key reasons: 1) holding an HSL implicitly enters a no-interrupt
section; 2) AM handlers run in implicit no-interrupt sections. Note that these calls do not
nest and the client is therefore responsible for managing no-interrupt sections when nesting
might occur dynamically.
It is worth noting that in a GASNET SEQ build the mutex calls may compile away to
“nothing” if and only if the GASNet implementation is using neither threads nor interrupts
internally to execute the client’s AM handlers.

2.3.5 Active Messages


The real meat of the GASNet Core API is the AM interfaces. The principle of an AM
is that a call on the initiating node transfers to the target node some small number of ar-
guments, and an optional payload, all of which are passed to a function run on the target
Global Address Space Networking 33

node. Functions to be run are known as AM “handlers” and are named by an index of
type gasnet handler t, where the mapping between these indices and actual func-
tions was established by the handler table passed to (and possibly modified by) the call to
gasnet attach.
Arguments to AM handlers are 32-bit integers.6 There are implementation-dependent
limits on the argument count, which can be queried at runtime:

size_t gasnet_AMMaxArgs(void);

The value must be at least 8 on 32-bit platforms and at least 16 on 64-bit platforms. This
ensures a client can always pass at least 8 pointer-sized values to a handler.
In addition to the arguments, there is an optional payload. There are three “categories”
of AMs depending on the treatment of the payload:

• Short AMs have no payload. The signature of a Short AM handler looks like:

void ShortExample(gasnet_token_t token ...);

• Medium AMs carry a payload which is held in an implementation-provided tem-


porary buffer on the target. The AM handler is given the address and length of this
buffer, which will be destroyed/recycled when the handler completes. The handler
is permitted to modify the payload in-place if desired. The signature of a Medium
AM handler looks like:

void MediumExample(gasnet_token_t token, void *buf,


size_t nbytes ...);

• Long AMs carry a payload that is placed at an address on the target node that is
provided by the initiating node. This address must lie in the GASNet segment. The
signature of a Long AM handler looks like:

void LongExample(gasnet_token_t token, void *buf,


size_t nbytes ...);

6. GASNet’s gasnet handlerarg t type is always equivalent to uint32 t, but GASNet supports C89
compilers which may not have uint32 t.
34 Chapter 2

In all three handler signatures above, the “...” denotes up to gasnet AMMaxArgs()
additional arguments. Since the Medium and Long AM handler signatures are identical, it
is permissible to use the same handler for either category of AM.
Payload size is subject to implementation-dependent limit, which can be queried at run-
time:

size_t gasnet_AMMaxMedium(void);
size_t gasnet_AMMaxLongRequest(void);
size_t gasnet_AMMaxLongReply(void);

The GASNet specification requires that all implementations support payloads of at least
512 bytes, and typical values are much higher for platforms with RMA support in hard-
ware. It is important to notice the distinction between the Request and Reply limits for
Long AMs.7
To invoke an AM handler on a target node one issues an AM request using one of the
following:

int gasnet_AMRequestShort[N](gasnet_node_t dest,


gasnet_handler_t handler ...);
int gasnet_AMRequestMedium[N](gasnet_node_t dest,
gasnet_handler_t handler,
void *src_addr, size_t nbytes ...);
int gasnet_AMRequestLong[N](gasnet_node_t dest,
gasnet_handler_t handler,
void *src_addr, size_t nbytes,
void * dest_addr ...);
int gasnet_AMRequestLongAsync[N](gasnet_node_t dest,
gasnet_handler_t handler,
void *src_addr, size_t nbytes,
void * dest_addr ...);

Each of the above prototypes represents an entire family of calls, where the “[N]” above
is replaced with the values from 0 through gasnet AMMaxArgs(). As before, the “...”
denotes the placement of the 32-bit arguments to pass to the handler. For the Medium and
Long requests, the calls return as soon as the payload memory is safe to reuse (also known
as “local completion”). The implementation is not required to make a copy of the payload,
and thus these calls may block temporarily until the network can send the payload. While
blocked, AMs sent to the calling node by others may be executed. The LongAsync request

7. This difference arises from the fact that a Reply can only be initiated within a request handler, and it may not
be possible in this context to allocate resources for a large Reply. Therefore, when the limits for LongRequest
and LongReply differ, the Request value will be the larger of the two.
Global Address Space Networking 35

differs from the Long case in that it returns without waiting for local completion (though
it may still block waiting for resources). The client must not modify the payload until
the corresponding AM reply handler begins running—that is the only indication of local
completion. This is a difficult semantic to apply correctly, but can be powerful.
When an AM handler runs, code is executed in an environment known as “handler con-
text” in which several restrictions apply. These restrictions will be enumerated later, but
at this point we focus on the one that for many is the defining feature of Berkeley AM,
and thus of GASNet. This is the “at most one reply” rule which states that 1) the only
communication permitted in the handler for an AM Request is one optional Reply to the
node initiating the Request; and 2) no communication is permitted in the handler for an
AM Reply. An AM Reply is sent with one of the following:8

int gasnet_AMReplyShort[N](gasnet_token_t token,


gasnet_handler_t handler ...);
int gasnet_AMReplyMedium[N](gasnet_token_t token,
gasnet_handler_t handler,
void *src_addr, size_t nbytes ...);
int gasnet_AMReplyLong[N](gasnet_token_t token,
gasnet_handler_t handler,
void *src_addr, size_t nbytes,
void * dest_addr ...);

The use of “...”, again, denotes the 32-bit hander arguments, and [N] indicates that
these three prototypes are templates for instances from 0 to gasnet AMMaxArgs() ar-
guments.
Other than the names, the key difference between the calls to send an AM Reply versus
those for a Request is the type of the first argument: gasnet token t. This type was
first seen, without any explanation, when the prototypes for the three categories of AM
handlers were given. It is an opaque type that contains (at least) the source node of an
AM. Since there is no way to construct an object of this type the only way to invoke an
AMReply function is using the token received as an argument to the Request handler. For
situations where one does need to known the source node for an AM (either Request or
Reply), one can query:

int gasnet_AMGetMsgSource(gasnet_token_t token,


gasnet_node_t *srcindex);

This call can be made only from the handler context and the only valid value for the
token argument is the one received as an argument to the handler function.
8. Note that the lack of an AMReplyLongAsync is a consequence of the fact that the “at most one reply” rule
prevents the AM Reply handler from issuing any communication that would serve to indicate the local completion.
36 Chapter 2

2.3.6 Active Message Progress


It has been mentioned that GASNet may run AM handlers using internal threads or with in-
terrupts. While important for writing correct client code, those should not be considered the
common case as many implementations lack both of those mechanisms for asynchronous
progress. Instead, the progress of a typical AM-driven GASNet application is dependent
on the client making entries to GASNet. A client can be assured that AM initiating calls
will also poll for incoming AMs, but that is not always sufficient. So, there are two ways
to explicitly poll for incoming AMs:

int gasnet_AMPoll(void);
#define GASNET_BLOCKUNTIL(condition) ...

The call gasnet AMPoll checks for incoming AMs (both Requests and Replies) and
will execute some implementation-dependent maximum number of them before return-
ing. Thus, there is no guarantee that at the time this call returns that are no additional
AMs waiting. This call is typically used in the clients’ own progress loop, or before and
after client operations that are known not to poll for long periods of time. The macro
GASNET BLOCKUNTIL is used to block until a condition becomes true. It takes as an
argument a C expression to evaluate, and GASNet executes code functionally equivalent
to:

#define GASNET_BLOCKUNTIL(cond) while (!(cond)) gasnet_AMPoll()

It is possible, however, for GASNET BLOCKUNTIL to use implementation-specific mech-


anisms instead of this naive mechanism. It is important to note that it is only valid to use
GASNET BLOCKUNTIL to block waiting for a condition which will change due to the ac-
tion of an AM—code which uses GASNET BLOCKUNTIL to block for an Extended API
call to change memory could block indefinitely.9

2.3.7 Active Message Rules and Restrictions


The GASNet specification is the most complete reference for the details, but the following
are the primary rules which must be followed to write correct/portable AM codes with
GASNet. Keep in mind that not all implementations enforce all of these rules, but clients
must follow them all or risk incorrect operation on some implementation.

9. If it helps to understand this rule, try visualize an implementation using a condition variable which is broadcast
by GASNet each time a handler execution completes.
Global Address Space Networking 37

• A handler running as the result of an AM Request is permitted communication only


via a single optional call to an AMReply function.

• A handler running as the result of an AM Reply is not permitted any communication.

• No handler may call the GASNet barrier functions, initiate AM Requests or call any
portion of the Extended API (these involve prohibited communication).

• A handler may block temporarily in a call to obtain an HSL, but must release any
held HSL before returning.

• A handler may not call GASNET BLOCKUNTIL.

• The GASNet implementation is not required to ensure AMs are executed in order
and client code must be constructed to be deadlock-free in the presence of reordered
messages.10

• Client code must be written in a thread-safe manner (through the proper use of HSL)
in recognition that even with single-threaded clients, GASNet may run AM handlers
asynchronously.

• The expression passed to GASNET BLOCKUNTIL does not have an exception to the
previous rule and must consider the possibility that the expression could be evaluated
concurrently with execution of AM handlers.

2.3.8 Error Codes


While only a few were mentioned in prior discussions, most of the Core API has integer
return values taken from the following:

• GASNET OK
Guaranteed to be zero, this value indicates success.

• GASNET ERR RESOURCE


A fairly generic error indicating that a call failed because some finite resource was
unavailable.

• GASNET ERR BAD ARG


Similar to errno == EINVAL, this indicates that the client passed an invalid ar-
gument.

10. GASNet will not drop or replay AMs. So the client may assume “exactly once” delivery.
38 Chapter 2

• GASNET ERR NOT INIT


Client has not yet called gasnet init

• GASNET ERR BARRIER MISMATCH


Either the id-matching logic of GASNet’s barrier has detected a mismatch (see
the specification for the matching rules), or a caller has indicated one by passing
GASNET BARRIERFLAG MISMATCH.

• GASNET ERR NOT READY


A (temporary) indication that a split-phase operation is incomplete. In the Core
API this occurs when calling gasnet barrier try before all nodes have called
gasnet barrier notify.

Additionally, one can convert the numerical error code into a string (GASNET ERR -
prefixed) name, or an English language description of the error value using the following:

char * gasnet_ErrorName(int errval);


char * gasnet_ErrorDesc(int errval);

2.4 Extended API

The Extended API provides a rich set of interfaces for remote memory access (Puts and
Gets) with a variety of semantics intended to ease automatic code generation, especially
from source-to-source translation of partitioned global address space (PGAS) Languages.
At this time GASNet provides standardized RMA interfaces only for Put and Get of con-
tiguous regions, but see Section 2.7 for information on proposed “Vector-Index-Strided”
interfaces).

2.4.1 The GASNet Segment


As previously mentioned, the Extended API can only access remote addresses that lie in a
portion of memory known as the GASNet segment and established at gasnet attach
time. The two original GASNet clients, implementations of UPC and Titanium, differ in
terms of what portion of memory should be remotely accessible. In UPC, only memory
allocated by the language-specific shared allocation functions can be the remote operand to
an RMA operation, while in Titanium there is no such specialization of memory allocation
and consequently all objects can potentially be accessed remotely with GASNet Extended
API calls. GASNet recognizes this distinction, plus one additional gradation in the form
of the “segment configuration” which must be selected when the GASNet library is built
Global Address Space Networking 39

from source. The default configuration is known as GASNET SEGMENT FAST, or just
SEGMENT FAST for short. In this configuration the implementation provides the fastest
(lowest latency and/or highest bandwidth) implementation possible, even if this results in
making trade-offs which significantly reduce the size of the segment. The second option,
SEGMENT LARGE, supports the largest contiguous segment possible (within reason) even
when this support may require “bounce buffers” or other mechanisms that reduce the speed
of remote accesses. The final option is GASNET SEGMENT EVERYTHING in which the
entire virtual address space is considered “in-segment.”

2.4.2 Ordering and Memory Model


As a general design principle, GASNet attempts to define as few semantics as possible
to allow both efficient implementation and freedom for the client to determine its own
semantics. In this vein, GASNet’s Extended API operations are unordered and the state
of destination memory is undefined between the initiation of an operation and its remote
completion. Clients which require ordering “A precedes B” must complete operation “A”
before initiating operation “B.” Local completion semantics for Puts depends on the client’s
choice of “bulk” or “nonbulk” operations as described below. GASNet makes no guaran-
tees about the results of concurrent operations where an operation’s destination overlaps
the source or destination of another (including undefined result for a loopback operation
with overlapping source and destination).

2.4.3 Blocking and Nonblocking


Blocking operations are both locally and remotely complete when they return. Therefore, a
sequence of blocking operations is trivially ordered, but only with respect to other blocking
operations. Nonblocking operations come in two flavors: “explicit handle” and “implicit
handle.” Explicit-handle operations have an “ nb” suffix and return an opaque handle,
gasnet handle t, with which one can poll on or block for completion of individual
operations or on arrays of handles (known as “syncing” a handle). Implicit-handle non-
blocking operations have an “ nbi” suffix and treat a sequence of RMA operations as a
group. A client can sync all outstanding implicit-handle Puts, Gets, or both. One cannot
sync individual implicit-handle operations, but can manage a sequence of such operations
without needing to track a collection of explicit handles. It is also possible to create an
“nbi access region” which collects all implicit-handle operations occurring dynamically
between the begin and end calls together under a single handle which can then be used
with the explicit-handle sync operations.
40 Chapter 2

2.4.4 Bulk and Nonbulk


GASNet combines the concepts of data alignment and local completion together into the
concept of “bulk” transfers. A “bulk” operation has an extra “ bulk” suffix and imposes
no alignment restrictions for either source or destination addresses. A “nonbulk” operation
imposes a “natural alignment constraint” on both the source and destination addresses.
GASNet uses “natural alignment” to mean that for power-of-two transfer sizes not larger
than the machine word size (4 or 8 bytes) the source and destination addresses must both
be multiples of the transfer size. For sizes larger than the word size or not a power-of-two,
there is no restriction.
In addition to the alignment restriction, the nonblocking nonbulk Put operations delay
their return until local completion (possibly making an internal copy11 ). The nonblocking
bulk Put operations, on the other hand, return as soon as possible without delaying for
local completion. In this case there is no mechanism by which one can determine local
completion independent of syncing the operation for remote completion. Get operations
also have bulk and nonbulk flavors with the corresponding alignment restriction on the
nonbulk version. However, the local completion distinction is absent, since the initiator’s
buffer is the destination, rather than source, of the operation.

2.4.5 Register-Memory and Remote memset Operations


GASNet’s Extended API does have a few odd-ball interfaces that can be very useful in
some cases. In addition to the bulk and nonbulk operations for data, GASNet offers value-
based operations for moving data that fits in a register (up to 4 or 8 bytes, depending on
platform) to or from remote memory. For Puts, blocking and nonblocking variants (with
explicit and implicit handles) are supported. For Gets, there is a blocking variant and an
explicit-handle variant with its own distinct handle type, gasnet valget handle t.
The GASNet API also provides a corresponding call to sync the operation and return the
value: gasnet wait syncnb valget.
GASNet supports blocking and nonblocking (with explicit and implicit handles) remote
memset calls with the same completion semantics as Put. These calls do not have the time
or space overheads of constructing a source buffer initialized to the desired constant value.

2.4.6 Extended API Summary


The following summarizes the various operations in the GASNet Extended API:

11. The README for each conduit includes documentation on protocol switch points that control such behaviors,
and most offer environment variables to adjust them.
Another Random Document on
Scribd Without Any Related Topics
inventions now for defying the ravages of age, for keeping a
youthful bloom on the cheek and a youthful lustre on the
hair. It would be necessary for Mrs Fortescue to look as
charming as ever in order to take her young charges about.
How pleasant it would be to go with them from one gay
assembly to another, to watch their innocent triumphs!

As she lay down in bed on the first night after their arrival
she appraised with a great deal of discernment their
manifest charms. Florence was, of course, the beauty, but
Brenda had a quiet distinction of her own. Her face was full
of intellect. Her eyes full of resource. She was dignified, too,
more so than Florence, who was all sparkling and gay, as
befitted the roses in her cheeks and the flashes of light in
her big brown eyes. Altogether, they were a charming pair,
and when dressed as they ought to be (how Mrs Fortescue
would love that part of her duty!) would do anybody credit.

Mrs Fortescue and the Misses Heathcote! She could hear


their names being announced on the threshold of more than
one notable reception room, could see the eager light in
manly eyes and the deference which would be shown to her
as the chaperone of the young heiresses!

Yes, Mr Timmins’ visit was decidedly welcome. He should


have the very best of receptions.

On the day when Mr Timmins had elected to come it was


Christmas Eve. In consequence, the trains were a little out
of order, and Mrs Fortescue could not tell exactly when he
would arrive.

“He said three o’clock, dears,” she remarked to her young


charges as they sat together at breakfast, the girls wearing
pretty brown dresses which suited their clear complexions
to a nicety. “Now, as a rule, the three o’clock train is in to
the moment, but of course to-day it may be late—in all
probability it will be late. I shall order hot cakes for tea;
Bridget is quite celebrated for her hot cakes. We will have
tea ready for him when he comes. Then when he has had
his chat with me, he will want to say a word or two to you,
Brenda, and you, Florence. You had better not be out of the
way.”

“We thought of going for a good walk,” said Florence. “It is


you, after all, he wants to see, Mrs Fortescue. He never has
had much to say to us, has he?” Here she looked at her
sister.

“No,” said Brenda, thoughtfully. “But,” she added, “when he


wrote to me this time, he said he particularly wanted to see
you and me alone, Flo. He didn’t even mention your name,
Mrs Fortescue.”

“Ah well, dear,” said Mrs Fortescue, with a smile; “that is


quite natural. You have left school, you know.”

“I can’t quite believe it, can you, Brenda?” said Florence. “It
seems just as if we must be going back to the dear old
place.”

“Oh, I don’t know,” said Brenda. “We are not going back:
we said good-bye to every one, don’t you remember?”

“You are never going back, dears, and for my part, I am


glad,” said Mrs Fortescue. “You will be my charge in future;
at least, I hope so.”

The girls were silent, looking hard at her. “As I have taken
care of you since you were quite young girls, you will
naturally wish for my protection until you are both married.”
Brenda was silent. Florence said eagerly—“I mean to marry
as soon as possible.” Here she laughed, showing her pearly
teeth, and a flashing light of anticipated triumph coming
into her eyes.

“Of course you will marry soon, Florence,” said Mrs


Fortescue. “You are far too pretty not to be somebody’s
darling before long. And you, Brenda, also have an
exceedingly attractive face. What are your dreams for the
future, my love?”

“I cannot tell you,” said Brenda.

She got up as she spoke, and walked to the window. After a


time, she said something to her sister, and the girls left the
room arm-in-arm.

Mrs Fortescue felt rather annoyed by their manners. They


were very independent, as independent as though they
were of age; whereas at the present moment they had not
a shilling—no, not a shilling in the world that she did not
supply to them under Mr Timmins’ directions. Were they
going to prove troublesome? She sincerely hoped not. They
were good girls but that house in London might not be quite
so agreeable as her dreams had pictured if Brenda
developed a very strong will of her own and Florence was
determined to marry for the sake of marrying. Still, Mr
Timmins would put all right, and he would be with them at
three o’clock.

The girls absented themselves during the whole of the


morning, but appeared again in time for lunch, which they
ate with a healthy appetite. They praised Mrs Fortescue’s
food, comparing it with what they had at school to the
disadvantage of the latter. Mrs Fortescue was pleased. She
prided herself very much on Bridget’s cooking.
“And now,” she said, when the meal had come to an end,
“you will go upstairs and put on your prettiest dresses and
wait in the drawing-room for Mr Timmins. I shall not be far
off. He will naturally want to see me as soon as he has had
his talk with you both, so I shall remain writing letters in
the dining-room. There are so many letters and cards to
send off at Christmas time that I shall be fully occupied, and
when you touch the bell, Brenda, I shall know what it
means. In any case, I will send tea into the drawing-room
at a quarter to four. That will give you time to get through
your business first, and if you want me to come in and pour
out the tea, I shall know if you will just touch the bell.”

“Thank you,” said Brenda. “But it isn’t half-past one yet,


and the day is a lovely one. Florence and I want to take a
good brisk walk between now and three o’clock. We shall be
back before three. We cannot be mewed up in the house
until Mr Timmins chooses to arrive.”

“Oh, my dear children! He will think it queer.”

“I am sorry,” said Brenda, “but he had no right to choose


Christmas Eve as the day when he was to come to see us.
His train may not be in till late. Anyhow, we want to take
advantage of the sunshine. Come, Florence.”

The girls left the room and soon afterwards were seen going
out arm-in-arm. They walked down the little avenue, and
were lost to view.

There was a certain style about them both. They looked


quite different from the ordinary Langdale girls. Florence
held herself very well, and although she acknowledged
herself to be a beauty, had no self-conscious airs. Brenda’s
sweet face appeared to see beyond the ordinary line of
vision, as though she were always communing with
thoughts deeper and more rare than those given to most.
People turned and looked at the girls as they walked up the
little High Street. Most people knew them, and were
interested in them. They were the very charming young
ladies who always spent their holidays with Mrs Fortescue.
They were, of course, to be included in all the Christmas
parties given at Langdale, and Mrs Fortescue would, as her
custom was, give a party on Twelfth Night in their honour.

That was the usual state of things. The girls did not seem in
the mood, however, to greet their old friends beyond
smiling and nodding to them. As they were returning home,
Brenda said—

“We are more than half an hour late. I wonder if he has


come.”

“Well, if he has, it is all right,” said Florence. “Mrs Fortescue


is dying to have a chat with him all by herself, and she will
have managed to by this time. She will be rather glad, if the
truth may be known, that we are not in to interrupt her. I
can see that she is dying with curiosity.”

“I don’t want her to live with us in the future,” said Brenda.

“But she has set her heart on it,” said Florence.

“I know,” remarked Brenda; “but, all the same, our lives are
our own, and I don’t think we can do with Mrs Fortescue. I
suppose Mr Timmins will tell us what he has decided. We
are not of age yet, either of us. You have three years to
wait, Flo, and I have two.”

“Well, we must do what he wishes,” said Florence. “I intend


to be married ages and ages before I am twenty-one; so
that will be all right.”
While they were coming towards the house, an impatient,
white-headed old lawyer was pacing up and down Mrs
Fortescue’s narrow drawing-room. Mrs Fortescue was sitting
with him and doing her utmost to soothe his impatience.

“Dear Mr Timmins, I am so sorry the girls are out. I quite


thought they would have been back before now.”

“But they knew my train would be in by three o’clock,” said


Mr Timmins.

He was a man of between fifty and sixty years of age,


rather small, with rosy cheeks and irascible eyes. His hair
was abundant and snow-white, white as milk.

“I said three o’clock,” he repeated.

“Yes,” said Mrs Fortescue, “but on Christmas Eve we made


sure your train would be late.”

The lawyer took out his watch.

“Not the special from London; that is never late,” he


remarked. “I want to catch the half-past four back;
otherwise I shall have to go by one of those dreadful slow
trains, and there’s a good deal to talk over. I do think it is a
little careless of those girls not to be at home when they are
expecting me.”

Mrs Fortescue coughed, then she ’hemmed.

“It might—” she began. The lawyer paused in his impatient


walk and stared at her. “It might expedite matters,” she
continued, “if you were to tell me some of your plans. For
instance, I shall quite understand if you wish me to leave
here and take a house in London. It is true the lease of this
house won’t be up for two years, but I have no doubt my
landlord would be open to a consideration.”

“Eh? What is it you were going to say? I don’t want you to


leave your house,” blurted out Mr Timmins. “I have nothing
whatever to do with your future, Mrs Fortescue. You have
been kind to my young friends in the past, but I think I
have—er—er—fully repaid you. And here they come—that is
all right. Now, my dear madam, if you would leave the
young ladies with me—no tea, thank you; I haven’t time for
any—I may be able to get my business through in three-
quarters of an hour. It is only just half-past three. If I leave
here at a quarter-past four, I may catch the express back to
town. Would you be so very kind as to order your servant to
have a cab at the door for me at a quarter-past four—yes,
in three-quarters of an hour I can say all that need be said.
No tea, I beg of you.”

He was really very cross; it was the girls’ doing. Mrs


Fortescue felt thoroughly annoyed. She went into the hall to
meet Brenda and Florence.

“Mr Timmins has been here for nearly twenty minutes. His
train was in sharp at three. He is very much annoyed at
your both being out. Go to him at once, girls—at once.”

“Oh, of course we will,” said Florence. “Who would have


supposed that his train would have been punctual to-day!
Come, Brenda, come.”

They went, just as they were, into the pretty little precise
drawing-room, where a fire was burning cheerily in the
grate, and the room was looking spick and span, everything
dusted and in perfect order, and some pretty vases full of
fresh flowers adding a picturesqueness to the scene. It was
quite a dear little drawing-room, and when the two girls—
Florence with that rich colour which so specially
characterised her, and Brenda a little paler but very sweet-
looking—entered the room, the picture was complete. The
old lawyer lost his sense of irritation. He came forward with
both hands outstretched.

“My dear children,” he said; “my poor children. Sit down; sit
down.”

They were surprised at his address, and Florence began to


apologise for being late; but Brenda made no remark, only
her face turned pale.

“I may as well out with it at once,” said Mr Timmins. “It was


never my wish that it should have been kept from you all
these years, but I only obeyed your parent’s special
instructions. You have left school—”

“Oh yes,” said Florence; “and I am glad. What are we to do


in the future, Daddy Timmins?”

She often called him by that name. He took her soft young
hand and stroked it. There was a husky note in his voice.
He found it difficult to speak. After a minute or two, he said
abruptly—

“Now, children, I will just tell you the very worst at once.
You haven’t a solid, solitary hundred pounds between you in
this wide world. I kept you at school as long as I could.
There is not enough money to pay for another term’s
schooling, but there is enough to pay Mrs Fortescue for your
Christmas holidays, and there will be a few pounds over to
put into each of your pockets. The little money your father
left you will then be quite exhausted.”

“I don’t understand,” said Brenda, after a long time.


Florence was silent—she, who was generally the noisy one.
She was gazing straight before her out into Mrs Fortescue’s
little garden which had a light covering of snow over the
flower-beds, and which looked so pretty and yet so small
and confined. She looked beyond the garden at the line of
the horizon, which showed clear against the frosty air. There
would be a hard frost to-night. Christmas Day would come
in with its old-fashioned splendour. She had imagined all
sorts of things about this special time; Christmas Day in hot
countries, Christmas Day in large country houses,
Christmas Day in her own home, when she had won the
man who would love her, not only for her beauty, but her
wealth. She was penniless. It seemed very queer. It seemed
to contract her world. She could not understand it.

Brenda, who had a stronger nature, began to perceive the


position more quickly.

“Please,” she said—and her young voice had no tremble in it


—“please tell me exactly what this means and why—why we
were neither of us told until now?”

Mr Timmins shrugged his shoulders.

“How old were you, Brenda, when your father and mother
died?” he asked.

“I was fourteen,” she answered, “and Florence was


thirteen.”

“Precisely; you were two little girls: you were relationless.”

“So I have always been told,” said Brenda.

“Your father left a will behind him. He always appeared to


you to be a rich man, did he not?”
“I suppose so,” said Brenda. “I never thought about it.”

“Nor did I,” said Florence, speaking for the first time.

“Well, he was not rich. He lived up to his income. He earned


a considerable amount as a writer.”

“I was very proud of him,” said Brenda.

“When he died,” continued Mr Timmins, taking no notice of


this remark—“you know your mother died first—but when
he died he left a will, giving explicit directions that all his
debts were to be paid in full. There were not many, but
there were some. The remainder of the money was to be
spent on the education of you two girls. I assure you, my
dears, there was not much; but I have brought the
accounts with me for you to see the exact amount realisable
from his estate and precisely how I spent it. I found Mrs
Fortescue willing to give you a home in the holidays, and I
arranged with her that you were to go to her for so much a
week. I chose, by your father’s directions, the very best
possible school to send you to, a school where you would
only meet with ladies, and where you would be educated as
thoroughly as possible. You were to stay on at school and
with Mrs Fortescue until the last hundred pounds of your
money was reached. Then you were to be told the truth:
that you were to face the world. After your fees for your last
term’s schooling have been met and Mrs Fortescue has been
paid for your Christmas holidays, there will be precisely
eighty pounds in the bank to your credit. That money I
think you ought to save for a nest-egg. That is all you
possess. Your father’s idea was that you would live more
happily and work more contentedly if you were allowed to
grow up to the period of adolescence without knowing the
cares and sorrows of the world. He may have been wrong;
doubtless he was; anyhow, there was nothing whatever for
me to do but to obey the will. I came down myself to tell
you. You will have the Christmas holidays in which to
prepare yourselves for the battle of life. You can tell Mrs
Fortescue or not, as you please. She has learned nothing
from me. I think that is about all, except—”

“Yes?” said Florence, speaking for the first time—“except


what?”

“Except that I would like you both—yes, both—to see Lady


Marian Dixie, a very old client of mine, who was a friend of
your mother’s, and I believe, would give you advice, and
perhaps help you to find situations. Lady Marian is in
London, and if you wish it, I will arrange that you shall have
an interview with her. What day would suit you both?”

“Any day,” said Brenda.

Florence was silent.

“Here is a five-pound note between you. It is your own


money—five pounds out of your remaining eighty pounds.
Be very careful of it. I will endeavour to see Lady Marian on
Monday, and will write to you. Ah, there is my cab. You can
tell Mrs Fortescue or not, just as you please. Good-bye now,
my dears, good-bye. I am truly sorry, truly sorry; but those
who work for their own living are not the most unhappy
people, and you are well-educated; your poor father saw to
that. Don’t blame the dead, Brenda. Florence, think kindly
of the dead.”
Chapter Three.
Plans for the Future.

Mrs Fortescue was full of curiosity.

The girls were absolutely silent. She talked with animation


of their usually gay programme for Christmas. The Blundells
and the Arbuthnots and the Aylmers had all invited them to
Christmas parties. Of course they would go. They were to
dine with the Arbuthnots on the following evening. She
hoped the girls had pretty dresses.

“There will be quite a big party,” said Mrs Fortescue. “Major


Reid and his son are also to be there. Michael Reid is a
remarkably clever man. What sort of dresses have you,
girls? Those white ones you wore last summer must be
rather outré now. It was such a pity that I was not able to
get you some really stylish frocks from Madame Aidée in
town.”

“Our white frocks will do very well indeed,” said Florence.

“But you have grown, dear; you have grown up now,” said
Mrs Fortescue. “Oh my love!” She drew her chair a little
closer to the young girl as she spoke. “I wonder what Mr
Timmins meant. He did not seem at all interested in my
house. I expressed so plainly my willingness to give it up
and to take a house in town where we could be all happy
together; but he was very huffy and disagreeable. It was a
sad pity that you didn’t stay in for him. It put him out. I
never knew that Mr Timmins was such an irascible old
gentleman before.”

“He is not; he is a perfect dear,” said Florence.


“Well, Florence, I assure you he was not at all a dear to me.
Still, if he made himself agreeable to you, you two darling
young creatures, I must not mind. I suppose I shan’t see a
great deal of you in the future. I shall miss you, my loves.”

Tears came into the little woman’s eyes. They were genuine
tears, of sorrow for herself but also of affection for the girls.
She would, of course, like to make money by them, but she
also regarded them as belonging to her. She had known
them for so long, and, notwithstanding the fact that she had
been paid for their support, she had been really good to
them. She had given them of those things which money
cannot buy, had sat up with Florence night after night when
she was ill with the measles, and had read herself hoarse in
order to keep that difficult young lady in bed when she
wanted to be up and playing about.

Of the two girls Florence was her darling. She dreamed


much of Florence’s future, of the husband she would win, of
the position she would attain, and of the advantage which
she, Mrs Fortescue, would derive from her young friends—
advancement in the social scale. Beauty was better than
talent; and Florence, as well as being an heiress, was also a
beauty.

It cannot be said that the girls did much justice to Bridget’s


hot cakes. They were both a little stunned, and their one
desire was to get away to their own bedroom to talk over
their changed circumstances, and decide on what course of
action they would pursue with regard to Mrs Fortescue. In
her heart of hearts, Florence would have liked to rush to the
good lady and say impulsively—

“I am a cheat, an impostor. I haven’t a penny in the world.


You will be paid up to the end of the Christmas holidays,
and then you will never see me any more. I have got to
provide my own living somehow. I suppose I’ll manage best
as a nursery governess; but I don’t know anything really
well.”

Brenda, however, would not encourage any such lawless


action.

“We won’t say a word about it,” said Brenda, “until after
Christmas Day.”

She gave forth this mandate when the girls were in their
room preparing for dinner.

“Oh,” said Florence; “it will kill me to keep it a secret for so


long!”

“It won’t kill you,” replied Brenda, “for you will have me to
talk it over with.”

“But she’ll go on asking us questions,” said Florence. “She


will want to know where we are going after the holidays; if
we are going to stay on with her, or what is to happen; and
unless we tell her a lot of lies, I don’t see how we are to
escape telling her the truth. It is all dreadful from first to
last; but I think having to keep it a secret from Mrs
Fortescue is about the most terrible part of all.”

“It is the part you feel most at the present time,” said
Brenda. “It is a merciful dispensation that we cannot realise
everything that is happening just at the moment it happens.
It is only by degrees that we get to realise the full extent of
our calamities.”

“I suppose it is a calamity,” said Florence, opening her


bright eyes very wide. “Somehow, at the present moment I
don’t feel anything at all about it except rather excited; and
there are eighty pounds left. Eighty pounds ought to go far,
oughtn’t they? Oughtn’t they to go far, Brenda?”

“No,” said Brenda; “they won’t go far at all.”

“But I can’t make out why. We could go into small lodgings


and live quite by ourselves and lead the simple life. There is
so much written now about the simple life. I have read
many books lately in which very clever men say that we eat
far too much, and that, after all, what we really need is
abundance of fresh air and so many hours for sleep and
very plain food. I was reading a book not long ago which
described a man who had exactly twenty pounds on which
he intended to live for a whole year. He paid two and
sixpence a week for his room and about as much more for
his food, and he was very healthy and very happy. Now, if
we did the same sort of thing, we could live both of us quite
comfortably for two years on our eighty pounds.”

“And then,” said Brenda, “what would happen at the end of


that time?”

“Oh, I should be married by then,” said Florence, “and you


would come and live with me, of course, you old darling.”

“No; that I wouldn’t,” said Brenda. “I am not at all content


to sit down and wait. I want to do something. As far as I am
concerned, I am rather glad of this chance. I never did care
for what are so-called ‘society pleasures.’ I see now the
reason why I always felt driven to work very hard. You
know father was a great writer. I shall write too. I will make
money by my books, and we will both live together and be
happy. If you find your prince, the man you have made up
your mind to marry, why, you shall marry him. But if you
don’t, I am always there. We will be very careful of our
money, and I will write a book; I think I just know how. I
am not father’s daughter for nothing. The book will be a
success, and I shall get an order for another book, and we
can live somehow. We shall be twenty thousand times
happier than if we were in a house with Mrs Fortescue
looking out for husbands for us—for that is what it comes to
when all is said and done.”

“Oh, you darling! I never thought of that,” said Florence. “It


is perfectly splendid! I never admired you in all my life as I
admire you now, Brenda. Of course, I never thought that
you would be the one to save us from destruction. I used at
times to have a sort of idea within me that perhaps you
would have to come and live with me some day when all
our money was spent. I can’t imagine why I used to think
so often about all our money being spent; but I used to,
only I imagined it would be after I had got my trousseau
and was married to my dear lord, or duke, or marquis—
anyhow, some one with a big place and a title; and I used
to imagine you living with me and being my dear
companion. But this is much, much better than any of those
things.”

“Yes; I think it is better,” said Brenda. “I will think about the


book to-night, and perhaps the title may come to me; but
in the meantime, we are not to tell Mrs Fortescue—not at
least till Christmas Day is over; and we’ve got to take out
our white dresses and get them ironed, and see that they
look as fresh as possible. Now, we mustn’t stay too long in
our room: she is dying with curiosity, but she can’t possibly
guess the truth.”

“No; she couldn’t guess the truth, that would be beyond her
power,” said Florence. “The truth is horrible, and yet
delightful. We are our own mistresses, aren’t we, Brenda?”

“As far as the eighty pounds go,” replied Brenda.


“What I was so terrified about,” said the younger sister,
“was this. I thought we should have to go as governesses or
companions, or something of that sort, in big houses and be
—be parted.” Her lips trembled.

“Oh no; we won’t be parted,” said Brenda; “but all the


same, we’ll have to go to see Lady Marian Dixie—that is,
when she writes to ask us. Now may I brush your hair for
you? I want you to look your very prettiest self to-night.”

The white frocks were ironed by Bridget’s skilful fingers. It


is true, they were only the sort of dresses worn by
schoolgirls, but they were quite pretty, and of the very best
material. They were somewhat short for the two tall girls,
and Brenda smiled at herself when she saw her dress,
which only reached a trifle below her ankles. As to Florence,
she skipped about the room in hers. She was in wonderfully
high spirits. For girls who had been brought up as heiresses,
and who expected all the world to bow before them, this
was extraordinary. And now it was borne in upon her that
she had only forty pounds in the world, not even quite that,
for already a little of the five pounds advanced by Mr
Timmins had been spent. Mrs Fortescue insisted upon it.
She said, “You ought to wear real flowers; I will order some
for you at the florist’s round the corner.”

Now flowers at Christmas time are expensive, but Florence


was reckless and ordered roses and lilies of the valley.
Brenda looked unutterable things, but after opening her lips
as though to speak, decided to remain silent. Why should
not Florence have her pretty way for once? She looked at
her sister with great admiration. She thought again of her
beauty, which was of the sort which can scarcely be
described, and deals more with expression than feature.
Wherever this girl went, her bright eyes did their own work.
They drew people towards them as towards a magnet. Her
charming manners effected the rest of the fascination. She
was not self-conscious either, so that women liked her as
much as men did.

But now Christmas Day had really come, and Mrs Fortescue,
in the highest of high spirits, accompanied her young
charges to Colonel Arbuthnot’s house. Year by year, the girls
had eaten their Christmas dinner at the old Colonel’s house,
which was known by the commonplace name of The
Grange. It was a corner house in Langdale, abutting straight
on to the street, but evidently at one time there had been a
big garden in front, and just before the hall door was an
enormous oak tree, which spread its shadows over the low
stone steps in summer, and caused the dining-room
windows which faced the street to be cool even in the
hottest weather.

At the back of the house was a glorious old garden. No one


had touched that. It measured nearly three acres. It had its
walled-in enclosure, its small paddock, and its wealth of
flower garden. The flowers, as far as Florence and Brenda
could make out, seemed to grow without expense or
trouble, for Colonel Arbuthnot was not a rich man, and
could not even afford a gardener every day, but he worked
a good deal himself, and was helped by his daughter Susie,
a buxom, rather matronly young woman of six or seven and
thirty. The girls liked Susie very much, although they
considered her quite an old maid.

No; Colonel Arbuthnot was by no means rich—that is, as far


as money is concerned; but he possessed other riches—the
riches of a brave and noble heart. He was straight as a die
in all his dealings with his fellow-men. He had a good deal
of penetration of character, and had long ago taken a fancy
to Mrs Fortescue’s young charges. It did not matter in the
least to him whether the girls were heiresses or not. They
were young. They were both, in his opinion, pretty. He liked
young and pretty creatures, and the idea of sitting down to
his Christmas dinner without these additions to his party
would have annoyed him very much.

Colonel Arbuthnot’s one extravagance in the year was his


Christmas dinner. He invited all those people to it who
otherwise might have to do without roast beef and plum
pudding. There were a good many such in the little town of
Langdale. It was a remote place, far from the world, and no
one was wealthy there. Money went far in a little place of
the sort, and the Colonel always saved several pounds out
of his income in order to give Susie plenty of money to pay
for a great joint at the butcher’s, and to make the old-
fashioned plum pudding, also to prepare the mince pies by
the old receipt, and to wind up by a sumptuous dessert.

It was on these rare occasions that the people who came to


The Grange saw the magnificent silver which Colonel
Arbuthnot possessed. It was kept wrapped up in paper and
baize during the remainder of the year: for Susie said
frankly that she could not keep it clean; what with the
garden and helping the young servant, she had no time for
polishing silver. Accordingly, she just kept out a few silver
spoons and forks for family use and locked the rest up.

But Christmas Day was a great occasion. Christmas Day


saw the doors flung wide, and hospitality reigning supreme.
The Colonel put on his best dinner coat. He had worn it on
more than one auspicious occasion at more than one
famous London club. But it never seemed to grow the least
bit old-fashioned. He always put a sprig of holly with the
berries on it in his button-hole, and would not change this
symbol of Christmas for any flower that could be presented
to him.
As to Susie, she also had one dinner dress which appeared
on these auspicious occasions, and only then. It was made
of a sort of grey “barège,” and had belonged to her mother.
It had been altered to fit her somewhat abundant
proportions, and it was lined with silk. That was what Susie
admired so much about it. The extravagance of silk lining
gave her, as she expressed it, “a sense of aristocracy.” She
said she felt much more like a lady with a silk lining in her
dress than if she wore a silk dress itself with a cotton lining.

“There is something pompous and ostentatious about the


latter,” she said, “whereas the former shows a true lady.”

She constantly moved about the room in order that the


rustle of the silk might be heard, and occasionally, in a fit of
absence—or apparent absence—she would lift the skirt so
as to show the silk lining. The dress itself was exceedingly
simple; but that did not matter at all to Susie. She wore it
low in the neck and short in the sleeves; and it is true that
she sometimes rather shivered with cold; for on no other
day in the remaining three hundred and sixty-four did she
dream of putting on a low dress. In the front of the dress
she wore her mother’s diamond brooch—a treasure from
the past, which alone she felt gave her distinction; and
round her neck she had a string of old pearls, somewhat
yellow with age, but very genuine and very good.

Susie’s hair was turning slightly grey and was somewhat


thin, but then she never remembered her hair at all, nor her
honest, flushed, reddish face, hardened by exposure to all
sorts of weather, but very healthy withal.

From the moment she entered the drawing-room to receive


her guests, she never gave Susie Arbuthnot a thought,
except in the very rare moments when she rustled her grey
barège in order to let her visitors know that the lining was
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge


connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and


personal growth every day!

ebookbell.com

You might also like