Eij KH Out Parallel Programming
Eij KH Out Parallel Programming
Victor Eijkhout
I MPI 15
3
CONTENTS
Victor Eijkhout 5
CONTENTS
II OpenMP 433
Victor Eijkhout 7
CONTENTS
Victor Eijkhout 9
CONTENTS
VI Tutorials 721
49 Debugging 723
49.1 Step 0: compiling for debug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
49.2 Invoking gdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
49.3 Finding errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725
49.4 Memory debugging with Valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
49.5 Stepping through a program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
49.6 Inspecting values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
49.7 Parallel debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
49.8 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
51 SimGrid 740
Victor Eijkhout 11
CONTENTS
61 Bibliography 792
Victor Eijkhout 13
CONTENTS
MPI
This section of the book teaches MPI (‘Message Passing Interface’), the dominant model for distributed
memory programming in science and engineering. It will instill the following competencies.
Basic level:
• The student will understand the SPMD model and how it is realized in MPI (chapter 2).
• The student will know the basic collective calls, both with and without a root process, and can
use them in applications (chapter 3).
• The student knows the basic blocking and non-blocking point-to-point calls, and how to use
them (chapter 4).
Intermediate level:
• The students knows about derived datatypes and can use them in communication routines
(chapter 6).
• The student knows about intra-communicators, and some basic calls for creating subcommuni-
cators (chapter 7); also Cartesian process topologies (section 11.1).
• The student understands the basic design of MPI I/O calls and can use them in basic applications
(chapter 10).
• The student understands about graph process topologies and neighborhood collectives (sec-
tion 11.2).
Advanced level:
• The student understands one-sided communication routines, including window creation rou-
tines, and synchronization mechanisms (chapter 9).
• The student understands MPI shared memory, its advantages, and ow it is based on windows
(chapter 12).
• The student understands MPI process spawning mechanisms and inter-communicators (chap-
ter 8).
In this chapter you will learn the use of the main tool for distributed memory programming: the Message
Passing Interface (MPI) library. The MPI library has about 250 routines, many of which you may never
need. Since this is a textbook, not a reference manual, we will focus on the important concepts and give the
important routines for each concept. What you learn here should be enough for most common purposes.
You are advised to keep a reference document handy, in case there is a specialized routine, or to look up
subtleties about the routines you use.
17
1. Getting started with MPI
1.2 History
Before the MPI standard was developed in 1993-4, there were many libraries for distributed memory
computing, often proprietary to a vendor platform. MPI standardized the inter-process communication
mechanisms. Other features, such as process management in PVM, or parallel I/O were omitted. Later
versions of the standard have included many of these features.
Since MPI was designed by a large number of academic and commercial participants, it quickly became a
standard. A few packages from the pre-MPI era, such as Charmpp [16], are still in use since they support
mechanisms that do not exist in MPI.
1. A command variant is mpirun; your local cluster may have a different mechanism.
You see that in both scenarios the parallel program is started by the mpiexec command using an Single
Program Multiple Data (SPMD) mode of execution: all hosts execute the same program. It is possible for
different hosts to execute different programs, but we will not consider that in this book.
There can be options and environment variables that are specific to some MPI installations, or to the
network.
• mpich and its derivatives such as Intel MPI or Cray MPI have mpiexec options: https://www.
mpich.org/static/docs/v3.1/www1/mpiexec.html
Remark 1 In OpenMPI, these commands are binary executables by default, but you can make it a shell
script by passing the --enable-script-wrapper-compilers option at configure time.
Victor Eijkhout 19
1. Getting started with MPI
MPI programs can be run on many different architectures. Obviously it is your ambition (or at least your
dream) to run your code on a cluster with a hundred thousand processors and a fast network. But maybe
you only have a small cluster with plain ethernet. Or maybe you’re sitting in a plane, with just your laptop.
An MPI program can be run in all these circumstances – within the limits of your available memory of
course.
The way this works is that you do not start your executable directly, but you use a program, typically
called mpiexec or something similar, which makes a connection to all available processors and starts a
run of your executable there. So if you have a thousand nodes in your cluster, mpiexec can start your
program once on each, and if you only have your laptop it can start a few instances there. In the latter
case you will of course not get great performance, but at least you can test your code for correctness.
Python note 1: running mpi4py programs. Load the TACC-provided python:
module load python
and run it as:
ibrun python-mpi yourprogram.py
1.5.3 Fortran
Fortran note 1: formatting of fortran notes. Fortran-specific notes will be indicated with a note like this.
Traditionally, Fortran bindings for MPI look very much like the C ones, except that each routine has a final
error return parameter. You will find that a lot of MPI code in Fortran conforms to this.
However, in the MPI 3 standard it is recommended that an MPI implementation providing a Fortran in-
terface provide a module named mpi_f08 that can be used in a Fortran program. This incorporates the
following improvements:
• This defines MPI routines to have an optional final parameter for the error.
• There are some visible implications of using the mpi_f08 module, mostly related to the fact
that some of the ‘MPI datatypes’ such as MPI_Comm, which were declared as Integer previously,
are now a Fortran Type. See the following sections for details: Communicator 7.1, Datatype 6.1,
Info 15.1.1, Op 3.10.2, Request 4.2.1, Status 4.3.2, Window 9.1.
• The mpi_f08 module solves a problem with previous Fortran90 bindings: Fortran90 is a strongly
typed language, so it is not possible to pass argument by reference to their address, as C/C++ do
with the void* type for send and receive buffers. This was solved by having separate routines
for each datatype, and providing an Interface block in the MPI module. If you manage to
request a version that does not exist, the compiler will display a message like
There is no matching specific subroutine for this generic subroutine call [MPI_Send]
For details see http://mpi-forum.org/docs/mpi-3.1/mpi31-report/node409.htm.
1.5.4 Python
Python note 2: python notes. Python-specific notes will be indicated with a note like this.
The mpi4py package [5] of python bindings is not defined by the MPI standards committee. Instead, it is
the work of an individual, Lisandro Dalcin.
In a way, the Python interface is the most elegant. It uses Object-Oriented (OO) techniques such as meth-
ods on objects, and many default arguments.
Notable about the Python bindings is that many communication routines exist in two variants:
• a version that can send arbitrary Python objects. These routines have lowercase names such as
bcast; and
• a version that sends numpy objects; these routines have names such as Bcast. Their syntax can
be slightly different.
The first version looks more ‘pythonic’, is easier to write, and can do things like sending python ob-
jects, but it is also decidedly less efficient since data is packed and unpacked with pickle. As a common
sense guideline, use the numpy interface in the performance-critical parts of your code, and the pythonic
interface only for complicated actions in a setup phase.
Codes with mpi4py can be interfaced to other languages through Swig or conversion routines.
Data in numpy can be specified as a simple object, or [data, (count,displ), datatype].
Victor Eijkhout 21
1. Getting started with MPI
1.5.5.1 C
The typically C routine specification in MPI looks like:
int MPI_Comm_size(MPI_Comm comm,int *nprocs)
However, the error codes are hardly ever useful, and there is not much your program can do to
recover from an error. Most people call the routine as
MPI_Comm_size( /* parameter ... */ );
• Finally, there is a ‘star’ parameter. This means that the routine wants an address, rather than a
value. You would typically write:
MPI_Comm my_comm = MPI_COMM_WORLD; // using a predefined value
int nprocs;
MPI_Comm_size( comm, &nprocs );
Seeing a ‘star’ parameter usually means either: the routine has an array argument, or: the rou-
tine internally sets the value of a variable. The latter is the case here.
1.5.5.2 Fortran
The Fortran specification looks like:
The syntax of using this routine is close to this specification: you write
Type(MPI_Comm) :: comm = MPI_COMM_WORLD
! legacy: Integer :: comm = MPI_COMM_WORLD
Integer :: comm = MPI_COMM_WORLD
Integer :: size,ierr
CALL MPI_Comm_size( comm, size ) ! without the optional ierr
• Most Fortran routines have the same parameters as the corresponding C routine, except that
they all have the error code as final parameter, instead of as a function result. As with C, you
can ignore the value of that parameter. Just don’t forget it.
• The types of the parameters are given in the specification.
• Where C routines have MPI_Comm and MPI_Request and such parameters, Fortran has INTEGER
parameters, or sometimes arrays of integers.
1.5.5.3 Python
The Python interface to MPI uses classes and objects. Thus, a specification like:
MPI.Comm.Send(self, buf, int dest, int tag=0)
• Next, you need a Comm object. Often you will use the predefined communicator
comm = MPI.COMM_WORLD
• The keyword self indicates that the actual routine Send is a method of the Comm object, so you
call:
comm.Send( .... )
• Parameters that are listed by themselves, such as buf, as positional. Parameters that are listed
with a type, such as int dest are keyword parameters. Keyword parameters that have a value
specified, such as int tag=0 are optional, with the default value indicated. Thus, the typical
call for this routine is:
comm.Send(sendbuf,dest=other)
Victor Eijkhout 23
1. Getting started with MPI
specifying the send buffer as positional parameter, the destination as keyword parameter, and
using the default value for the optional tag.
Some python routines are ‘class methods’, and their specification lacks the self keyword. For instance:
MPI.Request.Waitall(type cls, requests, statuses=None)
would be used as
MPI.Request.Waitall(requests)
1.6 Review
Review 1.1. What determines the parallelism of an MPI job?
1. The size of the cluster you run on.
2. The number of cores per cluster node.
3. The parameters of the MPI starter (mpiexec, ibrun,…)
Review 1.2. T/F: the number of cores of your laptop is the limit of how many MPI proceses
you can start up.
Review 1.3. Do the following languages have an object-oriented interface to MPI? In what
sense?
1. C
2. C++
3. Fortran2008
4. Python
Victor Eijkhout 25
Chapter 2
more than one process, using the time slicing of the Operating System (OS), but that would give you no
extra performance.
These days the situation is more complicated. You can still talk about a node in a cluster, but now a node
can contain more than one processor chip (sometimes called a socket), and each processor chip probably
has multiple cores. Figure 2.2 shows how you could explore this using a mix of MPI between the nodes,
and a shared memory programming system on the nodes.
However, since each core can act like an independent processor, you can also have multiple MPI processes
per node. To MPI, the cores look like the old completely separate processors. This is the ‘pure MPI’ model
26
2.1. The SPMD model
of figure 2.3, which we will use in most of this part of the book. (Hybrid computing will be discussed in
chapter 45.)
This is somewhat confusing: the old processors needed MPI programming, because they were physically
separated. The cores on a modern processor, on the other hand, share the same memory, and even some
caches. In its basic mode MPI seems to ignore all of this: each core receives an MPI process and the
programmer writes the same send/receive call no matter where the other process is located. In fact, you
can’t immediately see whether two cores are on the same node or different nodes. Of course, on the
implementation level MPI uses a different communication mechanism depending on whether cores are
Victor Eijkhout 27
2. MPI topic: Functional parallelism
on the same socket or on different nodes, so you don’t have to worry about lack of efficiency.
Remark 2 In some rare cases you may want to run in an Multiple Program Multiple Data (MPMD) mode,
rather than SPMD. This can be achieved either on the OS level (see section 15.9.4), using options of the mpiexec
mechanism, or you can use MPI’s built-in process management; chapter 8. Like I said, this concerns only rare
cases.
2.2.1 Headers
If you use MPI commands in a program file, be sure to include the proper header file, mpi.h or mpif.h.
#include "mpi.h" // for C
#include "mpif.h" ! for Fortran
For Fortran90, many MPI installations also have an MPI module, so you can write
use mpi ! pre 3.0
use mpi_f08 ! 3.0 standard
The internals of these files can be different between MPI installations, so you can not compile one file
against one mpi.h file and another file, even with the same compiler on the same machine, against a
different MPI.
Fortran note 2: new developments only in f08 module. New language developments, such as large counts;
section 6.4.2 will only be included in the mpi_f08 module, not in the earlier mechanisms.
Python note 3: import mpi module. It’s easiest to
from mpi4py import MPI
to your file.
Victor Eijkhout 29
2. MPI topic: Functional parallelism
Fortran (before 2008) lacks this commandline argument handling, so MPI_Init lacks those arguments.
After MPI_Finalize no MPI routines (with a few exceptions such as MPI_Finalized) can be called. In partic-
ular, it is not allowed to call MPI_Init again. If you want to do that, use the sessions model; section 8.3.
Python note 4: initialize-finalize. In many cases, no initialize and finalize calls are needed: the statement
## mpi.py
from mpi4py import MPI
performs the initialization. Likewise, the finalize happens at the end of the program.
However, for special cases, there is an mpi4py.rc object that can be set in between importing
mpi4py and importing mpi4py.MPI:
import mpi4py
mpi4py.rc.initialize = False
mpi4py.rc.finalize = False
from mpi4py import MPI
MPI.Init()
# stuff
MPI.Finalize()
Remark 3 For hybrid MPI-plus-threads programming there is also a call MPI_Init_thread. For that, see sec-
tion 13.1.
Apart from MPI_Finalize, which signals a successful conclusion of the MPI run, an abnormal end to a run
can be forced by MPI_Abort (figure 2.3). This stop execution on all processes associated with the commu-
nicator, but many implementations will terminate all processes. The value parameter is returned to the
environment.
Code: Output:
// return.c mpicc -o return return.o
MPI_Abort(MPI_COMM_WORLD,17); mpirun -n 1 ./return ; \
echo "MPI program return code $?"
application called MPI_Abort(MPI_COMM_WORLD, 17) -
MPI program return code 17
Victor Eijkhout 31
2. MPI topic: Functional parallelism
MPI.Get_processor_name()
The MPI_Init routines takes a reference to argc and argv for the following reason: the MPI_Init calls
filters out the arguments to mpirun or mpiexec, thereby lowering the value of argc and elimitating some
of the argv arguments.
On the other hand, the commandline arguments that are meant for mpiexec wind up in the MPI_INFO_ENV
object as a set of key/value pairs; see section 15.1.1.
In the following exercise you will print out the hostname of each MPI process with MPI_Get_processor_name
(figure 2.6) as a first way of distinguishing between processes. This routine has a character buffer argu-
ment, which needs to be allocated by you. The length of the buffer is also passed, and on return that
parameter has the actually used length. The maximum needed length is MPI_MAX_PROCESSOR_NAME.
Code: Output:
// procname.c make[3]: `procname' is up to date.
int name_length = MPI_MAX_PROCESSOR_NAME; TACC: Starting up job 4328411
char proc_name[name_length]; TACC: Starting parallel tasks...
MPI_Get_processor_name(proc_name,&name_length); This process is running on node <<c205-036.frontera
printf("This process is running on node This process is running on node <<c205-036.frontera
↪<<%s>>\n",proc_name); This process is running on node <<c205-036.frontera
This process is running on node <<c205-036.frontera
This process is running on node <<c205-036.frontera
This process is running on node <<c205-035.frontera
This process is running on node <<c205-035.frontera
This process is running on node <<c205-035.frontera
This process is running on node <<c205-035.frontera
This process is running on node <<c205-035.frontera
TACC: Shutdown complete. Exiting.
2.3.2 Communicators
First we need to introduce the concept of communicator, which is an abstract description of a group of
processes. For now you only need to know about the existence of the MPI_Comm data type, and that there
is a pre-defined communicator MPI_COMM_WORLD which describes all the processes involved in your parallel
run.
In the procedural languages C, a communicator is a variable that is passed to most routines:
#include <mpi.h>
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Send( /* stuff */ comm );
Fortran note 3: communicator type. In Fortran, pre-2008 a communicator was an opaque handle, stored in
an Integer. With Fortran 2008, communicators are derived types:
use mpi_f098
Type(MPI_Comm} :: comm = MPI_COMM_WORLD
call MPI_Send( ... comm )
Python note 5: communicator objects. In object-oriented languages, a communicator is an object, and rather
than passing it to routines, the routines are often methods of the communicator object:
from mpi4py import MPI
comm = MPI.COMM_WORLD
comm.Send( buffer, target )
MPL note 5: world communicator. The naive way of declaring a communicator would be:
Victor Eijkhout 33
2. MPI topic: Functional parallelism
// commrank.cxx
mpl::communicator comm_world =
mpl::environment::comm_world();
mpl::communicator copy =
mpl::environment::comm_world();
cout << "copy: " << boolalpha << (comm==copy) << endl;
// correct!
void comm_ref( const mpl::communicator &comm );
MPI.Comm.Get_size(self)
MPI.Comm.Get_rank(self)
If every process executes the MPI_Comm_size call, they all get the same result, namely the total number of
processes in your run. On the other hand, if every process executes MPI_Comm_rank, they all get a different
result, namely their own unique number, an integer in the range from zero to the number of processes
minus 1. See figure 2.5. In other words, each process can find out ‘I am process 5 out of a total of 20’.
Exercise 2.4. Write a program where each process prints out a message reporting its
number, and how many processes there are:
Hello from process 2 out of 5!
Write a second version of this program, where each process opens a unique file and
writes to it. On some clusters this may not be advisable if you have large numbers of
processors, since it can overload the file system.
(There is a skeleton for this exercise under the name commrank.)
Exercise 2.5. Write a program where only the process with number zero reports on how
many processes there are in total.
In object-oriented approaches to MPI, that is, mpi4py and MPL, the MPI_Comm_rank and MPI_Comm_size rou-
tines are methods of the communicator class:
Python note 6: communicator rank and size. Rank and size are methods of the communicator object. Note
that their names are slightly different from the MPI standard names.
Victor Eijkhout 35
2. MPI topic: Functional parallelism
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
MPL note 8: rank and size. The rank of a process (by mpl::communicator::rank) and the size of a commu-
nicator (by mpl::communicator::size) are both methods of the communicator class:
const mpl::communicator &comm_world =
mpl::environment::comm_world();
int procid = comm_world.rank();
int nprocs = comm_world.size();
Practice is a little more complicated than this. But we will start exploring this notion of processes deciding
on their activity based on their process number.
Being able to tell processes apart is already enough to write some applications, without knowing any
other MPI. We will look at a simple parallel search algorithm: based on its rank, a processor can find
its section of a search space. For instance, in Monte Carlo codes a large number of random samples is
generated and some computation performed on each. (This particular example requires each MPI process
to run an independent random number generator, which is not entirely trivial.)
Exercise 2.6. Is the number 𝑁 = 2, 000, 000, 111 prime? Let each process test a disjoint set
of integers, and print out any factor they find. You don’t have to test all
integers < 𝑁 : any factor is at most √𝑁 ≈ 45, 200.
(Hint: i%0 probably gives a runtime error.)
Can you find more than one solution?
(There is a skeleton for this exercise under the name prime.)
Remark 4 Normally, we expect parallel algorithms to be faster than sequential. Now consider the above
exercise. Suppose the number we are testing is divisible by some small prime number, but every process has a
large block of numbers to test. In that case the sequential algorithm would have been faster than the parallel
one. Food for thought.
As another example, in Boolean satisfiability problems a number of points in a search space needs to be
evaluated. Knowing a process’s rank is enough to let it generate its own portion of the search space. The
computation of the Mandelbrot set can also be considered as a case of functional parallelism. However,
the image that is constructed is data that needs to be kept on one processor, which breaks the symmetry
of the parallel run.
Of course, at the end of a functionally parallel run you need to summarize the results, for instance printing
out some total. The mechanisms for that you will learn next.
Victor Eijkhout 37
2. MPI topic: Functional parallelism
and fill it so that process 0 has the integers 0 ⋯ 9, process 1 has 10 ⋯ 19, et cetera.
It may be hard to print the output in a non-messy way.
If the array size is not perfectly divisible by the number of processors, we have to come up with a division
that is uneven, but not too much. You could for instance, write
int Nglobal, // is something large
Nlocal = Nglobal/ntids,
excess = Nglobal%ntids;
if (mytid==ntids-1)
Nlocal += excess;
Exercise 2.8. Argue that this strategy is not optimal. Can you come up with a better
distribution? Load balancing is further discussed in HPC book, section-2.10.
int main() {
#if 1
mpl::communicator comm_world =
mpl::environment::comm_world();
#else
const mpl::communicator &comm_world =
mpl::environment::comm_world();
#endif
std::cout << "Hello world! I am running on \""
<< mpl::environment::processor_name()
<< "\". My rank is "
<< comm_world.rank()
<< " out of "
<< comm_world.size() << " processes.\n" << std::endl;
return EXIT_SUCCESS;
}
#include <mpl/mpl.hpp>
int main() {
mpl::communicator copy =
mpl::environment::comm_world();
cout << "copy: " << boolalpha << (comm==copy) << endl;
Victor Eijkhout 39
2. MPI topic: Functional parallelism
{
mpl::communicator init;
// WRONG: init = mpl::environment::comm_world();
// error: overload resolution selected deleted operator '='
}
auto eq = comm.compare(copy);
cout << static_cast<int>(eq) << endl;
return EXIT_SUCCESS;
}
A certain class of MPI routines are called ‘collective’, or more correctly: ‘collective on a communicator’.
This means that if process one in that communicator calls that routine, they all need to call that routine.
In this chapter we will discuss collective routines that are about combining the data on all processes in
that communicator, but there are also operations such as opening a shared file that are collective, which
will be discussed in a later chapter.
41
3. MPI topic: Collectives
2. Each process computes a random number again. Now you want to scale these
numbers by their maximum.
3. Let each process compute a random number. You want to print on what
processor the maximum value is computed.
Think about time and space complexity of your suggestions.
𝑁 𝑁
1 ∑𝑖 𝑥𝑖
𝜎= ∑(𝑥 − 𝜇)2 where 𝜇=
𝑁 −1 𝑖 𝑖 𝑁
√
and assume that every process stores just one 𝑥𝑖 value.
1. The calculation of the average 𝜇 is a reduction, since all the distributed values need to be added.
2. Now every process needs to compute 𝑥𝑖 − 𝜇 for its value 𝑥𝑖 , so 𝜇 is needed everywhere. You can
compute this by doing a reduction followed by a broadcast, but it is better to use a so-called
allreduce operation, which does the reduction and leaves the result on all processors.
3. The calculation of ∑𝑖 (𝑥𝑖 − 𝜇) is another sum of distributed data, so we need another reduction
operation. Depending on whether each process needs to know 𝜎 , we can again use an allreduce.
3.1.2 Synchronization
Collectives are operations that involve all processes in a communicator. A collective is a single call, and
it blocks on all processors, meaning that a process calling a collective cannot proceed until the other
processes have similarly called the collective.
That does not mean that all processors exit the call at the same time: because of implementational details
and network latency they need not be synchronized in their execution. However, semantically we can say
that a process can not finish a collective until every other process has at least started the collective.
In addition to these collective operations, there are operations that are said to be ‘collective on their
communicator’, but which do not involve data movement. Collective then means that all processors must
call this routine; not to do so is an error that will manifest itself in ‘hanging’ code. One such example is
MPI_File_open.
Victor Eijkhout 43
3. MPI topic: Collectives
3.2 Reduction
3.2.1 Reduce to all
Above we saw a couple of scenarios where a quantity is reduced, with all proceses getting the result. The
MPI call for this is MPI_Allreduce (figure 3.1).
Example: we give each process a random number, and sum these numbers together. The result should be
approximately 1/2 times the number of processes.
// allreduce.c
float myrandom,sumrandom;
myrandom = (float) rand()/(float)RAND_MAX;
// add the random variables together
MPI_Allreduce(&myrandom,&sumrandom,
1,MPI_FLOAT,MPI_SUM,comm);
// the result should be approx nprocs/2:
if (procno==nprocs-1)
printf("Result %6.9f compared to .5\n",sumrandom/nprocs);
Or:
MPI_Count buffersize = 1000;
double *indata,*outdata;
indata = (double*) malloc( buffersize*sizeof(double) );
outdata = (double*) malloc( buffersize*sizeof(double) );
MPI_Allreduce_c(indata,outdata,buffersize,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
Remark 5 Routines with both a send and receive buffer should not alias these. Instead, see the discussion of
MPI_IN_PLACE; section 3.3.2.
𝜉 = ∑ 𝑥𝑖
𝑖
𝑥𝑖′ ← 𝑥𝑖 /𝜉
𝜉 ′ = ∑ 𝑥𝑖′
𝑖
Victor Eijkhout 45
3. MPI topic: Collectives
Exercise 3.5. The Gram-Schmidt method is a simple way to orthogonalize two vectors:
𝑢 ← 𝑢 − (𝑢 𝑡 𝑣)/(𝑢 𝑡 𝑢)
Implement this, and check that the result is indeed orthogonal.
Suggestion: fill 𝑣 with the values sin 2𝑛ℎ𝜋 where 𝑛 = 2𝜋/𝑁 , and 𝑢 with
sin 2𝑛ℎ𝜋 + sin 4𝑛ℎ𝜋. What does 𝑢 become after orthogonalization?
But for arrays we use the fact that arrays and addresses are more-or-less equivalent in:
float xx[2],yy[2];
MPI_Allreduce( xx,yy,2,MPI_FLOAT, ... );
but that is not necessary. The compiler will not complain if you leave out the cast.
C++ note 1: buffer treatment. Treatment of scalars in C++ is the same as in C. However, for arrays you
have the choice between C-style arrays, and std::vector or std::array. For the latter there are
two ways of dealing with buffers:
vector<float> xx(25);
MPI_Send( xx.data(),25,MPI_FLOAT, .... );
MPI_Send( &xx[0],25,MPI_FLOAT, .... );
Victor Eijkhout 47
3. MPI topic: Collectives
Fortran note 4: mpi send-recv buffers. In Fortran parameters are always passed by reference, so the buffer
is treated the same way:
Real*4 :: x
Real*4,dimension(2) :: xx
call MPI_Allreduce( x,1,MPI_REAL4, ... )
call MPI_Allreduce( xx,2,MPI_REAL4, ... )
In discussing OO languages, we first note that the official C++ Application Programmer Interface (API)
has been removed from the standard.
Python note 7: buffers from numpy. Most MPI routines in Python have both a variant that can send arbi-
trary Python data, and one that is based on numpy arrays. The former looks the most ‘pythonic’,
and is very flexible, but is usually demonstrably inefficient.
## allreduce.py
random_number = random.randint(1,random_bound)
# native mode send
max_random = comm.allreduce(random_number,op=MPI.MAX)
In the numpy variant, all buffers are numpy objects, which carry information about their type
and size. For scalar reductions this means we have to create an array for the receive buffer, even
though only one element is used.
myrandom = np.empty(1,dtype=int)
myrandom[0] = random_number
allrandom = np.empty(nprocs,dtype=int)
# numpy mode send
comm.Allreduce(myrandom,allrandom[:1],op=MPI.MAX)
Python note 8: buffers from subarrays. In many examples you will pass a whole Numpy array as send/re-
ceive buffer. Should want to use a buffer that corresponds to a subset of an array, you can use
the following notation:
Code: Output:
## bcastcolumn.py int size: 4
datatype = np.intc i
elementsize = datatype().itemsize [[ 0 0 0 0 0 0]
typechar = datatype().dtype.char [-1 1 1 1 1 1]
buffer = np.zeros( [nprocs,nprocs], dtype=datatype) [-1 -1 2 2 2 2]
buffer[:,:] = -1 [-1 -1 -1 3 3 3]
for proc in range(nprocs): [-1 -1 -1 -1 4 4]
if procid==proc: [ 5 5 5 5 5 5]]
buffer[proc,:] = proc
comm.Bcast\
( [ np.frombuffer\
( buffer.data,
dtype=datatype,
offset=(proc*nprocs+proc)*elementsize ),
nprocs-proc, typechar ],
root=proc )
MPL note 10: scalar buffers. Buffer type handling is done through polymorphism and templating: no ex-
plicit indiation of types.
Scalars are handled as such:
float x,y;
comm.bcast( 0,x ); // note: root first
comm.allreduce( mpl::plus<float>(), x,y ); // op first
where the reduction function needs to be compatible with the type of the buffer.
MPL note 11: vector buffers. If your buffer is a std::vector you need to take the .data() component of it:
vector<float> xx(2),yy(2);
comm.allreduce( mpl::plus<float>(),
xx.data(), yy.data(), mpl::contiguous_layout<float>(2) );
The contiguous_layout is a ‘derived type’; this will be discussed in more detail elsewhere (see
note 42 and later). For now, interpret it as a way of indicating the count/type part of a buffer
specification.
MPL note 12: iterator buffers. MPL point-to-point routines have a way of specifying the buffer(s) through
a begin and end iterator.
// sendrange.cxx
vector<double> v(15);
comm_world.send(v.begin(), v.end(), 1); // send to rank 1
comm_world.recv(v.begin(), v.end(), 0); // receive from rank 0
For the full source of this example, see section 4.5.6
Not available for collectives.
Victor Eijkhout 49
3. MPI topic: Collectives
void mpl::communicator::reduce
// root, in place
( F f,int root_rank,T & sendrecvdata ) const
( F f,int root_rank,T * sendrecvdata,const contiguous_layout< T > & l ) const
// non-root
( F f,int root_rank,const T & senddata ) const
( F f,int root_rank,
const T * senddata,const contiguous_layout< T > & l ) const
// general
( F f,int root_rank,const T & senddata,T & recvdata ) const
( F f,int root_rank,
const T * senddata,T * recvdata,const contiguous_layout< T > & l ) const
Python:
• One process can generate or read in the initial data for a program run. This then needs to be
communicated to all other processes.
• At the end of a program run, often one process needs to output some summary information.
This process is called the root process, and we will now consider routines that have a root.
• There is the original data, and the data resulting from the reduction. It is a design decision of
MPI that it will not by default overwrite the original data. The send data and receive data are of
the same size and type: if every processor has one real number, the reduced result is again one
real number.
• It is possible to indicate explicitly that a single buffer is used, and thereby the original data
overwritten; see section 3.3.2 for this ‘in place’ mode.
• There is a reduction operator. Popular choices are MPI_SUM, MPI_PROD and MPI_MAX, but compli-
cated operators such as finding the location of the maximum value exist. (For the full list, see
section 3.10.1.) You can also define your own operators; section 3.10.2.
• There is a root process that receives the result of the reduction. Since the nonroot processes do
not receive the reduced data, they can actually leave the receive buffer undefined.
// reduce.c
float myrandom = (float) rand()/(float)RAND_MAX,
result;
int target_proc = nprocs-1;
// add all the random variables together
MPI_Reduce(&myrandom,&result,1,MPI_FLOAT,MPI_SUM,
target_proc,comm);
// the result should be approx nprocs/2:
if (procno==target_proc)
printf("Result %6.3f compared to nprocs/2=%5.2f\n",
result,nprocs/2.);
Victor Eijkhout 51
3. MPI topic: Collectives
comm.Allreduce(MPI.IN_PLACE,myrandom,op=MPI.MAX)
Victor Eijkhout 53
3. MPI topic: Collectives
template<typename T >
void mpl::communicator::bcast
( int root, T & data ) const
( int root, T * data, const layout< T > & l ) const
Python:
3.3.3 Broadcast
A broadcast models the scenario where one process, the ‘root’ process, owns some data, and it communi-
cates it to all other processes.
The broadcast routine MPI_Bcast (figure 3.3) has the following structure:
MPI_Bcast( data..., root , comm);
Here:
• There is only one buffer, the send buffer. Before the call, the root process has data in this buffer;
the other processes allocate a same sized buffer, but for them the contents are irrelevant.
• The root is the process that is sending its data. Typically, it will be the root of a broadcast tree.
Example: in general we can not assume that all processes get the commandline arguments, so we broadcast
them from process 0.
// init.c
if (procno==0) {
if ( argc==1 || // the program is called without parameter
( argc>1 && !strcmp(argv[1],"-h") ) // user asked for help
) {
printf("\nUsage: init [0-9]+\n");
MPI_Abort(comm,1);
}
input_argument = atoi(argv[1]);
}
MPI_Bcast(&input_argument,1,MPI_INT,0,comm);
Victor Eijkhout 55
3. MPI topic: Collectives
MPL note 15: broadcast. The broadcast call comes in two variants, with scalar argument and general lay-
out:
template<typename T >
void mpl::communicator::bcast
( int root_rank, T &data ) const;
void mpl::communicator::bcast
( int root_rank, T *data, const layout< T > &l ) const;
where we ignore the update of the righthand side, or the formation of the inverse.
Let a matrix be distributed with each process storing one column. Implement the
Gauss-Jordan algorithm as a series of broadcasts: in iteration 𝑘 process 𝑘 computes
(𝑘)
and broadcasts the scaling vector {ℓ𝑖 }𝑖 . Replicate the right-hand side on all
processors.
(There is a skeleton for this exercise under the name jordan.)
Exercise 3.9. Add partial pivoting to your implementation of Gauss-Jordan elimination.
Change your implementation to let each processor store multiple columns, but still
do one broadcast per column. Is there a way to have only one broadcast per
processor?
process ∶ 0 1 2 ⋯ 𝑝−1
data ∶ 𝑥0 𝑥1 𝑥2 ⋯ 𝑥𝑝−1
𝑝−1
inclusive ∶ 𝑥0 𝑥0 ⊕ 𝑥1 𝑥0 ⊕ 𝑥1 ⊕ 𝑥2 ⋯ ⊕𝑖=0 𝑥𝑖
𝑝−2
exclusive ∶ unchanged 𝑥0 𝑥0 ⊕ 𝑥1 ⋯ ⊕𝑖=0 𝑥𝑖
matrix sol rhs action matrix sol rhs action Step 15:
2 2 13 1 17 take this row 2 2 13 1 17 minus ×2
↑ ↑ ↑ matrix sol rhs action
4 5 32 1 41 0 1 6 1 7 take this row 2 0 0 1 2
-2 -3 -16 1 -21 0 -1 -3 1 -4 0 1 6 1 7
Step 2: Step 9:
0 0 3 1 3 take this row
matrix sol rhs action matrix sol rhs action
2 2 13 1 17 take this row 2 0 1 1 3 Step 16:
↓ ↓ ↓
4 5 32 1 41 minus ×2 0 1 6 1 7 take this row
matrix sol rhs action
-2 -3 -16 1 -21 0 -1 -3 1 -4 2 0 0 1 2
matrix sol rhs action matrix sol rhs action matrix sol rhs action
2 2 13 1 17 first column done 2 0 1 1 3 2 0 0 1 2
0 1 6 1 7 0 1 6 1 7 0 1 0 1 1
Victor Eijkhout 57
3. MPI topic: Collectives
// scan.c
// add all the random variables together
MPI_Scan(&myrandom,&result,1,MPI_FLOAT,MPI_SUM,comm);
// the result should be approaching nprocs/2:
if (procno==nprocs-1)
printf("Result %6.3f compared to nprocs/2=%5.2f\n",
result,nprocs/2.);
In python mode the result is a function return value, with numpy the result is passed as the second
parameter.
## scan.py
mycontrib = 10+random.randint(1,nprocs)
myfirst = 0
mypartial = comm.scan(mycontrib)
sbuf = np.empty(1,dtype=int)
rbuf = np.empty(1,dtype=int)
sbuf[0] = mycontrib
comm.Scan(sbuf,rbuf)
You can use any of the given reduction operators, (for the list, see section 3.10.1), or a user-defined one.
In the latter case, the MPI_Op operations do not return an error code.
MPL note 16: scan operations. As in the C/F interfaces, MPL interfaces to the scan routines have the same
calling sequences as the ‘Allreduce’ routine.
Often, the more useful variant is the exclusive scan MPI_Exscan (figure 3.5) with the same signature.
The result of the exclusive scan is undefined on processor 0 (None in python), and on processor 1 it is a
copy of the send value of processor 1. In particular, the MPI_Op need not be called on these two processors.
Exercise 3.10. The exclusive definition, which computes 𝑥0 ⊕ 𝑥𝑖−1 on processor 𝑖, can be
derived from the inclusive operation for operations such as MPI_SUM or MPI_PROD. Are
there operators where that is not the case?
Victor Eijkhout 59
3. MPI topic: Collectives
𝑋𝑖 = 𝑥𝑖 if 𝑦𝑖 = 0
{
𝑋𝑖 = 𝑋𝑖−1 + 𝑥𝑖 if 𝑦𝑖 = 1
(This is the basis for the implementation of the sparse matrix vector product as prefix operation; see HPC
book, section-27.2.) This means that 𝑋𝑖 sums the segments between locations where 𝑦𝑖 = 0 and the first
subsequent place where 𝑦𝑖 = 1. To implement this, you need a user-defined operator
𝑋 𝑋1 𝑋2
𝑋 = 𝑥 1 + 𝑥2 if 𝑦2 == 1
( 𝑥 ) = ( 𝑥1 ) ⨁ ( 𝑥2 ) ∶ {
𝑦 𝑦1 𝑦2 𝑋 = 𝑥2 if 𝑦2 == 0
This operator is not communitative, and it needs to be declared as such with MPI_Op_create; see sec-
tion 3.10.2
In the MPI_Scatter operation, the root spreads information to all other processes. The difference with a
broadcast is that it involves individual information from/to every process. Thus, the gather operation
typically has an array of items, one coming from each sending process, and scatter has an array, with an
Victor Eijkhout 61
3. MPI topic: Collectives
In the gather and scatter calls, each processor has 𝑛 elements of individual data. There is also a root
processor that has an array of length 𝑛𝑝, where 𝑝 is the number of processors. The gather call collects all
this data from the processors to the root; the scatter call assumes that the information is initially on the
root and it is spread to the individual processors.
Here is a small example:
// gather.c
// we assume that each process has a value "localsize"
// the root process collects these values
if (procno==root)
localsizes = (int*) malloc( nprocs*sizeof(int) );
void mpl::communicator::gather
( int root_rank, const T & senddata ) const
( int root_rank, const T & senddata, T * recvdata ) const
( int root_rank, const T * senddata, const layout< T > & sendl ) const
( int root_rank, const T * senddata, const layout< T > & sendl,
T * recvdata, const layout< T > & recvl ) const
Python:
MPI.Comm.Gather
(self, sendbuf, recvbuf, int root=0)
vector<float> v;
float x;
comm_world.scatter(0, v.data(), x);
Victor Eijkhout 63
3. MPI topic: Collectives
vector<int> size_buffer(nprocs);
comm_world.gather
(
0,my_number_of_elements,size_buffer.data()
);
} else {
/*
* If you are not the root, do versions with only send buffers
*/
comm_world.gather
( 0,my_number_of_elements );
3.5.1 Examples
In some applications, each process computes a row or column of a matrix, but for some calculation (such
as the determinant) it is more efficient to have the whole matrix on one process. You should of course
only do this if this matrix is essentially smaller than the full problem, such as an interface system or the
last coarsening level in multigrid.
Figure 3.6 pictures this. Note that conceptually we are gathering a two-dimensional object, but the buffer
is of course one-dimensional. You will later see how this can be done more elegantly with the ‘subarray’
datatype; section 6.3.4.
Another thing you can do with a distributed matrix is to transpose it.
// itransposeblock.c
for (int iproc=0; iproc<nprocs; iproc++) {
MPI_Scatter( regular,1,MPI_DOUBLE,
&(transpose[iproc]),1,MPI_DOUBLE,
iproc,comm);
}
only once. Why do we need a loop? That is because each element of a process’ row originates from a
different scatter operation.
Exercise 3.14. Can you rewrite this code so that it uses a gather rather than a scatter? Does
that change anything essential about structure of the code?
Exercise 3.15. Take the code from exercise 3.11 and extend it to gather all local buffers onto
rank zero. Since the local arrays are of differing lengths, this requires MPI_Gatherv.
How do you construct the lengths and displacements arrays?
(There is a skeleton for this exercise under the name scangather.)
3.5.2 Allgather
Figure 3.7: All gather collects all data onto every process
The MPI_Allgather (figure 3.7) routine does the same gather onto every process: each process winds up
Victor Eijkhout 65
3. MPI topic: Collectives
This routine can be used in the simplest implementation of the dense matrix-vector product to give each
processor the full input; see HPC book, section-6.2.2.
Some cases look like an all-gather but can be implemented more efficiently. Suppose you have two dis-
tributed vectors, and you want to create a new vector that contains those elements of the one that do not
appear in the other. You could implement this by gathering the second vector on each processor, but this
may be prohibitive in memory usage.
Exercise 3.16. Can you think of another algorithm for taking the set difference of two
distributed vectors. Hint: look up bucket brigade algorithm; section 4.1.5. What is the
time and space complexity of this algorithm? Can you think of other advantages
beside a reduction in workspace?
3.6 All-to-all
The all-to-all operation MPI_Alltoall (figure 3.8) can be seen as a collection of simultaneous broadcasts
or simultaneous gathers. The parameter specification is much like an allgather, with a separate send and
receive buffer, and no root specified. As with the gather call, the receive count corresponds to an individual
receive, not the total amount.
Unlike the gather call, the send buffer now obeys the same principle: with a send count of 1, the buffer
has a length of the number of processes.
The typical application for such data transposition is in the FFT algorithm, where it can take tens of
percents of the running time on large clusters.
We will consider another application of data transposition, namely radix sort, but we will do that in a
couple of steps. First of all:
Exercise 3.17. In the initial stage of a radix sort, each process considers how many
elements to send to every other process. Use MPI_Alltoall to derive from this how
many elements they will receive from every other process.
3.6.2 All-to-all-v
The major part of the radix sort algorithm consist of every process sending some of its elements to each
of the other processes. The routine MPI_Alltoallv (figure 3.9) is used for this pattern:
• Every process scatters its data to all others,
• but the amount of data is different per process.
Exercise 3.18. The actual data shuffle of a radix sort can be done with MPI_Alltoallv. Finish
the code of exercise 3.17.
Victor Eijkhout 67
3. MPI topic: Collectives
3.7 Reduce-scatter
There are several MPI collectives that are functionally equivalent to a combination of others. You have
already seen MPI_Allreduce which is equivalent to a reduction followed by a broadcast. Often such com-
binations can be more efficient than using the individual calls; see HPC book, section-6.1.
Here is another example: MPI_Reduce_scatter is equivalent to a reduction on an array of data (meaning a
pointwise reduction on each array location) followed by a scatter of this array to the individual processes.
We will discuss this routine, or rather its variant MPI_Reduce_scatter_block (figure 3.10), using an impor-
tant example: the sparse matrix-vector product (see HPC book, section-6.5.1 for background information).
Each process contains one or more matrix rows, so by looking at indices the process can decide what other
processes it needs to receive data from, that is, each process knows how many messages it will receive,
and from which processes. The problem is for a process to find out what other processes it needs to send
data to.
Let’s set up the data:
// reducescatter.c
int
// data that we know:
*i_recv_from_proc = (int*) malloc(nprocs*sizeof(int)),
*procs_to_recv_from, nprocs_to_recv_from=0,
// data we are going to determin:
*procs_to_send_to,nprocs_to_send_to;
Victor Eijkhout 69
3. MPI topic: Collectives
int proc=procs_to_recv_from[iproc];
double send_buffer=0.;
MPI_Isend(&send_buffer,0,MPI_DOUBLE, /*to:*/ proc,0,comm,
&(send_requests[iproc]));
}
/*
* Do as many receives as you know are coming in;
* use wildcards since you don't know where they are coming from.
* The source is a process you need to send to.
*/
procs_to_send_to = (int*)malloc( nprocs_to_send_to * sizeof(int) );
for (int iproc=0; iproc<nprocs_to_send_to; iproc++) {
double recv_buffer;
MPI_Status status;
MPI_Recv(&recv_buffer,0,MPI_DOUBLE,MPI_ANY_SOURCE,MPI_ANY_TAG,comm,
&status);
procs_to_send_to[iproc] = status.MPI_SOURCE;
}
MPI_Waitall(nprocs_to_recv_from,send_requests,MPI_STATUSES_IGNORE);
figure 3.9).
Another application of the reduce-scatter mechanism is in the dense matrix-vector product, if a two-
dimensional data distribution is used.
3.7.1 Examples
An important application of this is establishing an irregular communication pattern. Assume that each
process knows which other processes it wants to communicate with; the problem is to let the other pro-
cesses know about this. The solution is to use MPI_Reduce_scatter to find out how many processes want
to communicate with you
MPI_Reduce_scatter_block
(i_recv_from_proc,&nprocs_to_send_to,1,MPI_INT,
MPI_SUM,comm);
/*
* Do as many receives as you know are coming in;
* use wildcards since you don't know where they are coming from.
* The source is a process you need to send to.
*/
procs_to_send_to = (int*)malloc( nprocs_to_send_to * sizeof(int) );
for (int iproc=0; iproc<nprocs_to_send_to; iproc++) {
double recv_buffer;
MPI_Status status;
MPI_Recv(&recv_buffer,0,MPI_DOUBLE,MPI_ANY_SOURCE,MPI_ANY_TAG,comm,
&status);
procs_to_send_to[iproc] = status.MPI_SOURCE;
}
MPI_Waitall(nprocs_to_recv_from,send_requests,MPI_STATUSES_IGNORE);
Victor Eijkhout 71
3. MPI topic: Collectives
MPI_Reduce_scatter(local_y,&my_y,&ione,MPI_DOUBLE,
MPI_SUM,environ.row_comm);
3.8 Barrier
A barrier call, MPI_Barrier (figure 3.11) is a routine that blocks all processes until they have all reached
the barrier call. Thus it achieves time synchronization of the processes.
This call’s simplicity is contrasted with its usefulness, which is very limited. It is almost never necessary
to synchronize processes through a barrier: for most purposes it does not matter if processors are out of
sync. Conversely, collectives (except the new nonblocking ones; section 3.11) introduce a barrier of sorts
themselves.
template<typename T>
void gatherv
(int root_rank, const T *senddata, const layout<T> &sendl,
T *recvdata, const layouts<T> &recvls, const displacements &recvdispls) const
(int root_rank, const T *senddata, const layout<T> &sendl,
T *recvdata, const layouts<T> &recvls) const
(int root_rank, const T *senddata, const layout<T> &sendl ) const
Python:
Victor Eijkhout 73
3. MPI topic: Collectives
For example, in an MPI_Gatherv (figure 3.12) call each process has an individual number of items to con-
tribute. To gather this, the root process needs to find these individual amounts with an MPI_Gather call,
and locally construct the offsets array. Note how the offsets array has size ntids+1: the final offset value
is automatically the total size of all incoming data. See the example below.
There are various calls where processors can have buffers of differing sizes.
• In MPI_Scatterv (figure 3.13) the root process has a different amount of data for each recipient.
• In MPI_Gatherv, conversely, each process contributes a different sized send buffer to the re-
ceived result; MPI_Allgatherv (figure 3.14) does the same, but leaves its result on all processes;
MPI_Alltoallv does a different variable-sized gather on each process.
We use MPI_Gatherv to do an irregular gather onto a root. We first need an MPI_Gather to determine offsets.
Code: Output:
// gatherv.c make[3]: `gatherv' is up to date.
// we assume that each process has an array "localdata" TACC: Starting up job 4328411
// of size "localsize" TACC: Starting parallel tasks...
Local sizes: 13, 12, 13, 14, 11, 12, 14, 6, 12, 8,
// the root process decides how much data will be coming: Collected:
// allocate arrays to contain size and offset information 0:1,1,1,1,1,1,1,1,1,1,1,1,1;
if (procno==root) { 1:2,2,2,2,2,2,2,2,2,2,2,2;
localsizes = (int*) malloc( nprocs*sizeof(int) ); 2:3,3,3,3,3,3,3,3,3,3,3,3,3;
offsets = (int*) malloc( nprocs*sizeof(int) ); 3:4,4,4,4,4,4,4,4,4,4,4,4,4,4;
} 4:5,5,5,5,5,5,5,5,5,5,5;
// everyone contributes their local size info 5:6,6,6,6,6,6,6,6,6,6,6,6;
MPI_Gather(&localsize,1,MPI_INT, 6:7,7,7,7,7,7,7,7,7,7,7,7,7,7;
localsizes,1,MPI_INT,root,comm); 7:8,8,8,8,8,8;
// the root constructs the offsets array 8:9,9,9,9,9,9,9,9,9,9,9,9;
if (procno==root) { 9:10,10,10,10,10,10,10,10;
int total_data = 0; TACC: Shutdown complete. Exiting.
for (int i=0; i<nprocs; i++) {
offsets[i] = total_data;
total_data += localsizes[i];
}
Victoralldata
Eijkhout = (int*) malloc( total_data*sizeof(int) ); 75
}
// everyone contributes their data
MPI_Gatherv(localdata,localsize,MPI_INT,
↪alldata,localsizes,offsets,MPI_INT,root,comm);
3. MPI topic: Collectives
## gatherv.py
# implicitly using root=0
globalsize = comm.reduce(localsize)
if procid==0:
print("Global size=%d" % globalsize)
collecteddata = np.empty(globalsize,dtype=int)
counts = comm.gather(localsize)
comm.Gatherv(localdata, [collecteddata, counts])
accumulate = 0
for p in range(nprocs):
recv_displs[p] = accumulate; accumulate += recv_counts[p]
global_array = np.empty(accumulate,dtype=np.float64)
comm.Allgatherv( my_array, [global_array,recv_counts,recv_displs,MPI.DOUBLE] )
Fortran note 5: min-maxloc types. The original Fortran interface to MPI was designed around Fortran77
features, so it is not using Fortran derived types (Type keyword). Instead, all integer indices
Victor Eijkhout 77
3. MPI topic: Collectives
MPI.Op.create(cls,function,bool commute=False)
are stored in whatever the type is that is being reduced. The available result types are then
MPI_2REAL, MPI_2DOUBLE_PRECISION, MPI_2INTEGER.
Likewise, the input needs to be arrays of such type. Consider this example:
Real*8,dimension(2,N) :: input,output
call MPI_Reduce( input,output, N, MPI_2DOUBLE_PRECISION, &
MPI_MAXLOC, root, comm )
For example, here is an operator for finding the smallest nonzero number in an array of nonnegative
integers:
// reductpositive.c
void reduce_without_zero(void *in,void *inout,int *len,MPI_Datatype *type) {
// r is the already reduced value, n is the new value
int n = *(int*)in, r = *(int*)inout;
int m;
if (n==0) { // new value is zero: keep r
m = r;
} else if (r==0) {
m = n;
} else if (n<r) { // new value is less but not zero: use n
m = n;
} else { // new value is more: use r
m = r;
};
*(int*)inout = m;
}
n = in_array[0]; r = inout_array[0]
if n==0:
m = r
elif r==0:
m = n
elif n<r:
m = n
else:
m = r
inout_array[0] = m
Victor Eijkhout 79
3. MPI topic: Collectives
The assert statement accounts for the fact that this mapping of MPI datatype to NumPy dtype
only works for built-in MPI datatypes.
MPL note 20: user-defined operators. A user-defined operator can be a templated class with an operator().
Example:
// reduceuser.cxx
template<typename T>
class lcm {
public:
T operator()(T a, T b) {
T zero=T();
T t((a/gcd(a, b))*b);
if (t<zero)
return -t;
return t;
}
comm_world.reduce(lcm<int>(), 0, v, result);
MPL note 21: lambda operator. You can also do the reduction by lambda:
comm_world.reduce
( [] (int i,int j) -> int
{ return i+j; },
0,data );
The function has an array length argument len, to allow for pointwise reduction on a whole array at
once. The inoutvec array contains partially reduced results, and is typically overwritten by the function.
There are some restrictions on the user function:
• It may not call MPI functions, except for MPI_Abort.
• It must be associative; it can be optionally commutative, which fact is passed to the MPI_Op_create
call.
Exercise 3.19. Write the reduction function to implement the one-norm of a vector:
‖𝑥‖1 ≡ ∑ |𝑥𝑖 |.
𝑖
This sets the operator to MPI_OP_NULL. This is not necessary in OO languages, where the destructor takes
care of it.
You can query the commutativity of an operator with MPI_Op_commutative (figure 3.16).
𝑦 ← 𝐴𝑥 + (𝑥 𝑡 𝑥)𝑦
involves a matrix-vector product, which is dominated by computation in the sparse matrix case, and an
inner product which is typically dominated by the communication cost. You would code this as
MPI_Iallreduce( .... x ..., &request);
// compute the matrix vector product
MPI_Wait(request);
// do the addition
Victor Eijkhout 81
3. MPI topic: Collectives
This can also be used for 3D FFT operations [13]. Occasionally, a nonblocking collective can be used for
nonobvious purposes, such as the MPI_Ibarrier in [14].
These have roughly the same calling sequence as their blocking counterparts, except that they output an
MPI_Request. You can then use an MPI_Wait call to make sure the collective has completed.
𝛼 ← 𝑥𝑡 𝑦
𝛽 ← ‖𝑧‖∞
Remark 6 Blocking and nonblocking don’t match: either all processes call the nonblocking or all call the
blocking one. Thus the following code is incorrect:
if (rank==root)
MPI_Reduce( &x /* ... */ root,comm );
else
MPI_Ireduce( &x /* ... */ );
This is unlike the point-to-point behavior of nonblocking calls: you can catch a message with MPI_Irecv that
was sent with MPI_Send.
Remark 7 Unlike sends and received, collectives have no identifying tag. With blocking collectives that does
not lead to ambiguity problems. With nonblocking collectives it means that all processes need to issue them
in identical order.
Victor Eijkhout 83
3. MPI topic: Collectives
• MPI_Iexscan, MPI_Iscan,
MPL note 22: nonblocking collectives. Nonblocking collectives have the same argument list as the corre-
sponding blocking variant, except that instead of a void result, they return an irequest. (See 29)
// ireducescalar.cxx
float x{1.},sum;
auto reduce_request =
comm_world.ireduce(mpl::plus<float>(), 0, x, sum);
reduce_request.wait();
if (comm_world.rank()==0) {
std::cout << "sum = " << sum << '\n';
}
3.11.1 Examples
3.11.1.1 Array transpose
To illustrate the overlapping of multiple nonblocking collectives, consider transposing a data matrix. Ini-
tially, each process has one row of the matrix; after transposition each process has a column. Since each
row needs to be distributed to all processes, algorithmically this corresponds to a series of scatter calls,
one originating from each process.
// itransposeblock.c
for (int iproc=0; iproc<nprocs; iproc++) {
MPI_Scatter( regular,1,MPI_DOUBLE,
&(transpose[iproc]),1,MPI_DOUBLE,
iproc,comm);
}
3.11.1.2 Stencils
The ever-popular five-point stencil evaluation does not look like a collective operation, and indeed, it is
usually evaluated with (nonblocking) send/recv operations. However, if we create a subcommunicator on
each subdomain that contains precisely that domain and its neighbors, (see figure 3.10) we can formu-
late the communication pattern as a gather on each of these. With ordinary collectives this can not be
formulated in a deadlock-free manner, but nonblocking collectives make this feasible.
We will see an even more elegant formulation of this operation in section 11.2.
Victor Eijkhout 85
3. MPI topic: Collectives
int global_finish=mysleep;
do {
int all_done_flag=0;
MPI_Test(&final_barrier,&all_done_flag,MPI_STATUS_IGNORE);
if (all_done_flag) {
break;
} else {
int flag; MPI_Status status;
// force progress
MPI_Iprobe
( MPI_ANY_SOURCE,MPI_ANY_TAG,
comm, &flag, MPI_STATUS_IGNORE );
printf("[%d] going to work for another second\n",procid);
sleep(1);
global_finish++;
}
} while (1);
every other process. While this describes the semantics of the operation, in practice the implementation
works quite differently.
The time that a message takes can simply be modeled as
𝛼 + 𝛽𝑛,
where 𝛼 is the latency, a one time delay from establishing the communication between two processes,
and 𝛽 is the time-per-byte, or the inverse of the bandwidth, and 𝑛 the number of bytes sent.
Under the assumption that a processor can only send one message at a time, the broadcast in figure 3.11
would take a time proportional to the number of processors.
Exercise 3.22. What is the total time required for a broadcast involving 𝑝 processes? Give 𝛼
and 𝛽 terms separately.
One way to ameliorate that is to structure the broadcast in a tree-like fashion. This is depicted in fig-
ure 3.12.
Victor Eijkhout 87
3. MPI topic: Collectives
Exercise 3.23. How does the communication time now depend on the number of
processors, again 𝛼 and 𝛽 terms separately.
What would be a lower bound on the 𝛼, 𝛽 terms?
The theory of the complexity of collectives is described in more detail in HPC book, section-6.1; see
also [3].
the send operations on all processors will occur after the root executes the broadcast. Conversely, in a
Figure 3.13: Trace of a reduction operation between two dual-socket 12-core nodes
reduce operation the root may have to wait for other processors. This is illustrated in figure 3.13, which
gives a TAU trace of a reduction operation on two nodes, with two six-core sockets (processors) each. We
see that1 :
1. This uses mvapich version 1.6; in version 1.9 the implementation of an on-node reduction has changed to simulate shared
memory.
Note the MPI_ANY_SOURCE parameter in the receive calls on processor 1. One obvious execution of this would
be:
1. The send from 2 is caught by processor 1;
2. Everyone executes the broadcast;
3. The send from 0 is caught by processor 1.
However, it is equally possible to have this execution:
1. Processor 0 starts its broadcast, then executes the send;
2. Processor 1’s receive catches the data from 0, then it executes its part of the broadcast;
3. Processor 1 catches the data sent by 2, and finally processor 2 does its part of the broadcast.
This is illustrated in figure 3.14.
Victor Eijkhout 89
3. MPI topic: Collectives
3.14.1 Scalability
We are motivated to write parallel software from two considerations. First of all, if we have a certain
problem to solve which normally takes time 𝑇 , then we hope that with 𝑝 processors it will take time 𝑇 /𝑝.
If this is true, we call our parallelization scheme scalable in time. In practice, we often accept small extra
terms: as you will see below, parallelization often adds a term log2 𝑝 to the running time.
Exercise 3.24. Discuss scalability of the following algorithms:
• You have an array of floating point numbers. You need to compute the sine of
each
• You a two-dimensional array, denoting the interval [−2, 2]2 . You want to make
a picture of the Mandelbrot set, so you need to compute the color of each point.
• The primality test of exercise 2.6.
There is also the notion that a parallel algorithm can be scalable in space: more processors gives you more
memory so that you can run a larger problem.
Exercise 3.25. Discuss space scalability in the context of modern processor design.
Simple ring Let the root only send to the next process, and that one send to its neighbor. This scheme
is known as a bucket brigade; see also section 4.1.5.
What is the expected performance of this in terms of 𝛼, 𝛽?
Run some tests and confirm.
Pipelined ring In a ring broadcast, each process needs to receive the whole message before it can
pass it on. We can increase the efficiency by breaking up the message and sending it in multiple parts.
(See figure 3.15.) This will be advantageous for messages that are long enough that the bandwidth cost
dominates the latency.
Assume a send buffer of length more than 1. Divide the send buffer into a number of chunks. The root
sends the chunks successively to the next process, and each process sends on whatever chunks it receives.
What is the expected performance of this in terms of 𝛼, 𝛽? Why is this better than the simple ring?
Run some tests and confirm.
Victor Eijkhout 91
3. MPI topic: Collectives
Recursive doubling Collectives such as broadcast can be implemented through recursive doubling, where
the root sends to another process, then the root and the other process send to two more, those four send
to four more, et cetera. However, in an actual physical architecture this scheme can be realized in multiple
ways that have drastically different performance.
First consider the implementation where process 0 is the root, and it starts by sending to process 1; then
they send to 2 and 3; these four send to 4–7, et cetera. If the architecture is a linear array of procesors,
this will lead to contention: multiple messages wanting to go through the same wire. (This is also related
to the concept of bisecection bandwidth.)
In the following analyses we will assume wormhole routing: a message sets up a path through the network,
reserving the necessary wires, and performing a send in time independent of the distance through the
network. That is, the send time for any message can be modeled as
𝑇 (𝑛) = 𝛼 + 𝛽𝑛
regardless source and destination, as long as the necessary connections are available.
Exercise 3.26. Analyze the running time of a recursive doubling broad cast as just
described, with wormhole routing.
Implement this broadcast in terms of blocking MPI send and receive calls. If you
have SimGrid available, run tests with a number of parameters.
The alternative, that avoids contention, is to let each doubling stage divide the network into separate
halves. That is, process 0 sends to 𝑃/2, after which these two repeat the algorithm in the two halves of
the network, sending to 𝑃/4 and 3𝑃/4 respectively.
Exercise 3.27. Analyze this variant of recursive doubling. Code it and measure runtimes on
SimGrid.
Exercise 3.28. Revisit exercise 3.26 and replace the blocking calls by nonblocking
MPI_Isend/ MPI_Irecv calls.
Make sure to test that the data is correctly propagated.
MPI implementations often have multiple algorithms, which they dynamicaly switch between. Sometimes
you can determine the choice yourself through environment variables.
TACC note. For Intel MPI , see https://software.intel.com/en-us/
mpi-developer-reference-linux-i-mpi-adjust-family-environment-variables.
give the approximate MPI-based code that computes the maximum value in the
array, and leaves the result on every processor.
Review 3.36.
Victor Eijkhout 93
3. MPI topic: Collectives
double data[Nglobal];
int myfirst = /* something */, mylast = /* something */;
for (int i=myfirst; i<mylast; i++) {
if (i>0 && i<N-1) {
process_point( data,i,Nglobal );
}
}
void process_point( double *data,int i,int N ) {
data[i-1] = g(i-1); data[i] = g(i); data[i+1] = g(i+1);
data[i] = f(data[i-1],data[i],data[i+1]);
}
Is this scalable in time? Is this scalable in space? What is the missing MPI call?
Review 3.37.
double data[Nlocal+2]; // include left and right neighbor
int myfirst = /* something */, mylast = myfirst+Nlocal;
for (int i=0; i<Nlocal; i++) {
if (i>0 && i<N-1) {
process_point( data,i,Nlocal );
}
void process_point( double *data,int i0,int n ) {
int i = i0+1;
data[i-1] = g(i-1); data[i] = g(i); data[i+1] = g(i+1);
data[i] = f(data[i-1],data[i],data[i+1]);
}
Is this scalable in time? Is this scalable in space? What is the missing MPI call?
Review 3.38. With data as in the previous question, given the code for normalizing the
array, that is, scaling each element so that ‖𝑥‖2 = 1.
Review 3.39. Just like MPI_Allreduce is equivalent to MPI_Reduce following by MPI_Bcast,
MPI_Reduce_scatter is equivalent to at least one of the following combinations. Select
those that are equivalent, and discuss differences in time or space complexity:
1. MPI_Reduce followed by MPI_Scatter;
2. MPI_Gather followed by MPI_Scatter;
3. MPI_Allreduce followed by MPI_Scatter;
4. MPI_Allreduce followed by a local operation (which?);
5. MPI_Allgather followed by a local operation (which?).
Review 3.40. Think of at least two algorithms for doing a broadcast. Compare them with
regards to asymptotic behavior.
#include "globalinit.c"
float myrandom,sumrandom;
myrandom = (float) rand()/(float)RAND_MAX;
// add the random variables together
MPI_Allreduce(&myrandom,&sumrandom,
1,MPI_FLOAT,MPI_SUM,comm);
// the result should be approx nprocs/2:
if (procno==nprocs-1)
printf("Result %6.9f compared to .5\n",sumrandom/nprocs);
MPI_Finalize();
return 0;
}
int main() {
Victor Eijkhout 95
3. MPI topic: Collectives
std::cout << "rank " << comm_world.rank() << " got " << x << '\n';
// in place
comm_world.reduce(mpl::plus<float>(), root, xrank);
if ( comm_world.rank()==root )
std::cout << "Allreduce got: separate=" << xreduce
<< ", inplace=" << xrank << std::endl;
}
return EXIT_SUCCESS;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
random.seed(procid)
random_bound = nprocs*nprocs
random_number = random.randint(1,random_bound)
#print("[%d] random=%d" % (procid,random_number))
if procid==0:
print("Python native:\n max=%d" % max_random)
myrandom = np.empty(1,dtype=int)
myrandom[0] = random_number
allrandom = np.empty(nprocs,dtype=int)
# numpy mode send
comm.Allreduce(myrandom,allrandom[:1],op=MPI.MAX)
if procid==0:
print("Python numpy:\n max=%d" % allrandom[0])
sumrandom = np.zeros(1,dtype=int)
sumrandom[0] = myrandom[0]
#### WRONG polymorphic use does not work
#comm.Allreduce(sumrandom[:1])
comm.Allreduce(MPI.IN_PLACE,sumrandom[:1],op=MPI.SUM)
if procid==0:
print( "Sum of randoms: %d, compare %d" % (sumrandom[0],nprocs*random_bound/2) )
Victor Eijkhout 97
3. MPI topic: Collectives
#include <vector>
using std::vector;
#include <mpl/mpl.hpp>
int main() {
const mpl::communicator &comm_world=mpl::environment::comm_world();
if (comm_world.size()<2)
return EXIT_FAILURE;
vector<double> v(15);
if (comm_world.rank()==0) {
// initialize
for ( auto &x : v ) x = 1.41;
/*
* Send and report
*/
comm_world.send(v.begin(), v.end(), 1); // send to rank 1
} else if (comm_world.rank()==1) {
/*
* Receive data and report
*/
comm_world.recv(v.begin(), v.end(), 0); // receive from rank 0
#include "globalinit.c"
MPI_Finalize();
return 0;
}
#include "globalinit.c"
Victor Eijkhout 99
3. MPI topic: Collectives
nrandoms,MPI_FLOAT,MPI_SUM,root,comm);
} else if (iter==2) {
for (int irand=0; irand<nrandoms; irand++)
myrandoms[irand] = (float) rand()/(float)RAND_MAX;
int root=nprocs-1;
float *sendbuf,*recvbuf;
if (procno==root) {
sendbuf = MPI_IN_PLACE; recvbuf = myrandoms;
} else {
sendbuf = myrandoms; recvbuf = MPI_IN_PLACE;
}
MPI_Reduce(sendbuf,recvbuf,
nrandoms,MPI_FLOAT,MPI_SUM,root,comm);
}
// the result should be approx nprocs/2:
if (procno==nprocs-1) {
float sum=0.;
for (int i=0; i<nrandoms; i++) sum += myrandoms[i];
sum /= nrandoms*nprocs;
printf("Result %6.9f compared to .5\n",sum);
}
}
free(myrandoms);
MPI_Finalize();
return 0;
}
use mpi
implicit none
real :: mynumber,result
integer :: target_proc
#include "globalinit.F90"
call random_number(mynumber)
target_proc = ntids-1;
! add all the random variables together
if (mytid.eq.target_proc) then
result = mytid
call MPI_Reduce(MPI_IN_PLACE,result,1,MPI_REAL,MPI_SUM,&
target_proc,comm,err)
else
mynumber = mytid
call MPI_Reduce(mynumber,result,1,MPI_REAL,MPI_SUM,&
target_proc,comm,err)
end if
! the result should be ntids*(ntids-1)/2:
if (mytid.eq.target_proc) then
write(*,'("Result ",f5.2," compared to n(n-1)/2=",f5.2)') &
result,ntids*(ntids-1)/2.
end if
call MPI_Finalize(err)
use mpi_f08
real,target :: mynumber,result,in_place_val
real,pointer :: mynumber_ptr,result_ptr
integer :: target_proc
#include "globalinit.F90"
call random_number(mynumber)
target_proc = ntids-1;
in_place_val = MPI_IN_PLACE
if (mytid.eq.target_proc) then
! set pointers
result_ptr => result
mynumber_ptr => in_place_val
! target sets value in receive buffer
result_ptr = mytid
else
! set pointers
mynumber_ptr => mynumber
result_ptr => in_place_val
! non-targets set value in send buffer
mynumber_ptr = mytid
end if
call MPI_Reduce(mynumber_ptr,result_ptr,1,MPI_REAL,MPI_SUM,&
target_proc,comm,err)
! the result should be ntids*(ntids-1)/2:
if (mytid.eq.target_proc) then
write(*,'("Result ",f7.4," compared to n(n-1)/2=",f7.4)') &
result,ntids*(ntids-1)/2.
end if
call MPI_Finalize(err)
import random
from mpi4py import MPI
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
random_number = random.randint(1,nprocs*nprocs)
print("[%d] random=%d" % (procid,random_number))
myrandom = np.empty(1,dtype=int)
myrandom[0] = random_number
comm.Allreduce(MPI.IN_PLACE,myrandom,op=MPI.MAX)
if procid==0:
print("Python numpy:\n max=%d" % myrandom[0])
#include <vector>
using std::vector;
#include <mpl/mpl.hpp>
int main() {
/*
* Reduce a 2 int buffer
*/
if (iprint) cout << "Reducing 2p, 2p+1" << endl;
float
xrank = static_cast<float>( comm_world.rank() );
vector<float> rank2p2p1{ 2*xrank,2*xrank+1 },reduce2p2p1{0,0};
mpl::contiguous_layout<float> two_floats(rank2p2p1.size());
comm_world.allreduce
(mpl::plus<float>(), rank2p2p1.data(),reduce2p2p1.data(),two_floats);
if ( iprint )
cout << "Got: " << reduce2p2p1.at(0) << ","
<< reduce2p2p1.at(1) << endl;
/*
* Scatter one number to each proc
*/
if (iprint) cout << "Scattering 0--p" << endl;
vector<float> v;
if (comm_world.rank()==0)
for (int i=0; i<comm_world.size(); ++i)
v.push_back(i);
if (iprint)
cout << "rank " << procno << " got " << x << '\n';
/*
* Scatter two numbers to each proc
*/
vector<float> vrecv(2),vsend(2*nprocs);
if (comm_world.rank()==0)
for (int i=0; i<2*nprocs; ++i)
vsend.at(i) = i;
if (iprint)
cout << "rank " << procno << " got "
<< vrecv[0] << "," << vrecv[1] << '\n';
return 0;
// multiply that number, giving twice your rank
x*=2;
return EXIT_SUCCESS;
}
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "mpi.h"
#include "globalinit.c"
if (procno==0) {
if ( argc==1 || // the program is called without parameter
( argc>1 && !strcmp(argv[1],"-h") ) // user asked for help
) {
printf("\nUsage: init [0-9]+\n");
MPI_Abort(comm,1);
}
input_argument = atoi(argv[1]);
}
MPI_Bcast(&input_argument,1,MPI_INT,0,comm);
printf("Processor %d gets %d\n",procno,input_argument);
MPI_Finalize();
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
root = 1
dsize = 10
# first native
if procid==root:
buffer = [ 5.0 ] * dsize
else:
buffer = [ 0.0 ] * dsize
buffer = comm.bcast(obj=buffer,root=root)
if not reduce( lambda x,y:x and y,
[ buffer[i]==5.0 for i in range(len(buffer)) ] ):
print( "Something wrong on proc %d: native buffer <<%s>>" \
% (procid,str(buffer)) )
#include "globalinit.c"
MPI_Finalize();
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
mycontrib = 10+random.randint(1,nprocs)
myfirst = 0
mypartial = comm.scan(mycontrib)
print("[%d] local: %d, partial: %d" % (procid,mycontrib,mypartial))
sbuf = np.empty(1,dtype=int)
rbuf = np.empty(1,dtype=int)
sbuf[0] = mycontrib
comm.Scan(sbuf,rbuf)
#include "globalinit.c"
int localsize = 10+10*( (float) rand()/(float)RAND_MAX - .5),
root = nprocs-1;
int *localsizes=NULL;
// create local data
int *localdata = (int*) malloc( localsize*sizeof(int) );
for (int i=0; i<localsize; i++)
localdata[i] = procno+1;
// we assume that each process has a value "localsize"
// the root process collects these values
if (procno==root)
localsizes = (int*) malloc( nprocs*sizeof(int) );
MPI_Finalize();
return 0;
}
#include "globalinit.c"
/*
* Allocate matrix and transpose:
* - one column per rank for regular
* - one row per rank for transpose
*/
double *regular,*transpose;
regular = (double*) malloc( nprocs*sizeof(double) );
transpose = (double*) malloc( nprocs*sizeof(double) );
// each process has columnns m*nprocs -- m*(nprocs+1)
for (int ip=0; ip<nprocs; ip++)
regular[ip] = procno*nprocs + ip;
/*
* Each proc does a scatter
*/
#if 0
// reference code:
for (int iproc=0; iproc<nprocs; iproc++) {
MPI_Scatter( regular,1,MPI_DOUBLE,
&(transpose[iproc]),1,MPI_DOUBLE,
iproc,comm);
}
#else
MPI_Request scatter_requests[nprocs];
for (int iproc=0; iproc<nprocs; iproc++) {
MPI_Iscatter( regular,1,MPI_DOUBLE,
&(transpose[iproc]),1,MPI_DOUBLE,
iproc,comm,scatter_requests+iproc);
}
MPI_Waitall(nprocs,scatter_requests,MPI_STATUSES_IGNORE);
#endif
/*
* Check the result
*/
printf("[%d] :",procno);
for (int ip=0; ip<nprocs; ip++)
printf(" %5.2f",transpose[ip]);
printf("\n");
MPI_Finalize();
return 0;
}
#include "globalinit.c"
/*
* Set up an array of which processes you will receive from
*/
int
// data that we know:
*i_recv_from_proc = (int*) malloc(nprocs*sizeof(int)),
*procs_to_recv_from, nprocs_to_recv_from=0,
// data we are going to determin:
*procs_to_send_to,nprocs_to_send_to;
/*
* Initialize
*/
/*
* Generate array of "yes/no I recv from proc p",
* and condensed array of procs I receive from.
*/
nprocs_to_recv_from = 0;
for (int iproc=0; iproc<nprocs; iproc++)
// pick random procs to receive from, not yourself.
if ( (float) rand()/(float)RAND_MAX < 2./nprocs && iproc!=procno ) {
i_recv_from_proc[iproc] = 1;
nprocs_to_recv_from++;
}
procs_to_recv_from = (int*) malloc(nprocs_to_recv_from*sizeof(int));
int count_procs_to_recv_from = 0;
for (int iproc=0; iproc<nprocs; iproc++)
if ( i_recv_from_proc[iproc] )
procs_to_recv_from[count_procs_to_recv_from++] = iproc;
ASSERT( count_procs_to_recv_from==nprocs_to_recv_from );
/*
*/
printf("[%d] receiving from:",procno);
for (int iproc=0; iproc<nprocs_to_recv_from; iproc++)
printf(" %3d",procs_to_recv_from[iproc]);
printf(".\n");
/*
* Now find how many procs will send to you
*/
MPI_Reduce_scatter_block
(i_recv_from_proc,&nprocs_to_send_to,1,MPI_INT,
MPI_SUM,comm);
/*
* Send a zero-size msg to everyone that you receive from,
* just to let them know that they need to send to you.
*/
MPI_Request send_requests[nprocs_to_recv_from];
for (int iproc=0; iproc<nprocs_to_recv_from; iproc++) {
int proc=procs_to_recv_from[iproc];
double send_buffer=0.;
MPI_Isend(&send_buffer,0,MPI_DOUBLE, /*to:*/ proc,0,comm,
&(send_requests[iproc]));
}
/*
* Do as many receives as you know are coming in;
* use wildcards since you don't know where they are coming from.
* The source is a process you need to send to.
*/
MPI_Finalize();
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
localsize = random.randint(2,10)
print("[%d] local size=%d" % (procid,localsize))
localdata = np.empty(localsize,dtype=int)
for i in range(localsize):
localdata[i] = procid
#include <string.h>
#include "mpi.h"
#include "globalinit.c"
MPI_Allgather
( &my_count,1,MPI_INT,
recv_counts,1,MPI_INT, comm );
int accumulate = 0;
for (int i=0; i<nprocs; i++) {
recv_displs[i] = accumulate; accumulate += recv_counts[i]; }
int *global_array = (int*) malloc(accumulate*sizeof(int));
MPI_Allgatherv
( my_array,procno+1,MPI_INT,
global_array,recv_counts,recv_displs,MPI_INT, comm );
if (procno==0) {
for (int p=0; p<nprocs; p++)
if (recv_counts[p]!=p+1)
printf("count[%d] should be %d, not %d\n",
p,p+1,recv_counts[p]);
int c = 0;
for (int p=0; p<nprocs; p++)
for (int q=0; q<=p; q++)
if (global_array[c++]!=p)
printf("p=%d, q=%d should be %d, not %d\n",
p,q,p,global_array[c-1]);
}
MPI_Finalize();
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
mycount = procid+1
my_array = np.empty(mycount,dtype=np.float64)
for i in range(mycount):
my_array[i] = procid
recv_counts = np.empty(nprocs,dtype=int)
recv_displs = np.empty(nprocs,dtype=int)
my_count = np.empty(1,dtype=int)
my_count[0] = mycount
comm.Allgather( my_count,recv_counts )
accumulate = 0
for p in range(nprocs):
recv_displs[p] = accumulate; accumulate += recv_counts[p]
global_array = np.empty(accumulate,dtype=np.float64)
comm.Allgatherv( my_array, [global_array,recv_counts,recv_displs,MPI.DOUBLE] )
# other syntax:
# comm.Allgatherv( [my_array,mycount,0,MPI.DOUBLE], [global_array,recv_counts,recv_displs,MPI.DOUBLE] )
if procid==0:
#print(procid,global_array)
for p in range(nprocs):
if recv_counts[p]!=p+1:
print( "recv count[%d] should be %d, not %d" \
% (p,p+1,recv_counts[p]) )
c = 0
for p in range(nprocs):
for q in range(p+1):
if global_array[c]!=p:
print( "p=%d, q=%d should be %d, not %d" \
% (p,q,p,global_array[c]) )
c += 1
print "finished"
m = r;
} else if (r==0) {
m = n;
} else if (n<r) { // new value is less but not zero: use n
m = n;
} else { // new value is more: use r
m = r;
};
#ifdef DEBUG
printf("combine %d %d : %d\n",r,n,m);
#endif
// return the new value
*(int*)inout = m;
}
#include "globalinit.c"
MPI_Op rwz;
MPI_Op_create(reduce_without_zero,1,&rwz);
MPI_Allreduce(data+procno,&positive_minimum,1,MPI_INT,rwz,comm);
MPI_Finalize();
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
n = in_array[0]; r = inout_array[0]
if n==0:
m = r
elif r==0:
m = n
elif n<r:
m = n
else:
m = r
inout_array[0] = m
ndata = 10
data = np.zeros(10,dtype=intc)
data[:] = [2,3,0,5,0,1,8,12,4,0]
if nprocs>ndata:
print("Too many procs for this example: at most %d\n" %ndata)
sys.exit(1)
#
# compute reduction by hand
#
mreduct=2000000000
for i in range(nprocs):
if data[i]<mreduct and data[i]>0:
mreduct = data[i]
rwz = MPI.Op.Create(reduceWithoutZero)
positive_minimum = np.zeros(1,dtype=intc)
comm.Allreduce(data[procid],positive_minimum,rwz);
#
# check that the distributed result is the same as sequential
#
if mreduct!=positive_minimum:
print("[%d] Result %d should be %d\n" % \
procid,positive_minimum,mreduct)
elif procid==0:
print("User-defined reduction successful: %d\n" % positive_minimum)
#include "globalinit.c"
/*
* Pick one random process
* that will do a send
*/
int sender,receiver;
if (procno==0)
sender = rand()%nprocs;
MPI_Bcast(&sender,1,MPI_INT,0,comm);
int i_do_send = sender==procno;
float data=1.;
MPI_Request send_request;
if (i_do_send) {
/*
* Pick a random process to send to,
* not yourself.
*/
int receiver = rand()%nprocs;
while (receiver==procno) receiver = rand()%nprocs;
printf("[%d] random send performed to %d\n",procno,receiver);
//MPI_Isend(&data,1,MPI_FLOAT,receiver,0,comm,&send_request);
MPI_Ssend(&data,1,MPI_FLOAT,receiver,0,comm);
}
/*
* Everyone posts the non-blocking barrier
* and gets a request to test/wait for
*/
MPI_Request barrier_request;
MPI_Ibarrier(comm,&barrier_request);
int step=0;
/*
* Now everyone repeatedly tests the barrier
* and probes for incoming message.
* If the barrier completes, there are no
* incoming message.
*/
MPI_Barrier(comm);
double tstart = MPI_Wtime();
for ( ; ; step++) {
int barrier_done_flag=0;
MPI_Test(&barrier_request,&barrier_done_flag,
MPI_STATUS_IGNORE);
//stop if you're done!
if (barrier_done_flag) {
break;
} else {
// if you're not done with the barrier:
int flag; MPI_Status status;
MPI_Iprobe
( MPI_ANY_SOURCE,MPI_ANY_TAG,
comm, &flag, &status );
if (flag) {
// absorb message!
int sender = status.MPI_SOURCE;
MPI_Recv(&data,1,MPI_FLOAT,sender,0,comm,MPI_STATUS_IGNORE);
printf("[%d] random receive from %d\n",procno,sender);
}
}
}
MPI_Barrier(comm);
double duration = MPI_Wtime()-tstart;
if (procno==0) printf("Probe loop: %e\n",duration);
MPI_Finalize();
return 0;
}
MPI_Init(&argc,&argv);
comm = MPI_COMM_WORLD;
MPI_Comm_size(comm,&nprocs);
MPI_Comm_rank(comm,&procid);
int mysleep;
srand(procid*time(NULL));
mysleep = nprocs * (rand()/(double)RAND_MAX);
int global_finish=mysleep;
do {
int all_done_flag=0;
MPI_Test(&final_barrier,&all_done_flag,MPI_STATUS_IGNORE);
if (all_done_flag) {
break;
} else {
int flag; MPI_Status status;
// force progress
MPI_Iprobe
( MPI_ANY_SOURCE,MPI_ANY_TAG,
comm, &flag, MPI_STATUS_IGNORE );
printf("[%d] going to work for another second\n",procid);
sleep(1);
global_finish++;
}
} while (1);
MPI_Wait(&final_barrier,MPI_STATUS_IGNORE);
printf("[%d] concluded %d work, total time %d\n",
procid,mysleep,global_finish);
MPI_Finalize();
return 0;
}
As before (see figure 2.6), we give each processor a subset of the 𝑥𝑖 s and 𝑦𝑖 s. Let’s define 𝑖𝑝 as the first
index of 𝑦 that is computed by processor 𝑝. (What is the last index computed by processor 𝑝? How many
indices are computed on that processor?)
We often talk about the owner computes model of parallel computing: each processor ‘owns’ certain data
items, and it computes their value.
Now let’s investigate how processor 𝑝 goes about computing 𝑦𝑖 for the 𝑖-values it owns. Let’s assume
that process 𝑝 also stores the values 𝑥𝑖 for these same indices. Now, for many values 𝑖 it can evalute the
computation
𝑦𝑖 = (𝑥𝑖−1 + 𝑥𝑖 + 𝑥𝑖+1 )/3
locally (figure 4.1).
However, there is a problem with computing 𝑦 in the first index 𝑖𝑝 on processor 𝑝:
The point to the left, 𝑥𝑖𝑝 −1 , is not stored on process 𝑝 (it is stored on 𝑝−1), so it is not immediately available
for use by process 𝑝. (figure 4.2). There is a similar story with the last index that 𝑝 tries to compute: that
involves a value that is only present on 𝑝 + 1.
You see that there is a need for processor-to-processor, or technically point-to-point, information ex-
change. MPI realizes this through matched send and receive calls:
• One process does a send to a specific other process;
• the other process does a specific receive from that source.
We will now discuss the send and receive routines in detail.
118
4.1. Blocking point-to-point operations
Since we are programming in SPMD mode, this means our program looks like:
if ( /* I am process A */ ) {
MPI_Send( /* to: */ B ..... );
MPI_Recv( /* from: */ B ... );
} else if ( /* I am process B */ ) {
MPI_Recv( /* from: */ A ... );
MPI_Send( /* to: */ A ..... );
}
Remark 8 The structure of the send and receive calls shows the symmetric nature of MPI: every target process
is reached with the same send call, no matter whether it’s running on the same multicore chip as the sender, or
on a computational node halfway across the machine room, taking several network hops to reach. Of course,
template<typename T >
void mpl::communicator::send
( const T scalar&,int dest,tag = tag(0) ) const
( const T *buffer,const layout< T > &,int dest,tag = tag(0) ) const
( iterT begin,iterT end,int dest,tag = tag(0) ) const
T : scalar type
begin : begin iterator
end : end iterator
Python:
Python native:
MPI.Comm.send(self, obj, int dest, int tag=0)
Python numpy:
MPI.Comm.Send(self, buf, int dest, int tag=0)
any self-respecting MPI implementation optimizes for the case where sender and receiver have access to the
same shared memory. This means that a send/recv pair is realized as a copy operation from the sender buffer
to the receiver buffer, rather than a network transfer.
Buffer The send buffer is described by a trio of buffer/count/datatype. See section 3.2.4 for discussion.
Target The messsage target is an explicit process rank to send to. This rank is a number from zero up
to the result of MPI_Comm_size. It is allowed for a process to send to itself, but this may lead to a runtime
deadlock; see section 4.1.4 for discussion.
Tag Next, a message can have a tag. Many applications have each sender send only one message at a
time to a given receiver. For the case where there are multiple simultaneous messages between the same
sender / receiver pair, the tag can be used to disambiguate between the messages.
Often, a tag value of zero is safe to use. Indeed, OO interfaces to MPI typically have the tag as an optional
parameter with value zero. If you do use tag values, you can use the key MPI_TAG_UB to query what the
maximum value is that can be used; see section 15.1.2.
Communicator Finally, in common with the vast majority of MPI calls, there is a communicator argu-
ment that provides a context for the send transaction.
MPL note 23: blocking send and receive. MPL uses a default value for the tag, and it can deduce the type
of the buffer. Sending a scalar becomes:
// sendscalar.cxx
if (comm_world.rank()==0) {
double pi=3.14;
comm_world.send(pi, 1); // send to rank 1
cout << "sent: " << pi << '\n';
} else if (comm_world.rank()==1) {
double pi=0;
comm_world.recv(pi, 0); // receive from rank 0
cout << "got : " << pi << '\n';
}
template<typename T >
status mpl::communicator::recv
( T &,int,tag = tag(0) ) const inline
( T *,const layout< T > &,int,tag = tag(0) ) const
( iterT begin,iterT end,int source, tag t = tag(0) ) const
Python:
An example:
double recv_data;
MPI_Recv
( /* recv buffer/count/type: */ &recv_data,1,MPI_DOUBLE,
/* from: */ sender, /* tag: */ 0,
/* communicator: */ comm,
/* recv status: */ MPI_STATUS_IGNORE);
Buffer The receive buffer has the same buffer/count/data parameters as the send call. However, the
count argument here indicates the size of the buffer, rather than the actual length of a message. This sets
an upper bound on the length of the incoming message.
• For receiving messages with unknown length, use MPI_Probe; section 4.3.1.
• A message longer than the buffer size will give an overflow error, either returning an error, or
ending your program; see section 15.2.2.
The length of the received message can be determined from the status object; see section 4.3.2 for more
detail.
Source Mirroring the target argument of the MPI_Send call, MPI_Recv has a message source argument. This
can be either a specific rank, or it can be the MPI_ANY_SOURCE wildcard. In the latter case, the actual source
can be determined after the message has been received; see section 4.3.2. A source value of MPI_PROC_NULL
is also allowed, which makes the receive succeed immediately with no data received.
MPL note 27: any source. The constant mpl::any_source equals MPI_ANY_SOURCE (by constexpr).
Tag Similar to the messsage source, the message tag of a receive call can be a specific value or a wildcard,
in this case MPI_ANY_TAG. Again, see below.
Status The MPI_Recv command has one parameter that the send call lacks: the MPI_Status object, de-
scribing the message status. This gives information about the message received, for instance if you used
wildcards for source or tag. See section 4.3.2 for more about the status object.
Remark 9 If you’re not interested in the status, as is the case in many examples in this book, specify the
constant MPI_STATUS_IGNORE. Note that the signature of MPI_Recv lists the status parameter as ‘output’; this
‘direction’ of the parameter of course only applies if you do not specify this constant. (See also section 15.10.1.)
Exercise 4.1. Implement the ping-pong program. Add a timer using MPI_Wtime. For the
status argument of the receive call, use MPI_STATUS_IGNORE.
• Run multiple ping-pongs (say a thousand) and put the timer around the loop.
The first run may take longer; try to discard it.
• Run your code with the two communicating processes first on the same node,
then on different nodes. Do you see a difference?
• Then modify the program to use longer messages. How does the timing
increase with message size?
For bonus points, can you do a regression to determine 𝛼, 𝛽?
(There is a skeleton for this exercise under the name pingpong.)
Exercise 4.2. Take your pingpong program and modify it to let half the processors be
source and the other half the targets. Does the pingpong time increase?
Figure 4.3: Illustration of an ideal (left) and actual (right) send-receive interaction
in the network there is buffer capacity for all messages that are in transit. This is not the case: data resides
on the sender, and the sending call blocks, until the receiver has received all of it. (There is a exception
for small messages, as explained in the next section.)
The use of MPI_Send and MPI_Recv is known as blocking communication: when your code reaches a send or
receive call, it blocks until the call is succesfully completed.
Technically, blocking operations are called non-local since their execution depends on factors that are not
local to the process. See section 5.4.
For a receive call it is clear that the receiving code will wait until the data has actually come in, but for a
send call this is more subtle.
4.1.4.1 Deadlock
Suppose two process need to exchange data, and consider the following pseudo-code, which purports to
exchange data between processes 0 and 1:
other = 1-mytid; /* if I am 0, other is 1; and vice versa */
receive(source=other);
send(target=other);
Imagine that the two processes execute this code. They both issue the send call… and then can’t go on,
because they are both waiting for the other to issue the send call corresponding to their receive call. This
is known as deadlock.
With a synchronous protocol you should get deadlock, since the send calls will be waiting for the receive
operation to be posted.
In practice, however, this code will often work. The reason is that MPI implementations sometimes send
small messages regardless of whether the receive has been posted. This relies on the availability of some
amount of available buffer space. The size under which this behavior is used is sometimes referred to as
the eager limit.
To illustrate eager and blocking behavior in MPI_Send, consider an example where we send gradually larger
messages. From the screen output you can see what the largest message was that fell under the eager limit;
after that the code hangs because of a deadlock.
// sendblock.c
other = 1-procno;
/* loop over increasingly large messages */
for (int size=1; size<2000000000; size*=10) {
sendbuf = (int*) malloc(size*sizeof(int));
recvbuf = (int*) malloc(size*sizeof(int));
if (!sendbuf || !recvbuf) {
printf("Out of memory\n"); MPI_Abort(comm,1);
}
MPI_Send(sendbuf,size,MPI_INT,other,0,comm);
MPI_Recv(recvbuf,size,MPI_INT,other,0,comm,&status);
/* If control reaches this point, the send call
did not block. If the send call blocks,
we do not reach this point, and the program will hang.
*/
if (procno==0)
printf("Send did not block for size %d\n",size);
free(sendbuf); free(recvbuf);
}
For the full source of this example, see section 4.5.7
!! sendblock.F90
other = 1-mytid
size = 1
do
allocate(sendbuf(size)); allocate(recvbuf(size))
print *,size
call MPI_Send(sendbuf,size,MPI_INTEGER,other,0,comm,err)
call MPI_Recv(recvbuf,size,MPI_INTEGER,other,0,comm,status,err)
if (mytid==0) then
print *,"MPI_Send did not block for size",size
end if
deallocate(sendbuf); deallocate(recvbuf)
size = size*10
if (size>2000000000) goto 20
end do
20 continue
For the full source of this example, see section 4.5.8
## sendblock.py
size = 1
while size<2000000000:
sendbuf = np.empty(size, dtype=int)
recvbuf = np.empty(size, dtype=int)
comm.Send(sendbuf,dest=other)
comm.Recv(sendbuf,source=other)
if procid<other:
print("Send did not block for",size)
size *= 10
For the full source of this example, see section 4.5.9
If you want a code to exhibit the same blocking behavior for all message sizes, you force the send call
to be blocking by using MPI_Ssend, which has the same calling sequence as MPI_Send, but which does not
allow eager sends.
// ssendblock.c
other = 1-procno;
sendbuf = (int*) malloc(sizeof(int));
recvbuf = (int*) malloc(sizeof(int));
size = 1;
MPI_Ssend(sendbuf,size,MPI_INT,other,0,comm);
MPI_Recv(recvbuf,size,MPI_INT,other,0,comm,&status);
printf("This statement is not reached\n");
For the full source of this example, see section 4.5.10
Formally you can describe deadlock as follows. Draw up a graph where every process is a node, and draw
a directed arc from process A to B if A is waiting for B. There is deadlock if this directed graph has a loop.
The solution to the deadlock in the above example is to first do the send from 0 to 1, and then from 1 to 0
(or the other way around). So the code would look like:
if ( /* I am processor 0 */ ) {
send(target=other);
receive(source=other);
} else {
receive(source=other);
send(target=other);
}
Eager sends also influences non-blocking sends. The wait call after a non-blocking send:
Code: Output:
// eageri.c Setting eager limit to 5000 bytes
printf("Sending %lu elements\n",n); TACC: Starting up job 4049189
MPI_Request request; TACC: Starting parallel tasks...
MPI_Isend(buffer,n,MPI_DOUBLE,processB,0,comm,&request); Sending 1 elements
MPI_Wait(&request,MPI_STATUS_IGNORE); .. concluded
printf(".. concluded\n"); Sending 10 elements
.. concluded
Sending 100 elements
.. concluded
Sending 1000 elements
^C[[email protected]] Sendi
will return immediately, regardless any receive call, if the message is under the eager limit.
The eager limit is implementation-specific. For instance, for Intel MPI there is a variable I_MPI_EAGER_THRESHOLD
(old versions) or I_MPI_SHM_EAGER_THRESHOLD; for mvapich2 it is MV2_IBA_EAGER_THRESHOLD, and for
OpenMPI the --mca options btl_openib_eager_limit and btl_openib_rndv_eager_limit.
4.1.4.3 Serialization
There is a second, even more subtle problem with blocking communication. Consider the scenario where
every processor needs to pass data to its successor, that is, the processor with the next higher rank. The
basic idea would be to first send to your successor, then receive from your predecessor. Since the last
processor does not have a successor it skips the send, and likewise the first processor skips the receive.
The pseudo-code looks like:
successor = mytid+1; predecessor = mytid-1;
if ( /* I am not the last processor */ )
send(target=successor);
if ( /* I am not the first processor */ )
receive(source=predecessor)
Exercise 4.3. (Classroom exercise) Each student holds a piece of paper in the right hand
– keep your left hand behind your back – and we want to execute:
1. Give the paper to your right neighbor;
2. Accept the paper from your left neighbor.
Including boundary conditions for first and last process, that becomes the following
program:
1. If you are not the rightmost student, turn to the right and give the paper to
your right neighbor.
2. If you are not the leftmost student, turn to your left and accept the paper from
your left neighbor.
This code does not deadlock. All processors but the last one block on the send call, but the last processor
executes the receive call. Thus, the processor before the last one can do its send, and subsequently continue
to its receive, which enables another send, et cetera.
In one way this code does what you intended to do: it will terminate (instead of hanging forever on a
deadlock) and exchange data the right way. However, the execution now suffers from unexpected serial-
ization: only one processor is active at any time, so what should have been a parallel operation becomes
𝑥0 = 1 on process zero
{
𝑥𝑝 = 𝑥𝑝−1 + (𝑝 + 1)2 on process 𝑝
Use MPI_Send and MPI_Recv; make sure to get the order right.
Food for thought: all quantities involved here are integers. Is it a good idea to use
the integer datatype here?
Question: could you have done this with a collective call?
(There is a skeleton for this exercise under the name bucketblock.)
Remark 10 There is an MPI_Scan routine (section 3.4) that performs the same computation, but computa-
tionally more efficiently. Thus, this exercise only serves to illustrate the principle.
with the right choice of source and destination. For instance, to send data to your right neighbor:
MPI_Comm_rank(comm,&procno);
MPI_Sendrecv( ....
/* from: */ procno-1
... ...
/* to: */ procno+1
... );
This scheme is correct for all processes but the first and last. In order to use the sendrecv call on these
processes, we use MPI_PROC_NULL for the non-existing processes that the endpoints communicate with.
MPI_Comm_rank( .... &mytid );
if ( /* I am not the first processor */ )
predecessor = mytid-1;
else
predecessor = MPI_PROC_NULL;
template<typename T >
status mpl::communicator::sendrecv
( const T & senddata, int dest, tag sendtag,
T & recvdata, int source, tag recvtag
) const
( const T * senddata, const layout< T > & sendl, int dest, tag sendtag,
T * recvdata, const layout< T > & recvl, int source, tag recvtag
) const
( iterT1 begin1, iterT1 end1, int dest, tag sendtag,
iterT2 begin2, iterT2 end2, int source, tag recvtag
) const
Python:
Sendrecv(self,
sendbuf, int dest, int sendtag=0,
recvbuf=None, int source=ANY_SOURCE, int recvtag=ANY_TAG,
Status status=None)
Remark 11 The MPI_Sendrecv can inter-operate with the normal send and receive calls, both blocking and
non-blocking. Thus it would also be possible to replace the MPI_Sendrecv calls at the end points by simple sends
or receives.
MPL note 28: send-recv call. The send-recv call in MPL has the same possibilities for specifying the send
and receive buffer as the separate send and recv calls: scalar, layout, iterator. However, out of the
nine conceivably possible routine signatures, only the versions are available where the send and
receive buffer are specified the same way. Also, the send and receive tag need to be specified;
they do not have default values.
// sendrecv.cxx
mpl::tag t0(0);
comm_world.sendrecv
( mydata,sendto,t0,
leftdata,recvfrom,t0 );
• As with the simple send/recv calls, processes have to match up: if process 𝑝 specifies 𝑝 ′ as the
destination of the send part of the call, 𝑝 ′ needs to specify 𝑝 as the source of the recv part.
The following exercise lets you implement a sorting algorithm with the send-receive call1 .
Exercise 4.9. A very simple sorting algorithm is swap sort or odd-even transposition sort:
pairs of processors compare data, and if necessary exchange. The elementary step is
called a compare-and-swap: in a pair of processors each sends their data to the other;
one keeps the minimum values, and the other the maximum. For simplicity, in this
exercise we give each processor just a single number.
The exchange sort algorithm is split in even and odd stages, where in the even stage,
processors 2𝑖 and 2𝑖 + 1 compare and swap data, and in the odd stage, processors
2𝑖 + 1 and 2𝑖 + 2 compare and swap. You need to repeat this 𝑃/2 times, where 𝑃 is
the number of processors; see figure 4.7.
Implement this algorithm using MPI_Sendrecv. (Use MPI_PROC_NULL for the edge cases
if needed.) Use a gather call to print the global state of the distributed array at the
beginning and end of the sorting process.
Remark 12 It is not possible to use MPI_IN_PLACE for the buffers. Instead, the routine MPI_Sendrecv_replace
(figure 4.4) has only one buffer, used as both send and receive buffer. Of course, this requires the send and
receive messages to fit in that one buffer.
Exercise 4.10. Extend this exercise to the case where each process hold an equal number of
elements, more than 1. Consider figure 4.8 for inspiration. Is it coincidence that the
algorithm takes the same number of steps as in the single scalar case?
The following material is for the recently released MPI-4 standard and may not be supported yet.
There are non-blocking and persistent versions of this routine: MPI_Isendrecv, MPI_Sendrecv_init, MPI_Isendrecv_replace,
MPI_Sendrecv_replace_init.
End of MPI-4 material
𝑦𝑖 = 𝑥𝑖−1 + 𝑥𝑖 + 𝑥𝑖+1 ∶ 𝑖 = 1, … , 𝑁 − 1
are organized in a general graph pattern. Here, the numbers of sends and receive of a processor do not
need to match.
In such cases, one wants a possibility to state ‘these are the expected incoming messages’, without having
to wait for them in sequence. Likewise, one wants to declare the outgoing messages without having to do
them in any particular sequence. Imposing any sequence on the sends and receives is likely to run into
the serialization behavior observed above, or at least be inefficient since processors will be waiting for
messages.
By contrast, the nonblocking calls MPI_Isend (figure 4.5) and MPI_Irecv (figure 4.6) (where the ‘I’ stands
for ‘immediate’ or ‘incomplete’ ) do not wait for their counterpart: in effect they tell the runtime system
‘here is some data and please send it as follows’ or ‘here is some buffer space, and expect such-and-such
data to come’. This is illustrated in figure 4.10.
// isendandirecv.c
double send_data = 1.;
MPI_Request request;
MPI_Isend
( /* send buffer/count/type: */ &send_data,1,MPI_DOUBLE,
/* to: */ receiver, /* tag: */ 0,
/* communicator: */ comm,
/* request: */ &request);
MPI_Wait(&request,MPI_STATUS_IGNORE);
template<typename T >
irequest mpl::communicator::isend
( const T & data, int dest, tag t = tag(0) ) const;
( const T * data, const layout< T > & l, int dest, tag t = tag(0) ) const;
( iterT begin, iterT end, int dest, tag t = tag(0) ) const;
Python:
/* request: */ &request);
MPI_Wait(&request,MPI_STATUS_IGNORE);
template<typename T >
irequest mpl::communicator::irecv
( const T & data, int src, tag t = tag(0) ) const;
( const T * data, const layout< T > & l, int src, tag t = tag(0) ) const;
( iterT begin, iterT end, int src, tag t = tag(0) ) const;
Python:
} else {
double recv=0.;
MPI_Request request;
MPI_Irecv( &recv,1,MPI_DOUBLE,sender,0,comm,&request);
MPI_Wait(&request,MPI_STATUS_IGNORE);
}
This means that the normal sequence of first declaring, and then filling in, the request variable
is not possible.
MPL implementation note: The wait call always returns a status object; not assigning
it means that the destructor is called on it.
Here we discuss in some detail the various wait calls. These are blocking; for the nonblocking versions
see section 4.2.3.
However, this would be inefficient if the first request is fulfilled much later than the others: your waiting
process would have lots of idle time. In that case, use one of the following routines.
The output argument is an array or MPI_Status object. If you don’t need the status objects, you can pass
MPI_STATUSES_IGNORE.
Exercise 4.11. Revisit exercise 4.6 and consider replacing the blocking calls by nonblocking
ones. How far apart can you put the MPI_Isend / MPI_Irecv calls and the
corresponding MPI_Waits?
(There is a skeleton for this exercise under the name bucketpipenonblock.)
Exercise 4.12. Create two distributed arrays of positive integers. Take the set difference of
the two: the first array needs to be transformed to remove from it those numbers
that are in the second array.
How could you solve this with an MPI_Allgather call? Why is it not a good idea to do
so? Solve this exercise instead with a circular bucket brigade algorithm.
(There is a skeleton for this exercise under the name setdiff.)
Python note 13: handling a single request. Non-blocking routines such as MPI_Isend return a request ob-
ject. The MPI_Wait is a class method, not a method of the request object:
## irecvsingle.py
sendbuffer = np.empty( nprocs, dtype=int )
recvbuffer = np.empty( nprocs, dtype=int )
for p in range(nprocs):
left_p = (p-1) % nprocs
right_p = (p+1) % nprocs
requests.append( comm.Isend\
( sendbuffer[p:p+1],dest=left_p) )
requests.append( comm.Irecv\
( sendbuffer[p:p+1],source=right_p) )
MPI.Request.Waitall(requests)
Note that this routine takes a single status argument, passed by reference, and not an array of statuses!
Fortran note 6: index of requests. The index parameter is the index in the array of requests, so it uses
1-based indexing.
MPI.Request.Waitany( requests,status=None )
class method, returns index
!! irecvsource.F90
if (mytid==ntids-1) then
do p=1,ntids-1
print *,"post"
call MPI_Irecv(recv_buffer(p),1,MPI_INTEGER,p-1,0,comm,&
requests(p),err)
end do
do p=1,ntids-1
call MPI_Waitany(ntids-1,requests,index,MPI_STATUS_IGNORE,err)
write(*,'("Message from",i3,":",i5)') index,recv_buffer(index)
end do
## irecvsource.py
if procid==nprocs-1:
receive_buffer = np.empty(nprocs-1,dtype=int)
requests = [ None ] * (nprocs-1)
for sender in range(nprocs-1):
requests[sender] = comm.Irecv(receive_buffer[sender:sender+1],source=sender)
# alternatively: requests = [ comm.Irecv(s) for s in .... ]
status = MPI.Status()
for sender in range(nprocs-1):
ind = MPI.Request.Waitany(requests,status=status)
if ind!=status.Get_source():
print("sender mismatch: %d vs %d" % (ind,status.Get_source()))
print("received from",ind)
else:
mywait = random.randint(1,2*nprocs)
print("[%d] wait for %d seconds" % (procid,mywait))
time.sleep(mywait)
mydata = np.empty(1,dtype=int)
mydata[0] = procid
comm.Send([mydata,MPI.INT],dest=nprocs-1)
Remark 13 The routines that can return multiple statuses, can return the error condition MPI_ERR_IN_STATUS,
indicating that one of the statuses was in error. See section 4.3.2.3.
Exercise 4.13.
(There is a skeleton for this exercise under the name isendirecv.) Now use
nonblocking send/receive routines to implement the three-point averaging
operation
𝑦𝑖 = (𝑥𝑖−1 + 𝑥𝑖 + 𝑥𝑖+1 )/3 ∶ 𝑖 = 1, … , 𝑁 − 1
on a distributed array. (Hint: use MPI_PROC_NULL at the ends.)
This is known as overlapping computation and communication, or latency hiding. See also asynchronous
progress; section 15.4.
Unfortunately, a lot of this communication involves activity in user space, so the solution would have
been to let it be handled by a separate thread. Until recently, processors were not efficient at doing such
multi-threading, so true overlap stayed a promise for the future. Some network cards have support for
this overlap, but it requires a nontrivial combination of hardware, firmware, and MPI implementation.
Exercise 4.14.
(There is a skeleton for this exercise under the name isendirecvarray.) Take your
code of exercise 4.13 and modify it to use latency hiding. Operations that can be
Remark 14 You have now seen various send types: blocking, nonblocking, synchronous. Can a receiver see
what kind of message was sent? Are different receive routines needed? The answer is that, on the receiving
end, there is nothing to distinguish a nonblocking or synchronous message. The MPI_Recv call can match any
of the send routines you have seen so far (but not MPI_Sendrecv), and conversely a message sent with MPI_Send
can be received by MPI_Irecv.
• On the other hand, when a nonblocking send call returns, the actual send may not have been
executed, so the send buffer may not be safe to overwrite. Similarly, when the recv call returns,
you do not know for sure that the expected data is in it. Only after the corresponding wait call
are you use that the buffer has been sent, or has received its contents.
• To send multiple messages with nonblocking calls you therefore have to allocate multiple buffers.
double **buffers;
for ( ... p ... ) {
buffers[p] = // fill in the data
MPI_Send( buffers[p], ... /* to: */ p );
}
MPI_Wait( /* the requests */ );
// irecvloop.c
MPI_Request requests =
(MPI_Request*) malloc( 2*nprocs*sizeof(MPI_Request) );
recv_buffers = (int*) malloc( nprocs*sizeof(int) );
send_buffers = (int*) malloc( nprocs*sizeof(int) );
for (int p=0; p<nprocs; p++) {
int
left_p = (p-1+nprocs) % nprocs,
right_p = (p+1) % nprocs;
send_buffer[p] = nprocs-p;
MPI_Isend(sendbuffer+p,1,MPI_INT, right_p,0, requests+2*p);
MPI_Irecv(recvbuffer+p,1,MPI_INT, left_p,0, requests+2*p+1);
}
/* your useful code here */
MPI_Waitall(2*nprocs,requests,MPI_STATUSES_IGNORE);
reqest.Test()
If the test is true, the request is deallocated and set to MPI_REQUEST_NULL, or, in the case of an active persistent
request, set to inactive.
Analogous to MPI_Wait, MPI_Waitany, MPI_Waitall, MPI_Waitsome, there are MPI_Test (figure 4.10), MPI_Testany,
MPI_Testall, MPI_Testsome.
Exercise 4.15. Read section HPC book, section-6.5 and give pseudo-code for the distributed
sparse matrix-vector product using the above idiom for using MPI_Test... calls.
Discuss the advantages and disadvantages of this approach. The answer is not going
to be black and white: discuss when you expect which approach to be preferable.
Correspondingly, calls to MPI_Wait or MPI_Test free this object, setting the handle to MPI_REQUEST_NULL.
(There is an exception for persistent communications where the request is only set to ‘inactive’; sec-
tion 5.1.) Thus, it is wise to issue wait calls even if you know that the operation has succeeded. For in-
stance, if all receive calls are concluded, you know that the corresponding send calls are finished and there
is no strict need to wait for their requests. However, omitting the wait calls would lead to a memory leak.
Another way around this is to call MPI_Request_free (figure 4.11), which sets the request variable to
MPI_REQUEST_NULL, and marks the object for deallocation after completion of the operation. Conceivably,
one could issue a nonblocking call, and immediately call MPI_Request_free, dispensing with any wait call.
However, this makes it hard to know when the operation is concluded and when the buffer is safe to
reuse [26].
You can inspect the status of a request without freeing the request object with MPI_Request_get_status
(figure 4.12).
if (procno==receiver) {
MPI_Status status;
MPI_Probe(sender,0,comm,&status);
int count;
MPI_Get_count(&status,MPI_FLOAT,&count);
float recv_buffer[count];
MPI_Recv(recv_buffer,count,MPI_FLOAT, sender,0,comm,MPI_STATUS_IGNORE);
} else if (procno==sender) {
float buffer[buffer_size];
ierr = MPI_Send(buffer,buffer_size,MPI_FLOAT, receiver,0,comm); CHK(ierr);
}
There is a problem with the MPI_Probe call in a multithreaded environment: the following scenario can
happen.
1. A thread determines by probing that a certain message has come in.
2. It issues a blocking receive call for that message…
3. But in between the probe and the receive call another thread has already received the message.
4. … Leaving the first thread in a blocked state with not message to receive.
This is solved by MPI_Mprobe (figure 4.15), which after a successful probe removes the message from the
matching queue: the list of messages that can be matched by a receive call. The thread that matched the
probe now issues an MPI_Mrecv (figure 4.16) call on that message through an object of type MPI_Message.
• MPI_ERROR gives the error status of the receive call; see section 4.3.2.3.
• MPI_SOURCE gives the source of the message; see section 4.3.2.1.
• MPI_TAG gives the tag with which the message was received; see section 4.3.2.2.
• The number if items in the message can be deduced from the status object, but through a func-
tion call to MPI_Get_count, not as a structure member; see section 4.3.2.4.
Fortran note 7: status object in f08. The mpi_f08 module turns many handles (such as communicators)
from Fortran Integers into Types. Retrieving the integer from the type is usually done through
the %val member, but for the status object this is more difficult. The routines MPI_Status_f2f08
and MPI_Status_f082f convert between these. (Remarkably, these routines are even available
in C, where they operate on MPI_Fint, MPI_F08_status arguments.)
Python note 15: status object. The status object is explicitly created before being passed to the receive
routine. It has the usual query methods:
## pingpongbig.py
status = MPI.Status()
comm.Recv( rdata,source=0,status=status)
count = status.Get_count(MPI.DOUBLE)
4.3.2.1 Source
In some applications it makes sense that a message can come from one of a number of processes. In
this case, it is possible to specify MPI_ANY_SOURCE as the source. To find out the source where the message
actually came from, you would use the MPI_SOURCE field of the status object that is delivered by MPI_Recv
or the MPI_Wait... call after an MPI_Irecv.
MPI_Recv(recv_buffer+p,1,MPI_INT, MPI_ANY_SOURCE,0,comm,
&status);
sender = status.MPI_SOURCE;
There are various scenarios where receiving from ‘any source’ makes sense. One is that of the manager-
worker model. The manager task would first send data to the worker tasks, then issues a blocking wait
for the data of whichever process finishes first.
This code snippet is a simple model for this: all workers processes wait a random amount of time. For
efficiency, the manager process accepts message from any source.
// anysource.c
if (procno==nprocs-1) {
/*
* The last process receives from every other process
*/
int *recv_buffer;
recv_buffer = (int*) malloc((nprocs-1)*sizeof(int));
/*
* Messages can come in in any order, so use MPI_ANY_SOURCE
*/
MPI_Status status;
for (int p=0; p<nprocs-1; p++) {
err = MPI_Recv(recv_buffer+p,1,MPI_INT, MPI_ANY_SOURCE,0,comm,
&status); CHK(err);
int sender = status.MPI_SOURCE;
printf("Message from sender=%d: %d\n",
sender,recv_buffer[p]);
}
free(recv_buffer);
} else {
/*
* Each rank waits an unpredictable amount of time,
* then sends to the last process in line.
*/
float randomfraction = (rand() / (double)RAND_MAX);
int randomwait = (int) ( nprocs * randomfraction );
printf("process %d waits for %e/%d=%d\n",
procno,randomfraction,nprocs,randomwait);
sleep(randomwait);
err = MPI_Send(&randomwait,1,MPI_INT, nprocs-1,0,comm); CHK(err);
}
MPL note 34: status source querying. The status object can be queried:
int source = recv_status.source();
4.3.2.2 Tag
If a processor is expecting more than one messsage from a single other processor, message tags are used to
distinguish between them. In that case, a value of MPI_ANY_TAG can be used, and the actual tag of a message
can be retrieved as the MPI_TAG member in the status structure. See section 4.3.2.1 about MPI_SOURCE for
how to use this.
MPL note 35: message tag. MPL differs from other APIs in its treatment of tags: a tag is not directly an
integer, but an object of class tag.
// sendrecv.cxx
mpl::tag t0(0);
comm_world.sendrecv
( mydata,sendto,t0,
leftdata,recvfrom,t0 );
The tag class has a couple of methods such as mpl::tag::any() (for the MPI_ANY_TAG wildcard in
receive calls) and mpl::tag::up() (maximal tag, found from the MPI_TAG_UB attribute).
4.3.2.3 Error
Any errors during the receive operation can be found as the MPI_ERROR member of the status structure. This
field is only set by functions that return multiple statuses, such as MPI_Waitall. For functions that return
a single status, any error is returned as the function result. For a function returning multiple statuses, the
presence of any error is indicated by a result of MPI_ERR_IN_STATUS; section 4.2.2.6.
4.3.2.4 Count
If the amount of data received is not known a priori, the count of elements received can be found by
MPI_Get_count (figure 4.17):
// count.c
if (procid==0) {
int sendcount = (rand()>.5) ? N : N-1;
MPI_Send( buffer,sendcount,MPI_FLOAT,target,0,comm );
} else if (procid==target) {
MPI_Status status;
int recvcount;
MPI_Recv( buffer,N,MPI_FLOAT,0,0, comm, &status );
MPI_Get_count(&status,MPI_FLOAT,&recvcount);
printf("Received %d elements\n",recvcount);
}
template<typename T>
int mpl::status::get_count () const
template<typename T>
int mpl::status::get_count (const layout<T> &l) const
Python:
Code: Output:
!! count.F90 make[3]: `count' is up to date.
if (procid==0) then TACC: Starting up job 4051425
sendcount = N TACC: Setting up parallel environment for MVAPICH2
call random_number(fraction) TACC: Starting parallel tasks...
if (fraction>.5) then One less
print *,"One less" ; sendcount = N-1 Received 9 elements
end if TACC: Shutdown complete. Exiting.
call MPI_Send(
↪buffer,sendcount,MPI_REAL,target,0,comm )
else if (procid==target) then
call MPI_Recv( buffer,N,MPI_REAL,0,0, comm, status )
call MPI_Get_count(status,MPI_FLOAT,recvcount)
print *,"Received",recvcount,"elements"
end if
This may be necessary since the count argument to MPI_Recv is the buffer size, not an indication of the
actually received number of data items.
Remarks.
• Unlike the source and tag, the message count is not directly a member of the status structure.
• The ‘count’ returned is the number of elements of the specified datatype. If this is a derived
type (section 6.3) this is not the same as the number of elementary datatype elements. For
that, use MPI_Get_elements (figure 4.18) or MPI_Get_elements_x which returns the number of basic
elements.
MPL note 36: receive count. The get_count function is a method of the status object. The argument type is
handled through templating:
// recvstatus.cxx
double pi=0;
auto s = comm_world.recv(pi, 0); // receive from rank 0
int c = s.get_count<double>();
std::cout << "got : " << c << " scalar(s): " << pi << '\n';
but this may incur idle time if the messages arrive out of order.
Instead, we use the MPI_ANY_SOURCE specifier to give a wildcard behavior to the receive call: using this
value for the ‘source’ value means that we accept mesages from any source within the communicator,
and messages are only matched by tag value. (Note that size and type of the receive buffer are not used
for message matching!)
We then retrieve the actual source from the MPI_Status object through the MPI_SOURCE field.
// anysource.c
if (procno==nprocs-1) {
/*
* The last process receives from every other process
*/
int *recv_buffer;
recv_buffer = (int*) malloc((nprocs-1)*sizeof(int));
/*
* Messages can come in in any order, so use MPI_ANY_SOURCE
*/
MPI_Status status;
for (int p=0; p<nprocs-1; p++) {
4.3.3 Errors
MPI routines return MPI_SUCCESS upon succesful completion. The following error codes can be returned
(see section 15.2.1 for details) for completion with error by both send and receive operations: MPI_ERR_COMM,
MPI_ERR_COUNT, MPI_ERR_TYPE, MPI_ERR_TAG, MPI_ERR_RANK.
Review 4.17. True or false: a message sent with MPI_Isend from one processor can be
received with an MPI_Recv call on another processor.
Review 4.18. True or false: a message sent with MPI_Send from one processor can be
received with an MPI_Irecv on another processor.
Review 4.19. Why does the MPI_Irecv call not have an MPI_Status argument?
Review 4.20. Suppose you are testing ping-pong timings. Why is it generally not a good
idea to use processes 0 and 1 for the source and target processor? Can you come up
with a better guess?
Review 4.21. What is the relation between the concepts of ‘origin’, ‘target’, ‘fence’, and
‘window’ in one-sided communication.
Review 4.22. What are the three routines for one-sided data transfer?
Review 4.23. In the following fragments assume that all buffers have been allocated with
sufficient size. For each fragment note whether it deadlocks or not. Discuss
performance issues.
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm);
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Recv(rbuffer,buflen,MPI_INT,p,0,comm,MPI_STATUS_IGNORE);
˜
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Recv(rbuffer,buflen,MPI_INT,p,0,comm,MPI_STATUS_IGNORE);
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm);
int ireq = 0;
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Isend(sbuffers[p],buflen,MPI_INT,p,0,comm,&(requests[ireq++]));
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Recv(rbuffer,buflen,MPI_INT,p,0,comm,MPI_STATUS_IGNORE);
MPI_Waitall(nprocs-1,requests,MPI_STATUSES_IGNORE);
int ireq = 0;
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Irecv(rbuffers[p],buflen,MPI_INT,p,0,comm,&(requests[ireq++]));
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm);
MPI_Waitall(nprocs-1,requests,MPI_STATUSES_IGNORE);
int ireq = 0;
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Irecv(rbuffers[p],buflen,MPI_INT,p,0,comm,&(requests[ireq++]));
MPI_Waitall(nprocs-1,requests,MPI_STATUSES_IGNORE);
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm);
Fortran codes:
do p=0,nprocs-1
if (p/=procid) then
call MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm,ierr)
end if
end do
do p=0,nprocs-1
if (p/=procid) then
call MPI_Recv(rbuffer,buflen,MPI_INT,p,0,comm,MPI_STATUS_IGNORE,ierr)
end if
end do
do p=0,nprocs-1
if (p/=procid) then
call MPI_Recv(rbuffer,buflen,MPI_INT,p,0,comm,MPI_STATUS_IGNORE,ierr)
end if
end do
do p=0,nprocs-1
if (p/=procid) then
call MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm,ierr)
end if
end do
ireq = 0
do p=0,nprocs-1
if (p/=procid) then
call MPI_Isend(sbuffers(1,p+1),buflen,MPI_INT,p,0,comm,&
requests(ireq+1),ierr)
ireq = ireq+1
end if
end do
do p=0,nprocs-1
if (p/=procid) then
call MPI_Recv(rbuffer,buflen,MPI_INT,p,0,comm,MPI_STATUS_IGNORE,ierr)
end if
end do
call MPI_Waitall(nprocs-1,requests,MPI_STATUSES_IGNORE,ierr)
ireq = 0
do p=0,nprocs-1
if (p/=procid) then
call MPI_Irecv(rbuffers(1,p+1),buflen,MPI_INT,p,0,comm,&
requests(ireq+1),ierr)
ireq = ireq+1
end if
end do
do p=0,nprocs-1
if (p/=procid) then
call MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm,ierr)
end if
end do
call MPI_Waitall(nprocs-1,requests,MPI_STATUSES_IGNORE,ierr)
// block5.F90
ireq = 0
do p=0,nprocs-1
if (p/=procid) then
call MPI_Irecv(rbuffers(1,p+1),buflen,MPI_INT,p,0,comm,&
requests(ireq+1),ierr)
ireq = ireq+1
end if
end do
call MPI_Waitall(nprocs-1,requests,MPI_STATUSES_IGNORE,ierr)
do p=0,nprocs-1
if (p/=procid) then
call MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm,ierr)
end if
end do
// ring3.c // ring4.c
MPI_Request req1,req2; MPI_Request req1,req2;
MPI_Irecv(&y,1,MPI_DOUBLE,prev,0,comm,&req1); MPI_Irecv(&y,1,MPI_DOUBLE,prev,0,comm,&req1);
MPI_Isend(&x,1,MPI_DOUBLE,next,0,comm,&req2); MPI_Isend(&x,1,MPI_DOUBLE,next,0,comm,&req2);
MPI_Wait(&req1,MPI_STATUS_IGNORE); MPI_Wait(&req2,MPI_STATUS_IGNORE);
MPI_Wait(&req2,MPI_STATUS_IGNORE); MPI_Wait(&req1,MPI_STATUS_IGNORE);
Can we have one nonblocking and one blocking call? Do these scenarios block?
// ring1.c // ring2.c
MPI_Request req; MPI_Request req;
MPI_Issend(&x,1,MPI_DOUBLE,next,0,comm,&req); MPI_Irecv(&y,1,MPI_DOUBLE,prev,0,comm,&req);
MPI_Recv(&y,1,MPI_DOUBLE,prev,0,comm, MPI_Ssend(&x,1,MPI_DOUBLE,next,0,comm);
MPI_STATUS_IGNORE); MPI_Wait(&req,MPI_STATUS_IGNORE);
MPI_Wait(&req,MPI_STATUS_IGNORE);
MPI_Init(&argc,&argv);
MPI_Comm_size(comm,&nprocs);
MPI_Comm_rank(comm,&procno);
/*
* We set up a single communication between
* the first and last process
*/
int sender,receiver;
sender = 0; receiver = nprocs-1;
if (procno==sender) {
double send_data = 1.;
MPI_Send
( /* send buffer/count/type: */ &send_data,1,MPI_DOUBLE,
/* to: */ receiver, /* tag: */ 0,
/* communicator: */ comm);
printf("[%d] Send successfully concluded\n",procno);
} else if (procno==receiver) {
double recv_data;
MPI_Recv
( /* recv buffer/count/type: */ &recv_data,1,MPI_DOUBLE,
/* from: */ sender, /* tag: */ 0,
/* communicator: */ comm,
/* recv status: */ MPI_STATUS_IGNORE);
printf("[%d] Receive successfully concluded\n",procno);
}
MPI_Finalize();
return 0;
}
#include <complex>
#include <iostream>
using std::cout;
using std::endl;
#include <mpl/mpl.hpp>
int main() {
const mpl::communicator &comm_world=mpl::environment::comm_world();
if (comm_world.size()<2)
return EXIT_FAILURE;
#include <mpl/mpl.hpp>
int main() {
const mpl::communicator &comm_world=mpl::environment::comm_world();
if (comm_world.size()<2)
return EXIT_FAILURE;
/*
* The compiler knows about arrays so we can send them `as is'
*/
double v[2][2][2];
/*
* Send and report
*/
comm_world.send(v, 1); // send to rank 1
stringstream s;
s << "sent: ";
vt = &(v[0][0][0]);
for (int i=0; i<8; i++)
s << " " << *(vt+i);
cout << s.str() << '\n';
} else if (comm_world.rank()==1) {
/*
* Receive data and report
*/
comm_world.recv(v, 0); // receive from rank 0
stringstream s;
s << "got : ";
double *vt = &(v[0][0][0]);
for (int i=0; i<8; i++)
s << " " << *(vt+i);
cout << s.str() << '\n';
}
return EXIT_SUCCESS;
}
int main() {
const mpl::communicator &comm_world=mpl::environment::comm_world();
if (comm_world.size()<2)
return EXIT_FAILURE;
/*
* To send a std::vector we declare a contiguous layout
*/
std::vector<double> v(8);
mpl::contiguous_layout<double> v_layout(v.size());
/*
* Send and report
*/
comm_world.send(v.data(), v_layout, 1); // send to rank 1
std::cout << "sent: ";
for (double &x : v)
std::cout << x << ' ';
std::cout << '\n';
} else if (comm_world.rank()==1) {
/*
* Receive data and report
*/
comm_world.recv(v.data(), v_layout, 0); // receive from rank 0
std::cout << "got : ";
for (double &x : v)
std::cout << x << ' ';
std::cout << '\n';
}
return EXIT_SUCCESS;
}
#include <vector>
using std::vector;
#include <mpl/mpl.hpp>
int main() {
const mpl::communicator &comm_world=mpl::environment::comm_world();
if (comm_world.size()<2)
return EXIT_FAILURE;
vector<double> v(15);
if (comm_world.rank()==0) {
// initialize
for ( auto &x : v ) x = 1.41;
/*
* Send and report
*/
comm_world.send(v.begin(), v.end(), 1); // send to rank 1
} else if (comm_world.rank()==1) {
/*
* Receive data and report
*/
comm_world.recv(v.begin(), v.end(), 0); // receive from rank 0
#include "globalinit.c"
skip:
MPI_Finalize();
return 0;
}
implicit none
#include "mpif.h"
integer :: other,size,status(MPI_STATUS_SIZE)
integer,dimension(:),allocatable :: sendbuf,recvbuf
#include "globalinit.F90"
if (mytid>1) goto 10
other = 1-mytid
size = 1
do
allocate(sendbuf(size)); allocate(recvbuf(size))
print *,size
call MPI_Send(sendbuf,size,MPI_INTEGER,other,0,comm,err)
call MPI_Recv(recvbuf,size,MPI_INTEGER,other,0,comm,status,err)
if (mytid==0) then
print *,"MPI_Send did not block for size",size
end if
deallocate(sendbuf); deallocate(recvbuf)
size = size*10
if (size>2000000000) goto 20
end do
20 continue
10 call MPI_Finalize(err)
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
if procid in [0,nprocs-1]:
other = nprocs-1-procid
size = 1
while size<2000000000:
sendbuf = np.empty(size, dtype=int)
recvbuf = np.empty(size, dtype=int)
comm.Send(sendbuf,dest=other)
comm.Recv(sendbuf,source=other)
if procid<other:
print("Send did not block for",size)
size *= 10
#include "globalinit.c"
skip:
MPI_Finalize();
return 0;
}
#include "globalinit.c"
/*
* We set up a single communication between
* the first and last process
*/
int sender,receiver;
sender = 0; receiver = nprocs-1;
if (procno==sender) {
double send_data = 1.;
MPI_Request request;
MPI_Isend
( /* send buffer/count/type: */ &send_data,1,MPI_DOUBLE,
/* to: */ receiver, /* tag: */ 0,
/* communicator: */ comm,
/* request: */ &request);
MPI_Wait(&request,MPI_STATUS_IGNORE);
printf("[%d] Isend successfully concluded\n",procno);
} else if (procno==receiver) {
double recv_data;
MPI_Request request;
MPI_Irecv
( /* recv buffer/count/type: */ &recv_data,1,MPI_DOUBLE,
/* from: */ sender, /* tag: */ 0,
/* communicator: */ comm,
/* request: */ &request);
MPI_Wait(&request,MPI_STATUS_IGNORE);
printf("[%d] Ireceive successfully concluded\n",procno);
}
MPI_Finalize();
return 0;
}
#include "mpi.h"
#include "globalinit.c"
double
mydata=procno;
int sender = nprocs-1;
if (procno==sender) {
for (int p=0; p<nprocs-1; p++) {
double send = 1.;
MPI_Send( &send,1,MPI_DOUBLE,p,0,comm);
}
} else {
double recv=0.;
MPI_Request request;
MPI_Irecv( &recv,1,MPI_DOUBLE,sender,0,comm,&request);
MPI_Wait(&request,MPI_STATUS_IGNORE);
}
MPI_Finalize();
return 0;
}
if (procno==sender) {
double send_data = 1.;
mpl::irequest send_request
( comm_world.isend( send_data, receiver ) );
send_request.wait();
printf("[%d] Isend successfully concluded\n",procno);
} else if (procno==receiver) {
double recv_data;
mpl::irequest recv_request =
comm_world.irecv( recv_data,sender );
recv_request.wait();
printf("[%d] Ireceive successfully concluded\n",procno);
}
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
for p in range(nprocs):
left_p = (p-1) % nprocs
right_p = (p+1) % nprocs
requests.append( comm.Isend\
( sendbuffer[p:p+1],dest=left_p) )
requests.append( comm.Irecv\
( sendbuffer[p:p+1],source=right_p) )
MPI.Request.Waitall(requests)
if procid==0:
print("All messages received")
use mpi
implicit none
integer,dimension(:),allocatable :: recv_buffer,requests
integer :: index,randomint
real :: randomvalue
#include "globalinit.F90"
allocate(recv_buffer(ntids-1))
allocate(requests(ntids-1))
if (mytid==ntids-1) then
do p=1,ntids-1
print *,"post"
call MPI_Irecv(recv_buffer(p),1,MPI_INTEGER,p-1,0,comm,&
requests(p),err)
end do
do p=1,ntids-1
call MPI_Waitany(ntids-1,requests,index,MPI_STATUS_IGNORE,err)
write(*,'("Message from",i3,":",i5)') index,recv_buffer(index)
end do
else
call sleep(6)
call random_number(randomvalue)
randomint = randomvalue
randomint = 30+mytid
call MPI_Send(randomint,1,MPI_INTEGER, ntids-1,0,comm,err)
end if
use mpi_f08
implicit none
!!
!! General stuff
!!
Type(MPI_Comm) :: comm;
integer :: mytid,ntids,i,p,err;
!!
!! random number generatoe
!!
integer :: randsize
integer,allocatable,dimension(:) :: randseed
!!
!! data for this program
!!
Type(MPI_Request),dimension(:),allocatable :: requests
integer,dimension(:),allocatable :: recv_buffer
integer :: index,randomint,success = 1
real :: randomvalue
call MPI_Init()
comm = MPI_COMM_WORLD
call MPI_Comm_rank(comm,mytid)
call MPI_Comm_size(comm,ntids)
call MPI_Comm_set_errhandler(comm,MPI_ERRORS_RETURN)
!!
!! seed the random number generator
!!
call random_seed(size=randsize)
allocate(randseed(randsize))
do i=1,randsize
randseed(i) = 1023*mytid
end do
call random_seed(put=randseed)
allocate(recv_buffer(ntids-1))
allocate(requests(ntids-1))
if (mytid==ntids-1) then
!
! the last process posts a receive
! from every other process
!
do p=0,ntids-2
call MPI_Irecv(recv_buffer(p+1),1,MPI_INTEGER,p,0,comm,&
requests(p+1))
end do
!
! then wait to see what comes in
!
do p=0,ntids-2
call MPI_Waitany(ntids-1,requests,index,MPI_STATUS_IGNORE)
if ( .not. requests(index)==MPI_REQUEST_NULL) then
print *,"This request should be null:",index
success = 0
end if
!write(*,'("Message from",i3,":",i5)') index,recv_buffer(index)
end do
else
!
! everyone else sends one number to the last
! after some random wait
!
call sleep(6)
call random_number(randomvalue)
randomint = randomvalue
randomint = 30+mytid
call MPI_Send(randomint,1,MPI_INTEGER, ntids-1,0,comm)
end if
call MPI_Allreduce(MPI_IN_PLACE,success,1,MPI_INTEGER,MPI_SUM,comm)
if (mytid==0) then
if (success==ntids) then
print *,"All processes successfully concluded"
else
call MPI_Finalize()
#include <mpl/mpl.hpp>
if (procno==nprocs-1) {
mpl::irequest_pool recv_requests;
vector<int> recv_buffer(nprocs-1);
for (int p=0; p<nprocs-1; p++) {
recv_requests.push( comm_world.irecv( recv_buffer[p], p ) );
}
printf("Outstanding request #=%lu\n",recv_requests.size());
for (int p=0; p<nprocs-1; p++) {
auto [success,index] = recv_requests.waitany();
if (success) {
auto recv_status = recv_requests.get_status(index);
int source = recv_status.source();
if (index!=source)
printf("Mismatch index %lu vs source %d\n",index,source);
printf("Message from %lu: %d\n",index,recv_buffer[index]);
} else
break;
}
} else {
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
if procid==nprocs-1:
receive_buffer = np.empty(nprocs-1,dtype=int)
requests = [ None ] * (nprocs-1)
for sender in range(nprocs-1):
requests[sender] = comm.Irecv(receive_buffer[sender:sender+1],source=sender)
# alternatively: requests = [ comm.Irecv(s) for s in .... ]
status = MPI.Status()
for sender in range(nprocs-1):
ind = MPI.Request.Waitany(requests,status=status)
if ind!=status.Get_source():
print("sender mismatch: %d vs %d" % (ind,status.Get_source()))
print("received from",ind)
else:
mywait = random.randint(1,2*nprocs)
print("[%d] wait for %d seconds" % (procid,mywait))
time.sleep(mywait)
mydata = np.empty(1,dtype=int)
mydata[0] = procid
comm.Send([mydata,MPI.INT],dest=nprocs-1)
#include "globalinit.c"
skip:
MPI_Finalize();
return 0;
}
#include "globalinit.c"
MPI_Finalize();
return 0;
}
#include <mpl/mpl.hpp>
if (nprocs<2) {
printf("This program needs at least two processes\n");
return -1;
}
int sender = 0, receiver = 1, the_other = 1-procno,
count = 5,stride=2;
vector<double>
source(stride*count);
vector<double>
target(count);
if (procno==sender) {
mpl::strided_vector_layout<double>
newvectortype(count,1,stride);
comm_world.send
(source.data(),newvectortype,the_other);
}
else if (procno==receiver) {
int recv_count;
mpl::contiguous_layout<double> target_layout(count);
mpl::status_t recv_status =
comm_world.recv(target.data(),target_layout, the_other);
recv_count = recv_status.get_count<double>();
assert(recv_count==count);
if (procno==receiver) {
for (int i=0; i<count; i++)
if (target[i]!=source[stride*i])
printf("location %d %e s/b %e\n",i,target[i],source[stride*i]);
}
if (procno==0)
printf("Finished\n");
return 0;
}
#include "globalinit.c"
if (procno==nprocs-1) {
/*
* The last process receives from every other process
*/
int *recv_buffer;
recv_buffer = (int*) malloc((nprocs-1)*sizeof(int));
/*
* Messages can come in in any order, so use MPI_ANY_SOURCE
*/
MPI_Status status;
for (int p=0; p<nprocs-1; p++) {
err = MPI_Recv(recv_buffer+p,1,MPI_INT, MPI_ANY_SOURCE,0,comm,
&status); CHK(err);
int sender = status.MPI_SOURCE;
printf("Message from sender=%d: %d\n",
sender,recv_buffer[p]);
}
free(recv_buffer);
} else {
/*
* Each rank waits an unpredictable amount of time,
* then sends to the last process in line.
*/
float randomfraction = (rand() / (double)RAND_MAX);
int randomwait = (int) ( nprocs * randomfraction );
printf("process %d waits for %e/%d=%d\n",
procno,randomfraction,nprocs,randomwait);
sleep(randomwait);
err = MPI_Send(&randomwait,1,MPI_INT, nprocs-1,0,comm); CHK(err);
}
MPI_Finalize();
return 0;
}
use mpi_f08
implicit none
integer,dimension(:),allocatable :: recv_buffer
Type(MPI_Status) :: status
real :: randomvalue
integer :: randomint,sender
#include "globalinit.F90"
if (mytid.eq.ntids-1) then
allocate(recv_buffer(ntids-1))
do p=0,ntids-2
call MPI_Recv(recv_buffer(p+1),1,MPI_INTEGER,&
MPI_ANY_SOURCE,0,comm,status)
sender = status%MPI_SOURCE
print *,"Message from",sender
end do
else
call random_number(randomvalue)
randomint = randomvalue*ntids
call sleep(randomint)
print *,mytid,"waits for",randomint
call MPI_Send(p,1,MPI_INTEGER,ntids-1,0,comm)
end if
call MPI_Finalize(err)
implicit none
#include "mpif.h"
integer,dimension(:),allocatable :: recv_buffer
integer :: status(MPI_STATUS_SIZE)
real :: randomvalue
integer :: randomint,sender
#include "globalinit.F90"
if (mytid.eq.ntids-1) then
allocate(recv_buffer(ntids-1))
do p=0,ntids-2
call MPI_Recv(recv_buffer(p+1),1,MPI_INTEGER,&
MPI_ANY_SOURCE,0,comm,status,err)
sender = status(MPI_SOURCE)
print *,"Message from",sender
end do
else
call random_number(randomvalue)
randomint = randomvalue*ntids
call sleep(randomint)
print *,mytid,"waits for",randomint
call MPI_Send(p,1,MPI_INTEGER,ntids-1,0,comm,err)
end if
call MPI_Finalize(err)
MPI_Init(&argc,&argv);
comm = MPI_COMM_WORLD;
MPI_Comm_size(comm,&nprocs);
MPI_Comm_rank(comm,&procid);
#define N 10
float buffer[N];
int target = nprocs-1;
if (procid==0) {
int sendcount = (rand()>.5) ? N : N-1;
MPI_Send( buffer,sendcount,MPI_FLOAT,target,0,comm );
} else if (procid==target) {
MPI_Status status;
int recvcount;
MPI_Recv( buffer,N,MPI_FLOAT,0,0, comm, &status );
MPI_Get_count(&status,MPI_FLOAT,&recvcount);
printf("Received %d elements\n",recvcount);
}
MPI_Finalize();
return 0;
}
int main() {
const mpl::communicator &comm_world=mpl::environment::comm_world();
if (comm_world.size()<2)
return EXIT_FAILURE;
// send and recieve a single floating point number
if (comm_world.rank()==0) {
double pi=3.14;
comm_world.send(pi, 1); // send to rank 1
std::cout << "sent: " << pi << '\n';
} else if (comm_world.rank()==1) {
double pi=0;
auto s = comm_world.recv(pi, 0); // receive from rank 0
int c = s.get_count<double>();
std::cout << "got : " << c << " scalar(s): " << pi << '\n';
}
return EXIT_SUCCESS;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
if procid==nprocs-1:
rbuf = np.empty(1,dtype=np.float64)
for p in range(procid):
rstatus = MPI.Status()
comm.Recv(rbuf,source=MPI.ANY_SOURCE,status=rstatus)
MPL note 37: persistent requests. MPL returns a prequest from persistent ‘init’ routines, rather than an
irequest (MPL note 29):
template<typename T >
prequest send_init (const T &data, int dest, tag t=tag(0)) const;
181
5. MPI topic: Communication modes
The main persistent point-to-point routines are MPI_Send_init (figure 5.1), which has the same calling
sequence as MPI_Isend, and MPI_Recv_init, which has the same calling sequence as MPI_Irecv.
In the following example a ping-pong is implemented with persistent communication. Since we use per-
sistent operations for both send and receive on the ‘ping’ process, we use MPI_Startall (figure 5.2) to start
both at the same time, and MPI_Waitall to test their completion. (There is MPI_Start for starting a single
persistent transfer.)
Code: Output:
// persist.c make[3]: `persist' is up to date.
if (procno==src) { TACC: Starting up job 4328411
MPI_Send_init TACC: Starting parallel tasks...
Pingpong size=1: t=1.2123e-04
↪(send,s,MPI_DOUBLE,tgt,0,comm,requests+0); Pingpong size=10: t=4.2826e-06
MPI_Recv_init Pingpong size=100: t=7.1507e-06
Pingpong size=1000: t=1.2084e-05
↪(recv,s,MPI_DOUBLE,tgt,0,comm,requests+1); Pingpong size=10000: t=3.7668e-05
for (int n=0; n<NEXPERIMENTS; n++) { Pingpong size=100000: t=3.4415e-04
fill_buffer(send,s,n); Persistent size=1: t=3.8177e-06
MPI_Startall(2,requests); Persistent size=10: t=3.2410e-06
Persistent size=100: t=4.0468e-06
Persistent size=1000: t=1.1525e-05
↪MPI_Waitall(2,requests,MPI_STATUSES_IGNORE);
int r = chck_buffer(send,s,n); Persistent size=10000: t=4.1672e-05
if (!r) printf("buffer problem %d\n",s); Persistent size=100000: t=2.8648e-04
} TACC: Shutdown complete. Exiting.
MPI_Request_free(requests+0);
MPI_Request_free(requests+1);
} else if (procno==tgt) {
for (int n=0; n<NEXPERIMENTS; n++) {
MPI_Recv(recv,s,MPI_DOUBLE,src,0,
comm,MPI_STATUS_IGNORE);
MPI_Send(recv,s,MPI_DOUBLE,src,0,
comm);
}
}
## persist.py
requests = [ None ] * 2
sendbuf = np.ones(size,dtype=int)
recvbuf = np.ones(size,dtype=int)
if procid==src:
print("Size:",size)
times[isize] = MPI.Wtime()
for n in range(nexperiments):
requests[0] = comm.Isend(sendbuf[0:size],dest=tgt)
requests[1] = comm.Irecv(recvbuf[0:size],source=tgt)
MPI.Request.Waitall(requests)
sendbuf[0] = sendbuf[0]+1
times[isize] = MPI.Wtime()-times[isize]
elif procid==tgt:
for n in range(nexperiments):
comm.Recv(recvbuf[0:size],source=src)
comm.Send(recvbuf[0:size],dest=src)
Some points.
• Metadata arrays, such as of counts and datatypes, must not be altered until the MPI_Request_free
call.
• The initialization call is nonlocal, so it can block until all processes have performed it.
• Multiple persistent collective can be initialized, in which case they satisfy the same restrictions
as ordinary collectives, in particular on ordering. Thus, the following code is incorrect:
// WRONG
if (procid==0) {
MPI_Reduce_init( /* ... */ &req1);
MPI_Bcast_init( /* ... */ &req2);
} else {
MPI_Bcast_init( /* ... */ &req2);
MPI_Reduce_init( /* ... */ &req1);
}
However, after initialization the start calls can be in arbitrary order, and in different order among
the processes.
Available persistent collectives are: MPI_Barrier_init MPI_Bcast_init MPI_Reduce_init MPI_Allreduce_init
MPI_Reduce_scatter_init MPI_Reduce_scatter_block_init MPI_Gather_init MPI_Gatherv_init
MPI_Allgather_init MPI_Allgatherv_init MPI_Scatter_init MPI_Scatterv_init MPI_Alltoall_init
MPI_Alltoallv_init MPI_Alltoallw_init MPI_Scan_init MPI_Exscan_init
End of MPI-4 material
The receiving side is largely the mirror image of the sending side:
double *recvbuffer = (double*)malloc(bufsize*sizeof(double));
MPI_Request recv_request;
MPI_Precv_init
(recvbuffer,nparts,SIZE,MPI_DOUBLE,src,0,
comm,MPI_INFO_NULL,&recv_request);
• a partitioned send can only be matched with a partitioned receive, so we start with an MPI_Precv_init.
• Arrival of a partition can be tested with MPI_Parrived (figure 5.6).
• A call to MPI_Wait completes the operation, indicating that all partitions have arrived.
Again, the MPI_Request object from the receive-init call can be used to test for completion of the full receive
operation.
End of MPI-4 material
2. You use the MPI_Bsend (figure 5.7) (or its local variant MPI_Ibsend) call for sending, using other-
wise normal send and receive buffers;
3. You detach the buffer when you’re done with the buffered sends.
One advantage of buffered sends is that they are nonblocking: since there is a guaranteed buffer long
enough to contain the message, it is not necessary to wait for the receiving process.
We illustrate the use of buffered sends:
// bufring.c
int bsize = BUFLEN*sizeof(float);
float
*sbuf = (float*) malloc( bsize ),
*rbuf = (float*) malloc( bsize );
MPI_Pack_size( BUFLEN,MPI_FLOAT,comm,&bsize);
bsize += MPI_BSEND_OVERHEAD;
float
*buffer = (float*) malloc( bsize );
MPI_Buffer_attach( buffer,bsize );
err = MPI_Bsend(sbuf,BUFLEN,MPI_FLOAT,next,0,comm);
MPI_Recv (rbuf,BUFLEN,MPI_FLOAT,prev,0,comm,MPI_STATUS_IGNORE);
MPI_Buffer_detach( &buffer,&bsize );
This returns the address and size of the buffer; the call blocks until all buffered messages have been
delivered.
Note that both MPI_Buffer_attach and MPI_Buffer_detach have a void* argument for the buffer, but
• in the attach routine this is the address of the buffer,
• while the detach routine it is the address of the buffer pointer.
This is done so that the detach routine can zero the buffer pointer.
While the buffered send is nonblocking like an MPI_Isend, there is no corresponding wait call. You can
force delivery by
MPI_Buffer_detach( &b, &n );
MPI_Buffer_attach( b, n );
MPL note 38: buffered send. Creating and attaching a buffer is done through bsend_buffer and a support
routine bsend_size helps in calculating the buffer size:
// bufring.cxx
vector<float> sbuf(BUFLEN), rbuf(BUFLEN);
int size{ comm_world.bsend_size<float>(mpl::contiguous_layout<float>(BUFLEN)) };
mpl::bsend_buffer buff(size);
comm_world.bsend(sbuf.data(),mpl::contiguous_layout<float>(BUFLEN), next);
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
nexperiments = 10
nsizes = 6
times = np.empty(nsizes,dtype=np.float64)
src = 0; tgt = nprocs-1
#
# ordinary communication
#
size = 1
if procid==src:
print("Ordinary send/recv")
for isize in range(nsizes):
requests = [ None ] * 2
sendbuf = np.ones(size,dtype=int)
recvbuf = np.ones(size,dtype=int)
if procid==src:
print("Size:",size)
times[isize] = MPI.Wtime()
for n in range(nexperiments):
requests[0] = comm.Isend(sendbuf[0:size],dest=tgt)
requests[1] = comm.Irecv(recvbuf[0:size],source=tgt)
MPI.Request.Waitall(requests)
sendbuf[0] = sendbuf[0]+1
times[isize] = MPI.Wtime()-times[isize]
elif procid==tgt:
for n in range(nexperiments):
comm.Recv(recvbuf[0:size],source=src)
comm.Send(recvbuf[0:size],dest=src)
size *= 10
if procid==src:
print("Timings:",times)
#
# ordinary communication
#
size = 1
requests = [ None ] * 2
if procid==src:
print("Persistent send/recv")
for isize in range(nsizes):
sendbuf = np.ones(size,dtype=int)
recvbuf = np.ones(size,dtype=int)
if procid==src:
print("Size:",size)
requests[0] = comm.Send_init(sendbuf[0:size],dest=tgt)
requests[1] = comm.Recv_init(recvbuf[0:size],source=tgt)
times[isize] = MPI.Wtime()
for n in range(nexperiments):
MPI.Prequest.Startall(requests)
MPI.Prequest.Waitall(requests)
sendbuf[0] = sendbuf[0]+1
times[isize] = MPI.Wtime()-times[isize]
elif procid==tgt:
for n in range(nexperiments):
comm.Recv(recvbuf[0:size],source=src)
comm.Send(recvbuf[0:size],dest=src)
size *= 10
if procid==src:
print("Timings:",times)
In the examples you have seen so far, every time data was sent, it was as a contiguous buffer with elements
of a single type. In practice you may want to send heterogeneous data, or noncontiguous data.
• Communicating the real parts of an array of complex numbers means specifying every other
number.
• Communicating a C structure or Fortran type with more than one type of element is not equiv-
alent to sending an array of elements of a single type.
The datatypes you have dealt with so far are known as elementary datatypes; irregular objects are known
as derived datatypes.
194
6.2. Elementary data types
which are themselves objects with methods for creating derived types; see section 6.3.1.
MPL note 40: other types.
// sendlong.cxx
mpl::contiguous_layout<long long> v_layout(v.size());
comm.send(v.data(), v_layout, 1); // send to rank 1
6.2.1 C/C++
Here we illustrate the correspondence between a type used to declare a variable, and how this type appears
in MPI communication routines:
long int i;
MPI_Send(&i,1,MPI_LONG,target,tag,comm);
6.2.2 Fortran
Not all these types need be supported, for instance MPI_INTEGER16 may not exist, in which case it will be
equivalent to MPI_DATATYPE_NULL.
The default integer type MPI_INTEGER is equivalent to INTEGER(KIND=MPI_INTEGER_KIND).
_Bool MPI_C_BOOL
float _Complex MPI_C_COMPLEX
MPI_C_FLOAT_COMPLEX
double _Complex MPI_C_DOUBLE_COMPLEX
long double _Complex MPI_C_LONG_DOUBLE_COMPLEX
The following material is for the recently released MPI-4 standard and may not be supported yet.
For every routine MPI_Something with an int count parameter, there is a corresponding routine MPI_Something_c
with an MPI_Count parameter.
The above MPI_Something_x routines will probably be deprecated in the MPI-4.1 standard.
End of MPI-4 material
int8_t MPI_INT8_T
int16_t MPI_INT16_T
int32_t MPI_INT32_T
int64_t MPI_INT64_T
uint8_t MPI_UINT8_T
uint16_t MPI_UINT16_T
uint32_t MPI_UINT32_T
uint64_t MPI_UINT64_T
MPI_INTEGER1
MPI_CHARACTER Character(Len=1) MPI_INTEGER2
MPI_INTEGER MPI_INTEGER4
MPI_REAL MPI_INTEGER8
MPI_DOUBLE_PRECISION MPI_INTEGER16
MPI_COMPLEX MPI_REAL2
MPI_LOGICAL MPI_REAL4
MPI_BYTE MPI_REAL8
MPI_PACKED MPI_DOUBLE_COMPLEX
Complex(Kind=Kind(0.d0))
Table 6.4: Standard Fortran types (left) and common extension (right)
The MPI_OFFSET_KIND is used to define MPI_Offset quantities, used in file I/O; section 10.2.2.
6.2.3 Python
In python, all buffer data comes from Numpy.
mpi4py type NumPy type
MPI.INT np.intc
np.int32
MPI.LONG np.int64
MPI.FLOAT np.float32
MPI.DOUBLE np.float64
In this table we see that Numpy has three integer types, one corresponding to C ints, and two with the
number of bits explicitly indicated. There used to be a np.int type, but this is deprecated as of Numpy 1.20
Examples:
## inttype.py
sizeofint = np.dtype('int32').itemsize
print("Size of numpy int32: {}".format(sizeofint))
sizeofint = np.dtype('intc').itemsize
print("Size of C int: {}".format(sizeofint))
my_array = np.empty(mycount,dtype=np.float64)
6.2.4.1 Fortran
Fortran lacks a sizeof operator to query the sizes of datatypes. Since sometimes exact byte counts are
necessary, for instance in one-sided communication, Fortran can use the (deprecated) MPI_Sizeof routine.
See section 6.2.5 for details.
6.2.4.2 Python
Here is a good way for finding the size of numpy datatypes in bytes:
## putfence.py
intsize = np.dtype('int').itemsize
window_data = np.zeros(2,dtype=int)
win = MPI.Win.Create(window_data,intsize,comm=comm)
In some circumstances you may want to find the MPI type that corresponds to a type in your programming
language.
• In C++ functions and classes can be templated, meaning that the type is not fully known:
template<typename T> {
class something<T> {
public:
void dosend(T input) {
MPI_Send( &input,1,/* ????? */ );
};
};
(Note that in MPL this is hardly ever needed because MPI calls are templated there.)
• Petsc installations use a generic identifier PetscScalar (or PetscReal) with a configuration-dependent
realization.
• The size of a datatype is not always statically known, for instance if the Fortran KIND keyword
is used.
Here are some MPI mechanisms that address this problem.
Datatypes in C can be translated to MPI types with MPI_Type_match_size (figure 6.4) where the typeclass
argument is one of MPI_TYPECLASS_REAL, MPI_TYPECLASS_INTEGER, MPI_TYPECLASS_COMPLEX.
// typematch.c
float x5;
double x10;
int s5,s10;
MPI_Datatype mpi_x5,mpi_x10;
MPI_Type_match_size(MPI_TYPECLASS_REAL,sizeof(x5),&mpi_x5);
MPI_Type_match_size(MPI_TYPECLASS_REAL,sizeof(x10),&mpi_x10);
MPI_Type_size(mpi_x5,&s5);
MPI_Type_size(mpi_x10,&s10);
The space that MPI takes for a structure type can be queried in a variety of ways. First of all MPI_Type_size
(figure 6.5) counts the datatype size as the number of bytes occupied by the data in a type. That means
that in an MPI vector datatype it does not count the gaps.
// typesize.c
MPI_Type_vector(count,bs,stride,MPI_DOUBLE,&newtype);
MPI_Type_commit(&newtype);
MPI_Type_size(newtype,&size);
ASSERT( size==(count*bs)*sizeof(double) );
• There is a create call, followed by a ‘commit’ call where MPI performs internal bookkeeping
and optimizations;
• The datatype is used, possibly multiple times;
• When the datatype is no longer needed, it must be freed to prevent memory leaks.
In code:
MPI_Datatype newtype;
MPI_Type_something( < oldtype specifications >, &newtype );
MPI_Type_commit( &newtype );
/* code that uses your new type */
MPI_Type_free( &newtype );
In Fortran2008:
Type(MPI_Datatype) :: newvectortype
call MPI_Type_something( <oldtype specification>, &
newvectortype)
call MPI_Type_commit(newvectortype)
!! code that uses your type
call MPI_Type_free(newvectortype)
Python note 17: derived type handling. The various type creation routines are methods of the datatype
classes, after which commit and free are methods on the new type.
## vector.py
source = np.empty(stride*count,dtype=np.float64)
target = np.empty(count,dtype=np.float64)
if procid==sender:
newvectortype = MPI.DOUBLE.Create_vector(count,1,stride)
newvectortype.Commit()
comm.Send([source,1,newvectortype],dest=the_other)
newvectortype.Free()
elif procid==receiver:
comm.Recv([target,count,MPI.DOUBLE],source=the_other)
MPI_Type_contiguous(count,MPI_DOUBLE,&newvectortype);
MPI_Type_commit(&newvectortype);
MPI_Send(source,1,newvectortype,receiver,0,comm);
MPI_Type_free(&newvectortype);
} else if (procno==receiver) {
MPI_Status recv_status;
int recv_count;
MPI_Recv(target,count,MPI_DOUBLE,sender,0,comm,
&recv_status);
MPI_Get_count(&recv_status,MPI_DOUBLE,&recv_count);
ASSERT(count==recv_count);
}
## contiguous.py
source = np.empty(count,dtype=np.float64)
target = np.empty(count,dtype=np.float64)
if procid==sender:
newcontiguoustype = MPI.DOUBLE.Create_contiguous(count)
newcontiguoustype.Commit()
comm.Send([source,1,newcontiguoustype],dest=the_other)
newcontiguoustype.Free()
elif procid==receiver:
comm.Recv([target,count,MPI.DOUBLE],source=the_other)
Figure 6.2: A vector datatype is built up out of strided blocks of elements of a constituent type
The vector datatype gives the first nontrivial illustration that datatypes can be different on the sender and
receiver. If the sender sends b blocks of length l each, the receiver can receive them as bl contiguous
elements, either as a contiguous datatype, or as a contiguous buffer of an elementary type; see figure 6.3.
In this case, the receiver has no knowledge of the stride of the datatype on the sender.
In this example a vector type is created only on the sender, in order to send a strided subset of an array;
the receiver receives the data as a contiguous block.
// vector.c
source = (double*) malloc(stride*count*sizeof(double));
target = (double*) malloc(count*sizeof(double));
MPI_Datatype newvectortype;
if (procno==sender) {
MPI_Type_vector(count,1,stride,MPI_DOUBLE,&newvectortype);
MPI_Type_commit(&newvectortype);
MPI_Send(source,1,newvectortype,the_other,0,comm);
MPI_Type_free(&newvectortype);
} else if (procno==receiver) {
MPI_Status recv_status;
int recv_count;
MPI_Recv(target,count,MPI_DOUBLE,the_other,0,comm,
&recv_status);
MPI_Get_count(&recv_status,MPI_DOUBLE,&recv_count);
ASSERT(recv_count==count);
}
Figure 6.4: Memory layout of a row and column of a matrix in column-major storage
then a column has 𝑀 blocks of one element, spaced 𝑁 locations apart. In other words:
MPI_Datatype MPI_column;
MPI_Type_vector(
/* count= */ M, /* blocklength= */ 1, /* stride= */ N,
MPI_DOUBLE, &MPI_column );
The second column is just a little trickier: you now need to pick out elements with the same stride, but
starting at A[0][1].
MPI_Send( &(mat[0][1]), 1,MPI_column, ... );
You can make this marginally more efficient (and harder to read) by replacing the index expression by
mat+1.
Exercise 6.2. Suppose you have a matrix of size 4𝑁 × 4𝑁 , and you want to send the
elements A[4*i][4*j] with 𝑖, 𝑗 = 0, … , 𝑁 − 1. How would you send these elements
with a single transfer?
Exercise 6.3. Allocate a matrix on processor zero, using Fortran column-major storage.
Using 𝑃 sendrecv calls, distribute the rows of this matrix among the processors.
Python note 19: sending from the middle of a matrix. In C and Fortran it’s easy to apply a derived type to
data in the middle of an array, for instance to extract an arbitrary column out of a C matrix,
or row out of a Fortran matrix. While Python has no trouble describing sections from an array,
usually it copies these instead of taking the address. Therefore, it is necessary to convert the
matrix to a buffer and compute an explicit offset in bytes:
## rowcol.py
rowsize = 4; colsize = 5
coltype = MPI.INT.Create_vector(4, 1, 5)
coltype.Commit()
columntosend = 2
comm.Send\
( [np.frombuffer(matrix.data, intc,
offset=columntosend*np.dtype('intc').itemsize),
1,coltype],
receiver)
Exercise 6.4. Let processor 0 have an array 𝑥 of length 10𝑃, where 𝑃 is the number of
processors. Elements 0, 𝑃, 2𝑃, … , 9𝑃 should go to processor zero, 1, 𝑃 + 1, 2𝑃 + 1, … to
processor 1, et cetera. Code this as a sequence of send/recv calls, using a vector
datatype for the send, and a contiguous buffer for the receive.
For simplicity, skip the send to/from zero. What is the most elegant solution if you
want to include that case?
Figure 6.5: Send strided data from process zero to all others
Exercise 6.6. Assume that your number of processors is 𝑃 = 𝑄 3 , and that each process has
an array of identical size. Use MPI_Type_create_subarray to gather all data onto a root
process. Use a sequence of send and receive calls; MPI_Gather does not work here.
(There is a skeleton for this exercise under the name cubegather.)
Fortran note 9: subarrays. Subarrays are naturally supported in Fortran through array sections.
!! section.F90
integer,parameter :: siz=20
real,dimension(siz,siz) :: matrix = [ ((j+(i-1)*siz,i=1,siz),j=1,siz) ]
real,dimension(2,2) :: submatrix
if (procno==0) then
call MPI_Send(matrix(1:2,1:2),4,MPI_REAL,1,0,comm)
else if (procno==1) then
call MPI_Recv(submatrix,4,MPI_REAL,0,0,comm,MPI_STATUS_IGNORE)
MPI.Datatype.Create_subarray
(self, sizes, subsizes, starts, int order=ORDER_C)
if (submatrix(2,2)==22) then
print *,"Yay"
else
print *,"nay...."
end if
end if
MPI.Datatype.Create_indexed(self, blocklengths,displacements )
MPI_Datatype newvectortype;
if (procno==sender) {
MPI_Type_indexed(count,blocklengths,displacements,MPI_INT,&newvectortype);
MPI_Type_commit(&newvectortype);
MPI_Send(source,1,newvectortype,the_other,0,comm);
MPI_Type_free(&newvectortype);
} else if (procno==receiver) {
MPI_Status recv_status;
int recv_count;
MPI_Recv(target,targetbuffersize,MPI_INT,the_other,0,comm,
&recv_status);
MPI_Get_count(&recv_status,MPI_INT,&recv_count);
ASSERT(recv_count==count);
}
!! indexed.F90
integer :: newvectortype;
ALLOCATE(indices(count))
ALLOCATE(blocklengths(count))
ALLOCATE(source(totalcount))
ALLOCATE(targt(count))
if (mytid==sender) then
call MPI_Type_indexed(count,blocklengths,indices,MPI_INT,&
newvectortype,err)
call MPI_Type_commit(newvectortype,err)
call MPI_Send(source,1,newvectortype,receiver,0,comm,err)
call MPI_Type_free(newvectortype,err)
else if (mytid==receiver) then
call MPI_Recv(targt,count,MPI_INT,sender,0,comm,&
recv_status,err)
call MPI_Get_count(recv_status,MPI_INT,recv_count,err)
! ASSERT(recv_count==count);
end if
if (procno==sender) {
comm_world.send( source_buffer.data(),indexed_where, receiver );
} else if (procno==receiver) {
auto recv_status =
comm_world.recv( target_buffer.data(),fiveints, sender );
int recv_count = recv_status.get_count<int>();
assert(recv_count==count);
}
MPL note 48: layouts for gatherv. The size/displacement arrays for MPI_Gatherv / MPI_Alltoallv are handled
through a layouts object, which is basically a vector of layout objects.
mpl::layouts<int> receive_layout;
for ( int iproc=0,loc=0; iproc<nprocs; iproc++ ) {
auto siz = size_buffer.at(iproc);
receive_layout.push_back
( mpl::indexed_layout<int>( {{ siz,loc }} ) );
loc += siz;
}
MPL note 49: indexed block type. For the case where all block lengths are the same, use indexed_block_layout:
// indexedblock.cxx
mpl::indexed_block_layout<int>
indexed_where( 1, {2,3,5,7,11} );
comm_world.send( source_buffer.data(),indexed_where, receiver );
A slightly simpler version, MPI_Type_create_hindexed_block (figure 6.12) assumes constant block length.
There is an important difference between the hindexed and the above MPI_Type_indexed: that one de-
scribed offsets from a base location; these routines describes absolute memory addresses. You can use this
to send for instance the elements of a linked list. You would traverse the list, recording the addresses of
the elements with MPI_Get_address (figure 6.13). (The routine MPI_Address is deprecated.)
In C++ you can use this to send an std::<vector>, that is, a vector object from the C++ standard library, if
the component type is a pointer.
has two blocks, one of a single integer, and one of two floats. This is illustrated in figure 6.7.
count The number of blocks in this datatype. The blocklengths, displacements, types arguments
have to be at least of this length.
blocklengths array containing the lengths of the blocks of each datatype.
displacements array describing the relative location of the blocks of each datatype.
types array containing the datatypes; each block in the new type is of a single datatype; there can be
multiple blocks consisting of the same type.
In this example, unlike the previous ones, both sender and receiver create the structure type. With struc-
tures it is no longer possible to send as a derived type and receive as a array of a simple type. (It would
be possible to send as one structure type and receive as another, as long as they have the same datatype
signature.)
// struct.c
struct object {
char c;
double x[2];
int i;
};
MPI_Datatype newstructuretype;
int structlen = 3;
int blocklengths[structlen]; MPI_Datatype types[structlen];
MPI_Aint displacements[structlen];
/*
* where are the components relative to the structure?
*/
MPI_Aint current_displacement=0;
// one character
blocklengths[0] = 1; types[0] = MPI_CHAR;
displacements[0] = (size_t)&(myobject.c) - (size_t)&myobject;
// two doubles
blocklengths[1] = 2; types[1] = MPI_DOUBLE;
displacements[1] = (size_t)&(myobject.x) - (size_t)&myobject;
// one int
blocklengths[2] = 1; types[2] = MPI_INT;
displacements[2] = (size_t)&(myobject.i) - (size_t)&myobject;
MPI_Type_create_struct(structlen,blocklengths,displacements,types,&newstructuretype);
MPI_Type_commit(&newstructuretype);
if (procno==sender) {
MPI_Send(&myobject,1,newstructuretype,the_other,0,comm);
} else if (procno==receiver) {
MPI_Recv(&myobject,1,newstructuretype,the_other,0,comm,MPI_STATUS_IGNORE);
}
MPI_Type_free(&newstructuretype);
if (procno==sender) then
call MPI_Send(myobject,1,newstructuretype,receiver,0,comm)
else if (procno==receiver) then
call MPI_Recv(myobject,1,newstructuretype,sender,0,comm,MPI_STATUS_IGNORE)
end if
call MPI_Type_free(newstructuretype)
displacement[0] = 0;
displacement[1] = displacement[0] + sizeof(char);
since you do not know the way the compiler lays out the structure in memory1 .
If you want to send more than one structure, you have to worry more about padding in the structure. You
can solve this by adding an extra type MPI_UB for the ‘upper bound’ on the structure:
displacements[3] = sizeof(myobject); types[3] = MPI_UB;
MPI_Type_create_struct(struclen+1,.....);
MPL note 50: struct type scalar. One could describe the MPI struct type as a collection of displacements, to
be applied to any set of items that conforms to the specifications. An MPL heterogeneous_layout
on the other hand, incorporates the actual data. Thus you could write
// structscalar.cxx
char c; double x; int i;
if (procno==sender) {
c = 'x'; x = 2.4; i = 37; }
mpl::heterogeneous_layout object( c,x,i );
if (procno==sender)
comm_world.send( mpl::absolute,object,receiver );
else if (procno==receiver)
comm_world.recv( mpl::absolute,object,sender );
Here, the absolute indicates the lack of an implicit buffer: the layout is absolute rather than a
relative description.
MPL note 51: struct type general. More complicated data than scalars takes more work:
// struct.cxx
char c; vector<double> x(2); int i;
if (procno==sender) {
c = 'x'; x[0] = 2.7; x[1] = 1.5; i = 37; }
mpl::heterogeneous_layout object
( c,
mpl::make_absolute(x.data(),mpl::vector_layout<double>(2)),
i );
if (procno==sender) {
comm_world.send( mpl::absolute,object,receiver );
} else if (procno==receiver) {
comm_world.recv( mpl::absolute,object,sender );
}
1. Homework question: what does the language standard say about this?
6.4.1 C
For every routine, such as MPI_Send with an integer count, there is a corresponding MPI_Send_c with a
count of type MPI_Count.
MPI_Count buffersize = 1000;
double *indata,*outdata;
indata = (double*) malloc( buffersize*sizeof(double) );
outdata = (double*) malloc( buffersize*sizeof(double) );
MPI_Allreduce_c(indata,outdata,buffersize,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
Code: Output:
// pingpongbig.c make[3]: `pingpongbig' is up to date.
assert( sizeof(MPI_Count)>4 ); Ping-pong between ranks 0--1, repeated 10 times
for ( int power=3; power<=10; power++) { MPI Count has 8 bytes
MPI_Count length=pow(10,power); Size: 10^3, (repeats=10000)
buffer = (double*)malloc( Time 1.399211e-05 for size 10^3: 1.1435 Gb/sec
↪length*sizeof(double) ); Size: 10^4, (repeats=10000)
MPI_Ssend_c Time 4.077882e-05 for size 10^4: 3.9236 Gb/sec
(buffer,length,MPI_DOUBLE, Size: 10^5, (repeats=1000)
processB,0,comm); Time 1.532863e-04 for size 10^5: 10.4380 Gb/sec
MPI_Recv_c Size: 10^6, (repeats=1000)
(buffer,length,MPI_DOUBLE, Time 1.418844e-03 for size 10^6: 11.2768 Gb/sec
processB,0,comm,MPI_STATUS_IGNORE); Size: 10^7, (repeats=100)
Time 1.443470e-02 for size 10^7: 11.0844 Gb/sec
Size: 10^8, (repeats=100)
Time 1.540918e-01 for size 10^8: 10.3834 Gb/sec
Size: 10^9, (repeats=10)
Time 1.813220e+00 for size 10^9: 8.8241 Gb/sec
Size: 10^10, (repeats=10)
Time 1.846741e+01 for size 10^10: 8.6639 Gb/sec
6.4.2 Fortran
The count parameter can be declared to be
use mpi_f08
Integer(kind=MPI_COUNT_KIND) :: count
Since Fortran has polymorphism, the same routine names can be used.
The legit way of coding: … but you can see what’s under the hood:
!! typecheck.F90 !! typecheck8.F90
integer(8) :: source integer(8) :: source,n=1
integer(kind=MPI_COUNT_KIND) :: n=1 call MPI_Init()
call MPI_Init() call
call ↪MPI_Send(source,n,MPI_INTEGER8,
↪MPI_Send(source,n,MPI_INTEGER8, ↪&
↪& 1,0,MPI_COMM_WORLD)
1,0,MPI_COMM_WORLD)
Routines using this type are not available unless using the mpi_f08 module.
End of MPI-4 material
Above, we did not actually create a datatype that was bigger than 2G, but if you do so, you can query its
extent by MPI_Type_get_extent_x (figure 6.16) and MPI_Type_get_true_extent_x (figure 6.16).
Python note 20: big data. Since python has unlimited size integers there is no explicit need for the ‘x’ vari-
ants of routines. Internally, MPI.Status.Get_elements is implemented in terms of MPI_Get_elements_x.
Similarly, using MPI_Type_get_extent counts the gaps in a struct induced by alignment issues.
size_t size_of_struct = sizeof(struct object);
MPI_Aint typesize,typelb;
MPI_Type_get_extent(newstructuretype,&typelb,&typesize);
assert( typesize==size_of_struct );
See section 6.3.6 for the code defining the structure type.
Figure 6.9: True lower bound and extent of a subarray data type
The subarray datatype need not start at the first element of the buffer, so the extent is an overstatement
of how much data is involved. In fact, the lower bound is zero, and the extent equals the size of the block
from which the subarray is taken. The routine MPI_Type_get_true_extent (figure 6.16) returns the lower
bound, indicating where the data starts, and the extent from that point. This is illustrated in figure 6.9.
Code: Output:
// trueextent.c In basic array of 192 bytes
int sender = 0, receiver = 1, the_other = 1-procno; find sub array of 48 bytes
int sizes[2] = {4,6},subsizes[2] = {2,3},starts[2] = {1,2}; Found lb=64, extent=72
MPI_Datatype subarraytype; Computing lb=64 extent=72
MPI_Type_create_subarray Non-true lb=0, extent=192, computed=192
(2,sizes,subsizes,starts, Finished
MPI_ORDER_C,MPI_DOUBLE,&subarraytype); received: 8.500 9.500 10.500 14.500 15.500 16.500
MPI_Type_commit(&subarraytype); 1,2
1,3
MPI_Aint true_lb,true_extent,extent; 1,4
MPI_Type_get_true_extent 2,2
(subarraytype,&true_lb,&true_extent); 2,3
MPI_Aint 2,4
comp_lb = sizeof(double) *
( starts[0]*sizes[1]+starts[1] ),
comp_extent = sizeof(double) *
( sizes[1]-starts[1] // first row
+ starts[1]+subsizes[1] // last row
+ ( subsizes[0]>1 ? subsizes[0]-2 : 0 )*sizes[1]
↪);
ASSERT(true_lb==comp_lb);
ASSERT(true_extent==comp_extent);
MPI_Send(source,1,subarraytype,the_other,0,comm);
MPI_Type_free(&subarraytype);
There are also ‘big data’ routines MPI_Type_get_extent_x MPI_Type_get_true_extent_x that has an MPI_Count
as output.
The following material is for the recently released MPI-4 standard and may not be supported yet.
MPI_Type_get_extent_c MPI_Type_get_true_extent_c also output an MPI_Count.
End of MPI-4 material
The technicality on which the solution hinges is that you can ‘resize’ a type with MPI_Type_create_resized
(figure 6.17) to give it a different extent, while not affecting how much data there actually is in it.
6.6.2.1 Example 1
Figure 6.10: Contiguous type of two vectors, before and after resizing the extent.
First consider sending more than one derived type, from a buffer containing consecutive integers:
// vectorpadsend.c
for (int i=0; i<max_elements; i++) sendbuffer[i] = i;
MPI_Type_vector(count,blocklength,stride,MPI_INT,&stridetype);
MPI_Type_commit(&stridetype);
MPI_Send( sendbuffer,ntypes,stridetype, receiver,0, comm );
6.6.2.2 Example 2
For another example, let’s revisit exercise 6.4 (and figure 6.5) where each process makes a buffer of integers
that will be interleaved in a gather call: Strided data was sent in individual transactions. Would it be
possible to address all these interleaved packets in one gather or scatter call?
int *mydata = (int*) malloc( localsize*sizeof(int) );
for (int i=0; i<localsize; i++)
mydata[i] = i*nprocs+procno;
MPI_Gather( mydata,localsize,MPI_INT,
/* rest to be determined */ );
An ordinary gather call will of course not interleave, but put the data end-to-end:
MPI_Gather( mydata,localsize,MPI_INT,
gathered,localsize,MPI_INT, // abutting
root,comm );
This is illustrated in figure 6.11. A sample printout of the result would be:
0 1879048192 1100361260 3 3 0 6 0 0 9 1 198654
The trick is to use MPI_Type_create_resized to make the extent of the type only one int long:
// interleavegather.c
MPI_Datatype interleavetype;
MPI_Type_create_resized(stridetype,0,sizeof(int),&interleavetype);
MPI_Type_commit(&interleavetype);
MPI_Gather( mydata,localsize,MPI_INT,
gathered,1,interleavetype, // shrunk extent
root,comm );
MPI_Datatype paddedblock;
MPI_Type_create_resized(oneblock,0,stride*sizeof(double),&paddedblock);
MPI_Type_commit(&paddedblock);
MPI_Type_get_extent(paddedblock,&block_lb,&block_x);
printf("Padded block has extent: %ld\n",block_x);
Transposing data is an important part of such operations as the FFT. We develop this in steps. Refer to
figure 6.13.
The source data can be described as a vector type defined as:
• there are 𝑏 blocks,
• of blocksize 𝑏,
• spaced apart by the global 𝑖-size of the array.
// transposeblock.cxx
MPI_Datatype sourceblock;
MPI_Type_vector( blocksize_j,blocksize_i,isize,MPI_INT,&sourceblock);
MPI_Type_commit( &sourceblock);
The target type is harder to describe. First we note that each contiguous block from the source type can
be described as a vector type with:
• 𝑏 blocks,
• of size 1 each,
• stided by the global 𝑗-size of the matrix.
MPI_Datatype targetcolumn;
MPI_Type_vector( blocksize_i,1,jsize, MPI_INT,&targetcolumn);
MPI_Type_commit( &targetcolumn );
For the full type at the receiving process we now need to pack 𝑏 of these lines together.
6.8 Packing
One of the reasons for derived datatypes is dealing with noncontiguous data. In older communication
libraries this could only be done by packing data from its original containers into a buffer, and likewise
unpacking it at the receiver into its destination data structures.
MPI offers this packing facility, partly for compatibility with such libraries, but also for reasons of flexibil-
ity. Unlike with derived datatypes, which transfers data atomically, packing routines add data sequentially
to the buffer and unpacking takes them sequentially.
This means that one could pack an integer describing how many floating point numbers are in the rest
of the packed message. Correspondingly, the unpack routine could then investigate the first integer and
based on it unpack the right number of floating point numbers.
MPI offers the following:
• The MPI_Pack command adds data to a send buffer;
• the MPI_Unpack command retrieves data from a receive buffer;
• the buffer is sent with a datatype of MPI_PACKED.
With MPI_Pack data elements can be added to a buffer one at a time. The position parameter is updated
each time by the packing routine.
int MPI_Pack(
void *inbuf, int incount, MPI_Datatype datatype,
void *outbuf, int outcount, int *position,
MPI_Comm comm);
Conversely, MPI_Unpack retrieves one element from the buffer at a time. You need to specify the MPI
datatype.
int MPI_Unpack(
void *inbuf, int insize, int *position,
void *outbuf, int outcount, MPI_Datatype datatype,
MPI_Comm comm);
A packed buffer is sent or received with a datatype of MPI_PACKED. The sending routine uses the position
parameter to specify how much data is sent, but the receiving routine does not know this value a priori,
so has to specify an upper bound.
Code: Output:
if (procno==sender) { [0] pack 8.401877e-01
position = 0; [0] pack 3.943829e-01
MPI_Pack(&nsends,1,MPI_INT, [0] pack 7.830992e-01
buffer,buflen,&position,comm); [0] pack 7.984400e-01
for (int i=0; i<nsends; i++) { [0] pack 9.116474e-01
double value = rand()/(double)RAND_MAX; [0] pack 1.975514e-01
printf("[%d] pack %e\n",procno,value);
MPI_Pack(&value,1,MPI_DOUBLE,
buffer,buflen,&position,comm);
}
MPI_Pack(&nsends,1,MPI_INT,
buffer,buflen,&position,comm);
MPI_Send(buffer,position,MPI_PACKED,other,0,comm);
} else if (procno==receiver) {
int irecv_value;
double xrecv_value;
MPI_Recv(buffer,buflen,MPI_PACKED,other,0,
comm,MPI_STATUS_IGNORE);
position = 0;
MPI_Unpack(buffer,buflen,&position,
&nsends,1,MPI_INT,comm);
for (int i=0; i<nsends; i++) {
MPI_Unpack(buffer,buflen,
&position,&xrecv_value,1,MPI_DOUBLE,comm);
printf("[%d] unpack %e\n",procno,xrecv_value);
}
MPI_Unpack(buffer,buflen,&position,
&irecv_value,1,MPI_INT,comm);
ASSERT(irecv_value==nsends);
}
You can precompute the size of the required buffer with MPI_Pack_size (figure 6.18).
Code: Output:
// pack.c 1 chars: 1
for (int i=1; i<=4; i++) { 2 chars: 2
MPI_Pack_size(i,MPI_CHAR,comm,&s); 3 chars: 3
printf("%d chars: %d\n",i,s); 4 chars: 4
} 1 unsigned shorts: 2
for (int i=1; i<=4; i++) { 2 unsigned shorts: 4
MPI_Pack_size(i,MPI_UNSIGNED_SHORT,comm,&s); 3 unsigned shorts: 6
printf("%d unsigned shorts: %d\n",i,s); 4 unsigned shorts: 8
} 1 ints: 4
for (int i=1; i<=4; i++) { 2 ints: 8
MPI_Pack_size(i,MPI_INT,comm,&s); 3 ints: 12
printf("%d ints: %d\n",i,s); 4 ints: 16
}
with dynamically created arrays. Write code to send and receive this structure.
comm = MPI.COMM_WORLD
nprocs = comm.Get_size()
procno = comm.Get_rank()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
if procno==sender:
sizeofint = np.dtype('intc').itemsize
print("Size of C int: {}".format(sizeofint))
data = np.empty(2*count,dtype=intc)
for i in range(2*count):
data[i] = i
vectortype = MPI.INT.Create_vector(count,1,2)
vectortype.Commit()
comm.Send( [data,1,vectortype], receiver )
elif procno==receiver:
data = np.empty(count,dtype=intc)
comm.Recv( data, sender )
print(data)
MPI_Init(0,0);
float x5;
double x10;
int s5,s10;
MPI_Datatype mpi_x5,mpi_x10;
MPI_Type_match_size(MPI_TYPECLASS_REAL,sizeof(x5),&mpi_x5);
MPI_Type_match_size(MPI_TYPECLASS_REAL,sizeof(x10),&mpi_x10);
MPI_Type_size(mpi_x5,&s5);
MPI_Type_size(mpi_x10,&s10);
printf("%d, %d\n",s5,s10);
MPI_Finalize();
return 0;
}
#include "globalinit.c"
MPI_Datatype newtype;
count = 3; bs = 2; stride = 5;
MPI_Type_vector(count,bs,stride,MPI_DOUBLE,&newtype);
MPI_Type_commit(&newtype);
MPI_Type_size(newtype,&size);
ASSERT( size==(count*bs)*sizeof(double) );
MPI_Type_free(&newtype);
MPI_Aint lb,asize;
MPI_Type_vector(count,bs,stride,MPI_DOUBLE,&newtype);
MPI_Type_commit(&newtype);
MPI_Type_get_extent(newtype,&lb,&asize);
ASSERT( lb==0 );
ASSERT( asize==((count-1)*stride+bs)*sizeof(double) );
MPI_Type_free(&newtype);
if (procno==0)
printf("Finished\n");
MPI_Finalize();
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
source = np.empty(stride*count,dtype=np.float64)
target = np.empty(count,dtype=np.float64)
for i in range(stride*count):
source[i] = i+.5
if procid==sender:
newvectortype = MPI.DOUBLE.Create_vector(count,1,stride)
newvectortype.Commit()
comm.Send([source,1,newvectortype],dest=the_other)
newvectortype.Free()
elif procid==receiver:
comm.Recv([target,count,MPI.DOUBLE],source=the_other)
if procid==sender:
print("finished")
if procid==receiver:
for i in range(count):
if target[i]!=source[stride*i]:
print("error in location %d: %e s/b %e" % (i,target[i],source[stride*i]))
#include <mpl/mpl.hpp>
if (nprocs<2) {
printf("This program needs at least two processes\n");
return -1;
}
int sender = 0, receiver = 1, the_other = 1-procno,
count = 5,stride=2;
vector<double>
source(stride*count);
vector<double>
target(count);
if (procno==sender) {
mpl::strided_vector_layout<double>
newvectortype(count,1,stride);
comm_world.send
(source.data(),newvectortype,the_other);
}
else if (procno==receiver) {
int recv_count;
mpl::contiguous_layout<double> target_layout(count);
mpl::status_t recv_status =
comm_world.recv(target.data(),target_layout, the_other);
recv_count = recv_status.get_count<double>();
assert(recv_count==count);
}
if (procno==receiver) {
for (int i=0; i<count; i++)
if (target[i]!=source[stride*i])
printf("location %d %e s/b %e\n",i,target[i],source[stride*i]);
}
if (procno==0)
printf("Finished\n");
return 0;
}
#include "globalinit.c"
if (nprocs<2) {
printf("This program needs at least two processes\n");
return -1;
}
int sender = 0, receiver = 1, count = 5;
double *source,*target;
source = (double*) malloc(count*sizeof(double));
target = (double*) malloc(count*sizeof(double));
MPI_Datatype newvectortype;
if (procno==sender) {
MPI_Type_contiguous(count,MPI_DOUBLE,&newvectortype);
MPI_Type_commit(&newvectortype);
MPI_Send(source,1,newvectortype,receiver,0,comm);
MPI_Type_free(&newvectortype);
} else if (procno==receiver) {
MPI_Status recv_status;
int recv_count;
MPI_Recv(target,count,MPI_DOUBLE,sender,0,comm,
&recv_status);
MPI_Get_count(&recv_status,MPI_DOUBLE,&recv_count);
ASSERT(count==recv_count);
}
if (procno==receiver) {
for (int i=0; i<count; i++)
if (target[i]!=source[i])
printf("location %d %e s/b %e\n",i,target[i],source[i]);
}
if (procno==0)
printf("Finished\n");
MPI_Finalize();
return 0;
}
use mpi_f08
implicit none
integer :: newvectortype
integer :: recv_status(MPI_STATUS_SIZE),recv_count
#include "globalinit.F90"
if (ntids<2) then
print *,"This program needs at least two processes"
stop
end if
ALLOCATE(source(count))
ALLOCATE(target(count))
do i=1,count
source(i) = i+.5;
end do
if (mytid==sender) then
call MPI_Type_contiguous(count,MPI_DOUBLE_PRECISION,newvectortype)
call MPI_Type_commit(newvectortype)
call MPI_Send(source,1,newvectortype,receiver,0,comm)
call MPI_Type_free(newvectortype)
else if (mytid==receiver) then
call MPI_Recv(target,count,MPI_DOUBLE_PRECISION,sender,0,comm,&
recv_status)
call MPI_Get_count(recv_status,MPI_DOUBLE_PRECISION,recv_count)
!ASSERT(count==recv_count);
end if
if (mytid==receiver) then
! for (i=0; i<count; i++)
! if (target[i]!=source[i])
! printf("location %d %e s/b %e\n",i,target[i],source[i]);
end if
call MPI_Finalize(err)
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
source = np.empty(count,dtype=np.float64)
target = np.empty(count,dtype=np.float64)
for i in range(count):
source[i] = i+.5
if procid==sender:
newcontiguoustype = MPI.DOUBLE.Create_contiguous(count)
newcontiguoustype.Commit()
comm.Send([source,1,newcontiguoustype],dest=the_other)
newcontiguoustype.Free()
elif procid==receiver:
comm.Recv([target,count,MPI.DOUBLE],source=the_other)
if procid==sender:
print("finished")
if procid==receiver:
for i in range(count):
if target[i]!=source[i]:
print("error in location %d: %e s/b %e" % (i,target[i],source[i]))
#include <stdio.h>
#include <string.h>
#include "mpi.h"
#include "globalinit.c"
if (nprocs<2) {
printf("This program needs at least two processes\n");
return -1;
}
int sender = 0, receiver = 1, the_other = 1-procno,
count = 5,stride=2;
double *source,*target;
source = (double*) malloc(stride*count*sizeof(double));
target = (double*) malloc(count*sizeof(double));
MPI_Datatype newvectortype;
if (procno==sender) {
MPI_Type_vector(count,1,stride,MPI_DOUBLE,&newvectortype);
MPI_Type_commit(&newvectortype);
MPI_Send(source,1,newvectortype,the_other,0,comm);
MPI_Type_free(&newvectortype);
} else if (procno==receiver) {
MPI_Status recv_status;
int recv_count;
MPI_Recv(target,count,MPI_DOUBLE,the_other,0,comm,
&recv_status);
MPI_Get_count(&recv_status,MPI_DOUBLE,&recv_count);
ASSERT(recv_count==count);
}
if (procno==receiver) {
for (int i=0; i<count; i++)
if (target[i]!=source[stride*i])
printf("location %d %e s/b %e\n",i,target[i],source[stride*i]);
}
if (procno==0)
printf("Finished\n");
MPI_Finalize();
return 0;
}
use mpi_f08
implicit none
Type(MPI_Datatype) :: newvectortype
integer :: recv_count
Type(MPI_Status) :: recv_status
Type(MPI_Comm) :: comm;
integer :: mytid,ntids,i,p,err;
call MPI_Init()
comm = MPI_COMM_WORLD
call MPI_Comm_rank(comm,mytid)
call MPI_Comm_size(comm,ntids)
call MPI_Comm_set_errhandler(comm,MPI_ERRORS_RETURN)
if (ntids<2) then
print *,"This program needs at least two processes"
stop
end if
ALLOCATE(source(stride*count))
ALLOCATE(target(stride*count))
do i=1,stride*count
source(i) = i+.5;
end do
if (mytid==sender) then
call MPI_Type_vector(count,1,stride,MPI_DOUBLE_PRECISION,&
newvectortype)
call MPI_Type_commit(newvectortype)
call MPI_Send(source,1,newvectortype,receiver,0,comm)
call MPI_Type_free(newvectortype)
if ( .not. newvectortype==MPI_DATATYPE_NULL) then
print *,"Trouble freeing datatype"
else
print *,"Datatype successfully freed"
end if
else if (mytid==receiver) then
call MPI_Recv(target,count,MPI_DOUBLE_PRECISION,sender,0,comm,&
recv_status)
call MPI_Get_count(recv_status,MPI_DOUBLE_PRECISION,recv_count)
end if
if (mytid==receiver) then
! for (i=0; i<count; i++)
! if (target[i]!=source[stride*i])
! printf("location %d %e s/b %e\n",i,target[i],source[stride*i]);
end if
call MPI_Finalize(err)
use mpi
implicit none
integer :: newvectortype
integer :: recv_status(MPI_STATUS_SIZE),recv_count
#include "globalinit.F90"
if (ntids<2) then
print *,"This program needs at least two processes"
stop
end if
ALLOCATE(source(stride*count))
ALLOCATE(target(stride*count))
do i=1,stride*count
source(i) = i+.5;
end do
if (mytid==sender) then
call MPI_Type_vector(count,1,stride,MPI_DOUBLE_PRECISION,&
newvectortype,err)
call MPI_Type_commit(newvectortype,err)
call MPI_Send(source,1,newvectortype,receiver,0,comm,err)
call MPI_Type_free(newvectortype,err)
else if (mytid==receiver) then
call MPI_Recv(target,count,MPI_DOUBLE_PRECISION,sender,0,comm,&
recv_status,err)
call MPI_Get_count(recv_status,MPI_DOUBLE_PRECISION,recv_count,err)
end if
if (mytid==receiver) then
! for (i=0; i<count; i++)
! if (target[i]!=source[stride*i])
! printf("location %d %e s/b %e\n",i,target[i],source[stride*i]);
end if
call MPI_Finalize(err)
call MPI_Init()
comm = MPI_COMM_WORLD
call MPI_Comm_size(comm,nprocs)
call MPI_Comm_rank(comm,procno)
if (nprocs<2) then
print *,"This example really needs 2 processors"
call MPI_Abort(comm,0)
end if
if (procno==0) then
call MPI_Send(matrix(1:2,1:2),4,MPI_REAL,1,0,comm)
else if (procno==1) then
call MPI_Recv(submatrix,4,MPI_REAL,0,0,comm,MPI_STATUS_IGNORE)
if (submatrix(2,2)==22) then
print *,"Yay"
else
print *,"nay...."
end if
end if
call MPI_Finalize()
call MPI_Init()
comm = MPI_COMM_WORLD
call MPI_Comm_size(comm,nprocs)
call MPI_Comm_rank(comm,procno)
if (nprocs<2) then
print *,"This example really needs 2 processors"
call MPI_Abort(comm,0)
end if
if (procno==0) then
siz = 20
allocate( matrix(siz,siz) )
matrix = reshape( [ ((j+(i-1)*siz,i=1,siz),j=1,siz) ], (/siz,siz/) )
call MPI_Isend(matrix(1:2,1:2),4,MPI_REAL,1,0,comm,request)
call MPI_Wait(request,MPI_STATUS_IGNORE)
deallocate(matrix)
else if (procno==1) then
call MPI_IRecv(submatrix,4,MPI_REAL,0,0,comm,request)
call MPI_Wait(request,MPI_STATUS_IGNORE)
if (submatrix(2,2)==22) then
print *,"Yay"
else
print *,"nay...."
end if
end if
call MPI_Finalize()
MPI_Comm comm;
int procno=-1,nprocs,ierr;
MPI_Init(&argc,&argv);
comm = MPI_COMM_WORLD;
MPI_Comm_rank(comm,&procno);
MPI_Comm_size(comm,&nprocs);
MPI_Comm_set_errhandler(comm,MPI_ERRORS_RETURN);
if (nprocs==1) {
printf("This program needs at least 2 procs\n");
MPI_Abort(comm,0);
}
int sender = 0, receiver = nprocs-1;
#define SIZE 4
int
sizes[2], subsizes[2], starts[2];
sizes[0] = SIZE; sizes[1] = SIZE;
subsizes[0] = SIZE/2; subsizes[1] = SIZE;
starts[0] = starts[1] = 0;
MPI_Request req;
if (procno==sender) {
/*
* Write lexicographic test data
*/
double data[SIZE][SIZE];
for (int i=0; i<SIZE; i++)
for (int j=0; j<SIZE; j++)
data[i][j] = j+i*SIZE;
/*
* Make a datatype that enumerates the storage in C order
*/
MPI_Datatype rowtype;
ierr =
MPI_Type_create_subarray
(2,sizes,subsizes,starts,
MPI_ORDER_C,MPI_DOUBLE,&rowtype);
ERR(ierr,"creating rowtype");
MPI_Type_commit(&rowtype);
MPI_Send(data,1,rowtype, receiver,0,comm);
MPI_Type_free(&rowtype);
/*
* Make a datatype that enumerates the storage in F order
*/
MPI_Datatype coltype;
ierr =
MPI_Type_create_subarray
(2,sizes,subsizes,starts,
MPI_ORDER_FORTRAN,MPI_DOUBLE,&coltype);
ERR(ierr,"creating rowtype");
MPI_Type_commit(&coltype);
MPI_Send(data,1,coltype, receiver,0,comm);
MPI_Type_free(&coltype);
} else if (procno==receiver) {
int linearsize = SIZE * SIZE/2;
double lineardata[ linearsize ];
/*
* Receive msg in C order:
*/
MPI_Recv(lineardata,linearsize,MPI_DOUBLE, sender,0,comm, MPI_STATUS_IGNORE);
printf("Received C order:\n");
for (int i=0; i<SIZE/2; i++) {
for (int j=0; j<SIZE; j++)
printf(" %5.3f",lineardata[j+i*SIZE]);
printf("\n");
}
/*
* Receive msg in F order:
*/
MPI_Recv(lineardata,linearsize,MPI_DOUBLE, sender,0,comm, MPI_STATUS_IGNORE);
printf("Received F order:\n");
for (int j=0; j<SIZE; j++) {
for (int i=0; i<SIZE/2; i++)
printf(" %5.3f",lineardata[i+j*SIZE/2]);
printf("\n");
}
/* MPI_Finalize(); */
return 0;
}
#include "globalinit.c"
if (nprocs<2) {
printf("This program needs at least two processes\n");
return -1;
}
int sender = 0, receiver = 1, the_other = 1-procno,
count = 5,totalcount = 15, targetbuffersize = 2*totalcount;
int *source,*target;
int *displacements,*blocklengths;
MPI_Datatype newvectortype;
if (procno==sender) {
MPI_Type_indexed(count,blocklengths,displacements,MPI_INT,&newvectortype);
MPI_Type_commit(&newvectortype);
MPI_Send(source,1,newvectortype,the_other,0,comm);
MPI_Type_free(&newvectortype);
} else if (procno==receiver) {
MPI_Status recv_status;
int recv_count;
MPI_Recv(target,targetbuffersize,MPI_INT,the_other,0,comm,
&recv_status);
MPI_Get_count(&recv_status,MPI_INT,&recv_count);
ASSERT(recv_count==count);
}
if (procno==receiver) {
int i=3,val=7;
if (target[i]!=val)
printf("location %d %d s/b %d\n",i,target[i],val);
i=4; val=11;
if (target[i]!=val)
printf("location %d %d s/b %d\n",i,target[i],val);
}
if (procno==0)
printf("Finished\n");
MPI_Finalize();
return 0;
}
use mpi
implicit none
integer :: newvectortype;
integer,dimension(:),allocatable :: indices,blocklengths,&
source,targt
integer :: sender = 0, receiver = 1, count = 5,totalcount = 15
integer :: recv_status(MPI_STATUS_SIZE),recv_count
#include "globalinit.F90"
if (ntids<2) then
print *,"This program needs at least two processes"
stop
end if
ALLOCATE(indices(count))
ALLOCATE(blocklengths(count))
ALLOCATE(source(totalcount))
ALLOCATE(targt(count))
if (mytid==sender) then
call MPI_Type_indexed(count,blocklengths,indices,MPI_INT,&
newvectortype,err)
call MPI_Type_commit(newvectortype,err)
call MPI_Send(source,1,newvectortype,receiver,0,comm,err)
call MPI_Type_free(newvectortype,err)
else if (mytid==receiver) then
call MPI_Recv(targt,count,MPI_INT,sender,0,comm,&
recv_status,err)
call MPI_Get_count(recv_status,MPI_INT,recv_count,err)
! ASSERT(recv_count==count);
end if
! if (mytid==receiver) {
! int i=3,val=7;
! if (targt(i)!=val)
! printf("location %d %d s/b %d\n",i,targt(i),val);
! i=4; val=11;
! if (targt(i)!=val)
! printf("location %d %d s/b %d\n",i,targt(i),val);
! }
call MPI_Finalize(err)
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
displacements = np.empty(count,dtype=int)
blocklengths = np.empty(count,dtype=int)
source = np.empty(totalcount,dtype=np.float64)
target = np.empty(count,dtype=np.float64)
idcs = [2,3,5,7,11]
for i in range(len(idcs)):
displacements[i] = idcs[i]
blocklengths[i] = 1
for i in range(totalcount):
source[i] = i+.5
if procid==sender:
newindextype = MPI.DOUBLE.Create_indexed(blocklengths,displacements)
newindextype.Commit()
comm.Send([source,1,newindextype],dest=the_other)
newindextype.Free()
elif procid==receiver:
comm.Recv([target,count,MPI.DOUBLE],source=the_other)
if procid==sender:
print("finished")
if procid==receiver:
target_loc = 0
for block in range(count):
for element in range(blocklengths[block]):
source_loc = displacements[block]+element
if target[target_loc]!=source[source_loc]:
print("error in src/tar location %d/%d: %e s/b %e" \
% (source_loc,target_loc,target[target_loc],source[source_loc]) )
target_loc += 1
#include <vector>
using std::vector;
#include <cassert>
#include <mpl/mpl.hpp>
vector<int>
source_buffer(totalcount),
target_buffer(targetbuffersize);
for (int i=0; i<totalcount; ++i)
source_buffer[i] = i;
if (procno==sender) {
comm_world.send( source_buffer.data(),indexed_where, receiver );
} else if (procno==receiver) {
auto recv_status =
comm_world.recv( target_buffer.data(),fiveints, sender );
int recv_count = recv_status.get_count<int>();
assert(recv_count==count);
}
if (procno==receiver) {
int i=3,val=7;
if (target_buffer[i]!=val)
printf("Error: location %d %d s/b %d\n",i,target_buffer[i],val);
i=4; val=11;
if (target_buffer[i]!=val)
printf("Error: location %d %d s/b %d\n",i,target_buffer[i],val);
printf("Finished. Correctly sent indexed primes.\n");
}
return 0;
}
#include <vector>
using std::vector;
#include <cassert>
#include <mpl/mpl.hpp>
vector<int>
source_buffer(totalcount),
target_buffer(targetbuffersize);
for (int i=0; i<totalcount; ++i)
source_buffer[i] = i;
if (procno==receiver) {
int i=3,val=7;
if (target_buffer[i]!=val)
printf("Error: location %d %d s/b %d\n",i,target_buffer[i],val);
i=4; val=11;
if (target_buffer[i]!=val)
printf("Error: location %d %d s/b %d\n",i,target_buffer[i],val);
printf("Finished. Correctly sent indexed primes.\n");
}
return 0;
#include "globalinit.c"
if (nprocs<2) {
printf("This program needs at least two processes\n");
return -1;
}
int sender = 0, receiver = 1, the_other = 1-procno;
struct object {
char c;
double x[2];
int i;
};
MPI_Datatype newstructuretype;
int structlen = 3;
int blocklengths[structlen]; MPI_Datatype types[structlen];
MPI_Aint displacements[structlen];
/*
* where are the components relative to the structure?
*/
MPI_Aint current_displacement=0;
// one character
blocklengths[0] = 1; types[0] = MPI_CHAR;
displacements[0] = (size_t)&(myobject.c) - (size_t)&myobject;
// two doubles
blocklengths[1] = 2; types[1] = MPI_DOUBLE;
// one int
blocklengths[2] = 1; types[2] = MPI_INT;
displacements[2] = (size_t)&(myobject.i) - (size_t)&myobject;
MPI_Type_create_struct(structlen,blocklengths,displacements,types,&newstructuretype);
MPI_Type_commit(&newstructuretype);
MPI_Aint typesize,typelb;
MPI_Type_get_extent(newstructuretype,&typelb,&typesize);
assert( typesize==size_of_struct );
if (procno==sender) {
printf("Type extent: %ld bytes; displacements: %ld %ld %ld\n",
typesize,displacements[0],displacements[1],displacements[2]);
}
if (procno==sender) {
MPI_Send(&myobject,1,newstructuretype,the_other,0,comm);
} else if (procno==receiver) {
MPI_Recv(&myobject,1,newstructuretype,the_other,0,comm,MPI_STATUS_IGNORE);
}
MPI_Type_free(&newstructuretype);
/* if (procno==sender) */
/* printf("char x=%ld, l=%ld; double x=%ld, l=%ld, int x=%ld, l=%ld\n", */
/* char_extent,char_lb,double_extent,double_lb,int_extent,int_lb); */
if (procno==receiver) {
printf("Char '%c' double0=%e double1=%e int=%d\n",
myobject.c,myobject.x[0],myobject.x[1],myobject.i);
ASSERT(myobject.x[1]==1.5);
ASSERT(myobject.i==37);
}
if (procno==0)
printf("Finished\n");
MPI_Finalize();
return 0;
}
#if 0
blocklengths[0] = 1; types[0] = MPI_CHAR;
displacements[0] = (size_t)&(myobject.c) - (size_t)&myobject;
blocklengths[1] = 2; types[1] = MPI_DOUBLE;
displacements[1] = (size_t)&(myobject.x[0]) - (size_t)&myobject;
blocklengths[2] = 1; types[2] = MPI_INT;
displacements[2] = (size_t)&(myobject.i) - (size_t)&myobject;
MPI_Aint char_extent,char_lb;
MPI_Type_get_extent(MPI_CHAR,&char_lb,&char_extent);
/* if (procno==0) */
/* printf("CHAR lb=%ld xt=%ld disp=%ld\n",char_lb,char_extent,current_displacement); */
MPI_Aint double_extent,double_lb;
MPI_Type_get_extent(MPI_DOUBLE,&double_lb,&double_extent);
/* if (procno==0) */
/* printf("DOUBLE lb=%ld xt=%ld disp=%ld\n",double_lb,double_extent,current_displacement); */
MPI_Aint int_extent,int_lb;
MPI_Type_get_extent(MPI_INT,&int_lb,&int_extent);
/* if (procno==0) */
/* printf("INT lb=%ld xt=%ld disp=%ld\n",int_lb,int_extent,current_displacement); */
#endif
#include <mpl/mpl.hpp>
if (nprocs<2) {
printf("This program needs at least two processes\n");
return -1;
}
int sender = 0, receiver = 1, the_other = 1-procno;
char c; vector<double> x(2); int i;
if (procno==sender) {
c = 'x'; x[0] = 2.7; x[1] = 1.5; i = 37; }
mpl::heterogeneous_layout object
( c,
mpl::make_absolute(x.data(),mpl::vector_layout<double>(2)),
i );
if (procno==sender) {
comm_world.send( mpl::absolute,object,receiver );
} else if (procno==receiver) {
comm_world.recv( mpl::absolute,object,sender );
}
if (procno==receiver) {
printf("Char '%c' double0=%e double1=%e int=%d\n",
c,x[0],x[1],i);
assert(x[1]==1.5);
assert(i==37);
}
if (procno==0)
printf("Finished\n");
return 0;
}
#include "globalinit.c"
if (nprocs<2) {
printf("This program needs at least two processes\n");
return -1;
}
int sender = 0, receiver = 1, the_other = 1-procno,
count = 5,stride=2;
float *source=NULL,*target=NULL;
int mediumsize = 1<<30;
int nblocks = 8;
size_t datasize = (size_t)mediumsize * nblocks * sizeof(float);
if (procno==sender)
printf("datasize = %lld bytes =%7.3f giga-bytes = %7.3f gfloats\n",
datasize,datasize*1.e-9,datasize*1.e-9/sizeof(float));
if (procno==sender) {
source = (float*) malloc(datasize);
if (source) {
printf("Source allocated\n");
} else {
printf("Could not allocate source data\n"); MPI_Abort(comm,1); }
long int idx = 0;
for (int iblock=0; iblock<nblocks; iblock++) {
for (int element=0; element<mediumsize; element++) {
source[idx] = idx+.5; idx++;
if (procno==receiver) {
target = (float*) malloc(datasize);
if (target) {
printf("Target allocated\n");
} else {
printf("Could not allocate target data\n"); MPI_Abort(comm,1); }
}
MPI_Datatype blocktype;
MPI_Type_contiguous(mediumsize,MPI_FLOAT,&blocktype);
MPI_Type_commit(&blocktype);
if (procno==sender) {
MPI_Send(source,nblocks,blocktype,receiver,0,comm);
} else if (procno==receiver) {
MPI_Status recv_status;
MPI_Recv(target,nblocks,blocktype,sender,0,comm,
&recv_status);
MPI_Count recv_count;
MPI_Get_elements_x(&recv_status,MPI_FLOAT,&recv_count);
printf("Received %7.3f medium size elements\n",recv_count * 1e-9);
}
MPI_Type_free(&blocktype);
if (0 && procno==receiver) {
for (int i=0; i<count; i++)
if (target[i]!=source[stride*i])
printf("location %d %e s/b %e\n",i,target[i],source[stride*i]);
}
if (procno==0)
printf("Finished\n");
if (procno==sender)
free(source);
if (procno==receiver)
free(target);
MPI_Finalize();
return 0;
}
#include "globalinit.c"
if (procno==0) {
printf("size of size_t = %d\n",sizeof(size_t));
MPI_Finalize();
return 0;
}
if (nprocs<2) {
printf("Needs at least 2 processes\n");
MPI_Abort(comm,0);
}
int sender=0, receiver=1;
/*
* Datatype for strided destinations
*/
MPI_Datatype stridetype;
int count = 3, stride = 2, blocklength = 1;
int ntypes = 2, max_elements = ntypes*stride*count;
if (procno==sender) {
int *sendbuffer = (int*)malloc( max_elements*sizeof(int) );
for (int i=0; i<max_elements; i++) sendbuffer[i] = i;
MPI_Type_vector(count,blocklength,stride,MPI_INT,&stridetype);
MPI_Type_commit(&stridetype);
MPI_Send( sendbuffer,ntypes,stridetype, receiver,0, comm );
free(sendbuffer);
} else if (procno==receiver) {
int *recvbuffer = (int*)malloc( max_elements*sizeof(int) );
MPI_Status status;
MPI_Recv( recvbuffer,max_elements,MPI_INT, sender,0, comm,&status );
int count; MPI_Get_count(&status,MPI_INT,&count);
printf("Receive %d elements:",count);
for (int i=0; i<count; i++) printf(" %d",recvbuffer[i]);
printf("\n");
free(recvbuffer);
}
// ntypes*stride
MPI_Datatype paddedtype;
if (procno==sender) {
MPI_Aint l,e;
int *sendbuffer = (int*)malloc( max_elements*sizeof(int) );
for (int i=0; i<max_elements; i++) sendbuffer[i] = i;
MPI_Type_get_extent(stridetype,&l,&e);
printf("Stride type l=%ld e=%ld\n",l,e);
e += ( stride-blocklength) * sizeof(int);
MPI_Type_create_resized(stridetype,l,e,&paddedtype);
MPI_Type_get_extent(paddedtype,&l,&e);
printf("Padded type l=%ld e=%ld\n",l,e);
MPI_Type_commit(&paddedtype);
MPI_Send( sendbuffer,ntypes,paddedtype, receiver,0, comm );
free(sendbuffer);
} else if (procno==receiver) {
int *recvbuffer = (int*)malloc( max_elements*sizeof(int) );
MPI_Status status;
MPI_Recv( recvbuffer,max_elements,MPI_INT, sender,0, comm,&status );
int count; MPI_Get_count(&status,MPI_INT,&count);
printf("Receive %d elements:",count);
for (int i=0; i<count; i++) printf(" %d",recvbuffer[i]);
printf("\n");
free(recvbuffer);
}
if (procno==sender) {
MPI_Type_free(&paddedtype);
MPI_Type_free(&stridetype);
}
if (procno==0)
printf("Finished\n");
MPI_Finalize();
return 0;
}
#include "globalinit.c"
if (nprocs<2) {
printf("This program needs at least two processes\n");
return -1;
}
int sender = 0, receiver = 1, the_other = 1-procno,
count = 5,stride=2;
double *source,*target;
source = (double*) malloc(stride*count*sizeof(double));
target = (double*) malloc(stride*count*sizeof(double));
if (procno==sender) {
MPI_Datatype oneblock;
MPI_Type_vector(1,1,stride,MPI_DOUBLE,&oneblock);
MPI_Type_commit(&oneblock);
MPI_Aint block_lb,block_x;
MPI_Type_get_extent(oneblock,&block_lb,&block_x);
printf("One block has extent: %ld\n",block_x);
MPI_Datatype paddedblock;
MPI_Type_create_resized(oneblock,0,stride*sizeof(double),&paddedblock);
MPI_Type_commit(&paddedblock);
MPI_Type_get_extent(paddedblock,&block_lb,&block_x);
printf("Padded block has extent: %ld\n",block_x);
MPI_Type_commit(&paddedblock);
MPI_Status recv_status;
MPI_Recv(target,count,paddedblock,the_other,0,comm,&recv_status);
/* MPI_Recv(target,count,MPI_DOUBLE,the_other,0,comm, */
/* &recv_status); */
int recv_count;
MPI_Get_count(&recv_status,MPI_DOUBLE,&recv_count);
ASSERT(recv_count==count);
}
if (procno==receiver) {
for (int i=0; i<count; i++)
if (target[i*stride]!=source[i*stride])
printf("location %d %e s/b %e\n",i,target[i*stride],source[stride*i]);
}
if (procno==0)
printf("Finished\n");
MPI_Finalize();
return 0;
}
A communicator is an object describing a group of processes. In many applications all processes work
together closely coupled, and the only communicator you need is MPI_COMM_WORLD, the group describing all
processes that your job starts with.
In this chapter you will see ways to make new groups of MPI processes: subgroups of the original world
communicator. Chapter 8 discusses dynamic process management, which, while not extending MPI_COMM_WORLD
does extend the set of available processes. That chapter also discusses the ‘sessions model’, which is an-
other way to constructing communicators.
Examples:
// C:
#include <mpi.h>
MPI_Comm comm = MPI_COMM_WORLD;
264
7.2. Duplicating communicators
MPL note 53: predefined communicators. The environment namespace has the equivalents of MPI_COMM_WORLD
and MPI_COMM_SELF:
const communicator& mpl::environment::comm_world();
const communicator& mpl::environment::comm_self();
You can name your communicators with MPI_Comm_set_name, which could improve the quality of error
messages when they arise.
newcomm = comm.Dup()
MPL note 55: communicator duplication. Communicators can be duplicated but only during initialization.
Copy assignment has been deleted. Thus:
// LEGAL:
mpl::communicator init = comm;
// WRONG:
mpl::communicator init;
init = comm;
You may wonder what ‘an exact copy’ means precisely. For this, think of a communicator as a context
label that you can attach to, among others, operations such as sends and receives. And it’s that label that
counts, not what processes are in the communicator. A send and a receive ‘belong together’ if they have
the same communicator context. Conversely, a send in one communicator can not be matched to a receive
in a duplicate communicator, made by MPI_Comm_dup.
Testing whether two communicators are really the same is then more than testing if they comprise the
same processes. The call MPI_Comm_compare returns MPI_IDENT if two communicator values are the same,
and not if one is derived from the other by duplication:
Code: Output:
// commcompare.c assign: comm==copy: 1
int result; congruent: 0
MPI_Comm copy = comm; not equal: 0
MPI_Comm_compare(comm,copy,&result); duplicate: comm==copy: 0
printf("assign: comm==copy: %d \n", congruent: 1
result==MPI_IDENT); not equal: 0
printf(" congruent: %d \n",
result==MPI_CONGRUENT);
printf(" not equal: %d \n",
result==MPI_UNEQUAL);
MPI_Comm_dup(comm,©);
MPI_Comm_compare(comm,copy,&result);
printf("duplicate: comm==copy: %d \n",
result==MPI_IDENT);
printf(" congruent: %d \n",
result==MPI_CONGRUENT);
printf(" not equal: %d \n",
result==MPI_UNEQUAL);
and suppose that the library has receive calls. Now it is possible that the receive in the library inadvertently
catches the message that was sent in the outer environment.
Let us consider an example. First of all, here is code where the library stores the communicator of the
calling program:
// commdupwrong.cxx
class library {
private:
MPI_Comm comm;
int procno,nprocs,other;
MPI_Request request[2];
public:
library(MPI_Comm incomm) {
comm = incomm;
MPI_Comm_rank(comm,&procno);
other = 1-procno;
};
int communication_start();
int communication_end();
};
self.comm = comm.Dup()
self.other = self.comm.Get_size()-self.comm.Get_rank()-1
self.requests = [ None ] * 2
def __del__(self):
if self.comm.Get_rank()==0: print(".. freeing communicator")
self.comm.Free()
def communication_start(self):
sendbuf = np.empty(1,dtype=int); sendbuf[0] = 37
recvbuf = np.empty(1,dtype=int)
self.requests[0] = self.comm.Isend( sendbuf, dest=other,tag=2 )
self.requests[1] = self.comm.Irecv( recvbuf, source=other )
def communication_end(self):
MPI.Request.Waitall(self.requests)
mylibrary = Library(comm)
my_requests[0] = comm.Isend( sendbuffer,dest=other,tag=1 )
mylibrary.communication_start()
my_requests[1] = comm.Irecv( recvbuffer,source=other )
MPI.Request.Waitall(my_requests,my_status)
mylibrary.communication_end()
7.3 Sub-communicators
In many scenarios you divide a large job over all the available processors. However, your job may have
two or more parts that can be considered as jobs by themselves. In that case it makes sense to divide your
processors into subgroups accordingly.
Suppose for instance that you are running a simulation where inputs are generated, a computation is per-
formed on them, and the results of this computation are analyzed or rendered graphically. You could then
consider dividing your processors in three groups corresponding to generation, computation, rendering.
As long as you only do sends and receives, this division works fine. However, if one group of processes
needs to perform a collective operation, you don’t want the other groups involved in this. Thus, you really
want the three groups to be distinct from each other.
In order to make such subsets of processes, MPI has the mechanism of taking a subset of MPI_COMM_WORLD
(or other communicator) and turning that subset into a new communicator.
Now you understand why the MPI collective calls had an argument for the communicator: a collective
involves all processes of that communicator. By making a communicator that contains a subset of all
available processes, you can do a collective on that subset.
The usage is as follows:
• You create a new communicator with routines such as MPI_Comm_dup (section 7.2), MPI_Comm_split
(section 7.4), MPI_Comm_create (section 7.5), MPI_Intercomm_create (section 7.6), MPI_Comm_spawn
(section 8.1);
• you use that communiator for a while;
• and you call MPI_Comm_free when you are done with it; this also sets the communicator variable
to MPI_COMM_NULL. A similar routine, MPI_Comm_disconnect waits for all pending communication
to finish. Both are collective.
The ranking of processes in the new communicator is determined by a ‘key’ value: in a subcommunicator
the process with lowest key is given the lowest rank, et cetera. Most of the time, there is no reason to
use a relative ranking that is different from the global ranking, so the MPI_Comm_rank value of the global
communicator is a good choice. Any ties between identical key values are broken by using the rank from
the original communicator. Thus, specifying zero are the key will also retain the original process ordering.
Here is one example of communicator splitting. Suppose your processors are in a two-dimensional grid:
MPI_Comm_rank( MPI_COMM_WORLD, &mytid );
proc_i = mytid % proc_column_length;
proc_j = mytid / proc_column_length;
Because of the SPMD nature of the program, you are now doing in parallel a broadcast in every processor
column. Such operations often appear in dense linear algebra.
Exercise 7.1. Organize your processes in a grid, and make subcommunicators for the rows
and columns. For this compute the row and column number of each process.
In the row and column communicator, compute the rank. For instance, on a 2 × 3
processor grid you should find:
Global ranks: Ranks in row: Ranks in colum:
0 1 2 0 1 2 0 0 0
3 4 5 0 1 2 1 1 1
Check that the rank in the row communicator is the column number, and the other
way around.
Run your code on different number of processes, for instance a number of rows and
columns that is a power of 2, or that is a prime number.
(There is a skeleton for this exercise under the name procgrid.)
Python note 23: comm split key is optional. In Python, the ‘key’ argument is optional:
Code: Output:
## commsplit.py Proc 0 -> 0 -> 0
mydata = procid Proc 2 -> 1 -> 0
Proc 6 -> 3 -> 1
# communicator modulo 2 Proc 4 -> 2 -> 1
color = procid%2 Proc 3 -> 1 -> 0
mod2comm = comm.Split(color) Proc 7 -> 3 -> 1
procid2 = mod2comm.Get_rank() Proc 1 -> 0 -> 0
Proc 5 -> 2 -> 1
# communicator modulo 4 recursively
color = procid2 % 2
mod4comm = mod2comm.Split(color)
procid4 = mod4comm.Get_rank()
MPL note 56: communicator splitting. In MPL, splitting a communicator is done as one of the overloads of
the communicator constructor;
// commsplit.cxx
// create sub communicator modulo 2
int color2 = procno % 2;
mpl::communicator comm2( mpl::communicator::split, comm_world, color2 );
auto procno2 = comm2.rank();
There is also a routine MPI_Comm_split_type which uses a type rather than a key to split the communicator.
We will see this in action in section 12.1.
As another example of communicator splitting, consider the recursive algorithm for matrix transposition.
Processors are organized in a square grid. The matrix is divided on 2 × 2 block form.
Creating a new communicator from a group is collective on the old communicator. There is also a routine
MPI_Comm_create_group that only needs to be called on the group that constitutes the new communicator.
Certain MPI types, MPI_Win and MPI_File, are created on a communicator. While you can not directly ex-
tract that communicator from the object, you can get the group with MPI_Win_get_group and MPI_File_get_group.
7.5.2 Example
Suppose you want to split the world communicator into one manager process, with the remaining pro-
cesses workers.
// portapp.c
MPI_Comm comm_work;
{
MPI_Group group_world,group_work;
MPI_Comm_group( comm_world,&group_world );
int manager[] = {0};
MPI_Group_excl( group_world,1,manager,&group_work );
MPI_Comm_create( comm_world,group_work,&comm_work );
MPI_Group_free( &group_world ); MPI_Group_free( &group_work );
}
7.6 Intercommunicators
In several scenarios it may be desirable to have a way to communicate between communicators. For
instance, an application can have clearly functionally separated modules (preprocessor, simulation, post-
processor) that need to stream data pairwise. In another example, dynamically spawned processes (sec-
tion 8.1) get their own value of MPI_COMM_WORLD, but still need to communicate with the process(es) that
spawned them. In this section we will discuss the inter-communicator mechanism that serves such use
cases.
Communicating between disjoint communicators can of course be done by having a communicator that
overlaps them, but this would be complicated: since the ‘inter’ communication happens in the overlap
communicator, you have to translate its ordering into those of the two worker communicators. It would
be easier to express messages directly in terms of those communicators, and this is what happens in an
inter-communicator.
• Two local communicators, which in this context are known as intra-communicators: one process
in each will act as the local leader, connected to the remote leader;
• The peer communicator, often MPI_COMM_WORLD, that contains the local communicators;
• An inter-communicator that allows the leaders of the subcommunicators to communicate with
the other subcommunicator.
Even though the intercommunicator connects only two proceses, it is collective on the peer communicator.
Intercomm.Get_remote_size(self)
Spawned processes can find their parent communicator with MPI_Comm_get_parent (figure 7.9) (see exam-
ples in section 8.1). On other processes this returns MPI_COMM_NULL.
Test whether a communicator is intra or inter: MPI_Comm_test_inter (figure 7.10).
MPI_Comm_compare works for intercommunicators.
Processes connected through an intercommunicator can query the size of the ‘other’ communicator with
MPI_Comm_remote_size (figure 7.11). The actual group can be obtained with MPI_Comm_remote_group (fig-
ure 7.12).
Virtual topologies (chapter 11) cannot be created with an intercommunicator. To set up virtual topologies,
first transform the intercommunicator to an intracommunicator with the function MPI_Intercomm_merge
(figure 7.13).
Intercomm.Get_remote_group(self)
class library {
private:
MPI_Comm comm;
int procno,nprocs,other;
MPI_Request request[2];
public:
library(MPI_Comm incomm) {
comm = incomm;
MPI_Comm_rank(comm,&procno);
other = 1-procno;
};
int communication_start();
int communication_end();
};
#include "globalinit.c"
library my_library(comm);
MPI_Isend(&sdata,1,MPI_INT,other,1,comm,&(request[0]));
my_library.communication_start();
MPI_Irecv(&rdata,1,MPI_INT,other,MPI_ANY_TAG,comm,&(request[1]));
MPI_Waitall(2,request,status);
my_library.communication_end();
if (status[1].MPI_TAG==2)
printf("wrong!\n");
MPI_Finalize();
return 0;
}
int library::communication_start() {
int sdata=6,rdata;
MPI_Isend(&sdata,1,MPI_INT,other,2,comm,&(request[0]));
MPI_Irecv(&rdata,1,MPI_INT,other,MPI_ANY_TAG,
comm,&(request[1]));
return 0;
}
int library::communication_end() {
MPI_Status status[2];
MPI_Waitall(2,request,status);
return 0;
}
class library {
private:
MPI_Comm comm;
int procno,nprocs,other;
MPI_Request request[2];
public:
library(MPI_Comm incomm) {
MPI_Comm_dup(incomm,&comm);
MPI_Comm_rank(comm,&procno);
other = 1-procno;
};
~library() {
MPI_Comm_free(&comm);
}
int communication_start();
int communication_end();
};
#include "globalinit.c"
library my_library(comm);
MPI_Isend(&sdata,1,MPI_INT,other,1,comm,&(request[0]));
my_library.communication_start();
MPI_Irecv(&rdata,1,MPI_INT,other,MPI_ANY_TAG,
comm,&(request[1]));
MPI_Waitall(2,request,status);
my_library.communication_end();
if (status[1].MPI_TAG==2)
printf("wrong!\n");
MPI_Finalize();
return 0;
}
int library::communication_start() {
int sdata=6,rdata, ierr;
MPI_Isend(&sdata,1,MPI_INT,other,2,comm,&(request[0]));
MPI_Irecv(&rdata,1,MPI_INT,other,MPI_ANY_TAG,comm,&(request[1]));
return 0;
}
int library::communication_end() {
MPI_Status status[2];
int ierr;
ierr = MPI_Waitall(2,request,status); CHK(ierr);
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
other = nprocs-procid-1
my_requests = [ None ] * 2
my_status = [ MPI.Status() ] * 2
sendbuffer = np.empty(1,dtype=int)
recvbuffer = np.empty(1,dtype=int)
class Library():
def __init__(self,comm):
# wrong: self.comm = comm
self.comm = comm.Dup()
self.other = self.comm.Get_size()-self.comm.Get_rank()-1
self.requests = [ None ] * 2
def __del__(self):
if self.comm.Get_rank()==0: print(".. freeing communicator")
self.comm.Free()
def communication_start(self):
sendbuf = np.empty(1,dtype=int); sendbuf[0] = 37
recvbuf = np.empty(1,dtype=int)
self.requests[0] = self.comm.Isend( sendbuf, dest=other,tag=2 )
self.requests[1] = self.comm.Irecv( recvbuf, source=other )
def communication_end(self):
MPI.Request.Waitall(self.requests)
mylibrary = Library(comm)
my_requests[0] = comm.Isend( sendbuffer,dest=other,tag=1 )
mylibrary.communication_start()
my_requests[1] = comm.Irecv( recvbuffer,source=other )
MPI.Request.Waitall(my_requests,my_status)
mylibrary.communication_end()
if my_status[1].Get_tag()==2:
print("Caught wrong message!")
int mod4ranks[nprocs];
comm_world.gather( 0, procno4,mod4ranks );
if (procno==0) {
cout << "Ranks mod 4:";
for (int ip=0; ip<nprocs; ip++)
cout << " " << mod4ranks[ip];
cout << endl;
}
if (procno/4!=procno4)
printf("Error %d %d\n",procno,procno4);
if (procno==0)
printf("Finished\n");
return 0;
}
#ifndef DEBUG
#define DEBUG 0
#endif
#include "globalinit.c"
if (nprocs<4) {
fprintf(stderr,"This program needs at least four processes\n");
return -1;
}
if (nprocs%2>0) {
fprintf(stderr,"This program needs an even number of processes\n");
return -1;
}
int color,colors=2;
MPI_Comm split_half_comm;
int
local_leader_in_inter_comm
= color==0 ? 2 : (sub_nprocs-2)
,
local_number_of_other_leader
= color==1 ? 2 : (sub_nprocs-2)
;
if (local_leader_in_inter_comm<0 || local_leader_in_inter_comm>=sub_nprocs) {
fprintf(stderr,
"[%d] invalid local member: %d\n",
procno,local_leader_in_inter_comm);
MPI_Abort(2,comm);
}
int
global_rank_of_other_leader =
1 + ( procno<nprocs/2 ? nprocs/2 : 0 )
;
int
i_am_local_leader = sub_procno==local_leader_in_inter_comm,
inter_tag = 314;
if (i_am_local_leader)
fprintf(stderr,"[%d] creating intercomm with %d\n",
procno,global_rank_of_other_leader);
MPI_Comm intercomm;
MPI_Intercomm_create
(/* local_comm: */ split_half_comm,
/* local_leader: */ local_leader_in_inter_comm,
/* peer_comm: */ MPI_COMM_WORLD,
/* remote_peer_rank: */ global_rank_of_other_leader,
/* tag: */ inter_tag,
/* newintercomm: */ &intercomm );
if (DEBUG) fprintf(stderr,"[%d] intercomm created.\n",procno);
if (i_am_local_leader) {
int inter_rank,inter_size;
MPI_Comm_size(intercomm,&inter_size);
MPI_Comm_rank(intercomm,&inter_rank);
if (DEBUG) fprintf(stderr,"[%d] inter rank/size: %d/%d\n",procno,inter_rank,inter_size);
}
double interdata=0.;
if (i_am_local_leader) {
if (color==0) {
interdata = 1.2;
int inter_target = local_number_of_other_leader;
printf("[%d] sending interdata %e to %d\n",
procno,interdata,inter_target);
MPI_Send(&interdata,1,MPI_DOUBLE,inter_target,0,intercomm);
} else {
MPI_Status status;
MPI_Recv(&interdata,1,MPI_DOUBLE,MPI_ANY_SOURCE,MPI_ANY_TAG,intercomm,&status);
if (procno==0)
fprintf(stderr,"Finished\n");
MPI_Finalize();
return 0;
}
In this course we have up to now only considered the SPMD model of running MPI programs. In some
rare cases you may want to run in an MPMD mode, rather than SPMD. This can be achieved either on the
OS level, using options of the mpiexec mechanism, or you can use MPI’s built-in process management.
Read on if you’re interested in the latter.
If this option is not supported, you can determine yourself how many processes you want to spawn.
If you exceed the hardware resources, your multi-tasking operating system (which is some variant of
Unix for almost everyone) will use time-slicing to start the spawned processes, but you will not gain any
performance.
Here is an example of a work manager. First we query how much space we have for new processes:
int universe_size, *universe_size_attr,uflag;
MPI_Comm_get_attr
(comm_world,MPI_UNIVERSE_SIZE,
&universe_size_attr,&uflag);
universe_size = *universe_size_attr;
287
8. MPI topic: Process management
MPI.Intracomm.Spawn(self,
command, args=None, int maxprocs=1, Info info=INFO_NULL,
int root=0, errcodes=None)
returns an intracommunicator
8.1.2 MPMD
Instead of spawning a single executable, you can spawn multiple with MPI_Comm_spawn_multiple. In that
case a process can retrieve with the attribute MPI_APPNUM which of the executables it is; section 15.1.2.
/*
* The workers collective connect over the inter communicator
*/
MPI_Comm intercomm;
MPI_Comm_connect( myport,MPI_INFO_NULL,0,comm_work,&intercomm );
if (work_p==0) {
int manage_n;
MPI_Comm_remote_size(intercomm,&manage_n);
printf("%d workers connected to %d managers\n",work_n,manage_n);
}
# Comm accept/connect
host accepted connection
4 workers connected to 1 managers
export I_MPI_HYDRA_NAMESERVER=`hostname`:8008
It is also possible to specify the name server as an argument to the job starter.
At the end of a run, the service should be unpublished with MPI_Unpublish_name (figure 8.6). Unpublishing
a nonexisting or already unpublished service gives an error code of MPI_ERR_SERVICE.
MPI provides no guarantee of fairness in servicing connection attempts. That is, connection attempts are
not necessarily satisfied in the order in which they were initiated, and competition from other connection
attempts may prevent a particular connection attempt from being satisfied.
8.3 Sessions
The most common way of initializing MPI, with MPI_Init (or MPI_Init_thread) and MPI_Finalize, is known
as the world model. This model suffers from some disadvantages:
1. There is no error handling during MPI_Init.
2. If multiple libraries are active, they can not initialize or finalize MPI, but have to base themselves
on subcommunicators; section 7.2.2.
3. A library can’t even
MPI_Initialized(&flag);
if (!flag) MPI_Init(0,0);
The following material is for the recently released MPI-4 standard and may not be supported yet.
In addition to the world, where all MPI is bracketed by MPI_Init (or MPI_Init_thread) and MPI_Finalize,
there is the session model, where entities such as libraries can start/end their MPI session independently.
The two models can be used in the same program, but there are limitations on how they can mix.
Other info keys can be implementation-dependent, but the key thread_support is pre-defined.
Info keys can be retrieved again with MPI_Session_get_info:
MPI_Info session_actual_info;
MPI_Session_get_info( the_session,&session_actual_info );
char thread_level[100]; int info_len = 100, flag;
MPI_Info_get_string( session_actual_info,
thread_key,&info_len,thread_level,&flag );
The following partial code creates a communicator equivalent to MPI_COMM_WORLD in the session model:
MPI_Group world_group = MPI_GROUP_NULL;
MPI_Comm world_comm = MPI_COMM_NULL;
MPI_Group_from_session_pset
( the_session,world_name,&world_group );
MPI_Comm_create_from_group
( world_group,"victor-code-session.c",
MPI_INFO_NULL,MPI_ERRORS_ARE_FATAL,
&world_comm );
MPI_Group_free( &world_group );
int procid = -1, nprocs = 0;
MPI_Comm_size(world_comm,&nprocs);
MPI_Comm_rank(world_comm,&procid);
However, comparing communicators (with MPI_Comm_compare) from the session and world model, or from
different sessions, is undefined behavior.
Get the info object (section 15.1.1) from a process set: MPI_Session_get_pset_info. This info object always
has the key mpi_size.
8.3.4 Example
As an example of the use of sessions, we declare a library class, where each library object starts and ends
its own session:
// sessionlib.cxx
class Library {
private:
MPI_Comm world_comm; MPI_Session session;
public:
Library() {
MPI_Info info = MPI_INFO_NULL;
MPI_Session_init
( MPI_INFO_NULL,MPI_ERRORS_ARE_FATAL,&session );
char world_name[] = "mpi://WORLD";
MPI_Group world_group;
MPI_Group_from_session_pset
( session,world_name,&world_group );
MPI_Comm_create_from_group
( world_group,"world-session",
MPI_INFO_NULL,MPI_ERRORS_ARE_FATAL,
&world_comm );
MPI_Group_free( &world_group );
};
~Library() { MPI_Session_finalize(&session); };
Now we create a main program, using the world model, which activates two libraries, passing data to
them by parameter:
int main(int argc,char **argv) {
Library lib1,lib2;
MPI_Init(0,0);
MPI_Comm world = MPI_COMM_WORLD;
int procno,nprocs;
MPI_Comm_rank(world,&procno);
MPI_Comm_size(world,&nprocs);
auto sum1 = lib1.compute(procno);
auto sum2 = lib2.compute(procno+1);
Note that no mpi calls will go between main program and either of the libraries, or between the two
libraries, but this seems to make sense in this scenario.
End of MPI-4 material
#define ASSERT(p) if (!(p)) {printf("Assertion failed for proc %d at line %d\n",procno,__LINE__); return -1;}
#define ASSERTm(p,m) if (!(p)) {printf("Message<<%s>> for proc %d at line %d\n",m,procno,__LINE__); return -1;}
MPI_Comm comm;
int procno=-1,nprocs,err;
MPI_Init(&argc,&argv);
comm = MPI_COMM_WORLD;
MPI_Comm_rank(comm,&procno);
MPI_Comm_size(comm,&nprocs);
// MPI_Comm_set_errhandler(comm,MPI_ERRORS_RETURN);
/*
* To investigate process placement, get host name
*/
{
int namelen = MPI_MAX_PROCESSOR_NAME;
char procname[namelen];
MPI_Get_processor_name(procname,&namelen);
printf("[%d] manager process runs on <<%s>>\n",procno,procname);
}
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &manager_rank);
MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE,
(void*)&universe_sizep, &flag);
if (!flag) {
if (manager_rank==0) {
printf("This MPI does not support UNIVERSE_SIZE.\nHow many processes total?");
scanf("%d", &universe_size);
}
MPI_Bcast(&universe_size,1,MPI_INTEGER,0,MPI_COMM_WORLD);
} else {
universe_size = *universe_sizep;
if (manager_rank==0)
printf("Universe size deduced as %d\n",universe_size);
}
ASSERTm(universe_size>world_size,"No room to start workers");
/*
* Now spawn the workers. Note that there is a run-time determination
* of what type of worker to spawn, and presumably this calculation must
* be done at run time and cannot be calculated before starting
* the program. If everything is known when the application is
* first started, it is generally better to start them all at once
* in a single MPI_COMM_WORLD.
*/
if (manager_rank==0)
printf("Now spawning %d workers\n",nworkers);
const char *worker_program = "spawnworker";
int errorcodes[nworkers];
MPI_Comm inter_to_workers; /* intercommunicator */
MPI_Comm_spawn(worker_program, MPI_ARGV_NULL, nworkers,
MPI_INFO_NULL, 0, MPI_COMM_WORLD, &inter_to_workers,
errorcodes);
for (int ie=0; ie<nworkers; ie++)
if (errorcodes[ie]!=0)
printf("Error %d in spawning worker %d\n",errorcodes[ie],ie);
/*
* Parallel code here. The communicator "inter_to_workers" can be used
* to communicate with the spawned processes, which have ranks 0,..
* MPI_UNIVERSE_SIZE-1 in the remote group of the intercommunicator
* "inter_to_workers".
*/
MPI_Finalize();
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
try :
universe_size = comm.Get_attr(MPI.UNIVERSE_SIZE)
if universe_size is None:
print("Universe query returned None")
universe_size = nprocs + 4
else:
MPI_Init(&argc,&argv);
MPI_Comm
comm_world = MPI_COMM_WORLD,
comm_self = MPI_COMM_SELF,
comm_inter;
int world_p,world_n;
MPI_Comm_size(comm_world,&world_n);
MPI_Comm_rank(comm_world,&world_p);
MPI_Comm comm_parent;
MPI_Comm_get_parent(&comm_parent);
int is_child = (comm_parent!=MPI_COMM_NULL);
if (is_child) {
int nworkers,workerno;
MPI_Comm_size(MPI_COMM_WORLD,&nworkers);
MPI_Comm_rank(MPI_COMM_WORLD,&workerno);
printf("I detect I am worker %d/%d running on %s\n",
workerno,nworkers,procname);
int remotesize;
MPI_Comm_remote_size(comm_parent, &remotesize);
if (workerno==0) {
printf("Worker deduces %d workers and %d parents\n",nworkers,remotesize);
}
} else {
/*
* Detect how many workers we can spawn
*/
int universe_size, *universe_size_attr,uflag;
MPI_Comm_get_attr
(comm_world,MPI_UNIVERSE_SIZE,
&universe_size_attr,&uflag);
universe_size = *universe_size_attr;
if (!uflag) universe_size = world_n;
int work_n = universe_size - world_n;
if (world_p==0) {
printf("A universe of size %d leaves room for %d workers\n",
universe_size,work_n);
printf(".. spawning from %s\n",procname);
}
if (work_n<=0)
MPI_Abort(comm_world,1);
const char *workerprogram = "./spawnapp";
MPI_Comm_spawn(workerprogram,MPI_ARGV_NULL,
work_n,MPI_INFO_NULL,
0,comm_world,&comm_inter,NULL);
}
MPI_Finalize();
return 0;
}
#define ASSERT(p) if (!(p)) {printf("Assertion failed for proc %d at line %d\n",procno,__LINE__); return -1;}
#define ASSERTm(p,m) if (!(p)) {printf("Message<<%s>> for proc %d at line %d\n",m,procno,__LINE__); return -1;}
MPI_Comm comm;
int procno=-1,nprocs,err;
MPI_Init(&argc,&argv);
comm = MPI_COMM_WORLD;
MPI_Comm_rank(comm,&procno);
MPI_Comm_size(comm,&nprocs);
MPI_Comm_set_errhandler(comm,MPI_ERRORS_RETURN);
int remotesize,nworkers,workerno;
MPI_Comm parent;
MPI_Comm_size(MPI_COMM_WORLD,&nworkers);
MPI_Comm_rank(MPI_COMM_WORLD,&workerno);
MPI_Comm_get_parent(&parent);
ASSERTm(parent!=MPI_COMM_NULL,"No parent!");
/*
* To investigate process placement, get host name
*/
{
int namelen = MPI_MAX_PROCESSOR_NAME;
char procname[namelen];
MPI_Get_processor_name(procname,&namelen);
printf("[%d] worker process runs on <<%s>>\n",workerno,procname);
}
/*
* Parallel code here.
* The manager is represented as the process with rank 0 in (the remote
* group of) MPI_COMM_PARENT. If the workers need to communicate among
* themselves, they can use MPI_COMM_WORLD.
*/
MPI_Finalize();
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if procid==0:
print("#workers:",nprocs)
parentcomm = comm.Get_parent()
nparents = parentcomm.Get_remote_size()
print("#parents=",nparents)
MPI_Init(&argc,&argv);
MPI_Comm
comm_world = MPI_COMM_WORLD,
comm_self = MPI_COMM_SELF;
int world_p,world_n;
MPI_Comm_size(comm_world,&world_n);
MPI_Comm_rank(comm_world,&world_p);
/*
* Set up a communicator for all the worker ranks
*/
MPI_Comm comm_work;
{
MPI_Group group_world,group_work;
MPI_Comm_group( comm_world,&group_world );
int manager[] = {0};
MPI_Group_excl( group_world,1,manager,&group_work );
MPI_Comm_create( comm_world,group_work,&comm_work );
MPI_Group_free( &group_world ); MPI_Group_free( &group_work );
}
if (world_p==0) {
/*
* On world process zero open a port, and
* send its name to world process 1,
* which is zero in the worker comm.
*/
MPI_Comm intercomm;
char myport[MPI_MAX_PORT_NAME];
MPI_Open_port( MPI_INFO_NULL,myport );
int portlen = strlen(myport);
MPI_Send( myport,portlen+1,MPI_CHAR,1,0,comm_world );
printf("Host sent port <<%s>>\n",myport);
MPI_Comm_accept( myport,MPI_INFO_NULL,0,comm_self,&intercomm );
printf("host accepted connection\n");
/*
* After the workers have accept the connection,
* we can talk over the inter communicator
*/
int work_n;
MPI_Comm_remote_size(intercomm,&work_n);
double work_data[work_n];
MPI_Send( work_data,work_n,MPI_DOUBLE,
/* to rank zero of worker comm */ 0,0,intercomm );
printf("Manager sent %d items over intercomm\n",work_n);
/*
* After we're done, close the port
*/
MPI_Close_port(myport);
} else {
int work_p,work_n;
MPI_Comm_size( comm_work,&work_n );
MPI_Comm_rank( comm_work,&work_p );
/*
* In the workers communicator, rank 0
* (which is 1 in the global)
* receives the port name and passes it on.
*/
char myport[MPI_MAX_PORT_NAME];
if (work_p==0) {
MPI_Recv( myport,MPI_MAX_PORT_NAME,MPI_CHAR,
MPI_ANY_SOURCE,0, comm_world,MPI_STATUS_IGNORE );
printf("Worker received port <<%s>>\n",myport);
}
MPI_Bcast( myport,MPI_MAX_PORT_NAME,MPI_CHAR,0,comm_work );
/*
* The workers collective connect over the inter communicator
*/
MPI_Comm intercomm;
MPI_Comm_connect( myport,MPI_INFO_NULL,0,comm_work,&intercomm );
if (work_p==0) {
int manage_n;
MPI_Comm_remote_size(intercomm,&manage_n);
printf("%d workers connected to %d managers\n",work_n,manage_n);
}
/*
* The local leader receives work from the manager
*/
if (work_p==0) {
double work_data[work_n];
MPI_Status work_status;
MPI_Recv( work_data,work_n,MPI_DOUBLE,
/* from rank zero of manager comm */ 0,0,intercomm,&work_status );
int work_count;
MPI_Get_count(&work_status,MPI_DOUBLE,&work_count);
printf("Worker zero received %d data items from manager\n",work_count);
}
/*
* After we're done, close the connection
*/
MPI_Close_port(myport);
}
MPI_Finalize();
return 0;
}
MPI_Init(&argc,&argv);
MPI_Comm
comm_world = MPI_COMM_WORLD,
comm_self = MPI_COMM_SELF;
int world_p,world_n;
MPI_Comm_size(comm_world,&world_n);
MPI_Comm_rank(comm_world,&world_p);
/*
* Set up a communicator for all the worker ranks
*/
MPI_Comm comm_work;
{
MPI_Group group_world,group_work;
MPI_Comm_group( comm_world,&group_world );
int manager[] = {0};
MPI_Group_excl( group_world,1,manager,&group_work );
MPI_Comm_create( comm_world,group_work,&comm_work );
MPI_Group_free( &group_world ); MPI_Group_free( &group_work );
}
char
service_name[] = "exampleservice";
if (world_p==0) {
/*
* On world process zero open a port, and
* send its name to world process 1,
* which is zero in the worker comm.
*/
MPI_Comm intercomm;
char myport[MPI_MAX_PORT_NAME];
MPI_Open_port( MPI_INFO_NULL,myport );
MPI_Publish_name( service_name, MPI_INFO_NULL, myport );
MPI_Comm_accept( myport,MPI_INFO_NULL,0,comm_self,&intercomm );
printf("Manager accepted connection on port <<%s>>\n",myport);
/*
* After the workers have accept the connection,
* we can talk over the inter communicator
*/
int work_n;
MPI_Comm_remote_size(intercomm,&work_n);
double work_data[work_n];
MPI_Send( work_data,work_n,MPI_DOUBLE,
} else {
/*
* See if we can find the service
*/
char myport[MPI_MAX_PORT_NAME];
MPI_Lookup_name( service_name,MPI_INFO_NULL,myport );
/*
* The workers collective connect over the inter communicator
*/
MPI_Comm intercomm;
MPI_Comm_connect( myport,MPI_INFO_NULL,0,comm_work,&intercomm );
int work_p,work_n;
MPI_Comm_size( comm_work,&work_n );
MPI_Comm_rank( comm_work,&work_p );
if (work_p==0) {
int manage_n;
MPI_Comm_remote_size(intercomm,&manage_n);
printf("%d workers connected to %d managers\n",work_n,manage_n);
}
/*
* The local leader receives work from the manager
*/
if (work_p==0) {
double work_data[work_n];
MPI_Status work_status;
MPI_Recv( work_data,work_n,MPI_DOUBLE,
/* from rank zero of manager comm */ 0,0,intercomm,&work_status );
int work_count;
MPI_Get_count(&work_status,MPI_DOUBLE,&work_count);
printf("Worker zero received %d data items from manager\n",work_count);
}
/*
* After we're done, close the connection
*/
MPI_Close_port(myport);
}
MPI_Finalize();
return 0;
Above, you saw point-to-point operations of the two-sided type: they require the co-operation of a sender
and receiver. This co-operation could be loose: you can post a receive with MPI_ANY_SOURCE as sender, but
there had to be both a send and receive call. This two-sidedness can be limiting. Consider code where the
receiving process is a dynamic function of the data:
x = f();
p = hash(x);
MPI_Send( x, /* to: */ p );
The problem is now: how does p know to post a receive, and how does everyone else know not to?
In this section, you will see one-sided communication routines where a process can do a ‘put’ or ‘get’ op-
eration, writing data to or reading it from another processor, without that other processor’s involvement.
In one-sided MPI operations, known as Remote Memory Access (RMA) operations in the standard, or
as Remote Direct Memory Access (RDMA) in other literature, there are still two processes involved: the
origin, which is the process that originates the transfer, whether this is a ‘put’ or a ‘get’, and the target
whose memory is being accessed. Unlike with two-sided operations, the target does not perform an action
that is the counterpart of the action on the origin.
That does not mean that the origin can access arbitrary data on the target at arbitrary times. First of all,
one-sided communication in MPI is limited to accessing only a specifically declared memory area on the
target: the target declares an area of memory that is accessible to other processes. This is known as a
window. Windows limit how origin processes can access the target’s memory: you can only ‘get’ data
from a window or ‘put’ it into a window; all the other memory is not reachable from other processes. On
the origin there is no such limitation; any data can function as the source of a ‘put’ or the recipient of a
‘get operation.
The alternative to having windows is to use distributed shared memory or virtual shared memory: memory
is distributed but acts as if it shared. The so-called Partitioned Global Address Space (PGAS) languages
such as Unified Parallel C (UPC) use this model.
Within one-sided communication, MPI has two modes: active RMA and passive RMA. In active RMA, or
active target synchronization, the target sets boundaries on the time period (the ‘epoch’) during which its
window can be accessed. The main advantage of this mode is that the origin program can perform many
309
9. MPI topic: One-sided communication
small transfers, which are aggregated behind the scenes. This would be appropriate for applications that
are structured in a Bulk Synchronous Parallel (BSP) mode with supersteps. Active RMA acts much like
asynchronous transfer with a concluding MPI_Waitall.
In passive RMA, or passive target synchronization, the target process puts no limitation on when its window
can be accessed. (PGAS languages such as UPC are based on this model: data is simply read or written at
will.) While intuitively it is attractive to be able to write to and read from a target at arbitrary time, there
are problems. For instance, it requires a remote agent on the target, which may interfere with execution
of the main thread, or conversely it may not be activated at the optimal time. Passive RMA is also very
hard to debug and can lead to race conditions.
9.1 Windows
In one-sided communication, each processor can make an area of memory, called a window, available to
one-sided transfers. This is stored in a variable of type MPI_Win. A process can put an arbitrary item from
its own memory (not limited to any window) to the window of another process, or get something from
the other process’ window in its own memory.
A window can be characteristized as follows:
• The window is defined on a communicator, so the create call is collective; see figure 9.1.
• The window size can be set individually on each process. A zero size is allowed, but since win-
dow creation is collective, it is not possible to skip the create call.
• You can set a ‘displacement unit’ for the window: this is a number of bytes that will be used as
the indexing unit. For example if you use sizeof(double) as the displacement unit, an MPI_Put
to location 8 will go to the 8th double. That’s easier than having to specify the 64th byte.
• The window is the target of data in a put operation, or the source of data in a get operation; see
figure 9.2.
• There can be memory associated with a window, so it needs to be freed explicitly with MPI_Win_free.
The typical calls involved are:
MPI_Info info;
MPI_Win window;
MPI.Win.Create
(memory, int disp_unit=1,
Info info=INFO_NULL, Intracomm comm=COMM_SELF)
Figure 9.2: Put and get between process memory and windows
• Use MPI_Win_allocate (figure 9.2) to create the data and the window in one call.
• If a communicator is on a shared memory (see section 12.1) you can create a window
in that shared memory with MPI_Win_allocate_shared. This will be useful for MPI shared
memory; see chapter 12.
3. Finally, you can create a window with MPI_Win_create_dynamic which postpones the allocation;
see section 9.5.2.
First of all, MPI_Win_create creates a window from a pointer to memory. The data array must not be
PARAMETER or static const.
The size parameter is measured in bytes. In C this can be done with the sizeof operator;
// putfencealloc.c
MPI_Win the_window;
int *window_data;
MPI_Win_allocate(2*sizeof(int),sizeof(int),
MPI_INFO_NULL,comm,
&window_data,&the_window);
The routine MPI_Alloc_mem (figure 9.3) performs only the allocation part of MPI_Win_allocate, after which
you need to MPI_Win_create.
• An error of MPI_ERR_NO_MEM indicates that no memory could be allocated.
The following material is for the recently released MPI-4 standard and may not be supported yet.
• Allocated memory can be aligned by specifying an MPI_Info key of mpi_minimum_memory_alignment.
• If you really want the raw base pointer (as an integer), you can do any of these:
base, size, disp_unit = win.atts
base = win.Get_attr(MPI.WIN_BASE)
• You can use mpi4py’s builtin memoryview/buffer-like type, but I do not recommend it, much
better to use NumPy as above:
mem = win.tomemory() # type(mem) is MPI.memory, similar to memoryview, but
↪quite limited in functionality
base = mem.address
size = mem.nbytes
In between the two fences the window is exposed, and while it is you should not access it locally. If you
absolutely need to access it locally, you can use an RMA operation for that. Also, there can be only one
remote process that does a put; multiple accumulate accesses are allowed.
Fences are, together with other window calls, collective operations. That means they imply some amount
of synchronization between processes. Consider:
MPI_Win_fence( ... win ... ); // start an epoch
if (mytid==0) // do lots of work
MPI_Win_fence( ... win ... ); // end the epoch
and assume that all processes execute the first fence more or less at the same time. The zero process does
work before it can do the second fence call, but all other processes can call it immediately. However, they
can not finish that second fence call until all one-sided communication is finished, which means they wait
for the zero process.
Figure 9.3: A trace of a one-sided communication epoch where process zero only originates a one-sided
transfer
As a further restriction, you can not mix MPI_Get with MPI_Put or MPI_Accumulate calls in a single epoch.
Hence, we can characterize an epoch as an access epoch on the origin, and as an exposure epoch on the
target.
Example:
Assertions are an integer parameter: you can combine assertions by adding them or using logical-or. The
value zero is always correct. For further information, see section 9.6.
In other words, this turns your window into the target for a remote access. There is a non-blocking version
MPI_Win_test of MPI_Win_wait.
In other words, these calls border the access to a remote window, with the current processor being the
origin of the remote access.
In the following snippet a single processor puts data on one other. Note that they both have their own
definition of the group, and that the receiving process only does the post and wait calls.
// postwaitwin.c
MPI_Comm_group(comm,&all_group);
if (procno==origin) {
MPI_Group_incl(all_group,1,&target,&two_group);
// access
MPI_Win_start(two_group,0,the_window);
MPI_Put( /* data on origin: */ &my_number, 1,MPI_INT,
/* data on target: */ target,0, 1,MPI_INT,
the_window);
MPI_Win_complete(the_window);
}
if (procno==target) {
MPI_Group_incl(all_group,1,&origin,&two_group);
// exposure
MPI_Win_post(two_group,0,the_window);
MPI_Win_wait(the_window);
}
9.3.1 Put
The MPI_Put (figure 9.5) call can be considered as a one-sided send. As such, it needs to specify
• the target rank
• the data to be sent from the origin, and
Here is a single put operation. Note that the window create and window fence calls are collective, so they
have to be performed on all processors of the communicator that was used in the create call.
// putfence.c
MPI_Win the_window;
MPI_Win_create
(&window_data,2*sizeof(int),sizeof(int),
MPI_INFO_NULL,comm,&the_window);
MPI_Win_fence(0,the_window);
if (procno==0) {
MPI_Put
( /* data on origin: */ &my_number, 1,MPI_INT,
/* data on target: */ other,1, 1,MPI_INT,
the_window);
}
MPI_Win_fence(0,the_window);
MPI_Win_free(&the_window);
!! putfence.F90
integer(kind=MPI_ADDRESS_KIND) :: target_displacement
target_displacement = 1
call MPI_Put( my_number, 1,MPI_INTEGER, &
other,target_displacement, &
1,MPI_INTEGER, &
the_window)
9.3.2 Get
The MPI_Get (figure 9.6) call is very similar.
Example:
MPI_Win_fence(0,the_window);
if (procno==0) {
MPI_Get( /* data on origin: */ &my_number, 1,MPI_INT,
/* data on target: */ other,1, 1,MPI_INT,
the_window);
}
MPI_Win_fence(0,the_window);
The MPI_MODE_NOPRECEDE and MPI_MODE_NOSUCCEED assertions still hold, but the Get operation implies that
instead of MPI_MODE_NOSTORE in the second fence, we use MPI_MODE_NOPUT in the first.
9.3.4 Accumulate
A third one-sided routine is MPI_Accumulate (figure 9.7) which does a reduction operation on the results
that are being put.
Accumulate is an atomic reduction with remote result. This means that multiple accumulates to a single
target in the same epoch give the correct result. As with MPI_Reduce, the order in which the operands are
accumulated is undefined.
The same predefined operators are available, but no user-defined ones. There is one extra operator:
MPI_REPLACE, this has the effect that only the last result to arrive is retained.
Exercise 9.3. Implement an ‘all-gather’ operation using one-sided communication: each
processor stores a single number, and you want each processor to build up an array
that contains the values from all processors. Note that you do not need a special case
for a processor collecting its own value: doing ‘communication’ between a processor
and itself is perfectly legal.
Exercise 9.4.
Implement a shared counter:
• One process maintains a counter;
• Iterate: all others at random moments update this counter.
• When the counter is no longer positive, everyone stops iterating.
The problem here is data synchronization: does everyone see the counter the same
way?
The problem is that reading and updating the pointer is not an atomic operation, so it is possible that
multiple processes get hold of the same value; conversely, multiple updates of the pointer may lead to
work descriptors being skipped. These different overall behaviors, depending on precise timing of lower
level events, are called a race condition.
In MPI-3 some atomic routines have been added. Both MPI_Fetch_and_op (figure 9.9) and MPI_Get_accumulate
(figure 9.10) atomically retrieve data from the window indicated, and apply an operator, combining the
data on the target with the data on the origin. Unlike Put and Get, it is safe to have multiple atomic
operations in the same epoch.
Both routines perform the same operations: return data before the operation, then atomically update data
on the target, but MPI_Get_accumulate is more flexible in data type handling. The more simple routine,
MPI_Fetch_and_op, which operates on only a single element, allows for faster implementations, in particular
through hardware support.
Use of MPI_NO_OP as the MPI_Op turns these routines into an atomic Get. Similarly, using MPI_REPLACE turns
them into an atomic Put.
Exercise 9.5. Redo exercise 9.4 using MPI_Fetch_and_op. The problem is again to make sure
all processes have the same view of the shared counter.
Does it work to make the fetch-and-op conditional? Is there a way to do it
unconditionally? What should the ‘break’ test be, seeing that multiple processes can
update the counter at the same time?
Example. A root process has a table of data; the other processes do atomic gets and update of that data
using passive target synchronization through MPI_Win_lock.
// passive.cxx
if (procno==repository) {
// Repository processor creates a table of inputs
// and associates that with the window
}
if (procno!=repository) {
float contribution=(float)procno,table_element;
int loc=0;
MPI_Win_lock(MPI_LOCK_EXCLUSIVE,repository,0,the_window);
// read the table element by getting the result from adding zero
MPI_Fetch_and_op
(&contribution,&table_element,MPI_FLOAT,
repository,loc,MPI_SUM,the_window);
MPI_Win_unlock(repository,the_window);
}
MPI.Win.Lock(self,
int rank, int lock_type=LOCK_EXCLUSIVE, int assertion=0)
MPI_Win_fence(0,the_window);
int
counter_value;
if (i_am_available) {
int
decrement = -1;
total_decrement++;
MPI_Fetch_and_op
( /* operate with data from origin: */ &decrement,
/* retrieve data from target: */ &counter_value,
MPI_INT, counter_process, 0, MPI_SUM,
the_window);
}
MPI_Win_fence(0,the_window);
if (i_am_available) {
my_counter_values[n_my_counter_values++] = counter_value;
}
Remark 17 The possibility to lock a window is not guaranteed for windows that are not created (possibly
internally) by MPI_Alloc_mem, that is, all but MPI_Win_create.
• while the remaining process spins until the others have performed their
update.
Use an atomic operation for the latter process to read out the shared value.
Can you replace the exclusive lock with a shared one?
(There is a skeleton for this exercise under the name lockfetch.)
Exercise 9.8. As exercise 9.7, but now use a shared lock: all processes acquire the lock
simultaneously and keep it as long as is needed.
The problem here is that coherence between window buffers and local variables is
now not forced by a fence or releasing a lock. Use MPI_Win_flush_local to force
coherence of a window (on another process) and the local variable from
MPI_Fetch_and_op.
(There is a skeleton for this exercise under the name lockfetchshared.)
Completion of the RMA operations in a passive target epoch is ensured with MPI_Win_unlock or MPI_Win_unlock_all,
similar to the use of MPI_Win_fence in active target synchronization.
If the passive target epoch is of greater duration, and no unlock operation is used to ensure completion,
the following calls are available.
Remark 18 Using flush routines with active target synchronization (or generally outside a passive target
epoch) you are likely to get a message
Wrong synchronization of RMA calls
With MPI_Win_flush_local_all local operations are concluded for all targets. This will typically be used
with MPI_Win_lock_all (section 9.4.2).
At first sight, the code looks like splitting up a MPI_Win_create call into separate creation of the window
and declaration of the buffer:
// windynamic.c
MPI_Win_create_dynamic(MPI_INFO_NULL,comm,&the_window);
if (procno==data_proc)
window_buffer = (int*) malloc( 2*sizeof(int) );
MPI_Win_attach(the_window,window_buffer,2*sizeof(int));
• MPI_WIN_SIZE and MPI_WIN_DISP_UNIT for obtaining the size and window displacement unit:
MPI_Aint *size;
MPI_Win_get_attr(win, MPI_WIN_SIZE, &size, &flag),
int *disp_unit;
MPI_Win_get_attr(win, MPI_WIN_DISP_UNIT, &disp_unit, &flag),
• MPI_WIN_MODEL for querying the window memory model; see section 9.5.1.
Get the group of processes (see section 7.5) associated with a window:
int MPI_Win_get_group(MPI_Win win, MPI_Group *group)
Window information objects (see section 15.1.1) can be set and retrieved:
int MPI_Win_set_info(MPI_Win win, MPI_Info info)
9.6 Assertions
The routines
• (Active target synchronization) MPI_Win_fence, MPI_Win_post, MPI_Win_start;
• (Passive target synchronization) MPI_Win_lock, MPI_Win_lockall,
take an argument through which assertions can be passed about the activity before, after, and during the
epoch. The value zero is always allowed, by you can make your program more efficient by specifying one
or more of the following, combined by bitwise OR in C/C++ or IOR in Fortran.
• MPI_Win_start Supports the option:
– MPI_MODE_NOCHECK the matching calls to MPI_Win_post have already completed on all target
processes when the call to MPI_Win_start is made. The nocheck option can be specified in
a start call if and only if it is specified in each matching post call. This is similar to the
optimization of “ready-send” that may save a handshake when the handshake is implicit
in the code. (However, ready-send is matched by a regular receive, whereas both start and
post must specify the nocheck option.)
• MPI_Win_post supports the following options:
– MPI_MODE_NOCHECK the matching calls to MPI_Win_start have not yet occurred on any origin
processes when the call to MPI_Win_post is made. The nocheck option can be specified by
a post call if and only if it is specified by each matching start call.
– MPI_MODE_NOSTORE the local window was not updated by local stores (or local get or receive
calls) since last synchronization. This may avoid the need for cache synchronization at the
post call.
– MPI_MODE_NOPUT the local window will not be updated by put or accumulate calls after the
post call, until the ensuing (wait) synchronization. This may avoid the need for cache
synchronization at the wait call.
• MPI_Win_fence supports the following options:
– MPI_MODE_NOSTORE the local window was not updated by local stores (or local get or receive
calls) since last synchronization.
– MPI_MODE_NOPUT the local window will not be updated by put or accumulate calls after the
fence call, until the ensuing (fence) synchronization.
– MPI_MODE_NOPRECEDE the fence does not complete any sequence of locally issued RMA calls.
If this assertion is given by any process in the window group, then it must be given by all
processes in the group.
– MPI_MODE_NOSUCCEED the fence does not start any sequence of locally issued RMA calls. If
the assertion is given by any process in the window group, then it must be given by all
processes in the group.
• MPI_Win_lock and MPI_Win_lock_all support the following option:
– MPI_MODE_NOCHECK no other process holds, or will attempt to acquire a conflicting lock, while
the caller holds the window lock. This is useful when mutual exclusion is achieved by other
means, but the coherence operations that may be attached to the lock and unlock calls are
still required.
9.7 Implementation
You may wonder how one-sided communication is realized1 . Can a processor somehow get at another
processor’s data? Unfortunately, no.
Active target synchronization is implemented in terms of two-sided communication. Imagine that the
first fence operation does nothing, unless it concludes prior one-sided operations. The Put and Get calls
do nothing involving communication, except for marking with what processors they exchange data. The
concluding fence is where everything happens: first a global operation determines which targets need to
issue send or receive calls, then the actual sends and receive are executed.
Exercise 9.9. Assume that only Get operations are performed during an epoch. Sketch how
these are translated to send/receive pairs. The problem here is how the senders find
out that they need to send. Show that you can solve this with an MPI_Reduce_scatter
call.
The previous paragraph noted that a collective operation was necessary to determine the two-sided traffic.
Since collective operations induce some amount of synchronization, you may want to limit this.
Exercise 9.10. Argue that the mechanism with window post/wait/start/complete operations
still needs a collective, but that this is less burdensome.
Passive target synchronization needs another mechanism entirely. Here the target process needs to have
a background task (process, thread, daemon,…) running that listens for requests to lock the window. This
can potentially be expensive.
#define MASTER 0
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
intsize = np.dtype('int').itemsize
window_data = np.zeros(2,dtype=int)
win = MPI.Win.Create(window_data,intsize,comm=comm)
my_number = np.empty(1,dtype=int)
src = 0; tgt = nprocs-1
if procid==src:
my_number[0] = 37
else:
my_number[0] = 1
win.Fence()
if procid==src:
# put data in the second element of the window
win.Put(my_number,tgt,target=1)
win.Fence()
if procid==tgt:
print("Window after put:",window_data)
#include "globalinit.c"
{
MPI_Win the_window;
MPI_Group all_group,two_group;
int my_number = 37, other_number,
twotids[2],origin,target;
MPI_Win_create(&other_number,1,sizeof(int),
MPI_INFO_NULL,comm,&the_window);
if (procno>0 && procno<nprocs-1) goto skip;
MPI_Comm_group(comm,&all_group);
if (procno==origin) {
MPI_Group_incl(all_group,1,&target,&two_group);
// access
MPI_Win_start(two_group,0,the_window);
MPI_Put( /* data on origin: */ &my_number, 1,MPI_INT,
/* data on target: */ target,0, 1,MPI_INT,
the_window);
MPI_Win_complete(the_window);
}
if (procno==target) {
MPI_Group_incl(all_group,1,&origin,&two_group);
// exposure
MPI_Win_post(two_group,0,the_window);
MPI_Win_wait(the_window);
}
if (procno==target)
printf("Got the following: %d\n",other_number);
MPI_Group_free(&all_group);
MPI_Group_free(&two_group);
skip:
MPI_Win_free(&the_window);
}
MPI_Finalize();
return 0;
}
#include "globalinit.c"
MPI_Win the_window;
int my_number=0, window_data[2], other = nprocs-1;
if (procno==0)
my_number = 37;
MPI_Win_create
(&window_data,2*sizeof(int),sizeof(int),
MPI_INFO_NULL,comm,&the_window);
MPI_Win_fence(0,the_window);
if (procno==0) {
MPI_Put
( /* data on origin: */ &my_number, 1,MPI_INT,
/* data on target: */ other,1, 1,MPI_INT,
the_window);
}
MPI_Win_fence(0,the_window);
if (procno==other)
printf("I got the following: %d\n",window_data[1]);
MPI_Win_free(&the_window);
MPI_Finalize();
return 0;
}
use mpi_f08
implicit none
Type(MPI_Win) :: the_window
integer :: window_elt_size
integer(kind=MPI_ADDRESS_KIND) :: window_size
integer(kind=MPI_ADDRESS_KIND) :: target_displacement
integer :: my_number=0, window_data(2), other
Type(MPI_Comm) :: comm;
integer :: mytid,ntids,i,p,err;
call MPI_Init()
comm = MPI_COMM_WORLD
call MPI_Comm_rank(comm,mytid)
call MPI_Comm_size(comm,ntids)
call MPI_Comm_set_errhandler(comm,MPI_ERRORS_RETURN)
other = ntids-1
if (mytid.eq.0) my_number = 37
call MPI_Sizeof(window_data,window_elt_size)
window_size = 2*window_elt_size
call MPI_Win_create(window_data,&
window_size,window_elt_size, & ! window size, unit size
MPI_INFO_NULL,comm,the_window)
call MPI_Win_fence(0,the_window)
if (mytid.eq.0) then
#ifndef F90STYLE
target_displacement = 1
call MPI_Put( my_number, 1,MPI_INTEGER, &
other,target_displacement, &
1,MPI_INTEGER, &
the_window)
#else
call MPI_Put( my_number, 1,MPI_INTEGER, &
other,1, &
1,MPI_INTEGER, &
the_window)
#endif
endif
call MPI_Win_fence(0,the_window)
if (mytid.eq.other) then
print *,"I got:",window_data(1+target_displacement)
end if
call MPI_Win_free(the_window)
call MPI_Finalize(err);
#include <mpi.h>
#include "window.c"
#include "globalinit.c"
MPI_Win the_window;
int my_number, other = nprocs-1;
MPI_INFO_NULL,&number_buffer);
MPI_Win_create
( number_buffer,2*sizeof(int),sizeof(int),
MPI_INFO_NULL,comm,&the_window);
if (procno==other)
number_buffer[1] = 27;
test_window(the_window,comm);
MPI_Win_fence(0,the_window);
if (procno==0) {
MPI_Get( /* data on origin: */ &my_number, 1,MPI_INT,
/* data on target: */ other,1, 1,MPI_INT,
the_window);
}
MPI_Win_fence(0,the_window);
if (procno==0)
printf("I got the following: %d\n",my_number);
MPI_Win_free(&the_window);
MPI_Free_mem(number_buffer);
MPI_Finalize();
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
other = nprocs-1-procid
mydata = random.random()
if procid==0 or procid==nprocs-1:
win_mem = np.empty( 1,dtype=np.float64 )
win = MPI.Win.Create( win_mem,comm=comm )
else:
win = MPI.Win.Create( None,comm=comm )
win.Free()
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<2:
print("C'mon, get real....")
sys.exit(1)
if procid==repository:
# repository process creates a table of inputs
# and associates it with the window
win_mem = np.empty( ninputs,dtype=np.float32 )
win = MPI.Win.Create( win_mem,comm=comm )
else:
# everyone else has an empty window
win = MPI.Win.Create( None,comm=comm )
if procid!=repository:
contribution = np.empty( 1,dtype=np.float32 )
contribution[0] = 1.*procid
table_element = np.empty( 1,dtype=np.float32 )
win.Lock( repository,lock_type=MPI.LOCK_EXCLUSIVE )
win.Fetch_and_op( contribution,table_element,repository,0,MPI.SUM)
win.Unlock( repository )
print(procid,"added its contribution to partial sum",table_element[0])
win.Free()
if procid==repository:
if abs(win_mem[0]-checksum)>1.e-12:
print("Incorrect result %e s/b %e" % (win_mem[0],checksum))
print("finished")
#include <mpi.h>
#include "gather_sort_print.h"
int nprocs,procno;
MPI_Init(0,0);
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm_size(comm,&nprocs);
MPI_Comm_rank(comm,&procno);
if (nprocs<2) {
printf("Need at least 2 procs\n");
MPI_Abort(comm,0);
}
{
/*
* Create a window.
* We only need a nonzero size on the last process,
* which we label the `counter_process';
* everyone else makes a window of size zero.
*/
MPI_Win the_window;
int counter_process = nprocs-1;
int window_data,check_data;
if (procno==counter_process) {
window_data = 2*nprocs-1;
check_data = window_data;
MPI_Win_create(&window_data,sizeof(int),sizeof(int),
MPI_INFO_NULL,comm,&the_window);
} else {
MPI_Win_create(&window_data,0,sizeof(int),
MPI_INFO_NULL,comm,&the_window);
}
/*
/*
* Allocate an array (grossly over-dimensioned)
* for the counter values that belong to me
*/
int *my_counter_values = (int*) malloc( counter_init * sizeof(int) );
if (!my_counter_values) {
printf("[%d] could not allocate counter values\n",procno);
MPI_Abort(comm,0);
}
int n_my_counter_values = 0;
/*
* Loop:
* - at random times update the counter on the counter process
* - and read out the counter to see if we stop
*/
int total_decrement = 0;
int nsteps = PROCWRITES / COLLISION;
if (procno==0)
printf("Doing %d steps, counter starting: %d\n probably %d-way collision on each step\n",
nsteps,counter_init,COLLISION);
for (int step=0; step<nsteps ; step++) {
/*
* Basic probability of a write is 1/P,
* so each step only one proc will write.
* Increase chance of collision by upping
* the value of COLLISION.
*/
float randomfraction = (rand() / (double)RAND_MAX);
int i_am_available = randomfraction < ( COLLISION * 1./nprocs );
/*
* Exercise:
* - atomically read and decrement the counter
*/
MPI_Win_fence(0,the_window);
int
counter_value;
if (i_am_available) {
int
decrement = -1;
total_decrement++;
MPI_Fetch_and_op
( /* operate with data from origin: */ &decrement,
/* retrieve data from target: */ &counter_value,
MPI_INT, counter_process, 0, MPI_SUM,
the_window);
#ifdef DEBUG
printf("[%d] updating in step %d; retrieved %d\n",procno,step,counter_value);
#endif
}
MPI_Win_fence(0,the_window);
if (i_am_available) {
my_counter_values[n_my_counter_values++] = counter_value;
}
}
/*
* What counter values were actually obtained?
*/
gather_sort_print( my_counter_values,n_my_counter_values, comm );
/*
* We do a correctness test by computing what the
* window_data is supposed to be
*/
{
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( /* origin data to set: */ &counter_value,1,MPI_INT,
/* window data to get: */ counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
MPI_Allreduce(MPI_IN_PLACE,&total_decrement,1,MPI_INT,MPI_SUM,comm);
if (procno==counter_process) {
if (counter_init-total_decrement==counter_value)
printf("[%d] initial counter %d decreased by %d correctly giving %d\n",
procno,counter_init,total_decrement,counter_value);
else
printf("[%d] initial counter %d decreased by %d, giving %d s/b %d\n",
procno,counter_init,total_decrement,counter_value,counter_init-total_decrement);
}
}
MPI_Win_free(&the_window);
}
MPI_Finalize();
return 0;
}
#include <mpi.h>
#include "gather_sort_print.h"
int nprocs,procno;
MPI_Init(0,0);
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm_size(comm,&nprocs);
MPI_Comm_rank(comm,&procno);
if (nprocs<2) {
printf("Need at least 2 procs\n");
MPI_Abort(comm,0);
}
{
/*
* Create a window.
* We only need a nonzero size on the last process,
* which we label the `counter_process';
* everyone else makes a window of size zero.
*/
MPI_Win the_window;
int counter_process = nprocs-1;
int window_data,check_data;
if (procno==counter_process) {
window_data = 2*nprocs-1;
check_data = window_data;
MPI_Win_create(&window_data,sizeof(int),sizeof(int),
MPI_INFO_NULL,comm,&the_window);
} else {
MPI_Win_create(&window_data,0,sizeof(int),
MPI_INFO_NULL,comm,&the_window);
}
/*
* Initialize the window
* - PROCWRITES is approx the number of writes we want each process to do
* - COLLISION is approx how many processes will collide on a write
*/
#ifndef COLLISION
#define COLLISION 1
#endif
#ifndef PROCWRITES
#define PROCWRITES 10
#endif
int counter_init = nprocs * PROCWRITES;
MPI_Win_fence(0,the_window);
if (procno==counter_process)
MPI_Put(&counter_init,1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
/*
* Allocate an array (grossly over-dimensioned)
* for the counter values that belong to me
*/
int *my_counter_values = (int*) malloc( counter_init * sizeof(int) );
if (!my_counter_values) {
printf("[%d] could not allocate counter values\n",procno);
MPI_Abort(comm,0);
}
int n_my_counter_values = 0;
/*
* Loop forever:
* - at random times update the counter on the counter process
* - and read out the counter to see if we stop
*/
int total_decrement = 0;
int nsteps = PROCWRITES / COLLISION;
if (procno==0)
printf("Doing %d steps, %d writes per proc,\n.. probably %d-way collision on each step\n",
nsteps,PROCWRITES,COLLISION);
for (int step=0; step<nsteps ; step++) {
/*
* Basic probability of a write is 1/P,
* so each step only one proc will write.
* Increase chance of collision by upping
* the value of COLLISION.
*/
float randomfraction = (rand() / (double)RAND_MAX);
int i_am_available = randomfraction < ( COLLISION * .8/nprocs );
/*
* Exercise:
* - decrement the counter by Get, compute new value, Put
*/
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( &counter_value,1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
if (i_am_available) {
#ifdef DEBUG
printf("[%d] obtaining value %d in step %d\n",
procno,counter_value,step);
#endif
my_counter_values[ n_my_counter_values++ ] = counter_value;
total_decrement++;
int decrement = -1;
counter_value += decrement;
MPI_Put
( &counter_value, 1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
}
MPI_Win_fence(0,the_window);
}
/*
* What counter values were actually obtained?
*/
gather_sort_print( my_counter_values,n_my_counter_values, comm );
/*
* We do a correctness test by computing what the
* window_data is supposed to be
*/
{
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( /* origin data to set: */ &counter_value,1,MPI_INT,
/* window data to get: */ counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
MPI_Allreduce(MPI_IN_PLACE,&total_decrement,1,MPI_INT,MPI_SUM,comm);
if (procno==counter_process) {
if (counter_init-total_decrement==counter_value)
printf("[%d] initial counter %d decreased by %d correctly giving %d\n",
procno,counter_init,total_decrement,counter_value);
else
printf("[%d] initial counter %d decreased by %d, giving %d s/b %d\n",
procno,counter_init,total_decrement,counter_value,counter_init-total_decrement);
}
}
MPI_Win_free(&the_window);
}
MPI_Finalize();
return 0;
}
#include <mpi.h>
#include "gather_sort_print.h"
#include "globalinit.c"
if (nprocs<2) {
printf("Need at least 2 procs\n");
MPI_Abort(comm,0);
}
{
/*
* Create a window.
* We only need a nonzero size on the last process,
* which we label the `counter_process';
* everyone else makes a window of size zero.
*/
MPI_Win the_window;
int counter_process = nprocs-1;
int window_data,check_data;
if (procno==counter_process) {
window_data = 2*nprocs-1;
check_data = window_data;
MPI_Win_create(&window_data,sizeof(int),sizeof(int),
MPI_INFO_NULL,comm,&the_window);
} else {
MPI_Win_create(&window_data,0,sizeof(int),
MPI_INFO_NULL,comm,&the_window);
}
/*
* Initialize the window
* - PROCWRITES is approx the number of writes we want each process to do
* - COLLISION is approx how many processes will collide on a write
*/
#ifndef COLLISION
#define COLLISION 2
#endif
#ifndef PROCWRITES
#define PROCWRITES 40
#endif
int counter_init = nprocs * PROCWRITES;
MPI_Win_fence(0,the_window);
if (procno==counter_process)
MPI_Put(&counter_init,1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
/*
* Allocate an array (grossly over-dimensioned)
* for the counter values that belong to me
*/
int *my_counter_values = (int*) malloc( counter_init * sizeof(int) );
if (!my_counter_values) {
printf("[%d] could not allocate counter values\n",procno);
MPI_Abort(comm,0);
}
int n_my_counter_values = 0;
/*
* Loop:
* - at random times update the counter on the counter process
* - and read out the counter to see if we stop
*/
int total_decrement = 0;
int nsteps = PROCWRITES / COLLISION;
if (procno==0)
printf("Doing %d steps, counter starting: %d\n probably %d-way collision on each step\n",
nsteps,counter_init,COLLISION);
for (int step=0; step<nsteps ; step++) {
/*
* Basic probability of a write is 1/P,
* so each step only one proc will write.
* Increase chance of collision by upping
* the value of COLLISION.
*/
float randomfraction = (rand() / (double)RAND_MAX);
int i_am_available = randomfraction < ( COLLISION * 1./nprocs );
/*
* Exercise:
* - decrement the counter by Get, compute new value, Accumulate
*/
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( &counter_value,1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
if (i_am_available) {
#ifdef DEBUG
printf("[%d] updating in step %d\n",procno,step);
#endif
my_counter_values[n_my_counter_values++] = counter_value;
total_decrement++;
int decrement = -1;
MPI_Accumulate
( &decrement, 1,MPI_INT,
counter_process,0,1,MPI_INT,
MPI_SUM,
the_window);
}
MPI_Win_fence(0,the_window);
}
/*
* What counter values were actually obtained?
*/
gather_sort_print( my_counter_values,n_my_counter_values, comm );
/*
* We do a correctness test by computing what the
* window_data is supposed to be
*/
{
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( /* origin data to set: */ &counter_value,1,MPI_INT,
/* window data to get: */ counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
MPI_Allreduce(MPI_IN_PLACE,&total_decrement,1,MPI_INT,MPI_SUM,comm);
if (procno==counter_process) {
if (counter_init-total_decrement==counter_value)
printf("[%d] initial counter %d decreased by %d correctly giving %d\n",
procno,counter_init,total_decrement,counter_value);
else
printf("[%d] initial counter %d decreased by %d, giving %d s/b %d\n",
procno,counter_init,total_decrement,counter_value,counter_init-total_decrement);
}
}
MPI_Win_free(&the_window);
}
MPI_Finalize();
return 0;
}
#include <mpi.h>
#include "window.c"
#include "globalinit.c"
{
MPI_Win the_window;
int origin=0, data_proc = nprocs-1;
int *retrieve=NULL,*window_buffer=NULL;
MPI_Win_create_dynamic(MPI_INFO_NULL,comm,&the_window);
if (procno==data_proc)
window_buffer = (int*) malloc( 2*sizeof(int) );
MPI_Win_attach(the_window,window_buffer,2*sizeof(int));
test_window( the_window,comm );
if (procno==data_proc) {
window_buffer[0] = 1;
window_buffer[1] = 27;
}
if (procno==origin) {
retrieve = (int*) malloc( sizeof(int) );
}
MPI_Aint data_address;
if (procno==data_proc) {
MPI_Get_address(window_buffer,&data_address);
}
MPI_Bcast(&data_address,1,MPI_AINT,data_proc,comm);
MPI_Win_fence(0,the_window);
if (procno==origin) {
MPI_Aint disp = data_address+1*sizeof(int);
MPI_Get( /* data on origin: */ retrieve, 1,MPI_INT,
/* data on target: */ data_proc,disp, 1,MPI_INT,
the_window);
}
MPI_Win_fence(0,the_window);
if (procno==origin)
printf("I got the following: %d\n",retrieve[0]);
MPI_Win_free(&the_window);
}
MPI_Finalize();
return 0;
}
This chapter discusses the I/O support of MPI, which is intended to alleviate the problems inherent in par-
allel file access. Let us first explore the issues. This story partly depends on what sort of parallel computer
are you running on. Here are some of the hardware scenarios you may encounter:
• On networks of workstations each node will have a separate drive with its own file system.
• On many clusters there will be a shared file system that acts as if every process can access every
file.
• Cluster nodes may or may not have a private file system.
Based on this, the following strategies are possible, even before we start talking about MPI I/O.
• One process can collect all data with MPI_Gather and write it out. There are at least three things
wrong with this: it uses network bandwidth for the gather, it may require a large amount of
memory on the root process, and centralized writing is a bottleneck.
• Absent a shared file system, writing can be parallelized by letting every process create a unique
file and merge these after the run. This makes the I/O symmetric, but collecting all the files is a
bottleneck.
• Even with a with a shared file system this approach is possible, but it can put a lot of strain on
the file system, and the post-processing can be a significant task.
• Using a shared file system, there is nothing against every process opening the same existing file
for reading, and using an individual file pointer to get its unique data.
• … but having every process open the same file for output is probably not a good idea. For
instance, if two processes try to write at the end of the file, you may need to synchronize them,
and synchronize the file system flushes.
For these reasons, MPI has a number of routines that make it possible to read and write a single file from
a large number of processes, giving each process its own well-defined location where to access the data.
These locations can use MPI derived datatypes for both the source data (that is, in memory) and target
data (that is, on disk). Thus, in one call that is collective on a communicator each process can address data
that is not contiguous in memory, and place it in locations that are not contiguous on disc.
There are dedicated libraries for file I/O, such as hdf5, netcdf , or silo. However, these often add header
information to a file that may not be understandable to post-processing applications. With MPI I/O you
are in complete control of what goes to the file. (A useful tool for viewing your file is the unix utility od.)
354
10.1. File handling
TACC note. Each node has a private /tmp file system (typically flash storage), to which you can write
files. Considerations:
• Since these drives are separate from the shared file system, you don’t have to worry about
stress on the file servers.
• These temporary file systems are wiped after your job finishes, so you have to do the
post-processing in your job script.
• The capacity of these local drives are fairly limited; see the userguide for exact numbers.
Even though the file is opened on a communicator, it is a class method for the MPI.File class,
rather than for the communicator object. The latter is passed in as an argument.
File access modes:
• MPI_MODE_RDONLY: read only,
• MPI_MODE_RDWR: reading and writing,
• MPI_MODE_WRONLY: write only,
• MPI_MODE_CREATE: create the file if it does not exist,
• MPI_MODE_EXCL: error if creating file that already exists,
• MPI_File_read (figure 10.4) This routine attempts to read the specified data from the locations
specified in the current file view. The number of items read is returned in the MPI_Status argu-
ment; all other fields of this argument are undefined. It can not be used if the file was opened
with MPI_MODE_SEQUENTIAL.
• If all processes execute a read at the same logical time, it is better to use the collective call
MPI_File_read_all (figure 10.5).
For thread safety it is good to combine seek and read/write operations:
• MPI_File_read_at: combine read and seek. The collective variant is MPI_File_read_at_all.
• MPI_File_write_at: combine write and seek. The collective variant is MPI_File_write_at_all; sec-
tion 10.2.2.
Writing to and reading from a parallel file is rather similar to sending a receiving:
• The process uses an elementary data type or a derived datatype to describe what elements in
an array go to file, or are read from file.
• In the simplest case, your read or write that data to the file using an offset, or first having done
a seek operation.
• But you can also set a ‘file view’ to describe explicitly what elements in the file will be involved.
Just like there are blocking and nonblocking sends, there are also nonblocking writes and reads: MPI_File_iwrite
(figure 10.6), MPI_File_iread operations, and their collective versions MPI_File_iwrite_all, MPI_File_iread_all.
These routines output an MPI_Request object, which can then be tested with MPI_Wait or MPI_Test.
Nonblocking collective I/O functions much like other nonblocking collectives (section 3.11): the request
is satisfied if all processes finish the collective.
There are also split collectives that function like nonblocking collective I/O, but with the request/wait
mechanism: MPI_File_write_all_begin / MPI_File_write_all_end (and similarly MPI_File_read_all_begin /
MPI_File_read_all_end) where the second routine blocks until the collective write/read has been con-
cluded.
Exercise 10.1. Create a buffer of length nwords=3 on each process, and write these buffers
as a sequence to one file with MPI_File_write_at.
(There is a skeleton for this exercise under the name blockwrite.)
Instead of giving the position in the file explicitly, you can also use a MPI_File_seek call to position the file
pointer, and write with MPI_File_write at the pointer location. The write call itself also advances the file
pointer so separate calls for writing contiguous elements need no seek calls with MPI_SEEK_CUR.
Exercise 10.2. Rewrite the code of exercise 10.1 to use a loop where each iteration writes
only one item to file. Note that no explicit advance of the file pointer is needed.
Exercise 10.3. Construct a file with the consecutive integers 0, … , 𝑊 𝑃 where 𝑊 some
integer, and 𝑃 the number of processes. Each process 𝑝 writes the numbers
𝑝, 𝑝 + 𝑊 , 𝑝 + 2𝑊 , …. Use a loop where each iteration
1. writes a single number with MPI_File_write, and
2. advanced the file pointer with MPI_File_seek with a whence parameter of
MPI_SEEK_CUR.
offset,MPI_INT,scattertype,
"native",MPI_INFO_NULL);
Exercise 10.4.
(There is a skeleton for this exercise under the name viewwrite.) Write a file in the
same way as in exercise 10.1, but now use MPI_File_write and use MPI_File_set_view
to set a view that determines where the data is written.
You can get very creative effects by setting the view to a derived datatype.
Fortran note 11: offset literals. In Fortran you have to assure that the displacement parameter is of ‘kind’
MPI_OFFSET_KIND. In particular, you can not specify a literal zero ‘0’ as the displacement; use
0_MPI_OFFSET_KIND instead.
10.3 Consistency
It is possible for one process to read data previously writte by another process. For this, it is of course
necessary to impose a temporal order, for instance by using MPI_Barrier, or using a zero-byte send from
the writing to the reading process.
However, the file also needs to be declared atomic: MPI_File_set_atomicity.
10.4 Constants
used to be called SEEK_SET which gave conflicts with the C++ library. This had to be cir-
MPI_SEEK_SET
cumvented with
make CPPFLAGS="-DMPICH_IGNORE_CXX_SEEK -DMPICH_SKIP_MPICXX"
and such.
A communicator describes a group of processes, but the structure of your computation may not be such
that every process will communicate with every other process. For instance, in a computation that is math-
ematically defined on a Cartesian 2D grid, the processes themselves act as if they are two-dimensionally
ordered and communicate with N/S/E/W neighbors. If MPI had this knowledge about your application, it
could conceivably optimize for it, for instance by renumbering the ranks so that communicating processes
are closer together physically in your cluster.
The mechanism to declare this structure of a computation to MPI is known as a virtual topology. The
following types of topology are defined:
• MPI_UNDEFINED: this values holds for communicators where no topology has explicitly been spec-
ified.
• MPI_CART: this value holds for Cartesian toppologies, where processes act as if they are ordered
in a multi-dimensional ‘brick’; see section 11.1.
• MPI_GRAPH: this value describes the graph topology that was defined in MPI-1; section 11.2.4.
It is unnecessarily burdensome, since each process needs to know the total graph, and should
therefore be considered obsolete; the type MPI_DIST_GRAPH should be used instead.
• MPI_DIST_GRAPH: this value describes the distributed graph topology where each process only
describes the edges in the process graph that touch itself; see section 11.2.
These values can be discovered with the routine MPI_Topo_test.
365
11. MPI topic: Topologies
Code: Output:
// cartdims.c mpicc -o cartdims cartdims.o
int *dimensions = (int*) malloc(dim*sizeof(int)); Cartesian grid size: 3 dim: 1
for (int idim=0; idim<dim; idim++) 3
dimensions[idim] = 0; Cartesian grid size: 3 dim: 2
MPI_Dims_create(nprocs,dim,dimensions); 3 x 1
Cartesian grid size: 4 dim: 1
4
Cartesian grid size: 4 dim: 2
2 x 2
Cartesian grid size: 4 dim: 3
2 x 2 x 1
Cartesian grid size: 12 dim: 1
12
Cartesian grid size: 12 dim: 2
4 x 3
Cartesian grid size: 12 dim: 3
3 x 2 x 2
Cartesian grid size: 12 dim: 4
3 x 2 x 2 x 1
If the dimensions array is nonzero in a component, that one is not touched. Of course, the product of the
specified dimensions has to divide in the input number of nodes.
(The Cartesian grid can have fewer processes than the input communicator: any processes not included
get MPI_COMM_NULL as output.)
For a given communicator, you can test what type it is with MPI_Topo_test (figure 11.2):
int world_type,cart_type;
MPI_Topo_test( comm,&world_type);
MPI_Topo_test( cart_comm,&cart_type );
if (procno==0) {
printf("World comm type=%d, Cart comm type=%d\n",
world_type,cart_type);
printf("no topo =%d, cart top =%d\n",
MPI_UNDEFINED,MPI_CART);
}
For a Cartesian communicator, you can retrieve its information with MPI_Cartdim_get and MPI_Cart_get:
int dim;
MPI_Cartdim_get( cart_comm,&dim );
int *dimensions = (int*) malloc(dim*sizeof(int));
int *periods = (int*) malloc(dim*sizeof(int));
int *coords = (int*) malloc(dim*sizeof(int));
MPI_Cart_get( cart_comm,dim,dimensions,periods,coords );
MPI_Cart_create(comm,ndim,dimensions,periodic,1,&comm2d);
MPI_Cart_coords(comm2d,procno,ndim,coord_2d);
MPI_Cart_rank(comm2d,coord_2d,&rank_2d);
printf("I am %d: (%d,%d); originally %d\n",
rank_2d,coord_2d[0],coord_2d[1],procno);
( comm,dim,dimensions,periods,
0,&period_comm );
We shift process 0 in dimensions 0 and 1. In dimension 0 we get a wrapped-around source, and a target
that is the next process in row-major ordering; in dimension 1 we get MPI_PROC_NULL as source, and a
legitimate target.
Code: Output:
int pred,succ; Grid of size 6 in 3 dimensions:
MPI_Cart_shift 3 x 2 x 1
(period_comm,/* dim: */ 0,/* up: */ 1, Shifting process 0.
&pred,&succ); periodic dimension 0:
printf("periodic dimension 0:\n src=%d, tgt=%d\n", src=4, tgt=2
pred,succ); non-periodic dimension 1:
MPI_Cart_shift src=-1, tgt=1
(period_comm,/* dim: */ 1,/* up: */ 1,
&pred,&succ);
printf("non-periodic dimension 1:\n src=%d, tgt=%d\n",
pred,succ);
The routine MPI_Cart_sub (figure 11.5) is similar to MPI_Comm_split, in that it splits a communicator into
disjoint subcommunicators. In this case, it splits a Cartesian communicator into disjoint Cartesian commu-
nicators, each corresponding to a subset of the dimensions. This subset inherits both sizes and periodicity
from the original communicator.
Code: Output:
MPI_Cart_sub( period_comm,remain,&hyperplane ); hyperplane has dimension 2, type 2
if (procno==0) { periodic: 1,0,
MPI_Topo_test( hyperplane,&topo_type );
MPI_Cartdim_get( hyperplane,&hyperdim );
printf("hyperplane has dimension %d, type %d\n",
hyperdim,topo_type);
MPI_Cart_get( hyperplane,dim,dims,period,coords );
printf(" periodic: ");
for (int id=0; id<2; id++)
printf("%d,",period[id]);
printf("\n");
11.1.5 Reordering
The routine MPI_Cart_map gives a re-ordered rank for the calling process.
Figure 11.1: Illustration of a distributed graph topology where each node has four neighbors
In many calculations on a grid (using the term in its mathematical, Finite Element Method (FEM), sense), a
grid point will collect information from grid points around it. Under a sensible distribution of the grid over
processes, this means that each process will collect information from a number of neighbor processes. The
number of neighbors is dependent on that process. For instance, in a 2D grid (and assuming a five-point
stencil for the computation) most processes communicate with four neighbors; processes on the edge with
three, and processes in the corners with two.
Such a topology is illustrated in figure 11.1.
MPI’s notion of graph topology, and the neighborhood collectives, offer an elegant way of expressing such
communication structures. There are various reasons for using graph topologies over the older, simpler
methods.
• MPI is allowed to reorder the processes, so that network proximity in the cluster corresponds
to proximity in the structure of the code.
• Ordinary collectives could not directly be used for graph problems, unless one would adopt
a subcommunicator for each graph neighborhood. However, scheduling would then lead to
deadlock or serialization.
• The normal way of dealing with graph problems is through nonblocking communications. How-
ever, since the user indicates an explicit order in which they are posted, congestion at certain
processes may occur.
• Collectives can pipeline data, while send/receive operations need to transfer their data in its
entirety.
• Collectives can use spanning trees, while send/receive uses a direct connection.
Thus the minimal description of a process graph contains for each process:
• Degree: the number of neighbor processes; and
• the ranks of the processes to communicate with.
However, this ignores that communication is not always symmetric: maybe the processes you receive from
are not the ones you send to. Worse, maybe only one side of this duality is easily described. Therefore,
there are two routines:
• MPI_Dist_graph_create_adjacent assumes that a process knows both who it is sending it, and
who will send to it. This is the most work for the programmer to specify, but it is ultimately the
most efficient.
• MPI_Dist_graph_create specifies on each process only what it is the source for; that is, who this
process will be sending to. Consequently, some amount of processing – including communica-
tion – is needed to build the converse information, the ranks that will be sending to a process.
1. I disagree with this design decision. Specifying your sources is usually easier than specifying your destinations.
dist_graph_communicator
(const communicator &old_comm,
const source_set &ss, const dest_set &ds, bool reorder=true)
where:
class dist_graph_communicator::source_set : private set< pair<int,int> >
class dist_graph_communicator::dest_set : private set< pair<int,int> >
Python:
MPI.Comm.Create_dist_graph
(self, sources, degrees, destinations, weights=None, Info info=INFO_NULL, bool reorder=False)
returns graph communicator
MPL note 57: distributed graph creation. The class mpl::dist_graph_communicator only has a constructor
corresponding to MPI_Dist_graph_create.
Figure 11.1 describes the common five-point stencil structure. If we let each process only describe itself,
we get the following:
• nsources= 1 because the calling process describes on node in the graph: itself.
• sources is an array of length 1, containing the rank of the calling process.
• degrees is an array of length 1, containing the degree (probably: 4) of this process.
• destinations is an array of length the degree of this process, probably again 4. The elements
of this array are the ranks of the neighbor nodes; strictly speaking the ones that this process
will send to.
• weights is an array declaring the relative importance of the destinations. For an unweighted
graph use MPI_UNWEIGHTED. In the case the graph is weighted, but the degree of a source is zero,
you can pass an empty array as MPI_WEIGHTS_EMPTY.
• reorder (int in C, LOGICAL in Fortran) indicates whether MPI is allowed to shuffle processes
to achieve greater locality.
The resulting communicator has all the processes of the original communicator, with the same ranks. In
other words MPI_Comm_size and MPI_Comm_rank gives the same values on the graph communicator, as on the
intra-communicator that it is constructed from. To get information about the grouping, use MPI_Dist_graph_neighbors
and MPI_Dist_graph_neighbors_count; section 11.2.3.
By way of example we build an unsymmetric graph, that is, an edge 𝑣1 → 𝑣2 between vertices 𝑣1 , 𝑣2 does
not imply an edge 𝑣2 → 𝑣1 .
Code:
// graph.c
for ( int i=0; i<=1; i++ ) {
int neighb_i = proci+i;
if (neighb_i<0 || neighb_i>=idim)
continue;
for (int j=0; j<=1; j++ ) {
int neighb_j = procj+j;
if (neighb_j<0 || neighb_j>=jdim)
continue;
destinations[ degree++ ] =
PROC(neighb_i,neighb_j,idim,jdim);
}
}
MPI_Dist_graph_create
(comm,
/* I specify just one proc: me */ 1,
&procno,°ree,destinations,weights,
MPI_INFO_NULL,0,
&comm2d
);
Code: Output:
int indegree,outdegree,weighted; [ 0 = (0,0)] has 4 outbound neighbours: 0, 1, 2, 3,
MPI_Dist_graph_neighbors_count 1 inbound neighbors: ( 0, 0)= 0
(comm2d, [ 1 = (0,1)] has 2 outbound neighbours: 1, 3,
&indegree,&outdegree,&weighted); 2 inbound neighbors: ( 0, 1)= 1 ( 0, 0)= 0
int my_ij[2] = {proci,procj}, [ 2 = (1,0)] has 4 outbound neighbours: 2, 3, 4, 5,
↪other_ij[4][2]; 2 inbound neighbors: ( 1, 0)= 2 ( 0, 0)= 0
MPI_Neighbor_allgather [ 3 = (1,1)] has 2 outbound neighbours: 3, 5,
( my_ij,2,MPI_INT, other_ij,2,MPI_INT, 4 inbound neighbors: ( 1, 1)= 3 ( 0, 1)= 1 ( 1, 0)= 2 ( 0, 0)
↪comm2d ); [ 4 = (2,0)] has 2 outbound neighbours: 4, 5,
2 inbound neighbors: ( 2, 0)= 4 ( 1, 0)= 2
[ 5 = (2,1)] has 1 outbound neighbours: 5,
4 inbound neighbors: ( 2, 1)= 5 ( 1, 1)= 3 ( 2, 0)= 4 ( 1, 0)
Python note 28: graph communicators. Graph communicator creation is a method of the Comm class, and
the graph communicator is a function return result:
11.2.3 Query
There are two routines for querying the neighbors of a process: MPI_Dist_graph_neighbors_count (fig-
ure 11.8) and MPI_Dist_graph_neighbors (figure 11.9).
While this information seems derivable from the graph construction, that is not entirely true for two
reasons.
1. With the nonadjoint version MPI_Dist_graph_create, only outdegrees and destinations are spec-
ified; this call then supplies the indegrees and sources;
2. As observed above, the order in which data is placed in the receive buffer of a gather call is not
determined by the create call, but can only be queried this way.
11.2.5 Re-ordering
The routine MPI_Graph_map gives a re-ordered rank for the calling process.
#include "globalinit.c"
MPI_Comm comm2d;
int periodic[ndim]; periodic[0] = periodic[1] = 0;
MPI_Cart_create(comm,ndim,dimensions,periodic,1,&comm2d);
MPI_Cart_coords(comm2d,procno,ndim,coord_2d);
MPI_Cart_rank(comm2d,coord_2d,&rank_2d);
printf("I am %d: (%d,%d); originally %d\n",
rank_2d,coord_2d[0],coord_2d[1],procno);
int rank_left,rank_right,rank_up,rank_down;
char indata[4]; int idata=0,sdata=0;
for (int i=0; i<4; i++)
indata[i] = 32;
char mychar = 65+procno;
MPI_Cart_shift(comm2d,0,+1,&rank_2d,&rank_right);
MPI_Cart_shift(comm2d,0,-1,&rank_2d,&rank_left);
MPI_Cart_shift(comm2d,1,+1,&rank_2d,&rank_up);
MPI_Cart_shift(comm2d,1,-1,&rank_2d,&rank_down);
int irequest = 0; MPI_Request *requests = malloc(8*sizeof(MPI_Request));
MPI_Isend(&mychar,1,MPI_CHAR,rank_right, 0,comm, requests+irequest++);
MPI_Isend(&mychar,1,MPI_CHAR,rank_left, 0,comm, requests+irequest++);
MPI_Isend(&mychar,1,MPI_CHAR,rank_up, 0,comm, requests+irequest++);
MPI_Isend(&mychar,1,MPI_CHAR,rank_down, 0,comm, requests+irequest++);
MPI_Irecv( indata+idata++, 1,MPI_CHAR, rank_right, 0,comm, requests+irequest++);
MPI_Irecv( indata+idata++, 1,MPI_CHAR, rank_left, 0,comm, requests+irequest++);
MPI_Irecv( indata+idata++, 1,MPI_CHAR, rank_up, 0,comm, requests+irequest++);
MPI_Irecv( indata+idata++, 1,MPI_CHAR, rank_down, 0,comm, requests+irequest++);
MPI_Waitall(irequest,requests,MPI_STATUSES_IGNORE);
printf("[%d] %s\n",procno,indata);
/* for (int i=0; i<4; i++) */
/* sdata += indata[i]; */
/* printf("[%d] %d,%d,%d,%d sum=%d\n",procno,indata[0],indata[1],indata[2],indata[3],sdata); */
if (procno==0)
printf("Finished\n");
MPI_Finalize();
return 0;
}
#include "globalinit.c"
/*
* Create 3D brick
*/
int *dimensions = (int*) malloc(dim*sizeof(int));
for (int idim=0; idim<dim; idim++)
dimensions[idim] = 0;
MPI_Dims_create(nprocs,dim,dimensions);
int *periods = (int*) malloc(dim*sizeof(int));
for ( int id=0; id<dim; id++)
periods[id] = 0;
if (procno==0) {
print_grid( nprocs,dim,dimensions );
}
MPI_Comm cart_comm;
MPI_Cart_create
( comm,dim,dimensions,periods,
0,&cart_comm );
MPI_Comm period_comm;
for ( int id=0; id<dim; id++)
periods[id] = id==0 ? 1 : 0;
MPI_Cart_create
( comm,dim,dimensions,periods,
0,&period_comm );
/*
* Translation rank -> coord
*/
if (procno==0) {
int *coord = (int*) malloc( dim*sizeof(int) );
for ( int ip=0; ip<nprocs; ip++ ) {
/*
* Translate process rank to cartesian coordinate
*/
MPI_Cart_coords( cart_comm,ip,dim,coord );
printf("[%2d] coord: [",ip);
for ( int id=0; id<dim; id++ )
printf("%d,",coord[id]);
printf("]\n");
/*
* Shift the coordinate and translate back to rank
* This is erroneous for a non-periodic Cartesian grid
*/
int rank_check;
coord[0]++;
MPI_Cart_rank( cart_comm,coord,&rank_check );
printf(" shifted neighbor : %2d\n",rank_check);
MPI_Cart_rank( period_comm,coord,&rank_check );
printf(" periodic neighbor: %2d\n",rank_check);
}
free(coord);
}
/*
* Shifted coordinates
*/
if (procno==0) {
if (dimensions[1]==1) {
printf("Too few processes: need non-trivial dimensions[1]\n");
} else {
printf("\nCartShift\n");
print_grid(nprocs,dim,dimensions);
printf("Shifting process 0.\n");
int pred,succ;
MPI_Cart_shift
(period_comm,/* dim: */ 0,/* up: */ 1,
&pred,&succ);
printf("periodic dimension 0:\n src=%d, tgt=%d\n",
pred,succ);
MPI_Cart_shift
(period_comm,/* dim: */ 1,/* up: */ 1,
&pred,&succ);
printf("non-periodic dimension 1:\n src=%d, tgt=%d\n",
pred,succ);
printf("cartshift\n\n");
}
/*
* Subdimensions
*/
{
int remain[] = {1,0,1};
int topo_type, hyperdim, dims[3], period[3], coords[3];
MPI_Comm hyperplane;
if (procno==0) printf("Hyperplane13\n");
MPI_Cart_sub( cart_comm,remain,&hyperplane );
if (procno==0) {
MPI_Topo_test( hyperplane,&topo_type );
MPI_Cartdim_get( hyperplane,&hyperdim );
printf("hyperplane has dimension %d, type %d\n",
hyperdim,topo_type);
MPI_Cart_get( hyperplane,dim,dims,period,coords );
printf(" periodic: ");
for (int id=0; id<2; id++)
printf("%d,",period[id]);
printf("\n");
}
MPI_Comm_free( &hyperplane );
if (procno==0) printf("hyperplane13\n\n");
if (procno==0) printf("Hyperplane13p\n");
MPI_Cart_sub( period_comm,remain,&hyperplane );
if (procno==0) {
MPI_Topo_test( hyperplane,&topo_type );
MPI_Cartdim_get( hyperplane,&hyperdim );
printf("hyperplane has dimension %d, type %d\n",
hyperdim,topo_type);
MPI_Cart_get( hyperplane,dim,dims,period,coords );
printf(" periodic: ");
for (int id=0; id<2; id++)
printf("%d,",period[id]);
printf("\n");
}
MPI_Comm_free( &hyperplane );
if (procno==0) printf("hyperplane13p\n\n");
free(dimensions); free(periods);
MPI_Finalize();
return 0;
}
Some programmers are under the impression that MPI would not be efficient on shared memory, since all
operations are done through what looks like network calls. This is not correct: many MPI implementations
have optimizations that detect shared memory and can exploit it, so that data is copied, rather than going
through a communication layer. (Conversely, programming systems for shared memory such as OpenMP
can actually have inefficiencies associated with thread handling.) The main inefficiency associated with
using MPI on shared memory is then that processes can not actually share data.
The one-sided MPI calls (chapter 9) can also be used to emulate shared memory, in the sense that an
origin process can access data from a target process without the target’s active involvement. However,
these calls do not distinguish between actually shared memory and one-sided access across the network.
In this chapter we will look at the ways MPI can interact with the presence of actual shared memory.
(This functionality was added in the MPI-3 standard.) This relies on the MPI_Win windows concept, but
otherwise uses direct access of other processes’ memory.
381
12. MPI topic: Shared memory
MPI.Comm.Split_type(
self, int split_type, int key=0, Info info=INFO_NULL)
Exercise 12.1. Write a program that uses MPI_Comm_split_type to analyze for a run
1. How many nodes there are;
2. How many processes there are on each node.
If you run this program on an unequal distribution, say 10 processes on 3 nodes,
what distribution do you find?
Nodes: 3; processes: 10
TACC: Starting up job 4210429
TACC: Starting parallel tasks...
There are 3 nodes
Node sizes: 4 3 3
TACC: Shutdown complete. Exiting.
Remark 19 The OpenMPI implementation of MPI has a number of non-standard split types, such as OMPI_COMM_TYPE_SOCKET;
see https://www.open-mpi.org/doc/v4.1/man3/MPI_Comm_split_type.3.php
MPL note 60: split by shared memory. Similar to ordinary communicator splitting 56: communicator::split_shared.
for some processes. To prevent this, the key alloc_shared_noncontig can be set to true in the MPI_Info
object.
The following material is for the recently released MPI-4 standard and may not be supported yet.
In the contiguous case, the mpi_minimum_memory_alignment info argument (section 9.1.1) applies only to the
memory on the first process; in the noncontiguous case it applies to all.
End of MPI-4 material
// numa.c
MPI_Info window_info;
MPI_Info_create(&window_info);
MPI_Info_set(window_info,"alloc_shared_noncontig","true");
MPI_Win_allocate_shared( window_size,sizeof(double),window_info,
nodecomm,
&window_data,&node_window);
MPI_Info_free(&window_info);
Strategy: default behavior of shared window alloca- Strategy: allow non-contiguous shared window al-
tion location
Distance 1 to zero: 8 Distance 1 to zero: 4096
Distance 2 to zero: 16 Distance 2 to zero: 8192
Distance 3 to zero: 24 Distance 3 to zero: 12288
Distance 4 to zero: 32 Distance 4 to zero: 16384
Distance 5 to zero: 40 Distance 5 to zero: 20480
Distance 6 to zero: 48 Distance 6 to zero: 24576
Distance 7 to zero: 56 Distance 7 to zero: 28672
Distance 8 to zero: 64 Distance 8 to zero: 32768
Distance 9 to zero: 72 Distance 9 to zero: 36864
The explanation here is that each window is placed on its own small page, which on this particular system
has a size of 4K.
Remark 20 The ampersand operator in C is not a physical address, but a virtual address. The translation
of where pages are placed in physical memory is determined by the page table.
Exercise 12.2. Let the ‘shared’ data originate on process zero in MPI_COMM_WORLD. Then:
• create a communicator per shared memory domain;
• create a communicator for all the processes with number zero on their node;
• broadcast the shared data to the processes zero on each node.
(There is a skeleton for this exercise under the name shareddata.)
MPI_Init(&argc,&argv);
comm = MPI_COMM_WORLD;
MPI_Comm_size(comm,&nprocs);
MPI_Comm_rank(comm,&procid);
/*
* Find the subcommunicator on the node,
* and get the procid on the node.
*/
MPI_Comm nodecomm; int onnode_procid;
MPI_Comm_split_type
(comm,MPI_COMM_TYPE_SHARED,procid,MPI_INFO_NULL,
&nodecomm);
MPI_Comm_rank(nodecomm,&onnode_procid);
/*
* Find the subcommunicators of
* identical `onnode_procid' processes;
* the procid on that communicator is the node ID
*/
MPI_Comm crosscomm; int nodeid;
MPI_Comm_split
(comm,onnode_procid,procid,&crosscomm);
MPI_Comm_rank(crosscomm,&nodeid);
printf("[%2d] = (%d,%d)\n",procid,nodeid,onnode_procid);
/*
* Create data on global process zero,
* and broadcast it to the zero processes on other nodes
*/
double shared_data = 0;
if (procid==0) shared_data = 3.14;
if (onnode_procid==0)
MPI_Bcast(&shared_data,1,MPI_DOUBLE,0,crosscomm);
if (procid==0)
printf("Head nodes should have shared data: %e\n",
shared_data);
/*
* Create window on the node communicator;
* it only has nonzero size on the first process
*/
MPI_Win node_window;
MPI_Aint window_size; double *window_data;
if (onnode_procid==0)
window_size = sizeof(double);
else window_size = 0;
MPI_Win_allocate_shared
( window_size,sizeof(double),MPI_INFO_NULL,
nodecomm,
&window_data,&node_window);
/*
* Put data on process zero of the node window
* We use a Put call rather than a straight copy:
* the Fence calls enforce coherence
*/
MPI_Win_fence(0,node_window);
if (onnode_procid==0) {
MPI_Aint disp = 0;
MPI_Put( &shared_data,1,MPI_DOUBLE,0,disp,1,MPI_DOUBLE,node_window);
}
MPI_Win_fence(0,node_window);
/*
* Now get on each process the address of the window of process zero.
*/
MPI_Aint window_size0; int window_unit; double *win0_addr;
MPI_Win_shared_query
( node_window,0,
&window_size0,&window_unit, &win0_addr );
/*
* Check that we can indeed get at the data in the shared memory
*/
printf("[%d,%d] data at shared window %lx: %e\n",
nodeid,onnode_procid,(unsigned long)win0_addr,*win0_addr);
/*
* cleanup
*/
MPI_Win_free(&node_window);
MPI_Finalize();
return 0;
}
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <mpi.h>
#include "window.c"
#include "globalinit.c"
/*
* Find the subcommunicator on the node,
* and get the procid on the node.
*/
MPI_Comm nodecomm;
int onnode_procno, onnode_nprocs;
MPI_Comm_split_type
(comm,MPI_COMM_TYPE_SHARED,procno,MPI_INFO_NULL,
&nodecomm);
MPI_Comm_size(nodecomm,&onnode_nprocs);
/* if (onnode_nprocs<2) { */
/* printf("This example needs at least two ranks per node\n"); */
/* // MPI_Abort(comm,0); */
/* } */
MPI_Comm_rank(nodecomm,&onnode_procno);
/*
* Now process zero checks on window placement
*/
if (onnode_procno==0) {
MPI_Aint window_size0; int window0_unit; double *win0_addr;
MPI_Win_shared_query( node_window,0,
&window_size0,&window0_unit, &win0_addr );
size_t dist1,distp;
for (int p=1; p<onnode_nprocs; p++) {
MPI_Aint window_sizep; int windowp_unit; double *winp_addr;
MPI_Win_shared_query( node_window,p,
&window_sizep,&windowp_unit, &winp_addr );
distp = (size_t)winp_addr-(size_t)win0_addr;
if (procno==0)
printf("Distance %d to zero: %ld\n",p,(long)distp);
if (p==1)
dist1 = distp;
else {
if (distp%dist1!=0)
printf("!!!! not a multiple of distance 0--1 !!!!\n");
}
}
}
MPI_Win_free(&node_window);
}
/*
* cleanup
*/
MPI_Finalize();
return 0;
}
#ifndef CORES_PER_NODE
#define CORES_PER_NODE 16
#endif
#include "globalinit.c"
if (nprocs<3) {
printf("This program needs at least three processes\n");
return -1;
}
if (procno==0)
printf("There are %d ranks total\n",nprocs);
int new_procno,new_nprocs;
MPI_Comm sharedcomm;
MPI_Info info;
MPI_Comm_split_type(MPI_COMM_WORLD,MPI_COMM_TYPE_SHARED,procno,info,&sharedcomm);
MPI_Comm_size(sharedcomm,&new_nprocs);
MPI_Comm_rank(sharedcomm,&new_procno);
if (new_nprocs!=nprocs) {
printf("This example can only run on shared memory\n");
MPI_Abort(comm,0);
}
{
MPI_Aint check_size; int check_unit; int *check_baseptr;
MPI_Win_shared_query
(shared_window,new_procno,
&check_size,&check_unit,&check_baseptr);
printf("[%d;%d] size=%ld\n",procno,new_procno,check_size);
}
int *left_ptr,*right_ptr;
int left_proc = new_procno>0 ? new_procno-1 : MPI_PROC_NULL,
right_proc = new_procno<new_nprocs-1 ? new_procno+1 : MPI_PROC_NULL;
MPI_Win_shared_query(shared_window,left_proc,NULL,NULL,&left_ptr);
MPI_Win_shared_query(shared_window,right_proc,NULL,NULL,&right_ptr);
if (procno==0)
printf("Finished\n");
MPI_Finalize();
return 0;
}
While the MPI standard itself makes no mention of threads – process being the primary unit of compu-
tation – the use of threads is allowed. Below we will discuss what provisions exist for doing so.
Using threads and other shared memory models in combination with MPI leads of course to the question
how race conditions are handled. Example of a code with a data race that pertains to MPI:
#pragma omp sections
#pragma omp section
MPI_Send( x, /* to process 2 */ )
#pragma omp section
MPI_Recv( x, /* from process 3 */ )
The MPI standard here puts the burden on the user: this code is not legal, and behavior is not defined.
392
13.1. MPI support for threading
The mvapich implementation of MPI does have the required threading support, but you need to set this
environment variable:
export MV2_ENABLE_AFFINITY=0
Another solution is to run your code like this:
ibrun tacc_affinity <my_multithreaded_mpi_executable
Intel MPI uses an environment variable to turn on thread support:
I_MPI_LIBRARY_KIND=<value>
where
release : multi-threaded with global lock
release_mt : multi-threaded with per-object lock for thread-split
The mpiexec program usually propagates environment variables, so the value of OMP_NUM_THREADS when
you call mpiexec will be seen by each MPI process.
• It is possible to use blocking sends in threads, and let the threads block. This does away with
the need for polling.
• You can not send to a thread number: use the MPI message tag to send to a specific thread.
Exercise 13.1. Consider the 2D heat equation and explore the mix of MPI/OpenMP
parallelism:
• Give each node one MPI process that is fully multi-threaded.
• Give each core an MPI process and don’t use multi-threading.
Discuss theoretically why the former can give higher performance. Implement both
schemes as special cases of the general hybrid case, and run tests to find the optimal
mix.
// thread.c
MPI_Init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&threading);
comm = MPI_COMM_WORLD;
MPI_Comm_rank(comm,&procno);
MPI_Comm_size(comm,&nprocs);
if (procno==0) {
switch (threading) {
case MPI_THREAD_MULTIPLE : printf("Glorious multithreaded MPI\n"); break;
case MPI_THREAD_SERIALIZED : printf("No simultaneous MPI from threads\n"); break;
case MPI_THREAD_FUNNELED : printf("MPI from main thread\n"); break;
case MPI_THREAD_SINGLE : printf("no threading supported\n"); break;
}
}
MPI_Finalize();
MPI_Comm comm;
int procno=-1,nprocs,threading,err;
MPI_Init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&threading);
comm = MPI_COMM_WORLD;
MPI_Comm_rank(comm,&procno);
MPI_Comm_size(comm,&nprocs);
if (procno==0) {
switch (threading) {
case MPI_THREAD_MULTIPLE : printf("Glorious multithreaded MPI\n"); break;
case MPI_THREAD_SERIALIZED : printf("No simultaneous MPI from threads\n"); break;
case MPI_THREAD_FUNNELED : printf("MPI from main thread\n"); break;
case MPI_THREAD_SINGLE : printf("no threading supported\n"); break;
}
}
MPI_Finalize();
return 0;
}
Recent versions of MPI have a standardized way of reading out performance variables: the tools interface
which improves on the old interface described in section 15.6.2.
These matching calls can be made multiple times, after MPI has already been initialized with MPI_Init or
MPI_Init_thread.
396
14.2. Control variables
// cvar.c
MPI_T_cvar_get_num(&ncvar);
printf("#cvars: %d\n",ncvar);
for (int ivar=0; ivar<ncvar; ivar++) {
char name[100]; int namelen = 100;
char desc[256]; int desclen = 256;
int verbosity,bind,scope;
MPI_Datatype datatype;
MPI_T_enum enumtype;
MPI_T_cvar_get_info
(ivar,
name,&namelen,
&verbosity,&datatype,&enumtype,desc,&desclen,&bind,&scope
);
printf("cvar %3d: %s\n %s\n",ivar,name,desc);
Remark 21 There is no constant indicating a maximum buffer length for these variables. However, you can
do the following:
1. Call the info routine with NULL values for the buffers, reading out the buffer lengths;
2. allocate the buffers with sufficient length, that is, including an extra position for the null terminator;
and
3. calling the info routine a second time, filling in the string buffers.
Conversely, given a variable name, its index can be retrieved with MPI_T_cvar_get_index:
int MPI_T_cvar_get_index(const char *name, int *cvar_index)
(If a routine takes both a session and handle argument, and the two are not associated, an error of
MPI_T_ERR_INVALID_HANDLE is returned.)
Passing MPI_T_PVAR_ALL_HANDLES to the stop call attempts to stop all variables within the session. Failure
to stop a variable returns MPI_T_ERR_PVAR_NO_STARTSTOP.
Variables can be read and written with MPI_T_pvar_read and MPI_T_pvar_write:
int MPI_T_pvar_read
(MPI_T_pvar_session session, MPI_T_pvar_handle handle,
void* buf)
int MPI_T_pvar_write
(MPI_T_pvar_session session, MPI_T_pvar_handle handle,
const void* buf)
If the variable can not be written (see the readonly parameter of MPI_T_pvar_get_info), MPI_T_ERR_PVAR_NO_WRITE
is returned.
A special case of writing the variable is to reset it with
int MPI_T_pvar_reset(MPI_T_pvar_session session, MPI_T_pvar_handle handle)
For a given category name the index can be found with MPI_T_category_get_index:
int MPI_T_category_get_index(const char *name, int *cat_index)
14.5 Events
// mpitevent.c
int nsource;
MPI_T_source_get_num(&nsource);
int name_len=256,desc_len=256;
char var_name[256],description[256];
MPI_T_source_order ordering;
MPI_Count ticks_per_second,max_ticks;
MPI_Info info;
MPI_Datatype datatype; MPI_T_enum enumtype;
for (int source=0; source<nsource; source++) {
name_len = 256; desc_len=256;
MPI_T_source_get_info(source,var_name,&name_len,
description,&desc_len,
&ordering,&ticks_per_second,&max_ticks,&info);
int tlevel;
MPI_Init_thread(&argc,&argv,MPI_THREAD_SINGLE,&tlevel);
MPI_T_init_thread(MPI_THREAD_SINGLE,&tlevel);
int npvar;
MPI_T_pvar_get_num(&npvar);
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&procid);
if (procid==0)
printf("#pvars: %d\n",npvar);
int name_len=256,desc_len=256,
verbosity,var_class,binding,isreadonly,iscontiguous,isatomic;
char var_name[256],description[256];
MPI_Datatype datatype; MPI_T_enum enumtype;
for (int pvar=0; pvar<npvar; pvar++) {
name_len = 256; desc_len=256;
MPI_T_pvar_get_info(pvar,var_name,&name_len,
&verbosity,&var_class,
&datatype,&enumtype,
description,&desc_len,
&binding,&isreadonly,&iscontiguous,&isatomic);
if (procid==0)
printf("pvar %d: %d/%s = %s\n",pvar,var_class,var_name,description);
}
MPI_T_finalize();
MPI_Finalize();
return 0;
}
403
15. MPI leftover topics
Copying a communicator with MPI_Comm_dup does not cause the info to be copied; to propagate information
to the copy there is MPI_Comm_dup_with_info (section 7.2).
15.1.2 Attributes
Some runtime (or installation dependendent) values are available as attributes through MPI_Comm_set_attr
(figure 15.8) and MPI_Comm_get_attr (figure 15.9) for communicators, or MPI_Win_get_attr, MPI_Type_get_attr.
(The MPI-2 routine MPI_Attr_get is deprecated). The flag parameter has two functions:
• it returns whether the attributed was found;
• if on entry it was set to false, the value parameter is ignored and the routines only tests whether
the key is present.
The return value parameter is subtle: while it is declared void*, it is actually the address of a void* pointer.
// tags.c
int tag_upperbound;
void *v; int flag=1;
ierr = MPI_Comm_get_attr(comm,MPI_TAG_UB,&v,&flag);
tag_upperbound = *(int*)v;
• MPI_ERR_IN_STATUS A functioning returning an array of statuses has at least one status where
the MPI_ERROR field is set to other than MPI_SUCCESS. See section 4.3.2.3.
• MPI_ERR_INFO: invalid info object.
• MPI_ERR_NO_MEM is returned by MPI_Alloc_mem if memory is exhausted.
• MPI_ERR_OTHER: an error occurred; use MPI_Error_string to retrieve further information about
this error; see section 15.2.2.3.
• MPI_ERR_PORT: invalid port; this applies to MPI_Comm_connect and such.
The following material is for the recently released MPI-4 standard and may not be supported yet.
• MPI_ERR_PROC_ABORTED is returned if a process tries to communicate with a process that has
aborted.
End of MPI-4 material
• MPI_ERR_RANK: an invalid source or destination rank is specified. Valid ranks are 0 … 𝑠 − 1 where
𝑠 is the size of the communicator, or MPI_PROC_NULL, or MPI_ANY_SOURCE for receive operations.
• MPI_ERR_SERVICE: invalid service in MPI_Unpublish_name; section 8.2.3.
Remark 22 The routine MPI_Errhandler_set is deprecated, replaced by its MPI-2 variant MPI_Comm_set_errhandler.
15.2.2.1 Abort
The default behavior, where the full run is aborted, is equivalent to your code having the following call
to
MPI_Comm_set_errhandler(MPI_COMM_WORLD,MPI_ERRORS_ARE_FATAL);
The handler MPI_ERRORS_ARE_FATAL, even though it is associated with a communicator, causes the whole
application to abort.
The following material is for the recently released MPI-4 standard and may not be supported yet.
The handler MPI_ERRORS_ABORT (MPI-4) aborts on the processes in the communicator for which it is speci-
fied.
End of MPI-4 material
15.2.2.2 Return
Another simple possibility is to specify MPI_ERRORS_RETURN:
MPI_Comm_set_errhandler(MPI_COMM_WORLD,MPI_ERRORS_RETURN);
which causes the error code to be returned to the user. This gives you the opportunity to write code that
handles the error return value.
For instance,
Fatal error in MPI_Waitall:
See the MPI_ERROR field in MPI_Status for the error code
You could then retrieve the MPI_ERROR field of the status, and print out an error string with MPI_Error_string
or maximal size MPI_MAX_ERROR_STRING:
MPI_Comm_set_errhandler(MPI_COMM_WORLD,MPI_ERRORS_RETURN);
ierr = MPI_Waitall(2*ntids-2,requests,status);
if (ierr!=0) {
char errtxt[MPI_MAX_ERROR_STRING];
for (int i=0; i<2*ntids-2; i++) {
int err = status[i].MPI_ERROR;
int len=MPI_MAX_ERROR_STRING;
MPI_Error_string(err,errtxt,&len);
printf("Waitall error: %d %s\n",err,errtxt);
}
MPI_Abort(MPI_COMM_WORLD,0);
}
One cases where errors can be handled is that of MPI file I/O: if an output file has the wrong permissions,
code can possibly progress without writing data, or writing to a temporary file.
MPI operators (MPI_Op) do not return an error code. In case of an error they call MPI_Abort; if MPI_ERRORS_RETURN
is the error handler, error codes may be silently ignored.
You can create your own error handler with MPI_Comm_create_errhandler (figure 15.12), which is then
installed with MPI_Comm_set_errhandler. You can retrieve the error handler with MPI_Comm_get_errhandler.
MPL note 62: communicator errhandler. MPL does not allow for access to the wrapped communicators.
However, for MPI_COMM_WORLD, the routine MPI_Comm_set_errhandler can be called directly.
This error number is larger than MPI_ERR_LASTCODE, the upper bound on built-in error codes. The attribute
MPI_LASTUSEDCODE records the last issued value.
Your new error code is then defined in this class with MPI_Add_error_code, and an error string can be added
with MPI_Add_error_string:
int nonzero_code;
MPI_Add_error_code(nonzero_class,&nonzero_code);
MPI_Add_error_string(nonzero_code,"Attempting to send zero buffer");
You can then call an error handler with this code. For instance to have a wrapped send routine that will
not send zero-sized messages:
// errorclass.c
int MyPI_Send( void *buffer,int n,MPI_Datatype type, int target,int tag,MPI_Comm comm) {
if (n==0)
MPI_Comm_call_errhandler( comm,nonzero_code );
MPI_Ssend(buffer,n,type,target,tag,comm);
return MPI_SUCCESS;
};
Here we used the default error handler associated with the communicator, but one can set a different one
with MPI_Comm_create_errhandler.
which gives:
Trying to send buffer of length 1
.. success
Trying to send buffer of length 0
Abort(1073742081) on node 0 (rank 0 in comm 0):
Fatal error in MPI_Comm_call_errhandler: Attempting to send zero buffer
the wait call does not involve the buffer, so the compiler can translate this into
call MPI_Isend( buf, ..., request )
register = buf(1)
call MPI_Wait(request)
print *,register
Preventing this is possible with a Fortran2018 mechanism. First of all the buffer should be declared
asynchronous
<type>,Asynchronous :: buf
and introducing
IF (.NOT. MPI_ASYNC_PROTECTS_NONBLOCKING) &
CALL MPI_F_SYNC_REG( buf )
15.4 Progress
The concept asynchronous progress describes that MPI messages continue on their way through the net-
work, while the application is otherwise busy.
The problem here is that, unlike straight MPI_Send and MPI_Recv calls, communication of this sort can
typically not be off-loaded to the network card, so different mechanisms are needed.
This can happen in a number of ways:
• Compute nodes may have a dedicated communications processor. The Intel Paragon was of this
design; modern multicore processors are a more efficient realization of this idea.
• The MPI library may reserve a core or thread for communications processing. This is imple-
mentation dependent; see Intel MPI information below.
• Reserving a core, or a thread in a continuous busy-wait spin loop, takes away possible perfor-
mance from the code. For this reason, Ruhela et al. [24] propose using a pthreads signal to wake
up the progress thread.
• Absent such dedicated resources, the application can force MPI to make progress by occasional
calls to a polling routine such as MPI_Iprobe.
Remark 23 The MPI_Probe call is somewhat similar, in spirit if not quite in functionality, as MPI_Test. How-
ever, they behave differently with respect to progress. Quoting the standard:
The MPI implementation of MPI_Probe and MPI_Iprobe needs to guarantee progress: if a
call to MPI_Probe has been issued by a process, and a send that matches the probe has
been initiated by some process, then the call to MPI_Probe will return.
In other words: probing causes MPI to make progress. On the other hand,
A call to MPI_Test returns flag = true if the operation identified by request is com-
plete.
In other words, if progress has been made, then testing will report completion, but by itself it does not cause
completion.
A similar problem arises with passive target synchronization: it is possible that the origin process may
hang until the target process makes an MPI call.
The following commands force progress: MPI_Win_test, MPI_Request_get_status.
Intel note. Only available with the release_mt and debug_mt versions of the Intel MPI library. Set
I_MPI_ASYNC_PROGRESS to 1 to enable asynchronous progress threads, and
I_MPI_ASYNC_PROGRESS_THREADS to set the number of progress threads.
See https://software.intel.com/en-us/
mpi-developer-guide-linux-asynchronous-progress-control,
https://software.intel.com/en-us/
mpi-developer-reference-linux-environment-variables-for-asynchronous-progress-control
Progress issues play with: MPI_Test, MPI_Request_get_status, MPI_Win_test.
15.6.1 Timing
MPI has a wall clock timer: MPI_Wtime (figure 15.13) which gives the number of seconds from a certain
point in the past. (Note the absence of the error parameter in the fortran call.)
MPI.Wtime()
MPI.Wtick()
double t;
t = MPI_Wtime();
for (int n=0; n<NEXPERIMENTS; n++) {
// do something;
}
t = MPI_Wtime()-t; t /= NEXPERIMENTS;
Timing in parallel is a tricky issue. For instance, most clusters do not have a central clock, so you can
not relate start and stop times on one process to those on another. You can test for a global clock as
followsMPI_WTIME_IS_GLOBAL:
int *v,flag;
MPI_Attr_get( comm, MPI_WTIME_IS_GLOBAL, &v, &flag );
if (mytid==0) printf("Time synchronized? %d->%d\n",flag,*v);
Normally you don’t worry about the starting point for this timer: you call it before and after an event and
subtract the values.
t = MPI_Wtime();
// something happens here
t = MPI_Wtime()-t;
If you execute this on a single processor you get fairly reliable timings, except that you would need to
subtract the overhead for the timer. This is the usual way to measure timer overhead:
t = MPI_Wtime();
// absolutely nothing here
t = MPI_Wtime()-t;
Exercise 15.1. This scheme also has some overhead associated with it. How would you
measure that?
No matter what sort of timing you are doing, it is good to know the accuracy of your timer. The routine
MPI_Wtick gives the smallest possible timer increment. If you find that your timing result is too close to
this ‘tick’, you need to find a better timer (for CPU measurements there are cycle-accurate timers), or you
need to increase your running time, for instance by increasing the amount of data.
Eager limit Short blocking messages are handled by a simpler mechanism than longer. The limit on
what is considered ‘short’ is known as the eager limit (section 4.1.4.2), and you could tune your code by
increasing its value. However, note that a process may likely have a buffer accomodating eager sends for
every single other process. This may eat into your available memory.
Blocking versus nonblocking The issue of blocking versus nonblocking communication is something
of a red herring. While nonblocking communication allows latency hiding, we can not consider it an
alternative to blocking sends, since replacing nonblocking by blocking calls will usually give deadlock.
Still, even if you use nonblocking communication for the mere avoidance of deadlock or serialization
(section 4.1.4.3), bear in mind the possibility of overlap of communication and computation. This also
brings us to our next point.
Looking at it the other way around, in a code with blocking sends you may get better performance from
nonblocking, even if that is not structurally necessary.
Progress MPI is not magically active in the background, especially if the user code is doing scalar work
that does not involve MPI. As sketched in section 15.4, there are various ways of ensuring that latency
hiding actually happens.
Persistent sends If a communication between the same pair of processes, involving the same buffer,
happens regularly, it is possible to set up a persistent communication. See section 5.1.
Buffering MPI uses internal buffers, and the copying from user data to these buffers may affect perfor-
mance. For instance, derived types (section 6.3) can typically not be streamed straight through the network
(this requires special hardware support [17]) so they are first copied. Somewhat surprisingly, we find that
buffered communication (section 5.5) does not help. Perhaps MPI implementors have not optimized this
mode since it is so rarely used.
This is issue is extensively investigated in [9].
Graph topology and neighborhood collectives Load balancing and communication minimization are
important in irregular applications. There are dedicated programs for this (ParMetis, Zoltan), and libraries
such as PETSc may offer convenient access to such capabilities.
In the declaration of a graph topology (section 11.2) MPI is allowed to reorder processes, which could
be used to support such activities. It can also serve for better message sequencing when neighborhood
collectives are used.
Network issues In the discussion so far we have assumed that the network is a perfect conduit for data.
However, there are issues of port design, in particular caused by oversubscription that adversely affect
performance. While in an ideal world it may be possible to set up routine to avoid this, in the actual
practice of a supercomputer cluster, network contention or message collision from different user jobs is
hard to avoid.
Offloading and onloading There are different philosophies of network card design: Mellanox, being a
network card manufacturer, believes in off-loading network activity to the Network Interface Card (NIC),
while Intel, being a processor manufacturer, believes in ‘on-loading’ activity to the process. There are
argument either way.
Either way, investigate the capabilities of your network.
15.6.4 MPIR
MPIR is the informally specified debugging interface for processes acquisition and message queue extrac-
tion.
15.7 Determinism
MPI processes are only synchronized to a certain extent, so you may wonder what guarantees there are
that running a code twice will give the same result. You need to consider two cases: first of all, if the
two runs are on different numbers of processors there are already numerical problems; see HPC book,
section-3.6.5.
Let us then limit ourselves to two runs on the same set of processors. In that case, MPI is deterministic as
long as you do not use wildcards such as MPI_ANY_SOURCE. Formally, MPI messages are ‘nonovertaking’: two
messages between the same sender-receiver pair will arrive in sequence. Actually, they may not arrive in
sequence: they are matched in sequence in the user program. If the second message is much smaller than
the first, it may actually arrive earlier in the lower transport layer.
if ( receiving ) {
MPI_Irecv() // post nonblocking receive
MPI_Barrier() // synchronize
else if ( sending ) {
MPI_Barrier() // synchronize
MPI_Rsend() // send data fast
When the barrier is reached, the receive has been posted, so it is safe to do a ready send. However, global
barriers are not a good idea. Instead you would just synchronize the two processes involved.
Exercise 15.2. Give pseudo-code for a scheme where you synchronize the two processes
through the exchange of a blocking zero-size message.
Letting MPI processes interact with the environment is not entirely straightforward. For instance, shell
input redirection as in
mpiexec -n 2 mpiprogram < someinput
may not work.
Instead, use a script programscript that has one parameter:
#!/bin/bash
mpirunprogram < $1
and run this in parallel:
mpiexec -n 2 programscript someinput
The stdout and stderr streams of an MPI process are returned through the ssh tunnel. Thus they can be
caught as the stdout/err of mpiexec.
// outerr.c
fprintf(stdout,"This goes to std out\n");
fprintf(stderr,"This goes to std err\n");
#!/bin/bash fi
rank=$PMI_RANK
half=$(( ${PMI_SIZE} / 2 ))
• MPI_BOTTOM
• MPI_STATUS_IGNORE
• MPI_STATUSES_IGNORE
• MPI_ERRCODES_IGNORE
• MPI_IN_PLACE
• MPI_ARGV_NULL
• MPI_ARGVS_NULL
• MPI_UNWEIGHTED
• MPI_WEIGHTS_EMPTY
Assorted constants:
• MPI_PROC_NULL and other ..._NULL constants.
• MPI_ANY_SOURCE
• MPI_ANY_TAG
• MPI_UNDEFINED
• MPI_BSEND_OVERHEAD
• MPI_KEYVAL_INVALID
• MPI_LOCK_EXCLUSIVE
• MPI_LOCK_SHARED
• MPI_ROOT
(This section was inspired by http://blogs.cisco.com/performance/mpi-outside-of-c-and-fortran.)
MPI_Bcast(&first_tid,1,MPI_INT, nprocs-1,comm
);
if (procno!=first_tid) {
MPI_Cancel(&request);
fprintf(stderr,"[%d] canceled\n",procno);
}
}
15.11 Literature
Online resources:
• MPI 1 Complete reference:
http://www.netlib.org/utk/papers/mpi-book/mpi-book.html
• Official MPI documents:
http://www.mpi-forum.org/docs/
• List of all MPI routines:
http://www.mcs.anl.gov/research/projects/mpi/www/www3/
Tutorial books on MPI:
• Using MPI [11] by some of the original authors.
#ifndef FREQUENCY
#define FREQUENCY -1
#endif
/*
* Standard initialization
*/
MPI_Comm comm = MPI_COMM_WORLD;
int nprocs, procid;
MPI_Init(&argc,&argv);
MPI_Comm_set_errhandler(comm,MPI_ERRORS_RETURN);
MPI_Comm_size(comm,&nprocs);
MPI_Comm_rank(comm,&procid);
int ierr;
if (nprocs<2) {
printf("This test needs at least 2 processes, not %d\n",nprocs);
MPI_Abort(comm,0);
}
int sender = 0, receiver = nprocs-1;
if (procid==0) {
printf("Running on comm world of %d procs; communicating between %d--%d\n",
nprocs,sender,receiver);
}
int tag_upperbound;
void *v; int flag=1;
ierr = MPI_Comm_get_attr(comm,MPI_TAG_UB,&v,&flag);
tag_upperbound = *(int*)v;
if (ierr!=MPI_SUCCESS) {
printf("Error getting attribute: return code=%d\n",ierr);
if (ierr==MPI_ERR_COMM)
printf("invalid communicator\n");
if (ierr==MPI_ERR_KEYVAL)
printf("errorneous keyval\n");
MPI_Abort(comm,0);
}
if (!flag) {
printf("Could not get keyval\n");
MPI_Abort(comm,0);
} else {
if (procid==sender)
printf("Determined tag upperbound: %d\n",tag_upperbound);
}
MPI_Finalize();
return 0;
}
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()
if nprocs<4:
prin( "Need 4 procs at least")
sys.exit(1)
tag_upperbound = comm.Get_attr(MPI.TAG_UB)
if procid==0:
print("Determined tag upperbound: {}".format(tag_upperbound))
return 0;
}
#include "globalinit.c"
if (procno==nprocs-1)
MPI_Abort(comm,37);
MPI_Finalize();
return 0;
}
#include "globalinit.c"
MPI_Finalize();
return 0;
}
MPI Examples
Figure 16.1 illustrates the ‘intra’ (left) and ‘inter’ (right) scheme for letting all processes communicate in
pairs. With intra-communication, the messages do not rely on the network so we expect to measure high
bandwidth. With inter-communication, all messages go through the network and we expect to measure
a lower number.
However, there are more issues to explore, which we will now do.
429
16. MPI Examples
400 379
25.5
25
20.3 20.7 300
20 19.4
msec
15 200 189
132
10 99.4
100
5
0 0
14 20 28 56 14 20 28 56
#cores per node #cores per node
Figure 16.2: Time as a function of core count. Left: on node. Right: between nodes.
The halfbandwidth is measured as the total number of bytes sent divided by the total time. Both numbers
are measured outside a repeat loop that does each transaction 100 times.
auto duration = myclock::now()-start_time;
auto microsec_duration = std::chrono::duration_cast<std::chrono::microseconds>(duration);
int total_ping_count;
MPI_Allreduce(&pingcount,&total_ping_count,1,MPI_INT,MPI_SUM,comm);
long bytes = buffersize * sizeof(double) * total_ping_count;
float fsec = microsec_duration.count() * 1.e-6,
halfbandwidth = bytes / fsec;
In the left graph of figure 16.2 we see that the time for 𝑃/2 simultaneous pingpongs stays fairly constant.
This reflects the fact that, on node, the ping pong operations are data copies, which proceed simultane-
ously. Thus, the time is independent of the number of cores that are moving data. The exception is the
final data point: with all cores active we take up more than the available bandwidth on the node.
In the right graph, each pingpong is inter-node, going through the network. Here we see the runtime
go up linearly with the number of pingpongs, or somewhat worse than that. This reflects the fact that
network transfers are done sequentially. (Actually, message can be broken up in packets, as long as they
satisfy MPI message semantics. This does not alter our argument.)
400 40
350
300 30
23
Gbyte/sec
240 21 22 22
20
20 17
200
145 146
130
10
100
42
0
0
1000 3000 10000 30000 10000003000000 1000 3000 10000 30000 10000003000000
buffer size buffer size
Figure 16.3: Bandwidth as a function of buffer size. Left: on node. Right: between nodes.
Next we explore the influence of the buffer size on performance. The right graph in figure 16.3 show that
inter-node bandwidth is almost independent of the buffer size. This means that even our smallest buffer
is large enough to overcome any MPI startup cost.
On other hand, the left graph shows a more complicated pattern. Initially, the bandwidth increases, possi-
bly reflecting the decreasing importance of MPI startup. For the final data points, however, performance
drops again. This is due to the fact that the data size overflows cache size, and we are dominated by
bandwidth from memory, rather than cache.
OPENMP
This section of the book teaches OpenMP (‘Open Multi Processing’), the dominant model for shared mem-
ory programming in science and engineering. It will instill the following competencies.
Basic level:
• Threading model: the student will understand the threading model of OpenMP, and the relation
between threads and cores (chapter 17); the concept of a parallel region and private versus
shared data (chapter 18).
• Loop parallelism: the student will be able to parallelize loops, and understand the impediments
to parallelization, and iteration scheduling (chapter 19; reductions (chapter 20).
• The student will understand the concept of worksharing constructs, and its implications for
synchronization (chapter 21).
Intermediate level:
• The student will understand the abstract notion of synchronization, its implementations in
OpenMP, and implications for performabnce (chapter 23).
• The student will understand the task model as underlying the thread model, be able to write code
that spawns tasks, and be able to distinguish when tasks are needed versus simpler worksharing
constructs (chapter 24).
• The student will understand thread/code affinity, how to control it, and possible implications
for performance (chapter 25).
Advanced level:
• The student will understand the OpenMP memory model, and sequential consistency (chap-
ter 26).
• The student will understand SIMD processing, the extent to which compilers do this outside of
OpenMP, and how OpenMP can specify further opportunities for SIMD-ization (chapter 27).
• The student will understand offloading to Graphics Processing Units (GPUs), and the OpenMP
directives for effecting this (chapter 28).
This chapter explains the basic concepts of OpenMP, and helps you get started on running your first
OpenMP program.
435
17. Getting started with OpenMP
Figure 17.1 pictures a typical design of a node: within one enclosure you find two sockets: single processor
chips. Your personal laptop or desktop computer will probably have one socket, most supercomputers
have nodes with two or four sockets (the picture is of a Stampede node with two sockets)1 .
To see where OpenMP operates we need to dig into the sockets. Figure 17.2 shows a picture of an Intel
Sandybridge socket. You recognize a structure with eight cores: independent processing units, that all have
access to the same memory. (In figure 17.1 you saw four memory chips, or DIMMs, attached to each of
the two sockets; all of the sixteen cores have access to all that memory.)
To summarize the structure of the architecture that OpenMP targets:
• A node has up to four sockets;
• each socket has up to 60 cores;
• each core is an independent processing unit, with access to all the memory on the node.
1. In that picture you also see a co-processor: OpenMP is increasingly targeting those too.
point these copies go away and the original thread is left (‘join’), but while the team of threads created by
the fork exists, you have parallelism available to you. The part of the execution between fork and join is
known as a parallel region.
Figure 17.3 gives a simple picture of this: a thread forks into a team of threads, and these threads themselves
can fork again.
The threads that are forked are all copies of the master thread: they have access to all that was computed
so far; this is their shared data. Of course, if the threads were completely identical the parallelism would
be pointless, so they also have private data, and they can identify themselves: they know their thread
number. This allows you to do meaningful parallel computations with threads.
This brings us to the third important concept: that of work sharing constructs. In a team of threads, initially
there will be replicated execution; a work sharing construct divides available parallelism over the threads.
So there you have it: OpenMP uses teams of threads, and inside a parallel region the
work is distributed over the threads with a work sharing construct. Threads can access
shared data, and they have some private data.
An important difference between OpenMP and MPI is that parallelism in OpenMP is dynamically acti-
vated by a thread spawning a team of threads. Furthermore, the number of threads used can differ between
parallel regions, and threads can create threads recursively. This is known as as dynamic mode. By con-
trast, in an MPI program the number of running processes is (mostly) constant throughout the run, and
determined by factors external to the program.
in C, and
use omp_lib
or
#include "omp_lib.h"
for Fortran.
OpenMP is handled by extensions to your regular compiler, typically by adding an option to your com-
mandline:
# gcc
gcc -o foo foo.c -fopenmp
# Intel compiler
icc -o foo foo.c -qopenmp
If you have separate compile and link stages, you need that option in both.
When you use the openmp compiler option, the OpenMP macro, (or cpp macro) _OPENMP will be defined.
Thus, you can have conditional compilation by writing
#ifdef _OPENMP
...
#else
...
#endif
The value of this macro is a decimal value yyyymm denoting the OpenMP standard release that this com-
piler supports; see section 29.7.
Fortran note 12: openmp version. The parameter openmp_version contains the version in yyyymm format.
17.3.1 Directives
OpenMP is not magic, so you have to tell it when something can be done in parallel. This is mostly done
through directives; additional specifications can be done through library calls.
In C/C++ the pragma mechanism is used: annotations for the benefit of the compiler that are otherwise
not part of the language. This looks like:
#pragma omp somedirective clause(value,othervalue)
statement;
with
• the #pragma omp sentinel to indicate that an OpenMP directive is coming;
• a directive, such as parallel;
• and possibly clauses with values.
• After the directive comes either a single statement or a block in curly braces.
Directives in C/C++ are case-sensitive. Directives can be broken over multiple lines by escaping the line
end.
Fortran note 13: openmp sentinel. The sentinel in Fortran looks like a comment:
!$omp directive clause(value)
statements
!$omp end directive
The difference with the C directive is that Fortran can not have a block, so there is an explicit
end-of directive line.
If you break a directive over more than one line, all but the last line need to have a continuation
character, and each line needs to have the sentinel:
!$OMP parallel do &
!%OMP copyin(x),copyout(y)
The directives are case-insensitive. In Fortran fixed-form source files (which is the only possibility
in Fortran77), c$omp and *$omp are allowed too.
Exercise 17.1. Write a ‘hello world’ program, where the print statement is in a parallel
region. Compile and run.
Run your program with different values of the environment variable
OMP_NUM_THREADS. If you know how many cores your machine has, can you set the
value higher?
We will go into much more detail in chapter 18.
Compile and run again. (In fact, run your program a number of times.) Do you see
something unexpected? Can you think of an explanation?
If the above puzzles you, read about race conditions in section 9.3.7.
Fortran has simpler rules, since it does not have blocks inside blocks.
OpenMP has similar rules concerning data in parallel regions and other OpenMP constructs. First of all,
data is visible in enclosed scopes:
main() {
int x;
#pragma omp parallel
{
// you can use and set `x' here
}
printf("x=%e\n",x); // value depends on what
// happened in the parallel region
}
}
x = ... // this refers to the integer again
}
There is an important difference: each thread in the team gets its own instance of the enclosed variable.
1. You can declare a parallel region and split one thread into a whole team of threads. We will
discuss this next in chapter 18. The division of the work over the threads is controlled by work
sharing construct; see chapter 21.
2. Alternatively, you can use tasks and indicating one parallel activity at a time. You will see this
in section 24.
Note that OpenMP only indicates how much parallelism is present; whether independent activities are in
fact executed in parallel is a runtime decision.
Declaring a parallel region tells OpenMP that a team of threads can be created. The actual size of the team
depends on various factors (see section 29.1 for variables and functions mentioned in this section).
• The environment variable OMP_NUM_THREADS limits the number of threads that can be created.
• If you don’t set this variable, you can also set this limit dynamically with the library routine
omp_set_num_threads. This routine takes precedence over the aforementioned environment vari-
able if both are specified.
• A limit on the number of threads can also be set as a num_threads clause on a parallel region:
#pragma omp parallel num_threads(ndata)
If you specify a greater amount of parallelism than the hardware supports, the runtime system will prob-
ably ignore your specification and choose a lower value. To ask how much parallelism is actually used in
your parallel region, use omp_get_num_threads. To query these hardware limits, use omp_get_num_procs. You
can query the maximum number of threads with omp_get_max_threads. This equals the value of OMP_NUM_THREADS,
not the number of actually active threads in a parallel region.
// proccount.c num_threads(2*env_num_threads)
void nested_report() { #pragma omp master
#pragma omp parallel {
#pragma omp master printf("Double : %2d cores and
printf("Nested : %2d cores and ↪%2d threads out of max %2d\n",
↪%2d threads out of max %2d\n", omp_get_num_procs(),
omp_get_num_procs(), omp_get_num_threads(),
omp_get_num_threads(), omp_get_max_threads());
omp_get_max_threads()); }
}
int env_num_threads; #pragma omp parallel
#pragma omp parallel #pragma omp master
#pragma omp master nested_report();
{
env_num_threads =
↪omp_get_num_threads();
printf("Parallel : %2d cores and
↪%2d threads out of max %2d\n",
omp_get_num_procs(),
omp_get_num_threads(),
omp_get_max_threads());
}
[c:48] for t in 1 2 4 8 16 ; do OMP_NUM_THREADS=$t ./proccount ; done Parallel : count 4 cores and 4 threads out of max 4
---------------- Parallelism report ---------------- Parallel : count 4 cores and 1 threads out of max 4
Sequential: count 4 cores and 1 threads out of max 1 ---------------- Parallelism report ----------------
Parallel : count 4 cores and 1 threads out of max 1 Sequential: count 4 cores and 1 threads out of max 8
Parallel : count 4 cores and 1 threads out of max 1 Parallel : count 4 cores and 8 threads out of max 8
---------------- Parallelism report ---------------- Parallel : count 4 cores and 1 threads out of max 8
Sequential: count 4 cores and 1 threads out of max 2 ---------------- Parallelism report ----------------
Parallel : count 4 cores and 2 threads out of max 2 Sequential: count 4 cores and 1 threads out of max 16
Parallel : count 4 cores and 1 threads out of max 2 Parallel : count 4 cores and 16 threads out of max 16
---------------- Parallelism report ---------------- Parallel : count 4 cores and 1 threads out of max 16
Sequential: count 4 cores and 1 threads out of max 4
Another limit on the number of threads is imposed when you use nested parallel regions. This can arise if
you have a parallel region in a subprogram which is sometimes called sequentially, sometimes in parallel.
For details, see section 18.2.
It would be pointless to have the block be executed identically by all threads. One way to get a meaningful
parallel code is to use the function omp_get_thread_num, to find out which thread you are, and execute work
that is individual to that thread. This function gives a number relative to the current team; recall from
figure 17.3 that new teams can be created recursively.
For instance, if you program computes
result = f(x)+g(x)+h(x)
The first thing we want to do is create a team of threads. This is done with a parallel region. Here is a very
simple example:
// hello.c
#pragma omp parallel
{
int t = omp_get_thread_num();
446
18.1. Creating parallelism with parallel regions
Remark 25 In future versions of OpenMP, the master thread will be called the primary thread. In 5.1 the
master construct will be deprecated, and masked (with added functionality) will take its place. In 6.0 mas-
ter will disappear from the Spec, including proc_bind master “variable” and combined master constructs
(master taskloop, etc.)
Exercise 18.1. Make a full program based on this fragment. Insert different print statements
before, inside, and after the parallel region. Run this example. How many times is
each print statement executed?
By default, the nested parallel region will have only one thread. To allow nested thread creation, use the
environment variable OMP_MAX_ACTIVE_LEVELS (default: 1) to set the number of levels of parallel nesting.
Equivalently, there are functions omp_set_max_active_levels and omp_get_max_active_levels:
OMP_MAX_ACTIVE_LEVELS=3
or
void omp_set_max_active_levels(int);
int omp_get_max_active_levels(void);
Nested parallelism can happen with nested loops, but it’s also possible to have a sections construct and a
loop nested. Example:
Code: Output:
// sectionnest.c Nesting: false
#pragma omp parallel sections reduction(+:s) Threads: 2, speedup: 2.0
{ Threads: 4, speedup: 2.0
#pragma omp section Threads: 8, speedup: 2.0
{ Threads: 12, speedup: 2.0
double s1=0; Nesting: true
omp_set_num_threads(team); Threads: 2, speedup: 1.8
#pragma omp parallel for reduction(+:s1) Threads: 4, speedup: 3.7
for (int i=0; i<N; i++) { Threads: 8, speedup: 6.9
Threads: 12, speedup: 10.4
the body of the function f falls in the dynamic scope of the parallel region, so the for loop will be paral-
lelized.
If the function may be called both from inside and outside parallel regions, you can test which is the case
with omp_in_parallel.
omp_set_max_active_levels( n )
n = omp_get_max_active_levels()
OMP_THREAD_LIMIT=123
n = omp_get_thread_limit()
omp_set_max_active_levels
omp_get_max_active_levels
omp_get_level
omp_get_active_level
omp_get_ancestor_thread_num
omp_get_team_size(level)
return 0;
}
use omp_lib
integer :: nthreads,mythread
!$omp parallel
nthreads = omp_get_num_threads()
mythread = omp_get_thread_num()
write(*,'("Hello from",i3," out of",i3)') mythread,nthreads
!$omp end parallel
#include <omp.h>
{
int t = omp_get_thread_num();
stringstream proctext;
proctext << "Hello world from " << t << endl;
cerr << proctext.str();
}
return 0;
}
This has several advantages. For one, you don’t have to calculate the loop bounds for the threads yourself,
but you can also tell OpenMP to assign the loop iterations according to different schedules (section 19.3).
Fortran note 14: omp do pragma. The for pragma only exists in C; there is a correspondingly named do
pragma in Fortran.
454
19.1. Loop parallelism
$!omp parallel
$!omp do
do i=1,N
! something with i
end do
$!omp end do
$!omp end parallel
The code before and after the loop is executed identically in each thread; the loop iterations are spread
over the four threads.
Note that the do and for pragmas do not create a team of threads: they take the team of threads that is
active, and divide the loop iterations over them. This means that the omp for or omp do directive needs
to be inside a parallel region.
As an illustration:
Code: Output:
// parfor.c %%%% equal thread/core counts %%%%
#pragma omp parallel Threads entering parallel region: 4
{ thread 3 executing iter 3
int Threads entering parallel region: 4
nthreads = omp_get_num_threads(), thread 0 executing iter 0
thread_num = omp_get_thread_num(); Threads entering parallel region: 4
printf("Threads entering parallel region: %d\n", thread 2 executing iter 2
nthreads); Threads entering parallel region: 4
#pragma omp for thread 1 executing iter 1
for (int iter=0; iter<nthreads; iter++)
printf("thread %d executing iter %d\n",
thread_num,iter);
}
Exercise 19.2. What would happen in the above example if you increase the number of
threads to be larger than the number of cores?
It is also possible to have a combined omp parallel for or omp parallel do directive.
#pragma omp parallel for
for (i=0; .....
Remark 27 The loop index needs to be an integer value for the loop to be parallelizable. Unsigned values
are allowed as of OpenMP-3.
19.1.2 Exercises
Exercise 19.3. Compute 𝜋 by numerical integration. We use the fact that 𝜋 is the area of the
unit circle, and we approximate this by computing the area of a quarter circle using
Riemann sums.
• Let 𝑓 (𝑥) = √1 − 𝑥 2 be the function that describes the quarter circle for
𝑥 = 0 … 1;
• Then we compute
𝑁 −1
𝜋/4 ≈ ∑ Δ𝑥𝑓 (𝑥𝑖 ) where 𝑥𝑖 = 𝑖Δ𝑥 and Δ𝑥 = 1/𝑁
𝑖=0
Write a program for this, and parallelize it using OpenMP parallel for directives.
1. Put a parallel directive around your loop. Does it still compute the right
result? Does the time go down with the number of threads? (The answers
should be no and no.)
2. Change the parallel to parallel for (or parallel do). Now is the result
correct? Does execution speed up? (The answers should now be no and yes.)
3. Put a critical directive in front of the update. (Yes and very much no.)
4. Remove the critical and add a clause reduction(+:quarterpi) to the for
directive. Now it should be correct and efficient.
Use different numbers of cores and compute the speedup you attain over the
sequential computation. Is there a performance difference between the OpenMP
code with 1 thread and the sequential code?
Remark 28 In this exercise you may have seen the runtime go up a couple of times where you weren’t
expecting it. The issue here is false sharing; see HPC book, section-3.6.5 for more explanation.
19.2 An example
To illustrate the speedup of perfectly parallel calculations, we consider a simple code that applies the same
calculation to each element of an array.
All tests are done on the TACC Frontera cluster, which has dual-socket Intel Cascade Lake nodes, with a
total of 56 cores. We control affinity by setting OMP_PROC_BIND=true.
Here is the essential code fragment:
// speedup.c
#pragma omp parallel for
for (int ip=0; ip<N; ip++) {
for (int jp=0; jp<M; jp++) {
double f = sin( values[ip] );
values[ip] = f;
}
}
Exercise 19.4. Verify that the outer loop is parallel, but the inner one is not.
Exercise 19.5. Compare the time for the sequential code and the single-threaded OpenMP
code. Try different optimization levels, and different compilers if you have them.
60 𝑁 = 200
55.42 𝑁 = 2000
𝑁 = 20000
48.5
41.75
40.4 41.61 41.55
34.81
33.2
Speedup
1 14 28 42 56 70 84 98 112
#Threads
𝑁 = 200
43.2
39.7
38.1
36.8
33.7
29.7
Speedup
20.6
20
10.4
10
5
1 0.32
Tests not reported here show exactly the same speedup as the C code.
The first distinction we now have to make is between static and dynamic schedules. With static schedules,
the iterations are assigned purely based on the number of iterations and the number of threads (and the
chunk parameter; see later). In dynamic schedules, on the other hand, iterations are assigned to threads
that are unoccupied. Dynamic schedules are a good idea if iterations take an unpredictable amount of
time, so that load balancing is needed.
Figure 19.4 illustrates this: assume that each core gets assigned two (blocks of) iterations and these blocks
take gradually less and less time. You see from the left picture that thread 1 gets two fairly long blocks,
where as thread 4 gets two short blocks, thus finishing much earlier. (This phenomenon of threads having
unequal amounts of work is known as load imbalance.) On the other hand, in the right figure thread 4
gets block 5, since it finishes the first set of blocks early. The effect is a perfect load balancing.
The default static schedule is to assign one consecutive block of iterations to each thread. If you want
different sized blocks you can define a chunk size:
#pragma omp for schedule(static[,chunk])
(where the square brackets indicate an optional argument). With static scheduling, the compiler will de-
termine the assignment of loop iterations to the threads at compile time, so, provided the iterations take
roughly the same amount of time, this is the most efficient at runtime.
The choice of a chunk size is often a balance between the low overhead of having only a few chunks,
versus the load balancing effect of having smaller chunks.
Exercise 19.6. Why is a chunk size of 1 typically a bad idea? (Hint: think about cache lines,
and read HPC book, section-1.4.1.2.)
In dynamic scheduling OpenMP will put blocks of iterations (the default chunk size is 1) in a task queue,
and the threads take one of these tasks whenever they are finished with the previous.
#pragma omp for schedule(static[,chunk])
While this schedule may give good load balancing if the iterations take very differing amounts of time to
execute, it does carry runtime overhead for managing the queue of iteration tasks.
Finally, there is the guided schedule, which gradually decreases the chunk size. The thinking here is
that large chunks carry the least overhead, but smaller chunks are better for load balancing. The various
schedules are illustrated in figure 19.5.
If you don’t want to decide on a schedule in your code, you can specify the runtime schedule. The actual
schedule will then at runtime be read from the OMP_SCHEDULE environment variable. You can even just leave
it to the runtime library by specifying auto
Exercise 19.7. We continue with exercise 19.3. We add ‘adaptive integration’: where
needed, the program refines the step size1 . This means that the iterations no longer
take a predictable amount of time.
1. Use the omp parallel for construct to parallelize the loop. As in the
previous lab, you may at first see an incorrect result. Use the reduction clause
to fix this.
2. Your code should now see a decent speedup, but possible not for all cores. It is
possible to get completely linear speedup by adjusting the schedule.
Start by using schedule(static,n). Experiment with values for 𝑛. When can
you get a better speedup? Explain this.
1. It doesn’t actually do this in a mathematically sophisticated way, so this code is more for the sake of the example.
Its mirror call is omp_set_schedule, which sets the value that is used when schedule value runtime is used.
It is in effect equivalent to setting the environment variable OMP_SCHEDULE.
void omp_set_schedule (omp_sched_t kind, int modifier);
guided Value: 3. The modifier parameter is the chunk size. Set by using value omp_sched_guided
runtime Use the value of the OMP_SCHEDULE environment variable. Set by using value omp_sched_runtime
19.4 Reductions
So far we have focused on loops with independent iterations. Reductions are a common type of loop with
dependencies. There is an extended discussion of reductions in section 20.
all 𝑁 2 iterations are independent, but a regular omp for directive will only parallelize one level. The
collapse clause will parallelize more than one level:
#pragma omp for collapse(2)
for ( i=0; i<N; i++ )
for ( j=0; j<N; j++ )
A[i][j] = B[i][j] + C[i][j]
It is only possible to collapse perfectly nested loops, that is, the loop body of the outer loop can consist
only of the inner loop; there can be no statements before or after the inner loop in the loop body of the
outer loop. That is, the two loops in
for (i=0; i<N; i++) {
y[i] = 0.;
for (j=0; j<N; j++)
y[i] += A[i][j] * x[j]
}
Assuming that the src and dst array are disjoint, which loops are parallel, and how
many levels can you collapse?
it is not true that all function evaluations happen more or less at the same time, followed by all print
statements. The print statements can really happen in any order. The ordered clause coupled with the
ordered directive can force execution in the right order:
#pragma omp parallel for ordered
for ( ... i ... ) {
... f(i) ...
#pragma omp ordered
printf("something with %d\n",i);
}
There is a limitation: each iteration can encounter only one ordered directive.
19.7 nowait
The implicit barrier at the end of a work sharing construct can be cancelled with a nowait clause. This has
the effect that threads that are finished can continue with the next code in the parallel region:
#pragma omp parallel
{
#pragma omp for nowait
for (i=0; i<N; i++) { ... }
// more parallel code
}
In the following example, threads that are finished with the first loop can start on the second. Note that
this requires both loops to have the same schedule. We specify the static schedule here to have an identical
scheduling of iterations over threads:
#pragma omp parallel
{
x = local_computation()
#pragma omp for schedule(static) nowait
for (i=0; i<N; i++) {
x[i] = ...
}
#pragma omp for schedule(static)
for (i=0; i<N; i++) {
y[i] = ... x[i] ...
}
}
We replace the while loop by a for loop that examines all locations:
result = -1;
#pragma omp parallel for
for (i=0; i<imax; i++) {
if (a[i]!=0 && result<0) result = i;
}
result = -1;
#pragma omp parallel for lastprivate(result)
for (i=0; i<imax; i++) {
if (a[i]!=0) result = i;
}
You have now solved a slightly different problem: the result variable contains the last location where a[i]
is zero.
19.9.1 Lonestar 6
Lonestar 6, dual socket AMD Milan, total 112 cores: figure 19.6.
19.9.2 Frontera
Intel Cascade Lake, dual socket, 56 cores total; figure 19.7.
For all core counts to half the total, performance for all binding strategies seems equal. After that , close
and spread perform equally, but the speedup for the false value gives erratic numbers.
110
false
100 96.04
91.9 close
93.41
90.17
88.58 88.75
spread
85.82
84.61
80.53
78.54
75.67
73.69
69.55
65.2963.564.67
62.3
58.54
57.34
51.47
Speedup
50.21
50 44.69
44.17
41.41
39.71 41.03
38.47 38.27
36.43 38.41 37.03
33.21 35.68
34.89
33.06 31.8
27.91
27.43 29.1 28.93
27.92
25.65
25
16.56
16.29
16.03
8.39
8.2
8.15
1
1
1 8 16 24 32 40 48 56 64 72 80 88 96 104112120128
#Threads
Figure 19.6: Speedup as function of thread count, Lonestar 6 cluster, different binding parameters
19.9.5 Longhorn
Dual 20-core IBM Power9, 4 hyperthreads; 19.10
Unlike the Intel processors, here we use the hyperthreads. Figure 19.10 shows dip in the speedup at 40
threads. For higher thread counts the speedup increases to well beyond the physical core count of 40.
false
70.93
70.75
close
spread
63.04
62.61
60.41
60
54.88
54.66 54.99
46.4
46.35
45.82
38.21
Speedup
40 37.74
37.32
27.46 28.96
26.69
26.68
20
15.3
15.21
15.13
7.29
7.24
7.22
1
1
1 7 14 21 28 35 42 49 56
#Threads
Figure 19.7: Speedup as function of thread count, Frontera cluster, different binding parameters
40 false
37.67
close
35.13
34.78
spread
32.56 33.11
32.99
30.99
30 27.86 27.05 27.27
26.2
22.96 22.82
20.63
Speedup
20
17.6 18.28
17.48
13.52
12.08
11.37
10 9.35
6.05
5.92
4.66
1.02
1.01
0.97
1
1 6 12 18 24 30 36 42 48
#Threads
Figure 19.8: Speedup as function of thread count, Stampede2 skylake cluster, different binding parameters
false
86.52
true
73.92
61.79
50 48.9
Speedup
39.2 39.96
40 36.69 36.42
29.82
30 26.31
22.2
20 16.77
9.32
7.68
10
1
1
1 9 19 29 38 48 58 68
#Threads
Figure 19.9: Speedup as function of thread count, Stampede2 Knights Landing cluster, different binding
parameters
60 false
56.09
56
close
spread
44.24
43.31 42.08
40 38.73
38.12
35.17
35.06
31.83 32.91
30.54
Speedup
28.83 27.57
25.01 26.36 23.6
20.29 19.7
20 17.45
17.4
10.19
10.1
10.08
0.98
1
1
1 10 20 30 40 50 60 70 80
#Threads
Figure 19.10: Speedup as function of thread count, Longhorn cluster, different binding parameters
you will find that the sum value depends on the number of threads, and is likely not the same as when you
execute the code sequentially. The problem here is the race condition involving the sum variable, since this
variable is shared between all threads.
We will discuss several strategies of dealing with this.
472
20.1. Reductions: why, what, how?
// sectionreduct.c
float y=0;
#pragma omp parallel reduction(+:y)
#pragma omp sections
{
#pragma omp section
y += f();
#pragma omp section
y += g();
}
Another reduction, this time over a parallel region, without any work sharing:
// reductpar.c
m = INT_MIN;
#pragma omp parallel reduction(max:m) num_threads(ndata)
{
int t = omp_get_thread_num();
int d = data[t];
m = d>m ? d : m;
};
For multiple reduction with different operators, use more than one clause.
A reduction is one of those cases where the parallel execution can have a slightly different value from the
one that is computed sequentially, because floating point operations are not associative. See HPC book,
section-3.6.5 for more explanation.
This is a good solution if the amount of serialization in the critical section is small compared to computing
the functions 𝑓 , 𝑔, ℎ. On the other hand, you may not want to do that in a loop:
double result = 0;
#pragma omp parallel
{
double local_result;
#pragma omp for
for (i=0; i<N; i++) {
local_result = f(x,i);
#pragma omp critical
result += local_result;
} // end of for loop
}
Exercise 20.1. Can you think of a small modification of this code, that still uses a critical
section, that is more efficient? Time both codes.
While this code is correct, it may be inefficient because of a phenomemon called false sharing. Even
though the threads write to separate variables, those variables are likely to be on the same cacheline (see
HPC book, section-1.4.1.2 for an explanation). This means that the cores will be wasting a lot of time and
bandwidth updating each other’s copy of this cacheline.
False sharing can be prevent by giving each thread its own cacheline:
double result,local_results[3][8];
#pragma omp parallel
{
int num = omp_get_thread_num();
if (num==0) local_results[num][1] = f(x)
// et cetera
}
A more elegant solution gives each thread a true local variable, and uses a critical section to sum these,
at the very end:
double result = 0;
#pragma omp parallel
{
double local_result;
local_result = .....
#pragam omp critical
result += local_result;
}
20.2.3 Types
Reduction can be applied to any type for which the operator is defined. The types to which max/min are
applicable are limited.
Each thread does a partial reduction, but its initial value is not the user-supplied init_x value, but a value
dependent on the operator. In the end, the partial results will then be combined with the user initial value.
The initialization values are mostly self-evident, such as zero for addition and one for multiplication. For
min and max they are respectively the maximal and minimal representable value of the result type.
Figure 20.1: Reduction of four items on two threads, taking into account initial values.
Figure 20.1 illustrates this, where 1,2,3,4 are four data items, i is the OpenMP initialization, and u is
the user initialization; each p stands for a partial reduction value. The figure is based on execution using
two threads.
Exercise 20.3. Write a program to test the fact that the partial results are initialized to the
unit of the reduction operator.
3. Optionally, you can specify the value to which the reduction should be initialized.
This is the syntax of the definition of the reduction, which can then be used in multiple reduction clauses.
#pragma omp declare reduction
( identifier : typelist : combiner )
[initializer(initializer-expression)]
where:
identifier is a name; this can be overloaded for different types, and redefined in inner scopes.
typelist is a list of types.
combiner is an expression that updates the internal variable omp_out as function of itself and omp_in.
initializer sets omp_priv to the identity of the reduction; this can be an expression or a brace initializer.
C++ note 4: templated reductions. You can reduce with a templated function if you put both the declaration
and the reduction in the same templated function:
template<typename T>
T generic_reduction( vector<T> tdata ) {
#pragma omp declare reduction \
(rwzt:T:omp_out=reduce_without_zero<T>(omp_out,omp_in)) \
initializer(omp_priv=-1.f)
T tmin = -1;
#pragma omp parallel for reduction(rwzt:tmin)
for (int id=0; id<tdata.size(); id++)
tmin = reduce_without_zero<T>(tmin,tdata[id]);
return tmin;
};
!! reducttype.F90 Type(inttype),dimension(nsize) ::
Type inttype ↪intarray
integer :: value = 0 Type(inttype) :: intsum = inttype(0)
end type inttype !$OMP parallel for reduction(+:intsum)
Interface operator(+) do i=1,nsize
module procedure addints intsum = intsum + intarray(i)
end Interface operator(+) end do
!$OMP end parallel
C++ note 5: reduction on class objects. Reduction can be applied to any class for which the reduction op-
erator is defined as operator+ or whichever operator the case may be.
A default constructor is required for the internally used init value; see figure 20.1.
m = INT_MIN;
for (int idata=0; idata<ndata; idata++) {
int d = data[idata];
m = d>m ? d : m;
}
if (m!=5)
printf("Sequential: wrong reduced value: %d, s/b %d\n",m,2);
else
printf("Sequential case succeeded\n");
m = INT_MIN;
#pragma omp parallel reduction(max:m) num_threads(ndata)
{
int t = omp_get_thread_num();
int d = data[t];
m = d>m ? d : m;
};
if (m!=5)
printf("Parallel: wrong reduced value: %d, s/b %d\n",m,2);
else
printf("Finished with correct parallel result\n");
return 0;
}
m = INT_MIN;
for (int idata=0; idata<ndata; idata++)
m = mymax(m,data[idata]);
if (m!=5)
printf("Sequential: wrong reduced value: %d, s/b %d\n",m,2);
else
printf("Sequential case succeeded\n");
m = INT_MIN;
#pragma omp parallel for reduction(rwz:m)
for (int idata=0; idata<ndata; idata++)
m = mymax(m,data[idata]);
if (m!=5)
printf("Parallel: wrong reduced value: %d, s/b %d\n",m,2);
else
printf("Finished\n");
return 0;
}
The declaration of a parallel region establishes a team of threads. This offers the possibility of parallelism,
but to actually get meaningful parallel activity you need something more. OpenMP uses the concept of a
work sharing construct: a way of dividing parallelizable work over a team of threads.
21.2 Sections
A parallel loop is an example of independent work units that are numbered. If you have a pre-determined
number of independent work units, the sections is more appropriate. In a sections construct can be any
number of section constructs. These need to be independent, and they can be execute by any available
thread in the current team, including having multiple sections done by the same thread.
#pragma omp sections
{
#pragma omp section
// one calculation
#pragma omp section
// another calculation
}
This construct can be used to divide large blocks of independent work. Suppose that in the following line,
both f(x) and g(x) are big calculations:
482
21.3. Single/master
y = f(x) + g(x)
Instead of using two temporaries, you could also use a critical section; see section 23.2.2. However, the
best solution is have a reduction clause on the parallel sections directive. For the sum
y = f(x) + g(x)
21.3 Single/master
The single and master pragma limit the execution of a block to a single thread. This can for instance be
used to print tracing information or doing I/O operations.
#pragma omp parallel
{
#pragma omp single
printf("We are starting this section!\n");
// parallel stuff
}
The point of the single directive in this last example is that the computation needs to be done only once,
because of the shared memory. Since it’s a work sharing construct there is an implicit barrier after it,
which guarantees that all threads have the correct value in their local memory (see section 26.3.
Exercise 21.1. What is the difference between this approach and how the same
computation would be parallelized in MPI?
The master directive, also enforces execution on a single thread, specifically the master thread of the team,
but it does not have the synchronization through the implicit barrier.
Exercise 21.2. Modify the above code to read:
int a;
#pragma omp parallel
{
#pragma omp master
a = f(); // some computation
#pragma omp sections
// various different computations using a
}
In a parallel region there are two types of data: private and shared. In this sections we will see the various
way you can control what category your data falls under; for private data items we also discuss how their
values relate to shared data.
All threads increment the same variable, so after the loop it will have a value of five plus the number of
threads; or maybe less because of the data races involved. This issue is discussed in HPC book, section-
2.6.1.5; see 23.2.2 for a solution in OpenMP.
Sometimes this global update is what you want; in other cases the variable is intended only for interme-
diate results in a computation. In that case there are various ways of creating data that is local to a thread,
and therefore invisible to other threads.
486
22.2. Private data
Code: Output:
// private.c Thread 3 sets x to 4
int x=5; Thread 2 sets x to 3
#pragma omp parallel num_threads(4) Thread 0 sets x to 1
{ Thread 1 sets x to 2
int t = omp_get_thread_num(), Outer x is still 5
x = t+1;
printf("Thread %d sets x to %d\n",t,x);
}
printf("Outer x is still %d\n",x);
After the parallel region the outer variable x will still have the value 5: there is no storage association
between the private variable and global one.
Fortran note 16: private variables in parallel region. The Fortran language does not have this concept of
scope, so you have to use a private clause:
Code: Output:
!! private.F90 Thread 0 sets x to 1
x=5 Thread 2 sets x to 3
!$omp parallel private(x,t) num_threads(4) Thread 3 sets x to 4
t = omp_get_thread_num() Thread 1 sets x to 2
x = t+1 Outer x is still 5
print '("Thread ",i2," sets x to ",i2)',t,x
!$omp end parallel
print '("Outer x is still ",i2)',x
The private directive declares data to have a separate copy in the memory of each thread. Such private
variables are initialized as they would be in a main program. Any computed value goes away at the end of
the parallel region. (However, see below.) Thus, you should not rely on any initial value, or on the value
of the outer variable after the region.
int x = 5;
#pragma omp parallel private(x)
{
x = x+1; // dangerous
printf("private: x is %d\n",x);
}
printf("after: x is %d\n",x); // also dangerous
Data that is declared private with the private directive is put on a separate stack per thread. The OpenMP
standard does not dictate the size of these stacks, but beware of stack overflow. A typical default is a few
megabyte; you can control it with the environment variable OMP_STACKSIZE. Its values can be literal or with
suffixes:
123 456k 567K 678m 789M 246g 357G
A normal Unix process also has a stack, but this is independent of the OpenMP stacks for private data. You
can query or set the Unix stack with ulimit:
[] ulimit -s
64000
[] ulimit -s 8192
[] ulimit -s
8192
The Unix stack can grow dynamically as space is needed. This does not hold for the OpenMP stacks: they
are immediately allocated at their requested size. Thus it is important not too make them too large.
By the above rules, the variables x,s,c are all shared variables. However, the values they receive in one
iteration are not used in a next iteration, so they behave in fact like private variables to each iteration.
• In both C and Fortran you can declare these variables private in the parallel for directive.
• In C, you can also redefine the variables inside the loop.
Sometimes, even if you forget to declare these temporaries as private, the code may still give the correct
output. That is because the compiler can sometimes eliminate them from the loop body, since it detects
that their values are not otherwise used.
22.5 Default
• Loop variables in an omp for are private;
• Local variables in the parallel region are private.
You can alter this default behavior with the default clause:
#pragma omp parallel default(shared) private(x)
{ ... }
#pragma omp parallel default(private) shared(matrix)
{ ... }
• The shared clause means that all variables from the outer scope are shared in the parallel
region; any private variables need to be declared explicitly. This is the default behavior.
• The private clause means that all outer variables become private in the parallel region. They
are not initialized; see the next option. Any shared variables in the parallel region need to be
declared explicitly. This value is not available in C.
• The firstprivate clause means all outer variables are private in the parallel region, and ini-
tialized with their outer value. Any shared variables need to be declared explicitly. This value
is not available in C.
• The none option is good for debugging, because it forces you to specify for each variable in the
parallel region whether it’s private or shared. Also, if your code behaves differently in parallel
from sequential there is probably a data race. Specifying the status of every variable is a good
way to debug this.
// alloc.c
int *array = (int*) malloc(nthreads*sizeof(int));
#pragma omp parallel firstprivate(array)
{
int t = omp_get_thread_num();
array += t;
array[0] = t;
}
The variable t behaves like a private variable, except that it is initialized to the outside value.
Secondly, you may want a private value to be preserved to the environment outside the parallel region.
This really only makes sense in one case, where you preserve a private variable from the last iteration of
a parallel loop, or the last section in an sections construct. This is done with lastprivate:
#pragma omp parallel for \
lastprivate(tmp)
for (i=0; i<N; i+) {
tmp = ......
x[i] = .... tmp ....
}
..... tmp ....
data, which is not limited in lifetime to one parallel region. The threadprivate pragma is used to declare
that each thread is to have a private copy of a variable:
#pragma omp threadprivate(var)
Example:
Code: Output:
!! threadprivate.F90 Thread 3 has 3
common /threaddata/tp Thread 7 has 0
integer :: tp Thread 5 has 5
!$omp threadprivate(/threaddata/) Thread 2 has 2
Thread 1 has 1
!$omp parallel num_threads(7) Thread 4 has 4
tp = omp_get_thread_num() Thread 0 has 0
!$omp end parallel Thread 6 has 6
Thread 8 has 0
!$omp parallel num_threads(9)
print *,"Thread",omp_get_thread_num(),"has",tp
!$omp end parallel
On the other hand, if the thread private data starts out identical in all threads, the copyin clause can be
used:
#pragma omp threadprivate(private_var)
private_var = 1;
#pragma omp parallel copyin(private_var)
private_var += omp_get_thread_num()
If one thread needs to set all thread private data to its value, the copyprivate clause can be used:
#pragma omp parallel
{
...
#pragma omp single copyprivate(private_var)
private_var = read_data();
...
}
can not be made threadsafe because of the initialization. However, the following works:
// privaterandom.cxx
static random_device rd;
static mt19937 rng;
#pragma omp threadprivate(rd)
#pragma omp threadprivate(rng)
int main() {
22.9 Allocators
The OpenMP was initially designed for shared memory. With accelerators (see chapter 28), non-coherent
memory was added to this. In the OpenMP-5 standard, the story is further complicated, to account for
new memory types such as high-bandwidth memory and non-volatile memory.
There are several ways of using the OpenMP memory allocators.
• First, in a directory on a static array:
float A[N], B[N];
#pragma omp allocate(A) \
allocator(omp_large_cap_mem_alloc)
{
int nthreads;
#pragma omp parallel
#pragma omp master
nthreads = omp_get_num_threads();
printf("Array result:\n");
for (int i=0; i<nthreads; i++)
printf("%d:%d, ",i,array[i]);
printf("\n");
}
{
int nthreads;
#pragma omp parallel
#pragma omp master
nthreads = omp_get_num_threads();
int array[nthreads];
for (int i=0; i<nthreads; i++)
array[i] = 0;
printf("Array result:\n");
{
int nthreads=4;
int array[nthreads];
for (int i=0; i<nthreads; i++)
array[i] = 0;
{
int t = 2;
array += t;
array[0] = t;
}
printf("Array result:\n");
for (int i=0; i<nthreads; i++)
printf("%d:%d, ",i,array[i]);
printf("\n");
}
return 0;
}
return 0;
}
In the constructs for declaring parallel regions above, you had little control over in what order threads
executed the work they were assigned. This section will discuss synchronization constructs: ways of telling
threads to bring a certain order to the sequence in which they do things.
• critical: a section of code can only be executed by one thread at a time; see 23.2.2.
• atomic Atomic update of a single memory location. Only certain specified syntax pattterns are
supported. This was added in order to be able to use hardware support for atomic updates.
• barrier: section 23.1.
• ordered: section 19.6.
• locks: section 23.3.
• flush: section 26.3.
• nowait: section 19.7.
23.1 Barrier
A barrier defines a point in the code where all active threads will stop until all threads have arrived at
that point. With this, you can guarantee that certain calculations are finished. For instance, in this code
snippet, computation of y can not proceed until another thread has computed its value of x.
#pragma omp parallel
{
int mytid = omp_get_thread_num();
x[mytid] = some_calculation();
y[mytid] = x[mytid]+x[mytid+1];
}
496
23.2. Mutual exclusion
Apart from the barrier directive, which inserts an explicit barrier, OpenMP has implicit barriers after a
load sharing construct. Thus the following code is well defined:
#pragma omp parallel
{
#pragma omp for
for (int mytid=0; mytid<number_of_threads; mytid++)
x[mytid] = some_calculation();
#pragma omp for
for (int mytid=0; mytid<number_of_threads-1; mytid++)
y[mytid] = x[mytid]+x[mytid+1];
}
You can also put each parallel loop in a parallel region of its own, but there is some overhead associated
with creating and deleting the team of threads in between the regions.
At the end of a parallel region the team of threads is dissolved and only the master thread continues.
Therefore, there is an implicit barrier at the end of a parallel region. This barrier behavior can be cancelled
with the nowait clause.
You will often see the idiom
#pragma omp parallel
{
#pragma omp for nowait
for (i=0; i<N; i++)
a[i] = // some expression
#pragma omp for
for (i=0; i<N; i++)
b[i] = ...... a[i] ......
Here the nowait clause implies that threads can start on the second loop while other threads are still
working on the first. Since the two loops use the same schedule here, an iteration that uses a[i] can
indeed rely on it that that value has been computed.
This is a legitimate activity if the variable is an accumulator for values computed by independent pro-
cesses. The result of these two updates depends on the sequence in which the processors read and write
the variable.
Figure 23.1 illustrates three scenarios. Such a scenario, where the final result depends on ‘micro-timing’
of the actions of a thread, is known as a race condition or data race. A formal definition would be:
We talk of a a data race if there are two statements 𝑆1 , 𝑆2 ,
• that are not causally related;
• that both access a location 𝐿; and
• at least one access is a write.
Enclosing the update statement in a critical section, or making it atomic by some other mechanism, en-
forces scenario 3 of the above figure.
but this should really be done with a reduction clause, which will be far more efficient.
A good use of critical sections is doing file writes or database updates.
Exercise 23.1. Consider a loop where each iteration updates a variable.
Critical sections are an easy way to turn an existing code into a correct parallel code. However, there are
performance disadvantages to critical sections, and sometimes a more drastic rewrite is called for.
23.3 Locks
OpenMP also has the traditional mechanism of a lock. A lock is somewhat similar to a critical section:
it guarantees that some instructions can only be performed by one process at a time. However, a critical
section is indeed about code; a lock is about data. With a lock you make sure that some data elements can
only be touched by one process at a time.
One simple example of the use of locks is generation of a histogram. A histogram consists of a number of
bins, that get updated depending on some data. Here is the basic structure of such a code:
int count[100];
float x = some_function();
int ix = (int)x;
if (ix>=100)
error();
else
count[ix]++;
but that is unnecessarily restrictive. If there are enough bins in the histogram, and if the some_function
takes enough time, there are unlikely to be conflicting writes. The solution then is to create an array of
locks, with one lock for each count location.
Create/destroy:
void omp_init_lock(omp_lock_t *lock);
void omp_destroy_lock(omp_lock_t *lock);
omp_set_lock(&lockb);
for (i=0; i<N; i++)
b[i] = .. a[i] ..
omp_unset_lock(&lockb);
omp_unset_lock(&locka);
}
omp_set_lock(&locka);
for (i=0; i<N; i++)
a[i] = .. b[i] ..
omp_unset_lock(&locka);
omp_unset_lock(&lockb);
}
} /* end of sections */
} /* end of parallel region */
We start by sketching the basic single-threaded solution. The naive code looks like:
int main() {
value = new int[nmax+1];
value[0] = 1;
value[1] = 1;
fib(10);
}
int fib(int n) {
int i, j, result;
if (n>=2) {
i=fib(n-1); j=fib(n-2);
value[n] = i+j;
}
return value[n];
}
However, this is inefficient, since most intermediate values will be computed more than once. We solve
this by keeping track of which results are known:
...
done = new int[nmax+1];
for (i=0; i<=nmax; i++)
done[i] = 0;
done[0] = 1;
done[1] = 1;
...
int fib(int n) {
int i, j;
if (!done[n]) {
i = fib(n-1); j = fib(n-2);
The OpenMP parallel solution calls for two different ideas. First of all, we parallelize the recursion by
using tasks (section 24:
int fib(int n) {
int i, j;
if (n>=2) {
#pragma omp task shared(i) firstprivate(n)
i=fib(n-1);
#pragma omp task shared(j) firstprivate(n)
j=fib(n-2);
#pragma omp taskwait
value[n] = i+j;
}
return value[n];
}
This computes the right solution, but, as in the naive single-threaded solution, it recomputes many of the
intermediate values.
A naive addition of the done array leads to data races, and probably an incorrect solution:
int fib(int n) {
int i, j, result;
if (!done[n]) {
#pragma omp task shared(i) firstprivate(n)
i=fib(n-1);
#pragma omp task shared(i) firstprivate(n)
j=fib(n-2);
#pragma omp taskwait
value[n] = i+j;
done[n] = 1;
}
return value[n];
}
For instance, there is no guarantee that the done array is updated later than the value array, so a thread
can think that done[n-1] is true, but value[n-1] does not have the right value yet.
One solution to this problem is to use a lock, and make sure that, for a given index n, the values done[n]
and value[n] are never touched by more than one thread at a time:
int fib(int n)
{
int i, j;
omp_set_lock( &(dolock[n]) );
if (!done[n]) {
#pragma omp task shared(i) firstprivate(n)
i = fib(n-1);
This solution is correct, optimally efficient in the sense that it does not recompute anything, and it uses
tasks to obtain a parallel execution.
However, the efficiency of this solution is only up to a constant. A lock is still being set, even if a value
is already computed and therefore will only be read. This can be solved with a complicated use of critical
sections, but we will forego this.
#include <cmath>
#include <omp.h>
#ifdef NOLOCK
#define omp_set_lock(x)
#define omp_unset_lock(x)
#endif
class object {
private:
omp_lock_t the_lock;
int _value{0};
public:
object() {
omp_init_lock(&the_lock);
};
~object() {
omp_destroy_lock(&the_lock);
};
int operator +=( int i ) {
// let's waste a little time,
// otherwise the threads finish before they start
float s = i;
for (int i=0; i<1000; i++)
s += sin(i)*sin(i);
// atomic increment
omp_set_lock(&the_lock);
_value += (s>0); int rv = _value;
omp_unset_lock(&the_lock);
return rv;
};
auto value() { return _value; };
};
#define NTHREADS 50
#define NOPS 100
int main() {
/*
* Create a bunch of threads, that
* each do a bunch of updates
*/
object my_object;
vector<thread> threads;
for (int ithread=0; ithread<NTHREADS; ithread++) {
threads.push_back
( thread(
[&my_object] () {
for (int iop=0; iop<NOPS; iop++)
my_object += iop; } ) );
}
for ( auto &t : threads )
t.join();
/*
* Check that no updates have gone lost
*/
cout << "Did " << NTHREADS * NOPS << " updates, over " << threads.size()
<< " threads, resulting in " << my_object.value() << endl;
return 0;
}
Tasks are a mechanism that OpenMP uses behind the scenes: if you specify something as being a task,
OpenMP will create a ‘block of work’: a section of code plus the data environment in which it occurred.
This block is set aside for execution at some later point. Thus, task-based code usually looks something
like this:
#pragma omp parallel
{
// generate a bunch of tasks
# pragma omp taskwait
// the result from the tasks is now available
}
For instance, a parallel loop was always implicitly translated to something like:
If we stick with this example of implementing a parallel loop through tasks, the next question is: precisely
who generates the tasks? The following code has a serious problem:
// WRONG. DO NOT WRITE THIS
#pragma omp parallel
for (int ib=0; ib<nblocks; ib++) {
int first=... last=... ;
# pragma omp task
for (int i=first; i<last; i++)
f(i)
}
507
24. OpenMP topic: Tasks
because the parallel region creates a team, and each thread in the team executes the task-generating code.
Instead, we use the following idiom:
#pragma omp parallel
#pragma omp single
for (int ib=0; ib<nblocks; ib++) {
// setup stuff
# pragma omp task
// task stuff
}
The first rule is that shared data is shared in the task, but private data becomes firstprivate. To see the
distinction, consider two code fragments.
In the first example, the variable count is declared outside the parallel region and is therefore shared.
When the print statement is executed, all tasks will have been generated, and so count will be zero. Thus,
the output will likely be 0,50.
In the second example, the count variable is private to the thread creating the tasks, and so it will be
firstprivate in the task, preserving the value that was current when the task was created.
Explanation: when the statement computing z is executed, the task computing y has only been scheduled;
it has not necessarily been executed yet.
In order to have a guarantee that a task is finished, you need the taskwait directive. The following creates
two tasks, which can be executed in parallel, and then waits for the results:
Code Execution
x = f(); the variable x gets a value
#pragma omp task
{ y1 = g1(x); }
two tasks are created with the current value of x
#pragma omp task
{ y2 = g2(x); }
#pragma omp taskwait the thread waits until the tasks are finished
z = h(y1)+h(y2); the variable z is computed using the task results
The task pragma is followed by a structured block. Each time the structured block is encountered, a new
task is generated. On the other hand taskwait is a standalone directive; the code that follows is just code,
it is not a structured block belonging to the directive.
Another aspect of the distinction between generating tasks and executing them: usually the tasks are
generated by one thread, but executed by many threads. Thus, the typical idiom is:
#pragma omp parallel
#pragma omp single
{
// code that generates tasks
}
This makes it possible to execute loops in parallel that do not have the right kind of iteration structure
for a omp parallel for. As an example, you could traverse and process a linked list:
#pragma omp parallel
#pragma omp single
{
while (!tail(p)) {
p = p->next();
#pragma omp task
process(p)
}
#pragma omp taskwait
}
One task traverses the linked list creating an independent task for each element in the list. These tasks
are then executed in parallel; their assignment to threads is done by the task scheduler.
You can indicate task dependencies in several ways:
1. Using the ‘task wait’ directive you can explicitly indicate the join of the forked tasks. The in-
struction after the wait directive will therefore be dependent on the spawned tasks.
2. The taskgroup directive, followed by a structured block, ensures completion of all tasks created
in the block, even if recursively created.
3. Each OpenMP task can have a depend clause, indicating what data dependency of the task. By
indicating what data is produced or absorbed by the tasks, the scheduler can construct the
dependency graph for you.
Another mechanism for dealing with tasks is the taskgroup: a task group is a code block that can contain
task directives; all these tasks need to be finished before any statement after the block is executed.
A task group is somewhat similar to having a taskwait directive after the block. The big difference is that
that taskwait directive does not wait for tasks that are recursively generated, while a taskgroup does.
it is conceivable that the second task is executed before the first, possibly leading to an incorrect result.
This is remedied by specifying:
#pragma omp task depend(out:x)
x = f()
#pragma omp task depend(in:x)
y = g(x)
for i in [1:N]:
for j in [1:N]:
x[i,j] = x[i-1,j]+x[i,j-1]
• Observe that the second loop nest is not amenable to OpenMP loop parallelism.
• Can you think of a way to realize the computation with OpenMP loop
parallelism? Hint: you need to rewrite the code so that the same operations are
done in a different order.
• Use tasks with dependencies to make this code parallel without any rewriting:
the only change is to add OpenMP directives.
Tasks dependencies are used to indicated how two uses of one data item relate to each other. Since either
use can be a read or a write, there are four types of dependencies.
RaW (Read after Write) The second task reads an item that the first task writes. The second task has
to be executed after the first:
... omp task depend(OUT:x)
foo(x)
... omp task depend( IN:x)
foo(x)
WaR (Write after Read) The first task reads and item, and the second task overwrites it. The second
task has to be executed second to prevent overwriting the initial value:
... omp task depend( IN:x)
foo(x)
... omp task depend(OUT:x)
foo(x)
WaW (Write after Write) Both tasks set the same variable. Since the variable can be used by an inter-
mediate task, the two writes have to be executed in this order.
... omp task depend(OUT:x)
foo(x)
... omp task depend(OUT:x)
foo(x)
RaR (Read after Read) Both tasks read a variable. Since neither tasks has an ‘out’ declaration, they can
run in either order.
... omp task depend(IN:x)
foo(x)
... omp task depend(IN:x)
foo(x)
The task group can contain both task that contribute to the reduction, and ones that don’t. The former
type needs a clause in_reduction:
#pragma omp task in_reduction(+:sum)
100
As an example, here the sum ∑𝑖=1 𝑖 is computed with tasks:
// taskreduct.c
#pragma omp parallel
#pragma omp single
{
#pragma omp taskgroup task_reduction(+:sum)
for (int itask=1; itask<=bound; itask++) {
#pragma omp task in_reduction(+:sum)
sum += itask;
}
}
24.5 More
24.5.1 Scheduling points
Normally, a task stays tied to the thread that first executes it. However, at a task scheduling point the
thread may switch to the execution of another task created by the same team.
• There is a scheduling point after explicit task creation. This means that, in the above examples,
the thread creating the tasks can also participate in executing them.
• There is a scheduling point at taskwait and taskyield.
On the other hand a task created with them untied clause on the task pragma is never tied to one thread.
This means that after suspension at a scheduling point any thread can resume execution of the task. If
you do this, beware that the value of a thread-id does not stay fixed. Also locks become a problem.
Example: if a thread is waiting for a lock, with a scheduling point it can suspend the task and work on
another task.
while (!omp_test_lock(lock))
#pragma omp taskyield
;
• The if clause may still lead to recursively generated tasks. On the other hand, final will execute
the code, and will also skip any recursively created tasks:
#pragma omp task final(level<3)
If you want to indicate that certain tasks are more important than others, use the priority clause:
#pragma omp task priority(5)
24.6 Examples
24.6.1 Fibonacci
As an example of the use of tasks, consider computing an array of Fibonacci values:
// taskgroup0.c
for (int i=2; i<N; i++)
{
fibo_values[i] = fibo_values[i-1]+fibo_values[i-2];
}
fibo_values[0] = 1; fibo_values[1] = 1;
{
for (int i=2; i<N; i++)
{
fibo_values[i] = fibo_values[i-1]+fibo_values[i-2];
}
}
printf("F(%d) = %ld\n",N,fibo_values[N-1]);
return 0;
}
return 0;
}
N = atoi(argv[1]);
if (N>99) {
printf("Sorry, this overflows: setting N=99\n");
N = 99;
}
}
fibo_values[0] = 1; fibo_values[1] = 1;
#pragma omp parallel
#pragma omp single
#pragma omp taskgroup
{
for (int i=2; i<N; i++)
#pragma omp task \
depend(out:fibo_values[i]) \
depend(in:fibo_values[i-1],fibo_values[i-2])
{
fibo_values[i] = fibo_values[i-1]+fibo_values[i-2];
}
}
printf("F(%d) = %ld\n",N,fibo_values[N-1]);
return 0;
}
int n;
if (argc>1)
n = atoi(argv[1]);
else n = 100;
if (n>NMAX) {
return 0;
}
518
25.1. OpenMP thread affinity control
We see pretty much perfect speedup for the OMP_PLACES=cores strategy; with OMP_PLACES=sockets
we probably get occasional collisions where two threads wind up on the same core.
Next we take a program for computing the time evolution of the heat equation:
(𝑡+1) (𝑡) (𝑡) (𝑡)
𝑡 = 0, 1, 2, … ∶ ∀𝑖 ∶ 𝑥𝑖 = 2𝑥𝑖 − 𝑥𝑖−1 − 𝑥𝑖+1
This is a bandwidth-bound operation because the amount of computation per data item is low.
}
}
25.2 First-touch
The affinity issue shows up in the first-touch phenomemon.
A little background knowledge. Memory is organized in memory pages, and what we think of as ‘addresses’
really virtual addresses, mapped to physical addresses, through a page table.
This means that data in your program can be anywhere in physical memory. In particular, on a dual socket
node, the memory can be mapped to either of the sockets.
The next thing to know is that memory allocated with malloc and like routines is not immediately
mapped; that only happens when data is written to it. In light of this, consider the following OpenMP
code:
double *x = (double*) malloc(N*sizeof(double));
Since the initialization loop is not parallel it is executed by the master thread, making all the memory
associated with the socket of that thread. Subsequent access by the other socket will then access data
from memory not attached to that socket.
Let’s consider an example. We make the initialization parallel subject to an option:
// heat.c
#pragma omp parallel if (init>0)
{
#pragma omp for
for (int i=0; i<N; i++)
y[i] = x[i] = 0.;
x[0] = 0; x[N-1] = 1.;
}
If the initialization is not parallel, the array will be mapped to the socket of the master thread; if it is
parallel, it may be mapped to different sockets, depending on where the threads run.
As a simple application we run a heat equation, which is parallel, though not embarassingly so:
for (int it=0; it<1000; it++) {
#pragma omp parallel for
for (int i=1; i<N-1; i++)
y[i] = ( x[i-1]+x[i]+x[i+1] )/3.;
#pragma omp parallel for
for (int i=1; i<N-1; i++)
x[i] = y[i];
}
On the TACC Frontera machine, with dual 28-core Intel Cascade Lake processors, we use the following
settings:
export OMP_PLACES=cores
export OMP_PROC_BIND=close
# no parallel initialization
make heat && OMP_NUM_THREADS=56 ./heat
# yes parallel initialization
make heat && OMP_NUM_THREADS=56 ./heat 1
This gives us a remarkable difference in runtime:
• Sequential init: avg=2.089, stddev=0.1083
• Parallel init: avg=1.006, stddev=0.0216
This large difference will be mitigated for algorithms with higher arithmetic intensity.
Exercise 25.1. How do the OpenMP dynamic schedules relate to this issue?
25.2.1 C++
The problem with realizing first-touch in C++ is that std::vector fills its allocation with default values.
This is known as ‘value-initialization’, and it makes
vector<double> x(N);
Running the code with the regular definition of a vector, and the above modification, reproduces the
runtimes of the C variant above.
Another option is to wrap memory allocated with new in a unique_ptr:
// heatptr.cxx
unique_ptr<double[]> x( new double[N] );
unique_ptr<double[]> y( new double[N] );
Note that this gives fairly elegant code, since square bracket indexing is overloaded for unique_ptr. The
only disadvantage is that we can not query the size of these arrays. Or do bound checking with at, but in
high performance contexts that is usually not appropriate anyway.
25.2.2 Remarks
You could move pages with move_pages.
By regarding affinity, in effect you are adopting an SPMD style of programming. You could make this
explicit by having each thread allocate its part of the arrays separately, and storing a private pointer
as threadprivate [18]. However, this makes it impossible for threads to access each other’s parts of the
distributed array, so this is only suitable for total data parallel or embarrassingly parallel applications.
int main() {
int i,j;
int reps = 1000;
int N = 8*10000;
start = omp_get_wtime();
#pragma omp parallel
{ // not a parallel for: just a bunch of reps
for (int j = 0; j < reps; j++) {
#pragma omp for schedule(static,1)
for (int i = 0; i < N; i++){
#pragma omp atomic
a++;
}
}
}
stop = omp_get_wtime();
delta = ((double)(stop - start))/reps;
printf("run time = %fusec\n", 1.0e6*delta);
return 0;
}
One reason this doesn’t work, is that the compiler will see that the flag is never used in the producing
section, and that is never changed in the consuming section, so it may optimize these statements, to the
point of optimizing them away.
The producer then needs to do:
... do some producing work ...
#pragma omp flush
#pragma atomic write
flag = 1;
#pragma omp flush(flag)
526
26.2. Data races
This code strictly speaking has a race condition on the flag variable.
The solution is to make this an atomic operation and use an atomic pragma here: the producer has
#pragma atomic write
flag = 1;
The basic rule about multiple-thread access of a single data item is:
Any memory location that is written by one thread, can not be read by another thread
in the same parallel region, if no synchronization is done.
To start with that last clause: any workshare construct ends with an implicit barrier, so data written before
that barrier can safely be read after it.
Under any reasonable interpretation of parallel execution, the possible values for r1,r2 are 1, 1 0, 1 or 1, 0.
This is known as sequential consistency: the parallel outcome is consistent with a sequential execution that
interleaves the parallel computations, respecting their local statement orderings. (See also HPC book,
section-2.6.1.6.)
However, running this, we get a small number of cases where 𝑟1 = 𝑟2 = 0. There are two possible expla-
nations:
1. The compiler is allowed to interchange the first and second statements, since there is no depen-
dence between them; or
2. The thread is allowed to have a local copy of the variable that is not coherent with the value in
memory.
We fix this by flushing both a,b:
// weak2.c
int a=0,b=0,r1,r2;
#pragma omp parallel sections shared(a, b, r1, r2)
{
#pragma omp section
{
a = 1;
#pragma omp flush (a,b)
r1 = b;
tasks++;
}
#pragma omp section
{
b = 1;
#pragma omp flush (a,b)
r2 = a;
tasks++;
}
}
return 0;
}
You can declare a loop to be executable with vector instructions with simd.
Remark 29 Depending on your compiler, it may be necessary to give an extra option enabling SIMD:
• -fopenmp-simd for GCC / Clang, and
• -qopenmp-simd for ICC.
531
27. OpenMP topic: SIMD processing
inprod = x1*x2+y1*y2,
xnorm = sqrt(x1*x1 + x2*x2),
ynorm = sqrt(y1*y1 + y2*y2);
return inprod / (xnorm*ynorm);
}
#pragma omp declare simd uniform(x1,x2,y1,y2) linear(i)
double csa(double *x1,double *x2,double *y1,double *y2, int i) {
double
inprod = x1[i]*x2[i]+y1[i]*y2[i],
xnorm = sqrt(x1[i]*x1[i] + x2[i]*x2[i]),
ynorm = sqrt(y1[i]*y1[i] + y2[i]*y2[i]);
return inprod / (xnorm*ynorm);
}
This chapter explains the mechanisms for offloading work to a Graphics Processing Unit (GPU).
The memory of a processor and that of an attached GPU are not coherent: there are separate memory
spaces and writing data in one is not automatically reflected in the other.
OpenMP transfers data (or maps it) when you enter an target construct.
#pragma omp target
{
// do stuff on the GPU
}
You can test whether the target region is indeed executed on a device with omp_is_initial_device:
#pragma omp target
if (omp_is_initial_device()) printf("Offloading failed\n");
536
28.2. Execution on the device
Also update to (synchronize data from host to device), update from (synchronize data to host from
device).
539
29. OpenMP remaining topics
• OMP_PROC_BIND with values TRUE and FALSE can bind threads to a processor. On the one hand,
doing so can minimize data movement; on the other hand, it may increase load imbalance.
29.2 Timing
OpenMP has a wall clock timer routine omp_get_wtime
double omp_get_wtime(void);
The starting point is arbitrary and is different for each program run; however, in one run it is identical
for all threads. This timer has a resolution given by omp_get_wtick.
Exercise 29.1. Use the timing routines to demonstrate speedup from using multiple threads.
• Write a code segment that takes a measurable amount of time, that is, it should
take a multiple of the tick time.
• Write a parallel loop and measure the speedup. You can for instance do this
for (int use_threads=1; use_threads<=nthreads; use_threads++) {
#pragma omp parallel for num_threads(use_threads)
for (int i=0; i<nthreads; i++) {
.....
}
if (use_threads==1)
time1 = tend-tstart;
else // compute speedup
• In order to prevent the compiler from optimizing your loop away, let the body
compute a result and use a reduction to preserve these results.
...
for ( .... ) {
int ivalue = next_one();
}
has a clear race condition, as the iterations of the loop may get different next_one values, as they are
supposed to, or not. This can be solved by using an critical pragma for the next_one call; another
solution is to use an threadprivate declaration for isave. This is for instance the right solution if the
next_one routine implements a random number generator.
Critical sections imply a loss of parallelism, but they are also slow as they are realized through
operating system functions. These are often quite costly, taking many thousands of cycles. Crit-
ical sections should be used only if the parallel work far outweighs it.
29.5 Accelerators
In OpenMP-4.0 there is support for offloading work to an accelerator or co-processor:
#pragma omp target [clauses]
OpenMP Review
546
30.2. Review questions
30.2.2 Parallelism
Can the following loops be parallelized? If so, how? (Assume that all arrays are already filled in, and that
there are no out-of-bounds errors.)
// variant #1 // variant #3
for (i=0; i<N; i++) { for (i=1; i<N; i++) {
x[i] = a[i]+b[i+1]; x[i] = a[i]+b[i+1];
a[i] = 2*x[i] + c[i+1]; a[i] = 2*x[i-1] + c[i+1];
} }
// variant #2 // variant #4
for (i=0; i<N; i++) { for (i=1; i<N; i++) {
x[i] = a[i]+b[i+1]; x[i] = a[i]+b[i+1];
a[i] = 2*x[i+1] + c[i+1]; a[i+1] = 2*x[i-1] + c[i+1];
} }
! variant #1 ! variant #3
do i=1,N do i=2,N
x(i) = a(i)+b(i+1) x(i) = a(i)+b(i+1)
a(i) = 2*x(i) + c(i+1) a(i) = 2*x(i-1) + c(i+1)
end do end do
! variant #2 ! variant #3
do i=1,N do i=2,N
x(i) = a(i)+b(i+1) x(i) = a(i)+b(i+1)
a(i) = 2*x(i+1) + c(i+1) a(i+1) = 2*x(i-1) + c(i+1)
end do end do
// variant #1 // variant #4
int nt; int nt;
#pragma omp parallel #pragma omp parallel
{ {
nt = omp_get_thread_num(); #pragma omp master
printf("thread number: %d\n",nt); {
} nt = omp_get_thread_num();
printf("thread number: %d\n",nt);
// variant #2 }
int nt; }
#pragma omp parallel private(nt)
{ // variant #5
nt = omp_get_thread_num(); int nt;
printf("thread number: %d\n",nt); #pragma omp parallel
} {
#pragma omp critical
// variant #3 {
int nt; nt = omp_get_thread_num();
#pragma omp parallel printf("thread number: %d\n",nt);
{ }
#pragma omp single }
{
nt = omp_get_thread_num();
printf("thread number: %d\n",nt);
}
}
!$OMP single
! variant #1 nt = omp_get_thread_num()
integer nt print *,"thread number:",nt
!$OMP parallel !$OMP end single
nt = omp_get_thread_num() !$OMP end parallel
print *,"thread number:",nt
!$OMP end parallel
! variant #2
integer nt
!$OMP parallel private(nt)
nt = omp_get_thread_num()
print *,"thread number:",nt
!$OMP end parallel
! variant #3
integer nt
!$OMP parallel
! variant #4 ! variant #5
integer nt integer nt
!$OMP parallel !$OMP parallel
!$OMP master !$OMP critical
nt = omp_get_thread_num() nt = omp_get_thread_num()
print *,"thread number:",nt print *,"thread number:",nt
!$OMP end master !$OMP end critical
!$OMP end parallel !$OMP end parallel
30.2.3.2
The following is an attempt to parallelize a serial code. Assume that all variables and arrays are defined.
What errors and potential problems do you see in this code? How would you fix them?
30.2.3.3
Assume two threads. What does the following program output?
int a;
#pragma omp parallel private(a) {
...
a = 0;
#pragma omp for
for (int i = 0; i < 10; i++)
{
#pragma omp atomic
a++; }
#pragma omp single
printf("a=%e\n",a);
}
30.2.4 Reductions
30.2.4.1
Is the following code correct? Is it efficient? If not, can you improve it?
#pragma omp parallel shared(r)
{
int x;
x = f(omp_get_thread_num());
#pragma omp critical
r += f(x);
}
30.2.4.2
Compare two fragments:
// variant 1 // variant 2
#pragma omp parallel reduction(+:s) #pragma omp parallel
#pragma omp for #pragma omp for reduction(+:s)
for (i=0; i<N; i++) for (i=0; i<N; i++)
s += f(i); s += f(i);
!$OMP end do
! variant 1 !$OMP end parallel
!$OMP parallel reduction(+:s)
!$OMP do
do i=1,N
s += f(i);
end do
s += f(i);
! variant 2 end do
!$OMP parallel !$OMP end do
!$OMP do reduction(+:s) !$OMP end parallel
do i=1,N
30.2.5 Barriers
Are the following two code fragments well defined?
}
#pragma omp parallel private(i) }
↪shared(icount) return 0;
{ }
#pragma omp critical
{ icount++;
for (i=0; i<100; i++)
iarray[icount][i] = 1;
iarray(icount,i) = 1
!$OMP parallel private(i) end do
↪shared(icount) !$OMP critical
!$OMP critical !$OMP end parallel
icount = icount+1
do i=1,100
30.2.7 Tasks
Fix two things in the following example:
print *,"sum=",x+y+z
!$OMP end single
!$OMP end parallel
30.2.8 Scheduling
Compare these two fragments. Do they compute the same result? What can you say about their efficiency?
How would you make the second loop more efficient? Can you do something similar for the first loop?
OpenMP Examples
𝐹⃖⃗𝑖 = ∑ 𝐹⃖⃗𝑖𝑗 .
𝑗≠𝑖
/* Force on p1 from p2 */
struct force force_calc( struct point p1,struct point p2 ) {
double dx = p2.x - p1.x, dy = p2.y - p1.y;
double f = p1.c * p2.c / sqrt( dx*dx + dy*dy );
struct force exert = {dx,dy,f};
return exert;
}
Force accumulation:
void add_force( struct force *f,struct force g ) {
f->x += g.x; f->y += g.y; f->f += g.f;
}
void sub_force( struct force *f,struct force g ) {
556
31.1. N-body problems
Here 𝐹⃖⃗𝑖𝑗 is only computed for 𝑗 > 𝑖, and then added to both 𝐹⃖⃗𝑖 and 𝐹⃖⃗𝑗 .
In C++ we use the overloaded operators:
for (int ip=0; ip<N; ip++) {
for (int jp=ip+1; jp<N; jp++) {
force f = points[ip].force_calc(points[jp]);
forces[ip] += f;
forces[jp] -= f;
}
}
Exercise 31.1. Argue that both the outer loop and the inner are not directly parallelizable.
We will now explore a number of different strategies for parallelization. All tests are done on the TACC
Frontera cluster, which has dual-socket Intel Cascade Lake nodes, with a total of 56 cores. Our code uses 10
thousand particles, and each interaction evaluation is repeated 10 times to eliminate cache loading effects.
In C++ we use the fact that we can reduce on any class that has an addition operator:
for (int ip=0; ip<N; ip++) {
force sumforce;
#pragma omp parallel for reduction(+:sumforce)
for (int jp=0; jp<N; jp++) {
if (ip==jp) continue;
force f = points[ip].force_calc(points[jp]);
sumforce += f;
} // end parallel jp loop
forces[ip] += sumforce;
} // end ip loop
This increases the scalar work by a factor of two, but surprisingly, on a single thread the run time improves:
we measure a speedup of 6.51 over the supposedly ‘optimal’ code.
Exercise 31.2. What would be an explanation?
However, increasing the number of threads has limited benefits for this strategy. Figure 31.1 shows that
the speedup is not only sublinear: it actually decreases with increasing core count.
Exercise 31.3. What would be an explanation?
10
6
Speedup
2
Reference
1 18 37 56
f->y -= g.y;
#pragma omp atomic
f->f += g.f;
}
20
15
Speedup
10
Reference
0
1 18 37 56
𝑠 ← ∑𝑐 𝑠𝑐
Another example is matrix factorization:
60
50
40
Speedup
30
20
10
Reference
0
1 18 37 56
101.7
101.3 Ideal
101
Speedup
100.7
100
1 18 37 56
#Threads
Sequential Inner loop reduction Triangle atomic Full atomic
We create a task for each column, and since they are in a loop we use taskgroup rather than taskwait.
#pragma omp taskgroup
for (int col=0; col<N; col++) {
placement next = current;
next.at(iqueen) = col;
#pragma omp task firstprivate(next)
if (feasible(next)) {
// stuff
} // end if(feasible)
}
However, the sequential program had return and break statements in the loop, which is not allowed in
workshare constructs such as taskgroup. Therefore we introduce a return variable, declared as shared:
// queens0.cxx
optional<placement> result = {};
#pragma omp taskgroup
for (int col=0; col<N; col++) {
placement next = current;
next.at(iqueen) = col;
#pragma omp task firstprivate(next) shared(result)
if (feasible(next)) {
if (iqueen==N-1) {
result = next;
} else { // do next level
auto attempt = place_queen(iqueen+1,next);
if (attempt.has_value()) {
result = attempt;
}
} // end if(iqueen==N-1)
} // end if(feasible)
}
return result;
So that was easy, this computes the right solution, and it uses OpenMP tasks. Done?
Actually this runs very slowly because, now that we’ve dispensed with all early breaks from the loop, we
in effect traverse the whole search tree. (It’s not quite breadth-first, though.) Figure 31.5 shows this for
𝑁 = 12 with the Intel compiler (version 2019) in the left panel, and the GNU compiler (version 9.1) in the
5
4.1 25 24.2 23.7
4 22.5
20.5
20
3 2.7
15
sec
sec
15
2
2
10 9.3
1
1 0.68 0.63 5
0.59
2.3
0 0
1 2 4 8 12 13 14 1 2 4 8 12 13 14
#cores used #cores used
Figure 31.5: Using taskgroups for 𝑁 = 12; left Intel compiler, right GCC
middle. In both cases, the blue bars give the result for the code with only the taskgroup directive, with
time plotted as function of core count.
We see that, for the Intel compiler, running time indeed goes down with core count. So, while we compute
too much (the whole search space), at least parallelization helps. With a number of threads greater than
the problem size, the benefit of parallelization disappears, which makes some sort of sense.
We also see that the GCC compiler is really bad at OpenMP tasks: the running time actually increases
with the number of threads.
Fortunately, with OpenMP-4 we can break out of the loop with a cancel of the task group:
// queenfinal.cxx
if (feasible(next)) {
if (iqueen==N-1) {
result = next;
#pragma omp cancel taskgroup
} else { // do next level
auto attempt = place_queen(iqueen+1,next);
if (attempt.has_value()) {
result = attempt;
#pragma omp cancel taskgroup
}
} // end if (iqueen==N-1)
} // end if (feasible)
Surprisingly, this does not immediately give a performance improvement. The reason for this is that
⋅10−2
1.6
1.4
1.2
1.1 ⋅ 10−2
1 ⋅ 10−2 1 ⋅ 10−2
1
0.8
sec
6 ⋅ 10−3
0.6
5 ⋅ 10−3
4 ⋅ 10−3
0.4
0.2
1 ⋅ 10−3
4 ⋅ 10−4
0
−0.2
seq 1 2 4 8 12 13 14
#cores used
PETSC
Chapter 32
PETSc basics
Remark 30 The PETSc library has hundreds of routines. In this chapter and the next few we will only touch
on a basic subset of these. The full list of man pages can be found at https://petsc.org/release/docs/
manualpages/singleindex.html. Each man page comes with links to related routines, as well as (usually)
example codes for that routine.
568
32.1. What is PETSc and why?
32.1.4.2 Fortran
A Fortran90 interface exists. The Fortran77 interface is only of interest for historical reasons.
To use Fortran, include both a module and a cpp header file:
#include "petsc/finclude/petscXXX.h"
use petscXXX
(here XXX stands for one of the PETSc types, but including petsc.h and using use petsc gives inclusion of
the whole library.)
Variables can be declared with their type (Vec, Mat, KSP et cetera), but internally they are Fortran Type
objects so they can be declared as such.
Example:
#include "petsc/finclude/petscvec.h"
use petscvec
Vec b
type(tVec) x
The output arguments of many query routines are optional in PETSc. While in C a generic NULL can be
passed, Fortran has type-specific nulls, such as PETSC_NULL_INTEGER, PETSC_NULL_OBJECT.
32.1.4.3 Python
A python interface was written by Lisandro Dalcin. It can be added to to PETSc at installation time; sec-
tion 32.3.
This book discusses the Python interface in short remarks in the appropriate sections.
32.1.5 Documentation
PETSc comes with a manual in pdf form and web pages with the documentation for every routine. The
starting point is the web page https://petsc.org/release/documentation/.
There is also a mailing list with excellent support for questions and bug reports.
TACC note. For questions specific to using PETSc on TACC resources, submit tickets to the TACC or
XSEDE portal.
% : %.F90
$(LINK.F) -o $@ $^ $(LDLIBS)
%.o: %.F90
$(COMPILE.F) $(OUTPUT_OPTION) $<
% : %.cxx
$(LINK.cc) -o $@ $^ $(LDLIBS)
%.o: %.cxx
$(COMPILE.cc) $(OUTPUT_OPTION) $<
Input Parameters:
argc - count of number of command line arguments
args - the command line arguments
file - [optional] PETSc database file.
help - [optional] Help message to print, use NULL for no message
Fortran:
call PetscInitialize(file,ierr)
Input parameters:
ierr - error return code
file - [optional] PETSc database file,
use PETSC_NULL_CHARACTER to not check for code specific file.
32.2.2 Running
PETSc programs use MPI for parallelism, so they are started like any other MPI program:
mpiexec -n 5 -machinefile mf \
your_petsc_program option1 option2 option3
TACC note. On TACC clusters, use ibrun.
32.3.1 Debug
For any set of options, you will typically make two installations: one with -with-debugging=yes and
once no. See section 39.1.1 for more detail on the differences between debug and non-debug mode.
32.3.3 Variants
• Scalars: the option -with-scalar-type has values real, complex; -with-precision has values
single, double, __float128, __fp16.
This is easiest if your python already includes mpi4py; see section 1.5.4.
Remark 31 There are two packages that PETSc is capable of downloading and install, but that you may
want to avoid:
• fblaslapack: this gives you BLAS/LAPACK through the Fortran ‘reference implementation’. If
you have an optimized version, such as Intel’s mkl available, this will give much higher perfor-
mance.
• mpich: this installs a MPI implementation, which may be required for your laptop. However, su-
percomputer clusters will already have an MPI implementation that uses the high-speed network.
PETSc’s downloaded version does not do that. Again, finding and using the already installed soft-
ware may greatly improve your performance.
32.4.1 Slepc
Most external packages add functionality to the lower layers of Petsc. For instance, the Hypre package
adds some preconditioners to Petsc’s repertoire (section 36.1.7.3), while Mumps (section 36.2) makes it
possible to use the LU preconditioner in parallel.
On the other hand, there are packages that use Petsc as a lower level tool. In particular, the eigenvalue
solver package Slepc [28] can be installed through the options
--download-slepc=<no,yes,filename,url>
Download and install slepc current: no
--download-slepc-commit=commitid
The commit id from a git repository to use for the build of slepc current: 0
--download-slepc-configure-arguments=string
Additional configure arguments for the build of SLEPc
The slepc header files wind up in the same directory as the petsc headers, so no change to your compilation
rules are needed. However, you need to add -lslepc to the link line.
#include <petscsys.h>
return PetscFinalize();
}
#include <petsc/finclude/petscsys.h>
use petsc
implicit none
logical :: flag
PetscErrorCode ierr;
character*80 :: help = "\nInit example.\n\n";
call PetscInitialize(PETSC_NULL_CHARACTER,ierr)
CHKERRA(ierr)
call MPI_Initialized(flag,ierr)
CHKERRA(ierr)
if (flag) then
print *,"MPI was initialized by PETSc"
else
print *,"MPI not yet initialized"
end if
call PetscFinalize(ierr); CHKERRQ(ierr);
PETSc objects
The way the distribution is done is by contiguous blocks: with 10 processes and 1000 components in a
vector, process 0 gets the range 0 ⋯ 99, process 1 gets 1 ⋯ 199, et cetera. This simple scheme suffices for
many cases, but PETSc has facilities for more sophisticated load balancing.
577
33. PETSc objects
Synopsis
#include "petscsys.h"
PetscErrorCode PetscSplitOwnership
(MPI_Comm comm,PetscInt *n,PetscInt *N)
Input Parameters
comm - MPI communicator that shares the object being divided
n - local length (or PETSC_DECIDE to have it set)
N - global length (or PETSC_DECIDE)
// split.c
N = 100; n = PETSC_DECIDE;
PetscSplitOwnership(comm,&n,&N);
PetscPrintf(comm,"Global %d, local %d\n",N,n);
N = PETSC_DECIDE; n = 10;
PetscSplitOwnership(comm,&n,&N);
PetscPrintf(comm,"Global %d, local %d\n",N,n);
These conversions between local and global size can also be done explicitly, using the PetscSplitOwnership
(figure 33.1) routine. This routine takes two parameter, for the local and global size, and whichever one is
initialized to PETSC_DECIDE gets computed from the other.
33.2 Scalars
Unlike programming languages that explicitly distinguish between single and double precision numbers,
PETSc has only a single scalar type: PetscScalar. The precision of this is determined at installation time.
In fact, a PetscScalar can even be a complex number if the installation specified that the scalar type is
complex.
Even in applications that use complex numbers there can be quantities that are real: for instance, the norm
of a complex vector is a real number. For that reason, PETSc also has the type PetscReal. There is also an
explicit PetscComplex.
Furthermore, there is
#define PETSC_BINARY_INT_SIZE (32/8)
#define PETSC_BINARY_FLOAT_SIZE (32/8)
#define PETSC_BINARY_CHAR_SIZE (8/8)
#define PETSC_BINARY_SHORT_SIZE (16/8)
#define PETSC_BINARY_DOUBLE_SIZE (64/8)
#define PETSC_BINARY_SCALAR_SIZE sizeof(PetscScalar)
33.2.1 Integers
Integers in PETSc are likewise of a size determined at installation time: PetscInt can be 32 or 64 bits. The
latter possibility is useful for indexing into large vectors and matrices. Furthermore, there is a PetscErrorCode
type for catching the return code of PETSc routines; see section 39.1.2.
For compatibility with other packages there are two more integer types:
• PetscBLASInt is the integer type used by the Basic Linear Algebra Subprograms (BLAS) / Linear
Algebra Package (LAPACK) library. This is 32-bits if the -download-blas-lapack option is used,
but it can be 64-bit if MKL is used. The routine PetscBLASIntCast casts a PetscInt to PetscBLASInt,
or returns PETSC_ERR_ARG_OUTOFRANGE if it is too large.
• PetscMPIInt is the integer type of the MPI library, which is always 32-bits. The routine PetscMPIIntCast
casts a PetscInt to PetscMPIInt, or returns PETSC_ERR_ARG_OUTOFRANGE if it is too large.
Many external packages do not support 64-bit integers.
33.2.2 Complex
Numbers of type PetscComplex have a precision matching PetscReal.
Form a complex number using PETSC_i:
PetscComplex x = 1.0 + 2.0 * PETSC_i;
The real and imaginary part can be extract with the functions PetscRealPart and PetscImaginaryPart which
return a PetscReal.
There are also routines VecRealPart and VecImaginaryPart that replace a vector with its real or imaginary
part respectively. Likewise MatRealPart and MatImaginaryPart.
F:
VecCreate( comm,v,ierr )
MPI_Comm :: comm
Vec :: v
PetscErrorCode :: ierr
Python:
vec = PETSc.Vec()
vec.create()
# or:
vec = PETSc.Vec().create()
33.2.4 Booleans
There is a PetscBool datatype with values PETSC_TRUE and PETSC_FALSE.
The corresponding routine VecDestroy (figure 33.3) deallocates data and zeros the pointer. (This and all
other Destroy routines are collective because of underlying MPI technicalities.)
The vector type needs to be set with VecSetType (figure 33.4).
The most common vector types are:
• VECSEQ for sequential vectors, that is, living on a single process; This is typically created on the
MPI_COMM_SELF or PETSC_COMM_SELF communicator.
Collective on Vec
Input Parameters:
v -the vector
Collective on Vec
Input Parameters:
vec- The vector object
method- The name of the vector type
• VECMPI for a vector distributed over the communicator. This is typically created on the MPI_COMM_WORLD
or PETSC_COMM_WORLD communicator, or one derived from it.
• VECSTANDARD is VECSEQ when used on a single process, or VECMPI on multiple.
You may wonder why these types exist: you could have just one type, which would be as parallel as
possible. The reason is that in a parallel run you may occasionally have a separate linear system on each
process, which would require a sequential vector (and matrix) on each process, not part of a larger linear
system.
Once you have created one vector, you can make more like it by VecDuplicate,
VecDuplicate(Vec old,Vec *new);
or VecDuplicateVecs
VecDuplicateVecs(Vec old,PetscInt n,Vec **new);
for multiple vectors. For the latter, there is a joint destroy call VecDestroyVecs:
VecDestroyVecs(PetscInt n,Vec **vecs);
Input Parameters
v :the vector
n : the local size (or PETSC_DECIDE to have it set)
N : the global size (or PETSC_DECIDE)
Python:
PETSc.Vec.setSizes(self, size, bsize=None)
size is a tuple of local/global
C:
#include "petscvec.h"
PetscErrorCode VecGetSize(Vec x,PetscInt *gsize)
PetscErrorCode VecGetLocalSize(Vec x,PetscInt *lsize)
Input Parameter
x -the vector
Output Parameters
gsize - the global length of the vector
lsize - the local length of the vector
Python:
PETSc.Vec.getLocalSize(self)
PETSc.Vec.getSize(self)
PETSc.Vec.getSizes(self)
is possible to specify one and let the other be computed by the library. This is indicated by setting it to
PETSC_DECIDE.
Python note 36: vector size. Use PETSc.DECIDE for the parameter not specified:
x.setSizes([2,PETSc.DECIDE])
The size is queried with VecGetSize (figure 33.6) for the global size and VecGetLocalSize (figure 33.6) for
the local size.
Each processor gets a contiguous part of the vector. Use VecGetOwnershipRange (figure 33.7) to query the
first index on this process, and the first one of the next process.
In general it is best to let PETSc take care of memory management of matrix and vector objects, in-
cluding allocating and freeing the memory. However, in cases where PETSc interfaces to other applica-
tions it maybe desirable to create a Vec object from an already allocated array: VecCreateSeqWithArray and
Input parameter:
x - the vector
Output parameters:
low - the first local element, pass in NULL if not interested
high - one more than the last local element, pass in NULL if not interested
Fortran note:
use PETSC_NULL_INTEGER for NULL.
Input Parameters:
alpha - the scalar
x, y - the vectors
Output Parameter:
y - output vector
VecCreateMPIWithArray.
VecCreateSeqWithArray
(MPI_Comm comm,PetscInt bs,
PetscInt n,PetscScalar *array,Vec *V);
VecCreateMPIWithArray
(MPI_Comm comm,PetscInt bs,
PetscInt n,PetscInt N,PetscScalar *array,Vec *vv);
As you will see in section 33.4.1, you can also create vectors based on the layout of a matrix, using
MatCreateVecs.
Python:
PETSc.Vec.view(self, Viewer viewer=None)
Collective on Vec
Input Parameters:
x, y - the vectors
Output Parameter:
val - the dot product
// fftsine.c
ierr = VecView(signal,PETSC_VIEWER_STDOUT_WORLD);
ierr = MatMult(transform,signal,frequencies);
ierr = VecScale(frequencies,1./Nglobal);
ierr = VecView(frequencies,PETSC_VIEWER_STDOUT_WORLD);
but the routine call also use more general PetscViewer objects, for instance to dump a vector to file.
Exercise 33.1. Create a vector where the values are a single sine wave. using VecGetSize,
VecGetLocalSize, VecGetOwnershipRange. Quick visual inspection:
ibrun vec -n 12 -vec_view
(There is a skeleton for this exercise under the name vec.)
Exercise 33.2. Use the routines VecDot (figure 33.10), VecScale (figure 33.11) and VecNorm
(figure 33.12) to compute the inner product of vectors x,y, scale the vector x, and
Input Parameters:
x - the vector
alpha - the scalar
Output Parameter:
x - the scaled vector
Python:
PETSc.Vec.norm(self, norm_type=None)
is defined.
x.sum() # max,min,....
x.dot(y)
x.norm(PETSc.NormType.NORM_INFINITY)
Not Collective
Input Parameters
v- the vector
row- the row location of the entry
value- the value to insert
mode- either INSERT_VALUES or ADD_VALUES
Not Collective
Input Parameters:
x - vector to insert in
ni - number of elements to add
ix - indices where to add
y - array of values
iora - either INSERT_VALUES or ADD_VALUES, where
ADD_VALUES adds values to any existing entries, and
INSERT_VALUES replaces existing entries with new values
In the general case, setting elements in a PETSc vector is done through a function VecSetValue (figure 33.13)
for setting elements that uses global numbering; any process can set any elements in the vector. There is
also a routine VecSetValues (figure 33.14) for setting multiple elements. This is mostly useful for setting
dense subblocks of a block matrix.
Collective on Vec
Input Parameter
vec -the vector
We illustrate both routines by setting a single element with VecSetValue, and two elements with VecSetValues.
In the latter case we need an array of length two for both the indices and values. The indices need not be
successive.
i = 1; v = 3.14;
VecSetValue(x,i,v,INSERT_VALUES);
ii[0] = 1; ii[1] = 2; vv[0] = 2.7; vv[1] = 3.1;
VecSetValues(x,2,ii,vv,INSERT_VALUES);
call VecSetValue(x,i,v,INSERT_VALUES,ierr)
ii(1) = 1; ii(2) = 2; vv(1) = 2.7; vv(2) = 3.1
call VecSetValues(x,2,ii,vv,INSERT_VALUES,ierr)
Multiple elements:
x.setValues( [2*procno,2*procno+1], [2.,3.] )
Using VecSetValue for specifying a local vector element corresponds to simple insertion in the local array.
However, an element that belongs to another process needs to be transferred. This done in two calls:
VecAssemblyBegin (figure 33.15) and VecAssemblyEnd.
if (myrank==0) then
do vecidx=0,globalsize-1
vecelt = vecidx
call VecSetValue(vector,vecidx,vecelt,INSERT_VALUES,ierr)
end do
end if
call VecAssemblyBegin(vector,ierr)
call VecAssemblyEnd(vector,ierr)
Input Parameter
x : the vector
Output Parameter
a : location to put pointer to the array
Input Parameters
x : the vector
a : location of pointer to array obtained from VecGetArray()
Fortran90:
#include <petsc/finclude/petscvec.h>
use petscvec
VecGetArrayF90(Vec x,{Scalar, pointer :: xx_v(:)},integer ierr)
(there is a Fortran77 version)
VecRestoreArrayF90(Vec x,{Scalar, pointer :: xx_v(:)},integer ierr)
Python:
PETSc.Vec.getArray(self, readonly=False)
?? PETSc.Vec.resetArray(self, force=False)
Elements can either be inserted with INSERT_VALUES, or added with ADD_VALUES in the VecSetValue / VecSetValues
call. You can not immediately mix these modes; to do so you need to call VecAssemblyBegin / VecAssemblyEnd
in between add/insert phases.
Input Parameters:
x- the vector
a- location of pointer to array obtained from VecGetArray()
Fortran90:
#include <petsc/finclude/petscvec.h>
use petscvec
VecRestoreArrayF90(Vec x,{Scalar, pointer :: xx_v(:)},integer ierr)
Input Parameters:
x- vector
xx_v- the Fortran90 pointer to the array
Input Parameter
x : the vector
Output Parameter
a : location to put pointer to the array
Input Parameters
x : the vector
a : location of pointer to array obtained from VecGetArray()
Fortran90:
#include <petsc/finclude/petscvec.h>
use petscvec
VecGetArrayF90(Vec x,{Scalar, pointer :: xx_v(:)},integer ierr)
(there is a Fortran77 version)
VecRestoreArrayF90(Vec x,{Scalar, pointer :: xx_v(:)},integer ierr)
Python:
PETSc.Vec.getArray(self, readonly=False)
?? PETSc.Vec.resetArray(self, force=False)
This example also uses VecGetLocalSize to determine the size of the data accessed. Even running in a
distributed context you can only get the array of local elements. Accessing the elements from another
process requires explicit communication; see section 33.5.2.
There are some variants to the VecGetArray operation:
• VecReplaceArray (figure 33.18) frees the memory of the Vec object, and replaces it with a different
array. That latter array needs to be allocated with PetscMalloc.
#include "petscvec.h"
PetscErrorCode VecPlaceArray(Vec vec,const PetscScalar array[])
PetscErrorCode VecReplaceArray(Vec vec,const PetscScalar array[])
Input Parameters
vec - the vector
array - the array
• VecPlaceArray (figure 33.18) also installs a new array in the vector, but it keeps the original array;
this can be restored with VecResetArray.
Putting the array of one vector into another has a common application, where you have a distributed
vector, but want to apply PETSc operations to its local section as if it were a sequential vector. In that case
you would create a sequential vector, and VecPlaceArray the contents of the distributed vector into it.
Fortran note 20: f90 array access through pointer. There are routines such as VecGetArrayF90 (with corre-
sponding VecRestoreArrayF90) that return a (Fortran) pointer to a one-dimensional array.
!! vecset.F90
Vec :: vector
PetscScalar,dimension(:),pointer :: elements
call VecGetArrayF90(vector,elements,ierr)
write (msg,10) myrank,elements(1)
10 format("First element on process",i3,":",f7.4,"\n")
call PetscSynchronizedPrintf(comm,msg,ierr)
call PetscSynchronizedFlush(comm,PETSC_STDOUT,ierr)
call VecRestoreArrayF90(vector,elements,ierr)
Python:
mat = PETSc.Mat()
mat.create()
# or:
mat = PETSc.Mat().create()
That is, the file starts with a magic number, then the number of vector elements, and subsequently all
scalar values.
Collective on Mat
Input Parameters:
mat- the matrix object
matype- matrix type
Just as with vectors, there is a local and global size; except that that now applies to rows and columns.
Set sizes with MatSetSizes (figure 33.21) and subsequently query them with MatSizes (figure 33.22). The
concept of local column size is tricky: since a process stores a full block row you may expect the local
column size to be the full matrix size, but that is not true. The exact definition will be discussed later, but
for square matrices it is a safe strategy to let the local row and column size to be equal.
Instead of querying a matrix size and creating vectors accordingly, the routine MatCreateVecs (figure 33.23)
can be used. (Sometimes this is even required; see section 33.4.9.)
Input Parameters
A : the matrix
m : number of local rows (or PETSC_DECIDE)
n : number of local columns (or PETSC_DECIDE)
M : number of global rows (or PETSC_DETERMINE)
N : number of global columns (or PETSC_DETERMINE)
Python:
PETSc.Mat.setSizes(self, size, bsize=None)
where 'size' is a tuple of 2 global sizes
or a tuple of 2 local/global pairs
Python:
PETSc.Mat.getSize(self) # tuple of global sizes
PETSc.Mat.getLocalSize(self) # tuple of local sizes
PETSc.Mat.getSizes(self) # tuple of local/global size tuples
#include "petscmat.h"
PetscErrorCode MatCreateVecs(Mat mat,Vec *right,Vec *left)
Collective on Mat
Input Parameter
mat - the matrix
Output Parameter;
right - (optional) vector that the matrix can be multiplied against
left - (optional) vector that the matrix vector product can be stored in
Input Parameters
B - the matrix
nz/d_nz/o_nz - number of nonzeros per row in matrix or
diagonal/off-diagonal portion of local submatrix
nnz/d_nnz/o_nnz - array containing the number of nonzeros in the various rows of
the sequential matrix / diagonal / offdiagonal part of the local submatrix
or NULL (PETSC_NULL_INTEGER in Fortran) if nz/d_nz/o_nz is used.
Python:
PETSc.Mat.setPreallocationNNZ(self, [nnz_d,nnz_o] )
PETSc.Mat.setPreallocationCSR(self, csr)
PETSc.Mat.setPreallocationDense(self, array)
repeated allocations and re-allocations are inefficient. For this reason PETSc puts a small burden on the
programmer: you need to specify a bound on how many elements the matrix will contain.
We explain this by looking at some cases. First we consider a matrix that only lives on a single process.
You would then use MatSeqAIJSetPreallocation (figure 33.24). In the case of a tridiagonal matrix you would
specify that each row has three elements:
MatSeqAIJSetPreallocation(A,3, NULL);
If the matrix is less regular you can use the third argument to give an array of explicit row lengths:
int *rowlengths;
// allocate, and then:
for (int row=0; row<nrows; row++)
rowlengths[row] = // calculation of row length
MatSeqAIJSetPreallocation(A,NULL,rowlengths);
In case of a distributed matrix you need to specify this bound with respect to the block structure of the
matrix. As illustrated in figure 33.2, a matrix has a diagonal part and an off-diagonal part. The diagonal
part describes the matrix elements that couple elements of the input and output vector that live on this
process. The off-diagonal part contains the matrix elements that are multiplied with elements not on this
process, in order to compute elements that do live on this process.
The preallocation specification now has separate parameters for these diagonal and off-diagonal parts:
with MatMPIAIJSetPreallocation (figure 33.24). you specify for both either a global upper bound on the
number of nonzeros, or a detailed listing of row lengths. For the matrix of the Laplace equation, this
specification would seem to be:
MatMPIAIJSetPreallocation(A, 3, NULL, 2, NULL);
Off−diagonal block
has off−processor connections
A B
Input Parameters
m : the matrix
row : the row location of the entry
col : the column location of the entry
value : the value to insert
mode : either INSERT_VALUES or ADD_VALUES
Python:
PETSc.Mat.setValue(self, row, col, value, addv=None)
also supported:
A[row,col] = value
However, this is only correct if the block structure from the parallel division equals that from the lines in
the domain. In general it may be necessary to use values that are an overestimate. It is then possible to
contract the storage by copying the matrix.
Specifying bounds on the number of nonzeros is often enough, and not too wasteful. However, if many
rows have fewer nonzeros than these bounds, a lot of space is wasted. In that case you can replace the
NULL arguments by an array that lists for each row the number of nonzeros in that row.
Input Parameters
mat- the matrix
type- type of assembly, either MAT_FLUSH_ASSEMBLY
or MAT_FINAL_ASSEMBLY
Python:
assemble(self, assembly=None)
assemblyBegin(self, assembly=None)
assemblyEnd(self, assembly=None)
MatAssemblyBegin (figure 33.26) and MatAssemblyEnd (figure 33.26) which can be used to achieve latency
hiding.
Elements can either be inserted (INSERT_VALUES) or added (ADD_VALUES). You can not immediately mix these
modes; to do so you need to call MatAssemblyBegin / MatAssemblyEnd with a value of MAT_FLUSH_ASSEMBLY.
PETSc sparse matrices are very flexible: you can create them empty and then start adding elements. How-
ever, this is very inefficient in execution since the OS needs to reallocate the matrix every time it grows
a little. Therefore, PETSc has calls for the user to indicate how many elements the matrix will ultimately
contain.
MatSetOption(A, MAT_NEW_NONZERO_ALLOCATION_ERR, PETSC_FALSE)
#include "petscmat.h"
PetscErrorCode MatGetRow
(Mat mat,PetscInt row,
PetscInt *ncols,const PetscInt *cols[],const PetscScalar *vals[])
PetscErrorCode MatRestoreRow
(Mat mat,PetscInt row,
PetscInt *ncols,const PetscInt *cols[],const PetscScalar *vals[])
Input Parameters:
mat - the matrix
row - the row to get
Output Parameters
ncols - if not NULL, the number of nonzeros in the row
cols - if not NULL, the column numbers
vals - if not NULL, the values
Input Parameters
mat - the matrix
x - the vector to be multiplied
Output Parameters
y - the result
Input Parameters
mat - the matrix
x, y - the vectors
Output Parameters
z -the result
Notes
The vectors x and z cannot be the same.
33.4.6 Submatrices
Given a parallel matrix, there are two routines for extracting submatrices:
• MatCreateSubMatrix creates a single parallel submatrix.
• MatCreateSubMatrices creates a sequential submatrix on each process.
Collective
Input Parameters:
comm- MPI communicator
m- number of local rows (must be given)
n- number of local columns (must be given)
M- number of global rows (may be PETSC_DETERMINE)
N- number of global columns (may be PETSC_DETERMINE)
ctx- pointer to data needed by the shell matrix routines
Output Parameter:
A -the matrix
Input Parameters:
mat- the shell matrix
op- the name of the operation
g- the function that provides the operation.
What operation is specified is determined by a keyword MATOP_<OP> where OP is the name of the matrix
routine, minus the Mat part, in all caps.
MatCreate(comm,&A);
MatSetSizes(A,localsize,localsize,matrix_size,matrix_size);
MatSetType(A,MATSHELL);
Input Parameters
mat - the shell matrix
ctx - the context
Not Collective
Input Parameter:
mat -the matrix, should have been created with MatCreateShell()
Output Parameter:
ctx -the user provided context
MatSetFromOptions(A);
MatShellSetOperation(A,MATOP_MULT,(void*)&mymatmult);
MatShellSetContext(A,(void*)Diag);
MatSetUp(A);
The routine signature has this argument as a void* but it’s not necessary to cast it to that. Getting the
context means that a pointer to your structure needs to be set
struct matrix_data *mystruct;
MatShellGetContext( A, &mystruct );
Somewhat confusingly, the Get routine also has a void* argument, even though it’s really a pointer vari-
able.
In both cases this corresponds to a block matrix, but for a problem of 𝑁 nodes and 3 equations, the
respective structures are:
1. 3 × 3 blocks of size 𝑁 , versus
2. 𝑁 × 𝑁 blocks of size 3.
The first case can be pictured as
𝐴00 𝐴01 𝐴02
(𝐴10 𝐴11 𝐴12 )
𝐴20 𝐴21 𝐴22
and while it looks natural, there is a computational problem with it. Preconditioners for such problems
often look like
𝐴00 𝐴00
( 𝐴11 ) or (𝐴10 𝐴11 )
𝐴22 𝐴20 𝐴21 𝐴22
With the block-row partitioning of PETSc’s matrices, this means at most a 50% efficiency for the precon-
ditioner solve.
It is better to use the second scheme, which requires the MATMPIBIJ format, and use so-called field-split
preconditioners; see section 36.1.7.3.5.
#include "petscvec.h"
PetscErrorCode VecScatterCreate(Vec xin,IS ix,Vec yin,IS iy,VecScatter *newctx)
Input Parameters:
xin : a vector that defines the layout of vectors from which we scatter
yin : a vector that defines the layout of vectors to which we scatter
ix : the indices of xin to scatter (if NULL scatters all values)
iy : the indices of yin to hold results (if NULL fills entire vector yin)
Output Parameter
newctx : location to store the new scatter context
Note that the index set is applied to the input vector, since it describes the components to be moved. The
output vector uses NULL since these components are placed in sequence.
Exercise 33.3. Modify this example so that the components are still separated odd/even, but
now placed in descending order on each process.
Exercise 33.4. Can you extend this example so that process 𝑝 receives all indices that are
multiples of 𝑝? Is your solution correct if Nglobal is not a multiple of nprocs?
33.7 Partitionings
By default, PETSc uses partitioning of matrices and vectors based on consecutive blocks of variables. In
regular cases that is not a bad strategy. However, for some matrices a permutation and re-division can be
advantageous. For instance, one could look at the adjacency graph, and minimize the number of edge cuts
or the sum of the edge weights.
This functionality is not built into PETSc, but can be provided by graph partitioning packages such as
ParMetis or Zoltan. The basic object is the MatPartitioning, with routines for
• Create and destroy: MatPartitioningCreate, MatPartitioningDestroy;
• Setting the type MatPartitioningSetType to an explicit partitioner, or something generated as
the dual or a refinement of the current matrix;
• Apply with MatPartitioningApply, giving a distribued IS object, which can then be used in
MatCreateSubMatrix to repartition.
Illustrative example:
MatPartitioning part;
MatPartitioningCreate(comm,&part);
MatPartitioningSetType(part,MATPARTITIONINGPARMETIS);
MatPartitioningApply(part,&is);
/* get new global number of each old global number */
ISPartitioningToNumbering(is,&isn);
ISBuildTwoSided(is,NULL,&isrows);
MatCreateSubMatrix(A,isrows,isrows,MAT_INITIAL_MATRIX,&perA);
Other scenario:
MatPartitioningSetAdjacency(part,A);
MatPartitioningSetType(part,MATPARTITIONINGHIERARCH);
MatPartitioningHierarchicalSetNcoarseparts(part,2);
MatPartitioningHierarchicalSetNfineparts(part,2);
#include <petscvec.h>
PetscInitialize(&argc,&argv,(char*)0,help);
MPI_Comm comm = PETSC_COMM_WORLD;
MPI_Comm_size(comm,&nprocs);
MPI_Comm_rank(comm,&procid);
PetscInt N,n;
N = 100; n = PETSC_DECIDE;
PetscSplitOwnership(comm,&n,&N);
PetscPrintf(comm,"Global %d, local %d\n",N,n);
N = PETSC_DECIDE; n = 10;
PetscSplitOwnership(comm,&n,&N);
PetscPrintf(comm,"Global %d, local %d\n",N,n);
return PetscFinalize();
}
#include <petscmat.h>
PetscErrorCode ierr;
PetscInitialize(&argc,&argv,(char*)0,help);
MPI_Comm comm = PETSC_COMM_WORLD;
MPI_Comm_size(comm,&nprocs);
MPI_Comm_rank(comm,&procid);
Mat transform;
int dimensionality=1;
PetscInt dimensions[dimensionality]; dimensions[0] = Nglobal;
PetscPrintf(comm,"Creating fft D=%d, dim=%d\n",dimensionality,dimensions[0]);
ierr = MatCreateFFT(comm,dimensionality,dimensions,MATFFTW,&transform); CHKERRQ(ierr);
{
PetscInt fft_i,fft_j;
ierr = MatGetSize(transform,&fft_i,&fft_j); CHKERRQ(ierr);
PetscPrintf(comm,"FFT global size %d x %d\n",fft_i,fft_j);
}
Vec signal,frequencies;
ierr = MatCreateVecsFFTW(transform,&frequencies,&signal,PETSC_NULL); CHKERRQ(ierr);
ierr = PetscObjectSetName((PetscObject)signal,"signal"); CHKERRQ(ierr);
ierr = PetscObjectSetName((PetscObject)frequencies,"frequencies"); CHKERRQ(ierr);
ierr = VecAssemblyBegin(signal); CHKERRQ(ierr);
ierr = VecAssemblyEnd(signal); CHKERRQ(ierr);
{
PetscInt nlocal,nglobal;
ierr = VecGetLocalSize(signal,&nlocal); CHKERRQ(ierr);
ierr = VecGetSize(signal,&nglobal); CHKERRQ(ierr);
ierr = PetscPrintf(comm,"Signal local=%d global=%d\n",nlocal,nglobal); CHKERRQ(ierr);
}
PetscInt myfirst,mylast;
ierr = VecGetOwnershipRange(signal,&myfirst,&mylast); CHKERRQ(ierr);
printf("Setting %d -- %d\n",myfirst,mylast);
for (PetscInt vecindex=0; vecindex<Nglobal; vecindex++) {
PetscScalar
pi = 4. * atan(1.0),
h = 1./Nglobal,
phi = 2* pi * vecindex * h,
puresine = cos( phi )
#if defined(PETSC_USE_COMPLEX)
+ PETSC_i * sin(phi)
#endif
;
ierr = VecSetValue(signal,vecindex,puresine,INSERT_VALUES); CHKERRQ(ierr);
}
ierr = VecAssemblyBegin(signal); CHKERRQ(ierr);
ierr = VecAssemblyEnd(signal); CHKERRQ(ierr);
{
Vec confirm;
ierr = VecDuplicate(signal,&confirm); CHKERRQ(ierr);
ierr = MatMultTranspose(transform,frequencies,confirm); CHKERRQ(ierr);
ierr = VecAXPY(confirm,-1,signal); CHKERRQ(ierr);
PetscReal nrm;
ierr = VecNorm(confirm,NORM_2,&nrm); CHKERRQ(ierr);
PetscPrintf(MPI_COMM_WORLD,"FFT accuracy %e\n",nrm);
ierr = VecDestroy(&confirm); CHKERRQ(ierr);
}
return PetscFinalize();
}
#include <petsc/finclude/petsc.h>
use petsc
implicit none
Vec :: vector
PetscScalar,dimension(:),pointer :: elements
PetscErrorCode :: ierr
PetscInt :: globalsize
integer :: myrank,vecidx,comm
PetscScalar :: vecelt
character*80 :: msg
call PetscInitialize(PETSC_NULL_CHARACTER,ierr)
CHKERRA(ierr)
comm = MPI_COMM_WORLD
call MPI_Comm_rank(comm,myrank,ierr)
call VecCreate(comm,vector,ierr)
call VecSetType(vector,VECMPI,ierr)
call VecSetSizes(vector,2,PETSC_DECIDE,ierr)
call VecGetSize(vector,globalsize,ierr)
if (myrank==0) then
do vecidx=0,globalsize-1
vecelt = vecidx
call VecSetValue(vector,vecidx,vecelt,INSERT_VALUES,ierr)
end do
end if
call VecAssemblyBegin(vector,ierr)
call VecAssemblyEnd(vector,ierr)
call VecView(vector,PETSC_VIEWER_STDOUT_WORLD,ierr)
call VecGetArrayF90(vector,elements,ierr)
write (msg,10) myrank,elements(1)
10 format("First element on process",i3,":",f7.4,"\n")
call PetscSynchronizedPrintf(comm,msg,ierr)
call PetscSynchronizedFlush(comm,PETSC_STDOUT,ierr)
call VecRestoreArrayF90(vector,elements,ierr)
call VecDestroy(vector,ierr)
call PetscFinalize(ierr); CHKERRQ(ierr);
MPI_Comm :: comm
Vec :: x,y
PetscInt :: n=1, procno
PetscScalar :: one=1.0, two=2.0, value, inprod,scaling,xnorm,ynorm
PetscScalar,dimension(:),Pointer :: &
in_array,out_array
PetscInt :: globalsize,localsize,myfirst,mylast,index
Character*80 :: message
PetscBool :: flag
PetscErrorCode :: ierr
call PetscInitialize(PETSC_NULL_CHARACTER,ierr)
CHKERRA(ierr)
comm = PETSC_COMM_WORLD
call MPI_Comm_rank(comm,procno,ierr)
!!
!! Get a commandline argument for the size of the problem
!!
call PetscOptionsGetInt( &
PETSC_NULL_OPTIONS,PETSC_NULL_CHARACTER, &
"-n",n,flag,ierr)
CHKERRA(ierr)
!!
!! Create vector `x' with a default layout
!!
call VecCreate(comm,x,ierr); CHKERRA(ierr)
call VecSetSizes(x,n,PETSC_DECIDE,ierr); CHKERRA(ierr)
call VecSetFromOptions(x,ierr); CHKERRA(ierr)
!!
!! Set x,y to constant values
!!
call VecSet(x,one,ierr); CHKERRA(ierr)
!!
!! Make another vector, just like x
!!
call VecDuplicate(x,y,ierr); CHKERRA(ierr)
!!
!! Get arrays and operate on them
!!
call VecGetArrayReadF90( x,in_array,ierr )
call VecGetArrayF90( y,out_array,ierr )
call VecGetLocalSize( x,localsize,ierr )
do index=1,localsize
out_array(index) = 2*in_array(index)
end do
call VecRestoreArrayReadF90( x,in_array,ierr )
call VecRestoreArrayF90( y,out_array,ierr )
!!
!! Sanity check printout
!!
call VecNorm(x,NORM_2,xnorm,ierr)
call VecNorm(y,NORM_2,ynorm,ierr)
write(message,10) xnorm,ynorm
10 format("Norm x: ",f6.3,", y: ",f6.3,"\n")
call PetscPrintf(comm,message,ierr)
!!
!! Free work space. All PETSc objects should be destroyed when they
!! are no longer needed
!!
call VecDestroy(x,ierr)
call VecDestroy(y,ierr)
call PetscFinalize(ierr);
CHKERRA(ierr)
#undef __FUNCT__
#define __FUNCT__ "main"
int main(int argc,char **args)
{
PetscErrorCode ierr;
MPI_Comm comm;
PetscFunctionBegin;
PetscInitialize(&argc,&args,0,0);
comm = MPI_COMM_WORLD;
int nprocs,procid;
MPI_Comm_rank(comm,&procid);
MPI_Comm_size(comm,&nprocs);
if (nprocs!=2) {
PetscPrintf(comm,"This example only works on 2 processes, not %d\n",nprocs);
PetscFunctionReturn(-1); }
Vec in,out;
ierr = VecCreate(comm,&in); CHKERRQ(ierr);
ierr = VecSetType(in,VECMPI); CHKERRQ(ierr);
ierr = VecSetSizes(in,PETSC_DECIDE,Nglobal); CHKERRQ(ierr);
ierr = VecDuplicate(in,&out); CHKERRQ(ierr);
{
PetscInt myfirst,mylast;
ierr = VecGetOwnershipRange(in,&myfirst,&mylast); CHKERRQ(ierr);
for (PetscInt index=myfirst; index<mylast; index++) {
PetscScalar v = index;
ierr = VecSetValue(in,index,v,INSERT_VALUES); CHKERRQ(ierr);
}
ierr = VecAssemblyBegin(in); CHKERRQ(ierr);
ierr = VecAssemblyEnd(in); CHKERRQ(ierr);
}
IS oddeven;
if (procid==0) {
ierr = ISCreateStride(comm,Nglobal/2,0,2,&oddeven); CHKERRQ(ierr);
} else {
ierr = ISCreateStride(comm,Nglobal/2,1,2,&oddeven); CHKERRQ(ierr);
}
ISView(oddeven,0);
VecScatter separate;
ierr = VecScatterCreate
(in,oddeven,out,NULL,&separate); CHKERRQ(ierr);
ierr = VecScatterBegin
(separate,in,out,INSERT_VALUES,SCATTER_FORWARD); CHKERRQ(ierr);
ierr = VecScatterEnd
(separate,in,out,INSERT_VALUES,SCATTER_FORWARD); CHKERRQ(ierr);
return PetscFinalize();
}
Grid support
PETSc’s DM objects raise the abstraction level from the linear algebra problem to the physics problem:
they allow for a more direct expression of operators in terms of their domain of definition. In this section
we look at the DMDA ‘distributed array’ objects, which correspond to problems defined on Cartesian grids.
Distributed arrays make it easier to construct the coefficient matrix of an operator that is defined as a
stencil on a 1/2/3-dimensional Cartesian grid.
The main creation routine exists in three variants that mostly differ their number of parameters. For
instance, DMDACreate2d has parameters along the x,y axes. However, DMDACreate1d has no parameter for
the stencil type, since in 1D those are all the same, or for the process distribution.
613
34. Grid support
Input Parameters
Output Parameter
• dof indicates the number of ‘degrees of freedom’, where 1 corresponds to a scalar problem.
• width indicates the extent of the stencil: 1 for a 5-point stencil or more general a 2nd order
stencil for 2nd order PDEs, 2 for 2nd order discretizations of a 4th order PDE, et cetera.
• partitionx,partitiony are arrays giving explicit partitionings of the grid over the processors,
or PETSC_NULL for default distributions.
Code: Output:
// dmrhs.c ld: warning: dylib (/Users/eijkhout/Installation/pe
DM grid; [0] Local = 0-50 x 0-50, halo = 0-51 x 0-51
ierr = DMDACreate2d [1] Local = 50-100 x 0-50, halo = 49-100 x 0-51
( comm, [2] Local = 0-50 x 50-100, halo = 0-51 x 49-100
DM_BOUNDARY_NONE,DM_BOUNDARY_NONE, [3] Local = 50-100 x 50-100, halo = 49-100 x 49-100
DMDA_STENCIL_STAR,
100,100,
PETSC_DECIDE,PETSC_DECIDE,
1,
1,
NULL,NULL,
&grid
);
ierr = DMSetFromOptions(grid);
ierr = DMSetUp(grid);
ierr = DMViewFromOptions(grid,NULL,"-dm_view");
After you define a DM object, each process has a contiguous subdomain out of the total grid. You can query
its size and location with DMDAGetCorners, or query that and all other information with DMDAGetLocalInfo
(figure 34.2), which returns an DMDALocalInfo (figure 34.3) structure.
DMDALocalInfo :: info(DMDA_LOCAL_INFO_SIZE)
info(DMDA_LOCAL_INFO_DIM)
info(DMDA_LOCAL_INFO_DOF) etc.
The entries bx,by,bz, st, and da are not accessible from Fortran.
(A DMDALocalInfo struct is the same for 1/2/3 dimensions, so certain fields may not be applicable to your
specific PDE.)
On each point of the domain, we describe the stencil at that point. First of all, we now have the information
to compute the 𝑥, 𝑦 coordinates of the domain points:
PetscReal **xyarray;
DMDAVecGetArray(grid,xy,&xyarray);
for (int j=info.ys; j<info.ys+info.ym; j++) {
for (int i=info.xs; i<info.xs+info.xm; i++) {
Each matrix element row,col is a combination of two MatStencil objects. Technically, this is a struct with
members i,j,k,s for the domain coordinates and the number of the field.
MatStencil row;
row.i = i; row.j = j;
We could construct the columns in this row one by one, but MatSetValuesStencil can set multiple rows or
columns at a time, so we construct all columns at the same time:
MatStencil col[5];
PetscScalar v[5];
PetscInt ncols = 0;
/**** diagonal element ****/
col[ncols].i = i; col[ncols].j = j;
v[ncols++] = 4.;
/**** off diagonal elements ****/
....
The other ‘legs’ of the stencil need to be set conditionally: the connection to (𝑖 − 1, 𝑗) is missing on the top
row of the domain, and the connection to (𝑖, 𝑗 − 1) is missing on the left column. In all:
// grid2d.c
for (int j=info.ys; j<info.ys+info.ym; j++) {
for (int i=info.xs; i<info.xs+info.xm; i++) {
MatStencil row,col[5];
PetscScalar v[5];
PetscInt ncols = 0;
row.j = j; row.i = i;
/**** local connection: diagonal element ****/
col[ncols].j = j; col[ncols].i = i; v[ncols++] = 4.;
/* boundaries: top and bottom row */
if (i>0) {col[ncols].j = j; col[ncols].i = i-1; v[ncols++] = -1.;}
if (i<info.mx-1) {col[ncols].j = j; col[ncols].i = i+1; v[ncols++] = -1.;}
/* boundary left and right */
if (j>0) {col[ncols].j = j-1; col[ncols].i = i; v[ncols++] = -1.;}
if (j<info.my-1) {col[ncols].j = j+1; col[ncols].i = i; v[ncols++] = -1.;}
ierr = MatSetValuesStencil(A,1,&row,ncols,col,v,INSERT_VALUES);
}
}
plexprogram -dm_plex_dim k
for other dimensions. In two dimensions there are three levels of cells:
14 32 15 30 16 10
8 20 9 18 10
7 5 3 1
21 20 31 28 29 15 11
14 19 16 17
1 6 1 2
11 22 12 26 13 7 16 68 20
14 79
2 3 2 3
19 18 23 24 25 13 12 17
12 18
13 19
0 4 0 0
8 17 9 27 10 4 11 45 15 56
After this, you don’t use VecSetValues, but set elements directly in the raw array, obtained by DMDAVecGetArray:
PetscReal **xyarray;
DMDAVecGetArray(grid,xy,&xyarray);
for (int j=info.ys; j<info.ys+info.ym; j++) {
for (int i=info.xs; i<info.xs+info.xm; i++) {
PetscReal x = i*hx, y = j*hy;
xyarray[j][i] = x*y;
}
}
DMDAVecRestoreArray(grid,xy,&xyarray);
1. You can create a ‘global’ vector, defined on the same communicator as the array, and which is
disjointly partitioned in the same manner. This is done with DMCreateGlobalVector:
PetscErrorCode DMCreateGlobalVector(DM dm,Vec *vec)
2. You can create a ‘local’ vector, which is sequential and defined on PETSC_COMM_SELF, that has not
only the points local to the process, but also the ‘halo’ region with the extent specified in the
definition of the DMDACreate call. For this, use DMCreateLocalVector:
PetscErrorCode DMCreateLocalVector(DM dm,Vec *vec)
With this subdomain information you can then start to create the coefficient matrix:
DM grid;
PetscInt i_first,j_first,i_local,j_local;
DMDAGetCorners(grid,&i_first,&j_first,NULL,&i_local,&j_local,NULL);
for ( PetscInt i_index=i_first; i_index<i_first+i_local; i_index++) {
for ( PetscInt j_index=j_first; j_index<j_first+j_local; j_index++) {
// construct coefficients for domain point (i_index,j_index)
}
}
Note that indexing here is in terms of the grid, not in terms of the matrix.
For a simple example, consider 1-dimensional smoothing. From DMDAGetCorners we need only the param-
eters in 𝑖-direction:
// grid1d.c
PetscInt i_first,i_local;
ierr = DMDAGetCorners(grid,&i_first,NULL,NULL,&i_local,NULL,NULL);
for (PetscInt i_index=i_first; i_index<i_first+i_local; i_index++) {
PetscErrorCode ierr;
ierr = PetscInitialize(&argc,&argv,0,0); CHKERRQ(ierr);
/*
* Create a 2d grid and a matrix on that grid.
*/
DM grid;
ierr = DMDACreate2d
( comm,
DM_BOUNDARY_NONE,DM_BOUNDARY_NONE,
DMDA_STENCIL_STAR,
100,100,
PETSC_DECIDE,PETSC_DECIDE,
1,
1,
NULL,NULL,
&grid
); CHKERRQ(ierr);
ierr = DMSetFromOptions(grid); CHKERRQ(ierr);
ierr = DMSetUp(grid); CHKERRQ(ierr);
ierr = DMViewFromOptions(grid,NULL,"-dm_view"); CHKERRQ(ierr);
/*
* Print out how the grid is distributed over processors
*/
DMDALocalInfo info;
ierr = DMDAGetLocalInfo(grid,&info); CHKERRQ(ierr);
ierr = PetscPrintf(comm,"Create\n"); CHKERRQ(ierr);
ierr = PetscSynchronizedPrintf
(comm,
"[%d] Local = %d-%d x %d-%d, halo = %d-%d x %d-%d\n",
procno,
info.xs,info.xs+info.xm,info.ys,info.ys+info.ym,
info.gxs,info.gxs+info.gxm,info.gys,info.gys+info.gym
); CHKERRQ(ierr);
ierr = PetscSynchronizedFlush(comm,stdout); CHKERRQ(ierr);
ierr = PetscPrintf(comm,"create\n"); CHKERRQ(ierr);
Vec xy;
ierr = VecCreate(comm,&xy); CHKERRQ(ierr);
ierr = VecSetType(xy,VECMPI); CHKERRQ(ierr);
PetscInt nlocal = info.xm*info.ym, nglobal = info.mx*info.my;
ierr = VecSetSizes(xy,nlocal,nglobal); CHKERRQ(ierr);
{
PetscReal
hx = 1. / ( info.mx-1 ),
hy = 1. / ( info.my-1 );
PetscReal **xyarray;
DMDAVecGetArray(grid,xy,&xyarray); CHKERRQ(ierr);
for (int j=info.ys; j<info.ys+info.ym; j++) {
for (int i=info.xs; i<info.xs+info.xm; i++) {
PetscReal x = i*hx, y = j*hy;
xyarray[j][i] = x*y;
}
}
DMDAVecRestoreArray(grid,xy,&xyarray); CHKERRQ(ierr);
}
ierr = VecAssemblyBegin(xy); CHKERRQ(ierr);
ierr = VecAssemblyEnd(xy); CHKERRQ(ierr);
// ierr = VecView(xy,0); CHKERRQ(ierr);
{
Vec ghostvector;
ierr = DMGetLocalVector(grid,&ghostvector); CHKERRQ(ierr);
ierr = DMGlobalToLocal(grid,xy,INSERT_VALUES,ghostvector); CHKERRQ(ierr);
PetscReal **xyarray,**gh;
ierr = DMDAVecGetArray(grid,xy,&xyarray); CHKERRQ(ierr);
ierr = DMDAVecGetArray(grid,ghostvector,&gh); CHKERRQ(ierr);
// computation on the arrays
for (int j=info.ys; j<info.ys+info.ym; j++) {
for (int i=info.xs; i<info.xs+info.xm; i++) {
if (info.gxs<info.xs && info.gys<info.ys)
if (i-1>=info.gxs && i+1<=info.gxs+info.gxm &&
j-1>=info.gys && j+1<=info.gys+info.gym )
xyarray[j][i] =
( gh[j-1][i] + gh[j][i-1] + gh[j][i+1] + gh[j+1][i] )
/4.;
goto exit;
}
} exit:
ierr = DMDAVecRestoreArray(grid,xy,&xyarray); CHKERRQ(ierr);
ierr = DMDAVecRestoreArray(grid,ghostvector,&gh); CHKERRQ(ierr);
ierr = DMLocalToGlobal(grid,ghostvector,INSERT_VALUES,xy); CHKERRQ(ierr);
ierr = DMRestoreLocalVector(grid,&ghostvector); CHKERRQ(ierr);
/* info.ys,info.ym,info.gys,info.gym */
/* ); */
// if ( ! (info.ys==info.gys || info.ys==info.gys ) ) goto exit;
}
return PetscFinalize();
}
PetscErrorCode ierr;
ierr = PetscInitialize(&argc,&argv,0,0); CHKERRQ(ierr);
/*
* Create a 2d grid and a matrix on that grid.
*/
DM grid;
ierr = DMDACreate1d // IN:
( comm, // collective on this communicator
DM_BOUNDARY_NONE, // no periodicity and such
i_global, // global size 100x100; can be changed with options
1, // degree of freedom per node
1, // stencil width
NULL, // arrays of local sizes in each direction
&grid // OUT: resulting object
); CHKERRQ(ierr);
ierr = DMSetUp(grid); CHKERRQ(ierr);
Mat A;
ierr = DMCreateMatrix(grid,&A); CHKERRQ(ierr);
/*
* Print out how the grid is distributed over processors
*/
PetscInt i_first,i_local;
ierr = DMDAGetCorners(grid,&i_first,NULL,NULL,&i_local,NULL,NULL);CHKERRQ(ierr);
/* ierr = PetscSynchronizedPrintf */
/* (comm, */
/* "[%d] Local part = %d-%d x %d-%d\n", */
/* procno,info.xs,info.xs+info.xm,info.ys,info.ys+info.ym); CHKERRQ(ierr); */
/* ierr = PetscSynchronizedFlush(comm,stdout); CHKERRQ(ierr); */
/*
* Fill in the elements of the matrix
*/
for (PetscInt i_index=i_first; i_index<i_first+i_local; i_index++) {
MatStencil row = {0},col[3] = {{0}};
PetscScalar v[3];
PetscInt ncols = 0;
row.i = i_index;
col[ncols].i = i_index; v[ncols] = 2.;
ncols++;
if (i_index>0) { col[ncols].i = i_index-1; v[ncols] = 1.; ncols++; }
if (i_index<i_global-1) { col[ncols].i = i_index+1; v[ncols] = 1.; ncols++; }
ierr = MatSetValuesStencil(A,1,&row,ncols,col,v,INSERT_VALUES);CHKERRQ(ierr);
}
/*
* Create vectors on the grid
*/
Vec x,y;
ierr = DMCreateGlobalVector(grid,&x); CHKERRQ(ierr);
ierr = VecDuplicate(x,&y); CHKERRQ(ierr);
/*
* Set vector values: first locally, then global
*/
PetscReal one = 1.;
{
Vec xlocal;
ierr = DMCreateLocalVector(grid,&xlocal); CHKERRQ(ierr);
ierr = VecSet(xlocal,one); CHKERRQ(ierr);
ierr = DMLocalToGlobalBegin(grid,xlocal,INSERT_VALUES,x); CHKERRQ(ierr);
ierr = DMLocalToGlobalEnd(grid,xlocal,INSERT_VALUES,x); CHKERRQ(ierr);
ierr = VecDestroy(&xlocal); CHKERRQ(ierr);
}
/*
* Solve a linear system on the grid
*/
KSP solver;
ierr = KSPCreate(comm,&solver); CHKERRQ(ierr);
ierr = KSPSetType(solver,KSPBCGS); CHKERRQ(ierr);
ierr = KSPSetOperators(solver,A,A); CHKERRQ(ierr);
ierr = KSPSetFromOptions(solver); CHKERRQ(ierr);
ierr = KSPSolve(solver,x,y); CHKERRQ(ierr);
/*
* Report on success of the solver, or lack thereof
*/
{
PetscInt its; KSPConvergedReason reason;
ierr = KSPGetConvergedReason(solver,&reason);
ierr = KSPGetIterationNumber(solver,&its); CHKERRQ(ierr);
if (reason<0) {
PetscPrintf(comm,"Failure to converge after %d iterations; reason %s\n",
its,KSPConvergedReasons[reason]);
} else {
PetscPrintf(comm,"Number of iterations to convergence: %d\n",its);
}
}
/*
* Clean up
*/
ierr = KSPDestroy(&solver); CHKERRQ(ierr);
ierr = VecDestroy(&x); CHKERRQ(ierr);
ierr = VecDestroy(&y); CHKERRQ(ierr);
ierr = MatDestroy(&A); CHKERRQ(ierr);
ierr = DMDestroy(&grid); CHKERRQ(ierr);
return PetscFinalize();
}
630
35.1. Sources used in this chapter
PETSc solvers
Probably the most important activity in PETSc is solving a linear system. This is done through a solver
object: an object of the class KSP. (This stands for Krylov SPace solver.) The solution routine KSPSolve takes
a matrix and a right-hand-side and gives a solution; however, before you can call this some amount of
setup is needed.
There two very different ways of solving a linear system: through a direct method, essentially a variant
of Gaussian elimination; or through an iterative method that makes successive approximations to the
solution. In PETSc there are only iterative methods. We will show how to achieve direct methods later.
The default linear system solver in PETSc is fully parallel, and will work on many linear systems, but
there are many settings and customizations to tailor the solver to your specific problem.
?𝑥 ∶ 𝐴𝑥 = 𝑏
The elementary textbook way of solving this is through an LU factorization, also known as Gaussian
elimination:
𝐿𝑈 ← 𝐴, 𝐿𝑧 = 𝑏, 𝑈 𝑥 = 𝑧.
While PETSc has support for this, its basic design is geared towards so-called iterative solution methods.
Instead of directly computing the solution to the system, they compute a sequence of approximations that,
with luck, converges to the true solution:
while not converged
𝑥𝑖+1 ← 𝑓 (𝑥𝑖 )
The interesting thing about iterative methods is that the iterative step only involves the matrix-vector
product:
632
36.1. KSP: linear system solvers
Python:
ksp = PETSc.KSP()
ksp.create()
# or:
ksp = PETSc.KSP().create()
𝑀 −1 𝐴𝑥 = 𝑀 −1 𝑏
so conceivably we could iterate on the transformed matrix and right-hand side. However, in practice we
apply the preconditioner in each iteration:
while not converged
𝑟𝑖 = 𝐴𝑥𝑖 − 𝑏
𝑧𝑖 = 𝑀 −1 𝑟𝑖
𝑥𝑖+1 ← 𝑓 (𝑧𝑖 )
In this schematic presentation we have left the nature of the 𝑓 () update function unspecified. Here, many
possibilities exist; the primary choice here is of the iterative method type, such as ‘conjugate gradients’,
‘generalized minimum residual’, or ‘bi-conjugate gradients stabilized’. (We will go into direct solvers in
section 36.2.)
Quantifying issues of convergence speed is difficult; see HPC book, section-5.5.14.
Input Parameters:
ksp- the Krylov subspace context
rtol- the relative convergence tolerance, relative decrease in the
(possibly preconditioned) residual norm
abstol- the absolute convergence tolerance absolute size of the
(possibly preconditioned) residual norm
dtol- the divergence tolerance, amount (possibly preconditioned)
residual norm can increase before KSPConvergedDefault() concludes that
the method is diverging
maxits- maximum number of iterations to use
using various default settings. The vectors and the matrix have to be conformly partitioned. The KSPSetOperators
call takes two operators: one is the actual coefficient matrix, and the second the one that the precondi-
tioner is derived from. In some cases it makes sense to specify a different matrix here. (You can retrieve the
operators with KSPGetOperators.) The call KSPSetFromOptions can cover almost all of the settings discussed
next.
KSP objects have many options to control them, so it is convenient to call KSPView (or use the commandline
option -ksp_view) to get a listing of all the settings.
36.1.3 Tolerances
Since neither solution nor solution speed is guaranteed, an iterative solver is subject to some tolerances:
• a relative tolerance for when the residual has been reduced enough;
• an absolute tolerance for when the residual is objectively small;
• a divergence tolerance that stops the iteration if the residual grows by too much; and
• a bound on the number of iterations, regardless any progress the process may still be making.
These tolerances are set with KSPSetTolerances (figure 36.2), or options -ksp_atol, -ksp_rtol, -ksp_divtol,
-ksp_max_it. Specify to PETSC_DEFAULT to leave a value unaltered.
In the next section we will see how you can determine which of these tolerances caused the solver to stop.
Input Parameter
ksp -the KSP context
Output Parameter
reason -negative value indicates diverged, positive value converged,
see KSPConvergedReason
Python:
r = KSP.getConvergedReason(self)
where r in PETSc.KSP.ConvergedReason
In case of successful convergence, you can use KSPGetIterationNumber to report how many iterations were
taken.
The following snippet analyzes the status of a KSP object that has stopped iterating:
// shellvector.c
PetscInt its; KSPConvergedReason reason;
Vec Res; PetscReal norm;
ierr = KSPGetConvergedReason(Solve,&reason);
ierr = KSPConvergedReasonView(Solve,PETSC_VIEWER_STDOUT_WORLD);
if (reason<0) {
PetscPrintf(comm,"Failure to converge: reason=%d\n",reason);
} else {
ierr = KSPGetIterationNumber(Solve,&its);
PetscPrintf(comm,"Number of iterations: %d\n",its);
}
Input Parameters:
ksp : the Krylov space context
type : a known method
Input Parameters
ksp - iterative context
B - block of right-hand sides
Output Parameter
X - block of solutions
36.1.7 Preconditioners
Another part of an iterative solver is the preconditioner. The mathematical background of this is given
in section 36.1.1. The preconditioner acts to make the coefficient matrix better conditioned, which will
improve the convergence speed; it can even be that without a suitable preconditioner a solver will not
converge at all.
36.1.7.1 Background
The mathematical requirement that the preconditioner 𝑀 satisfy 𝑀 ≈ 𝐴 can take two forms:
1. What is the cost of constructing the preconditioner? This should not be more than the gain in
solution time of the iterative method.
2. What is the cost per iteration of applying the preconditioner? There is clearly no point in using
a preconditioner that decreases the number of iterations by a certain amount, but increases the
cost per iteration much more.
3. Many preconditioners have parameter settings that make these considerations even more com-
plicated: low parameter values may give a preconditioner that is cheaply to apply but does not
improve convergence much, while large parameter values make the application more costly but
decrease the number of iterations.
36.1.7.2 Usage
Unlike most of the other PETSc object types, a PC object is typically not explicitly created. Instead, it is
created as part of the KSP object, and can be retrieved from it.
PC prec;
KSPGetPC(solver,&prec);
PCSetType(prec,PCILU);
Beyond setting the type of the preconditioner, there are various type-specific routines for setting various
parameters. Some of these can get quite tedious, and it is more convenient to set them through comman-
dline options.
36.1.7.3 Types
36.1.7.3.1 Sparse approximate inverses The inverse of a sparse matrix (at least, those from PDEs) is typi-
cally dense. Therefore, we aim to construct a sparse approximate inverse.
PETSc offers two such preconditioners, both of which require an external package.
• PCSPAI. This is a preconditioner that can only be used in single-processor runs, or as local solver
in a block preconditioner; section 36.1.7.3.3.
• As part of the PCHYPRE package, the parallel variant parasails is available.
-pc_type hypre -pc_hypre_type parasails
36.1.7.3.2 Incomplete factorizations The 𝐿𝑈 factorization of a matrix stemming from PDEs problems has
several practical problems:
• It takes (considerably) more storage space than the coefficient matrix, and
• it correspondingly takes more time to apply.
For instance, for a three-dimensional PDE in 𝑁 variables, the coefficient matrix can take storage space 7𝑁 ,
while the 𝐿𝑈 factorization takes 𝑂(𝑁 5/3 ).
For this reason, often incompletely 𝐿𝑈 factorizations are popular.
• PETSc has of itself a PCILU type, but this can only be used sequentially. This may sound like
a limitation, but in parallel it can still be used as the subdomain solver in a block methods;
section 36.1.7.3.3.
• As part of Hypre, pilut is a parallel ILU.
There are many options for the ILU type, such as PCFactorSetLevels (option -pc_factor_levels), which
sets the number of levels of fill-in allowed.
36.1.7.3.3 Block methods Certain preconditioners seem almost intrinsically sequential. For instance, an
ILU solution is sequential between the variables. There is a modest amount of parallelism, but that is hard
to explore.
Taking a step back, one of the problems with parallel preconditioners lies in the cross-process connections
in the matrix. If only those were not present, we could solve the linear system on each process indepen-
dently. Well, since a preconditioner is an approximate solution to begin with, ignoring those connections
only introduces an extra degree of approxomaticity.
There are two preconditioners that operate on this notion:
• PCBJACOBI: block Jacobi. Here each process solves locally the system consisting of the matrix
coefficients that couple the local variables. In effect, each process solves an independent system
on a subdomain.
The next question is then what solver is used on the subdomains. Here any preconditioner can
be used, in particular the ones that only existed in a sequential version. Specifying all this in code
gets tedious, and it is usually easier to specify such a complicated solver through commandline
options:
-pc_type jacobi -sub_ksp_type preonly \
-sub_pc_type ilu -sub_pc_factor_levels 1
(Note that this also talks about a sub_ksp: the subdomain solver is in fact a KSP object. By setting
its type to preonly we state that the solver should consist of solely applying its preconditioner.)
The block Jacobi preconditioner can asympotically only speed up the system solution by a factor
relating to the number of subdomains, but in practice it can be quite valuable.
• PCASM: additive Schwarz method. Here each process solves locally a slightly larger system, based
on the local variables, and one (or a few) levels of connections to neighboring processes. In effect,
the processes solve system on overlapping subdomains. This preconditioner can asympotically
reduce the number of iterations to 𝑂(1), but that requires exact solutions on the subdomains,
and in practice it may not happen anyway.
Figure 36.1 illustrates these preconditioners both in matrix and subdomain terms.
Figure 36.1: Illustration of block Jacobi and Additive Schwarz preconditioners: left domains and subdo-
mains, right the corresponding submatrices
See section 36.1.1 for the background on this, as well as the various specific
subsections.
36.1.8.0.1 Shell preconditioners You already saw that, in an iterative methods, the coefficient matrix can
be given operationally as a shell matrix; section 33.4.7. Similarly, the preconditioner matrix can be specified
operationally by specifying type PCSHELL.
This needs specification of the application routine through PCShellSetApply:
PCShellSetApply(PC pc,PetscErrorCode (*apply)(PC,Vec,Vec));
If the shell preconditioner requires setup, a routine for this can be specified with PCShellSetSetUp:
PCShellSetSetUp(PC pc,PetscErrorCode (*setup)(PC));
By default no monitor is set, meaning that the iteration process runs without output. The option -ksp_monitor
activates printing a norm of the residual. This corresponds to setting KSPMonitorDefault as the monitor.
This actually outputs the ‘preconditined norm’ of the residual, which is not the L2 norm, but the square
root of 𝑟 𝑡 𝑀 −1 𝑟, a quantity that is computed in the course of the iteration process. Specifying KSPMonitorTrueResidualNorm
(with corresponding option -ksp_monitor_true_residual) as the monitor prints the actual norm √𝑟 𝑡 𝑟. How-
ever, to compute this involves extra computation, since this quantity is not normally computed.
where the solver variable is of type MatSolverType, and can be MATSOLVERMUMS and such when specified in
source:
#include "petscksp.h"
PetscErrorCode KSPSetFromOptions(KSP ksp)
Collective on ksp
Input Parameters
ksp - the Krylov space context
// direct.c
ierr = KSPCreate(comm,&Solver);
ierr = KSPSetOperators(Solver,A,A);
ierr = KSPSetType(Solver,KSPPREONLY);
{
PC Prec;
ierr = KSPGetPC(Solver,&Prec);
ierr = PCSetType(Prec,PCLU);
ierr = PCFactorSetMatSolverType(Prec,MATSOLVERMUMPS);
}
? ∶ 𝑓 (𝑥) = 0
𝑥
𝑓 (𝑥) = 𝐴𝑥 − 𝑏,
645
37. PETSC nonlinear solvers
Collective on SNES
Input Parameters
snes - the SNES context
b - the constant part of the equation F(x) = b, or NULL to use zero.
x - the solution vector.
PetscErrorCode formfunction(SNES,Vec,Vec,void*)
Comparing the above to the introductory description you see that the Hessian is not specified here. An
analytic Hessian can be dispensed with if you instruct PETSc to approximate it by finite differences:
𝑓 (𝑥 + ℎ𝑦) − 𝑓 (𝑥)
𝐻 (𝑥)𝑦 ≈
ℎ
with ℎ some finite diference. The commandline option -snes_fd forces the use of this finite difference
approximation. However, it may lead to a large number of function evaluations. The option -snes_fd_color
applies a coloring to the variables, leading to a drastic reduction in the number of function evaluations.
If you can form the analytic Jacobian / Hessian, you can specify it with SNESSetJacobian (figure 37.2),
where the Jacobian is a function of type SNESJacobianFunction (figure 37.3).
Specifying the Jacobian:
Mat J;
ierr = MatCreate(comm,&J); CHKERRQ(ierr);
ierr = MatSetType(J,MATSEQDENSE); CHKERRQ(ierr);
Input Parameters
snes - the SNES context
Amat - the matrix that defines the (approximate) Jacobian
Pmat - the matrix to be used in constructing the preconditioner, usually the same as Amat.
J - Jacobian evaluation routine (if NULL then SNES retains any previously set value)
ctx - [optional] user-defined context for private data for the Jacobian evaluation routine
Collective on snes
Input Parameters
x - input vector, the Jacobian is to be computed at this value
ctx - [optional] user-defined Jacobian context
Output Parameters
Amat - the matrix that defines the (approximate) Jacobian
Pmat - the matrix to be used in constructing the preconditioner, usually the same as Amat.
37.2 Time-stepping
For cases
𝑢𝑡 = 𝐺(𝑡, 𝑢)
call TSSetRHSFunction.
#include "petscts.h"
PetscErrorCode TSSetRHSFunction
(TS ts,Vec r,
PetscErrorCode (*f)(TS,PetscReal,Vec,Vec,void*),
void *ctx);
#include "petscts.h"
PetscErrorCode TSSetIFunction
(TS ts,Vec r,TSIFunction f,void *ctx)
Some GPUs can accomodate MPI by being directly connected to the network through GPUDirect Remote
Memory Access (RMA). If not, use this runtime option:
-use_gpu_aware_mpi 0
More conveniently, add this to your .petscrc file; section 39.3.3.
650
38.4. Other
38.3.1 Vectors
Analogous to vector creation as before, there are specific create calls VecCreateSeqCUDA, VecCreateMPICUDAWithArray,
or the type can be set in VecSetType:
// kspcu.c
#ifdef PETSC_HAVE_CUDA
ierr = VecCreateMPICUDA(comm,localsize,PETSC_DECIDE,&Rhs);
#else
ierr = VecCreateMPI(comm,localsize,PETSC_DECIDE,&Rhs);
#endif
The type VECCUDA is sequential or parallel dependent on the run; specific types are VECSEQCUDA, VECMPICUDA.
38.3.2 Matrices
ierr = MatCreate(comm,&A);
#ifdef PETSC_HAVE_CUDA
ierr = MatSetType(A,MATMPIAIJCUSPARSE);
#else
ierr = MatSetType(A,MATMPIAIJ);
#endif
Dense matrices can be created with specific calls MatCreateDenseCUDA, MatCreateSeqDenseCUDA, or by setting
types MATDENSECUDA, MATSEQDENSECUDA, MATMPIDENSECUDA.
Sparse matrices: MATAIJCUSPARSE which is sequential or distributed depending on how the program is
started. Specific types are: MATMPIAIJCUSPARSE, MATSEQAIJCUSPARSE.
38.3.3 Array access
All sorts of ‘array’ operations such as MatDenseCUDAGetArray, VecCUDAGetArray,
Set PetscMalloc to use the GPU: PetscMallocSetCUDAHost, and switch back with PetscMallocResetCUDAHost.
38.4 Other
The memories of a CPU and GPU are not coherent. This means that routines such as PetscMalloc1 can not
immediately be used for GPU allocation. Use the routines PetscMallocSetCUDAHost and PetscMallocResetCUDAHost
to switch the allocator to GPU memory and back.
// cudamatself.c
Mat cuda_matrix;
PetscScalar *matdata;
ierr = PetscMallocSetCUDAHost();
ierr = PetscMalloc1(global_size*global_size,&matdata);
ierr = PetscMallocResetCUDAHost();
ierr = MatCreateDenseCUDA
(comm,
global_size,global_size,global_size,global_size,
matdata,
&cuda_matrix);
PETSc tools
4. Other error checking macros are CHKERRABORT which aborts immediately, and CHKERRMPI.
5. You can effect your own error return by using SETERRQ (figure 39.1) SETERRQ1 (figure 39.1), SETERRQ2
(figure 39.1).
Fortran note 21: error code handling. In the main program, use CHKERRA and SETERRA. Also beware that these
error ‘commands’ are macros, and after expansion may interfere with Fortran line length, so they
should only be used in .F90 files.
Example. We write a routine that sets an error:
653
39. PETSc tools
Input Parameters:
comm - A communicator, so that the error can be collective
ierr - nonzero error code, see the list of standard error codes in include/petscerror.h
message - error message in the printf format
arg1,arg2,arg3 - argument (for example an integer, string or double)
// backtrace.c
PetscErrorCode this_function_bombs() {
PetscFunctionBegin;
SETERRQ(PETSC_COMM_SELF,1,"We cannot go on like this");
PetscFunctionReturn(0);
}
Remark 32 In this example, the use of PETSC_COMM_SELF indicates that this error is individually generated
on a process; use PETSC_COMM_WORLD only if the same error would be detected everywhere.
Exercise 39.1. Look up the definition of SETERRQ1. Write a routine to compute square roots
that is used as follows:
x = 1.5; ierr = square_root(x,&rootx); CHKERRQ(ierr);
PetscPrintf(PETSC_COMM_WORLD,"Root of %f is %f\n",x,rootx);
x = -2.6; ierr = square_root(x,&rootx); CHKERRQ(ierr);
PetscPrintf(PETSC_COMM_WORLD,"Root of %f is %f\n",x,rootx);
39.1.3.1 Valgrind
Valgrind is rather verbose in its output. To limit the number of processs that run under valgrind:
mpiexec -n 3 valgrind --track-origins=yes ./app -args : -n 5 ./app -args
Fortran:
PetscPrintf(MPI_Comm, character(*), PetscErrorCode ierr)
Python:
PETSc.Sys.Print(type cls, *args, **kwargs)
kwargs:
comm : communicator object
Fortran:
PetscSynchronizedPrintf(MPI_Comm, character(*), PetscErrorCode ierr)
python:
PETSc.Sys.syncPrint(type cls, *args, **kargs)
kwargs:
comm : communicator object
flush : if True, do synchronizedFlush
other keyword args as for python3 print function
Fortran note 24: printing and newlines. The Fortran calls are only wrappers around C routines, so you can
use \n newline characters in the Fortran string argument to PetscPrintf.
The file to flush is typically PETSC_STDOUT.
Fortran:
PetscSynchronizedFlush(comm,fd,err)
Integer :: comm
fd is usually PETSC_STDOUT
PetscErrorCode :: err
python:
PETSc.Sys.syncFlush(type cls, comm=None)
#include "petscviewer.h"
PetscErrorCode PetscViewerRead(PetscViewer viewer, void *data, PetscInt num, PetscInt *count, PetscDataType dtype)
Collective
Input Parameters
viewer - The viewer
data - Location to write the data
num - Number of items of data to read
datatype - Type of data to read
Output Parameters
count -number of items of data actually read, or NULL
Python note 40: petsc print and python print. Since the print routines use the python print call, they au-
tomatically include the trailing newline. You don’t have to specify it as in the C calls.
39.2.2 Viewers
In order to export PETSc matrix or vector data structures there is a PetscViewer object type. This is a quite
general concept of viewing: it encompasses ascii output to screen, binary dump to file, or communication
to a running Matlab process. Calls such as MatView or KSPView accept a PetscViewer argument.
In cases where this makes sense, there is also an inverse ‘load’ operation. See section 33.3.5 for vectors.
Some viewers are predefined, such as PETSC_VIEWER_STDOUT_WORLD for ascii rendering to standard out. (In C,
specifying zero or NULL also uses this default viewer; for Fortran use PETSC_NULL_VIEWER.)
giving:
Vec Object: space local 4 MPI processes
type: mpi
Process [0]
[ ... et cetera ... ]
#include "petscsys.h"
#include "petsctime.h"
PetscErrorCode PetscGetCPUTime(PetscLogDouble *t)
PetscErrorCode PetscTime(PetscLogDouble *v)
KSPCreate(comm,&time_solver);
KSPCreate(comm,&space_solver);
KSPSetOptionsPrefix(time_solver,"time_");
KSPSetOptionsPrefix(space_solver,"space_");
You can then use options -time_ksp_monitor and such. Note that the prefix does not have a leading
dash, but it does have the trailing underscore.
Similar routines: MatSetOptionsPrefix, PCSetOptionsPrefix, PetscObjectSetOptionsPrefix,
PetscViewerSetOptionsPrefix, SNESSetOptionsPrefix, TSSetOptionsPrefix, VecSetOptionsPrefix, and some
more obscure ones.
Options can be specified in a file .petscrc in the user’s home directory or the current directory.
Finally, an environment variable PETSC_OPTIONS can be set.
The rc file is processed first, then the environment variable, then any commandline arguments. This
parsing is done in PetscInitialize, so any values from PetscOptionsSetValue override this.
C:
#include <petscsys.h>
PetscErrorCode PetscMalloc1(size_t m1,type **r1)
Input Parameter:
m1 - number of elements to allocate (may be zero)
Output Parameter:
r1 - memory allocated
C:
#include <petscsys.h>
PetscErrorCode PetscFree(void *memory)
Input Parameter:
memory - memory to free (the pointer is ALWAYS set to NULL upon sucess)
39.4.1 Logging
Petsc does a lot of logging on its own operations. Additionally, you can introduce your own routines into
this log.
The simplest way to display statistics is to run with an option -log_view. This takes an optional file name
argument:
mpiexec -n 10 yourprogram -log_view :statistics.txt
The corresponding routine is PetscLogView.
#include <petsc.h>
PetscErrorCode this_function_bombs() {
PetscFunctionBegin;
SETERRQ(PETSC_COMM_SELF,1,"We cannot go on like this");
PetscFunctionReturn(0);
}
#include <petsc/finclude/petsc.h>
use petsc
implicit none
PetscErrorCode :: ierr
call PetscInitialize(PETSC_NULL_CHARACTER,ierr)
CHKERRA(ierr)
call this_function_bombs(ierr)
CHKERRA(ierr)
call PetscFinalize(ierr)
CHKERRA(ierr)
contains
Subroutine this_function_bombs(ierr)
implicit none
integer,intent(out) :: ierr
ierr = -1
#undef __FUNCT__
#define __FUNCT__ "main"
int main(int argc,char **args)
{
PetscErrorCode ierr;
MPI_Comm comm;
KSP solver;
Mat A;
Vec Rhs,Sol;
PetscScalar one = 1.0;
PetscFunctionBegin;
PetscInitialize(&argc,&args,0,0);
comm = PETSC_COMM_SELF;
/*
* Read the domain size, and square it to get the matrix size
*/
PetscBool flag;
int matrix_size = 100;
ierr = PetscOptionsGetInt
(NULL,PETSC_NULL,"-n",&matrix_size,&flag); CHKERRQ(ierr);
PetscPrintf(comm,"Using matrix size %d\n",matrix_size);
/*
* Create the five-point laplacian matrix
*/
ierr = MatCreate(comm,&A); CHKERRQ(ierr);
ierr = MatSetType(A,MATSEQAIJ); CHKERRQ(ierr);
ierr = MatSetSizes(A,matrix_size,matrix_size,matrix_size,matrix_size); CHKERRQ(ierr);
ierr = MatSeqAIJSetPreallocation(A,3,PETSC_NULL); CHKERRQ(ierr);
ierr = MatCreateVecs(A,&Rhs,PETSC_NULL); CHKERRQ(ierr);
for (int i=0; i<matrix_size; i++) {
PetscScalar
h = 1./(matrix_size+1), pi = 3.1415926,
sin1 = i * pi * h, sin2 = 2 * i * pi * h, sin3 = 3 * i * pi * h,
coefs[3] = {-1,2,-1};
PetscInt cols[3] = {i-1,i,i+1};
ierr = VecSetValue(Rhs,i,sin1 + .5 * sin2 + .3 * sin3, INSERT_VALUES); CHKERRQ(ierr);
if (i==0) {
ierr = MatSetValues(A,1,&i,2,cols+1,coefs+1,INSERT_VALUES); CHKERRQ(ierr);
} else if (i==matrix_size-1) {
/*
* Create right hand side and solution vectors
*/
ierr = VecDuplicate(Rhs,&Sol); CHKERRQ(ierr);
ierr = VecSet(Rhs,one); CHKERRQ(ierr);
/*
* Create iterative method and preconditioner
*/
ierr = KSPCreate(comm,&solver);
ierr = KSPSetOperators(solver,A,A); CHKERRQ(ierr);
ierr = KSPSetType(solver,KSPCG); CHKERRQ(ierr);
{
PC prec;
ierr = KSPGetPC(solver,&prec); CHKERRQ(ierr);
ierr = PCSetType(prec,PCNONE); CHKERRQ(ierr);
}
/*
* Incorporate any commandline options for the KSP
*/
ierr = KSPSetFromOptions(solver); CHKERRQ(ierr);
/*
* Solve the system and analyze the outcome
*/
ierr = KSPSolve(solver,Rhs,Sol); CHKERRQ(ierr);
{
PetscInt its; KSPConvergedReason reason;
ierr = KSPGetConvergedReason(solver,&reason);
ierr = KSPGetIterationNumber(solver,&its); CHKERRQ(ierr);
if (reason<0) {
PetscPrintf
(comm,"Failure to converge after %d iterations; reason %s\n",
its,KSPConvergedReasons[reason]);
} else {
PetscPrintf
(comm,"Number of iterations to convergence: %d\n",
its);
}
}
return PetscFinalize();
}
#undef __FUNCT__
#define __FUNCT__ "main"
int main(int argc,char **args)
{
PetscErrorCode ierr;
MPI_Comm comm;
PetscFunctionBegin;
PetscInitialize(&argc,&args,0,0);
comm = MPI_COMM_WORLD;
PetscPrintf(comm,"i=%d, j=%d\n",i_value,j_value);
return PetscFinalize();
}
PETSc topics
40.1 Communicators
PETSc has a ‘world’ communicator, which by default equals MPI_COMM_WORLD. If you want to run PETSc on
a subset of processes, you can assign a subcommunicator to the variable PETSC_COMM_WORLD in between the
calls to MPI_Init and PetscInitialize. Petsc communicators are of type PetscComm.
668
40.2. Sources used in this chapter
Co-array Fortran
This chapter explains the basic concepts of Co-array Fortran (CAF), and helps you get started on running
your first program.
41.3 Basics
Co-arrays are defined by giving them, in addition to the Dimension, a Codimension
Complex,codimension(*) :: number
Integer,dimension(:,:,:),codimension[-1:1,*] :: grid
This means we are respectively declaring an array with a single number on each image, or a three-
dimensional grid spread over a two-dimensional processor grid.
Traditional-like syntax can also be used:
Complex :: number[*]
Integer :: grid(10,20,30)[-1:1,*]
672
41.3. Basics
Unlike Message Passing Interface (MPI), which normally only supports a linear process numbering, CAF
allows for multi-dimensional process grids. The last dimension is always specified as *, meaning it is
determined at runtime.
41.3.3 Synchronization
The fortran standard forbids race conditions:
If a variable is defined on an image in a segment, it shall not be referenced, defined or
become undefined in a segment on another image unless the segments are ordered.
That is, you should not cause them to happen. The language and runtime are certainly not going to help
yu with that.
Well, a little. After remote updates you can synchronize images with the sync call. The easiest variant is
a global synchronization:
sync all
While remote operations in CAF are nicely one-sided, synchronization is not: if image p issues a call
sync(q)
or a local one:
if (procid==1) &
number[procid+1] = number[procid]
if (procid<=2) sync images( (/1,2/) )
if (procid==2) &
number[procid-1] = 2*number[procid]
if (procid<=2) sync images( (/2,1/) )
Note that the local sync call is done on both images involved.
Example of how you would synchronize a collective:
if ( this_image() .eq. 1 ) sync images( * )
if ( this_image() .ne. 1 ) sync images( 1 )
Here image 1 synchronizes with all others, but the others don’t synchronize with each other.
if (procid==1) then
sync images( (/procid+1/) )
else if (procid==nprocs) then
sync images( (/procid-1/) )
else
sync images( (/procid-1,procid+1/) )
end if
41.3.4 Collectives
Collectives are not part of CAF as of the 2008 Fortran standard.
integer,dimension(2),codimension[*] :: numbers
integer :: procid,nprocs
procid = this_image()
nprocs = num_images()
numbers(:)[procid] = procid
if (procid<nprocs) then
numbers(1)[procid+1] = procid
end if
if (procid==1) then
sync images( (/procid+1/) )
else if (procid==nprocs) then
sync images( (/procid-1/) )
else
sync images( (/procid-1,procid+1/) )
end if
This chapter explains the basic concepts of Sycl/Dpc++, and helps you get started on running your first
program.
• SYCL is a C++-based language for portable parallel programming.
• Data Parallel C++ (DPCPP) is Intel’s extension of Sycl.
• OneAPI is Intel’s compiler suite, which contains the DPCPP compiler.
Intel DPC++ extension. The various Intel extensions are listed here: https://spec.oneapi.com/
versions/latest/elements/dpcpp/source/index.html#extensions-table
42.1 Logistics
Headers:
#include <CL/sycl.hpp>
You can now include namespace, but with care! If you use
using namespace cl;
you have to prefix all SYCL class with sycl::, which is a bit of a bother. However, if you use
using namespace cl::sycl;
you run into the fact that SYCL has its own versions of many Standard Template Library (STL) commands,
and so you will get name collisions. The most obvious example is that the cl::sycl name space has its
own versions of cout and endl. Therefore you have to use explicitly std::cout and std::end. Using the
wrong I/O will cause tons of inscrutable error messages. Additionally, SYCL has its own version of free,
and of several math routines.
Intel DPC++ extension.
using namespace sycl;
677
42. Sycl, OneAPI, DPC++
42.3 Queues
The execution mechanism of SYCL is the queue: a sequence of actions that will be executed on a selected
device. The only user action is submitting actions to a queue; the queue is executed at the end of the scope
where it is declared.
Queue execution is asynchronous with host code.
The following example explicitly assigns the queue to the CPU device using the sycl::cpu_selector.
// cpuname.cxx
sycl::queue myqueue( sycl::cpu_selector{} );
42.4 Kernels
One kernel per submit.
myqueue.submit( [&] ( handler &commandgroup ) {
commandgroup.parallel_for<uniquename>
( range<1>{N},
[=] ( id<1> idx ) { ... idx }
)
} );
Note that the lambda in the kernel captures by value. Capturing by reference makes no sense, since the
kernel is executed on a device.
cgh.single_task(
[=]() {
// kernel function is executed EXACTLY once on a SINGLE work-item
});
cgh.parallel_for(
nd_range<3>( {1024,1024,1024},{16,16,16} ),
// using 3D in this example
[=](nd_item<3> myID) {
// kernel function is executed on an n-dimensional range (NDrange)
});
cgh.parallel_for_work_group(
range<2>(1024,1024),
// using 2D in this example
[=](group<2> myGroup) {
// kernel function is executed once per work-group
});
grp.parallel_for_work_item(
range<1>(1024),
// using 1D in this example
[=](h_item<1> myItem) {
// kernel function is executed once per work-item
});
cgh.parallel_for<class foo>(
range<1>{D*D*D},
[=](id<1> item) {
xx[ item[0] ] = 2 * item[0] + 1;
}
)
While the C++ vectors remain one-dimensional, DPCPP allows you to make multi-dimensional buffers:
std::vector<int> y(D*D*D);
buffer<int,1> y_buf(y.data(), range<1>(D*D*D));
cgh.parallel_for<class foo2D>
(range<2>{D,D*D},
[=](id<2> item) {
yy[ item[0] + D*item[1] ] = 2;
}
);
Intel DPC++ extension. There is an implicit conversion from the one-dimensional sycl::id<1> to size_t,
so
[=](sycl::id<1> i) {
data[i] = i;
}
constexpr size_t B = 4;
sycl::range<2> local_range(B, B);
sycl::range<2> tile_range = local_range + sycl::range<2>(2, 2); // Includes boundary
↪cells
auto tile = local_accessor<float, 2>(tile_range, h); // see templated def'n above
We first copy global data into an array local to the work group:
sycl::id<2> offset(1, 1);
h.parallel_for
( sycl::nd_range<2>(stencil_range, local_range, offset),
[=] ( sycl::nd_item<2> it ) {
// Load this tile into work-group local memory
sycl::id<2> lid = it.get_local_id();
sycl::range<2> lrange = it.get_local_range();
for (int ti = lid[0]; ti < B + 2; ti += lrange[0]) {
for (int tj = lid[1]; tj < B + 2; tj += lrange[1]) {
int gi = ti + B * it.get_group(0);
int gj = tj + B * it.get_group(1);
tile[ti][tj] = input[gi][gj];
}
}
Global coordinates in the input are computed from the nd_item’s coordinate and group:
[=] ( sycl::nd_item<2> it ) {
for (int ti ... ) {
for (int tj ... ) {
int gi = ti + B * it.get_group(0);
int gj = tj + B * it.get_group(1);
... = input[gi][gj];
Local coordinates in the tile, including boundary, I DON’T QUITE GET THIS YET.
[=] ( sycl::nd_item<2> it ) {
sycl::id<2> lid = it.get_local_id();
sycl::range<2> lrange = it.get_local_range();
for (int ti = lid[0]; ti < B + 2; ti += lrange[0]) {
for (int tj = lid[1]; tj < B + 2; tj += lrange[1]) {
tile[ti][tj] = ..
To get this working correctly would need either a reduction primitive or atomics on the accumulator. The
2020 proposed standard has improved atomics.
// reduct1d.cxx
auto input_values = array_buffer.get_access<sycl::access::mode::read>(h);
auto sum_reduction = sycl::reduction( scalar_buffer,h,std::plus<>() );
h.parallel_for
( array_range,sum_reduction,
[=]( sycl::id<1> index,auto& sum )
{
sum += input_values[index];
}
); // end of parallel for
42.5.4 Reductions
Reduction operations were added in the the SYCL 2020 Provisional Standard, meaning that they are not
yet finalized.
Here is one approach, which works in hipsycl:
// reductscalar.cxx
auto reduce_to_sum =
sycl::reduction( sum_array, static_cast<float>(0.), std::plus<float>() );
myqueue.parallel_for// parallel_for<reduction_kernel<T,BinaryOp,__LINE__>>
( array_range, // sycl::range<1>(input_size),
reduce_to_sum, // sycl::reduction(output, identity, op),
[=] (sycl::id<1> idx, auto& reducer) { // type of reducer is impl-dependent, so use
↪auto
reducer.combine(shared_array[idx[0]]); //(input[idx[0]]);
//reducer += shared_array[idx[0]]; // see line 216: add_reducer += input0[idx[0]];
} ).wait();
Here a sycl::reduction object is created from the target data and the reduction operator. This is then
passed to the parallel_for and its combine method is called.
Note the corresponding free call that also has the queue as parameter.
Note that you need to be in a parallel task. The following gives a segmentation error:
[&](sycl::handler &cgh) {
shar_float[0] = host_float[0];
}
// forloop.cxx
std::vector<int> myArray(SIZE);
range<1> mySize{myArray.size()};
buffer<int, 1> bufferA(myArray.data(), myArray.size());
42.6.3 Querying
The function get_range can query the size of either a buffer or an accessor:
// range2.cxx
sycl::buffer<int, 2>
a_buf(a.data(), sycl::range<2>(N, M)),
b_buf(b.data(), sycl::range<2>(N, M)),
c_buf(c.data(), sycl::range<2>(N, M));
sycl::range<2>
a_range = a_buf.get_range(),
b_range = b_buf.get_range();
if (a_range==b_range) {
sycl::accessor c = c_buf.get_access<sycl::access::mode::write>(h);
42.10 Examples
42.10.1 Kernels in a loop
The following idiom works:
sycl::event last_event = queue.submit( [&] (sycl::handler &h) {
for (int iteration=0; iteration<N; iteration++) {
last_event = queue.submit( [&] (sycl::handler &h) {
h.depends_on(last_event);
int main() {
int main() {
return 0;
}
int main() {
#if 0
sycl::cpu_selector selector;
sycl::queue myqueue(selector);
#else
sycl::queue myqueue;
#endif
return 0;
}
sycl::queue myqueue;
std::cout << "Hello example running on "
<< myqueue.get_device().get_info<sycl::info::device::name>()
<< std::endl;
myqueue.submit
( [&](sycl::handler &cgh) {
// WRONG shar_float[0] = host_float[0];
return 0;
}
sycl::queue myqueue;
std::cout << "Hello example running on "
<< myqueue.get_device().get_info<sycl::info::device::name>()
<< std::endl;
myqueue.submit
(
[&](sycl::handler &cgh) {
cgh.memcpy(devc_float,host_float,sizeof(floattype));
}
);
myqueue.wait();
myqueue.submit
(
[&](sycl::handler &cgh) {
sycl::stream sout(1024, 256, cgh);
cgh.single_task
(
[=] () {
sout << "Number " << devc_float[0] << sycl::endl;
}
);
} // end of submitted lambda
);
myqueue.wait();
free(host_float);
sycl::free(devc_float,myqueue);
return 0;
}
int main() {
std::vector<int> myArray(SIZE);
for (int i = 0; i<SIZE; ++i)
myArray[i] = i;
range<1> mySize{myArray.size()};
buffer<int, 1> bufferA(myArray.data(), myArray.size());
return 0;
/*
( [&](handler &myHandle) {
auto deviceAccessorA =
bufferA.get_access<access::mode::read_write>(myHandle);
myHandle
*/
int main() {
// Set up queue on any available device
sycl::queue Q;
{
// Create buffers associated with inputs and output
sycl::buffer<int, 2>
a_buf(a.data(), sycl::range<2>(N, M)),
b_buf(b.data(), sycl::range<2>(N, M)),
c_buf(c.data(), sycl::range<2>(N, M));
sycl::range<2>
a_range = a_buf.get_range(),
b_range = b_buf.get_range();
if (a_range==b_range) {
} );
}
} );
} else
throw(std::out_of_range("array size mismatch"));
// Q.submit( [&]( sycl::handler &h ) {
// sycl::id<2> one{1,1};
// sycl::accessor c = c_buf.get_access<sycl::access::mode::write>(h);
// sycl::range<2> c_range = c.get_range();
// c_range -= one;
// h.single_task( [&] () {} );
// });
}
sycl::queue myQueue;
std::cout << "Hello example running on "
<< myQueue.get_device().get_info<sycl::info::device::name>()
<< std::endl;
// Create a command_group to issue command to the group
myQueue.submit
(
[&](sycl::handler &cgh) {
sycl::stream sout(1024, 256, cgh);
cgh.parallel_for<class hello_world>
(
sycl::range<1>(global_range), [=](sycl::id<1> idx) {
sout << "Hello, World: World rank " << idx << sycl::endl;
}); // End of the kernel function
}
); // End of the queue commands.
myQueue.wait();
return 0;
#include <CL/sycl.hpp>
namespace sycl = cl::sycl;
int main() {
sycl::queue queue;
{
sycl::buffer<float,1> new_buf(new_values.data(), with_boundary);
sycl::buffer<float,1> old_buf(old_values.data(), with_boundary);
}
return 0;
}
Python multiprocessing
Python has a multiprocessing toolbox. This is a parallel processing library that relies on subprocesses,
rather than threads.
43.2 Process
A process is an object that will execute a python function:
## quicksort.py
import multiprocessing as mp
import random
import os
if __name__ == '__main__':
numbers = [ random.randint(1,50) for i in range(32) ]
process = mp.Process(target=quicksort,args=[numbers])
process.start()
process.join()
697
43. Python multiprocessing
43.2.1 Arguments
Arguments can be passed to the function of the process with the args keyword. This accepts a list (or
tuple) of arguments, leading to a somewhat strange syntax for a single argument:
proc = Process(target=print_func, args=(name,))
The target function of a process can get hold of that process with the current_process function.
Of course you can also query os.getpid() but that does not offer any further possibilities.
def say_name(iproc):
print(f"Process {os.getpid()} has name: {mp.current_process().name}")
if __name__ == '__main__':
processes = [ mp.Process(target=say_name,name=f"proc{iproc}",args=[iproc])
for iproc in range(6) ]
Exercise 43.1. Do you see a way to improve the speed of this calculation?
43.4.1 Pipes
A pipe, object type Pipe, corresponds to what used to be called a channel in older parallel programming
systems: a First-In / First-Out (FIFO) object into which one process can place items, and from which
another process can take them. However, a pipe is not associated with any particular pair: creating the
pipe gives the entrace and exit from the pipe
q_entrance,q_exit = mp.Pipe()
which can then can put and get items, using the send and recv commands.
## pipemulti.py
def add_to_pipe(v,q):
for i in range(10):
print(f"put {v}")
q.send(v)
time.sleep(1)
q.send("END")
def print_from_pipe(q):
ends = 0
while True:
v = q.recv()
print(f"Got: {v}")
if v=="END":
ends += 1
if ends==2:
break
print("pipe is empty")
43.4.2 Queues
def say_hello(iproc):
print(f"Process has input value: {iproc}")
if __name__ == '__main__':
processes = [ mp.Process(target=say_hello,args=[iproc])
for iproc in range(6) ]
if __name__ == '__main__':
for p in processes:
p.start()
for p in processes:
p.join()
def say_name(iproc):
print(f"Process {os.getpid()} has name: {mp.current_process().name}")
if __name__ == '__main__':
processes = [ mp.Process(target=say_name,name=f"proc{iproc}",args=[iproc])
for iproc in range(6) ]
for p in processes:
p.start()
for p in processes:
p.join()
def print_value(ivalue):
mp.current_process()
#print( f"Value: {ivalue}" )
return 2*ivalue
if __name__ == '__main__':
print("Test 1: what does cpu_count() return?")
nprocs = mp.cpu_count()
print(f"I detect {nprocs} cores")
print("Test 2: create a pool and process an array on it")
pool = mp.Pool( nprocs )
results = pool.map( print_value,range(1,2*nprocs) )
print(results)
THE REST
Chapter 44
There is much that can be said about computer architecture. However, in the context of parallel program-
ming we are mostly concerned with the following:
• How many networked nodes are there, and does the network have a structure that we need to
pay attention to?
• On a compute node, how many sockets (or other Non-Uniform Memory Access (NUMA) do-
mains) are there?
• For each socket, how many cores and hyperthreads are there? Are caches shared?
44.1.2 hwloc
The open source package hwloc does similar reporting to cpuinfo, but it has been ported to many plat-
forms. Additionally, it can generate ascii and pdf graphic renderings of the architecture.
704
44.2. Sources used in this chapter
Hybrid computing
So far, you have learned to use MPI for distributed memory and OpenMP for shared memory parallel
programming. However, distribute memory architectures actually have a shared memory component,
since each cluster node is typically of a multicore design. Accordingly, you could program your cluster
using MPI for inter-node and OpenMP for intra-node parallelism.
You now have to find the right balance between processes and threads, since each can keep a core fully
busy. Complicating this story, a node can have more than one socket, and corresponding NUMA domain.
Figure 45.1 illustrates three modes: pure MPI with no threads used; one MPI process per node and full
multi-threading; two MPI processes per node, one per socket, and multiple threads on each socket.
706
45.1. Affinity
45.1 Affinity
In the preceeding chapters we mostly considered all MPI nodes or OpenMP thread as being in one flat
pool. However, for high performance you need to worry about affinity: the question of which process or
thread is placed where, and how efficiently they can interact.
Machine (32GB)
L3 (20MB) eth0
L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) eth1
L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB)
PCI 1a03:2000
Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7
PCI 8086:1d02
PU P#0 PU P#1 PU P#2 PU P#3 PU P#4 PU P#5 PU P#6 PU P#7
sda
L3 (20MB) ib0
L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB)
PCI 8086:225c
L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB)
Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7
Host: c401−402.stampede.tacc.utexas.edu
Indexes: physical
Date: Tue 07 Jun 2016 01:06:43 PM CDT
Figure 45.3 depicts a Stampede compute node, which is a two-socket Intel Sandybridge design; figure 45.4
shows a Stampede largemem node, which is a four-socket design. Finally, figure 45.5 shows a Lonestar5
compute node, a two-socket design with 12-core Intel Haswell processors with two hardware threads each.
45.4 Discussion
The performance implications of the pure MPI strategy versus hybrid are subtle.
• First of all, we note that there is no obvious speedup: in a well balanced MPI application all
cores are busy all the time, so using threading can give no immediate improvement.
Machine (1024GB)
L3 (20MB) eth2
L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) eth3
L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB)
PCI 14e4:165f
Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7 eth0
eth1
PCI 1000:005b
sda
PCI 10de:0dd8
PCI 102b:0534
PCI 8086:1d02
sr0
L3 (20MB) ib0
L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB)
PCI 10de:0dd8
L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB)
Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7
Socket P#2
L3 (20MB)
L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB)
L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB)
Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7
Socket P#3
L3 (20MB)
L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB)
L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB)
Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7
Host: c400-106.stampede.tacc.utexas.edu
Indexes: physical
Date: Tue 07 Jun 2016 04:49:53 PM CDT
Machine (64GB)
Socket P#0
L3 (30MB)
L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB)
L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB)
Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#8 Core P#9 Core P#10 Core P#11 Core P#12 Core P#13
PU P#0 PU P#1 PU P#2 PU P#3 PU P#4 PU P#5 PU P#6 PU P#7 PU P#8 PU P#9 PU P#10 PU P#11
PU P#24 PU P#25 PU P#26 PU P#27 PU P#28 PU P#29 PU P#30 PU P#31 PU P#32 PU P#33 PU P#34 PU P#35
Socket P#1
L3 (30MB)
L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB)
L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB) L1 (32KB)
Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#8 Core P#9 Core P#10 Core P#11 Core P#12 Core P#13
PU P#12 PU P#13 PU P#14 PU P#15 PU P#16 PU P#17 PU P#18 PU P#19 PU P#20 PU P#21 PU P#22 PU P#23
PU P#36 PU P#37 PU P#38 PU P#39 PU P#40 PU P#41 PU P#42 PU P#43 PU P#44 PU P#45 PU P#46 PU P#47
Host: nid00015
Indexes: physical
Date: Tue 07 Jun 2016 04:18:47 PM CDT
• Both MPI and OpenMP are subject to Amdahl’s law that quantifies the influence of sequential
code; in hybrid computing there is a new version of this law regarding the amount of code that
is MPI-parallel, but not OpenMP-parallel.
• MPI processes run unsynchronized, so small variations in load or in processor behavior can
be tolerated. The frequent barriers in OpenMP constructs make a hybrid code more tightly
synchronized, so load balancing becomes more critical.
• On the other hand, in OpenMP codes it is easier to divide the work into more tasks than there
are threads, so statistically a certain amount of load balancing happens automatically.
• Each MPI process has its own buffers, so hybrid takes less buffer overhead.
Exercise 45.1. Review the scalability argument for 1D versus 2D matrix decomposition in
HPC book, section-6.2. Would you get scalable performance from doing a 1D
decomposition (for instance, of the rows) over MPI processes, and decomposing the
other directions (the columns) over OpenMP threads?
Another performance argument we need to consider concerns message traffic. If let all threads make MPI
calls (see section 13.1) there is going to be little difference. However, in one popular hybrid computing
strategy we would keep MPI calls out of the OpenMP regions and have them in effect done by the master
thread. In that case there are only MPI messages between nodes, instead of between cores. This leads
to a decrease in message traffic, though this is hard to quantify. The number of messages goes down
approximately by the number of cores per node, so this is an advantage if the average message size is
small. On the other hand, the amount of data sent is only reduced if there is overlap in content between
the messages.
Limiting MPI traffic to the master thread also means that no buffer space is needed for the on-node com-
munication.
int totalcores;
MPI_Reduce(&ncores,&totalcores,1,MPI_INT,MPI_SUM,0,comm);
if (procid==0) {
printf("Omp procs on this process: %d\n",ncores);
printf("Omp procs total: %d\n",totalcores);
}
• However, cores are grouped in ‘tiles’ of two, so processes 1 and 3 start halfway a tile.
• Therefore, thread zero of that process is bound to the second core.
export OMP_NUM_THREADS=16
There is a third choice, in between these extremes, that makes sense. A cluster node often has more than
one socket, so you could put one MPI process on each socket, and use a number of threads equal to the
number of cores per socket.
The script for this would be:
#$ SBATCH -N 100
#$ SBATCH -n 200
export OMP_NUM_THREADS=8
ibrun tacc_affinity yourprogram
The tacc_affinity script unsets the following variables:
export MV2_USE_AFFINITY=0
export MV2_ENABLE_AFFINITY=0
export VIADEV_USE_AFFINITY=0
export VIADEV_ENABLE_AFFINITY=0
If you don’t use tacc_affinity you may want to do this by hand, otherwise mvapich2 will use its own
affinity rules.
int procid,nprocs;
int requested=MPI_THREAD_MULTIPLE,provided;
MPI_Init_thread(&argc,&argv,requested,&provided);
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm_rank(comm,&procid);
MPI_Comm_size(comm,&nprocs);
if (procid==0)
printf("Threading level requested=%d, provided=%d\n",
requested,provided);
int ncores;
#pragma omp parallel
#pragma omp master
ncores = omp_get_num_procs();
int totalcores;
MPI_Reduce(&ncores,&totalcores,1,MPI_INT,MPI_SUM,0,comm);
if (procid==0) {
printf("Omp procs on this process: %d\n",ncores);
printf("Omp procs total: %d\n",totalcores);
}
MPI_Finalize();
return 0;
}
Here is how you initialize the random number generator uniquely on each process:
C:
integer :: randsize
integer,allocatable,dimension(:) :: randseed
real :: random_value
call random_seed(size=randsize)
allocate(randseed(randsize))
randseed(:) = 1023*mytid
call random_seed(put=randseed)
715
46. Random number generation
Parallel I/O
Parallel I/O is a tricky subject. You can try to let all processors jointly write one file, or to write a file per
process and combine them later. With the standard mechanisms of your programming language there are
the following considerations:
• On clusters where the processes have individual file systems, the only way to write a single file
is to let it be generated by a single processor.
• Writing one file per process is easy to do, but
– You need a post-processing script;
– if the files are not on a shared file system (such as Lustre), it takes additional effort to bring
them together;
– if the files are on a shared file system, writing many files may be a burden on the metadata
server.
• On a shared file system it is possible for all files to open the same file and set the file pointer
individually. This can be difficult if the amount of data per process is not uniform.
Illustrating the last point:
// pseek.c
FILE *pfile;
pfile = fopen("pseek.dat","w");
fseek(pfile,procid*sizeof(int),SEEK_CUR);
// fseek(pfile,procid*sizeof(char),SEEK_CUR);
fprintf(pfile,"%d\n",procid);
fclose(pfile);
717
47. Parallel I/O
int nprocs,procid;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&procid);
FILE *pfile;
pfile = fopen("pseek.dat","w");
fseek(pfile,procid*sizeof(int),SEEK_CUR);
// fseek(pfile,procid*sizeof(char),SEEK_CUR);
fprintf(pfile,"%d\n",procid);
fclose(pfile);
MPI_Finalize();
return 0;
}
Support libraries
There are many libraries related to parallel programming to make life easier, or at least more interesting,
for you.
48.1 SimGrid
SimGrid [15] is a simulator for distributed systems. It can for instance be used to explore the effects of
architectural parameters. It has been used to simulate large scale operations such as High Performance
Linpack (HPL) [4].
48.2 Other
ParaMesh
Global Arrays
Hdf5 and Silo
719
48. Support libraries
TUTORIALS
here are some tutorials specific to parallel programming.
1. Debugging, first sequential, then parallel. Chapter 49.
2. Tracing and profiling. Chapter 50.
3. SimGrid: a simulation tool for cluster execution. Chapter 51.
4. Batch systems, in particular Simple Linux Utility for Resource Management (SLURM). Chap-
ter 52.
5. Parallel I/O. Chapter 53.
Debugging
When a program misbehaves, debugging is the process of finding out why. There are various strategies
of finding errors in a program. The crudest one is debugging by print statements. If you have a notion of
where in your code the error arises, you can edit your code to insert print statements, recompile, rerun,
and see if the output gives you any suggestions. There are several problems with this:
• The edit/compile/run cycle is time consuming, especially since
• often the error will be caused by an earlier section of code, requiring you to edit, compile, and
rerun repeatedly. Furthermore,
• the amount of data produced by your program can be too large to display and inspect effectively,
and
• if your program is parallel, you probably need to print out data from all proccessors, making
the inspection process very tedious.
For these reasons, the best way to debug is by the use of an interactive debugger, a program that allows you
to monitor and control the behaviour of a running program. In this section you will familiarize yourself
with gdb, which is the open source debugger of the GNU project. Other debuggers are proprietary, and
typically come with a compiler suite. Another distinction is that gdb is a commandline debugger; there
are graphical debuggers such as ddd (a frontend to gdb) or DDT and TotalView (debuggers for parallel
codes). We limit ourselves to gdb, since it incorporates the basic concepts common to all debuggers.
In this tutorial you will debug a number of simple programs with gdb and valgrind. The files can be found
in the repository in the directory tutorials/debug_tutorial_files.
723
49. Debugging
to replace this by -O0 (‘oh-zero’). The reason is that higher levels will reorganize your code, making it
hard to relate the execution to the source1 .
tutorials/gdb/c/hello.c
%% cc -g -o hello hello.c
# regular invocation:
%% ./hello
hello world
# invocation from gdb:
%% gdb hello
GNU gdb 6.3.50-20050815 # ..... version info
Copyright 2004 Free Software Foundation, Inc. .... copyright info ....
(gdb) run
Starting program: /home/eijkhout/tutorials/gdb/hello
Reading symbols for shared libraries +. done
hello world
1. Typically, actual code motion is done by -O3, but at level -O2 the compiler will inline functions and make other simplifi-
cations.
2. Compiler optimizations are not supposed to change the semantics of a program, but sometimes do. This can lead to the
nightmare scenario where a program crashes or gives incorrect results, but magically works correctly with compiled with debug
and run in a debugger.
%% cc -o hello hello.c
%% gdb hello
GNU gdb 6.3.50-20050815 # ..... version info
(gdb) list
For a program with commandline input we give the arguments to the run command (Fortran users use
say.F):
tutorials/gdb/c/say.c
%% cc -o say -g say.c
%% ./say 2
hello world
hello world
%% gdb say
.... the usual messages ...
(gdb) run 2
Starting program: /home/eijkhout/tutorials/gdb/c/say 2
Reading symbols for shared libraries +. done
hello world
hello world
49.3.1 C programs
tutorials/gdb/c/square.c
%% cc -g -o square square.c
%% ./square
5000
Segmentation fault
The segmentation fault (other messages are possible too) indicates that we are accessing memory that we
are not allowed to, making the program stop. A debugger will quickly tell us where this happens:
%% gdb square
(gdb) run
50000
Apparently the error occurred in a function __svfscanf_l, which is not one of ours, but a system func-
tion. Using the backtrace (or bt, also where or w) command we quickly find out how this came to be
called:
(gdb) backtrace
#0 0x00007fff824295ca in __svfscanf_l ()
#1 0x00007fff8244011b in fscanf ()
#2 0x0000000100000e89 in main (argc=1, argv=0x7fff5fbfc7c0) at square.c:7
We take a close look at line 7, and see that we need to change nmax to &nmax.
There is still an error in our program:
(gdb) run
50000
tutorials/gdb/f/square.F It should end prematurely with a message such as ‘Illegal instruction’. Running
the program in gdb quickly tells you where the problem lies:
(gdb) run
Starting program: tutorials/gdb//fsquare
Reading symbols for shared libraries ++++. done
tutorials/gdb/c/square1.c Compile this program with cc -o square1 square1.c and run it with valgrind
square1 (you need to type the input value). You will lots of output, starting with:
%% valgrind square1
==53695== Memcheck, a memory error detector
==53695== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==53695== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==53695== Command: a.out
==53695==
10
==53695== Invalid write of size 4
==53695== at 0x100000EB0: main (square1.c:10)
==53695== Address 0x10027e148 is 0 bytes after a block of size 40 alloc'd
==53695== at 0x1000101EF: malloc (vg_replace_malloc.c:236)
==53695== by 0x100000E77: main (square1.c:8)
==53695==
==53695== Invalid read of size 4
==53695== at 0x100000EC1: main (square1.c:11)
==53695== Address 0x10027e148 is 0 bytes after a block of size 40 alloc'd
==53695== at 0x1000101EF: malloc (vg_replace_malloc.c:236)
==53695== by 0x100000E77: main (square1.c:8)
Valgrind is informative but cryptic, since it works on the bare memory, not on variables. Thus, these error
messages take some exegesis. They state that a line 10 writes a 4-byte object immediately after a block of
40 bytes that was allocated. In other words: the code is writing outside the bounds of an allocated array.
Do you see what the problem in the code is?
Note that valgrind also reports at the end of the program run how much memory is still in use, meaning
not properly freed.
If you fix the array bounds and recompile and rerun the program, valgrind still complains:
==53785== Conditional jump or move depends on uninitialised value(s)
==53785== at 0x10006FC68: __dtoa (in /usr/lib/libSystem.B.dylib)
16 x += root(i);
(gdb)
(if you just hit return, the previously issued command is repeated). Do a number of steps in a row by
hitting return. What do you notice about the function and the loop?
Switch from doing step to doing next. Now what do you notice about the loop and the function?
Set another breakpoint: break 17 and do cont. What happens?
Rerun the program after you set a breakpoint on the line with the sqrt call. When the execution stops
there do where and list.
• If you set many breakpoints, you can find out what they are with info breakpoints.
• You can remove breakpoints with delete n where n is the number of the breakpoint.
• If you restart your program with run without leaving gdb, the breakpoints stay in effect.
• If you leave gdb, the breakpoints are cleared but you can save them: save breakpoints <file>.
Use source <file> to read them in on the next gdb run.
• Processes can deadlock because they are waiting for a message that never comes. This typically
happens with blocking send/receive calls due to an error in program logic.
• If an incoming message is unexpectedly larger than anticipated, a memory error can occur.
• A collective call will hang if somehow one of the processes does not call the routine.
There are few low-budget solutions to parallel debugging. The main one is to create an xterm for each
process. We will describe this next. There are also commercial packages such as DDT and TotalView, that
offer a GUI. They are very convenient but also expensive. The Eclipse project has a parallel package, Eclipse
PTP, that includes a graphic debugger.
50.1 Timing
Many systems have their own timers:
• MPI see section 15.6.1;
• OpenMP see section 29.2;
• PETSc see section 39.4.
Timing parallel operations is fraught with peril, as processes or threads can interact with each other. This
means that you may be measuring the wait time induced by synchronization. Sometimes that is actually
what you want, as in the case of a ping-pong operation; section 4.1.1.
Other times, this is not what you want. Consider the code
if (procno==0)
do_big_setup();
t = timer();
mpi_some_collective();
duration = timer() - t;
The solution is to put a barrier around the section that you want to time; see again figure 50.1.
731
50. Tracing, timing, and profiling
50.2 Tau
TAU http://www.cs.uoregon.edu/Research/tau/home.php is a utility for profiling and tracing your
parallel programs. Profiling is the gathering and displaying of bulk statistics, for instance showing you
which routines take the most time, or whether communication takes a large portion of your runtime.
When you get concerned about performance, a good profiling tool is indispensible.
Tracing is the construction and displaying of time-dependent information on your program run, for in-
stance showing you if one process lags behind others. For understanding a program’s behaviour, and the
reasons behind profiling statistics, a tracing tool can be very insightful.
50.2.1 Instrumentation
Unlike such tools as VTune which profile your binary as-is, TAU works by adding instrumentation to your
code: in effect it is a source-to-source translator that takes your code and turns it into one that generates
run-time statistics.
This instrumentation is largely done for you; you mostly need to recompile your code with a script that
does the source-to-source translation, and subsequently compiles that instrumented code. You could for
instance have the following in your makefile:
ifdef TACC_TAU_DIR
CC = tau_cc.sh
else
CC = mpicc
endif
% : %.c
<TAB>${CC} -o $@ $^
If TAU is to be used (which we detect here by checking for the environment variable TACC_TAU_DIR), we
define the CC variable as one of the TAU compilation scripts; otherwise we set it to a regular MPI compiler.
50.2.2 Running
You can now run your instrumented code; trace/profile output will be written to file if environment vari-
ables TAU_PROFILE and/or TAU_TRACE are set:
export TAU_PROFILE=1
export TAU_TRACE=1
A TAU run can generate many files: typically at least one per process. It is therefore advisabe to create a
directory for your tracing and profiling information. You declare them to TAU by setting the environment
variables PROFILEDIR and TRACEDIR.
mkdir tau_trace
mkdir tau_profile
export PROFILEDIR=tau_profile
export TRACEDIR=tau_trace
The actual program invocation is then unchanged:
mpirun -np 26 myprogram
TACC note. At TACC, use ibrun without a processor count; the count is derived from the queue
submission parameters.
While this example uses two separate directories, there is no harm in using the same for both.
50.2.3 Output
The tracing/profiling information is spread over many files, and hard to read as such. Therefore, you need
some further programs to consolidate and display the information.
You view profiling information with paraprof
paraprof tau_profile
Viewing the traces takes a few steps:
cd tau_trace
rm -f tau.trc tau.edf align.trc align.edf
tau_treemerge.pl
tau_timecorrect tau.trc tau.edf align.trc align.edf
tau2slog2 align.trc align.edf -o yourprogram.slog2
If you skip the tau_timecorrect step, you can generate the slog2 file by:
50.2.5 Examples
50.2.5.1 Bucket brigade
Let’s consider a bucket brigade implementation of a broadcast: each process sends its data to the next
higher rank.
int sendto =
( procno<nprocs-1 ? procno+1 : MPI_PROC_NULL )
;
int recvfrom =
( procno>0 ? procno-1 : MPI_PROC_NULL )
;
MPI_Recv( leftdata,1,MPI_DOUBLE,recvfrom,0,comm,MPI_STATUS_IGNORE);
myvalue = leftdata
MPI_Send( myvalue,1,MPI_DOUBLE,sendto,0,comm);
We implement the bucket brigade with blocking sends and receives: each process waits to receive from
its predecessor, before sending to its successor. missing snippet bucketblock The TAU trace of
this is in figure 50.2, using 4 nodes of 4 ranks each. We see that the processes within each node are fairly
well synchronized, but there is less synchronization between the nodes. However, the bucket brigade then
imposes its own synchronization on the processes because each has to wait for its predecessor, no matter
if it posted the receive operation early.
Next, we introduce pipelining into this operation: each send is broken up into parts, and these parts are
sent and received with non-blocking calls.
// bucketpipenonblock.c
MPI_Request rrequests[PARTS];
for (int ipart=0; ipart<PARTS; ipart++) {
MPI_Irecv
(
leftdata+partition_starts[ipart],partition_sizes[ipart],
MPI_DOUBLE,recvfrom,ipart,comm,rrequests+ipart);
}
sum = sum + d
enddo
We recognize this structure in the TAU trace: figure 50.4. Upon closer examination, we see how this
particular algorithm induces a lot of wait time. Figures 50.6 and 50.7 show a whole cascade of processes
waiting for each other.
Figure 50.6: Four stages of processes waiting caused by a single lagging process
50.2. Tau
Figure 50.7: Four stages of processes waiting caused by a single lagging process
SimGrid
Many readers of this book will have access to some sort of parallel machine so that they can run simu-
lations, maybe even some realistic scaling studies. However, not many people will have access to more
than one cluster type so that they can evaluate the influence of the interconnect. Even then, for didactic
purposes one would often wish for interconnect types (fully connected, linear processor array) that are
unlikely to be available.
In order to explore architectural issues pertaining to the network, we then resort to a simulation tool,
SimGrid.
Installation
Compilation You write plain MPI files, but compile them with the SimGrid compiler smpicc.
Running SimGrid has its own version of mpirun: smpirun. You need to supply this with options:
• -np 123456 for the number of (virtual) processors;
• -hostfile simgridhostfile which lists the names of these processors. You can basically
make these up, but are defined in:
• -platform arch.xml which defines the connectivity between the processors.
For instance, with a hostfile of 8 hosts, a linearly connected network would be defined as:
<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
<platform version="4">
740
<host id="host4" speed="1Mf"/>
<host id="host5" speed="1Mf"/>
<host id="host6" speed="1Mf"/>
<host id="host7" speed="1Mf"/>
<host id="host8" speed="1Mf"/>
<link id="link1" bandwidth="125MBps" latency="100us"/>
<!-- the routing: specify how the hosts are interconnected -->
<route src="host1" dst="host2"><link_ctn id="link1"/></route>
<route src="host2" dst="host3"><link_ctn id="link1"/></route>
<route src="host3" dst="host4"><link_ctn id="link1"/></route>
<route src="host4" dst="host5"><link_ctn id="link1"/></route>
<route src="host5" dst="host6"><link_ctn id="link1"/></route>
<route src="host6" dst="host7"><link_ctn id="link1"/></route>
<route src="host7" dst="host8"><link_ctn id="link1"/></route>
</zone>
</platform>
(such files can be generated with a short shell script).
The Floyd designation of the routing means that any route using the transitive closure of the paths given
can be used. It is also possible to use routing="Full" which requires full specification of all pairs that
can communicate.
Batch systems
Supercomputer clusters can have a large number of nodes, but not enough to let all their users run si-
multaneously, and at the scale that they want. Therefore, users are asked to submit jobs, which may start
executing immediately, or may have to wait until resources are available.
The decision when to run a job, and what resources to give it, is not done by a human operator, but by
software called a batch system. (The Stampede cluster at TACC ran close to 10 million jobs over its lifetime,
which corresponds to starting a job every 20 seconds.)
This tutorial will cover the basics of such systems, and in particular Simple Linux Utility for Resource
Management (SLURM).
742
52.2. Queues
• Often, clusters have a number of ‘large memory’ nodes, on the order of a Terabyte of memory
or more. Because of the cost of such hardware, there is usually only a small number of these
nodes.
52.2 Queues
Jobs often can not start immediately, because not enough resources are available, or because other jobs
may have higher priority (see section 52.7). It is thus typical for a job to be put on a queue, scheduled, and
started, by a batch system such as SLURM.
Batch systems do not put all jobs in one big pool: jobs are submitted to any of a number of queues, that
are all scheduled separately.
52.4.2 Environment
Your job script acts like any other shell script when it is executed. In particular, it inherits the calling
environment with all its environment variables. Additionally, slurm defines a number of environment
variables, such as the job ID, the hostlist, and the node and process count.
An interesting side effect of this is that, right before the really large job starts, a ‘fairly’ large job can be
run, if it only has a short running time. This is known as backfill, and it may cause jobs to be run earlier
than their priority would warrant.
Most of the file system lives on discs that are part of RAID arrays. These discs have a large amount of
redundancy to make them fault-tolerant, and in aggregate they form a shared file system: one unified file
system that is accessible from any node and where files can take on any size, or at least much larger than
any individual disc in the system.
TACC note. The HOME file system is limited in size, but is both permanent and backed up. Here you put
scripts and sources.
The WORK file system is permanent but not backed up. Here you can store output of your
simulations. However, currently the work file system can not immediately sustain the output
of a large parallel job.
The SCRATCH file system is purged, but it has the most bandwidth for accepting program
output. This is where you would write your data. After post-processing, you can then store on
the work file system, or write to tape.
Exercise 52.6. If you install software with cmake, you typically have
1. a script with all your cmake options;
2. the sources,
3. the installed header and binary files
4. temporary object files and such.
How would you orgnize these entities over your available file systems?
52.9 Examples
Very sketchy section.
#!/bin/sh
• CPU;
• Disc space?
Exercise 52.11. What is the command for querying the status of your job?
• sinfo
• squeue
• sacct
Exercise 52.12. On 4 nodes with 40 cores each, what’s the largest program run, measured in
• MPI ranks;
• OpenMP threads?
Parallel I/O
For a great discussion see [20], from which figures here are taken.
751
53. Parallel I/O
CLASS PROJECTS
Chapter 54
Here are some guidelines for how to submit assignments and projects. As a general rule, consider pro-
gramming as an experimental science, and your writeup as a report on some tests you have done: explain
the problem you’re addressing, your strategy, your results.
Structure of your writeup Most of the projects in this book use a scientific question to allow you to
prove your coding skills. That does not mean that turning in the code is sufficient, nor code plus sample
output. Turn in a writeup in pdf form that was generated from a text processing program such (preferably)
LATEX (for a tutorial, see HPC book, section-40).
Your writeup should have:
• Foremost, a short description of the purpose of your project and your results;
• An explanation of your algorithms or solution strategy;
• Relevant fragments of your code;
• A scientific discussion of what you observed,
• Any code-related observations.
• If applicable: graphs, both of application quantities and performance issues. (For parallel runs
possibly TAU plots; see 50.2).
Observe, measure, hypothesize, deduce Your project may be a scientific investigation of some phenomenon.
Formulate hypotheses as to what you expect to observe, report on your observations, and draw conclu-
sions.
Quite often your program will display unexpected behaviour. It is important to observe this, and hypoth-
esize what the reason might be for your observed behaviour.
In most applications of computing machinery we care about the efficiency with which we find the solution.
Thus, make sure that you do measurements. In general, make observations that allow you to judge whether
your program behaves the way you would expect it to.
754
Including code If you include code samples in your writeup, make sure they look good. For starters, use a
mono-spaced font. In LATEX, you can use the verbatim environment or the verbatiminput command. In
that section option the source is included automatically, rather than cut and pasted. This is to be preferred,
since your writeup will stay current after you edit the source file.
Including whole source files makes for a long and boring writeup. The code samples in this book were
generated as follows. In the source files, the relevant snippet was marked as
... boring stuff
//snippet samplex
.. interesting! ..
//snippet end
... more boring stuff
The files were then processed with the following command line (actually, included in a makefile, which
requires doubling the dollar signs):
for f in *.{c,cxx,h} ; do
cat $x | awk 'BEGIN {f=0}
/snippet end/ {f=0}
f==1 {print $0 > file}
/snippet/ && !/end/ {f=1; file=$2 }
'
done
which gives (in this example) a file samplex. Other solutions are of course possible.
Code formatting Included code snippets should be readable. At a minimum you could indent the code
correctly in an editor before you include it in a verbatim environment. (Screenshots of your terminal
window are a decidedly suboptimal solution.) But it’s better to use the listing package which formats
your code, include syntax coloring. For instance,
\lstset{language=C++} % or Fortran or so
\begin{lstlisting}
for (int i=0; i<N; i++)
s += 1;
\end{lstlisting}
for (int i=0; i<N; i++)
s += 1;
Running your code A single run doesn’t prove anything. For a good report, you need to run your
code for more than one input dataset (if available) and in more than one processor configuration. When
you choose problem sizes, be aware that an average processor can do a billion operations per second: you
need to make your problem large enough for the timings to rise above the level of random variations and
startup phenomena.
When you run a code in parallel, beware that on clusters the behaviour of a parallel code will always
be different between one node and multiple nodes. On a single node the MPI implementation is likely
optimized to use the shared memory. This means that results obtained from a single node run will be
unrepresentative. In fact, in timing and scaling tests you will often see a drop in (relative) performance
going from one node to two. Therefore you need to run your code in a variety of scenarios, using more
than one node.
Reporting scaling If you do a scaling analysis, a graph reporting runtimes should not have a linear time
axis: a logarithmic graph is much easier to read. A speedup graph can also be informative.
Some algorithms are mathematically equivalent in their sequential and parallel versions. Others, such
as iterative processes, can take more operations in parallel than sequentially, for instance because the
number of iterations goes up. In this case, report both the speedup of a single iteration, and the total
improvement of running the full algorithm in parallel.
Repository organization If you submit your work through a repository, make sure you organize your
submissions in subdirectories, and that you give a clear name to all files. Object files and binaries should
not be in a repository since they are dependent on hardware and things like compilers.
Warmup Exercises
55.2 Collectives
It is a good idea to be able to collect statistics, so before we do anything interesting, we will look at MPI
collectives; section 3.1.
Take a look at time_max.cxx. This program sleeps for a random number of seconds: and measures
how long the sleep actually was: In the code, this quantity is called ‘jitter’, which is a term for random
deviations in a system.
Exercise 55.1. Change this program to compute the average jitter by changing the
reduction operator.
757
55. Warmup Exercises
∑𝑖 (𝑥𝑖 − 𝑚)2
𝜎=
√ 𝑛
where 𝑚 is the average value you computed in the previous exercise.
• Solve this exercise twice: once by following the reduce by a broadcast
operation and once by using an Allreduce.
• Run your code both on a single cluster node and on multiple nodes, and
inspect the TAU trace. Some MPI implementations are optimized for shared
memory, so the trace on a single node may not look as expected.
• Can you see from the trace how the allreduce is implemented?
Exercise 55.3. Finally, use a gather call to collect all the values on processor zero, and print
them out. Is there any process that behaves very differently from the others?
Mandelbrot set
If you’ve never heard the name Mandelbrot set, you probably recognize the picture; figure 56.1 Its formal
definition is as follows:
A point 𝑐 in the complex plane is part of the Mandelbrot set if the series 𝑥𝑛 defined by
𝑥0 = 0
{
𝑥𝑛+1 = 𝑥𝑛2 + 𝑐
satisfies
∀𝑛 ∶ |𝑥𝑛 | ≤ 2.
761
56. Mandelbrot set
It is easy to see that only points 𝑐 in the bounding circle |𝑐| < 2 qualify, but apart from that it’s hard to say
much without a lot more thinking. Or computing; and that’s what we’re going to do.
In this set of exercises you are going to take an example program mandel_main.cxx and extend it to
use a variety of MPI programming constructs. This program has been set up as a manager-worker model:
there is one manager processor (for a change this is the last processor, rather than zero) which gives out
work to, and accepts results from, the worker processors. It then takes the results and constructs an image
file from them.
56.1 Invocation
The mandel_main program is called as
mpirun -np 123 mandel_main steps 456 iters 789
where the steps parameter indicates how many steps in 𝑥, 𝑦 direction there are in the image, and iters
gives the maximum number of iterations in the belong test.
If you forget the parameter, you can call the program with
mandel_serial -h
and it will print out the usage information.
56.2 Tools
The driver part of the Mandelbrot program is simple. There is a circle object that can generate coordinates
missing snippet circledef and a global routine that tests whether a coordinate is in the set, at least up
to an iteration bound. It returns zero if the series from the given starting point has not diverged, or the
iteration number in which it diverged if it did so.
int belongs(struct coordinate xy,int itbound) {
double x=xy.x, y=xy.y; int it;
for (it=0; it<itbound; it++) {
double xx,yy;
xx = x*x - y*y + xy.x;
yy = 2*x*y + xy.y;
x = xx; y = yy;
if (x*x+y*y>4.) {
return it;
}
}
return 0;
}
In the former case, the point could be in the Mandelbrot set, and we colour it black, in the latter case we
give it a colour depending on the iteration number.
if (iteration==0)
memset(colour,0,3*sizeof(float));
else {
float rfloat = ((float) iteration) / workcircle->infty;
colour[0] = rfloat;
colour[1] = MAX((float)0,(float)(1-2*rfloat));
colour[2] = MAX((float)0,(float)(2*(rfloat-.5)));
}
We use a fairly simple code for the worker processes: they execute a loop in which they wait for input,
process it, return the result.
void queue::wait_for_work(MPI_Comm comm,circle *workcircle) {
MPI_Status status; int ntids;
MPI_Comm_size(comm,&ntids);
int stop = 0;
while (!stop) {
struct coordinate xy;
int res;
MPI_Recv(&xy,1,coordinate_type,ntids-1,0, comm,&status);
stop = !workcircle->is_valid_coordinate(xy);
if (stop) break; //res = 0;
else {
res = belongs(xy,workcircle->infty);
}
MPI_Send(&res,1,MPI_INT,ntids-1,0, comm);
}
return;
}
err = MPI_Send(&xy,1,coordinate_type,
free_processor,0,comm); CHK(err);
err = MPI_Recv(&contribution,1,MPI_INT,
free_processor,0,comm, &status); CHK(err);
coordinate_to_image(xy,contribution);
total_tasks++;
free_processor = (free_processor+1)%(ntids-1);
return 0;
};
Exercise 56.1. Explain why this solution is very inefficient. Make a trace of its execution
that bears this out.
to finish before you give a new round of data to all workers. Make a trace of the
execution of this and report on the total time.
You can do this by writing a new class that inherits from queue, and that provides
its own addtask method:
// mandel_bulk.cxx
class bulkqueue : public queue {
public :
bulkqueue(MPI_Comm queue_comm,circle *workcircle)
: queue(queue_comm,workcircle) {
You will also have to override the complete method: when the circle object
indicates that all coordinates have been generated, not all workers will be busy, so
you need to supply the proper MPI_Waitall call.
In this section we will gradually build a semi-realistic example program. To get you started some pieces
have already been written: as a starting point look at code/mpi/c/grid.cxx.
This is easy enough to implement sequentially, but in parallel this requires some care.
Let’s divide the grid 𝐺 and divide it over a two-dimension grid of 𝑝𝑖 × 𝑝𝑗 processors. (Other strategies exist,
but this one scales best; see section HPC book, section-6.5.) Formally, we define two sequences of points
0 = 𝑖0 < ⋯ < 𝑖𝑝𝑖 < 𝑖𝑝𝑖 +1 = 𝑛𝑖 , 0 < 𝑗0 < ⋯ < 𝑗𝑝𝑗 < 𝑖𝑝𝑗 +1 = 𝑛𝑗
From formula (57.1) you see that the processor then needs one row of points on each side surrounding its
part of the grid. A picture makes this clear; see figure 57.1. These elements surrounding the processor’s
own part are called the halo or ghost region of that processor.
The problem is now that the elements in the halo are stored on a different processor, so communication
is needed to gather them. In the upcoming exercises you will have to use different strategies for doing so.
767
57. Data parallel grids
Figure 57.1: A grid divided over processors, with the ‘ghost’ region indicated
class number_grid {
public:
processor_grid *pgrid;
double *values,*shadow;
where values contains the elements owned by the processor, and shadow is intended to contain the
values plus the ghost region. So how does shadow receive those values? Well, the call looks like
grid->build_shadow();
and you will need to supply the implementation of that. Once you’ve done so, there is a routine that prints
out the shadow array of each processor
grid->print_shadow();
In the file code/mpi/c/grid_impl.cxx you can see several uses of the macro INDEX. This translates
from a two-dimensional coordinate system to one-dimensional. Its main use is letting you use (𝑖, 𝑗) coor-
dinates for indexing the processor grid and the number grid: for processors you need the translation to
the linear rank, and for the grid you need the translation to the linear array that holds the values.
A good example of the use of INDEX is in the number_grid::relax routine: this takes points from
the shadow array and averages them into a point of the values array. (To understand the reason for
this particular averaging, see HPC book, section-4.2.3 and HPC book, section-5.5.3.) Note how the INDEX
macro is used to index in a ilength × jlength target array values, while reading from a (ilength +
2) × (jlength + 2) source array shadow.
for (i=0; i<ilength; i++) {
for (j=0; j<jlength; j++) {
int c=0;
double new_value=0.;
for (c=0; c<5; c++) {
int ioff=i+1+ioffsets[c],joff=j+1+joffsets[c];
new_value += coefficients[c] *
shadow[ INDEX(ioff,joff,ilength+2,jlength+2) ];
}
values[ INDEX(i,j,ilength,jlength) ] = new_value/8.;
}
}
N-body problems
N-body problems describe the motion of particles under the influence of forces such as gravity. There are
many approaches to this problem, some exact, some approximate. Here we will explore a number of them.
For background reading see HPC book, section-10.
770
PART VIII
DIDACTICS
Chapter 59
Teaching guide
Based on two lectures per week, here is an outline of how MPI can be taught in a college course. Links to
the relevant exercises.
Topic Exercises Week
Block 1: SPMD and collectives
Intro: cluster structure hello: 2.1, 2.2
week 1
Functional parallelism commrank: 2.4, 2.5, prime: 2.6
Allreduce, broadcast 3.1, randommax: 3.2
week 2
jordan: 3.8
Scan, Gather 3.13, scangather: 3.11, 3.15
Block 2: Two-sided point-to-point week 3
Send and receive pingpong: 4.1, rightsend: 4.4
Sendrecv bucketblock: 4.6, sendrecv: 4.8, 4.9
week 4
Nonblocking isendirecv: 4.13, isendirecvarray: 4.14
bucketpipenonblock: 4.11
Block 3: Derived datatypes week 5
Contiguous, Vector, Indexed stridesend: 6.4, cubegather: 6.6
Extent and resizing
Block 4: Communicators
Duplication, split procgrid: 7.1, 7.2
week 6
Groups
Block 5: I/O
File open, write, views blockwrite: 10.1, viewwrite 10.4
Block 6: Neighborhood collectives week 7
Neighbor allgather rightgraph: 11.1
772
59.1. Sources used in this chapter
Distributed memory programming, typically through the MPI library, is the de facto standard for pro-
gramming large scale parallelism, with up to millions of individual processes. Its dominant paradigm of
Single Program Multiple Data (SPMD) programming is different from threaded and multicore parallelism,
to an extent that students have a hard time switching models. In contrast to threaded programming, which
allows for a view of the execution with central control and a central repository of data, SPMD program-
ming has a symmetric model where all processes are active all the time, with none privileged, and where
data is distributed.
This model is counterintuitive to the novice parallel programmer, so care needs to be taken how to instill
the proper ‘mental model’. Adoption of an incorrect mental model leads to broken or inefficient code.
We identify problems with the currently common way of teaching MPI, and propose a structuring of MPI
courses that is geared to explicit reinforcing the symmetric model. Additionally, we advocate starting from
realistic scenarios, rather than writing artificial code just to exercise newly-learned routines.
60.1 Introduction
The MPI library [25, 21] is the de facto tool for large scale parallelism as it is used in engineering sciences.
In this paper we want to discuss the manner it is usually taught, and propose a rethinking.
We argue that the topics are typically taught in a sequence that is essentially dictated by level of com-
plexity in the implementation, rather than by conceptual considerations. Our argument will be for a se-
quencing of topics, and use of examples, that is motivated by typical applications of the MPI library, and
that explicitly targets the required mental model of the parallelism model underlying MPI.
We have written an open-source textbook [8] with exercise sets that follows the proposed sequencing of
topics and the motivating applications.
774
60.2. Implied mental models
typically used to code large-scale Finite Element Method (FEM) and other physical simulation applications,
which share characteristics of a relatively static distribution of large amounts of data – hence the use of
clusters to increase size of the target problem – and the need for very efficient exchange of small amounts
of data.
The main motivation for MPI is the fact that it can be scaled to more or less arbitrary scales, currently up
to millions of cores [1]. Contrast this with threaded programming, which is limited more or less by the
core count on a single node, currently about 70.
Considering this background, the target audience for MPI teaching consists of upper level undergraduate
students, graduate students, and even post-doctoral researchers who are engaging for the first time in
large scale simulations. The typical participant in an MPI course is likely to understand more than the
basics of linear algebra and some amount of numerics of Partial Diffential Equation (PDE).
In this section we consider in more detail the mental models that students may implicitly be working
under, and the problems with them; targeting the right mental model will then be the subject of later
sections. The two (interrelated) aspects of a correct mental model for distributed memory programming
are control and synchronization. We here discuss how these can be misunderstood by students.
2. We carefully avoid the word ‘thread’ which carries many connotations in the context of parallel programming.
3. To first order; second order effects such as affinity complicate this story.
a single thread of control: the above-mentioned ‘index finger’ going down the statements of the source.
A second factor contributing to this view is that a parallel code incorporates statements with values
(int x = 1.5;) that are replicated over all processes. It is easy to view these as centrally executed.
Interestingly, work by Ben-David Kolikant [2] shows that students with no prior knowledge of concur-
rency, when invited to consider parallel activities, will still think in terms of centralized solutions. This
shows that distributed control, such as it appears in MPI, is counterintuitive and needs explicit enforce-
ment in its mental model. In particular, we explicitly target process symmetry and process differentiation.
The centralized model can still be maintained in MPI to an extent, since the scalar operations that would
be executed by a single thread become replicated operations in the MPI processes. The distinction between
sequential execution and replicated execution escapes many students at first, and in fact, since nothing it
gained by explaining this, we do not do so.
This sequence is defensible from a point of the underlying implementation: the two-sided communication
calls are a close map to hardware behavior, and collectives are both conceptually equivalent to, and can
be implemented as, a sequence of point-to-point communication calls. However, this is not a sufficient
justification for teaching this sequence of topics.
60.3.1 Criticism
We offer three points of criticism against this traditional approach to teaching MPI.
First of all, there is no real reason for teaching collectives after two-sided routines. They are not harder,
nor require the latter as prerequisite. In fact, their interface is simpler for a beginner, requiring one line
for a collective, as opposed to at least two for a send/receive pair, probably surrounded by conditionals
testing the process rank. More importantly, they reinforce the symmetric process view, certainly in the
case of the MPI_All... routines.
Our second point of criticism is regarding the blocking and nonblocking two-sided communication rou-
tines. The blocking routines are typically taught first, with a discussion of how blocking behavior can lead
to load unbalance and therefore inefficiency. The nonblocking routines are then motivated from a point
of latency hiding and solving the problems inherent in blocking. In our view such performance consid-
erations should be secondary. Nonblocking routines should instead be taught as the natural solution to a
conceptual problem, as explained below.
Thirdly, starting with point-to-point routines stems from a Communicating Sequential Processes (CSP)[12]
view of a program: each process stands on its own, and any global behavior is an emergent property of
the run. This may make sense for the teacher who know how concepts are realized ‘under the hood’, but
it does not lead to additional insight with the students. We believe that a more fruitful approach to MPI
programming starts from the global behavior, and then derives the MPI process in a top-down manner.
‘projection’ onto that process of the global calculation. The opposing view, where the overall computation
is emergent from the individual processes, is the CSP model mentioned above.
mpiexec, as if it were an MPI program. Every process executes the print statement identically, bearing
out the total symmetry between the processes.
Next, students are asked to insert the initialize and finalize statements, with three different ‘hello world’
statements before, between, and after them. This will prevent any notion of the code between initialization
and finalization being considered as an OpenMP style ‘parallel region’.
A simple test to show that while processes are symmetric they are not identical is offered by the exercise
of using the MPI_Get_processor_name function, which will have different output for some or all of the
processes, depending on how the hostfile was arranged.
4. The MPI_Reduce call performs a reduction on data found on all processes, leaving the result on a ‘root’ process. With
MPI_Allreduce the result is left on all processes.
Certainly, in most applications the ‘allreduce’ is the more common mechanism, for instance where the
algorithm requires computations such as
𝑦 ̄ ← 𝑥/‖
̄ 𝑥‖̄
where 𝑥, 𝑦 are distributed vectors. The quantity ‖𝑥‖̄ is then needed on all processes, making the Allreduce
the natural choice. The rooted reduction is typically only used for final results. Therefore we advocate
introducing both rooted and nonrooted collectives, but letting the students initially do exercises with the
nonrooted variants.
This has the added advantage of not bothering the students initially with the asymmetric treatment of
the receive buffer between the root and all other processes.
The lecturer stresses that the global structure of the distributed array is only ‘in the programmer’s mind’:
each MPI process sees an array with indexing starting at zero. The following snippet of code is given for
the students to use in subsequent exercises:
int myfirst = .....;
for (int ilocal=0; ilocal<nlocal; ilocal++) {
int iglobal = myfirst+ilocal;
array[ilocal] = f(iglobal);
}
At this point, the students can code a second variant of the primality testing exercise above, but with an
array allocated to store the integer range. Since collectives are now known, it becomes possible to have a
single summary statement from one process, rather than a partial result statement from each.
The inner product of two distributed vectors is a second illustration of working with distributed data. In
this case, the reduction for collecting the global result is slightly more useful than the collective in the
previous examples. For this example no translation from local to global numbering is needed.
We then state that this data transfer is realized in MPI by two-sided send/receive pairs.
5. In this operations, process A sends to B, and B subsequenty sends to A. Thus the time for a message is half the time of a
ping-pong. It is not possible to measure a single message directly, since processes can not be synchronized that finely.
The concepts of latency and bandwidth can be introduced, as the students test the ping-pong code on
messages of increasing size. The concept of halfbandwidth can be introduced by letting half of all processes
execute a ping-pong with a partner process in the other half.
6. The MPI_Sendrecv call combines a send a receive operation, specifying for each process both a sending and receiving
communication. The execution guarantees that no deadlock or serialization will occur.
60.5.1 Sequentialization
Our prime example is to illustrate the blocking behavior of MPI_Send and MPI_Recv7 . Deadlock is easy
enough to understand as a consequence of blocking – in the simplest case of deadlock to processes are
both blocked expecting a receive from the other – but there are more subtle effects that will come as a
surprise to students. (This was alluded to in section 60.4.7.)
Consider the following basic program:
• Pass a data item to the next higher numbered process.
Note that this is conceptually a fully parallel program, so it should execute in time 𝑂(1) in terms of the
number of processes.
In terms of send and receive calls, the program becomes
• Send data to the next higher process;
• Receive data from the next lower process.
The final detail concerns the boundary conditions: the first process has nothing to receive and the last
one has nothing to send. This makes the final version of the program:
7. Blocking is defined as the process executing a send or receive call halting until the corresponding operation is executing.
• If you are not the last process, send data to the next higher process; then
• If you are not the first process, receive data from the next lower process.
To have students act this out, we tell them to hold a pen in their right hand, and put the left hand in a pocket
or behind their back. Thus, they have only one ‘communication channel’. The ‘send data’ instruction
becomes ‘turn to your right and give your pen’, and ‘receive data’ becomes ‘turn to your left and receive
a pen’.
Executing this program, the students first all turn to the right, and they see that giving data to a neighbor
is not possible because no one is executing the receive instruction. The last process is not sending, so
moves on to the receive instruction, after which the penultimate process can receive, et cetera.
This exercise makes the students see, better than any explanation or diagram, how a parallel program
can compute the right result, but with unexpectedly low performance because of the interaction of the
processes. (In fact, we have had explicit feedback that this game was the biggest lightbulb moment of the
class.)
60.5.2 Ping-pong
While in general we emphasize the symmetry of MPI processes, during the discussion of send and receive
calls we act out the ping-pong operation (one process sending data to another, followed by the other
sending data back), precisely to demonstrate how asymmetric actions are handled. For this, two students
throw a pen back and forth between them, calling out ‘send’ and ‘receive’ when they do so.
The teacher then asks each student what program they executed, which is ‘send-receive’ for the one,
and ‘receive-send’ for the other student. Incorporating this in the SPMD model then leads to a code with
conditionals to determine the right action for the right process.
This code computes the correct result, and with the correct performance behavior, but it still shows a
conceptual misunderstanding. As one of the ‘parallel computer games’ (section 60.5) we have put a student
stand in front of the class with a sign ‘I am process 5’, and go through the above loop out loud (‘Am I
process zero? No. Am I process one? No.’) which quickly drives home the point about the futility of this
construct.
60.6.1 Exercises
On day 1 the students do approximately 10 programming exercises, mostly finishing a skeleton code given
by the instructor. For the day 2 material students do two exercises per topic, again starting with a given
skeleton. (Skeleton codes are available as part of the repository [8].)
The design of these skeleton codes is an interesting problem in view of our concern with mental models.
The skeletons are intended to take the grunt work away from the students, to both indicate a basic code
structure and relieve them from making elementary coding errors that have no bearing on learning MPI.
On the other hand, the skeletons should leave enough unspecified that multiple solutions are possible,
including wrong ones: we want students to be confronted with conceptual errors in their thinking, and a
too-far-finished skeleton would prevent them from doing that.
Example: the prime finding exercise mentioned above (which teaches the notion of functional parallelim)
has the following skeleton:
int myfactor;
// Specify the loop header:
// for ( ... myfactor ... )
for (
/**** your code here ****/
) {
if (bignum%myfactor==0)
printf("Process %d found factor %d\n",
procno,myfactor);
}
This leaves open the possibility of both a blockwise and a cyclic distribution of the search space, as well
as incorrect solutions where each process runs through the whole search space.
60.6.2 Projects
Students in our academic course do a programming project in place of a final exam. Students can choose
between one of a set of standard projects, or doing a project of their own choosing. In the latter case, some
students will do a project in context of their graduate research, which means that they have an existing
codebase; others will write code from scratch. It is this last category, that will most clearly demonstrate
their correct understanding of the mental model underlying SPMD programs. However, we note that this
is only a fraction of the students in our course, a fraction made even smaller by the fact that we also
give a choice of doing a project in OpenMP rather than MPI. Since OpenMP is, at least to the beginning
programmer, simpler to use, there is an in fact a clear preference for it among the students who pick their
own project.
to the writers of compilers and source translators. This means that by writing fairly modest parsers (say,
less than 200 lines of python) we can perform a sophisticated analysis of the students’ codes. We hope to
report on this in more detail in a follow-up paper.
60.9 Summary
In this paper we have introduced a nonstandard sequence for presenting the basic mechanisms in MPI.
Rather than starting with sends and receives and building up from there, we start with mechanisms that
emphasize the inherent symmetry between processes in the SPMD programming model. This symmetry
requires a substantial shift in mindset of the programmer, and therefore we target it explicitly.
In general, it is the opinion of this author that it pays off to teach from the basis of instilling a mental
model, rather than of presenting topics in some order of (perceived) complexity or sophistication.
Comparing our presentation as outlined above to the standard presentation, we recognize the downplay-
ing of the blocking send and receive calls. While students learn these, and in fact learn them before other
send and receive mechanisms, they will recognize the dangers and difficulties in using them, and will
have the combined sendrecv call as well as nonblocking routines as standard tools in their arsenal.
Bibliography
[1] Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Sameer Kumar, Ewing Lusk, Rajeev
Thakur, and Jesper Larsson Träff. MPI on millions of cores. Parallel Processing Letters, 21(01):45–60,
2011.
[2] Y. Ben-David Kolikant. Gardeners and cinema tickets: High schools’ preconceptions of concurrency.
Computer Science Education, 11:221–245, 2001.
[3] Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective communication:
theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19:1749–
1783, 2007.
[4] Tom Cornebize, Franz C Heinrich, Arnaud Legrand, and Jérôme Vienne. Emulating High Perfor-
mance Linpack on a Commodity Server at the Scale of a Supercomputer. working paper or preprint,
December 2017.
[5] Lisandro Dalcin. MPI for Python, homepage. https://github.com/mpi4py/mpi4py.
[6] Saeed Dehnadi, Richard Bornat, and Ray Adams. Meta-analysis of the effect of consistency on success
in early learning of programming. In Psychology of Programming Interest Group PPIG 2009, pages 1–
13. University of Limerick, Ireland, 2009.
[7] Peter J. Denning. Computational thinking in science. American Scientist, pages 13–17, 2017.
[8] Victor Eijkhout. Parallel Programming in MPI and OpenMP. 2016. available for download: https:
//bitbucket.org/VictorEijkhout/parallel-computing-book/src.
[9] Victor Eijkhout. Performance of MPI sends of non-contiguous data. arXiv e-prints, page
arXiv:1809.10778, Sep 2018.
[10] Brice Goglin. Managing the Topology of Heterogeneous Cluster Nodes with Hardware Locality
(hwloc). In International Conference on High Performance Computing & Simulation (HPCS 2014),
Bologna, Italy, July 2014. IEEE.
[11] W. Gropp, E. Lusk, and A. Skjellum. Using MPI. The MIT Press, 1994.
[12] C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985. ISBN-10: 0131532715, ISBN-
13: 978-0131532717.
[13] Torsten Hoefler, Prabhanjan Kambadur, Richard L. Graham, Galen Shipman, and Andrew Lumsdaine.
A case for standard non-blocking collective operations. In Proceedings, Euro PVM/MPI, Paris, France,
October 2007.
[14] Torsten Hoefler, Christian Siebert, and Andrew Lumsdaine. Scalable communication protocols for
dynamic sparse data exchange. SIGPLAN Not., 45(5):159–168, January 2010.
792
[15] INRIA. SimGrid homepage. http://simgrid.gforge.inria.fr/.
[16] L. V. Kale and S. Krishnan. Charm++: Parallel programming with message-driven objects. In Parallel
Programming using C++, G. V. Wilson and P. Lu, editors, pages 175–213. MIT Press, 1996.
[17] M. Li, H. Subramoni, K. Hamidouche, X. Lu, and D. K. Panda. High performance mpi datatype support
with user-mode memory registration: Challenges, designs, and benefits. In 2015 IEEE International
Conference on Cluster Computing, pages 226–235, Sept 2015.
[18] Zhenying Liu, Barbara Chapman, Tien-Hsiung Weng, and Oscar Hernandez. Improving the per-
formance of openmp by array privatization. In Proceedings of the OpenMP Applications and Tools
2003 International Conference on OpenMP Shared Memory Parallel Programming, WOMPAT’03, pages
244–259, Berlin, Heidelberg, 2003. Springer-Verlag.
[19] Robert McLay. T3pio: TACC’s terrific too for parallel I/O. https://github.com/TACC/t3pio.
[20] Sandra Mendez, Sebastian L?hrs, Volker Weinberg, Dominic Sloan-Murphy, and Andrew
Turner. Best practice guide - parallel i/o. https://prace-ri.eu/training-support/
best-practice-guides/best-practice-guide-parallel-io/, 02 2019.
[21] MPI forum: MPI documents. http://www.mpi-forum.org/docs/docs.html.
[22] NASA Advaned Supercomputing Division. NAS parallel benchmarks. https://www.nas.nasa.
gov/publications/npb.html.
[23] The OpenMP API specification for parallel programming. http://openmp.org/wp/
openmp-specifications/.
[24] Amit Ruhela, Hari Subramoni, Sourav Chakraborty, Mohammadreza Bayatpour, Pouya Kousha, and
Dhabaleswar K. Panda. Efficient asynchronous communication progress for mpi without dedicated
resources. EuroMPI’18, New York, NY, USA, 2018. Association for Computing Machinery.
[25] Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. MPI: The Complete
Reference, Volume 1, The MPI-1 Core. MIT Press, second edition edition, 1998.
[26] Jeff Squyres. Mpi-request-free is evil. Cisco Blogs, January 2013. https://blogs.cisco.com/
performance/mpi_request_free-is-evil.
[27] R. Thakur, W. Gropp, and B. Toonen. Optimizing the synchronization operations in MPI one-sided
communication. Int’l Journal of High Performance Computing Applications, 19:119–128, 2005.
[28] Universitat Politecnica de Valencia. SLEPC – Scalable software for Eigenvalue Problem Computa-
tions. http://www.grycap.upv.es/slepc/.
[29] Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33:103–111, August
1990.
[30] P.C. Wason and P.N. Johnson-Laird. Thinking and Reasoning. Harmondsworth: Penguin, 1968.
List of acronyms
794
Chapter 63
General Index
795
Index
796
INDEX
Lists of notes
807
64.2 Fortran notes
MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1 formatting-of-fortran-notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 new-developments-only-in-f08-module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 communicator-type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 mpi-send-recv-buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 min-maxloc-types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 index-of-requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7 status-object-in-f08 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8 derived-types-for-handles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9 subarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
10 displacement-unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
11 offset-literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
12 openmp-version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
13 openmp-sentinel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
14 omp-do-pragma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
15 reductions-on-derived-types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
16 private-variables-in-parallel-region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
17 private-common-blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
18 array-sizes-in-map-clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
PETSc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
19 petsc-initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
20 f90-array-access-through-pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
21 error-code-handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
22 backtrace-on-error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
23 print-string-construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
24 printing-and-newlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722
25 cpp-includes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733
808
64.3 C++ notes
MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1 buffer-treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
2 range-syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
3 reduction-over-iterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
4 templated-reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
5 reduction-on-class-objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
6 threadprivate-random-number-generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
7 lock-inside-overloaded-operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
8 uninitialized-containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
PETSc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
1 notes-format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 header-file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 init,-finalize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 processor-name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 world-communicator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 communicator-copying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7 communicator-passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8 rank-and-size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9 reduction-operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
10 scalar-buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
11 vector-buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
12 iterator-buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
13 reduce-in-place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
809
LIST OF MPL NOTES
14 reduce-on-non-root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
15 broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
16 scan-operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
17 gather-scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
18 gather-on-nonroot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
19 operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
20 user-defined-operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
21 lambda-operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
22 nonblocking-collectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
23 blocking-send-and-receive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
24 sending-arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
25 iterator-buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
26 iterator-layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
27 any-source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
28 send-recv-call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
29 requests-from-nonblocking-calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
30 request-pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
31 wait-any . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
32 request-handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
33 status-object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
34 status-source-querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
35 message-tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
36 receive-count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
37 persistent-requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
38 buffered-send . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
39 buffer-attach-and-detach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
40 other-types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
41 data-types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
42 derived-type-handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
43 contiguous-type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
44 contiguous-composing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
45 vector-type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
46 subarray-layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
47 indexed-type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
48 layouts-for-gatherv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
49 indexed-block-type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
50 struct-type-scalar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
51 struct-type-general . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
52 extent-resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
53 predefined-communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
54 raw-communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
55 communicator-duplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
56 communicator-splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1 running-mpi4py-programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 python-notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 import-mpi-module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 initialize-finalize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 communicator-objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 communicator-rank-and-size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 buffers-from-numpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8 buffers-from-subarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9 in-place-collectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10 sending-objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
11 define-reduction-operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
12 reduction-function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
13 handling-a-single-request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
14 request-arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
15 status-object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
16 data-types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
17 derived-type-handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
18 vector-type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
19 sending-from-the-middle-of-a-matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
20 big-data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
21 communicator-types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
22 communicator-duplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
23 comm-split-key-is-optional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
24 displacement-byte-computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
811
LIST OF PYTHON NOTES
25 window-buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
26 mpi-one-sided-transfer-routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
27 file-open-is-class-method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
28 graph-communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
29 thread-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
30 utility-functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
31 error-policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
PETSc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
32 init,-and-with-commandline-options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
33 communicator-object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
34 petsc4py-interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
35 vector-creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
36 vector-size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
37 vector-operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
38 setting-vector-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
39 vector-access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
40 petsc-print-and-python-print . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
41 hdf5-file-generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
42 petsc-options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660
Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722
43 python-mpi-programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
MPI_2DOUBLE_PRECISION, 78
MPI_2INT, 77
MPI_2INTEGER, 78
MPI_2REAL, 78
MPI_Abort, 30, 80, 412, 422
MPI_Accumulate, 315, 321, 327, 334
MPI_Add_error_class, 412
MPI_Add_error_code, 412
0_MPI_OFFSET_KIND, 361 MPI_Add_error_string, 412
access_style, 405 MPI_Address (deprecated), 216
accumulate_ops, 334 MPI_ADDRESS_KIND, 196, 200, 319, 423
accumulate_ordering, 334 MPI_AINT, 199
alloc_shared_noncontig, 384 MPI_Aint, 195, 196, 199, 199, 222, 314, 318
in Fortran, 200
cb_block_size, 406 MPI_Aint_add, 199, 314
cb_buffer_size, 406 MPI_Aint_diff, 199, 314
cb_nodes, 406 MPI_Allgather, 65, 94
chunked, 406 MPI_Allgather_init, 185
chunked_item, 406 MPI_Allgatherv, 74
chunked_size, 406 MPI_Allgatherv_init, 185
collective_buffering, 406 MPI_Alloc_mem, 311, 313, 329, 410
control variable, 396–398 MPI_Allreduce, 44, 45, 52, 68, 94
cvar, see control variable MPI_Allreduce_init, 184, 185
MPI_Alltoall, 66, 67, 603
file_perm, 406 MPI_Alltoall_init, 185
MPI_Alltoallv, 67, 67, 74, 76, 216
io_node_list, 406
MPI_Alltoallv_init, 185
irequest_pool, 142
MPI_Alltoallw_init, 185
KSPSolve, 634 MPI_ANY_SOURCE, 71, 89, 123, 123, 146, 149, 150,
154, 309, 410, 420, 424
mpi://SELF, 296 MPI_ANY_TAG, 123, 131, 149, 152, 424
mpi://WORLD, 296 MPI_APPNUM, 290, 407, 422
813
MPI_ARGV_NULL, 424 MPI_Comm_create_group, 274
MPI_ARGVS_NULL, 424 MPI_Comm_create_keyval, 407
MPI_ASYNC_PROTECTS_NONBLOCKING, 414, 423 MPI_Comm_delete_attr, 407
MPI_Attr_get, 406 MPI_Comm_disconnect, 270, 292
MPI_BAND, 77 MPI_Comm_dup, 34, 265, 266, 268, 269, 405, 573
MPI_Barrier, 72, 362, 417 MPI_Comm_dup_with_info, 265, 405
MPI_Barrier_init, 185 MPI_Comm_free, 268, 270
MPI_Bcast, 54, 94 MPI_Comm_free_keyval, 407
MPI_Bcast_init, 185 MPI_Comm_get_attr, 287, 406
MPI_BOR, 77 MPI_Comm_get_errhandler, 410, 412
MPI_BOTTOM, 199, 334, 423, 424 MPI_Comm_get_info, 405
MPI_Bsend, 188, 189, 190 MPI_Comm_get_parent, 278, 289, 290, 295
MPI_Bsend_init, 183, 188, 191, 191 MPI_Comm_group, 273, 277
MPI_BSEND_OVERHEAD, 190, 191, 424 MPI_Comm_idup, 265
MPI_Buffer_attach, 190, 190 MPI_Comm_idup_with_info, 265
MPI_Buffer_detach, 190, 190 MPI_Comm_join, 294, 295
MPI_BXOR, 77 MPI_COMM_NULL, 264, 265, 267, 270, 278, 289, 367,
MPI_BYTE, 196, 197, 202, 224 381
MPI_C_BOOL, 196 MPI_Comm_rank, 34, 35, 271, 277, 373, 751
MPI_C_COMPLEX, 196 MPI_Comm_remote_group, 278
MPI_C_DOUBLE_COMPLEX, 196 MPI_Comm_remote_size, 278, 290
MPI_C_FLOAT_COMPLEX, 196 MPI_COMM_SELF, 264, 265, 580
MPI_C_LONG_DOUBLE_COMPLEX, 196 MPI_Comm_set_attr, 406, 407
MPI_Cancel, 424, 425 MPI_Comm_set_errhandler, 410, 410, 412
MPI_CART, 365 MPI_Comm_set_info, 405
MPI_Cart_coords, 367 MPI_Comm_set_name, 265
MPI_Cart_create, 366, 368 MPI_Comm_size, 34, 35, 121, 277, 373
MPI_Cart_get, 367 MPI_Comm_spawn, 269, 287, 415
MPI_Cart_map, 370 MPI_Comm_spawn_multiple, 290, 407
MPI_Cart_rank, 367 MPI_Comm_split, 71, 269, 270, 279, 369
MPI_Cart_sub, 369 MPI_Comm_split_type, 272, 381, 382, 385
MPI_Cartdim_get, 367 MPI_Comm_test_inter, 278
MPI_CHAR, 196 MPI_COMM_TYPE_HW_GUIDED, 381
MPI_CHARACTER, 197 MPI_COMM_TYPE_HW_UNGUIDED, 381
MPI_Close_port, 292 MPI_COMM_TYPE_SHARED, 381
MPI_COMBINER_VECTOR, 232 MPI_COMM_WORLD, 33, 264, 265, 269, 270, 275, 276,
MPI_Comm, 21–23, 33, 264, 265, 423 279, 287, 290, 296, 386, 407, 412, 415, 423,
MPI_Comm_accept, 291, 291 569, 573, 581
MPI_Comm_attr_function, 407 MPI_Compare_and_swap, 326
MPI_Comm_compare, 266, 278, 296 MPI_COMPLEX, 197
MPI_Comm_connect, 291, 292, 292, 410 MPI_CONGRUENT, 267
MPI_Comm_create, 269, 273 MPI_Count, 45, 196, 221, 222, 222, 223, 226
MPI_Comm_create_errhandler, 412, 412 MPI_COUNT_KIND, 196, 423
nb_proc, 406
no_locks, 334
num_io_nodes, 406
OMPI_COMM_TYPE_SOCKET, 382
same_op, 334
same_op_no_op, 334
striping_factor, 406
striping_unit, 406
testany, 142
thread_support, 295
vector_layout, 206
wtime, 416
• ALLOCATABLE ?? • IN ?? • POINTER ??
• ASYNCHRONOUS ?? • INCLUDE ?? • PROCEDURE ??
• BLOCK ?? • INOUT ?? • REAL ??
• CHARACTER ?? • INTEGER ?? • SEQUENCE ??
• COMMON ?? • INTENT ?? • TARGET ??
• COMM_COPY_ATTR_FUNCTION ?? • INTERFACE ?? • TYPE ??
• COMM_DELETE_ATTR_FUNCTION ?? • ISO_C_BINDING ?? • TYPE_COPY_ATTR_FUNCTION ??
• COMPLEX ?? • ISO_FORTRAN_ENV ??
• TYPE_DELETE_ATTR_FUNCTION ??
• CONTAINS ?? • KIND ??
• USER_FUNCTION ??
• CONTIGUOUS ?? • LOGICAL ??
• COPY_FUNCTION ?? • MODULE ?? • VOLATILE ??
• C_F_POINTER ?? • MPI_Send 120 • WIN_COPY_ATTR_FUNCTION ??
• C_PTR ?? • MPI_Status 149 • WIN_DELETE_ATTR_FUNCTION ??
• DATAREP_CONVERSION_FUNCTION ?? • MPI_User_function ?? • base ??
• DELETE_FUNCTION ?? • MPI_Waitall 139 • foo ??
• EXTERNAL ?? • OPTIONAL ?? • int ??
• FUNCTION ?? • OUT ?? • separated_sections ??
dist_schedule, 537
do, 450, 454
dynamic, 542
final, 513
firstprivate, 490, 509, 536, 553
flush, 500, 529
for, 450, 454, 457
_OPENMP, 438, 544
if, 513
aligned, 531 implicit barrier
atomic, 498, 527 after single directive, 484
in_reduction, 512
barrier
cancelled by nowait, 465 lastprivate, 465, 490
barrier, 496, 496 league, 537
linear, 531
cancel, 450, 514, 564, 565
chunk, 460 master, 392, 483, 484, 519
collapse, 463, 463
copyin, 492 nowait, 465, 497, 542, 542
copyprivate, 484, 492 num_threads, 443
critical, 457, 498, 542
omp
declare, 476 barrier
declare simd, 531 implicit, 497
default omp for, 488
firstprivate, 489 omp_alloc, 493
none, 489 OMP_CANCELLATION, 450, 539
private, 489 omp_cgroup_mem_alloc, 493
shared, 489 omp_const_mem_alloc, 493
default, 488 omp_const_mem_space, 493
depend, 511, 514 OMP_DEFAULT_DEVICE, 539
829
omp_default_mem_alloc, 493 omp_sched_auto, 462
omp_default_mem_space, 493 omp_sched_dynamic, 462
omp_destroy_nest_lock, 501 omp_sched_guided, 462, 463
OMP_DISPLAY_ENV, 518, 539 omp_sched_runtime, 463
OMP_DYNAMIC, 492, 539, 540 omp_sched_static, 462
omp_get_active_level, 539 omp_sched_t, 462
omp_get_ancestor_thread_num, 539 OMP_SCHEDULE, 461–463, 540, 540
omp_get_cancellation, 450 omp_set_dynamic, 539, 540
omp_get_dynamic, 539, 540 omp_set_max_active_levels, 448, 539
omp_get_level, 539 omp_set_nest_lock, 501
omp_get_max_active_levels, 448, 539 omp_set_nested, 539, 540
omp_get_max_threads, 443, 539, 540 omp_set_num_threads, 443, 539, 540
omp_get_nested, 539, 540 omp_set_schedule, 462, 539, 540
omp_get_num_procs, 440, 443, 539, 540, 711 OMP_STACKSIZE, 487, 540, 540
omp_get_num_threads, 440, 443, 447, 448, 539, omp_test_nest_lock, 501
540 OMP_THREAD_LIMIT, 540
omp_get_schedule, 462, 539, 540 omp_thread_mem_alloc, 493
omp_get_team_size, 539 omp_unset_nest_lock, 501
omp_get_thread_limit, 539 OMP_WAIT_POLICY, 448, 540, 540
omp_get_thread_num, 440, 446, 448, 539, 540 openmp_version, 438
omp_get_wtick, 539, 541 ordered, 464, 464
omp_get_wtime, 539, 541
parallel, 439, 440, 446, 448, 450, 454, 457, 521,
omp_high_bw_mem_alloc, 493
553
omp_high_bw_mem_space, 493
parallel region
omp_in, 476, 477
barrier at the end of, 497
omp_in_parallel, 450, 539, 540
pragma, see see under pragma name
omp_init_nest_lock, 501
priority, 513
omp_is_initial_device, 536
private, 487, 553
omp_large_cap_mem_alloc, 493
proc_bind, 519, 521
omp_large_cap_mem_space, 493
omp_low_lat_mem_alloc, 493 reduction, 457, 461, 472, 475, 476, 477, 498,
omp_low_lat_mem_space, 493 512
OMP_MAX_ACTIVE_LEVELS, 448, 539
OMP_MAX_TASK_PRIORITY, 513, 540 safelen(𝑛), 531
OMP_NESTED (deprecated), 449 schedule
OMP_NESTED, 540, 540 auto, 461
OMP_NUM_THREADS, 438, 440, 443, 540, 540 chunk, 460
omp_out, 476, 477 guided, 461
OMP_PLACES, 518, 518, 520, 540 runtime, 461
omp_priv, 477 schedule, 460–462, 542
OMP_PROC_BIND, 518, 520, 540, 541 section, 482
omp_pteam_mem_alloc, 493 sections, 449, 450, 472, 482, 490
omp_sched_affinity, 462 simd, 531, 531
target
enter data, 537
exit data, 537
map, 537
update from, 537
update to, 537
target, 536, 537
task, 510, 511
task_reduction, 512
taskgroup, 450, 511, 512, 563, 564
taskwait, 510, 511, 513, 536, 563
taskyield, 513
team, 537
teams, 537
threadprivate, 491, 524, 542
tofrom, 536
untied, 513
wait-policy-var, 540
workshare, 484
-vec_view, 658
-with-precision, 574
-with-scalar-type, 574
CHKERRA, 653
--sub_ksp_monitor, 660
CHKERRABORT, 653
-da_grid_x, 613
CHKERRMPI, 653
-da_refine, 620
CHKERRQ, 653
-da_refine_x, 620
CHKMEMA, 655
-download-blas-lapack, 579
CHKMEMQ, 653, 655
-download_mpich, 574
-ksp_atol, 634 DM, 613, 614, 620, 658
-ksp_converged_reason, 635 DM_BOUNDARY_GHOSTED, 613
-ksp_divtol, 634 DM_BOUNDARY_NONE, 613
-ksp_gmres_restart, 636 DM_BOUNDARY_PERIODIC, 613
-ksp_mat_view, 597 DMBoundaryType, 613
-ksp_max_it, 634 DMCreateGlobalVector, 616, 621
-ksp_monitor, 642, 660 DMCreateLocalVector, 616, 621
-ksp_monitor_true_residual, 642 DMDA, 613, 616, 620
-ksp_rtol, 634 DMDA_STENCIL_BOX, 613
-ksp_type, 636 DMDA_STENCIL_STAR, 613
-ksp_view, 634, 658, 660 DMDACreate1d, 613
-log_summary, 659 DMDACreate2d, 613, 613
-log_view, 662 DMDAGetCorners, 614, 621
-malloc_dump, 663 DMDAGetLocalInfo, 614
-mat_view, 597, 658 DMDALocalInfo, 614, 615, 620
-pc_factor_levels, 639 DMDASetRefinementFactor, 620
-snes_fd, 646 DMDAVecGetArray, 620
-snes_fd_color, 646 DMGetGlobalVector, 616
832
DMGetLocalVector, 616 KSPSetConvergenceTest, 641
DMGlobalToLocal, 616, 621 KSPSetFromOptions, 634, 636, 643
DMGlobalToLocalBegin, 621 KSPSetOperators, 634
DMGlobalToLocalEnd, 621 KSPSetOptionsPrefix, 660
DMLocalToGlobal, 616, 621 KSPSetTolerances, 634
DMLocalToGlobalBegin, 621 KSPSetType, 636
DMLocalToGlobalEnd, 621 KSPView, 634, 657
DMPLEX, 618 KSPViewFromOptions, 659
DMRestoreGlobalVector, 616
DMRestoreLocalVector, 616 MAT_FLUSH_ASSEMBLY, 596
DMStencilType, 613 MATAIJCUSPARSE, 651
DMViewFromOptions, 659 MatAssemblyBegin, 596, 596
MatAssemblyEnd, 596, 596
INSERT_VALUES, 588, 596 MatCoarsenViewFromOptions, 659
IS, 604 MatCreate, 591
ISCreate, 602 MatCreateDenseCUDA, 651
ISCreateBlock, 602 MatCreateFFT, 601
ISCreateGeneral, 602 MatCreateSeqDenseCUDA, 651
ISCreateStride, 602 MatCreateShell, 598
ISGetIndices, 603 MatCreateSubMatrices, 598
ISLocalToGlobalMappingViewFromOptions, MatCreateSubMatrix, 598, 604
659 MatCreateVecs, 583, 592
ISRestoreIndices, 603 MatCreateVecsFFTW, 601
ISViewFromOptions, 659 MATDENSECUDA, 651
MatDenseCUDAGetArray, 651
KSP, 597, 632 MatDenseGetArray, 596
KSPBuildResidual, 642 MatDenseRestoreArray, 596
KSPBuildSolution, 642 MatGetArray (deprecated), 596
KSPConvergedDefault, 642 MatGetRow, 596
KSPConvergedReason, 634 MatImaginaryPart, 579
KSPConvergedReasonView, 635 MatMatMult, 597
KSPConvergedReasonViewFromOptions, 659 MATMPIAIJ, 591
KSPCreate, 633 MATMPIAIJCUSPARSE, 651
KSPGetConvergedReason, 634 MatMPIAIJSetPreallocation, 594
KSPGetIterationNumber, 635 MATMPIBIJ, 601
KSPGetOperators, 634 MATMPIDENSE, 591
KSPGetRhs, 642 MATMPIDENSECUDA, 651
KSPGetSolution, 642 MatMult, 597, 599
KSPGMRESSetRestart, 636 MatMultAdd, 597
KSPMatSolve, 636 MatMultHermitianTranspose, 597
KSPMonitorDefault, 642 MatMultTranspose, 597
KSPMonitorSet, 642 MatPartitioning, 604
KSPMonitorTrueResidualNorm, 642 MatPartitioningApply, 604
KSPReasonView (deprecated), 635 MatPartitioningCreate, 604
malloc_shared, 684
nd_item, 682
offset, 688
queue::memcpy, 684
accessor, 684, 686
range, 681, 681
buffer, 684 read, 686
reduction, 684
combine, 684 runtime_error, 679
cout, 687
cpu_selector, 678 submit, 680
get_access, 686
get_range, 686
host_selector, 679
id<1>, 681
id<nd>, 681
is_cpu, 678
is_gpu, 678
is_host, 678
malloc, 685
malloc_device, 684, 685
malloc_host, 684, 685
837
ISBN 978-1-387-40028-7
90000
9 781387 400287