Advanced Message Passing Interface (MPI) : Bruno C. Mundim
Advanced Message Passing Interface (MPI) : Bruno C. Mundim
Bruno C. Mundim
SciNet HPC Consortium
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 1 / 69
1
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 2 / 69
What do you need for this workshop?
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 3 / 69
Workshop structure
MONDAY: A first online lecture over Zoom (you’re here!).
An assignment will be given during the course of the lecture.
You can ask questions:
▶ in the Zoom chat during and at the end of the lecture.
▶ in the student forum on the course site.
▶ and also during:
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 4 / 69
Today’s Lecture Outline
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 5 / 69
2
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 6 / 69
Distributed Memory: Clusters
Machine Architecture: Clusters, or,
distributed memory machines.
CPU1
Parallel code: run on separate computers
and communicate with each other.
Usual communication model: “message
passing”.
Message Passing Interface (MPI): Open CPU2
standard library interface for message passing,
ratified by the MPI Forum.
MPI Implementations:
▶ OpenMPI www.open-mpi.org
⋆ SciNet clusters (Niagara or Teach):
module load gcc openmpi CPU3
▶ MPICH2 www.mpich.org
⋆ Niagara: module load intel intelmpi CPU4
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 7 / 69
MPI is a Library for Message Passing
Not built into the compiler.
Function calls that can be made from any compiler, many languages.
Just link to it.
Wrappers: mpicc, mpif90, mpicxx
#include <stdio.h> program helloworld
#include <mpi.h> use mpi
int main(int argc, char **argv) { implicit none
int rank, size, err; integer :: rank, commsize, err
err = MPI_Init(&argc, &argv); call MPI_Init(err)
err = MPI_Comm_size(MPI_COMM_WORLD, &size); call MPI_Comm_size(MPI_COMM_WORLD, commsize, err)
err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); call MPI_Comm_rank(MPI_COMM_WORLD, rank, err)
printf("Hello world from task %d of %d!\n",rank, print *,'Hello world from task',rank,'of',commsize
size); call MPI_Finalize(err)
err = MPI_Finalize(); end program helloworld
}
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 8 / 69
MPI is a Library for Message Passing
Communication/coordination between tasks
done by sending and receiving messages.
CPU1
Each message involves a function call from
each of the programs.
CPU2
CPU3
CPU4
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 9 / 69
MPI is a Library for Message Passing
Three basic sets of functionality:
Pairwise communications via messages CPU1
Collective operations via messages
Efficient routines for getting data from
memory into messages and vice versa
CPU2
CPU3
CPU4
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 10 / 69
Messages
CPU1 CPU2
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 11 / 69
Messages
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 12 / 69
Communicators
MPI groups processes into communicators.
rank 0
rank 2
MPI_COMM_WORLD:
size = 4, ranks = 0..3 rank 3
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 13 / 69
Communicators
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 14 / 69
Communicators
MPI_COMM_WORLD:
size=4,ranks=0..3
rank 2
rank 3
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 14 / 69
Communicators
MPI_COMM_WORLD: new_comm:
size=4,ranks=0..3 size=3,ranks=0..2
rank 2 rank 0
rank 3
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 14 / 69
MPI Communicator Basics
Communicator Components
MPI_COMM_WORLD:
Global Communicator
MPI_Comm_rank(MPI_COMM_WORLD,&rank)
Get current task’s rank
MPI_Comm_size(MPI_COMM_WORLD,&size)
Get communicator size
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 15 / 69
Different versions of SEND
C
MPI_Status status;
err = MPI_Ssend(sendptr, count, MPI_TYPE, destination, tag, Communicator);
err = MPI_Recv(rcvptr, count, MPI_TYPE, source, tag, Communicator, status);
Fortran
integer status(MPI_STATUS_SIZE)
call MPI_SSEND(sendarr, count, MPI_TYPE, destination, tag, Communicator, err)
call MPI_RECV(rcvarr, count, MPI_TYPE, source, tag, Communicator, status, err)
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 17 / 69
MPI: Sendrecv
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 18 / 69
MPI Non-Blocking Functions: MPI_Isend, MPI_Irecv
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 19 / 69
Nonblocking Sends
FORTRAN:
MPI_ISEND(BUF,INTEGER COUNT,INTEGER DATATYPE,INTEGER DEST,INTEGER TAG, INTEGER COMM, INTEGER REQUEST,
INTEGER ERROR)
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 20 / 69
MPI: Non-Blocking Isend & Irecv
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 21 / 69
MPI Collectives
Reduction:
▶ Works for a variety of operations (+,*,min,max)
▶ For example, to calculate the min/mean/max of numbers accross the cluster.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 22 / 69
Collective Operations
Collective
Reductions are an example of a collective operation.
As opposed to the pairwise messages we’ve seen before
All processes in the communicator must participate.
Cannot proceed until all have participated.
Don’t necessarity know what’s ‘under the hood’.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 23 / 69
Collective Operations
Collective
Reductions are an example of a collective operation.
As opposed to the pairwise messages we’ve seen before
All processes in the communicator must participate.
Cannot proceed until all have participated.
Don’t necessarity know what’s ‘under the hood’.
Broadcast
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 23 / 69
Collective Operations
Collective
Reductions are an example of a collective operation.
As opposed to the pairwise messages we’ve seen before
All processes in the communicator must participate.
Cannot proceed until all have participated.
Don’t necessarity know what’s ‘under the hood’.
Broadcast Scatter
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 23 / 69
Collective Operations
Collective
Reductions are an example of a collective operation.
As opposed to the pairwise messages we’ve seen before
All processes in the communicator must participate.
Cannot proceed until all have participated.
Don’t necessarity know what’s ‘under the hood’.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 23 / 69
Collective Operations
Collective
Reductions are an example of a collective operation.
As opposed to the pairwise messages we’ve seen before
All processes in the communicator must participate.
Cannot proceed until all have participated.
Don’t necessarity know what’s ‘under the hood’.
Barriers (don’t!)
All-to-all . . .
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 23 / 69
3
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 24 / 69
Scientific MPI Example
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 25 / 69
Discretizing Derivatives
Partial Differential Equations like the diffusion ∂2T Ti+1 − 2Ti + Ti−1
equation ≈
∂x2 ∆x2
2
∂T ∂ T
=D
∂t ∂x2
are usually numerically solved by finite
differencing the discretized values. i−2 i−1 i i+1 i+2
Implicitly or explicitly involves interpolating
data and taking the derivative of the
interpolant.
Larger ‘stencils’ → More accuracy. +1 −2 +1
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 26 / 69
Diffusion equation in higher dimensions
Spatial grid separation: ∆x. Time step ∆t.
Grid indices: i, j. Time step index: (n)
1D
(n) (n−1)
∂T Ti − Ti
≈
∂t i ∆t
∂2T
(n) (n) (n)
Ti−1 − 2Ti + Ti+1 +1 −2 +1
≈
∂x2 i ∆x2
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 27 / 69
Diffusion equation in higher dimensions
Spatial grid separation: ∆x. Time step ∆t.
Grid indices: i, j. Time step index: (n)
1D
(n) (n−1)
∂T Ti − Ti
≈
∂t i ∆t
∂2T
(n) (n) (n)
Ti−1 − 2Ti + Ti+1 +1 −2 +1
≈
∂x2 i ∆x2
2D
+1
−4
+1 +1
+1
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 27 / 69
Diffusion equation in higher dimensions
Spatial grid separation: ∆x. Time step ∆t.
Grid indices: i, j. Time step index: (n)
1D
(n) (n−1)
∂T Ti − Ti
≈
∂t i ∆t
∂2T
(n) (n) (n)
Ti−1 − 2Ti + Ti+1 +1 −2 +1
≈
∂x2 i ∆x2
2D
+1
(n) (n−1)
∂T Ti,j − Ti,j
≈
∂t i,j ∆t
2 2 (n) (n) (n) (n) (n)
Ti−1,j + Ti,j−1 − 4Ti,j + Ti+1,j + Ti,j+1
−4 ∂ T ∂ T
+1 +1 + ≈
∂x2 ∂y 2 i,j ∆x2
+1
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 27 / 69
Stencils and Boundaries
How do you deal with 1D
boundaries?
The stencil juts out, you need 0 1 2 3 4 5 6
info on cells beyond those Number of guard cells
you’re updating. ng = 1
Common solution: Loop from i = ng . . .
N − 2ng .
Guard cells:
▶ Pad domain with these
guard cells so that stencil
works even for the first
point in domain.
▶ Fill guard cells with values
such that the required
boundary conditions are
met.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 28 / 69
Stencils and Boundaries
How do you deal with 1D 2D
boundaries?
The stencil juts out, you need 0 1 2 3 4 5 6
info on cells beyond those Number of guard cells 111
000
000
111
000
111
you’re updating. ng = 1 000
111
Loop from i = ng . . .
000
111
Common solution:
N − 2ng .
Guard cells:
▶ Pad domain with these 111
000
000
111 111
000
000
111
guard cells so that stencil 000
111 000
111
works even for the first 000
111 000
111
point in domain.
▶ Fill guard cells with values
such that the required
boundary conditions are
met.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 28 / 69
Domain decomposition
A very common approach to
parallelizing on distributed
memory computers.
Subdivide the domain into
contiguous subdomains.
Give each subdomain to a
different MPI process.
No process contains the full
data!
Maintains locality.
Need mostly local data, ie.,
only data at the boundary of
each subdomain will need to
be sent between processes.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 29 / 69
Guard cell exchange
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 30 / 69
Diffusion: Had to wait for communications to compute
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 31 / 69
Diffusion: Had to wait?
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 32 / 69
Blocking Communication/Computation Pattern
We have the following sequence of communication
and computation:
Sendrecv Sendrecv The code exchanges guard cells using
Sendrecv
The code then computes the next step.
Computation Computation
The code exchanges guard cells using
Sendrecv again.
etc.
Sendrecv Sendrecv We can do better.
Computation Computation
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 33 / 69
Non-Blocking Communication/Computation Pattern
ISend ISend
The code start a send of its guard cells using
ISend.
Computation Computation
Without waiting for that send’s completion,
the code computes the next step for the inner
IRecv IRecv cells (while the guard cell message is in flight)
Computation Computation The code then receives the guard cells using
IRecv.
IRecv IRecv
Computation Computation
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 34 / 69
2D diffusion with MPI
How to divide the work in a 2D grid?
11
00 11
00
00011
11100 11
00
000
111 11111
00000
00
11
00
11 00
11
000
111
00
11
000
11100
11
00
11 00
11
000
111
00
11
000
111 000
111
000
11100
11
00
11
00
11 00
11
000
111
00011
00 00
11
000
111 000
111
00011
00
00011
11100 00
11
11100
11 00
11
000
111 11100
11
000
111
000
111
000
111 111
000
000
111
000
111
000
111
000
111
111
000
000
111 111
000
000
111
000
111
000
111 000
111
000
111
111
000
000
111
000
111
000
111
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 35 / 69
2D diffusion with MPI
How to divide the work in a 2D grid?
11
00 11
00
00011
11100 11
00
000
111 11111
00000
00
11
00
11 00
11
000
111
00
11
000
11100
11
00
11 00
11
000
111
00
11
000
111 000
111
000
11100
11
00
11
00
11 00
11
000
111
00011
00 00
11
000
111 000
111
00011
00
00011
11100 00
11
11100
11 00
11
000
111 11100
11
000
111
000
111
000
111 111
000
000
111
000
111
000
111
000
111
111
000
000
111 111
000
000
111
000
111
000
111 000
111
000
111
111
000
000
111
000
111
000
111
Less communication (18 edges).
Harder to program, non-contiguous data to
send, left, right, up and down.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 35 / 69
2D diffusion with MPI
How to divide the work in a 2D grid?
11
00 11
00
00011
11100 11
00
000
111 11111
00000
00
11
00
11 00
11
000
111
00
11
000
11100
11
00
11 00
11
000
111
00
11
000
111 000
111
000
11100
11
00
11
00
11 00
11
000
111
00011
00 00
11
000
111 000
111
00011
00
00011
11100 00
11
11100
11 00
11
000
111 11100
11
000
111
000
111
000
111 111
000
000
111
000
111
000
111
000
111
111
000
000
111 111
000
000
111
000
111
000
111 000
111
000
111
111
000
000
111
000
111
000
111
Less communication (18 edges). Easier to code, similar to 1d, but with
Harder to program, non-contiguous data to contiguous guard cells to send up and down.
send, left, right, up and down. More communication (30 edges).
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 35 / 69
Let’s look at the easiest domain decomposition.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 36 / 69
Let’s look at the easiest domain decomposition.
00
111 000:11
000Serial
11 111 00
00
11
000
111
00
11
000
111 000
111
000
111 00
11
00
11
00
11
000
111
00
11 000
111
00011
000 111
111 00
00
11
000
111
111
000
000
111
000
111
000
111
111
000
000
111
000
111
000
111
111
000
000
111
000
111
000
111
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 36 / 69
Let’s look at the easiest domain decomposition.
11
00
111 000:11
000Serial
111 00 Parallel
00
11
00011
111 00
(P 00
= 3):
00011
111
00
11
000
111
00
11
000
111 000
111
000
111 00
11
00
11 00
11
00011
11100
00011
11100
00
11
000
111 000
111 00
11
000
111
00011
00
000
111
00011
00
00
11 00011
000 111
111 00
00
11 00
11
111
00
11
00
11
111
00011
11100
00
11
00011
11100
111
000
000
111
000
111 000
111
000
111
111
000
000
111 000
111
000
111 000
111
000
111
000
111 000
111 00
11
00
11
00
11
00
11
111
000
000
111
000
111
000
111
111
000
000
111
000
111
000
111 00
11
11
00
00
11
00
11
00
11
111
000
000
111
000
111
000
111
11
00
000
11100
11
00
11
000
111
00
11
000
11100
11
00011
00
00
11
11100
11
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 36 / 69
Let’s look at the easiest domain decomposition.
11
00
111 000:11
000Serial
111 00 Parallel
00
11
00011
111 00
(P 00
= 3):
00011
111
00
11
000
111
00
11
000
111 000
111
000
111 00
11
00
11 00
11
00011
11100
00011
11100
00
11
000
111 000
111 00
11
000
111
00011
00
000
111
00011
00
00
11 00011
000 111
111 00
00
11 00
11
111
00
11
00
11
111
00011
11100
00
11
00011
11100
111
000
000
111
000
111 000
111
000
111
111
000
000
111 000
111
000
111 000
111
000
111
000
111 000
111 00
11
00
11
00
11
00
11
111
000
000
111
000
111
000
111
111
000
000
111
000
111
000
111 00
11
11
00
00
11
00
11
00
11
Communication pattern:
111
000
Copy upper stripe to upper neighbour bottom guard cell. 000
111
000
111
000
111
Copy lower stripe to lower neighbout top guard cell.
11
00
000
11100
11
Contiguous cells: can use count in MPI_Sendrecv. 00
11
000
111
00
11
000
11100
11
00011
00
00
11
11100
11
Similar to 1d diffusion.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 36 / 69
4
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 37 / 69
Access to SciNet’s Teach supercomputer
Access to SciNet’s Teach supercomputer
SciNet’s Teach supercomputer is part of the
old GPC system (42 nodes) that has been
repurposed for education and training in
general, and in particular for many of summer
school sessions.
Log into Teach login node, teach01, with
your Compute Canada account credentials or
your lcl_uothpc383sNNNN temporary
account.
$ ssh -Y [email protected]
$ cd $SCRATCH
$ cp -r /scinet/course/mpi/advanced-mpi .
$ cd advanced-mpi
$ source setup
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 38 / 69
Access to SciNet’s Teach supercomputer
Access to SciNet’s Teach supercomputer Running computations
SciNet’s Teach supercomputer is part of the On most supercomputer, a scheduler governs
old GPC system (42 nodes) that has been the allocation of resources.
repurposed for education and training in
This means submitting a job with a jobscript.
general, and in particular for many of summer
school sessions. srun: a command that is a resource request
+ job running command all in one, and will
Log into Teach login node, teach01, with
run the command on one (or more) of the
your Compute Canada account credentials or
available resources.
your lcl_uothpc383sNNNN temporary
account. We have set aside 34 nodes with 16 cores for
this class, so occasionally, only in very busy
$ ssh -Y [email protected] sessions, you may have to wait for someone
$ cd $SCRATCH
$ cp -r /scinet/course/mpi/advanced-mpi . else’s srun command to finish.
$ cd advanced-mpi
$ source setup
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 38 / 69
Assignment: 2D Diffusion
2D diffusion equation serial code:
$ cd $SCRATCH/advanced-mpi/diffusion2d
$ # source ../setup
$ make diffusion2dc
$ ./diffusion2dc
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 39 / 69
Assignment: 2D Diffusion
2D diffusion equation serial code:
$ cd $SCRATCH/advanced-mpi/diffusion2d
$ # source ../setup
$ make diffusion2dc
$ ./diffusion2dc
Derived Datatypes
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 40 / 69
Motivation
Every message is associated with a datatype.
count of MPI_SOMETYPE
All MPI data movement functions move data
in some count units of some datatype.
tag
Portability: specifying the length of a
message as a given count of occurrences of a
given datatype is more portable than using
length in bytes, since lengths of given types
may vary from one machine to another.
CPU1 CPU2
So far our messages correspond to contiguous
regions of memory: a count of the the basic
MPI datatypes such as MPI_INT or
MPI_DOUBLE was sufficient to describe our
messages.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 41 / 69
Motivation
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 42 / 69
Basic Datatypes for Fortran
MPI provides a rich set of predefined
datatypes.
MPI Datatype Fortran Datatype
All basic datatypes in C and Fortran.
MPI_BYTE
Two datatypes specific to MPI:
MPI_CHARACTER CHARACTER
▶ MPI_BYTE: Refers to a byte defined
as eight binary digits.
MPI_COMPLEX COMPLEX
▶ MPI_PACKED: Rather than create a MPI_DOUBLE_PRECISION DOUBLE PRECISION
new datatype, just assemble a MPI_INTEGER INTEGER
contiguous buffer to be sent. MPI_LOGICAL LOGICAL
Why not use char as bytes? MPI_PACKED
▶ Usually represented by MPI_REAL REAL
implementations but not required.
For example C for Japanese might
choose 16-bit chars.
▶ Machines might have different
character sets in heterogeneous
environment.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 43 / 69
Basic Datatypes for C
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 44 / 69
Datatype Concepts
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 45 / 69
Datatype Concepts (Cont.)
Component Displacement Lower bound (lb) is the location of the first byte described by the datatype:
Component Displacement Upper bound (ub) is the location of the last byte described by the
datatype:
ub(T ypemap) = max(dispj + sizeof (typej )) + pad
j
▶ Where sizeof operator returns the size of the basic datatype in bytes.
Extent is the difference between these two bounds:
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 46 / 69
Data Alignment
Both C and Fortran require that the basic Example of a typemap on a computer that
datatypes be properly aligned: requires int’s to be aligned on 4-byte
▶ The locations of an integer or a boundaries:
double-precision value occur only where
allowed. {(int, 0), (char, 4)}
▶ Each implementation of these languages
defines what is allowed. lb = min(0, 4) = 0
▶ Most common: the address of an item in
bytes is a multiple of the length of that item ub = max(0 + 4, 4 + 1) = 5
in bytes. ▶ Next int can only be placed with
▶ For example, if an int takes four bytes, then displacement eight from the int in the
the address of an int must be a multiple of typemap. Pad in this case is three.
the length of that item in bytes: evenly ▶ Therefore, this typemap’s extend on this
divisible by four. computer is eight.
Data aligment requirement reflects in the
definition of extent of a MPI datatype.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 47 / 69
Datatype Information
MPI routines to retrieve information about MPI datatypes:
MPI_Type_get_extent
int MPI_Type_get_extent(MPI_Datatype datatype, MPI_Aint *lb, MPI_Aint *extent)
MPI_Type_size
int MPI_Type_size(MPI_Datatype datatype, int *size)
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 48 / 69
Datatype Constructors
Problem: Typemap is a general way of More sophisticated constructors:
describing an arbitrary datatype, but not ▶ Indexed: Array of displacements provided.
convenient for a large number of entries. Displacements measured in terms of the
Solution: MPI provides different ways to extent of the input datatype.
create datatypes without explicitly
▶ Hindexed: Like indexed, but displacements
measured in bytes.
constructing the typemap: ▶ Struct: Fully general. Input is the typemap,
▶ Contiguous: It produces a new datatype by
if input are basic MPI datatypes.
making count copies of an old one.
Displacements incremented by the extent of
the oldtype.
▶ Vector: Like contiguous but allows for
regular gaps in displacements. Elements
separated by multiples of the extent of the
input datatype.
▶ Hvector: Like vector, but elements are
separated by a number of bytes.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 49 / 69
Datatype Constructors (Cont.)
MPI_Type_contiguous
int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype)
Simplest datatype constructor, which allows replication of a oldtype datatype into contiguous
locations:
▶ count: replication count (nonnegative integer).
▶ oldtype: old datatype handle.
▶ newtype: new datatype handle.
Example: if original datatype (oldtype) has typemap:
then:
MPI_Type_contiguous(2, oldtype, &newtype);
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 50 / 69
Datatype Constructors (Cont.)
MPI_Type_vector
int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype)
Allows replication of a oldtype datatype into locations of equally spaced blocks. Each block is
obtained by concatenating blocklength copies of the old datatype:
▶ count: number of blocks (nonnegative integer).
▶ blocklength: number of elements in each block (nonnegative integer).
▶ stride: number of elements between start of each block.
Very useful for Cartesian arrays.
Example: if original datatype (oldtype) has typemap: (double, 0) with extent 8, then:
MPI_Type_vector(3, 2, 4, oldtype, &newtype);
{(double, 0), (double, 8), (double, 32), (double, 40), (double, 64), (double, 72)}
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 51 / 69
Datatype Constructors (Cont.)
MPI_Type_create_hvector
int MPI_Type_create_hvector(int count, int blocklength, MPI_Aint stride,
MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_Type_create_indexed_block
int MPI_Type_create_indexed_block(int count, int blocklength, const int array_of_displacements[],
MPI_Datatype oldtype, MPI_Datatype *newtype)
Creates an indexed data type with the same block length for all blocks.
Useful for retrieving irregular subsets of data from a single array.
▶ blocklength = 2
▶ array_of_displacements = {0, 5, 8, 13, 18}
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 52 / 69
Datatype Constructors (Cont.)
MPI_Type_indexed
int MPI_Type_indexed(int count, const int array_of_blocklengths[],
const int array_of_displacements[], MPI_Datatype oldtype,
MPI_Datatype *newtype)
Creates an indexed datatype, where each block can contain a different number of oldtype copies.
▶ array_of_blocklengths = {1, 1, 2, 1, 2, 1}
▶ array_of_displacements = {0, 3, 5, 9, 13, 17}
MPI_Type_create_struct
int MPI_Type_create_struct(int count, int array_of_blocklengths[],
const MPI_Aint array_of_displacements[], const MPI_Datatype array_of_types[],
MPI_Datatype *newtype)
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 53 / 69
Datatype Constructors (Cont.)
MPI_Type_create_subarray
int MPI_Type_create_subarray(int ndims, const int array_of_sizes[],
const int array_of_subsizes[], const int array_of_starts[],
int order, MPI_Datatype oldtype, MPI_Datatype *newtype)
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 54 / 69
6
Application Topology
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 55 / 69
Introduction to Application Topology
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 56 / 69
Process Mapping
Good choice of mapping depends on the details of the underlying hardware.
Only the vendors knows the best way to fit the application topologies into the machine topology.
They optimize through the implementation of MPI topology functions.
MPI does not provide the programmer any control over these mappings.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 58 / 69
Graph and Cartesian Topologies
MPI has the task of deciding how to assign processes to each part of the decomposed domain.
MPI provides the service of handling assignment of processes to regions. It provides two types of
topology routines to address the needs of different data topological layouts:
Cartesian Topology
It is a decomposition of the application processes in the natural coordinate directions, for example,
along x and y directions.
Graph Topology
It is the type of virtual topology that allows general relationships between processes, where processes
are represented by nodes of a graph.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 59 / 69
MPI Cartesian Topology Functions
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 60 / 69
MPI Cartesian Topology Functions (Cont.)
MPI_Cart_create
int MPI_Cart_create(MPI_Comm comm_old, int ndims, const int dims[],
const int periods[], int reorder, MPI_Comm *comm_cart)
It returns a a handle to a new communicator to which the Cartesian topology information is attached.
If reorder = false then the rank of each process in the new group is identical to its rank in the old
group.
Otherwise it may reorder to choose a good embedding of the virtual topology onto the physical
machine.
▶ comm_old: handle to input communicator.
▶ ndims: number of dimensions of Cartesian grid.
▶ dims: integer array of size ndims specifying the number of
processes in each dimension.
▶ periods: logical array of size ndims specifying whether the grid is periodic (true) or not (false) in each
dimension.
If the total size of the Cartesian grid is smaller than the size of the group of comm, then some
processes are returned MPI_COMM_NULL.
The call is erroneous if it specifies a grid that is larger than the group size.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 61 / 69
MPI Cartesian Topology Functions (Cont.)
This code snippet uses MPI_Cart_create to
remap the process ranks from a linear MPI_Cart_create (code snippet)
ordering (0, 1,. . . ,5) to 2-dimensional array of #include "mpi.h"
3 rows by 2 columns ((0,0),(0,1),. . . ,(2,1)). MPI_Comm old_comm, new_comm;
int ndims, reorder, periods[2], dim_size[2];
We are able to assign work to the processes old_comm = MPI_COMM_WORLD;
by their grid topology instead of their linear ndims = 2; /* 2-D matrix/grid */
process rank. dim_size[0] = 3; /* rows */
dim_size[1] = 2; /* columns */
We imposed periodicity on the first dimension. periods[0] = 1; /* row periodic
(each column forms a ring) */
This means any reference beyond the first or periods[1] = 0; /* columns nonperiodic */
last entry of the columns it cycles back to the reorder = 1; /* allows processes reordered
last and first entry, respectively. for efficiency */
MPI_Cart_create(old_comm, ndims, dim_size,
Any reference to column index outside the periods, reorder, &new_comm);
range returns MPI_PROC_NULL.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 62 / 69
MPI Cartesian Topology Functions (Cont.)
Messages are still sent to and received from process’s ranks.
MPI provides routines to map or convert ranks to cartesian coordinates and vice-versa:
MPI_Cart_coords
int MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims, int coords[])
MPI_Cart_rank
int MPI_Cart_rank(MPI_Comm comm, int coords[], int *rank)
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 63 / 69
MPI Cartesian Topology Functions (Cont.)
MPI_Cart_coords (code snippet)
MPI_Cart_create(old_comm, ndims, dim_size, periods, reorder, &new_comm);
if(rank == 0) { /* only want to do this on one process */
for (rank=0; rank‹p; rank++) {
MPI_Cart_coords(new_comm, rank, ndims, &coords);
printf("%d, %d, %d\n ",rank, coords[0], coords[1]);
}
}
MPI_Cart_get
int MPI_Cart_get(MPI_Comm comm, int maxdims, int dims[], int periods[],
int coords[])
It retrieves information from a communicator with Cartesian topology:
▶ maxdims: Length of vectors dims, periods, and coords in the calling program.
▶ dims: Number of processes for each Cartesian dimension.
▶ periods: Periodicity for each Cartesian dimension.
▶ coords: Coordinates of the calling process in Cartesian structure.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 66 / 69
MPI Cartesian Topology Functions (Cont.)
MPI_Cart_shift
int MPI_Cart_shift(MPI_Comm comm, int direction, int disp,
int *rank_source, int *rank_dest)
It returns the shifted source and destination ranks, given a shift direction and amount.
▶ direction: Coordinate dimension of shift, i.e., the coordinate whose value is modified by the shift.
▶ disp: Displacement ( > 0: upward shift, < 0: downward shift).
▶ rank_source: Rank of source process.
▶ rank_dest: Rank of destination process.
A MPI_Sendrecv operation is likely to be used along a coordinate direction to perform a shift of data.
▶ As input, it takes the rank of a source process for the receive, and the rank of a destination process for
the send.
▶ MPI_Cart_shift provides MPI_Sendrecv with the above identifiers.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 67 / 69
MPI Cartesian Topology Functions (Cont.)
MPI_Cart_shift (code snippet)
/* Create Cartesian topology for processes */
ndim = 2; /* number of dimensions */
dims[0] = nrow; /* number of rows */
dims[1] = mcol; /* number of columns */
period[0] = 1; /* cyclic in this direction */
period[1] = 0; /* not cyclic in this direction */
MPI_Cart_create(MPI_COMM_WORLD, ndim, dims, period, reorder,
&comm2D);
MPI_Comm_rank(comm2D, &me);
MPI_Cart_coords(comm2D, me, ndim, &coords);
displ = 1; /* shift by 1 */
index = 0; /* shift along the 1st index (out of 2) */
MPI_Cart_shift(comm2D, index, displ, &source0, &dest0);
index = 1; /* shift along the 2nd index (out of 2) */
MPI_Cart_shift(comm2D, index, displ, &source1, &dest1);
MPI_Cart_shift is used to obtain the source and destination rank numbers of the calling process.
There are two calls to MPI_Cart_shift, the first shifting along columns, and the second along rows.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 68 / 69
Conclusion
Recap
MPI Basics Review
Scientific MPI Example: 2D Diffusion Equation
Derived Data Types
Application Topology
Good References
W. Gropp, E. Lusk, and A. Skjellun, Using MPI: Portable Parallel Programming with the
Message-Passing Interface. Third Edition. (MIT Press, 2014).
W. Gropp, T. Hoefler, R. Thakur, E. Lusk, Using Advanced MPI: Modern Features of the
Message-Passing Interface. (MIT Press, 2014).
A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel Computing, Second Edition.
(Addison-Wesley, 2003) (A bit old but still reasonable)
The man pages for various MPI commands.
http://www.mpi-forum.org/docs/ for MPI standard specification.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 69 / 69