0% found this document useful (0 votes)

10 views86 pages

Advanced Message Passing Interface (MPI) : Bruno C. Mundim

Uploaded by

Steve Mojang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views86 pages

Advanced Message Passing Interface (MPI) : Bruno C. Mundim

Uploaded by

Steve Mojang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

Advanced Message Passing Interface (MPI)

Bruno C. Mundim
SciNet HPC Consortium

May 15, 2023

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 1 / 69
1

About This Workshop

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 2 / 69
What do you need for this workshop?

A computer with browser and internet connection to attend the lectures.

A Zoom client to connect to the lecture and office hours.
An ssh client to connect to the SciNet Teach cluster.
▶ Linux and MaxOS: Use the ssh command in the terminal.
▶ Windows: Use MobaXTerm https://mobaxterm.mobatek.net.

Make sure you can login to the website https://scinet.courses/1269 !

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 3 / 69
Workshop structure
MONDAY: A first online lecture over Zoom (you’re here!).
An assignment will be given during the course of the lecture.
You can ask questions:
▶ in the Zoom chat during and at the end of the lecture.
▶ in the student forum on the course site.
▶ and also during:

WEDNESDAY: Zoom office hours.

Submit a solution for the assignment on the course website (deadline is midnight Thursday)
Assignment submission and lecture participation required to obtain credits towards the SciNet HPC
Certificate.
FRIDAY: A last online lecture on Zoom that will address the solution, common mistakes, and
wrap-up.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 4 / 69
Today’s Lecture Outline

MPI Basics Review

Scientific MPI Example: 2D Diffusion Equation
Teach Cluster Access and Assignment
Derived Data Types
Application Topology

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 5 / 69
2

MPI Basics Review

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 6 / 69
Distributed Memory: Clusters
Machine Architecture: Clusters, or,
distributed memory machines.
CPU1
Parallel code: run on separate computers
and communicate with each other.
Usual communication model: “message
passing”.
Message Passing Interface (MPI): Open CPU2
standard library interface for message passing,
ratified by the MPI Forum.
MPI Implementations:
▶ OpenMPI www.open-mpi.org
⋆ SciNet clusters (Niagara or Teach):
module load gcc openmpi CPU3
▶ MPICH2 www.mpich.org
⋆ Niagara: module load intel intelmpi CPU4
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 7 / 69
MPI is a Library for Message Passing
Not built into the compiler.
Function calls that can be made from any compiler, many languages.
Just link to it.
Wrappers: mpicc, mpif90, mpicxx
#include <stdio.h> program helloworld
#include <mpi.h> use mpi
int main(int argc, char **argv) { implicit none
int rank, size, err; integer :: rank, commsize, err
err = MPI_Init(&argc, &argv); call MPI_Init(err)
err = MPI_Comm_size(MPI_COMM_WORLD, &size); call MPI_Comm_size(MPI_COMM_WORLD, commsize, err)
err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); call MPI_Comm_rank(MPI_COMM_WORLD, rank, err)
printf("Hello world from task %d of %d!\n",rank, print *,'Hello world from task',rank,'of',commsize
size); call MPI_Finalize(err)
err = MPI_Finalize(); end program helloworld
}

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 8 / 69
MPI is a Library for Message Passing
Communication/coordination between tasks
done by sending and receiving messages.
CPU1
Each message involves a function call from
each of the programs.

CPU2

CPU3

CPU4

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 9 / 69
MPI is a Library for Message Passing
Three basic sets of functionality:
Pairwise communications via messages CPU1
Collective operations via messages
Efficient routines for getting data from
memory into messages and vice versa

CPU2

CPU3

CPU4

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 10 / 69
Messages

Messages have a sender and a receiver.

count of MPI_SOMETYPE
When you are sending a message, don’t need
to specify sender (it’s the current processor).
tag
A sent message has to be actively received by
the receiving process.

CPU1 CPU2

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 11 / 69
Messages

MPI messages are a string of length count all

of some fixed MPI type.
count of MPI_SOMETYPE
MPI types exist for characters, integers,
floating point numbers, etc. tag
An arbitrary non-negative integer tag is also
included – it helps keep things straight if lots
of messages are sent.
CPU1 CPU2

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 12 / 69
Communicators
MPI groups processes into communicators.
rank 0

Each communicator has some size – number

of tasks.
Every task has a rank 0..size-1
Every task in your program belongs to rank 1
MPI_COMM_WORLD.

rank 2
MPI_COMM_WORLD:
size = 4, ranks = 0..3 rank 3

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 13 / 69
Communicators

One can create one’s own

communicators over the same
tasks.
May break the tasks up into
subgroups.
May just re-order them for
some reason

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 14 / 69
Communicators
MPI_COMM_WORLD:
size=4,ranks=0..3

One can create one’s own

communicators over the same rank 0
tasks.
May break the tasks up into
subgroups.
May just re-order them for rank 1
some reason

rank 2

rank 3

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 14 / 69
Communicators
MPI_COMM_WORLD: new_comm:
size=4,ranks=0..3 size=3,ranks=0..2

One can create one’s own

communicators over the same rank 0 rank 1
tasks.
May break the tasks up into
subgroups.
May just re-order them for rank 1 rank 2
some reason

rank 2 rank 0

rank 3

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 14 / 69
MPI Communicator Basics

Communicator Components
MPI_COMM_WORLD:
Global Communicator
MPI_Comm_rank(MPI_COMM_WORLD,&rank)
Get current task’s rank
MPI_Comm_size(MPI_COMM_WORLD,&size)
Get communicator size

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 15 / 69
Different versions of SEND

MPI_Ssend: Standard synchronous send

guaranteed to be synchronous.
routine will not return until the receiver has “picked up”.
MPI_Bsend: Buffered Send
guaranteed to be asynchronous.
In this class, stick with
routine returns before the message is delivered. MPI_Ssend for clarity and
system copies data into a buffer and sends it in due course. robustness

can fail if buffer is full.

MPI_Send (standard Send)
may be implemented as synchronous or asynchronous send.
causes a lot of confusion.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 16 / 69
Send and Receive

C
MPI_Status status;
err = MPI_Ssend(sendptr, count, MPI_TYPE, destination, tag, Communicator);
err = MPI_Recv(rcvptr, count, MPI_TYPE, source, tag, Communicator, status);

Fortran
integer status(MPI_STATUS_SIZE)
call MPI_SSEND(sendarr, count, MPI_TYPE, destination, tag, Communicator, err)
call MPI_RECV(rcvarr, count, MPI_TYPE, source, tag, Communicator, status, err)

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 17 / 69
MPI: Sendrecv

err = MPI_Sendrecv(sendptr, count, MPI_TYPE, destination, tag,

recvptr, count, MPI_TYPE, source, tag, Communicator, MPI_Status)

A blocking send and receive built together

Let them happen simultaneously
Can automatically pair send/recvs
Why 2 sets of tags/types/counts?

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 18 / 69
MPI Non-Blocking Functions: MPI_Isend, MPI_Irecv

Returns immediately, posting request to system to initiate communication.

However, communication is not completed yet.
Cannot tamper with the memory provided in these calls until the communication is completed.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 19 / 69
Nonblocking Sends

Allows you to get work done while message is

in flight.
Must not alter send buffer until send has
completed.
C:
MPI_Isend(void *buf,int count,MPI_Datatype datatype,int dest,int tag,MPI_Comm comm,MPI_Request *request)

FORTRAN:
MPI_ISEND(BUF,INTEGER COUNT,INTEGER DATATYPE,INTEGER DEST,INTEGER TAG, INTEGER COMM, INTEGER REQUEST,
INTEGER ERROR)

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 20 / 69
MPI: Non-Blocking Isend & Irecv

err = MPI_Isend(sendptr, count, MPI_TYPE, destination, tag, Communicator, MPI_Request)

err = MPI_Irecv(rcvptr, count, MPI_TYPE, source, tag, Communicator, MPI_Request)

sendptr/rcvptr: pointer to message

count: number of elements in ptr
MPI_TYPE: one of MPI_DOUBLE, MPI_FLOAT, MPI_INT, MPI_CHAR, etc.
destination/source: rank of sender/receiver
tag: unique id for message pair
Communicator: MPI_COMM WORLD or user created
MPI_Request: Identify comm operations

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 21 / 69
MPI Collectives
Reduction:
▶ Works for a variety of operations (+,*,min,max)
▶ For example, to calculate the min/mean/max of numbers accross the cluster.

err = MPI_Allreduce(sendptr, rcvptr, count, MPI_TYPE, MPI_Op, Communicator);

err = MPI_Reduce(sendbuf, recvbuf, count, MPI_TYPE, MPI_Op, root, Communicator);

sendptr/rcvptr: pointers to buffers

count: number of elements in ptrs
MPI_TYPE: one of MPI_DOUBLE, MPI_FLOAT, MPI_INT, MPI_CHAR, etc.
MPI_Op: one of MPI_SUM, MPI_PROD, MPI_MIN, MPI_MAX.
Communicator: MPI_COMM_WORLD or user created.
All variants send result back to all processes; non-All sends to process root.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 22 / 69
Collective Operations
Collective
Reductions are an example of a collective operation.
As opposed to the pairwise messages we’ve seen before
All processes in the communicator must participate.
Cannot proceed until all have participated.
Don’t necessarity know what’s ‘under the hood’.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 23 / 69
Collective Operations
Collective
Reductions are an example of a collective operation.
As opposed to the pairwise messages we’ve seen before
All processes in the communicator must participate.
Cannot proceed until all have participated.
Don’t necessarity know what’s ‘under the hood’.

Other MPI Collectives

Broadcast

Other MPI Collectives

Broadcast Scatter

Other MPI Collectives

Broadcast Scatter Gather

Other MPI Collectives

Broadcast Scatter Gather File I/O

Barriers (don’t!)

All-to-all . . .

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 23 / 69
3

Scientific MPI Example

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 24 / 69
Scientific MPI Example

Consider a diffusion equation with an explicit finite-difference, time-marching method.

Imagine the problem is too large to fit in the memory of one node, so we need to do domain
decomposition, and use MPI.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 25 / 69
Discretizing Derivatives

Partial Differential Equations like the diffusion ∂2T Ti+1 − 2Ti + Ti−1
equation ≈
∂x2 ∆x2
2
∂T ∂ T
=D
∂t ∂x2
are usually numerically solved by finite
differencing the discretized values. i−2 i−1 i i+1 i+2
Implicitly or explicitly involves interpolating
data and taking the derivative of the
interpolant.
Larger ‘stencils’ → More accuracy. +1 −2 +1

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 26 / 69
Diffusion equation in higher dimensions
Spatial grid separation: ∆x. Time step ∆t.
Grid indices: i, j. Time step index: (n)
1D
(n) (n−1)
∂T Ti − Ti
≈
∂t i ∆t
∂2T
(n) (n) (n)
Ti−1 − 2Ti + Ti+1 +1 −2 +1
≈
∂x2 i ∆x2

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 27 / 69
Diffusion equation in higher dimensions
Spatial grid separation: ∆x. Time step ∆t.
Grid indices: i, j. Time step index: (n)
1D
(n) (n−1)
∂T Ti − Ti
≈
∂t i ∆t
∂2T
(n) (n) (n)
Ti−1 − 2Ti + Ti+1 +1 −2 +1
≈
∂x2 i ∆x2

2D
+1

−4
+1 +1
+1
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 27 / 69
Diffusion equation in higher dimensions
Spatial grid separation: ∆x. Time step ∆t.
Grid indices: i, j. Time step index: (n)
1D
(n) (n−1)
∂T Ti − Ti
≈
∂t i ∆t
∂2T
(n) (n) (n)
Ti−1 − 2Ti + Ti+1 +1 −2 +1
≈
∂x2 i ∆x2

2D
+1
(n) (n−1)
∂T Ti,j − Ti,j
≈
∂t i,j ∆t
2 2 (n) (n) (n) (n) (n)
Ti−1,j + Ti,j−1 − 4Ti,j + Ti+1,j + Ti,j+1

−4 ∂ T ∂ T
+1 +1 + ≈
∂x2 ∂y 2 i,j ∆x2
+1
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 27 / 69
Stencils and Boundaries
How do you deal with 1D
boundaries?
The stencil juts out, you need 0 1 2 3 4 5 6
info on cells beyond those Number of guard cells
you’re updating. ng = 1
Common solution: Loop from i = ng . . .
N − 2ng .
Guard cells:
▶ Pad domain with these
guard cells so that stencil
works even for the first
point in domain.
▶ Fill guard cells with values
such that the required
boundary conditions are
met.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 28 / 69
Stencils and Boundaries
How do you deal with 1D 2D
boundaries?
The stencil juts out, you need 0 1 2 3 4 5 6
info on cells beyond those Number of guard cells 111
000
000
111
000
111
you’re updating. ng = 1 000
111
Loop from i = ng . . .
000
111
Common solution:
N − 2ng .
Guard cells:
▶ Pad domain with these 111
000
000
111 111
000
000
111
guard cells so that stencil 000
111 000
111
works even for the first 000
111 000
111
point in domain.
▶ Fill guard cells with values
such that the required
boundary conditions are
met.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 28 / 69
Domain decomposition
A very common approach to
parallelizing on distributed
memory computers.
Subdivide the domain into
contiguous subdomains.
Give each subdomain to a
different MPI process.
No process contains the full
data!
Maintains locality.
Need mostly local data, ie.,
only data at the boundary of
each subdomain will need to
be sent between processes.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 29 / 69
Guard cell exchange

In the domain decomposition, the stencils will

jut out into a neighbouring subdomain.
0 1 2 3 4 5 6
Much like the boundary condition.
One uses guard cells for domain
decomposition too. 5 6 7 8 9 10 11
If we managed to fill the guard cell with Could use even/odd trick, or sendrecv.
values from neighbouring domains, we can
treat each coupled subdomain as an isolated
domain with changing boundary conditions.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 30 / 69
Diffusion: Had to wait for communications to compute

Could not compute end points without

guardcell data
0 1 2 3 4 5 6
All work halted while all communications
occurred
Significant parallel overhead. 5 6 7 8 9 10 11

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 31 / 69
Diffusion: Had to wait?

But inner zones could have been computed

just fine.
0 1 2 3 4 5 6
Ideally, would do inner zones work while
communications is being done; then go back
and do end points.
5 6 7 8 9 10 11

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 32 / 69
Blocking Communication/Computation Pattern
We have the following sequence of communication
and computation:
Sendrecv Sendrecv The code exchanges guard cells using
Sendrecv
The code then computes the next step.
Computation Computation
The code exchanges guard cells using
Sendrecv again.
etc.
Sendrecv Sendrecv We can do better.

Computation Computation

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 33 / 69
Non-Blocking Communication/Computation Pattern
ISend ISend
The code start a send of its guard cells using
ISend.
Computation Computation
Without waiting for that send’s completion,
the code computes the next step for the inner
IRecv IRecv cells (while the guard cell message is in flight)

Computation Computation The code then receives the guard cells using
IRecv.

ISend ISend Afterwards, it computes the outer cell’s new

values.

Computation Computation Repeat.

IRecv IRecv

Computation Computation

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 34 / 69
2D diffusion with MPI
How to divide the work in a 2D grid?
11
00 11
00
00011
11100 11
00
000
111 11111
00000
00
11
00
11 00
11
000
111
00
11
000
11100
11
00
11 00
11
000
111
00
11
000
111 000
111
000
11100
11
00
11
00
11 00
11
000
111
00011
00 00
11
000
111 000
111
00011
00
00011
11100 00
11
11100
11 00
11
000
111 11100
11
000
111
000
111
000
111 111
000
000
111
000
111
000
111
000
111

111
000
000
111 111
000
000
111
000
111
000
111 000
111
000
111

111
000
000
111
000
111
000
111

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 35 / 69
2D diffusion with MPI
How to divide the work in a 2D grid?
11
00 11
00
00011
11100 11
00
000
111 11111
00000
00
11
00
11 00
11
000
111
00
11
000
11100
11
00
11 00
11
000
111
00
11
000
111 000
111
000
11100
11
00
11
00
11 00
11
000
111
00011
00 00
11
000
111 000
111
00011
00
00011
11100 00
11
11100
11 00
11
000
111 11100
11
000
111
000
111
000
111 111
000
000
111
000
111
000
111
000
111

111
000
000
111 111
000
000
111
000
111
000
111 000
111
000
111

111
000
000
111
000
111
000
111
Less communication (18 edges).
Harder to program, non-contiguous data to
send, left, right, up and down.

111
000
000
111 111
000
000
111
000
111
000
111 000
111
000
111

111
000
000
111
000
111
000
111
Less communication (18 edges). Easier to code, similar to 1d, but with
Harder to program, non-contiguous data to contiguous guard cells to send up and down.
send, left, right, up and down. More communication (30 edges).

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 35 / 69
Let’s look at the easiest domain decomposition.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 36 / 69
Let’s look at the easiest domain decomposition.
00
111 000:11
000Serial
11 111 00
00
11
000
111
00
11
000
111 000
111
000
111 00
11
00
11
00
11
000
111
00
11 000
111
00011
000 111
111 00
00
11
000
111
111
000
000
111
000
111
000
111

111
000
000
111
000
111
000
111

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 36 / 69
Let’s look at the easiest domain decomposition.
11
00
111 000:11
000Serial
111 00 Parallel
00
11
00011
111 00
(P 00
= 3):
00011
111
00
11
000
111
00
11
000
111 000
111
000
111 00
11
00
11 00
11
00011
11100
00011
11100
00
11
000
111 000
111 00
11
000
111
00011
00
000
111
00011
00
00
11 00011
000 111
111 00
00
11 00
11
111
00
11
00
11
111
00011
11100
00
11
00011
11100
111
000
000
111
000
111 000
111
000
111
111
000
000
111 000
111
000
111 000
111
000
111
000
111 000
111 00
11
00
11
00
11
00
11
111
000
000
111
000
111
000
111

111
000
000
111
000
111
000
111 00
11
11
00
00
11
00
11
00
11

111
000
000
111
000
111
000
111

11
00
000
11100
11
00
11
000
111
00
11
000
11100
11
00011
00
00
11
11100
11

111
000
000
111
000
111
000
111 00
11
11
00
00
11
00
11
00
11

Communication pattern:
111
000
Copy upper stripe to upper neighbour bottom guard cell. 000
111
000
111
000
111
Copy lower stripe to lower neighbout top guard cell.
11
00
000
11100
11
Contiguous cells: can use count in MPI_Sendrecv. 00
11
000
111
00
11
000
11100
11
00011
00
00
11
11100
11
Similar to 1d diffusion.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 36 / 69
4

Teach Cluster Access and Assignment

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 37 / 69
Access to SciNet’s Teach supercomputer
Access to SciNet’s Teach supercomputer
SciNet’s Teach supercomputer is part of the
old GPC system (42 nodes) that has been
repurposed for education and training in
general, and in particular for many of summer
school sessions.
Log into Teach login node, teach01, with
your Compute Canada account credentials or
your lcl_uothpc383sNNNN temporary
account.

$ ssh -Y [email protected]
$ cd $SCRATCH
$ cp -r /scinet/course/mpi/advanced-mpi .
$ cd advanced-mpi
$ source setup

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 38 / 69
Access to SciNet’s Teach supercomputer
Access to SciNet’s Teach supercomputer Running computations
SciNet’s Teach supercomputer is part of the On most supercomputer, a scheduler governs
old GPC system (42 nodes) that has been the allocation of resources.
repurposed for education and training in
This means submitting a job with a jobscript.
general, and in particular for many of summer
school sessions. srun: a command that is a resource request
+ job running command all in one, and will
Log into Teach login node, teach01, with
run the command on one (or more) of the
your Compute Canada account credentials or
available resources.
your lcl_uothpc383sNNNN temporary
account. We have set aside 34 nodes with 16 cores for
this class, so occasionally, only in very busy
$ ssh -Y [email protected] sessions, you may have to wait for someone
$ cd $SCRATCH
$ cp -r /scinet/course/mpi/advanced-mpi . else’s srun command to finish.
$ cd advanced-mpi
$ source setup

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 38 / 69
Assignment: 2D Diffusion
2D diffusion equation serial code:

$ cd $SCRATCH/advanced-mpi/diffusion2d
$ # source ../setup
$ make diffusion2dc
$ ./diffusion2dc

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 39 / 69
Assignment: 2D Diffusion
2D diffusion equation serial code:

$ cd $SCRATCH/advanced-mpi/diffusion2d
$ # source ../setup
$ make diffusion2dc
$ ./diffusion2dc

2D diffusion equation parallel code:

$ make diffusion2dc-mpi-nonblocking
$ # or srun
$ mpirun -np 4 ./diffusion2dc-mpi-nonblocking

Part I: Use MPI derived datatypes instead of

packing and unpacking the data manually.
cp diffusion2dc-mpi-nonblocking.c
diffusion2dc-mpi-nonblocking-datatype.c
Build with make
diffusion2dc-mpi-nonblocking-datatype
Test on 4..9 processors
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 39 / 69
Assignment: 2D Diffusion
2D diffusion equation serial code: Part II: Use MPI Cartesian topology routines
to map the 2D cartesian grid of the diffusion
$ cd $SCRATCH/advanced-mpi/diffusion2d equation domain into a 2D layout of processes.
$ # source ../setup
$ make diffusion2dc Get rid of the manually done mapping.
$ ./diffusion2dc cp
diffusion2dc-mpi-nonblocking-datatype.c
2D diffusion equation parallel code: diffusion2dc-mpi-nonblocking-carttopo.c
$ make diffusion2dc-mpi-nonblocking
$ # or srun
Build with make
$ mpirun -np 4 ./diffusion2dc-mpi-nonblocking diffusion2dc-mpi-nonblocking-carttopo
Tips
Part I: Use MPI derived datatypes instead of
Switch off graphics (in Makefile, change
packing and unpacking the data manually.
USEPGPLOT=-DPGPLOT to USEPGPLOT=);
cp diffusion2dc-mpi-nonblocking.c
diffusion2dc-mpi-nonblocking-datatype.c Get familiar with the serial code in 2D and
review the 1D one, if needed.
Build with make
diffusion2dc-mpi-nonblocking-datatype If you get stuck debugging, try to decrease
the problem size and the number of steps.
Test on 4..9 processors
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 39 / 69
5

Derived Datatypes

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 40 / 69
Motivation
Every message is associated with a datatype.
count of MPI_SOMETYPE
All MPI data movement functions move data
in some count units of some datatype.
tag
Portability: specifying the length of a
message as a given count of occurrences of a
given datatype is more portable than using
length in bytes, since lengths of given types
may vary from one machine to another.
CPU1 CPU2
So far our messages correspond to contiguous
regions of memory: a count of the the basic
MPI datatypes such as MPI_INT or
MPI_DOUBLE was sufficient to describe our
messages.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 41 / 69
Motivation

Derived datatypes allow us to specify

noncontiguous areas of memory, such as a
count of MPI_SOMETYPE
column of an array stored rowwise.
A new datatype might describe, for example, tag
a group of elements that are separated by a
constant amount in memory, a stride.
Derived datatypes allow arbitrary data layouts
to be serialized into message streams
CPU1 CPU2

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 42 / 69
Basic Datatypes for Fortran
MPI provides a rich set of predefined
datatypes.
MPI Datatype Fortran Datatype
All basic datatypes in C and Fortran.
MPI_BYTE
Two datatypes specific to MPI:
MPI_CHARACTER CHARACTER
▶ MPI_BYTE: Refers to a byte defined
as eight binary digits.
MPI_COMPLEX COMPLEX
▶ MPI_PACKED: Rather than create a MPI_DOUBLE_PRECISION DOUBLE PRECISION
new datatype, just assemble a MPI_INTEGER INTEGER
contiguous buffer to be sent. MPI_LOGICAL LOGICAL
Why not use char as bytes? MPI_PACKED
▶ Usually represented by MPI_REAL REAL
implementations but not required.
For example C for Japanese might
choose 16-bit chars.
▶ Machines might have different
character sets in heterogeneous
environment.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 43 / 69
Basic Datatypes for C

MPI Datatype C Datatype MPI Datatype C Datatype

MPI_CHAR signed char MPI_UNSIGNED_CHAR unsigned char
MPI_FLOAT float MPI_UNSIGNED_SHORT unsigned short
MPI_DOUBLE double MPI_UNSIGNED unsigned int
MPI_LONG_DOUBLE long double MPI_UNSIGNED_LONG unsigned long
MPI_WCHAR wchar_t MPI_UNSIGNED_LONG_LONG unsigned long long
MPI_SHORT short MPI_C_COMPLEX float _Complex
MPI_INT int MPI_C_DOUBLE_COMPLEX double _Complex
MPI_LONG long MPI_C_LONG_DOUBLE_COMPLEX long double _Complex
MPI_LONG_LONG_INT long long MPI_PACKED
MPI_SIGNED_CHAR signed char MPI_BYTE

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 44 / 69
Datatype Concepts

Basic definitions: Typemap:

▶ Datatype is an object consisting of
a sequence of the basic datatypes T ypemap = {(type0 , disp0 ), ..., (typen−1 , dispn−1 )
and displacements, in bytes, of
each of these datatypes. ▶ For example, type MPI_INT represented by (int, 0).
▶ Displacements in bytes are relative ▶ Displacements tell MPI where to find the bits.
to the buffer the datatype Type signature: list of the basic datatypes in a datatype.
describes.
How does MPI describe a general T ypesignature = {type0 , ..., typen−1 }
datatype?
▶ MPI represents a datatype as a ▶ It controls how data items are interpreted when sent or
sequence of pairs of basic types received.
and displacements, a typemap.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 45 / 69
Datatype Concepts (Cont.)

Component Displacement Lower bound (lb) is the location of the first byte described by the datatype:

lb(T ypemap) = min(dispj )

Component Displacement Upper bound (ub) is the location of the last byte described by the
datatype:
ub(T ypemap) = max(dispj + sizeof (typej )) + pad
j

▶ Where sizeof operator returns the size of the basic datatype in bytes.
Extent is the difference between these two bounds:

extent(T ypemap) = ub(T ypemap) − lb(T ypemap)

▶ ub is possibly increased by pad to satisfy alignment requirements.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 46 / 69
Data Alignment
Both C and Fortran require that the basic Example of a typemap on a computer that
datatypes be properly aligned: requires int’s to be aligned on 4-byte
▶ The locations of an integer or a boundaries:
double-precision value occur only where
allowed. {(int, 0), (char, 4)}
▶ Each implementation of these languages
defines what is allowed. lb = min(0, 4) = 0
▶ Most common: the address of an item in
bytes is a multiple of the length of that item ub = max(0 + 4, 4 + 1) = 5
in bytes. ▶ Next int can only be placed with
▶ For example, if an int takes four bytes, then displacement eight from the int in the
the address of an int must be a multiple of typemap. Pad in this case is three.
the length of that item in bytes: evenly ▶ Therefore, this typemap’s extend on this
divisible by four. computer is eight.
Data aligment requirement reflects in the
definition of extent of a MPI datatype.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 47 / 69
Datatype Information
MPI routines to retrieve information about MPI datatypes:
MPI_Type_get_extent
int MPI_Type_get_extent(MPI_Datatype datatype, MPI_Aint *lb, MPI_Aint *extent)

Get the lower bound and extent for a datatype:

▶ datatype: handle on datatype to get information on.
▶ lb: lower bound returned and stored as MPI_Aint, an integer type that can hold an arbitrary address.
▶ extent: the returned extent of the datatype. Previous example extent was 8 bytes.

MPI_Type_size
int MPI_Type_size(MPI_Datatype datatype, int *size)

Get the number of bytes or the size of a datatype:

▶ datatype: handle on datatype to get information on.
▶ size: datatype size in bytes. Previous example size was 5 bytes.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 48 / 69
Datatype Constructors
Problem: Typemap is a general way of More sophisticated constructors:
describing an arbitrary datatype, but not ▶ Indexed: Array of displacements provided.
convenient for a large number of entries. Displacements measured in terms of the
Solution: MPI provides different ways to extent of the input datatype.
create datatypes without explicitly
▶ Hindexed: Like indexed, but displacements
measured in bytes.
constructing the typemap: ▶ Struct: Fully general. Input is the typemap,
▶ Contiguous: It produces a new datatype by
if input are basic MPI datatypes.
making count copies of an old one.
Displacements incremented by the extent of
the oldtype.
▶ Vector: Like contiguous but allows for
regular gaps in displacements. Elements
separated by multiples of the extent of the
input datatype.
▶ Hvector: Like vector, but elements are
separated by a number of bytes.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 49 / 69
Datatype Constructors (Cont.)
MPI_Type_contiguous
int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype)

Simplest datatype constructor, which allows replication of a oldtype datatype into contiguous
locations:
▶ count: replication count (nonnegative integer).
▶ oldtype: old datatype handle.
▶ newtype: new datatype handle.
Example: if original datatype (oldtype) has typemap:

{(int, 0), (double, 8)}

then:
MPI_Type_contiguous(2, oldtype, &newtype);

produces a datatype newtype with typemap:

{(int, 0), (double, 8), (int, 16), (double, 24)}

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 50 / 69
Datatype Constructors (Cont.)

MPI_Type_vector
int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype)

Allows replication of a oldtype datatype into locations of equally spaced blocks. Each block is
obtained by concatenating blocklength copies of the old datatype:
▶ count: number of blocks (nonnegative integer).
▶ blocklength: number of elements in each block (nonnegative integer).
▶ stride: number of elements between start of each block.
Very useful for Cartesian arrays.
Example: if original datatype (oldtype) has typemap: (double, 0) with extent 8, then:
MPI_Type_vector(3, 2, 4, oldtype, &newtype);

produces a datatype newtype with extension 3 x 4 x 8 = 96 bytes and typemap:

{(double, 0), (double, 8), (double, 32), (double, 40), (double, 64), (double, 72)}

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 51 / 69
Datatype Constructors (Cont.)
MPI_Type_create_hvector
int MPI_Type_create_hvector(int count, int blocklength, MPI_Aint stride,
MPI_Datatype oldtype, MPI_Datatype *newtype)

Creates a vector (strided) data type with offset in bytes.

Useful for composition, for example, vector of structs.

MPI_Type_create_indexed_block
int MPI_Type_create_indexed_block(int count, int blocklength, const int array_of_displacements[],
MPI_Datatype oldtype, MPI_Datatype *newtype)

Creates an indexed data type with the same block length for all blocks.
Useful for retrieving irregular subsets of data from a single array.
▶ blocklength = 2
▶ array_of_displacements = {0, 5, 8, 13, 18}

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 52 / 69
Datatype Constructors (Cont.)
MPI_Type_indexed
int MPI_Type_indexed(int count, const int array_of_blocklengths[],
const int array_of_displacements[], MPI_Datatype oldtype,
MPI_Datatype *newtype)

Creates an indexed datatype, where each block can contain a different number of oldtype copies.
▶ array_of_blocklengths = {1, 1, 2, 1, 2, 1}
▶ array_of_displacements = {0, 3, 5, 9, 13, 17}

MPI_Type_create_struct
int MPI_Type_create_struct(int count, int array_of_blocklengths[],
const MPI_Aint array_of_displacements[], const MPI_Datatype array_of_types[],
MPI_Datatype *newtype)

Creates a structured data type.

Useful for retrieving virtually any data layout in memory.
▶ array_of_blocklengths = {1, 1, 2, 1, 2, 1}
▶ array_of_displacements = {0, 3, 5, 9, 13, 17}
▶ array_of_types = { MPI_INT, MPI_DOUBLE, MPI_INT, MPI_INT, MPI_INT }

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 53 / 69
Datatype Constructors (Cont.)
MPI_Type_create_subarray
int MPI_Type_create_subarray(int ndims, const int array_of_sizes[],
const int array_of_subsizes[], const int array_of_starts[],
int order, MPI_Datatype oldtype, MPI_Datatype *newtype)

Creates a data type describing an n-dimensional subarray of an n-dimensional array.

Create, Commit, Duplicate, and Free

Once a type is created with one of the constructors above, it must be committed to the system
before use.
▶ Use MPI_Type_commit.
▶ This allows the system to hopefully perform heavy optimizations.
MPI_Type_dup
▶ Duplicates a type.
MPI_Type_free
▶ Free MPI resources for datatypes.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 54 / 69
6

Application Topology

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 55 / 69
Introduction to Application Topology

MPI offers a facility, called process topology,

to attach information about the Definitions
communication relationships between Topology of the computer, or interconnection
processes to a communicator. network, is the description of how the
processes in a parallel computer are
Programmer specifies the topology once
connected to one another.
during setup and then reuse it in different
parts of the code. Virtual or application topology is the pattern
of communication amongst the processes.
User-specified topology matches application
communication patterns.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 56 / 69
Process Mapping
Good choice of mapping depends on the details of the underlying hardware.
Only the vendors knows the best way to fit the application topologies into the machine topology.
They optimize through the implementation of MPI topology functions.
MPI does not provide the programmer any control over these mappings.

Different ways to map a set of processes to a 2-dimensional grid.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 57 / 69
Example: Matrix Partitioning

2-dimensional rank numbering system provides a

clearer representation of submatrices relationships.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 58 / 69
Graph and Cartesian Topologies

MPI has the task of deciding how to assign processes to each part of the decomposed domain.
MPI provides the service of handling assignment of processes to regions. It provides two types of
topology routines to address the needs of different data topological layouts:
Cartesian Topology
It is a decomposition of the application processes in the natural coordinate directions, for example,
along x and y directions.

Graph Topology
It is the type of virtual topology that allows general relationships between processes, where processes
are represented by nodes of a graph.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 59 / 69
MPI Cartesian Topology Functions

MPI provides routines for dealing with cartesian topologies:

MPI_Cart_create: Create a Cartesian MPI_Cart_get: Retrieves Cartesian topology

topology. information associated with a communicator.
MPI_Cart_coords: Determine process MPI_Cartdim_get: Retrieves Cartesian
coordinates in Cartesian topology given rank topology information associated with a
in group. communicator: number of dimensions.
MPI_Cart_rank: Determines process rank in MPI_Cart_shift: Returns the shifted source
communicator given Cartesian location. and destination ranks, given a shift direction
and amount.
MPI_Cart_sub: Partitions a communicator
into subgroups, which form lower-dimensional
Cartesian subgrids.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 60 / 69
MPI Cartesian Topology Functions (Cont.)
MPI_Cart_create
int MPI_Cart_create(MPI_Comm comm_old, int ndims, const int dims[],
const int periods[], int reorder, MPI_Comm *comm_cart)

It returns a a handle to a new communicator to which the Cartesian topology information is attached.
If reorder = false then the rank of each process in the new group is identical to its rank in the old
group.
Otherwise it may reorder to choose a good embedding of the virtual topology onto the physical
machine.
▶ comm_old: handle to input communicator.
▶ ndims: number of dimensions of Cartesian grid.
▶ dims: integer array of size ndims specifying the number of
processes in each dimension.
▶ periods: logical array of size ndims specifying whether the grid is periodic (true) or not (false) in each
dimension.
If the total size of the Cartesian grid is smaller than the size of the group of comm, then some
processes are returned MPI_COMM_NULL.
The call is erroneous if it specifies a grid that is larger than the group size.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 61 / 69
MPI Cartesian Topology Functions (Cont.)
This code snippet uses MPI_Cart_create to
remap the process ranks from a linear MPI_Cart_create (code snippet)
ordering (0, 1,. . . ,5) to 2-dimensional array of #include "mpi.h"
3 rows by 2 columns ((0,0),(0,1),. . . ,(2,1)). MPI_Comm old_comm, new_comm;
int ndims, reorder, periods[2], dim_size[2];
We are able to assign work to the processes old_comm = MPI_COMM_WORLD;
by their grid topology instead of their linear ndims = 2; /* 2-D matrix/grid */
process rank. dim_size[0] = 3; /* rows */
dim_size[1] = 2; /* columns */
We imposed periodicity on the first dimension. periods[0] = 1; /* row periodic
(each column forms a ring) */
This means any reference beyond the first or periods[1] = 0; /* columns nonperiodic */
last entry of the columns it cycles back to the reorder = 1; /* allows processes reordered
last and first entry, respectively. for efficiency */
MPI_Cart_create(old_comm, ndims, dim_size,
Any reference to column index outside the periods, reorder, &new_comm);
range returns MPI_PROC_NULL.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 62 / 69
MPI Cartesian Topology Functions (Cont.)
Messages are still sent to and received from process’s ranks.
MPI provides routines to map or convert ranks to cartesian coordinates and vice-versa:
MPI_Cart_coords
int MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims, int coords[])

It provides a mapping of ranks to Cartesian coordinates.

▶ rank: rank of a process within group of comm.
▶ maxdims: length of vector coords in the calling program.
▶ coords: Integer array of size ndims (defined by MPI_Cart_create call) containing the Cartesian
coordinates of specified process.

MPI_Cart_rank
int MPI_Cart_rank(MPI_Comm comm, int coords[], int *rank)

It translates the logical process coordinates to process ranks.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 63 / 69
MPI Cartesian Topology Functions (Cont.)
MPI_Cart_coords (code snippet)
MPI_Cart_create(old_comm, ndims, dim_size, periods, reorder, &new_comm);
if(rank == 0) { /* only want to do this on one process */
for (rank=0; rank‹p; rank++) {
MPI_Cart_coords(new_comm, rank, ndims, &coords);
printf("%d, %d, %d\n ",rank, coords[0], coords[1]);
}
}

MPI_Cart_rank (code snippet)

MPI_Cart_create(old_comm, ndims, dim_size, periods, reorder, &new_comm);
if(rank == 0) { /* only want to do this on one process */
for (i=0; i‹nv; i++) {
for (j=0; j‹mv; j++) {
coords[0] = i;
coords[1] = j;
MPI_Cart_rank(new_comm, coords, &rank);
printf("%d, %d, %d\n",coords[0],coords[1],rank);
}
}
}
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 64 / 69
MPI Cartesian Topology Functions (Cont.)
MPI_Cart_sub
int MPI_Cart_sub(MPI_Comm comm, const int remain_dims[], MPI_Comm *comm_new)

It partitions a communicator into subgroups, which form lower-dimensional Cartesian subgrids.

It builds for each subgroup a communicator with the associated subgrid Cartesian topology.
▶ remain_dims: logical vector indicating if the ith dimension corresponding to the ith entry of
remain_dims, is kept in the subgrid (true) or is dropped (false).

MPI_Cart_sub (code snippet)

MPI_Cart_create(MPI_COMM_WORLD, ndim, dims, period, reorder, &comm2D);
MPI_Comm_rank(comm2D, &id2D);
MPI_Cart_coords(comm2D, id2D, ndim, coords2D);
/* Create 1D row subgrids */
belongs[0] = 0;
belongs[1] = 1; /* this dimension belongs to subgrid */
MPI_Cart_sub(comm2D, belongs, &commrow);
/* Create 1D column subgrids */
belongs[0] = 1; /* this dimension belongs to subgrid */
belongs[1] = 0;
MPI_Cart_sub(comm2D, belongs, &commcol);
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 65 / 69
MPI Cartesian Topology Functions (Cont.)
It is common on large programs to create the cartesian topology with associated communicator in
one routine while being used in another.
The follow two functions help retrieving information about these communicators:
MPI_Cartdim_get
int MPI_Cartdim_get(MPI_Comm comm, int *ndims)

It retrieves the number of dimensions from a communicator with Cartesian structure.

MPI_Cart_get
int MPI_Cart_get(MPI_Comm comm, int maxdims, int dims[], int periods[],
int coords[])
It retrieves information from a communicator with Cartesian topology:
▶ maxdims: Length of vectors dims, periods, and coords in the calling program.
▶ dims: Number of processes for each Cartesian dimension.
▶ periods: Periodicity for each Cartesian dimension.
▶ coords: Coordinates of the calling process in Cartesian structure.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 66 / 69
MPI Cartesian Topology Functions (Cont.)

MPI_Cart_shift
int MPI_Cart_shift(MPI_Comm comm, int direction, int disp,
int *rank_source, int *rank_dest)

It returns the shifted source and destination ranks, given a shift direction and amount.
▶ direction: Coordinate dimension of shift, i.e., the coordinate whose value is modified by the shift.
▶ disp: Displacement ( > 0: upward shift, < 0: downward shift).
▶ rank_source: Rank of source process.
▶ rank_dest: Rank of destination process.
A MPI_Sendrecv operation is likely to be used along a coordinate direction to perform a shift of data.
▶ As input, it takes the rank of a source process for the receive, and the rank of a destination process for
the send.
▶ MPI_Cart_shift provides MPI_Sendrecv with the above identifiers.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 67 / 69
MPI Cartesian Topology Functions (Cont.)
MPI_Cart_shift (code snippet)
/* Create Cartesian topology for processes */
ndim = 2; /* number of dimensions */
dims[0] = nrow; /* number of rows */
dims[1] = mcol; /* number of columns */
period[0] = 1; /* cyclic in this direction */
period[1] = 0; /* not cyclic in this direction */
MPI_Cart_create(MPI_COMM_WORLD, ndim, dims, period, reorder,
&comm2D);
MPI_Comm_rank(comm2D, &me);
MPI_Cart_coords(comm2D, me, ndim, &coords);
displ = 1; /* shift by 1 */
index = 0; /* shift along the 1st index (out of 2) */
MPI_Cart_shift(comm2D, index, displ, &source0, &dest0);
index = 1; /* shift along the 2nd index (out of 2) */
MPI_Cart_shift(comm2D, index, displ, &source1, &dest1);

MPI_Cart_shift is used to obtain the source and destination rank numbers of the calling process.
There are two calls to MPI_Cart_shift, the first shifting along columns, and the second along rows.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 68 / 69
Conclusion
Recap
MPI Basics Review
Scientific MPI Example: 2D Diffusion Equation
Derived Data Types
Application Topology

Good References
W. Gropp, E. Lusk, and A. Skjellun, Using MPI: Portable Parallel Programming with the
Message-Passing Interface. Third Edition. (MIT Press, 2014).
W. Gropp, T. Hoefler, R. Thakur, E. Lusk, Using Advanced MPI: Modern Features of the
Message-Passing Interface. (MIT Press, 2014).
A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel Computing, Second Edition.
(Addison-Wesley, 2003) (A bit old but still reasonable)
The man pages for various MPI commands.
http://www.mpi-forum.org/docs/ for MPI standard specification.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 69 / 69

MiniTool Partition Wizard Crack 12 Key Download Free 2025
No ratings yet
MiniTool Partition Wizard Crack 12 Key Download Free 2025
29 pages
Message Passing Interface (MPI) : Author: Blaise Barney, Lawrence Livermore National Laboratory
No ratings yet
Message Passing Interface (MPI) : Author: Blaise Barney, Lawrence Livermore National Laboratory
41 pages
(.NET) Decrypt Confuser 1.9 Methods
83% (6)
(.NET) Decrypt Confuser 1.9 Methods
18 pages
T SQL Programming
No ratings yet
T SQL Programming
31 pages
Message-Passing Multicomputer
No ratings yet
Message-Passing Multicomputer
57 pages
Azure CSP Documntation
No ratings yet
Azure CSP Documntation
376 pages
4007 Software Release
No ratings yet
4007 Software Release
2 pages
ملخص لغة البرمجة c
No ratings yet
ملخص لغة البرمجة c
12 pages
4.2 Machine-Independent Macro Processor Features
No ratings yet
4.2 Machine-Independent Macro Processor Features
47 pages
05 DistributedMemoryMPI
No ratings yet
05 DistributedMemoryMPI
77 pages
CS301 Assignment No 01 Solution Spring 2020
No ratings yet
CS301 Assignment No 01 Solution Spring 2020
7 pages
SAP Business One 10.0 Highlights
No ratings yet
SAP Business One 10.0 Highlights
152 pages
Cluster6 Python QUESTION ANSWER
No ratings yet
Cluster6 Python QUESTION ANSWER
105 pages
Distributed Memory Programming Using
No ratings yet
Distributed Memory Programming Using
113 pages
Architectural Blueprints - The "4+1" View Model of Software Architecture
No ratings yet
Architectural Blueprints - The "4+1" View Model of Software Architecture
16 pages
Computer Science
No ratings yet
Computer Science
23 pages
04 cmsc416 Mpi
No ratings yet
04 cmsc416 Mpi
31 pages
Mpi Unit 5 Part 2 1
No ratings yet
Mpi Unit 5 Part 2 1
65 pages
unifaceWhatsNew9703 PDF
No ratings yet
unifaceWhatsNew9703 PDF
103 pages
Mpi p2
No ratings yet
Mpi p2
51 pages
OOM Case Study
No ratings yet
OOM Case Study
31 pages
RTOS Application Design - Embedded Software Design - A Practical Approach To Architecture, Processes, and Coding Techniques
No ratings yet
RTOS Application Design - Embedded Software Design - A Practical Approach To Architecture, Processes, and Coding Techniques
32 pages
SAP CRM & SAP Solution Manager: Business Transaction Search Enhancement
No ratings yet
SAP CRM & SAP Solution Manager: Business Transaction Search Enhancement
25 pages
Lesson 7 Javascript - Control Flow
No ratings yet
Lesson 7 Javascript - Control Flow
5 pages
cs101 Solved Mcqs Midterm Exam 2013
No ratings yet
cs101 Solved Mcqs Midterm Exam 2013
27 pages
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
No ratings yet
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
199 pages
Distributed Systems and Cloud Computing
No ratings yet
Distributed Systems and Cloud Computing
24 pages
MPI4 Py
No ratings yet
MPI4 Py
28 pages
ch5 MPI
No ratings yet
ch5 MPI
53 pages
CH 6
No ratings yet
CH 6
47 pages
Short Notes: @maheshpal Singh Rathore @mpsrathore2020
No ratings yet
Short Notes: @maheshpal Singh Rathore @mpsrathore2020
21 pages
Informatica Product Lifecycle Guide 15.7
No ratings yet
Informatica Product Lifecycle Guide 15.7
11 pages
D) Java Is A Platform-Independent Programming Language
No ratings yet
D) Java Is A Platform-Independent Programming Language
11 pages
Week 2
No ratings yet
Week 2
31 pages
Ex No: Introduction To DSP & DSK (Tms320C6711) Date
No ratings yet
Ex No: Introduction To DSP & DSK (Tms320C6711) Date
9 pages
03 MPIProgramStructure
No ratings yet
03 MPIProgramStructure
42 pages
Key Concepts in MPI Programming: Processes
No ratings yet
Key Concepts in MPI Programming: Processes
6 pages
High Performance Computing: Matthew Jacob Indian Institute of Science
No ratings yet
High Performance Computing: Matthew Jacob Indian Institute of Science
25 pages
Unit 1: OOP Fundamentals
No ratings yet
Unit 1: OOP Fundamentals
12 pages
Question Bank: Subject Name: Operating Systems Subject Code: CS2254 Year/ Sem: IV
No ratings yet
Question Bank: Subject Name: Operating Systems Subject Code: CS2254 Year/ Sem: IV
6 pages
Message Passing-1
No ratings yet
Message Passing-1
76 pages
Sapscript Formatting Options Available in Sapscript
No ratings yet
Sapscript Formatting Options Available in Sapscript
5 pages
Piechart Code
No ratings yet
Piechart Code
9 pages
BIg Data Anslysi
No ratings yet
BIg Data Anslysi
57 pages
Chapter 4 - Message-Passing Programming, MPI
No ratings yet
Chapter 4 - Message-Passing Programming, MPI
79 pages
Mpi 2
No ratings yet
Mpi 2
46 pages
Lab 6
No ratings yet
Lab 6
6 pages
CS-3006 - 5 - MPI Basics
No ratings yet
CS-3006 - 5 - MPI Basics
53 pages
MCQ CH04 Uml M3
No ratings yet
MCQ CH04 Uml M3
5 pages
Unit - 3 - My
No ratings yet
Unit - 3 - My
84 pages
Intro MPI
No ratings yet
Intro MPI
60 pages
Cs-3006 6 Mpi Basics 2
No ratings yet
Cs-3006 6 Mpi Basics 2
52 pages
In3200 Chap09
No ratings yet
In3200 Chap09
56 pages
Resume Aman Bajpai
No ratings yet
Resume Aman Bajpai
2 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
Ms. V. Uma Maheswari, Assistant Lecturer, Department of Information Technology, National Institute of Technology, Surathkal
No ratings yet
Ms. V. Uma Maheswari, Assistant Lecturer, Department of Information Technology, National Institute of Technology, Surathkal
91 pages
Introduction To C MPI PM
No ratings yet
Introduction To C MPI PM
50 pages
Week 10
No ratings yet
Week 10
52 pages
Priya Shakya: Personal Details
No ratings yet
Priya Shakya: Personal Details
2 pages
Lec 9 DR Marwa Abbas
No ratings yet
Lec 9 DR Marwa Abbas
64 pages
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
100% (1)
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
40 pages
2013 02 24 Ppopp Mpi Basic
No ratings yet
2013 02 24 Ppopp Mpi Basic
102 pages
HPC Lecture40
No ratings yet
HPC Lecture40
25 pages
Message Passing and MPI: John Mellor-Crummey
No ratings yet
Message Passing and MPI: John Mellor-Crummey
78 pages
MPI Part2 Updated
No ratings yet
MPI Part2 Updated
20 pages
Lec5 MPI
No ratings yet
Lec5 MPI
28 pages
Introduction MPI - Chap2 - Slide 3
No ratings yet
Introduction MPI - Chap2 - Slide 3
16 pages
Asg 03 - MPI
No ratings yet
Asg 03 - MPI
8 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
22 pages
Message Passing Interface: Parallel Processing Course University of Tehran
No ratings yet
Message Passing Interface: Parallel Processing Course University of Tehran
49 pages
Computing LLNL Gov
No ratings yet
Computing LLNL Gov
42 pages
Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University
No ratings yet
Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University
53 pages
Mpi Lecture
No ratings yet
Mpi Lecture
129 pages
An Introduction To MPI: Parallel Programming With The Message Passing Interface
No ratings yet
An Introduction To MPI: Parallel Programming With The Message Passing Interface
48 pages
Introduction To MPI Basics
No ratings yet
Introduction To MPI Basics
8 pages
Class03 - MPI, Part 1, Intermediate PDF
No ratings yet
Class03 - MPI, Part 1, Intermediate PDF
83 pages
Using MPI Portable Programming With The Message Pa PDF
No ratings yet
Using MPI Portable Programming With The Message Pa PDF
8 pages
Week09 L2
No ratings yet
Week09 L2
13 pages
Clase 4 - Tutorial de MPI
No ratings yet
Clase 4 - Tutorial de MPI
35 pages
Message Passing Interface - Point To Point
No ratings yet
Message Passing Interface - Point To Point
3 pages
02 Message Passing Interface Tutorial
No ratings yet
02 Message Passing Interface Tutorial
34 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
4 pages
Mpi
No ratings yet
Mpi
30 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
14 pages
The Message Passing Interface (MPI)
No ratings yet
The Message Passing Interface (MPI)
18 pages
Message Passing Interface (MPI) Programming
No ratings yet
Message Passing Interface (MPI) Programming
11 pages

Advanced Message Passing Interface (MPI) : Bruno C. Mundim

Uploaded by

Advanced Message Passing Interface (MPI) : Bruno C. Mundim

Uploaded by

Advanced Message Passing Interface (MPI)

May 15, 2023

About This Workshop

A computer with browser and internet connection to attend the lectures.

Make sure you can login to the website https://scinet.courses/1269 !

WEDNESDAY: Zoom office hours.

MPI Basics Review

MPI Basics Review

Messages have a sender and a receiver.

MPI messages are a string of length count all

Each communicator has some size – number

One can create one’s own

One can create one’s own

One can create one’s own

MPI_Ssend: Standard synchronous send

can fail if buffer is full.

err = MPI_Sendrecv(sendptr, count, MPI_TYPE, destination, tag,

A blocking send and receive built together

Returns immediately, posting request to system to initiate communication.

Allows you to get work done while message is

err = MPI_Isend(sendptr, count, MPI_TYPE, destination, tag, Communicator, MPI_Request)

sendptr/rcvptr: pointer to message

err = MPI_Allreduce(sendptr, rcvptr, count, MPI_TYPE, MPI_Op, Communicator);

sendptr/rcvptr: pointers to buffers

Other MPI Collectives

Other MPI Collectives

Other MPI Collectives

Broadcast Scatter Gather

Other MPI Collectives

Broadcast Scatter Gather File I/O

Scientific MPI Example

Consider a diffusion equation with an explicit finite-difference, time-marching method.

In the domain decomposition, the stencils will

Could not compute end points without

But inner zones could have been computed

ISend ISend Afterwards, it computes the outer cell’s new

Computation Computation Repeat.

Teach Cluster Access and Assignment

2D diffusion equation parallel code:

Part I: Use MPI derived datatypes instead of

Derived datatypes allow us to specify

MPI Datatype C Datatype MPI Datatype C Datatype

Basic definitions: Typemap:

lb(T ypemap) = min(dispj )

extent(T ypemap) = ub(T ypemap) − lb(T ypemap)

Get the lower bound and extent for a datatype:

Get the number of bytes or the size of a datatype:

{(int, 0), (double, 8)}

produces a datatype newtype with typemap:

{(int, 0), (double, 8), (int, 16), (double, 24)}

produces a datatype newtype with extension 3 x 4 x 8 = 96 bytes and typemap:

Creates a vector (strided) data type with offset in bytes.

Creates a structured data type.

Creates a data type describing an n-dimensional subarray of an n-dimensional array.

Create, Commit, Duplicate, and Free

MPI offers a facility, called process topology,

Different ways to map a set of processes to a 2-dimensional grid.

2-dimensional rank numbering system provides a

MPI provides routines for dealing with cartesian topologies:

MPI_Cart_create: Create a Cartesian MPI_Cart_get: Retrieves Cartesian topology

It provides a mapping of ranks to Cartesian coordinates.

It translates the logical process coordinates to process ranks.

MPI_Cart_rank (code snippet)

It partitions a communicator into subgroups, which form lower-dimensional Cartesian subgrids.

MPI_Cart_sub (code snippet)

It retrieves the number of dimensions from a communicator with Cartesian structure.

You might also like