0% found this document useful (0 votes)
10 views86 pages

Advanced Message Passing Interface (MPI) : Bruno C. Mundim

Uploaded by

Steve Mojang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views86 pages

Advanced Message Passing Interface (MPI) : Bruno C. Mundim

Uploaded by

Steve Mojang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Advanced Message Passing Interface (MPI)

Bruno C. Mundim
SciNet HPC Consortium

May 15, 2023

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 1 / 69
1

About This Workshop

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 2 / 69
What do you need for this workshop?

A computer with browser and internet connection to attend the lectures.


A Zoom client to connect to the lecture and office hours.
An ssh client to connect to the SciNet Teach cluster.
▶ Linux and MaxOS: Use the ssh command in the terminal.
▶ Windows: Use MobaXTerm https://mobaxterm.mobatek.net.

Make sure you can login to the website https://scinet.courses/1269 !

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 3 / 69
Workshop structure
MONDAY: A first online lecture over Zoom (you’re here!).
An assignment will be given during the course of the lecture.
You can ask questions:
▶ in the Zoom chat during and at the end of the lecture.
▶ in the student forum on the course site.
▶ and also during:

WEDNESDAY: Zoom office hours.


Submit a solution for the assignment on the course website (deadline is midnight Thursday)
Assignment submission and lecture participation required to obtain credits towards the SciNet HPC
Certificate.
FRIDAY: A last online lecture on Zoom that will address the solution, common mistakes, and
wrap-up.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 4 / 69
Today’s Lecture Outline

MPI Basics Review


Scientific MPI Example: 2D Diffusion Equation
Teach Cluster Access and Assignment
Derived Data Types
Application Topology

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 5 / 69
2

MPI Basics Review

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 6 / 69
Distributed Memory: Clusters
Machine Architecture: Clusters, or,
distributed memory machines.
CPU1
Parallel code: run on separate computers
and communicate with each other.
Usual communication model: “message
passing”.
Message Passing Interface (MPI): Open CPU2
standard library interface for message passing,
ratified by the MPI Forum.
MPI Implementations:
▶ OpenMPI www.open-mpi.org
⋆ SciNet clusters (Niagara or Teach):
module load gcc openmpi CPU3
▶ MPICH2 www.mpich.org
⋆ Niagara: module load intel intelmpi CPU4
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 7 / 69
MPI is a Library for Message Passing
Not built into the compiler.
Function calls that can be made from any compiler, many languages.
Just link to it.
Wrappers: mpicc, mpif90, mpicxx
#include <stdio.h> program helloworld
#include <mpi.h> use mpi
int main(int argc, char **argv) { implicit none
int rank, size, err; integer :: rank, commsize, err
err = MPI_Init(&argc, &argv); call MPI_Init(err)
err = MPI_Comm_size(MPI_COMM_WORLD, &size); call MPI_Comm_size(MPI_COMM_WORLD, commsize, err)
err = MPI_Comm_rank(MPI_COMM_WORLD, &rank); call MPI_Comm_rank(MPI_COMM_WORLD, rank, err)
printf("Hello world from task %d of %d!\n",rank, print *,'Hello world from task',rank,'of',commsize
size); call MPI_Finalize(err)
err = MPI_Finalize(); end program helloworld
}

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 8 / 69
MPI is a Library for Message Passing
Communication/coordination between tasks
done by sending and receiving messages.
CPU1
Each message involves a function call from
each of the programs.

CPU2

CPU3

CPU4

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 9 / 69
MPI is a Library for Message Passing
Three basic sets of functionality:
Pairwise communications via messages CPU1
Collective operations via messages
Efficient routines for getting data from
memory into messages and vice versa

CPU2

CPU3

CPU4

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 10 / 69
Messages

Messages have a sender and a receiver.


count of MPI_SOMETYPE
When you are sending a message, don’t need
to specify sender (it’s the current processor).
tag
A sent message has to be actively received by
the receiving process.

CPU1 CPU2

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 11 / 69
Messages

MPI messages are a string of length count all


of some fixed MPI type.
count of MPI_SOMETYPE
MPI types exist for characters, integers,
floating point numbers, etc. tag
An arbitrary non-negative integer tag is also
included – it helps keep things straight if lots
of messages are sent.
CPU1 CPU2

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 12 / 69
Communicators
MPI groups processes into communicators.
rank 0

Each communicator has some size – number


of tasks.
Every task has a rank 0..size-1
Every task in your program belongs to rank 1
MPI_COMM_WORLD.

rank 2
MPI_COMM_WORLD:
size = 4, ranks = 0..3 rank 3

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 13 / 69
Communicators

One can create one’s own


communicators over the same
tasks.
May break the tasks up into
subgroups.
May just re-order them for
some reason

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 14 / 69
Communicators
MPI_COMM_WORLD:
size=4,ranks=0..3

One can create one’s own


communicators over the same rank 0
tasks.
May break the tasks up into
subgroups.
May just re-order them for rank 1
some reason

rank 2

rank 3

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 14 / 69
Communicators
MPI_COMM_WORLD: new_comm:
size=4,ranks=0..3 size=3,ranks=0..2

One can create one’s own


communicators over the same rank 0 rank 1
tasks.
May break the tasks up into
subgroups.
May just re-order them for rank 1 rank 2
some reason

rank 2 rank 0

rank 3

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 14 / 69
MPI Communicator Basics

Communicator Components
MPI_COMM_WORLD:
Global Communicator
MPI_Comm_rank(MPI_COMM_WORLD,&rank)
Get current task’s rank
MPI_Comm_size(MPI_COMM_WORLD,&size)
Get communicator size

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 15 / 69
Different versions of SEND

MPI_Ssend: Standard synchronous send


guaranteed to be synchronous.
routine will not return until the receiver has “picked up”.
MPI_Bsend: Buffered Send
guaranteed to be asynchronous.
In this class, stick with
routine returns before the message is delivered. MPI_Ssend for clarity and
system copies data into a buffer and sends it in due course. robustness

can fail if buffer is full.


MPI_Send (standard Send)
may be implemented as synchronous or asynchronous send.
causes a lot of confusion.
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 16 / 69
Send and Receive

C
MPI_Status status;
err = MPI_Ssend(sendptr, count, MPI_TYPE, destination, tag, Communicator);
err = MPI_Recv(rcvptr, count, MPI_TYPE, source, tag, Communicator, status);

Fortran
integer status(MPI_STATUS_SIZE)
call MPI_SSEND(sendarr, count, MPI_TYPE, destination, tag, Communicator, err)
call MPI_RECV(rcvarr, count, MPI_TYPE, source, tag, Communicator, status, err)

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 17 / 69
MPI: Sendrecv

err = MPI_Sendrecv(sendptr, count, MPI_TYPE, destination, tag,


recvptr, count, MPI_TYPE, source, tag, Communicator, MPI_Status)

A blocking send and receive built together


Let them happen simultaneously
Can automatically pair send/recvs
Why 2 sets of tags/types/counts?

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 18 / 69
MPI Non-Blocking Functions: MPI_Isend, MPI_Irecv

Returns immediately, posting request to system to initiate communication.


However, communication is not completed yet.
Cannot tamper with the memory provided in these calls until the communication is completed.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 19 / 69
Nonblocking Sends

Allows you to get work done while message is


in flight.
Must not alter send buffer until send has
completed.
C:
MPI_Isend(void *buf,int count,MPI_Datatype datatype,int dest,int tag,MPI_Comm comm,MPI_Request *request)

FORTRAN:
MPI_ISEND(BUF,INTEGER COUNT,INTEGER DATATYPE,INTEGER DEST,INTEGER TAG, INTEGER COMM, INTEGER REQUEST,
INTEGER ERROR)

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 20 / 69
MPI: Non-Blocking Isend & Irecv

err = MPI_Isend(sendptr, count, MPI_TYPE, destination, tag, Communicator, MPI_Request)


err = MPI_Irecv(rcvptr, count, MPI_TYPE, source, tag, Communicator, MPI_Request)

sendptr/rcvptr: pointer to message


count: number of elements in ptr
MPI_TYPE: one of MPI_DOUBLE, MPI_FLOAT, MPI_INT, MPI_CHAR, etc.
destination/source: rank of sender/receiver
tag: unique id for message pair
Communicator: MPI_COMM WORLD or user created
MPI_Request: Identify comm operations

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 21 / 69
MPI Collectives
Reduction:
▶ Works for a variety of operations (+,*,min,max)
▶ For example, to calculate the min/mean/max of numbers accross the cluster.

err = MPI_Allreduce(sendptr, rcvptr, count, MPI_TYPE, MPI_Op, Communicator);


err = MPI_Reduce(sendbuf, recvbuf, count, MPI_TYPE, MPI_Op, root, Communicator);

sendptr/rcvptr: pointers to buffers


count: number of elements in ptrs
MPI_TYPE: one of MPI_DOUBLE, MPI_FLOAT, MPI_INT, MPI_CHAR, etc.
MPI_Op: one of MPI_SUM, MPI_PROD, MPI_MIN, MPI_MAX.
Communicator: MPI_COMM_WORLD or user created.
All variants send result back to all processes; non-All sends to process root.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 22 / 69
Collective Operations
Collective
Reductions are an example of a collective operation.
As opposed to the pairwise messages we’ve seen before
All processes in the communicator must participate.
Cannot proceed until all have participated.
Don’t necessarity know what’s ‘under the hood’.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 23 / 69
Collective Operations
Collective
Reductions are an example of a collective operation.
As opposed to the pairwise messages we’ve seen before
All processes in the communicator must participate.
Cannot proceed until all have participated.
Don’t necessarity know what’s ‘under the hood’.

Other MPI Collectives

Broadcast

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 23 / 69
Collective Operations
Collective
Reductions are an example of a collective operation.
As opposed to the pairwise messages we’ve seen before
All processes in the communicator must participate.
Cannot proceed until all have participated.
Don’t necessarity know what’s ‘under the hood’.

Other MPI Collectives

Broadcast Scatter

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 23 / 69
Collective Operations
Collective
Reductions are an example of a collective operation.
As opposed to the pairwise messages we’ve seen before
All processes in the communicator must participate.
Cannot proceed until all have participated.
Don’t necessarity know what’s ‘under the hood’.

Other MPI Collectives

Broadcast Scatter Gather

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 23 / 69
Collective Operations
Collective
Reductions are an example of a collective operation.
As opposed to the pairwise messages we’ve seen before
All processes in the communicator must participate.
Cannot proceed until all have participated.
Don’t necessarity know what’s ‘under the hood’.

Other MPI Collectives

Broadcast Scatter Gather File I/O

Barriers (don’t!)

All-to-all . . .

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 23 / 69
3

Scientific MPI Example

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 24 / 69
Scientific MPI Example

Consider a diffusion equation with an explicit finite-difference, time-marching method.


Imagine the problem is too large to fit in the memory of one node, so we need to do domain
decomposition, and use MPI.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 25 / 69
Discretizing Derivatives

Partial Differential Equations like the diffusion ∂2T Ti+1 − 2Ti + Ti−1
equation ≈
∂x2 ∆x2
2
∂T ∂ T
=D
∂t ∂x2
are usually numerically solved by finite
differencing the discretized values. i−2 i−1 i i+1 i+2
Implicitly or explicitly involves interpolating
data and taking the derivative of the
interpolant.
Larger ‘stencils’ → More accuracy. +1 −2 +1

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 26 / 69
Diffusion equation in higher dimensions
Spatial grid separation: ∆x. Time step ∆t.
Grid indices: i, j. Time step index: (n)
1D
(n) (n−1)
∂T Ti − Ti

∂t i ∆t
∂2T
(n) (n) (n)
Ti−1 − 2Ti + Ti+1 +1 −2 +1

∂x2 i ∆x2

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 27 / 69
Diffusion equation in higher dimensions
Spatial grid separation: ∆x. Time step ∆t.
Grid indices: i, j. Time step index: (n)
1D
(n) (n−1)
∂T Ti − Ti

∂t i ∆t
∂2T
(n) (n) (n)
Ti−1 − 2Ti + Ti+1 +1 −2 +1

∂x2 i ∆x2

2D
+1

−4
+1 +1
+1
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 27 / 69
Diffusion equation in higher dimensions
Spatial grid separation: ∆x. Time step ∆t.
Grid indices: i, j. Time step index: (n)
1D
(n) (n−1)
∂T Ti − Ti

∂t i ∆t
∂2T
(n) (n) (n)
Ti−1 − 2Ti + Ti+1 +1 −2 +1

∂x2 i ∆x2

2D
+1
(n) (n−1)
∂T Ti,j − Ti,j

∂t i,j ∆t
2 2 (n) (n) (n) (n) (n)
Ti−1,j + Ti,j−1 − 4Ti,j + Ti+1,j + Ti,j+1
 
−4 ∂ T ∂ T
+1 +1 + ≈
∂x2 ∂y 2 i,j ∆x2
+1
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 27 / 69
Stencils and Boundaries
How do you deal with 1D
boundaries?
The stencil juts out, you need 0 1 2 3 4 5 6
info on cells beyond those Number of guard cells
you’re updating. ng = 1
Common solution: Loop from i = ng . . .
N − 2ng .
Guard cells:
▶ Pad domain with these
guard cells so that stencil
works even for the first
point in domain.
▶ Fill guard cells with values
such that the required
boundary conditions are
met.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 28 / 69
Stencils and Boundaries
How do you deal with 1D 2D
boundaries?
The stencil juts out, you need 0 1 2 3 4 5 6
info on cells beyond those Number of guard cells 111
000
000
111
000
111
you’re updating. ng = 1 000
111
Loop from i = ng . . .
000
111
Common solution:
N − 2ng .
Guard cells:
▶ Pad domain with these 111
000
000
111 111
000
000
111
guard cells so that stencil 000
111 000
111
works even for the first 000
111 000
111
point in domain.
▶ Fill guard cells with values
such that the required
boundary conditions are
met.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 28 / 69
Domain decomposition
A very common approach to
parallelizing on distributed
memory computers.
Subdivide the domain into
contiguous subdomains.
Give each subdomain to a
different MPI process.
No process contains the full
data!
Maintains locality.
Need mostly local data, ie.,
only data at the boundary of
each subdomain will need to
be sent between processes.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 29 / 69
Guard cell exchange

In the domain decomposition, the stencils will


jut out into a neighbouring subdomain.
0 1 2 3 4 5 6
Much like the boundary condition.
One uses guard cells for domain
decomposition too. 5 6 7 8 9 10 11
If we managed to fill the guard cell with Could use even/odd trick, or sendrecv.
values from neighbouring domains, we can
treat each coupled subdomain as an isolated
domain with changing boundary conditions.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 30 / 69
Diffusion: Had to wait for communications to compute

Could not compute end points without


guardcell data
0 1 2 3 4 5 6
All work halted while all communications
occurred
Significant parallel overhead. 5 6 7 8 9 10 11

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 31 / 69
Diffusion: Had to wait?

But inner zones could have been computed


just fine.
0 1 2 3 4 5 6
Ideally, would do inner zones work while
communications is being done; then go back
and do end points.
5 6 7 8 9 10 11

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 32 / 69
Blocking Communication/Computation Pattern
We have the following sequence of communication
and computation:
Sendrecv Sendrecv The code exchanges guard cells using
Sendrecv
The code then computes the next step.
Computation Computation
The code exchanges guard cells using
Sendrecv again.
etc.
Sendrecv Sendrecv We can do better.

Computation Computation

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 33 / 69
Non-Blocking Communication/Computation Pattern
ISend ISend
The code start a send of its guard cells using
ISend.
Computation Computation
Without waiting for that send’s completion,
the code computes the next step for the inner
IRecv IRecv cells (while the guard cell message is in flight)

Computation Computation The code then receives the guard cells using
IRecv.

ISend ISend Afterwards, it computes the outer cell’s new


values.

Computation Computation Repeat.

IRecv IRecv

Computation Computation

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 34 / 69
2D diffusion with MPI
How to divide the work in a 2D grid?
11
00 11
00
00011
11100 11
00
000
111 11111
00000
00
11
00
11 00
11
000
111
00
11
000
11100
11
00
11 00
11
000
111
00
11
000
111 000
111
000
11100
11
00
11
00
11 00
11
000
111
00011
00 00
11
000
111 000
111
00011
00
00011
11100 00
11
11100
11 00
11
000
111 11100
11
000
111
000
111
000
111 111
000
000
111
000
111
000
111
000
111

111
000
000
111 111
000
000
111
000
111
000
111 000
111
000
111

111
000
000
111
000
111
000
111

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 35 / 69
2D diffusion with MPI
How to divide the work in a 2D grid?
11
00 11
00
00011
11100 11
00
000
111 11111
00000
00
11
00
11 00
11
000
111
00
11
000
11100
11
00
11 00
11
000
111
00
11
000
111 000
111
000
11100
11
00
11
00
11 00
11
000
111
00011
00 00
11
000
111 000
111
00011
00
00011
11100 00
11
11100
11 00
11
000
111 11100
11
000
111
000
111
000
111 111
000
000
111
000
111
000
111
000
111

111
000
000
111 111
000
000
111
000
111
000
111 000
111
000
111

111
000
000
111
000
111
000
111
Less communication (18 edges).
Harder to program, non-contiguous data to
send, left, right, up and down.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 35 / 69
2D diffusion with MPI
How to divide the work in a 2D grid?
11
00 11
00
00011
11100 11
00
000
111 11111
00000
00
11
00
11 00
11
000
111
00
11
000
11100
11
00
11 00
11
000
111
00
11
000
111 000
111
000
11100
11
00
11
00
11 00
11
000
111
00011
00 00
11
000
111 000
111
00011
00
00011
11100 00
11
11100
11 00
11
000
111 11100
11
000
111
000
111
000
111 111
000
000
111
000
111
000
111
000
111

111
000
000
111 111
000
000
111
000
111
000
111 000
111
000
111

111
000
000
111
000
111
000
111
Less communication (18 edges). Easier to code, similar to 1d, but with
Harder to program, non-contiguous data to contiguous guard cells to send up and down.
send, left, right, up and down. More communication (30 edges).

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 35 / 69
Let’s look at the easiest domain decomposition.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 36 / 69
Let’s look at the easiest domain decomposition.
00
111 000:11
000Serial
11 111 00
00
11
000
111
00
11
000
111 000
111
000
111 00
11
00
11
00
11
000
111
00
11 000
111
00011
000 111
111 00
00
11
000
111
111
000
000
111
000
111
000
111

111
000
000
111
000
111
000
111

111
000
000
111
000
111
000
111

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 36 / 69
Let’s look at the easiest domain decomposition.
11
00
111 000:11
000Serial
111 00 Parallel
00
11
00011
111 00
(P 00
= 3):
00011
111
00
11
000
111
00
11
000
111 000
111
000
111 00
11
00
11 00
11
00011
11100
00011
11100
00
11
000
111 000
111 00
11
000
111
00011
00
000
111
00011
00
00
11 00011
000 111
111 00
00
11 00
11
111
00
11
00
11
111
00011
11100
00
11
00011
11100
111
000
000
111
000
111 000
111
000
111
111
000
000
111 000
111
000
111 000
111
000
111
000
111 000
111 00
11
00
11
00
11
00
11
111
000
000
111
000
111
000
111

111
000
000
111
000
111
000
111 00
11
11
00
00
11
00
11
00
11

111
000
000
111
000
111
000
111

11
00
000
11100
11
00
11
000
111
00
11
000
11100
11
00011
00
00
11
11100
11

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 36 / 69
Let’s look at the easiest domain decomposition.
11
00
111 000:11
000Serial
111 00 Parallel
00
11
00011
111 00
(P 00
= 3):
00011
111
00
11
000
111
00
11
000
111 000
111
000
111 00
11
00
11 00
11
00011
11100
00011
11100
00
11
000
111 000
111 00
11
000
111
00011
00
000
111
00011
00
00
11 00011
000 111
111 00
00
11 00
11
111
00
11
00
11
111
00011
11100
00
11
00011
11100
111
000
000
111
000
111 000
111
000
111
111
000
000
111 000
111
000
111 000
111
000
111
000
111 000
111 00
11
00
11
00
11
00
11
111
000
000
111
000
111
000
111

111
000
000
111
000
111
000
111 00
11
11
00
00
11
00
11
00
11

Communication pattern:
111
000
Copy upper stripe to upper neighbour bottom guard cell. 000
111
000
111
000
111
Copy lower stripe to lower neighbout top guard cell.
11
00
000
11100
11
Contiguous cells: can use count in MPI_Sendrecv. 00
11
000
111
00
11
000
11100
11
00011
00
00
11
11100
11
Similar to 1d diffusion.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 36 / 69
4

Teach Cluster Access and Assignment

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 37 / 69
Access to SciNet’s Teach supercomputer
Access to SciNet’s Teach supercomputer
SciNet’s Teach supercomputer is part of the
old GPC system (42 nodes) that has been
repurposed for education and training in
general, and in particular for many of summer
school sessions.
Log into Teach login node, teach01, with
your Compute Canada account credentials or
your lcl_uothpc383sNNNN temporary
account.

$ ssh -Y [email protected]
$ cd $SCRATCH
$ cp -r /scinet/course/mpi/advanced-mpi .
$ cd advanced-mpi
$ source setup

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 38 / 69
Access to SciNet’s Teach supercomputer
Access to SciNet’s Teach supercomputer Running computations
SciNet’s Teach supercomputer is part of the On most supercomputer, a scheduler governs
old GPC system (42 nodes) that has been the allocation of resources.
repurposed for education and training in
This means submitting a job with a jobscript.
general, and in particular for many of summer
school sessions. srun: a command that is a resource request
+ job running command all in one, and will
Log into Teach login node, teach01, with
run the command on one (or more) of the
your Compute Canada account credentials or
available resources.
your lcl_uothpc383sNNNN temporary
account. We have set aside 34 nodes with 16 cores for
this class, so occasionally, only in very busy
$ ssh -Y [email protected] sessions, you may have to wait for someone
$ cd $SCRATCH
$ cp -r /scinet/course/mpi/advanced-mpi . else’s srun command to finish.
$ cd advanced-mpi
$ source setup

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 38 / 69
Assignment: 2D Diffusion
2D diffusion equation serial code:

$ cd $SCRATCH/advanced-mpi/diffusion2d
$ # source ../setup
$ make diffusion2dc
$ ./diffusion2dc

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 39 / 69
Assignment: 2D Diffusion
2D diffusion equation serial code:

$ cd $SCRATCH/advanced-mpi/diffusion2d
$ # source ../setup
$ make diffusion2dc
$ ./diffusion2dc

2D diffusion equation parallel code:


$ make diffusion2dc-mpi-nonblocking
$ # or srun
$ mpirun -np 4 ./diffusion2dc-mpi-nonblocking

Part I: Use MPI derived datatypes instead of


packing and unpacking the data manually.
cp diffusion2dc-mpi-nonblocking.c
diffusion2dc-mpi-nonblocking-datatype.c
Build with make
diffusion2dc-mpi-nonblocking-datatype
Test on 4..9 processors
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 39 / 69
Assignment: 2D Diffusion
2D diffusion equation serial code: Part II: Use MPI Cartesian topology routines
to map the 2D cartesian grid of the diffusion
$ cd $SCRATCH/advanced-mpi/diffusion2d equation domain into a 2D layout of processes.
$ # source ../setup
$ make diffusion2dc Get rid of the manually done mapping.
$ ./diffusion2dc cp
diffusion2dc-mpi-nonblocking-datatype.c
2D diffusion equation parallel code: diffusion2dc-mpi-nonblocking-carttopo.c
$ make diffusion2dc-mpi-nonblocking
$ # or srun
Build with make
$ mpirun -np 4 ./diffusion2dc-mpi-nonblocking diffusion2dc-mpi-nonblocking-carttopo
Tips
Part I: Use MPI derived datatypes instead of
Switch off graphics (in Makefile, change
packing and unpacking the data manually.
USEPGPLOT=-DPGPLOT to USEPGPLOT=);
cp diffusion2dc-mpi-nonblocking.c
diffusion2dc-mpi-nonblocking-datatype.c Get familiar with the serial code in 2D and
review the 1D one, if needed.
Build with make
diffusion2dc-mpi-nonblocking-datatype If you get stuck debugging, try to decrease
the problem size and the number of steps.
Test on 4..9 processors
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 39 / 69
5

Derived Datatypes

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 40 / 69
Motivation
Every message is associated with a datatype.
count of MPI_SOMETYPE
All MPI data movement functions move data
in some count units of some datatype.
tag
Portability: specifying the length of a
message as a given count of occurrences of a
given datatype is more portable than using
length in bytes, since lengths of given types
may vary from one machine to another.
CPU1 CPU2
So far our messages correspond to contiguous
regions of memory: a count of the the basic
MPI datatypes such as MPI_INT or
MPI_DOUBLE was sufficient to describe our
messages.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 41 / 69
Motivation

Derived datatypes allow us to specify


noncontiguous areas of memory, such as a
count of MPI_SOMETYPE
column of an array stored rowwise.
A new datatype might describe, for example, tag
a group of elements that are separated by a
constant amount in memory, a stride.
Derived datatypes allow arbitrary data layouts
to be serialized into message streams
CPU1 CPU2

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 42 / 69
Basic Datatypes for Fortran
MPI provides a rich set of predefined
datatypes.
MPI Datatype Fortran Datatype
All basic datatypes in C and Fortran.
MPI_BYTE
Two datatypes specific to MPI:
MPI_CHARACTER CHARACTER
▶ MPI_BYTE: Refers to a byte defined
as eight binary digits.
MPI_COMPLEX COMPLEX
▶ MPI_PACKED: Rather than create a MPI_DOUBLE_PRECISION DOUBLE PRECISION
new datatype, just assemble a MPI_INTEGER INTEGER
contiguous buffer to be sent. MPI_LOGICAL LOGICAL
Why not use char as bytes? MPI_PACKED
▶ Usually represented by MPI_REAL REAL
implementations but not required.
For example C for Japanese might
choose 16-bit chars.
▶ Machines might have different
character sets in heterogeneous
environment.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 43 / 69
Basic Datatypes for C

MPI Datatype C Datatype MPI Datatype C Datatype


MPI_CHAR signed char MPI_UNSIGNED_CHAR unsigned char
MPI_FLOAT float MPI_UNSIGNED_SHORT unsigned short
MPI_DOUBLE double MPI_UNSIGNED unsigned int
MPI_LONG_DOUBLE long double MPI_UNSIGNED_LONG unsigned long
MPI_WCHAR wchar_t MPI_UNSIGNED_LONG_LONG unsigned long long
MPI_SHORT short MPI_C_COMPLEX float _Complex
MPI_INT int MPI_C_DOUBLE_COMPLEX double _Complex
MPI_LONG long MPI_C_LONG_DOUBLE_COMPLEX long double _Complex
MPI_LONG_LONG_INT long long MPI_PACKED
MPI_SIGNED_CHAR signed char MPI_BYTE

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 44 / 69
Datatype Concepts

Basic definitions: Typemap:


▶ Datatype is an object consisting of
a sequence of the basic datatypes T ypemap = {(type0 , disp0 ), ..., (typen−1 , dispn−1 )
and displacements, in bytes, of
each of these datatypes. ▶ For example, type MPI_INT represented by (int, 0).
▶ Displacements in bytes are relative ▶ Displacements tell MPI where to find the bits.
to the buffer the datatype Type signature: list of the basic datatypes in a datatype.
describes.
How does MPI describe a general T ypesignature = {type0 , ..., typen−1 }
datatype?
▶ MPI represents a datatype as a ▶ It controls how data items are interpreted when sent or
sequence of pairs of basic types received.
and displacements, a typemap.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 45 / 69
Datatype Concepts (Cont.)

Component Displacement Lower bound (lb) is the location of the first byte described by the datatype:

lb(T ypemap) = min(dispj )


j

Component Displacement Upper bound (ub) is the location of the last byte described by the
datatype:
ub(T ypemap) = max(dispj + sizeof (typej )) + pad
j

▶ Where sizeof operator returns the size of the basic datatype in bytes.
Extent is the difference between these two bounds:

extent(T ypemap) = ub(T ypemap) − lb(T ypemap)


▶ ub is possibly increased by pad to satisfy alignment requirements.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 46 / 69
Data Alignment
Both C and Fortran require that the basic Example of a typemap on a computer that
datatypes be properly aligned: requires int’s to be aligned on 4-byte
▶ The locations of an integer or a boundaries:
double-precision value occur only where
allowed. {(int, 0), (char, 4)}
▶ Each implementation of these languages
defines what is allowed. lb = min(0, 4) = 0
▶ Most common: the address of an item in
bytes is a multiple of the length of that item ub = max(0 + 4, 4 + 1) = 5
in bytes. ▶ Next int can only be placed with
▶ For example, if an int takes four bytes, then displacement eight from the int in the
the address of an int must be a multiple of typemap. Pad in this case is three.
the length of that item in bytes: evenly ▶ Therefore, this typemap’s extend on this
divisible by four. computer is eight.
Data aligment requirement reflects in the
definition of extent of a MPI datatype.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 47 / 69
Datatype Information
MPI routines to retrieve information about MPI datatypes:
MPI_Type_get_extent
int MPI_Type_get_extent(MPI_Datatype datatype, MPI_Aint *lb, MPI_Aint *extent)

Get the lower bound and extent for a datatype:


▶ datatype: handle on datatype to get information on.
▶ lb: lower bound returned and stored as MPI_Aint, an integer type that can hold an arbitrary address.
▶ extent: the returned extent of the datatype. Previous example extent was 8 bytes.

MPI_Type_size
int MPI_Type_size(MPI_Datatype datatype, int *size)

Get the number of bytes or the size of a datatype:


▶ datatype: handle on datatype to get information on.
▶ size: datatype size in bytes. Previous example size was 5 bytes.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 48 / 69
Datatype Constructors
Problem: Typemap is a general way of More sophisticated constructors:
describing an arbitrary datatype, but not ▶ Indexed: Array of displacements provided.
convenient for a large number of entries. Displacements measured in terms of the
Solution: MPI provides different ways to extent of the input datatype.
create datatypes without explicitly
▶ Hindexed: Like indexed, but displacements
measured in bytes.
constructing the typemap: ▶ Struct: Fully general. Input is the typemap,
▶ Contiguous: It produces a new datatype by
if input are basic MPI datatypes.
making count copies of an old one.
Displacements incremented by the extent of
the oldtype.
▶ Vector: Like contiguous but allows for
regular gaps in displacements. Elements
separated by multiples of the extent of the
input datatype.
▶ Hvector: Like vector, but elements are
separated by a number of bytes.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 49 / 69
Datatype Constructors (Cont.)
MPI_Type_contiguous
int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype)

Simplest datatype constructor, which allows replication of a oldtype datatype into contiguous
locations:
▶ count: replication count (nonnegative integer).
▶ oldtype: old datatype handle.
▶ newtype: new datatype handle.
Example: if original datatype (oldtype) has typemap:

{(int, 0), (double, 8)}

then:
MPI_Type_contiguous(2, oldtype, &newtype);

produces a datatype newtype with typemap:

{(int, 0), (double, 8), (int, 16), (double, 24)}

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 50 / 69
Datatype Constructors (Cont.)

MPI_Type_vector
int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype)

Allows replication of a oldtype datatype into locations of equally spaced blocks. Each block is
obtained by concatenating blocklength copies of the old datatype:
▶ count: number of blocks (nonnegative integer).
▶ blocklength: number of elements in each block (nonnegative integer).
▶ stride: number of elements between start of each block.
Very useful for Cartesian arrays.
Example: if original datatype (oldtype) has typemap: (double, 0) with extent 8, then:
MPI_Type_vector(3, 2, 4, oldtype, &newtype);

produces a datatype newtype with extension 3 x 4 x 8 = 96 bytes and typemap:

{(double, 0), (double, 8), (double, 32), (double, 40), (double, 64), (double, 72)}

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 51 / 69
Datatype Constructors (Cont.)
MPI_Type_create_hvector
int MPI_Type_create_hvector(int count, int blocklength, MPI_Aint stride,
MPI_Datatype oldtype, MPI_Datatype *newtype)

Creates a vector (strided) data type with offset in bytes.


Useful for composition, for example, vector of structs.

MPI_Type_create_indexed_block
int MPI_Type_create_indexed_block(int count, int blocklength, const int array_of_displacements[],
MPI_Datatype oldtype, MPI_Datatype *newtype)

Creates an indexed data type with the same block length for all blocks.
Useful for retrieving irregular subsets of data from a single array.
▶ blocklength = 2
▶ array_of_displacements = {0, 5, 8, 13, 18}

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 52 / 69
Datatype Constructors (Cont.)
MPI_Type_indexed
int MPI_Type_indexed(int count, const int array_of_blocklengths[],
const int array_of_displacements[], MPI_Datatype oldtype,
MPI_Datatype *newtype)

Creates an indexed datatype, where each block can contain a different number of oldtype copies.
▶ array_of_blocklengths = {1, 1, 2, 1, 2, 1}
▶ array_of_displacements = {0, 3, 5, 9, 13, 17}

MPI_Type_create_struct
int MPI_Type_create_struct(int count, int array_of_blocklengths[],
const MPI_Aint array_of_displacements[], const MPI_Datatype array_of_types[],
MPI_Datatype *newtype)

Creates a structured data type.


Useful for retrieving virtually any data layout in memory.
▶ array_of_blocklengths = {1, 1, 2, 1, 2, 1}
▶ array_of_displacements = {0, 3, 5, 9, 13, 17}
▶ array_of_types = { MPI_INT, MPI_DOUBLE, MPI_INT, MPI_INT, MPI_INT }

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 53 / 69
Datatype Constructors (Cont.)
MPI_Type_create_subarray
int MPI_Type_create_subarray(int ndims, const int array_of_sizes[],
const int array_of_subsizes[], const int array_of_starts[],
int order, MPI_Datatype oldtype, MPI_Datatype *newtype)

Creates a data type describing an n-dimensional subarray of an n-dimensional array.

Create, Commit, Duplicate, and Free


Once a type is created with one of the constructors above, it must be committed to the system
before use.
▶ Use MPI_Type_commit.
▶ This allows the system to hopefully perform heavy optimizations.
MPI_Type_dup
▶ Duplicates a type.
MPI_Type_free
▶ Free MPI resources for datatypes.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 54 / 69
6

Application Topology

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 55 / 69
Introduction to Application Topology

MPI offers a facility, called process topology,


to attach information about the Definitions
communication relationships between Topology of the computer, or interconnection
processes to a communicator. network, is the description of how the
processes in a parallel computer are
Programmer specifies the topology once
connected to one another.
during setup and then reuse it in different
parts of the code. Virtual or application topology is the pattern
of communication amongst the processes.
User-specified topology matches application
communication patterns.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 56 / 69
Process Mapping
Good choice of mapping depends on the details of the underlying hardware.
Only the vendors knows the best way to fit the application topologies into the machine topology.
They optimize through the implementation of MPI topology functions.
MPI does not provide the programmer any control over these mappings.

Different ways to map a set of processes to a 2-dimensional grid.


Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 57 / 69
Example: Matrix Partitioning

2-dimensional rank numbering system provides a


clearer representation of submatrices relationships.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 58 / 69
Graph and Cartesian Topologies

MPI has the task of deciding how to assign processes to each part of the decomposed domain.
MPI provides the service of handling assignment of processes to regions. It provides two types of
topology routines to address the needs of different data topological layouts:
Cartesian Topology
It is a decomposition of the application processes in the natural coordinate directions, for example,
along x and y directions.

Graph Topology
It is the type of virtual topology that allows general relationships between processes, where processes
are represented by nodes of a graph.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 59 / 69
MPI Cartesian Topology Functions

MPI provides routines for dealing with cartesian topologies:

MPI_Cart_create: Create a Cartesian MPI_Cart_get: Retrieves Cartesian topology


topology. information associated with a communicator.
MPI_Cart_coords: Determine process MPI_Cartdim_get: Retrieves Cartesian
coordinates in Cartesian topology given rank topology information associated with a
in group. communicator: number of dimensions.
MPI_Cart_rank: Determines process rank in MPI_Cart_shift: Returns the shifted source
communicator given Cartesian location. and destination ranks, given a shift direction
and amount.
MPI_Cart_sub: Partitions a communicator
into subgroups, which form lower-dimensional
Cartesian subgrids.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 60 / 69
MPI Cartesian Topology Functions (Cont.)
MPI_Cart_create
int MPI_Cart_create(MPI_Comm comm_old, int ndims, const int dims[],
const int periods[], int reorder, MPI_Comm *comm_cart)

It returns a a handle to a new communicator to which the Cartesian topology information is attached.
If reorder = false then the rank of each process in the new group is identical to its rank in the old
group.
Otherwise it may reorder to choose a good embedding of the virtual topology onto the physical
machine.
▶ comm_old: handle to input communicator.
▶ ndims: number of dimensions of Cartesian grid.
▶ dims: integer array of size ndims specifying the number of
processes in each dimension.
▶ periods: logical array of size ndims specifying whether the grid is periodic (true) or not (false) in each
dimension.
If the total size of the Cartesian grid is smaller than the size of the group of comm, then some
processes are returned MPI_COMM_NULL.
The call is erroneous if it specifies a grid that is larger than the group size.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 61 / 69
MPI Cartesian Topology Functions (Cont.)
This code snippet uses MPI_Cart_create to
remap the process ranks from a linear MPI_Cart_create (code snippet)
ordering (0, 1,. . . ,5) to 2-dimensional array of #include "mpi.h"
3 rows by 2 columns ((0,0),(0,1),. . . ,(2,1)). MPI_Comm old_comm, new_comm;
int ndims, reorder, periods[2], dim_size[2];
We are able to assign work to the processes old_comm = MPI_COMM_WORLD;
by their grid topology instead of their linear ndims = 2; /* 2-D matrix/grid */
process rank. dim_size[0] = 3; /* rows */
dim_size[1] = 2; /* columns */
We imposed periodicity on the first dimension. periods[0] = 1; /* row periodic
(each column forms a ring) */
This means any reference beyond the first or periods[1] = 0; /* columns nonperiodic */
last entry of the columns it cycles back to the reorder = 1; /* allows processes reordered
last and first entry, respectively. for efficiency */
MPI_Cart_create(old_comm, ndims, dim_size,
Any reference to column index outside the periods, reorder, &new_comm);
range returns MPI_PROC_NULL.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 62 / 69
MPI Cartesian Topology Functions (Cont.)
Messages are still sent to and received from process’s ranks.
MPI provides routines to map or convert ranks to cartesian coordinates and vice-versa:
MPI_Cart_coords
int MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims, int coords[])

It provides a mapping of ranks to Cartesian coordinates.


▶ rank: rank of a process within group of comm.
▶ maxdims: length of vector coords in the calling program.
▶ coords: Integer array of size ndims (defined by MPI_Cart_create call) containing the Cartesian
coordinates of specified process.

MPI_Cart_rank
int MPI_Cart_rank(MPI_Comm comm, int coords[], int *rank)

It translates the logical process coordinates to process ranks.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 63 / 69
MPI Cartesian Topology Functions (Cont.)
MPI_Cart_coords (code snippet)
MPI_Cart_create(old_comm, ndims, dim_size, periods, reorder, &new_comm);
if(rank == 0) { /* only want to do this on one process */
for (rank=0; rank‹p; rank++) {
MPI_Cart_coords(new_comm, rank, ndims, &coords);
printf("%d, %d, %d\n ",rank, coords[0], coords[1]);
}
}

MPI_Cart_rank (code snippet)


MPI_Cart_create(old_comm, ndims, dim_size, periods, reorder, &new_comm);
if(rank == 0) { /* only want to do this on one process */
for (i=0; i‹nv; i++) {
for (j=0; j‹mv; j++) {
coords[0] = i;
coords[1] = j;
MPI_Cart_rank(new_comm, coords, &rank);
printf("%d, %d, %d\n",coords[0],coords[1],rank);
}
}
}
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 64 / 69
MPI Cartesian Topology Functions (Cont.)
MPI_Cart_sub
int MPI_Cart_sub(MPI_Comm comm, const int remain_dims[], MPI_Comm *comm_new)

It partitions a communicator into subgroups, which form lower-dimensional Cartesian subgrids.


It builds for each subgroup a communicator with the associated subgrid Cartesian topology.
▶ remain_dims: logical vector indicating if the ith dimension corresponding to the ith entry of
remain_dims, is kept in the subgrid (true) or is dropped (false).

MPI_Cart_sub (code snippet)


MPI_Cart_create(MPI_COMM_WORLD, ndim, dims, period, reorder, &comm2D);
MPI_Comm_rank(comm2D, &id2D);
MPI_Cart_coords(comm2D, id2D, ndim, coords2D);
/* Create 1D row subgrids */
belongs[0] = 0;
belongs[1] = 1; /* this dimension belongs to subgrid */
MPI_Cart_sub(comm2D, belongs, &commrow);
/* Create 1D column subgrids */
belongs[0] = 1; /* this dimension belongs to subgrid */
belongs[1] = 0;
MPI_Cart_sub(comm2D, belongs, &commcol);
Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 65 / 69
MPI Cartesian Topology Functions (Cont.)
It is common on large programs to create the cartesian topology with associated communicator in
one routine while being used in another.
The follow two functions help retrieving information about these communicators:
MPI_Cartdim_get
int MPI_Cartdim_get(MPI_Comm comm, int *ndims)

It retrieves the number of dimensions from a communicator with Cartesian structure.

MPI_Cart_get
int MPI_Cart_get(MPI_Comm comm, int maxdims, int dims[], int periods[],
int coords[])
It retrieves information from a communicator with Cartesian topology:
▶ maxdims: Length of vectors dims, periods, and coords in the calling program.
▶ dims: Number of processes for each Cartesian dimension.
▶ periods: Periodicity for each Cartesian dimension.
▶ coords: Coordinates of the calling process in Cartesian structure.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 66 / 69
MPI Cartesian Topology Functions (Cont.)

MPI_Cart_shift
int MPI_Cart_shift(MPI_Comm comm, int direction, int disp,
int *rank_source, int *rank_dest)

It returns the shifted source and destination ranks, given a shift direction and amount.
▶ direction: Coordinate dimension of shift, i.e., the coordinate whose value is modified by the shift.
▶ disp: Displacement ( > 0: upward shift, < 0: downward shift).
▶ rank_source: Rank of source process.
▶ rank_dest: Rank of destination process.
A MPI_Sendrecv operation is likely to be used along a coordinate direction to perform a shift of data.
▶ As input, it takes the rank of a source process for the receive, and the rank of a destination process for
the send.
▶ MPI_Cart_shift provides MPI_Sendrecv with the above identifiers.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 67 / 69
MPI Cartesian Topology Functions (Cont.)
MPI_Cart_shift (code snippet)
/* Create Cartesian topology for processes */
ndim = 2; /* number of dimensions */
dims[0] = nrow; /* number of rows */
dims[1] = mcol; /* number of columns */
period[0] = 1; /* cyclic in this direction */
period[1] = 0; /* not cyclic in this direction */
MPI_Cart_create(MPI_COMM_WORLD, ndim, dims, period, reorder,
&comm2D);
MPI_Comm_rank(comm2D, &me);
MPI_Cart_coords(comm2D, me, ndim, &coords);
displ = 1; /* shift by 1 */
index = 0; /* shift along the 1st index (out of 2) */
MPI_Cart_shift(comm2D, index, displ, &source0, &dest0);
index = 1; /* shift along the 2nd index (out of 2) */
MPI_Cart_shift(comm2D, index, displ, &source1, &dest1);

MPI_Cart_shift is used to obtain the source and destination rank numbers of the calling process.
There are two calls to MPI_Cart_shift, the first shifting along columns, and the second along rows.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 68 / 69
Conclusion
Recap
MPI Basics Review
Scientific MPI Example: 2D Diffusion Equation
Derived Data Types
Application Topology

Good References
W. Gropp, E. Lusk, and A. Skjellun, Using MPI: Portable Parallel Programming with the
Message-Passing Interface. Third Edition. (MIT Press, 2014).
W. Gropp, T. Hoefler, R. Thakur, E. Lusk, Using Advanced MPI: Modern Features of the
Message-Passing Interface. (MIT Press, 2014).
A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel Computing, Second Edition.
(Addison-Wesley, 2003) (A bit old but still reasonable)
The man pages for various MPI commands.
http://www.mpi-forum.org/docs/ for MPI standard specification.

Bruno C. Mundim (SciNet HPC Consortium) Advanced Message Passing Interface (MPI) May 15, 2023 69 / 69

You might also like