P2P Communication
Lecture 4
Jan 15, 2025
MPI Program Execution
Memory
Node/compute
node/host/
system/machine
host1 host2 host3 host4
Intranode Internode
mpiexec –n 8 –hosts host1,host2,host3,host4 ./exe
2
Execution on Beowulf/Unmanaged
Cluster
host1 host2 host3 host4 How much
load is there
on each node?
Network and Load-Aware Resource Manager for MPI Programs
https://dl.acm.org/doi/10.1145/3409390.3409406
3
MPI Program Execution hostfile
cn023
Memory
cn024
cn025
host1 host2 host3 host4 cn026
cn023
Intranode Internode cn024
cn025
cn026
mpiexec –n 8 –hosts host1,host2,host3,host4 ./exe
4
Homework
Analyze the
communication endpoints
for the optimized
algorithm of parallel sum
for round-robin and
sequential placement of 8
and 16 processes.
5
Process Placement - Parallel Sum
(Optimized)
host1 host2 host3 host4
0 2 4 6 Sequential placement
1 3 5 7
0 1 2 3 Round-robin placement
4 5 6 7
6
Parallel Sum (Optimized) on 4
Processes
Communication step 1: 1 -> 0, 3 -> 2
host1 host2 Communication step 2: 2 -> 0
0 2 Sequential placement
1 3
0 1 Round-robin placement
2 3
7
Number of Hops on 4 Processes
Communication step 1: 1 -> 0, 3 -> 2
host1 host2 Communication step 2: 2 -> 0
0 2 Sequential placement
1 3 Communication step 1 #hops: 0, 0 Max: 0
Communication step 2 #hops: 1 Max: 1
Sum: 1
0 1 Round-robin placement
2 3 Communication step 1 #hops: 1, 1 Max: 1
Communication step 2 #hops: 0 Max: 0
Sum: 1 8
Homework: Analyze #Hops
Analyze for
host1 host2 host3 host4
np = 8 and 16, ppn=4
for
Sequential placement
vs.
Round-robin placement
9
CSE Lab Beowulf Cluster
• ~ 30 nodes connected via Ethernet
• Each node has 12/8/4 cores
• Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
• NFS filesystem
• Your home directories are NFS-mounted on all nodes
• Login with CSE login credentials to any machine (IP address range
172.27.19.1 – 172.27.19.30)
• It’s possible that some machines are not reachable/usable, try some other IP
10
MPI Installation on CSE Cluster
Install MPICH 4.2.3 (https://www.mpich.org/static/downloads/4.2.3/) in your
home directory (from any node)
• Download mpich-4.2.3.tar.gz
• Follow installation instructions from
https://www.mpich.org/static/downloads/4.2.3/mpich-4.2.3-installguide.pd
f
• DO NOT use /tmp
• If mpirun is already installed locally on the system, do not use that node to
install (check using which mpirun)
• Verify after installation that `which mpirun` from any node points to your
installation
11
MPI Installation – BYO Cluster
Install MPICH 4.2.3 (https://www.mpich.org/static/downloads/4.2.3/) in a directory
(should have the same path on all your systems of interest)
• Download mpich-4.2.3.tar.gz
• Follow installation instructions from
https://www.mpich.org/static/downloads/4.2.3/mpich-4.2.3-installguide.pdf
• Use the same installation path on all systems (e.g. /home/test)
• Verify after installation that `which mpirun` from any node points to your
installation
• Create a user name and enable passwordless ssh (ssh-keygen)
12
CSE Lab Cluster
• Enable passwordless ssh (ssh-keygen)
• ssh csewsX (from any csews*) passwordlessly
• for i in `seq 1 20`; do ssh csews$i uptime ; done
• “Are you sure you want to continue connecting?” yes
13
MPI Reference Material
• Marc Snir, Steve W. Otto, Steven Huss-Lederman, David W.
Walker and Jack Dongarra, MPI - The Complete Reference,
Second Edition, Volume 1, The MPI Core.
• William Gropp, Ewing Lusk, Anthony Skjellum, Using MPI:
portable parallel programming with the message-passing
interface, 3rd Ed., Cambridge MIT Press, 2014.
• https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report.pdf
14
P2P/Direct Communication
Blocking send and receive
MPI_Send
int MPI_Send (const void *buf, int
count, MPI_Datatype datatype, int
dest, int tag, MPI_Comm comm)
SENDER
Tags should match
int MPI_Recv (void *buf, int count,
MPI_Recv
MPI_Datatype datatype, int source,
int tag, MPI_Comm comm,
MPI_Status *status)
RECEIVER 15
MPI_Send Parameters
buf
initial address of send buffer (choice)
count
number of elements in send buffer (non-negative integer)
datatype
datatype of each send buffer element (handle)
dest
rank of destination (integer)
tag
message tag (integer)
comm
communicator (handle) https://www.mpich.org/static/docs/latest/www3/MPI_Send.html 16
MPI Data Types
• MPI_BYTE
• MPI_CHAR
• MPI_INT
• MPI_FLOAT
• MPI_DOUBLE
17
Example
int MPI_Send (const void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm)
Send 1 INT from rank 0 to rank 1
// Initialization
if (myrank == 0)
MPI_Send (buf, 1, MPI_INT, 1, 1, MPI_COMM_WORLD);
else if (myrank == 1)
MPI_Recv (buf, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
18
Code
19
Executing MPI programs
• Check `which mpicc` and `which mpiexec`
• Update PATH environment variable
• Compile
• mpicc –o filename filename.c
• Run
• mpiexec –np 4 –f hostfile filename [Often “No such file” error]
• mpiexec –np 4 –f hostfile ./filename
20
Simple Send/Recv Code (sendmessage.c)
No runtime or
compile-time
error
21
Runtime error
22
Message Size
Sender Receiver
message (13 bytes) Message (10 bytes)
Fatal error in MPI_Recv: Message truncated, error stack:
MPI_Recv(200)...........................: MPI_Recv(buf=0x7ffccc37c610,
count=10, MPI_CHAR, src=0, tag=99, MPI_COMM_WORLD,
status=0x7ffccc37c5d0) failed
MPIDI_CH3_PktHandler_EagerShortSend(363): Message from rank 0
and tag 99 truncated; 13 bytes received but buffer size is 10
23
No runtime or
compile-time
error
24
Simple Send/Recv Code
(sendmessage.c)
received : Hello,
there
25
Output
0 7 0
Received:
Welcome
1 0 7
26
Output for 4 Processes
0 7 0
Received:
Welcome
1 0 7
2 0 0
3 0 0
27
mpirun -np 2
./send
0 12 0
28
Multiple Sends and Receives
if (myrank == 0)
MPI_Send (buf, count, MPI_INT, 1, 1, MPI_COMM_WORLD),
MPI_Send (buf, count, MPI_INT, 1, 2, MPI_COMM_WORLD);
else if (myrank == 1)
MPI_Recv (buf, count, MPI_INT, 0, 1, MPI_COMM_WORLD, &status),
MPI_Recv (buf, count, MPI_INT, 0, 2, MPI_COMM_WORLD, &status);
printf ("%d %d\n", myrank, count); $ mpirun –np 2 ./send 10
0 10
1 10 29
Multiple Sends and Receives
$ mpirun –np 4 ./send
10
if (myrank == 0)
MPI_Send (buf, count, MPI_INT, 1, 1, MPI_COMM_WORLD),
MPI_Send (buf, count, MPI_INT, 1, 2, MPI_COMM_WORLD);
else if (myrank == 1)
MPI_Recv (buf, count, MPI_INT, 0, 1, MPI_COMM_WORLD, &status),
MPI_Recv (buf, count, MPI_INT, 0, 2, MPI_COMM_WORLD, &status);
0 10
printf ("%d %d\n", myrank, count); 1 10
2 10
3 10 30
Multiple Sends and Receives
if (myrank == 0)
MPI_Send (buf, count, MPI_INT, 1, 1, MPI_COMM_WORLD),
MPI_Send (buf, count, MPI_INT, 1, 2, MPI_COMM_WORLD);
else if (myrank == 1)
MPI_Recv (buf, count, MPI_INT, 0, 1, MPI_COMM_WORLD, &status),
MPI_Recv (buf, count, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
printf ("%d %d\n", myrank, count);
$ mpirun –np 2 ./send 10
0 10 31
Send and Receive
if (myrank == 0)
MPI_Send (buf, count, MPI_INT, 1, 1, MPI_COMM_WORLD);
else if (myrank == 1)
MPI_Recv (buf, count, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
$ mpirun –np 2 ./send
printf ("%d %d\n", myrank, count); 10
0 10
1 10 32
Multiple Sends and Receives
if (myrank == 0)
MPI_Send (buf, count, MPI_INT, 1, 1, MPI_COMM_WORLD),
MPI_Send (buf, count, MPI_INT, 1, 2, MPI_COMM_WORLD);
else if (myrank == 1)
MPI_Recv (buf, count, MPI_INT, 0, 2, MPI_COMM_WORLD, &status),
MPI_Recv (buf, count, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
printf ("%d %d\n", myrank, count); $ mpirun –np 2 ./send 10
0 10
1 10 33
MPI_Send (Blocking, Standard Mode)
• Does not return until buffer can be reused SENDER
• Message buffering can affect this
• Implementation-dependent
RECEIVER
34
Buffering
[Source: Cray presentation] 35
Multiple Sends and Receives
if (myrank == 0)
MPI_Send (buf, count, MPI_INT, 1, 1, MPI_COMM_WORLD),
MPI_Send (buf, count, MPI_INT, 1, 2, MPI_COMM_WORLD);
else if (myrank == 1)
MPI_Recv (buf, count, MPI_INT, 0, 2, MPI_COMM_WORLD, &status),
MPI_Recv (buf, count, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
printf ("%d %d\n", myrank, count);
$ mpirun –np 2 ./send
1000000 36
Eager vs. Rendezvous Protocol
• Eager
• Send completes without acknowledgement from destination
• MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE (check output of mpivars)
• Small messages, typically 128 KB (at least in MPICH)
• Rendezvous
• Requires an acknowledgement from a matching receive
• Large messages
37
MPI_Status
int MPI_Recv (void *buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Status *status)
typedef struct _MPI_Status {
• Source rank int count;
• Message tag int cancelled;
• Message length int MPI_SOURCE;
• MPI_Get_count int MPI_TAG;
int MPI_ERROR;
} MPI_Status, *PMPI_Status;
38
MPI_Get_count (status.c)
status.MPI_SOURCE
status.MPI_TAG
Rank 1 of 2 received 100 elements from 0
39
Communication – Message
Passing
Process 0 Process 1
40
Timing Send/Recv (timingSend.c)
41
Timing Output
What is the
total time?
42