1 MPI Communications: CS424. Parallel Computing Lab#4
1 MPI Communications: CS424. Parallel Computing Lab#4
1 MPI Communications
In MPI (Message Passing Interface), communication plays a crucial role in coordinating work among independent
processes. MPI processes are independent, and explicit communication is necessary for coordination. There are two
types of communication within MPI:
1. Point-to-Point Communication: involves sending and receiving messages between two specific processes
within the same communicator. It allows processes to exchange data directly with each other.
Types of Point-to-Point Operations:
- Blocking Send/Receive: Synchronous communication where the sender blocks until the
receiver acknowledges receipt.
- Non-blocking Send/Receive: Asynchronous communication where the sender initiates the
transfer and continues execution without waiting for the receiver.
2. Collective Communication: involves multiple processes working together as a group. It enables coordinated
operations across all processes in a communicator. It uses collective MPI functions, such as:
- MPI_Scatter: Distributes data from one process to many processes.
- MPI_Reduce: Performs data reduction (e.g., sum, max) across all processes.
- MPI_Bcast: Broadcasts data from one process to all others.
2 Examples
1. The following program implements a simple send/receive communication. Compile and run the program and
study the output of several runs.
Code 1
2. The following program demonstrates sending and receiving messages between two processes in a ping-pong
fashion. The communication does not work if the size of the communicator is not exactly 2.
Note that:
Process 0: Sends a message to process 1 and then receives a message back.
Process 1: Receives a message from process 0, sends a response, and then receives another
message.
Compile and run the program and study the output of several runs.
1
Code 2
3 Practice
1. Revisit Code 1 and change the behavior of the program so process 0 can receive the results in the order in
which the processes finish.
3. Compile and run the code in mpiSumArray.c( Code folder), study the behavior of the program across several
runs, and write the proper code lines to address the comments marked as “// ** display…”. Your output
should be similar to the example below.
4. Write an MPI program in C to calculate the sum of two predefined arrays of at least length 12. The program
should ensure that it only runs when the number of processes is four or more. Use parallel execution to
compute partial sums and then combine them to get the final result.
2
CS424. Parallel Computing Lab#5
1. MPI_Bcast: One process (usually the root) has a piece of data.This data is sent to all other processes in
the communicator. All processes will have the same data after the broadcast.
2. MPI_Reduce: Each process has a local value (e.g., individual score). A mathematical operation (like sum,
max, min) is applied to all the values sent by each process. The result of the operation is stored on the
designated process (usually the root).
3. MPI_All_reduce: Similar to Reduce but it distributes the reduced result to all processes.
4. MPI_Scatter: One process (usually the root) has a large dataset. The data is divided (scattered) into
smaller chunks and sent to all processes. Each process receives a portion of the original data.
5. MPI_Gather: Each process has a piece of data. All processes send their data to a designated process
(usually the root). The root process gathers all the data into a single collection.
6. MPI_All_gather: Gathers data from all processes and distributes it to all processes.
7. MPI_Barrier: All processes in the communicator must reach this point before any can proceed. Useful
for synchronization, ensuring all processes have finished a specific task before moving on.
• Simplifies code compared to sending and receiving messages individually between processes.
• Ensures efficient data exchange and synchronization within the communicator group.
Additional Notes:
To optimize code and potentially enhance performance, it is beneficial to explore which MPI functions can be
effectively substituted with other functions while maintaining the core functionality of the program.
2 Examples
Code 1 distributes array values among processes using MPI_Scatter. Compile and run the program and study the
output of several runs.
Explanation:
1. Include headers: Necessary header files for MPI communication (mpi.h), standard input/output (stdio.h),
and memory allocation (stdlib.h).
2. MPI Initialization: MPI_Init initializes the MPI environment and prepares it for communication.
3. Get Process Information:
o MPI_Comm_rank: Retrieves the rank (unique identifier) of the current process within the
communicator group.
o MPI_Comm_size: Retrieves the total number of processes participating in the communicator group.
4. Global Array (Root Process Only):
o Allocates memory for the entire array data on the root process (rank == 0) using malloc.
o You can modify the loop to initialize the data array with your desired values.
1
5. Scattering Data:
o send_count: Calculates the number of elements to send to each process, considering potential
uneven division.
o remainder: Tracks any remaining elements after dividing the total by the number of processes.
o MPI_Scatter: Distributes elements of the data array on the root process to all processes, including
the root itself. Each process receives send_count elements (except the root process might receive
extras if there's a remainder). local_data is allocated to store the received portion.
6. Printing Received Data: Each process prints the elements it received using a loop.
7. Memory Deallocation: Frees the allocated memory for local_data on all processes and data on the root
process.
8. Finalize MPI: MPI_Finalize cleans up the MPI environment and releases resources.
Running MPI programs on a single core system
• The --oversubscribe parameter in mpiexec (or mpirun) allows you to launch more MPI processes on a
single node than the number of available cores or hardware threads.
• The behavior of --oversubscribe might vary depending on the specific MPI implementation and system
configuration.
• Some MPI libraries might offer alternative options for process placement that provide more control over
oversubscription.
Code 1
2
3 Practice
1. In MPI, achieving a desired outcome often involves multiple communication pathways. List the potential
substitutions between MPI functions that can achieve similar communication patterns.
2. Recall the version you wrote of the code in mpiSumArray.c (Code folder, sample run is shown below).
Answer the following questions.
3. Do the necessary changes to the MPI program in Code 1 to perform the following.
a) Instead of printing the values of local_data, multiply the values of local_data by rank+1.
b) Gather the values of arrays after multiplication and send it to process 1. Show the result.
c) Use MPI_Reduce to calculate the minimum value of the array- after gathering- and let process 2 display the
result.
3
MPI identifiers
MPI_Bcast procs توزيع البيانات على جميع الMPI_Bcast(&data ,count , datatype ,source, MPI_COMM_WORLD); 5 1
MPI_Scatter توزيع جزء من من مصفوفه على جميعMPI_Scatter(&sendbuf, sendcount, sendtype, &recvbuf, recvcount, 8 2
procs الrecvtype, src, MPI_COMM_WORLD);
MPI_Gather لتصبحprocs صفوفه من كلD تجميع اجزاء اMPI_Gather(&sendbuf, sendcount, sendtype, &recvbuf, recvcount, 8 2
مصفوفه واحده متكاملهrecvtype, dest, MPI_COMM_WORLD);
MPI_Allgather لتصبحprocs صفوفه من كلD تجميع اجزاء اMPI_Allgather(&sendbuf, sendcount, sendtype, &recvbuf, recvcount, 7 2
مصفوفه واحده متكامله ثم توزعها على كلrecvtype, MPI_COMM_WORLD);
procs
MPI_Type_create_stru ( انشاء نوع بيانات مخصص مكون من عدةCount ,array_of_length ,array_of_dis , array_of_type, &newtype ); 5 1
ct متغيرات
MPI_Get_count مهاgعرفة عدد العناصر التي تم استD MPI_Get_count( MPI Status&status-p, MPI-Datatype type, count-p); 3 1
MPI Constant
اسمها الوظيفه
MPI_Double اعداد عشريه تحديد نوع البيانات
MPI_INT اعداد صحيحه
MPI_CHAR حروف
Objective
To learn the basics of MPI program.
To learn the MPI point-to-point communication.
Lab Activities
1. Using the “hello, world” program in the previous lab, please do the following:
a. Include the MPI header file mpi.h at the top of your program.
b. Include the MPI function, MPI_Init(), which initializes the program with the
MPI environment and MPI_Finalise(), which does the cleaning up of the
environment before the program ends.
c. Edit the “hello, world” to be “Hello, world from process with
rank %d out of %d processes.”, my_rank, comm_sz);
d. Declare my_rank and comm_sz as type integer.
e. Include MPI function MPI_Comm_size() which returns the size of the
communicator (i.e. total number of processes).
f. Include MPI function MPI_Comm_rank() that tells the rank or id of the
process.
2. Compile and run the “MPI-hello, world” program with 1 process, 2, 4 and 8
processes. Write down the output, note the differences and explain.
Here a sample with 4 processes:
4. Modify the above program by having process 0 sends the value of year iteratively,
starting from 2018 to 2021, to 4 different processes (i.e. 2018 to process 1, 2019 to
process 2 etc).
Here a sample:
Exercises
1. Please answer the following questions:
(i) Name the header file that you need to run MPI programs. Briefly explain the
content of the header file.
(ii) What are the two MPI functions that must be included in every MPI program.
What are they for?
2. Compile and run Program 3.1 on page 85 with 1, 2 and 4 processes. Explain what
the program does.
3. Modify the program so that it does the reverse, that is process 0 sends the greeting
to the rest of the processes and each prints the greeting.
Note:
All provided samples were written using VS2017 which can't run the MPI programs directly
by IDE, so "cmd.exe" was used to execute and run the programs, with VS2010 there is no
need to use the command, the programs can be run directly by IDE.
int main(void) {
char greeting[MAX_STRING];
int comm_sz;
int my_rank;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
if (my_rank != 0) {
sprintf(greeting, "Hello, world from process with rank %d out of %d
processes.\n", my_rank, comm_sz);
MPI_Send(greeting, strlen(greeting) + 1, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
}
else {
printf("Hello, world from process with rank %d out of %d processes.\n",
my_rank, comm_sz);
for (int q = 1; q < comm_sz; q++) {
MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}
}
MPI_Finalize();
//system("pause");
return 0;
}
3.
#include "stdafx.h"
#include <stdio.h>
#include <string.h>
#include <mpi.h>
#include <iostream>
int main(void) {
int comm_sz;
int my_rank;
int year = 0;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
if (my_rank == 0) {
year = 2017;
for (int p = 1; p < comm_sz; p++) {
MPI_Send(&year, 1, MPI_INT, p, 0, MPI_COMM_WORLD);
}
}
else {
MPI_Recv(&year, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received %d from process 0\n", my_rank, year);
}
MPI_Finalize();
//system("pause");
return 0;
}
4.
#include "stdafx.h"
#include <stdio.h>
#include <string.h>
#include <mpi.h>
#include <iostream>
int main(void) {
int comm_sz;
int my_rank;
int year = 2018;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
if (my_rank == 0) {
//0 sends that year to each
for (int p = 1; p < comm_sz; p++) {
MPI_Send(&year, 1, MPI_INT, p, 0, MPI_COMM_WORLD);
year++;
}
}
else {
MPI_Recv(&year, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received %d from process 0\n", my_rank, year);
}
MPI_Finalize();
//system("pause");
return 0;
}
KINGDOM OF SAUDI ARABIA اﻟﻤﻤﻠﻜﺔ اﻟﻌﺮﺑﻴﺔ اﻟﺴﻌﻮدﻳﺔ
Ministry of Higher Education
Taibah University وزارة اﻟﺘﻌﻠﻴﻢ اﻟﻌﺎﻟﻲ
College of Computer Science & Engineering ﻛﻠﻴﺔ ﻋﻠﻮم وﻫﻨﺪﺳﺔ اﻟﺤﺎﺳﺐ-ﺟﺎﻣﻌﺔ ﻃﻴﺒﺔ
اﻵﻟﻲ
Objective
To learn the MPI collective communication.
Lab Activities
1. Use MPI_Reduce() to sum up the process rank of all the processes. The final
result is in process 0.
Here a sample with 4 processes:
2. In the same program, use MPI_Bcast() to send the result of the summation from
process 0 to all other processes.
Here a sample with 4 processes:
Exercises
1. Modify the program in (5) so that process 0 reads the values into the vector, does
the summation and print the final results.
2. Examine the effect of multiple calls to MPI_Reduce () as presented in Table 3.3
section 3.4.3 of Pacheco.
Note:
All provided samples were written using VS2017 which can't run the MPI programs directly
by IDE, so "cmd.exe" was used to execute and run the programs, with VS2010 there is no
need to use the command, the programs can be run directly by IDE.
1:
#include "stdafx.h"
#include <stdio.h>
#include <string.h>
#include <mpi.h>
#include <iostream>
int main(void) {
int comm_sz;
int my_rank;
int sum;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
// we add reduce instead off send/receive
MPI_Reduce(&my_rank, &sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (my_rank == 0) {
printf("Sum rank of %d processes is %d!\n", comm_sz, sum);
}
MPI_Finalize();
//system("pause");
return 0;
}
2.
#include "stdafx.h"
#include <stdio.h>
#include <string.h>
#include <mpi.h>
#include <iostream>
int main(void) {
int comm_sz;
int my_rank;
int sum;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Finalize();
//system("pause");
return 0;
}
3.
#include "stdafx.h"
#include <stdio.h>
#include <string.h>
#include <mpi.h>
#include <iostream>
int main(void) {
int comm_sz;
int my_rank;
int sum;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Finalize();
//system("pause");
return 0;
}
4.
#include "stdafx.h"
#include <stdio.h>
#include <string.h>
#include <mpi.h>
#include <iostream>
/*
MPI_Scatter, MPI_Gather (vector addition)
*/
int main(void) {
int my_rank, comm_sz, n = 8;
int local_n;
int x[8] = { 0, 0, 1, 1, 2, 2, 3, 3 };
int local_x[2];
int y[8] = { 0, 0, 1, 1, 2, 2, 3, 3 };
int local_y[2];
int local_z[2];
int z[8];
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
local_n = n / comm_sz;
if (my_rank == 0) {
for (int i = 0; i<n; i++)
printf("%d: %d\n", i, z[i]);
}
MPI_Finalize();
//system("pause");
return 0;
}
//-----------------------in java
#include <stdio.h>
#include <string.h>
#include <iostream>
/*
matrix-vector multiplication
*/
void Mat_vect_mult(int local_A[] /*in*/, int local_x[] /*in*/, int local_y[]
/*out*/, int
local_m /*in*/,
int n /*in*/);
int main(void) {
int my_rank, comm_sz, m = 16, n = 4;
int A[16] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 5 };
int x[4] = { 0, 1, 1, 2 };
int y[4];
printf("A:\n");
for (int i = 0; i < m; i++) {
printf("%d:%d\n", i, A[i]);
}
printf("X:\n");
for (int i = 0; i < n; i++) {
printf("%d:%d\n", i, x[i]);
}
Mat_vect_mult(A, x, y, m, n);
printf("A*X:\n");
for (int i = 0; i < m / n; i++)
printf("%d: %d\n", i, y[i]);
//system("pause");
return 0;
}
// Serial matrix-vector multiplication
void Mat_vect_mult(int A[] /*in*/, int x[] /*in*/, int y[] /*out*/, int m
/*in*/, int n
/*in*/) {
int i, j;
for (i = 0; i < m / n; i++) {
y[i] = 0;
for (j = 0; j < n; j++) {
y[i] += A[i * n + j] * x[j];
}
}
}
KINGDOM OF SAUDI ARABIA اﻟﻤﻤﻠﻜﺔ اﻟﻌﺮﺑﻴﺔ اﻟﺴﻌﻮدﻳﺔ
Ministry of Higher Education
Taibah University وزارة اﻟﺘﻌﻠﻴﻢ اﻟﻌﺎﻟﻲ
College of Computer Science & Engineering ﻛﻠﻴﺔ ﻋﻠﻮم وﻫﻨﺪﺳﺔ اﻟﺤﺎﺳﺐ-ﺟﺎﻣﻌﺔ ﻃﻴﺒﺔ
اﻵﻟﻲ
Lab Activities
1. Modify your sequential matrix multiplication in Lab 1, into parallel matrix
multiplication.
2. Insert the necessary code to time the parallel matrix multiplication (please refer to
section 3.6.1).
3. Get the sequential run-time and parallel run-time for 2, 4, 8 and 16 processes for
varying matrix sizes as in Table 3.5 (Pacheco).
4. Calculate the speed up and the efficiency of the parallel program in (3) and plot a
graph for both.
Exercises
1. Obtain a speed up and efficiency for the following application using varying processes
and data size:
a. parallel trapezoid program (section 3.2.2).
b. parallel pi computation (section 4.4).
All provided samples were written using VS2017 which can't run the MPI programs directly
by IDE, so "cmd.exe" was used to execute and run the programs, with VS2010 there is no
need to use the command, the programs can be run directly by IDE.
1:
#include "stdafx.h"
#include <stdio.h>
#include <string.h>
#include <mpi.h>
#include <iostream>
/*
matrix-vector multiplication
*/
void Mat_vect_mult(int local_A[] /*in*/, int x[] /*in*/, int local_y[] /*out*/, int
local_m /*in*/,
int n /*in*/, MPI_Comm comm /*in*/);
int main(void) {
int my_rank, comm_sz, m = 8/*number of rows*/, n = 4 /*number of columns*/,
local_m;
int A[32] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }; // m*n
int x[4] = { 0, 1, 3, 4 }; // n
int y[8]; // m
int local_A[8]; // local_m * n
int local_y[2]; // local_m
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
local_m = m / comm_sz;
//printf("local_y:\n");
//for (int i = 0; i<local_m; i++)
// printf("%d: %d\n", i, local_y[i]);
if (my_rank == 0) {
printf("A:\n");
for (int i = 0; i<m*n; i++) {
printf("%d:%d\n", i, A[i]);
}
printf("X:\n");
for (int i = 0; i<n; i++) {
printf("%d:%d\n", i, x[i]);
}
printf("A*X:\n");
for (int i = 0; i<m; i++)
printf("%d: %d\n", i, y[i]);
}
MPI_Finalize();
//system("pause");
return 0;
}
/*
matrix-vector multiplication
*/
void Mat_vect_mult(int local_A[] /*in*/, int x[] /*in*/, int local_y[] /*out*/, int
local_m /*in*/,
int n /*in*/, MPI_Comm comm /*in*/);
int main(void) {
int my_rank, comm_sz, m = 8/*number of rows*/, n = 4 /*number of columns*/,
local_m;
int A[32] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }; // m*n
int x[4] = { 0, 1, 3, 4 }; // n
int y[8]; // m
int local_A[8]; // local_m * n
int local_y[2]; // local_m
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
local_m = m / comm_sz;
local_start = MPI_Wtime();
local_finish = MPI_Wtime();
local_elapsed = local_finish - local_start;
//printf("local_y:\n");
//for (int i = 0; i<local_m; i++)
// printf("%d: %d\n", i, local_y[i]);
if (my_rank == 0) {
printf("A:\n");
for (int i = 0; i<m*n; i++) {
printf("%d:%d\n", i, A[i]);
}
printf("X:\n");
for (int i = 0; i<n; i++) {
printf("%d:%d\n", i, x[i]);
}
printf("A*X:\n");
for (int i = 0; i<m; i++)
printf("%d: %d\n", i, y[i]);
MPI_Finalize();
//system("pause");
return 0;
}