Parallel Programming and MPI
Parallel Programming and MPI
Programming and
MPI
A course for IIT-M. September 2008
R Badrinath, STSD Bangalore
([email protected])
3
Contents
1. MPI_Init Instead we
2. •Understand Issues
MPI_Comm_rank
3. MPI_Comm_size
• Understand Concepts
4. MPI_Send
5. MPI_Recv •Learn enough to pickup from the manual
6. MPI_Bcast
7.
Go
MPI_Create_comm
• by motivating examples
8. MPI_Sendrecv
9. MPI_Scatter •Try out some of the examples
10. MPI_Gather
………………
8
Simple Parallel Program – sorting numbers
in a large array A
• Notionally divide A into 5 pieces
[0..99;100..199;200..299;300..399;400..499].
• Each part is sorted by an independent sequential
algorithm and left within its region.
int main()
{
int total_size, my_rank;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &total_size);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
7 0 6 4
Observation:
For a fixed k,
Computing i-th row needs i-th row and k-th row
i-th row
q x [ n/p ]
Tq
19 September 2008
} IIT-Madras
The MPI model
• Recall MPI tasks are typically created when the jobs
are launched – not inside the MPI program (no
forking).
− mpirun usually creates the task set
− mpirun –np 2 a.out <args to a.out>
− a.out is run on all nodes and a communication channel is
setup between them
• Functions allow for tasks to find out
− Size of the task group
− Ones own position within the group
MPI_Init(argc,argv);
MPI_Comm_rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&p);
.
./* This is where all the real work happens */
.
MPI_Finalize(); /* Epilogue */
23
Visualizing the execution Multiple Tasks/CPUs
maybe on the same node
•Task 0 receives all blocks of the final array and prints them out
•MPI_Finalize
24 September 2008 IIT-Madras
Communication vs Computation
• Often communication is needed between iterations to complete
the work.
• Often the more the tasks the more the communication can
become.
− In Floyd, bigger “p” indicates that “rowk” will be sent to a larger
number of tasks.
− If each iteration depends on more data, it can get very busy.
• This may mean network contention; i.e., delays.
• Try to count the numbr of “a”s in a string. Time vs p
• This is why for a fixed problem size increasing number of
CPUs does not continually increase performance
• This needs experimentation – problem specific
28
A bit more on Broadcast
Ranks: 0 1 2
x : 0 1 2
MPI_Bcast(&x,1,..,0,..); MPI_Bcast(&x,1,..,0,..); MPI_Bcast(&x,1,..,0,..);
x : 0 0 0
0 0 0
Rank0
2 Slaves
reverse(work);
MPI_Send(&work,n+1,MPI_CHAR,0,0,MPI_COMM_WORLD);
} while (1);
MPI_Finalize();
}
38
Block Distribution of Matrices
• Matrix Mutliply: •Each task owns a block – its own
− Cij = Σ (Aik * Bkj) part of A,B and C
•The old formula holds for blocks!
• BMR Algorithm:
•Example:
C21=A20 * B01
A21 * B11
A22 * B21
A23 * B31
41
Communicators and Topologies
• BMR example shows limitations of broadcast..
Although there is pattern
• Communicators can be created on subgroups of
processes.
• Communicators can be created that have a topology
− Will make programming natural
− Might improve performance by matching to hardware
45
A brief on other MPI Topics – The last leg
• MPI+Multi-threaded / OpenMP
• One sided Communication
• MPI and IO