0% found this document useful (0 votes)
60 views89 pages

Module 08

1) The document describes the architecture of distributed memory multiprocessor (MPP) systems. MPPs connect multiple CPUs with separate memories using a network. 2) The primary programming model for MPPs uses message passing between parallel processes running on separate processors. Processes must communicate to share data or synchronize work. 3) MPPs provide more parallelism than SMPs and can scale to thousands of processors. Modern high-speed networks allow near-linear speedup as processors are added.

Uploaded by

Rajesh c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views89 pages

Module 08

1) The document describes the architecture of distributed memory multiprocessor (MPP) systems. MPPs connect multiple CPUs with separate memories using a network. 2) The primary programming model for MPPs uses message passing between parallel processes running on separate processors. Processes must communicate to share data or synchronize work. 3) MPPs provide more parallelism than SMPs and can scale to thousands of processors. Modern high-speed networks allow near-linear speedup as processors are added.

Uploaded by

Rajesh c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 89

Parallel Computing 1

Distributed Memory Multiprocessor


CPU CPU CPU

Memory Mem ory Mem ory

Network

The MPP architecture


Parallel Computing 2
Programming Model
 Primary programming model
– Parallel processes
» each running on a separate processor
» using message passing to communicate with the others
 Why send messages?
– Some processes may use data computed by other processes
» the data must be delivered from processes producing the data to
processes using the data
– The processes may need to synchronize their work
» sending messages is used to inform the processes that some event
has happened or some condition has been satisfied
Parallel Computing 3
MPP Architecture
 MPP architecture
– Provides more parallelism than SMPs
» SMPs rarely have more than 32 processors
» MPPs may have hundreds and even thousands processors
 How significant is the performance potential of the
MPP architecture?
– Due to modern communication technologies, MPPs are a
scalable parallel architecture
» (p+1)-processor configuration executes “normal” message-passing
programs faster than p-processor one for practically arbitrary p

Parallel Computing 4
MPP Architecture (ctd)
 MPP communication network
– Must provide a communication layer that would be
» fast
» well balanced with the number and performance of
processors
 should be no degradation in communication speed

even when all processors of the MPP simultaneously


perform intensive data transfer operations
» homogeneous
 ensure the same speed of data transfer between any

two processors of the MPP

Parallel Computing 5
MPP Architecture (ctd)
 Implementation of the MPP architecture
– Parallel computer
– Dedicated cluster of workstations
– A real MPP implementation the ideal MPP
architecture
» a compromise between the cost and quality of its
communication network

Parallel Computing 6
Optimizing Compilers
 MPPs are much far away from the serial scalar architecture
than VPs, SPs, or SMPs
– Optimizing C or Fortran 77 compilers for MPPs would have to have
intellect
» To automatically generate an efficient message-passing code using the
serial source code as specification of its functional semantics
– No industrial optimising C or Fortran 77 compilers for MPPs
– A small number of experimental research compilers
» PARADIGM
» Far away from practical use
 Basic programming tools for MPPs
– Message-passing libraries
– High-level parallel languages

Parallel Computing 7
Parallel Computing 8
Message-Passing Libraries
 Message-passing libraries directly implement the
message-passing parallel programming model
– The basic paradigm of message passing is the same in
different libraries (PARMACS, Chameleon, CHIMP,
PICL, Zipcode, p4, PVM, MPI, etc)
– The libraries just differ in details of implementation
 The most popular libraries are
– MPI (Message-Passing Interface)
– PVM (Parallel Virtual Machine)
– Absolute majority of existing message-passing code is
written using one of the libraries
Parallel Computing 9
Message-Passing Libraries
 We outline MPI
– Standardised in 1995 as MPI 1.1
– Widely implemented in compliance with the
standard
» all hardware vendors offer MPI
» Free high-quality MPI implementations (LAM MPI,
MPICH)
– Supports parallel programming in C and Fortran
on all MPP architectures
» including Unix and Windows NT platforms

Parallel Computing 10
Message-Passing Libraries
 MPI 2.0 is a set of extension to MPI 1.1
released in 1997
– Fortran 90 and C++ bindings, parallel I/O, one-
sided communications, etc
– A typical MPI library fully implements MPI 1.1
and optionally supports some features of MPI 2.0
 We use the C interface to MPI 1.1 to present
MPI

Parallel Computing 11
MPI
 An MPI program
– A fixed number of processes
» Executing their own code in their own address space
 The codes need not be identical

» Communicating via calls to MPI communication primitives


– Does not specify
» The number of processes
» The allocation of the processes to physical processors
» Such mechanisms are external to the MPI program
 must be provided by particular MPI implementations

– Uses calls to MPI inquiring operations to determine their


total number and identify themselves in the program
Parallel Computing 12
MPI (ctd)
 Two types of communication operation
– Point-to-point
» Involves two processes, one of which sends a message
and other receives the message
– Collective
» Involves a group of processes
 barrier synchronisation, broadcast, etc

Parallel Computing 13
MPI (ctd)
 Process group
– An ordered collection of processes, each with a rank
– Defines the scope of collective communication operations
» No unnecessarily synchronizing uninvolved processes
– Defines a scope for process names in point-to-point
communication operations
» Participating processes are specified by their rank in the same
process group
– Cannot be build from scratch
» Only from other, previously defined groups
 By subsetting and supersetting existing groups

» The base group of all processes available after MPI is initialised

Parallel Computing 14
MPI (ctd)
 Communicators
– Mechanism to safely separate messages
» Which don’t have to be logically mixed, even when the messages
are transferred between processes of the same group
– A separate communication layer associated with a group of
processes
» There may be several communicators associated with the same
group, providing non-intersecting communication layers
– Communication operations explicitly specify the
communicator
» Messages transmitted over different communicators cannot be
mixed (a message sent through a communicator is never received by
a process not communicating over the communicator)
Parallel Computing 15
MPI (ctd)
 Communicators (ctd)
– Technically, a communicator is implemented as
follows
» A unique tag is generated at runtime
 shared by all processes of the group, with which the

communicator is associated
 attached to all messages sent through the

communicator
 used by the processes to filter incoming messages

– Communicators make MPI suitable for writing


parallel libraries
Parallel Computing 16
MPI (ctd)
 MPI vs PVM
– PVM can’t be used for implementation of parallel
libraries
» No means to have separate safe communication layers
» All communication attributes, which could be used to
separate messages, such as groups and tags, are user-
defined
» The attributes do not have to be unique at runtime
 Especially if different modules of the program are

written by different programmers

Parallel Computing 17
MPI (ctd)
 Example 1. The pseudo-code of a PVM application
extern Proc();
if(my process ID is A)
Send message M with tag T to process B
Proc();
if(my process ID is B)
Receive a message with tag T from process A
– Does not guarantee that message M will not be intercepted
inside the library procedure Proc
» In this procedure, process A may send a message to process B
» The programmer, who coded this procedure, could attach tag T to
the message

Parallel Computing 18
MPI (ctd)
 Example 2. MPI solves the problem as follows
extern Proc();
Create communicator C for a group including
processes A and B
if(my process ID is A)
Send message M to process B through
communicator C
Proc();
if(my process ID is B)
Receive a message from process A through
communicator C
– A unique tag attached to any message sent over the
communicator C prevents the interception of the message
inside the procedure Proc

Parallel Computing 19
Groups and Communicators
 A group is an ordered set of processes
– Each process in a group is associated with an integer rank
– Ranks are contiguous and start from zero
– Groups are represented by opaque group objects of the type
MPI_Group
» hence cannot be directly transferred from one process to another
 A context is a unique, system-generated tag
– That differentiates messages
– The MPI system manages this differentiation process

Parallel Computing 20
Groups and Communicators (ctd)
 Communicator
– Brings together the concepts of group and context
– Used in communication operations to determine the scope
and the “communication universe” in which an operation is
to operate
– Contains
» an instance of a group
» a context for point-to-point communication
» a context for collective communication
– Represented by opaque communicator objects of the type
MPI_Comm

Parallel Computing 21
Groups and Communicators (ctd)
 We described intra-communicators
– This type of communicators is used
» for point-to-point communication between processes of
the same group
» for collective communication
 MPI also introduces inter-communicators
– Used specifically for point-to-point
communication between processes of different
groups
– We do not consider inter-communicators
Parallel Computing 22
Groups and Communicators (ctd)
 An initial pre-defined communicator
MPI_COMM_WORLD
– A communicator of all processes making up the MPI
program
– Has the same value in all processes
 The group associated with MPI_COMM_WORLD
– The base group, upon which all other groups are defined
– Does not appear as a pre-defined constant
» Can be accessed using the function
int MPI_Comm_group(MPI_Comm comm, MPI_Group *group)

Parallel Computing 23
Groups and Communicators (ctd)
 Other group constructors
– Explicitly list the processes of an existing group,
which make up a new group, or
– Do set-like binary operations on existing groups to
construct a new group
» Union
» Intersection
» Difference

Parallel Computing 24
Groups and Communicators (ctd)
 Function
int MPI_Group_incl(MPI_Group group, int n,
int *ranks, MPI_Group *newgroup)
creates newgroup that consists of the n processes in group
with ranks rank[0],…, rank[n-1]. The process k in
newgroup is the process rank[k] in group.
 Function
int MPI_Group_excl(MPI_Group group, int n,
int *ranks, MPI_Group *newgroup)
creates newgroup by deleting from group those processes
with ranks rank[0],…, rank[n-1]. The ordering of
processes in newgroup is identical to the ordering in group.
Parallel Computing 25
Groups and Communicators (ctd)
 Function
int MPI_Group_range_incl(MPI_Group group, int n,
int ranges[][3],
MPI_Group *newgroup)
assumes that ranges consist of triplets
(f0, l0, s0),…, (fn-1, ln-1, sn-1)
and constructs a group newgroup consisting of processes in group
with ranks

 l0  f0   ln 1  f n 1 
f 0 , f 0  s0 , , f 0     s0 ,  , f n 1, f n 1  sn 1, , f n 1     sn 1
 s0   sn 1 
Parallel Computing 26
Groups and Communicators (ctd)
 Function
int MPI_Group_range_excl(MPI_Group group, int n,
int ranges[][3], MPI_Group *newgroup)

constructs newgroup by deleting from group those


processes with ranks

 l0  f0   ln 1  f n 1 
f0 , f0  s0 , , f0     s0 ,  , f n 1, f n 1  sn 1, , f n 1     sn 1
 s0   sn 1 
The ordering of processes in newgroup is identical to the
ordering in group
Parallel Computing 27
Groups and Communicators (ctd)
 Function
int MPI_Group_union(MPI_Group group1,
MPI_Group group2,
MPI_Group *newgroup)
creates newgroup that consists of all processes of group1,
followed by all processes of group2 .
 Function
int MPI_Group_intersection(MPI_Group group1,
MPI_Group group2,
MPI_Group *newgroup)
creates newgroup that consists of all processes of group1
that are also in group2, ordered as in group1 .
Parallel Computing 28
Groups and Communicators (ctd)
 Function
int MPI_Group_difference(MPI_Group group1,
MPI_Group group2,
MPI_Group *newgroup)
creates newgroup that consists of all processes of group1
that are not in group2, ordered as in group1.
 The order in the output group of a set-like operation
– Determined primarily by order in the first group
» and only than, if necessary, by order in the second group
– Therefore, the operations are not commutative
» but are associative

Parallel Computing 29
Groups and Communicators (ctd)
 Group constructors are local operations
 Communicator constructors are collective operations
– Must be performed by all processes in the group associated
with the existing communicator, which is used for creation
of a new communicator
 The function
int MPI_Comm_dup(MPI_Comm comm,
MPI_Comm *newcomm)

creates a communicator newcomm with the same group,


but a new context

Parallel Computing 30
Groups and Communicators (ctd)
 The function
int MPI_Comm_create(MPI_Comm comm,
MPI_Group group,
MPI_Comm *newcomm)

creates a communicator newcomm with associated group


defined by group and a new context
» Returns MPI_COMM_NULL to processes that are not in group
» The call is to be executed by all processes in comm
» group must be a subset of the group associated with comm

Parallel Computing 31
Groups and Communicators (ctd)
 The function
int MPI_Comm_split(MPI_Comm comm, int color,
int key, MPI_Comm *newcomm)

partitions the group of comm into disjoint subgroups, one for


each nonnegative value of color
» Each subgroup contains all processes of the same color
 MPI_UNDEFINED for non-participating processes

» Within each subgroup, the processes are ordered key


 processes with the same key are ordered according to their rank in

the parent group


» A new communicator is created for each subgroup and returned in
newcomm
 MPI_COMM_NULL for non-participating processes

Parallel Computing 32
Groups and Communicators (ctd)
 Two local operations to determine the process’s rank
and the total number of processes in the group
– The function
int MPI_Comm_size(MPI_Comm comm, int *size)
returns in size the number of processes in the group of
comm
– The function
int MPI_Comm_rank(MPI_Comm comm, int *rank)
returns in rank the rank of the calling processes in the
group of comm

Parallel Computing 33
Groups and Communicators (ctd)
 Example. See handout for an MPI program:
– Each process first determines the total number of
processes executing the program and its rank in the
global group associated with MPI_COMM_WORLD
– Then two new communicators are created
» Containing processes with even global ranks
» Containing processes with odd global ranks
– Then each process determines its local rank in the
group associated with one of the newly created
communicators
Parallel Computing 34
Groups and Communicators (ctd)
 The function
int MPI_Init(int *argc, char ***argv)
initializes the MPI environment
» Must be called by all processes of the program before any other
MPI function is called
» Must be called at most once
 The function
int MPI_Finalize(void)
cleans up all MPI state
» Once the function is called, no MPI function may be called
 even MPI_Init

Parallel Computing 35
Groups and Communicators (ctd)
 Group and communicator destructors
– Collective operations
– Mark the group or communicator for deallocation
» actually deallocated only if there are no other active references to it
– Function int MPI_Comm_free(MPI_Comm *comm)
marks the communication object for dealocation
» The handle is set to MPI_COMM_NULL
– Function
int MPI_Group_free(MPI_Group *group)
marks the group object for dealocation
» The handle is set to MPI_GROUP_NULL

Parallel Computing 36
Point-to-Point Communication
 Point-to-point communication operations
– The basic MPI communication mechanism
– A wide range of send and receive operations for different
modes of point-to-point communication
» Blocking and nonblocking
» Synchronous and asynchronous, etc
 Two basic operations
– A blocking send and a blocking receive
» Predominantly used in MPI applications
» Implement a clear and reliable model of point-to-point
communication
» Allow the programmers to write portable MPI code
Parallel Computing 37
Point-to-Point Communication (ctd)
 The function
int MPI_Send(void *buf, int n, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)

implements a standard blocking send operation


» Forms a message
» Sends it to the addressee
 The message consists of
– A data to be transferred
» The data part may be empty (n=0)
– An envelope
» A fixed part of message
» Used to distinguish messages and selectively receive them
Parallel Computing 38
Point-to-Point Communication (ctd)
 The envelope carries the following information
– The communicator
» specified by comm
– The message source
» implicitly determined by the identity of the message sender
– The message destination
» specified by dest
 a rank within the group of comm

– The message tag


» specified by tag
 Non-negative integer value

Parallel Computing 39
Point-to-Point Communication (ctd)
 The data part of the message
– A sequence of n values of the type specified by
datatype
– The values are taken from a send buffer
» The buffer consists of n entries of the type specified by
datatype, starting at address buf
– datatype can specify
» A basic datatype
 corresponds to one of the basic datatypes of the C language

» A derived datatype
 constructed from basic ones using datatype constructors

provided by MPI

Parallel Computing 40
Point-to-Point Communication (ctd)
 datatype
– An opaque object of the type MPI_Datatype
– Pre-defined constants of that type for the basic datatypes
» MPI_CHAR (corresponds to signed char)
» MPI_SHORT (signed short int)
» MPI_INT (signed int)
» MPI_FLOAT (float)
» etc
– Basic datatypes
» Contiguous buffers of elements of the same basic type
» More general buffers are specified by using derived datatypes

Parallel Computing 41
Point-to-Point Communication (ctd)
 The function
int MPI_Recv(void *buf, int n, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm,
MPI_Status *status)

implements a standard blocking receive operation


» The receive buffer
 the storage containing n consecutive elements of
type datatype, starting at address buf
 the data part of the received message  the receive
buffer

Parallel Computing 42
Point-to-Point Communication (ctd)
 The selection of a message is governed by the
message envelope
– A message can be received only if its envelope
matches the source, tag and comm specified by
the receive operation
» MPI_ANY_SOURCE for source
 any source is acceptable

» MPI_ANY_TAG for tag


 any tag is acceptable

» No wildcard for comm

Parallel Computing 43
Point-to-Point Communication (ctd)
 The asymmetry between send and receive operations
– A sender always directs a message to a unique receiver
– A receiver may accept messages from an arbitrary sender
– A push communication
» Driven by the sender
» No pull communication driven by the receiver
 The status argument
– Points to an object of the type MPI_Status
– A structure containing at least three fields
» MPI_SOURCE
» MPI_TAG
» MPI_ERROR
– Used to return the source, tag, and error code of the received message

Parallel Computing 44
Point-to-Point Communication (ctd)
 MPI_Recv and MPI_Send are blocking operations
 Return from a blocking operation means that
resources used by the operation is allowed to be re-
used
– MPI_Recv returns only after the data part of the incoming
message has been stored in the receive buffer
– MPI_Send does not return until the message data and
envelope have been safely stored away
» The sender is free to access and overwrite the send buffer
» No matching receive may have been executed by the receiver
» Message buffering decouples the send and receive operations

Parallel Computing 45
Point-to-Point Communication (ctd)
 MPI_Send uses the standard communication mode
– MPI decides whether outgoing messages will be buffered
» If buffered, the send may complete before a matching receive is
invoked
» If not, the send call will not complete until a matching receive has been
posted, and the data has been moved to the receiver
 buffer space may be unavailable

 for performance reasons

 The standard mode send is non-local


– An operation is non-local if its completion may require the
execution of some MPI procedure on another process
» Completion of MPI_Send may depend on the occurrence of a
matching receive

Parallel Computing 46
Point-to-Point Communication (ctd)
 Three other (non-standard) send modes
– Buffered mode
» MPI must buffer the outgoing message
» local
» not safe (an error occurs if there is insufficient buffer space)
– Synchronous mode
» complete only if the matching receive operation has started to
receive the message sent by the synchronous send
» non-local
– Ready mode
» started only if the matching receive is already posted

Parallel Computing 47
Point-to-Point Communication (ctd)
 Properties of MPI point-to-point communication
– A certain order in receiving messages sent from the same
source is guaranteed
» messages do not overtake each other
» LogP is more liberal
– A certain progress in the execution of point-to-point
communication is guaranteed
» If a pair of matching send and receive have been initiated on two
processes, then at least one of these two operations will complete
– No guarantee of fairness
» A message may be never received, because it is each time
overtaken by another message, sent from another source

Parallel Computing 48
Point-to-Point Communication (ctd)
 Nonblocking communication
– Allows communication to overlap computation
» MPI’s alternative to multithreading
– Split a one-piece operation into 2 sub-operations
» The first just initiates the operation but does not complete it
 Nonblocking send start and receive start

» The second sub-operation completes this communication operation


 Send complete and receive complete

– Nonblocking sends can be matched with blocking receives,


and vice-versa

Parallel Computing 49
Point-to-Point Communication (ctd)
 Nonblocking send start calls
– Can use the same communication modes as
blocking sends
» standard, buffered, synchronous, and ready
– A nonblocking ready send can be started only if a
matching receive is posted
– In all cases, the send start call is local
– The send complete call acts according to the send
communication mode set by the send start call
Parallel Computing 50
Point-to-Point Communication (ctd)
 A start call creates an opaque request object of the
type MPI_Request
– Identifies various properties of a communication operation
» (send) mode, buffer, context, tag, destination or source, status of
the pending communication operation

int MPI_Ixsend(void *buf, int n, MPI_Datatype datatype,


int dest, int tag, MPI_Comm comm,
MPI_Request *request)

int MPI_Irecv(void *buf, int n, MPI_Datatype datatype,


int source,int tag, MPI_Comm comm,
MPI_Request *request)

Parallel Computing 51
Point-to-Point Communication (ctd)
 The request is used later
– To wait for its completion
int MPI_Wait(MPI_Request *request, MPI_Status *status)

– To query the status of the communication


int MPI_Test(MPI_Request *request, int *flag,
MPI_Status *status)

– Additional complete operations


» can be used to wait for the completion of any, some, or all the
operations in a list, rather than having to wait for a specific message

Parallel Computing 52
Point-to-Point Communication (ctd)
 The remaining point-to-point communication
operations
– Aimed at optimisation of memory and processor cycles
– MPI_Probe and MPI_Iprobe allow incoming messages
to be checked for, without actually receiving them
» the user may allocate memory for the receive buffer, according to
the length of the probed message
– If a communication with the same argument list is
repeatedly executed within the loop
» the list of arguments can be bound to a persistent communication
request once, out of the loop, and, then, repeatedly used to initiate
and complete messages in the loop

Parallel Computing 53
Collective Communication
 Collective communication operation
– Involves a group of processes
» All processes in the group must call the corresponding MPI
communication function, with the matching arguments
– The basic collective communication operations are:
» Barrier synchronization across all group members
» Broadcast from one member to all member of a group
» Gather data from all group members to one member
» Scatter data from one member to all members of a group
» Global reduction operations
 such as sum, max, min, or user-defined functions

Parallel Computing 54
Collective Communication (ctd)
 Some collective operations have a single originating
or receiving process called the root
 A barrier call returns only after all group members
have entered the call
 Other collective communication calls
– Can return as soon as their participation in the collective
communication is complete
– Should not be used for synchronization of calling processes
 The same communicators can be used for collective
and point-to-point communications

Parallel Computing 55
Collective Communication (ctd)
 The function
int MPI_Barrier(MPI_Comm comm)
– Blocks the caller until all members of the group of comm have called it
– The call returns at any process only after all group members have
entered the call
 The function
int MPI_Bcast(void *buf, int count,
MPI_Datatype datatype,
int root, MPI_Comm comm)
– Broadcasts a message from root to all processes of the group
– The type signature obtained by count-fold replication of the type
signature of datatype on any process must be equal to that at root

Parallel Computing 56
Collective Communication (ctd)
 int MPI_Gather(void *sendbuf, int sendcount,
MPI_Datatype sendtype,
void *recvbuf, int recvcount,
MPI_Datatype recvtype,
int root, MPI_Comm comm)
– Each process (root included) sends the contents of its
send buffer to root
– root receives the messages and stores them in rank order
– The specification of counts and types should not cause any
location on root to be written more than once
– recvcount is the number of items root receives from
each process, not the total number of items it receives

Parallel Computing 57
Collective Communication (ctd)
 The function
int MPI_Scatter(void *sendbuf, int sendcount,
MPI_Datatype sendtype,
void *recvbuf, int recvcount,
MPI_Datatype recvtype,
int root, MPI_Comm comm)

– Is the inverse operation to MPI_Gather

Parallel Computing 58
Collective Communication (ctd)
 The function
int MPI_Reduce(void *inbuf, void *outbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root,
MPI_Comm comm)
– Combines the elements in the input buffer (inbuf,
count, datatype) using the operation op
– Returns the combined value in the output buffer (outbuf,
count, datatype) of root
– Each process can provide a sequence of elements
» the combine operation is executed element-wise on each entry of
the sequence

Parallel Computing 59
Collective Communication (ctd)
 Global reduction operations can combine
– Pre-defined operations
» Specified by pre-defined constants of type MPI_Op
– A user-defined operation
 Pre-defined constants of type MPI_Op
– MPI_MAX (maximum), MPI_MIN (minimum), MPI_SUM (sum),
MPI_PROD (product)
– MPI_LAND (logical and), MPI_LOR (logical or), MPI_LXOR
(logical exclusive or)
– MPI_BAND (bit-wise and), MPI_BOR (bit-wise or), MPI_BXOR (bit-
wise exclusive or)
– MPI_MAXLOC (maximum value and its location), MPI_MINLOC
(minimum value and its location)

Parallel Computing 60
Collective Communication (ctd)
 A user-defined operation
– Associative
– Bound to an op handle with the function
int MPI_Op_create(MPI_User_function *fun,
int commute, MPI_Op *op)
– Its type can be specified as follows:
typedef void MPI_User_function(
void *invec, void *inoutvec,
int len, MPI_Datatype datatype)
 Examples. See handouts.
Parallel Computing 61
Environment Management
 A few functions for various parameters of the MPI
implementation and the execution environment
– Function
int MPI_Get_processor_name(char *name, int *resultlen)
returns the name of the processor, on which it was called
– Function double MPI_Wtime(void)returns a floating-
point number of seconds, representing elapsed wall-clock
time since some time in the past
– Function double MPI_Wtick(void)returns the resolution
of MPI_Wtime in seconds
» the number of second between sucessive clock ticks

Parallel Computing 62
Example of MPI Application
 Example. The application implements the
simplest parallel algorithm of matrix-matrix
multiplication.
– See handouts for the source code and comments

Parallel Computing 63
Parallel Computing 64
Parallel Languages
 Many scientific programmers find the explicit
message passing tedious and error-prone
– Used to write their applications in Fortran
– Consider MPI’s parallel primitives too low level
» unnecessary detailed description of parallel algorithms
– Their algorithms are often straightforward and
based on the data parallel paradigm

Parallel Computing 65
Parallel Languages (ctd)
 Data parallel programming model
– Processors perform the same work on different parts of data
– The distribution of the data across the processors determines
distribution of work and interprocessor communication
 Main features of the data parallel programming style
– Single threaded control
– Global name space
– Loosely synchronous processes
– Parallelism implied by operations on data

Parallel Computing 66
Parallel Languages (ctd)
 Data parallel programming
– Mainly supported by high-level parallel languages
– A compiler generates the explicit message passing code,
which will be executed in parallel by all participating
processors (the SPMD model)
 Main advantage
– Easy to use
» data parallel applications are simple to write
» easy to debug (due to the single thread of control)
» easy to port legacy serial code to MPPs

Parallel Computing 67
Parallel Languages (ctd)
 HPF (High Performance Fortran)
– The most popular data parallel language
– A set of extensions to Fortran
» aimed at writing data parallel programs for MPPs
– Defined by the High Performance Fortran Forum
(HPFF)
» over 40 organizations
– Two main versions of HPF
» HPF 1.1 (November 10, 1994)
» HPF 2.0 (January 31, 1997)
Parallel Computing 68
High Performance Fortran
 HPF 1.1
– Based on Fortran 90
– Specifies language constructs already included into
Fortran 95
» the FORALL construct and statement, PURE procedures
 HPF 2.0
– An extension of Fortran 95
» hence, simply inherits all these data parallel constructs

Parallel Computing 69
High Performance Fortran (ctd)
 Data parallel features provided by HPF
– Inherited from Fortran 95
» Whole-array operations, assignments, and functions
» The FORALL construct and statement
» PURE and ELEMENTAL procedures
– HPF specific
» The INDEPENDENT directive
 can precede an indexed DO loop or FORALL statement

 asserts to the compiler that the iterations in the following

statement may be executed independently


 the compiler relies on the assertion in its translation process

Parallel Computing 70
High Performance Fortran (ctd)
 HPF introduces a number of directives
– To suggest the distribution of data among available
processors to the compiler
– The data distribution directives
» structured comments of the form
 !HPF$ directive-body
» do not change the value computed by the program
 an HPF program may be compiled by Fortran

compilers and executed serially

Parallel Computing 71
High Performance Fortran (ctd)
 Two basic data distribution directives are
– The PROCESSORS directive
– The DISTRIBUTE directive
 HPF’s view of the parallel machine
– A rectilinear arrangement of abstract processors in one or
more dimensions
– Declared with the PROCESSORS directive specifying
» its name
» its rank (number of dimensions), and
» the extent in each dimension
– Example. !HPF$ PROCESSORS p(4,8)
Parallel Computing 72
High Performance Fortran (ctd)
 Two important intrinsic functions
– NUMBER_OF_PROCESSORS, PROCESSORS_SHAPE
– Example.
» !HPF$ PROCESSORS q(4, NUMBER_OF_PROCESSORS()/4)

 Several processor arrangements may be declared in


the same program
– If they are of the same shape, then corresponding elements
of the arrangements refer the same abstract processor
– Example. If function NUMBER_OF_PROCESSORS returns
32, then p(2,3) and q(2,3) refer the same processor

Parallel Computing 73
High Performance Fortran (ctd)
 The DISTRIBUTE directive
– Specifies a mapping of data objects (mainly, arrays) to abstract
processors in a processor arrangement
 Two basic types of distribution
– Block
– Cyclic
 Example 1. Block distribution
REAL A(10000)
!HPF$ DISTRIBUTE A(BLOCK)
– Array A should be distributed across some set of abstract processors
by partitioning it uniformly into blocks of contiguous elements

Parallel Computing 74
High Performance Fortran (ctd)
 Example 2. The block size may be specified
explicitly
REAL A(10000)
!HPF$ DISTRIBUTE A(BLOCK(256))

– Groups of exactly 256 elements should be mapped to


successive abstract processors
– There must be at least 40 abstract processors if the
directive is to be satisfied
– The 40th processor will contain a partial block of only 16
elements
Parallel Computing 75
High Performance Fortran (ctd)
 Example 3. Cyclic distribution
INTEGER D(52)
!HPF$ DISTRIBUTE D(CYCLIC(2))
– Successive 2-element blocks of D are mapped to successive
abstract processors in a round-robin fashion

 Example 4. CYCLIC  CYCLIC(1)


INTEGER DECK_OF_CARD(52)
!HPF$ PROCESSORS PLAYERS(4)
!HPF$ DISTRIBUTE DECK_OF_CARDS(CYCLIC) ONTO PLAYERS

Parallel Computing 76
High Performance Fortran (ctd)
 Example 5. Distributions are specified independently for each
dimension of a multidimensional array
INTEGER CHESS_BOARD(8,8), GO_BOARD(19,19)
!HPF$ DISTRIBUTE CHESS_BOARD(BLOCK, BLOCK)
!HPF$ DISTRIBUTE GO_BOARD(CYCLIC,*)
– CHESS_BOARD
» Partitioned into contiguous rectangular patches
» The patches will be distributed onto a 2D processors arrangement
– GO_BOARD
» Rows distributed cyclically over a 1D arrangement of abstract
processors
» ‘*’ specifies that it is not to be distributed along the second axis

Parallel Computing 77
High Performance Fortran (ctd)
 Example 6. The HPF program implementing
matrix operation C=AxB on a 16-processor
MPP, where A, B are dense square 1000x1000
matrices.
– See handouts for its source code
– The PROCESSORS directive specifies a logical
4x4 grid of abstract processors, p

Parallel Computing 78
High Performance Fortran (ctd)
PROGRAM SIMPLE
REAL, DIMENSION(1000,1000):: A, B, C
!HPF$ PROCESSORS p(4,4)
!HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO p:: A, B, C
!HPF$ INDEPENDENT
DO J=1,1000
!HPF$ INDEPENDENT
DO I=1,1000
A(I,J)=1.0
B(I,J)=2.0
END DO
END DO
!HPF$ INDEPENDENT
DO J=1,1000
!HPF$ INDEPENDENT
DO I=1,1000
C(I,J)=0.0
DO K=1,1000
C(I,J)=C(I,J)+A(I,K)*B(K,J)
END DO
END DO
END DO
END

Parallel Computing 79
High Performance Fortran (ctd)
 Example 6 (ctd).
– The DISTRIBUTE directive recommends the compiler to
partition each of the arrays A, B, and C into equal-sized
blocks along each of its dimension
» A 4x4 configuration of blocks each containing 250x250 elements,
one block per processor
» The corresponding blocks of arrays A, B, and C will be mapped to
the same abstract and hence physical processor
– Each INDEPENDENT directive is applied to a DO loop
» Advises the compiler that the loop does not carry any dependences
and therefore its different iterations may be executed in parallel

Parallel Computing 80
High Performance Fortran (ctd)
 Example 6 (ctd).
– Altogether the directives give enough information
to generate a target message-passing program
– Additional information is given by the general
HPF rule
» Evaluation of an expression should be performed on
the processor, in the memory of which its result will be
stored

Parallel Computing 81
High Performance Fortran (ctd)
 A clever HPF compiler will be able to generate for
the program in Example 6 the following SPMD
message-passing code
– See handouts
 The minimization of inter-processor communication
– The main optimisation performed by an HPF compiler
– Not a trivial problem
– No HPF constructs/directives helping the compiler to solve
the problem
» Therefore, HPF is considered a difficult language to compile

Parallel Computing 82
High Performance Fortran (ctd)
 Many real HPF compilers
– Will generate a message-passing program, where
each process sends its blocks of A and B to all
other processes
» This guarantees that each process receives all the
elements of A and B, it needs to compute its elements of
C
» This universal scheme involves a good deal of
redundant communications
 sending and receiving data that are never used in

computation
Parallel Computing 83
High Performance Fortran (ctd)
 Two more HPF directives
– The TEMPLATE and ALIGN directives
» Facilitate coordinated distribution of a group of
interrelated arrays and other data objects
» Provide a 2-level mapping of data objects to abstract
processors
 Data objects are first aligned relative to some template

 The template is then distributed with DISTRIBUTE

 Template is an array of nothings

Parallel Computing 84
High Performance Fortran (ctd)
 Example.

REAL, DIMENSION(10000,10000) :: NW,NE,SW,SE


!HPF$ TEMPLATE EARTH(10001,10001)
!HPF$ ALIGN NW(I,J) WITH EARTH(I,J)
!HPF$ ALIGN NE(I,J) WITH EARTH(I,J+1)
!HPF$ ALIGN SW(I,J) WITH EARTH(I+1,J)
!HPF$ ALIGN SE(I,J) WITH EARTH(I+1,J+1)
!HPF$ DISTRIBUTE EARTH(BLOCK, BLOCK)

Parallel Computing 85
High Performance Fortran (ctd)
 HPF 2.0 extends the presented HPF 1.1 model
in 3 directions
– Greater control over the mapping of the data
» DYNAMIC, REDISTRIBUTE, and REALIGN
– More information for generating efficient code
» RANGE, SHADOW
– Basic support for task parallelism
» The ON directive, the RESIDENT directive, and the
TASK_REGION construct
Parallel Computing 86
MPP Architecture: Summary
 MPPs provide much more parallelism than SMPs
– The MPP architecture is scalable
» No bottlenecks to limit the number of efficiently interacting
processors
 Message passing is the dominant programming model
 No industrial optimising C or Fortran 77 compiler for
the MPP architecture
 Basic programming tools for MPPs
– Message-passing libraries
– High-level parallel languages

Parallel Computing 87
MPP Architecture: Summary (ctd)
 Message-passing libraries directly implement the
message passing paradigm
– Explicit message-passing programming
– MPI is a standard message-passing interface
» Supports efficiently portable parallel programming MPPs
» Unlike PVM, MPI supports modular parallel programming
 can be used for development of parallel libraries

 Scientific programmers find the explicit message


passing provided by MPI tedious and error-prone
– They use data parallel languages, mainly, HPF
Parallel Computing 88
MPP Architecture: Summary (ctd)
 When programming in HPF
– The programmer specifies the strategy for parallelization
and data partitioning at a higher level of abstraction
– The tedious low-level details are left to the compiler
 HPF programs
– Easy to write and debug
» HPF 2.0 is more complicated and not so easy
– Can express only a quite limited class of parallel
algorithms
– Difficult to compile
Parallel Computing 89

You might also like