PP CS
PP CS
CS-451
Parallel Processing
(BCIT)
Name : _____________________________
Year : _____________________________
Batch : _____________________________
Roll No : _____________________________
Department: _____________________________
Parallel processing has been in common use for decades. To deal with all types of grand
challenges we must need High Performance Computing / Parallel Processing. Vary broadly;
Parallel Processing can be achieved by two approaches, Hardware Approach and Software
Approach and this Lab manual of Parallel Processing has been designed accordingly.
In Hardware approach, all the processors or execution units are placed on the same
motherboard sharing a common memory and other resources on the board. This approach is
referred to as Shared Memory Architecture. This approach is expensive but much faster and
easy to program. SMPs and Multi Core processors are examples of this system.
Programming a parallel system is not as easy as programming single processor systems. There
are many considerations like details of the underlying parallel system, processors
interconnection, use of the correct parallel programming model and selection of parallel
language which makes the parallel programming more difficult. This lab manual is focused on
writing parallel algorithms and their programming on distributed and shared memory
environments.
Part one of this lab manual is based on Cluster Programming. A four node Linux based
cluster using MPICH is used for programming. First Lab starts with the basics of MPI and
MPICH. The next two labs proceed with the communication among the parallel MPI
processes. Third, fourth and fifth labs deal with the MPI collective operations. In the final lab
some non blocking parallel operations are explored.
Part two of this lab manual deal with SMP and Multi-Core systems programming. Intel Dual
processors systems and Intel Quad core systems are the targeted platforms. This section starts
with the introduction of Shared Memory Architectures and OpenMP API for windows. Rest
of laboratory sessions are based on the theory and implementation of OpenMP directives and
their clauses. Environment variables related to the OpenMP API are also discussed in the end.
CONTENTS
Introduction 1
1 Basics of MPI (Message Passing Interface) 5
2 To learn Communication between MPI processes 10
3 To get familiarized with advance communication between MPI processes 15
4 Study of MPI collective operations using ‘Synchronization’ 21
5 Study of MPI collective operations using ‘Data Movement’ 23
6 Study of MPI collective operations using ‘Collective Computation’ 30
7 To understand MPI Non-Blocking operation 37
Introduction 44
8 Basics of OpenMP API (Open Multi-Processor API) 49
9 To get familiarized with OpenMP Directives 55
10 Sharing of work among threads using Loop Construct in OpenMP 61
11 Clauses in Loop Construct 65
12 Sharing of work among threads in an OpenMP program using ‘Sections Construct’ 74
13 Sharing of work among threads in an OpenMP program using ‘Single Construct’ 78
14 Use of Environment Variables in OpenMP API 82
Parallel Processing Part One: Introduction
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Part One
Cluster Programming
Parallel Processing Distributed Memory Environment
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Introduction
As parallel computers started getting larger, scalability considerations resulted in a pure
distributed memory model. In this model, each CPU has local memory associated with it, and
there is no shared memory in the system. This architecture is scalable since, with every additional
CPU in the system, there is additional memory local to that CPU, which in turn does not present a
bandwidth bottleneck for communication between CPUs and memory. On such systems, the only
way for tasks running on distinct CPUs to communicate is for them to explicitly send and receive
messages to and from other tasks called Message Passing. Message passing languages grew in
popularity very quickly and a few of them have emerged as standards in the recent years. This
section discusses some of the more popular distributed memory environments.
1. Ada
Ada is a programming language originally designed to support the construction of long-lived,
highly reliable software systems. It was developed for the U.S. Department of Defense for real-
time embedded systems. Inter task communication in Ada is based on the rendezvous mechanism.
The tasks can be created explicitly or declared statically. A task must have a specification part
which declares the entries for the rendezvous mechanisms. It must also have a body part, defined
separately, which contains the accept statements for the entries, data, and the code local to the
task. Ada uses the select statement for expressing non determinism. The select statement allows
the selection of one among several alternatives, where the alternatives are prefixed by guards.
Guards are boolean expressions that establish the conditions that must be true for the
corresponding alternative to be a candidate for execution. Another distinguishing feature of Ada
is its exception handling mechanism to deal with software errors. The disadvantages of Ada are
that it does not provide a way to map tasks onto CPUs.
1
Parallel Processing Distributed Memory Environment
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
different vendor's PVM implementations can also talk to each other, because of a well-defined
inter-PVM daemon protocol. Thus, a PVM application can have tasks running on a cluster of
machines, of different types, and running different PVM implementations. Another notable point
about PVM is that it provides the programmer with great flexibility for dynamically changing the
virtual machine, spawning tasks, and forming groups. It also provides support for fault tolerance
and-Load balancing.
The main disadvantage of PVM is that its performance is not as good as other message passing
systems such as MPI. This is mainly because PVM sacrifices performance for flexibility. PVM
was quickly embraced by many programmers as their preferred parallel programming
environment when it was released for public use, particularly by those who were interested in
using a network of computers "and those who programmed on many different platforms, since
this paradigm helped them write on program that would run on almost any platform. The public
domain implementation works for almost any UNIX platform, and Windows/NT
implementations have also been added.
The popularity of the java language stems largely from its capability and suitability for writing
programs that use and interact with resources on the internet in particular, and clusters of
heterogeneous computers in general. The basic Java package, the Java Development Kit or JDK,
supports many varieties of distributed memory paradigms corresponding to various levels of
abstraction. Additionally, several accessory paradigms have been developed for different kinds
of distributed computing using Java, although these do not belong to the JDK.
2
Parallel Processing Distributed Memory Environment
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
3.2 Sockets
At the lowest level of abstraction, Java provides socket APIs through its set of socket-related
classes. The Socket and Server Socket classes provide APIs for stream or TCP sockets, and the
Datagram Socket, Datagram Packet and Multicast Socket classes provide APIs for datagram or
UDP sockets. Each of these classes has several methods that provide the corresponding APIs.
3.4 URLS
At a very high level of abstraction, the Java runtime provides classes via which a. program can
access resources on another machine in the network. Through the DRL and URL Connection
classes, a .Java program an access a resource on the network by specifying its address in a from
of a uniform resource locator. A program can also use the URL connection class to connect to a
resource on the network. Once the connection is established, actions such as reading from or
writing to the connection can be performed.
3
Parallel Processing Distributed Memory Environment
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
The two main objectives of MPI are portability and high performance. The MPI environment
consists of an MPI library that provides a rich set of functions numbering in the hundreds. MPI
defines the concept of communicators which combine message context and task group to provide
message security. Intra-communicators allow safe message passing within a group of tasks. MPI
provides many different flavors of both blocking and non-blocking point to point communication
primitives, and has support for structured buffers and derived data types. It also provides many
different types of collective communication routines for communication, between tasks belonging
to a group. Other functions include those for application-oriented task topologies, profiling, and
environmental query and control functions. MPI-2 also adds dynamic spawning of MPI tasks to
this impressive list of functions.
5. JMPI
The MPI-2 specification includes bindings for FORTRAN, C, and C++ languages. However, no
binding for Java is planned by the MPI Forum. JMPI is an effort underway at MPI Software
Technology Inc. to integrate MPI with Java. JMPI is different from other such efforts in that,
where possible; the use of native methods has been avoided for the MPI implementation. Native
methods are those that are written in a language other than Java, such as C, C++, or assembly.
The use of native methods in Java programs may be necessitated in situations where some
platform-dependent feature may be needed, or there may be a need to use existing programs
written in another language from a Java application. Minimizing the use of native methods in a
Java programs makes the program more portable. JMPI also inlc1udes an optional
communication layer that is tightly integrated with the Java Native Interface, which is the native
programming interface for Java that is part of the Java Development Kit (JDK). This layer
enables vendors to seamlessly implement their own native message passing schemes in, a way
that is compatible with the Java programming model. Another characteristic of JMPI is that it
only implements MPI functionality deemed essential for commercial customers.
6. JPVM
JPVM is an API written using the Java native methods capability so that Java applications can
use the PVM software. JPVM extends the capabilities of PVM to the Java platform, allowing
Java applications and existing C, C++, and FORTRAN applications to communicate with each
other via the PVM API.
4
Parallel Processing Lab Session 1
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 1
OBJECT
THEORY
The Message Passing Interface or MPI is a standard for message passing that has been
developed by a consortium consisting of representatives from research laboratories,
universities, and industry. The first version MPI-l was standardized in 1994, and the second
version MPI-2 was developed in 1997. MPI is an explicit message passing paradigm where
tasks communicate with each other by sending messages.
The two main objectives of MPI are portability and high performance. The MPI environment
consists of an MPI library that provides a rich set of functions numbering in the hundreds.
MPI defines the concept of communicators which combine message context and task group to
provide message security. Intra-communicators allow safe message passing within a group of
tasks, and intercommunicates allow safe message passing between two groups of tasks. MPI
provides many different flavors of both blocking and non-blocking point to point
communication primitives, and has support for structured buffers and derived data types. It
also provides many different types of collective communication routines for communication,
between tasks belonging to a group. Other functions include those for application-oriented
task topologies, profiling, and environmental query and control functions. MPI-2 also adds
dynamic spawning of MPI tasks to this impressive list of functions.
Key Points:
MPICH is one of the complete implementation of the MPI specification, designed to be both
portable and efficient. The ``CH'' in MPICH stands for ``Chameleon,'' symbol of adaptability
to one's environment and thus of portability. Chameleons are fast, and from the beginning a
secondary goal was to give up as little efficiency as possible for the portability.
MPICH is a unified source distribution, supporting most flavors of Unix and recent versions
of Windows. In additional, binary distributions are available for Windows platforms.
5
Parallel Processing Lab Session 1
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
#include <mpi.h>
int main(int argc, char ** argv)
//Serial Code
{
MPI_Init(&argc,&argv);
//Parallel Code
MPI_Finalize();
//Serial Code
}
A simple MPI program contains a main program in which parallel code of program is placed
between MPI_Init and MPI_Finalize.
MPI_Init
It is used initializes the parallel code segment. Always use to declare the start of
the parallel code segment.
MPI_Init(&argc,&argv)
MPI Finalize
It is used to declare the end of the parallel code segment. It is important to note
that it takes no arguments.
or simply
MPI_Finalize()
Key Points:
Must include mpi.h by introducing its header #include<mpi.h>. This provides us with
the function declarations for all MPI functions.
A program must have a beginning and an ending. The beginning is in the form of an
MPI_Init() call, which indicates to the operating system that this is an MPI program
and allows the OS to do any necessary initialization. The ending is in the form of an
MPI_Finalize() call, which indicates to the OS that “clean-up” with respect to MPI can
commence.
6
Parallel Processing Lab Session 1
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
If the program is embarrassingly parallel, then the operations done between the MPI
initialization and finalization involve no communication.
#include <iostream.h>
#include <mpi.h>
int main(int argc, char ** argv)
{
MPI_Init(&argc,&argv);
cout << "Hello World!" << endl;
MPI_Finalize();
}
On compile and running of the above program, a collection of “Hello World!” messages will
be printed to your screen equal to the number of processes on which you ran the program
despite there is only one print statement.
For Compilation on Linux terminal, mpicc - o {object name} {file name with c extension}
For Execution on Linux terminal, mpirun –np { number of process} program name
7
Parallel Processing Lab Session 1
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
MPI Comm rank: It provides you with your process identification or rank
(Which is an integer ranging from 0 to P − 1, where P is the number of processes on
which are running),
or simply
MPI_Comm_rank(MPI_COMM_WORLD,&myrank)
MPI Comm size: It provides you with the total number of processes that have been
allocated.
or simply
MPI_Comm_size(MPI_COMM_WORLD,&mysize)
The argument comm is called the communicator, and it essentially is a designation for a
collection of processes which can communicate with each other. MPI has functionality to
allow you to specify varies communicators (differing collections of processes); however,
generally MPI_COMM_WORLD, which is predefined within MPI and consists of all the
processes initiated when a parallel program, is used.
An Example Program:
#include <iostream.h>
#include <mpi.h>
int main(int argc, char ** argv)
{
int mynode, totalnodes;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
cout << "Hello world from process " << mynode;
cout << " of " << totalnodes << endl;
MPI_Finalize();
}
When run with four processes, the screen output may look like:
8
Parallel Processing Lab Session 1
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Key Point:
The output to the screen may not be ordered correctly since all processes are trying to write to
the screen at the same time, and the operating system has to decide on an ordering. However,
the thing to notice is that each process called out with its process identification number and
the total number of MPI processes of which it was a part.
Exercises:
Code:
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
2. Write a program that prints “I am even” for nodes whom rank is divisible by two and
print “I am odd” for the other odd ranking nodes
Program:
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
9
Parallel Processing Lab Session 2
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 2
OBJECT
THEORY
It is important to observe that when a program running with MPI, all processes use the same
compiled binary, and hence all processes are running the exact same code. What in an MPI
distinguishes a parallel program running on P processors from the serial version of the code
running on P processors? Two things distinguish the parallel program:
Each process uses its process rank to determine what part of the algorithm instructions
are meant for it.
Processes communicate with each other in order to accomplish the final task.
Even though each process receives an identical copy of the instructions to be executed, this
does not imply that all processes will execute the same instructions. Because each process is
able to obtain its process rank (using MPI_Comm_rank).It can determine which part of the
code it is supposed to run. This is accomplished through the use of IF statements. Code that is
meant to be run by one particular process should be enclosed within an IF statement, which
verifies the process identification number of the process. If the code is not placed with in IF
statements specific to a particular id, then the code will be executed by all processes.
The second point, communicating between processes; MPI communication can be summed up
in the concept of sending and receiving messages. Sending and receiving is done with the
following two functions: MPI Send and MPI Recv.
MPI_Send
MPI_Recv
int MPI_Recv( void* message /* out */, int count /* in */, MPI
Datatype datatype /* in */, int source /* in */, int tag /* in
*/, MPI Comm comm /* in */, MPI Status* status /* out */)
10
Parallel Processing Lab Session 2
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
An Example Program:
The following program demonstrate the use of send/receive function in which sender is
initialized as node two (2) where as receiver is assigned as node four (4). The following
program requires that it should be accommodated on five (5) nodes otherwise the sender and
receiver should be initialized to suitable ranks.
#include <iostream.h>
#include <mpi.h>
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
// Determine datasize
double * databuffer = new double[datasize];
// Fill in sender, receiver, tag on sender/receiver
processes,
// and fill in databuffer on the sender process.
if(mynode==sender)
MPI_Send(databuffer,datasize,MPI_DOUBLE,receiver,
tag,MPI_COMM_WORLD);
if(mynode==receiver)
MPI_Recv(databuffer,datasize,MPI_DOUBLE,sender,tag,
MPI_COMM_WORLD,&status);
// Send/Recv complete
11
Parallel Processing Lab Session 2
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
MPI_Finalize();
}
Key Points:
In general, the message array for both the sender and receiver should be of the same
type and both of same size at least datasize.
In most cases the sendtype and recvtype are identical.
The tag can be any integer between 0-32767.
MPI Recv may use for the tag the wildcard MPI ANY TAG. This allows an MPI Recv
to receive from a send using any tag.
MPI Send cannot use the wildcard MPI ANY TAG. A special tag must be specified.
MPI Recv may use for the source the wildcard MPI ANY SOURCE. This allows an
MPI Recv to receive from a send from any source.
MPI Send must specify the process rank of the destination. No wildcard exists.
The following program calculates the sum of numbers from 1 to 1000 in a parallel fashion
while executing on all the cluster nodes and providing the result at the end on only one node.
It should be noted that the print statement for the sum is only executed on the node that is
ranked zero (0) otherwise the statement would be printed as much time as the number of
nodes in the cluster.
#include<iostream.h>
#include<mpi.h>
MPI_Init(argc,argv);
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
sum = 0;
startval = 1000*mynode/totalnodes+1;
endval = 1000*(mynode+1)/totalnodes;
for(int i=startval;i<=endval;i=i+1)
sum = sum + i;
if(mynode!=0)
MPI_Send(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD);
else
for(int j=1;j<totalnodes;j=j+1)
{
12
Parallel Processing Lab Session 2
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
MPI_Recv(&accum,1,MPI_INT,j,1,MPI_COMM_WORLD,
&status);
sum = sum + accum;
}
if(mynode == 0)
cout << "The sum from 1 to 1000 is: " << sum <<
endl;
MPI_Finalize();
}
Exercise:
1. Code the above example program in C that calculates the sum of numbers in parallel
on different numbers of nodes. Also calculate the execution time.
[Note: You have to use time stamp function to also print the time at begging and end of
parallel code segment]
Execution Time:
Execution Time:
Speedup:
Execution Time:
Speedup:
Execution Time:
Speedup:
13
Parallel Processing Lab Session 2
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
2. Suppose you are in a scenario where you have to transmit an array buffer from all
other nodes to one node by using send/ receive functions that are used for intra-process
synchronous communication. The figure below demonstrates the required functionality
of the program.
Node 1
Node 0
Node 2
Node n
Figure 2.1
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
14
Parallel Processing Lab Session 3
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 3
OBJECT
THEORY
This Lab session will focus on more information about sending and receiving in MPI like
sending of arrays and simultaneous send and receive
Key Points
Whenever you send and receive data, MPI assumes that you have provided non
overlapping positions in memory. As discussed in the previous lab session,
MPI_COMM_WORLD is referred to as a communicator. In general, a communicator
is a collection of processes that can send messages to each other.
MPI_COMM_WORLD is pre-defined in all implementations of MPI, and it consists of
all MPI processes running after the initial execution of the program.
In the send/receive, we are required to use a tag. The tag variable is used to distinguish
upon receipt between two messages sent by the same process.
The order of sending does not necessarily guarantee the order of receiving. Tags are
used to distinguish between messages. MPI allows the tag MPI_ANY_TAG which can
be used by MPI_Recv to accept any valid tag from a sender but you cannot use
MPI_ANY_TAG in the MPI_Send command.
Similar to the MPI_ ANY_ TAG wildcard for tags, there is also an
MPI_ANY_SOURCE wildcard that can also be used by MPI_Recv. By using it in an
MPI_Recv, a process is ready to receive from any sending process. Again, you cannot
use MPI_ ANY_ SOURCE in the MPI_ Send command. There is no wildcard for
sender destinations.
When you pass an array to MPI_ Send/MPI_Recv, it need not have exactly the number
of items to be sent – it must have greater than or equal to the number of items to be
sent. Suppose, for example, that you had an array of 100 items, but you only wanted to
send the first ten items, you can do so by passing the array to MPI_Send and only
stating that ten items are to be sent.
An Example Program:
In the following MPI code, array on each process is created, initialize it on process 0. Once
the array has been initialized on process 0, then it is sent out to each process.
#include<iostream.h>
#include<mpi.h>
15
Parallel Processing Lab Session 3
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
array = new double[nitems];
if(mynode == 0)
{
for(i=0;i<nitems;i++)
array[i] = (double) i;
}
if(mynode==0)
for(i=1;i<totalnodes;i++)
MPI_Send(array,nitems,MPI_DOUBLE,i,1,MPI_COMM_WORLD);
else
MPI_Recv(array,nitems,MPI_DOUBLE,0,1,MPI_COMM_WORLD,
&status);
for(i=0;i<nitems;i++)
{
cout << "Processor " << mynode;
cout << ": array[" << i << "] = " << array[i] <<
endl;
}
MPI_Finalize();
}
Key Points:
16
Parallel Processing Lab Session 3
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
It should be noted in above stated arguments that they contain all the arguments that were
declared in send and receive functions separately in the previous lab session
Exercise:
1. Write a program in which every node receives from its left node and sends message
to its right node simultaneously as depicted in the following figure
Program:
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
17
Parallel Processing Lab Session 3
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
18
Parallel Processing Lab Session 3
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
3. Write a parallel program that calculates the sum of array and execute it for different
numbers of nodes in the cluster. Also calculate their respective execution time.
Program Code:
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
19
Parallel Processing Lab Session 3
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Execution Time:
___________________________________________________________________________
Execution Time:
Speedup:
___________________________________________________________________________
Execution Time:
Speedup:
___________________________________________________________________________
Execution Time:
Speedup:
___________________________________________________________________________
20
Parallel Processing Lab Session 4
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 4
OBJECT
THEORY
Collective operations
Synchronization,
Data Movement, and
Collective Computation
These classes are not mutually exclusive, of course, since blocking data movement functions
also serve to synchronize process activity, and some MPI routines perform both data
movement and computation.
Synchronization
MPI Barrier
int MPI_Barrier( MPI Comm comm /* in */ )
comm - communicator
Example of Usage
MPI_Init(&argc,&argv);
21
Parallel Processing Lab Session 4
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
MPI_Barrier(MPI_COMM_WORLD);
// At this stage, all processes are synchronized
Key Point
This command is a useful tool to help insure synchronization between processes. For
example, you may want all processes to wait until one particular process has read in
data from disk. Each process would call MPI_Barrier in the place in the program
where the synchronization is required.
Exercise:
1. Write a parallel program, after discussing your instructor which uses MPI_Barrier
Program:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
22
Parallel Processing Lab Session 5
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 5
OBJECT
THEORY
Collective operations
MPI_Send and MPI_Recv are "point-to-point" communications functions. That is,
they involve one sender and one receiver. MPI includes a large number of subroutines for
performing "collective" operations. Collective operations are performed by MPI routines that
are called by each member of a group of processes that want some operation to be performed
for them as a group. A collective function may specify one-to-many, many-to-one, or many-
to-many message transmission. MPI supports three classes of collective operations:
Synchronization,
Data Movement, and
Collective Computation
These classes are not mutually exclusive, of course, since blocking data movement functions
also serve to synchronize process activity, and some MPI routines perform both data
movement and computation.
There are several routines for performing collective data distribution tasks:
MPI_Bcast, The subroutine MPI_Bcast sends a message from one process to all
processes in a communicator.
MPI_Alltoall, MPI_Alltoallv, Gather data and then scatter it to all participants (All-
to-all scatter/gather)
23
Parallel Processing Lab Session 5
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
MPI_Bcast
The subroutine MPI_Bcast sends a message from one process to all processes in a
communicator.
Figure 5.1 MPI Bcast schematic demonstrating a broadcast of two data objects from process
zero to all other processes.
int MPI Bcast( void* buffer /* in/out */, int count /* in */,
MPI Datatype datatype /* in */,
int root /* in */, MPI Comm comm /* in */ )
MPI_Bcast broadcasts a message from the process with rank "root" to all other processes of
the group.
Example of Usage
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
24
Parallel Processing Lab Session 5
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
MPI_Bcast(databuffer,datasize,MPI_DOUBLE,root,MPI_COMM_WORLD);
Key Point
Each process will make an identical call of the MPI Bcast function. On the
broadcasting (root) process, the buffer array contains the data to be broadcast. At the
conclusion of the call, all processes have obtained a copy of the contents of the buffer
array from process root.
MPI_Scatter:
MPI_Scatter is one of the most frequently used functions of MPI Programming. Break a
structure into portions and distribute those portions to other processes. Suppose you are going
to distribute an array elements equally to all other nodes in the cluster by decomposing the
main array into its sub segments which are then distributed to the nodes for parallel
computation of array segments on different cluster nodes.
int MPI_Scatter
(
void *send_data,
int send_count,
MPI_Datatype send_type,
void *receive_data,
int receive_count,
MPI_Datatype receive_type,
int sending_process_ID,
MPI_Comm comm.
)
MPI_Gather
25
Parallel Processing Lab Session 5
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
MPI_Datatype recvtype,
int root,
MPI_Comm comm
)
Input Parameters:
Output Parameter:
EXERCISE:
1. Write a program that broadcasts a number from one process to all others by using
MPI_Bcast.
Node 1
Node 2
Node 0 Node 3
Node 4
Node n-1
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
26
Parallel Processing Lab Session 5
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
2. Break up a long vector into sub-vectors of equal length. Distribute sub-vectors to
processes. Let the processes to compute the partial sums. Collect the partial sums from
the processes and add them at root node using collective computation operations
Node 1
Node 2
Node 0
Node 0 Node 3
MPI_Gather(..)
Gather partial sum from all
MPI_Scatter(..) nodes in the group and gather
Node 4 them at root node
Distribute Array Elements from
one node to all other node in the
group
Node n-1
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
27
Parallel Processing Lab Session 5
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
3. Write a parallel program that calculates the value of PI using integral method.
Algorithm: The algorithm suggested here is chosen for its simplicity. The method evaluates
the integral of 4/(1+x*x) between -1/2 and 1/2. The method is simple: the integral is
approximated by a sum of n intervals; the approximation to the integral in each interval is
(1/n)*4/(1+x*x). The master process (rank 0) asks the user for the number of intervals; the
master should then broadcast this number to all of the other processes. Each process then adds
up every n'th interval (x = -1/2+rank/n, -1/2+rank/n+size/n,). Finally, the sums computed by
each process are added together using a reduction.
Program:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
28
Parallel Processing Lab Session 5
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
29
Parallel Processing Lab Session 6
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 6
OBJECT
THEORY
Collective operations
MPI_Send and MPI_Recv are "point-to-point" communications functions. That is, they
involve one sender and one receiver. MPI includes a large number of subroutines for
performing "collective" operations. Collective operations are performed by MPI routines that
are called by each member of a group of processes that want some operation to be performed
for them as a group. A collective function may specify one-to-many, many-to-one, or many-
to-many message transmission. MPI supports three classes of collective operations:
Synchronization,
Data Movement, and
Collective Computation
These classes are not mutually exclusive, of course, since blocking data movement
functions also serve to synchronize process activity, and some MPI routines perform both data
movement and computation.
Collective computation is similar to collective data movement with the additional feature that
data may be modified as it is moved. The following routines can be used for collective
computation.
30
Parallel Processing Lab Session 6
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
MPI_Reduce:
MPI_Reduce apply some operation to some operand in every participating process. For
example, add an integer residing in every process together and put the result in a process
specified in the MPI_Reduce argument list. The subroutine MPI_Reduce combines data from
all processes in a communicator using one of several reduction operations to produce a single
result that appears in a specified target process.
When processes are ready to share information with other processes as part of a data
reduction, all of the participating processes execute a call to MPI_Reduce, which uses local
data to calculate each process's portion of the reduction operation and communicates the local
result to other processes as necessary. Only the target_process_ID receives the final result.
int MPI_Reduce(
void* operand /* in */,
void* result /* out */,
int count /* in */,
MPI Datatype datatype /* in */,
MPI Op operator /* in */,
int root /* in */,
MPI Comm comm /* in */
)
31
Parallel Processing Lab Session 6
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Example of Usage
The given code receives data on only the root node (rank=0) and passes null in the receive
data argument of all other nodes
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
if(mynode == root)
recvdata = new double[datasize];
// Fill in senddata on all processes
MPI_Reduce(senddata,recvdata,datasize,MPI_DOUBLE,MPI_SUM,
root,MPI_COMM_WORLD);
Key Points
The recvdata array only needs to be allocated on the process of rank root (since root is
the only processor receiving data). All other processes may pass NULL in the place of
the recvdata argument.
Both the senddata array and the recvdata array must be of the same data type. Both
arrays should contain at least datasize elements.
32
Parallel Processing Lab Session 6
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Example of Usage
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
MPI_Allreduce(senddata,recvdata,datasize,MPI_DOUBLE,
MPI_SUM,MPI_COMM_WORLD);
// At this stage, all processes contains the result of
the reduction (in this case MPI_SUM) in the recvdata array
Remarks
In this case, the recvdata array needs to be allocated on all processes since all
processes will be receiving the result of the reduction.
Both the senddata array and the recvdata array must be of the same data type. Both
arrays should contain at least datasize elements.
MPI_Scan:
#include "mpi.h"
33
Parallel Processing Lab Session 6
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
int MPI_Scan (void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
MPI_Op op, MPI_Comm comm)
Input Parameters
sendbuf: starting address of send buffer
count: number of elements in input buffer
datatype: data type of elements of input buffer
op: operation
comm: communicator
Output Parameter:
EXERCISE:
Program:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
34
Parallel Processing Lab Session 6
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
The above problem can be best stated with the help of following figure.
S0 S1 S2 Sn
Program:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
35
Parallel Processing Lab Session 6
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
3. Write and explain argument lists of the following and say how they are different
from the two functions you have seen:
– MPI_Allreduce
– MPI_Reduce_scatter
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
36
Parallel Processing Lab Session 7
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 7
OBJECT
THEORY
37
Parallel Processing Lab Session 7
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Figure7.1 MPI Isend/MPI Irecv schematic demonstrating the communication between two processes.
38
Parallel Processing Lab Session 7
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Example of Usage
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
// Determine datasize
double * databuffer = new double[datasize];
// Fill in sender, receiver, tag on sender/receiver
processes,
// and fill in databuffer on the sender process.
if(mynode==sender)
MPI_Isend(databuffer,datasize,MPI_DOUBLE,receiver,tag,
MPI_COMM_WORLD,&request);
if(mynode==receiver)
MPI_Irecv(databuffer,datasize,MPI_DOUBLE,sender,tag,
MPI_COMM_WORLD,&request);
// The sender/receiver can be accomplishing various things
// which do not involve the databuffer array
39
Parallel Processing Lab Session 7
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Key Points
In general, the message array for both the sender and receiver should be of the same
type and both of size at least datasize.
In most cases the sendtype and recvtype are identical.
After the MPI_Isend call and before the MPI_Wait call, the contents of message
should not be changed.
After the MPI_Irecv call and before the MPI_Wait call, the contents of message
should not be used.
An MPI_Send can be received by an MPI_Irecv/MPI_Wait.
An MPI_Recv can obtain information from an MPI_Isend/MPI_Wait.
The tag can be any integer between 0-32767.
MPI_Irecv may use for the tag the wildcard MPI_ANY_TAG. This allows an
MPI_Irecv to receive from a send using any tag.
MPI_Isend cannot use the wildcard MPI_ANY_TAG. A specific tag must be specified.
MPI_Irecv may use for the source the wildcard MPI_ANY_SOURCE. This allows an
MPI_Irecv to receive from a send from any source.
MPI_Isend must specify the process rank of the destination. No wildcard exists.
EXERCISE:
1. Write a parallel program having the non blocking processes communications which
calculates the sum of numbers in parallel on different numbers of nodes. Also calculate
the execution time.
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
40
Parallel Processing Lab Session 7
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Execution Time:
_________________________________________________________________________
Execution Time:
Speedup:
_________________________________________________________________________
Execution Time:
Speedup:
_________________________________________________________________________
Execution Time:
Speedup:
_________________________________________________________________________
2. Write two programs that utilizes the functions MPI_Waitall and MPI_Waitany
respectively.
Program 1:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
41
Parallel Processing Lab Session 7
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
Program 2:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
42
Parallel Processing Part Two: Introduction
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Part Two
Introduction
As Parallel commuters matured, demand for higher abstraction for programming such machines
grew. Most of the early parallel computers consisted of tens of CPUs connected together and to a
globally addressable shared memory, via a bus. That is a task running on any CPU could access
any memory location in the computer with equal speed. Such systems are referred to as Uniform
Memory Access or UMA architectures. For these machines their respective vendors started
providing shares memory language. Such languages typically consist of a means for spawning
multiple tasks for a problem, Synchronization constructs for task to exchange data via shares
memory and machine to allow synchronization which each other via barriers and related
functions. Such environments are characterized as pure shared memory environments.
As the number of CPUs started increasing on parallel computers, bus based architectures
exhibited limited performance improvement, since the bus bandwidth requirement reached its
saturation point. Hence larger parallel computers adopted switch interconnects for connecting
CPUs to memory modules. For such systems the distance of a memory module from a CPU is not
constant, resulting in a non uniform memory access speed by a task. Such systems are referred to
as Non Uniform Memory Access or NUMA architectures.
Typically, on NUMA architectures each CPU has local memory that is only addressable by task
on that CPU, and the collection of all these local memory modules form the entire memory of the
system. Programming to this model involves explicitly passing information or message from a
task on one CPU to another, since no shared memory exists that can be utilized for this purpose.
This message passing programming model is viewed by many as difficult, when compared to the
shared memory model. To overcome this difficulty in using distributed memory machines
vendors of such systems started providing simulated globally addressable shared memory
environments to the user, taking care of the translation to the physically distributed local
memories in the operating systems. Such systems are referred to as Distributed Shared Memory
or DSM. Since the memory is not really shared in hardware, but is available as such as to the
programmer, such a shared memory paradigm will be characterized as a virtual shared memory
paradigm.
These languages are designed for systems where the entire memory in the system is uniformly
globally addressable and those environments do not make any provision for, nor do they have any
tuning hooks for, application to take care of non uniform memory accesses. While these
languages will continue to work on virtual shared memory systems, they have no inherent
features to support any NUMA characteristics that virtual shared memory machines exhibit. Most
of the early parallel machine had their own shared memory parallel environments, and all those
had a similar flavor. From these languages evolved the concept of threads. Threads are
lightweight entities, which are similar to processes except that they require minimal resource to
run. A process may consist of several threads, each of which represents a separate execution
44
Parallel Processing Shared Memory Environment
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
context; hence, a separate program counter, stacks, and registers. All threads of a process share
the remaining resources with the other threads in the process. Two types of thread environment
are gaining widespread popularity across various platforms: Pthreads and Java threads.
1.1 Pthreads
Pthreads is an abbreviation for POSIX threads and refers to a standard specification for threads
developed by the POSIX committee. The Pthreads environment provides two types of
synchronization primitives Mutex and Condition Variable. Mutexes are simple lock primitives
that can be used to control access to a shared resource and the operations supported on a mutex to
achieve this are lock and unlock primitives. Only one thread may own a mutex at a time and is
thus guaranteed exclusive access to the associated resource.
Synchronization using mutexes may not be sufficient for many programs since they have limited
functionality. Condition variables supplement the functionality of mutexes by allowing threads to
block and wait for an event and be woken up when the event occurs. Pthreads are limited to use
within a single address space, and can not be spread across distinct address spaces.
The Java language also provides a threads programming model via its Thread and Thread Group
classes. The Thread class implements a generic thread that, by default, does nothing. Users
specify the body of the thread by providing a run method for their Thread objects. The Thread
Group class provides a mechanism for manipulating a collection of threads at the same time, such
as starting or stopping a set of threads via a single invocation of a method. Synchronization
between various threads in a program is provided via two constructs synchronized blocks and
walt-notify constructs. In the Java language a block or a method of the program that is identified
with the synchronized keyword represents a critical section in the program. The Java platform
associates a lock with every object that has synchronized code. The wait construct allows a thread
to relinquish its lock and wait for notification of a given event, and the notify construct allows
thread to signal the occurrence of an event to another thread that is waiting for this event.
45
Parallel Processing Shared Memory Environment
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
CPU to another. These hooks can be used by a parallel program in an attempt to maximize the
performance of the system controlling the placement of data, both statically and dynamically, so
it is near the tasks that access that data. The biggest advantage of such an environment is that it
still preserves a shared memory programming interface which is considered by many to be a
simpler programming model than a message programming interface. Popular examples .of such
environments are described below.
2.1 Linda
Linda allows tasks to communicate with each other by inserting and retrieving data items called
Tuples, into a debuted shared memory called Tuple Space. A tuple consists of a string which is
the tap1e's identifier, and zero or more data items. A tuple Space is a segment of memory in one
or mare computers whose purpose is to serve as a temporary storage area for data being
transferred between tasks. It is an associative memory abstraction where tasks communicate by
inserting and removing tuples from this tuple space. When a task is ready to send information to
another task, it places the corresponding tuple in the tuple space. When the receiver task is ready
to receive this information, it retrieves this tuple from the tuple space. This decouples the send
and receives parts of the communication so that the sender task does not have to block until the
receiver is ready to receive the data being communicated.
Linda works very well smaller parallel programs with a few component tasks. However as the
number of tasks increases in a program controlling access to a single tuple space and managing
the tasks becomes difficult. This is mainly because there is no scoping in the tuple space, making
the probability of access conflicts higher, which places a responsibility on programmers to be
extra careful when the number of tasks becomes large.
2.2 SHMEM
The SHMEM environment provides the view of a logically shared, distributed memory access to
the programmer and is available on massively parallel distributed memory machines as well on
distributed shared memory machines. It enables tasks to communication among themselves via
low-latency, high-bandwidth primitives. In addition to being used on shared memory architecture,
SHMEM can be used by tasks in distinct address spaces to explicitly pass data among each other.
It provides an efficient alternative to using message passing for inter-task communication.
SHMEM supports remote data transfer throughput operations, which can transfer data to another
task, and through get operations, which can retrieve data from another task. This one-sided style
of communication offered by SHMEM makes programming simpler since a matching receive
request need not match every send request. SHMEM also supports broadcast and reduction
operations, barrier synchronization and atomic memory operations.
High Performance Fortran or HPF is a standard defined for FORTRAN parallel programs with
the goals of achieving program portability across a number of parallel machines, and achieving
high performance on parallel computers with no uniform memory access costs. HPF supports the
data parallel programming model, where data are divided across machines, and the same program
46
Parallel Processing Shared Memory Environment
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
is executed on different machines on different subsets of the overall data. The HPF environment
provides software tools such as HPF compilers that produce programs for parallel computers with
no uniform access cost. There have been two versions of this standard to date. The first version,
HPF1, defined in 1993, was an extension to FORTRAN 90. The second version, HPF-2, defined
in 1997, is an extension to the current FORTRAN standard (FORTRAN 95).
An HPF programmer expresses parallelism explicitly in his program, and the data distribution is
tuned to control the load balance and to minimize inter-task communication. On the other hand,
given a data distribution, an HPF compiler may be able to identify operations that can be
executed concurrently, and thus generate even more efficient code. HPF's constructs allow
programmers to indicate potential parallelism at a relatively high level, without entering into the
low-level details of message-passing and synchronization. When an HPF program is compiled,
the compiler assumes responsibility for scheduling the parallel operations on the physical
machines, thereby reducing the time and effort required for parallel program development.
Pthreads can not be extended across distinct address space boundaries such as a cluster of
workstations. Remote threads or Rthreads extend Pthreads-like constructs between address
spaces. They provide software distributed shared memory system that supports sharing of global
variables on clusters of computers with physically distributed memory. Rthreads use explicit
function calls to access distributed shared data. Its synchronization primitives are syntactically
and semantically similar to those of Pthreads. The Rthreads environment consists of a pre-
compiler that automatically transforms Pthreads programs into Rthreads programs. The programs
can still change the Rthreads code for further optimization in the transformed program. Also,
Pthreads and Rthreads can be mixed within a single program. Heterogeneous clusters are
supported in Rthreads by implementing it on top of portable message passing environments such
as PVM, MPI, and DCE.
3.5 OpenMP
OpenMP is a new standard that has been defined by several vendors as the standard Application
Programming Interface or API for the shared memory multiprocessing model. It attempts to
standardize existing practices from several different vendor-specific shared memory
environments. OpenMP provides a portable shared memory API across different platforms
including DEC, HP, IBM, Intel, Silicon Graphics/Cray, and Sun. The languages supported by
OpenMP are FORTRAN, C and C++. Its main emphasis is on performance and scalability.
OpenMP consists of a collection of directives, library routines, and environment variables used to
specify shared memory parallelism in a program's source code. It standardizes fine grained (loop
level) parallelism and also supports coarse-grain parallelism. Fine grain parallelism is achieved
via the fork/join model. A typical OpenMP program starts executing as a single task, and on
encountering a parallel construct, a group of tasks is spawned to execute the parallel region, each
with its own data environment. The complier is responsible for assigning the appropriate iteration
to the tasks in the group. The parallel region ends with the end do construct, which represents an
implied barrier. At this point, the results of the parallel region are used to update the data
environment of the original task, which then resumes execution. This sequence of fork/join
47
Parallel Processing Shared Memory Environment
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
actions is repeated for every parallel construct in the program, enabling loop-level parallelism in a
program. For enabling coarse-grain parallelism effectively, OpenMP introduces the concept of
orphan directives, which are directives encountered outside the lexical extent of the parallel
region. This allows a parallel program to specify control from anywhere inside the parallel
region, as opposed to only from the lexically contained portion, which is often necessary in
coarse-grained parallel programs.
48
Parallel Processing Lab Session 8
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 8
OBJECT
THEORY
OpenMP
OpenMP is a portable and standard Application Program Interface (API) that may be used to
explicitly direct multi-threaded, shared memory parallelism
Goals of OpenMP
Explicit Parallelism:
OpenMP is an explicit (not automatic) programming model, offering the programmer
full control over parallelization.
49
Parallel Processing Lab Session 8
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
All OpenMP programs begin as a single process: the master thread. The master
thread executes sequentially until the first parallel region construct is encountered.
FORK: the master thread then creates a team of parallel threads
The statements in the program that are enclosed by the parallel region construct are
then executed in parallel among the various team threads
JOIN: When the team threads complete the statements in the parallel region construct,
they synchronize and terminate, leaving only the master thread
Dynamic Threads:
The API provides for dynamically altering the number of threads which may used to
execute different parallel regions.
Implementations may or may not support this feature.
I/O:
OpenMP specifies nothing about parallel I/O. This is particularly important if multiple
threads attempt to write/read from the same file.
If every thread conducts I/O to a different file, the issues are not as significant.
It is entirely up to the programmer to insure that I/O is conducted correctly within the
context of a multi-threaded program.
50
Parallel Processing Lab Session 8
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
#include <omp.h>
main ()
{
Serial code
{
Parallel section executed by all threads
.
.
.
All threads join master thread and disband
}
directive: A C or C++ #pragma followed by the omp identifier, other text, and a new line.
The directive specifies program behavior.
51
Parallel Processing Lab Session 8
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
dynamic extent: All statements in the lexical extent, plus any statement inside a function that
is executed as a result of the execution of statements within the lexical extent. A dynamic
extent is also referred to as a region.
structured block: A structured block is a statement (single or compound) that has a single
entry and a single exit. No statement is a structured block if there is a jump into or out of that
statement. A compound statement is a structured block if its execution always begins at the
opening { and always ends at the closing }. An expression statement, selection statement,
iteration statement is a structured block if the corresponding compound statement obtained by
enclosing it in { and }would be a structured block. A jump statement, labeled statement, or
declaration statement is not a structured block.
Thread: An execution entity having a serial flow of control, a set of private variables, and
access to shared variables.
master thread: The thread that creates a team when a parallel region is entered.
serial region: Statements executed only by the master thread outside of the dynamic extent of
any parallel region.
parallel region: Statements that bind to an OpenMP parallel construct and may be executed
by multiple threads.
Private: A private variable names a block of storage that is unique to the thread making the
reference. Note that there are several ways to specify that a variable is private: a definition
within a parallel region, a threadprivate directive, a private, firstprivate, lastprivate, or
reduction clause, or use of the variable as a forloop control variable in a for loop immediately
following a for or parallel for directive.
Shared: A shared variable names a single block of storage. All threads in a team that access
this variable will access this single block of storage.
Serialize: To execute a parallel construct with a team of threads consisting of only a single
thread (which is the master thread for that parallel construct), with serial order of execution
for the statements within the structured block (the same order as if the block were not part of a
parallel construct), and with no effect on the value returned by omp_in_parallel() (apart from
the effects of any nested parallel constructs).
52
Parallel Processing Lab Session 8
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Barrier: A synchronization point that must be reached by all threads in a team. Each thread
waits until all threads in the team arrive at this point. There are explicit barriers identified by
directives and implicit barriers created by the implementation.
EXERCISE:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
53
Parallel Processing Lab Session 8
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
54
Parallel Processing Lab Session 9
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 9
OBJECT
THEORY
OpenMP directives for C/C++ are specified with the pragma preprocessing directive.
Where:
directive-name: A valid OpenMP directive. Must appear after the pragma and before
any clauses.
[clause, ...]: Optional, Clauses can be in any order, and repeated as necessary unless
otherwise restricted.
Newline: Required, Precedes the structured block which is enclosed by this directive.
General Rules:
Case sensitive
Directives follow conventions of the C/C++ standards for compiler directives
Only one directive-name may be specified per directive
Each directive applies to at most one succeeding statement, which must be a structured
block.
Long directive lines can be "continued" on succeeding lines by escaping the newline
character with a backslash ("\") at the end of a directive line.
• Parallel Construct
• Work-Sharing Constructs
Loop Construct
Sections Construct
55
Parallel Processing Lab Session 9
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Single Construct
• Barrier Construct
• Critical Construct
• Atomic Construct
• Locks
• Master Construct
Directive Scoping
The code textually enclosed between the beginning and the end of a structured block
following a directive.
The static extent of a directives does not span multiple routines or code files
Orphaned Directive:
Dynamic Extent:
The dynamic extent of a directive includes both its static (lexical) extent and the
extents of its orphaned directives.
Parallel Construct
This construct is used to specify the computations that should be executed in parallel. Parts of
the program that are not enclosed by a parallel construct will be executed serially. When a
thread encounters this construct, a team of threads is created to execute the associated parallel
region, which is the code dynamically contained within the parallel construct. But although
this construct ensures that computations are performed in parallel, it does not distribute the
work of the region among the threads in a team. In fact, if the programmer does not use the
appropriate syntax to specify this action, the work will be replicated. At the end of a parallel
region, there is an implied barrier that forces all threads to wait until the work inside the
region has been completed. Only the initial thread continues execution after the end of the
parallel region.
56
Parallel Processing Lab Session 9
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
The thread that encounters the parallel construct becomes the master of the new team. Each
thread in the team is assigned a unique thread number (also referred to as the “thread id”) to
identify it. They range from zero (for the master thread) up to one less than the number of
threads within the team, and they can be accessed by the programmer. Although the parallel
region is executed by all threads in the team, each thread is allowed to follow a different path
of execution.
Format
#include <omp.h>
main()
{
#pragma omp parallel
{
printf("The parallel region is executed by
thread %d\n",
omp_get_thread_num());
57
Parallel Processing Lab Session 9
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
if(scalar-expression)
num threads(integer-expression)
private(list)
firstprivate(list)
shared(list)
default(none|shared)
copyin(list)
reduction(operator:list)
Details and usage of clauses are disused in lab session B.4
Key Points:
When execution encounters a parallel directive, the value of the if clause or num_threads
clause (if any) on the directive, the current parallel context, and the values of the nthreads-var,
dyn-var, thread-limit-var, max-active-level-var, and nest-var ICVs are used to determine the
number of threads to use in the region.
The following example demonstrates the num_threads clause. The parallel region is executed
with a maximum of 10 threads.
#include <omp.h>
main()
{
...
58
Parallel Processing Lab Session 9
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Some programs rely on a fixed, pre-specified number of threads to execute correctly. Because
the default setting for the dynamic adjustment of the number of threads is implementation-
defined, such programs can choose to turn off the dynamic threads capability and set the
number of threads explicitly to ensure portability. The following example shows how to do
this using omp_set_dynamic and omp_set_num_threads
Example:
#include <omp.h>
main()
{
omp_set_dynamic(0);
omp_set_num_threads(16);
EXERCISE:
1. Code the above example programs and write down their outputs.
59
Parallel Processing Lab Session 9
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
60
Parallel Processing Lab Session 10
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 10
OBJECT
THEORY
Introduction
OpenMP’s work-sharing constructs are the most important feature of OpenMP. They are used
to distribute computation among the threads in a team. C/C++ has three work-sharing
constructs. A work-sharing construct, along with its terminating construct where appropriate,
specifies a region of code whose work is to be distributed among the executing threads; it also
specifies the manner in which the work in the region is to be parceled out. A work-sharing
region must bind to an active parallel region in order to have an effect. If a work-sharing
directive is encountered in an inactive parallel region or in the sequential part of the program,
it is simply ignored. Since work-sharing directives may occur in procedures that are invoked
both from within a parallel region as well as outside of any parallel regions, they may be
exploited during some calls and ignored during others.
#pragma omp single: Only one thread executes the code block
A work-sharing construct does not launch new threads and does not have a barrier on entry.
By default, threads wait at a barrier at the end of a work-sharing region until the last thread
has completed its share of the work. However, the programmer can suppress this by using the
nowait clause.
The loop construct causes the iterations of the loop immediately following it to be executed in
parallel. At run time, the loop iterations are distributed across the threads. This is probably the
most widely used of the work-sharing features.
61
Parallel Processing Lab Session 10
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Format:
for_loop
#include <omp.h>
main()
{
#pragma omp parallel shared(n) private(i)
{
#pragma omp for
for (i=0; i<n; i++)
printf("Thread %d executes loop iteration %d\n",
omp_get_thread_num(),i);
}
}
Here we use a parallel directive to define a parallel region and then share its work among
threads via the for work-sharing directive: the #pragma omp for directive states that
iterations of the loop following it will be distributed. Within the loop, we use the OpenMP
function omp_get_thread_num(), this time to obtain and print the number of the executing
thread in each iteration. Parallel construct that state which data in the region is shared and
which is private. Although not strictly needed since this is enforced by the compiler, loop
variable i is explicitly declared to be a private variable, which means that each thread will
have its own copy of i. its value is also undefined after the loop has finished. Variable n is
made shared.
Output from the example which is executed for n = 9 and uses four threads.
62
Parallel Processing Lab Session 10
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Combined parallel work-sharing constructs are shortcuts that can be used when a parallel
region comprises precisely one work-sharing construct, that is, the work-sharing region
includes all the code in the parallel region. The semantics of the shortcut directives are
identical to explicitly specifying the parallel construct immediately followed by the work-
sharing construct.
EXERCISE:
1. Code the above example programs and write down their outputs.
Output of Program:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
63
Parallel Processing Lab Session 10
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
2. write a parallel program, after discussing your instructor, which uses Loop Construct.
Program:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
64
Parallel Processing Lab Session 11
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 11
OBJECT
THEORY
Introduction
The OpenMP Data Scope Attribute Clauses are used to explicitly define how variables should
be scoped.
Data Scope Attribute Clauses are used in conjunction with several directives (PARALLEL,
DO/for, and SECTIONS) to control the scoping of enclosed variables.
These constructs provide the ability to control the data environment during execution of
parallel constructs.
They define how and which data variables in the serial section of the program are
transferred to the parallel sections of the program (and back)
They define which variables will be visible to all threads in the parallel sections and
which variables will be privately allocated to all threads.
List of Clauses
PRIVATE
FIRSTPRIVATE
LASTPRIVATE
SHARED
DEFAULT
REDUCTION
COPYIN
PRIVATE Clause
The PRIVATE clause declares variables in its list to be private to each thread.
Notes:
65
Parallel Processing Lab Session 11
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
o A new object of the same type is declared once for each thread in the team
o All references to the original object are replaced with references to the new
object
o Variables declared PRIVATE are uninitialized for each thread
Example of the private clause – Each thread has a local copy of variables i and a.
SHARED Clause
The SHARED clause declares variables in its list to be shared among all threads in the team.
Notes:
A shared variable exists in only one memory location and all threads can read or write
to that address
It is the programmer's responsibility to ensure that multiple threads properly access
SHARED variables (such as via CRITICAL sections)
Example of the shared clause – All threads can read from and write to vector a.
DEFAULT Clause
The DEFAULT clause allows the user to specify a default PRIVATE, SHARED, or NONE
scope for all variables in the lexical extent of any parallel region.
The default clause is used to give variables a default data-sharing attribute. Its usage is
straightforward. For example, default (shared) assigns the shared attribute to all variables
66
Parallel Processing Lab Session 11
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
referenced in the construct. This clause is most often used to define the data-sharing attribute
of the majority of the variables in a parallel region. Only the exceptions need to be explicitly
listed.
If default(none) is specified instead, the programmer is forced to specify a data-sharing
attribute for each variable in the construct. Although variables with a predetermined data-
sharing attribute need not be listed in one of the clauses, it is strongly recommend that the
attribute be explicitly specified for all variables in the construct.
Notes:
Specific variables can be exempted from the default using the PRIVATE, SHARED,
FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses
The C/C++ OpenMP specification does not include "private" as a possible default.
However, actual implementations may provide this option.
Only one DEFAULT clause can be specified on a PARALLEL directive
Example of the Deafulat clause: all variables to be shared, with the exception of a, b, and c.
FIRSTPRIVATE Clause
The FIRSTPRIVATE clause combines the behavior of the PRIVATE clause with automatic
initialization of the variables in its list.
Notes:
Listed variables are initialized according to the value of their original objects prior to
entry into the parallel or work-sharing construct
Example using the firstprivate clause – Each thread has a pre-initialized copy of variable
indx. This variable is still private, so threads can update it individually.
{
TID = omp_get_thread_num();
indx += n*TID;
67
Parallel Processing Lab Session 11
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
LASTPRIVATE Clause
The LASTPRIVATE clause combines the behavior of the PRIVATE clause with a copy from
the last loop iteration or section to the original variable object.
Notes:
The value copied back into the original variable object is obtained from the last
(sequentially) iteration or section of the enclosing construct.
It ensures that the last value of a data object listed is accessible after the
corresponding construct has completed execution
Example of the lastprivate clause – This clause makes the sequentially last value of variable
a accessible outside the parallel loop.
COPYIN Clause
The COPYIN clause provides a means for assigning the same value to THREADPRIVATE
variables for all threads in the team.
Notes:
List contains the names of variables to copy. The master thread variable is used as the
copy source. The team threads are initialized with its value upon entry into the parallel
construct.
68
Parallel Processing Lab Session 11
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
REDUCTION Clause
The REDUCTION clause performs a reduction on the variables that appear in its list. A
private copy for each list variable is created for each thread. At the end of the reduction, the
reduction variable is applied to all private copies of the shared variable, and the final result is
written to the global shared variable.
Notes:
Variables in the list must be named scalar variables. They can not be array or structure
type variables. They must also be declared SHARED in the enclosing context.
Reduction operations may not be associative for real numbers.
The REDUCTION clause is intended to be used on a region or work-sharing construct
in which the reduction variable is used only in statements which have one of following
forms:
C / C++
x = x op expr
x = expr op x (except subtraction)
x binop = expr
x++
++x
x--
--x
x is a scalar variable in the list
expr is a scalar expression that does not reference x
op is not overloaded, and is one of +, *, -, /, &, ^, |, &&, ||
binop is not overloaded, and is one of +, *, -, /, &, ^, |
Example of REDUCTION - Vector Dot Product. Iterations of the parallel loop will be
distributed in equal sized blocks to each thread in the team (SCHEDULE STATIC). At the
end of the parallel loop construct, all threads will add their values of "result" to update the
master thread's global copy.
#include <omp.h>
main ()
{
int i, n, chunk;
float a[100], b[100], result;
n = 100;
chunk = 10;
result = 0.0;
69
Parallel Processing Lab Session 11
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
SCHEDULE:
Describes how iterations of the loop are divided among the threads in the team. The default
schedule is implementation dependent.
STATIC
Loop iterations are divided into pieces of size chunk and then statically assigned to
threads. If chunk is not specified, the iterations are evenly (if possible) divided
contiguously among the threads.
DYNAMIC
Loop iterations are divided into pieces of size chunk, and dynamically scheduled
among the threads; when a thread finishes one chunk, it is dynamically assigned
another. The default chunk size is 1.
GUIDED
For a chunk size of 1, the size of each chunk is proportional to the number of
unassigned iterations divided by the number of threads, decreasing to 1. For a chunk
size with value k (greater than 1), the size of each chunk is determined in the same
way with the restriction that the chunks do not contain fewer than k iterations (except
for the last chunk to be assigned, which may have fewer than k iterations). The default
chunk size is 1.
Nowait Clause
The nowait clause allows the programmer to fine-tune a program’s performance. When we
introduced the work-sharing constructs, we mentioned that there is an implicit barrier at the
end of them. This clause overrides that feature of OpenMP; in other words, if it is added to a
construct, the barrier at the end of the associated construct will be suppressed. When threads
reach the end of the construct, they will immediately proceed to perform other work. Note,
however, that the barrier at the end of a parallel region cannot be suppressed.
Example of the nowait clause in C/C++ – The clause ensures that there is no barrier at the
end of the loop.
70
Parallel Processing Lab Session 11
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Directive
Clause PARALLEL DO/for SECTIONS SINGLE PARALLEL PARALLEL
DO/for SECTIONS
IF
PRIVATE
SHARED
DEFAULT
FIRSTPRIVATE
LASTPRIVATE
REDUCTION
COPYIN
SCHEDULE
ORDERED
NOWAIT
MASTER
CRITICAL
BARRIER
ATOMIC
FLUSH
ORDERED
THREADPRIVATE
71
Parallel Processing Lab Session 11
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Implementations may (and do) differ from the standard in which clauses are supported
by each directive.
EXERCISE:
1. Code the above example programs and write down their outputs.
Program:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
72
Parallel Processing Lab Session 11
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
73
Parallel Processing Lab Session 12
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 12
OBJECT
THEORY
Introduction
OpenMP’s work-sharing constructs are the most important feature of OpenMP. They are used
to distribute computation among the threads in a team. C/C++ has three work-sharing
constructs. A work-sharing construct, along with its terminating construct where appropriate,
specifies a region of code whose work is to be distributed among the executing threads; it also
specifies the manner in which the work in the region is to be parceled out. A work-sharing
region must bind to an active parallel region in order to have an effect. If a work-sharing
directive is encountered in an inactive parallel region or in the sequential part of the program,
it is simply ignored. Since work-sharing directives may occur in procedures that are invoked
both from within a parallel region as well as outside of any parallel regions, they may be
exploited during some calls and ignored during others.
A work-sharing construct does not launch new threads and does not have a barrier on entry.
By default, threads wait at a barrier at the end of a work-sharing region until the last thread
has completed its share of the work. However, the programmer can suppress this by using the
nowait clause .
74
Parallel Processing Lab Session 12
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
At run time, the specified code blocks are executed by the threads in the team. Each thread
executes one code block at a time, and each code block will be executed exactly once. If there
are fewer threads than code blocks, some or all of the threads execute multiple code blocks. If
there are fewer code blocks than threads, the remaining threads will be idle. Note that the
assignment of code blocks to threads is implementation-dependent.
Format:
Combined parallel work-sharing constructs are shortcuts that can be used when a parallel
region comprises precisely one work-sharing construct, that is, the work-sharing region
includes all the code in the parallel region. The semantics of the shortcut directives are
identical to explicitly specifying the parallel construct immediately followed by the work-
sharing construct.
75
Parallel Processing Lab Session 12
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
If two or more threads are available, one thread invokes funcA() and another thread calls
funcB(). Any other threads are idle.
#include <omp.h>
main()
{
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
(void) funcA();
void funcA()
{
printf("In funcA: this section is executed by thread
%d\n",
omp_get_thread_num());
}
void funcB()
{
printf("In funcB: this section is executed by thread
%d\n",
omp_get_thread_num());
}
Output from the example; the code is executed by using two threads.
76
Parallel Processing Lab Session 12
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
EXERCISE:
1. Code the above example programs and write down their outputs.
Output of Program:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
2. write a parallel program, after discussing your instructor, which uses Sections
Construct.
Program:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
77
Parallel Processing Lab Session 13
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 13
OBJECT
THEORY
Introduction
OpenMP’s work-sharing constructs are the most important feature of OpenMP. They are used
to distribute computation among the threads in a team. C/C++ has three work-sharing
constructs. A work-sharing construct, along with its terminating construct where appropriate,
specifies a region of code whose work is to be distributed among the executing threads; it also
specifies the manner in which the work in the region is to be parceled out. A work-sharing
region must bind to an active parallel region in order to have an effect. If a work-sharing
directive is encountered in an inactive parallel region or in the sequential part of the program,
it is simply ignored. Since work-sharing directives may occur in procedures that are invoked
both from within a parallel region as well as outside of any parallel regions, they may be
exploited during some calls and ignored during others.
A work-sharing construct does not launch new threads and does not have a barrier on entry.
By default, threads wait at a barrier at the end of a work-sharing region until the last thread
has completed its share of the work. However, the programmer can suppress this by using the
nowait clause .
78
Parallel Processing Lab Session 13
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
as long as the work gets done by exactly one thread. The other threads wait at a barrier until
the thread executing the single code block has completed.
Format:
private (list)
firstprivate (list)
nowait
structured_block
#include <omp.h>
main()
{
#pragma omp parallel shared(a,b) private(i)
{
#pragma omp single
{
a = 10;
printf("Single construct executed by thread
%d\n", omp_get_thread_num());
}
#pragma omp for
for (i=0; i<n; i++)
b[i] = a;
} /*-- End of parallel region --*/
printf("After the parallel region:\n");
for (i=0; i<n; i++)
printf("b[%d] = %d\n",i,b[i]);
}
Output from the example, the value of variable n is set to 9, and four threads are used.
b[0] = 10
b[1] = 10
b[2] = 10
b[3] = 10
b[4] = 10
b[5] = 10
b[6] = 10
b[7] = 10
b[8] = 10
79
Parallel Processing Lab Session 13
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
EXERCISE:
1. Code the above example programs and write down their outputs.
Output of Program:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
2. Write a parallel program, after discussing your instructor, which uses Single
Construct.
Program:
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
80
Parallel Processing Lab Session 13
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
81
Parallel Processing Lab Session 14
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
Lab Session 14
OBJECT
THEORY
Environment Variables
This Lab session describes the OpenMP environment variables that specify the settings of the
ICVs that affect the execution of OpenMP programs. The names of the environment variables
must be upper case. The values assigned to the environment variables are case insensitive and
may have leading and trailing white space. Modifications to the environment variables after
the program has started, even if modified by the program itself, are ignored by the OpenMP
implementation. However, the settings of some of the ICVs can be modified during the
execution of the OpenMP program by the use of the appropriate directive clauses or OpenMP
API routines.
OMP_SCHEDULE
OMP_NUM_THREADS
OMP_DYNAMIC
OMP_NESTED
OMP_STACKSIZE
OMP_WAIT_POLICY
OMP_MAX_ACTIVE_LEVELS
OMP_THREAD_LIMIT
OMP_SCHEDULE
The OMP_SCHEDULE environment variable controls the schedule type and chunk size of all
loop directives that have the schedule type runtime, by setting the value of the run-sched-var
ICV.
The value of this environment variable takes the form: type[,chunk ] where
Example
Set OMP_SCHEDULE "guided,4"
82
Parallel Processing Lab Session 14
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
OMP_NUM_THREADS
The OMP_NUM_THREADS environment variable sets the number of threads to use for
parallel regions by setting the initial value of the nthreads-var ICV. The value of this
environment variable must be a positive integer. The behavior of the program is
implementation defined if the requested value of OMP_NUM_THREADS is greater than the
number of threads an implementation can support, or if the value is not a positive integer.
Example:
Set OMP_NUM_THREADS 16
OMP_DYNAMIC
Example:
OMP_NESTED
The OMP_NESTED environment variable controls nested parallelism by setting the initial
value of the nest-var ICV. The value of this environment variable must be true or false. If the
environment variable is set to true, nested parallelism is enabled; if set to false, nested
parallelism is disabled. The behavior of the program is implementation defined if the value of
OMP_NESTED is neither true nor false.
Example:
Set OMP_DYNAMIC true
Set OMP_NESTED false
OMP_STACKSIZE
The OMP_STACKSIZE environment variable controls the size of the stack for threads
created by the OpenMP implementation, by setting the value of the stacksize-var ICV. The
environment variable does not control the size of the stack for the initial thread. The value of
this environment variable takes the form:
83
Parallel Processing Lab Session 14
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
where:
size is a positive integer that specifies the size of the stack for threads that are created
by the OpenMP implementation.
B, K, M, and G are letters that specify whether the given size is in Bytes, Kilobytes,
Megabytes, or Gigabytes, respectively. If one of these letters is present, there may be
white space between size and the letter.
If only size is specified and none of B, K, M, or G is specified, then size is assumed to
be in Kilobytes.
The behavior of the program is implementation defined if OMP_STACKSIZE does
not conform to the above format, or if the implementation cannot provide a stack with
the requested size.
Examples:
OMP_WAIT_POLICY
The ACTIVE value specifies that waiting threads should mostly be active, i.e., consume
processor cycles, while waiting. An OpenMP implementation may, for example, make waiting
threads spin. The PASSIVE value specifies that waiting threads should mostly be passive, i.e.,
not consume processor cycles, while waiting. An OpenMP implementation, may for example,
make waiting threads yield the processor to other threads or go to sleep.
Examples:
Set OMP_WAIT_POLICY ACTIVE
Set OMP_WAIT_POLICY PASSIVE
OMP_MAX_ACTIVE_LEVELS
84
Parallel Processing Lab Session 14
NED University of Engineering & Technology – Department of Computer & Information Systems Engineering
OMP_THREAD_LIMIT
The OMP_THREAD_LIMIT environment variable sets the number of OpenMP threads to use
for the whole OpenMP program by setting the thread-limit-var ICV. The value of this
environment variable must be a positive integer. The behavior of the program is
implementation defined if the requested value of OMP_THREAD_LIMIT is greater than the
number of threads an implementation can support, or if the value is not a positive integer.
85