PP Unit 2 Tesseract
PP Unit 2 Tesseract
TOPIC-1
Parallel Programming on CPU-I
Vectorization
Vectorization is a technique used in computer science to perform operations on entire arrays or sequences
of data elements simultaneously, instead of processing each element individually. It's commonly used in
numerical and scientific computing, as well as in various data analysis and machine learning tasks. In
Parallel computing, processors have special vector units that can load and operate on more than one data
element at a time.
SIMD overview
Vectorization is an example of single instruction, multiple data (SIMD) processing because it executes a
single operation (e.g., addition, division) over a large dataset. A scalar operation, in the context of
mathematics and computer science, refers to an operation that is performed on a single scalar value, as
opposed to a vector, matrix, or any other data structure. Scalars are single numerical values and can be
integers, floating-point numbers, or other numerical types.
Vectorization Terminology:
Vector (SIMD) lane : A pathway through a vector operation on vector registers for a single data element
much like a lane on a multi-lane free way.
Vector width : The width of the vector unit, usually expressed in bits
Vector length :The number of data elements that can be processed by the vector in one operation.
Vector (SIMD) instruction sets:The set of instructions that extend the regular scalar processor
instructions to utilize the vector processor.
Vectorization is produced through both a software and a hardware component.
The requirements are
A "remainder loop" refers to a loop that iterates over the remaining elements of a collection or
array after a specific condition is met. It is a common programming construct used to process the
remaining elements of a data structure once a certain criterion is satisfied within the loop.
For example, if the vectorized loop trip count is 20 and the vector length is 16, it means every time
the kernel loop gets executed once, the remainder 4 iterations have to be executed in the remainder-
loop.
The peel loop is added to deal with the unaligned data at the start of the loop, and the remainder
loop takes care of any extra data at the end of the loop
Vector intrinsic:
Vector intrinsics are low-level programming constructs used to write explicit vectorized code by
directly utilizing the capabilities of SIMD (Single Instruction, Multiple Data) instructions
available on modern processors. Vector intrinsics are typically written in assembly or as inline
assembly within a higher-level programming language and are often specific to a particular CPU
architecture.
Intrisics provide data types for vectors (i.e. __m128 a; would declare the variable a to be a
vector of 4 floats). They also provide functions that operate directly on vectors
(i.e. _mm_add_ps(a,b) would add together the two vectors a and b).
Assembler instructions:
Using assembly language for vectorization involves writing low-level code that directly employs
SIMD (Single Instruction, Multiple Data) instructions to perform operations on multiple data
elements in parallel.
The example below demonstrates vectorization using x86 assembly with SSE (Streaming
SIMD Extensions) instructions:
array1 dd 1, 2, 3, 4 ; First array of integers
array2 dd 5, 6, 7, 8 ; Second array of integers
result dd 0, 0, 0, 0 ; Array to store the result
_start:
movaps xmm0, [array1] ; Load 128-bit (4x32-bit) values from array1 to xmm0
movaps xmm1, [array2] ; Load 128-bit (4x32-bit) values from array2 to xmm1
addps xmm0, xmm1 ; Perform vectorized addition
movaps [result], xmm0 ; Store the result back to the result array
Exit
Programming style for better Vectorization:
Adopting the following programming styles leads to better performance out of the box and less
work needed for optimization efforts.
General suggestions:
Use the restrict attribute on pointers in function arguments and declarations (C and C++).
Use pragmas or directives where needed to inform the compiler.
Be careful with optimizing for the compiler with #pragma unroll and other techniques; you might
limit the possible options for the compiler transformations.
Put exceptions and error checks with print statements in a separate loop.
Concerning data structures:
Try to use a data structure with a long length for the innermost loop
Use the smallest data type needed (short rather than int).
Use contiguous memory accesses.
Use Structure of Arrays (SOA) rather than Array of Structures (AOS).
Array of Structures (AoS)(structure variable is an array)
struct person {
char gender;
int age;
} s[5];
MPI allows programs to be written in a distributed-memory programming model, where each process
has its own local memory space and communicates with other processes using message passing.
Processes can send and receive messages, making it possible for them to exchange data and
synchronize their execution.
In the context of MPI (Message Passing Interface) programming, the terms "communication world"
and "rank" are fundamental concepts used to manage communication and coordination among
parallel processes in a parallel computing environment.
1. Communication World:
A communication world, also known as a communicator, is a group of MPI processes that can
communicate with each other.
MPI_COMM_WORLD is the default communicator that includes all processes created when
the MPI application starts.
Communicators allow processes to be organized into groups, enabling more controlled and
specific communication patterns.
2. Rank:
Rank refers to the unique identifier assigned to each process within a communicator.
In MPI_COMM_WORLD, ranks range from 0 to (number of processes - 1).
Ranks are used to distinguish one process from another within the same communicator.
Processes can communicate with each other using their ranks as identifiers.
The diagram shows a program which runs with five processes. In this example, the size of
MPI_COMM_WORLD is 5. The rank of each process is the number inside each circle. The rank of
a process always ranges from 0 to 4.
MPI Functions
1. MPI_Comm_rank(MPI_COMM_WORLD, &rank) : The rank of a process within a
communicator can be obtained using this
Parameters
MPI_COMM_WORLD: This is a predefined communicator in MPI that includes all processes
spawned by the MPI program. It is a communicator for the world of all processes.
&rank: This is the address of the variable where the rank of the calling process will be stored.
The function retrieves the rank and stores it in the memory location pointed to by the &rank
variable.
2. MPI_Comm_size(MPI_COMM_WORLD, &size) is an MPI function call which retrieves
the total number of processes in the communicator MPI_COMM_WORLD.
Parameters
&size: This is the address of the variable where the total number of processes in the
communicator will be stored. The function retrieves the size and stores it in the memory location
pointed to by the &size variable.
3. MPI_Init(&argc, &argv) is an MPI (Message Passing Interface) function call used to
initialize the MPI environment. It is typically the first MPI function called in an MPI program.
Parameters
&argc: This passes a pointer to the argc variable to the MPI library. The argc variable holds
array of strings containing the command-line arguments. When MPI initializes, it sets up
communication channels between the processes, prepares the MPI environment for parallel
computation
4. MPI_Finalize() is an MPI (Message Passing Interface) function used to finalize the MPI
environment. It is typically the last MPI function called in an MPI program, and it performs several
important tasks to ensure the proper termination of the MPI application. MPI_Finalize() ensures
that all communication operations initiated by the program are completed before the program
terminates.
MPI Program Structure
MPI Program
#include <stdio.h>
#include <mpi.h>
int main(int argc, char** argv)
{
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get the rank of the current process
MPI_Comm_size(MPI_COMM_WORLD, &size); // Get the total number of processes in
the communicator
printf("Hello from process %d of %d in MPI_COMM_WORLD\n", rank, size);
MPI_Finalize();
return 0;
}
MPI Datatypes: MPI provides its own reference data types corresponding to the various
elementarydata types in C.
MPI
Communication
Broadcast
Barrier for Gather,
Blocking Non-Blocking Synchroniza All gather
Communication Communication tion Reduce,
a All Reduce
Scatter
The core of the message-passing approach is to send a message from point-to-point or, perhaps
more precisely, process-to-process. The whole point of parallel processing is to coordinate work.
The Figure shows the three components for process to process communication
Mail Box: There must be a mailbox at either end of the system. The size of the mailbox is
important. The sending side knows the size of the message, but the receiving side does not.
To make sure there is a place for the message to be stored, it is usually better to post the
receive first. This avoids delaying the message by the receiving process having to allocate a
temporary space to store the message until a receive is posted and it can copy it to the right
location. For an analogy, if the receive (mailbox) is not posted (not there), the postman has to
hangout until someone puts one up. Posting the receive first avoids the possibility of
insufficient memory space on the receiving end to allocate a temporary buffer to store the
message.
Message :The message itself is always composed of a triplet at both ends: a pointer to a
memory buffer, a count, and a type. The type sent and type received can be different types
and counts. The rationale for using types and counts is that it allows the conversion of types
between the processes at the source and at the destination. This permits a message to be
converted to a different form at the receiving end. In a heterogeneous environment, this might
mean converting lower-endian to big-endian, a lowlevel difference in the byte order of data
stored on different hardware vendors. Also, the receive size can be greater than the amount
sent. This permits the receiver to query how much data is sent so it can properly handle the
message. But the receiving size cannot be smaller than the sending size because it would cause
a write past the end of the buffer.
Envelope: The envelope also is composed of a triplet. It defines who the message is from, who it
is sent to, and a message identifier to keep from getting multiple messages confused. The triplet
consists of the rank, tag, and communication group. The rank is for the specified communication
group. The tag helps the programmer and MPI distinguish which message goes to which receive. In
MPI, the tag is a convenience. It can be set to MPI_ANY_TAG if an explicit tag number is not desired.
We have two types of process to process communication: Blocking and non blocking
Blocking communication in MPI refers to the type of communication where a process halts its
execution until a specific communication operation is completed. This means that the sending
and receiving processes are synchronized also refered as Synchronuous Communication. The
sender blocks until the receiver is ready to receive the message, and vice versa. The two most
common blocking communication operations in MPI are MPI_Send and MPI_Recv.
MPI_Send is a blocking communication function in MPI (Message Passing Interface) used for
sending messages from one process to another. It sends a message from the sender process to the
specified destination process. Here is the syntax for MPI_Send
MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm);
buf: A pointer to the send buffer (the data you want to send).
count: The number of elements in the send buffer.
datatype: The data type of the elements in the send buffer.
dest: The rank of the destination process.
tag: A message tag, which can be used by the receiver to distinguish different kinds of
messages.
comm: The communicator (usually MPI_COMM_WORLD for communication among all
processes).
MPI_Recv is a blocking communication function in MPI (Message Passing Interface) used for
receiving messages from other processes. It receives a message from a specified source process.
Here is the syntax for MPI_Recv
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm
comm, MPI_Status *status);
buf: A pointer to the receive buffer (where the received data will be stored).
count: The number of elements in the receive buffer.
datatype: The data type of the elements in the receive buffer.
source: The rank of the source process from which you want to receive the message. Use
MPI_ANY_SOURCE if you want to receive a message from any source.
tag: A message tag. If you used tags in MPI_Send, you can use the same tag here to filter
messages. MPI_ANY_TAG matches any tag comm
comm: The communicator (usually MPI_COMM_WORLD for communication among all
processes).
status: A pointer to an MPI_Status structure that will hold information about the received
message, such as the source, tag, and error codes.
Program
#include <mpi.h>
#include <stdio.h>
int main(int argc, char **argv) {
int rank, size;
int data_send = 42;
int data_recv;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (size < 2) {
printf("This program requires at least 2 processes.\n");
MPI_Finalize();
return 1;
}
// Blocking Send from process 0 to process 1
if (rank == 0) {
MPI_Send(&data_send, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
printf("Process %d sent data: %d\n", rank, data_send);
}
// Blocking Receive at process 1
else if (rank == 1) {
MPI_Recv(&data_recv, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Process %d received data: %d\n", rank, data_recv);
}
MPI_Finalize();
return 0;
}
Problems with blocking communication:
A deadlock occurs when a set of processes are blocked because each is waiting for the other
to release a resource. For example, if two processes are waiting for each other to send a
message before they can receive, they will be deadlocked.
Processes that are blocked waiting for communication can waste computational resources,
such as CPU time and memory, as they are not performing useful work during that time.so
we go for non blocking communication
Non blocking communication
Non-blocking communication in MPI allows processes to initiate communication operations and
continue their execution without waiting for the communication to complete. This is often referred
to as asynchronous or non-blocking calls. Asynchronous means that the call initiates the operation
but does not wait for the completion of the work.
MPI provides non-blocking communication functions like MPI_Isend( I means Immediate),
MPI_Irecv, MPI_Test, MPI_Wait, and others to facilitate non-blocking communication.
Completion of a non-blocking send operation means that the sender is now free to update the
send buffer “message”.
Completion of a non-blocking receive operation means that the receive buffer “message”
contains the received data.
Syntax:
int MPI_Barrier(MPI_Comm communicator);
communicator: The communicator that defines the group of processes that synchronize at the barrier.
The MPI_Barrier function is often used to coordinate the execution of processes in a parallel
program. For example, if different processes are performing different parts of a computation and need
to ensure that they all reach a certain point before proceeding.
Program
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Some computation before the barrier
printf("Process %d reached the barrier.\n", rank);
MPI_Barrier(MPI_COMM_WORLD); // All processes wait here until everyone reaches this point
// Code after the barrier
MPI_Finalize();
return 0;
}
2. Data Movement (or Global Communication):
Broadcast: In a broadcast operation, one process sends the same data to all other
processes in a group. It is often used to distribute information from one process to all
others.
Syntax
MPI_Bcast( void* data, int count, MPI_Datatype datatype, int root, MPI_Comm communicator)
data: A pointer to the data that the root process wants to broadcast. This data is sent by the
root process and received by all other processes.
count: The number of data elements in the buffer.
root: The rank of the process within the communicator that is broadcasting the data.
communicator: The communicator that defines the group of processes over which the
Gather: The gather operation collects data from all processes in a group and sends it to a
designated process. This is useful when you want to aggregate data from multiple sources.
Syntax
MPI_Gather( void* sendbuf, int send_count, MPI_Datatype sendtype, void * recvbuf, int
recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator)
recvbuf: A pointer to the receive buffer on the root process. This is where the gathered data will
be stored.
recvcount: The number of elements to receive from each process.
root: The rank of the root process, which will receive the gathered data.
Program
#include<stdio.h>
#include<mpi.h>
int main(int argc, char* argv[])
{
int d = 0, r, s, a[5];
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD,&s);MPI_Comm_rank(MPI_COMM_WORLD, &r);
d = r * 2;
MPI_Gather(&d, 1, MPI_INT, &a, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (r == 0)
{
printf("data received by process o=");
for (int i = 0;i < s;i++)
printf("%d\t", a[i]);
}
MPI_Finalize();return 0;
}
Allgather : MPI_Allgather distributes the gathered data to all processes in the
communicator, not just to the root process. Each process receives the entire gathered dataset.
Syntax
MPI_Allgather( void* sendbuf, int sendcount, MPI_Datatype senddatatype, void* recvbuf, int
recvcount, MPI_Datatype recvtype, MPI_Comm communicator)
sendbuf: A pointer to the send buffer (data to be sent) on each process.
sendcount: The number of elements to send from the send buffer on each process.
recvbuf: A pointer to the receive buffer on each process. This is where the gathered data will
be stored.
recvcount: The number of elements to receive from each process.
Reduce: In a reduce operation, data from all processes is combined using an associative and
commutative operation (e.g., addition, multiplication) to produce a single result. This is often used
for aggregating data or finding global statistics. There are many operations that can be done during
the reduction. The most common are
MPI_MAX (maximum value in an array)
MPI_MIN (minimum value in an array)
MPI_SUM (sum of an array)
MPI_MINLOC (index of minimum value)
MPI_MAXLOC (index of maximum value
Example:
Syntax
• MPI_Reduce( void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op,
int root, MPI_Comm communicator)
• sendbuf: A pointer to the send buffer (data to be reduced) on each process.
• recvbuf: A pointer to the receive buffer on the root process. This is where the reduced result
will be stored.
• count: The number of elements in the send buffer.
MPI_PROD,etc.).
• root: The rank of the root process, where the reduced result will be stored.
Program
int main(int argc, char* argv[])
{
int size, rank;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD,
&rank);
int localsum ;
int globalsum;
localsum = 10 + rank;
printf("process %d value=%d", rank, localsum);
MPI_Reduce(&localsum, &globalsum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0)
{
printf("\n globalsum = %d", globalsum);
}
MPI_Finalize();
return (0);
}
Syntax
• MPI_Allreduce( void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype,
MPI_Op op, MPI_Comm communicator);
recvbuf: A pointer to the receive buffer on each process. This is where the reduced result will
be stored.
count: The number of elements in the send buffer.
sendcount: The number of elements to send from the send buffer on the root process.
recvbuf: A pointer to the receive buffer on each process. This is where the scattered data will be
stored.
recvcount: The number of elements to receive on each process.
root: The rank of the root process, which is the source of the scattered data.
Program
#include<stdio.h>
#include<mpi.h>
int main(int argc, char* argv[])
{
int d = 0, r, s, * buf=NULL;MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &s);MPI_Comm_rank(MPI_COMM_WORLD, &r);
if(r == 0)
{
int a[5] = { 1,2,3,4,5 };
buf = a;
}
printf("\n data in process %d before scatter=%d", r, d);
MPI_Scatter(buf, 1, MPI_INT, &d, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("\n data in process %d after scatter=%d", r, d);MPI_Finalize();
return 0;
}
Output
UNIT-2
TOPIC 6
Data Parallel Examples
The data parallel strategy is the most common approach in parallel applications.
First, a simple case of the stream triad where no communication is necessary.
The Stream Triad benchmark measures the memory bandwidth of a computing system. It is a simple
yet effective benchmark to assess the memory performance of a node. The Stream Triad benchmark
calculates the memory bandwidth by performing a simple operation on arrays in memory.
Example
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define ARRAY_SIZE (1 << 20) // 1 million elements (adjust based on your system)
#define SCALAR 2.0
int main() {
double *a, *b, *c;
int i;
double start_time, end_time;
double bandwidth;
We can see how the cells are exchanged from one mesh to another with the help of halo
cells.(Colors are swapped).
Advanced MPI functionality to simplify code and enable optimizations
The advanced functions that are useful in common data parallel applications.
MPI custom data types
Topology support
A type must be committed before use and it must be freed to avoid a memory leak. The routines
include
MPI_Type_Commit—Initializes the new custom type with needed memory allocation or
other setup
MPI_Type_Free—Frees any memory or data structure entries from the creation of the data
type
Topology support
Cartesian topology support in MPI (Message Passing Interface) allows you to define a logical, multi-
dimensional grid or mesh of processes to facilitate communication and coordination in parallel
applications. This is particularly useful for simulations, numerical computations, and other scientific
computing tasks where data is organized in multi-dimensional arrays or grids.
To create a Cartesian topology in MPI, you typically follow these steps:
1. Initialize MPI and determine the size and rank of your MPI communicator.
#include <mpi.h>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Finalize();
return 0;
}
2. Define the dimensions and periodicity of the grid using an integer array, and create the
Cartesian communicator using MPI_Cart_create
int dims[ndims]; // Array specifying the number of processes in each dimension
int periods[ndims]; // Array indicating whether the grid is periodic in each dimension
MPI_Comm cart_comm;
MPI_Cart_create(MPI_COMM_WORLD, ndims, dims, periods, 0, &cart_comm);
Parameters:
MPI_COMM_WORLD:This is the communicator representing all the processes that are
involved in the original communication world.
ndims:An integer representing the number of dimensions of the Cartesian grid. For example,
ndims = 2 for a 2D grid (like a matrix) or ndims = 3 for a 3D grid.
dims[]:An array of integers specifying the number of processes in each dimension. For example,
for a 2D grid with 4 processes in each dimension, dims = {4, 4} would mean a grid with 16
processes in total.
periods[]:An array of integers of size ndims that specifies whether the grid should be periodic in
each dimension.
1 indicates periodic (like wrapping around, torus-like),
0 indicates non-periodic (no wrapping).
0(reorderflag):This flag indicates whether process ranks should be reordered to optimize the
topology.
If set to 1, processes may be reordered for performance reasons.
If set to 0, the rank assignment remains unchanged.
&cart_comm:A pointer to the new communicator that will be created. This new communicator
will include the processes arranged in the specified Cartesian topology. cart_comm will be used
in future communication within this topology.
3. Retrieve the coordinates of each process in the Cartesian grid using MPI_Cart_coords.
int coords[ndims];
MPI_Cart_coords(cart_comm, rank, ndims, coords);
Parameters:
comm:The communicator with Cartesian structure (the communicator created by
MPI_Cart_create).
rank:The rank of the process whose coordinates you want to determine (within the
Cartesian communicator).
maxdims:The number of dimensions in the Cartesian grid (same as ndims passed to
MPI_Cart_create).
coords[]:An integer array that will hold the coordinates of the process. The array should
be of size maxdims to hold the Cartesian coordinates in each dimension.
Return Value:
The function returns an array of integers (coords) containing the coordinates of the process
in the Cartesian topology. The return value is MPI_SUCCESS if the function completes
successfully.