0% found this document useful (0 votes)
35 views58 pages

Hpclab

Uploaded by

Abhishek Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views58 pages

Hpclab

Uploaded by

Abhishek Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 58

NAME:

ROLL NO:

YEAR/SEM: 1st YEAR/ 2nd SEM

BRANCH: M.TECH IN COMPUTER SCIENCE & ENGINEERING

SUBJECT: HIGH PERFORMANCE COMPUTING LAB


What is Parallel Computing?
Serial Computing
Traditionally, software has been written for serial computation:
• A problem is broken into a discrete series of instructions

• Instructions are executed sequentially one after another


• Executed on a single processor
• Only one instruction may execute at any moment in time

For example:

Parallel Computing
In the simplest sense, parallel computing is the simultaneous use of multiple compute
resources to solve a computational problem:
• A problem is broken into discrete parts that can be solved concurrently
• Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different processors
• An overall control/coordination mechanism is employed

For example:
• The computational problem should be able to: oBe broken apart into discrete pieces
of work that can be solved simultaneously;
o Execute multiple program instructions at any moment in
time; oBe solved in less time with multiple compute
resources than with a single compute resource.
• The compute resources are typically: oA single computer with multiple
processors/cores oAn arbitrary number of such computers connected by a network
Parallel Computers
• Virtually all stand-alone computers today are parallel from a hardware perspective:
oMultiple functional units (L1 cache, L2 cache, branch, prefetch, decode,
floating-point, graphics processing (GPU), integer, etc.)
o Multiple execution units/cores oMultiple hardware
threads
IBM BG/Q Compute Chip with 18 cores (PU) and 16 L2 Cache units (L2)
Networks connect multiple stand-alone computers (nodes) to make larger parallel
computer clusters.

For example, the schematic below shows a typical LLNL parallel computer
cluster:Each compute node is a multi-processor parallel computer in itself
• Multiple compute nodes are networked together with an Infiniband network Special
purpose nodes, also multi-processor, are used for other purposes

The majority of the world's large parallel computers (supercomputers) are clusters of
hardware produced by a handful of (mostly) well known vendors.
Why Use Parallel Computing? The Real
World is Massively Complex
• In the natural world, many complex, interrelated events are happening at the same
time, yet within a temporal sequence.
• Compared to serial computing, parallel computing is much better suited for
modeling, simulating and understanding complex, real world phenomena.
• For example, imagine modeling these serially:
Main Reasons for Using Parallel Programming
SAVE TIME AND/OR MONEY
• In theory, throwing more resources at a task will shorten its time to completion, with
potential cost savings.
• Parallel computers can be built from cheap, commodity components.
SOLVE LARGER / MORE COMPLEX PROBLEMS
• Many problems are so large and/or complex that it is impractical or impossible to
solve them using a serial program, especially given limited computer memory.
• Example: "Grand Challenge Problems"
(en.wikipedia.org/wiki/Grand_Challenge) requiring petaflops and petabytes of
computing resources.
• Example: Web search engines/databases processing millions of transactions every
second
PROVIDE CONCURRENCY
• A single compute resource can only do one thing at a time. Multiple compute
resources can do many things simultaneously.
• Example: Collaborative Networks provide a global venue where people from
around the world can meet and conduct work "virtually".
Who is Using Parallel Computing?
Science and Engineering
• Historically, parallel computing has been considered to be "the high end of
computing", and has been used to model difficult problems in many areas of
science and engineering:
• Atmosphere, Earth, Environment
• Physics - applied, nuclear, particle, condensed matter, high pressure, fusion,
photonics
• Bioscience, Biotechnology, Genetics
• Chemistry, Molecular Sciences
• Geology, Seismology
• Mechanical Engineering - from prosthetics to spacecraft
• Electrical Engineering, Circuit Design, Microelectronics
• Computer Science, Mathematics
• Defense, Weapons
Industrial and Commercial
• Today, commercial applications provide an equal or greater driving force in the
development of faster computers. These applications require the processing of large
amounts of data in sophisticated ways. For example:
• "Big Data", databases, data mining
• Artificial Intelligence (AI)
• Oil exploration
• Web search engines, web based business services
• Medical imaging and diagnosis
• Pharmaceutical design
• Financial and economic modeling
• Management of national and multi-national corporations
• Advanced graphics and virtual reality, particularly in the entertainment industry
• Networked video and multi-media technologies
• Collaborative work environments

The model of a parallel algorithm is developed by considering a strategy for dividing


the data and processing method and applying a suitable strategy to reduce interactions.
In this chapter, we will discuss the following Parallel Algorithm
Models −

• Data parallel model


• Task graph model
• Work pool model
• Master slave model
• Producer consumer or pipeline model
• Hybrid model
Data Parallel
In data parallel model, tasks are assigned to processes and each task performs similar
types of operations on different data. Data parallelism is a consequence of single
operations that is being applied on multiple data items.
Data-parallel model can be applied on shared-address spaces and message-passing
paradigms. In data-parallel model, interaction overheads can be reduced by selecting a
locality preserving decomposition, by using optimized collective interaction routines,
or by overlapping computation and interaction.
The primary characteristic of data-parallel model problems is that the intensity of data
parallelism increases with the size of the problem, which in turn makes it possible to
use more processes to solve larger problems.
Example − Dense matrix multiplication.

Task Graph Model


In the task graph model, parallelism is expressed by a task graph. A task graph can be
either trivial or nontrivial. In this model, the correlation among the tasks are utilized to
promote locality or to minimize interaction costs. This model is enforced to solve
problems in which the quantity of data associated with the tasks is huge compared to
the number of computation associated with them. The tasks are assigned to help
improve the cost of data movement among the tasks.
Examples − Parallel quick sort, sparse matrix factorization, and parallel algorithms
derived via divide-and-conquer approach.
Here, problems are divided into atomic tasks and implemented as a graph. Each task is
an independent unit of job that has dependencies on one or more antecedent task.
After the completion of a task, the output of an antecedent task is passed to the
dependent task. A task with antecedent task starts execution only when its entire
antecedent task is completed. The final output of the graph is received when the last
dependent task is completed (Task 6 in the above figure).
Work Pool Model
In work pool model, tasks are dynamically assigned to the processes for balancing the
load. Therefore, any process may potentially execute any task. This model is used
when the quantity of data associated with tasks is comparatively smaller than the
computation associated with the tasks.
There is no desired pre-assigning of tasks onto the processes. Assigning of tasks is
centralized or decentralized. Pointers to the tasks are saved in a physically shared list,
in a priority queue, or in a hash table or tree, or they could be saved in a physically
distributed data structure.
The task may be available in the beginning, or may be generated dynamically. If the
task is generated dynamically and a decentralized assigning of task is done, then a
termination detection algorithm is required so that all the processes can actually detect
the completion of the entire program and stop looking for more tasks.
Example − Parallel tree search

Master-Slave Model
In the master-slave model, one or more master processes generate task and allocate it
to slave processes. The tasks may be allocated beforehand if −

• the master can estimate the volume of the tasks, or


• a random assigning can do a satisfactory job of balancing load, or 
slaves are assigned smaller pieces of task at different times.
This model is generally equally suitable to shared-address-space or messagepassing
paradigms, since the interaction is naturally two ways.
In some cases, a task may need to be completed in phases, and the task in each phase
must be completed before the task in the next phases can be generated. The master-
slave model can be generalized to hierarchical or multi-level masterslave model in
which the top level master feeds the large portion of tasks to the second-level master,
who further subdivides the tasks among its own slaves and may perform a part of the
task itself.

Precautions in using the master-slave model

Care should be taken to assure that the master does not become a congestion point. It
may happen if the tasks are too small or the workers are comparatively fast.
The tasks should be selected in a way that the cost of performing a task dominates the
cost of communication and the cost of synchronization.
Asynchronous interaction may help overlap interaction and the computation associated
with work generation by the master.
Pipeline Model
It is also known as the producer-consumer model. Here a set of data is passed on
through a series of processes, each of which performs some task on it. Here, the arrival
of new data generates the execution of a new task by a process in the queue. The
processes could form a queue in the shape of linear or multidimensional arrays, trees,
or general graphs with or without cycles.
This model is a chain of producers and consumers. Each process in the queue can be
considered as a consumer of a sequence of data items for the process preceding it in
the queue and as a producer of data for the process following it in the queue. The
queue does not need to be a linear chain; it can be a directed graph. The most common
interaction minimization technique applicable to this model is overlapping interaction
with computation.
Example − Parallel LU factorization algorithm.

Hybrid Models
A hybrid algorithm model is required when more than one model may be needed to
solve a problem.
A hybrid model may be composed of either multiple models applied hierarchically or
multiple models applied sequentially to different phases of a parallel algorithm.
Example − Parallel quick sort
Parallel Random Access Machines (PRAM) is a model, which is considered for
most of the parallel algorithms. Here, multiple processors are attached to a single
block of memory. A PRAM model contains − A set of similar type of processors.
• All the processors share a common memory unit. Processors can communicate
among themselves through the shared memory only.
• A memory access unit (MAU) connects the processors with the single shared
memory.
Here, n number of processors can perform independent operations on nnumber of data
in a particular unit of time. This may result in simultaneous access of same memory
location by different processors.
To solve this problem, the following constraints have been enforced on PRAM model

• Exclusive Read Exclusive Write (EREW) − Here no two processors are
allowed to read from or write to the same memory location at the same time.
• Exclusive Read Concurrent Write (ERCW) − Here no two processors are
allowed to read from the same memory location at the same time, but are
allowed to write to the same memory location at the same time.
• Concurrent Read Exclusive Write (CREW) − Here all the processors are
allowed to read from the same memory location at the same time, but are not
allowed to write to the same memory location at the same time. Concurrent
Read Concurrent Write (CRCW) − All the processors are allowed to read
from or write to the same memory location at the same time.
There are many methods to implement the PRAM model, but the most prominent ones
are −

• Shared memory model


• Message passing model
• Data parallel model
Shared Memory Model
Shared memory emphasizes on control parallelism than on data parallelism. In the
shared memory model, multiple processes execute on different processors
independently, but they share a common memory space. Due to any processor activity,
if there is any change in any memory location, it is visible to the rest of the processors.
As multiple processors access the same memory location, it may happen that at any
particular point of time, more than one processor is accessing the same memory
location. Suppose one is reading that location and the other is writing on that location.
It may create confusion. To avoid this, some control mechanism, like lock /
semaphore, is implemented to ensure mutual exclusion.

Shared memory programming has been implemented in the following −


• Thread libraries − The thread library allows multiple threads of control that
run concurrently in the same memory location. Thread library provides an
interface that supports multithreading through a library of subroutine. It
contains subroutines for oCreating and destroying threads oScheduling
execution of thread opassing data and message between threads osaving and
restoring thread contexts
Examples of thread libraries include − SolarisTM threads for Solaris, POSIX threads
as implemented in Linux, Win32 threads available in Windows NT and Windows
2000, and JavaTM threads as part of the standard JavaTM Development Kit (JDK).
• Distributed Shared Memory (DSM) Systems − DSM systems create an
abstraction of shared memory on loosely coupled architecture in order to
implement shared memory programming without hardware support. They
implement standard libraries and use the advanced user-level memory
management features present in modern operating systems. Examples include
Tread Marks System, Munin, IVY, Shasta, Brazos, and Cashmere.
• Program Annotation Packages − This is implemented on the architectures
having uniform memory access characteristics. The most notable example of
program annotation packages is OpenMP. OpenMP implements functional
parallelism. It mainly focuses on parallelization of loops.
The concept of shared memory provides a low-level control of shared memory system,
but it tends to be tedious and erroneous. It is more applicable for system programming
than application programming.

Merits of Shared Memory Programming

• Global address space gives a user-friendly programming approach to memory.


• Due to the closeness of memory to CPU, data sharing among processes is fast
and uniform.
• There is no need to specify distinctly the communication of data among
processes.
• Process-communication overhead is negligible.
• It is very easy to learn.

Demerits of Shared Memory Programming

• It is not portable.
• Managing data locality is very difficult.
Message Passing Model
Message passing is the most commonly used parallel programming approach in
distributed memory systems. Here, the programmer has to determine the parallelism.
In this model, all the processors have their own local memory unit and they exchange
data through a communication network.
Processors use message-passing libraries for communication among themselves.
Along with the data being sent, the message contains the following components −
• The address of the processor from which the message is being sent;
• Starting address of the memory location of the data in the sending processor;
• Data type of the sending data;
• Data size of the sending data;
• The address of the processor to which the message is being sent;
• Starting address of the memory location for the data in the receiving processor.
Processors can communicate with each other by any of the following methods −

• Point-to-Point Communication
• Collective Communication
• Message Passing Interface

Point-to-Point Communication

Point-to-point communication is the simplest form of message passing. Here, a


message can be sent from the sending processor to a receiving processor by any of the
following transfer modes −
• Synchronous mode − The next message is sent only after the receiving a
confirmation that its previous message has been delivered, to maintain the
sequence of the message.
• Asynchronous mode − To send the next message, receipt of the confirmation
of the delivery of the previous message is not required.
Collective Communication
Collective communication involves more than two processors for message passing.
Following modes allow collective communications −
• Barrier − Barrier mode is possible if all the processors included in the
communications run a particular bock (known as barrier block) for message
passing.
• Broadcast − Broadcasting is of two types −
o One-to-all − Here, one processor with a single operation sends same message
to all other processors.
o All-to-all − Here, all processors send message to all other processors.
Messages broadcasted may be of three types −
• Personalized − Unique messages are sent to all other destination processors.
• Non-personalized − All the destination processors receive the same message.
• Reduction − In reduction broadcasting, one processor of the group collects all
the messages from all other processors in the group and combine them to a
single message which all other processors in the group can access.

Merits of Message Passing

• Provides low-level control of parallelism;


• It is portable;
• Less error prone;
• Less overhead in parallel synchronization and data distribution.

Demerits of Message Passing

• As compared to parallel shared-memory code, message-passing code generally


needs more software overhead.

Message Passing Libraries

There are many message-passing libraries. Here, we will discuss two of the mostused
message-passing libraries −
• Message Passing Interface (MPI)
• Parallel Virtual Machine (PVM)
Message Passing Interface (MPI)

It is a universal standard to provide communication among all the concurrent


processes in a distributed memory system. Most of the commonly used parallel
computing platforms provide at least one implementation of message passing
interface. It has been implemented as the collection of predefined functions called
library and can be called from languages such as C, C++, Fortran, etc. MPIs are both
fast and portable as compared to the other message passing libraries.
Merits of Message Passing Interface
• Runs only on shared memory architectures or distributed memory architectures;
• Each processors has its own local variables;
• As compared to large shared memory computers, distributed memory computers
are less expensive.
Demerits of Message Passing Interface

• More programming changes are required for parallel algorithm;


• Sometimes difficult to debug; and
• Does not perform well in the communication network between the nodes.

Parallel Virtual Machine (PVM)

PVM is a portable message passing system, designed to connect separate


heterogeneous host machines to form a single virtual machine. It is a single
manageable parallel computing resource. Large computational problems like
superconductivity studies, molecular dynamics simulations, and matrix algorithms can
be solved more cost effectively by using the memory and the aggregate power of
many computers. It manages all message routing, data conversion, task scheduling in
the network of incompatible computer architectures.
Features of PVM

• Very easy to install and configure;


• Multiple users can use PVM at the same time;
• One user can execute multiple applications;
• It’s a small package;
• Supports C, C++, Fortran;
• For a given run of a PVM program, users can select the group of machines;
• It is a message-passing model,
• Process-based computation;
• Supports heterogeneous architecture.
Data Parallel Programming
The major focus of data parallel programming model is on performing operations on a
data set simultaneously. The data set is organized into some structure like an array,
hypercube, etc. Processors perform operations collectively on the same data structure.
Each task is performed on a different partition of the same data structure.
It is restrictive, as not all the algorithms can be specified in terms of data parallelism.
This is the reason why data parallelism is not universal.
Data parallel languages help to specify the data decomposition and mapping to the
processors. It also includes data distribution statements that allow the programmer to
have control on data – for example, which data will go on which processor – to reduce
the amount of communication within the processors.

To apply any algorithm properly, it is very important that you select a proper data
structure. It is because a particular operation performed on a data structure may take
more time as compared to the same operation performed on another data structure.
Example − To access the ith element in a set by using an array, it may take a constant
time but by using a linked list, the time required to perform the same operation may
become a polynomial.
Therefore, the selection of a data structure must be done considering the architecture
and the type of operations to be performed.
The following data structures are commonly used in parallel programming −

• Linked List
• Arrays
• Hypercube Network
Linked List
A linked list is a data structure having zero or more nodes connected by pointers.
Nodes may or may not occupy consecutive memory locations. Each node has two or
three parts − one data part that stores the data and the other two are link fields that
store the address of the previous or next node. The first node’s address is stored in an
external pointer called head. The last node, known as tail, generally does not contain
any address. There are three types of linked lists −

• Singly Linked List


• Doubly Linked List
• Circular Linked List

Singly Linked List

A node of a singly linked list contains data and the address of the next node. An
external pointer called head stores the address of the first node.

Doubly Linked List

A node of a doubly linked list contains data and the address of both the previous and
the next node. An external pointer called head stores the address of the first node and
the external pointer called tail stores the address of the last node.

Circular Linked List


A circular linked list is very similar to the singly linked list except the fact that the last
node saved the address of the first node.
Arrays
An array is a data structure where we can store similar types of data. It can be one-
dimensional or multi-dimensional. Arrays can be created statically or dynamically.
• In statically declared arrays, dimension and size of the arrays are known at the
time of compilation.
• In dynamically declared arrays, dimension and size of the array are known at
runtime.
For shared memory programming, arrays can be used as a common memory and for
data parallel programming, they can be used by partitioning into sub-arrays.
Hypercube Network
Hypercube architecture is helpful for those parallel algorithms where each task has to
communicate with other tasks. Hypercube topology can easily embed other topologies
such as ring and mesh. It is also known as n-cubes, where n is the number of
dimensions. A hypercube can be constructed recursively.

OpenMP was introduced by OpenMP Architecture Review Board (ARB) on 1997.


In the subsequent releases, the enthusiastic OpenMP team added many features to it
including the task parallelizing, support for accelerators, userdefined reductions and lot
more. The latest OpenMP 5.0 release was made in 2018 November.
Selecting a proper designing technique for a parallel algorithm is the most difficult and
important task. Most of the parallel programming problems may have more than one
solution. In this chapter, we will discuss the following designing techniques for
parallel algorithms −
• Divide and conquer
• Greedy Method
• Dynamic Programming
• Backtracking
• Branch & Bound
• Linear Programming

Divide and Conquer Method


In the divide and conquer approach, the problem is divided into several small
subproblems. Then the sub-problems are solved recursively and combined to get the
solution of the original problem.
The divide and conquer approach involves the following steps at each level −
Divide − The original problem is divided into sub-problems.
• Conquer − The sub-problems are solved recursively.
• Combine − The solutions of the sub-problems are combined together to get the
solution of the original problem.
The divide and conquer approach is applied in the following algorithms −

• Binary search
• Quick sort
• Merge sort
• Integer multiplication
• Matrix inversion
• Matrix multiplication
Greedy Method
In greedy algorithm of optimizing solution, the best solution is chosen at any moment.
A greedy algorithm is very easy to apply to complex problems. It decides which step
will provide the most accurate solution in the next step.
This algorithm is a called greedy because when the optimal solution to the smaller
instance is provided, the algorithm does not consider the total program as a whole.
Once a solution is considered, the greedy algorithm never considers the same solution
again.
A greedy algorithm works recursively creating a group of objects from the smallest
possible component parts. Recursion is a procedure to solve a problem in which the
solution to a specific problem is dependent on the solution of the smaller instance of
that problem.
Dynamic Programming
Dynamic programming is an optimization technique, which divides the problem into
smaller sub-problems and after solving each sub-problem, dynamic programming
combines all the solutions to get ultimate solution. Unlike divide and conquer method,
dynamic programming reuses the solution to the subproblems many times.
Recursive algorithm for Fibonacci Series is an example of dynamic programming.
Backtracking Algorithm
Backtracking is an optimization technique to solve combinational problems. It is
applied to both programmatic and real-life problems. Eight queen problem, Sudoku
puzzle and going through a maze are popular examples where backtracking algorithm
is used.
In backtracking, we start with a possible solution, which satisfies all the required
conditions. Then we move to the next level and if that level does not produce a
satisfactory solution, we return one level back and start with a new option.
Branch and Bound
A branch and bound algorithm is an optimization technique to get an optimal solution
to the problem. It looks for the best solution for a given problem in the entire space of
the solution. The bounds in the function to be optimized are merged with the value of
the latest best solution. It allows the algorithm to find parts of the solution space
completely.
The purpose of a branch and bound search is to maintain the lowest-cost path to a
target. Once a solution is found, it can keep improving the solution. Branch and bound
search is implemented in depth-bounded search and depth–first search.
Linear Programming
Linear programming describes a wide class of optimization job where both the
optimization criterion and the constraints are linear functions. It is a technique to get
the best outcome like maximum profit, shortest path, or lowest cost.
In this programming, we have a set of variables and we have to assign absolute values
to them to satisfy a set of linear equations and to maximize or minimize a given linear
objective function.

OpenMP was introduced by OpenMP Architecture Review Board (ARB) on 1997.


In the subsequent releases, the enthusiastic OpenMP team added many features to it
including the task parallelizing, support for accelerators, userdefined reductions and lot
more. The latest OpenMP 5.0 release was made in 2018 November.
Open Multi-processing (OpenMP) is a technique of parallelizing a section(s) of C/C+
+/Fortran code. OpenMP is also seen as an extension to C/C++/Fortran languages by
adding the parallelizing features to them. In general, OpenMP uses a portable,
scalable model that gives programmers a simple and flexible interface for developing
parallel applications for platforms that ranges from the normal desktop computer to the
high-end supercomputers.
THREAD Vs PROCESS
A process is created by the OS to execute a program with given resources(memory,
registers); generally, different processes do not share their memory with another. A
thread is a subset of a process, and it shares the resources of its parent process but has
its own stack to keep track of function calls. Multiple threads of a process will have
access to the same memory. Parallel Memory Architectures
Before getting deep into OpenMP, let’s revive the basic parallel memory architectures.
These are divided into three categories; Shared
memory:
OpenMP comes under the shared memory concept. In this, different CPU’s
(processors) will have access to the same memory location. Since all CPU’s connect to
the same memory, memory access should be handled carefully.

Distributed memory
• Here, each CPU(processor) will have its own memory location to access
and use. In order to make them communicate, all independent systems
will be connected together using a network.MPIis based on distributed
architecture.
• Hybrid: Hybrid is a combination of both shared and distributed
architectures. A simple scenario to showcase the power of OpenMP
would be comparing the execution time of a normal C/C++ program and
the OpenMP program.
Steps for Installation of OpenMP
• STEP 1: Check the GCC version of the compilergcc --version

GCC provides support for OpenMP starting from its version 4.2.0. So if the system
has GCC compiler with the version higher than 4.2.0, then it must have OpenMP
features configured with it.
• If the system doesn’t have the GCC compiler, we can use the following
command
sudo apt install gcc
For more detailed support for installation, we can referhere
• STEP 2: Configuring OpenMP
We can check whether the OpenMP features are configured into our compiler or
not, using the command echo |cpp -fopenmp -dM |grep -i open

If OpenMP is not featured in the compiler, we can configure it use using the command
sudo apt install libomp-dev
• STEP 3: Setting the number of threads
In OpenMP, Before running the code, we can initialise the number of threads
to be executed using the following command. Here, we set the number of threads to
be getting executed to be 8 threads.
export OMP_NUM_THREADS=8

Running First Code in OpenMP

// OpenMP header

#include <omp.h>

#include <stdio.h>

#include <stdlib.h>

int main(int argc, char* argv[])

{ int nthreads, tid;

// Begin of parallel region

#pragma omp parallel private(nthreads, tid)

// Getting thread number tid

= omp_get_thread_num();

printf("Welcome to GFG from thread

= %d\n",

tid);
if (tid == 0) {

// Only master thread does this

nthreads = omp_get_num_threads();

printf("Number of threads = %d\n",

nthreads);

This program will print a message which will be getting executed by various threads.
Compile:
gcc -o gfg -fopenmpgeeksforgeeks.c

Execute:
./gfg

OpenMP | Hello World program


STEPS TO CREATE A PARALLEL PROGRAM

1. Include the header file: We have to include the OpenMP header for our program
along with the standard header files.

//OpenMP header #include


<omp.h>
1.
2. Specify the parallel region: In OpenMP, we need to mention the region which we
are going to make it as parallel using the keyword pragma omp parallel. The
pragma omp parallel is used to fork additional threads to carry out the work
enclosed in the parallel. The original thread will be denoted as the master thread
with thread ID 0. Code for creating a parallel region would be,

#pragmaomp parallel
{
//Parallel region code
}
1. So, here we include

#pragmaomp parallel
{ printf("Hello World... from thread = %d\n",
omp_get_thread_num());
}
1.
2. Set the number of threads: we can set the number of threads to execute the
program using the external variable.

export OMP_NUM_THREADS=5
1.Diagram of parallel region
1. As per the above figure, Once the compiler encounters the parallel
regions code, the master thread(thread which has thread id 0) will fork
into the specified number of threads. Here it will get forked into 5
threads because we will initialise the number of threads to be executed
as 5, using the command export OMP_NUM_THREADS=5. Entire
code within the parallel region will be executed by all threads
concurrently. Once the parallel region ended, all threads will get
merged into the master thread.

2. Compile and Run:


Compile:

gcc -o hello -fopenmphello.c


1.Execute:

./hello
1.
Below is the complete program with the output of the above approach: Program: Since
we specified the number of threads to be executed as 5, 5 threads will execute the same
print statement at the same point of time. Here we can’t assure the order of execution
of threads, i.eOrder of statement execution in the parallel region won’t be the same
for all executions. In the below picture, while executing the program for first-time
thread 1 gets completed first whereas, in the second run, thread 0 completed first.
omp_get_thread_num() will return the thread number associated with the thread.

 OpenMP Hello World program


// OpenMP program to print Hello World

// using C language

// OpenMP header

#include <omp.h>

#include <stdio.h>

#include <stdlib.h>

int main(int argc, char* argv[])

// Beginning of parallel region

#pragmaomp parallel
{

printf("Hello World... from thread = %d\n",

omp_get_thread_num());

// Ending of parallel region

Output:

• When run for 1st time:


• When run for multiple time: Order of execution of threads changes
every time.

GPGPU: General purpose computation using graphics processing units (GPUs) and
graphics API • GPU consists of multiprocessor element that run under the
sharedmemory threads model. GPUs can run hundreds or thousands of threads in
parallel and has its own DRAM. – GPU is a dedicated, multithread, data parallel
processor. – GPU is good at • Data-parallel processing: the same computation executed
on many data elements in parallel • with high arithmetic intensity

CUDA: Compute unified device architecture – A new hardware and software


architecture for issuing and managing computations on the GPU • CUDA C is a
programming language developed by NVIDIA for programming on their GPUs. It is
an extension of C

GPU Performance Analysis

1 Introduction
This text is roughly based on a number of webinars from Nvidia. It discusses how you
can analyze, understand and improve the performance of your GPU kernel.
Note that although the webinars talk about CUDA it should be straightforward to
translate the concepts to OpenCL.

The first section explains how you can find out what is limiting the performance of
your kernel. The subsequent sections discuss instruction limited and bandwidt limited
kernels. Finally, there is a section on register spilling.
2 Identifying Performance Limiters
Typically, the performance of a GPU kernel is limited by one of the following factors:

• memory throughput;

• instruction throughput;

• latency;

• a combination of the above.


There exist three ways to determine which of those limits the performance of your
kernel. We will discuss each of them in a separate subsection. We will first define what
we mean by a memory-bound or instruction bound kernel.

For a GPU with memory bandwidth M and instruction bandwidth I, all kernels with an
instruction:byte ratio lower than I/M are memory bandwidth, while all kernels with a
higher ration are instruction bandwidth.

2.1 Algorithmic
The first way to determine what is holding the performance of your kernel back, is
counting the number of instructions and memory access operations. Let I be the
number of instructions and M the number of bytes read from or written to memory,
then I/M will be the instruction:byte ratio.

The disadvantage of this method is that you will typically count less instructions or
memory accesses than actually exist. To alleviate this problem you could check the
machine-language assembly instructions. For, example you can compile CUDA code to
PTX - an intermediate assembly language for NVIDIA - by invoking nvcc with the -ptx
option. AMD’s kernel analyzer allows you to do the same for different AMD cards via
a user friendly GUI.

2.2 Profiler
You can use a profiler to determine the instruction:byte ratio. In this case you will base
yourself on profiler-collected memory and instruction counters. Note, however, that
this method does not account for overlapped memory movements and computations.
Furthermore, the counters that you should use to determine this ratio might not be
available for your GPU.

The counters we concern ourselves with are instructions_issued for the instructions and
dram_reads and dram_writes for memory access. Note that the first counter is
incremented by 1 per warp, while the other counters are incremented by 1 per 32 bytes
access to the DRAM. Thus you should multiply each counter value by 32 (although
you may not do so because both factors will neutralize one another).

You can also look at instruction and memory throughputs reported by the profiler. In
particular, IPC (instructions per clock) and GB/s achieved for memory. Compare the
reported values with the theoretical maximum. If you’re close, you know you hit a
limit.

2.3 Code Modification


Finally, the most interesting method is to change the source code of your kernel such
that is becomes either a memory-only either arithmetic-only kernel. This method will
not only help you to determine whether your kernel is memory-bound or instruction-
bound, it will also show how well memory operations are overlapped with
computation. Unfortunately, this method can not be applied to every kernel.

A number of things need to be taken into account when you modify your code. The
following parts discuss what you should do to do this correctly for each type of
modification.

Memory-only

You should remove as much arithmetic as possible without changing the access
pattern. You can also use the profiler to verify that the load/store count is the same.

Store-only

In this case, you should also remove the loads.

Compute-only
You should remove the global memory accesses. In this case however, you need to
trick the compiler, because it throws away all code that is detected as not contributing
to stores. A trick is to put stores inside conditionals that always evaluate tofalse. A
typical way to do this is as follows:
__global__ void add(float *output, float *A, float *B, int flag) {
...
value = A[idx] + B[idx};
// the conditional should depend on the value to
avoid // that the computation is moved into the then-
branch if (1 == value*flag) out[idx] = value;
}

Another possible problem is that your occupancy changes when you modify your
kernel, for example because you use fewer registers. To avoid this you may add shared
memory to the kernel invocation. The amount of shared memory must be such that you
achieve the same occupancy as before you modified the kernel.
kernel<<<grid, block, smem, ...>>>(...)
3 Instruction Limited Kernels
The following factors could be limiting for instruction limited kernels:
• The instruction mix. Different instructions have different throughputs.

• Instruction serialization: when threads in a warp execute/issue the same


instruction after each other rather than in parallel. This could be caused by
shared memory bank conflicts, constant memory bank conflicts and
branching.

It is important to differentiate instructions that are executed and instructions that are
issued. A difference between both indicates that instructions have been serialized. You
should find out why and try to minimize this effect.

Some optimizations are proposed for each type of problem. They are discussed in the
following subsections.

3.1 Instruction Mix


• Use __sin(), …rather than sin(), …functions.
• Use double floating point only when you need it.

• Use compiler flags.


3.2 Serialization Issues
Warp divergence can be measured with the divergent_branch and branch counters.
Because these counters only measure the branch instructions, it is better to look at the
thread instruction executed counter. If there is no divergence than
threads_instructions_executed = 32*instruction_executed. If there is divergence the
left expression is smaller than the right.

Shared memory bank conflicts can be measured using


the l1_shared_bank_conflict, shared_load and shared_store counters. bank
conflicts are significant for instruction limited kernels, when
l1_shared_bank_conflict is significant compared to instructions_issued.

The proposed optimizations roughly correspond with those presented in our MSc
thesis.

3.3 Latency Hiding Issues


There might be insufficient latency hiding because there are too few concurrent threads
per SM. This can be due to poor occupancy or to a small overall number of threads in
your execution configuration. Typical advice to alleviate the problem is:

• Run more threads.

• Process more than one element per thread.

• Use smaller threadblocks to alleviate the effect of barrier synchronization.

4 Bandwidth Limited Kernels


A number of factors will determine the bandwidth achieved by your kernel. The
following subsections discuss the most important ones.

4.1 Launch Configuration


The exact size of your workgroups and the number of elements that is mapped on one
work item can have an enormous impact on the resulting memory throughput. For
example,processing more than one element per work item can lead to a higher
memory throughput.
4.2 Memory Access Patterns
The exact pattern of your memory access will highly influence the memory throughput.
For example, on NVIDIA GPUs memory access is optimal when data is read or written
in a coalesced fashion.
Global Memory Operations

Note that this discussion only concerns NVIDIA GPUs with newer Fermi architecture.
The GPUs in the parallel systems lab are unfortunately older.

Loads are either caching - the default - or non-caching - through a compiler flag. For
the fists the load granularity is a 128-byte line, for the second the load granularity is
32-bytes.

As a consequence the impact of misaligned memory access is greater for caching loads
than for non-caching loads. In the first case the bus utilization drops to 50 %, while in
the second case it only drops to 80 %.

Also if all threads access bytes within the same 4-byte word the bus utilization drops
to 3.125 % for cached loading and only to 12.5 % for non-cached loading.

When accessing 32 scattered 4-byte words that fall within N segments, the bus
utilization will drop become 128/ N * 128. for cached loads and 128/N * 32 for
non-cached loads.

On-Chip Memory

Should I use explicit shared memory or implicit caching? It is possible to choose the
amount of shared memory / cache memory from 16KB / 48KB and 48KB / 16KB.

Additional Memory

It can still be useful to use texture and constant memory.

5 Register Spilling
Local memory refers to memory private to a thread but stored in global memory. The
only differences with global memory are that addressing is resolved by the compiler
and stores are cached in L1.

Local memory is used for register spilling and arrays declared inside the kernel.
Register spilling is not necessarily a bad thing.

There are a number of ways to check local memory usage:


• Certain flags force nvcc to print the number of bytes used for local
memory.
• Look at a number of profiler
counters: l1_local_load_hit, l1_local_load_miss, l1_local_store_hit and l1_l
ocal_store_miss.

• Look at L2 counters on read and write sector queries.


There are a number of optimizations you can apply when register spilling is
problematic:

• Increase the limit of registers per thread. This will however result in a
lower occupancy.

• Use non-caching loads for global memory. As a result there may be fewer
contentions with spilled registers in L1 cache.

• Increase L1 cache size to 48KB. This can be done on a kernel basis and
on a device basis.

Matrix-Matrix Multiplication

Before starting, it is helpful to briefly recap how a matrix-matrix multiplication is


computed. Let's say we have two matrices, $A$ and $B$. Assume that $A$ is a $n \
times m$ matrix, which means that it has $n$ rows and $m$ columns. Also assume that
$B$ is a $m \times w$ matrix. The result of the multiplication $A*B$ (which is
different from $B*A$!) is a $n \times w$ matrix, which we call $M$. That is, the
number of rows in the resulting matrix equals the number of rows of the first matrix
$A$ and the number of columns of the second matrix $B$.

Why does this happen and how does it work? The answer is the same for both
questions here. Let's take the cell $1$,$1$ (first row, first column) of $M$. The number
inside it after the operation $M=A*B$ is the sum of all the element-wise
multiplications of the numbers in $A$, row 1, with the numbers in $B$, column 1. That
is, in the cell $i$,$j$ of $M$ we have the sum of the element-wise multiplication of all
the numbers in the $i$-th row in $A$ and the $j$-th column in $B$.
The following figure intuitively explains this idea:

It should be pretty clear now why matrix-matrix multiplication is a good example for
parallel computation. We have to compute every element in $C$, and each of them is
independent from the others, so we can efficiently parallelise.

We will see different ways of achieving this. The goal is to add new concepts
throughout this article, ending up with a 2D kernel, which uses shared memory to
efficiently optimise operations.
Grids And Blocks

After the previous articles, we now have a basic knowledge of CUDA thread
organisation, so that we can better examine the structure of grids and blocks.

When we call a kernel using the instruction <<<>>> we automatically define a dim3
type variable defining the number of blocks per grid and threads per block.

In fact, grids and blocks are 3D arrays of blocks and threads, respectively. This is
evident when we define them before calling a kernel, with something like this: dim3

dim3 threadsPerBlock(512, 1, 1)
kernel<<<blocksPerGrid, threadsPerBlock>>>()

blocksPerGrid(512, 1, 1)

C++

Copy

In the previous articles you didn't see anything like that, as we only discussed 1D
examples, in which we didn't have to specify the other dimensions. This is because, if
you only give a number to the kernel call as we did, it is assumed that you created a
dim3 mono-dimensional variable, implying $y=1$ and $z=1$.

As we are dealing with matrices now, we want to specify a second dimension (and,
again, we can omit the third one). This is very useful, and sometimes essential, to make
the threads work properly.

Indeed, in this way we can refer to both the $x$ and $y$ axis in the very same way we
followed in previous examples. Let's have a look at the code:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x; C++

Copy

As you can see, it's similar code for both of them. In


CUDA, blockIdx, blockDim and threadIdx are built-in functions with members x, y
and z. They are indexed as normal vectors in C++, so between 0 and the maximum
number minus 1. For instance, if we have a grid dimension of blocksPerGrid = (512,
1, 1), blockIdx.x will range between 0 and 511.

As I mentionedherethe total amount of threads in a single block cannot exceed 1024.

Using a multi-dimensional block means that you have to be careful about distributing
this number of threads among all the dimensions. In a 1D block, you can set 1024
threads at most in the x axis, but in a 2D block, if you set 2 as the size of y, you cannot
exceed 512 for the x! For example, dim3 threadsPerBlock(1024, 1, 1) is allowed, as
well as dim3 threadsPerBlock(512, 2, 1), but not dim3 threadsPerBlock(256, 3, 2).

Linearise Multidimensional Arrays

In this article we will make use of 1D arrays for our matrixes. This might sound a bit
confusing, but the problem is in the programming language itself. The standard upon
which CUDA is developed needs to know the number of columns before compiling the
program. Hence it is impossible to change it or set it in the middle of the code.

However, a little thought shows that this is not a big issue. We cannot use the
comfortable notation A[i][j], but we won't struggle that much as we already know how
to properly index rows and columns.
In fact, the easiest way to linearize a 2D array is to stack each row lengthways, from
the first to the last. The following picture will make this concept clearer:
The Kernel

Now that we have all the necessary information for carrying out the task, let's have a
look at the kernel code. For sake of simplicity we will use square $N \times N$
matrices in our example.

The first thing to do, as we already saw, is to determine the $x$ and $y$ axis index (i.e.
row and column numbers):

__global__ voidmultiplication(float *A, float* B, float *C, int N){


int ROW = blockIdx.y*blockDim.y+threadIdx.y;
int COL = blockIdx.x*blockDim.x+threadIdx.x; C++

Copy
Then, we check that the row and column total does not exceed the number of actual
rows and columns in the matrices. As the threads will access the memory in random
order, we have to do this for preventing unnecessary threads from performing
operations on our matrices. That is, we are making blocks and grids of a certain size, so
that if we don't set our $N$ as a multiple of that size, we would have more threads than
we need:

if (ROW < N && COL < N) { C++

Copy

We won't have any problem here as we are using square matrices, but it's always a
good idea to keep this in mind. We then have to initialise the temporary variable
tmp_sum for summing all the cells in the selected row and column. It is always a
good procedure to specify the decimal points and the f even if they are zeroes. So let's
write:

float tmp_sum = 0.0f;

C+
+
Copy

Now the straightforward part: as for the CPU code, we can use a for loop for
computing the sum and then store it in the corresponding C cell.
tmpSum += A[ROW * N + i] * B[i * N + COL];
}
C[ROW * N + COL] = tmpSum;
C++

Copy
for (inti = 0; i< N; i++) { The host (CPU) code starts by declaring variables in this way:
int main() {
// declare arrays and variables
int N = 16;
int SIZE = N*N;
vector<float> h_A(SIZE);
vector<float> h_B(SIZE);
vector<float> h_C(SIZE);
C++

Copy

Note that, given the if condition in the kernel, we could set $N$ values that are not
necessarily a multiple of the block size. Also, I will make use of the library dev_array,
which I discussed inthisarticle.

Now let's fill the matrices. There are several ways to do this, such as making
functions for manual input or using random numbers. In this case, we simply use a
for loop to fill the cells with trigonometric values of the indices: for (inti=0; i<N; i+

for (int j=0; j<N; j++){


h_A[i*N+j] = sin(i);
h_B[i*N+j] = cos(j);
}

C++

+){
}

Copy

Recalling the "Grids and Blocks" paragraph, we now have to set the dim3 variables for
both blocks and grids dimensions. The grid is just a BLOCK_SIZE $\times$
BLOCK_SIZE grid, so we can write:

dim3 threadsPerBlock (BLOCK_SIZE, BLOCK_SIZE)


C++

Copy

As we are not working only with matrices with a size multiple of BLOCK_SIZE, we
have to use the ceil instruction, to get the next integer number as our size, as you can
see:

int n_blocks = ceil(N/BLOCK_SIZE);


dim3 blocksPerGrid (n_blocks, n_blocks)

C+
+
Copy

Note that in this way we will use more threads than necessary, but we can prevent them
working on our matrices with the if condition we wrote in the kernel.

This is the general solution showing the reasoning behind it all, but in the complete
code you will find a more efficient version of it in kernel.cu.

Now we only have to create the device arrays, allocate memory on the device and call
our kernel and, a as result, we will have a parallel matrix multiplication program. Using
dev_array, we simply write:
dev_array<float> d_A(SIZE);
dev_array<float> d_B(SIZE);
dev_array<float> d_C(SIZE);
C++

Copy
for declaring arrays and:

d_A.set(&h_A[0], SIZE);
d_B.set(&h_B[0], SIZE);
C++
Copy

for allocating memory. For getting the device results and copying it on the host, we use
the get method instead. Once again, this is simply:

d_C.get(&h_C[0], SIZE);

C+
+
Copy

At the bottom of this page you can find the complete code, including performance
comparison and error computation between the parallel and the serial code.

In the next article I will discuss the different types of memory and, in particular, I
will use the shared memory for speeding-up the matrix multiplication. But don't
worry, just after this we will come back to an actual finance application by applying
what we learned so far to financial problems. Full Code dev_array.h
#ifndef _DEV_ARRAY_H_
#define _DEV_ARRAY_H_

#include<stdexcept>
#include<algorithm>
#include<cuda_runtime.h>

template<classT>classd
ev_array
{
// public functionspublic:
explicitdev_array()
: start_(0),

end_(0)
{}

// constructor
explicitdev_array(size_t size)
{
allocate(size);
}
// destructor
~dev_array()
{
free();
}

// resize the vector


voidresize(size_t size)
{ free();
allocate(size);
}

// get the size of the array


size_tgetSize() const
{
return end_ - start_;
}

// get data
const T* getData() const
{
return start_;
}

T* getData()
{
return start_;
}

// set
voidset(const T* src, size_t size)

{
size_t min = std::min(size, getSize());
cudaError_t result = cudaMemcpy(start_, src, min * sizeof(T),
cudaMemcpyHostToDevice);
if (result != cudaSuccess)
{
throwstd::runtime_error("failed to copy to device memory");
}
}
// get
voidget(T* dest, size_t size)
{
size_t min = std::min(size, getSize());
cudaError_t result = cudaMemcpy(dest, start_, min * sizeof(T),
cudaMemcpyDeviceToHost);
if (result != cudaSuccess)
{
throwstd::runtime_error("failed to copy to host memory");
}
}

// private functionsprivate:
// allocate memory on the device
voidallocate(size_t size)
{
cudaError_t result = cudaMalloc((void**)&start_, size * sizeof(T));
if (result != cudaSuccess)
{
start_ = end_ = 0;
throwstd::runtime_error("failed to allocate device memory");
}
end_ = start_ + size;
}

// free memory on the device


voidfree()
{
if (start_ != 0)

{
cudaFree(start_);
start_ = end_ = 0;
}
}

T* start_;
T* end_;
};

#endif
C++

Copy

matrixmul.cu

#include<iostream>
#include<vector>
#include<stdlib.h>
#include<time.h>
#include<cuda_runtime.h>
#include"kernel.h"
#include"kernel.cu"
#include"dev_array.h"
#include<math.h>

usingnamespace std;

intmain()
{
// Perform matrix multiplication C = A*B// where A, B and C are NxN matrices
int N = 16;
int SIZE = N*N;

// Allocate memory on the host vector<float>h_A(SIZE); vector<float>h_B(SIZE);


vector<float>h_C(SIZE);

// Initialize matrices on the


hostfor (inti=0; i<N; i++){for (int
j=0; j<N; j++)
{ h_A[i*N+j] = sin(i);
h_B[i*N+j] = cos(j);
}
}

// Allocate memory on the


devicedev_array<float>d_A(SIZE);
dev_array<float>d_B(SIZE);
dev_array<float>d_C(SIZE);

d_A.set(&h_A[0], SIZE);
d_B.set(&h_B[0], SIZE);

matrixMultiplication(d_A.getData(), d_B.getData(), d_C.getData(), N);


cudaDeviceSynchronize();

d_C.get(&h_C[0], SIZE);
cudaDeviceSynchronize();

float *cpu_C;
cpu_C=newfloat[SIZE];

// Now do the matrix multiplication on the CPUfloat


sum;
for (int row=0; row<N; row++){for
(int col=0; col<N; col++)
{ sum = 0.f;
for (int n=0; n<N; n++){
sum += h_A[row*N+n]*h_B[n*N+col];
}
cpu_C[row*N+col] = sum;
}
}

double err = 0;
// Check the result and make sure it is correct

for (int ROW=0; ROW < N; ROW++){


for (int COL=0; COL < N; COL++){
err += cpu_C[ROW * N + COL] - h_C[ROW * N + COL];
}
}

cout << "Error: " << err << endl;

return 0;
}
C++

Copy

kernel.h

#ifndef KERNEL_CUH_
#define KERNEL_CUH_

void matrixMultiplication(float *A, float *B, float *C, int N);

#endif
C++

Copy

kernel.cu

#include<math.h>
#include<iostream>
#include"cuda_runtime.h"
#include"kernel.h"
#include<stdlib.h>

usingnamespace std;

__global__ voidmatrixMultiplicationKernel(float* A, float* B, float* C, int N) {

int ROW = blockIdx.y*blockDim.y+threadIdx.y;


int COL = blockIdx.x*blockDim.x+threadIdx.x;

floattmpSum = 0;

if (ROW < N && COL < N) {


// each thread computes one element of the block sub-matrixfor (inti
= 0; i< N; i++) {
tmpSum += A[ROW * N + i] * B[i * N + COL];
}
}
C[ROW * N + COL] = tmpSum;
}

voidmatrixMultiplication(float *A, float *B, float *C, int N){

// declare the number of blocks per grid and the number of threads per block
// use 1 to 512 threads per block
dim3 threadsPerBlock(N, N);
dim3 blocksPerGrid(1, 1);
if (N*N >512){threadsPerBlock.x = 512; threadsPerBlock.y =
512; blocksPerGrid.x =
ceil(double(N)/double(threadsPerBlock.x)); blocksPerGrid.y
= ceil(double(N)/double(threadsPerBlock.y));
}

matrixMultiplicationKernel<<<blocksPerGrid,threadsPerBlock>>>(A, B, C, N); }

Note: Follow this link


https://www.nvidia.com/content/nvision2008/tech_presentations/Game_Developer
_Track/NVISION08-Image_Processing_and_Video_with_CUDA.pdf

You might also like