HPC Manual Najmu Group
HPC Manual Najmu Group
Student Name:
Seat No / Roll No: Class:
Branch:
1
Department of Computer Engineering
Certificate
This is to certify that…………………………………………………………………………...
High Performance computing Lab Experiments and Assignments as per the term work of
2
Maulana Mukhtar Ahmad Nadvi Technical Campus
Department of Computer Engineering
Empowering society through quality education and research for the socio-economic
development of the region.
PO1 Engineering knowledge Apply the knowledge of mathematics, science, Engineering fundamentals,
and an Engineering specialization to the solution of complex Engineering
problems.
PO2 Problem analysis Identify, formulate, review research literature and analyze complex
Engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences and Engineering sciences.
PO3 Design / Development of Design solutions for complex Engineering problems and design system
Solutions components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and
Environmental considerations.
PO4 Conduct Investigations of Use research-based knowledge and research methods including design of
Complex Problems experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
PO5 Modern Tool Usage Create, select, and apply appropriate techniques, resources, and modern
Engineering and IT tools including prediction and modeling to complex
Engineering activities with an understanding of the limitations.
PO6 The Engineer and Society Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
PO7 Environment and Understand the impact of the professional Engineering solutions in societal
Sustainability and Environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
PO8 Ethics Apply ethical principles and commit to professional ethics and
responsibilities and norms of Engineering practice.
PO9 Individual and Team Work Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.
P10 Communication Skills Communicate effectively on complex Engineering activities with the
Engineering community and with society at large, such as, being able to
comprehend and write effective reports and design documentation, make
effective presentations, and give and receive clear instructions
P11 Project Management and Demonstrate knowledge and understanding of Engineering and
Finance management principles and apply these to one’s own work, as a member
and leader in a team, to manage projects and in multidisciplinary
Environments.
P12 Life-long Learning Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological
change.
Program Specific Outcomes (PSO)
A graduate of the Computer Engineering Program will demonstrate
PSO1
Professional Skills-The ability to understand, analyze and develop computer programs in the areas
relatedto algorithms, system software, multimedia, web design, big data analytics, and
networking for efficient design of computer-based systems of varying complexities.
PSO2
Problem-Solving Skills- The ability to apply standard practices and strategies in software project
development using open-ended programming environments to deliver a quality product for
business success.
PSO3
Successful Career and Entrepreneurship- The ability to employ modern computer languages,
environments and platforms in creating innovative career paths to be an entrepreneur and to have
a zest for higher studies
Savitri Bai Phule Pune University
Final Year of Computer Engineering (2019 course)
410250: Laboratory Practice V (High Performance Computing)
Course Objectives:
To understand and implement searching and sorting algorithms.
To learn the fundamentals of GPU Computing in the CUDA environment.
Course Outcomes:
Sr Date Of Date Of
Name Of The Experiment Marks Sign
.No Start Completion
1
Experiment - 01
Aim:
Design and implement Parallel Breadth First Search and Depth First Search based on
existing algorithms using OpenMP. For
1. Use a Tree
2. An undirected graph for BFS and DFS.
Objective:
Student will learn:
1. The Basic Concepts of DFS, BFS.
2. Multiple Compiler Directives, library routines, environment variables available for
OpenMP.
Theory:
Breadth First Search (BFS)
There are many ways to traverse graphs. BFS is the most commonly used approach. BFS is a
traversing algorithm where you should start traversing from a selected node (source or starting node)
and traverse the graph layer wise thus exploring the neighbor nodes (nodes which are directly
connected to source node). You must then move towards the next-level neighbor nodes.
As the name BFS suggests, you are required to traverse the graph breadthwise as follows:
1. First move horizontally and visit all the nodes of the current layer
Move to the next layer Consider the following diagram.
2
1. To design and implement parallel breadth first search, you will need to divide the graphinto
smaller sub-graphs and assign each sub-graph to a different processor or thread.
2. Each processor or thread will then perform a breadth first search on its assigned sub-graph
concurrently with the other processors or threads.
3. Two methods: Vertex by Vertex OR Level By Level
The second strategy generally yields a more even split of the space.
Program:
import queue
import numpy as np
3
from multiprocessing import Process
NUM_NODES = 6
graph = {
0: [1, 2],
1: [0, 3, 4],
2: [0, 5],
3: [1],
4: [1],
5: [2]
}
visited = np.zeros(NUM_NODES)
def parallel_bfs(start, nodes, q):
q.put(start)
visited[start] = 1
while not q.empty():
current = q.get()
print(f"Visiting node: {current} in parallel./n")
neighbors = graph[current]
for neighbor in neighbors:
if not visited[neighbor]:
with np.errstate(divide='ignore', invalid='ignore'):
visited[neighbor] = 1
q.put(neighbor)
def parallel_dfs(start):
stack = [start]
while stack:
current = stack.pop()
print(f"Visiting node: {current} bfs in parallel.")
neighbors = reversed(graph[current])
for neighbor in neighbors:
if not visited[neighbor]:
with np.errstate(divide='ignore', invalid='ignore'):
4
visited[neighbor] = 1
stack.append(neighbor)
if name == " main ":
# Reset visited array
visited = np.zeros(NUM_NODES)
# Create a multiprocessing queue
q = queue.Queue()
# Create separate processes for BFS and DFS
bfs_process = Process(target=parallel_bfs, args=(0, NUM_NODES, q))
dfs_process = Process(target=parallel_dfs, args=(0,))
# Start both processes
bfs_process.start()
dfs_process.start()
# Wait for both processes to finish
bfs_process.join()
dfs_process.join()
Output-
Conclusion:
Thus, we have successfully implemented parallel algorithms for Binary Search and Breadth First
Search.
5
Experiment - 02
Aim:
Write a program to implement Parallel Bubble Sort and Parallel Merge sort using OpenMP.
Useexisting algorithms and measure the performance of sequential and parallel algorithms.
Objective:
Student will learn:
1. The Basic Concepts of Bubble Sort and Merge Sort.
Theory:
Parallel Sorting:
A sequential sorting algorithm may not be efficient enough when we have to sort a huge
volumeof data. Therefore, parallel algorithms are used in sorting.
Design methodology:
Based on an existing sequential sort algorithm
Try to utilize all resources available
Possible to turn a poor sequential algorithm into a reasonable parallel algorithm
Bubble Sort
The idea of bubble sort is to compare two adjacent elements. If they are not in the right order,
switch them. Do this comparing and switching (if necessary) until the end of the array is reached.
Repeat this process from the beginning of the array n times. Average performance is O(n2)
6
1, 5, 8 No switch for 1, 5
1, 5, 8 No switch for 5, 8
1, 5, 8 Reached end.
But do not start again since this is the nth iteration of same process
Let local_size = n / no_proc. We divide the array in no_proc parts, and each process executes
the bubble sort on its part, including comparing the last element with the first onebelonging to
the next thread.
Implement with the loop (instead of j<i) for (j=0; j<n-1; j++)
For every iteration of i, each thread needs to wait until the previous thread has finished that
iteration before starting.
We'll coordinate using a barrier.
Let local_size = n / no_proc. We divide the array in no_proc parts, and each process executes
the bubble sort on its part, including comparing the last element with the first onebelonging to
the next thread.
Implement with the loop (instead of j<i) for (j=0; j<n-1; j++).
7
For every iteration of i, each thread needs to wait until the previous thread has finished that
iteration before starting.
We'll coordinate using a barrier.
1. For k = 0 to n-2
2. If k is even then
6. Else
10. Next k
Shared flag, sorted, initialized to true at beginning of each iteration (2 phases), if any
processor perform swap, sorted = false.
Program
import time
import random
import threading
start = time.perf_counter()
# Create function to sort elements in odd phase
def odd_phase(lst, n):
for i in range(0, n-1, 2):
8
if lst[i] > lst[i+1]:
lst[i], lst[i+1] = lst[i+1], lst[i]
# Create function to sort elements in even phase
def even_phase(lst, n):
for i in range(1, n-1, 2):
if lst[i] > lst[i+1]:
lst[i], lst[i+1] = lst[i+1], lst[i]
# Create function for parallel bubble sort
def Parallel_Bubble_Sort(lst):
n = len(lst)
# Create list to keep track of all threads
threads = []
for i in range(1, n+1):
# If i is odd, call for odd_phase function
if i % 2 == 1:
# Start threads one by one
t1 = threading.Thread(target=odd_phase, args=[lst, n])
t1.start()
threads.append(t1)
# If i is even, call for even_phase function
else:
t2 = threading.Thread(target=even_phase, args=[lst, n])
t2.start()
threads.append(t2)
for thread in threads:
thread.join()
# Print final sorted list
print(lst)
# Example list to test the program
lst = [random.randint(0, 100) for i in range(100)]
Parallel_Bubble_Sort(lst)
finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')
9
Output
Merge Sort
Collects sorted list onto one processor
2. Conquer Step
Conquer by recursively sorting the two subarrays A[p .. q] and A[q + 1 .. r].
3. Combine Step
Combine the elements back in A[p .. r] by merging the two sorted subarraysA[p .. q] and A[q + 1.. r]
into a sorted sequence. To accomplish this step, we will define a procedure MERGE (A, p, q,r).
Example:
Max parallelization achived with one processor per node (at each layer/height)
10
Algorithm for Parallel Merge Sort
1. Procedure parallel Merge Sort
2. Begin
3. Create processors Pi where i = 1 to n
4. if i > 0 then receive size and parent from the root
5. receive the list, size and parent from the root
6. endif
7. midvalue= listsize/2
8. if both children is present in the tree then
9. send midvalue, first child
10. send listsize-mid,second child
11. send list, midvalue, first child
12. send list from midvalue, listsize-midvalue, second child
13. call mergelist(list,0,midvalue,list, midvalue+1,listsize,temp,0,listsize)
14. store temp in another array list2
15. else
16. call parallelMergeSort(list,0,listsize)
17. endif
18. if i >0 then
19. send list, listsize,parent
20. endif
21. end
ALGORITHM ANALYSIS
1. Time Complexity Of parallel Merge Sort and parallel Bubble sort in best case is( when alldata
is already in sorted form):O(n)
2. Time Complexity Of parallel Merge Sort and parallel Bubble sort in worst case is: O(n
logn)
3. Time Complexity Of parallel Merge Sort and parallel Bubble sort in average case is:
O(nlogn)
Program
import time
# Python program for implementation of MergeSort
11
def mergeSort(arr):
if len(arr) > 1:
# Finding the mid of the array
mid = len(arr) // 2
# Dividing the array elements into 2 halves
L = arr[:mid]
R = arr[mid:]
# Sorting the first half
mergeSort(L)
# Sorting the second half
mergeSort(R)
i=j=k=0
# Copy data to temp arrays L[] and R[]
while i < len(L) and j < len(R):
if L[i] <= R[j]:
arr[k] = L[i]
i += 1
else:
arr[k] = R[j]
j += 1
k += 1
# Checking if any element was left
while i < len(L):
arr[k] = L[i]
i += 1
k += 1
while j < len(R):
arr[k] = R[j]
j += 1
k += 1
# Code to print the list
def printList(arr):
for i in range(len(arr)):
print(arr[i], end=" ")
12
print()
# Driver Code
if name == ' main ':
arr = [12, 11, 13, 5, 6, 7]
print("Given array is", end="\n")
printList(arr)
mergeSort(arr)
print("Sorted array is: ", end="\n")
printList(arr)
Output
Conclusion:
Thus, we have successfully implemented parallel algorithms for Bubble Sort and Merger Sort.
13
Experiment -03
Aim:
Implement parallel reduction using Min, Max, Sum and Average Operations.
Objective:
To study and implementation of directive based parallel programming model.
Theory:
OpenMP:
OpenMP is a set of C/C++ pragmas which provide the programmer a high-level front-end
interface which get translated as calls to threads. The key phrase here is "higher-level"; the goal is to
better enable the programmer to "think parallel" alleviating him/her of the burdenand distraction of
dealing with setting up and coordinating threads. For example, the OpenMP directive.
1. The min_reduction function finds the minimum value in the input array using the #pragma omp
parallel for reduction (min: min_value) directive, which creates a parallelregion and divides the
loop iterations among the available threads. Each thread performsthe comparison operation in
parallel and updates the min_value variable if a smaller value is found.
2. Similarly, the max_reduction function finds the maximum value in the array, sum_reduction
function finds thesum of the elements of array and average_reduction function finds the average
of the elements of array by dividing the sum by the size of thearray.
3. The reduction clause is used to combine the results of multiple threads into a single value,
which is then returned by the function. The min and max operators are used for the
min_reduction and max_reduction functions, respectively, and the + operator is usedfor the
sum_reduction and average_reduction functions. Inthe main function, it creates a vector and
calls the functions min_reduction, max_reduction, sum_reduction, andaverage_reduction to
compute the values of min, max, sum and average respectively.
Program:
import numpy as np
14
import multiprocessing as mp
def min_reduction(arr):
min_value = float('inf')
for num in arr:
if num < min_value:
min_value = num
return min_value
def max_reduction(arr):
max_value = float('-inf')
for num in arr:
if num > max_value:
max_value = num
return max_value
def sum_reduction(arr):
return sum(arr)
def average_reduction(arr):
return sum(arr) / len(arr)
def main():
arr = [5, 2, 9, 1, 7, 6, 8, 3, 4]
pool = mp.Pool(processes=mp.cpu_count())
15
OUTPUT:
Conclusion:
Thus, we have successfully implemented parallel reduction using Min, Max, Sum and Average
Operations.
16
Experiment – 04
Aim:
To write a CUDA program for, find.
1. Addition of two large vectors.
2. Matrix Multiplication using CUDA C.
Objective:
To study and implement the CUDA program using vectors.
Theory:
CUDA:
CUDA programming is especially well-suited to address problems that can be expressed as
data- parallel computations. Any applications that process large data sets can use a data- parallel model
to speed up the computations. Data-parallel processing maps data elements to parallel threads.
The first step in designing a data parallel program is to partition data across threads, with each
thread working on a portion of the data. The first step in designing a data parallel program is to partition
data across threads, with each thread working on a portion of the data.
CUDA Architecture:
A heterogeneous application consists of two parts:
Host code
Device code
Host code runs on CPUs and device code runs on GPUs. An application executing on a
heterogeneous platform is typically initialized by the CPU. The CPU code is responsible formanaging
the environment, code, and data for the device before loading compute-intensivetasks on the device.
With computational intensive applications, program sections often exhibit a rich amount of data
parallelism. GPUs are used to accelerate the execution of thisportion of data parallelism. When a
hardware component that is physically separate from the CPU is used to accelerate computationally
intensive sections of an application, it is referred to as a hardware accelerator. GPUs are arguably the
most common example of a hardware accelerator. GPUs must operate in conjunction with a CPU-
based host through PCI-Express bus, as shown in Figure.
17
Matrix-Matrix Multiplication
Consider two n x n matrices A and B partitioned into p blocks Ai,j and Bi,j(0≤ i<j) of size
each.
Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix.
Computing submatrix Ci,j requires all submatrice Ai,k and Bk,j for 0 ≤ k <.
Program:
#include <iostream>
#define N 10
void add(int *a, int *b, int *c)
{
for (int tid = 0; tid < N; tid++)
{
c[tid] = a[tid] + b[tid];
18
}
}
int main(void)
{
int a[N], b[N], c[N];
add(a, b, c);
return 0;
}
OUTPUT:
Conclusion:
Thus, we have successfully implemented the Addition of two large vectors and Matrix
Multiplication using CUDA C Programming.
19
A MINI PROJECT REPORT ON
BACHELOR OF ENGINEERING
Submitted by
Submitted to
Prof. Waquar Ahmed
This report explores the design and implementation of a parallel Quicksort algorithm
using MPI. It evaluates the performance improvement achieved by parallelization and
analyzes the factors affecting its scalability. We will compare the execution time of
the parallel Quicksort with the sequential version for various data sizes and numbers
of processing cores.
This project contributes to understanding the potential of parallel algorithms for data
sorting tasks. The findings will be valuable for researchers and developers seeking to
optimize sorting algorithms for large-scale data processing on high-performance
computing systems.
TABLE OF CONTENTS
i. Abstract i
1. INTRODUCTION 1
2. LITERATURE REVIEW 4
3. METHODOLOGY 5
5. CONCLUSION 9
6. REFERENCES 10
1. INTRODUCTION
1.1 Parallel Quick Sort
In the realm of data science, sorting algorithms are the silent heroes, meticulously
organizing information for efficient retrieval and analysis. Quicksort, a divide-and-
conquer champion, has long reigned supreme for its efficiency. However, as datasets
balloon in size, traditional quicksort struggles to keep pace with the processing
demands. This is where parallel quicksort emerges, leveraging the power of multiple
processors to tackle massive sorting tasks with remarkable speed.
Parallel quicksort builds upon the core principles of its sequential counterpart. It starts
by selecting a pivot element and partitions the data into two sub-arrays: elements less
than the pivot and elements greater than it. The magic, however, lies in the
parallelization. Unlike the single-threaded approach of sequential quicksort, parallel
quicksort utilizes the Message Passing Interface (MPI) to distribute these sub-arrays
across multiple processors. Each processor then independently sorts its assigned chunk
of data using the familiar quicksort method.
The benefits of parallel quicksort are undeniable. It harnesses the combined processing
power of multiple processors, drastically reducing the sorting time for massive
datasets. This makes it a perfect candidate for large-scale data processing tasks
commonly encountered in high-performance computing. However, parallel quicksort
is not without its challenges. Unlike its sequential counterpart, it introduces
communication overhead between processes using MPI functions to coordinate the
sorting effort. This overhead can outweigh the benefits for small datasets, making
sequential quicksort the preferred choice in such cases. Additionally, achieving
optimal performance with parallel quicksort requires careful design to ensure efficient
load balancing and avoid communication bottlenecks that could hinder scalability.
In conclusion, both quicksort and parallel quicksort play crucial roles in the sorting
landscape. Sequential quicksort remains a valuable tool for its simplicity and
efficiency on single processors. However, as data sizes continue to explode, parallel
quicksort emerges as the champion for large-scale sorting tasks. By effectively
utilizing the power of multiple processors, it paves the way for faster data processing
and analysis in our ever-growing digital world.
Quicksort, a champion of sorting algorithms with its average time complexity of O(n
log n), reigns supreme for many tasks. However, its sequential nature limits its ability
to fully harness the processing power of modern multi-core processors and distributed
computing systems. This is where MPI (Message Passing Interface) steps in, offering
a powerful tool to parallelize Quicksort and unlock significant performance gains.
MPI, the language of parallel programming, enables communication and data
exchange between processes running on different processors within a cluster. Imagine
a team of collaborators, each assigned a portion of a massive sorting task. MPI
facilitates the exchange of data and coordination between these collaborators,
ultimately leading to a faster, more efficient sorting process.
At the heart of MPI lie processes, independent units of execution that communicate
with each other. These processes are grouped into communicators, ensuring messages
reach the intended recipients. Each process within a communicator has a unique
identifier, its rank, allowing for targeted communication. MPI offers both synchronous
(blocking) and asynchronous (non-blocking) communication methods, catering to
different programming needs. Additionally, it provides built-in collective operations
like MPI_Bcast, enabling all processes to receive the same data simultaneously.
2. LITERATURE REVIEW
MPI acts as the communication hub for a team of collaborators – the MPI processes.
Each process resides on a separate processor within a cluster. Imagine a large, unsorted
pile of items. MPI facilitates the strategic distribution (Dividing the Spoils) of this data
across all the processes, just like dividing the items among team members. Now, each
process tackles its assigned portion (Local Conquests) using the familiar, efficient
Quicksort algorithm. This is where individual processors shine, utilizing their
strengths to conquer their sub-problems.
But how do these independent efforts culminate in a globally sorted whole? A key step
involves choosing a champion (Choosing a Champion). One process, often the leader,
selects a pivot element – a crucial value that will guide the sorting process. Different
strategies can be employed for this selection, akin to choosing a champion to divide
the remaining items based on a specific criterion.
The next stage, The Great Exchange, involves a coordinated exchange of data using
MPI functions like MPI_Sendrecv or MPI_Gatherv. Processes communicate and
shuffle elements based on the chosen pivot. Elements smaller than the pivot are sent
to one group, while those larger are sent to another. This exchange resembles sorting
items based on the champion's value, effectively partitioning the data.
The power of parallelization lies in the conquer and repeat approach (Conquer and
Repeat). Each process recursively repeats these steps – data distribution, local sorting,
pivot selection, and exchange – on its respective sub-arrays. It's like recursively
conquering smaller sub-problems until each process owns a perfectly sorted sub-
array.Finally, the MPI function MPI_Gather orchestrates the Victory Lap. The sorted
sub-arrays are combined, bringing the team's efforts together to form the final, globally
sorted data. Just like a team celebrating victory, the result is a completely sorted dataset,
achieved through collaboration and efficient communication.
4. RESULTS AND DISCUSSION
4.1 Source Code
import random
from mpi4py import MPI
import numpy as np
# Merge the received data with the local sorted subarray (if applicable)
if merged_arr is not None:
left_size = min(scatter_size, len(merged_arr))
right_size = len(merged_arr) - left_size
i, j, k = 0, 0, 0
while i < left_size and j < right_size:
if merged_arr[i] <= merged_arr[left_size + j]:
local_arr[k] = merged_arr[i]
i += 1
else:
local_arr[k] = merged_arr[left_size + j]
j += 1
k += 1
while i < left_size:
local_arr[k] = merged_arr[i]
i += 1
k += 1
while j < right_size:
local_arr[k] = merged_arr[left_size + j]
j += 1
k += 1
# Start timer
start_time = MPI.Wtime()
# Stop timer
end_time = MPI.Wtime()
# Finalize MPI
MPI.Finalize()
4.2 Output
5. CONCLUSION
Parallel quicksort remains a powerful tool for large-scale data sorting. Its ability to
leverage multiple processors for faster sorting makes it a valuable asset in high-
performance computing environments. As data sizes continue to grow exponentially,
parallel quicksort, with its divide-and-conquer spirit and parallel execution, stands
poised to remain a champion in the ever-evolving world of data organization.
ii. https://github.com/Eduard-747/my_projects
iii. https://www.codeproject.com/KB/threads/Parallel_Quicksort/Parallel_Quick_so
rt_without_merge.pdf
iv. Puneet C Kataria, Parallel quicksort implementation using MPI and Pthreads
viii. https://www.javatpoint.com/quick-sort
ix. https://www.geeksforgeeks.org/implementation-of-quick-sort-using-mpi-omp-
and-posix-thread/
x. https://github.com/triasamo1/QuicksortParallelMPI/blob/master/quicksort_merg
e_mpi.c