0% found this document useful (0 votes)
37 views38 pages

HPC Manual Najmu Group

The document is a practical lab manual for High Performance Computing for final year Computer Engineering students, detailing course objectives, outcomes, and experiments. It includes a certificate of completion, the vision and mission of the institute and department, program outcomes, and specific outcomes for computer engineering graduates. The manual outlines various experiments, including parallel algorithms for breadth-first search, bubble sort, and merge sort using OpenMP and CUDA.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views38 pages

HPC Manual Najmu Group

The document is a practical lab manual for High Performance Computing for final year Computer Engineering students, detailing course objectives, outcomes, and experiments. It includes a certificate of completion, the vision and mission of the institute and department, program outcomes, and specific outcomes for computer engineering graduates. The manual outlines various experiments, including parallel algorithms for breadth-first search, bubble sort, and merge sort using OpenMP and CUDA.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Practical Lab Manual

High Performance Computing (410250)


For Final Year Computer Engineering (2019 Course)

Student Name:
Seat No / Roll No: Class:
Branch:

Department of Computer Engineering

Al Jamia Mohammediyah Education Society’s


Maulana Mukhtar Ahmad Nadvi Technical Campus
Mansoora Campus, Malegaon (Nashik)

1
Department of Computer Engineering

Certificate
This is to certify that…………………………………………………………………………...

Roll No: ………Exam Seat No:…………………. Class: …………………………………….

Branch ...........................................................................................has Successfully Completed

High Performance computing Lab Experiments and Assignments as per the term work of

Savitri-bai Phule Pune University, Pune.

Subject Teacher Head of Department Principal

2
Maulana Mukhtar Ahmad Nadvi Technical Campus
Department of Computer Engineering

Vision of the Institute

Empowering society through quality education and research for the socio-economic
development of the region.

Mission of the Institute


 Inspire students to achieve excellence in science and engineering.
 Commit to making quality education accessible and affordable to serve society.
 Provide transformative, holistic, and value-based immersive learning experiences for
students.
 Transform into an institution of global standards that contributes to nation-building.
 Develop sustainable, cost-effective solutions through innovation and research.
 Promote quality education in rural areas.

Vision of the Department


To build strong research and learning environment producing globally competent
professionals and innovators who will contribute to the betterment of the society.

Mission of the Department


 To create and sustain an academic environment conducive to the highest level
of research and teaching.
 To provide state-of-the-art laboratories which will be up to date with the new
developments in the area of computer engineering.
 To organize competitive event, industry interactions and global collaborations
in view of providing a nurturing environment for students to prepare for a
successful career and the ability to tackle lifelong challenges in global
industrial needs.
 To educate students to be socially and ethically responsible citizens in view of
national and global development.
Program Outcomes (POs)
Learners are expected to know and be able to

PO1 Engineering knowledge Apply the knowledge of mathematics, science, Engineering fundamentals,
and an Engineering specialization to the solution of complex Engineering
problems.
PO2 Problem analysis Identify, formulate, review research literature and analyze complex
Engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences and Engineering sciences.
PO3 Design / Development of Design solutions for complex Engineering problems and design system
Solutions components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and
Environmental considerations.
PO4 Conduct Investigations of Use research-based knowledge and research methods including design of
Complex Problems experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
PO5 Modern Tool Usage Create, select, and apply appropriate techniques, resources, and modern
Engineering and IT tools including prediction and modeling to complex
Engineering activities with an understanding of the limitations.
PO6 The Engineer and Society Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
PO7 Environment and Understand the impact of the professional Engineering solutions in societal
Sustainability and Environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
PO8 Ethics Apply ethical principles and commit to professional ethics and
responsibilities and norms of Engineering practice.
PO9 Individual and Team Work Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.
P10 Communication Skills Communicate effectively on complex Engineering activities with the
Engineering community and with society at large, such as, being able to
comprehend and write effective reports and design documentation, make
effective presentations, and give and receive clear instructions
P11 Project Management and Demonstrate knowledge and understanding of Engineering and
Finance management principles and apply these to one’s own work, as a member
and leader in a team, to manage projects and in multidisciplinary
Environments.
P12 Life-long Learning Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological
change.
Program Specific Outcomes (PSO)
A graduate of the Computer Engineering Program will demonstrate

PSO1
Professional Skills-The ability to understand, analyze and develop computer programs in the areas
relatedto algorithms, system software, multimedia, web design, big data analytics, and
networking for efficient design of computer-based systems of varying complexities.
PSO2
Problem-Solving Skills- The ability to apply standard practices and strategies in software project
development using open-ended programming environments to deliver a quality product for
business success.
PSO3
Successful Career and Entrepreneurship- The ability to employ modern computer languages,
environments and platforms in creating innovative career paths to be an entrepreneur and to have
a zest for higher studies
Savitri Bai Phule Pune University
Final Year of Computer Engineering (2019 course)
410250: Laboratory Practice V (High Performance Computing)

Course Objectives:
 To understand and implement searching and sorting algorithms.
 To learn the fundamentals of GPU Computing in the CUDA environment.

Course Outcomes:

On completion of this course, the students will be able to


CO1: Analyze and measure performance of sequential and parallel algorithms.
CO2: Design and Implement solutions for multicore/Distributed/parallel environment
INDEX

Sr Date Of Date Of
Name Of The Experiment Marks Sign
.No Start Completion

Design and implement Parallel Breadth


1 First Search and Depth First Search
based on existing algorithms using
OpenMP. Use a Tree or an undirected
graph for BFSand DFS.
Write a program to implement Parallel
2 Bubble Sort and Merge sort using
OpenMP. Use existing algorithms and
measure the performance of sequential
and parallel algorithms.
Implement Min, Max, Sum and
3 Averageoperations using Parallel
Reduction.
Write a CUDA program for :
4 1. Addition of two large vectors
2. Matrix Multiplication using
CUDA C

1
Experiment - 01
Aim:
Design and implement Parallel Breadth First Search and Depth First Search based on
existing algorithms using OpenMP. For

1. Use a Tree
2. An undirected graph for BFS and DFS.

Objective:
Student will learn:
1. The Basic Concepts of DFS, BFS.
2. Multiple Compiler Directives, library routines, environment variables available for
OpenMP.

Theory:
Breadth First Search (BFS)
There are many ways to traverse graphs. BFS is the most commonly used approach. BFS is a
traversing algorithm where you should start traversing from a selected node (source or starting node)
and traverse the graph layer wise thus exploring the neighbor nodes (nodes which are directly
connected to source node). You must then move towards the next-level neighbor nodes.

As the name BFS suggests, you are required to traverse the graph breadthwise as follows:
1. First move horizontally and visit all the nodes of the current layer
Move to the next layer Consider the following diagram.

Parallel Breadth First Search

2
1. To design and implement parallel breadth first search, you will need to divide the graphinto
smaller sub-graphs and assign each sub-graph to a different processor or thread.
2. Each processor or thread will then perform a breadth first search on its assigned sub-graph
concurrently with the other processors or threads.
3. Two methods: Vertex by Vertex OR Level By Level

Parallel Depth First Search


 Different subtrees can be searched concurrently.

 Subtrees can be very different in size.

 Estimate the size of a subtree rooted at a node.

 Dynamic load balancing is required.

Parameters in Parallel DFS: Work Splitting


 Work is split by splitting the stack into two.

 Ideally, we do not want either of the split pieces to be small.

 Select nodes near the bottom of the stack (node splitting), or

 Select some nodes from each level (stack splitting)

 The second strategy generally yields a more even split of the space.

OpenMP Section Compiler Directive:


A parallel loop is an example of independent work units that are numbered. If you have a pre-
determined number of independent work units, the sections is more appropriate. In a sections
construct can be any number of section constructs. These need to be independent, and they can be
execute by any available thread in the current team, including having multiple sections done by the
same thread.
The sections construct is a non-iterative work sharing construct that contains a set of
structured blocks that are to be distributed among and executed by the threads in a team. Each
structured block is executed once by one of the threads in the team in the context of its implicit task.

Program:

import queue
import numpy as np

3
from multiprocessing import Process
NUM_NODES = 6
graph = {
0: [1, 2],
1: [0, 3, 4],
2: [0, 5],
3: [1],
4: [1],
5: [2]
}
visited = np.zeros(NUM_NODES)
def parallel_bfs(start, nodes, q):
q.put(start)
visited[start] = 1
while not q.empty():
current = q.get()
print(f"Visiting node: {current} in parallel./n")
neighbors = graph[current]
for neighbor in neighbors:
if not visited[neighbor]:
with np.errstate(divide='ignore', invalid='ignore'):
visited[neighbor] = 1
q.put(neighbor)
def parallel_dfs(start):
stack = [start]
while stack:
current = stack.pop()
print(f"Visiting node: {current} bfs in parallel.")
neighbors = reversed(graph[current])
for neighbor in neighbors:
if not visited[neighbor]:
with np.errstate(divide='ignore', invalid='ignore'):

4
visited[neighbor] = 1
stack.append(neighbor)
if name == " main ":
# Reset visited array
visited = np.zeros(NUM_NODES)
# Create a multiprocessing queue
q = queue.Queue()
# Create separate processes for BFS and DFS
bfs_process = Process(target=parallel_bfs, args=(0, NUM_NODES, q))
dfs_process = Process(target=parallel_dfs, args=(0,))
# Start both processes
bfs_process.start()
dfs_process.start()
# Wait for both processes to finish
bfs_process.join()
dfs_process.join()

Output-

Conclusion:
Thus, we have successfully implemented parallel algorithms for Binary Search and Breadth First
Search.

5
Experiment - 02

Aim:
Write a program to implement Parallel Bubble Sort and Parallel Merge sort using OpenMP.
Useexisting algorithms and measure the performance of sequential and parallel algorithms.

Objective:
Student will learn:
1. The Basic Concepts of Bubble Sort and Merge Sort.

2. Multiple Compiler Directives, library routines, environment variables available for


OpenMP.

Theory:
Parallel Sorting:
A sequential sorting algorithm may not be efficient enough when we have to sort a huge
volumeof data. Therefore, parallel algorithms are used in sorting.
Design methodology:
Based on an existing sequential sort algorithm
 Try to utilize all resources available
 Possible to turn a poor sequential algorithm into a reasonable parallel algorithm

Bubble Sort
The idea of bubble sort is to compare two adjacent elements. If they are not in the right order,
switch them. Do this comparing and switching (if necessary) until the end of the array is reached.
Repeat this process from the beginning of the array n times. Average performance is O(n2)

Bubble Sort Example


Here we want to sort an array containing
[8, 5, 1].8, 5, 1 Switch 8 and 5
5, 8, 1 Switch 8 and 1
5, 1, 8 Reached end start again
5, 1, 8 Switch 5 and 1
1, 5, 8 No Switch for 5 and 8
1, 5, 8 Reached end start again.

6
1, 5, 8 No switch for 1, 5
1, 5, 8 No switch for 5, 8
1, 5, 8 Reached end.
But do not start again since this is the nth iteration of same process

Parallel Bubble Sort


 Implemented as a pipeline.

 Let local_size = n / no_proc. We divide the array in no_proc parts, and each process executes
the bubble sort on its part, including comparing the last element with the first onebelonging to
the next thread.
 Implement with the loop (instead of j<i) for (j=0; j<n-1; j++)

 For every iteration of i, each thread needs to wait until the previous thread has finished that
iteration before starting.
 We'll coordinate using a barrier.

Algorithm for Parallel Bubble Sort


1. For k = 0 to n-2
2. If k is even then
3. for i = 0 to (n/2)-1 do in parallel
4. If A[2i] > A[2i+1] then
5. Exchange A[2i] ↔ A[2i+1]
6. Else
7. for i = 0 to (n/2)-2 do in parallel
8. If A[2i+1] > A[2i+2] then
9. Exchange A[2i+1] ↔ A[2i+2]
10. Next k

Parallel Bubble Sort


 Implemented as a pipeline.

 Let local_size = n / no_proc. We divide the array in no_proc parts, and each process executes
the bubble sort on its part, including comparing the last element with the first onebelonging to
the next thread.
 Implement with the loop (instead of j<i) for (j=0; j<n-1; j++).

7
 For every iteration of i, each thread needs to wait until the previous thread has finished that
iteration before starting.
 We'll coordinate using a barrier.

Algorithm for Parallel Bubble Sort

1. For k = 0 to n-2

2. If k is even then

3. for i = 0 to (n/2)-1 do in parallel

4. If A[2i] > A[2i+1] then

5. Exchange A[2i] ↔ A[2i+1]

6. Else

7. for i = 0 to (n/2)-2 do in parallel

8. If A[2i+1] > A[2i+2] then

9. Exchange A[2i+1] ↔ A[2i+2]

10. Next k

Parallel Bubble Sort Example


 Compare all pairs in the list in parallel.

 Alternate between odd and even phases.

 Shared flag, sorted, initialized to true at beginning of each iteration (2 phases), if any
processor perform swap, sorted = false.

Program

import time
import random
import threading
start = time.perf_counter()
# Create function to sort elements in odd phase
def odd_phase(lst, n):
for i in range(0, n-1, 2):

8
if lst[i] > lst[i+1]:
lst[i], lst[i+1] = lst[i+1], lst[i]
# Create function to sort elements in even phase
def even_phase(lst, n):
for i in range(1, n-1, 2):
if lst[i] > lst[i+1]:
lst[i], lst[i+1] = lst[i+1], lst[i]
# Create function for parallel bubble sort
def Parallel_Bubble_Sort(lst):
n = len(lst)
# Create list to keep track of all threads
threads = []
for i in range(1, n+1):
# If i is odd, call for odd_phase function
if i % 2 == 1:
# Start threads one by one
t1 = threading.Thread(target=odd_phase, args=[lst, n])
t1.start()
threads.append(t1)
# If i is even, call for even_phase function
else:
t2 = threading.Thread(target=even_phase, args=[lst, n])
t2.start()
threads.append(t2)
for thread in threads:
thread.join()
# Print final sorted list
print(lst)
# Example list to test the program
lst = [random.randint(0, 100) for i in range(100)]
Parallel_Bubble_Sort(lst)
finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')

9
Output

Merge Sort
 Collects sorted list onto one processor

 Merges elements as they come together

 Simple tree structure

 Parallelism is limited when near the root

Steps of Merge Sort:


To sort A[p .. r]:
1. Divide Step
If a given array A has zero or one element, simply return; it is already sorted. Otherwise, split A[p.. r]
into two subarrays A[p .. q] and A[q + 1 .. r], each containing about half of the elements ofA[p ..r].
That is, q is the halfway point of A[p .. r].

2. Conquer Step
Conquer by recursively sorting the two subarrays A[p .. q] and A[q + 1 .. r].

3. Combine Step
Combine the elements back in A[p .. r] by merging the two sorted subarraysA[p .. q] and A[q + 1.. r]
into a sorted sequence. To accomplish this step, we will define a procedure MERGE (A, p, q,r).
Example:

Parallel Merge Sort


 Parallelize processing of sub-problems

 Max parallelization achived with one processor per node (at each layer/height)

Parallel Merge Sort Example


 Perform Merge Sort on the following list of elements. Given 2 processors, P0 & P1.
 4,3,2,1

10
Algorithm for Parallel Merge Sort
1. Procedure parallel Merge Sort
2. Begin
3. Create processors Pi where i = 1 to n
4. if i > 0 then receive size and parent from the root
5. receive the list, size and parent from the root
6. endif
7. midvalue= listsize/2
8. if both children is present in the tree then
9. send midvalue, first child
10. send listsize-mid,second child
11. send list, midvalue, first child
12. send list from midvalue, listsize-midvalue, second child
13. call mergelist(list,0,midvalue,list, midvalue+1,listsize,temp,0,listsize)
14. store temp in another array list2
15. else
16. call parallelMergeSort(list,0,listsize)
17. endif
18. if i >0 then
19. send list, listsize,parent
20. endif
21. end

ALGORITHM ANALYSIS
1. Time Complexity Of parallel Merge Sort and parallel Bubble sort in best case is( when alldata
is already in sorted form):O(n)
2. Time Complexity Of parallel Merge Sort and parallel Bubble sort in worst case is: O(n
logn)
3. Time Complexity Of parallel Merge Sort and parallel Bubble sort in average case is:
O(nlogn)
Program

import time
# Python program for implementation of MergeSort

11
def mergeSort(arr):
if len(arr) > 1:
# Finding the mid of the array
mid = len(arr) // 2
# Dividing the array elements into 2 halves
L = arr[:mid]
R = arr[mid:]
# Sorting the first half
mergeSort(L)
# Sorting the second half
mergeSort(R)
i=j=k=0
# Copy data to temp arrays L[] and R[]
while i < len(L) and j < len(R):
if L[i] <= R[j]:
arr[k] = L[i]
i += 1
else:
arr[k] = R[j]
j += 1
k += 1
# Checking if any element was left
while i < len(L):
arr[k] = L[i]
i += 1
k += 1
while j < len(R):
arr[k] = R[j]
j += 1
k += 1
# Code to print the list
def printList(arr):
for i in range(len(arr)):
print(arr[i], end=" ")

12
print()
# Driver Code
if name == ' main ':
arr = [12, 11, 13, 5, 6, 7]
print("Given array is", end="\n")
printList(arr)
mergeSort(arr)
print("Sorted array is: ", end="\n")
printList(arr)

Output

Conclusion:
Thus, we have successfully implemented parallel algorithms for Bubble Sort and Merger Sort.

13
Experiment -03
Aim:
Implement parallel reduction using Min, Max, Sum and Average Operations.

Objective:
To study and implementation of directive based parallel programming model.

Theory:

OpenMP:
OpenMP is a set of C/C++ pragmas which provide the programmer a high-level front-end
interface which get translated as calls to threads. The key phrase here is "higher-level"; the goal is to
better enable the programmer to "think parallel" alleviating him/her of the burdenand distraction of
dealing with setting up and coordinating threads. For example, the OpenMP directive.

OpenMP Core Syntax:


Most of the constructs in OpenMP are compiler directives:
#pragma omp construct [clause [clause]...]

1. The min_reduction function finds the minimum value in the input array using the #pragma omp
parallel for reduction (min: min_value) directive, which creates a parallelregion and divides the
loop iterations among the available threads. Each thread performsthe comparison operation in
parallel and updates the min_value variable if a smaller value is found.
2. Similarly, the max_reduction function finds the maximum value in the array, sum_reduction
function finds thesum of the elements of array and average_reduction function finds the average
of the elements of array by dividing the sum by the size of thearray.
3. The reduction clause is used to combine the results of multiple threads into a single value,
which is then returned by the function. The min and max operators are used for the
min_reduction and max_reduction functions, respectively, and the + operator is usedfor the
sum_reduction and average_reduction functions. Inthe main function, it creates a vector and
calls the functions min_reduction, max_reduction, sum_reduction, andaverage_reduction to
compute the values of min, max, sum and average respectively.

Program:
import numpy as np

14
import multiprocessing as mp

def min_reduction(arr):
min_value = float('inf')
for num in arr:
if num < min_value:
min_value = num
return min_value

def max_reduction(arr):
max_value = float('-inf')
for num in arr:
if num > max_value:
max_value = num
return max_value

def sum_reduction(arr):
return sum(arr)

def average_reduction(arr):
return sum(arr) / len(arr)

def main():
arr = [5, 2, 9, 1, 7, 6, 8, 3, 4]

pool = mp.Pool(processes=mp.cpu_count())

min_value = pool.apply(min_reduction, args=(arr,))


max_value = pool.apply(max_reduction, args=(arr,))
sum_value = pool.apply(sum_reduction, args=(arr,))
average_value = pool.apply(average_reduction, args=(arr,))

print(f"Minimum value: {min_value}")


print(f"Maximum value: {max_value}")
print(f"Sum: {sum_value}")
print(f"Average: {average_value}")

if name == " main ":


main()

15
OUTPUT:

Conclusion:

Thus, we have successfully implemented parallel reduction using Min, Max, Sum and Average
Operations.

16
Experiment – 04
Aim:
To write a CUDA program for, find.
1. Addition of two large vectors.
2. Matrix Multiplication using CUDA C.

Objective:
To study and implement the CUDA program using vectors.

Theory:

CUDA:
CUDA programming is especially well-suited to address problems that can be expressed as
data- parallel computations. Any applications that process large data sets can use a data- parallel model
to speed up the computations. Data-parallel processing maps data elements to parallel threads.
The first step in designing a data parallel program is to partition data across threads, with each
thread working on a portion of the data. The first step in designing a data parallel program is to partition
data across threads, with each thread working on a portion of the data.

CUDA Architecture:
A heterogeneous application consists of two parts:
 Host code
 Device code

Host code runs on CPUs and device code runs on GPUs. An application executing on a
heterogeneous platform is typically initialized by the CPU. The CPU code is responsible formanaging
the environment, code, and data for the device before loading compute-intensivetasks on the device.
With computational intensive applications, program sections often exhibit a rich amount of data
parallelism. GPUs are used to accelerate the execution of thisportion of data parallelism. When a
hardware component that is physically separate from the CPU is used to accelerate computationally
intensive sections of an application, it is referred to as a hardware accelerator. GPUs are arguably the
most common example of a hardware accelerator. GPUs must operate in conjunction with a CPU-
based host through PCI-Express bus, as shown in Figure.

17
Matrix-Matrix Multiplication

 Consider two n x n matrices A and B partitioned into p blocks Ai,j and Bi,j(0≤ i<j) of size
each. 
 Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix.

 Computing submatrix Ci,j requires all submatrice Ai,k and Bk,j for 0 ≤ k <.

 All-to-all broadcast blocks of A along rows and B along columns.

 Perform local submatrix multiplication

Program:
#include <iostream>
#define N 10
void add(int *a, int *b, int *c)
{
for (int tid = 0; tid < N; tid++)
{
c[tid] = a[tid] + b[tid];

18
}
}

int main(void)
{
int a[N], b[N], c[N];

// Fill the arrays 'a' and 'b'


for (int i = 0; i < N; i++)
{
a[i] = -i;
b[i] = i * i;
}

add(a, b, c);

// Display the results


for (int i = 0; i < N; i++)
{
std::cout << a[i] << " + " << b[i] << " = " << c[i] << std::endl;
}

return 0;
}
OUTPUT:

Conclusion:

Thus, we have successfully implemented the Addition of two large vectors and Matrix
Multiplication using CUDA C Programming.

19
A MINI PROJECT REPORT ON

EVALUATION OF PERFORMANCE ENHANCEMENT OF


PARALLEL QUICKSORT ALGORITHM USING MPI

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE

AWARDS OF THE DEGREE OF

BACHELOR OF ENGINEERING
Submitted by

KHAN NAJMUDDIN FAKHRUDDIN 72221549F

SURTI MOHAMMED BILAL 72221582H

ADNAN AKHLAQUE SHAIKH 72266098H

MAKANDAR AMAN FIROJKHAN 72266109G

Submitted to
Prof. Waquar Ahmed

DEPARTMENT OF COMPUTER ENGINEERING


Academic Year 2024-25
ABSTRACT

Sorting is a fundamental operation in computer science, with numerous applications


in data analysis, machine learning, and scientific computing. As datasets continue to
grow exponentially, traditional sequential sorting algorithms struggle to keep pace.
Parallel computing offers a promising solution by harnessing the power of multiple
processors to tackle large sorting tasks efficiently.

This mini-project investigates the performance enhancement of the Quicksort


algorithm using the Message Passing Interface (MPI). Quicksort is a popular divide-
and-conquer sorting algorithm known for its average time complexity of O(n log n).
By parallelizing the Quicksort algorithm with MPI, we aim to leverage the combined
processing power of multiple machines to achieve significant speedups compared to
the sequential implementation.

This report explores the design and implementation of a parallel Quicksort algorithm
using MPI. It evaluates the performance improvement achieved by parallelization and
analyzes the factors affecting its scalability. We will compare the execution time of
the parallel Quicksort with the sequential version for various data sizes and numbers
of processing cores.

This project contributes to understanding the potential of parallel algorithms for data
sorting tasks. The findings will be valuable for researchers and developers seeking to
optimize sorting algorithms for large-scale data processing on high-performance
computing systems.
TABLE OF CONTENTS

SR NO. CONTENT PAGE NO

i. Abstract i

1. INTRODUCTION 1

2. LITERATURE REVIEW 4

3. METHODOLOGY 5

4. RESULT AND DISCUSSION 6

5. CONCLUSION 9

6. REFERENCES 10
1. INTRODUCTION
1.1 Parallel Quick Sort

In the realm of data science, sorting algorithms are the silent heroes, meticulously
organizing information for efficient retrieval and analysis. Quicksort, a divide-and-
conquer champion, has long reigned supreme for its efficiency. However, as datasets
balloon in size, traditional quicksort struggles to keep pace with the processing
demands. This is where parallel quicksort emerges, leveraging the power of multiple
processors to tackle massive sorting tasks with remarkable speed.

Parallel quicksort builds upon the core principles of its sequential counterpart. It starts
by selecting a pivot element and partitions the data into two sub-arrays: elements less
than the pivot and elements greater than it. The magic, however, lies in the
parallelization. Unlike the single-threaded approach of sequential quicksort, parallel
quicksort utilizes the Message Passing Interface (MPI) to distribute these sub-arrays
across multiple processors. Each processor then independently sorts its assigned chunk
of data using the familiar quicksort method.

This parallel execution unlocks significant speedups. Imagine multiple processors


working concurrently, sorting their sub-arrays simultaneously. By dividing the
workload and harnessing the collective processing power, parallel quicksort
significantly reduces the overall sorting time. This becomes particularly advantageous
for massive datasets, where sequential quicksort would take considerably longer.

1.2 Sequential Quicksort vs. Parallel Quicksort

Quicksort, a divide-and-conquer sorting algorithm, has served as a workhorse for data


organization for decades. It efficiently sorts data by selecting a pivot element,
partitioning elements based on the pivot's value, and recursively sorting the sub-arrays.
While excelling in average-case scenarios with a time complexity of O(n log n),
quicksort falters when tackling massive datasets on modern multi-core or multi-
processor systems. This is where parallel quicksort steps in, offering a compelling
solution in the era of parallel computing.
Traditional quicksort operates on a single processor, meticulously comparing and
arranging elements one by one. Parallel quicksort, on the other hand, leverages the
Message Passing Interface (MPI) to distribute the quicksort workload across multiple
processors. By employing the same divide-and-conquer strategy, it partitions the data
and sorts sub-arrays concurrently on these processors, significantly accelerating the
sorting process. This parallelization unlocks the true potential of multi-processor
systems, leading to substantial speedups when dealing with large datasets.

The benefits of parallel quicksort are undeniable. It harnesses the combined processing
power of multiple processors, drastically reducing the sorting time for massive
datasets. This makes it a perfect candidate for large-scale data processing tasks
commonly encountered in high-performance computing. However, parallel quicksort
is not without its challenges. Unlike its sequential counterpart, it introduces
communication overhead between processes using MPI functions to coordinate the
sorting effort. This overhead can outweigh the benefits for small datasets, making
sequential quicksort the preferred choice in such cases. Additionally, achieving
optimal performance with parallel quicksort requires careful design to ensure efficient
load balancing and avoid communication bottlenecks that could hinder scalability.

In conclusion, both quicksort and parallel quicksort play crucial roles in the sorting
landscape. Sequential quicksort remains a valuable tool for its simplicity and
efficiency on single processors. However, as data sizes continue to explode, parallel
quicksort emerges as the champion for large-scale sorting tasks. By effectively
utilizing the power of multiple processors, it paves the way for faster data processing
and analysis in our ever-growing digital world.

1.3 Message Passing Interface

Quicksort, a champion of sorting algorithms with its average time complexity of O(n
log n), reigns supreme for many tasks. However, its sequential nature limits its ability
to fully harness the processing power of modern multi-core processors and distributed
computing systems. This is where MPI (Message Passing Interface) steps in, offering
a powerful tool to parallelize Quicksort and unlock significant performance gains.
MPI, the language of parallel programming, enables communication and data
exchange between processes running on different processors within a cluster. Imagine
a team of collaborators, each assigned a portion of a massive sorting task. MPI
facilitates the exchange of data and coordination between these collaborators,
ultimately leading to a faster, more efficient sorting process.

At the heart of MPI lie processes, independent units of execution that communicate
with each other. These processes are grouped into communicators, ensuring messages
reach the intended recipients. Each process within a communicator has a unique
identifier, its rank, allowing for targeted communication. MPI offers both synchronous
(blocking) and asynchronous (non-blocking) communication methods, catering to
different programming needs. Additionally, it provides built-in collective operations
like MPI_Bcast, enabling all processes to receive the same data simultaneously.
2. LITERATURE REVIEW

 "Parallel Implementation and Evaluation of QuickSort using Open MPI"


(CodeProject, 2019): This article demonstrates a parallel quicksort implementation
using Open MPI and compares its performance with a sequential version. It shows
significant speedups for larger datasets.

 "Performance of MPI Sorting Algorithms on Dual-Core Processor


Windows-Based Systems" (ResearchGate): This study analyzes the performance of
various sorting algorithms, including quicksort, using MPI on a dual-core system. It
highlights the importance of communication overhead and load balancing for
achieving optimal performance.

 "A Survey on Parallel Sorting Algorithms" (ACM Computing Surveys,


2012): This comprehensive survey explores various parallel sorting algorithms,
including quicksort, discussing their advantages, disadvantages, and implementation
techniques.
3. METHODOLOGY
Quicksort, a champion of sorting algorithms, reigns supreme for its efficiency.
However, its sequential nature limits its ability to fully leverage the processing power
of modern computing systems. Enter MPI (Message Passing Interface), a powerful
tool that enables parallelization of Quicksort, unlocking significant performance gains.

MPI acts as the communication hub for a team of collaborators – the MPI processes.
Each process resides on a separate processor within a cluster. Imagine a large, unsorted
pile of items. MPI facilitates the strategic distribution (Dividing the Spoils) of this data
across all the processes, just like dividing the items among team members. Now, each
process tackles its assigned portion (Local Conquests) using the familiar, efficient
Quicksort algorithm. This is where individual processors shine, utilizing their
strengths to conquer their sub-problems.

But how do these independent efforts culminate in a globally sorted whole? A key step
involves choosing a champion (Choosing a Champion). One process, often the leader,
selects a pivot element – a crucial value that will guide the sorting process. Different
strategies can be employed for this selection, akin to choosing a champion to divide
the remaining items based on a specific criterion.

The next stage, The Great Exchange, involves a coordinated exchange of data using
MPI functions like MPI_Sendrecv or MPI_Gatherv. Processes communicate and
shuffle elements based on the chosen pivot. Elements smaller than the pivot are sent
to one group, while those larger are sent to another. This exchange resembles sorting
items based on the champion's value, effectively partitioning the data.

The power of parallelization lies in the conquer and repeat approach (Conquer and
Repeat). Each process recursively repeats these steps – data distribution, local sorting,
pivot selection, and exchange – on its respective sub-arrays. It's like recursively
conquering smaller sub-problems until each process owns a perfectly sorted sub-
array.Finally, the MPI function MPI_Gather orchestrates the Victory Lap. The sorted
sub-arrays are combined, bringing the team's efforts together to form the final, globally
sorted data. Just like a team celebrating victory, the result is a completely sorted dataset,
achieved through collaboration and efficient communication.
4. RESULTS AND DISCUSSION
4.1 Source Code
import random
from mpi4py import MPI
import numpy as np

def sequential_quicksort(arr, low, high):


if low < high:
# Choose a random pivot
pivot_index = random.randint(low, high)
pivot = arr[pivot_index]

# Partition the array


partition_index = partition(arr, low, high, pivot)

# Sort the left and right subarrays sequentially


sequential_quicksort(arr, low, partition_index - 1)
sequential_quicksort(arr, partition_index + 1, high)

def partition(arr, low, high, pivot):


i = low - 1
for j in range(low, high):
if arr[j] <= pivot:
i += 1
arr[i], arr[j] = arr[j], arr[i]
arr[i + 1], arr[high] = arr[high], arr[i + 1]
return i + 1

def parallel_quicksort(arr, comm):


rank = comm.Get_rank()
size = comm.Get_size()

# Base case for small subarrays or single process


if len(arr) <= 1 or size == 1:
sequential_quicksort(arr, 0, len(arr) - 1)
return

# Choose a pivot element (consider median selection for better performance)


pivot = arr[0] # Replace with a better pivot selection strategy if needed

# Scatter the array to all processes


local_arr = None
scatter_size = int(len(arr) / size)
scatter_counts = [scatter_size] * size
extras = len(arr) % size
for i in range(size):
scatter_counts[i] += (1 if i < extras else 0)
local_arr = comm.Scatterv(arr, scatter_counts, root=0)

# Sort the local subarray sequentially


parallel_quicksort(local_arr, comm)
if size > 1:
# Determine partner process for merge (odd process sends to even)
partner_rank = (rank + 1) % size

# Send/Receive data for merging (if ranks are appropriate)


if rank % 2 == 0 and partner_rank < size:
send_data = local_arr[scatter_size:] # Subarray after potential pivot
recv_data = None
recv_data = comm.Recv(recvbuf=recv_data, source=partner_rank)
merged_arr = np.concatenate((local_arr[:scatter_size], recv_data))
elif rank % 2 == 1 and rank - 1 >= 0:
send_data = local_arr # Entire subarray for odd process
send_data = comm.Send(sendbuf=send_data, dest=rank - 1)
merged_arr = local_arr

# Merge the received data with the local sorted subarray (if applicable)
if merged_arr is not None:
left_size = min(scatter_size, len(merged_arr))
right_size = len(merged_arr) - left_size
i, j, k = 0, 0, 0
while i < left_size and j < right_size:
if merged_arr[i] <= merged_arr[left_size + j]:
local_arr[k] = merged_arr[i]
i += 1
else:
local_arr[k] = merged_arr[left_size + j]
j += 1
k += 1
while i < left_size:
local_arr[k] = merged_arr[i]
i += 1
k += 1
while j < right_size:
local_arr[k] = merged_arr[left_size + j]
j += 1
k += 1

# Gather sorted subarrays from all processes (root only)


gathered_arr = None
if rank == 0:
gathered_arr = np.empty(len(arr), dtype=arr.dtype) # Ensure correct data type
gather_recvcounts = [scatter_size] * size # Initial counts for gathering
displacements = [0] * size # Initial displacements for gathering

# Update recvcounts and displacements based on local subarray sizes


for i in range(1, size):
displacements[i] = displacements[i - 1] + gather_recvcounts[i - 1]
if local_arr.size > scatter_size:
gather_recvcounts[i] = local_arr.size - scatter_size # Count for extras in odd ranks

# Gather sorted subarrays from all processes using Gatherv


comm.Gatherv(local_arr[scatter_size:], gather_recvcounts, displacements, arr.dtype,
gathered_arr, root=0)

# Print the gathered sorted array (on root process only)


if rank == 0:
print("Gathered sorted array:", gathered_arr)
comm = MPI.COMM_WORLD

# Generate random data (adjust as needed)


data_size = 100000
data = [random.randint(1, 1000000) for _ in range(data_size)]

# Start timer
start_time = MPI.Wtime()

# Perform parallel quicksort


parallel_quicksort(data, comm)

# Stop timer
end_time = MPI.Wtime()

# Print execution time in microseconds (rank 0 only)


if comm.Get_rank() == 0:
execution_time = (end_time - start_time) * 1e6 # Convert to microseconds
print("Execution time (parallel):", execution_time, "microseconds")

# Sequential quicksort for comparison (uncomment to measure)


start_time = MPI.Wtime()
sequential_quicksort(data, 0, len(data) - 1)
end_time = MPI.Wtime()
if comm.Get_rank() == 0:
print("Execution time (sequential):", (end_time - start_time)* 1e6, "microseconds")

# Finalize MPI
MPI.Finalize()

4.2 Output
5. CONCLUSION
Parallel quicksort remains a powerful tool for large-scale data sorting. Its ability to
leverage multiple processors for faster sorting makes it a valuable asset in high-
performance computing environments. As data sizes continue to grow exponentially,
parallel quicksort, with its divide-and-conquer spirit and parallel execution, stands
poised to remain a champion in the ever-evolving world of data organization.

The benefits of employing MPI in Quicksort are compelling. Scalability Triumphant


– as the number of processors increases, the MPI-based Quicksort scales gracefully,
potentially leading to dramatic speedups for massive datasets. Resourceful Utilization
ensures optimal utilization of the entire computing cluster's processing power. No
processor remains idle while others struggle. Additionally, Flexibility at Play allows
for customization of communication patterns to accommodate specific hardware and
software environments.

However, parallel quicksort is not without its complexities. Communication overhead


becomes a factor. Processors need to exchange information about the pivot element
and potentially redistribute data after partitioning. This communication can introduce
a slight bottleneck, especially for smaller datasets where the overhead might outweigh
the benefits of parallelization. Additionally, ensuring efficient load balancing across
processors is crucial for optimal performance. Skewed data distribution or imbalanced
sub-array sizes can lead to processors idling while others are overloaded.
6. REFERENCES
i. https://hackr.io/blog/quick-sort-in-c

ii. https://github.com/Eduard-747/my_projects

iii. https://www.codeproject.com/KB/threads/Parallel_Quicksort/Parallel_Quick_so
rt_without_merge.pdf

iv. Puneet C Kataria, Parallel quicksort implementation using MPI and Pthreads

v. Hanmao Shi Jonathan Schaeffer, Parallel Sorting by Regular Sampling

vi. Philippas Tsigas and Yi Zhang. A Simple, Fast Parallel Implementation of


Quicksort

vii. and its Performance Evaluation on SUN Enterprise10000

viii. https://www.javatpoint.com/quick-sort

ix. https://www.geeksforgeeks.org/implementation-of-quick-sort-using-mpi-omp-
and-posix-thread/

x. https://github.com/triasamo1/QuicksortParallelMPI/blob/master/quicksort_merg
e_mpi.c

You might also like