0% found this document useful (0 votes)

10 views28 pages

Combinepdf

The document outlines assignments focused on implementing CUDA programs for vector addition and matrix multiplication, as well as a parallel merge sort algorithm. It includes prerequisites, theoretical content about CUDA, and step-by-step instructions for executing the programs in a CUDA environment. Additionally, it poses questions aimed at enhancing understanding of CUDA's advantages and performance optimization techniques.

Uploaded by

bhush8767

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views28 pages

Combinepdf

Uploaded by

bhush8767

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Group A

Assignment 4(A)

Title of the Assignment: Write a CUDA Program for Addition of two large vectors

Objective of the Assignment: Students should be able to perform CUDA Program for Addition of
two large vectors

Prerequisite:
1. CUDA Concept
2. Vector Addition
3. How to execute Program in CUDA Environment
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. What is CUDA
2. Addition of two large Vector
3. Execution of CUDA Environment
--------------------------------------------------------------------------------------------------------------
Department of Computer Engineering Course : Laboratory Practice V

What is CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model
developed by NVIDIA. It allows developers to use the power of NVIDIA graphics processing units (GPUs)
to accelerate computation tasks in various applications, including scientific computing, machine learning, and
computer vision.CUDA provides a set of programming APIs, libraries, and tools that enable developers to
write and execute parallel code on NVIDIA GPUs. It supports popular programming languages like C, C++,
and Python, and provides a simple programming model that abstracts away much of the low-level details of
GPU architecture.

Using CUDA, developers can exploit the massive parallelism and high computational power of GPUs to
accelerate computationally intensive tasks, such as matrix operations, image processing, and deep learning.
CUDA has become an important tool for scientific research and is widely used in fields like physics, chemistry,
biology, and engineering.
Steps for Addition of two large vectors using CUDA
1. Define the size of the vectors: In this step, you need to define the size of the vectors that you want to
add. This will determine the number of threads and blocks you will need to use to parallelize the
addition operation.
2. Allocate memory on the host: In this step, you need to allocate memory on the host for the two vectors
that you want to add and for the result vector. You can use the C malloc function to allocate memory.
3. Initialize the vectors: In this step, you need to initialize the two vectors that you want to add on the
host. You can use a loop to fill the vectors with data.
4. Allocate memory on the device: In this step, you need to allocate memory on the device for the two
vectors that you want to add and for the result vector. You can use the CUDA function cudaMalloc to
allocate memory.
5. Copy the input vectors from host to device: In this step, you need to copy the two input vectors from
the host to the device memory. You can use the CUDA function cudaMemcpy to copy the vectors.
6. Launch the kernel: In this step, you need to launch the CUDA kernel that will perform the addition
operation. The kernel will be executed by multiple threads in parallel. You can use the <<<...>>>
syntax to specify the number of blocks and threads to use.
7. Copy the result vector from device to host: In this step, you need to copy the result vector from the
device memory to the host memory. You can use the CUDA function cudaMemcpy to copy the result
vector.

2
Department of Computer Engineering Course : Laboratory Practice V

8. Free memory on the device: In this step, you need to free the memory that was allocated on the
device. You can use the CUDA function cudaFree to free the memory.
9. Free memory on the host: In this step, you need to free the memory that was allocated on the host.
You can use the C free function to free the memory.

Execution of Program over CUDA Environment

Here are the steps to run a CUDA program for adding two large vectors:
1. Install CUDA Toolkit: First, you need to install the CUDA Toolkit on your system. You can download
the CUDA Toolkit from the NVIDIA website and follow the installation instructions provided.
2. Set up CUDA environment: Once the CUDA Toolkit is installed, you need to set up the CUDA
environment on your system. This involves setting the PATH and LD_LIBRARY_PATH environment
variables to the appropriate directories.
3. Write the CUDA program: You need to write a CUDA program that performs the addition of two
large vectors. You can use a text editor to write the program and save it with a .cu extension.
4. Compile the CUDA program: You need to compile the CUDA program using the nvcc compiler that
comes with the CUDA Toolkit. The command to compile the program is:

5. This will generate an executable program named program_name.

Run the CUDA program: Finally, you can run the CUDA program by executing the executable file
generated in the previous step. The command to run the program is:

This will execute the program and perform the addition of two large vectors.

Questions:
1. What is the purpose of using CUDA to perform addition of two large vectors?
2. How do you allocate memory for the vectors on the device using CUDA?
3. How do you launch the CUDA kernel to perform the addition of two large vectors?
4. How can you optimize the performance of the CUDA program for adding two large vectors

3
Department of Computer Engineering Course : Laboratory Practice V

Group A

Assignment 4(B)

Title of the Assignment: Write a Program for Matrix Multiplication using CUDA C

Objective of the Assignment: Students should be able to performProgram for Matrix Multiplication
using CUDA C

Prerequisite:
1. CUDA Concept
2. Matrix Multiplication
3. How to execute Program in CUDA Environment
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. What is CUDA
2. Matrix Multiplication
3. Execution of CUDA Environment
--------------------------------------------------------------------------------------------------------------

1
Department of Computer Engineering Course : Laboratory Practice V

What is CUDA

1. Matrix Initialization: The first step is to initialize the matrices that you want to multiply. You can
use standard C or CUDA functions to allocate memory for the matrices and initialize their values.
The matrices are usually represented as 2D arrays.
2. Memory Allocation: The next step is to allocate memory on the host and the device for the matrices.
You can use the standard C malloc function to allocate memory on the host and the CUDA function
cudaMalloc() to allocate memory on the device.
3. Data Transfer: The third step is to transfer data between the host and the device. You can use
the CUDA function cudaMemcpy() to transfer data from the host to the device or vice versa.
4. Kernel Launch: The fourth step is to launch the CUDA kernel that will perform the matrix
multiplication on the device. You can use the <<<...>>> syntax to specify the number of blocks
and threads to use. Each thread in the kernel will compute one element of the output matrix.
5. Device Synchronization: The fifth step is to synchronize the device to ensure that all kernel
executions have completed before proceeding. You can use the CUDA function
cudaDeviceSynchronize() to synchronize the device.
6. Data Retrieval: The sixth step is to retrieve the result of the computation from the device to the host.
You can use the CUDA function cudaMemcpy() to transfer data from the device to the host.
7. Memory Deallocation: The final step is to deallocate the memory that was allocated on the host and the
device. You can use the C free function to deallocate memory on the host and the CUDA function
2
Department of Computer Engineering Course : Laboratory Practice V

cudaFree() to deallocate memory on the device.

Execution of Program over CUDA Environment

1. Install CUDA Toolkit: First, you need to install the CUDA Toolkit on your system. You can download
the CUDA Toolkit from the NVIDIA website and follow the installation instructions provided.
2. Set up CUDA environment: Once the CUDA Toolkit is installed, you need to set up the CUDA
environment on your system. This involves setting the PATH and LD_LIBRARY_PATH environment
variables to the appropriate directories.
3. Write the CUDA program: You need to write a CUDA program that performs the addition of two
large vectors. You can use a text editor to write the program and save it with a .cu extension.
4. Compile the CUDA program: You need to compile the CUDA program using the nvcc compiler that
comes with the CUDA Toolkit. The command to compile the program is:

5. This will generate an executable program named program_name.

Run the CUDA program: Finally, you can run the CUDA program by executing the executable file
generated in the previous step. The command to run the program is:

Questions:

1. What are the advantages of using CUDA to perform matrix multiplication compared to using a CPU?
2. How do you handle matrices that are too large to fit in GPU memory in CUDA matrix
multiplication?
3. How do you optimize the performance of the CUDA program for matrix multiplication?
4. How do you ensure correctness of the CUDA program for matrix multiplication and verify the

results?

3
Department of Computer Engineering Course : Laboratory Practice V

Group A

Assignment No: 2(B)

Title of the Assignment: Write a program to implement Parallel Merge Sort. Use existing
algorithms and measure the performance of sequential and parallel algorithms.

Objective of the Assignment: Students should be able to Write a program to implement Parallel
Merge Sort and can measure the performance of sequential and parallel algorithms.

Prerequisite:
1. Basic of programming language
2. Concept of Merge Sort
3. Concept of Parallelism
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. What is Merge? Use of Merge Sort
2. Example of Merge sort?
3. Concept of OpenMP
4. How Parallel Merge Sort Work
5. How to measure the performance of sequential and parallel algorithms?

--------------------------------------------------------------------------------------------------------------

1
Department of Computer Engineering Course : Laboratory Practice V

What is Merge Sort?

Merge sort is a sorting algorithm that uses a divide-and-conquer approach to sort an array or a list of
elements. The algorithm works by recursively dividing the input array into two halves, sorting each half,
and then merging the sorted halves to produce a sorted output.

The merge sort algorithm can be broken down into the following steps:

1. Divide the input array into two halves.

2. Recursively sort the left half of the array.
3. Recursively sort the right half of the array.
4. Merge the two sorted halves into a single sorted output array.
● The merging step is where the bulk of the work happens in merge sort. The algorithm compares the
first elements of each sorted half, selects the smaller element, and appends it to the output array.
This process continues until all elements from both halves have been appended to the output array.
● The time complexity of merge sort is O(n log n), which makes it an efficient sorting algorithm for
large input arrays. However, merge sort also requires additional memory to store the output array,
which can make it less suitable for use with limited memory resources.
● In simple terms, we can say that the process of merge sort is to divide the array into two halves, sort
each half, and then merge the sorted halves back together. This process is repeated until the entire
array is sorted.
● One thing that you might wonder is what is the specialty of this algorithm. We already have a
number of sorting algorithms then why do we need this algorithm? One of the main advantages of
merge sort is that it has a time complexity of O(n log n), which means it can sort large arrays
relatively quickly. It is also a stable sort, which means that the order of elements with equal values
is preserved during the sort.
● Merge sort is a popular choice for sorting large datasets because it is relatively efficient and easy to
implement. It is often used in conjunction with other algorithms, such as quicksort, to improve the
overall performance of a sorting routine.

Example of Merge sort

Now, let's see the working of merge sort Algorithm. To understand the working of the merge sort
algorithm, let's take an unsorted array. It will be easier to understand the merge sort via an
example. Let the elements of array are -

2
Department of Computer Engineering Course : Laboratory Practice V

● According to the merge sort, first divide the given array into two equal halves. Merge sort
keeps dividing the list into equal parts until it cannot be further divided.
● As there are eight elements in the given array, so it is divided into two arrays of size 4.

● Now, again divide these two arrays into halves. As they are of size 4, divide them into new
arrays of size 2.

● Now, again divide these arrays to get the atomic value that cannot be further divided.

● Now, combine them in the same manner they were broken.

● In combining, first compare the element of each array and then combine them into another array
in sorted order.
● So, first compare 12 and 31, both are in sorted positions. Then compare 25 and 8, and in the list
of two values, put 8 first followed by 25. Then compare 32 and 17, sort them and put 17 first
followed by 32. After that, compare 40 and 42, and place them sequentially.

3
Department of Computer Engineering Course : Laboratory Practice V

● In the next iteration of combining, now compare the arrays with two data values and merge them
into an array of found values in sorted order.

● Now, there is a final merging of the arrays. After the final merging of above arrays, the array
will look like -

Concept of OpenMP

● OpenMP (Open Multi-Processing) is an application programming interface (API) that supports

shared-memory parallel programming in C, C++, and Fortran. It is used to write parallel programs
that can run on multicore processors, multiprocessor systems, and parallel computing clusters.
● OpenMP provides a set of directives and functions that can be inserted into the source code of a
program to parallelize its execution. These directives are simple and easy to use, and they can be
applied to loops, sections, functions, and other program constructs. The compiler then generates
parallel code that can run on multiple processors concurrently.
● OpenMP programs are designed to take advantage of the shared-memory architecture of modern
processors, where multiple processor cores can access the same memory. OpenMP uses a fork-
join model of parallel execution, where a master thread forks multiple worker threads to execute a
parallel region of the code, and then waits for all threads to complete before continuing with the
sequential part of the code.

How Parallel Merge Sort Work

● Parallel merge sort is a parallelized version of the merge sort algorithm that takes advantage of
multiple processors or cores to improve its performance. In parallel merge sort, the input array is
divided into smaller subarrays, which are sorted in parallel using multiple processors or cores. The
sorted subarrays are then merged together in parallel to produce the final sorted output.
● The parallel merge sort algorithm can be broken down into the following steps:
4
Department of Computer Engineering Course : Laboratory Practice V

● Divide the input array into smaller subarrays.

● Assign each subarray to a separate processor or core for sorting.
● Sort each subarray in parallel using the merge sort algorithm.
● Merge the sorted subarrays together in parallel to produce the final sorted output.
● The merging step in parallel merge sort is performed in a similar way to the merging step in the
sequential merge sort algorithm. However, because the subarrays are sorted in parallel, the
merging step can also be performed in parallel using multiple processors or cores. This can
significantly reduce the time required to merge the sorted subarrays and produce the final output.
● Parallel merge sort can provide significant performance benefits for large input arrays with many
elements, especially when running on hardware with multiple processors or cores. However, it
also requires additional overhead to manage the parallelization, and may not always provide
performance improvements for smaller input sizes or when run on hardware with limited parallel
processing capabilities.
How to measure the performance of sequential and parallel algorithms?

There are several metrics that can be used to measure the performance of sequential and parallel
merge sort algorithms:

1. Execution time: Execution time is the amount of time it takes for the algorithm to complete its
sorting operation. This metric can be used to compare the speed of sequential and parallel
merge sort algorithms.
2. Speedup: Speedup is the ratio of the execution time of the sequential merge sort algorithm to the
execution time of the parallel merge sort algorithm. A speedup of greater than 1 indicates that the
parallel algorithm is faster than the sequential algorithm.
3. Efficiency: Efficiency is the ratio of the speedup to the number of processors or cores used in
the parallel algorithm. This metric can be used to determine how well the parallel algorithm is
utilizing the available resources.
4. Scalability: Scalability is the ability of the algorithm to maintain its performance as the input size
and number of processors or cores increase. A scalable algorithm will maintain a consistent
speedup and efficiency as more resources are added.

To measure the performance of sequential and parallel merge sort algorithms, you can perform experiments
on different input sizes and numbers of processors or cores. By measuring the execution time, speedup,
efficiency, and scalability of the algorithms under different conditions, you can determine

5
Department of Computer Engineering Course : Laboratory Practice V

which algorithm is more efficient for different input sizes and hardware configurations. Additionally, you can

use profiling tools to analyze the performance of the algorithms and identify areas for optimization

Conclusion- In this way we can implement Merge Sort in parallel way using OpenMP also come to

know how to how to measure performance of serial and parallel algorithm

Assignment Question

1. What is parallel Merge Sort?

2. How does Parallel Merge Sort work?
3. How do you implement Parallel MergeSort using OpenMP?
4. What are the advantages of Parallel MergeSort?
5. Difference between serial Mergesort and parallel Mergesort

Reference link
● https://www.geeksforgeeks.org/merge-sort/
● https://www.javatpoint.com/merge-sort

.
Group A

Assignment No: 3

Title of the Assignment: Implement Min, Max, Sum and Average operations using Parallel
Reduction.

Objective of the Assignment: To understand the concept of parallel reduction and how it can be
used to perform basic mathematical operations on given data sets.

Prerequisite:
1. Parallel computing architectures
2. Parallel programming models
3. Proficiency in programming languages
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. What is parallel reduction and its usefulness for mathematical operations on large data?
2. Concept of OpenMP
3. How do parallel reduction algorithms for Min, Max, Sum, and Average work, and what are
their advantages and limitations?

--------------------------------------------------------------------------------------------------------------

1
Department of Computer Engineering Course : Laboratory Practice V

Parallel Reduction.

Here's a function-wise manual on how to understand and run the sample C++ program that demonstrates
how to implement Min, Max, Sum, and Average operations using parallel reduction.
1. Min_Reduction function
• The function takes in a vector of integers as input and finds the minimum value in the
vector using parallel reduction.
• The OpenMP reduction clause is used with the "min" operator to find the minimum value
across all threads.
• The minimum value found by each thread is reduced to the overall minimum value of the
entire array.
• The final minimum value is printed to the console.

2. Max_Reduction function
• The function takes in a vector of integers as input and finds the maximum value in the
vector using parallel reduction.
• The OpenMP reduction clause is used with the "max" operator to find the maximum value
across all threads.
• The maximum value found by each thread is reduced to the overall maximum value of the
entire array.
• The final maximum value is printed to the console.

3. Sum_Reduction function
• The function takes in a vector of integers as input and finds the sum of all the values in the
vector using parallel reduction.
• The OpenMP reduction clause is used with the "+" operator to find the sum across all
threads.
• The sum found by each thread is reduced to the overall sum of the entire array.
• The final sum is printed to the console.

4. Average_Reduction function
• The function takes in a vector of integers as input and finds the average of all the values in
the vector using parallel reduction.
• The OpenMP reduction clause is used with the "+" operator to find the sum across all
threads.
2
Department of Computer Engineering Course : Laboratory Practice V

• The sum found by each thread is reduced to the overall sum of the entire array.
• The final sum is divided by the size of the array to find the average.
• The final average value is printed to the console.
5. Main Function
• The function initializes a vector of integers with some values.
• The function calls the min_reduction, max_reduction, sum_reduction, and
average_reduction functions on the input vector to find the corresponding values.
• The final minimum, maximum, sum, and average values are printed to the console.
6. Compiling and running the program

Compile the program: You need to use a C++ compiler that supports OpenMP, such as g++ or
clang. Open a terminal and navigate to the directory where your program is saved. Then,
compile the program using the following command:
$ g++ -fopenmp program.cpp -o program

This command compiles your program and creates an executable file named "program". The "-
fopenmp" flag tells the compiler to enable OpenMP.
Run the program: To run the program, simply type the name of the executable file in the terminal
and press Enter:
$ ./program

Conclusion: We have implemented the Min, Max, Sum, and Average operations using parallel
reduction in C++ with OpenMP. Parallel reduction is a powerful technique that allows us to
perform these operations on large arrays more efficiently by dividing the work among multiple
threads running in parallel. We presented a code example that demonstrates the
implementation of these operations using parallel reduction in C++ with OpenMP. We also
provided a manual for running OpenMP programs on the Ubuntu platform.

Assignment Question

1. What are the benefits of using parallel reduction for basic operations on large arrays?
2. How does OpenMP's "reduction" clause work in parallel reduction?
3. How do you set up a C++ program for parallel computation with OpenMP?
4. What are the performance characteristics of parallel reduction, and how do they vary
based on input size?
5. How can you modify the provided code example for more complex operations using
parallel reduction?

3
Group A

Assignment No: 2(B)

Title of the Assignment: Write a program to implement Parallel Bubble Sort. Use existing
algorithms and measure the performance of sequential and parallel algorithms.

Objective of the Assignment: Students should be able to Write a program to implement

Parallel Bubble Sort and can measure the performance of sequential and parallel algorithms.

Prerequisite:
1. Basic of programming language
2. Concept of Bubble Sort
3. Concept of Parallelism
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. What is Bubble Sort? Use of Bubble Sort
2. Example of Bubble sort?
3. Concept of OpenMP
4. How Parallel Bubble Sort Work
5. How to measure the performance of sequential and parallel algorithms?

--------------------------------------------------------------------------------------------------------------

1
Department of Computer Engineering Course : Laboratory Practice V

What is Bubble Sort?

Bubble Sort is a simple sorting algorithm that works by repeatedly swapping adjacent elements if they are
in the wrong order. It is called "bubble" sort because the algorithm moves the larger elements towards the
end of the array in a manner that resembles the rising of bubbles in a liquid.

The basic algorithm of Bubble Sort is as follows:

1. Start at the beginning of the array.

2. Compare the first two elements. If the first element is greater than the second element, swap them.
3. Move to the next pair of elements and repeat step 2.
4. Continue the process until the end of the array is reached.
5. If any swaps were made in step 2-4, repeat the process from step 1.

The time complexity of Bubble Sort is O(n^2), which makes it inefficient for large lists. However, it has
the advantage of being easy to understand and implement, and it is useful for educational purposes and
for sorting small datasets.

Bubble Sort has limited practical use in modern software development due to its inefficient time
complexity of O(n^2) which makes it unsuitable for sorting large datasets. However, Bubble Sort has
some advantages and use cases that make it a valuable algorithm to understand, such as:

1. Simplicity: Bubble Sort is one of the simplest sorting algorithms, and it is easy to understand and
implement. It can be used to introduce the concept of sorting to beginners and as a basis for more
complex sorting algorithms.
2. Educational purposes: Bubble Sort is often used in academic settings to teach the principles of
sorting algorithms and to help students understand how algorithms work.
3. Small datasets: For very small datasets, Bubble Sort can be an efficient sorting algorithm, as its
overhead is relatively low.
4. Partially sorted datasets: If a dataset is already partially sorted, Bubble Sort can be very efficient.
Since Bubble Sort only swaps adjacent elements that are in the wrong order, it has a low number
of operations for a partially sorted dataset.
5. Performance optimization: Although Bubble Sort itself is not suitable for sorting large datasets,
some of its techniques can be used in combination with other sorting algorithms to optimize their
performance. For example, Bubble Sort can be used to optimize the performance of Insertion Sort
by reducing the number of comparisons needed.
2
Department of Computer Engineering Course : Laboratory Practice V

Example of Bubble sort

Let's say we want to sort a series of numbers 5, 3, 4, 1, and 2 so that they are arranged in
ascending order…

The sorting begins the first iteration by comparing the first two values. If the first value is greater than the
second, the algorithm pushes the first value to the index of the second value.

First Iteration of the Sorting

Step 1: In the case of 5, 3, 4, 1, and 2, 5 is greater than 3. So 5 takes the position of 3 and the numbers
become 3, 5, 4, 1, and 2.

Step 2: The algorithm now has 3, 5, 4, 1, and 2 to compare, this time around, it compares the next two
values, which are 5 and 4. 5 is greater than 4, so 5 takes the index of 4 and the values now become 3, 4,
5, 1, and 2.

Step 3: The algorithm now has 3, 4, 5, 1, and 2 to compare. It compares the next two values, which are 5
3
Department of Computer Engineering Course : Laboratory Practice V

and 1. 5 is greater than 1, so 5 takes the index of 1 and the numbers become 3, 4, 1, 5, and 2.

Step 4: The algorithm now has 3, 4, 1, 5, and 2 to compare. It compares the next two values, which are
5 and 2. 5 is greater than 2, so 5 takes the index of 2 and the numbers become 3, 4, 1, 2, and 5.

That’s the first iteration. And the numbers are now arranged as 3, 4, 1, 2, and 5 – from the initial 5, 3, 4,
1, and 2. As you might realize, 5 should be the last number if the numbers are sorted in ascending order.
This means the first iteration is really completed.

Second Iteration of the Sorting and the Rest

The algorithm starts the second iteration with the last result of 3, 4, 1, 2, and 5. This time around, 3
is smaller than 4, so no swapping happens. This means the numbers will remain the same.

4
Department of Computer Engineering Course : Laboratory Practice V

The algorithm proceeds to compare 4 and 1. 4 is greater than 1, so 4 is swapped for 1 and the numbers
become 3, 1, 4, 2, and 5.

The algorithm now proceeds to compare 4 and 2. 4 is greater than 2, so 4 is swapped for 2 and
the numbers become 3, 1, 2, 4, and 5.

5
Department of Computer Engineering Course : Laboratory Practice V

4 is now in the right place, so no swapping occurs between 4 and 5 because 4 is smaller than 5.

That’s how the algorithm continues to compare the numbers until they are arranged in ascending order
of 1, 2, 3, 4, and 5.

Concept of OpenMP

● OpenMP (Open Multi-Processing) is an application programming interface (API) that supports

processors, where multiple processor cores can access the same memory. OpenMP uses a fork-
join model of parallel execution, where a master thread forks multiple worker threads to execute a
parallel region of the code, and then waits for all threads to complete before continuing with the
sequential part of the code.

How Parallel Bubble Sort Work

● Parallel Bubble Sort is a modification of the classic Bubble Sort algorithm that takes advantage of
parallel processing to speed up the sorting process.

● In parallel Bubble Sort, the list of elements is divided into multiple sublists that are sorted
concurrently by multiple threads. Each thread sorts its sublist using the regular Bubble Sort
algorithm. When all sublists have been sorted, they are merged together to form the final sorted list.

● The parallelization of the algorithm is achieved using OpenMP, a programming API that supports
parallel processing in C++, Fortran, and other programming languages. OpenMP provides a set of
compiler directives that allow developers to specify which parts of the code can be executed in
parallel.

● In the parallel Bubble Sort algorithm, the main loop that iterates over the list of elements is divided
into multiple iterations that are executed concurrently by multiple threads. Each thread sorts a subset
of the list, and the threads synchronize their work at the end of each iteration to ensure that the
elements are properly ordered.

● Parallel Bubble Sort can provide a significant speedup over the regular Bubble Sort algorithm,
especially when sorting large datasets on multi-core processors. However, the speedup is limited by
the overhead of thread creation and synchronization, and it may not be worth the effort for small
datasets or when using a single-core processor.

How to measure the performance of sequential and parallel algorithms?

To measure the performance of sequential Bubble sort and parallel Bubble sort algorithms, you can
follow these steps:

1. Implement both the sequential and parallel Bubble sort algorithms.

2. Choose a range of test cases, such as arrays of different sizes and different degrees of
sortedness, to test the performance of both algorithms.

7
Department of Computer Engineering Course : Laboratory Practice V

3. Use a reliable timer to measure the execution time of each algorithm on each test case.
4. Record the execution times and analyze the results.

When measuring the performance of the parallel Bubble sort algorithm, you will need to specify the number
of threads to use. You can experiment with different numbers of threads to find the optimal value for your
system.

Here are some additional tips for measuring performance:

● Run each algorithm multiple times on each test case and take the average execution time to reduce
the impact of variations in system load and other factors.
● Monitor system resource usage during execution, such as CPU utilization and memory
consumption, to detect any performance bottlenecks.
● Visualize the results using charts or graphs to make it easier to compare the performance of
the two algorithms.

How to check CPU utilization and memory consumption in ubuntu

In Ubuntu, you can use a variety of tools to check CPU utilization and memory consumption. Here are
some common tools:

1. top: The top command provides a real-time view of system resource usage, including CPU
utilization and memory consumption. To use it, open a terminal window and type top. The output
will display a list of processes sorted by resource usage, with the most resource-intensive processes
at the top.
2. htop: htop is a more advanced version of top that provides additional features, such as interactive
process filtering and a color-coded display. To use it, open a terminal window and type htop.
3. ps: The ps command provides a snapshot of system resource usage at a particular moment in time.
To use it, open a terminal window and type ps aux. This will display a list of all running processes
and their resource usage.
4. free: The free command provides information about system memory usage, including total, used,
and free memory. To use it, open a terminal window and type free -h.
5. vmstat: The vmstat command provides a variety of system statistics, including CPU utilization,
memory usage, and disk activity. To use it, open a terminal window and type vmstat.

8
Department of Computer Engineering Course : Laboratory Practice V

Conclusion- In this way we can implement Bubble Sort in parallel way using OpenMP also come to

know how to how to measure performance of serial and parallel algorithm

Assignment Question

1. What is parallel Bubble Sort?

2. How does Parallel Bubble Sort work?
3. How do you implement Parallel Bubble Sort using OpenMP?
4. What are the advantages of Parallel Bubble Sort?
5. Difference between serial bubble sort and parallel bubble sort

Reference link
● https://www.freecodecamp.org/news/bubble-sort-algorithm-in-java-cpp-python-with-example-code/

9
Department of Computer Engineering Course : Laboratory Practice V

Group A

Assignment No: 1(A)

Title of the Assignment: Design and implement Parallel Breadth First Search based on
existing algorithms using OpenMP. Use a Tree or an undirected graph for BFS

Objective of the Assignment: Students should be able to perform Parallel Breadth First
Search based on existing algorithms using OpenMP

Prerequisite:
1. Basic of programming language
2. Concept of BFS
3. Concept of Parallelism
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. What is BFS?
2. Example of BFS
3. Concept of OpenMP
4. How Parallel BFS Work
5. Code Explanation with Output
---------------------------------------------------------------------------------------------------------------

1
Department of Computer Engineering Course : Laboratory Practice V

What is BFS?

BFS stands for Breadth-First Search. It is a graph traversal algorithm used to explore all the nodes of a
graph or tree systematically, starting from the root node or a specified starting point, and visiting all the
neighboring nodes at the current depth level before moving on to the next depth level.

The algorithm uses a queue data structure to keep track of the nodes that need to be visited, and marks
each visited node to avoid processing it again. The basic idea of the BFS algorithm is to visit all the
nodes at a given level before moving on to the next level, which ensures that all the nodes are visited in
breadth-first order.

BFS is commonly used in many applications, such as finding the shortest path between two nodes,
solving puzzles, and searching through a tree or graph.

Example of BFS

Now let’s take a look at the steps involved in traversing a graph by using Breadth-First Search:

Step 1: Take an Empty Queue.

Step 2: Select a starting node (visiting a node) and insert it into the Queue.

Step 3: Provided that the Queue is not empty, extract the node from the Queue and insert its child nodes
(exploring a node) into the Queue.

Step 4: Print the extracted node.

2
Department of Computer Engineering Course : Laboratory Practice V

Concept of OpenMP

● OpenMP (Open Multi-Processing) is an application programming interface (API) that supports

shared-memory parallel programming in C, C++, and Fortran. It is used to write parallel programs
that can run on multicore processors, multiprocessor systems, and parallel computing clusters.
● OpenMP provides a set of directives and functions that can be inserted into the source code of a
program to parallelize its execution. These directives are simple and easy to use, and they can be
applied to loops, sections, functions, and other program constructs. The compiler then generates
parallel code that can run on multiple processors concurrently.
● OpenMP programs are designed to take advantage of the shared-memory architecture of modern
processors, where multiple processor cores can access the same memory. OpenMP uses a fork-join
model of parallel execution, where a master thread forks multiple worker threads to execute a
parallel region of the code, and then waits for all threads to complete before continuing with the
sequential part of the code.
● OpenMP is widely used in scientific computing, engineering, and other fields that require high-
performance computing. It is supported by most modern compilers and is available on a wide range
of platforms, including desktops, servers, and supercomputers.

How Parallel BFS Work

● Parallel BFS (Breadth-First Search) is an algorithm used to explore all the nodes of a graph or tree

3
Department of Computer Engineering Course : Laboratory Practice V

systematically in parallel. It is a popular parallel algorithm used for graph traversal in distributed
computing, shared-memory systems, and parallel clusters.
● The parallel BFS algorithm starts by selecting a root node or a specified starting point, and then
assigning it to a thread or processor in the system. Each thread maintains a local queue of nodes to be
visited and marks each visited node to avoid processing it again.
● The algorithm then proceeds in levels, where each level represents a set of nodes that are at a certain
distance from the root node. Each thread processes the nodes in its local queue at the current level, and
then exchanges the nodes that are adjacent to the current level with other threads or processors. This
is done to ensure that the nodes at the next level are visited by the next iteration of the algorithm.
● The parallel BFS algorithm uses two phases: the computation phase and the communication phase. In
the computation phase, each thread processes the nodes in its local queue, while in the communication
phase, the threads exchange the nodes that are adjacent to the current level with other threads or
processors.
● The parallel BFS algorithm terminates when all nodes have been visited or when a specified node has
been found. The result of the algorithm is the set of visited nodes or the shortest path from the root
node to the target node.
● Parallel BFS can be implemented using different parallel programming models, such as OpenMP, MPI,
CUDA, and others. The performance of the algorithm depends on the number of threads or processors
used, the size of the graph, and the communication overhead between the threads or processors.

Conclusion- In this way we can achieve parallelism while implementing BFS

Assignment Question

1. What if BFS?
2. What is OpenMP?What is its significance in parallel programming?
3. Write down applications of Parallel BFS
4. How can BFS be parallelized using OpenMP? Describe the parallel BFS algorithm
using OpenMP.
5. Write Down Commands used in OpenMP?

Reference link
● https://www.edureka.co/blog/breadth-first-search-algorithm/

Web Application Penetration Testing - Final Project
No ratings yet
Web Application Penetration Testing - Final Project
50 pages
RTR Bharti Shine Role
No ratings yet
RTR Bharti Shine Role
535 pages
Group A Assignment 4 (A) : Two Large Vectors
No ratings yet
Group A Assignment 4 (A) : Two Large Vectors
5 pages
HPC 4 B
No ratings yet
HPC 4 B
5 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Threads
No ratings yet
Threads
54 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Mastering CUDA C Programming
From Everand
Mastering CUDA C Programming
Ed Norex
No ratings yet
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Parallel Computing Lab4
No ratings yet
Parallel Computing Lab4
13 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
CUDA Programming in C: From Basics to Expert Proficiency
From Everand
CUDA Programming in C: From Basics to Expert Proficiency
William Smith
No ratings yet
cs239 Ejer1
No ratings yet
cs239 Ejer1
2 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
ACA Unit3 Revised
No ratings yet
ACA Unit3 Revised
53 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
LP1 1
No ratings yet
LP1 1
129 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
TP1: Converting Vector Addition To CUDA.: Listing 1 An Example of Vector Addition Implemented in C
No ratings yet
TP1: Converting Vector Addition To CUDA.: Listing 1 An Example of Vector Addition Implemented in C
1 page
Introduction To The Cuda Programming
No ratings yet
Introduction To The Cuda Programming
25 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
CUDA Additionof2Vector
No ratings yet
CUDA Additionof2Vector
2 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
11 pages
Christian Eh An Sen 2
No ratings yet
Christian Eh An Sen 2
18 pages
PDSCUDA
No ratings yet
PDSCUDA
11 pages
CUDA
No ratings yet
CUDA
33 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Mastering CUDA Python Programming
From Everand
Mastering CUDA Python Programming
Ed A Norex
No ratings yet
Flipkart - Transitioning To A Marketplace Model
50% (2)
Flipkart - Transitioning To A Marketplace Model
13 pages
CTG Technical Data of VJ-1280 IP-55
No ratings yet
CTG Technical Data of VJ-1280 IP-55
2 pages
Power Curve M1500 600
100% (1)
Power Curve M1500 600
3 pages
Apriso WhitePaper How Next Gen MOM Can Drive
No ratings yet
Apriso WhitePaper How Next Gen MOM Can Drive
10 pages
Sri Siddhartha Academy of Higher Education
No ratings yet
Sri Siddhartha Academy of Higher Education
31 pages
Terminal Emulator Display and Printer User Guide Version 10.1.2 11-30-2021
No ratings yet
Terminal Emulator Display and Printer User Guide Version 10.1.2 11-30-2021
1 page
Fortimanager v7.2.6 Release Notes
No ratings yet
Fortimanager v7.2.6 Release Notes
59 pages
Operating System Lab Project Report
No ratings yet
Operating System Lab Project Report
15 pages
Reversing
50% (2)
Reversing
28 pages
The Human Persons Flourishing in Science and Technology
No ratings yet
The Human Persons Flourishing in Science and Technology
28 pages
Oracle ASM Load Balancing - Anthony Noriega
0% (1)
Oracle ASM Load Balancing - Anthony Noriega
48 pages
A Survey On Acoustic Sensing
No ratings yet
A Survey On Acoustic Sensing
33 pages
Health and Safety Manual
No ratings yet
Health and Safety Manual
5 pages
SFG Player Guide PVP
No ratings yet
SFG Player Guide PVP
31 pages
05 - Chapter 2
No ratings yet
05 - Chapter 2
44 pages
Alexander Popov-Kernel Hack Drill-Zer0Con
No ratings yet
Alexander Popov-Kernel Hack Drill-Zer0Con
99 pages
Chapter 6 Logic
No ratings yet
Chapter 6 Logic
49 pages
GO Ms No 402 dt8.10.1982
No ratings yet
GO Ms No 402 dt8.10.1982
8 pages
Principle of Operating Systems
No ratings yet
Principle of Operating Systems
6 pages
Catalogo de Referencias EZC
No ratings yet
Catalogo de Referencias EZC
17 pages
Iphone4 CDMA Schematic
No ratings yet
Iphone4 CDMA Schematic
33 pages
EtherCAT Seminar Document PDF
No ratings yet
EtherCAT Seminar Document PDF
22 pages
Exam Time Table For Building Technology Ndi
No ratings yet
Exam Time Table For Building Technology Ndi
2 pages
Interbus Configurator CMD
No ratings yet
Interbus Configurator CMD
284 pages
Presented by
No ratings yet
Presented by
19 pages
(External Pentest) Citrix - Checklist - NetScaler Gateway 11.1 Virtual Server - Carl Stalhood
No ratings yet
(External Pentest) Citrix - Checklist - NetScaler Gateway 11.1 Virtual Server - Carl Stalhood
57 pages
Beyond Automation AIs Strategic Role in Insurance
No ratings yet
Beyond Automation AIs Strategic Role in Insurance
20 pages
2200 Manual Section 1
No ratings yet
2200 Manual Section 1
27 pages