Combinepdf
Combinepdf
Assignment 4(A)
Title of the Assignment: Write a CUDA Program for Addition of two large vectors
Objective of the Assignment: Students should be able to perform CUDA Program for Addition of
two large vectors
Prerequisite:
1. CUDA Concept
2. Vector Addition
3. How to execute Program in CUDA Environment
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. What is CUDA
2. Addition of two large Vector
3. Execution of CUDA Environment
--------------------------------------------------------------------------------------------------------------
Department of Computer Engineering Course : Laboratory Practice V
What is CUDA
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model
developed by NVIDIA. It allows developers to use the power of NVIDIA graphics processing units (GPUs)
to accelerate computation tasks in various applications, including scientific computing, machine learning, and
computer vision.CUDA provides a set of programming APIs, libraries, and tools that enable developers to
write and execute parallel code on NVIDIA GPUs. It supports popular programming languages like C, C++,
and Python, and provides a simple programming model that abstracts away much of the low-level details of
GPU architecture.
Using CUDA, developers can exploit the massive parallelism and high computational power of GPUs to
accelerate computationally intensive tasks, such as matrix operations, image processing, and deep learning.
CUDA has become an important tool for scientific research and is widely used in fields like physics, chemistry,
biology, and engineering.
Steps for Addition of two large vectors using CUDA
1. Define the size of the vectors: In this step, you need to define the size of the vectors that you want to
add. This will determine the number of threads and blocks you will need to use to parallelize the
addition operation.
2. Allocate memory on the host: In this step, you need to allocate memory on the host for the two vectors
that you want to add and for the result vector. You can use the C malloc function to allocate memory.
3. Initialize the vectors: In this step, you need to initialize the two vectors that you want to add on the
host. You can use a loop to fill the vectors with data.
4. Allocate memory on the device: In this step, you need to allocate memory on the device for the two
vectors that you want to add and for the result vector. You can use the CUDA function cudaMalloc to
allocate memory.
5. Copy the input vectors from host to device: In this step, you need to copy the two input vectors from
the host to the device memory. You can use the CUDA function cudaMemcpy to copy the vectors.
6. Launch the kernel: In this step, you need to launch the CUDA kernel that will perform the addition
operation. The kernel will be executed by multiple threads in parallel. You can use the <<<...>>>
syntax to specify the number of blocks and threads to use.
7. Copy the result vector from device to host: In this step, you need to copy the result vector from the
device memory to the host memory. You can use the CUDA function cudaMemcpy to copy the result
vector.
2
Department of Computer Engineering Course : Laboratory Practice V
8. Free memory on the device: In this step, you need to free the memory that was allocated on the
device. You can use the CUDA function cudaFree to free the memory.
9. Free memory on the host: In this step, you need to free the memory that was allocated on the host.
You can use the C free function to free the memory.
This will execute the program and perform the addition of two large vectors.
Questions:
1. What is the purpose of using CUDA to perform addition of two large vectors?
2. How do you allocate memory for the vectors on the device using CUDA?
3. How do you launch the CUDA kernel to perform the addition of two large vectors?
4. How can you optimize the performance of the CUDA program for adding two large vectors
3
Department of Computer Engineering Course : Laboratory Practice V
Group A
Assignment 4(B)
Title of the Assignment: Write a Program for Matrix Multiplication using CUDA C
Objective of the Assignment: Students should be able to performProgram for Matrix Multiplication
using CUDA C
Prerequisite:
1. CUDA Concept
2. Matrix Multiplication
3. How to execute Program in CUDA Environment
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. What is CUDA
2. Matrix Multiplication
3. Execution of CUDA Environment
--------------------------------------------------------------------------------------------------------------
1
Department of Computer Engineering Course : Laboratory Practice V
What is CUDA
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model
developed by NVIDIA. It allows developers to use the power of NVIDIA graphics processing units (GPUs)
to accelerate computation tasks in various applications, including scientific computing, machine learning, and
computer vision.CUDA provides a set of programming APIs, libraries, and tools that enable developers to
write and execute parallel code on NVIDIA GPUs. It supports popular programming languages like C, C++,
and Python, and provides a simple programming model that abstracts away much of the low-level details of
GPU architecture.
Using CUDA, developers can exploit the massive parallelism and high computational power of GPUs to
accelerate computationally intensive tasks, such as matrix operations, image processing, and deep learning.
CUDA has become an important tool for scientific research and is widely used in fields like physics, chemistry,
biology, and engineering.
Steps for Matrix Multiplication using CUDA
Here are the steps for implementing matrix multiplication using CUDA C:
1. Matrix Initialization: The first step is to initialize the matrices that you want to multiply. You can
use standard C or CUDA functions to allocate memory for the matrices and initialize their values.
The matrices are usually represented as 2D arrays.
2. Memory Allocation: The next step is to allocate memory on the host and the device for the matrices.
You can use the standard C malloc function to allocate memory on the host and the CUDA function
cudaMalloc() to allocate memory on the device.
3. Data Transfer: The third step is to transfer data between the host and the device. You can use
the CUDA function cudaMemcpy() to transfer data from the host to the device or vice versa.
4. Kernel Launch: The fourth step is to launch the CUDA kernel that will perform the matrix
multiplication on the device. You can use the <<<...>>> syntax to specify the number of blocks
and threads to use. Each thread in the kernel will compute one element of the output matrix.
5. Device Synchronization: The fifth step is to synchronize the device to ensure that all kernel
executions have completed before proceeding. You can use the CUDA function
cudaDeviceSynchronize() to synchronize the device.
6. Data Retrieval: The sixth step is to retrieve the result of the computation from the device to the host.
You can use the CUDA function cudaMemcpy() to transfer data from the device to the host.
7. Memory Deallocation: The final step is to deallocate the memory that was allocated on the host and the
device. You can use the C free function to deallocate memory on the host and the CUDA function
2
Department of Computer Engineering Course : Laboratory Practice V
Questions:
1. What are the advantages of using CUDA to perform matrix multiplication compared to using a CPU?
2. How do you handle matrices that are too large to fit in GPU memory in CUDA matrix
multiplication?
3. How do you optimize the performance of the CUDA program for matrix multiplication?
4. How do you ensure correctness of the CUDA program for matrix multiplication and verify the
results?
3
Department of Computer Engineering Course : Laboratory Practice V
Group A
Title of the Assignment: Write a program to implement Parallel Merge Sort. Use existing
algorithms and measure the performance of sequential and parallel algorithms.
Objective of the Assignment: Students should be able to Write a program to implement Parallel
Merge Sort and can measure the performance of sequential and parallel algorithms.
Prerequisite:
1. Basic of programming language
2. Concept of Merge Sort
3. Concept of Parallelism
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. What is Merge? Use of Merge Sort
2. Example of Merge sort?
3. Concept of OpenMP
4. How Parallel Merge Sort Work
5. How to measure the performance of sequential and parallel algorithms?
--------------------------------------------------------------------------------------------------------------
1
Department of Computer Engineering Course : Laboratory Practice V
Merge sort is a sorting algorithm that uses a divide-and-conquer approach to sort an array or a list of
elements. The algorithm works by recursively dividing the input array into two halves, sorting each half,
and then merging the sorted halves to produce a sorted output.
The merge sort algorithm can be broken down into the following steps:
2
Department of Computer Engineering Course : Laboratory Practice V
● According to the merge sort, first divide the given array into two equal halves. Merge sort
keeps dividing the list into equal parts until it cannot be further divided.
● As there are eight elements in the given array, so it is divided into two arrays of size 4.
● Now, again divide these two arrays into halves. As they are of size 4, divide them into new
arrays of size 2.
● Now, again divide these arrays to get the atomic value that cannot be further divided.
3
Department of Computer Engineering Course : Laboratory Practice V
● In the next iteration of combining, now compare the arrays with two data values and merge them
into an array of found values in sorted order.
● Now, there is a final merging of the arrays. After the final merging of above arrays, the array
will look like -
Concept of OpenMP
There are several metrics that can be used to measure the performance of sequential and parallel
merge sort algorithms:
1. Execution time: Execution time is the amount of time it takes for the algorithm to complete its
sorting operation. This metric can be used to compare the speed of sequential and parallel
merge sort algorithms.
2. Speedup: Speedup is the ratio of the execution time of the sequential merge sort algorithm to the
execution time of the parallel merge sort algorithm. A speedup of greater than 1 indicates that the
parallel algorithm is faster than the sequential algorithm.
3. Efficiency: Efficiency is the ratio of the speedup to the number of processors or cores used in
the parallel algorithm. This metric can be used to determine how well the parallel algorithm is
utilizing the available resources.
4. Scalability: Scalability is the ability of the algorithm to maintain its performance as the input size
and number of processors or cores increase. A scalable algorithm will maintain a consistent
speedup and efficiency as more resources are added.
To measure the performance of sequential and parallel merge sort algorithms, you can perform experiments
on different input sizes and numbers of processors or cores. By measuring the execution time, speedup,
efficiency, and scalability of the algorithms under different conditions, you can determine
5
Department of Computer Engineering Course : Laboratory Practice V
which algorithm is more efficient for different input sizes and hardware configurations. Additionally, you can
use profiling tools to analyze the performance of the algorithms and identify areas for optimization
Conclusion- In this way we can implement Merge Sort in parallel way using OpenMP also come to
Assignment Question
Reference link
● https://www.geeksforgeeks.org/merge-sort/
● https://www.javatpoint.com/merge-sort
.
Group A
Assignment No: 3
Title of the Assignment: Implement Min, Max, Sum and Average operations using Parallel
Reduction.
Objective of the Assignment: To understand the concept of parallel reduction and how it can be
used to perform basic mathematical operations on given data sets.
Prerequisite:
1. Parallel computing architectures
2. Parallel programming models
3. Proficiency in programming languages
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. What is parallel reduction and its usefulness for mathematical operations on large data?
2. Concept of OpenMP
3. How do parallel reduction algorithms for Min, Max, Sum, and Average work, and what are
their advantages and limitations?
--------------------------------------------------------------------------------------------------------------
1
Department of Computer Engineering Course : Laboratory Practice V
Parallel Reduction.
Here's a function-wise manual on how to understand and run the sample C++ program that demonstrates
how to implement Min, Max, Sum, and Average operations using parallel reduction.
1. Min_Reduction function
• The function takes in a vector of integers as input and finds the minimum value in the
vector using parallel reduction.
• The OpenMP reduction clause is used with the "min" operator to find the minimum value
across all threads.
• The minimum value found by each thread is reduced to the overall minimum value of the
entire array.
• The final minimum value is printed to the console.
2. Max_Reduction function
• The function takes in a vector of integers as input and finds the maximum value in the
vector using parallel reduction.
• The OpenMP reduction clause is used with the "max" operator to find the maximum value
across all threads.
• The maximum value found by each thread is reduced to the overall maximum value of the
entire array.
• The final maximum value is printed to the console.
3. Sum_Reduction function
• The function takes in a vector of integers as input and finds the sum of all the values in the
vector using parallel reduction.
• The OpenMP reduction clause is used with the "+" operator to find the sum across all
threads.
• The sum found by each thread is reduced to the overall sum of the entire array.
• The final sum is printed to the console.
4. Average_Reduction function
• The function takes in a vector of integers as input and finds the average of all the values in
the vector using parallel reduction.
• The OpenMP reduction clause is used with the "+" operator to find the sum across all
threads.
2
Department of Computer Engineering Course : Laboratory Practice V
• The sum found by each thread is reduced to the overall sum of the entire array.
• The final sum is divided by the size of the array to find the average.
• The final average value is printed to the console.
5. Main Function
• The function initializes a vector of integers with some values.
• The function calls the min_reduction, max_reduction, sum_reduction, and
average_reduction functions on the input vector to find the corresponding values.
• The final minimum, maximum, sum, and average values are printed to the console.
6. Compiling and running the program
Compile the program: You need to use a C++ compiler that supports OpenMP, such as g++ or
clang. Open a terminal and navigate to the directory where your program is saved. Then,
compile the program using the following command:
$ g++ -fopenmp program.cpp -o program
This command compiles your program and creates an executable file named "program". The "-
fopenmp" flag tells the compiler to enable OpenMP.
Run the program: To run the program, simply type the name of the executable file in the terminal
and press Enter:
$ ./program
Conclusion: We have implemented the Min, Max, Sum, and Average operations using parallel
reduction in C++ with OpenMP. Parallel reduction is a powerful technique that allows us to
perform these operations on large arrays more efficiently by dividing the work among multiple
threads running in parallel. We presented a code example that demonstrates the
implementation of these operations using parallel reduction in C++ with OpenMP. We also
provided a manual for running OpenMP programs on the Ubuntu platform.
Assignment Question
1. What are the benefits of using parallel reduction for basic operations on large arrays?
2. How does OpenMP's "reduction" clause work in parallel reduction?
3. How do you set up a C++ program for parallel computation with OpenMP?
4. What are the performance characteristics of parallel reduction, and how do they vary
based on input size?
5. How can you modify the provided code example for more complex operations using
parallel reduction?
3
Group A
Title of the Assignment: Write a program to implement Parallel Bubble Sort. Use existing
algorithms and measure the performance of sequential and parallel algorithms.
Parallel Bubble Sort and can measure the performance of sequential and parallel algorithms.
Prerequisite:
1. Basic of programming language
2. Concept of Bubble Sort
3. Concept of Parallelism
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. What is Bubble Sort? Use of Bubble Sort
2. Example of Bubble sort?
3. Concept of OpenMP
4. How Parallel Bubble Sort Work
5. How to measure the performance of sequential and parallel algorithms?
--------------------------------------------------------------------------------------------------------------
1
Department of Computer Engineering Course : Laboratory Practice V
Bubble Sort is a simple sorting algorithm that works by repeatedly swapping adjacent elements if they are
in the wrong order. It is called "bubble" sort because the algorithm moves the larger elements towards the
end of the array in a manner that resembles the rising of bubbles in a liquid.
The time complexity of Bubble Sort is O(n^2), which makes it inefficient for large lists. However, it has
the advantage of being easy to understand and implement, and it is useful for educational purposes and
for sorting small datasets.
Bubble Sort has limited practical use in modern software development due to its inefficient time
complexity of O(n^2) which makes it unsuitable for sorting large datasets. However, Bubble Sort has
some advantages and use cases that make it a valuable algorithm to understand, such as:
1. Simplicity: Bubble Sort is one of the simplest sorting algorithms, and it is easy to understand and
implement. It can be used to introduce the concept of sorting to beginners and as a basis for more
complex sorting algorithms.
2. Educational purposes: Bubble Sort is often used in academic settings to teach the principles of
sorting algorithms and to help students understand how algorithms work.
3. Small datasets: For very small datasets, Bubble Sort can be an efficient sorting algorithm, as its
overhead is relatively low.
4. Partially sorted datasets: If a dataset is already partially sorted, Bubble Sort can be very efficient.
Since Bubble Sort only swaps adjacent elements that are in the wrong order, it has a low number
of operations for a partially sorted dataset.
5. Performance optimization: Although Bubble Sort itself is not suitable for sorting large datasets,
some of its techniques can be used in combination with other sorting algorithms to optimize their
performance. For example, Bubble Sort can be used to optimize the performance of Insertion Sort
by reducing the number of comparisons needed.
2
Department of Computer Engineering Course : Laboratory Practice V
The sorting begins the first iteration by comparing the first two values. If the first value is greater than the
second, the algorithm pushes the first value to the index of the second value.
Step 1: In the case of 5, 3, 4, 1, and 2, 5 is greater than 3. So 5 takes the position of 3 and the numbers
become 3, 5, 4, 1, and 2.
Step 2: The algorithm now has 3, 5, 4, 1, and 2 to compare, this time around, it compares the next two
values, which are 5 and 4. 5 is greater than 4, so 5 takes the index of 4 and the values now become 3, 4,
5, 1, and 2.
Step 3: The algorithm now has 3, 4, 5, 1, and 2 to compare. It compares the next two values, which are 5
3
Department of Computer Engineering Course : Laboratory Practice V
and 1. 5 is greater than 1, so 5 takes the index of 1 and the numbers become 3, 4, 1, 5, and 2.
Step 4: The algorithm now has 3, 4, 1, 5, and 2 to compare. It compares the next two values, which are
5 and 2. 5 is greater than 2, so 5 takes the index of 2 and the numbers become 3, 4, 1, 2, and 5.
That’s the first iteration. And the numbers are now arranged as 3, 4, 1, 2, and 5 – from the initial 5, 3, 4,
1, and 2. As you might realize, 5 should be the last number if the numbers are sorted in ascending order.
This means the first iteration is really completed.
The algorithm starts the second iteration with the last result of 3, 4, 1, 2, and 5. This time around, 3
is smaller than 4, so no swapping happens. This means the numbers will remain the same.
4
Department of Computer Engineering Course : Laboratory Practice V
The algorithm proceeds to compare 4 and 1. 4 is greater than 1, so 4 is swapped for 1 and the numbers
become 3, 1, 4, 2, and 5.
The algorithm now proceeds to compare 4 and 2. 4 is greater than 2, so 4 is swapped for 2 and
the numbers become 3, 1, 2, 4, and 5.
5
Department of Computer Engineering Course : Laboratory Practice V
4 is now in the right place, so no swapping occurs between 4 and 5 because 4 is smaller than 5.
That’s how the algorithm continues to compare the numbers until they are arranged in ascending order
of 1, 2, 3, 4, and 5.
Concept of OpenMP
processors, where multiple processor cores can access the same memory. OpenMP uses a fork-
join model of parallel execution, where a master thread forks multiple worker threads to execute a
parallel region of the code, and then waits for all threads to complete before continuing with the
sequential part of the code.
● Parallel Bubble Sort is a modification of the classic Bubble Sort algorithm that takes advantage of
parallel processing to speed up the sorting process.
● In parallel Bubble Sort, the list of elements is divided into multiple sublists that are sorted
concurrently by multiple threads. Each thread sorts its sublist using the regular Bubble Sort
algorithm. When all sublists have been sorted, they are merged together to form the final sorted list.
● The parallelization of the algorithm is achieved using OpenMP, a programming API that supports
parallel processing in C++, Fortran, and other programming languages. OpenMP provides a set of
compiler directives that allow developers to specify which parts of the code can be executed in
parallel.
● In the parallel Bubble Sort algorithm, the main loop that iterates over the list of elements is divided
into multiple iterations that are executed concurrently by multiple threads. Each thread sorts a subset
of the list, and the threads synchronize their work at the end of each iteration to ensure that the
elements are properly ordered.
● Parallel Bubble Sort can provide a significant speedup over the regular Bubble Sort algorithm,
especially when sorting large datasets on multi-core processors. However, the speedup is limited by
the overhead of thread creation and synchronization, and it may not be worth the effort for small
datasets or when using a single-core processor.
7
Department of Computer Engineering Course : Laboratory Practice V
3. Use a reliable timer to measure the execution time of each algorithm on each test case.
4. Record the execution times and analyze the results.
When measuring the performance of the parallel Bubble sort algorithm, you will need to specify the number
of threads to use. You can experiment with different numbers of threads to find the optimal value for your
system.
● Run each algorithm multiple times on each test case and take the average execution time to reduce
the impact of variations in system load and other factors.
● Monitor system resource usage during execution, such as CPU utilization and memory
consumption, to detect any performance bottlenecks.
● Visualize the results using charts or graphs to make it easier to compare the performance of
the two algorithms.
1. top: The top command provides a real-time view of system resource usage, including CPU
utilization and memory consumption. To use it, open a terminal window and type top. The output
will display a list of processes sorted by resource usage, with the most resource-intensive processes
at the top.
2. htop: htop is a more advanced version of top that provides additional features, such as interactive
process filtering and a color-coded display. To use it, open a terminal window and type htop.
3. ps: The ps command provides a snapshot of system resource usage at a particular moment in time.
To use it, open a terminal window and type ps aux. This will display a list of all running processes
and their resource usage.
4. free: The free command provides information about system memory usage, including total, used,
and free memory. To use it, open a terminal window and type free -h.
5. vmstat: The vmstat command provides a variety of system statistics, including CPU utilization,
memory usage, and disk activity. To use it, open a terminal window and type vmstat.
8
Department of Computer Engineering Course : Laboratory Practice V
Conclusion- In this way we can implement Bubble Sort in parallel way using OpenMP also come to
Assignment Question
Reference link
● https://www.freecodecamp.org/news/bubble-sort-algorithm-in-java-cpp-python-with-example-code/
9
Department of Computer Engineering Course : Laboratory Practice V
Group A
Title of the Assignment: Design and implement Parallel Breadth First Search based on
existing algorithms using OpenMP. Use a Tree or an undirected graph for BFS
Objective of the Assignment: Students should be able to perform Parallel Breadth First
Search based on existing algorithms using OpenMP
Prerequisite:
1. Basic of programming language
2. Concept of BFS
3. Concept of Parallelism
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. What is BFS?
2. Example of BFS
3. Concept of OpenMP
4. How Parallel BFS Work
5. Code Explanation with Output
---------------------------------------------------------------------------------------------------------------
1
Department of Computer Engineering Course : Laboratory Practice V
What is BFS?
BFS stands for Breadth-First Search. It is a graph traversal algorithm used to explore all the nodes of a
graph or tree systematically, starting from the root node or a specified starting point, and visiting all the
neighboring nodes at the current depth level before moving on to the next depth level.
The algorithm uses a queue data structure to keep track of the nodes that need to be visited, and marks
each visited node to avoid processing it again. The basic idea of the BFS algorithm is to visit all the
nodes at a given level before moving on to the next level, which ensures that all the nodes are visited in
breadth-first order.
BFS is commonly used in many applications, such as finding the shortest path between two nodes,
solving puzzles, and searching through a tree or graph.
Example of BFS
Now let’s take a look at the steps involved in traversing a graph by using Breadth-First Search:
Step 2: Select a starting node (visiting a node) and insert it into the Queue.
Step 3: Provided that the Queue is not empty, extract the node from the Queue and insert its child nodes
(exploring a node) into the Queue.
2
Department of Computer Engineering Course : Laboratory Practice V
Concept of OpenMP
● Parallel BFS (Breadth-First Search) is an algorithm used to explore all the nodes of a graph or tree
3
Department of Computer Engineering Course : Laboratory Practice V
systematically in parallel. It is a popular parallel algorithm used for graph traversal in distributed
computing, shared-memory systems, and parallel clusters.
● The parallel BFS algorithm starts by selecting a root node or a specified starting point, and then
assigning it to a thread or processor in the system. Each thread maintains a local queue of nodes to be
visited and marks each visited node to avoid processing it again.
● The algorithm then proceeds in levels, where each level represents a set of nodes that are at a certain
distance from the root node. Each thread processes the nodes in its local queue at the current level, and
then exchanges the nodes that are adjacent to the current level with other threads or processors. This
is done to ensure that the nodes at the next level are visited by the next iteration of the algorithm.
● The parallel BFS algorithm uses two phases: the computation phase and the communication phase. In
the computation phase, each thread processes the nodes in its local queue, while in the communication
phase, the threads exchange the nodes that are adjacent to the current level with other threads or
processors.
● The parallel BFS algorithm terminates when all nodes have been visited or when a specified node has
been found. The result of the algorithm is the set of visited nodes or the shortest path from the root
node to the target node.
● Parallel BFS can be implemented using different parallel programming models, such as OpenMP, MPI,
CUDA, and others. The performance of the algorithm depends on the number of threads or processors
used, the size of the graph, and the communication overhead between the threads or processors.
Assignment Question
1. What if BFS?
2. What is OpenMP?What is its significance in parallel programming?
3. Write down applications of Parallel BFS
4. How can BFS be parallelized using OpenMP? Describe the parallel BFS algorithm
using OpenMP.
5. Write Down Commands used in OpenMP?
Reference link
● https://www.edureka.co/blog/breadth-first-search-algorithm/