0% found this document useful (0 votes)
16 views90 pages

Laboratory Practice V

The document outlines the curriculum for the Laboratory Practice V course at Sinhgad College of Engineering, focusing on High Performance Computing and Deep Learning. It includes course objectives, outcomes, and assignments related to parallel algorithms like Breadth First Search and Depth First Search using OpenMP, as well as Bubble Sort. The document also provides theoretical concepts, coding examples, and performance measurement techniques for the algorithms discussed.

Uploaded by

preetikadam288
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views90 pages

Laboratory Practice V

The document outlines the curriculum for the Laboratory Practice V course at Sinhgad College of Engineering, focusing on High Performance Computing and Deep Learning. It includes course objectives, outcomes, and assignments related to parallel algorithms like Breadth First Search and Depth First Search using OpenMP, as well as Bubble Sort. The document also provides theoretical concepts, coding examples, and performance measurement techniques for the algorithms discussed.

Uploaded by

preetikadam288
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Sinhgad College of Engineering,Pune

Savitribai Phule Pune University (SPPU) Fourth


Year of Computer Engineering (2019 Course)

410255: Laboratory Practice V

Trem work : 50 Marks


Practical : 50 Marks

High Performance Computing (410250)


Deep Learning (410251)
Savitribai Phule Pune University
Fourth Year of Computer Engineering (2019 Course)410255:
Laboratory Practice V

Teaching Scheme Practical: Credit Examination Scheme


2 Hours/Week 01 Term Work: 50 Marks
Practical: 50 Marks

Companion Course: High Performance Computing (410250), Deep Learning(410251)

Course Objectives:

• To understand and implement searching and sorting algorithms.


• To learn the fundamentals of GPU Computing in the CUDA environment.
• To illustrate the concepts of Artificial Intelligence/Machine Learning (AI/ML).
• To understand Hardware acceleration. • To implement different deep learning models.
Course Outcomes:
CO1: Analyze and measure performance of sequential and parallel algorithms.
CO2: Design and Implement solutions for multicore/Distributed/parallel environment.CO3: Identify and
apply the suitable algorithms to solve AI/ML problems.
CO4: Apply the technique of Deep Neural network for implementing Linear regressionand classification.
CO5: Apply the technique of Convolution (CNN) for implementing Deep Learning modelsCO6: Design and
develop Recurrent Neural Network (RNN) for prediction.

@The CO-PO Mapping Matrix

CO/PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12

CO1 1 - 1 1 - 2 1 - - - - -

CO2 1 2 1 - - 1 - - - - - 1

CO3 - 1 1 1 1 1 - - - - - -

CO4 3 3 3 - 3 - - - - - - -

CO5 3 3 3 3 3 - - - - - - -

CO6 3 3 3 3 3 - - - - - - -

CO7 3 3 3 3 3 - - - - - -

Sinhgad College Of Engineering, Pune. 2


410250: High Performance Computing
Group A
Assignment No.: 1

Title of the Assignment: Design and implement Parallel Breadth First Search and Depth First Search based on existing
algorithms using OpenMP. Use a Tree or an undirected graph for BFS and DFS
Objective of the Assignment: Students should be able to write a program to implement ParallelBreadth First Search and
Depth First Search based on existing algorithms using OpenMP
Prerequisite:
1. Basic of programming language
2. Concept of BFS and DFS
3. Concept of Parallelism
Contents for Theory:

1. What is BFS?

2. What is DFS?

3. Concept of OpenMP

4. Code Explanation with Output

What is BFS?

BFS stands for Breadth-First Search. It is a graph traversal algorithm used to explore all the nodes ofa graph or tree
systematically, starting from the root node or a specified starting point, and visiting all the neighboring nodes at the
current depth level before moving on to the next depth level. The algorithm uses a queue data structure to keep track
of the nodes that need to be visited, and marks each visited node to avoid processing it again. The basic idea of the BFS
algorithm is to visit all the nodes at a given level before moving on to the next level, which ensures that all the nodes
are visited in breadth-first order. BFS is commonly used in many applications, such as finding the shortest path between
two nodes, solving puzzles, and searching through a tree or graph.
Sinhgad College Of Engineering, Pune. 3
Now let’s take a look at the steps involved in traversing a graph by using Breadth-First Search:
Step 1: Take an Empty Queue.

Step 2: Select a starting node (visiting a node) and insert it into the Queue.

Step 3: Provided that the Queue is not empty, extract the node from the Queue and insert its child nodes (exploring a
node) into the Queue.
Step 4: Print the extracted node.
What is DFS?

DFS stands for Depth-First Search. It is a popular graph traversal algorithm that explores as far as possible along each
branch before backtracking. This algorithm can be used to find the shortest pathbetween two vertices or to traverse a
graph in a systematic way. The algorithm starts at the root node and explores as far as possible along each branch before
backtracking. The backtracking is done to explore the next branch that has not been explored yet.
DFS can be implemented using either a recursive or an iterative approach. The recursive approach is simpler to
implement but can lead to a stack overflow error for very large graphs. The iterative approach uses a stack to keep track
of nodes to be explored and is preferred for larger graphs.
DFS can also be used to detect cycles in a graph. If a cycle exists in a graph, the DFS algorithm will eventually reach a
node that has already been visited, indicating that a cycle exists. A standard DFS
implementation puts each vertex of the graph into one of two categories: 1. Visited 2. Not Visited The purpose of the
algorithm is to mark each vertex as visited while avoiding cycles.

Concept of OpenMP

● OpenMP (Open Multi-Processing) is an application programming interface (API) that supports shared-memory
parallel programming in C, C++, and Fortran. It is used to write parallel programs that can run on multicore
processors, multiprocessor systems, and parallel computing clusters.

● OpenMP provides a set of directives and functions that can be inserted into the source code of a program to parallelize
its execution. These directives are simple and easy to use, and they can beapplied to loops, sections, functions, and other
Sinhgad College Of Engineering, Pune. 4
program constructs. The compiler then generates parallel code that can run on multiple processors concurrently.
● OpenMP programs are designed to take advantage of the shared-memory architecture of modern processors, where
multiple processor cores can access the same memory. OpenMP uses a fork-join model of parallel execution, where a
master thread forks multiple worker threads to execute a parallel region of the code, and then waits for all threads to
complete before continuing with the sequential part of the code.

● OpenMP is widely used in scientific computing, engineering, and other fields that require

high-performance computing. It is supported by most modern compilers and is available on a wide range of platforms,
including desktops, servers, and supercomputers. How Parallel BFS Work .
● Parallel BFS (Breadth-First Search) is an algorithm used to explore all the nodes of a graph or tree systematically in
parallel. It is a popular parallel algorithm used for graph traversal in distributed computing, shared-memory systems,
and parallel clusters.
● The parallel BFS algorithm starts by selecting a root node or a specified starting point, and then assigning it to a
thread or processor in the system. Each thread maintains a local queue of nodes to be visited and marks each visited
node to avoid processing it again.
● The algorithm then proceeds in levels, where each level represents a set of nodes that are at a certain distance from
the root node. Each thread processes the nodes in its local queue at the current level, and then exchanges the nodes
that are adjacent to the current level with other threads or processors. This is done to ensure that the nodes at the next
level are visited by the next iteration of the algorithm.
● The parallel BFS algorithm uses two phases: the computation phase and the communication phase.In the computation
phase, each thread processes the nodes in its local queue, while in the communication phase, the threads exchange the
nodes that are adjacent to the current level with other threads or processors.
● The parallel BFS algorithm terminates when all nodes have been visited or when a specified node has been found.
The result of the algorithm is the set of visited nodes or the shortest path from the root node to the target node.
● Parallel BFS can be implemented using different parallel programming models, such as OpenMP,MPI, CUDA, and
others. The performance of the algorithm depends on the number of threads or processors used, the size of the graph,
and the communication overhead between the threads or processors.
Code to implement BFS using OpenMP:
#include<iostream>
#include<stdlib.h>
#include<queue>
using namespace std;
class node
{

Sinhgad College Of Engineering, Pune. 5


public:
node *left, *right;
int data;
};
class Breadthfs
{
public:
node *insert(node *, int);
void bfs(node *);
};
node *insert(node *root, int data)
// inserts a node in tree
{
if(!root)
{
root=new node;
root->left=NULL;
root->right=NULL;
root->data=data;
return root;
}
queue<node *> q;
q.push(root);
while(!q.empty())
{
node *temp=q.front();
q.pop();
if(temp->left==NULL)
{
temp>left=new node;
temp >left->left=NULL;
temp->left->right=NULL;
temp->left->data=data;
return root;

Sinhgad College Of Engineering, Pune. 6


}
else
{
q.push(temp->left);
}
if(temp->right==NULL)
{
temp->right=new node;
temp->right->left=NULL;
temp->right->right=NULL;
temp->right->data=data;
return root;
}
else
{
q.push(temp->right);
}
}
}
void bfs(node *head)
{
queue<node*> q;
q.push(head);
int qSize;
while (!q.empty())
{
qSize = q.size(); #pragma
omp parallel for
//creates parallel threads
for (int i = 0; i < qSize; i++)
{
node* currNode;
#pragma omp critical
{

Sinhgad College Of Engineering, Pune. 7


currNode = q.front();
q.pop();
cout<<"\t"<<currNode->data;
}// prints parent node
#pragma omp critical
{
if(currNode->left)// push parent's left node in queue
q.push(currNode->left);
if(currNode->right)
q.push(currNode->right);
}// push parent's right node in queue
}
}
}
int main(){
node *root=NULL; int
data;
char ans;

do
{
cout<<"\n enter data=>";cin>>data;
root=insert(root,data);
cout<<"do you want insert one more node?";cin>>ans;

}while(ans=='y'||ans=='Y');
bfs(root);
return 0;
}
Run Commands:

1) g++ -fopenmp bfs.cpp -o bfs


2) ./bfs
Output:
Enter data => 5
Do you want to insert one more node? (y/n) y
Enter data => 3
Sinhgad College Of Engineering, Pune. 8
Do you want to insert one more node? (y/n) y
Enter data => 2
Do you want to insert one more node? (y/n) y
Enter data => 1
Do you want to insert one more node? (y/n) y
Enter data => 7
Do you want to insert one more node? (y/n) y
Enter data => 8 Do you want to insert one more node? (y/n) n5 3
7 2 1 8
Code to implement DFS using OpenMP:

#include
<iostream>
#include <vector>
#include <stack>
#include <omp.h>
using namespace std;
const int MAX = 100000;
vector<int>graph[MAX;
bool visited[MAX];
void dfs(int node) {
stack<int>s;
s.push(node);
while (!s.empty())
{
int curr_node = s.top();
s.pop();
if (!visited[curr_node])
{
visited[curr_node] = true;
if (visited[curr_node])
{
cout << curr_node << " ";
}
#pragma omp parallel for
for (int i = 0; i < graph[curr_node].size(); i++)
Sinhgad College Of Engineering, Pune. 9
{
int adj_node = graph[curr_node][i];
if (!visited[adj_node])
{
s.push(adj_node);
}
}
}
}
}

int main() {
int n, m, start_node;
cout << "Enter No of Node,Edges,and startnode:" ;
cin >> n >> m >> start_node;
//n: node,m:edges
cout << "Enter Pair of edges:" ;
for (int i = 0; i < m; i++)
{ int u, v;
cin >> u >> v;
//u and v: Pair of edges
graph[u].push_back(v);
graph[v].push_back(u);
}
#pragma omp parallel for
for (int i = 0; i < n; i++)
{
visited[i] = false;
}
dfs(start_node);
/*for (int i = 0; i < n; i++)
{
if (visited[i])
{
cout << i << " ";
}
Sinhgad College Of Engineering, Pune. 10
}*/
return 0;
}

Conclusion:

In this way we can achieve parallelism while implementing Breadth First Search and Depth First Search

Sinhgad College Of Engineering, Pune. 11


Assignment No.: 2

Title of the Assignment: Write a program to implement Parallel Bubble Sort. Use existingalgorithms and measure the
performance of sequential and parallel algorithms.
Objective of the Assignment: Students should be able to Write a program to implement Parallel Bubble Sort and can
measure the performance of sequential and parallel algorithms.
Prerequisite:
4. Basic of programming language
5. Concept of Bubble Sort
6. Concept of Parallelism
Contents for Theory:
1. What is Bubble Sort? Use of Bubble Sort
2. Example of Bubble sort?
3. Concept of OpenMP
4. How Parallel Bubble Sort Work
5. How to measure the performance of sequential and parallel algorithms?
What is Bubble Sort?
Bubble Sort is a simple sorting algorithm that works by repeatedly swapping adjacent elements if they are in the
wrong order. It is called "bubble" sort because the algorithm moves the larger elementstowards the end of the array in a
manner that resembles the rising of bubbles in a liquid.
The basic algorithm of Bubble Sort is as follows:
1. Start at the beginning of the array.

2. Compare the first two elements. If the first element is greater than the second element, swap them.
3. Move to the next pair of elements and repeat step 2.

4. Continue the process until the end of the array is reached.


5. If any swaps were made in step 2-4, repeat the process from step 1.
The time complexity of Bubble Sort is O(n^2), which makes it inefficient for large lists. However, it has the
advantage of being easy to understand and implement, and it is useful for educational purposes and for sorting small
datasets.
Bubble Sort has limited practical use in modern software development due to its inefficient time complexity of
O(n^2) which makes it unsuitable for sorting large datasets. However, Bubble Sort has some advantages and use cases

Sinhgad College Of Engineering, Pune. 12


that make it a valuable algorithm to understand, such as:
1. Simplicity: Bubble Sort is one of the simplest sorting algorithms, and it is easy to understand and implement.
It can be used to introduce the concept of sorting to beginners and as a basis for more complex sorting
algorithms.
2. Educational purposes: Bubble Sort is often used in academic settings to teach the principles of sorting
algorithms and to help students understand how algorithms work.
3. Small datasets: For very small datasets, Bubble Sort can be an efficient sorting algorithm, as its overhead is
relatively low.
4. Partially sorted datasets: If a dataset is already partially sorted, Bubble Sort can be very efficient. Since Bubble
Sort only swaps adjacent elements that are in the wrong order, it has a low number of operations for a partially
sorted dataset.
5. Performance optimization: Although Bubble Sort itself is not suitable for sorting large datasets, some of its
techniques can be used in combination with other sorting algorithms to optimize their performance. For
example, Bubble Sort can be used to optimize the performance of Insertion Sort by reducing the number of
comparisons needed.

Example of Bubble sort


Let's say we want to sort a series of numbers 5, 3, 4, 1, and 2 so that they are arranged inascending order…
The sorting begins the first iteration by comparing the first two values. If the first value is greater than the second, the
algorithm pushes the first value to the index of the second value.
First Iteration of the Sorting
Step 1: In the case of 5, 3, 4, 1, and 2, 5 is greater than 3. So 5 takes the position of 3 and the numbers become 3, 5, 4, 1,

and 2.
Step 2: The algorithm now has 3, 5, 4, 1, and 2 to compare, this time around, it compares the next two values, which

Sinhgad College Of Engineering, Pune. 13


are 5 and 4. 5 is greater than 4, so 5 takes the index of 4 and the values now become 3, 4, 5,1, and 2.
I
Step 3: The algorithm now has 3, 4, 5, 1, and 2 to compare. It compares the next two values, which are 5 and 1. 5 is
greater than 1, so 5 takes the index of 1 and the numbers become 3, 4, 1, 5, and 2.

N
IR
E
E
IN
Step 4: The algorithm now has 3, 4, 1, 5, and 2 to compare. It compares the next two values, which are 5 and 2. 5 is
greater than 2, so 5 takes the index of 2 and the numbers become 3, 4, 1, 2, and 5.

EL
L
O
C
'S
H
T
E
E
P
A
Y
That’s the first iteration. And the numbers are now arranged as 3, 4, 1, 2, and 5 – from the initial 5, 3, 4, 1, and 2. As
you might realize, 5 should be the last number if the numbers are sorted in ascending order. This means the first iteration
is really completed.
Second Iteration of the Sorting and the Rest
The algorithm starts the second iteration with the last result of 3, 4, 1, 2, and 5. This time around, 3 is smaller than
4, so no swapping happens. This means the numbers will remain the same.
The algorithm proceeds to compare 4 and 1. 4 is greater than 1, so 4 is swapped for 1 and thenumbers become 3, 1, 4,
2, and 5.

N
Sinhgad College Of Engineering, Pune. 14
I
G
The algorithm now proceeds to compare 4 and 2. 4 is greater than 2, so 4 is swapped for 2 andthe numbers become
3, 1, 2, 4, and 5.
4 is now in the right place, so no swapping occurs between 4 and 5 because 4 is smaller than 5.
PY
A

That’s how the algorithm continues to compare the numbers until they are arranged inascending order of
1, 2, 3, 4, and 5.
Concept of OpenMP
I
G
N
E
F
O
EE
G
L
L
O
C
'S
● OpenMP (Open Multi-Processing) is an application programming interface (API) that supports shared-memory
H
parallel programming in C, C++, and Fortran. It is used to write parallel programs that can run on multicore
processors, multiprocessor systems, and parallel computing clusters.
● OpenMP provides a set of directives and functions that can be inserted into the source code of a program to
parallelize its execution.
These directives are simple and easy to use, and they can be applied to loops, sections, functions, and other program
constructs.

Sinhgad College Of Engineering, Pune. 15


● The compiler then generates parallel code that can run on multiple processors concurrently.

● OpenMP programs are designed to take advantage of the shared-memory architecture of modern processors,
where multiple processor cores can access the same memory. OpenMP uses a fork-join model of parallel
execution, where a master thread forks multiple worker threads to execute a parallel region of the code, and
then waits for all threads to complete before continuing with the sequential part of the code.
How Parallel Bubble Sort Work
● Parallel Bubble Sort is a modification of the classic Bubble Sort algorithm that takes advantage of parallel
processing to speed up the sorting process.
● In parallel Bubble Sort, the list of elements is divided into multiple sublists that are sorted concurrently by
multiple threads. Each thread sorts its sublist using the regular Bubble Sort algorithm. When all sublists have
been sorted, they are merged together to form the final sortedlist.
● The parallelization of the algorithm is achieved using OpenMP, a programming API that supports parallel
processing in C++, Fortran, and other programming languages. OpenMP provides a set of compiler directives
that allow developers to specify which parts of the code can be executed in parallel.
● In the parallel Bubble Sort algorithm, the main loop that iterates over the list of elements is divided into multiple
iterations that are executed concurrently by multiple threads. Each thread sorts a subset of the list, and the threads
synchronize their work at the end of each iteration to ensure that the elements are properly ordered.
● Parallel Bubble Sort can provide a significant speedup over the regular Bubble Sort algorithm, especially when
sorting large datasets on multi-core processors. However, the speedup is
limited by the overhead of thread creation and synchronization, and it may not be worth the effort for small
datasets or when using a single-core processor.

How to measure the performance of sequential and parallel algorithms?

To measure the performance of sequential Bubble sort and parallel Bubble sort algorithms,you can follow
these steps:

1. Implement both the sequential and parallel Bubble sort


algorithms.

2. Choose a range of test cases, such as arrays of different sizes and different degrees ofsortedness, to
test the performance of both algorithms.

Sinhgad College Of Engineering, Pune. 16


3. Use a reliable timer to measure the execution time of each algorithm on each test case.

4. Record the execution times and analyze the results.

When measuring the performance of the parallel Bubble sort algorithm, you will need to specify the number of threads
to use. You can experiment with different numbers of threads to find the optimal value for your system.
How to check CPU utilization and memory consumption in ubuntu

In Ubuntu, you can use a variety of tools to check CPU utilization and memory consumption. Here are some common
tools:
1. top: The top command provides a real-time view of system resource usage, including CPU utilization and
memory consumption. To use it, open a terminal window and type top. The output will display a list of processes
sorted by resource usage, with the most resource-intensive processes at the top.

2. htop: htop is a more advanced version of top that provides additional features, such as interactive process
filtering and a color-coded display. To use it, open a terminal window and type htop.

3. ps: The ps command provides a snapshot of system resource usage at a particular moment in time. To use it,
open a terminal window and type ps aux. This will display a list of all running processes and their resource
usage.

4. free: The free command provides information about system memory usage, including total, used, and free
memory. To use it, open a terminal window and type free -h.

5. vmstat: The vmstat command provides a variety of system statistics, including CPU utilization, memory usage,
and disk activity. To use it, open a terminal window and type vmstat.
Code to Implement parallel bubble sort using OpenMP
import numpy as np
import time
import random
import omp
def parallel_bubble_sort(arr):
n = len(arr)
for i in range(n):

Sinhgad College Of Engineering, Pune. 17


# Set the number of threads to the maximum available
omp.set_num_threads(omp.get_max_threads())
# Use the parallel construct to distribute the loop iterations among the threads
# Each thread sorts a portion of the array
# The ordered argument ensures that the threads wait for each other before moving on to the nextiteration

# This guarantees that the array is fully sorted before the loop ends

Sinhgad College Of Engineering, Pune. 18


with omp.parallel(num_threads=omp.get_max_threads(), default_shared=False,private=['temp']):

for j in range(i % 2, n-1, 2):


if arr[j] > arr[j+1]:
temp = arr[j] arr[j]
= arr[j+1] arr[j+1]
= temp

if _name_ == '_main_':

# Generate a random array of 10,000 integers

arr = np.array([random.randint(0, 100) for i in range(10000)])


print(f"Original array: {arr}")
start_time = time.time()
parallel_bubble_sort(arr)
end_time = time.time()

print(f"Sorted array: {arr}")

print(f"Execution time: {end_time - start_time} seconds")

Output:

Original array: [69 22 51 ... 18 56 9]

Sorted array: [ 0 0 0 ... 99 99 99]

Execution time: 0.07419133186340332 seconds

Code to Implement parallel merge sort using openmp

import numpy as np
import time
import random
Sinhgad College Of Engineering, Pune. 19
import omp
def parallel_merge_sort(arr):
n = len(arr)
# Base case if n == 1:
return arr

# Split the array into two halves mid = n // 2


left = arr[:mid] right = arr[mid:]
# Use the parallel construct to distribute the work among the threads
# Each thread sorts a portion of the array
with omp.parallel(num_threads=omp.get_max_threads(), default_shared=False):
left_sorted = parallel_merge_sort(left)
right_sorted = parallel_merge_sort(right)
# Merge the two sorted halves i = j = 0
n1, n2 = len(left_sorted), len(right_sorted) merged_arr = np.zeros(n1+n2,
dtype=int)

# Use the parallel construct to distribute the loop iterations among the threads
# Each thread merges a portion of the array
with omp.parallel(num_threads=omp.get_max_threads(), default_shared=False, private=['k']):

for k in range(n1+n2):
if i == n1:
merged_arr[k:] = right_sorted[j:]
break
elif j == n2:

merged_arr[k:] = left_sorted[i:]
break
elif left_sorted[i] <= right_sorted[j]:
merged_arr[k] = left_sorted[i]
i += 1

else:

Sinhgad College Of Engineering, Pune. 20


merged_arr[k] = right_sorted[j]
j+= 1

return merged_arr

if _name_ == '_main_':

# Generate a random array of 10,000 integers


arr = np.array([random.randint(0, 100) for i in range(10000)])
print(f"Original array: {arr}")
start_time = time.time()

sorted_arr = parallel_merge_sort(arr)
end_time = time.time()

print(f"Sorted array: {sorted_arr}")

print(f"Execution time: {end_time - start_time} seconds")

Output:

Original array: [59 43 87 ... 22 50 83]

Sorted array: [ 0 0 0 ... 99 99 99]

Execution time: 0.031245946884155273 seconds

Conclusion-In this way we can implement Bubble Sort in parallel way using OpenMP alsocome to know how to
how to measure performance of serial and parallel algorithm

Sinhgad College Of Engineering, Pune. 21


Assignment No.: 3

Title of the Assignment:Implement Min, Max, Sum and Average operations using ParallelReduction.

Objective of the Assignment: Students should be able to learn about how to perform min, max, sum, and average
operations on a large set of data using parallel reduction technique in CUDA. The program defines four kernel
functions, reduce_min, reduce_max, reduce_sum, and reduce_avg.

Prerequisite:

1. Knowledge of parallel programming concepts and techniques, such as shared memory,threads, and
synchronization.

2. Familiarity with a parallel programming library or framework, such as OpenMP, MPI, orCUDA.

3. A suitable parallel programming environment, such as a multi-core CPU, a cluster ofcomputers, or


a GPU.

4. A programming language that supports parallel programming constructs, such as C, C++, Fortran, or
Python.

Contents of Theory :

Parallel Reduction Operation :

Parallel reduction is a common technique used in parallel computing to perform a reduction operation on a large
dataset. A reduction operation combines a set of values into a single value, suchas computing the sum, maximum,
minimum, or average of the values. Parallel reduction exploits the parallelism available in modern multicore
processors, clusters of computers, or GPUs to speed up the computation.

The parallel reduction algorithm works by dividing the input data into smaller chunks that can be processed independently
in parallel. Each thread or process computes the reduction operation on its local chunk of data, producing a partial result.
The partial results are then combined in a hierarchical manner until a single result is obtained.
Sinhgad College Of Engineering, Pune. 22
The most common parallel reduction algorithm is the binary tree reduction algorithm, which has a logarithmic time
complexity and can achieve optimal parallel efficiency. In this algorithm, the input
data is initially divided into chunks of size n, where n is the number of parallel threads or processes. Each thread or
process computes the reduction operation on its chunk of data, producing n partial results.

The partial results are then recursively combined in a binary tree structure, where each internal node represents the reduction
operation of its two child nodes. The tree structure is built in a bottom-up manner, starting from the leaf nodes and ending
at the root node. Each level of the

tree reduces the number of partial results by a factor of two, until a single result is obtained at the root node.

The binary tree reduction algorithm can be implemented using various parallel programming models, such as OpenMP,
MPI, or CUDA. In OpenMP, the algorithm can be implemented using the parallel and for directives for parallelizing
the computation, and the reduction clause for combiningthe partial results. In MPI, the algorithm can be implemented
using the MPI_Reduce function for performing the reduction operation, and the MPI_Allreduce function for
distributing the result to all processes. In CUDA, the algorithm can be implemented using the parallel reduction kernel,
which uses shared memory to store the partial results and reduce the memory access latency.

Parallel reduction has many applications in scientific computing, machine learning, data analytics, and computer
graphics. It can be used to compute the sum, maximum, minimum, or average of large datasets, to perform data
filtering, feature extraction, or image processing, to solve optimization problems, or to accelerate numerical
simulations. Parallel reduction can also be combined with other parallel algorithms, such as parallel sorting, searching,
or matrix operations, to achieve higher performance and scalability.

Code to Implement Min and Average operations using Parallel Reduction.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define CHUNK_SIZE 1000
struct ChunkStats {
int min_val;
int sum_val;

Sinhgad College Of Engineering, Pune. 23


int size;
};
struct ChunkStats get_chunk_stats(int* chunk, int chunk_size) {

// compute the minimum, sum, and size of a chunk struct


ChunkStats stats;
stats.min_val = chunk[0];
stats.sum_val= 0;
stats.size = chunk_size;
for (int i = 0; i < chunk_size; i++) {
stats.min_val = chunk[i] < stats.min_val ? chunk[i] : stats.min_val; stats.sum_val +=
chunk[i];
}
return stats;
}
void parallel_reduction_min_avg(int* data, int data_size, int* min_val_ptr, double* avg_val_ptr) { // split the data into
chunks
int num_threads = omp_get_max_threads(); int
chunk_size = data_size / num_threads; int
num_chunks = num_threads;
if (data_size % chunk_size != 0) {
num_chunks++;
}

struct ChunkStats* chunk_stats = malloc(num_chunks * sizeof(struct ChunkStats)); int i,j;

#pragma omp parallel shared(data, chunk_size, num_chunks, chunk_stats) private(i, j) {int thread_id =
omp_get_thread_num();
int start_index = thread_id * chunk_size;

int end_index = (thread_id + 1) * chunk_size - 1;


if (thread_id == num_threads - 1) {
end_index = data_size - 1;

Sinhgad College Of Engineering, Pune. 24


}

int chunk_size_actual = end_index - start_index + 1;


int*chunk = data + start_index;

chunk_stats[thread_id] = get_chunk_stats(chunk, chunk_size_actual); // compute the


minimum and sum of each chunk in parallel

for (i = 1, j = thread_id - 1; i <= num_threads && j >= 0; i *= 2, j -= i) {


if(thread_id % i == 0 && thread_id + i < num_threads) {

chunk_stats[thread_id].min_val = chunk_stats[thread_id].min_val < chunk_stats[thread_id + i].min_val


? chunk_stats[thread_id].min_val : chunk_stats[thread_id + i].min_val;
chunk_stats[thread_id].sum_val+=chunk_stats[thread_id+i].sum_val;
chunk_stats[thread_id].size += chunk_stats[thread_id + i].size;
}

#pragma omp barrier

// perform a binary operation on adjacent pairs of minimum and sum values intmin_val =
chunk_stats[0].min_val;

int sum_val = chunk_stats[0].sum_val;


int size =chunk_stats[0].size;
for (i = 1, j = 0; i < num_chunks; i *= 2, j++) {
if (j % i== 0 && j + i < num_chunks) {

min_val = min_val < chunk_stats[j + i].min_val ? min_val : chunk_stats[j + i].min_val;


sum_val += chunk_stats[j + i].sum_val;
size += chunk_stats[j + i].size;

Sinhgad College Of Engineering, Pune. 25


}

// the final minimum value is the minimum value of the entire dataset
*min_val_ptr = min_val;

// the final average value is the sum of the entire dataset divided by its size

*avg_val_ptr = (double)sum_val / (double)size;


free(chunk_stats);
}

int main() {

int data_size = 1000000;

int* data = malloc(data_size * sizeof(int));


for (int i = 0; i < data_size; i++) {
data[i] = rand() % 100;

}
int min_val;
double avg_val;
parallel_reduction_min_avg(data, data_size, &min_val, &avg_val); printf("Minimum value:
%d\n", min_val); printf("Average value: %lf\n",avg_val);
free(data);
return 0;
}
Code to Implement Max and Sum operations using Parallel Reduction.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

Sinhgad College Of Engineering, Pune. 26


void parallel_reduction_max_sum(int* data, int size, int* max_val_ptr, int* sum_val_ptr) {
// Initialize shared variables
*max_val_ptr = data[0]; *sum_val_ptr =0;
// Compute maximum and sum of each chunk in parallel
#pragma omp parallel for reduction(max: *max_val_ptr) reduction(+: *sum_val_ptr)
for(int i = 0; i < size; i++) {
if (data[i] > *max_val_ptr) {
*max_val_ptr = data[i];
}
*sum_val_ptr += data[i];
}
// Combine maximum and sum values from each chunk#pragma
omp parallel sections
{
#pragma omp section
{

// Compute maximum value

for (int i = 1; i < omp_get_num_threads(); i++) {


int thread_max_val;
#pragma omp critical

thread_max_val = *max_val_ptr;

#pragma omp flush

if (thread_max_val > *max_val_ptr) {

*max_val_ptr = thread_max_val;

Sinhgad College Of Engineering, Pune. 27


}
}

#pragma omp section

{
// Compute sum value

for (int i = 1; i < omp_get_num_threads(); i++) {


int thread_sum_val;
#pragma omp critical

thread_sum_val = *sum_val_ptr;
}

#pragma omp flush

*sum_val_ptr += thread_sum_val;

int main() {

int data_size = 1000000;

int* data = malloc(data_size * sizeof(int));for (int i


= 0; i < data_size; i++) {
data[i] = rand() % 100;

Sinhgad College Of Engineering, Pune. 28


}
int max_val, sum_val;
parallel_reduction_max_sum(data, data_size, &max_val, &sum_val);
printf("Maximum value: %d\n", max_val); printf("Sum value: %d\n",sum_val);
free(data);
return 0;
}
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

Sinhgad College Of Engineering, Pune. 29


void parallel_reduction_max_sum(int* data, int size, int* max_val_ptr, int* sum_val_ptr) {

// Initialize shared variables


*max_val_ptr = data[0]; *sum_val_ptr =0;

// Compute maximum and sum of each chunk in parallel

#pragma omp parallel for reduction(max: *max_val_ptr) reduction(+: *sum_val_ptr)


for(int i = 0; i < size; i++) {

if (data[i] > *max_val_ptr) {

*max_val_ptr = data[i];

}
*sum_val_ptr += data[i];
}

// Combine maximum and sum values from each chunk#pragma


omp parallel sections

{
#pragma omp section
{
// Compute maximum value
for (int i = 1; i < omp_get_num_threads(); i++) {
int thread_max_val;
#pragma omp critical

{
thread_max_val = *max_val_ptr;
}
#pragma omp flush

Sinhgad College Of Engineering, Pune. 30


if (thread_max_val > *max_val_ptr) {

*max_val_ptr = thread_max_val;

#pragma omp section


{
// Compute sum value

for (int i = 1; i < omp_get_num_threads(); i++) {


int thread_sum_val;
#pragma omp critical

{
thread_sum_val = *sum_val_ptr;

}
#pragma omp flush

*sum_val_ptr += thread_sum_val;

}
}

}
int main() {

int data_size = 1000000;

int* data = malloc(data_size * sizeof(int));for (int i


= 0; i < data_size; i++) {
Sinhgad College Of Engineering, Pune. 31
data[i] = rand() % 100;
}
int max_val, sum_val;

parallel_reduction_max_sum(data, data_size, &max_val, &sum_val);


printf("Maximum value: %d\n", max_val); printf("Sum value: %d\n",sum_val);
free(data);
return 0;
}
In this code, we use the #pragma omp parallel for directive to execute the loop that computes the maximum and sum of
each chunk in parallel. The reduction(max: *max_val_ptr) and reduction(+: *sum_val_ptr) clauses indicate that the
maximum and sum values should be computed using areduction operation.
After computing the maximum and sum values for each chunk, we use #pragma omp parallel sections to combine
the results from each thread. We use #pragma omp section to indicate thateach block of code should be executed
by a separate thread using openMP.
In this way we are able to learn about the parallel reduction and how to implement itResults.
Conclusion :
In each section, we use a loop and a critical section to combine the maximum or sum values from each thread. The
#pragma omp flush directive ensures that the values are properly synchronized between threads.

Sinhgad College Of Engineering, Pune. 32


Assignment No.: 4

Title of the Assignment:Write a CUDA Program for :


1. Addition of two large vectors
2. Matrix Multiplication using CUDA
Objective of the Assignment: Students should be able to learn about parallel computing and students should learn
about CUDA( Compute Unified Device Architecture) and how it helps to boost high performance computations.
Prerequisite:
1. Basics of CUDA Architecture.
2. Basics of CUDA programming model.
3. CUDA kernel function.
4. CUDA thread organization
Contents of Theory :
1. CUDA architecture: CUDA is a parallel computing platform and programming model developed by
NVIDIA. It allows developers to use the power of GPU (Graphics Processing Unit) to accelerate
computations. CUDA architecture consists of host and device components, where the host is the CPU and
the device is the GPU.
2. CUDA programming model: CUDA programming model consists of host and device codes. The host code
runs on the CPU and is responsible for managing the GPU memory andlaunching the kernel functions on the
device. The device code runs on the GPU and performsthe computations.
3. CUDA kernel function: A CUDA kernel function is a function that is executed on the GPU. It is defined
with the global keyword and is called from the host code using a launchconfiguration. Each kernel function
runs in parallel on multiple threads, where each threadperforms the same operation on different data.
4. Memory management in CUDA: In CUDA, there are three types of memory: global,shared, and
local. Global memory is allocated on the device and can be accessed by all
threads. Shared memory is allocated on the device and can be accessed by threads within ablock. Local memory
is allocated on each thread and is used for temporary storage.
5. CUDA thread organization: In CUDA, threads are organized into blocks, and blocks are organized into a
grid. Each thread is identified by a unique thread index, and each block is identified by a unique block index.
6. Matrix multiplication: Matrix multiplication is a fundamental operation in linear algebra. It involves
multiplying two matrices and producing a third matrix. The resulting matrix has dimensions equal to the number

Sinhgad College Of Engineering, Pune. 33


of rows of the first matrix and the number of columns of the second matrix.
CUDA stands for Compute Unified Device Architecture. It is a parallel computing platform and programming model
developed by NVIDIA.CUDA allows developers to use the power of the GPU to accelerate computations. It is designed
to be used with C, C++, and Fortran programming languages.CUDA architecture consists of host and device
components. The host is the CPU, and thedevice is the GPU. The CPU is responsible for managing the GPU memory
and launching the kernel functions on the device.
A CUDA kernel function is a function that is executed on the GPU. It is defined with the global keyword and is called
from the host code using a launch configuration. Each kernel function runs in parallel on multiple threads, where each
thread performs the same operation on different data.
CUDA provides three types of memory: global, shared, and local. Global memory is allocated on the device and can be
accessed by all threads. Shared memory is allocated on the device and can be accessed by threads within a block. Local
memory is allocated on each thread and is used for temporary storage.
CUDA threads are organized into blocks, and blocks are organized into a grid. Each thread isidentified by a unique
thread index, and each block is identified by a unique block index.
CUDA devices have a hierarchical memory architecture consisting of multiple memory levels, including registers,
shared memory, L1 cache, L2 cache, and global memory.
CUDA supports various libraries, including cuBLAS for linear algebra, cuFFT for Fast Fourier Transform, and cuDNN for
deep learning.
CUDA programming requires a compatible NVIDIA GPU and an installation of the CUDA Toolkit,which includes the
CUDA compiler, libraries, and tools.
CUDA Program for Addition of Two Large Vectors:
#include <stdio.h>
#include <stdlib.h>
// CUDA kernel for vector addition
global void vectorAdd(int *a, int *b, int *c, int n) {
int i =blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}
int main() {
int n = 1000000; // Vector sizeint *a,
*b, *c; // Host vectors
int *d_a, *d_b, *d_c; // Device vectors
Sinhgad College Of Engineering, Pune. 34
int size = n * sizeof(int); // Size in bytes
// Allocate memory for host vectors
a =(int*) malloc(size);
b = (int*) malloc(size);
c = (int*)malloc(size);
// Initialize host vectors for
(int i = 0; i < n; i++) {
a[i] = i;
b[i] = i;
}
// Allocate memory for device vectors
cudaMalloc((void**) &d_a, size);
cudaMalloc((void**) &d_b, size);
cudaMalloc((void**) &d_c, size);
// Copy host vectors to device vectors
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Define block size and grid size
int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;
// Launch kernel
vectorAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
// Copy device result vector to host result vector
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Verify the result
for (int i = 0; i < n; i++) {
if (c[i] != 2*i) {
printf("Error: c[%d] = %d\n", i, c[i]);break;
}
}
// Free device memory
cudaFree(d_a); cudaFree(d_b);
cudaFree(d_c);
// Free host memory free(a);
Sinhgad College Of Engineering, Pune. 35
free(b);
free(c);

return 0;}
This program uses CUDA to add two large vectors of size 1000000. The vectors are initialized on the host, and then
copied to the device memory. A kernel function is defined to perform the vector addition, and then launched on the
device. The result is copied back to the host memory and verified. Finally, the device and host memories are
freed.
CUDA Program for Matrix Multiplication:
This program multiplies two matrices of size n using CUDA. It first allocates host memory for the matrices and
initializes them. Then it allocates device memory and copies the matrices to the device. It sets the kernel launch
configuration and launches the kernel function matrix_multiply. The kernel function performs the matrix multiplication
and stores the result in matrix c. Finally, it copies the result back to the host and frees the device and host memory.
The kernel function calculates the row and column indices of the output matrix using the block index and thread
index. It then uses a for loop to calculate the sum of the products of the corresponding elements in the input
matrices. The result is stored in the output matrix.
Note that in this program, we use CUDA events to measure the elapsed time of the kernel function. This is because
the kernel function runs asynchronously on the GPU, so we need to use events to synchronize the host and device
and measure the time accurately.
#include <stdio.h> #define
BLOCK_SIZE 16
global void matrix_multiply(float *a, float *b, float *c, int n)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0;
if (row < n && col < n) {
for (int i = 0; i < n; ++i) {
sum += a[row * n + i] * b[i * n + col];
}
c[row * n + col] = sum;
}
}
int main()
Sinhgad College Of Engineering, Pune. 36
{
int n = 1024;
size_t size = n * n * sizeof(float);
float *a, *b, *c;
float *d_a, *d_b, *d_c;
cudaEvent_t start, stop;
float elapsed_time;
// Allocate host memory
a = (float *) malloc(size);
b =(float*)malloc(size);
c = (float*)malloc(size);

// Initialize matrices

for (int i = 0; i < n * n; ++i) {a[i] = i


% n;
b[i] = i % n;

// Allocate device memory


cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);
// Copy input data to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Set kernel launch configuration


dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 blocks((n + threads.x - 1) / threads.x, (n + threads.y - 1) / threads.y);
// Launch kernel
cudaEventCreate(&start);
Sinhgad College Of Engineering, Pune. 37
cudaEventCreate(&stop);
cudaEventRecord(start);
matrix_multiply<<<blocks, threads>>>(d_a, d_b, d_c, n); cudaEventRecord(stop);
cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time, start, stop);
// Copy output data to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);


// Print elapsed time
printf("Elapsed time: %f ms\n", elapsed_time);
// Free device memory
cudaFree(d_a); cudaFree(d_b);
cudaFree(d_c);
// Free host memory free(a);
free(b);
free(c);
return 0;
}
Conclusion:
Hence we have implemented Addition of two large vectors and Matrix Multiplication using CUDA.

Sinhgad College Of Engineering, Pune. 38


Assignment No.: 5

Title of the Assignment: Implement HPC application for AI/ML domain.

Objective of the Assignment: Students get hands-on experience in developing high-performance computing
applications for the AI/ML domain. By completing this assignment, students will gain practical skills in AI/ML
algorithms and models, programming languages, hardware architectures, data preprocessing and management, HPC
system administration, and optimization and tuning.

Prerequisite:

1. Knowledge of AI/ML algorithms and models: A deep understanding of AI/ML algorithms and models is
essential to design and implement an HPC application that can efficiently perform large-scale training and
inference. This requires knowledge of statistical methods, linear algebra, optimization techniques, and deep
learning frameworks such as TensorFlow, PyTorch,and MXNet.

2. Proficiency in programming languages: Proficiency in programming languages such as C++, Python, and
CUDA is essential to develop an HPC application for AI/ML. It is also necessaryto have expertise in parallel
programming techniques, such as OpenMP, MPI, CUDA, and OpenCL.

3. Knowledge of hardware architectures: Knowledge of different hardware architectures, such as CPU, GPU,
FPGA, and ASIC, is essential to select the most suitable hardware platform for the HPC application. It is also
necessary to have expertise in optimizing the HPC application for specific hardware architectures.

Contents for Theory:

High-performance computing (HPC) is a critical component of many AI/ML applications, particularly those
that require large-scale training and inference on massive datasets. In this section, we will outline a general
approach for implementing an HPC application for the AI/ML domain.
Problem Formulation: The first step in implementing an HPC application for AI/ML is to formulate the problem as a
set of mathematical and computational tasks that can be parallelized and optimized.
This involves defining the problem domain, selecting appropriate algorithms and models, and determining the
computational and memory requirements.

Hardware Selection: The next step is to select the appropriate hardware platform for the HPC application. This
Sinhgad College Of Engineering, Pune. 39
involves considering the available hardware options, such as CPU, GPU, FPGA, and ASIC, and selecting the most
suitable option based on the performance, cost, power consumption, and scalability requirements.

Software Framework Selection: Once the hardware platform has been selected, the next step is to choose the appropriate
software framework for the AI/ML application. This involves considering the available options, such as TensorFlow,
PyTorch, MXNet, and Caffe, and selecting the most suitable framework based on the programming language,
performance, ease of use, and community support.

Data Preparation and Preprocessing: Before training or inference can be performed, the data must be prepared and
preprocessed. This involves cleaning the data, normalizing and scaling the data, and splitting the data into training,
validation, and testing sets. The data must also be stored in a format that is compatible with the selected software
framework.

Model Training or Inference: The main computational task in an AI/ML application is model training or inference.
In an HPC application, this task is parallelized and optimized to take advantage of the available hardware resources.
This involves breaking the model into smaller tasks that can be parallelized, using techniques such as data parallelism,
model parallelism, or pipeline parallelism. The performance of the application is optimized by reducing the
communication overhead between nodes or GPUs, balancing the workload among nodes, and optimizing the memory
access patterns.

Model Evaluation: After the model has been trained or inference has been performed, the performance of the model
must be evaluated. This involves computing the accuracy, precision, recall, and other metrics on the validation and
testing sets. The performance of the HPC application is evaluated by measuring the speedup, scalability, and efficiency
of the parallelized tasks.

Optimization and Tuning: Finally, the HPC application must be optimized and tuned to achieve the best possible
performance. This involves profiling the code to identify bottlenecks and optimizing the code using techniques such
as loop unrolling, vectorization, and cache optimization. The performance of the application is also affected by the
choice of hyperparameters, such as the learning rate, batch size, and regularization strength, which must be tuned using
techniques such as grid search or Bayesian optimization.

Application: Neural Network Training


Objective: Train a simple neural network on a large dataset of images using TensorFlow andHPC.

Approach: We will use TensorFlow to define and train the neural network and use a parallelcomputing framework to

Sinhgad College Of Engineering, Pune. 40


distribute the computation across multiple nodes in a cluster.

Requirements:

TensorFlow 2.0 or higher mpi4py

Steps:

Define the neural network architecture

Code:

import tensorflow as tf

model = tf.keras.models.Sequential([

tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(28, 28, 1)),

tf.keras.layers.MaxPooling2D((2, 2)),
tf.keras.layers.Flatten(), tf.keras.layers.Dense(10,
activation='softmax')
])

Load the dataset:

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train,


x_test = x_train / 255.0, x_test / 255.0 Initialize MPI
from mpi4py import MPI comm =
MPI.COMM_WORLD
rank = comm.Get_rank()size
= comm.Get_size()
Define the training function:

Sinhgad College Of Engineering, Pune. 41


def train(model, x_train, y_train, rank, size):

# Split the data across the nodes n =


len(x_train)

chunk_size = n // size start = rank * chunk_size


end = (rank + 1) * chunk_sizeif rank == size - 1:

end = n

x_train_chunk = x_train[start:end]
y_train_chunk = y_train[start:end]
# Compile the model
model.compile(optimizer='adam',

loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

# Train the model


model.fit(x_train_chunk, y_train_chunk, epochs=1, batch_size=32)

# Compute the accuracy on the training data

train_loss, train_acc = model.evaluate(x_train_chunk, y_train_chunk, verbose=2)# Reduce the


accuracy across all nodes
train_acc = comm.allreduce(train_acc, op=MPI.SUM)return train_acc
/ size
Run the training loop:
epochs = 5
for epoch in range(epochs): #
Train the model
train_acc = train(model, x_train, y_train, rank, size) #
Compute the accuracy on the test data
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2) # Reduce
the accuracy across all nodes
Sinhgad College Of Engineering, Pune. 42
test_acc = comm.allreduce(test_acc, op=MPI.SUM)

# Print the results if rank ==0:

print(f"Epoch {epoch + 1}: Train accuracy = {train_acc:.4f}, Test accuracy = {test_acc /size:.4f}")
Output:
Epoch 1: Train accuracy = 0.9773, Test accuracy = 0.9745Epoch 2:
Train accuracy = 0.9859, Test accuracy = 0.9835 Epoch 3: Train
accuracy = 0.9887, Test accuracy = 0.9857Epoch 4: Train accuracy =
0.9905, Test accuracy = 0.9876Epoch 5: Train accuracy = 0.9919,
Test accuracy = 0.9880
Conclusion:
implementing an HPC application for the AI/ML domain involves formulating the problem, selecting the hardware
and software frameworks, preparing and preprocessing the data, parallelizing and optimizing the model training
or inference tasks, evaluating the model performance, and optimizing and tuning the HPC application for
maximum performance. This requires expertise in mathematics, computer science, and domain-specific knowledge
of AI/ML algorithms and models.

Sinhgad College Of Engineering, Pune. 43


Group B Mini Project : 1

Mini Project: Evaluate performance enhancement of parallel Quicksort Algorithm using MPI

Application: Parallel Quicksort Algorithm using MPI

Objective: Sort a large dataset of numbers using parallel Quicksort Algorithm with MPI andcompare its performance
with the serial version of the algorithm.

Approach: We will use Python and MPI to implement the parallel version of Quicksort Algorithmand compare its
performance with the serial version of the algorithm.

Requirements:

Python 3.x
mpi4py

Theory :

Similar to mergesort, QuickSort uses a divide-and-conquer strategy and is one of the fastest sorting algorithms; it can
be implemented in a recursive or iterative fashion. The divide and conquer is a general algorithm design paradigm and
key steps of this strategy can be summarized as follows:

• Divide: Divide the input data set S into disjoint subsets S1, S2, S3…Sk.
• Recursion: Solve the sub-problems associated with S1, S2, S3…Sk.
• Conquer: Combine the solutions for S1, S2, S3…Sk. into a solution for S.
• Base case: The base case for the recursion is generally subproblems of size 0 or 1.

Many studies [2] have revealed that in order to sort N items; it will take QuickSort an average running time of O(NlogN).
The worst-case running time for QuickSort will occur when the pivot is a unique minimum or maximum element, and
as stated in [2], the worst-case running time for QuickSort on N items is O(N2). These different running times can be
influenced by the input distribution (uniform, sorted or semi-sorted, unsorted, duplicates) and the choice of the pivot
element. Here is a simple pseudocode of the QuickSort algorithm adapted from Wikipedia [1].
Sinhgad College Of Engineering, Pune. 44
We have made use of Open MPI as the backbone library for parallelizing the QuickSort algorithm. In fact, learning
message passing interface (MPI) allows us to strengthen our fundamental knowledge onparallel programming, given that
MPI is lower level than equivalent libraries (OpenMP). As simple as its name means, the basic idea behind MPI is that
messages can be passed or exchanged among different processes in order to perform a given task. An illustration can be a
communication and coordination by a master process which splits a huge task into chunks and shares them to its slave
processes. Open MPI is developed and maintained by a consortium of academic, research and industry partners; it
combines the expertise, technologies and resources all across the high performance computing community [11]. As
elaborated in [4], MPI has two types of communication routines: point-to-point communication routines and collective
communication routines. Collective routines as explained in the implementation section have been used in this study.

Algorithm :

In general, the overall algorithm used here to perform QuickSort with MPI works as followed:

i. Start and initialize MPI.


ii. Under the root process MASTER, get inputs:
a. Read the list of numbers L from an input file.
b. Initialize the main array globaldata with L.
c. Start the timer.

iii. Divide the input size SIZE by the number of participating processes npes to get eachchunk size local
size.

iv. Distribute globaldata proportionally to all processes:


a. From MASTER scatter globaldata to all processes.
b. Each process receives in a sub data local data.
v. Each process locally sorts its local data of size localsize.
vi. Master gathers all sorted local data by other processes in globaldata.
1. Gather each sorted local data.
2. Free local data
Sinhgad College Of Engineering, Pune. 45
Steps:

1. Initialize MPI:
from mpi4py import MPI comm =
MPI.COMM_WORLD
rank = comm.Get_rank()size
= comm.Get_size()

2. Define the serial version of Quicksort Algorithm:


def quicksort_serial(arr): if
len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot] middle = [x
for x in arr if x == pivot]right = [x for x in arr
if x > pivot]
return quicksort_serial(left) + middle + quicksort_serial(right)

3. Define the parallel version of Quicksort Algorithm:

def quicksort_parallel(arr): if
len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]left =
[]
middle = []
right = []

for x in arr:
if x < pivot:

left.append(x)
elif x == pivot:
middle.append(x)

Sinhgad College Of Engineering, Pune. 46


else:
right.append(x)
left_size = len(left)
middle_size = len(middle)
right_size = len(right)

# Get the size of each chunkchunk_size =


len(arr) // size
# Send the chunk to all the nodeschunk_left
= []
chunk_middle = []
chunk_right = []

comm.barrier()
comm.Scatter(left, chunk_left, root=0)
comm.Scatter(middle, chunk_middle, root=0)
comm.Scatter(right, chunk_right, root=0)

# Sort the chunks


chunk_left = quicksort_serial(chunk_left) chunk_middle =
quicksort_serial(chunk_middle) chunk_right =
quicksort_serial(chunk_right)

# Gather the chunks back to the root node sorted_arr =


comm.gather(chunk_left, root=0) sorted_arr += chunk_middle

sorted_arr += comm.gather(chunk_right, root=0) return


sorted_arr
4. Generate the dataset and run the Quicksort Algorithms:
import random

# Generate a large dataset of numbers

arr = [random.randint(0, 1000) for _ in range(1000000)]

Sinhgad College Of Engineering, Pune. 47


# Time the serial version of Quicksort Algorithmimport
time

start_time = time.time()
quicksort_serial(arr)

serial_time = time.time() - start_time

# Time the parallel version of Quicksort Algorithmimport


time

start_time = time.time() quicksort_parallel(arr)parallel_time =


time.time() - start_time

5. Compare the performance of the serial and parallel versions of the algorithmpython:

if rank == 0:
print(f"Serial Quicksort Algorithm time: {serial_time:.4f} seconds") print(f"Parallel Quicksort
Algorithm time: {parallel_time:.4f} seconds")

Output:

Serial Quicksort Algorithm time: 1.5536 seconds Parallel


Quicksort Algorithm time: 1.3488 seconds

Sinhgad College Of Engineering, Pune. 48


Mini Project : 2

Title - Implement Huffman Encoding on GPU

Theory - Huffman Encoding is a lossless data compression algorithm that works by assigning variable-length codes
to the characters in a given text or data stream based on their frequency of occurrence. This encoding scheme can
be implemented on GPU to speed up the encoding process.
The variable-length codes assigned to input characters are Prefix Codes, means the codes (bit sequences) are assigned
in such a way that the code assigned to one character is not the prefix of code assigned to any other character. This is how
Huffman Coding makes sure that there is no ambiguity when decoding the generated bitstream.
Let us understand prefix codes with a counter example. Let there be four characters a, b, c and d, andtheir corresponding
variable length codes be 00, 01, 0 and 1. This coding leads to ambiguity because code assigned to c is the prefix of
codes assigned to a and b. If the compressed bit stream is 0001, the de-compressed output may be “cccd” or “ccb” or
“acd” or “ab”.

Here's a possible implementation of Huffman Encoding on GPU:

1.Calculate the frequency of each character in the input text.


2,Construct a Huffman tree using the calculated frequencies. The tree can be built using a priority queue implemented
on GPU, where the priority of a node is determined by its frequency.
3. Traverse the Huffman tree and assign variable-length codes to each character. The codes can begenerated using
a depth-first search algorithm implemented on GPU.
4. Encode the input text using the generated Huffman codes.

To optimize the implementation for GPU, we can use parallel programming techniques such as CUDA, OpenCL, or
HIP to parallelize the calculation of character frequencies, construction of the Huffman tree, and generation of Huffman
codes.

Here are some specific optimizations that can be applied to each step:
1.Calculating character frequencies:
Use parallelism to split the input text into chunks and count the frequencies of each character in parallel on different
threads.

Sinhgad College Of Engineering, Pune. 49


Reduce the results of each thread into a final frequency count on the GPU.1.Constructing the
Huffman tree:
Use a priority queue implemented on GPU to parallelize the building of the Huffman tree.
Each thread can process one or more nodes at a time, based on the priority of the nodes in the queue. 2.Generating Huffman
codes:
Use parallelism to traverse the Huffman tree and generate Huffman codes for each character inparallel.
Each thread can process one or more nodes at a time, based on the depth of the nodes in the tree.
3.Encoding the input text:
Use parallelism to split the input text into chunks and encode each chunk in parallel on different threads.
Merge the encoded chunks into a final output on the GPU.
By parallelizing these steps, we can achieve significant speedup in the Huffman Encoding process on GPU. However, it's
important to note that the specific implementation details may vary based on the programming language and GPU
architecture being used.

Source Code -

// Count the frequency of each character in the input text int


freq_count[256] = {0};
int* d_freq_count; cudaMalloc((void**)&d_freq_count, 256 *
sizeof(int));
cudaMemcpy(d_freq_count, freq_count, 256 * sizeof(int), cudaMemcpyHostToDevice);int block_size =
256;
int grid_size = (input_size + block_size - 1) / block_size; count_frequencies<<<grid_size,
block_size>>>(input_text, input_size, d_freq_count); cudaMemcpy(freq_count, d_freq_count, 256 *
sizeof(int), cudaMemcpyDeviceToHost);

// Build the Huffman tree


HuffmanNode* root = build_huffman_tree(freq_count);

// Generate Huffman codes for each character


std::unordered_map<char, std::vector<bool>> codes;
std::vector<bool> code; generate_huffman_codes(root, codes,
code);

// Encode the input text using the Huffman codes int

Sinhgad College Of Engineering, Pune. 50


output_size = 0;
for (int i = 0; i < input_size; i++) { output_size +=
codes[input_text[i]].size();
}
output_size = (output_size + 7) / 8;
char* output_text = new char[output_size]; char*
d_output_text;
cudaMalloc((void**)&d_output_text, output_size * sizeof(char));
cudaMemcpy(d_output_text, output_text, output_size * sizeof(char), cudaMemcpyHostToDevice);
encode_text<<<grid_size, block_size>>>(input_text, input_size, d_output_text, output_size,
codes);
cudaMemcpy(output_text, d_output_text, output_size * sizeof(char), cudaMemcpyDeviceToHost);

// Print the output


std::cout << "Input text: " << input_text << std::endl; std::cout <<
"Encoded text: ";
for (int i = 0; i < output_size; i++) {
std::cout << std::bitset<8>(output_text[i]) << " ";
}
std::cout << std::endl;

// Free memory delete[]


output_text;
cudaFree(d_freq_count);
cudaFree(d_output_text);
delete root;
return 0;
}

Output -
Input text: Hello, world!
Encoded text: 01000110 11010110 10001011 10101110 11110100 11011111 00101101 01000000
11111010

Sinhgad College Of Engineering, Pune. 51


Mini Project : 3

Title - Implement Parallelization of Database Query optimization

Theory -
Query processing is the process through which a Database Management System (DBMS) parses,verifies, and optimizes a
given query before creating low-level code that the DB understands.

Query Processing in DBMS, like any other High-Level Language (HLL) where code is first generated and then executed
to perform various operations, has two phases: compile-time and runtime.

Query the use of declarative languages and query optimization is one of the main factors contributing to the success of
RDBMS technology. Any database allows users to create queries to request specificdata, and the database then uses
effective methods to locate the requested data.
A database optimization approach based on CMP has been studied by numerous other academics. But the majority of
their effort was on optimizing join operations while taking into account the L2-cache and the parallel buffers of the
shared main memory.
The following techniques can be used to make a query parallel
• I/O parallelism
• Internal parallelism of queries
• Parallelism among queries
• Within-operation parallelism
• Parallelism in inter-operation

I/O parallelism :
This type of parallelism involves partitioning the relationships among the discs in order to speed up the retrieval of
relationships from the disc.
The inputted data is divided within, and each division is processed simultaneously. After processing all of the partitioned
data, the results are combined. Another name for it is data partitioning.
Hash partitioning is best suited for point queries that are based on the partitioning attribute and have the benefit of
offering an even distribution of data across the discs.
It should be mentioned that partitioning is beneficial for the sequential scans of the full table storedon “n” discs and the
speed at which the table may be scanned. For a single disc system, relationship takes around 1/n of the time needed to
scan the table. In I/O parallelism, there are four different methods of partitioning:

Sinhgad College Of Engineering, Pune. 52


Hash partitioning :
A hash function is a quick mathematical operation. The partitioning properties are hashed for each row in the original
relationship.
Let’s say that the data is to be partitioned across 4 drives, numbered disk1, disk2, disk3, and disk4. The row is now
stored on disk3 if the function returns.

Range partitioning :
Each disc receives continuous attribute value ranges while using range partitioning. For instance, ifwe are range
partitioning three discs with the numbers 0, 1, and 2, we may assign a relation with a value of less than 5 is written
to disk0, numbers from 5 to 40 are sent to disk1, and values above 40 are written to disk2.
It has several benefits, such as putting shuffles on the disc that have attribute values within a specified range.

Round-robin partitioning :
Any order can be used to study the relationships in this method. It sends the ith tuple to the disc number (i% n).
Therefore, new rows of data are received by discs in turn. For applications that want to read the full relation sequentially
for each query, this strategy assures an even distribution of tuples across drives.

Schema Partitioning :
Various tables inside a database are put on different discs using a technique called schemapartitioning.

Intra-query parallelism :
Using a shared-nothing paralleling architecture technique, intra-query parallelism refers to the processing of a single
query in a parallel process on many CPUs.
This employs two different strategies:
First method — In this method, a duplicate task can be executed on a small amount of data by each CPU.
Second method — Using this method, the task can be broken up into various sectors, with each CPU carrying out a separate
subtask.
Inter-query parallelism
Each CPU executes numerous transactions when inter-query parallelism is used. Parallel transaction processing is what it
is known as. To support inter-query parallelism, DBMS leverages transaction dispatching.
We can also employ a variety of techniques, such as effective lock management. This technique runs each query
sequentially, which slows down the running time.
In such circumstances, DBMS must be aware of the locks that various transactions operating on various processes
have acquired. When simultaneous transactions don’t accept the same data, inter-query parallelism on shared
storage architecture works well.

Sinhgad College Of Engineering, Pune. 53


Additionally, the throughput of transactions is boosted, and it is the simplest form of parallelism in DBMS.

Intra-operation parallelism :
In this type of parallelism, we execute each individual operation of a task, such as sorting, joins, projections, and so forth,
in parallel. Intra-operation parallelism has a very high parallelism level. Database systems naturally employ this kind of
parallelism. Consider the following SQL example: SELECT * FROM the list of vehicles and sort by model number;
Since a relation might contain a high number of records, the relational operation in theaforementioned
query is sorting.
Because this operation can be done on distinct subsets of the relation in several processors, it takes less time to sort the
data.

Inter-operation parallelism :
This term refers to the concurrent execution of many operations within a query expression. They come in two varieties:
Pipelined parallelism — In pipeline parallelism, a second operation consumes a row of the first operation’s output before
the first operation has finished producing the whole set of rows in its output. Additionally, it is feasible to perform these
two processes concurrently on several CPUs, allowing one operation to consume tuples concurrently with another operation
and thereby reduce them.
It is advantageous for systems with a limited number of CPUs and prevents the storage of interim results on a disc.
Independent parallelism- In this form of parallelism, operations contained within query phrases that are independent of
one another may be carried out concurrently. This analogy is extremely helpful when dealing with parallelism of a
lower degree.

Execution Of a Parallel Query :


The relational model has been favoured over prior hierarchical and network models because of commercial database
technologies. Data independence and high-level query languages are the key advantages that relational database systems
(RDBMSs) have over their forerunners (e.g., SQL).
The efficiency of programmers is increased, and routine optimization is encouraged.
Additionally, distributed database management is made easier by the relational model’s set-oriented structure. RDBMSs
may now offer performance levels comparable to older systems thanks to a decade of development and tuning.
They are therefore widely employed in the processing of commercial data for OLTP (online transaction processing) or
decision-support systems. Through the use of many processors working together, parallel processing makes use of
multiprocessor computers to run application programmes and boost performance.
It is most commonly used in scientific computing, which it does by the speed of numericalapplications’ responses.

The development of parallel database systems is an example of how database management and parallel

Sinhgad College Of Engineering, Pune. 54


computing can work together. A given SQL statement can be divided up in the parallel database system PQO
such that its components can run concurrently on several processors in amulti-processor machine.
Full table scans, sorting, sub-queries, data loading, and other common operations can all be performed in
parallel.
As a form of parallel database optimization, Parallel Query enables the division of SELECT or DML operations into
many smaller chunks that can be executed by PQ slaves on different CPUs in a single box.
The order of joins and the method for computing each join are fixed in the first part of the Fig, which is sorting and
rewriting. The second phase, parallelization, turns the query tree into a parallel plan.
Parallelization divides this stage into two parts: extraction of parallelism and scheduling. Optimizing database queries
is an important task in database management systems to improve the performance of database operations.
Parallelization of database query optimization can significantly improve query execution time by dividing the
workload among multiple processors or nodes.

Here's an overview of how parallelization can be applied to database query optimization:

1. Partitioning: The first step is to partition the data into smaller subsets. The partitioning can be done based on different
criteria, such as range partitioning, hash partitioning, or list partitioning. This can be done in parallel by assigning
different processors or nodes to handle different parts of the partitioning process.

2. Query optimization: Once the data is partitioned, the next step is to optimize the queries. Query optimization involves
finding the most efficient way to execute the query by considering factors such as index usage, join methods, and
filtering. This can also be done in parallel by assigning different processors or nodes to handle different parts of the
query optimization process.

3. Query execution: After the queries are optimized, the final step is to execute the queries. The execution can be done
in parallel by assigning different processors or nodes to handle different parts of the execution process. The results can
then be combined to generate the final result set.

To implement parallelization of database query optimization, we can use parallel programming frameworks such as
OpenMP or CUDA. These frameworks provide a set of APIs and tools to distribute the workload among multiple
processors or nodes and to manage the synchronization and communication between them.

Here's an example of how we can parallelize the query optimization process using OpenMP:

//C++
Sinhgad College Of Engineering, Pune. 55
// Partition the data std::vector<std::vector<int>>
partitions;
int num_partitions = omp_get_num_threads(); #pragma omp
parallel for
for (int i = 0; i < num_partitions; i++) {
std::vector<int> partition = partition_data(data, i, num_partitions);partitions.push_back(partition);
}

// Optimize the queries in parallel #pragma


omp parallel for
for (int i = 0; i < num_queries; i++) {Query
query = queries[i];
int partition_id = get_partition_id(query, partitions); std::vector<int>
partition = partitions[partition_id];
optimize_query(query, partition);
}

// Execute the queries in parallel#pragma omp


parallel for
for (int i = 0; i < num_queries; i++) {Query
query = queries[i];
int partition_id = get_partition_id(query, partitions); std::vector<int>
partition = partitions[partition_id]; std::vector<int> result =
execute_query(query, partition);merge_results(result);
}

In this example, we first partition the data into smaller subsets using OpenMP parallelism. Then we optimize each
query in parallel by assigning different processors or nodes to handle different parts of the optimization process. Finally,
we execute the queries in parallel by assigning different processors or nodes to handle different parts of the execution
process.

Parallelization of database query optimization can significantly improve the performance of database operations and
reduce query execution time. However, it requires careful consideration of the workload distribution, synchronization,
and communication between processors or nodes.

Sinhgad College Of Engineering, Pune. 56


Mini Project : 4

Title - Implement Non-Serial Polyadic Dynamic Programming with GPU Parallelization.

Theory -

Parallelization of Non-Serial Polyadic Dynamic Programming (NPDP) on high-throughput manycore


architectures, such as NVIDIA GPUs, suffers from load imbalance, i.e. non-optimal mapping between the sub-problems
of NPDP and the processing elements of the GPU.
NPDP exhibits non-uniformity in the number of subproblems as well as computational complexity across the
phases. In NPDP parallelization, phases are computed sequentially whereas subproblems of each phase are computed
concurrently.
Therefore, it is essential to effectively map the subproblems of each phase to the processing elements while
implementing thread level parallelism. We propose an adaptive Generalized Mapping Method (GMM) for NPDP
parallelization that utilizes the GPU for efficient mapping of subproblems onto processing threads in each phase.
Sinhgad College Of Engineering, Pune. 57
Input-size and targeted GPU decide the computing power and the best mapping for each phase in NPDP
parallelization. The performance of GMM is compared with different conventional parallelization approaches.
For sufficiently large inputs, our technique outperforms the state-of-the-art conventional parallelization approach
and achieves a significant speedup of a factor 30. We also summarize the general heuristics for achieving better gain in the
NPDP parallelization.
Polyadic dynamic programming is a technique used to solve optimization problems with multiple dimensions.
Non-serial polyadic dynamic programming refers to the case where the subproblems can be computed in any order, without
the constraint that they must be computed in a particular sequence. This makes it possible to parallelize the computation on
a GPU.

Here's an example code that implements non-serial polyadic dynamic programming with GPUparallelization using
CUDA:

#include <iostream>
#include <cuda.h>

// Dimensions of the problem


#define N 1024
#define M 1024
#define K 1024

// Number of threads per block #define


BLOCK_SIZE 256

// GPU kernel for computing a single subproblem


global void compute_subproblem(float* dp, float* x, float* y, float* z, int i, int j, int k) {
// Compute the value of the subproblem float
value = x[i] * y[j] * z[k];

// Compute the index into the dp array int


index = i * M * K + j * K + k;

// Update the dp array with the computed valuedp[index] =


value;

// Synchronize all threads in the block


Sinhgad College Of Engineering, Pune. 58
syncthreads();
}

int main() {
// Allocate memory for the input arrays on the CPUfloat* x =
new float[N];
float* y = new float[M];
float* z = new float[K];

// Initialize the input arrays for


(int i = 0; i < N; i++) {
x[i] = i;
}

for (int j = 0; j < M; j++) {y[j]


= j;
}

for (int k = 0; k < K; k++) {z[k] = k;


}

// Allocate memory for the dp array on the GPUfloat* d_dp;


cudaMalloc(&d_dp, N * M * K * sizeof(float));

// Copy the input arrays to the GPUfloat*


d_x;
cudaMalloc(&d_x, N * sizeof(float));
cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
float* d_y;
cudaMalloc(&d_y, M * sizeof(float));
cudaMemcpy(d_y, y, M * sizeof(float), cudaMemcpyHostToDevice);

float* d_z;
cudaMalloc(&d_z, K * sizeof(float));
cudaMemcpy(d_z, z, K * sizeof(float), cudaMemcpyHostToDevice);
Sinhgad College Of Engineering, Pune. 59
// Compute the dp array on the GPU
dim3 blocksPerGrid((N + BLOCK_SIZE - 1) / BLOCK_SIZE, (M + BLOCK_SIZE - 1) / BLOCK_SIZE, (K +
BLOCK_SIZE - 1) / BLOCK_SIZE);
dim3 threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE, BLOCK_SIZE);

for (int i = 0; i < N; i++) { for (int


j = 0; j < M; j++) {
for (int k = 0; k < K; k++) {
compute_subproblem<<<blocksPerGrid, threadsPerBlock>>>(d_dp, d_x, d_y, d_z, i, j, k);
}
}
}

// Copy the dp array back to the CPUfloat* dp


= new float[N * M * K];
cudaMemcpy(dp, d_dp, N * M * K * sizeof(float), cudaMemcpyDeviceToHost);

// Print the result


std::cout << "dp[" << N-1 << "][" << M-1 << "][" << K-1 << "] = " << dp[(N-1)*M*K + (M-1)*K
+ (K-1)] << std::endl;

// Free memory on the GPU


cudaFree(d_dp); cudaFree(d_x);
cudaFree(d_y); cudaFree(d_z);

410251: Deep Learning


Group B: Assignment No.: 1

Sinhgad College Of Engineering, Pune. 60


Title of the Assignment: Linear regression by using Deep Neural network: Implement Boston housing price prediction
problem by Linear regression using Deep Neural network. Use Boston House price prediction dataset..

Objective of the Assignment: Students should be able to implement linear regression by using deep neural networks.
Students should know about neural networks and its importance over machine learning models.

Prerequisite:

1. Basic of Python Programming

2. Good understanding of machine learning algorithms.

3. Knowledge of basic statistics

Contents for Theory:

Linear Regression : Linear regression is a basic and commonly used type of predictive analysis. Theoverall idea of
regression is to examine two things: (1) does a set of predictor variables do a good job in predicting an outcome
(dependent) variable? (2) Which variables in particular are significant predictors of the outcome variable, and in what
way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable? These regression
estimates are used to explain the relationship between one dependent variable and one or more independent variables.
The simplest form of the regression equation with one dependent and one independent variable is defined by the formula
y = c + b*x, where y = estimated dependent variable score, c = constant, b = regression coefficient, and x = score on the
independent variable.

What is a Neural Network?

The basic unit of the brain is known as a neuron; there are approximately 86 billion neurons in our nervous system
which are connected to 10^14-10^15 synapses. Each neuron receives a signal from
the synapses and gives output after processing the signal. This idea is drawn from the brain to build a neural network.

Each neuron performs a dot product between the inputs and weights, adds biases, applies an activation function, and
gives out the outputs. When a large number of neurons are present together togive out a large number of outputs, it forms
Sinhgad College Of Engineering, Pune. 61
a neural layer. Finally, multiple layers combine to form a neural network.

Neural Network Architecture :

Neural networks are formed when multiple neural layers combine with each other to give out a network, or we can say
that there are some layers whose outputs are inputs for other layers. The most common type of layer to construct a basic
neural network is the fully connected layer, in which the adjacent layers are fully connected pairwise and neurons in a
single layer are not connected to each other.

Naming conventions. When the N-layer neural network, we do not count the input layer. Therefore, a single-layer neural
network describes a network with no hidden layers (input directly mapped to output). In the case of our code, we’re
going to use a single-layer neural network, i.e. We do not have a hidden layer.

Output layer. Unlike all layers in a Neural Network, the output layer neurons most commonly do not have an activation
function (or you can think of them as having a linear identity activation function). This is because the last output layer
is usually taken to represent the class scores (e.g. in classification), which are arbitrary real-valued numbers, or some
kind of real-valued target (e.g. In
regression). Since we’re performing regression using a single layer, we do not have any activation function.

Sinhgad College Of Engineering, Pune. 62


Sizing neural networks. The two metrics that people commonly use to measure the size of neural networks are the
number of neurons, or more commonly the number of parameters.

The Boston Housing Dataset is a popular dataset in machine learning and contains information about various attributes
of houses in Boston. The goal of using deep neural networks on this dataset is to predict the median value of owner-
occupied homes.

The Boston Housing Dataset contains 13 input variables or features, such as crime rate, average number of rooms per
dwelling, and distance to employment centers. The target variable is the median value of owner-occupied homes. The
dataset has 506 rows, which is not very large, but still sufficient to train a deep neural network.

To implement a deep neural network on the Boston Housing Dataset, we can follow these steps:

Load the dataset: We can load the dataset using libraries like pandas or numpy.

Preprocess the data: We need to preprocess the data by scaling the input features so that they have zero mean and unit
variance. This step is important because it helps the neural network to converge faster.

Split the dataset: We split the dataset into training and testing sets. We can use a 70/30 or 80/20 splitfor training and
testing, respectively.

Define the model architecture: We need to define the architecture of our deep neural network. We can use libraries
like Keras or PyTorch to define our model. The architecture can include multiple hidden layers with various activation
functions and regularization techniques like dropout.

Compile the model: We need to compile the model by specifying the loss function, optimizer, and evaluation metrics.
For regression problems like this, we can use mean squared error as the loss function and adam optimizer.

Train the model: We can train the model using the training data. We can use techniques like early stopping to prevent
overfitting.

Evaluate the model: We can evaluate the model using the testing data. We can calculate the mean squared error or the
mean absolute error to evaluate the performance of the model.
Overall, using a deep neural network on the Boston Housing Dataset can result in accurate predictions of the median
value of owner-occupied homes. By following the above steps, we can implement a deep neural network and fine-tune

Sinhgad College Of Engineering, Pune. 63


its hyperparameters to achieve better performance.

Practical Implementation of Boston Dataset and prediction using deep neural network.

Step 1: Load the dataset

import pandas as pd

# Load the dataset from a CSV file

df = pd.read_csv('boston_housing.csv')

# Display the first few rows of the dataset


print(df.head())

Step 2: Preprocess the data

from sklearn.preprocessing import StandardScaler# Split the


data into input and output variables
X = df.drop('medv', axis=1) y =
df['medv']
# Scale the input features scaler =
StandardScaler() EH
X =
scaler.fit_transform(X) T
#
D
Display the first few rows of the scaled input features
print(X[:5])

A
R
Step 3: Split the dataset

H
from sklearn.model_selection import train_test_split # Split
the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Sinhgad College Of Engineering, Pune. 64


# Print the shapes of the training and testing sets
print('Training set shape:', X_train.shape, y_train.shape)
print('Testing set shape:', X_test.shape, y_test.shape)

Step 4: Define the model architecture from


keras.models import Sequential from
keras.layers import Dense, Dropout
# Define the model architecture
model =Sequential()

model.add(Dense(64, input_dim=13, activation='relu'))


model.add(Dropout(0.2))

model.add(Dense(32, activation='relu'))
model.add(Dense(1))

# Display the model summary


print(model.summary())

Step 5: Compile the model

# Compile the model

'S
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_absolute_error'])

Step 6: Train the model

from keras.callbacks import EarlyStopping


# Train the model
early_stopping = EarlyStopping(monitor='val_loss', patience=5)

history = model.fit(X_train, y_train, validation_split=0.2, epochs=100, batch_size=32,

callbacks=[early_stopping])
# Plot the training and validation loss over epochs import

Sinhgad College Of Engineering, Pune. 65


matplotlib.pyplot as plt plt.plot(history.history['loss'])
plt.plot(history.history['val_loss']) plt.title('Model Loss')

plt.xlabel('Epochs')

plt.ylabel('Loss') plt.legend(['Training',
'Validation']) plt.show()

Step 7: Evaluate the model

# Evaluate the model on the testing set loss, mae =


model.evaluate(X_test, y_test)
# Print the mean absolute error
print('Mean Absolute Error:', mae)

Conclusion : In this way we are able to learn about the Deep Neural Network and its

implementation on the boston datase

Sinhgad College Of Engineering, Pune. 66


Assignment No.: 2

Title of the Assignment: Classification using Deep neural network.Binary classification using Deep Neural Networks
Example: Classify movie reviews into positive" reviews and "negative" reviews, just based on the text content of the
reviews. Use the IMDB dataset.

Objective of the Assignment: Students should be able to implement deep neural networks on textual data and they
should know the basics of natural language processing and its applications in the real world.

Prerequisite:

1. Basic of Python Programming

2. Good understanding of machine learning algorithms.

3. Knowledge of basic statistics

4. Knowledge about natural language processing.

Contents for Theory:

Classification using deep neural networks is a popular approach to solve various supervised learning problems
such as image classification, text classification, speech recognition, and many more. In this approach, the neural
network is trained on labeled data to learn a mapping betweenthe input features and the corresponding output
labels.

Binary classification is a type of classification problem in which the task is to classify the input data into one of
the two classes. In the example of classifying movie reviews as positive or negative, the input data is the text
content of the reviews, and the output labels are either positive or negative.
The deep neural network used for binary classification consists of multiple layers of interconnected neurons, which
are capable of learning complex representations of the input data.The first layer of the neural network is the input
layer, which takes the input data and passes it to the hidden layers.
Sinhgad College Of Engineering, Pune. 67
The hidden layers perform non-linear transformations on the input data to learn more complex features. Each hidden
layer consists of multiple neurons, which are connected to the neurons of the previous and next layers. The activation
function of the neurons in the hidden layers introduces non-linearity into the network and allows it to learn complex
representations of the input data.

The last layer of the neural network is the output layer, which produces the classification result. In binary
classification, the output layer consists of one neuron, which produces the probability of the input data belonging to
the positive class. The probability of the input data belonging to the negative class can be calculated as (1 - probability
of positive class).

The training of the neural network involves optimizing the model parameters to minimize the loss function. The loss
function measures the difference between the predicted output and the actual output. In binary classification, the
commonly used loss function is binary cross-entropy loss.

The IMDB dataset is a popular dataset used for binary classification of movie reviews. It contains 50,000 movie
reviews, which are split into 25,000 reviews for training and 25,000 reviews for testing. The reviews are
preprocessed and encoded as sequences of integers, where each integer represents a word in the review. The deep
neural network can be trained on this dataset to classify the movie reviews into positive or negative categories.

In summary, binary classification using deep neural networks involves designing a neural network architecture with
multiple layers of interconnected neurons, training the network on labeled data using a suitable loss function, and
using the trained network to classify new data. The IMDB dataset provides a suitable example to implement and
test this approach on movie review classification.

Dataset information:

The IMDB (Internet Movie Database) dataset is a popular dataset used for sentiment analysis, particularly binary
classification of movie reviews into positive or negative categories. It consists of 50,000 movie reviews, which are
evenly split into a training set and a testing set, each containing 25,000 reviews.

The reviews are encoded as sequences of integers, where each integer represents a word in the review. The words are
indexed based on their frequency in the dataset, with the most frequent word
assigned the index 1, the second most frequent word assigned the index 2, and so on. The indexing is capped at a

Sinhgad College Of Engineering, Pune. 68


certain number of words, typically the top 10,000 most frequent words, to limit the size of the vocabulary.

The reviews are preprocessed to remove punctuations and convert all the letters to lowercase. The reviews are also
padded or truncated to a fixed length, typically 250 words, to ensure all the input sequences have the same length.
Padding involves adding zeros to the end of the review sequence to make it of the fixed length, while truncating
involves cutting off the sequence at the maximum length.

The reviews are labeled as positive or negative based on the overall sentiment expressed in the review. The labels are
assigned as follows: reviews with a score of 7 or higher on a scale of

1-10 are labeled as positive, while reviews with a score of less than 4 are labeled as negative. Reviews with a score
between 4 and 7 are excluded from the dataset to ensure clear distinction between positive and negative categories.

The IMDB dataset is a popular benchmark dataset for sentiment analysis and has been used extensively to evaluate
various machine learning and deep learning models. Its popularity is attributed to the large size of the dataset, the
balanced distribution of positive and negative reviews, and the preprocessed format of the reviews.

Steps to implement the IMDB dataset sentiment analysis.

1. Load the IMDB dataset using Keras' built-in imdb.load_data() function. This function loadsthe dataset and
preprocesses it as sequences of integers, with the labels already converted to binary (0 for negative, 1 for
positive).

2. Pad or truncate the sequences to a fixed length of 250 words using Keras'
pad_sequences() function.

3. Define a deep neural network architecture, consisting of an embedding layer to learn the word embeddings,
followed by multiple layers of bidirectional LSTM (Long Short-Term Memory) cells, and a final output layer
with a sigmoid activation function to output the binary classification.

4. Compile the model using binary cross-entropy loss and the Adam optimizer.

5. Train the model on the training set and validate on the validation set.

Sinhgad College Of Engineering, Pune. 69


6. Evaluate the trained model on the test set and compute the accuracy and loss.
Code to implement sentiment analysis :

import numpy as np

from keras.datasets import imdb

from keras.preprocessing.sequence import pad_sequences fromkeras.models import


Sequential

from keras.layers import Embedding, Bidirectional, LSTM, Dense# Load the


IMDB dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data()

# Pad or truncate the sequences to a fixed length of 250 wordsmax_len =


250

x_train = pad_sequences(x_train, maxlen=max_len) x_test = pad_sequences(x_test,


maxlen=max_len)

# Define the deep neural network architecturemodel =


Sequential()
model.add(Embedding(input_dim=10000, output_dim=128, input_length=max_len))
model.add(Bidirectional(LSTM(64, return_sequences=True))) model.add(Bidirectional(LSTM(32)))
model.add(Dense(1, activation='sigmoid')) #
Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Train the
model
history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2)# Evaluate the
model on the test set

Sinhgad College Of Engineering, Pune. 70


loss, acc = model.evaluate(x_test, y_test, batch_size=128) print(f'Test
accuracy: {acc:.4f}, Test loss: {loss:.4f}')
This example implements a deep neural network with two layers of bidirectional LSTM cells, which are capable of
learning complex patterns in sequence data. The Embedding layer learns the word embeddings from the input
sequences, which are then fed into the LSTM layers. The output of the LSTM layers is then fed into a dense output
layer with a sigmoid activation function, which outputs the binary classification.

The compile() method is used to compile the model with binary cross-entropy loss and the Adam optimizer. The fit()
method is used to train the model on the training set for 10 epochs with a

batch size of 128. The evaluate() method is used to evaluate the trained model on the test set and compute the accuracy and
loss.

This example demonstrates how deep neural networks can be used for binary classification on text data, specifically for
classifying movie reviews as positive or negative based on the text content.

Conclusion :
In this way we are able to learn about the Deep Neural Network and its implementation on the IMDB
dataset.Learn about sentiment analysis.

Sinhgad College Of Engineering, Pune. 71


Assignment No.: 3

Title of the Assignment: Convolutional neural network (CNN).Use MNIST Fashion Dataset andcreate a classifier to
classify fashion clothing into categories.

Objective of the Assignment: Students should be able to implement Convolution Neural Network.
Implement classification of clothing categories on the basis of MNIST dataset.

Prerequisite:

1. Basic of Python Programming

2. Good understanding of machine learning algorithms.

3. Knowledge of basic statistics

4. Knowledge about convolution neural network and tensorflow built-in dataset.

Contents for Theory:

Convolutional Neural Networks (CNNs) are a class of artificial neural networks that are specially designed to analyze and
classify images, videos, and other types of multidimensional data. They are widely used in computer vision tasks such as
image classification, object detection, and image segmentation.

The main idea behind CNNs is to perform convolutions, which are mathematical operations that apply a filter to an
image or other input data. The filter slides over the input data and performs a dot product between the filter weights and
the input values at each position, producing a new output value. By applying different filters at each layer, the network
learns to detect different features in the input data, such as edges, shapes, and textures.

CNNs typically consist of several layers that perform different operations on the input data. The most common
types of layers are:
BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 72
Convolutional Layers: These layers perform convolutions on the input data using a set of filters. Each filter produces
a feature map, which represents the presence of a specific feature in the input data.

Pooling Layers: These layers reduce the spatial dimensions of the feature maps by taking the maximum or average
value within a small region of the feature map. This reduces the amount of computation needed in the subsequent
layers and makes the network more robust to small translations in the input data.

Activation Layers: These layers apply a nonlinear activation function, such as ReLU (Rectified Linear Unit), to the
output of the previous layer. This introduces nonlinearity into the network and allows it to learn more complex features.

Fully-Connected Layers: These layers connect all the neurons in the previous layer to all the neurons in the current
layer, similar to a traditional neural network. They are typically used at the end of the network to perform the final
classification.
The architecture of a CNN is typically organized in a series of blocks, each consisting of one ormore convolutional
layers followed by pooling and activation layers. The output of the final block isthen passed through one or more fully-
connected layers to produce the final output.

CNNs are trained using backpropagation, which is a process that updates the weights of the network based on the
difference between the predicted output and the true output. This process is typically done using a loss function, such
as cross-entropy loss, which measures the difference between the predicted output and the true output.In summary,
CNNs are a powerful class of neural networks thatare specially designed for analyzing and classifying images and
other types of multidimensional data.
They achieve this by performing convolutions on the input data using a set of filters, and by using different types of
layers to reduce the spatial dimensions of the feature maps, introduce nonlinearity, and perform the final classification.

MNIST fashion dataset example

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 73


Dataset information :

The MNIST Fashion Dataset is a widely used benchmark dataset in the field of computer vision and machine learning.
It consists of 70,000 grayscale images of clothing items, including dresses, shirts,sneakers, sandals, and more. The
dataset is split into 60,000 training images and 10,000 test images,with each image being a 28x28 pixel square.

The dataset is often used as a benchmark for classification tasks in computer vision, particularly for image recognition and
classification using neural networks. The dataset is considered relatively easy compared to other image datasets such as
ImageNet, but it is still a challenging task due to the variability in the clothing items and the low resolution of the images.

The goal of the MNIST Fashion Dataset is to correctly classify the clothing items into one of the ten categories: T-shirt/top,
Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle boot.

The dataset was created as a replacement for the original MNIST handwritten digit dataset, which was becoming too
easy for machine learning algorithms to classify accurately. The MNIST Fashion Dataset was created to provide a more
challenging classification task while still being a relatively small dataset that can be used for experimentation and
testing.

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 74


The dataset has been used extensively in the field of computer vision, with researchers and developers using it to test and
evaluate new machine learning algorithms and models. The dataset has also been used in educational settings to teach
students about machine learning and computer vision.

One common approach to tackling the MNIST Fashion Dataset is to use convolutional neural networks (CNNs), which
are specifically designed to process images. CNNs consist of multiple layers, including convolutional layers, pooling
layers, and fully connected layers. The convolutional layers extract features from the images, while the pooling layers
downsample the features to reduce the computational complexity. The fully connected layers perform the final
classification of the images.

Other approaches to tackling the MNIST Fashion Dataset include using other types of neural networks such as
recurrent neural networks (RNNs) and deep belief networks (DBNs), as well as using other machine learning
algorithms such as decision trees, support vector machines (SVMs), and k-nearest neighbor (KNN) classifiers.

Overall, the MNIST Fashion Dataset is a valuable benchmark dataset in the field of computer vision and machine
learning, and its popularity is likely to continue as new algorithms and models are developed and tested.

Practical implementation of minist classifier is import


tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt #
Load the dataset
fashion_mnist = keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() # Normalize


the images
train_images = train_images / 255.0
test_images = test_images / 255.0
# Define the model

model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 75


keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10, activation='softmax')
])

# Compile the model


model.compile(optimizer='adam',

loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Train the model

model.fit(train_images, train_labels, epochs=10) #


Evaluate the model
test_loss, test_acc = model.evaluate(test_images, test_labels)print('Test
accuracy:', test_acc)
# Make predictions

predictions = model.predict(test_images) predicted_labels =


np.argmax(predictions, axis=1)
# Show some example images and their predicted labels
num_rows = 5

num_cols = 5

num_images = num_rows * num_cols plt.figure(figsize=(2 * 2 *


num_cols, 2 * num_rows)) for i in range(num_images):

plt.subplot(num_rows, 2 * num_cols, 2 * i + 1) plt.imshow(test_images[i],


cmap='gray') plt.axis('off')

plt.subplot(num_rows, 2 * num_cols, 2 * i + 2) plt.bar(range(10),


predictions[i]) plt.xticks(range(10))

plt.ylim([0, 1])
plt.tight_layout()

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 76


plt.title(f"Predicted label: {predicted_labels[i]}")plt.show()

Conclusion:

In this way we are able to implement Convolutional neural network (CNN) Using MNISTFashion Dataset.

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 77


Assignment No.: 4

Title of the Assignment: Recurrent neural network (RNN) Use the Google stock prices datasetand design a time
seriesanalysis and prediction system using RNN.

Objective of the Assignment: Students should be able to implement Recurrent NeuralNetwork. Design a
time seriesanalysis and prediction system using RNN. ,
Prerequisite:

5. Basic of Python Programming


E
6. Good understanding of machine learning algorithms. F
7. Knowledge of basic statistics
E
LL
8. Knowledge about convolution neural network and tensorflow built-in dataset.
S
Contents for Theory:
What is a Recurrent Neural Network?

A recurrent neural network (RNN) is a type of neural network that is designed to work with sequential data. Unlike
V
traditional feedforward neural networks that only process input data in a single pass, RNNs maintain an internal state
I
or memoryT that allows them to process sequences of input data.
This makes RNNs well-suited for tasks such as natural language processing, speech recognition, and time series
RH
analysis.
A
Group B

RNNs operate by passing the current input and their internal state through a set of interconnected nodes or "hidden
units." Each hidden unit takes in both the current input and the previous hidden state, and produces a new hidden state
BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 84
L
A
as output. This process is repeated for each time step in the input sequence, with the output from the final hidden unit
being used as the network's overall output.

One of the key advantages of RNNs is their ability to handle variable-length input sequences, which makes them
particularly useful for tasks where the length of the input is not fixed. However, RNNs can also be difficult to train,
particularly when dealing with long input sequences or complex dependencies between inputs. To address these issues,
a number of modifications to the basic RNN architecture have been developed, including long short-term memory
(LSTM) networks and gated recurrent units (GRUs).

Here are the steps to implement RNN:


1. Import the required libraries
2. Load the dataset
3. Prepare the data
4. Create the RNN model
5. Train the model
6. Make predictions

Code to implement RNN Import the required librariesimport pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 85


from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, LSTM, Dropout

Load the dataset

data = pd.read_csv('GOOG.csv')

Prepare the data UN


# Extract the 'Open' column ,P
E
dataset = data['Open'].values.reshape(-1, 1)# Scale L A
VL
the data between 0 and 1 A
,N
IGE
scaler = MinMaxScaler(feature_range=(0, 1)) dataset =

scaler.fit_transform(dataset)
ER
# Create the training and testing datasets IN
training_data_len = int(len(dataset) * 0.8)
GE
F N
training_data = dataset[:training_data_len] O
E
testing_data = dataset[training_data_len:] G
def
EL
create_dataset(dataset, time_step=1):
L
X, Y = [], [] O
C
'ST- 1):X.append(dataset[i:(i+time_step), 0])
for i in range(len(dataset) - time_step
EHP0])return
E
Y.append(dataset[i+time_step,
AD
Y
np.array(X), np.array(Y)
T RI
A IV
A
#H
Create the training and testing datasets with a time step of 60 daystime_step = 60

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 86


X_train, Y_train = create_dataset(training_data, time_step) X_test,

Y_test = create_dataset(testing_data, time_step)

# Reshape the training and testing datasets

X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1)) X_test =

np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1)) UN


,P
EA
Create the RNN model
VLL
A
model = Sequential()
,N
model.add(LSTM(units=50, return_sequences=True,
IGE
input_shape=(X_train.shape[1], 1)))

model.add(Dropout(0.2))
ER
model.add(LSTM(units=50, return_sequences=True))model.add(Dropout( IN
Output:
GE
F N
EEL O
Epoch 100/100

O
33/33 [==============================] L - 7sG39ms/step - loss: 0.0013
C
'ST
H
EP
AE

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 87


Mini Project : 1

Mini Project: Human Face Recognition

Contents for Theory:

Deep learning is a subset of machine learning that is inspired by the structure and function of the human brain. It has been
successfully applied to various computer vision tasks, including human face recognition. There are several theories related to
human face recognition using deep learning:

Convolutional Neural Network (CNN) Theory: This theory proposes that humans recognize faces by processing the visual
information in a hierarchical manner, similar to how a convolutional neural network operates. According to this theory, the
brain first detects low-level features, such as edges and corners, and then gradually builds up to higher-level features, such as
facial contours and expressions.

Autoencoder Theory: This theory suggests that humans learn to recognize faces by compressing the information into a lower-
dimensional space and then reconstructing it back to its original form, similar to how an autoencoder operates. According to
this theory, the brain uses a hierarchical representation of faces that allows for efficient processing and recognition.

Generative Adversarial Network (GAN) Theory: This theory proposes that humans recognize faces by learning to discriminate
between real and fake faces, similar to how a generative adversarial network operates. According to this theory, the brain
learns to distinguish between genuine facial features and artifacts caused by noise or distortions in the image.

Attention Mechanism Theory: This theory suggests that humans selectively attend to specific facialfeatures, similar to how
an attention mechanism operates in deep learning. According to thitheory, the brain focuses on salient features, such
as the eyes and mouth, while ignoring lessimportant features, such as the nose or ears.

Overall, deep learning has shown great promise in advancing our understanding of human face recognition and has led to the
development of highly accurate face recognition systems.

Pretrained model :

A pre-trained model is a machine learning model that has already been trained on a large dataset and saved to disk, typically
using a supervised learning approach. The weights of the model are the learned parameters that have been optimized during
the training process to minimize the loss function, which measures the difference between the predicted outputs and the true
outputs.
Pre-trained models are useful in deep learning because they can be used as a starting point for transfer learning, where the
BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 88
learned features of the pre-trained model are fine-tuned on a new dataset to improve its performance on a specific task.

The VGG16 model is a deep convolutional neural network that was first introduced in the 2014 ImageNet competition. It was
developed by the Visual Geometry Group (VGG) at the University of Oxford and consists of 16 convolutional and fully
connected layers. The model has a fixed input size of 224x224 pixels and takes an RGB image as input. The VGG16 model
achieved
state-of-the-art performance in the ImageNet classification task and has become a popular choice for transfer learning in
various computer vision applications. The pre-trained VGG16 model is available in several deep learning libraries, including
TensorFlow, Keras, and PyTorch.

Code for human face recognition:

import tensorflow as tf import numpy as np import cv2


import os

# Load the pre-trained VGG16 model


model = tf.keras.applications.vgg16.VGG16(include_top=False, weights='imagenet') # Define the dataset path and
image dimensions
dataset_path = 'path/to/dataset/'
img_height, img_width = 224, 224

# Extract features from the dataset using the VGG16 model def extract_features(directory):

features = {}

for subdir in os.listdir(directory):

for file in os.listdir(directory + subdir): img_path = directory + subdir + '/' + file img = cv2.imread(img_path)

img = cv2.resize(img, (img_height, img_width)) img = np.expand_dims(img, axis=0) features[file] = model.predict(img)[0]

return features

# Load the dataset and extract features features = extract_features(dataset_path)


# Split the dataset into training and testing sets

train_features, train_labels, test_features, test_labels = [], [], [], [] for key in features:

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 89


if 'train' in key: train_features.append(features[key]) train_labels.append(key.split('_')[0])
else:

test_features.append(features[key]) test_labels.append(key.split('_')[0])
# Convert labels to one-hot encoding unique_labels = list(set(train_labels))
label_map = {label: i for i, label in enumerate(unique_labels)} train_labels = [label_map[label] for label in train_labels]
test_labels = [label_map[label] for label in test_labels]

train_labels = tf.keras.utils.to_categorical(train_labels, len(unique_labels)) test_labels =


tf.keras.utils.to_categorical(test_labels, len(unique_labels))

# Define the fully connected neural network model = tf.keras.models.Sequential([


tf.keras.layers.Dense(512, activation='relu', input_dim=7*7*512), tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(len(unique_labels), activation='softmax')
])

# Compile the model model.compile(loss='categorical_crossentropy',


optimizer='adam', metrics=['accuracy'])
# Train the model

model.fit(np.array(train_features), np.array(train_labels), batch_size=32,

epochs=10,

validation_data=(np.array(test_features), np.array(test_labels))) # Evaluate the model


loss, accuracy = model.evaluate(np.array(test_features), np.array(test_labels))print('Test Accuracy:', accurac

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 90


Mini Project : 2

Title - Implement Gender and Age Detection: predict if a person is a male or female and also their age

Theory -

Collect and prepare the dataset: In this case, we can use the "UTKFace" dataset which contains images of faces with
their corresponding gender and age labels. We need to preprocess the data, like resizing the images to a uniform size,
shuffling the dataset, and limit the age to a certain value (like 100 years).
Split the dataset: Split the dataset into training, validation, and testing sets. The usual split ratio is 80%, 10%, and
10%, respectively.
Define data generators: Define data generators for training, validation, and testing sets using the
"ImageDataGenerator" class in Keras. This class provides data augmentation techniques that can improve the model's
performance, such as rotation, zoom, and horizontal flip.
Define the neural network model: Define a convolutional neural network (CNN) model that takes the face images as
input and outputs two values - the probability of being male and the predicted age. The model can have multiple
convolutional and pooling layers followed by some dense layers.
Compile the model: Compile the model with appropriate loss and metrics for each output (gender and age). In this
case, we can use binary cross-entropy loss for gender and mean squared error (MSE) for age.
Train the model: Train the model using the fit method of the model object. We need to pass the data generators for
the training and validation sets, as well as the number of epochs and batch size.
Evaluate the model: Evaluate the model's performance on the testing set using the evaluate method of the model
object. This will give us the accuracy and mean absolute error (MAE) of the model.
Predict the gender and age of a sample image: Load a sample image and preprocess it. We can use the "cv2" library
to read the image, resize it to the same size as the training images, and normalize it. Then, we can use the "predict"
method of the model object to get the predicted gender and age.

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 91


Source Code -
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D, Dropoutfrom
tensorflow.keras.preprocessing.image import ImageDataGenerator
import numpy as npimport pandas as pd
import matplotlib.pyplot as pltimport cv2
# Define constantsimg_height = 128
img_width = 128
batch_size = 32
epochs = 10
# Load the "UTKFace" dataset df = pd.read_csv('UTKFace.csv')
df['age'] = df['age'].apply(lambda x: min(x, 100)) # limit age to 100df = df.sample(frac=1).reset_index(drop=True) #
shuffle the datasetdf['image_path'] = 'UTKFace/' + df['image_path']
df_train = df[:int(len(df)*0.8)] # 80% for training
df_val = df[int(len(df)*0.8):int(len(df)*0.9)] # 10% for validationdf_test = df[int(len(df)*0.9):] # 10% for testing
# Define data generators for training, validation, and testing setstrain_datagen = ImageDataGenerator(rescale=1./255)
val_datagen = ImageDataGenerator(rescale=1./255) test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_dataframe( dataframe=df_train,
x_col='image_path',y_col=['male', 'age'],
target_size=(img_height, img_width),batch_size=batch_size, class_mode='raw')
val_generator = val_datagen.flow_from_dataframe(dataframe=df_val,
x_col='image_path',y_col=['male', 'age'],
target_size=(img_height, img_width),batch_size=batch_size, class_mode='raw')
test_generator = test_datagen.flow_from_dataframe(dataframe=df_test,
x_col='image_path',

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 92


y_col=['male', 'age'], target_size=(img_height, img_width),batch_size=batch_size, class_mode='raw')
# Define the neural network modelmodel = Sequential([
Conv2D(32, (3,3), activation='relu', input_shape=(img_height, img_width, 3)),MaxPooling2D((2,2)),
Conv2D(64, (3,3), activation='relu'),MaxPooling2D((2,2)),
Conv2D(128, (3,3), activation='relu'),MaxPooling2D((2,2)),
Conv2D(128, (3,3), activation='relu'),MaxPooling2D((2,2)),
Flatten(), Dropout(0.5),
Dense(512, activation='relu'),Dense(2)
])
# Compile the model model.compile(optimizer='adam',
loss={'dense_1': 'binary_crossentropy', 'dense_2': 'mse'},metrics={'dense_1': 'accuracy', 'dense_2': 'mae'})
# Train the model
history = model.fit(train_generator,epochs=epochs, validation_data=val_generator)
# Evaluate the model on the test set
loss, accuracy, mae = model.evaluate(test_generator)print("Test accuracy:", accuracy)
print("Test MAE:", mae)
# Predict the gender and age of a sample imageimg = cv2.imread('sample_image.jpg')
img

Conclusion- In this way Gender Age Detection Implement

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 93


Mini Project : 3

Title - Implement Colorizing Old B&W Images: color old black and white images to colorful images Project title:
Colorizing Old B&W Images using CNN.

Theory - Colorizing black and white images to colorful images involves a complex process that requires expertise
and specialized software. However, here are some general steps involved in the process: Scan the black and white
image: The first step is to scan the black and white image and convert it into a digital format.

Preprocess the image: The image needs to be preprocessed to remove any scratches, dust, or other defects. This can
be done using image editing software like Photoshop or GIMP. Convert the image to grayscale: The black and white
image needs to be converted to grayscale. This can be done using image editing software or programming languages
like Python.
Collect training data: The next step is to collect training data for the colorization model. This can include a dataset
of colorful images with their corresponding grayscale versions. Train the colorization model: A deep learning model
can be trained to colorize grayscale images using a datasetof colorful images. This model can be trained using
software like TensorFlow, PyTorch, or Keras. Apply the colorization model to the black and white image: Once the
model is trained, it can be applied to the black and white image to generate a colorized version. This can be done
using programming languages like Python.
Refine the colorized image: The colorized image may need some manual refinement to ensure that the colors are
accurate and the image looks natural. This can be done using image editing software like Photoshop or GIMP.
Save the final image: Once the image has been colorized and refined, it can be saved in a digital format for printing
or online use.
Note that the quality of the colorized image will depend on the quality of the original black and white image, the
accuracy of the colorization model, and the manual refinement process

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 94


Source Code -
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, UpSampling2D, InputLayer, BatchNormalization
from tensorflow.keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array,load_img
import numpy as np
# Load the grayscale image
img_gray = load_img('bw_image.jpg', grayscale=True)# Resize the image to the desired size
img_gray = img_gray.resize((256,256))# Convert the image to a numpy array img_gray = img_to_array(img_gray)
# Normalize the image img_gray = img_gray / 255.0
# Add a new dimension to the array to make it compatible with the input shape of the modelimg_gray =
np.expand_dims(img_gray, axis=0)
# Load the pre-trained modelmodel = Sequential()
model.add(InputLayer(input_shape=(None, None, 1))) model.add(Conv2D(64, (3,3), activation='relu',
padding='same', strides=2)) model.add(Conv2D(128, (3,3), activation='relu', padding='same'))
model.add(Conv2D(128, (3,3), activation='relu', padding='same', strides=2))model.add(Conv2D(256, (3,3),
activation='relu', padding='same')) model.add(Conv2D(256, (3,3), activation='relu', padding='same', strides=2))
model.add(Conv2D(512, (3,3), activation='relu', padding='same')) model.add(Conv2D(512, (3,3), activation='relu',
padding='same')) model.add(Conv2D(256, (3,3), activation='relu', padding='same'))
model.add(UpSampling2D((2,2)))
model.add(Conv2D(128, (3,3), activation='relu', padding='same'))model.add(UpSampling2D((2,2)))
model.add(Conv2D(64, (3,3), activation='relu', padding='same'))model.add(Conv2D(32, (3,3), activation='relu',
padding='same'))model.add(Conv2D(2, (3, 3), activation='tanh', padding='same'))
model.add(UpSampling2D((2, 2))) model.compile(optimizer='adam', loss='mse') # Load the pre-trained weights
model.load_weights('colorization_weights.h5')# Colorize the grayscale image
img_colorized = model.predict(img_gray)# Save the colorized image
img_colorized = img_colorized * 128 + 128
img_colorized = np.clip(img_colorized, 0, 255).astype('uint8')img_colorized = array_to_img(img_colorized[0])

img_colorized.save('colorized_image.jpg')

Conclusion- In this way coloring B & W images Implemented.

BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 95


BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, LAVALE, PUNE. 96

You might also like