0% found this document useful (0 votes)
27 views42 pages

SOI Report

Report

Uploaded by

pragyapandey8984
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views42 pages

SOI Report

Report

Uploaded by

pragyapandey8984
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Implementation of Convolutional Neural Networks

using Python and Verilog


Neural Image Compression and Explanation
A seminar report on internship and minor project submitted towards
partial fulfillment of the requirements for the degree of

BACHELOR OF TECHNOLOGY
IN THE
DEPARTMENT OF ELECTRONICS AND TELECOMMUNICATION ENGINEERING

Submitted by:
SAMPAD MOHANTY
(2002070059)

VEER SURENDRA SAI UNIVERSITY OF TECHNOLOGY


BURLA, ODISHA- 768018, INDIA
VEER SURENDRA SAI UNIVERSITY OF TECHNOLOGY,
BURLA

CERTIFICATE

This is to certify that this Seminar report on internship entitled,


"PYTHON-VERILOG BASED MACHINE LEARNING HARDWARE”
is the work of SAMPAD MOHANTY (Regd. No.: 2002070059, 7th
semester) submitted for partial fulfillment of the course work requirement for
Bachelor of Technology Program in the Department of Electronics and
Telecommunication Engineering, Veer Surendra Sai University of
Technology, Burla. He has completed this seminar on internship. The work
has been found quite satisfactory.

Under the guidance Under the guidance


Dr. Bandan Kumar Bhoi Dr. Bikramaditya Das
Assistant Professor Assistant Professor
Dept. of Electronics and Dept. of Electronics and
Telecommunication Engineering Telecommunication Engineering
VSSUT, Burla VSSUT, Burla

Head of the department


Prof. Harish Kumar Sahoo
Dept. of Electronics and Telecommunication Engineering
VSSUT, Burla
DECLARATION

I solemnly declare that the seminar report on internship entitled


"PYTHON-VERILOG BASED MACHINE LEARNING HARDWARE"
is based on my own work carried out during my study. I hereby declare that all
information in this document has been obtained and presented in accordance
with the academic rules and ethical conduct. I also declare that, as required by
these rules and conduct, I have fully cited and referenced all material and
results that are original to this work.

_
SAMPAD MOHANTY
Regd no.: - 2002070059
Electronics and Telecommunication
Engineering
Section: A2
ACKNOWLEDGEMENT

I am immensely grateful to Dr. Bikramaditya Das, Asst. Professor


and Dr. Bandan Kumar Bhoi, Asst. Professor, Dept. of Electronics and
Telecommunication Engineering, for their kind support and guidance without
which this seminar on internship could not have been materialized. This
seminar and report are a product of the hard work and collective effort of all
the members of the group who have been constantly encouraging each other
throughout the seminar. We would also like to extend our deepest gratitude
to all those who have directly and indirectly guided us in accomplishing this
seminar work.

Lastly, I would like to thank the honorable Vice Chancellor and Head
of Department of Electronics and Telecommunication Engineering, Veer
Surendra Sai University of Technology, Burla for giving us this opportunity
to work together as a team on this project as a part of our curriculum.

SAMPAD MOHANTY
Regd no.: - 2002070059
Electronics and Telecommunication
Engineering
Section: A2
ABSTRACT
In this internship, in the Department of Electronics and Electrical
Communication Engineering at IIT Kharagpur, I learned about various topics
and techniques in machine learning and deep learning, which are essential for
the field of electronics. I started with the basics of model representation, cost
function, and gradient descent for linear regression. Then I moved on to logistic
regression and neural networks, where I learned how to classify data using
different activation functions, loss functions, and architectures. I also learned
how to implement these techniques using Python and TensorFlow frameworks.
In addition, I explored some advanced topics and techniques in deep
learning, such as convolutional neural networks, residual networks, transfer
learning, face recognition, and neural style transfer. I learned how these
techniques can improve the performance and efficiency of deep learning
models on complex tasks and domains. I also learned how to use pre-trained
models and frameworks, such as ResNet, MobileNetV2, FaceNet, and U-Net.
Moreover, I worked on a hardware project that involved performing a 3x3
convolution dot product of two arrays in both decimal and fixed-point binary
representation, optimizing the code to reduce the error, and implementing the
operation in Verilog.
This internship was a great opportunity for me to learn and grow in the
field of electronics and electrical communication engineering. I gained
valuable exposure to various topics and techniques in machine learning and
deep learning, which are essential for this field. I also acquired practical
experience and skills in hardware design and programming, which are
important for implementing these techniques in real-world applications. This
internship enhanced my knowledge and confidence in this field and prepared
me for future challenges and opportunities. I am grateful for this internship and
the guidance I received from my mentors.
Contents
Description of Internship .......................................................................................................... 1
............................................................................................................................... 1
Assignment 1: Model Representation .............................................................................. 1
Assignment 2: Cost Function............................................................................................ 2
Assignment 3: Gradient Descent for Linear Regression................................................. 2
Assignment 4: Linear Regression .................................................................................... 3
Assignment 5: Logistic Regression.................................................................................. 4
...................................................................................................................................... 6
Neural Networks................................................................................................................. 6
Assignment 6: Logistic Regression with a Neural Network mindset ............................. 6
Assignment 7: Planar Data classification with one hidden layer ................................... 7
Assignment 8: Build your Deep Network: Step by Step .................................................. 9
Assignment 9: Deep Neural Network for Image Classification ..................................... 10
Assignment 10: Initialization ........................................................................................... 11
Assignment 11: Regularization ....................................................................................... 12
Assignment 12: Gradient Checking ................................................................................ 14
Assignment 13: Optimization Methods .......................................................................... 15
Assignment 14: Convolutional Neural Network: Step by Step ..................................... 17
Some learning stuffs 1:.................................................................................................... 18
Assignment 15: Residual Networks (ResNets) .............................................................. 19
Some Learning Stuffs 2: .................................................................................................. 20
Assignment 16: Transfer Learning with MobileNetV2.................................................... 20
Assignment 17: Autonomous Driving – Car Detection .................................................. 21
Some Learning Stuffs 3: U-Net ....................................................................................... 23
Assignment 18: Face recognition using FaceNet .......................................................... 25
Some Learning Stuffs 4: Neural Style Transfer................................................................. 25
........................................................................................................................................................ 27
............................................................................................................................................... 30
Description of Internship
I joined the Department of E&ECE as an intern on 10th May 2023. My internship duration was two
months. During this period, I learned various topics related to machine learning, deep learning,
convolutional neural networks, fixed point and floating-point binary number representation and
their arithmetic. These topics are described ahead in this section.

Machine Learning is an AI technique that teaches computers to learn from experience. Machine
learning algorithms use computational methods to “learn” information directly from data without
relying on a predetermined equation as a model.
In this part I learned about supervised learning. The concepts were based on linear regression
model, cost function, gradient descent, classification, logistic regression, overfitting problem and
regularization.
There was several learn-by-doing type of assignments included in it as described ahead in this
section.

Assignment 1: Model Representation


Problem Statement: Build a model for housing price prediction. Use a simple data set with only
two data points - a house with 1000 square feet(sqft) sold for $300,000 and a house with 2000
square feet sold for $500,000. These two points will constitute our *data or training set*. In this
lab, the units of size are 1000 sqft and the units of price are 1000s of dollars. Fit a linear regression
model through these two points, so you can then predict price for other houses - say, a house
with 1200 sqft.
How I solved:

• I created the `x_train` and `y_train` variables. The data is stored in one-dimensional
NumPy arrays.
• I plotted the data using matplotlib’s scatter function.
• Then the model function 𝑓(𝑤,𝑏) was plotted according to the following equation using
matplotlib,
𝑓𝑤,𝑏 (𝑥 (𝑖) ) = 𝑤𝑥 (𝑖) + 𝑏
• I adjusted the values of 𝑤 and 𝑏 to fit the model by repeated checking with different
values.
What I learned:

• Linear regression builds a model which establishes a relationship between features and
targets.
• In the assignment above, the feature was house size, and the target was house price
• for simple linear regression, the model has two parameters 𝑤 and 𝑏 whose values are 'fit'
using training data.
• once a model's parameters have been determined, the model can be used to make
predictions on novel data.

Assignment 2: Cost Function


Problem Statement: (Same as Assignment 1): a model which can predict housing prices given
the size of the house. Two data points - a house with 1000 square feet sold for $300,000 and a
house with 2000 square feet sold for $500,000.
How I solved:

• I plotted the data points using matplotlib’s scatter.


• Cost is a measure how well our model is predicting the target price of the house. The term
'price' is used for housing data.
• The equation for cost with one variable is:
𝑚−1
1 2
𝐽(𝑤, 𝑏) = ∑ (𝑓𝑤,𝑏 (𝑥 (𝑖) ) − 𝑦 (𝑖) )
2𝑚
𝑖=0

𝑓𝑤,𝑏 (𝑥 (𝑖) ) = 𝑤𝑥 (𝑖) + 𝑏


• 𝑓𝑤,𝑏 (𝑥 (𝑖) ) is our prediction for example 𝑖 using parameters 𝑤, 𝑏.
2
• (𝑓𝑤,𝑏 (𝑥 (𝑖) ) − 𝑦 (𝑖) ) is the squared difference between the target value and the
prediction.
• These differences are summed over all the 𝑚 examples and divided by `2m` to produce
the cost, 𝐽(𝑤, 𝑏).
• I also analyzed the variation of cost with w and b values using contour plot.
• Further analysis was done using a 3-D model of the cost which gave a convex bowl shaped
curve.
What I learned:

• How to implement and explore the `cost` function for linear regression with one variable.
• The cost equation provides a measure of how well your predictions match your training
data.
• Minimizing the cost can provide optimal values of 𝑤, 𝑏.

Assignment 3: Gradient Descent for Linear Regression


Problem Statement: (Same as Assignment 2): a model which can predict housing prices given
the size of the house. Two data points - a house with 1000 square feet sold for $300,000 and a
house with 2000 square feet sold for $500,000.
How I Solved:

• The data points were plotted.


• The cost was computed by defining a function compute_cost
• Gradient descent was described as:
∂𝐽(𝑤, 𝑏)
𝑤=𝑤− α
∂𝑤
𝜕𝐽(𝑤, 𝑏)
𝑏=𝑏− 𝛼
𝜕𝑏
• where, parameters 𝑤, 𝑏 are updated simultaneously.
• The gradient is defined as:
𝑚−1
∂𝐽(𝑤, 𝑏) 1
= ∑ (𝑓𝑤,𝑏 (𝑥 (𝑖) ) − 𝑦 (𝑖) )𝑥 (𝑖)
∂𝑤 𝑚
𝑖=0
𝑚−1
𝜕𝐽(𝑤, 𝑏) 1
= ∑ (𝑓𝑤,𝑏 (𝑥 (𝑖) ) − 𝑦 (𝑖) )
𝜕𝑏 𝑚
𝑖=0
• I implemented the gradient descent using compute_gradient and gradient_descent
functions.
• I analyzed the Cost vs w plot with gradient using a quiver plot.
• After that Gradient descent was finished implemented and I got new values of w and b.
• Using the values of w and b at minimum cost, I predicted the house prices.
• For further analysis of learning rate α I plotted a contour plot of the cost vs b, w with a
path indication the gradient descent.
• Also, some more plots analyzed by increasing and decreasing the learning rate.
What I Learned:

• Delved into the details of gradient descent for a single variable.


• Developed a routine to compute the gradient.
• Visualized what the gradient is.
• Completed a gradient descent routine.
• Utilized gradient descent to find parameters.
• Examined the impact of sizing on the learning rate.
• Automate the process of optimizing 𝑤 and 𝑏 using gradient descent.

Assignment 4: Linear Regression


Problem Statement: Suppose you are the CEO of a restaurant franchise and are considering
different cities for opening a new outlet. You would like to expand your business to cities that
may give your restaurant higher profits. The chain already has restaurants in various cities, and
you have data for profits and populations from the cities. You also have data on cities that are
candidates for a new restaurant. For these cities, you have the city population. Use the data to
help you identify which cities may potentially give your business higher profits.
How I solved:

• I plotted the data to visualize the complete training dataset.


• Defined a function to compute the cost.
• ‘w’ and ‘b’ were initialized randomly from zero here.
• The gradient was computed by defining a function compute_gradient.
• The w and b were set to 0.2 each and again the gradient was computed.
• Then the gradient descent operation was carried out with w and b initialized from 0.
• The value of w and b at the minimum cost was computed using gradient descent.
• Using those values, the linear graph was plotted.
• The prediction was carried out and found the expected values.
What I learned:

• Redeveloped the routines for linear regression, now with multiple variables.
• Utilized numpy `np.dot` to vectorize the implementations.
• Explored the impact of the learning rate 𝛼 on convergence.
• Discovered the value of feature scaling using z-score normalization in speeding
convergence.
• Learned how linear regression can model complex, even highly non-linear functions using
feature engineering.
• Recognized that it is important to apply feature scaling when doing feature engineering.
• Utilized an open-source machine learning toolkit, scikit-learn.
• Implemented linear regression using gradient descent and feature normalization from
that toolkit.
• Implemented linear regression using a close-form solution from that toolkit.

Assignment 5: Logistic Regression


Problem Statement: Suppose that you are the administrator of a university department, and you
want to determine each applicant’s chance of admission based on their results on two exams.

• You have historical data from previous applicants that you can use as a training set for
logistic regression.
• For each training example, you have the applicant’s scores on two exams and the
admissions decision.
• Your task is to build a classification model that estimates an applicant’s probability of
admission based on the scores from those two exams.
How I solved:

• Loaded the data from the dataset provided.


• Viewed the training set variables.
• Visualized the data using scatter plot.
• Defined the Sigmoid function.
1
𝑔(𝑧) =
1 + 𝑒 −𝑧
𝑓𝑤,𝑏 (𝑥) = 𝑔(𝑤 ⋅ 𝑥 + 𝑏)

• Defined the cost function for logistic regression.


𝑚−1
1
𝐽(𝑤, 𝑏) = ∑ [𝑙𝑜𝑠𝑠(𝑓𝑤,𝑏 (𝑥 (𝑖) ), 𝑦 (𝑖) )]
𝑚
𝑖=0

𝑙𝑜𝑠𝑠(𝑓𝑤,𝑏 (𝑥 (𝑖) ), 𝑦 (𝑖) ) = (−𝑦 (𝑖) log (𝑓𝑤,𝑏 (𝑥 (𝑖) )) − (1 − 𝑦 (𝑖) ) log (1 − 𝑓𝑤,𝑏 (𝑥 (𝑖) )))

• Gradient Descent algorithm was implemented for logistic regression.


• Gradient was computed using a function compute_gradient.
• Parameters were zero initialized and gradient descent operation carried out.
• Optimal value of w and b at minima found, and decision boundary was plotted.
• Train accuracy found to be 92%.
• The above operation was again repeated with regularization.
• The regularized cost function is represented as:
𝑚−1 𝑛−1
1 λ
𝐽(𝑤, 𝑏) = ∑ [−𝑦 (𝑖) log (𝑓𝑤,𝑏 (𝑥 (𝑖) )) − (1 − 𝑦 (𝑖) ) log (1 − 𝑓𝑤,𝑏 (𝑥 (𝑖) ))] + ∑ 𝑤𝑗2
𝑚 2𝑚
𝑖=0 𝑗=0

• Similarly, the regularized gradient is computed.


• Then output of some test data are predicted and the accuracy of the regularized model is
found to be 80%.
What I learned:

• Implement logistic regression and apply it to two different datasets.


• Explored categorical data sets and plotting.
• Determined that linear regression was insufficient for a classification problem.
• Explored the use of the sigmoid function in logistic regression.
• Explored the decision boundary in the context of logistic regression.
• Determining a squared error loss function is not suitable for classification tasks.
• Developed and examined the logistic loss function which **is** suitable for classification
tasks.
• Examined and utilized the cost function for logistic regression.
• Examined the formulas and implementation of calculating the gradient for logistic
regression.
• Utilized those routines in
o Exploring a single variable data set
o Exploring a two-variable data set
• Developed some intuition about the causes and solutions to overfitting.
• Analyzed examples of cost and gradient routines with regularization added for both linear and
logistic regression.
• Developed some intuition on how regularization can reduce over-fitting.

Deep learning is a branch of machine learning that uses neural networks to learn from data.
Neural networks are composed of layers of artificial neurons that perform nonlinear
transformations on the input and pass it to the next layer. The output layer produces the final
prediction or classification. Neural networks can learn complex features and patterns from large
amounts of data, and can be applied to various domains such as computer vision, natural
language processing, speech recognition, etc.

Neural Networks
Neural networks are models that consist of multiple layers of artificial neurons that are
connected by weights. Each neuron receives an input vector and computes a weighted sum of
its elements, adds a bias term, and applies an activation function to produce an output scalar.
The output of one layer becomes the input of the next layer, until the final output layer
produces the prediction or classification. The structure and parameters of a neural network are
determined by its architecture and hyperparameters.

Assignment 6: Logistic Regression with a Neural Network


mindset
Problem Statement: You are given a dataset ("data.h5") containing:

• a training set of m_train images labeled as cat (y=1) or non-cat (y=0).


• a test set of m_test images labeled as cat or non-cat.
• each image is of shape (num_px, num_px, 3) where 3 is for the 3 channels (RGB). Thus,
each image is square (height = num_px) and (width = num_px).
Build a simple image-recognition algorithm that can correctly classify pictures as cat or non-cat.
How I solved:

• Loaded the data from the given dataset `data.h5` and analyzed the number of
datapoints.
• I split the data into train and test set.
• Standardized the training data.
• Studied the concepts of general Architecture of the learning algorithm in neural
networks.
• Defined the model structure by defining the helper function i.e. sigmoid function in this
case.
• Initialized parameters with zero.
• Using propagate function I computed the cost function and gradient.
• A function named optimize is declared to find the optimal parameters at minimum cost.
• The loss was optimized iteratively to learn the parameters:
o Computed the cost and its gradient.
o Updated the parameters using gradient descent.
• Used the learned parameters to predict the labels for a given set of the examples.
• Merged all function to a single function which is ultimately called the model.
What I Learned:

• Build the general architecture of a learning algorithm, including:


o Initializing parameters
o Calculating the cost function and its gradient
o Using an optimization algorithm (gradient descent)
• Gather all three functions above into a main model function, in the right order.

Assignment 7: Planar Data classification with one hidden


layer
This assignment was a purely learning assignment, so it didn’t have any problem statement. It
was instructed to follow a set of instructions and complete small exercises and analyse the
whole working of a simple neural network with single hidden layer.
What I did:

• Loaded the data from the dataset.


• Visualized the data by plotting the datapoints using matplotlib.
• Determined the shape of the training set.
• Used scikit learn’s linear model LogisticRegression for pre analysis of what can be the
result but found to be 47% accurate which is not desirable.
• Did a mathematical analysis of a single hidden layer neural network.
• Mathematically, we can express them as:

For one example 𝑥 (𝑖) :

𝑧 [1](𝑖) = 𝑊 [1] 𝑥 (𝑖) + 𝑏 [1]

𝑎[1](𝑖) = tanh(𝑧 [1](𝑖) )

𝑧 [2](𝑖) = 𝑊 [2] 𝑎[1](𝑖) + 𝑏 [2]

𝑦̂
(𝑖) = 𝑎 [2](𝑖) = 𝜎(𝑧 [2](𝑖) )

Given the predictions on all the examples, you can also compute the cost 𝐽 as follows:
𝑚
1
𝐽 = − ∑(𝑦 (𝑖) log(𝑎 [2](𝑖) ) + (1 − 𝑦 (𝑖) ) log(1 − 𝑎[2](𝑖) ))
𝑚
𝑖=0

• Defined three-layer sizes – input, hidden and output.


• Initialized the parameters – random initialization used in this case.
• Implemented the forward propagation by using the following equations:

𝑍 [1] = 𝑊 [1] 𝑋 + 𝑏 [1]

𝐴[1] = tanh(𝑍 [1] )

𝑍 [2] = 𝑊 [2] 𝐴[1] + 𝑏 [2]

𝑌̂ = 𝐴[2] = σ(𝑍 [2] )

• Used the `np.tanh` function.


• Store the values for Z1, Z2, A1, A2 we found from forward propagation in a dictionary for
future use in analysis of backward propagation.
• Compute the cost.
𝑚
1
𝐽 = − ∑(𝑦 (𝑖) log(𝑎 [2](𝑖) ) + (1 − 𝑦 (𝑖) ) log(1 − 𝑎[2](𝑖) ))
𝑚
𝑖=1

• Backward Propagation is implemented and the gradients dW1, dW2, db1, db2.
• Parameters are updated using the update_parameters function, which returns a
dictionary with W1, b1, W2, b2.
• The whole model was integrated into a single function nn_model(). Which returns the
updated parameters.
• The model was tested, and the prediction function was defined.
• Accuracy was computed and found to be 90%.
What I learned:

• Built a complete 2-class classification neural network with a hidden layer.


• Made good use of a non-linear unit.
• Computed cross-entropy loss.
• Implemented forward and backward propagation.
• Seen the impact of varying the hidden layer size, including overfitting.

Assignment 8: Build your Deep Network: Step by Step


This assignment was a purely learning assignment, so it didn’t have any problem statement. It
was instructed to follow a set of instructions and complete small exercises and analyse the
whole working of a simple neural network with single hidden layer.
What I did:

• To build the neural network, I implemented several “helper functions”.


• Here's an outline of the steps in this assignment that I followed:
• Initialized the parameters for a two-layer network and for an 𝐿-layer neural network
• Implemented the forward propagation module (shown in purple in the figure below)
o Completed the LINEAR part of a layer's forward propagation step (resulting in
𝑍 [𝑙] ).
o The ACTIVATION function is provided for you (relu/sigmoid)
o Combine the previous two steps into a new [LINEAR->ACTIVATION] forward
function.
o Stack the [LINEAR->RELU] forward function L-1 time (for layers 1 through L-1) and
add a [LINEAR->SIGMOID] at the end (for the final layer 𝐿). This gives you a new
L_model_forward function.
• Compute the loss.
• Implement the backward propagation module (denoted in red in the figure below)
o Complete the LINEAR part of a layer's backward propagation step.
o The gradient of the ACTIVATE function is provided for
you(relu_backward/sigmoid_backward)
o Combine the previous two steps into a new [LINEAR->ACTIVATION] backward
function.
o Stack [LINEAR->RELU] backward L-1 times and add [LINEAR->SIGMOID] backward
in a new L_model_backward function.
• Finally, update the parameters.

What I learned:

• I implemented all the functions required for building a deep neural network.
• Used non-linear units to improve your model.
• Built a deeper neural network (with more than 1 hidden layer).
• Implemented an easy-to-use neural network class.

Assignment 9: Deep Neural Network for Image Classification


This assignment was a purely learning assignment, so it didn’t have any problem statement. It
was instructed to follow a set of instructions and complete small exercises and analyse the
whole working of a simple neural network with single hidden layer.
What I did:

• Loaded the dataset and split the data into train and test data.
• Reshape and standardized the images before feeding them to network.
• Then I studied the model architecture for 2-layer neural network and L-layer neural
network.
• The analysis was done for the 2-layer neural network by the following steps:
o The input is a (64,64,3) image which is flattened to a vector of size (12288,1).
o The corresponding vector: [𝑥0 , 𝑥1 , … , 𝑥12287 ]𝑇 is then multiplied by the weight
matrix 𝑊 [1] of size (𝑛[1] , 12288).
o Then, added a bias term and took its relu to get the following vector:
[1] [1] [1] 𝑇
[𝑎0 , 𝑎1 , … , 𝑎𝑛[1] −1 ] .
o Repeated the same process.
o Multiplied the resulting vector by 𝑊 [2] and added the intercept (bias).
o Finally, took the sigmoid of the result. If it's greater than 0.5, classified it as a cat.
• Detailed steps followed for the L-layer neural network:
o The input was a (64,64,3) image which was flattened to a vector of size
(12288,1).
o The corresponding vector: [𝑥0 , 𝑥1 , … , 𝑥12287 ]𝑇 is then multiplied by the weight
matrix 𝑊 [1] and then added the intercept 𝑏 [1] . The result is called the linear unit.
o Next, I took the relu of the linear unit. This process repeated several times for
each (𝑊 [𝑙] , 𝑏 [𝑙] ) depending on the model architecture.
o Finally, took the sigmoid of the final linear unit. If it is greater than 0.5, classified
it as a cat.
• The accuracy for the 2-layer neural network was computed as:
o Accuracy of prediction on the training data is found to be 0.99.
o Accuracy of prediction on the test data is found to be 0.72.
• Whereas the accuracy for the L-layer (4-Layer in this case) neural network was
computed as:
o Accuracy of prediction on the training data is found to be 0.99.
o Accuracy of prediction on the test data is found to be 0.8.
• It seemed that the 4-layered neural network has better performance (80%) than the 2-
layered neural network (72%) on the same test set.
What I learned:

• Learned to build and train a deep L-layer neural network and applied it to supervised
learning.
• Explored how the increase in layers increased the accuracy of the model.

Assignment 10: Initialization


Training your neural network requires specifying an initial value of the weights. A well-chosen
initialization method helps the learning process.

• Speed up the convergence of gradient descent.


• Increase the odds of gradient descent converging to a lower training (and generalization)
error.
In this assignment I studied about:

• Zeros initialization -- setting `initialization = "zeros"` in the input argument.


• Random initialization -- setting `initialization = "random"` in the input argument. This
initializes the weights to large random values.
• He initialization -- setting `initialization = "he"` in the input argument. This initializes the
weights to random values scaled according to a paper by He et al., 2015.
What I did:

• Imported all the required libraries for the assignment as per the instruction.
• Loaded the dataset and split it into train and test dataset.
• I used a 3-layer neural network.
• The hidden layer have ReLU activation function.
• Zero Initialization of Parameters:
o W1, W2, b1, b2 = 0
o Accuracy on train set: 0.5
o Accuracy on test set: 0.5
• Random Initialization:
o b1 and b2 = 0
o W1 and W2 arrays are initialized randomly with large random values.
o Accuracy on train set: 0.83
o Accuracy on test set: 0.86
• `He` Initialization:
o Instead of multiplying `np.random.randn(..,..)` for 𝑊 and 𝑏 by 10, you will
2
multiply it by√dimension of the previous layer, which is what He initialization
recommends for layers with a ReLU activation.
o Accuracy on train set: 0.99
o Accuracy on test set: 0.96
What I learned:

• Zero Initialization gives undesirable low accuracy.


• The weights 𝑊 [𝑙] should be initialized randomly to break symmetry.
• However, it's okay to initialize the biases 𝑏 [𝑙] to zeros. Symmetry is still broken so long as
𝑊 [𝑙] is initialized randomly.
• Initializing weights to very large random values doesn't work well.
• Initializing with small random values should do better. The important question is, how
small should these random values be? Let's find out next!
• Different initializations lead to very different results.
• Random initialization is used to break symmetry and make sure different hidden units
can learn different things.
• Resist initializing to values that are too large!
• `He` initialization works well for networks with ReLU activations.

Assignment 11: Regularization


Deep Learning models have so much flexibility and capacity that overfitting can be a serious
problem, if the training dataset is not big enough. Sure, it does well on the training set, but the
learned network doesn't generalize to new examples that it has never seen!
Problem Statement:
You have just been hired as an AI expert by the French Football Corporation. They would like you
to recommend positions where France's goalkeeper should kick the ball so that the French team's
players can then hit it with their head. 2D dataset from France’s past 10 games are available given.

What I did:
• Loaded the dataset and split it into train and test dataset
• First, I analysed by building a non-regularized model:
o Accuracy on train set: 0.94
o Accuracy on test set: 0.915
o The scatter plot and decision boundary show that the neural network model
suffers from overfitting problem.
• To reduce overfitting, I analysed two techniques – L2 Regularization and Dropout
• L2-regularization:
o A hyperparameter λ is used.
o Computed cost with regularization.
o Backward propagation with regularization is also computed.
o I computed the parameters.
o Accuracy on train set: 0.93
o Accuracy on test set: 0.93
• Dropout:
o It randomly shuts down some neurons in each iteration.
o Computed forward propagation.
o Performed backward propagation.
o I got the parameters and computed the predictions.
o Accuracy on train dataset: 0.929
o Accuracy on test dataset: 0.95
• So, dropout regularization worked better than L2- regularization.
What I learned:

• Regularization is used to reduce Overfitting problem in a neural network.


• The implications of L2-regularization on:
o The cost computation:
▪ A regularization term is added to the cost.
o The backpropagation function:
▪ There are extra terms in the gradients with respect to weight matrices.
o Weights end up smaller ("weight decay"):
▪ Weights are pushed to smaller values.
• Dropout is a regularization technique.
• dropout should be only used during training. Don't use dropout (randomly eliminate
nodes) during test time.
• Applied dropout both during forward and backward propagation.
• During training time, divide each dropout layer by keep_prob to keep the same expected
value for the activations. For example, if keep_prob is 0.5, then we will on average shut
down half the nodes, so the output will be scaled by 0.5 since only the remaining half
are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence,
the output now has the same expected value. You can check that this works even when
keep_prob is other values than 0.5.

Assignment 12: Gradient Checking


We will Implement gradient checking to verify the accuracy of your backprop implementation.
Problem Statement:
You are part of a team working to make mobile payments available globally, and are asked to
build a deep learning model to detect fraud--whenever someone makes a payment, you want to
see if the payment might be fraudulent, such as if the user's account has been taken over by a
hacker.
You already know that backpropagation is quite challenging to implement, and sometimes has
bugs. Because this is a mission-critical application, your company's CEO wants to be really
certain that your implementation of backpropagation is correct. Your CEO says, "Give me proof
that your backpropagation is actually working!" To give this reassurance, you are going to use
"gradient checking."
How I solved:

• A derivative is mathematically defined as:


∂𝐽 𝐽(θ + ε) − 𝐽(θ − ε)
= lim
∂θ ε→0 2ε
• Computed 𝐽 in forward propagation.
• In backpropagation compute the derivative of 𝐽(θ) = θ𝑥 with respect to θ.
• Gradient check carried out:
θ+ = θ +
θ− = θ −
𝐽+ = 𝐽(θ+ )
𝐽− = 𝐽(θ− )
𝐽+ − 𝐽−
𝑔𝑟𝑎𝑑𝑎𝑝𝑝𝑟𝑜𝑥 =
2𝜀

• Performed N-dimensional gradient checking.


• Forward propagation carried out followed by backward propagation.
• Then the gradient checking was performed.
What I learned:

• Gradient checking verifies closeness between the gradients from backpropagation and
the numerical approximation of the gradient (computed using forward propagation).
• Gradient checking is slow, so you don't want to run it in every iteration of training. You
would usually run it only to make sure your code is correct, then turn it off and use
backprop for the actual learning process.

Assignment 13: Optimization Methods


There are some advanced optimization methods that can speed up learning and perhaps even
get you to a better final value for the cost function. Having a good optimization algorithm can be
the difference between waiting days vs. just a few hours to get a good result.
What I did:

• Loaded a dataset and split it into train and test set.


• Updating parameters by gradient descent involved two ways – batch gradient descent
and stochastic gradient descent.
• SGD oscillates more before reaching minima whereas batch gradient descent has less
oscillations.
• Mini-batch gradient descent - In practice we often take some intermediate parts of
input data in each step.
• I observed that mini-batch gradient descent often leads to faster optimization.
• Random mini batches are often used in optimization cases.
• Another method of optimization is the momentum method.
• Momentum optimization method was applied using some mathematical equations:
𝑣𝑑𝑊 = β𝑣𝑑𝑊 + (1 − β)𝑑𝑊
𝑣𝑑𝑏 = β𝑣𝑑𝑏 + (1 − β)𝑑𝑏
𝑊 ≔ 𝑊 − α𝑣𝑑𝑊
𝑏 ≔ 𝑏 − α𝑣𝑑𝑏
o Initialized the velocity.
o Parameters are updated with momentum.
• Adam optimization:
o Adaptive movement.
o It combines RMSProp and Momentum to perform.
o Mathematical expression associated this optimization:
𝑣𝑑𝑊 = β1 𝑣𝑑𝑊 + (1 − β1 )𝑑𝑊
𝑣𝑑𝑏 = β1 𝑣𝑑𝑏 + (1 − β1 )𝑑𝑏
𝑠𝑑𝑊 = β2 𝑠𝑑𝑊 + (1 − β2 )𝑑𝑊 2
𝑠𝑑𝑏 = β2 𝑠𝑑𝑏 + (1 − β2 )𝑑𝑏 2
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑣𝑑𝑊
𝑣𝑑𝑊 =
1 − β1𝑡
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑣𝑑𝑏
𝑣𝑑𝑏 =
1 − β1𝑡
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑠𝑑𝑊
𝑠𝑑𝑊 =
1 − β𝑡2
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑠𝑑𝑏
𝑠𝑑𝑏 =
1 − β𝑡2
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑣𝑑𝑊
𝑊 ≔𝑊−α
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
√𝑠𝑑𝑊 +ϵ

𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑣𝑑𝑏
𝑏 ≔𝑏−α
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
√𝑠𝑑𝑏 +ϵ

• I combined mini batch with gradient descent, momentum and Adam optimization
techniques for better outcomes.
• Accuracy of mini batch gradient descent: 0.716
• Accuracy of mini batch gradient descent with momentum: 0.716
• Accuracy of mini batch gradient descent with Adam: 0.943
• Learning Rate Decay:
1
𝛼= 𝛼
1 + 𝑑𝑒𝑐𝑎𝑦𝑅𝑎𝑡𝑒 × 𝑒𝑝𝑜𝑐ℎ𝑁𝑢𝑚𝑏𝑒𝑟 0
o Schedule learning rate decay - the learning rate scheduling such that it only
changes when the epoch number is a multiple of the time interval.
1
α= α0
𝑒𝑝𝑜𝑐ℎ𝑁𝑢𝑚
1 + 𝑑𝑒𝑐𝑎𝑦𝑅𝑎𝑡𝑒 × ⌊ ⌋
𝑡𝑖𝑚𝑒𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙
• I implemented the learning rate decay with different optimization techniques.
• Gradient Descent with Learning Rate Decay:
o Accuracy: 0.94
• Gradient descent with momentum and learning rate decay:
o Accuracy: 0.95
• Adam with learning rate decay:
o Accuracy: 0.94
• I achieved nearly similar performance with different methods.
What I learned:

• Shuffling and Partitioning are the two steps required to build mini batches.
• Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.
• Momentum takes past gradients into account to smooth out the steps of gradient
descent. It can be applied with batch gradient descent, mini-batch gradient descent or
stochastic gradient descent.
• We have to tune a momentum hyperparameter 𝛽 and a learning rate 𝛼.
• Apply three different optimization methods to your models.
• Build mini batches for your training set.
• Use learning rate decay scheduling to speed up your training.

Assignment 14: Convolutional Neural Network: Step by Step


Implement convolutional (CONV) and pooling (POOL) layers in numpy, including both forward
propagation and (optionally) backward propagation.
This is an introductory assignment.
What I learned:

• There are two types of layers – convolutional layers and pooling layers
• Convolutional layers:
o Convolutional layers perform convolution over the input using one or more filters
and produce feature maps that represent different aspects or characteristics of
the input.
o I performed zero padding.
o Then I studied about single step convolution.
o How to perform convolution in python.
o Convolutional Neural Networks – forward pass was again implemented
• Pooling Layers:
o Pooling layers perform pooling over the feature maps using a pooling function
and produce pooled feature maps that reduce the size and complexity of the
feature maps and make them more invariant to translation or distortion.
o Learned about forward pooling.
𝑛𝐻𝑝𝑟𝑒𝑣 − 𝑓
𝑛𝐻 = ⌊ ⌋+1
𝑠𝑡𝑟𝑖𝑑𝑒
𝑛𝑊𝑝𝑟𝑒𝑣 − 𝑓
𝑛𝑊 = ⌊ ⌋+1
𝑠𝑡𝑟𝑖𝑑𝑒
𝑛𝐶 = 𝑛𝐶𝑝𝑟𝑒𝑣

• A convolution extracts features from an input image by taking the dot product between
the input data and a 3D array of weights (the filter).
• The 2D output of the convolution is called the feature map.
• A convolution layer is where the filter slides over the image and computes the dot
product.
o This transforms the input volume into an output volume of different size.
• Zero padding helps keep more information at the image borders, and is helpful for
building deeper networks, because you can build a CONV layer without shrinking the
height and width of the volumes.
• Pooling layers gradually reduce the height and width of the input by sliding a 2D window
over each specified region, then summarizing the features in that region.
• Explored the backward pass of pooling layer by two different concepts – max pooling
layer backward pass and average pooling backward pass.
• Also used Tensorflow.
• Using Tensorflow Keras I learned to make Sequential models.
• Trained and evaluated the model using Keras models.
• Learned to make a convolutional model using tensorflow keras’ Conv2D.

Some learning stuffs 1:


These are some concepts that was learned before going to computer vision.

Residual Networks (ResNet):


ResNets are a type of convolutional neural networks that use residual blocks to overcome the
problem of vanishing or exploding gradients in deep networks. Residual blocks are modules that
add the input of the block to the output of the block after applying some convolutional layers.

Identity Block
The identity block is the standard block used in ResNets, and corresponds to the case where the
input activation (say 𝑎[𝑙] ) has the same dimension as the output activation (say 𝑎[𝑙+2] ).
Convolutional Block
The ResNet "convolutional block" is the second block type. You can use this type of block when
the input and output dimensions don't match up. The difference with the identity block is that
there is a CONV2D layer in the shortcut path.

Assignment 15: Residual Networks (ResNets)


Problem Statement: To make a ResNet of 50 layers.
What I did:

• I made an identity block function according to the diagram shown above.


• Identity block construction took the following steps:
• First component of main path:
- The first CONV2D has 𝐹1 filters of shape (1,1) and a stride of (1,1). Its padding is
"valid". Use 0 as the seed for the random uniform initialization: `kernel_initializer =
initializer(seed=0)`.
- The first BatchNorm is normalizing the 'channels' axis.
- Then apply the ReLU activation function. This has no hyperparameters.
• Second component of main path:
- The second CONV2D has 𝐹2 filters of shape (𝑓, 𝑓) and a stride of (1,1). Its padding
is "same". Use 0 as the seed for the random uniform initialization:
`kernel_initializer = initializer(seed=0)`.
- The second BatchNorm is normalizing the 'channels' axis.
- Then apply the ReLU activation function. This has no hyperparameters.
• Third component of main path:
- The third CONV2D has 𝐹3 filters of shape (1,1) and a stride of (1,1). Its padding is
"valid". Use 0 as the seed for the random uniform initialization: `kernel_initializer =
initializer(seed=0)`.
- The third BatchNorm is normalizing the 'channels' axis.
- Note that there is no ReLU activation function in this component.
• Final step:
- The `X_shortcut` and the output from the 3rd layer `X` are added together.
- The syntax will look something like `Add()([var1,var2])`
- Then apply the ReLU activation function. This has no hyperparameters.
• Convolutional Block is constructed using following steps:
- The CONV2D layer in the shortcut path is used to resize the input 𝑥 to a different
dimension, so that the dimensions match up in the final addition needed to add
the shortcut value back to the main path. (This plays a similar role as the matrix 𝑊𝑠
discussed in lecture.)
- For example, to reduce the activation dimensions's height and width by a factor of
2, you can use a 1x1 convolution with a stride of 2.
- The CONV2D layer on the shortcut path does not use any non-linear activation
function. Its main role is to just apply a (learned) linear function that reduces the
dimension of the input, so that the dimensions match up for the later addition
step.
• The deep ResNet are built by stacking both identity and convolutional block.
What I learned:

• Very deep "plain" networks don't work in practice because vanishing gradients make
them hard to train.
• Skip connections help address the Vanishing Gradient problem. They also make it easy
for a ResNet block to learn an identity function.
• There are two main types of blocks: The identity block and the convolutional block.
• Very deep Residual Networks are built by stacking these blocks together.

Some Learning Stuffs 2:


MobileNet
MobileNet is a type of convolutional neural network that use depthwise separable convolutions
to reduce the complexity and size of the network. Depthwise separable convolutions are
convolutions that separate the standard convolution into two steps: a depthwise convolution that
applies a single filter to each channel of the input, and a pointwise convolution that applies a 1x1
filter to combine the outputs of the depthwise convolution. Depthwise separable convolutions
can help reduce the number of parameters and computations compared to standard convolutions
without compromising the performance. MobileNet was introduced by Howard et al. in 2017 and
achieved high accuracy on various computer vision tasks with a small and lightweight network.

Assignment 16: Transfer Learning with MobileNetV2


In this assignment I used a pretrained model MobileNetV2 to build a classifier and learn
different concepts.
What I learned:

• Create a dataset from a directory.


• Preprocess and augment data using the Sequential API.
• Adapt a pretrained model to new data and train a classifier using the Functional API and
MobileNet.
• Fine-tune a classifier's final layers to improve accuracy.
• When calling image_data_set_from_directory(), specify the train/val subsets and match
the seeds to prevent overlap.
• Use prefetch() to prevent memory bottlenecks when reading from disk.
• Give your model more to learn from with simple data augmentations like rotation and
flipping.
• When using a pretrained model, it's best to reuse the weights it was trained on.
• MobileNetV2's unique features are:
o Depthwise separable convolutions that provide lightweight feature filtering and
creation.
o Input and output bottlenecks that preserve important information on either end
of the block.
• Depthwise separable convolutions deal with both spatial and depth (number of
channels) dimensions.
• To adapt the classifier to new data: Delete the top layer, add a new classification layer,
and train only on that layer.
• When freezing layers, avoid keeping track of statistics (like in the batch normalization
layer).
• Fine-tune the final layers of your model to capture high-level details near the end of the
network and potentially improve accuracy.

Assignment 17: Autonomous Driving – Car Detection


Problem Statement: To implement object detection using the very powerful YOLO model.
While doing this assignment I learned a ton of fascinating concepts.
What I learned by performing this assignment:
"You Only Look Once" (YOLO) is a popular algorithm because it achieves high accuracy while
also being able to run in real time. This algorithm "only looks once" at the image in the sense
that it requires only one forward propagation pass through the network to make predictions.
After non-max suppression, it then outputs recognized objects together with the bounding
boxes.
Inputs and outputs
- The input is a batch of images, and each image has the shape (m, 608, 608, 3)
- The output is a list of bounding boxes along with the recognized classes. Each
bounding box is represented by 6 numbers (𝑝𝑐 , 𝑏𝑥 , 𝑏𝑦 , 𝑏ℎ , 𝑏𝑤 , 𝑐) as explained
above. If you expand 𝑐 into an 80-dimensional vector, each bounding box is then
represented by 85 numbers.
Anchor Boxes

• Anchor boxes are chosen by exploring the training data to choose reasonable
height/width ratios that represent the different classes. For this assignment, 5 anchor
boxes were chosen for you (to cover the 80 classes), and stored in the file
'./model_data/yolo_anchors.txt'
• The dimension of the encoding tensor of the second to last dimension based on the
anchor boxes is (𝑚, 𝑛𝐻 , 𝑛𝑊 , 𝑎𝑛𝑐ℎ𝑜𝑟𝑠, 𝑐𝑙𝑎𝑠𝑠𝑒𝑠).
• The YOLO architecture is IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19,
5, 85).
Encoding
If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting
that object.
Class score
Now, for each box (of each cell) you'll compute the following element-wise product and extract
a probability that the box contains a certain class.
Visualizing classes
- For each of the 19x19 grid cells, find the maximum of the probability scores (taking a max
across the 80 classes, one maximum for each of the 5 anchor boxes).

- Color that grid cell according to what object that grid cell considers the most likely.
Note that this visualization isn't a core part of the YOLO algorithm itself for making predictions;
it's just a nice way of visualizing an intermediate result of the algorithm.
Visualizing bounding boxes
Another way to visualize YOLO's output is to plot the bounding boxes that it outputs. Doing that
results in a visualization like this:

Each cell gives you 5 boxes. In total, the model predicts: 19x19x5 = 1805 boxes just by looking
once at the image (one forward pass through the network)! Different colors denote different
classes.

Non-Max suppression
In the figure above, the only boxes plotted are ones for which the model had assigned a high
probability, but this is still too many boxes. You'd like to reduce the algorithm's output to a
much smaller number of detected objects.
To do so, you'll use non-max suppression. Specifically, you'll carry out these steps:
- Get rid of boxes with a low score. Meaning, the box is not very confident about detecting a
class, either due to the low probability of any object, or low probability of this particular class.
- Select only one box when several boxes overlap with each other and detect the same object.
Several more concepts are discussed in the coding part of the assignment.

Some Learning Stuffs 3: U-Net


Image Segmentation using U-Net
U-Net is a type of CNN designed for quick, precise image segmentation, and used it to predict a
label for every single pixel in an image - in this case, an image from a self-driving car dataset.
This type of image classification is called semantic image segmentation.
Contracting path (Encoder containing downsampling steps):
Images are first fed through several convolutional layers which reduce height and width, while
growing the number of channels.
The contracting path follows a regular CNN architecture, with convolutional layers, their
activations, and pooling layers to downsample the image and extract its features. In detail, it
consists of the repeated application of two 3 x 3 unpadded convolutions, each followed by a
rectified linear unit (ReLU) and a 2 x 2 max pooling operation with stride 2 for downsampling. At
each downsampling step, the number of feature channels is doubled.
Crop function: This step crops the image from the contracting path and concatenates it to the
current image on the expanding path to create a skip connection.
Expanding path: (Decoder containing upsampling steps):
The expanding path performs the opposite operation of the contracting path, growing the
image back to its original size, while shrinking the channels gradually.
In detail, each step in the expanding path upsamples the feature map, followed by a 2 x 2
convolution (the transposed convolution). This transposed convolution halves the number of
feature channels, while growing the height and width of the image.
Next is a concatenation with the correspondingly cropped feature map from the contracting
path, and two 3 x 3 convolutions, each followed by a ReLU. You need to perform cropping to
handle the loss of border pixels in every convolution.
Final Feature Mapping Block: In the final layer, a 1x1 convolution is used to map each 64-
component feature vector to the desired number of classes. The channel dimensions from the
previous layer correspond to the number of filters used, so when you use 1x1 convolutions, you
can transform that dimension by choosing an appropriate number of 1x1 filters. When this idea
is applied to the last layer, you can reduce the channel dimensions to have one layer per class.
The U-Net network has 23 convolutional layers in total.

Assignment 18: Face recognition using FaceNet


In this assignment I used a prebuilt model and implemented it for face verification and
recognition. With successful completion of the assignment, I learned the following main key
points:

• Posed face recognition as a binary classification problem.


• Implemented one-shot learning for a face recognition problem.
• Applied the triplet loss function to learn a network's parameters in the context of face
recognition.
• Mapped face images into 128-dimensional encodings using a pretrained model.
• Performed face verification and face recognition with these encodings.
• Face verification solves an easier 1:1 matching problem; face recognition addresses a
harder 1:K matching problem.
• Triplet loss is an effective loss function for training a neural network to learn an encoding
of a face image.
• The same encoding can be used for verification and recognition. Measuring distances
between two images' encodings allows you to determine whether they are pictures of
the same person.
For further improvement of the model:

• Put more images of each person (under different lighting conditions, taken on different
days, etc.) into the database. Then, given a new image, compare the new face to
multiple pictures of the person. This would increase accuracy.
• Crop the images to contain just the face, and less of the "border" region around the face.
This preprocessing removes some of the irrelevant pixels around the face, and also
makes the algorithm more robust.

Some Learning Stuffs 4: Neural Style Transfer


Neural style transfer is a technique that transfers the style of one image to another image while
preserving the content of the latter image. Neural style transfer can be used to create artistic
images or modify images according to different styles. Neural style transfer can use techniques
such as convolutional neural networks, content loss, style loss, or total variation loss, which differ
in how they extract, represent, or optimize the content and style of the images
CONTENT LOSS
Content loss is a type of loss function that measures how well the generated image preserves the
content of the content image. Content loss can be computed by using a pre-trained convolutional
neural network and comparing the activations of a hidden layer between the generated image
and the content image. Content loss can capture the high-level features and semantics of the
content image and ignore the low-level details and colors. Content loss is defined as:
1 [𝑙](𝐺) [𝑙](𝐶) 2
𝐿𝑐𝑜𝑛𝑡𝑒𝑛𝑡 = ∑ (𝑎𝑖𝑗 − 𝑎𝑖𝑗 )
2
𝑖,𝑗

[𝑙](𝐺) [𝑙](𝐶)
where 𝑎𝑖𝑗 and 𝑎𝑖𝑗 are the activations of layer 𝑙 for the generated image and the content
image, respectively.

STYLE LOSS
Style loss is a type of loss function that measures how well the generated image matches the style
of the style image. Style loss can be computed by using a pre-trained convolutional neural
network and comparing the Gram matrices of multiple hidden layers between the generated
image and the style image. Gram matrices are matrices that contain the inner products of the
feature maps of a layer, which capture the correlations and patterns of the features. Style loss can
capture the textures, colors, and styles of the style image and ignore the spatial structure and
semantics. Style loss is defined as:
1 [𝑙](𝐺) [𝑙](𝑆) 2
𝐿𝑠𝑡𝑦𝑙𝑒 = ∑ λ𝑙 2 2 ∑ (𝐺𝑖𝑗 − 𝐺𝑖𝑗 )
4𝑛𝑙 𝑚𝑙
𝑙 𝑖,𝑗

[𝑙](𝐺) [𝑙](𝑆)
where 𝐺𝑖𝑗 and 𝐺𝑖𝑗 are the Gram matrices of layer 𝑙 for the generated image and the style
image, respectively, 𝑛𝑙 is the number of filters in layer 𝑙, 𝑚𝑙 is the height times width of the feature
map of layer 𝑙, and λ𝑙 is a weight parameter that controls the contribution of layer 𝑙 to the style
loss.

TOTAL VARIATION LOSS


Total variation loss is a type of loss function that measures how smooth or coherent the generated
image is. Total variation loss can be computed by summing up the absolute differences between
neighboring pixels in the generated image. Total variation loss can reduce noise and artifacts in
the generated image and make it more visually pleasing. Total variation loss is defined as:

𝐿𝑡𝑣 = ∑|𝑥𝑖,𝑗+1 − 𝑥𝑖,𝑗 | + |𝑥𝑖+1,𝑗 − 𝑥𝑖,𝑗 |


𝑖,𝑗

where 𝑥𝑖,𝑗 is the pixel value at position (𝑖, 𝑗) in the generated image.
Problem Statement: Perform a 3x3 convolution dot product of two arrays, namely an input
array and a kernel or weights array. I had to do this in both decimal and fixed-point binary
representation. Then I had to analyze the error between the two representations. Then I had to
optimize the code to reduce the error. After achieving the minimum error, I had to implement the
whole operation in hardware using Verilog.

I successfully completed the task by the following steps:


Implementation using Python:

• Using Python, a class named `FixedPoint` is declared that converts decimal to fixed point
numbers. The class takes four arguments: `bits`, `num`, `frac`, and `signed`. `bits` is the total
number of bits used to represent the fixed-point number, `frac` is the number of fractional
bits, and `signed` is a boolean value that indicates whether the number is signed or unsigned.
The class has methods for converting decimal to binary, binary to decimal, adding,
subtracting, multiplying, dividing, shifting, rounding, and truncating fixed point numbers.
• Another class named `FixedPointArray` is declared that converts decimal arrays to fixed point
arrays. The class inherits from `FixedPoint` and takes an additional argument: `array`. `array`
is a numpy array of decimal numbers that needs to be converted to fixed point numbers. The
class has methods for converting decimal arrays to binary arrays, binary arrays to decimal
arrays, performing element-wise operations on fixed point arrays, and displaying fixed point
arrays.
• An image array of size 7x7 and a weight array of size 3x3 are defined as numpy arrays of
decimal numbers. The image array represents an image that needs to be convolved with the
weight array, which represents a kernel or a filter that modifies the image.
• The image array and the weight array are converted to fixed point arrays using the
`FixedPointArray` class. The image array in fixed point domain has 8 bits and 8 fractional bits,
whereas the weight array in fixed point domain has 4 bits and 4 fractional bits. The signed
argument is set to True for both arrays.
• To perform the convolution dot product in fixed point domain, the image array is padded with
zeros on all sides to make it 9x9. Then, a 3x3 submatrix of the padded image array is multiplied
element-wise with the weight array, producing a 3x3 product matrix. The elements of the
product matrix are summed up to get a single number that represents one element of the
output matrix at a defined position according to loop iteration. This process is repeated for all
possible positions of the submatrix on the padded image array.
• The output matrix in fixed point domain is requantized to 8 bits using the `FixedPointArray`
class's `requantize` method. The number of fractional bits is decided after observing the mean
squared error (MSE) plot for different fractional bits from 0 to 10. The MSE plot is generated
using matplotlib's `plot` function. The MSE plot shows that 7 fractional bits for the final output
has the least error.
• Similarly, requantization is carried out for the multiplication step and the addition step in
convolution dot product. The multiplication step produces a 3x3 matrix in both domains,
which is requantized to 8 bits and 8 fractional bits in fixed point domain. The addition step
produces a single number in both domains, which is requantized to 8 bits and 8 fractional bits
in fixed point domain. These requantizations are also based on MSE plots for different
fractional bits.
• Project Folder:
https://drive.google.com/drive/folders/1yLFY5iypBE6YkOo4p4tipRmIqk3wdPvi?usp=sharing

Implementation using Verilog:

• The convolution dot product for the optimized is also implemented for hardware using
Verilog.
• To implement the convolution dot product for hardware using Verilog, I defined modules as
`multiplier`, `adder`, and `convolution`.
• `convolution` module that takes 18 elements as input and gives 1 element as output. The first
9 elements are 8bits fixed-point inputs and remaining 9 input elements are 4bits signed fixed
point weights. The output is an 8 bits element.
• `multiplier` module takes an 8-bit input, a 4-bit signed input and gives an 8-bit signed output.
• `adder` module takes two 8-bit signed inputs and gives 9-bit signed output.
• I also designed separate projects for adder and multiplier and tested both separately.
• It performs the convolution dot product in fixed point format.
• Using python I generated a large amount of test cases for multiplier, adder and convolution,
10000 test cases each.
• Those test cases were saved in .mem files and then imported inside respective projects in
Verilog.
• These designs were tested by writing the testbench code for each project separately by taking
large number of inputs from .mem files and comparing the Verilog output with the Python
output.
• Project Folder:
https://drive.google.com/drive/folders/1P15NGwc71F6t48g0NXVjU6_mHefCk7a6?usp=sharing

What I Learned:
• Representation of decimal numbers in fixed-point form.
In fixed-point representation, the fraction is often expressed in the same number base as
the integer part, but using negative powers of the base b. The most common variants are
decimal (base 10) and binary (base 2). The latter is commonly known also as binary
scaling.
A fixed-point representation of a fractional number is essentially an integer that is to be
implicitly multiplied by a fixed scaling factor. For example, to represent the number 4.85
in fixed-point binary with 3 bits for the fractional part, you would first multiply 4.85 by 2^3
= 8 to get 38.8. Then you would round this number to the nearest integer to get 39. Finally,
you would represent this integer in binary as 100111. The binary point is implicitly located
three bits from the right, so this number represents 100.111 in binary, which is equivalent
to 4 + 0 + 0 + 0.5 + 0.25 + 0.125 = 4.875 in decimal.
• Importance of 2’s complement and why 2s’s complement works with normal arithmetic:
The key to understanding two's complement is to note that we have a set of finitely many
(in particular, 28) values in which there is a sensible notion of addition by 1 that allows us
to cycle through all of the numbers. In particular, we have a system of modular arithmetic,
in this case modulo 28=256.
In the context of arithmetic with signed integers, we don't think of 11111101 as
being 253 in our 8-bit system, we instead consider it to represent the number −3. Rather
than having our numbers go from 0 to 255 around a clock, we have them go
from −128 to 127, where −x occupies the same spot that n−x would occupy for values of x
from 1 to 128.
Succinctly, this amounts to saying that a number with 8 binary digits is deemed negative
if and only if its leading digit (its "most significant" digit) is a 1. For this reason, the leading
digit is referred to as the "sign bit" in this context.
• After completing the task related to the above concepts, I explored how to analyse each
step of a convolution dot product. Here I mean to optimize the steps by requantising the
number of bits of product and sum.
• I learned to use matplotlib to visualize the MSE (mean squared error) vs number of
fractional bits in each step and to set the number of bits and fractional bits by defining a
method requantise.
• I explored the convolution dot product in Verilog. Though Verilog is an HDL, I got a vision
how to do the operations by considering the numbers in binary domain.
• Also, I got to know how to generate 10000 test cases in python and store it to a memory
(.mem) file and import it in Verilog test bench code to test the design.
This task has taught me many concepts, some of which were new and others that needed
revisiting. I learned how to analyze my work to determine if it was correct or if there were any
errors. If there were errors, I learned how to identify, address, and fix them. The most important
lesson I learned from repeatedly analyzing the model, fixing bugs, reducing errors, and trying
different approaches to further minimize errors is to keep persevering and striving for human-level
accuracy.
Conclusion
In this internship I gained valuable knowledge and skills in Machine Learning, Deep
Learning and Convolutional Neural Networks. I applied these concepts to a practical task
that involved Convolution Dot Product and learned how to perform it in both decimal and
fixed-point binary representation. I also learned how to analyze and minimize the error
between the two representations and how to optimize the code for better performance.
Finally, I learned how to implement the whole operation in hardware using Verilog and
verified its functionality. This internship was a great learning experience for me, and I
thank my mentors and guides for their support and guidance.

Besides the technical aspects, I also learned how to conduct analysis and
approach a problem from various angles. I was fascinated by the discussions on the
research work done by the students in the lab and how they shared their insights and
findings. I admired the culture of the lab, where the students had the freedom to explore
their interests and interact with each other in a professional and friendly way. This
experience motivated me to pursue my higher studies at IIT Kharagpur, as I aspire to be
part of such a stimulating and supportive environment.
Minor Project
Neural Image Compression and Explanation

ABSTRACT
Explaining the prediction of deep neural networks (DNNs) and semantic
image compression are two active research areas of deep learning with a
numerous of applications in decision-critical systems, such as surveillance
cameras, drones and self-driving cars, where interpretable decision is critical
and storage/network bandwidth is limited. In this article, we propose a novel
end-to-end Neural Image Compression and Explanation (NICE) framework that
learns to (1) explain the predictions of convolutional neural networks (CNNs),
and (2) subsequently compress the input images for efficient storage or
transmission.
Specifically, NICE generates a sparse mask over an input image by
attaching a stochastic binary gate to each pixel of the image, whose parameters
are learned through the interaction with the CNN classifier to be explained. The
generated mask is able to capture the saliency of each pixel measured by its
influence to the final prediction of CNN; it can also be used to produce a mixed-
resolution image, where important pixels maintain their original high resolution
and insignificant background pixels are subsampled to a low resolution.
The produced images achieve a high compression rate (e.g., about 0.6×
of original image file size), while retaining a similar classification accuracy.
Extensive experiments across multiple image classification benchmarks
demonstrate the superior performance of NICE compared to the state-of-the-
art methods in terms of explanation quality and semantic image compression
rate.
RELATED WORKS
Neural Explanation
Neural explanation methods are techniques that help us understand
how deep neural networks (DNNs) make predictions. They can be divided into
two types: global and local. Global methods try to find out which input
variables are most important for the overall performance of a trained model.
This can help us discover general rules or knowledge from the model. Local
methods try to give understandable explanations for each individual
prediction. This can help us see what features or regions of the input data
influence the prediction the most.
There are different ways to implement local methods. Some methods
change or remove parts of the input data and see how the prediction changes.
Some methods calculate the gradient of the output with respect to the input
sample using backpropagation. This can show which features have high
sensitivity to the prediction. Some methods use a simpler model, such as a
linear model, to approximate the decision boundary of a DNN near a specific
prediction. This can give a local linear explanation for the prediction.
NICE is a local method that aims to produce simple and clear local
explanations, similar to some other methods such as Saliency Map, RTIS and
VIBI. However, NICE explicitly enforces sparsity and smoothness on the
explanations by using an L0-norm regularization and a smoothness constraint,
which are optimized by stochastic binary optimization.
The sparse mask generator of NICE is also related to semantic
segmentation, which is a task of dividing an image into meaningful regions.
However, unlike most semantic segmentation methods that use different
kinds of supervision, such as pixel-level labels, image-level labels, bounding
boxes, etc., NICE trains the sparse mask generator to maximize the
classification accuracy of the mixed-resolution images without using any pixel-
level annotations. This makes NICE a weakly supervised binary segmentation
algorithm that detects salient regions of an image. Since NICE’s main goal is to
provide better or comparable neural explanations, we mainly compare it with
other deep explanation methods in our experiments, rather than segmentation
methods.

Semantic Image Compression


Traditional image compression algorithms, such as JPEG and PNG, have
fixed steps / components to reduce the size of images. For example, the JPEG
compression first applies a discrete cosine transform (DCT) over each 8 × 8
image block, then uses quantization to encode the frequency coefficients as a
sequence of bits. The DCT can be considered as a general feature extractor
with 214606 VOLUME 8, 2020 X. Li, S. Ji: NICE a fixed set of basis functions that
do not depend on the distribution of the input images.
Unlike standard image compression algorithms, the ML-based methods
can automatically learn semantic patterns and basis functions from training
images to achieve even higher compression rate. These ML-based methods
have a similar structure of autoencoder, where an encoder is used to extract
feature representation from images and a decoder is used to reconstruct
images from the quantized representations.
The main differences among these ML-based methods are the
architectures of encoder and decoder. While most of these algorithms use
CNNs as the encoder and decoder, some others use recurrent networks such
as LSTM and GRU. As far as we know, none of these methods are very content-
aware, except the work from Prakash et al. which is probably the closest work
to ours. While Prakash et al. use CAM as the semantic region detector, we
develop a principled L0-regularized sparse mask generator to detect the
semantic regions and further compress images with mixed resolutions. We will
compare NICE with when we show results of semantic image compression.
NICE Framework
Given a training set D = {(xi, yi), i = 1, 2, · · · ,N}, where xi denotes the i-th
input image and yi denotes the corresponding target, a neural network is a
function h(x; θ) parameterized by θ that fits to the training data D with the goal
of achieving good generalization to unseen test data. To optimize θ, typically the
following empirical risk minimization (ERM) is adopted:

where L(·) denotes the loss over training data D, such as the cross-entropy loss
for classification or the mean squared error (MSE) for regression.
The goal of this article is to develop an approach that can explain the
prediction of a neural network h(x; θ) in response to an input image x;
meanwhile, to reduce storage or network transmission cost of the image, we’d
like to compress the image x based on the above derived explanation such that
the compressed image x˜ has the minimal file size while retaining a similar
classification accuracy as the original image x.
To meet these interdependent goals, we develop a Neural Image
Compression and Explanation (NICE) framework that integrates explanation
and compression into an end-toend trainable pipeline as illustrated in Fig. 1. In
this framework, given an input image, a mask generator under the L0- norm and
smoothness constraints generates a sparse mask that indicates salient regions
of the image. The generated mask is then used to transform the original input
image to a mixed-resolution image that has a high resolution in the salient
regions and a low resolution in the background.
To evaluate the quality of sparse mask generator and the compressed
image, at the end of the pipeline a discriminator network (e.g., CNN) classifies
the generated image for prediction. Finally, the prediction, sparse mask and
compressed image can be stored or transmitted efficiently for decision making,
interpretation and system diagnosis. The whole pipeline is fully differentiable
and can be trained end-to-end by backpropagation.
Overall architecture of NICE
CONCLUSION
In this ongoing project, we will present a novel framework, NICE, that can
simultaneously explain and compress images for deep neural network
classifiers. NICE leverages a stochastic binary gate mechanism to generate
sparse masks that highlight the salient regions of the input images. The masks
can also be used to produce mixed-resolution images that preserve the
semantic information while reducing the file size. We will try such that NICE can
achieve high-quality explanations and high compression rates on various
image classification benchmarks, outperforming the existing methods. Our
work will opens up new possibilities for interpretable and efficient deep
learning applications in resource-constrained scenarios. As future work, we
plan to extend our framework to other modalities, such as natural language and
speech, and explore more ways to improve the explanation and compression
performance.

You might also like