0% found this document useful (0 votes)
17 views

Artificial Neural Network Concepts and Examples

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Artificial Neural Network Concepts and Examples

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

University of Missouri, St.

Louis
IRL @ UMSL

Theses UMSL Graduate Works

7-22-2022

Artificial Neural Network Concepts and Examples


Harcharan Kabbay
University of Missouri-St. Louis, [email protected]

Follow this and additional works at: https://irl.umsl.edu/thesis

Part of the Other Mathematics Commons

Recommended Citation
Kabbay, Harcharan, "Artificial Neural Network Concepts and Examples" (2022). Theses. 402.
https://irl.umsl.edu/thesis/402

This Thesis is brought to you for free and open access by the UMSL Graduate Works at IRL @ UMSL. It has been
accepted for inclusion in Theses by an authorized administrator of IRL @ UMSL. For more information, please
contact [email protected].
Artificial Neural Network
Concepts and Examples

Harcharan Singh Kabbay


M.A. Mathematics, University of Missouri-St. Louis, 2022

A Thesis Submitted to The Graduate School at the University of Missouri-St. Louis


in partial fulfillment of the requirements for the degree
Masters of Arts in Mathematics
with an emphasis in Data Science

August, 2022

Advisory Committee

Dr. Adrian Clingher, Ph.D


Chairperson
Dr. Qingtang Jiang, Ph.D
Dr. Haiyan Cai, Ph.D
Contents
I Math Foundations of Artificial Neural Networks 3
I.I Differences between Artificial Intelligence, Machine Learning and Deep-Learning . . . . 3
I.II What is Artificial Neural Networks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
I.III Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

II Optimizing the Loss function 10


II.I Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
II.I.1 Gradient Descent in context of a SL Parametric Model . . . . . . . . . . . . . . 15
II.II Newton-Raphson method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

III Basic Architecture of Neural Networks 21


III.I Type of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
III.II Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
III.III Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
III.IV Derivatives of Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
III.V Cross-Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
III.VI Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

IV Convolutional Neural Networks 40


IV.I Building Blocks of CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
IV.II Visualizing the ConvNet Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

V ConvNet by Example 48

1
Abstract
Artificial Neural Networks have gained much media attention in the last few years. Every day, numer-
ous articles on Artificial Intelligence, Machine Learning, and Deep Learning exist. Both academics and
business are becoming increasingly interested in deep learning. Deep learning has innumerable uses, in-
cluding autonomous driving, computer vision, robotics, security and surveillance, and natural language
processing. The recent development and focus have primarily been made possible by the convergence of
related research efforts and the introduction of APIs like Keras. The availability of high-speed compute
resources such as GPUs and TPUs has also been instrumental in developing deep learning models.

While the development of the APIs like Keras offers a layer of abstraction and makes the model
development convenient, the Mathematical logic behind the working of the Neural Networks is often
misunderstood. The thesis focuses on the building blocks of a Neural Network in terms of Mathemat-
ical terms and formulas. The research article also includes the details on the core parts of the Deep
Learning algorithms like Forwardpropagation, Gradient Descent, and Backpropagation.

The research briefly covers the basic operations in Convolution Neural Networks, and a working
example of multi-class classification problem using Keras library in R. CNN is a vast area of research in
itself, and covering all the aspects of the ConvNets is out of scope of this paper. However, it provides
an excellent foundation for understanding how Neural Networks work and how a CNN uses the concepts
of the building blocks of a primary Neural Network in an image classification problem.

2
Contents

I Math Foundations of Artificial Neural Networks


I.I Differences between Artificial Intelligence, Machine Learning and Deep-
Learning
Before starting the research on Artificial Neural Networks, it is essential to understand the history
and evolution of various technologies considered or related to Data Science. The field of Artificial
Intelligence goes back to the 1950s when John McCarthy, Assistant Professor of Mathematics at
Dartmouth College, organized a summer workshop with fellow research scientists with a proposal that
the Machines could be made to think to solve human intellectual tasks. Artificial Intelligence, or AI, is
the attempt to automate intellectual work that people would otherwise perform.
Machine Learning (ML) is an artificial intelligence (AI) subdomain that allows a machine to
automatically find (learn) the statistical structure of data and transform such representations (patterns)
to come closer to the intended output. The learning process is improved via a feedback channel to
compare the expected and computed results.
Statistical modeling is a mathematical description of the relationship involving multiple variables,
whereas a machine learning model is an engine that can learn from data without being directly coded.
Statistics is a sub-domain of mathematics, whereas machine learning is a sub-domain of AI.
We talked about the learning process in Machine Learning which helps find biases and weights to
understand the transformation from the input data to the expected output. However, this technique
may not be possible to analyze advanced use cases like computer vision and speech recognition, which
require enhanced feature engineering. Deep Learning attempts to resolve this problem by providing
multiple learning layers and getting a representation of each layer with its own biases and weights
through a process called forward propagation.

Figure 1: AI, ML, and DL

I.II What is Artificial Neural Networks?


Artificial neural networks (ANNs), also known as neural networks (NNs), are computer systems modeled
after the biological neural networks that make up animal brains. Artificial neurons are linked units or

3
nodes in an ANN that loosely replicate the neurons in a biological brain. Like synapses in the brain,
each link may send a signal to other neurons.
A primary neural network has an input layer, a hidden layer, and an output layer. The input layer
feeds the input features, and the hidden layer computes the activation function of the weighted input
to predict the output. The number of hidden layers can be expanded based on the requirements of the
model. This number of hidden layers is also referred to as the depth of the model.

Figure 2: Deep Learning model with 2 hidden layers

I.III Supervised Learning


Supervised Learning is generally referred to process of machine learning where the training data set is
given with the Input variables and their associated target labels (1). To set some context for discussion,
we can think of the training data set as: -

(Vi , yi ) , i = (1, 2, 3, ..., m) ←Ð m is the size of dataset

Each input Vi is a numerical vector with n dimensions or number of features.

V i ∈ Rn

Vi = (X1i , X2i , X3i , ..., Xni , ) ←Ð Input f eatures or variables


The output labels in form of Yi is a single numerical variable.

yi = (y1 , y2 , y3 , ..., ym ), yi ∈ R

Input V ar1 Input V ar2 ... Input V arn Output labeli


X11 X21 ... Xn1 y1
X12 X22 ... Xn2 y2
... ... ... ... ...
X1m X2m ... Xnm ym

Table 1: Example-Training set for Supervised Learning

4
Supervised Learning aims to find an estimator or predictor function f for the output yi , based on
the given data set.
f ∶ Rn Ð→ R, such that
f (Vi ) ≈ yi f or i = 1, 2, ..., m
The base idea is to find a formula or function that generates the output label yi using the input
features Vi . The process of finding the predictor function f is often referred to as fitting or training
the solution/model.

Supervised Learning is classified into three categories:-


1. Regression - Output Yi may take a continuous range of values. As an example estimating the
Real-estate prices based on a given training set.

2. Classification - Output Yi belongs to a finite set of values. A model to classify facial images
from an image set into categories of mood like happy, sad, angry, etc.

3. Binary Classification - This is a special case of Classification where the output Yi is classified
into two categories, e.g., Yes/No, 0/1, ±1. An example of Binary classification could be estimated
if a person has diabetes or not based on historical training data for admitted patients.
The objectives of the Supervised Learning are as follows:-
• Prediction - Predict the output labels for unseen data with reasonable accuracy. The unseen data
is not part of the training set.

• Inference - Understanding the effect of each input feature and which one affects the most.

• Accuracy - Measure the quality of predictions and inferences.

Mathematical Explanation of Supervised Learning


So far, we have discussed the essential requirement to find the estimator or predictor function f (v),
which is close to the output y. To put in a Math equation, it looks as follows:-

f (v) = y + ϵ
f (v) is to be estimate, v is the input vector and y is the output class. ϵ is the noise, a random numerical
variable from distribution N (0, σ 2 )
.

σ 2 = V ar(ϵ) ⇐Ô Irreducible error


The goal is to find an estimator function fˆ(v) close to f (v) such that

fˆ(v) ≈ f (v)

Here is a plot (3) of some data points (Vi , yi ) with i = 1, 2, ..., 30. The true predictor in this example
is
f (x) = sin(x3 + 1)
And noise ϵ ≈ N (0, 0.01)

5
Figure 3: Data points and True predictor

In real case scenarios, the true predictor is unknown, and we have to find the best estimator function
fˆ(v). There could be different candidates for fˆ(v). Let us represent those candidate estimators

g ∶ Rn Ð→ R

Input V ar1 Input V ar2 ... Input V arn Output labeli g(Vi ) [g(Vi ) − yi ]2
X11 X21 ... Xn1 y1 g(V1 ) [g(V1 ) − y1 ]2
X12 X22 ... Xn2 y2 g(V2 ) [g(V2 ) − y2 ]2
... ... ... ... ...
X1m X2m ... Xnm ym g(Vm ) [g(Vm ) − ym ]2

Table 2: Square Errors in Predictors

The average of all the square errors or Mean Square Error (MSE) can be found as

1 m
M SE = ∑[g(Vi ) − yi ]2
m 1

We can keep the MSE as a baseline to compare the quality of the estimator function. We try to find
fˆ(v), which minimizes the value of the MSE, also
√ called a Cost function. Some other versions of
∑1 [g(Vi ) − yi ]2 , Mean Absolute Error (MAE)
1 m
the MSE are Root Mean Square Error (RMSE) m

∑1 ∣g(Vi ) − yi ∣, and Mean Absolute Percentage Error (MAPE) m1 ∑1 ∣ yi i i ∣.


1 m m g(V )−y
m
Cost function and Loss function are sometimes used inter-changeably but the Loss function is calculated
at each instance of the training data, and the Cost function is the average of the Loss functions over
all the samples of the training data. Some other Loss functions for regression problems are as follow:-

6
• Root Mean Square Logarithmic Error - Also referred as RMSLE is an extention of RMSE. The
formula of the RMSLE is as follow:-
¿
Á1 m
RM SLE = Á À ∑[log(1 + g(Vi )) − log(1 + yi )]2
m 1

• Cosine Similarity - This is a metric used in data analysis to compare two numerical sequences.
Suppose, we have two non-zero vectors a ⃗ and ⃗b. Then the Cosine similarity between these two
vectors is given by:
⃗.⃗b
a
cos θ =
a∥.∥⃗b∥
∥⃗
The closer the angle between the vectors, higher is the similarity.

• Log-Cosh error - Logarithmic of the hyperbolic cosine of the prediction error is another measure
for the quality of the predictor. The Log-Cosh loss is measured as:
m
L(y, y p ) = ∑ log(cosh(yip − yi ))
i=1

The concept of MSE does not apply to the Binary Classification, where we are predicting the
output as 0 or 1. So the square errors method does not make much sense here. Instead, we use the
concept of Maximum Likelihood Estimator to help us.
Let us take an example of Bernoulli distribution with a data set (Vi , yi ), where Vi ∈ Rn and yi can
take a value 0 or 1. We can represent y as

1 with probability p(v)


yi = {
0 with probability 1 − p(v)
We have to find an optimal probability estimator p̂(v) to solve this problem. We can follow the
following guidelines to estimate the p̂(v). Let us assume that there is some function q Ð→ Rn that
could be a candidate for a good p̂(v). For a good probability estimator, the estimated value should be
close to the true value.
If yi = 1, q(vi ) should be close to 1
If yi = 0, q(vi ) should be close to 0

And, it is desired to have q(Vi )yi .[1 − q(Vi )]1−yi closer to 1. Since, we have m records in the data set
we need to m
Maximize: ∏ q(Vi )yi .[1 − q(Vi )]1−yi ←Ð MLE
1

We know that MLE in this case belongs to (0, 1), so we apply the log to the above equation.
m
−log(M LE) = −log[∏ q(Vi )yi .[1 − q(Vi )]1−yi ]
1

m
−log(M LE) = ∑[−yi .log(q(Vi )) − (1 − yi ).log(1 − q(Vi ))] ←Ð Cross-Entropy function
i=1

7
The cost function, in this case, is −log(M LE), so we need to minimize this as part of the func-
tion/model training for binary classification. Categorical Cross-Entropy is a version of Binary Cross-
Entropy, which is used for multi-class classification problems.
Other Loss functions for classification problems are as follow:-
• Hinge Loss - This loss function is mostly used in the classification problem in Support Vector
Machines (SVMs). The formula for Hinge Loss is as follow:-
L(y) = max(0, 1 − t.y)
Here, t = ±1 is the intended output for the classifier and y = w.x + b.
• Huber Loss - The Huber Loss function is derived from both the MSE and the MAE. This is less
sensitive to outliers than the MSE and also differentiable over zero.

⎪ 1

⎪ 2 (y − f (x)) ,

2
for ∣y − f (x)∣ ≤ δ,
Lδ (y, f (x)) = ⎨



1
⎪ δ∣y − f (x)∣ − δ 2 , otherwise
⎩ 2
• Kullback-Liebler Divergence - This is used to find the distance between two distributions i.e.
existing distribution p and the predicted data distribution q. The discrimation information of
probability p to q is given as
p(x)
DKL (p∣∣q) = ∑ p(x)log ( )
q(x)

Parametric Models
We discussed the guidelines for finding a regression predictor function and a binary classification problem.
The predictor function is chosen from a specific set of functions that are indexed by parameters θ =
(θ0 , θ1 , . . . ) A predictor function θ has the biases and weights to find the output labels from the input
vectors. Here are some examples: Suppose we have a Hypothesis set H = f (V ), which contains
the predictor functions. And suppose we have two variables in the input vector, i.e., x1 , x2 . And
H = f (x1 , x2 ) polynomial function of degree d.
d = 1, f (x1 , x2 ) = a0 + a1 x1 + a2 x2 ←Ð θ = (a0 , a1 , a2 ) ∈ R3
d = 2, f (x1 , x2 ) = a0 + a1 x1 + a2 x2 + a3 x1 x2 + a4 x21 + a5 x22
Here, θ = (a0 , a1 , a2 , a3 , a4 , a5 ) ∈ R6
As part of the training, the estimator functions from the Hypothesis set are tried with the provided
training set.

Figure 4: Overview of Parametric Model

8
The learning Algorithm uses the Cost function to find the closeness of the estimated value with the
actual value. Some of the standard loss functions are Mean Square Error and Cross Entropy. Once
the loss is calculated, the learning algorithm modifies the weights and biases(parameters) to find the
optimal θ̂ through the optimization techniques like Gradient Descent. a high level learning workflow
can be described by Figure 5.

Figure 5: Overview of Training cycle

Besides the regular parameters, other sets of parameters called Hyper-parameters also impact the
learning of a training algorithm. Hyper-parameters include but are not limited to no. of neurons per
layer, learning rate, etc.
As part of the model learning process, we capture the validation metrics, i.e., the model’s perfor-
mance on the validation data set, which is separate from a training set. A loss function is used to
calculate the validation error in regression problems. The loss function is evaluated using the validation
data set on the optimal (trained) parameters θ̂.

In classification problems, Accuracy is commonly used as a validation metrics. Accuracy is calculated


as follow:-
1
accuracy = ∑ I(fˆ(v) = y) Ð→ l - size of validation set
l
I(fˆ(v) = y) is the Indicator function where I(true) = 1 and I(f alse) = 0. Accuracy is a percentage
of success of the algorithm. The validation error is computed as:

validation error = 1 − accuracy

There are two common problems in the life-cycle of model training:


• Over-fitting - Over-fitting happens when the model is complex to the extent that it learns the
detail of data and noise. However, the model suffers from predicting the output of the validation
data and any new data. To fix this problem, we need to generalize the model to ignore the noise
and focus only on data patterns.
• Under-fitting - Under-fitting is the problem when the model cannot learn the training data or
predict validation data. This means the model is unsuitable for learning and needs to be optimized.

9
II Optimizing the Loss function
We discussed the Loss function and how it measures the quality of a predictor or estimator. The
important piece is how to find the optimal value of weights and biases to reduce this loss. This
mathematical optimization problem has two components to minimize the function J (x1 , x2 , ..., xn ).

1. Determine if an actual minimum exists for this function.

2. If a minimum exists, find the inputs where this minimum is attained.

One common issue in the optimization problem is finding the difference between local minima and
global minima. To illustrate, we plot (Figure 6) the values for x2 + 10 ∗ sin(x) for x ranging from -10
to 10. In the given range of x values, this function has one global minimum, local minima, and another
local minimum.

Figure 6: Local and Global minima

The minimization problem’s solution is finding the global minimum, i.e., x0 . The minimum realiza-
tion for function J is J (x0 ).

The complexity of the minimization problem increases as the number of variables in the Loss function
J increases. Here is an example plot for a two variable function J (x1 , x2 ).

10
Figure 7: Local and Global minimum for J (x1 , x2 ) - Ref erences8

A general theoretical procedure using Calculus to solve a multivariate loss function J (x1 , x2 , ..., xn )
is as follow:-

1. Compute all the partial derivatives


∂J
Ð→ for i = 1,2,...,n
∂xi

2. Construct the gradient


∂J ∂J ∂J
∇J = ( , ,..., )
∂x1 ∂x2 δxn
3. Solve the equation
∇J = 0
Here 0 on the r.h.s of the above equation represents a n-dimensional 0 vector. Solutions are
critical points of J i.e. local minima, local maxima, and saddle points.

4. We can classify the critical points using the second derivative tests. This involves n2 of such
partial derivatives.

5. From the partial derivatives of second order, we can evaluate the local minima points to find the
global minima.

11
In general Machine Learning problems, the equation ∇J = 0 is quite challenging to solve. Some
numerical methods of minimization have been developed to approximate the global minimum of the
function J . We are going to discuss two commonly used algorithms, namely:

• Gradient Descent Method

• Newton-Raphson Method

II.I Gradient Descent


Gradient Descent is such an Optimization Algorithm that helps achieve the objective for a learning
model to minimize the loss function. The word descent refers to the lowest point of the Loss function
curve. Suppose we have the loss function represented by J (x1 , x2 , . . . , xn ). Assume we start with a
random point v0 , and we get ∂J (v0 ) = 0, which means by coincidence, we got a critical point.
In a normal case, where ∂J (v0 ) ≠ 0, which direction shall we move from v0 in order to decrease J and
to get to a global minima.

Figure 8: Gradient Descent direction - Ref erences1

Let us assume we choose a direction given by a unit vector direction h ∈ Rn and t ≥ 0, t ∈ R. The
variation in vector v is given by v = v0 + th. Now, we would like to know the impact of this variation on
J (v0 + th), which is a derivative of this one-variable function w.r.t t i.e. dJ (vdt0 +th) . Using multi-variable
chain rule.
dJ (v0 + th) n dJ (v0 + th)
=∑ .hi
dt i=1 dxi
=< ∇J (v0 + th), hi >
Or the dot product of the gradient J and the direction vector. More specifically, we are interested in
the instantaneous rate of change at v0

dJ (v0 + th)
(0 ) =< ∇J (v0 ), h >
dt
The rate of change of this derivative, if −ve, means we move in the direction of minima. And, the
direction that makes < ∇J (v0 ), h > negative and minimal is the direction of the fastest descent.
Treating this as a Linear Algebra problem, we have a fixed vector J (v0 ) and a varying unit direction
vector h. We need to find a direction h that minimizes the dot product < ∇J (v0 ), h >. The cosine of
the angle between these two vectors can be given by.

< ∇J (v0 ), h >


cos(θ) =
∥∇J (v0 )∥

12
The denominator represents the length of the vector ∇J (v0 ), so if we minimize the value for the
cos (θ), we should be able to minimize the dot product < ∇J (v0 ), h >. The cos (θ) can take a value
between +1 and −1. Going with the minimum value, we pick cos (θ) = −1, means

θ = π = 180o

. With this information, we decide to move to the direction opposite to the direction of the gradient.

Figure 9: Direction of the movement

1
h=− .∇J (v0 )
∥∇J (v0 )∥

Gradient Descent Algorithm


The Gradient Descent Algorithm follows the steps:

1. Initialize starting vector- Select a random vector v0 = (x01 , x02 , . . . , x0n ) in the domain of J

2. Perform iterations to minimize the value of J - Let us assume, from a prior iteration we have
vk = (xk1 , xk2 , . . . , xkn )
Find ∇J (vk )
Define vk+1 = vk − α.∇J (vk )

Possibilities:

• Fixed Learning rate α - Learning rate α is chosen at the beginning and it remains the same
throughout all the iterations.
• Exact line search (add details) - Learning rate α is chosen at each iteration by minimizing
the 1-var function
t ↪ J (vk − t.∇J (vk )
• Backtracking line search - Uses advanced techniques of convex optimization.

During this phase, we find the value of loss function J (vi ) which reduces in each iteration such
that
J (v0 ) > J (v1 ) > J (v2 ) ⋅ ⋅ ⋅ > J (vk ) > J (vk+1 ) > . . .

13
This descent can be shown for a two-variable vector as follow:-

Figure 10: Loss decreases on each iteration to reach a global minima - Ref erences6

3. Termination - Per Mathematical Theorem (Convex Optimization, Boyd and Vendenberghe 2021),
based on certain convergence conditions that depends upon the learning rate α:

(a) The norm of gradient i.e. ∥∇J (vk )∥ is decreasing and

lim ∥∇J (vk )∥ = 0


k→∞

(b) ∣J (vk ) − J (v ∗ )∣ < C.∥∇J(vk )∥ , where v ∗ is a local minimum for J , and C is a constant.
2

We set a small variable ϵ > 0, as acceptable error. The Gradient-Descent Algorithm stops when
∥∇J (vk ∥ < ϵ. Then, we have
∣J (vk ) − J (v ∗ )∣ < C.ϵ2

Observations
Some important observations around the characteristics and use of Gradient Descent are as follow:-

• The iteration methods like Exact line search and backtracking line search are computationally
expensive compared to the use of a fixed learning rate α.

• The fixed learning rate α, if chosen too small, can make the algorithm very slow to reach a
solution. And choosing α too high can make it hard to converge.

14
Figure 11: Impact of learning rate on convergence - Ref erences6

• Gradient Descent Algorithm can get stuck in a local minimum. To address this, we should run
the Gradient Descent multiple times, with a different initialization vector v0 . The solution is
chosen as the value of v that produces the lowest value for J (v).

II.I.1 Gradient Descent in context of a SL Parametric Model


Let us consider a Supervised-Learning parametric model with s number of parameters such that θ =
(θ0 , θ1 , . . . , θs ) ∈ Rs . And, we have the Loss function denoted as J (θ).
To calculate the Loss function, we sum it over all the pairs in the training set (vt , yt ), where t is the
total no. of records.
1 m
J (θ) = ∑ Qθ (vt , yt )
m t=1
In case of Gradient Descent, at each iteration, we compute the gradient factor for the function J ,
which will be like
1 m
∇J (θ) = ∑ ∇Qθ (vt , yt )
m t=1
Since this involves all the records in the training set and involves a large sum, doing this at each
iteration would be computationally very expensive. To overcome this problem, there are some version
of the Gradient Descent which are listed as follow:-

Stochastic Gradient Descent


Stochastic Gradient Descent is similar to how Gradient Descent except the slight changes to the iteration
part. The Iteration in a Batch Gradient Descent is as follow:-
α m
θk+1 = θk − α∇J (θk ) = θk − ∑ ∇Qθk (vt , yt )
m t=1

An iteration in a Stochastic Gradient Descent looks like as below:-

1. Set w ∶= θk

2. Reshuffle the m pairs in the training data set

3. for t=1 to m do:


w ∶= w − m
α
∑t=1 ∇Qθw (vt , yt )
m

4. θk+1 ∶= w

15
In Batch gradient Descent, we compute the entire gradient sum and then reset the parameter vector.
In the case of the Stochastic Gradient Descent (SGD), we add the term to the gradient sum and reset
the parameter vector.
From a computational point of view, SGD converges faster but could be hard to control due to the
mathematical complexity involved in the algorithm.

Mini-Batch Gradient Descent


Min-Batch Gradient Descent is a compromise between the the Batch Gradient and the Stochastic
Gradient Descent methods. In this, we randomly split the training data set into mini-batches of size b
(usually 50 ≤ b ≤ 256). Suppose,we perform the split of m training record into l batches as

{1 2 . . . m} = B1 ∪ B2 ∪ ⋅ ⋅ ⋅ ∪ Bl

∣Bq ∣ ≤ b where q = 1, 2, . . . , l
The iteration for a Mini-Batch Gradient Descent is as follow:-
1. Set w ∶= θk

2. Reshuffle the m pairs in the training data set

3. for q=1 to l do:


w ∶= w − m
α
∑t∈Bq ∇Qw (vt , yt )
4. θk+1 ∶= w
In this case, we work in mini-batches, i.e., for every mini-batch from 1 to l, we compute the gradient
sum and update the parameter vector. This is more stable than the SGD and considered adequate for
training Artificial Neural Networks on large training data sets.

II.II Newton-Raphson method


Newton-Raphson method is named after Issac Newton (1642-1726) and Joseph Raphson (1648-1715).
This is another method for mathematical optimization with the goal to minimize a given differen-
tiable function J ∶ Rn → R. We have an n-dimensional function J (x1 , x2 , . . . , xn ), and we want to
approximate a global minimum v ∗ = (x1 , x2 , . . . , xn ) ∈ Rn such that

J (v ∗ ) ≤ J (v) for v ∈ Rn

The basic strategy is to approximate the critical points or the solutions to the equation

∇J = 0 ←Ð zero vector inRn

For mathematical convenience, we refer to this gradient as a function g

g = ∇J ∶ Rn → Rn

Then using the standard calculus method, we evaluate these critical points for function J , to find
out which one would be a global minimum. This method enables us to numerically solve a system of
equations:
g(x1 , x2 , . . . , xn ) = 0

16
g ∶ Rn → Rn
These are n non-linear equations i.e

g1 (x1 , x2 , . . . , xn ) = 0

g2 (x1 , x2 , . . . , xn ) = 0

gn (x1 , x2 , . . . , xn ) = 0

Newton Raphson method - Case(n=1)


To understand this method, we start with the case of n = 1 or g(x) = 0. There is only one input to
function g and one output. This method is commonly referred to as the Newton’s method. The basic
algorithm works as follow:-

1. Choose an initial value x0 as an approximate for the solution x∗

2. Create a recursive sequence


g(xk )
x k + 1 = xk −
g ′ (xk )

3. Per Mathematical Theorem (Convex Optimization, Boyd and Vendenberghe 2021) If the initial
value x0 is chosen properly, the sequence (xk )k≥0 will converge, with

lim xk = x∗
k→∞

Why does Newton’s method work?


It might be difficult to solve g(x) = 0, but we can try the Taylor expansion of g(x) i.e.

g”(x0 )
g(x) = g(x0 ) + g ′ (x0 )(x − x0 ) + (x − x0 )2 + . . .
2
With the first-order approximation of g(x), we have

g(x0 ) + g ′ (x0 ).(x − x0 ) = 0

g(x0 )
We solve the x from x = x −
g ′ (x0 )
On each iteration, we improve the approximation for x∗ to find the exact root.

17
Figure 12: Convergence using Newton-Raphson method - Ref erences7

Some possible issues with the Newton-Raphson methods are as follow:-


1. We might get a derivation g ′ (xk ) = 0, and this term appears in the denominator when approxi-
mating the xk+1 i.e.
g(xk )
xk + 1 = xk − ′
g (xk )
This would cause the algorithm to crash.
2. Slow convergence: (xk )k≥1 converges slowly or does not converge.
3. The equation g(x) = 0 could have multiple solutions. You may end up capturing another solution
which is close to the initialization value.
All these issues can be resolved by re-setting the initialization value x0 and re-running the algorithm.

Newton Raphson method - Case(n > 1)


The Newton-Raphson method case n = 1 can be generalized to find a solution for a multi-variable case.
We use the concept of Jacobian matrix of gi ∶ Rn → R with the rows representing the gradient vectors
i.e. ∇g1 , ∇g2 , . . . , ∇gn

⎛ dg1 , dg1 , . . . , dg1 ⎞


⎜ dx1 dx2 dxn ⎟
⎜ dg2 dg2 dg2 ⎟
⎜ ⎟
⎜ , , . . . , ⎟
(Jacobian matrix of gi ) = Mg = ⎜ dx1 dx2 dx n⎟ (1)
⎜ ⎟
⎜ ⋮, ⋮, ⋮ ⎟
⎜ ⎟
⎜ dgn dgn dgn ⎟
⎝ dx dx
, , . . . ,
dxn ⎠
1 2

It is worth noting that this is a Matrix function, because if we change the inputs x1 , x2 , . . . , xn , the
partial derivatives would change and the whole Jacobian Matrix would change.
Our goal is to find a solution v ∗ ∈ Rn for the equation g(v) = 0. The algorithm follows the following
steps:-

18
1. Choose an initial vector v0 close to where we believe v ∗ should be.
2. Perform iterations to calculate
vk+1 = vk − [Mg (vk )]−1 .g(vk )
[Mg (vk )]−1 Ð→ inverse of nxn matrix Mg We use this method to find the values for v1 , v2 , v3 , . . .
The rationale for the Newton Raphson method states that there is a neighborhood of U of v ∗
such that, if v0 ∈ U , we have:-
• The Jacobian matrix Mg is non-singular for any k ≥ 0.
• The sequence (vk )k≥0 is convergent and limk→∞ vk = v ∗ .
3. Terminate the algorithm estimating the permitted error ∥vk − v ∗ ∥
Observations?

How does Newton-Raphson method apply to optimization?


In order to solve the optimization problem, we need to find the critical points or a solution for ∇J = 0.
As mentioned earlier J ∶ Rn → R. This means that ∇J ∶ Rn → Rn .
We have
dJ dJ dJ
g = ∇J = ( , ,..., ) (2)
dx1 dx2 dxn
So, the Jacobian Matrix will consist of second-order partial derivatives of J

⎛d J , d J ,..., d J ⎞
2 2 2

⎜ dx21 dx1 dx2 dx1 dxn ⎟


⎜ 2 ⎟
⎜ d J d2 J d2 J ⎟
⎜ , 2 ,..., ⎟
Mg = ⎜
⎜ dx2 dx1 dx2 dx2 dxn ⎟ ⎟ (3)
⎜ ⋮, ⋮, ⋮ ⎟
⎜ ⎟
⎜ d2 J dJ d J⎟
⎜ 2 2 ⎟
⎝ dxn dx1 , dxn dx2 , . . . , dx2 ⎠
n

The above matrix is also called as Hessian matrix of J and denoted by HJ .

Observations
Some important observations on Hessian matrix HJ are as follows:-
• Hessian HJ is a symmetric matrix, i.e., it is a square matrix that is equal to its transpose.
• If HJ is convex, then J is a convex function, hence a good candidate for minimization. The
convex functions have only one local minimum, a global minimum.
The algorithm works as follow to approximate the critical points:-
1. Choose an initial vector v0 = (x01 , x02 , . . . , x0n ) .
2. Perform iterations to calculate
vk+1 = vk − [HJ ]−1 .∇J (vk )
[Mg (vk )]−1 Ð→ inverse of nxn matrix Mg On each iteration, we should get closer to v ∗ .

19
3. Terminate the algorithm estimating the permitted error ∥vk − v ∗ ∥

The iteration process in the Newton-Raphson method is expensive, it requires calculating second-
order partial derivatives as compared to first-order derivatives used in the Gradient Descent method.

20
III Basic Architecture of Neural Networks
Neural Networks are capable of dealing with both structured and unstructured data. Structured data
is a highly organized format like tabular data stored in relational databases, CSV files, data frames,
etc. Unstructured data, also categorized as qualitative data, is in different forms like videos, pictures,
emails, social media posts, IoT sensor data, etc.
The basic architecture of a Neural Network consists of an input layer, a hidden layer, and an output
layer.

Figure 13: Basic Neural Network Architecture

In Figure 13, the Neural network has two input variables X = [x1 , x2 ] and a hidden layer with
four neurons (nodes), which uses the inputs and weights (w1, w2, . . . , w8) to compute the activation
function. The output from the hidden layer passes to the output layer to predict the output ŷ. The
difference between the predicted and actual output is used to calculate the cost function. The cost
function is minimized by adjusting the weights after each iteration. Gradients are used to update the
weights, and the next iteration starts. This process repeats to get closer to the optimal parameter
weights.

III.I Type of Neural Networks


Neural Networks can be classified into three main categories based on their architecture. The name
and brief information on each group is as follows:-

Feed Forward Neural Networks (FFNNs)


Feed Forward Neural Networks are the most common type of neural network where information flows in
one direction, starting from the input layer, and passing through the hidden layers to the output layer.
The connection between the nodes do not form a cycle. Each hidden layer transforms inputs, resulting
in a new representation at each successive layer.

21
Some common non-linear transformations in weighted inputs include sigmoid, tanh, ReLU, selu, soft-
max, etc. Feed Forward Neural Networks can be used for classification use-cases and in unsupervised
learning as auto-encoders.

Figure 14: A basic FFNN architecture

Convolutional Neural Networks (ConvNets or CNNs)


Convolutional Neural Networks use convolutional layers to perform hierarchical feature extraction. One
common difference between the FFNN and CNNs is that only the last layer in a CNN is fully connected.
In an FFNN, each layer is a fully connected layer, i.e., all the nodes of a hidden layer are connected to
each node in the neighbor layer. The CNNs or ConvNets are commonly used for image classification,
image segmentation, object detection, and text classification.

Figure 15: A basic CNN architecture

Recurrent Neural Networks (RNNs)


Recurrent Neural Networks have one hidden layer per time slice, thus enable to store past information
for long time. In comparison to other Neural Networks, RNNs are difficult to train. Long Short Term
Memory (LSTMs) and Gated Recurrent Networks (GRUs) are two counter-parts of Recurrent Neural
Networks.

22
Figure 16: A basic RNN architecture - Ref erences1

III.II Forward Propagation


Forward propagation is one of the core phases in the learning phase. The input data moves forward
through the successive hidden layers and the data transformation happens at each hidden layer. The
output of from each layer feeds as input into the next successive layer.

Figure 17: Forward propagation

The above diagram shows two adjacent hidden layers l of size Sl and l + 1 of size Sl+1 in a NN
model.All the nodes in layer l are connected to the nodes of successive layer l + 1. The notation
representation is as follow:-

• Each neuron/node is represented as alayer


i , where subscript i is the index of the node and super-
script layer refers to the index of the hidden layer.

23
• Weights are represented as θij
l
where j is the starting node on layer l and i is the target node on
l+1
(l)
θij
ÐÐ→ ai
(l) (l+1)
aj

At each neuron in a hidden or an output layer, the following two operations happen:-

• Weighted sum of the inputs (θi0 + θi1 .a1 + ⋅ ⋅ ⋅ + +θisl .asl )


(l) (l) (l) (l) (l)

(l)
Here θi0 is the bias unit.

• Activation - Evaluate the weighted sum with an activation function φ like sigmoid, tanh, ReLU,
softmax, etc.

The overall operation looks like

= φ(θi0 + θi1 .a1 + ⋅ ⋅ ⋅ + +θisl .asl )


(l+1) (l) (l) (l) (l) (l)
ai

We can further elaborate this operation using Linear algebra. Let us convert the layer inputs into
vectors.
⎛a1 ⎞
(l)

⎜a(l) ⎟
Vector of units in layer l = Vl = ⎜ 2 ⎟
⎜ ⋮ ⎟∈R
sl
(4)
⎜ ⎟
⎝a(l)
s

l

The extended vector with the bias a0 = 1 from layer l is given by


(l)

⎛ 1 ⎞
⎜a(l) ⎟
⎜ 1 ⎟
⎜ ⎟
Extended vector for layer l = V̄l = ⎜a(l) ⎟ ∈ Rsl +1 (5)
⎜ 2 ⎟
⎜ ⋮ ⎟
⎜ ⎟
⎝a(l)
sl ⎠

⎛a1 ⎞
(l+1)

⎜a(l+1) ⎟
Vector of units in layer l + 1 = Vl+1 = ⎜
⎜ ⋮ ⎟∈R
2 ⎟ sl+1
(6)
⎜ ⎟
⎝a(l+1)
s

l+1

24
Figure 18: Forward propagation with bias

The arrows connecting the neurons from layer l to l + 1 carries weight θij
(l)

1 ≤ i ≤ sl+1

0 ≤ j ≤ sl
All the weights can be written in a matrix format as follow:

⎛ θ10 θ11 θ12 . . . θ1sl ⎞


(l) (l) (l) (l)

⎜ ⎟
⎜ ⎟
(l+1)
a2
⎜ (l) (l) (l) (l) ⎟
θ = ⎜ θ20 θ21 θ22 . . . θ2s ⎟ Ð→ Size of matrix =sl+1 .(sl + 1)
(l)
(7)
⎜ ⎟
⎜ ⋮ ⋮ ⋮ ⋮
l

⎜ ⎟
⎝θs(l) 0 θs(l) 1 θs(l) 2 . . . θs(l)
l+1 sl

l+1 l+1 l+1

The forward propagation can be represented in a lucid manner using Linear Algebra

vl+1 = φ(θ(l) .V̄l )

We can consider the basic format of a Neural Network with the n input features and s output units.
Hyperparameters

k - no. of layers (k − 2 hidden layers)

si - no. of units in ith layer

25
φ - activation function (may differ for each layer)

Parameters

θ(l) - weight matrix, here l = 1, 2, . . . , k − 1

Total parameters
k−1
∑ (sl + 1).(sl+1 )
l=1

All these parameters need to be optimized as part of the training process. As an outcome of
successful training, we select an optimal set of parameters θ̂.
As an example, we can count the parameters for a NN architecture with one hidden layer (4 nodes),
two inputs and one output. As we can see from the Figure 19, The total parameters are

(s1 + 1)s2 + (s2 + 1)s3 = 3 × 4 + 5 × 1 = 17

Figure 19: NN Architecture with one hidden layer (size=4)

The Power of Hidden Layer - Example of NN for XOR


XOR is a logical function, which outputs true when fed with odd number of true inputs. Lets us
consider two binary inputs x1 and x2 , then x1 XOR x2 is

x1 ⊕ x2 ≡ (x1 ∨ x2 ) ∧ ¬(x1 ∧ x2 )

26
x1 x2 y
0 0 0
1 0 1
0 1 1
1 1 0

Table 3: XOR: Truth table

The geometrical representation of the data looks like below

Figure 20: XOR values representation

As we notice, no linear solution can segregate the red and green points or the outcome of a logical
XOR operation. We can solve this problem using a NN with a hidden layer.

Figure 21: NN for XOR Operation

We can use the sigmoid activation function at layer 2 and layer 3 along with the following weight
matrix to solve this problem
−10 20 20
θ1 = ( )
30 −20 −20

27
θ2 = (−30 20 20)

III.III Activation Functions


Activation functions play a crucial role of transforming the inputs from the previous layer to a more
meaningful representation. On each successive layer, this transformation of data gets closer to the
expected output from the Neural Network model.
The output vector for a layer l in a NN is given as vl+1 = φ(θ(l) .V̄l ), where φ represents the activation
function applied to the dot product of the weights matrix θ(l) , and the extended vector V̄l . The following
section covers some of the commonly used activation functions along with their properties:-
1. Identity Activation Function: This activation function do not perform any modification to the
Linear function. This function appears in traditional Linear regression, where we have the linear
function, but we do not perform any operation on the outcome of the linear function.
φ(Z) = Z, φ ∶ R → R

Figure 22: Identity Activation Function

2. Rectified Linear Unit or ReLU: This is the most commonly used activation function on the hidden
layers in a NN. ReLU return a 0 if the input is less than or equal to 0, and returns the input if
the input is greater than 0. So, it is used when we want the output of a neuron be always +ve.
φ(Z) = max(Z, 0), φ ∶ R → [0, ∞)

Figure 23: ReLU Activation Function

28
3. Leaky Rectified Linear Unit or LReLU: This function allows some values on the −ve side and
behaves like an Identity function for +ve values.

φ∶R→R

z, if z > 0
φ(Z) = {
az, if z ≤ 0; usually a = 0.01
This function is called as randomized relu when a ≠ 0.01

Figure 24: Leaky ReLU Activation Function

Both ReLU and LReLU are widely used for Regression problems due to their nature of Linearity.

4. Sigmoid: Sigmoid activation function is most commonly used in the Binary classification problems
in Logistic Regression. This function takes inputs of real numbers and return a probability.

φ ∶ R → [0, 1)
1
φ(Z) =
1 + e−Z

Figure 25: Sigmoid Activation Function

29
5. Hyperbolic Tangent: This activation looks like the Sigmoid activation function in shape but the
output value ranges between −1 and +1. This is also used in the Binary classification problems,
but does not return a probability.
φ ∶ R → [−1, 1)
eZ − e−Z
φ(Z) =
eZ + e−Z

Figure 26: Hyperbolic Tangent Activation Function

6. Softmax: This activation function is used for a multi-class classification problem. It takes k
numerical entries and transforms into k probabilities.

p ∶ Rk → (0, 1)k

⎛z1 ⎞ ⎛p1 ⎞
⎜z ⎟ ⎜p ⎟
p ⎜ 2⎟ = ⎜ 2⎟ (8)
⎜⋮⎟ ⎜⋮⎟
⎝zk ⎠ ⎝pk ⎠

pi ∶ Rk → (0, 1)
The ith probability is given by
ezi
pi =
ez1 + ez2 + ⋅ ⋅ ⋅ + ezk
All the probabilities for the k classes sum up to 1.

p1 + p2 + ⋅ ⋅ ⋅ + pk = 1

III.IV Derivatives of Activation Functions


Derivative of a function is the rate of change of the function with respect to a variable. For a function
f (x), the derivative is written as f ′ (x) or dfdx
(x)
. We train Neural Networks with gradient descent, so
partial derivatives come into play. The goal is to minimize the error, and if we know how the error
changes with a change in weights, we can change the weights in a direction to minimize the error. The
derivatives for commonly used Activation functions are as follow: -

30
• Derivative of Sigmoid: Sigmoid function is normally used in a binary classification scenario’s final
layer of a NN model. It provides the probability score for the output. The Sigmoid function is
given as
1
φ(Z) =
1 + e−Z
The derivative of the Sigmoid function is as follows:-
d
φ(Z) = φ(Z)(1 − φ(Z))
dZ

• Derivative of tanh: Hyperbolic tangent function is given as follow:-

sinh Z eZ − e−Z
φ(Z) = tanh Z = =
cosh Z eZ + e−Z
The derivative of tanh is represented as:-
d d
φ(Z) = tanh (Z) = 1 − tanh2 (Z)
dZ dZ

• Derivative of Rectified Linear Unit: ReLU function is represented as

φ(Z) = relu(Z) = max(Z, 0)

The derivative of ReLU is given as

d d 1, if z > 0
φ(Z) = relu(Z) {
dZ dZ 0, otherwise

• Derivative of Leaky Rectified Linear Unit: The Leaky ReLU function is represented as

z, if z > 0
φ(Z) = lrelu(Z) = {
az, if z ≤ 0; usually a = 0.01

The derivative of Leaky ReLU is given as

d d 1, if z > 0
φ(Z) = lrelu(Z) {
dZ dZ a, otherwise

• Derivative of Softmax is given as follows

∂pi pi (1 − pj ), if i = j
={
∂Zj −pi pj , if i ≠ j

Here pi is the output vector and zj is the input vector.

31
III.V Cross-Entropy Loss
The training data set has the pairs as (Vt , yt ) and the output Y = y ∈ {0, 1}. The NN returns a
probability for output class for each input record Vt i.e. the probability p(Vt ) ∈ (0, 1). The Cross-
Entropy Loss for a binary classification problem is given by
m
J (θ) = ∑[−yt .log(p(Vt )) − (1 − yt ).log(1 − p(Vt ))]
t=1

The value of the Cross-Entropy Loss changes as we change the weights of the Neural Network.
In case of a multi-class classification problem, suppose we have s classes {k1 , k2 , . . . , ks } The output
Yt = (y1t y2t . . . yst ) ∈ {0, 1}s with
1, if vt ∈ kq
yqt = {
0, if vt ∉ kq
Multi-class cross-entropy loss is given by :
s m
J (θ) = ∑ ∑[−yqt .log(pq (Vt )) − (1 − yqt ).log(1 − pq (Vt ))]
q=1 t=1

III.VI Back Propagation


The history of the backpropagation goes back to a paper published by Rumelhart, Hinton, and Williams
in 1986. The algorithms apply to FFNN, and an enhanced version can be used on other NNs like
ConvNets. Back Propagation or Backprop is the critical part of a Neural Network, which involves
optimization of the network parameters or the training phase to minimize the loss function J . At a
high level, minimizing a function involves taking a derivative of the function and equating it to zero,
∂w = 0. However, finding the minima of the cost function in a NN is complicated due to the
i.e., ∂J
following:-

• Multiple equations need to be optimized as we deal with weights from different layers.

• Number of weights could be different for each layer depending upon the no. of nodes in a layer.

• We need to find a global minimum while multiple local minima can exist.

The Backpropagation Algorithm computes all the partial derivative of the total Loss function w.r.t
the weight directions ∂J(l) . The backpropagation is performed by establishing a cost function, e.g.,
∂θij
choosing a mean square error (MSE) for a regression problem or Cross-Entropy for a binary or multi-
class classification problem. We measure the performance of the NN on each forward pass based on the
cost function. The network computes the gradient of the cost with respect to the weights and works
backward to adjust the weights. This process is repeated to reduce the cost function to its minimum
possible value.
To illustrate the working of the Backpropagation algorithm, let us consider a binaryy classification
problem with n input features (x1 , x2 , . . . , xn ) and output y ∈ {0, 1}. The NN in this case predicts the
probability p of y = 1. We choose Sigmoid activation functions for all layers i.e. l = 2, 3, . . . , k (l = 1 is
the input layer). And we choose cross-entropy loss function to measure the performance of the NN.

32
Figure 27: Network Illustration for Binary Classification problem

Suppose we have m records in the training set i.e. (Vt , yt ), t = 1, 2, . . . , m. The cross-entropy loss
function is given as
m
J (θ) = ∑[−yt .log(p(Vt )) − (1 − yt ).log(1 − p(Vt ))]
t=1

Backpropagation Algorithm - Case K=2


We start with a case of Logistic regression with just an input and output layer (k = 2). As we discussed
earlier, there are two operations happening at each hidden layer namely Linear operation with weights
and inputs followed by the activation function Sigmoid over the Linear sum. To put it in mathematical
equation, it looks like the follow:-

Weights: θ0 , θ1 , θ2 , . . . , θn

Linear Sum: z = θ0 .x0 + θ1 .x1 + θ2 .x2 + ⋅ ⋅ ⋅ + θn .xn = θ0 + θ1 .x1 + θ2 .x2 + ⋅ ⋅ ⋅ + θn .xn , x0 = 1


1
Activation: p = g(z) =
1 + e−z
m
Loss function: J (θ) = ∑[−yt .log(g(zt )) − (1 − yt ).log(1 − g(zt ))]
t=1

Figure 28: NN Overview for Logistic Regression

The derivative of the Cost function involves partial derivative of the Activation function and the
Linear sum function. We noticed in the earlier section that the derivative of the Sigmoid function is

33
given by:
d d
φ(z) = g(z) = g(z)(1 − g(z)) (9)
dz dz
And the derivative of the Linear sum operation is given by:
d d
(Z) = (θ0 .x0 + θ1 .x1 + θ2 .x2 + ⋅ ⋅ ⋅ + θi .xi + . . . , θn .xn ) = xi (10)
dθi dθi
The derivative of the cost function w.r.t. weight θ0 :
m
d d
J (θ) = ∑ [−yt .log(g(zt )) − (1 − yt ).log(1 − g(zt ))]
dθ0 t=1 dθ0
m
d
= −∑ [yt .log(g(zt )) + (1 − yt ).log(1 − g(zt ))]
t=1 dθ0
m
yt dzt 1 − yt dzt
= −∑( .g ′ z(t). )+( .−g ′ z(t). )
t=1 g(zt ) dθ0 1 − g(zt ) dθ0
m
yt 1 − yt
= −∑( .g(zt )(1 − g(zt )).x0 ) + ( .(−g(zt )(1 − g(zt ))).x0 ) , using 9, 10. (11)
t=1 g(zt ) 1 − g(zt )
m
= − ∑ (yt (1 − g(zt )).1) + (1 − yt ).(−1).g(zt ).1) , using x0 = 1
t=1
m
= − ∑ (yt − yt .g(zt ) − g(zt ) + yt .g(zt ))
t=1
m
= ∑ (g(zt ) − yt )
t=1

Similarly, we can find the partial derivative of the cost function J w.r.t weights θi , i ≥ 1
m
d d
J (θ) = ∑ [−yt .log(g(zt )) − (1 − yt ).log(1 − g(zt ))]
dθi t=1 dθi
m
d
= −∑ [yt .log(g(zt )) + (1 − yt ).log(1 − g(zt ))]
t=1 dθi
m
yt dzt 1 − yt dzt
= −∑( .g ′ z(t). ) + ( .−g ′ z(t). )
t=1 g(zt ) dθi 1 − g(zt ) dθi
(12)
m
yt 1 − yt
= −∑( .g(zt )(1 − g(zt )).xti ) + ( .(−g(zt )(1 − g(zt ))).xti ) , using 9, 10.
t=1 g(zt ) 1 − g(zt )
m
= − ∑ xti (yt − yt .g(zt ) − g(zt ) + yt .g(zt ))
t=1
m
= ∑ xti (g(zt ) − yt )
t=1

We have these two final equation for the partial derivatives: -


m
d
J (θ) = ∑ (g(zt ) − yt )
dθ0 t=1

m
d
J (θ) = ∑ xti (g(zt ) − yt ) , i ≥ 1
dθi t=1

34
The predicted output is represented as p = g(zt ), and yt is the actual output label from the training
set. So, the error for the training pair (Vt , yt ) is given as:
δt = g(zt ) − yt
With this, we can re-write the above equations as:-


m


⎪ ∑ δt , if i = 0
d ⎪
⎪t=1
J (θ) = ⎨ m
dθi ⎪




⎪ ∑ xti .δt , if i ≥ 1
⎩t=1
t=1 xi .δt , x0 = 1 for the bias
We have a common pattern for the partial derivative i.e. dθd i J (θ) = ∑m t

node.
At a high level, we have the following steps in the Back propagating algorithm:-
1. Identify the training set (Vt , yt ).
2. Feed the input variables Vt to the NN model.
3. Compute probability p = g(vt ) using forward propagation.
4. Compute the error δt = g(vt ) − yt .
5. Backpropagate the error corresponding to θi . Multiply δi with the origin unit value xi .
(a) Sum over all the pairs in the training data.
Algorithm to compute dθd i J

General Backpropagation Algorithm


This section will discuss the backpropagation algorithm for a general binary classification problem. We
consider a simple case with a n − dimensional input vector with binary output. This network has
several hidden layers (k), and we predict the probability of binary class 1 in the output layer. We can
think of a training set with t records of labeled pairs (Vt , yt ).

Figure 29: Generic NN for Binary classification

35
If we feed the input features for a record from the training set to the Neural Network, we would get
(k)
a value at the output layer as a probability of a binary class, denoted as a1 in the above figure. The
final predicted class is compared with the target label or the given output (yt ) to find the error in the
output layer. The error is given by:
δ (k) = a1 − yt
(k)

This error term depends upon the weights of the network and the specific input pair (Vt , yt ) used.
There are other hidden errors at each hidden layer that are propagating backward through the network.
Suppose, we have a hidden layer l of size sl , then the error vector will have sl + 1.
⎡ (l) ⎤
⎢δ0 ⎥
⎢ (l) ⎥
⎢δ ⎥
⎢ 1 ⎥
⎢ ⎥
δ̄ = ⎢δ2(l) ⎥ ∈ Rsl +1
(l)
⎢ ⎥
⎢ ⋮ ⎥
⎢ ⎥
⎢ (l) ⎥
⎢δsl ⎥
⎣ ⎦
(l)
Here, δ0 is the error corresponding to the bias unit. So, δ̄ (l) is called as the extended vector of
error terms at layer l. The normal vector without the bias error is referred to as δ (l) .
The general formula to compute these intermediate error vectors in layer l is:

δ̄ (l) = [θ(l) ]t δ (l+1) ⊙ a(l) ⊙ (1 − a(l) ), l ≥ 2

In this formula, we have:

[θ(l) ] - Weight matrix for all the connections from layer ∶ l to layer ∶ l + 1 with a size of sl+1
(rows) × sl + 1 (columns). [θ(l) ]t will be a weight matrix of size sl + 1 × sl+1 .

⊙ - Represents the element-wise multiplication, also called as Hadamard product of two vectors.
Example:
⎡3⎤ ⎡1⎤ ⎡3⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢2⎥ ⊙ ⎢4⎥ = ⎢8⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1⎥ ⎢2⎥ ⎢2⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦

ā(l) - Extended vector of all the entries in layer l (including bias a0 ), with size sl + 1. We get the
(l)

(l)
values ai from forward-propagation.

(1 − ā(l) ): Performing a minus operation for all the entries in ā(l) resulting in a sl + 1 dimensional
vector.

As we can see, this is a backward propagating method to find the errors in layer l by using the errors
in l + 1. To illustrate the working of Back-propagation as compared to forward-propagation, consider a
NN with four layers.

36
Figure 30: Forward-propagation v/s Backward-propagation

The above figure shows the network’s forward and backward propagation values. Forward propa-
gation moves from left to right, whereas backward propagation moves from the right to left except in
the input layer. So, back propagation ensures that each hidden layer other than the input layer has the
error value for every neuron.

is the key to calculating the partial derivatives of the total loss function J ,
(l+1)
The error vector δi
which is given by the following Backpropagation formula.
m
dJ
= ∑ δi
(l+1) (l)
(l)
.aj
dθij t=1

The following figure shows a more simplistic picture of this formula.

Figure 31: Backpropagation formula - breakdown

dJ
= ∑m
(l+1) (l)
The RHS of this equation (l) t=1 δi .aj is the sum of the values obtained from all the pairs
dθij
(Vt , yt ) in the training set.

37
The following steps work in an ordered fashion as part of finding the partial derivatives in a NN:-

1. Feed the network model with the input from one pair (Vt , yt ) in the training set.

2. Perform forward propagation to compute all the values for the activation. For example, to
(l+1)
calculate ai .
(l)
(a) Consider the bias unit where applicable a0
= θi0 .a0 + θi1 .a1 + θi2 .a2 + ⋅ ⋅ ⋅ + θisl .asl
(l+1) (l) (l) (l) (l) (l)
(b) Compute the linear sum zi
(c) Apply the activation function aj = g(zi )
(l) (l)

3. Compute the error at the final layer as the difference between the predicted output and the actual
output δ1k = a1 − y.
(k)

(l)
4. Perform backpropagation to find δi , i.e. intermediate errors for all the hidden layers except the
input layer.
δ̄ (l) = [θ(l) ]t δ (l+1) ⊙ a(l) ⊙ (1 − a(l) )

5. Repeat the process (Steps 1-4) for all pairs in the training set.
dJ
= ∑m
(l+1) (l)
6. Compute the partial derivatives using (l) t=1 δi .aj
dθij

38
Training
Training is the process of minimizing the Loss function J by finding the appropriate weights for the
network. Gradient Descent is used as a method to minimize the loss function J . Suppose, we have a
training set (Vt , yt ), t = 1, 2, 3, . . . , m, with an n dimensional input variables and an s class output.

Vt = (xt1 xt2 . . . xtn )

yt = (y1t y2t . . . yst )


The weights for the network represented as θij for the connections from layer l to layer l + 1.
(l)

⎛ xt1 ⎞ ⎛f 1 (Vt )⎞
⎜ x2 ⎟
t ⎜f 2 (Vt )⎟
⎜ t⎟ NN with weights ⎜ 3 ⎟
Vt = ⎜ ⎟
⎜ x3 ⎟ ⇉ ( )⇉⎜
⎜f (Vt )⎟ ⎟ = fθ (Vt ) (13)
⎜ ⋮ ⎟ θ ⎜ ⋮ ⎟
⎜ ⎟ ⎜ ⎟
⎝xtn ⎠ ⎝f s (Vt )⎠

The NN is going to predict the output using the weights θ for a s class target i.e. fθ (Vt ) ∈ Rs . We
measure how close we are to the predicted output by using L(fθ (Vt ), yt ). We define a loss function as
m
J (θ) = ∑ L(fθ (Vt ), yt )
t=1

As part of training, we start with some initialization weights say

θ0 = (θij )i,j,l Ð→ Large collection of parameters


(l),0

Layers l = 1, 2, . . . , k − 1
i = 1, 2, . . . , Sl+1
j = 1, 2, . . . , Sl
The weights are adjusted with learning rate α (hyperparameter) using the following formula:

θd+1 = θd − α.∇J (θd )

dJ
(θij )i,j,l = (θij )i,j,l − α. (θd )
(l),d+1 (l),d
(l)
dθij
We keep optimizing the weights in each iteration to minimize the loss function J (θ).

θ0 ↝ θ1 ↝ θ2 ↝ ⋅ ⋅ ⋅ ↝ θd ↝ θd+1 . . .

These attempts of optimizations are called epochs and are controlled via hyperparameter.

39
IV Convolutional Neural Networks
Convolutional Neural Networks or ConvNets, a.k.a. CNNs, are most commonly used for the problems
in the domain of computer vision. ConvNets use the spatially local correlation by enforcing regional
connectivity between the adjacent layers.

IV.I Building Blocks of CNN


We will start with the basics of the Convolutional Neural Network before exploring a CNN model. Some
of the building blocks of a CNN are as follow:-

Convolution Operation
Images can be represented as matrix values in different channels depending upon the color space. Some
commonly used color spaces are Grayscale, CMYK, HSV, RGB, etc. Convolution operation is performed
by applying a filter, a.k.a. kernel, over the image, which computes the dot product of the filter with
the are of the image under the filter. The filter slides over the complete image to perform this activity
and reduces the image to a readable format without losing features.

Figure 32: Convolution Operation - Overview

Suppose, we have an Image in the form of matrix values. We apply a filter of weights (kernel) 3 × 3
on the first row and first column of the matrix. As an example the first value of the convolution feature
is calculated as follow: -
4 × 1 + 5 × −1 + 1 × 0 + 2 × 0 + 6 × 1 + 5 × 0 + 6 × 1 + 8 × 0 + 2 × 1 = 13
We scan the whole matrix with the size matching the filter 3×3 to perform the convolution operation
and generate a convolved features matrix of size 3 × 3 from a 5 × 5 matrix.

Edge Detection
Edges for the objects in an image are the areas where there is a drastic change in the brightness of
the pixels. There are different filters to detect the vertical, horizontal or angle edges. An example of a
vertical detector along with the convolution operation is as follow:-

40
Figure 33: Convolution Example - Matrix operation (Ref erences1 )

Figure 34: Convolution Example - Edge detection

Some commonly used edge detection filters are Canny, Sobel, Scharr, etc. The size of the output
image matrix after applying a filter of size f ×f on an image of matrix size n×n is (n−f +1)×(n−f +1).
The value of f is odd to ensure that the filter always has a central position and to help with padding.

Padding
As we discussed in the filter operations above, the size of the output image is reduced to n × n is
(n − f + 1) × (n − f + 1). In addition to the shrinkage in size, we also lose important information at
the edges of the image. To overcome this problem, we pad extra pixels to the image border via a
hyper-parameter called padding. If we apply a padding p = 1 on a n × n, the resultant image size is
n + 2 × n + 2. Padding is configured using two modes in CNN models, namely:

1. Valid: No padding

2. Same: Pad to match the size of the output image with the input image.
Example of Padding:

41
Figure 35: Padding Example - Ref erences2

When a filter of size f × f is applied along with a padding of size p on an image of size n × n, the
resulting image size is given by (n + 2p − f + 1) × (n + 2p − f + 1).

Strided Convolution
Strides are another factor that can influence the output size of the convolution. Stride is the distance
between two successive windows in a convolution operation. By default, the convolution windows are
contiguous, i.e., a stride of size 1. Here is an example of a stride with size 2.

Figure 36: 3 × 3 convolution patches with 2 × 2 strides - Ref erences2

The size of the convolution output with stride size s, padding p, filter (f × f ), and original image
size (n × n) is given by
n + 2p − f n + 2p − f
( + 1), ( + 1)
s s
In the above formula, if the fraction is not an integer, we use the f loor function in R.

Convolutions over Volume


Color images are represented in multiple channels, i.e., more than one data matrix. For example, a
common format for a color picture is the rgb channel, which means a stack of red, green, and blue

42
channels. To perform a convolution on a 3D image (with three channels), we need a 3D filter, i.e.,
with the depth of the filter matching the number of channels in the input image. The follow of the
convolution operation on a volume 3D image with a 3D filter goes as follows:-
1. Take dot product of the filters with the corresponding values from the channels in the image,
e.g., values from the red filter with the values from the red channel from the image, and so on.
2. Add up these dot products from each filter with the corresponding channel to produce a single
value.

Figure 37: Volume convolution Example

If we want to detect the edges in a specific color channel, say red, we keep the other color channels,
i.e., blue and green in this case, to zero and vice-versa. The edge filters are set to the same value for
all channels when checking the edges across all the channels. We can use different filters to extract
various images’ features, like additional edge filters - horizontal, vertical, angular, etc.

Figure 38: Using multiple filters to get a 2D output

Pooling
Pooling can help reduce the input size and, in turn, can lower the computation cost. It also makes
the feature detection independent of its position in the image, known as invariant feature detection.
Pooling is commonly performed in a window size of 2 × 2. Types of pooling layer:-

43
Max-pooling layer: Moves the Pool layer filter over the input image and stores the max value
from input as output.

Average-pooling layer: Moves the Pool layer filter over the input image and stores the average
value for input as output.
Max-pooling is most used as compared to Average-pooling. The pooling layers cannot be trained
using the Backpropagation. However, these are controlled via hyper parameters like the window size
of the pooling operation f , stride s, and the type of pooling max or average. If we apply the pooling
layer with window size f × f on an input image of size nH × nW × nC , and striding of size s, the output
dimensions are nHs−f + 1 × nHs−f + 1 × nC (padding is usually not used with the max-pooling operation).

Figure 39: Max-pooling operation with filter size 2x2

Flattening
Flattening converts the last convolution layer output to a one-dimensional array which can then be
used to make the actual predictions.

IV.II Visualizing the ConvNet Learning


So far, we have discussed the convolution operations in a ConvNet model that helps understand the
features in the input image. We can visualize the outputs from the filters in a hidden layer to find the
pictures that maximize a unit’s activation. We can start with a CNN with 15 layers that take an image
of size 150 × 150 as input and gives a probability using a single unit sigmoid Dense layer. The summary
of the model is as follows:-

Model: "sequential"
________________________________________________________________________
Layer (type) Output Shape. Param #

44
========================================================================
conv2d_3 (Conv2D) (None, 148, 148, 32) 896
activation_3 (Activation) (None, 148, 148, 32) 0
max_pooling2d_3 (MaxPooling2D) (None, 74, 74, 32) 0
conv2d_2 (Conv2D) (None, 72, 72, 64) 18496
activation_2 (Activation) (None, 72, 72, 64) 0
max_pooling2d_2 (MaxPooling2D) (None, 36, 36, 64) 0
conv2d_1 (Conv2D) (None, 34, 34, 128) 73856
activation_1 (Activation) (None, 34, 34, 128) 0
max_pooling2d_1 (MaxPooling2D) (None, 17, 17, 128) 0
conv2d (Conv2D) (None, 15, 15, 128) 147584
activation (Activation) (None, 15, 15, 128) 0
max_pooling2d (MaxPooling2D) (None, 7, 7, 128) 0
flatten (Flatten) (None, 6272) 0
dense_1 (Dense) (None, 512) 3211776
dense (Dense) (None, 1) 513
========================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0
________________________________________________________________________

We can notice from the model summary how output from each layer changes e.g. the output of the
first-layer is 1 × 148 × 148 × 32. We take an image from Vincent Van Gogh arts to exhibit the working
of this model. The original image looks like below:

Figure 40: Original Image - Ref erences1

As a requirement for the input feed to the model, we pre-process the image into a tensor of shape
4D.

45
img_path <- "~/Downloads/vangogh02.jpg"
img <- image_load(img_path, target_size = c(150, 150))

img_tensor <- image_to_array(img)


img_tensor <- array_reshape(img_tensor, c(1, 150, 150, 3))
img_tensor <- img_tensor / 255
dim(img_tensor)
[1] 1 150 150 3

To visualize the outputs from the layers and filters, we create a function act model to get the model
output provided the image tensor as input.

layer_outputs <- lapply(model\$layers[1:15], function(layer) layer\$output)


act_model <- keras_model(inputs = model\$input, outputs = layer_outputs)
activations <- act_model %>% predict(img_tensor)

We can check the dimension for the f irst − layer and the tenth − layer activation.

first_layer_act <- activations[[1]]


tenth_layer_act <- activations[[10]]
dim(first_layer_act)
[1] 1 148 148 32

dim(tenth_layer_act)
[1] 1 15 15 128

We create a plot channel function to plot the channel of the respective layer.

plot_channel <- function(channel) {


rotate <- function(x) t(apply(x, 2, rev))
image(rotate(channel), axes = FALSE, asp = 1, col = terrain.colors(12))
}

We use the plot channel function to plot the output for channels 1, 16, and 32 from the f irst −
layer.

par(mfrow = c(1, 3))


plot_channel(first_layer_act[1, , , 1])
plot_channel(first_layer_act[1, , , 16])
plot_channel(first_layer_act[1, , , 32])

46
(a) Channel-1 (b) Channel-16 (c) Channel-32

Figure 41: First Layer Activation - Output

We can observe that the first layer activation retains most of the image information and perform
exploration on the edges of the image input.
The information gets abstracted as we move higher in the layers stack. For example, we can plot the
outputs from the tenth−layer for channels 2, 64, 128 and observe that the layer represents more global
and abstract features.

par(mfrow = c(1, 3))


plot_channel(tenth_layer_act[1, , , 2])
plot_channel(tenth_layer_act[1, , , 64])
plot_channel(tenth_layer_act[1, , , 128])

(a) Channel-2 (b) Channel-64 (c) Channel-128

Figure 42: Tenth Layer Activation - Output

47
V ConvNet by Example
Keras provides a set of easy to use APIs for model development which runs on top of Tensorflow.
Tensorflow can run on different type of hardware e.g. CPU, GPU, and TPU.

Figure 43: Keras and Tensorflow - Ref erences2

Keras Deep Learning API can be used to develop and train model for basic to advanced scenarios.
In this section, we will use the keras library in RStudio (2022.02.3+492).

• Data set - We use a dataset containing Chest X-ray images with opacity(pneumonia) and
normal cases. This curated dataset can be accessed on Kaggle, which is already distributed into
training, validation and test subsets for machine learning exercise. The information on the data
sets and no. of images is in Table 4.

Data set Normal Opacity


Training 1082 3110
Validation 267 773
Test 234 390

Table 4: Dataset details

Some sample images from the data set along with the labels are as follow:-

Figure 44: Images - Normal category

48
Figure 45: Images - Opacity(Pneumonia) category

These are grayscale images (single channel) with different sizes but in good resolution. As an
example, we can see the details for one random image.

library(EBImage)
img <- readImage(file.path(train_image_files_path,"normal","IM-0688-0001.jpeg"))
print(img)

> print(img)
Image
colorMode : Grayscale
storage.mode : double
dim : 1514 1310
frames.total : 1
frames.render: 1

imageData(object)[1:5,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 0 0 0 0
[2,] 0 0 0 0 0 0
[3,] 0 0 0 0 0 0
[4,] 0 0 0 0 0 0
[5,] 0 0 0 0 0 0

• Data Augmentation - Image data generators allow altering images with zoom, shear, rotation,
etc. to improve the learning process. We use basic Image generators to read the images from
the data set directories.

train_data_gen = image_data_generator(
rescale = 1/255
)

valid_data_gen <- image_data_generator(


rescale = 1/255
)

49
We use the f low image f iles path function to input images in the model. We’ll also standardize
the image size to 120 × 120

target_labels <- c("Normal", "Opacity")


img_width <- 120
img_height <- 120
target_size <- c(img_width, img_height)
channels = 1

# Training images
train_image_array_gen <- flow_images_from_directory(
train_image_files_path,
train_data_gen,
target_size = target_size,
class_mode = ’binary’,
classes = target_labels,
color_mode = "grayscale",
seed = 1994)
# Validation images
valid_image_array_gen <- flow_images_from_directory(
valid_image_files_path,
valid_data_gen,
target_size = target_size,
class_mode = ’binary’,
classes = target_labels,
color_mode = "grayscale",
seed = 1994)

• Model Training - This is a binary classification problem with the outcome as normal or opacity
health condition. The baseline accuracy is 50%, so a good model should be able to predict with
an accuracy higher than the baseline accuracy. We start with a simple CNN model to learn
the data representation in the images, and then modify it to observe the impact. We set some
parameters required for the model training.

batch_size <- 132


num_classes <- 2
epochs <- 12

– model 0− The base ConvNet model 0 has two Conv2D layers, one Flatten layer and two
Dense layers. We use relu activation in the model and sigmoid activation at the end layer
to address this binary classification problem. The model summary looks as follow:-

Layer (type) Output Shape Param #


=====================================
conv2d_79 (Conv2D) (None, 118, 118, 4) 40

50
activation_59 (Activation) (None, 118, 118, 4) 0
conv2d_78 (Conv2D) (None, 116, 116, 2) 74
flatten_33 (Flatten) (None, 26912) 0
dense_67 (Dense) (None, 10) 269130
activation_58 (Activation) (None, 10) 0
dense_66 (Dense) (None, 1) 11
activation_57 (Activation) (None, 1) 0
=====================================
Total params: 269,255
Trainable params: 269,255
Non-trainable params: 0
_____________________________________

We use binary crossentropy as the loss function and optimizer optimizer adadelta when
compiling the model. We also record accuracy in addition to the loss values when training
the model
# Compile model
model_0 %>% compile(
loss = ’binary_crossentropy’,
optimizer = optimizer_adadelta(),
metrics = c(’accuracy’)
)

# Fit model
history_0 <- model_0 %>% fit(
train_image_array_gen,
batch_size = batch_size,
epochs = epochs,
validation_data = valid_image_array_gen
)

The model achieves an accuracy of 94.52% on the validation data in 12 epochs. The training
accuracy is on the rise throughout the model fit, while the validation accuracy stays almost
same. A similar patter can be seen in the loss graph, with the validation loss rises towards
the end of the model fit. This indicates that the model is over-fitting in the later epochs.
The metrics plots for accuracy and loss looks as follow:-

51
Figure 46: CNN model 0 Fit - Metrics

– model 1 We increase the model complexity by introducing another Conv2D layer and in-
creasing the number of filters. This change almost doubles the number of parameters in
the model from 269, 255 to 521, 473. Model summary looks as follow:-

Layer (type) Output Shape Param #


=====================================
conv2d_82 (Conv2D) (None, 118, 118, 16) 160
activation_63 (Activation) (None, 118, 118, 16) 0
conv2d_81 (Conv2D) (None, 116, 116, 8) 1160
conv2d_80 (Conv2D) (None, 114, 114, 4) 292
flatten_33 (Flatten) (None, 51984) 0
dense_67 (Dense) (None, 10) 519850
activation_58 (Activation) (None, 10) 0
dense_66 (Dense) (None, 1) 11
activation_57 (Activation) (None, 1) 0
=====================================
Total params: 521,473
Trainable params: 521,473
Non-trainable params: 0
_____________________________________

52
The model learning drops with a validation accuracy of 74.33% in the second epoch and it
stays there. This seems to be a problem of under-fitting as the model is not able to learn
the representations of the data in the images.

Figure 47: CNN model 1 Fit - Metrics

– model 2− We further increase the filters of Conv2D layers in model to increase the param-
eters to 10, 403, 105. The model summary looks as follow:-

Layer (type) Output Shape Param #


=====================================
conv2d_82 (Conv2D) (None, 118, 118, 32) 320
activation_63 (Activation) (None, 118, 118, 32) 0
conv2d_81 (Conv2D) (None, 116, 116, 8) 4624
conv2d_80 (Conv2D) (None, 114, 114, 4) 1160
flatten_33 (Flatten) (None, 103968) 0
dense_67 (Dense) (None, 10) 10396900
activation_58 (Activation) (None, 10) 0
dense_66 (Dense) (None, 1) 101
activation_57 (Activation) (None, 1) 0
=====================================
Total params: 10,403,105

53
Trainable params: 10,403,105
Non-trainable params: 0
_____________________________________

The validation accuracy improves to 96.15% in the ninth epoch and the validation loss
increases after that.

Figure 48: CNN model 2 Fit - Metrics

– model 3− We introduce MaxPooling layers after each Conv2d layer. This reduces the model
parameters to 141, 505. The model summary looks as follow:-

Layer (type) Output Shape Param #


=====================================
conv2d_82 (Conv2D) (None, 118, 118, 32) 320
activation_63 (Activation) (None, 118, 118, 32) 0
max_pooling2d_1 (MaxPooling2D) (None, 59, 59, 32). 0
conv2d_81 (Conv2D) (None, 57, 57, 16) 4624
max_pooling2d_2 (MaxPooling2D) (None, 28, 28, 16). 0
conv2d_80 (Conv2D) (None, 26, 26, 8) 1160
max_pooling2d_2 (MaxPooling2D) (None, 13, 13, 8). 0
flatten_33 (Flatten) (None, 1352) 0

54
dense_67 (Dense) (None, 100) 135300
activation_58 (Activation) (None, 100) 0
dense_66 (Dense) (None, 1) 101
activation_57 (Activation) (None, 1) 0
=====================================
Total params: 141,505
Trainable params: 141,505
Non-trainable params: 0
_____________________________________

The validation accuracy improves to 96.83% in the ninth epoch with a validation loss at
0.0912.

Figure 49: CNN model 3 Fit - Metrics

Observations - We tried four different models with different range of complexities and we can
observe that the model complexity does not always help improve the learning. Introducing a new
layer or changing the filters updates the number of parameters to be trained. An increase in
number of parameters may also slightly increases the epoch run-time.

• Regularization - We choose the best model model 2 from the initial testing and try the regu-
larization techniques to study the effect on the accuracy of the model. Regularization provides

55
a simplistic approach to control the problem of over-fitting by using some constraints on the
weights to choose a smaller value. Some of the standard regularization methods are BatchNor-
malization, Dropout, and L2 regularization. Similar to what we followed in previous phases,
we start with a base model and test the impact of regularization by enabling one parameter at
a time. The metrics like accuracy, loss, and convergence rate are noted for each regularization
change for the base model.

1. BatchNormalization - We modify the model by adding BatchNormalization layer after


each Conv2D layer. This layer would normalize the data coming out from the Conv2D. The
model is able to achieve a validation accuracy of 96.15% in ninth epoch and the minimum
value of the validation loss at 0.1584.

Figure 50: Using BatchNormalization

2. Adding L2 Regularization - We further modify the model by adding L2 regularization


in each Conv2D layer. L2 regularization forces the model to take smaller weights. The
validation accuracy of remains approx. same (96.06%), however the validation loss increases
to .2788.

56
Figure 51: Using L2 Regularization

3. Adding Dropout - Finally, we add the Dropout layer with a rate=0.5 after the Flatten
layer. The model achieves a a validation accuracy of 96.25% and the validation loss changes
to 0.1696

Figure 52: Using Dropout

Observations - We tried out three Regularization techniques, namely BatchNormalization,

57
L2 Regularization, and Dropout. There are the following observations from these tryouts:-

– Slow convergence - The final model with all the regularization layers was slow to converge
as compared to the initial/base model, which is expected behavior.
– Impact on Accuracy - The following table shows the impact on Training and Validation
Accuracy as we applied regularization.

Regularization Training Accuracy Validation Accuracy


None 97.73% 96.15%
BatchNormalization 99.88% 96.15%
BatchNormalization, L2 Regularization 99.95% 96.06%
BatchNormalization, L2 Regularization, Dropout 99.58% 96.25%

Table 5: Impact of Regularization

As we can notice from the above table, we don’t see much improvements on the validation
accuracy by applying the regularization.

Evaluate model on Test set - We use the model with all three the regularization methods to
predict the labels for the test set and verify the model’s performance.

scores <- model %>% evaluate(


x_test, y_test, verbose = 0
)

# Output metrics
cat(’Test loss:’, scores[[1]], ’\n’)
>Test loss: 0.1695605
cat(’Test accuracy:’, scores[[2]], ’\n’)
>Test accuracy: 0.9625

As we can see from the model evaluation, the ConvNet model delivers an impressive accuracy of
96.25% on the test set.

58
References
1. Ghatak, A. (2019). Deep learning with R. Springer Singapore.

2. Chollet, F. (2021). Deep learning with python. Manning Publications.

3. Boyd, S. P., amp; Vandenberghe, L. (2021). Unconstrained minimization. In Convex optimiza-


tion. essay, Cambridge University Press.

4. MNIST CNN. Keras. (n.d.). from https://keras.rstudio.com/articles/examples/mnist cnn.html

5. Chest X-ray dataset from https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia

6. Gradient Descent Algorithm Image from https://medium.com/@rndayala/gradient-descent-algorithm-


2553ccc79750

7. Newton Raphson method image from https://www.quora.com/What-is-the-difference-between-


Newtons-method-and-secant-method-Explain-your-answer-by-example

8. Local and Global minimum image from https://vitalflux.com/local-global-maxima-minima-explained-


examples/

59

You might also like