Artificial Neural Network Concepts and Examples
Artificial Neural Network Concepts and Examples
Louis
IRL @ UMSL
7-22-2022
Recommended Citation
Kabbay, Harcharan, "Artificial Neural Network Concepts and Examples" (2022). Theses. 402.
https://irl.umsl.edu/thesis/402
This Thesis is brought to you for free and open access by the UMSL Graduate Works at IRL @ UMSL. It has been
accepted for inclusion in Theses by an authorized administrator of IRL @ UMSL. For more information, please
contact [email protected].
Artificial Neural Network
Concepts and Examples
August, 2022
Advisory Committee
V ConvNet by Example 48
1
Abstract
Artificial Neural Networks have gained much media attention in the last few years. Every day, numer-
ous articles on Artificial Intelligence, Machine Learning, and Deep Learning exist. Both academics and
business are becoming increasingly interested in deep learning. Deep learning has innumerable uses, in-
cluding autonomous driving, computer vision, robotics, security and surveillance, and natural language
processing. The recent development and focus have primarily been made possible by the convergence of
related research efforts and the introduction of APIs like Keras. The availability of high-speed compute
resources such as GPUs and TPUs has also been instrumental in developing deep learning models.
While the development of the APIs like Keras offers a layer of abstraction and makes the model
development convenient, the Mathematical logic behind the working of the Neural Networks is often
misunderstood. The thesis focuses on the building blocks of a Neural Network in terms of Mathemat-
ical terms and formulas. The research article also includes the details on the core parts of the Deep
Learning algorithms like Forwardpropagation, Gradient Descent, and Backpropagation.
The research briefly covers the basic operations in Convolution Neural Networks, and a working
example of multi-class classification problem using Keras library in R. CNN is a vast area of research in
itself, and covering all the aspects of the ConvNets is out of scope of this paper. However, it provides
an excellent foundation for understanding how Neural Networks work and how a CNN uses the concepts
of the building blocks of a primary Neural Network in an image classification problem.
2
Contents
3
nodes in an ANN that loosely replicate the neurons in a biological brain. Like synapses in the brain,
each link may send a signal to other neurons.
A primary neural network has an input layer, a hidden layer, and an output layer. The input layer
feeds the input features, and the hidden layer computes the activation function of the weighted input
to predict the output. The number of hidden layers can be expanded based on the requirements of the
model. This number of hidden layers is also referred to as the depth of the model.
V i ∈ Rn
yi = (y1 , y2 , y3 , ..., ym ), yi ∈ R
4
Supervised Learning aims to find an estimator or predictor function f for the output yi , based on
the given data set.
f ∶ Rn Ð→ R, such that
f (Vi ) ≈ yi f or i = 1, 2, ..., m
The base idea is to find a formula or function that generates the output label yi using the input
features Vi . The process of finding the predictor function f is often referred to as fitting or training
the solution/model.
2. Classification - Output Yi belongs to a finite set of values. A model to classify facial images
from an image set into categories of mood like happy, sad, angry, etc.
3. Binary Classification - This is a special case of Classification where the output Yi is classified
into two categories, e.g., Yes/No, 0/1, ±1. An example of Binary classification could be estimated
if a person has diabetes or not based on historical training data for admitted patients.
The objectives of the Supervised Learning are as follows:-
• Prediction - Predict the output labels for unseen data with reasonable accuracy. The unseen data
is not part of the training set.
• Inference - Understanding the effect of each input feature and which one affects the most.
f (v) = y + ϵ
f (v) is to be estimate, v is the input vector and y is the output class. ϵ is the noise, a random numerical
variable from distribution N (0, σ 2 )
.
fˆ(v) ≈ f (v)
Here is a plot (3) of some data points (Vi , yi ) with i = 1, 2, ..., 30. The true predictor in this example
is
f (x) = sin(x3 + 1)
And noise ϵ ≈ N (0, 0.01)
5
Figure 3: Data points and True predictor
In real case scenarios, the true predictor is unknown, and we have to find the best estimator function
fˆ(v). There could be different candidates for fˆ(v). Let us represent those candidate estimators
g ∶ Rn Ð→ R
Input V ar1 Input V ar2 ... Input V arn Output labeli g(Vi ) [g(Vi ) − yi ]2
X11 X21 ... Xn1 y1 g(V1 ) [g(V1 ) − y1 ]2
X12 X22 ... Xn2 y2 g(V2 ) [g(V2 ) − y2 ]2
... ... ... ... ...
X1m X2m ... Xnm ym g(Vm ) [g(Vm ) − ym ]2
The average of all the square errors or Mean Square Error (MSE) can be found as
1 m
M SE = ∑[g(Vi ) − yi ]2
m 1
We can keep the MSE as a baseline to compare the quality of the estimator function. We try to find
fˆ(v), which minimizes the value of the MSE, also
√ called a Cost function. Some other versions of
∑1 [g(Vi ) − yi ]2 , Mean Absolute Error (MAE)
1 m
the MSE are Root Mean Square Error (RMSE) m
6
• Root Mean Square Logarithmic Error - Also referred as RMSLE is an extention of RMSE. The
formula of the RMSLE is as follow:-
¿
Á1 m
RM SLE = Á À ∑[log(1 + g(Vi )) − log(1 + yi )]2
m 1
• Cosine Similarity - This is a metric used in data analysis to compare two numerical sequences.
Suppose, we have two non-zero vectors a ⃗ and ⃗b. Then the Cosine similarity between these two
vectors is given by:
⃗.⃗b
a
cos θ =
a∥.∥⃗b∥
∥⃗
The closer the angle between the vectors, higher is the similarity.
• Log-Cosh error - Logarithmic of the hyperbolic cosine of the prediction error is another measure
for the quality of the predictor. The Log-Cosh loss is measured as:
m
L(y, y p ) = ∑ log(cosh(yip − yi ))
i=1
The concept of MSE does not apply to the Binary Classification, where we are predicting the
output as 0 or 1. So the square errors method does not make much sense here. Instead, we use the
concept of Maximum Likelihood Estimator to help us.
Let us take an example of Bernoulli distribution with a data set (Vi , yi ), where Vi ∈ Rn and yi can
take a value 0 or 1. We can represent y as
And, it is desired to have q(Vi )yi .[1 − q(Vi )]1−yi closer to 1. Since, we have m records in the data set
we need to m
Maximize: ∏ q(Vi )yi .[1 − q(Vi )]1−yi ←Ð MLE
1
We know that MLE in this case belongs to (0, 1), so we apply the log to the above equation.
m
−log(M LE) = −log[∏ q(Vi )yi .[1 − q(Vi )]1−yi ]
1
m
−log(M LE) = ∑[−yi .log(q(Vi )) − (1 − yi ).log(1 − q(Vi ))] ←Ð Cross-Entropy function
i=1
7
The cost function, in this case, is −log(M LE), so we need to minimize this as part of the func-
tion/model training for binary classification. Categorical Cross-Entropy is a version of Binary Cross-
Entropy, which is used for multi-class classification problems.
Other Loss functions for classification problems are as follow:-
• Hinge Loss - This loss function is mostly used in the classification problem in Support Vector
Machines (SVMs). The formula for Hinge Loss is as follow:-
L(y) = max(0, 1 − t.y)
Here, t = ±1 is the intended output for the classifier and y = w.x + b.
• Huber Loss - The Huber Loss function is derived from both the MSE and the MAE. This is less
sensitive to outliers than the MSE and also differentiable over zero.
⎧
⎪ 1
⎪
⎪ 2 (y − f (x)) ,
⎪
2
for ∣y − f (x)∣ ≤ δ,
Lδ (y, f (x)) = ⎨
⎪
⎪
⎪
1
⎪ δ∣y − f (x)∣ − δ 2 , otherwise
⎩ 2
• Kullback-Liebler Divergence - This is used to find the distance between two distributions i.e.
existing distribution p and the predicted data distribution q. The discrimation information of
probability p to q is given as
p(x)
DKL (p∣∣q) = ∑ p(x)log ( )
q(x)
Parametric Models
We discussed the guidelines for finding a regression predictor function and a binary classification problem.
The predictor function is chosen from a specific set of functions that are indexed by parameters θ =
(θ0 , θ1 , . . . ) A predictor function θ has the biases and weights to find the output labels from the input
vectors. Here are some examples: Suppose we have a Hypothesis set H = f (V ), which contains
the predictor functions. And suppose we have two variables in the input vector, i.e., x1 , x2 . And
H = f (x1 , x2 ) polynomial function of degree d.
d = 1, f (x1 , x2 ) = a0 + a1 x1 + a2 x2 ←Ð θ = (a0 , a1 , a2 ) ∈ R3
d = 2, f (x1 , x2 ) = a0 + a1 x1 + a2 x2 + a3 x1 x2 + a4 x21 + a5 x22
Here, θ = (a0 , a1 , a2 , a3 , a4 , a5 ) ∈ R6
As part of the training, the estimator functions from the Hypothesis set are tried with the provided
training set.
8
The learning Algorithm uses the Cost function to find the closeness of the estimated value with the
actual value. Some of the standard loss functions are Mean Square Error and Cross Entropy. Once
the loss is calculated, the learning algorithm modifies the weights and biases(parameters) to find the
optimal θ̂ through the optimization techniques like Gradient Descent. a high level learning workflow
can be described by Figure 5.
Besides the regular parameters, other sets of parameters called Hyper-parameters also impact the
learning of a training algorithm. Hyper-parameters include but are not limited to no. of neurons per
layer, learning rate, etc.
As part of the model learning process, we capture the validation metrics, i.e., the model’s perfor-
mance on the validation data set, which is separate from a training set. A loss function is used to
calculate the validation error in regression problems. The loss function is evaluated using the validation
data set on the optimal (trained) parameters θ̂.
9
II Optimizing the Loss function
We discussed the Loss function and how it measures the quality of a predictor or estimator. The
important piece is how to find the optimal value of weights and biases to reduce this loss. This
mathematical optimization problem has two components to minimize the function J (x1 , x2 , ..., xn ).
One common issue in the optimization problem is finding the difference between local minima and
global minima. To illustrate, we plot (Figure 6) the values for x2 + 10 ∗ sin(x) for x ranging from -10
to 10. In the given range of x values, this function has one global minimum, local minima, and another
local minimum.
The minimization problem’s solution is finding the global minimum, i.e., x0 . The minimum realiza-
tion for function J is J (x0 ).
The complexity of the minimization problem increases as the number of variables in the Loss function
J increases. Here is an example plot for a two variable function J (x1 , x2 ).
10
Figure 7: Local and Global minimum for J (x1 , x2 ) - Ref erences8
A general theoretical procedure using Calculus to solve a multivariate loss function J (x1 , x2 , ..., xn )
is as follow:-
4. We can classify the critical points using the second derivative tests. This involves n2 of such
partial derivatives.
5. From the partial derivatives of second order, we can evaluate the local minima points to find the
global minima.
11
In general Machine Learning problems, the equation ∇J = 0 is quite challenging to solve. Some
numerical methods of minimization have been developed to approximate the global minimum of the
function J . We are going to discuss two commonly used algorithms, namely:
• Newton-Raphson Method
Let us assume we choose a direction given by a unit vector direction h ∈ Rn and t ≥ 0, t ∈ R. The
variation in vector v is given by v = v0 + th. Now, we would like to know the impact of this variation on
J (v0 + th), which is a derivative of this one-variable function w.r.t t i.e. dJ (vdt0 +th) . Using multi-variable
chain rule.
dJ (v0 + th) n dJ (v0 + th)
=∑ .hi
dt i=1 dxi
=< ∇J (v0 + th), hi >
Or the dot product of the gradient J and the direction vector. More specifically, we are interested in
the instantaneous rate of change at v0
dJ (v0 + th)
(0 ) =< ∇J (v0 ), h >
dt
The rate of change of this derivative, if −ve, means we move in the direction of minima. And, the
direction that makes < ∇J (v0 ), h > negative and minimal is the direction of the fastest descent.
Treating this as a Linear Algebra problem, we have a fixed vector J (v0 ) and a varying unit direction
vector h. We need to find a direction h that minimizes the dot product < ∇J (v0 ), h >. The cosine of
the angle between these two vectors can be given by.
12
The denominator represents the length of the vector ∇J (v0 ), so if we minimize the value for the
cos (θ), we should be able to minimize the dot product < ∇J (v0 ), h >. The cos (θ) can take a value
between +1 and −1. Going with the minimum value, we pick cos (θ) = −1, means
θ = π = 180o
. With this information, we decide to move to the direction opposite to the direction of the gradient.
1
h=− .∇J (v0 )
∥∇J (v0 )∥
1. Initialize starting vector- Select a random vector v0 = (x01 , x02 , . . . , x0n ) in the domain of J
2. Perform iterations to minimize the value of J - Let us assume, from a prior iteration we have
vk = (xk1 , xk2 , . . . , xkn )
Find ∇J (vk )
Define vk+1 = vk − α.∇J (vk )
Possibilities:
• Fixed Learning rate α - Learning rate α is chosen at the beginning and it remains the same
throughout all the iterations.
• Exact line search (add details) - Learning rate α is chosen at each iteration by minimizing
the 1-var function
t ↪ J (vk − t.∇J (vk )
• Backtracking line search - Uses advanced techniques of convex optimization.
During this phase, we find the value of loss function J (vi ) which reduces in each iteration such
that
J (v0 ) > J (v1 ) > J (v2 ) ⋅ ⋅ ⋅ > J (vk ) > J (vk+1 ) > . . .
13
This descent can be shown for a two-variable vector as follow:-
Figure 10: Loss decreases on each iteration to reach a global minima - Ref erences6
3. Termination - Per Mathematical Theorem (Convex Optimization, Boyd and Vendenberghe 2021),
based on certain convergence conditions that depends upon the learning rate α:
(b) ∣J (vk ) − J (v ∗ )∣ < C.∥∇J(vk )∥ , where v ∗ is a local minimum for J , and C is a constant.
2
We set a small variable ϵ > 0, as acceptable error. The Gradient-Descent Algorithm stops when
∥∇J (vk ∥ < ϵ. Then, we have
∣J (vk ) − J (v ∗ )∣ < C.ϵ2
Observations
Some important observations around the characteristics and use of Gradient Descent are as follow:-
• The iteration methods like Exact line search and backtracking line search are computationally
expensive compared to the use of a fixed learning rate α.
• The fixed learning rate α, if chosen too small, can make the algorithm very slow to reach a
solution. And choosing α too high can make it hard to converge.
14
Figure 11: Impact of learning rate on convergence - Ref erences6
• Gradient Descent Algorithm can get stuck in a local minimum. To address this, we should run
the Gradient Descent multiple times, with a different initialization vector v0 . The solution is
chosen as the value of v that produces the lowest value for J (v).
1. Set w ∶= θk
4. θk+1 ∶= w
15
In Batch gradient Descent, we compute the entire gradient sum and then reset the parameter vector.
In the case of the Stochastic Gradient Descent (SGD), we add the term to the gradient sum and reset
the parameter vector.
From a computational point of view, SGD converges faster but could be hard to control due to the
mathematical complexity involved in the algorithm.
{1 2 . . . m} = B1 ∪ B2 ∪ ⋅ ⋅ ⋅ ∪ Bl
∣Bq ∣ ≤ b where q = 1, 2, . . . , l
The iteration for a Mini-Batch Gradient Descent is as follow:-
1. Set w ∶= θk
J (v ∗ ) ≤ J (v) for v ∈ Rn
The basic strategy is to approximate the critical points or the solutions to the equation
g = ∇J ∶ Rn → Rn
Then using the standard calculus method, we evaluate these critical points for function J , to find
out which one would be a global minimum. This method enables us to numerically solve a system of
equations:
g(x1 , x2 , . . . , xn ) = 0
16
g ∶ Rn → Rn
These are n non-linear equations i.e
g1 (x1 , x2 , . . . , xn ) = 0
g2 (x1 , x2 , . . . , xn ) = 0
⋮
gn (x1 , x2 , . . . , xn ) = 0
3. Per Mathematical Theorem (Convex Optimization, Boyd and Vendenberghe 2021) If the initial
value x0 is chosen properly, the sequence (xk )k≥0 will converge, with
lim xk = x∗
k→∞
g”(x0 )
g(x) = g(x0 ) + g ′ (x0 )(x − x0 ) + (x − x0 )2 + . . .
2
With the first-order approximation of g(x), we have
g(x0 )
We solve the x from x = x −
g ′ (x0 )
On each iteration, we improve the approximation for x∗ to find the exact root.
17
Figure 12: Convergence using Newton-Raphson method - Ref erences7
It is worth noting that this is a Matrix function, because if we change the inputs x1 , x2 , . . . , xn , the
partial derivatives would change and the whole Jacobian Matrix would change.
Our goal is to find a solution v ∗ ∈ Rn for the equation g(v) = 0. The algorithm follows the following
steps:-
18
1. Choose an initial vector v0 close to where we believe v ∗ should be.
2. Perform iterations to calculate
vk+1 = vk − [Mg (vk )]−1 .g(vk )
[Mg (vk )]−1 Ð→ inverse of nxn matrix Mg We use this method to find the values for v1 , v2 , v3 , . . .
The rationale for the Newton Raphson method states that there is a neighborhood of U of v ∗
such that, if v0 ∈ U , we have:-
• The Jacobian matrix Mg is non-singular for any k ≥ 0.
• The sequence (vk )k≥0 is convergent and limk→∞ vk = v ∗ .
3. Terminate the algorithm estimating the permitted error ∥vk − v ∗ ∥
Observations?
⎛d J , d J ,..., d J ⎞
2 2 2
Observations
Some important observations on Hessian matrix HJ are as follows:-
• Hessian HJ is a symmetric matrix, i.e., it is a square matrix that is equal to its transpose.
• If HJ is convex, then J is a convex function, hence a good candidate for minimization. The
convex functions have only one local minimum, a global minimum.
The algorithm works as follow to approximate the critical points:-
1. Choose an initial vector v0 = (x01 , x02 , . . . , x0n ) .
2. Perform iterations to calculate
vk+1 = vk − [HJ ]−1 .∇J (vk )
[Mg (vk )]−1 Ð→ inverse of nxn matrix Mg On each iteration, we should get closer to v ∗ .
19
3. Terminate the algorithm estimating the permitted error ∥vk − v ∗ ∥
The iteration process in the Newton-Raphson method is expensive, it requires calculating second-
order partial derivatives as compared to first-order derivatives used in the Gradient Descent method.
20
III Basic Architecture of Neural Networks
Neural Networks are capable of dealing with both structured and unstructured data. Structured data
is a highly organized format like tabular data stored in relational databases, CSV files, data frames,
etc. Unstructured data, also categorized as qualitative data, is in different forms like videos, pictures,
emails, social media posts, IoT sensor data, etc.
The basic architecture of a Neural Network consists of an input layer, a hidden layer, and an output
layer.
In Figure 13, the Neural network has two input variables X = [x1 , x2 ] and a hidden layer with
four neurons (nodes), which uses the inputs and weights (w1, w2, . . . , w8) to compute the activation
function. The output from the hidden layer passes to the output layer to predict the output ŷ. The
difference between the predicted and actual output is used to calculate the cost function. The cost
function is minimized by adjusting the weights after each iteration. Gradients are used to update the
weights, and the next iteration starts. This process repeats to get closer to the optimal parameter
weights.
21
Some common non-linear transformations in weighted inputs include sigmoid, tanh, ReLU, selu, soft-
max, etc. Feed Forward Neural Networks can be used for classification use-cases and in unsupervised
learning as auto-encoders.
22
Figure 16: A basic RNN architecture - Ref erences1
The above diagram shows two adjacent hidden layers l of size Sl and l + 1 of size Sl+1 in a NN
model.All the nodes in layer l are connected to the nodes of successive layer l + 1. The notation
representation is as follow:-
23
• Weights are represented as θij
l
where j is the starting node on layer l and i is the target node on
l+1
(l)
θij
ÐÐ→ ai
(l) (l+1)
aj
At each neuron in a hidden or an output layer, the following two operations happen:-
(l)
Here θi0 is the bias unit.
• Activation - Evaluate the weighted sum with an activation function φ like sigmoid, tanh, ReLU,
softmax, etc.
We can further elaborate this operation using Linear algebra. Let us convert the layer inputs into
vectors.
⎛a1 ⎞
(l)
⎜a(l) ⎟
Vector of units in layer l = Vl = ⎜ 2 ⎟
⎜ ⋮ ⎟∈R
sl
(4)
⎜ ⎟
⎝a(l)
s
⎠
l
⎛ 1 ⎞
⎜a(l) ⎟
⎜ 1 ⎟
⎜ ⎟
Extended vector for layer l = V̄l = ⎜a(l) ⎟ ∈ Rsl +1 (5)
⎜ 2 ⎟
⎜ ⋮ ⎟
⎜ ⎟
⎝a(l)
sl ⎠
⎛a1 ⎞
(l+1)
⎜a(l+1) ⎟
Vector of units in layer l + 1 = Vl+1 = ⎜
⎜ ⋮ ⎟∈R
2 ⎟ sl+1
(6)
⎜ ⎟
⎝a(l+1)
s
⎠
l+1
24
Figure 18: Forward propagation with bias
The arrows connecting the neurons from layer l to l + 1 carries weight θij
(l)
1 ≤ i ≤ sl+1
0 ≤ j ≤ sl
All the weights can be written in a matrix format as follow:
⎜ ⎟
⎜ ⎟
(l+1)
a2
⎜ (l) (l) (l) (l) ⎟
θ = ⎜ θ20 θ21 θ22 . . . θ2s ⎟ Ð→ Size of matrix =sl+1 .(sl + 1)
(l)
(7)
⎜ ⎟
⎜ ⋮ ⋮ ⋮ ⋮
l
⎟
⎜ ⎟
⎝θs(l) 0 θs(l) 1 θs(l) 2 . . . θs(l)
l+1 sl
⎠
l+1 l+1 l+1
The forward propagation can be represented in a lucid manner using Linear Algebra
We can consider the basic format of a Neural Network with the n input features and s output units.
Hyperparameters
25
φ - activation function (may differ for each layer)
Parameters
Total parameters
k−1
∑ (sl + 1).(sl+1 )
l=1
All these parameters need to be optimized as part of the training process. As an outcome of
successful training, we select an optimal set of parameters θ̂.
As an example, we can count the parameters for a NN architecture with one hidden layer (4 nodes),
two inputs and one output. As we can see from the Figure 19, The total parameters are
x1 ⊕ x2 ≡ (x1 ∨ x2 ) ∧ ¬(x1 ∧ x2 )
26
x1 x2 y
0 0 0
1 0 1
0 1 1
1 1 0
As we notice, no linear solution can segregate the red and green points or the outcome of a logical
XOR operation. We can solve this problem using a NN with a hidden layer.
We can use the sigmoid activation function at layer 2 and layer 3 along with the following weight
matrix to solve this problem
−10 20 20
θ1 = ( )
30 −20 −20
27
θ2 = (−30 20 20)
2. Rectified Linear Unit or ReLU: This is the most commonly used activation function on the hidden
layers in a NN. ReLU return a 0 if the input is less than or equal to 0, and returns the input if
the input is greater than 0. So, it is used when we want the output of a neuron be always +ve.
φ(Z) = max(Z, 0), φ ∶ R → [0, ∞)
28
3. Leaky Rectified Linear Unit or LReLU: This function allows some values on the −ve side and
behaves like an Identity function for +ve values.
φ∶R→R
z, if z > 0
φ(Z) = {
az, if z ≤ 0; usually a = 0.01
This function is called as randomized relu when a ≠ 0.01
Both ReLU and LReLU are widely used for Regression problems due to their nature of Linearity.
4. Sigmoid: Sigmoid activation function is most commonly used in the Binary classification problems
in Logistic Regression. This function takes inputs of real numbers and return a probability.
φ ∶ R → [0, 1)
1
φ(Z) =
1 + e−Z
29
5. Hyperbolic Tangent: This activation looks like the Sigmoid activation function in shape but the
output value ranges between −1 and +1. This is also used in the Binary classification problems,
but does not return a probability.
φ ∶ R → [−1, 1)
eZ − e−Z
φ(Z) =
eZ + e−Z
6. Softmax: This activation function is used for a multi-class classification problem. It takes k
numerical entries and transforms into k probabilities.
p ∶ Rk → (0, 1)k
⎛z1 ⎞ ⎛p1 ⎞
⎜z ⎟ ⎜p ⎟
p ⎜ 2⎟ = ⎜ 2⎟ (8)
⎜⋮⎟ ⎜⋮⎟
⎝zk ⎠ ⎝pk ⎠
pi ∶ Rk → (0, 1)
The ith probability is given by
ezi
pi =
ez1 + ez2 + ⋅ ⋅ ⋅ + ezk
All the probabilities for the k classes sum up to 1.
p1 + p2 + ⋅ ⋅ ⋅ + pk = 1
30
• Derivative of Sigmoid: Sigmoid function is normally used in a binary classification scenario’s final
layer of a NN model. It provides the probability score for the output. The Sigmoid function is
given as
1
φ(Z) =
1 + e−Z
The derivative of the Sigmoid function is as follows:-
d
φ(Z) = φ(Z)(1 − φ(Z))
dZ
sinh Z eZ − e−Z
φ(Z) = tanh Z = =
cosh Z eZ + e−Z
The derivative of tanh is represented as:-
d d
φ(Z) = tanh (Z) = 1 − tanh2 (Z)
dZ dZ
d d 1, if z > 0
φ(Z) = relu(Z) {
dZ dZ 0, otherwise
• Derivative of Leaky Rectified Linear Unit: The Leaky ReLU function is represented as
z, if z > 0
φ(Z) = lrelu(Z) = {
az, if z ≤ 0; usually a = 0.01
d d 1, if z > 0
φ(Z) = lrelu(Z) {
dZ dZ a, otherwise
∂pi pi (1 − pj ), if i = j
={
∂Zj −pi pj , if i ≠ j
31
III.V Cross-Entropy Loss
The training data set has the pairs as (Vt , yt ) and the output Y = y ∈ {0, 1}. The NN returns a
probability for output class for each input record Vt i.e. the probability p(Vt ) ∈ (0, 1). The Cross-
Entropy Loss for a binary classification problem is given by
m
J (θ) = ∑[−yt .log(p(Vt )) − (1 − yt ).log(1 − p(Vt ))]
t=1
The value of the Cross-Entropy Loss changes as we change the weights of the Neural Network.
In case of a multi-class classification problem, suppose we have s classes {k1 , k2 , . . . , ks } The output
Yt = (y1t y2t . . . yst ) ∈ {0, 1}s with
1, if vt ∈ kq
yqt = {
0, if vt ∉ kq
Multi-class cross-entropy loss is given by :
s m
J (θ) = ∑ ∑[−yqt .log(pq (Vt )) − (1 − yqt ).log(1 − pq (Vt ))]
q=1 t=1
• Multiple equations need to be optimized as we deal with weights from different layers.
• Number of weights could be different for each layer depending upon the no. of nodes in a layer.
• We need to find a global minimum while multiple local minima can exist.
The Backpropagation Algorithm computes all the partial derivative of the total Loss function w.r.t
the weight directions ∂J(l) . The backpropagation is performed by establishing a cost function, e.g.,
∂θij
choosing a mean square error (MSE) for a regression problem or Cross-Entropy for a binary or multi-
class classification problem. We measure the performance of the NN on each forward pass based on the
cost function. The network computes the gradient of the cost with respect to the weights and works
backward to adjust the weights. This process is repeated to reduce the cost function to its minimum
possible value.
To illustrate the working of the Backpropagation algorithm, let us consider a binaryy classification
problem with n input features (x1 , x2 , . . . , xn ) and output y ∈ {0, 1}. The NN in this case predicts the
probability p of y = 1. We choose Sigmoid activation functions for all layers i.e. l = 2, 3, . . . , k (l = 1 is
the input layer). And we choose cross-entropy loss function to measure the performance of the NN.
32
Figure 27: Network Illustration for Binary Classification problem
Suppose we have m records in the training set i.e. (Vt , yt ), t = 1, 2, . . . , m. The cross-entropy loss
function is given as
m
J (θ) = ∑[−yt .log(p(Vt )) − (1 − yt ).log(1 − p(Vt ))]
t=1
Weights: θ0 , θ1 , θ2 , . . . , θn
The derivative of the Cost function involves partial derivative of the Activation function and the
Linear sum function. We noticed in the earlier section that the derivative of the Sigmoid function is
33
given by:
d d
φ(z) = g(z) = g(z)(1 − g(z)) (9)
dz dz
And the derivative of the Linear sum operation is given by:
d d
(Z) = (θ0 .x0 + θ1 .x1 + θ2 .x2 + ⋅ ⋅ ⋅ + θi .xi + . . . , θn .xn ) = xi (10)
dθi dθi
The derivative of the cost function w.r.t. weight θ0 :
m
d d
J (θ) = ∑ [−yt .log(g(zt )) − (1 − yt ).log(1 − g(zt ))]
dθ0 t=1 dθ0
m
d
= −∑ [yt .log(g(zt )) + (1 − yt ).log(1 − g(zt ))]
t=1 dθ0
m
yt dzt 1 − yt dzt
= −∑( .g ′ z(t). )+( .−g ′ z(t). )
t=1 g(zt ) dθ0 1 − g(zt ) dθ0
m
yt 1 − yt
= −∑( .g(zt )(1 − g(zt )).x0 ) + ( .(−g(zt )(1 − g(zt ))).x0 ) , using 9, 10. (11)
t=1 g(zt ) 1 − g(zt )
m
= − ∑ (yt (1 − g(zt )).1) + (1 − yt ).(−1).g(zt ).1) , using x0 = 1
t=1
m
= − ∑ (yt − yt .g(zt ) − g(zt ) + yt .g(zt ))
t=1
m
= ∑ (g(zt ) − yt )
t=1
Similarly, we can find the partial derivative of the cost function J w.r.t weights θi , i ≥ 1
m
d d
J (θ) = ∑ [−yt .log(g(zt )) − (1 − yt ).log(1 − g(zt ))]
dθi t=1 dθi
m
d
= −∑ [yt .log(g(zt )) + (1 − yt ).log(1 − g(zt ))]
t=1 dθi
m
yt dzt 1 − yt dzt
= −∑( .g ′ z(t). ) + ( .−g ′ z(t). )
t=1 g(zt ) dθi 1 − g(zt ) dθi
(12)
m
yt 1 − yt
= −∑( .g(zt )(1 − g(zt )).xti ) + ( .(−g(zt )(1 − g(zt ))).xti ) , using 9, 10.
t=1 g(zt ) 1 − g(zt )
m
= − ∑ xti (yt − yt .g(zt ) − g(zt ) + yt .g(zt ))
t=1
m
= ∑ xti (g(zt ) − yt )
t=1
m
d
J (θ) = ∑ xti (g(zt ) − yt ) , i ≥ 1
dθi t=1
34
The predicted output is represented as p = g(zt ), and yt is the actual output label from the training
set. So, the error for the training pair (Vt , yt ) is given as:
δt = g(zt ) − yt
With this, we can re-write the above equations as:-
⎧
⎪
m
⎪
⎪
⎪ ∑ δt , if i = 0
d ⎪
⎪t=1
J (θ) = ⎨ m
dθi ⎪
⎪
⎪
⎪
⎪
⎪ ∑ xti .δt , if i ≥ 1
⎩t=1
t=1 xi .δt , x0 = 1 for the bias
We have a common pattern for the partial derivative i.e. dθd i J (θ) = ∑m t
node.
At a high level, we have the following steps in the Back propagating algorithm:-
1. Identify the training set (Vt , yt ).
2. Feed the input variables Vt to the NN model.
3. Compute probability p = g(vt ) using forward propagation.
4. Compute the error δt = g(vt ) − yt .
5. Backpropagate the error corresponding to θi . Multiply δi with the origin unit value xi .
(a) Sum over all the pairs in the training data.
Algorithm to compute dθd i J
35
If we feed the input features for a record from the training set to the Neural Network, we would get
(k)
a value at the output layer as a probability of a binary class, denoted as a1 in the above figure. The
final predicted class is compared with the target label or the given output (yt ) to find the error in the
output layer. The error is given by:
δ (k) = a1 − yt
(k)
This error term depends upon the weights of the network and the specific input pair (Vt , yt ) used.
There are other hidden errors at each hidden layer that are propagating backward through the network.
Suppose, we have a hidden layer l of size sl , then the error vector will have sl + 1.
⎡ (l) ⎤
⎢δ0 ⎥
⎢ (l) ⎥
⎢δ ⎥
⎢ 1 ⎥
⎢ ⎥
δ̄ = ⎢δ2(l) ⎥ ∈ Rsl +1
(l)
⎢ ⎥
⎢ ⋮ ⎥
⎢ ⎥
⎢ (l) ⎥
⎢δsl ⎥
⎣ ⎦
(l)
Here, δ0 is the error corresponding to the bias unit. So, δ̄ (l) is called as the extended vector of
error terms at layer l. The normal vector without the bias error is referred to as δ (l) .
The general formula to compute these intermediate error vectors in layer l is:
[θ(l) ] - Weight matrix for all the connections from layer ∶ l to layer ∶ l + 1 with a size of sl+1
(rows) × sl + 1 (columns). [θ(l) ]t will be a weight matrix of size sl + 1 × sl+1 .
⊙ - Represents the element-wise multiplication, also called as Hadamard product of two vectors.
Example:
⎡3⎤ ⎡1⎤ ⎡3⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢2⎥ ⊙ ⎢4⎥ = ⎢8⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1⎥ ⎢2⎥ ⎢2⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
ā(l) - Extended vector of all the entries in layer l (including bias a0 ), with size sl + 1. We get the
(l)
(l)
values ai from forward-propagation.
(1 − ā(l) ): Performing a minus operation for all the entries in ā(l) resulting in a sl + 1 dimensional
vector.
As we can see, this is a backward propagating method to find the errors in layer l by using the errors
in l + 1. To illustrate the working of Back-propagation as compared to forward-propagation, consider a
NN with four layers.
36
Figure 30: Forward-propagation v/s Backward-propagation
The above figure shows the network’s forward and backward propagation values. Forward propa-
gation moves from left to right, whereas backward propagation moves from the right to left except in
the input layer. So, back propagation ensures that each hidden layer other than the input layer has the
error value for every neuron.
is the key to calculating the partial derivatives of the total loss function J ,
(l+1)
The error vector δi
which is given by the following Backpropagation formula.
m
dJ
= ∑ δi
(l+1) (l)
(l)
.aj
dθij t=1
dJ
= ∑m
(l+1) (l)
The RHS of this equation (l) t=1 δi .aj is the sum of the values obtained from all the pairs
dθij
(Vt , yt ) in the training set.
37
The following steps work in an ordered fashion as part of finding the partial derivatives in a NN:-
1. Feed the network model with the input from one pair (Vt , yt ) in the training set.
2. Perform forward propagation to compute all the values for the activation. For example, to
(l+1)
calculate ai .
(l)
(a) Consider the bias unit where applicable a0
= θi0 .a0 + θi1 .a1 + θi2 .a2 + ⋅ ⋅ ⋅ + θisl .asl
(l+1) (l) (l) (l) (l) (l)
(b) Compute the linear sum zi
(c) Apply the activation function aj = g(zi )
(l) (l)
3. Compute the error at the final layer as the difference between the predicted output and the actual
output δ1k = a1 − y.
(k)
(l)
4. Perform backpropagation to find δi , i.e. intermediate errors for all the hidden layers except the
input layer.
δ̄ (l) = [θ(l) ]t δ (l+1) ⊙ a(l) ⊙ (1 − a(l) )
5. Repeat the process (Steps 1-4) for all pairs in the training set.
dJ
= ∑m
(l+1) (l)
6. Compute the partial derivatives using (l) t=1 δi .aj
dθij
38
Training
Training is the process of minimizing the Loss function J by finding the appropriate weights for the
network. Gradient Descent is used as a method to minimize the loss function J . Suppose, we have a
training set (Vt , yt ), t = 1, 2, 3, . . . , m, with an n dimensional input variables and an s class output.
⎛ xt1 ⎞ ⎛f 1 (Vt )⎞
⎜ x2 ⎟
t ⎜f 2 (Vt )⎟
⎜ t⎟ NN with weights ⎜ 3 ⎟
Vt = ⎜ ⎟
⎜ x3 ⎟ ⇉ ( )⇉⎜
⎜f (Vt )⎟ ⎟ = fθ (Vt ) (13)
⎜ ⋮ ⎟ θ ⎜ ⋮ ⎟
⎜ ⎟ ⎜ ⎟
⎝xtn ⎠ ⎝f s (Vt )⎠
The NN is going to predict the output using the weights θ for a s class target i.e. fθ (Vt ) ∈ Rs . We
measure how close we are to the predicted output by using L(fθ (Vt ), yt ). We define a loss function as
m
J (θ) = ∑ L(fθ (Vt ), yt )
t=1
Layers l = 1, 2, . . . , k − 1
i = 1, 2, . . . , Sl+1
j = 1, 2, . . . , Sl
The weights are adjusted with learning rate α (hyperparameter) using the following formula:
dJ
(θij )i,j,l = (θij )i,j,l − α. (θd )
(l),d+1 (l),d
(l)
dθij
We keep optimizing the weights in each iteration to minimize the loss function J (θ).
θ0 ↝ θ1 ↝ θ2 ↝ ⋅ ⋅ ⋅ ↝ θd ↝ θd+1 . . .
These attempts of optimizations are called epochs and are controlled via hyperparameter.
39
IV Convolutional Neural Networks
Convolutional Neural Networks or ConvNets, a.k.a. CNNs, are most commonly used for the problems
in the domain of computer vision. ConvNets use the spatially local correlation by enforcing regional
connectivity between the adjacent layers.
Convolution Operation
Images can be represented as matrix values in different channels depending upon the color space. Some
commonly used color spaces are Grayscale, CMYK, HSV, RGB, etc. Convolution operation is performed
by applying a filter, a.k.a. kernel, over the image, which computes the dot product of the filter with
the are of the image under the filter. The filter slides over the complete image to perform this activity
and reduces the image to a readable format without losing features.
Suppose, we have an Image in the form of matrix values. We apply a filter of weights (kernel) 3 × 3
on the first row and first column of the matrix. As an example the first value of the convolution feature
is calculated as follow: -
4 × 1 + 5 × −1 + 1 × 0 + 2 × 0 + 6 × 1 + 5 × 0 + 6 × 1 + 8 × 0 + 2 × 1 = 13
We scan the whole matrix with the size matching the filter 3×3 to perform the convolution operation
and generate a convolved features matrix of size 3 × 3 from a 5 × 5 matrix.
Edge Detection
Edges for the objects in an image are the areas where there is a drastic change in the brightness of
the pixels. There are different filters to detect the vertical, horizontal or angle edges. An example of a
vertical detector along with the convolution operation is as follow:-
40
Figure 33: Convolution Example - Matrix operation (Ref erences1 )
Some commonly used edge detection filters are Canny, Sobel, Scharr, etc. The size of the output
image matrix after applying a filter of size f ×f on an image of matrix size n×n is (n−f +1)×(n−f +1).
The value of f is odd to ensure that the filter always has a central position and to help with padding.
Padding
As we discussed in the filter operations above, the size of the output image is reduced to n × n is
(n − f + 1) × (n − f + 1). In addition to the shrinkage in size, we also lose important information at
the edges of the image. To overcome this problem, we pad extra pixels to the image border via a
hyper-parameter called padding. If we apply a padding p = 1 on a n × n, the resultant image size is
n + 2 × n + 2. Padding is configured using two modes in CNN models, namely:
1. Valid: No padding
2. Same: Pad to match the size of the output image with the input image.
Example of Padding:
41
Figure 35: Padding Example - Ref erences2
When a filter of size f × f is applied along with a padding of size p on an image of size n × n, the
resulting image size is given by (n + 2p − f + 1) × (n + 2p − f + 1).
Strided Convolution
Strides are another factor that can influence the output size of the convolution. Stride is the distance
between two successive windows in a convolution operation. By default, the convolution windows are
contiguous, i.e., a stride of size 1. Here is an example of a stride with size 2.
The size of the convolution output with stride size s, padding p, filter (f × f ), and original image
size (n × n) is given by
n + 2p − f n + 2p − f
( + 1), ( + 1)
s s
In the above formula, if the fraction is not an integer, we use the f loor function in R.
42
channels. To perform a convolution on a 3D image (with three channels), we need a 3D filter, i.e.,
with the depth of the filter matching the number of channels in the input image. The follow of the
convolution operation on a volume 3D image with a 3D filter goes as follows:-
1. Take dot product of the filters with the corresponding values from the channels in the image,
e.g., values from the red filter with the values from the red channel from the image, and so on.
2. Add up these dot products from each filter with the corresponding channel to produce a single
value.
If we want to detect the edges in a specific color channel, say red, we keep the other color channels,
i.e., blue and green in this case, to zero and vice-versa. The edge filters are set to the same value for
all channels when checking the edges across all the channels. We can use different filters to extract
various images’ features, like additional edge filters - horizontal, vertical, angular, etc.
Pooling
Pooling can help reduce the input size and, in turn, can lower the computation cost. It also makes
the feature detection independent of its position in the image, known as invariant feature detection.
Pooling is commonly performed in a window size of 2 × 2. Types of pooling layer:-
43
Max-pooling layer: Moves the Pool layer filter over the input image and stores the max value
from input as output.
Average-pooling layer: Moves the Pool layer filter over the input image and stores the average
value for input as output.
Max-pooling is most used as compared to Average-pooling. The pooling layers cannot be trained
using the Backpropagation. However, these are controlled via hyper parameters like the window size
of the pooling operation f , stride s, and the type of pooling max or average. If we apply the pooling
layer with window size f × f on an input image of size nH × nW × nC , and striding of size s, the output
dimensions are nHs−f + 1 × nHs−f + 1 × nC (padding is usually not used with the max-pooling operation).
Flattening
Flattening converts the last convolution layer output to a one-dimensional array which can then be
used to make the actual predictions.
Model: "sequential"
________________________________________________________________________
Layer (type) Output Shape. Param #
44
========================================================================
conv2d_3 (Conv2D) (None, 148, 148, 32) 896
activation_3 (Activation) (None, 148, 148, 32) 0
max_pooling2d_3 (MaxPooling2D) (None, 74, 74, 32) 0
conv2d_2 (Conv2D) (None, 72, 72, 64) 18496
activation_2 (Activation) (None, 72, 72, 64) 0
max_pooling2d_2 (MaxPooling2D) (None, 36, 36, 64) 0
conv2d_1 (Conv2D) (None, 34, 34, 128) 73856
activation_1 (Activation) (None, 34, 34, 128) 0
max_pooling2d_1 (MaxPooling2D) (None, 17, 17, 128) 0
conv2d (Conv2D) (None, 15, 15, 128) 147584
activation (Activation) (None, 15, 15, 128) 0
max_pooling2d (MaxPooling2D) (None, 7, 7, 128) 0
flatten (Flatten) (None, 6272) 0
dense_1 (Dense) (None, 512) 3211776
dense (Dense) (None, 1) 513
========================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0
________________________________________________________________________
We can notice from the model summary how output from each layer changes e.g. the output of the
first-layer is 1 × 148 × 148 × 32. We take an image from Vincent Van Gogh arts to exhibit the working
of this model. The original image looks like below:
As a requirement for the input feed to the model, we pre-process the image into a tensor of shape
4D.
45
img_path <- "~/Downloads/vangogh02.jpg"
img <- image_load(img_path, target_size = c(150, 150))
To visualize the outputs from the layers and filters, we create a function act model to get the model
output provided the image tensor as input.
We can check the dimension for the f irst − layer and the tenth − layer activation.
dim(tenth_layer_act)
[1] 1 15 15 128
We create a plot channel function to plot the channel of the respective layer.
We use the plot channel function to plot the output for channels 1, 16, and 32 from the f irst −
layer.
46
(a) Channel-1 (b) Channel-16 (c) Channel-32
We can observe that the first layer activation retains most of the image information and perform
exploration on the edges of the image input.
The information gets abstracted as we move higher in the layers stack. For example, we can plot the
outputs from the tenth−layer for channels 2, 64, 128 and observe that the layer represents more global
and abstract features.
47
V ConvNet by Example
Keras provides a set of easy to use APIs for model development which runs on top of Tensorflow.
Tensorflow can run on different type of hardware e.g. CPU, GPU, and TPU.
Keras Deep Learning API can be used to develop and train model for basic to advanced scenarios.
In this section, we will use the keras library in RStudio (2022.02.3+492).
• Data set - We use a dataset containing Chest X-ray images with opacity(pneumonia) and
normal cases. This curated dataset can be accessed on Kaggle, which is already distributed into
training, validation and test subsets for machine learning exercise. The information on the data
sets and no. of images is in Table 4.
Some sample images from the data set along with the labels are as follow:-
48
Figure 45: Images - Opacity(Pneumonia) category
These are grayscale images (single channel) with different sizes but in good resolution. As an
example, we can see the details for one random image.
library(EBImage)
img <- readImage(file.path(train_image_files_path,"normal","IM-0688-0001.jpeg"))
print(img)
> print(img)
Image
colorMode : Grayscale
storage.mode : double
dim : 1514 1310
frames.total : 1
frames.render: 1
imageData(object)[1:5,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 0 0 0 0
[2,] 0 0 0 0 0 0
[3,] 0 0 0 0 0 0
[4,] 0 0 0 0 0 0
[5,] 0 0 0 0 0 0
• Data Augmentation - Image data generators allow altering images with zoom, shear, rotation,
etc. to improve the learning process. We use basic Image generators to read the images from
the data set directories.
train_data_gen = image_data_generator(
rescale = 1/255
)
49
We use the f low image f iles path function to input images in the model. We’ll also standardize
the image size to 120 × 120
# Training images
train_image_array_gen <- flow_images_from_directory(
train_image_files_path,
train_data_gen,
target_size = target_size,
class_mode = ’binary’,
classes = target_labels,
color_mode = "grayscale",
seed = 1994)
# Validation images
valid_image_array_gen <- flow_images_from_directory(
valid_image_files_path,
valid_data_gen,
target_size = target_size,
class_mode = ’binary’,
classes = target_labels,
color_mode = "grayscale",
seed = 1994)
• Model Training - This is a binary classification problem with the outcome as normal or opacity
health condition. The baseline accuracy is 50%, so a good model should be able to predict with
an accuracy higher than the baseline accuracy. We start with a simple CNN model to learn
the data representation in the images, and then modify it to observe the impact. We set some
parameters required for the model training.
– model 0− The base ConvNet model 0 has two Conv2D layers, one Flatten layer and two
Dense layers. We use relu activation in the model and sigmoid activation at the end layer
to address this binary classification problem. The model summary looks as follow:-
50
activation_59 (Activation) (None, 118, 118, 4) 0
conv2d_78 (Conv2D) (None, 116, 116, 2) 74
flatten_33 (Flatten) (None, 26912) 0
dense_67 (Dense) (None, 10) 269130
activation_58 (Activation) (None, 10) 0
dense_66 (Dense) (None, 1) 11
activation_57 (Activation) (None, 1) 0
=====================================
Total params: 269,255
Trainable params: 269,255
Non-trainable params: 0
_____________________________________
We use binary crossentropy as the loss function and optimizer optimizer adadelta when
compiling the model. We also record accuracy in addition to the loss values when training
the model
# Compile model
model_0 %>% compile(
loss = ’binary_crossentropy’,
optimizer = optimizer_adadelta(),
metrics = c(’accuracy’)
)
# Fit model
history_0 <- model_0 %>% fit(
train_image_array_gen,
batch_size = batch_size,
epochs = epochs,
validation_data = valid_image_array_gen
)
The model achieves an accuracy of 94.52% on the validation data in 12 epochs. The training
accuracy is on the rise throughout the model fit, while the validation accuracy stays almost
same. A similar patter can be seen in the loss graph, with the validation loss rises towards
the end of the model fit. This indicates that the model is over-fitting in the later epochs.
The metrics plots for accuracy and loss looks as follow:-
51
Figure 46: CNN model 0 Fit - Metrics
– model 1 We increase the model complexity by introducing another Conv2D layer and in-
creasing the number of filters. This change almost doubles the number of parameters in
the model from 269, 255 to 521, 473. Model summary looks as follow:-
52
The model learning drops with a validation accuracy of 74.33% in the second epoch and it
stays there. This seems to be a problem of under-fitting as the model is not able to learn
the representations of the data in the images.
– model 2− We further increase the filters of Conv2D layers in model to increase the param-
eters to 10, 403, 105. The model summary looks as follow:-
53
Trainable params: 10,403,105
Non-trainable params: 0
_____________________________________
The validation accuracy improves to 96.15% in the ninth epoch and the validation loss
increases after that.
– model 3− We introduce MaxPooling layers after each Conv2d layer. This reduces the model
parameters to 141, 505. The model summary looks as follow:-
54
dense_67 (Dense) (None, 100) 135300
activation_58 (Activation) (None, 100) 0
dense_66 (Dense) (None, 1) 101
activation_57 (Activation) (None, 1) 0
=====================================
Total params: 141,505
Trainable params: 141,505
Non-trainable params: 0
_____________________________________
The validation accuracy improves to 96.83% in the ninth epoch with a validation loss at
0.0912.
Observations - We tried four different models with different range of complexities and we can
observe that the model complexity does not always help improve the learning. Introducing a new
layer or changing the filters updates the number of parameters to be trained. An increase in
number of parameters may also slightly increases the epoch run-time.
• Regularization - We choose the best model model 2 from the initial testing and try the regu-
larization techniques to study the effect on the accuracy of the model. Regularization provides
55
a simplistic approach to control the problem of over-fitting by using some constraints on the
weights to choose a smaller value. Some of the standard regularization methods are BatchNor-
malization, Dropout, and L2 regularization. Similar to what we followed in previous phases,
we start with a base model and test the impact of regularization by enabling one parameter at
a time. The metrics like accuracy, loss, and convergence rate are noted for each regularization
change for the base model.
56
Figure 51: Using L2 Regularization
3. Adding Dropout - Finally, we add the Dropout layer with a rate=0.5 after the Flatten
layer. The model achieves a a validation accuracy of 96.25% and the validation loss changes
to 0.1696
57
L2 Regularization, and Dropout. There are the following observations from these tryouts:-
– Slow convergence - The final model with all the regularization layers was slow to converge
as compared to the initial/base model, which is expected behavior.
– Impact on Accuracy - The following table shows the impact on Training and Validation
Accuracy as we applied regularization.
As we can notice from the above table, we don’t see much improvements on the validation
accuracy by applying the regularization.
Evaluate model on Test set - We use the model with all three the regularization methods to
predict the labels for the test set and verify the model’s performance.
# Output metrics
cat(’Test loss:’, scores[[1]], ’\n’)
>Test loss: 0.1695605
cat(’Test accuracy:’, scores[[2]], ’\n’)
>Test accuracy: 0.9625
As we can see from the model evaluation, the ConvNet model delivers an impressive accuracy of
96.25% on the test set.
58
References
1. Ghatak, A. (2019). Deep learning with R. Springer Singapore.
59