ML Lecture 9 - DeepLearning
ML Lecture 9 - DeepLearning
https://cse.uiu.ac.bd/profiles/dewanfarid/
ABOUT THE AUTHOR
PROF. DR. DEWAN MD. FARID is a Professor of Computer Science and Engineering
at United International University. He is an IEEE Senior Member and Member
ACM. Prof. Farid worked as a Postdoctoral Fellow/Staff at the following research
labs/groups: (1) Computational Intelligence Group (CIG), Department of Com-
puter Science and Digital Technology, University of Northumbria at Newcastle,
UK in 2013, (2) Computational Modelling Lab (CoMo) and Artificial Intelligence
Research Group, Department of Computer Science, Vrije Universiteit Brussel,
Belgium in 2015-2016, and (3) Decision and Information Systems for Production
systems (DISP) Laboratory, IUT Lumière – Université Lyon 2, France in 2020.
Prof. Farid was a Visiting Faculty at the Faculty of Engineering, University of
Porto, Portugal in June 2016. He holds a PhD in Computer Science and Engi-
neering from Jahangirnagar University, Bangladesh in 2012. Part of his PhD re-
search has been done at ERIC Laboratory, University Lumière Lyon 2, France by
Erasmus-Mundus ECW eLink PhD Exchange Program. His PhD was fully funded
by Ministry of Science & Information and Communication Technology, Govern-
ment of the People’s Republic of Bangladesh and European Union (EU) eLink
project. Prof. Farid has published 111 peer-reviewed scientific articles, includ-
ing 30 highly esteemed journals like Expert Systems with Applications, Journal
of Theoretical Biology, Journal of Neuroscience Methods, Bioinformatics, Scien-
tific Reports (Nature), Proteins and so on in the field of Machine Learning, Data
v
vi ABOUT THE AUTHOR
Mining and Big Data. Prof. Farid received the following awards: (1) Dr. Fatema
Rashid Best Paper Award (2nd Position) for the paper titled “KNNTree: A new
method to ameliorate k-nearest neighbour classification using decision tree” in
3rd International Conference on Electrical Computer and Communication Engi-
neering (ECCE 2023), CUET, Chittagong, Bangladesh, (2) JuliaCon 2019 Travel
Award for attending Julia Conference at the University of Maryland, Baltimore,
USA, and (3) United Group Research Award 2016 in the field of Science and
Engineering. He received the following research funds as Principal Investigator:
(1) a2i Innovation Fund of Innov-A-Thon 2018 (Ideabank ID No.: 12502) from
a2i-Access to Information Program – II, Information and Communication Tech-
nology (ICT) Division, Government of the People’s Republic of Bangladesh, and
(2) Project Code: UIU/IAR/01/2021/SE/23 received from Institute for Advanced
Research (IAR), United International University. Prof. Farid received the follow-
ing Erasmus Mundus scholarships: (1) LEADERS (Leading mobility between Eu-
rope and Asia in Developing Engineering Education and Research) to undertake a
staff level mobility at the Faculty of Engineering, University of Porto, Portugal in
2015, (2) cLink (Centre of excellence for Learning, Innovation, Networking and
Knowledge) for pursuing Postdoc at University of Northumbria at Newcastle, UK
in 2013, and (3) eLink (east west Link for Innovation, Networking and Knowl-
edge exchange) for pursuing Ph.D. at University Lumière Lyon 2, France in 2009.
Prof. Farid also received Senior Fellowship I and II awards by National Science &
Information and Communication Technology (NSICT), Ministry of Science & In-
formation and Communication Technology, Government of the People’s Republic
of Bangladesh respectively in 2008 and 2011 for pursuing Ph.D. at Jahangirna-
gar University. He visited 18 countries for attending international conferences,
research and higher education. Prof. Farid delivered several invited/keynote talks
including an invited research talk at Data to AI Group (DAI), Laboratory for In-
formation and Decision Systems (LIDS), Massachusetts Institute of Technology
(MIT), Cambridge, Massachusetts, USA.
CONTENTS
List of Symbols ix
1 Deep Learning 3
1.1 What’s Deep Learning? 3
1.1.1 History of Deep Learning 4
1.1.2 Why Deep Learning Getting Popular? 4
1.2 Building Neural Networks 6
1.2.1 A Single Neuron 6
1.2.2 Activation Functions 8
1.3 The Perceptron: Example 9
1.3.1 AND Gate 9
1.3.2 OR Gate 10
1.3.3 NOT Gate 10
1.4 Building Neural Networks with Perceptions 11
1.4.1 Deep Neural Network 12
1.5 Understanding Loss Functions 12
vii
viii CONTENTS
DEEP LEARNING
Artificial Intelligence, deep learning, machine learning - whatever you are doing if you do
not understand it - learn it. Because otherwise you are going to be a dinosaur within 3
years.
—Mark Cuban, Theoretical Physicist
Deep learning (DL) is a sub-set of machine learning (ML) to extract patterns from
big data using artificial neural networks. An artificial neural network (ANN) is the
piece of a computing system designed to simulate the way the human brain analy-
ses and processes information. It is the foundation of artificial intelligence (AI) and
solves problems that would prove impossible or difficult by human or statistical stan-
Machine Learning, Data Mining, Big Data, First Edition. 3
By Dewan Md. Farid Copyright c 2023 United International University.
4 DEEP LEARNING
dards. Traditional ML algorithms define a set of features in the data. Generally, these
features are hand engineered. The key idea of DL is to learn these features directly
from data. Hand-engineering features are time consuming, brittle, are not scalable in
practice. So, DL is a form of machine learning where a model has multiple layers of
neurones.
Deep Learning Extract informative patterns from Big Data using Artificial Neural
Networks (ANN).
The history of Deep Learning can be traced back to 1943, when Walter Pitts and
Warren McCulloch created a computer model based on the neural networks of the
human brain. They used a combination of algorithms and mathematics they called
“threshold logic” to mimic the thought process.
Year Invention
1952 Stochastic Gradient Descent
1958 Perceptron (Learnable Weights)
1960 Multi-layer Perception
1986 Back-propagation
1989 Convolutional Neural Network (CNN)
1990-1994 Neural Network in the Wild
1995 Long-Short-Term Memory
2005-2010 Banishment
2010 Rename: Deep Learning
2013 CNN + GPU + Image Net
2015 AlphaGo (CNN + Reinforcement Learning)
Big Data Big Data is data that contains greater variety, arriving in increasing vol-
umes and with more velocity (3 V’s).
Massive Parallelizable.
Software Today we have improved techniques, new models, and toolboxes. E.g.
Scikit-learn is a free software machine learning library for the Python program-
ming language/TensorFlow.
6 DEEP LEARNING
Input Layer This layer accepts input features. It provides information from the
outside world to the network, no computation is performed at this layer, nodes
here just pass on the information(features) to the hidden layer.
Hidden Layer Nodes of this layer are not exposed to the outer world, they are the
part of the abstraction provided by any neural network. Hidden layer performs
all sort of computation on the features entered through the input layer and trans-
fer the result to the output layer.
Output Layer This layer bring up the information learned by the network to the
outer world.
m
X
ŷ = g(w0 + xi .wi )
i=1
T
= g(w0 + X W ) (1.1)
Figure 1.4: Perception Forward Propagation with Bias and Activation Function.
8 DEEP LEARNING
1. Linear Activation: No transformation between input and output. So, input will
be output.
Rectified Linear Unit (ReLU) is a very simple function where for any negative
value of ‘z’ the result of the function applied to that input is zero. So, it passes
on zero (0) if the total sum of the inputs is less than zero, or return’s ‘z’ if the
value of the input is gather than zero. It’s just a maximum value between the
value of the input ‘z’ and zero. The rectifier or ReLU is defined as the positive
part of its argument: g(z) = max(0, z) where ‘z’ is the total weighted sum of
a neuron.( ReLU allows a small, positive gradient when the unit is not active,
1 if z > 0,
g(z) = .
0 otherwise
THE PERCEPTRON: EXAMPLE 9
X1 X2 y
0 0 0
0 1 0
1 0 0
1 1 1
1.3.2 OR Gate
X1 X2 y
0 0 0
0 1 1
1 0 1
1 1 1
X1 y
0 1
1 0
BUILDING NEURAL NETWORKS WITH PERCEPTIONS 11
m
X
yˆi = g(zi ) = g(w0,i + xj .wj,i ) (1.2)
j=1
12 DEEP LEARNING
n
X
yˆi = g(w2 0,i + g(zj ).w2 j,i )
j=1
n
X m
X
= g(w2 0,i + g(w1 0,i + xk .w1 k,i ).w2 j,i ) (1.3)
j=1 k=1
Loss functions play an important role in any machine learning model. It measure how
far an estimated value is from its true value. Loss is the penalty for misclassification
of a learning model. It’s a number indicating how bad the model’s prediction was on
a single instance. If the model’s prediction is correct the loss is zero; otherwise, the
loss is greater. The goal of a learning model is to find a set of weights and biases
that have low/minimum loss, on average, across all instances. Fig. 1.12 shows a high
loss model on the left and a low loss model on the right. Loss functions are not fixed,
they change depending on the task in hand and the goal to be met.
Mean Absolute Error (MAE) also known as L1 loss is one of the most simple yet
robust loss functions used for regression models. MAE takes the average sum
UNDERSTANDING LOSS FUNCTIONS 13
𝑤1 𝑤2 𝑤3
𝑔(𝑧11 ) 𝑔(𝑧12 )
𝑧11 𝑧12
𝑥1
𝑔(𝑧21 ) 𝑔(𝑧22 )
𝑧21 𝑧22 𝑦ො1
𝑥2
𝑔(𝑧31 )
𝑔(𝑧32 )
⋮
⋮ 𝑧31 𝑧32
⋮
⋮ ⋮ 𝑦ො2
𝑥𝑚 ⋮ ⋮
⋮ 𝑔(𝑧𝑛1 ) ⋮ 𝑔(𝑧𝑝2 )
𝑧𝑛1 𝑧𝑝2
Figure 1.12: High loss in the left model; low loss in the right model.
of the absolute differences between the actual and the predicted values. For a
data point xi and its predicted value yi , n being the total number of instances in
the dataset, the mean absolute error is defined as:
Pn
i=1 |yi − yˆi |
M AE = (1.4)
n
Mean Squared Error (MSE) also known as L2 loss function for regression tasks.
MSE is the average of the squared differences between the actual and the pre-
dicted values. For an instance xi and its predicted value yˆi , where n is the total
number of instance in the dataset, the mean squared error is defined as:
n
1X
M SE = (yi − yˆi )2 (1.5)
n i=1
Mean Bias Error (MBE) is used to calculate the average bias in the model. Bias,
in a nutshell, is overestimating or underestimating a parameter. Corrective mea-
sures can be taken to reduce the bias post-evaluating the model using MBE. It
14 DEEP LEARNING
takes the actual difference between the target and the predicted value. The for-
mula of Mean Bias Error is shown in Eq. 1.6 where yi is the true value, yˆi is
the predicted value and n is the total number of instances in the dataset.
Pn
i=1 (yˆi − yi )
M BE = (1.6)
n
Mean Squared Logarithmic Error (MSLE) is the same as Mean Squared Error,
except the natural logarithm of the predicted values is used rather than the actual
values. The formula of Mean Squared Logarithmic Error is shown in Eq. 1.7
where yi is the true value, yˆi is the predicted value and n is the total number of
instances in the dataset.
n
1X
M SLE = (log(yi ) − log(yˆi ))2 (1.7)
n i=1
Huber Loss combines the robustness of L1 with the stability of L2, essentially the
best of L1 and L2 losses. For huge errors, it’s linear and for small errors, it’s
quadratic in nature. Huber Loss is characterised by the parameter delta (δ). For
a prediction yˆi of the data point xi , with the characterising parameter δ, Huber
Loss is formulated as:
(
1
2 (yi − yˆi )2 , if|yi − yˆi | ≤ δ
Lδ = (1.8)
δ|yi − yˆi | − 12 δ 2 , otherwise
Binary Cross Entropy Loss gives the probability value between 0 and 1 for binary-
class classification task. Entropy is the measure of randomness in the informa-
tion being processed, and cross entropy is a measure of the difference of the
randomness between two random variables. Cross-Entropy calculates the aver-
age difference between the predicted and actual probabilities. If we deals with
Yes/No situation e.g., “a person has diabetes or not”, then Binary Classification
Loss Function is used.
N
X
J =− yi log(yˆi ) + (1 − yi )log(1 − yˆi ) (1.9)
i=1
Categorical Cross Entropy Loss is essentially Binary Cross Entropy Loss expanded
to multiple classes. This way, only one element will be non-zero as other ele-
ments in the vector would be multiplied by zero. This property is extended to
an activation function called softmax.
Hinge Loss is primarily developed for support vector machines for calculating the
maximum margin from the hyperplane to the classes. Hinge loss functions pe-
nalise wrong predictions and does not do so for the right predictions. So, the
EMPIRICAL RISK MINIMISATION 15
score of the target label should be greater than the sum of all the incorrect labels
by a margin of (at the least) one. The mathematical formulation of hinge loss is
as follows:
X
SV M Loss = max(0, sj − sji + 1) (1.10)
j6=j
The chain rule finds the derivative of a composite function. Deriving or taking the
derivative, means to find the “slope” (slope of a decision line) of a given function
or classifier. Derivatives, on the other hand, are a measure of the rate of change,
but they apply to almost any function. The chain rule expresses the derivative of the
composition of two differentiable functions f and g in terms of the derivatives of f
and g. More precisely, if h = f ◦ g is the function such that h(x) = f (g(x)) for
every x, then the chain rule is, h0 (x) = f 0 (g(x))g 0 (x), which is shown in Eq. 1.13.
16 DEEP LEARNING
d
[f (g(x))] = f 0 (g(x))g 0 (x) (1.13)
dx
If a variable Size depends on the variable Height, which itself depends on the
variable W eight (that is, Height and Size are dependent variables), then Size de-
THE CHAIN RULE 17
pends on W eight as well, via the intermediate variable Height, which can be ex-
pressed by the chain rule that is shown in Eq. 1.14.
dSize dSize dHeight
= · (1.14)
dW eight dHeight dW eight
Therefore, if we change the value for W eight then we see a change in Size. Since
dHeight
the slope is 2 the change in Height is dW eight
= 2. The equation for Height =
dHeight
dW eight × W eight = 2 × W eight.
1 dSize
As, we go up 4 unit for any 1 unit of Size, the slope is 41 . So, dHeight = 41 .
dSize
Size = × Height
dHeight
dSize dHeight
= × × W eight
dHeight dW eight
1
= ×2×1
4
1 dSize
= =
2 dW eight
That means every 1 unit increase of W eight increases 1/2 unit of the Size.
2
Residual2 = (Actual, y − P redicted, ŷ) (1.17)
2
= (y − Intercept − (Slope × W eight))
In order to find the value for the Intercept that minimises the Squared Resid-
ual, we are going to find the Derivative of the Squared Residual with respect to the
Intercept and we are going to find where the derivative is equal to zero. Because,
given the function y = Residual2 , the derivative is zero at the lowest point.
18 DEEP LEARNING
dResidual2 d 2 dResidual
= Residual ×
dIntercept dResidual dIntercept
dResidual
= 2 × Residual ×
dIntercept
d
The Power Rule do the derivative of Residual2 is dResidual
Residual
2
= 2 × Residual
As, Residual = y − Intercept − (Slope × W eight)) = y − Intercept − (1 ×
W eight)). So, ddIntercept
Residual
= −1
dResidual2
= 2 × Residual × (−1)
dIntercept
= −2 × Residual
= −2 × (y − Intercept − (1 × W eight))
Therefore,
−2 × (2 − Intercept − (1 × 1)) = 0
−4 + 2 × Intercept + 2 = 0
2 × Intercept = 4 − 2
2
Intercept =
2
Intercept = 1
Finally, we can say that when the Intercept = 1, we minimise the Squared Resid-
ual.
X1 y
x1 0.5 1.4
x2 2.3 1.9
x3 2.9 3.2
First, we will use Gradient Descent to find the optimal value for the intercept.
Let’s just assume the Least Squares estimate for the slope is 0.64. Step 1: Select a
random value for the intercept, e.g. intercept = 0; so, ŷ = 0 + 0.64 × X. Now, we
will evaluate how well this line fits the data with the Sum of the Squared Residuals.
In machine learning, the Sum of the Squared Residual is a type of loss function.
2 2
Therefore, Sum of Squared Residuals = (Actual, y − P redicted, ŷ) = (1.08) +
2 2
(0.428) + (1.344) = 3.1. The point 3.1 represent the Sum of Squared Residuals
when the Intercept = 0. Gradient Descent only does a few calculations for the opti-
mal solution and increases the number of calculations closer to the optimal value. In
20 DEEP LEARNING
X1 y ŷ Residual
x1 0.5 1.4 0.32 1.08
x2 2.3 1.9 1.472 0.428
x3 2.9 3.2 1.856 1.344
other words, Gradient Descent identifies the optimal value by taking big steps when
it is far away and take small steps when it is close. Now we can take the derivative
of the function and determine the slope at any value for the intercept.
n
X 2
Sumof SquaredResiduals = (Actual, yi − P redicted, ŷi )
i=1
2
= (1.4 − (Intercept + 0.64 × 0.5))
2
+ (1.9 − (Intercept + 0.64 × 2.3))
2
+ (3.2 − (Intercept + 0.64 × 2.9))
d d 2
Sumof SquaredResiduals = (1.4 − (Intercept + 0.64 × 0.5))
dIntercept dIntercept
d 2
+ (1.9 − (Intercept + 0.64 × 2.3))
dIntercept
d 2
+ (3.2 − (Intercept + 0.64 × 2.9))
dIntercept
= 2(1.4 − (Intercept + 0.64 × 0.5)) × −1
+ 2(1.9 − (Intercept + 0.64 × 2.3)) × −1
+ 2(3.2 − (Intercept + 0.64 × 2.9)) × −1
= −2(1.4 − (Intercept + 0.64 × 0.5))
+ −2(1.9 − (Intercept + 0.64 × 2.3))
+ −2(3.2 − (Intercept + 0.64 × 2.9))
= −2(1.4 − (0 + 0.64 × 0.5))
+ −2(1.9 − (0 + 0.64 × 2.3))
+ −2(3.2 − (0 + 0.64 × 2.9))
= −5.7(T heSlopeof thecurve)
We start by moving the square to the front and multiply -1 by the derivative of
the stuff inside the parentheses. So, when the Intercept = 0, the slope of the curve =
-5.7. Gradient Descent is use to find where the sum of squared residuals is lowest.
GRADIENT DESCENT 21
If we use Least Squares to solve for the optimal value for the Intercept, we would
simply find where the slope of the curve = 0. In contrast, Gradient Descent finds the
minimum value by taking steps from an initial guess until it reaches the best value.
This makes Gradient Descent very useful when it is not possible to solve for where
the derivative = 0, and this is why Gradient Descent can be used in so many different
situations. The closer we get to the optimal value for the Intercept, the closer the
slope of the curve gets to 0. This means that when the slope of the curve is close to 0
then we should take very small steps, because we are close to the optimal value and
when the slope is far from 0 then we should take big steps, because we are far from
the optimal value. However, if we take a very large step then we would increase the
Sum of the Squared Residuals. So, the size of the step should be related to the slope,
since it tells us if we should take a small step or a large step, but we need to make
sure the large step is not loo large.
Step Size = −5.7 × 0.1 = -0.57. Gradient Descent determine the Step Size by
multiplying the slope by a small number call The Learning Rate. When the Inter-
cept = 0, the Step Size = -0.57. With the Step Size, we can calculate a new Intercept.
New Intercept = Old Intercept - Step Size = 0 - (-0.57) = 0.57.
d d 2
Sumof SquaredResiduals = (1.4 − (Intercept + 0.64 × 0.5))
dIntercept dIntercept
d 2
+ (1.9 − (Intercept + 0.64 × 2.3))
dIntercept
d 2
+ (3.2 − (Intercept + 0.64 × 2.9))
dIntercept
= 2(1.4 − (Intercept + 0.64 × 0.5)) × −1
+ 2(1.9 − (Intercept + 0.64 × 2.3)) × −1
+ 2(3.2 − (Intercept + 0.64 × 2.9)) × −1
= −2(1.4 − (Intercept + 0.64 × 0.5))
+ −2(1.9 − (Intercept + 0.64 × 2.3))
+ −2(3.2 − (Intercept + 0.64 × 2.9))
= −2(1.4 − (0.57 + 0.64 × 0.5))
+ −2(1.9 − (0.57 + 0.64 × 2.3))
+ −2(3.2 − (0.57 + 0.64 × 2.9))
= −2.3(T heSlopeof thecurve)
Therefore, Step Size = −2.3 × 0.1 = -0.23, and New Intercept = 0.57 - (-0.23) =
0.8.
22 DEEP LEARNING
d
Sumof SquaredResiduals = 2(1.4 − (Intercept + 0.64 × 0.5)) × −1
dIntercept
+ 2(1.9 − (Intercept + 0.64 × 2.3)) × −1
+ 2(3.2 − (Intercept + 0.64 × 2.9)) × −1
= −2(1.4 − (Intercept + 0.64 × 0.5))
+ −2(1.9 − (Intercept + 0.64 × 2.3))
+ −2(3.2 − (Intercept + 0.64 × 2.9))
= −2(1.4 − (0.8 + 0.64 × 0.5))
+ −2(1.9 − (0.8 + 0.64 × 2.3))
+ −2(3.2 − (0.8 + 0.64 × 2.9))
= −0.9
Therefore, Step Size = −0.9 × 0.1 = -0.09, and New Intercept = 0.8 - (-0.09) =
0.89. Take another steps: the New Intercept = 0.92, 0.94, and 0.95 respectively. After
six steps, the Gradient Descent estimate for the Intercept is 0.95. Gradient Descent
stops when the Step Size is very close to ZERO. In practice, the minimum Step Size
= 0.001 or smaller. Gradient Descent will stop if the Step Size is less then 0.001.
Gradient Descent also includes a limit on the number of steps it will take before
giving up. In practice, the maximum number of steps = 1,000 or greater. So, even if
the Step Size is large, if there have been more than the maximum number of steps,
Gradient Descent will stop.
Similarly, we can find the value for Slope. We want to find the values for the
Intercept and Slope that give us the minimum Sum of Squared Residuals.
d d 2
Sumof SquaredResiduals = (1.4 − (Intercept + Slope × 0.5))
dSlope dSlope
d 2
+ (1.9 − (Intercept + Slope × 2.3))
dSlope
d 2
+ (3.2 − (Intercept + Slope × 2.9))
dSlope
= 2(1.4 − (Intercept + Slope × 0.5)) × −0.5
+ 2(1.9 − (Intercept + Slope × 2.3)) × −2.3
+ 2(3.2 − (Intercept + Slope × 2.9)) × −2.9
= −2 × 0.5(1.4 − (Intercept + Slope × 0.5))
+ −2 × 2.3(1.9 − (Intercept + Slope × 2.3))
+ −2 × 2.9(3.2 − (Intercept + Slope × 2.9))
BACK-PROPAGATION TECHNIQUE 23
We used chine rule to move the square to the front and multiply that by the deriva-
tive of the stuff inside the parentheses. Since, we are taking the derivative with
respect to the Slope, we treat the Intercept like a constant and the derivative of a
constant is ZERO. When we have two or more derivatives of the same function, they
are called a Gradient. We use the Gradient to descend to lowest point in the Loss
Function, which is the case, is the Sum of the Squared Residuals. This is why the
algorithm is called Gradient Descent. We will start by picking a random number for
Intercept and Slope. So, Intercept = 0, and Slope = 1.
d
Sumof SquaredResiduals = −2(1.4 − (0 + 1 × 0.5))
dIntercept
+ −2(1.9 − (0 + 1 × 2.3))
+ −2(3.2 − (0 + 1 × 2.9))
= −1.6
d
Sumof SquaredResiduals = −2 × 0.5(1.4 − (0 + 1 × 0.5))
dSlope
+ −2 × 2.3(1.9 − (0 + 1 × 2.3))
+ −2 × 2.9(3.2 − (0 + 1 × 2.9))
= −0.8
Therefore, Step Size for Intercept = -1.6 × Learning Rate = -1.6 × 0.01 = -0.016.
And, Step Size for Slope = -0.8 × Learning Rate = -0.8 × 0.01 = -0.008. Gradient
Descent can be very sensitive to the Learning Rate. Now, we can calculate New
Intercept = 0 - (-0.016) = 0.016, and New Slope = 1- (-0.008) = 1.008. Now, we just
repeat what we did until all the Steps Sizes are very small OR we reach the maximum
number of steps. Best fitting line, with Intercept = 0.95 and Slope - 0.64, the same
value we get from Least Squares.
P
Forward pass: Compute the outputs for y3 , y4 , and y5 . Find sumj = j wi.j ×xi
and yj = f (aj ) = 1+e1−aj ;
δ3 = y3 (1 − y3 )w35 δ5
= 0.68(1 − 0.68) × 0.3 × −0.0406
= −0.00265
δ4 = y4 (1 − y4 )w45 δ5
= 0.6637(1 − 0.6637) × 0.9 × −0.0406
= −0.0082
∆w45 = ηδ5 y4
= 1 × −0.0406 × 0.6637
= −0.0269
w45 (new) = ∆w45 + w45 (old)
= −0.0269 + 0.9
= 0.8731
26 DEEP LEARNING
∆w14 = ηδ4 X1
= 1 × −0.0082 × 0.35
= −0.00287
w14 (new) = ∆w14 + w14 (old)
= −0.00287 + 0.4
= 0.3971
1. Dropout
Figure 1.22: Flow of CNN to process an input image and classifies the objects.
Convolution Layer is the first layer to extract features from an input image. Con-
volution preserves the relationship between pixels by learning image features
using small squares of input data. It is a mathematical operation that takes two
inputs such as image matrix and a filter or kernel. A 5 x 5 image pixel values
are 0, 1 and filter matrix 3 x 3 as shown in Fig. 1.24. Then the convolution of
5 x 5 image matrix multiplies with 3 x 3 filter matrix which is called Feature
Map as output shown in Fig. 1.24.
Convolution of an image with different filters can perform operations such as
edge detection, blur and sharpen by applying filters. The below example shows
various convolution image after applying different types of filters (Kernels).
Strides is the number of pixels shifts over the input matrix. When the stride is 1
then we move the filters to 1 pixel at a time. When the stride is 2 then we move
the filters to 2 pixels at a time and so on. Fig. 1.26 shows convolution with a
stride of 2.
Padding when filter does not fit perfectly fit the input image then we have to do
padding. There are two types of padding options: (1) Zero-Padding: pad the
picture with zeros so that it fits, and (2) Valid Padding: drop the part of the
image where the filter did not fit.
30 DEEP LEARNING
Non Linearity (ReLU) ReLU stands for Rectified Linear Unit for a non-linear op-
eration. The output is f (x) = max(0.x). ReLU’s purpose is to introduce
non-linearity in CNN. Since, the real world data would want the CNN to learn
non-negative linear values.
Pooling Layer reduce the number of parameters of the images. Spatial pooling also
called subsampling or downsampling which reduces the dimensionality of each
map but retains important information. Spatial pooling can be of different types:
(1) Max Pooling, (2) Average Pooling, and (3) Sum Pooling. Max pooling takes
the largest element from the rectified feature map. Taking the largest element
could also take the average pooling. Sum of all elements in the feature map call
as sum pooling.
Fully Connected Layer FC layer flattened the feature map matrix into vector. With
the fully connected layers, we combined features together to create a model.
Finally, we have an activation function such as softmax or sigmoid to classify
the outputs as cat, dog, car, truck etc.
RECURRENT NEURAL NETWORKS (RNNS) 31
Recurrent Neural Networks (RNNs) also known as deep sequence modelling which
is a class of artificial neural networks (ANNs) that uses sequential data or time se-
ries data. Machine learning models that input or output data sequences are known
32 DEEP LEARNING
as sequence models. Text streams, audio clips, video clips, time-series data, and
other types of sequential data are examples of sequential data. When the instances
in the dataset are dependent on the other instances in the dataset, the data is termed
sequential. A Time-series is a common example of this, with each instance reflecting
an observation at a certain instances in time, such as a stock price or sensor data. Se-
quences, DNA sequences, and meteorological data are examples of sequential data.
RNNs are frequently used in Natural Language Processing (NLP). Because RNNs
have internal memory, they are especially useful for machine learning applications
that need sequential input. Time series data can also be forecasted using RNNs. We
can achieved following different sequence modelling tasks employing RNN.
One-to-one Model takes one input and returns one output e.g. classic feed-forward
neural network architecture.
One-to-many This is referred to as image captioning e.g. model takes one fixed-size
image as input, and the output can be words or phrases of varying lengths.
𝑤1 𝑤2 𝑤3
𝑔(𝑧11 ) 𝑔(𝑧12 )
𝑧11 𝑧12
𝑥1
𝑔(𝑧21 ) 𝑔(𝑧22 )
𝑧21 𝑧22 𝑦ො1
𝑥2
𝑔(𝑧31 )
𝑔(𝑧32 )
⋮
⋮ 𝑧31 𝑧32
⋮
⋮ ⋮ 𝑦ො2
𝑥𝑚 ⋮ ⋮
⋮ 𝑔(𝑧𝑛1 ) ⋮ 𝑔(𝑧𝑝2 )
𝑧𝑛1 𝑧𝑝2
RNNs take input vector, xt , update hidden state, ht , and return output vector, ŷt .
In RNNs, we reuse the same weight matrices at every time step.
T T
ht = tanh(Whh ht−1 + Wxh xt ) (1.20)
T
ŷt = Why ht (1.21)
RECURRENT NEURAL NETWORKS (RNNS) 35
proach zero which eventually leaves the weights of the initial or lower layers nearly
unchanged. As a result, the gradient descent never converges to the optimum. This is
known as the vanishing gradients problem. On the contrary, in some cases, the gradi-
ents keep on getting larger and larger as the backpropagation algorithm progresses.
This, in turn, causes very large weight updates and causes the gradient descent to
diverge. This is known as the exploding gradients problem. In a network of n hidden
layers, n derivatives will be multiplied together. If the derivatives are large then the
gradient will increase exponentially as we propagate down the model until they even-
tually explode, and this is what we call the problem of exploding gradient. We can
apply the following methods to address the vanishing/exploding gradients in deep
neural networks.
Encoding bottleneck
Slow, no parallelisation
Not long memory e.g. can’t remember beginning of the sentence for a big long
sentence
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network
capable of learning order dependence in sequence prediction problems. LSTMs net-
works rely on a gated call to track information throughout many time steps. A gate
in a neural network acts as a threshold for helping the network to distinguish when
to use normal stacked layers versus an identity connection. An identity connection
LONG SHORT-TERM MEMORY 37
uses the the output of lower layers as an addition to the output of consecutive layers.
In short, it allows for the layers of the network to learn in increments, rather than cre-
ating transformations from scratch. The gate in the neural network is used to decide
whether the network can use the shortened identity connections, or if it will need to
use the stacked layers.
Gated LSTM cells control the information flow which is the solution for vanishing
gradient problem. The main disadvantage of LSTM is to take very long time to
train. LSTM cells are able to track information throughout many time-steps. The
key concepts of LSTMs are listed bellow:
A LSTM cell contains: (1) A simple RNN cell, (2) Cell state (Long Term Mem-
ory), (3) Forget gate, (4) Input gate, and (5) Output gate.
Figure 1.35: LSTM with forget gate, input gate, and output gate.
ht = tanh(Ct ) ∗ ot (1.28)
An illustration: Ct f = Ct−1 ∗ ft E.g. [1,4,2]*[1,0,1] = [1,0,2]
Deep Generative Models (DGM) are deep neural networks which take input train-
ing instances from high-dimensional probability distribution of Big Data and train
a classifier to represent the data. It is a powerful way of unsupervised learning to
train models with any kind of data distribution. The generative models learn true
data distribution of the training data so that it can engender the new data instances.
40 DEEP LEARNING
The generative models have two functions: (1) Density Estimation, and (2) Sam-
ple Generation. Density estimation fins the probability density function (PDF) from
Big Data. On the contrary, sample generation engenders new input instances from
existing training data.
training instances. GANs can learn Pmodel (x) similar to Pdata (x). Figure 1.43
shows adversarial nets framework, which has two models: (1) Generator Network,
and (2) Discriminator Network.
Generator Network tries to create synthetic input instances to trick the discrimina-
tor.
Discriminator Netwrok tries to identify real data from fakes that created by the
generator.
GANs use the minimax game theory which is a decision rule used to minimise
the worst-case potential loss. In minimax game, a player considers all of the best
opponent responses to his strategies, and selects the strategy such that the opponent’s
best strategy gives a payoff as large as possible.
1 1
J (D) = − Ex∼pdata logD(x) − Ez log(1 − D(G(z))) (1.29)
2 2
1
J (G) = − Ez logD(G(z)) (1.31)
2
Equilibrium is a saddle point/minimax point of the discriminator loss. In math-
ematics, a saddle point or minimax point is a point on the surface of the graph
of a function where the slopes on orthogonal directions are all zero, but which
is not a local extremum of the function.
Estimating D(x) ratio using supervised learning e.g. neural network is the key ap-
proximation technique used by GANs. Optimal D(x) for any pdata (x) and pmodel (x)
is shown in Eq. 1.32.
pdata (x)
D(x) = (1.32)
pdata (x) + pmodel (x)
Loss function for training GANs are shown in Eq. 1.33 to 1.35 where logD(G(z))
are fake instances and log(1 − D(x)) are real instances. Argmax is an operation that
finds the argument that gives the maximum value from a target function. In ML,
finding the class with the largest predicted probability.
GANs are commonly use for enhancing the resolution of an image, colouring
image from black & white image, and converting day to night image.
1.14.2 Attention
It is the ability of a model to pay attention into the important part of the text or image.
DEEP GENERATIVE MODELLING 43
1.14.3 Transformers
Transformers are a powerful deep learning model used in semi-supervised learn-
ing e.g. GPT-4 (Generative Pre-trained Transformer 4) and BERT (Bidirectional
Encoder Representations from Transformers). ChatGPT is an artificial intelligence
chatbot developed by OpenAI and released in November 2022. BERT is a family of
masked-language models introduced in 2018 by researchers at Google. Transform-
ers have several advantages: (1) uses attention, (2) no recurrent, (3) faster to train,
and (4) can be parallelised.
Figure 1.47: (left) Scaled Dot-Product Attention, and (right) Multi-Head Attention
consists of several attention layers running in parallel.
Figure 1.48: Transformers work in deep learning and NLP: an intuitive introduction.