0% found this document useful (0 votes)
15 views54 pages

ML Lecture 9 - DeepLearning

The document discusses machine learning, data science, and big data mining, focusing on deep learning as a subset of machine learning that utilizes artificial neural networks to extract patterns from large datasets. It provides an overview of the author's credentials, including his extensive research and publications in the field, and outlines the structure of the document, which includes chapters on building neural networks and understanding deep learning concepts. The author emphasizes the importance of deep learning in modern data analysis and its growing popularity due to advancements in technology and data availability.

Uploaded by

Saraf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views54 pages

ML Lecture 9 - DeepLearning

The document discusses machine learning, data science, and big data mining, focusing on deep learning as a subset of machine learning that utilizes artificial neural networks to extract patterns from large datasets. It provides an overview of the author's credentials, including his extensive research and publications in the field, and outlines the structure of the document, which includes chapters on building neural networks and understanding deep learning concepts. The author emphasizes the importance of deep learning in modern data analysis and its growing popularity due to advancements in technology and data availability.

Uploaded by

Saraf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

MACHINE LEARNING, DATA SCIENCE,

BIG DATA MINING


MACHINE LEARNING, DATA SCIENCE,
BIG DATA MINING
Machine Learning for Data Mining
Applications in Big Data

Prof. Dr. Dewan Md. Farid


Department of Computer Science & Engineering
United International University

https://cse.uiu.ac.bd/profiles/dewanfarid/
ABOUT THE AUTHOR

PROF. DR. DEWAN MD. FARID is a Professor of Computer Science and Engineering
at United International University. He is an IEEE Senior Member and Member
ACM. Prof. Farid worked as a Postdoctoral Fellow/Staff at the following research
labs/groups: (1) Computational Intelligence Group (CIG), Department of Com-
puter Science and Digital Technology, University of Northumbria at Newcastle,
UK in 2013, (2) Computational Modelling Lab (CoMo) and Artificial Intelligence
Research Group, Department of Computer Science, Vrije Universiteit Brussel,
Belgium in 2015-2016, and (3) Decision and Information Systems for Production
systems (DISP) Laboratory, IUT Lumière – Université Lyon 2, France in 2020.
Prof. Farid was a Visiting Faculty at the Faculty of Engineering, University of
Porto, Portugal in June 2016. He holds a PhD in Computer Science and Engi-
neering from Jahangirnagar University, Bangladesh in 2012. Part of his PhD re-
search has been done at ERIC Laboratory, University Lumière Lyon 2, France by
Erasmus-Mundus ECW eLink PhD Exchange Program. His PhD was fully funded
by Ministry of Science & Information and Communication Technology, Govern-
ment of the People’s Republic of Bangladesh and European Union (EU) eLink
project. Prof. Farid has published 111 peer-reviewed scientific articles, includ-
ing 30 highly esteemed journals like Expert Systems with Applications, Journal
of Theoretical Biology, Journal of Neuroscience Methods, Bioinformatics, Scien-
tific Reports (Nature), Proteins and so on in the field of Machine Learning, Data

v
vi ABOUT THE AUTHOR

Mining and Big Data. Prof. Farid received the following awards: (1) Dr. Fatema
Rashid Best Paper Award (2nd Position) for the paper titled “KNNTree: A new
method to ameliorate k-nearest neighbour classification using decision tree” in
3rd International Conference on Electrical Computer and Communication Engi-
neering (ECCE 2023), CUET, Chittagong, Bangladesh, (2) JuliaCon 2019 Travel
Award for attending Julia Conference at the University of Maryland, Baltimore,
USA, and (3) United Group Research Award 2016 in the field of Science and
Engineering. He received the following research funds as Principal Investigator:
(1) a2i Innovation Fund of Innov-A-Thon 2018 (Ideabank ID No.: 12502) from
a2i-Access to Information Program – II, Information and Communication Tech-
nology (ICT) Division, Government of the People’s Republic of Bangladesh, and
(2) Project Code: UIU/IAR/01/2021/SE/23 received from Institute for Advanced
Research (IAR), United International University. Prof. Farid received the follow-
ing Erasmus Mundus scholarships: (1) LEADERS (Leading mobility between Eu-
rope and Asia in Developing Engineering Education and Research) to undertake a
staff level mobility at the Faculty of Engineering, University of Porto, Portugal in
2015, (2) cLink (Centre of excellence for Learning, Innovation, Networking and
Knowledge) for pursuing Postdoc at University of Northumbria at Newcastle, UK
in 2013, and (3) eLink (east west Link for Innovation, Networking and Knowl-
edge exchange) for pursuing Ph.D. at University Lumière Lyon 2, France in 2009.
Prof. Farid also received Senior Fellowship I and II awards by National Science &
Information and Communication Technology (NSICT), Ministry of Science & In-
formation and Communication Technology, Government of the People’s Republic
of Bangladesh respectively in 2008 and 2011 for pursuing Ph.D. at Jahangirna-
gar University. He visited 18 countries for attending international conferences,
research and higher education. Prof. Farid delivered several invited/keynote talks
including an invited research talk at Data to AI Group (DAI), Laboratory for In-
formation and Decision Systems (LIDS), Massachusetts Institute of Technology
(MIT), Cambridge, Massachusetts, USA.
CONTENTS

List of Symbols ix

PART I LEARNING FROM DATA

1 Deep Learning 3
1.1 What’s Deep Learning? 3
1.1.1 History of Deep Learning 4
1.1.2 Why Deep Learning Getting Popular? 4
1.2 Building Neural Networks 6
1.2.1 A Single Neuron 6
1.2.2 Activation Functions 8
1.3 The Perceptron: Example 9
1.3.1 AND Gate 9
1.3.2 OR Gate 10
1.3.3 NOT Gate 10
1.4 Building Neural Networks with Perceptions 11
1.4.1 Deep Neural Network 12
1.5 Understanding Loss Functions 12
vii
viii CONTENTS

1.6 Empirical Risk Minimisation 15


1.6.1 Quantifying Loss 15
1.6.2 Empirical Loss 15
1.7 The Chain Rule 15
1.7.1 Applying The Chain Rule in Loss Function 17
1.8 Gradient Descent 18
1.8.1 Gradient Descent Example 19
1.9 Back-Propagation Technique 23
1.9.1 Back-Propagation with Example 23
1.10 Neural Network Overfitting 27
1.11 Convolutional Neural Network 28
1.12 Recurrent Neural Networks (RNNs) 31
1.12.1 Backpropagation Through Time (BPTT) for RNNs 35
1.12.2 Vanishing/Exploding Gradients in RNNs 35
1.12.3 Limitations of Recurrent Models (RNNs) 36
1.13 Long Short-Term memory 36
1.13.1 RNN with LSTM: An Illustration 38
1.14 Deep Generative Modelling 39
1.14.1 Generative Adversarial Networks (GANs) 40
1.14.2 Attention 42
1.14.3 Transformers 43
SYMBOLS

X The set of training instances/ training data


N Size of X
(i) (i)
(x , y ) The i-th instance pair in X (supervised learning)
x(i) The input (features) of i-th training instance in X (unsupervised learning)
(i)
xj The value of feature j in i-th training instance
D Dimension of an instance x(i)
K Dimension of a label y (i)
X ∈ RN ×D Design matrix, where Xi ,: denotes x(i)
Xi A feature in training data
F Hypothesis space of functions to be learnt, i.e., a model
C[f ] A cost function of f ∈ F
C[θ] A cost function of θ parameterising f ∈ F
(x0 , y 0 ) A testing pair
ŷ Label predicted by a function f , i.e., ŷ = f (x0 ) (supervised learning)

P (x, y) A data generating distribution

Machine Learning, Data Mining, Big Data, First Edition. ix


By Dewan Md. Farid Copyright c 2023 United International University.
PART I

LEARNING FROM DATA


CHAPTER 1

DEEP LEARNING

Imagination is more important than knowledge. Knowledge is limited. Imagination encir-


cles the world.
—Albert Einstein, Physicist of the 20th century

Artificial Intelligence, deep learning, machine learning - whatever you are doing if you do
not understand it - learn it. Because otherwise you are going to be a dinosaur within 3
years.
—Mark Cuban, Theoretical Physicist

1.1 What’s Deep Learning?

Deep learning (DL) is a sub-set of machine learning (ML) to extract patterns from
big data using artificial neural networks. An artificial neural network (ANN) is the
piece of a computing system designed to simulate the way the human brain analy-
ses and processes information. It is the foundation of artificial intelligence (AI) and
solves problems that would prove impossible or difficult by human or statistical stan-
Machine Learning, Data Mining, Big Data, First Edition. 3
By Dewan Md. Farid Copyright c 2023 United International University.
4 DEEP LEARNING

dards. Traditional ML algorithms define a set of features in the data. Generally, these
features are hand engineered. The key idea of DL is to learn these features directly
from data. Hand-engineering features are time consuming, brittle, are not scalable in
practice. So, DL is a form of machine learning where a model has multiple layers of
neurones.

Artificial Intelligence Any technique that enables computers/machines to mimic


human behaviour.

Machine Learning Ability to learn computers/machines to mimic human behaviour


without explicitly being programmed.

Deep Learning Extract informative patterns from Big Data using Artificial Neural
Networks (ANN).

1.1.1 History of Deep Learning

The history of Deep Learning can be traced back to 1943, when Walter Pitts and
Warren McCulloch created a computer model based on the neural networks of the
human brain. They used a combination of algorithms and mathematics they called
“threshold logic” to mimic the thought process.

Table 1.1: History of Neural Networks/Deep Learning.

Year Invention
1952 Stochastic Gradient Descent
1958 Perceptron (Learnable Weights)
1960 Multi-layer Perception
1986 Back-propagation
1989 Convolutional Neural Network (CNN)
1990-1994 Neural Network in the Wild
1995 Long-Short-Term Memory
2005-2010 Banishment
2010 Rename: Deep Learning
2013 CNN + GPU + Image Net
2015 AlphaGo (CNN + Reinforcement Learning)

1.1.2 Why Deep Learning Getting Popular?


Can we learn the underlying features from data? Hand engineering features are time
consuming, brittle and not scalable in practice.
WHAT’S DEEP LEARNING? 5

Figure 1.1: Classification of deep learning methods.

Big Data Big Data is data that contains greater variety, arriving in increasing vol-
umes and with more velocity (3 V’s).

Larger datasets are available now-a-days.

Data collection and storage techniques became mush easier.

Hardware A graphics processing unit (GPU) is an electronic device designed to


manipulate and alter memory to accelerate the creation of images in a frame
buffer intended for output to a display device.

Graphics Processing Unit (GPU).

Massive Parallelizable.

Software Today we have improved techniques, new models, and toolboxes. E.g.
Scikit-learn is a free software machine learning library for the Python program-
ming language/TensorFlow.
6 DEEP LEARNING

1.2 Building Neural Networks

Input Layer This layer accepts input features. It provides information from the
outside world to the network, no computation is performed at this layer, nodes
here just pass on the information(features) to the hidden layer.
Hidden Layer Nodes of this layer are not exposed to the outer world, they are the
part of the abstraction provided by any neural network. Hidden layer performs
all sort of computation on the features entered through the input layer and trans-
fer the result to the output layer.
Output Layer This layer bring up the information learned by the network to the
outer world.

Figure 1.2: Single Layer Neural Networks.

1.2.1 A Single Neuron


A single Neuron/Perceptron (Forward Propagation) is the basic structural building
block of Deep Learning. Figures 1.3 to 1.4 show a single Neuron with bias and
activation function. The output of a single Neuron is ŷ = g(w0 + X T W ) that’s
   
x1 w1
 x2   w2 
   
shown in Eq. 1.3 where X =   ..  and W =  .. .
  
 .   . 
xm wm
BUILDING NEURAL NETWORKS 7

m
X
ŷ = g(w0 + xi .wi )
i=1
T
= g(w0 + X W ) (1.1)

Figure 1.3: The Perception: Forward Propagation.

Figure 1.4: Perception Forward Propagation with Bias and Activation Function.
8 DEEP LEARNING

1.2.2 Activation Functions


The activation function defines the output of a neuron/perceptron given an input or
set of input (output of multiple neurons). Activation function decides, whether a
neuron should be activated or not by calculating weighted sum and further adding
bias with it. The purpose of the activation function is to introduce non-linearity into
the output of a neuron.

1. Linear Activation: No transformation between input and output. So, input will
be output.

2. Non-Linear Activation: Non-linear activations increase the functional capacity


of the neural network, because it allows it to present non-linear relationships
between features in the input.
1
(a) Sigmoid Function, g(z) = σ(z)) = 1+e−z
ez −e−z
(b) Hyperbolic Tangent, g(z) = tanh(z) = ez +e−z
(
0 if z is negative,
(c) Rectified Linear Unit (ReLU), g(z) = max(0, z) =
z if z is positive

Sigmoid Function is a mathematical function having a characteristic ”S”-shaped


curve or sigmoid curve. A common example of a sigmoid function is the logistic
1
function and defined by the formula: g(z) = . Sigmoid functions most
1 + e−z
often show a return value (y axis) in the range 0 to 1. Sigmoid function squash
the results for a given input value ’z’ between zero and one. If input (z) is
positive then output tends to 1, or if input (z) is negative then output tends to 0.

Hyperbolic Tangent function is very similar to sigmoid function, except it collapse


the value between -1 and 1 that is analogous to the tangent and defined by the
equation tanh x = sinh x/cosh x —abbreviation tanh and defined by the formula:
ez − e−z
g(z) = tanh z = z . If input (x) is positive then output tends to +1. If
e + e−z
input (x) is negative then output tends to -1.

Rectified Linear Unit (ReLU) is a very simple function where for any negative
value of ‘z’ the result of the function applied to that input is zero. So, it passes
on zero (0) if the total sum of the inputs is less than zero, or return’s ‘z’ if the
value of the input is gather than zero. It’s just a maximum value between the
value of the input ‘z’ and zero. The rectifier or ReLU is defined as the positive
part of its argument: g(z) = max(0, z) where ‘z’ is the total weighted sum of
a neuron.( ReLU allows a small, positive gradient when the unit is not active,
1 if z > 0,
g(z) = .
0 otherwise
THE PERCEPTRON: EXAMPLE 9

Figure 1.5: Common Activation Function.

1.3 The Perceptron: Example

1.3.1 AND Gate

Table 1.2: AND Gate.

X1 X2 y
0 0 0
0 1 0
1 0 0
1 1 1

Figure 1.6: Perception Example: AND Gate.


10 DEEP LEARNING

1.3.2 OR Gate

Table 1.3: OR Gate.

X1 X2 y
0 0 0
0 1 1
1 0 1
1 1 1

Figure 1.7: Perception Example: OR Gate.

1.3.3 NOT Gate

Table 1.4: NOT Gate.

X1 y
0 1
1 0
BUILDING NEURAL NETWORKS WITH PERCEPTIONS 11

Figure 1.8: Perception Example: NOT Gate.

1.4 Building Neural Networks with Perceptions

Figure 1.9: Multi output perception.

m
X
yˆi = g(zi ) = g(w0,i + xj .wj,i ) (1.2)
j=1
12 DEEP LEARNING

Figure 1.10: Single layer neural network.

n
X
yˆi = g(w2 0,i + g(zj ).w2 j,i )
j=1
n
X m
X
= g(w2 0,i + g(w1 0,i + xk .w1 k,i ).w2 j,i ) (1.3)
j=1 k=1

1.4.1 Deep Neural Network

1.5 Understanding Loss Functions

Loss functions play an important role in any machine learning model. It measure how
far an estimated value is from its true value. Loss is the penalty for misclassification
of a learning model. It’s a number indicating how bad the model’s prediction was on
a single instance. If the model’s prediction is correct the loss is zero; otherwise, the
loss is greater. The goal of a learning model is to find a set of weights and biases
that have low/minimum loss, on average, across all instances. Fig. 1.12 shows a high
loss model on the left and a low loss model on the right. Loss functions are not fixed,
they change depending on the task in hand and the goal to be met.
Mean Absolute Error (MAE) also known as L1 loss is one of the most simple yet
robust loss functions used for regression models. MAE takes the average sum
UNDERSTANDING LOSS FUNCTIONS 13

𝑤1 𝑤2 𝑤3
𝑔(𝑧11 ) 𝑔(𝑧12 )
𝑧11 𝑧12
𝑥1
𝑔(𝑧21 ) 𝑔(𝑧22 )
𝑧21 𝑧22 𝑦ො1
𝑥2
𝑔(𝑧31 )
𝑔(𝑧32 )

⋮ 𝑧31 𝑧32

⋮ ⋮ 𝑦ො2
𝑥𝑚 ⋮ ⋮
⋮ 𝑔(𝑧𝑛1 ) ⋮ 𝑔(𝑧𝑝2 )

𝑧𝑛1 𝑧𝑝2

Input Hidden Layer 1 Hidden Layer 2 Output


𝑚 𝑛 𝑝
𝑧𝑖1 = 1
𝑤𝑜,𝑖 + 1
෍ 𝑥𝑗 𝑤𝑗,𝑖 𝑧𝑖2 = 2
𝑤𝑜,𝑖 + ෍𝑔 𝑧𝑗1 2
𝑤𝑗,𝑖 3
𝑦ො𝑖 = 𝑔 𝑤𝑜,𝑖 3
+ ෍ 𝑔 𝑧𝑗2 𝑤𝑗,𝑖
𝑗=1 𝑗=1 𝑗=1

Figure 1.11: Deep Neural Network.

Figure 1.12: High loss in the left model; low loss in the right model.

of the absolute differences between the actual and the predicted values. For a
data point xi and its predicted value yi , n being the total number of instances in
the dataset, the mean absolute error is defined as:
Pn
i=1 |yi − yˆi |
M AE = (1.4)
n
Mean Squared Error (MSE) also known as L2 loss function for regression tasks.
MSE is the average of the squared differences between the actual and the pre-
dicted values. For an instance xi and its predicted value yˆi , where n is the total
number of instance in the dataset, the mean squared error is defined as:

n
1X
M SE = (yi − yˆi )2 (1.5)
n i=1

Mean Bias Error (MBE) is used to calculate the average bias in the model. Bias,
in a nutshell, is overestimating or underestimating a parameter. Corrective mea-
sures can be taken to reduce the bias post-evaluating the model using MBE. It
14 DEEP LEARNING

takes the actual difference between the target and the predicted value. The for-
mula of Mean Bias Error is shown in Eq. 1.6 where yi is the true value, yˆi is
the predicted value and n is the total number of instances in the dataset.
Pn
i=1 (yˆi − yi )
M BE = (1.6)
n
Mean Squared Logarithmic Error (MSLE) is the same as Mean Squared Error,
except the natural logarithm of the predicted values is used rather than the actual
values. The formula of Mean Squared Logarithmic Error is shown in Eq. 1.7
where yi is the true value, yˆi is the predicted value and n is the total number of
instances in the dataset.

n
1X
M SLE = (log(yi ) − log(yˆi ))2 (1.7)
n i=1

Huber Loss combines the robustness of L1 with the stability of L2, essentially the
best of L1 and L2 losses. For huge errors, it’s linear and for small errors, it’s
quadratic in nature. Huber Loss is characterised by the parameter delta (δ). For
a prediction yˆi of the data point xi , with the characterising parameter δ, Huber
Loss is formulated as:
(
1
2 (yi − yˆi )2 , if|yi − yˆi | ≤ δ
Lδ = (1.8)
δ|yi − yˆi | − 12 δ 2 , otherwise

Binary Cross Entropy Loss gives the probability value between 0 and 1 for binary-
class classification task. Entropy is the measure of randomness in the informa-
tion being processed, and cross entropy is a measure of the difference of the
randomness between two random variables. Cross-Entropy calculates the aver-
age difference between the predicted and actual probabilities. If we deals with
Yes/No situation e.g., “a person has diabetes or not”, then Binary Classification
Loss Function is used.

N
X
J =− yi log(yˆi ) + (1 − yi )log(1 − yˆi ) (1.9)
i=1

Categorical Cross Entropy Loss is essentially Binary Cross Entropy Loss expanded
to multiple classes. This way, only one element will be non-zero as other ele-
ments in the vector would be multiplied by zero. This property is extended to
an activation function called softmax.

Hinge Loss is primarily developed for support vector machines for calculating the
maximum margin from the hyperplane to the classes. Hinge loss functions pe-
nalise wrong predictions and does not do so for the right predictions. So, the
EMPIRICAL RISK MINIMISATION 15

score of the target label should be greater than the sum of all the incorrect labels
by a margin of (at the least) one. The mathematical formulation of hinge loss is
as follows:
X
SV M Loss = max(0, sj − sji + 1) (1.10)
j6=j

Soft-max Cross Entropy Loss

1.6 Empirical Risk Minimisation

Empirical Risk Minimisation (ERM) is a fundamental concept in machine learning


that we can’t know exactly how well the learning algorithm will work in real-life
(the true risk), as testing learning algorithm on new real-world data is very expen-
sive and costly process (known as goal standard). Because, we don’t know the true
distribution of data that the algorithm will work on, but we can instead measure its
performance on a known set of training data (the ”empirical” risk).

1.6.1 Quantifying Loss


The loss of our network measures the cost increased from incorrect predictions.

L(P redicted, ŷ (i) , Actual, y (i) ) (1.11)

1.6.2 Empirical Loss


The empirical loss measures the total loss over the entire dataset.
n
1X
L(P redicted, ŷ (i) , Actual, y (i) ) (1.12)
n i=1
We want to find the network weights that achieve the lowest loss.

1.7 The Chain Rule

The chain rule finds the derivative of a composite function. Deriving or taking the
derivative, means to find the “slope” (slope of a decision line) of a given function
or classifier. Derivatives, on the other hand, are a measure of the rate of change,
but they apply to almost any function. The chain rule expresses the derivative of the
composition of two differentiable functions f and g in terms of the derivatives of f
and g. More precisely, if h = f ◦ g is the function such that h(x) = f (g(x)) for
every x, then the chain rule is, h0 (x) = f 0 (g(x))g 0 (x), which is shown in Eq. 1.13.
16 DEEP LEARNING

Figure 1.13: Quantifying Loss.

Figure 1.14: Empirical Loss.

d
[f (g(x))] = f 0 (g(x))g 0 (x) (1.13)
dx
If a variable Size depends on the variable Height, which itself depends on the
variable W eight (that is, Height and Size are dependent variables), then Size de-
THE CHAIN RULE 17

pends on W eight as well, via the intermediate variable Height, which can be ex-
pressed by the chain rule that is shown in Eq. 1.14.
dSize dSize dHeight
= · (1.14)
dW eight dHeight dW eight
Therefore, if we change the value for W eight then we see a change in Size. Since
dHeight
the slope is 2 the change in Height is dW eight
= 2. The equation for Height =
dHeight
dW eight × W eight = 2 × W eight.
1 dSize
As, we go up 4 unit for any 1 unit of Size, the slope is 41 . So, dHeight = 41 .

dSize
Size = × Height
dHeight
dSize dHeight
= × × W eight
dHeight dW eight
1
= ×2×1
4
1 dSize
= =
2 dW eight

That means every 1 unit increase of W eight increases 1/2 unit of the Size.

1.7.1 Applying The Chain Rule in Loss Function


Now let’s look at how The Chain Rule applies to the Residual Sum of Squares, a
commonly used Loss Function in machine learning.

Height = Intercept + (Slope × W eight) (1.15)

Residual = Actual, y − P redicted, ŷ (1.16)


= y − (Intercept + (Slope × W eight))
= y − Intercept − (Slope × W eight)

2
Residual2 = (Actual, y − P redicted, ŷ) (1.17)
2
= (y − Intercept − (Slope × W eight))

In order to find the value for the Intercept that minimises the Squared Resid-
ual, we are going to find the Derivative of the Squared Residual with respect to the
Intercept and we are going to find where the derivative is equal to zero. Because,
given the function y = Residual2 , the derivative is zero at the lowest point.
18 DEEP LEARNING

dResidual2 d 2 dResidual
= Residual ×
dIntercept dResidual dIntercept
dResidual
= 2 × Residual ×
dIntercept
d
The Power Rule do the derivative of Residual2 is dResidual
Residual
2
= 2 × Residual
As, Residual = y − Intercept − (Slope × W eight)) = y − Intercept − (1 ×
W eight)). So, ddIntercept
Residual
= −1

dResidual2
= 2 × Residual × (−1)
dIntercept
= −2 × Residual
= −2 × (y − Intercept − (1 × W eight))

Therefore,

−2 × (2 − Intercept − (1 × 1)) = 0
−4 + 2 × Intercept + 2 = 0
2 × Intercept = 4 − 2
2
Intercept =
2
Intercept = 1

Finally, we can say that when the Intercept = 1, we minimise the Squared Resid-
ual.

1.8 Gradient Descent

Gradient descent (GD) is an iterative first-order optimisation algorithm used to find


a local minimum/maximum of a given function. This method is commonly used
in machine learning (ML) and deep learning(DL) to minimise a cost/loss function.
Gradient Descent algorithm is given bellow:

Algorithm Gradient Descent


1. Initialise weights randomly
2. Loop until convergence
3. Compute gradient
4. Update weights
5. Return weights
GRADIENT DESCENT 19

Algorithm Stochastic Gradient Descent


1. Initialise weights randomly
2. Loop until convergence
3. Pick single data point, i
4. Compute gradient
5. Update weights
6. Return weights

Figure 1.15: Gradient Descent Algorithms.

1.8.1 Gradient Descent Example


In machine learning and data science area, we optimise a lot of stuff. We can apply
Gradient Descent algorithm to find the optimal values for the intercept and the slope.

Table 1.5: An Example GD algorithm.

X1 y
x1 0.5 1.4
x2 2.3 1.9
x3 2.9 3.2

First, we will use Gradient Descent to find the optimal value for the intercept.
Let’s just assume the Least Squares estimate for the slope is 0.64. Step 1: Select a
random value for the intercept, e.g. intercept = 0; so, ŷ = 0 + 0.64 × X. Now, we
will evaluate how well this line fits the data with the Sum of the Squared Residuals.
In machine learning, the Sum of the Squared Residual is a type of loss function.
2 2
Therefore, Sum of Squared Residuals = (Actual, y − P redicted, ŷ) = (1.08) +
2 2
(0.428) + (1.344) = 3.1. The point 3.1 represent the Sum of Squared Residuals
when the Intercept = 0. Gradient Descent only does a few calculations for the opti-
mal solution and increases the number of calculations closer to the optimal value. In
20 DEEP LEARNING

Table 1.6: Finding residual, where ŷ = 0 + 0.64 × X and residual = y − ŷ.

X1 y ŷ Residual
x1 0.5 1.4 0.32 1.08
x2 2.3 1.9 1.472 0.428
x3 2.9 3.2 1.856 1.344

other words, Gradient Descent identifies the optimal value by taking big steps when
it is far away and take small steps when it is close. Now we can take the derivative
of the function and determine the slope at any value for the intercept.

n
X 2
Sumof SquaredResiduals = (Actual, yi − P redicted, ŷi )
i=1
2
= (1.4 − (Intercept + 0.64 × 0.5))
2
+ (1.9 − (Intercept + 0.64 × 2.3))
2
+ (3.2 − (Intercept + 0.64 × 2.9))

d d 2
Sumof SquaredResiduals = (1.4 − (Intercept + 0.64 × 0.5))
dIntercept dIntercept
d 2
+ (1.9 − (Intercept + 0.64 × 2.3))
dIntercept
d 2
+ (3.2 − (Intercept + 0.64 × 2.9))
dIntercept
= 2(1.4 − (Intercept + 0.64 × 0.5)) × −1
+ 2(1.9 − (Intercept + 0.64 × 2.3)) × −1
+ 2(3.2 − (Intercept + 0.64 × 2.9)) × −1
= −2(1.4 − (Intercept + 0.64 × 0.5))
+ −2(1.9 − (Intercept + 0.64 × 2.3))
+ −2(3.2 − (Intercept + 0.64 × 2.9))
= −2(1.4 − (0 + 0.64 × 0.5))
+ −2(1.9 − (0 + 0.64 × 2.3))
+ −2(3.2 − (0 + 0.64 × 2.9))
= −5.7(T heSlopeof thecurve)

We start by moving the square to the front and multiply -1 by the derivative of
the stuff inside the parentheses. So, when the Intercept = 0, the slope of the curve =
-5.7. Gradient Descent is use to find where the sum of squared residuals is lowest.
GRADIENT DESCENT 21

If we use Least Squares to solve for the optimal value for the Intercept, we would
simply find where the slope of the curve = 0. In contrast, Gradient Descent finds the
minimum value by taking steps from an initial guess until it reaches the best value.
This makes Gradient Descent very useful when it is not possible to solve for where
the derivative = 0, and this is why Gradient Descent can be used in so many different
situations. The closer we get to the optimal value for the Intercept, the closer the
slope of the curve gets to 0. This means that when the slope of the curve is close to 0
then we should take very small steps, because we are close to the optimal value and
when the slope is far from 0 then we should take big steps, because we are far from
the optimal value. However, if we take a very large step then we would increase the
Sum of the Squared Residuals. So, the size of the step should be related to the slope,
since it tells us if we should take a small step or a large step, but we need to make
sure the large step is not loo large.
Step Size = −5.7 × 0.1 = -0.57. Gradient Descent determine the Step Size by
multiplying the slope by a small number call The Learning Rate. When the Inter-
cept = 0, the Step Size = -0.57. With the Step Size, we can calculate a new Intercept.
New Intercept = Old Intercept - Step Size = 0 - (-0.57) = 0.57.

d d 2
Sumof SquaredResiduals = (1.4 − (Intercept + 0.64 × 0.5))
dIntercept dIntercept
d 2
+ (1.9 − (Intercept + 0.64 × 2.3))
dIntercept
d 2
+ (3.2 − (Intercept + 0.64 × 2.9))
dIntercept
= 2(1.4 − (Intercept + 0.64 × 0.5)) × −1
+ 2(1.9 − (Intercept + 0.64 × 2.3)) × −1
+ 2(3.2 − (Intercept + 0.64 × 2.9)) × −1
= −2(1.4 − (Intercept + 0.64 × 0.5))
+ −2(1.9 − (Intercept + 0.64 × 2.3))
+ −2(3.2 − (Intercept + 0.64 × 2.9))
= −2(1.4 − (0.57 + 0.64 × 0.5))
+ −2(1.9 − (0.57 + 0.64 × 2.3))
+ −2(3.2 − (0.57 + 0.64 × 2.9))
= −2.3(T heSlopeof thecurve)

Therefore, Step Size = −2.3 × 0.1 = -0.23, and New Intercept = 0.57 - (-0.23) =
0.8.
22 DEEP LEARNING

d
Sumof SquaredResiduals = 2(1.4 − (Intercept + 0.64 × 0.5)) × −1
dIntercept
+ 2(1.9 − (Intercept + 0.64 × 2.3)) × −1
+ 2(3.2 − (Intercept + 0.64 × 2.9)) × −1
= −2(1.4 − (Intercept + 0.64 × 0.5))
+ −2(1.9 − (Intercept + 0.64 × 2.3))
+ −2(3.2 − (Intercept + 0.64 × 2.9))
= −2(1.4 − (0.8 + 0.64 × 0.5))
+ −2(1.9 − (0.8 + 0.64 × 2.3))
+ −2(3.2 − (0.8 + 0.64 × 2.9))
= −0.9

Therefore, Step Size = −0.9 × 0.1 = -0.09, and New Intercept = 0.8 - (-0.09) =
0.89. Take another steps: the New Intercept = 0.92, 0.94, and 0.95 respectively. After
six steps, the Gradient Descent estimate for the Intercept is 0.95. Gradient Descent
stops when the Step Size is very close to ZERO. In practice, the minimum Step Size
= 0.001 or smaller. Gradient Descent will stop if the Step Size is less then 0.001.
Gradient Descent also includes a limit on the number of steps it will take before
giving up. In practice, the maximum number of steps = 1,000 or greater. So, even if
the Step Size is large, if there have been more than the maximum number of steps,
Gradient Descent will stop.
Similarly, we can find the value for Slope. We want to find the values for the
Intercept and Slope that give us the minimum Sum of Squared Residuals.

d d 2
Sumof SquaredResiduals = (1.4 − (Intercept + Slope × 0.5))
dSlope dSlope
d 2
+ (1.9 − (Intercept + Slope × 2.3))
dSlope
d 2
+ (3.2 − (Intercept + Slope × 2.9))
dSlope
= 2(1.4 − (Intercept + Slope × 0.5)) × −0.5
+ 2(1.9 − (Intercept + Slope × 2.3)) × −2.3
+ 2(3.2 − (Intercept + Slope × 2.9)) × −2.9
= −2 × 0.5(1.4 − (Intercept + Slope × 0.5))
+ −2 × 2.3(1.9 − (Intercept + Slope × 2.3))
+ −2 × 2.9(3.2 − (Intercept + Slope × 2.9))
BACK-PROPAGATION TECHNIQUE 23

We used chine rule to move the square to the front and multiply that by the deriva-
tive of the stuff inside the parentheses. Since, we are taking the derivative with
respect to the Slope, we treat the Intercept like a constant and the derivative of a
constant is ZERO. When we have two or more derivatives of the same function, they
are called a Gradient. We use the Gradient to descend to lowest point in the Loss
Function, which is the case, is the Sum of the Squared Residuals. This is why the
algorithm is called Gradient Descent. We will start by picking a random number for
Intercept and Slope. So, Intercept = 0, and Slope = 1.

d
Sumof SquaredResiduals = −2(1.4 − (0 + 1 × 0.5))
dIntercept
+ −2(1.9 − (0 + 1 × 2.3))
+ −2(3.2 − (0 + 1 × 2.9))
= −1.6

d
Sumof SquaredResiduals = −2 × 0.5(1.4 − (0 + 1 × 0.5))
dSlope
+ −2 × 2.3(1.9 − (0 + 1 × 2.3))
+ −2 × 2.9(3.2 − (0 + 1 × 2.9))
= −0.8

Therefore, Step Size for Intercept = -1.6 × Learning Rate = -1.6 × 0.01 = -0.016.
And, Step Size for Slope = -0.8 × Learning Rate = -0.8 × 0.01 = -0.008. Gradient
Descent can be very sensitive to the Learning Rate. Now, we can calculate New
Intercept = 0 - (-0.016) = 0.016, and New Slope = 1- (-0.008) = 1.008. Now, we just
repeat what we did until all the Steps Sizes are very small OR we reach the maximum
number of steps. Best fitting line, with Intercept = 0.95 and Slope - 0.64, the same
value we get from Least Squares.

1.9 Back-Propagation Technique

Backpropagation (BP) algorithm preforms the following two steps:


1. Take the derivative (gradient) of the loss with respect to each parameter.
2. Shift parameters in order to minimise loss.

1.9.1 Back-Propagation with Example


Let’s assume that the neurones have a sigmoid activation function, perform a forward
pass and a backward pass on the network. Assume that the actual output is y = 0.5
and learning rate is 1. Perform another forward pass.
24 DEEP LEARNING

Figure 1.16: Back-Propagation with Example.

P
Forward pass: Compute the outputs for y3 , y4 , and y5 . Find sumj = j wi.j ×xi
and yj = f (aj ) = 1+e1−aj ;

sum3 = (w13 × X1 ) + (w23 × X2 )


= (0.1 × 0.35) + (0.8 × 0.9)
= 0.755
y3 = f (sum3 )
1
=
1 + e−0.755
= 0.68

sum4 = (w14 × X1 ) + (w24 × X2 )


= (0.4 × 0.35) + (0.6 × 0.9)
= 0.68
y4 = f (sum4 )
1
=
1 + e−0.68
= 0.6637
BACK-PROPAGATION TECHNIQUE 25

sum5 = (w35 × y3 ) + (w45 × y4 )


= (0.3 × 0.68) + (0.9 × 0.6637)
= 0.801
y5 = f (sum5 )
1
=
1 + e−0.801
= 0.69

Therefore, Error = y − ŷ = 0.5 - 0.69 = -0.19;


Each weight changed by the following formula:
1. ∆wji = ηδj oj
2. δj = oj (1 − oj )(tj − oj ); if j is an output unit.
P
3. δj = oj (1 − oj ) k δk wkj ; if j is a hidden unit.
Where, η is a constant called the Learning Rate, tj is the actual output for unit j,
and δj is the error measure for unit j.
For output unit:

δ5 = ŷ(1 − ŷ)(y − ŷ)


= 0.69(1 − 0.69)(0.5 − 0.69)
= −0.0406

For hidden unit:

δ3 = y3 (1 − y3 )w35 δ5
= 0.68(1 − 0.68) × 0.3 × −0.0406
= −0.00265
δ4 = y4 (1 − y4 )w45 δ5
= 0.6637(1 − 0.6637) × 0.9 × −0.0406
= −0.0082

∆w45 = ηδ5 y4
= 1 × −0.0406 × 0.6637
= −0.0269
w45 (new) = ∆w45 + w45 (old)
= −0.0269 + 0.9
= 0.8731
26 DEEP LEARNING

∆w14 = ηδ4 X1
= 1 × −0.0082 × 0.35
= −0.00287
w14 (new) = ∆w14 + w14 (old)
= −0.00287 + 0.4
= 0.3971

Similarly, update all other weights:

Table 1.7: Updated weights.

i j wij δj Xi η Updated wij


1 3 0.1 -0.00265 0.35 1 0.0991
2 3 0.8 -0.00265 0.9 1 0.7976
1 4 0.4 -0.0082 0.35 1 0.3971
2 4 0.6 -0.0082 0.9 1 0.5926
3 5 0.3 -0.0406 0.68 1 0.2724
4 5 0.9 -0.0406 0.6637 1 0.8731

Figure 1.17: Back-Propagation with Example (con.).


P
Forward pass: Compute the outputs for y3 , y4 , and y5 . Find sumj = j wi.j ×xi
and yj = f (aj ) = 1+e1−aj ;

sum3 = (w13 × X1 ) + (w23 × X2 )


= (0.0991 × 0.35) + (0.7976 × 0.9)
= 0.7525
y3 = f (sum3 )
1
= −0.7525
1+e
= 0.6797
NEURAL NETWORK OVERFITTING 27

sum4 = (w14 × X1 ) + (w24 × X2 )


= (0.3971 × 0.35) + (0.5926 × 0.9)
= 0.6723
y4 = f (sum4 )
1
=
1 + e−0.6723
= 0.6620

sum5 = (w35 × y3 ) + (w45 × y4 )


= (0.2724 × 0.6797) + (0.8731 × 0.6620)
= 0.7631
y5 = f (sum5 )
1
= −0.7631
1+e
= 0.6820

Therefore, Error = y − ŷ = -0.182;

1.10 Neural Network Overfitting

Figure 1.18: NN- Overfitting.

Regularisation to address overfitting in neural networks:

1. Dropout

2. Early stopping: Stop training before we have a chance to overfit.


28 DEEP LEARNING

Figure 1.19: Dropout.

Figure 1.20: Early Stopping.

1.11 Convolutional Neural Network

Convolutional Neural Network (CNN, or ConvNet) is a class of deep neural network


that commonly used to analyse visual imagery e.g. objects detections, images recog-
nition, images classification, recognition faces etc. CNNs classify an input image
under certain categories (Eg., Dog, Cat, Tiger, Lion). An input image is an array
of pixels which is represented by h ∗ w ∗ d where, h is height, w is width, and d is
dimension. An image of 6 x 6 x 3 array of matrix of RGB is shown in Fig. 1.21.
Each input image pass through a series of convolution layers with filters (known as
Kernals), pooling layer, fully connected (FC) neural network with Softmax function
to classify an object that shown in Fig. 1.22. The core components of CNN are:
(1) Convolution Layer, (2) Activation Function (ReLU), (3) Pooling Layer, and (4)
Fully Connected Neural Network.
CONVOLUTIONAL NEURAL NETWORK 29

Figure 1.21: Array of RGB (Red Green Blue) Matrix.

Figure 1.22: Flow of CNN to process an input image and classifies the objects.

Convolution Layer is the first layer to extract features from an input image. Con-
volution preserves the relationship between pixels by learning image features
using small squares of input data. It is a mathematical operation that takes two
inputs such as image matrix and a filter or kernel. A 5 x 5 image pixel values
are 0, 1 and filter matrix 3 x 3 as shown in Fig. 1.24. Then the convolution of
5 x 5 image matrix multiplies with 3 x 3 filter matrix which is called Feature
Map as output shown in Fig. 1.24.
Convolution of an image with different filters can perform operations such as
edge detection, blur and sharpen by applying filters. The below example shows
various convolution image after applying different types of filters (Kernels).
Strides is the number of pixels shifts over the input matrix. When the stride is 1
then we move the filters to 1 pixel at a time. When the stride is 2 then we move
the filters to 2 pixels at a time and so on. Fig. 1.26 shows convolution with a
stride of 2.
Padding when filter does not fit perfectly fit the input image then we have to do
padding. There are two types of padding options: (1) Zero-Padding: pad the
picture with zeros so that it fits, and (2) Valid Padding: drop the part of the
image where the filter did not fit.
30 DEEP LEARNING

Figure 1.23: Image matrix multiplies kernel or filter matrix.

Figure 1.24: Image matrix multiplies kernel or filter matrix.

Non Linearity (ReLU) ReLU stands for Rectified Linear Unit for a non-linear op-
eration. The output is f (x) = max(0.x). ReLU’s purpose is to introduce
non-linearity in CNN. Since, the real world data would want the CNN to learn
non-negative linear values.

Pooling Layer reduce the number of parameters of the images. Spatial pooling also
called subsampling or downsampling which reduces the dimensionality of each
map but retains important information. Spatial pooling can be of different types:
(1) Max Pooling, (2) Average Pooling, and (3) Sum Pooling. Max pooling takes
the largest element from the rectified feature map. Taking the largest element
could also take the average pooling. Sum of all elements in the feature map call
as sum pooling.

Fully Connected Layer FC layer flattened the feature map matrix into vector. With
the fully connected layers, we combined features together to create a model.
Finally, we have an activation function such as softmax or sigmoid to classify
the outputs as cat, dog, car, truck etc.
RECURRENT NEURAL NETWORKS (RNNS) 31

Figure 1.25: Some common filters.

1.12 Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) also known as deep sequence modelling which
is a class of artificial neural networks (ANNs) that uses sequential data or time se-
ries data. Machine learning models that input or output data sequences are known
32 DEEP LEARNING

Figure 1.26: Stride of 2 pixels.

Figure 1.27: ReLU Activation Function operation.

as sequence models. Text streams, audio clips, video clips, time-series data, and
other types of sequential data are examples of sequential data. When the instances
in the dataset are dependent on the other instances in the dataset, the data is termed
sequential. A Time-series is a common example of this, with each instance reflecting
an observation at a certain instances in time, such as a stock price or sensor data. Se-
quences, DNA sequences, and meteorological data are examples of sequential data.
RNNs are frequently used in Natural Language Processing (NLP). Because RNNs
have internal memory, they are especially useful for machine learning applications
that need sequential input. Time series data can also be forecasted using RNNs. We
can achieved following different sequence modelling tasks employing RNN.

One-to-one Model takes one input and returns one output e.g. classic feed-forward
neural network architecture.

One-to-many This is referred to as image captioning e.g. model takes one fixed-size
image as input, and the output can be words or phrases of varying lengths.

Many-to-one This is used to categorise emotions. A succession of words or even


paragraphs of words is anticipated as input. The result can be a continuous-
RECURRENT NEURAL NETWORKS (RNNS) 33

Figure 1.28: Max Pooling.

𝑤1 𝑤2 𝑤3
𝑔(𝑧11 ) 𝑔(𝑧12 )
𝑧11 𝑧12
𝑥1
𝑔(𝑧21 ) 𝑔(𝑧22 )
𝑧21 𝑧22 𝑦ො1
𝑥2
𝑔(𝑧31 )
𝑔(𝑧32 )

⋮ 𝑧31 𝑧32

⋮ ⋮ 𝑦ො2
𝑥𝑚 ⋮ ⋮
⋮ 𝑔(𝑧𝑛1 ) ⋮ 𝑔(𝑧𝑝2 )

𝑧𝑛1 𝑧𝑝2

Input Hidden Layer 1 Hidden Layer 2 Output


𝑚 𝑛 𝑝
𝑧𝑖1 = 𝑤𝑜,𝑖
1 1
+ ෍ 𝑥𝑗 𝑤𝑗,𝑖 𝑧𝑖2 = 𝑤𝑜,𝑖
2
+ ෍ 𝑔 𝑧𝑗1 𝑤𝑗,𝑖
2
𝑦ො𝑖 = 𝑔 3
𝑤𝑜,𝑖 3
+ ෍ 𝑔 𝑧𝑗2 𝑤𝑗,𝑖
𝑗=1 𝑗=1 𝑗=1

Figure 1.29: Deep Neural Networks with Two Hidden Layers.

valued regression output that represents the likelihood of having a favourable


attitude.
Many-to-many This paradigm is suitable for machine translation, such as that seen
on Google Translate. The input could be a variable-length English sentence, and
the output could be a variable-length English sentence in a different language.
On a frame-by-frame basis, the last many to many models can be utilised for
video classification.
To model sequences, we need to consider the following design criteria:
1. Handle variable length sequences.
2. Track long-term dependencies.
34 DEEP LEARNING

Figure 1.30: Sequence Modeling Applications.

3. Maintain information about order.


4. Share parameters across the sequence.
Fig. 1.31 shows the basics of Recurrent Neural Networks (RNNs), where h is
the internal state or memory state that pass through time steps. In Eq. 1.18, ŷt is
the output of the RNN, xt is the input, and ht−1 is the pass memory. RNNs have a
cell state, ht , that is updated at each time step as a sequence is processed. Apply a
recurrent relation at every time step to precess a sequence that shown in Eq. 1.19,
where ht is a cell state, gW is the function with weights W , xt is input, and ht−1
is the old state. In RNNs, the same function and set of parameters are used at every
time step.

ŷt = g(xt , ht−1 ) (1.18)

ht = gW (xt , ht−1 ) (1.19)

Figure 1.31: The basics of Recurrent Neural Networks (RNNs).

RNNs take input vector, xt , update hidden state, ht , and return output vector, ŷt .
In RNNs, we reuse the same weight matrices at every time step.
T T
ht = tanh(Whh ht−1 + Wxh xt ) (1.20)

T
ŷt = Why ht (1.21)
RECURRENT NEURAL NETWORKS (RNNS) 35

Figure 1.32: :RNNs: Computational Graph Across Time.

1.12.1 Backpropagation Through Time (BPTT) for RNNs


A RNN essentially processes sequences one step at a time, so during backpropaga-
tion the gradients flow backward across time steps that’s why this process is called
backpropagation through time. In order to do backpropagation through time to train
a RNN, we need to compute the loss function first that’s shown in Eq. 1.22.
T
X
L(ŷ, y) = Lt (ŷt , yt ) (1.22)
t=1

Figure 1.33: Backpropagation Through Time for RNNs.

1.12.2 Vanishing/Exploding Gradients in RNNs


As the backpropagation algorithm advances downwards(or backward) from the out-
put layer towards the input layer, the gradients often get smaller and smaller and ap-
36 DEEP LEARNING

proach zero which eventually leaves the weights of the initial or lower layers nearly
unchanged. As a result, the gradient descent never converges to the optimum. This is
known as the vanishing gradients problem. On the contrary, in some cases, the gradi-
ents keep on getting larger and larger as the backpropagation algorithm progresses.
This, in turn, causes very large weight updates and causes the gradient descent to
diverge. This is known as the exploding gradients problem. In a network of n hidden
layers, n derivatives will be multiplied together. If the derivatives are large then the
gradient will increase exponentially as we propagate down the model until they even-
tually explode, and this is what we call the problem of exploding gradient. We can
apply the following methods to address the vanishing/exploding gradients in deep
neural networks.

Proper Weight Initialisation is a procedure to set the weights of a neural network


to small random values that define the starting point for the optimisation (learn-
ing or training) of the neural network model.
Using Non-saturating Activation Functions e.g. ReLU (Rectified Linear Unit),
LReLU (Leaky ReLU) or some other non-saturating functions can be used in-
stead of the saturation of activation functions like sigmoid and tanh.
Batch Normalisation is a technique used to improve the performance of a deep
learning network by first removing the batch mean and then splitting it by the
batch standard deviation.
Gradient Clipping is a method where the error derivative is changed or clipped to
a threshold during backward propagation through the network, and using the
clipped gradients to update the weights.

1.12.3 Limitations of Recurrent Models (RNNs)


The major problems of RNNs are listed bellow:

Encoding bottleneck
Slow, no parallelisation
Not long memory e.g. can’t remember beginning of the sentence for a big long
sentence

1.13 Long Short-Term memory

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network
capable of learning order dependence in sequence prediction problems. LSTMs net-
works rely on a gated call to track information throughout many time steps. A gate
in a neural network acts as a threshold for helping the network to distinguish when
to use normal stacked layers versus an identity connection. An identity connection
LONG SHORT-TERM MEMORY 37

uses the the output of lower layers as an addition to the output of consecutive layers.
In short, it allows for the layers of the network to learn in increments, rather than cre-
ating transformations from scratch. The gate in the neural network is used to decide
whether the network can use the shortened identity connections, or if it will need to
use the stacked layers.
Gated LSTM cells control the information flow which is the solution for vanishing
gradient problem. The main disadvantage of LSTM is to take very long time to
train. LSTM cells are able to track information throughout many time-steps. The
key concepts of LSTMs are listed bellow:

1. Maintain a cell state.

2. Use gates to control the flow of information.

Forget gate gets rid of irrelevant information.


Store relevant information from current input.
Selectively update cell state.
Output gate returns a filtered version of the cell state.

3. Backpropagation through time with partially uninterrupted gradient flow.

A LSTM cell contains: (1) A simple RNN cell, (2) Cell state (Long Term Mem-
ory), (3) Forget gate, (4) Input gate, and (5) Output gate.

Figure 1.34: Long Short Term Memory (LSTM) Neural Network.


       
wf bf wf h ft
       
 wi   bi   wih   it 
w=
 w  b =  b  h =  w  g = c 
      
 c  c  ch   t
wo bo woh ot
ft = σ[(wf h ∗ ht−1 ) + (wf x ∗ xt ) + bf ] (1.23)
38 DEEP LEARNING

Figure 1.35: LSTM with forget gate, input gate, and output gate.

it = σ[(wih ∗ ht−1 ) + (wix ∗ xt ) + bi ] (1.24)

c̃t = tanh[(wch ∗ ht−1 ) + (wcx ∗ xt ) + bc ] (1.25)

ot = σ[(woh ∗ ht−1 ) + (wox ∗ xt ) + bo ] (1.26)

Ct = (Ct−1 ∗ ft ) + (it ∗ c̃t ) (1.27)

ht = tanh(Ct ) ∗ ot (1.28)
An illustration: Ct f = Ct−1 ∗ ft E.g. [1,4,2]*[1,0,1] = [1,0,2]

1.13.1 RNN with LSTM: An Illustration

Figure 1.36: Simple Word-Embedding Based Models (SWEMs).


DEEP GENERATIVE MODELLING 39

Figure 1.37: Word to vector representation.

Figure 1.38: Neural Model for text.

Figure 1.39: RNN for text.

1.14 Deep Generative Modelling

Deep Generative Models (DGM) are deep neural networks which take input train-
ing instances from high-dimensional probability distribution of Big Data and train
a classifier to represent the data. It is a powerful way of unsupervised learning to
train models with any kind of data distribution. The generative models learn true
data distribution of the training data so that it can engender the new data instances.
40 DEEP LEARNING

Figure 1.40: Long Short-Term Memory.

Figure 1.41: RNN with LSTM for text.

The generative models have two functions: (1) Density Estimation, and (2) Sam-
ple Generation. Density estimation fins the probability density function (PDF) from
Big Data. On the contrary, sample generation engenders new input instances from
existing training data.

1.14.1 Generative Adversarial Networks (GANs)


Generative Adversarial Networks (GANs) are deep-learning-based generative model.
It is one of the machine learning frameworks that used to generate high dimensional
input training instances for images, music, and text. GANs can produce accurate rep-
resentations of human faces. It was introduced by Ian Goodfellow and his colleagues
in June 2014. The concept of adversarial training involves to training a classifier to
generate adversarial instances and labeled as threatening. There are two separate
neural networks in GANs and worst case input instances for one neural network is
produced by another neural network. GANs work with large and high-dimensional
DEEP GENERATIVE MODELLING 41

Figure 1.42: Generative Modelling: density estimation and sample generation.

training instances. GANs can learn Pmodel (x) similar to Pdata (x). Figure 1.43
shows adversarial nets framework, which has two models: (1) Generator Network,
and (2) Discriminator Network.

Generator Network tries to create synthetic input instances to trick the discrimina-
tor.

Discriminator Netwrok tries to identify real data from fakes that created by the
generator.

Figure 1.43: Adversarial Nets Framework.


42 DEEP LEARNING

GANs use the minimax game theory which is a decision rule used to minimise
the worst-case potential loss. In minimax game, a player considers all of the best
opponent responses to his strategies, and selects the strategy such that the opponent’s
best strategy gives a payoff as large as possible.
1 1
J (D) = − Ex∼pdata logD(x) − Ez log(1 − D(G(z))) (1.29)
2 2

J (G) = −J (D) (1.30)

1
J (G) = − Ez logD(G(z)) (1.31)
2
Equilibrium is a saddle point/minimax point of the discriminator loss. In math-
ematics, a saddle point or minimax point is a point on the surface of the graph
of a function where the slopes on orthogonal directions are all zero, but which
is not a local extremum of the function.

Resembles Jensen-Shannon divergence.

Generate minimises the log-probability of the discriminator being correct.

Estimating D(x) ratio using supervised learning e.g. neural network is the key ap-
proximation technique used by GANs. Optimal D(x) for any pdata (x) and pmodel (x)
is shown in Eq. 1.32.

pdata (x)
D(x) = (1.32)
pdata (x) + pmodel (x)
Loss function for training GANs are shown in Eq. 1.33 to 1.35 where logD(G(z))
are fake instances and log(1 − D(x)) are real instances. Argmax is an operation that
finds the argument that gives the maximum value from a target function. In ML,
finding the class with the largest predicted probability.

arg max Ez,x [logD(G(z)) + log(1 − D(x))] (1.33)


D

arg min Ez,x [logD(G(z)) + log(1 − D(x))] (1.34)


G

arg min max Ez,x [logD(G(z)) + log(1 − D(x))] (1.35)


G D

GANs are commonly use for enhancing the resolution of an image, colouring
image from black & white image, and converting day to night image.

1.14.2 Attention
It is the ability of a model to pay attention into the important part of the text or image.
DEEP GENERATIVE MODELLING 43

1.14.3 Transformers
Transformers are a powerful deep learning model used in semi-supervised learn-
ing e.g. GPT-4 (Generative Pre-trained Transformer 4) and BERT (Bidirectional
Encoder Representations from Transformers). ChatGPT is an artificial intelligence
chatbot developed by OpenAI and released in November 2022. BERT is a family of
masked-language models introduced in 2018 by researchers at Google. Transform-
ers have several advantages: (1) uses attention, (2) no recurrent, (3) faster to train,
and (4) can be parallelised.

Figure 1.44: The Transformers.

Figure 1.45: Overview of a full Transformers architecture.


44 DEEP LEARNING

Figure 1.46: The Transformers Model.


DEEP GENERATIVE MODELLING 45

Figure 1.47: (left) Scaled Dot-Product Attention, and (right) Multi-Head Attention
consists of several attention layers running in parallel.

Figure 1.48: Transformers work in deep learning and NLP: an intuitive introduction.

You might also like