0% found this document useful (0 votes)
21 views107 pages

Unit 1.2 Perceptron 2024

The document provides an overview of neural networks and deep learning, detailing their historical development from early models in the 1940s to modern applications in various fields. It discusses key concepts such as empirical risk minimization, learning types, loss functions, model fitting, overfitting, and regularization techniques. Additionally, it highlights the significance of optimizers and logistic classifiers in improving model performance and accuracy.

Uploaded by

aakilalig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views107 pages

Unit 1.2 Perceptron 2024

The document provides an overview of neural networks and deep learning, detailing their historical development from early models in the 1940s to modern applications in various fields. It discusses key concepts such as empirical risk minimization, learning types, loss functions, model fitting, overfitting, and regularization techniques. Additionally, it highlights the significance of optimizers and logistic classifiers in improving model performance and accuracy.

Uploaded by

aakilalig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

NEURAL NETWORKS & DEEP LEARNING

(21MCA24DB3)

Prepared & Presented By:


Dr. Balkishan
Assistant Professor
Department of Computer Science & Applications
Maharshi Dayanand University
Rohtak
Historical Development
• 1943: McCulloch and Pitts proposed a model of a neuron
• 1960s: Widrow and Hoff explored Perceptron networks
• (which they called “Adelines”) and the delta rule.
• 1957: Frank Rosenblatt invents the Perceptron
• 1962: Rosenblatt proved convergence of the perceptron training rule.
• 1969: Minsky and Papert showed that the Perceptron cannot deal with
nonlinearly-separable data sets---even those that represent simple
function (e.g., X-OR)
• 1970-1985: Very little research on Neural Nets
• 1986: Invention of Back Propagation [Rumelhart & McClelland; Parker;
Werbos] which can learn nonlinearly-separable data sets.
• Since 1985: A lot of research in Neural Nets!
Empirical Risk Minimization
The empirical risk minimization principle states
that the learning algorithm should choose a
function/model/hypothesis which minimizes
the empirical risk
Learning
• The goals of learning are understanding and prediction.
• Learning falls into many categories, including supervised
learning, unsupervised learning, and reinforcement
learning.
• From the perspective of statistical learning theory,
supervised learning is best understood.
• Supervised learning involves learning from a training set
of data.
• Every point in the training is an input-output pair, where
the input maps to an output.
• The learning problem consists of inferring the function
that maps between the input and the output, such that the
learned function can be used to predict the output from
future input.
• Depending on the type of output, supervised
learning problems are either problems of
regression or problems of classification.
• If the output takes a continuous range of
values, it is a regression problem.
• If the output takes elements from a discrete
set of labels then it is Classification problems
• Classification is very common for machine
learning applications.
Loss Function
• Loss Function
• A loss function, in the context of machine learning and optimization
algorithms, is a measure of how well a model's predictions match
the actual targets or ground truth values in a dataset.
• The purpose of a loss function is to quantify the difference
between predicted values and actual values, providing a
numerical value that represents the "loss" or "error" of the model's
predictions.
• Given a set of inputs and outputs, loss function measures the
difference between the predicted output and the true output.
• We want to know what the loss is over all the possibilities.
• This is where “true risk” comes into picture.
• True risk computes the average loss over all the possibilities.
What exactly is Empirical Risk Minimization
• If we compute the loss using the data points in
our dataset, it’s called empirical risk.
• It’s “empirical (experimental)” and not “true”
because we are using a dataset that’s a subset of
the whole population.
• When we build learning model, we need to pick
the function that minimizes the empirical risk
i.e. the difference between the predicted output
and the actual output for the data points in our
dataset.
• This process of finding the function that
minimizes the empirical risk is called empirical
risk minimization.
Example of Empirical Risk Minimization
• Example: We want to build a model that can
differentiate between a male and a female based
on specific features.
• If we select 150 random people where women are
really short, and men are really tall, then the model
might incorrectly assume that height is the
differentiating feature.
• For building a truly accurate model, we have to
gather all the women and men in the world to extract
differentiating features.
• Unfortunately, that is not possible! So we select a
small number of people and hope that this sample is
representative of the whole population.
• If we compute the loss using the data points in
our dataset, it’s called empirical risk.
• It is “empirical” and not “true” because we are
using a dataset that’s a subset of the whole
population.
• When our learning model is built, we have to pick
a function that minimizes the empirical risk.
• This process of finding this function is called
empirical risk minimization (ERM).
• We want to minimize the true risk.
Training and Testing of Model
Model (Function)Fitting
• How well a model performs on training /evaluation
datasets will define its characteristics

Underfit Overfit Good Fit

Training Dataset Poor Very Good Good

Evaluation Very Poor Poor Good


Dataset
Model Fitting –Visualization

Variations of model fitting


Bias and Variance
Bias: The difference between predicted values and actual values/target value is
called bias
High Bias: High bias means the predicted values is far way from the target values
Low Bias: Low bias means the predicted values is very near to target values
Variance: Variance means how predicted values are scattered from each other
Low Variance: Low variance means predicted values are close to each other
High Variance: High variance means predicted values are far scattered from
each other Low Bias: predicted values is
High Bias: very near to target values
High Variance:
Predicted values are far
Predicted values is far Low Variance: predicted
scattered from each way from the target values is very near to target
other values values
• If the machine learning model is not accurate, it
can make predictions errors, and these
prediction errors are usually known as Bias and
Var
• In machine learning, these errors will always be
present as there is always a slight difference
between the model predictions and actual
predictions.
• The main aim of ML/data science analysts is to
reduce these errors in order to get more
accurate results.
Overfitting and Underfitting
• A model is said to be a good machine learning
model if it generalizes any new input data from
the problem domain in a proper way.
• This helps us to make predictions in the future
data, that the data model has never seen.
• Now, suppose we want to check how well our
machine learning model learns and generalizes
to the new data.
• For that, we have overfitting and underfitting,
which are majorly responsible for the poor
performances of the machine learning
algorithms.
Underfitting
• A statistical model or a machine learning algorithm is said
to have underfitting when it cannot capture the
underlying trend of the data.
• Underfitting destroys the accuracy of our machine
learning model.
• Its occurrence simply means that our model or the
algorithm does not fit the data well enough.
• It usually happens when we have fewer/less data to build
an accurate model
• In such cases, the rules of the machine learning model are
too easy and flexible to be applied on such minimal data
and therefore the model will probably make a lot of
wrong predictions.
• Underfitting can be avoided by using more data and also
increasing the features.
• Underfitting – High bias (predicted values is far way
from the target values) and low variance (Low
variance means predicted values are close to each
other)

Techniques to reduce underfitting:


• Increase model complexity
• Increase the number of features, performing feature
engineering
• Remove noise from the data.
• Increase the number of epochs or increase the duration
of training to get better results.
Overfitting
• A statistical model is said to be overfitted when we train it
with a lot of features and data
(just like fitting ourselves in oversized pants)
• When a model gets trained with so much data, it starts
learning from the noise and inaccurate data entries in our
data set.
• Then the model does not categorize the data correctly, because
of too many details and noise.
• A solution to avoid overfitting:
-Remove features
-Early stopping
-Regularization
-Ensembling technique etc
• Overfitting – High variance and low bias
• Techniques to reduce overfitting:
• Increase training data.
• Reduce model complexity.
• Early stopping during the training phase (have
an eye over the loss over the training period as
soon as loss begins to increase stop training).
Regularization

• Regularization is a technique which makes


slight modifications to the learning algorithm
such that the model generalizes better.
• This in turn improves the model’s
performance on the unseen data.
• Reduce the complexity of the model
Regularization
• Regularization is a technique used to reduce the errors by fitting the
function appropriately on the given training set and avoid overfitting.

The commonly used regularization techniques are :


-L2 regularization
-L1 regularization
-Dropout regularization
- Early Stopping Regularization

• A regression model that uses L2 regularization technique is called Ridge


regression.
• A regression model which uses L1 Regularization technique is called
LASSO(Least Absolute Shrinkage and Selection Operator) regression.

Lasso Regression adds “absolute value of magnitude” of coefficient as
penalty term to the loss function(L).
Model Exploration
• Model exploration is the process of
experimenting with different architectures,
hyperparameters, and training techniques
to develop and refine neural network
models.
• It involves exploring various configurations
and setups to understand how they impact the
performance and behavior of the model on a
given task or dataset.
Model Exploration and Hyper Parameter Tuning
Model Parameters (Learned during training)
• Model Parameters are the entities learned via training from the
training data.
• They are not set manually by the designer.
With respect to deep neural networks, the model parameters are:
-Weights
- Biases
Model Hyper-parameters (Control the parameters)
• These are parameters that govern(control) the determination of the
model parameters during training
-They are typically set manually via heuristics
-They are tuned during a cross-validation phase
Examples:
Learning rate, number of layers, number of units in each layer, activation
functions, many others
Historical Context and Motivation for Deep
Learning
• Deep learning, a subset of machine learning, has its roots in
the broader field of artificial intelligence (AI) and neural
networks.
• Deep learning is inspired by the brain
• Historical context and motivations behind the development
of deep learning:
1. Early Neural Network Research (1940s-1960s):
• The history of deep learning goes back to 1943 when
McCulloch and Pitts created a computer model based on
the neural networks of the human brain.
• In 1947, Donald Hebb had the idea that neurons in the brain
learn by modifying the strength of the connections between
neurons.
• In 1957, Frank Rosenblatt proposed the Perceptron, which
is a learning algorithm that modifies the weights of very
simple neural nets.
Historical context and motivation for deep learning
2. Connectionism and Parallel Distributed Processing (PDP)
(1980s): During the 1980s, there was a resurgence of interest in
neural networks, fueled by the connectionist approach
advocated by researchers like David Rumelhart, Geoffrey Hinton,
and James McClelland.
• Their work on Parallel Distributed Processing (PDP) systems
demonstrated the potential of neural networks for modeling
complex cognitive processes.
3. Backpropagation Algorithm (1986): The development of the
backpropagation algorithm by Geoffrey Hinton, David
Rumelhart, and Ronald Williams in 1986 was a major
breakthrough for neural networks.
• Backpropagation allowed for efficient training of multi-layer
neural networks by propagating errors backward through the
network, adjusting the weights to minimize the error.
Historical context and motivation for deep learning

4.Challenges and AI Winter (Late 1980s-1990s): Despite the progress


made in neural network research, the field faced challenges such as
limited computational power, inadequate datasets, and theoretical
limitations.
• This led to a period known as the "AI winter," characterized by
reduced funding and interest in AI research.
• Deep Learning took off again in 1985 with the emergence of
backpropagation.
5.Rise of Big Data and Computational Power (2000s): The advent of
big data and the availability of powerful computational resources in
the 2000s provided a renewed impetus for neural network research.
• Researchers realized that deep neural networks with many layers
could potentially learn hierarchical representations from large
datasets, leading to better performance on various tasks.
Historical context and motivation for deep learning

• Deep Learning took off again in 1985 with the


emergence of backpropagation.
• In 1995, the field died again and the machine
learning community abandoned the idea of
neural nets.
• In early 2010, people start using neuron nets
in speech recognition with huge performance
improvement and later it became widely
deployed in the commercial field.
Historical context and motivation for deep learning
6. ImageNet Competition and Convolutional Neural Networks
(CNNs) (2010s): The ImageNet Large Scale Visual Recognition
Challenge (ILSVRC), initiated in 2010, played a crucial role in
advancing deep learning, particularly in computer vision.
• Convolutional Neural Networks (CNNs), such as AlexNet (2012) by
Alex Krizhevsky et al., demonstrated remarkable performance in
image classification tasks, significantly outperforming traditional
computer vision techniques.
7.Applications and Success Stories: Deep learning has seen widespread
adoption across various domains, including computer vision,
natural language processing (NLP), speech recognition, healthcare,
finance, and autonomous vehicles.
• Success stories such as AlphaGo's victory over human champions in
the game of Go and the development of self-driving cars have
highlighted the potential of deep learning in real-world applications.
Historical context and motivation for deep learning

• In early 2010, people start using neuron nets in


speech recognition with huge performance
improvement and later it became widely deployed
in the commercial field.
• In 2013, computer vision started to switch to
neuron nets.
• In 2016, the same transition occurred in natural
language processing.
• Soon, similar revolutions will occur in robotics,
control, and many other fields.
Optimizers in Deep Neural Network

• Optimizers are algorithms or methods used to


minimize an error function (loss function or cost
function ) to maximize the efficiency of
production.
• Optimizers are mathematical functions which are
dependent on model’s learnable parameters i.e
Weights & Biases.
• Optimizers help to know how to change weights
and learning rate of neural network to reduce the
losses.
Logistic Classifier
• A logistic classifier or logistic regression is a type of
statistical model
• Logistic Classifier estimates the probability of a given
input vector x belongs to a certain class by fitting data to a
logistic function, also known as the sigmoid function.
• The sigmoid function maps any real-valued number to the
range [0,1].
• Mathematically, the logistic function is defined as:
1
𝜎 𝑍 =
1+𝑒 −𝑍
• Where z is a linear combination of the input features
• 𝑍 = 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝑛𝑥𝑛
• where w0. w1, …wn are the weight parameters
• x1, x2, …xn are the inputs features
• The logistic regression model is trained
using optimization techniques to find the
best values for the weights w0,w1,w2,…,wn
that minimize the error
• Once trained, the model can be used to
predict the probability that a new input
belongs to a particular class.
• A threshold value (0.5 app) is chosen to
generate the output
Optimizing a logistic classifier using Gradient
Descent
• Gradient descent is a popular optimization
algorithm used to minimize the cost function of
Deep learning / machine learning models,
including logistic regression classifiers.
• The cost function used in logistic regression is the
cross-entropy loss function.
• Gradient descent is a iterative optimization
algorithm that adjusts the model parameters in
the direction of steepest descent of the cost
function's gradient until convergence is achieved.
Steps of Optimizing a logistic classifier using Gradient
Descent
1. Initialize Parameters: Initializing the parameters (weights) of the
logistic regression model with random values or zeros.
2. Define the Cost Function: Cost function or loss function measures the
difference between the predicted values and the actual values.
• The most commonly used cost function for logistic regression is the binary
cross-entropy loss function.
3. Compute the Gradient: Compute the gradient of the cost function with
respect to the parameters.
• This gradient indicates the direction and magnitude of the steepest ascent of
the cost function.
4. Update Parameters: Update the parameters in the opposite direction of the
gradient to minimize the cost function.
• This step involves multiplying the gradient by a learning rate and
subtracting the result from the current parameter values.
5. Repeat Until Convergence: Repeat steps 3 and 4 until the cost function
converges to a minimum value or until a predefined number of iterations is
reached.
Stochastic Gradient Descent (SGD)

• Stochastic Gradient Descent (SGD) is a


variant of gradient descent where instead of
computing the gradient using the entire
dataset, the gradient is computed using only
a single randomly chosen data point (or a
small subset of data points) at each iteration.
• This makes SGD faster and more suitable
for large datasets.
1. Initialize Parameters: Start by initializing the parameters (weights)
of the logistic regression model with random values or zeros.
2. Define the Cost Function: As in the standard gradient descent,
you'll need a cost function. For logistic regression, the binary cross-
entropy loss function is commonly used.
3. Stochastic Gradient Descent Update Rule: Instead of computing
the gradient using the entire dataset, compute the gradient using
only a single data point (or a small subset of data points). Then,
update the parameters based on this stochastic gradient.
4. Repeat Until Convergence: Iterate through the dataset, updating
the parameters at each step using the stochastic gradient until the
cost function converges to a minimum value or until a predefined
number of iterations is reached.
What is wrong with back-propagation?
(a plausible story, but false; Hinton ICASSP-2013)
• It requires labeled training data.
– Almost all data is unlabeled.
• The learning time does not scale well
– It is very slow in networks with multiple hidden
layers.
• It can get stuck in poor local optima.
– These are often quite good, but for deep nets
they are far from optimal
• Deep learning (partially) overcomes these
difficulties by using undirected graphical model
Back Prorogation Network (BPN)
• Back propogation is a multi-layered feed forward
network
• It has minimum three layers
(i) input layer (ii) Hidden layer (iii) Output layer
• In this network information propogation is only in
forward direction and there is no feed back loop
• It does not have any feed back connection
• Only error are back propogated during the training
• The name back-propogation derives from the fact that
computation are passed forward from input to the
output layer, following which calculated errors
propogated back in other direction to change the
weights to obtain better performance
• BPN is most widely used model in terms of
practical applications and about 90% of
commercial and industry applications uses
this model
• It was designed in 1986 by G E Hinton,
Rumelpart and R O Willianms
• The network used extended gradient-
descent based delta-learning rule commonly
known as back propogation learning rule
Three-layer back-propagation neural network
Input signals
1
x1
1 y1
1
2
x2 2 y2
2

i wij j wjk
xi k yk

m
n l yl
xn
Input Hidden Output
layer layer layer

Error signals
Limitations of the Backpropagation algorithm:
• It is slow, all previous layers are locked until
gradients for the current layer is calculated
• It suffers from vanishing and exploding gradients
problem
• It suffers from overfitting & underfitting problem
• It considers predicted value & actual value only to
calculate error and to calculate gradients, related to the
objective function, partially related to the
Backpropagation algorithm
• It doesn’t consider the spatial, associative and dis-
associative relationship between classes while
calculating errors, related to the objective function,
partially related to the Backpropagation algorithm
• The network may get trapped in a local minima even
through there is a much deeper minimum nearby
Difficulty of training deep neural networks

Vanishing and Exploding Gradient


• Vanishing Gradient and Exploding Gradient Problems are
difficulties found in training certain Artificial Neural
Networks with gradient based methods like Back
Propagation

• This problem makes it hard to learn and tune the


parameters of the earlier layers in the network
How does Gradient Descent Algorithm Work

In Mathematics Deviation is a way to show rate of


change at a given point
To Calculate deviation of error w.r.t. first
weight, back propagation via chain rule
To Calculate deviation of error w.r.t. first
weight, back propagation via chain rule
To Calculate deviation of error w.r.t. first
weight, back propagation via chain rule
Vanishing Gradient Problem
Exploding Gradient Problem
Optimizers in Deep Neural Network

• Optimizers are algorithms or methods used to


minimize an error function(loss function)or to
maximize the efficiency of production.
• Optimizers are mathematical functions which are
dependent on model’s learnable parameters i.e
Weights & Biases.
• Optimizers help to know how to change weights
and learning rate of neural network to reduce the
losses.
• Optimizing logistic classifier using gradient descent

Gradient descent is popular algorithm to perform


Optimization of Deep Learning
Gradient Descent in Machine learning and
Deep Learning
• Gradient Descent is a popular optimization
technique in Machine Learning and Deep
Learning.
• A gradient is the slope of a function and
gradient descent (a movement down to a lower
place) means descending a slope to reach the
lowest point on that surface.
• It measures the degree of change of a variable
in response to the changes of another
variable.
• Mathematically, Gradient Descent is a convex
function whose output is the partial derivative of
a set of parameters of its inputs.
• The greater the gradient, the steeper (bigger) the
slope.
• Gradient Descent iteratively reduces a loss
function by moving in the direction opposite to
that of steepest path.
• It is dependent on the derivatives of the loss
function for finding minima.
Gradient Descent Optimization

• Gradient Descent is an iterative optimization


algorithm, used to find the minimum value for a
function.
• The general idea is to initialize the parameters to
random values, and then take small steps in the
direction of the “slope” at each iteration.
• Gradient descent is highly used in supervised
learning to minimize the error function and find
the optimal values for the parameters.
Gradient Descent Optimization

Starting Point

Loss

Value of weight
Point of Convergence i.e where the cost function is at
its minimum Level
Gradient Descent
• Gradient descent is a way to minimize an objective
function 𝐽(𝜃)
• 𝐽(𝜃):Objective function
• 𝜃 ∈𝑅𝑑:Model’s parameters
• 𝜂:Learning rate. This determines the size of the steps we
take to reach a (local) minimum.

𝛻𝜃𝐽(𝜃) 𝐽(𝜃)
Update equation

𝜃(new) = 𝜃 − 𝜂 ∗𝛻𝜃𝐽(𝜃)
𝑙𝑜𝑐𝑎𝑙
𝑚𝑖𝑛𝑖𝑚𝑢
𝜃∗ 𝜃
Change in Weight

algorithm: θnew=θ old−α⋅∇J(θ)


Advantages and Disadvantages of Gradient
Descent
• Advantages of Gradient Descent
-Easy computation
-Easy to understand
-Easy to implement
• Disadvantages of Gradient Descent
-May trap at local minima
-Weights are changed after calculating gradient on the
whole dataset. So, if the dataset is too large than this
may take years to converge to the minima.
– Requires large memory to calculate gradient on the whole
dataset.
Gradient DescentVariants

• There are three variants of gradient descent.


• Batch gradient descent
• Stochastic gradient descent
• Mini-batch gradient descent
• The difference of these algorithms is
the amount of data.

This term is different


Update equation
with each method
𝜃 = 𝜃 − 𝜂 ∗𝛻𝜃𝐽(𝜃)
Mini Batch Gradient Descent
Stochastic Gradient Descent Learning
Algorithm
Stochastic Gradient Descent Learning Algorithm

• Stochastic gradient descent is an optimization


algorithm used in machine learning
applications to find the model parameters
that correspond to the best fit between
predicted and actual outputs.
• Stochastic gradient descent is widely used in
machine learning applications.
Stochastic Gradient Descent
• The word ‘stochastic‘ means a system or a
process that is linked with a random probability.
• In Stochastic Gradient Descent, a few samples
are selected randomly instead of the whole data
set for each iteration.
• In Gradient Descent, whole dataset is used for
calculating the gradient for each iteration.
• Whole dataset is really useful for getting to the
minima in a less noisy and less random manner,
but the problem arises when our datasets get
big.
• Suppose, we have a million samples in dataset,
Gradient Descent optimization technique uses , all of
the one million samples for completing one iteration,
and it has to be done for every iteration until the
minima reached.
• Hence, it becomes computationally very expensive to
perform.
• This problem is solved by Stochastic Gradient Descent.
• In SGD, it uses only a single sample, i.e., a batch size of
one, to perform each iteration.
• The sample is randomly shuffled and selected for
performing the iteration.
Stochastic Gradient Descent
• This algorithm processes one training sample in every
iteration.
• The parameters get updated after every iteration
since only one data sample is worked on in every
iteration.
• It is quicker in comparison to batch gradient descent.
• The overhead is high if the number of training
samples in the dataset is large. This is because the
number of iterations would be high and the amount of
time taken would also be high.
• Algorithm Θnew = θ old − α⋅∇J(θ;x(i),y(i))
Where {x(i) ,y(i)} are the training examples.
Stochastic gradient descent
This method performs a parameter update for each training
example 𝑥 i and label 𝑦(𝑖).

Update equation
We need to calculate the
𝜃 = 𝜃 − 𝜂 ∗ 𝛻𝜃 𝐽(𝜃; 𝑥(i) ,𝑦(𝑖)) gradients for the whole dataset
to perform just one update.

Code
Note : we shuffle the training data at every epoch
• Advantages of SDG:
-Frequent updates of model parameters hence,
converges in less time.
-Requires less memory as no need to store
values of loss functions.
-May get new minima’s.
• Disadvantages:
-High variance in model parameters.
-May shoot even after achieving global minima.
Stochastic Gradient Descent With Momentum

• Momentum is a very popular optimization technique


that is used along with SGD.
• Momentum is a hyper-parameter symbolized by
gama ‘γ’.
• It is used for reducing high variance in SGD
• Instead of using only the gradient of the current step
to guide the search, momentum accumulates the
gradient of the past steps to determine the direction
to go.
• It accelerates the convergence towards the relevant
direction and reduces the fluctuation to the irrelevant
direction.
• Algorithm:
V(t)=γV(t−1)+α.∇J(θ)
• The weights are updated by
θ new=θ old − V(t).
• The value of momentum term γ can lies
between 0<= γ <=1
• Advantages:
• Reduces the oscillations and high variance of
the parameters.
• Converges faster than gradient descent.
• Disadvantages:
• One more hyper-parameter is added which
needs to be selected manually and accurately.
Difference between Gradient Descent and stochastic
Gradient Descent Algorithms
Gradient Descent Stochastic Gradient Descent
1 Gradient Descent Algorithm uses
SGD uses single Training sample data
the whole Training sample Data
2 Slow and computationally Faster and less computationally expensive
expensive algorithm than Batch GD
3 Not suggested for huge training
Can be used for large training samples.
samples.
4 Deterministic in nature. Stochastic in nature
5 Gives optimal solution. Gives good solution but not optimal.
6 The data sample should be in a random
No random shuffling of points are
order, and this is why we want to shuffle
required.
the training set for every epoch.
7 Convergence is slow. Reaches the convergence much faster.
8 Can’t escape shallow local minima SGD can escape shallow local minima more
easily. easily.
Learning Rate
Learning Rate
• Learning rate is probably the most important aspect
of gradient descent and also other optimizers as well.
• Example: Imagine the cost function as a pit.
• We will be starting from the top and our objective is to
get to the bottom of the pit.
• If we choose a large value as learning rate, we would
be making drastic changes to the weights and bias
values, i.e we would be taking huge jumps to reach the
bottom.
• If We choose a small value as learning rate, we lose
the risk of overshooting the minima but our algorithm
will take longer time to converge,
• if the cost function is non-convex, our algorithm
might be easily trapped in a local minima and it will be
unable to get out and converge to the global minima.
First-order optimization algorithm

• First-order methods use the first derivatives of


the function to minimize the loss function:
-Momentum
-Adagrad
-Adadelta
-RMSprop
-Adam
- Nesterov accelerated gradient (NAG)
Adaptive Learning Rate
● Previous algorithm uses the fixed learning rate
tharoughout the learning process
○ The learning rate has to be either set to be very small at
the beginning or periodically decrease the learning rate

● Adaptive learning rate: learning rate is automatically


decreased in the learning process

● Adaptive learning rate algorithm includes:


-AdaGrad
-Adadelta
-RMSprop
-Adam
74
Adaptive gradient (Adagrad) Optimizer
𝐺𝐷, 𝑆𝐺𝐷, 𝑀𝑜𝑚𝑒𝑛𝑡𝑢𝑚 𝑆𝐺𝐷, 𝑀𝑖𝑛𝑖 𝐵𝑎𝑡𝑐ℎ 𝑆𝐺𝐷 ℎ𝑎𝑣𝑒 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒
𝐿𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒

𝛿𝐿
𝑊(new) = W(old) - 𝜂 Same
𝛿𝑊(𝑜𝑙𝑑)
Learning
𝛿𝐿 Rate
𝑊(t) = W(t-1) - 𝜂
𝛿𝑊(𝑡−1)
Idea of Adaptive Gradient
Adaptive Gradient Algorithm (Adagrad)

• The Adaptive Gradient algorithm, or AdaGrad for


short, is an extension to the Stochastic gradient
descent optimization algorithm.
• A limitation of stochastic gradient descent is that
it uses the same step size (learning rate) for each
input variable.
• Previous methods: Same learning rate η for all
parameters θ.
• Adagrad [Duchi et al., 2011] adapts the learning
rate to the parameters (large updates for
infrequent parameters, small updates for frequent
parameters).
Adagrad (Adaptive Gradient Algorithm)

• Adagrad is an optimization algorithm used in machine


learning and deep learning for training neural networks.
• It is designed to adaptively adjust the learning rate during
training, providing larger updates for infrequent
parameters and smaller updates for frequent parameters.
• The idea behind Adagrad is to scale the learning rates of
each parameter based on the historical gradients.
• It maintains a separate learning rate for each parameter, and as
training progresses, it accumulates the squared gradients of
each parameter. It accumulates the squared gradients of
each parameter
• Mathematically, given a parameter w, the update rule for Adagrad
at time step t is as follows:
/ 𝜹𝑳 𝜼
w(t+1)​= wt​− 𝜼 where 𝜼/ ((New Learning Rate)) =
𝜹𝑾 𝜶(𝒕)+𝝐
Or
𝜼 𝜹𝑳
w(t+1)​= wt​−
𝜶(𝒕)+𝝐 𝜹𝑾
• Where:
• η is the learning rate.
𝛿𝐿
• is the gradient of the loss function with respect to the
𝛿𝑊
parameter at time step t.
• 𝛼 (t) is a diagonal matrix where each diagonal element is the sum
of the squares of the gradients with respect to wi up to time step t.
• ϵ is a small constant (usually added for numerical stability to avoid
division by zero).
Advantage of Adagrad

• The main advantage of Adagrad is its


automatic adjustment of learning rates,
which can be particularly useful in settings
where the data has sparse gradients or where
different features have vastly different
frequencies.
Adaptive Gradient Algorithm (Adagrad)

SGD ℝ 𝑑×𝑑
⋯ ⋯




𝐺𝑡 = ⋯ ⋯



⋯ ⋯

𝐺𝑡 is a diagonal matrix where each diagonal


element (𝑖,𝑖) is the sum of the squares of the gradients 𝜃𝑖
up to time step 𝑡.
𝜀 is a smoothing term that avoids division by zero
Adagrad

Vectorize
Adagrad divides the learning rate by the square root of
the sum of squares of gradients.
Adaptive Gradient Algorithm (Adagrad)

SGD ℝ 𝑑×𝑑
⋯ ⋯




𝐺𝑡 = ⋯ ⋯



⋯ ⋯

Adagrad modifies the learning rate 𝜼


based on the past gradients that have
been computed for 𝜽𝒊
Adagrad

Vectorize
Advantages and Disadvantages of Adagrad

• Advantages :
• It is well-suited for dealing with sparse data (missing or
gaps in the data).
• It greatly improved the robustness of SGD.
• It eliminates the need to manually tune the learning rate.
• Disadvantage :
• Main weakness is its accumulation of the squared
gradients in the denominator
Adadelta Optimizer Algorithm
• Adadelta is an extension of Adagrad.
• Adagrad : It accumulate all past squared gradients.
• Adadelta was proposed by Matthew D. Zeiler in 2012.
• The main idea behind Adadelta is to adaptively scale the
learning rates during training based on the historical
gradients.
• Adadelta replaces the accumulation of all past squared
gradients with a decaying average of past squared
gradients.
• This helps mitigate the problem of the learning rate
monotonically decreasing, which can eventually lead to
very small updates and slow convergence..
Adadelta Optimization Technique

• ADAGRAD works well for sparse settings


• But work not well where:
- Loss functions are non-convex
-Gradients are dense due to the rapid decay of the learning
rate since it uses all the past gradients in the update.
• The following algorithms aim to resolve this problem:
By mitigate the rapid decay of the learning rate using
limiting the update to only the past few gradients:
-Adadelta
-RMSprop
-Adam
Adadelta
• Instead of inefficiently storing, the sum of gradients is
recursively defined as a decaying average of all past squared
gradients.
• We defines running average of squared gradients E [g 2]t at
time t:

• 𝐸[𝑔2] :The running average at time step 𝑡.

• 𝛾 : A fraction similarly to the Momentum term, around 0.9

Where gt is the gradient and given as


Adadelta
Adagrad SGD

Replace the diagonal matrix 𝐺𝑡 with the


decaying average over past squared gradients
𝐸[𝑔2 ] 𝑡

Adadelta
• Advantages:
-The learning rate does not decay and the
training does not stop.
• Disadvantages:
-Computationally expensive.
Root Mean Square Propagation
(RMSprop)
• RMSprop was introduced by Geoffrey Hinton in 2014
• The algorithm aims to adaptively adjust the learning rates
for different parameters based on the magnitudes of recent
gradients.
• The key idea behind RMSprop is to maintain a moving
average of the squared gradients for each parameter. Instead
of accumulating all past squared gradients as Adagrad does.
• RMSprop uses an exponentially decaying average of past
squared gradients. This allows the algorithm to effectively
handle non-stationary objectives and noisy gradients.
Root Mean Square Propagation
(RMSprop)
• Mathematically, RMSprop updates the parameters θ at
each iteration t using the following formula:
𝜼
• 𝜽 𝒕+𝟏 =𝜽 𝒕 − 𝒈(𝒕)
𝒗 𝒕 +𝜺
• Where:
• 𝜃(𝑡)is the parameter vector at time step t.
• gt is the gradient vector at time step t.
• V(t) is the exponentially decaying average of squared
gradients.
• η is the learning rate.
• ϵ is a small constant added for numerical stability
• Mathematically, v(t) is updated at each time step
using the following formula:
• v(t)=𝜷v(t−1)+(1− 𝜷)⋅(𝒈(𝒕)𝟐 )
• Where:
• v(t) is the exponentially decaying average of
squared gradients at time step t.
• 𝛽 is a decay rate parameter, typically set to a
value close to 1 (e.g., 0.9 or 0.99).
• g(t) is the gradient vector at time step t.
Adam (adaptive moment Estimation)
Adam (Adaptive Moment Estimation )

• Adam was introduced by Diederik P. Kingma and Jimmy in


2014.
• It combines ideas from two RMSprop and momentum, to
provide an adaptive learning rate method.
• The key idea behind Adam is to maintain two exponentially
decaying moving averages of gradients: the first moment
(mean) and the second moment (uncentered variance).
• These moving averages are then used to adaptively adjust the
learning rates for each parameter during the training process.
• The algorithm is known for its efficiency, simplicity, and
effectiveness in a wide range of deep learning tasks.
• The update rule for Adam can be summarized as follows:
1. Compute gradients gt​ using the current minibatch.
2. Update biased first moment estimate: mt​= 𝜷 (1)⋅m(t−1)​+(1− 𝜷 (1)​)⋅gt
3. Update biased second raw moment estimate: v(t) ​=β2​⋅vt−1​+(1−β2​)⋅(gt​)2
𝑚𝑡
4. Correct bias in first and second moments: m/ (t) ​=
1=𝛽1𝑡
𝑣𝑡
v/ (t) = ​
1=𝛽2𝑡
𝜂
5. Update parameters: θt+1​=θt​− m/(t)
𝑣 𝑡 +𝜖
• Where: θt​ is the parameter vector at time step t.
• gt​ is the gradient vector at time step t.
• β1​ and β2​ are decay rate parameters for the first and second moments,
respectively, typically set close to 1 (e.g., 0.9 and 0.999).
• η is the learning rate and ϵ is a small constant added for numerical
stability
Advantages of Adam

• Straightforward to implement.
• Computationally efficient.
• Less memory requirements.
• Well suited for problems that are large in terms of data
and/or parameters.
• Appropriate for non-stationary objectives.
• Hyper-parameters require less tuning.
Nesterov Accelerated Gradient
(NAG)
• Nesterov Accelerated Gradient (NAG) is an optimization
algorithm used for training artificial neural networks.
• It is an extension of the momentum optimization technique
• The key idea behind Nesterov Accelerated Gradient is to
modify the momentum update rule in such a way that it
anticipates the future direction of the parameter updates,
thereby reducing oscillations and overshooting during the
optimization process.
Nesterov Accelerated Gradient

θ(New) = θ (old) - vt
• Momentum
• Momentum was invented for reducing high variance in
SGD and softens the convergence. It accelerates the
convergence towards the relevant direction and reduces
the fluctuation to the irrelevant direction.
• One more hyperparameter is used in this method known
as momentum symbolized by ‘γ’.
V(t)=γV(t−1)+α.∇J(θ)
• Now, the weights are updated by θ=θ−V(t).
• The momentum term γ is usually set to 0.9 or a similar
value.
• Algorithm:
V(t)=γV(t−1)+α.∇J(θ)
• The weights are updated by
θ new=θ old − V(t).
• The value of momentum term γ can lies
between 0<= γ <=1
Nesterov Accelerated Gradient
• Momentum may be a good method but if the
momentum is too high the algorithm may miss the
local minima and may continue to rise up.
• To resolve this issue the NAG algorithm was
developed.
• We know we’ll be using γV(t−1) for modifying the
weights so, θ−γV(t−1) approximately tells us the future
location.
• Now, we’ll calculate the cost based on this future
parameter rather than the current one.
• V(t)=γV(t−1)+α. ∇J( θ−γV(t−1) ) and then update the
parameters using θ(new) = θ (old) −V(t).
• Advantages:
• Does not miss the local minima.
• Slows if minima’s are occurring.
• Disadvantages:
• Still, the hyper parameter needs to be
selected manually.
Saddle point Problem
Saddle Point Problem
• Saddle point is a critical point of a function where the gradient is
zero (a stationary point), but it is neither a local minimum nor
a local maximum.
• A saddle point, the function curves upwards in some directions
and downwards in others. This creates a saddle-shaped surface
around the point.
• The saddle point problem arises in optimization algorithms,
because gradient-based methods may get stuck at saddle points.
• At a saddle point, the gradients are zero, which can mislead the
optimization algorithm into thinking it has reached a minimum.
• However, the function continues to either increase or decrease
along different dimensions, making it difficult for the algorithm to
escape the saddle point and continue towards the true minimum or
maximum.
Saddle point Problem
AT saddle point derivate is zero and weights are not updated
Saddle point is not minimum point nor maximum point
Weights are stuck at saddle point
This problem exist in a non-convex function
In three dimensional
F(x, y) = x2 – y2
d(fx,y)/dx = 2x , Equate to zero, 2x = 0, X = 0 Saddle point
Also
df(x,y)/dy = -2y, Equate to zero, -2y = 0, Y = 0
At (0,0), df(x,y)/dx = 0 and df(x,y)/dy = 0
So at x=0 it is Local minima, And y = 0, it is Local maxima
We should not stuck at this saddle point
How to move away from saddle point Saddle point
At saddle point if we take a long jump in y direction we will
minimize the y value
x(new) = x(old) – α df/dx (df/dx = 0, so x(new) = x(old)
y(new) = y(old) - α df/dy (df/dy = 0, so y(new) = y(old)
How to avoid this situation
x(new) = x(old) – α (df/dx + 20) (adding some value)
Y-
y(new) = y(old) - α (df/dy + 20) (adding some value in y we will
X axis axis
take a long jump and we minimize the y value and after
applying gradient descent we will reach towards global minima

You might also like