Unit 1.2 Perceptron 2024
Unit 1.2 Perceptron 2024
(21MCA24DB3)
i wij j wjk
xi k yk
m
n l yl
xn
Input Hidden Output
layer layer layer
Error signals
Limitations of the Backpropagation algorithm:
• It is slow, all previous layers are locked until
gradients for the current layer is calculated
• It suffers from vanishing and exploding gradients
problem
• It suffers from overfitting & underfitting problem
• It considers predicted value & actual value only to
calculate error and to calculate gradients, related to the
objective function, partially related to the
Backpropagation algorithm
• It doesn’t consider the spatial, associative and dis-
associative relationship between classes while
calculating errors, related to the objective function,
partially related to the Backpropagation algorithm
• The network may get trapped in a local minima even
through there is a much deeper minimum nearby
Difficulty of training deep neural networks
Starting Point
Loss
Value of weight
Point of Convergence i.e where the cost function is at
its minimum Level
Gradient Descent
• Gradient descent is a way to minimize an objective
function 𝐽(𝜃)
• 𝐽(𝜃):Objective function
• 𝜃 ∈𝑅𝑑:Model’s parameters
• 𝜂:Learning rate. This determines the size of the steps we
take to reach a (local) minimum.
𝛻𝜃𝐽(𝜃) 𝐽(𝜃)
Update equation
𝜃(new) = 𝜃 − 𝜂 ∗𝛻𝜃𝐽(𝜃)
𝑙𝑜𝑐𝑎𝑙
𝑚𝑖𝑛𝑖𝑚𝑢
𝜃∗ 𝜃
Change in Weight
Update equation
We need to calculate the
𝜃 = 𝜃 − 𝜂 ∗ 𝛻𝜃 𝐽(𝜃; 𝑥(i) ,𝑦(𝑖)) gradients for the whole dataset
to perform just one update.
Code
Note : we shuffle the training data at every epoch
• Advantages of SDG:
-Frequent updates of model parameters hence,
converges in less time.
-Requires less memory as no need to store
values of loss functions.
-May get new minima’s.
• Disadvantages:
-High variance in model parameters.
-May shoot even after achieving global minima.
Stochastic Gradient Descent With Momentum
𝛿𝐿
𝑊(new) = W(old) - 𝜂 Same
𝛿𝑊(𝑜𝑙𝑑)
Learning
𝛿𝐿 Rate
𝑊(t) = W(t-1) - 𝜂
𝛿𝑊(𝑡−1)
Idea of Adaptive Gradient
Adaptive Gradient Algorithm (Adagrad)
SGD ℝ 𝑑×𝑑
⋯ ⋯
⋯
⋯
⋯
𝐺𝑡 = ⋯ ⋯
⋯
⋯
⋯ ⋯
Vectorize
Adagrad divides the learning rate by the square root of
the sum of squares of gradients.
Adaptive Gradient Algorithm (Adagrad)
SGD ℝ 𝑑×𝑑
⋯ ⋯
⋯
⋯
⋯
𝐺𝑡 = ⋯ ⋯
⋯
⋯
⋯ ⋯
Vectorize
Advantages and Disadvantages of Adagrad
• Advantages :
• It is well-suited for dealing with sparse data (missing or
gaps in the data).
• It greatly improved the robustness of SGD.
• It eliminates the need to manually tune the learning rate.
• Disadvantage :
• Main weakness is its accumulation of the squared
gradients in the denominator
Adadelta Optimizer Algorithm
• Adadelta is an extension of Adagrad.
• Adagrad : It accumulate all past squared gradients.
• Adadelta was proposed by Matthew D. Zeiler in 2012.
• The main idea behind Adadelta is to adaptively scale the
learning rates during training based on the historical
gradients.
• Adadelta replaces the accumulation of all past squared
gradients with a decaying average of past squared
gradients.
• This helps mitigate the problem of the learning rate
monotonically decreasing, which can eventually lead to
very small updates and slow convergence..
Adadelta Optimization Technique
Adadelta
• Advantages:
-The learning rate does not decay and the
training does not stop.
• Disadvantages:
-Computationally expensive.
Root Mean Square Propagation
(RMSprop)
• RMSprop was introduced by Geoffrey Hinton in 2014
• The algorithm aims to adaptively adjust the learning rates
for different parameters based on the magnitudes of recent
gradients.
• The key idea behind RMSprop is to maintain a moving
average of the squared gradients for each parameter. Instead
of accumulating all past squared gradients as Adagrad does.
• RMSprop uses an exponentially decaying average of past
squared gradients. This allows the algorithm to effectively
handle non-stationary objectives and noisy gradients.
Root Mean Square Propagation
(RMSprop)
• Mathematically, RMSprop updates the parameters θ at
each iteration t using the following formula:
𝜼
• 𝜽 𝒕+𝟏 =𝜽 𝒕 − 𝒈(𝒕)
𝒗 𝒕 +𝜺
• Where:
• 𝜃(𝑡)is the parameter vector at time step t.
• gt is the gradient vector at time step t.
• V(t) is the exponentially decaying average of squared
gradients.
• η is the learning rate.
• ϵ is a small constant added for numerical stability
• Mathematically, v(t) is updated at each time step
using the following formula:
• v(t)=𝜷v(t−1)+(1− 𝜷)⋅(𝒈(𝒕)𝟐 )
• Where:
• v(t) is the exponentially decaying average of
squared gradients at time step t.
• 𝛽 is a decay rate parameter, typically set to a
value close to 1 (e.g., 0.9 or 0.99).
• g(t) is the gradient vector at time step t.
Adam (adaptive moment Estimation)
Adam (Adaptive Moment Estimation )
• Straightforward to implement.
• Computationally efficient.
• Less memory requirements.
• Well suited for problems that are large in terms of data
and/or parameters.
• Appropriate for non-stationary objectives.
• Hyper-parameters require less tuning.
Nesterov Accelerated Gradient
(NAG)
• Nesterov Accelerated Gradient (NAG) is an optimization
algorithm used for training artificial neural networks.
• It is an extension of the momentum optimization technique
• The key idea behind Nesterov Accelerated Gradient is to
modify the momentum update rule in such a way that it
anticipates the future direction of the parameter updates,
thereby reducing oscillations and overshooting during the
optimization process.
Nesterov Accelerated Gradient
θ(New) = θ (old) - vt
• Momentum
• Momentum was invented for reducing high variance in
SGD and softens the convergence. It accelerates the
convergence towards the relevant direction and reduces
the fluctuation to the irrelevant direction.
• One more hyperparameter is used in this method known
as momentum symbolized by ‘γ’.
V(t)=γV(t−1)+α.∇J(θ)
• Now, the weights are updated by θ=θ−V(t).
• The momentum term γ is usually set to 0.9 or a similar
value.
• Algorithm:
V(t)=γV(t−1)+α.∇J(θ)
• The weights are updated by
θ new=θ old − V(t).
• The value of momentum term γ can lies
between 0<= γ <=1
Nesterov Accelerated Gradient
• Momentum may be a good method but if the
momentum is too high the algorithm may miss the
local minima and may continue to rise up.
• To resolve this issue the NAG algorithm was
developed.
• We know we’ll be using γV(t−1) for modifying the
weights so, θ−γV(t−1) approximately tells us the future
location.
• Now, we’ll calculate the cost based on this future
parameter rather than the current one.
• V(t)=γV(t−1)+α. ∇J( θ−γV(t−1) ) and then update the
parameters using θ(new) = θ (old) −V(t).
• Advantages:
• Does not miss the local minima.
• Slows if minima’s are occurring.
• Disadvantages:
• Still, the hyper parameter needs to be
selected manually.
Saddle point Problem
Saddle Point Problem
• Saddle point is a critical point of a function where the gradient is
zero (a stationary point), but it is neither a local minimum nor
a local maximum.
• A saddle point, the function curves upwards in some directions
and downwards in others. This creates a saddle-shaped surface
around the point.
• The saddle point problem arises in optimization algorithms,
because gradient-based methods may get stuck at saddle points.
• At a saddle point, the gradients are zero, which can mislead the
optimization algorithm into thinking it has reached a minimum.
• However, the function continues to either increase or decrease
along different dimensions, making it difficult for the algorithm to
escape the saddle point and continue towards the true minimum or
maximum.
Saddle point Problem
AT saddle point derivate is zero and weights are not updated
Saddle point is not minimum point nor maximum point
Weights are stuck at saddle point
This problem exist in a non-convex function
In three dimensional
F(x, y) = x2 – y2
d(fx,y)/dx = 2x , Equate to zero, 2x = 0, X = 0 Saddle point
Also
df(x,y)/dy = -2y, Equate to zero, -2y = 0, Y = 0
At (0,0), df(x,y)/dx = 0 and df(x,y)/dy = 0
So at x=0 it is Local minima, And y = 0, it is Local maxima
We should not stuck at this saddle point
How to move away from saddle point Saddle point
At saddle point if we take a long jump in y direction we will
minimize the y value
x(new) = x(old) – α df/dx (df/dx = 0, so x(new) = x(old)
y(new) = y(old) - α df/dy (df/dy = 0, so y(new) = y(old)
How to avoid this situation
x(new) = x(old) – α (df/dx + 20) (adding some value)
Y-
y(new) = y(old) - α (df/dy + 20) (adding some value in y we will
X axis axis
take a long jump and we minimize the y value and after
applying gradient descent we will reach towards global minima