Unit 2.2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 46

MIT Art Design and Technology University

MIT School of Computing, Pune


21BTCS031 – Deep Learning & Neural Networks

Class - L.Y. CORE (SEM-I)

Unit - II Deep Networks


Dr. Anant Kaulage
Dr. Sunita Parinam
Dr. Mayura Shelke
Dr. Aditya Pai
AY 2024-2025 SEM-I
Beyond Gradient Descent

Unit II
Challenges with Gradient
Descent
Local Minima In Error Surface
□ The primary challenge in optimizing deep
learning models is that we are forced to
use minimal local information to infer the
global structure of the error surface.
■ E.g. Let’s assume you’re an ant on the
continental United States. You’re dropped
randomly on the map, and your goal is to
find the lowest point on this surface.
■ How do you do it?
Local Minima In Error Surface

Mini-batch gradient descent may aid in escaping shallow local


minima, but often fails when dealing with deep local minima
Local Minima In Error Surface
□ Local minima pose a significant issue
□ How common are local minima in the
error surfaces of deep networks?
□ In which scenarios are they actually
problematic for training?
Model Identifiability
□ One observation about deep neural networks is
that their error surfaces are guaranteed to
have a large—and in some cases, an infinite—
number of local minima.
□ within a layer of a fully-connected feed-forward
neural network, any rearrangement of neurons
will still give you the same final output at the
end of the network
□ A model is said to be identifiable if a
sufficiently large training set can rule out all
but one setting of the model’s parameters
Model Identifiability

within a layer with n


neurons, there are n! ways
to rearrange parameters.
And for a deep network
with l layers, each with n
neurons, we have a total
of n!l equivalent
configurations.
Spurious Local Minima
□ Local minima are only problematic when they
are spurious.
□ A spurious local minimum corresponds to a
configuration of weights in a neural network
that incurs a higher error than the configuration
at the global minimum.
□ If these kinds of local minima are common, we
quickly run into significant problems while
using gradient based optimization methods
Flat Regions in the Error
Surface
□ Flat region where the gradient approaches zero
□ This point is not a local minima, so it is unlikely
to get us completely stuck,
□ The zero gradient might slow down learning if
we are unlucky enough to encounter it.
□ a point at which the gradient is the zero vector
is called a critical point
Gradient of Functions
□ For a function with two variables z = f
(x,y ) the vector of partial derivatives

is called the gradient of the


function and is denoted by ∇z
□ The same can be generalized for a
function with n variables. A
multivariate function f (x1, x2,.., xn)
can also be expressed as f (x)
Hessian Matrix of a Function
□ The Hessian of a multivariate function
is a matrix of second-order partial
derivatives.
□ For a function f (x, y, z)
Rules for Maxima and Minima
for a Univariate Function
□ The derivative of f (x) with respect to x would be
zero at maxima and minima.
□ The second derivative of f (x), which is nothing
but the derivative of the first, needs to be
investigated at the point where the first
derivative is zero
□ If the second derivative is less than zero, then
it’s a point of maxima, while if it is greater than
zero it’s a point of minima.
□ If the second derivative turns out to be zero as
well, then the point is called a point of inflection.
Rules for Maxima and Minima
for a Univariate Function

for all values of x, including x = 0, the second derivative is


greater than zero and hence x = 0 is the minima point for the
function f (x)
Momentum based Optimization
□ Gradient descent is one of the most popular
algorithms to perform optimization
□ Common way to optimize neural networks
□ Gradient descent is a way to minimize an
objective function J(θ) parameterized by a
model’s parameters θ∈Rd by updating the
parameters in the opposite direction of the
gradient of the objective function ∇θJ(θ) w.r.t. to
the parameters
□ The learning rate η
Gradient Descent Variants
□ Batch gradient descent aka Vanilla
gradient descent

□ Stochastic gradient descent

□ Mini-batch gradient descent


Challenges
□ Choosing a proper learning rate can be
difficult.
□ A learning rate that is too small leads to
painfully slow convergence
□ large can hinder convergence and cause
the loss function to fluctuate around the
minimum or even to diverge.
□ Learning rate adaption
Momentum
□ Intuition
■ If I am repeatedly being asked to move in
the same direction then I should
probably gain some confidence and start
taking bigger steps in that direction
■ Just as a ball gains momentum while
rolling down a slope

friction

velocit acceleratio
Momentum
Momentum
Momentum

https://distill.pub/2017/momentum/
Observations
□ Even in the regions having gentle slopes,
momentum based gradient descent is able
to take large steps because the momentum
carries it along.
□ Is moving fast always good? Would there be
a situation where momentum would cause
us to run pass our goal?
Nesterov Accelerated Gradient
Descent
□ Can we do something to reduce these
oscillations ?
□ Intuition
■ Look before you leap
■ Recall that
■ Why not calculate the gradient at
this partially updated value of w
Nesterov Accelerated Gradient
Descent
□ Momentum first computes
the current gradient (small
blue vector)
□ then takes a big jump in the
direction of the updated
accumulated gradient (big
blue vector)
□ NAG first makes a big jump
in the direction of the
previous accumulated
gradient (brown vector)
□ measures the gradient
□ then makes a correction
(red vector), which results
in the complete NAG update
(green vector).
Nesterov Accelerated Gradient
Descent

□ Looking ahead helps


NAG in correcting its
course quicker than
momentum based
gradient descent
□ Hence the oscillations
are smaller and the
chances of escaping
the minima valley
also smaller
Learning rate adaption
□ Choosing the correct learning rate has long been one
of the most troublesome aspects of training deep
networks because it has a major impact on a
network’s performance
□ major breakthroughs in modern deep network
optimization was the advent of learning rate adaption
□ The basic concept behind learning rate adaptation is
that the optimal learning rate is appropriately
modified over the span of learning to achieve good
convergence properties
□ An adaptive learning rate can be observed in
AdaGrad, AdaDelta, RMSprop and Adam
AdaGrad—Accumulating
Historical Gradients
□ Attempts to adapt the global learning
rate over time using an accumulation of
the historical gradients
□ It adapts the learning rate to the
parameters
□ Adagrad uses a different learning rate
for every parameter θi at every time step
t
AdaGrad—Accumulating
Historical Gradients

□ Adagrad modifies the general learning rate


each time step t for every parameter θi based
on the past gradients that have been computed
for θi:

here is a diagonal matrix where each diagonal element i; i is


the sum of the squares of the gradients w.r.t. θi
Adagrad - Drawback
□ Despite not having to manually tune the
learning rate there is one huge
disadvantage
■ i.e due to monotonically decreasing learning
rates, at some point in time step, the model will
stop learning as the learning rate is almost
close to 0.
Adadelta
□ Extension of Adagrad
□ Instead of accumulating all past squared
gradients, Adadelta restricts the window of
accumulated past gradients to some fixed size
w
□ The running average E[g2]t at time step t then
depends only on the previous average and the
current gradient
□ unlike the alpha “α” in Adagrad, where it
increases exponentially after every time step.
□ In Adadelda, using the exponentially weighted
averages over the past Gradient, an increase
in “Sdw” is under control
□ The typical “β” value is 0.9 or 0.95.
Adadelta & RMSProp
□ E[g2]t= γE[g2]t−1+(1−γ)g2
where g=∇J(θt,i)
Adaptive Moment Estimation
(ADAM)
□ Adaptive Moment Estimation (Adam) is another
method that computes adaptive learning rates
for each parameter
□ Combination of momentum and RMSprop
□ In addition to storing an exponentially decaying
average of past squared gradients vt like
Adadelta and RMSprop
□ Adam also keeps an exponentially decaying
average of past gradients mt, similar to
momentum
Adaptive Moment Estimation
(ADAM)

□ mt and vt are estimates of the first moment (the


mean) and the second moment (the un-centered
variance) of the gradients respectively
□ As mt and vt are initialized as vectors of 0’s, the
authors of Adam observe that they are biased
towards zero, especially during the initial time
steps, and especially when the decay rates are
small (i.e. β1 and β2 are close to 1).
Adaptive Moment Estimation
(ADAM)
□ They counteract these biases by computing
bias-corrected first and second moment
estimates

https://emiliendupont.github.io/2018/01/24/optimization-visualization/
□ Exponential Weighted Averages for past gradients

Exponential Weighted Averages for past squared


gradients
□ Using the above equation, now the weight
and bias updation formula looks like:
Nesterov-accelerated Adaptive
Moment Estimation (NADAM)
□ Combines Adam and NAG
□ In order to incorporate NAG into Adam, we
need to modify its momentum term mt
□ momentum update rule

□ where J is our objective function, γ is the


momentum decay term, and η is our step size
Nesterov-accelerated Adaptive
Moment Estimation (NADAM)
□ NAG then allows us to perform a more accurate step in
the gradient direction by updating the parameters with
the momentum step before computing the gradient

□ Dozat proposes to modify NAG the following way:


Rather than applying the momentum step twice –
□ one time for updating the gradient gt and a second
time for updating the parameters θt+1 – we now apply
the look-ahead momentum vector directly to update
the current parameters
Nesterov-accelerated Adaptive
Moment Estimation (NADAM)
□ Rather than utilizing the previous momentum
vector mt-1, we now use the current momentum
vector mt to look ahead.
□ In order to add Nesterov momentum to Adam,
we can thus similarly replace the previous
momentum vector with the current momentum
vector.
Nesterov-accelerated Adaptive
Moment Estimation (NADAM)
□ Expanding the second equation with the
definitions of m^t and mt in turn gives us:
Summary

You might also like