0% found this document useful (0 votes)

15 views46 pages

Unit 2.2

deep learning

Uploaded by

jadhavrohan7337

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views46 pages

Unit 2.2

deep learning

Uploaded by

jadhavrohan7337

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

MIT Art Design and Technology University

MIT School of Computing, Pune

21BTCS031 – Deep Learning & Neural Networks

Class - L.Y. CORE (SEM-I)

Unit - II Deep Networks

Dr. Anant Kaulage
Dr. Sunita Parinam
Dr. Mayura Shelke
Dr. Aditya Pai
AY 2024-2025 SEM-I
Beyond Gradient Descent

Unit II
Challenges with Gradient
Descent
Local Minima In Error Surface
□ The primary challenge in optimizing deep
learning models is that we are forced to
use minimal local information to infer the
global structure of the error surface.
■ E.g. Let’s assume you’re an ant on the
continental United States. You’re dropped
randomly on the map, and your goal is to
find the lowest point on this surface.
■ How do you do it?
Local Minima In Error Surface

Mini-batch gradient descent may aid in escaping shallow local

minima, but often fails when dealing with deep local minima
Local Minima In Error Surface
□ Local minima pose a significant issue
□ How common are local minima in the
error surfaces of deep networks?
□ In which scenarios are they actually
problematic for training?
Model Identifiability
□ One observation about deep neural networks is
that their error surfaces are guaranteed to
have a large—and in some cases, an infinite—
number of local minima.
□ within a layer of a fully-connected feed-forward
neural network, any rearrangement of neurons
will still give you the same final output at the
end of the network
□ A model is said to be identifiable if a
sufficiently large training set can rule out all
but one setting of the model’s parameters
Model Identifiability

within a layer with n

neurons, there are n! ways
to rearrange parameters.
And for a deep network
with l layers, each with n
neurons, we have a total
of n!l equivalent
configurations.
Spurious Local Minima
□ Local minima are only problematic when they
are spurious.
□ A spurious local minimum corresponds to a
configuration of weights in a neural network
that incurs a higher error than the configuration
at the global minimum.
□ If these kinds of local minima are common, we
quickly run into significant problems while
using gradient based optimization methods
Flat Regions in the Error
Surface
□ Flat region where the gradient approaches zero
□ This point is not a local minima, so it is unlikely
to get us completely stuck,
□ The zero gradient might slow down learning if
we are unlucky enough to encounter it.
□ a point at which the gradient is the zero vector
is called a critical point
Gradient of Functions
□ For a function with two variables z = f
(x,y ) the vector of partial derivatives

is called the gradient of the

function and is denoted by ∇z
□ The same can be generalized for a
function with n variables. A
multivariate function f (x1, x2,.., xn)
can also be expressed as f (x)
Hessian Matrix of a Function
□ The Hessian of a multivariate function
is a matrix of second-order partial
derivatives.
□ For a function f (x, y, z)
Rules for Maxima and Minima
for a Univariate Function
□ The derivative of f (x) with respect to x would be
zero at maxima and minima.
□ The second derivative of f (x), which is nothing
but the derivative of the first, needs to be
investigated at the point where the first
derivative is zero
□ If the second derivative is less than zero, then
it’s a point of maxima, while if it is greater than
zero it’s a point of minima.
□ If the second derivative turns out to be zero as
well, then the point is called a point of inflection.
Rules for Maxima and Minima
for a Univariate Function

for all values of x, including x = 0, the second derivative is

greater than zero and hence x = 0 is the minima point for the
function f (x)
Momentum based Optimization
□ Gradient descent is one of the most popular
algorithms to perform optimization
□ Common way to optimize neural networks
□ Gradient descent is a way to minimize an
objective function J(θ) parameterized by a
model’s parameters θ∈Rd by updating the
parameters in the opposite direction of the
gradient of the objective function ∇θJ(θ) w.r.t. to
the parameters
□ The learning rate η
Gradient Descent Variants
□ Batch gradient descent aka Vanilla
gradient descent

□ Stochastic gradient descent

□ Mini-batch gradient descent

Challenges
□ Choosing a proper learning rate can be
difficult.
□ A learning rate that is too small leads to
painfully slow convergence
□ large can hinder convergence and cause
the loss function to fluctuate around the
minimum or even to diverge.
□ Learning rate adaption
Momentum
□ Intuition
■ If I am repeatedly being asked to move in
the same direction then I should
probably gain some confidence and start
taking bigger steps in that direction
■ Just as a ball gains momentum while
rolling down a slope

friction

velocit acceleratio
Momentum
Momentum
Momentum

https://distill.pub/2017/momentum/
Observations
□ Even in the regions having gentle slopes,
momentum based gradient descent is able
to take large steps because the momentum
carries it along.
□ Is moving fast always good? Would there be
a situation where momentum would cause
us to run pass our goal?
Nesterov Accelerated Gradient
Descent
□ Can we do something to reduce these
oscillations ?
□ Intuition
■ Look before you leap
■ Recall that
■ Why not calculate the gradient at
this partially updated value of w
Nesterov Accelerated Gradient
Descent
□ Momentum first computes
the current gradient (small
blue vector)
□ then takes a big jump in the
direction of the updated
accumulated gradient (big
blue vector)
□ NAG first makes a big jump
in the direction of the
previous accumulated
gradient (brown vector)
□ measures the gradient
□ then makes a correction
(red vector), which results
in the complete NAG update
(green vector).
Nesterov Accelerated Gradient
Descent

□ Looking ahead helps

NAG in correcting its
course quicker than
momentum based
gradient descent
□ Hence the oscillations
are smaller and the
chances of escaping
the minima valley
also smaller
Learning rate adaption
□ Choosing the correct learning rate has long been one
of the most troublesome aspects of training deep
networks because it has a major impact on a
network’s performance
□ major breakthroughs in modern deep network
optimization was the advent of learning rate adaption
□ The basic concept behind learning rate adaptation is
that the optimal learning rate is appropriately
modified over the span of learning to achieve good
convergence properties
□ An adaptive learning rate can be observed in
AdaGrad, AdaDelta, RMSprop and Adam
AdaGrad—Accumulating
Historical Gradients
□ Attempts to adapt the global learning
rate over time using an accumulation of
the historical gradients
□ It adapts the learning rate to the
parameters
□ Adagrad uses a different learning rate
for every parameter θi at every time step
t
AdaGrad—Accumulating
Historical Gradients

□ Adagrad modifies the general learning rate

each time step t for every parameter θi based
on the past gradients that have been computed
for θi:

here is a diagonal matrix where each diagonal element i; i is

the sum of the squares of the gradients w.r.t. θi
Adagrad - Drawback
□ Despite not having to manually tune the
learning rate there is one huge
disadvantage
■ i.e due to monotonically decreasing learning
rates, at some point in time step, the model will
stop learning as the learning rate is almost
close to 0.
Adadelta
□ Extension of Adagrad
□ Instead of accumulating all past squared
gradients, Adadelta restricts the window of
accumulated past gradients to some fixed size
w
□ The running average E[g2]t at time step t then
depends only on the previous average and the
current gradient
□ unlike the alpha “α” in Adagrad, where it
increases exponentially after every time step.
□ In Adadelda, using the exponentially weighted
averages over the past Gradient, an increase
in “Sdw” is under control
□ The typical “β” value is 0.9 or 0.95.
Adadelta & RMSProp
□ E[g2]t= γE[g2]t−1+(1−γ)g2
where g=∇J(θt,i)
Adaptive Moment Estimation
(ADAM)
□ Adaptive Moment Estimation (Adam) is another
method that computes adaptive learning rates
for each parameter
□ Combination of momentum and RMSprop
□ In addition to storing an exponentially decaying
average of past squared gradients vt like
Adadelta and RMSprop
□ Adam also keeps an exponentially decaying
average of past gradients mt, similar to
momentum
Adaptive Moment Estimation
(ADAM)

□ mt and vt are estimates of the first moment (the

mean) and the second moment (the un-centered
variance) of the gradients respectively
□ As mt and vt are initialized as vectors of 0’s, the
authors of Adam observe that they are biased
towards zero, especially during the initial time
steps, and especially when the decay rates are
small (i.e. β1 and β2 are close to 1).
Adaptive Moment Estimation
(ADAM)
□ They counteract these biases by computing
bias-corrected first and second moment
estimates

https://emiliendupont.github.io/2018/01/24/optimization-visualization/
□ Exponential Weighted Averages for past gradients

Exponential Weighted Averages for past squared

gradients
□ Using the above equation, now the weight
and bias updation formula looks like:
Nesterov-accelerated Adaptive
Moment Estimation (NADAM)
□ Combines Adam and NAG
□ In order to incorporate NAG into Adam, we
need to modify its momentum term mt
□ momentum update rule

□ where J is our objective function, γ is the

momentum decay term, and η is our step size
Nesterov-accelerated Adaptive
Moment Estimation (NADAM)
□ NAG then allows us to perform a more accurate step in
the gradient direction by updating the parameters with
the momentum step before computing the gradient

□ Dozat proposes to modify NAG the following way:

Rather than applying the momentum step twice –
□ one time for updating the gradient gt and a second
time for updating the parameters θt+1 – we now apply
the look-ahead momentum vector directly to update
the current parameters
Nesterov-accelerated Adaptive
Moment Estimation (NADAM)
□ Rather than utilizing the previous momentum
vector mt-1, we now use the current momentum
vector mt to look ahead.
□ In order to add Nesterov momentum to Adam,
we can thus similarly replace the previous
momentum vector with the current momentum
vector.
Nesterov-accelerated Adaptive
Moment Estimation (NADAM)
□ Expanding the second equation with the
definitions of m^t and mt in turn gives us:
Summary

Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
1 Intro
No ratings yet
1 Intro
91 pages
Machine Learning Tutorial PDF
No ratings yet
Machine Learning Tutorial PDF
56 pages
MLT, Two Marks
No ratings yet
MLT, Two Marks
19 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Machine Learning
No ratings yet
Machine Learning
135 pages
Netflix Data Science Interview Question
No ratings yet
Netflix Data Science Interview Question
7 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
CS230: Deep Learning: Winter Quarter 2018 Stanford University Midterm Examination 180 Minutes
100% (1)
CS230: Deep Learning: Winter Quarter 2018 Stanford University Midterm Examination 180 Minutes
36 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
No ratings yet
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
25 pages
Ex 1
No ratings yet
Ex 1
15 pages
Midterm Solutions
No ratings yet
Midterm Solutions
14 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
15 Optimization Script
No ratings yet
15 Optimization Script
62 pages
DL Unit2
No ratings yet
DL Unit2
113 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Artificial Neural Networks Video Tutorial: Machine Learning 17CS73
No ratings yet
Artificial Neural Networks Video Tutorial: Machine Learning 17CS73
23 pages
Machine Learning Cognition and Big Data Oberlin
No ratings yet
Machine Learning Cognition and Big Data Oberlin
17 pages
Optimization
No ratings yet
Optimization
51 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
CS60010 Fitting-1
No ratings yet
CS60010 Fitting-1
39 pages
DL - Unit 2
No ratings yet
DL - Unit 2
60 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
ML Notes
No ratings yet
ML Notes
47 pages
DL Class2
No ratings yet
DL Class2
30 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Optim
No ratings yet
Optim
33 pages
Optimization
No ratings yet
Optimization
26 pages
DL Class1
No ratings yet
DL Class1
18 pages
Chapter
No ratings yet
Chapter
46 pages
Unit 4
No ratings yet
Unit 4
18 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Gradient Descent
No ratings yet
Gradient Descent
27 pages
Gradient Descent Learning: Minimize Objective Function: Error Landscape
No ratings yet
Gradient Descent Learning: Minimize Objective Function: Error Landscape
14 pages
Aids Lab PDF
No ratings yet
Aids Lab PDF
53 pages
Neural Networks To Hedge and Price Stock Options 1716166663
No ratings yet
Neural Networks To Hedge and Price Stock Options 1716166663
58 pages
Heuristic Search
No ratings yet
Heuristic Search
49 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Cours 5
No ratings yet
Cours 5
23 pages
AIML
No ratings yet
AIML
24 pages
Role of An Optimizer
No ratings yet
Role of An Optimizer
9 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Module 3
No ratings yet
Module 3
7 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
#9 Steepest Descent
No ratings yet
#9 Steepest Descent
17 pages
1.explain The Concept of Empirical Risk Minimization. What Is The Goal of Optimization in Deep Learning?
No ratings yet
1.explain The Concept of Empirical Risk Minimization. What Is The Goal of Optimization in Deep Learning?
11 pages
SGN 21006 Advanced Signal Processing: Stochastic Gradient Based Adaptation: Least Mean Square (LMS) Algorithm
No ratings yet
SGN 21006 Advanced Signal Processing: Stochastic Gradient Based Adaptation: Least Mean Square (LMS) Algorithm
30 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Machine Learning C
No ratings yet
Machine Learning C
24 pages
Navier - Stokes Optimization Using Genetic Algorithm and A Flexible
No ratings yet
Navier - Stokes Optimization Using Genetic Algorithm and A Flexible
7 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Lec 2
No ratings yet
Lec 2
5 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Optimizers
No ratings yet
Optimizers
4 pages
Lab Research Methodology
No ratings yet
Lab Research Methodology
13 pages
1 - Neural Granger Causality
No ratings yet
1 - Neural Granger Causality
13 pages
A Hybrid Electromagnetic Optimization Method Based
No ratings yet
A Hybrid Electromagnetic Optimization Method Based
9 pages
Ensemble Methods Random Forests.
No ratings yet
Ensemble Methods Random Forests.
9 pages
MIT15 093J F09 Final 2003
No ratings yet
MIT15 093J F09 Final 2003
6 pages
Midem ML Makeup Sol Upated
No ratings yet
Midem ML Makeup Sol Upated
6 pages
Depth First Learning Learning
No ratings yet
Depth First Learning Learning
6 pages
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

Unit 2.2

Uploaded by

Unit 2.2

Uploaded by

MIT Art Design and Technology University

MIT School of Computing, Pune

Class - L.Y. CORE (SEM-I)

Unit - II Deep Networks

Mini-batch gradient descent may aid in escaping shallow local

within a layer with n

is called the gradient of the

for all values of x, including x = 0, the second derivative is

□ Stochastic gradient descent

□ Mini-batch gradient descent

□ Looking ahead helps

□ Adagrad modifies the general learning rate

here is a diagonal matrix where each diagonal element i; i is

□ mt and vt are estimates of the first moment (the

Exponential Weighted Averages for past squared

□ where J is our objective function, γ is the

□ Dozat proposes to modify NAG the following way:

You might also like