0% found this document useful (0 votes)

87 views59 pages

Deep Learning CS60010: Computer Science and Engineering

This document provides an agenda and resources for a course on deep learning and optimization. The agenda includes understanding matrix/vector calculus concepts, different types of errors in learning, and optimization. The document discusses the connections between deep learning and optimization, noting that gradient descent is commonly used to optimize neural networks by minimizing an objective function on training data. Sources of error are explored, such as the difference between expected risk and empirical risk when the true data distribution is unknown.

Uploaded by

ishika saraf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views59 pages

Deep Learning CS60010: Computer Science and Engineering

Uploaded by

ishika saraf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 59

Computer Science and Engineering| Indian Institute of Technology Kharagpu

cse.iitkgp.ac.i

Deep Learning
CS60010

Abir Das

Computer Science and Engineering Department

Indian Institute of Technology Kharagpur

http://cse.iitkgp.ac.in/~adas/
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Agenda
• Understand basics of Matrix/Vector Calculus and Optimization concepts
to be used in the course.

• Understand different types or errors in learning

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 2
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Resources
• ”Deep Learning”, I. Goodfellow, Y. Bengio, A. Courville. (Chapter 8)

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 3
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Optimization and Deep Learning- Connections

• Deep learning (machine learning, in general) involves optimization in many contexts

• The goal is to find parameters of a neural network that significantly reduce a cost
function or objective function .

• Gradient based optimization is the most popular way for training Deep Neural Networks.

• There are other ways too, e.g., evolutionary or derivative free optimization, but they
come with issues particularly crucial for neural network training.

• Its easy to spend a semester on optimization. Thus, these few lectures will only be a
scratch on a very small part of the surface.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 4
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Optimization and Deep Learning- Differences

• In learning we care about some performance measure (e.g., image classification
accuracy, language translation accuracy etc.) on test set, but we minimize a different cost
function on training set, with the hope that doing so will improve

• This is in contrast to pure optimization where minimizing is a goal in itself

• Lets see what types of errors creep in as a result

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 5
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Expected and Empirical Risk

• Training data is coming from probability distribution
• Let the neural network learns the output functions as and the loss is denoted as
• What we want to minimize is the expected risk

• If we could minimize this risk we would have got the true optimal function

• But we don’t know the actual distribution that generates the data. So, what we actually minimize is the empirical risk

• We choose a family of candidate prediction functions and find the function that minimizes the empirical risk

• Since may not be found in the family , we also define

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 6
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Sources of Error
Minimizes Staying within
expected risk No constraints on function family • True data distribution known
• Family of functions exhaustive
expected risk family of functions • True data distribution known
• Family of functions not exhaustive
empirical risk family of functions • True data distribution not known
• Family of functions not exhaustive

• True data generation procedure not known

• Family of functions [or hypothesis] to try is not exhaustive
• (And not to forget) we rely on a surrogate loss in place of the true classification error rate

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 7
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

A Simple Learning Problem

2 Data Points. 2 hypothesis sets:
ℋ 0 : h ( 𝑥 ) =𝑏
ℋ1 : h( 𝑥 )=𝑎𝑥 +𝑏
y

y
x x
Slide courtesy: Malik Magdon-Ismail

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 8
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Let’s Repeat the Experiment Many Times

y
y

x x

• For each data set you get a different

• For a fixed , is random value, depending on

Slide courtesy: Malik Magdon-Ismail

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 9
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

What’s Happening on Average

g¯(x
y

y
g¯(x )
)
sin(x ) sin(x )
We can define

x x
𝑔 𝒟𝑛 (𝒙 ) Random value, depending on
1 𝒟
𝑔 ( 𝒙 )=𝔼𝒟 [ 𝑔𝒟
𝑛 ( 𝒙) ] = ( 𝑔𝑛 ( 𝒙 ) + 𝑔𝑛 ( 𝒙 )+ …+𝑔𝑛 ( 𝒙) )
𝒟 𝒟
Your average prediction on
1 2 𝐾

[ ]
var ( 𝒙 )=𝔼𝒟 ( 𝑔 𝑛 ( 𝒙 ) −𝑔 ( 𝒙 ) ) =𝔼 𝒟 [ 𝑔𝑛 ( 𝒙 ) ] −𝑔 ( 𝒙 )
𝒟 2 𝒟 2 2
How variable is your prediction
Slide courtesy: Malik Magdon-Ismail

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 10
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

on Test Point for Data

f f
(x) (x)
E oDu t (x
)
E oDu t (x
) 𝑔𝒟
𝑛 (𝒙)

𝑔𝒟
𝑛 (𝒙)

x x

2
𝐸 ( 𝒙 ) =( 𝑔 ( 𝒙 ) − 𝑓 ( 𝒙 ) )
𝒟 𝒟
𝑜𝑢𝑡 𝑛
Squared error, a random value depending on

𝒟
[
𝐸𝑜𝑢𝑡 ( 𝒙 ) =𝔼𝒟 [ 𝐸𝑜𝑢𝑡 ( 𝒙 ) ]=𝔼𝒟 ( 𝑔 𝑛 ( 𝒙 ) − 𝑓 ( 𝒙 ) )
𝒟 2
] Expected value of the above random variable
Slide motivation: Malik Magdon-Ismail

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 11
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

The Bias-Variance Decomposition

𝐸𝑜𝑢𝑡 ( 𝒙 ) =𝔼𝒟 [( 𝑔 ( 𝒙 ) − 𝑓 ( 𝒙 ) ) ]
𝒟
𝑛
2

¿ 𝔼𝒟 [ 𝑓 ( 𝒙 ) − 2 𝑓 ( 𝒙 ) 𝑔 𝑛 ( 𝒙 )+ 𝑔𝑛 ( 𝒙 ) ]
2 𝒟 𝒟 2

𝑛 ( 𝒙 ) ] +𝔼 𝒟 [ 𝑔𝑛 ( 𝒙 ) ]
¿ 𝑓 ( 𝒙 )2 − 2 𝑓 ( 𝒙 ) 𝔼𝒟 [ 𝑔𝒟 𝒟 2
+¿
𝐸𝑜𝑢𝑡 ( 𝒙 ) =¿ Bias Variance
¿ 𝑓 ( 𝒙 )2 − 2 𝑓 ( 𝒙 ) 𝑔 ( 𝒙 )+ 𝔼𝒟 [ 𝑔𝒟
𝑛 (𝒙) ]
2

¿ 𝑓 ( 𝒙 ) 2 − 𝑔 ( 𝒙 )2 +𝑔 ( 𝒙 )2 − 2 𝑓 ( 𝒙 ) 𝑔 ( 𝒙 )+𝔼 𝒟 [ 𝑔𝒟
𝑛 ( 𝒙) ]
2

¿ ( 𝑓 ( 𝒙 ) −𝑔 ( 𝒙 ) ) +𝔼𝒟 [ 𝑔 𝑛 ( 𝒙 ) ] −𝑔 ( 𝒙 )
2 𝒟 2 2

Bias Variance

Slide motivation: Malik Magdon-Ismail

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 12
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Bias-Variance to Overfitting-Underfitting
• Suppose the true underlying function is .
• But the observations (data points) are noisy i.e.,

𝑓 (𝒙)

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 13
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

The Extreme Cases of Bias and Variance - Under-fitting

A good way to understand the concepts of bias and variance is by
considering the two extreme cases of what a neural network might learn.

Suppose the neural network is lazy and just produces the same
constant output whatever training data we give it, i.e. =c, then

( 𝑓 ( 𝒙 ) −𝑔 ( 𝒙 ) ) +𝔼𝒟 [ 𝑔 𝒟𝑛 ( 𝒙 )2 ] −𝑔 ( 𝒙 )2
2
𝐸𝑜𝑢𝑡 ( 𝒙 ) =¿
¿ ( 𝑓 ( 𝒙 ) − 𝑐 ) +𝔼𝒟 [ 𝒄 ] − 𝒄
2 2 2

2
¿ ( 𝑓 ( 𝒙 ) − 𝑐 ) +0

In this case the variance term will be zero, but the bias will be large,
because the network has made no attempt to fit the data. We say
we have extreme under-fitting
Slide motivation: John A. Bullinaria

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 14
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

The Extreme Cases of Bias and Variance - Over-fitting

On the other hand, suppose the neural network is very hard working and
makes sure that it exactly fits every data point, i.e. , then,

( 𝑓 ( 𝒙 ) −𝑔 ( 𝒙 ) ) +𝔼𝒟 [ 𝑔 𝒟𝑛 ( 𝒙 )2 ] −𝑔 ( 𝒙 )2
2
𝐸𝑜𝑢𝑡 ( 𝒙 ) =¿

[
¿ ( 𝑓 ( 𝒙 ) − 𝑓 ( 𝒙 ) ) +𝔼 𝒟 ( 𝑔𝒟
2
𝑛 ( 𝒙 ) − 𝑔( 𝒙 ))
2
]
¿ 0+𝔼 𝒟 [ ( 𝑓 ( 𝒙 )+ 𝜀 − 𝑓 ( 𝒙 ) ) ]
2

¿ 0+𝔼 𝒟 [ 𝜺 ]
2

In this case the bias term will be zero, but the variance is the square
of the noise on the data, which could be substantial. In this case we
say we have extreme over-fitting.
Slide motivation: John A. Bullinaria

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 15
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Bias-Variance Trade-off
Overfitting Underfitting

Test Error

Error
Training
Error

Number of Data Points, N

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 16
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Vector/Matrix Calculus
• If and , then is called the gradient of
• If and , then

• is called “Jacobian matrix” of w.r.t.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 17
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Vector/Matrix Calculus
• If and , then

• is called the “Hessian matrix” of w.r.t.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 18
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Vector/Matrix Calculus
• Some standard results:

• ; if

• Product rule:

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 19
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Vector/Matrix Calculus
• Some standard results:

• If , then

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 20
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Vector/Matrix Calculus
• Derivatives of norms:

• For two norm of a vector

• For Frobenius norm of a matrix

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 21
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Optimization Problem
• Problem Statement:

• Problem statement of convex optimization:

• with is a convex function and

• is a convex set

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 22
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Convex Sets and Functions

• Convex Set: A set is a convex set if for all , the line segment connecting
and is in i.e.,

𝜃 𝒙 + ( 1− 𝜃 ) 𝒚 , 𝜃∈[0,1]
𝒞 𝒙
𝒚

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 23
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Convex Sets and Functions

• Convex Function: A function is a convex function if its domain is a
convex set and for all , and , we have

All norms are convex functions

‖𝜃 𝒙 + ( 1 − 𝜃 ) 𝒚 ‖≤‖𝜃 𝒙 ‖+‖( 1− 𝜃 ) 𝒚 ‖
¿ 𝜃‖ 𝒙 ‖+ ( 1− 𝜃 )‖𝒚 ‖

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 24
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Alternative Definition of Convexity for differentiable functions

Theorem: (first order characterization) Let be a differentiable function

whose dom(f) is convex. Then, f is convex iff

…(1)

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 25
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Proof (part 1):

Given

Let us consider and

Let

Now, By
taking
taking
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 26
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Multiplying with we get,

…(4)
and with

Now combining , we have,

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 27
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

⇒ ( 1−𝜃 ) 𝑓 ( 𝒙 ) +𝜃 𝑓 ( 𝒚 ) ≥ 𝑓 ( 𝒛 ) −𝜃 𝑓 ( 𝒛 ) +𝜃 𝑓 ( 𝒛 )+ ⟨ 𝛻 𝑓 ( 𝒛 ) , ( 1−𝜃 )( 𝒙 −𝒛 ) ⟩
⇒ ( 1 − 𝜃 ) 𝑓 ( 𝒙 ) +𝜃 𝑓 ( 𝒚 ) ≥ 𝑓 ( 𝒛 )+ ⟨ 𝛻 𝑓 ( 𝒛 ) , ( 1 − 𝜃 ) ( 𝒙 − 𝒛 )+ 𝜃

=0 (Why ?)

¿ 𝑓 ( 𝒛 )= 𝑓 ( (1 − 𝜃 ) 𝒙 + 𝜃 𝒚 )
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 28
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Proof (part 2):

Suppose is convex, i.e., and
𝑓 ( (1 − 𝜃 ) 𝒙 + 𝜃 𝒚 ) ≤ ( 1− 𝜃 ) 𝑓 ( 𝒙 )+ 𝜃 𝑓 ( 𝒚 )
Equivalently,
𝑓 ( 𝒙 + 𝜃 ( 𝒚 − 𝒙 ) ) ≤ 𝑓 ( 𝒙 )+𝜃 ( 𝑓 ( 𝒚 ) − 𝑓 ( 𝒙 ))
𝑓 ( 𝒙 + 𝜃 ( 𝒚 − 𝒙 ))− 𝑓 ( 𝒙 )
⇒ 𝑓 ( 𝒚 ) − 𝑓 ( 𝒙) ≥
𝜃

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 29
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

By taking the limit as on both sides and using the definition of

derivative, we obtain,
𝑓 ( 𝒙 +𝜃 ( 𝒚 − 𝒙 ) ) − 𝑓 ( 𝒙 )
𝑓 ( 𝒚 ) − 𝑓 ( 𝒙 ) ≥ lim ¿⟨𝛻 𝑓 (𝒙), 𝒚 − 𝒙⟩
𝜃→0 𝜃
(How?)
𝑓 ( 𝑥 +𝜃 ( 𝑦 − 𝑥 ) ) − 𝑓 ( 𝑥) [Let us see in case of 1-D .]
𝑔=𝑙𝑖𝑚
𝜃→0 𝜃
𝑓 ( 𝑥 +𝜃 ( 𝑦 − 𝑥 ) ) − 𝑓 ( 𝑥 ) [Multi plying numerator and denominator
𝑔=𝑙𝑖𝑚 ( 𝑦 − 𝑥) by
𝜃→0 𝜃( 𝑦 − 𝑥)
Let, As
𝑓 ( 𝑥+ h ) − 𝑓 ( 𝑥 )
∴ 𝑔=𝑙𝑖𝑚 ( 𝑦 − 𝑥)¿ 𝑓 ′ (𝑥)(𝑦 − 𝑥 )
h →0 h
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 30
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Gradient Descent
min loss function
𝒘,𝒃 𝐽 (𝜃)
More formally, min 𝐽 (𝜃)
𝜃
′ 𝜕𝐽
For scalar , the condition is 𝐽 ( 𝜃 )= =0
𝜕𝜃
For higher dimensional , the condition boils down to

𝐽 ′ ( 𝜃 )=0 𝜃

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 31
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

One Way to Find Minima – Gradient Descent

This is helpful but not always useful.
For example is a convex function with
clear minima, but finding analytical solution is not easy.

So, a numerical iterative solution is sought for.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 32
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

One Way to Find Minima – Gradient Descent

Start with an initial guess

Repeatedly update by taking a small step: … (1)
so that gets smaller with each update i.e., … (2)

(1) implies,
[Using Taylor series expansion]
[Neglecting ] … (3)

Combining (2) and (3),

[as is positive]
So, for to minimize to satisfy (2) we have to choose some that is negative when a dot product of it is done
with the gradient .
Then why not choose, !!
Then, surely is a negative quantity and satisfies the condition.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 33
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Gradient Descent

Start with an initial guess

Repeatedly update by taking a small step: until convergence ( is very small)

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 34
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Gradient Descent

Start with an initial guess

Repeatedly update by taking a small step: until convergence ( is very small)

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 35
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Gradient Descent

Start with an initial guess

Repeatedly update by taking a small step: until convergence ( is very small)

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 36
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Gradient Descent

Start with an initial guess

Repeatedly update by taking a small step: until convergence ( is very small)

Remember, in Neural Networks,

the loss is computed by averaging
the losses for all examples

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 37
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

But Imagine This

Adapted from Olivier Bousquet’s invited talk at NeurIPS 2018 after winning Test of Time Award for NIPS 2007 paper:
“The Trade-Offs of Large Scale Learning” by Leon Bottou and Olivier Bousquet. Link of the talk.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 38
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Batch, Stochastic and Minibatch

• Optimization algorithms that use the entire training set to compute the gradient are called batch or deterministic
gradient methods. Ones that use a single training example for that task are called stochastic or online gradient
methods

• Most of the algorithms we use for deep learning fall somewhere in between!

• These are called minibatch or minibatch stochastic methods

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 39
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Batch, Stochastic and Mini-batch Stochastic Gradient Descent

Mini-batch

Images courtesy: Shubhendu Trivedi et. Al, Goodfellow et. al..

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 40
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Batch and Stochastic Gradient Descent

Images courtesy: Shubhendu Trivedi et. al.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 41
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Images courtesy: Karpathy et. al.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 42
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Images courtesy: Karpathy et. al.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 43
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 44
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Gradient Descent: only one force acting at any point.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 45
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

i) Momentum
at the point

Actual step

ii) Negative of
gradient at the
point. (Would
have reached here
w/o momentum)

Momentum: Two forces acting at any point.

i) Momentum built up due to gradients pushing the particle at that point
ii) Gradient computed at that point

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 46
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Stochastic Gradient Descent with Momentum

Momentum
at the point

Actual step

Negative of
gradient at the
point. (Would
have reached here
w/o momentum)
Images courtesy: Goodfellow et. al.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 47
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Nesterov Momentum

• The difference between Nesterov momentum and standard momentum is where the gradient is
evaluated.
• With Nesterov momentum the gradient is evaluated after the current velocity is applied.
• In practice, Nesterov momentum speeds up the convergence only for well behaved loss functions (convex
with consistent curvature) Images courtesy: Goodfellow et. al.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 48
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Distill Pub Article on Why Momentum Really Works

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 49
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Adaptive Learning Rate Methods

• Till now we assign the same learning rate in all directions.

• Would it not be a good idea if we can move slowly in a steeper direction whereas move fast in a shallower
direction?

Images courtesy: Karpathy et. al.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 50
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

AdaGrad
• Downscale learning rate by square-root of sum of squares of all the historical gradient values

• Parameters that have large partial derivative of the loss – learning rates for them are rapidly declined

Images courtesy: Shubhendu Trivedi et. al.

Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

AdaGrad
• , , …,

• , , ….,

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 52
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

RMSProp
• One problem of AdaGrad is that the ‘r’ vector continues to build up and grow its value.
• This shrinks the learning rate too aggressively.
• RMSProp strikes a balance by exponentially decaying contributions from past gradients.

Images courtesy: Shubhendu Trivedi et. al.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 53
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

RMSProp
• , , …,

• RMSProp uses an exponentially decaying average to discard history from the extreme past so that the
accumulation of gradients do not stall the learning.
• AdaDelta is another variant where instead of exponentially decaying average, a moving window average
of the past gradients is taken
• Nesterov acceleration can also be applied to both these variants by computing the gradients at a ‘look
ahead’ position (i.e., at a place where the momentum would have taken the parameters).

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 54
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

ADAM (Adaptive Moments)

• Variant of the combination of RMSProp and Momentum.

• Incorporates first order moment of the gradient which can be thought of as equivalent to taking
advantage of the momentum strategy. Here momentum is also added with exponential averaging.

• It also incorporates the second order term which can be thought of as the RMSProp like exponential
averaging of the past gradients.

• Both first and second moments are corrected for bias to account for their initialization to zero.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 55
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

ADAM (Adaptive Moments)

• Biased first order moment

• Biased second order moment

• Bias corrected first order moment

• Bias corrected second order moment

• Weight update (operations applied elementwise)

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 56
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

ADAM (Adaptive Moments)

Images courtesy: Shubhendu Trivedi et. al.

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 57
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Visualization

Find more animations at

https://tinyurl.com/y6tkf4f8

animation courtesy: Alec Radford

07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 58
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Thank You!!

DL Unit1
100% (2)
DL Unit1
79 pages
Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Chapter 08
100% (2)
Chapter 08
202 pages
CS7015 (Deep Learning) : Lecture 8
No ratings yet
CS7015 (Deep Learning) : Lecture 8
86 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Lecture 2
No ratings yet
Lecture 2
67 pages
Map578 5
No ratings yet
Map578 5
52 pages
Week 4
No ratings yet
Week 4
61 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
DL 12
No ratings yet
DL 12
55 pages
05 Neural Network
No ratings yet
05 Neural Network
38 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Introduction To Optimization-Lec1
No ratings yet
Introduction To Optimization-Lec1
36 pages
Intro SVM New Example PDF
100% (1)
Intro SVM New Example PDF
56 pages
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
39 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
ML 19.03 Sidenotes
No ratings yet
ML 19.03 Sidenotes
30 pages
Lecture 220927 02
No ratings yet
Lecture 220927 02
29 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
31 pages
Excellent 05 - Overfitting
No ratings yet
Excellent 05 - Overfitting
22 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
Ds 2
No ratings yet
Ds 2
27 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
002-Supervised Learning Setup 01 W2L2
No ratings yet
002-Supervised Learning Setup 01 W2L2
21 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
111 pages
Internal
No ratings yet
Internal
25 pages
Over Fitting
No ratings yet
Over Fitting
19 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
56 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Azure Fundamentals Success Kit
From Everand
Azure Fundamentals Success Kit
PRIYANKA
No ratings yet
TFM Lichtner Bajjaoui Aisha
No ratings yet
TFM Lichtner Bajjaoui Aisha
18 pages
Module3 Notes
No ratings yet
Module3 Notes
18 pages
Biasvariancetradeoff 210313075413
No ratings yet
Biasvariancetradeoff 210313075413
13 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
No ratings yet
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
14 pages
Optimization For Data Science
No ratings yet
Optimization For Data Science
18 pages
Optimization Lecture 1
No ratings yet
Optimization Lecture 1
11 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
ML 01
No ratings yet
ML 01
24 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
t4 Sol
No ratings yet
t4 Sol
8 pages
Semester5 Term-Test-I Papers
No ratings yet
Semester5 Term-Test-I Papers
11 pages
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
No ratings yet
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
59 pages
ML - Viva QnA - Doubtly - in
No ratings yet
ML - Viva QnA - Doubtly - in
14 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
MidA F21
No ratings yet
MidA F21
8 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Final Exam Epfl 2020 Machine Leaning
No ratings yet
Final Exam Epfl 2020 Machine Leaning
16 pages
Exercise 01 Math Refresher
No ratings yet
Exercise 01 Math Refresher
4 pages
EE364a Homework 6 Solutions: I 1,..., K I I I
No ratings yet
EE364a Homework 6 Solutions: I 1,..., K I I I
20 pages
Regression Analysis MCQ
No ratings yet
Regression Analysis MCQ
15 pages
Solutions To The Exercises On The Bias-Variance Dilemma
No ratings yet
Solutions To The Exercises On The Bias-Variance Dilemma
8 pages
CS 182 Berkeley 2021 Discussion 1
No ratings yet
CS 182 Berkeley 2021 Discussion 1
7 pages
MathCAD 15 Shortcut
No ratings yet
MathCAD 15 Shortcut
5 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
Cost Optimization of Elevated Intze Water Tank
No ratings yet
Cost Optimization of Elevated Intze Water Tank
6 pages
Question of The Day: N N N N
No ratings yet
Question of The Day: N N N N
8 pages
Learning Oracle 12c: A PL/SQL Approach
From Everand
Learning Oracle 12c: A PL/SQL Approach
Prof. Sham Tickoo
No ratings yet
Machine Learning Interview Questions
No ratings yet
Machine Learning Interview Questions
8 pages
Eml 5060 - Syllabus Fall 2016
No ratings yet
Eml 5060 - Syllabus Fall 2016
2 pages
Answers - Sets, Functions, Graphs, Equations PDF
No ratings yet
Answers - Sets, Functions, Graphs, Equations PDF
4 pages
X-Bar R Charts
No ratings yet
X-Bar R Charts
6 pages
Lecture 3 PDF
No ratings yet
Lecture 3 PDF
66 pages
ASSIGNMENT Decision Making Process PDF
No ratings yet
ASSIGNMENT Decision Making Process PDF
6 pages
Dataset - Airport Passengers
100% (1)
Dataset - Airport Passengers
10 pages
Suggested Solution of HW4
No ratings yet
Suggested Solution of HW4
8 pages
Me 2020
No ratings yet
Me 2020
172 pages
Tut 1082013
No ratings yet
Tut 1082013
13 pages
Toeplitz Marices-1 PDF
No ratings yet
Toeplitz Marices-1 PDF
5 pages
Ch13. Decision Tree: KH Wong
No ratings yet
Ch13. Decision Tree: KH Wong
82 pages
RMSE Root Mean Square Error - Statistics How To
No ratings yet
RMSE Root Mean Square Error - Statistics How To
1 page
Lecture 22 Continuity and Partial Derivatives MTH165
No ratings yet
Lecture 22 Continuity and Partial Derivatives MTH165
18 pages
January 2014 (IAL) QP - C4 Edexcel
No ratings yet
January 2014 (IAL) QP - C4 Edexcel
28 pages
Calculus 1 Chapter 3 To 6
No ratings yet
Calculus 1 Chapter 3 To 6
5 pages
Numerical Methods: Jeffrey R. Chasnov
No ratings yet
Numerical Methods: Jeffrey R. Chasnov
60 pages
MSO202 Lect 2
No ratings yet
MSO202 Lect 2
14 pages
Wiener Helstrom Filter
No ratings yet
Wiener Helstrom Filter
4 pages
Clustering Data With Measurement Errors: Mahesh Kumar, Nitin R. Patel, James B. Orlin Operations Research Center, MIT
No ratings yet
Clustering Data With Measurement Errors: Mahesh Kumar, Nitin R. Patel, James B. Orlin Operations Research Center, MIT
26 pages
Stress-Constrained Topology Optimization With Design-Dependent Loading
No ratings yet
Stress-Constrained Topology Optimization With Design-Dependent Loading
15 pages
Marshall-Olkin Chris-Jerry Distribution and Its Applications
No ratings yet
Marshall-Olkin Chris-Jerry Distribution and Its Applications
12 pages
Solutions 06
No ratings yet
Solutions 06
4 pages
Practice Problem Set 1: OA4201 Nonlinear Programming
No ratings yet
Practice Problem Set 1: OA4201 Nonlinear Programming
4 pages
Calculus II Practice Exercises: Find The Center of Mass of A Thin Plate of Constant Density Covering The Given Region
No ratings yet
Calculus II Practice Exercises: Find The Center of Mass of A Thin Plate of Constant Density Covering The Given Region
3 pages
Mth603 Assignment No 1 Solution
No ratings yet
Mth603 Assignment No 1 Solution
6 pages
Titration Level 3 Labnotebook
No ratings yet
Titration Level 3 Labnotebook
3 pages
7 Igcse Functions Ws
No ratings yet
7 Igcse Functions Ws
1 page