Deep Learning CS60010: Computer Science and Engineering
Deep Learning CS60010: Computer Science and Engineering
cse.iitkgp.ac.i
Deep Learning
CS60010
Abir Das
http://cse.iitkgp.ac.in/~adas/
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Agenda
• Understand basics of Matrix/Vector Calculus and Optimization concepts
to be used in the course.
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 2
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Resources
• ”Deep Learning”, I. Goodfellow, Y. Bengio, A. Courville. (Chapter 8)
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 3
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
• The goal is to find parameters of a neural network that significantly reduce a cost
function or objective function .
• Gradient based optimization is the most popular way for training Deep Neural Networks.
• There are other ways too, e.g., evolutionary or derivative free optimization, but they
come with issues particularly crucial for neural network training.
• Its easy to spend a semester on optimization. Thus, these few lectures will only be a
scratch on a very small part of the surface.
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 4
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 5
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
• If we could minimize this risk we would have got the true optimal function
• But we don’t know the actual distribution that generates the data. So, what we actually minimize is the empirical risk
• We choose a family of candidate prediction functions and find the function that minimizes the empirical risk
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 6
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Sources of Error
Minimizes Staying within
expected risk No constraints on function family • True data distribution known
• Family of functions exhaustive
expected risk family of functions • True data distribution known
• Family of functions not exhaustive
empirical risk family of functions • True data distribution not known
• Family of functions not exhaustive
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 7
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
y
x x
Slide courtesy: Malik Magdon-Ismail
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 8
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
y
y
x x
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 9
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
g¯(x
y
y
g¯(x )
)
sin(x ) sin(x )
We can define
x x
𝑔 𝒟𝑛 (𝒙 ) Random value, depending on
1 𝒟
𝑔 ( 𝒙 )=𝔼𝒟 [ 𝑔𝒟
𝑛 ( 𝒙) ] = ( 𝑔𝑛 ( 𝒙 ) + 𝑔𝑛 ( 𝒙 )+ …+𝑔𝑛 ( 𝒙) )
𝒟 𝒟
Your average prediction on
1 2 𝐾
[ ]
var ( 𝒙 )=𝔼𝒟 ( 𝑔 𝑛 ( 𝒙 ) −𝑔 ( 𝒙 ) ) =𝔼 𝒟 [ 𝑔𝑛 ( 𝒙 ) ] −𝑔 ( 𝒙 )
𝒟 2 𝒟 2 2
How variable is your prediction
Slide courtesy: Malik Magdon-Ismail
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 10
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
𝑔𝒟
𝑛 (𝒙)
x x
2
𝐸 ( 𝒙 ) =( 𝑔 ( 𝒙 ) − 𝑓 ( 𝒙 ) )
𝒟 𝒟
𝑜𝑢𝑡 𝑛
Squared error, a random value depending on
𝒟
[
𝐸𝑜𝑢𝑡 ( 𝒙 ) =𝔼𝒟 [ 𝐸𝑜𝑢𝑡 ( 𝒙 ) ]=𝔼𝒟 ( 𝑔 𝑛 ( 𝒙 ) − 𝑓 ( 𝒙 ) )
𝒟 2
] Expected value of the above random variable
Slide motivation: Malik Magdon-Ismail
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 11
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
¿ 𝔼𝒟 [ 𝑓 ( 𝒙 ) − 2 𝑓 ( 𝒙 ) 𝑔 𝑛 ( 𝒙 )+ 𝑔𝑛 ( 𝒙 ) ]
2 𝒟 𝒟 2
𝑛 ( 𝒙 ) ] +𝔼 𝒟 [ 𝑔𝑛 ( 𝒙 ) ]
¿ 𝑓 ( 𝒙 )2 − 2 𝑓 ( 𝒙 ) 𝔼𝒟 [ 𝑔𝒟 𝒟 2
+¿
𝐸𝑜𝑢𝑡 ( 𝒙 ) =¿ Bias Variance
¿ 𝑓 ( 𝒙 )2 − 2 𝑓 ( 𝒙 ) 𝑔 ( 𝒙 )+ 𝔼𝒟 [ 𝑔𝒟
𝑛 (𝒙) ]
2
¿ 𝑓 ( 𝒙 ) 2 − 𝑔 ( 𝒙 )2 +𝑔 ( 𝒙 )2 − 2 𝑓 ( 𝒙 ) 𝑔 ( 𝒙 )+𝔼 𝒟 [ 𝑔𝒟
𝑛 ( 𝒙) ]
2
¿ ( 𝑓 ( 𝒙 ) −𝑔 ( 𝒙 ) ) +𝔼𝒟 [ 𝑔 𝑛 ( 𝒙 ) ] −𝑔 ( 𝒙 )
2 𝒟 2 2
Bias Variance
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 12
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Bias-Variance to Overfitting-Underfitting
• Suppose the true underlying function is .
• But the observations (data points) are noisy i.e.,
𝑓 (𝒙)
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 13
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Suppose the neural network is lazy and just produces the same
constant output whatever training data we give it, i.e. =c, then
( 𝑓 ( 𝒙 ) −𝑔 ( 𝒙 ) ) +𝔼𝒟 [ 𝑔 𝒟𝑛 ( 𝒙 )2 ] −𝑔 ( 𝒙 )2
2
𝐸𝑜𝑢𝑡 ( 𝒙 ) =¿
¿ ( 𝑓 ( 𝒙 ) − 𝑐 ) +𝔼𝒟 [ 𝒄 ] − 𝒄
2 2 2
2
¿ ( 𝑓 ( 𝒙 ) − 𝑐 ) +0
In this case the variance term will be zero, but the bias will be large,
because the network has made no attempt to fit the data. We say
we have extreme under-fitting
Slide motivation: John A. Bullinaria
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 14
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
( 𝑓 ( 𝒙 ) −𝑔 ( 𝒙 ) ) +𝔼𝒟 [ 𝑔 𝒟𝑛 ( 𝒙 )2 ] −𝑔 ( 𝒙 )2
2
𝐸𝑜𝑢𝑡 ( 𝒙 ) =¿
[
¿ ( 𝑓 ( 𝒙 ) − 𝑓 ( 𝒙 ) ) +𝔼 𝒟 ( 𝑔𝒟
2
𝑛 ( 𝒙 ) − 𝑔( 𝒙 ))
2
]
¿ 0+𝔼 𝒟 [ ( 𝑓 ( 𝒙 )+ 𝜀 − 𝑓 ( 𝒙 ) ) ]
2
¿ 0+𝔼 𝒟 [ 𝜺 ]
2
In this case the bias term will be zero, but the variance is the square
of the noise on the data, which could be substantial. In this case we
say we have extreme over-fitting.
Slide motivation: John A. Bullinaria
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 15
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Bias-Variance Trade-off
Overfitting Underfitting
Test Error
Error
Training
Error
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 16
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Vector/Matrix Calculus
• If and , then is called the gradient of
• If and , then
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 17
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Vector/Matrix Calculus
• If and , then
Vector/Matrix Calculus
• Some standard results:
• ; if
• Product rule:
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 19
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Vector/Matrix Calculus
• Some standard results:
• If , then
• If , then
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 20
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Vector/Matrix Calculus
• Derivatives of norms:
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 21
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Optimization Problem
• Problem Statement:
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 22
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
𝜃 𝒙 + ( 1− 𝜃 ) 𝒚 , 𝜃∈[0,1]
𝒞 𝒙
𝒚
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 23
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 24
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
…(1)
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 25
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Now, By
taking
taking
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 26
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
…(4)
and with
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 27
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
⇒ ( 1−𝜃 ) 𝑓 ( 𝒙 ) +𝜃 𝑓 ( 𝒚 ) ≥ 𝑓 ( 𝒛 ) −𝜃 𝑓 ( 𝒛 ) +𝜃 𝑓 ( 𝒛 )+ ⟨ 𝛻 𝑓 ( 𝒛 ) , ( 1−𝜃 )( 𝒙 −𝒛 ) ⟩
⇒ ( 1 − 𝜃 ) 𝑓 ( 𝒙 ) +𝜃 𝑓 ( 𝒚 ) ≥ 𝑓 ( 𝒛 )+ ⟨ 𝛻 𝑓 ( 𝒛 ) , ( 1 − 𝜃 ) ( 𝒙 − 𝒛 )+ 𝜃
=0 (Why ?)
¿ 𝑓 ( 𝒛 )= 𝑓 ( (1 − 𝜃 ) 𝒙 + 𝜃 𝒚 )
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 28
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 29
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Gradient Descent
min loss function
𝒘,𝒃 𝐽 (𝜃)
More formally, min 𝐽 (𝜃)
𝜃
′ 𝜕𝐽
For scalar , the condition is 𝐽 ( 𝜃 )= =0
𝜕𝜃
For higher dimensional , the condition boils down to
𝐽 ′ ( 𝜃 )=0 𝜃
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 31
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 32
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
(1) implies,
[Using Taylor series expansion]
[Neglecting ] … (3)
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 33
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Gradient Descent
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 34
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Gradient Descent
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 35
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Gradient Descent
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 36
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Gradient Descent
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 37
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Adapted from Olivier Bousquet’s invited talk at NeurIPS 2018 after winning Test of Time Award for NIPS 2007 paper:
“The Trade-Offs of Large Scale Learning” by Leon Bottou and Olivier Bousquet. Link of the talk.
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 38
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
• Optimization algorithms that use the entire training set to compute the gradient are called batch or deterministic
gradient methods. Ones that use a single training example for that task are called stochastic or online gradient
methods
• Most of the algorithms we use for deep learning fall somewhere in between!
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 39
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Mini-batch
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 40
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 41
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 42
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 43
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 44
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 45
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
i) Momentum
at the point
Actual step
ii) Negative of
gradient at the
point. (Would
have reached here
w/o momentum)
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 46
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Momentum
at the point
Actual step
Negative of
gradient at the
point. (Would
have reached here
w/o momentum)
Images courtesy: Goodfellow et. al.
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 47
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Nesterov Momentum
• The difference between Nesterov momentum and standard momentum is where the gradient is
evaluated.
• With Nesterov momentum the gradient is evaluated after the current velocity is applied.
• In practice, Nesterov momentum speeds up the convergence only for well behaved loss functions (convex
with consistent curvature) Images courtesy: Goodfellow et. al.
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 48
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 49
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
• Would it not be a good idea if we can move slowly in a steeper direction whereas move fast in a shallower
direction?
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 50
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
AdaGrad
• Downscale learning rate by square-root of sum of squares of all the historical gradient values
• Parameters that have large partial derivative of the loss – learning rates for them are rapidly declined
AdaGrad
• , , …,
• , , ….,
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 52
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
RMSProp
• One problem of AdaGrad is that the ‘r’ vector continues to build up and grow its value.
• This shrinks the learning rate too aggressively.
• RMSProp strikes a balance by exponentially decaying contributions from past gradients.
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 53
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
RMSProp
• , , …,
• RMSProp uses an exponentially decaying average to discard history from the extreme past so that the
accumulation of gradients do not stall the learning.
• AdaDelta is another variant where instead of exponentially decaying average, a moving window average
of the past gradients is taken
• Nesterov acceleration can also be applied to both these variants by computing the gradients at a ‘look
ahead’ position (i.e., at a place where the momentum would have taken the parameters).
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 54
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
• Incorporates first order moment of the gradient which can be thought of as equivalent to taking
advantage of the momentum strategy. Here momentum is also added with exponential averaging.
• It also incorporates the second order term which can be thought of as the RMSProp like exponential
averaging of the past gradients.
• Both first and second moments are corrected for bias to account for their initialization to zero.
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 55
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 56
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 57
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Visualization
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 58
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Thank You!!