Lecture 5
Lecture 5
Mitesh M. Khapra
1 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Acknowledgements
For most of the lecture, I have borrowed ideas from the videos by Ryan Harris
on “visualize backpropagation” (available on youtube)
Some content is based on the course CS231na by Andrej Karpathy and others
a
http://cs231n.stanford.edu/2016/
2 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Module 5.1: Learning Parameters : Infeasible (Guess
Work)
3 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
1
f (x) = 1+e−(w·x+b)
4 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
Input for training
1 {xi , yi }N
i=1 → N pairs of (x, y)
1
f (x) = 1+e−(w·x+b)
4 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
Input for training
1 {xi , yi }N
i=1 → N pairs of (x, y)
1
f (x) = 1+e−(w·x+b) Training objective
Find w and b such that:
N
X
minimize L (w, b) = (yi − f (xi ))2
w,b
i=1
4 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
What does it mean to train the network?
1 Suppose we train the network with
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b)
4 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
What does it mean to train the network?
1 Suppose we train the network with
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1 At the end of training we expect to
f (x) = 1+e−(w·x+b)
find w∗ , b∗ such that:
4 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
What does it mean to train the network?
1 Suppose we train the network with
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1 At the end of training we expect to
f (x) = 1+e−(w·x+b)
find w∗ , b∗ such that:
f (0.5) → 0.2 and f (2.5) → 0.9
4 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
What does it mean to train the network?
1 Suppose we train the network with
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1 At the end of training we expect to
f (x) = 1+e−(w·x+b)
find w∗ , b∗ such that:
f (0.5) → 0.2 and f (2.5) → 0.9
In other words...
We hope to find a sigmoid function
such that (0.5, 0.2) and (2.5, 0.9) lie
on this sigmoid
4 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
What does it mean to train the network?
1 Suppose we train the network with
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1 At the end of training we expect to
f (x) = 1+e−(w·x+b)
find w∗ , b∗ such that:
f (0.5) → 0.2 and f (2.5) → 0.9
In other words...
We hope to find a sigmoid function
such that (0.5, 0.2) and (2.5, 0.9) lie
on this sigmoid
4 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see this in more detail....
5 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Can we try to find such a w∗ , b∗ manually
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Can we try to find such a w∗ , b∗ manually
Let us try a random guess.. (say, w = 0.5, b = 0)
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Can we try to find such a w∗ , b∗ manually
Let us try a random guess.. (say, w = 0.5, b = 0)
Clearly not good, but how bad is it ?
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Can we try to find such a w∗ , b∗ manually
Let us try a random guess.. (say, w = 0.5, b = 0)
Clearly not good, but how bad is it ?
Let us revisit L (w, b) to see how bad it is ...
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
N
1 X
L (w, b) = ∗ (yi − f (xi ))2
2
i=1
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
N
1 X
L (w, b) = ∗ (yi − f (xi ))2
2
i=1
1
= ∗ ((y1 − f (x1 ))2 + (y2 − f (x2 ))2 )
2
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
N
1 X
L (w, b) = ∗ (yi − f (xi ))2
2
i=1
1
= ∗ ((y1 − f (x1 ))2 + (y2 − f (x2 ))2 )
2
1
= ∗ ((0.9 − f (2.5))2 + (0.2 − f (0.5))2 )
2
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
N
1 X
L (w, b) = ∗ (yi − f (xi ))2
2
i=1
1
= ∗ ((y1 − f (x1 ))2 + (y2 − f (x2 ))2 )
2
1
= ∗ ((0.9 − f (2.5))2 + (0.2 − f (0.5))2 )
2
= 0.073
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
N
1 X
L (w, b) = ∗ (yi − f (xi ))2
2
i=1
1
= ∗ ((y1 − f (x1 ))2 + (y2 − f (x2 ))2 )
2
1
= ∗ ((0.9 − f (2.5))2 + (0.2 − f (0.5))2 )
2
= 0.073
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000
6 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us look at something better than our “guess work”
algorithm....
7 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Since we have only 2 points and 2
parameters (w, b) we can easily plot
L (w, b) for different values of (w, b)
and pick the one where L (w, b) is
minimum
8 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Since we have only 2 points and 2
parameters (w, b) we can easily plot
L (w, b) for different values of (w, b)
and pick the one where L (w, b) is
minimum
8 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Since we have only 2 points and 2
parameters (w, b) we can easily plot
L (w, b) for different values of (w, b)
and pick the one where L (w, b) is
minimum
But of course this becomes intract-
able once you have many more data
points and many more parameters !!
8 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Since we have only 2 points and 2
parameters (w, b) we can easily plot
L (w, b) for different values of (w, b)
and pick the one where L (w, b) is
minimum
But of course this becomes intract-
able once you have many more data
points and many more parameters !!
Further, even here we have plotted
the error surface only for a small
range of (w, b) [from (−6, 6) and not
from (− inf, inf)]
8 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us look at the geometric interpretation of our
“guess work” algorithm in terms of this error surface
9 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
10 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
10 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
10 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
10 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
10 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
10 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
10 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Module 5.2: Learning Parameters : Gradient Descent
11 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Now let’s see if there is a more efficient and
principled way of doing this
12 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Goal
Find a better way of traversing the error surface so that we can reach the
minimum value quickly without resorting to brute force search!
13 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
vector of parameters,
say, randomly initial-
ized
θ = [w, b]
14 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
vector of parameters,
say, randomly initial-
ized
θ = [w, b]
∆θ = [∆w, ∆b]
change in the
values of w, b
14 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
vector of parameters,
say, randomly initial-
ized
θ = [w, b] θ
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b
14 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
vector of parameters,
say, randomly initial-
ized
θ = [w, b] θ θnew
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b
14 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
vector of parameters,
say, randomly initial-
ized
We moved in the direc-
θ = [w, b] θ θnew
tion of ∆θ
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b
14 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
vector of parameters,
say, randomly initial-
ized
We moved in the direc-
θ = [w, b] θ θnew
tion of ∆θ
∆θ = [∆w, ∆b]
∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
14 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
vector of parameters,
say, randomly initial-
ized
We moved in the direc-
θ = [w, b] θ θnew
tion of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
14 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
vector of parameters,
say, randomly initial-
ized
We moved in the direc-
θ = [w, b] θ θnew
tion of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
14 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
vector of parameters,
say, randomly initial-
ized
We moved in the direc-
θ = [w, b] θ θnew
tion of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ
14 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
vector of parameters,
say, randomly initial-
ized
We moved in the direc-
θ = [w, b] θ θnew
tion of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ
14 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
vector of parameters,
say, randomly initial-
ized
We moved in the direc-
θ = [w, b] θ θnew
tion of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ
14 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For ease of notation, let ∆θ = u, then from Taylor series, we have,
15 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For ease of notation, let ∆θ = u, then from Taylor series, we have,
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
15 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For ease of notation, let ∆θ = u, then from Taylor series, we have,
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇L (θ) [η is typically small, so η 2 , η 3 , ... → 0]
15 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For ease of notation, let ∆θ = u, then from Taylor series, we have,
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇L (θ) [η is typically small, so η 2 , η 3 , ... → 0]
L (θ + ηu) − L (θ) < 0 [i.e., if the new loss is less than the previous loss]
15 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For ease of notation, let ∆θ = u, then from Taylor series, we have,
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇L (θ) [η is typically small, so η 2 , η 3 , ... → 0]
L (θ + ηu) − L (θ) < 0 [i.e., if the new loss is less than the previous loss]
This implies,
uT ∇L (θ) < 0
15 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Okay, so we have,
uT ∇L (θ) < 0
16 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Okay, so we have,
uT ∇L (θ) < 0
16 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Okay, so we have,
uT ∇L (θ) < 0
16 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Okay, so we have,
uT ∇L (θ) < 0
uT ∇L (θ)
− 1 ≤ cos(β) = ≤1
||u|| ∗ ||∇L (θ)||
16 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Okay, so we have,
uT ∇L (θ) < 0
uT ∇L (θ)
− 1 ≤ cos(β) = ≤1
||u|| ∗ ||∇L (θ)||
− k ≤ k ∗ cos(β) = uT ∇L (θ) ≤ k
16 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Okay, so we have,
uT ∇L (θ) < 0
uT ∇L (θ)
− 1 ≤ cos(β) = ≤1
||u|| ∗ ||∇L (θ)||
− k ≤ k ∗ cos(β) = uT ∇L (θ) ≤ k
16 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Gradient Descent Rule
The direction u that we intend to move in should be at 180◦ w.r.t. the gradient
17 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Gradient Descent Rule
The direction u that we intend to move in should be at 180◦ w.r.t. the gradient
In other words, move in a direction opposite to the gradient
17 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Gradient Descent Rule
The direction u that we intend to move in should be at 180◦ w.r.t. the gradient
In other words, move in a direction opposite to the gradient
wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
∂L (w, b) ∂L (w, b)
where, ∇wt = , ∇bt =
∂w at w = wt , b = bt ∂b at w = wt , b = bt
17 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Gradient Descent Rule
The direction u that we intend to move in should be at 180◦ w.r.t. the gradient
In other words, move in a direction opposite to the gradient
wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
∂L (w, b) ∂L (w, b)
where, ∇wt = , ∇bt =
∂w at w = wt , b = bt ∂b at w = wt , b = bt
So we now have a more principled way of moving in the w-b plane than our “guess
work” algorithm
17 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let’s create an algorithm from this rule ...
18 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let’s create an algorithm from this rule ...
Algorithm 1: gradient descent()
t ← 0;
max iterations ← 1000;
while t < max iterations do
wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
end
To see this algorithm in practice let us first derive ∇w and ∇b for our toy neural
network
18 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
1
f (x) = 1+e−(w·x+b)
19 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
1
Let’s assume there is only 1 point to fit
1
f (x) = 1+e−(w·x+b) (x, y)
19 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
1
Let’s assume there is only 1 point to fit
1
f (x) = 1+e−(w·x+b) (x, y)
1
L (w, b) = ∗ (f (x) − y)2
2
19 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
1
Let’s assume there is only 1 point to fit
1
f (x) = 1+e−(w·x+b) (x, y)
1
L (w, b) =
∗ (f (x) − y)2
2
∂L (w, b) ∂ 1
∇w = = [ ∗ (f (x) − y)2 ]
∂w ∂w 2
19 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
∂ 1
∇w = [ ∗ (f (x) − y)2 ]
∂w 2
20 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
∂ 1
∇w = [ ∗ (f (x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f (x) − y) ∗ (f (x) − y)]
2 ∂w
20 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
∂ 1
∇w = [ ∗ (f (x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f (x) − y) ∗ (f (x) − y)]
2 ∂w
∂
= (f (x) − y) ∗ (f (x))
∂w
20 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
∂ 1
∇w = [ ∗ (f (x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f (x) − y) ∗ (f (x) − y)]
2 ∂w
∂
= (f (x) − y) ∗ (f (x))
∂w
∂ 1
= (f (x) − y) ∗
∂w 1 + e−(wx+b)
20 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
∂ 1 ∂ 1
∇w = [ ∗ (f (x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂
= ∗ [2 ∗ (f (x) − y) ∗ (f (x) − y)]
2 ∂w
∂
= (f (x) − y) ∗ (f (x))
∂w
∂ 1
= (f (x) − y) ∗
∂w 1 + e−(wx+b)
20 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
∂ 1 ∂ 1
∇w = [ ∗ (f (x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f (x) − y) ∗ (f (x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂
= (f (x) − y) ∗ (f (x))
∂w
∂ 1
= (f (x) − y) ∗
∂w 1 + e−(wx+b)
20 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
∂ 1 ∂ 1
∇w = [ ∗ (f (x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f (x) − y) ∗ (f (x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f (x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) ) 2 ∂w
∂ 1
= (f (x) − y) ∗
∂w 1 + e−(wx+b)
20 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
∂ 1 ∂ 1
∇w = [ ∗ (f (x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f (x) − y) ∗ (f (x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f (x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) ) 2 ∂w
∂ 1
= (f (x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
20 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
∂ 1 ∂ 1
∇w = [ ∗ (f (x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f (x) − y) ∗ (f (x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f (x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) ) 2 ∂w
∂ 1
= (f (x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
= f (x) ∗ (1 − f (x)) ∗ x
20 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
∂ 1 ∂ 1
∇w = [ ∗ (f (x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f (x) − y) ∗ (f (x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f (x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) ) 2 ∂w
∂ 1
= (f (x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
= (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x (1 + e−(wx+b) ) (1 + e−(wx+b) )
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
= f (x) ∗ (1 − f (x)) ∗ x
20 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
1
So if there is only 1 point (x, y), we have,
1
f (x) = 1+e−(w·x+b)
21 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
1
So if there is only 1 point (x, y), we have,
1
f (x) = 1+e−(w·x+b)
∇w = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x
21 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
1
So if there is only 1 point (x, y), we have,
1
f (x) = 1+e−(w·x+b)
∇w = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x
21 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
1
So if there is only 1 point (x, y), we have,
1
f (x) = 1+e−(w·x+b)
∇w = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x
2
X
∇w = (f (xi ) − yi ) ∗ f (xi ) ∗ (1 − f (xi )) ∗ xi
i=1
21 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x σ y = f (x)
1
So if there is only 1 point (x, y), we have,
1
f (x) = 1+e−(w·x+b)
∇w = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x
2
X
∇w = (f (xi ) − yi ) ∗ f (xi ) ∗ (1 − f (xi )) ∗ xi
i=1
2
X
∇b = (f (xi ) − yi ) ∗ f (xi ) ∗ (1 − f (xi ))
i=1
21 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
22 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
y
6 f (x) = x2 + 1
0 x
−1 0 1 2 3 4
23 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
y When the curve is steep the gradient
∆y1
6 f (x) = x2 + 1 ( ∆x 1
) is large
0 x
−1 0 1 2 3 4
23 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
y When the curve is steep the gradient
∆y1
6 f (x) = x2 + 1 ( ∆x 1
) is large
When the curve is gentle the gradient
∆y2
5 ( ∆x ) is small
2
4
∆y1
3
∆x1
2
0 x
−1 0 1 2 3 4
23 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
y When the curve is steep the gradient
∆y1
6 f (x) = x2 + 1 ( ∆x 1
) is large
When the curve is gentle the gradient
∆y2
5 ( ∆x ) is small
2
Recall that our weight updates are
4 proportional to the gradient w = w −
∆y1 η∇w
3
∆x1
2
∆y2
1
∆x2
0 x
−1 0 1 2 3 4
23 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
y When the curve is steep the gradient
∆y1
6 f (x) = x2 + 1 ( ∆x 1
) is large
When the curve is gentle the gradient
∆y2
5 ( ∆x ) is small
2
Recall that our weight updates are
4 proportional to the gradient w = w −
∆y1 η∇w
3 Hence in the areas where the curve is
∆x1 gentle the updates are small whereas
2 in the areas where the curve is steep
∆y2 the updates are large
1
∆x2
0 x
−1 0 1 2 3 4
23 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let’s see what happens when we start from a differ-
ent point
24 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Irrespective of where we start from
once we hit a surface which has a
gentle slope, the progress slows down
25 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Module 5.3 : Contours
26 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Visualizing things in 3d can sometimes become a bit
cumbersome
27 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Visualizing things in 3d can sometimes become a bit
cumbersome
Can we do a 2d visualization of this traversal along
the error surface
27 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Visualizing things in 3d can sometimes become a bit
cumbersome
Can we do a 2d visualization of this traversal along
the error surface
Yes, let’s take a look at something known as con-
tours
27 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Suppose I take horizontal slices of
this error surface at regular intervals
along the vertical axis
error
29 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Guess the 3d surface
30 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Guess the 3d surface
30 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Guess the 3d surface
30 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Guess the 3d surface
30 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Guess the 3d surface
30 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Guess the 3d surface
30 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Now that we know what are contour maps and how
to read them let us go back to our toy example and
visualize gradient descent from the point of view of
contours...
31 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
32 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
33 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Module 5.4 : Momentum based Gradient Descent
34 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Some observations about gradient descent
It takes a lot of time to navigate regions having a gentle slope
35 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Some observations about gradient descent
It takes a lot of time to navigate regions having a gentle slope
This is because the gradient in these regions is very small
35 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Some observations about gradient descent
It takes a lot of time to navigate regions having a gentle slope
This is because the gradient in these regions is very small
Can we do something better ?
35 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Some observations about gradient descent
It takes a lot of time to navigate regions having a gentle slope
This is because the gradient in these regions is very small
Can we do something better ?
Yes, let’s take a look at ‘Momentum based gradient descent’
35 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
If I am repeatedly being asked to move in the same direction then I should
probably gain some confidence and start taking bigger steps in that direction
36 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
If I am repeatedly being asked to move in the same direction then I should
probably gain some confidence and start taking bigger steps in that direction
Just as a ball gains momentum while rolling down a slope
36 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
If I am repeatedly being asked to move in the same direction then I should
probably gain some confidence and start taking bigger steps in that direction
Just as a ball gains momentum while rolling down a slope
36 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
If I am repeatedly being asked to move in the same direction then I should
probably gain some confidence and start taking bigger steps in that direction
Just as a ball gains momentum while rolling down a slope
36 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
updatet = γ · updatet−1 + η∇wt
wt+1 = wt − updatet
update0 = 0
37 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
updatet = γ · updatet−1 + η∇wt
wt+1 = wt − updatet
update0 = 0
update1 = γ · update0 + η∇w1 = η∇w1
37 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
updatet = γ · updatet−1 + η∇wt
wt+1 = wt − updatet
update0 = 0
update1 = γ · update0 + η∇w1 = η∇w1
update2 = γ · update1 + η∇w2 = γ · η∇w1 + η∇w2
37 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
updatet = γ · updatet−1 + η∇wt
wt+1 = wt − updatet
update0 = 0
update1 = γ · update0 + η∇w1 = η∇w1
update2 = γ · update1 + η∇w2 = γ · η∇w1 + η∇w2
update3 = γ · update2 + η∇w3 = γ(γ · η∇w1 + η∇w2 ) + η∇w3
37 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
updatet = γ · updatet−1 + η∇wt
wt+1 = wt − updatet
update0 = 0
update1 = γ · update0 + η∇w1 = η∇w1
update2 = γ · update1 + η∇w2 = γ · η∇w1 + η∇w2
update3 = γ · update2 + η∇w3 = γ(γ · η∇w1 + η∇w2 ) + η∇w3
= γ · update2 + η∇w3 = γ 2 · η∇w1 + γ · η∇w2 + η∇w3
37 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
updatet = γ · updatet−1 + η∇wt
wt+1 = wt − updatet
update0 = 0
update1 = γ · update0 + η∇w1 = η∇w1
update2 = γ · update1 + η∇w2 = γ · η∇w1 + η∇w2
update3 = γ · update2 + η∇w3 = γ(γ · η∇w1 + η∇w2 ) + η∇w3
= γ · update2 + η∇w3 = γ 2 · η∇w1 + γ · η∇w2 + η∇w3
update4 = γ · update3 + η∇w4 = γ 3 · η∇w1 + γ 2 · η∇w2 + γ · η∇w3 + η∇w4
37 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
updatet = γ · updatet−1 + η∇wt
wt+1 = wt − updatet
update0 = 0
update1 = γ · update0 + η∇w1 = η∇w1
update2 = γ · update1 + η∇w2 = γ · η∇w1 + η∇w2
update3 = γ · update2 + η∇w3 = γ(γ · η∇w1 + η∇w2 ) + η∇w3
= γ · update2 + η∇w3 = γ 2 · η∇w1 + γ · η∇w2 + η∇w3
update4 = γ · update3 + η∇w4 = γ 3 · η∇w1 + γ 2 · η∇w2 + γ · η∇w3 + η∇w4
..
.
37 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
updatet = γ · updatet−1 + η∇wt
wt+1 = wt − updatet
update0 = 0
update1 = γ · update0 + η∇w1 = η∇w1
update2 = γ · update1 + η∇w2 = γ · η∇w1 + η∇w2
update3 = γ · update2 + η∇w3 = γ(γ · η∇w1 + η∇w2 ) + η∇w3
= γ · update2 + η∇w3 = γ 2 · η∇w1 + γ · η∇w2 + η∇w3
update4 = γ · update3 + η∇w4 = γ 3 · η∇w1 + γ 2 · η∇w2 + γ · η∇w3 + η∇w4
..
.
updatet = γ · updatet−1 + η∇wt = γ t−1 · η∇w1 + γ t−2 · η∇w1 + ... + η∇wt
37 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
38 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Some observations and questions
Even in the regions having gentle slopes, momentum based gradient descent is
able to take large steps because the momentum carries it along
39 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Some observations and questions
Even in the regions having gentle slopes, momentum based gradient descent is
able to take large steps because the momentum carries it along
Is moving fast always good? Would there be a situation where momentum
would cause us to run pass our goal?
39 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Some observations and questions
Even in the regions having gentle slopes, momentum based gradient descent is
able to take large steps because the momentum carries it along
Is moving fast always good? Would there be a situation where momentum
would cause us to run pass our goal?
Let us change our input data so that we end up with a different error surface
and then see what happens ...
39 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
40 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
In this case, the error is high on either
side of the minima valley
40 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
In this case, the error is high on either
side of the minima valley
Could momentum be detrimental in
such cases... let’s see....
40 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Momentum based gradient descent
oscillates in and out of the minima
valley as the momentum carries it out
of the valley b
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Momentum based gradient descent
oscillates in and out of the minima
valley as the momentum carries it out
of the valley b
Takes a lot of u-turns before finally
converging
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Momentum based gradient descent
oscillates in and out of the minima
valley as the momentum carries it out
of the valley b
Takes a lot of u-turns before finally
converging
Despite these u-turns it still con-
verges faster than vanilla gradient
w
descent
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Momentum based gradient descent
oscillates in and out of the minima
valley as the momentum carries it out
of the valley b
Takes a lot of u-turns before finally
converging
Despite these u-turns it still con-
verges faster than vanilla gradient
w
descent
After 100 iterations momentum based
method has reached an error of
0.00001 whereas vanilla gradient des-
cent is still stuck at an error of 0.36
41 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let’s look at a 3d visualization and a different
geometric perspective of the same thing...
42 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
43 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Module 5.5 : Nesterov Accelerated Gradient Descent
44 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Question
Can we do something to reduce these oscillations ?
45 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Question
Can we do something to reduce these oscillations ?
Yes, let’s look at Nesterov accelerated gradient
45 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Look before you leap
46 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Look before you leap
Recall that updatet = γ · updatet−1 + η∇wt
46 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Look before you leap
Recall that updatet = γ · updatet−1 + η∇wt
So we know that we are going to move by at least by γ · updatet−1 and then a
bit more by η∇wt
46 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Look before you leap
Recall that updatet = γ · updatet−1 + η∇wt
So we know that we are going to move by at least by γ · updatet−1 and then a
bit more by η∇wt
Why not calculate the gradient (∇wlook ahead ) at this partially updated value
of w (wlook ahead = wt − γ · updatet−1 ) instead of calculating it using the current
value wt
46 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Look before you leap
Recall that updatet = γ · updatet−1 + η∇wt
So we know that we are going to move by at least by γ · updatet−1 and then a
bit more by η∇wt
Why not calculate the gradient (∇wlook ahead ) at this partially updated value
of w (wlook ahead = wt − γ · updatet−1 ) instead of calculating it using the current
value wt
Update rule for NAG
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
47 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
48 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Observations about NAG
Looking ahead helps NAG in correcting its course quicker than momentum
based gradient descent
49 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Observations about NAG
Looking ahead helps NAG in correcting its course quicker than momentum
based gradient descent
Hence the oscillations are smaller and the chances of escaping the minima valley
also smaller
49 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Module 5.6 : Stochastic And Mini-Batch Gradient
Descent
50 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let’s digress a bit and talk about the stochastic
version of these algorithms...
51 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
52 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm goes over the entire
data once before updating the parameters
52 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm goes over the entire
data once before updating the parameters
Why?
52 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm goes over the entire
data once before updating the parameters
Why? Because this is the true gradient of the
loss as derived earlier (sum of the gradients of
the losses corresponding to each data point)
52 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm goes over the entire
data once before updating the parameters
Why? Because this is the true gradient of the
loss as derived earlier (sum of the gradients of
the losses corresponding to each data point)
No approximation.
52 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm goes over the entire
data once before updating the parameters
Why? Because this is the true gradient of the
loss as derived earlier (sum of the gradients of
the losses corresponding to each data point)
No approximation. Hence, theoretical guaran-
tees hold (in other words each step guarantees
that the loss will decrease)
52 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm goes over the entire
data once before updating the parameters
Why? Because this is the true gradient of the
loss as derived earlier (sum of the gradients of
the losses corresponding to each data point)
No approximation. Hence, theoretical guaran-
tees hold (in other words each step guarantees
that the loss will decrease)
What’s the flipside?
52 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm goes over the entire
data once before updating the parameters
Why? Because this is the true gradient of the
loss as derived earlier (sum of the gradients of
the losses corresponding to each data point)
No approximation. Hence, theoretical guaran-
tees hold (in other words each step guarantees
that the loss will decrease)
What’s the flipside? Imagine we have a mil-
lion points in the training data.
52 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm goes over the entire
data once before updating the parameters
Why? Because this is the true gradient of the
loss as derived earlier (sum of the gradients of
the losses corresponding to each data point)
No approximation. Hence, theoretical guaran-
tees hold (in other words each step guarantees
that the loss will decrease)
What’s the flipside? Imagine we have a mil-
lion points in the training data. To make 1
update to w, b the algorithm makes a million
calculations.
52 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm goes over the entire
data once before updating the parameters
Why? Because this is the true gradient of the
loss as derived earlier (sum of the gradients of
the losses corresponding to each data point)
No approximation. Hence, theoretical guaran-
tees hold (in other words each step guarantees
that the loss will decrease)
What’s the flipside? Imagine we have a mil-
lion points in the training data. To make 1
update to w, b the algorithm makes a million
calculations. Obviously very slow!!
52 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm goes over the entire
data once before updating the parameters
Why? Because this is the true gradient of the
loss as derived earlier (sum of the gradients of
the losses corresponding to each data point)
No approximation. Hence, theoretical guaran-
tees hold (in other words each step guarantees
that the loss will decrease)
What’s the flipside? Imagine we have a mil-
lion points in the training data. To make 1
update to w, b the algorithm makes a million
calculations. Obviously very slow!!
Can we do something better ?
52 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm goes over the entire
data once before updating the parameters
Why? Because this is the true gradient of the
loss as derived earlier (sum of the gradients of
the losses corresponding to each data point)
No approximation. Hence, theoretical guaran-
tees hold (in other words each step guarantees
that the loss will decrease)
What’s the flipside? Imagine we have a mil-
lion points in the training data. To make 1
update to w, b the algorithm makes a million
calculations. Obviously very slow!!
Can we do something better ? Yes, let’s look
at stochastic gradient descent
52 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
53 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm updates the para-
meters for every single data point
53 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm updates the para-
meters for every single data point
Now if we have a million data points we will
make a million updates in each epoch (1 epoch
= 1 pass over the data; 1 step = 1 update)
53 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm updates the para-
meters for every single data point
Now if we have a million data points we will
make a million updates in each epoch (1 epoch
= 1 pass over the data; 1 step = 1 update)
What is the flipside ?
53 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm updates the para-
meters for every single data point
Now if we have a million data points we will
make a million updates in each epoch (1 epoch
= 1 pass over the data; 1 step = 1 update)
What is the flipside ? It is an approximate
(rather stochastic) gradient
53 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm updates the para-
meters for every single data point
Now if we have a million data points we will
make a million updates in each epoch (1 epoch
= 1 pass over the data; 1 step = 1 update)
What is the flipside ? It is an approximate
Stochastic because we are (rather stochastic) gradient
estimating the total gradi-
ent based on a single data
point.
53 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm updates the para-
meters for every single data point
Now if we have a million data points we will
make a million updates in each epoch (1 epoch
= 1 pass over the data; 1 step = 1 update)
What is the flipside ? It is an approximate
Stochastic because we are (rather stochastic) gradient
estimating the total gradi-
ent based on a single data
point. Almost like tossing a
coin only once and estimat-
ing P(heads).
53 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm updates the para-
meters for every single data point
Now if we have a million data points we will
make a million updates in each epoch (1 epoch
= 1 pass over the data; 1 step = 1 update)
What is the flipside ? It is an approximate
Stochastic because we are (rather stochastic) gradient
estimating the total gradi- No guarantee that each step will decrease the
ent based on a single data loss
point. Almost like tossing a
coin only once and estimat-
ing P(heads).
53 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm updates the para-
meters for every single data point
Now if we have a million data points we will
make a million updates in each epoch (1 epoch
= 1 pass over the data; 1 step = 1 update)
What is the flipside ? It is an approximate
Stochastic because we are (rather stochastic) gradient
estimating the total gradi- No guarantee that each step will decrease the
ent based on a single data loss
point. Almost like tossing a
Let’s see this algorithm in action when we
coin only once and estimat-
have a few data points
ing P(heads).
53 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
54 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
We see many oscillations. Why ?
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
We see many oscillations. Why ? Be-
cause we are making greedy decisions.
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
We see many oscillations. Why ? Be-
cause we are making greedy decisions.
Each point is trying to push the para-
meters in a direction most favorable to b
it
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
We see many oscillations. Why ? Be-
cause we are making greedy decisions.
Each point is trying to push the para-
meters in a direction most favorable to b
it (without being aware of how this af-
fects other points)
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
We see many oscillations. Why ? Be-
cause we are making greedy decisions.
Each point is trying to push the para-
meters in a direction most favorable to b
it (without being aware of how this af-
fects other points)
A parameter update which is locally fa-
vorable to one point may harm other
points w
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
We see many oscillations. Why ? Be-
cause we are making greedy decisions.
Each point is trying to push the para-
meters in a direction most favorable to b
it (without being aware of how this af-
fects other points)
A parameter update which is locally fa-
vorable to one point may harm other
points (its almost as if the data points w
are competing with each other)
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
We see many oscillations. Why ? Be-
cause we are making greedy decisions.
Each point is trying to push the para-
meters in a direction most favorable to b
it (without being aware of how this af-
fects other points)
A parameter update which is locally fa-
vorable to one point may harm other
points (its almost as if the data points w
are competing with each other)
Can we reduce the oscillations by im-
proving our stochastic estimates of the
gradient
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
We see many oscillations. Why ? Be-
cause we are making greedy decisions.
Each point is trying to push the para-
meters in a direction most favorable to b
it (without being aware of how this af-
fects other points)
A parameter update which is locally fa-
vorable to one point may harm other
points (its almost as if the data points w
are competing with each other)
Can we reduce the oscillations by im-
proving our stochastic estimates of the
gradient (currently estimated from just
1 data point at a time)
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
We see many oscillations. Why ? Be-
cause we are making greedy decisions.
Each point is trying to push the para-
meters in a direction most favorable to b
it (without being aware of how this af-
fects other points)
A parameter update which is locally fa-
vorable to one point may harm other
points (its almost as if the data points w
are competing with each other)
Can we reduce the oscillations by im-
proving our stochastic estimates of the
gradient (currently estimated from just
1 data point at a time)
Yes, let’s look at mini-batch gradient
descent
55 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm up-
dates the parameters after it sees
mini batch size number of data
points
56 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm up-
dates the parameters after it sees
mini batch size number of data
points
The stochastic estimates are now
slightly better
56 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Notice that the algorithm up-
dates the parameters after it sees
mini batch size number of data
points
The stochastic estimates are now
slightly better
Let’s see this algorithm in action
when we have k = 2
56 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
57 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Even with a batch size of k=2 the oscilla-
tions have reduced slightly.
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Even with a batch size of k=2 the oscilla-
tions have reduced slightly. Why ?
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Even with a batch size of k=2 the oscilla-
tions have reduced slightly. Why ?
Because we now have slightly better es-
timates of the gradient b
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Even with a batch size of k=2 the oscilla-
tions have reduced slightly. Why ?
Because we now have slightly better es-
timates of the gradient [analogy: we are b
now tossing the coin k=2 times to estim-
ate P(heads)]
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Even with a batch size of k=2 the oscilla-
tions have reduced slightly. Why ?
Because we now have slightly better es-
timates of the gradient [analogy: we are b
now tossing the coin k=2 times to estim-
ate P(heads)]
The higher the value of k the more accurate
are the estimates w
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Even with a batch size of k=2 the oscilla-
tions have reduced slightly. Why ?
Because we now have slightly better es-
timates of the gradient [analogy: we are b
now tossing the coin k=2 times to estim-
ate P(heads)]
The higher the value of k the more accurate
are the estimates w
In practice, typical values of k are 16, 32,
64
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Even with a batch size of k=2 the oscilla-
tions have reduced slightly. Why ?
Because we now have slightly better es-
timates of the gradient [analogy: we are b
now tossing the coin k=2 times to estim-
ate P(heads)]
The higher the value of k the more accurate
are the estimates w
In practice, typical values of k are 16, 32,
64
Of course, there are still oscillations and
they will always be there as long as we are
using an approximate gradient as opposed
to the true gradient
58 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Some things to remember ....
1 epoch = one pass over the entire data
1 step = one update of the parameters
N = number of data points
B = Mini batch size
Algorithm # of steps in 1 epoch
Vanilla (Batch) Gradient Descent
Stochastic Gradient Descent
Mini-Batch Gradient Descent
59 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Some things to remember ....
1 epoch = one pass over the entire data
1 step = one update of the parameters
N = number of data points
B = Mini batch size
Algorithm # of steps in 1 epoch
Vanilla (Batch) Gradient Descent 1
Stochastic Gradient Descent
Mini-Batch Gradient Descent
59 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Some things to remember ....
1 epoch = one pass over the entire data
1 step = one update of the parameters
N = number of data points
B = Mini batch size
Algorithm # of steps in 1 epoch
Vanilla (Batch) Gradient Descent 1
Stochastic Gradient Descent N
Mini-Batch Gradient Descent
59 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Some things to remember ....
1 epoch = one pass over the entire data
1 step = one update of the parameters
N = number of data points
B = Mini batch size
Algorithm # of steps in 1 epoch
Vanilla (Batch) Gradient Descent 1
Stochastic Gradient Descent N
N
Mini-Batch Gradient Descent B
59 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Similarly, we can have stochastic versions of
Momentum based gradient descent and Nesterov
accelerated based gradient descent
60 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
61 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
62 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
b
63 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
While the stochastic versions of both Mo-
mentum [red] and NAG [blue] exhibit oscilla-
tions the relative advantage of NAG over Mo- b
mentum still holds
63 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
While the stochastic versions of both Mo-
mentum [red] and NAG [blue] exhibit oscilla-
tions the relative advantage of NAG over Mo- b
mentum still holds (i.e., NAG takes relatively
shorter u-turns)
w
63 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
While the stochastic versions of both Mo-
mentum [red] and NAG [blue] exhibit oscilla-
tions the relative advantage of NAG over Mo- b
mentum still holds (i.e., NAG takes relatively
shorter u-turns)
Further both of them are faster than w
stochastic gradient descent
63 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
While the stochastic versions of both Mo-
mentum [red] and NAG [blue] exhibit oscilla-
tions the relative advantage of NAG over Mo- b
mentum still holds (i.e., NAG takes relatively
shorter u-turns)
Further both of them are faster than w
stochastic gradient descent (after 60 steps,
stochastic gradient descent [black - top figure]
still exhibits a very high error whereas NAG
and Momentum are close to convergence) b
63 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
And, of course, you can also have the mini batch
version of Momentum and NAG...
64 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
And, of course, you can also have the mini batch
version of Momentum and NAG...I leave that as an
exercise :-)
64 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Module 5.7 : Tips for Adjusting learning Rate and
Momentum
65 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Before moving on to advanced optimization
algorithms let us revisit the problem of learning rate
in gradient descent
66 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
w
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
On the regions which have a steep slope, w
the already large gradient blows up further
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
On the regions which have a steep slope, w
the already large gradient blows up further
It would be good to have a learning rate
which could adjust to the gradient ...
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
One could argue that we could have solved
the problem of navigating gentle slopes by
setting the learning rate high (i.e., blow up
the small gradient by multiplying it with a b
large η)
Let us see what happens if we set the learn-
ing rate to 10
On the regions which have a steep slope, w
the already large gradient blows up further
It would be good to have a learning rate
which could adjust to the gradient ... we
will see a few such algorithms soon
67 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Tips for initial learning rate ?
68 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Tips for initial learning rate ?
Tune learning rate [Try different values on a log scale: 0.0001, 0.001, 0.01, 0.1.
1.0]
68 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Tips for initial learning rate ?
Tune learning rate [Try different values on a log scale: 0.0001, 0.001, 0.01, 0.1.
1.0]
Run a few epochs with each of these and figure out a learning rate which works
best
68 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Tips for initial learning rate ?
Tune learning rate [Try different values on a log scale: 0.0001, 0.001, 0.01, 0.1.
1.0]
Run a few epochs with each of these and figure out a learning rate which works
best
Now do a finer search around this value [for example, if the best learning rate
was 0.1 then now try some values around it: 0.05, 0.2, 0.3]
68 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Tips for initial learning rate ?
Tune learning rate [Try different values on a log scale: 0.0001, 0.001, 0.01, 0.1.
1.0]
Run a few epochs with each of these and figure out a learning rate which works
best
Now do a finer search around this value [for example, if the best learning rate
was 0.1 then now try some values around it: 0.05, 0.2, 0.3]
Disclaimer: these are just heuristics ... no clear winner strategy
68 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Tips for annealing learning rate
69 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Tips for annealing learning rate
Step Decay:
69 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Tips for annealing learning rate
Step Decay:
Halve the learning rate after every 5 epochs or
69 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Tips for annealing learning rate
Step Decay:
Halve the learning rate after every 5 epochs or
Halve the learning rate after an epoch if the validation error is more than what it
was at the end of the previous epoch
69 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Tips for annealing learning rate
Step Decay:
Halve the learning rate after every 5 epochs or
Halve the learning rate after an epoch if the validation error is more than what it
was at the end of the previous epoch
Exponential Decay: η = η0−kt where η0 and k are hyperparameters and t is
the step number
69 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Tips for annealing learning rate
Step Decay:
Halve the learning rate after every 5 epochs or
Halve the learning rate after an epoch if the validation error is more than what it
was at the end of the previous epoch
Exponential Decay: η = η0−kt where η0 and k are hyperparameters and t is
the step number
η0
1/t Decay: η = 1+kt where η0 and k are hyperparameters and t is the step
number
69 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Tips for momentum
The following schedule was suggested by Sutskever et. al., 2013
70 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Module 5.8 : Line Search
71 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Just one last thing before we move on to some other
algorithms ...
72 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
In practice, often a line search is done
to find a relatively better value of η
73 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
In practice, often a line search is done
to find a relatively better value of η
73 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
In practice, often a line search is done
to find a relatively better value of η
Update w using different values of η
73 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
In practice, often a line search is done
to find a relatively better value of η
Update w using different values of η
Now retain that updated value of w
which gives the lowest loss
73 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
In practice, often a line search is done
to find a relatively better value of η
Update w using different values of η
Now retain that updated value of w
which gives the lowest loss
Esentially at each step we are trying
to use the best η value from the avail-
able choices
73 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
In practice, often a line search is done
to find a relatively better value of η
Update w using different values of η
Now retain that updated value of w
which gives the lowest loss
Esentially at each step we are trying
to use the best η value from the avail-
able choices
What’s the flipside?
73 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
In practice, often a line search is done
to find a relatively better value of η
Update w using different values of η
Now retain that updated value of w
which gives the lowest loss
Esentially at each step we are trying
to use the best η value from the avail-
able choices
What’s the flipside? We are doing
many more computations in each step
73 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
In practice, often a line search is done
to find a relatively better value of η
Update w using different values of η
Now retain that updated value of w
which gives the lowest loss
Esentially at each step we are trying
to use the best η value from the avail-
able choices
What’s the flipside? We are doing
many more computations in each step
We will come back to this when we
talk about second order optimization
methods
73 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
Convergence is faster than vanilla gradient
descent
b
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
Convergence is faster than vanilla gradient
descent
We see some oscillations, b
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
Convergence is faster than vanilla gradient
descent
We see some oscillations, but note that b
these oscillations are different from what we
see in momentum and NAG
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Let us see line search in action
Convergence is faster than vanilla gradient
descent
We see some oscillations, but note that b
these oscillations are different from what we
see in momentum and NAG
74 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Module 5.9 : Gradient Descent with Adaptive Learning
Rate
75 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x1
x2
x3 σ y
x4
1
y = f (x) = 1+e−(w·x+b)
x = {x1 , x2 , x3 , x4 }
w = {w1 , w2 , w3 , w4 }
76 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x1 Given this network, it should be easy to see that given a
single point (x, y)...
x2
x3 σ y
x4
1
y = f (x) = 1+e−(w·x+b)
x = {x1 , x2 , x3 , x4 }
w = {w1 , w2 , w3 , w4 }
76 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x1 Given this network, it should be easy to see that given a
single point (x, y)...
x2
∇w1 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x1
x3 σ y ∇w2 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x2 ... so on
x4
1
y = f (x) = 1+e−(w·x+b)
x = {x1 , x2 , x3 , x4 }
w = {w1 , w2 , w3 , w4 }
76 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x1 Given this network, it should be easy to see that given a
single point (x, y)...
x2
∇w1 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x1
x3 σ y ∇w2 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x2 ... so on
If there are n points, we can just sum the gradients over
x4 all the n points to get the total gradient
1
y = f (x) = 1+e−(w·x+b)
x = {x1 , x2 , x3 , x4 }
w = {w1 , w2 , w3 , w4 }
76 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x1 Given this network, it should be easy to see that given a
single point (x, y)...
x2
∇w1 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x1
x3 σ y ∇w2 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x2 ... so on
If there are n points, we can just sum the gradients over
x4 all the n points to get the total gradient
What happens if the feature x2 is very sparse?
1
1
y = f (x) = 1+e−(w·x+b)
x = {x1 , x2 , x3 , x4 }
w = {w1 , w2 , w3 , w4 }
76 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x1 Given this network, it should be easy to see that given a
single point (x, y)...
x2
∇w1 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x1
x3 σ y ∇w2 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x2 ... so on
If there are n points, we can just sum the gradients over
x4 all the n points to get the total gradient
What happens if the feature x2 is very sparse? (i.e., if its
1 value is 0 for most inputs)
1
y = f (x) = 1+e−(w·x+b)
x = {x1 , x2 , x3 , x4 }
w = {w1 , w2 , w3 , w4 }
76 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x1 Given this network, it should be easy to see that given a
single point (x, y)...
x2
∇w1 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x1
x3 σ y ∇w2 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x2 ... so on
If there are n points, we can just sum the gradients over
x4 all the n points to get the total gradient
What happens if the feature x2 is very sparse? (i.e., if its
1 value is 0 for most inputs)
∇w2 will be 0 for most inputs (see formula) and hence w2
1
y = f (x) = 1+e−(w·x+b) will not get enough updates
x = {x1 , x2 , x3 , x4 }
w = {w1 , w2 , w3 , w4 }
76 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x1 Given this network, it should be easy to see that given a
single point (x, y)...
x2
∇w1 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x1
x3 σ y ∇w2 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x2 ... so on
If there are n points, we can just sum the gradients over
x4 all the n points to get the total gradient
What happens if the feature x2 is very sparse? (i.e., if its
1 value is 0 for most inputs)
∇w2 will be 0 for most inputs (see formula) and hence w2
1
y = f (x) = 1+e−(w·x+b) will not get enough updates
If x2 happens to be sparse as well as important we would
x= {x1 , x2 , x3 , x4 } want to take the updates to w2 more seriously
w = {w1 , w2 , w3 , w4 }
76 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
x1 Given this network, it should be easy to see that given a
single point (x, y)...
x2
∇w1 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x1
x3 σ y ∇w2 = (f (x) − y) ∗ f (x) ∗ (1 − f (x)) ∗ x2 ... so on
If there are n points, we can just sum the gradients over
x4 all the n points to get the total gradient
What happens if the feature x2 is very sparse? (i.e., if its
1 value is 0 for most inputs)
∇w2 will be 0 for most inputs (see formula) and hence w2
1
y = f (x) = 1+e−(w·x+b) will not get enough updates
If x2 happens to be sparse as well as important we would
x= {x1 , x2 , x3 , x4 } want to take the updates to w2 more seriously
Can we have a different learning rate for each parameter
w = {w1 , w2 , w3 , w4 } which takes care of the frequency of features ?
76 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Decay the learning rate for parameters in proportion to their update history
77 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Decay the learning rate for parameters in proportion to their update history
(more updates means more decay)
77 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Decay the learning rate for parameters in proportion to their update history
(more updates means more decay)
Update rule for Adagrad
vt = vt−1 + (∇wt )2
η
wt+1 = wt − √ ∗ ∇wt
vt +
77 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
To see this in action we need to first create
some data where one of the features is sparse
78 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
To see this in action we need to first create
some data where one of the features is sparse
How would we do this in our toy network ?
78 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
To see this in action we need to first create
some data where one of the features is sparse
How would we do this in our toy network ?
Take some time to think about it
78 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
To see this in action we need to first create
some data where one of the features is sparse
How would we do this in our toy network ?
Take some time to think about it
Well, our network has just two parameters (w
and b).
78 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
To see this in action we need to first create
some data where one of the features is sparse
How would we do this in our toy network ?
Take some time to think about it
Well, our network has just two parameters (w
and b). Of these, the input/feature corres-
ponding to b is always on
78 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
To see this in action we need to first create
some data where one of the features is sparse
How would we do this in our toy network ?
Take some time to think about it
Well, our network has just two parameters (w
and b). Of these, the input/feature corres-
ponding to b is always on (so can’t really make
it sparse)
78 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
To see this in action we need to first create
some data where one of the features is sparse
How would we do this in our toy network ?
Take some time to think about it
Well, our network has just two parameters (w
and b). Of these, the input/feature corres-
ponding to b is always on (so can’t really make
it sparse)
The only option is to make x sparse
78 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
To see this in action we need to first create
some data where one of the features is sparse
How would we do this in our toy network ?
Take some time to think about it
Well, our network has just two parameters (w
and b). Of these, the input/feature corres-
ponding to b is always on (so can’t really make
it sparse)
The only option is to make x sparse
Solution: We created 100 random (x, y) pairs
and then for roughly 80% of these pairs we set
x to 0
78 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
To see this in action we need to first create
some data where one of the features is sparse
How would we do this in our toy network ?
Take some time to think about it
Well, our network has just two parameters (w
and b). Of these, the input/feature corres-
ponding to b is always on (so can’t really make
it sparse)
The only option is to make x sparse
Solution: We created 100 random (x, y) pairs
and then for roughly 80% of these pairs we set
x to 0 thereby, making the feature for w sparse
78 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
GD (black), momentum (red) and NAG (blue)
79 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
GD (black), momentum (red) and NAG (blue)
There is something interesting that these 3 al-
gorithms are doing for this dataset.
79 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
GD (black), momentum (red) and NAG (blue)
There is something interesting that these 3 al-
gorithms are doing for this dataset. Can you spot
it?
79 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
GD (black), momentum (red) and NAG (blue)
There is something interesting that these 3 al-
gorithms are doing for this dataset. Can you spot
it?
Initially, all three algorithms are moving mainly
along the vertical (b) axis and there is very little
movement along the horizontal (w) axis
79 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
GD (black), momentum (red) and NAG (blue)
There is something interesting that these 3 al-
gorithms are doing for this dataset. Can you spot
it?
Initially, all three algorithms are moving mainly
along the vertical (b) axis and there is very little
movement along the horizontal (w) axis
Why?
79 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
GD (black), momentum (red) and NAG (blue)
There is something interesting that these 3 al-
gorithms are doing for this dataset. Can you spot
it?
Initially, all three algorithms are moving mainly
along the vertical (b) axis and there is very little
movement along the horizontal (w) axis
Why? Because in our data, the feature corres-
ponding to w is sparse and hence w undergoes
very few updates
79 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
GD (black), momentum (red) and NAG (blue)
There is something interesting that these 3 al-
gorithms are doing for this dataset. Can you spot
it?
Initially, all three algorithms are moving mainly
along the vertical (b) axis and there is very little
movement along the horizontal (w) axis
Why? Because in our data, the feature corres-
ponding to w is sparse and hence w undergoes
very few updates ...on the other hand b is very
dense and undergoes many updates
79 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
GD (black), momentum (red) and NAG (blue)
There is something interesting that these 3 al-
gorithms are doing for this dataset. Can you spot
it?
Initially, all three algorithms are moving mainly
along the vertical (b) axis and there is very little
movement along the horizontal (w) axis
Why? Because in our data, the feature corres-
ponding to w is sparse and hence w undergoes
very few updates ...on the other hand b is very
dense and undergoes many updates
Such sparsity is very common in large neural
networks containing 1000s of input features and
hence we need to address it
79 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
GD (black), momentum (red) and NAG (blue)
There is something interesting that these 3 al-
gorithms are doing for this dataset. Can you spot
it?
Initially, all three algorithms are moving mainly
along the vertical (b) axis and there is very little
movement along the horizontal (w) axis
Why? Because in our data, the feature corres-
Let’s see what Adagrad
ponding to w is sparse and hence w undergoes
does....
very few updates ...on the other hand b is very
dense and undergoes many updates
Such sparsity is very common in large neural
networks containing 1000s of input features and
hence we need to address it
79 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
By using a parameter specific learning rate it
ensures that despite sparsity w gets a higher
learning rate and hence larger updates
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
By using a parameter specific learning rate it
ensures that despite sparsity w gets a higher
learning rate and hence larger updates
Further, it also ensures that if b undergoes a
lot of updates its effective learning rate de-
creases because of the growing denominator
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
By using a parameter specific learning rate it
ensures that despite sparsity w gets a higher
learning rate and hence larger updates
Further, it also ensures that if b undergoes a
lot of updates its effective learning rate de-
creases because of the growing denominator
In practice, this does not work so well if we
remove the square root from the denominator
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
By using a parameter specific learning rate it
ensures that despite sparsity w gets a higher
learning rate and hence larger updates
Further, it also ensures that if b undergoes a
lot of updates its effective learning rate de-
creases because of the growing denominator
In practice, this does not work so well if we
remove the square root from the denominator
(something to ponder about)
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
By using a parameter specific learning rate it
ensures that despite sparsity w gets a higher
learning rate and hence larger updates
Further, it also ensures that if b undergoes a
lot of updates its effective learning rate de-
creases because of the growing denominator
In practice, this does not work so well if we
remove the square root from the denominator
(something to ponder about)
What’s the flipside?
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
By using a parameter specific learning rate it
ensures that despite sparsity w gets a higher
learning rate and hence larger updates
Further, it also ensures that if b undergoes a
lot of updates its effective learning rate de-
creases because of the growing denominator
In practice, this does not work so well if we
remove the square root from the denominator
(something to ponder about)
What’s the flipside? over time the effective
learning rate for b will decay to an extent that
there will be no further updates to b
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
By using a parameter specific learning rate it
ensures that despite sparsity w gets a higher
learning rate and hence larger updates
Further, it also ensures that if b undergoes a
lot of updates its effective learning rate de-
creases because of the growing denominator
In practice, this does not work so well if we
remove the square root from the denominator
(something to ponder about)
What’s the flipside? over time the effective
learning rate for b will decay to an extent that
there will be no further updates to b
Can we avoid this?
80 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Adagrad decays the learning rate very aggressively (as the denominator grows)
81 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Adagrad decays the learning rate very aggressively (as the denominator grows)
As a result after a while the frequent parameters will start receiving very small
updates because of the decayed learning rate
81 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Adagrad decays the learning rate very aggressively (as the denominator grows)
As a result after a while the frequent parameters will start receiving very small
updates because of the decayed learning rate
To avoid this why not decay the denominator and prevent its rapid growth
81 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Adagrad decays the learning rate very aggressively (as the denominator grows)
As a result after a while the frequent parameters will start receiving very small
updates because of the decayed learning rate
To avoid this why not decay the denominator and prevent its rapid growth
vt = β ∗ vt−1 + (1 − β)(∇wt )2
η
wt+1 = wt − √ ∗ ∇wt
vt +
81 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Adagrad got stuck when it was close
to convergence (it was no longer able
to move in the vertical (b) direction
because of the decayed learning rate)
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Adagrad got stuck when it was close RMSProp overcomes this problem by
to convergence (it was no longer able being less aggressive on the decay
to move in the vertical (b) direction
because of the decayed learning rate)
82 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Do everything that RMSProp does to solve the decay problem of Adagrad
83 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Do everything that RMSProp does to solve the decay problem of Adagrad
Plus use a cumulative history of the gradients
83 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Do everything that RMSProp does to solve the decay problem of Adagrad
Plus use a cumulative history of the gradients
83 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Do everything that RMSProp does to solve the decay problem of Adagrad
Plus use a cumulative history of the gradients
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt
83 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Do everything that RMSProp does to solve the decay problem of Adagrad
Plus use a cumulative history of the gradients
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ (∇wt )2
83 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Do everything that RMSProp does to solve the decay problem of Adagrad
Plus use a cumulative history of the gradients
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ (∇wt )2
mt
m̂t =
1 − β1t
83 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Do everything that RMSProp does to solve the decay problem of Adagrad
Plus use a cumulative history of the gradients
In practice, β 1 = 0.9 and β 2 = 0.999
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ (∇wt )2
mt vt
m̂t = v̂t =
1 − β1t 1 − β2t
83 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Do everything that RMSProp does to solve the decay problem of Adagrad
Plus use a cumulative history of the gradients
In practice, β 1 = 0.9 and β 2 = 0.999
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ (∇wt )2
mt vt
m̂t = v̂t =
1 − β1t 1 − β2t
η
wt+1 = wt − √ ∗ m̂t
v̂t +
83 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Intuition
Do everything that RMSProp does to solve the decay problem of Adagrad
Plus use a cumulative history of the gradients
In practice, β 1 = 0.9 and β 2 = 0.999
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ (∇wt )2
mt vt
m̂t = v̂t =
1 − β1t 1 − β2t
η
wt+1 = wt − √ ∗ m̂t
v̂t +
... and a similar set of equations for bt
83 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
As expected, taking a cumulative his-
tory gives a speed up ... 84 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Million dollar question: Which algorithm to use in practice
Adam seems to be more or less the default choice now (β1 = 0.9, β2 = 0.999
and = 1e − 8 )
85 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Million dollar question: Which algorithm to use in practice
Adam seems to be more or less the default choice now (β1 = 0.9, β2 = 0.999
and = 1e − 8 )
Although it is supposed to be robust to initial learning rates, we have observed
that for sequence generation problems η = 0.001, 0.0001 works best
85 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Million dollar question: Which algorithm to use in practice
Adam seems to be more or less the default choice now (β1 = 0.9, β2 = 0.999
and = 1e − 8 )
Although it is supposed to be robust to initial learning rates, we have observed
that for sequence generation problems η = 0.001, 0.0001 works best
Having said that, many papers report that SGD with momentum (Nesterov
or classical) with a simple annealing learning rate schedule also works well
in practice (typically, starting with η = 0.001, 0.0001 for sequence generation
problems)
85 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Million dollar question: Which algorithm to use in practice
Adam seems to be more or less the default choice now (β1 = 0.9, β2 = 0.999
and = 1e − 8 )
Although it is supposed to be robust to initial learning rates, we have observed
that for sequence generation problems η = 0.001, 0.0001 works best
Having said that, many papers report that SGD with momentum (Nesterov
or classical) with a simple annealing learning rate schedule also works well
in practice (typically, starting with η = 0.001, 0.0001 for sequence generation
problems)
Adam might just be the best choice overall!!
85 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Million dollar question: Which algorithm to use in practice
Adam seems to be more or less the default choice now (β1 = 0.9, β2 = 0.999
and = 1e − 8 )
Although it is supposed to be robust to initial learning rates, we have observed
that for sequence generation problems η = 0.001, 0.0001 works best
Having said that, many papers report that SGD with momentum (Nesterov
or classical) with a simple annealing learning rate schedule also works well
in practice (typically, starting with η = 0.001, 0.0001 for sequence generation
problems)
Adam might just be the best choice overall!!
Some recent work suggest that there is a problem with Adam and it will not
converge in some cases
85 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Explanation for why we need bias correction in Adam
86 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Note that we are taking a running average
Update rule for Adam
of the gradients as mt
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ (∇wt )2
mt
m̂t =
1 − β1t
vt
v̂t =
1 − β2t
η
wt+1 = wt − √ ∗ m̂t
v̂t +
87 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Note that we are taking a running average
Update rule for Adam
of the gradients as mt
The reason we are doing this is that we
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt don’t want to rely too much on the cur-
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ (∇wt )2 rent gradient and instead rely on the over-
mt all behaviour of the gradients over many
m̂t = timesteps
1 − β1t
vt
v̂t =
1 − β2t
η
wt+1 = wt − √ ∗ m̂t
v̂t +
87 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Note that we are taking a running average
Update rule for Adam
of the gradients as mt
The reason we are doing this is that we
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt don’t want to rely too much on the cur-
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ (∇wt )2 rent gradient and instead rely on the over-
mt all behaviour of the gradients over many
m̂t = timesteps
1 − β1t
vt One way of looking at this is that we
v̂t = are interested in the expected value of the
1 − β2t
η gradients and not on a single point estim-
wt+1 = wt − √ ∗ m̂t ate computed at time t
v̂t +
87 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Note that we are taking a running average
Update rule for Adam
of the gradients as mt
The reason we are doing this is that we
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt don’t want to rely too much on the cur-
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ (∇wt )2 rent gradient and instead rely on the over-
mt all behaviour of the gradients over many
m̂t = timesteps
1 − β1t
vt One way of looking at this is that we
v̂t = are interested in the expected value of the
1 − β2t
η gradients and not on a single point estim-
wt+1 = wt − √ ∗ m̂t ate computed at time t
v̂t +
However, instead of computing E[∇wt ] we
are computing mt as the exponentially
moving average
87 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Note that we are taking a running average
Update rule for Adam
of the gradients as mt
The reason we are doing this is that we
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt don’t want to rely too much on the cur-
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ (∇wt )2 rent gradient and instead rely on the over-
mt all behaviour of the gradients over many
m̂t = timesteps
1 − β1t
vt One way of looking at this is that we
v̂t = are interested in the expected value of the
1 − β2t
η gradients and not on a single point estim-
wt+1 = wt − √ ∗ m̂t ate computed at time t
v̂t +
However, instead of computing E[∇wt ] we
are computing mt as the exponentially
moving average
Ideally we would want E[mt ] to be equal
to E[∇wt ]
87 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Note that we are taking a running average
Update rule for Adam
of the gradients as mt
The reason we are doing this is that we
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt don’t want to rely too much on the cur-
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ (∇wt )2 rent gradient and instead rely on the over-
mt all behaviour of the gradients over many
m̂t = timesteps
1 − β1t
vt One way of looking at this is that we
v̂t = are interested in the expected value of the
1 − β2t
η gradients and not on a single point estim-
wt+1 = wt − √ ∗ m̂t ate computed at time t
v̂t +
However, instead of computing E[∇wt ] we
are computing mt as the exponentially
moving average
Ideally we would want E[mt ] to be equal
to E[∇wt ]
Let us see if that is the case
87 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For convenience we will denote ∇wt as gt and β1 as β
88 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For convenience we will denote ∇wt as gt and β1 as β
mt = β ∗ mt−1 + (1 − β) ∗ gt
88 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For convenience we will denote ∇wt as gt and β1 as β
mt = β ∗ mt−1 + (1 − β) ∗ gt
m0 = 0
88 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For convenience we will denote ∇wt as gt and β1 as β
mt = β ∗ mt−1 + (1 − β) ∗ gt
m0 = 0
m1 = βm0 + (1 − β)g1
88 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For convenience we will denote ∇wt as gt and β1 as β
mt = β ∗ mt−1 + (1 − β) ∗ gt
m0 = 0
m1 = βm0 + (1 − β)g1
= (1 − β)g1
88 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For convenience we will denote ∇wt as gt and β1 as β
mt = β ∗ mt−1 + (1 − β) ∗ gt
m0 = 0
m1 = βm0 + (1 − β)g1
= (1 − β)g1
m2 = βm1 + (1 − β)g2
88 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For convenience we will denote ∇wt as gt and β1 as β
mt = β ∗ mt−1 + (1 − β) ∗ gt
m0 = 0
m1 = βm0 + (1 − β)g1
= (1 − β)g1
m2 = βm1 + (1 − β)g2
= β(1 − β)g1 + (1 − β)g2
88 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For convenience we will denote ∇wt as gt and β1 as β
mt = β ∗ mt−1 + (1 − β) ∗ gt
m0 = 0
m1 = βm0 + (1 − β)g1
= (1 − β)g1
m2 = βm1 + (1 − β)g2
= β(1 − β)g1 + (1 − β)g2
m3 = βm2 + (1 − β)g3
88 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For convenience we will denote ∇wt as gt and β1 as β
mt = β ∗ mt−1 + (1 − β) ∗ gt
m0 = 0
m1 = βm0 + (1 − β)g1
= (1 − β)g1
m2 = βm1 + (1 − β)g2
= β(1 − β)g1 + (1 − β)g2
m3 = βm2 + (1 − β)g3
= β(β(1 − β)g1 + (1 − β)g2 ) + (1 − β)g3
88 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For convenience we will denote ∇wt as gt and β1 as β
mt = β ∗ mt−1 + (1 − β) ∗ gt
m0 = 0
m1 = βm0 + (1 − β)g1
= (1 − β)g1
m2 = βm1 + (1 − β)g2
= β(1 − β)g1 + (1 − β)g2
m3 = βm2 + (1 − β)g3
= β(β(1 − β)g1 + (1 − β)g2 ) + (1 − β)g3
= β 2 (1 − β)g1 + β(1 − β)g2 + (1 − β)g3
88 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For convenience we will denote ∇wt as gt and β1 as β
mt = β ∗ mt−1 + (1 − β) ∗ gt
m0 = 0
m1 = βm0 + (1 − β)g1
= (1 − β)g1
m2 = βm1 + (1 − β)g2
= β(1 − β)g1 + (1 − β)g2
m3 = βm2 + (1 − β)g3
= β(β(1 − β)g1 + (1 − β)g2 ) + (1 − β)g3
= β 2 (1 − β)g1 + β(1 − β)g2 + (1 − β)g3
3
X
= (1 − β) β 3−i gi
i=1
88 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
For convenience we will denote ∇wt as gt and β1 as β
mt = β ∗ mt−1 + (1 − β) ∗ gt
m0 = 0
m1 = βm0 + (1 − β)g1
= (1 − β)g1
m2 = βm1 + (1 − β)g2
= β(1 − β)g1 + (1 − β)g2
m3 = βm2 + (1 − β)g3
= β(β(1 − β)g1 + (1 − β)g2 ) + (1 − β)g3
= β 2 (1 − β)g1 + β(1 − β)g2 + (1 − β)g3
3
X
= (1 − β) β 3−i gi
i=1
In general,
t
X
mt = (1 − β) β t−i gi
i=1
88 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
Pt t−i g
So we have, mt = (1 − β) i=1 β i
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
So we have, mt = (1 − β) ti=1 β t−i gi
P
Taking Expectation on both sides
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
So we have, mt = (1 − β) ti=1 β t−i gi
P
Taking Expectation on both sides
t
X
E[mt ] = E[(1 − β) β t−i gi ]
i=1
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
So we have, mt = (1 − β) ti=1 β t−i gi
P
Taking Expectation on both sides
t
X
E[mt ] = E[(1 − β) β t−i gi ]
i=1
Xt
E[mt ] = (1 − β)E[ β t−i gi ]
i=1
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
So we have, mt = (1 − β) ti=1 β t−i gi
P
Taking Expectation on both sides
t
X
E[mt ] = E[(1 − β) β t−i gi ]
i=1
Xt
E[mt ] = (1 − β)E[ β t−i gi ]
i=1
t
X
E[mt ] = (1 − β) E[β t−i gi ]
i=1
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
So we have, mt = (1 − β) ti=1 β t−i gi
P
Taking Expectation on both sides
t
X
E[mt ] = E[(1 − β) β t−i gi ]
i=1
Xt
E[mt ] = (1 − β)E[ β t−i gi ]
i=1
t
X
E[mt ] = (1 − β) E[β t−i gi ]
i=1
Xt
= (1 − β) β t−i E[gi ]
i=1
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
So we have, mt = (1 − β) ti=1 β t−i gi
P
Taking Expectation on both sides
t
X
E[mt ] = E[(1 − β) β t−i gi ]
i=1
Xt
E[mt ] = (1 − β)E[ β t−i gi ]
i=1
t
X
E[mt ] = (1 − β) E[β t−i gi ]
i=1
Xt
= (1 − β) β t−i E[gi ]
i=1
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
t
So we have, mt = (1 − β) ti=1 β t−i gi
P X
E[mt ] = (1 − β) (β)t−i E[gi ]
Taking Expectation on both sides
i=1
t
X
E[mt ] = E[(1 − β) β t−i gi ]
i=1
Xt
E[mt ] = (1 − β)E[ β t−i gi ]
i=1
t
X
E[mt ] = (1 − β) E[β t−i gi ]
i=1
Xt
= (1 − β) β t−i E[gi ]
i=1
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
t
So we have, mt = (1 − β) ti=1 β t−i gi
P X
E[mt ] = (1 − β) (β)t−i E[gi ]
Taking Expectation on both sides
i=1
t t
X
(β)t−i
X
E[mt ] = E[(1 − β) β t−i gi ] = E[g](1 − β)
i=1 i=1
t
X
E[mt ] = (1 − β)E[ β t−i gi ]
i=1
t
X
E[mt ] = (1 − β) E[β t−i gi ]
i=1
Xt
= (1 − β) β t−i E[gi ]
i=1
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
t
So we have, mt = (1 − β) ti=1 β t−i gi
P X
E[mt ] = (1 − β) (β)t−i E[gi ]
Taking Expectation on both sides
i=1
t t
X
(β)t−i
X
E[mt ] = E[(1 − β) β t−i gi ] = E[g](1 − β)
i=1 i=1
t
X = E[g](1 − β)(β t−1 + β t−2 + · · · + β 0 )
E[mt ] = (1 − β)E[ β t−i gi ]
i=1
t
X
E[mt ] = (1 − β) E[β t−i gi ]
i=1
Xt
= (1 − β) β t−i E[gi ]
i=1
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
t
So we have, mt = (1 − β) ti=1 β t−i gi
P X
E[mt ] = (1 − β) (β)t−i E[gi ]
Taking Expectation on both sides
i=1
t t
X
(β)t−i
X
E[mt ] = E[(1 − β) β t−i gi ] = E[g](1 − β)
i=1 i=1
t
X = E[g](1 − β)(β t−1 + β t−2 + · · · + β 0 )
E[mt ] = (1 − β)E[ β t−i gi ] 1 − βt
i=1 = E[g](1 − β)
t
1−β
X
E[mt ] = (1 − β) E[β t−i gi ]
i=1
Xt
= (1 − β) β t−i E[gi ]
i=1
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
t
So we have, mt = (1 − β) ti=1 β t−i gi
P X
E[mt ] = (1 − β) (β)t−i E[gi ]
Taking Expectation on both sides
i=1
t t
X
(β)t−i
X
E[mt ] = E[(1 − β) β t−i gi ] = E[g](1 − β)
i=1 i=1
t
X = E[g](1 − β)(β t−1 + β t−2 + · · · + β 0 )
E[mt ] = (1 − β)E[ β t−i gi ] 1 − βt
i=1 = E[g](1 − β)
t
1−β
X
E[mt ] = (1 − β) E[β t−i gi ] the last fraction is the sum of a GP with common
i=1 ratio = β
Xt
= (1 − β) β t−i E[gi ]
i=1
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
t
So we have, mt = (1 − β) ti=1 β t−i gi
P X
E[mt ] = (1 − β) (β)t−i E[gi ]
Taking Expectation on both sides
i=1
t t
X
(β)t−i
X
E[mt ] = E[(1 − β) β t−i gi ] = E[g](1 − β)
i=1 i=1
t
X = E[g](1 − β)(β t−1 + β t−2 + · · · + β 0 )
E[mt ] = (1 − β)E[ β t−i gi ] 1 − βt
i=1 = E[g](1 − β)
t
1−β
X
E[mt ] = (1 − β) E[β t−i gi ] the last fraction is the sum of a GP with common
i=1 ratio = β
Xt
= (1 − β) β t−i E[gi ] E[mt ] = E[g](1 − β t )
i=1
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
t
So we have, mt = (1 − β) ti=1 β t−i gi
P X
E[mt ] = (1 − β) (β)t−i E[gi ]
Taking Expectation on both sides
i=1
t t
X
(β)t−i
X
E[mt ] = E[(1 − β) β t−i gi ] = E[g](1 − β)
i=1 i=1
t
X = E[g](1 − β)(β t−1 + β t−2 + · · · + β 0 )
E[mt ] = (1 − β)E[ β t−i gi ] 1 − βt
i=1 = E[g](1 − β)
t
1−β
X
E[mt ] = (1 − β) E[β t−i gi ] the last fraction is the sum of a GP with common
i=1 ratio = β
Xt
= (1 − β) β t−i E[gi ] E[mt ] = E[g](1 − β t )
i=1 mt
E[ ] = E[g]
1 − βt
Assumption: All gi ’s come from the same
distribution i.e. E[gi ] = E[g] ∀i
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
t
So we have, mt = (1 − β) ti=1 β t−i gi
P X
E[mt ] = (1 − β) (β)t−i E[gi ]
Taking Expectation on both sides
i=1
t t
X
(β)t−i
X
E[mt ] = E[(1 − β) β t−i gi ] = E[g](1 − β)
i=1 i=1
t
X = E[g](1 − β)(β t−1 + β t−2 + · · · + β 0 )
E[mt ] = (1 − β)E[ β t−i gi ] 1 − βt
i=1 = E[g](1 − β)
t
1−β
X
E[mt ] = (1 − β) E[β t−i gi ] the last fraction is the sum of a GP with common
i=1 ratio = β
Xt
= (1 − β) β t−i E[gi ] E[mt ] = E[g](1 − β t )
i=1 mt
E[ ] = E[g]
1 − βt
Assumption: All gi ’s come from the same mt
distribution i.e. E[gi ] = E[g] ∀i E[m̂t ] = E[g](∵ = m̂t )
1 − βt
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5
t
So we have, mt = (1 − β) ti=1 β t−i gi
P X
E[mt ] = (1 − β) (β)t−i E[gi ]
Taking Expectation on both sides
i=1
t t
X
(β)t−i
X
E[mt ] = E[(1 − β) β t−i gi ] = E[g](1 − β)
i=1 i=1
t
X = E[g](1 − β)(β t−1 + β t−2 + · · · + β 0 )
E[mt ] = (1 − β)E[ β t−i gi ] 1 − βt
i=1 = E[g](1 − β)
t
1−β
X
E[mt ] = (1 − β) E[β t−i gi ] the last fraction is the sum of a GP with common
i=1 ratio = β
Xt
= (1 − β) β t−i E[gi ] E[mt ] = E[g](1 − β t )
i=1 mt
E[ ] = E[g]
1 − βt
Assumption: All gi ’s come from the same mt
distribution i.e. E[gi ] = E[g] ∀i E[m̂t ] = E[g](∵ = m̂t )
1 − βt
Hence we apply the bias correction because then
the expected value of m̂t is the same as the
expected value of gt
89 / 89
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5