Lecture3 - Gradient Descent - IITM - 23-1-200
Lecture3 - Gradient Descent - IITM - 23-1-200
Mitesh M. Khapra
0/61
Acknowledgements
• For Module 3.4, I have borrowed ideas from the videos by Ryan Harris on “visualize
backpropagation” (available on youtube)
• For Module 3.5, I have borrowed ideas from this excellent book ? which is available
online
• I am sure I would have been influenced and borrowed ideas from other sources and I
apologize if I have failed to acknowledge them
?
http://neuralnetworksanddeeplearning.com/chap4.html 1/61
1
Module 3.1: Sigmoid Neuron
2/61
2
The story ahead ...
• Enough about boolean functions!
3/61
3
The story ahead ...
• Enough about boolean functions!
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?
3/61
3
The story ahead ...
• Enough about boolean functions!
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?
• Can we have a network which can (approximately) represent such functions ?
3/61
3
The story ahead ...
• Enough about boolean functions!
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?
• Can we have a network which can (approximately) represent such functions ?
• Before answering the above question we will have to first graduate from perceptrons
to sigmoidal neurons ...
3/61
3
Recall
• A perceptron will fire if the weighted sum of its inputs is greater than the threshold
(-w0 )
4/61
4
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5
w1 = 1
x1
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
w1 = 1
x1
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
6/61
6
• This behavior is not a characteristic of the spe-
1
cific problem we chose or the specific weight and
threshold that we chose
• It is a characteristic of the perceptron function it-
y
P-wn 0
z= i=1 wi xi
6/61
6
• This behavior is not a characteristic of the spe-
1
cific problem we chose or the specific weight and
threshold that we chose
• It is a characteristic of the perceptron function it-
y
threshold (-w0 )
P-wn 0
z= i=1 wi xi
6/61
6
• This behavior is not a characteristic of the spe-
1
cific problem we chose or the specific weight and
threshold that we chose
• It is a characteristic of the perceptron function it-
y
threshold (-w0 )
P-wn 0
z= • For most real world applications we would ex-
i=1 wi xi
pect a smoother decision function which gradually
changes from 0 to 1
6/61
6
• Introducing sigmoid neurons where the output
1
y function is much smoother than the step function
P-wn 0
z= i=1 wi xi
6/61
6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y
1
y= Pn
1+ e−(w0 + i=1 wi xi )
P-wn 0
z= i=1 wi xi
6/61
6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y
1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi
6/61
6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y
1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi • Also the output y is no longer binary but a real
value between 0 and 1 which can be interpreted
as a probability
6/61
6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y
1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi • Also the output y is no longer binary but a real
value between 0 and 1 which can be interpreted
as a probability
• Instead of a like/dislike decision we get the prob-
ability of liking the movie 6/61
6
Perceptron Sigmoid (logistic) Neuron
y y
w0 = −θ w1 w2 .. .. wn w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn x0 = 1 x1 x2 .. .. xn
n 1
y=
X
y=1 if wi ∗ xi ≥ 0 Pn
1+ e−( i=0 wi xi )
i=0
Xn
= 0 if wi ∗ xi < 0
i=0 7/61
7
Perceptron Sigmoid Neuron
y 1 1
y
P-wn 0 P-wn 0
z= i=1 wi xi z= i=1 wi xi
Not smooth, not continuous (at w0), not
differentiable
8/61
8
Perceptron Sigmoid Neuron
y 1 1
y
P-wn 0 P-wn 0
z= i=1 wi xi z= i=1 wi xi
Not smooth, not continuous (at w0), not
Smooth, continuous, differentiable
differentiable
8/61
8
Module 3.2: A typical Supervised Machine Learning Setup
9/61
9
• What next ?
Sigmoid (logistic) Neuron
y
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
10/61
10
• What next ?
Sigmoid (logistic) Neuron • Well, just as we had an algorithm for learning the
y
weights of a perceptron, we also need a way of
learning the weights of a sigmoid neuron
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
10/61
10
• What next ?
Sigmoid (logistic) Neuron • Well, just as we had an algorithm for learning the
y
weights of a perceptron, we also need a way of
learning the weights of a sigmoid neuron
• Before we see such an algorithm we will revisit the
concept of error
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
10/61
10
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
11/61
11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
11/61
11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
11/61
11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
11/61
11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad
11/61
11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad
• Sure, it misclassifies 3 blue points and 3 red points but we
could live with this error in most real world applications
11/61
11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad
• Sure, it misclassifies 3 blue points and 3 red points but we
could live with this error in most real world applications
• From now on, we will accept that it is hard to drive the error
to 0 in most cases and will instead aim to reach the min-
imum possible error
11/61
11
This brings us to a typical machine learning setup which has the following components...
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)
• Objective/Loss/Error function: To guide the learning algorithm
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)
• Objective/Loss/Error function: To guide the learning algorithm - the learning algorithm
12/61
should aim to minimize the loss function
12
As an illustration, consider our movie example
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
• Objective/Loss/Error function:
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
• Objective/Loss/Error function: One possibility is
n
(ŷi − yi )2
X
L (w) =
i=1
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
• Objective/Loss/Error function: One possibility is
n
(ŷi − yi )2
X
L (w) =
i=1
13/61
The learning algorithm should aim to find a w which minimizes the above function
13
(squared error between y and ŷ)
Module 3.3: Learning Parameters: (Infeasible) guess work
14/61
14
y • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
for learning the parameters of this model from
σ some given data using an appropriate objective
function
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
1
f (x) = 1+e−(w·x+b)
15/61
15
y • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
for learning the parameters of this model from
σ some given data using an appropriate objective
function
15/61
15
x σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
1 for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input
15/61
15
x
w
σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
1 b
for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input
• Further to be consistent with the literature, from
now on, we will refer to w0 as b (bias)
15/61
15
x
w
σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
1 b
for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input
• Further to be consistent with the literature, from
now on, we will refer to w0 as b (bias)
• Lastly, instead of considering the problem of pre-
dicting like/dislike, we will assume that we want to
predict criticsRating(y) given imdbRating(x) (for no 15/61
particular reason) 15
x σ ŷ = f(x)
w
1 b
1
f (x) = 1+e−(w·x+b)
16/61
16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)
1
f (x) = 1+e−(w·x+b)
16/61
16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)
Training objective
1
f (x) = 1+e−(w·x+b) Find w and b such that:
N
(yi − f(xi ))2
X
minimize L (w, b) =
w,b
i=1
16/61
16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)
Training objective
1
f (x) = 1+e−(w·x+b) Find w and b such that:
N
(yi − f(xi ))2
X
minimize L (w, b) =
w,b
i=1
16/61
16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b)
16/61
16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
16/61
16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9
16/61
16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9
In other words...
• We hope to find a sigmoid function such
that (0.5, 0.2) and (2.5, 0.9) lie on this
sigmoid
16/61
16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9
In other words...
• We hope to find a sigmoid function such
that (0.5, 0.2) and (2.5, 0.9) lie on this
sigmoid
16/61
16
Let us see this in more detail....
17/61
17
1
σ(x) =
1+ e−(wx+b)
18/61
18
• Can we try to find such a w∗ , b∗ manually
1
σ(x) =
1+ e−(wx+b)
18/61
18
• Can we try to find such a w∗ , b∗ manually
• Let us try a random guess.. (say, w = 0.5, b = 0)
1
σ(x) =
1+ e−(wx+b)
18/61
18
• Can we try to find such a w∗ , b∗ manually
• Let us try a random guess.. (say, w = 0.5, b = 0)
• Clearly not good, but how bad is it ?
1
σ(x) =
1+ e−(wx+b)
18/61
18
• Can we try to find such a w∗ , b∗ manually
• Let us try a random guess.. (say, w = 0.5, b = 0)
• Clearly not good, but how bad is it ?
• Let us revisit L (w, b) to see how bad it is ...
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
= 0.073
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
= 0.073
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
1
σ(x) =
1+ e−(wx+b)
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1
σ(x) = Let us keep going in this direction, i.e., increase w and
1+ e−(wx+b)
decrease b
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000
1
σ(x) = With some guess work and intuition we were able to
1+ e−(wx+b)
find the right values for w and b
18/61
18
Let us look at something better than our “guess work” algorithm....
19/61
19
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
20/61
20
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
20/61
20
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
• But of course this becomes intractable
once you have many more data points
and many more parameters !!
20/61
20
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
• But of course this becomes intractable
once you have many more data points
and many more parameters !!
• Further, even here we have plotted the er-
ror surface only for a small range of (w, b)
[from (−6, 6) and not from (− inf, inf)]
20/61
20
Let us look at the geometric interpretation of our “guess work”
algorithm in terms of this error surface
21/61
21
22/61
22
22/61
22
22/61
22
22/61
22
22/61
22
22/61
22
22/61
22
Module 3.4: Learning Parameters : Gradient Descent
23/61
23
Now let us see if there is a more efficient and principled way of
doing this
24/61
24
Goal
Find a better way of traversing the error surface so that we can reach the minimum value
quickly without resorting to brute force search!
25/61
25
vector of parameters,
say, randomly initialized
θ = [w, b]
26/61
26
vector of parameters,
say, randomly initialized
θ = [w, b]
∆θ = [∆w, ∆b]
change in the
values of w, b
26/61
26
vector of parameters,
say, randomly initialized
θ = [w, b] θ
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b
26/61
26
vector of parameters,
say, randomly initialized
θ = [w, b] θ θnew
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ
26/61
26
For ease of notation, let ∆θ = u, then from Taylor series, we have,
27/61
27
For ease of notation, let ∆θ = u, then from Taylor series, we have,
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
27/61
27
For ease of notation, let ∆θ = u, then from Taylor series, we have,
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]
27/61
27
For ease of notation, let ∆θ = u, then from Taylor series, we have,
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]
L (θ + ηu) − L (θ) < 0 [i.e., if the new loss is less than the previous loss]
27/61
27
For ease of notation, let ∆θ = u, then from Taylor series, we have,
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]
L (θ + ηu) − L (θ) < 0 [i.e., if the new loss is less than the previous loss]
This implies,
uT ∇θ L (θ) < 0
27/61
27
Okay, so we have,
uT ∇θ L (θ) < 0
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||
−k ≤ k ∗ cos(β) = uT ∇θ L (θ) ≤ k
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||
−k ≤ k ∗ cos(β) = uT ∇θ L (θ) ≤ k
28
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
29/61
29
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
• In other words, move in a direction opposite to the gradient
29/61
29
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
• In other words, move in a direction opposite to the gradient
wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
∂L (w, b) ∂L (w, b)
where, ∇wt = , ∇b =
∂w at w = wt , b = bt ∂b at w = wt , b = bt
29/61
29
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
• In other words, move in a direction opposite to the gradient
wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
∂L (w, b) ∂L (w, b)
where, ∇wt = , ∇b =
∂w at w = wt , b = bt ∂b at w = wt , b = bt
So we now have a more principled way of moving in the w-b plane than our “guess work”
algorithm
29/61
29
• Let us create an algorithm from this rule ...
30/61
30
• Let us create an algorithm from this rule ...
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
while t < max_iterations do
wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
t ← t + 1;
end
30/61
30
• Let us create an algorithm from this rule ...
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
while t < max_iterations do
wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
t ← t + 1;
end
• To see this algorithm in practice let us first derive ∇w and ∇b for our toy neural network
30/61
30
x σ y = f (x)
1
f(x) = 1+e−(w·x+b)
31/61
31
x σ y = f (x)
31/61
31
x σ y = f (x)
31/61
31
x σ y = f (x)
31/61
31
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
32/61
32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w
32/61
32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w
∂
= (f(x) − y) ∗ (f (x))
∂w
32/61
32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w
∂
= (f(x) − y) ∗ (f (x))
∂w
∂ 1
= (f(x) − y) ∗
∂w 1 + e−(wx+b)
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w
∂
= (f(x) − y) ∗ (f (x))
∂w
∂ 1
= (f(x) − y) ∗
∂w 1 + e−(wx+b)
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂
= (f(x) − y) ∗ (f (x))
∂w
∂ 1
= (f(x) − y) ∗
∂w 1 + e−(wx+b)
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂ 1
= (f(x) − y) ∗
∂w 1 + e−(wx+b)
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂ 1
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂ 1
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂ 1
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
= f(x) ∗ (1 − f(x)) ∗ x
32/61
32
x σ y = f (x)
33/61
33
x σ y = f (x)
33/61
33
x σ y = f (x)
33/61
33
x σ y = f (x)
2
X
∇w = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi )) ∗ xi
i=1
33/61
33
x σ y = f (x)
2
X
∇w = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi )) ∗ xi
i=1
X2
∇b = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi ))
i=1
33/61
33
34/61
34
34/61
34
34/61
34
34/61
34
34/61
34
34/61
34
34/61
34
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35