0% found this document useful (0 votes)
26 views200 pages

Lecture3 - Gradient Descent - IITM - 23-1-200

The document discusses the limitations of perceptrons and introduces sigmoid neurons as a way to represent arbitrary real-valued functions using a smoother activation function compared to the harsh thresholding of perceptrons. It provides an example of how the harsh thresholding of perceptrons can lead to undesirable behavior when making predictions and suggests that sigmoid neurons may address this issue.

Uploaded by

amrutamhetre9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views200 pages

Lecture3 - Gradient Descent - IITM - 23-1-200

The document discusses the limitations of perceptrons and introduces sigmoid neurons as a way to represent arbitrary real-valued functions using a smoother activation function compared to the harsh thresholding of perceptrons. It provides an example of how the harsh thresholding of perceptrons can lead to undesirable behavior when making predictions and suggests that sigmoid neurons may address this issue.

Uploaded by

amrutamhetre9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 200

CS7015 (Deep Learning) : Lecture 3

Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,


Representation Power of Feedforward Neural Networks

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

0/61
Acknowledgements
• For Module 3.4, I have borrowed ideas from the videos by Ryan Harris on “visualize
backpropagation” (available on youtube)
• For Module 3.5, I have borrowed ideas from this excellent book ? which is available
online
• I am sure I would have been influenced and borrowed ideas from other sources and I
apologize if I have failed to acknowledge them

?
http://neuralnetworksanddeeplearning.com/chap4.html 1/61

1
Module 3.1: Sigmoid Neuron

2/61

2
The story ahead ...
• Enough about boolean functions!

3/61

3
The story ahead ...
• Enough about boolean functions!
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?

3/61

3
The story ahead ...
• Enough about boolean functions!
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?
• Can we have a network which can (approximately) represent such functions ?

3/61

3
The story ahead ...
• Enough about boolean functions!
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?
• Can we have a network which can (approximately) represent such functions ?
• Before answering the above question we will have to first graduate from perceptrons
to sigmoidal neurons ...

3/61

3
Recall
• A perceptron will fire if the weighted sum of its inputs is greater than the threshold
(-w0 )

4/61

4
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5

w1 = 1
x1

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1
x1

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1 • Consider that we base our decision only on one


input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1 • Consider that we base our decision only on one


input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating
• If the threshold is 0.5 (w0 = −0.5) and w1 = 1
then what would be the decision for a movie with
criticsRating = 0.51 ?

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1 • Consider that we base our decision only on one


input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating
• If the threshold is 0.5 (w0 = −0.5) and w1 = 1
then what would be the decision for a movie with
criticsRating = 0.51 ? (like)

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1 • Consider that we base our decision only on one


input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating
• If the threshold is 0.5 (w0 = −0.5) and w1 = 1
then what would be the decision for a movie with
criticsRating = 0.51 ? (like)
• What about a movie with criticsRating = 0.49 ?

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1 • Consider that we base our decision only on one


input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating
• If the threshold is 0.5 (w0 = −0.5) and w1 = 1
then what would be the decision for a movie with
criticsRating = 0.51 ? (like)
• What about a movie with criticsRating = 0.49 ?
(dislike)

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1 • Consider that we base our decision only on one


input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating
• If the threshold is 0.5 (w0 = −0.5) and w1 = 1
then what would be the decision for a movie with
criticsRating = 0.51 ? (like)
• What about a movie with criticsRating = 0.49 ?
(dislike)
• It seems harsh that we would like a movie with rat-
5/61
ing 0.51 but not one with a rating of 0.49
5
• This behavior is not a characteristic of the spe-
cific problem we chose or the specific weight and
threshold that we chose

6/61

6
• This behavior is not a characteristic of the spe-
1
cific problem we chose or the specific weight and
threshold that we chose
• It is a characteristic of the perceptron function it-
y

self which behaves like a step function

P-wn 0
z= i=1 wi xi

6/61

6
• This behavior is not a characteristic of the spe-
1
cific problem we chose or the specific weight and
threshold that we chose
• It is a characteristic of the perceptron function it-
y

self which behaves like a step function


• There will always be this sudden change in the de-
cision (from 0 to 1) when ni=1 wi xi crosses the
P

threshold (-w0 )
P-wn 0
z= i=1 wi xi

6/61

6
• This behavior is not a characteristic of the spe-
1
cific problem we chose or the specific weight and
threshold that we chose
• It is a characteristic of the perceptron function it-
y

self which behaves like a step function


• There will always be this sudden change in the de-
cision (from 0 to 1) when ni=1 wi xi crosses the
P

threshold (-w0 )
P-wn 0
z= • For most real world applications we would ex-
i=1 wi xi
pect a smoother decision function which gradually
changes from 0 to 1

6/61

6
• Introducing sigmoid neurons where the output
1
y function is much smoother than the step function

P-wn 0
z= i=1 wi xi

6/61

6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y

1
y= Pn
1+ e−(w0 + i=1 wi xi )

P-wn 0
z= i=1 wi xi

6/61

6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y

1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi

6/61

6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y

1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi • Also the output y is no longer binary but a real
value between 0 and 1 which can be interpreted
as a probability

6/61

6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y

1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi • Also the output y is no longer binary but a real
value between 0 and 1 which can be interpreted
as a probability
• Instead of a like/dislike decision we get the prob-
ability of liking the movie 6/61

6
Perceptron Sigmoid (logistic) Neuron
y y

w0 = −θ w1 w2 .. .. wn w0 = −θ w1 w2 .. .. wn

x0 = 1 x1 x2 .. .. xn x0 = 1 x1 x2 .. .. xn
n 1
y=
X
y=1 if wi ∗ xi ≥ 0 Pn
1+ e−( i=0 wi xi )
i=0
Xn
= 0 if wi ∗ xi < 0
i=0 7/61

7
Perceptron Sigmoid Neuron
y 1 1

y
P-wn 0 P-wn 0
z= i=1 wi xi z= i=1 wi xi
Not smooth, not continuous (at w0), not

differentiable

8/61

8
Perceptron Sigmoid Neuron
y 1 1

y
P-wn 0 P-wn 0
z= i=1 wi xi z= i=1 wi xi
Not smooth, not continuous (at w0), not
Smooth, continuous, differentiable
differentiable

8/61

8
Module 3.2: A typical Supervised Machine Learning Setup

9/61

9
• What next ?
Sigmoid (logistic) Neuron
y

w0 = −θ w1 w2 .. .. wn

x0 = 1 x1 x2 .. .. xn

10/61

10
• What next ?
Sigmoid (logistic) Neuron • Well, just as we had an algorithm for learning the
y
weights of a perceptron, we also need a way of
learning the weights of a sigmoid neuron

w0 = −θ w1 w2 .. .. wn

x0 = 1 x1 x2 .. .. xn

10/61

10
• What next ?
Sigmoid (logistic) Neuron • Well, just as we had an algorithm for learning the
y
weights of a perceptron, we also need a way of
learning the weights of a sigmoid neuron
• Before we see such an algorithm we will revisit the
concept of error

w0 = −θ w1 w2 .. .. wn

x0 = 1 x1 x2 .. .. xn

10/61

10
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable

11/61

11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?

11/61

11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?

11/61

11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...

11/61

11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad

11/61

11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad
• Sure, it misclassifies 3 blue points and 3 red points but we
could live with this error in most real world applications

11/61

11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad
• Sure, it misclassifies 3 blue points and 3 red points but we
could live with this error in most real world applications
• From now on, we will accept that it is hard to drive the error
to 0 in most cases and will instead aim to reach the min-
imum possible error
11/61

11
This brings us to a typical machine learning setup which has the following components...

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)
• Objective/Loss/Error function: To guide the learning algorithm
12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)
• Objective/Loss/Error function: To guide the learning algorithm - the learning algorithm
12/61
should aim to minimize the loss function
12
As an illustration, consider our movie example

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)

• Parameter: w

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)

• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)

• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
• Objective/Loss/Error function:

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)

• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
• Objective/Loss/Error function: One possibility is
n
(ŷi − yi )2
X
L (w) =
i=1

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)

• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
• Objective/Loss/Error function: One possibility is
n
(ŷi − yi )2
X
L (w) =
i=1

13/61
The learning algorithm should aim to find a w which minimizes the above function
13
(squared error between y and ŷ)
Module 3.3: Learning Parameters: (Infeasible) guess work

14/61

14
y • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
for learning the parameters of this model from
σ some given data using an appropriate objective
function

w0 = −θ w1 w2 .. .. wn

x0 = 1 x1 x2 .. .. xn
1
f (x) = 1+e−(w·x+b)

15/61

15
y • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
for learning the parameters of this model from
σ some given data using an appropriate objective
function

w0 = −θ w1 w2 .. .. wn • σ stands for the sigmoid function (logistic func-


tion in this case)
x0 = 1 x1 x2 .. .. xn
1
f (x) = 1+e−(w·x+b)

15/61

15
x σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
1 for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input

15/61

15
x
w
σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
1 b
for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input
• Further to be consistent with the literature, from
now on, we will refer to w0 as b (bias)

15/61

15
x
w
σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
1 b
for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input
• Further to be consistent with the literature, from
now on, we will refer to w0 as b (bias)
• Lastly, instead of considering the problem of pre-
dicting like/dislike, we will assume that we want to
predict criticsRating(y) given imdbRating(x) (for no 15/61
particular reason) 15
x σ ŷ = f(x)
w

1 b

1
f (x) = 1+e−(w·x+b)

16/61

16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)

1
f (x) = 1+e−(w·x+b)

16/61

16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)
Training objective
1
f (x) = 1+e−(w·x+b) Find w and b such that:
N
(yi − f(xi ))2
X
minimize L (w, b) =
w,b
i=1

16/61

16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)
Training objective
1
f (x) = 1+e−(w·x+b) Find w and b such that:
N
(yi − f(xi ))2
X
minimize L (w, b) =
w,b
i=1

16/61

16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b)

16/61

16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:

16/61

16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9

16/61

16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9

In other words...
• We hope to find a sigmoid function such
that (0.5, 0.2) and (2.5, 0.9) lie on this
sigmoid

16/61

16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9

In other words...
• We hope to find a sigmoid function such
that (0.5, 0.2) and (2.5, 0.9) lie on this
sigmoid

16/61

16
Let us see this in more detail....

17/61

17
1
σ(x) =
1+ e−(wx+b)

18/61

18
• Can we try to find such a w∗ , b∗ manually

1
σ(x) =
1+ e−(wx+b)

18/61

18
• Can we try to find such a w∗ , b∗ manually
• Let us try a random guess.. (say, w = 0.5, b = 0)

1
σ(x) =
1+ e−(wx+b)

18/61

18
• Can we try to find such a w∗ , b∗ manually
• Let us try a random guess.. (say, w = 0.5, b = 0)
• Clearly not good, but how bad is it ?

1
σ(x) =
1+ e−(wx+b)

18/61

18
• Can we try to find such a w∗ , b∗ manually
• Let us try a random guess.. (say, w = 0.5, b = 0)
• Clearly not good, but how bad is it ?
• Let us revisit L (w, b) to see how bad it is ...

1
σ(x) =
1+ e−(wx+b)

18/61

18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1

1
σ(x) =
1+ e−(wx+b)

18/61

18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2

1
σ(x) =
1+ e−(wx+b)

18/61

18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2

1
σ(x) =
1+ e−(wx+b)

18/61

18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
= 0.073

1
σ(x) =
1+ e−(wx+b)

18/61

18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
= 0.073

We want L (w, b) to be as close to 0 as possible


1
σ(x) =
1 + e−(wx+b)

18/61

18
Let us try some other values of w, b

w b L (w, b)
0.50 0.00 0.0730

1
σ(x) =
1+ e−(wx+b)

18/61

18
Let us try some other values of w, b

w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481

1 Oops!! this made things even worse...


σ(x) =
1+ e−(wx+b)

18/61

18
Let us try some other values of w, b

w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214

1 Perhaps it would help to push w and b in the other


σ(x) = direction...
1 + e−(wx+b)

18/61

18
Let us try some other values of w, b

w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028

1 Let us keep going in this direction, i.e., increase w and


σ(x) = decrease b
1 + e−(wx+b)

18/61

18
Let us try some other values of w, b

w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003

1
σ(x) = Let us keep going in this direction, i.e., increase w and
1+ e−(wx+b)
decrease b

18/61

18
Let us try some other values of w, b

w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000

1
σ(x) = With some guess work and intuition we were able to
1+ e−(wx+b)
find the right values for w and b

18/61

18
Let us look at something better than our “guess work” algorithm....

19/61

19
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum

20/61

20
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum

20/61

20
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
• But of course this becomes intractable
once you have many more data points
and many more parameters !!

20/61

20
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
• But of course this becomes intractable
once you have many more data points
and many more parameters !!
• Further, even here we have plotted the er-
ror surface only for a small range of (w, b)
[from (−6, 6) and not from (− inf, inf)]

20/61

20
Let us look at the geometric interpretation of our “guess work”
algorithm in terms of this error surface

21/61

21
22/61

22
22/61

22
22/61

22
22/61

22
22/61

22
22/61

22
22/61

22
Module 3.4: Learning Parameters : Gradient Descent

23/61

23
Now let us see if there is a more efficient and principled way of
doing this

24/61

24
Goal
Find a better way of traversing the error surface so that we can reach the minimum value
quickly without resorting to brute force search!

25/61

25
vector of parameters,
say, randomly initialized
θ = [w, b]

26/61

26
vector of parameters,
say, randomly initialized
θ = [w, b]

∆θ = [∆w, ∆b]
change in the
values of w, b

26/61

26
vector of parameters,
say, randomly initialized
θ = [w, b] θ

∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b

26/61

26
vector of parameters,
say, randomly initialized
θ = [w, b] θ θnew

∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ

Question: What is the right ∆θ to use ?

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ

Question: What is the right ∆θ to use ?

The answer comes from Taylor series

26/61

26
For ease of notation, let ∆θ = u, then from Taylor series, we have,

27/61

27
For ease of notation, let ∆θ = u, then from Taylor series, we have,

η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!

27/61

27
For ease of notation, let ∆θ = u, then from Taylor series, we have,

η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]

27/61

27
For ease of notation, let ∆θ = u, then from Taylor series, we have,

η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]

Note that the move (ηu) would be favorable only if,

L (θ + ηu) − L (θ) < 0 [i.e., if the new loss is less than the previous loss]

27/61

27
For ease of notation, let ∆θ = u, then from Taylor series, we have,

η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]

Note that the move (ηu) would be favorable only if,

L (θ + ηu) − L (θ) < 0 [i.e., if the new loss is less than the previous loss]

This implies,

uT ∇θ L (θ) < 0

27/61

27
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ?

28/61

28
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ? Let us see....

28/61

28
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ? Let us see....


Let β be the angle between u and ∇θ L (θ), then we know that,

28/61

28
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ? Let us see....


Let β be the angle between u and ∇θ L (θ), then we know that,

uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||

28/61

28
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ? Let us see....


Let β be the angle between u and ∇θ L (θ), then we know that,

uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||

28/61

28
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ? Let us see....


Let β be the angle between u and ∇θ L (θ), then we know that,

uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||

multiply throughout by k = ||u|| ∗ ||∇θ L (θ)||

−k ≤ k ∗ cos(β) = uT ∇θ L (θ) ≤ k

28/61

28
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ? Let us see....


Let β be the angle between u and ∇θ L (θ), then we know that,

uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||

multiply throughout by k = ||u|| ∗ ||∇θ L (θ)||

−k ≤ k ∗ cos(β) = uT ∇θ L (θ) ≤ k

Thus, L (θ + ηu) − L (θ) = uT ∇θ L (θ) = k ∗ cos(β) will be most negative when


cos(β) = −1 i.e., when β is 180°
28/61

28
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient

29/61

29
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
• In other words, move in a direction opposite to the gradient

29/61

29
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
• In other words, move in a direction opposite to the gradient

Parameter Update Equations

wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
∂L (w, b) ∂L (w, b)
where, ∇wt = , ∇b =
∂w at w = wt , b = bt ∂b at w = wt , b = bt

29/61

29
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
• In other words, move in a direction opposite to the gradient

Parameter Update Equations

wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
∂L (w, b) ∂L (w, b)
where, ∇wt = , ∇b =
∂w at w = wt , b = bt ∂b at w = wt , b = bt

So we now have a more principled way of moving in the w-b plane than our “guess work”
algorithm
29/61

29
• Let us create an algorithm from this rule ...

30/61

30
• Let us create an algorithm from this rule ...
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
while t < max_iterations do
wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
t ← t + 1;
end

30/61

30
• Let us create an algorithm from this rule ...
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
while t < max_iterations do
wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
t ← t + 1;
end

• To see this algorithm in practice let us first derive ∇w and ∇b for our toy neural network

30/61

30
x σ y = f (x)

1
f(x) = 1+e−(w·x+b)

31/61

31
x σ y = f (x)

1 Let’s assume there is only 1 point to fit (x, y)


f(x) = 1+e−(w·x+b)

31/61

31
x σ y = f (x)

1 Let’s assume there is only 1 point to fit (x, y)


f(x) = 1+e−(w·x+b)
1
L (w, b) = ∗ (f(x) − y)2
2

31/61

31
x σ y = f (x)

1 Let’s assume there is only 1 point to fit (x, y)


f(x) = 1+e−(w·x+b)
1
∗ (f(x) − y)2
L (w, b) =
2
∂L (w, b) ∂ 1
∇w = = [ ∗ (f (x) − y)2 ]
∂w ∂w 2

31/61

31
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2

32/61

32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w

32/61

32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w

= (f(x) − y) ∗ (f (x))
∂w

32/61

32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w

= (f(x) − y) ∗ (f (x))
∂w
∂  1 
= (f(x) − y) ∗
∂w 1 + e−(wx+b)

32/61

32
∂ 1 ∂  1 
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w

= (f(x) − y) ∗ (f (x))
∂w
∂  1 
= (f(x) − y) ∗
∂w 1 + e−(wx+b)

32/61

32
∂ 1 ∂  1 
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w

= (f(x) − y) ∗ (f (x))
∂w
∂  1 
= (f(x) − y) ∗
∂w 1 + e−(wx+b)

32/61

32
∂ 1 ∂  1 
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂  1 
= (f(x) − y) ∗
∂w 1 + e−(wx+b)

32/61

32
∂ 1 ∂  1 
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂  1 
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )

32/61

32
∂ 1 ∂  1 
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂  1 
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )

32/61

32
∂ 1 ∂  1 
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂  1 
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
= f(x) ∗ (1 − f(x)) ∗ x

32/61

32
x σ y = f (x)

1 So if there is only 1 point (x, y), we have,


f(x) = 1+e−(w·x+b)

33/61

33
x σ y = f (x)

1 So if there is only 1 point (x, y), we have,


f(x) = 1+e−(w·x+b)

∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

33/61

33
x σ y = f (x)

1 So if there is only 1 point (x, y), we have,


f(x) = 1+e−(w·x+b)

∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

For two points,

33/61

33
x σ y = f (x)

1 So if there is only 1 point (x, y), we have,


f(x) = 1+e−(w·x+b)

∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

For two points,

2
X
∇w = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi )) ∗ xi
i=1

33/61

33
x σ y = f (x)

1 So if there is only 1 point (x, y), we have,


f(x) = 1+e−(w·x+b)

∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

For two points,

2
X
∇w = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi )) ∗ xi
i=1
X2
∇b = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi ))
i=1
33/61

33
34/61

34
34/61

34
34/61

34
34/61

34
34/61

34
34/61

34
34/61

34
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35

You might also like