0% found this document useful (0 votes)
27 views25 pages

Lecture 4 - SGD, Back Propagation

The document discusses stochastic gradient descent and backpropagation algorithms. Stochastic gradient descent is an optimization method for minimizing loss functions. It iterates over examples individually to update parameters. Backpropagation is a method for calculating gradients in neural networks using chain rule to efficiently propagate gradients backward through the network.

Uploaded by

Robert Babayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views25 pages

Lecture 4 - SGD, Back Propagation

The document discusses stochastic gradient descent and backpropagation algorithms. Stochastic gradient descent is an optimization method for minimizing loss functions. It iterates over examples individually to update parameters. Backpropagation is a method for calculating gradients in neural networks using chain rule to efficiently propagate gradients backward through the network.

Uploaded by

Robert Babayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Deep Learning

Vazgen Mikayelyan

October 27, 2020

V. Mikayelyan Deep Learning October 27, 2020 1/9


Outline

1 Stochastic Gradient Descent

2 Back-Propagation

V. Mikayelyan Deep Learning October 27, 2020 2/9


Stochastic Gradient Descent

Let L be a loss function that we know:

V. Mikayelyan Deep Learning October 27, 2020 3/9


Stochastic Gradient Descent

Let L be a loss function that we know:


n
1X
L (w ) = (fw (xi ) − yi )2 ,
n
i=1

n
1X
L (w ) = (−yl log fw (xi ) − (1 − yi ) log (1 − fw (xi ))) ,
n
i=1
n
1 X T 
L (w ) = −yi log fw (xi ) .
n
i=1

V. Mikayelyan Deep Learning October 27, 2020 3/9


Stochastic Gradient Descent

Let L be a loss function that we know:


n
1X
L (w ) = (fw (xi ) − yi )2 ,
n
i=1

n
1X
L (w ) = (−yl log fw (xi ) − (1 − yi ) log (1 − fw (xi ))) ,
n
i=1
n
1 X T 
L (w ) = −yi log fw (xi ) .
n
i=1

Do you see problems in finding minimum of these functions using GD?

V. Mikayelyan Deep Learning October 27, 2020 3/9


Stochastic Gradient Descent

Note that in each case we can represent the loss function by the following
form:
n
1X
L (w ) = Li (w ) .
n
i=1

V. Mikayelyan Deep Learning October 27, 2020 4/9


Stochastic Gradient Descent

Note that in each case we can represent the loss function by the following
form:
n
1X
L (w ) = Li (w ) .
n
i=1

SGD algorithm is the following:


Choose an initial vector of parameters w and learning rate α.

V. Mikayelyan Deep Learning October 27, 2020 4/9


Stochastic Gradient Descent

Note that in each case we can represent the loss function by the following
form:
n
1X
L (w ) = Li (w ) .
n
i=1

SGD algorithm is the following:


Choose an initial vector of parameters w and learning rate α.
Repeat until an approximate minimum is obtained.

V. Mikayelyan Deep Learning October 27, 2020 4/9


Stochastic Gradient Descent

Note that in each case we can represent the loss function by the following
form:
n
1X
L (w ) = Li (w ) .
n
i=1

SGD algorithm is the following:


Choose an initial vector of parameters w and learning rate α.
Repeat until an approximate minimum is obtained.
Randomly shuffle examples in the training set.

V. Mikayelyan Deep Learning October 27, 2020 4/9


Stochastic Gradient Descent

Note that in each case we can represent the loss function by the following
form:
n
1X
L (w ) = Li (w ) .
n
i=1

SGD algorithm is the following:


Choose an initial vector of parameters w and learning rate α.
Repeat until an approximate minimum is obtained.
Randomly shuffle examples in the training set.
For i = 1, 2, ..., n, do w ← w − α∇Li (w ).

V. Mikayelyan Deep Learning October 27, 2020 4/9


Stochastic Gradient Descent

Note that in each case we can represent the loss function by the following
form:
n
1X
L (w ) = Li (w ) .
n
i=1

SGD algorithm is the following:


Choose an initial vector of parameters w and learning rate α.
Repeat until an approximate minimum is obtained.
Randomly shuffle examples in the training set.
For i = 1, 2, ..., n, do w ← w − α∇Li (w ).
Do you see problems in this optimization method?

V. Mikayelyan Deep Learning October 27, 2020 4/9


Mini-Batch Gradient Descent

MBGD algorithm is the following:


Choose an initial vector of parameters w , learning rate α and batch
size B.

V. Mikayelyan Deep Learning October 27, 2020 5/9


Mini-Batch Gradient Descent

MBGD algorithm is the following:


Choose an initial vector of parameters w , learning rate α and batch
size B.
Repeat until an approximate minimum is obtained.

V. Mikayelyan Deep Learning October 27, 2020 5/9


Mini-Batch Gradient Descent

MBGD algorithm is the following:


Choose an initial vector of parameters w , learning rate α and batch
size B.
Repeat until an approximate minimum is obtained.
Randomly shuffle examples in the training set.

V. Mikayelyan Deep Learning October 27, 2020 5/9


Mini-Batch Gradient Descent

MBGD algorithm is the following:


Choose an initial vector of parameters w , learning rate α and batch
size B.
Repeat until an approximate minimum is obtained.
Randomly shuffle examples in the training set.
For i = 1, 2, ..., d Bn e, do
i·B
1 X
w ← w − α∇ Lk (w ) .
B
k=(i−1)·B+1

V. Mikayelyan Deep Learning October 27, 2020 5/9


Outline

1 Stochastic Gradient Descent

2 Back-Propagation

V. Mikayelyan Deep Learning October 27, 2020 6/9


Back-Propagation

Question: How to calculate the derivative of the function sin x 2 ?

V. Mikayelyan Deep Learning October 27, 2020 7/9


Back-Propagation

Question: How to calculate the derivative of the function sin x 2 ?


Theorem 1
Given n functions f1 , . . . , fn with the composite function

f = f1 ◦ (f2 ◦ · · · (fn−1 ◦ fn )) ,

if each function fi is differentiable at its immediate input, then the


composite function is also differentiable by the repeated application of
Chain Rule, where the derivative is
df df1 df2 dfn
= ··· .
dx df2 df3 dx

V. Mikayelyan Deep Learning October 27, 2020 7/9


Back-Propagation

V. Mikayelyan Deep Learning October 27, 2020 8/9


Back-Propagation

V. Mikayelyan Deep Learning October 27, 2020 9/9


Back-Propagation

In this case we have the following function

L (w1 , w2 ) = (f2 (w2 f1 (w1 x)) − y )2

V. Mikayelyan Deep Learning October 27, 2020 9/9


Back-Propagation

In this case we have the following function

L (w1 , w2 ) = (f2 (w2 f1 (w1 x)) − y )2


∂L ∂L
We have to calculate the derivatives and :
∂w1 ∂w2

V. Mikayelyan Deep Learning October 27, 2020 9/9


Back-Propagation

In this case we have the following function

L (w1 , w2 ) = (f2 (w2 f1 (w1 x)) − y )2


∂L ∂L
We have to calculate the derivatives and :
∂w1 ∂w2
∂L ∂L ∂f2 ∂ (w2 f1 )
= ,
∂w2 ∂f2 ∂ (w2 f1 ) ∂w2

V. Mikayelyan Deep Learning October 27, 2020 9/9


Back-Propagation

In this case we have the following function

L (w1 , w2 ) = (f2 (w2 f1 (w1 x)) − y )2


∂L ∂L
We have to calculate the derivatives and :
∂w1 ∂w2
∂L ∂L ∂f2 ∂ (w2 f1 )
= ,
∂w2 ∂f2 ∂ (w2 f1 ) ∂w2

∂L
=
∂w1
V. Mikayelyan Deep Learning October 27, 2020 9/9
Back-Propagation

In this case we have the following function

L (w1 , w2 ) = (f2 (w2 f1 (w1 x)) − y )2


∂L ∂L
We have to calculate the derivatives and :
∂w1 ∂w2
∂L ∂L ∂f2 ∂ (w2 f1 )
= ,
∂w2 ∂f2 ∂ (w2 f1 ) ∂w2

∂L ∂L ∂f2 ∂ (w2 f1 ) ∂ (f1 ) ∂ (w1 x)


= .
∂w1 ∂f2 ∂ (w2 f1 ) ∂f1 ∂ (w1 x) ∂w1
V. Mikayelyan Deep Learning October 27, 2020 9/9

You might also like