0% found this document useful (0 votes)
25 views19 pages

Lecture Slides 4 - Backpropagation - 2021

The document discusses backpropagation, which is a method for calculating the gradient of the cost function with respect to the parameters of a neural network. It defines working variables like activation values and gradients, then describes how to calculate gradients starting from the output layer and propagating backwards through the network using the chain rule.

Uploaded by

alvin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views19 pages

Lecture Slides 4 - Backpropagation - 2021

The document discusses backpropagation, which is a method for calculating the gradient of the cost function with respect to the parameters of a neural network. It defines working variables like activation values and gradients, then describes how to calculate gradients starting from the output layer and propagating backwards through the network using the chain rule.

Uploaded by

alvin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Lecture 7: Backpropagation for Neural Networks.

Dr Etienne Pienaar
[email protected]  Room 5.43

University of Cape Town

October 11, 2021

1/12
Backpropagation
Based on Nielsen (2015).

2/12
Backpropagation

Backpropagation is a strategy for calculating the gradient of the cost


function with respect to the parameters of the network. Although the
calculations can become tedious, the only mathematical machinery
required is the chain rule for dierentiation. First, one denes working
variables:
A linear component:
dl−1
X
zjl = al−1 l l
k wkj + bj
k=1

A working gradient:
∂C
δjl =
∂zjl

3/12
After dening the working variables, one starts at the terminal nodes of
the network and work backwards. For these purposes, we calculate the
gradient w.r.t. the working variable in the output layer:
dX
L =q
∂C ∂aL
δjL = k

k=1
∂aL
k ∂zj
L

∂C ∂aL j
=
∂aL
j ∂zj
L

∂C 0 L
= σ (zj )
∂aL
j

Note that everything we need to evaluate δjL is already known after


evaluating the forward updating equation for the network. That is,
after a forward pass through the network, all we need to know running
backwards are the activation function specications and working
variable values.

4/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 =
∂zjl−1

5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1

5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
dl
X ∂C ∂zkl
=
k=1
∂zkl ∂zjl−1

5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
dl
X ∂C ∂zkl
= Chain R.
k=1
∂zkl ∂zjl−1

5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
dl
X ∂C ∂zkl
= Chain R.
k=1
∂zkl ∂zjl−1
dl
X ∂zkl
= δkl
k=1 ∂zjl−1

5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
dl
X ∂C ∂zkl
= Chain R.
k=1
∂zkl ∂zjl−1
dl
X ∂zkl
= δkl Swap.+w.g.
k=1 ∂zjl−1

5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
dl
X ∂C ∂zkl
= Chain R.
k=1
∂zkl ∂zjl−1
dl
X ∂zkl
= δkl Swap.+w.g.
k=1 ∂zjl−1
dl dl−1
X ∂ X
l
σ 0 (zjl−1 ) δkl al−1 l
+ blk
 
= wjk l−1 m wmk
k=1 ∂zj m=1

5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
dl
X ∂C ∂zkl
= Chain R.
k=1
∂zkl ∂zjl−1
dl
X ∂zkl
= δkl Swap.+w.g.
k=1 ∂zjl−1
dl dl−1
X ∂ X
l
σ 0 (zjl−1 ) δkl al−1 l
+ blk
 
= wjk l−1 m wmk
k=1 ∂zj m=1
dl
X
= l
wjk δkl σ 0 (zjl−1 )
k=1

5/12
So, we start with terminal working gradients δjL and then work our way
backwards using the backwards updating eqaution to nd δjL−1 , δjL−2
etc. That is, evaluate
∂C 0 L
δjL = σ (zj )
∂aL
j

and then propagate:


dl
X
δjl−1 = l
wjk δkl σ 0 (zjl−1 )
k=1

for all l, and j .


But these are `working' gradients. What does that have to do with
the parameters of our model?

6/12
As it turns out, we can make some nal manipulations to get the
gradient of the cost function w.r.t. the model parameters. First, for the
biases of the model, we have:
∂C ∂C ∂zjl
= ,
∂blj ∂zjl ∂blj (1)
= δjl

and then for the weights:


∂C ∂C ∂zjl
= l
l
∂wkj l
∂zj ∂wkj (2)
= al−1 l
k δj .

Equations 1 and 2 along with the terminal condition in Equation 6 dene


(backwards) updating equations for the back-propagation procedure.

7/12
Vector and Matrix form of backprop?

Can you write these equations in vector-form? Can you write them in
matrix-form? Start by dening:
δ1l
 l
z1l
   
a1
 δ2l   al2   z2l 
δl =  .  al =  .  zl =  . 
 ..   ..   .. 
     

l
δdl d ×1 aldl d ×1 zdl l d ×1
l l l

and of course the parameters Wl = (wjk


l
)dl−1 ×dl and bl = (blj )dl ×1 , so
that:
z l = WTl al−1 + bl
al = σ(z l )

8/12
For the vector-form of the equations, we can look at the structure of the
scalar updating equations for some guidance:
∂C
δjL = × σ 0 (zjL )
∂aL
j

for all j in the nal layer. So calculate a vector of elements ∂a


∂C
L and a
j

vector of elements σ 0 (zjL ) and calculate the vector δ L element-wise.


Then for the backward update:
dl
X 
δjl−1 = l
wjk δkl × σ 0 (zjl−1 ).
k=1
| {z }
matrix mult?
The rst term coincides with the def. of matrix multiplication (post.m.
by vector), resulting an a single element on a vector. Again, conclude
with element-wise multiplication.

9/12
Written out in full, we arrive at:

= [Wl ]dl−1 ×dl [δ l ]dl ×1 (zl−1 )]dl−1 ×1


∂Ci 0
∂ zl−1
δ l−1 = [σl−1

for l = L − 1, L − 2, ... where σl0 (.) denotes the derivative of the


activation function in layer l and the terminal condition for the
equations are given by:
∂Ci 0
(zL ).
∂ aL
δL = σL

10/12
Again, that's only half the story... What about the weights and biases?
Well for the biases we just use
∂Ci ∂Ci
= δ ldl ×1 .
∂ zl−1
=
∂b l−1

in each layer. Then for the weights, noting again the similarity to matrix
multiplication:

= al−1 δ Tl = [al−1 ]dl−1 ×1 [δ Tl ]1×dl


∂Ci
∂ Wl

Does this make sense? What are the dimensions of Wl ?

11/12
References I

[Nielsen 2015] Nielsen, Michael A.: Neural networks and deep


learning. Determination Press, 2015

12/12

You might also like