Lecture Slides 4 - Backpropagation - 2021
Lecture Slides 4 - Backpropagation - 2021
Dr Etienne Pienaar
[email protected] Room 5.43
1/12
Backpropagation
Based on Nielsen (2015).
2/12
Backpropagation
A working gradient:
∂C
δjl =
∂zjl
3/12
After dening the working variables, one starts at the terminal nodes of
the network and work backwards. For these purposes, we calculate the
gradient w.r.t. the working variable in the output layer:
dX
L =q
∂C ∂aL
δjL = k
k=1
∂aL
k ∂zj
L
∂C ∂aL j
=
∂aL
j ∂zj
L
∂C 0 L
= σ (zj )
∂aL
j
4/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 =
∂zjl−1
5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
dl
X ∂C ∂zkl
=
k=1
∂zkl ∂zjl−1
5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
dl
X ∂C ∂zkl
= Chain R.
k=1
∂zkl ∂zjl−1
5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
dl
X ∂C ∂zkl
= Chain R.
k=1
∂zkl ∂zjl−1
dl
X ∂zkl
= δkl
k=1 ∂zjl−1
5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
dl
X ∂C ∂zkl
= Chain R.
k=1
∂zkl ∂zjl−1
dl
X ∂zkl
= δkl Swap.+w.g.
k=1 ∂zjl−1
5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
dl
X ∂C ∂zkl
= Chain R.
k=1
∂zkl ∂zjl−1
dl
X ∂zkl
= δkl Swap.+w.g.
k=1 ∂zjl−1
dl dl−1
X ∂ X
l
σ 0 (zjl−1 ) δkl al−1 l
+ blk
= wjk l−1 m wmk
k=1 ∂zj m=1
5/12
Now, in order to evaluate the gradients in the interim layers of the
network, we iterate backwards from the terminal layer. For these
purposes, we derive:
∂C
δjl−1 = Cost. (l) w.r.t. linear component (l − 1)
∂zjl−1
dl
X ∂C ∂zkl
= Chain R.
k=1
∂zkl ∂zjl−1
dl
X ∂zkl
= δkl Swap.+w.g.
k=1 ∂zjl−1
dl dl−1
X ∂ X
l
σ 0 (zjl−1 ) δkl al−1 l
+ blk
= wjk l−1 m wmk
k=1 ∂zj m=1
dl
X
= l
wjk δkl σ 0 (zjl−1 )
k=1
5/12
So, we start with terminal working gradients δjL and then work our way
backwards using the backwards updating eqaution to nd δjL−1 , δjL−2
etc. That is, evaluate
∂C 0 L
δjL = σ (zj )
∂aL
j
6/12
As it turns out, we can make some nal manipulations to get the
gradient of the cost function w.r.t. the model parameters. First, for the
biases of the model, we have:
∂C ∂C ∂zjl
= ,
∂blj ∂zjl ∂blj (1)
= δjl
7/12
Vector and Matrix form of backprop?
Can you write these equations in vector-form? Can you write them in
matrix-form? Start by dening:
δ1l
l
z1l
a1
δ2l al2 z2l
δl = . al = . zl = .
.. .. ..
l
δdl d ×1 aldl d ×1 zdl l d ×1
l l l
8/12
For the vector-form of the equations, we can look at the structure of the
scalar updating equations for some guidance:
∂C
δjL = × σ 0 (zjL )
∂aL
j
9/12
Written out in full, we arrive at:
10/12
Again, that's only half the story... What about the weights and biases?
Well for the biases we just use
∂Ci ∂Ci
= δ ldl ×1 .
∂ zl−1
=
∂b l−1
in each layer. Then for the weights, noting again the similarity to matrix
multiplication:
11/12
References I
12/12