How To Derive Errors in Neural Network With The Backpropagation Algorithm?
How To Derive Errors in Neural Network With The Backpropagation Algorithm?
2021, 19:14
13
10
How are 𝛿3 and 𝛿2 derived? In fact, what does 𝛿3 even mean? 𝛿4 is got by comparing to
y, no such comparison is possible for the output of a hidden layer, right?
Share Cite edited May 11 '20 at 13:49 asked Apr 19 '14 at 19:27
Improve this question Follow Aditya Saini qed
103 5 2,438 3 20 32
https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 1 of 7
machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.2021, 19:14
The video link is not working. Please, update it, or provide a link to the course. Thanks.
– MadHatter Nov 28 '17 at 3:09
Though @tmangin answered it beautifully, still, if you want a more detailed explanation, you
can refer to this GitHub repo. It is well explained and contains derivations to both the
Machine Learning Course and the Deep Learning Specialization. – Aditya Saini Jun 3 '20 at
12:50
(𝑙)
I'm going to answer your question about the 𝛿𝑖 , but remember that your question is a
sub question of a larger question which is why:
20
(𝑙) (𝑙+1) (𝑙+1) (𝑙) (𝑙) (𝑙−1)
∇𝑖𝑗 = 𝜃𝑘𝑖 𝛿𝑘 ∗ (𝑎𝑖 (1 − 𝑎𝑖 )) ∗ 𝑎𝑗
∑
𝑘
(𝑙)
Step 1: forward propagation (calculation of the 𝑎𝑖 )
(𝑙)
Step 2a: backward propagation: calculation of the errors 𝛿𝑖
(𝑙)
Step 2b: backward propagation: calculation of the gradient ∇𝑖𝑗 of J(Θ) using the
(𝑙+1) (𝑙)
errors 𝛿𝑖 and the 𝑎𝑖 ,
(𝑙) (𝑙)
Step 3: gradient descent: calculate the new 𝜃𝑖𝑗 using the gradients ∇𝑖𝑗
(𝑙)
First, to understand what the 𝛿𝑖 are, what they represent and why Andrew NG it talking
about them, you need to understand what Andrew is actually doing at that pointand why
(𝑙) (𝑙)
we do all these calculations: He's calculating the gradient ∇𝑖𝑗 of 𝜃𝑖𝑗 to be used in the
Gradient descent algorithm.
∂𝐶
∇(𝑙)
𝑖𝑗 =
∂𝜃(𝑙)
𝑖𝑗
As we can't really solve this formula directly, we are going to modify it using TWO MAGIC
TRICKS to arrive at a formula we can actually calculate. This final usable formula is:
Note : here mapping from 1st layer to 2nd layer is notated as theta2 and so on, instead
https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 2 of 7
machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.2021, 19:14
(𝑙)
To arrive at this result, the FIRST MAGIC TRICK is that we can write the gradient ∇𝑖𝑗 of
𝜃(𝑙) (𝑙)
𝑖𝑗 using 𝛿𝑖 :
(𝐿) ∂𝐶
𝛿𝑖 =
∂𝑧𝑖(𝑙)
(𝑙) (𝑙+1)
And then the SECOND MAGIC TRICK using the relation between 𝛿𝑖 and 𝛿𝑖 , to
defined the other indexes,
And as I said, we can finally write a formula for which we know all the terms:
We defined:
∂𝐶
∇(𝑙)
𝑖𝑗 = (𝑙)
∂𝜃𝑖𝑗
The Chain rule for higher dimensions (you should REALLY read this property of the
Chain rule) enables us to write:
∂𝐶 ∂𝑧𝑘(𝑙)
∇(𝑙) = ∗
𝑖𝑗 ∑ ∂𝑧(𝑙) ∂𝜃(𝑙)
𝑘 𝑘 𝑖𝑗
However , as:
https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 3 of 7
machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.2021, 19:14
(𝑙)
∂𝑧𝑘 ∂ (𝑙) (𝑙−1)
= 𝜃𝑘𝑚 ∗ 𝑎𝑚
∂𝜃𝑖𝑗 ∑
(𝑙) (𝑙)
∂𝜃𝑖𝑗 𝑚
Because of the linearity of the differentiation [ (u + v)' = u'+ v'], we can write:
(𝑙) (𝑙)
∂𝑧𝑘 ∂𝜃𝑘𝑚 (𝑙−1)
= ∗ 𝑎𝑚
∂𝜃𝑖𝑗
(𝑙) ∑ ∂𝜃(𝑙)
𝑚 𝑖𝑗
with:
(𝑙)
∂𝜃𝑘𝑚 (𝑙−1)
if 𝑘, 𝑚 ≠ 𝑖, 𝑗, ∗ 𝑎𝑚 =0
∂𝜃(𝑙)
𝑖𝑗
∂𝜃𝑘𝑚
(𝑙)
(𝑙−1)
∂𝜃(𝑙)
𝑖𝑗 (𝑙−1) (𝑙−1)
if 𝑘, 𝑚 = 𝑖, 𝑗, (𝑙)
∗ 𝑎𝑚 = (𝑙)
∗ 𝑎𝑗 = 𝑎𝑗
∂𝜃𝑖𝑗 ∂𝜃𝑖𝑗
Finally, for 𝑘 = 𝑖:
(𝑙)
∂𝑧𝑖 (𝑙−1)
(𝑙)
= 𝑎𝑗
∂𝜃𝑖𝑗
(𝑙)
As a result, we can write our first expression of the gradient ∇𝑖𝑗 :
∂𝐶 ∂𝑧𝑖(𝑙)
∇(𝑙)
𝑖𝑗 = (𝑙)
∗ (𝑙)
∂𝑧𝑖 ∂𝜃𝑖𝑗
(𝑙) ∂𝐶 (𝑙−1)
∇𝑖𝑗 = ∗ 𝑎𝑗
∂𝑧𝑖(𝑙)
https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 4 of 7
machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.2021, 19:14
Or:
∂𝐶 ∂𝐶
𝛿(𝑙) = 𝑎𝑛𝑑 𝛿𝑖(𝑙) = (𝑙)
∂𝑧(𝑙) ∂𝑧𝑖
∂𝐶 ∂𝑧𝑘(𝑙+1)
𝛿𝑖(𝑙) =
∑ ∂𝑧(𝑙+1) ∂𝑧(𝑙)
𝑘 𝑘 𝑖
∂𝐶 (𝑙+1)
Replacing by 𝛿𝑘 , we have:
∂𝑧(𝑙+1)
𝑘
∂𝑧𝑘(𝑙+1)
𝛿𝑖(𝑙) = 𝛿𝑘(𝑙+1)
∑ ∂𝑧𝑖
(𝑙)
𝑘
(𝑙+1)
∂𝑧𝑘
Now, let's focus on . We have:
(𝑙)
∂𝑧𝑖
(𝑙+1) (𝑙+1) (𝑙) (𝑙+1) (𝑙)
𝑧𝑘 = 𝜃𝑘𝑗 ∗ 𝑎𝑗 = 𝜃𝑘𝑗 ∗ 𝑔(𝑧𝑗 )
∑ ∑
𝑗 𝑗
(𝑖)
Then we derive this expression regarding 𝑧𝑘 :
∂𝑧𝑘(𝑙+1) ∂ ∑ 𝑗 𝜃(𝑙+1)
𝑘𝑗 ∗ 𝑔(𝑧(𝑙)
𝑗 )
(𝑙)
= (𝑙)
∂𝑧𝑖 ∂𝑧𝑖
∂𝑧𝑘
(𝑙+1)
(𝑙+1)
∂𝑔(𝑧(𝑙)
𝑗 )
= 𝜃𝑘𝑗 ∗
∂𝑧𝑖
(𝑙) ∑ ∂𝑧𝑖
(𝑙)
𝑗
(𝑙+1) (𝑙)
∂ ∗ 𝑔( )
https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 5 of 7
=0
machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.2021, 19:14
(𝑙+1) (𝑙)
∂𝜃𝑘𝑗 ∗ 𝑔(𝑧𝑗 )
If 𝑗 ≠ 𝑖, then =0
∂𝑧(𝑙)
𝑖
As a consequence:
∂𝑧𝑘(𝑙+1) (𝑙+1)
𝜃𝑘𝑖 ∗ ∂𝑔(𝑧𝑖(𝑙) )
=
∂𝑧𝑖(𝑙) ∂𝑧𝑖(𝑙)
And then:
∂𝑔(𝑧𝑖(𝑙) )
𝛿𝑖(𝑙) = 𝛿𝑘(𝑙+1) 𝜃𝑘𝑖
(𝑙+1)
∗
∑ ∂𝑧𝑖(𝑙)
𝑘
(𝑙) (𝑙)
And as 𝑔(𝑧𝑖 ) = 𝑎𝑖 , we have:
(𝑙) (𝑙+1) (𝑙+1) (𝑙) (𝑙)
𝛿𝑖 = 𝛿𝑘 𝜃𝑘𝑖 ∗ 𝑎𝑖 (1 − 𝑎𝑖 )
∑
𝑘
1 Thank you for your answer. I upvoted you !! Could you please cite the sources you referred
for arriving at the answer... :) – Adithya Upadhya Jan 30 '17 at 11:09
(𝑖)
@tmangin : Following Andrew Ng talk, we have 𝛿𝑗 is the error of node j in layer l. How did
(𝑖) ∂𝐶
you get the definition of 𝛿𝑗 = . – phuong Feb 9 '17 at 17:40
∂𝑍 𝑗(𝑙)
https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 6 of 7
machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.2021, 19:14
𝛿𝑖(𝐿)
with the highest "l" index L is defined as
∂𝐶
𝛿𝑖(𝐿) =
∂𝑧𝑖(𝑙)
Whereas the deltas with lower "l" indexes are defined by the following formula:
3 I highly recommend reading the backprop vectorial notation of calculating the gradients.
– CKM Mar 26 '17 at 13:30
Your final usable formula is not what Andrew Ng had, which is making it really frustrating to
follow your proof. He had ∇(l)ij=θ(l)Tδ(l+1).∗(a(l)i(1−a(l)i))∗a(l−1)j, not θ(l+1)Tδ(l+1) – azizj
Aug 5 '17 at 23:19
Highly active question. Earn 10 reputation (not counting the association bonus) in order to answer
this question. The reputation requirement helps protect this question from spam and non-answer
activity.
https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 7 of 7