0% found this document useful (0 votes)
53 views

How To Derive Errors in Neural Network With The Backpropagation Algorithm?

The document is a question posted on Cross Validated asking how to derive errors in a neural network using the backpropagation algorithm. The top answer explains that δ values represent error terms that are calculated during backpropagation. δ values allow the gradient of the cost function with respect to the weights to be calculated. This gradient is then used in gradient descent training. The key steps are: 1) Forward propagation to calculate activations, 2) Backpropagation to calculate error terms and gradients, 3) Gradient descent to update weights. Calculating δ terms involves applying the chain rule and relationships between errors at different layers.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

How To Derive Errors in Neural Network With The Backpropagation Algorithm?

The document is a question posted on Cross Validated asking how to derive errors in a neural network using the backpropagation algorithm. The top answer explains that δ values represent error terms that are calculated during backpropagation. δ values allow the gradient of the cost function with respect to the weights to be calculated. This gradient is then used in gradient descent training. The key steps are: 1) Forward propagation to calculate activations, 2) Backpropagation to calculate error terms and gradients, 3) Gradient descent to update weights. Calculating δ terms involves applying the chain rule and relationships between errors at different layers.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.

2021, 19:14

Cross Validated is a question and


Anybody can ask a question
answer site for people interested in
statistics, machine learning, data
analysis, data mining, and data
visualization. It only takes a minute to Anybody can answer
sign up.

The best answers are voted up


Sign up to join this community and rise to the top

How to derive errors in neural network with the


backpropagation algorithm?
Asked 7 years, 7 months ago Active 2 months ago Viewed 8k times

From this video by Andrew Ng around 5:00

13

10

How are 𝛿3 and 𝛿2 derived? In fact, what does 𝛿3 even mean? 𝛿4 is got by comparing to
y, no such comparison is possible for the output of a hidden layer, right?

machine-learning neural-networks backpropagation

Share Cite edited May 11 '20 at 13:49 asked Apr 19 '14 at 19:27
Improve this question Follow Aditya Saini qed
103 5 2,438 3 20 32

https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 1 of 7
machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.2021, 19:14

The video link is not working. Please, update it, or provide a link to the course. Thanks.
– MadHatter Nov 28 '17 at 3:09

Though @tmangin answered it beautifully, still, if you want a more detailed explanation, you
can refer to this GitHub repo. It is well explained and contains derivations to both the
Machine Learning Course and the Deep Learning Specialization. – Aditya Saini Jun 3 '20 at
12:50

1 Answer Active Oldest Votes

(𝑙)
I'm going to answer your question about the 𝛿𝑖 , but remember that your question is a
sub question of a larger question which is why:
20
(𝑙) (𝑙+1) (𝑙+1) (𝑙) (𝑙) (𝑙−1)
∇𝑖𝑗 = 𝜃𝑘𝑖 𝛿𝑘 ∗ (𝑎𝑖 (1 − 𝑎𝑖 )) ∗ 𝑎𝑗

𝑘

Reminder about the steps in Neural networks:

(𝑙)
Step 1: forward propagation (calculation of the 𝑎𝑖 )
(𝑙)
Step 2a: backward propagation: calculation of the errors 𝛿𝑖
(𝑙)
Step 2b: backward propagation: calculation of the gradient ∇𝑖𝑗 of J(Θ) using the
(𝑙+1) (𝑙)
errors 𝛿𝑖 and the 𝑎𝑖 ,
(𝑙) (𝑙)
Step 3: gradient descent: calculate the new 𝜃𝑖𝑗 using the gradients ∇𝑖𝑗

(𝑙)
First, to understand what the 𝛿𝑖 are, what they represent and why Andrew NG it talking
about them, you need to understand what Andrew is actually doing at that pointand why
(𝑙) (𝑙)
we do all these calculations: He's calculating the gradient ∇𝑖𝑗 of 𝜃𝑖𝑗 to be used in the
Gradient descent algorithm.

The gradient is defined as:

∂𝐶
∇(𝑙)
𝑖𝑗 =
∂𝜃(𝑙)
𝑖𝑗

As we can't really solve this formula directly, we are going to modify it using TWO MAGIC
TRICKS to arrive at a formula we can actually calculate. This final usable formula is:

(𝑙) 𝑇 (𝑙) (𝑙) (𝑙−1)


∇𝑖𝑗 = 𝜃(𝑙+1) 𝛿(𝑙+1) . ∗(𝑎𝑖 (1 − 𝑎𝑖 )) ∗ 𝑎𝑗

Note : here mapping from 1st layer to 2nd layer is notated as theta2 and so on, instead

https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 2 of 7
machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.2021, 19:14

of theta1 as in andrew ng coursera

(𝑙)
To arrive at this result, the FIRST MAGIC TRICK is that we can write the gradient ∇𝑖𝑗 of
𝜃(𝑙) (𝑙)
𝑖𝑗 using 𝛿𝑖 :

(𝑙) (𝑙) (𝑙−1)


∇𝑖𝑗 = 𝛿𝑖 ∗ 𝑎𝑗
(𝐿)
With 𝛿𝑖 defined (for the L index only) as:

(𝐿) ∂𝐶
𝛿𝑖 =
∂𝑧𝑖(𝑙)
(𝑙) (𝑙+1)
And then the SECOND MAGIC TRICK using the relation between 𝛿𝑖 and 𝛿𝑖 , to
defined the other indexes,

𝛿𝑖(𝑙) = 𝜃(𝑙+1) 𝛿(𝑙+1) . ∗(𝑎𝑖(𝑙) (1 − 𝑎𝑖(𝑙) ))


𝑇

And as I said, we can finally write a formula for which we know all the terms:

∇(𝑙) . ∗(𝑎𝑖(𝑙) (1 − 𝑎𝑖(𝑙) )) ∗ 𝑎𝑗(𝑙−1)


𝑇
(𝑙+1) (𝑙+1)
𝑖𝑗 = 𝜃 𝛿
(𝑙)
DEMONSTRATION of the FIRST MAGIC TRICK: ∇𝑖𝑗 = 𝛿(𝑙) (𝑙−1)
𝑖 ∗ 𝑎𝑗

We defined:

∂𝐶
∇(𝑙)
𝑖𝑗 = (𝑙)
∂𝜃𝑖𝑗

The Chain rule for higher dimensions (you should REALLY read this property of the
Chain rule) enables us to write:

∂𝐶 ∂𝑧𝑘(𝑙)
∇(𝑙) = ∗
𝑖𝑗 ∑ ∂𝑧(𝑙) ∂𝜃(𝑙)
𝑘 𝑘 𝑖𝑗

However , as:

(𝑙) (𝑙) (𝑙−1)


𝑧𝑘 = 𝜃𝑘𝑚 ∗ 𝑎𝑚

𝑚

Here: m --> unit in layer l - 1 ;


k --> unit in layer l

https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 3 of 7
machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.2021, 19:14

i --> unit in layer *l*


j --> unit in layer *l*-1

We then can write:

(𝑙)
∂𝑧𝑘 ∂ (𝑙) (𝑙−1)
= 𝜃𝑘𝑚 ∗ 𝑎𝑚
∂𝜃𝑖𝑗 ∑
(𝑙) (𝑙)
∂𝜃𝑖𝑗 𝑚

Because of the linearity of the differentiation [ (u + v)' = u'+ v'], we can write:

(𝑙) (𝑙)
∂𝑧𝑘 ∂𝜃𝑘𝑚 (𝑙−1)
= ∗ 𝑎𝑚
∂𝜃𝑖𝑗
(𝑙) ∑ ∂𝜃(𝑙)
𝑚 𝑖𝑗

with:

(𝑙)
∂𝜃𝑘𝑚 (𝑙−1)
if 𝑘, 𝑚 ≠ 𝑖, 𝑗, ∗ 𝑎𝑚 =0
∂𝜃(𝑙)
𝑖𝑗

∂𝜃𝑘𝑚
(𝑙)
(𝑙−1)
∂𝜃(𝑙)
𝑖𝑗 (𝑙−1) (𝑙−1)
if 𝑘, 𝑚 = 𝑖, 𝑗, (𝑙)
∗ 𝑎𝑚 = (𝑙)
∗ 𝑎𝑗 = 𝑎𝑗
∂𝜃𝑖𝑗 ∂𝜃𝑖𝑗

Then for 𝑘 = 𝑖 (otherwise it's clearly equal to zero):


(𝑙) (𝑙) (𝑙)
∂𝑧𝑖 ∂𝜃𝑖𝑗 (𝑙−1) ∂𝜃𝑖𝑚 (𝑙−1) (𝑙−1)
= ∗ 𝑎𝑗 + ∗ 𝑎𝑗 = 𝑎𝑗 +0
(𝑙)
∂𝜃𝑖𝑗
(𝑙)
∂𝜃𝑖𝑗 ∑ ∂𝜃(𝑙)
𝑚≠𝑗 𝑖𝑗

Finally, for 𝑘 = 𝑖:
(𝑙)
∂𝑧𝑖 (𝑙−1)
(𝑙)
= 𝑎𝑗
∂𝜃𝑖𝑗
(𝑙)
As a result, we can write our first expression of the gradient ∇𝑖𝑗 :

∂𝐶 ∂𝑧𝑖(𝑙)
∇(𝑙)
𝑖𝑗 = (𝑙)
∗ (𝑙)
∂𝑧𝑖 ∂𝜃𝑖𝑗

Which is equivalent to:

(𝑙) ∂𝐶 (𝑙−1)
∇𝑖𝑗 = ∗ 𝑎𝑗
∂𝑧𝑖(𝑙)
https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 4 of 7
machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.2021, 19:14

Or:

(𝑙) (𝑙) (𝑙−1)


∇𝑖𝑗 = 𝛿𝑖 ∗ 𝑎𝑗

DEMONSTRATION OF THE SECOND MAGIC TRICK:


(𝑙) 𝑇 (𝑙) (𝑙)
𝛿𝑖 = 𝜃(𝑙+1) 𝛿(𝑙+1) . ∗(𝑎𝑖 (1 − 𝑎𝑖 )) or:
𝑇
𝛿(𝑙) = 𝜃(𝑙+1) 𝛿(𝑙+1) . ∗(𝑎(𝑙) (1 − 𝑎(𝑙) ))

Remember that we posed:

∂𝐶 ∂𝐶
𝛿(𝑙) = 𝑎𝑛𝑑 𝛿𝑖(𝑙) = (𝑙)
∂𝑧(𝑙) ∂𝑧𝑖

Again, the Chain rule for higher dimensions enables us to write:

∂𝐶 ∂𝑧𝑘(𝑙+1)
𝛿𝑖(𝑙) =
∑ ∂𝑧(𝑙+1) ∂𝑧(𝑙)
𝑘 𝑘 𝑖

∂𝐶 (𝑙+1)
Replacing by 𝛿𝑘 , we have:
∂𝑧(𝑙+1)
𝑘

∂𝑧𝑘(𝑙+1)
𝛿𝑖(𝑙) = 𝛿𝑘(𝑙+1)
∑ ∂𝑧𝑖
(𝑙)
𝑘

(𝑙+1)
∂𝑧𝑘
Now, let's focus on . We have:
(𝑙)
∂𝑧𝑖
(𝑙+1) (𝑙+1) (𝑙) (𝑙+1) (𝑙)
𝑧𝑘 = 𝜃𝑘𝑗 ∗ 𝑎𝑗 = 𝜃𝑘𝑗 ∗ 𝑔(𝑧𝑗 )
∑ ∑
𝑗 𝑗

(𝑖)
Then we derive this expression regarding 𝑧𝑘 :

∂𝑧𝑘(𝑙+1) ∂ ∑ 𝑗 𝜃(𝑙+1)
𝑘𝑗 ∗ 𝑔(𝑧(𝑙)
𝑗 )
(𝑙)
= (𝑙)
∂𝑧𝑖 ∂𝑧𝑖

Because of the linearity of the derivation, we can write:

∂𝑧𝑘
(𝑙+1)
(𝑙+1)
∂𝑔(𝑧(𝑙)
𝑗 )
= 𝜃𝑘𝑗 ∗
∂𝑧𝑖
(𝑙) ∑ ∂𝑧𝑖
(𝑙)
𝑗

(𝑙+1) (𝑙)
∂ ∗ 𝑔( )
https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 5 of 7
=0
machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.2021, 19:14

(𝑙+1) (𝑙)
∂𝜃𝑘𝑗 ∗ 𝑔(𝑧𝑗 )
If 𝑗 ≠ 𝑖, then =0
∂𝑧(𝑙)
𝑖

As a consequence:

∂𝑧𝑘(𝑙+1) (𝑙+1)
𝜃𝑘𝑖 ∗ ∂𝑔(𝑧𝑖(𝑙) )
=
∂𝑧𝑖(𝑙) ∂𝑧𝑖(𝑙)

And then:

∂𝑔(𝑧𝑖(𝑙) )
𝛿𝑖(𝑙) = 𝛿𝑘(𝑙+1) 𝜃𝑘𝑖
(𝑙+1)

∑ ∂𝑧𝑖(𝑙)
𝑘

As 𝑔′ (𝑧) = 𝑔(𝑧)(1 − 𝑔(𝑧)) , we have:

𝛿𝑖(𝑙) = 𝛿𝑘(𝑙+1) 𝜃𝑘𝑖


(𝑙+1)
∗ 𝑔(𝑧𝑖(𝑙) )(1 − 𝑔(𝑧𝑖(𝑙) )

𝑘

(𝑙) (𝑙)
And as 𝑔(𝑧𝑖 ) = 𝑎𝑖 , we have:
(𝑙) (𝑙+1) (𝑙+1) (𝑙) (𝑙)
𝛿𝑖 = 𝛿𝑘 𝜃𝑘𝑖 ∗ 𝑎𝑖 (1 − 𝑎𝑖 )

𝑘

And finally, using the vectorized notation:

∇(𝑙) ∗ (𝑎𝑖(𝑙) (1 − 𝑎𝑖(𝑙) ))] ∗ [𝑎𝑗(𝑙−1) ]


𝑇
(𝑙+1) (𝑙+1)
𝑖𝑗 = [𝜃 𝛿

Share Cite edited Sep 20 at 12:00 answered May 13 '15 at 15:32


Improve this answer Follow Aditya Khedekar tmangin
3 2 473 3 10

1 Thank you for your answer. I upvoted you !! Could you please cite the sources you referred
for arriving at the answer... :) – Adithya Upadhya Jan 30 '17 at 11:09

(𝑖)
@tmangin : Following Andrew Ng talk, we have 𝛿𝑗 is the error of node j in layer l. How did
(𝑖) ∂𝐶
you get the definition of 𝛿𝑗 = . – phuong Feb 9 '17 at 17:40
∂𝑍 𝑗(𝑙)

https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 6 of 7
machine learning - How to derive errors in neural network with the backpropagation algorithm? - Cross Validated 23.11.2021, 19:14

@phuong Actually, I you're right to ask: only the

𝛿𝑖(𝐿)
with the highest "l" index L is defined as

∂𝐶
𝛿𝑖(𝐿) =
∂𝑧𝑖(𝑙)
Whereas the deltas with lower "l" indexes are defined by the following formula:

𝛿𝑖(𝑙) = 𝜃(𝑙+1) 𝛿(𝑙+1) . ∗(𝑎𝑖(𝑙) (1 − 𝑎𝑖(𝑙) ))


𝑇

– tmangin Feb 28 '17 at 14:20

3 I highly recommend reading the backprop vectorial notation of calculating the gradients.
– CKM Mar 26 '17 at 13:30

Your final usable formula is not what Andrew Ng had, which is making it really frustrating to
follow your proof. He had ∇(l)ij=θ(l)Tδ(l+1).∗(a(l)i(1−a(l)i))∗a(l−1)j, not θ(l+1)Tδ(l+1) – azizj
Aug 5 '17 at 23:19

Highly active question. Earn 10 reputation (not counting the association bonus) in order to answer
this question. The reputation requirement helps protect this question from spam and non-answer
activity.

https://stats.stackexchange.com/questions/94387/how-to-derive-errors-in-neural-network-with-the-backpropagation-algorithm Page 7 of 7

You might also like