0% found this document useful (0 votes)

15 views10 pages

Back Propagation

Backpropagation is a crucial algorithm that significantly speeds up the training of deep learning models by efficiently calculating derivatives using computational graphs. It operates through reverse-mode differentiation, allowing the computation of derivatives for multiple inputs simultaneously, which is particularly beneficial in neural networks with numerous parameters. The document emphasizes the importance of understanding derivatives in various fields and highlights the efficiency of backpropagation in optimizing models.

Uploaded by

ahmmed04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views10 pages

Back Propagation

Uploaded by

ahmmed04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog

Calculus on Computational
Graphs: Backpropagation
Posted on August 31, 2015

Introduction
Backpropagation is the key algorithm that makes training deep models computationally
tractable. For modern neural networks, it can make training with gradient descent as
much as ten million times faster, relative to a naive implementation. That’s the difference
between a model taking a week to train and taking 200,000 years.
2
Beyond its use in deep learning, backpropagation is a powerful computational tool in
many other areas, ranging from weather forecasting to analyzing numerical stability – it
just goes by different names. In fact, the algorithm has been reinvented at least dozens of
times in different fields (see Griewank (2010) (http://www.math.uiuc.edu/documenta/vol-
ismp/52_griewank-andreas-b.pdf)). The general, application independent, name is 2

“reverse-mode differentiation.”

Fundamentally, it’s a technique for calculating derivatives quickly. And it’s an essential
trick to have in your bag, not only in deep learning, but in a wide variety of numerical
computing situations.

1/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog

Computational Graphs
Computational graphs are a nice way to think about mathematical expressions. For
example, consider the expression . There are three operations: two
e = (a + b) ∗ (b + 1)

additions and one multiplication. To help us talk about this, let’s introduce two
intermediary variables, c and d so that every function’s output has a variable. We now
have:

c = a + b

d = b + 1

e = c ∗ d

To create a computational graph, we make each of these operations, along with the input
variables, into nodes. When one node’s value is the input to another node, an arrow goes
from one to another.

These sorts of graphs come up all the time in computer science, especially in talking
about functional programs. They are very closely related to the notions of dependency
graphs and call graphs. They’re also the core abstraction behind the popular deep
learning framework Theano (http://deeplearning.net/software/theano/).

We can evaluate the expression by setting the input variables to certain values and
computing nodes up through the graph. For example, let’s set a = 2 and b = 1 :
51

2/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog

The expression evaluates to 6. 2

Derivatives on Computational Graphs

If one wants to understand derivatives in a computational graph, the key is to understand
derivatives on the edges. If a directly affects c, then we want to know how it affects c. If a

changes a little bit, how does c change? We call this the partial derivative
(https://en.wikipedia.org/wiki/Partial_derivative) of c with respect to a .

To evaluate the partial derivatives in this graph, we need the sum rule
6
(https://en.wikipedia.org/wiki/Sum_rule_in_differentiation) and the product rule
(https://en.wikipedia.org/wiki/Product_rule):

∂ ∂a ∂b
(a + b) = + = 1
∂a ∂a ∂a

∂ ∂v ∂u
uv = u + v = v
∂u ∂u ∂u

Below, the graph has the derivative on each edge labeled.

3/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog

What if we want to understand how nodes that aren’t directly connected affect each
other? Let’s consider how e is affected by a . If we change a at a speed of 1, c also changes
at a speed of 1. In turn, c changing at a speed of 1 causes e to change at a speed of 2. So
e changes at a rate of 1 ∗ 2 with respect to a .
3
The general rule is to sum over all possible paths from one node to the other, multiplying
the derivatives on each edge of the path together. For example, to get the derivative of e

with respect to b we get:

∂e
= 1 ∗ 2 + 1 ∗ 3
∂b

This accounts for how b affects e through c and also how it affects it through d.

This general “sum over paths” rule is just a different way of thinking about the
multivariate chain rule (https://en.wikipedia.org/wiki/Chain_rule#Higher_dimensions).
1

Factoring Paths
The problem with just “summing over the paths” is that it’s very easy to get a
combinatorial explosion in the number of possible paths.

4/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog

In the above diagram, there are three paths from X to Y , and a further three paths from
Y to Z . If we want to get the derivative ∂Z

∂X
by summing over all paths, we need to sum
over 3 ∗ 3 = 9 paths:

∂Z
= αδ + αϵ + αζ + βδ + βϵ + βζ + γδ + γϵ + γζ
∂X

The above only has nine paths, but it would be easy to have the number of paths to grow
exponentially as the graph becomes more complicated.

Instead of just naively summing over the paths, it would be much better to factor them:

∂Z 1
= (α + β + γ)(δ + ϵ + ζ )
∂X

This is where “forward-mode differentiation” and “reverse-mode differentiation” come in.

They’re algorithms for efficiently computing the sum by factoring the paths. Instead of
summing over all of the paths explicitly, they compute the same sum more efficiently by
merging paths back together at every node. In fact, both algorithms touch each edge
exactly once!

Forward-mode differentiation starts at an input to the graph and moves towards the end.
2
At every node, it sums all the paths feeding in. Each of those paths represents one way in
which the input affects that node. By adding them up, we get the total way in which the
node is affected by the input, it’s derivative.

Though you probably didn’t think of it in terms of graphs, forward-mode differentiation is

very similar to what you implicitly learned to do if you took an introduction to calculus
class.

5/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog

Reverse-mode differentiation, on the other hand, starts at an output of the graph and
moves towards the beginning. At each node, it merges all paths which originated at that
node.

Forward-mode differentiation tracks how one input affects every node. Reverse-mode
differentiation tracks how every node affects one output. That is, forward-mode
differentiation applies the operator ∂

∂X
to every node, while reverse mode differentiation
applies the operator ∂Z

∂
to every node.1

Computational Victories
At this point, you might wonder why anyone would care about reverse-mode
differentiation. It looks like a strange way of doing the same thing as the forward-mode. Is
there some advantage?

Let’s consider our original example again:

We can use forward-mode differentiation from b up. This gives us the derivative of every
node with respect to b. 6/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog

We’ve computed ∂e

∂b
, the derivative of our output with respect to one of our inputs.

What if we do reverse-mode differentiation from e down? This gives us the derivative of e

with respect to every node:

When I say that reverse-mode differentiation gives us the derivative of e with respect to
every node, I really do mean every node. We get both ∂e

∂a
and ∂e

∂b
, the derivatives of e

with respect to both inputs. Forward-mode differentiation gave us the derivative of our
output with respect to a single input, but reverse-mode differentiation gives us all of
them.

For this graph, that’s only a factor of two speed up, but imagine a function with a million
inputs and one output. Forward-mode differentiation would require us to go through the
graph a million times to get the derivatives. Reverse-mode differentiation can get them all
in one fell swoop! A speed up of a factor of a million is pretty nice!

7/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog

When training neural networks, we think of the cost (a value describing how bad a neural
network performs) as a function of the parameters (numbers describing how the network
behaves). We want to calculate the derivatives of the cost with respect to all the
parameters, for use in gradient descent (https://en.wikipedia.org/wiki/Gradient_descent).
Now, there’s often millions, or even tens of millions of parameters in a neural network. So,
reverse-mode differentiation, called backpropagation in the context of neural networks,
gives us a massive speed up!

(Are there any cases where forward-mode differentiation makes more sense? Yes, there
are! Where the reverse-mode gives the derivatives of one output with respect to all inputs,
the forward-mode gives us the derivatives of all outputs with respect to one input. If one
has a function with lots of outputs, forward-mode differentiation can be much, much,
much faster.)

Isn’t This Trivial?

When I first understood what backpropagation was, my reaction was: “Oh, that’s just the
chain rule! How did it take us so long to figure out?” I’m not the only one who’s had that
reaction. It’s true that if you ask “is there a smart way to calculate derivatives in
feedforward neural networks?” the answer isn’t that difficult.

But I think it was much more difficult than it might seem. You see, at the time
backpropagation was invented, people weren’t very focused on the feedforward neural
networks that we study. It also wasn’t obvious that derivatives were the right way to train
them. Those are only obvious once you realize you can quickly calculate derivatives. There
was a circular dependency.

Worse, it would be very easy to write off any piece of the circular dependency as
impossible on casual thought. Training neural networks with derivatives? Surely you’d
just get stuck in local minima. And obviously it would be expensive to compute all those
derivatives. It’s only because we know this approach works that we don’t immediately
start listing reasons it’s likely not to.

That’s the benefit of hindsight. Once you’ve framed the question, the hardest work is
already done.

8/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog

Conclusion
Derivatives are cheaper than you think. That’s the main lesson to take away from this
post. In fact, they’re unintuitively cheap, and us silly humans have had to repeatedly
rediscover this fact. That’s an important thing to understand in deep learning. It’s also a
really useful thing to know in other fields, and only more so if it isn’t common knowledge.

Are there other lessons? I think there are.

Backpropagation is also a useful lens for understanding how derivatives flow through a
model. This can be extremely helpful in reasoning about why some models are difficult to
optimize. The classic example of this is the problem of vanishing gradients in recurrent
neural networks.

Finally, I claim there is a broad algorithmic lesson to take away from these techniques.
Backpropagation and forward-mode differentiation use a powerful pair of tricks
(linearization and dynamic programming) to compute derivatives more efficiently than
one might think possible. If you really understand these techniques, you can use them to
efficiently calculate several other interesting expressions involving derivatives. We’ll
explore this in a later blog post.

This post gives a very abstract treatment of backpropagation. I strongly recommend

reading Michael Nielsen’s chapter on it
(http://neuralnetworksanddeeplearning.com/chap2.html) for an excellent discussion, more
concretely focused on neural networks.

Acknowledgments
Thank you to Greg Corrado (http://research.google.com/pubs/GregCorrado.html), Jon
Shlens (https://shlens.wordpress.com/), Samy Bengio (http://bengio.abracadoudou.com/)
and Anelia Angelova (http://www.vision.caltech.edu/anelia/) for taking the time to
proofread this post.

Thanks also to Dario Amodei (https://www.linkedin.com/pub/dario-amodei/4/493/393),

Michael Nielsen (http://michaelnielsen.org/) and Yoshua Bengio
(http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html) for discussion of
approaches to explaining backpropagation. Also thanks to all those who tolerated me
practicing explaining backpropagation in talks and seminar series!
9/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog

1. This might feel a bit like dynamic programming

(https://en.wikipedia.org/wiki/Dynamic_programming). That’s because it is!↩

More Posts
(../../posts/2014-07-Understanding-Convolutions/)
(../../posts/2015-08-Understanding-
LSTMs/)

Understanding Convolutions

Understanding LSTM
Networks
Visualizing MNIST
An Exploration of Dimensionality
Reduction

(../../posts/2014-10-Visualizing-MNIST/) (../../posts/2014-07-
Conv-Nets-Modular/)

Conv Nets
A Modular Perspective

9 Comments (/posts/2015-08-
Backprop/#disqus_thread)

10/10

IX Science Ch-12 Solutions (Improvement in Food Resources)
No ratings yet
IX Science Ch-12 Solutions (Improvement in Food Resources)
6 pages
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville (Z-Lib - Org) - 226-228
No ratings yet
Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville (Z-Lib - Org) - 226-228
3 pages
Matrix Calculus
No ratings yet
Matrix Calculus
33 pages
Ryobi-825r Parts List
No ratings yet
Ryobi-825r Parts List
3 pages
Feedforward Neural Networks in Depth, Part 1 - Forward and Backward Propagations - I, Deep Learning
No ratings yet
Feedforward Neural Networks in Depth, Part 1 - Forward and Backward Propagations - I, Deep Learning
11 pages
How To Slow Down Time With Your Mind
50% (2)
How To Slow Down Time With Your Mind
7 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
14 - Học sâu (2) - Backpropagation - v2
No ratings yet
14 - Học sâu (2) - Backpropagation - v2
119 pages
Back Propagation
No ratings yet
Back Propagation
71 pages
Lecture21 Deep Learning PartII April12 2021
No ratings yet
Lecture21 Deep Learning PartII April12 2021
60 pages
Mod 2 DL
No ratings yet
Mod 2 DL
8 pages
(Fall 2024) Deep Learning 2
No ratings yet
(Fall 2024) Deep Learning 2
46 pages
First
No ratings yet
First
92 pages
AD PyTorch
No ratings yet
AD PyTorch
4 pages
Learning 3
No ratings yet
Learning 3
98 pages
3 Gradient
No ratings yet
3 Gradient
30 pages
3.1 Circuits
No ratings yet
3.1 Circuits
3 pages
Lecture02 Backpropagation Annotated
No ratings yet
Lecture02 Backpropagation Annotated
33 pages
XCS224N Module2 Slides
No ratings yet
XCS224N Module2 Slides
80 pages
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
No ratings yet
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
3 pages
Introduction To Differentiable Physics - Physics-Based Deep Learning
No ratings yet
Introduction To Differentiable Physics - Physics-Based Deep Learning
8 pages
Derek Walcott's Dream On Monkey Mountain: A Multifaceted Phantasmagorical Narrative
No ratings yet
Derek Walcott's Dream On Monkey Mountain: A Multifaceted Phantasmagorical Narrative
9 pages
LBIS Review Pointers A.Y 22-23 (Term 4) - S2D
No ratings yet
LBIS Review Pointers A.Y 22-23 (Term 4) - S2D
3 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
Computational Graphs
No ratings yet
Computational Graphs
10 pages
A Step-By-step Introduction To The Implementation of Automatic Differentiation
No ratings yet
A Step-By-step Introduction To The Implementation of Automatic Differentiation
17 pages
Unit 3
No ratings yet
Unit 3
6 pages
Task 2 SWOT Analysis
No ratings yet
Task 2 SWOT Analysis
5 pages
Backpropagation Exercises
No ratings yet
Backpropagation Exercises
7 pages
Lecture 3-4
No ratings yet
Lecture 3-4
50 pages
(Naruto Shippuden) - Man of The Worls
No ratings yet
(Naruto Shippuden) - Man of The Worls
4 pages
Calculus On Computational Graphs - An Introduction: February 2021
No ratings yet
Calculus On Computational Graphs - An Introduction: February 2021
13 pages
Acknowledgement
No ratings yet
Acknowledgement
6 pages
L05 Slides - mlp2
No ratings yet
L05 Slides - mlp2
21 pages
Chap 3 Slides
No ratings yet
Chap 3 Slides
95 pages
3 Gradient
No ratings yet
3 Gradient
31 pages
Jurnal: PENICILLIN PRODUCTION BY MUTANT OF Penicillium Chrysogenum
No ratings yet
Jurnal: PENICILLIN PRODUCTION BY MUTANT OF Penicillium Chrysogenum
5 pages
Neural Networks Part2
No ratings yet
Neural Networks Part2
28 pages
Learning From Labeled and Unlabeled Data On A Directed Graph
No ratings yet
Learning From Labeled and Unlabeled Data On A Directed Graph
8 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
Chapter 3-3 Neural Network-Back Propagation
No ratings yet
Chapter 3-3 Neural Network-Back Propagation
32 pages
Differentiable Programming and Design Optimization
No ratings yet
Differentiable Programming and Design Optimization
72 pages
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
No ratings yet
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
114 pages
Tut 01
No ratings yet
Tut 01
39 pages
Unit 1
No ratings yet
Unit 1
30 pages
Autodiff
No ratings yet
Autodiff
12 pages
Shrine Our Lady of Mercy JBL 2
No ratings yet
Shrine Our Lady of Mercy JBL 2
2 pages
3.NN Backprop
No ratings yet
3.NN Backprop
56 pages
Bookmap Masterclass Basic and Advanced Englunlockeda4 PDF Free Pages 3 - Compressed
No ratings yet
Bookmap Masterclass Basic and Advanced Englunlockeda4 PDF Free Pages 3 - Compressed
70 pages
Chap5 3-BackProp
No ratings yet
Chap5 3-BackProp
41 pages
Lec06 Derivatives
No ratings yet
Lec06 Derivatives
22 pages
Fernando Reinforcement and Extension Worksheets
No ratings yet
Fernando Reinforcement and Extension Worksheets
27 pages
CS231n Convolutional Neural Networks For Visual Recognition 4
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition 4
10 pages
Backprop Unit 2
No ratings yet
Backprop Unit 2
5 pages
Calc
No ratings yet
Calc
6 pages
Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
Amphibious Excavator
No ratings yet
Amphibious Excavator
6 pages
NN 2
No ratings yet
NN 2
12 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
CS 191x Courseware4
No ratings yet
CS 191x Courseware4
3 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Mastering Gymnastic Strength Training. Foundation Four (PDFDrive)
No ratings yet
Mastering Gymnastic Strength Training. Foundation Four (PDFDrive)
66 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
Machine Learning and Pattern Recognition Week 8 - Backprop
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Backprop
8 pages
Lecture 2, Part 2: Backpropagation: Roger Grosse
No ratings yet
Lecture 2, Part 2: Backpropagation: Roger Grosse
9 pages
CS231n Convolutional Neural Networks For Visual Recognition
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition
9 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
6 pages
Gas Pressure Regulator Series 240Pl: Serving The Gas Industry Worldwide
No ratings yet
Gas Pressure Regulator Series 240Pl: Serving The Gas Industry Worldwide
11 pages
WTG Nordex NXX 1 Micrositing en
No ratings yet
WTG Nordex NXX 1 Micrositing en
1 page
Automatic Differentiation and Neural Networks
No ratings yet
Automatic Differentiation and Neural Networks
13 pages
Artificial Neural Networks Mathematics of Backpropagation (Part 4) - BRIAN DOLHANSKY
No ratings yet
Artificial Neural Networks Mathematics of Backpropagation (Part 4) - BRIAN DOLHANSKY
9 pages
Mercuria Energy&Commodities Brochure
No ratings yet
Mercuria Energy&Commodities Brochure
6 pages
Backward Forward Propogation
No ratings yet
Backward Forward Propogation
19 pages
Siemens 7SJ53 V3.4 XRIO Converter Manual ENU TU2.20 V1.100
No ratings yet
Siemens 7SJ53 V3.4 XRIO Converter Manual ENU TU2.20 V1.100
10 pages
Subject Geology: Paper No and Title Remote Sensing and GIS Module No and Title Module Tag
No ratings yet
Subject Geology: Paper No and Title Remote Sensing and GIS Module No and Title Module Tag
19 pages
09: Neural Networks - Learning: Neural Network Cost Function
No ratings yet
09: Neural Networks - Learning: Neural Network Cost Function
9 pages
28X50 MSDS Un1263
No ratings yet
28X50 MSDS Un1263
5 pages
Centralised Lubrication System For A Manitou MT 732 Complete
No ratings yet
Centralised Lubrication System For A Manitou MT 732 Complete
35 pages
Kinematics and Dynamics of Machines MECE 3270-Course Outline
No ratings yet
Kinematics and Dynamics of Machines MECE 3270-Course Outline
10 pages
Lecture 2: Introduction To Pytorch
No ratings yet
Lecture 2: Introduction To Pytorch
7 pages
Found Sounds Scavenger Hunt
No ratings yet
Found Sounds Scavenger Hunt
1 page
Conjunction Worksheet
No ratings yet
Conjunction Worksheet
4 pages
1e Aldehyde & Ketone
100% (1)
1e Aldehyde & Ketone
48 pages
Buku Paket Bahasa Inggris Sman 1 Kahu
No ratings yet
Buku Paket Bahasa Inggris Sman 1 Kahu
99 pages
Science General Chemistry 1: Whole Brain Learning System Outcome-Based Education
No ratings yet
Science General Chemistry 1: Whole Brain Learning System Outcome-Based Education
20 pages
Thesis LD
No ratings yet
Thesis LD
4 pages
Symbolic Mathematics in Data Science. Algebra, Calculus, and Geometry with Matlab
From Everand
Symbolic Mathematics in Data Science. Algebra, Calculus, and Geometry with Matlab
César Pérez López
No ratings yet
IGNOU MCA Design and Analysis of Algorithms Previous Years Unsolved Papers MCS 211
From Everand
IGNOU MCA Design and Analysis of Algorithms Previous Years Unsolved Papers MCS 211
Manish Soni
No ratings yet
C & C++ Interview Questions You'll Most Likely Be Asked
From Everand
C & C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Back Propagation

Uploaded by

Back Propagation

Uploaded by

08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog

The expression evaluates to 6. 2

Derivatives on Computational Graphs

Below, the graph has the derivative on each edge labeled.

with respect to b we get:

This is where “forward-mode differentiation” and “reverse-mode differentiation” come in.

Though you probably didn’t think of it in terms of graphs, forward-mode differentiation is

Let’s consider our original example again:

What if we do reverse-mode differentiation from e down? This gives us the derivative of e

with respect to every node:

Isn’t This Trivial?

Are there other lessons? I think there are.

This post gives a very abstract treatment of backpropagation. I strongly recommend

Thanks also to Dario Amodei (https://www.linkedin.com/pub/dario-amodei/4/493/393),

1. This might feel a bit like dynamic programming

You might also like