Back Propagation
Back Propagation
Calculus on Computational
Graphs: Backpropagation
Posted on August 31, 2015
Introduction
Backpropagation is the key algorithm that makes training deep models computationally
tractable. For modern neural networks, it can make training with gradient descent as
much as ten million times faster, relative to a naive implementation. That’s the difference
between a model taking a week to train and taking 200,000 years.
2
Beyond its use in deep learning, backpropagation is a powerful computational tool in
many other areas, ranging from weather forecasting to analyzing numerical stability – it
just goes by different names. In fact, the algorithm has been reinvented at least dozens of
times in different fields (see Griewank (2010) (http://www.math.uiuc.edu/documenta/vol-
ismp/52_griewank-andreas-b.pdf)). The general, application independent, name is 2
“reverse-mode differentiation.”
Fundamentally, it’s a technique for calculating derivatives quickly. And it’s an essential
trick to have in your bag, not only in deep learning, but in a wide variety of numerical
computing situations.
1/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog
Computational Graphs
Computational graphs are a nice way to think about mathematical expressions. For
example, consider the expression . There are three operations: two
e = (a + b) ∗ (b + 1)
additions and one multiplication. To help us talk about this, let’s introduce two
intermediary variables, c and d so that every function’s output has a variable. We now
have:
c = a + b
d = b + 1
e = c ∗ d
To create a computational graph, we make each of these operations, along with the input
variables, into nodes. When one node’s value is the input to another node, an arrow goes
from one to another.
These sorts of graphs come up all the time in computer science, especially in talking
about functional programs. They are very closely related to the notions of dependency
graphs and call graphs. They’re also the core abstraction behind the popular deep
learning framework Theano (http://deeplearning.net/software/theano/).
We can evaluate the expression by setting the input variables to certain values and
computing nodes up through the graph. For example, let’s set a = 2 and b = 1 :
51
2/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog
51
51
changes a little bit, how does c change? We call this the partial derivative
(https://en.wikipedia.org/wiki/Partial_derivative) of c with respect to a .
To evaluate the partial derivatives in this graph, we need the sum rule
6
(https://en.wikipedia.org/wiki/Sum_rule_in_differentiation) and the product rule
(https://en.wikipedia.org/wiki/Product_rule):
∂ ∂a ∂b
(a + b) = + = 1
∂a ∂a ∂a
∂ ∂v ∂u
uv = u + v = v
∂u ∂u ∂u
3/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog
What if we want to understand how nodes that aren’t directly connected affect each
other? Let’s consider how e is affected by a . If we change a at a speed of 1, c also changes
at a speed of 1. In turn, c changing at a speed of 1 causes e to change at a speed of 2. So
e changes at a rate of 1 ∗ 2 with respect to a .
3
The general rule is to sum over all possible paths from one node to the other, multiplying
the derivatives on each edge of the path together. For example, to get the derivative of e
∂e
= 1 ∗ 2 + 1 ∗ 3
∂b
This accounts for how b affects e through c and also how it affects it through d.
This general “sum over paths” rule is just a different way of thinking about the
multivariate chain rule (https://en.wikipedia.org/wiki/Chain_rule#Higher_dimensions).
1
Factoring Paths
The problem with just “summing over the paths” is that it’s very easy to get a
combinatorial explosion in the number of possible paths.
4/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog
In the above diagram, there are three paths from X to Y , and a further three paths from
Y to Z . If we want to get the derivative ∂Z
∂X
by summing over all paths, we need to sum
over 3 ∗ 3 = 9 paths:
∂Z
= αδ + αϵ + αζ + βδ + βϵ + βζ + γδ + γϵ + γζ
∂X
The above only has nine paths, but it would be easy to have the number of paths to grow
exponentially as the graph becomes more complicated.
Instead of just naively summing over the paths, it would be much better to factor them:
∂Z 1
= (α + β + γ)(δ + ϵ + ζ )
∂X
Forward-mode differentiation starts at an input to the graph and moves towards the end.
2
At every node, it sums all the paths feeding in. Each of those paths represents one way in
which the input affects that node. By adding them up, we get the total way in which the
node is affected by the input, it’s derivative.
5/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog
Reverse-mode differentiation, on the other hand, starts at an output of the graph and
moves towards the beginning. At each node, it merges all paths which originated at that
node.
Forward-mode differentiation tracks how one input affects every node. Reverse-mode
differentiation tracks how every node affects one output. That is, forward-mode
differentiation applies the operator ∂
∂X
to every node, while reverse mode differentiation
applies the operator ∂Z
∂
to every node.1
Computational Victories
At this point, you might wonder why anyone would care about reverse-mode
differentiation. It looks like a strange way of doing the same thing as the forward-mode. Is
there some advantage?
We can use forward-mode differentiation from b up. This gives us the derivative of every
node with respect to b. 6/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog
We’ve computed ∂e
∂b
, the derivative of our output with respect to one of our inputs.
When I say that reverse-mode differentiation gives us the derivative of e with respect to
every node, I really do mean every node. We get both ∂e
∂a
and ∂e
∂b
, the derivatives of e
with respect to both inputs. Forward-mode differentiation gave us the derivative of our
output with respect to a single input, but reverse-mode differentiation gives us all of
them.
For this graph, that’s only a factor of two speed up, but imagine a function with a million
inputs and one output. Forward-mode differentiation would require us to go through the
graph a million times to get the derivatives. Reverse-mode differentiation can get them all
in one fell swoop! A speed up of a factor of a million is pretty nice!
7/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog
When training neural networks, we think of the cost (a value describing how bad a neural
network performs) as a function of the parameters (numbers describing how the network
behaves). We want to calculate the derivatives of the cost with respect to all the
parameters, for use in gradient descent (https://en.wikipedia.org/wiki/Gradient_descent).
Now, there’s often millions, or even tens of millions of parameters in a neural network. So,
reverse-mode differentiation, called backpropagation in the context of neural networks,
gives us a massive speed up!
(Are there any cases where forward-mode differentiation makes more sense? Yes, there
are! Where the reverse-mode gives the derivatives of one output with respect to all inputs,
the forward-mode gives us the derivatives of all outputs with respect to one input. If one
has a function with lots of outputs, forward-mode differentiation can be much, much,
much faster.)
But I think it was much more difficult than it might seem. You see, at the time
backpropagation was invented, people weren’t very focused on the feedforward neural
networks that we study. It also wasn’t obvious that derivatives were the right way to train
them. Those are only obvious once you realize you can quickly calculate derivatives. There
was a circular dependency.
Worse, it would be very easy to write off any piece of the circular dependency as
impossible on casual thought. Training neural networks with derivatives? Surely you’d
just get stuck in local minima. And obviously it would be expensive to compute all those
derivatives. It’s only because we know this approach works that we don’t immediately
start listing reasons it’s likely not to.
That’s the benefit of hindsight. Once you’ve framed the question, the hardest work is
already done.
8/10
08/01/2025, 09:42 Calculus on Computational Graphs: Backpropagation -- colah's blog
Conclusion
Derivatives are cheaper than you think. That’s the main lesson to take away from this
post. In fact, they’re unintuitively cheap, and us silly humans have had to repeatedly
rediscover this fact. That’s an important thing to understand in deep learning. It’s also a
really useful thing to know in other fields, and only more so if it isn’t common knowledge.
Backpropagation is also a useful lens for understanding how derivatives flow through a
model. This can be extremely helpful in reasoning about why some models are difficult to
optimize. The classic example of this is the problem of vanishing gradients in recurrent
neural networks.
Finally, I claim there is a broad algorithmic lesson to take away from these techniques.
Backpropagation and forward-mode differentiation use a powerful pair of tricks
(linearization and dynamic programming) to compute derivatives more efficiently than
one might think possible. If you really understand these techniques, you can use them to
efficiently calculate several other interesting expressions involving derivatives. We’ll
explore this in a later blog post.
Acknowledgments
Thank you to Greg Corrado (http://research.google.com/pubs/GregCorrado.html), Jon
Shlens (https://shlens.wordpress.com/), Samy Bengio (http://bengio.abracadoudou.com/)
and Anelia Angelova (http://www.vision.caltech.edu/anelia/) for taking the time to
proofread this post.
More Posts
(../../posts/2014-07-Understanding-Convolutions/)
(../../posts/2015-08-Understanding-
LSTMs/)
Understanding Convolutions
Understanding LSTM
Networks
Visualizing MNIST
An Exploration of Dimensionality
Reduction
(../../posts/2014-10-Visualizing-MNIST/) (../../posts/2014-07-
Conv-Nets-Modular/)
Conv Nets
A Modular Perspective
9 Comments (/posts/2015-08-
Backprop/#disqus_thread)
10/10