0% found this document useful (0 votes)
8 views33 pages

Lecture02.Backpropagation.annotated

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views33 pages

Lecture02.Backpropagation.annotated

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Today’s lecture will be entirely devoted to the backpropagation

algorithm. The heart of all deep learning.


Lecture 2: Backpropagation

Peter Bloem
Deep Learning

dlvu.github.io

STOCHASTIC GRADIENT DESCENT


This is where we left things last lecture. We introduced
gradient descent as the training algorithm for neural
pick some initial weights θ (for the whole model)
networks. The gradient collects all derivates of the loss
loop: epochs
with respect to every weight in the neural network.
for x, t in Data:
For simple models, like a one-layer linear or logistic
✓ ✓ - ↵ · r✓ lossx,t (✓) regression, we can work out the gradient in closed
learning rate form. That means we get a function of what the
gradient is at every point in our parameter space.
stochastic gradient descent: loss over one example per step (loop over data)
Fro complex, multilayer functions, working out the
minibatch gradient descent: loss over a few examples per step (loop over data)
gradient is not so easy. We need an algorithm to do it
full-batch gradient descent: loss over the whole data
for us. That algorithm is called backpropagation, and
2
it’s what powers deep learning.

Because backpropagation is so important, we are going to


THE PLAN
spend an entire lecture on it. Looking at it from different
part 1: scalar backpropagation perspectives, and slowly building up to the completely
automated, highly parallelized version of the algorithm that we
use today.
part 2: tensor backpropagation
In the first part we describe backpropagation in a scalar
part 3: automatic differentiation setting. That is, we will treat each individual element of the
neural network as a single number, and simply loop over all
these numbers to do backpropagation over the whole
network. This simplifies the derivation, but it is ultimately a
slow algorithm with a complicated notation.
In the second part, we translate neural networks to operations
3 on vectors, matrices and tensors. This allows us to simplify our
notation, and more importantly, massively speed up the
computation of neural networks. Backpropagation on tensors
is a little more difficult to do than backpropagation on scalars,
but it's well worth the effort.
In the third part, we will make the final leap from manually
worked out and implemented backpropagation system to full-
fledged automatic differentiation: we will show you how to
build a system that takes care of the gradient computation
entirely by itself. This is the technology behind software like
pytorch and tensorflow.
In the previous video, we looked at what neural
networks are, and we saw that to train them, we need
Lecture 2: Backpropagation to work out the derivatives of the loss with respect to
the parameters of the neural network: collectively
these derivatives are known as the gradient.
Peter Bloem
Deep Learning

dlvu.github.io

|section|Scalar backpropagation|
|video|https://www.youtube.com/embed/
idO5r5eWIrw?si=NnUUgtAroD3_Rich|

PART ONE: SCALAR BACKPROPAGATION

How do we work out the gradient for a neural network?

Here is a diagram of the sort of network we’ll be


encountering (this one is called the GoogLeNet). We
can’t work out a complete gradient for this kind of
architecture by hand. We need help. What we want is
some sort of algorithm that lets the computer work
out the gradient for us.

GRADIENT COMPUTATION: THE SYMBOLIC APPROACH


Of course, working out derivatives is a pretty
mechanical process. We could easily take all the rules
we know, and put them into some algorithm. This is
called symbolic differentiation, and it’s what systems
like Mathematica and Wolfram Alpha do for us.
Unfortunately, as you can see here, the derivatives it
returns get pretty horrendous the deeper the neural
network gets. This approach becomes impractical very
quickly.
As the depth of the network grows, the symbolic
expression of its gradient (usually) grows exponentially
7
in size.
Note that in symbolic differentiation we get a
description of the derivative that is independent of
the input. We get a function that we can then feed any
input to.

GRADIENT COMPUTATION: THE NUMERIC APPROACH


Another approach is to compute the gradient
numerically. For instance by the method of finite
differences: we take a small step ε and, see how much
the function changes. The amount of change divided
by the step size is a good estimate for the gradient if ε
<latexit sha1_base64="G0r/UE7pa0vohX+ALQT1P/kZeiU=">AAAN1HicfZdNb9s2HMbV7q3L6i3djrsQCwq0WxZIieOXQ4Faso0c1jYL8tItCgKKphXBlEhQlGNX02nYdR9ln2aXHbbPMkp2ZImSrBPB5+GjH/8kbcphxAuFrv/z6PFHH3/y6WdPPt/54mnry692n319GdKII3yBKKH8vQNDTLwAXwhPEPyecQx9h+ArZ2al+tUc89CjwblYMnzjQzfwph6CQnbd7p7YAi9EPMHcmyfAhoxxugD2lEMUT18swA/Ali+JbcxCj9AgeQl+BLL/ZRKX+5Pb3T39QM8eUG0Y68aetn5Ob589FTv2hKLIx4FABIbhtaEzcRNDLjxEcLJjRyFmEM2gi68jMe3dxF7AIoEDlIDnUptGBAgK0mmBiccxEmQpGxBxTyYAdAflLISc/E45KsQB9HG4P5l7LFw1w7m7aggoK3cTL7LKJqWBscshu/PQokQWQz/0obirdIZL3yl34ohgPvfLnSmlZFScC8yRF6Y1OJWFecfSxQrP6elav1uyOxyESRxxkhQHSgFzjqdyYNYMsYhYnE1G7pBZ+ErwCO+nzazv1RDy2Rme7MucUkcZZ0ooFOUux0+LE+B7RH0fBpPYZnJLZHvJ3j9IpPgcOESaHQr5pOw8S+LYTmvmOOBMWkvi24L4VhVHBXG0fgklEzClHMzl+lMeAmkE0sI9hMPy6It89BRcqNGXBfFSFa8K4pUqOlFBjSrqvKDOK+p9Qb1X1UVBXKjisiAuK7mioApV/VAQP6ji+20v/WXbS39VYuXqyCOzlKccT+XPU7bB4hlK4pPzNz8lcT971jslwsAoG5HzYDwad7r9bqLK5EFvj3uGOazquaFrDgyrzpA7Bl1LH442LIeKN4fW9c5o0FGjENnovb41ruobWH3QHZo1hg3t2GqPuuvyYRwoVjf3Wf1Ou1IWN8/pm6Z53K/qucFsW1bvsMaQO6zhcDiwMhQWcUaw4mUPxk7nWK9GsTyop3fagxp9swB6zzQrsKzAYo5NY2hkLAJDojhFvlms3qA/UoPEpv7mwLIqCygK5e9ZxrBdY9iwHg+PR0cZCeUwcNWq0Lx8nW7vqKdG0Txo3JUrWGGhmzeN+6berbDQAsvYtMy0sOWTKA/ZtXETr36QN+cO7BkgSVSzPGiqOT17D2bFS2rMpNlda9/mrx9AGuGrM0WoKR7VhKNGGFTHgprhUS082gbvVv1uU7xbE+42wrh1LG4zvFsL726DZ1U/a4pnNeGsEYbVsbBmeFYLz7bBi6pfNMWLmnDRCCPqWEQzvKiFF9vgadVPm+JpTThthKF1LLQZntbC0zK8/PNI7/SQgPROTAnwgvXFoPSbxdLrwwzJm+TKvZr4EMtvA47fyGvFO3mhhfKO931sQ+76XpDIbwXX3k9b24xw8WCULfmdYqhfJdXG5eGBcXyg/9zee22uv1ieaN9q32kvNEPraq+1E+1Uu9CQ9pf2t/av9l/rsvVb6/fWHyvr40frMd9opaf15/+hwRfr</latexit>
is small enough.
f(x + ✏) - f(x)
f(x + ✏) - f(x) deriv ⇡ Numeric approaches are sometimes used in deep
deriv ⇡ ✏ learning, but it’s very expensive to make them
✏ the “metho d of finite differ
ences”
accurate enough if you have a large number of
parameters.
Note that in the numeric approach, you only get an
8
answer for a particular input. If you want to compute
the gradient at some other point in space, you have to
compute another numeric approximation. Compare
this to the symbolic approach (either with pen and
paper or through wolfram alpha) where once the
differentation is done, all we have to compute is the
derivative that we've worked out.

BACKPROPAGATION: THE MIDDLE GROUND


Backpropagation is a kind of middle ground between
symbolic and numeric approaches to working out the
Work out parts of the derivative symbolically gradient. We break the computation into parts: we
chain these together in a numeric computation. work out the derivatives of the parts symbolically, and
then chain these together numerically.
secret ingredient: the chain rule. The secret ingredient that allows us to make this work
is the chain rule of differentiation.

9
THE CHAIN RULE
Here is the chain rule: if we want the derivative of a
function which is the composition of two other
functions, in this case f and g, we can take the
@f(g(x)) @f(g(x)) @g(x) derivative of f with respect to the output of g and
=
@x @g(x) @x multiply it by the derivative of g with respect to the
@f @f@g(x)
@g input x.
@f(g(x)) @f(g(x))
= @x = @g @x Since we’ll be using the chain rule a lot, we’ll introduce
shorthand:@x @g(x) @x
@f @f @g a simple shorthand to make it a little easier to parse.
= We draw a little diagram of which function feeds into
x g f @x @g @x which. This means we know what the argument of
each function is, so we can remove the arguments
10
from our notation.
We call this diagram a computation graph. We'll stick
with simple diagrams like this for now. At the end of
the lecture, we will expand our notation a little bit to

INTUITION FOR THE CHAIN RULE


Since the chain rule is the heart of backpropagation,
f(g) = sf g + bf and backpropagation is the heart of deep learning, we
g
f f over g(x) = sg x + bg
slope o should probably take some time to see why the chain
@f
f(g) = sf g + bf with sf = rule is true at all.
@g x g f
g(x) = sg x + bg @g
sg = If we imagine that f and g are linear functions, it’s
@f @x
with sf = pretty straightforward to show that this is true. They
@g
@g may not be, of course, but the nice thing about
sg =f(g(x)) = sff(sggx + bbgg)) +
+ bbff
<latexit sha1_base64="tCAh3qBNFJD1L3/bHSxYlhX/fLk=">AAAN7HicfZfLbttGGIWZpJfUjVqnWXYzqJHCbgyDtGVdEBiISEnIoklcw7fWNNThaEQTGnIGw6EsheBbdFd020fpO/Qdum3XHVIyxau40WDO+Q8//uRQQ4sRxxeq+vejx08++fSzz59+sfXls8ZXX28//+bSpwFH+AJRQvm1BX1MHA9fCEcQfM04hq5F8JU1NWL9aoa571DvXCwYvnWh7TkTB0Ehp0bbv5rICifRrons0I5253t74PsTYE5R6I8mu/7InoNXwBrZe8nPJAKmuRU7/NGyMP6NK8E8KTJfvzJfy0pZsSoYbe+oB2pygPJAWw12lNVxOnr+TGyZY4oCF3sCEej7N5rKxG0IuXAQwdGWGfiYQTSFNr4JxKRzGzoeCwT2UAReSm0SECAoiC8XjB2OkSALOYCIOzIBoDvIIRKyKVv5KB970MX+/njmMH859Gf2ciCg7OhtOE86HuUKQ5tDduegeY4shK7vQnFXmvQXrpWfxAHBfObmJ2NKyVhwzjFHjh/34FQ25gOLb6J/Tk9X+t2C3WHPj8KAkyhbKAXMOZ7IwmToYxGwMLkY+eRM/RPBA7wfD5O5kz7k0zM83pc5uYk8zoRQKPJTlhs3x8P3iLou9MahyaLQFHguQnP/IJLiS2ARabYo5OO88ywKQzPumWWBM2nNie8z4vuiOMiIg9VJKBmDCeVgJu8/5T6QRiAt3EHYz1dfpNUTcFGMvsyIl0XxKiNeFUUryKhBSZ1l1FlJvc+o90V1nhHnRXGRERelXJFRRVH9mBE/FsXrTSf9edNJfynEyrsjl8xCrnI8ka+t5AELpygK356/+zEKu8mxelICDLS8EVkPxqNhq91tR0WZPOjNYUfT+2U9NbT1nmZUGVJHr22o/cGa5bDgTaFVtTXotYpRiKz1TtcYlvU1rNpr9/UKw5p2aDQH7VX7MPYKVjv1Gd1Ws9QWO83p6rp+3C3rqUFvGkbnsMKQOox+v98zEhQWcEZwwcsejK3WsVqOYmlQR201exX6+gaoHV0vwbIMiz7Utb6WsAgMScEp0ofF6PS6g2KQWPdf7xlG6QaKTPs7htZvVhjWrMf948FRQkI59OxiV2javla7c9QpRtE0aNiWd7DEQtdnGnZ1tV1ioRmWoW7ocWPzK1EushvtNly+kNfrDuxoIIqKZrnQiuZ47T2YC15SYSb17kr7Jn91AamFL18pQnXxqCIc1cKgKhZUD48q4dEmeLvst+vi7YpwuxbGrmKx6+HtSnh7Ezwr+1ldPKsIZ7UwrIqF1cOzSni2CV6U/aIuXlSEi1oYUcUi6uFFJbzYBE/LfloXTyvCaS0MrWKh9fC0Ep7m4eWfR7ynhwTEe2JKgOOtNga5dxaLtw/xt8XKvbzwPpbfBhy/k9uKD3JDC+Ue74fQhNx2HS+S3wq2uR+PNhnh/MEoR/I7RSt+lZQHl4cHWuvg6Kfmzht99cXyVPlW+U7ZVTSlrbxR3iqnyoWClL+Uf5R/lf8aXuO3xu+NP5bWx49WNS+U3NH4838N+h0c</latexit>

@x = sffsggx + ssffbbgg +
calculus is that locally, we can treat them as linear
+ sbgf
functions (if they are differentiable). In an
slope of f over x constant infinitesimally small neighbourhood f and g are exactly
@f @g linear.
@g @x
11
If f and g are locally linear, we can describe their
behavior with a slope s and an additive constant b. The
slopes, sf and sg, are simply the derivatives. The
additive constants we will show can be ignored.
In this linear view, what the chain rule says is this: if we
approximate f(x) as a linear function, then its slope is
the slope of f as a function of g, times the slope of g as
a function of x. To prove that this is true, we just write
down f(g(x)) as linear functions, and multiply out the
brackets.
Note that this doesn’t quite count as a rigorous proof,
but it’s hopefully enough to give you some intuition for
why the chain rule holds.

MULTIVARIATE CHAIN RULE


Since we’ll be looking at some pretty elaborate
computation graphs, we’ll need to be able to deal with
this situation as well: we have a computation graph, as
before, but f depends on x through two different
g
operations. How do we take the derivative of f over x?
x f
h
The multivariate chain rule tells us that we can simply
apply the chain rule along g, taking h as a constant,
and sum it with the chain rule along h taking g as a
constant.
@f @f @g @f @h
= +
@x @g @x @h @x
12
INTUITION FOR THE MULTIVARIATE
f(g,CHAIN
h) = s1RULE
g + s 2 h + bf
<latexit sha1_base64="Di6a1z0KlZ6ik5JUtRHTkvcfZnA=">AAAO0nicfZfJbuM2AIbl6TZ1p+6kPfZCNJgibYNAShwvGAQYS7Yxh85MGmRroyCgaFoWLIkERTl2BB2KXvsofZreivZhSsmyrNU60fw//vzFRSYNalsel+V/Gs8++viTTz97/nnzixdftr56uff1tUd8hvAVIjZhtwb0sG25+Ipb3Ma3lGHoGDa+MeZapN8sMPMs4l7yFcX3DjRda2ohyEXVw15jpCMjmIYHOjIDMzwEOqLBLPwBfH8GvIdIU8K1BH5KKo7DNSMq9DkKjIdpCHS9uaYOlmnT6OcyhcwEilpmoOjnFpqtIY6XPHi0+AyEaYiogT5lEAX6nIJ16DAqrzsK9df66zQgqGWj/kQnGzh+sTwcu8Xl5dY0ft88FzslXPPh5b58JMcPKBeUpLAvJc/5w94L3tQnBPkOdjmyoefdKTLl9wFk3EI2Dpu672EK0Rya+M7n0959YLnU59hFIXgltKlvA05ANKlgYjGMuL0SBYiYJRwAmkGRlYupb+atPOxCB3uHk4VFvXXRW5jrAodi3dwHy3hdhbmGgckgnVlomUsWQMdzIJ+VKr2VY+QrsW9jtnDylVFKkbFALjFDlheNwbkYmA80WqreJTlP9NmKzrDrhYHP7DDbUAiYMTwVDeOih7lPg/hlxP6Ye2ec+fgwKsZ1Z0PI5hd4cih8chX5OFObQJ6vMpxocFz8iIjjQHcS6FQshXjZ6odHoRBfAcMWsEEgm+TJizAI9GjMDANcCDQnvs+I74viKCOOkk6IPQFTwsBCzD9hHhAgEAizEPbyra/S1lNwVbS+zojXRfEmI94URcPPqH5JXWTURUl9zKiPRXWZEZdFcZURVyVfnlF5UX3KiE9F8XZXp7/u6vS3gq2YHbFlVmKX46n4OMcLLJijMHh7+e7nMOjHT7JSfAyUPIiMDXgy7nT73bAo2xu9Pe4p6rCsp0BXHShaFZASg64mD0fbLMcFNg0ty53RoFO0QvZW7/W1cVnfhpUH3aFaAWzTjrX2qJsMH8ZuATVTTut32qVhMVOfvqqqp/2yngJqW9N6xxVASmjD4XCgxVGoz6iNCyzdgJ3OqVy2oqlRT+60BxX6dgLknqqWwtJMFnWsKkMlzsIxtAskTxeL1hv0R0Ujvh1/daBppQnkmeHvacqwXQFss54OT0cncRLCoGsWR4Wkw9fp9k56RSuSGo27YgZLWci2p3FflbulLCSTZaxqajSw+Z0oNtmdch+sP8jbfQf2FRCGRVhstCIc7b0NXGDtCtiupyvxXXx1A7s2fPlNEaqzRxXmqDYMqsqC6sOjyvBoV3izzJt19maFuVkbxqzKYtaHNyvDm7vC0zJP6+xphTmtDUOrstD68LQyPN0Vnpd5XmfPK8x5bRhelYXXh+eV4fmu8KTMkzp7UmFOasOQqiykPjypDE/y4cWfR3SmhzaIzsTEBpabHAxy3ywaHR/EJUhP6PWLD7G4GzD8ThwrPogDLRRnvB8DHTLTsdxQ3BVM/TAq7QLhcgOKkrinKMVbSblwfXyknB7Jv7T336jJjeW59K30nXQgKVJXeiO9lc6lKwk1/mr83fi38V/rsvXU+r31xxp91kjafCPlntaf/wNdaHDQ</latexit>

We can see why this holds in the same way as before.


g(x) = sg x + bg The short story: since all functions can be taken to be
h(x) = sh x + bh linear, their slopes distribute out into a sum
f(g, h) = s1 g + s2 h + bf
@f @f @g @h
g(x) = sg x + bg with s1 = s2 = sg = sh =
@g @h @x @x
h(x) = sh x + bh
@f @f @g @h
with s1 = s2 = sg = sh =
@g = s (s@h
f(g(x), h(x))
h(x)) x b @xs (s x +@xb ) + b
<latexit sha1_base64="WEZsfb75aWQ7JGTAM9g2CSx2+us=">AAAOW3icfZdbb9s2GIbldofOS7N0w652QyzokKxGICWODygC1JJt9GJts6A5bFFgUDQtC5ZEgqIcu4J+4653sf8ySlZkHa0bf+D7fq8ek6JNGdS2PC7L/zaePf/q62++ffFd8/u9l/s/HLz68cYjPkP4GhGbsDsDeti2XHzNLW7jO8owdAwb3xoLLdJvl5h5FnE/8zXFDw40XWtmIcjF0OTgHx0ZwSw80pEZmOHR6rgFdESDeVQeg98ugL5AgTdRjryJuQJvgDExj8UH8CanYmi+GZofxx+zEOh6c9uTtIhKdCVNSY+oRFu+S+RFMEoYfUY0sS8aOo2HIqrjVZStv32jv40jzGLW5OBQPpHjC5QLJSkOpeS6nLza4019SpDvYJcjG3revSJT/hBAxi1k47Cp+x6mEC2gie99Pus9BJZLfY5dFILXQpv5NuAERLMLphbDiNtrUUDELJEA0BwyiLhYg2Y+ysMudLDXmi4t6m1Kb2luCg7FAj4Eq3iBw1xjYDJI5xZa5cgC6HgO5PPSoLd2jPwg9m3Mlk5+MKIUjAXnCjNkedEcXIqJ+USjZ8b7TC4Tfb6mc+x6YeAzO8w2CgEzhmeiMS49zH0axF9GPKgL74IzH7eiMh67GEK2uMLTlsjJDeRxZjaBPD9kONHkuPgREceB7jTQaRjoHK94oLdOQiG+BoYtzAaBbJp3XoVBoEdzZhjgSlhz4seM+LEojjLiKLkJsadgRhhYivUnzAPCCISFWQh7+e7rtHsGrovRNxnxpijeZsTbomj4GdUvqcuMuiypjxn1saiuMuKqKK4z4rqUyzMqL6pfMuKXoni366Z/7brp34VYsTpiy6zFLscz8SsZP2DBAoXB+88f/giDfnwlT4qPgZI3IuPJeDbudPvdsCjbT3p73FPUYVlPDV11oGhVhtQx6GrycLRlOS14U2hZ7owGnWIUsrd6r6+Ny/oWVh50h2qFYUs71tqjbjJ9GLsFq5n6tH6nXZoWM83pq6p63i/rqUFta1rvtMKQOrThcDjQYhTqM2rjgpc+GTudc7kcRdOgntxpDyr07QLIPVUtwdIMizpWlaESs3AM7YKTpw+L1hv0R8Ugvp1/daBppQXkmenvacqwXWHYsp4Pz0dnMQlh0DWLs0LS6et0e2e9YhRJg8ZdsYIlFrK907ivyt0SC8mwjFVNjSY2vxPFJrtXHoLND/J234FDBYRh0Sw2WtEc7b0nc8FrV5jtenelfZe/usGuhS9/U4Tq4lFFOKqFQVUsqB4eVcKjXfBm2W/WxZsV4WYtjFnFYtbDm5Xw5i54WvbTunhaEU5rYWgVC62Hp5XwdBc8L/t5XTyvCOe1MLyKhdfD80p4vguelP2kLp5UhJNaGFLFQurhSSU8ycOLP4/oTA9tEJ2JiQ0sNzkY5H6zaHR8iF4yEvfmiw+xeDdg+IM4VnwSB1oozni/BzpkpmO5oXhXMPVWVO0ywtWTUVTiPUUpvpWUi5vTE6VzovzZPnynJm8sL6RfpF+lI0mRutI76b10KV1LqHHeuG9MG/jlf/vP95v7exvrs0bS85OUu/Z//h8ONTtS</latexit>

+ ) +
<latexit sha1_base64="WEZsfb75aWQ7JGTAM9g2CSx2+us=">AAAOW3icfZdbb9s2GIbldofOS7N0w652QyzokKxGICWODygC1JJt9GJts6A5bFFgUDQtC5ZEgqIcu4J+4653sf8ySlZkHa0bf+D7fq8ek6JNGdS2PC7L/zaePf/q62++ffFd8/u9l/s/HLz68cYjPkP4GhGbsDsDeti2XHzNLW7jO8owdAwb3xoLLdJvl5h5FnE/8zXFDw40XWtmIcjF0OTgHx0ZwSw80pEZmOHR6rgFdESDeVQeg98ugL5AgTdRjryJuQJvgDExj8UH8CanYmi+GZofxx+zEOh6c9uTtIhKdCVNSY+oRFu+S+RFMEoYfUY0sS8aOo2HIqrjVZStv32jv40jzGLW5OBQPpHjC5QLJSkOpeS6nLza4019SpDvYJcjG3revSJT/hBAxi1k47Cp+x6mEC2gie99Pus9BJZLfY5dFILXQpv5NuAERLMLphbDiNtrUUDELJEA0BwyiLhYg2Y+ysMudLDXmi4t6m1Kb2luCg7FAj4Eq3iBw1xjYDJI5xZa5cgC6HgO5PPSoLd2jPwg9m3Mlk5+MKIUjAXnCjNkedEcXIqJ+USjZ8b7TC4Tfb6mc+x6YeAzO8w2CgEzhmeiMS49zH0axF9GPKgL74IzH7eiMh67GEK2uMLTlsjJDeRxZjaBPD9kONHkuPgREceB7jTQaRjoHK94oLdOQiG+BoYtzAaBbJp3XoVBoEdzZhjgSlhz4seM+LEojjLiKLkJsadgRhhYivUnzAPCCISFWQh7+e7rtHsGrovRNxnxpijeZsTbomj4GdUvqcuMuiypjxn1saiuMuKqKK4z4rqUyzMqL6pfMuKXoni366Z/7brp34VYsTpiy6zFLscz8SsZP2DBAoXB+88f/giDfnwlT4qPgZI3IuPJeDbudPvdsCjbT3p73FPUYVlPDV11oGhVhtQx6GrycLRlOS14U2hZ7owGnWIUsrd6r6+Ny/oWVh50h2qFYUs71tqjbjJ9GLsFq5n6tH6nXZoWM83pq6p63i/rqUFta1rvtMKQOrThcDjQYhTqM2rjgpc+GTudc7kcRdOgntxpDyr07QLIPVUtwdIMizpWlaESs3AM7YKTpw+L1hv0R8Ugvp1/daBppQXkmenvacqwXWHYsp4Pz0dnMQlh0DWLs0LS6et0e2e9YhRJg8ZdsYIlFrK907ivyt0SC8mwjFVNjSY2vxPFJrtXHoLND/J234FDBYRh0Sw2WtEc7b0nc8FrV5jtenelfZe/usGuhS9/U4Tq4lFFOKqFQVUsqB4eVcKjXfBm2W/WxZsV4WYtjFnFYtbDm5Xw5i54WvbTunhaEU5rYWgVC62Hp5XwdBc8L/t5XTyvCOe1MLyKhdfD80p4vguelP2kLp5UhJNaGFLFQurhSSU8ycOLP4/oTA9tEJ2JiQ0sNzkY5H6zaHR8iF4yEvfmiw+xeDdg+IM4VnwSB1oozni/BzpkpmO5oXhXMPVWVO0ywtWTUVTiPUUpvpWUi5vTE6VzovzZPnynJm8sL6RfpF+lI0mRutI76b10KV1LqHHeuG9MG/jlf/vP95v7exvrs0bS85OUu/Z//h8ONTtS</latexit>

f(g(x), = s 1 (s g x + b g ) + s
f(g(x), h(x)) = s11 (sgg x + bgg ) + s22 (sh2 (s +
hx + b
h x bh h)
h )++b
bfff
<latexit sha1_base64="WEZsfb75aWQ7JGTAM9g2CSx2+us=">AAAOW3icfZdbb9s2GIbldofOS7N0w652QyzokKxGICWODygC1JJt9GJts6A5bFFgUDQtC5ZEgqIcu4J+4653sf8ySlZkHa0bf+D7fq8ek6JNGdS2PC7L/zaePf/q62++ffFd8/u9l/s/HLz68cYjPkP4GhGbsDsDeti2XHzNLW7jO8owdAwb3xoLLdJvl5h5FnE/8zXFDw40XWtmIcjF0OTgHx0ZwSw80pEZmOHR6rgFdESDeVQeg98ugL5AgTdRjryJuQJvgDExj8UH8CanYmi+GZofxx+zEOh6c9uTtIhKdCVNSY+oRFu+S+RFMEoYfUY0sS8aOo2HIqrjVZStv32jv40jzGLW5OBQPpHjC5QLJSkOpeS6nLza4019SpDvYJcjG3revSJT/hBAxi1k47Cp+x6mEC2gie99Pus9BJZLfY5dFILXQpv5NuAERLMLphbDiNtrUUDELJEA0BwyiLhYg2Y+ysMudLDXmi4t6m1Kb2luCg7FAj4Eq3iBw1xjYDJI5xZa5cgC6HgO5PPSoLd2jPwg9m3Mlk5+MKIUjAXnCjNkedEcXIqJ+USjZ8b7TC4Tfb6mc+x6YeAzO8w2CgEzhmeiMS49zH0axF9GPKgL74IzH7eiMh67GEK2uMLTlsjJDeRxZjaBPD9kONHkuPgREceB7jTQaRjoHK94oLdOQiG+BoYtzAaBbJp3XoVBoEdzZhjgSlhz4seM+LEojjLiKLkJsadgRhhYivUnzAPCCISFWQh7+e7rtHsGrovRNxnxpijeZsTbomj4GdUvqcuMuiypjxn1saiuMuKqKK4z4rqUyzMqL6pfMuKXoni366Z/7brp34VYsTpiy6zFLscz8SsZP2DBAoXB+88f/giDfnwlT4qPgZI3IuPJeDbudPvdsCjbT3p73FPUYVlPDV11oGhVhtQx6GrycLRlOS14U2hZ7owGnWIUsrd6r6+Ny/oWVh50h2qFYUs71tqjbjJ9GLsFq5n6tH6nXZoWM83pq6p63i/rqUFta1rvtMKQOrThcDjQYhTqM2rjgpc+GTudc7kcRdOgntxpDyr07QLIPVUtwdIMizpWlaESs3AM7YKTpw+L1hv0R8Ugvp1/daBppQXkmenvacqwXWHYsp4Pz0dnMQlh0DWLs0LS6et0e2e9YhRJg8ZdsYIlFrK907ivyt0SC8mwjFVNjSY2vxPFJrtXHoLND/J234FDBYRh0Sw2WtEc7b0nc8FrV5jtenelfZe/usGuhS9/U4Tq4lFFOKqFQVUsqB4eVcKjXfBm2W/WxZsV4WYtjFnFYtbDm5Xw5i54WvbTunhaEU5rYWgVC62Hp5XwdBc8L/t5XTyvCOe1MLyKhdfD80p4vguelP2kLp5UhJNaGFLFQurhSSU8ycOLP4/oTA9tEJ2JiQ0sNzkY5H6zaHR8iF4yEvfmiw+xeDdg+IM4VnwSB1oozni/BzpkpmO5oXhXMPVWVO0ywtWTUVTiPUUpvpWUi5vTE6VzovzZPnynJm8sL6RfpF+lI0mRutI76b10KV1LqHHeuG9MG/jlf/vP95v7exvrs0bS85OUu/Z//h8ONTtS</latexit>

= ss11 ssgg x
= x++ ss11 b + ss22 ssh
bgg + x
x +
+ s
s 2 b
b +
+
h+b
h bff
= s 1 s g x + s 1 bg + s 2 s h h x + s22 bh bf
= (s
= (s11 ssgg ++ ss22 ssh )x +
h )x + ss22 b + ss22 b
bgg + bh
h++b b
= (s1 sg + s2 sh )x + s2 bg + s2 bh + bfff
slope of f over x constant
@f @g @f @h
13 +
@g @x @h @x

MULTIVARIATE CHAIN RULE


If we have more than two paths from the input to the
output, we simply sum over all of them.
g1
... ...

gi

x f

gN

@f X @f @gi
<latexit sha1_base64="DMJldPRh5qsthHcwBU7lAJ+NqwM=">AAAN5nicfZfbbts2HMbV7tRlzZZul7sRFhQYhiCQEseHiwC1ZBu9WNssyGmtgoCiaVkwJRIU5dgV9Aq7G3a7R9lr7AV2uz3CKEWRJYqyrgh+Hz/99Cdpky7FfsQN4+8nTz/59LPPv3j25c5Xz3e//mbvxbdXEYkZRJeQYMJuXBAh7IfokvscoxvKEAhcjK7dhZ3p10vEIp+EF3xN0W0AvNCf+RBw0XW3996ZMQATZ0F1B8JklqZ5e5Xqp7oTxcGdrzY40Eu89M5Pq3LRVSTc7e0bh0b+6M2GWTT2teI5u3vxnO84UwLjAIUcYhBFH0yD8tsEMO5DjNIdJ44QBXABPPQh5rP+beKHNOYohKn+UmizGOuc6NmH6lOfIcjxWjQAZL5I0OEcCFguyrFTj4pQCAIUHUyXPo0emtHSe2hwIGp5m6zyWqe1gYnHAJ37cFUjS0AQBYDPG53ROnDrnSjGiC2DemdGKRgl5wox6EdZDc5EYd7RbPqiC3JW6PM1naMwSpOY4bQ6UAiIMTQTA/NmhHhMk/xjxJpZRKecxegga+Z9pyPAFudoeiByah11nBkmgNe73CArTojuIQkCEE4Th4qVwNGKJ87BYSrEl7qLhdklgE3rzvM0SZysZq6rnwtrTXxbEd/K4rgijouXEDzVZ4TpSzH/hEW6MOrCwnyIovroy3L0TL+Uo68q4pUsXlfEa1l044oaN9RlRV021PuKei+rq4q4ksV1RVw3cnlF5bL6sSJ+lMWbbS/9ddtL30uxYnbEllmLXY5m4gcrX2DJAqbJ64s3P6fJIH+KlRIj3awboftoPJ50e4NeKsv4Ue9M+qY1auqloWcNTVtlKB3Dnm2MxhuWI8lbQhtGdzzsylEQb/T+wJ409Q2sMeyNLIVhQzuxO+NeUT6EQsnqlT570O00yuKVOQPLsk4GTb00WB3b7h8pDKXDHo1GQztHoTGjGEle+mjsdk+MZhQtg/pGtzNU6JsJMPqW1YClFRZrYpkjM2fhCGDJycvFYveHg7EcxDf1t4a23ZhAXil/3zZHHYVhw3oyOhkf5ySEgdCTq0LK8nV7/eO+HEXKoElPzGCDhWzeNBlYRq/BQiosE8u2ssLWd6LYZB/M2+ThB3mz7/R9U09T2Sw2mmzO9t6jWfJihRm3u5X2bX71ANwK3/xSCNvioSIctsJAFQtsh4dKeLgN3mv6vbZ4TxHutcJ4KhavHd5Twnvb4GnTT9viqSKctsJQFQtth6dKeLoNnjf9vC2eK8J5KwxXsfB2eK6E59vgSdNP2uKJIpy0whAVC2mHJ0p4UocXfx7ZmR5gPTsTE6z7YXEwqP1m0ez4sBC3i8L98OEjJO4GDL0Rx4p34kALxBnvp8QBzAv8MBV3Bc85yFrbjGD1aAT5PcWUbyXNxtXRoXlyaPzS2X9lFTeWZ9r32g/aj5qp9bRX2mvtTLvUoPaX9o/2r/bf7nz3t93fd/94sD59Uoz5Tqs9u3/+D4Z4IIo=</latexit>

=
@x @gi @x
i
14

EXAMPLE
With that, we are ready to show how backpropagation
works. We'll start with a fairly arbitrary function to
show the principle before we move on to more
realistic neural networks.

2
f(x) =
sin(e-x )

15

ITERATING THE CHAIN RULE


The first thing we do is to break up its functional form
into a series of smaller operations. The entire function
2 f is then just a chain of these small operations chained
f(x) = f(x) = d(c(b(a(x)))) together. We can draw this in a computation graph as
sin(e-x )
we did before.
operations: computation graph:
2 Normally, we wouldn’t break a function up in such
d(c) = x a b c d small operations. This is just a simple example to
c
illustrate the principle.
c(b) = sin b
b(a) = ea
a(x) = -x
16
ITERATING THE CHAIN RULE
Now, to work out the derivative of f, we can iterate the
chain rule. We apply it again and again, until the
derivative of f over x is expressed as a long product of
@f @d derivatives of operation outputs over their inputs.
= x a b c d
@x @x
@d @c
=
@c @x
@d @c @b
=
@c @b @x
@d @c @b @a
=
@c @b @a @x

17

@f @f @d
=
@x @d @x
@f @d @c
=
@d @c @x
GLOBAL AND LOCAL DERIVATIVES
We call the larger derivative of f over x the global
@f @d @c @b derivative. And we call the individual factors, the
= derivatives of the operation output wrt to their inputs,
@d @c @b @x the local derivatives.
@f @f @d @f @d@c @b @a
= =
@x @d @d @c @x
@b @a @x
f
global derivative @d @c derivatives
local
=
@d @c @x
f @d @c @b
=
18 @d @c @b @x
f @d @c @b @a
=
@d @c @b @a @x

BACKPROPAGATION
This is how the backpropagation algorithm combines
symbolic and numeric computation. We work out the
The BACKPROPAGATION algorithm: local derivatives symbolically, and then work out the
• break your computation up into a sequence of operations global derivative numerically.
what counts as an operation is up to you

• work out the local derivatives symbolically.


• compute the global derivative numerically
by computing the local derivatives and multiplying them

19

@f @f @d
=
@x @d @x
@f @d @c
= SYMBOLICALLY
WORK OUT THE LOCAL DERIVATIVES
For each local derivative, we work out the symbolic
@d @c @x derivative with pen and paper.
@f @d @c @b
2 = Note that we could fill in the a, b and c in the result,
f(x) = @d @c @b @x
sin(e-x ) but we don’t. We simply leave them as is. For the
@f @f @d @c @b @a
@f @d
operations: = = symbolic part, we are only interested in the derivative
@x @d @d
@c @x
@b @a @x of the output of each sub-operation with respect to its
2
d(c) = f @d @c immediate input.
c @f = @d2 @c @x
c(b) = sin b = - 2 · cos b · ea · -1 The rest of thew algorithm is performed numerically.
@x fc @d @c @b
b(a) = ea =
@d @c @b @x
a(x) = -x f @d @c @b @a
=
20
@d @c @b @a @x
COMPUTE A FORWARD PASS (X = -4.499)
This we are now computing things numerically, we
need a specific input, in this case x = -4.499. We start
by feeding this through the computation graph. For
f(-4.499) = 2 each sub-operation, we store the output value. At the
x a b c d
end, we get the output of the function f. This is called a
2
<latexit sha1_base64="OJjpZWerc9vinEiAFnncrzHfoVc=">AAAN+nicfZdNb9s2HMaVttu6rNnS7bgLsaDDMGSBlDh+OQSoJdvoYW2zIG9rlAUUTSuCKZGgKMeupi+z27DrPsou+yo7jZIdWaIk60Tzefjopz9FmXQY8UKh6/9uPXn67JNPP3v++fYXL3a+/Gr35deXIY04wheIEsqvHRhi4gX4QniC4GvGMfQdgq+cqZXqVzPMQ48G52LB8K0P3cCbeAgK2XW3G9iIxuMEfH8C7AmHKD5MYhs5MUoScAIObXt7+SszhF4AbMRiJ9UMkInpL6nh32zkxjAVenomZL+k8tNc9rUOWr2eHHC3u6cf6NkFqg1j1djTVtfp3csXYtseUxT5OBCIwDC8MXQmbmPIhYcITrbtKMQMoil08U0kJt3b2AtYJHAgoV9JbRIRIChIHx6MPY6RIAvZgIh7MgGgeygfW8gSbZejQhxAH4f745nHwmUznLnLhoCyvrfxPKt/UhoYuxyyew/NS2Qx9EMfivtKZ7jwnXInjgjmM7/cmVJKRsU5xxx5YVqDU1mY9yyd0vCcnq70+wW7x0GYxBEnSXGgFDDneCIHZs0Qi4jF2cPI92gangge4f20mfWdDCCfnuHxvswpdZRxJoRCUe5y/LQ4AX5A1PdhMI5tJl8vgecitvcPEim+Ag6RZodCPi47z5I4ttOaOQ44k9aS+K4gvlPFYUEcrm5CyRhMKAczOf+Uh0AagbRwD+GwPPoiHz0BF2r0ZUG8VMWrgnilik5UUKOKOiuos4r6UFAfVHVeEOequCiIi0quKKhCVT8WxI+qeL3ppr9uuukHJVbOjlwyC7nK8UR+xLIXLJ6iJH5z/vbnJO5l1+pNiTAwykbkPBqPRu1Or5OoMnnUW6OuYQ6qem7omH3DqjPkjn7H0gfDNcuh4s2hdb097LfVKETWerdnjar6GlbvdwZmjWFNO7Jaw86qfBgHitXNfVav3aqUxc1zeqZpHveqem4wW5bVPawx5A5rMBj0rQyFRZwRrHjZo7HdPtarUSwP6urtVr9GX0+A3jXNCiwrsJgj0xgYGYvAkChOkb8sVrffG6pBYl1/s29ZlQkUhfJ3LWPQqjGsWY8Hx8OjjIRyGLhqVWhevnane9RVo2geNOrIGayw0PWdRj1T71RYaIFlZFpmWtjySpSL7Ma4jZcf5PW6A3sGSBLVLBeaak7X3qNZ8ZIaM2l219o3+esHkEb46pMi1BSPasJRIwyqY0HN8KgWHm2Cd6t+tynerQl3G2HcOha3Gd6thXc3wbOqnzXFs5pw1gjD6lhYMzyrhWeb4EXVL5riRU24aIQRdSyiGV7UwotN8LTqp03xtCacNsLQOhbaDE9r4WkZXv55pHt6SEC6J6YEyMPGcmNQ+maxdPswRXInuXQvH3yA5dmA47dyW/Febmih3OP9GNuQu74XJPKs4Nr7aWuTEc4fjbIlzymGeiqpNi4PD4zjA/2X1t5rc3Viea59q32n/aAZWkd7rb3RTrULDWn/aP9tPd16tvP7zh87f+78tbQ+2VqN+UYrXTt//w8aDx5e</latexit>

d= 2=2 forward pass: a fancy term for computing the output


d=c =2 of f for a given input.
2c
c=
d = sin=b 2= 1
c = c b=1
sin Note that at this point, we are no longer computing
b = ea = 90
c=
b = esin
a b=1
= 90 solutions in general. We are computing our function
a = -x a = 4.499
b=
a = e-x == 90
4.499 for a specific input. We will be computing the gradient
a = -x = 4.499 for this specific input as well.
21

COMPUTE A FORWARD PASS (X = -4.499)


Keeping all intermediate values from the forward pass
in memory, we go back to our symbolic expression of
@f 2 the derivative. Here, we fill in the intermediate values
f(-4.499) = 2 = - 2 · cos b · ea · -1 a b and c. After we do this, we can finish the
@x c
2 2 multiplication numerically, giving us a numeric value of
d= =2= =2 = -222 · cos 90 · -4.499e-4.499 · -1
<latexit sha1_base64="0ODmX1yVjSmtRDEDP9BXQ1p50D0=">AAAOB3icfZdNb9s2HMaV7qVdVm/JdtxFWNBiGNJAShzbOgSoJdnoYW2zIC/dojSgaFoRTIkERTl2BX2AfZrdhl33MXbcNxnlF1miJOvgEHwePvrpT1EhXYr9iGvavztPPvv8iy+fPvtq9+vnrW++3dv/7joiMYPoChJM2AcXRAj7IbriPsfoA2UIBC5GN+7EyvSbKWKRT8JLPqfoLgBe6I99CLjout9LXp6pr5wxAzA5ThMHuomefjxOVQeOCBe/JBI/NDG0dRf6KFxe0j5qG0a67nylq46z+1LNs/Q0yUO01V9D27jPVM1x7vcOtCNtcanVhr5qHCir6/x+/znfdUYExgEKOcQgim51jfK7BDDuQ4zSXSeOEAVwAjx0G/Nx7y7xQxpzFMJUfSG0cYxVTtSsEurIZwhyPBcNAJkvElT4AAQ9F/XaLUdFKAQBig5HU59Gy2Y09ZYNDkSx75LZYjLS0sDEY4A++HBWIktAEAWAP1Q6o3ngljtRjBGbBuXOjFIwSs4ZYtCPshqci8K8p9n8RpfkfKU/zOkDCqM0iRlOiwOFgBhDYzFw0YwQj2myeBjxUk2iM85idJg1F31nNmCTCzQ6FDmljjLOGBPAy11ukBUnRI+QBAEIR4lDxRvH0YwnzuFRKsQXqouF2SWAjcrOizRJnKxmrqteCGtJfFcQ38nioCAOVjcheKSOCVOnYv4Ji1RhVIWF+RBF5dFX+eixeiVHXxfEa1m8KYg3sujGBTWuqNOCOq2ojwX1UVZnBXEmi/OCOK/k8oLKZfVTQfwkix+23fS3bTf9XYoVsyOWzFyscjQWX7TFC5ZMYJq8uXz7S5oYi2v1psRI1ctG6K6NJ8NO1+imsozXenvY0027queGrtnXrTpD7uh3Lc0ebFiOJW8OrWmdQb8jR0G80XuGNazqG1it37XNGsOGdmi1B91V+RAKJauX+yyj066UxctzDNM0T42qnhvMtmX1jmsMucOybbtvLVBozChGkpeujZ3OqVaNonlQT+u0+zX6ZgK0nmlWYGmBxRyauq0vWDgCWHLy/GWxen1jIAfxTf3NvmVVJpAXyt+zdLtdY9iwntqng5MFCWEg9OSqkLx8nW7vpCdHkTxo2BUzWGEhmzsNDVPrVlhIgWVoWmZW2PJKFIvsVr9Llh/kzbpTD3Q1TWWzWGiyOVt7a7PkxTVm3OyutW/z1w/AjfDVJ4WwKR7WhMNGGFjHApvhYS083AbvVf1eU7xXE+41wnh1LF4zvFcL722Dp1U/bYqnNeG0EYbWsdBmeFoLT7fB86qfN8XzmnDeCMPrWHgzPK+F59vgSdVPmuJJTThphCF1LKQZntTCkzK8+OeR7ekBVrM9McGqH642BqVvFs22DxModpJL9/LBbSTOBgy9FduK92JDC8Qe7+fEAcwL/DAVZwXPOcxa24xgtjaKljin6PKppNq4Pj7ST4+0X9sHr83VieWZ8oPyo/KToitd5bXyRjlXrhSo/LfzdGdvZ7/1R+vP1l+tv5fWJzurMd8rpav1z//P3SLG</latexit>

the gradient of f at x = -4.499. In this case, the gradient


<latexit sha1_base64="G5JpQX1kCo3sPbai6u9lIbcl1L4=">AAAOCHicfZdNc6M2HMbJ9i1N122yPfbCNLM7nU6SgcSxzSEza7A9e+juppm8dUM2I2SZMBZII4RjL+UL9NP01um136LXfpIKv2AQYA6ORs+jhx9/ISI5FHsh17R/t5599vkXX361/fXON88b3363u/fiOiQRg+gKEkzYrQNChL0AXXGPY3RLGQK+g9GNM7ZS/WaCWOiR4JLPKLr3gRt4Iw8CLroedn9/daYe2iMGYHycxDZ0Yj35eJyoNhwSLn5JKH5obGirLvRRuNz4sHnUNIxk1Xuoq7a980rNwvQkzlK05V9DW7vPVM22H3b3tSNtfqnlhr5s7CvL6/xh7znfsYcERj4KOMQgDO90jfL7GDDuQYySHTsKEQVwDFx0F/FR5z72AhpxFMBEfSm0UYRVTtS0FOrQYwhyPBMNAJknElT4CAQ9FwXbKUaFKAA+Cg+GE4+Gi2Y4cRcNDkS17+PpfDaSwsDYZYA+enBaIIuBH/qAP5Y6w5nvFDtRhBGb+MXOlFIwSs4pYtAL0xqci8K8p+kEh5fkfKk/zugjCsIkjhhO8gOFgBhDIzFw3gwRj2g8fxjxVo3DM84idJA2531nPcDGF2h4IHIKHUWcESaAF7scPy1OgJ4g8X0QDGObileOoymP7YOjRIgvVQcLs0MAGxadF0kc22nNHEe9ENaC+C4nvpPFfk7sL29C8FAdEaZOxPwTFqrCqAoL8yAKi6OvstEj9UqOvs6J17J4kxNvZNGJcmpUUic5dVJSn3Lqk6xOc+JUFmc5cVbK5TmVy+qnnPhJFm833fS3TTf9IMWK2RFLZiZWORqJT9r8BYvHMInfXL79JYmN+bV8UyKk6kUjdFbGk0GrbbQTWcYrvTno6GavrGeGttnVrSpD5ui2La3XX7McS94MWtNa/W5LjoJ4rXcMa1DW17Bat90zKwxr2oHV7LeX5UMokKxu5rOMVrNUFjfLMUzTPDXKemYwm5bVOa4wZA6r1+t1rTkKjRjFSPLSlbHVOtXKUTQL6mitZrdCX0+A1jHNEizNsZgDU+/pcxaOAJacPHtZrE7X6MtBfF1/s2tZpQnkufJ3LL3XrDCsWU97p/2TOQlhIHDlqpCsfK1256QjR5EsaNAWM1hiIes7DQxTa5dYSI5lYFpmWtjiShSL7E6/jxcf5PW6U/d1NUlks1hosjldeyuz5MUVZlzvrrRv8lcPwLXw5SeFsC4eVoTDWhhYxQLr4WElPNwE75b9bl28WxHu1sK4VSxuPbxbCe9ugqdlP62LpxXhtBaGVrHQenhaCU83wfOyn9fF84pwXgvDq1h4PTyvhOeb4EnZT+riSUU4qYUhVSykHp5UwpMivPjnke7pAVbTPTHBqhcsNwaFbxZNtw9jKHaSC/fiwXtInA0Yeiu2Fe/FhhaIPd7PsQ2Y63tBIs4Krn2QtjYZwXRlFC1xTtHlU0m5cX18pJ8eab8291+byxPLtvKD8qPyk6IrbeW18kY5V64UqPy3tb21t/Wi8Ufjz8Zfjb8X1mdbyzHfK4Wr8c//cNMi/Q==</latexit>

d 2c 2 = - 12222· ·cos
= - cos9090··ee4.499 · ·-1
<latexit sha1_base64="DFkx7ICy5cN+S/j37NVEkDriLUE=">AAAN/XicfZdNb9s2HMaVbmu7rN7S7biLsKDFMKSB5Di2dQhQS7bRw9pmQd7WKA0omlYESyJHUY5dQdin2W3YdR9l2IcZMMovskRR1sEh+Dx89NOfokI6xPcipmn/7jz67PMvHj95+uXuV88aX3+z9/zbywjHFKILiH1Mrx0QId8L0QXzmI+uCUUgcHx05UysTL+aIhp5ODxnc4JuA+CG3tiDgPGuu73fXp6or+wxBTBppokNnURPPzZT1YYjzPgvjvgPSQxt3YU+cpebtA5bhpGuO1/pqm3vvlR5VnPVpa3+GtrGc6Jqtn23t68daotLrTb0VWNfWV2nd8+fsV17hGEcoJBBH0TRja4RdpsAyjzoo3TXjiNEAJwAF93EbNy9TbyQxAyFMFVfcG0c+yrDavb86sijCDJ/zhsAUo8nqPAe8OdnvEq75agIhSBA0cFo6pFo2Yym7rLBAC/xbTJbTEFaGpi4FJB7D85KZAkIogCw+0pnNA+ccieKfUSnQbkzo+SMgnOGKPSirAanvDDvSTar0Tk+Xen3c3KPwihNYuqnxYFcQJSiMR+4aEaIxSRZPAx/lSbRCaMxOsiai76TPqCTMzQ64DmljjLO2MeAlbucICtOiB4gDgIQjhKb8PeMoRlL7IPDlIsvVMfnZgcDOio7z9IksbOaOY56xq0l8V1BfCeKg4I4WN0E+yN1jKk65fOPaaRyo8ot1IMoKo++yEeP1Qsx+rIgXoriVUG8EkUnLqhxRZ0W1GlFfSioD6I6K4gzUZwXxHkllxVUJqqfCuInUbzedtNft930gxDLZ4cvmTlf5WjMv2OLFyyZwDR5c/725zQxFtfqTYmRqpeN0Fkbj4btjtFJRdlf661hVzf7VT03dMyebskMuaPXsbT+YMPSFLw5tKa1B722GAX9jd41rGFV38BqvU7flBg2tEOrNeisyodQKFjd3GcZ7ValLG6eY5imeWxU9dxgtiyr25QYcofV7/d71gKFxJT4SPCStbHdPtaqUSQP6mrtVk+ibyZA65pmBZYUWMyhqff1BQtDwBecLH9ZrG7PGIhBbFN/s2dZlQlkhfJ3Lb3fkhg2rMf948HRggRTELpiVXBevnane9QVo3AeNOzwGayw4M2dhoapdSosuMAyNC0zK2x5JfJFdqPfJssP8mbdqfu6mqaimS800ZytvbVZ8PoSs1/vltq3+eUD/Fr46pNCWBcPJeGwFgbKWGA9PJTCw23wbtXv1sW7knC3FsaVsbj18K4U3t0GT6p+UhdPJOGkFobIWEg9PJHCk23wrOpndfFMEs5qYZiMhdXDMyk82waPq35cF48l4bgWBstYcD08lsLjMjz/55Ht6YGvZnti7KteuNoYlL5ZJNs+TCDfSS7dywfvI342oOgt31a85xtawPd4PyU2oG7ghSk/K7j2QdbaZgSztZG3+DlFF08l1cZl81A/PtR+ae2/NlcnlqfK98oPyo+KrnSU18ob5VS5UKDyj/LfzuOdJ43fG380/mz8tbQ+2lmN+U4pXY2//weLIh9I</latexit>

d= c =2 -1
2c b = 1
sin
c = sin = -112 · cos 90 · e4.499 · -1 happens to be 0.
= a=b 2= 1
dc = = 1
1
-1 · 0 · 90 90··-1
-1==00
b = eca =
bc = sin
e b=
= 901
90 == -222···000···90
=-- · -1 = 0
b
ac== sin
e-x
a b=1
=
= 90
4.499 2
a = -x a = 4.499
b=
a = e-x == 90
4.499
a = -x = 4.499
22

BACKPROPAGATION
Before we try this on a neural network, here are the
main things to remember about the backpropagation
The BACKPROPAGATION algorithm: algorithm.
• break your computation up into a sequence of operations Note that backpropagation by itself does not train a
what counts as an operation is up to you
neural net. It just provides a gradient. When people
• work out the local derivatives symbolically.
say that they trained a network by backpropagation,
• compute the global derivative numerically that's actually shorthand for training the network by
by computing the local derivatives and multiplying them
gradient descent, with the gradients worked out by
• Much more accurate than finite differences backpropagation.
only source of inaccuracy is the numeric computation of the operations.

• Much faster than symbolic differentiation


The backward pass has (broadly) the same complexity as the forward.

23

BRACKPROPAGATION IN A FEEDFORWARD NETWORK


To explain how backpropagation works in a neural
network, we extend our neural network diagram a
l
little bit, to make it closer to the actual computation
<- computation of loss graph we’ll be using.
y t
First, we separate the hidden node into the result of
v2

the linear operation ki and the application of the


h1 h2 h3
k2 k3 <-
nonlinearity hi. Second, since we’re interested in the
k1 non
l in
ear derivative of the loss rather than the output of the
i ty
network, we extend the network with one more step:
2
w1

the computation of the loss (over one example to


x1 x2 keep things simple). In this final step, the output y of
24
the network is compared to the target value t from the
data, producing a loss value.
The loss is the function for which we want to work out
the gradient, so the computation graph is the one that
computes first the model output, and then the loss
based on this output (and the target).
Note that the model is now just a subgraph of the
computation graph. You can think of t as another input
node, like x1 and x2, (but one to which the model
doesn’t have access).

BRACKPROPAGATION IN A FEEDFORWARD NETWORK


We want to work out the gradient of the loss. This is
simply the collection of the derivative of the loss over
each parameter.
l @l @l
,
y t @v2 @w12 We’ll pick two parameters, v2 in the second layer, and
w12 in the first, and see how backpropagation
l = (y - t)22
v2

l = (y - t)
operates. The rest of the parameters can be worked
h1 h2 h3
k2 k3
y = v1 h1 + v2 h2 + v3 h3 + bh out in the same way to give us the rest of the gradient.
k1 y = v1 h1 + v2 h2 + v3 h3 + bh
1 First, we have to break the computation of the loss
h2 = 1
h2 = 1 + exp(-k2 )
2
w1

1 + exp(-k2 ) into operations. If we take the graph on the left to be


k2 = w21 x1 + w22 x2 + bx our computation graph, then we end up with the
k = w x1 + w22 x2 + bx
x1 x2 k22 = w21
12 x1 + w22 x2 + bx
<latexit sha1_base64="9AFPVZHM+/C1JrM8M7urc8fPZ/Q=">AAANxXicfZdNb9s2HMbV7q3L6i3djrsICzoMWxBIjuOXQ4Bakr0e1jYL8tIuCgyKphXBlEhQlGNXEPZR9ml23c77NqNkW5YoyToRfB4++vFP0SYdir2Qa9p/T55+8ulnn3/x7MuDr563vv7m8MW3NyGJGETXkGDC3jsgRNgL0DX3OEbvKUPAdzC6deZmqt8uEAs9ElzxFUX3PnADb+ZBwEXX5HBwYEM3nk/aifrjuWpDEj9OYr2dJMuJrv6y7WhnHe2sw4mdyTKx7cnhkXaiZY9abeibxpGyeS4mL57zA3tKYOSjgEMMwvBO1yi/jwHjHsQoObCjEFEA58BFdxGf9e9jL6ARRwFM1JdCm0VY5URN56FOPYYgxyvRAJB5IkGFD4AByMVsD8pRIQqAj8Lj6cKj4boZLtx1gwNRqvt4mZUyKQ2MXQbogweXJbIY+KEP+EOlM1z5TrkTRRixhV/uTCkFo+RcIga9MK3BhSjMO5quTnhFLjb6w4o+oCBM4ojhpDhQCIgxNBMDs2aIeETjbDLik5iH55xF6DhtZn3nFmDzSzQ9FjmljjLODBPAy12OnxYnQI+Q+D4IprFNk9jmaMlj+/gkEeJL1cHC7BDApmXnZRLHdlozx1EvhbUkvi2Ib2VxVBBHm5cQPFVnhKkLsf6EhaowqsLCPIjC8ujrfPRMvZajbwrijSzeFsRbWXSighpV1EVBXVTUx4L6KKvLgriUxVVBXFVyeUHlsvqxIH6Uxff7Xvph30v/kGLF6ogtsxK7HM3E71H2gcVzmMSvr978lsSD7Nl8KRFS9bIROlvj6bjbG/QSWcZbvTPu64ZV1XNDzxjqZp0hdwx7pmaNdixtyZtDa1p3NOzKURDv9P7AHFf1Haw27FlGjWFHOzY7o96mfAgFktXNfeag26mUxc1zBoZhnA2qem4wOqbZb9cYcodpWdbQzFBoxChGkpdujd3umVaNonlQX+t2hjX6bgG0vmFUYGmBxRgbuqVnLBwBLDl5/rGY/eFgJAfxXf2NoWlWFpAXyt83datTY9ixnllno9OMhDAQuHJVSF6+bq9/2pejSB407okVrLCQ3ZvGA0PrVVhIgWVsmEZa2PJOFJvsTr+P1z/Iu32nHulqkshmsdFkc7r3tmbJi2vMuNlda9/nrx+AG+GrM4WwKR7WhMNGGFjHApvhYS083AfvVv1uU7xbE+42wrh1LG4zvFsL7+6Dp1U/bYqnNeG0EYbWsdBmeFoLT/fB86qfN8XzmnDeCMPrWHgzPK+F5/vgSdVPmuJJTThphCF1LKQZntTCkzK8+PNIz/QAq+mZmGDVCzYHg9JvFk2PD3MoTpJr93riFhJ3A4beiGPFO3GgBeKM93NsA+b6XpCIu4JrH6etfUaw3BpFS9xTdPlWUm3ctE/07knn987RK2NzY3mmfK/8oPyk6EpPeaW8Vi6UawUqfyl/K/8o/7Z+bfkt3lqsrU+fbMZ8p5Se1p//A5vGD2o=</latexit>

operations of the right.


25

To simplify things, we’ll compute the loss over only


one instance. We’ll use a simple squared error loss.

BACKPROPAGATION IN A FEEDFORWARD NETWORK


For the derivative with respect to v2, we’ll only need
these two operations. Anything below doesn’t affect
l
the result.
l = (y - t)22
l = (y - t) To work out the derivative we apply the chain rule, and
y t
y = v1 h1 + v2 h2 + v3 h3 + bh
y = v1 h1 + v2 h2 + v3 h3 + bh work out the local derivatives symbolically.
1
vv22

h2 = 1
h1 h2 h3 h2 = 1 + exp(-k2 )
k1 k2 k3 1 + exp(-k2 )
k2 = w
@l @l21 x1 + w22 x2 + bx
@y
k2 =
@l = w x1 + w22 x2 + bx
@l21@y
@v = @y @v2
2
w1

@v22 @y @v2
= 2(y - l) · h
= 2(y - t) · h22
x1 x2
= 2(10.1 - 12.1) · .99 = -3.96
26

BACKPROPAGATION IN A FEEDFORWARD NETWORK


We then do a forward pass with some values. We get
an output of 10.1, which should have been 12.1, so
4
our loss is 4. We keep all intermediate values in
@l @l @y memory.
=
10.1 12.1 @v
@l2 @y
@l @v
@y2
= We then take our product of local derivatives, fill in the
@v2 = 2(y - 2t) · h2
@y @v
numeric values from the forward pass, and compute
2

= 2(y
2(10.1 - ·12.1)
- t) h2 · .99 = -3.96
.99 the derivative over v2.
8 = 2(10.1 - 12.1) · .99 = -3.96
When we apply this derivative in a gradient descent
3

update, v2 changes as shown below.


v2 v2 - ↵ · -3.96
3 5

27
BACKPROPAGATION IN A FEEDFORWARD NETWORK
Let’s try something a bit earlier in the network: the
weight w12. We add two operations, apply the chain
l l = (y - t)22 rule and work out the local derivatives.
l = (y - t)2
y = v1 h1 + v2 h2 + v3 h3 + bh
y t
y = v11 h11 +1v22 h22 + v33 h33 + bh
h
h2 = 1
h22 = 1 + exp(-k2 )
1 + exp(-k2 )
v2
k2 = w21 x1 + w222 x2 + bx
k2 = w21 x11 + w22 2 + bx
22 x2 x
k22 = w21
12 x1 + w22 x2 + bx
<latexit sha1_base64="9AFPVZHM+/C1JrM8M7urc8fPZ/Q=">AAANxXicfZdNb9s2HMbV7q3L6i3djrsICzoMWxBIjuOXQ4Bakr0e1jYL8tIuCgyKphXBlEhQlGNXEPZR9ml23c77NqNkW5YoyToRfB4++vFP0SYdir2Qa9p/T55+8ulnn3/x7MuDr563vv7m8MW3NyGJGETXkGDC3jsgRNgL0DX3OEbvKUPAdzC6deZmqt8uEAs9ElzxFUX3PnADb+ZBwEXX5HBwYEM3nk/aifrjuWpDEj9OYr2dJMuJrv6y7WhnHe2sw4mdyTKx7cnhkXaiZY9abeibxpGyeS4mL57zA3tKYOSjgEMMwvBO1yi/jwHjHsQoObCjEFEA58BFdxGf9e9jL6ARRwFM1JdCm0VY5URN56FOPYYgxyvRAJB5IkGFD4AByMVsD8pRIQqAj8Lj6cKj4boZLtx1gwNRqvt4mZUyKQ2MXQbogweXJbIY+KEP+EOlM1z5TrkTRRixhV/uTCkFo+RcIga9MK3BhSjMO5quTnhFLjb6w4o+oCBM4ojhpDhQCIgxNBMDs2aIeETjbDLik5iH55xF6DhtZn3nFmDzSzQ9FjmljjLODBPAy12OnxYnQI+Q+D4IprFNk9jmaMlj+/gkEeJL1cHC7BDApmXnZRLHdlozx1EvhbUkvi2Ib2VxVBBHm5cQPFVnhKkLsf6EhaowqsLCPIjC8ujrfPRMvZajbwrijSzeFsRbWXSighpV1EVBXVTUx4L6KKvLgriUxVVBXFVyeUHlsvqxIH6Uxff7Xvph30v/kGLF6ogtsxK7HM3E71H2gcVzmMSvr978lsSD7Nl8KRFS9bIROlvj6bjbG/QSWcZbvTPu64ZV1XNDzxjqZp0hdwx7pmaNdixtyZtDa1p3NOzKURDv9P7AHFf1Haw27FlGjWFHOzY7o96mfAgFktXNfeag26mUxc1zBoZhnA2qem4wOqbZb9cYcodpWdbQzFBoxChGkpdujd3umVaNonlQX+t2hjX6bgG0vmFUYGmBxRgbuqVnLBwBLDl5/rGY/eFgJAfxXf2NoWlWFpAXyt83datTY9ixnllno9OMhDAQuHJVSF6+bq9/2pejSB407okVrLCQ3ZvGA0PrVVhIgWVsmEZa2PJOFJvsTr+P1z/Iu32nHulqkshmsdFkc7r3tmbJi2vMuNlda9/nrx+AG+GrM4WwKR7WhMNGGFjHApvhYS083AfvVv1uU7xbE+42wrh1LG4zvFsL7+6Dp1U/bYqnNeG0EYbWsdBmeFoLT/fB86qfN8XzmnDeCMPrWHgzPK+F5/vgSdVPmuJJTThphCF1LKQZntTCkzK8+PNIz/QAq+mZmGDVCzYHg9JvFk2PD3MoTpJr93riFhJ3A4beiGPFO3GgBeKM93NsA+b6XpCIu4JrH6etfUaw3BpFS9xTdPlWUm3ctE/07knn987RK2NzY3mmfK/8oPyk6EpPeaW8Vi6UawUqfyl/K/8o/7Z+bfkt3lqsrU+fbMZ8p5Se1p//A5vGD2o=</latexit>

h1 h2 h3
k1 k2 k3

@l @l @y @h2 @k2
=
2

<latexit sha1_base64="j3ZbafUpwvdmZIKYlDTRg9R4h0E=">AAAOdnicfZddb9s2FIad7qvL2izdLgcMwoJs2ZAFkuP44yJALdlGL9Y2C5qPLQoMiqZlwZRISJRjVxC2v7m/sF+wy1GyLEsUZd3kiO/Lw8eHpEJaFDsBU9V/9p598ulnn3/x/Mv9r168PPj68NU3twEJfYhuIMHEv7dAgLDjoRvmMIzuqY+Aa2F0Z82NRL9bID9wiPeBrSh6dIHtOVMHAsabxof/mlMfwMicUwXH6R8TkuhpHGnNOI6VHy8ViYFGK65thawli1k0GzcFPWvL3uxoXnFkbVUE09znFM2T9SDKrwr7mRsmhKW2Be+Uv65HOdG4afOy8S7H2vjwSD1T00epBloWHDWy52r86gXbNycEhi7yGMQgCB40lbLHCPjMgRjF+2YYIArgHNjoIWTT7mPkeDRkyIOxcsy1aYgVRpSk8MrE8RFkeMUDAH2HZ1DgDPACMD49++VUAfKAi4LTycKhwToMFvY6YIDP7WO0TOc+LnWMbB/QmQOXJbIIuIEL2KzSGKxcq9yIQoz8hVtuTCg5o+BcIh86QVKDK16Y9zRZTsEHcpXpsxWdIS+Io9DHcbEjF5DvoynvmIYBYiGN0h/D1/A8uGR+iE6TMG27HAB/fo0mpzxPqaGMM8UEsHKT5SbF8dATJK4LvElkUr66GFqyyDw9i7l4rFiYmy0C/EnZeR1HkZnUzLKUa24tie8K4jtRHBbEYTYIwRNlSnxlweef+IHCjQq3+A5EQbn3Td57qtyIqW8L4q0o3hXEO1G0woIaVtRFQV1U1KeC+iSqy4K4FMVVQVxV8rKCykT1Y0H8KIr3uwb9Y9egfwpp+ezwLbPiuxxN+Qc0XWDRHMbRmw9vf4ujXvpkKyVEilY2QmtjPB+1O71OLMp4o7dGXU0fVPXc0NH7miEz5I5+x1AHwy1LU/Dm0KraHvbbYiqIt3q3Z4yq+hZW7XcGusSwpR0ZrWEnKx9CnmC1c5/Ra7cqZbHzPD1d1y96VT036C3D6DYlhtxhDAaDvpGi0NCnGAleujG22xdqNRXNE3XVdqsv0bcToHZ1vQJLCyz6SNcGWsrCEMCCk+WLxej2e0MxEdvWX+8bRmUCWaH8XUMbtCSGLevF4GJ4npIQH3i2WBWSl6/d6Z53xVQkTzTq8BmssJDtSKOernYqLKTAMtINPSlseSfyTfagPUbrD/J23ylHmhLHoplvNNGc7L2NWfBiiRnXu6X2XX55B1wLX/2lENalh5LksBYGylhgPTyUwsNd8HbVb9eltyXJ7VoYW8Zi18PbUnh7Fzyt+mldeipJTmthqIyF1sNTKTzdBc+qflaXnkmSs1oYJmNh9fBMCs92wZOqn9SlJ5LkpBaGyFhIPTyRwpMyPP/nkZzpAVaSMzHBiuNlB4PSN4smx4c5v7Fk7vUPHyB+N/DRW36seM8PtICf8X6JTODbruPF/K5gm6dJtMsIlhsjj/g9RRNvJdXgtnmmXZypv7eOXuvZjeV547vGD42ThtboNF433jSuGjcNuHe1t9j7a+/vl/8dfH9wfPDT2vpsL+vzbaP0HKj/A7hTT8o=</latexit>

@l12
@w @y
@l @h @h22 @w
@k
@y2 @h @k12
@k
w1

22
=
@w12 = 2(y - 2t)
@y @h @k
2 @k @w
· v22w@w· h122 (1 - h2 ) · x1
12
t) ·· vvw
= 2(y - t) 2 ·· h
h22(1 - h22) · x1
(1 -
x1 x2
))
= σ(x)(1 - σ(x
sigmo id: σ’(x)
28
derivative of

BACKPROPAGATION
Note that when we’re computing the derivative for
w12, we are also, along the way computing the
l
@l @l @y @h2 @k2 derivatives for y, h2 and k2.
=
@w12 @y @h2 @k2 @w12
y t
This useful when it comes to implementing
@l = 2(y
@l - · v2w@l
@yt)@h · h22 (1 - h2 ) · x1
<latexit sha1_base64="nH9fbpauKQWPXg5ZX8gQFcwfSwU=">AAANqXicfZfbbts2HMbVdocuq9d0vRwGCAsKDFsQSInjw0WAWpKNXqxtGuS0xkFA0bQsmBIJinLsCrra0+x2e5q9zSjFkSWKsm5M8Pv46ac/RZl0KfYjbhj/PXn67Kuvv/n2+Xc7379o/fBy99WPlxGJGUQXkGDCrl0QIeyH6IL7HKNryhAIXIyu3Lmd6VcLxCKfhOd8RdFtALzQn/oQcNF1t/vziT6eMgCT8ZzqOM1/xpAns7vDNNXvdveMAyO/9HrDXDf2tPV1evfqBd8ZTwiMAxRyiEEU3ZgG5bcJYNyHGKU74zhCFMA58NBNzKe928QPacxRCFP9jdCmMdY50TNWfeIzBDleiQaAzBcJOpwBQcvFE+1UoyIUggBF+5OFT6OHZrTwHhociHLcJsu8XGllYOIxQGc+XFbIEhBEAeCzWme0CtxqJ4oxYoug2plRCkbJuUQM+lFWg1NRmI80m4HonJyu9dmKzlAYpUnMcFoeKATEGJqKgXkzQjymSf4wYtrn0QlnMdrPmnnfiQPY/AxN9kVOpaOKM8UE8GqXG2TFCdE9JEEAwkkypuKN4GjJk/H+QSrEN7qLhdklgE2qzrM0ScZZzVxXPxPWivihJH6QxWFJHK5vQvBEnxKmL8T8ExbpwqgLC/MhiqqjL4rRU/1Cjr4siZeyeFUSr2TRjUtqXFMXJXVRU+9L6r2sLkviUhZXJXFVy+Ullcvql5L4RRavt930z203/SzFitkRS2YlVjmaim9O/oIlc5gm787f/5Em/fxavykx0s2qEbqPxqNRp9vvprKMH/X2qGdaTl0vDF1rYNoqQ+EYdG3DGW5YDiVvAW0YneGgI0dBvNF7fXtU1zewxqDrWArDhnZkt4fddfkQCiWrV/jsfqddK4tX5PQtyzru1/XCYLVtu3eoMBQO23GcgZ2j0JhRjCQvfTR2OsdGPYoWQT2j0x4o9M0EGD3LqsHSEos1skzHzFk4Alhy8uJlsXuD/lAO4pv6WwPbrk0gL5W/Z5tOW2HYsB47x8OjnIQwEHpyVUhRvk63d9STo0gRNOqKGayxkM2dRn3L6NZYSIllZNlWVtjqShSL7Ma8TR4+yJt1p++ZeprKZrHQZHO29h7NkhcrzLjZrbRv86sH4Eb4+pNC2BQPFeGwEQaqWGAzPFTCw23wXt3vNcV7inCvEcZTsXjN8J4S3tsGT+t+2hRPFeG0EYaqWGgzPFXC023wvO7nTfFcEc4bYbiKhTfDcyU83wZP6n7SFE8U4aQRhqhYSDM8UcKTKrz488j29ADr2Z6YYN0P1xuDyjeLZtuHuTherN0PD+4gcTZg6L3YVnwUG1og9ni/JWPAvMAPU3FW8Mb7WWubESwfjaIlzimmfCqpNy4PD8zjA+NTe++ttT6xPNd+0n7RftVMrau91d5pp9qFBrW/tL+1f7R/W7+3PrWuW58frE+frMe81ipXC/4PcAgFiQ==</latexit>

@k backpropagation. We can walk backward down the


= =
v2

@w12 @y @h2 @k2@h @w2 12 computation graph and compute the derivative of the
h1 h2 h3
k2 k3 @l = 2(y
@l - · v2w @k
@yt)@h · h22@l
(1 - h2 ) · x1
<latexit sha1_base64="sgE4DAIgoQfIIf/5CtOYFODi4vs=">AAANqXicfZfbbts2HMbVdocuq9d0vRwGCAsKDFsQSInjw0WAWpKNXqxtGuS0xkZA0bQimBIJinLsCrra0+x2e5q9zSjFkSWKsm5M8Pv46ac/RZl0KfYjbhj/PXn67Kuvv/n2+Xc7379o/fBy99WPlxGJGUQXkGDCrl0QIeyH6IL7HKNryhAIXIyu3Lmd6VcLxCKfhOd8RdEkAF7oz3wIuOi63f35RB/PGIDJeE51nOY/Y+gl89vDNNVvd/eMAyO/9HrDXDf2tPV1evvqBd8ZTwmMAxRyiEEU3ZgG5ZMEMO5DjNKdcRwhCuAceOgm5rPeJPFDGnMUwlR/I7RZjHVO9IxVn/oMQY5XogEg80WCDu+AoOXiiXaqUREKQYCi/enCp9FDM1p4Dw0ORDkmyTIvV1oZmHgM0DsfLitkCQiiAPC7Wme0CtxqJ4oxYoug2plRCkbJuUQM+lFWg1NRmI80m4HonJyu9bsVvUNhlCYxw2l5oBAQY2gmBubNCPGYJvnDiGmfRyecxWg/a+Z9Jw5g8zM03Rc5lY4qzgwTwKtdbpAVJ0T3kAQBCKfJmIo3gqMlT8b7B6kQ3+guFmaXADatOs/SJBlnNXNd/UxYK+KHkvhBFoclcbi+CcFTfUaYvhDzT1ikC6MuLMyHKKqOvihGz/QLOfqyJF7K4lVJvJJFNy6pcU1dlNRFTb0vqfeyuiyJS1lclcRVLZeXVC6rX0riF1m83nbTP7fd9LMUK2ZHLJmVWOVoJr45+QuWzGGavDt//0ea9PNr/abESDerRug+Go9GnW6/m8oyftTbo55pOXW9MHStgWmrDIVj0LUNZ7hhOZS8BbRhdIaDjhwF8Ubv9e1RXd/AGoOuYykMG9qR3R521+VDKJSsXuGz+512rSxekdO3LOu4X9cLg9W27d6hwlA4bMdxBnaOQmNGMZK89NHY6Rwb9ShaBPWMTnug0DcTYPQsqwZLSyzWyDIdM2fhCGDJyYuXxe4N+kM5iG/qbw1suzaBvFT+nm06bYVhw3rsHA+PchLCQOjJVSFF+Trd3lFPjiJF0KgrZrDGQjZ3GvUto1tjISWWkWVbWWGrK1Esshtzkjx8kDfrTt8z9TSVzWKhyeZs7T2aJS9WmHGzW2nf5lcPwI3w9SeFsCkeKsJhIwxUscBmeKiEh9vgvbrfa4r3FOFeI4ynYvGa4T0lvLcNntb9tCmeKsJpIwxVsdBmeKqEp9vged3Pm+K5Ipw3wnAVC2+G50p4vg2e1P2kKZ4owkkjDFGxkGZ4ooQnVXjx55Ht6QHWsz0xwbofrjcGlW8WzbYPc3G8WLsfHtxB4mzA0HuxrfgoNrRA7PF+S8aAeYEfpuKs4I33s9Y2I1g+GkVLnFNM+VRSb1weHpjHB8an9t5ba31iea79pP2i/aqZWld7q73TTrULDWp/aX9r/2j/tn5vfWpdtz4/WJ8+WY95rVWuFvwf7CcFfw==</latexit>

loss for every node. For the nodes below, we just


k1
= = multiply the local gradient. This means we can very
@w12 @y @h2 @k2 @w12 @k2
efficiently compute any derivatives we need.
2
w1

= 2(y - t) · vw · h2 (1 - h2 ) · x1 <latexit sha1_base64="I6uw0xrWK20JnGx/2G2zfhLDego=">AAANrHicfZdNb9s2HMbV7q3L6jXdjrsQCwoMQxBIieOXQ4Bako0e1jYL8tItNgKKphXBlEhQlGNX0HWfZtftu+zbjHIcWaIo62KCz8NHP/0pyqTHSBAL0/zv2fMvvvzq629efLv33cvW96/2X/9wHdOEI3yFKKH8kwdjTIIIX4lAEPyJcQxDj+Abb+7k+s0C8zig0aVYMTwJoR8FswBBIbvu9sEZGM84ROl4zgDJ1j9jRNOHu9Q6zrIM3O0fmEfm+gL1hrVpHBib6/zu9UuxN55SlIQ4EojAOL61TCYmKeQiQARne+MkxgyiOfTxbSJmvUkaRCwROEIZeCO1WUKAoCDHBdOAYyTISjYg4oFMAOgeSmAhH2qvGhXjCIY4PpwuAhY/NuOF/9gQUFZkki7XFcsqA1OfQ3YfoGWFLIVhHEJxX+uMV6FX7cQJwXwRVjtzSsmoOJeYoyDOa3AuC/OR5ZMQX9LzjX6/Yvc4irM04SQrD5QC5hzP5MB1M8YiYen6YeTMz+MzwRN8mDfXfWcu5PMLPD2UOZWOKs6MUCiqXV6YFyfCD4iGIYym6ZjJl0LgpUjHh0eZFN8Aj0izRyGfVp0XWZqO85p5HriQ1or4oSR+UMVhSRxubkLJFMwoBws5/5THQBqBtPAA4bg6+qoYPQNXavR1SbxWxZuSeKOKXlJSk5q6KKmLmvpQUh9UdVkSl6q4KomrWq4oqUJVP5fEz6r4addN/9h10z+VWDk7csms5CrHM/nZWb9g6Rxl6bvL979laX99bd6UBAOrakTek/Fk1On2u5kqkye9PepZtlvXC0PXHliOzlA4Bl3HdIdblmPFW0CbZmc46KhRiGz1Xt8Z1fUtrDnourbGsKUdOe1hd1M+jCPF6hc+p99p18riFzl927ZP+3W9MNhtx+kdawyFw3Fdd+CsUVjCGcGKlz0ZO51Tsx7FiqCe2WkPNPp2AsyebddgWYnFHtmWa61ZBIZEcYriZXF6g/5QDRLb+tsDx6lNoCiVv+dYbltj2LKeuqfDkzUJ5TDy1arQonydbu+kp0bRImjUlTNYY6HbO436ttmtsdASy8h27Lyw1ZUoF9mtNUkfP8jbdQcOLJBlqlkuNNWcr70ns+IlGjNpdmvtu/z6AaQRvv6kCDXFI004aoRBOhbUDI+08GgXvF/3+03xvibcb4TxdSx+M7yvhfd3wbO6nzXFM004a4RhOhbWDM+08GwXvKj7RVO80ISLRhihYxHN8EILL3bB07qfNsVTTThthKE6FtoMT7XwtAov/zzyPT0kIN8TUwKCaLMxqHyzWL59mMsTxsb9+OAulmcDjt/LbcVHuaGFco/3azqG3A+DKJNnBX98mLd2GeHyyShb8pxiqaeSeuP6+Mg6PTJ/bx+8tTcnlhfGT8bPxi+GZXSNt8Y749y4MpDxl/G38Y/xb+uoddm6bU0erc+fbcb8aFSu1ux/6ygG2g==</latexit>

@l @l @y @h2 @k2 @l
= = In fact, this is where the name backpropagation comes
x1 x2 @w12 @y @h2 @k2 @w12 @w12 from: the derivative of the loss propagates down the
29 = 2(y - t) · vw · h2 (1 - h2 ) · x1 network in the opposite direction to the forward pass.
We will show this more precisely in the last part of this
lecture.

WALKING BACK DOWN THE NETWORK


Here is a more abstract view of what is happening in
backpropagation, which should apply to any kind of
computation. We can think of the computations we do
Break your function up into modules
Call the last node the loss.
as modules with inputs and outputs, chained together
loss to make a computation graph. The output of each
Do a forward pass.
module contributes ultimately to the result of the
Start at the loss node and work your way down. computation, which is the loss.
At each input to a module, multiplying We want to know the derivative of the loss with
deriv. of loss over output <- already computed
deriv. of output over input <- local deriv.
respect to each of our input nodes. By the chain rule,
produces the deriv. of loss over input. this is the derivative of the los with respect to the
output times the derivative of the output with respect
30
to the input.
If we take care to walk back down our computation
graph in the correct order, then we can be sure that for
every module we encounter, we will have already
computed the first factor. We only need to compute
the second, and mulitply by the value we already have.
We’ll extend this abstract view of backpropagation in
the last part of the lecture.
BACKPROPAGATION

The BACKPROPAGATION algorithm:


• break your computation up into a sequence of operations
what counts as an operation is up to you

• work out the local derivatives symbolically.


• compute the global derivative numerically
by computing the local derivatives and multiplying them

• Walk backward from the loss, accumulating the derivatives.

31

BUILDING SOME INTUITION


To finish up, let’s see if we can build a little intuition for
what all these accumulated derivatives mean.
100 Here is a forward pass for some weights and some
inputs. Backpropagation starts with the loss, and walks
10 0
down the network, figuring out at each step how every
value contributed to the result of the forward pass.
-3

0.5 Every value that contributed positively to a positive


0 loss should be lowered, every value that contributed
positively to a negative loss should be increased, and
1

so on.
-3
32

IMAGINE THAT y IS A PARAMETER


We’ll start with the first value below the loss: y, the
output of our model. Of course, this isn’t a parameter
of the network, we can set it to any value we'd like.
100 <latexit sha1_base64="k5Igov+6dFG0O9rDIGEOSETaqxg=">AAAN8nichZfbbts2HMbV7tRl9ZZul7shFnTohiyQEscHDAFqyTZ6sbZZkFMXGQFF07JgSiQoyrEr6EV2N+x2j7IX2GvsdrsY5TiyDpSnK4Lfx08//SlKpMOIFwpd/+vR4w8+/OjjT558uvPZ08bnX+w++/IypBFH+AJRQvm1A0NMvABfCE8QfM04hr5D8JUzs1L9ao556NHgXCwZHvnQDbyJh6CQXbe7ExuxeJkAm+CJgJzTu2+B/SNY9/4A7BmKbUjYFEoPGlMBDl9kovgO2PbOyf+O0MHt7p5+oK8uUG0Y68aetr5Ob589FTv2mKLIx4FABIbhjaEzMYohFx4iONmxoxAziGbQxTeRmHRGsRewSOAAJeC51CYRAYKC9KHB2OMYCbKUDYi4JxMAmkIOkZCl2SlGhTiAPg73x3OPhffNcO7eNwSUdR3Fi1Xdk8LA2OWQTT20KJDF0A99KKaVznDpO8VOHBHM536xM6WUjCXnAnPkhWkNTmVh3rJ0KsNzerrWp0s2xUGYxBEnSX6gFDDneCIHrpohFhGLVw8j359ZeCJ4hPfT5qrvpA/57AyP92VOoaOIMyEUimKX46fFCfAdor4Pg3FssyS2BV6I2N4/SKT4HDhEmh0K+bjoPEvi2E5r5jjgTFoL4puc+KYsDnLiYH0TSsZgQjmYy/mnPATSCKSFewiHxdEX2egJuChHX+bEy7J4lROvyqIT5dSoos5z6ryi3uXUu7K6yImLsrjMictKrsipoqy+z4nvy+L1tpu+23bTX0qxcnbkklnKVY4n8uO1esHiGUriV+evf0ri7upavykRBkbRiJwH49Gw1e62k7JMHvTmsGOY/aqeGdpmz7BUhszRa1t6f7BhOSx5M2hdbw16rXIUIhu907WGVX0Dq/fafVNh2NAOreagvS4fxkHJ6mY+q9tqVsriZjld0zSPu1U9M5hNy+ocKgyZw+r3+z1rhcIizgguedmDsdU61qtRLAvq6K1mT6FvJkDvmGYFluVYzKFp9I0Vi8CQlJwie1msTq87KAeJTf3NnmVVJlDkyt+xjH5TYdiwHvePB0crEsph4JarQrPytdqdo045imZBw7acwQoL3dxp2DX1doWF5liGpmWmhS2uRLnIboxRfP9B3qw7sGeAJCmb5UIrm9O192AueYnCTOrdSvs2v3oAqYWvPilCdfFIEY5qYZCKBdXDIyU82gbvVv1uXbyrCHdrYVwVi1sP7yrh3W3wrOpndfFMEc5qYZiKhdXDMyU82wYvqn5RFy8U4aIWRqhYRD28UMKLbfC06qd18VQRTmthqIqF1sNTJTwtwsufR7qnhwSke2JKgBesNwaFbxZLtw/p0WLtvn/wPpZnA45fy23FW7mhhXKP9708fXDX94JEnhVcez9tbTPCxYNRtuQ5xSifSqqNy8MD4/hA/7m599Jcn1ieaF9r32gvNENray+1V9qpdqEh7U/tb+0f7d+GaPza+K3x+7318aP1mK+0wtX44z98Sh//</latexit>

y y - ↵ · 2(y - t) But let’s imagine for a moment that we could. What


10 0 would the gradient descent update rule look like if we
= y - ↵ · 20 try to update y?
0.5 If the output is 10, and it should have been 0, then
0 gradient descent on y tells us to lower the output of
y - ↵ · 2(y - t)
<latexit sha1_base64="k5Igov+6dFG0O9rDIGEOSETaqxg=">AAAN8nichZfbbts2HMbV7tRl9ZZul7shFnTohiyQEscHDAFqyTZ6sbZZkFMXGQFF07JgSiQoyrEr6EV2N+x2j7IX2GvsdrsY5TiyDpSnK4Lfx08//SlKpMOIFwpd/+vR4w8+/OjjT558uvPZ08bnX+w++/IypBFH+AJRQvm1A0NMvABfCE8QfM04hr5D8JUzs1L9ao556NHgXCwZHvnQDbyJh6CQXbe7ExuxeJkAm+CJgJzTu2+B/SNY9/4A7BmKbUjYFEoPGlMBDl9kovgO2PbOyf+O0MHt7p5+oK8uUG0Y68aetr5Ob589FTv2mKLIx4FABIbhjaEzMYohFx4iONmxoxAziGbQxTeRmHRGsRewSOAAJeC51CYRAYKC9KHB2OMYCbKUDYi4JxMAmkIOkZCl2SlGhTiAPg73x3OPhffNcO7eNwSUdR3Fi1Xdk8LA2OWQTT20KJDF0A99KKaVznDpO8VOHBHM536xM6WUjCXnAnPkhWkNTmVh3rJ0KsNzerrWp0s2xUGYxBEnSX6gFDDneCIHrpohFhGLVw8j359ZeCJ4hPfT5qrvpA/57AyP92VOoaOIMyEUimKX46fFCfAdor4Pg3FssyS2BV6I2N4/SKT4HDhEmh0K+bjoPEvi2E5r5jjgTFoL4puc+KYsDnLiYH0TSsZgQjmYy/mnPATSCKSFewiHxdEX2egJuChHX+bEy7J4lROvyqIT5dSoos5z6ryi3uXUu7K6yImLsrjMictKrsipoqy+z4nvy+L1tpu+23bTX0qxcnbkklnKVY4n8uO1esHiGUriV+evf0ri7upavykRBkbRiJwH49Gw1e62k7JMHvTmsGOY/aqeGdpmz7BUhszRa1t6f7BhOSx5M2hdbw16rXIUIhu907WGVX0Dq/fafVNh2NAOreagvS4fxkHJ6mY+q9tqVsriZjld0zSPu1U9M5hNy+ocKgyZw+r3+z1rhcIizgguedmDsdU61qtRLAvq6K1mT6FvJkDvmGYFluVYzKFp9I0Vi8CQlJwie1msTq87KAeJTf3NnmVVJlDkyt+xjH5TYdiwHvePB0crEsph4JarQrPytdqdo045imZBw7acwQoL3dxp2DX1doWF5liGpmWmhS2uRLnIboxRfP9B3qw7sGeAJCmb5UIrm9O192AueYnCTOrdSvs2v3oAqYWvPilCdfFIEY5qYZCKBdXDIyU82gbvVv1uXbyrCHdrYVwVi1sP7yrh3W3wrOpndfFMEc5qYZiKhdXDMyU82wYvqn5RFy8U4aIWRqhYRD28UMKLbfC06qd18VQRTmthqIqF1sNTJTwtwsufR7qnhwSke2JKgBesNwaFbxZLtw/p0WLtvn/wPpZnA45fy23FW7mhhXKP9708fXDX94JEnhVcez9tbTPCxYNRtuQ5xSifSqqNy8MD4/hA/7m599Jcn1ieaF9r32gvNENray+1V9qpdqEh7U/tb+0f7d+GaPza+K3x+7318aP1mK+0wtX44z98Sh//</latexit>

100
y the network. If the output is 0 and it should have been
-1

= -↵↵ ·· 20
20 10, GD tells us to increase the value of the output.
=yy+
<latexit sha1_base64="jS0SCCcHiShn5JqEVzzf/djJhC0=">AAANsHicfZfbbts2HMbV7tRl9Zaul7shlnUYtiCQEscHDAFqyTZ6sbZZkNMWBwFF07JqSiQoyrEr6AX2NLvd3mRvM8pxZImirCuC38dPP/0pSqTLiB8J0/zvydNPPv3s8y+efbnz1fPG19/svvj2MqIxR/gCUUL5tQsjTPwQXwhfEHzNOIaBS/CVO3My/WqOeeTT8FwsGb4NoBf6Ex9BIbvudn84+RGMfgUjxJJlCn4BoxlKRpCwKUxl55gKcGiCu90988BcXaDasNaNPWN9nd69eC52RmOK4gCHAhEYRTeWycRtArnwEcHpziiOMINoBj18E4tJ5zbxQxYLHKIUvJLaJCZAUJAhg7HPMRJkKRsQcV8mADSFHCIhH2ynHBXhEAY42h/PfRY9NKO599AQUFblNlmsqpaWBiYeh2zqo0WJLIFBFEAxrXRGy8Atd+KYYD4Pyp0ZpWRUnAvMkR9lNTiVhXnPsomIzunpWp8u2RSHUZrEnKTFgVLAnOOJHLhqRljELFk9jJz9WXQieIz3s+aq76QP+ewMj/dlTqmjjDMhFIpylxtkxQnxPaJBAMNxMmJpMhJ4IZLR/kEqxVfAJdLsUsjHZedZmiSjrGauC86ktSS+K4jvVHFQEAfrm1AyBhPKwVzOP+URkEYgLdxHOCqPvshHT8CFGn1ZEC9V8aogXqmiGxfUuKLOC+q8ot4X1HtVXRTEhSouC+KykisKqlDVjwXxoypeb7vpH9tu+qcSK2dHLpmlXOV4Ij89qxcsmaE0eXP+9rc06a6u9ZsSY2CVjch9NB4NW+1uO1Vl8qg3hx3L7lf13NC2e5ajM+SOXtsx+4MNy6HizaFNszXotdQoRDZ6p+sMq/oG1uy1+7bGsKEdOs1Be10+jEPF6uU+p9tqVsri5Tld27aPu1U9N9hNx+kcagy5w+n3+z1nhcJizghWvOzR2Godm9Uolgd1zFazp9E3E2B2bLsCywos9tC2+taKRWBIFKfIXxan0+sO1CCxqb/dc5zKBIpC+TuO1W9qDBvW4/7x4GhFQjkMPbUqNC9fq9056qhRNA8atuUMVljo5k7Drm22Kyy0wDK0HTsrbHklykV2Y90mDx/kzboDexZIU9UsF5pqztbeo1nxEo2Z1Lu19m1+/QBSC199UoTq4pEmHNXCIB0LqodHWni0Dd6r+r26eE8T7tXCeDoWrx7e08J72+BZ1c/q4pkmnNXCMB0Lq4dnWni2DV5U/aIuXmjCRS2M0LGIenihhRfb4GnVT+viqSac1sJQHQuth6daeFqGlz+PbE8PCcj2xJQAP1xvDErfLJZtH7Kjxdr98OB9LM8GHL+V24r3ckML5R7vZ3n64F7gh6k8K3ij/ay1zQgXj0bZkucUSz2VVBuXhwfW8YH5e3Pvtb0+sTwzvjO+N34yLKNtvDbeGKfGhYGMv4y/jX+MfxuHjevGXQM+WJ8+WY95aZSuxof/ATXABr0=</latexit>

10
0 0
10
Even though we can’t change y directly, this is the
33
effect we want to achieve: we want to change the
0.5
values we can change so that we achieve this change
in y. To figure out how to do that, we take this gradient
for y, and propagate it back down the network.
Note that even though these scenarios have the same
loss (because of the square), the derivative of the loss
has a different sign for each, so we can tell whether
the output is bigger than the target or the other way
around. The loss only tells us how bad we've done, but
the derivative of the loss tells us where to move to
make it better.
Of course, we cannot change y directly. Instead, we
have to change the values that influenced y.
100 Here we see what that looks like for the weights of the
v2 - ↵ · 2(y - t)h2
<latexit sha1_base64="3wYaiarvQNZLOn66+fywJunFpic=">AAAOC3ichZfbbts2HMaV7tSlzZbucLUbYUGHbsgCKfERQ4Baso1erG0W5NRFQUDRtCyYEgmKcuwKwp5gT7O7Ybd7iD3A3mOU48gSRXm6Ivh9/PTTn6JEuhT7ETeMf7YeffDhRx9/8vjT7SdPdz77fPfZFxcRiRlE55Bgwq5cECHsh+ic+xyjK8oQCFyMLt2pnemXM8Qin4RnfEHRTQC80B/7EHDRdbv7mwNJMrs9THUHozEHjJG775yf8t4fdWcKEwdgOgHCA0eE64cvHEiTRSby70UfTybLAGf7+H+HGqtGNso4aKa3u3vGgbG89GrDXDX2tNV1cvvsKd92RgTGAQo5xCCKrk2D8psEMO5DjNJtJ44QBXAKPHQd83HnJvFDGnMUwlR/LrRxjHVO9Kwa+shnCHK8EA0AmS8SdDgBDEAuarZdjopQCAIU7Y9mPo3um9HMu29wIAp+k8yXE5KWBiYeA3Tiw3mJLAFBFAA+qXRGi8Atd6IYIzYLyp0ZpWCUnHPEoB9lNTgRhXlLszmOzsjJSp8s6ASFUZrEDKfFgUJAjKGxGLhsRojHNFk+jHixptExZzHaz5rLvuM+YNNTNNoXOaWOMs4YE8DLXW6QFSdEd5AEAQhHiUPTxOFozhNn/yAV4nPdxcLsEsBGZedpmiROVjPX1U+FtSS+KYhvZHFQEAermxA80seE6TMx/4RFujDqwsJ8iKLy6PN89Fg/l6MvCuKFLF4WxEtZdOOCGlfUWUGdVdS7gnonq/OCOJfFRUFcVHJ5QeWy+r4gvpfFq003fbfppr9KsWJ2xJJZiFWOxuKrtnzBkilMk1dnr39Ok+7yWr0pMdLNshG6D8ajYavdbaeyjB/0xrBjWv2qnhvaVs+0VYbc0WvbRn+wZjmUvDm0YbQGvZYcBfFa73TtYVVfwxq9dt9SGNa0Q7sxaK/Kh1AoWb3cZ3dbjUpZvDyna1lWs1vVc4PVsO3OocKQO+x+v9+zlyg0ZhQjyUsfjK1W06hG0TyoY7QaPYW+ngCjY1kVWFpgsYaW2TeXLBwBLDl5/rLYnV53IAfxdf2tnm1XJpAXyt+xzX5DYVizNvvNwdGShDAQenJVSF6+Vrtz1JGjSB40bIsZrLCQ9Z2GXctoV1hIgWVo2VZW2PJKFIvs2rxJ7j/I63Wn75l6mspmsdBkc7b2HsySFyvMuN6ttG/yqwfgWvjqk0JYFw8V4bAWBqpYYD08VMLDTfBe1e/VxXuKcK8WxlOxePXwnhLe2wRPq35aF08V4bQWhqpYaD08VcLTTfC86ud18VwRzmthuIqF18NzJTzfBE+qflIXTxThpBaGqFhIPTxRwpMyvPh5ZHt6gPVsT0yw7oerjUHpm0Wz7UN20Fi57x+8j8TZgKHXYlvxVmxogdjj/SDOIswL/DAVZwXP2c9am4xg/mAULXFOMeVTSbVxcXhgNg+MXxp7L63VieWx9o32rfZCM7W29lJ7pZ1o5xrU/t16svXV1tc7v+/8sfPnzl/31kdbqzFfaqVr5+//AEAXJ0s=</latexit>

10 0
v2 second layer. First note that the output y in this
example was too high. Since all the hidden nodes have
= v2 - ↵ · 20 · 0.5 positive values (because of the sigmoid), we end up
v1 v1 - ↵ · 20 · 0.1
<latexit sha1_base64="1fAMpTLUPd/VrFgaCtGKa+Tee28=">AAAOY3icjZfrbts2HMXl7talbZZ2+zYMEBZ0GIY0kBxfMQSoJdvoh7XNgty2KAgompYFUyJBUY5dQc857AH2HqNsRdaFMqpPBM/h0U9/kjZpU+wGXNP+bTz54suvvv7m6bd7z56/2P/u4OWrq4CEDKJLSDBhNzYIEHZ9dMldjtENZQh4NkbX9txM9OsFYoFL/Au+oujOA47vTl0IuOi6P/jHgiRa3OuxamE05YAx8vCL9buadb9RrTmMLIDpDAgTnBCuNrW0YUEeacd6bFl7mwFNeU7zs3La25wTec7JZ+X0Rc79waF2rK0ftdrQ08ahkj5n9y+f8z1rQmDoIZ9DDILgVtcov4sA4y7EKN6zwgBRAOfAQbchn/buItenIUc+jNXXQpuGWOVETaqsTlyGIMcr0QCQuSJBhTPAAORiLvaKUQHygYeCo8nCpcGmGSycTYMDMZF30XI90XFhYOQwQGcuXBbIIuAFHuCzSmew8uxiJwoxYguv2JlQCsaSc4kYdIOkBmeiMB9psnaCC3KW6rMVnSE/iKOQ4Tg/UAiIMTQVA9fNAPGQRuuPEQt2HpxyFqKjpLnuOx0CNj9HkyORU+go4kwxAbzYZXtJcXz0AInnAX8SWTSOLI6WPLKOjmMhvlZtLMw2AWxSdJ7HUWQlNbNt9VxYC+KHnPihLI5y4ih9CcETdUqYuhDzT1igCqMqLMyFKCiOvsxGT9XLcvRVTrwqi9c58bos2mFODSvqIqcuKupDTn0oq8ucuCyLq5y4quTynMrL6qec+Kks3ux66V+7Xvp3KVbMjtgyK7HL0VT8Wq4XWDSHcfTu4v0fcdRfP+lKCZGqF43QfjSejDvdfjcuy/hRb417ujGs6pmhawx0U2bIHIOuqQ1HW5ZmyZtBa1pnNOiUoyDe6r2+Oa7qW1ht0B0aEsOWdmy2Rt20fAj5JauT+cx+p1Upi5Pl9A3DaPeremYwWqbZa0oMmcMcDocDc41CQ0YxKnnpo7HTaWvVKJoF9bROayDRtxOg9QyjAktzLMbY0If6moUjgEtOni0Wszfoj8pBfFt/Y2CalQnkufL3TH3Ykhi2rO1he3SyJiEM+E65KiQrX6fbO+mVo0gWNO6KGaywkO2bxn1D61ZYSI5lbJhGUtjiThSb7Fa/izY/yNt9px7qahyXzWKjlc3J3ns0l7xYYsb1bql9l18+ANfCV78Uwrp4KAmHtTBQxgLr4aEUHu6Cd6p+py7ekYQ7tTCOjMWph3ek8M4ueFr107p4KgmntTBUxkLr4akUnu6C51U/r4vnknBeC8NlLLwenkvh+S54UvWTungiCSe1METGQurhiRSeFOHFn0dypgdYTc7EBKuunx4MCr9ZNDk+JFeN1L358CESdwOG3otjxUdxoAXijPebuI0wx3P9WNwVHOsoae0yguWjUbTEPUUv30qqjavmsd4+1v5sHb410hvLU+VH5WflV0VXuspb5Z1yplwqsHHagA3c8F78t/9s/9X+Dxvrk0Y65nul8Oz/9D/MA0Tg</latexit>

0.1 0.5 0.9 subtracting some positive value from all the weights.
0 This will lower the output, as expected.
v2 v2 - ↵ · 20 · 0.5
Second, note that the change is proportional to the
1

v3 v3 - ↵ · 20 · 0.9 input. The first hidden node h1 only contributes a


-3 factor of 0.1 (times its weight) to the value of y, so it
34
isn't changed as much as h3, which contributes much
more to the erroneous value.
Note also that the current value of the weight doesn’t
factor into the update: whether v1 is 1, 10 or 100, we
subtract the same amount. Only how much influence
the weight had on the value of y in the forward pass if
taken into account. The higher the activation of the
source node, the more the weight gets adjusted.
Finally, note how the sign of the the derivative wrt to y
is taken into account. Here, the model output was too
high, so the more a weight contributed to the output,
the more it gets "punished" by being lowered. If the
output had been too low, the opposite would be true,
and we would be adding something to the value of
each weight.

NEGATIVE ACTIVATION
The sigmoid activation we’ve used so far allows only
positive values to emerge from the hidden layer. If we
tanh 1
switch to an activation that also allows negative
100
activations (like a linear activation or a tanh
10 0 activation), we see that backpropagation very
-1
naturally takes the sign into account.
0.1 -0.5 0.9 Note the negative activation on h2.
v1 v1 - ↵ · 20 · 0.1
<latexit sha1_base64="MAtH+mJ4IRrP13xM1lksk0Jt29I=">AAAOY3icjZfrbts2HMXl7talbZZ2+zYMEBZ0GLYskBJfMQSoJdvoh7XNgty2KAgompYFUyJBUY5dQc857AH2HqMcR5Yoyps+ETyHRz/9SdqkS7EfccP4u/Hkk08/+/yLp1/uPHv+YvervZevLiMSM4guIMGEXbsgQtgP0QX3OUbXlCEQuBhduTM706/miEU+Cc/5kqLbAHihP/Eh4KLrbu8vB5JkfmemuoPRhAPGyP0Pzq963v2L7sxg4gBMp0CY4Jhw/chYNxzIE+PQTB1n52HAkTpHdP/83zmtTc6xOuf4f/H0RM7d3r5xaKwevdow1419bf2c3r18znecMYFxgEIOMYiiG9Og/DYBjPsQo3THiSNEAZwBD93EfNK9TfyQxhyFMNVfC20SY50TPauyPvYZghwvRQNA5osEHU4BA5CLudgpR0UoBAGKDsZzn0YPzWjuPTQ4EBN5myxWE52WBiYeA3Tqw0WJLAFBFAA+rXRGy8Atd6IYIzYPyp0ZpWCUnAvEoB9lNTgVhflAs7UTnZPTtT5d0ikKozSJGU6LA4WAGEMTMXDVjBCPabL6GLFgZ9EJZzE6yJqrvpMBYLMzND4QOaWOMs4EE8DLXW6QFSdE95AEAQjHiUPTxOFowRPn4DAV4mvdxcLsEsDGZedZmiROVjPX1c+EtSS+L4jvZXFYEIfrlxA81ieE6XMx/4RFujDqwsJ8iKLy6It89ES/kKMvC+KlLF4VxCtZdOOCGlfUeUGdV9T7gnovq4uCuJDFZUFcVnJ5QeWy+rEgfpTF620v/WPbS/+UYsXsiC2zFLscTcSv5WqBJTOYJm/P3/2WJr3Vs14pMdLNshG6j8bjUbvT66SyjB/15qhrWoOqnhs6Vt+0VYbc0e/YxmC4YTmSvDm0YbSH/bYcBfFG7/bsUVXfwBr9zsBSGDa0I7s57KzLh1AoWb3cZ/fazUpZvDynZ1lWq1fVc4PVtO3ukcKQO+zBYNC3Vyg0ZhQjyUsfje12y6hG0Tyoa7SbfYW+mQCja1kVWFpgsUaWOTBXLBwBLDl5vljsbr83lIP4pv5W37YrE8gL5e/a5qCpMGxYW4PW8HhFQhgIPbkqJC9fu9M97spRJA8adcQMVljI5k2jnmV0KiykwDKybCsrbHknik12Y94mDz/Im32n75t6mspmsdFkc7b3Hs2SFyvMuN6ttG/zqwfgWvjql0JYFw8V4bAWBqpYYD08VMLDbfBe1e/VxXuKcK8WxlOxePXwnhLe2wZPq35aF08V4bQWhqpYaD08VcLTbfC86ud18VwRzmthuIqF18NzJTzfBk+qflIXTxThpBaGqFhIPTxRwpMyvPjzyM70AOvZmZhg3Q/XB4PSbxbNjg/ZVWPtfvjwARJ3A4beiWPFB3GgBeKM95O4jTAv8MNU3BU85yBrbTOCxaNRtMQ9xZRvJdXG5dGh2To0fm/uv7HWN5an2rfa99qPmql1tDfaW+1Uu9Bg46QBG7gRvPhn99nuq91vHqxPGusxX2ulZ/e7fwGw2UTe</latexit>

In this case, we want to update in such a way that y


v2 v2 + ↵ · 20 · 0.5 decreases, but we note that the weight v2 is multiplied
1

v3 v3 - ↵ · 20 · 0.9 by a negative value. This means that (for this instance)


-3 v2 contributes negatively to the model output, and its
35
value should be increased if we want to decrease the
output.
Note that the sign of v2 itself doesn’t matter. Whether

We use the same principle to work our way back down


the network. If we could change the output of the
second node h2 directly, this is how we’d do it.
100

Note that we now take the value of v2 to be a constant.


10 0
We are working out partial derivatives, so when we
h2 - ↵ · 2(y - t)v2
<latexit sha1_base64="A9IB/v2aC7rFSW4tAdIn9V3z90w=">AAAODXicfZfbbts2HMaV7tRl9ZZuwHaxG2FBh25NAylxfMAQoJZsoxdrmwU5bVEQUDQtC6ZEgqIcu4Iu9gR7mt0Nu90z7An2GqMURdaBsq4Ift//009/SjZpU+wGXNP+3Xr0wYcfffzJ40+3P3vS+vyLnadfXgQkZBCdQ4IJu7JBgLDro3PucoyuKEPAszG6tOdmol8uEAtc4p/xFUU3HnB8d+pCwMXU7c7vFuTR7PYgVi2MphwwRu6+t35S8+mXqjWHkQUwnQFhghPC1YPnFqTRKn7JfxAzJFqk9db2canyhaxSywZJ2WFadLuzq+1r6aXWB3o22FWy6+T26RO+bU0IDD3kc4hBEFzrGuU3EWDchRjF21YYIArgHDjoOuTT3k3k+jTkyIex+kxo0xCrnKhJP9SJyxDkeCUGADJXJKhwBhiAXHRtuxwVIB94KNibLFwa3A+DhXM/4EC0/CZapksSlwojhwE6c+GyRBYBL/AAn9Umg5VnlydRiBFbeOXJhFIwVpxLxKAbJD04EY15R5NVDs7ISabPVnSG/CCOQobjYqEQEGNoKgrTYYB4SKP0YcSrNQ+OOQvRXjJM546HgM1P0WRP5JQmyjhTTAAvT9le0hwf3UHiecCfRBaNI4ujJY+svf1YiM9UGwuzTQCblJ2ncRRZSc9sWz0V1pL4tiC+rYqjgjjKbkLwRJ0Spi7E+hMWqMKoCgtzIQrK1ed59VQ9r0ZfFMSLqnhZEC+roh0W1LCmLgrqoqbeFdS7qrosiMuquCqIq1ouL6i8qr4viO+r4tWmm/666aa/VWLF6ohPZiW+cjQVv2vpCxbNYRy9Pnvzcxz10yt7U0Kk6mUjtB+Mh+NOt9+NqzJ+0Nvjnm4M63pu6BoD3ZQZcsega2rD0ZrloOLNoTWtMxp0qlEQr/Ve3xzX9TWsNugODYlhTTs226Nu1j6E/IrVyX1mv9OutcXJc/qGYRz163puMNqm2TuQGHKHORwOB2aKQkNGMap46YOx0znS6lE0D+ppnfZAoq8XQOsZRg2WFliMsaEP9ZSFI4ArTp6/LGZv0B9Vg/i6/8bANGsLyAvt75n6sC0xrFmPhkejw5SEMOA71a6QvH2dbu+wV40iedC4K1awxkLWdxr3Da1bYyEFlrFhGkljy1+i+Miu9Zvo/gd5/d2pu7oax1Wz+NCq5uTbezBXvFhixs1uqX2TX16AG+HrTwphUzyUhMNGGChjgc3wUAoPN8E7db/TFO9Iwp1GGEfG4jTDO1J4ZxM8rftpUzyVhNNGGCpjoc3wVApPN8Hzup83xXNJOG+E4TIW3gzPpfB8Ezyp+0lTPJGEk0YYImMhzfBECk/K8OLPI9nTA6wme2KCVdfPNgal3yyabB+So0bmvn/wIRJnA4beiG3FO7GhBWKP96M4jTDHc/1YnBUcay8ZbTKC5YNRjMQ5Ra+eSuqDi4N9/Whf+6W9+8rITiyPlW+V75Tniq50lVfKa+VEOVeg8t9Wa+vrrW9af7T+bP3V+vve+mgrq/lKKV2tf/4HU+wnyA==</latexit>

h2 are focusing on one parameters, all the others are


-3
1

0.5 = h2 + ↵ · 20 · 3 taken as constant.


0
Remember, that we want to decrease the output of
the network. Since v2 makes a negative contribution to
1

the loss, we can achieve this by increasing the


-3 activation of the source node v2 is multiplied by.
36

Note also that we’ve switched back to sigmoid


activations.
Moving down to k2, remember that the derivative of
the sigmoid is the output of the sigmoid times 1 minus
that output.
100
k2 - ↵ · 2(y - t)v2 h2 (1 - h2 )
<latexit sha1_base64="wpH/KHoOQGLW3vKeyXdcjTXi1U8=">AAAN63icfZdNb9s2HMbV7q3L6i3djrsICzqkQxpIieMXDAVqyTZ6WNssyNsWBQFF07JgSiQoyrEr6FPsNuy6j7IPsc+w63Yf5diyRFHW6Q8+Dx/99Jdoky7FfsQN4+9Hjz/6+JNPP3vy+c4XTxtffrX77OvLiMQMogtIMGHXLogQ9kN0wX2O0TVlCAQuRlfu1M70qxlikU/Cc76g6DYAXuiPfQi4GLrbvXOgl0zvjlLdwWjMAWPk/nvnRz0ffqk7U5g4ANMJECY4Ilw/2ncgTRbpS/5CjJBktpwPeTIRxb75cl2+cJy73T3j0FheerUwV8WetrpO75495TvOiMA4QCGHGETRjWlQfpsAxn2IUbrjxBGiAE6Bh25iPu7cJn5IY45CmOrPhTaOsc6Jnj2tPvIZghwvRAEg80WCDieAAchFT3bKUREKQYCig9HMp9FDGc28h4ID0dDbZL5seFqamHgM0IkP5yWyBARRAPikMhgtArc8iGKM2CwoD2aUglFyzhGDfpT14FQ05j3N3mF0Tk5X+mRBJyiM0iRmOC1OFAJiDI3FxGUZIR7TZPkw4sOZRq84i9FBVi7HXvUBm56h0YHIKQ2UccaYAF4ecoOsOSG6hyQIQDhKHJomDkdznjgHh6kQn+suFmaXADYqO8/SJHGynrmufiasJfFdQXwni4OCOFjdhOCRPiZMn4n3T1ikC6MuLMyHKCrPvshnj/ULOfqyIF7K4lVBvJJFNy6ocUWdFdRZRb0vqPeyOi+Ic1lcFMRFJZcXVC6rHwriB1m83nbTX7bd9FcpVrwdsWQWYpWjsfjVWn5gyRSmyZvztz+lSXd5rb6UGOlm2QjdtfF42Gp326ks47XeHHZMq1/Vc0Pb6pm2ypA7em3b6A82LEeSN4c2jNag15KjIN7ona49rOobWKPX7lsKw4Z2aDcH7VX7EAolq5f77G6rWWmLl+d0Lcs66Vb13GA1bbtzpDDkDrvf7/fsJQqNGcVI8tK1sdU6MapRNA/qGK1mT6FvXoDRsawKLC2wWEPL7JtLFo4Alpw8/1jsTq87kIP4pv9Wz7YrL5AX2t+xzX5TYdiwnvRPBsdLEsJA6MldIXn7Wu3OcUeOInnQsC3eYIWFbO407FpGu8JCCixDy7ayxpZXolhkN+Zt8vCDvFl3+p6pp6lsFgtNNmdrb22WvFhhxvVupX2bXz0B18JXnxTCunioCIe1MFDFAuvhoRIeboP3qn6vLt5ThHu1MJ6KxauH95Tw3jZ4WvXTuniqCKe1MFTFQuvhqRKeboPnVT+vi+eKcF4Lw1UsvB6eK+H5NnhS9ZO6eKIIJ7UwRMVC6uGJEp6U4cWfR7anB1jP9sQE63642hiUfrNotn3IjiAr98OD95E4GzD0Vmwr3osNLRB7vB/EKYV5gR+m4qzgOQdZtc0I5mujqMQ5xZRPJdXi8ujQPDk0fm7uvbZWJ5Yn2rfad9q+Zmpt7bX2RjvVLjSo/aX9o/2r/dcIGr81fm/88WB9/Gg15xutdDX+/B836B70</latexit>

k2 We see here, that in the extreme regimes, the sigmoid


10 0
is resistant to change. The closer to 1 or 0 we get the
smaller the weight update becomes.
1

0.5 1-σ(x) σ(x)


0
This is actually a great downside of the sigmoid
activation, and one of the big reasons it was eventually
σ(x)(1-σ(x)) replaced by the ReLU as the default choice for hidden
1

units. We’ll come back to this in a later lecture.


-3
37
Nevertheless, this update rule tells us what the change
is to k2 that we want to achieve by changing the
gradients we can actually change (the weights of layer
1).

Finally, we come to the weights of the first layer. As


before, we want the output of the network to
decrease. To achieve this, we want h2 to increase
100
(because v2 is negative). However, the input x1 is
10 0
<latexit sha1_base64="O1LY0g0RhwRJhnGV19ipnfgAcKo=">AAAOoXicjZfbbts2HMbl7tRlnddsu9uNsKBDOiSBZDs+YAhQS7bRAWubBc2hiwKDomlZMCUSFOXYFfQwu92eaG8zypZlHb3p6g9+Hz/9xINNmhTbHleUf2pPPvn0s8+/ePrlwVfPvq5/8/zw2xuP+Ayia0gwYXcm8BC2XXTNbY7RHWUIOCZGt+Zcj/TbBWKeTdz3fEXRgwMs157aEHDRND6sfW9AEjyOA7URhrKB0ZQDxsjjT8Yvclo5lY05DAyA6QwIH5wQLjeODUiDVSTyl5F5MW5EGg9mojhWT5P6pbwcq4ZxcCFiK1PjUCVOP42Mze27oiRjygAM1DBohdvm06b8X7EbZxL7f1Kb4+dHypmyfuRiocbFkRQ/l+PDZ/zAmBDoO8jlEAPPu1cVyh8CwLgNMQoPDN9DFMA5sNC9z6fdh8B2qc+RC0P5hdCmPpY5kaM5kic2Q5DjlSgAZLZIkOEMCEwuZvIgG+UhFzjIO5ksbOptSm9hbQoOxDJ4CJbrZRJmOgYWA3Rmw2WGLACO5wA+KzR6K8fMNiIfI7Zwso0RpWDMOZeIQduLxuBSDMw7Gq087z25jPXZis6Q64WBz3CY7igExBiaio7r0kPcp8H6Y8Ryn3sXnPnoJCrXbRcDwOZXaHIicjINWZwpJoBnm0wnGhwXPULiOMCdBAYNA4OjpVgdJ2ehEF/IJhZmkwA2yTqvwiAwojEzTflKWDPi25T4Ni8OU+IwfgnBE3lKmLwQ80+YJwujLCzMhsjL9r5Oek/l63z0TUq8yYu3KfE2L5p+SvUL6iKlLgrqY0p9zKvLlLjMi6uUuCrk8pTK8+rHlPgxL97te+mHfS/9IxcrZkdsmZXY5WgqfmvXCyyYwzB4/f7Nb2HQWz/xSvGRrGaN0Nwam6N2p9cJ8zLe6q1RV9UGRT0xdLS+qpcZEke/oyuD4Y6lkfMm0IrSHvbb+SiId3q3p4+K+g5W6XcGWolhRzvSW8NOPHwIuTmrlfj0XrtVGBYryelpmnbeK+qJQWvperdRYkgc+mAw6OtrFOozilHOS7fGdvtcKUbRJKirtFv9En03AUpX0wqwNMWijTR1oK5ZOAI45+TJYtG7/d4wH8R346/1db0wgTw1/F1dHbRKDDvW88H5sLkmIQy4Vn5USDJ87U632c1HkSRo1BEzWGAhuzeNeprSKbCQFMtI07VoYLM7UWyye/Uh2Pwg7/adfKTKYZg3i42WN0d7b2vOeXGJGVe7S+37/OUdcCV88UshrIqHJeGwEgaWscBqeFgKD/fBW0W/VRVvlYRblTBWGYtVDW+Vwlv74GnRT6viaUk4rYShZSy0Gp6WwtN98Lzo51XxvCScV8LwMhZeDc9L4fk+eFL0k6p4UhJOKmFIGQuphiel8CQLL/48ojM9wHJ0JiZYtt34YJD5zaLR8SG6hMTuzYcPkLgbMPRGHCveiQMtEGe8n8U9hVmO7YbirmAZJ1G1zwiWW6OoxD1Fzd9KisVN40w9P1N+bx290uIby1PpB+lH6VhSpY70SnotXUrXEqwFtT9rf9X+rh/Vf61f1q821ie1uM93Uuap3/8LvrVagg==</latexit>

w12 w12 - ↵ · 2(y - t)v2 h2 (1 - h2 )x1 negative, so we should decrease w12 to increase h2.
This is all beautifully captured by the chain rule: the
1
-3
1

= w12 - ↵ · 20 · -3 · · -3 two negatives of x1 and v2 cancel out and we get a


0.5 4 positive value which we subtract from w12.
0 1
= w12 - ↵ · 20 · 3 · · 3
4
-11

-3
38

FORWARD PASS IN PSEUDOCODE


To finish up let's look at how you would implement this
in code. Here is the forward pass: computing the
l
model output and the loss, given the inputs and the
for j in [1 … 3]:
for i in [1 … 2]:
target value.
y t k[j] += w[i,j] * x[i]
k[j] += b[j] Assume that k and y are initialized with 0s or random
v2

values. We'll talk about initialization strategies in the


h1 h2 h3 for i in [1 … 3]:
k2 k3 h[i] = sigmoid(k[i]))
4th lecture.
k1

for i in [1 … 3]:
2
w1

y += h[i] * v[i]
y += c

x1 x2
loss = (y - t) ** 2
39

BACKWARD PASS IN PSEUDOCODE


And here is the backward pass. We compute gradients
for every node in the network, regardless of whether
l
y’ = 2 * (y - t) # the error the node represents a parameter. When we do the
for i in [1 … 3]:
v’[i] = y’ * h[i]
gradient descent update, we'll use the gradients of the
y t h’[i] = y’ * v[i] parameters, and ignore the rest.
c’ = y’
v2

Note that we don’t implement the derivations from


h1 h2 h3 for i in [1 … 3]:
k2 k3 k’[i] = h’[i] * h[i] * (1-h[i])
slide 44 directly. Instead, we work backwards down
k1
the neural network: computing the derivative of each
for j in [1 … 3]: node as we go, by taking the derivative of the loss over
2
w1

for i in [1 … 2]:
w’[i, j] = k’[j] * x[i]
the outputs and multiplying it by the local derivative.
x1 b’[j] = k’[j]
x2

40
RECAP

Backpropagation: method for computing derivatives.

Combined with gradient descent to train NNs.

Middle ground between symbolic and numeric computation.


• Break computation up into operations.
• Work out local derivatives symbolically.
• Work out global derivative numerically.

41

Lecture 2: Backpropagation

Peter Bloem
Deep Learning

dlvu.github.io

|section|Tensor backpropagation|
|video|https://www.youtube.com/embed/O-
xs8IyP4bQ?si=YWT0e5PGkU37kOVf|
We’ve seen what neural networks are, how to train
PART THREE: TENSOR BACKPROPAGATION them by gradient descent and how to compute that
gradient by backpropagation.
Expressing backpropagation in vector, matrix and tensor operations.
In order to scale up to larger and more complex
structures, we need two make our computations as
efficient as possible, and we’ll also need to simplify our
notation. There’s one insight that we are going to get a
lot of mileage out of.

IT’S ALL JUST LINEAR ALGEBRA


When we look at the computation of a neural network,
we can see that most operations can be expressed
x1 very naturally in those of linear algebra.
× x2

y w11 w21 b1 k1 h1
The multiplication of weights by their inputs is a
w12 w22 + b2 = f(x)
k
= V (Wx h+ b) + c 2 2 multiplication of a matrix of weights by a vector of
v2

w13 w23 b3 <latexit sha1_base64="MR7DOaUycjbul3iXVKeS8Lvtb9A=">AAAIGnicfVVNb9s2GFbbre68dUu74y7CjAHZZgSSsybZIUCXj7WHtcmC2C4QGQFFv5IFUx8gKUcqwX+y835Ib8Ouu+wf7Lr9glGS7UqiNl708n2e5yX5kBTdhASMW9af9+4/+ODDh71HH/U//uTxp5/tPHk6YXFKMYxxTGL6xkUMSBDBmAecwJuEAgpdAlN3eVrg0xVQFsTRNc8TmIXIjwIvwIir1O2O4+0Kx/VEJuXX5rHpxLjomhNpOkOHBX6Idje5qSw/mTS/NR23yrlK9b6HlcrpO3nVW8jbnYG1Z5XN1AN7HQyMdbu8ffLwV2ce4zSEiGOCGLuxrYTPBKI8wARk30kZJAgvkQ83KfeOZiKIkpRDpIb+SmFeSkwem8VKzXlAAXOSqwBhGqgKJl4gijBXfvSbpRhEKAQ2nK+ChFUhW/lVwJEycyay0mz5uKEUPkXJIsBZY2oChSxEfKElWR66zSSkBOgqbCaLaapJtpgZUBywwoRL5cxFUmwgu44v1/giTxYQMSlSSmRdqACgFDwlLEMGPE1EuRp1apbsmNMUhkVY5o7PEF1ewXyo6jQSzel4JEa8mXLVMpQ7EdzhOAxRNBdOos4Mh4wLZ7gnS+/q6JUUwimMcl3zqoAb6Osa+lrKJnheA88V2ETHW9Qzx23ppAZOtFGnNXTalrppDU01dFVDV1pl964G32lwVkMzDc1raK6hb2voW91npI7FzWgmqr0oN1VckGAFLyhAJMVgJNtroWq/b+ympDgDYmDL0u45eOqXUwFhXtDFy+tXP0lxejR6Zh3INsMlKWwo1v7Bs1NLo/jVbNYc6+hodKJxYooif1vo7PzgB1svlKQ0IVvS4eH+j9/rlXIgJL7bVjo9ORvttxemHGlOyj60Lat92ijWrFo7Yg5sU7PW76Kvh+kUuF2Cys9O/lLnv6Ao/w923FV9Y3OnIulSbDzvVORdis0GbBRNSdRh0/vt2GpaK0+Ki7BU709SPBmIVHXPQD0mFF6pC3KhfoCIx/QbdSuoHwaqlvo6wyL6PyLKNkQV9fvqZbPb75geTEZ79v6e9fN3g+cn6zfukfGF8aWxa9jGofHceGlcGmMDG++Mv4y/jX96v/Te9X7r/V5R799baz43Gq33x7+J3/G/</latexit>

k3 h3 h inputs.
h1 h2 h3
k1 k2 k3
h1
Note that the weight w12 (as we’ve called it so far,
h2
because it goes from node 1 to node 2) actually goes
2

×
w1

h3 into element W21 of the matrix W. You can tell the


v1 v2 v3 + c = y
difference by whether we’re using a lower-case or
x1 x2
capital w.
44

The addition of a bias is the addition of a vector of bias


parameters.
The nonlinearities can be implemented as simple
element-wise operations.
This perspective buys us two things. First…

Our notation becomes extremely simple. We can


describe the whole operation of a neural network with
one simple equation. This expressiveness will be sorely
needed when we move to more complicated
networks.

f(x) = V (Wx + b) + c

MATRIX MULTIPLICATION
The second reason is that the biggest computational
bottleneck in a neural network is the multiplication of
l
# k, w, k, h, etc are zero’d arrays the layer input by the layer weights. The matrix
for j in [1 … 2]:
for i in [1 … 3]:
multiplication in the notation of the previous slide.
y t k[i] += w[i,j] * x[i] This operation is more than quadratic while everything
k[i] += b[i] else is linear. We can see that in our pseudocode,
v2

h1 h2 h3 for i in [1 … 3]: because we have one loop, nested in another.


k1 k2 k3 h[i] = sigmoid(k[i]))
Matrix multiplication (and other tensor operations like
for i in [1 … 3]:
it) can be parallelized and implemented efficiently but
2
w1

y += h[i] * v[i]
y += c it’s a lot of work. Ideally, we’d like to let somebody else
x1 x2
loss = (y - c) ** 2
do all that work (like the implementers of numpy) and
46
then just call their code to do the matrix
multiplications.
This is especially important if you have access to a
GPU. A matrix multiplication can easily be offloaded to
the GPU for much faster processing, but for a loop
over an array, there’s no benefit.
This is called vectorizing: taking a piece of code
written in for loops, and getting rid of the loops by
expressing the function as a sequence of linear algebra
operations.
VECTORIZING
Without vectorized implementations, deep learning
would be painfully slow, and GPUs would be useless.
Express everything as operations on vectors, matrices and tensors. Turning a naive, loop-based implementation into a
Get rid of all the loops vectorized one is a key skill for DL researchers and
programmers.
Makes the notation simpler.
Makes the execution faster.

x
y
47
W
V
b
x
c
x y
xt
y
y xW Here’s what the vectorized forward pass looks like as a
h
FORWARD
W y V computation graph, in symbolic notation and in
Wk
V
Vl x W b pseudocode.
xy V c
b
b X When you implement this in numpy it’ll look almost
ycW b t
c l= (y - t)2 k = w.dot(x) + b
the same.
Wt V cx h
t y h = sigmoid(k)
Vh t k y = Vh + c Note that this doesn't just represent the network in the
h bx W
b l previous part, it represents any such network,
k yc h
k
V
h = (k) y = v.dot(h) + c
cl t k regardless of the sizes of the input and hidden layers.
l b k = Wx + b We've abstracted away some details about the
tWh l
l = (y - t) ** 2

x
h V c specifics of the network. We've also made the output
k and the target label vectors for the sake of generality
k l t
y
48
b
Wl c h So far so good. The forward pass is easy enough to
V t k vectorize.
bh l
x
ck
x y
xt l
y
y xW Of course, we lose a lot of the benefit of vectorizing if
h WHAT ABOUT THE BACKWARD PASS?
BUT
W y V the backward pass (the backpropagation algorithm)
Wk
V
Vl x W b isn’t also expressed in terms of matrix multiplications.
x
b V c
X How do we vectorize the backward pass? That's the
by
ycW b t l = (y - t)2 question we'll answer in the rest of this video.
c x
Wt V c h Can we do something like this?
t y On the left, we see the forward pass of our loss
Vh t k y = Vh + c @l ? @l @y @h @k computation as a set of operations on vectors and
h bx W
b
k c h l h = (k) = matrices.
k y V @W @y @h @k @W
cl t k
lW b k = Wx + b To generalize backpropagation to this view, we might
th l
x
hk V c ask if something similar to the chain rule exists for
vectors and matrices. Firstly, can we define something
k bl t
y
49

Wl c h
analogous to the derivative of one matrix over
k another, and secondly, can we break this apart into a
V t
product of local derivatives, possibly giving us a
bh l
sequence of matrix multiplications?
ck
t l The answer is yes, there are many ways of applying
h calculus to vectors and matrices, and there are many
k chain rules available in these domains. However, things
l
can quickly get a little hairy, so we need to tread
carefully.
GRADIENTS, JACOBIANS, ETC
The derivatives of high-dimensional objects are easily
defined. We simply take the derivative of every
@B
: derivatives of every element of A over every element of B
number in the input over every number in the output,
f(A) = B,
@A and we arrange all possibilities into some logical
for instance:
function returns a shape. For instance, if we have a vector-to-vector
00
001111 function, the natural analog of the derivative is a
✓ ◆✓ ◆
<latexit sha1_base64="Cu9+fXzKgYNP/4xIL+VZWpA1pdw=">AAAPI3icjVfLbuM2FFXS19Sdpkm7arshGkwxLYJAchw/FgHGkm0Mis5MGuTVRoFB0bQsWBYJinLsEbTqol/Sr+mu6KaL/kC/opTsyBIleapFcM1z7uHh1aVCWtR1fK6qf+/svvf+Bx9+9OTj2idPP937bP/g82ufBAzhK0Rcwm4t6GPX8fAVd7iLbynDcGa5+MaaGjF+M8fMd4h3yZcU38+g7TljB0EuhoYHO7+OTReP+XMT2aFpYdvxQjqDnDmLCA41YJoADuvJ3xMTe6MUjIDJHHvCvwNnAJgIycnWKtlKkuVMs/bDcJwk5pNqYsQfMyjUpjRRFTJRtP5lhzD+Bb4tkuoFkpj13Wr1RK2EVi/QhOd3yp38PzlBq+UqMtw/VI/V5AHFQFsHh8r6OR8ePP2tZo4ICmbY48iFvn+nqZTfh5BxB7lY6Ac+phBNoY3vAj5u34eORwOOPRSBZwIbBy7gBMQtAUYOw4i7SxFAxByhANAEihVw0Ti1vJSPPTjD/tFo7lB/FfpzexVwKLruPlwkXRnlEkObQTpx0CLnLIQzX5RgUhj0lzMrP4gDF7P5LD8YuxQeJeYCM+T4cQ3ORWHe0LjR/UtyvsYnSzrBnh+FAXOjbKIAMGN4LBKT0Mc8oGGyGLG7pv4ZZwE+isNk7KwH2fQCj46ETm4gb2fsEsjzQ5a0jEXcLXG9PPyAyGwGRWuYVHQMxwsemkfHol9qz4DlCr5FIBvlmRdRGJpxGS0LXMStlQVfZ8DXMtjPgP31JMQdgTFhYC5agjAfCCJI2hRhP599lWaPwZUsfZ0Br2XwJgPeyKAVZNCggM4z6LyAPmTQBxldZMCFDC4z4LKgyzMol9G3GfCtDN5um/TnbZP+IsmKtyN20VJsfDwWX/uk58IpisKXl69+jMJO8qw7JcBAyxOR9Ug8GTRbnVYkw+4j3hi0Nb1XxFNCS+9qRhkhZXRbhtrrb7zUJW5qWlWb/W5TlkLuBm93jEER35hVu62eXkLYuB0YjX5rXT6MPYlqpzyj02wUymKnOh1d1087RTwl6A3DaNdLCCnD6PV6XSOxQgNGXSxx6SOx2TxVi1I0FWqrzUa3BN+8ALWt6wWzNONFH+haT0u8cAxdicnTZjHa3U5fFuKb+utdwyi8QJ4pf9vQeo0Swsbrae+0f5I4IQx6tlwVkpav2WqftGUpkgoNWuINFryQzUyDjq62Cl5IxstAN/S4sPmdKDbZnXYfrj7Im30HDjUQRTJZbDSZHO+9R7LEdUvIbjW7lL6NX57gVpovrhShKnlUIo4qzaAyL6jaPCo1j7aZt4t8u0reLhG3K83YZV7savN2qXl7m3la5NMqeVoiTivN0DIvtNo8LTVPt5nnRT6vkucl4rzSDC/zwqvN81LzfJt5UuSTKnlSIk4qzZAyL6TaPCk1T/LmxT+P+JgPXRAfk4kLHG99MMh9s2h8fJiKW8iavVp4D4vrAsOvxLHijTjjQnHG+z40IbNnjheJ64NtHsXRNiJcPBJhcnXR5ItKMbiuH2unx+pPjcMX+voS80T5WvlGea5oSkt5obxUzpUrBe38u7u/++XuV3u/7/2x9+feXyvq7s465wsl9+z98x9A4JAh</latexit>

scalar vector matrix


a1 a1
<latexit sha1_base64="Cu9+fXzKgYNP/4xIL+VZWpA1pdw=">AAAPI3icjVfLbuM2FFXS19Sdpkm7arshGkwxLYJAchw/FgHGkm0Mis5MGuTVRoFB0bQsWBYJinLsEbTqol/Sr+mu6KaL/kC/opTsyBIleapFcM1z7uHh1aVCWtR1fK6qf+/svvf+Bx9+9OTj2idPP937bP/g82ufBAzhK0Rcwm4t6GPX8fAVd7iLbynDcGa5+MaaGjF+M8fMd4h3yZcU38+g7TljB0EuhoYHO7+OTReP+XMT2aFpYdvxQjqDnDmLCA41YJoADuvJ3xMTe6MUjIDJHHvCvwNnAJgIycnWKtlKkuVMs/bDcJwk5pNqYsQfMyjUpjRRFTJRtP5lhzD+Bb4tkuoFkpj13Wr1RK2EVi/QhOd3yp38PzlBq+UqMtw/VI/V5AHFQFsHh8r6OR8ePP2tZo4ICmbY48iFvn+nqZTfh5BxB7lY6Ac+phBNoY3vAj5u34eORwOOPRSBZwIbBy7gBMQtAUYOw4i7SxFAxByhANAEihVw0Ti1vJSPPTjD/tFo7lB/FfpzexVwKLruPlwkXRnlEkObQTpx0CLnLIQzX5RgUhj0lzMrP4gDF7P5LD8YuxQeJeYCM+T4cQ3ORWHe0LjR/UtyvsYnSzrBnh+FAXOjbKIAMGN4LBKT0Mc8oGGyGLG7pv4ZZwE+isNk7KwH2fQCj46ETm4gb2fsEsjzQ5a0jEXcLXG9PPyAyGwGRWuYVHQMxwsemkfHol9qz4DlCr5FIBvlmRdRGJpxGS0LXMStlQVfZ8DXMtjPgP31JMQdgTFhYC5agjAfCCJI2hRhP599lWaPwZUsfZ0Br2XwJgPeyKAVZNCggM4z6LyAPmTQBxldZMCFDC4z4LKgyzMol9G3GfCtDN5um/TnbZP+IsmKtyN20VJsfDwWX/uk58IpisKXl69+jMJO8qw7JcBAyxOR9Ug8GTRbnVYkw+4j3hi0Nb1XxFNCS+9qRhkhZXRbhtrrb7zUJW5qWlWb/W5TlkLuBm93jEER35hVu62eXkLYuB0YjX5rXT6MPYlqpzyj02wUymKnOh1d1087RTwl6A3DaNdLCCnD6PV6XSOxQgNGXSxx6SOx2TxVi1I0FWqrzUa3BN+8ALWt6wWzNONFH+haT0u8cAxdicnTZjHa3U5fFuKb+utdwyi8QJ4pf9vQeo0Swsbrae+0f5I4IQx6tlwVkpav2WqftGUpkgoNWuINFryQzUyDjq62Cl5IxstAN/S4sPmdKDbZnXYfrj7Im30HDjUQRTJZbDSZHO+9R7LEdUvIbjW7lL6NX57gVpovrhShKnlUIo4qzaAyL6jaPCo1j7aZt4t8u0reLhG3K83YZV7savN2qXl7m3la5NMqeVoiTivN0DIvtNo8LTVPt5nnRT6vkucl4rzSDC/zwqvN81LzfJt5UuSTKnlSIk4qzZAyL6TaPCk1T/LmxT+P+JgPXRAfk4kLHG99MMh9s2h8fJiKW8iavVp4D4vrAsOvxLHijTjjQnHG+z40IbNnjheJ64NtHsXRNiJcPBJhcnXR5ItKMbiuH2unx+pPjcMX+voS80T5WvlGea5oSkt5obxUzpUrBe38u7u/++XuV3u/7/2x9+feXyvq7s465wsl9+z98x9A4JAh</latexit>

ff@@
@@ b1 b1 matrix with all the partial derivatives in it.
a2AA=A b=2
a2 A arrange derivatives in a:

a3 b2
0 a3 1 scalar scalar vector matrix
However, once we get to matrix/vector operations or
0@b1/@a1 @b2/@a1 1

input is a
Jf = @@b@b 1/@a@b
1/@a2
@b A
1 2/@a22/@a1 matrix/matrix operations, the only way to logically
vector vector matrix ?
Jf = @ @b
@b 1/3@a@b
1/@a
2
@b32/@a2 A
2/@a
arrange every input with every output is a tensor of
@b1/@a3 @b2/@a3 higher dimension than 2.
matrix matrix ? ?
50
NB: We don’t normally apply the differential symbol
to non-scalars like this. We’ll introduce better
notation later.

As we see, even if we could come up with this kind of


chain rule, one of the local derivatives is now a vector
over a matrix. The result could only be represented in
a 3-tensor. There are two problems with this:
@l ? @l @y @h @k If the operation has n inputs and n outputs, we are
= -

@W @y @h @k @W computing n3 derivatives, even though we were only


ultimately interested in n2 of them (the derivatives
uh oh
of W). In the scalar algorithm we only ever had two
nested loops (an n2 complexity), and we only ever
stored one gradient for one node in the
computation graph. Now we suddenly have n3
51
complexity and n3 memory use. We were supposed
to be making things faster.
- We can easily represent a 3-tensor, but there’s no
obvious, default way to multiply a 3-tensor with a
matrix, or with a vector (in fact there are many
different ways). The multiplication of the chain rule
becomes very complicated this way.
In short, we need a more careful approach.

SIMPLIFYING ASSUMPTIONS
To work out how to do this we make these following
simplifying assumptions.
1) The computation graph always has a single, scalar output: l

2) We are only ever interested in the derivative of l.

scalar vector matrix

@l <- scalar
scalar scalar vector matrix

@W <- tensor
of
any dimens
ion
vector vector matrix ?

matrix matrix ? ?

52 3-tensor 3-tensor ? ?
Note that this doesn’t mean we can only ever train
neural networks with a single scalar output. Our
loss
neural networks can have any number of outputs of

computation
any shape and size. We can make neural networks that

of loss
generate images, natural language, raw audio, even
computation graph outputs video.
However, the loss we define over those outputs needs
computation
to map them all to a single scalar value. The
of model
m o del
computation graph is always the model, plus the
computation of the loss.
inputs

We call this derivative the gradient of W. This is a


THE GRADIENT
common term, but we will deviate from the standard
@l
@l approach in one respect.
We will call the gradient of@l (with respect to@W )
@W
@W Normally, the gradient is defined as a row vector, so
Commonly written as rW l that it can operate on a space of column vectors by
matrix multiplication.
@l
Nonstandard assumption: rW l has the same shape as@W In our case, we are never interested in multiplying the
gradient by anything. We only ever want to sum the
@l gradient of l wrt W with the original matrix W in a
[rW l]ijk = gradient update step. For this reason we define the
@W ijk
gradient as having the same shape as the tensor W
with respect to which we are taking the derivative.
In the example shown, W is a 3-tensor (a kind of matrix
with 3 dimensions instead of 2). The gradient of l wrt
W has the same shape as W, and at element (i, j, k) it
holds the scalar derivative of l wrt Wijk.
With these rules, we can use tensors of any shape and
dimension and always have a welldefined gradient.

The standard gradient notation isn’t very suitable for


NEW NOTATION: THE GRADIENT FOR W
our purposes. It puts the loss front and center, but that
will be the same in all cases. The object that we’re
rrW l = Wr
rW l = W <-
<- always the same
act
actually interested in is relegated to a subscript. Also, it
u isn’t very clear from the notation what the shape is of
of al o
int bje
ere ct
st the tensor that we’re looking at.
<latexit sha1_base64="08ZyZvqdBsdsNRALUTvS6G8bk3Y=">AAAOB3icfZfLbuM2GIWV6WWm6bhN2mU3RIMBiiIIpMTxZRFgLNnGLDozaZDLtJEnoGhaFkyJBEU59gh6gD5Nd0W3fYwu+yalbEfWhbI2IXgOjz7/JBXSYcQLha7/u/fss8+/+PL5i6/2v37Z+Obbg8PvbkMacYRvECWUf3BgiIkX4BvhCYI/MI6h7xB858ysVL+bYx56NLgWS4ZHPnQDb+IhKGTXw0FsIycGtjMBTvLRDqBDILgA68ZDnPanBidJAEn7CZ4IcA+APeEQxfaMAZKs/qxdD0YiPWMqwlpHkNjcc6di9PH64eBIP9FXD6g2jE3jSNs8lw+HL8W+PaYo8nEgEIFheG/oTIxiyIWHCE727SjEDKIZdPF9JCadUewFLBI4QAl4JbVJRICgIK0EGHscI0GWsgER92QCQFMoqYWs134xKsQB9HF4PJ57LFw3w7m7bghZKzyKF6vJSAoDY5dDNvXQokAWQz/0oZhWOsOl7xQ7cUQwn/vFzpRSMpacC8yRF6Y1uJSFec/S+Q2v6eVGny7ZFAdhEkecJPmBUsCc44kcuGqGWEQsXv0Yuahm4YXgET5Om6u+iz7ksys8PpY5hY4izoRQKIpdjp8WJ8CPiPo+DMaxzeTKEHghYvv4JJHiKyBXHZo5FPJx0XmVxLGd1sxxwJW0FsR3OfFdWRzkxMHmJZSMwYRyMJfzT3kIpBFIC/cQDoujb7LRE3BTjr7Nibdl8S4n3pVFJ8qpUUWd59R5RX3MqY9ldZETF2VxmROXlVyRU0VZ/ZQTP5XFD7te+tuul/5eipWzI7fMUu5yPJFftNUCi2coid9cv/0liburZ7NSIgyMohE5T8azYavdbSdlmTzpzWHHMPtVPTO0zZ5hqQyZo9e29P5gy3Ja8mbQut4a9FrlKES2eqdrDav6FlbvtfumwrClHVrNQXtTPoyDktXNfFa31ayUxc1yuqZpnneremYwm5bVOVUYMofV7/d71gqFRZwRXPKyJ2Orda5Xo1gW1NFbzZ5C306A3jHNCizLsZhD0+gbKxaBISk5RbZYrE6vOygHiW39zZ5lVSZQ5MrfsYx+U2HYsp73zwdnKxLKYeCWq0Kz8rXanbNOOYpmQcO2nMEKC92+adg19XaFheZYhqZlpoUt7kS5ye6NUbz+IG/3HTgyQJKUzXKjlc3p3nsyl7xEYSb1bqV9l189gNTCV38pQnXxSBGOamGQigXVwyMlPNoF71b9bl28qwh3a2FcFYtbD+8q4d1d8KzqZ3XxTBHOamGYioXVwzMlPNsFL6p+URcvFOGiFkaoWEQ9vFDCi13wtOqndfFUEU5rYaiKhdbDUyU8LcLLfx7pmR4SkJ6JKQFesDkYFL5ZLD0+zOQ1Y+Ne//A+lncDjt/KY8V7eaCF8oz3c2xD7vpekMi7gmsfp61dRrh4MsqWvKcY5VtJtXF7emKcn+i/No9em5sbywvtB+1H7SfN0Nraa+2NdqndaEj7b+/53sHeYeOPxp+Nvxp/r63P9jZjvtcKT+Of/wElISiG</latexit>

 T For this reason, we’ll introduce a new notation. This


r @l @l
b = rb l = ... isn’t standard, so don’t expect to see it anywhere else,
@b1 @bn but it will help to clarify our mathematics a lot as we
go forward.
@l
y r = ry l = We’ve put the W front and center, so it’s clear that the
@y
result of taking the gradient is also a matrix, and we’ve
removed the loss, since we’ve assume that we are
always taking the gradient of the loss.
You can think of the nabla as an operator like a
transposition or taking an inverse. It turns a matrix into
another matrix, a vector into another vector and so on.
The notation works the same for vectors and even for
scalars. This is the gradient of l with respect to W.
Since l never changes, we’ll refer to this as “the
gradient for W”.

When we refer to a single element of the gradient, we


will follow our convention for matrices, and use the
non-bold version of its letter.

⇥ ⇤ @l Element (i, j) for the gradient of W is the same as the


<latexit sha1_base64="xZtwZU/8oKUKiWt48TtDmp2dpzE=">AAAOEXicfZfLbuM2GIWV6W2ajtuk3XU2RIMpiiIIpMTxZRFgLNnGLDozaZDLdCI3oGhaVk2JBEU59gha9gn6NN0V3fYJ+gx9iVKybOtqbULwHB59+kk6pMWI4wtV/XfvyUcff/LpZ08/3//iWePLrw4Ov771acARvkGUUP7Ogj4mjodvhCMIfsc4hq5F8J01M2L9bo6571DvWiwZHrnQ9pyJg6CQXQ8Hv5sETwS4NxENTWsC7qJfTQ9aBAKTO/ZUjB5C57cIfH8BMkbpWnWnnvUQaZpwiEJzxgCJkj9Ze2Sa+3HQqisdlCgPB0fqiZo8oNzQ0saRkj6XD4fPxL45pihwsScQgb5/r6lMjELIhYMIjvbNwMcMohm08X0gJp1R6HgsENhDEXghtUlAgKAgrgkYOxwjQZayARF3ZAJAUyi/RMjK7eejfOxBF/vH47nD/FXTn9urhpAfhEfhIpmWKDcwtDlkUwctcmQhdH0Ximmp01+6Vr4TBwTzuZvvjCklY8G5wBw5flyDS1mYtyyeaf+aXqb6dMmm2POjMOAkyg6UAuYcT+TApOljEbAw+Ri5vGb+heABPo6bSd9FH/LZFR4fy5xcRx5nQigU+S7LjYvj4UdEXRd649BkcrUIvBCheXwSSfEFkEsDzSwK+TjvvIrC0IxrZlngSlpz4puM+KYoDjLiIH0JJWMwoRzM5fxT7gNpBNLCHYT9/OibzegJuClG32bE26J4lxHviqIVZNSgpM4z6rykPmbUx6K6yIiLorjMiMtSrsiooqh+yIgfiuK7XS/9ZddL3xdi5ezILbOUuxxP5G9bssDCGYrCV9evf4rCbvKkKyXAQMsbkbU2ng1b7W47KspkrTeHHU3vl/WNoa33NKPKsHH02obaH2xZTgveDbSqtga9VjEKka3e6RrDsr6FVXvtvl5h2NIOjeagnZYPY69gtTc+o9tqlspib3K6uq6fd8v6xqA3DaNzWmHYOIx+v98zEhQWcEZwwcvWxlbrXC1HsU1QR201exX6dgLUjq6XYFmGRR/qWl9LWASGpOAUm8VidHrdQTFIbOuv9wyjNIEiU/6OofWbFYYt63n/fHCWkFAOPbtYFbopX6vdOesUo+gmaNiWM1hiods3Dbu62i6x0AzLUDf0uLD5nSg32b02Clc/yNt9B440EEVFs9xoRXO899bmgpdUmEm9u9K+y189gNTCl78Uobp4VBGOamFQFQuqh0eV8GgXvF3223XxdkW4XQtjV7HY9fB2Jby9C56V/awunlWEs1oYVsXC6uFZJTzbBS/KflEXLyrCRS2MqGIR9fCiEl7sgqdlP62LpxXhtBaGVrHQenhaCU/z8PKfR3ymhwTEZ2JKgOOlB4PcbxaLjw8zefVI3asP72N5N+D4tTxWvJUHWijPeD+GJuS263iRvCvY5nHc2mWEi7VRtuQ9RSveSsqN29MT7fxE/bl59FJPbyxPlefKd8oPiqa0lZfKK+VSuVGQ8t/e4d63e88bfzT+bPzV+HtlfbKXjvlGyT2Nf/4HXCUsug==</latexit>

r r gradient for element (i, j) of W.


W ij
= [W ij ] =
@W ij To denote this element we follow the convention we
have for elements of tensors, that we use the same
= Wr
ij
letter as we use for the matrix, but non-bold.

TENSOR BACKPROPAGATION
This gives us a good way of thinking about the
gradients, but we still don’t have a chain rule to base
Work out scalar derivatives first, then vectorize. out backpropagation on.
Use the multivariate chain rule to derive the scalar derivative.
The main trick we will use is to stick to scalar
derivatives as much as possible.
Apply the chain rule step by step. Once we have worked out the derivative in purely
Start at the loss and work backward, accumulating the gradients. scalar terms (on pen and paper), we will then find a
way to vectorize their computation.

x
y
57
W
V
b
x
c
x y
xt
y
y xW We start simple: what is the gradient for y? This is a
h
GRADIENT FOR y
W
W y V vector, because y is a vector. Let’s first work out the
k r
Vl x W b r y derivative of the i-th element of y. This is purely a
V
xy V c
b yr i =
@l y <latexit sha1_base64="8Kc76guShsxxBzDcjz7QnGGR1lA=">AAANoXicfZfbbts2HMbVdofOq7d0xa52IywoMAxBICWODxcFakk2OmBtsyCnNcoCiqYVwZRIUJRjV9DD7HZ7or3NKNuRJYqyrgh+Hz/9+Kdokx7FQcwN478nT5998eVXXz//pvXti/Z33++9/OEyJgmD6AISTNi1B2KEgwhd8IBjdE0ZAqGH0ZU3s3P9ao5YHJDonC8pug2BHwXTAAIuuu72fnQhTZfZXfCXGwEPA911W63W3d6+cWisHr3eMDeNfW3znN69fMFb7oTAJEQRhxjE8Y1pUH6bAsYDiFHWcpMYUQBnwEc3CZ/2b9MgoglHEcz010KbJljnRM8Z9UnAEOR4KRoAskAk6PAeMAC5mEmrGhWjCIQoPpjMAxqvm/HcXze4mBG6TRerMmWVganPAL0P4KJCloIwDgG/r3XGy9CrdqIEIzYPq505pWCUnAvEYBDnNTgVhflI88rH5+R0o98v6T2K4ixNGM7KA4WAGENTMXDVjBFPaLqajFjuWfyGswQd5M1V3xsHsNkZmhyInEpHFWeKCeDVLi/MixOhB0jCEEST1KVZ6nK04Kl7cJgJ8bUuvg048whgk6rzLEtTN6+Z5+lnwloRP5TED7I4KomjzUsInuhTwvS5WH/CYl0YdWFhAURxdfRFMXqqX8jRlyXxUhavSuKVLHpJSU1q6rykzmvqQ0l9kNVFSVzI4rIkLmu5vKRyWf1cEj/L4vWul/6566WfpFixOmLLLMUuR1PxW7P6wNIZzNJ35+9/z9LB6tl8KQnSzaoReo/G43G3N+hlsowf9c64b1pOXS8MPWto2ipD4Rj2bMMZbVmOJG8BbRjd0bArR0G81fsDe1zXt7DGsOdYCsOWdmx3Rr1N+RCKJKtf+OxBt1Mri1/kDCzLOhnU9cJgdWy7f6QwFA7bcZyhvUKhCaMYSV76aOx2T4x6FC2C+ka3M1To2wUw+pZVg6UlFmtsmY65YuEIYMnJi4/F7g8HIzmIb+tvDW27toC8VP6+bTodhWHLeuKcjI5XJISByJerQorydXv9474cRYqgcU+sYI2FbN80HlhGr8ZCSixjy7bywlZ3othkN+Ztuv5B3u47fd/Us0w2i40mm/O992iWvFhhxs1upX2XXz0AN8LXZwphUzxUhMNGGKhigc3wUAkPd8H7db/fFO8rwv1GGF/F4jfD+0p4fxc8rftpUzxVhNNGGKpioc3wVAlPd8Hzup83xXNFOG+E4SoW3gzPlfB8Fzyp+0lTPFGEk0YYomIhzfBECU+q8OLPIz/TA6znZ2KC9SDaHAwqv1k0Pz7MoDhJrt3riTtI3A0Yei+OFR/FgRaIM96vqQuYHwZRJu4KvnuQt3YZweLRKFrinmLKt5J64/Lo0Dw5NP7o7L+1NjeW59pP2s/aL5qp9bS32jvtVLvQoJZqf2v/aP+299u/tU/bZ2vr0yebMa+0ytO++R/zpgGs</latexit>

scalar derivative so we can simply use the rules we


b @yi
ycW b t P already know. We get 2(yi - ti) for that particular
c
Wt V cx h r =
@ k (yk - tk )2 X @(yk - tk )2
= @l derivative.
t y i
@yi
k
@yi
y =
Vhb t
h
k
=
X @y
i2(yk - tk ) @y k - tk
= 2(yi - ti ) @y Then, we just re-arrange all the derivatives for yi into a
b
k
x W
k yc h
l k i
2
P X @(y - tvector,2 which gives us the complete gradient for y.
V
cl t k
0 k 1k 1 k @ (y - t ) k k)
lW b
th l
0 r1 0
y1 y -
2(y
1 t
1 - t1 )
1
<latexit sha1_base64="5SNSn/KfQ1EBMqtvBXHPE0+j1qc=">AAAOY3icfZfbbts2HMbl7tS5bZZ2uxsGEAs6pEMWSInjw0WAWrKNXqxpFuS0RVlA0bQiWBIJinLsCnrOYQ+w9xgl27JESdYVze/jp5/+JGXKoq4TcFX9t/Hsiy+/+vqb5982X7x8tfPd7us31wEJGcJXiLiE3VowwK7j4yvucBffUoahZ7n4xpoaiX4zwyxwiH/JFxTfe9D2nYmDIBddD7v/mIhGpgf5ozUBi/hv04eWC8Evp8C0sO34ERUac+Zx4lvED9raYZrAnI0JD9LWUjxbiSb2x9lAUIo62l+Hgd8Af9DeFcMy+SyVz4Qs5R2B/SK2MGa/ePwONJsPu3vqoZpeoNzQVo09ZXWdP7x+yZvmmKDQwz5HLgyCO02l/D6CjDvIxXHTDANMIZpCG9+FfNK9jxyfhhz7KAZvhTYJXcAJSKoMxg7DiLsL0YCIOSIBoEfIIOJiLprFqAD70MPBwXjm0GDZDGb2ssFFOfF9NE8nOi4MjGwG6aOD5gWyCHpBUohSZ7DwrGInDl3MZl6xM6EUjJJzjhlygqQG56Iwn2iydoJLcr7SHxf0EftBHIXMjfMDhYAZwxMxMG0GmIc0Sh9GLNhpcMpZiA+SZtp3OoBseoHHByKn0FHEmbgE8mKX5SXF8fETIp4HxWoxaRyZHM95ZB4cxkJ8C8TCRFOLQDYuOi/iaLV4LHAhrAXxLCeeyeIwJw5XNyHuGEwIAzMx/4QFQBhBunIRDoqjr7LRE3AlR1/nxGtZvMmJN7JohTk1LKmznDorqU859UlW5zlxLouLnLgo5fKcymX1c078LIu3227657ab/iXFitkRW2YhdjmeiLdlusCiKYqjD5cff4+jXnqtVkqIgVY0ImttPB61O71OLMvuWm+Nupo+KOuZoaP3NaPKkDn6HUMdDDcsR5I3g1bV9rDflqOQu9G7PWNU1jewar8z0CsMG9qR0Rp2VuXD2JesduYzeu1WqSx2ltPTdf2kV9Yzg94yjO5RhSFzGIPBoG+kKDRk1MWSl66N7faJWo6iWVBXbbf6FfpmAtSurpdgaY5FH+naQEtZOIau5OTZYjG6/d5QDuKb+ut9wyhNIM+Vv2tog1aFYcN6MjgZHqckhEHflqtCsvK1O93jrhxFsqBRR8xgiYVs7jTq6WqnxEJyLCPd0JPCFnei2GR32n20fCFv9h3Y00Acy2ax0WRzsvfWZsnrVpjdenelfZu/eoBbC19+UoTq4lFFOKqFQVUsqB4eVcKjbfB22W/XxdsV4XYtjF3FYtfD25Xw9jZ4WvbTunhaEU5rYWgVC62Hp5XwdBs8L/t5XTyvCOe1MLyKhdfD80p4vg2elP2kLp5UhJNaGFLFQurhSSU8KcKLP4/kTA9dkJyJiQscf3UwKLyzaHJ8mCJxkly6lw8+wOLbgOGP4ljxSRxooTjj/RqZkNme48fiW8E2D5LWNiOcr42iJb5TNPmrpNy4PjrU2oetP1p77/XVF8tz5UflZ2Vf0ZSO8l75oJwrVwpqnDZQw214r/7bebHzZueHpfVZYzXme6Vw7fz0P7wTRqk=</latexit>

= = The final step requires a little creativity: we need to


xV c
h
B rC 2 B
yr = @ y... A= = @B ... .. C
@ .
A iC A = 2(y - t)
@y @yi figure out how to compute this vector using only basic
k
b t r yn - t n k linear algebra operations on the given vectors and
y yN 2(yN - tN )
k
58 l
Wl c h
X @y - tk matrices. In this case it’s not so complicated: we get
k
V t k k k = 2(y - t ) = 2(yi the
-gradient
ti ) for y by element-wise subtracting t from y
bh l @yi and multiplying by 2.
k
ck
t l
h 0 1
k y1 - t 1
We haven’t needed any chain rule yet, because our
computation graph for this part has only one edge.

x
y
W
V
b
x
c
x y V ij
xt
y x W y1 Let’s move one step down and work out the gradient
y
h
GRADIENT FOR V Vr
W y V yk for V. We start with the scalar derivative for an
Wk VVrr
Vr =
@l
Vr ij Vr
Vl x W
V b yN rr
VVijij==
@l@l
@l X
@V ij
@l @yk X
r @yk @l
arbitrary element of V: Vij.
@V@Vijij Vr = = = y r
k V ij =
xy V c
b l X X @l@l @y @y
ij
X X @V ij
rr@y
k@y k
@y @V ij
k
@V ij @V ij
b ==
@y @y
kk
== X
@Vijij = = @V
yykk@l k@y
X
k
@Vijijr = @[Vh
k
X
+ c] r @yX
yk = k
r =@[V
X @l @y
h]k + ck k =
X
yr
@yk Now, we need a chain rule, to backpropagate through
k k@V
k k:
ycW b t V ij kk k k @yk y @V k ij
@Vkij @V ij yk @V@y @V ij k
@V ij
c X X @[Vh
rr @[Vh++c]c]
k
XX
X k @[V
@[V
@V h]
hh]
+k+c+ ckck k
k ij k k
<latexit sha1_base64="3NiTQptPOTsnSo2p6OjbxAbNLC0=">AAAN+XicfZdNb9s2HMaVdC9dVm/pdtyFWFBg2IJAShy/HALUkm30sLZZkLctSg2KpmXNkkhQlGNX0IfZbdh1H2WnfZTdRjmOLFGUdSL4PHz005+kTTrU9yKu6//u7D775NPPPn/+xd6XLxpffb3/8pvriMQM4StEfMJuHRhh3wvxFfe4j28pwzBwfHzjzKxMv5ljFnkkvORLiu8D6IbexEOQi67RfnAG7BlK7CgORsksBcsPdggdH45mIAX2hEGhzSiwEUnsAPKpMwHXqXDaaEx4aiOed09T8JPwOQlKR7M0H5W5vd/TFADb3hvtH+hH+uoB1Yaxbhxo6+d89PIF37PHBMUBDjnyYRTdGTrl9wlk3EM+TvfsOMIUohl08V3MJ537xAtpzHGIUvBKaJPYB5yA7NvB2GMYcX8pGhAxTyQANIXiG7mo0F45KsIhDHB0OJ57NHpsRnP3scFFhfB9sliVPy0NTFwG6dRDixJZAoMoK1OlM1oGTrkTxz5m86DcmVEKRsm5wAx5UVaDc1GY9zSb0eiSnK/16ZJOcRilScz8tDhQCJgxPBEDV80I85gmq48Ry2gWnXEW48Osueo760M2u8DjQ5FT6ijjTHwCebnLCbLihPgBkSCA4TixqVgYHC/Eqjk8SoX4Coi1hmYOgWxcdl6kyXppOeBCWEviu4L4ThYHBXGwfgnxx2BCGJiL+ScsAsIIhIV5CEfl0Vf56Am4kqOvC+K1LN4UxBtZdOKCGlfUeUGdV9SHgvogq4uCuJDFZUFcVnJ5QeWy+rEgfpTF220v/XXbS3+TYsXsiC2zFLscT8Rv2GqBJTOUJm8u3/6cJt3Vs14pMQZG2YicJ+PJsNXutlNZ9p/05rBjmP2qnhvaZs+wVIbc0Wtben+wYTmWvDm0rrcGvZYchfyN3ulaw6q+gdV77b6pMGxoh1Zz0F6XD+NQsrq5z+q2mpWyuHlO1zTN025Vzw1m07I6xwpD7rD6/X7PWqHQmFEfS176ZGy1TvVqFM2DOnqr2VPomwnQO6ZZgaUFFnNoGn1jxcIx9CUnzxeL1el1B3IQ39Tf7FlWZQJ5ofwdy+g3FYYN62n/dHCyIiEMhq5cFZKXr9XunHTkKJIHDdtiBissZPOmYdfU2xUWUmAZmpaZFba8E8UmuzPuk8cf5M2+AwcGSFPZLDaabM723pNZ8voKs1/vVtq3+dUD/Fr46pciVBePFOGoFgapWFA9PFLCo23wbtXv1sW7inC3FsZVsbj18K4S3t0GT6t+WhdPFeG0FoaqWGg9PFXC023wvOrndfFcEc5rYbiKhdfDcyU83wZPqn5SF08U4aQWhqhYSD08UcKTMrz488jO9NAH2ZmY+MAL1weD0m8WzY4P2YVl7X788D4WdwOG34pjxXtxoIXijPdjYkPmBl6YiruCax9mrW1GuHgyipa4pxjyraTauD4+MlpHzV+aB6/N9Y3lufad9r32g2Zobe219kY71640pP2j/bezu/OskTT+aPzZ+OvRuruzHvOtVnoaf/8PLp4k/A==</latexit>

y. However, since we are sticking to scalar derivatives,


Wt V cx h == yy
k kX
=== X rr
@[Vh
yyy r
+P
k:
k·k:
c] k kX
kV h + c @[V h]
k: k
X + c k @[Vh + c]k
X @[Vk: h]k + ck
V ij y1 kk
@V @Vijij= =yr kkk @
r @V@V@V
l ij kl l yr
ij= kk = yr = yr
t y X
kk
X @@PPVVklkl hh
k
l++ck
k yk@V ij
kkk
ckk P
ij
@V ijk @V ij
k
k

P
@V ij
k
k
@V ij we can simply use the scalar multivariate chain rule.
Vh b t k V ij . . . y. 1. . yk == yy rr
kk
ll
=
lX
Xr @ V kl
l@V kl hl h l + c k @V ij hj =
X
yr
@ l V kl h l + c k
This tells us that however many intermediate values
hx W kk
@V@V ijij =yk yr
k @V ij
@V ij
= y r
i
k
@V ij
k
@V ij
l y1 X X @V k
b yk yN rr @V hh @V
@Vkl hh
k

k yc h
k ==yX we have, we can work out the derivative for each,
== yy
klkl ll rr
y
ijij jj
r@V kl hl @V ij hj X @V kl hl @V ij hj
kk
@Vijij = i i=y yri ijh = yr = yr = yr
V
cl t k yk l klkl
@V @V@V
k ijj
@V ij i
@V ij k
@V ij i
@V ij
yN ==yyrr
i ihh
kl kl
keeping the others constant, and sum the results.
l b
jj
= yi hj r 0 1 r

tW
.. = yi hj
yN l .
h l 00
... ... Vr
11 B B
0 = @ . . . yr hj 1. . .C
C
A=y h
r T
0 1 This gives us the sum in the second equality. We've
x
h V c l BB CC rr T..T ..
. i ..
.
k V ij VVrr BB
==@@ . .. .. . yy rr
i ihh
C
j jr. .. ..A
C==y y hr
.B
B
A h . C
C r T r
B
B r
C
C r T

k bl t
y .. .. V = @. . . yi hj . . .A = y h V = @. . . yi hj . . .A = y h
.. ... ...
worked out the gradient y∇ already, so we can fill that
59
y1
Wl c h in and focus on the derivative of yk over Vij. the value
yk
V t k of yk is defined in terms of linear algebra operations
yN
bh l
like matrix multiplication, but with a little thinking we
l
can always rewrite these as a simple scalar sums.
ck
t l In the end we find that the derivative of yk over Vij
h reduces to the value of hj.
k
This tells us the values of every elemen (i, j) of V∇. All
l
that's left is to figure out how to compute this in a
vectorized way. In this case, we can compute V∇ as the
outer product of the gradient for y, which we've
computed already, and the vector h, which we can
save during the forward pass.

RECIPE

To work out a gradient X∇ for some X:

• Write down a scalar derivative of the loss wrt one element of X.

• Use the multivariate chain rule to sum over all outputs.

• Work out the scalar derivative.

• Vectorize the computation of X∇ in terms of the original inputs.


60
V
b
x
c
x y
xt
y
y xW Since this is an important principle to grasp, let’s keep
h
GRADIENT FOR h
W
W y V going until we get to the other set of parameters, W.
k
Vl x W
V b <latexit sha1_base64="nr7M/ReXQPunOIPM0X/oYTUWfY0=">AAAQu3icnVfdbts2GHWb/XRet7nb5W6IBR26LQikxPHPRYBaso1erG1W5Kdd5BoUTcuaKZGQKNeuoMfY0+x2e4i9zShFliVK8tYJBkzznO98hx9JmTQZsX2uKH/fu3/w0ceffPrgs+bnD7/48qvWo6+vfRp4CF8hSqj32oQ+JraLr7jNCX7NPAwdk+Abc6nH+M0Ke75N3Uu+YXjiQMu15zaCXHRNHx0cG4iHi2hqvzVcaBIIvj8HxtyDKDSWDJAo+dpyImAYzZjgB850meeBLZGFm2i6jHJY1idJ7WRSwtbBv4emLpYo3GpsduH53LcGoqHhQL4w5+A6ihW2vxYR+ElImlkHiialTM1z8F/SJCgBxWTTcEmiVImkuVD9YGKJOKJUjkI9/keCZib+Ydr2VtsGey2XVaukjGb8KdQ/v+JMbNluyATm2WuhuJpR7os8H5Qk4aeR2J3t5M7Bkzh8S99kqbMuNXp7CQzKbQf7kvQPORKYtg6VYyV5QLmhpo3DRvpcTB895E1jRlHgYJcjAn3/VlUYn4TQ4zYiOGoagY8ZREto4duAz3uT0HZZwLGLIvBYYPOAAE5BvHvBzPYw4mQjGhB5tlAAaAHFFHKxx5tFKR+7UIzmaLaymX/X9FfWXYOL0eNJuE5eIFEhMLQ8yBY2WhechdDx4yqUOv2NYxY7cUCwt3KKnbFL4VFirrGHbD+uwYUozEsWv5P8S3qR4osNW2DXj8LAEws9FygA7Hl4LgKTpo95wMJkMOJFuPTPuRfgo7iZ9J0Pobd8hWdHQqfQUbQzJxTyYpfpxMVx8TtEHQeKJWUwsRc4Xot1fHQcCfAxEOsILU0KvVmR+SoK05VjgleCWgBf5MAXMjjKgaM0CSUzMKceWIn5p54PBBEkyxthvxh9lUXPwZUsfZ0Dr2XwJgfeyKAZ5NCghK5y6KqEvsuh72R0nQPXMrjJgZuSLs+hXEbf58D3Mvh6X9I3+5L+KsmK2RFbZiN2OZ6Lf+FkgYVLFIXPLp//HIX95ElXSoCBWiQic0s8HXe6/W4kw2SLt8c9VRuW8YzQ1QaqXkXIGIOurgxHOy8nEjczrSid0aAjSyGyw3t9fVzGd2aVQXeoVRB2bsd6e9RNy4exK1GtjKf3O+1SWaxMp69p2lm/jGcEra3rvZMKQsbQh8PhQE+ssMBjBEtctiV2OmdKWYplQj2l0x5U4LsJUHqaVjLLcl60saYO1cQLx5BITJ4tFr036I9kIb6rvzbQ9dIE8lz5e7o6bFcQdl7Phmej08QJ9aBryVWhWfk63d5pT5aimdC4K2aw5IXuMo37mtIteaE5L2NN1+LCFnei2GS36iS8eyHv9h04VEEUyWSx0WRyvPe2ZIlLKsiknl1J38evDiC15ssjRahOHlWIo1ozqMoLqjePKs2jfeatMt+qk7cqxK1aM1aVF6vevFVp3tpnnpX5rE6eVYizWjOsygurN88qzbN95nmZz+vkeYU4rzXDq7zwevO80jzfZ56W+bROnlaI01oztMoLrTdPK83Tonnx5xGf6SEB8ZmYEmC76cGg8M5i8fFBXCmNlH038CEWdwMPPxfHipfiQAvFGe/H0ICe5dhuJO4KlnEUt/YR4XpLFC1xT1HlW0m5cX1yrJ4dK7+0D59q6Y3lQePbxneNJw210W08bTxrXDSuGujg94M/Dv48+Kt13kKt31rkjnr/XhrzTaPwtIJ/AKnNK24=</latexit>

hr
i =
@l We’ll leave the biases as an exercise.
@hi
x c
by V
X @l @y X
b = k
= yr
k
@yk
For the gradient for h, most of the derivation is the
@yk @hi @hi
ycW b t k k
P
c X @[Vh + c]k X r @ l Vkl hl + ck same as before, until we get to the point where the
Wt V cx h
= yr
k = yk
@hi @hi
t y X
k k
@Vkl hl + ck X r @Vki hi
k
scalar derivative is reduced to a matrix mulitplication.
V = yr = yk
h
h bx Wt k kl
k
@hi
k
@hi
Unlike the previous derivation, the sum over k doesn't
X
b l = yr
k Vki
k yc h
k
V k
disappear. We need to take it into the description of
cl t k 0
..
1
i
the gradient for h.
l 0 1 0 1
<latexit sha1_base64="jkBE5EM4oFBVqyrFXSBaIjQgf4Y=">AAAOL3icfZfbbts2HMbldoc2q7d0vdwNsaDDMASZlDg+XASoJdvoxdpmQU5blAYUTcuCJZGgKMeuoKfZE+xpht0Mu91bjJJtWaIk64rg9/HjT3+SNmVR1wm4qv7dePL0s8+/+PLZ872vXjS//mb/5bfXAQkZwleIuITdWjDAruPjK+5wF99ShqFnufjGmhmJfjPHLHCIf8mXFN970PadiYMgF10P+3+YiEemB/nUmoBp/NH0oeVC8MMZMC1sO35EhcacRQzM+ZjwAJgmMIPQe4hmogvRaBk/zDajTESyrOtYWJw49a9HYn+8jTsDUTJ8Y19mU8cfL+Wg/QP1SE0fUG5o68aBsn7OH16+4HvmmKDQwz5HLgyCO02l/D6CjDvIxfGeGQaYQjSDNr4L+aR7Hzk+DTn2UQxeC20SuoATkFQMjB2GEXeXogERc0QCQFPIIOKirnvFqAD70MPB4Xju0GDVDOb2qsHF6+H7aJEuWlwYGNkM0qmDFgWyCHpBUoRSZ7D0rGInDl3M5l6xM6EUjJJzgRlygqQG56IwH2iyD4JLcr7Wp0s6xX4QRyFz4/xAIWDG8EQMTJsB5iGN0pcRm28WnHEW4sOkmfadDSCbXeDxocgpdBRxJi6BvNhleUlxfPyIiOdBsWdMGkcmxwuxUQ+PYiG+BmKjoJlFIBsXnRdxtN44FrgQ1oL4Pie+l8VhThyuJyHuGEwIA3Ox/oQFQBhBun8RDoqjr7LRE3AlR1/nxGtZvMmJN7JohTk1LKnznDovqY859VFWFzlxIYvLnLgs5fKcymX1U078JIu3uyb9bdekv0uxYnXEkVmKU44n4pcv3WDRDMXR28t3v8RRL33WOyXEQCsakbUxnozanV4nlmV3o7dGXU0flPXM0NH7mlFlyBz9jqEOhluWY8mbQatqe9hvy1HI3erdnjEq61tYtd8Z6BWGLe3IaA076/Jh7EtWO/MZvXarVBY7y+npun7aK+uZQW8ZRve4wpA5jMFg0DdSFBoy6mLJSzfGdvtULUfRLKirtlv9Cn27AGpX10uwNMeij3RtoKUsHENXcvJssxjdfm8oB/Ft/fW+YZQWkOfK3zW0QavCsGU9HZwOT1ISwqBvy1UhWfnane5JV44iWdCoI1awxEK2M416utopsZAcy0g39KSwxZMoDtmddh+tfpC35w4caCCOZbM4aLI5OXsbs+R1K8xuvbvSvstfPcCthS+/KUJ18agiHNXCoCoWVA+PKuHRLni77Lfr4u2KcLsWxq5isevh7Up4exc8LftpXTytCKe1MLSKhdbD00p4uguel/28Lp5XhPNaGF7FwuvheSU83wVPyn5SF08qwkktDKliIfXwpBKeFOHFn0dyp4cuSO7ExAWOv74YFH6zaHJ9mCFxk1y5Vy8+wOLbgOF34lrxQVxoobjj/RSZkNme48fiW8E2D5PWLiNcbIyiJb5TNPmrpNy4Pj7S2kcnv7YO3ujrL5ZnynfK98qPiqZ0lDfKW+VcuVJQ43nj50a30Wv+2fyr+U/z35X1SWM95pVSeJr//Q+nOzm9</latexit>

<latexit sha1_base64="RDvTuOSjEtDdb8mvxSGR7Pwg7T8=">AAAOZ3icfZfbbts2HMbl7tRlbZpuwDBgN9qCDt0QBFLi+HARoJZsoxdrmxXOYYuSgKJpWTAlchTl2BX0oMOeYG8xSrZliZKsK4Lfx48//UnalE2xG3BN+6fx5LPPv/jyq6df733z7Pn+i4OX314FJGQQXUKCCbuxQYCw66NL7nKMbihDwLMxurZnZqJfzxELXOKP+JKiOw84vjtxIeCi6+HgXwvyyPIAn9oTdRrfWz6wMVB/OVctGzmuH1GhMXcRq9Z8THigWpZqBaH3EM1EF6TRMn6YbUZZkGRZV7GwuHHqX49E/ngbJybAaMJfR0nIZtAyA4jvR8U0i7nOlP96PzqXprkf1UU8HBxqx1r6qOWGvm4cKuvn4uHlM75njQkMPeRziEEQ3Ooa5XcRYNyFGMV7VhggCuAMOOg25JPOXeT6NOTIh7H6SmiTEKucqEml1bHLEOR4KRoAMlckqHAKGIBcrMdeMSpAPvBQcDSeuzRYNYO5s2pw8TboLlqkix0XBkYOA3TqwkWBLAJekFSj1BksPbvYiUKM2NwrdiaUglFyLhCDbpDU4EIU5gNN9k8wIhdrfbqkU+QHcRQyHOcHCgExhiZiYNoMEA9plL6M2LSz4JyzEB0lzbTvvA/Y7CMaH4mcQkcRZ4IJ4MUu20uK46NHSDwPiL1m0TiyOFqIDX50HAvxlSr2BZzZBLBx0fkxjtY7yFY/CmtBfJ8T38viICcO1pMQPFYnhKlzsf6EBaowqum+hygojr7MRk/USzn6KideyeJ1TryWRTvMqWFJnefUeUl9zKmPsrrIiQtZXObEZSmX51Quq59y4idZvNk16Z+7Jv1LihWrI47MUpxyNBG/mOkGi2Ywjt6O3v0eR930We+UEKl60QjtjfF02Gp327Es443eHHZ0o1/WM0Pb6OlmlSFz9Nqm1h9sWU4kbwataa1BryVHQbzVO11zWNa3sFqv3TcqDFvaodkctNflQ8iXrE7mM7utZqksTpbTNQzjrFvWM4PRNM3OSYUhc5j9fr9npig0ZBQjyUs3xlbrTCtH0Syoo7WavQp9uwBaxzBKsDTHYgwNva+nLBwBLDl5tlnMTq87kIP4tv5GzzRLC8hz5e+Yer9ZYdiynvXPBqcpCWHAd+SqkKx8rXbntCNHkSxo2BYrWGIh25mGXUNrl1hIjmVomEZS2OJJFIfsVr+LVj/I23OnHupqHMtmcdBkc3L2NmbJiyvMuN5dad/lrx6Aa+HLbwphXTysCIe1MLCKBdbDw0p4uAveKfuduninItyphXGqWJx6eKcS3tkFT8t+WhdPK8JpLQytYqH18LQSnu6C52U/r4vnFeG8FoZXsfB6eF4Jz3fBk7Kf1MWTinBSC0OqWEg9PKmEJ0V48eeR3OkBVpM7McGq668vBoXfLJpcH2ZQ3CRX7tWL95H4NmDonbhWfBAXWiDueL9FFmCO5/qx+FZwrKOktcsIFhujaInvFF3+Kik3rk6O9dbx6R/NwzfG+ovlqfKj8rPyWtGVtvJGeatcKJcKbPQaToM2/n7+3/6L/e/3f1hZnzTWY75TCs/+T/8DQF5NLA==</latexit>

k ->
tW b BP ..r
..
<latexit sha1_base64="RDvTuOSjEtDdb8mvxSGR7Pwg7T8=">AAAOZ3icfZfbbts2HMbl7tRlbZpuwDBgN9qCDt0QBFLi+HARoJZsoxdrmxXOYYuSgKJpWTAlchTl2BX0oMOeYG8xSrZliZKsK4Lfx48//UnalE2xG3BN+6fx5LPPv/jyq6df733z7Pn+i4OX314FJGQQXUKCCbuxQYCw66NL7nKMbihDwLMxurZnZqJfzxELXOKP+JKiOw84vjtxIeCi6+HgXwvyyPIAn9oTdRrfWz6wMVB/OVctGzmuH1GhMXcRq9Z8THigWpZqBaH3EM1EF6TRMn6YbUZZkGRZV7GwuHHqX49E/ngbJybAaMJfR0nIZtAyA4jvR8U0i7nOlP96PzqXprkf1UU8HBxqx1r6qOWGvm4cKuvn4uHlM75njQkMPeRziEEQ3Ooa5XcRYNyFGMV7VhggCuAMOOg25JPOXeT6NOTIh7H6SmiTEKucqEml1bHLEOR4KRoAMlckqHAKGIBcrMdeMSpAPvBQcDSeuzRYNYO5s2pw8TboLlqkix0XBkYOA3TqwkWBLAJekFSj1BksPbvYiUKM2NwrdiaUglFyLhCDbpDU4EIU5gNN9k8wIhdrfbqkU+QHcRQyHOcHCgExhiZiYNoMEA9plL6M2LSz4JyzEB0lzbTvvA/Y7CMaH4mcQkcRZ4IJ4MUu20uK46NHSDwPiL1m0TiyOFqIDX50HAvxlSr2BZzZBLBx0fkxjtY7yFY/CmtBfJ8T38viICcO1pMQPFYnhKlzsf6EBaowqum+hygojr7MRk/USzn6KideyeJ1TryWRTvMqWFJnefUeUl9zKmPsrrIiQtZXObEZSmX51Quq59y4idZvNk16Z+7Jv1LihWrI47MUpxyNBG/mOkGi2Ywjt6O3v0eR930We+UEKl60QjtjfF02Gp327Es443eHHZ0o1/WM0Pb6OlmlSFz9Nqm1h9sWU4kbwataa1BryVHQbzVO11zWNa3sFqv3TcqDFvaodkctNflQ8iXrE7mM7utZqksTpbTNQzjrFvWM4PRNM3OSYUhc5j9fr9npig0ZBQjyUs3xlbrTCtH0Syoo7WavQp9uwBaxzBKsDTHYgwNva+nLBwBLDl5tlnMTq87kIP4tv5GzzRLC8hz5e+Yer9ZYdiynvXPBqcpCWHAd+SqkKx8rXbntCNHkSxo2BYrWGIh25mGXUNrl1hIjmVomEZS2OJJFIfsVr+LVj/I23OnHupqHMtmcdBkc3L2NmbJiyvMuN5dad/lrx6Aa+HLbwphXTysCIe1MLCKBdbDw0p4uAveKfuduninItyphXGqWJx6eKcS3tkFT8t+WhdPK8JpLQytYqH18LQSnu6C52U/r4vnFeG8FoZXsfB6eF4Jz3fBk7Kf1MWTinBSC0OqWEg9PKmEJ0V48eeR3OkBVpM7McGq668vBoXfLJpcH2ZQ3CRX7tWL95H4NmDonbhWfBAXWiDueL9FFmCO5/qx+FZwrKOktcsIFhujaInvFF3+Kik3rk6O9dbx6R/NwzfG+ovlqfKj8rPyWtGVtvJGeatcKJcKbPQaToM2/n7+3/6L/e/3f1hZnzTWY75TCs/+T/8DQF5NLA==</latexit>

C ⇣ r ⌘T
BP .r C ⇣ r T ⌘T
h l
TT
hr BP yrVki C = yr
r =@ yr VTV = VT yr
B k yk VkiA
C hr = B C = y V = VT yr
h =@ k .k A = (y 1 ⌦ V)1 y V
To vectorize this, we note that each element of this
hk cV .. @ k k ki A
x .
0
..
1 ..
.
<latexit sha1_base64="RDvTuOSjEtDdb8mvxSGR7Pwg7T8=">AAAOZ3icfZfbbts2HMbl7tRlbZpuwDBgN9qCDt0QBFLi+HARoJZsoxdrmxXOYYuSgKJpWTAlchTl2BX0oMOeYG8xSrZliZKsK4Lfx48//UnalE2xG3BN+6fx5LPPv/jyq6df733z7Pn+i4OX314FJGQQXUKCCbuxQYCw66NL7nKMbihDwLMxurZnZqJfzxELXOKP+JKiOw84vjtxIeCi6+HgXwvyyPIAn9oTdRrfWz6wMVB/OVctGzmuH1GhMXcRq9Z8THigWpZqBaH3EM1EF6TRMn6YbUZZkGRZV7GwuHHqX49E/ngbJybAaMJfR0nIZtAyA4jvR8U0i7nOlP96PzqXprkf1UU8HBxqx1r6qOWGvm4cKuvn4uHlM75njQkMPeRziEEQ3Ooa5XcRYNyFGMV7VhggCuAMOOg25JPOXeT6NOTIh7H6SmiTEKucqEml1bHLEOR4KRoAMlckqHAKGIBcrMdeMSpAPvBQcDSeuzRYNYO5s2pw8TboLlqkix0XBkYOA3TqwkWBLAJekFSj1BksPbvYiUKM2NwrdiaUglFyLhCDbpDU4EIU5gNN9k8wIhdrfbqkU+QHcRQyHOcHCgExhiZiYNoMEA9plL6M2LSz4JyzEB0lzbTvvA/Y7CMaH4mcQkcRZ4IJ4MUu20uK46NHSDwPiL1m0TiyOFqIDX50HAvxlSr2BZzZBLBx0fkxjtY7yFY/CmtBfJ8T38viICcO1pMQPFYnhKlzsf6EBaowqum+hygojr7MRk/USzn6KideyeJ1TryWRTvMqWFJnefUeUl9zKmPsrrIiQtZXObEZSmX51Quq59y4idZvNk16Z+7Jv1LihWrI47MUpxyNBG/mOkGi2Ywjt6O3v0eR930We+UEKl60QjtjfF02Gp327Es443eHHZ0o1/WM0Pb6OlmlSFz9Nqm1h9sWU4kbwataa1BryVHQbzVO11zWNa3sFqv3TcqDFvaodkctNflQ8iXrE7mM7utZqksTpbTNQzjrFvWM4PRNM3OSYUhc5j9fr9npig0ZBQjyUs3xlbrTCtH0Syoo7WavQp9uwBaxzBKsDTHYgwNva+nLBwBLDl5tlnMTq87kIP4tv5GzzRLC8hz5e+Yer9ZYdiynvXPBqcpCWHAd+SqkKx8rXbntCNHkSxo2BYrWGIh25mGXUNrl1hIjmVomEZS2OJJFIfsVr+LVj/I23OnHupqHMtmcdBkc3L2NmbJiyvMuN5dad/lrx6Aa+HLbwphXTysCIe1MLCKBdbDw0p4uAveKfuduninItyphXGqWJx6eKcS3tkFT8t+WhdPK8JpLQytYqH18LQSnu6C52U/r4vnFeG8FoZXsfB6eF4Jz3fBk7Kf1MWTinBSC0OqWEg9PKmEJ0V48eeR3OkBVpM7McGq668vBoXfLJpcH2ZQ3CRX7tWL95H4NmDonbhWfBAXWiDueL9FFmCO5/qx+FZwrKOktcsIFhujaInvFF3+Kik3rk6O9dbx6R/NwzfG+ovlqfKj8rPyWtGVtvJGeatcKJcKbPQaToM2/n7+3/6L/e/3f1hZnzTWY75TCs/+T/8DQF5NLA==</latexit>

. k ->
⇣ ⌘ vector is a dot product of y and a column of V. This
k bl t
BP r C
y r B C rT
h = @ k yk Vki A = y V = VT yr
T

61

Wl c h
..
. means that if we premultiply y (transposed) by V, we
get the required result as a row vector. Transposing the
V t k
result (to make it a column vector) is equivalent to
bh l
post multiplying y by the transpose of V.
ck
t l Note the symmetry of the forward and the backward
h operation.
k
l

x
y
W
V
b
x
c
xt y V ij
x
y x W y1 Now, at this point, when we analyze k, remember that
y
h
GRADIENT FOR k
Wk y V yk we already have the gradient over h. This means that
W
Vl x W b
V
yN we no longer have to apply the chain rule to anything
xy V c l @l above h. We can draw this scalar computation graph,
b
b kr
i =
@ki
ycW b t X
X r @h @hkm X X @ @(kk(k ) m)
<latexit sha1_base64="oqHsdrdaisMavx2lA/O3UClYKNo=">AAAODnicjZfbbts2HMaVdofOq7t0vdiA3QgLOnRDEEiJ48NFgFqyjV6sbRbktEWZQdG0LFgSCYpy7Aq62RPsaXY37HavsDfYY4xyFFmiKGO6ovl9/PTTn6RM2cRzQ6Zp/+w8evzRx598+uSzxudPm8++2H3+5WWIIwrRBcQeptc2CJHnBuiCucxD14Qi4NseurLnZqpfLRANXRycsxVBtz5wAnfqQsB413j3t+9OVCuM/LGvWpDFs+RXKwC2B9LfUwpgbM1Jpoz9JPvlxPNk7CZq4/+MDV3HB6+yQf73QoZlNca7e9qBtr7UakPPGntKdp2Onz9lDWuCYeSjgEEPhOGNrhF2GwPKXOihpGFFISIAzoGDbiI27d7GbkAihgKYqC+5No08lWE1LYg6cSmCzFvxBoDU5QkqnAGOz3jZGuWoEAXAR+H+ZOGS8L4ZLpz7BuOPjm7j5XpOktLA2KGAzFy4LJHFwA99wGaVznDl2+VOFHmILvxyZ0rJGQXnElHohmkNTnlh3pN0msNzfJrpsxWZoSBM4oh6SXEgFxClaMoHrpshYhGJ1w/D19Y8PGE0Qvtpc913MgB0foYm+zyn1FHGmXoYsHKX7afFCdAdxL4PgklsEb4mGFqy2No/SLj4UuWLCM5tDOik7DxL4thKa2bb6hm3lsR3BfGdKA4L4jC7CfYm6hRTdcHnH9NQ5UaVW6gLUVgefZGPnqoXYvRlQbwUxauCeCWKdlRQo4q6KKiLinpXUO9EdVkQl6K4KoirSi4rqExUPxTED6J4ve2mP2+76S9CLJ8dvmVWfJejKX+xrRdYPIdJ/Ob87Y9J3Ftf2UqJkKqXjdB+MB6N2p1eJxFl70Fvjbq6MajquaFj9HVTZsgd/Y6pDYYblkPBm0NrWnvYb4tR0Nvo3Z45quobWK3fGRgSw4Z2ZLaGnax8CAWC1cl9Zq/dqpTFyXN6hmEc96p6bjBaptk9lBhyhzkYDPrmGoVElHhI8JIHY7t9rFWjSB7U1dqtvkTfTIDWNYwKLCmwGCNDH+hrFoaAJzhZvljMbr83FIPYpv5G3zQrE8gK5e+a+qAlMWxYjwfHw6M1CaYgcMSq4Lx87U73qCtG4Txo1OEzWGHBmzuNeobWqbDgAsvIMI20sOWdyDfZjX4b37+QN/tO3dPVJBHNfKOJ5nTvPZgFrycxe/VuqX2bXz7Aq4WvPimEdfFQEg5rYaCMBdbDQyk83AbvVP1OXbwjCXdqYRwZi1MP70jhnW3wpOondfFEEk5qYYiMhdTDEyk82QbPqn5WF88k4awWhslYWD08k8KzbfC46sd18VgSjmthsIwF18NjKTwuw/M/j/RMDzw1PRNjT3WD7GBQemeR9Pgw598bmfv+wQeIfxtQ9JYfK97zAy3gZ7wfYgtQx3eDhH8rONZ+2tpmBMsHI2/x7xRd/CqpNi4PD/TjA+2n1t5rI/tieaJ8o3yrvFJ0paO8Vt4op8qFApV/d57tfLXzdfP35h/NP5t/3Vsf7WRjXiilq/n3f8fgLI8=</latexit>

and work out the local gradient for k in terms for the
c hkr == hr r
Wt V cx h ki = h khm
<latexit sha1_base64="hY4wAzEfE4stcT0eO/hp6TuVyl4=">AAANn3icfZdbb9s2GIbV7tRl9Zaul7vRFhQYhiCQEseHiwK1JHu9WNMsyKmLgoCiaVkwJRIU5dgV9Ft2u/2k/ZtRtiNLFGVdfeD78tWjj6JNeRQHMTeM/549/+LLr77+5sW3e9+9bH3/w/6rH69jkjCIriDBhN16IEY4iNAVDzhGt5QhEHoY3XgzO9dv5ojFAYku+ZKi+xD4UTAJIOBi6GH/tQv9dJY9BK6750KeTrOH8GH/wDgyVpdeL8xNcaBtrvOHVy/5njsmMAlRxCEGcXxnGpTfp4DxAGKU7blJjCiAM+Cju4RPevdpENGEowhm+huhTRKsc6LnhPo4YAhyvBQFgCwQCTqcAgYgF8+xV42KUQRCFB+O5wGN12U899cFB6IJ9+li1aSsMjH1GaDTAC4qZCkI4xDwaW0wXoZedRAlGLF5WB3MKQWj5FwgBoM478G5aMxHmvc9viTnG326pFMUxVmaMJyVJwoBMYYmYuKqjBFPaLp6GLHYs/gtZwk6zMvV2FsHsNkFGh+KnMpAFWeCCeDVIS/MmxOhR0jCEETj1KVZ6nK04Kl7eJQJ8Y3uYWH2CGDjqvMiS1M375nn6RfCWhHPSuKZLA5L4nBzE4LH+oQwfS7Wn7BYF0ZdWFgAUVydfVXMnuhXcvR1SbyWxZuSeCOLXlJSk5o6L6nzmvpYUh9ldVESF7K4LInLWi4vqVxWP5fEz7J4u+umn3bd9C8pVqyO2DJLscvRRPzSrF6wdAaz9P3lhz+ytL+6Nm9KgnSzaoTek/Fk1On2u5ks4ye9PeqZllPXC0PXGpi2ylA4Bl3bcIZblmPJW0AbRmc46MhREG/1Xt8e1fUtrDHoOpbCsKUd2e1hd9M+hCLJ6hc+u99p19riFzl9y7JO+3W9MFht2+4dKwyFw3YcZ2CvUGjCKEaSlz4ZO51Tox5Fi6Ce0WkPFPp2AYyeZdVgaYnFGlmmY65YOAJYcvLiZbF7g/5QDuLb/lsD264tIC+1v2ebTlth2LKeOqfDkxUJYSDy5a6Qon2dbu+kJ0eRImjUFStYYyHbO436ltGtsZASy8iyrbyx1Z0oNtmdeZ+uf5C3+04/MPUsk81io8nmfO89mSUvVphxs1tp3+VXT8CN8PUnhbApHirCYSMMVLHAZniohIe74P2632+K9xXhfiOMr2Lxm+F9Jby/C57W/bQpnirCaSMMVbHQZniqhKe74Hndz5viuSKcN8JwFQtvhudKeL4LntT9pCmeKMJJIwxRsZBmeKKEJ1V48eeRn+kB1vMzMcF6EG0OBpXfLJofH2ZQnCTX7vWDO0h8GzD0QRwrPooDLRBnvN9SFzA/DKJMfCv47mFe7TKCxZNRVOI7xZS/SurF9fGReXpk/Nk+eGdtvlheaD9pv2i/aqbW1d5p77Vz7UqD2lL7W/tH+7f1c+v31lnrfG19/mwz57VWuVqf/gc7WQH0</latexit>

m@k
@k
ii @ki@ki gradient for h.
t y h
mk k m
@ (ki )
Vhb t k
h . .m
. ... = hri Given that, working out the gradient for k is relatively
b x W @k i
k c h l r 0
0 (ki ) = with
r
(ki0)(1 easy, since the operation from k to h is an element-
<latexit sha1_base64="pAp2wjfk9mObL7yTYQF4T3o4bKc=">AAAOFnicfZfLbuM2GIWV6W2ajttMC3TTRYkG0yZFJpASx5dFgLEkG7PozKRBbm2UBhRNy4IlkaAoxx5BL9An6NN0V3Tbbd+ij1DKdmRdKGtF8xweffpJypRNPTfkqvrv1pMPPvzo40+efrr92bPG51/sPP/yKiQRQ/gSEY+wGxuG2HMDfMld7uEbyjD0bQ9f2xMj1a+nmIUuCS74nOI7HzqBO3IR5KLrfuf370+BhXg8Tn6zAmh78N4FVug6Pvxhz0JOPEnu3X1Q51lb9rSXlb59YFnbsvzFb9HY08DL7Nf+/c6ueqguLlBtaKvGrrK6zu6fP+Pb1pCgyMcBRx4Mw1tNpfwuhoy7yMPJthWFmEI0gQ6+jfiocxe7AY04DlACXghtFHmAE5AWBgxdhhH35qIBEXNFAkBjyCDionzbxagQB9DH4cFw6tJw2QynzrLBxVPiu3i2mJukMDB2GKRjF80KZDH0Qx/ycaUznPt2sRNHHmZTv9iZUgrGknOGGXLDtAZnojDvaDrd4QU5W+njOR3jIEziiHlJfqAQMGN4JAYumiHmEY0XDyPW2CQ85SzCB2lz0XdqQjY5x8MDkVPoKOKMPAJ5scv20+IE+AER34fBMLZoElscz3hsHRwmQnwBxHpBE5tANiw6z5M4ttKa2TY4F9aC+DYnvi2L/ZzYX92EeEMwIgxMxfwTFgJhBMLCXITD4ujLbPQIXJajr3LiVVm8zonXZdGOcmpUUac5dVpRH3LqQ1md5cRZWZznxHkll+dUXlbf58T3ZfFm001/2XTTX0uxYnbElpmLXY5H4gW3WGDxBCXx64s3PyVxd3GtVkqEgVY0IvvReDxotbvtpCx7j3pz0NF0s6pnhrbe0wyZIXP02oZq9tcsRyVvBq2qrX6vVY5C3lrvdI1BVV/Dqr22qUsMa9qB0ey3V+XDOChZncxndFvNSlmcLKer6/pJt6pnBr1pGJ0jiSFzGKZp9owFCo0Y9XDJSx+NrdaJWo2iWVBHbTV7En09AWpH1yuwNMeiD3TN1BYsHEOv5OTZYjE6vW6/HMTX9dd7hlGZQJ4rf8fQzKbEsGY9MU/6xwsSwmDglKtCsvK12p3jTjmKZEGDtpjBCgtZ32nQ1dV2hYXkWAa6oaeFLe5Esclutbt4+UJe7zuwq4EkKZvFRiub0733aC55PYnZq3dL7Zv88gFeLXz1SRGqi0eScFQLg2QsqB4eSeHRJnin6nfq4h1JuFML48hYnHp4RwrvbIKnVT+ti6eScFoLQ2UstB6eSuHpJnhe9fO6eC4J57UwXMbC6+G5FJ5vgidVP6mLJ5JwUgtDZCykHp5I4UkRXvx5pGd66IH0TEw84Aarg0HhnUXT48MEiZPk0r18cBOLbwOG34hjxTtxoIXijPdjbEHm+G6QiG8FxzpIW5uMcPZoFC3xnaKVv0qqjaujQ+3kUP25uftKX32xPFW+Ub5T9hRNaSuvlNfKmXKpIOW/ra+3vt0CjT8afzb+avy9tD7ZWo35SilcjX/+B4eGLBE=</latexit>

k y V =
=hhir
i (ki ) = h hir (x) -
= (k
i (ki )(1 - (ki ))
i ))(1 - x)
(x) <latexit sha1_base64="pAp2wjfk9mObL7yTYQF4T3o4bKc=">AAAOFnicfZfLbuM2GIWV6W2ajttMC3TTRYkG0yZFJpASx5dFgLEkG7PozKRBbm2UBhRNy4IlkaAoxx5BL9An6NN0V3Tbbd+ij1DKdmRdKGtF8xweffpJypRNPTfkqvrv1pMPPvzo40+efrr92bPG51/sPP/yKiQRQ/gSEY+wGxuG2HMDfMld7uEbyjD0bQ9f2xMj1a+nmIUuCS74nOI7HzqBO3IR5KLrfuf370+BhXg8Tn6zAmh78N4FVug6Pvxhz0JOPEnu3X1Q51lb9rSXlb59YFnbsvzFb9HY08DL7Nf+/c6ueqguLlBtaKvGrrK6zu6fP+Pb1pCgyMcBRx4Mw1tNpfwuhoy7yMPJthWFmEI0gQ6+jfiocxe7AY04DlACXghtFHmAE5AWBgxdhhH35qIBEXNFAkBjyCDionzbxagQB9DH4cFw6tJw2QynzrLBxVPiu3i2mJukMDB2GKRjF80KZDH0Qx/ycaUznPt2sRNHHmZTv9iZUgrGknOGGXLDtAZnojDvaDrd4QU5W+njOR3jIEziiHlJfqAQMGN4JAYumiHmEY0XDyPW2CQ85SzCB2lz0XdqQjY5x8MDkVPoKOKMPAJ5scv20+IE+AER34fBMLZoElscz3hsHRwmQnwBxHpBE5tANiw6z5M4ttKa2TY4F9aC+DYnvi2L/ZzYX92EeEMwIgxMxfwTFgJhBMLCXITD4ujLbPQIXJajr3LiVVm8zonXZdGOcmpUUac5dVpRH3LqQ1md5cRZWZznxHkll+dUXlbf58T3ZfFm001/2XTTX0uxYnbElpmLXY5H4gW3WGDxBCXx64s3PyVxd3GtVkqEgVY0IvvReDxotbvtpCx7j3pz0NF0s6pnhrbe0wyZIXP02oZq9tcsRyVvBq2qrX6vVY5C3lrvdI1BVV/Dqr22qUsMa9qB0ey3V+XDOChZncxndFvNSlmcLKer6/pJt6pnBr1pGJ0jiSFzGKZp9owFCo0Y9XDJSx+NrdaJWo2iWVBHbTV7En09AWpH1yuwNMeiD3TN1BYsHEOv5OTZYjE6vW6/HMTX9dd7hlGZQJ4rf8fQzKbEsGY9MU/6xwsSwmDglKtCsvK12p3jTjmKZEGDtpjBCgtZ32nQ1dV2hYXkWAa6oaeFLe5Esclutbt4+UJe7zuwq4EkKZvFRiub0733aC55PYnZq3dL7Zv88gFeLXz1SRGqi0eScFQLg2QsqB4eSeHRJnin6nfq4h1JuFML48hYnHp4RwrvbIKnVT+ti6eScFoLQ2UstB6eSuHpJnhe9fO6eC4J57UwXMbC6+G5FJ5vgidVP6mLJ5JwUgtDZCykHp5I4UkRXvx5pGd66IH0TEw84Aarg0HhnUXT48MEiZPk0r18cBOLbwOG34hjxTtxoIXijPdjbEHm+G6QiG8FxzpIW5uMcPZoFC3xnaKVv0qqjaujQ+3kUP25uftKX32xPFW+Ub5T9hRNaSuvlNfKmXKpIOW/ra+3vt0CjT8afzb+avy9tD7ZWo35SilcjX/+B4eGLBE=</latexit>

cl t k r
sigmo id: wise one.
derivative of
lW b = h r h (1 - h
= hii hii (1 - hii ))
- σ(x))
th l ki kr = h ⌦ 0 (k) σ’(x) = σ(x)(1
xV c hk kr = hr ⌦ 0k(k)
r
= hhrr
⌦ h0⌦ r
<latexit sha1_base64="ulXpDO6BYycliF1KgIFXJTHNPOY=">AAAOHXichZfbbts2HMaV7tRl9ZZul8MAYkHbdMgCKXF8uAhQS7bRi7XNgpy6KAsompYFSyJBUY5dQQ+xJ9jT7G7Y7bAn2e0o2ZF1oDxdEfw+fvrpT9ImLeo6AVfVf7YeffTxJ59+9vjz7S+eNL78aufp15cBCRnCF4i4hF1bMMCu4+ML7nAXX1OGoWe5+MqaGol+NcMscIh/zhcU33rQ9p2xgyAXXXc7v5nIjkwP8ok1BtP4V9OHlgvB8xNgIp4Jk0wwCXc8HAAzcGwPvtgrDn8J/ndcoqZK1rWn/Zjrfnm3s6seqOkDqg1t1dhVVs/p3dMnfNscERR62OfIhUFwo6mU30aQcQe5ON42wwBTiKbQxjchH3duI8enIcc+isEzoY1DF3ACkvqAkcMw4u5CNCBijkgAaAIZRFxUcbsYFWAfii/YH80cGiybwcxeNrj4bHwbzdMpigsDI5tBOnHQvEAWQS9IylbpDBaeVezEoYvZzCt2JpSCseScY4acIKnBqSjMO5rMenBOTlf6ZEEn2A/iKGRunB8oBMwYHouBaTPAPKRR+jFiqU2DE85CvJ80076TPmTTMzzaFzmFjiLO2CWQF7ssLymOj+8R8TzojyKTxpHJ8VwsiP2DWIjPgFhAaGoRyEZF51kcrZaaBc6EtSC+zYlvy+IgJw5WLyHuCIwJAzMx/4QFQBiBsDAH4aA4+iIbPQYX5ejLnHhZFq9y4lVZtMKcGlbUWU6dVdT7nHpfVuc5cV4WFzlxUcnlOZWX1Q858UNZvN700vebXvpLKVbMjtgyC7HL8Vj8zqULLJqiOHp9/uanOOqmz2qlhBhoRSOyHoxHw1a7247LsvugN4cdTe9X9czQ1nuaITNkjl7bUPuDNcthyZtBq2pr0GuVo5C71jtdY1jV17Bqr93XJYY17dBoDtqr8mHsl6x25jO6rWalLHaW09V1/bhb1TOD3jSMzqHEkDmMfr/fM1IUGjLq4pKXPhhbrWO1GkWzoI7aavYk+noC1I6uV2BpjkUf6lpfS1k4hm7JybPFYnR63UE5iK/rr/cMozKBPFf+jqH1mxLDmvW4fzw4SkkIg75drgrJytdqd4465SiSBQ3bYgYrLGT9pmFXV9sVFpJjGeqGnhS2uBPFJrvRbqPlD/J634FdDcRx2Sw2Wtmc7L0Hc8nrSsxuvVtq3+SXD3Br4atfilBdPJKEo1oYJGNB9fBICo82wdtVv10Xb0vC7VoYW8Zi18PbUnh7Ezyt+mldPJWE01oYKmOh9fBUCk83wfOqn9fFc0k4r4XhMhZeD8+l8HwTPKn6SV08kYSTWhgiYyH18EQKT4rw4s8jOdNDFyRnYuICx18dDAq/WTQ5PkyROEku3csP72NxN2D4jThWvBMHWijOeD9EJmS25/ixuCvY5n7S2mSE8wejaIl7ila+lVQbl4cH2vGB+nNz95W+urE8Vr5Vvlf2FE1pK6+U18qpcqEg5d+t77aeb71o/N74o/Fn46+l9dHWasw3SuFp/P0fc2gwkg==</latexit> <latexit sha1_base64="ulXpDO6BYycliF1KgIFXJTHNPOY=">AAAOHXichZfbbts2HMaV7tRl9ZZul8MAYkHbdMgCKXF8uAhQS7bRi7XNgpy6KAsompYFSyJBUY5dQQ+xJ9jT7G7Y7bAn2e0o2ZF1oDxdEfw+fvrpT9ImLeo6AVfVf7YeffTxJ59+9vjz7S+eNL78aufp15cBCRnCF4i4hF1bMMCu4+ML7nAXX1OGoWe5+MqaGol+NcMscIh/zhcU33rQ9p2xgyAXXXc7v5nIjkwP8ok1BtP4V9OHlgvB8xNgIp4Jk0wwCXc8HAAzcGwPvtgrDn8J/ndcoqZK1rWn/Zjrfnm3s6seqOkDqg1t1dhVVs/p3dMnfNscERR62OfIhUFwo6mU30aQcQe5ON42wwBTiKbQxjchH3duI8enIcc+isEzoY1DF3ACkvqAkcMw4u5CNCBijkgAaAIZRFxUcbsYFWAfii/YH80cGiybwcxeNrj4bHwbzdMpigsDI5tBOnHQvEAWQS9IylbpDBaeVezEoYvZzCt2JpSCseScY4acIKnBqSjMO5rMenBOTlf6ZEEn2A/iKGRunB8oBMwYHouBaTPAPKRR+jFiqU2DE85CvJ80076TPmTTMzzaFzmFjiLO2CWQF7ssLymOj+8R8TzojyKTxpHJ8VwsiP2DWIjPgFhAaGoRyEZF51kcrZaaBc6EtSC+zYlvy+IgJw5WLyHuCIwJAzMx/4QFQBiBsDAH4aA4+iIbPQYX5ejLnHhZFq9y4lVZtMKcGlbUWU6dVdT7nHpfVuc5cV4WFzlxUcnlOZWX1Q858UNZvN700vebXvpLKVbMjtgyC7HL8Vj8zqULLJqiOHp9/uanOOqmz2qlhBhoRSOyHoxHw1a7247LsvugN4cdTe9X9czQ1nuaITNkjl7bUPuDNcthyZtBq2pr0GuVo5C71jtdY1jV17Bqr93XJYY17dBoDtqr8mHsl6x25jO6rWalLHaW09V1/bhb1TOD3jSMzqHEkDmMfr/fM1IUGjLq4pKXPhhbrWO1GkWzoI7aavYk+noC1I6uV2BpjkUf6lpfS1k4hm7JybPFYnR63UE5iK/rr/cMozKBPFf+jqH1mxLDmvW4fzw4SkkIg75drgrJytdqd4465SiSBQ3bYgYrLGT9pmFXV9sVFpJjGeqGnhS2uBPFJrvRbqPlD/J634FdDcRx2Sw2Wtmc7L0Hc8nrSsxuvVtq3+SXD3Br4atfilBdPJKEo1oYJGNB9fBICo82wdtVv10Xb0vC7VoYW8Zi18PbUnh7Ezyt+mldPJWE01oYKmOh9fBUCk83wfOqn9fFc0k4r4XhMhZeD8+l8HwTPKn6SV08kYSTWhgiYyH18EQKT4rw4s8jOdNDFyRnYuICx18dDAq/WTQ5PkyROEku3csP72NxN2D4jThWvBMHWijOeD9EJmS25/ixuCvY5n7S2mSE8wejaIl7ila+lVQbl4cH2vGB+nNz95W+urE8Vr5Vvlf2FE1pK6+U18qpcqEg5d+t77aeb71o/N74o/Fn46+l9dHWasw3SuFp/P0fc2gwkg==</latexit>

h k
= ⌦ (k)(1=-hh) ⌦ h ⌦ (1 - h)
y
k l b t
62
Wl c h
V t k
bh l
x
ck
xt l y
x
y
y xW Finally, the gradient for W. The situation here is exactly
h
GRADIENT FOR W
Wk
W y V the same as we saw earlier for V (matrix in, vector out,
Vl x W b
V
Wr
ij =
@l
@W ij
matrix multiplication), so we should expect the
xy V c
b =
X @l @km
=
X
kr
@km derivation to have the same form (and indeed it does).
b m
@km @W ij m
m
@W ij
ycW b t X @[Wx + b]m X
<latexit sha1_base64="sC591nGhRKZjUkSvjznWzZoDs2Y=">AAAOXnicjZdNb9s2HMblbuu6rM3S7TJgF2JBh2ELAilx/HIIUEu20cPaZkFeukWuQdG0rFkSCYpy7Ar8krvtto8yynFkvVDGdCL4PHz405+UTTrU9yKu6/80nnz2+RdPv3z21d7Xz1/sf3Pw8tubiMQM4WtEfMI+ODDCvhfia+5xH3+gDMPA8fGtM7dS/XaBWeSR8IqvKB4F0A29qYcgl13jg79/Ogf2HCV2FAfjJBBg/tEOoePDcQAEsKcMSm1OwZ2NSGIHkM+cKbgVWXMpwK/ARk7W4YjROBDrMemQWzFOvL+EAHv/b57iNNJpownhivkcoZwG2Pb44FA/1tcPqDaMTeNQ2zwX45fP+Z49ISgOcMiRD6PoztApHyWQcQ/5WOzZcYQpRHPo4ruYTzujxAtpzHGIBHgltWnsA05AWmAw8RhG3F/JBkTMkwkAzaB8Py6XYa8YFeEQBjg6miw8Gj00o4X70OCyOniULNdrLAoDE5dBOvPQskCWwCBKq1TpjFaBU+zEsY/ZIih2ppSSseRcYoa8KK3BhSzMe5pum+iKXGz02YrOcBiJJGa+yA+UAmYMT+XAdTPCPKbJ+mXkXp1H55zF+ChtrvvO+5DNL/HkSOYUOoo4U59AXuxygrQ4Ib5HJAhgOElsKvcFx0ue2EfHQoqvgNxnaO4QyCZF56VINjvLAZfSWhDf5cR3ZXGQEwebSYg/AVPCwEKuP2ERkEYgLcxDOCqOvs5GT8F1OfomJ96UxduceFsWnTinxhV1kVMXFfU+p96X1WVOXJbFVU5cVXJ5TuVl9VNO/FQWP+ya9I9dk/5ZipWrIz+ZlfzK8VT+UK43WDJHInlz9fY3kXTXz2anxBgYRSNyHo2nw1a72xZl2X/Um8OOYfaremZomz3DUhkyR69t6f3BluWk5M2gdb016LXKUcjf6p2uNazqW1i91+6bCsOWdmg1B+1N+TAOS1Y381ndVrNSFjfL6Zqmedat6pnBbFpW50RhyBxWv9/vWWsUGjPq45KXPhpbrTO9GkWzoI7eavYU+nYB9I5pVmBpjsUcmkbfWLNwDP2Sk2ebxer0uoNyEN/W3+xZVmUBea78HcvoNxWGLetZ/2xwuiYhDIZuuSokK1+r3TntlKNIFjRsyxWssJDtTMOuqbcrLCTHMjQtMy1s8UuUH9mdMUoefpC33x04NIAQZbP80Mrm9Nt7NJe8vsLs17uV9l1+9QC/Fr76pgjVxSNFOKqFQSoWVA+PlPBoF7xb9bt18a4i3K2FcVUsbj28q4R3d8HTqp/WxVNFOK2FoSoWWg9PlfB0Fzyv+nldPFeE81oYrmLh9fBcCc93wZOqn9TFE0U4qYUhKhZSD0+U8KQIL/880jM99EF6JiY+8MLNwaDwm0XT40N6Wdm4H168j+XdgOG38ljxXh5ooTzj/ZLYkLmBFwp5V3Dto7S1ywiXj0bZkvcUo3wrqTZuTo6N1nHz9+bha3NzY3mm/aD9qP2sGVpbe6290S60aw01Oo2PDbcxe/Hv/tN9eUt8sD5pbMZ8pxWe/e//A9rZSLM=</latexit>

@[W
@Wm·m:xx]+
mb+ mbm
c = kr = kr
Wt V cx h
m m
m
@W ij m
@W@W ij ij
P
t y =
X
kr
@ l W ml x l + b m

Vhb t k m
m
@W ij
h x W X
r @W ml xl @W ij xj
b = km = kr
k c h l
k @W ij i
@W ij
ml
V
cl yt k = kr
i xj
lW b
th l 0
..
1
.
xV c
h
B
Wr = B
@. . . kr
i xj
C
. . .C r T
A=k x
k ..
k bl t
.
y
63

Wl c h
V t k
bh l
ck
t l
h
k
V
b
x
c
x y
xt
y
y xW Here’s the backward pass that we just derived.
h
PSEUDOCODE: BACKWARD
W
W y V
k We’ve left the derivatives of the bias parameters out.
Vl x W b
V
xy V c
You’ll have to work these out to implement the first
b
b homework exercise.
ycW b t y’ = 2 * (y - t)
c
Wt V cx h
t y v’ = y’.mm(h.T)
Vhb t k
h x W h’ = v.T.mm(y’)
b l
k yc h
k
V
cl t k k’ = h’ * h * (1 - h)
l b
tWh l w’ = k’.mm(x.T)
xV c
h k
k bl t
y
64

Wl c h
V t k
bh l
ck
t l
h
RECAP
k
Vectorizing makes notation simpler, and computation faster.
l

Vectorizing the forward pass is usually easy.

Backward pass: work out the scalar derivatives, accumulate, then


vectorize.

65

Lecture 2: Backpropagation

Peter Bloem
Deep Learning

dlvu.github.io

|section|Automatic differentiation|
|video|https://www.youtube.com/embed/9H-
o8wESCxI?si=NyQ6djR2xavcBsIL|
We’ve simplified the computation of derivatives a lot.
PART FOUR: AUTOMATIC DIFFERENTIATION All we’ve had to do is work out local derivatives and
chain them together. We can go one step further: if we
Letting the computer do all the work. let the computer keep track of our computation graph,
and provide some backwards functions for basic
operations, the computer can work out the whole
backpropagation algorithm for us. This is called
automatic differentiation (or sometimes autograd).
b
x
c
x y
xt
y
y xW
h
W
W y V
k STORY SO FAR
THE
W b y’ = 2 * (y - t)
Vl x
V
x
by V c v’ = y’.mm(h.T)
b
ycW b t
c
Wt V cx h
h’ = (y’[None, :] * v).sum(axis=1)
t y
Vhb t k k’ = sigmoid(k) * sigmoid(1 - k)
h x W
b
k c h l w’ = k’.mm(x.T)
k V
cl yt k
lW b
th l
xV c
h k
kb t
y
pen land paper
x
in the computer
Wl c h y
68
V t k W
bh l V
ck b
x
t l c
x y
h xt
y
y xW This is what we want to achieve. We work out on pen
k IDEAL
THE h
W y V y’ = 2 * (y - t)
l Wk and paper the local derivatives of various modules, and
V W b then we chain these modules together in a
Vl x k = w.dot(x) + b
x
b V c computation graph in our code. The computer keeps
by h = sigmoid(k)
ycW b t the graph in memory, and can automatically work out
c
Wt V cx h y = v.dot(h) + c the backpropagation.
t y
Vhb t k
+ x
h x W l = (y - t) ** 2
b l
k yc h
k
V l.backward() # start backprop
cl t k
l b
tWh l
x hk c
x V
y and paper
pen k bl t
y in the computer

W Wl c h
V V t k
b bh l
x ck
c
xt y t l
x
y x W h Whenever we work out a gradient, as we did in the
y
h KEY IDEA
THE
Wk
W y V k previous video, we always look at the node above the
Vl x
V W b r @l
lW ij = @W ij one for which we're computing the gradient, and use
xy V c
b =
X @l @km
=
X
kr
@km the multivariate chain rule to split the gradient in two.
b @km @W ij m
@W ij
ycW b t X
m
@[Wx + b]
m
X @[Wm: x]m + bm In this case, we are working out the gradient for W and
c = kr
m
= kr
Wt V cx h Ignored onceX we apply the multivariate chain to break that gradient
m m
m
@W ij m
@W ij
we @have
P k ∇
t y l W ml xl + bm
V
= krm into the gradient for k and the derivatives of k with
hb t k m
@W ij
h x W X @W ml xl @W ij xj respect to W.
b = kr = kr
k h l m i
k yc V ml
@W ij @W ij

cl t k = kr
i xj
l The key idea, that powers automatic differentiation is
tW b 0 1
h l ..
. that once we have k∇, we no longer care about
x
hk V c W r
=
B C
B. . . kr xj . . .C = kr xT
@ i
...
A anything else that happens above k in our
k bl t
y
70 computation graph. All we need is k∇ and we can work
Wl c h
out W nabla.
V t k
bh l
ck
t l
h
k
Define your computation graph in terms of operations chained together.
l

For each operation x -> y, if you know the gradient y∇ for the output, you
can work out the gradient x∇ for the input.
It doesn't matter what happens in the rest of the computation graph, before or after the operation.

With this knowledge, you can start at the top of the graph, and work your
way down, computing all the gradients.

71
AUTOMATIC DIFFERENTIATION
This kind of algorithm is called automatic
differentiation. What we’ve been doing so far is called
backward, or reverse mode automatic differentiation.
This is efficient if you have few output nodes. Note
that if we had two output nodes, we’d need to do a
separate backward pass for each.
If you have few inputs, it’s more efficient to start with
the inputs and apply the chain rule working forward.
many inputs, few outputs, work backward few inputs, many outputs, work forward This is called forward mode automatic differentiation.
<-
dee
p l
ear Since we assume we only have one output node
ni ng
(where the gradient computation is concerned), we
72
will always use reverse mode in deep learning.

THE PLAN
To make this idea precise, and applicable to as broad a
range of functions as possible, we'll need to set up a
Tensors few ingredients. Our values with be tensors: scalars,
Scalars, vectors, matrices, etc. vectors, matrices, or their higher-dimensional
analogues.
Building computation graphs
We then build computation graphs out of operations
Define operations with tensor inputs and outputs, chain them together.

Working out backward functions


For each operation, work out how to compute the gradient for the input given the gradient for the output

73

TENSORS (AKA MULTIDIMENSIONAL ARRAYS)


The basic datastructure of our system will be the
tensor. A tensor is a generic name for family of
datastructures that includes a scalar, a vector, a matrix
scalar vector matrix and so on.
There is no good way to visualize a 4-dimensional
2 2 0
structure. We will occasionally use this form to
2 0 0 5
indicate that there is a fourth dimension along which
1 1 2
we can also index the tensor.
We will assume that whatever data we work with
0-tensor 1-tensor 2-tensor 3-tensor 4-tensor (images, text, sounds), will in some way be encoded
shape: 3 shape: 3x2 shape: 2x3x2 shape: 2x3x2x4 into one or more tensors, before it is fed into our
74
system. Let’s look at some examples.

A CLASSIFICATION TASK AS TWO TENSORS


A simple dataset, with numeric features can simply be
word count nr. of class
represented as a matrix. For the labels, we usually
recipients create a separate corresponding vector for the labels.
30 3 ham
30 3 0
340 4 ham 340 4 0
Any categoric features or labels should be converted to
121 2 spam 121 2 1 numeric features (normally by one-hot coding).
11 1 1
11 1 spam
23 1 1
X= y=
23 1 spam 455 1 1
455 1 spam 512 2 1
2 12 0
512 2 spam
32 1 0
2 12 ham 432 1 0

32 1 ham 23 2 1

432 1 ham

75
23 2 spam
AN IMAGE AS A 3-TENSOR
Images can be represented as 3-tensors. In an RGB
image, the color of a single pixel is represented using
pixel
three values between 0 and 1 (how red it is, how green
it is and how blue it is). This means that an RGB image
can be thought of as a stack of three color channels,
represented by matrices.
This stack forms a 3-tensor.
height
width
0.1
channels

0.5

0.0

76 image source: http://www.adsell.com/scanning101.html

A DATASET OF IMAGES
If we have a dataset of images, we can represent this
as a 4-tensor, with dimensions indexing the instances,
their width, their height and their color channels
respectively. Below is a snippet of code showing that
when you load the CIFAR10 image data in Keras, you
do indeed get a 4-tensor like this.
There is no agreed standard ordering for the
dimensions in a batch of images. Tensorflow and Keras
use (batch, height, width, channels), whereas Pytorch
uses (batch, channels, height, width).
(You can remember the latter with the mnemonic
77
“bachelor chow”.)

Tensors

Building computation graphs

Working out backward functions

78

COMPUTATION GRAPHS
We’ll need to be a little mode precise in our notations.
From now on we’ll draw computation graphs with both
the operations and the values as nodes. The graph is
Tensor nodes Operation nodes
always bipartite and directed. A tensor node is
connected to the ops for which it is an input, and an
operation node is connected to the tensor nodes that
it produces as outputs.

bipartite: only op-to-tensor or tensor-to-op edges There doesn’t seem to be a standard visual notation.
tensor nodes: no input or one input, multiple outputs This is what we'll use in most of this course.
operation nodes: multiple inputs and outputs In Tensorflow operations are called ops, and in Pytorch
they’re called functions.
79
b
c x
x t y
x
y h W
y
FEEDFORWARDW
NETWORK V As an example, here is our MLP expressed in our new
k
W x b
notation.
V l
V
b x y c Just as before, it’s up to us how granular we make the
bc x yW t computation. for instance, we wrap the whole
ct × y W V+ h computation of the loss in a single operation, but we
ht W V b k could also separate out the subtraction and the
h
k xσ V b c l squaring.
kl b c t
y shorthand: You may note that this graph is not directed or
× l + c t h bipartite everywhere. This is just visual shorthand.
x W × +
t h k Occasionally, we will omit the direction of an edge for
y V × +
h k l × + clarity, or omit intermediate nodes when it's clear that
x W b
80 k l a tensor node or op node should be there. The actual
y V c
l computation graph we are describing in such cases is
W b t always bipartite and entirely directed.
V c h
b k t
c h l
RESTATING OUR
We hold on to the same assumptions we had before.
t k ASSUMPTIONS
They are required for automatic differentiation to work
h l
One output node l with a scalar value. efficiently.
k
l
All gradients are of l, with respect to the tensors on the tensor nodes.
We compute gradients for all tensor nodes.

81

IMPLEMENTATIONS
To store a computation graph in memory, we need
three classes of objects. The first two are shown here.
A TensorNode objects holds the tensor value at that
TensorNode: OpNode: node. It holds the gradient of the tensor at that node
value: <Tensor> inputs: List<TensorNode> (to be filled by the backpropagation algorithm) and it
gradient: <Tensor> outputs: List<TensorNode> holds a pointer to the Operation Node that produced it
source: <OpNode> op: <Op> (which can be null for leaf nodes).
An OpNode object represents an instance of a
particular operation being applied in the computation.
It stores the particular op being used (mulitplication,
additions, etc) and the inputs and the outputs. In the
82
MLP example, we perform several matrix
mulitplications: each of these becomes a separate
OpNode in the computation graph, all recording an
instance of the matrix multiplication operation being
used. Each of them refers to a single object
representing this operation.
This is the third class: the Op.
DEFINING AN OPERATION
If the difference between an Op and an OpNode is
unclear consider that a single neural network may
the
for
ee d ass apply, say, a sigmoid operation many times. The Op
g we n kward p
ythin bac
s an object for the sigmoid defines how to compute the
class Op: s tore
<latexit sha1_base64="TVaHn2cx6MPh1FqPJsvR0G58OdQ=">AAAOHnicfZfbbts2HMaV7tRlzZZulwMGYkG7bAgCKXF8GBCglmyjF2ubBTltURZQNK0IpkSCohy7gl5iT7Cn2d2w2+1NdjnKB1kHyrqi+X389NOfpEw5jHih0PV/t5588OFHH3/y9NPtz57tfP7F7vMvr0IacYQvESWU3zgwxMQL8KXwBME3jGPoOwRfO2Mr1a8nmIceDS7EjOE7H7qBN/IQFLLrfvf30b6N3Nh2RqCbfA9engIbOfOfZgJse9sWeCriEeWPkA+T+9HLH8HaD2wfslBQ5RgHonF+0Mrxmx1Ah8Dc2FXcUrnf3dMP9fkFqg1j2djTltfZ/fNnYtseUhT5OBCIwDC8NXQm7mLIhYcITrbtKMRM8kAX30Zi1L6LvYBFAgcoAS+kNooIkChpgcDQ4xgJMpMNiLgnEwB6gBwiIcu4XYwKcQB9HB4MJx4LF81w4i4aQj4Lvoun8zlKCgNjl0P24KFpgSyGfuhD8VDpDGe+U+zEEcF84hc7U0rJWHJOMUdemNbgTBbmHUunPbygZ0v9YcYecBAmccRJkh8oBcw5HsmB82aIRcTi+cPItTYOTwWP8EHanPed9iAfn+PhgcwpdBRxRoRCUexy/LQ4AX5E1PdhMIxtlsSLJWQfHCZSfAHkqkBjh8rVVHSeJ3FspzVzHHAurQXxbU58Wxb7ObG/vAklQyBXOpjI+ac8BNIIpIV7CIfF0ZfZ6BG4LEdf5cSrsnidE6/LohPl1KiiTnLqpKI+5tTHsjrNidOyOMuJs0quyKmirL7Pie/L4s2mm/6y6aa/lmLl7MgtM5O7HI/ki26+wOIxSuLXF29+SuLO/FqulAgDo2hEzsp4PGi2Oq2kLJOV3hi0DbNX1TNDy+walsqQObotS+/11yxHJW8GrevNfrdZjkJkrbc71qCqr2H1bqtnKgxr2oHV6LeW5cM4KFndzGd1mo1KWdwsp2Oa5kmnqmcGs2FZ7SOFIXNYvV6va81RWMQZwSUvWxmbzRO9GsWyoLbebHQV+noC9LZpVmBZjsUcmEbPmLMIDEnJKbLFYrW7nX45SKzrb3YtqzKBIlf+tmX0GgrDmvWkd9I/npNQDgO3XBWala/Zah+3y1E0Cxq05AxWWOj6ToOOqbcqLDTHMjAtMy1scSfKTXZr3MWLF/J634E9AyRJ2Sw3Wtmc7r2VueQlCjOpdyvtm/zqAaQWvvqkCNXFI0U4qoVBKhZUD4+U8GgTvFv1u3XxriLcrYVxVSxuPbyrhHc3wbOqn9XFM0U4q4VhKhZWD8+U8GwTvKj6RV28UISLWhihYhH18EIJLzbB06qf1sVTRTithaEqFloPT5XwtAgv/zzSMz0kID0TUwK8YHkwKLyzWHp8GCN5kly4Fw/ew/LbgOM38ljxTh5ooTzj/RDbkLu+FyTyW8G1D9LWJiOcroyyJb9TjPJXSbVxdXRonBzqPzf2XpnLL5an2tfat9q+Zmgt7ZX2WjvTLjWk/bf1zdZ3W/s7f+z8ufPXzt8L65Ot5ZivtMK188//nykwHw==</latexit>

f(A) = B

<-
sigmoid operations (and its gradient). Then for each
<latexit sha1_base64="TVaHn2cx6MPh1FqPJsvR0G58OdQ=">AAAOHnicfZfbbts2HMaV7tRlzZZulwMGYkG7bAgCKXF8GBCglmyjF2ubBTltURZQNK0IpkSCohy7gl5iT7Cn2d2w2+1NdjnKB1kHyrqi+X389NOfpEw5jHih0PV/t5588OFHH3/y9NPtz57tfP7F7vMvr0IacYQvESWU3zgwxMQL8KXwBME3jGPoOwRfO2Mr1a8nmIceDS7EjOE7H7qBN/IQFLLrfvf30b6N3Nh2RqCbfA9engIbOfOfZgJse9sWeCriEeWPkA+T+9HLH8HaD2wfslBQ5RgHonF+0Mrxmx1Ah8Dc2FXcUrnf3dMP9fkFqg1j2djTltfZ/fNnYtseUhT5OBCIwDC8NXQm7mLIhYcITrbtKMRM8kAX30Zi1L6LvYBFAgcoAS+kNooIkChpgcDQ4xgJMpMNiLgnEwB6gBwiIcu4XYwKcQB9HB4MJx4LF81w4i4aQj4Lvoun8zlKCgNjl0P24KFpgSyGfuhD8VDpDGe+U+zEEcF84hc7U0rJWHJOMUdemNbgTBbmHUunPbygZ0v9YcYecBAmccRJkh8oBcw5HsmB82aIRcTi+cPItTYOTwWP8EHanPed9iAfn+PhgcwpdBRxRoRCUexy/LQ4AX5E1PdhMIxtlsSLJWQfHCZSfAHkqkBjh8rVVHSeJ3FspzVzHHAurQXxbU58Wxb7ObG/vAklQyBXOpjI+ac8BNIIpIV7CIfF0ZfZ6BG4LEdf5cSrsnidE6/LohPl1KiiTnLqpKI+5tTHsjrNidOyOMuJs0quyKmirL7Pie/L4s2mm/6y6aa/lmLl7MgtM5O7HI/ki26+wOIxSuLXF29+SuLO/FqulAgDo2hEzsp4PGi2Oq2kLJOV3hi0DbNX1TNDy+walsqQObotS+/11yxHJW8GrevNfrdZjkJkrbc71qCqr2H1bqtnKgxr2oHV6LeW5cM4KFndzGd1mo1KWdwsp2Oa5kmnqmcGs2FZ7SOFIXNYvV6va81RWMQZwSUvWxmbzRO9GsWyoLbebHQV+noC9LZpVmBZjsUcmEbPmLMIDEnJKbLFYrW7nX45SKzrb3YtqzKBIlf+tmX0GgrDmvWkd9I/npNQDgO3XBWala/Zah+3y1E0Cxq05AxWWOj6ToOOqbcqLDTHMjAtMy1scSfKTXZr3MWLF/J634E9AyRJ2Sw3Wtmc7r2VueQlCjOpdyvtm/zqAaQWvvqkCNXFI0U4qoVBKhZUD4+U8GgTvFv1u3XxriLcrYVxVSxuPbyrhHc3wbOqn9XFM0U4q4VhKhZWD8+U8GwTvKj6RV28UISLWhihYhH18EIJLzbB06qf1sVTRTithaEqFloPT5XwtAgv/zzSMz0kID0TUwK8YHkwKLyzWHp8GCN5kly4Fw/ew/LbgOM38ljxTh5ooTzj/RDbkLu+FyTyW8G1D9LWJiOcroyyJb9TjPJXSbVxdXRonBzqPzf2XpnLL5an2tfat9q+Zmgt7ZX2WjvTLjWk/bf1zdZ3W/s7f+z8ufPXzt8L65Ot5ZivtMK188//nykwHw==</latexit>

def forward(context, inputs): f(A) = B


forwardf : A 7! B
# given the inputs, compute the outputs forwardf : Ar7! B r place in the computation graph where the sigmoid is
backward : BB 7! A
<latexit sha1_base64="TVaHn2cx6MPh1FqPJsvR0G58OdQ=">AAAOHnicfZfbbts2HMaV7tRlzZZulwMGYkG7bAgCKXF8GBCglmyjF2ubBTltURZQNK0IpkSCohy7gl5iT7Cn2d2w2+1NdjnKB1kHyrqi+X389NOfpEw5jHih0PV/t5588OFHH3/y9NPtz57tfP7F7vMvr0IacYQvESWU3zgwxMQL8KXwBME3jGPoOwRfO2Mr1a8nmIceDS7EjOE7H7qBN/IQFLLrfvf30b6N3Nh2RqCbfA9engIbOfOfZgJse9sWeCriEeWPkA+T+9HLH8HaD2wfslBQ5RgHonF+0Mrxmx1Ah8Dc2FXcUrnf3dMP9fkFqg1j2djTltfZ/fNnYtseUhT5OBCIwDC8NXQm7mLIhYcITrbtKMRM8kAX30Zi1L6LvYBFAgcoAS+kNooIkChpgcDQ4xgJMpMNiLgnEwB6gBwiIcu4XYwKcQB9HB4MJx4LF81w4i4aQj4Lvoun8zlKCgNjl0P24KFpgSyGfuhD8VDpDGe+U+zEEcF84hc7U0rJWHJOMUdemNbgTBbmHUunPbygZ0v9YcYecBAmccRJkh8oBcw5HsmB82aIRcTi+cPItTYOTwWP8EHanPed9iAfn+PhgcwpdBRxRoRCUexy/LQ4AX5E1PdhMIxtlsSLJWQfHCZSfAHkqkBjh8rVVHSeJ3FspzVzHHAurQXxbU58Wxb7ObG/vAklQyBXOpjI+ac8BNIIpIV7CIfF0ZfZ6BG4LEdf5cSrsnidE6/LohPl1KiiTnLqpKI+5tTHsjrNidOyOMuJs0quyKmirL7Pie/L4s2mm/6y6aa/lmLl7MgtM5O7HI/ki26+wOIxSuLXF29+SuLO/FqulAgDo2hEzsp4PGi2Oq2kLJOV3hi0DbNX1TNDy+walsqQObotS+/11yxHJW8GrevNfrdZjkJkrbc71qCqr2H1bqtnKgxr2oHV6LeW5cM4KFndzGd1mo1KWdwsp2Oa5kmnqmcGs2FZ7SOFIXNYvV6va81RWMQZwSUvWxmbzRO9GsWyoLbebHQV+noC9LZpVmBZjsUcmEbPmLMIDEnJKbLFYrW7nX45SKzrb3YtqzKBIlf+tmX0GgrDmvWkd9I/npNQDgO3XBWala/Zah+3y1E0Cxq05AxWWOj6ToOOqbcqLDTHMjAtMy1scSfKTXZr3MWLF/J634E9AyRJ2Sw3Wtmc7r2VueQlCjOpdyvtm/zqAaQWvvqkCNXFI0U4qoVBKhZUD4+U8GgTvFv1u3XxriLcrYVxVSxuPbyrhHc3wbOqn9XFM0U4q4VhKhZWD8+U8GwTvKj6RV28UISLWhihYhH18EIJLzbB06qf1sVTRTithaEqFloPT5XwtAgv/zzSMz0kID0TUwK8YHkwKLyzWHp8GCN5kly4Fw/ew/LbgOM38ljxTh5ooTzj/RDbkLu+FyTyW8G1D9LWJiOcroyyJb9TjPJXSbVxdXRonBzqPzf2XpnLL5an2tfat9q+Zmgt7ZX2WjvTLjWk/bf1zdZ3W/s7f+z8ufPXzt8L65Ot5ZivtMK188//nykwHw==</latexit>

f(A)f =
...
backwardf : Br 7! Ar applied, we create a separate OpNode object which
forwardf : A 7! B references the inputs and outputs specific to that
def backward(context, outputs_gradient): backwardf : Br 7! Ar application of the sigmoid, together with a single
# given the gradient of the loss wrt to the ouputs pointer to the single sigmoid Op. This is where we
# compute the gradient of the loss wrt to the inputs
actually define how to perform the computation.
...
83
An Op is defined by two functions.

The function forward computes the outputs given


the inputs (just like any function in any programming
language).

The function backward takes the gradient for the


outputs (the gradient of the loss wrt to the outputs)
and produces the gradient for the inputs.

Both functions are also given a context object. This


is a data structure (a dictionary or a list) to which the
forward can add any value which it needs to save for
the backward, we'll see an example of this later.
Note that the backward function does not compute the
x local derivative as we did in the scalar
y backpropagation: it computes the accumulated
W gradient of the loss over the inputs (given the
V accumulated gradient of the loss over the outputs).
b
c x
t y
x
x h W
y We'll see how to implement ops later. First, let's see
BACKPROPAGATION
y k V
W
x b how all this works put together. Using Ops, OpNodes
W
V l build a computation graph, perform
V x y c and TensorNodes, we can build a computation
b forward pass, bottom-up from leaves.
y x W t graph (again, more on how we do that later). The
bc
ct W
× y
+ V h traverse the tree backwards breadth-first
OpNodes reference the TensorNodes that are their
h t VW b k inputs and outputs and they reference the Ops that
h b V c l
call backward for each OpNode
contain the actual code for their forward and
k σ x
kl c b t breadth-first ensures that the output gradients are known backward.
ty h
×
l + c Once we have this computation graph, computing our
x Wh t k add the computed gradients to the tensor model output and loss and performing
y kV l nodes
h backpropagation becomes almost trivially easy. We
84 x W lb sum them with any gradients already present
k simply traverse the tree forwards from the inputs to
y V c
l the loss calling all the forward functions, and then
W b t backwards from the loss to the inputs, calling all the
V c h backwards functions.
b t k
c The only thing we need to keep track of is that when
h l
t k we call forward() on an OpNode's Op, all the
h l
ancestors of that OpNode have been called, so that we
k
have all its inputs. And, when we traverse the tree
l backward, when we call backward() on an
OpNode's Op that all the descendants have been
called so that we have the gradients for all of its
outputs.
Note the last point: some TensorNodes will be inputs to
multiple operation nodes. By the multivariate chain
rule, we should sum the gradients they get from all the
OpNodes they feed into. We can achieve this easily by
just initializing the gradients to zero and adding the
gradients we compute to any that are already there.
BUILDING COMPUTATION GRAPHS IN CODE
All this assumes that the computation graph is already
there. But how do we tell the system what our
Lazy execution: build your graph, compile it, feed data through. computation graph should be? It turns out that the
nicest way to do this is to describe the computation
Eager execution: perform forward pass, keep track of computation graph. itself in code. This makes sense: code is a language
explicitly for describing computations, so it would be
nice to simply describe a computation as a piece of
Python code, and have the system build a computation
graph for us.
This can be done through the magic of operator
overloading. We simply change change what operators
85
like * and + mean, when they are applied to
TensorNode objects. There are two strategies
common: lazy and eager execution.

EXAMPLE: LAZY EXECUTION


Here’s one example of how a lazy execution API might
look.
c
a = TensorNode() value: 2
?
Note that when we’re building the graph, we’re not
b = TensorNode() grad: 1?

source: mult(a, b)
telling it which values to put at each node. We’re just
defining the shape of the computation, but not
c = a * b performing the computation itself.
×
m = Model( When we create the model, we define which nodes
in=(a, b),
loss=c)
are the input nodes, and which node is the loss node.
a b
We then provide the input values and perform the
1
value: ? value: 2
?
m.train((1, 2)) grad: ?2 grad: 1? forward and backward passes.

86

BUILDING THE COMPUTATION GRAPH: LAZY EXECUTION


In lazy execution, which was the norm during the early
years of deep learning, we build the computation
graph, but we don’t yet specify the values on the
Tensor ow 1.x default, Keras default tensor nodes. When the computation graph is
De ne the computation graph. finished, we define the data, and we feed it through.
• Compile it.
• Iterate backward/forward over the data This is fast since the deep learning system can optimize
the graph structure during compilation, but it makes
Fast. Many possibilities for optimization. Easy to serialise models. Easy to make
training parallel.
models hard to debug: if something goes wrong during
training, you probably made a mistake while defining
Di cult to debug. Model must remain static during training. your graph, but you will only get an error while passing
data through it. The resulting stack trace will never
87
point to the part of your code where you made the
mistake.

EXAMPLE: EAGER EXECUTION


In eager mode deep learning systems, we create a
node in our computation graph (a TensorNode) by
c
a = TensorNode(value=1) value: 2
specifying what data it should contain. The result is a
b = TensorNode(value=2) grad: 1? tensor object that stores both the data, and the
source: mult(a, b) gradient over that data (which will be filled later).
c = a * b Here we create the variables a and b. If we now apply
× an operation to these, for instance to multiply their
c.backward() values, the result is another variable c. Languages like
a b python allow us to overload the * operator it looks like
value: 1 value: 2 we’re just computing multiplication, but behind the
grad: ?2 grad: 1?
scenes, we are creating a computation graph that
88
records all the computations we’ve done.
We compute the data stored in c by running the
computation immediately, but we also store
references to the variables that were used to create c,
ffi
fi
fl
and the operation that created it. We create the
computation graph on the fly, as we compute the
forward pass.
Using this graph, we can perform the backpropagation
from a given node that we designate as the loss node
(node c in this case). We work our way down the graph
computing the derivative of each variable with respect
to c. At the start the TensorNodes do not have their
grad’s filled in, but at the end of the backward, all
gradients have been computed.
Once the gradients have been computed, and the
gradient descent step has been performed, we clear
the computation graph. It's rebuilt from scratch for
every forward pass.

BUILDING THE COMPUTATION GRAPH: EAGER EXECUTION


In eager execution, we simply execute all operations
immediately on the data, and collect the computation
graph on the fly. We then execute the backward and
PyTorch, Tensor ow 2.0 default, Keras option ditch the graph we’ve collected.
• Build the computation graph on the y during the forward pass.
This makes debugging much easier, and allows for
Easy to debug, problems in the model occur as the module is executing. more flexibility in what we can do in the forward pass.
Flexible: the model can be entirely di erent from one forward to the next. It can, however be a little difficult to wrap your head
More di cult to optimize. A little more di cult to serialize.
around. Since we’ll be using pytorch later in the
course, we’ll show you how eager execution works
step by step.

89

Tensors

Building computation graphs

Working out backward functions

x
y
90

W
V
b
c
t
h The final ingredient we need is a large collection of
WORKING OUT THE BACKWARD FUNCTION
k operations with backward functions worked out. We’ll
ABS
<latexit sha1_base64="PDfvGk7ccq9faPRBTMFWBm/xuaI=">AAANvXicfZfbbts2HMbV7tRl9ZZul7sRFhQYhiCQEscHFMVqSTZ6sbZZllMXGQFF04pgSiQoyrEr6Dn2NLvdnmFvM8p2ZImirCuC38dPP/1J2qRHcRBzw/jvydPPPv/iy6+efb33zfPWt9/tv/j+KiYJg+gSEkzYjQdihIMIXfKAY3RDGQKhh9G1N7Nz/XqOWByQ6IIvKRqHwI+CaQABF113+6YL/dQNAb/3pvog091XLqRFh5W5r3QXkqLjj0y/2z8wjozVo9cb5qZxoG2es7sXz/meOyEwCVHEIQZxfGsalI9TwHgAMcr23CRGFMAZ8NFtwqe9cRpENOEogpn+UmjTBOuc6Dm/PgkYghwvRQNAFogEHd4DBiAXX7lXjYpRBEIUH07mAY3XzXjurxsciBKN08WqhFllYOozQO8DuKiQpSCM8yrUOuNl6FU7UYIRm4fVzpxSMErOBWIwiPManInCfKD5rMQX5Gyj3y/pPYriLE0YzsoDhYAYQ1MxcNWMEU9ouvoYsRRm8WvOEnSYN1d9rx3AZudocihyKh1VnCkmgFe7vDAvToQeIAlDEE1Sl2apy9GCp+7hUSbEl7qHhdkjgE2qzvMs3awcTz8X1or4viS+l8VhSRxuXkLwRJ8Sps/F/BMW68KoCwsLIIqroy+L0VP9Uo6+KolXsnhdEq9l0UtKalJT5yV1XlMfSuqDrC5K4kIWlyVxWcvlJZXL6qeS+EkWb3a99OOul/4pxYrZEVtmKXY5morfodUCS2cwS99evPstS/urZ7NSEqSbVSP0Ho0no063381kGT/q7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu5vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7hUITRjGSvPTR2OmcGvUoWgT1jE57oNC3E2D0LKsGS0ss1sgyHXPFwhHAkpMXi8XuDfpDOYhv628NbLs2gbxU/p5tOm2FYct66pwOT1YkhIHIl6tCivJ1ur2TnhxFiqBRV8xgjYVs3zTqW0a3xkJKLCPLtvLCVnei2GS35jhd/yBv951+YOpZJpvFRpPN+d57NEterDDjZrfSvsuvHoAb4etfCmFTPFSEw0YYqGKBzfBQCQ93wft1v98U7yvC/UYYX8XiN8P7Snh/Fzyt+2lTPFWE00YYqmKhzfBUCU93wfO6nzfFc0U4b4ThKhbeDM+V8HwXPKn7SVM8UYSTRhiiYiHN8EQJT6rw4s8jP9MDrOdnYoL1INocDCq/WTQ/PsygOEmu3esPd5C4GzD0ThwrPogDLRBnvF9SFzA/DKJM3BV89zBv7TKCxaNRtMQ9xZRvJfXG1fGReXpk/N4+eGNtbizPtB+1n7SfNVPram+0t9qZdqlB7S/tb+0f7d/Wry3Uwq1obX36ZDPmB63ytB7+B/FwDUc=</latexit>

ABS l
<latexit sha1_base64="PDfvGk7ccq9faPRBTMFWBm/xuaI=">AAANvXicfZfbbts2HMbV7tRl9ZZul7sRFhQYhiCQEscHFMVqSTZ6sbZZllMXGQFF04pgSiQoyrEr6Dn2NLvdnmFvM8p2ZImirCuC38dPP/1J2qRHcRBzw/jvydPPPv/iy6+efb33zfPWt9/tv/j+KiYJg+gSEkzYjQdihIMIXfKAY3RDGQKhh9G1N7Nz/XqOWByQ6IIvKRqHwI+CaQABF113+6YL/dQNAb/3pvog091XLqRFh5W5r3QXkqLjj0y/2z8wjozVo9cb5qZxoG2es7sXz/meOyEwCVHEIQZxfGsalI9TwHgAMcr23CRGFMAZ8NFtwqe9cRpENOEogpn+UmjTBOuc6Dm/PgkYghwvRQNAFogEHd4DBiAXX7lXjYpRBEIUH07mAY3XzXjurxsciBKN08WqhFllYOozQO8DuKiQpSCM8yrUOuNl6FU7UYIRm4fVzpxSMErOBWIwiPManInCfKD5rMQX5Gyj3y/pPYriLE0YzsoDhYAYQ1MxcNWMEU9ouvoYsRRm8WvOEnSYN1d9rx3AZudocihyKh1VnCkmgFe7vDAvToQeIAlDEE1Sl2apy9GCp+7hUSbEl7qHhdkjgE2qzvMs3awcTz8X1or4viS+l8VhSRxuXkLwRJ8Sps/F/BMW68KoCwsLIIqroy+L0VP9Uo6+KolXsnhdEq9l0UtKalJT5yV1XlMfSuqDrC5K4kIWlyVxWcvlJZXL6qeS+EkWb3a99OOul/4pxYrZEVtmKXY5morfodUCS2cwS99evPstS/urZ7NSEqSbVSP0Ho0no063381kGT/q7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu5vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7hUITRjGSvPTR2OmcGvUoWgT1jE57oNC3E2D0LKsGS0ss1sgyHXPFwhHAkpMXi8XuDfpDOYhv628NbLs2gbxU/p5tOm2FYct66pwOT1YkhIHIl6tCivJ1ur2TnhxFiqBRV8xgjYVs3zTqW0a3xkJKLCPLtvLCVnei2GS35jhd/yBv951+YOpZJpvFRpPN+d57NEterDDjZrfSvsuvHoAb4etfCmFTPFSEw0YYqGKBzfBQCQ93wft1v98U7yvC/UYYX8XiN8P7Snh/Fzyt+2lTPFWE00YYqmKhzfBUCU93wfO6nzfFc0U4b4ThKhbeDM+V8HwXPKn7SVM8UYSTRhiiYiHN8EQJT6rw4s8jP9MDrOdnYoL1INocDCq/WTQ/PsygOEmu3esPd5C4GzD0ThwrPogDLRBnvF9SFzA/DKJM3BV89zBv7TKCxaNRtMQ9xZRvJfXG1fGReXpk/N4+eGNtbizPtB+1n7SfNVPram+0t9qZdqlB7S/tb+0f7d/Wry3Uwq1obX36ZDPmB63ytB7+B/FwDUc=</latexit>

class Plus(Op):
+
show how to do this for a few examples.
ABS
<latexit sha1_base64="PDfvGk7ccq9faPRBTMFWBm/xuaI=">AAANvXicfZfbbts2HMbV7tRl9ZZul7sRFhQYhiCQEscHFMVqSTZ6sbZZllMXGQFF04pgSiQoyrEr6Dn2NLvdnmFvM8p2ZImirCuC38dPP/1J2qRHcRBzw/jvydPPPv/iy6+efb33zfPWt9/tv/j+KiYJg+gSEkzYjQdihIMIXfKAY3RDGQKhh9G1N7Nz/XqOWByQ6IIvKRqHwI+CaQABF113+6YL/dQNAb/3pvog091XLqRFh5W5r3QXkqLjj0y/2z8wjozVo9cb5qZxoG2es7sXz/meOyEwCVHEIQZxfGsalI9TwHgAMcr23CRGFMAZ8NFtwqe9cRpENOEogpn+UmjTBOuc6Dm/PgkYghwvRQNAFogEHd4DBiAXX7lXjYpRBEIUH07mAY3XzXjurxsciBKN08WqhFllYOozQO8DuKiQpSCM8yrUOuNl6FU7UYIRm4fVzpxSMErOBWIwiPManInCfKD5rMQX5Gyj3y/pPYriLE0YzsoDhYAYQ1MxcNWMEU9ouvoYsRRm8WvOEnSYN1d9rx3AZudocihyKh1VnCkmgFe7vDAvToQeIAlDEE1Sl2apy9GCp+7hUSbEl7qHhdkjgE2qzvMs3awcTz8X1or4viS+l8VhSRxuXkLwRJ8Sps/F/BMW68KoCwsLIIqroy+L0VP9Uo6+KolXsnhdEq9l0UtKalJT5yV1XlMfSuqDrC5K4kIWlyVxWcvlJZXL6qeS+EkWb3a99OOul/4pxYrZEVtmKXY5morfodUCS2cwS99evPstS/urZ7NSEqSbVSP0Ho0no063381kGT/q7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu5vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7hUITRjGSvPTR2OmcGvUoWgT1jE57oNC3E2D0LKsGS0ss1sgyHXPFwhHAkpMXi8XuDfpDOYhv628NbLs2gbxU/p5tOm2FYct66pwOT1YkhIHIl6tCivJ1ur2TnhxFiqBRV8xgjYVs3zTqW0a3xkJKLCPLtvLCVnei2GS35jhd/yBv951+YOpZJpvFRpPN+d57NEterDDjZrfSvsuvHoAb4etfCmFTPFSEw0YYqGKBzfBQCQ93wft1v98U7yvC/UYYX8XiN8P7Snh/Fzyt+2lTPFWE00YYqmKhzfBUCU93wfO6nzfFc0U4b4ThKhbeDM+V8HwXPKn7SVM8UYSTRhiiYiHN8EQJT6rw4s8jP9MDrOdnYoL1INocDCq/WTQ/PsygOEmu3esPd5C4GzD0ThwrPogDLRBnvF9SFzA/DKJM3BV89zBv7TKCxaNRtMQ9xZRvJfXG1fGReXpk/N4+eGNtbizPtB+1n7SfNVPram+0t9qZdqlB7S/tb+0f7d/Wry3Uwq1obX36ZDPmB63ytB7+B/FwDUc=</latexit>

def forward(context, a, b):


First, an operation that sums two matrices element-
# a, b are matrices of the same size
wise.
<latexit sha1_base64="wLz1/iLAoeye3AQqFZhtdFq4K5A=">AAAPpHicjZfbbts2AIZV79R53dxsF7vYDbGgxbAFgZQ4PlwEqCXb6EUPWdIctsgNKJpWNEsiIVGOXUF7mt1u77O3GSXLsg6UV8GACf4/f3082aRBbctnsvzvo8Ynn372+RePv2x+9eTrb1pP97698kngIXyJiE28GwP62LZcfMksZuMb6mHoGDa+NuZarF8vsOdbxH3HVhRPHGi61sxCkPGqu73G9zoyw0F0F1p/RO91Fxo2BM9PgT7zIAr1OQV2lHzlbBHQ9Wbs8QPnLpzbkcBMwoso0XJivraS2QS5wNSa8qRVH5ezYZtzbxZ3UUjKId3GTXUHsntjBgYR+IVn0axCjSa7aD/uFZtmcfU6Xv0/9mLnuQKqcXGtIEHcWm/Gn0JX83PNW2yEbcun+/KhnDygWlDSwr6UPmd3e09YU58SFDjYZciGvn+ryJRNQugxC9k4auqBjylEc2ji24DNepPQcmnAsIsi8Ixrs8AGjIB4mYKp5WHE7BUvQORZPAGge8gHgfHF3CxG+diFDvYPpguL+uuivzDXBcb7gifhMtkpUaFhaHqQ3ltoWSALoePHY1Gp9FeOUazEgY29hVOsjCk5Y8m5xB6y/HgMzvjAvKXx5vPfkbNUv1/Re+z6URh4fDflGnIBex6e8YZJ0ccsoGHSGb7j5/4p8wJ8EBeTutMh9ObneHrAcwoVRZyZTSArVhlOPDgufkDEcaA7DXXKVxfDSxbqB4d8wTefAb4q0Nwg0JsWnedRmK4fA5xza0F8kxPflMVRThylLyH2FMyIBxZ8/onnA24E3OJZCPvF1pdZ6xm4LEdf5cSrsnidE6/LohHk1KCiLnLqoqI+5NSHsrrMicuyuMqJq0ouy6msrH7IiR/K4s2ul/6266W/l2L57PAts+K7HM/4302ywMI5isKX716/isJ+8qQrJcBAKRqRsTEejzvdfjcqy/ZGb497ijqs6pmhqw4UTWTIHIOuJg9HW5ajkjeDluXOaNApRyF7q/f62riqb2HlQXeoCgxb2rHWHnXT4cPYLVnNzKf1O+3KsJhZTl9V1ZN+Vc8MalvTekcCQ+bQhsPhQEtQaOBRG5e8dGPsdE7kahTNgnpypz0Q6NsJkHuqWoGlORZ1rCpDJWFhGNolJ8sWi9Yb9EflILYdf3WgaZUJZLnh72nKsC0wbFlPhiej44SEeNA1y6NCsuHrdHvHvXIUyYLGXT6DFRayfdO4r8rdCgvJsYxVTY0HtrgT+Sa7VSbh+gd5u+/AvgKiqGzmG61sjvfexlzy2gKzXe8W2nf5xQ3sWvhqTxGqi0eCcFQLg0QsqB4eCeHRLniz6jfr4k1BuFkLY4pYzHp4Uwhv7oKnVT+ti6eCcFoLQ0UstB6eCuHpLnhW9bO6eCYIZ7UwTMTC6uGZEJ7tgidVP6mLJ4JwUgtDRCykHp4I4UkRnv95xGd6aIP4TExsYLnpwaDwm0Xj40N8R0rd644PMb8bePg1P1a85QdayM94P4c69EzHciN+VzD1g7i0ywiXGyMv8XuKUr6VVAtXR4fKyaH8a3v/hZreWB5LP0g/Sj9JitSVXkgvpTPpUkKNPxt/Nf5u/NN63nrVumhdrq2NR2mb76TC03r/HwUTxEI=</latexit>

@l
Ar
ij =
return a + b @Aij
X @l @Skl X @Skl
=
@Skl @Aij
= Sr
kl
@Aij
The recipe is the same as we saw in the last part:
kl kl
def backward(context, goutput): X @[A + B]kl X r @Akl + Bkl
= Sr
kl
@Aij
= Skl
@Aij
1) Write out the scalar derivative for a single element.
???
return goutput, goutput kl kl
@Aij
= Sr
ij
@Aij
= Sr
ij 2) Use the multivariate chain rule to sum over all
outputs.
Ar = Sr Br = S r
<latexit sha1_base64="grOCEBPoBUR1matLxO/x46uXoOA=">AAANunicfZfbbts2HMbV7tRl9ZZul7sRFnQYhiCQEscHYAFqSTZ6sbZZlkO7KAsomlYEUyJBUY5dQW+xp9nt9hJ7m1E+yBJFWVcEv4+ffvqTtEmP4iDmhvHfk6effPrZ5188+3Lvq+etr7/Zf/HtdUwSBtEVJJiw9x6IEQ4idMUDjtF7yhAIPYxuvKmd6zczxOKARJd8QdFdCPwomAQQcNF1v3/kQpq6IeAP3kS3sj/dCHgY6D+e6S4khfD7RrjfPzCOjOWj1xvmunGgrZ/z+xfP+Z47JjAJUcQhBnF8axqU36WA8QBilO25SYwogFPgo9uET3p3aRDRhKMIZvpLoU0SrHOi5/D6OGAIcrwQDQBZIBJ0+AAYgFx84l41KkYRCFF8OJ4FNF4145m/anDxLegunS/rl1UGpj4D9CGA8wpZCsI4r0WtM16EXrUTJRixWVjtzCkFo+ScIwaDOK/BuSjMO5pPSXxJztf6w4I+oCjO0oThrDxQCIgxNBEDl80Y8YSmy48R62Aan3GWoMO8uew7cwCbXqDxocipdFRxJpgAXu3ywrw4EXqEJAxBNE5dmqUuR3OeuodHmRBf6mJVwKlHABtXnRdZul4/nn4hrBXxbUl8K4vDkjhcv4TgsT4hTJ+J+Scs1oVRFxYWQBRXR18Voyf6lRx9XRKvZfGmJN7IopeU1KSmzkrqrKY+ltRHWZ2XxLksLkriopbLSyqX1Y8l8aMsvt/10g+7XvqHFCtmR2yZhdjlaCJ+hJYLLJ3CLH19+ebXLO0vn/VKSZBuVo3Q2xhPRp1uv5vJMt7o7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu+vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7iUITRjGSvHRj7HROjXoULYJ6Rqc9UOjbCTB6llWDpSUWa2SZjrlk4QhgycmLxWL3Bv2hHMS39bcGtl2bQF4qf882nbbCsGU9dU6HJ0sSwkDky1UhRfk63d5JT44iRdCoK2awxkK2bxr1LaNbYyEllpFlW3lhqztRbLJb8y5d/SBv951+YOpZJpvFRpPN+d7bmCUvVphxs1tp3+VXD8CN8PUvhbApHirCYSMMVLHAZniohIe74P2632+K9xXhfiOMr2Lxm+F9Jby/C57W/bQpnirCaSMMVbHQZniqhKe74Hndz5viuSKcN8JwFQtvhudKeL4LntT9pCmeKMJJIwxRsZBmeKKEJ1V48eeRn+kB1vMzMcF6EK0PBpXfLJofH6ZQnCRX7tWHO0jcDRh6I44V78SBFogz3s+pC5gfBlEm7gq+e5i3dhnBfGMULXFPMeVbSb1xfXxknh4Zv7UPXlnrG8sz7XvtB+0nzdS62ivttXauXWlQ+0v7W/tH+7f1S8trBa3pyvr0yXrMd1rlafH/AeRPDGE=</latexit>

91 3) Vectorize the result.


Note that when we draw the computation graph, we
can think of everything that happens between S and l
ffi
fl
as a single module: we are already given the gradient
of l over S, so it doesn’t matter if it’s one operation or
a million.
Again, we can draw the computation graph at any
granularity we like: very fine individual operations like
summing and multiplying or very coarse-grained
operations like entire NNs.
For this operation, the context object is not needed,
we can perform the backward pass without
remembering anything about the forward pass.

BACKWARD: A FEW MORE EXAMPLES


To finish up, we’ll show you the implementation of
some more operations. You’ll be asked to do a few of
these in the second homework exercise.
Sigmoid

Row-wise sum

Expand

x
y
92

W
V
b
c
t
h This is a pretty simple derivation, but it shows two
SIGMOID
k things:
X YX Y
<latexit sha1_base64="HnME7Vww2ZO+c4ItFBzl9+JNMF4=">AAANrHicfZfbbts2HMbVdocuq9d0u9yNsKDAMASBlPiIokAtyUYv1jYLcupiI6BoWhFMiQRFOXYF3e5pdru9y95mlKPIEkVZviH4ffz005+iTLoU+xE3jP+ePH321dfffPv8u73vX7R+eLn/6sfLiMQMogtIMGHXLogQ9kN0wX2O0TVlCAQuRlfuws70qyVikU/Cc76maBoAL/TnPgRcdN3u6xMIk0kA+J0716/TyRvxg6To+Zze7h8YR8bm0usNM28caPl1evvqBd+bzAiMAxRyiEEU3ZgG5dMEMO5DjNK9SRwhCuACeOgm5vP+NPFDGnMUwlR/LbR5jHVO9AxXn/kMQY7XogEg80WCDu8AA5CLh9qrRkUoBAGKDmdLn0YPzWjpPTQ4EBWZJqtNxdLKwMRjgN75cFUhS0AQZUWodUbrwK12ohgjtgyqnRmlYJScK8SgH2U1OBWF+USzSYjOyWmu363pHQqjNIkZTssDhYAYQ3MxcNOMEI9psnkYMfOL6C1nMTrMmpu+tw5gizM0OxQ5lY4qzhwTwKtdbpAVJ0T3kAQBCGfJhKbJhKMVTyaHR6kQX+suFmaXADarOs/SJH9xXP1MWCvix5L4URZHJXGU34TgmT4nTF+K+Scs0oVRFxbmQxRVR18Uo+f6hRx9WRIvZfGqJF7JohuX1LimLkvqsqbel9R7WV2VxJUsrkviupbLSyqX1S8l8YssXu+66eddN/1TihWzI5bMWqxyNBefnc0Llixgmrw///B7mgw2V/6mxEg3q0boPhpPxt3eoJfKMn7U2+O+aTl1vTD0rKFpqwyFY9izDWe0ZTmWvAW0YXRHw64cBfFW7w/scV3fwhrDnmMpDFvasd0e9fLyIRRKVq/w2YNuu1YWr8gZWJbVGdT1wmC1bbt/rDAUDttxnKG9QaExoxhJXvpo7HY7Rj2KFkF9o9seKvTtBBh9y6rB0hKLNbZMx9ywcASw5OTFy2L3h4ORHMS39beGtl2bQF4qf982nbbCsGXtOJ3RyYaEMBB6clVIUb5ur3/Sl6NIETTuiRmssZDtncYDy+jVWEiJZWzZVlbY6koUi+zGnCYPH+TtutMPTD1NZbNYaLI5W3uPZsmLFWbc7Fbad/nVA3AjfP1JIWyKh4pw2AgDVSywGR4q4eEueK/u95riPUW41wjjqVi8ZnhPCe/tgqd1P22Kp4pw2ghDVSy0GZ4q4ekueF7386Z4rgjnjTBcxcKb4bkSnu+CJ3U/aYoninDSCENULKQZnijhSRVe/Hlke3qA9WxPTLDuh/nGoPLNotn2YSFOGbn74cEdJM4GDH0Q24pPYkMLxB7vt2QCmBf4YSrOCt7kMGvtMoLVo1G0xDnFlE8l9cbl8ZHZOTL+aB+8s/ITy3PtZ+0X7VfN1HraO+29dqpdaFD7S/tb+0f7t3XUOm/dtKYP1qdP8jE/aZWrNf8f2EoG/w==</latexit> <latexit sha1_base64="HnME7Vww2ZO+c4ItFBzl9+JNMF4=">AAANrHicfZfbbts2HMbVdocuq9d0u9yNsKDAMASBlPiIokAtyUYv1jYLcupiI6BoWhFMiQRFOXYF3e5pdru9y95mlKPIEkVZviH4ffz005+iTLoU+xE3jP+ePH321dfffPv8u73vX7R+eLn/6sfLiMQMogtIMGHXLogQ9kN0wX2O0TVlCAQuRlfuws70qyVikU/Cc76maBoAL/TnPgRcdN3u6xMIk0kA+J0716/TyRvxg6To+Zze7h8YR8bm0usNM28caPl1evvqBd+bzAiMAxRyiEEU3ZgG5dMEMO5DjNK9SRwhCuACeOgm5vP+NPFDGnMUwlR/LbR5jHVO9AxXn/kMQY7XogEg80WCDu8AA5CLh9qrRkUoBAGKDmdLn0YPzWjpPTQ4EBWZJqtNxdLKwMRjgN75cFUhS0AQZUWodUbrwK12ohgjtgyqnRmlYJScK8SgH2U1OBWF+USzSYjOyWmu363pHQqjNIkZTssDhYAYQ3MxcNOMEI9psnkYMfOL6C1nMTrMmpu+tw5gizM0OxQ5lY4qzhwTwKtdbpAVJ0T3kAQBCGfJhKbJhKMVTyaHR6kQX+suFmaXADarOs/SJH9xXP1MWCvix5L4URZHJXGU34TgmT4nTF+K+Scs0oVRFxbmQxRVR18Uo+f6hRx9WRIvZfGqJF7JohuX1LimLkvqsqbel9R7WV2VxJUsrkviupbLSyqX1S8l8YssXu+66eddN/1TihWzI5bMWqxyNBefnc0Llixgmrw///B7mgw2V/6mxEg3q0boPhpPxt3eoJfKMn7U2+O+aTl1vTD0rKFpqwyFY9izDWe0ZTmWvAW0YXRHw64cBfFW7w/scV3fwhrDnmMpDFvasd0e9fLyIRRKVq/w2YNuu1YWr8gZWJbVGdT1wmC1bbt/rDAUDttxnKG9QaExoxhJXvpo7HY7Rj2KFkF9o9seKvTtBBh9y6rB0hKLNbZMx9ywcASw5OTFy2L3h4ORHMS39beGtl2bQF4qf982nbbCsGXtOJ3RyYaEMBB6clVIUb5ur3/Sl6NIETTuiRmssZDtncYDy+jVWEiJZWzZVlbY6koUi+zGnCYPH+TtutMPTD1NZbNYaLI5W3uPZsmLFWbc7Fbad/nVA3AjfP1JIWyKh4pw2AgDVSywGR4q4eEueK/u95riPUW41wjjqVi8ZnhPCe/tgqd1P22Kp4pw2ghDVSy0GZ4q4ekueF7386Z4rgjnjTBcxcKb4bkSnu+CJ3U/aYoninDSCENULKQZnijhSRVe/Hlke3qA9WxPTLDuh/nGoPLNotn2YSFOGbn74cEdJM4GDH0Q24pPYkMLxB7vt2QCmBf4YSrOCt7kMGvtMoLVo1G0xDnFlE8l9cbl8ZHZOTL+aB+8s/ITy3PtZ+0X7VfN1HraO+29dqpdaFD7S/tb+0f7t3XUOm/dtKYP1qdP8jE/aZWrNf8f2EoG/w==</latexit>

l
class Sigmoid(Op):
1) We can easily do a backward over functions that
def forward(context, x):
<latexit sha1_base64="DhhF8c/aGYtP3a3Hmfit+MKXb8w=">AAAPr3icnZfbbts2AIZd79R53dJsl7shGjTIhiyQEseHiwC1ZBu9WNssyKmL0oyiaUWzJBIU5dgV9AB7mt1uj7K3GWXLsg6Ug00wYIL/z5+feLBJkzq2zxXlnyf1Tz797PMvnn7Z+OrZ199sPd/+9tInAUP4AhGHsGsT+tixPXzBbe7ga8owdE0HX5kTPdavppj5NvHO+ZziWxdanj22EeSi6m67/sJAKLyOPhgeNB14F9q/T6LG7gkw/MC9C6GJIgOR8P3aEFcBY8wgCo0JBUt1WR0lNXHgMkk4jcYu+A9xvm250HDwmIO9VdKiGbOte/7Dpi7ywbH2WHDs+Z/Bj8SBhbSngp8eta6+jEb8EbrhQn5vjkE6K2B3CbESUhhgEG672M/2sZePWPE0UqsMTN4o+bp7vqMcKIsHlAtqUtipJc/p3fYz3jBGBAUu9jhyoO/fqArltyFk3EYOjhpG4GMK0QRa+Cbg485taHs04NgTi+Gl0MaBAzgB8YoFI5thxJ25KEDEbJEA0D0U88rFum7ko3zsQfGS+6OpTf1l0Z9aywIXI4Zvw9li00S5hqHFIL230SxHFkLXj4ekVOnPXTNfiQMHs6mbr4wpBWPBOcMM2X48BqdiYN7ReB/65+Q00e/n9B57fhQGzImyDYWAGcNj0XBR9DEPaLh4GbH5J/4JZwHej4uLupM+ZJMzPNoXObmKPM7YIZDnq0w3HhwPPyDiutAbhQYVm4PjGQ+N/YNIiC+BWHtoYhLIRnnnWRQmy8gEZ8KaE99mxLdFcZARB0knxBmBMWFgKuafMB8IIxAWZiPs51tfpK3H4KIYfZkRL4viVUa8KopmkFGDkjrNqNOS+pBRH4rqLCPOiuI8I85LuTyj8qL6MSN+LIrXmzp9v6nTXwuxYnbElpmLXY7H4p9nscDCCYrC1+dvfo7C7uJJVkqAgZo3InNlPBq22t12VJSdld4cdlStX9ZTQ1vrqbrMkDp6bV3pD9YshwVvCq0orUGvVYxCzlrvdPVhWV/DKr12X5MY1rRDvTloJ8OHsVewWqlP77aapWGx0pyupmnH3bKeGrSmrncOJYbUoff7/Z6+QKEBow4ueOnK2GodK+UomgZ1lFazJ9HXE6B0NK0ESzMs2lBT++qChWPoFJw8XSx6p9cdFIP4evy1nq6XJpBnhr+jq/2mxLBmPe4fD44WJIRBzyqOCkmHr9XuHHWKUSQNGrbFDJZYyLqnYVdT2iUWkmEZaroWD2x+J4pNdqPehssf5PW+AzsqiKKiWWy0ojneeytzwetIzE61W2rf5Jc3cCrhy2+KUFU8koSjShgkY0HV8EgKjzbBW2W/VRVvScKtShhLxmJVw1tSeGsTPC37aVU8lYTTShgqY6HV8FQKTzfB87KfV8VzSTivhOEyFl4Nz6XwfBM8KftJVTyRhJNKGCJjIdXwRApP8vDizyM+00MHxGdi4gDbSw4Gud8sGh8fJuKGkbiXL97H4m7A8BtxrHgnDrRQnPF+DA3ILNf2InFXsIz9uLTJCGcroyiJe4pavJWUC5eHB+rxgfJLc+eVltxYnta+r72o7dXUWrv2qva6dlq7qKH6H/U/63/V/95St662Pmz9trTWnyRtvqvlni37XzLsxkY=</latexit>

@Yabc X
X output high dimensional tensors, but we should
<latexit sha1_base64="as9gjkgrzac5fV+gn7F0WFX7WN8=">AAAPjnicnZfbbts2HMYd79R53Zxsl7shFjTIhjSQEseHi2C1ZBu5WNssaA5dlAUUTcuaJZGgKMeuoFfY7fZqe5tRPsg6UC42IRcEvx8/fvqTdCiTOrbPFeWfneonn372+RfPvqx99fzrb+q7e9/e+CRgCF8j4hB2Z0IfO7aHr7nNHXxHGYau6eBbc6LH+u0UM98m3js+p/jBhZZnj2wEueh63KvuGAiFd9HvhgdNBz6G9h+TqHZwDgw/cB9DaKLIQCR8vwHiLmCMGEShMaFgqS67o1VPbLh0EqRROwD/wc63LRcaDh5xcLh2WgxjtjXmP26bImscax8zjpn/afwRuyzwUkII8/hPCIYL+dgcgWQVwMFy0rWQTA4Mwm0X+2n3w6xFMr+EfClFH3f3lWNl8YBiQ1019iur5/Jx7zmvGUOCAhd7HDnQ9+9VhfKHEDJuIwdHNSPwMYVoAi18H/BR+yG0PRpw7Im1fiG0UeAATkC8IcHQZhhxZy4aEDFbOAA0hmLZuNi2tayVjz0oXupoOLWpv2z6U2vZ4KJA+CGcLc5ElBkYWgzSsY1mmWQhdP24FoVOf+6a2U4cOJhN3WxnnFJkzJEzzJDtxzW4FIV5S+Nj5r8jlyt9PKdj7PlRGDAnSg8UAmYMj8TARdPHPKDh4mXE2Z7455wF+ChuLvrOe5BNrvDwSPhkOrJxRg6BPNtlunFxPPyEiOtCbxgaVOx9jmc8NI6OIyG+AGKroYlJIBtmyasoXO0fE1wJNCO+SYlv8mI/JfZXkxBnCEaEgalYf8J8IEAgEGYj7GdHXyejR+A6b32TEm/y4m1KvM2LZpBSg4I6TanTgvqUUp/y6iwlzvLiPCXOC748pfK8+iElfsiLd9smfb9t0t9ytmJ1xJGZi1OOR+Ify2KDhRMUhRfvXv8ShZ3Fs9opAQZqFkTmGjwdNFudVpSXnbXeGLRVrVfUE6CldVVdBiREt6Urvf4my0mOTUIrSrPfbeatkLPR2x19UNQ3YZVuq6dJgE3agd7ot1blw9jLoVbC6Z1mo1AWK/HpaJp21inqCaA1dL19IgESQu/1el19EYUGjDo4x9I12GyeKUUrmhi1lWajK9E3C6C0Na0QlqayaANN7amLLBxDJ0fyZLPo7W6nnzfim/prXV0vLCBPlb+tq72GBNhkPeud9U8XSQiDnpWvCknK12y1T9t5K5IYDVpiBQtZyGamQUdTWoUsJJVloOlaXNjsSRSH7F59CJc/yJtzB/ZVEEV5WBy0PByfvTWcYx0J7JTTUnwbLx/glIYvvilCZfZIYo5KwyBZFlQeHknDo23hrSJvldlbEnOrNIwly2KVh7ek4a1t4WmRp2X2VGJOS8NQWRZaHp5Kw9Nt4XmR52X2XGLOS8NwWRZeHp5Lw/Nt4UmRJ2X2RGJOSsMQWRZSHp5Iw5NsePHPI77TQwfEd2LiANtbXQwyv1k0vj5MxKfFil6+eA+LbwOGX4trxVtxoYXijvdTaEBmubYXiW8FyziKW9tAOFuDoiW+U9T8V0mxcXNyrJ4dK7829l9pqy+WZ5XvKz9UDitqpVV5VbmoXFauK6g6rv5Z/av6d3233qyf139eotWd1ZjvKpmnfvEv0AO6bA==</latexit>

rr = r @Y abc
X Y
Yr
# x is a tensor of any shape ijk =
Xijk abc
abc @Xijk
abc @X ijk
sum over all dimensions of the output when we
sigx = 1 / (1 + (- x).exp()) abc
X
X r @ @ (X
(X abc))
context['sigx'] = sigx =
= Yr

You might also like