Lecture02.Backpropagation.annotated
Lecture02.Backpropagation.annotated
Peter Bloem
Deep Learning
dlvu.github.io
dlvu.github.io
|section|Scalar backpropagation|
|video|https://www.youtube.com/embed/
idO5r5eWIrw?si=NnUUgtAroD3_Rich|
9
THE CHAIN RULE
Here is the chain rule: if we want the derivative of a
function which is the composition of two other
functions, in this case f and g, we can take the
@f(g(x)) @f(g(x)) @g(x) derivative of f with respect to the output of g and
=
@x @g(x) @x multiply it by the derivative of g with respect to the
@f @f@g(x)
@g input x.
@f(g(x)) @f(g(x))
= @x = @g @x Since we’ll be using the chain rule a lot, we’ll introduce
shorthand:@x @g(x) @x
@f @f @g a simple shorthand to make it a little easier to parse.
= We draw a little diagram of which function feeds into
x g f @x @g @x which. This means we know what the argument of
each function is, so we can remove the arguments
10
from our notation.
We call this diagram a computation graph. We'll stick
with simple diagrams like this for now. At the end of
the lecture, we will expand our notation a little bit to
@x = sffsggx + ssffbbgg +
calculus is that locally, we can treat them as linear
+ sbgf
functions (if they are differentiable). In an
slope of f over x constant infinitesimally small neighbourhood f and g are exactly
@f @g linear.
@g @x
11
If f and g are locally linear, we can describe their
behavior with a slope s and an additive constant b. The
slopes, sf and sg, are simply the derivatives. The
additive constants we will show can be ignored.
In this linear view, what the chain rule says is this: if we
approximate f(x) as a linear function, then its slope is
the slope of f as a function of g, times the slope of g as
a function of x. To prove that this is true, we just write
down f(g(x)) as linear functions, and multiply out the
brackets.
Note that this doesn’t quite count as a rigorous proof,
but it’s hopefully enough to give you some intuition for
why the chain rule holds.
+ ) +
<latexit sha1_base64="WEZsfb75aWQ7JGTAM9g2CSx2+us=">AAAOW3icfZdbb9s2GIbldofOS7N0w652QyzokKxGICWODygC1JJt9GJts6A5bFFgUDQtC5ZEgqIcu4J+4653sf8ySlZkHa0bf+D7fq8ek6JNGdS2PC7L/zaePf/q62++ffFd8/u9l/s/HLz68cYjPkP4GhGbsDsDeti2XHzNLW7jO8owdAwb3xoLLdJvl5h5FnE/8zXFDw40XWtmIcjF0OTgHx0ZwSw80pEZmOHR6rgFdESDeVQeg98ugL5AgTdRjryJuQJvgDExj8UH8CanYmi+GZofxx+zEOh6c9uTtIhKdCVNSY+oRFu+S+RFMEoYfUY0sS8aOo2HIqrjVZStv32jv40jzGLW5OBQPpHjC5QLJSkOpeS6nLza4019SpDvYJcjG3revSJT/hBAxi1k47Cp+x6mEC2gie99Pus9BJZLfY5dFILXQpv5NuAERLMLphbDiNtrUUDELJEA0BwyiLhYg2Y+ysMudLDXmi4t6m1Kb2luCg7FAj4Eq3iBw1xjYDJI5xZa5cgC6HgO5PPSoLd2jPwg9m3Mlk5+MKIUjAXnCjNkedEcXIqJ+USjZ8b7TC4Tfb6mc+x6YeAzO8w2CgEzhmeiMS49zH0axF9GPKgL74IzH7eiMh67GEK2uMLTlsjJDeRxZjaBPD9kONHkuPgREceB7jTQaRjoHK94oLdOQiG+BoYtzAaBbJp3XoVBoEdzZhjgSlhz4seM+LEojjLiKLkJsadgRhhYivUnzAPCCISFWQh7+e7rtHsGrovRNxnxpijeZsTbomj4GdUvqcuMuiypjxn1saiuMuKqKK4z4rqUyzMqL6pfMuKXoni366Z/7brp34VYsTpiy6zFLscz8SsZP2DBAoXB+88f/giDfnwlT4qPgZI3IuPJeDbudPvdsCjbT3p73FPUYVlPDV11oGhVhtQx6GrycLRlOS14U2hZ7owGnWIUsrd6r6+Ny/oWVh50h2qFYUs71tqjbjJ9GLsFq5n6tH6nXZoWM83pq6p63i/rqUFta1rvtMKQOrThcDjQYhTqM2rjgpc+GTudc7kcRdOgntxpDyr07QLIPVUtwdIMizpWlaESs3AM7YKTpw+L1hv0R8Ugvp1/daBppQXkmenvacqwXWHYsp4Pz0dnMQlh0DWLs0LS6et0e2e9YhRJg8ZdsYIlFrK907ivyt0SC8mwjFVNjSY2vxPFJrtXHoLND/J234FDBYRh0Sw2WtEc7b0nc8FrV5jtenelfZe/usGuhS9/U4Tq4lFFOKqFQVUsqB4eVcKjXfBm2W/WxZsV4WYtjFnFYtbDm5Xw5i54WvbTunhaEU5rYWgVC62Hp5XwdBc8L/t5XTyvCOe1MLyKhdfD80p4vguelP2kLp5UhJNaGFLFQurhSSU8ycOLP4/oTA9tEJ2JiQ0sNzkY5H6zaHR8iF4yEvfmiw+xeDdg+IM4VnwSB1oozni/BzpkpmO5oXhXMPVWVO0ywtWTUVTiPUUpvpWUi5vTE6VzovzZPnynJm8sL6RfpF+lI0mRutI76b10KV1LqHHeuG9MG/jlf/vP95v7exvrs0bS85OUu/Z//h8ONTtS</latexit>
f(g(x), = s 1 (s g x + b g ) + s
f(g(x), h(x)) = s11 (sgg x + bgg ) + s22 (sh2 (s +
hx + b
h x bh h)
h )++b
bfff
<latexit sha1_base64="WEZsfb75aWQ7JGTAM9g2CSx2+us=">AAAOW3icfZdbb9s2GIbldofOS7N0w652QyzokKxGICWODygC1JJt9GJts6A5bFFgUDQtC5ZEgqIcu4J+4653sf8ySlZkHa0bf+D7fq8ek6JNGdS2PC7L/zaePf/q62++ffFd8/u9l/s/HLz68cYjPkP4GhGbsDsDeti2XHzNLW7jO8owdAwb3xoLLdJvl5h5FnE/8zXFDw40XWtmIcjF0OTgHx0ZwSw80pEZmOHR6rgFdESDeVQeg98ugL5AgTdRjryJuQJvgDExj8UH8CanYmi+GZofxx+zEOh6c9uTtIhKdCVNSY+oRFu+S+RFMEoYfUY0sS8aOo2HIqrjVZStv32jv40jzGLW5OBQPpHjC5QLJSkOpeS6nLza4019SpDvYJcjG3revSJT/hBAxi1k47Cp+x6mEC2gie99Pus9BJZLfY5dFILXQpv5NuAERLMLphbDiNtrUUDELJEA0BwyiLhYg2Y+ysMudLDXmi4t6m1Kb2luCg7FAj4Eq3iBw1xjYDJI5xZa5cgC6HgO5PPSoLd2jPwg9m3Mlk5+MKIUjAXnCjNkedEcXIqJ+USjZ8b7TC4Tfb6mc+x6YeAzO8w2CgEzhmeiMS49zH0axF9GPKgL74IzH7eiMh67GEK2uMLTlsjJDeRxZjaBPD9kONHkuPgREceB7jTQaRjoHK94oLdOQiG+BoYtzAaBbJp3XoVBoEdzZhjgSlhz4seM+LEojjLiKLkJsadgRhhYivUnzAPCCISFWQh7+e7rtHsGrovRNxnxpijeZsTbomj4GdUvqcuMuiypjxn1saiuMuKqKK4z4rqUyzMqL6pfMuKXoni366Z/7brp34VYsTpiy6zFLscz8SsZP2DBAoXB+88f/giDfnwlT4qPgZI3IuPJeDbudPvdsCjbT3p73FPUYVlPDV11oGhVhtQx6GrycLRlOS14U2hZ7owGnWIUsrd6r6+Ny/oWVh50h2qFYUs71tqjbjJ9GLsFq5n6tH6nXZoWM83pq6p63i/rqUFta1rvtMKQOrThcDjQYhTqM2rjgpc+GTudc7kcRdOgntxpDyr07QLIPVUtwdIMizpWlaESs3AM7YKTpw+L1hv0R8Ugvp1/daBppQXkmenvacqwXWHYsp4Pz0dnMQlh0DWLs0LS6et0e2e9YhRJg8ZdsYIlFrK907ivyt0SC8mwjFVNjSY2vxPFJrtXHoLND/J234FDBYRh0Sw2WtEc7b0nc8FrV5jtenelfZe/usGuhS9/U4Tq4lFFOKqFQVUsqB4eVcKjXfBm2W/WxZsV4WYtjFnFYtbDm5Xw5i54WvbTunhaEU5rYWgVC62Hp5XwdBc8L/t5XTyvCOe1MLyKhdfD80p4vguelP2kLp5UhJNaGFLFQurhSSU8ycOLP4/oTA9tEJ2JiQ0sNzkY5H6zaHR8iF4yEvfmiw+xeDdg+IM4VnwSB1oozni/BzpkpmO5oXhXMPVWVO0ywtWTUVTiPUUpvpWUi5vTE6VzovzZPnynJm8sL6RfpF+lI0mRutI76b10KV1LqHHeuG9MG/jlf/vP95v7exvrs0bS85OUu/Z//h8ONTtS</latexit>
= ss11 ssgg x
= x++ ss11 b + ss22 ssh
bgg + x
x +
+ s
s 2 b
b +
+
h+b
h bff
= s 1 s g x + s 1 bg + s 2 s h h x + s22 bh bf
= (s
= (s11 ssgg ++ ss22 ssh )x +
h )x + ss22 b + ss22 b
bgg + bh
h++b b
= (s1 sg + s2 sh )x + s2 bg + s2 bh + bfff
slope of f over x constant
@f @g @f @h
13 +
@g @x @h @x
gi
x f
gN
@f X @f @gi
<latexit sha1_base64="DMJldPRh5qsthHcwBU7lAJ+NqwM=">AAAN5nicfZfbbts2HMbV7tRlzZZul7sRFhQYhiCQEseHiwC1ZBu9WNssyGmtgoCiaVkwJRIU5dgV9Aq7G3a7R9lr7AV2uz3CKEWRJYqyrgh+Hz/99Cdpky7FfsQN4+8nTz/59LPPv3j25c5Xz3e//mbvxbdXEYkZRJeQYMJuXBAh7IfokvscoxvKEAhcjK7dhZ3p10vEIp+EF3xN0W0AvNCf+RBw0XW3996ZMQATZ0F1B8JklqZ5e5Xqp7oTxcGdrzY40Eu89M5Pq3LRVSTc7e0bh0b+6M2GWTT2teI5u3vxnO84UwLjAIUcYhBFH0yD8tsEMO5DjNIdJ44QBXABPPQh5rP+beKHNOYohKn+UmizGOuc6NmH6lOfIcjxWjQAZL5I0OEcCFguyrFTj4pQCAIUHUyXPo0emtHSe2hwIGp5m6zyWqe1gYnHAJ37cFUjS0AQBYDPG53ROnDrnSjGiC2DemdGKRgl5wox6EdZDc5EYd7RbPqiC3JW6PM1naMwSpOY4bQ6UAiIMTQTA/NmhHhMk/xjxJpZRKecxegga+Z9pyPAFudoeiByah11nBkmgNe73CArTojuIQkCEE4Th4qVwNGKJ87BYSrEl7qLhdklgE3rzvM0SZysZq6rnwtrTXxbEd/K4rgijouXEDzVZ4TpSzH/hEW6MOrCwnyIovroy3L0TL+Uo68q4pUsXlfEa1l044oaN9RlRV021PuKei+rq4q4ksV1RVw3cnlF5bL6sSJ+lMWbbS/9ddtL30uxYnbEllmLXY5m4gcrX2DJAqbJ64s3P6fJIH+KlRIj3awboftoPJ50e4NeKsv4Ue9M+qY1auqloWcNTVtlKB3Dnm2MxhuWI8lbQhtGdzzsylEQb/T+wJ409Q2sMeyNLIVhQzuxO+NeUT6EQsnqlT570O00yuKVOQPLsk4GTb00WB3b7h8pDKXDHo1GQztHoTGjGEle+mjsdk+MZhQtg/pGtzNU6JsJMPqW1YClFRZrYpkjM2fhCGDJycvFYveHg7EcxDf1t4a23ZhAXil/3zZHHYVhw3oyOhkf5ySEgdCTq0LK8nV7/eO+HEXKoElPzGCDhWzeNBlYRq/BQiosE8u2ssLWd6LYZB/M2+ThB3mz7/R9U09T2Sw2mmzO9t6jWfJihRm3u5X2bX71ANwK3/xSCNvioSIctsJAFQtsh4dKeLgN3mv6vbZ4TxHutcJ4KhavHd5Twnvb4GnTT9viqSKctsJQFQtth6dKeLoNnjf9vC2eK8J5KwxXsfB2eK6E59vgSdNP2uKJIpy0whAVC2mHJ0p4UocXfx7ZmR5gPTsTE6z7YXEwqP1m0ez4sBC3i8L98OEjJO4GDL0Rx4p34kALxBnvp8QBzAv8MBV3Bc85yFrbjGD1aAT5PcWUbyXNxtXRoXlyaPzS2X9lFTeWZ9r32g/aj5qp9bRX2mvtTLvUoPaX9o/2r/bf7nz3t93fd/94sD59Uoz5Tqs9u3/+D4Z4IIo=</latexit>
=
@x @gi @x
i
14
EXAMPLE
With that, we are ready to show how backpropagation
works. We'll start with a fairly arbitrary function to
show the principle before we move on to more
realistic neural networks.
2
f(x) =
sin(e-x )
15
17
@f @f @d
=
@x @d @x
@f @d @c
=
@d @c @x
GLOBAL AND LOCAL DERIVATIVES
We call the larger derivative of f over x the global
@f @d @c @b derivative. And we call the individual factors, the
= derivatives of the operation output wrt to their inputs,
@d @c @b @x the local derivatives.
@f @f @d @f @d@c @b @a
= =
@x @d @d @c @x
@b @a @x
f
global derivative @d @c derivatives
local
=
@d @c @x
f @d @c @b
=
18 @d @c @b @x
f @d @c @b @a
=
@d @c @b @a @x
BACKPROPAGATION
This is how the backpropagation algorithm combines
symbolic and numeric computation. We work out the
The BACKPROPAGATION algorithm: local derivatives symbolically, and then work out the
• break your computation up into a sequence of operations global derivative numerically.
what counts as an operation is up to you
19
@f @f @d
=
@x @d @x
@f @d @c
= SYMBOLICALLY
WORK OUT THE LOCAL DERIVATIVES
For each local derivative, we work out the symbolic
@d @c @x derivative with pen and paper.
@f @d @c @b
2 = Note that we could fill in the a, b and c in the result,
f(x) = @d @c @b @x
sin(e-x ) but we don’t. We simply leave them as is. For the
@f @f @d @c @b @a
@f @d
operations: = = symbolic part, we are only interested in the derivative
@x @d @d
@c @x
@b @a @x of the output of each sub-operation with respect to its
2
d(c) = f @d @c immediate input.
c @f = @d2 @c @x
c(b) = sin b = - 2 · cos b · ea · -1 The rest of thew algorithm is performed numerically.
@x fc @d @c @b
b(a) = ea =
@d @c @b @x
a(x) = -x f @d @c @b @a
=
20
@d @c @b @a @x
COMPUTE A FORWARD PASS (X = -4.499)
This we are now computing things numerically, we
need a specific input, in this case x = -4.499. We start
by feeding this through the computation graph. For
f(-4.499) = 2 each sub-operation, we store the output value. At the
x a b c d
end, we get the output of the function f. This is called a
2
<latexit sha1_base64="OJjpZWerc9vinEiAFnncrzHfoVc=">AAAN+nicfZdNb9s2HMaVttu6rNnS7bgLsaDDMGSBlDh+OQSoJdvoYW2zIG9rlAUUTSuCKZGgKMeupi+z27DrPsou+yo7jZIdWaIk60Tzefjopz9FmXQY8UKh6/9uPXn67JNPP3v++fYXL3a+/Gr35deXIY04wheIEsqvHRhi4gX4QniC4GvGMfQdgq+cqZXqVzPMQ48G52LB8K0P3cCbeAgK2XW3G9iIxuMEfH8C7AmHKD5MYhs5MUoScAIObXt7+SszhF4AbMRiJ9UMkInpL6nh32zkxjAVenomZL+k8tNc9rUOWr2eHHC3u6cf6NkFqg1j1djTVtfp3csXYtseUxT5OBCIwDC8MXQmbmPIhYcITrbtKMQMoil08U0kJt3b2AtYJHAgoV9JbRIRIChIHx6MPY6RIAvZgIh7MgGgeygfW8gSbZejQhxAH4f745nHwmUznLnLhoCyvrfxPKt/UhoYuxyyew/NS2Qx9EMfivtKZ7jwnXInjgjmM7/cmVJKRsU5xxx5YVqDU1mY9yyd0vCcnq70+wW7x0GYxBEnSXGgFDDneCIHZs0Qi4jF2cPI92gangge4f20mfWdDCCfnuHxvswpdZRxJoRCUe5y/LQ4AX5A1PdhMI5tJl8vgecitvcPEim+Ag6RZodCPi47z5I4ttOaOQ44k9aS+K4gvlPFYUEcrm5CyRhMKAczOf+Uh0AagbRwD+GwPPoiHz0BF2r0ZUG8VMWrgnilik5UUKOKOiuos4r6UFAfVHVeEOequCiIi0quKKhCVT8WxI+qeL3ppr9uuukHJVbOjlwyC7nK8UR+xLIXLJ6iJH5z/vbnJO5l1+pNiTAwykbkPBqPRu1Or5OoMnnUW6OuYQ6qem7omH3DqjPkjn7H0gfDNcuh4s2hdb097LfVKETWerdnjar6GlbvdwZmjWFNO7Jaw86qfBgHitXNfVav3aqUxc1zeqZpHveqem4wW5bVPawx5A5rMBj0rQyFRZwRrHjZo7HdPtarUSwP6urtVr9GX0+A3jXNCiwrsJgj0xgYGYvAkChOkb8sVrffG6pBYl1/s29ZlQkUhfJ3LWPQqjGsWY8Hx8OjjIRyGLhqVWhevnane9RVo2geNOrIGayw0PWdRj1T71RYaIFlZFpmWtjySpSL7Ma4jZcf5PW6A3sGSBLVLBeaak7X3qNZ8ZIaM2l219o3+esHkEb46pMi1BSPasJRIwyqY0HN8KgWHm2Cd6t+tynerQl3G2HcOha3Gd6thXc3wbOqnzXFs5pw1gjD6lhYMzyrhWeb4EXVL5riRU24aIQRdSyiGV7UwotN8LTqp03xtCacNsLQOhbaDE9r4WkZXv55pHt6SEC6J6YEyMPGcmNQ+maxdPswRXInuXQvH3yA5dmA47dyW/Febmih3OP9GNuQu74XJPKs4Nr7aWuTEc4fjbIlzymGeiqpNi4PD4zjA/2X1t5rc3Viea59q32n/aAZWkd7rb3RTrULDWn/aP9tPd16tvP7zh87f+78tbQ+2VqN+UYrXTt//w8aDx5e</latexit>
d 2c 2 = - 12222· ·cos
= - cos9090··ee4.499 · ·-1
<latexit sha1_base64="DFkx7ICy5cN+S/j37NVEkDriLUE=">AAAN/XicfZdNb9s2HMaVbmu7rN7S7biLsKDFMKSB5Di2dQhQS7bRw9pmQd7WKA0omlYESyJHUY5dQdin2W3YdR9l2IcZMMovskRR1sEh+Dx89NOfokI6xPcipmn/7jz67PMvHj95+uXuV88aX3+z9/zbywjHFKILiH1Mrx0QId8L0QXzmI+uCUUgcHx05UysTL+aIhp5ODxnc4JuA+CG3tiDgPGuu73fXp6or+wxBTBppokNnURPPzZT1YYjzPgvjvgPSQxt3YU+cpebtA5bhpGuO1/pqm3vvlR5VnPVpa3+GtrGc6Jqtn23t68daotLrTb0VWNfWV2nd8+fsV17hGEcoJBBH0TRja4RdpsAyjzoo3TXjiNEAJwAF93EbNy9TbyQxAyFMFVfcG0c+yrDavb86sijCDJ/zhsAUo8nqPAe8OdnvEq75agIhSBA0cFo6pFo2Yym7rLBAC/xbTJbTEFaGpi4FJB7D85KZAkIogCw+0pnNA+ccieKfUSnQbkzo+SMgnOGKPSirAanvDDvSTar0Tk+Xen3c3KPwihNYuqnxYFcQJSiMR+4aEaIxSRZPAx/lSbRCaMxOsiai76TPqCTMzQ64DmljjLO2MeAlbucICtOiB4gDgIQjhKb8PeMoRlL7IPDlIsvVMfnZgcDOio7z9IksbOaOY56xq0l8V1BfCeKg4I4WN0E+yN1jKk65fOPaaRyo8ot1IMoKo++yEeP1Qsx+rIgXoriVUG8EkUnLqhxRZ0W1GlFfSioD6I6K4gzUZwXxHkllxVUJqqfCuInUbzedtNft930gxDLZ4cvmTlf5WjMv2OLFyyZwDR5c/725zQxFtfqTYmRqpeN0Fkbj4btjtFJRdlf661hVzf7VT03dMyebskMuaPXsbT+YMPSFLw5tKa1B722GAX9jd41rGFV38BqvU7flBg2tEOrNeisyodQKFjd3GcZ7ValLG6eY5imeWxU9dxgtiyr25QYcofV7/d71gKFxJT4SPCStbHdPtaqUSQP6mrtVk+ibyZA65pmBZYUWMyhqff1BQtDwBecLH9ZrG7PGIhBbFN/s2dZlQlkhfJ3Lb3fkhg2rMf948HRggRTELpiVXBevnane9QVo3AeNOzwGayw4M2dhoapdSosuMAyNC0zK2x5JfJFdqPfJssP8mbdqfu6mqaimS800ZytvbVZ8PoSs1/vltq3+eUD/Fr46pNCWBcPJeGwFgbKWGA9PJTCw23wbtXv1sW7knC3FsaVsbj18K4U3t0GT6p+UhdPJOGkFobIWEg9PJHCk23wrOpndfFMEs5qYZiMhdXDMyk82waPq35cF48l4bgWBstYcD08lsLjMjz/55Ht6YGvZnti7KteuNoYlL5ZJNs+TCDfSS7dywfvI342oOgt31a85xtawPd4PyU2oG7ghSk/K7j2QdbaZgSztZG3+DlFF08l1cZl81A/PtR+ae2/NlcnlqfK98oPyo+KrnSU18ob5VS5UKDyj/LfzuOdJ43fG380/mz8tbQ+2lmN+U4pXY2//weLIh9I</latexit>
d= c =2 -1
2c b = 1
sin
c = sin = -112 · cos 90 · e4.499 · -1 happens to be 0.
= a=b 2= 1
dc = = 1
1
-1 · 0 · 90 90··-1
-1==00
b = eca =
bc = sin
e b=
= 901
90 == -222···000···90
=-- · -1 = 0
b
ac== sin
e-x
a b=1
=
= 90
4.499 2
a = -x a = 4.499
b=
a = e-x == 90
4.499
a = -x = 4.499
22
BACKPROPAGATION
Before we try this on a neural network, here are the
main things to remember about the backpropagation
The BACKPROPAGATION algorithm: algorithm.
• break your computation up into a sequence of operations Note that backpropagation by itself does not train a
what counts as an operation is up to you
neural net. It just provides a gradient. When people
• work out the local derivatives symbolically.
say that they trained a network by backpropagation,
• compute the global derivative numerically that's actually shorthand for training the network by
by computing the local derivatives and multiplying them
gradient descent, with the gradients worked out by
• Much more accurate than finite differences backpropagation.
only source of inaccuracy is the numeric computation of the operations.
23
l = (y - t)
operates. The rest of the parameters can be worked
h1 h2 h3
k2 k3
y = v1 h1 + v2 h2 + v3 h3 + bh out in the same way to give us the rest of the gradient.
k1 y = v1 h1 + v2 h2 + v3 h3 + bh
1 First, we have to break the computation of the loss
h2 = 1
h2 = 1 + exp(-k2 )
2
w1
h2 = 1
h1 h2 h3 h2 = 1 + exp(-k2 )
k1 k2 k3 1 + exp(-k2 )
k2 = w
@l @l21 x1 + w22 x2 + bx
@y
k2 =
@l = w x1 + w22 x2 + bx
@l21@y
@v = @y @v2
2
w1
@v22 @y @v2
= 2(y - l) · h
= 2(y - t) · h22
x1 x2
= 2(10.1 - 12.1) · .99 = -3.96
26
= 2(y
2(10.1 - ·12.1)
- t) h2 · .99 = -3.96
.99 the derivative over v2.
8 = 2(10.1 - 12.1) · .99 = -3.96
When we apply this derivative in a gradient descent
3
27
BACKPROPAGATION IN A FEEDFORWARD NETWORK
Let’s try something a bit earlier in the network: the
weight w12. We add two operations, apply the chain
l l = (y - t)22 rule and work out the local derivatives.
l = (y - t)2
y = v1 h1 + v2 h2 + v3 h3 + bh
y t
y = v11 h11 +1v22 h22 + v33 h33 + bh
h
h2 = 1
h22 = 1 + exp(-k2 )
1 + exp(-k2 )
v2
k2 = w21 x1 + w222 x2 + bx
k2 = w21 x11 + w22 2 + bx
22 x2 x
k22 = w21
12 x1 + w22 x2 + bx
<latexit sha1_base64="9AFPVZHM+/C1JrM8M7urc8fPZ/Q=">AAANxXicfZdNb9s2HMbV7q3L6i3djrsICzoMWxBIjuOXQ4Bakr0e1jYL8tIuCgyKphXBlEhQlGNXEPZR9ml23c77NqNkW5YoyToRfB4++vFP0SYdir2Qa9p/T55+8ulnn3/x7MuDr563vv7m8MW3NyGJGETXkGDC3jsgRNgL0DX3OEbvKUPAdzC6deZmqt8uEAs9ElzxFUX3PnADb+ZBwEXX5HBwYEM3nk/aifrjuWpDEj9OYr2dJMuJrv6y7WhnHe2sw4mdyTKx7cnhkXaiZY9abeibxpGyeS4mL57zA3tKYOSjgEMMwvBO1yi/jwHjHsQoObCjEFEA58BFdxGf9e9jL6ARRwFM1JdCm0VY5URN56FOPYYgxyvRAJB5IkGFD4AByMVsD8pRIQqAj8Lj6cKj4boZLtx1gwNRqvt4mZUyKQ2MXQbogweXJbIY+KEP+EOlM1z5TrkTRRixhV/uTCkFo+RcIga9MK3BhSjMO5quTnhFLjb6w4o+oCBM4ojhpDhQCIgxNBMDs2aIeETjbDLik5iH55xF6DhtZn3nFmDzSzQ9FjmljjLODBPAy12OnxYnQI+Q+D4IprFNk9jmaMlj+/gkEeJL1cHC7BDApmXnZRLHdlozx1EvhbUkvi2Ib2VxVBBHm5cQPFVnhKkLsf6EhaowqsLCPIjC8ujrfPRMvZajbwrijSzeFsRbWXSighpV1EVBXVTUx4L6KKvLgriUxVVBXFVyeUHlsvqxIH6Uxff7Xvph30v/kGLF6ogtsxK7HM3E71H2gcVzmMSvr978lsSD7Nl8KRFS9bIROlvj6bjbG/QSWcZbvTPu64ZV1XNDzxjqZp0hdwx7pmaNdixtyZtDa1p3NOzKURDv9P7AHFf1Haw27FlGjWFHOzY7o96mfAgFktXNfeag26mUxc1zBoZhnA2qem4wOqbZb9cYcodpWdbQzFBoxChGkpdujd3umVaNonlQX+t2hjX6bgG0vmFUYGmBxRgbuqVnLBwBLDl5/rGY/eFgJAfxXf2NoWlWFpAXyt83datTY9ixnllno9OMhDAQuHJVSF6+bq9/2pejSB407okVrLCQ3ZvGA0PrVVhIgWVsmEZa2PJOFJvsTr+P1z/Iu32nHulqkshmsdFkc7r3tmbJi2vMuNlda9/nrx+AG+GrM4WwKR7WhMNGGFjHApvhYS083AfvVv1uU7xbE+42wrh1LG4zvFsL7+6Dp1U/bYqnNeG0EYbWsdBmeFoLT/fB86qfN8XzmnDeCMPrWHgzPK+F5/vgSdVPmuJJTThphCF1LKQZntTCkzK8+PNIz/QAq+mZmGDVCzYHg9JvFk2PD3MoTpJr93riFhJ3A4beiGPFO3GgBeKM93NsA+b6XpCIu4JrH6etfUaw3BpFS9xTdPlWUm3ctE/07knn987RK2NzY3mmfK/8oPyk6EpPeaW8Vi6UawUqfyl/K/8o/7Z+bfkt3lqsrU+fbMZ8p5Se1p//A5vGD2o=</latexit>
h1 h2 h3
k1 k2 k3
@l @l @y @h2 @k2
=
2
<latexit sha1_base64="j3ZbafUpwvdmZIKYlDTRg9R4h0E=">AAAOdnicfZddb9s2FIad7qvL2izdLgcMwoJs2ZAFkuP44yJALdlGL9Y2C5qPLQoMiqZlwZRISJRjVxC2v7m/sF+wy1GyLEsUZd3kiO/Lw8eHpEJaFDsBU9V/9p598ulnn3/x/Mv9r168PPj68NU3twEJfYhuIMHEv7dAgLDjoRvmMIzuqY+Aa2F0Z82NRL9bID9wiPeBrSh6dIHtOVMHAsabxof/mlMfwMicUwXH6R8TkuhpHGnNOI6VHy8ViYFGK65thawli1k0GzcFPWvL3uxoXnFkbVUE09znFM2T9SDKrwr7mRsmhKW2Be+Uv65HOdG4afOy8S7H2vjwSD1T00epBloWHDWy52r86gXbNycEhi7yGMQgCB40lbLHCPjMgRjF+2YYIArgHNjoIWTT7mPkeDRkyIOxcsy1aYgVRpSk8MrE8RFkeMUDAH2HZ1DgDPACMD49++VUAfKAi4LTycKhwToMFvY6YIDP7WO0TOc+LnWMbB/QmQOXJbIIuIEL2KzSGKxcq9yIQoz8hVtuTCg5o+BcIh86QVKDK16Y9zRZTsEHcpXpsxWdIS+Io9DHcbEjF5DvoynvmIYBYiGN0h/D1/A8uGR+iE6TMG27HAB/fo0mpzxPqaGMM8UEsHKT5SbF8dATJK4LvElkUr66GFqyyDw9i7l4rFiYmy0C/EnZeR1HkZnUzLKUa24tie8K4jtRHBbEYTYIwRNlSnxlweef+IHCjQq3+A5EQbn3Td57qtyIqW8L4q0o3hXEO1G0woIaVtRFQV1U1KeC+iSqy4K4FMVVQVxV8rKCykT1Y0H8KIr3uwb9Y9egfwpp+ezwLbPiuxxN+Qc0XWDRHMbRmw9vf4ujXvpkKyVEilY2QmtjPB+1O71OLMp4o7dGXU0fVPXc0NH7miEz5I5+x1AHwy1LU/Dm0KraHvbbYiqIt3q3Z4yq+hZW7XcGusSwpR0ZrWEnKx9CnmC1c5/Ra7cqZbHzPD1d1y96VT036C3D6DYlhtxhDAaDvpGi0NCnGAleujG22xdqNRXNE3XVdqsv0bcToHZ1vQJLCyz6SNcGWsrCEMCCk+WLxej2e0MxEdvWX+8bRmUCWaH8XUMbtCSGLevF4GJ4npIQH3i2WBWSl6/d6Z53xVQkTzTq8BmssJDtSKOernYqLKTAMtINPSlseSfyTfagPUbrD/J23ylHmhLHoplvNNGc7L2NWfBiiRnXu6X2XX55B1wLX/2lENalh5LksBYGylhgPTyUwsNd8HbVb9eltyXJ7VoYW8Zi18PbUnh7Fzyt+mldeipJTmthqIyF1sNTKTzdBc+qflaXnkmSs1oYJmNh9fBMCs92wZOqn9SlJ5LkpBaGyFhIPTyRwpMyPP/nkZzpAVaSMzHBiuNlB4PSN4smx4c5v7Fk7vUPHyB+N/DRW36seM8PtICf8X6JTODbruPF/K5gm6dJtMsIlhsjj/g9RRNvJdXgtnmmXZypv7eOXuvZjeV547vGD42ThtboNF433jSuGjcNuHe1t9j7a+/vl/8dfH9wfPDT2vpsL+vzbaP0HKj/A7hTT8o=</latexit>
@l12
@w @y
@l @h @h22 @w
@k
@y2 @h @k12
@k
w1
22
=
@w12 = 2(y - 2t)
@y @h @k
2 @k @w
· v22w@w· h122 (1 - h2 ) · x1
12
t) ·· vvw
= 2(y - t) 2 ·· h
h22(1 - h22) · x1
(1 -
x1 x2
))
= σ(x)(1 - σ(x
sigmo id: σ’(x)
28
derivative of
BACKPROPAGATION
Note that when we’re computing the derivative for
w12, we are also, along the way computing the
l
@l @l @y @h2 @k2 derivatives for y, h2 and k2.
=
@w12 @y @h2 @k2 @w12
y t
This useful when it comes to implementing
@l = 2(y
@l - · v2w@l
@yt)@h · h22 (1 - h2 ) · x1
<latexit sha1_base64="nH9fbpauKQWPXg5ZX8gQFcwfSwU=">AAANqXicfZfbbts2HMbVdocuq9d0vRwGCAsKDFsQSInjw0WAWpKNXqxtGuS0xkFA0bQsmBIJinLsCrra0+x2e5q9zSjFkSWKsm5M8Pv46ac/RZl0KfYjbhj/PXn67Kuvv/n2+Xc7379o/fBy99WPlxGJGUQXkGDCrl0QIeyH6IL7HKNryhAIXIyu3Lmd6VcLxCKfhOd8RdFtALzQn/oQcNF1t/vziT6eMgCT8ZzqOM1/xpAns7vDNNXvdveMAyO/9HrDXDf2tPV1evfqBd8ZTwiMAxRyiEEU3ZgG5bcJYNyHGKU74zhCFMA58NBNzKe928QPacxRCFP9jdCmMdY50TNWfeIzBDleiQaAzBcJOpwBQcvFE+1UoyIUggBF+5OFT6OHZrTwHhociHLcJsu8XGllYOIxQGc+XFbIEhBEAeCzWme0CtxqJ4oxYoug2plRCkbJuUQM+lFWg1NRmI80m4HonJyu9dmKzlAYpUnMcFoeKATEGJqKgXkzQjymSf4wYtrn0QlnMdrPmnnfiQPY/AxN9kVOpaOKM8UE8GqXG2TFCdE9JEEAwkkypuKN4GjJk/H+QSrEN7qLhdklgE2qzrM0ScZZzVxXPxPWivihJH6QxWFJHK5vQvBEnxKmL8T8ExbpwqgLC/MhiqqjL4rRU/1Cjr4siZeyeFUSr2TRjUtqXFMXJXVRU+9L6r2sLkviUhZXJXFVy+Ullcvql5L4RRavt930z203/SzFitkRS2YlVjmaim9O/oIlc5gm787f/5Em/fxavykx0s2qEbqPxqNRp9vvprKMH/X2qGdaTl0vDF1rYNoqQ+EYdG3DGW5YDiVvAW0YneGgI0dBvNF7fXtU1zewxqDrWArDhnZkt4fddfkQCiWrV/jsfqddK4tX5PQtyzru1/XCYLVtu3eoMBQO23GcgZ2j0JhRjCQvfTR2OsdGPYoWQT2j0x4o9M0EGD3LqsHSEos1skzHzFk4Alhy8uJlsXuD/lAO4pv6WwPbrk0gL5W/Z5tOW2HYsB47x8OjnIQwEHpyVUhRvk63d9STo0gRNOqKGayxkM2dRn3L6NZYSIllZNlWVtjqShSL7Ma8TR4+yJt1p++ZeprKZrHQZHO29h7NkhcrzLjZrbRv86sH4Eb4+pNC2BQPFeGwEQaqWGAzPFTCw23wXt3vNcV7inCvEcZTsXjN8J4S3tsGT+t+2hRPFeG0EYaqWGgzPFXC023wvO7nTfFcEc4bYbiKhTfDcyU83wZP6n7SFE8U4aQRhqhYSDM8UcKTKrz488j29ADr2Z6YYN0P1xuDyjeLZtuHuTherN0PD+4gcTZg6L3YVnwUG1og9ni/JWPAvMAPU3FW8Mb7WWubESwfjaIlzimmfCqpNy4PD8zjA+NTe++ttT6xPNd+0n7RftVMrau91d5pp9qFBrW/tL+1f7R/W7+3PrWuW58frE+frMe81ipXC/4PcAgFiQ==</latexit>
@w12 @y @h2 @k2@h @w2 12 computation graph and compute the derivative of the
h1 h2 h3
k2 k3 @l = 2(y
@l - · v2w @k
@yt)@h · h22@l
(1 - h2 ) · x1
<latexit sha1_base64="sgE4DAIgoQfIIf/5CtOYFODi4vs=">AAANqXicfZfbbts2HMbVdocuq9d0vRwGCAsKDFsQSInjw0WAWpKNXqxtGuS0xkZA0bQimBIJinLsCrra0+x2e5q9zSjFkSWKsm5M8Pv46ac/RZl0KfYjbhj/PXn67Kuvv/n2+Xc7379o/fBy99WPlxGJGUQXkGDCrl0QIeyH6IL7HKNryhAIXIyu3Lmd6VcLxCKfhOd8RdEkAF7oz3wIuOi63f35RB/PGIDJeE51nOY/Y+gl89vDNNVvd/eMAyO/9HrDXDf2tPV1evvqBd8ZTwmMAxRyiEEU3ZgG5ZMEMO5DjNKdcRwhCuAceOgm5rPeJPFDGnMUwlR/I7RZjHVO9IxVn/oMQY5XogEg80WCDu+AoOXiiXaqUREKQYCi/enCp9FDM1p4Dw0ORDkmyTIvV1oZmHgM0DsfLitkCQiiAPC7Wme0CtxqJ4oxYoug2plRCkbJuUQM+lFWg1NRmI80m4HonJyu9bsVvUNhlCYxw2l5oBAQY2gmBubNCPGYJvnDiGmfRyecxWg/a+Z9Jw5g8zM03Rc5lY4qzgwTwKtdbpAVJ0T3kAQBCKfJmIo3gqMlT8b7B6kQ3+guFmaXADatOs/SJBlnNXNd/UxYK+KHkvhBFoclcbi+CcFTfUaYvhDzT1ikC6MuLMyHKKqOvihGz/QLOfqyJF7K4lVJvJJFNy6pcU1dlNRFTb0vqfeyuiyJS1lclcRVLZeXVC6rX0riF1m83nbTP7fd9LMUK2ZHLJmVWOVoJr45+QuWzGGavDt//0ea9PNr/abESDerRug+Go9GnW6/m8oyftTbo55pOXW9MHStgWmrDIVj0LUNZ7hhOZS8BbRhdIaDjhwF8Ubv9e1RXd/AGoOuYykMG9qR3R521+VDKJSsXuGz+512rSxekdO3LOu4X9cLg9W27d6hwlA4bMdxBnaOQmNGMZK89NHY6Rwb9ShaBPWMTnug0DcTYPQsqwZLSyzWyDIdM2fhCGDJyYuXxe4N+kM5iG/qbw1suzaBvFT+nm06bYVhw3rsHA+PchLCQOjJVSFF+Trd3lFPjiJF0KgrZrDGQjZ3GvUto1tjISWWkWVbWWGrK1Esshtzkjx8kDfrTt8z9TSVzWKhyeZs7T2aJS9WmHGzW2nf5lcPwI3w9SeFsCkeKsJhIwxUscBmeKiEh9vgvbrfa4r3FOFeI4ynYvGa4T0lvLcNntb9tCmeKsJpIwxVsdBmeKqEp9vged3Pm+K5Ipw3wnAVC2+G50p4vg2e1P2kKZ4owkkjDFGxkGZ4ooQnVXjx55Ht6QHWsz0xwbofrjcGlW8WzbYPc3G8WLsfHtxB4mzA0HuxrfgoNrRA7PF+S8aAeYEfpuKs4I33s9Y2I1g+GkVLnFNM+VRSb1weHpjHB8an9t5ba31iea79pP2i/aqZWld7q73TTrULDWp/aX9r/2j/tn5vfWpdtz4/WJ8+WY95rVWuFvwf7CcFfw==</latexit>
@l @l @y @h2 @k2 @l
= = In fact, this is where the name backpropagation comes
x1 x2 @w12 @y @h2 @k2 @w12 @w12 from: the derivative of the loss propagates down the
29 = 2(y - t) · vw · h2 (1 - h2 ) · x1 network in the opposite direction to the forward pass.
We will show this more precisely in the last part of this
lecture.
31
so on.
-3
32
100
y the network. If the output is 0 and it should have been
-1
= -↵↵ ·· 20
20 10, GD tells us to increase the value of the output.
=yy+
<latexit sha1_base64="jS0SCCcHiShn5JqEVzzf/djJhC0=">AAANsHicfZfbbts2HMbV7tRl9Zaul7shlnUYtiCQEscHDAFqyTZ6sbZZkNMWBwFF07JqSiQoyrEr6AX2NLvd3mRvM8pxZImirCuC38dPP/0pSqTLiB8J0/zvydNPPv3s8y+efbnz1fPG19/svvj2MqIxR/gCUUL5tQsjTPwQXwhfEHzNOIaBS/CVO3My/WqOeeTT8FwsGb4NoBf6Ex9BIbvudn84+RGMfgUjxJJlCn4BoxlKRpCwKUxl55gKcGiCu90988BcXaDasNaNPWN9nd69eC52RmOK4gCHAhEYRTeWycRtArnwEcHpziiOMINoBj18E4tJ5zbxQxYLHKIUvJLaJCZAUJAhg7HPMRJkKRsQcV8mADSFHCIhH2ynHBXhEAY42h/PfRY9NKO599AQUFblNlmsqpaWBiYeh2zqo0WJLIFBFEAxrXRGy8Atd+KYYD4Pyp0ZpWRUnAvMkR9lNTiVhXnPsomIzunpWp8u2RSHUZrEnKTFgVLAnOOJHLhqRljELFk9jJz9WXQieIz3s+aq76QP+ewMj/dlTqmjjDMhFIpylxtkxQnxPaJBAMNxMmJpMhJ4IZLR/kEqxVfAJdLsUsjHZedZmiSjrGauC86ktSS+K4jvVHFQEAfrm1AyBhPKwVzOP+URkEYgLdxHOCqPvshHT8CFGn1ZEC9V8aogXqmiGxfUuKLOC+q8ot4X1HtVXRTEhSouC+KykisKqlDVjwXxoypeb7vpH9tu+qcSK2dHLpmlXOV4Ij89qxcsmaE0eXP+9rc06a6u9ZsSY2CVjch9NB4NW+1uO1Vl8qg3hx3L7lf13NC2e5ajM+SOXtsx+4MNy6HizaFNszXotdQoRDZ6p+sMq/oG1uy1+7bGsKEdOs1Be10+jEPF6uU+p9tqVsri5Tld27aPu1U9N9hNx+kcagy5w+n3+z1nhcJizghWvOzR2Godm9Uolgd1zFazp9E3E2B2bLsCywos9tC2+taKRWBIFKfIXxan0+sO1CCxqb/dc5zKBIpC+TuO1W9qDBvW4/7x4GhFQjkMPbUqNC9fq9056qhRNA8atuUMVljo5k7Drm22Kyy0wDK0HTsrbHklykV2Y90mDx/kzboDexZIU9UsF5pqztbeo1nxEo2Z1Lu19m1+/QBSC199UoTq4pEmHNXCIB0LqodHWni0Dd6r+r26eE8T7tXCeDoWrx7e08J72+BZ1c/q4pkmnNXCMB0Lq4dnWni2DV5U/aIuXmjCRS2M0LGIenihhRfb4GnVT+viqSac1sJQHQuth6daeFqGlz+PbE8PCcj2xJQAP1xvDErfLJZtH7Kjxdr98OB9LM8GHL+V24r3ckML5R7vZ3n64F7gh6k8K3ij/ay1zQgXj0bZkucUSz2VVBuXhwfW8YH5e3Pvtb0+sTwzvjO+N34yLKNtvDbeGKfGhYGMv4y/jX+MfxuHjevGXQM+WJ8+WY95aZSuxof/ATXABr0=</latexit>
10
0 0
10
Even though we can’t change y directly, this is the
33
effect we want to achieve: we want to change the
0.5
values we can change so that we achieve this change
in y. To figure out how to do that, we take this gradient
for y, and propagate it back down the network.
Note that even though these scenarios have the same
loss (because of the square), the derivative of the loss
has a different sign for each, so we can tell whether
the output is bigger than the target or the other way
around. The loss only tells us how bad we've done, but
the derivative of the loss tells us where to move to
make it better.
Of course, we cannot change y directly. Instead, we
have to change the values that influenced y.
100 Here we see what that looks like for the weights of the
v2 - ↵ · 2(y - t)h2
<latexit sha1_base64="3wYaiarvQNZLOn66+fywJunFpic=">AAAOC3ichZfbbts2HMaV7tSlzZbucLUbYUGHbsgCKfERQ4Baso1erG0W5NRFQUDRtCyYEgmKcuwKwp5gT7O7Ybd7iD3A3mOU48gSRXm6Ivh9/PTTn6JEuhT7ETeMf7YeffDhRx9/8vjT7SdPdz77fPfZFxcRiRlE55Bgwq5cECHsh+ic+xyjK8oQCFyMLt2pnemXM8Qin4RnfEHRTQC80B/7EHDRdbv7mwNJMrs9THUHozEHjJG775yf8t4fdWcKEwdgOgHCA0eE64cvHEiTRSby70UfTybLAGf7+H+HGqtGNso4aKa3u3vGgbG89GrDXDX2tNV1cvvsKd92RgTGAQo5xCCKrk2D8psEMO5DjNJtJ44QBXAKPHQd83HnJvFDGnMUwlR/LrRxjHVO9Kwa+shnCHK8EA0AmS8SdDgBDEAuarZdjopQCAIU7Y9mPo3um9HMu29wIAp+k8yXE5KWBiYeA3Tiw3mJLAFBFAA+qXRGi8Atd6IYIzYLyp0ZpWCUnHPEoB9lNTgRhXlLszmOzsjJSp8s6ASFUZrEDKfFgUJAjKGxGLhsRojHNFk+jHixptExZzHaz5rLvuM+YNNTNNoXOaWOMs4YE8DLXW6QFSdEd5AEAQhHiUPTxOFozhNn/yAV4nPdxcLsEsBGZedpmiROVjPX1U+FtSS+KYhvZHFQEAermxA80seE6TMx/4RFujDqwsJ8iKLy6PN89Fg/l6MvCuKFLF4WxEtZdOOCGlfUWUGdVdS7gnonq/OCOJfFRUFcVHJ5QeWy+r4gvpfFq003fbfppr9KsWJ2xJJZiFWOxuKrtnzBkilMk1dnr39Ok+7yWr0pMdLNshG6D8ajYavdbaeyjB/0xrBjWv2qnhvaVs+0VYbc0WvbRn+wZjmUvDm0YbQGvZYcBfFa73TtYVVfwxq9dt9SGNa0Q7sxaK/Kh1AoWb3cZ3dbjUpZvDyna1lWs1vVc4PVsO3OocKQO+x+v9+zlyg0ZhQjyUsfjK1W06hG0TyoY7QaPYW+ngCjY1kVWFpgsYaW2TeXLBwBLDl5/rLYnV53IAfxdf2tnm1XJpAXyt+xzX5DYVizNvvNwdGShDAQenJVSF6+Vrtz1JGjSB40bIsZrLCQ9Z2GXctoV1hIgWVo2VZW2PJKFIvs2rxJ7j/I63Wn75l6mspmsdBkc7b2HsySFyvMuN6ttG/yqwfgWvjqk0JYFw8V4bAWBqpYYD08VMLDTfBe1e/VxXuKcK8WxlOxePXwnhLe2wRPq35aF08V4bQWhqpYaD08VcLTTfC86ud18VwRzmthuIqF18NzJTzfBE+qflIXTxThpBaGqFhIPTxRwpMyvPh5ZHt6gPVsT0yw7oerjUHpm0Wz7UN20Fi57x+8j8TZgKHXYlvxVmxogdjj/SDOIswL/DAVZwXP2c9am4xg/mAULXFOMeVTSbVxcXhgNg+MXxp7L63VieWx9o32rfZCM7W29lJ7pZ1o5xrU/t16svXV1tc7v+/8sfPnzl/31kdbqzFfaqVr5+//AEAXJ0s=</latexit>
10 0
v2 second layer. First note that the output y in this
example was too high. Since all the hidden nodes have
= v2 - ↵ · 20 · 0.5 positive values (because of the sigmoid), we end up
v1 v1 - ↵ · 20 · 0.1
<latexit sha1_base64="1fAMpTLUPd/VrFgaCtGKa+Tee28=">AAAOY3icjZfrbts2HMXl7talbZZ2+zYMEBZ0GIY0kBxfMQSoJdvoh7XNgty2KAgompYFUyJBUY5dQc857AH2HqNsRdaFMqpPBM/h0U9/kjZpU+wGXNP+bTz54suvvv7m6bd7z56/2P/u4OWrq4CEDKJLSDBhNzYIEHZ9dMldjtENZQh4NkbX9txM9OsFYoFL/Au+oujOA47vTl0IuOi6P/jHgiRa3OuxamE05YAx8vCL9buadb9RrTmMLIDpDAgTnBCuNrW0YUEeacd6bFl7mwFNeU7zs3La25wTec7JZ+X0Rc79waF2rK0ftdrQ08ahkj5n9y+f8z1rQmDoIZ9DDILgVtcov4sA4y7EKN6zwgBRAOfAQbchn/buItenIUc+jNXXQpuGWOVETaqsTlyGIMcr0QCQuSJBhTPAAORiLvaKUQHygYeCo8nCpcGmGSycTYMDMZF30XI90XFhYOQwQGcuXBbIIuAFHuCzSmew8uxiJwoxYguv2JlQCsaSc4kYdIOkBmeiMB9psnaCC3KW6rMVnSE/iKOQ4Tg/UAiIMTQVA9fNAPGQRuuPEQt2HpxyFqKjpLnuOx0CNj9HkyORU+go4kwxAbzYZXtJcXz0AInnAX8SWTSOLI6WPLKOjmMhvlZtLMw2AWxSdJ7HUWQlNbNt9VxYC+KHnPihLI5y4ih9CcETdUqYuhDzT1igCqMqLMyFKCiOvsxGT9XLcvRVTrwqi9c58bos2mFODSvqIqcuKupDTn0oq8ucuCyLq5y4quTynMrL6qec+Kks3ux66V+7Xvp3KVbMjtgyK7HL0VT8Wq4XWDSHcfTu4v0fcdRfP+lKCZGqF43QfjSejDvdfjcuy/hRb417ujGs6pmhawx0U2bIHIOuqQ1HW5ZmyZtBa1pnNOiUoyDe6r2+Oa7qW1ht0B0aEsOWdmy2Rt20fAj5JauT+cx+p1Upi5Pl9A3DaPeremYwWqbZa0oMmcMcDocDc41CQ0YxKnnpo7HTaWvVKJoF9bROayDRtxOg9QyjAktzLMbY0If6moUjgEtOni0Wszfoj8pBfFt/Y2CalQnkufL3TH3Ykhi2rO1he3SyJiEM+E65KiQrX6fbO+mVo0gWNO6KGaywkO2bxn1D61ZYSI5lbJhGUtjiThSb7Fa/izY/yNt9px7qahyXzWKjlc3J3ns0l7xYYsb1bql9l18+ANfCV78Uwrp4KAmHtTBQxgLr4aEUHu6Cd6p+py7ekYQ7tTCOjMWph3ek8M4ueFr107p4KgmntTBUxkLr4akUnu6C51U/r4vnknBeC8NlLLwenkvh+S54UvWTungiCSe1METGQurhiRSeFOHFn0dypgdYTc7EBKuunx4MCr9ZNDk+JFeN1L358CESdwOG3otjxUdxoAXijPebuI0wx3P9WNwVHOsoae0yguWjUbTEPUUv30qqjavmsd4+1v5sHb410hvLU+VH5WflV0VXuspb5Z1yplwqsHHagA3c8F78t/9s/9X+Dxvrk0Y65nul8Oz/9D/MA0Tg</latexit>
0.1 0.5 0.9 subtracting some positive value from all the weights.
0 This will lower the output, as expected.
v2 v2 - ↵ · 20 · 0.5
Second, note that the change is proportional to the
1
NEGATIVE ACTIVATION
The sigmoid activation we’ve used so far allows only
positive values to emerge from the hidden layer. If we
tanh 1
switch to an activation that also allows negative
100
activations (like a linear activation or a tanh
10 0 activation), we see that backpropagation very
-1
naturally takes the sign into account.
0.1 -0.5 0.9 Note the negative activation on h2.
v1 v1 - ↵ · 20 · 0.1
<latexit sha1_base64="MAtH+mJ4IRrP13xM1lksk0Jt29I=">AAAOY3icjZfrbts2HMXl7talbZZ2+zYMEBZ0GLYskBJfMQSoJdvoh7XNgty2KAgompYFUyJBUY5dQc857AH2HqMcR5Yoyps+ETyHRz/9SdqkS7EfccP4u/Hkk08/+/yLp1/uPHv+YvervZevLiMSM4guIMGEXbsgQtgP0QX3OUbXlCEQuBhduTM706/miEU+Cc/5kqLbAHihP/Eh4KLrbu8vB5JkfmemuoPRhAPGyP0Pzq963v2L7sxg4gBMp0CY4Jhw/chYNxzIE+PQTB1n52HAkTpHdP/83zmtTc6xOuf4f/H0RM7d3r5xaKwevdow1419bf2c3r18znecMYFxgEIOMYiiG9Og/DYBjPsQo3THiSNEAZwBD93EfNK9TfyQxhyFMNVfC20SY50TPauyPvYZghwvRQNA5osEHU4BA5CLudgpR0UoBAGKDsZzn0YPzWjuPTQ4EBN5myxWE52WBiYeA3Tqw0WJLAFBFAA+rXRGy8Atd6IYIzYPyp0ZpWCUnAvEoB9lNTgVhflAs7UTnZPTtT5d0ikKozSJGU6LA4WAGEMTMXDVjBCPabL6GLFgZ9EJZzE6yJqrvpMBYLMzND4QOaWOMs4EE8DLXW6QFSdE95AEAQjHiUPTxOFowRPn4DAV4mvdxcLsEsDGZedZmiROVjPX1c+EtSS+L4jvZXFYEIfrlxA81ieE6XMx/4RFujDqwsJ8iKLy6It89ES/kKMvC+KlLF4VxCtZdOOCGlfUeUGdV9T7gnovq4uCuJDFZUFcVnJ5QeWy+rEgfpTF620v/WPbS/+UYsXsiC2zFLscTcSv5WqBJTOYJm/P3/2WJr3Vs14pMdLNshG6j8bjUbvT66SyjB/15qhrWoOqnhs6Vt+0VYbc0e/YxmC4YTmSvDm0YbSH/bYcBfFG7/bsUVXfwBr9zsBSGDa0I7s57KzLh1AoWb3cZ/fazUpZvDynZ1lWq1fVc4PVtO3ukcKQO+zBYNC3Vyg0ZhQjyUsfje12y6hG0Tyoa7SbfYW+mQCja1kVWFpgsUaWOTBXLBwBLDl5vljsbr83lIP4pv5W37YrE8gL5e/a5qCpMGxYW4PW8HhFQhgIPbkqJC9fu9M97spRJA8adcQMVljI5k2jnmV0KiykwDKybCsrbHknik12Y94mDz/Im32n75t6mspmsdFkc7b3Hs2SFyvMuN6ttG/zqwfgWvjql0JYFw8V4bAWBqpYYD08VMLDbfBe1e/VxXuKcK8WxlOxePXwnhLe2wZPq35aF08V4bQWhqpYaD08VcLTbfC86ud18VwRzmthuIqF18NzJTzfBk+qflIXTxThpBaGqFhIPTxRwpMyvPjzyM70AOvZmZhg3Q/XB4PSbxbNjg/ZVWPtfvjwARJ3A4beiWPFB3GgBeKM95O4jTAv8MNU3BU85yBrbTOCxaNRtMQ9xZRvJdXG5dGh2To0fm/uv7HWN5an2rfa99qPmql1tDfaW+1Uu9Bg46QBG7gRvPhn99nuq91vHqxPGusxX2ulZ/e7fwGw2UTe</latexit>
w12 w12 - ↵ · 2(y - t)v2 h2 (1 - h2 )x1 negative, so we should decrease w12 to increase h2.
This is all beautifully captured by the chain rule: the
1
-3
1
-3
38
for i in [1 … 3]:
2
w1
y += h[i] * v[i]
y += c
x1 x2
loss = (y - t) ** 2
39
for i in [1 … 2]:
w’[i, j] = k’[j] * x[i]
the outputs and multiplying it by the local derivative.
x1 b’[j] = k’[j]
x2
40
RECAP
41
Lecture 2: Backpropagation
Peter Bloem
Deep Learning
dlvu.github.io
|section|Tensor backpropagation|
|video|https://www.youtube.com/embed/O-
xs8IyP4bQ?si=YWT0e5PGkU37kOVf|
We’ve seen what neural networks are, how to train
PART THREE: TENSOR BACKPROPAGATION them by gradient descent and how to compute that
gradient by backpropagation.
Expressing backpropagation in vector, matrix and tensor operations.
In order to scale up to larger and more complex
structures, we need two make our computations as
efficient as possible, and we’ll also need to simplify our
notation. There’s one insight that we are going to get a
lot of mileage out of.
y w11 w21 b1 k1 h1
The multiplication of weights by their inputs is a
w12 w22 + b2 = f(x)
k
= V (Wx h+ b) + c 2 2 multiplication of a matrix of weights by a vector of
v2
k3 h3 h inputs.
h1 h2 h3
k1 k2 k3
h1
Note that the weight w12 (as we’ve called it so far,
h2
because it goes from node 1 to node 2) actually goes
2
×
w1
f(x) = V (Wx + b) + c
MATRIX MULTIPLICATION
The second reason is that the biggest computational
bottleneck in a neural network is the multiplication of
l
# k, w, k, h, etc are zero’d arrays the layer input by the layer weights. The matrix
for j in [1 … 2]:
for i in [1 … 3]:
multiplication in the notation of the previous slide.
y t k[i] += w[i,j] * x[i] This operation is more than quadratic while everything
k[i] += b[i] else is linear. We can see that in our pseudocode,
v2
y += h[i] * v[i]
y += c it’s a lot of work. Ideally, we’d like to let somebody else
x1 x2
loss = (y - c) ** 2
do all that work (like the implementers of numpy) and
46
then just call their code to do the matrix
multiplications.
This is especially important if you have access to a
GPU. A matrix multiplication can easily be offloaded to
the GPU for much faster processing, but for a loop
over an array, there’s no benefit.
This is called vectorizing: taking a piece of code
written in for loops, and getting rid of the loops by
expressing the function as a sequence of linear algebra
operations.
VECTORIZING
Without vectorized implementations, deep learning
would be painfully slow, and GPUs would be useless.
Express everything as operations on vectors, matrices and tensors. Turning a naive, loop-based implementation into a
Get rid of all the loops vectorized one is a key skill for DL researchers and
programmers.
Makes the notation simpler.
Makes the execution faster.
x
y
47
W
V
b
x
c
x y
xt
y
y xW Here’s what the vectorized forward pass looks like as a
h
FORWARD
W y V computation graph, in symbolic notation and in
Wk
V
Vl x W b pseudocode.
xy V c
b
b X When you implement this in numpy it’ll look almost
ycW b t
c l= (y - t)2 k = w.dot(x) + b
the same.
Wt V cx h
t y h = sigmoid(k)
Vh t k y = Vh + c Note that this doesn't just represent the network in the
h bx W
b l previous part, it represents any such network,
k yc h
k
V
h = (k) y = v.dot(h) + c
cl t k regardless of the sizes of the input and hidden layers.
l b k = Wx + b We've abstracted away some details about the
tWh l
l = (y - t) ** 2
x
h V c specifics of the network. We've also made the output
k and the target label vectors for the sake of generality
k l t
y
48
b
Wl c h So far so good. The forward pass is easy enough to
V t k vectorize.
bh l
x
ck
x y
xt l
y
y xW Of course, we lose a lot of the benefit of vectorizing if
h WHAT ABOUT THE BACKWARD PASS?
BUT
W y V the backward pass (the backpropagation algorithm)
Wk
V
Vl x W b isn’t also expressed in terms of matrix multiplications.
x
b V c
X How do we vectorize the backward pass? That's the
by
ycW b t l = (y - t)2 question we'll answer in the rest of this video.
c x
Wt V c h Can we do something like this?
t y On the left, we see the forward pass of our loss
Vh t k y = Vh + c @l ? @l @y @h @k computation as a set of operations on vectors and
h bx W
b
k c h l h = (k) = matrices.
k y V @W @y @h @k @W
cl t k
lW b k = Wx + b To generalize backpropagation to this view, we might
th l
x
hk V c ask if something similar to the chain rule exists for
vectors and matrices. Firstly, can we define something
k bl t
y
49
Wl c h
analogous to the derivative of one matrix over
k another, and secondly, can we break this apart into a
V t
product of local derivatives, possibly giving us a
bh l
sequence of matrix multiplications?
ck
t l The answer is yes, there are many ways of applying
h calculus to vectors and matrices, and there are many
k chain rules available in these domains. However, things
l
can quickly get a little hairy, so we need to tread
carefully.
GRADIENTS, JACOBIANS, ETC
The derivatives of high-dimensional objects are easily
defined. We simply take the derivative of every
@B
: derivatives of every element of A over every element of B
number in the input over every number in the output,
f(A) = B,
@A and we arrange all possibilities into some logical
for instance:
function returns a shape. For instance, if we have a vector-to-vector
00
001111 function, the natural analog of the derivative is a
✓ ◆✓ ◆
<latexit sha1_base64="Cu9+fXzKgYNP/4xIL+VZWpA1pdw=">AAAPI3icjVfLbuM2FFXS19Sdpkm7arshGkwxLYJAchw/FgHGkm0Mis5MGuTVRoFB0bQsWBYJinLsEbTqol/Sr+mu6KaL/kC/opTsyBIleapFcM1z7uHh1aVCWtR1fK6qf+/svvf+Bx9+9OTj2idPP937bP/g82ufBAzhK0Rcwm4t6GPX8fAVd7iLbynDcGa5+MaaGjF+M8fMd4h3yZcU38+g7TljB0EuhoYHO7+OTReP+XMT2aFpYdvxQjqDnDmLCA41YJoADuvJ3xMTe6MUjIDJHHvCvwNnAJgIycnWKtlKkuVMs/bDcJwk5pNqYsQfMyjUpjRRFTJRtP5lhzD+Bb4tkuoFkpj13Wr1RK2EVi/QhOd3yp38PzlBq+UqMtw/VI/V5AHFQFsHh8r6OR8ePP2tZo4ICmbY48iFvn+nqZTfh5BxB7lY6Ac+phBNoY3vAj5u34eORwOOPRSBZwIbBy7gBMQtAUYOw4i7SxFAxByhANAEihVw0Ti1vJSPPTjD/tFo7lB/FfpzexVwKLruPlwkXRnlEkObQTpx0CLnLIQzX5RgUhj0lzMrP4gDF7P5LD8YuxQeJeYCM+T4cQ3ORWHe0LjR/UtyvsYnSzrBnh+FAXOjbKIAMGN4LBKT0Mc8oGGyGLG7pv4ZZwE+isNk7KwH2fQCj46ETm4gb2fsEsjzQ5a0jEXcLXG9PPyAyGwGRWuYVHQMxwsemkfHol9qz4DlCr5FIBvlmRdRGJpxGS0LXMStlQVfZ8DXMtjPgP31JMQdgTFhYC5agjAfCCJI2hRhP599lWaPwZUsfZ0Br2XwJgPeyKAVZNCggM4z6LyAPmTQBxldZMCFDC4z4LKgyzMol9G3GfCtDN5um/TnbZP+IsmKtyN20VJsfDwWX/uk58IpisKXl69+jMJO8qw7JcBAyxOR9Ug8GTRbnVYkw+4j3hi0Nb1XxFNCS+9qRhkhZXRbhtrrb7zUJW5qWlWb/W5TlkLuBm93jEER35hVu62eXkLYuB0YjX5rXT6MPYlqpzyj02wUymKnOh1d1087RTwl6A3DaNdLCCnD6PV6XSOxQgNGXSxx6SOx2TxVi1I0FWqrzUa3BN+8ALWt6wWzNONFH+haT0u8cAxdicnTZjHa3U5fFuKb+utdwyi8QJ4pf9vQeo0Swsbrae+0f5I4IQx6tlwVkpav2WqftGUpkgoNWuINFryQzUyDjq62Cl5IxstAN/S4sPmdKDbZnXYfrj7Im30HDjUQRTJZbDSZHO+9R7LEdUvIbjW7lL6NX57gVpovrhShKnlUIo4qzaAyL6jaPCo1j7aZt4t8u0reLhG3K83YZV7savN2qXl7m3la5NMqeVoiTivN0DIvtNo8LTVPt5nnRT6vkucl4rzSDC/zwqvN81LzfJt5UuSTKnlSIk4qzZAyL6TaPCk1T/LmxT+P+JgPXRAfk4kLHG99MMh9s2h8fJiKW8iavVp4D4vrAsOvxLHijTjjQnHG+z40IbNnjheJ64NtHsXRNiJcPBJhcnXR5ItKMbiuH2unx+pPjcMX+voS80T5WvlGea5oSkt5obxUzpUrBe38u7u/++XuV3u/7/2x9+feXyvq7s465wsl9+z98x9A4JAh</latexit>
ff@@
@@ b1 b1 matrix with all the partial derivatives in it.
a2AA=A b=2
a2 A arrange derivatives in a:
a3 b2
0 a3 1 scalar scalar vector matrix
However, once we get to matrix/vector operations or
0@b1/@a1 @b2/@a1 1
input is a
Jf = @@b@b 1/@a@b
1/@a2
@b A
1 2/@a22/@a1 matrix/matrix operations, the only way to logically
vector vector matrix ?
Jf = @ @b
@b 1/3@a@b
1/@a
2
@b32/@a2 A
2/@a
arrange every input with every output is a tensor of
@b1/@a3 @b2/@a3 higher dimension than 2.
matrix matrix ? ?
50
NB: We don’t normally apply the differential symbol
to non-scalars like this. We’ll introduce better
notation later.
SIMPLIFYING ASSUMPTIONS
To work out how to do this we make these following
simplifying assumptions.
1) The computation graph always has a single, scalar output: l
@l <- scalar
scalar scalar vector matrix
@W <- tensor
of
any dimens
ion
vector vector matrix ?
matrix matrix ? ?
52 3-tensor 3-tensor ? ?
Note that this doesn’t mean we can only ever train
neural networks with a single scalar output. Our
loss
neural networks can have any number of outputs of
computation
any shape and size. We can make neural networks that
of loss
generate images, natural language, raw audio, even
computation graph outputs video.
However, the loss we define over those outputs needs
computation
to map them all to a single scalar value. The
of model
m o del
computation graph is always the model, plus the
computation of the loss.
inputs
TENSOR BACKPROPAGATION
This gives us a good way of thinking about the
gradients, but we still don’t have a chain rule to base
Work out scalar derivatives first, then vectorize. out backpropagation on.
Use the multivariate chain rule to derive the scalar derivative.
The main trick we will use is to stick to scalar
derivatives as much as possible.
Apply the chain rule step by step. Once we have worked out the derivative in purely
Start at the loss and work backward, accumulating the gradients. scalar terms (on pen and paper), we will then find a
way to vectorize their computation.
x
y
57
W
V
b
x
c
x y
xt
y
y xW We start simple: what is the gradient for y? This is a
h
GRADIENT FOR y
W
W y V vector, because y is a vector. Let’s first work out the
k r
Vl x W b r y derivative of the i-th element of y. This is purely a
V
xy V c
b yr i =
@l y <latexit sha1_base64="8Kc76guShsxxBzDcjz7QnGGR1lA=">AAANoXicfZfbbts2HMbVdofOq7d0xa52IywoMAxBICWODxcFakk2OmBtsyCnNcoCiqYVwZRIUJRjV9DD7HZ7or3NKNuRJYqyrgh+Hz/9+Kdokx7FQcwN478nT5998eVXXz//pvXti/Z33++9/OEyJgmD6AISTNi1B2KEgwhd8IBjdE0ZAqGH0ZU3s3P9ao5YHJDonC8pug2BHwXTAAIuuu72fnQhTZfZXfCXGwEPA911W63W3d6+cWisHr3eMDeNfW3znN69fMFb7oTAJEQRhxjE8Y1pUH6bAsYDiFHWcpMYUQBnwEc3CZ/2b9MgoglHEcz010KbJljnRM8Z9UnAEOR4KRoAskAk6PAeMAC5mEmrGhWjCIQoPpjMAxqvm/HcXze4mBG6TRerMmWVganPAL0P4KJCloIwDgG/r3XGy9CrdqIEIzYPq505pWCUnAvEYBDnNTgVhflI88rH5+R0o98v6T2K4ixNGM7KA4WAGENTMXDVjBFPaLqajFjuWfyGswQd5M1V3xsHsNkZmhyInEpHFWeKCeDVLi/MixOhB0jCEEST1KVZ6nK04Kl7cJgJ8bUuvg048whgk6rzLEtTN6+Z5+lnwloRP5TED7I4KomjzUsInuhTwvS5WH/CYl0YdWFhAURxdfRFMXqqX8jRlyXxUhavSuKVLHpJSU1q6rykzmvqQ0l9kNVFSVzI4rIkLmu5vKRyWf1cEj/L4vWul/6566WfpFixOmLLLMUuR1PxW7P6wNIZzNJ35+9/z9LB6tl8KQnSzaoReo/G43G3N+hlsowf9c64b1pOXS8MPWto2ipD4Rj2bMMZbVmOJG8BbRjd0bArR0G81fsDe1zXt7DGsOdYCsOWdmx3Rr1N+RCKJKtf+OxBt1Mri1/kDCzLOhnU9cJgdWy7f6QwFA7bcZyhvUKhCaMYSV76aOx2T4x6FC2C+ka3M1To2wUw+pZVg6UlFmtsmY65YuEIYMnJi4/F7g8HIzmIb+tvDW27toC8VP6+bTodhWHLeuKcjI5XJISByJerQorydXv9474cRYqgcU+sYI2FbN80HlhGr8ZCSixjy7bywlZ3othkN+Ztuv5B3u47fd/Us0w2i40mm/O992iWvFhhxs1upX2XXz0AN8LXZwphUzxUhMNGGKhigc3wUAkPd8H7db/fFO8rwv1GGF/F4jfD+0p4fxc8rftpUzxVhNNGGKpioc3wVAlPd8Hzup83xXNFOG+E4SoW3gzPlfB8Fzyp+0lTPFGEk0YYomIhzfBECU+q8OLPIz/TA6znZ2KC9SDaHAwqv1k0Pz7MoDhJrt3riTtI3A0Yei+OFR/FgRaIM96vqQuYHwZRJu4KvnuQt3YZweLRKFrinmLKt5J64/Lo0Dw5NP7o7L+1NjeW59pP2s/aL5qp9bS32jvtVLvQoJZqf2v/aP+299u/tU/bZ2vr0yebMa+0ytO++R/zpgGs</latexit>
x
y
W
V
b
x
c
x y V ij
xt
y x W y1 Let’s move one step down and work out the gradient
y
h
GRADIENT FOR V Vr
W y V yk for V. We start with the scalar derivative for an
Wk VVrr
Vr =
@l
Vr ij Vr
Vl x W
V b yN rr
VVijij==
@l@l
@l X
@V ij
@l @yk X
r @yk @l
arbitrary element of V: Vij.
@V@Vijij Vr = = = y r
k V ij =
xy V c
b l X X @l@l @y @y
ij
X X @V ij
rr@y
k@y k
@y @V ij
k
@V ij @V ij
b ==
@y @y
kk
== X
@Vijij = = @V
yykk@l k@y
X
k
@Vijijr = @[Vh
k
X
+ c] r @yX
yk = k
r =@[V
X @l @y
h]k + ck k =
X
yr
@yk Now, we need a chain rule, to backpropagate through
k k@V
k k:
ycW b t V ij kk k k @yk y @V k ij
@Vkij @V ij yk @V@y @V ij k
@V ij
c X X @[Vh
rr @[Vh++c]c]
k
XX
X k @[V
@[V
@V h]
hh]
+k+c+ ckck k
k ij k k
<latexit sha1_base64="3NiTQptPOTsnSo2p6OjbxAbNLC0=">AAAN+XicfZdNb9s2HMaVdC9dVm/pdtyFWFBg2IJAShy/HALUkm30sLZZkLctSg2KpmXNkkhQlGNX0IfZbdh1H2WnfZTdRjmOLFGUdSL4PHz005+kTTrU9yKu6//u7D775NPPPn/+xd6XLxpffb3/8pvriMQM4StEfMJuHRhh3wvxFfe4j28pwzBwfHzjzKxMv5ljFnkkvORLiu8D6IbexEOQi67RfnAG7BlK7CgORsksBcsPdggdH45mIAX2hEGhzSiwEUnsAPKpMwHXqXDaaEx4aiOed09T8JPwOQlKR7M0H5W5vd/TFADb3hvtH+hH+uoB1Yaxbhxo6+d89PIF37PHBMUBDjnyYRTdGTrl9wlk3EM+TvfsOMIUohl08V3MJ537xAtpzHGIUvBKaJPYB5yA7NvB2GMYcX8pGhAxTyQANIXiG7mo0F45KsIhDHB0OJ57NHpsRnP3scFFhfB9sliVPy0NTFwG6dRDixJZAoMoK1OlM1oGTrkTxz5m86DcmVEKRsm5wAx5UVaDc1GY9zSb0eiSnK/16ZJOcRilScz8tDhQCJgxPBEDV80I85gmq48Ry2gWnXEW48Osueo760M2u8DjQ5FT6ijjTHwCebnLCbLihPgBkSCA4TixqVgYHC/Eqjk8SoX4Coi1hmYOgWxcdl6kyXppOeBCWEviu4L4ThYHBXGwfgnxx2BCGJiL+ScsAsIIhIV5CEfl0Vf56Am4kqOvC+K1LN4UxBtZdOKCGlfUeUGdV9SHgvogq4uCuJDFZUFcVnJ5QeWy+rEgfpTF220v/XXbS3+TYsXsiC2zFLscT8Rv2GqBJTOUJm8u3/6cJt3Vs14pMQZG2YicJ+PJsNXutlNZ9p/05rBjmP2qnhvaZs+wVIbc0Wtben+wYTmWvDm0rrcGvZYchfyN3ulaw6q+gdV77b6pMGxoh1Zz0F6XD+NQsrq5z+q2mpWyuHlO1zTN025Vzw1m07I6xwpD7rD6/X7PWqHQmFEfS176ZGy1TvVqFM2DOnqr2VPomwnQO6ZZgaUFFnNoGn1jxcIx9CUnzxeL1el1B3IQ39Tf7FlWZQJ5ofwdy+g3FYYN62n/dHCyIiEMhq5cFZKXr9XunHTkKJIHDdtiBissZPOmYdfU2xUWUmAZmpaZFba8E8UmuzPuk8cf5M2+AwcGSFPZLDaabM723pNZ8voKs1/vVtq3+dUD/Fr46pciVBePFOGoFgapWFA9PFLCo23wbtXv1sW7inC3FsZVsbj18K4S3t0GT6t+WhdPFeG0FoaqWGg9PFXC023wvOrndfFcEc5rYbiKhdfDcyU83wZPqn5SF08U4aQWhqhYSD08UcKTMrz488jO9NAH2ZmY+MAL1weD0m8WzY4P2YVl7X788D4WdwOG34pjxXtxoIXijPdjYkPmBl6YiruCax9mrW1GuHgyipa4pxjyraTauD4+MlpHzV+aB6/N9Y3lufad9r32g2Zobe219kY71640pP2j/bezu/OskTT+aPzZ+OvRuruzHvOtVnoaf/8PLp4k/A==</latexit>
P
@V ij
k
k
@V ij we can simply use the scalar multivariate chain rule.
Vh b t k V ij . . . y. 1. . yk == yy rr
kk
ll
=
lX
Xr @ V kl
l@V kl hl h l + c k @V ij hj =
X
yr
@ l V kl h l + c k
This tells us that however many intermediate values
hx W kk
@V@V ijij =yk yr
k @V ij
@V ij
= y r
i
k
@V ij
k
@V ij
l y1 X X @V k
b yk yN rr @V hh @V
@Vkl hh
k
k yc h
k ==yX we have, we can work out the derivative for each,
== yy
klkl ll rr
y
ijij jj
r@V kl hl @V ij hj X @V kl hl @V ij hj
kk
@Vijij = i i=y yri ijh = yr = yr = yr
V
cl t k yk l klkl
@V @V@V
k ijj
@V ij i
@V ij k
@V ij i
@V ij
yN ==yyrr
i ihh
kl kl
keeping the others constant, and sum the results.
l b
jj
= yi hj r 0 1 r
tW
.. = yi hj
yN l .
h l 00
... ... Vr
11 B B
0 = @ . . . yr hj 1. . .C
C
A=y h
r T
0 1 This gives us the sum in the second equality. We've
x
h V c l BB CC rr T..T ..
. i ..
.
k V ij VVrr BB
==@@ . .. .. . yy rr
i ihh
C
j jr. .. ..A
C==y y hr
.B
B
A h . C
C r T r
B
B r
C
C r T
k bl t
y .. .. V = @. . . yi hj . . .A = y h V = @. . . yi hj . . .A = y h
.. ... ...
worked out the gradient y∇ already, so we can fill that
59
y1
Wl c h in and focus on the derivative of yk over Vij. the value
yk
V t k of yk is defined in terms of linear algebra operations
yN
bh l
like matrix multiplication, but with a little thinking we
l
can always rewrite these as a simple scalar sums.
ck
t l In the end we find that the derivative of yk over Vij
h reduces to the value of hj.
k
This tells us the values of every elemen (i, j) of V∇. All
l
that's left is to figure out how to compute this in a
vectorized way. In this case, we can compute V∇ as the
outer product of the gradient for y, which we've
computed already, and the vector h, which we can
save during the forward pass.
RECIPE
hr
i =
@l We’ll leave the biases as an exercise.
@hi
x c
by V
X @l @y X
b = k
= yr
k
@yk
For the gradient for h, most of the derivation is the
@yk @hi @hi
ycW b t k k
P
c X @[Vh + c]k X r @ l Vkl hl + ck same as before, until we get to the point where the
Wt V cx h
= yr
k = yk
@hi @hi
t y X
k k
@Vkl hl + ck X r @Vki hi
k
scalar derivative is reduced to a matrix mulitplication.
V = yr = yk
h
h bx Wt k kl
k
@hi
k
@hi
Unlike the previous derivation, the sum over k doesn't
X
b l = yr
k Vki
k yc h
k
V k
disappear. We need to take it into the description of
cl t k 0
..
1
i
the gradient for h.
l 0 1 0 1
<latexit sha1_base64="jkBE5EM4oFBVqyrFXSBaIjQgf4Y=">AAAOL3icfZfbbts2HMbldoc2q7d0vdwNsaDDMASZlDg+XASoJdvoxdpmQU5blAYUTcuCJZGgKMeuoKfZE+xpht0Mu91bjJJtWaIk64rg9/HjT3+SNmVR1wm4qv7dePL0s8+/+PLZ872vXjS//mb/5bfXAQkZwleIuITdWjDAruPjK+5wF99ShqFnufjGmhmJfjPHLHCIf8mXFN970PadiYMgF10P+3+YiEemB/nUmoBp/NH0oeVC8MMZMC1sO35EhcacRQzM+ZjwAJgmMIPQe4hmogvRaBk/zDajTESyrOtYWJw49a9HYn+8jTsDUTJ8Y19mU8cfL+Wg/QP1SE0fUG5o68aBsn7OH16+4HvmmKDQwz5HLgyCO02l/D6CjDvIxfGeGQaYQjSDNr4L+aR7Hzk+DTn2UQxeC20SuoATkFQMjB2GEXeXogERc0QCQFPIIOKirnvFqAD70MPB4Xju0GDVDOb2qsHF6+H7aJEuWlwYGNkM0qmDFgWyCHpBUoRSZ7D0rGInDl3M5l6xM6EUjJJzgRlygqQG56IwH2iyD4JLcr7Wp0s6xX4QRyFz4/xAIWDG8EQMTJsB5iGN0pcRm28WnHEW4sOkmfadDSCbXeDxocgpdBRxJi6BvNhleUlxfPyIiOdBsWdMGkcmxwuxUQ+PYiG+BmKjoJlFIBsXnRdxtN44FrgQ1oL4Pie+l8VhThyuJyHuGEwIA3Ox/oQFQBhBun8RDoqjr7LRE3AlR1/nxGtZvMmJN7JohTk1LKnznDovqY859VFWFzlxIYvLnLgs5fKcymX1U078JIu3uyb9bdekv0uxYnXEkVmKU44n4pcv3WDRDMXR28t3v8RRL33WOyXEQCsakbUxnozanV4nlmV3o7dGXU0flPXM0NH7mlFlyBz9jqEOhluWY8mbQatqe9hvy1HI3erdnjEq61tYtd8Z6BWGLe3IaA076/Jh7EtWO/MZvXarVBY7y+npun7aK+uZQW8ZRve4wpA5jMFg0DdSFBoy6mLJSzfGdvtULUfRLKirtlv9Cn27AGpX10uwNMeij3RtoKUsHENXcvJssxjdfm8oB/Ft/fW+YZQWkOfK3zW0QavCsGU9HZwOT1ISwqBvy1UhWfnane5JV44iWdCoI1awxEK2M416utopsZAcy0g39KSwxZMoDtmddh+tfpC35w4caCCOZbM4aLI5OXsbs+R1K8xuvbvSvstfPcCthS+/KUJ18agiHNXCoCoWVA+PKuHRLni77Lfr4u2KcLsWxq5isevh7Up4exc8LftpXTytCKe1MLSKhdbD00p4uguel/28Lp5XhPNaGF7FwuvheSU83wVPyn5SF08qwkktDKliIfXwpBKeFOHFn0dyp4cuSO7ExAWOv74YFH6zaHJ9mCFxk1y5Vy8+wOLbgOF34lrxQVxoobjj/RSZkNme48fiW8E2D5PWLiNcbIyiJb5TNPmrpNy4Pj7S2kcnv7YO3ujrL5ZnynfK98qPiqZ0lDfKW+VcuVJQ43nj50a30Wv+2fyr+U/z35X1SWM95pVSeJr//Q+nOzm9</latexit>
<latexit sha1_base64="RDvTuOSjEtDdb8mvxSGR7Pwg7T8=">AAAOZ3icfZfbbts2HMbl7tRlbZpuwDBgN9qCDt0QBFLi+HARoJZsoxdrmxXOYYuSgKJpWTAlchTl2BX0oMOeYG8xSrZliZKsK4Lfx48//UnalE2xG3BN+6fx5LPPv/jyq6df733z7Pn+i4OX314FJGQQXUKCCbuxQYCw66NL7nKMbihDwLMxurZnZqJfzxELXOKP+JKiOw84vjtxIeCi6+HgXwvyyPIAn9oTdRrfWz6wMVB/OVctGzmuH1GhMXcRq9Z8THigWpZqBaH3EM1EF6TRMn6YbUZZkGRZV7GwuHHqX49E/ngbJybAaMJfR0nIZtAyA4jvR8U0i7nOlP96PzqXprkf1UU8HBxqx1r6qOWGvm4cKuvn4uHlM75njQkMPeRziEEQ3Ooa5XcRYNyFGMV7VhggCuAMOOg25JPOXeT6NOTIh7H6SmiTEKucqEml1bHLEOR4KRoAMlckqHAKGIBcrMdeMSpAPvBQcDSeuzRYNYO5s2pw8TboLlqkix0XBkYOA3TqwkWBLAJekFSj1BksPbvYiUKM2NwrdiaUglFyLhCDbpDU4EIU5gNN9k8wIhdrfbqkU+QHcRQyHOcHCgExhiZiYNoMEA9plL6M2LSz4JyzEB0lzbTvvA/Y7CMaH4mcQkcRZ4IJ4MUu20uK46NHSDwPiL1m0TiyOFqIDX50HAvxlSr2BZzZBLBx0fkxjtY7yFY/CmtBfJ8T38viICcO1pMQPFYnhKlzsf6EBaowqum+hygojr7MRk/USzn6KideyeJ1TryWRTvMqWFJnefUeUl9zKmPsrrIiQtZXObEZSmX51Quq59y4idZvNk16Z+7Jv1LihWrI47MUpxyNBG/mOkGi2Ywjt6O3v0eR930We+UEKl60QjtjfF02Gp327Es443eHHZ0o1/WM0Pb6OlmlSFz9Nqm1h9sWU4kbwataa1BryVHQbzVO11zWNa3sFqv3TcqDFvaodkctNflQ8iXrE7mM7utZqksTpbTNQzjrFvWM4PRNM3OSYUhc5j9fr9npig0ZBQjyUs3xlbrTCtH0Syoo7WavQp9uwBaxzBKsDTHYgwNva+nLBwBLDl5tlnMTq87kIP4tv5GzzRLC8hz5e+Yer9ZYdiynvXPBqcpCWHAd+SqkKx8rXbntCNHkSxo2BYrWGIh25mGXUNrl1hIjmVomEZS2OJJFIfsVr+LVj/I23OnHupqHMtmcdBkc3L2NmbJiyvMuN5dad/lrx6Aa+HLbwphXTysCIe1MLCKBdbDw0p4uAveKfuduninItyphXGqWJx6eKcS3tkFT8t+WhdPK8JpLQytYqH18LQSnu6C52U/r4vnFeG8FoZXsfB6eF4Jz3fBk7Kf1MWTinBSC0OqWEg9PKmEJ0V48eeR3OkBVpM7McGq668vBoXfLJpcH2ZQ3CRX7tWL95H4NmDonbhWfBAXWiDueL9FFmCO5/qx+FZwrKOktcsIFhujaInvFF3+Kik3rk6O9dbx6R/NwzfG+ovlqfKj8rPyWtGVtvJGeatcKJcKbPQaToM2/n7+3/6L/e/3f1hZnzTWY75TCs/+T/8DQF5NLA==</latexit>
k ->
tW b BP ..r
..
<latexit sha1_base64="RDvTuOSjEtDdb8mvxSGR7Pwg7T8=">AAAOZ3icfZfbbts2HMbl7tRlbZpuwDBgN9qCDt0QBFLi+HARoJZsoxdrmxXOYYuSgKJpWTAlchTl2BX0oMOeYG8xSrZliZKsK4Lfx48//UnalE2xG3BN+6fx5LPPv/jyq6df733z7Pn+i4OX314FJGQQXUKCCbuxQYCw66NL7nKMbihDwLMxurZnZqJfzxELXOKP+JKiOw84vjtxIeCi6+HgXwvyyPIAn9oTdRrfWz6wMVB/OVctGzmuH1GhMXcRq9Z8THigWpZqBaH3EM1EF6TRMn6YbUZZkGRZV7GwuHHqX49E/ngbJybAaMJfR0nIZtAyA4jvR8U0i7nOlP96PzqXprkf1UU8HBxqx1r6qOWGvm4cKuvn4uHlM75njQkMPeRziEEQ3Ooa5XcRYNyFGMV7VhggCuAMOOg25JPOXeT6NOTIh7H6SmiTEKucqEml1bHLEOR4KRoAMlckqHAKGIBcrMdeMSpAPvBQcDSeuzRYNYO5s2pw8TboLlqkix0XBkYOA3TqwkWBLAJekFSj1BksPbvYiUKM2NwrdiaUglFyLhCDbpDU4EIU5gNN9k8wIhdrfbqkU+QHcRQyHOcHCgExhiZiYNoMEA9plL6M2LSz4JyzEB0lzbTvvA/Y7CMaH4mcQkcRZ4IJ4MUu20uK46NHSDwPiL1m0TiyOFqIDX50HAvxlSr2BZzZBLBx0fkxjtY7yFY/CmtBfJ8T38viICcO1pMQPFYnhKlzsf6EBaowqum+hygojr7MRk/USzn6KideyeJ1TryWRTvMqWFJnefUeUl9zKmPsrrIiQtZXObEZSmX51Quq59y4idZvNk16Z+7Jv1LihWrI47MUpxyNBG/mOkGi2Ywjt6O3v0eR930We+UEKl60QjtjfF02Gp327Es443eHHZ0o1/WM0Pb6OlmlSFz9Nqm1h9sWU4kbwataa1BryVHQbzVO11zWNa3sFqv3TcqDFvaodkctNflQ8iXrE7mM7utZqksTpbTNQzjrFvWM4PRNM3OSYUhc5j9fr9npig0ZBQjyUs3xlbrTCtH0Syoo7WavQp9uwBaxzBKsDTHYgwNva+nLBwBLDl5tlnMTq87kIP4tv5GzzRLC8hz5e+Yer9ZYdiynvXPBqcpCWHAd+SqkKx8rXbntCNHkSxo2BYrWGIh25mGXUNrl1hIjmVomEZS2OJJFIfsVr+LVj/I23OnHupqHMtmcdBkc3L2NmbJiyvMuN5dad/lrx6Aa+HLbwphXTysCIe1MLCKBdbDw0p4uAveKfuduninItyphXGqWJx6eKcS3tkFT8t+WhdPK8JpLQytYqH18LQSnu6C52U/r4vnFeG8FoZXsfB6eF4Jz3fBk7Kf1MWTinBSC0OqWEg9PKmEJ0V48eeR3OkBVpM7McGq668vBoXfLJpcH2ZQ3CRX7tWL95H4NmDonbhWfBAXWiDueL9FFmCO5/qx+FZwrKOktcsIFhujaInvFF3+Kik3rk6O9dbx6R/NwzfG+ovlqfKj8rPyWtGVtvJGeatcKJcKbPQaToM2/n7+3/6L/e/3f1hZnzTWY75TCs/+T/8DQF5NLA==</latexit>
C ⇣ r ⌘T
BP .r C ⇣ r T ⌘T
h l
TT
hr BP yrVki C = yr
r =@ yr VTV = VT yr
B k yk VkiA
C hr = B C = y V = VT yr
h =@ k .k A = (y 1 ⌦ V)1 y V
To vectorize this, we note that each element of this
hk cV .. @ k k ki A
x .
0
..
1 ..
.
<latexit sha1_base64="RDvTuOSjEtDdb8mvxSGR7Pwg7T8=">AAAOZ3icfZfbbts2HMbl7tRlbZpuwDBgN9qCDt0QBFLi+HARoJZsoxdrmxXOYYuSgKJpWTAlchTl2BX0oMOeYG8xSrZliZKsK4Lfx48//UnalE2xG3BN+6fx5LPPv/jyq6df733z7Pn+i4OX314FJGQQXUKCCbuxQYCw66NL7nKMbihDwLMxurZnZqJfzxELXOKP+JKiOw84vjtxIeCi6+HgXwvyyPIAn9oTdRrfWz6wMVB/OVctGzmuH1GhMXcRq9Z8THigWpZqBaH3EM1EF6TRMn6YbUZZkGRZV7GwuHHqX49E/ngbJybAaMJfR0nIZtAyA4jvR8U0i7nOlP96PzqXprkf1UU8HBxqx1r6qOWGvm4cKuvn4uHlM75njQkMPeRziEEQ3Ooa5XcRYNyFGMV7VhggCuAMOOg25JPOXeT6NOTIh7H6SmiTEKucqEml1bHLEOR4KRoAMlckqHAKGIBcrMdeMSpAPvBQcDSeuzRYNYO5s2pw8TboLlqkix0XBkYOA3TqwkWBLAJekFSj1BksPbvYiUKM2NwrdiaUglFyLhCDbpDU4EIU5gNN9k8wIhdrfbqkU+QHcRQyHOcHCgExhiZiYNoMEA9plL6M2LSz4JyzEB0lzbTvvA/Y7CMaH4mcQkcRZ4IJ4MUu20uK46NHSDwPiL1m0TiyOFqIDX50HAvxlSr2BZzZBLBx0fkxjtY7yFY/CmtBfJ8T38viICcO1pMQPFYnhKlzsf6EBaowqum+hygojr7MRk/USzn6KideyeJ1TryWRTvMqWFJnefUeUl9zKmPsrrIiQtZXObEZSmX51Quq59y4idZvNk16Z+7Jv1LihWrI47MUpxyNBG/mOkGi2Ywjt6O3v0eR930We+UEKl60QjtjfF02Gp327Es443eHHZ0o1/WM0Pb6OlmlSFz9Nqm1h9sWU4kbwataa1BryVHQbzVO11zWNa3sFqv3TcqDFvaodkctNflQ8iXrE7mM7utZqksTpbTNQzjrFvWM4PRNM3OSYUhc5j9fr9npig0ZBQjyUs3xlbrTCtH0Syoo7WavQp9uwBaxzBKsDTHYgwNva+nLBwBLDl5tlnMTq87kIP4tv5GzzRLC8hz5e+Yer9ZYdiynvXPBqcpCWHAd+SqkKx8rXbntCNHkSxo2BYrWGIh25mGXUNrl1hIjmVomEZS2OJJFIfsVr+LVj/I23OnHupqHMtmcdBkc3L2NmbJiyvMuN5dad/lrx6Aa+HLbwphXTysCIe1MLCKBdbDw0p4uAveKfuduninItyphXGqWJx6eKcS3tkFT8t+WhdPK8JpLQytYqH18LQSnu6C52U/r4vnFeG8FoZXsfB6eF4Jz3fBk7Kf1MWTinBSC0OqWEg9PKmEJ0V48eeR3OkBVpM7McGq668vBoXfLJpcH2ZQ3CRX7tWL95H4NmDonbhWfBAXWiDueL9FFmCO5/qx+FZwrKOktcsIFhujaInvFF3+Kik3rk6O9dbx6R/NwzfG+ovlqfKj8rPyWtGVtvJGeatcKJcKbPQaToM2/n7+3/6L/e/3f1hZnzTWY75TCs/+T/8DQF5NLA==</latexit>
. k ->
⇣ ⌘ vector is a dot product of y and a column of V. This
k bl t
BP r C
y r B C rT
h = @ k yk Vki A = y V = VT yr
T
61
Wl c h
..
. means that if we premultiply y (transposed) by V, we
get the required result as a row vector. Transposing the
V t k
result (to make it a column vector) is equivalent to
bh l
post multiplying y by the transpose of V.
ck
t l Note the symmetry of the forward and the backward
h operation.
k
l
x
y
W
V
b
x
c
xt y V ij
x
y x W y1 Now, at this point, when we analyze k, remember that
y
h
GRADIENT FOR k
Wk y V yk we already have the gradient over h. This means that
W
Vl x W b
V
yN we no longer have to apply the chain rule to anything
xy V c l @l above h. We can draw this scalar computation graph,
b
b kr
i =
@ki
ycW b t X
X r @h @hkm X X @ @(kk(k ) m)
<latexit sha1_base64="oqHsdrdaisMavx2lA/O3UClYKNo=">AAAODnicjZfbbts2HMaVdofOq7t0vdiA3QgLOnRDEEiJ48NFgFqyjV6sbRbktEWZQdG0LFgSCYpy7Aq62RPsaXY37HavsDfYY4xyFFmiKGO6ovl9/PTTn6RM2cRzQ6Zp/+w8evzRx598+uSzxudPm8++2H3+5WWIIwrRBcQeptc2CJHnBuiCucxD14Qi4NseurLnZqpfLRANXRycsxVBtz5wAnfqQsB413j3t+9OVCuM/LGvWpDFs+RXKwC2B9LfUwpgbM1Jpoz9JPvlxPNk7CZq4/+MDV3HB6+yQf73QoZlNca7e9qBtr7UakPPGntKdp2Onz9lDWuCYeSjgEEPhOGNrhF2GwPKXOihpGFFISIAzoGDbiI27d7GbkAihgKYqC+5No08lWE1LYg6cSmCzFvxBoDU5QkqnAGOz3jZGuWoEAXAR+H+ZOGS8L4ZLpz7BuOPjm7j5XpOktLA2KGAzFy4LJHFwA99wGaVznDl2+VOFHmILvxyZ0rJGQXnElHohmkNTnlh3pN0msNzfJrpsxWZoSBM4oh6SXEgFxClaMoHrpshYhGJ1w/D19Y8PGE0Qvtpc913MgB0foYm+zyn1FHGmXoYsHKX7afFCdAdxL4PgklsEb4mGFqy2No/SLj4UuWLCM5tDOik7DxL4thKa2bb6hm3lsR3BfGdKA4L4jC7CfYm6hRTdcHnH9NQ5UaVW6gLUVgefZGPnqoXYvRlQbwUxauCeCWKdlRQo4q6KKiLinpXUO9EdVkQl6K4KoirSi4rqExUPxTED6J4ve2mP2+76S9CLJ8dvmVWfJejKX+xrRdYPIdJ/Ob87Y9J3Ftf2UqJkKqXjdB+MB6N2p1eJxFl70Fvjbq6MajquaFj9HVTZsgd/Y6pDYYblkPBm0NrWnvYb4tR0Nvo3Z45quobWK3fGRgSw4Z2ZLaGnax8CAWC1cl9Zq/dqpTFyXN6hmEc96p6bjBaptk9lBhyhzkYDPrmGoVElHhI8JIHY7t9rFWjSB7U1dqtvkTfTIDWNYwKLCmwGCNDH+hrFoaAJzhZvljMbr83FIPYpv5G3zQrE8gK5e+a+qAlMWxYjwfHw6M1CaYgcMSq4Lx87U73qCtG4Txo1OEzWGHBmzuNeobWqbDgAsvIMI20sOWdyDfZjX4b37+QN/tO3dPVJBHNfKOJ5nTvPZgFrycxe/VuqX2bXz7Aq4WvPimEdfFQEg5rYaCMBdbDQyk83AbvVP1OXbwjCXdqYRwZi1MP70jhnW3wpOondfFEEk5qYYiMhdTDEyk82QbPqn5WF88k4awWhslYWD08k8KzbfC46sd18VgSjmthsIwF18NjKTwuw/M/j/RMDzw1PRNjT3WD7GBQemeR9Pgw598bmfv+wQeIfxtQ9JYfK97zAy3gZ7wfYgtQx3eDhH8rONZ+2tpmBMsHI2/x7xRd/CqpNi4PD/TjA+2n1t5rI/tieaJ8o3yrvFJ0paO8Vt4op8qFApV/d57tfLXzdfP35h/NP5t/3Vsf7WRjXiilq/n3f8fgLI8=</latexit>
and work out the local gradient for k in terms for the
c hkr == hr r
Wt V cx h ki = h khm
<latexit sha1_base64="hY4wAzEfE4stcT0eO/hp6TuVyl4=">AAANn3icfZdbb9s2GIbV7tRl9Zaul7vRFhQYhiCQEseHiwK1JHu9WNMsyKmLgoCiaVkwJRIU5dgV9Ft2u/2k/ZtRtiNLFGVdfeD78tWjj6JNeRQHMTeM/549/+LLr77+5sW3e9+9bH3/w/6rH69jkjCIriDBhN16IEY4iNAVDzhGt5QhEHoY3XgzO9dv5ojFAYku+ZKi+xD4UTAJIOBi6GH/tQv9dJY9BK6750KeTrOH8GH/wDgyVpdeL8xNcaBtrvOHVy/5njsmMAlRxCEGcXxnGpTfp4DxAGKU7blJjCiAM+Cju4RPevdpENGEowhm+huhTRKsc6LnhPo4YAhyvBQFgCwQCTqcAgYgF8+xV42KUQRCFB+O5wGN12U899cFB6IJ9+li1aSsMjH1GaDTAC4qZCkI4xDwaW0wXoZedRAlGLF5WB3MKQWj5FwgBoM478G5aMxHmvc9viTnG326pFMUxVmaMJyVJwoBMYYmYuKqjBFPaLp6GLHYs/gtZwk6zMvV2FsHsNkFGh+KnMpAFWeCCeDVIS/MmxOhR0jCEETj1KVZ6nK04Kl7eJQJ8Y3uYWH2CGDjqvMiS1M375nn6RfCWhHPSuKZLA5L4nBzE4LH+oQwfS7Wn7BYF0ZdWFgAUVydfVXMnuhXcvR1SbyWxZuSeCOLXlJSk5o6L6nzmvpYUh9ldVESF7K4LInLWi4vqVxWP5fEz7J4u+umn3bd9C8pVqyO2DJLscvRRPzSrF6wdAaz9P3lhz+ytL+6Nm9KgnSzaoTek/Fk1On2u5ks4ye9PeqZllPXC0PXGpi2ylA4Bl3bcIZblmPJW0AbRmc46MhREG/1Xt8e1fUtrDHoOpbCsKUd2e1hd9M+hCLJ6hc+u99p19riFzl9y7JO+3W9MFht2+4dKwyFw3YcZ2CvUGjCKEaSlz4ZO51Tox5Fi6Ce0WkPFPp2AYyeZdVgaYnFGlmmY65YOAJYcvLiZbF7g/5QDuLb/lsD264tIC+1v2ebTlth2LKeOqfDkxUJYSDy5a6Qon2dbu+kJ0eRImjUFStYYyHbO436ltGtsZASy8iyrbyx1Z0oNtmdeZ+uf5C3+04/MPUsk81io8nmfO89mSUvVphxs1tp3+VXT8CN8PUnhbApHirCYSMMVLHAZniohIe74P2632+K9xXhfiOMr2Lxm+F9Jby/C57W/bQpnirCaSMMVbHQZniqhKe74Hndz5viuSKcN8JwFQtvhudKeL4LntT9pCmeKMJJIwxRsZBmeKKEJ1V48eeRn+kB1vMzMcF6EG0OBpXfLJofH2ZQnCTX7vWDO0h8GzD0QRwrPooDLRBnvN9SFzA/DKJMfCv47mFe7TKCxZNRVOI7xZS/SurF9fGReXpk/Nk+eGdtvlheaD9pv2i/aqbW1d5p77Vz7UqD2lL7W/tH+7f1c+v31lnrfG19/mwz57VWuVqf/gc7WQH0</latexit>
m@k
@k
ii @ki@ki gradient for h.
t y h
mk k m
@ (ki )
Vhb t k
h . .m
. ... = hri Given that, working out the gradient for k is relatively
b x W @k i
k c h l r 0
0 (ki ) = with
r
(ki0)(1 easy, since the operation from k to h is an element-
<latexit sha1_base64="pAp2wjfk9mObL7yTYQF4T3o4bKc=">AAAOFnicfZfLbuM2GIWV6W2ajttMC3TTRYkG0yZFJpASx5dFgLEkG7PozKRBbm2UBhRNy4IlkaAoxx5BL9An6NN0V3Tbbd+ij1DKdmRdKGtF8xweffpJypRNPTfkqvrv1pMPPvzo40+efrr92bPG51/sPP/yKiQRQ/gSEY+wGxuG2HMDfMld7uEbyjD0bQ9f2xMj1a+nmIUuCS74nOI7HzqBO3IR5KLrfuf370+BhXg8Tn6zAmh78N4FVug6Pvxhz0JOPEnu3X1Q51lb9rSXlb59YFnbsvzFb9HY08DL7Nf+/c6ueqguLlBtaKvGrrK6zu6fP+Pb1pCgyMcBRx4Mw1tNpfwuhoy7yMPJthWFmEI0gQ6+jfiocxe7AY04DlACXghtFHmAE5AWBgxdhhH35qIBEXNFAkBjyCDionzbxagQB9DH4cFw6tJw2QynzrLBxVPiu3i2mJukMDB2GKRjF80KZDH0Qx/ycaUznPt2sRNHHmZTv9iZUgrGknOGGXLDtAZnojDvaDrd4QU5W+njOR3jIEziiHlJfqAQMGN4JAYumiHmEY0XDyPW2CQ85SzCB2lz0XdqQjY5x8MDkVPoKOKMPAJ5scv20+IE+AER34fBMLZoElscz3hsHRwmQnwBxHpBE5tANiw6z5M4ttKa2TY4F9aC+DYnvi2L/ZzYX92EeEMwIgxMxfwTFgJhBMLCXITD4ujLbPQIXJajr3LiVVm8zonXZdGOcmpUUac5dVpRH3LqQ1md5cRZWZznxHkll+dUXlbf58T3ZfFm001/2XTTX0uxYnbElpmLXY5H4gW3WGDxBCXx64s3PyVxd3GtVkqEgVY0IvvReDxotbvtpCx7j3pz0NF0s6pnhrbe0wyZIXP02oZq9tcsRyVvBq2qrX6vVY5C3lrvdI1BVV/Dqr22qUsMa9qB0ey3V+XDOChZncxndFvNSlmcLKer6/pJt6pnBr1pGJ0jiSFzGKZp9owFCo0Y9XDJSx+NrdaJWo2iWVBHbTV7En09AWpH1yuwNMeiD3TN1BYsHEOv5OTZYjE6vW6/HMTX9dd7hlGZQJ4rf8fQzKbEsGY9MU/6xwsSwmDglKtCsvK12p3jTjmKZEGDtpjBCgtZ32nQ1dV2hYXkWAa6oaeFLe5Esclutbt4+UJe7zuwq4EkKZvFRiub0733aC55PYnZq3dL7Zv88gFeLXz1SRGqi0eScFQLg2QsqB4eSeHRJnin6nfq4h1JuFML48hYnHp4RwrvbIKnVT+ti6eScFoLQ2UstB6eSuHpJnhe9fO6eC4J57UwXMbC6+G5FJ5vgidVP6mLJ5JwUgtDZCykHp5I4UkRXvx5pGd66IH0TEw84Aarg0HhnUXT48MEiZPk0r18cBOLbwOG34hjxTtxoIXijPdjbEHm+G6QiG8FxzpIW5uMcPZoFC3xnaKVv0qqjaujQ+3kUP25uftKX32xPFW+Ub5T9hRNaSuvlNfKmXKpIOW/ra+3vt0CjT8afzb+avy9tD7ZWo35SilcjX/+B4eGLBE=</latexit>
k y V =
=hhir
i (ki ) = h hir (x) -
= (k
i (ki )(1 - (ki ))
i ))(1 - x)
(x) <latexit sha1_base64="pAp2wjfk9mObL7yTYQF4T3o4bKc=">AAAOFnicfZfLbuM2GIWV6W2ajttMC3TTRYkG0yZFJpASx5dFgLEkG7PozKRBbm2UBhRNy4IlkaAoxx5BL9An6NN0V3Tbbd+ij1DKdmRdKGtF8xweffpJypRNPTfkqvrv1pMPPvzo40+efrr92bPG51/sPP/yKiQRQ/gSEY+wGxuG2HMDfMld7uEbyjD0bQ9f2xMj1a+nmIUuCS74nOI7HzqBO3IR5KLrfuf370+BhXg8Tn6zAmh78N4FVug6Pvxhz0JOPEnu3X1Q51lb9rSXlb59YFnbsvzFb9HY08DL7Nf+/c6ueqguLlBtaKvGrrK6zu6fP+Pb1pCgyMcBRx4Mw1tNpfwuhoy7yMPJthWFmEI0gQ6+jfiocxe7AY04DlACXghtFHmAE5AWBgxdhhH35qIBEXNFAkBjyCDionzbxagQB9DH4cFw6tJw2QynzrLBxVPiu3i2mJukMDB2GKRjF80KZDH0Qx/ycaUznPt2sRNHHmZTv9iZUgrGknOGGXLDtAZnojDvaDrd4QU5W+njOR3jIEziiHlJfqAQMGN4JAYumiHmEY0XDyPW2CQ85SzCB2lz0XdqQjY5x8MDkVPoKOKMPAJ5scv20+IE+AER34fBMLZoElscz3hsHRwmQnwBxHpBE5tANiw6z5M4ttKa2TY4F9aC+DYnvi2L/ZzYX92EeEMwIgxMxfwTFgJhBMLCXITD4ujLbPQIXJajr3LiVVm8zonXZdGOcmpUUac5dVpRH3LqQ1md5cRZWZznxHkll+dUXlbf58T3ZfFm001/2XTTX0uxYnbElpmLXY5H4gW3WGDxBCXx64s3PyVxd3GtVkqEgVY0IvvReDxotbvtpCx7j3pz0NF0s6pnhrbe0wyZIXP02oZq9tcsRyVvBq2qrX6vVY5C3lrvdI1BVV/Dqr22qUsMa9qB0ey3V+XDOChZncxndFvNSlmcLKer6/pJt6pnBr1pGJ0jiSFzGKZp9owFCo0Y9XDJSx+NrdaJWo2iWVBHbTV7En09AWpH1yuwNMeiD3TN1BYsHEOv5OTZYjE6vW6/HMTX9dd7hlGZQJ4rf8fQzKbEsGY9MU/6xwsSwmDglKtCsvK12p3jTjmKZEGDtpjBCgtZ32nQ1dV2hYXkWAa6oaeFLe5Esclutbt4+UJe7zuwq4EkKZvFRiub0733aC55PYnZq3dL7Zv88gFeLXz1SRGqi0eScFQLg2QsqB4eSeHRJnin6nfq4h1JuFML48hYnHp4RwrvbIKnVT+ti6eScFoLQ2UstB6eSuHpJnhe9fO6eC4J57UwXMbC6+G5FJ5vgidVP6mLJ5JwUgtDZCykHp5I4UkRXvx5pGd66IH0TEw84Aarg0HhnUXT48MEiZPk0r18cBOLbwOG34hjxTtxoIXijPdjbEHm+G6QiG8FxzpIW5uMcPZoFC3xnaKVv0qqjaujQ+3kUP25uftKX32xPFW+Ub5T9hRNaSuvlNfKmXKpIOW/ra+3vt0CjT8afzb+avy9tD7ZWo35SilcjX/+B4eGLBE=</latexit>
cl t k r
sigmo id: wise one.
derivative of
lW b = h r h (1 - h
= hii hii (1 - hii ))
- σ(x))
th l ki kr = h ⌦ 0 (k) σ’(x) = σ(x)(1
xV c hk kr = hr ⌦ 0k(k)
r
= hhrr
⌦ h0⌦ r
<latexit sha1_base64="ulXpDO6BYycliF1KgIFXJTHNPOY=">AAAOHXichZfbbts2HMaV7tRl9ZZul8MAYkHbdMgCKXF8uAhQS7bRi7XNgpy6KAsompYFSyJBUY5dQQ+xJ9jT7G7Y7bAn2e0o2ZF1oDxdEfw+fvrpT9ImLeo6AVfVf7YeffTxJ59+9vjz7S+eNL78aufp15cBCRnCF4i4hF1bMMCu4+ML7nAXX1OGoWe5+MqaGol+NcMscIh/zhcU33rQ9p2xgyAXXXc7v5nIjkwP8ok1BtP4V9OHlgvB8xNgIp4Jk0wwCXc8HAAzcGwPvtgrDn8J/ndcoqZK1rWn/Zjrfnm3s6seqOkDqg1t1dhVVs/p3dMnfNscERR62OfIhUFwo6mU30aQcQe5ON42wwBTiKbQxjchH3duI8enIcc+isEzoY1DF3ACkvqAkcMw4u5CNCBijkgAaAIZRFxUcbsYFWAfii/YH80cGiybwcxeNrj4bHwbzdMpigsDI5tBOnHQvEAWQS9IylbpDBaeVezEoYvZzCt2JpSCseScY4acIKnBqSjMO5rMenBOTlf6ZEEn2A/iKGRunB8oBMwYHouBaTPAPKRR+jFiqU2DE85CvJ80076TPmTTMzzaFzmFjiLO2CWQF7ssLymOj+8R8TzojyKTxpHJ8VwsiP2DWIjPgFhAaGoRyEZF51kcrZaaBc6EtSC+zYlvy+IgJw5WLyHuCIwJAzMx/4QFQBiBsDAH4aA4+iIbPQYX5ejLnHhZFq9y4lVZtMKcGlbUWU6dVdT7nHpfVuc5cV4WFzlxUcnlOZWX1Q858UNZvN700vebXvpLKVbMjtgyC7HL8Vj8zqULLJqiOHp9/uanOOqmz2qlhBhoRSOyHoxHw1a7247LsvugN4cdTe9X9czQ1nuaITNkjl7bUPuDNcthyZtBq2pr0GuVo5C71jtdY1jV17Bqr93XJYY17dBoDtqr8mHsl6x25jO6rWalLHaW09V1/bhb1TOD3jSMzqHEkDmMfr/fM1IUGjLq4pKXPhhbrWO1GkWzoI7aavYk+noC1I6uV2BpjkUf6lpfS1k4hm7JybPFYnR63UE5iK/rr/cMozKBPFf+jqH1mxLDmvW4fzw4SkkIg75drgrJytdqd4465SiSBQ3bYgYrLGT9pmFXV9sVFpJjGeqGnhS2uBPFJrvRbqPlD/J634FdDcRx2Sw2Wtmc7L0Hc8nrSsxuvVtq3+SXD3Br4atfilBdPJKEo1oYJGNB9fBICo82wdtVv10Xb0vC7VoYW8Zi18PbUnh7Ezyt+mldPJWE01oYKmOh9fBUCk83wfOqn9fFc0k4r4XhMhZeD8+l8HwTPKn6SV08kYSTWhgiYyH18EQKT4rw4s8jOdNDFyRnYuICx18dDAq/WTQ5PkyROEku3csP72NxN2D4jThWvBMHWijOeD9EJmS25/ixuCvY5n7S2mSE8wejaIl7ila+lVQbl4cH2vGB+nNz95W+urE8Vr5Vvlf2FE1pK6+U18qpcqEg5d+t77aeb71o/N74o/Fn46+l9dHWasw3SuFp/P0fc2gwkg==</latexit> <latexit sha1_base64="ulXpDO6BYycliF1KgIFXJTHNPOY=">AAAOHXichZfbbts2HMaV7tRl9ZZul8MAYkHbdMgCKXF8uAhQS7bRi7XNgpy6KAsompYFSyJBUY5dQQ+xJ9jT7G7Y7bAn2e0o2ZF1oDxdEfw+fvrpT9ImLeo6AVfVf7YeffTxJ59+9vjz7S+eNL78aufp15cBCRnCF4i4hF1bMMCu4+ML7nAXX1OGoWe5+MqaGol+NcMscIh/zhcU33rQ9p2xgyAXXXc7v5nIjkwP8ok1BtP4V9OHlgvB8xNgIp4Jk0wwCXc8HAAzcGwPvtgrDn8J/ndcoqZK1rWn/Zjrfnm3s6seqOkDqg1t1dhVVs/p3dMnfNscERR62OfIhUFwo6mU30aQcQe5ON42wwBTiKbQxjchH3duI8enIcc+isEzoY1DF3ACkvqAkcMw4u5CNCBijkgAaAIZRFxUcbsYFWAfii/YH80cGiybwcxeNrj4bHwbzdMpigsDI5tBOnHQvEAWQS9IylbpDBaeVezEoYvZzCt2JpSCseScY4acIKnBqSjMO5rMenBOTlf6ZEEn2A/iKGRunB8oBMwYHouBaTPAPKRR+jFiqU2DE85CvJ80076TPmTTMzzaFzmFjiLO2CWQF7ssLymOj+8R8TzojyKTxpHJ8VwsiP2DWIjPgFhAaGoRyEZF51kcrZaaBc6EtSC+zYlvy+IgJw5WLyHuCIwJAzMx/4QFQBiBsDAH4aA4+iIbPQYX5ejLnHhZFq9y4lVZtMKcGlbUWU6dVdT7nHpfVuc5cV4WFzlxUcnlOZWX1Q858UNZvN700vebXvpLKVbMjtgyC7HL8Vj8zqULLJqiOHp9/uanOOqmz2qlhBhoRSOyHoxHw1a7247LsvugN4cdTe9X9czQ1nuaITNkjl7bUPuDNcthyZtBq2pr0GuVo5C71jtdY1jV17Bqr93XJYY17dBoDtqr8mHsl6x25jO6rWalLHaW09V1/bhb1TOD3jSMzqHEkDmMfr/fM1IUGjLq4pKXPhhbrWO1GkWzoI7aavYk+noC1I6uV2BpjkUf6lpfS1k4hm7JybPFYnR63UE5iK/rr/cMozKBPFf+jqH1mxLDmvW4fzw4SkkIg75drgrJytdqd4465SiSBQ3bYgYrLGT9pmFXV9sVFpJjGeqGnhS2uBPFJrvRbqPlD/J634FdDcRx2Sw2Wtmc7L0Hc8nrSsxuvVtq3+SXD3Br4atfilBdPJKEo1oYJGNB9fBICo82wdtVv10Xb0vC7VoYW8Zi18PbUnh7Ezyt+mldPJWE01oYKmOh9fBUCk83wfOqn9fFc0k4r4XhMhZeD8+l8HwTPKn6SV08kYSTWhgiYyH18EQKT4rw4s8jOdNDFyRnYuICx18dDAq/WTQ5PkyROEku3csP72NxN2D4jThWvBMHWijOeD9EJmS25/ixuCvY5n7S2mSE8wejaIl7ila+lVQbl4cH2vGB+nNz95W+urE8Vr5Vvlf2FE1pK6+U18qpcqEg5d+t77aeb71o/N74o/Fn46+l9dHWasw3SuFp/P0fc2gwkg==</latexit>
h k
= ⌦ (k)(1=-hh) ⌦ h ⌦ (1 - h)
y
k l b t
62
Wl c h
V t k
bh l
x
ck
xt l y
x
y
y xW Finally, the gradient for W. The situation here is exactly
h
GRADIENT FOR W
Wk
W y V the same as we saw earlier for V (matrix in, vector out,
Vl x W b
V
Wr
ij =
@l
@W ij
matrix multiplication), so we should expect the
xy V c
b =
X @l @km
=
X
kr
@km derivation to have the same form (and indeed it does).
b m
@km @W ij m
m
@W ij
ycW b t X @[Wx + b]m X
<latexit sha1_base64="sC591nGhRKZjUkSvjznWzZoDs2Y=">AAAOXnicjZdNb9s2HMblbuu6rM3S7TJgF2JBh2ELAilx/HIIUEu20cPaZkFeukWuQdG0rFkSCYpy7Ar8krvtto8yynFkvVDGdCL4PHz405+UTTrU9yKu6/80nnz2+RdPv3z21d7Xz1/sf3Pw8tubiMQM4WtEfMI+ODDCvhfia+5xH3+gDMPA8fGtM7dS/XaBWeSR8IqvKB4F0A29qYcgl13jg79/Ogf2HCV2FAfjJBBg/tEOoePDcQAEsKcMSm1OwZ2NSGIHkM+cKbgVWXMpwK/ARk7W4YjROBDrMemQWzFOvL+EAHv/b57iNNJpownhivkcoZwG2Pb44FA/1tcPqDaMTeNQ2zwX45fP+Z49ISgOcMiRD6PoztApHyWQcQ/5WOzZcYQpRHPo4ruYTzujxAtpzHGIBHgltWnsA05AWmAw8RhG3F/JBkTMkwkAzaB8Py6XYa8YFeEQBjg6miw8Gj00o4X70OCyOniULNdrLAoDE5dBOvPQskCWwCBKq1TpjFaBU+zEsY/ZIih2ppSSseRcYoa8KK3BhSzMe5pum+iKXGz02YrOcBiJJGa+yA+UAmYMT+XAdTPCPKbJ+mXkXp1H55zF+ChtrvvO+5DNL/HkSOYUOoo4U59AXuxygrQ4Ib5HJAhgOElsKvcFx0ue2EfHQoqvgNxnaO4QyCZF56VINjvLAZfSWhDf5cR3ZXGQEwebSYg/AVPCwEKuP2ERkEYgLcxDOCqOvs5GT8F1OfomJ96UxduceFsWnTinxhV1kVMXFfU+p96X1WVOXJbFVU5cVXJ5TuVl9VNO/FQWP+ya9I9dk/5ZipWrIz+ZlfzK8VT+UK43WDJHInlz9fY3kXTXz2anxBgYRSNyHo2nw1a72xZl2X/Um8OOYfaremZomz3DUhkyR69t6f3BluWk5M2gdb016LXKUcjf6p2uNazqW1i91+6bCsOWdmg1B+1N+TAOS1Y381ndVrNSFjfL6Zqmedat6pnBbFpW50RhyBxWv9/vWWsUGjPq45KXPhpbrTO9GkWzoI7eavYU+nYB9I5pVmBpjsUcmkbfWLNwDP2Sk2ebxer0uoNyEN/W3+xZVmUBea78HcvoNxWGLetZ/2xwuiYhDIZuuSokK1+r3TntlKNIFjRsyxWssJDtTMOuqbcrLCTHMjQtMy1s8UuUH9mdMUoefpC33x04NIAQZbP80Mrm9Nt7NJe8vsLs17uV9l1+9QC/Fr76pgjVxSNFOKqFQSoWVA+PlPBoF7xb9bt18a4i3K2FcVUsbj28q4R3d8HTqp/WxVNFOK2FoSoWWg9PlfB0Fzyv+nldPFeE81oYrmLh9fBcCc93wZOqn9TFE0U4qYUhKhZSD0+U8KQIL/880jM99EF6JiY+8MLNwaDwm0XT40N6Wdm4H168j+XdgOG38ljxXh5ooTzj/ZLYkLmBFwp5V3Dto7S1ywiXj0bZkvcUo3wrqTZuTo6N1nHz9+bha3NzY3mm/aD9qP2sGVpbe6290S60aw01Oo2PDbcxe/Hv/tN9eUt8sD5pbMZ8pxWe/e//A9rZSLM=</latexit>
@[W
@Wm·m:xx]+
mb+ mbm
c = kr = kr
Wt V cx h
m m
m
@W ij m
@W@W ij ij
P
t y =
X
kr
@ l W ml x l + b m
Vhb t k m
m
@W ij
h x W X
r @W ml xl @W ij xj
b = km = kr
k c h l
k @W ij i
@W ij
ml
V
cl yt k = kr
i xj
lW b
th l 0
..
1
.
xV c
h
B
Wr = B
@. . . kr
i xj
C
. . .C r T
A=k x
k ..
k bl t
.
y
63
Wl c h
V t k
bh l
ck
t l
h
k
V
b
x
c
x y
xt
y
y xW Here’s the backward pass that we just derived.
h
PSEUDOCODE: BACKWARD
W
W y V
k We’ve left the derivatives of the bias parameters out.
Vl x W b
V
xy V c
You’ll have to work these out to implement the first
b
b homework exercise.
ycW b t y’ = 2 * (y - t)
c
Wt V cx h
t y v’ = y’.mm(h.T)
Vhb t k
h x W h’ = v.T.mm(y’)
b l
k yc h
k
V
cl t k k’ = h’ * h * (1 - h)
l b
tWh l w’ = k’.mm(x.T)
xV c
h k
k bl t
y
64
Wl c h
V t k
bh l
ck
t l
h
RECAP
k
Vectorizing makes notation simpler, and computation faster.
l
65
Lecture 2: Backpropagation
Peter Bloem
Deep Learning
dlvu.github.io
|section|Automatic differentiation|
|video|https://www.youtube.com/embed/9H-
o8wESCxI?si=NyQ6djR2xavcBsIL|
We’ve simplified the computation of derivatives a lot.
PART FOUR: AUTOMATIC DIFFERENTIATION All we’ve had to do is work out local derivatives and
chain them together. We can go one step further: if we
Letting the computer do all the work. let the computer keep track of our computation graph,
and provide some backwards functions for basic
operations, the computer can work out the whole
backpropagation algorithm for us. This is called
automatic differentiation (or sometimes autograd).
b
x
c
x y
xt
y
y xW
h
W
W y V
k STORY SO FAR
THE
W b y’ = 2 * (y - t)
Vl x
V
x
by V c v’ = y’.mm(h.T)
b
ycW b t
c
Wt V cx h
h’ = (y’[None, :] * v).sum(axis=1)
t y
Vhb t k k’ = sigmoid(k) * sigmoid(1 - k)
h x W
b
k c h l w’ = k’.mm(x.T)
k V
cl yt k
lW b
th l
xV c
h k
kb t
y
pen land paper
x
in the computer
Wl c h y
68
V t k W
bh l V
ck b
x
t l c
x y
h xt
y
y xW This is what we want to achieve. We work out on pen
k IDEAL
THE h
W y V y’ = 2 * (y - t)
l Wk and paper the local derivatives of various modules, and
V W b then we chain these modules together in a
Vl x k = w.dot(x) + b
x
b V c computation graph in our code. The computer keeps
by h = sigmoid(k)
ycW b t the graph in memory, and can automatically work out
c
Wt V cx h y = v.dot(h) + c the backpropagation.
t y
Vhb t k
+ x
h x W l = (y - t) ** 2
b l
k yc h
k
V l.backward() # start backprop
cl t k
l b
tWh l
x hk c
x V
y and paper
pen k bl t
y in the computer
W Wl c h
V V t k
b bh l
x ck
c
xt y t l
x
y x W h Whenever we work out a gradient, as we did in the
y
h KEY IDEA
THE
Wk
W y V k previous video, we always look at the node above the
Vl x
V W b r @l
lW ij = @W ij one for which we're computing the gradient, and use
xy V c
b =
X @l @km
=
X
kr
@km the multivariate chain rule to split the gradient in two.
b @km @W ij m
@W ij
ycW b t X
m
@[Wx + b]
m
X @[Wm: x]m + bm In this case, we are working out the gradient for W and
c = kr
m
= kr
Wt V cx h Ignored onceX we apply the multivariate chain to break that gradient
m m
m
@W ij m
@W ij
we @have
P k ∇
t y l W ml xl + bm
V
= krm into the gradient for k and the derivatives of k with
hb t k m
@W ij
h x W X @W ml xl @W ij xj respect to W.
b = kr = kr
k h l m i
k yc V ml
@W ij @W ij
cl t k = kr
i xj
l The key idea, that powers automatic differentiation is
tW b 0 1
h l ..
. that once we have k∇, we no longer care about
x
hk V c W r
=
B C
B. . . kr xj . . .C = kr xT
@ i
...
A anything else that happens above k in our
k bl t
y
70 computation graph. All we need is k∇ and we can work
Wl c h
out W nabla.
V t k
bh l
ck
t l
h
k
Define your computation graph in terms of operations chained together.
l
For each operation x -> y, if you know the gradient y∇ for the output, you
can work out the gradient x∇ for the input.
It doesn't matter what happens in the rest of the computation graph, before or after the operation.
With this knowledge, you can start at the top of the graph, and work your
way down, computing all the gradients.
71
AUTOMATIC DIFFERENTIATION
This kind of algorithm is called automatic
differentiation. What we’ve been doing so far is called
backward, or reverse mode automatic differentiation.
This is efficient if you have few output nodes. Note
that if we had two output nodes, we’d need to do a
separate backward pass for each.
If you have few inputs, it’s more efficient to start with
the inputs and apply the chain rule working forward.
many inputs, few outputs, work backward few inputs, many outputs, work forward This is called forward mode automatic differentiation.
<-
dee
p l
ear Since we assume we only have one output node
ni ng
(where the gradient computation is concerned), we
72
will always use reverse mode in deep learning.
THE PLAN
To make this idea precise, and applicable to as broad a
range of functions as possible, we'll need to set up a
Tensors few ingredients. Our values with be tensors: scalars,
Scalars, vectors, matrices, etc. vectors, matrices, or their higher-dimensional
analogues.
Building computation graphs
We then build computation graphs out of operations
Define operations with tensor inputs and outputs, chain them together.
73
32 1 ham 23 2 1
432 1 ham
75
23 2 spam
AN IMAGE AS A 3-TENSOR
Images can be represented as 3-tensors. In an RGB
image, the color of a single pixel is represented using
pixel
three values between 0 and 1 (how red it is, how green
it is and how blue it is). This means that an RGB image
can be thought of as a stack of three color channels,
represented by matrices.
This stack forms a 3-tensor.
height
width
0.1
channels
0.5
0.0
A DATASET OF IMAGES
If we have a dataset of images, we can represent this
as a 4-tensor, with dimensions indexing the instances,
their width, their height and their color channels
respectively. Below is a snippet of code showing that
when you load the CIFAR10 image data in Keras, you
do indeed get a 4-tensor like this.
There is no agreed standard ordering for the
dimensions in a batch of images. Tensorflow and Keras
use (batch, height, width, channels), whereas Pytorch
uses (batch, channels, height, width).
(You can remember the latter with the mnemonic
77
“bachelor chow”.)
Tensors
78
COMPUTATION GRAPHS
We’ll need to be a little mode precise in our notations.
From now on we’ll draw computation graphs with both
the operations and the values as nodes. The graph is
Tensor nodes Operation nodes
always bipartite and directed. A tensor node is
connected to the ops for which it is an input, and an
operation node is connected to the tensor nodes that
it produces as outputs.
bipartite: only op-to-tensor or tensor-to-op edges There doesn’t seem to be a standard visual notation.
tensor nodes: no input or one input, multiple outputs This is what we'll use in most of this course.
operation nodes: multiple inputs and outputs In Tensorflow operations are called ops, and in Pytorch
they’re called functions.
79
b
c x
x t y
x
y h W
y
FEEDFORWARDW
NETWORK V As an example, here is our MLP expressed in our new
k
W x b
notation.
V l
V
b x y c Just as before, it’s up to us how granular we make the
bc x yW t computation. for instance, we wrap the whole
ct × y W V+ h computation of the loss in a single operation, but we
ht W V b k could also separate out the subtraction and the
h
k xσ V b c l squaring.
kl b c t
y shorthand: You may note that this graph is not directed or
× l + c t h bipartite everywhere. This is just visual shorthand.
x W × +
t h k Occasionally, we will omit the direction of an edge for
y V × +
h k l × + clarity, or omit intermediate nodes when it's clear that
x W b
80 k l a tensor node or op node should be there. The actual
y V c
l computation graph we are describing in such cases is
W b t always bipartite and entirely directed.
V c h
b k t
c h l
RESTATING OUR
We hold on to the same assumptions we had before.
t k ASSUMPTIONS
They are required for automatic differentiation to work
h l
One output node l with a scalar value. efficiently.
k
l
All gradients are of l, with respect to the tensors on the tensor nodes.
We compute gradients for all tensor nodes.
81
IMPLEMENTATIONS
To store a computation graph in memory, we need
three classes of objects. The first two are shown here.
A TensorNode objects holds the tensor value at that
TensorNode: OpNode: node. It holds the gradient of the tensor at that node
value: <Tensor> inputs: List<TensorNode> (to be filled by the backpropagation algorithm) and it
gradient: <Tensor> outputs: List<TensorNode> holds a pointer to the Operation Node that produced it
source: <OpNode> op: <Op> (which can be null for leaf nodes).
An OpNode object represents an instance of a
particular operation being applied in the computation.
It stores the particular op being used (mulitplication,
additions, etc) and the inputs and the outputs. In the
82
MLP example, we perform several matrix
mulitplications: each of these becomes a separate
OpNode in the computation graph, all recording an
instance of the matrix multiplication operation being
used. Each of them refers to a single object
representing this operation.
This is the third class: the Op.
DEFINING AN OPERATION
If the difference between an Op and an OpNode is
unclear consider that a single neural network may
the
for
ee d ass apply, say, a sigmoid operation many times. The Op
g we n kward p
ythin bac
s an object for the sigmoid defines how to compute the
class Op: s tore
<latexit sha1_base64="TVaHn2cx6MPh1FqPJsvR0G58OdQ=">AAAOHnicfZfbbts2HMaV7tRlzZZulwMGYkG7bAgCKXF8GBCglmyjF2ubBTltURZQNK0IpkSCohy7gl5iT7Cn2d2w2+1NdjnKB1kHyrqi+X389NOfpEw5jHih0PV/t5588OFHH3/y9NPtz57tfP7F7vMvr0IacYQvESWU3zgwxMQL8KXwBME3jGPoOwRfO2Mr1a8nmIceDS7EjOE7H7qBN/IQFLLrfvf30b6N3Nh2RqCbfA9engIbOfOfZgJse9sWeCriEeWPkA+T+9HLH8HaD2wfslBQ5RgHonF+0Mrxmx1Ah8Dc2FXcUrnf3dMP9fkFqg1j2djTltfZ/fNnYtseUhT5OBCIwDC8NXQm7mLIhYcITrbtKMRM8kAX30Zi1L6LvYBFAgcoAS+kNooIkChpgcDQ4xgJMpMNiLgnEwB6gBwiIcu4XYwKcQB9HB4MJx4LF81w4i4aQj4Lvoun8zlKCgNjl0P24KFpgSyGfuhD8VDpDGe+U+zEEcF84hc7U0rJWHJOMUdemNbgTBbmHUunPbygZ0v9YcYecBAmccRJkh8oBcw5HsmB82aIRcTi+cPItTYOTwWP8EHanPed9iAfn+PhgcwpdBRxRoRCUexy/LQ4AX5E1PdhMIxtlsSLJWQfHCZSfAHkqkBjh8rVVHSeJ3FspzVzHHAurQXxbU58Wxb7ObG/vAklQyBXOpjI+ac8BNIIpIV7CIfF0ZfZ6BG4LEdf5cSrsnidE6/LohPl1KiiTnLqpKI+5tTHsjrNidOyOMuJs0quyKmirL7Pie/L4s2mm/6y6aa/lmLl7MgtM5O7HI/ki26+wOIxSuLXF29+SuLO/FqulAgDo2hEzsp4PGi2Oq2kLJOV3hi0DbNX1TNDy+walsqQObotS+/11yxHJW8GrevNfrdZjkJkrbc71qCqr2H1bqtnKgxr2oHV6LeW5cM4KFndzGd1mo1KWdwsp2Oa5kmnqmcGs2FZ7SOFIXNYvV6va81RWMQZwSUvWxmbzRO9GsWyoLbebHQV+noC9LZpVmBZjsUcmEbPmLMIDEnJKbLFYrW7nX45SKzrb3YtqzKBIlf+tmX0GgrDmvWkd9I/npNQDgO3XBWala/Zah+3y1E0Cxq05AxWWOj6ToOOqbcqLDTHMjAtMy1scSfKTXZr3MWLF/J634E9AyRJ2Sw3Wtmc7r2VueQlCjOpdyvtm/zqAaQWvvqkCNXFI0U4qoVBKhZUD4+U8GgTvFv1u3XxriLcrYVxVSxuPbyrhHc3wbOqn9XFM0U4q4VhKhZWD8+U8GwTvKj6RV28UISLWhihYhH18EIJLzbB06qf1sVTRTithaEqFloPT5XwtAgv/zzSMz0kID0TUwK8YHkwKLyzWHp8GCN5kly4Fw/ew/LbgOM38ljxTh5ooTzj/RDbkLu+FyTyW8G1D9LWJiOcroyyJb9TjPJXSbVxdXRonBzqPzf2XpnLL5an2tfat9q+Zmgt7ZX2WjvTLjWk/bf1zdZ3W/s7f+z8ufPXzt8L65Ot5ZivtMK188//nykwHw==</latexit>
f(A) = B
<-
sigmoid operations (and its gradient). Then for each
<latexit sha1_base64="TVaHn2cx6MPh1FqPJsvR0G58OdQ=">AAAOHnicfZfbbts2HMaV7tRlzZZulwMGYkG7bAgCKXF8GBCglmyjF2ubBTltURZQNK0IpkSCohy7gl5iT7Cn2d2w2+1NdjnKB1kHyrqi+X389NOfpEw5jHih0PV/t5588OFHH3/y9NPtz57tfP7F7vMvr0IacYQvESWU3zgwxMQL8KXwBME3jGPoOwRfO2Mr1a8nmIceDS7EjOE7H7qBN/IQFLLrfvf30b6N3Nh2RqCbfA9engIbOfOfZgJse9sWeCriEeWPkA+T+9HLH8HaD2wfslBQ5RgHonF+0Mrxmx1Ah8Dc2FXcUrnf3dMP9fkFqg1j2djTltfZ/fNnYtseUhT5OBCIwDC8NXQm7mLIhYcITrbtKMRM8kAX30Zi1L6LvYBFAgcoAS+kNooIkChpgcDQ4xgJMpMNiLgnEwB6gBwiIcu4XYwKcQB9HB4MJx4LF81w4i4aQj4Lvoun8zlKCgNjl0P24KFpgSyGfuhD8VDpDGe+U+zEEcF84hc7U0rJWHJOMUdemNbgTBbmHUunPbygZ0v9YcYecBAmccRJkh8oBcw5HsmB82aIRcTi+cPItTYOTwWP8EHanPed9iAfn+PhgcwpdBRxRoRCUexy/LQ4AX5E1PdhMIxtlsSLJWQfHCZSfAHkqkBjh8rVVHSeJ3FspzVzHHAurQXxbU58Wxb7ObG/vAklQyBXOpjI+ac8BNIIpIV7CIfF0ZfZ6BG4LEdf5cSrsnidE6/LohPl1KiiTnLqpKI+5tTHsjrNidOyOMuJs0quyKmirL7Pie/L4s2mm/6y6aa/lmLl7MgtM5O7HI/ki26+wOIxSuLXF29+SuLO/FqulAgDo2hEzsp4PGi2Oq2kLJOV3hi0DbNX1TNDy+walsqQObotS+/11yxHJW8GrevNfrdZjkJkrbc71qCqr2H1bqtnKgxr2oHV6LeW5cM4KFndzGd1mo1KWdwsp2Oa5kmnqmcGs2FZ7SOFIXNYvV6va81RWMQZwSUvWxmbzRO9GsWyoLbebHQV+noC9LZpVmBZjsUcmEbPmLMIDEnJKbLFYrW7nX45SKzrb3YtqzKBIlf+tmX0GgrDmvWkd9I/npNQDgO3XBWala/Zah+3y1E0Cxq05AxWWOj6ToOOqbcqLDTHMjAtMy1scSfKTXZr3MWLF/J634E9AyRJ2Sw3Wtmc7r2VueQlCjOpdyvtm/zqAaQWvvqkCNXFI0U4qoVBKhZUD4+U8GgTvFv1u3XxriLcrYVxVSxuPbyrhHc3wbOqn9XFM0U4q4VhKhZWD8+U8GwTvKj6RV28UISLWhihYhH18EIJLzbB06qf1sVTRTithaEqFloPT5XwtAgv/zzSMz0kID0TUwK8YHkwKLyzWHp8GCN5kly4Fw/ew/LbgOM38ljxTh5ooTzj/RDbkLu+FyTyW8G1D9LWJiOcroyyJb9TjPJXSbVxdXRonBzqPzf2XpnLL5an2tfat9q+Zmgt7ZX2WjvTLjWk/bf1zdZ3W/s7f+z8ufPXzt8L65Ot5ZivtMK188//nykwHw==</latexit>
f(A)f =
...
backwardf : Br 7! Ar applied, we create a separate OpNode object which
forwardf : A 7! B references the inputs and outputs specific to that
def backward(context, outputs_gradient): backwardf : Br 7! Ar application of the sigmoid, together with a single
# given the gradient of the loss wrt to the ouputs pointer to the single sigmoid Op. This is where we
# compute the gradient of the loss wrt to the inputs
actually define how to perform the computation.
...
83
An Op is defined by two functions.
source: mult(a, b)
telling it which values to put at each node. We’re just
defining the shape of the computation, but not
c = a * b performing the computation itself.
×
m = Model( When we create the model, we define which nodes
in=(a, b),
loss=c)
are the input nodes, and which node is the loss node.
a b
We then provide the input values and perform the
1
value: ? value: 2
?
m.train((1, 2)) grad: ?2 grad: 1? forward and backward passes.
86
89
Tensors
x
y
90
W
V
b
c
t
h The final ingredient we need is a large collection of
WORKING OUT THE BACKWARD FUNCTION
k operations with backward functions worked out. We’ll
ABS
<latexit sha1_base64="PDfvGk7ccq9faPRBTMFWBm/xuaI=">AAANvXicfZfbbts2HMbV7tRl9ZZul7sRFhQYhiCQEscHFMVqSTZ6sbZZllMXGQFF04pgSiQoyrEr6Dn2NLvdnmFvM8p2ZImirCuC38dPP/1J2qRHcRBzw/jvydPPPv/iy6+efb33zfPWt9/tv/j+KiYJg+gSEkzYjQdihIMIXfKAY3RDGQKhh9G1N7Nz/XqOWByQ6IIvKRqHwI+CaQABF113+6YL/dQNAb/3pvog091XLqRFh5W5r3QXkqLjj0y/2z8wjozVo9cb5qZxoG2es7sXz/meOyEwCVHEIQZxfGsalI9TwHgAMcr23CRGFMAZ8NFtwqe9cRpENOEogpn+UmjTBOuc6Dm/PgkYghwvRQNAFogEHd4DBiAXX7lXjYpRBEIUH07mAY3XzXjurxsciBKN08WqhFllYOozQO8DuKiQpSCM8yrUOuNl6FU7UYIRm4fVzpxSMErOBWIwiPManInCfKD5rMQX5Gyj3y/pPYriLE0YzsoDhYAYQ1MxcNWMEU9ouvoYsRRm8WvOEnSYN1d9rx3AZudocihyKh1VnCkmgFe7vDAvToQeIAlDEE1Sl2apy9GCp+7hUSbEl7qHhdkjgE2qzvMs3awcTz8X1or4viS+l8VhSRxuXkLwRJ8Sps/F/BMW68KoCwsLIIqroy+L0VP9Uo6+KolXsnhdEq9l0UtKalJT5yV1XlMfSuqDrC5K4kIWlyVxWcvlJZXL6qeS+EkWb3a99OOul/4pxYrZEVtmKXY5morfodUCS2cwS99evPstS/urZ7NSEqSbVSP0Ho0no063381kGT/q7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu5vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7hUITRjGSvPTR2OmcGvUoWgT1jE57oNC3E2D0LKsGS0ss1sgyHXPFwhHAkpMXi8XuDfpDOYhv628NbLs2gbxU/p5tOm2FYct66pwOT1YkhIHIl6tCivJ1ur2TnhxFiqBRV8xgjYVs3zTqW0a3xkJKLCPLtvLCVnei2GS35jhd/yBv951+YOpZJpvFRpPN+d57NEterDDjZrfSvsuvHoAb4etfCmFTPFSEw0YYqGKBzfBQCQ93wft1v98U7yvC/UYYX8XiN8P7Snh/Fzyt+2lTPFWE00YYqmKhzfBUCU93wfO6nzfFc0U4b4ThKhbeDM+V8HwXPKn7SVM8UYSTRhiiYiHN8EQJT6rw4s8jP9MDrOdnYoL1INocDCq/WTQ/PsygOEmu3esPd5C4GzD0ThwrPogDLRBnvF9SFzA/DKJM3BV89zBv7TKCxaNRtMQ9xZRvJfXG1fGReXpk/N4+eGNtbizPtB+1n7SfNVPram+0t9qZdqlB7S/tb+0f7d/Wry3Uwq1obX36ZDPmB63ytB7+B/FwDUc=</latexit>
ABS l
<latexit sha1_base64="PDfvGk7ccq9faPRBTMFWBm/xuaI=">AAANvXicfZfbbts2HMbV7tRl9ZZul7sRFhQYhiCQEscHFMVqSTZ6sbZZllMXGQFF04pgSiQoyrEr6Dn2NLvdnmFvM8p2ZImirCuC38dPP/1J2qRHcRBzw/jvydPPPv/iy6+efb33zfPWt9/tv/j+KiYJg+gSEkzYjQdihIMIXfKAY3RDGQKhh9G1N7Nz/XqOWByQ6IIvKRqHwI+CaQABF113+6YL/dQNAb/3pvog091XLqRFh5W5r3QXkqLjj0y/2z8wjozVo9cb5qZxoG2es7sXz/meOyEwCVHEIQZxfGsalI9TwHgAMcr23CRGFMAZ8NFtwqe9cRpENOEogpn+UmjTBOuc6Dm/PgkYghwvRQNAFogEHd4DBiAXX7lXjYpRBEIUH07mAY3XzXjurxsciBKN08WqhFllYOozQO8DuKiQpSCM8yrUOuNl6FU7UYIRm4fVzpxSMErOBWIwiPManInCfKD5rMQX5Gyj3y/pPYriLE0YzsoDhYAYQ1MxcNWMEU9ouvoYsRRm8WvOEnSYN1d9rx3AZudocihyKh1VnCkmgFe7vDAvToQeIAlDEE1Sl2apy9GCp+7hUSbEl7qHhdkjgE2qzvMs3awcTz8X1or4viS+l8VhSRxuXkLwRJ8Sps/F/BMW68KoCwsLIIqroy+L0VP9Uo6+KolXsnhdEq9l0UtKalJT5yV1XlMfSuqDrC5K4kIWlyVxWcvlJZXL6qeS+EkWb3a99OOul/4pxYrZEVtmKXY5morfodUCS2cwS99evPstS/urZ7NSEqSbVSP0Ho0no063381kGT/q7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu5vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7hUITRjGSvPTR2OmcGvUoWgT1jE57oNC3E2D0LKsGS0ss1sgyHXPFwhHAkpMXi8XuDfpDOYhv628NbLs2gbxU/p5tOm2FYct66pwOT1YkhIHIl6tCivJ1ur2TnhxFiqBRV8xgjYVs3zTqW0a3xkJKLCPLtvLCVnei2GS35jhd/yBv951+YOpZJpvFRpPN+d57NEterDDjZrfSvsuvHoAb4etfCmFTPFSEw0YYqGKBzfBQCQ93wft1v98U7yvC/UYYX8XiN8P7Snh/Fzyt+2lTPFWE00YYqmKhzfBUCU93wfO6nzfFc0U4b4ThKhbeDM+V8HwXPKn7SVM8UYSTRhiiYiHN8EQJT6rw4s8jP9MDrOdnYoL1INocDCq/WTQ/PsygOEmu3esPd5C4GzD0ThwrPogDLRBnvF9SFzA/DKJM3BV89zBv7TKCxaNRtMQ9xZRvJfXG1fGReXpk/N4+eGNtbizPtB+1n7SfNVPram+0t9qZdqlB7S/tb+0f7d/Wry3Uwq1obX36ZDPmB63ytB7+B/FwDUc=</latexit>
class Plus(Op):
+
show how to do this for a few examples.
ABS
<latexit sha1_base64="PDfvGk7ccq9faPRBTMFWBm/xuaI=">AAANvXicfZfbbts2HMbV7tRl9ZZul7sRFhQYhiCQEscHFMVqSTZ6sbZZllMXGQFF04pgSiQoyrEr6Dn2NLvdnmFvM8p2ZImirCuC38dPP/1J2qRHcRBzw/jvydPPPv/iy6+efb33zfPWt9/tv/j+KiYJg+gSEkzYjQdihIMIXfKAY3RDGQKhh9G1N7Nz/XqOWByQ6IIvKRqHwI+CaQABF113+6YL/dQNAb/3pvog091XLqRFh5W5r3QXkqLjj0y/2z8wjozVo9cb5qZxoG2es7sXz/meOyEwCVHEIQZxfGsalI9TwHgAMcr23CRGFMAZ8NFtwqe9cRpENOEogpn+UmjTBOuc6Dm/PgkYghwvRQNAFogEHd4DBiAXX7lXjYpRBEIUH07mAY3XzXjurxsciBKN08WqhFllYOozQO8DuKiQpSCM8yrUOuNl6FU7UYIRm4fVzpxSMErOBWIwiPManInCfKD5rMQX5Gyj3y/pPYriLE0YzsoDhYAYQ1MxcNWMEU9ouvoYsRRm8WvOEnSYN1d9rx3AZudocihyKh1VnCkmgFe7vDAvToQeIAlDEE1Sl2apy9GCp+7hUSbEl7qHhdkjgE2qzvMs3awcTz8X1or4viS+l8VhSRxuXkLwRJ8Sps/F/BMW68KoCwsLIIqroy+L0VP9Uo6+KolXsnhdEq9l0UtKalJT5yV1XlMfSuqDrC5K4kIWlyVxWcvlJZXL6qeS+EkWb3a99OOul/4pxYrZEVtmKXY5morfodUCS2cwS99evPstS/urZ7NSEqSbVSP0Ho0no063381kGT/q7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu5vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7hUITRjGSvPTR2OmcGvUoWgT1jE57oNC3E2D0LKsGS0ss1sgyHXPFwhHAkpMXi8XuDfpDOYhv628NbLs2gbxU/p5tOm2FYct66pwOT1YkhIHIl6tCivJ1ur2TnhxFiqBRV8xgjYVs3zTqW0a3xkJKLCPLtvLCVnei2GS35jhd/yBv951+YOpZJpvFRpPN+d57NEterDDjZrfSvsuvHoAb4etfCmFTPFSEw0YYqGKBzfBQCQ93wft1v98U7yvC/UYYX8XiN8P7Snh/Fzyt+2lTPFWE00YYqmKhzfBUCU93wfO6nzfFc0U4b4ThKhbeDM+V8HwXPKn7SVM8UYSTRhiiYiHN8EQJT6rw4s8jP9MDrOdnYoL1INocDCq/WTQ/PsygOEmu3esPd5C4GzD0ThwrPogDLRBnvF9SFzA/DKJM3BV89zBv7TKCxaNRtMQ9xZRvJfXG1fGReXpk/N4+eGNtbizPtB+1n7SfNVPram+0t9qZdqlB7S/tb+0f7d/Wry3Uwq1obX36ZDPmB63ytB7+B/FwDUc=</latexit>
@l
Ar
ij =
return a + b @Aij
X @l @Skl X @Skl
=
@Skl @Aij
= Sr
kl
@Aij
The recipe is the same as we saw in the last part:
kl kl
def backward(context, goutput): X @[A + B]kl X r @Akl + Bkl
= Sr
kl
@Aij
= Skl
@Aij
1) Write out the scalar derivative for a single element.
???
return goutput, goutput kl kl
@Aij
= Sr
ij
@Aij
= Sr
ij 2) Use the multivariate chain rule to sum over all
outputs.
Ar = Sr Br = S r
<latexit sha1_base64="grOCEBPoBUR1matLxO/x46uXoOA=">AAANunicfZfbbts2HMbV7tRl9ZZul7sRFnQYhiCQEscHYAFqSTZ6sbZZlkO7KAsomlYEUyJBUY5dQW+xp9nt9hJ7m1E+yBJFWVcEv4+ffvqTtEmP4iDmhvHfk6effPrZ5188+3Lvq+etr7/Zf/HtdUwSBtEVJJiw9x6IEQ4idMUDjtF7yhAIPYxuvKmd6zczxOKARJd8QdFdCPwomAQQcNF1v3/kQpq6IeAP3kS3sj/dCHgY6D+e6S4khfD7RrjfPzCOjOWj1xvmunGgrZ/z+xfP+Z47JjAJUcQhBnF8axqU36WA8QBilO25SYwogFPgo9uET3p3aRDRhKMIZvpLoU0SrHOi5/D6OGAIcrwQDQBZIBJ0+AAYgFx84l41KkYRCFF8OJ4FNF4145m/anDxLegunS/rl1UGpj4D9CGA8wpZCsI4r0WtM16EXrUTJRixWVjtzCkFo+ScIwaDOK/BuSjMO5pPSXxJztf6w4I+oCjO0oThrDxQCIgxNBEDl80Y8YSmy48R62Aan3GWoMO8uew7cwCbXqDxocipdFRxJpgAXu3ywrw4EXqEJAxBNE5dmqUuR3OeuodHmRBf6mJVwKlHABtXnRdZul4/nn4hrBXxbUl8K4vDkjhcv4TgsT4hTJ+J+Scs1oVRFxYWQBRXR18Voyf6lRx9XRKvZfGmJN7IopeU1KSmzkrqrKY+ltRHWZ2XxLksLkriopbLSyqX1Y8l8aMsvt/10g+7XvqHFCtmR2yZhdjlaCJ+hJYLLJ3CLH19+ebXLO0vn/VKSZBuVo3Q2xhPRp1uv5vJMt7o7VHPtJy6Xhi61sC0VYbCMejahjPcshxL3gLaMDrDQUeOgnir9/r2qK5vYY1B17EUhi3tyG4Pu+vyIRRJVr/w2f1Ou1YWv8jpW5Z12q/rhcFq23bvWGEoHLbjOAN7iUITRjGSvHRj7HROjXoULYJ6Rqc9UOjbCTB6llWDpSUWa2SZjrlk4QhgycmLxWL3Bv2hHMS39bcGtl2bQF4qf882nbbCsGU9dU6HJ0sSwkDky1UhRfk63d5JT44iRdCoK2awxkK2bxr1LaNbYyEllpFlW3lhqztRbLJb8y5d/SBv951+YOpZJpvFRpPN+d7bmCUvVphxs1tp3+VXD8CN8PUvhbApHirCYSMMVLHAZniohIe74P2632+K9xXhfiOMr2Lxm+F9Jby/C57W/bQpnirCaSMMVbHQZniqhKe74Hndz5viuSKcN8JwFQtvhudKeL4LntT9pCmeKMJJIwxRsZBmeKKEJ1V48eeRn+kB1vMzMcF6EK0PBpXfLJofH6ZQnCRX7tWHO0jcDRh6I44V78SBFogz3s+pC5gfBlEm7gq+e5i3dhnBfGMULXFPMeVbSb1xfXxknh4Zv7UPXlnrG8sz7XvtB+0nzdS62ivttXauXWlQ+0v7W/tH+7f1S8trBa3pyvr0yXrMd1rlafH/AeRPDGE=</latexit>
Row-wise sum
Expand
x
y
92
W
V
b
c
t
h This is a pretty simple derivation, but it shows two
SIGMOID
k things:
X YX Y
<latexit sha1_base64="HnME7Vww2ZO+c4ItFBzl9+JNMF4=">AAANrHicfZfbbts2HMbVdocuq9d0u9yNsKDAMASBlPiIokAtyUYv1jYLcupiI6BoWhFMiQRFOXYF3e5pdru9y95mlKPIEkVZviH4ffz005+iTLoU+xE3jP+ePH321dfffPv8u73vX7R+eLn/6sfLiMQMogtIMGHXLogQ9kN0wX2O0TVlCAQuRlfuws70qyVikU/Cc76maBoAL/TnPgRcdN3u6xMIk0kA+J0716/TyRvxg6To+Zze7h8YR8bm0usNM28caPl1evvqBd+bzAiMAxRyiEEU3ZgG5dMEMO5DjNK9SRwhCuACeOgm5vP+NPFDGnMUwlR/LbR5jHVO9AxXn/kMQY7XogEg80WCDu8AA5CLh9qrRkUoBAGKDmdLn0YPzWjpPTQ4EBWZJqtNxdLKwMRjgN75cFUhS0AQZUWodUbrwK12ohgjtgyqnRmlYJScK8SgH2U1OBWF+USzSYjOyWmu363pHQqjNIkZTssDhYAYQ3MxcNOMEI9psnkYMfOL6C1nMTrMmpu+tw5gizM0OxQ5lY4qzhwTwKtdbpAVJ0T3kAQBCGfJhKbJhKMVTyaHR6kQX+suFmaXADarOs/SJH9xXP1MWCvix5L4URZHJXGU34TgmT4nTF+K+Scs0oVRFxbmQxRVR18Uo+f6hRx9WRIvZfGqJF7JohuX1LimLkvqsqbel9R7WV2VxJUsrkviupbLSyqX1S8l8YssXu+66eddN/1TihWzI5bMWqxyNBefnc0Llixgmrw///B7mgw2V/6mxEg3q0boPhpPxt3eoJfKMn7U2+O+aTl1vTD0rKFpqwyFY9izDWe0ZTmWvAW0YXRHw64cBfFW7w/scV3fwhrDnmMpDFvasd0e9fLyIRRKVq/w2YNuu1YWr8gZWJbVGdT1wmC1bbt/rDAUDttxnKG9QaExoxhJXvpo7HY7Rj2KFkF9o9seKvTtBBh9y6rB0hKLNbZMx9ywcASw5OTFy2L3h4ORHMS39beGtl2bQF4qf982nbbCsGXtOJ3RyYaEMBB6clVIUb5ur3/Sl6NIETTuiRmssZDtncYDy+jVWEiJZWzZVlbY6koUi+zGnCYPH+TtutMPTD1NZbNYaLI5W3uPZsmLFWbc7Fbad/nVA3AjfP1JIWyKh4pw2AgDVSywGR4q4eEueK/u95riPUW41wjjqVi8ZnhPCe/tgqd1P22Kp4pw2ghDVSy0GZ4q4ekueF7386Z4rgjnjTBcxcKb4bkSnu+CJ3U/aYoninDSCENULKQZnijhSRVe/Hlke3qA9WxPTLDuh/nGoPLNotn2YSFOGbn74cEdJM4GDH0Q24pPYkMLxB7vt2QCmBf4YSrOCt7kMGvtMoLVo1G0xDnFlE8l9cbl8ZHZOTL+aB+8s/ITy3PtZ+0X7VfN1HraO+29dqpdaFD7S/tb+0f7t3XUOm/dtKYP1qdP8jE/aZWrNf8f2EoG/w==</latexit> <latexit sha1_base64="HnME7Vww2ZO+c4ItFBzl9+JNMF4=">AAANrHicfZfbbts2HMbVdocuq9d0u9yNsKDAMASBlPiIokAtyUYv1jYLcupiI6BoWhFMiQRFOXYF3e5pdru9y95mlKPIEkVZviH4ffz005+iTLoU+xE3jP+ePH321dfffPv8u73vX7R+eLn/6sfLiMQMogtIMGHXLogQ9kN0wX2O0TVlCAQuRlfuws70qyVikU/Cc76maBoAL/TnPgRcdN3u6xMIk0kA+J0716/TyRvxg6To+Zze7h8YR8bm0usNM28caPl1evvqBd+bzAiMAxRyiEEU3ZgG5dMEMO5DjNK9SRwhCuACeOgm5vP+NPFDGnMUwlR/LbR5jHVO9AxXn/kMQY7XogEg80WCDu8AA5CLh9qrRkUoBAGKDmdLn0YPzWjpPTQ4EBWZJqtNxdLKwMRjgN75cFUhS0AQZUWodUbrwK12ohgjtgyqnRmlYJScK8SgH2U1OBWF+USzSYjOyWmu363pHQqjNIkZTssDhYAYQ3MxcNOMEI9psnkYMfOL6C1nMTrMmpu+tw5gizM0OxQ5lY4qzhwTwKtdbpAVJ0T3kAQBCGfJhKbJhKMVTyaHR6kQX+suFmaXADarOs/SJH9xXP1MWCvix5L4URZHJXGU34TgmT4nTF+K+Scs0oVRFxbmQxRVR18Uo+f6hRx9WRIvZfGqJF7JohuX1LimLkvqsqbel9R7WV2VxJUsrkviupbLSyqX1S8l8YssXu+66eddN/1TihWzI5bMWqxyNBefnc0Llixgmrw///B7mgw2V/6mxEg3q0boPhpPxt3eoJfKMn7U2+O+aTl1vTD0rKFpqwyFY9izDWe0ZTmWvAW0YXRHw64cBfFW7w/scV3fwhrDnmMpDFvasd0e9fLyIRRKVq/w2YNuu1YWr8gZWJbVGdT1wmC1bbt/rDAUDttxnKG9QaExoxhJXvpo7HY7Rj2KFkF9o9seKvTtBBh9y6rB0hKLNbZMx9ywcASw5OTFy2L3h4ORHMS39beGtl2bQF4qf982nbbCsGXtOJ3RyYaEMBB6clVIUb5ur3/Sl6NIETTuiRmssZDtncYDy+jVWEiJZWzZVlbY6koUi+zGnCYPH+TtutMPTD1NZbNYaLI5W3uPZsmLFWbc7Fbad/nVA3AjfP1JIWyKh4pw2AgDVSywGR4q4eEueK/u95riPUW41wjjqVi8ZnhPCe/tgqd1P22Kp4pw2ghDVSy0GZ4q4ekueF7386Z4rgjnjTBcxcKb4bkSnu+CJ3U/aYoninDSCENULKQZnijhSRVe/Hlke3qA9WxPTLDuh/nGoPLNotn2YSFOGbn74cEdJM4GDH0Q24pPYkMLxB7vt2QCmBf4YSrOCt7kMGvtMoLVo1G0xDnFlE8l9cbl8ZHZOTL+aB+8s/ITy3PtZ+0X7VfN1HraO+29dqpdaFD7S/tb+0f7t3XUOm/dtKYP1qdP8jE/aZWrNf8f2EoG/w==</latexit>
l
class Sigmoid(Op):
1) We can easily do a backward over functions that
def forward(context, x):
<latexit sha1_base64="DhhF8c/aGYtP3a3Hmfit+MKXb8w=">AAAPr3icnZfbbts2AIZd79R53dJsl7shGjTIhiyQEseHiwC1ZBu9WNssyKmL0oyiaUWzJBIU5dgV9AB7mt1uj7K3GWXLsg6Ug00wYIL/z5+feLBJkzq2zxXlnyf1Tz797PMvnn7Z+OrZ199sPd/+9tInAUP4AhGHsGsT+tixPXzBbe7ga8owdE0HX5kTPdavppj5NvHO+ZziWxdanj22EeSi6m67/sJAKLyOPhgeNB14F9q/T6LG7gkw/MC9C6GJIgOR8P3aEFcBY8wgCo0JBUt1WR0lNXHgMkk4jcYu+A9xvm250HDwmIO9VdKiGbOte/7Dpi7ywbH2WHDs+Z/Bj8SBhbSngp8eta6+jEb8EbrhQn5vjkE6K2B3CbESUhhgEG672M/2sZePWPE0UqsMTN4o+bp7vqMcKIsHlAtqUtipJc/p3fYz3jBGBAUu9jhyoO/fqArltyFk3EYOjhpG4GMK0QRa+Cbg485taHs04NgTi+Gl0MaBAzgB8YoFI5thxJ25KEDEbJEA0D0U88rFum7ko3zsQfGS+6OpTf1l0Z9aywIXI4Zvw9li00S5hqHFIL230SxHFkLXj4ekVOnPXTNfiQMHs6mbr4wpBWPBOcMM2X48BqdiYN7ReB/65+Q00e/n9B57fhQGzImyDYWAGcNj0XBR9DEPaLh4GbH5J/4JZwHej4uLupM+ZJMzPNoXObmKPM7YIZDnq0w3HhwPPyDiutAbhQYVm4PjGQ+N/YNIiC+BWHtoYhLIRnnnWRQmy8gEZ8KaE99mxLdFcZARB0knxBmBMWFgKuafMB8IIxAWZiPs51tfpK3H4KIYfZkRL4viVUa8KopmkFGDkjrNqNOS+pBRH4rqLCPOiuI8I85LuTyj8qL6MSN+LIrXmzp9v6nTXwuxYnbElpmLXY7H4p9nscDCCYrC1+dvfo7C7uJJVkqAgZo3InNlPBq22t12VJSdld4cdlStX9ZTQ1vrqbrMkDp6bV3pD9YshwVvCq0orUGvVYxCzlrvdPVhWV/DKr12X5MY1rRDvTloJ8OHsVewWqlP77aapWGx0pyupmnH3bKeGrSmrncOJYbUoff7/Z6+QKEBow4ueOnK2GodK+UomgZ1lFazJ9HXE6B0NK0ESzMs2lBT++qChWPoFJw8XSx6p9cdFIP4evy1nq6XJpBnhr+jq/2mxLBmPe4fD44WJIRBzyqOCkmHr9XuHHWKUSQNGrbFDJZYyLqnYVdT2iUWkmEZaroWD2x+J4pNdqPehssf5PW+AzsqiKKiWWy0ojneeytzwetIzE61W2rf5Jc3cCrhy2+KUFU8koSjShgkY0HV8EgKjzbBW2W/VRVvScKtShhLxmJVw1tSeGsTPC37aVU8lYTTShgqY6HV8FQKTzfB87KfV8VzSTivhOEyFl4Nz6XwfBM8KftJVTyRhJNKGCJjIdXwRApP8vDizyM+00MHxGdi4gDbSw4Gud8sGh8fJuKGkbiXL97H4m7A8BtxrHgnDrRQnPF+DA3ILNf2InFXsIz9uLTJCGcroyiJe4pavJWUC5eHB+rxgfJLc+eVltxYnta+r72o7dXUWrv2qva6dlq7qKH6H/U/63/V/95St662Pmz9trTWnyRtvqvlni37XzLsxkY=</latexit>
@Yabc X
X output high dimensional tensors, but we should
<latexit sha1_base64="as9gjkgrzac5fV+gn7F0WFX7WN8=">AAAPjnicnZfbbts2HMYd79R53Zxsl7shFjTIhjSQEseHi2C1ZBu5WNssaA5dlAUUTcuaJZGgKMeuoFfY7fZqe5tRPsg6UC42IRcEvx8/fvqTdCiTOrbPFeWfneonn372+RfPvqx99fzrb+q7e9/e+CRgCF8j4hB2Z0IfO7aHr7nNHXxHGYau6eBbc6LH+u0UM98m3js+p/jBhZZnj2wEueh63KvuGAiFd9HvhgdNBz6G9h+TqHZwDgw/cB9DaKLIQCR8vwHiLmCMGEShMaFgqS67o1VPbLh0EqRROwD/wc63LRcaDh5xcLh2WgxjtjXmP26bImscax8zjpn/afwRuyzwUkII8/hPCIYL+dgcgWQVwMFy0rWQTA4Mwm0X+2n3w6xFMr+EfClFH3f3lWNl8YBiQ1019iur5/Jx7zmvGUOCAhd7HDnQ9+9VhfKHEDJuIwdHNSPwMYVoAi18H/BR+yG0PRpw7Im1fiG0UeAATkC8IcHQZhhxZy4aEDFbOAA0hmLZuNi2tayVjz0oXupoOLWpv2z6U2vZ4KJA+CGcLc5ElBkYWgzSsY1mmWQhdP24FoVOf+6a2U4cOJhN3WxnnFJkzJEzzJDtxzW4FIV5S+Nj5r8jlyt9PKdj7PlRGDAnSg8UAmYMj8TARdPHPKDh4mXE2Z7455wF+ChuLvrOe5BNrvDwSPhkOrJxRg6BPNtlunFxPPyEiOtCbxgaVOx9jmc8NI6OIyG+AGKroYlJIBtmyasoXO0fE1wJNCO+SYlv8mI/JfZXkxBnCEaEgalYf8J8IEAgEGYj7GdHXyejR+A6b32TEm/y4m1KvM2LZpBSg4I6TanTgvqUUp/y6iwlzvLiPCXOC748pfK8+iElfsiLd9smfb9t0t9ytmJ1xJGZi1OOR+Ify2KDhRMUhRfvXv8ShZ3Fs9opAQZqFkTmGjwdNFudVpSXnbXeGLRVrVfUE6CldVVdBiREt6Urvf4my0mOTUIrSrPfbeatkLPR2x19UNQ3YZVuq6dJgE3agd7ot1blw9jLoVbC6Z1mo1AWK/HpaJp21inqCaA1dL19IgESQu/1el19EYUGjDo4x9I12GyeKUUrmhi1lWajK9E3C6C0Na0QlqayaANN7amLLBxDJ0fyZLPo7W6nnzfim/prXV0vLCBPlb+tq72GBNhkPeud9U8XSQiDnpWvCknK12y1T9t5K5IYDVpiBQtZyGamQUdTWoUsJJVloOlaXNjsSRSH7F59CJc/yJtzB/ZVEEV5WBy0PByfvTWcYx0J7JTTUnwbLx/glIYvvilCZfZIYo5KwyBZFlQeHknDo23hrSJvldlbEnOrNIwly2KVh7ek4a1t4WmRp2X2VGJOS8NQWRZaHp5Kw9Nt4XmR52X2XGLOS8NwWRZeHp5Lw/Nt4UmRJ2X2RGJOSsMQWRZSHp5Iw5NsePHPI77TQwfEd2LiANtbXQwyv1k0vj5MxKfFil6+eA+LbwOGX4trxVtxoYXijvdTaEBmubYXiW8FyziKW9tAOFuDoiW+U9T8V0mxcXNyrJ4dK7829l9pqy+WZ5XvKz9UDitqpVV5VbmoXFauK6g6rv5Z/av6d3233qyf139eotWd1ZjvKpmnfvEv0AO6bA==</latexit>
rr = r @Y abc
X Y
Yr
# x is a tensor of any shape ijk =
Xijk abc
abc @Xijk
abc @X ijk
sum over all dimensions of the output when we
sigx = 1 / (1 + (- x).exp()) abc
X
X r @ @ (X
(X abc))
context['sigx'] = sigx =
= Yr