Autograd Handouts
Autograd Handouts
Automatically
Jonathon Hare
1
http://forums.fast.ai/t/
differentiable-programming-is-this-why-we-switched-to-pytorch/9589/5
Jonathon Hare Automatic Differentiation 3 / 27
∂x
=?
Expression ∂t
x =? ∂y
=?
∂t
y =? ∂a ∂y ∂x
a=xy =x +y
∂t ∂t ∂t
b = sin(x ) ∂b ∂x
= cos(x )
z =a+b ∂t ∂t
∂z ∂a ∂b
= +
∂t ∂t ∂t
If we substitute t = x in the above we’ll have an algorithm for computing
∂z/∂x . To get ∂z/∂y we’d just substitute t = y .
Translating to code II
dx = 1
The effect is remarkably simple:
dy = 0
da = y * dx + x * dy
to compute ∂z/∂x we just seed
db = cos ( x ) * dx the algorithm with dx=1 and
dz = da + db dy=0.
dx = 0
dy = 1 To compute ∂z/∂y we just seed
da = y * dx + x * dy the algorithm with dx=0 and
db = cos ( x ) * dx dy=1.
dz = da + db
Making Rules
Forward Mode AD
Reverse Mode AD
The chain rule is symmetric — this means we can turn the derivatives
upside-down:
N
∂s X ∂wi ∂s ∂w1 ∂s ∂w2 ∂s ∂wN ∂s
= = + + ··· +
∂u i
∂u ∂wi ∂u ∂w1 ∂u ∂w2 ∂u ∂wN
In this form, the chain rule can be applied repeatedly to every input
variable u (akin to how in forward mode we repeatedly applied it to every
w ). Therefore, given some s we expect this form of the rule to give us a
program to compute both ∂s/∂x and ∂s/∂y in one go. . .
∂s N
∂wi ∂s ∂s
=
X
=?
∂u ∂u ∂wi ∂z
i
∂s ∂z ∂s ∂s
= =
∂b ∂b ∂z ∂z
x =?
∂s ∂z ∂s ∂s
y =? = =
∂a ∂a ∂z ∂z
a=xy ∂s ∂a ∂s ∂s
b = sin(x ) = =x
∂y ∂y ∂a ∂a
z =a+b ∂s ∂a ∂s ∂b ∂s
= +
∂x ∂x ∂a ∂x ∂b
∂s ∂s
=y + cos(x )
∂a ∂b
∂s
= (y + cos(x ))
∂z
Jonathon Hare Automatic Differentiation 18 / 27
Visualising dependencies
x y
sin ·
b a
+
z
Translating to code
Let’s now translate our derivatives into code. As before we replace the
derivatives (∂s/∂z, ∂s/∂b, . . . ) with variables (gz, gb, ...) which we call
adjoint variables:
gz = ?
gb = gz
ga = gz
gy = x * ga
gx = y * ga + cos ( x ) * gb
If we go back to the equations and substitute s = z we would obtain the
gradient in the last two equations. In the above program, this is equivalent
to setting gz = 1.
This means to get the both gradients ∂z/∂x and ∂z/∂y we only
need to run the program once!
If we have multiple output variables, we’d have to run the program for
each one (with different seeds on the output variables)2 . For example:
(
z = 2x + sin x
v = 4x + cos x
2
there are ways to avoid this limitation. . .
Jonathon Hare Automatic Differentiation 21 / 27
Building expressions
By default, nodes do not have any children. As expressions are created
each expression u registers itself as a child of each of its dependencies wi
together with its weight ∂wi /∂u which will be used to compute gradients:
class Var :
...
def __mul__ ( self , other ) :
z = Var ( self . value * other . value )
3
Andreas Griewank (1992) Achieving logarithmic growth of temporal and spatial
complexity in reverse automatic differentiation, Optimization Methods and Software,
1:1, 35-54, DOI: 10.1080/10556789208805505
Jonathon Hare Automatic Differentiation 26 / 27
AD in the PyTorch autograd package
PyTorch’s AD is remarkably similar to the one we’ve just built:
it eschews the use of a tape
it builds the computation graph as it runs (recording explicit Function
objects as the children of Tensors rather than grouping everything
into Var objects)
it caches the gradients in the same way we do (in the grad attribute) -
hence the need to call zero_grad() when recomputing the gradients
of the same graph after a round of backprop.
PyTorch does some clever memory management to work well in a
reference-counted regime and aggressively frees values that are no
longer needed.
The backend is actually mostly written in C++, so its fast, and can
be multi-threaded (avoids problems of the GIL).
It allows easy “turning off” of gradient computations through
requires_grad.
In-place operations which invalidate data needed to compute
derivatives will cause runtime errors, as will variable aliasing...
Jonathon Hare Automatic Differentiation 27 / 27