0% found this document useful (0 votes)
33 views14 pages

Autograd Handouts

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views14 pages

Autograd Handouts

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Differentiate

Automatically

An Introduction to Automatic Differentiation

Jonathon Hare

Vision, Learning and Control


University of Southampton

Much of this material is based on this blog post:


https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation

Jonathon Hare Automatic Differentiation 2 / 27


What is Automatic Differentiation (AD)?

To solve optimisation problems using gradient methods we need to


compute the gradients (derivatives) of the objective with respect to the
parameters.
In neural nets we’re talking about the gradients of the loss function,
L with respect to the parameters θ: ∇θ L = ∂L ∂θ
AD is important - it’s been suggested that “Differentiable
programming” could be the term that ultimately replaces deep
learning1 .

1
http://forums.fast.ai/t/
differentiable-programming-is-this-why-we-switched-to-pytorch/9589/5
Jonathon Hare Automatic Differentiation 3 / 27

What is Automatic Differentiation (AD)?


Computing Derivatives

There are three ways to compute derivatives:


Symbolically differentiate
the function with respect to Problems
its parameters Static - can’t “differentiate
by hand algorithms”
using a CAS
Make estimates using finite Problems
differences Numerical errors - will compound
Use Automatic in deep nets
Differentiation

Jonathon Hare Automatic Differentiation 4 / 27


What is Automatic Differentiation (AD)?

Automatic Differentiation is:


a method to get exact derivatives efficiently, by storing information as
you go forward that you can reuse as you go backwards.
Takes code that computes a function and uses that to compute the
derivative of that function.
The goal isn’t to obtain closed-form solutions, but to be able to write a
program that efficiently computes the derivatives.

Jonathon Hare Automatic Differentiation 5 / 27

Lets think about differentiation and programming

Example (Math) Example (Code)


x =?
x = ?
y =? y = ?
a=xy a = x * y
b = sin(x ) b = sin ( x )
z = a + b
z =a+b

Jonathon Hare Automatic Differentiation 6 / 27


The Chain Rule of Differentiation

Recall the chain rule for a variable/function z that depends on y which


depends on x :
dz dz dy
=
dx dy dx

In general, the chain rule can be expressed as:


N
∂w X ∂w ∂ui ∂w ∂u1 ∂w ∂u2 ∂w ∂uN
= = + + ··· +
∂t i
∂ui ∂t ∂u1 ∂t ∂u2 ∂t ∂uN ∂t

where w is some output variable, and ui denotes each input variable w


depends on.

Jonathon Hare Automatic Differentiation 7 / 27

Applying the Chain Rule

Let’s differentiate our previous expression with respect to some yet to be


given variable t:

∂x
=?
Expression ∂t
x =? ∂y
=?
∂t
y =? ∂a ∂y ∂x
a=xy =x +y
∂t ∂t ∂t
b = sin(x ) ∂b ∂x
= cos(x )
z =a+b ∂t ∂t
∂z ∂a ∂b
= +
∂t ∂t ∂t
If we substitute t = x in the above we’ll have an algorithm for computing
∂z/∂x . To get ∂z/∂y we’d just substitute t = y .

Jonathon Hare Automatic Differentiation 8 / 27


Translating to code I

We could translate the previous expressions back into a program involving


differential variables {dx, dy, ...} which represent ∂x /∂t, ∂y /∂t, . . .
respectively:
dx = ?
dy = ?
da = y * dx + x * dy
db = cos ( x ) * dx
dz = da + db
What happens to this program if we substitute t = x into the math
expression?

Jonathon Hare Automatic Differentiation 9 / 27

Translating to code II

dx = 1
The effect is remarkably simple:
dy = 0
da = y * dx + x * dy
to compute ∂z/∂x we just seed
db = cos ( x ) * dx the algorithm with dx=1 and
dz = da + db dy=0.

Jonathon Hare Automatic Differentiation 10 / 27


Translating to code III

dx = 0
dy = 1 To compute ∂z/∂y we just seed
da = y * dx + x * dy the algorithm with dx=0 and
db = cos ( x ) * dx dy=1.
dz = da + db

Jonathon Hare Automatic Differentiation 11 / 27

Making Rules

We’ve successfully computed the gradients for a specific function, but


the process was far from automatic.
We need to formalise a set of rules for translating a program that
evaluates an expression into a program that evaluates its derivatives.
We have actually already discovered 3 of these rules:
c = a + b => dc = da + db
c = a * b => dc = b * da + a * db
c = sin ( a ) => dc = cos ( a ) * da

Jonathon Hare Automatic Differentiation 12 / 27


More rules

These initial rules:


c=a+b => dc = da + db
c=a*b => dc = b * da + a * db
c = sin ( a ) => dc = cos ( a ) * da
can easily be extended further using multivariable calculus:
c =a - b => dc = da - db
c=a/b => dc = da /b - a * db / b **2
c = a ** b => dc = b * a **( b -1) * da + log ( a ) * a ** b * db
c = cos ( a ) => dc = - sin ( a ) * da
c = tan ( a ) => dc = da / cos ( a ) **2

Jonathon Hare Automatic Differentiation 13 / 27

Forward Mode AD

To translate using the rules we simply replace each primitive


operation in the original program by its differential analogue.
The order of computation remains unchanged: if a statement K is
evaluated before another statement L, then the differential analogue
of K is evaluated before the analogue statement of L.
This is Forward-mode Automatic Differentiation.

Jonathon Hare Automatic Differentiation 14 / 27


Interleaving differential computation
A careful analysis of our original program and its differential analogue
shows that its possible to interleave the differential calculations with the
original ones:
x = ?
dx = ?
Dual Numbers
y = ? This implies that we can
dy = ?
keep track of the value and
a = x * y gradient at the same time.
da = y * dx + x * dy We can use a mathematical
concept called a “Dual
b = sin ( x ) Number” to create a very
db = cos ( x ) * dx
simple direct implementation
z = a + b
of AD.
dz = da + db

Jonathon Hare Automatic Differentiation 15 / 27

Reverse Mode AD

Whilst Forward-mode AD is easy to implement, it comes with a very


big disadvantage. . .
For every variable we wish to compute the gradient with
respect to, we have to run the complete program again.
This is obviously going to be a problem if we’re talking about the
gradients of a function with very many parameters (e.g. a deep
network).
A solution is Reverse Mode Automatic Differentiation.

Jonathon Hare Automatic Differentiation 16 / 27


Reversing the Chain Rule

The chain rule is symmetric — this means we can turn the derivatives
upside-down:
N
∂s X ∂wi ∂s ∂w1 ∂s ∂w2 ∂s ∂wN ∂s
= = + + ··· +
∂u i
∂u ∂wi ∂u ∂w1 ∂u ∂w2 ∂u ∂wN

In doing so, we have inverted the input-output role of the variables: u is


some input variable, the wi ’s are the output variables that depend on u. s
is the yet-to-be-given variable.

In this form, the chain rule can be applied repeatedly to every input
variable u (akin to how in forward mode we repeatedly applied it to every
w ). Therefore, given some s we expect this form of the rule to give us a
program to compute both ∂s/∂x and ∂s/∂y in one go. . .

Jonathon Hare Automatic Differentiation 17 / 27

Reversing the chain rule: Example

∂s N
∂wi ∂s ∂s
=
X
=?
∂u ∂u ∂wi ∂z
i
∂s ∂z ∂s ∂s
= =
∂b ∂b ∂z ∂z
x =?
∂s ∂z ∂s ∂s
y =? = =
∂a ∂a ∂z ∂z
a=xy ∂s ∂a ∂s ∂s
b = sin(x ) = =x
∂y ∂y ∂a ∂a
z =a+b ∂s ∂a ∂s ∂b ∂s
= +
∂x ∂x ∂a ∂x ∂b
∂s ∂s
=y + cos(x )
∂a ∂b
∂s
= (y + cos(x ))
∂z
Jonathon Hare Automatic Differentiation 18 / 27
Visualising dependencies

Differentiating in reverse can be quite mind-bending: instead of asking


what input variables an output depends on, we have to ask what output
variables a given input variable can affect.

We can see this visually by drawing a dependency graph of the expression:

x y

sin ·
b a

+
z

Jonathon Hare Automatic Differentiation 19 / 27

Translating to code

Let’s now translate our derivatives into code. As before we replace the
derivatives (∂s/∂z, ∂s/∂b, . . . ) with variables (gz, gb, ...) which we call
adjoint variables:
gz = ?
gb = gz
ga = gz
gy = x * ga
gx = y * ga + cos ( x ) * gb
If we go back to the equations and substitute s = z we would obtain the
gradient in the last two equations. In the above program, this is equivalent
to setting gz = 1.

This means to get the both gradients ∂z/∂x and ∂z/∂y we only
need to run the program once!

Jonathon Hare Automatic Differentiation 20 / 27


Limitations of Reverse Mode AD

If we have multiple output variables, we’d have to run the program for
each one (with different seeds on the output variables)2 . For example:
(
z = 2x + sin x
v = 4x + cos x

We can’t just interleave the derivative calculations (since they all


appear to be in reverse). . . How can we make this automatic?

2
there are ways to avoid this limitation. . .
Jonathon Hare Automatic Differentiation 21 / 27

Implementing Reverse Mode AD

There are two ways to implement Reverse AD:


1 We can parse the original program and generate the adjoint program
that calculates the derivatives.
Potentially hard to do.
Static, so can only be used to differentiate algorithms that have
parameters predefined.
But, efficient (lots of opportunities for optimisation)
2 We can make a dynamic implementation by constructing a graph that
represents the original expression as the program runs.

Jonathon Hare Automatic Differentiation 22 / 27


Constructing an expression graph

The “roots” of the graph are the independent


variables x and y. Constructing these nodes is as
simple as creating an object:
class Var :
The goal is to get def __init__ ( self , value ) :
something akin to the self . value = value
graph we saw earlier: self . children = []
x y ...
sin * ...
b a
x = Var (0.5)
+
y = Var (4.2)
z
Each Var node can have children which are the
nodes that depend directly on that node. The
children allow nodes to link together in a
Directed Acyclic Graph.
Jonathon Hare Automatic Differentiation 23 / 27

Building expressions
By default, nodes do not have any children. As expressions are created
each expression u registers itself as a child of each of its dependencies wi
together with its weight ∂wi /∂u which will be used to compute gradients:
class Var :
...
def __mul__ ( self , other ) :
z = Var ( self . value * other . value )

# weight = dz / dself = other . value


self . children . append (( other . value , z ) )

# weight = dz / dother = self . value


other . children . append (( self . value , z ) )
return z
...
...
# " a " is a new Var that is a child of both x and y
a = x * y
Jonathon Hare Automatic Differentiation 24 / 27
Computing gradients
Finally, to get the gradients we need to propagate the derivatives. To
avoid unnecessarily traversing the tree multiple times we will cache the
derivative of a node in an attribute grad_value:
class Var :
def __init__ ( self ) :
...
self . grad_value = None

def grad ( self ) :


if self . grad_value is None :
# calculate derivative using chain rule
self . grad_value = sum ( weight * var . grad () for weight ,
var in self . children )
return self . grad_value
...
...
a . grad_value = 1.0
print ( " da / dx ␣ = ␣ {} " . format ( x . grad () ) )

Jonathon Hare Automatic Differentiation 25 / 27

Aside: Optimising Reverse Mode AD

The Reverse AD approach we’ve outlined is not very space efficient.


One way to get around this is to avoid storing the children directly
and instead store indices in an auxiliary data structure called a
Wengert list or tape.
Another interesting approach to memory reduction is trade-off
computation for memory of the caches. The Count-Trailing-Zeros
(CTZ) approach does just this3 .
But, in reality memory is relatively cheap if managed well...

3
Andreas Griewank (1992) Achieving logarithmic growth of temporal and spatial
complexity in reverse automatic differentiation, Optimization Methods and Software,
1:1, 35-54, DOI: 10.1080/10556789208805505
Jonathon Hare Automatic Differentiation 26 / 27
AD in the PyTorch autograd package
PyTorch’s AD is remarkably similar to the one we’ve just built:
it eschews the use of a tape
it builds the computation graph as it runs (recording explicit Function
objects as the children of Tensors rather than grouping everything
into Var objects)
it caches the gradients in the same way we do (in the grad attribute) -
hence the need to call zero_grad() when recomputing the gradients
of the same graph after a round of backprop.
PyTorch does some clever memory management to work well in a
reference-counted regime and aggressively frees values that are no
longer needed.
The backend is actually mostly written in C++, so its fast, and can
be multi-threaded (avoids problems of the GIL).
It allows easy “turning off” of gradient computations through
requires_grad.
In-place operations which invalidate data needed to compute
derivatives will cause runtime errors, as will variable aliasing...
Jonathon Hare Automatic Differentiation 27 / 27

You might also like