0% found this document useful (0 votes)

33 views14 pages

Autograd Handouts

Uploaded by

Gunduz Vehbi Demirci

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views14 pages

Autograd Handouts

Uploaded by

Gunduz Vehbi Demirci

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Differentiate

Automatically

An Introduction to Automatic Differentiation

Jonathon Hare

Vision, Learning and Control

University of Southampton

Much of this material is based on this blog post:

https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation

Jonathon Hare Automatic Differentiation 2 / 27

What is Automatic Differentiation (AD)?

To solve optimisation problems using gradient methods we need to

compute the gradients (derivatives) of the objective with respect to the
parameters.
In neural nets we’re talking about the gradients of the loss function,
L with respect to the parameters θ: ∇θ L = ∂L ∂θ
AD is important - it’s been suggested that “Differentiable
programming” could be the term that ultimately replaces deep
learning1 .

1
http://forums.fast.ai/t/
differentiable-programming-is-this-why-we-switched-to-pytorch/9589/5
Jonathon Hare Automatic Differentiation 3 / 27

What is Automatic Differentiation (AD)?

Computing Derivatives

There are three ways to compute derivatives:

Symbolically differentiate
the function with respect to Problems
its parameters Static - can’t “differentiate
by hand algorithms”
using a CAS
Make estimates using finite Problems
differences Numerical errors - will compound
Use Automatic in deep nets
Differentiation

Jonathon Hare Automatic Differentiation 4 / 27

What is Automatic Differentiation (AD)?

Automatic Differentiation is:

a method to get exact derivatives efficiently, by storing information as
you go forward that you can reuse as you go backwards.
Takes code that computes a function and uses that to compute the
derivative of that function.
The goal isn’t to obtain closed-form solutions, but to be able to write a
program that efficiently computes the derivatives.

Jonathon Hare Automatic Differentiation 5 / 27

Lets think about differentiation and programming

Example (Math) Example (Code)

x =?
x = ?
y =? y = ?
a=xy a = x * y
b = sin(x ) b = sin ( x )
z = a + b
z =a+b

Jonathon Hare Automatic Differentiation 6 / 27

The Chain Rule of Differentiation

Recall the chain rule for a variable/function z that depends on y which

depends on x :
dz dz dy
=
dx dy dx

In general, the chain rule can be expressed as:

N
∂w X ∂w ∂ui ∂w ∂u1 ∂w ∂u2 ∂w ∂uN
= = + + ··· +
∂t i
∂ui ∂t ∂u1 ∂t ∂u2 ∂t ∂uN ∂t

where w is some output variable, and ui denotes each input variable w

depends on.

Jonathon Hare Automatic Differentiation 7 / 27

Applying the Chain Rule

Let’s differentiate our previous expression with respect to some yet to be

given variable t:

∂x
=?
Expression ∂t
x =? ∂y
=?
∂t
y =? ∂a ∂y ∂x
a=xy =x +y
∂t ∂t ∂t
b = sin(x ) ∂b ∂x
= cos(x )
z =a+b ∂t ∂t
∂z ∂a ∂b
= +
∂t ∂t ∂t
If we substitute t = x in the above we’ll have an algorithm for computing
∂z/∂x . To get ∂z/∂y we’d just substitute t = y .

Jonathon Hare Automatic Differentiation 8 / 27

Translating to code I

We could translate the previous expressions back into a program involving

differential variables {dx, dy, ...} which represent ∂x /∂t, ∂y /∂t, . . .
respectively:
dx = ?
dy = ?
da = y * dx + x * dy
db = cos ( x ) * dx
dz = da + db
What happens to this program if we substitute t = x into the math
expression?

Jonathon Hare Automatic Differentiation 9 / 27

Translating to code II

dx = 1
The effect is remarkably simple:
dy = 0
da = y * dx + x * dy
to compute ∂z/∂x we just seed
db = cos ( x ) * dx the algorithm with dx=1 and
dz = da + db dy=0.

Jonathon Hare Automatic Differentiation 10 / 27

Translating to code III

dx = 0
dy = 1 To compute ∂z/∂y we just seed
da = y * dx + x * dy the algorithm with dx=0 and
db = cos ( x ) * dx dy=1.
dz = da + db

Jonathon Hare Automatic Differentiation 11 / 27

Making Rules

We’ve successfully computed the gradients for a specific function, but

the process was far from automatic.
We need to formalise a set of rules for translating a program that
evaluates an expression into a program that evaluates its derivatives.
We have actually already discovered 3 of these rules:
c = a + b => dc = da + db
c = a * b => dc = b * da + a * db
c = sin ( a ) => dc = cos ( a ) * da

Jonathon Hare Automatic Differentiation 12 / 27

More rules

These initial rules:

c=a+b => dc = da + db
c=a*b => dc = b * da + a * db
c = sin ( a ) => dc = cos ( a ) * da
can easily be extended further using multivariable calculus:
c =a - b => dc = da - db
c=a/b => dc = da /b - a * db / b **2
c = a ** b => dc = b * a **( b -1) * da + log ( a ) * a ** b * db
c = cos ( a ) => dc = - sin ( a ) * da
c = tan ( a ) => dc = da / cos ( a ) **2

Jonathon Hare Automatic Differentiation 13 / 27

Forward Mode AD

To translate using the rules we simply replace each primitive

operation in the original program by its differential analogue.
The order of computation remains unchanged: if a statement K is
evaluated before another statement L, then the differential analogue
of K is evaluated before the analogue statement of L.
This is Forward-mode Automatic Differentiation.

Jonathon Hare Automatic Differentiation 14 / 27

Interleaving differential computation
A careful analysis of our original program and its differential analogue
shows that its possible to interleave the differential calculations with the
original ones:
x = ?
dx = ?
Dual Numbers
y = ? This implies that we can
dy = ?
keep track of the value and
a = x * y gradient at the same time.
da = y * dx + x * dy We can use a mathematical
concept called a “Dual
b = sin ( x ) Number” to create a very
db = cos ( x ) * dx
simple direct implementation
z = a + b
of AD.
dz = da + db

Jonathon Hare Automatic Differentiation 15 / 27

Reverse Mode AD

Whilst Forward-mode AD is easy to implement, it comes with a very

big disadvantage. . .
For every variable we wish to compute the gradient with
respect to, we have to run the complete program again.
This is obviously going to be a problem if we’re talking about the
gradients of a function with very many parameters (e.g. a deep
network).
A solution is Reverse Mode Automatic Differentiation.

Jonathon Hare Automatic Differentiation 16 / 27

Reversing the Chain Rule

The chain rule is symmetric — this means we can turn the derivatives
upside-down:
N
∂s X ∂wi ∂s ∂w1 ∂s ∂w2 ∂s ∂wN ∂s
= = + + ··· +
∂u i
∂u ∂wi ∂u ∂w1 ∂u ∂w2 ∂u ∂wN

In doing so, we have inverted the input-output role of the variables: u is

some input variable, the wi ’s are the output variables that depend on u. s
is the yet-to-be-given variable.

In this form, the chain rule can be applied repeatedly to every input
variable u (akin to how in forward mode we repeatedly applied it to every
w ). Therefore, given some s we expect this form of the rule to give us a
program to compute both ∂s/∂x and ∂s/∂y in one go. . .

Jonathon Hare Automatic Differentiation 17 / 27

Reversing the chain rule: Example

∂s N
∂wi ∂s ∂s
=
X
=?
∂u ∂u ∂wi ∂z
i
∂s ∂z ∂s ∂s
= =
∂b ∂b ∂z ∂z
x =?
∂s ∂z ∂s ∂s
y =? = =
∂a ∂a ∂z ∂z
a=xy ∂s ∂a ∂s ∂s
b = sin(x ) = =x
∂y ∂y ∂a ∂a
z =a+b ∂s ∂a ∂s ∂b ∂s
= +
∂x ∂x ∂a ∂x ∂b
∂s ∂s
=y + cos(x )
∂a ∂b
∂s
= (y + cos(x ))
∂z
Jonathon Hare Automatic Differentiation 18 / 27
Visualising dependencies

Differentiating in reverse can be quite mind-bending: instead of asking

what input variables an output depends on, we have to ask what output
variables a given input variable can affect.

We can see this visually by drawing a dependency graph of the expression:

x y

sin ·
b a

+
z

Jonathon Hare Automatic Differentiation 19 / 27

Translating to code

Let’s now translate our derivatives into code. As before we replace the
derivatives (∂s/∂z, ∂s/∂b, . . . ) with variables (gz, gb, ...) which we call
adjoint variables:
gz = ?
gb = gz
ga = gz
gy = x * ga
gx = y * ga + cos ( x ) * gb
If we go back to the equations and substitute s = z we would obtain the
gradient in the last two equations. In the above program, this is equivalent
to setting gz = 1.

This means to get the both gradients ∂z/∂x and ∂z/∂y we only
need to run the program once!

Jonathon Hare Automatic Differentiation 20 / 27

Limitations of Reverse Mode AD

If we have multiple output variables, we’d have to run the program for
each one (with different seeds on the output variables)2 . For example:
(
z = 2x + sin x
v = 4x + cos x

We can’t just interleave the derivative calculations (since they all

appear to be in reverse). . . How can we make this automatic?

2
there are ways to avoid this limitation. . .
Jonathon Hare Automatic Differentiation 21 / 27

Implementing Reverse Mode AD

There are two ways to implement Reverse AD:

1 We can parse the original program and generate the adjoint program
that calculates the derivatives.
Potentially hard to do.
Static, so can only be used to differentiate algorithms that have
parameters predefined.
But, efficient (lots of opportunities for optimisation)
2 We can make a dynamic implementation by constructing a graph that
represents the original expression as the program runs.

Jonathon Hare Automatic Differentiation 22 / 27

Constructing an expression graph

The “roots” of the graph are the independent

variables x and y. Constructing these nodes is as
simple as creating an object:
class Var :
The goal is to get def __init__ ( self , value ) :
something akin to the self . value = value
graph we saw earlier: self . children = []
x y ...
sin * ...
b a
x = Var (0.5)
+
y = Var (4.2)
z
Each Var node can have children which are the
nodes that depend directly on that node. The
children allow nodes to link together in a
Directed Acyclic Graph.
Jonathon Hare Automatic Differentiation 23 / 27

Building expressions
By default, nodes do not have any children. As expressions are created
each expression u registers itself as a child of each of its dependencies wi
together with its weight ∂wi /∂u which will be used to compute gradients:
class Var :
...
def __mul__ ( self , other ) :
z = Var ( self . value * other . value )

# weight = dz / dself = other . value

self . children . append (( other . value , z ) )

# weight = dz / dother = self . value

other . children . append (( self . value , z ) )
return z
...
...
# " a " is a new Var that is a child of both x and y
a = x * y
Jonathon Hare Automatic Differentiation 24 / 27
Computing gradients
Finally, to get the gradients we need to propagate the derivatives. To
avoid unnecessarily traversing the tree multiple times we will cache the
derivative of a node in an attribute grad_value:
class Var :
def __init__ ( self ) :
...
self . grad_value = None

def grad ( self ) :

if self . grad_value is None :
# calculate derivative using chain rule
self . grad_value = sum ( weight * var . grad () for weight ,
var in self . children )
return self . grad_value
...
...
a . grad_value = 1.0
print ( " da / dx ␣ = ␣ {} " . format ( x . grad () ) )

Jonathon Hare Automatic Differentiation 25 / 27

Aside: Optimising Reverse Mode AD

The Reverse AD approach we’ve outlined is not very space efficient.

One way to get around this is to avoid storing the children directly
and instead store indices in an auxiliary data structure called a
Wengert list or tape.
Another interesting approach to memory reduction is trade-off
computation for memory of the caches. The Count-Trailing-Zeros
(CTZ) approach does just this3 .
But, in reality memory is relatively cheap if managed well...

3
Andreas Griewank (1992) Achieving logarithmic growth of temporal and spatial
complexity in reverse automatic differentiation, Optimization Methods and Software,
1:1, 35-54, DOI: 10.1080/10556789208805505
Jonathon Hare Automatic Differentiation 26 / 27
AD in the PyTorch autograd package
PyTorch’s AD is remarkably similar to the one we’ve just built:
it eschews the use of a tape
it builds the computation graph as it runs (recording explicit Function
objects as the children of Tensors rather than grouping everything
into Var objects)
it caches the gradients in the same way we do (in the grad attribute) -
hence the need to call zero_grad() when recomputing the gradients
of the same graph after a round of backprop.
PyTorch does some clever memory management to work well in a
reference-counted regime and aggressively frees values that are no
longer needed.
The backend is actually mostly written in C++, so its fast, and can
be multi-threaded (avoids problems of the GIL).
It allows easy “turning off” of gradient computations through
requires_grad.
In-place operations which invalidate data needed to compute
derivatives will cause runtime errors, as will variable aliasing...
Jonathon Hare Automatic Differentiation 27 / 27

WIT-Color Ultra 9000 High Definition Printer Operations Manual
100% (1)
WIT-Color Ultra 9000 High Definition Printer Operations Manual
95 pages
Manual: High Pressure Cleaner MC 300/21
No ratings yet
Manual: High Pressure Cleaner MC 300/21
46 pages
Blake PDF
No ratings yet
Blake PDF
96 pages
Situational Leadership Theory Proposes That Effective Leadership Requires A Rational Understanding of The Situation and An Appropriate Response
No ratings yet
Situational Leadership Theory Proposes That Effective Leadership Requires A Rational Understanding of The Situation and An Appropriate Response
6 pages
Image Compression (Chapter 8) : CS474/674 - Prof. Bebis
No ratings yet
Image Compression (Chapter 8) : CS474/674 - Prof. Bebis
128 pages
Development of Visualization
100% (1)
Development of Visualization
11 pages
Child Friendly School S High School 1
No ratings yet
Child Friendly School S High School 1
17 pages
Automatic Differentiation
No ratings yet
Automatic Differentiation
39 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Automatic Differentiation: Hamid Reza Ghaffari, Jonathan Li, Yang Li, Zhenghua Nie
No ratings yet
Automatic Differentiation: Hamid Reza Ghaffari, Jonathan Li, Yang Li, Zhenghua Nie
103 pages
APA 7 Referencing Sources Examples August 2021 v1.0
No ratings yet
APA 7 Referencing Sources Examples August 2021 v1.0
67 pages
EASA Components Categories
No ratings yet
EASA Components Categories
1 page
The Simple Essence of Automatic Differentiation: Conal Elliott
No ratings yet
The Simple Essence of Automatic Differentiation: Conal Elliott
29 pages
Chapter 4
No ratings yet
Chapter 4
49 pages
Adjoint in Computational Finance
No ratings yet
Adjoint in Computational Finance
24 pages
Name: Pranav G Dasgaonkar Roll No: 70 CLASS: 8 (CMPN-2) CG Experiment No: 09
No ratings yet
Name: Pranav G Dasgaonkar Roll No: 70 CLASS: 8 (CMPN-2) CG Experiment No: 09
12 pages
Computational Science
No ratings yet
Computational Science
65 pages
Differentiable Programming and Design Optimization
No ratings yet
Differentiable Programming and Design Optimization
72 pages
314 - Guidelines For The Use of Weathering Steel in Bridges
No ratings yet
314 - Guidelines For The Use of Weathering Steel in Bridges
108 pages
Mathematical View of Automatic Differentiation
No ratings yet
Mathematical View of Automatic Differentiation
78 pages
Grade 11 - ABM - Araling Panlipunan - Applied Economics - Week 3
No ratings yet
Grade 11 - ABM - Araling Panlipunan - Applied Economics - Week 3
8 pages
1502 05767v2 PDF
No ratings yet
1502 05767v2 PDF
29 pages
Deeplearning2017 Johnson Automatic Differentiation 01
No ratings yet
Deeplearning2017 Johnson Automatic Differentiation 01
142 pages
Output: Aoi
No ratings yet
Output: Aoi
24 pages
Automatic Differentiation of Algorithms For Machine Learning
No ratings yet
Automatic Differentiation of Algorithms For Machine Learning
7 pages
Layam Group - Business Presentation
No ratings yet
Layam Group - Business Presentation
28 pages
Calculus 1
100% (1)
Calculus 1
230 pages
Gradients Without Backpropagation
No ratings yet
Gradients Without Backpropagation
10 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
Chapman - From Their POV
No ratings yet
Chapman - From Their POV
18 pages
Social Science Disciplines
No ratings yet
Social Science Disciplines
2 pages
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
No ratings yet
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
114 pages
You Only Linearize Once: Alexey Radul, Adam Paszke, Roy Frostig, Matthew J. Johnson, Dougal Maclaurin
No ratings yet
You Only Linearize Once: Alexey Radul, Adam Paszke, Roy Frostig, Matthew J. Johnson, Dougal Maclaurin
29 pages
Biblio Tatla Aspects of Universality in Modern and Postmodern Architecture
No ratings yet
Biblio Tatla Aspects of Universality in Modern and Postmodern Architecture
3 pages
UF Lecture 14 Automatic Differentiation
No ratings yet
UF Lecture 14 Automatic Differentiation
16 pages
Ad Refer
No ratings yet
Ad Refer
53 pages
Proportional Relief Valves, High Pressure: SS-4R3A
No ratings yet
Proportional Relief Valves, High Pressure: SS-4R3A
2 pages
Autodiff
No ratings yet
Autodiff
15 pages
Proyecto Salina Cruz Mediana Tension
No ratings yet
Proyecto Salina Cruz Mediana Tension
1 page
003 - Calculus
No ratings yet
003 - Calculus
43 pages
Automatic Differentiation and Neural Networks
No ratings yet
Automatic Differentiation and Neural Networks
13 pages
Tut 01
No ratings yet
Tut 01
39 pages
Automatic Differentiation
No ratings yet
Automatic Differentiation
22 pages
Research Paper
No ratings yet
Research Paper
10 pages
Automatic Differentiation: H Avard Berland
No ratings yet
Automatic Differentiation: H Avard Berland
22 pages
Automatic Differentiation, C++ Templates AndPhotogrammetry
No ratings yet
Automatic Differentiation, C++ Templates AndPhotogrammetry
14 pages
Introduction To Automatic Differentiation and MATLAB Object-Oriented Programming
No ratings yet
Introduction To Automatic Differentiation and MATLAB Object-Oriented Programming
19 pages
Efficient Derivative Codes Through Automatic Differentiation and Interface Contraction: Application in Biostatistics
No ratings yet
Efficient Derivative Codes Through Automatic Differentiation and Interface Contraction: Application in Biostatistics
13 pages
Security Aspects in IoT Based Cloud Computing
No ratings yet
Security Aspects in IoT Based Cloud Computing
12 pages
Stochastic Automatic Differentiation Automatic Differentiation For Monte-Carlo Simulations
No ratings yet
Stochastic Automatic Differentiation Automatic Differentiation For Monte-Carlo Simulations
45 pages
Measuring & Evaluating Learning
No ratings yet
Measuring & Evaluating Learning
15 pages
Steps Involved in Production and Utilization of A TV Programme
No ratings yet
Steps Involved in Production and Utilization of A TV Programme
5 pages
Lecture NM 1 Numerical Differentiation Integration
No ratings yet
Lecture NM 1 Numerical Differentiation Integration
57 pages
27 - Ex 6B Review of Antiderivative
No ratings yet
27 - Ex 6B Review of Antiderivative
23 pages
Derivatives PDF
No ratings yet
Derivatives PDF
6 pages
Notes - Mike Giles - Aad (Updated With More Accurate Notations)
No ratings yet
Notes - Mike Giles - Aad (Updated With More Accurate Notations)
33 pages
Machine Learning and Pattern Recognition Week 8 - Backprop
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Backprop
8 pages
Lecture12 Diff
No ratings yet
Lecture12 Diff
31 pages
LYONS, Martin. New Directions in The History of Written Culture
No ratings yet
LYONS, Martin. New Directions in The History of Written Culture
9 pages
Autodiff
No ratings yet
Autodiff
12 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
Grade 10 Physics Assessment
No ratings yet
Grade 10 Physics Assessment
1 page
Peter Siwes Report
No ratings yet
Peter Siwes Report
17 pages
Neural Networks With Cheap Differential Operators
No ratings yet
Neural Networks With Cheap Differential Operators
11 pages
CS115 Intro To Optimization
No ratings yet
CS115 Intro To Optimization
60 pages
Differentiation & Integral
No ratings yet
Differentiation & Integral
12 pages
3 Gradient
No ratings yet
3 Gradient
31 pages
Lec06 Derivatives
No ratings yet
Lec06 Derivatives
22 pages
Differentiation Rules (Differential Calculus)
No ratings yet
Differentiation Rules (Differential Calculus)
3 pages
Short-Term National Resource Adequacy Plan 2025-26
No ratings yet
Short-Term National Resource Adequacy Plan 2025-26
66 pages
A Step-By-step Introduction To The Implementation of Automatic Differentiation
No ratings yet
A Step-By-step Introduction To The Implementation of Automatic Differentiation
17 pages
Opt Sem3
No ratings yet
Opt Sem3
50 pages
Calc
No ratings yet
Calc
6 pages
Derivatives Class Notes
No ratings yet
Derivatives Class Notes
9 pages
Barak Pearl Mutter Auto Diff
No ratings yet
Barak Pearl Mutter Auto Diff
103 pages
VFD Transformers (03!24!2025)
No ratings yet
VFD Transformers (03!24!2025)
2 pages
Optimization and Gradient Descent Algorithm
No ratings yet
Optimization and Gradient Descent Algorithm
37 pages
SIT194 - Derivatives (Lecture Notes)
No ratings yet
SIT194 - Derivatives (Lecture Notes)
40 pages
Content Beyond Syllabus Unit-2
No ratings yet
Content Beyond Syllabus Unit-2
4 pages
Content 3
No ratings yet
Content 3
7 pages
Gradient Vectors Computation
No ratings yet
Gradient Vectors Computation
4 pages
Resilience Through Education Equipping Schools and Students To Face Climate Change Challenges in Punjab
No ratings yet
Resilience Through Education Equipping Schools and Students To Face Climate Change Challenges in Punjab
6 pages
AD Review Paper
No ratings yet
AD Review Paper
32 pages
Mit18 S096iap23 Lec01
No ratings yet
Mit18 S096iap23 Lec01
6 pages
Physics Quantum Stuffs
No ratings yet
Physics Quantum Stuffs
46 pages
Differentiation
No ratings yet
Differentiation
70 pages
Teaching Script
No ratings yet
Teaching Script
8 pages
Complete Derivatives Notes
No ratings yet
Complete Derivatives Notes
3 pages
Full Derivatives Guide Complete
No ratings yet
Full Derivatives Guide Complete
2 pages
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet

Autograd Handouts

Uploaded by

Autograd Handouts

Uploaded by

Differentiate

An Introduction to Automatic Differentiation

Vision, Learning and Control

Much of this material is based on this blog post:

Jonathon Hare Automatic Differentiation 2 / 27

To solve optimisation problems using gradient methods we need to

What is Automatic Differentiation (AD)?

There are three ways to compute derivatives:

Jonathon Hare Automatic Differentiation 4 / 27

Automatic Differentiation is:

Jonathon Hare Automatic Differentiation 5 / 27

Lets think about differentiation and programming

Example (Math) Example (Code)

Jonathon Hare Automatic Differentiation 6 / 27

Recall the chain rule for a variable/function z that depends on y which

In general, the chain rule can be expressed as:

where w is some output variable, and ui denotes each input variable w

Jonathon Hare Automatic Differentiation 7 / 27

Applying the Chain Rule

Let’s differentiate our previous expression with respect to some yet to be

Jonathon Hare Automatic Differentiation 8 / 27

We could translate the previous expressions back into a program involving

Jonathon Hare Automatic Differentiation 9 / 27

Jonathon Hare Automatic Differentiation 10 / 27

Jonathon Hare Automatic Differentiation 11 / 27

We’ve successfully computed the gradients for a specific function, but

Jonathon Hare Automatic Differentiation 12 / 27

These initial rules:

Jonathon Hare Automatic Differentiation 13 / 27

To translate using the rules we simply replace each primitive

Jonathon Hare Automatic Differentiation 14 / 27

Jonathon Hare Automatic Differentiation 15 / 27

Whilst Forward-mode AD is easy to implement, it comes with a very

Jonathon Hare Automatic Differentiation 16 / 27

In doing so, we have inverted the input-output role of the variables: u is

Jonathon Hare Automatic Differentiation 17 / 27

Reversing the chain rule: Example

Differentiating in reverse can be quite mind-bending: instead of asking

We can see this visually by drawing a dependency graph of the expression:

Jonathon Hare Automatic Differentiation 19 / 27

Jonathon Hare Automatic Differentiation 20 / 27

We can’t just interleave the derivative calculations (since they all

Implementing Reverse Mode AD

There are two ways to implement Reverse AD:

Jonathon Hare Automatic Differentiation 22 / 27

The “roots” of the graph are the independent

# weight = dz / dself = other . value

# weight = dz / dother = self . value

def grad ( self ) :

Jonathon Hare Automatic Differentiation 25 / 27

Aside: Optimising Reverse Mode AD

The Reverse AD approach we’ve outlined is not very space efficient.

You might also like