0% found this document useful (0 votes)
16 views114 pages

Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views114 pages

Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

Automatic Differentiation (1)

Slides Prepared By:

Atılım Güneş Baydin


[email protected]
Outline

This lecture:
- Derivatives in machine learning
- Review of essential concepts (what is a derivative, Jacobian, etc.)
- How do we compute derivatives
- Automatic differentiation

Next lecture:
- Current landscape of tools
- Implementation techniques
- Advanced concepts (higher-order API, checkpointing, etc.)

2
Derivatives and
machine learning

3
Derivatives in machine learning
“Backprop” and gradient descent are at the core of all recent advances
Computer vision

Top-5 error rate for ImageNet (NVIDIA devblog) Faster R-CNN (Ren et al. 2015) NVIDIA DRIVE PX 2 segmentation

Speech recognition/synthesis Machine translation

4
Word error rates (Huang et al., 2014) Google Neural Machine Translation System (GNMT)
Derivatives in machine learning
“Backprop” and gradient descent are at the core of all recent advances

Probabilistic programming (and modeling)

Pyro ProbTorch
(2017) (2017)

Edward TensorFlow Probability


(2016) (2018)

- Variational inference
- “Neural” density estimation
- Transformed distributions via bijectors
- Normalizing flows (Rezende & Mohamed, 2015)
- Masked autoregressive flows (Papamakarios et al., 2017) 5
Derivatives in machine learning
At the core of all: differentiable functions (programs) whose parameters are
tuned by gradient-based optimization

(Ruder, 2017) http://ruder.io/optimizing-gradient-descent/

6
Automatic differentiation
Execute differentiable functions (programs) via automatic differentiation

A word on naming:
- Differentiable programming, a generalization of deep learning (Olah, LeCun)
“Neural networks are just a class of differentiable functions”
- Automatic differentiation
- Algorithmic differentiation
- AD
- Autodiff
- Algodiff
- Autograd
Also remember:
- Backprop
7
- Backpropagation (backward propagation of errors)
Essential concepts
refresher

8
Derivative
Function of a real variable

Sensitivity of function value w.r.t.


a change in its argument
(the instantaneous rate of change)
Dependent Independent Newton, c. 1665

Leibniz, c. 1675
Leibniz Lagrange Newton 9
Derivative
Function of a real variable

Newton, c. 1665


around 15 such rules

Note: the derivative is a linear operator, a.k.a. a higher-order


function in programming languages Leibniz, c. 1675
10
Partial derivative
Function of several real variables

A derivative w.r.t. one independent variable,


with others held constant

“del”

11
Partial derivative
Function of several real variables

The gradient, given

is the vector of all partial derivatives

points to the direction with the largest rate of


“nabla” change
or “del”

Nabla is the higher-order function: 12


Total derivative
Function of several real variables

The derivative w.r.t. all variables


(independent & dependent)

Consider all partial derivatives simultaneously and accumulate all direct and
indirect contributions (Important: will be useful later)
13
Matrix calculus and machine learning
Extension to Scalar output Vector output
multivariable
functions Scalar input

Vector input

scalar field vector field

In machine learning, we construct (deep) compositions of


- , e.g., a neural network
- , e.g., a loss function, KL divergence, or log joint probability 14
Matrix calculus and machine learning

And many, many more rules

Generalization to tensors (multi-dimensional arrays) for efficient


batching, handling of sequences, channels in convolutions, etc. 15
Matrix calculus and machine learning
Finally, two constructs relevant to machine learning: Jacobian and Hessian

16
How to compute derivatives

17
Derivatives
as code

We can compute the derivatives not just of


mathematical functions, but of general programs
(with control flow)

18
Derivatives
as code

19
Manual
You can see papers like this:

Analytic derivatives are needed for theoretical insight


- analytic solutions, proofs
- mathematical analysis, e.g., stability of fixed points
Unnecessary when we just need derivative evaluations for optimization 20
Symbolic differentiation
Symbolic computation with Mathematica, Maple, Maxima,
and deep learning frameworks such as Theano
Problem: expression swell

21
Symbolic differentiation
Symbolic computation with Mathematica, Maple, Maxima,
and deep learning frameworks such as Theano Graph optimization
Problem: expression swell (e.g., in Theano)

22
Symbolic differentiation
Problem: only applicable to closed-form mathematical functions

You can find the derivative of

but not of

Symbolic graph builders such as Theano and TensorFlow


have limited, unintuitive control flow, loops, recursion
Numerical differentiation
Finite difference approximation of ,

Problem: needs to be evaluated times,


once with each standard basis vector

Problem: we must select and


we face approximation errors

24
Numerical differentiation
Finite difference approximation of ,

Better approximations exist:


- Higher-order finite differences
e.g., center difference:

- Richardson extrapolation
- Differential quadrature
These increase rapidly in complexity
and never completely eliminate the error 25
Numerical differentiation
Finite difference approximation of ,

Still extremely useful as a quick check of our gradient implementations


Good to learn:
Better approximations exist:
- Higher-order finite differences
e.g., center difference:

- Richardson extrapolation
- Differential quadrature
These increase rapidly in complexity
and never completely eliminate the error 26
Automatic differentiation
If we don’t need analytic derivative expressions, we can
evaluate a gradient exactly with only one forward and one reverse execution

Nature 323, 533–536 (9 October 1986)

In machine learning, this is known as


backpropagation or “backprop”

- Automatic differentiation is more than


backprop
- Or, backprop is a specialized reverse mode
automatic differentiation
- We will come back to this shortly
27
Backprob or automatic
differentiation?

28
1960s 1970s 1980s

Precursors Linnainmaa, 1970, 1976 Speelpenning, 1980


Backpropagation Automatic reverse mode
Kelley, 1960
Bryson, 1961 Dreyfus, 1973 Werbos, 1982
Pontryagin et al., 1961 Control parameters First NN-specific backprop
Dreyfus, 1962
Werbos, 1974 Parker, 1985
Wengert, 1964 Reverse mode
Forward mode LeCun, 1985

Rumelhart, Hinton, Williams, 1986


Revived backprop

Griewank, 1989
Revived reverse mode 29
1960s 1970s 1980s

Precursors Linnainmaa, 1970, 1976 Speelpenning, 1980


Backpropagation Automatic reverse mode
Recommended
Kelley, 1960 reading:
Bryson, 1961 Dreyfus, 1973 Werbos, 1982
Pontryagin et al., 1961
Griewank, A., 2012. WhoControl parameters
Invented First NN-specific
the Reverse Mode backprop
of Differentiation?
Dreyfus, 1962
Documenta Mathematica, Extra Volume ISMP, pp.389-400.
Werbos, 1974 Parker, 1985
Wengert, 1964 Reverse mode
Schmidhuber,
Forward mode
J., 2015. Who Invented Backpropagation?
LeCun, 1985
http://people.idsia.ch/~juergen/who-invented-backpropagation.html
Rumelhart, Hinton, Williams, 1986
Revived backprop

Griewank, 1989
Revived reverse mode 30
Automatic differentiation

31
Automatic differentiation
All numerical algorithms, when executed, evaluate to compositions of
a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies

32
Automatic differentiation
All numerical algorithms, when executed, evaluate to compositions of
a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies

33
Automatic differentiation
All numerical algorithms, when executed, evaluate to compositions of
a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies

f(a, b):
c = a * b a
d = log(c) c
return d * log d

34
Automatic differentiation
All numerical algorithms, when executed, evaluate to compositions of
a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies

primal
f(a, b):
2
c = a * b a 6
d = log(c) c 1.791
return d * log d
3
b
1.791 = f(2, 3)

35
Automatic differentiation
All numerical algorithms, when executed, evaluate to compositions of
a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies

primal
f(a, b):
2
c = a * b a 6
d = log(c) c 1.791
0.5
return d * log d
3 0.166 1
b
1.791 = f(2, 3) derivative
0.333
[0.5, 0.333] = f’(2, 3) tangent, adjoint
“gradient” 36
Automatic differentiation
All numerical algorithms, when executed, evaluate to compositions of
a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies

primal
f(a, b):
2
c = a * b a 6
d = log(c) c 1.791
0.5
return d * log d
3 0.166 1
b
1.791 = f(2, 3) derivative
0.333
[0.5, 0.333] = f’(2, 3) tangent, adjoint
“gradient” 37
Automatic differentiation
Two main flavors

Forward mode Reverse mode (a.k.a. backprop)

Primals
Primals
Derivatives
Derivatives
(Tangents)
(Adjoints)

Nested combinations
(higher-order derivatives, Hessian–vector products, etc.)
- Forward-on-reverse
- Reverse-on-forward
- ... 38
What happens to control flow?
It disappears: branches are taken, loops are unrolled, functions are inlined, etc.
until we are left with the linear trace of execution

f(a, b):
c = a * b
if c > 0:
d = log(c)
else:
d = sin(c)
return d

39
What happens to control flow?
It disappears: branches are taken, loops are unrolled, functions are inlined, etc.
until we are left with the linear trace of execution

f(a = 2, b = 3): 2
c = a * b = 6 a 6
c 1.791
if c > 0:
d = log(c) = 1.791 * log d
3
else:
b
d = sin(c)
return d

40
What happens to control flow?
It disappears: branches are taken, loops are unrolled, functions are inlined, etc.
until we are left with the linear trace of execution

f(a = 2, b = -1): 2
c = a * b = -2 a -2
c -0.909
if c > 0:
d = log(c) * sin d
-1
else:
b
d = sin(c) = -0.909
return d

41
What happens to control flow?
It disappears: branches are taken, loops are unrolled, functions are inlined, etc.
until we are left with the linear trace of execution

f(a = 2, b = -1): 2
c = a * b = -2 a -2
c -0.909
if c > 0:
d = log(c) * sin d
-1
else:
b
d = sin(c) = -0.909
return d A directed acyclic graph (DAG)

Topological ordering
42
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

v1
f(x1, x2):
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
y2 = v1 + v2
x2 log + y2
return (y1, y2)

43
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

v1
f(x1, x2):
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

44
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

45
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

46
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

47
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

48
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

49
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

50
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

51
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 0
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

52
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 0
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

53
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3 -0.279
v1 = x1 * x2 x1 * sin y1
1 2.880
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 0
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

54
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3 -0.279
v1 = x1 * x2 x1 * sin y1
1 2.880
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 0

f(2, 3)

55
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3 -0.279
v1 = x1 * x2 x1 * sin y1
1 2.880
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 0 3

f(2, 3)

56
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
In general, forward mode evaluates v1
2 3 -0.279
a Jacobian–vector product x1 sin y1
*
1 2.880
So we evaluated: 1.098
v2
3 7.098
0
x2 log + y2
0 3

57
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
In general, forward mode evaluates v1
2 3 -0.279
a Jacobian–vector product x1 sin y1
*
1 2.880
So we evaluated: 1.098
v2
3 7.098
0
x2 log + y2
0 3

Can be any
not only unit vectors
58
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

In general, forward mode evaluates For this is a


a Jacobian–vector product directional derivative

So we evaluated:

Can be any
not only unit vectors
59
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

v1
f(x1, x2):
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
y2 = v1 + v2
x2 log + y2
return (y1, y2)

60
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

v1
f(x1, x2):
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

61
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

62
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

63
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

64
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

65
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 -0.279
v1 = x1 * x2 x1 * sin y1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

66
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 -0.279
v1 = x1 * x2 x1 * sin y1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

67
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

68
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

69
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

70
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

71
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

72
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 0

f(2, 3)

73
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 0

f(2, 3)

74
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
2.880 1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 0

f(2, 3)

75
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
2.880 1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 0

f(2, 3)

76
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
2.880 1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 1.920 0

f(2, 3)

77
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
In general, forward mode evaluates a v1
2 0.960 -0.279
transposed Jacobian–vector product x1 sin y1
*
2.880 1
1.098
So we evaluated: v2
3 7.098
0
x2 log + y2
1.920 0

78
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

In general, reverse mode evaluates a


transposed Jacobian–vector product

For this is
So we evaluated: the gradient

79
Forward vs reverse summary
In the extreme In the extreme
use forward mode to evaluate use reverse mode to evaluate

80
Forward vs reverse summary
In the extreme In the extreme
use forward mode to evaluate use reverse mode to evaluate

In general the Jacobian can be evaluated in


- with forward mode
- with reverse mode

Reverse performs better when

81
Backprop through
normal PDF

82
Backprop through normal PDF

0.5
x - ·2 / - exp * f

0
µ ·2 * * sqrt 1/·

1
σ 2 π

83
Summary

84
Summary

This lecture:
- Derivatives in machine learning
- Review of essential concepts (what is a derivative, etc.)
- How do we compute derivatives
- Automatic differentiation

Next lecture:
- Current landscape of tools
- Implementation techniques
- Advanced concepts (higher-order API, checkpointing, etc.)

85
References
Baydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., 2017. Automatic differentiation in machine learning: a survey.
Journal of Machine Learning Research (JMLR), 18(153), pp.1-153.

Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “Tricks from Deep Learning.” In 7th International
Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.

Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “DiffSharp: An AD Library for .NET Languages.” In 7th
International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.

Baydin, Atılım Güneş, Robert Cornish, David Martínez Rubio, Mark Schmidt, and Frank Wood. 2018. “Online Learning Rate
Adaptation with Hypergradient Descent.” In Sixth International Conference on Learning Representations (ICLR), Vancouver,
Canada, April 30 – May 3, 2018.

Griewank, A. and Walther, A., 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation (Vol. 105).
SIAM.

Nocedal, J. and Wright, S.J., 1999. Numerical Optimization. Springer.

86
Extra slides

87
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

88
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

f(a, b): a
c = a * b c
d = log(c) * log d
return d
b

89
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

f(a, b): a
c = a * b c
d = log(c) * log d
return d
b
f(2, 3)

90
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a
c = a * b c
d = log(c) * log d
return d
b
f(2, 3)

91
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a
c = a * b c
d = log(c) * log d
3
return d
b
f(2, 3)

92
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a
c = a * b 1 c
d = log(c) * log d
3
return d
b
f(2, 3)

93
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a
c = a * b 1 c
d = log(c) * log d
3
return d
b
0
f(2, 3)

94
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a 6
c = a * b 1 c
d = log(c) * log d
3
return d
b
0
f(2, 3)

95
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a 6
c = a * b 1 c
d = log(c) * log d
3 3
return d
b
0
f(2, 3)

96
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
1
d = log(c) * log d
3 3
return d
b
0
f(2, 3)

97
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
1
d = log(c) * log d
3 3 0.5
return d
b
0
f(2, 3)

98
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
1
d = log(c) * log d
3 3 0.5
return d
b
0
f(2, 3)

In general, forward mode evaluates a Jacobian–vector product

We evaluated the partial derivative with


99
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

f(a, b): a
c = a * b c
d = log(c) * log d
return d
b

100
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

f(a, b): a
c = a * b c
d = log(c) * log d
return d
b
f(2, 3)

101
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a
c = a * b c
d = log(c) * log d
return d
b
f(2, 3)

102
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a
c = a * b c
d = log(c) * log d
3
return d
b
f(2, 3)

103
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c
d = log(c) * log d
3
return d
b
f(2, 3)

104
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
d = log(c) * log d
3
return d
b
f(2, 3)

105
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
d = log(c) * log d
3 1
return d
b
f(2, 3)

106
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
d = log(c) * log d
3 1
return d
b
f(2, 3)

107
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
d = log(c) * log d
3 0.166 1
return d
b
f(2, 3)

108
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
d = log(c) * log d
3 0.166 1
return d
b
f(2, 3)

109
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
0.5
d = log(c) * log d
3 0.166 1
return d
b
f(2, 3)

110
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
0.5
d = log(c) * log d
3 0.166 1
return d
b
f(2, 3)

111
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
0.5
d = log(c) * log d
3 0.166 1
return d
b
0.333
f(2, 3)

112
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
0.5
d = log(c) * log d
3 0.166 1
return d
b
0.333
f(2, 3)

In general, reverse mode evaluates a transposed Jacobian–vector product

We evaluated the gradient with


113
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
0.5
d = log(c) * log d
3 0.166 1
return d
b
0.333
f(2, 3)

In general, reverse mode evaluates a transposed Jacobian–vector product

We evaluated the gradient with


114

You might also like