0% found this document useful (0 votes)

16 views114 pages

Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk

Uploaded by

dimitrije.markovic.mita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views114 pages

Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk

Uploaded by

dimitrije.markovic.mita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 114

Automatic Diﬀerentiation (1)

Slides Prepared By:

Atılım Güneş Baydin

[email protected]
Outline

This lecture:
- Derivatives in machine learning
- Review of essential concepts (what is a derivative, Jacobian, etc.)
- How do we compute derivatives
- Automatic diﬀerentiation

Next lecture:
- Current landscape of tools
- Implementation techniques
- Advanced concepts (higher-order API, checkpointing, etc.)

2
Derivatives and
machine learning

3
Derivatives in machine learning
“Backprop” and gradient descent are at the core of all recent advances
Computer vision

Top-5 error rate for ImageNet (NVIDIA devblog) Faster R-CNN (Ren et al. 2015) NVIDIA DRIVE PX 2 segmentation

Speech recognition/synthesis Machine translation

4
Word error rates (Huang et al., 2014) Google Neural Machine Translation System (GNMT)
Derivatives in machine learning
“Backprop” and gradient descent are at the core of all recent advances

Probabilistic programming (and modeling)

Pyro ProbTorch
(2017) (2017)

Edward TensorFlow Probability

(2016) (2018)

- Variational inference
- “Neural” density estimation
- Transformed distributions via bijectors
- Normalizing flows (Rezende & Mohamed, 2015)
- Masked autoregressive flows (Papamakarios et al., 2017) 5
Derivatives in machine learning
At the core of all: differentiable functions (programs) whose parameters are
tuned by gradient-based optimization

(Ruder, 2017) http://ruder.io/optimizing-gradient-descent/

6
Automatic differentiation
Execute differentiable functions (programs) via automatic differentiation

A word on naming:
- Differentiable programming, a generalization of deep learning (Olah, LeCun)
“Neural networks are just a class of differentiable functions”
- Automatic differentiation
- Algorithmic differentiation
- AD
- Autodiff
- Algodiff
- Autograd
Also remember:
- Backprop
7
- Backpropagation (backward propagation of errors)
Essential concepts
refresher

8
Derivative
Function of a real variable

Sensitivity of function value w.r.t.

a change in its argument
(the instantaneous rate of change)
Dependent Independent Newton, c. 1665

Leibniz, c. 1675
Leibniz Lagrange Newton 9
Derivative
Function of a real variable

Newton, c. 1665

…
around 15 such rules

Note: the derivative is a linear operator, a.k.a. a higher-order

function in programming languages Leibniz, c. 1675
10
Partial derivative
Function of several real variables

A derivative w.r.t. one independent variable,

with others held constant

“del”

11
Partial derivative
Function of several real variables

The gradient, given

is the vector of all partial derivatives

points to the direction with the largest rate of

“nabla” change
or “del”

Nabla is the higher-order function: 12

Total derivative
Function of several real variables

The derivative w.r.t. all variables

(independent & dependent)

Consider all partial derivatives simultaneously and accumulate all direct and
indirect contributions (Important: will be useful later)
13
Matrix calculus and machine learning
Extension to Scalar output Vector output
multivariable
functions Scalar input

Vector input

scalar ﬁeld vector ﬁeld

In machine learning, we construct (deep) compositions of

- , e.g., a neural network
- , e.g., a loss function, KL divergence, or log joint probability 14
Matrix calculus and machine learning

And many, many more rules

Generalization to tensors (multi-dimensional arrays) for eﬃcient

batching, handling of sequences, channels in convolutions, etc. 15
Matrix calculus and machine learning
Finally, two constructs relevant to machine learning: Jacobian and Hessian

16
How to compute derivatives

17
Derivatives
as code

We can compute the derivatives not just of

mathematical functions, but of general programs
(with control ﬂow)

18
Derivatives
as code

19
Manual
You can see papers like this:

Analytic derivatives are needed for theoretical insight

- analytic solutions, proofs
- mathematical analysis, e.g., stability of ﬁxed points
Unnecessary when we just need derivative evaluations for optimization 20
Symbolic diﬀerentiation
Symbolic computation with Mathematica, Maple, Maxima,
and deep learning frameworks such as Theano
Problem: expression swell

21
Symbolic diﬀerentiation
Symbolic computation with Mathematica, Maple, Maxima,
and deep learning frameworks such as Theano Graph optimization
Problem: expression swell (e.g., in Theano)

22
Symbolic diﬀerentiation
Problem: only applicable to closed-form mathematical functions

You can ﬁnd the derivative of

but not of

Symbolic graph builders such as Theano and TensorFlow

have limited, unintuitive control flow, loops, recursion
Numerical differentiation
Finite difference approximation of ,

Problem: needs to be evaluated times,

once with each standard basis vector

Problem: we must select and

we face approximation errors

24
Numerical diﬀerentiation
Finite diﬀerence approximation of ,

Better approximations exist:

- Higher-order finite differences
e.g., center difference:

- Richardson extrapolation
- Differential quadrature
These increase rapidly in complexity
and never completely eliminate the error 25
Numerical differentiation
Finite difference approximation of ,

Still extremely useful as a quick check of our gradient implementations

Good to learn:
Better approximations exist:
- Higher-order finite differences
e.g., center difference:

- Richardson extrapolation
- Diﬀerential quadrature
These increase rapidly in complexity
and never completely eliminate the error 26
Automatic diﬀerentiation
If we don’t need analytic derivative expressions, we can
evaluate a gradient exactly with only one forward and one reverse execution

Nature 323, 533–536 (9 October 1986)

In machine learning, this is known as

backpropagation or “backprop”

- Automatic diﬀerentiation is more than

backprop
- Or, backprop is a specialized reverse mode
automatic diﬀerentiation
- We will come back to this shortly
27
Backprob or automatic
diﬀerentiation?

28
1960s 1970s 1980s

Precursors Linnainmaa, 1970, 1976 Speelpenning, 1980

Backpropagation Automatic reverse mode
Kelley, 1960
Bryson, 1961 Dreyfus, 1973 Werbos, 1982
Pontryagin et al., 1961 Control parameters First NN-specific backprop
Dreyfus, 1962
Werbos, 1974 Parker, 1985
Wengert, 1964 Reverse mode
Forward mode LeCun, 1985

Rumelhart, Hinton, Williams, 1986

Revived backprop

Griewank, 1989
Revived reverse mode 29
1960s 1970s 1980s

Precursors Linnainmaa, 1970, 1976 Speelpenning, 1980

Backpropagation Automatic reverse mode
Recommended
Kelley, 1960 reading:
Bryson, 1961 Dreyfus, 1973 Werbos, 1982
Pontryagin et al., 1961
Griewank, A., 2012. WhoControl parameters
Invented First NN-specific
the Reverse Mode backprop
of Diﬀerentiation?
Dreyfus, 1962
Documenta Mathematica, Extra Volume ISMP, pp.389-400.
Werbos, 1974 Parker, 1985
Wengert, 1964 Reverse mode
Schmidhuber,
Forward mode
J., 2015. Who Invented Backpropagation?
LeCun, 1985
http://people.idsia.ch/~juergen/who-invented-backpropagation.html
Rumelhart, Hinton, Williams, 1986
Revived backprop

Griewank, 1989
Revived reverse mode 30
Automatic diﬀerentiation

31
Automatic diﬀerentiation
All numerical algorithms, when executed, evaluate to compositions of
a ﬁnite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies

32
Automatic diﬀerentiation
All numerical algorithms, when executed, evaluate to compositions of
a ﬁnite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies

33
Automatic diﬀerentiation
All numerical algorithms, when executed, evaluate to compositions of
a ﬁnite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies

f(a, b):
c = a * b a
d = log(c) c
return d * log d

34
Automatic diﬀerentiation
All numerical algorithms, when executed, evaluate to compositions of
a ﬁnite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies

primal
f(a, b):
2
c = a * b a 6
d = log(c) c 1.791
return d * log d
3
b
1.791 = f(2, 3)

35
Automatic diﬀerentiation
All numerical algorithms, when executed, evaluate to compositions of
a ﬁnite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies

primal
f(a, b):
2
c = a * b a 6
d = log(c) c 1.791
0.5
return d * log d
3 0.166 1
b
1.791 = f(2, 3) derivative
0.333
[0.5, 0.333] = f’(2, 3) tangent, adjoint
“gradient” 36
Automatic diﬀerentiation
All numerical algorithms, when executed, evaluate to compositions of
a ﬁnite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies

primal
f(a, b):
2
c = a * b a 6
d = log(c) c 1.791
0.5
return d * log d
3 0.166 1
b
1.791 = f(2, 3) derivative
0.333
[0.5, 0.333] = f’(2, 3) tangent, adjoint
“gradient” 37
Automatic diﬀerentiation
Two main ﬂavors

Forward mode Reverse mode (a.k.a. backprop)

Primals
Primals
Derivatives
Derivatives
(Tangents)
(Adjoints)

Nested combinations
(higher-order derivatives, Hessian–vector products, etc.)
- Forward-on-reverse
- Reverse-on-forward
- ... 38
What happens to control ﬂow?
It disappears: branches are taken, loops are unrolled, functions are inlined, etc.
until we are left with the linear trace of execution

f(a, b):
c = a * b
if c > 0:
d = log(c)
else:
d = sin(c)
return d

39
What happens to control ﬂow?
It disappears: branches are taken, loops are unrolled, functions are inlined, etc.
until we are left with the linear trace of execution

f(a = 2, b = 3): 2
c = a * b = 6 a 6
c 1.791
if c > 0:
d = log(c) = 1.791 * log d
3
else:
b
d = sin(c)
return d

40
What happens to control ﬂow?
It disappears: branches are taken, loops are unrolled, functions are inlined, etc.
until we are left with the linear trace of execution

f(a = 2, b = -1): 2
c = a * b = -2 a -2
c -0.909
if c > 0:
d = log(c) * sin d
-1
else:
b
d = sin(c) = -0.909
return d

41
What happens to control ﬂow?
It disappears: branches are taken, loops are unrolled, functions are inlined, etc.
until we are left with the linear trace of execution

f(a = 2, b = -1): 2
c = a * b = -2 a -2
c -0.909
if c > 0:
d = log(c) * sin d
-1
else:
b
d = sin(c) = -0.909
return d A directed acyclic graph (DAG)

Topological ordering
42
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

v1
f(x1, x2):
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
y2 = v1 + v2
x2 log + y2
return (y1, y2)

43
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

v1
f(x1, x2):
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

44
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

45
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

46
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

47
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

48
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

49
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

50
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

51
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 0
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

52
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 0
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

53
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3 -0.279
v1 = x1 * x2 x1 * sin y1
1 2.880
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 0
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

54
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3 -0.279
v1 = x1 * x2 x1 * sin y1
1 2.880
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 0

f(2, 3)

55
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
v1
f(x1, x2): 2 3 -0.279
v1 = x1 * x2 x1 * sin y1
1 2.880
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 0 3

f(2, 3)

56
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
In general, forward mode evaluates v1
2 3 -0.279
a Jacobian–vector product x1 sin y1
*
1 2.880
So we evaluated: 1.098
v2
3 7.098
0
x2 log + y2
0 3

57
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

6
In general, forward mode evaluates v1
2 3 -0.279
a Jacobian–vector product x1 sin y1
*
1 2.880
So we evaluated: 1.098
v2
3 7.098
0
x2 log + y2
0 3

Can be any
not only unit vectors
58
Forward mode Primals: independent
Derivatives (tangents): independent
dependent
dependent

In general, forward mode evaluates For this is a

a Jacobian–vector product directional derivative

So we evaluated:

Can be any
not only unit vectors
59
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

v1
f(x1, x2):
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
y2 = v1 + v2
x2 log + y2
return (y1, y2)

60
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

v1
f(x1, x2):
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

61
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

62
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

63
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
v2 = log(x2)
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

64
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2
v1 = x1 * x2 x1 * sin y1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

65
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 -0.279
v1 = x1 * x2 x1 * sin y1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

66
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 -0.279
v1 = x1 * x2 x1 * sin y1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

67
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2
x2 log + y2
return (y1, y2)

f(2, 3)

68
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

69
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

70
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

71
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2
x2 log + y2
return (y1, y2) 0

f(2, 3)

72
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 0

f(2, 3)

73
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 0

f(2, 3)

74
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
2.880 1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 0

f(2, 3)

75
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
2.880 1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 0

f(2, 3)

76
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
v1
f(x1, x2): 2 0.960 -0.279
v1 = x1 * x2 x1 * sin y1
2.880 1
v2 = log(x2) 1.098
y1 = sin(v1) v2
3 7.098
y2 = v1 + v2 0
x2 log + y2
return (y1, y2) 1.920 0

f(2, 3)

77
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

6
In general, forward mode evaluates a v1
2 0.960 -0.279
transposed Jacobian–vector product x1 sin y1
*
2.880 1
1.098
So we evaluated: v2
3 7.098
0
x2 log + y2
1.920 0

78
Reverse mode Primals: independent
Derivatives (adjoints): independent
dependent
dependent

In general, reverse mode evaluates a

transposed Jacobian–vector product

For this is
So we evaluated: the gradient

79
Forward vs reverse summary
In the extreme In the extreme
use forward mode to evaluate use reverse mode to evaluate

80
Forward vs reverse summary
In the extreme In the extreme
use forward mode to evaluate use reverse mode to evaluate

In general the Jacobian can be evaluated in

- with forward mode
- with reverse mode

Reverse performs better when

81
Backprop through
normal PDF

82
Backprop through normal PDF

0.5
x - ·2 / - exp * f

0
µ ·2 * * sqrt 1/·

1
σ 2 π

83
Summary

84
Summary

This lecture:
- Derivatives in machine learning
- Review of essential concepts (what is a derivative, etc.)
- How do we compute derivatives
- Automatic diﬀerentiation

Next lecture:
- Current landscape of tools
- Implementation techniques
- Advanced concepts (higher-order API, checkpointing, etc.)

85
References
Baydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., 2017. Automatic diﬀerentiation in machine learning: a survey.
Journal of Machine Learning Research (JMLR), 18(153), pp.1-153.

Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeﬀrey Mark Siskind. 2016. “Tricks from Deep Learning.” In 7th International
Conference on Algorithmic Diﬀerentiation, Christ Church Oxford, UK, September 12–15, 2016.

Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “DiffSharp: An AD Library for .NET Languages.” In 7th
International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.

Baydin, Atılım Güneş, Robert Cornish, David Martínez Rubio, Mark Schmidt, and Frank Wood. 2018. “Online Learning Rate
Adaptation with Hypergradient Descent.” In Sixth International Conference on Learning Representations (ICLR), Vancouver,
Canada, April 30 – May 3, 2018.

Griewank, A. and Walther, A., 2008. Evaluating derivatives: principles and techniques of algorithmic diﬀerentiation (Vol. 105).
SIAM.

Nocedal, J. and Wright, S.J., 1999. Numerical Optimization. Springer.

86
Extra slides

87
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

88
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

f(a, b): a
c = a * b c
d = log(c) * log d
return d
b

89
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

f(a, b): a
c = a * b c
d = log(c) * log d
return d
b
f(2, 3)

90
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a
c = a * b c
d = log(c) * log d
return d
b
f(2, 3)

91
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a
c = a * b c
d = log(c) * log d
3
return d
b
f(2, 3)

92
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a
c = a * b 1 c
d = log(c) * log d
3
return d
b
f(2, 3)

93
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a
c = a * b 1 c
d = log(c) * log d
3
return d
b
0
f(2, 3)

94
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a 6
c = a * b 1 c
d = log(c) * log d
3
return d
b
0
f(2, 3)

95
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a 6
c = a * b 1 c
d = log(c) * log d
3 3
return d
b
0
f(2, 3)

96
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
1
d = log(c) * log d
3 3
return d
b
0
f(2, 3)

97
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
1
d = log(c) * log d
3 3 0.5
return d
b
0
f(2, 3)

98
Forward mode Primals: independent 🡨 dependent
Derivatives (tangents): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
1
d = log(c) * log d
3 3 0.5
return d
b
0
f(2, 3)

In general, forward mode evaluates a Jacobian–vector product

We evaluated the partial derivative with

99
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

f(a, b): a
c = a * b c
d = log(c) * log d
return d
b

100
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

f(a, b): a
c = a * b c
d = log(c) * log d
return d
b
f(2, 3)

101
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a
c = a * b c
d = log(c) * log d
return d
b
f(2, 3)

102
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a
c = a * b c
d = log(c) * log d
3
return d
b
f(2, 3)

103
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c
d = log(c) * log d
3
return d
b
f(2, 3)

104
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
d = log(c) * log d
3
return d
b
f(2, 3)

105
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
d = log(c) * log d
3 1
return d
b
f(2, 3)

106
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
d = log(c) * log d
3 1
return d
b
f(2, 3)

107
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
d = log(c) * log d
3 0.166 1
return d
b
f(2, 3)

108
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
d = log(c) * log d
3 0.166 1
return d
b
f(2, 3)

109
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
0.5
d = log(c) * log d
3 0.166 1
return d
b
f(2, 3)

110
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
0.5
d = log(c) * log d
3 0.166 1
return d
b
f(2, 3)

111
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
0.5
d = log(c) * log d
3 0.166 1
return d
b
0.333
f(2, 3)

112
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
0.5
d = log(c) * log d
3 0.166 1
return d
b
0.333
f(2, 3)

In general, reverse mode evaluates a transposed Jacobian–vector product

We evaluated the gradient with

113
Reverse mode Primals: independent 🡨 dependent
Derivatives (adjoints): independent 🡨 dependent

2
f(a, b): a 6
c = a * b c 1.791
0.5
d = log(c) * log d
3 0.166 1
return d
b
0.333
f(2, 3)

In general, reverse mode evaluates a transposed Jacobian–vector product

We evaluated the gradient with

114

Differentiable Programming and Design Optimization
No ratings yet
Differentiable Programming and Design Optimization
72 pages
Automatic Differentiation and Neural Networks
No ratings yet
Automatic Differentiation and Neural Networks
13 pages
Lecture NM 1 Numerical Differentiation Integration
No ratings yet
Lecture NM 1 Numerical Differentiation Integration
57 pages
LOD Differentiable
No ratings yet
LOD Differentiable
55 pages
Automatic Differentiation of Algorithms For Machine Learning
No ratings yet
Automatic Differentiation of Algorithms For Machine Learning
7 pages
Tut 01
No ratings yet
Tut 01
39 pages
Ad Refer
No ratings yet
Ad Refer
53 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Matrix Calculus
No ratings yet
Matrix Calculus
33 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Mit18 S096iap23 Lec01
No ratings yet
Mit18 S096iap23 Lec01
6 pages
Machine Learning and Pattern Recognition Week 8 - Backprop
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Backprop
8 pages
A Step-By-step Introduction To The Implementation of Automatic Differentiation
No ratings yet
A Step-By-step Introduction To The Implementation of Automatic Differentiation
17 pages
Autodiff
No ratings yet
Autodiff
12 pages
XCS224N Module2 Slides
No ratings yet
XCS224N Module2 Slides
80 pages
Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
Lecture12 Diff
No ratings yet
Lecture12 Diff
31 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Slides Concepts 1 Differentiability
No ratings yet
Slides Concepts 1 Differentiability
14 pages
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
No ratings yet
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
84 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
Mit18 S096iap23 Lec1
No ratings yet
Mit18 S096iap23 Lec1
16 pages
1502 05767v2 PDF
No ratings yet
1502 05767v2 PDF
29 pages
Neural Networks With Cheap Differential Operators
No ratings yet
Neural Networks With Cheap Differential Operators
11 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
CS115 Intro To Optimization
No ratings yet
CS115 Intro To Optimization
60 pages
Lec06 Derivatives
No ratings yet
Lec06 Derivatives
22 pages
AD Review Paper
No ratings yet
AD Review Paper
32 pages
Appendix D Calculus
No ratings yet
Appendix D Calculus
31 pages
Barak Pearl Mutter Auto Diff
No ratings yet
Barak Pearl Mutter Auto Diff
103 pages
Computational Graphs
No ratings yet
Computational Graphs
10 pages
Calc
No ratings yet
Calc
6 pages
Gradients Without Backpropagation
No ratings yet
Gradients Without Backpropagation
10 pages
ANN-Unit 4 - Logistic & Neural Notation
No ratings yet
ANN-Unit 4 - Logistic & Neural Notation
13 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Machine Learning Notation: 1 Numbers & Arrays 4 Functions
No ratings yet
Machine Learning Notation: 1 Numbers & Arrays 4 Functions
2 pages
Chap5 3-BackProp
No ratings yet
Chap5 3-BackProp
41 pages
2024 04 CS115 Vector Caculus
No ratings yet
2024 04 CS115 Vector Caculus
131 pages
Lecture21 Deep Learning PartII April12 2021
No ratings yet
Lecture21 Deep Learning PartII April12 2021
60 pages
Introduction To Differentiable Physics - Physics-Based Deep Learning
No ratings yet
Introduction To Differentiable Physics - Physics-Based Deep Learning
8 pages
3 Gradient
No ratings yet
3 Gradient
31 pages
Chap 3 Slides
No ratings yet
Chap 3 Slides
95 pages
PDF 1678529419
No ratings yet
PDF 1678529419
100 pages
Lecture 3-4
No ratings yet
Lecture 3-4
50 pages
Autodiff
No ratings yet
Autodiff
15 pages
Notice: Estimation Theory Pattern Recognition
No ratings yet
Notice: Estimation Theory Pattern Recognition
5 pages
Back Propagation
No ratings yet
Back Propagation
10 pages
Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015
No ratings yet
Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015
283 pages
Content Beyond Syllabus Unit-2
No ratings yet
Content Beyond Syllabus Unit-2
4 pages
Automatic Differentiation: H Avard Berland
No ratings yet
Automatic Differentiation: H Avard Berland
22 pages
Automatic Differentiation
No ratings yet
Automatic Differentiation
22 pages
Back Propagation
No ratings yet
Back Propagation
71 pages
Differentiation and Integration: (Lectures On Numerical Analysis For Economists II)
No ratings yet
Differentiation and Integration: (Lectures On Numerical Analysis For Economists II)
50 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
Mathematical View of Automatic Differentiation
No ratings yet
Mathematical View of Automatic Differentiation
78 pages
Opt Sem3
No ratings yet
Opt Sem3
50 pages