AutomaticDifferentiation AppliedMaths
AutomaticDifferentiation AppliedMaths
Jean Feydy
February 2018
Écoles Normales Supérieures de Paris et Paris-Saclay
Overview
What is backpropagation?
A time-efficient algorithm to compute gradients.
1
Overview
What is backpropagation?
A time-efficient algorithm to compute gradients.
1
Overview
What is backpropagation?
A time-efficient algorithm to compute gradients.
Where is it useful?
Automatic tuning of variables (optimal control)
or of your transform’s parameters (neural networks).
1
Overview
What is backpropagation?
A time-efficient algorithm to compute gradients.
Where is it useful?
Automatic tuning of variables (optimal control)
or of your transform’s parameters (neural networks).
1
How do we compute a gradient?
∂x1 f(x0 ) f(x0 + δt · (1, 0, . . . , 0)) − f(x0 )
∂x2 f(x0 ) f(x0 + δt · (0, 1, . . . , 0)) − f(x0 )
1
∇f(x0 ) = . '
.. .
. . δt
.
∂xn f(x0 ) f(x0 + δt · (0, 0, . . . , 1)) − f(x0 )
2
How do we compute a gradient?
∂x1 f(x0 ) f(x0 + δt · (1, 0, . . . , 0)) − f(x0 )
∂x2 f(x0 ) f(x0 + δt · (0, 1, . . . , 0)) − f(x0 )
1
∇f(x0 ) = . '
.. .
. . δt
.
∂xn f(x0 ) f(x0 + δt · (0, 0, . . . , 1)) − f(x0 )
2
What is a gradient?
Let f : x ∈ Rn 7→ y ∈ R be smooth,
3
What is a gradient?
Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:
3
What is a gradient?
Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:
dx1
.. =
df(x).dx = ∂1 f(x) · · · ∂n f(x) ·
. dy
dxn
3
What is a gradient?
Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:
dx1
.. =
df(x).dx = ∂1 f(x) · · · ∂n f(x) ·
. dy
dxn
3
What is a gradient?
Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:
dx1
.. =
df(x).dx = ∂1 f(x) · · · ∂n f(x) ·
. dy
dxn
3
What is a gradient?
Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:
dx1
.. =
df(x).dx = ∂1 f(x) · · · ∂n f(x) ·
. dy
dxn
3
What is a gradient?
Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:
dx1
.. =
df(x).dx = ∂1 f(x) · · · ∂n f(x) ·
. dy
dxn
4
Autodiff is simple – no magic!
4
Autodiff is simple – no magic!
4
Autodiff is simple – no magic!
4
Autodiff is simple – no magic!
4
Autodiff is simple – no magic!
f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)
input
x x
4
Autodiff is simple – no magic!
f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)
input g
x x g(x)
4
Autodiff is simple – no magic!
f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)
input g h
x x g(x) h(g(x))
4
Autodiff is simple – no magic!
f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)
input g h output
x x g(x) h(g(x)) f(x)
4
Autodiff is simple – no magic!
f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)
input g h output
x x g(x) h(g(x)) f(x)
input
1 1
4
Autodiff is simple – no magic!
f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)
input g h output
x x g(x) h(g(x)) f(x)
∂h input
∂h(g(x)).1 1 1
4
Autodiff is simple – no magic!
f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)
input g h output
x x g(x) h(g(x)) f(x)
∂g ∂h input
∂g(x).( · · · ) ∂h(g(x)).1 1 1
4
Autodiff is simple – no magic!
f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)
input g h output
x x g(x) h(g(x)) f(x)
output ∂g ∂h input
∇f(x) ∂g(x).( · · · ) ∂h(g(x)).1 1 1
4
What do you need to compute a gradient?
fi : Ei−1 → Ei
(1)
x 7 → fi (x)
∂x fi : Ei−1 × Ei → Ei−1
and (2)
(x0 , a) 7→ ∂x fi (x0 ) · a
5
What do you need to compute a gradient?
fi : Ei−1 → Ei
(1)
x 7 → fi (x)
∂x fi : Ei−1 × Ei → Ei−1
and (2)
(x0 , a) 7→ ∂x fi (x0 ) · a
5
Autodiff is easy to use
PyTorch :
6
Autodiff is easy to use
PyTorch :
6
Autodiff is easy to use
PyTorch :
6
Autodiff is easy to use
PyTorch :
6
Autodiff is easy to use
PyTorch :
6
A typical formula: the kernel square norm
D D
q1 p1
q2 p2
q3 p3
N N
.. ..
. .
.. ..
. .
qN pN
7
A typical formula: the kernel square norm
D D
q1 p1
q2 p2
q3 p3
N N
.. ..
. .
.. ..
. .
qN pN
7
A typical formula: the kernel square norm
D D
q1 p1
q2 p2
q3 p3
N N
.. ..
. .
.. ..
. .
qN pN
8
Numpy, in practice
8
Numpy, in practice
# Actual computations:
sqd = np.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = np.exp( - sqd / s**2 )
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
8
Numpy, in practice
# Actual computations:
sqd = np.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = np.exp( - sqd / s**2 )
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
8
Numpy, in practice
# Actual computations:
sqd = np.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = np.exp( - sqd / s**2 )
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
H = 6029309.1348486
8
Elapsed time: 3.01s
PyTorch, in practice
9
PyTorch, in practice
9
PyTorch, in practice
9
PyTorch, in practice
# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
10
PyTorch, in practice
# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
10
PyTorch, in practice
# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel
10
PyTorch, in practice
# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
10
PyTorch, in practice
# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
10
PyTorch, in practice
# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
H = 6029309.0
Elapsed time: 0.31s
10
PyTorch, in practice
# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
H = 6029309.0
Elapsed time: 0.31s
# Automatic differentiation is straightforward:
[dq,dp] = grad( H, [q,p] )
10
PyTorch, in practice
# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
H = 6029309.0
Elapsed time: 0.31s
# Automatic differentiation is straightforward:
[dq,dp] = grad( H, [q,p] )
∂Unsqueeze ∂Unsqueeze
∂Sub
∂PowConstant
qi − qj
∂Sum
(5000, 5000, 3)
∂PowConstant ∂Negate
∂Div
s2 −kqi − qj k2 p
∂Exp
(1) (5000, 5000) (5000, 3)
−kqi − qj k2 /s2
∂Addmm ∂View
(5000, 5000)
p Kq,q
∂View
(5000, 3) (5000, 5000)
∂Dot
p Kq,q p
∂MulConstant
(15000) (15000)
11
Using PyTorch for Optimal Control
Ballistic 101
12
Ballistic 101
13
Ballistic 101
13
Ballistic 101
13
PyTorch allows you to work with the proper equations!
qt = xt , pt = m v t ,
14
PyTorch allows you to work with the proper equations!
qt = xt , pt = m v t ,
14
PyTorch allows you to work with the proper equations!
qt = xt , pt = m v t ,
14
Setting the parameters of our model
15
Setting the parameters of our model
15
Defining a cost to optimize
def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)
16
Defining a cost to optimize
def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)
# Initial condition:
qt = source ; pt = P
16
Defining a cost to optimize
def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)
# Initial condition:
qt = source ; pt = P
16
Defining a cost to optimize
def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)
# Initial condition:
qt = source ; pt = P
16
Defining a cost to optimize
def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)
# Initial condition:
qt = source ; pt = P
16
Solving the control problem through gradient descent
17
Solving the control problem through gradient descent
17
Solving the control problem through gradient descent
17
Solving the control problem through gradient descent
17
Solving the control problem through gradient descent
17
Solving the control problem through gradient descent
17
Solving the control problem through gradient descent
17
Solving the control problem through gradient descent
17
Solving the control problem through gradient descent
17
Solving the control problem through gradient descent
17
Solving the control problem through gradient descent
17
Solving the control problem through gradient descent
17
Solving the control problem through gradient descent
17
Putting randomness into our model
def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)
# Initial condition:
qt = source ; pt = P
# Simple Euler scheme:
for it in range(10) :
[dq,dp] = grad(Emec(qt,pt), [qt,pt], create_graph=True)
dq += qt[1] * 20 * torch.randn(2)
qt = qt + .1 * dp
pt = pt - .1 * dq
19
Optimizing a noisy command
19
Optimizing a noisy command
19
Optimizing a noisy command
19
Optimizing a noisy command
19
Optimizing a noisy command
19
Optimizing a noisy command
19
Optimizing a noisy command
19
Optimizing a noisy command
19
Optimizing a noisy command
19
Optimizing a noisy command
19
Optimizing a noisy command
19
Optimizing wrt. the gravitational field
20
Optimizing wrt. the gravitational field
20
Optimizing wrt. the gravitational field
20
Optimizing wrt. the gravitational field
20
Optimizing wrt. the gravitational field
20
Optimizing wrt. the gravitational field
20
Optimizing wrt. the gravitational field
20
Optimizing wrt. the gravitational field
20
Optimizing wrt. the gravitational field
20
Optimizing wrt. the gravitational field
20
Optimizing wrt. the gravitational field
20
Optimizing wrt. the gravitational field
20
Recap on the basic usage of PyTorch
21
Recap on the basic usage of PyTorch
21
Recap on the basic usage of PyTorch
21
Recap on the basic usage of PyTorch
21
Recap on the basic usage of PyTorch
21
Recap on the basic usage of PyTorch
21
Convolutional “neural” networks:
optimizing a multiscale transform
The (Discrete) Fourier Transform
22
The (Discrete) Fourier Transform
Original image.
23
The (Discrete) Fourier Transform
Super-clever algorithms...
24
Going beyond linear signal processing
Super-clever algorithms...
24
Going beyond linear signal processing
Super-clever algorithms...
F : a ∈ RW×H 7→ b ∈ RN .
24
Wavelet transforms: Fourier++
26
A real-life application: JPEG 2000
28
How do we choose the convolution filters?
28
How do we choose the convolution filters?
-1 0 +1 2
Here is the Db2 pair: lowpass -.129 .224 .837 .483
highpass -.483 .837 -.224 -.129
28
Scattering Transform: |Fourier|++
29
Scattering Transform: |Fourier|++
29
Scattering Transform: |Fourier|++
29
Scattering Transform: |Fourier|++
29
Using scattering momenta to characterize textured patches
30
Using scattering momenta to characterize textured patches
30
Using scattering momenta to characterize textured patches
30
Using scattering momenta to characterize textured patches
31
Pure maths can only take you so far.
31
Pure maths can only take you so far.
31
Problem: classification of web-like images
32
Problem: classification of web-like images
(xi , yi ) ∈ RW×H ×[[1, C]] ' (xi , δyi ) ∈ RW×H × [0, 1]C ,
33
Problem: classification of web-like images
(xi , yi ) ∈ RW×H ×[[1, C]] ' (xi , δyi ) ∈ RW×H × [0, 1]C ,
33
Convolutional Neural Networks: |Fourier|+++
34
Convolutional Neural Networks: |Fourier|+++
35
Convolutional Neural Networks are trainable multiscale transforms
35
Convolutional Neural Networks are trainable multiscale transforms
for it in range(1,000,000):
I = random set of 10 indices
cost =
P
i∈I kFw (xi ) − yi kKL
dw = grad( cost, [w] )[0]
w.data -= .001 * dw.data
35
Convolutional Neural Networks are trainable multiscale transforms
for it in range(1,000,000):
I = random set of 10 indices
cost =
P
i∈I kFw (xi ) − yi kKL
dw = grad( cost, [w] )[0]
w.data -= .001 * dw.data
36
Convolutional Neural Networks : a good compromise
37
Convolutional Neural Networks : a good compromise
37
Convolutional Neural Networks : a good compromise
37
Convolutional Neural Networks : a good compromise
37
Convolutional Neural Networks : a good compromise
CNN
x µ(x) m(x) M(x)
37
Deep Art : an emblematic application (Nikulin, Novak (2016))
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
µ m M
38
Deep Art : an emblematic application (Nikulin, Novak (2016))
µ m M
µ m M
µ m M
38
Major autodiff frameworks in python
39
Major autodiff frameworks in python
39
Major autodiff frameworks in python
39
Major autodiff frameworks in python
39
Major autodiff frameworks in python
39
Major autodiff frameworks in python
39
Major autodiff frameworks in python
39
Major autodiff frameworks in python
39
Major autodiff frameworks in python
39
Major autodiff frameworks in python
39
Extending PyTorch
Computing the Hamiltonian
# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
40
Computing the Hamiltonian
# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
40
Computing the Hamiltonian
# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel
40
Computing the Hamiltonian
# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
40
Computing the Hamiltonian
# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
40
Computing the Hamiltonian
# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
40
Computing the Hamiltonian
# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
40
Computing the Hamiltonian
# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)
∂Unsqueeze ∂Unsqueeze
∂Sub
∂PowConstant
qi − qj
∂Sum
(5000, 5000, 3)
∂PowConstant ∂Negate
∂Div
s2 −kqi − qj k2 p
∂Exp
(1) (5000, 5000) (5000, 3)
−kqi − qj k2 /s2
∂Addmm ∂View
(5000, 5000)
p Kq,q
∂View
(5000, 3) (5000, 5000)
∂Dot
p Kq,q p
∂MulConstant
(15000) (15000)
41
The KeOps library
q p
(5000, 3) (5000, 3)
∂KernelProduct ∂View
p q q s
∂View
(5000, 3) (5000, 3) (5000, 3) (1)
∂Dot
p Kq,q
∂MulConstant
(15000) (15000)
42
Define custom operators
class KernelProduct(torch.autograd.Function):
43
Define custom operators
class KernelProduct(torch.autograd.Function):
@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):
43
Define custom operators
class KernelProduct(torch.autograd.Function):
@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):
# save everything to compute the gradient
ctx.save_for_backward( s, q1, q2, p )
43
Define custom operators
class KernelProduct(torch.autograd.Function):
@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):
# save everything to compute the gradient
ctx.save_for_backward( s, q1, q2, p )
# init gamma, the output of the convolution K_(q1,q2) @ p
gamma = torch.zeros(
q1.size()[0] * p.size()[1] ).type(dtype)
43
Define custom operators
class KernelProduct(torch.autograd.Function):
@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):
# save everything to compute the gradient
ctx.save_for_backward( s, q1, q2, p )
# init gamma, the output of the convolution K_(q1,q2) @ p
gamma = torch.zeros(
q1.size()[0] * p.size()[1] ).type(dtype)
# Inplace CUDA routine on the raw float arrays,
# loaded from .dll/.so files by the "pybind11" module
cudaconv.cuda_conv( q1, q2, p, gamma, s,
kernel = kernel_type)
43
Define custom operators
class KernelProduct(torch.autograd.Function):
@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):
# save everything to compute the gradient
ctx.save_for_backward( s, q1, q2, p )
# init gamma, the output of the convolution K_(q1,q2) @ p
gamma = torch.zeros(
q1.size()[0] * p.size()[1] ).type(dtype)
# Inplace CUDA routine on the raw float arrays,
# loaded from .dll/.so files by the "pybind11" module
cudaconv.cuda_conv( q1, q2, p, gamma, s,
kernel = kernel_type)
gamma = gamma.view( q1.size()[0], p.size()[1] )
return gamma
43
Define custom operators
@staticmethod
def backward(ctx, a):
(s, q1, q2, p) = ctx.saved_variables
44
Define custom operators
@staticmethod
def backward(ctx, a):
(s, q1, q2, p) = ctx.saved_variables
# In order to get second derivatives, we encapsulated the
# cudagradconv.cuda_gradconv routine in another
# torch.autograd.Function object:
kernelproductgrad_x = KernelProductGrad_x().apply
44
Define custom operators
@staticmethod
def backward(ctx, a):
(s, q1, q2, p) = ctx.saved_variables
# In order to get second derivatives, we encapsulated the
# cudagradconv.cuda_gradconv routine in another
# torch.autograd.Function object:
kernelproductgrad_x = KernelProductGrad_x().apply
# Call the CUDA routines
# ...
grad_x = kernelproductgrad_x( ... )
# ...
44
Define custom operators
@staticmethod
def backward(ctx, a):
(s, q1, q2, p) = ctx.saved_variables
# In order to get second derivatives, we encapsulated the
# cudagradconv.cuda_gradconv routine in another
# torch.autograd.Function object:
kernelproductgrad_x = KernelProductGrad_x().apply
# Call the CUDA routines
# ...
grad_x = kernelproductgrad_x( ... )
# ...
return (grad_s, grad_q1, grad_q2, grad_p, None)
44
KeOps:
Online Map-Reduce Operators,
with autodiff,
without memory overflows.
45
KeOps:
Online Map-Reduce Operators,
with autodiff,
without memory overflows.
www.kernel-operations.io
45
What we provide
46
What we provide
with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
46
What we provide
with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
46
What we provide
with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
46
What we provide
with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …
46
What we provide
with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …
• Vector y-variables, indexed by j : y1j , y2j , …
46
What we provide
with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …
• Vector y-variables, indexed by j : y1j , y2j , …
46
What we provide
with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …
• Vector y-variables, indexed by j : y1j , y2j , …
With KeOps you will get:
• Linear memory footprint.
46
What we provide
with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …
• Vector y-variables, indexed by j : y1j , y2j , …
With KeOps you will get:
• Linear memory footprint.
• High order derivatives – thank you Joan!
46
What we provide
with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …
• Vector y-variables, indexed by j : y1j , y2j , …
With KeOps you will get:
• Linear memory footprint.
• High order derivatives – thank you Joan!
• Support for block-sparse (=cluster-aware) reductions.
46
KeOps’ low-level interface: generic_sum
47
KeOps’ low-level interface: generic_sum
gaussian_conv = generic_sum(
"Exp(-G*SqDist(X,Y)) * B", # Custom formula
"A = Vx(2)", # Output, 2D, indexed by i
"G = Pm(1)", # 1st arg, 1D, parameter
"X = Vx(3)", # 2nd arg, 3D, indexed by i
"Y = Vy(3)", # 3rd arg, 3D, indexed by j
"B = Vy(2)") # 4th arg, 2D, indexed by j
47
KeOps’ low-level interface: generic_sum
gaussian_conv = generic_sum(
"Exp(-G*SqDist(X,Y)) * B", # Custom formula
"A = Vx(2)", # Output, 2D, indexed by i
"G = Pm(1)", # 1st arg, 1D, parameter
"X = Vx(3)", # 2nd arg, 3D, indexed by i
"Y = Vy(3)", # 3rd arg, 3D, indexed by j
"B = Vy(2)") # 4th arg, 2D, indexed by j
100
out of mem
10−1
10−2
10−3 2
10 103 104 105 106
Number of points N
48
It works!
100
out of mem
10−1
10−2
10−3 2
10 103 104 105 106
Number of points N
Key points:
49
Recap of today’s presentation
Key points:
49
Recap of today’s presentation
Key points:
49
Recap of today’s presentation
Key points:
49
Going further
pytorch.org
50
Going further
www.kernel-operations.io
51
Going further
www.math.ens.fr/~feydy/Teaching/
52
Going further
53
Thank you for your attention.
53