0% found this document useful (0 votes)
31 views228 pages

AutomaticDifferentiation AppliedMaths

Uploaded by

lelruc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views228 pages

AutomaticDifferentiation AppliedMaths

Uploaded by

lelruc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 228

Automatic differentiation

for applied mathematicians


Is PyTorch the right tool for you?

Jean Feydy
February 2018
Écoles Normales Supérieures de Paris et Paris-Saclay
Overview

What is backpropagation?
A time-efficient algorithm to compute gradients.

1
Overview

What is backpropagation?
A time-efficient algorithm to compute gradients.

How can we use it?


PyTorch provides a simple syntax, transparent CPU/GPU support.

1
Overview

What is backpropagation?
A time-efficient algorithm to compute gradients.

How can we use it?


PyTorch provides a simple syntax, transparent CPU/GPU support.

Where is it useful?
Automatic tuning of variables (optimal control)
or of your transform’s parameters (neural networks).

1
Overview

What is backpropagation?
A time-efficient algorithm to compute gradients.

How can we use it?


PyTorch provides a simple syntax, transparent CPU/GPU support.

Where is it useful?
Automatic tuning of variables (optimal control)
or of your transform’s parameters (neural networks).

Bonus: you can extend it easily.


Link with your homebrew CUDA routines!

1
How do we compute a gradient?

Let f : Rn → R be a smooth function. Then:

   
∂x1 f(x0 ) f(x0 + δt · (1, 0, . . . , 0)) − f(x0 )
∂x2 f(x0 ) f(x0 + δt · (0, 1, . . . , 0)) − f(x0 )
   
1 
∇f(x0 ) =  .  '
   .. .
 .  . δt 
 .


∂xn f(x0 ) f(x0 + δt · (0, 0, . . . , 1)) − f(x0 )

2
How do we compute a gradient?

Let f : Rn → R be a smooth function. Then:

   
∂x1 f(x0 ) f(x0 + δt · (1, 0, . . . , 0)) − f(x0 )
∂x2 f(x0 ) f(x0 + δt · (0, 1, . . . , 0)) − f(x0 )
   
1 
∇f(x0 ) =  .  '
   .. .
 .  . δt 
 .


∂xn f(x0 ) f(x0 + δt · (0, 0, . . . , 1)) − f(x0 )

=⇒ costs (n+1) evaluations of f, which is poor.

2
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,

3
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:

3
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:

 dx1 
 

..  =
  
df(x).dx = ∂1 f(x) · · · ∂n f(x) · 
 .  dy
dxn

3
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:

 dx1 
 

..  =
  
df(x).dx = ∂1 f(x) · · · ∂n f(x) · 
 .  dy
dxn

We define ∂f(x) = (df(x))?

3
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:

 dx1 
 

..  =
  
df(x).dx = ∂1 f(x) · · · ∂n f(x) · 
 .  dy
dxn

We define ∂f(x) = (df(x))? ' (df(x))T

3
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:

 dx1 
 

..  =
  
df(x).dx = ∂1 f(x) · · · ∂n f(x) · 
 .  dy
dxn

We define ∂f(x) = (df(x))? ' (df(x))T


i.e. ∂f(x) : dy? ∈ R 7→ dx? ∈ Rn .

3
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:

 dx1 
 

..  =
  
df(x).dx = ∂1 f(x) · · · ∂n f(x) · 
 .  dy
dxn

We define ∂f(x) = (df(x))? ' (df(x))T


i.e. ∂f(x) : dy? ∈ R 7→ dx? ∈ Rn .
   
∂1 f(x)
 .   
 .  · dy
∂f(x).dy? =  = dx?  so that
.  ?  
∇f(x) = ∂f(x).1
∂n f(x)
3
Autodiff is simple – no magic!

This definition lets us compose gradients:

4
Autodiff is simple – no magic!

This definition lets us compose gradients:


f = h◦g

4
Autodiff is simple – no magic!

This definition lets us compose gradients:


f = h◦g
df(x) = dh(g(x)) ◦ dg(x)

4
Autodiff is simple – no magic!

This definition lets us compose gradients:


f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))

4
Autodiff is simple – no magic!

This definition lets us compose gradients:


f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input
x x

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g
x x g(x)

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g h
x x g(x) h(g(x))

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g h output
x x g(x) h(g(x)) f(x)

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g h output
x x g(x) h(g(x)) f(x)

input
1 1

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g h output
x x g(x) h(g(x)) f(x)

∂h input
∂h(g(x)).1 1 1

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g h output
x x g(x) h(g(x)) f(x)

∂g ∂h input
∂g(x).( · · · ) ∂h(g(x)).1 1 1

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g h output
x x g(x) h(g(x)) f(x)

output ∂g ∂h input
∇f(x) ∂g(x).( · · · ) ∂h(g(x)).1 1 1

4
What do you need to compute a gradient?

Backpropagating through a computational graph requires:

fi : Ei−1 → Ei
(1)
x 7 → fi (x)

∂x fi : Ei−1 × Ei → Ei−1
and (2)
(x0 , a) 7→ ∂x fi (x0 ) · a

encoded as computer programs.

5
What do you need to compute a gradient?

Backpropagating through a computational graph requires:

fi : Ei−1 → Ei
(1)
x 7 → fi (x)

∂x fi : Ei−1 × Ei → Ei−1
and (2)
(x0 , a) 7→ ∂x fi (x0 ) · a

encoded as computer programs.


This is what PyTorch is all about.

5
Autodiff is easy to use

PyTorch :

• Straightforward replacement of Matlab/Numpy.

6
Autodiff is easy to use

PyTorch :

• Straightforward replacement of Matlab/Numpy.


• Operators f : x 7→ f(x) are bundled with their
adjoints ∂f : ( x, dy? ) 7→ ∂f(x).dy? = dx? .

6
Autodiff is easy to use

PyTorch :

• Straightforward replacement of Matlab/Numpy.


• Operators f : x 7→ f(x) are bundled with their
adjoints ∂f : ( x, dy? ) 7→ ∂f(x).dy? = dx? .
• Seamless GPU support.

6
Autodiff is easy to use

PyTorch :

• Straightforward replacement of Matlab/Numpy.


• Operators f : x 7→ f(x) are bundled with their
adjoints ∂f : ( x, dy? ) 7→ ∂f(x).dy? = dx? .
• Seamless GPU support.

6
Autodiff is easy to use

PyTorch :

• Straightforward replacement of Matlab/Numpy.


• Operators f : x 7→ f(x) are bundled with their
adjoints ∂f : ( x, dy? ) 7→ ∂f(x).dy? = dx? .
• Seamless GPU support.

Let’s see how it goes in practice!

6
A typical formula: the kernel square norm

D D

q1 p1
q2 p2
q3 p3
N N
.. ..
. .
.. ..
. .
qN pN

7
A typical formula: the kernel square norm

D D

q1 p1
q2 p2
q3 p3
N N
.. ..
. .
.. ..
. .
qN pN

In shape analysis, algorithms often rely on the kernel dot product:


1X
H(q, p) = exp(− σ12 kqi − qj k2 ) pi , pj 2
2 i,j

7
A typical formula: the kernel square norm

D D

q1 p1
q2 p2
q3 p3
N N
.. ..
. .
.. ..
. .
qN pN

In shape analysis, algorithms often rely on the kernel dot product:


1X
H(q, p) = exp(− σ12 kqi − qj k2 ) pi , pj 2
2 i,j
1X X 1
= pi , k(qi − qj ) pj 2
= p, Kq p 2
.
2 i j
2
7
Numpy, in practice

import numpy as np # standard library


N = 5000 ; D = 3 # cloud of 5,000 points in 3D
q = np.random.rand(N,D)
p = np.random.rand(N,D)
s = 1.

8
Numpy, in practice

import numpy as np # standard library


N = 5000 ; D = 3 # cloud of 5,000 points in 3D
q = np.random.rand(N,D)
p = np.random.rand(N,D)
s = 1.
# Re-indexing:
q_i = q[:,np.newaxis,:] # shape (N,D) -> (N,1,D)
q_j = q[np.newaxis,:,:] # shape (N,D) -> (1,N,D)

8
Numpy, in practice

import numpy as np # standard library


N = 5000 ; D = 3 # cloud of 5,000 points in 3D
q = np.random.rand(N,D)
p = np.random.rand(N,D)
s = 1.
# Re-indexing:
q_i = q[:,np.newaxis,:] # shape (N,D) -> (N,1,D)
q_j = q[np.newaxis,:,:] # shape (N,D) -> (1,N,D)

# Actual computations:
sqd = np.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = np.exp( - sqd / s**2 )
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

8
Numpy, in practice

import numpy as np # standard library


N = 5000 ; D = 3 # cloud of 5,000 points in 3D
q = np.random.rand(N,D)
p = np.random.rand(N,D)
s = 1.
# Re-indexing:
q_i = q[:,np.newaxis,:] # shape (N,D) -> (N,1,D)
q_j = q[np.newaxis,:,:] # shape (N,D) -> (1,N,D)

# Actual computations:
sqd = np.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = np.exp( - sqd / s**2 )
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

# Finally, output the kernel norm H(q,p): .5*<p,v>


H = .5 * np.inner( p.ravel(), v.ravel() )

8
Numpy, in practice

import numpy as np # standard library


N = 5000 ; D = 3 # cloud of 5,000 points in 3D
q = np.random.rand(N,D)
p = np.random.rand(N,D)
s = 1.
# Re-indexing:
q_i = q[:,np.newaxis,:] # shape (N,D) -> (N,1,D)
q_j = q[np.newaxis,:,:] # shape (N,D) -> (1,N,D)

# Actual computations:
sqd = np.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = np.exp( - sqd / s**2 )
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

# Finally, output the kernel norm H(q,p): .5*<p,v>


H = .5 * np.inner( p.ravel(), v.ravel() )

H = 6029309.1348486
8
Elapsed time: 3.01s
PyTorch, in practice

import torch # GPU + autodiff library


from torch.autograd import grad

# With PyTorch, using the GPU is that simple:


use_gpu = torch.cuda.is_available()
dtype = torch.cuda.FloatTensor if use_gpu \
else torch.FloatTensor

9
PyTorch, in practice

import torch # GPU + autodiff library


from torch.autograd import grad

# With PyTorch, using the GPU is that simple:


use_gpu = torch.cuda.is_available()
dtype = torch.cuda.FloatTensor if use_gpu \
else torch.FloatTensor
# Store arbitrary arrays on the CPU or GPU:
q = torch.from_numpy( q ).type(dtype)
p = torch.from_numpy( p ).type(dtype)
s = torch.Tensor( [1.] ).type(dtype)

9
PyTorch, in practice

import torch # GPU + autodiff library


from torch.autograd import grad

# With PyTorch, using the GPU is that simple:


use_gpu = torch.cuda.is_available()
dtype = torch.cuda.FloatTensor if use_gpu \
else torch.FloatTensor
# Store arbitrary arrays on the CPU or GPU:
q = torch.from_numpy( q ).type(dtype)
p = torch.from_numpy( p ).type(dtype)
s = torch.Tensor( [1.] ).type(dtype)

# Tell PyTorch to track the variables "q" and "p"


q.requires_grad = True
p.requires_grad = True

9
PyTorch, in practice

# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)

10
PyTorch, in practice

# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2

10
PyTorch, in practice

# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel

10
PyTorch, in practice

# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

10
PyTorch, in practice

# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

# Output the kernel norm H(q,p): .5*<p,v>


H = .5 * torch.dot( p.view(-1), v.view(-1) )

10
PyTorch, in practice

# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

# Output the kernel norm H(q,p): .5*<p,v>


H = .5 * torch.dot( p.view(-1), v.view(-1) )

H = 6029309.0
Elapsed time: 0.31s

10
PyTorch, in practice

# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

# Output the kernel norm H(q,p): .5*<p,v>


H = .5 * torch.dot( p.view(-1), v.view(-1) )

H = 6029309.0
Elapsed time: 0.31s
# Automatic differentiation is straightforward:
[dq,dp] = grad( H, [q,p] )

10
PyTorch, in practice

# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

# Output the kernel norm H(q,p): .5*<p,v>


H = .5 * torch.dot( p.view(-1), v.view(-1) )

H = 6029309.0
Elapsed time: 0.31s
# Automatic differentiation is straightforward:
[dq,dp] = grad( H, [q,p] )

dq.shape = q.shape ; dp.shape = p.shape


10
Elapsed time: 0.03s
q
(5000, 3)

∂Unsqueeze ∂Unsqueeze

∂Sub

∂PowConstant

qi − qj
∂Sum
(5000, 5000, 3)

∂PowConstant ∂Negate

∂Div

s2 −kqi − qj k2 p
∂Exp
(1) (5000, 5000) (5000, 3)

−kqi − qj k2 /s2
∂Addmm ∂View
(5000, 5000)

p Kq,q
∂View
(5000, 3) (5000, 5000)

∂Dot

p Kq,q p
∂MulConstant
(15000) (15000)

11
Using PyTorch for Optimal Control
Ballistic 101

12
Ballistic 101

Take two locations in the plane R2 :


! !
0 7
x0 = and ex = .
.5 2

13
Ballistic 101

Take two locations in the plane R2 :


! !
0 7
x0 = and ex = .
.5 2

Assume that the trajectory xt follows Newton’s laws of motion:


!
0
ẍt = .
−g

13
Ballistic 101

Take two locations in the plane R2 :


! !
0 7
x0 = and ex = .
.5 2

Assume that the trajectory xt follows Newton’s laws of motion:


!
0
ẍt = .
−g

Optimal Control problem: find a momentum P ∈ R2 such that

m ẋ0 = P =⇒ x1 ' ex.

13
PyTorch allows you to work with the proper equations!

Using the position-momentum coordinates

qt = xt , pt = m v t ,

14
PyTorch allows you to work with the proper equations!

Using the position-momentum coordinates

qt = xt , pt = m v t ,

write down the expression of the mechanical energy


1 2
Emec (x, v) = mg · x[2] + 2 m kvk ,
1 2
Emec (q, p) = mg · q[2] + 2m kpk .

14
PyTorch allows you to work with the proper equations!

Using the position-momentum coordinates

qt = xt , pt = m v t ,

write down the expression of the mechanical energy


1 2
Emec (x, v) = mg · x[2] + 2 m kvk ,
1 2
Emec (q, p) = mg · q[2] + 2m kpk .

Then (Hamilton, 1833; Pontryagin, 1956):


(
q̇t = vt = + m1 pt = + ∂E∂pmec (qt , pt )
ṗt = m v̇t = (0, −mg) = − ∂E∂qmec (qt , pt )

14
Setting the parameters of our model

import torch # GPU + autodiff library


from torch import Tensor
from torch.autograd import grad

15
Setting the parameters of our model

import torch # GPU + autodiff library


from torch import Tensor
from torch.autograd import grad

# Set the parameters of our model:


g = Tensor( [ 9.81], requires_grad = True )
m = Tensor( [ 15. ], requires_grad = True )
source = Tensor( [0.,.5], requires_grad = True )
target = Tensor( [7.,2.], requires_grad = True )

15
Defining a cost to optimize

def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)

16
Defining a cost to optimize

def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)

# Initial condition:
qt = source ; pt = P

16
Defining a cost to optimize

def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)

# Initial condition:
qt = source ; pt = P

# Simple Euler scheme:


for it in range(10) :
[dq,dp] = grad(Emec(qt,pt), [qt,pt], create_graph=True)

16
Defining a cost to optimize

def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)

# Initial condition:
qt = source ; pt = P

# Simple Euler scheme:


for it in range(10) :
[dq,dp] = grad(Emec(qt,pt), [qt,pt], create_graph=True)
qt = qt + .1 * dp
pt = pt - .1 * dq

16
Defining a cost to optimize

def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)

# Initial condition:
qt = source ; pt = P

# Simple Euler scheme:


for it in range(10) :
[dq,dp] = grad(Emec(qt,pt), [qt,pt], create_graph=True)
qt = qt + .1 * dp
pt = pt - .1 * dq

# Return the squared distance to the target:


return ((qt - target)**2).sum()

16
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Putting randomness into our model

def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)

# Initial condition:
qt = source ; pt = P
# Simple Euler scheme:
for it in range(10) :
[dq,dp] = grad(Emec(qt,pt), [qt,pt], create_graph=True)
dq += qt[1] * 20 * torch.randn(2)
qt = qt + .1 * dp
pt = pt - .1 * dq

# Return the squared distance to the target:


return ((qt - target)**2).sum()
18
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )


lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )


lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )


lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )


lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )


lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )


lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )


lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )


lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )


lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )


lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )


lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )


lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )


lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Recap on the basic usage of PyTorch

PyTorch is a simple replacement for numpy:

• Use torch.Tensor instead of numpy.array.


• Back-and-forth: from_numpy(...) and .numpy().
• GPU backend: .cuda() and .cpu().
• Tensor(...) = (data, graph history) container:
check éA.data and A.grad_fn.
• Compute gradients with grad(A, [...]).
• High-order derivatives are supported!

21
Recap on the basic usage of PyTorch

PyTorch is a simple replacement for numpy:

• Use torch.Tensor instead of numpy.array.


• Back-and-forth: from_numpy(...) and .numpy().
• GPU backend: .cuda() and .cpu().
• Tensor(...) = (data, graph history) container:
check éA.data and A.grad_fn.
• Compute gradients with grad(A, [...]).
• High-order derivatives are supported!

21
Recap on the basic usage of PyTorch

PyTorch is a simple replacement for numpy:

• Use torch.Tensor instead of numpy.array.


• Back-and-forth: from_numpy(...) and .numpy().
• GPU backend: .cuda() and .cpu().
• Tensor(...) = (data, graph history) container:
check éA.data and A.grad_fn.
• Compute gradients with grad(A, [...]).
• High-order derivatives are supported!

21
Recap on the basic usage of PyTorch

PyTorch is a simple replacement for numpy:

• Use torch.Tensor instead of numpy.array.


• Back-and-forth: from_numpy(...) and .numpy().
• GPU backend: .cuda() and .cpu().
• Tensor(...) = (data, graph history) container:
check éA.data and A.grad_fn.
• Compute gradients with grad(A, [...]).
• High-order derivatives are supported!

21
Recap on the basic usage of PyTorch

PyTorch is a simple replacement for numpy:

• Use torch.Tensor instead of numpy.array.


• Back-and-forth: from_numpy(...) and .numpy().
• GPU backend: .cuda() and .cpu().
• Tensor(...) = (data, graph history) container:
check éA.data and A.grad_fn.
• Compute gradients with grad(A, [...]).
• High-order derivatives are supported!

21
Recap on the basic usage of PyTorch

PyTorch is a simple replacement for numpy:

• Use torch.Tensor instead of numpy.array.


• Back-and-forth: from_numpy(...) and .numpy().
• GPU backend: .cuda() and .cpu().
• Tensor(...) = (data, graph history) container:
check éA.data and A.grad_fn.
• Compute gradients with grad(A, [...]).
• High-order derivatives are supported!

21
Convolutional “neural” networks:
optimizing a multiscale transform
The (Discrete) Fourier Transform

Given a signal f, compute the coefficients


a(ω) = heω , aiL2 ,
b where eω : x 7→ eiω·x .

22
The (Discrete) Fourier Transform

Given a signal f, compute the coefficients


a(ω) = heω , aiL2 ,
b where eω : x 7→ eiω·x .

f(x) and log(|bf(ω)|).


22
The (Discrete) Fourier Transform

This transform allows us to apply Gaussian blur, unsharp filters or


Wiener denoising.

Original image.
23
The (Discrete) Fourier Transform

This transform allows us to apply Gaussian blur, unsharp filters or


Wiener denoising.

With a Gaussian white noise.


23
The (Discrete) Fourier Transform

This transform allows us to apply Gaussian blur, unsharp filters or


Wiener denoising.

Blurred with a Gaussian filter.


23
The (Discrete) Fourier Transform

This transform allows us to apply Gaussian blur, unsharp filters or


Wiener denoising.

Denoised with a Wiener filter.


23
Going beyond linear signal processing

Super-clever algorithms...

24
Going beyond linear signal processing

Super-clever algorithms...

Do not scale well – at all.

24
Going beyond linear signal processing

Super-clever algorithms...

Do not scale well – at all.

As of 2018, we can only implement basic algorithms on clever


representations. We strive to find relevant mappings

F : a ∈ RW×H 7→ b ∈ RN .

24
Wavelet transforms: Fourier++

Compute linear features by enforcing two priors:

• Features should be localized and translation-covariant:


b i, x,y = (ϕi ? a)(x, y).

Gabor filter responses, from the scikit-learn doc. 25


Wavelet transforms: Fourier++

• Multiscale prior: features are built in cascade from finer scales,


b(i1 ,...,ik ), x,y = (ψik ? · · · ? ϕi1 ? a)(x, y),
with filters of (geometrically) increasing radii – this is
algorithmically enforced through the subsampling of feature
maps.

Understanding Deep Convolutional Networks (Mallat, 2016).

26
A real-life application: JPEG 2000

Standard format in cinemas:


• Subsample the coarse scales.
• Only store the large coefficients.

Image by Allessio Damato, from Wikipedia. 27


How do we choose the convolution filters?

Fast Wavelet Transform (Mallat, 1989): Given a lowpass and a


highpass filter of size k, compute a multiscale decomposition of a
signal of size n in O(k · n) operations.

28
How do we choose the convolution filters?

Fast Wavelet Transform (Mallat, 1989): Given a lowpass and a


highpass filter of size k, compute a multiscale decomposition of a
signal of size n in O(k · n) operations.

Daubechies filters (1992): For a given index p, there exists a pair of


filters of size 2 p generating an orthogonal transform that perfectly
factorizes locally polynomial signals of degree < p.

28
How do we choose the convolution filters?

Fast Wavelet Transform (Mallat, 1989): Given a lowpass and a


highpass filter of size k, compute a multiscale decomposition of a
signal of size n in O(k · n) operations.

Daubechies filters (1992): For a given index p, there exists a pair of


filters of size 2 p generating an orthogonal transform that perfectly
factorizes locally polynomial signals of degree < p.

-1 0 +1 2
Here is the Db2 pair: lowpass -.129 .224 .837 .483
highpass -.483 .837 -.224 -.129

28
Scattering Transform: |Fourier|++

We use a wavelet transform:

Fwav (a) : RW×H → RN×W×H


a 7→ ( ψ1 ? ϕ1 ? a ( · , · ),
ψ1 ? ϕ2 ? a ( · , · ),
··· )

29
Scattering Transform: |Fourier|++

We use a scattering transform:

Fscat (a) : RW×H → RN×W×H


+
a 7→ ( |ψ1 ? |ϕ1 ? a||( · , · ),
|ψ1 ? |ϕ2 ? a||( · , · ),
··· )

29
Scattering Transform: |Fourier|++

We use scattering momenta:

F1scat (a) : RW×H → RN+


a 7→ ( kψ1 ? |ϕ1 ? a|k1 ,
kψ1 ? |ϕ2 ? a|k1 ,
··· )

29
Scattering Transform: |Fourier|++

We use scattering momenta:

F1scat (a) : RW×H → RN+


a 7→ ( kψ1 ? |ϕ1 ? a|k1 ,
kψ1 ? |ϕ2 ? a|k1 ,
··· )

Texture synthesis: an optimal control problem.


Given an image Y and a transform F, find, by gradient descent from a
random starting point, a synthetized image X such that

F(X) ' F(Y).

29
Using scattering momenta to characterize textured patches

Understanding Deep Convolutional Networks (Mallat, 2016).


Texture synthesis: Original patches.

30
Using scattering momenta to characterize textured patches

Understanding Deep Convolutional Networks (Mallat, 2016).


Texture synthesis: Synthetized from covariance momenta.

30
Using scattering momenta to characterize textured patches

Understanding Deep Convolutional Networks (Mallat, 2016).


Texture synthesis: Synthetized from scattering momenta.

30
Using scattering momenta to characterize textured patches

Understanding Deep Convolutional Networks (Mallat, 2016):


Trying to synthetize a photo using scattering momenta...

31
Pure maths can only take you so far.

31
Pure maths can only take you so far.

Thankfully, you can now go beyond


explicit formulas.

31
Problem: classification of web-like images

ImageNet: 100,000+ classes, with 1,000+ samples per class.

32
Problem: classification of web-like images

Let’s restrict ourselves to a subset of C classes.


The dataset is seen as a collection

(xi , yi ) ∈ RW×H ×[[1, C]] ' (xi , δyi ) ∈ RW×H × [0, 1]C ,

33
Problem: classification of web-like images

Let’s restrict ourselves to a subset of C classes.


The dataset is seen as a collection

(xi , yi ) ∈ RW×H ×[[1, C]] ' (xi , δyi ) ∈ RW×H × [0, 1]C ,

and we try to learn a sensible classifier

Fw : RW×H → [0, 1]C

such that for all index i,

Fw (xi ) ' δyi .

33
Convolutional Neural Networks: |Fourier|+++

Multiscale transform Ffeat : RW×H → RN combined with a classifier


Fclass : x ∈ RN → Softmax(M2 (M1 x)+ ),
with M1 an N-by-H matrix, M2 an H-by-C matrix and
 
C exp(xi )
Softmax : xi ∈ R 7→ P ∈ [0, 1]C .
j exp(x j ) i

34
Convolutional Neural Networks: |Fourier|+++

Multiscale transform Ffeat : RW×H → RN combined with a classifier


Fclass : x ∈ RN → Softmax(M2 (M1 x)+ ),
with M1 an N-by-H matrix, M2 an H-by-C matrix and
 
C exp(xi )
Softmax : xi ∈ R 7→ P ∈ [0, 1]C .
j exp(x j ) i

Speed sign detection and recognition by convolutional neural networks,


34
Peeman et al. (2011).
Convolutional Neural Networks are trainable multiscale transforms

The full transform is parameterized by:

a set of convolution filters


+ a few matrices in the classifier
= a large vector w.

35
Convolutional Neural Networks are trainable multiscale transforms

The full transform is parameterized by:

a set of convolution filters


+ a few matrices in the classifier
= a large vector w.
Let’s optimize these weights using stochastic gradient descent!

35
Convolutional Neural Networks are trainable multiscale transforms

The full transform is parameterized by:

a set of convolution filters


+ a few matrices in the classifier
= a large vector w.
Let’s optimize these weights using stochastic gradient descent!

for it in range(1,000,000):
I = random set of 10 indices
cost =
P
i∈I kFw (xi ) − yi kKL
dw = grad( cost, [w] )[0]
w.data -= .001 * dw.data

35
Convolutional Neural Networks are trainable multiscale transforms

The full transform is parameterized by:

a set of convolution filters


+ a few matrices in the classifier
= a large vector w.
Let’s optimize these weights using stochastic gradient descent!

for it in range(1,000,000):
I = random set of 10 indices
cost =
P
i∈I kFw (xi ) − yi kKL
dw = grad( cost, [w] )[0]
w.data -= .001 * dw.data

(You’d better own a good GPU!)


35
Convolutional Neural Networks : Texture + Structure

Hopeful CNN visualization, from vision03.csail.mit.edu/cnn_art/.

36
Convolutional Neural Networks : a good compromise

Wavelets ' JPEG2000 :


• Super cheap.

37
Convolutional Neural Networks : a good compromise

Wavelets ' JPEG2000 :


• Super cheap.
• Lot of structure.

37
Convolutional Neural Networks : a good compromise

Wavelets ' JPEG2000 :


• Super cheap.
• Lot of structure.
• Encode a multiscale prior on images.

37
Convolutional Neural Networks : a good compromise

Wavelets ' JPEG2000 :


• Super cheap.
• Lot of structure.
• Encode a multiscale prior on images.

37
Convolutional Neural Networks : a good compromise

Wavelets ' JPEG2000 :


• Super cheap.
• Lot of structure.
• Encode a multiscale prior on images.
By tuning its coefficients on a database of labeled images,
we get a CNN ' “JPEG 2020” that is adapted to the problem.

CNN
x µ(x) m(x) M(x)

37
Deep Art : an emblematic application (Nikulin, Novak (2016))

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

µ m M

µ m M

38
Major autodiff frameworks in python

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):


• Turns a computational graph into C++ code.

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):


• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):


• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

• TensorFlow (Google, 2015- ):

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):


• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

• TensorFlow (Google, 2015- ):


• Operations are pre-compiled.

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):


• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

• TensorFlow (Google, 2015- ):


• Operations are pre-compiled.
• Strong emphasis put on large-scale deployment.

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):


• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

• TensorFlow (Google, 2015- ):


• Operations are pre-compiled.
• Strong emphasis put on large-scale deployment.

• PyTorch (Facebook, 2016- ):

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):


• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

• TensorFlow (Google, 2015- ):


• Operations are pre-compiled.
• Strong emphasis put on large-scale deployment.

• PyTorch (Facebook, 2016- ):


• Straitghtforward numpy replacement.

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):


• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

• TensorFlow (Google, 2015- ):


• Operations are pre-compiled.
• Strong emphasis put on large-scale deployment.

• PyTorch (Facebook, 2016- ):


• Straitghtforward numpy replacement.
• Strong emphasis put on flexibility.

39
Extending PyTorch
Computing the Hamiltonian

# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)

40
Computing the Hamiltonian

# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2

40
Computing the Hamiltonian

# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel

40
Computing the Hamiltonian

# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

40
Computing the Hamiltonian

# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

# Finally, compute the Hamiltonian H(q,p): .5*<p,v>


H = .5 * torch.dot( p.view(-1), v.view(-1) )

40
Computing the Hamiltonian

# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

# Finally, compute the Hamiltonian H(q,p): .5*<p,v>


H = .5 * torch.dot( p.view(-1), v.view(-1) )

# Automatic differentiation is straitghtforward


[dq,dp] = grad( H, [q,p], 1.)

40
Computing the Hamiltonian

# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

# Finally, compute the Hamiltonian H(q,p): .5*<p,v>


H = .5 * torch.dot( p.view(-1), v.view(-1) )

# Automatic differentiation is straitghtforward


[dq,dp] = grad( H, [q,p], 1.)

RuntimeError: cuda runtime error (2) : out of memory at


/opt/conda/.../THCStorage.cu:66

40
Computing the Hamiltonian

# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

# Finally, compute the Hamiltonian H(q,p): .5*<p,v>


H = .5 * torch.dot( p.view(-1), v.view(-1) )

# Automatic differentiation is straitghtforward


[dq,dp] = grad( H, [q,p], 1.)

RuntimeError: cuda runtime error (2) : out of memory at


/opt/conda/.../THCStorage.cu:66
# Display -- see next figure.
make_dot(H, {'q':q, 'p':p, 's':s}).render(view=True)
40
q
(5000, 3)

∂Unsqueeze ∂Unsqueeze

∂Sub

∂PowConstant

qi − qj
∂Sum
(5000, 5000, 3)

∂PowConstant ∂Negate

∂Div

s2 −kqi − qj k2 p
∂Exp
(1) (5000, 5000) (5000, 3)

−kqi − qj k2 /s2
∂Addmm ∂View
(5000, 5000)

p Kq,q
∂View
(5000, 3) (5000, 5000)

∂Dot

p Kq,q p
∂MulConstant
(15000) (15000)

41
The KeOps library

# Compute the kernel convolution


v = kernelproduct(s, q, q, p, "gaussian")
# Then, compute the Hamiltonian H(q,p): .5*<p,v>
H = .5 * torch.dot( p.view(-1), v.view(-1) )

q p
(5000, 3) (5000, 3)

∂KernelProduct ∂View

p q q s
∂View
(5000, 3) (5000, 3) (5000, 3) (1)

∂Dot

p Kq,q
∂MulConstant
(15000) (15000)

42
Define custom operators

class KernelProduct(torch.autograd.Function):

43
Define custom operators

class KernelProduct(torch.autograd.Function):

@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):

43
Define custom operators

class KernelProduct(torch.autograd.Function):

@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):
# save everything to compute the gradient
ctx.save_for_backward( s, q1, q2, p )

43
Define custom operators

class KernelProduct(torch.autograd.Function):

@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):
# save everything to compute the gradient
ctx.save_for_backward( s, q1, q2, p )
# init gamma, the output of the convolution K_(q1,q2) @ p
gamma = torch.zeros(
q1.size()[0] * p.size()[1] ).type(dtype)

43
Define custom operators

class KernelProduct(torch.autograd.Function):

@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):
# save everything to compute the gradient
ctx.save_for_backward( s, q1, q2, p )
# init gamma, the output of the convolution K_(q1,q2) @ p
gamma = torch.zeros(
q1.size()[0] * p.size()[1] ).type(dtype)
# Inplace CUDA routine on the raw float arrays,
# loaded from .dll/.so files by the "pybind11" module
cudaconv.cuda_conv( q1, q2, p, gamma, s,
kernel = kernel_type)

43
Define custom operators

class KernelProduct(torch.autograd.Function):

@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):
# save everything to compute the gradient
ctx.save_for_backward( s, q1, q2, p )
# init gamma, the output of the convolution K_(q1,q2) @ p
gamma = torch.zeros(
q1.size()[0] * p.size()[1] ).type(dtype)
# Inplace CUDA routine on the raw float arrays,
# loaded from .dll/.so files by the "pybind11" module
cudaconv.cuda_conv( q1, q2, p, gamma, s,
kernel = kernel_type)
gamma = gamma.view( q1.size()[0], p.size()[1] )
return gamma

43
Define custom operators

@staticmethod
def backward(ctx, a):
(s, q1, q2, p) = ctx.saved_variables

44
Define custom operators

@staticmethod
def backward(ctx, a):
(s, q1, q2, p) = ctx.saved_variables
# In order to get second derivatives, we encapsulated the
# cudagradconv.cuda_gradconv routine in another
# torch.autograd.Function object:
kernelproductgrad_x = KernelProductGrad_x().apply

44
Define custom operators

@staticmethod
def backward(ctx, a):
(s, q1, q2, p) = ctx.saved_variables
# In order to get second derivatives, we encapsulated the
# cudagradconv.cuda_gradconv routine in another
# torch.autograd.Function object:
kernelproductgrad_x = KernelProductGrad_x().apply
# Call the CUDA routines
# ...
grad_x = kernelproductgrad_x( ... )
# ...

44
Define custom operators

@staticmethod
def backward(ctx, a):
(s, q1, q2, p) = ctx.saved_variables
# In order to get second derivatives, we encapsulated the
# cudagradconv.cuda_gradconv routine in another
# torch.autograd.Function object:
kernelproductgrad_x = KernelProductGrad_x().apply
# Call the CUDA routines
# ...
grad_x = kernelproductgrad_x( ... )
# ...
return (grad_s, grad_q1, grad_q2, grad_p, None)

=⇒ You can do it!

44
KeOps:
Online Map-Reduce Operators,
with autodiff,
without memory overflows.

45
KeOps:
Online Map-Reduce Operators,
with autodiff,
without memory overflows.

www.kernel-operations.io

=⇒ pip install pykeops ⇐=


(Thank you Benjamin!)

45
What we provide

For i = 1, . . . , N, you want to compute:


h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

46
What we provide

For i = 1, . . . , N, you want to compute:


h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...

46
What we provide

For i = 1, . . . , N, you want to compute:


h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.

46
What we provide

For i = 1, . . . , N, you want to compute:


h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …

46
What we provide

For i = 1, . . . , N, you want to compute:


h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …

46
What we provide

For i = 1, . . . , N, you want to compute:


h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …
• Vector y-variables, indexed by j : y1j , y2j , …

46
What we provide

For i = 1, . . . , N, you want to compute:


h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …
• Vector y-variables, indexed by j : y1j , y2j , …

46
What we provide

For i = 1, . . . , N, you want to compute:


h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …
• Vector y-variables, indexed by j : y1j , y2j , …
With KeOps you will get:
• Linear memory footprint.

46
What we provide

For i = 1, . . . , N, you want to compute:


h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …
• Vector y-variables, indexed by j : y1j , y2j , …
With KeOps you will get:
• Linear memory footprint.
• High order derivatives – thank you Joan!

46
What we provide

For i = 1, . . . , N, you want to compute:


h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …
• Vector y-variables, indexed by j : y1j , y2j , …
With KeOps you will get:
• Linear memory footprint.
• High order derivatives – thank you Joan!
• Support for block-sparse (=cluster-aware) reductions.
46
KeOps’ low-level interface: generic_sum

With xi , yj points in R3 and bj a 2D-signal:


M
kxi − yj k2
X  
ai = exp − · bj
σ2
j=1

47
KeOps’ low-level interface: generic_sum

With xi , yj points in R3 and bj a 2D-signal:


M
kxi − yj k2
X  
ai = exp − · bj
σ2
j=1

from pykeops.torch import generic_sum

gaussian_conv = generic_sum(
"Exp(-G*SqDist(X,Y)) * B", # Custom formula
"A = Vx(2)", # Output, 2D, indexed by i
"G = Pm(1)", # 1st arg, 1D, parameter
"X = Vx(3)", # 2nd arg, 3D, indexed by i
"Y = Vy(3)", # 3rd arg, 3D, indexed by j
"B = Vy(2)") # 4th arg, 2D, indexed by j

47
KeOps’ low-level interface: generic_sum

With xi , yj points in R3 and bj a 2D-signal:


M
kxi − yj k2
X  
ai = exp − · bj
σ2
j=1

from pykeops.torch import generic_sum

gaussian_conv = generic_sum(
"Exp(-G*SqDist(X,Y)) * B", # Custom formula
"A = Vx(2)", # Output, 2D, indexed by i
"G = Pm(1)", # 1st arg, 1D, parameter
"X = Vx(3)", # 2nd arg, 3D, indexed by i
"Y = Vy(3)", # 3rd arg, 3D, indexed by j
"B = Vy(2)") # 4th arg, 2D, indexed by j

# Simply apply your routine to CPU/GPU torch tensors!


a = gaussian_conv( 1/sigma**2, x, y, b )
47
It works!

Kernel norm + gradient with N vertices on a cheap laptop’s GPU (GTX960M)

102 PyTorch on CPU


out of mem
PyTorch on GPU
101 PyTorch+KeOps
Time (sec)

100

out of mem
10−1

10−2

10−3 2
10 103 104 105 106
Number of points N

48
It works!

Kernel norm + gradient with N vertices on a cheap laptop’s GPU (GTX960M)

102 PyTorch on CPU


out of mem
PyTorch on GPU
101 PyTorch+KeOps
Time (sec)

100

out of mem
10−1

10−2

10−3 2
10 103 104 105 106
Number of points N

+ You can go further and use multiscale, FMM-like information.


48
Recap of today’s presentation

Key points:

• Gradients are cheap.

49
Recap of today’s presentation

Key points:

• Gradients are cheap.


• PyTorch is the perfect framework for researchers as it’s both
simple and flexible.

49
Recap of today’s presentation

Key points:

• Gradients are cheap.


• PyTorch is the perfect framework for researchers as it’s both
simple and flexible.
• It generalizes regression to arbitrary models, without hassle.

49
Recap of today’s presentation

Key points:

• Gradients are cheap.


• PyTorch is the perfect framework for researchers as it’s both
simple and flexible.
• It generalizes regression to arbitrary models, without hassle.
• Multiscale image analysis has gone through a revolution over the
past six years.

What about your field?

49
Going further

pytorch.org

50
Going further

www.kernel-operations.io
51
Going further

www.math.ens.fr/~feydy/Teaching/

52
Going further

Differential geometry and stochastic dynamics with Deep Learning numerics,


Kühnel, Arnaudon, Sommer (2017)

53
Thank you for your attention.

53

You might also like