0% found this document useful (0 votes)

31 views228 pages

AutomaticDifferentiation AppliedMaths

Uploaded by

lelruc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views228 pages

AutomaticDifferentiation AppliedMaths

Uploaded by

lelruc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 228

Automatic differentiation

for applied mathematicians

Is PyTorch the right tool for you?

Jean Feydy
February 2018
Écoles Normales Supérieures de Paris et Paris-Saclay
Overview

What is backpropagation?
A time-efficient algorithm to compute gradients.

1
Overview

What is backpropagation?
A time-efficient algorithm to compute gradients.

How can we use it?

PyTorch provides a simple syntax, transparent CPU/GPU support.

1
Overview

What is backpropagation?
A time-efficient algorithm to compute gradients.

How can we use it?

PyTorch provides a simple syntax, transparent CPU/GPU support.

Where is it useful?
Automatic tuning of variables (optimal control)
or of your transform’s parameters (neural networks).

1
Overview

What is backpropagation?
A time-efficient algorithm to compute gradients.

How can we use it?

PyTorch provides a simple syntax, transparent CPU/GPU support.

Where is it useful?
Automatic tuning of variables (optimal control)
or of your transform’s parameters (neural networks).

Bonus: you can extend it easily.

Link with your homebrew CUDA routines!

1
How do we compute a gradient?

Let f : Rn → R be a smooth function. Then:

   
∂x1 f(x0 ) f(x0 + δt · (1, 0, . . . , 0)) − f(x0 )
∂x2 f(x0 ) f(x0 + δt · (0, 1, . . . , 0)) − f(x0 )
   
1 
∇f(x0 ) =  .  '
   .. .
 .  . δt 
 .


∂xn f(x0 ) f(x0 + δt · (0, 0, . . . , 1)) − f(x0 )

2
How do we compute a gradient?

Let f : Rn → R be a smooth function. Then:

=⇒ costs (n+1) evaluations of f, which is poor.

2
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,

3
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:

3
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:

dx1 
 

..  =

df(x).dx = ∂1 f(x) · · · ∂n f(x) · 
 .  dy
dxn

3
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:

dx1 
 

..  =

df(x).dx = ∂1 f(x) · · · ∂n f(x) · 
 .  dy
dxn

We define ∂f(x) = (df(x))?

3
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:

dx1 
 

..  =

df(x).dx = ∂1 f(x) · · · ∂n f(x) · 
 .  dy
dxn

We define ∂f(x) = (df(x))? ' (df(x))T

3
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:

dx1 
 

..  =

df(x).dx = ∂1 f(x) · · · ∂n f(x) · 
 .  dy
dxn

We define ∂f(x) = (df(x))? ' (df(x))T

i.e. ∂f(x) : dy? ∈ R 7→ dx? ∈ Rn .

3
What is a gradient?

Let f : x ∈ Rn 7→ y ∈ R be smooth,
So that df(x) : dx ∈ Rn 7→ dy ∈ R is linear:

dx1 
 

..  =

df(x).dx = ∂1 f(x) · · · ∂n f(x) · 
 .  dy
dxn

We define ∂f(x) = (df(x))? ' (df(x))T

i.e. ∂f(x) : dy? ∈ R 7→ dx? ∈ Rn .
   
∂1 f(x)
 . 
 .  · dy
∂f(x).dy? =  = dx?  so that
.  ?  
∇f(x) = ∂f(x).1
∂n f(x)
3
Autodiff is simple – no magic!

This definition lets us compose gradients:

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input
x x

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g
x x g(x)

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g h
x x g(x) h(g(x))

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g h output
x x g(x) h(g(x)) f(x)

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g h output
x x g(x) h(g(x)) f(x)

input
1 1

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g h output
x x g(x) h(g(x)) f(x)

∂h input
∂h(g(x)).1 1 1

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g h output
x x g(x) h(g(x)) f(x)

∂g ∂h input
∂g(x).( · · · ) ∂h(g(x)).1 1 1

4
Autodiff is simple – no magic!

This definition lets us compose gradients:

f = h◦g
df(x) = dh(g(x)) ◦ dg(x)
∂f(x) = (dh(g(x)) ◦ dg(x))T = ∂g(x) ◦ ∂h(g(x))
∇f(x) = ∂f(x).1 = ∂g(x).(∂h(g(x)).1)

input g h output
x x g(x) h(g(x)) f(x)

output ∂g ∂h input
∇f(x) ∂g(x).( · · · ) ∂h(g(x)).1 1 1

4
What do you need to compute a gradient?

Backpropagating through a computational graph requires:

fi : Ei−1 → Ei
(1)
x 7 → fi (x)

∂x fi : Ei−1 × Ei → Ei−1
and (2)
(x0 , a) 7→ ∂x fi (x0 ) · a

encoded as computer programs.

5
What do you need to compute a gradient?

Backpropagating through a computational graph requires:

fi : Ei−1 → Ei
(1)
x 7 → fi (x)

∂x fi : Ei−1 × Ei → Ei−1
and (2)
(x0 , a) 7→ ∂x fi (x0 ) · a

encoded as computer programs.

This is what PyTorch is all about.

5
Autodiff is easy to use

PyTorch :

• Straightforward replacement of Matlab/Numpy.

6
Autodiff is easy to use

PyTorch :

• Straightforward replacement of Matlab/Numpy.

• Operators f : x 7→ f(x) are bundled with their
adjoints ∂f : ( x, dy? ) 7→ ∂f(x).dy? = dx? .

6
Autodiff is easy to use

PyTorch :

• Straightforward replacement of Matlab/Numpy.

• Operators f : x 7→ f(x) are bundled with their
adjoints ∂f : ( x, dy? ) 7→ ∂f(x).dy? = dx? .
• Seamless GPU support.

6
Autodiff is easy to use

PyTorch :

• Straightforward replacement of Matlab/Numpy.

• Operators f : x 7→ f(x) are bundled with their
adjoints ∂f : ( x, dy? ) 7→ ∂f(x).dy? = dx? .
• Seamless GPU support.

6
Autodiff is easy to use

PyTorch :

• Straightforward replacement of Matlab/Numpy.

• Operators f : x 7→ f(x) are bundled with their
adjoints ∂f : ( x, dy? ) 7→ ∂f(x).dy? = dx? .
• Seamless GPU support.

Let’s see how it goes in practice!

6
A typical formula: the kernel square norm

D D

q1 p1
q2 p2
q3 p3
N N
.. ..
. .
.. ..
. .
qN pN

7
A typical formula: the kernel square norm

D D

q1 p1
q2 p2
q3 p3
N N
.. ..
. .
.. ..
. .
qN pN

In shape analysis, algorithms often rely on the kernel dot product:

1X
H(q, p) = exp(− σ12 kqi − qj k2 ) pi , pj 2
2 i,j

7
A typical formula: the kernel square norm

D D

q1 p1
q2 p2
q3 p3
N N
.. ..
. .
.. ..
. .
qN pN

In shape analysis, algorithms often rely on the kernel dot product:

1X
H(q, p) = exp(− σ12 kqi − qj k2 ) pi , pj 2
2 i,j
1X X 1
= pi , k(qi − qj ) pj 2
= p, Kq p 2
.
2 i j
2
7
Numpy, in practice

import numpy as np # standard library

N = 5000 ; D = 3 # cloud of 5,000 points in 3D
q = np.random.rand(N,D)
p = np.random.rand(N,D)
s = 1.

8
Numpy, in practice

import numpy as np # standard library

N = 5000 ; D = 3 # cloud of 5,000 points in 3D
q = np.random.rand(N,D)
p = np.random.rand(N,D)
s = 1.
# Re-indexing:
q_i = q[:,np.newaxis,:] # shape (N,D) -> (N,1,D)
q_j = q[np.newaxis,:,:] # shape (N,D) -> (1,N,D)

8
Numpy, in practice

import numpy as np # standard library

# Actual computations:
sqd = np.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = np.exp( - sqd / s**2 )
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

8
Numpy, in practice

import numpy as np # standard library

# Actual computations:
sqd = np.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = np.exp( - sqd / s**2 )
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

# Finally, output the kernel norm H(q,p): .5*<p,v>

H = .5 * np.inner( p.ravel(), v.ravel() )

8
Numpy, in practice

import numpy as np # standard library

# Actual computations:
sqd = np.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = np.exp( - sqd / s**2 )
v = K_qq @ p # matrix mult. (N,N)@(N,D) = (N,D)

# Finally, output the kernel norm H(q,p): .5*<p,v>

H = .5 * np.inner( p.ravel(), v.ravel() )

H = 6029309.1348486
8
Elapsed time: 3.01s
PyTorch, in practice

import torch # GPU + autodiff library

from torch.autograd import grad

# With PyTorch, using the GPU is that simple:

use_gpu = torch.cuda.is_available()
dtype = torch.cuda.FloatTensor if use_gpu \
else torch.FloatTensor

9
PyTorch, in practice

import torch # GPU + autodiff library

from torch.autograd import grad

# With PyTorch, using the GPU is that simple:

use_gpu = torch.cuda.is_available()
dtype = torch.cuda.FloatTensor if use_gpu \
else torch.FloatTensor
# Store arbitrary arrays on the CPU or GPU:
q = torch.from_numpy( q ).type(dtype)
p = torch.from_numpy( p ).type(dtype)
s = torch.Tensor( [1.] ).type(dtype)

9
PyTorch, in practice

import torch # GPU + autodiff library

from torch.autograd import grad

# With PyTorch, using the GPU is that simple:

# Tell PyTorch to track the variables "q" and "p"

q.requires_grad = True
p.requires_grad = True

9
PyTorch, in practice

# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)

10
PyTorch, in practice

# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2

10
PyTorch, in practice

# Re-indexing:
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
# Actual computations:
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / s**2 ) # Gaussian kernel

10
PyTorch, in practice

# Output the kernel norm H(q,p): .5*<p,v>

H = .5 * torch.dot( p.view(-1), v.view(-1) )

10
PyTorch, in practice

# Output the kernel norm H(q,p): .5*<p,v>

H = .5 * torch.dot( p.view(-1), v.view(-1) )

H = 6029309.0
Elapsed time: 0.31s

10
PyTorch, in practice

# Output the kernel norm H(q,p): .5*<p,v>

H = .5 * torch.dot( p.view(-1), v.view(-1) )

H = 6029309.0
Elapsed time: 0.31s
# Automatic differentiation is straightforward:
[dq,dp] = grad( H, [q,p] )

10
PyTorch, in practice

# Output the kernel norm H(q,p): .5*<p,v>

H = .5 * torch.dot( p.view(-1), v.view(-1) )

H = 6029309.0
Elapsed time: 0.31s
# Automatic differentiation is straightforward:
[dq,dp] = grad( H, [q,p] )

dq.shape = q.shape ; dp.shape = p.shape

10
Elapsed time: 0.03s
q
(5000, 3)

∂Unsqueeze ∂Unsqueeze

∂Sub

∂PowConstant

qi − qj
∂Sum
(5000, 5000, 3)

∂PowConstant ∂Negate

∂Div

s2 −kqi − qj k2 p
∂Exp
(1) (5000, 5000) (5000, 3)

−kqi − qj k2 /s2
∂Addmm ∂View
(5000, 5000)

p Kq,q
∂View
(5000, 3) (5000, 5000)

∂Dot

p Kq,q p
∂MulConstant
(15000) (15000)

11
Using PyTorch for Optimal Control
Ballistic 101

12
Ballistic 101

Take two locations in the plane R2 :

! !
0 7
x0 = and ex = .
.5 2

13
Ballistic 101

Take two locations in the plane R2 :

! !
0 7
x0 = and ex = .
.5 2

Assume that the trajectory xt follows Newton’s laws of motion:

!
0
ẍt = .
−g

13
Ballistic 101

Take two locations in the plane R2 :

! !
0 7
x0 = and ex = .
.5 2

Assume that the trajectory xt follows Newton’s laws of motion:

!
0
ẍt = .
−g

Optimal Control problem: find a momentum P ∈ R2 such that

m ẋ0 = P =⇒ x1 ' ex.

13
PyTorch allows you to work with the proper equations!

Using the position-momentum coordinates

qt = xt , pt = m v t ,

14
PyTorch allows you to work with the proper equations!

Using the position-momentum coordinates

qt = xt , pt = m v t ,

write down the expression of the mechanical energy

1 2
Emec (x, v) = mg · x[2] + 2 m kvk ,
1 2
Emec (q, p) = mg · q[2] + 2m kpk .

14
PyTorch allows you to work with the proper equations!

Using the position-momentum coordinates

qt = xt , pt = m v t ,

write down the expression of the mechanical energy

1 2
Emec (x, v) = mg · x[2] + 2 m kvk ,
1 2
Emec (q, p) = mg · q[2] + 2m kpk .

Then (Hamilton, 1833; Pontryagin, 1956):

(
q̇t = vt = + m1 pt = + ∂E∂pmec (qt , pt )
ṗt = m v̇t = (0, −mg) = − ∂E∂qmec (qt , pt )

14
Setting the parameters of our model

import torch # GPU + autodiff library

from torch import Tensor
from torch.autograd import grad

15
Setting the parameters of our model

import torch # GPU + autodiff library

from torch import Tensor
from torch.autograd import grad

# Set the parameters of our model:

g = Tensor( [ 9.81], requires_grad = True )
m = Tensor( [ 15. ], requires_grad = True )
source = Tensor( [0.,.5], requires_grad = True )
target = Tensor( [7.,2.], requires_grad = True )

15
Defining a cost to optimize

def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)

16
Defining a cost to optimize

def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)

# Initial condition:
qt = source ; pt = P

16
Defining a cost to optimize

def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)

# Initial condition:
qt = source ; pt = P

# Simple Euler scheme:

for it in range(10) :
[dq,dp] = grad(Emec(qt,pt), [qt,pt], create_graph=True)

16
Defining a cost to optimize

def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)

# Initial condition:
qt = source ; pt = P

# Simple Euler scheme:

for it in range(10) :
[dq,dp] = grad(Emec(qt,pt), [qt,pt], create_graph=True)
qt = qt + .1 * dp
pt = pt - .1 * dq

16
Defining a cost to optimize

def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)

# Initial condition:
qt = source ; pt = P

# Simple Euler scheme:

for it in range(10) :
[dq,dp] = grad(Emec(qt,pt), [qt,pt], create_graph=True)
qt = qt + .1 * dp
pt = pt - .1 * dq

# Return the squared distance to the target:

return ((qt - target)**2).sum()

16
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Solving the control problem through gradient descent

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

17
Putting randomness into our model

def cost(m, g, P) :
"Cost associated to a simple ballistic problem."
def Emec(q,p) :
"Particle of mass m in a gravitational field g."
return m*g*q[1] + (p**2).sum() / (2*m)

# Initial condition:
qt = source ; pt = P
# Simple Euler scheme:
for it in range(10) :
[dq,dp] = grad(Emec(qt,pt), [qt,pt], create_graph=True)
dq += qt[1] * 20 * torch.randn(2)
qt = qt + .1 * dp
pt = pt - .1 * dq

# Return the squared distance to the target:

return ((qt - target)**2).sum()
18
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing a noisy command

P = Tensor( [60., 30.], requires_grad = True )

lr = 5.
for it in range(100) :
[dP] = grad( cost(m,g,P), [P])
P.data -= lr * dP.data

19
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )

lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )

lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )

lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )

lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )

lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )

lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )

lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )

lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )

lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )

lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )

lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Optimizing wrt. the gravitational field

g = Tensor( [ 9.81], requires_grad = True )

lr = .1
for it in range(100) :
[dg] = grad( cost(m,g,P), [g])
g.data -= lr * dg.data

20
Recap on the basic usage of PyTorch

PyTorch is a simple replacement for numpy:

• Use torch.Tensor instead of numpy.array.

• Back-and-forth: from_numpy(...) and .numpy().
• GPU backend: .cuda() and .cpu().
• Tensor(...) = (data, graph history) container:
check éA.data and A.grad_fn.
• Compute gradients with grad(A, [...]).
• High-order derivatives are supported!

21
Recap on the basic usage of PyTorch

PyTorch is a simple replacement for numpy:

• Use torch.Tensor instead of numpy.array.

21
Recap on the basic usage of PyTorch

PyTorch is a simple replacement for numpy:

• Use torch.Tensor instead of numpy.array.

21
Recap on the basic usage of PyTorch

PyTorch is a simple replacement for numpy:

• Use torch.Tensor instead of numpy.array.

21
Recap on the basic usage of PyTorch

PyTorch is a simple replacement for numpy:

• Use torch.Tensor instead of numpy.array.

21
Recap on the basic usage of PyTorch

PyTorch is a simple replacement for numpy:

• Use torch.Tensor instead of numpy.array.

21
Convolutional “neural” networks:
optimizing a multiscale transform
The (Discrete) Fourier Transform

Given a signal f, compute the coefficients

a(ω) = heω , aiL2 ,
b where eω : x 7→ eiω·x .

22
The (Discrete) Fourier Transform

Given a signal f, compute the coefficients

a(ω) = heω , aiL2 ,
b where eω : x 7→ eiω·x .

f(x) and log(|bf(ω)|).

22
The (Discrete) Fourier Transform

This transform allows us to apply Gaussian blur, unsharp filters or

Wiener denoising.

Original image.
23
The (Discrete) Fourier Transform

This transform allows us to apply Gaussian blur, unsharp filters or

Wiener denoising.

With a Gaussian white noise.

23
The (Discrete) Fourier Transform

This transform allows us to apply Gaussian blur, unsharp filters or

Wiener denoising.

Blurred with a Gaussian filter.

23
The (Discrete) Fourier Transform

This transform allows us to apply Gaussian blur, unsharp filters or

Wiener denoising.

Denoised with a Wiener filter.

23
Going beyond linear signal processing

Super-clever algorithms...

24
Going beyond linear signal processing

Super-clever algorithms...

Do not scale well – at all.

24
Going beyond linear signal processing

Super-clever algorithms...

Do not scale well – at all.

As of 2018, we can only implement basic algorithms on clever

representations. We strive to find relevant mappings

F : a ∈ RW×H 7→ b ∈ RN .

24
Wavelet transforms: Fourier++

Compute linear features by enforcing two priors:

• Features should be localized and translation-covariant:

b i, x,y = (ϕi ? a)(x, y).

Gabor filter responses, from the scikit-learn doc. 25

Wavelet transforms: Fourier++

• Multiscale prior: features are built in cascade from finer scales,

b(i1 ,...,ik ), x,y = (ψik ? · · · ? ϕi1 ? a)(x, y),
with filters of (geometrically) increasing radii – this is
algorithmically enforced through the subsampling of feature
maps.

Understanding Deep Convolutional Networks (Mallat, 2016).

26
A real-life application: JPEG 2000

Standard format in cinemas:

• Subsample the coarse scales.
• Only store the large coefficients.

Image by Allessio Damato, from Wikipedia. 27

How do we choose the convolution filters?

Fast Wavelet Transform (Mallat, 1989): Given a lowpass and a

highpass filter of size k, compute a multiscale decomposition of a
signal of size n in O(k · n) operations.

28
How do we choose the convolution filters?

Fast Wavelet Transform (Mallat, 1989): Given a lowpass and a

highpass filter of size k, compute a multiscale decomposition of a
signal of size n in O(k · n) operations.

Daubechies filters (1992): For a given index p, there exists a pair of

filters of size 2 p generating an orthogonal transform that perfectly
factorizes locally polynomial signals of degree < p.

28
How do we choose the convolution filters?

Fast Wavelet Transform (Mallat, 1989): Given a lowpass and a

highpass filter of size k, compute a multiscale decomposition of a
signal of size n in O(k · n) operations.

Daubechies filters (1992): For a given index p, there exists a pair of

filters of size 2 p generating an orthogonal transform that perfectly
factorizes locally polynomial signals of degree < p.

-1 0 +1 2
Here is the Db2 pair: lowpass -.129 .224 .837 .483
highpass -.483 .837 -.224 -.129

28
Scattering Transform: |Fourier|++

We use a wavelet transform:

Fwav (a) : RW×H → RN×W×H

a 7→ ( ψ1 ? ϕ1 ? a ( · , · ),
ψ1 ? ϕ2 ? a ( · , · ),
··· )

29
Scattering Transform: |Fourier|++

We use a scattering transform:

Fscat (a) : RW×H → RN×W×H

+
a 7→ ( |ψ1 ? |ϕ1 ? a||( · , · ),
|ψ1 ? |ϕ2 ? a||( · , · ),
··· )

29
Scattering Transform: |Fourier|++

We use scattering momenta:

F1scat (a) : RW×H → RN+

a 7→ ( kψ1 ? |ϕ1 ? a|k1 ,
kψ1 ? |ϕ2 ? a|k1 ,
··· )

29
Scattering Transform: |Fourier|++

We use scattering momenta:

F1scat (a) : RW×H → RN+

a 7→ ( kψ1 ? |ϕ1 ? a|k1 ,
kψ1 ? |ϕ2 ? a|k1 ,
··· )

Texture synthesis: an optimal control problem.

Given an image Y and a transform F, find, by gradient descent from a
random starting point, a synthetized image X such that

F(X) ' F(Y).

29
Using scattering momenta to characterize textured patches

Understanding Deep Convolutional Networks (Mallat, 2016).

Texture synthesis: Original patches.

30
Using scattering momenta to characterize textured patches

Understanding Deep Convolutional Networks (Mallat, 2016).

Texture synthesis: Synthetized from covariance momenta.

30
Using scattering momenta to characterize textured patches

Understanding Deep Convolutional Networks (Mallat, 2016).

Texture synthesis: Synthetized from scattering momenta.

30
Using scattering momenta to characterize textured patches

Understanding Deep Convolutional Networks (Mallat, 2016):

Trying to synthetize a photo using scattering momenta...

31
Pure maths can only take you so far.

Thankfully, you can now go beyond

explicit formulas.

31
Problem: classification of web-like images

ImageNet: 100,000+ classes, with 1,000+ samples per class.

32
Problem: classification of web-like images

Let’s restrict ourselves to a subset of C classes.

The dataset is seen as a collection

(xi , yi ) ∈ RW×H ×[[1, C]] ' (xi , δyi ) ∈ RW×H × [0, 1]C ,

33
Problem: classification of web-like images

Let’s restrict ourselves to a subset of C classes.

The dataset is seen as a collection

(xi , yi ) ∈ RW×H ×[[1, C]] ' (xi , δyi ) ∈ RW×H × [0, 1]C ,

and we try to learn a sensible classifier

Fw : RW×H → [0, 1]C

such that for all index i,

Fw (xi ) ' δyi .

33
Convolutional Neural Networks: |Fourier|+++

Multiscale transform Ffeat : RW×H → RN combined with a classifier

Fclass : x ∈ RN → Softmax(M2 (M1 x)+ ),
with M1 an N-by-H matrix, M2 an H-by-C matrix and

C exp(xi )
Softmax : xi ∈ R 7→ P ∈ [0, 1]C .
j exp(x j ) i

34
Convolutional Neural Networks: |Fourier|+++

Multiscale transform Ffeat : RW×H → RN combined with a classifier

Fclass : x ∈ RN → Softmax(M2 (M1 x)+ ),
with M1 an N-by-H matrix, M2 an H-by-C matrix and

C exp(xi )
Softmax : xi ∈ R 7→ P ∈ [0, 1]C .
j exp(x j ) i

Speed sign detection and recognition by convolutional neural networks,

34
Peeman et al. (2011).
Convolutional Neural Networks are trainable multiscale transforms

The full transform is parameterized by:

a set of convolution filters

+ a few matrices in the classifier
= a large vector w.

35
Convolutional Neural Networks are trainable multiscale transforms

The full transform is parameterized by:

a set of convolution filters

+ a few matrices in the classifier
= a large vector w.
Let’s optimize these weights using stochastic gradient descent!

35
Convolutional Neural Networks are trainable multiscale transforms

The full transform is parameterized by:

a set of convolution filters

+ a few matrices in the classifier
= a large vector w.
Let’s optimize these weights using stochastic gradient descent!

for it in range(1,000,000):
I = random set of 10 indices
cost =
P
i∈I kFw (xi ) − yi kKL
dw = grad( cost, [w] )[0]
w.data -= .001 * dw.data

35
Convolutional Neural Networks are trainable multiscale transforms

The full transform is parameterized by:

a set of convolution filters

+ a few matrices in the classifier
= a large vector w.
Let’s optimize these weights using stochastic gradient descent!

for it in range(1,000,000):
I = random set of 10 indices
cost =
P
i∈I kFw (xi ) − yi kKL
dw = grad( cost, [w] )[0]
w.data -= .001 * dw.data

(You’d better own a good GPU!)

35
Convolutional Neural Networks : Texture + Structure

Hopeful CNN visualization, from vision03.csail.mit.edu/cnn_art/.

36
Convolutional Neural Networks : a good compromise

Wavelets ' JPEG2000 :

• Super cheap.

37
Convolutional Neural Networks : a good compromise

Wavelets ' JPEG2000 :

• Super cheap.
• Lot of structure.

37
Convolutional Neural Networks : a good compromise

Wavelets ' JPEG2000 :

• Super cheap.
• Lot of structure.
• Encode a multiscale prior on images.

37
Convolutional Neural Networks : a good compromise

Wavelets ' JPEG2000 :

• Super cheap.
• Lot of structure.
• Encode a multiscale prior on images.

37
Convolutional Neural Networks : a good compromise

Wavelets ' JPEG2000 :

• Super cheap.
• Lot of structure.
• Encode a multiscale prior on images.
By tuning its coefficients on a database of labeled images,
we get a CNN ' “JPEG 2020” that is adapted to the problem.

CNN
x µ(x) m(x) M(x)

37
Deep Art : an emblematic application (Nikulin, Novak (2016))

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Deep Art : an emblematic application (Nikulin, Novak (2016))

µ m M

38
Major autodiff frameworks in python

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):

• Turns a computational graph into C++ code.

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):

• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):

• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

• TensorFlow (Google, 2015- ):

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):

• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

• TensorFlow (Google, 2015- ):

• Operations are pre-compiled.

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):

• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

• TensorFlow (Google, 2015- ):

• Operations are pre-compiled.
• Strong emphasis put on large-scale deployment.

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):

• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

• TensorFlow (Google, 2015- ):

• Operations are pre-compiled.
• Strong emphasis put on large-scale deployment.

• PyTorch (Facebook, 2016- ):

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):

• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

• TensorFlow (Google, 2015- ):

• Operations are pre-compiled.
• Strong emphasis put on large-scale deployment.

• PyTorch (Facebook, 2016- ):

• Straitghtforward numpy replacement.

39
Major autodiff frameworks in python

• Theano (MILA, 2008-2017):

• Turns a computational graph into C++ code.
• Relies on g++ to provide a linked executable.

• TensorFlow (Google, 2015- ):

• Operations are pre-compiled.
• Strong emphasis put on large-scale deployment.

• PyTorch (Facebook, 2016- ):

• Straitghtforward numpy replacement.
• Strong emphasis put on flexibility.

39
Extending PyTorch
Computing the Hamiltonian

# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)

40
Computing the Hamiltonian

# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2

40
Computing the Hamiltonian

# Actual computations.
q_i = q.unsqueeze[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q.unsqueeze[None,:,:] # shape (N,D) -> (1,N,D)
sqd = torch.sum( (q_i - q_j)**2 , 2 ) # |q_i-q_j|^2
K_qq = torch.exp( - sqd / (s**2) ) # Gaussian kernel

40
Computing the Hamiltonian

# Finally, compute the Hamiltonian H(q,p): .5*<p,v>

H = .5 * torch.dot( p.view(-1), v.view(-1) )

40
Computing the Hamiltonian

# Finally, compute the Hamiltonian H(q,p): .5*<p,v>

H = .5 * torch.dot( p.view(-1), v.view(-1) )

# Automatic differentiation is straitghtforward

[dq,dp] = grad( H, [q,p], 1.)

40
Computing the Hamiltonian

# Finally, compute the Hamiltonian H(q,p): .5*<p,v>

H = .5 * torch.dot( p.view(-1), v.view(-1) )

# Automatic differentiation is straitghtforward

[dq,dp] = grad( H, [q,p], 1.)

RuntimeError: cuda runtime error (2) : out of memory at

/opt/conda/.../THCStorage.cu:66

40
Computing the Hamiltonian

# Finally, compute the Hamiltonian H(q,p): .5*<p,v>

H = .5 * torch.dot( p.view(-1), v.view(-1) )

# Automatic differentiation is straitghtforward

[dq,dp] = grad( H, [q,p], 1.)

RuntimeError: cuda runtime error (2) : out of memory at

/opt/conda/.../THCStorage.cu:66
# Display -- see next figure.
make_dot(H, {'q':q, 'p':p, 's':s}).render(view=True)
40
q
(5000, 3)

∂Unsqueeze ∂Unsqueeze

∂Sub

∂PowConstant

qi − qj
∂Sum
(5000, 5000, 3)

∂PowConstant ∂Negate

∂Div

s2 −kqi − qj k2 p
∂Exp
(1) (5000, 5000) (5000, 3)

−kqi − qj k2 /s2
∂Addmm ∂View
(5000, 5000)

p Kq,q
∂View
(5000, 3) (5000, 5000)

∂Dot

p Kq,q p
∂MulConstant
(15000) (15000)

41
The KeOps library

# Compute the kernel convolution

v = kernelproduct(s, q, q, p, "gaussian")
# Then, compute the Hamiltonian H(q,p): .5*<p,v>
H = .5 * torch.dot( p.view(-1), v.view(-1) )

q p
(5000, 3) (5000, 3)

∂KernelProduct ∂View

p q q s
∂View
(5000, 3) (5000, 3) (5000, 3) (1)

∂Dot

p Kq,q
∂MulConstant
(15000) (15000)

42
Define custom operators

class KernelProduct(torch.autograd.Function):

43
Define custom operators

class KernelProduct(torch.autograd.Function):

@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):

43
Define custom operators

class KernelProduct(torch.autograd.Function):

@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):
# save everything to compute the gradient
ctx.save_for_backward( s, q1, q2, p )

43
Define custom operators

class KernelProduct(torch.autograd.Function):

43
Define custom operators

class KernelProduct(torch.autograd.Function):

@staticmethod
def forward(ctx, s, q1, q2, p, kernel_type):
# save everything to compute the gradient
ctx.save_for_backward( s, q1, q2, p )
# init gamma, the output of the convolution K_(q1,q2) @ p
gamma = torch.zeros(
q1.size()[0] * p.size()[1] ).type(dtype)
# Inplace CUDA routine on the raw float arrays,
# loaded from .dll/.so files by the "pybind11" module
cudaconv.cuda_conv( q1, q2, p, gamma, s,
kernel = kernel_type)

43
Define custom operators

class KernelProduct(torch.autograd.Function):

43
Define custom operators

@staticmethod
def backward(ctx, a):
(s, q1, q2, p) = ctx.saved_variables

44
Define custom operators

@staticmethod
def backward(ctx, a):
(s, q1, q2, p) = ctx.saved_variables
# In order to get second derivatives, we encapsulated the
# cudagradconv.cuda_gradconv routine in another
# torch.autograd.Function object:
kernelproductgrad_x = KernelProductGrad_x().apply
# Call the CUDA routines
# ...
grad_x = kernelproductgrad_x( ... )
# ...

44
Define custom operators

=⇒ You can do it!

44
KeOps:
Online Map-Reduce Operators,
with autodiff,
without memory overflows.

45
KeOps:
Online Map-Reduce Operators,
with autodiff,
without memory overflows.

www.kernel-operations.io

=⇒ pip install pykeops ⇐=

(Thank you Benjamin!)

45
What we provide

For i = 1, . . . , N, you want to compute:

h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

46
What we provide

For i = 1, . . . , N, you want to compute:

h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...

46
What we provide

For i = 1, . . . , N, you want to compute:

h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.

46
What we provide

For i = 1, . . . , N, you want to compute:

h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …

46
What we provide

For i = 1, . . . , N, you want to compute:

h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …

46
What we provide

For i = 1, . . . , N, you want to compute:

h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

46
What we provide

For i = 1, . . . , N, you want to compute:

h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

46
What we provide

For i = 1, . . . , N, you want to compute:

h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

46
What we provide

For i = 1, . . . , N, you want to compute:

h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

with :
• “Reduction” : Sum, Max, ArgMin, LogSumExp, ...
• A vector-valued formula: F.
• Vector parameters: p1 , p2 , …
• Vector x-variables, indexed by i : x1i , x2i , …
• Vector y-variables, indexed by j : y1j , y2j , …
With KeOps you will get:
• Linear memory footprint.
• High order derivatives – thank you Joan!

46
What we provide

For i = 1, . . . , N, you want to compute:

h i
ai = Reductionj=1,...,M F ( p1 , p2 , . . . , x1i , x2i , . . . , y1j , y2j , . . .) ,

With xi , yj points in R3 and bj a 2D-signal:

M
kxi − yj k2
X
ai = exp − · bj
σ2
j=1

47
KeOps’ low-level interface: generic_sum

With xi , yj points in R3 and bj a 2D-signal:

M
kxi − yj k2
X
ai = exp − · bj
σ2
j=1

from pykeops.torch import generic_sum

gaussian_conv = generic_sum(
"Exp(-G*SqDist(X,Y)) * B", # Custom formula
"A = Vx(2)", # Output, 2D, indexed by i
"G = Pm(1)", # 1st arg, 1D, parameter
"X = Vx(3)", # 2nd arg, 3D, indexed by i
"Y = Vy(3)", # 3rd arg, 3D, indexed by j
"B = Vy(2)") # 4th arg, 2D, indexed by j

47
KeOps’ low-level interface: generic_sum

With xi , yj points in R3 and bj a 2D-signal:

M
kxi − yj k2
X
ai = exp − · bj
σ2
j=1

from pykeops.torch import generic_sum

# Simply apply your routine to CPU/GPU torch tensors!

a = gaussian_conv( 1/sigma**2, x, y, b )
47
It works!

Kernel norm + gradient with N vertices on a cheap laptop’s GPU (GTX960M)

102 PyTorch on CPU

out of mem
PyTorch on GPU
101 PyTorch+KeOps
Time (sec)

100

out of mem
10−1

10−2

10−3 2
10 103 104 105 106
Number of points N

48
It works!

Kernel norm + gradient with N vertices on a cheap laptop’s GPU (GTX960M)

102 PyTorch on CPU

out of mem
PyTorch on GPU
101 PyTorch+KeOps
Time (sec)

100

out of mem
10−1

10−2

10−3 2
10 103 104 105 106
Number of points N

+ You can go further and use multiscale, FMM-like information.

48
Recap of today’s presentation

Key points:

• Gradients are cheap.

49
Recap of today’s presentation

Key points:

• Gradients are cheap.

• PyTorch is the perfect framework for researchers as it’s both
simple and flexible.

49
Recap of today’s presentation

Key points:

• Gradients are cheap.

• PyTorch is the perfect framework for researchers as it’s both
simple and flexible.
• It generalizes regression to arbitrary models, without hassle.

49
Recap of today’s presentation

Key points:

• Gradients are cheap.

• PyTorch is the perfect framework for researchers as it’s both
simple and flexible.
• It generalizes regression to arbitrary models, without hassle.
• Multiscale image analysis has gone through a revolution over the
past six years.

What about your field?

49
Going further

pytorch.org

50
Going further

www.kernel-operations.io
51
Going further

www.math.ens.fr/~feydy/Teaching/

52
Going further

Differential geometry and stochastic dynamics with Deep Learning numerics,

Kühnel, Arnaudon, Sommer (2017)

53
Thank you for your attention.

Mark Newman - Computational Physics-CreateSpace Independent Publishing Platform (2012) - Compressed PDF
100% (6)
Mark Newman - Computational Physics-CreateSpace Independent Publishing Platform (2012) - Compressed PDF
559 pages
(Ebook PDF) Applied Numerical Methods With MATLAB For Engineers and Scientists 4th Edition Download
100% (1)
(Ebook PDF) Applied Numerical Methods With MATLAB For Engineers and Scientists 4th Edition Download
50 pages
Resa
No ratings yet
Resa
168 pages
Numerical Methods in Economics PDF
100% (3)
Numerical Methods in Economics PDF
349 pages
Lesson Plan - Metal Work
50% (2)
Lesson Plan - Metal Work
6 pages
Sms Essay 2
No ratings yet
Sms Essay 2
6 pages
Mit18 S096iap23 Lec06
No ratings yet
Mit18 S096iap23 Lec06
9 pages
Mit18 S096iap23 Lec4
No ratings yet
Mit18 S096iap23 Lec4
14 pages
Gradient Vectors Computation
No ratings yet
Gradient Vectors Computation
4 pages
Backpropagation: Loading Data
No ratings yet
Backpropagation: Loading Data
12 pages
04 Numerical
No ratings yet
04 Numerical
46 pages
Learning 3
No ratings yet
Learning 3
98 pages
Deeplearning2017 Johnson Automatic Differentiation 01
No ratings yet
Deeplearning2017 Johnson Automatic Differentiation 01
142 pages
Autodiff
No ratings yet
Autodiff
12 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
MAT321 Lecture Notes Boumal 2019
No ratings yet
MAT321 Lecture Notes Boumal 2019
203 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
Physics Informed Neural Networks For Numerical Analysis
No ratings yet
Physics Informed Neural Networks For Numerical Analysis
16 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Coursenotes
No ratings yet
Coursenotes
157 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
NMFSC
100% (2)
NMFSC
716 pages
Matrix Calculus Short
No ratings yet
Matrix Calculus Short
5 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
MATLAB Tutorial
No ratings yet
MATLAB Tutorial
182 pages
MATLAB Tutorial
No ratings yet
MATLAB Tutorial
182 pages
NMFSC
No ratings yet
NMFSC
734 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
972 pages
Lecture12 Diff
No ratings yet
Lecture12 Diff
31 pages
Machine Learning and Pattern Recognition Week 8 - Backprop
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Backprop
8 pages
Tut 01
No ratings yet
Tut 01
39 pages
Advanced Statistical Computing PDF
No ratings yet
Advanced Statistical Computing PDF
329 pages
Stat Computing
No ratings yet
Stat Computing
329 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Lec 105
No ratings yet
Lec 105
19 pages
Statcomp Notes
No ratings yet
Statcomp Notes
56 pages
Ma50174 Advanced Numerical Methods - Part 1: I.G. Graham (Heavily Based On Original Notes by C.J.Budd)
No ratings yet
Ma50174 Advanced Numerical Methods - Part 1: I.G. Graham (Heavily Based On Original Notes by C.J.Budd)
53 pages
Maths For Intelligent Systems
No ratings yet
Maths For Intelligent Systems
76 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Optimization Lecture 1
No ratings yet
Optimization Lecture 1
11 pages
Topic 1
No ratings yet
Topic 1
6 pages
d2l en
No ratings yet
d2l en
981 pages
d2l en PDF
No ratings yet
d2l en PDF
995 pages
Front Matter
No ratings yet
Front Matter
10 pages
Lecture Maths
No ratings yet
Lecture Maths
104 pages
Morgan & Claypool - Introduction To Deep Learning For Engineers Using Python and Google Clod Platform - 2020
No ratings yet
Morgan & Claypool - Introduction To Deep Learning For Engineers Using Python and Google Clod Platform - 2020
111 pages
Numerical Methods Kirkegaard
No ratings yet
Numerical Methods Kirkegaard
122 pages
Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
NM Script
No ratings yet
NM Script
181 pages
Math 3na3 Textbook
No ratings yet
Math 3na3 Textbook
586 pages
Deep Learning
100% (3)
Deep Learning
661 pages
(00000) - 2018-Weinan - (CommMathStat) - The Deep Ritz Method A Deep Learning-Based
No ratings yet
(00000) - 2018-Weinan - (CommMathStat) - The Deep Ritz Method A Deep Learning-Based
12 pages
d2l en PDF
No ratings yet
d2l en PDF
651 pages
d2l en PDF
No ratings yet
d2l en PDF
996 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
997 pages
MA261
No ratings yet
MA261
75 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Get General Organic and Biochemistry 4th Edition Katherine Denniston Free All Chapters
100% (7)
Get General Organic and Biochemistry 4th Edition Katherine Denniston Free All Chapters
82 pages
A.Datum Case Study
No ratings yet
A.Datum Case Study
23 pages
Introduction To Modern Industrial Engineering
100% (2)
Introduction To Modern Industrial Engineering
221 pages
Read Across America Day - by Slidesgo
No ratings yet
Read Across America Day - by Slidesgo
56 pages
Chemical Engineering in Practice Second Edition - Sampler
100% (1)
Chemical Engineering in Practice Second Edition - Sampler
99 pages
2nde Unit 6 Speaking
No ratings yet
2nde Unit 6 Speaking
3 pages
Warmups Linear Functions 8 TH Grade Math Common Core Standards
No ratings yet
Warmups Linear Functions 8 TH Grade Math Common Core Standards
61 pages
Topics in Finite and Discrete Mathematics - Sheldon M. Ross
100% (1)
Topics in Finite and Discrete Mathematics - Sheldon M. Ross
279 pages
Teen Smart Prep 2 2020
No ratings yet
Teen Smart Prep 2 2020
151 pages
Feldman-Mahalanobis Model
No ratings yet
Feldman-Mahalanobis Model
3 pages
Think Pair Share Food Safety 2
No ratings yet
Think Pair Share Food Safety 2
4 pages
Table of Contents (The Summary) : Intro
No ratings yet
Table of Contents (The Summary) : Intro
14 pages
Lesson 4 (Computer Maintenance)
No ratings yet
Lesson 4 (Computer Maintenance)
4 pages
Autopage C3-RS665 PDF
No ratings yet
Autopage C3-RS665 PDF
34 pages
File Page No 1663658874765
No ratings yet
File Page No 1663658874765
10 pages
Gitaarmap Deel B PDF
100% (1)
Gitaarmap Deel B PDF
150 pages
Instant Download Understanding Race and Crime 1st Edition Colin Webster PDF All Chapter
100% (3)
Instant Download Understanding Race and Crime 1st Edition Colin Webster PDF All Chapter
84 pages
Motion in 1dimension DPP 05 of Lec 06 Prayas JEE 2.0 2025666a9a5e40a9650018c46d03
No ratings yet
Motion in 1dimension DPP 05 of Lec 06 Prayas JEE 2.0 2025666a9a5e40a9650018c46d03
5 pages
Acebrofilina+budesonida
No ratings yet
Acebrofilina+budesonida
3 pages
Statement Up0510110008421
No ratings yet
Statement Up0510110008421
3 pages
FBS Midterm
No ratings yet
FBS Midterm
2 pages
Experimental Investigation of Circular Concrete Filled Steel Tube Geometry On Seismic Performance
No ratings yet
Experimental Investigation of Circular Concrete Filled Steel Tube Geometry On Seismic Performance
54 pages
Bio Paper 5 PDF
No ratings yet
Bio Paper 5 PDF
8 pages
Resumen Productos Datalogic SENSORES
No ratings yet
Resumen Productos Datalogic SENSORES
219 pages
User Manual 3948368
No ratings yet
User Manual 3948368
4 pages
2
No ratings yet
2
29 pages
M P5 Rev1 Sem2 2024 2025
No ratings yet
M P5 Rev1 Sem2 2024 2025
6 pages
Material Test Report: Cse. Chiang Sung Enterprise Co., LTD
No ratings yet
Material Test Report: Cse. Chiang Sung Enterprise Co., LTD
3 pages