Differentiable Programming and Design Optimization
Differentiable Programming and Design Optimization
2
What is differentiable
programming?
Derivatives in machine learning
Deep learning is behind all recent advances
Computer vision Generative models Autonomous driving
Top-5 error rate for ImageNet (NVIDIA devblog) VQ-VAE (Razavi et al. 2019) TESLA Autopilot
4
Word error rates (Huang et al., 2014) Google Neural Machine Translation System (GNMT)
Derivatives in machine learning
Deep learning is behind all recent advances
= nonlinear differentiable functions (programs)
whose parameters are tuned by gradient-based optimization
5
Automatic differentiation
In practice the derivatives for gradient-based optimization come from
running differentiable code via automatic differentiation
Many names:
- Automatic differentiation
- Algorithmic differentiation automatic
- Autodiff differentiation
- Algodiff
- Autograd
- AD
Also remember:
- Backpropagation (backward propagation of errors)
- Backprop
6
Differentiable programming
Execute differentiable code via automatic differentiation
● Differentiable programming:
Writing software composed of differentiable and parameterized building
blocks that are executed via automatic differentiation and optimized in
order to perform a specified task
● A generalization of deep learning (neural networks are just a class of more
general differentiable functions)
7
How do we compute derivatives of
computer code?
Derivatives
as code
We can compute the derivatives not just of
mathematical functions, but of general-purpose
computer code (with control flow, loops, recursions,
etc.) Newton, c. 1665
Leibniz, c. 1675
Derivatives
as code
10
Manual
Find the analytical derivative using Calculus, and implement as code
13
Symbolic differentiation
Symbolic computation with Mathematica, Maple, Maxima,
and deep learning frameworks such as Theano Graph optimization
Problem: expression swell (e.g., in Theano)
14
Symbolic differentiation
Problem: only applicable to closed-form mathematical functions
but not of
16
Numerical differentiation
Finite difference approximation of ,
- Richardson extrapolation
- Differential quadrature
These increase rapidly in complexity
and never completely eliminate the error 17
Numerical differentiation
Finite difference approximation of ,
- Richardson extrapolation
- Differential quadrature
These increase rapidly in complexity
and never completely eliminate the error 18
Automatic differentiation
If we don’t need analytic derivative expressions, we can
evaluate a gradient exactly with only one forward and one reverse execution
19
Backprop or automatic
differentiation?
1960s 1970s 1980s
Griewank, 1989
Revived reverse mode 21
1960s 1970s 1980s
Griewank, 1989
Revived reverse mode 22
Automatic differentiation
Automatic differentiation
All numerical algorithms, when executed, evaluate to compositions of
a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert, 1964)
- Alternatively represented as a computational graph showing dependencies
24
Automatic differentiation
All numerical algorithms, when executed, evaluate to compositions of
a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies
25
Automatic differentiation
All numerical algorithms, when executed, evaluate to compositions of
a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies
f(a, b):
c = a * b a
d = log(c) c
return d * log d
26
Automatic differentiation
All numerical algorithms, when executed, evaluate to compositions of
a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies
primal
f(a, b):
2
c = a * b a 6
d = log(c) c 1.791
return d * log d
3
b
1.791 = f(2, 3)
27
Automatic differentiation
All numerical algorithms, when executed, evaluate to compositions of
a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies
primal
f(a, b):
2
c = a * b a 6
d = log(c) c 1.791
0.5
return d * log d
3 0.166 1
b
1.791 = f(2, 3) derivative
0.333
[0.5, 0.333] = f’(2, 3) tangent, adjoint
“gradient” 28
Automatic differentiation
All numerical algorithms, when executed, evaluate to compositions of
a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)
- Alternatively represented as a computational graph showing dependencies
primal
f(a, b):
2
c = a * b a 6
d = log(c) c 1.791
0.5
return d * log d
3 0.166 1
b
1.791 = f(2, 3) derivative
0.333
[0.5, 0.333] = f’(2, 3) tangent, adjoint
“gradient” 29
Automatic differentiation
Two main flavors
Primals
Primals
Derivatives
Derivatives
(Tangents)
(Adjoints)
Nested combinations
(higher-order derivatives, Hessian–vector products, etc.)
- Forward-on-reverse
- Reverse-on-forward
- ... 30
Tools and communities
Two communities getting to know each other
Automatic differentiation Machine learning
Methods Theory of differentiation, adjoints, Deep learning, differentiable programming,
checkpointing, source transformation probability theory, Bayesian methods
Applications Scientific computing, engineering design, Virtually all recent machine learning
computational fluid dynamics, Earth sciences, applications, pattern recognition,
computational finance representation learning
Community 1st international conference: 1991 1st autodiff workshop at NeurIPS: 2016
http://www.autodiff.org/?module=Workshops https://autodiff-workshop.github.io/
Two communities getting to know each other
Automatic differentiation Machine learning
Methods Theory of differentiation, adjoints, Deep learning, differentiable programming,
checkpointing, source transformation probability theory, Bayesian methods
Applications Scientific computing, engineering design, Virtually all recent machine learning
computational fluid dynamics, Earth sciences, applications, pattern recognition,
computational finance representation learning
Community 1st international conference: 1991 1st autodiff workshop at NeurIPS: 2016
http://www.autodiff.org/?module=Workshops https://autodiff-workshop.github.io/
Baydin, Atılım Güneş, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2018. “Automatic Differentiation in Machine 34
Learning: a Survey.” Journal of Machine Learning Research (JMLR) 18 (153): 1–43. http://jmlr.org/papers/v18/17-468.html
Automatic differentiation
It is a (small) field of its own,
with a dedicated community
http://www.autodiff.org/
Non-machine-learning applications in
industry and academia
● Computational fluid dynamics
● Atmospheric sciences
● Computational finance
● Engineering design optimization
35
Fuel ignition, supersonic flow in a rocket nozzle. Gifs: Jason Koebler, SpaceX
Tools and community
International Conferences on AD European Workshops on AD
38
Differentiable programming frameworks
39
Static graphs (define-and-run)
Prototypical examples: Theano, TensorFlow 1.0
- The user creates the graph using symbolic placeholders, using a
mini-language (domain-specific language, DSL)
- Limited (and unintuitive) control flow and expressivity
- The graph gets “compiled” to take care of expression swell, in-place ops.
Graph compilation in
Theano
40
Static graphs (define-and-run)
Prototypical examples: Theano, TensorFlow
Let’s implement
41
Static graphs (define-and-run)
Prototypical examples: Theano, TensorFlow
Let’s implement
Pure Python:
42
Static graphs (define-and-run)
Prototypical examples: Theano, TensorFlow
Let’s implement
Pure Python:
43
Dynamic graphs (define-by-run)
Prototypical examples: PyTorch
General-purpose autodiff, usually via operator overloading
- The user writes regular programs in host programming language
All language features (including control flow) are supported
- The graph is automatically constructed
44
Dynamic graphs (define-by-run)
Prototypical examples: PyTorch
Let’s implement
45
Dynamic graphs (define-by-run)
Prototypical examples: PyTorch
Let’s implement
Pure Python:
46
Dynamic graphs (define-by-run)
Prototypical examples: PyTorch
Let’s implement
Pure Python:
47
Current state of differentiable
programming
Evolution of frameworks
From: coarse-grained (module level) backprop
Towards: fine-grained, general-purpose automatic differentiation
theano
2008
torch7 torch-autograd
2011 2015
PyTorch
2016
HIPS autograd
2014
TensorFlow TensorFlow TensorFlow 2
2015 eager exec 2019
2017
JAX
2018
Design optimization
Design optimization
51
Design optimization
52
Surrogates for differentiability
53
Surrogates for differentiability
● Use the dataset to learn a differentiable approximation of the simulator
(e.g., a deep generative model)
Behl, Baydin, Gal, Torr, Vineet “AutoSimulate: (Quickly) Learning Synthetic Data Generation” ECCV 2020 55
Example: exoplanet radiative transfer
● Posterior probability distributions of exoplanet atmospheric parameters
conditioned on observed spectra, using radiative transfer simulators
● Surrogates allow up to 180x faster inference
Himes, Harrington, Cobb, Baydin, Soboczenski, O'Beirne, Zorzan, Wright, Scheffer, Domagal-Goldman, Arney. 2020, October. Accelerating Bayesian Inference via Neural Networks:
Application to Exoplanet Retrievals. In AAS/Division for Planetary Sciences Meeting Abstracts (Vol. 52, No. 6, pp. 207-07). 56
Example: local generative surrogates
● Deep generative surrogates (GAN) successively trained in local neighborhoods
● Optimize SHiP muon shield (GEANT4, FairRoot), minimize number of recorded
muons by varying magnet geometry
57
Shirobokov, Belavin, Kagan, Ustyuzhanin, Baydin “Black-Box Optimization with Local Generative Surrogates” NeurIPS 2020
Example: universal probabilistic surrogates
● Replace a (slow) universal probabilistic program with a (fast) LSTM-based
surrogate that works in the same address space
● Enables faster Bayesian inference
● Differentiable surrogate model can enable gradient-based inference engines
Munk, Scibior, Baydin, Stewart, Fernlund, Poursartip, Wood “Deep Probabilistic Surrogate Networks for Universal Simulator
58
Approximation” ProbProg 2020
Differentiability without surrogates
59
Differentiability without surrogates
Forth, Shaun A.; Evans, Trevor P. Aerofoil Optimisation via AD of a Multigrid Cell-Vertex Euler Flow Solver. 2002
Daniele Casanova, Robin S. Sharp, Mark Final, Bruce Christianson, Pat Symonds. “Application of Automatic Differentiation to Race Car Performance 61
Optimisation” in Automatic Differentiation of Algorithms: From Simulation to Optimization, Springer, 2002
End-to-end differentiable pipelines
● Complex experimental setups can be composed of a pipeline of a series of
distinct simulators (e.g., SHERPA -> GEANT)
● One might need to differentiate through the whole end-to-end pipeline, which
can be achieved by compositionality and the chain rule
Milutinovic, Baydin, Zinkov, Harvey, Song, Wood, Shen. 2017. “End-to-End Training of Differentiable Pipelines Across Machine Learning
Frameworks.” NeurIPS 2017 Autodiff Workshop
62
Differentiable programming in particle physics
● Differentiable analysis
Unify analysis pipeline by simultaneously
optimizing the free parameters of an analysis
with respect to the desired physics objective
● Differentiable simulation
Enable efficient simulation-based inference,
reducing the number of events needed by
orders of magnitude
Baydin, Cranmer, Feickert, Gray, Heinrich, Held, Melo, Neubauer, Pearkes, Simpson, Smith, Stark, Thais, Vassilev, Watts. 2020.
“Differentiable Programming in High-Energy Physics.” In Snowmass 2021 Letters of Interest (LOI), Division of Particles and Fields (DPF),
American Physical Society. https://snowmass21.org/loi.
63
Optimization of experimental design
● Design of instruments is a complex
task, involving a combination of
performance and cost
considerations
● We need the next generation of
tools to optimize modern and
future particle detectors and
experiments
● MODE (Machine-learning Optimized
Design of Experiments)
collaboration!
https://mode-collaboration.github.io/
Baydin, Cranmer, de Castro Manzano, Delaere, Derkach, Donini, Dorigo, Giammanco, Kieseler, Layer, Louppe, Ratnikov, Strong, Tosi,
Ustyuzhanin, Vischia, Yarar. 2021. “Toward Machine Learning Optimization of Experimental Design.” Nuclear Physics News 31 (1)
64
Summary
Summary
● What is differentiable programming?
○ How to compute derivatives
○ Automatic differentiation
○ Tools and communities
● Differentiable programming in practice
○ Current state of differentiable programming
● Design optimization
○ Surrogates
○ Direct differentiation
66
Thank you for listening
Questions?
Selected references
[1] T. A. Le, A. G. Baydin, and F. Wood. Inference compilation and universal probabilistic programming. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
[2] A. Munk, A. Ścibior, A. G. Baydin, A. Stewart, G. Fernlund, A. Poursartip, and F. Wood. Deep probabilistic surrogate networks for universal simulator approximation. In PROBPROG, 2020.
[3] A. G. Baydin, L. Heinrich, W. Bhimji, L. Shao, S. Naderiparizi, A. Munk, J. Liu, B. Gram-Hansen, G. Louppe, L. Meadows, P. Torr, V. Lee, Prabhat, K. Cranmer, and F. Wood. Efficient probabilistic inference in the quest for physics beyond the
standard model. In NeurIPS, 2019.
[4] A. G. Baydin, L. Shao, W. Bhimji, L. Heinrich, L. F. Meadows, J. Liu, A. Munk, S. Naderiparizi, B. Gram-Hansen, G. Louppe, M. Ma, X. Zhao, P. Torr, V. Lee, K. Cranmer, Prabhat, and F. Wood. Etalumis: Bringing probabilistic programming to scientific
simulators at scale. In SC19, 2019.
[5] K. Cranmer, J. Brehmer, and G. Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055–30062, 2020.
[6] B. Gram-Hansen, C. Schroeder, P. H. Torr, Y. W. Teh, T. Rainforth, and A. G. Baydin. Hijacking malaria simulators with probabilistic programming. In ICML workshop on AI for Social Good, 2019.
[7] B. Gram-Hansen, C. S. de Witt, R. Zinkov, S. Naderiparizi, A. Scibior, A. Munk, F. Wood, M. Ghadiri, P. Torr, Y. W. Teh, A. G. Baydin, and T. Rainforth. Efficient bayesian inference for nested simulators. In AABI, 2019.
[8] B. Poduval, A. G. Baydin, and N. Schwadron. Studying solar energetic particles and their seed population using surrogate models. In MML for Space Sciences workshop, COSPAR, 2021.
[9] G. Acciarini, F. Pinto, S. Metz, S. Boufelja, S. Kaczmarek, K. Merz, J. A. Martinez-Heras, F. Letizia, C. Bridges, and A. G. Baydin. Spacecraft collision risk assessment with probabilistic programming. In ML4PS (NeurIPS 2020), 2020.
[10] F. Pinto, G. Acciarini, S. Metz, S. Boufelja, S. Kaczmarek, K. Merz, J. A. Martinez-Heras, F. Letizia, C. Bridges, and A. G. Baydin. Towards automated satellite conjunction management with bayesian deep learning. In AI for Earth Sciences Workshop
(NeurIPS), 2020.
[11] G. Acciarini, F. Pinto, S. Metz, S. Boufelja, S. Kaczmarek, K. Merz, J. A. Martinez-Heras, F. Letizia, C. Bridges, and A. G. Baydin. Kessler: a machine learning library for space collision avoidance. In 8th European Conference on Space Debris, 2021.
[12] S. Shirobokov, V. Belavin, M. Kagan, A. Ustyuzhanin, and A. G. Baydin. Black-box optimization with local generative surrogates. In NeurIPS, 2020.
[13] H. S. Behl, A. G. Baydin, R. Gal, P. H. S. Torr, and V. Vineet. Autosimulate: (quickly) learning synthetic data generation. In 16th European Conference on Computer Vision (ECCV), 2020.
[14] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research (JMLR), 18(153):1–43, 2018.
[15] A. G. Baydin, B. A. Pearlmutter, and J. M. Siskind. DiffSharp: An AD library for .net languages. In 7th International Conference on Algorithmic Differentiation, 2016.
[16] A. G. Baydin, R. Cornish, D. M. Rubio, M. Schmidt, and F. Wood. Online learning rate adaptation with hypergradient descent. In ICLR, 2018.
[17] H. Behl, A. G. Baydin, and P. H. Torr. Alpha maml: Adaptive model-agnostic meta-learning. In AutoML (ICML), 2019.
[18] A. G. Baydin, K. Cranmer, M. Feickert, L. Gray, L. Heinrich, A. Held, A. Melo, M. Neubauer, J. Pearkes, N. Simpson, N. Smith, G. Stark, S. Thais, V. Vassilev, and G. Watts. Differentiable programming in high-energy physics. In Snowmass
2021 Letters of Interest (LOI), Division of Particles and Fields (DPF), American Physical Society, 2020.
[19] A. G. Baydin, K. Cranmer, P. de Castro Manzano, C. Delaere, D. Derkach, J. Donini, T. Dorigo, A. Giammanco, J. Kieseler, L. Layer, G. Louppe, F. Ratnikov, G. C. Strong, M. Tosi, A. Ustyuzhanin, P. Vischia, and H. Yarar. Toward machine learning
optimization of experimental design. Nuclear Physics News International (Submitted), 2020.
[20] L. F. Guedes dos Santos, S. Bose, V. Salvatelli, B. Neuberg, M. Cheung, M. Janvier, M. Jin, Y. Gal, P. Boerner, and A. G. Baydin. Multi-channel auto-calibration for the atmospheric imaging assembly using machine learning. Astronomy &
Astrophysics (in press), 2021.
[21] A. D. Cobb, M. D. Himes, F. Soboczenski, S. Zorzan, M. D. O’Beirne, A. G. Baydin, Y. Gal, S. D. Domagal-Goldman, G. N. Arney, and D. Angerhausen. An ensemble of bayesian neural networks for exoplanetary atmospheric retrieval. The
Astronomical Journal, 158(1), 2019.
[22] C. Schroeder de Witt, B. Gram-Hansen, N. Nardelli, A. Gambardella, R. Zinkov, P. Dokania, N. Siddharth, A. B. Espinosa-Gonzalez, A. Darzi, P. Torr, and A. G. Baydin. Simulation-based inference for global health decisions. In ICML Workshop
on Machine Learning for Global Health, Thirty-seventh International Conference on Machine Learning (ICML 2020), 2020. 67
Supplementary slides
Forward vs reverse
69
Derivatives in machine learning
“Backprop” and gradient descent are at the core of all recent advances
Pyro ProbTorch
(2017) (2017) PyProb (2019)
- Variational inference
- “Neural” density estimation
- Transformed distributions via bijectors
- Normalizing flows (Rezende & Mohamed, 2015)
- Masked autoregressive flows (Papamakarios et al., 2017) 70
AD is at the core of machine learning
A new mindset and workflow, enabling differentiable algorithmic elements
● Neural Turing Machine, Differentiable Neural Computer (Graves et al. 2014, 2016)
○ Can infer algorithms: copy, sort, recall
● Stack-augmented RNN (Joulin & Mikolov, 2015)
● End-to-end memory network (Sukhbaatar et al., 2015)
● Stack, queue, deque (Grefenstette et al., 2015)
● Discrete interfaces (Zaremba & Sutskever, 2015)