0% found this document useful (0 votes)
4 views32 pages

Machine Learning and Computational Mathematics: in Memory of Professor Feng Kang (1920-1993)

This document discusses the intersection of machine learning and computational mathematics, highlighting how neural network-based algorithms can address high-dimensional problems in scientific computing, such as protein folding and molecular dynamics. It emphasizes the potential of machine learning to overcome the 'curse of dimensionality' in various computational challenges and explores the mathematical theories behind these advancements. The article aims to integrate machine learning with computational mathematics to enhance understanding and improve algorithmic performance.

Uploaded by

naveedladdu201
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views32 pages

Machine Learning and Computational Mathematics: in Memory of Professor Feng Kang (1920-1993)

This document discusses the intersection of machine learning and computational mathematics, highlighting how neural network-based algorithms can address high-dimensional problems in scientific computing, such as protein folding and molecular dynamics. It emphasizes the potential of machine learning to overcome the 'curse of dimensionality' in various computational challenges and explores the mathematical theories behind these advancements. The article aims to integrate machine learning with computational mathematics to enhance understanding and improve algorithmic performance.

Uploaded by

naveedladdu201
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Machine Learning and Computational Mathematics

i
Weinan E
Princeton University and Beijing Institute of Big Data Research

In memory of Professor Feng Kang (1920-1993)

Contents
arXiv:2009.14596v1 [math.NA] 23 Sep 2020

1 Introduction 2

2 Machine learning-based algorithms for problems in computational sci-


ence 3
2.1 Nonlinear multi-grid method and protein folding . . . . . . . . . . . . . . . 4
2.2 Molecular dynamics with ab initio accuracy . . . . . . . . . . . . . . . . . 6

3 Machine learning-based algorithms for high dimensional problems in


scientific computing 9
3.1 Stochastic control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Nonlinear parabolic PDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Moment closure for kinetic equations modeling gas dynamics . . . . . . . . 13

4 Mathematical theory of machine learning 15


4.1 An introduction to neural network-based supervised learning . . . . . . . . 15
4.2 Approximation theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Estimation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 A priori estimates for regularized models . . . . . . . . . . . . . . . . . . . 21

5 Machine learning from a continuous viewpoint 22


5.1 Representation of functions . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 The stochastic optimization problem . . . . . . . . . . . . . . . . . . . . . 23
5.3 Optimization: Gradient flows . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.4 Discretizing the gradient flows . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5 The optimal control problem for flow-based representation . . . . . . . . . 25

6 Concluding remarks 28

Abstract
Neural network-based machine learning is capable of approximating functions in
very high dimension with unprecedented efficiency and accuracy. This has opened up
many exciting new possibilities, not just in traditional areas of artificial intelligence,
but also in scientific computing and computational science. At the same time,
machine learning has also acquired the reputation of being a set of “black box”
i
[email protected]

1
type of tricks, without fundamental principles. This has been a real obstacle for
making further progress in machine learning.
In this article, we try to address the following two very important questions:
(1) How machine learning has already impacted and will further impact compu-
tational mathematics, scientific computing and computational science? (2) How
computational mathematics, particularly numerical analysis, can impact machine
learning? We describe some of the most important progress that has been made on
these issues. Our hope is to put things into a perspective that will help to integrate
machine learning with computational mathematics.

1 Introduction
Neural network-based machine learning (ML) has shown very impressive success on a
variety of tasks in traditional artificial intelligence. This includes classifying images, gen-
erating new images such as (fake) human faces and playing sophisticated games such as
Go. A common feature of all these tasks is that they involve objects in very high dimen-
sion. Indeed when formulated in mathematical terms, the image classification problem is
a problem of approximating a high dimensional function, defined on the space of images,
to the discrete set of values corresponding to the category of each image. The dimen-
sionality of the input space is typically 3 times the number of pixels in the image, where
3 is the dimensionality of the color space. The image generation problem is a problem
of generating samples from an unknown high dimensional distribution, given a set of
samples from that distribution. The Go game problem is about solving a Bellman-like
equation in dynamic programming, since the optimal strategy satisfies such an equation.
For sophisticated games such as Go, this Bellman-like equation is formulated on a huge
space.
All these are made possible by the ability to accurately approximate high dimensional
functions, using modern machine learning techniques. This opens up new possibilities for
attacking problems that suffer from the “curse of dimensionality” (CoD): As dimension-
ality grows, computational cost grows exponentially fast. This CoD problem has been an
essential obstacle for the scientific community for a very long time.
Take, for example, the problem of solving partial differential equations (PDEs) nu-
merically. With traditional numerical methods such as finite difference, finite element
and spectral methods, we can now routinely solve PDEs in three spatial dimensions plus
the temporal dimension. Most of the PDEs currently studied in computational mathe-
matics belong to this category. Well known examples include the Poisson equation, the
Maxwell equation, the Euler equation, the Navier-Stokes equations, and the PDEs for
linear elasticity. Sparse grids can increase our ability to handling PDEs to, say 8 to 10
dimensions. This allows us to try solving problems such as the Boltzmann equation for
simple molecules. But we are totally lost when faced with PDEs, say in 100 dimension.
This makes it essentially impossible to solve Fokker-Planck or Boltzmann equations for
complex molecules, many-body Schrödinger, or the Hamilton-Jacobi-Bellman equations
for realistic control problems.
This is exactly where machine learning can help. Indeed, starting with the work in
[20, 10, 21], machine learning-based numerical algorithm for solving high dimensional

2
PDEs and control problems has been one of the most exciting new developments in re-
cent years in scientific computing, and this has opened up a host of new possibilities for
computational mathematics. We refer to [17] for a review of this exciting development.
Solving PDEs is just the tip of the iceberg. There are many other problems for which
CoD is the main obstacle, including:

• classical many-body problem, e.g. protein folding

• turbulence. Even though turbulence can be modeled by the three dimensional


Navier-Stokes equation, it has so many active degrees of freedom that an effective
model for turbulence should involve many variables.

• solid mechanics. In solid mechanics, we do not even have the analog of the Navier-
Stokes equation. Why is this the case? Well, the real reason is that the behavior of
solids is essentially a multi-scale problem that involves scales from atomistic all the
way to macroscopic.

• multi-scale modeling. In fact most multi-scale problems for which there is no sepa-
ration of scales belong to this category. An immediate example is the dynamics of
polymer fluids or polymer melts.

Can machine learning help for these problems? More generally, can we extend the
success of machine learning beyond traditional AI? We will try to convince the reader
that this is indeed the case for many problems.
Besides being extremely powerful, neural network-based machine learning has also got
the reputation of being a set of tricks instead of a set of systematic scientific principles.
Its performance depends sensitively on the value of the hyper-parameters, such as the
network widths and depths, the initialization, the learning rates, etc. Indeed just a few
years ago, parameter tuning was considered to be very much of an art. Even now, This
is still the case for some tasks. Therefore a natural question is: Can we understand these
subtleties and propose better machine learning models whose performance is more robust?
In this article, we review what has been learned on these two issues. We discuss the
impact that machine learning has already made or will make on computational mathemat-
ics, and how the ideas from computational mathematics, particularly numerical analysis,
can be used to help understanding and better formulating machine learning models. On
the former, we will mainly discuss the new problems that can now be addressed using
ML-based algorithms. Even though machine learning also suggests new ways to solve
some traditional problems in computational mathematics, we will not say much on this
front.

2 Machine learning-based algorithms for problems in


computational science
In this and the next section, we will discuss how neural network models can be used
to develop new algorithms. For readers who are not familiar with neural networks, just

3
think of them as being some replacement of polynomials. We will discuss neural networks
afterwards.

2.1 Nonlinear multi-grid method and protein folding


In traditional multi-grid method [5], say for solving the linear systems of equation that
arise from some finite difference or finite element discretization of a linear elliptic equation,
our objective is to minimize a quadratic function like
1
Ih (uh ) = uTh Lh uh − fhT uh
2
Here h is the grid size of the discretization. The basic idea of the multi-grid method is to
iterate between solving this problem and a reduced problem on a coarser grid with grid
size H. In order to do this, we need the following

• a projection operator: P : uh → uH , that maps functions defined on the fine grid


to functions defined on the coarse grid.

• the effective operator at scale H: LH = P T Lh P . This defines the objective function


on the coarse grid:
1
IH (uH ) = uTH LH uH − fHT uH
2
• a prolongation operator Q : uH → uh , that maps functions defined on the coarse
grid to functions defined on the fine grid. Usually one can take Q to be P T .

The key idea here is coarse graining, and iterating between the fine scale and the
coarse-grained problem. The main components in coarse graining is a set of coarse-
grained variables and the effective coarse-grained problem. Formulated this way, these
are obviously general ideas that can be relevant for a wide variety of problems. In practice,
however, the difficulty lies in how to obtain the effective coarse-grained problem, a step
that is trivial for linear problems, and this is where machine learning can help.
We are going to use the protein folding problem as an example to illustrate the general
idea for nonlinear problems.
Let {xj } be the positions of the atoms in a protein and the surrounding solvent, and
U = U ({xj }) be the potential energy of the combined protein-solvent system. The poten-
tial energy consists of the energies due to chemical bonding, Van der Waals interaction,
electro-static interaction, etc. The protein folding problem is to find the “ground state”
of the energy U :
“Minimize” U.
Here we have added quotation marks since really what we want to do is to sample the
distribution
1
ρβ = e−βU , β = (kB T )−1
Z
To define the coarse-grained problem, we assume that we are given a set of collective
variables: s = (s1 , · · · , sn ), sj = sj (x1 , · · · , xN ), (n < N ). One possibility is to use the

4
dihedral angles as the coarse-grained variables. In principle, one may also use machine
learning methods to learn the “best” set of coarse-grained variables but this direction will
not be pursued here.
Having defined the coarse-grained variables, the effective coarse-grained problem is
simply the free energy associated with this set of coarse-grained variables:
Z
1 1
A(s) = − ln p(s), pβ (s) = e−βU (x) δ(s(x) − s) dx,
β Z
Unlike the case for linear problems, for which the effective coarse-grained model is readily
available, in the current situation, we have to find the function A first.
The idea is to approximate A by neural networks. The issue here is how to obtain the
training data.
Contrary to most standard machine learning problems where the training data is
collected beforehand, in applications to computational science and scientific computing,
the training data is collected “on-the-fly” as learning proceeds. This is referred to as
the “concurrent learning” protocol [11]. In this regard, the standard machine learning
problems for which the training data is collected beforehand are examples of “sequential
learning”. The key issue for concurrent learning is an efficient algorithm for generating
the data in the best way. The training dataset should on one hand be representative
enough and on the other hand be as small as possible.
A general procedure for generating such datasets is suggested in [11]. It is called
the EELT (exploration- examination-labeling-training) algorithm and it consists of the
following steps:
1 −βA(s)
• exploration: exploring the s space. This can be done by sampling Z
e with
the current approximation of A.

• examination: for each state explored, decide whether that state should be labeled.
One way to do this is to use an a posteriori error estimator. One possible such a
posteriori error estimator is the variance of the predictions of an ensemble of machine
learning models, see [41].

• labeling: compute the mean force (say using restrained molecular dynamics)

F (s) = −∇s A(s).

from which the free energy A can be computed using standard thermodynamic
integration.

• training: train the appropriate neural network model. To come up with a good
neural network model, one has to take into account the symmetries in the prob-
lem. For example, if we coarse grain a full atom representation of a collection of
water molecules by eliminating the positions of the hydrogen atoms, then the free
energy function for the resulting system should have permutation symmetry and
this should be taken into account when designing the neural network model (see the
next subsection).

5
Figure 1: The folded and extended states of Trp-cage, reproduced with permission from
[37]
.

This can also be viewed as a nonlinear multi-grid algorithm in the sense that it iterates
between sampling pβ on the space of the coarse-grained variables and the (constrained)
Gibbs distribution ρβ for the full atom description.
This is a general procedure that should work for a large class of nonlinear “multi-grid”
problems.
Shown in Figure 1 is the extended and folded structure of Trp-cage. This is a small
protein with 20 amino acids. We have chosen the 38 dihedral angles as the collective
variables. The full result is presented in [37].

2.2 Molecular dynamics with ab initio accuracy


Molecular dynamics is a way of studying the behavior of molecular and material systems
by tracking the trajectories of all the nuclei in the system. The dynamics of the nuclei
is assumed to obey Newton’s law, with some potential energy function (typically called
potential energy surface or PES) V that models the effective interaction between the
nuclei:
d2 x i
mi 2 = −∇xi V, V = V (x1 , x2 , ..., xi , ..., xN ),
dt
How can we get the function V ? Traditionally, there have been two rather different
approaches. The first is to compute the inter-atomic forces (−∇V ) on the fly using
quantum mechanics models, the most popular one being the density functional theory
(DFT). This is known as the Car-Parrinello molecular dynamics or ab initio molecular
dynamics [6, 8]. This approach is quite accurate but is also very expensive, limiting the
size of the system that one can handle to about 1000 atoms, even with high performance
supercomputers. The other approach is to come up with empirical potentials. Basically
one guesses a functional form of V with a small set of fitting parameters which are then
determined by a small set of data. This approach is very efficient but unreliable. This
dilemma between accuracy and efficiency has been an essential road block for molecular
dynamics for a long time.
With machine learning, the new paradigm is to use DFT to generate the data, and
then use machine learning to generate an approximation to V . This approach has the

6
Figure 2: The results of EELT algorithm: Number of configurations explored vs. the
number of data points labeled. Only a very small percentage of the configurations are
labeled. Reproduced with permission from Linfeng Zhang. See also [40].

potential to produce an approximation to V that is as accurate as the DFT model and


as efficient as the empirical potentials.
To achieve this goal, we have to address two issues. The first is the generation of data.
The second is coming up with the appropriate neural network model. These two issues
are the common features for all the problems that we discuss here.
The issue of adaptive data generation is very much the same as before. The EELT
procedure can still be used. The details of how each step is implemented is a little different.
We refer to [40] for details.
Figure 2 shows the effect of using the EELT algorithm. As one can see, a very small
percentage of the configurations explored are actually labeled. For the Al-Mg example,
only ∼0.005% configurations explored by are selected for labeling.
For the second issue, the design of appropriate neural networks, the most important
considerations are:
1. Extensiveness, the neural network should be extensive in the sense that if we want
to extend the system, we just have to extend the neural network accordingly. One
way of achieving this is suggested by Behler and Parrinello [4].

2. Preserving the symmetry. Besides the usual translational and rotational symmetry,
one also has the permutational symmetry: If we relabel a system of copper atoms,
its potential energy should not change. It makes a big difference in terms of the
accuracy of the neural network model whether one takes these symmetries into
account (see [23] and Figure 3).
One very nice and general way of addressing the symmetry problem is to design the
neural network model as the composition of two networks: An embedding network fol-
lowed by a fitting network. The task for the embedding network is to represent enough
symmetry-preserving functions to be fed into the fitting network [39].
With these issues properly addressed, one can come up with very satisfactory neural
network-based representation of V (see Figure 4). This representation is named Deep
Potential [23, 39] and the Deep Potential-based molecular dynamics is named DeePMD
[38]. As has been demonstrated recently in [30, 25], DeePMD, combined with state of the

7
Figure 3: The effect of symmetry preservation on testing accuracy. Shown in red are the
results of poor man’s way of imposing symmetry (see main text for explanation). One
can see that testing accuracy is drastically improved. Reproduced with permission from
Linfeng Zhang.

Figure 4: The test accuracy of the Deep Potential for a wide variety of systems. Repro-
duced with permission from Linfeng Zhang. See also [39].

8
art high performance supercomputers, can help to increase the size of the system that
one can model with ab initio accuracy by 5 orders of magnitude.

3 Machine learning-based algorithms for high dimen-


sional problems in scientific computing
3.1 Stochastic control
The first application of machine learning for solving high dimensional problems in scientific
computing was presented in [20]. Stochastic control was chosen as the first example due
to its close analogy with machine learning. Consider the discrete stochastic dynamical
system:
st+1 = st + bt (st , at ) + ξt+1 . (1)
Here st and at are respectively the state and control at time t, ξt is the noise at time t.
Our objective is to solve:
T −1
X
min E{ξt } ct (st , at ) + cT (sT ) (2)
−1
{at }T
t=0 t=0

within the set of feedback controls:

at = At (st ). (3)

We approximate the functions At by neural network models:

At (s) ≈ Ãt (s|θt ), t = 0, · · · , T − 1 (4)

The optimization problem (2) then becomes:


T −1
X
min E{ξt } ct (st , Ãt (st |θt )) + cT (sT )}. (5)
−1
{θt }T
t=0 t=0

Unlike the situation in standard supervised learning, here we have T set of neural
networks to be trained simultaneously. The network architecture is shown in Figure 5

9
Figure 5: Network architecture for solving stochastic control in discrete time. The whole
network has (N + 1)T layers in total that involve free parameters to be optimized simul-
taneously. Each column (except ξt ) corresponds to a sub-network at t. Reproduced with
permission from Jiequn Han. See also [20].

Compared with the standard setting for machine learning, one can see a clear analogy
in which (1) plays the role for the residual networks and the noise {ξt } plays the role
of data. Indeed, stochastic gradient descent (SGD) can be readily used to solve the
optimization problem (5).
An example of the application of this algorithm is shown in Figure 6 for the problem of
energy storage with multiple devices. Here n is the number of devices. For more details,
we refer to [20].

1.1
reward relative to the case n=50

1.0

0.9

0.8
number of devices
0.7 n=30
n=40
n=50
0.60 10000 20000 30000 40000 50000
iteration

Figure 6: Relative reward for the energy storage problem. The space of control func-
tion is Rn+2 → R3n for n = 30, 40, 50, with multiple equality and inequality constrains.
Reproduced with permission from Jiequn Han. See also [20].

10
3.2 Nonlinear parabolic PDEs
Consider parabolic PDEs of the form:
∂u 1 T
+ σσ : ∇2x u + µ · ∇u + f σ T ∇u = 0,

u(T, x) = g(x)
∂t 2
We study a terminal-value problem instead of initial-value problem since one of the main
applications we have in mind is in finance. To develop machine learning-based algorithms,
we would like to first reformulate this as a stochastic optimization problem. This can be
done using backward stochastic differential equations (BSDE) [33].

inf E|g(XT ) − YT |2 , (6)


Y0 ,{Zt }0≤t≤T
Z t Z t
s.t. Xt = ξ + µ(s, Xs ) ds + Σ(s, Xs ) dWs , (7)
0 0
Z t Z t
Yt = Y0 − h(s, Xs , Ys , Zs ) ds + (Zs )T dWs . (8)
0 0

It can be shown that the unique minimizer of this problem is the solution to the PDE
with:
Yt = u(t, Xt ) and Zt = σ T (t, Xt ) ∇u(t, Xt ). (9)
With this formulation, one can develop a machine learning-based algorithm along the
following lines, adopting the ideas for the stochastic control problems discussed earlier
[10, 21]:

• After time discretization, approximate the unknown functions

X0 7→ u(0, X0 ) and Xtj 7→ σ T (tj , Xtj ) ∇u(tj , Xtj )

by feedforward neural networks ψ and φ.

• Using the BSDE, one constructs an approximation û that takes the paths {Xtn }0≤n≤N
and {Wtn }0≤n≤N as the input data and gives the final output, denoted by û({Xtn }0≤n≤N , {Wtn }0≤n≤N
as an approximation to u(tN , XtN ).

• The error in the matching between û and the given terminal condition defines the
expected loss function
h  2i
l(θ) = E g(XtN ) − û {Xtn }0≤n≤N , {Wtn }0≤n≤N .

This algorithm is called the Deep BSDE method.


As applications, let us first study a stochastic control problem, but we now solve this
problem using the Hamilton-Jacobi-Bellman (HJB) equation. Consider the well-known
LQG (linear quadratic Gaussian) problem at dimension d = 100:
√ √
dXt = 2 λ mt dt + 2 dWt , (10)

11
RT 
with the cost functional: J({mt }0≤t≤T ) = E 0 kmt k22 dt + g(XT ) . The corresponding
HJB equation is given by
∂u
+ ∆u − λk∇uk22 = 0 (11)
∂t
Using the Hopf-Cole transform, we can express the solution in the form:

 h i
1 
u(t, x) = − ln E exp − λg(x + 2WT −t ) . (12)
λ

This can be used to calibrate the accuracy of the Deep BSDE method.
4.7
Deep BSDE Solver
4.6 Monte Carlo
4.5

u(0,0,...,0)
4.4
4.3
4.2
4.1
4.0
0 10 20 30 40 50
lambda

Figure 7: Left: Relative error of the deep BSDE method for u(t=0, x=(0, . . . , 0)) when λ = 1, which
achieves 0.17% in a runtime of 330 seconds. Right: Optimal cost u(t=0, x=(0, . . . , 0)) against different
λ. Reproduced with permission from Jiequn Han. See also [21].

As a second example, we study the Black-Scholes equation with default risk:


∂u
+ ∆u − (1 − δ) Q(u(t, x)) u(t, x) − R u(t, x) = 0
∂t
where Q is some nonlinear function. This form of modeling the default risk was proposed
and/or used in the literature for low dimensional situation (d = 5, see [21] for references).
The Deep BSDE method was used for this problem with d = 100 [21].

12
Figure 8: The solution of the Black-Scholes equation with default risk at d = 100. The
Deep BSDE method achieves a relative error of size 0.46% in a runtime of 617 seconds.
Reproduced with permission from Jiequn Han. See also [21].

The Deep BSDE method has been applied to pricing basket options, interest rate-
dependent options, Libor market model, Bermudan Swaption, barrier option (see [17] for
references).

3.3 Moment closure for kinetic equations modeling gas dynam-


ics
The dynamics of gas can be modeled very accurately by the well-known Boltzmann Equa-
tion:
1
∂t f + v · ∇x f = Q(f ), v ∈ R3 , x ∈ Ω ⊂ R3 , (13)
ε
where f is the phase space density function, ε is the Knudsen number:
mean free path
ε= ,
macroscopic length
Q is the collision operator that models the collision process between gas particles. When
ε  1, this can be approximated by the Euler equation:
∂t U + ∇x · F (U ) = 0, (14)
where Z Z
T 1
U = (ρ, ρu, E) , ρ= f dv, u= f v dv.
ρ
and
F (U ) = (ρu, ρu ⊗ u + pI, (E + p)u)T
Euler’s equation can be obtained by projecting the Boltzmann equation on to the low
order moments involved, and making use of the ansatz that the distribution function f is
close to be local Maxwellian.

13
1.1 Some History and Background 1 CONTINUUM MODELS

transition
Euler Eqn NSF Eqn regime kinetic regime free flight

½
½

½
10¡2 10¡1 1.0 10.0
Kn
! equilibrium ! ! non-equilibrium !

FigureFigure
1: Overview of the
9: The different range
regimes ofdynamics.
of gas KnudsenReproduced
number with
andpermission
various model regimes.
from Jiequn
Han. See also [22].

What happens when ε is not small? In this case, a natural idea is to seek some
the moment systems oflead
generalization to equation
Euler’s stable hyperbolic equations.
using more moments. This However,
program wasin practical
initiated by explicit
Harold Grad who constructed the well-known 13-moment system using the moments of
systems hyperbolicity is given only in a finite range due to linearization. In Junk (1998)
{1, v, (v − u) ⊗ (v − u), |v − u|2 (v − u)}. This line of work has encountered several diffi-
and Junkculties.
(2002)First,
it is shown
there that the
is no guarantee thatfully nonlinear
the equations maximum-entropy
obtained approach has
are well-posed. Secondly
sever drawbacks and singularities.
there is always Furthermore,
the “closure problem”: the hyperbolicity
When projecting leads to
the Boltzmann equation on discontinuous
a set
sub-shock solutions in the shock profile. A variant of the moment method has been
of moments, there are always terms which involve moments outside the set of moments
considered. In order to obtain a closed system, one needs to model these terms in some
proposed byway.Eu For(1980)
Euler’s and is used,
equation, this ise.g.,
done in Myong
using (2001).
the local Recently,
Maxwellian a maximum-entropy
approximation. This
10-moment systemwhen
is accurate has εbeen used
is small, butby Suzuki
is no longer and vanε Leer
so when (2005).
is not small. It is highly unclear
Both fundamental approaches
what should be used of kinetic theory, Chapman-Enskog and Grad, exhibit
as the replacement.
In [22], Han et al developed a machine learning-based moment method. The over-
severe disadvantages.
all objective is Higher order
to construct an Chapman-Enskog expansions
uniformly accurate (generalized) are unstable
moment model. The and Grad’s
method introduces
methodology subshocks
consists of two and show slow convergence. It seems to be desirable to
steps:
combine both 1: methods
Learn a setinoforder
optimaltogeneralized
remedy their momentsdisadvantages. Such an hybrid
through an auto-encoder. Here by approach
optimality we mean that the set of generalized moments retains a maximum amount
have already been discussed by Grad in a side note in Grad (1958). He derives a variant
of information about the original distribution function and can be used to recover the
of the regularized
distribution13-moment
function withequations
a minimum loss given below, but
of accuracy. This surprisingly he neither
can be done as follows: Find gives any
details noranisencoder
he using or ainvestigating
Ψ and decoder Φ that the equations.
recovers Infthe
the original fromlast
U, Wfifty years the paper Grad
(1958) was cited as standard source Z for introduction into kinetic theory, but, apparently,
W = Ψ(f ) = wf dv, Φ(U , W )(v) = h(v; U , W ).
this specific idea got entirely lost and seems not to be known in present day literature.
The Chapman-Enskog expansion is based on equilibrium and the corrections describe
Minimizew,h E kf − Φ(Ψ(f ))k2 .
the non-equilibrium perturbation. A hybrid f ∼D
version which uses a non-equilibrium as basis
is discussed in Karlin et al. (1998). They deduced variables
U and W form the set of generalized hydrodynamic linearizedthatequations with
we will use to modelunspecified
the gas flow.
coefficients. 2: Starting
Learn thefromfluxesBurnett
and source equations
terms in theJinPDEandfor Slemrod (2001)
the projected PDE. derived an extended
The effective
system of PDE
evolution
for U and equations
W can bewhichobtainedresembles
by formallythe regularized
projecting 13-moment
the Boltzmann equationsystem.
on It is
solved numerically in Jin et al.
this set of (generalized) (2002).
moments. ThisHowever,
gives us a the
set oftensorial
PDEs of thestructure
form: of their relations
is not in accordance with Boltzmann’s (
∂t U + ∇x · Fequation.
(U , W ; ε) = Starting
0, from higher moment systems
Müller et al. (2003) discussed∂t W a parabolization
+ ∇x · G(U , W ; ε)which
= R(U ,includes
W ; ε). higher order(15) expressions
into hyperbolic equations. R
−1
R R
The regularized 13-moment-equations presented below were rigorously
where F (U , W ; ε) = vU f dv, G(U , W ; ε) = vW f dv, R(U , W ; ε) = ε W Q(fderived
)dv. from
Our task now is to learn F , G, R from the original kinetic equation.
Boltzmann’s equation in Struchtrup and Torrilhon (2003). The key ingredient is a Chapman-
Enskog expansion around a pseudo-equilibrium 14 which is given by the constitutive relations
of Grad for stress tensor and heat flux. The final system consists of evolution equations
for the fluid fields: density, velocity, temperature, stress tensor and heat flux. The closure
Again the important issues are (1) get an optimal dataset, and (2) enforce the physical
constraints. Here two notable physical constraints are (1) conservation laws and (2)
symmetries. Conservation laws are automatically respected in this approach. Regarding
symmetries, besides the usual static symmetry, there is now a new dynamic symmetry:
the Galilean invariance. These issues are all discussed in [22]. We also refer to [22] for
numerical results for the models obtained this way.

4 Mathematical theory of machine learning


While neural network-based machine learning has demonstrated a wide range of very
impressive successes, it has also acquired a reputation of being a “black magic” rather than
a solid scientific technique. This is due to the fact that (1) we lack a basic understanding
of the fundamental reasons behind its success; (2) the performance of these models and
algorithms is quite sensitive to the choice of the hyper-parameters such as the architecture
of the network and the learning rate; and (3) some techniques, such as batch normalization,
does appear to be a black magic.
To change this situation, we need to (1) improve our understanding of the reasons
behind the success and the fragility of neural network-based models and algorithms and
(2) find ways to formulate more robust models and design more robust algorithms. In this
section we address the first issue. The next section will be devoted to the second issue.
Here is a list of the most basic questions that we need to address:

• Why does it work in such high dimension?

• Why simple gradient descent works for training neural network models?

• Is over-parametrization good or bad?

• Why does neural network modeling require such extensive parameter tuning?

At this point, we do not yet have clear answers to all these questions. But some coher-
ent picture is starting to emerge. We will focus on the problem of supervised learning,
namely approximating a target function using a finite dataset. For simplicity, we will
limit ourselves to the case when the physical domain of interest is X = [0, 1]d .

4.1 An introduction to neural network-based supervised learn-


ing
The basic problem in supervised learning is as follows: Given a natural number n ∈ N
and a sequence {(xj , yj ) = (xj , f ∗ (xj )), j ∈ {1, 2, . . . , n}}, of pairs of input-output data,
we want to recover the target function f ∗ as accurately as possible. We will assume that
the input data {xj , j ∈ {1, 2, . . . , n}}, is sampled from the probability distribution µ on
Rd .
Step 1. Choose a hypothesis space. This is a set of trial functions Hm where
m ∈ N is the dimensionality of Hm . One might choose piecewise polynomials or wavelets.

15
In modern machine learning the most popular choice is neural network functions. Two-
layer neural network functions (one input layer, one output layer which usually does not
count, and one hidden layer) take the form
m
1 X
fm (x, θ) = aj σ(hwj , xi) (16)
m j=1

where σ : R → R is a fixed scalar nonlinear function and where θ = {(aj , wj )j∈{1,2,...,m} } are
the parameters to be optimized (or trained). A popular example for the nonlinear function
σ : R → R is the ReLU (rectified linear unit) activation function: σ(z) = max{z, 0}, for all
z ∈ R. We will restrict our attention to this activation function. Roughly speaking, deep
neural network (DNN) functions are obtained if one composes two-layer neural network
functions several times. One important class of DNN models are residual neural networks
(ResNet). They closely resemble discretized ordinary differential equations and take the
form
M
X
zl+1 = zl + aj,l σ(hzl , wj,l i), z0 = V x̃, fL (x, θ) = hα, zL i (17)
j=1

for l ∈ {0, 1, . . . , L−1} where L, M ∈ N. Here the parameters are θ = (α, V , (aj,l )j,l , (wj,l )j,l ).
ResNets are the model of choice for truly deep neural network models.
Step 2. Choose a loss function. The primary consideration for the choice of the
loss function is to fit the data. Therefore the most obvious choice is the L2 loss:
n n
1X 1X
R̂n (f ) = |f (xj ) − yj |2 = |f (xj ) − f ∗ (xj )|2 . (18)
n j=1 n j=1

This is also called the “empirical risk”. Sometimes we also add regularization terms.
Step 3. Choose an optimization algorithm. The most popular optimization
algorithms in machine learning are different versions of the gradient descent (GD) algo-
rithm, or its stochastic analog, the stochastic gradient descent (SGD) algorithm. Assume
that the objective function we aim to minimize is of the form
 
F (θ) = Eξ∼ν l(θ, ξ) . (19)

The simplest form of SGD iteration takes the form

θk+1 = θk − η∇l(θk , ξk ), (20)

for k ∈ N0 where {ξk , k ∈ N0 = {0, 1, 2, . . . }} is a sequence of i.i.d. random variables


sampled from the distribution ν and η is the learning rate which might also change during
the course of the iteration. In contrast, GD takes the form
 
θk+1 = θk − η∇Eξ∼ν l(θk , ξ) . (21)

Obviously this form of SGD can be adapted to loss functions of the form (18) which can
be regarded as an expectation with ν being the empirical distribution on the training
dataset. This DNN-SGD paradigm is the heart of modern machine learning.

16
4.2 Approximation theory
The simplest way of approximating functions is to use polynomials. For polynomial
approximation, there are two kinds of theorems. The first is the Weierstrass’ Theorem
which asserts that continuous functions can be uniformly approximated by polynomials
on compact domains. The second is Taylor’s Theorem which tells us that the rate of
convergence depends on the smoothness of the target function.
Using the terminology in neural network theory, Weierstrass’ Theorem is the “Univer-
sal Approximation Theorem” (UAT). It is a useful fact. But Taylor’s Theorem is more
useful since it tells us something about the rate of convergence. The form of Taylor’s The-
orem used in approximation theory are the direct and inverse approximation theorems
which assert that a given function can approximated by polynomials with a particular
rate if and only if certain norms of that function is finite. This particular norm, which
measures the regularity of the function, is the key quantity that characterizes this approx-
imation scheme. For piecewise polynomials, these norms are some Besov space norms [36].
For L2 , a typical result looks like follows:
inf kf − fm kL2 (X) ≤ C0 hα kf kH α (X) (22)
f ∈Hm

Here H α stands for the α-th order Sobolev norm [36], m is the number of degrees of
freedom. On a regular grid, the grid size is given by
h ∼ m−1/d (23)
An important thing to notice is that the convergence rate in (22) suffers from CoD: If
we want to reduce the error by a factor of , we need to increase m by a factor m ∼ −d if
α = 1. For d = 100 which is not very high dimension by the standards of machine learning,
this means that we have to increase m by a factor of −100 . This is why polynomials and
piecewise polynomials are not useful in high dimensions.
Another way to appreciate this is as follows. The number of monomials of degree p in
d
dimension d is Cp+d . This grows very fast for large values of d and p.
What should we expect in high dimension? One example that we can learn from is
Monte Carlo methods for integration. Consider the problem of approximating
I(g) = Ex∼µ g(x)
using
1 X
Im (g) = g(xj )
m j
where {xj , j ∈ [m]} is a set of i.i.d samples of the probability distribution µ. A direct
computation gives
var(g)
E(I(g) − Im (g))2 = , var(g) = Ex∼µ g 2 (x) − (Ex∼µ g(x))2
m
This exact relation tells us two things. (1) The convergence rate of Monte Carlo integra-
tion is independent of dimension. (2) The error constant is given by the variance of the
integrand. Therefore to reduce error, one has to do variance reduction.

17
Had we used grid-based quadrature rules, the accuracy would have also suffered from
CoD.
It is possible to improve the Monte Carlo rate by more sophisticated ways of choosing
{xj , j ∈ [m]}, say using number-theoretic-based quadrature rules. But these typically
give rise to an O(1/d) improvement for the convergence rate and it diminishes as d → ∞.
Based on these considerations, we aim to find function approximations that satisfy:
kf ∗ k2∗
inf R(f ) = inf kf − f ∗ k2L2 (dµ) .
f ∈Hm f ∈Hm m
The natural questions are then:
• How can we achieve this? That is, what kind of hypothesis space should we choose?
• What should be the “norm” k · k∗ (associated with the choice of Hm )? Here we put
norm in quotation marks since it does not have to be a real norm. All we need is
that it controls the approximation error.
Regarding the first question, let us look an illustrative example.
Consider the following Fourier representation of the function f and its approximation
fm (say FFT-based):
Z
1 X
f (x) = a(ω)ei(ω,x) dω, fm (x) = a(ωj )ei(ωj ,x)
Rd m j

Here {ωj } is a fixed grid, e.g. uniform grid. For this approximation, we have
kf − fm kL2 (X) ≤ C0 m−α/d kf kH α (X)
which suffers from CoD.
Now consider the alternative representation
Z
f (x) = a(ω)ei(ω,x) π(dω) = Eω∼π a(ω)ei(ω,x) (24)
Rd

where π is a probability distribution. Now to approximate


Pm f , it isi(ωnatural to use Monte
1 j ,x)
Carlo. Let {ωj } be an i.i.d. sample of π, fm (x) = m j=1 a(ωj )e , then we have
var(f )
E|f (x) − fm (x)|2 =
m
This approximation does not suffer from CoD. Notice that fm (x) = m1 m T
P
j=1 aj σ(ωj x) is
nothing but a two-layer neural network with activation function σ(z) = eiz (here aj =
a(ωj )).
We believe that this simple argument is really at the heart of why neural network
models do so well in high dimension.
Now let us turn to a concrete example of the kind of approximation theory for neural
network models. We will consider two-layer neural networks.
1 X
Hm = {fm (x) = aj σ(wjT x)}, θ = {(aj , wj ), j ∈ [m]}
m j

18
Consider function f : X = [0, 1]d 7→ R of the following form
Z
f (x) = aσ(wT x)ρ(da, dw) = E(a,w)∼ρ [aσ(wT x)], x∈X

where Ω = R1 × Rd+1 , ρ is a probability distribution on Ω. Define:


1/2
kf kB = inf Eρ [a2 kwk21 ]
ρ∈Pf

where Pf := {ρ : f (x) = Eρ [aσ(wT x)]}. This is called the Barron norm [2, 12, 13] . The
space
B = {f ∈ C 0 : kf kB < ∞}
is called the Barron space [2, 12, 13] (see also [3, 26, 14]).
In analogy with classical approximation theory, we can also prove some direct and
inverse approximation theorem [13].
Theorem 1 (Direct Approximation Theorem). If kf kB < ∞, then for any integer m > 0,
there exists a two-layer neural network function fm such that
kf kB
kf − fm kL2 (X) . √
m
Theorem 2 (Inverse Approximation Theorem). Let
m m
def1 X 1 X
NC = { ak σ(wkT x) : |ak |2 kwk k21 ≤ C 2 , m ∈ N+ }.
m k=1 m k=1

Let f ∗ be a continuous function. Assume there exists a constant C and a sequence of


functions fm ∈ NC such that
fm (x) → f ∗ (x)
for all x ∈ X. Then there exists a probability distribution ρ∗ on Ω, such that
Z
f (x) = aσ(wT x)ρ∗ (da, dw),

for all x ∈ X and kf ∗ kB ≤ C.

4.3 Estimation error


Another issue we have to worry about is the performance of the machine learning model
outside the training dataset. This issue also shows up in classical approximation theory.
Illustrated in Figure 10 is the classical Runge phenomenon for polynomial interpolation
on a uniform grid. One can see that away from the grid points, the error of the interpolant
can be very large. This is a situation that we would like to avoid.
What we do in practice is to minimize the training error:
1X
R̂n (θ) = (f (xj , θ) − f ∗ (xj ))2
n j

19
Figure 10: The Runge phenomenon: f ∗ (x) = 1
1+25x2
. Reproduced with permission from
Chao Ma.

but we are interested in the testing error, which is a sampled version of the population
risk:
R(θ) = Ex∼µ (f (x, θ) − f ∗ (x))2
The question is how we can control the difference between these two errors.
One way of doing this is to use the notion of Rademacher complexity. The important
fact for us here is that the Rademacher complexity controls the difference between train-
ing and testing errors (also called the “generalization gap”). Indeed, let H be a set of
functions, and S = (x1 , x2 , ..., xn ) be a dataset. Then, up to logarithmic terms, we have
n
1X
sup Ex [h(x)] − h(xi ) ∼ RadS (H)
h∈H n i=1

where the Rademacher complexity of H with respect to S is defined as


" n
#
1 X
RadS (H) = Eξ sup ξi h(xi ) , (25)
n h∈H i=1

where {ξi }ni=1 are i.i.d. random variables taking values ±1 with equal probability.
The question then becomes to bound the Rademacher complexity of a hypothesis
space. For the Barron space, we have [2]:

20
Theorem 3. Let FQ = {f ∈ B, kf kB ≤ Q}. Then we have
r
2 ln(2d)
RadS (FQ ) ≤ 2Q
n
where n = |S|, the size of the dataset S.

4.4 A priori estimates for regularized models


Consider the regularized model
r
log(2d)
Ln (θ) = R̂n (θ) + λ kθkP , θ̂n = argmin Ln (θ) (26)
n
where the path norm is defined by:
m
!1/2
1 X
kθkP = |ak |2 kwk k21
m k=1
The following result was proved in [12]:
Theorem 4. Assume f ∗ : X 7→ [0, 1] ∈ B. There exist constants C0 , such that for any
δ > 0, if λ ≥ C0 , then with probability at least 1 − δ over the choice of the training dataset,
we have r r
kf ∗ k2B ∗ log(2d) log(1/δ) + log(n)
R(θ̂n ) . + λkf kB + .
m n n
Similar approximation theory and a priori error estimates have been proved for other
machine learning models. Here is a brief summary of these results.
• Random feature model: The corresponding function space is the reproducing kernel
Hilbert space (RKHS).
• Residual networks (ResNets): The corresponding function space is the so-called
flow-induced space introduced in [13].
• Multi-layer neural networks: A candidate for the corresponding function space is
the multi-layer space introduced in [15].
What is really important is the “norms” that control the approximation error and the
generalization gap. These quantities are defined for functions in the corresponding spaces.
After the approximation theorems and Rademacher complexity estimates are in place, one
can readily prove a theorem of the following type for regularized models: Up to logarithmic
terms, the minimizers of the regularized models satisfy:
Γ(f ∗ ) γ(f ∗ )
R(fˆ) . + √
m n
where m is the number of free parameters, n is the size of the training dataset. Note that
for the multilayer spaces, the results proved in [15] are not as sharp.
We only discussed the analysis of the hypothesis space. There are many more other
questions. We refer to [19] for more discussion on the current understanding of neural
network-based machine learning.

21
5 Machine learning from a continuous viewpoint
Now we turn to alternative formulations of machine learning. Motivated by the situation
for PDEs, we would like to first formulate machine learning in a continuous setting and
then discretize to get concrete models and algorithms. The key here is that continuous
problems that we come up with should be nice mathematical problems. For PDEs, this is
accomplished by requiring them to be “well-posed”. For problems in calculus of variations,
we require the problem to be “convex” in some sense and lower semi-continuous. The point
of these requirements is to make sure that the problem has a unique solution. Intuitively,
for machine learning problems, being “nice” means that the variational problem should
have a simple landscape. How to formulate this precisely is an important research problem
for the future.
As was pointed out in [16], the key ingredients for the continuous formulation are as
follows:
• representation of functions (as expectations)

• formulating the variational problem (as expectations)

• optimization, e.g. gradient flows

5.1 Representation of functions


Two kinds of representations are considered in [16]: Integral transform-based representa-
tion and flow-based representation. The simplest integral-transform based representation
is a generalization of (24):
Z
f (x; θ) = a(w)σ(wT x)π(dw)
Rd
=Ew∼π a(w)σ(wT x)
=E(a,w)∼ρ aσ(wT x)
=Eu∼ρ φ(x, u)

Here θ denotes the parameters in the model: θ can be a(·) or the prob distributions π or
ρ.
This representation corresponds to two-layer neural networks. A generalization to
multi-layer neural networks is presented in [15].
Next we turn to flow-based representation:
dz
=Ew∼πτ a(w, τ )σ(wT z) (27)

=E(a,w)∼ρτ aσ(wT z) (28)
=Eu∼ρτ φ(z, u), z(0, x) = x (29)

f (x, θ) = 1T z(1, x)
In this representation, the parameter θ can be either {aτ (·)} or {πτ } or {ρτ }

22
5.2 The stochastic optimization problem
Stochastic optimization problems are of the type:

min Ew∼ν g(θ, w)


θ

These kinds of problems can readily be approached by stochastic algorithms, which is a key
component in machine learning. For example, instead of the gradient descent algorithm:
n
X
θk+1 = θk − η∇θ Ew∼ν g(θ, w) = θk − η∇θ g(θ, wj )
j=1

one can use the stochastic gradient descent:

θk+1 = θk − η∇θ g(θ, wk )

where {wk } is a set of random variables sampled from ν.


The following are some examples of the stochastic optimization problems that arise in
modern machine learning:

• Supervised learning: In this case, the minimization of the population risk becomes

R(f ) = Ex∼µ (f (x) − f ∗ (x))2

• Eigenvalue problems for quantum many-body Hamiltonian:

(φ, Hφ) φ(x)Hφ(x) 1


I(φ) = = Ex∼µφ , µφ (dx) = |φ(x)|2 dx
(φ, φ) φ(x)2 Z

Here H is the Hamiltonian of the quantum system.

• Stochastic control problems:


T −1
T −1
X
L({at }t=0 )=E ct (st , at )) + cT (sT )
t=0

Substituting the representations discussed earlier to these expressions for the stochastic
optimization problems, we obtain the final variational problem that we need to solve.
One can either discretize these variational problems directly and then solve the dis-
cretized problem using some optimization algorithms, or one can write down continuous
forms of some optimization algorithms, typically gradient flow dynamics, and then dis-
cretize these continuous flows. We are going to discuss the second approach.

23
5.3 Optimization: Gradient flows
To write continuous form of the gradient flows, we draw some inspiration from statistical
physics. Take the supervised learning as an example. We regard the population risk as
being the “free energy”, and following Halperin and Hohenberg [24], we divide the pa-
rameters into two kinds, conserved and non-conserved. For example, a is a non-conserved
parameter and π is conserved since its total integral has to be 1.
For non-conserved parameter, as was suggested in [24], one can use the “model A”
dynamics:
∂a δR
=−
∂t δa
2
which is simply the usual L gradient flow.
For conserved parameters such as π, one should use the “model B” dynamics which
works as follows: First define the “chemical potential”
δR
V = .
δπ
From the chemical potential, one obtains the velocity field v and the current J:

J = πv, v = −∇V

The continuity equation then gives us the gradient flow dynamics:


∂π
+ ∇ · J = 0.
∂t
This is also the gradient flow under the Wasserstein metric.

5.4 Discretizing the gradient flows


To obtain practical models, one needs to discretize these continuous problems. The first
step is to replace the population risk by the empirical risk using the training data.
The more non-trivial issue is how to discretize the gradient flows in the parameter
space. The parameter space has the following characteristics: (1) It has a simple geometry
– unlike the real space which may have a complicated geometry. (2) It is also usually high
dimensional. For these reasons, the most natural numerical methods for the discretization
in the parameter space is the particle method which is the dynamic version of Monte Carlo.
Smoothed particle method might be helpful to improve the performance. In relatively low
dimensions, one might also consider the spectral method, particularly some sparse version
of the spectral method, due to the relative simple geometry of the parameter space.
Take for example the discretization of the conservative flow for the integral transform-
based representation. With the representation: f (x; θ) = E(a,w)∼ρ aσ(wT x), the gradient
flow equation becomes:
δR
∂t ρ = ∇(ρ∇V ), V = (30)
δρ

24
Test errors Test errors 0.6
2.6 2.6 1.2
1.8
2.4 2.4 2.4

log10(n)
log10(n) 3.0
2.2 2.2
3.6
2.0 2.0 4.2
4.8
1.8 1.8 5.4
6.0
2.0 2.5 3.0 3.5 4.0 4.5 2.0 2.5 3.0 3.5 4.0 4.5
log10(m) log10(m)

Figure 11: (Left) continuous viewpoint; (Right) conventional NN models. Target func-
tion is a single neuron. Reproduced with permission from Lei Wu.

The particle method discretization is based on:


1 X 1 X
ρ(a, w, t) ∼ δ(aj (t),wj (t)) = δuj (t)
m j m j

where uj (t) = (aj (t), w(t)). One can show that in this case, (30) reduces to
duj
= −∇uj I(u1 , · · · , um )
dt
where
1 X
I(u1 , · · · , um ) = R(fm ), uj = (aj , wj ), fm (x) = aj σ(wjT x)
m j

This is exactly gradient descent for “scaled” two-layer neural networks. In this case,
the continuous formulation also coincides with the “mean-field” formulation for two-layer
neural networks [7, 32, 34, 35].
The scaling factor 1/m in front of the “scaled” two-layer neural networks is actually
quite important and makes a big difference for the test performance of the network. Shown
in Figure 11 is the heat map of the test error for two-layer neural network models with
and without this scaling factor. The target function is the simple single neuron function:
f ∗ (x) = σ(x1 ). The important observation is that in the absence of this scaling factor, the
test error shows a “phase transition” between a phase where the neural network model
performs like an associated random feature model and another phase where it shows much
better performance than the random feature model. Such a phase transition is one of the
reasons that choosing the right set of hyper-parameters, here the network width m, is
so important. However, if one uses the scaled form, this phase transition phenomenon is
avoided and the performance is more robust [31].

5.5 The optimal control problem for flow-based representation


The flow-based representation naturally leads to a control problem. This viewpoint has
been used explicitly or implicitly in machine learning for quite some time (see for example

25
[27]). The back propagation algorithm, for example, is an example of control-theory based
algorithm. Another more recent example is the development of maximum principle-based
algorithm, first introduced in [28]. Despite these successes, we feel that there is still a lot
of room for using the control theory viewpoint to develop new algorithms.
We consider the flow-based representation in a slightly more general form
dz
= Eu∼ρτ φ(z, u), z(0, x) = x

where z is the state, ρτ is the control at time τ . Our objective is to minimize R over {ρτ }
Z

R({ρτ }) = Ex∼µ (f (x) − f (x)) = 2
(f (x) − f ∗ (x))2 dµ
Rd
where as before
f (x) = 1T z(1, x) (31)
One most important result for this control problem is Pontryagin’s maximum principle
(PMP). To state this result, let us define the Hamiltonian H : Rd × Rd × P2 (Ω) :7→ R as
H(z, p, µ) = Eu∼µ [pT φ(z, u)].
Pontryagin’s maximum principle asserts that the solutions of the control problem must
satisfy:
ρτ = argmaxρ Ex [H zτt,x , pt,x

τ , ρ ], ∀τ ∈ [0, 1], (32)
where for each x, {(zτt,x , pt,x
τ )} are defined by the forward/backward equations:

dzτt,x
= ∇p H = Eu∼ρτ (·;t) [φ(zτt,x , u)]
dτ (33)
dpt,x
τ
= −∇z H = Eu∼ρτ (·;t) [∇Tz φ(zτt,x , u)pt,x
τ ],

with the boundary conditions:
z0t,x = x (34)
pt,x
1 = −2(f (x; ρ(·; t)) − f ∗ (x))1. (35)
Pontryagin’s maximum principle is slightly stronger than the KKT condition for the
stationary points in that (32) is a statement of optimality rather than criticality. In fact
(32) also holds when the parameters are discrete and this has been used in [29] to develop
efficient numerical algorithms for this case.
With the help of PMP, it is also easy to write down the gradient descent flow for the
optimization problem. Formally, one can simply write down the gradient descent flow for
(32) for each τ :
∂t ρτ (u, t) = ∇ · (ρτ (u, t)∇V (u; ρ)) , ∀τ ∈ [0, 1], (36)
where
δH t,x t,x 
V (u; ρ) = Ex [ zτ , pτ , ρτ (·; t) ],
δρ
t,x t,x
and {(zτ , pτ )} are defined as before by the forward/backward equations.
To discretize the gradient flow, we can simply use:

26
• forward Euler for the flow in τ variable, with step size 1/L (L is the number of grid
points in the τ variable);

• particle method for the gradient descent dynamics, with M samples in each τ -grid
point.

This gives us
M
t,x 1 X
zl+1 = zlt,x + φ(zlt,x , ujl (t)), l = 0, . . . , L − 1 (37)
LM j=1
M
1 X
pt,x
l = pt,x
l+1 + t,x
∇z φ(zl+1 , ujl+1 (t))pt,x
l+1 , l = 0, . . . , L − 1 (38)
LM j=1
dujl (t)
= −Ex [∇Tw φ(zlt,x , ujl (t))pt,x
l ]. (39)
dt
This recovers the GD algorithm (with back-propagation) for the (scaled) ResNet:
M
1 X
zl+1 = zl + φ(zl , ul ).
LM j=1

We call this “scaled” ResNet because of the presence of the factor 1/(LM ).
In a similar spirit, one can also obtain an algorithm using PMP [28]. Adopting the
terminology in control theory, this kind of algorithms are called “method of successive
approximation” (MSA). The basic MSA is as follows:

• Initialize: θ0 ∈ U

• For k = 0, 1, 2, · · · :

– Solve
dzτk
= ∇p H(zτk , pkτ , θτk ), z0k = x

– Solve
dpkτ
= −∇z H(zτk , pkτ , θτk ), pk1 = −2(f (x; θk ) − f ∗ (x))1

– Set
θτk+1 = argmax k k
θ∈Θ H(zτ , pτ , θ)

for each τ ∈ [0, 1]

In practice, this basic version does not perform as well as the “extended MSA” which
works in the same way as the MSA except that the Hamiltonian is replaced by an extended
Hamiltonian [28]:
1 1
H̃(z, p, θ, v, q) := H(z, p, θ) − λkv − f (z, θ)k2 − λkq + ∇z H(z, p, θ)k2 .
2 2

27
Figure 12: Comparison of the extended MSA with different versions of stochastic gradient
descent algorithms. The top figures show results for small initialization. The bottom
figures show results for bigger initialization. Reproduced with permission from Qianxiao
Li. See also [28].

Figure 12 shows the results of the extended MSA compared with different versions of
SGD for two kinds of initialization. One can see that in terms of the number of iterations,
extended MSA outperforms all the SGDs. In terms of wall clock time, the advantage of
the extended MSA is diminished significantly. This is possibly due to the inefficiencies in
the implementation of the optimization algorithm (here the BFGS) used for solving (32).
We refer to [28] for more details. In any case, it is clear that there is a lot of room for
improvement.

6 Concluding remarks
In conclusion, we have discussed a wide range of problems for which machine learning-
based algorithms have made and/or will make a significant difference. These problems are
relatively new to computational mathematics. We believe strongly that machine learning-
based algorithms will also significantly impact the way we solve more traditional problems
in computational mathematics. However, research in this direction is still at a very early
stage.
Another important area that machine learning can be of great help is multi-scale
modeling. The moment-closure problem discussed above is an example in this direction.

28
There are many more possible applications, see [8]. Machine learning seems to be able to
provide the missing link in making advanced multi-scale modeling techniques really prac-
tical. For example in the heterogeneous multi-scale method (HMM) [1, 9], one important
component is to extract the relevant macro-scale information from micro-scale simulation
data. This step has always been a major obstacle in HMM. It seems quite clear that
machine learning techniques can of great help here.
We also discussed how the viewpoint of numerical analysis can help to improve the
mathematical foundation of machine learning as well as propose new and possibly more
robust formulations. In particular, we have given a taste of how high dimensional ap-
proximation theory should look like. We also demonstrated that commonly used machine
learning models and training algorithms can be recovered from some particular discretiza-
tion of continuous models, in a scaled form. From this discussion, one can see that neural
network models are quite natural and rather inevitable.
What have we really learned from machine learning? Well, it seems that the most
important new insight from machine learning is the representation of functions as expec-
tations. We reproduce them here for convenience:

• integral-transform based:

f (x) = E(a,w)∼ρ aσ(wT x)


(L)
f (x) = EθL ∼πL aθL σ(EθL−1 ∼πL−1 . . . σ(Eθ1 ∼π1 a1θ2 ,θ1 σ(a0θ1 · x)) . . . )

• flow-based:
dz
=E(a,w)∼ρτ aσ(wT z), z(0, x) = x (40)

f (x, θ) =1T z(1, x) (41)

From the viewpoint of computational mathematics, this suggests that the central
issue will move from specific discretization schemes to more effective representations of
functions.
This review is rather sketchy. Interested reader can consult the three review articles
[17, 18, 19] for more details.
Acknowledgement: I am very grateful to my collaborators for their contribution to
the work described here. In particular, I would like to express my sincere gratitude to
Roberto Car, Jiequn Han, Arnulf Jentzen, Qianxiao Li, Chao Ma, Han Wang, Stephan
Wojtowytsch, and Lei Wu for the many discussions that we have had on the issues dis-
cussed here. This work is supported in part by a gift to the Princeton University from
iFlytek as well as the ONR grant N00014-13-1-0338.

References
[1] Assyr Abdulle, Weinan E, Bjorn Engquist and Eric Vanden-Eijnden, The heteroge-
nous multiscale methods, Acta Numerica, vol. 21, pp. 1-87, 2012.

29
[2] Francis Bach, “Breaking the curse of dimensionality with convex neural networks”,
Journal of Machine Learning Research, 18(19):1-53, 2017.
[3] Andrew R. Barron, “Universal approximation bounds for superpositions of a sig-
moidal function”, IEEE Transactions on Information theory, 39(3):930-945, 1993.
[4] Jörg Behler and Michele Parrinello, “Generalized neural-network representation of
high-dimensional potential-energy surfaces”, Physical review letters, 98(14):146401,
2007.
[5] Achi Brandt, “Multiscale scientific computation: review 2001”. In Barth, T.J., Chan,
T.F. and Haimes, R. (eds.): Multiscale and Multiresolution Methods: Theory and
Applications, Springer Verlag, Heidelberg, 2001, pp. 196.
[6] Roberto Car and Michele Parrinello, “Unified approach for molecular dynamics and
density-functional theory”, Physical Review Letters, 55(22):2471, 1985.
[7] Lenaic Chizat and Francis Bach, “On the global convergence of gradient descent for
over-parameterized models using optimal transport”, In Advances in neural informa-
tion processing systems, pages 3036-3046, 2018.
[8] Weinan E, Principles of Multiscale Modeling, Cambridge University Press, 2011.
[9] Weinan E and Bjorn Engquist, The heterogeneous multiscale methods, Comm. Math.
Sci., vol. 1, no. 1, pp. 87-132, 2003.
[10] Weinan E, Jiequn Han and Arnulf Jentzen, “Deep learning-based numerical methods
for high-dimensional parabolic partial differential equations and backward stochastic
differential equations”, Communications in Mathematics and Statistics 5, 4 (2017),
349-380.
[11] Weinan E, Jiequn Han, and Linfeng Zhang, “Integrating Machine Learning with
Physics-Based Modeling”, https://arxiv.org/pdf/2006.02619.pdf, 2020.
[12] Weinan E, Chao Ma and Lei Wu, “A priori estimates of the population risk for two-
layer neural networks”, Communications in Mathematical Sciences, 17(5):1407-1425,
2019; arXiv:1810.06397, 2018.
[13] Weinan E, Chao Ma and Lei Wu, “Barron spaces and the flow-induced function
spaces for neural network models”, arXiv:1906.08039, 2019.
[14] Weinan E and Stephan Wojtowytsch, “Representation formulas and pointwise prop-
erties for Barron functions”, arXiv:2006.05982, 2020.
[15] Weinan E and Stephan Wojtowytsch, “On the Banach spaces associated with multi-
layer ReLU networks: Function representation, approximation theory and gradient
descent dynamics”, https://arxiv.org/pdf/2007.15623.pdf, 2020.
[16] Weinan E, Chao Ma, and Lei Wu, “Machine learning from a continuous viewpoint”,
arXiv:1912.12777, 2019.

30
[17] Weinan E, Jiequn Han and Arnulf Jentzen, “Algorithms for solving high
dimensional PDEs: From nonlinear Monte Carlo to machine learning”,
https://arxiv.org/pdf/2008.13333.pdf, 2020.

[18] Weinan E, Jiequn Han and Linfeng Zhang, “Integrating machine learning with
physics-based modeling”, https://arxiv.org/pdf/2006.02619.pdf, 2020.

[19] Weinan E, Chao Ma, Stephan Wojtowytsch and Lei Wu, “Towards a mathe-
matical understanding of machine learning: What is known and what is not”,
https://arxiv.org/pdf/2009.10713.pdf.

[20] Jiequn Han and Weinan E, “Deep learning approximation for stochastic
control problems”, Deep Reinforcement Learning Workshop, NIPS (2016),
https://arxiv.org/pdf/1611.07422.pdf.

[21] Jiequn Han, Arnulf Jentzen and Weinan E, “Solving high-dimensional partial dif-
ferential equations using deep learning”, Proceedings of the National Academy of
Sciences, 115, 34 (2018), 8505-8510.

[22] Jiequn Han, Chao Ma, Zheng Ma, Weinan E, “Uniformly Accurate Machine Learning
Based Hydrodynamic Models for Kinetic Equations”, Proceedings of the National
Academy of Sciences, 116 (44) 21983-21991; DOI: 10.1073/pnas.1909854116, 2019.

[23] Jiequn Han, Linfeng Zhang, Roberto Car, and Weinan E, “Deep potential: a gen-
eral representation of a many-body potential energy surface”, Communications in
Computational Physics, 23(3):629-639, 2018.

[24] Pierre C Hohenberg and Bertrand I Halperin, “Theory of dynamic critical phenom-
ena”, Reviews of Modern Physics, 49(3):435, 1977.

[25] Weile Jia, Han Wang, Mohan Chen, Denghui Lu, Jiduan Liu, Lin Lin, Roberto Car,
Weinan E, and Linfeng Zhang, “Pushing the limit of molecular dynamics with ab
initio accuracy to 100 million atoms with machine learning”, arXiv:2005.00223, 2020.

[26] Jason M Klusowski and Andrew R Barron, “Risk bounds for high-dimensional ridge
function combinations including neural networks”, arXiv:1607.01434, 2016.

[27] Yann LeCun, “A theoretical framework for back propagation”, In: Touretzky, D.,
Hinton, G., Sejnouski, T. (eds.) Proceedings of the 1988 connectionist models summer
school, Carnegie-Mellon University, Morgan Kaufmann, 1989.

[28] Qianxiao Li, Long Chen, Cheng Tai and Weinan E, “Maximum Principle Based Al-
gorithms for Deep Learning”, Journal of Machine Learning Research, vol.18, no.165,
pp.1-29, 2018, https://arxiv.org/pdf/1710.09513.pdf.

[29] Qianxiao Li and Shuji Hao, “An Optimal Control Approach to Deep Learning and
Applications to Discrete-Weight Neural Networks”, Proceedings of the 35th Interna-
tional Conference on Machine Learning, 2018.

31
[30] Denghui Lu, Han Wang, Mohan Chen, Jiduan Liu, Lin Lin, Roberto Car, Weinan E,
Weile Jia, and Linfeng Zhang, “86 pflops deep potential molecular dynamics simula-
tion of 100 million atoms with ab initio accuracy”, arXiv:2004.11658, 2020.

[31] Chao Ma, Lei Wu, and Weinan E, “The quenching-activation behavior of the gradient
descent dynamics for two-layer neural network models”, arXiv:2006.14450, 2020.

[32] Song Mei, Andrea Montanari, and Phan-Minh Nguyen, “A mean field view of the
landscape of two-layer neural networks”, Proceedings of the National Academy of
Sciences, 115(33):E7665-E7671, 2018.

[33] Etienne Pardoux and Shige Peng, “Backward stochastic differential equations and
quasilinear parabolic partial differential equations”, in Stochastic partial differential
equations and their applications (Charlotte, NC, 1991), vol. 176 of Lecture Notes in
Control and Inform. Sci., Springer, Berlin, 1992, pp. 200-217.

[34] Grant Rotskoff and Eric Vanden-Eijnden, “Parameters as interacting particles: long
time convergence and asymptotic error scaling of neural networks”, In Advances in
neural information processing systems, pages 7146-7155, 2018.

[35] Justin Sirignano and Konstantinos Spiliopoulos, “Mean field analysis of neural net-
works: A central limit theorem”, arXiv:1808.09372, 2018.

[36] Hans Tribel, Theory of Function Spaces, Birkhäuser, 1983.

[37] Han Wang, Linfeng Zhang and Weinan E, unpublished.

[38] Linfeng Zhang, Jiequn Han, Han Wang, Roberto Car, and Weinan E, “Deep potential
molecular dynamics: A scalable model with the accuracy of quantum mechanics”,
Physical Review Letters, 120:143001, Apr 2018.

[39] Linfeng Zhang, Jiequn Han, Han Wang, Wissam A Saidi, Roberto Car, and Weinan
E, “End-to-end symmetry preserving inter-atomic potential energy model for finite
and extended systems”, In Advances of the Neural Information Processing Systems
(NIPS), 2018.

[40] Linfeng Zhang, De-Ye Lin, Han Wang, Roberto Car, and Weinan E, “Active learn-
ing of uniformly accurate interatomic potentials for materials simulation”, Physical
Review Materials, 3(2):023804, 2019.

[41] Linfeng Zhang, Han Wang and Weinan E, “Reinforced dynamics for the enhanced
sampling in large atomic and molecular systems. I. Basic Methodology”, J. Chem.
Phys., vol. 148, pp.124113, 2018.

32

You might also like