Machine Learning and Computational Mathematics: in Memory of Professor Feng Kang (1920-1993)
Machine Learning and Computational Mathematics: in Memory of Professor Feng Kang (1920-1993)
i
Weinan E
Princeton University and Beijing Institute of Big Data Research
Contents
arXiv:2009.14596v1 [math.NA] 23 Sep 2020
1 Introduction 2
6 Concluding remarks 28
Abstract
Neural network-based machine learning is capable of approximating functions in
very high dimension with unprecedented efficiency and accuracy. This has opened up
many exciting new possibilities, not just in traditional areas of artificial intelligence,
but also in scientific computing and computational science. At the same time,
machine learning has also acquired the reputation of being a set of “black box”
i
[email protected]
1
type of tricks, without fundamental principles. This has been a real obstacle for
making further progress in machine learning.
In this article, we try to address the following two very important questions:
(1) How machine learning has already impacted and will further impact compu-
tational mathematics, scientific computing and computational science? (2) How
computational mathematics, particularly numerical analysis, can impact machine
learning? We describe some of the most important progress that has been made on
these issues. Our hope is to put things into a perspective that will help to integrate
machine learning with computational mathematics.
1 Introduction
Neural network-based machine learning (ML) has shown very impressive success on a
variety of tasks in traditional artificial intelligence. This includes classifying images, gen-
erating new images such as (fake) human faces and playing sophisticated games such as
Go. A common feature of all these tasks is that they involve objects in very high dimen-
sion. Indeed when formulated in mathematical terms, the image classification problem is
a problem of approximating a high dimensional function, defined on the space of images,
to the discrete set of values corresponding to the category of each image. The dimen-
sionality of the input space is typically 3 times the number of pixels in the image, where
3 is the dimensionality of the color space. The image generation problem is a problem
of generating samples from an unknown high dimensional distribution, given a set of
samples from that distribution. The Go game problem is about solving a Bellman-like
equation in dynamic programming, since the optimal strategy satisfies such an equation.
For sophisticated games such as Go, this Bellman-like equation is formulated on a huge
space.
All these are made possible by the ability to accurately approximate high dimensional
functions, using modern machine learning techniques. This opens up new possibilities for
attacking problems that suffer from the “curse of dimensionality” (CoD): As dimension-
ality grows, computational cost grows exponentially fast. This CoD problem has been an
essential obstacle for the scientific community for a very long time.
Take, for example, the problem of solving partial differential equations (PDEs) nu-
merically. With traditional numerical methods such as finite difference, finite element
and spectral methods, we can now routinely solve PDEs in three spatial dimensions plus
the temporal dimension. Most of the PDEs currently studied in computational mathe-
matics belong to this category. Well known examples include the Poisson equation, the
Maxwell equation, the Euler equation, the Navier-Stokes equations, and the PDEs for
linear elasticity. Sparse grids can increase our ability to handling PDEs to, say 8 to 10
dimensions. This allows us to try solving problems such as the Boltzmann equation for
simple molecules. But we are totally lost when faced with PDEs, say in 100 dimension.
This makes it essentially impossible to solve Fokker-Planck or Boltzmann equations for
complex molecules, many-body Schrödinger, or the Hamilton-Jacobi-Bellman equations
for realistic control problems.
This is exactly where machine learning can help. Indeed, starting with the work in
[20, 10, 21], machine learning-based numerical algorithm for solving high dimensional
2
PDEs and control problems has been one of the most exciting new developments in re-
cent years in scientific computing, and this has opened up a host of new possibilities for
computational mathematics. We refer to [17] for a review of this exciting development.
Solving PDEs is just the tip of the iceberg. There are many other problems for which
CoD is the main obstacle, including:
• solid mechanics. In solid mechanics, we do not even have the analog of the Navier-
Stokes equation. Why is this the case? Well, the real reason is that the behavior of
solids is essentially a multi-scale problem that involves scales from atomistic all the
way to macroscopic.
• multi-scale modeling. In fact most multi-scale problems for which there is no sepa-
ration of scales belong to this category. An immediate example is the dynamics of
polymer fluids or polymer melts.
Can machine learning help for these problems? More generally, can we extend the
success of machine learning beyond traditional AI? We will try to convince the reader
that this is indeed the case for many problems.
Besides being extremely powerful, neural network-based machine learning has also got
the reputation of being a set of tricks instead of a set of systematic scientific principles.
Its performance depends sensitively on the value of the hyper-parameters, such as the
network widths and depths, the initialization, the learning rates, etc. Indeed just a few
years ago, parameter tuning was considered to be very much of an art. Even now, This
is still the case for some tasks. Therefore a natural question is: Can we understand these
subtleties and propose better machine learning models whose performance is more robust?
In this article, we review what has been learned on these two issues. We discuss the
impact that machine learning has already made or will make on computational mathemat-
ics, and how the ideas from computational mathematics, particularly numerical analysis,
can be used to help understanding and better formulating machine learning models. On
the former, we will mainly discuss the new problems that can now be addressed using
ML-based algorithms. Even though machine learning also suggests new ways to solve
some traditional problems in computational mathematics, we will not say much on this
front.
3
think of them as being some replacement of polynomials. We will discuss neural networks
afterwards.
The key idea here is coarse graining, and iterating between the fine scale and the
coarse-grained problem. The main components in coarse graining is a set of coarse-
grained variables and the effective coarse-grained problem. Formulated this way, these
are obviously general ideas that can be relevant for a wide variety of problems. In practice,
however, the difficulty lies in how to obtain the effective coarse-grained problem, a step
that is trivial for linear problems, and this is where machine learning can help.
We are going to use the protein folding problem as an example to illustrate the general
idea for nonlinear problems.
Let {xj } be the positions of the atoms in a protein and the surrounding solvent, and
U = U ({xj }) be the potential energy of the combined protein-solvent system. The poten-
tial energy consists of the energies due to chemical bonding, Van der Waals interaction,
electro-static interaction, etc. The protein folding problem is to find the “ground state”
of the energy U :
“Minimize” U.
Here we have added quotation marks since really what we want to do is to sample the
distribution
1
ρβ = e−βU , β = (kB T )−1
Z
To define the coarse-grained problem, we assume that we are given a set of collective
variables: s = (s1 , · · · , sn ), sj = sj (x1 , · · · , xN ), (n < N ). One possibility is to use the
4
dihedral angles as the coarse-grained variables. In principle, one may also use machine
learning methods to learn the “best” set of coarse-grained variables but this direction will
not be pursued here.
Having defined the coarse-grained variables, the effective coarse-grained problem is
simply the free energy associated with this set of coarse-grained variables:
Z
1 1
A(s) = − ln p(s), pβ (s) = e−βU (x) δ(s(x) − s) dx,
β Z
Unlike the case for linear problems, for which the effective coarse-grained model is readily
available, in the current situation, we have to find the function A first.
The idea is to approximate A by neural networks. The issue here is how to obtain the
training data.
Contrary to most standard machine learning problems where the training data is
collected beforehand, in applications to computational science and scientific computing,
the training data is collected “on-the-fly” as learning proceeds. This is referred to as
the “concurrent learning” protocol [11]. In this regard, the standard machine learning
problems for which the training data is collected beforehand are examples of “sequential
learning”. The key issue for concurrent learning is an efficient algorithm for generating
the data in the best way. The training dataset should on one hand be representative
enough and on the other hand be as small as possible.
A general procedure for generating such datasets is suggested in [11]. It is called
the EELT (exploration- examination-labeling-training) algorithm and it consists of the
following steps:
1 −βA(s)
• exploration: exploring the s space. This can be done by sampling Z
e with
the current approximation of A.
• examination: for each state explored, decide whether that state should be labeled.
One way to do this is to use an a posteriori error estimator. One possible such a
posteriori error estimator is the variance of the predictions of an ensemble of machine
learning models, see [41].
• labeling: compute the mean force (say using restrained molecular dynamics)
from which the free energy A can be computed using standard thermodynamic
integration.
• training: train the appropriate neural network model. To come up with a good
neural network model, one has to take into account the symmetries in the prob-
lem. For example, if we coarse grain a full atom representation of a collection of
water molecules by eliminating the positions of the hydrogen atoms, then the free
energy function for the resulting system should have permutation symmetry and
this should be taken into account when designing the neural network model (see the
next subsection).
5
Figure 1: The folded and extended states of Trp-cage, reproduced with permission from
[37]
.
This can also be viewed as a nonlinear multi-grid algorithm in the sense that it iterates
between sampling pβ on the space of the coarse-grained variables and the (constrained)
Gibbs distribution ρβ for the full atom description.
This is a general procedure that should work for a large class of nonlinear “multi-grid”
problems.
Shown in Figure 1 is the extended and folded structure of Trp-cage. This is a small
protein with 20 amino acids. We have chosen the 38 dihedral angles as the collective
variables. The full result is presented in [37].
6
Figure 2: The results of EELT algorithm: Number of configurations explored vs. the
number of data points labeled. Only a very small percentage of the configurations are
labeled. Reproduced with permission from Linfeng Zhang. See also [40].
2. Preserving the symmetry. Besides the usual translational and rotational symmetry,
one also has the permutational symmetry: If we relabel a system of copper atoms,
its potential energy should not change. It makes a big difference in terms of the
accuracy of the neural network model whether one takes these symmetries into
account (see [23] and Figure 3).
One very nice and general way of addressing the symmetry problem is to design the
neural network model as the composition of two networks: An embedding network fol-
lowed by a fitting network. The task for the embedding network is to represent enough
symmetry-preserving functions to be fed into the fitting network [39].
With these issues properly addressed, one can come up with very satisfactory neural
network-based representation of V (see Figure 4). This representation is named Deep
Potential [23, 39] and the Deep Potential-based molecular dynamics is named DeePMD
[38]. As has been demonstrated recently in [30, 25], DeePMD, combined with state of the
7
Figure 3: The effect of symmetry preservation on testing accuracy. Shown in red are the
results of poor man’s way of imposing symmetry (see main text for explanation). One
can see that testing accuracy is drastically improved. Reproduced with permission from
Linfeng Zhang.
Figure 4: The test accuracy of the Deep Potential for a wide variety of systems. Repro-
duced with permission from Linfeng Zhang. See also [39].
8
art high performance supercomputers, can help to increase the size of the system that
one can model with ab initio accuracy by 5 orders of magnitude.
at = At (st ). (3)
Unlike the situation in standard supervised learning, here we have T set of neural
networks to be trained simultaneously. The network architecture is shown in Figure 5
9
Figure 5: Network architecture for solving stochastic control in discrete time. The whole
network has (N + 1)T layers in total that involve free parameters to be optimized simul-
taneously. Each column (except ξt ) corresponds to a sub-network at t. Reproduced with
permission from Jiequn Han. See also [20].
Compared with the standard setting for machine learning, one can see a clear analogy
in which (1) plays the role for the residual networks and the noise {ξt } plays the role
of data. Indeed, stochastic gradient descent (SGD) can be readily used to solve the
optimization problem (5).
An example of the application of this algorithm is shown in Figure 6 for the problem of
energy storage with multiple devices. Here n is the number of devices. For more details,
we refer to [20].
1.1
reward relative to the case n=50
1.0
0.9
0.8
number of devices
0.7 n=30
n=40
n=50
0.60 10000 20000 30000 40000 50000
iteration
Figure 6: Relative reward for the energy storage problem. The space of control func-
tion is Rn+2 → R3n for n = 30, 40, 50, with multiple equality and inequality constrains.
Reproduced with permission from Jiequn Han. See also [20].
10
3.2 Nonlinear parabolic PDEs
Consider parabolic PDEs of the form:
∂u 1 T
+ σσ : ∇2x u + µ · ∇u + f σ T ∇u = 0,
u(T, x) = g(x)
∂t 2
We study a terminal-value problem instead of initial-value problem since one of the main
applications we have in mind is in finance. To develop machine learning-based algorithms,
we would like to first reformulate this as a stochastic optimization problem. This can be
done using backward stochastic differential equations (BSDE) [33].
It can be shown that the unique minimizer of this problem is the solution to the PDE
with:
Yt = u(t, Xt ) and Zt = σ T (t, Xt ) ∇u(t, Xt ). (9)
With this formulation, one can develop a machine learning-based algorithm along the
following lines, adopting the ideas for the stochastic control problems discussed earlier
[10, 21]:
• Using the BSDE, one constructs an approximation û that takes the paths {Xtn }0≤n≤N
and {Wtn }0≤n≤N as the input data and gives the final output, denoted by û({Xtn }0≤n≤N , {Wtn }0≤n≤N
as an approximation to u(tN , XtN ).
• The error in the matching between û and the given terminal condition defines the
expected loss function
h 2i
l(θ) = E g(XtN ) − û {Xtn }0≤n≤N , {Wtn }0≤n≤N .
11
RT
with the cost functional: J({mt }0≤t≤T ) = E 0 kmt k22 dt + g(XT ) . The corresponding
HJB equation is given by
∂u
+ ∆u − λk∇uk22 = 0 (11)
∂t
Using the Hopf-Cole transform, we can express the solution in the form:
√
h i
1
u(t, x) = − ln E exp − λg(x + 2WT −t ) . (12)
λ
This can be used to calibrate the accuracy of the Deep BSDE method.
4.7
Deep BSDE Solver
4.6 Monte Carlo
4.5
u(0,0,...,0)
4.4
4.3
4.2
4.1
4.0
0 10 20 30 40 50
lambda
Figure 7: Left: Relative error of the deep BSDE method for u(t=0, x=(0, . . . , 0)) when λ = 1, which
achieves 0.17% in a runtime of 330 seconds. Right: Optimal cost u(t=0, x=(0, . . . , 0)) against different
λ. Reproduced with permission from Jiequn Han. See also [21].
12
Figure 8: The solution of the Black-Scholes equation with default risk at d = 100. The
Deep BSDE method achieves a relative error of size 0.46% in a runtime of 617 seconds.
Reproduced with permission from Jiequn Han. See also [21].
The Deep BSDE method has been applied to pricing basket options, interest rate-
dependent options, Libor market model, Bermudan Swaption, barrier option (see [17] for
references).
13
1.1 Some History and Background 1 CONTINUUM MODELS
transition
Euler Eqn NSF Eqn regime kinetic regime free flight
½
½
½
10¡2 10¡1 1.0 10.0
Kn
! equilibrium ! ! non-equilibrium !
FigureFigure
1: Overview of the
9: The different range
regimes ofdynamics.
of gas KnudsenReproduced
number with
andpermission
various model regimes.
from Jiequn
Han. See also [22].
What happens when ε is not small? In this case, a natural idea is to seek some
the moment systems oflead
generalization to equation
Euler’s stable hyperbolic equations.
using more moments. This However,
program wasin practical
initiated by explicit
Harold Grad who constructed the well-known 13-moment system using the moments of
systems hyperbolicity is given only in a finite range due to linearization. In Junk (1998)
{1, v, (v − u) ⊗ (v − u), |v − u|2 (v − u)}. This line of work has encountered several diffi-
and Junkculties.
(2002)First,
it is shown
there that the
is no guarantee thatfully nonlinear
the equations maximum-entropy
obtained approach has
are well-posed. Secondly
sever drawbacks and singularities.
there is always Furthermore,
the “closure problem”: the hyperbolicity
When projecting leads to
the Boltzmann equation on discontinuous
a set
sub-shock solutions in the shock profile. A variant of the moment method has been
of moments, there are always terms which involve moments outside the set of moments
considered. In order to obtain a closed system, one needs to model these terms in some
proposed byway.Eu For(1980)
Euler’s and is used,
equation, this ise.g.,
done in Myong
using (2001).
the local Recently,
Maxwellian a maximum-entropy
approximation. This
10-moment systemwhen
is accurate has εbeen used
is small, butby Suzuki
is no longer and vanε Leer
so when (2005).
is not small. It is highly unclear
Both fundamental approaches
what should be used of kinetic theory, Chapman-Enskog and Grad, exhibit
as the replacement.
In [22], Han et al developed a machine learning-based moment method. The over-
severe disadvantages.
all objective is Higher order
to construct an Chapman-Enskog expansions
uniformly accurate (generalized) are unstable
moment model. The and Grad’s
method introduces
methodology subshocks
consists of two and show slow convergence. It seems to be desirable to
steps:
combine both 1: methods
Learn a setinoforder
optimaltogeneralized
remedy their momentsdisadvantages. Such an hybrid
through an auto-encoder. Here by approach
optimality we mean that the set of generalized moments retains a maximum amount
have already been discussed by Grad in a side note in Grad (1958). He derives a variant
of information about the original distribution function and can be used to recover the
of the regularized
distribution13-moment
function withequations
a minimum loss given below, but
of accuracy. This surprisingly he neither
can be done as follows: Find gives any
details noranisencoder
he using or ainvestigating
Ψ and decoder Φ that the equations.
recovers Infthe
the original fromlast
U, Wfifty years the paper Grad
(1958) was cited as standard source Z for introduction into kinetic theory, but, apparently,
W = Ψ(f ) = wf dv, Φ(U , W )(v) = h(v; U , W ).
this specific idea got entirely lost and seems not to be known in present day literature.
The Chapman-Enskog expansion is based on equilibrium and the corrections describe
Minimizew,h E kf − Φ(Ψ(f ))k2 .
the non-equilibrium perturbation. A hybrid f ∼D
version which uses a non-equilibrium as basis
is discussed in Karlin et al. (1998). They deduced variables
U and W form the set of generalized hydrodynamic linearizedthatequations with
we will use to modelunspecified
the gas flow.
coefficients. 2: Starting
Learn thefromfluxesBurnett
and source equations
terms in theJinPDEandfor Slemrod (2001)
the projected PDE. derived an extended
The effective
system of PDE
evolution
for U and equations
W can bewhichobtainedresembles
by formallythe regularized
projecting 13-moment
the Boltzmann equationsystem.
on It is
solved numerically in Jin et al.
this set of (generalized) (2002).
moments. ThisHowever,
gives us a the
set oftensorial
PDEs of thestructure
form: of their relations
is not in accordance with Boltzmann’s (
∂t U + ∇x · Fequation.
(U , W ; ε) = Starting
0, from higher moment systems
Müller et al. (2003) discussed∂t W a parabolization
+ ∇x · G(U , W ; ε)which
= R(U ,includes
W ; ε). higher order(15) expressions
into hyperbolic equations. R
−1
R R
The regularized 13-moment-equations presented below were rigorously
where F (U , W ; ε) = vU f dv, G(U , W ; ε) = vW f dv, R(U , W ; ε) = ε W Q(fderived
)dv. from
Our task now is to learn F , G, R from the original kinetic equation.
Boltzmann’s equation in Struchtrup and Torrilhon (2003). The key ingredient is a Chapman-
Enskog expansion around a pseudo-equilibrium 14 which is given by the constitutive relations
of Grad for stress tensor and heat flux. The final system consists of evolution equations
for the fluid fields: density, velocity, temperature, stress tensor and heat flux. The closure
Again the important issues are (1) get an optimal dataset, and (2) enforce the physical
constraints. Here two notable physical constraints are (1) conservation laws and (2)
symmetries. Conservation laws are automatically respected in this approach. Regarding
symmetries, besides the usual static symmetry, there is now a new dynamic symmetry:
the Galilean invariance. These issues are all discussed in [22]. We also refer to [22] for
numerical results for the models obtained this way.
• Why simple gradient descent works for training neural network models?
• Why does neural network modeling require such extensive parameter tuning?
At this point, we do not yet have clear answers to all these questions. But some coher-
ent picture is starting to emerge. We will focus on the problem of supervised learning,
namely approximating a target function using a finite dataset. For simplicity, we will
limit ourselves to the case when the physical domain of interest is X = [0, 1]d .
15
In modern machine learning the most popular choice is neural network functions. Two-
layer neural network functions (one input layer, one output layer which usually does not
count, and one hidden layer) take the form
m
1 X
fm (x, θ) = aj σ(hwj , xi) (16)
m j=1
where σ : R → R is a fixed scalar nonlinear function and where θ = {(aj , wj )j∈{1,2,...,m} } are
the parameters to be optimized (or trained). A popular example for the nonlinear function
σ : R → R is the ReLU (rectified linear unit) activation function: σ(z) = max{z, 0}, for all
z ∈ R. We will restrict our attention to this activation function. Roughly speaking, deep
neural network (DNN) functions are obtained if one composes two-layer neural network
functions several times. One important class of DNN models are residual neural networks
(ResNet). They closely resemble discretized ordinary differential equations and take the
form
M
X
zl+1 = zl + aj,l σ(hzl , wj,l i), z0 = V x̃, fL (x, θ) = hα, zL i (17)
j=1
for l ∈ {0, 1, . . . , L−1} where L, M ∈ N. Here the parameters are θ = (α, V , (aj,l )j,l , (wj,l )j,l ).
ResNets are the model of choice for truly deep neural network models.
Step 2. Choose a loss function. The primary consideration for the choice of the
loss function is to fit the data. Therefore the most obvious choice is the L2 loss:
n n
1X 1X
R̂n (f ) = |f (xj ) − yj |2 = |f (xj ) − f ∗ (xj )|2 . (18)
n j=1 n j=1
This is also called the “empirical risk”. Sometimes we also add regularization terms.
Step 3. Choose an optimization algorithm. The most popular optimization
algorithms in machine learning are different versions of the gradient descent (GD) algo-
rithm, or its stochastic analog, the stochastic gradient descent (SGD) algorithm. Assume
that the objective function we aim to minimize is of the form
F (θ) = Eξ∼ν l(θ, ξ) . (19)
Obviously this form of SGD can be adapted to loss functions of the form (18) which can
be regarded as an expectation with ν being the empirical distribution on the training
dataset. This DNN-SGD paradigm is the heart of modern machine learning.
16
4.2 Approximation theory
The simplest way of approximating functions is to use polynomials. For polynomial
approximation, there are two kinds of theorems. The first is the Weierstrass’ Theorem
which asserts that continuous functions can be uniformly approximated by polynomials
on compact domains. The second is Taylor’s Theorem which tells us that the rate of
convergence depends on the smoothness of the target function.
Using the terminology in neural network theory, Weierstrass’ Theorem is the “Univer-
sal Approximation Theorem” (UAT). It is a useful fact. But Taylor’s Theorem is more
useful since it tells us something about the rate of convergence. The form of Taylor’s The-
orem used in approximation theory are the direct and inverse approximation theorems
which assert that a given function can approximated by polynomials with a particular
rate if and only if certain norms of that function is finite. This particular norm, which
measures the regularity of the function, is the key quantity that characterizes this approx-
imation scheme. For piecewise polynomials, these norms are some Besov space norms [36].
For L2 , a typical result looks like follows:
inf kf − fm kL2 (X) ≤ C0 hα kf kH α (X) (22)
f ∈Hm
Here H α stands for the α-th order Sobolev norm [36], m is the number of degrees of
freedom. On a regular grid, the grid size is given by
h ∼ m−1/d (23)
An important thing to notice is that the convergence rate in (22) suffers from CoD: If
we want to reduce the error by a factor of , we need to increase m by a factor m ∼ −d if
α = 1. For d = 100 which is not very high dimension by the standards of machine learning,
this means that we have to increase m by a factor of −100 . This is why polynomials and
piecewise polynomials are not useful in high dimensions.
Another way to appreciate this is as follows. The number of monomials of degree p in
d
dimension d is Cp+d . This grows very fast for large values of d and p.
What should we expect in high dimension? One example that we can learn from is
Monte Carlo methods for integration. Consider the problem of approximating
I(g) = Ex∼µ g(x)
using
1 X
Im (g) = g(xj )
m j
where {xj , j ∈ [m]} is a set of i.i.d samples of the probability distribution µ. A direct
computation gives
var(g)
E(I(g) − Im (g))2 = , var(g) = Ex∼µ g 2 (x) − (Ex∼µ g(x))2
m
This exact relation tells us two things. (1) The convergence rate of Monte Carlo integra-
tion is independent of dimension. (2) The error constant is given by the variance of the
integrand. Therefore to reduce error, one has to do variance reduction.
17
Had we used grid-based quadrature rules, the accuracy would have also suffered from
CoD.
It is possible to improve the Monte Carlo rate by more sophisticated ways of choosing
{xj , j ∈ [m]}, say using number-theoretic-based quadrature rules. But these typically
give rise to an O(1/d) improvement for the convergence rate and it diminishes as d → ∞.
Based on these considerations, we aim to find function approximations that satisfy:
kf ∗ k2∗
inf R(f ) = inf kf − f ∗ k2L2 (dµ) .
f ∈Hm f ∈Hm m
The natural questions are then:
• How can we achieve this? That is, what kind of hypothesis space should we choose?
• What should be the “norm” k · k∗ (associated with the choice of Hm )? Here we put
norm in quotation marks since it does not have to be a real norm. All we need is
that it controls the approximation error.
Regarding the first question, let us look an illustrative example.
Consider the following Fourier representation of the function f and its approximation
fm (say FFT-based):
Z
1 X
f (x) = a(ω)ei(ω,x) dω, fm (x) = a(ωj )ei(ωj ,x)
Rd m j
Here {ωj } is a fixed grid, e.g. uniform grid. For this approximation, we have
kf − fm kL2 (X) ≤ C0 m−α/d kf kH α (X)
which suffers from CoD.
Now consider the alternative representation
Z
f (x) = a(ω)ei(ω,x) π(dω) = Eω∼π a(ω)ei(ω,x) (24)
Rd
18
Consider function f : X = [0, 1]d 7→ R of the following form
Z
f (x) = aσ(wT x)ρ(da, dw) = E(a,w)∼ρ [aσ(wT x)], x∈X
Ω
where Pf := {ρ : f (x) = Eρ [aσ(wT x)]}. This is called the Barron norm [2, 12, 13] . The
space
B = {f ∈ C 0 : kf kB < ∞}
is called the Barron space [2, 12, 13] (see also [3, 26, 14]).
In analogy with classical approximation theory, we can also prove some direct and
inverse approximation theorem [13].
Theorem 1 (Direct Approximation Theorem). If kf kB < ∞, then for any integer m > 0,
there exists a two-layer neural network function fm such that
kf kB
kf − fm kL2 (X) . √
m
Theorem 2 (Inverse Approximation Theorem). Let
m m
def1 X 1 X
NC = { ak σ(wkT x) : |ak |2 kwk k21 ≤ C 2 , m ∈ N+ }.
m k=1 m k=1
19
Figure 10: The Runge phenomenon: f ∗ (x) = 1
1+25x2
. Reproduced with permission from
Chao Ma.
but we are interested in the testing error, which is a sampled version of the population
risk:
R(θ) = Ex∼µ (f (x, θ) − f ∗ (x))2
The question is how we can control the difference between these two errors.
One way of doing this is to use the notion of Rademacher complexity. The important
fact for us here is that the Rademacher complexity controls the difference between train-
ing and testing errors (also called the “generalization gap”). Indeed, let H be a set of
functions, and S = (x1 , x2 , ..., xn ) be a dataset. Then, up to logarithmic terms, we have
n
1X
sup Ex [h(x)] − h(xi ) ∼ RadS (H)
h∈H n i=1
where {ξi }ni=1 are i.i.d. random variables taking values ±1 with equal probability.
The question then becomes to bound the Rademacher complexity of a hypothesis
space. For the Barron space, we have [2]:
20
Theorem 3. Let FQ = {f ∈ B, kf kB ≤ Q}. Then we have
r
2 ln(2d)
RadS (FQ ) ≤ 2Q
n
where n = |S|, the size of the dataset S.
21
5 Machine learning from a continuous viewpoint
Now we turn to alternative formulations of machine learning. Motivated by the situation
for PDEs, we would like to first formulate machine learning in a continuous setting and
then discretize to get concrete models and algorithms. The key here is that continuous
problems that we come up with should be nice mathematical problems. For PDEs, this is
accomplished by requiring them to be “well-posed”. For problems in calculus of variations,
we require the problem to be “convex” in some sense and lower semi-continuous. The point
of these requirements is to make sure that the problem has a unique solution. Intuitively,
for machine learning problems, being “nice” means that the variational problem should
have a simple landscape. How to formulate this precisely is an important research problem
for the future.
As was pointed out in [16], the key ingredients for the continuous formulation are as
follows:
• representation of functions (as expectations)
Here θ denotes the parameters in the model: θ can be a(·) or the prob distributions π or
ρ.
This representation corresponds to two-layer neural networks. A generalization to
multi-layer neural networks is presented in [15].
Next we turn to flow-based representation:
dz
=Ew∼πτ a(w, τ )σ(wT z) (27)
dτ
=E(a,w)∼ρτ aσ(wT z) (28)
=Eu∼ρτ φ(z, u), z(0, x) = x (29)
f (x, θ) = 1T z(1, x)
In this representation, the parameter θ can be either {aτ (·)} or {πτ } or {ρτ }
22
5.2 The stochastic optimization problem
Stochastic optimization problems are of the type:
These kinds of problems can readily be approached by stochastic algorithms, which is a key
component in machine learning. For example, instead of the gradient descent algorithm:
n
X
θk+1 = θk − η∇θ Ew∼ν g(θ, w) = θk − η∇θ g(θ, wj )
j=1
• Supervised learning: In this case, the minimization of the population risk becomes
Substituting the representations discussed earlier to these expressions for the stochastic
optimization problems, we obtain the final variational problem that we need to solve.
One can either discretize these variational problems directly and then solve the dis-
cretized problem using some optimization algorithms, or one can write down continuous
forms of some optimization algorithms, typically gradient flow dynamics, and then dis-
cretize these continuous flows. We are going to discuss the second approach.
23
5.3 Optimization: Gradient flows
To write continuous form of the gradient flows, we draw some inspiration from statistical
physics. Take the supervised learning as an example. We regard the population risk as
being the “free energy”, and following Halperin and Hohenberg [24], we divide the pa-
rameters into two kinds, conserved and non-conserved. For example, a is a non-conserved
parameter and π is conserved since its total integral has to be 1.
For non-conserved parameter, as was suggested in [24], one can use the “model A”
dynamics:
∂a δR
=−
∂t δa
2
which is simply the usual L gradient flow.
For conserved parameters such as π, one should use the “model B” dynamics which
works as follows: First define the “chemical potential”
δR
V = .
δπ
From the chemical potential, one obtains the velocity field v and the current J:
J = πv, v = −∇V
24
Test errors Test errors 0.6
2.6 2.6 1.2
1.8
2.4 2.4 2.4
log10(n)
log10(n) 3.0
2.2 2.2
3.6
2.0 2.0 4.2
4.8
1.8 1.8 5.4
6.0
2.0 2.5 3.0 3.5 4.0 4.5 2.0 2.5 3.0 3.5 4.0 4.5
log10(m) log10(m)
Figure 11: (Left) continuous viewpoint; (Right) conventional NN models. Target func-
tion is a single neuron. Reproduced with permission from Lei Wu.
where uj (t) = (aj (t), w(t)). One can show that in this case, (30) reduces to
duj
= −∇uj I(u1 , · · · , um )
dt
where
1 X
I(u1 , · · · , um ) = R(fm ), uj = (aj , wj ), fm (x) = aj σ(wjT x)
m j
This is exactly gradient descent for “scaled” two-layer neural networks. In this case,
the continuous formulation also coincides with the “mean-field” formulation for two-layer
neural networks [7, 32, 34, 35].
The scaling factor 1/m in front of the “scaled” two-layer neural networks is actually
quite important and makes a big difference for the test performance of the network. Shown
in Figure 11 is the heat map of the test error for two-layer neural network models with
and without this scaling factor. The target function is the simple single neuron function:
f ∗ (x) = σ(x1 ). The important observation is that in the absence of this scaling factor, the
test error shows a “phase transition” between a phase where the neural network model
performs like an associated random feature model and another phase where it shows much
better performance than the random feature model. Such a phase transition is one of the
reasons that choosing the right set of hyper-parameters, here the network width m, is
so important. However, if one uses the scaled form, this phase transition phenomenon is
avoided and the performance is more robust [31].
25
[27]). The back propagation algorithm, for example, is an example of control-theory based
algorithm. Another more recent example is the development of maximum principle-based
algorithm, first introduced in [28]. Despite these successes, we feel that there is still a lot
of room for using the control theory viewpoint to develop new algorithms.
We consider the flow-based representation in a slightly more general form
dz
= Eu∼ρτ φ(z, u), z(0, x) = x
dτ
where z is the state, ρτ is the control at time τ . Our objective is to minimize R over {ρτ }
Z
∗
R({ρτ }) = Ex∼µ (f (x) − f (x)) = 2
(f (x) − f ∗ (x))2 dµ
Rd
where as before
f (x) = 1T z(1, x) (31)
One most important result for this control problem is Pontryagin’s maximum principle
(PMP). To state this result, let us define the Hamiltonian H : Rd × Rd × P2 (Ω) :7→ R as
H(z, p, µ) = Eu∼µ [pT φ(z, u)].
Pontryagin’s maximum principle asserts that the solutions of the control problem must
satisfy:
ρτ = argmaxρ Ex [H zτt,x , pt,x
τ , ρ ], ∀τ ∈ [0, 1], (32)
where for each x, {(zτt,x , pt,x
τ )} are defined by the forward/backward equations:
dzτt,x
= ∇p H = Eu∼ρτ (·;t) [φ(zτt,x , u)]
dτ (33)
dpt,x
τ
= −∇z H = Eu∼ρτ (·;t) [∇Tz φ(zτt,x , u)pt,x
τ ],
dτ
with the boundary conditions:
z0t,x = x (34)
pt,x
1 = −2(f (x; ρ(·; t)) − f ∗ (x))1. (35)
Pontryagin’s maximum principle is slightly stronger than the KKT condition for the
stationary points in that (32) is a statement of optimality rather than criticality. In fact
(32) also holds when the parameters are discrete and this has been used in [29] to develop
efficient numerical algorithms for this case.
With the help of PMP, it is also easy to write down the gradient descent flow for the
optimization problem. Formally, one can simply write down the gradient descent flow for
(32) for each τ :
∂t ρτ (u, t) = ∇ · (ρτ (u, t)∇V (u; ρ)) , ∀τ ∈ [0, 1], (36)
where
δH t,x t,x
V (u; ρ) = Ex [ zτ , pτ , ρτ (·; t) ],
δρ
t,x t,x
and {(zτ , pτ )} are defined as before by the forward/backward equations.
To discretize the gradient flow, we can simply use:
26
• forward Euler for the flow in τ variable, with step size 1/L (L is the number of grid
points in the τ variable);
• particle method for the gradient descent dynamics, with M samples in each τ -grid
point.
This gives us
M
t,x 1 X
zl+1 = zlt,x + φ(zlt,x , ujl (t)), l = 0, . . . , L − 1 (37)
LM j=1
M
1 X
pt,x
l = pt,x
l+1 + t,x
∇z φ(zl+1 , ujl+1 (t))pt,x
l+1 , l = 0, . . . , L − 1 (38)
LM j=1
dujl (t)
= −Ex [∇Tw φ(zlt,x , ujl (t))pt,x
l ]. (39)
dt
This recovers the GD algorithm (with back-propagation) for the (scaled) ResNet:
M
1 X
zl+1 = zl + φ(zl , ul ).
LM j=1
We call this “scaled” ResNet because of the presence of the factor 1/(LM ).
In a similar spirit, one can also obtain an algorithm using PMP [28]. Adopting the
terminology in control theory, this kind of algorithms are called “method of successive
approximation” (MSA). The basic MSA is as follows:
• Initialize: θ0 ∈ U
• For k = 0, 1, 2, · · · :
– Solve
dzτk
= ∇p H(zτk , pkτ , θτk ), z0k = x
dτ
– Solve
dpkτ
= −∇z H(zτk , pkτ , θτk ), pk1 = −2(f (x; θk ) − f ∗ (x))1
dτ
– Set
θτk+1 = argmax k k
θ∈Θ H(zτ , pτ , θ)
In practice, this basic version does not perform as well as the “extended MSA” which
works in the same way as the MSA except that the Hamiltonian is replaced by an extended
Hamiltonian [28]:
1 1
H̃(z, p, θ, v, q) := H(z, p, θ) − λkv − f (z, θ)k2 − λkq + ∇z H(z, p, θ)k2 .
2 2
27
Figure 12: Comparison of the extended MSA with different versions of stochastic gradient
descent algorithms. The top figures show results for small initialization. The bottom
figures show results for bigger initialization. Reproduced with permission from Qianxiao
Li. See also [28].
Figure 12 shows the results of the extended MSA compared with different versions of
SGD for two kinds of initialization. One can see that in terms of the number of iterations,
extended MSA outperforms all the SGDs. In terms of wall clock time, the advantage of
the extended MSA is diminished significantly. This is possibly due to the inefficiencies in
the implementation of the optimization algorithm (here the BFGS) used for solving (32).
We refer to [28] for more details. In any case, it is clear that there is a lot of room for
improvement.
6 Concluding remarks
In conclusion, we have discussed a wide range of problems for which machine learning-
based algorithms have made and/or will make a significant difference. These problems are
relatively new to computational mathematics. We believe strongly that machine learning-
based algorithms will also significantly impact the way we solve more traditional problems
in computational mathematics. However, research in this direction is still at a very early
stage.
Another important area that machine learning can be of great help is multi-scale
modeling. The moment-closure problem discussed above is an example in this direction.
28
There are many more possible applications, see [8]. Machine learning seems to be able to
provide the missing link in making advanced multi-scale modeling techniques really prac-
tical. For example in the heterogeneous multi-scale method (HMM) [1, 9], one important
component is to extract the relevant macro-scale information from micro-scale simulation
data. This step has always been a major obstacle in HMM. It seems quite clear that
machine learning techniques can of great help here.
We also discussed how the viewpoint of numerical analysis can help to improve the
mathematical foundation of machine learning as well as propose new and possibly more
robust formulations. In particular, we have given a taste of how high dimensional ap-
proximation theory should look like. We also demonstrated that commonly used machine
learning models and training algorithms can be recovered from some particular discretiza-
tion of continuous models, in a scaled form. From this discussion, one can see that neural
network models are quite natural and rather inevitable.
What have we really learned from machine learning? Well, it seems that the most
important new insight from machine learning is the representation of functions as expec-
tations. We reproduce them here for convenience:
• integral-transform based:
• flow-based:
dz
=E(a,w)∼ρτ aσ(wT z), z(0, x) = x (40)
dτ
f (x, θ) =1T z(1, x) (41)
From the viewpoint of computational mathematics, this suggests that the central
issue will move from specific discretization schemes to more effective representations of
functions.
This review is rather sketchy. Interested reader can consult the three review articles
[17, 18, 19] for more details.
Acknowledgement: I am very grateful to my collaborators for their contribution to
the work described here. In particular, I would like to express my sincere gratitude to
Roberto Car, Jiequn Han, Arnulf Jentzen, Qianxiao Li, Chao Ma, Han Wang, Stephan
Wojtowytsch, and Lei Wu for the many discussions that we have had on the issues dis-
cussed here. This work is supported in part by a gift to the Princeton University from
iFlytek as well as the ONR grant N00014-13-1-0338.
References
[1] Assyr Abdulle, Weinan E, Bjorn Engquist and Eric Vanden-Eijnden, The heteroge-
nous multiscale methods, Acta Numerica, vol. 21, pp. 1-87, 2012.
29
[2] Francis Bach, “Breaking the curse of dimensionality with convex neural networks”,
Journal of Machine Learning Research, 18(19):1-53, 2017.
[3] Andrew R. Barron, “Universal approximation bounds for superpositions of a sig-
moidal function”, IEEE Transactions on Information theory, 39(3):930-945, 1993.
[4] Jörg Behler and Michele Parrinello, “Generalized neural-network representation of
high-dimensional potential-energy surfaces”, Physical review letters, 98(14):146401,
2007.
[5] Achi Brandt, “Multiscale scientific computation: review 2001”. In Barth, T.J., Chan,
T.F. and Haimes, R. (eds.): Multiscale and Multiresolution Methods: Theory and
Applications, Springer Verlag, Heidelberg, 2001, pp. 196.
[6] Roberto Car and Michele Parrinello, “Unified approach for molecular dynamics and
density-functional theory”, Physical Review Letters, 55(22):2471, 1985.
[7] Lenaic Chizat and Francis Bach, “On the global convergence of gradient descent for
over-parameterized models using optimal transport”, In Advances in neural informa-
tion processing systems, pages 3036-3046, 2018.
[8] Weinan E, Principles of Multiscale Modeling, Cambridge University Press, 2011.
[9] Weinan E and Bjorn Engquist, The heterogeneous multiscale methods, Comm. Math.
Sci., vol. 1, no. 1, pp. 87-132, 2003.
[10] Weinan E, Jiequn Han and Arnulf Jentzen, “Deep learning-based numerical methods
for high-dimensional parabolic partial differential equations and backward stochastic
differential equations”, Communications in Mathematics and Statistics 5, 4 (2017),
349-380.
[11] Weinan E, Jiequn Han, and Linfeng Zhang, “Integrating Machine Learning with
Physics-Based Modeling”, https://arxiv.org/pdf/2006.02619.pdf, 2020.
[12] Weinan E, Chao Ma and Lei Wu, “A priori estimates of the population risk for two-
layer neural networks”, Communications in Mathematical Sciences, 17(5):1407-1425,
2019; arXiv:1810.06397, 2018.
[13] Weinan E, Chao Ma and Lei Wu, “Barron spaces and the flow-induced function
spaces for neural network models”, arXiv:1906.08039, 2019.
[14] Weinan E and Stephan Wojtowytsch, “Representation formulas and pointwise prop-
erties for Barron functions”, arXiv:2006.05982, 2020.
[15] Weinan E and Stephan Wojtowytsch, “On the Banach spaces associated with multi-
layer ReLU networks: Function representation, approximation theory and gradient
descent dynamics”, https://arxiv.org/pdf/2007.15623.pdf, 2020.
[16] Weinan E, Chao Ma, and Lei Wu, “Machine learning from a continuous viewpoint”,
arXiv:1912.12777, 2019.
30
[17] Weinan E, Jiequn Han and Arnulf Jentzen, “Algorithms for solving high
dimensional PDEs: From nonlinear Monte Carlo to machine learning”,
https://arxiv.org/pdf/2008.13333.pdf, 2020.
[18] Weinan E, Jiequn Han and Linfeng Zhang, “Integrating machine learning with
physics-based modeling”, https://arxiv.org/pdf/2006.02619.pdf, 2020.
[19] Weinan E, Chao Ma, Stephan Wojtowytsch and Lei Wu, “Towards a mathe-
matical understanding of machine learning: What is known and what is not”,
https://arxiv.org/pdf/2009.10713.pdf.
[20] Jiequn Han and Weinan E, “Deep learning approximation for stochastic
control problems”, Deep Reinforcement Learning Workshop, NIPS (2016),
https://arxiv.org/pdf/1611.07422.pdf.
[21] Jiequn Han, Arnulf Jentzen and Weinan E, “Solving high-dimensional partial dif-
ferential equations using deep learning”, Proceedings of the National Academy of
Sciences, 115, 34 (2018), 8505-8510.
[22] Jiequn Han, Chao Ma, Zheng Ma, Weinan E, “Uniformly Accurate Machine Learning
Based Hydrodynamic Models for Kinetic Equations”, Proceedings of the National
Academy of Sciences, 116 (44) 21983-21991; DOI: 10.1073/pnas.1909854116, 2019.
[23] Jiequn Han, Linfeng Zhang, Roberto Car, and Weinan E, “Deep potential: a gen-
eral representation of a many-body potential energy surface”, Communications in
Computational Physics, 23(3):629-639, 2018.
[24] Pierre C Hohenberg and Bertrand I Halperin, “Theory of dynamic critical phenom-
ena”, Reviews of Modern Physics, 49(3):435, 1977.
[25] Weile Jia, Han Wang, Mohan Chen, Denghui Lu, Jiduan Liu, Lin Lin, Roberto Car,
Weinan E, and Linfeng Zhang, “Pushing the limit of molecular dynamics with ab
initio accuracy to 100 million atoms with machine learning”, arXiv:2005.00223, 2020.
[26] Jason M Klusowski and Andrew R Barron, “Risk bounds for high-dimensional ridge
function combinations including neural networks”, arXiv:1607.01434, 2016.
[27] Yann LeCun, “A theoretical framework for back propagation”, In: Touretzky, D.,
Hinton, G., Sejnouski, T. (eds.) Proceedings of the 1988 connectionist models summer
school, Carnegie-Mellon University, Morgan Kaufmann, 1989.
[28] Qianxiao Li, Long Chen, Cheng Tai and Weinan E, “Maximum Principle Based Al-
gorithms for Deep Learning”, Journal of Machine Learning Research, vol.18, no.165,
pp.1-29, 2018, https://arxiv.org/pdf/1710.09513.pdf.
[29] Qianxiao Li and Shuji Hao, “An Optimal Control Approach to Deep Learning and
Applications to Discrete-Weight Neural Networks”, Proceedings of the 35th Interna-
tional Conference on Machine Learning, 2018.
31
[30] Denghui Lu, Han Wang, Mohan Chen, Jiduan Liu, Lin Lin, Roberto Car, Weinan E,
Weile Jia, and Linfeng Zhang, “86 pflops deep potential molecular dynamics simula-
tion of 100 million atoms with ab initio accuracy”, arXiv:2004.11658, 2020.
[31] Chao Ma, Lei Wu, and Weinan E, “The quenching-activation behavior of the gradient
descent dynamics for two-layer neural network models”, arXiv:2006.14450, 2020.
[32] Song Mei, Andrea Montanari, and Phan-Minh Nguyen, “A mean field view of the
landscape of two-layer neural networks”, Proceedings of the National Academy of
Sciences, 115(33):E7665-E7671, 2018.
[33] Etienne Pardoux and Shige Peng, “Backward stochastic differential equations and
quasilinear parabolic partial differential equations”, in Stochastic partial differential
equations and their applications (Charlotte, NC, 1991), vol. 176 of Lecture Notes in
Control and Inform. Sci., Springer, Berlin, 1992, pp. 200-217.
[34] Grant Rotskoff and Eric Vanden-Eijnden, “Parameters as interacting particles: long
time convergence and asymptotic error scaling of neural networks”, In Advances in
neural information processing systems, pages 7146-7155, 2018.
[35] Justin Sirignano and Konstantinos Spiliopoulos, “Mean field analysis of neural net-
works: A central limit theorem”, arXiv:1808.09372, 2018.
[38] Linfeng Zhang, Jiequn Han, Han Wang, Roberto Car, and Weinan E, “Deep potential
molecular dynamics: A scalable model with the accuracy of quantum mechanics”,
Physical Review Letters, 120:143001, Apr 2018.
[39] Linfeng Zhang, Jiequn Han, Han Wang, Wissam A Saidi, Roberto Car, and Weinan
E, “End-to-end symmetry preserving inter-atomic potential energy model for finite
and extended systems”, In Advances of the Neural Information Processing Systems
(NIPS), 2018.
[40] Linfeng Zhang, De-Ye Lin, Han Wang, Roberto Car, and Weinan E, “Active learn-
ing of uniformly accurate interatomic potentials for materials simulation”, Physical
Review Materials, 3(2):023804, 2019.
[41] Linfeng Zhang, Han Wang and Weinan E, “Reinforced dynamics for the enhanced
sampling in large atomic and molecular systems. I. Basic Methodology”, J. Chem.
Phys., vol. 148, pp.124113, 2018.
32