Handbook Control ML 2022
Handbook Control ML 2022
net/publication/342093309
CITATIONS READS
0 3,972
6 authors, including:
Xiang Zhou
City University of Hong Kong
63 PUBLICATIONS 678 CITATIONS
SEE PROFILE
All content following this page was uploaded by Xiang Zhou on 31 January 2022.
University of Texas at Dallas, Richardson, TX, United States, b School of Data Science, City
University of Hong Kong, Kowloon Tong, Hong Kong, c Department of Statistics, The Chinese
University of Hong Kong, Shatin, Hong Kong, d Department of Mathematics, Southern Methodist
University, Dallas, TX, United states, e Faculty of Information Technology, Nha Trang University,
Nha Trang, Viet Nam, f Department of Mathematics, City University of Hong Kong, Kowloon Tong,
Hong Kong
∗ Corresponding author: e-mail address: [email protected]
Contents
1 Introduction 2 5.1 General theory 14
2 Reinforcement learning 5 5.2 Parametric approach for the
2.1 General concepts 5 feedback 15
2.2 Mathematical model without 5.3 Non-parametric approach for
action 5 the value function and its
2.3 Approximation 6 gradient 16
2.4 Mathematical model with 6 Focus on the deterministic case 16
action 7 6.1 Problem and algorithm 17
3 Control theory and deep learning 10 6.2 Splitting up method 18
3.1 Supervised learning 10 7 Convergence results 19
3.2 Deep learning 10
7.1 Setting of the problem 19
3.3 Control theory approach 11
7.2 Preliminaries 19
4 Stochastic gradient descent and
7.3 Main result 20
control theory 12
7.4 Algorithm 24
4.1 Comments 12
4.2 Stochastic gradient and MDP 12 7.5 Linear quadratic case 24
4.3 Continuous version 13 8 Numerical results 24
5 Machine learning approach of Acknowledgment 26
stochastic control problems 14 References 27
Abstract
We survey in this chapter the connections between Machine Learning and Control The-
ory. Control Theory provide useful concepts and tools for Machine Learning. Conversely
Machine Learning can be used to solve large control problems. In the first part of the pa-
per, we develop the connections between reinforcement learning and Markov Decision
Processes, which are discrete time control problems. In the second part, we review the
concept of supervised learning and the relation with static optimization. Deep learning
which extends supervised learning, can be viewed as a control problem. In the third part,
we present the links between stochastic gradient descent and mean field theory. Con-
versely, in the fourth and fifth parts, we review machine learning approaches to stochastic
control problems, and focus on the deterministic case, to explain, more easily, the numer-
ical algorithms.
Keywords
Control theory, Machine learning, Deep learning
MSC Codes
49J15, 49L20, 93E20, 93E99, 68T05, 68T99, 90C39, 90C15
1 Introduction
The Big Data phenomenon is at the origin of a new expansion of Artificial In-
telligence. Machine learning (Jordan and Mitchell, 2015) is a way to implement
AI, by providing the machine with the capability of learning and decision mak-
ing, which characterize humans. The fact that humans use algorithms to help
performing these two tasks is not new, by itself. As soon as computing possi-
bilities appeared, algorithms have been developed. The ambition of AI came
also early. However, during the last decades, the momentum has been spec-
tacular, and Machine Learning has become the new Graal. Its introduction has
revolutionized all kinds of fields in science, in engineering, in medicine, in
management. Image processing, pattern recognition, text mining, speech recog-
nition, automatic translation have benefited considerably from this development.
An important breakthough occurred with the methodology of deep neural net-
work (DNN).
Conceptually, since the objective is to improve the knowledge of environ-
ment and improve decision making, we are naturally dealing with optimization
and statistics. This is clearly apparent in supervised learning. On the other hand,
reinforcement learning and DNN add an additional variable, which is time or
ordered like time. Control theory comes in as the framework of dynamic opti-
mization.
Control theory, see for instance, Bensoussan (2018), is about how to design
optimal actions for dynamical models, in continuous or discrete time. However,
it is notoriously acknowledged that the numerical computation is the main bar-
rier of putting these control theories to work in practice and many applications
are unfortunately limited to the linear quadratic regulator. The curse of dimen-
sionality as Bellman, the creator of Dynamic Programming, coined it has been
haunting the numerical methods of control theory for quite a long time. It is
therefore natural that the new possibilities of ML be considered to overcome the
challenge of dimension. This explains why, in the past few years, we have been
ARTICLE IN PRESS
witnessing many exciting ideas and innovative results from the perspective of
merging the above two research areas, with the efforts from different commu-
nities like applied and computational mathematics, optimal control, stochastic
optimization as well as computer science. The two sides, researchers from ma-
chine learning and optimal control, start to explore the techniques, tools as well
as problem formulations, from each other. We can roughly divide these works
into two categories: control theory for machine learning and machine learning
for control theory. Generally speaking, the former refers to the use of control
theory as a mathematical tool to formulate and solve theoretical and practical
problems in machine learning, such as optimal parameter tuning, training neu-
ral network; while the latter is how to use machine learning practice such as
kernel method and DNN to numerically solve complex models in control theory
which can become intractable by traditional methods (Han et al., 2018).
There are many evidences to support our argument of close connections
between machine learning and control theory. We begin with reinforcement
learning (RL), which became famous when AlphaGo Zero (Silver et al., 2017)
was invented. Reinforcement learning (Sutton and Barto, 2018) is a subfield of
machine learning that studies how to use past data to enhance the future ma-
nipulation of a dynamical system. The control communities target for the same
problems as RL. However, the RL and control communities are practically dis-
joint due to the distinctive language and culture; see Recht (2019) for a recent
effort to unify this gap.
In RL, one of the simplest strategies is to first estimate such models from the
given data, which is called system identification in control community. This can
be achieved by supervised learning (Chiuso and Pillonetto, 2019). Then in the
second stage dynamical programming principle in control theory can be applied
and to derive many popular RL algorithms such as Q-learning and Temporal
Difference algorithms (Sutton and Barto, 2018).
As said above, Dynamic Programming is hard to implement numerically, for
high dimensional dynamic systems. Machine learning and DNN can be helpful.
For example, Han et al. (2018) proposed an efficient machine learning algo-
rithm by using DNN to approximate the value function in the high dimensional
Hamilton-Jacobi-Bellman equation, based on the equivalent stochastic control
formulation of the PDE.
The bond that ties machine learning and control theory more closely in re-
cent years gets critically strengthened from continuous perspective in various
contexts (E, 2017; E et al., 2019b; Recht, 2019). For example, deep residual
neural network (ResNet) (He et al., 2016) can be obtained by recasting it as dy-
namical systems with network layers considered as time discretization (Chang
et al., 2018a,b; Chen et al., 2018; Haber and Ruthotto, 2017; Li and Hao, 2018;
Li and Shi, 2017; Sonoda and Murata, 2017). Based on this point of view, ma-
chine learning algorithm for ResNet can be viewed as part of static and dynamic
optimization for an ordinary differential equation controlled by network parame-
ters (E et al., 2019a). This continuous model immediately triggered several new
ARTICLE IN PRESS
In the next section, we study a Stochastic Control Problem, in which the state
is that of a controlled diffusion. We then propose a Machine Learning approach
for this problem.
We finally focus on the deterministic case in Section 6 to simplify the
theory. We provide some related theoretical results in companion with a few
high-dimensional numerical illustrations to demonstrate the effectiveness of the
algorithms.
2 Reinforcement learning
2.1 General concepts
In the language of Control Theory, we consider a dynamical system, which
evolves in an uncertain environment. The evolution of this system is called a
process, which can be characterized by its state. A controller decides a strategy
of actions, called feedback, and there is at each time a cost or profit attached
to the current state and the current action. In the language of RL, every time
the action is made, the controller receives an award. The controller will try to
choose the actions such that the sum of rewards is maximized. Since time is dis-
crete, the control problem is called a Markov Decision Process (MDP) and can
be solved by Dynamic Programming approach. The award is then a function of
the state and the action.
u = f + αu. (2.4)
2.3 Approximation
Our main task now is to compute the function u(x). Eqs. (2.4) and (2.6) are
explicit and straightforward. However, the challenge is when the dimension d
of X is large, these formulations are not of practical use. Here comes the other
aspect of machine learning: how to approximate a function given by formulae
(2.3) or (2.4). Since Supervised Learning does exactly that, approximate a func-
tion, we follow the ideas of SL. There are basically two methods, the parametric
method and the non parametric method. In the parametric method, we look for
an approximation of the form
I
u(x) ≈ θi ϕi (x), (2.7)
i=1
where ϕi (x) are given functions, so that the family {ϕi (·)}∞
i=1 forms a basis of the
functional space to which u(x) belongs, and θi are coefficients to be determined.
We need to compute parameters minimizing the error
I 2
θi (ϕi − αϕi ) − f , (2.8)
i=1
where · is the sup-norm G = sup |G|. To guarantee the existence and
uniqueness of the parameters, we minimize the quadratic functional, with a
quadratic regularization.
I 2
I
γ θi2 + θi (ϕi − αϕi ) − f . (2.9)
i=1 i=1
ARTICLE IN PRESS
N +∞
1 n−1
u(x) ≈ α f (Xnν ), (2.10)
N
ν=1 n=1
Remark 1. In RL, one claims that a significant difference with MDP is that the
Markov Chain may not be known. The controller, however makes trials, which is
similar to Monte Carlo, without referring to a selection of trajectories according
to a given probability transition.
It is also interesting to introduce the cost (Q-function) when the first action is
arbitrary and the following actions are optimized, namely
we arrive at
Q(x, a) = f (x, a) + α[a (inf Q(·, a ))](x). (2.18)
a
There are two basic types of iteration to solve the above Bellman equation, the
value iteration and the policy iteration. The value iteration is defined by
and u0 (x) = 0. When f (x, a) is bounded, the solution of the Bellman equation
is unique and the sequence uk (x) converges to the value function monotoni-
cally. When f (x, a) is not bounded, the solution of the Bellman equation is
not unique. The sequence uk (x) converges monotonically to the value function,
which is the minimum solution. We can also interpret uk (x) as the value func-
tion for the control problem with k periods. To see this, we define
k
Ja(·) (x) = E
k
α f (Xn , a(Xn ))X1 = x ,
n−1
(2.20)
n=1
then
uk (x) = inf Ja(·)
k
(x). (2.21)
a(·)
ARTICLE IN PRESS
On the other hand, the policy iteration technique starts with a given feedback
control a k (x) and solves the linear (fixed point) problem similar to (2.4)
k (x)
uk+1 (x) = f (x, a k (x)) + αa uk+1 (x). (2.22)
in which the coefficient ρ k is chosen such that it solves the following scalar
optimization problem
By definition,
uk+1 (x) = Qk+1 (x, a k (x)),
We can then, instead of solving the linear problem for uk+1 (x), use the approxi-
mation Qk (x, a k (x)). We then proceed as follows: knowing a k (x) and Qk (x, a),
we define
ūk+1 (x) := Qk (x, a k (x)),
(2.25)
Qk+1 (x, a) = f (x, a) + αa ūk+1 (x),
ARTICLE IN PRESS
and a k+1 (x) can then be obtained approximately through (2.23) and (2.24). In
the above procedure, we start with Q0 (x, a) = f (x, a) and a 0 (x) is chosen to
be the minimizer of f (x, a).
This is the well-known parametric method. In the simplest case of neural net-
works, the function f (x; θ ) is defined as follows:
We first introduce X = χ(x), where χ : Rd → Rn . Of course χ can be iden-
tity. Now, choose σ to be a scalar function, called the activation function and W
to be a matrix in L(Rn ; Rd ). Suppose that b is a vector in Rn . The pair (W, b)
represents the parameter θ . We now define the vector X by
⎛ ⎞
d
Xi = σ ⎝ Wij Xj + bi ⎠ , (3.2)
j =1
and
f (x; θ ) = g(X), (3.3)
where g : Rn → R is the output function. The minimization in (3.1) is performed
by a gradient method with iterative application of chain rules.
In the non-parametric method, particularly, the kernel method, one finds a
functional space H, to which the approximation of F (x) belongs. One then
chooses that approximation f (x) by solving
M
min γ f 2H + (f (x m ) − y m )2 . (3.4)
f ∈H
m=1
X k+1 = fk (X k ; θ k ), k = 0, · · · K − 1, (3.5)
where
fk (X k ; θ k ) := σ (W (k+1) X k + b(k+1) )
and
X 0 = χ(x), (3.6)
with
f (x, θ ) = g(X K ). (3.7)
The parameter θ is the collection of {θ 0 , · · · θ K−1 }.
We have written the case of
several layers of neural networks, but it is just an example.
The approximation of F (x) is then f (x, θ ) = g(XT ). The loss in this scenario
is (y − g(XT ))2 = (XT ), recalling that y will be a known value. The idea is
to consider θt as a control and we want to minimize an analog of (3.4), which is
expressed as:
M T
J (θ ) := (XTm ) + L(θt )dt, (3.9)
m=1 0
with
M
t , p
H (X t , θ, t) = tm , θ, t) + L(θ ).
tm f (X
p
m=1
To solve (3.10), one can use the following approximation recursively: let
θtk be
t , p
given, define (X m,k m,k
t ) by
⎧
⎪
⎪ m,k
⎪ dX t
⎨ = f (Xtm,k ,
θtk , t), X m,k = x m ,
0
dt (3.11)
⎪
⎪ d ptm,k
⎪
⎩− = (Dx f )∗ (X t ,
m,k
θtk , t)m,k
pt , p m,k ).
T = Dx (X m,k
T
dt
We look for
θtk+1 that minimizes
p tm,k , θ, t) + L(θ ).
tm,k f (X (3.12)
m
Note that the above approximation may fail to converge. We refer to E et al.
(2019a,b); Han and E (2016); Sutton and Barto (2018) for recent improvements
and techniques to deal with this issue.
This is not a standard MDP, but a Mean Field type control problem in discrete
time.
Xk+1 = Xk − ηk Df (Xk ) + ηk Yk ,
with
Yk = Df (Xk ) − Dx f (Xk , Zk ).
Note that we have E[Yk |F k−1 ] = 0 and
with
(x) = E(Dx f (x, Z)(Dx f (x, Z))∗ ) − Df (x)(Df (x))∗ . (4.5)
ARTICLE IN PRESS
We shall write
(x) = σ (x)σ (x)∗ . (4.6)
If we write Yk = σ (Xk )Yk , then the process Yk satisfies
There are two approaches to resolve the above problem: Dynamic Programming
and Stochastic Pontryagin Maximum Principle. The theory shows that the opti-
mal control is described by a feedback. The value function is defined by
In the above problem, there are 3 functions of interest, the value function u(x),
the optimal feedback a (x) (if exists), the gradient of u(x), λ(x) = Du(x). In-
troducing u(x) and its gradient independently may look superfluous. It turns
out that the gradient has a very interesting interpretation, the shadow price in
economics. Surprisingly, the gradient is solution of a self-contained vector equa-
tion. On the numerical side, approximating the gradient of u(x) by the gradient
of the approximation of u(x) results in a source of errors. This justifies the in-
terest in the system of equations for λ(x). We may think of parametric and
non-parametric approximations for these functions. We shall discuss a para-
metric approach for the optimal feedback, and a non-parametric approach for
the value function and its gradient.
where f (x, θ ) abbreviates for f (x, a(x, θ )). The important simplification of this
procedure is that θ (t) is regarded as deterministic.
We can write a necessary condition of optimality for the optimal new control
θ (t). Define the optimal state (t) by:
x (t) and the adjoint state p
⎧
⎨d x,
x = g( θ )dt + dw, x (0) = x,
d
p (5.6)
⎩− + α x,
p = gx∗ ( p + fx (
θ ) x,
θ ),
dt
and
θ (t) satisfies
inf E [
p (t) · g(
x (t), θ ) + f (
x (t), θ )], t-a.e.. (5.7)
θ
To obtain
θ (t), we can use an iterative approximation coupled with a gradient
method
⎧
⎪
⎨d xk ,
x k = g( θ k )dt + dw, x (0) = x,
k (5.8)
⎪
⎩−
dp
+ αp k = gx∗ (xk , xk ,
p k + fx (
θ k ) θ k ),
dt
ARTICLE IN PRESS
θ k+1 (t) = xk ,
θ k (t) − ρ k (t)E[gθ∗ ( xk ,
p k + fθ (
θ k ) θ k )], t-a.e.,
a (x) minimizes in a of the function λ(x) · g(x, a) + f (x, a).
The above system has an interesting structure, in which there is coupling only
for the last two equations. The first equation allows one to define the value
function. Note that we have used the fact that Dλ(x) = (Dλ(x))∗ .
a k (x), λk (x)),
We now define the following iteration: suppose that we know (
k+1
we can find λ (x) by solving the differential equation system:
1
αλk+1 (x) − Dλk+1 (x)g(x, a k (x)) − λk+1 (x)
2
= Dx∗ g(x,
a k (x))λk (x) + Dx f (x,a k (x)) (5.10)
such that
a k+1 (x) minimizes in a of the function λk+1 (x) · g(x, a) + f (x, a).
The equations for the components of λk+1 (x) are completely decoupled, and
can be solved in parallel. One possibility is to use simulation to define λk+1 (x)
in a finite number of points and to use an extrapolation by a kernel method.
where
a (x) minimizes in a of the function λ(x) · g(x, a) + f (x, a).
(ii) The second one is to describe the policy iteration as follows: given the
a k (x), uk (x)), we then set
functions (
We now set
λk+1 (x) = Duk+1 (x), (6.7)
and the values of the function
a k+1 (x) can be obtained by minimizing
Since
a k (x) satisfies the necessary condition of optimality
a k+1 (x) = a k (x)) + (Da g)∗ (x,
a k (x) − θ k+1 [Da f (x, a k (x))λk+1 (x)]. (6.10)
and
H k+1 (θ )(x) = f (x, w k+1 (θ )(x)) + λk+1 (x) · g(x, w k+1 (θ )(x)), (6.12)
where we can now find θ k+1 by minimizing the function H k+1 (θ )(x) in θ . As a
result, θ k+1 depends on x and plugging back in (6.10) to obtain
a k+1 (x).
For which we propose a parallel splitting up method2 : knowing λj (x), and writ-
l
ing x = (x1 , · · · , xd ), we define λj + d (x), l = 1, · · · , d, by
l
l ∂λj + d (x) l
αλj + d (x) − Gl (x) = Z j + d (x), (6.14)
∂xl
l ∂λj (x)
where Z j + d (x) = h=l Gh (x) + F (x). And λj +1 (x) is defined by
∂xh
1 j+ l
d
j +1
λ (x) = λ d (x). (6.15)
d
l=1
Note that (6.14) is a one dimensional first order differential equation, which has
an explicit solution
2 The parallel splitting up method not only reduces the original problems into a number of separable
one dimensional linear problems, but also enables us to compute all these one dimensional linear
problems by parallel computing, for which the calibration of the fractional steps are independent of
each other (Lu et al., 1991).
ARTICLE IN PRESS
xl
Z j + d (ξl , x̄l )
l
xl dηl
j + dl
λ (x) = − exp α dξl . (6.16)
−∞ Gl (ξl , x̄l ) ξl Gl (ηl , x̄l )
Here, we have used the notation x = (xl , x̄l ) where x̄l ∈ Rd−1 .
7 Convergence results
7.1 Setting of the problem
We take
g(x, a) = A(x) + Ba, (7.1)
such that
and
b|x1 − x2 |
||DA(x1 ) − DA(x2 )|| .
1 + |x1 | + |x2 |
The pay-off functional
1
f (x, a) = F (x) + a ∗ N a, (7.2)
2
x → F (x) : Rn → R, |DF (x)| M|x|,
N ∈ L(Rd ; Rd ), symmetric and invertible.
We then have
a (x) = −N −1 B ∗ λ(x),
(7.3)
and thus the second relation (6.1) becomes
7.2 Preliminaries
We will need conditions on α and b: α sufficiently large and b sufficiently small.
We first assume that
α − 2γ > 2 M||BN −1 B ∗ ||. (7.5)
We set
(α − 2γ )2
β= > 1. (7.6)
4M||BN −1 B ∗ ||
ARTICLE IN PRESS
We define
α − 2γ − (α − 2γ )2 − 4M||BN −1 B ∗ ||
= , (7.7)
2||BN −1 B ∗ ||
which is a solution of
2 ||BN −1 B ∗ || − (α − 2γ ) + M = 0. (7.8)
We need that
||BN −1 B ∗ || β −1
b< √ √ . (7.10)
M β − β −1
We then define
α − 2γ − (α − 2γ )2 − 4(M + b )||BN −1 B ∗ ||
ν= . (7.11)
2||BN −1 B ∗ ||
Since A(x) and λ(x) are uniformly Lipschitz, this equation has a unique
solution. We then define (x) by the formula
+∞
(x) = exp(−αs) DF (y(s)) + DA∗ (y(s))λ(y(s)) ds. (7.14)
0
This integral is well-defined. Indeed, from (7.13), the second assumption (7.1)
and the first property (7.12), we can assert that
then,
+∞
D(x) = exp(−αs) D 2 F (y(s)) + DA(y(s))Dλ(y(s))
0
+ D 2 A(y(s)λ(y(s)) Y (s)ds. (7.18)
So
+∞
||D(x)|| (M + γ ν + b ) exp(−αs)||Y (s)||ds,
0
and from (7.17) it follows that
+∞
||D(x)|| (M + γ ν + b ) exp(−(α − γ − ||BN −1 B ∗ ||ν)s) ds
0
M + γ ν + b
= = ν,
α − γ − ||BN −1 B ∗ ||ν
and thus (x) satisfies the second condition (7.12).
We consider the Banach space of functions λ(x) : Rn → Rn , with the norm
|λ(x)|
||λ|| = sup ,
x |x|
and the closed subset
⎪
⎪ + ||BN −1 B ∗ || ||λ1 − λ2 || |x| exp(γ + ||BN −1 B ∗ ||)s,
⎩
(y1 − y2 )(0) = x,
therefore
||λ1 − λ2 || |x|
|y1 (s) − y2 (s)| (exp((γ + ν||BN −1 B ∗ ||)s)
ν −
− exp((γ + ||BN −1 B ∗ ||)s)). (7.19)
Writing
DA∗ (y1 (s))λ1 (y1 (s)) − DA∗ (y2 (s))λ2 (y2 (s))
= DA∗ (y1 (s)) − DA∗ (y2 (s)) λ1 (y1 (s))
+ DA∗ (y1 (s))(λ1 (y1 (s)) − λ2 (y2 (s))).
Moreover
Rearranging and using the definition of ν, see (7.9), we finally obtain that
γ + ||BN −1 B ∗ ||ν
||1 − 2 || ||λ1 − λ2 ||. (7.20)
α − γ − ||BN −1 B ∗ ||
γ + ||BN −1 B ∗ ||ν
<1 (7.21)
α − γ − ||BN −1 B ∗ ||
which is equivalent to
which is true, from the definition of and ν, see (7.7) and (7.11). If we call
λ(x) the unique fixed point of T , it satisfies (7.12) and the equation
+∞
λ(x) = exp −αs DF (y(s)) + DA∗ (y(s))λ(y(s)) ds. (7.22)
0
It is standard to check that (7.12) and (7.22) is equivalent to (7.12) and (7.4).
This concludes the proof of the Theorem.
ARTICLE IN PRESS
7.4 Algorithm
We can write the algorithm (6.3) which leads to
and its solution is λ(x) = P x, with P the solution of the Riccati equation
αP = M + A∗ P + P A − P BN −1 B ∗ . (7.26)
We have
α − 2||A| − (α − 2||A|)2 − 4M||BN −1 B ∗ ||
=ν= . (7.28)
2||BN −1 B ∗ ||
8 Numerical results
We now present numerical tests for the algorithm. We consider the following
values for m, n: m = 10, n = 30; the matrices M, N , A, B are chosen arbitrarily
and their values are not displayed here.
ARTICLE IN PRESS
In Fig. 2, we choose α = 1770.3688, 0 < P (0) < 6.8153 for the 4 samples
of the initial guess P (0) . These choices satisfy the two conditions (7.5) and (7.6).
We can see that as the number of iterations k increases, the difference between
P (k) − P (k+1) becomes very small. And after 15 iterations, these differences
are essentially 0.
In Fig. 4, we take the first choice of P (0) arbitrarily, and varies the values of
α to be 1, 5, 20, 100, 200, 300. Using the results of our Python code, we plot
the values of P (5) − P (6) , P (10) − P (11) , P (15) − P (16) , P (20) − P (21) ,
P (25) − P (26) for each value of α as a curve. We can see that the algorithm
starts to converge very fast if α is big: the curves for α = 100, 200, 300 (red,
purple and brown) are almost the 0-line.
Acknowledgment
Alain Bensoussan acknowledges the financial support from the National Science Founda-
tion under grants DMS-1612880, DMS-1905449, and the Research Grant Council of Hong
Kong Special Administrative Region under grant GRF 11303316. Minh-Binh Tran is partially
supported by NSF Grant DMS-1854453, SMU URC Grant 2020, SMU DCII Research Clus-
ter Grant, Dedman College Linking Fellowship, Alexander von Humboldt Fellowship. Dinh
Phan Cao Nguyen and Minh-Binh Tran would like to thank Prof. T. Hagstrom and Prof. A.
ARTICLE IN PRESS
Aceves for the computational resources. Phillip Yam acknowledges the financial supports from
HKGRF-14300717 with the project title “New kinds of Forward-backward Stochastic Systems
with Applications”, HKGRF-14300319 with the project title “Shape-constrained Inference:
Testing for Monotonicity”, and Direct Grant for Research 2014/15 (Project No. 4053141) of-
fered by CUHK. Xiang Zhou acknowledges the support of Hong Kong RGC GRF grants
11337216 and 11305318.
References
Bensoussan, A., 2018. Estimation and Control of Dynamical Systems. Interdisciplinary Applied
Mathematics. Springer International Publishing.
Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D., Holtham, E., 2018a. Reversible architec-
tures for arbitrarily deep residual neural networks. In: The Thirty-Second AAAI Conference on
Artificial Intelligence (AAAI-18), p. 2811.
Chang, Bo, Meng, Lili, Haber, Eldad, Tung, Frederick, Begert, David, 2018b. Multi-level residual
networks from dynamical systems view.
Chen, Ricky T.Q., Rubanova, Yulia, Bettencourt, Jesse, Duvenaud, David, 2018. Neural ordinary
differential equations. In: Advances in Neural Information Processing Systems, 2018-Decem,
pp. 6571–6583.
Chiuso, A., Pillonetto, G., 2019. System identification: a machine learning perspective. Annual Re-
view of Control, Robotics, and Autonomous Systems 2 (1), 281–304.
E, W., 2017. A proposal on machine learning via dynamical systems. Communications in Mathe-
matics and Statistics 5 (1), 1–11.
E, W., Han, J., Li, Q., 2019a. A mean-field optimal control formulation of deep learning. Research
in the Mathematical Sciences 6 (1), 1–41.
E, W., Ma, C., Wu, L., 2019b. Machine learning from a continuous viewpoint. arXiv:1912.12777.
Haber, E., Ruthotto, L., 2017. Stable architectures for deep neural networks. Inverse Problems 34
(1), 014004.
Han, J., E, W., 2016. Deep learning approximation for stochastic control problems. arXiv:1611.
07422.
Han, Jiequn, Jentzen, Arnulf, E, Weinan, 2018. Solving high-dimensional partial differential
equations using deep learning. Proceedings of the National Academy of Sciences 115 (34),
8505–8510.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian, 2016. Deep residual learning for image
recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 770–778.
Jordan, M.I., Mitchell, T.M., 2015. Machine learning: trends, perspectives, and prospects. Sci-
ence 349 (6245), 255–260.
Li, Q., Chen, L., Tai, C., E, W., 2017a. Maximum principle based algorithms for deep learning.
Journal of Machine Learning Research 18 (1), 5998–6026.
Li, Q., Hao, S., 2018. An optimal control approach to deep learning and applications to discrete-
weight neural networks. arXiv preprint. arXiv:1803.01299.
Li, Q., Tai, C., E, W., 2017b. Stochastic modified equations and adaptive stochastic gradient algo-
rithms. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70.
JMLR.org, pp. 2101–2110.
Li, Q., Tai, C., E, W., 2019. Stochastic modified equations and dynamics of stochastic gradient
algorithms I: mathematical foundations. Journal of Machine Learning Research 20 (40), 1–40.
Li, Qianxiao, Tai, Cheng, E, Weinan, 2015. Dynamics of stochastic gradient algorithms. arXiv
preprint. arXiv:1511.06251.
Li, Z., Shi, Z., 2017. Deep residual learning and PDEs on manifold. arXiv preprint. arXiv:1708.
05115.
Lu, T., Neittaanmaki, P., Tai, X.-C., 1991. A parallel splitting up method and its application to
Navier-Stokes equations. Applied Mathematics Letters 4, 25–29.
ARTICLE IN PRESS
Lu, Y., Zhong, A., Li, Q., Dong, B., 2017. Beyond finite layer neural networks: bridging deep
architectures and numerical differential equations. arXiv preprint. arXiv:1710.10121.
Mei, S., Montanari, A., Nguyen, P.-M., 2018. A mean field view of the landscape of two-layer neural
networks. Proceedings of the National Academy of Sciences 115 (33), E7665–E7671.
Recht, Benjamin, 2019. A tour of reinforcement learning: the view from continuous control. Annual
Review of Control, Robotics, and Autonomous Systems 2 (1), 253–279.
Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez,
Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, Chen, Yutian, Lillicrap,
Timothy, Hui, Fan, Sifre, Laurent, van den Driessche, George, Graepel, Thore, Hassabis, Demis,
2017. Mastering the game of go without human knowledge. Nature 550 (7676), 354.
Sonoda, S., Murata, N., 2017. Double continuum limit of deep neural networks. In: ICML Workshop
Principled Approaches to Deep Learning.
Sutton, R.S., Barto, A.G., 2018. Reinforcement Learning: An Introduction, 2 edition. MIT Press.
Wang, Haoran, Zariphopoulou, Thaleia, Zhou, Xunyu, 2018. Exploration versus exploitation in re-
inforcement learning: a stochastic control approach. arXiv:1812.01552.