0% found this document useful (0 votes)
11 views29 pages

Handbook Control ML 2022

This document discusses the intersection of Machine Learning and Control Theory, highlighting how concepts from Control Theory can enhance Machine Learning and vice versa. It covers topics such as reinforcement learning, supervised learning, deep learning, and stochastic gradient descent, emphasizing their mathematical frameworks and applications. The paper aims to bridge the gap between these two fields, encouraging further research and collaboration to tackle complex problems in both domains.

Uploaded by

tanvir anwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views29 pages

Handbook Control ML 2022

This document discusses the intersection of Machine Learning and Control Theory, highlighting how concepts from Control Theory can enhance Machine Learning and vice versa. It covers topics such as reinforcement learning, supervised learning, deep learning, and stochastic gradient descent, emphasizing their mathematical frameworks and applications. The paper aims to bridge the gap between these two fields, encouraging further research and collaboration to tackle complex problems in both domains.

Uploaded by

tanvir anwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/342093309

Machine Learning and Control Theory

Preprint · June 2020

CITATIONS READS

0 3,972

6 authors, including:

Alain Bensoussan Yiqun Li


University of Texas at Dallas City University of Hong Kong
484 PUBLICATIONS 20,151 CITATIONS 4 PUBLICATIONS 22 CITATIONS

SEE PROFILE SEE PROFILE

Xiang Zhou
City University of Hong Kong
63 PUBLICATIONS 678 CITATIONS

SEE PROFILE

All content following this page was uploaded by Xiang Zhou on 31 January 2022.

The user has requested enhancement of the downloaded file.


ARTICLE IN PRESS

Machine learning and control


theory
Alain Bensoussana,b , Yiqun Lic , Dinh Phan Cao Nguyend,e ,
Minh-Binh Trand , Sheung Chi Phillip Yamc , and Xiang Zhoub,f,∗
a International Center for Decision and Risk Analysis, Jindal School of Management, The

University of Texas at Dallas, Richardson, TX, United States, b School of Data Science, City
University of Hong Kong, Kowloon Tong, Hong Kong, c Department of Statistics, The Chinese
University of Hong Kong, Shatin, Hong Kong, d Department of Mathematics, Southern Methodist
University, Dallas, TX, United states, e Faculty of Information Technology, Nha Trang University,
Nha Trang, Viet Nam, f Department of Mathematics, City University of Hong Kong, Kowloon Tong,
Hong Kong
∗ Corresponding author: e-mail address: [email protected]

Contents
1 Introduction 2 5.1 General theory 14
2 Reinforcement learning 5 5.2 Parametric approach for the
2.1 General concepts 5 feedback 15
2.2 Mathematical model without 5.3 Non-parametric approach for
action 5 the value function and its
2.3 Approximation 6 gradient 16
2.4 Mathematical model with 6 Focus on the deterministic case 16
action 7 6.1 Problem and algorithm 17
3 Control theory and deep learning 10 6.2 Splitting up method 18
3.1 Supervised learning 10 7 Convergence results 19
3.2 Deep learning 10
7.1 Setting of the problem 19
3.3 Control theory approach 11
7.2 Preliminaries 19
4 Stochastic gradient descent and
7.3 Main result 20
control theory 12
7.4 Algorithm 24
4.1 Comments 12
4.2 Stochastic gradient and MDP 12 7.5 Linear quadratic case 24
4.3 Continuous version 13 8 Numerical results 24
5 Machine learning approach of Acknowledgment 26
stochastic control problems 14 References 27

Abstract
We survey in this chapter the connections between Machine Learning and Control The-
ory. Control Theory provide useful concepts and tools for Machine Learning. Conversely
Machine Learning can be used to solve large control problems. In the first part of the pa-

Handbook of Numerical Analysis, ISSN 1570-8659, https://doi.org/10.1016/bs.hna.2021.12.016 1


Copyright © 2022 Elsevier B.V. All rights reserved.
ARTICLE IN PRESS

2 Handbook of Numerical Analysis

per, we develop the connections between reinforcement learning and Markov Decision
Processes, which are discrete time control problems. In the second part, we review the
concept of supervised learning and the relation with static optimization. Deep learning
which extends supervised learning, can be viewed as a control problem. In the third part,
we present the links between stochastic gradient descent and mean field theory. Con-
versely, in the fourth and fifth parts, we review machine learning approaches to stochastic
control problems, and focus on the deterministic case, to explain, more easily, the numer-
ical algorithms.

Keywords
Control theory, Machine learning, Deep learning

MSC Codes
49J15, 49L20, 93E20, 93E99, 68T05, 68T99, 90C39, 90C15

1 Introduction
The Big Data phenomenon is at the origin of a new expansion of Artificial In-
telligence. Machine learning (Jordan and Mitchell, 2015) is a way to implement
AI, by providing the machine with the capability of learning and decision mak-
ing, which characterize humans. The fact that humans use algorithms to help
performing these two tasks is not new, by itself. As soon as computing possi-
bilities appeared, algorithms have been developed. The ambition of AI came
also early. However, during the last decades, the momentum has been spec-
tacular, and Machine Learning has become the new Graal. Its introduction has
revolutionized all kinds of fields in science, in engineering, in medicine, in
management. Image processing, pattern recognition, text mining, speech recog-
nition, automatic translation have benefited considerably from this development.
An important breakthough occurred with the methodology of deep neural net-
work (DNN).
Conceptually, since the objective is to improve the knowledge of environ-
ment and improve decision making, we are naturally dealing with optimization
and statistics. This is clearly apparent in supervised learning. On the other hand,
reinforcement learning and DNN add an additional variable, which is time or
ordered like time. Control theory comes in as the framework of dynamic opti-
mization.
Control theory, see for instance, Bensoussan (2018), is about how to design
optimal actions for dynamical models, in continuous or discrete time. However,
it is notoriously acknowledged that the numerical computation is the main bar-
rier of putting these control theories to work in practice and many applications
are unfortunately limited to the linear quadratic regulator. The curse of dimen-
sionality as Bellman, the creator of Dynamic Programming, coined it has been
haunting the numerical methods of control theory for quite a long time. It is
therefore natural that the new possibilities of ML be considered to overcome the
challenge of dimension. This explains why, in the past few years, we have been
ARTICLE IN PRESS

Machine learning and control theory 3

witnessing many exciting ideas and innovative results from the perspective of
merging the above two research areas, with the efforts from different commu-
nities like applied and computational mathematics, optimal control, stochastic
optimization as well as computer science. The two sides, researchers from ma-
chine learning and optimal control, start to explore the techniques, tools as well
as problem formulations, from each other. We can roughly divide these works
into two categories: control theory for machine learning and machine learning
for control theory. Generally speaking, the former refers to the use of control
theory as a mathematical tool to formulate and solve theoretical and practical
problems in machine learning, such as optimal parameter tuning, training neu-
ral network; while the latter is how to use machine learning practice such as
kernel method and DNN to numerically solve complex models in control theory
which can become intractable by traditional methods (Han et al., 2018).
There are many evidences to support our argument of close connections
between machine learning and control theory. We begin with reinforcement
learning (RL), which became famous when AlphaGo Zero (Silver et al., 2017)
was invented. Reinforcement learning (Sutton and Barto, 2018) is a subfield of
machine learning that studies how to use past data to enhance the future ma-
nipulation of a dynamical system. The control communities target for the same
problems as RL. However, the RL and control communities are practically dis-
joint due to the distinctive language and culture; see Recht (2019) for a recent
effort to unify this gap.
In RL, one of the simplest strategies is to first estimate such models from the
given data, which is called system identification in control community. This can
be achieved by supervised learning (Chiuso and Pillonetto, 2019). Then in the
second stage dynamical programming principle in control theory can be applied
and to derive many popular RL algorithms such as Q-learning and Temporal
Difference algorithms (Sutton and Barto, 2018).
As said above, Dynamic Programming is hard to implement numerically, for
high dimensional dynamic systems. Machine learning and DNN can be helpful.
For example, Han et al. (2018) proposed an efficient machine learning algo-
rithm by using DNN to approximate the value function in the high dimensional
Hamilton-Jacobi-Bellman equation, based on the equivalent stochastic control
formulation of the PDE.
The bond that ties machine learning and control theory more closely in re-
cent years gets critically strengthened from continuous perspective in various
contexts (E, 2017; E et al., 2019b; Recht, 2019). For example, deep residual
neural network (ResNet) (He et al., 2016) can be obtained by recasting it as dy-
namical systems with network layers considered as time discretization (Chang
et al., 2018a,b; Chen et al., 2018; Haber and Ruthotto, 2017; Li and Hao, 2018;
Li and Shi, 2017; Sonoda and Murata, 2017). Based on this point of view, ma-
chine learning algorithm for ResNet can be viewed as part of static and dynamic
optimization for an ordinary differential equation controlled by network parame-
ters (E et al., 2019a). This continuous model immediately triggered several new
ARTICLE IN PRESS

4 Handbook of Numerical Analysis

training methods based on well-known techniques in control theory: Li et al.


(2017a) from the Pontryagin Maximum Principle and Chen et al. (2018) from
the adjoint approach. This viewpoint of continuous modeling is also becom-
ing more and more popular in optimization community for machine learning,
particularly for the stochastic gradient descent (SGD), in which a stochastic dif-
ferential equation (SDE) emerges as the continuous model (Li et al., 2019). The
acceleration of the SGD is then regarded as an optimal control problem for the
SDE to reach minimum point as early as possible (Li et al., 2017b). The con-
tribution of control theory is certainly not restricted to the training algorithm.
For RL, the trade-off between exploration and exploitation is a very serious and
daunting practical problem. Wang et al. (2018) recently studied the analysis of
this problem in theory through the lens of stochastic control. Similarly, the need
to provide a solid mathematical framework to analyze deep neural networks is
very pressing. Recent works have pointed out that new mathematical properties
of deep neural network can be obtained by recasting deep learning as dynamical
systems (cf. Chang et al., 2018a,b; Chen et al., 2018; Haber and Ruthotto, 2017;
Li et al., 2017a; Li and Hao, 2018; Li and Shi, 2017; Lu et al., 2017; Sonoda
and Murata, 2017).
Nowadays, it is difficult to ignore the intervene and synthesis between ma-
chine learning and control theory and the fusion of these two fields at certain
boundaries is pushing forward tremendous research progress with accelerating
momentum now. This paper is to give a brief introduction to and a short review
of some selective works on the overlap of these communities. The interaction
between the data-driven approach in machine learning and the model-based con-
trol theory is still at the very early age and there are certainly many challenges
at the control-learning interface to advance the deeper development both in the-
ory and in practice. We hope that the gap between the learning-centric views of
ML and the model-centric views of control can diminish in the foreseen future
on an arduous journey of understanding machine learning and artificial intel-
ligence. As a result, a new territory may emerge (e.g. actional intelligence in
Recht, 2019) from these joint efforts across the disciplines.
In the first part of the chapter, we discuss Markov Decision Processes
(MDP), which normally provide mathematical frameworks for modeling deci-
sion making in stochastic environment where outcomes are partly random and
partly under the control of a decision maker. MDPs can indeed be solved via
Dynamic Programming and provide a very useful framework for Reinforcement
Learning.
The second part of the chapter is devoted to Supervised Learning and Deep
Learning, that concerns the approximation of a function given some preliminary
observations. Supervised Learning is an optimization problem. Deep Learning
can be recast into a control theory problem and can be solved using various
strategies, including the Pontryagin Maximum Principle approach.
Recent mean field and stochastic control views for Stochastic Gradient De-
scent methods will be provided in the third part of this paper.
ARTICLE IN PRESS

Machine learning and control theory 5

In the next section, we study a Stochastic Control Problem, in which the state
is that of a controlled diffusion. We then propose a Machine Learning approach
for this problem.
We finally focus on the deterministic case in Section 6 to simplify the
theory. We provide some related theoretical results in companion with a few
high-dimensional numerical illustrations to demonstrate the effectiveness of the
algorithms.

2 Reinforcement learning
2.1 General concepts
In the language of Control Theory, we consider a dynamical system, which
evolves in an uncertain environment. The evolution of this system is called a
process, which can be characterized by its state. A controller decides a strategy
of actions, called feedback, and there is at each time a cost or profit attached
to the current state and the current action. In the language of RL, every time
the action is made, the controller receives an award. The controller will try to
choose the actions such that the sum of rewards is maximized. Since time is dis-
crete, the control problem is called a Markov Decision Process (MDP) and can
be solved by Dynamic Programming approach. The award is then a function of
the state and the action.

2.2 Mathematical model without action


We suppose that the states belong to a space X. On X, there is a σ -algebra,
denoted by X . A transition probability is a (regular enough) function π(x; )
on (X, X ). For any fixed x we define the probability of  ∈ X to be π(x; ).
If B is the space of bounded functions on X, equipped with the norm f  =
supx |f (x)|, we associate to the transition probability a linear operator  from
B to B as follows:

f (x) = f (η)π(x; dη), (2.1)
X
and clearly ||||  1. A Markov chain {Xi }∞i=1 on X with transition probability
π(x; dη) is a stochastic process on X such that

E[f (Xn+1 )|Xn = x] = f (η)π(x; dη), for n = 1, 2, · · · . (2.2)
X

Assuming stationarity in (2.2), and choosing α to be a discount factor, we can


express the function
+∞  
 

u(x) = E α n−1 f (Xn )X1 = x . (2.3)

n=1
ARTICLE IN PRESS

6 Handbook of Numerical Analysis

This is the sum of rewards, in the terminology of RL. There is no action to


modify the trajectory. We just add the discounted rewards. We can give an ex-
plicit analytic expression (not probabilist) of the function u(x). It is the unique
solution of the analytic problem

u = f + αu. (2.4)

It then follows that


u = (I − α)−1 f, (2.5)
and using the generator , we can also rewrite,


u= α n−1 n−1 f. (2.6)
n=1

2.3 Approximation
Our main task now is to compute the function u(x). Eqs. (2.4) and (2.6) are
explicit and straightforward. However, the challenge is when the dimension d
of X is large, these formulations are not of practical use. Here comes the other
aspect of machine learning: how to approximate a function given by formulae
(2.3) or (2.4). Since Supervised Learning does exactly that, approximate a func-
tion, we follow the ideas of SL. There are basically two methods, the parametric
method and the non parametric method. In the parametric method, we look for
an approximation of the form


I
u(x) ≈ θi ϕi (x), (2.7)
i=1

where ϕi (x) are given functions, so that the family {ϕi (·)}∞
i=1 forms a basis of the
functional space to which u(x) belongs, and θi are coefficients to be determined.
We need to compute parameters minimizing the error
 I 2
 
 
 θi (ϕi − αϕi ) − f  , (2.8)
 
i=1

where  ·  is the sup-norm G = sup |G|. To guarantee the existence and
uniqueness of the parameters, we minimize the quadratic functional, with a
quadratic regularization.
 I 2

I  
 
γ θi2 +  θi (ϕi − αϕi ) − f  . (2.9)
 
i=1 i=1
ARTICLE IN PRESS

Machine learning and control theory 7

In the non parametric method, used in supervised learning, we do not refer to


a functional equation for u(x). We assume that we can compute the value at a
finite number of points. For a given x, u(x) can be calculated by formula (2.3),
by Monte Carlo simulation. We then find

N +∞
1   n−1
u(x) ≈ α f (Xnν ), (2.10)
N
ν=1 n=1

where X1ν = x, · · · , Xnν = Xn (ων ), · · · , represents one trajectory indexed by ν


of the Markov chain, corresponding to one sample point ων . We choose M
points x 1 , · · · , x M in Rd , and then compute u(x 1 ) = y 1 , · · · u(x M ) = y M by
using Monte-Carlo method and the approximation formula (2.10). The number
M is chosen arbitrarily. If we assume that f is continuous and bounded, then
u(x) is also bounded and continuous. The goal is to extrapolate u(x) from the
knowledge of y 1 , · · · , y M .
We now choose a subset H of C(Rd ). This subset is called the hypoth-
esis space. We select an element in H such that it is the closest possible to
y 1 , · · · , y M at points x 1 , · · · , x M . We assume naturally that H is a nice enough
functional space. The theory of reproducing kernels allows us to define H as a
Hilbert space, with a continuous injection in C(Rd ). The function u(x) can be
defined as the solution of the minimization problem


M
min γ u2H + (u(x m ) − y m )2 . (2.11)
u∈H
m=1

Remark 1. In RL, one claims that a significant difference with MDP is that the
Markov Chain may not be known. The controller, however makes trials, which is
similar to Monte Carlo, without referring to a selection of trajectories according
to a given probability transition.

2.4 Mathematical model with action


The Markov chain has a probability transition depending on an auxiliary vari-
able a called the action, π(x, a; dη). When the action is a function of the state,
also called the feedback, a(x), we then get π(x, a(x); dη). We now define the
operator a f (x) or a(x) f (x) by
 
a f (x) = f (η)π(x, a; dη); a(x) f (x) = f (η)π(x, a(x); dη).
X X
(2.12)
We consider an award depending on the state and the action f (x, a). For conve-
nience, this award is supposed to be a cost and we assume f (x, a)  0. We then
ARTICLE IN PRESS

8 Handbook of Numerical Analysis

set the aggregate cost to be


+∞  
 

Ja(·) (x) = E α n−1
f (Xn , a(Xn ))X1 = x , (2.13)

n=1

in which Xn evolves as a function of the probability transition π(x, a(x); dη).


We also define the value function

u(x) = inf Ja(·) (x), (2.14)


a(·)

which is the solution of the following Bellman equation

u(x) = inf[f (x, a) + αa u(x)]. (2.15)


a

It is also interesting to introduce the cost (Q-function) when the first action is
arbitrary and the following actions are optimized, namely

Q(x, a) = f (x, a) + αa u(x). (2.16)

Taking into account the fact that

u(x) = inf Q(x, a), (2.17)


a

we arrive at
Q(x, a) = f (x, a) + α[a (inf Q(·, a  ))](x). (2.18)
a
There are two basic types of iteration to solve the above Bellman equation, the
value iteration and the policy iteration. The value iteration is defined by

uk+1 (x) = inf[f (x, a) + αa uk (x)], (2.19)


a

and u0 (x) = 0. When f (x, a) is bounded, the solution of the Bellman equation
is unique and the sequence uk (x) converges to the value function monotoni-
cally. When f (x, a) is not bounded, the solution of the Bellman equation is
not unique. The sequence uk (x) converges monotonically to the value function,
which is the minimum solution. We can also interpret uk (x) as the value func-
tion for the control problem with k periods. To see this, we define
 k  
 

Ja(·) (x) = E
k
α f (Xn , a(Xn ))X1 = x ,
n−1
(2.20)

n=1

then
uk (x) = inf Ja(·)
k
(x). (2.21)
a(·)
ARTICLE IN PRESS

Machine learning and control theory 9

On the other hand, the policy iteration technique starts with a given feedback
control a k (x) and solves the linear (fixed point) problem similar to (2.4)
k (x)
uk+1 (x) = f (x, a k (x)) + αa uk+1 (x). (2.22)

With the uk+1 (x), then a k+1 (x) is defined by

inf[f (x, a) + αa uk+1 (x)],


a

in which we start with a function a 0 (x), which minimizes

inf f (x, a).


a

In both types of iterations, we have to solve the respective minimization prob-


lems
(i) value iteration: infa [f (x, a) + αa uk (x)];
(ii) policy iteration: infa [f (x, a) + αa uk+1 (x)],
in which uk (x) and uk+1 (x) are known respectively (obtained a-priori by some
approximation method such as the use of reproducing kernals).
For both cases, the minimization problem can be resolved by a gradient de-
cent technique. We set

Qk+1 (x, a) = f (x, a) + αa uk+1 (x)

which is an approximation of Q(x, a).


Since we cannot find the infimum exactly, we use the following approxima-
tion for a k+1 (x)

a k+1 (x) = a k (x) − ρ k Da Qk+1 (x, a k (x)), (2.23)

in which the coefficient ρ k is chosen such that it solves the following scalar
optimization problem

inf Qk+1 (x, a k (x) − ρDa Qk+1 (x, a k (x))). (2.24)


ρ

By definition,
uk+1 (x) = Qk+1 (x, a k (x)),
We can then, instead of solving the linear problem for uk+1 (x), use the approxi-
mation Qk (x, a k (x)). We then proceed as follows: knowing a k (x) and Qk (x, a),
we define

ūk+1 (x) := Qk (x, a k (x)),
(2.25)
Qk+1 (x, a) = f (x, a) + αa ūk+1 (x),
ARTICLE IN PRESS

10 Handbook of Numerical Analysis

and a k+1 (x) can then be obtained approximately through (2.23) and (2.24). In
the above procedure, we start with Q0 (x, a) = f (x, a) and a 0 (x) is chosen to
be the minimizer of f (x, a).

3 Control theory and deep learning


3.1 Supervised learning
Supervised learning concerns basically approximating an unknown function
F (x) : Rd → R, given some noisy observations y m = F (x m ) + ε m , in which
x m is known and the noise ε m models the uncertainty (unknown). There are
two methods that can be used to solve this approximation problem. In the first
method, we consider a function f (x; θ ), where θ ∈ Rn for some n and we try to
perform a minimization problem with the parameter θ

M
min γ |θ |2 + (f (x m ; θ ) − y m )2 . (3.1)
θ
m=1

This is the well-known parametric method. In the simplest case of neural net-
works, the function f (x; θ ) is defined as follows:
We first introduce X = χ(x), where χ : Rd → Rn . Of course χ can be iden-
tity. Now, choose σ to be a scalar function, called the activation function and W
to be a matrix in L(Rn ; Rd ). Suppose that b is a vector in Rn . The pair (W, b)
represents the parameter θ . We now define the vector X by
⎛ ⎞
d
Xi = σ ⎝ Wij Xj + bi ⎠ , (3.2)
j =1

and
f (x; θ ) = g(X), (3.3)
where g : Rn → R is the output function. The minimization in (3.1) is performed
by a gradient method with iterative application of chain rules.
In the non-parametric method, particularly, the kernel method, one finds a
functional space H, to which the approximation of F (x) belongs. One then
chooses that approximation f (x) by solving


M
min γ f 2H + (f (x m ) − y m )2 . (3.4)
f ∈H
m=1

3.2 Deep learning


Deep learning is a generalization of supervised learning with a sequence of
layers. We generalize (3.3) to K layers as follows with X k ∈ Rn , θ k :=
ARTICLE IN PRESS

Machine learning and control theory 11

(W (k+1) , b(k+1) ) ∈ L(Rn ; Rn ) × Rn ,

X k+1 = fk (X k ; θ k ), k = 0, · · · K − 1, (3.5)

where

fk (X k ; θ k ) := σ (W (k+1) X k + b(k+1) )

and
X 0 = χ(x), (3.6)
with
f (x, θ ) = g(X K ). (3.7)
The parameter θ is the collection of {θ 0 , · · · θ K−1 }.
We have written the case of
several layers of neural networks, but it is just an example.

3.3 Control theory approach


This approach has been introduced by Li et al. (2015, 2017b, 2019). The idea is
to consider a continuous time extension of (3.5), (3.6), and (3.7). We write

⎨ dX
= f (Xt , θt , t), (3.8)
⎩ dt
X0 = χ(x).

The approximation of F (x) is then f (x, θ ) = g(XT ). The loss in this scenario
is (y − g(XT ))2 = (XT ), recalling that y will be a known value. The idea is
to consider θt as a control and we want to minimize an analog of (3.4), which is
expressed as:

M  T
J (θ ) := (XTm ) + L(θt )dt, (3.9)
m=1 0

where L(θ ) is a regularization function, for instance, γ θ 2 . Using a Pontryagin


Maximum Principle approach, we can write a necessary condition of optimal-
ity for the control θt . Define (X tm , p
tm ), we have the optimal state and optimal
adjoint state solutions of

⎪ m
dX

⎨ t = f (X tm , 
θt , t), X m = x m ,
0
dt (3.10)


m
⎩ − d pt ∗ m 
= (Dx f ) (Xt , θt , t) m
pt , p 
T = Dx (XT ),
m m
dt
and the optimality condition
 t , p
θt minimizes H (X t , θ, t), a.e. t,
ARTICLE IN PRESS

12 Handbook of Numerical Analysis

with

M
t , p
H (X t , θ, t) = tm , θ, t) + L(θ ).
tm f (X
p
m=1

To solve (3.10), one can use the following approximation recursively: let 
θtk be
t , p
given, define (X m,k m,k
t ) by


⎪ m,k
⎪ dX t
⎨ = f (Xtm,k , 
θtk , t), X m,k = x m ,
0
dt (3.11)

⎪ d ptm,k

⎩− = (Dx f )∗ (X t , 
m,k
θtk , t)m,k
pt , p m,k  ).
T = Dx (X m,k
T
dt

We look for 
θtk+1 that minimizes

p tm,k , θ, t) + L(θ ).
tm,k f (X (3.12)
m

Note that the above approximation may fail to converge. We refer to E et al.
(2019a,b); Han and E (2016); Sutton and Barto (2018) for recent improvements
and techniques to deal with this issue.

4 Stochastic gradient descent and control theory


4.1 Comments
The gradient descent algorithm plays an essential role in the various parts of
ML, as we have seen above. There has been a considerable amount of work in
order to improve its efficiency. The stochastic version it, described below, offers
another example of connection to control theory, stochastic control and even
mean field control theory. We limit ourselves to some basic considerations. In
particular we do not discuss the case of the use of SG for DNN, as in Mei et
al. (2018). This is because the connection is of a different nature. It does not
involve control theory, but introduces interesting PDE. In this section, we will
define SG and relate the choice of the optimal descent parameters to an MDP
problem. We will then give a continuous version and also connect with mean
field control.

4.2 Stochastic gradient and MDP


We recall the definition of gradient descent. If f (x) is a function on Rd , for
which we want to find a minimum x ∗ . The gradient descent algorithm is defined
by the sequence
xk+1 = xk − ηk Df (xk ), (4.1)
ARTICLE IN PRESS

Machine learning and control theory 13

where ηk is a positive constant, which can be chosen independent of k. This is


simpler, but by all means not optimal. Suppose now that the function f (x) is an
expected value
f (x) = E(f (x, Z)). (4.2)
Applying the gradient method to this function leads to

xk+1 = xk − ηk E(Dx f (xk , Z)).

In the SG descent method, one chooses a sequence of independent versions of


Z, called Zk and define

Xk+1 = Xk − ηk Dx f (Xk , Zk ). (4.3)

We clearly obtain a controlled Markov chain, in which ηk is the control. If we


define the σ -algebra F k = σ (Z1 , · · · , Zk ), then ηk is adapted to F k−1 , k  1.
F 0 is the trivial σ -algebra. The process Xk is also adapted to F k−1 . We have
to define the pay off to optimize. Suppose we stop at K. We naturally want XK
as close as possible to x ∗ . One way to proceed would be to minimize Ef (XK ).
But this requires the computation of f (x), for random values of the argument,
which we want to avoid. In fact, if we insure that XK is close to a constant,
that constant will be necessarily x ∗ , provided ηk is larger than a fixed positive
constant. So a good criterion will be to minimize

E|XK − EXX |2 . (4.4)

This is not a standard MDP, but a Mean Field type control problem in discrete
time.

4.3 Continuous version


We first write (4.3) as follows:

Xk+1 = Xk − ηk Df (Xk ) + ηk Yk ,

with
Yk = Df (Xk ) − Dx f (Xk , Zk ).
Note that we have E[Yk |F k−1 ] = 0 and

E[Yk (Yk )∗ |F k−1 ] = (Xk ),

with
(x) = E(Dx f (x, Z)(Dx f (x, Z))∗ ) − Df (x)(Df (x))∗ . (4.5)
ARTICLE IN PRESS

14 Handbook of Numerical Analysis

We shall write
(x) = σ (x)σ (x)∗ . (4.6)
If we write Yk = σ (Xk )Yk , then the process Yk satisfies

E[Yk |F k−1 ] = 0, E[Yk (Yk )∗ |F k−1 ] = 0. (4.7)

We obtain the algorithm

Xk+1 = Xk − ηk Df (Xk ) + ηk σ (Xk )Yk . (4.8)

We can then follow Li et al. (2017b) to define a diffusion approximation of (4.8)


as follows
dX = −u(t)Df (X(t))dt + u(t)ησ (X(t))dB(t) (4.9)
where B(t) is a brownian motion and u(t) adapted to the filtration generated by
the brownian motion, with values in [0, 1]. The number η is a scaling constant.
We can then choose the control to minimize the payoff Ef (X(T )) which defines
a stochastic control problem, or
 
min E |X(T ) − EX(T )|2 , (4.10)
u(·)u0 >0

which defines a mean field type control problem.

5 Machine learning approach of stochastic control problems


5.1 General theory
Let us now consider the following problem, in which the state equation is a
controlled diffusion

dx(t) = g(x(t), a)dt + dw,
(5.1)
x(0) = x,

and the pay-off is given by


 +∞ 
Jx (a(·)) = E exp(−αt) f (x(t), a(t))dt . (5.2)
0

There are two approaches to resolve the above problem: Dynamic Programming
and Stochastic Pontryagin Maximum Principle. The theory shows that the opti-
mal control is described by a feedback. The value function is defined by

u(x) = inf Jx (a(·)). (5.3)


a(·)
ARTICLE IN PRESS

Machine learning and control theory 15

In the above problem, there are 3 functions of interest, the value function u(x),
the optimal feedback a (x) (if exists), the gradient of u(x), λ(x) = Du(x). In-
troducing u(x) and its gradient independently may look superfluous. It turns
out that the gradient has a very interesting interpretation, the shadow price in
economics. Surprisingly, the gradient is solution of a self-contained vector equa-
tion. On the numerical side, approximating the gradient of u(x) by the gradient
of the approximation of u(x) results in a source of errors. This justifies the in-
terest in the system of equations for λ(x). We may think of parametric and
non-parametric approximations for these functions. We shall discuss a para-
metric approach for the optimal feedback, and a non-parametric approach for
the value function and its gradient.

5.2 Parametric approach for the feedback


Let us now replace candidacy of a(x) by a special function of the form a(x, θ )
with θ being a parameter in Rp . The function g(x, a) is then replaced by
g(x, a(x, θ )), which can be renamed as g(x, θ ) by abuse of notation. We ob-
tain the control problem

dx(t) = g(x(t), θ )dt + dw,
(5.4)
x(0) = x,
 +∞ 
Jx (θ (·)) = E exp(−αt) f (x(t), θ (t))dt , (5.5)
0

where f (x, θ ) abbreviates for f (x, a(x, θ )). The important simplification of this
procedure is that θ (t) is regarded as deterministic.
We can write a necessary condition of optimality for the optimal new control

θ (t). Define the optimal state  (t) by:
x (t) and the adjoint state p

⎨d x, 
x = g( θ )dt + dw,  x (0) = x,
d
p (5.6)
⎩− + α x, 
p = gx∗ ( p + fx (
θ ) x, 
θ ),
dt

and 
θ (t) satisfies

inf E [
p (t) · g(
x (t), θ ) + f (
x (t), θ )], t-a.e.. (5.7)
θ

To obtain 
θ (t), we can use an iterative approximation coupled with a gradient
method


⎨d xk , 
x k = g( θ k )dt + dw,  x (0) = x,
k (5.8)

⎩−
dp
+ αp k = gx∗ (xk ,  xk , 
p k + fx (
θ k ) θ k ),
dt
ARTICLE IN PRESS

16 Handbook of Numerical Analysis


θ k+1 (t) =  xk , 
θ k (t) − ρ k (t)E[gθ∗ ( xk , 
p k + fθ (
θ k ) θ k )], t-a.e.,

where ρ k (t) minimizes in ρ the following function


 · g(
E [p(t) x (t), θ ) + f (
x (t), θ )],
in which θ = θ (t) − ρE[gθ∗ (
k
xk ,  xk , 
p k + fθ (
θ k ) θ k )].

5.3 Non-parametric approach for the value function and its


gradient
First, we notice that the value function u(x), the gradient λ(x) = Du(x) and the
optimal feedback  a (x) are linked as follows1


⎪ 1
⎨αu(x) = λ(x) · g(x, a (x)) + f (x,a (x)) + tr(Dλ(x)),
2

⎪ 1
⎩αλ(x) = Dλ(x)g(x, ∗
a (x)) + Dx g(x, a (x))λ(x) + Dx f (x,
a (x)) + λ(x),
2
(5.9)
where


a (x) minimizes in a of the function λ(x) · g(x, a) + f (x, a).

The above system has an interesting structure, in which there is coupling only
for the last two equations. The first equation allows one to define the value
function. Note that we have used the fact that Dλ(x) = (Dλ(x))∗ .
a k (x), λk (x)),
We now define the following iteration: suppose that we know (
k+1
we can find λ (x) by solving the differential equation system:
1
αλk+1 (x) − Dλk+1 (x)g(x, a k (x)) − λk+1 (x)
2
= Dx∗ g(x,
a k (x))λk (x) + Dx f (x,a k (x)) (5.10)

such that


a k+1 (x) minimizes in a of the function λk+1 (x) · g(x, a) + f (x, a).

The equations for the components of λk+1 (x) are completely decoupled, and
can be solved in parallel. One possibility is to use simulation to define λk+1 (x)
in a finite number of points and to use an extrapolation by a kernel method.

6 Focus on the deterministic case


In this section, we shall simplify by considering the case of a deterministic dy-
namics. Some theoretical and numerical results will be presented to illustrate
the efficiency of the numerical algorithms.
1 (5.9) is the HJB system and (5.9) follows by differentiating (5.9) with respect to x.
1 2 1
ARTICLE IN PRESS

Machine learning and control theory 17

6.1 Problem and algorithm


We first define the relation between the three functions u(x), λ(x) and 
a (x) (as
the special case of (5.9)):

αu(x) = f (x,
a (x)) + λ(x) · g(x,a (x)),
(6.1)
αλ(x) = Dλ(x)g(x, a (x)) + Dx∗ g(x,a (x))λ(x) + Dx f (x,
a (x)),

where


a (x) minimizes in a of the function λ(x) · g(x, a) + f (x, a).

We propose two iterations.


a k (x), λk (x)), we define uk (x) as
(i) The first one is: for given functions (

αuk (x) = f (x,


a k (x)) + λk (x) · g(x,
a k (x)). (6.2)

Now, we find λk+1 (x) by solving

αλk+1 (x) − Dλk+1 (x)g(x, a k (x)) = Dx∗ g(x,


a k (x))λk (x) + Dx f (x,
a k (x)).
(6.3)
We next resolve 
a k+1 (x) by minimizing

the function λk+1 (x) · g(x, a) + f (x, a) in a,

and uk+1 (x) is constructed by

αuk+1 (x) = f (x,


a k+1 (x)) + λk+1 (x) · g(x,
a k+1 (x)). (6.4)

(ii) The second one is to describe the policy iteration as follows: given the
a k (x), uk (x)), we then set
functions (

λk (x) = Duk (x). (6.5)

We obtain uk+1 (x) by solving

αuk+1 (x) = f (x,


a k (x)) + Duk+1 (x) · g(x,
a k (x)). (6.6)

We now set
λk+1 (x) = Duk+1 (x), (6.7)
and the values of the function 
a k+1 (x) can be obtained by minimizing

the function λk+1 (x) · g(x, a) + f (x, a) in a. (6.8)


ARTICLE IN PRESS

18 Handbook of Numerical Analysis

Since 
a k (x) satisfies the necessary condition of optimality

a k (x)) + (Da g)∗ (x,


Da f (x, a k (x))λk (x) = 0, (6.9)

we can use a gradient descent method


a k+1 (x) =  a k (x)) + (Da g)∗ (x,
a k (x) − θ k+1 [Da f (x, a k (x))λk+1 (x)]. (6.10)

The suitable scalar θ k+1 can now be obtained by a one-dimensional optimization


problem by setting

w k+1 (θ )(x) =  a k (x)) + (Da g)∗ (x,


a k (x) − θ [Da f (x, a k (x))λk+1 (x)], (6.11)

and

H k+1 (θ )(x) = f (x, w k+1 (θ )(x)) + λk+1 (x) · g(x, w k+1 (θ )(x)), (6.12)

where we can now find θ k+1 by minimizing the function H k+1 (θ )(x) in θ . As a
result, θ k+1 depends on x and plugging back in (6.10) to obtain 
a k+1 (x).

6.2 Splitting up method


As a part of both iterations (6.3) and (6.6) described above we have to solve a
generic linear PDE

αλ(x) − Dλ(x) · G(x) = F (x). (6.13)

For which we propose a parallel splitting up method2 : knowing λj (x), and writ-
l
ing x = (x1 , · · · , xd ), we define λj + d (x), l = 1, · · · , d, by
l
l ∂λj + d (x) l
αλj + d (x) − Gl (x) = Z j + d (x), (6.14)
∂xl

l  ∂λj (x)
where Z j + d (x) = h=l Gh (x) + F (x). And λj +1 (x) is defined by
∂xh

1  j+ l
d
j +1
λ (x) = λ d (x). (6.15)
d
l=1

Note that (6.14) is a one dimensional first order differential equation, which has
an explicit solution

2 The parallel splitting up method not only reduces the original problems into a number of separable
one dimensional linear problems, but also enables us to compute all these one dimensional linear
problems by parallel computing, for which the calibration of the fractional steps are independent of
each other (Lu et al., 1991).
ARTICLE IN PRESS

Machine learning and control theory 19

   xl 
Z j + d (ξl , x̄l )
l
xl dηl
j + dl
λ (x) = − exp α dξl . (6.16)
−∞ Gl (ξl , x̄l ) ξl Gl (ηl , x̄l )

Here, we have used the notation x = (xl , x̄l ) where x̄l ∈ Rd−1 .

7 Convergence results
7.1 Setting of the problem
We take
g(x, a) = A(x) + Ba, (7.1)
such that

x → A(x) : Rn → Rn , |A(x)|  γ |x|, B ∈ L(Rd ; Rn ),

and
b|x1 − x2 |
||DA(x1 ) − DA(x2 )||  .
1 + |x1 | + |x2 |
The pay-off functional

1
f (x, a) = F (x) + a ∗ N a, (7.2)
2
x → F (x) : Rn → R, |DF (x)|  M|x|,
N ∈ L(Rd ; Rd ), symmetric and invertible.

We then have
a (x) = −N −1 B ∗ λ(x),
 (7.3)
and thus the second relation (6.1) becomes

αλ(x) − DA∗ (x)λ(x) − Dλ(x)(A(x) − BN −1 B ∗ λ(x)) = DF (x). (7.4)

7.2 Preliminaries
We will need conditions on α and b: α sufficiently large and b sufficiently small.
We first assume that

α − 2γ > 2 M||BN −1 B ∗ ||. (7.5)

We set
(α − 2γ )2
β= > 1. (7.6)
4M||BN −1 B ∗ ||
ARTICLE IN PRESS

20 Handbook of Numerical Analysis

We define

α − 2γ − (α − 2γ )2 − 4M||BN −1 B ∗ ||
= , (7.7)
2||BN −1 B ∗ ||
which is a solution of

 2 ||BN −1 B ∗ || − (α − 2γ ) + M = 0. (7.8)

We next need to solve for the equation

ν 2 ||BN −1 B ∗ || − (α − 2γ )ν + (M + b ) = 0, ν > . (7.9)

We need that

||BN −1 B ∗ || β −1
b< √ √ . (7.10)
M β − β −1
We then define

α − 2γ − (α − 2γ )2 − 4(M + b )||BN −1 B ∗ ||
ν= . (7.11)
2||BN −1 B ∗ ||

7.3 Main result


We can state
Theorem 2. We assume (7.1), (7.2), (7.5), and (7.10). Then Eq. (7.4) has a
unique solution such that

|λ(x)|   |x|, ||Dλ(x)||  ν. (7.12)

Proof. We will use a contraction mapping argument. Let λ(x) be a vector of


functions satisfying (7.12). We shall define a function (x) as follows. We con-
sider the differential equation

⎨ dy = A(y) − BN −1 B ∗ λ(y),
ds (7.13)

y(0) = x.

Since A(x) and λ(x) are uniformly Lipschitz, this equation has a unique
solution. We then define (x) by the formula
 +∞

(x) = exp(−αs) DF (y(s)) + DA∗ (y(s))λ(y(s)) ds. (7.14)
0

This integral is well-defined. Indeed, from (7.13), the second assumption (7.1)
and the first property (7.12), we can assert that

|y(s)|  |x| exp(γ + ||BN −1 B ∗ || )s (7.15)


ARTICLE IN PRESS

Machine learning and control theory 21

and, from (7.14) we get


 +∞
|(x)|  (M +  γ ) exp(−αs) |y(s)|ds
0
 +∞
 (M +  γ )|x| exp(−(α − γ − ||BN −1 B ∗ || )s)ds
0
(M +  γ )|x|
= =  |x|, (7.16)
α − γ − ||BN −1 B ∗ ||
from the definition of  of (7.7) and (7.8). In particular (x) satisfies the first
property (7.12). We next differentiate in x the formula (7.14). We set Y (s) =
Dx y(s). From Eq. (7.13) we obtain

⎨ dY (s) = DA(y(s))Y (s) − BN −1 B ∗ Dλ(y(s))Y (s),
ds (7.17)

Y (0) = I,

then,
 +∞ 
D(x) = exp(−αs) D 2 F (y(s)) + DA(y(s))Dλ(y(s))
0

+ D 2 A(y(s)λ(y(s)) Y (s)ds. (7.18)

So
 +∞
||D(x)||  (M + γ ν + b ) exp(−αs)||Y (s)||ds,
0
and from (7.17) it follows that
 +∞
||D(x)||  (M + γ ν + b ) exp(−(α − γ − ||BN −1 B ∗ ||ν)s) ds
0
M + γ ν + b
= = ν,
α − γ − ||BN −1 B ∗ ||ν
and thus (x) satisfies the second condition (7.12).
We consider the Banach space of functions λ(x) : Rn → Rn , with the norm
|λ(x)|
||λ|| = sup ,
x |x|
and the closed subset

C = {λ(·)| ||λ||  , ||Dλ(x)||  ν, ∀x}.

We consider the map T : λ → , defined by the formula (7.14). We want to


show that it is a contraction from C to C. We pick two functions λ1 , λ2 in C, let
ARTICLE IN PRESS

22 Handbook of Numerical Analysis

y1 (s), y2 (s) be defined by (7.13) with λ = λ1 , λ2 respectively, and 1 = T λ1 ,


2 = T λ2 . We have
d
(y1 − y2 ) = A(y1 ) − A(y2 ) − BN −1 B ∗ (λ1 (y1 ) − λ2 (y2 )).
ds
Noting that

|λ1 (y1 ) − λ2 (y2 )|  |λ1 (y1 ) − λ2 (y1 )| + |λ2 (y1 ) − λ2 (y2 )|


 ||λ1 − λ2 || |y1 | + ν|y1 − y2 |,

we get, using the estimate (7.15),


⎧ d
⎪ −1 ∗
⎨ ds |y1 − y2 |  (γ + ν||BN B ||)|y1 − y2 |


⎪ + ||BN −1 B ∗ || ||λ1 − λ2 || |x| exp(γ +  ||BN −1 B ∗ ||)s,

(y1 − y2 )(0) = x,

therefore

|y1 (s) − y2 (s)|  ||BN −1 B ∗ || ||λ1 − λ2 || |x| exp((γ + ν||BN −1 B ∗ ||)s)


 s
× exp(−(ν −  )||BN −1 B ∗ ||τ ) dτ.
0

Finally, we obtain that

||λ1 − λ2 || |x|
|y1 (s) − y2 (s)|  (exp((γ + ν||BN −1 B ∗ ||)s)
ν −
− exp((γ +  ||BN −1 B ∗ ||)s)). (7.19)

Next, from the definition of (x), we get


 +∞
1 (x) − 2 (x) = exp(−αs)[DF (y1 (s)) − DF (y2 (s))
0
+ DA∗ (y1 (s))λ1 (y1 (s)) − DA∗ (y2 (s))λ2 (y2 (s))]ds.

Writing

DA∗ (y1 (s))λ1 (y1 (s)) − DA∗ (y2 (s))λ2 (y2 (s))

= DA∗ (y1 (s)) − DA∗ (y2 (s)) λ1 (y1 (s))
+ DA∗ (y1 (s))(λ1 (y1 (s)) − λ2 (y2 (s))).

From the third line of assumption (7.1) we obtain



| DA∗ (y1 (s)) − DA∗ (y2 (s)) λ1 (y1 (s))|  b |y1 (s) − y2 (s)|.
ARTICLE IN PRESS

Machine learning and control theory 23

Moreover

|λ1 (y1 (s)) − λ2 (y2 (s))|  ν|y1 (s) − y2 (s)|


+ ||λ1 − λ2 || |x| exp((γ +  ||BN −1 B ∗ ||)s).

Collecting results, we can write


 +∞
|1 (x) − 2 (x)|  (M + b + γ ν) exp(−αs) |y1 (s) − y2 (s)|ds
0
γ ||λ1 − λ2 || |x|
+ .
α − γ − ||BN −1 B ∗ ||

Making use of (7.19), it follows that

|1 (x) − 2 (x)|


γ ||λ1 − λ2 || |x|
 + (M + b + γ ν)
α − γ − ||BN −1 B ∗ ||
 
||λ1 − λ2 || |x| 1 1
 − .
ν − α − γ − ||BN −1 B ∗ ||ν α − γ − ||BN −1 B ∗ ||

Rearranging and using the definition of ν, see (7.9), we finally obtain that

γ + ||BN −1 B ∗ ||ν
||1 − 2 ||  ||λ1 − λ2 ||. (7.20)
α − γ − ||BN −1 B ∗ ||

We need to check that

γ + ||BN −1 B ∗ ||ν
<1 (7.21)
α − γ − ||BN −1 B ∗ ||

which is equivalent to

α − 2γ − ||BN −1 B ∗ ||( + ν) > 0,

which is true, from the definition of  and ν, see (7.7) and (7.11). If we call
λ(x) the unique fixed point of T , it satisfies (7.12) and the equation
 +∞ 
λ(x) = exp −αs DF (y(s)) + DA∗ (y(s))λ(y(s)) ds. (7.22)
0

It is standard to check that (7.12) and (7.22) is equivalent to (7.12) and (7.4).
This concludes the proof of the Theorem.
ARTICLE IN PRESS

24 Handbook of Numerical Analysis

7.4 Algorithm
We can write the algorithm (6.3) which leads to

αλk+1 (x) − Dλk+1 (x)(A(x) − BN −1 B ∗ λk (x)) = DF (x) + DA∗ (x)λk (x).


(7.23)
From the contraction property obtained in Theorem 2, we can obtain immedi-
ately

Corollary 3. Under the assumptions of Theorem 2, if we start the iteration with


λ0 such that |λ0 (x)|   |x| and ||Dλ0 (x)||  ν, we have

||λk − λ|| → 0, (7.24)

where λ is the solution of (7.4).

7.5 Linear quadratic case


1
We take A(x) = Ax, F (x) = x ∗ Mx, then Eq. (7.4) becomes
2

αλ(x) = Mx + A∗ λ(x) + Dλ(x)(Ax − BN −1 B ∗ λ(x)), (7.25)

and its solution is λ(x) = P x, with P the solution of the Riccati equation

αP = M + A∗ P + P A − P BN −1 B ∗ . (7.26)

We have γ = ||A|| and b = 0. Assumption (7.5) becomes



α > 2||A|| + 2 M||BN −1 B ∗ ||. (7.27)

We have

α − 2||A| − (α − 2||A|)2 − 4M||BN −1 B ∗ ||
 =ν= . (7.28)
2||BN −1 B ∗ ||

The iteration (7.23) becomes λk (x) = P k x, with

P k+1 (αI − A + BN −1 B ∗ P k ) = M + A∗ P k , (7.29)

and if ||P 0 ||   , we obtain ||P k − P || → 0, as k → +∞.

8 Numerical results
We now present numerical tests for the algorithm. We consider the following
values for m, n: m = 10, n = 30; the matrices M, N , A, B are chosen arbitrarily
and their values are not displayed here.
ARTICLE IN PRESS

Machine learning and control theory 25

In Fig. 1, we choose α = 1 and we pick 4 samples of the initial guess P (0) .


These choices do not satisfy the two conditions (7.5) and (7.6). Using the results
of our Python code, we display the difference between P (5) − P (6) , P (10) −
P (11) , P (15) − P (16) , P (20) − P (21) , P (25) − P (26) . We can see that as
the number of iterations k increases, the difference between P (k) − P (k+1) 
does not become small. This shows that the algorithm does not converge.

FIGURE 1 Solving for P : 4 tests where α = 1.

In Fig. 2, we choose α = 1770.3688, 0 < P (0)  < 6.8153 for the 4 samples
of the initial guess P (0) . These choices satisfy the two conditions (7.5) and (7.6).
We can see that as the number of iterations k increases, the difference between
P (k) − P (k+1)  becomes very small. And after 15 iterations, these differences
are essentially 0.

FIGURE 2 Solving for P : 4 tests where α = 1770.3688.

In Fig. 3, we choose α = 225. In this test, the condition (7.5) corresponds


to α > 270. Using the results of our Python code, we display the difference
between P (k) − P (k+1) . We can see that the difference does not converge to 0
even after 10000 iterations. Therefore, the condition (7.5) is quite good.
ARTICLE IN PRESS

26 Handbook of Numerical Analysis

FIGURE 3 Solving for P : 4 tests where α = 225.

In Fig. 4, we take the first choice of P (0) arbitrarily, and varies the values of
α to be 1, 5, 20, 100, 200, 300. Using the results of our Python code, we plot
the values of P (5) − P (6) , P (10) − P (11) , P (15) − P (16) , P (20) − P (21) ,
P (25) − P (26)  for each value of α as a curve. We can see that the algorithm
starts to converge very fast if α is big: the curves for α = 100, 200, 300 (red,
purple and brown) are almost the 0-line.

FIGURE 4 Solving for P : The convergence rate for different values of α.

Acknowledgment
Alain Bensoussan acknowledges the financial support from the National Science Founda-
tion under grants DMS-1612880, DMS-1905449, and the Research Grant Council of Hong
Kong Special Administrative Region under grant GRF 11303316. Minh-Binh Tran is partially
supported by NSF Grant DMS-1854453, SMU URC Grant 2020, SMU DCII Research Clus-
ter Grant, Dedman College Linking Fellowship, Alexander von Humboldt Fellowship. Dinh
Phan Cao Nguyen and Minh-Binh Tran would like to thank Prof. T. Hagstrom and Prof. A.
ARTICLE IN PRESS

Machine learning and control theory 27

Aceves for the computational resources. Phillip Yam acknowledges the financial supports from
HKGRF-14300717 with the project title “New kinds of Forward-backward Stochastic Systems
with Applications”, HKGRF-14300319 with the project title “Shape-constrained Inference:
Testing for Monotonicity”, and Direct Grant for Research 2014/15 (Project No. 4053141) of-
fered by CUHK. Xiang Zhou acknowledges the support of Hong Kong RGC GRF grants
11337216 and 11305318.

References
Bensoussan, A., 2018. Estimation and Control of Dynamical Systems. Interdisciplinary Applied
Mathematics. Springer International Publishing.
Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D., Holtham, E., 2018a. Reversible architec-
tures for arbitrarily deep residual neural networks. In: The Thirty-Second AAAI Conference on
Artificial Intelligence (AAAI-18), p. 2811.
Chang, Bo, Meng, Lili, Haber, Eldad, Tung, Frederick, Begert, David, 2018b. Multi-level residual
networks from dynamical systems view.
Chen, Ricky T.Q., Rubanova, Yulia, Bettencourt, Jesse, Duvenaud, David, 2018. Neural ordinary
differential equations. In: Advances in Neural Information Processing Systems, 2018-Decem,
pp. 6571–6583.
Chiuso, A., Pillonetto, G., 2019. System identification: a machine learning perspective. Annual Re-
view of Control, Robotics, and Autonomous Systems 2 (1), 281–304.
E, W., 2017. A proposal on machine learning via dynamical systems. Communications in Mathe-
matics and Statistics 5 (1), 1–11.
E, W., Han, J., Li, Q., 2019a. A mean-field optimal control formulation of deep learning. Research
in the Mathematical Sciences 6 (1), 1–41.
E, W., Ma, C., Wu, L., 2019b. Machine learning from a continuous viewpoint. arXiv:1912.12777.
Haber, E., Ruthotto, L., 2017. Stable architectures for deep neural networks. Inverse Problems 34
(1), 014004.
Han, J., E, W., 2016. Deep learning approximation for stochastic control problems. arXiv:1611.
07422.
Han, Jiequn, Jentzen, Arnulf, E, Weinan, 2018. Solving high-dimensional partial differential
equations using deep learning. Proceedings of the National Academy of Sciences 115 (34),
8505–8510.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian, 2016. Deep residual learning for image
recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 770–778.
Jordan, M.I., Mitchell, T.M., 2015. Machine learning: trends, perspectives, and prospects. Sci-
ence 349 (6245), 255–260.
Li, Q., Chen, L., Tai, C., E, W., 2017a. Maximum principle based algorithms for deep learning.
Journal of Machine Learning Research 18 (1), 5998–6026.
Li, Q., Hao, S., 2018. An optimal control approach to deep learning and applications to discrete-
weight neural networks. arXiv preprint. arXiv:1803.01299.
Li, Q., Tai, C., E, W., 2017b. Stochastic modified equations and adaptive stochastic gradient algo-
rithms. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70.
JMLR.org, pp. 2101–2110.
Li, Q., Tai, C., E, W., 2019. Stochastic modified equations and dynamics of stochastic gradient
algorithms I: mathematical foundations. Journal of Machine Learning Research 20 (40), 1–40.
Li, Qianxiao, Tai, Cheng, E, Weinan, 2015. Dynamics of stochastic gradient algorithms. arXiv
preprint. arXiv:1511.06251.
Li, Z., Shi, Z., 2017. Deep residual learning and PDEs on manifold. arXiv preprint. arXiv:1708.
05115.
Lu, T., Neittaanmaki, P., Tai, X.-C., 1991. A parallel splitting up method and its application to
Navier-Stokes equations. Applied Mathematics Letters 4, 25–29.
ARTICLE IN PRESS

28 Handbook of Numerical Analysis

Lu, Y., Zhong, A., Li, Q., Dong, B., 2017. Beyond finite layer neural networks: bridging deep
architectures and numerical differential equations. arXiv preprint. arXiv:1710.10121.
Mei, S., Montanari, A., Nguyen, P.-M., 2018. A mean field view of the landscape of two-layer neural
networks. Proceedings of the National Academy of Sciences 115 (33), E7665–E7671.
Recht, Benjamin, 2019. A tour of reinforcement learning: the view from continuous control. Annual
Review of Control, Robotics, and Autonomous Systems 2 (1), 253–279.
Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez,
Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, Chen, Yutian, Lillicrap,
Timothy, Hui, Fan, Sifre, Laurent, van den Driessche, George, Graepel, Thore, Hassabis, Demis,
2017. Mastering the game of go without human knowledge. Nature 550 (7676), 354.
Sonoda, S., Murata, N., 2017. Double continuum limit of deep neural networks. In: ICML Workshop
Principled Approaches to Deep Learning.
Sutton, R.S., Barto, A.G., 2018. Reinforcement Learning: An Introduction, 2 edition. MIT Press.
Wang, Haoran, Zariphopoulou, Thaleia, Zhou, Xunyu, 2018. Exploration versus exploitation in re-
inforcement learning: a stochastic control approach. arXiv:1812.01552.

View publication stats

You might also like