0% found this document useful (0 votes)

11 views29 pages

Handbook Control ML 2022

This document discusses the intersection of Machine Learning and Control Theory, highlighting how concepts from Control Theory can enhance Machine Learning and vice versa. It covers topics such as reinforcement learning, supervised learning, deep learning, and stochastic gradient descent, emphasizing their mathematical frameworks and applications. The paper aims to bridge the gap between these two fields, encouraging further research and collaboration to tackle complex problems in both domains.

Uploaded by

tanvir anwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views29 pages

Handbook Control ML 2022

Uploaded by

tanvir anwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/342093309

Machine Learning and Control Theory

Preprint · June 2020

CITATIONS READS

0 3,972

6 authors, including:

Alain Bensoussan Yiqun Li

University of Texas at Dallas City University of Hong Kong
484 PUBLICATIONS 20,151 CITATIONS 4 PUBLICATIONS 22 CITATIONS

SEE PROFILE SEE PROFILE

Xiang Zhou
City University of Hong Kong
63 PUBLICATIONS 678 CITATIONS

SEE PROFILE

All content following this page was uploaded by Xiang Zhou on 31 January 2022.

The user has requested enhancement of the downloaded file.

ARTICLE IN PRESS

Machine learning and control

theory
Alain Bensoussana,b , Yiqun Lic , Dinh Phan Cao Nguyend,e ,
Minh-Binh Trand , Sheung Chi Phillip Yamc , and Xiang Zhoub,f,∗
a International Center for Decision and Risk Analysis, Jindal School of Management, The

University of Texas at Dallas, Richardson, TX, United States, b School of Data Science, City
University of Hong Kong, Kowloon Tong, Hong Kong, c Department of Statistics, The Chinese
University of Hong Kong, Shatin, Hong Kong, d Department of Mathematics, Southern Methodist
University, Dallas, TX, United states, e Faculty of Information Technology, Nha Trang University,
Nha Trang, Viet Nam, f Department of Mathematics, City University of Hong Kong, Kowloon Tong,
Hong Kong
∗ Corresponding author: e-mail address: [email protected]

Contents
1 Introduction 2 5.1 General theory 14
2 Reinforcement learning 5 5.2 Parametric approach for the
2.1 General concepts 5 feedback 15
2.2 Mathematical model without 5.3 Non-parametric approach for
action 5 the value function and its
2.3 Approximation 6 gradient 16
2.4 Mathematical model with 6 Focus on the deterministic case 16
action 7 6.1 Problem and algorithm 17
3 Control theory and deep learning 10 6.2 Splitting up method 18
3.1 Supervised learning 10 7 Convergence results 19
3.2 Deep learning 10
7.1 Setting of the problem 19
3.3 Control theory approach 11
7.2 Preliminaries 19
4 Stochastic gradient descent and
7.3 Main result 20
control theory 12
7.4 Algorithm 24
4.1 Comments 12
4.2 Stochastic gradient and MDP 12 7.5 Linear quadratic case 24
4.3 Continuous version 13 8 Numerical results 24
5 Machine learning approach of Acknowledgment 26
stochastic control problems 14 References 27

Abstract
We survey in this chapter the connections between Machine Learning and Control The-
ory. Control Theory provide useful concepts and tools for Machine Learning. Conversely
Machine Learning can be used to solve large control problems. In the first part of the pa-

Handbook of Numerical Analysis, ISSN 1570-8659, https://doi.org/10.1016/bs.hna.2021.12.016 1

2 Handbook of Numerical Analysis

per, we develop the connections between reinforcement learning and Markov Decision
Processes, which are discrete time control problems. In the second part, we review the
concept of supervised learning and the relation with static optimization. Deep learning
which extends supervised learning, can be viewed as a control problem. In the third part,
we present the links between stochastic gradient descent and mean field theory. Con-
versely, in the fourth and fifth parts, we review machine learning approaches to stochastic
control problems, and focus on the deterministic case, to explain, more easily, the numer-
ical algorithms.

Keywords
Control theory, Machine learning, Deep learning

MSC Codes
49J15, 49L20, 93E20, 93E99, 68T05, 68T99, 90C39, 90C15

1 Introduction
The Big Data phenomenon is at the origin of a new expansion of Artificial In-
telligence. Machine learning (Jordan and Mitchell, 2015) is a way to implement
AI, by providing the machine with the capability of learning and decision mak-
ing, which characterize humans. The fact that humans use algorithms to help
performing these two tasks is not new, by itself. As soon as computing possi-
bilities appeared, algorithms have been developed. The ambition of AI came
also early. However, during the last decades, the momentum has been spec-
tacular, and Machine Learning has become the new Graal. Its introduction has
revolutionized all kinds of fields in science, in engineering, in medicine, in
management. Image processing, pattern recognition, text mining, speech recog-
nition, automatic translation have benefited considerably from this development.
An important breakthough occurred with the methodology of deep neural net-
work (DNN).
Conceptually, since the objective is to improve the knowledge of environ-
ment and improve decision making, we are naturally dealing with optimization
and statistics. This is clearly apparent in supervised learning. On the other hand,
reinforcement learning and DNN add an additional variable, which is time or
ordered like time. Control theory comes in as the framework of dynamic opti-
mization.
Control theory, see for instance, Bensoussan (2018), is about how to design
optimal actions for dynamical models, in continuous or discrete time. However,
it is notoriously acknowledged that the numerical computation is the main bar-
rier of putting these control theories to work in practice and many applications
are unfortunately limited to the linear quadratic regulator. The curse of dimen-
sionality as Bellman, the creator of Dynamic Programming, coined it has been
haunting the numerical methods of control theory for quite a long time. It is
therefore natural that the new possibilities of ML be considered to overcome the
challenge of dimension. This explains why, in the past few years, we have been
ARTICLE IN PRESS

Machine learning and control theory 3

witnessing many exciting ideas and innovative results from the perspective of
merging the above two research areas, with the efforts from different commu-
nities like applied and computational mathematics, optimal control, stochastic
optimization as well as computer science. The two sides, researchers from ma-
chine learning and optimal control, start to explore the techniques, tools as well
as problem formulations, from each other. We can roughly divide these works
into two categories: control theory for machine learning and machine learning
for control theory. Generally speaking, the former refers to the use of control
theory as a mathematical tool to formulate and solve theoretical and practical
problems in machine learning, such as optimal parameter tuning, training neu-
ral network; while the latter is how to use machine learning practice such as
kernel method and DNN to numerically solve complex models in control theory
which can become intractable by traditional methods (Han et al., 2018).
There are many evidences to support our argument of close connections
between machine learning and control theory. We begin with reinforcement
learning (RL), which became famous when AlphaGo Zero (Silver et al., 2017)
was invented. Reinforcement learning (Sutton and Barto, 2018) is a subfield of
machine learning that studies how to use past data to enhance the future ma-
nipulation of a dynamical system. The control communities target for the same
problems as RL. However, the RL and control communities are practically dis-
joint due to the distinctive language and culture; see Recht (2019) for a recent
effort to unify this gap.
In RL, one of the simplest strategies is to first estimate such models from the
given data, which is called system identification in control community. This can
be achieved by supervised learning (Chiuso and Pillonetto, 2019). Then in the
second stage dynamical programming principle in control theory can be applied
and to derive many popular RL algorithms such as Q-learning and Temporal
Difference algorithms (Sutton and Barto, 2018).
As said above, Dynamic Programming is hard to implement numerically, for
high dimensional dynamic systems. Machine learning and DNN can be helpful.
For example, Han et al. (2018) proposed an efficient machine learning algo-
rithm by using DNN to approximate the value function in the high dimensional
Hamilton-Jacobi-Bellman equation, based on the equivalent stochastic control
formulation of the PDE.
The bond that ties machine learning and control theory more closely in re-
cent years gets critically strengthened from continuous perspective in various
contexts (E, 2017; E et al., 2019b; Recht, 2019). For example, deep residual
neural network (ResNet) (He et al., 2016) can be obtained by recasting it as dy-
namical systems with network layers considered as time discretization (Chang
et al., 2018a,b; Chen et al., 2018; Haber and Ruthotto, 2017; Li and Hao, 2018;
Li and Shi, 2017; Sonoda and Murata, 2017). Based on this point of view, ma-
chine learning algorithm for ResNet can be viewed as part of static and dynamic
optimization for an ordinary differential equation controlled by network parame-
ters (E et al., 2019a). This continuous model immediately triggered several new
ARTICLE IN PRESS

4 Handbook of Numerical Analysis

training methods based on well-known techniques in control theory: Li et al.

(2017a) from the Pontryagin Maximum Principle and Chen et al. (2018) from
the adjoint approach. This viewpoint of continuous modeling is also becom-
ing more and more popular in optimization community for machine learning,
particularly for the stochastic gradient descent (SGD), in which a stochastic dif-
ferential equation (SDE) emerges as the continuous model (Li et al., 2019). The
acceleration of the SGD is then regarded as an optimal control problem for the
SDE to reach minimum point as early as possible (Li et al., 2017b). The con-
tribution of control theory is certainly not restricted to the training algorithm.
For RL, the trade-off between exploration and exploitation is a very serious and
daunting practical problem. Wang et al. (2018) recently studied the analysis of
this problem in theory through the lens of stochastic control. Similarly, the need
to provide a solid mathematical framework to analyze deep neural networks is
very pressing. Recent works have pointed out that new mathematical properties
of deep neural network can be obtained by recasting deep learning as dynamical
systems (cf. Chang et al., 2018a,b; Chen et al., 2018; Haber and Ruthotto, 2017;
Li et al., 2017a; Li and Hao, 2018; Li and Shi, 2017; Lu et al., 2017; Sonoda
and Murata, 2017).
Nowadays, it is difficult to ignore the intervene and synthesis between ma-
chine learning and control theory and the fusion of these two fields at certain
boundaries is pushing forward tremendous research progress with accelerating
momentum now. This paper is to give a brief introduction to and a short review
of some selective works on the overlap of these communities. The interaction
between the data-driven approach in machine learning and the model-based con-
trol theory is still at the very early age and there are certainly many challenges
at the control-learning interface to advance the deeper development both in the-
ory and in practice. We hope that the gap between the learning-centric views of
ML and the model-centric views of control can diminish in the foreseen future
on an arduous journey of understanding machine learning and artificial intel-
ligence. As a result, a new territory may emerge (e.g. actional intelligence in
Recht, 2019) from these joint efforts across the disciplines.
In the first part of the chapter, we discuss Markov Decision Processes
(MDP), which normally provide mathematical frameworks for modeling deci-
sion making in stochastic environment where outcomes are partly random and
partly under the control of a decision maker. MDPs can indeed be solved via
Dynamic Programming and provide a very useful framework for Reinforcement
Learning.
The second part of the chapter is devoted to Supervised Learning and Deep
Learning, that concerns the approximation of a function given some preliminary
observations. Supervised Learning is an optimization problem. Deep Learning
can be recast into a control theory problem and can be solved using various
strategies, including the Pontryagin Maximum Principle approach.
Recent mean field and stochastic control views for Stochastic Gradient De-
scent methods will be provided in the third part of this paper.
ARTICLE IN PRESS

Machine learning and control theory 5

In the next section, we study a Stochastic Control Problem, in which the state
is that of a controlled diffusion. We then propose a Machine Learning approach
for this problem.
We finally focus on the deterministic case in Section 6 to simplify the
theory. We provide some related theoretical results in companion with a few
high-dimensional numerical illustrations to demonstrate the effectiveness of the
algorithms.

2 Reinforcement learning
2.1 General concepts
In the language of Control Theory, we consider a dynamical system, which
evolves in an uncertain environment. The evolution of this system is called a
process, which can be characterized by its state. A controller decides a strategy
of actions, called feedback, and there is at each time a cost or profit attached
to the current state and the current action. In the language of RL, every time
the action is made, the controller receives an award. The controller will try to
choose the actions such that the sum of rewards is maximized. Since time is dis-
crete, the control problem is called a Markov Decision Process (MDP) and can
be solved by Dynamic Programming approach. The award is then a function of
the state and the action.

2.2 Mathematical model without action

We suppose that the states belong to a space X. On X, there is a σ -algebra,
denoted by X . A transition probability is a (regular enough) function π(x; )
on (X, X ). For any fixed x we define the probability of ∈ X to be π(x; ).
If B is the space of bounded functions on X, equipped with the norm f =
supx |f (x)|, we associate to the transition probability a linear operator from
B to B as follows:

f (x) = f (η)π(x; dη), (2.1)
X
and clearly |||| 1. A Markov chain {Xi }∞i=1 on X with transition probability
π(x; dη) is a stochastic process on X such that

E[f (Xn+1 )|Xn = x] = f (η)π(x; dη), for n = 1, 2, · · · . (2.2)
X

Assuming stationarity in (2.2), and choosing α to be a discount factor, we can

express the function
+∞

u(x) = E α n−1 f (Xn )X1 = x . (2.3)

n=1
ARTICLE IN PRESS

6 Handbook of Numerical Analysis

This is the sum of rewards, in the terminology of RL. There is no action to

modify the trajectory. We just add the discounted rewards. We can give an ex-
plicit analytic expression (not probabilist) of the function u(x). It is the unique
solution of the analytic problem

u = f + αu. (2.4)

It then follows that

u = (I − α)−1 f, (2.5)
and using the generator , we can also rewrite,
∞

u= α n−1 n−1 f. (2.6)
n=1

2.3 Approximation
Our main task now is to compute the function u(x). Eqs. (2.4) and (2.6) are
explicit and straightforward. However, the challenge is when the dimension d
of X is large, these formulations are not of practical use. Here comes the other
aspect of machine learning: how to approximate a function given by formulae
(2.3) or (2.4). Since Supervised Learning does exactly that, approximate a func-
tion, we follow the ideas of SL. There are basically two methods, the parametric
method and the non parametric method. In the parametric method, we look for
an approximation of the form

I
u(x) ≈ θi ϕi (x), (2.7)
i=1

where ϕi (x) are given functions, so that the family {ϕi (·)}∞
i=1 forms a basis of the
functional space to which u(x) belongs, and θi are coefficients to be determined.
We need to compute parameters minimizing the error
I 2

θi (ϕi − αϕi ) − f , (2.8)

i=1

where · is the sup-norm G = sup |G|. To guarantee the existence and
uniqueness of the parameters, we minimize the quadratic functional, with a
quadratic regularization.
I 2

I

γ θi2 + θi (ϕi − αϕi ) − f . (2.9)

i=1 i=1
ARTICLE IN PRESS

Machine learning and control theory 7

In the non parametric method, used in supervised learning, we do not refer to

a functional equation for u(x). We assume that we can compute the value at a
finite number of points. For a given x, u(x) can be calculated by formula (2.3),
by Monte Carlo simulation. We then find

N +∞
1 n−1
u(x) ≈ α f (Xnν ), (2.10)
N
ν=1 n=1

where X1ν = x, · · · , Xnν = Xn (ων ), · · · , represents one trajectory indexed by ν

of the Markov chain, corresponding to one sample point ων . We choose M
points x 1 , · · · , x M in Rd , and then compute u(x 1 ) = y 1 , · · · u(x M ) = y M by
using Monte-Carlo method and the approximation formula (2.10). The number
M is chosen arbitrarily. If we assume that f is continuous and bounded, then
u(x) is also bounded and continuous. The goal is to extrapolate u(x) from the
knowledge of y 1 , · · · , y M .
We now choose a subset H of C(Rd ). This subset is called the hypoth-
esis space. We select an element in H such that it is the closest possible to
y 1 , · · · , y M at points x 1 , · · · , x M . We assume naturally that H is a nice enough
functional space. The theory of reproducing kernels allows us to define H as a
Hilbert space, with a continuous injection in C(Rd ). The function u(x) can be
defined as the solution of the minimization problem

M
min γ u2H + (u(x m ) − y m )2 . (2.11)
u∈H
m=1

Remark 1. In RL, one claims that a significant difference with MDP is that the
Markov Chain may not be known. The controller, however makes trials, which is
similar to Monte Carlo, without referring to a selection of trajectories according
to a given probability transition.

2.4 Mathematical model with action

The Markov chain has a probability transition depending on an auxiliary vari-
able a called the action, π(x, a; dη). When the action is a function of the state,
also called the feedback, a(x), we then get π(x, a(x); dη). We now define the
operator a f (x) or a(x) f (x) by

a f (x) = f (η)π(x, a; dη); a(x) f (x) = f (η)π(x, a(x); dη).
X X
(2.12)
We consider an award depending on the state and the action f (x, a). For conve-
nience, this award is supposed to be a cost and we assume f (x, a) 0. We then
ARTICLE IN PRESS

8 Handbook of Numerical Analysis

set the aggregate cost to be

+∞

Ja(·) (x) = E α n−1
f (Xn , a(Xn ))X1 = x , (2.13)

n=1

in which Xn evolves as a function of the probability transition π(x, a(x); dη).

We also define the value function

u(x) = inf Ja(·) (x), (2.14)

a(·)

which is the solution of the following Bellman equation

u(x) = inf[f (x, a) + αa u(x)]. (2.15)

It is also interesting to introduce the cost (Q-function) when the first action is
arbitrary and the following actions are optimized, namely

Q(x, a) = f (x, a) + αa u(x). (2.16)

Taking into account the fact that

u(x) = inf Q(x, a), (2.17)

we arrive at
Q(x, a) = f (x, a) + α[a (inf Q(·, a ))](x). (2.18)
a
There are two basic types of iteration to solve the above Bellman equation, the
value iteration and the policy iteration. The value iteration is defined by

uk+1 (x) = inf[f (x, a) + αa uk (x)], (2.19)

and u0 (x) = 0. When f (x, a) is bounded, the solution of the Bellman equation
is unique and the sequence uk (x) converges to the value function monotoni-
cally. When f (x, a) is not bounded, the solution of the Bellman equation is
not unique. The sequence uk (x) converges monotonically to the value function,
which is the minimum solution. We can also interpret uk (x) as the value func-
tion for the control problem with k periods. To see this, we define
k

Ja(·) (x) = E
k
α f (Xn , a(Xn ))X1 = x ,
n−1
(2.20)

n=1

then
uk (x) = inf Ja(·)
k
(x). (2.21)
a(·)
ARTICLE IN PRESS

Machine learning and control theory 9

On the other hand, the policy iteration technique starts with a given feedback
control a k (x) and solves the linear (fixed point) problem similar to (2.4)
k (x)
uk+1 (x) = f (x, a k (x)) + αa uk+1 (x). (2.22)

With the uk+1 (x), then a k+1 (x) is defined by

inf[f (x, a) + αa uk+1 (x)],

in which we start with a function a 0 (x), which minimizes

inf f (x, a).

In both types of iterations, we have to solve the respective minimization prob-

lems
(i) value iteration: infa [f (x, a) + αa uk (x)];
(ii) policy iteration: infa [f (x, a) + αa uk+1 (x)],
in which uk (x) and uk+1 (x) are known respectively (obtained a-priori by some
approximation method such as the use of reproducing kernals).
For both cases, the minimization problem can be resolved by a gradient de-
cent technique. We set

Qk+1 (x, a) = f (x, a) + αa uk+1 (x)

which is an approximation of Q(x, a).

Since we cannot find the infimum exactly, we use the following approxima-
tion for a k+1 (x)

a k+1 (x) = a k (x) − ρ k Da Qk+1 (x, a k (x)), (2.23)

in which the coefficient ρ k is chosen such that it solves the following scalar
optimization problem

inf Qk+1 (x, a k (x) − ρDa Qk+1 (x, a k (x))). (2.24)

By definition,
uk+1 (x) = Qk+1 (x, a k (x)),
We can then, instead of solving the linear problem for uk+1 (x), use the approxi-
mation Qk (x, a k (x)). We then proceed as follows: knowing a k (x) and Qk (x, a),
we define

ūk+1 (x) := Qk (x, a k (x)),
(2.25)
Qk+1 (x, a) = f (x, a) + αa ūk+1 (x),
ARTICLE IN PRESS

10 Handbook of Numerical Analysis

and a k+1 (x) can then be obtained approximately through (2.23) and (2.24). In
the above procedure, we start with Q0 (x, a) = f (x, a) and a 0 (x) is chosen to
be the minimizer of f (x, a).

3 Control theory and deep learning

3.1 Supervised learning
Supervised learning concerns basically approximating an unknown function
F (x) : Rd → R, given some noisy observations y m = F (x m ) + ε m , in which
x m is known and the noise ε m models the uncertainty (unknown). There are
two methods that can be used to solve this approximation problem. In the first
method, we consider a function f (x; θ ), where θ ∈ Rn for some n and we try to
perform a minimization problem with the parameter θ

M
min γ |θ |2 + (f (x m ; θ ) − y m )2 . (3.1)
θ
m=1

This is the well-known parametric method. In the simplest case of neural net-
works, the function f (x; θ ) is defined as follows:
We first introduce X = χ(x), where χ : Rd → Rn . Of course χ can be iden-
tity. Now, choose σ to be a scalar function, called the activation function and W
to be a matrix in L(Rn ; Rd ). Suppose that b is a vector in Rn . The pair (W, b)
represents the parameter θ . We now define the vector X by
⎛ ⎞
d
Xi = σ ⎝ Wij Xj + bi ⎠ , (3.2)
j =1

and
f (x; θ ) = g(X), (3.3)
where g : Rn → R is the output function. The minimization in (3.1) is performed
by a gradient method with iterative application of chain rules.
In the non-parametric method, particularly, the kernel method, one finds a
functional space H, to which the approximation of F (x) belongs. One then
chooses that approximation f (x) by solving

M
min γ f 2H + (f (x m ) − y m )2 . (3.4)
f ∈H
m=1

3.2 Deep learning

Deep learning is a generalization of supervised learning with a sequence of
layers. We generalize (3.3) to K layers as follows with X k ∈ Rn , θ k :=
ARTICLE IN PRESS

Machine learning and control theory 11

(W (k+1) , b(k+1) ) ∈ L(Rn ; Rn ) × Rn ,

X k+1 = fk (X k ; θ k ), k = 0, · · · K − 1, (3.5)

where

fk (X k ; θ k ) := σ (W (k+1) X k + b(k+1) )

and
X 0 = χ(x), (3.6)
with
f (x, θ ) = g(X K ). (3.7)
The parameter θ is the collection of {θ 0 , · · · θ K−1 }.
We have written the case of
several layers of neural networks, but it is just an example.

3.3 Control theory approach

This approach has been introduced by Li et al. (2015, 2017b, 2019). The idea is
to consider a continuous time extension of (3.5), (3.6), and (3.7). We write
⎧
⎨ dX
= f (Xt , θt , t), (3.8)
⎩ dt
X0 = χ(x).

The approximation of F (x) is then f (x, θ ) = g(XT ). The loss in this scenario
is (y − g(XT ))2 = (XT ), recalling that y will be a known value. The idea is
to consider θt as a control and we want to minimize an analog of (3.4), which is
expressed as:

M T
J (θ ) := (XTm ) + L(θt )dt, (3.9)
m=1 0

where L(θ ) is a regularization function, for instance, γ θ 2 . Using a Pontryagin

Maximum Principle approach, we can write a necessary condition of optimal-
ity for the control θt . Define (X tm , p
tm ), we have the optimal state and optimal
adjoint state solutions of
⎧
⎪ m
dX
⎪
⎨ t = f (X tm ,
θt , t), X m = x m ,
0
dt (3.10)
⎪
⎪
m
⎩ − d pt ∗ m
= (Dx f ) (Xt , θt , t) m
pt , p
T = Dx (XT ),
m m
dt
and the optimality condition
t , p
θt minimizes H (X t , θ, t), a.e. t,
ARTICLE IN PRESS

12 Handbook of Numerical Analysis

with

M
t , p
H (X t , θ, t) = tm , θ, t) + L(θ ).
tm f (X
p
m=1

To solve (3.10), one can use the following approximation recursively: let
θtk be
t , p
given, define (X m,k m,k
t ) by
⎧
⎪
⎪ m,k
⎪ dX t
⎨ = f (Xtm,k ,
θtk , t), X m,k = x m ,
0
dt (3.11)
⎪
⎪ d ptm,k
⎪
⎩− = (Dx f )∗ (X t ,
m,k
θtk , t)m,k
pt , p m,k ).
T = Dx (X m,k
T
dt

We look for
θtk+1 that minimizes

p tm,k , θ, t) + L(θ ).
tm,k f (X (3.12)
m

Note that the above approximation may fail to converge. We refer to E et al.
(2019a,b); Han and E (2016); Sutton and Barto (2018) for recent improvements
and techniques to deal with this issue.

4 Stochastic gradient descent and control theory

4.1 Comments
The gradient descent algorithm plays an essential role in the various parts of
ML, as we have seen above. There has been a considerable amount of work in
order to improve its efficiency. The stochastic version it, described below, offers
another example of connection to control theory, stochastic control and even
mean field control theory. We limit ourselves to some basic considerations. In
particular we do not discuss the case of the use of SG for DNN, as in Mei et
al. (2018). This is because the connection is of a different nature. It does not
involve control theory, but introduces interesting PDE. In this section, we will
define SG and relate the choice of the optimal descent parameters to an MDP
problem. We will then give a continuous version and also connect with mean
field control.

4.2 Stochastic gradient and MDP

We recall the definition of gradient descent. If f (x) is a function on Rd , for
which we want to find a minimum x ∗ . The gradient descent algorithm is defined
by the sequence
xk+1 = xk − ηk Df (xk ), (4.1)
ARTICLE IN PRESS

Machine learning and control theory 13

where ηk is a positive constant, which can be chosen independent of k. This is

simpler, but by all means not optimal. Suppose now that the function f (x) is an
expected value
f (x) = E(f (x, Z)). (4.2)
Applying the gradient method to this function leads to

xk+1 = xk − ηk E(Dx f (xk , Z)).

In the SG descent method, one chooses a sequence of independent versions of

Z, called Zk and define

Xk+1 = Xk − ηk Dx f (Xk , Zk ). (4.3)

We clearly obtain a controlled Markov chain, in which ηk is the control. If we

define the σ -algebra F k = σ (Z1 , · · · , Zk ), then ηk is adapted to F k−1 , k 1.
F 0 is the trivial σ -algebra. The process Xk is also adapted to F k−1 . We have
to define the pay off to optimize. Suppose we stop at K. We naturally want XK
as close as possible to x ∗ . One way to proceed would be to minimize Ef (XK ).
But this requires the computation of f (x), for random values of the argument,
which we want to avoid. In fact, if we insure that XK is close to a constant,
that constant will be necessarily x ∗ , provided ηk is larger than a fixed positive
constant. So a good criterion will be to minimize

E|XK − EXX |2 . (4.4)

This is not a standard MDP, but a Mean Field type control problem in discrete
time.

4.3 Continuous version

We first write (4.3) as follows:

Xk+1 = Xk − ηk Df (Xk ) + ηk Yk ,

with
Yk = Df (Xk ) − Dx f (Xk , Zk ).
Note that we have E[Yk |F k−1 ] = 0 and

E[Yk (Yk )∗ |F k−1 ] = (Xk ),

with
(x) = E(Dx f (x, Z)(Dx f (x, Z))∗ ) − Df (x)(Df (x))∗ . (4.5)
ARTICLE IN PRESS

14 Handbook of Numerical Analysis

We shall write
(x) = σ (x)σ (x)∗ . (4.6)
If we write Yk = σ (Xk )Yk , then the process Yk satisfies

E[Yk |F k−1 ] = 0, E[Yk (Yk )∗ |F k−1 ] = 0. (4.7)

We obtain the algorithm

Xk+1 = Xk − ηk Df (Xk ) + ηk σ (Xk )Yk . (4.8)

We can then follow Li et al. (2017b) to define a diffusion approximation of (4.8)

as follows
dX = −u(t)Df (X(t))dt + u(t)ησ (X(t))dB(t) (4.9)
where B(t) is a brownian motion and u(t) adapted to the filtration generated by
the brownian motion, with values in [0, 1]. The number η is a scaling constant.
We can then choose the control to minimize the payoff Ef (X(T )) which defines
a stochastic control problem, or

min E |X(T ) − EX(T )|2 , (4.10)
u(·)u0 >0

which defines a mean field type control problem.

5 Machine learning approach of stochastic control problems

5.1 General theory
Let us now consider the following problem, in which the state equation is a
controlled diffusion

dx(t) = g(x(t), a)dt + dw,
(5.1)
x(0) = x,

and the pay-off is given by

+∞
Jx (a(·)) = E exp(−αt) f (x(t), a(t))dt . (5.2)
0

There are two approaches to resolve the above problem: Dynamic Programming
and Stochastic Pontryagin Maximum Principle. The theory shows that the opti-
mal control is described by a feedback. The value function is defined by

u(x) = inf Jx (a(·)). (5.3)

a(·)
ARTICLE IN PRESS

Machine learning and control theory 15

In the above problem, there are 3 functions of interest, the value function u(x),
the optimal feedback a (x) (if exists), the gradient of u(x), λ(x) = Du(x). In-
troducing u(x) and its gradient independently may look superfluous. It turns
out that the gradient has a very interesting interpretation, the shadow price in
economics. Surprisingly, the gradient is solution of a self-contained vector equa-
tion. On the numerical side, approximating the gradient of u(x) by the gradient
of the approximation of u(x) results in a source of errors. This justifies the in-
terest in the system of equations for λ(x). We may think of parametric and
non-parametric approximations for these functions. We shall discuss a para-
metric approach for the optimal feedback, and a non-parametric approach for
the value function and its gradient.

5.2 Parametric approach for the feedback

Let us now replace candidacy of a(x) by a special function of the form a(x, θ )
with θ being a parameter in Rp . The function g(x, a) is then replaced by
g(x, a(x, θ )), which can be renamed as g(x, θ ) by abuse of notation. We ob-
tain the control problem

dx(t) = g(x(t), θ )dt + dw,
(5.4)
x(0) = x,
+∞
Jx (θ (·)) = E exp(−αt) f (x(t), θ (t))dt , (5.5)
0

where f (x, θ ) abbreviates for f (x, a(x, θ )). The important simplification of this
procedure is that θ (t) is regarded as deterministic.
We can write a necessary condition of optimality for the optimal new control

θ (t). Define the optimal state (t) by:
x (t) and the adjoint state p
⎧
⎨d x,
x = g( θ )dt + dw, x (0) = x,
d
p (5.6)
⎩− + α x,
p = gx∗ ( p + fx (
θ ) x,
θ ),
dt

and
θ (t) satisfies

inf E [
p (t) · g(
x (t), θ ) + f (
x (t), θ )], t-a.e.. (5.7)
θ

To obtain
θ (t), we can use an iterative approximation coupled with a gradient
method
⎧
⎪
⎨d xk ,
x k = g( θ k )dt + dw, x (0) = x,
k (5.8)
⎪
⎩−
dp
+ αp k = gx∗ (xk , xk ,
p k + fx (
θ k ) θ k ),
dt
ARTICLE IN PRESS

16 Handbook of Numerical Analysis

θ k+1 (t) = xk ,
θ k (t) − ρ k (t)E[gθ∗ ( xk ,
p k + fθ (
θ k ) θ k )], t-a.e.,

where ρ k (t) minimizes in ρ the following function

· g(
E [p(t) x (t), θ ) + f (
x (t), θ )],
in which θ = θ (t) − ρE[gθ∗ (
k
xk , xk ,
p k + fθ (
θ k ) θ k )].

5.3 Non-parametric approach for the value function and its

gradient
First, we notice that the value function u(x), the gradient λ(x) = Du(x) and the
optimal feedback a (x) are linked as follows1
⎧
⎪
⎪ 1
⎨αu(x) = λ(x) · g(x, a (x)) + f (x,a (x)) + tr(Dλ(x)),
2
⎪
⎪ 1
⎩αλ(x) = Dλ(x)g(x, ∗
a (x)) + Dx g(x, a (x))λ(x) + Dx f (x,
a (x)) + λ(x),
2
(5.9)
where

a (x) minimizes in a of the function λ(x) · g(x, a) + f (x, a).

The above system has an interesting structure, in which there is coupling only
for the last two equations. The first equation allows one to define the value
function. Note that we have used the fact that Dλ(x) = (Dλ(x))∗ .
a k (x), λk (x)),
We now define the following iteration: suppose that we know (
k+1
we can find λ (x) by solving the differential equation system:
1
αλk+1 (x) − Dλk+1 (x)g(x, a k (x)) − λk+1 (x)
2
= Dx∗ g(x,
a k (x))λk (x) + Dx f (x,a k (x)) (5.10)

such that

a k+1 (x) minimizes in a of the function λk+1 (x) · g(x, a) + f (x, a).

The equations for the components of λk+1 (x) are completely decoupled, and
can be solved in parallel. One possibility is to use simulation to define λk+1 (x)
in a finite number of points and to use an extrapolation by a kernel method.

6 Focus on the deterministic case

In this section, we shall simplify by considering the case of a deterministic dy-
namics. Some theoretical and numerical results will be presented to illustrate
the efficiency of the numerical algorithms.
1 (5.9) is the HJB system and (5.9) follows by differentiating (5.9) with respect to x.
1 2 1
ARTICLE IN PRESS

Machine learning and control theory 17

6.1 Problem and algorithm

We first define the relation between the three functions u(x), λ(x) and
a (x) (as
the special case of (5.9)):

αu(x) = f (x,
a (x)) + λ(x) · g(x,a (x)),
(6.1)
αλ(x) = Dλ(x)g(x, a (x)) + Dx∗ g(x,a (x))λ(x) + Dx f (x,
a (x)),

where

a (x) minimizes in a of the function λ(x) · g(x, a) + f (x, a).

We propose two iterations.

a k (x), λk (x)), we define uk (x) as
(i) The first one is: for given functions (

αuk (x) = f (x,

a k (x)) + λk (x) · g(x,
a k (x)). (6.2)

Now, we find λk+1 (x) by solving

αλk+1 (x) − Dλk+1 (x)g(x, a k (x)) = Dx∗ g(x,

a k (x))λk (x) + Dx f (x,
a k (x)).
(6.3)
We next resolve
a k+1 (x) by minimizing

the function λk+1 (x) · g(x, a) + f (x, a) in a,

and uk+1 (x) is constructed by

αuk+1 (x) = f (x,

a k+1 (x)) + λk+1 (x) · g(x,
a k+1 (x)). (6.4)

(ii) The second one is to describe the policy iteration as follows: given the
a k (x), uk (x)), we then set
functions (

λk (x) = Duk (x). (6.5)

We obtain uk+1 (x) by solving

αuk+1 (x) = f (x,

a k (x)) + Duk+1 (x) · g(x,
a k (x)). (6.6)

We now set
λk+1 (x) = Duk+1 (x), (6.7)
and the values of the function
a k+1 (x) can be obtained by minimizing

the function λk+1 (x) · g(x, a) + f (x, a) in a. (6.8)

ARTICLE IN PRESS

18 Handbook of Numerical Analysis

Since
a k (x) satisfies the necessary condition of optimality

a k (x)) + (Da g)∗ (x,

Da f (x, a k (x))λk (x) = 0, (6.9)

we can use a gradient descent method

a k+1 (x) = a k (x)) + (Da g)∗ (x,
a k (x) − θ k+1 [Da f (x, a k (x))λk+1 (x)]. (6.10)

The suitable scalar θ k+1 can now be obtained by a one-dimensional optimization

problem by setting

w k+1 (θ )(x) = a k (x)) + (Da g)∗ (x,

a k (x) − θ [Da f (x, a k (x))λk+1 (x)], (6.11)

and

H k+1 (θ )(x) = f (x, w k+1 (θ )(x)) + λk+1 (x) · g(x, w k+1 (θ )(x)), (6.12)

where we can now find θ k+1 by minimizing the function H k+1 (θ )(x) in θ . As a
result, θ k+1 depends on x and plugging back in (6.10) to obtain
a k+1 (x).

6.2 Splitting up method

As a part of both iterations (6.3) and (6.6) described above we have to solve a
generic linear PDE

αλ(x) − Dλ(x) · G(x) = F (x). (6.13)

For which we propose a parallel splitting up method2 : knowing λj (x), and writ-
l
ing x = (x1 , · · · , xd ), we define λj + d (x), l = 1, · · · , d, by
l
l ∂λj + d (x) l
αλj + d (x) − Gl (x) = Z j + d (x), (6.14)
∂xl

l ∂λj (x)
where Z j + d (x) = h=l Gh (x) + F (x). And λj +1 (x) is defined by
∂xh

1 j+ l
d
j +1
λ (x) = λ d (x). (6.15)
d
l=1

Note that (6.14) is a one dimensional first order differential equation, which has
an explicit solution

2 The parallel splitting up method not only reduces the original problems into a number of separable
one dimensional linear problems, but also enables us to compute all these one dimensional linear
problems by parallel computing, for which the calibration of the fractional steps are independent of
each other (Lu et al., 1991).
ARTICLE IN PRESS

Machine learning and control theory 19

xl
Z j + d (ξl , x̄l )
l
xl dηl
j + dl
λ (x) = − exp α dξl . (6.16)
−∞ Gl (ξl , x̄l ) ξl Gl (ηl , x̄l )

Here, we have used the notation x = (xl , x̄l ) where x̄l ∈ Rd−1 .

7 Convergence results
7.1 Setting of the problem
We take
g(x, a) = A(x) + Ba, (7.1)
such that

x → A(x) : Rn → Rn , |A(x)| γ |x|, B ∈ L(Rd ; Rn ),

and
b|x1 − x2 |
||DA(x1 ) − DA(x2 )|| .
1 + |x1 | + |x2 |
The pay-off functional

1
f (x, a) = F (x) + a ∗ N a, (7.2)
2
x → F (x) : Rn → R, |DF (x)| M|x|,
N ∈ L(Rd ; Rd ), symmetric and invertible.

We then have
a (x) = −N −1 B ∗ λ(x),
(7.3)
and thus the second relation (6.1) becomes

αλ(x) − DA∗ (x)λ(x) − Dλ(x)(A(x) − BN −1 B ∗ λ(x)) = DF (x). (7.4)

7.2 Preliminaries
We will need conditions on α and b: α sufficiently large and b sufficiently small.
We first assume that

α − 2γ > 2 M||BN −1 B ∗ ||. (7.5)

We set
(α − 2γ )2
β= > 1. (7.6)
4M||BN −1 B ∗ ||
ARTICLE IN PRESS

20 Handbook of Numerical Analysis

We define

α − 2γ − (α − 2γ )2 − 4M||BN −1 B ∗ ||
= , (7.7)
2||BN −1 B ∗ ||
which is a solution of

2 ||BN −1 B ∗ || − (α − 2γ ) + M = 0. (7.8)

We next need to solve for the equation

ν 2 ||BN −1 B ∗ || − (α − 2γ )ν + (M + b ) = 0, ν > . (7.9)

We need that

||BN −1 B ∗ || β −1
b< √ √ . (7.10)
M β − β −1
We then define

α − 2γ − (α − 2γ )2 − 4(M + b )||BN −1 B ∗ ||
ν= . (7.11)
2||BN −1 B ∗ ||

7.3 Main result

We can state
Theorem 2. We assume (7.1), (7.2), (7.5), and (7.10). Then Eq. (7.4) has a
unique solution such that

|λ(x)| |x|, ||Dλ(x)|| ν. (7.12)

Proof. We will use a contraction mapping argument. Let λ(x) be a vector of

functions satisfying (7.12). We shall define a function (x) as follows. We con-
sider the differential equation
⎧
⎨ dy = A(y) − BN −1 B ∗ λ(y),
ds (7.13)
⎩
y(0) = x.

Since A(x) and λ(x) are uniformly Lipschitz, this equation has a unique
solution. We then define (x) by the formula
+∞

(x) = exp(−αs) DF (y(s)) + DA∗ (y(s))λ(y(s)) ds. (7.14)
0

This integral is well-defined. Indeed, from (7.13), the second assumption (7.1)
and the first property (7.12), we can assert that

|y(s)| |x| exp(γ + ||BN −1 B ∗ || )s (7.15)

ARTICLE IN PRESS

Machine learning and control theory 21

and, from (7.14) we get

+∞
|(x)| (M + γ ) exp(−αs) |y(s)|ds
0
+∞
(M + γ )|x| exp(−(α − γ − ||BN −1 B ∗ || )s)ds
0
(M + γ )|x|
= = |x|, (7.16)
α − γ − ||BN −1 B ∗ ||
from the definition of of (7.7) and (7.8). In particular (x) satisfies the first
property (7.12). We next differentiate in x the formula (7.14). We set Y (s) =
Dx y(s). From Eq. (7.13) we obtain
⎧
⎨ dY (s) = DA(y(s))Y (s) − BN −1 B ∗ Dλ(y(s))Y (s),
ds (7.17)
⎩
Y (0) = I,

then,
+∞
D(x) = exp(−αs) D 2 F (y(s)) + DA(y(s))Dλ(y(s))
0

+ D 2 A(y(s)λ(y(s)) Y (s)ds. (7.18)

So
+∞
||D(x)|| (M + γ ν + b ) exp(−αs)||Y (s)||ds,
0
and from (7.17) it follows that
+∞
||D(x)|| (M + γ ν + b ) exp(−(α − γ − ||BN −1 B ∗ ||ν)s) ds
0
M + γ ν + b
= = ν,
α − γ − ||BN −1 B ∗ ||ν
and thus (x) satisfies the second condition (7.12).
We consider the Banach space of functions λ(x) : Rn → Rn , with the norm
|λ(x)|
||λ|| = sup ,
x |x|
and the closed subset

C = {λ(·)| ||λ|| , ||Dλ(x)|| ν, ∀x}.

We consider the map T : λ → , defined by the formula (7.14). We want to

show that it is a contraction from C to C. We pick two functions λ1 , λ2 in C, let
ARTICLE IN PRESS

22 Handbook of Numerical Analysis

y1 (s), y2 (s) be defined by (7.13) with λ = λ1 , λ2 respectively, and 1 = T λ1 ,

2 = T λ2 . We have
d
(y1 − y2 ) = A(y1 ) − A(y2 ) − BN −1 B ∗ (λ1 (y1 ) − λ2 (y2 )).
ds
Noting that

|λ1 (y1 ) − λ2 (y2 )| |λ1 (y1 ) − λ2 (y1 )| + |λ2 (y1 ) − λ2 (y2 )|

||λ1 − λ2 || |y1 | + ν|y1 − y2 |,

we get, using the estimate (7.15),

⎧ d
⎪ −1 ∗
⎨ ds |y1 − y2 | (γ + ν||BN B ||)|y1 − y2 |
⎪

⎪
⎪ + ||BN −1 B ∗ || ||λ1 − λ2 || |x| exp(γ + ||BN −1 B ∗ ||)s,
⎩
(y1 − y2 )(0) = x,

therefore

|y1 (s) − y2 (s)| ||BN −1 B ∗ || ||λ1 − λ2 || |x| exp((γ + ν||BN −1 B ∗ ||)s)

s
× exp(−(ν − )||BN −1 B ∗ ||τ ) dτ.
0

Finally, we obtain that

||λ1 − λ2 || |x|
|y1 (s) − y2 (s)| (exp((γ + ν||BN −1 B ∗ ||)s)
ν −
− exp((γ + ||BN −1 B ∗ ||)s)). (7.19)

Next, from the definition of (x), we get

+∞
1 (x) − 2 (x) = exp(−αs)[DF (y1 (s)) − DF (y2 (s))
0
+ DA∗ (y1 (s))λ1 (y1 (s)) − DA∗ (y2 (s))λ2 (y2 (s))]ds.

Writing

DA∗ (y1 (s))λ1 (y1 (s)) − DA∗ (y2 (s))λ2 (y2 (s))

= DA∗ (y1 (s)) − DA∗ (y2 (s)) λ1 (y1 (s))
+ DA∗ (y1 (s))(λ1 (y1 (s)) − λ2 (y2 (s))).

From the third line of assumption (7.1) we obtain

| DA∗ (y1 (s)) − DA∗ (y2 (s)) λ1 (y1 (s))| b |y1 (s) − y2 (s)|.
ARTICLE IN PRESS

Machine learning and control theory 23

Moreover

|λ1 (y1 (s)) − λ2 (y2 (s))| ν|y1 (s) − y2 (s)|

+ ||λ1 − λ2 || |x| exp((γ + ||BN −1 B ∗ ||)s).

Collecting results, we can write

+∞
|1 (x) − 2 (x)| (M + b + γ ν) exp(−αs) |y1 (s) − y2 (s)|ds
0
γ ||λ1 − λ2 || |x|
+ .
α − γ − ||BN −1 B ∗ ||

Making use of (7.19), it follows that

|1 (x) − 2 (x)|

γ ||λ1 − λ2 || |x|
+ (M + b + γ ν)
α − γ − ||BN −1 B ∗ ||

||λ1 − λ2 || |x| 1 1
− .
ν − α − γ − ||BN −1 B ∗ ||ν α − γ − ||BN −1 B ∗ ||

Rearranging and using the definition of ν, see (7.9), we finally obtain that

γ + ||BN −1 B ∗ ||ν
||1 − 2 || ||λ1 − λ2 ||. (7.20)
α − γ − ||BN −1 B ∗ ||

We need to check that

γ + ||BN −1 B ∗ ||ν
<1 (7.21)
α − γ − ||BN −1 B ∗ ||

which is equivalent to

α − 2γ − ||BN −1 B ∗ ||( + ν) > 0,

which is true, from the definition of and ν, see (7.7) and (7.11). If we call
λ(x) the unique fixed point of T , it satisfies (7.12) and the equation
+∞
λ(x) = exp −αs DF (y(s)) + DA∗ (y(s))λ(y(s)) ds. (7.22)
0

It is standard to check that (7.12) and (7.22) is equivalent to (7.12) and (7.4).
This concludes the proof of the Theorem.
ARTICLE IN PRESS

24 Handbook of Numerical Analysis

7.4 Algorithm
We can write the algorithm (6.3) which leads to

αλk+1 (x) − Dλk+1 (x)(A(x) − BN −1 B ∗ λk (x)) = DF (x) + DA∗ (x)λk (x).

(7.23)
From the contraction property obtained in Theorem 2, we can obtain immedi-
ately

Corollary 3. Under the assumptions of Theorem 2, if we start the iteration with

λ0 such that |λ0 (x)| |x| and ||Dλ0 (x)|| ν, we have

||λk − λ|| → 0, (7.24)

where λ is the solution of (7.4).

7.5 Linear quadratic case

1
We take A(x) = Ax, F (x) = x ∗ Mx, then Eq. (7.4) becomes
2

αλ(x) = Mx + A∗ λ(x) + Dλ(x)(Ax − BN −1 B ∗ λ(x)), (7.25)

and its solution is λ(x) = P x, with P the solution of the Riccati equation

αP = M + A∗ P + P A − P BN −1 B ∗ . (7.26)

We have γ = ||A|| and b = 0. Assumption (7.5) becomes

α > 2||A|| + 2 M||BN −1 B ∗ ||. (7.27)

We have

α − 2||A| − (α − 2||A|)2 − 4M||BN −1 B ∗ ||
=ν= . (7.28)
2||BN −1 B ∗ ||

The iteration (7.23) becomes λk (x) = P k x, with

P k+1 (αI − A + BN −1 B ∗ P k ) = M + A∗ P k , (7.29)

and if ||P 0 || , we obtain ||P k − P || → 0, as k → +∞.

8 Numerical results
We now present numerical tests for the algorithm. We consider the following
values for m, n: m = 10, n = 30; the matrices M, N , A, B are chosen arbitrarily
and their values are not displayed here.
ARTICLE IN PRESS

Machine learning and control theory 25

In Fig. 1, we choose α = 1 and we pick 4 samples of the initial guess P (0) .

These choices do not satisfy the two conditions (7.5) and (7.6). Using the results
of our Python code, we display the difference between P (5) − P (6) , P (10) −
P (11) , P (15) − P (16) , P (20) − P (21) , P (25) − P (26) . We can see that as
the number of iterations k increases, the difference between P (k) − P (k+1)
does not become small. This shows that the algorithm does not converge.

FIGURE 1 Solving for P : 4 tests where α = 1.

In Fig. 2, we choose α = 1770.3688, 0 < P (0) < 6.8153 for the 4 samples
of the initial guess P (0) . These choices satisfy the two conditions (7.5) and (7.6).
We can see that as the number of iterations k increases, the difference between
P (k) − P (k+1) becomes very small. And after 15 iterations, these differences
are essentially 0.

FIGURE 2 Solving for P : 4 tests where α = 1770.3688.

In Fig. 3, we choose α = 225. In this test, the condition (7.5) corresponds

to α > 270. Using the results of our Python code, we display the difference
between P (k) − P (k+1) . We can see that the difference does not converge to 0
even after 10000 iterations. Therefore, the condition (7.5) is quite good.
ARTICLE IN PRESS

26 Handbook of Numerical Analysis

FIGURE 3 Solving for P : 4 tests where α = 225.

In Fig. 4, we take the first choice of P (0) arbitrarily, and varies the values of
α to be 1, 5, 20, 100, 200, 300. Using the results of our Python code, we plot
the values of P (5) − P (6) , P (10) − P (11) , P (15) − P (16) , P (20) − P (21) ,
P (25) − P (26) for each value of α as a curve. We can see that the algorithm
starts to converge very fast if α is big: the curves for α = 100, 200, 300 (red,
purple and brown) are almost the 0-line.

FIGURE 4 Solving for P : The convergence rate for different values of α.

Acknowledgment
Alain Bensoussan acknowledges the financial support from the National Science Founda-
tion under grants DMS-1612880, DMS-1905449, and the Research Grant Council of Hong
Kong Special Administrative Region under grant GRF 11303316. Minh-Binh Tran is partially
supported by NSF Grant DMS-1854453, SMU URC Grant 2020, SMU DCII Research Clus-
ter Grant, Dedman College Linking Fellowship, Alexander von Humboldt Fellowship. Dinh
Phan Cao Nguyen and Minh-Binh Tran would like to thank Prof. T. Hagstrom and Prof. A.
ARTICLE IN PRESS

Machine learning and control theory 27

Aceves for the computational resources. Phillip Yam acknowledges the financial supports from
HKGRF-14300717 with the project title “New kinds of Forward-backward Stochastic Systems
with Applications”, HKGRF-14300319 with the project title “Shape-constrained Inference:
Testing for Monotonicity”, and Direct Grant for Research 2014/15 (Project No. 4053141) of-
fered by CUHK. Xiang Zhou acknowledges the support of Hong Kong RGC GRF grants
11337216 and 11305318.

References
Bensoussan, A., 2018. Estimation and Control of Dynamical Systems. Interdisciplinary Applied
Mathematics. Springer International Publishing.
Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D., Holtham, E., 2018a. Reversible architec-
tures for arbitrarily deep residual neural networks. In: The Thirty-Second AAAI Conference on
Artificial Intelligence (AAAI-18), p. 2811.
Chang, Bo, Meng, Lili, Haber, Eldad, Tung, Frederick, Begert, David, 2018b. Multi-level residual
networks from dynamical systems view.
Chen, Ricky T.Q., Rubanova, Yulia, Bettencourt, Jesse, Duvenaud, David, 2018. Neural ordinary
differential equations. In: Advances in Neural Information Processing Systems, 2018-Decem,
pp. 6571–6583.
Chiuso, A., Pillonetto, G., 2019. System identification: a machine learning perspective. Annual Re-
view of Control, Robotics, and Autonomous Systems 2 (1), 281–304.
E, W., 2017. A proposal on machine learning via dynamical systems. Communications in Mathe-
matics and Statistics 5 (1), 1–11.
E, W., Han, J., Li, Q., 2019a. A mean-field optimal control formulation of deep learning. Research
in the Mathematical Sciences 6 (1), 1–41.
E, W., Ma, C., Wu, L., 2019b. Machine learning from a continuous viewpoint. arXiv:1912.12777.
Haber, E., Ruthotto, L., 2017. Stable architectures for deep neural networks. Inverse Problems 34
(1), 014004.
Han, J., E, W., 2016. Deep learning approximation for stochastic control problems. arXiv:1611.
07422.
Han, Jiequn, Jentzen, Arnulf, E, Weinan, 2018. Solving high-dimensional partial differential
equations using deep learning. Proceedings of the National Academy of Sciences 115 (34),
8505–8510.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian, 2016. Deep residual learning for image
recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 770–778.
Jordan, M.I., Mitchell, T.M., 2015. Machine learning: trends, perspectives, and prospects. Sci-
ence 349 (6245), 255–260.
Li, Q., Chen, L., Tai, C., E, W., 2017a. Maximum principle based algorithms for deep learning.
Journal of Machine Learning Research 18 (1), 5998–6026.
Li, Q., Hao, S., 2018. An optimal control approach to deep learning and applications to discrete-
weight neural networks. arXiv preprint. arXiv:1803.01299.
Li, Q., Tai, C., E, W., 2017b. Stochastic modified equations and adaptive stochastic gradient algo-
rithms. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70.
JMLR.org, pp. 2101–2110.
Li, Q., Tai, C., E, W., 2019. Stochastic modified equations and dynamics of stochastic gradient
algorithms I: mathematical foundations. Journal of Machine Learning Research 20 (40), 1–40.
Li, Qianxiao, Tai, Cheng, E, Weinan, 2015. Dynamics of stochastic gradient algorithms. arXiv
preprint. arXiv:1511.06251.
Li, Z., Shi, Z., 2017. Deep residual learning and PDEs on manifold. arXiv preprint. arXiv:1708.
05115.
Lu, T., Neittaanmaki, P., Tai, X.-C., 1991. A parallel splitting up method and its application to
Navier-Stokes equations. Applied Mathematics Letters 4, 25–29.
ARTICLE IN PRESS

28 Handbook of Numerical Analysis

Lu, Y., Zhong, A., Li, Q., Dong, B., 2017. Beyond finite layer neural networks: bridging deep
architectures and numerical differential equations. arXiv preprint. arXiv:1710.10121.
Mei, S., Montanari, A., Nguyen, P.-M., 2018. A mean field view of the landscape of two-layer neural
networks. Proceedings of the National Academy of Sciences 115 (33), E7665–E7671.
Recht, Benjamin, 2019. A tour of reinforcement learning: the view from continuous control. Annual
Review of Control, Robotics, and Autonomous Systems 2 (1), 253–279.
Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez,
Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, Chen, Yutian, Lillicrap,
Timothy, Hui, Fan, Sifre, Laurent, van den Driessche, George, Graepel, Thore, Hassabis, Demis,
2017. Mastering the game of go without human knowledge. Nature 550 (7676), 354.
Sonoda, S., Murata, N., 2017. Double continuum limit of deep neural networks. In: ICML Workshop
Principled Approaches to Deep Learning.
Sutton, R.S., Barto, A.G., 2018. Reinforcement Learning: An Introduction, 2 edition. MIT Press.
Wang, Haoran, Zariphopoulou, Thaleia, Zhou, Xunyu, 2018. Exploration versus exploitation in re-
inforcement learning: a stochastic control approach. arXiv:1812.01552.

View publication stats

RI Evaluation Exam No. 1
100% (1)
RI Evaluation Exam No. 1
4 pages
Deep Learning
No ratings yet
Deep Learning
800 pages
(2007) - Cucker-Learning Theory - An Approximation Theory Viewpoint
No ratings yet
(2007) - Cucker-Learning Theory - An Approximation Theory Viewpoint
237 pages
AASAN.2021 - Invertible and Pseudo-Invertible Encoders An Approach To Inverse Problems With Neural Networks
No ratings yet
AASAN.2021 - Invertible and Pseudo-Invertible Encoders An Approach To Inverse Problems With Neural Networks
199 pages
Artificial Intelligence and Causal Inference (Momiao Xiong) (Z-Library)
No ratings yet
Artificial Intelligence and Causal Inference (Momiao Xiong) (Z-Library)
395 pages
Physics Past Papers P6
No ratings yet
Physics Past Papers P6
455 pages
дсту en iso 9934-3-2016 МТ оборудование контроля
No ratings yet
дсту en iso 9934-3-2016 МТ оборудование контроля
20 pages
Fluid Mechanics PDF
No ratings yet
Fluid Mechanics PDF
65 pages
Deep Learning Hand Book 2024
No ratings yet
Deep Learning Hand Book 2024
185 pages
Han 2018 PHD Diss Princeton Univ
No ratings yet
Han 2018 PHD Diss Princeton Univ
141 pages
The Little Book of Deep Learning
100% (1)
The Little Book of Deep Learning
140 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
163 pages
Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 3
No ratings yet
Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 3
433 pages
CE 517 Module 1 Algebra
No ratings yet
CE 517 Module 1 Algebra
65 pages
Which Machine Learning Algorithm Should I Use - The SAS Data Science Blog
No ratings yet
Which Machine Learning Algorithm Should I Use - The SAS Data Science Blog
15 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Deep Learning in Computational Mechanics A Review
No ratings yet
Deep Learning in Computational Mechanics A Review
51 pages
ML in Control and Games SIAM Review HAL
No ratings yet
ML in Control and Games SIAM Review HAL
76 pages
PDLT
No ratings yet
PDLT
449 pages
Introduction to Machine Learning and Neural Classification
From Everand
Introduction to Machine Learning and Neural Classification
Trilokesh Khatri
No ratings yet
The Science of Deep Learning Iddo Drori
No ratings yet
The Science of Deep Learning Iddo Drori
37 pages
Mathematical Foundations of Deep Learning
No ratings yet
Mathematical Foundations of Deep Learning
174 pages
Bech Differentiated Learning Plan Part2
No ratings yet
Bech Differentiated Learning Plan Part2
12 pages
LBDL
No ratings yet
LBDL
143 pages
LBDL
No ratings yet
LBDL
185 pages
ML LittelBook
No ratings yet
ML LittelBook
161 pages
Algorithm For RL
No ratings yet
Algorithm For RL
99 pages
LN ML Rug
No ratings yet
LN ML Rug
267 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
143 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
167 pages
Lbdlu
No ratings yet
Lbdlu
168 pages
Deep Learning Model
No ratings yet
Deep Learning Model
144 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
No ratings yet
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
163 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
LBDL
No ratings yet
LBDL
142 pages
From Pinns To Pikans: Recent Advances in Physics-Informed Machine Learning
No ratings yet
From Pinns To Pikans: Recent Advances in Physics-Informed Machine Learning
90 pages
TD Unit-1
No ratings yet
TD Unit-1
83 pages
The - Little - Book - of - Deep Learning
No ratings yet
The - Little - Book - of - Deep Learning
140 pages
Summary
No ratings yet
Summary
43 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
Modern Deep Reinforcement Learning Algorithms
No ratings yet
Modern Deep Reinforcement Learning Algorithms
56 pages
MLfinance Handbook
No ratings yet
MLfinance Handbook
28 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Mechanical Design and Trajectory Planning of A Lower Limb Rehabilitation Robot With A Variable Workspace
No ratings yet
Mechanical Design and Trajectory Planning of A Lower Limb Rehabilitation Robot With A Variable Workspace
13 pages
Information Fusion: Sciencedirect
No ratings yet
Information Fusion: Sciencedirect
55 pages
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
100% (7)
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
368 pages
Alternating Current Bounce Back 2.0 22 Dec
No ratings yet
Alternating Current Bounce Back 2.0 22 Dec
132 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
PPINN
No ratings yet
PPINN
18 pages
The Modern Mathematics of Deep Learning
No ratings yet
The Modern Mathematics of Deep Learning
78 pages
Biomimetics 08 00434
No ratings yet
Biomimetics 08 00434
26 pages
Bell Ringer: Explain One of The Stations From Yesterday in Detail
No ratings yet
Bell Ringer: Explain One of The Stations From Yesterday in Detail
77 pages
SSRN 4963741
No ratings yet
SSRN 4963741
26 pages
Bridging Machine Learning and Computer Network Res
No ratings yet
Bridging Machine Learning and Computer Network Res
16 pages
State-Space Modeling For Control Based On Physics-Informed Neural Networks
No ratings yet
State-Space Modeling For Control Based On Physics-Informed Neural Networks
10 pages
Overview of The Application of Radiation in Medical Diagnosis and Therapy
No ratings yet
Overview of The Application of Radiation in Medical Diagnosis and Therapy
3 pages
Static and Fatigue Simulation of Automotive Anti Roll Bar Before DBTT
No ratings yet
Static and Fatigue Simulation of Automotive Anti Roll Bar Before DBTT
6 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Machine Learning and Computational Mathematics: in Memory of Professor Feng Kang (1920-1993)
No ratings yet
Machine Learning and Computational Mathematics: in Memory of Professor Feng Kang (1920-1993)
32 pages
Ref
No ratings yet
Ref
10 pages
J Matpr 2021 02 281
No ratings yet
J Matpr 2021 02 281
6 pages
Application of The Acid Hydrolysis of Sucrose As A Temperature
No ratings yet
Application of The Acid Hydrolysis of Sucrose As A Temperature
8 pages
SIAM Review Book Review
No ratings yet
SIAM Review Book Review
4 pages
C4 Practice Paper A1 Mark Scheme
No ratings yet
C4 Practice Paper A1 Mark Scheme
4 pages
SIAM Review Book Review
No ratings yet
SIAM Review Book Review
4 pages
SIAM Review Book Review
No ratings yet
SIAM Review Book Review
4 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
Chapter 3
No ratings yet
Chapter 3
90 pages
Computers and Chemical Engineering: Jay H. Lee, Joohyun Shin, Matthew J. Realff
No ratings yet
Computers and Chemical Engineering: Jay H. Lee, Joohyun Shin, Matthew J. Realff
11 pages
Table of Contents
No ratings yet
Table of Contents
9 pages
The Application of Machine Learning in Data Mining Under Big Data Environment
No ratings yet
The Application of Machine Learning in Data Mining Under Big Data Environment
4 pages
HAWC2 - Course: Structure 2
No ratings yet
HAWC2 - Course: Structure 2
26 pages
Applied Mathematical Modelling: Moharam Habibnejad Korayem, H.N. Rahimi, A. Nikoobin
No ratings yet
Applied Mathematical Modelling: Moharam Habibnejad Korayem, H.N. Rahimi, A. Nikoobin
16 pages
A Proposal On Machine Learning Via Dynamical Systems
No ratings yet
A Proposal On Machine Learning Via Dynamical Systems
11 pages
DL (1-10)
No ratings yet
DL (1-10)
10 pages
Physiological Reviews: MAX Kleiber
No ratings yet
Physiological Reviews: MAX Kleiber
31 pages
How To Choose A Machine Learning Algorithm
No ratings yet
How To Choose A Machine Learning Algorithm
12 pages
Send Array of Floating Point Numbers From MATLAB To Arduino - Javaer101
No ratings yet
Send Array of Floating Point Numbers From MATLAB To Arduino - Javaer101
11 pages
Cambridge IGCSE™: Biology 0610/53
No ratings yet
Cambridge IGCSE™: Biology 0610/53
9 pages
Annurev Bioeng 110220 110833
No ratings yet
Annurev Bioeng 110220 110833
28 pages
Varying Speed of Light in An Anisotropic
No ratings yet
Varying Speed of Light in An Anisotropic
16 pages
2021 09 23 461504v1 Full
No ratings yet
2021 09 23 461504v1 Full
12 pages
Publishedvershion Axial
No ratings yet
Publishedvershion Axial
9 pages
Tsa Final 2017 Sol
No ratings yet
Tsa Final 2017 Sol
6 pages
Fnbot 13 00056
No ratings yet
Fnbot 13 00056
21 pages
Biosensors 11 00411 v2
No ratings yet
Biosensors 11 00411 v2
15 pages
Applsci 09 00226
No ratings yet
Applsci 09 00226
15 pages
Omalley Revised
No ratings yet
Omalley Revised
15 pages
Fuzzy Extreme Learning Machine For Classification: W.B. Zhang and H.B. Ji
No ratings yet
Fuzzy Extreme Learning Machine For Classification: W.B. Zhang and H.B. Ji
2 pages
How To Choose The Right Machine Learning Algorithm
No ratings yet
How To Choose The Right Machine Learning Algorithm
10 pages
The Royal Society
No ratings yet
The Royal Society
13 pages
Access 2021 3083705
No ratings yet
Access 2021 3083705
13 pages
When Will Help and Victory From Allah Come - Islam Question & Answer
No ratings yet
When Will Help and Victory From Allah Come - Islam Question & Answer
13 pages
Techniques of Counting: Definitive Reference for Developers and Engineers
From Everand
Techniques of Counting: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Software Vs Firmware - What's The Difference
No ratings yet
Software Vs Firmware - What's The Difference
18 pages
Nishtha Prahladka - Worksheet 1 - AC - Ext
No ratings yet
Nishtha Prahladka - Worksheet 1 - AC - Ext
18 pages
Cca 2015 7320733
No ratings yet
Cca 2015 7320733
6 pages
Astm 6780-2 TDR
No ratings yet
Astm 6780-2 TDR
13 pages
Keys To Tadabbur - How To Reflect Deeply On The Qur'an - Yaqeen Institute For Islamic Research
No ratings yet
Keys To Tadabbur - How To Reflect Deeply On The Qur'an - Yaqeen Institute For Islamic Research
21 pages
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
From Everand
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Interface Pullout Resistance of Polymeric Strips Embedded in Marginal Tropical Soils
No ratings yet
Interface Pullout Resistance of Polymeric Strips Embedded in Marginal Tropical Soils
20 pages
Dynamics Slides Romano V2
No ratings yet
Dynamics Slides Romano V2
25 pages
Reflection No. 176 On Q 39 - 29 - Submitting To ONE Master - The Academy For Learning Islam
No ratings yet
Reflection No. 176 On Q 39 - 29 - Submitting To ONE Master - The Academy For Learning Islam
5 pages
Mathematics 10 Challenge
No ratings yet
Mathematics 10 Challenge
3 pages
PG1 18 Purchasing Guidelines For en 39 2001 Tube 4.0mm 4
No ratings yet
PG1 18 Purchasing Guidelines For en 39 2001 Tube 4.0mm 4
2 pages
BrainNetwModulation112-3436793 093247
No ratings yet
BrainNetwModulation112-3436793 093247
7 pages
Cavallo - CAPESB 2021
No ratings yet
Cavallo - CAPESB 2021
2 pages
Surah Yusuf Ayat 39 (12 - 39 Quran) With Tafsir - My Islam
No ratings yet
Surah Yusuf Ayat 39 (12 - 39 Quran) With Tafsir - My Islam
2 pages
Robotic Arm - Impedance Control Vs Position Control - Robotics Stack Exchange
No ratings yet
Robotic Arm - Impedance Control Vs Position Control - Robotics Stack Exchange
3 pages
SartorietalIEEEICORR201105975441709 714
No ratings yet
SartorietalIEEEICORR201105975441709 714
7 pages
Y5 Y6 Probability Q&A
No ratings yet
Y5 Y6 Probability Q&A
2 pages
Statistical Reasons Behind The Comparaison of F-Stat With P-Value in The Null Hypothesis Test - ResearchGate
No ratings yet
Statistical Reasons Behind The Comparaison of F-Stat With P-Value in The Null Hypothesis Test - ResearchGate
8 pages
1 s2.0 S1474667016400285 Main
No ratings yet
1 s2.0 S1474667016400285 Main
6 pages
How To Interpret The F-Value and P-Value in ANOVA - Statology
No ratings yet
How To Interpret The F-Value and P-Value in ANOVA - Statology
3 pages
Mole Concept - DPP 08 (Of Lec 09) - Fighter JEE 2025
No ratings yet
Mole Concept - DPP 08 (Of Lec 09) - Fighter JEE 2025
2 pages
Isb21 Ho
No ratings yet
Isb21 Ho
2 pages
A Novel EMG Control Method For Multi-DOF Prosthetic Hands With Electrical Stimulation Feedback
No ratings yet
A Novel EMG Control Method For Multi-DOF Prosthetic Hands With Electrical Stimulation Feedback
2 pages
Types of Waves, Production and Propagation of Sound Waves
No ratings yet
Types of Waves, Production and Propagation of Sound Waves
7 pages
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
From Everand
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Fouad Sabry
No ratings yet

Handbook Control ML 2022

Uploaded by

Handbook Control ML 2022

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Machine Learning and Control Theory

Preprint · June 2020

Alain Bensoussan Yiqun Li

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Machine learning and control

Handbook of Numerical Analysis, ISSN 1570-8659, https://doi.org/10.1016/bs.hna.2021.12.016 1

2 Handbook of Numerical Analysis

Machine learning and control theory 3

4 Handbook of Numerical Analysis

training methods based on well-known techniques in control theory: Li et al.

Machine learning and control theory 5

2.2 Mathematical model without action

Assuming stationarity in (2.2), and choosing α to be a discount factor, we can

6 Handbook of Numerical Analysis

This is the sum of rewards, in the terminology of RL. There is no action to

It then follows that

Machine learning and control theory 7

In the non parametric method, used in supervised learning, we do not refer to

where X1ν = x, · · · , Xnν = Xn (ων ), · · · , represents one trajectory indexed by ν

2.4 Mathematical model with action

8 Handbook of Numerical Analysis

set the aggregate cost to be

in which Xn evolves as a function of the probability transition π(x, a(x); dη).

u(x) = inf Ja(·) (x), (2.14)

which is the solution of the following Bellman equation

u(x) = inf[f (x, a) + αa u(x)]. (2.15)

Q(x, a) = f (x, a) + αa u(x). (2.16)

Taking into account the fact that

u(x) = inf Q(x, a), (2.17)

uk+1 (x) = inf[f (x, a) + αa uk (x)], (2.19)

Machine learning and control theory 9

With the uk+1 (x), then a k+1 (x) is defined by

inf[f (x, a) + αa uk+1 (x)],

in which we start with a function a 0 (x), which minimizes

inf f (x, a).

In both types of iterations, we have to solve the respective minimization prob-

Qk+1 (x, a) = f (x, a) + αa uk+1 (x)

which is an approximation of Q(x, a).

a k+1 (x) = a k (x) − ρ k Da Qk+1 (x, a k (x)), (2.23)

inf Qk+1 (x, a k (x) − ρDa Qk+1 (x, a k (x))). (2.24)

10 Handbook of Numerical Analysis

3 Control theory and deep learning

3.2 Deep learning

Machine learning and control theory 11

(W (k+1) , b(k+1) ) ∈ L(Rn ; Rn ) × Rn ,

3.3 Control theory approach

where L(θ ) is a regularization function, for instance, γ θ 2 . Using a Pontryagin

12 Handbook of Numerical Analysis

4 Stochastic gradient descent and control theory

4.2 Stochastic gradient and MDP

Machine learning and control theory 13

where ηk is a positive constant, which can be chosen independent of k. This is

xk+1 = xk − ηk E(Dx f (xk , Z)).

In the SG descent method, one chooses a sequence of independent versions of

Xk+1 = Xk − ηk Dx f (Xk , Zk ). (4.3)

We clearly obtain a controlled Markov chain, in which ηk is the control. If we

E|XK − EXX |2 . (4.4)

4.3 Continuous version

E[Yk (Yk )∗ |F k−1 ] = (Xk ),

14 Handbook of Numerical Analysis

E[Yk |F k−1 ] = 0, E[Yk (Yk )∗ |F k−1 ] = 0. (4.7)

We obtain the algorithm

Xk+1 = Xk − ηk Df (Xk ) + ηk σ (Xk )Yk . (4.8)

We can then follow Li et al. (2017b) to define a diffusion approximation of (4.8)

which defines a mean field type control problem.

5 Machine learning approach of stochastic control problems

and the pay-off is given by

u(x) = inf Jx (a(·)). (5.3)

Machine learning and control theory 15

5.2 Parametric approach for the feedback

16 Handbook of Numerical Analysis

where ρ k (t) minimizes in ρ the following function

5.3 Non-parametric approach for the value function and its

6 Focus on the deterministic case

u(x) = inf[f (x, a) + αa u(x)]. (2.15)

Q(x, a) = f (x, a) + αa u(x). (2.16)

uk+1 (x) = inf[f (x, a) + αa uk (x)], (2.19)

inf[f (x, a) + αa uk+1 (x)],

Qk+1 (x, a) = f (x, a) + αa uk+1 (x)

E[Yk (Yk )∗ |F k−1 ] = (Xk ),

αuk (x) = f (x,

αλk+1 (x) − Dλk+1 (x)g(x, a k (x)) = Dx∗ g(x,

αuk+1 (x) = f (x,

αuk+1 (x) = f (x,

a k (x)) + (Da g)∗ (x,

w k+1 (θ )(x) = a k (x)) + (Da g)∗ (x,

ν 2 ||BN −1 B ∗ || − (α − 2γ )ν + (M + b ) = 0, ν > . (7.9)

|λ(x)| |x|, ||Dλ(x)|| ν. (7.12)

|y(s)| |x| exp(γ + ||BN −1 B ∗ || )s (7.15)

C = {λ(·)| ||λ|| , ||Dλ(x)|| ν, ∀x}.

α − 2γ − ||BN −1 B ∗ ||( + ν) > 0,

and if ||P 0 || , we obtain ||P k − P || → 0, as k → +∞.