0% found this document useful (0 votes)
11 views6 pages

Control and Reinforcement Learning

The paper discusses the convergence of the value function in optimal control problems with uncertain dynamics, particularly focusing on Reinforcement Learning (RL) methods that utilize probabilistic models. It establishes a framework where the value function of an approximate control problem converges to the true value function as the agent's belief about the dynamics improves with experience. The results are supported by theoretical proofs and numerical experiments, highlighting the efficiency of model-based RL techniques in complex environments.

Uploaded by

Ivan Medri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views6 pages

Control and Reinforcement Learning

The paper discusses the convergence of the value function in optimal control problems with uncertain dynamics, particularly focusing on Reinforcement Learning (RL) methods that utilize probabilistic models. It establishes a framework where the value function of an approximate control problem converges to the true value function as the agent's belief about the dynamics improves with experience. The results are supported by theoretical proofs and numerical experiments, highlighting the efficiency of model-based RL techniques in complex environments.

Uploaded by

Ivan Medri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1

Convergence of the Value Function in Optimal Control Problems


with Unknown Dynamics
Andrea Pesare1 , Michele Palladino2 and Maurizio Falcone1

Abstract—We deal with the convergence of the value function the viewpoint of control theory. In particular, we consider
of an approximate control problem with uncertain dynamics to nonlinear control systems in which the dynamics is partially
the value function of a nonlinear optimal control problem. The known and we assume that the belief on the dynamics that an
assumptions on the dynamics and the costs are rather general
and we assume to represent uncertainty in the dynamics by a agent has is represented by a probability distribution π on a
probability distribution. The proposed framework aims to de- space of functions [17]. The task is thus written as an optimal
scribe and motivate some model-based Reinforcement Learning control problem with averaged cost, a kind of formulation that
algorithms where the model is probabilistic. We also show some has received a growing interest in the last few years (see e.g.
numerical experiments which confirm the theoretical results. [18], [19], [20]). In probabilistic model-based RL algorithms,
arXiv:2105.13708v1 [math.OC] 28 May 2021

Keywords—Reinforcement learning, optimal control, nonlinear this corresponds to the policy improvement step, where the
systems, system identification, convergence. agent seeks to find the best control given a probabilistic model
π, using a policy search method [10], [12], [16] or MPC
I. I NTRODUCTION [13], [15] or other methods. In the framework we propose, the
precision with which π describes the true dynamics improves
Reinforcement Learning (RL) is an important branch of
as soon as the dataset becomes wider. This reflects a situation
Machine Learning aiming to provide satisfactory policies that
in BRL where the agent learns on the surrounding environment
an agent can easily implement in an uncertain environment
as far as he gets more experience and updates the dynamics
[1]. The agent acquires knowledge (learns) on the environment
model.
from its past experience, which is usually represented by series
The main objective of the paper concerns a convergence
of statistical data, but can also learn while interacting with
result of the value function Vπ of an averaged (with respect
the system. Optimal control [2], [3] and RL are strongly
to a probability measure π) optimal control problem to the
connected [4], [5], to the point that in [4] the authors write
“true value function” V . Here, by true value function we mean
“Reinforcement Learning is direct adaptive optimal control”.
the one defined by the optimal control governed by the true,
RL algorithms are generally classified in two categories:
underlying dynamics. Roughly speaking, the main result of
model-based methods, which build a predictive model of the
the paper can be stated as follows: the value function Vπ is
environment and use it to construct a controller, and model-
close to V as soon as π provides an accurate representation
free methods, which directly learn a policy or a value function
of the true, underlying dynamics. Similar results have been
by interacting with the environment. Model-free algorithms
recently obtained for a general Linear Quadratic Regulator
have shown great performances [6], [7], [8], [9], although
problem with finite horizon (see [21]). In the present paper,
they are generally quite expensive to train, especially in terms
we will focus our attention on a general, nonlinear optimal
of sample complexity; this often limits their applications to
control problem over a finite horizon, under globally Lipschitz
simulated domains. On the contrary, model-based techniques
assumptions on the costs and the controlled dynamics.
show a higher sample efficiency [10], [11], which results in
In the next section, we present the precise setting of the
faster learning. Furthermore, recent algorithms have managed
paper whereas in section III we will state and prove the main
to limit the model-bias phenomenon by using probabilistic
result. Section IV is devoted to some numerical tests and
models, which capture the uncertainty of the learned model
section V provides the conclusions and an overview on open
[10], [12], [13]. In Bayesian Reinforcement Learning (BRL),
questions.
the dynamics model is updated when new data are available
[14]. Finally, the recently used probabilistic model ensembles II. P ROBLEM FORMULATION
[15], [16] allowed model-based methods to achieve the same
asymptotic performance as state-of-the-art model-free meth- This section aims to propose the nonlinear optimal control
ods, with higher sample efficiency. Thanks to these features, framework that we want to solve. Before that, let us introduce
model-based methods seem to be the most suitable for solving some notations which will be used throughout the paper.
complex real-world problems. For vectors v ∈ Rn , |v| will denote the Euclidean norm. For
In this paper, we consider the class of BRL algorithms continuous functions f : D → R with D ⊂ Rn , the notations
for continuous state-action space and we analyze them from k f k∞ and k f k∞,K will indicate the sup norm respectively over
the function domain D or over a compact set K ⊂ D.
1 A. Pesare and M. Falcone are with the Deparment Let π and π 0 be two probability distributions on a compact
of Mathematics, Sapienza Università di Roma, 00185 metric space (X, d). The 1-Wasserstein distance (see [22])
Rome, Italy (e-mail: [email protected],
[email protected]) between them is defined as
Z
2 M. Palladino is with the Gran Sasso Science Institute - GSSI, 67100
L’Aquila, Italy (e-mail: [email protected])
W1 (π, π 0 ) := inf kg − f k∞ dγ(g, f ) , (1)
γ∈Γ(π,π 0 ) X×X
2

where Γ(π, π 0 ) is the collection of all probability measures Note that, even if there is a different trajectory xg (t) for each
on X × X having π and π 0 as marginals, f and g are generic g ∈ X, the task concerns to look for a single control u to be
elements in X and the symbol dγ indicates that the integral is applied to every system dynamics g ∈ X.
with respect to the measure γ. The notions of value function and optimal control are analo-
Given a probability space (Ω, F , π) and a random variable Y gous to those given for Problem A:
on Ω, Eπ [Y ] denotes the expected value of Y with respect to
Vπ (s, x0 ) := inf Jπ,s,x0 [u] and u∗π (s, x0 ) := arg min Jπ,s,x0 [u] .
π. u∈Us u∈Us
Remark II.1. It is worth pointing out that the theory developed
A. Problem A: a classical optimal control problem here works both for non-parametric models (e.g. Gaussian
processes [10], [13]) and for parametric models (e.g. deep
Let us consider a classical finite horizon optimal con-
neural networks ([12], [15], [16])). In the latter case, we
trol problem ([2], [3]), which we will call Problem A. For
are assuming that the support of π is a family of functions
0 ≤ s < T , let us consider the controlled dynamics
( { fλ }λ ∈Rd described by a parameter λ ∈ Rd , with the dimen-
ẋ(t) = f (x(t), u(t)) t ∈ [s, T ] sion d arbitrarily large. π can thus be seen as a probability
(2) distribution on the parameter space Rd and is then easier to
x(s) = x0 ,
work with (see, for instance, the numerical tests in section IV).
where the nonlinear dynamics f : Rn × Rm → Rn is continuous There are several model-based RL methods that rely on the
in the pair (x, u) and Lipschitz continuous with respect to x, design of probability distributions representing the belief that
uniformly with respect to u. Those conditions guarantee that an agent has on the environment. In BRL algorithms [10],
the Cauchy problem (2) is well-posed, in the sense that for [12], [13], [15], [16], [14], the probability distribution, which
every measurable control u and initial condition x(s) = x0 ∈ is built upon the data collected while exploring the partially
Rn , there exists a unique solution of (2). unknown environment, is updated when new experience is
The goal is to minimize the cost functional gained. In this context, it is reasonable to expect that the
Z T newest probability distribution representing the environment
Js,x0 [u] := `(x(t), u(t))dt + h(x(T )) , (3) will be more accurate than the initial one.
s
In the framework of our paper, π is the probability dis-
over the class of the admissible controls Us = {u : [s, T ] → tribution, i.e. the probabilistic model, which represents the
U, measurable}, where U is a closed subset of Rm and ` knowledge that the agent has on the environment. Clearly, as
and h are respectively the running cost and the terminal cost, the accuracy of π increases, one should expect that the value
which we require to be Lipschitz continuous with respect function of Problem B is close (in a sense that will be made
to x. For each (s, x0 ) ∈ [0, T ] × Rn , the value function and precise) to the value function of Problem A. In particular, we
the corresponding optimal control associated to the optimal will investigate the following questions:
control problem (2)-(3) are respectively defined as
(A) How far is Vπ from V ?
V (s, x0 ) := inf Js,x0 [u] and u∗ (s, x0 ) := arg min Js,x0 [u] . (B) If π N → δ f , then is it true that Vπ N → V , where {Vπ N }N∈N
u∈Us u∈Us are the value functions of a sequence of problems of
type B, and {π N }N∈N are the respective probability
B. Problem B: an optimal control problem with uncertain distributions on X?
dynamics
Let us now introduce another control problem in which the III. M AIN R ESULTS
real dynamics f is unknown, meaning that one has merely a In this section, we will state and prove the main result of the
partial knowledge on f . Such a model uncertainty is captured paper valid for a general family of probability distributions.
by a probability distribution on a space of functions X (which Theorem III.1. Let us consider two Problems of type B as
f belongs to). More precisely, X is a compact subset of described in section II-B, one with distribution π N and the
C0 (Rn × U; Rn ) with respect to the || · ||∞ norm (the Arzelà- other with distribution π ∞ . We make the further assumptions:
Ascoli Theorem provides necessary and sufficient conditions (H1) There exists a constant L > 0 such that
f
for the set X being compact). For 0 ≤ s < T and every g ∈ X,
let us define the dynamical system | f (x, u) − f (y, u)| ≤ L f |x − y|
for each f ∈ supp(π ∞ ), x, y ∈ Rn , u ∈ U;
(
ẋg (t) = g (xg (t), u(t)) t ∈ [s, T ]
(4) (H2) The two cost functions ` and h are Lipschitz continuous
xg (s) = x0 .
in the first argument with constants respectively L` and
Given a probability distribution π over X, one can define a Lh .
cost functional for Problem B: Then the following estimate holds:
Z T 
g ||Vπ N −Vπ ∞ ||∞ ≤ C(L f , L` , Lh , T )W1 (π N , π ∞ ) , (6)
Jπ,s,x0 [u] :=Eπ `(x (t), u(t))dt + h(x(T )) (5)
s
Z Z T  where C(L f , L` , Lh , T ) is a constant which depends only on
= g
`(x (t), u(t)) dt + h(x(T )) dπ(g). the Lipschitz constants and on T , and W1 (π N , π ∞ ) is the 1-
X s Wasserstein distance defined in (1).
3

Proof. We divide the proof in three steps. Passing to the absolute value and using the bounds in (8) and
STEP 1: Fix two dynamics g ∈ X and f ∈ supp(π ∞ ), an (9), we recover the expression of W1 :
initial condition x(s) = x0 with s ∈ [0, T ] and x0 ∈ Rn and a
control u ∈ Us . We estimate how far is xg (t) from x f (t), using Jπ N [u] − Jπ ∞ [u] ≤
Gronwall’s Lemma, for each t ∈ [s, T ].  ZT Z
Lf t Lf T
Recall that x f (t) and xg (t) are solutions of the dynamical ≤ L` t e dt + Lh e kg − f k∞ dγ(g, f )
0 X×X
systems (4):
eL f T (L f T − 1) + 1
 
Lf T
Z t = L` + Lh e W1 (π N , π ∞ )
x f (t) = x0 + f (x f (τ), u(τ)) dτ Lf 2
s
Z t = C(L f , L` , Lh , T )W1 (π N , π ∞ ).
g g (10)
x (t) = x0 + g(x (τ), u(τ)) dτ
s Note that this estimate does not depend on x0 or u.
Then we have the following estimate: STEP 3: We now prove the estimate (6) using (10). Fix an
Z t initial condition x(s) = x0 and some ε > 0. By the definition
|xg (t) − x f (t)| ≤ |g(xg (τ), u(τ)) − f (x f (τ), u(τ))| dτ of Vπ ∞ , there exists some control u ∈ Us such that
s
Z t
≤ |g(xg (τ), u(τ)) − f (xg (τ), u(τ))| dτ Jπ ∞ [uε ] ≤ Vπ ∞ (s, x0 ) + ε .
s
Z t Then one has
+ | f (xg (τ), u(τ)) − f (x f (τ), u(τ))| dτ
s Vπ N (s, x0 ) −Vπ ∞ (s, x0 ) = inf Jπ N [u] − inf Jπ ∞ [u]
Z t u∈Us u∈Us
g f
≤ (t − s) k f − gk∞ + L f |x (τ) − x (τ)| dτ . < inf Jπ N [u] − Jπ ∞ [uε ] + ε
s u∈Us
Then, by Gronwall’s Lemma, ≤ Jπ N [uε ] − Jπ ∞ [uε ] + ε
|xg (t) − x f (t)| ≤ (t − s) k f − gk∞ eL f (t−s) ≤ t k f − gk∞ eL f t . ≤ sup |Jπ N − Jπ ∞ | + ε.
u∈Us
(7)
STEP 2: Fix an initial condition x(s) = x0 with s ∈ [0, T ] and In the same way, we get
x0 ∈ Rn and a control u ∈ Us . We estimate how far is Jπ N ,s,x0 [u]
Vπ ∞ (s, x0 ) −Vπ N (s, x0 ) < sup |Jπ N − Jπ ∞ | + ε ,
from Jπ ∞ ,s,x0 [u]. To enlighten the notation, we will write u∈Us

Jπ N [u] = Jπ N ,s,x0 [u] and Jπ ∞ [u] = Jπ ∞ ,s,x0 [u]. but the ε is arbitrary, so eventually
For each g ∈ X and f ∈ supp(π ∞ ) it holds |Vπ N (s, x0 ) −Vπ ∞ (s, x0 )| ≤ sup |Jπ N − Jπ ∞ | .
u∈U
|l(xg (t), u(t)) − l x f (t), u(t) ) ≤ L` |xg (t) − x f (t)|

(8) Finally, noting that the estimate is independent from x0 , we
≤ L` t k f − gk∞ eL f t get the result:
and ||Vπ N −Vπ ∞ ||∞ ≤ sup |Jπ N − Jπ ∞ |
|h(xg (T )) − h(x(T ))| ≤ Lh T k f − gk∞ eL f T . (9) u∈U
≤ C(L f , L` , Lh , T )W1 (π N , π ∞ ).
As a property of W1 , there exists (see Theorem 4.1 on [22])
a distribution γ ∗ on X × X with marginal distributions π N and
π ∞ such that
Z In the particular case where π ∞ ≡ δ f , f being the true
N ∞ ∗
W1 (π , π ) = kg − f k∞ dγ (g, f ). underlying dynamics, then one can express Theorem III.1 in
X×X
a more expressive fashion as follows:
Let (g(x), f (x))x∈Rn be a continuous process which has γ ∗ as
distribution. For each event ω ∈ Ω, we consider the realization Corollary III.2 (Convergence of the value functions). Sup-
of the process (gω , f ω ). pose that f : Rn ×U → Rn is a Lipschitz continuous function
Let us sum up over the g ∈ supp(π N ) and the f ∈ supp(π ∞ ): and the assumption (H2) on ` and h is satisfied. Let {π N }N∈N
be a sequence of probability distributions on X ⊂ C0 (Rn ×U)
Jπ N [u] − Jπ ∞ [u] = W1
compact set, with π N −−→ δ f . Then the value function Vπ N of
Z Z T 
Problem B converges uniformly on [0, T ] × Rn to the value
= `(x (t), u(t)) dt + h(x (T )) dπ N (g)
g g
X s function V of Problem A.
Z Z T 
− f f
`(x (t), u(t))dt + h(x (T )) dπ ∞ ( f ) The previous Corollary provides a positive answer to the
X s
Z T questions (A)-(B) posed in the introduction. More precisely, it
Z
tells us that, whenever it is possible to construct a sufficiently
`(xg (t), u(t)) − `(x f (t), u(t)) dt

=
XxX s close (with respect to the Wasserstein distance) probability
distribution π to the real dynamics f , then the value function

+ h(xg (T )) − h(x f (T )) dγ ∗ (g, f ).

Vπ is a good approximation of the value function V .
4

A. A case of study: finite support measures converging to δ f


Let us consider the case in which the distribution π N is a
linear combination of a finite number of Dirac deltas defined
on a family of equi-bounded and equi-Lipschitz continuous
functions X := { f1 , . . . , fM }:
M
π N := ∑ αiN δ fi , (11)
i=1

where αiN ≥ 0, ∑M N
i=1 αi = 1 for every i = 1, . . . , M and N ∈ N.
In this case, the cost functional (5) can be written as
M Z T 
N fi fi
Jπ N ,s,x0 [u] := ∑ αi `(x (t), u(t)) dt + h(x (T )) .
i=1 s

Without loss of generality, let us also assume that f ≡ f1 is


the real underlying dynamics. Then it follows from Theorem
III.1 the next result:
Corollary III.3. Let us consider a sequence of probability dis-
tributions π N defined as in (11). Assume that f1 : Rn ×U → Rn
is Lipschitz continuous and assume the assumption (H2) on Fig. 1. Test 1: Optimal (multi-)trajectory starting from x0 = 1.
the functions ` and h. Then
1) ||Vπ N − V ||∞ ≤ C(L f , L` , Lh , T ) Eπ N [k f − gk∞ ] ∀ N ∈ N, Note that a probability distribution on a set of parameters is
where V is the value function of Problem A relative to equivalent to a probability distribution on a family of functions
the true dynamics f1 . { fi }i (see also Remark II.1), where
2) If {π N }N∈N is a sequence of probability distributions with
the same support { f1 , . . . , fM } and π N is converging to fi (x, u) = λi x + sin(x) + u.
δ f1 , then Vπ N converges to V .
We set s ≡ 0. The agent needs to minimize an averaged cost
IV. N UMERICAL T ESTS 5 Z T 
Jπ N ,0,x0 [u] = ∑ αiN u(t)2 dt − xi (T )
In this section, we will present two numerical tests. Both 0
i=1
tests deal with an extremely simplified situation: a parametric (12)
Z T 5
2
model, where the parameter can take only a finite number = u(t) dt + ∑ αiN [−xi (T )]
of values (cf. section III-A). We remark that these tests are 0 i=1
intended for illustrative purposes only and their main goal over the set of measurable controls U . The Problem B
is to verify that Theorem III.1 and Corollary III.2 hold in a associated to a fixed π N can thus be seen as an optimal control
particular case. Indeed, the theory developed here includes far problem in dimension 5, where the state variable include the
more general cases than this, including parametric models with five trajectories x1 , . . . , x5 .
a large number of parameters and infinite possible values for We solved this problem numerically when the probabilities
each parameter (e.g. deep neural networks [12], [15], [16]) αiN were defined according to the following rule:
or even non-parametric models (e.g. Gaussian processes [10],
[13]). 1 1 1
α1N = 1 − N
, αiN = for i = 2, . . . , 5 .
2 4 2N
A. Test 1 It is clear that the sequence is converging to δλ1 , which we
We consider a dynamical system governed by the differen- assume to be the parameter λ of the true dynamics, following
tial equation the setting of subsection III-A. An example of optimal (multi-
( )trajectory of Problem B relative to π 1 is plotted in Fig. 1. In
ẋ(t) = λ x(t) + sin(x(t)) + u(t) t ∈ [s, T ] order to minimize the cost functional (12), the agent tries to
x(s) = x0 , steer all the trajectories towards the positive values of the real
axis, using a control close to +1; at the same time, the optimal
with u(t) ∈ U = [−1, 1] for t ∈ [s, T ]. The agent, who doesn’t
control cannot be constantly +1, since the cost functional
know the parameter λ , has a probability distribution on a set
penalizes larger values of the control (Fig. 2).
of 5 possible values for λ :
For each N = 1, . . . , 8 we computed the value function of
{λ1 = 0, λ2 = 1, λ3 = −1, λ4 = 0.5, λ5 = −0.5} . Problem B solving the equations given by the Pontryagin
Maximum Principle (see e.g. [2]), for a grid of initial points
For each N ∈ N the probability distribution can be written,
x0 ∈ [−1, 1]. Then, we compared them to the value function of
similarly to (11), as
Problem A relative to the true dynamics f1 , computing the sup
5
norm of the difference Vπ N − V over the interval [−1, 1]; the
π N = ∑ αiN δλi .
i=1
results are reported in Table I. Note that at each iteration the
5

1 1 1
Fig. 3. Test 2: Optimal (multi-)trajectory for π = 3, 3, 3 starting from
x0 = (−0.4, 0.3).

Fig. 2. Test 1: Optimal control starting from x0 = 1.


where u is a 1-dimensional control which lies in [0, 2π].
The three possible matrices are
Wasserstein distance W1 between π N and δ f1 is halved and so      
is the error; this means that the numerical convergence order 1 0 0.5 0 0.5 −0.5
A1 = , A2 = and A3 = .
is 1, which agrees to the estimate given by Corollary III.3. 0 1 0 2 0.5 0.5
Without loss of generality, we assume that the true dynamics
TABLE I corresponds to the first matrix. The agent only knows a
T EST 1: E RRORS FOR VALUE FUNCTIONS RELATED TO π N FOR N = 1, . . . , 8 probability distribution π N on the set of the three matrices,
WITH RESPECT TO THE TRUE VALUE FUNCTION OF P ROBLEM A.
defined as in the previous example:
3
N α1N ||Vπ N −V ||∞,[−1,1] order π N = ∑ αiN δAi ,
i=1
1 0.5 1.57e-1 -
where the weights are defined according to the rule
2 0.75 7.87e-2 1.00
1 1
3 0.875 3.94e-2 1.00 α1N = 1 − , αiN = N+1 for i = 2, 3.
2N 2
4 0.9375 1.97e-2 1.00 We set again s ≡ 0. The functional cost to be minimized is
1 3
Z T 
5 0.9687 9.84e-3 1.00 2 2
Jπ N ,0,x0 [u] = ∑ αiN xi (t) dt + xi (T ) , (13)
2 i=1 0
6 0.9844 4.92e-3 1.00
7 0.9922 2.46e-3 1.00 where k·k indicates the euclidean norm in R2 . In Fig. 3 we can
see an example of optimal (multi-)trajectory for this problem,
8 0.9961 1.23e-3 1.00 when the distribution π has been chosen to be 31 , 13 , 13 . In this
case, the agent has to minimize the average squared distance
of the trajectories from the origin, thus he looks for a single
control to steer all three trajectories towards the origin at the
B. Test 2
same time. Clearly, we can see that none of the trajectories
The convergence results presented in this work hold under reaches exactly the origin, since a control which is optimal
the assumption that the cost functions ` and h are both for one of the three dynamics may not be optimal for the
globally Lipschitz continuous. With the following numerical other two. In Table II the distance between the value function
test, we show that there are other examples of practical interest, of Problem B and the true value function of Problem A is
where we can observe similar convergent behavior in the error reported, for different distributions π N . Even in this case,
kVπ N −V k∞ , even though this hypothesis is not verified. although the hypothesis of Corollary III.3 are not satisfied,
In this second example, the state of the system is we observe a clear convergence with order 1.
2-dimensional. The three possible dynamics are all linear in
the space variable and they differ only by the system matrix V. C ONCLUSIONS
Ai ∈ R2×2 :
In this paper, we have shown some convergence proper-
 i  i  
ẋ1 x1 cos(u)
= A i + , ties of the value function for optimal control problems in
ẋ2i x2i sin(u)
6

TABLE II [10] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes


T EST 2: E RRORS FOR VALUE FUNCTIONS RELATED TO π N WITH for data-efficient learning in robotics and control,” IEEE transactions on
N = 1, . . . , 6 WITH RESPECT TO THE TRUE VALUE FUNCTION OF P ROBLEM pattern analysis and machine intelligence, vol. 37, no. 2, pp. 408–423,
A. 2013.
[11] T. Wang, X. Bao, I. Clavera, J. Hoang, Y. Wen, E. Langlois, S. Zhang,
G. Zhang, P. Abbeel, and J. Ba, “Benchmarking model-based reinforce-
N α1N ||Vπ N −V ||∞,[−1,1]2 order ment learning,” arXiv preprint arXiv:1907.02057, 2019.
[12] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving PILCO with
1 0.5 2.52e-0 - Bayesian neural network dynamics models,” in Data-Efficient Machine
Learning workshop, ICML, vol. 4, no. 34, 2016, p. 25.
2 0.75 1.32e-0 0.94 [13] S. Kamthe and M. Deisenroth, “Data-efficient reinforcement learning
with probabilistic model predictive control,” in International Conference
3 0.875 6.81e-1 0.95 on Artificial Intelligence and Statistics. PMLR, 2018, pp. 1701–1710.
[14] M. Ghavamzadeh, S. Mannor, J. Pineau, A. Tamar et al., “Bayesian Re-
4 0.9375 3.47e-1 0.97 inforcement Learning: A Survey,” Foundations and Trends® in Machine
Learning, vol. 8, no. 5-6, pp. 359–483, 2015.
5 0.9687 1.75e-1 0.98 [15] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep Reinforce-
ment Learning in a Handful of Trials using Probabilistic Dynamics
6 0.9844 8.81e-2 0.99 Models,” in Advances in Neural Information Processing Systems, vol.
2018-December, 2018, pp. 4754–4765.
[16] M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model:
Model-based policy optimization,” in Advances in Neural Information
Processing Systems, vol. 32, 2019.
an uncertain environment. In the framework of the paper, [17] R. Murray and M. Palladino, “A model for system uncertainty in
the degree of uncertainty of the control system is captured reinforcement learning,” Systems & Control Letters, vol. 122, pp. 24–31,
by a probability measure defined on a compact space of 2018.
[18] P. Bettiol and N. Khalil, “Necessary optimality conditions for aver-
functions. Furthermore, such a probability measure is updated age cost minimization problems,” Discrete & Continuous Dynamical
as soon as more information on the environment is gained. Systems-B, vol. 24, no. 5, p. 2093, 2019.
The paper framework is closely related to many model-based [19] E. Zuazua, “Averaged control,” Automatica, vol. 50, no. 12, pp. 3077–
3087, 2014.
RL algorithms which aim at designing a suitable probability [20] J. Lohéac and E. Zuazua, “From averaged to simultaneous controllabil-
distribution rather than providing a pointwise estimate of the ity,” in Annales de la Faculté des sciences de Toulouse: Mathématiques,
dynamics. vol. 25, no. 4, 2016, pp. 785–828.
[21] A. Pesare, M. Palladino, and M. Falcone, “A convergent approximation
Similar results have been proved for the Linear Quadratic of the linear quadratic optimal control problem for Reinforcement
Regulator problem [21]. The main novelty of the paper con- Learning,” arXiv:2011.03447, 2020.
sists in the assumptions on the dynamical system and the cost [22] C. Villani, Optimal transport: old and new. Springer Science &
Business Media, 2008, vol. 338.
since we abandon the classical linear quadratic setting to deal
with general nonlinear optimal control problems. We believe
that the hypotheses of our theoretical result can be further
relaxed. Some numerical examples seem to be a good omen
for future investigations in this direction.

R EFERENCES
[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
2nd ed. Cambridge, MA: MIT Press, 2018.
[2] W. H. Fleming and R. W. Rishel, Deterministic and stochastic optimal
control. Springer Science & Business Media, 2012, vol. 1.
[3] M. Bardi and I. Capuzzo-Dolcetta, Optimal Control and Viscosity
Solutions of Hamilton-Jacobi-Bellman Equations. Birkhauser, 1997.
[4] R. S. Sutton, A. G. Barto, and R. J. Williams, “Reinforcement Learning
is Direct Adaptive Optimal Control,” IEEE Control Systems, vol. 12,
no. 2, pp. 19–22, 1992.
[5] B. Recht, “A tour of reinforcement learning: The view from continuous
control,” Annual Review of Control, Robotics, and Autonomous Systems,
vol. 2, pp. 253–279, 2019.
[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[7] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
region policy optimization,” in International conference on machine
learning. PMLR, 2015, pp. 1889–1897.
[8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” in 4th International Conference on Learning Representations
(ICLR), 2016.
[9] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic
actor,” in International Conference on Machine Learning. PMLR, 2018,
pp. 1861–1870.

You might also like