Control and Reinforcement Learning

The paper discusses the convergence of the value function in optimal control problems with uncertain dynamics, particularly focusing on Reinforcement Learning (RL) methods that utilize probabilistic models. It establishes a framework where the value function of an approximate control problem converges to the true value function as the agent's belief about the dynamics improves with experience. The results are supported by theoretical proofs and numerical experiments, highlighting the efficiency of model-based RL techniques in complex environments.

Uploaded by

Ivan Medri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views6 pages

Control and Reinforcement Learning

Uploaded by

Ivan Medri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1

Convergence of the Value Function in Optimal Control Problems

with Unknown Dynamics
Andrea Pesare1 , Michele Palladino2 and Maurizio Falcone1

Abstract—We deal with the convergence of the value function the viewpoint of control theory. In particular, we consider
of an approximate control problem with uncertain dynamics to nonlinear control systems in which the dynamics is partially
the value function of a nonlinear optimal control problem. The known and we assume that the belief on the dynamics that an
assumptions on the dynamics and the costs are rather general
and we assume to represent uncertainty in the dynamics by a agent has is represented by a probability distribution π on a
probability distribution. The proposed framework aims to de- space of functions [17]. The task is thus written as an optimal
scribe and motivate some model-based Reinforcement Learning control problem with averaged cost, a kind of formulation that
algorithms where the model is probabilistic. We also show some has received a growing interest in the last few years (see e.g.
numerical experiments which confirm the theoretical results. [18], [19], [20]). In probabilistic model-based RL algorithms,
arXiv:2105.13708v1 [math.OC] 28 May 2021

Keywords—Reinforcement learning, optimal control, nonlinear this corresponds to the policy improvement step, where the
systems, system identification, convergence. agent seeks to find the best control given a probabilistic model
π, using a policy search method [10], [12], [16] or MPC
I. I NTRODUCTION [13], [15] or other methods. In the framework we propose, the
precision with which π describes the true dynamics improves
Reinforcement Learning (RL) is an important branch of
as soon as the dataset becomes wider. This reflects a situation
Machine Learning aiming to provide satisfactory policies that
in BRL where the agent learns on the surrounding environment
an agent can easily implement in an uncertain environment
as far as he gets more experience and updates the dynamics
[1]. The agent acquires knowledge (learns) on the environment
model.
from its past experience, which is usually represented by series
The main objective of the paper concerns a convergence
of statistical data, but can also learn while interacting with
result of the value function Vπ of an averaged (with respect
the system. Optimal control [2], [3] and RL are strongly
to a probability measure π) optimal control problem to the
connected [4], [5], to the point that in [4] the authors write
“true value function” V . Here, by true value function we mean
“Reinforcement Learning is direct adaptive optimal control”.
the one defined by the optimal control governed by the true,
RL algorithms are generally classified in two categories:
underlying dynamics. Roughly speaking, the main result of
model-based methods, which build a predictive model of the
the paper can be stated as follows: the value function Vπ is
environment and use it to construct a controller, and model-
close to V as soon as π provides an accurate representation
free methods, which directly learn a policy or a value function
of the true, underlying dynamics. Similar results have been
by interacting with the environment. Model-free algorithms
recently obtained for a general Linear Quadratic Regulator
have shown great performances [6], [7], [8], [9], although
problem with finite horizon (see [21]). In the present paper,
they are generally quite expensive to train, especially in terms
we will focus our attention on a general, nonlinear optimal
of sample complexity; this often limits their applications to
control problem over a finite horizon, under globally Lipschitz
simulated domains. On the contrary, model-based techniques
assumptions on the costs and the controlled dynamics.
show a higher sample efficiency [10], [11], which results in
In the next section, we present the precise setting of the
faster learning. Furthermore, recent algorithms have managed
paper whereas in section III we will state and prove the main
to limit the model-bias phenomenon by using probabilistic
result. Section IV is devoted to some numerical tests and
models, which capture the uncertainty of the learned model
section V provides the conclusions and an overview on open
[10], [12], [13]. In Bayesian Reinforcement Learning (BRL),
questions.
the dynamics model is updated when new data are available
[14]. Finally, the recently used probabilistic model ensembles II. P ROBLEM FORMULATION
[15], [16] allowed model-based methods to achieve the same
asymptotic performance as state-of-the-art model-free meth- This section aims to propose the nonlinear optimal control
ods, with higher sample efficiency. Thanks to these features, framework that we want to solve. Before that, let us introduce
model-based methods seem to be the most suitable for solving some notations which will be used throughout the paper.
complex real-world problems. For vectors v ∈ Rn , |v| will denote the Euclidean norm. For
In this paper, we consider the class of BRL algorithms continuous functions f : D → R with D ⊂ Rn , the notations
for continuous state-action space and we analyze them from k f k∞ and k f k∞,K will indicate the sup norm respectively over
the function domain D or over a compact set K ⊂ D.
1 A. Pesare and M. Falcone are with the Deparment Let π and π 0 be two probability distributions on a compact
of Mathematics, Sapienza Università di Roma, 00185 metric space (X, d). The 1-Wasserstein distance (see [22])
Rome, Italy (e-mail: [email protected],
[email protected]) between them is defined as
Z
2 M. Palladino is with the Gran Sasso Science Institute - GSSI, 67100
L’Aquila, Italy (e-mail: [email protected])
W1 (π, π 0 ) := inf kg − f k∞ dγ(g, f ) , (1)
γ∈Γ(π,π 0 ) X×X
2

where Γ(π, π 0 ) is the collection of all probability measures Note that, even if there is a different trajectory xg (t) for each
on X × X having π and π 0 as marginals, f and g are generic g ∈ X, the task concerns to look for a single control u to be
elements in X and the symbol dγ indicates that the integral is applied to every system dynamics g ∈ X.
with respect to the measure γ. The notions of value function and optimal control are analo-
Given a probability space (Ω, F , π) and a random variable Y gous to those given for Problem A:
on Ω, Eπ [Y ] denotes the expected value of Y with respect to
Vπ (s, x0 ) := inf Jπ,s,x0 [u] and u∗π (s, x0 ) := arg min Jπ,s,x0 [u] .
π. u∈Us u∈Us
Remark II.1. It is worth pointing out that the theory developed
A. Problem A: a classical optimal control problem here works both for non-parametric models (e.g. Gaussian
processes [10], [13]) and for parametric models (e.g. deep
Let us consider a classical finite horizon optimal con-
neural networks ([12], [15], [16])). In the latter case, we
trol problem ([2], [3]), which we will call Problem A. For
are assuming that the support of π is a family of functions
0 ≤ s < T , let us consider the controlled dynamics
( { fλ }λ ∈Rd described by a parameter λ ∈ Rd , with the dimen-
ẋ(t) = f (x(t), u(t)) t ∈ [s, T ] sion d arbitrarily large. π can thus be seen as a probability
(2) distribution on the parameter space Rd and is then easier to
x(s) = x0 ,
work with (see, for instance, the numerical tests in section IV).
where the nonlinear dynamics f : Rn × Rm → Rn is continuous There are several model-based RL methods that rely on the
in the pair (x, u) and Lipschitz continuous with respect to x, design of probability distributions representing the belief that
uniformly with respect to u. Those conditions guarantee that an agent has on the environment. In BRL algorithms [10],
the Cauchy problem (2) is well-posed, in the sense that for [12], [13], [15], [16], [14], the probability distribution, which
every measurable control u and initial condition x(s) = x0 ∈ is built upon the data collected while exploring the partially
Rn , there exists a unique solution of (2). unknown environment, is updated when new experience is
The goal is to minimize the cost functional gained. In this context, it is reasonable to expect that the
Z T newest probability distribution representing the environment
Js,x0 [u] := `(x(t), u(t))dt + h(x(T )) , (3) will be more accurate than the initial one.
s
In the framework of our paper, π is the probability dis-
over the class of the admissible controls Us = {u : [s, T ] → tribution, i.e. the probabilistic model, which represents the
U, measurable}, where U is a closed subset of Rm and ` knowledge that the agent has on the environment. Clearly, as
and h are respectively the running cost and the terminal cost, the accuracy of π increases, one should expect that the value
which we require to be Lipschitz continuous with respect function of Problem B is close (in a sense that will be made
to x. For each (s, x0 ) ∈ [0, T ] × Rn , the value function and precise) to the value function of Problem A. In particular, we
the corresponding optimal control associated to the optimal will investigate the following questions:
control problem (2)-(3) are respectively defined as
(A) How far is Vπ from V ?
V (s, x0 ) := inf Js,x0 [u] and u∗ (s, x0 ) := arg min Js,x0 [u] . (B) If π N → δ f , then is it true that Vπ N → V , where {Vπ N }N∈N
u∈Us u∈Us are the value functions of a sequence of problems of
type B, and {π N }N∈N are the respective probability
B. Problem B: an optimal control problem with uncertain distributions on X?
dynamics
Let us now introduce another control problem in which the III. M AIN R ESULTS
real dynamics f is unknown, meaning that one has merely a In this section, we will state and prove the main result of the
partial knowledge on f . Such a model uncertainty is captured paper valid for a general family of probability distributions.
by a probability distribution on a space of functions X (which Theorem III.1. Let us consider two Problems of type B as
f belongs to). More precisely, X is a compact subset of described in section II-B, one with distribution π N and the
C0 (Rn × U; Rn ) with respect to the || · ||∞ norm (the Arzelà- other with distribution π ∞ . We make the further assumptions:
Ascoli Theorem provides necessary and sufficient conditions (H1) There exists a constant L > 0 such that
f
for the set X being compact). For 0 ≤ s < T and every g ∈ X,
let us define the dynamical system | f (x, u) − f (y, u)| ≤ L f |x − y|
for each f ∈ supp(π ∞ ), x, y ∈ Rn , u ∈ U;
(
ẋg (t) = g (xg (t), u(t)) t ∈ [s, T ]
(4) (H2) The two cost functions ` and h are Lipschitz continuous
xg (s) = x0 .
in the first argument with constants respectively L` and
Given a probability distribution π over X, one can define a Lh .
cost functional for Problem B: Then the following estimate holds:
Z T
g ||Vπ N −Vπ ∞ ||∞ ≤ C(L f , L` , Lh , T )W1 (π N , π ∞ ) , (6)
Jπ,s,x0 [u] :=Eπ `(x (t), u(t))dt + h(x(T )) (5)
s
Z Z T where C(L f , L` , Lh , T ) is a constant which depends only on
= g
`(x (t), u(t)) dt + h(x(T )) dπ(g). the Lipschitz constants and on T , and W1 (π N , π ∞ ) is the 1-
X s Wasserstein distance defined in (1).
3

Proof. We divide the proof in three steps. Passing to the absolute value and using the bounds in (8) and
STEP 1: Fix two dynamics g ∈ X and f ∈ supp(π ∞ ), an (9), we recover the expression of W1 :
initial condition x(s) = x0 with s ∈ [0, T ] and x0 ∈ Rn and a
control u ∈ Us . We estimate how far is xg (t) from x f (t), using Jπ N [u] − Jπ ∞ [u] ≤
Gronwall’s Lemma, for each t ∈ [s, T ]. ZT Z
Lf t Lf T
Recall that x f (t) and xg (t) are solutions of the dynamical ≤ L` t e dt + Lh e kg − f k∞ dγ(g, f )
0 X×X
systems (4):
eL f T (L f T − 1) + 1

Lf T
Z t = L` + Lh e W1 (π N , π ∞ )
x f (t) = x0 + f (x f (τ), u(τ)) dτ Lf 2
s
Z t = C(L f , L` , Lh , T )W1 (π N , π ∞ ).
g g (10)
x (t) = x0 + g(x (τ), u(τ)) dτ
s Note that this estimate does not depend on x0 or u.
Then we have the following estimate: STEP 3: We now prove the estimate (6) using (10). Fix an
Z t initial condition x(s) = x0 and some ε > 0. By the definition
|xg (t) − x f (t)| ≤ |g(xg (τ), u(τ)) − f (x f (τ), u(τ))| dτ of Vπ ∞ , there exists some control u ∈ Us such that
s
Z t
≤ |g(xg (τ), u(τ)) − f (xg (τ), u(τ))| dτ Jπ ∞ [uε ] ≤ Vπ ∞ (s, x0 ) + ε .
s
Z t Then one has
+ | f (xg (τ), u(τ)) − f (x f (τ), u(τ))| dτ
s Vπ N (s, x0 ) −Vπ ∞ (s, x0 ) = inf Jπ N [u] − inf Jπ ∞ [u]
Z t u∈Us u∈Us
g f
≤ (t − s) k f − gk∞ + L f |x (τ) − x (τ)| dτ . < inf Jπ N [u] − Jπ ∞ [uε ] + ε
s u∈Us
Then, by Gronwall’s Lemma, ≤ Jπ N [uε ] − Jπ ∞ [uε ] + ε
|xg (t) − x f (t)| ≤ (t − s) k f − gk∞ eL f (t−s) ≤ t k f − gk∞ eL f t . ≤ sup |Jπ N − Jπ ∞ | + ε.
u∈Us
(7)
STEP 2: Fix an initial condition x(s) = x0 with s ∈ [0, T ] and In the same way, we get
x0 ∈ Rn and a control u ∈ Us . We estimate how far is Jπ N ,s,x0 [u]
Vπ ∞ (s, x0 ) −Vπ N (s, x0 ) < sup |Jπ N − Jπ ∞ | + ε ,
from Jπ ∞ ,s,x0 [u]. To enlighten the notation, we will write u∈Us

Jπ N [u] = Jπ N ,s,x0 [u] and Jπ ∞ [u] = Jπ ∞ ,s,x0 [u]. but the ε is arbitrary, so eventually
For each g ∈ X and f ∈ supp(π ∞ ) it holds |Vπ N (s, x0 ) −Vπ ∞ (s, x0 )| ≤ sup |Jπ N − Jπ ∞ | .
u∈U
|l(xg (t), u(t)) − l x f (t), u(t) ) ≤ L` |xg (t) − x f (t)|

(8) Finally, noting that the estimate is independent from x0 , we
≤ L` t k f − gk∞ eL f t get the result:
and ||Vπ N −Vπ ∞ ||∞ ≤ sup |Jπ N − Jπ ∞ |
|h(xg (T )) − h(x(T ))| ≤ Lh T k f − gk∞ eL f T . (9) u∈U
≤ C(L f , L` , Lh , T )W1 (π N , π ∞ ).
As a property of W1 , there exists (see Theorem 4.1 on [22])
a distribution γ ∗ on X × X with marginal distributions π N and
π ∞ such that
Z In the particular case where π ∞ ≡ δ f , f being the true
N ∞ ∗
W1 (π , π ) = kg − f k∞ dγ (g, f ). underlying dynamics, then one can express Theorem III.1 in
X×X
a more expressive fashion as follows:
Let (g(x), f (x))x∈Rn be a continuous process which has γ ∗ as
distribution. For each event ω ∈ Ω, we consider the realization Corollary III.2 (Convergence of the value functions). Sup-
of the process (gω , f ω ). pose that f : Rn ×U → Rn is a Lipschitz continuous function
Let us sum up over the g ∈ supp(π N ) and the f ∈ supp(π ∞ ): and the assumption (H2) on ` and h is satisfied. Let {π N }N∈N
be a sequence of probability distributions on X ⊂ C0 (Rn ×U)
Jπ N [u] − Jπ ∞ [u] = W1
compact set, with π N −−→ δ f . Then the value function Vπ N of
Z Z T
Problem B converges uniformly on [0, T ] × Rn to the value
= `(x (t), u(t)) dt + h(x (T )) dπ N (g)
g g
X s function V of Problem A.
Z Z T
− f f
`(x (t), u(t))dt + h(x (T )) dπ ∞ ( f ) The previous Corollary provides a positive answer to the
X s
Z T questions (A)-(B) posed in the introduction. More precisely, it
Z
tells us that, whenever it is possible to construct a sufficiently
`(xg (t), u(t)) − `(x f (t), u(t)) dt

=
XxX s close (with respect to the Wasserstein distance) probability
distribution π to the real dynamics f , then the value function

+ h(xg (T )) − h(x f (T )) dγ ∗ (g, f ).

Vπ is a good approximation of the value function V .
4

A. A case of study: finite support measures converging to δ f

Let us consider the case in which the distribution π N is a
linear combination of a finite number of Dirac deltas defined
on a family of equi-bounded and equi-Lipschitz continuous
functions X := { f1 , . . . , fM }:
M
π N := ∑ αiN δ fi , (11)
i=1

where αiN ≥ 0, ∑M N
i=1 αi = 1 for every i = 1, . . . , M and N ∈ N.
In this case, the cost functional (5) can be written as
M Z T
N fi fi
Jπ N ,s,x0 [u] := ∑ αi `(x (t), u(t)) dt + h(x (T )) .
i=1 s

Without loss of generality, let us also assume that f ≡ f1 is

the real underlying dynamics. Then it follows from Theorem
III.1 the next result:
Corollary III.3. Let us consider a sequence of probability dis-
tributions π N defined as in (11). Assume that f1 : Rn ×U → Rn
is Lipschitz continuous and assume the assumption (H2) on Fig. 1. Test 1: Optimal (multi-)trajectory starting from x0 = 1.
the functions ` and h. Then
1) ||Vπ N − V ||∞ ≤ C(L f , L` , Lh , T ) Eπ N [k f − gk∞ ] ∀ N ∈ N, Note that a probability distribution on a set of parameters is
where V is the value function of Problem A relative to equivalent to a probability distribution on a family of functions
the true dynamics f1 . { fi }i (see also Remark II.1), where
2) If {π N }N∈N is a sequence of probability distributions with
the same support { f1 , . . . , fM } and π N is converging to fi (x, u) = λi x + sin(x) + u.
δ f1 , then Vπ N converges to V .
We set s ≡ 0. The agent needs to minimize an averaged cost
IV. N UMERICAL T ESTS 5 Z T
Jπ N ,0,x0 [u] = ∑ αiN u(t)2 dt − xi (T )
In this section, we will present two numerical tests. Both 0
i=1
tests deal with an extremely simplified situation: a parametric (12)
Z T 5
2
model, where the parameter can take only a finite number = u(t) dt + ∑ αiN [−xi (T )]
of values (cf. section III-A). We remark that these tests are 0 i=1
intended for illustrative purposes only and their main goal over the set of measurable controls U . The Problem B
is to verify that Theorem III.1 and Corollary III.2 hold in a associated to a fixed π N can thus be seen as an optimal control
particular case. Indeed, the theory developed here includes far problem in dimension 5, where the state variable include the
more general cases than this, including parametric models with five trajectories x1 , . . . , x5 .
a large number of parameters and infinite possible values for We solved this problem numerically when the probabilities
each parameter (e.g. deep neural networks [12], [15], [16]) αiN were defined according to the following rule:
or even non-parametric models (e.g. Gaussian processes [10],
[13]). 1 1 1
α1N = 1 − N
, αiN = for i = 2, . . . , 5 .
2 4 2N
A. Test 1 It is clear that the sequence is converging to δλ1 , which we
We consider a dynamical system governed by the differen- assume to be the parameter λ of the true dynamics, following
tial equation the setting of subsection III-A. An example of optimal (multi-
( )trajectory of Problem B relative to π 1 is plotted in Fig. 1. In
ẋ(t) = λ x(t) + sin(x(t)) + u(t) t ∈ [s, T ] order to minimize the cost functional (12), the agent tries to
x(s) = x0 , steer all the trajectories towards the positive values of the real
axis, using a control close to +1; at the same time, the optimal
with u(t) ∈ U = [−1, 1] for t ∈ [s, T ]. The agent, who doesn’t
control cannot be constantly +1, since the cost functional
know the parameter λ , has a probability distribution on a set
penalizes larger values of the control (Fig. 2).
of 5 possible values for λ :
For each N = 1, . . . , 8 we computed the value function of
{λ1 = 0, λ2 = 1, λ3 = −1, λ4 = 0.5, λ5 = −0.5} . Problem B solving the equations given by the Pontryagin
Maximum Principle (see e.g. [2]), for a grid of initial points
For each N ∈ N the probability distribution can be written,
x0 ∈ [−1, 1]. Then, we compared them to the value function of
similarly to (11), as
Problem A relative to the true dynamics f1 , computing the sup
5
norm of the difference Vπ N − V over the interval [−1, 1]; the
π N = ∑ αiN δλi .
i=1
results are reported in Table I. Note that at each iteration the
5

1 1 1
Fig. 3. Test 2: Optimal (multi-)trajectory for π = 3, 3, 3 starting from
x0 = (−0.4, 0.3).

Fig. 2. Test 1: Optimal control starting from x0 = 1.

where u is a 1-dimensional control which lies in [0, 2π].
The three possible matrices are
Wasserstein distance W1 between π N and δ f1 is halved and so
is the error; this means that the numerical convergence order 1 0 0.5 0 0.5 −0.5
A1 = , A2 = and A3 = .
is 1, which agrees to the estimate given by Corollary III.3. 0 1 0 2 0.5 0.5
Without loss of generality, we assume that the true dynamics
TABLE I corresponds to the first matrix. The agent only knows a
T EST 1: E RRORS FOR VALUE FUNCTIONS RELATED TO π N FOR N = 1, . . . , 8 probability distribution π N on the set of the three matrices,
WITH RESPECT TO THE TRUE VALUE FUNCTION OF P ROBLEM A.
defined as in the previous example:
3
N α1N ||Vπ N −V ||∞,[−1,1] order π N = ∑ αiN δAi ,
i=1
1 0.5 1.57e-1 -
where the weights are defined according to the rule
2 0.75 7.87e-2 1.00
1 1
3 0.875 3.94e-2 1.00 α1N = 1 − , αiN = N+1 for i = 2, 3.
2N 2
4 0.9375 1.97e-2 1.00 We set again s ≡ 0. The functional cost to be minimized is
1 3
Z T
5 0.9687 9.84e-3 1.00 2 2
Jπ N ,0,x0 [u] = ∑ αiN xi (t) dt + xi (T ) , (13)
2 i=1 0
6 0.9844 4.92e-3 1.00
7 0.9922 2.46e-3 1.00 where k·k indicates the euclidean norm in R2 . In Fig. 3 we can
see an example of optimal (multi-)trajectory for this problem,
8 0.9961 1.23e-3 1.00 when the distribution π has been chosen to be 31 , 13 , 13 . In this
case, the agent has to minimize the average squared distance
of the trajectories from the origin, thus he looks for a single
control to steer all three trajectories towards the origin at the
B. Test 2
same time. Clearly, we can see that none of the trajectories
The convergence results presented in this work hold under reaches exactly the origin, since a control which is optimal
the assumption that the cost functions ` and h are both for one of the three dynamics may not be optimal for the
globally Lipschitz continuous. With the following numerical other two. In Table II the distance between the value function
test, we show that there are other examples of practical interest, of Problem B and the true value function of Problem A is
where we can observe similar convergent behavior in the error reported, for different distributions π N . Even in this case,
kVπ N −V k∞ , even though this hypothesis is not verified. although the hypothesis of Corollary III.3 are not satisfied,
In this second example, the state of the system is we observe a clear convergence with order 1.
2-dimensional. The three possible dynamics are all linear in
the space variable and they differ only by the system matrix V. C ONCLUSIONS
Ai ∈ R2×2 :
In this paper, we have shown some convergence proper-
i i
ẋ1 x1 cos(u)
= A i + , ties of the value function for optimal control problems in
ẋ2i x2i sin(u)
6

TABLE II [10] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes

T EST 2: E RRORS FOR VALUE FUNCTIONS RELATED TO π N WITH for data-efficient learning in robotics and control,” IEEE transactions on
N = 1, . . . , 6 WITH RESPECT TO THE TRUE VALUE FUNCTION OF P ROBLEM pattern analysis and machine intelligence, vol. 37, no. 2, pp. 408–423,
A. 2013.
[11] T. Wang, X. Bao, I. Clavera, J. Hoang, Y. Wen, E. Langlois, S. Zhang,
G. Zhang, P. Abbeel, and J. Ba, “Benchmarking model-based reinforce-
N α1N ||Vπ N −V ||∞,[−1,1]2 order ment learning,” arXiv preprint arXiv:1907.02057, 2019.
[12] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving PILCO with
1 0.5 2.52e-0 - Bayesian neural network dynamics models,” in Data-Efficient Machine
Learning workshop, ICML, vol. 4, no. 34, 2016, p. 25.
2 0.75 1.32e-0 0.94 [13] S. Kamthe and M. Deisenroth, “Data-efficient reinforcement learning
with probabilistic model predictive control,” in International Conference
3 0.875 6.81e-1 0.95 on Artificial Intelligence and Statistics. PMLR, 2018, pp. 1701–1710.
[14] M. Ghavamzadeh, S. Mannor, J. Pineau, A. Tamar et al., “Bayesian Re-
4 0.9375 3.47e-1 0.97 inforcement Learning: A Survey,” Foundations and Trends® in Machine
Learning, vol. 8, no. 5-6, pp. 359–483, 2015.
5 0.9687 1.75e-1 0.98 [15] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep Reinforce-
ment Learning in a Handful of Trials using Probabilistic Dynamics
6 0.9844 8.81e-2 0.99 Models,” in Advances in Neural Information Processing Systems, vol.
2018-December, 2018, pp. 4754–4765.
[16] M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model:
Model-based policy optimization,” in Advances in Neural Information
Processing Systems, vol. 32, 2019.
an uncertain environment. In the framework of the paper, [17] R. Murray and M. Palladino, “A model for system uncertainty in
the degree of uncertainty of the control system is captured reinforcement learning,” Systems & Control Letters, vol. 122, pp. 24–31,
by a probability measure defined on a compact space of 2018.
[18] P. Bettiol and N. Khalil, “Necessary optimality conditions for aver-
functions. Furthermore, such a probability measure is updated age cost minimization problems,” Discrete & Continuous Dynamical
as soon as more information on the environment is gained. Systems-B, vol. 24, no. 5, p. 2093, 2019.
The paper framework is closely related to many model-based [19] E. Zuazua, “Averaged control,” Automatica, vol. 50, no. 12, pp. 3077–
3087, 2014.
RL algorithms which aim at designing a suitable probability [20] J. Lohéac and E. Zuazua, “From averaged to simultaneous controllabil-
distribution rather than providing a pointwise estimate of the ity,” in Annales de la Faculté des sciences de Toulouse: Mathématiques,
dynamics. vol. 25, no. 4, 2016, pp. 785–828.
[21] A. Pesare, M. Palladino, and M. Falcone, “A convergent approximation
Similar results have been proved for the Linear Quadratic of the linear quadratic optimal control problem for Reinforcement
Regulator problem [21]. The main novelty of the paper con- Learning,” arXiv:2011.03447, 2020.
sists in the assumptions on the dynamical system and the cost [22] C. Villani, Optimal transport: old and new. Springer Science &
Business Media, 2008, vol. 338.
since we abandon the classical linear quadratic setting to deal
with general nonlinear optimal control problems. We believe
that the hypotheses of our theoretical result can be further
relaxed. Some numerical examples seem to be a good omen
for future investigations in this direction.

R EFERENCES
[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
2nd ed. Cambridge, MA: MIT Press, 2018.
[2] W. H. Fleming and R. W. Rishel, Deterministic and stochastic optimal
control. Springer Science & Business Media, 2012, vol. 1.
[3] M. Bardi and I. Capuzzo-Dolcetta, Optimal Control and Viscosity
Solutions of Hamilton-Jacobi-Bellman Equations. Birkhauser, 1997.
[4] R. S. Sutton, A. G. Barto, and R. J. Williams, “Reinforcement Learning
is Direct Adaptive Optimal Control,” IEEE Control Systems, vol. 12,
no. 2, pp. 19–22, 1992.
[5] B. Recht, “A tour of reinforcement learning: The view from continuous
control,” Annual Review of Control, Robotics, and Autonomous Systems,
vol. 2, pp. 253–279, 2019.
[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[7] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
region policy optimization,” in International conference on machine
learning. PMLR, 2015, pp. 1889–1897.
[8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” in 4th International Conference on Learning Representations
(ICLR), 2016.
[9] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic
actor,” in International Conference on Machine Learning. PMLR, 2018,
pp. 1861–1870.

Adaptative For Lineare
No ratings yet
Adaptative For Lineare
23 pages
Kamala Pur Kar 2016
No ratings yet
Kamala Pur Kar 2016
11 pages
Nonparametric Adaptive Bayesian Stochastic Control Under Model Uncertainty
No ratings yet
Nonparametric Adaptive Bayesian Stochastic Control Under Model Uncertainty
28 pages
ACC23 Tutorial Paulson
No ratings yet
ACC23 Tutorial Paulson
12 pages
Provably Safe and Robust Learning-BasedModel Predictive Control
No ratings yet
Provably Safe and Robust Learning-BasedModel Predictive Control
13 pages
Robust Design of Linear Control Laws For Constrained Nonlinear Dynamic Systems
No ratings yet
Robust Design of Linear Control Laws For Constrained Nonlinear Dynamic Systems
6 pages
Tac 232
No ratings yet
Tac 232
7 pages
Adaptive Learning Feedback Linearization
No ratings yet
Adaptive Learning Feedback Linearization
9 pages
Kamalapurkar 2016
No ratings yet
Kamalapurkar 2016
12 pages
2019 RL Control Review
No ratings yet
2019 RL Control Review
27 pages
Optimal Control Under Unknown Intensity With Bayesian Learning
No ratings yet
Optimal Control Under Unknown Intensity With Bayesian Learning
23 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Open-Loop Reinforcement Learning Algorithms
No ratings yet
Open-Loop Reinforcement Learning Algorithms
21 pages
Adaptive Dynamic Programming For Stochastic Systems With State and Control Dependent Noise
No ratings yet
Adaptive Dynamic Programming For Stochastic Systems With State and Control Dependent Noise
12 pages
Learning-Based Control of Continuous-Time Systems Using Output Feedback
No ratings yet
Learning-Based Control of Continuous-Time Systems Using Output Feedback
8 pages
Unified Stochastic Optimization Framework
No ratings yet
Unified Stochastic Optimization Framework
69 pages
Advanced Control Systems Lecture
No ratings yet
Advanced Control Systems Lecture
34 pages
ACODS 2014 GAndrade
No ratings yet
ACODS 2014 GAndrade
7 pages
Model-Free RL for Linear Quadratic Control
No ratings yet
Model-Free RL for Linear Quadratic Control
16 pages
Control of Toys
No ratings yet
Control of Toys
6 pages
2017 - On The Sample Complexity of The Linear Quadratic Regulator
No ratings yet
2017 - On The Sample Complexity of The Linear Quadratic Regulator
43 pages
Adaptive Optimal Control for Unknown Systems
No ratings yet
Adaptive Optimal Control for Unknown Systems
6 pages
2772 Blending MPC Value Function AP
No ratings yet
2772 Blending MPC Value Function AP
16 pages
Function Approximation in Reinforcement Learning
No ratings yet
Function Approximation in Reinforcement Learning
9 pages
Random Sampling of States in Dynamic Programming: Christopher G. Atkeson and Benjamin J. Stephens
No ratings yet
Random Sampling of States in Dynamic Programming: Christopher G. Atkeson and Benjamin J. Stephens
6 pages
The O.D.E. Method For Convergence of Stochastic Approximation and Reinforcement Learning
No ratings yet
The O.D.E. Method For Convergence of Stochastic Approximation and Reinforcement Learning
23 pages
Applsci 13 13181
No ratings yet
Applsci 13 13181
21 pages
Robust Control Scheme For A Class of Uncertain Nonlinear Systems With Completely Unknown Dynamics Using Data-Driven Reinforcement Learning Method
No ratings yet
Robust Control Scheme For A Class of Uncertain Nonlinear Systems With Completely Unknown Dynamics Using Data-Driven Reinforcement Learning Method
34 pages
Root
No ratings yet
Root
8 pages
Reinforcement Learning Algorithms
No ratings yet
Reinforcement Learning Algorithms
98 pages
Computationally Efficient Gauss-Newton Reinforcement Learning For Model Predictive Control
No ratings yet
Computationally Efficient Gauss-Newton Reinforcement Learning For Model Predictive Control
14 pages
Limiting Average Cost Control Problems in A Class of Discrete-Time Stochastic Systems
No ratings yet
Limiting Average Cost Control Problems in A Class of Discrete-Time Stochastic Systems
13 pages
Singular Arcs On Average Optimal Control-Affine Problems: M.S. Aronna, G. de Lima Monteiro and O. Sierra
No ratings yet
Singular Arcs On Average Optimal Control-Affine Problems: M.S. Aronna, G. de Lima Monteiro and O. Sierra
6 pages
Hu Mingshang-非线性期望下HJB
No ratings yet
Hu Mingshang-非线性期望下HJB
19 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Data-Driven Event-Triggered Control for Nonlinear Systems
No ratings yet
Data-Driven Event-Triggered Control for Nonlinear Systems
11 pages
Near-Optimal Control of Dynamical Systems With Neural Ordinary Differential Equations
No ratings yet
Near-Optimal Control of Dynamical Systems With Neural Ordinary Differential Equations
23 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Adaptive Optimal Output Regulation of Unknown Linear Continuous-Time Systems by Dynamic Output Feedback and Value Iteration
No ratings yet
Adaptive Optimal Output Regulation of Unknown Linear Continuous-Time Systems by Dynamic Output Feedback and Value Iteration
11 pages
ADA2604 Udh
No ratings yet
ADA2604 Udh
89 pages
Action Robust Reinforcement Learning and Applications in Continuous Control
No ratings yet
Action Robust Reinforcement Learning and Applications in Continuous Control
10 pages
Neural Lyapunov MPC for Control
No ratings yet
Neural Lyapunov MPC for Control
12 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Continuous Value Assignment A Doubly Robust Data Augmentation For Off-Policy Learning
No ratings yet
Continuous Value Assignment A Doubly Robust Data Augmentation For Off-Policy Learning
13 pages
SC Dec22
No ratings yet
SC Dec22
82 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
Reinforcement Learning Control Techniques
No ratings yet
Reinforcement Learning Control Techniques
268 pages
RL Solution3
No ratings yet
RL Solution3
4 pages
CDC - 2023 - Final - Submission 2023-09-12 14 - 11 - 10
No ratings yet
CDC - 2023 - Final - Submission 2023-09-12 14 - 11 - 10
6 pages
Woolseylecture 1
No ratings yet
Woolseylecture 1
4 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
Kamala Pur Kar 2015
No ratings yet
Kamala Pur Kar 2015
9 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Reinforcement Learning Textbook Draft
No ratings yet
Reinforcement Learning Textbook Draft
11 pages
Bayesian Reinforcement Learning: A Survey
No ratings yet
Bayesian Reinforcement Learning: A Survey
147 pages
Robust Model Predictive Control of Constrained Linear Systems With Bounded Disturbances
No ratings yet
Robust Model Predictive Control of Constrained Linear Systems With Bounded Disturbances
6 pages
00048895field Foam Applications in Enhanced Oil Recovery ProjectsScreening and Design Aspects
No ratings yet
00048895field Foam Applications in Enhanced Oil Recovery ProjectsScreening and Design Aspects
15 pages
Dividing Head Indexing
100% (3)
Dividing Head Indexing
4 pages
Beach Cleaner
No ratings yet
Beach Cleaner
12 pages
SAGE ECI Company Profile
No ratings yet
SAGE ECI Company Profile
18 pages
Biophysiologic Measures in Research
No ratings yet
Biophysiologic Measures in Research
1 page
Boatowner S Mechanical and Electrical Manual 4th Edition Nigel Calder Download Full Chapters
No ratings yet
Boatowner S Mechanical and Electrical Manual 4th Edition Nigel Calder Download Full Chapters
150 pages
Teams and Venue - Software
No ratings yet
Teams and Venue - Software
8 pages
Internship Report at Supreme Industries
No ratings yet
Internship Report at Supreme Industries
16 pages
ALIANZ Sandwich Panels En
No ratings yet
ALIANZ Sandwich Panels En
3 pages
Basics of Mining and Mineral Processing by W Scott Dunbar PDF
100% (3)
Basics of Mining and Mineral Processing by W Scott Dunbar PDF
179 pages
Ugc Net Computer Science
No ratings yet
Ugc Net Computer Science
39 pages
Mizoram Road Upgrade Environmental Plan
No ratings yet
Mizoram Road Upgrade Environmental Plan
38 pages
Final Research - 094055
No ratings yet
Final Research - 094055
60 pages
Material and Equipment Standard: IPS-M-TP-205
No ratings yet
Material and Equipment Standard: IPS-M-TP-205
12 pages
VF80 200 300 en 845182
No ratings yet
VF80 200 300 en 845182
139 pages
ADE S2022 (Gtustudy - In)
No ratings yet
ADE S2022 (Gtustudy - In)
2 pages
A To Z Vocabulary For 2026
No ratings yet
A To Z Vocabulary For 2026
20 pages
Re-Evacuation, Repair and Revival Techniques of Cryogenic Dewars
No ratings yet
Re-Evacuation, Repair and Revival Techniques of Cryogenic Dewars
6 pages
PQM4
No ratings yet
PQM4
3 pages
Theistic Evolution ACPQ Prepublished
No ratings yet
Theistic Evolution ACPQ Prepublished
39 pages
Recombinant Human Erythropoietin Stimulates Erythropoiesis Via Division and Differentiation of Progenitor Cells in Bone Marrow
No ratings yet
Recombinant Human Erythropoietin Stimulates Erythropoiesis Via Division and Differentiation of Progenitor Cells in Bone Marrow
2 pages
English Exam Paper: Literature & Grammar
No ratings yet
English Exam Paper: Literature & Grammar
3 pages
2 Bahay Kubo
No ratings yet
2 Bahay Kubo
1 page
Group 2 Presentation CC View
No ratings yet
Group 2 Presentation CC View
10 pages
Sample PS FR23
No ratings yet
Sample PS FR23
18 pages
22 Tacheometric Surveying
50% (2)
22 Tacheometric Surveying
28 pages
Chapter 6 Biodiversity - 1
No ratings yet
Chapter 6 Biodiversity - 1
27 pages
Grade 11 Math Curriculum 2024 ATP
No ratings yet
Grade 11 Math Curriculum 2024 ATP
6 pages
Beauty Therapy 1
No ratings yet
Beauty Therapy 1
6 pages
,preview
67% (6)
,preview
57 pages

Control and Reinforcement Learning

Uploaded by

Control and Reinforcement Learning

Uploaded by

1

Convergence of the Value Function in Optimal Control Problems

A. A case of study: finite support measures converging to δ f

Without loss of generality, let us also assume that f ≡ f1 is

Fig. 2. Test 1: Optimal control starting from x0 = 1.

TABLE II [10] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes

You might also like