Control and Reinforcement Learning
Control and Reinforcement Learning
Abstract—We deal with the convergence of the value function the viewpoint of control theory. In particular, we consider
of an approximate control problem with uncertain dynamics to nonlinear control systems in which the dynamics is partially
the value function of a nonlinear optimal control problem. The known and we assume that the belief on the dynamics that an
assumptions on the dynamics and the costs are rather general
and we assume to represent uncertainty in the dynamics by a agent has is represented by a probability distribution π on a
probability distribution. The proposed framework aims to de- space of functions [17]. The task is thus written as an optimal
scribe and motivate some model-based Reinforcement Learning control problem with averaged cost, a kind of formulation that
algorithms where the model is probabilistic. We also show some has received a growing interest in the last few years (see e.g.
numerical experiments which confirm the theoretical results. [18], [19], [20]). In probabilistic model-based RL algorithms,
arXiv:2105.13708v1 [math.OC] 28 May 2021
Keywords—Reinforcement learning, optimal control, nonlinear this corresponds to the policy improvement step, where the
systems, system identification, convergence. agent seeks to find the best control given a probabilistic model
π, using a policy search method [10], [12], [16] or MPC
I. I NTRODUCTION [13], [15] or other methods. In the framework we propose, the
precision with which π describes the true dynamics improves
Reinforcement Learning (RL) is an important branch of
as soon as the dataset becomes wider. This reflects a situation
Machine Learning aiming to provide satisfactory policies that
in BRL where the agent learns on the surrounding environment
an agent can easily implement in an uncertain environment
as far as he gets more experience and updates the dynamics
[1]. The agent acquires knowledge (learns) on the environment
model.
from its past experience, which is usually represented by series
The main objective of the paper concerns a convergence
of statistical data, but can also learn while interacting with
result of the value function Vπ of an averaged (with respect
the system. Optimal control [2], [3] and RL are strongly
to a probability measure π) optimal control problem to the
connected [4], [5], to the point that in [4] the authors write
“true value function” V . Here, by true value function we mean
“Reinforcement Learning is direct adaptive optimal control”.
the one defined by the optimal control governed by the true,
RL algorithms are generally classified in two categories:
underlying dynamics. Roughly speaking, the main result of
model-based methods, which build a predictive model of the
the paper can be stated as follows: the value function Vπ is
environment and use it to construct a controller, and model-
close to V as soon as π provides an accurate representation
free methods, which directly learn a policy or a value function
of the true, underlying dynamics. Similar results have been
by interacting with the environment. Model-free algorithms
recently obtained for a general Linear Quadratic Regulator
have shown great performances [6], [7], [8], [9], although
problem with finite horizon (see [21]). In the present paper,
they are generally quite expensive to train, especially in terms
we will focus our attention on a general, nonlinear optimal
of sample complexity; this often limits their applications to
control problem over a finite horizon, under globally Lipschitz
simulated domains. On the contrary, model-based techniques
assumptions on the costs and the controlled dynamics.
show a higher sample efficiency [10], [11], which results in
In the next section, we present the precise setting of the
faster learning. Furthermore, recent algorithms have managed
paper whereas in section III we will state and prove the main
to limit the model-bias phenomenon by using probabilistic
result. Section IV is devoted to some numerical tests and
models, which capture the uncertainty of the learned model
section V provides the conclusions and an overview on open
[10], [12], [13]. In Bayesian Reinforcement Learning (BRL),
questions.
the dynamics model is updated when new data are available
[14]. Finally, the recently used probabilistic model ensembles II. P ROBLEM FORMULATION
[15], [16] allowed model-based methods to achieve the same
asymptotic performance as state-of-the-art model-free meth- This section aims to propose the nonlinear optimal control
ods, with higher sample efficiency. Thanks to these features, framework that we want to solve. Before that, let us introduce
model-based methods seem to be the most suitable for solving some notations which will be used throughout the paper.
complex real-world problems. For vectors v ∈ Rn , |v| will denote the Euclidean norm. For
In this paper, we consider the class of BRL algorithms continuous functions f : D → R with D ⊂ Rn , the notations
for continuous state-action space and we analyze them from k f k∞ and k f k∞,K will indicate the sup norm respectively over
the function domain D or over a compact set K ⊂ D.
1 A. Pesare and M. Falcone are with the Deparment Let π and π 0 be two probability distributions on a compact
of Mathematics, Sapienza Università di Roma, 00185 metric space (X, d). The 1-Wasserstein distance (see [22])
Rome, Italy (e-mail: [email protected],
[email protected]) between them is defined as
Z
2 M. Palladino is with the Gran Sasso Science Institute - GSSI, 67100
L’Aquila, Italy (e-mail: [email protected])
W1 (π, π 0 ) := inf kg − f k∞ dγ(g, f ) , (1)
γ∈Γ(π,π 0 ) X×X
2
where Γ(π, π 0 ) is the collection of all probability measures Note that, even if there is a different trajectory xg (t) for each
on X × X having π and π 0 as marginals, f and g are generic g ∈ X, the task concerns to look for a single control u to be
elements in X and the symbol dγ indicates that the integral is applied to every system dynamics g ∈ X.
with respect to the measure γ. The notions of value function and optimal control are analo-
Given a probability space (Ω, F , π) and a random variable Y gous to those given for Problem A:
on Ω, Eπ [Y ] denotes the expected value of Y with respect to
Vπ (s, x0 ) := inf Jπ,s,x0 [u] and u∗π (s, x0 ) := arg min Jπ,s,x0 [u] .
π. u∈Us u∈Us
Remark II.1. It is worth pointing out that the theory developed
A. Problem A: a classical optimal control problem here works both for non-parametric models (e.g. Gaussian
processes [10], [13]) and for parametric models (e.g. deep
Let us consider a classical finite horizon optimal con-
neural networks ([12], [15], [16])). In the latter case, we
trol problem ([2], [3]), which we will call Problem A. For
are assuming that the support of π is a family of functions
0 ≤ s < T , let us consider the controlled dynamics
( { fλ }λ ∈Rd described by a parameter λ ∈ Rd , with the dimen-
ẋ(t) = f (x(t), u(t)) t ∈ [s, T ] sion d arbitrarily large. π can thus be seen as a probability
(2) distribution on the parameter space Rd and is then easier to
x(s) = x0 ,
work with (see, for instance, the numerical tests in section IV).
where the nonlinear dynamics f : Rn × Rm → Rn is continuous There are several model-based RL methods that rely on the
in the pair (x, u) and Lipschitz continuous with respect to x, design of probability distributions representing the belief that
uniformly with respect to u. Those conditions guarantee that an agent has on the environment. In BRL algorithms [10],
the Cauchy problem (2) is well-posed, in the sense that for [12], [13], [15], [16], [14], the probability distribution, which
every measurable control u and initial condition x(s) = x0 ∈ is built upon the data collected while exploring the partially
Rn , there exists a unique solution of (2). unknown environment, is updated when new experience is
The goal is to minimize the cost functional gained. In this context, it is reasonable to expect that the
Z T newest probability distribution representing the environment
Js,x0 [u] := `(x(t), u(t))dt + h(x(T )) , (3) will be more accurate than the initial one.
s
In the framework of our paper, π is the probability dis-
over the class of the admissible controls Us = {u : [s, T ] → tribution, i.e. the probabilistic model, which represents the
U, measurable}, where U is a closed subset of Rm and ` knowledge that the agent has on the environment. Clearly, as
and h are respectively the running cost and the terminal cost, the accuracy of π increases, one should expect that the value
which we require to be Lipschitz continuous with respect function of Problem B is close (in a sense that will be made
to x. For each (s, x0 ) ∈ [0, T ] × Rn , the value function and precise) to the value function of Problem A. In particular, we
the corresponding optimal control associated to the optimal will investigate the following questions:
control problem (2)-(3) are respectively defined as
(A) How far is Vπ from V ?
V (s, x0 ) := inf Js,x0 [u] and u∗ (s, x0 ) := arg min Js,x0 [u] . (B) If π N → δ f , then is it true that Vπ N → V , where {Vπ N }N∈N
u∈Us u∈Us are the value functions of a sequence of problems of
type B, and {π N }N∈N are the respective probability
B. Problem B: an optimal control problem with uncertain distributions on X?
dynamics
Let us now introduce another control problem in which the III. M AIN R ESULTS
real dynamics f is unknown, meaning that one has merely a In this section, we will state and prove the main result of the
partial knowledge on f . Such a model uncertainty is captured paper valid for a general family of probability distributions.
by a probability distribution on a space of functions X (which Theorem III.1. Let us consider two Problems of type B as
f belongs to). More precisely, X is a compact subset of described in section II-B, one with distribution π N and the
C0 (Rn × U; Rn ) with respect to the || · ||∞ norm (the Arzelà- other with distribution π ∞ . We make the further assumptions:
Ascoli Theorem provides necessary and sufficient conditions (H1) There exists a constant L > 0 such that
f
for the set X being compact). For 0 ≤ s < T and every g ∈ X,
let us define the dynamical system | f (x, u) − f (y, u)| ≤ L f |x − y|
for each f ∈ supp(π ∞ ), x, y ∈ Rn , u ∈ U;
(
ẋg (t) = g (xg (t), u(t)) t ∈ [s, T ]
(4) (H2) The two cost functions ` and h are Lipschitz continuous
xg (s) = x0 .
in the first argument with constants respectively L` and
Given a probability distribution π over X, one can define a Lh .
cost functional for Problem B: Then the following estimate holds:
Z T
g ||Vπ N −Vπ ∞ ||∞ ≤ C(L f , L` , Lh , T )W1 (π N , π ∞ ) , (6)
Jπ,s,x0 [u] :=Eπ `(x (t), u(t))dt + h(x(T )) (5)
s
Z Z T where C(L f , L` , Lh , T ) is a constant which depends only on
= g
`(x (t), u(t)) dt + h(x(T )) dπ(g). the Lipschitz constants and on T , and W1 (π N , π ∞ ) is the 1-
X s Wasserstein distance defined in (1).
3
Proof. We divide the proof in three steps. Passing to the absolute value and using the bounds in (8) and
STEP 1: Fix two dynamics g ∈ X and f ∈ supp(π ∞ ), an (9), we recover the expression of W1 :
initial condition x(s) = x0 with s ∈ [0, T ] and x0 ∈ Rn and a
control u ∈ Us . We estimate how far is xg (t) from x f (t), using Jπ N [u] − Jπ ∞ [u] ≤
Gronwall’s Lemma, for each t ∈ [s, T ]. ZT Z
Lf t Lf T
Recall that x f (t) and xg (t) are solutions of the dynamical ≤ L` t e dt + Lh e kg − f k∞ dγ(g, f )
0 X×X
systems (4):
eL f T (L f T − 1) + 1
Lf T
Z t = L` + Lh e W1 (π N , π ∞ )
x f (t) = x0 + f (x f (τ), u(τ)) dτ Lf 2
s
Z t = C(L f , L` , Lh , T )W1 (π N , π ∞ ).
g g (10)
x (t) = x0 + g(x (τ), u(τ)) dτ
s Note that this estimate does not depend on x0 or u.
Then we have the following estimate: STEP 3: We now prove the estimate (6) using (10). Fix an
Z t initial condition x(s) = x0 and some ε > 0. By the definition
|xg (t) − x f (t)| ≤ |g(xg (τ), u(τ)) − f (x f (τ), u(τ))| dτ of Vπ ∞ , there exists some control u ∈ Us such that
s
Z t
≤ |g(xg (τ), u(τ)) − f (xg (τ), u(τ))| dτ Jπ ∞ [uε ] ≤ Vπ ∞ (s, x0 ) + ε .
s
Z t Then one has
+ | f (xg (τ), u(τ)) − f (x f (τ), u(τ))| dτ
s Vπ N (s, x0 ) −Vπ ∞ (s, x0 ) = inf Jπ N [u] − inf Jπ ∞ [u]
Z t u∈Us u∈Us
g f
≤ (t − s) k f − gk∞ + L f |x (τ) − x (τ)| dτ . < inf Jπ N [u] − Jπ ∞ [uε ] + ε
s u∈Us
Then, by Gronwall’s Lemma, ≤ Jπ N [uε ] − Jπ ∞ [uε ] + ε
|xg (t) − x f (t)| ≤ (t − s) k f − gk∞ eL f (t−s) ≤ t k f − gk∞ eL f t . ≤ sup |Jπ N − Jπ ∞ | + ε.
u∈Us
(7)
STEP 2: Fix an initial condition x(s) = x0 with s ∈ [0, T ] and In the same way, we get
x0 ∈ Rn and a control u ∈ Us . We estimate how far is Jπ N ,s,x0 [u]
Vπ ∞ (s, x0 ) −Vπ N (s, x0 ) < sup |Jπ N − Jπ ∞ | + ε ,
from Jπ ∞ ,s,x0 [u]. To enlighten the notation, we will write u∈Us
Jπ N [u] = Jπ N ,s,x0 [u] and Jπ ∞ [u] = Jπ ∞ ,s,x0 [u]. but the ε is arbitrary, so eventually
For each g ∈ X and f ∈ supp(π ∞ ) it holds |Vπ N (s, x0 ) −Vπ ∞ (s, x0 )| ≤ sup |Jπ N − Jπ ∞ | .
u∈U
|l(xg (t), u(t)) − l x f (t), u(t) ) ≤ L` |xg (t) − x f (t)|
(8) Finally, noting that the estimate is independent from x0 , we
≤ L` t k f − gk∞ eL f t get the result:
and ||Vπ N −Vπ ∞ ||∞ ≤ sup |Jπ N − Jπ ∞ |
|h(xg (T )) − h(x(T ))| ≤ Lh T k f − gk∞ eL f T . (9) u∈U
≤ C(L f , L` , Lh , T )W1 (π N , π ∞ ).
As a property of W1 , there exists (see Theorem 4.1 on [22])
a distribution γ ∗ on X × X with marginal distributions π N and
π ∞ such that
Z In the particular case where π ∞ ≡ δ f , f being the true
N ∞ ∗
W1 (π , π ) = kg − f k∞ dγ (g, f ). underlying dynamics, then one can express Theorem III.1 in
X×X
a more expressive fashion as follows:
Let (g(x), f (x))x∈Rn be a continuous process which has γ ∗ as
distribution. For each event ω ∈ Ω, we consider the realization Corollary III.2 (Convergence of the value functions). Sup-
of the process (gω , f ω ). pose that f : Rn ×U → Rn is a Lipschitz continuous function
Let us sum up over the g ∈ supp(π N ) and the f ∈ supp(π ∞ ): and the assumption (H2) on ` and h is satisfied. Let {π N }N∈N
be a sequence of probability distributions on X ⊂ C0 (Rn ×U)
Jπ N [u] − Jπ ∞ [u] = W1
compact set, with π N −−→ δ f . Then the value function Vπ N of
Z Z T
Problem B converges uniformly on [0, T ] × Rn to the value
= `(x (t), u(t)) dt + h(x (T )) dπ N (g)
g g
X s function V of Problem A.
Z Z T
− f f
`(x (t), u(t))dt + h(x (T )) dπ ∞ ( f ) The previous Corollary provides a positive answer to the
X s
Z T questions (A)-(B) posed in the introduction. More precisely, it
Z
tells us that, whenever it is possible to construct a sufficiently
`(xg (t), u(t)) − `(x f (t), u(t)) dt
=
XxX s close (with respect to the Wasserstein distance) probability
distribution π to the real dynamics f , then the value function
+ h(xg (T )) − h(x f (T )) dγ ∗ (g, f ).
Vπ is a good approximation of the value function V .
4
where αiN ≥ 0, ∑M N
i=1 αi = 1 for every i = 1, . . . , M and N ∈ N.
In this case, the cost functional (5) can be written as
M Z T
N fi fi
Jπ N ,s,x0 [u] := ∑ αi `(x (t), u(t)) dt + h(x (T )) .
i=1 s
1 1 1
Fig. 3. Test 2: Optimal (multi-)trajectory for π = 3, 3, 3 starting from
x0 = (−0.4, 0.3).
R EFERENCES
[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
2nd ed. Cambridge, MA: MIT Press, 2018.
[2] W. H. Fleming and R. W. Rishel, Deterministic and stochastic optimal
control. Springer Science & Business Media, 2012, vol. 1.
[3] M. Bardi and I. Capuzzo-Dolcetta, Optimal Control and Viscosity
Solutions of Hamilton-Jacobi-Bellman Equations. Birkhauser, 1997.
[4] R. S. Sutton, A. G. Barto, and R. J. Williams, “Reinforcement Learning
is Direct Adaptive Optimal Control,” IEEE Control Systems, vol. 12,
no. 2, pp. 19–22, 1992.
[5] B. Recht, “A tour of reinforcement learning: The view from continuous
control,” Annual Review of Control, Robotics, and Autonomous Systems,
vol. 2, pp. 253–279, 2019.
[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[7] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
region policy optimization,” in International conference on machine
learning. PMLR, 2015, pp. 1889–1897.
[8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” in 4th International Conference on Learning Representations
(ICLR), 2016.
[9] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic
actor,” in International Conference on Machine Learning. PMLR, 2018,
pp. 1861–1870.