1reinforcement Learning-Based Model Predictive Control For Discrete-Time Systems
1reinforcement Learning-Based Model Predictive Control For Discrete-Time Systems
3, MARCH 2024
Abstract— This article proposes a novel reinforcement Lyapunov function as the terminal cost [9], [10], [11], which
learning-based model predictive control (RLMPC) scheme for established the currently well-known framework for stability.
discrete-time systems. The scheme integrates model predictive Based on this framework, a commonly adopted MPC design
control (MPC) and reinforcement learning (RL) through policy
iteration (PI), where MPC is a policy generator and the RL paradigm involves finding an appropriate terminal cost and
technique is employed to evaluate the policy. Then the obtained a terminal controller as well as a terminal set that strictly
value function is taken as the terminal cost of MPC, thus meet specific conditions. While this is relatively easy for
improving the generated policy. The advantage of doing so is linear systems, it is generally a challenging task for non-
that it rules out the need for the offline design paradigm of the linear ones. Even if a qualified terminal cost, a terminal
terminal cost, the auxiliary controller, and the terminal constraint
in traditional MPC. Moreover, RLMPC proposed in this article controller, and the corresponding terminal set are found, the
enables a more flexible choice of prediction horizon due to the conditions for stability and recursive feasibility are usually
elimination of the terminal constraint, which has great potential quite conservative. In practice, however, the theoretically
in reducing the computational burden. We provide a rigorous well-established MPC approaches are prohibitively difficult to
analysis of the convergence, feasibility, and stability properties apply due to the high computational burden. Instead, MPC
of RLMPC. Simulation results show that RLMPC achieves nearly
the same performance as traditional MPC in the control of without terminal cost and terminal constraint enjoys wide
linear systems and exhibits superiority over traditional MPC for popularity. Its closed-loop stability and recursive feasibility
nonlinear ones. can be guaranteed under some conditions [12], [13]. But the
Index Terms— Discrete-time systems, model predictive control, absence of terminal cost entails that the residual costs outside
policy iteration (PI), reinforcement learning (RL). the truncated prediction horizon are completely ignored, which
makes it difficult to achieve optimal performance under a very
limited prediction horizon. Motivated by the analysis above,
I. I NTRODUCTION we aim to develop a learning-based MPC scheme to learn
an appropriate terminal cost function without complex offline
T HE development of modern unmanned systems places
higher demands on the performance of the controllers.
Typically, these systems are constrained, and violation of
design procedures while improving the performance of the
controller.
constraints may lead to unsafe behaviors that can severely In recent years, striking progress of reinforcement learning
affect system operation. Model predictive control (MPC) is (RL), such as AlphaGo [14] and Atari games [15], has drawn
capable of providing an optimal solution while handling the the attention of the control community. Different from MPC’s
constraints explicitly, which has seen significant success in offline design, RL optimizes the control policy through online
recent decades with wide applications in diverse fields [1], data-based adaptation [16], [17]. Observed state transitions and
[2], [3], [4], [5], [6]. costs (or rewards) are the only inputs to the RL agent, and no
The successful applications of MPC attracted tremendous prior knowledge of the system dynamics is needed [18], [19].
academic interest, and soon a rigorous stability-centered MPC A central idea in RL is temporal-difference (TD) learning [20],
theoretical foundation was established. The early stability [21], [22], which estimates the value function directly from
results on MPC employ a zero-state terminal equality con- raw experience in a bootstrapping way, without waiting for
straint [7], [8]. They were subsequently extended to the a final outcome. The value function is a prediction of the
use of a terminal inequality constraint by taking a control expected long-term reward at each state. When the opti-
mality is reached, it encodes the global optimal information
Manuscript received 30 March 2022; revised 31 January 2023; accepted so that the infinite-horizon optimal control policy can be
3 May 2023. Date of publication 19 May 2023; date of current version obtained.
1 March 2024. This work was supported in part by the Beijing Municipal
Science Foundation under Grant 4222052; and in part by the National Natural Nevertheless, to approximate the global optimal policy, the
Science Foundation of China under Grant 62003040, Grant 61836001, and RL agent tends to try different policies and learns via trial
Grant 61720106010. (Corresponding author: Zhongqi Sun.) and error, thereby struggling to provide safe guarantees on the
Min Lin, Yuanqing Xia, and Jinhui Zhang are with the School of Automa-
tion, Beijing Institute of Technology, Beijing 100081, China. resulting behaviors. This problem becomes prominent when
Zhongqi Sun is with the School of Automation, Beijing Institute of confronted with some safety-critical systems, and safety in
Technology, Beijing 100081, China, and also with the Yangtze Delta Region this context is defined in terms of stability. There have been
Academy of Beijing Institute of Technology, Jiaxing 314019, China (e-mail:
[email protected]). some achievements centered on safe RL recently [23], [24],
Digital Object Identifier 10.1109/TNNLS.2023.3273590 [25], but it remained largely an open field.
2162-237X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: REINFORCEMENT LEARNING-BASED MODEL PREDICTIVE CONTROL FOR DISCRETE-TIME SYSTEMS 3313
In view of the powerful data-driven optimization capability 3) We incorporate the value function approximation (VFA)
of RL, if combined with MPC to optimize the controller design technique into the developed RLMPC approach to
with running data, it can be expected to improve the control approximate the global optimal value function with
performance. At the same time, benefiting from the solid high computational efficiency. The effect of the predic-
theoretical foundation of MPC, the properties of the behaviors tion horizon on the control performance is scrutinized.
produced by the RL agent would be easier to analyze. This Results show that the proposed RLMPC scheme can
motivates the attempts to integrate the best of both worlds. achieve nearly the same optimal performance as tradi-
However, only limited pioneering work has been reported in tional MPC in linear system control and outperform it in
this field. For example, in [26], a novel “plan online and learn nonlinear cases, which is due to the conservativeness of
offline” framework was proposed by taking MPC as the tra- the offline MPC design for nonlinear systems. In addi-
jectory optimizer of RL. It is shown that MPC enables a more tion, thanks to the elimination of the terminal controller
efficient exploration, thereby accelerating the convergence and and terminal constraint, the RLMPC scheme is particu-
reducing the approximation errors in the value function. The larly useful in dealing with systems where it is difficult
related work was extended to the framework of actor-critic or even impossible to design a terminal controller, such
(AC) in [27] to handle the case of sparse and binary reward, as nonholonomic systems, while reducing computational
which shows that MPC is an important sampler that effectively burden by flexibly adjusting its prediction horizon.
propagates global information. These results were further The rest of this article is organized as follows. In Section II,
applied to the tasks defined by simple and easily interpretable we formally outline the problem formulation. The RLMPC
high-level objectives on a real unmanned ground vehicle scheme is developed in Section III. Its convergence, optimality,
in [28], and the expression of the value function was improved. feasibility, and stability properties are analyzed in Section IV.
Yet, the focus of these works is on the improvements that Section V discusses the implementation of the overall scheme.
MPC brings to RL practice, with no reference to stability. We test the proposed scheme in simulations of both linear and
The stability issue was highlighted in [29] where an economic nonlinear examples in Section VI. Section VII provides final
MPC was used as a function approximator in RL. This scheme conclusion.
was further analyzed in [30] under an improved algorithmic Notation: We use the following convention: R denotes the
framework. These two papers paved the way for [31], where set of reals and Rn is an n-dimensional set of real numbers. N
the constraint satisfaction concern was addressed with the help is the set of natural numbers. For some r1 , r2 ∈ R, we use R≥r1 ,
of robust linear MPC. The practicality of these approaches, N>r2 , and N[r1 ,r2 ] to represent the sets {r ∈ R|r ≥ r1 }, {r ∈
however, remains to be investigated and verified. To the best of N|r > r2 }, and {r ∈ N|r1 ≤ r ≤ r2 }, respectively. We label
the authors’ knowledge, few studies can balance the theoretical the variables of the optimal solution with ·∗ , feasible solution
and practical guarantees of the “RL + MPC” approaches, that with ˜·, and estimated ones with ˆ·, respectively. Moreover, the
is, to provide a theoretically safe and operationally efficient notations xi|k and u i|k indicate the state and input prediction i
scheme, and this article aims to fill this gap. steps ahead from the current time k, respectively. The sequence
We propose an “RL + MPC” scheme from a new per- {u 0|k , u 1|k , . . . , u N −1|k } is denoted by uk or by u N ∥k if we want
spective: policy iteration (PI). Considering that MPC’s abil- to emphasize its length N .
ity to handle constraints can provide constraint-enforcing
policies for RL agents, safety referred to in this article is II. P ROBLEM F ORMULATION
not only limited to stability, but also constraint satisfaction.
A. Optimal Control Problem
The main contributions of this article contain threefold as
follows. Consider a dynamic system described by the following state
1) We provide a new idea for combining RL and MPC space difference equation:
by bridging them through PI. In this way, a complete xk+1 = f (xk , u k ) (1)
RL-based model predictive control (RLMPC) scheme is
developed for discrete-time systems. In each iteration, where xk ∈ Rn and u k ∈ Rm are the system state and control
the current policy is evaluated through learning to obtain input at time k ∈ N, respectively. It is assumed that the system
the value function. Then it is employed as the terminal is subject to the constraints
cost to compensate for the suboptimality induced by the
xk ∈ X, uk ∈ U (2)
truncated prediction horizon of MPC, thus improving
the policy. This solves the challenge of complex offline where X and U are compact sets and contain the origin as
design procedures while progressively improving the an interior point. Also, system (1) is assumed to satisfy the
performance to (near-)optimum. following conditions.
2) The convergence of learning, recursive feasibility, and Assumption 1: The function f : Rn × Rm → Rn is
closed-loop stability of the proposed RLMPC are continuous with f (0, 0) = 0. Under the constraints (2), the
closely investigated, thereby theoretically guaranteeing system (1) is stabilizable with respect to the equilibrium at
its safety. We demonstrate that no constraint is violated xk = 0, u k = 0, and the system state xk is measurable.
even before the optimal value function is well-learned, Remark 1: Assumption 1 is a nonintrusive and common
which verifies the ability of MPC to effectively constrain assumption in the MPC community and can be found in
the RL-produced policies within safe limits. numerous literatures (e.g., [9], [10], [11], [12], [13]). For a
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
3314 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 3, MARCH 2024
system x̆ k+1 = f (x̆ k , ŭ k ) with a general equilibrium (x e , u e ), F(x N |k ) ≥ i=N ℓ(xi|k , κ(xi|k )) for ∀xk ∈ X , such that F(·)
P∞
one can always make variable changes xk = x̆ k − x e and is a local Lyapunov function for κ(xk ) in X .
u k = ŭ k − u e to translate it to system (1) with the equilibrium As mentioned in Section I, it is generally a difficult task
(0, 0). to design such terminal conditions (i.e., terminal cost, ter-
For the particular setup considered above, the infinite- minal constraint, and terminal controller) for nonlinear sys-
horizon optimal control problem at time k with the initial state tems. Although MPC without terminal condition (MPCWTC)
xk ∈ X is then described by Problem 1. is proposed as an option to circumvent this challenge, its
Problem 1: Find a control policy u k = π(xk ), which is performance under a very limited (but necessary) prediction
defined as a map from state space to control space π : Rn → horizon tends to be inferior to that of the kind with terminal
Rm and results in a control sequence u∞ = {u k , u k+1 , . . .}, cost. Therefore, we propose to introduce RL into MPC to
such that it stabilizes the zero equilibrium of (1) and minimizes learn the terminal cost online, thereby generating a policy
the infinite-horizon cost that is closer to the optimal one in the sense of an infinite
X∞ horizon.
J∞ (xk , u∞ ) = ℓ(xi , u i ) (3)
i=k
C. Background of RL
subject to constraints (1) and (2), with the running cost ℓ :
An important class of RL techniques aims to learn the opti-
Rn × Rm → R≥0 satisfying
mal policy π ∗ (x) by iteratively obtaining the evaluation of the
ℓ(0, 0) = 0 current policy and then improving the policy accordingly [34].
α1K∞ (∥x∥) ≤ ℓ̌(x) ≜ inf ℓ(x, u) ≤ α2K∞ (∥x∥) (4) At state xk , the value function Vπ : Rn → R, or the evaluation,
u∈U for a given policy u k = π(xk ) is the accumulated rewards
for ∀x ∈ X, where α1K∞ and α2K∞ are K∞ functions [32]. (or runningPcosts) by following the policy from xk , that is,
Remark 2: The running cost defined by (4) is a fairly Vπ (xk ) = i=k ℓ(x i , u i ), which can be rearranged into the
∞
general form and is common in many MPC research studies well-known Bellman equation
(e.g., [13], [33]). Its specific form is task-dependent, for exam- Vπ (xk ) = ℓ(xk , u k ) + Vπ (xk+1 ). (6)
ple, it usually takes a quadratic form in regulation problems:
ℓ(x, u) = x T Qx + u T Ru, where Q and R are set to be According to Bellman’s principle, when the optimality is
positive-definite matrices with proper dimensions. achieved, the optimal value function Vπ∗ (xk ) satisfies the
Bellman optimality equation [35]
B. Background of Model Predictive Control
Vπ∗ (xk ) = min ℓ(xk , u k ) + Vπ∗ (xk+1 )
(7)
Generally, there is no analytic solution to Problem 1, uk
especially for constrained nonlinear systems. An efficient and the corresponding optimal policy is given by
method to solve this problem is MPC [9], which provides
π ∗ (xk ) = arg min ℓ(xk , u k ) + Vπ∗ (xk+1 ) .
an approximated solution by recursively solving the following (8)
uk
OP 1 at each xk ∈ X.
OP 1: Find the optimal control sequence u∗k by solving the Nevertheless, the calculation of Vπ (xk ) is generally com-
finite-horizon minimization problem putationally difficult, and Vπ∗ (xk ) is unknown before all the
N −1 policies π(xk ) = u k ∈ U are considered. This motivates the
TD learning approach [36], which provides a more tractable
X
min J (xk , uk ) = ℓ(xi|k , u i|k ) + F(x N |k ) (5a)
uk computation. In TD learning, the running costs are observed to
i=0
s.t. x0|k = xk (5b) construct the TD target VTD (xk ), thereby iteratively updating
the value function as
xi+1|k = f (xi|k , u i|k ), 0≤i ≤ N −1 (5c)
u i|k ∈ U, 0≤i ≤ N −1 (5d) Vπ (xk ) ← Vπ (xk ) + α[VTD (xk ) − Vπ (xk )] (9)
xi|k ∈ X, 1≤i ≤ N −1 (5e) where α ∈ (0, 1) is a constant called the learning step
x N |k ∈ X (5f) size, and VTD (xk ) − Vπ (xk ) is known as the TD error. The
Bellman equation (6) holds whenever the TD error is zero, thus
Where N ∈ N>0 is the prediction horizon, F(·) is the terminal
obtaining the evaluation of policy π(xk ). Here, to be consistent
cost, uk = {u 0|k , u 1|k , . . . , u N −1|k } is the predicted control
with MPC, the TD target can be constructed into the N -step
sequence, X is the terminal set, and (5f) is the terminal
form
constraint.
k+N
X−1
After solving OP 1, apply only the first element in u∗k , that is,
u k = u ∗0|k , to system (1) and repeat this procedure. VTD (xk ) = ℓ(xi , u i ) + Vπ (xk+N ). (10)
i=k
To guarantee the recursive feasibility and stability, it is
required that N is chosen properly so that x N |k is guaranteed to It can be seen that this TD target has a similar form as
enter a positively invariant set X (under a terminal controller the MPC cost function (5a), which builds up the connection
κ(xk ) ∈ U) containing the origin. Moreover, parameters of between MPC and RL, and their combination gives rise to the
the running and terminal costs should be designed to satisfy RLMPC algorithm to be presented in Section III–V.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: REINFORCEMENT LEARNING-BASED MODEL PREDICTIVE CONTROL FOR DISCRETE-TIME SYSTEMS 3315
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
3316 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 3, MARCH 2024
where V̂ π (xk , W ) is an approximation of Vπ (xk ), W = π t (·). We are now in a position to present the PI mechanism
[W1 , . . . , W p ]T ∈ R p is the weight vector, and 8(xk ) = in RLMPC.
[81 (xk ), . . . , 8 p (xk )]T is the basis vector. The approximation 1) Initialization: Given an initial state xk ∈ X at time
error e(xk ) converges uniformly to zero as p → ∞. Note that k, we start the iteration with an initial heuristic term
this approximation is essentially a single-layer network, and Vπ0 (·) ≡ 0 and a corresponding stabilizing policy π 0 (xk )
all we have to do is to feed training data into this network. whose existence is already guaranteed by Assumptions 2
Consider a set containing q (q ∈ N>0 ) states sampled in and 3. For t = 0, 1, 2, . . ., do the following two steps
X at time k, denoted as Sk = {xk1 , xk2 , . . . , xkq } ⊂ X. Take iteratively.
each of its elements separately as initial states and propagate 2) Policy Evaluation Step: Determine the value function
one step using this policy, then we record the incurred costs Vπt+1 (·) for policy π t (·) by following the steps presented
as training data. Specifically, for each xk j ∈ Sk , j ∈ N[1,q] , in Section III-B: first, sample in X to form Sk . Second,
solve OP 2 (whose terminal cost is a fixed heuristic function generate training data J t (xk j , utk j ), xk j ∈ Sk , with
corresponding to the current policy) with the subscript k
replaced by k j , obtain the solution u∗k j , and calculate the cost ( N −1
X
)
J (xk j , u∗k j ) according to (11a). Note that we do not apply any J t
(xk j , utk j ) = min ℓ(xi|k j , u i|k j ) + Vπt (x N |k j )
uk j ∈U
u∗k j to the real system in this process, so the new subscript k j i=0
is adopted here to distinguish it from the actual system state. N −1
X
With all these costs J (xk j , u∗k j ), j ∈ N[1,q] , we are now able = ℓ xi|k
t
j
, u i|k
t
j
+ Vπt x Nt |k j (17)
to train the network similar to the TD learning. Recall that the i=0
J (xk j , u∗k j ) deduced from model (11c), satisfying (11d) and (11e).
h i Third, train the network using (16) to obtain the value
V̂ π (xk j , W ) ← V̂ π (xk j , W ) + α J xk j , u∗k j − V̂ π (xk j , W ) function
(15) Vπt+1 (xk ) → J t (xk , utk ) ∀xk ∈ X. (18)
for ∀ j ∈ N[1,q] , and we aim to minimize E(xk j ) = 1/2e(x
kj ) ,
2
where e(xk j ) = J (xk j , u∗k j ) − V̂ π (xk j , W ). The stochastic 3) Policy Improvement Step: According to the descriptions
in Section III-A, the updated policy π t+1 (xk ) is yielded
gradient descent method [36] achieves this minimization by
by solving OP 2 with terminal cost Vπt+1 (·)
adjusting the weight vector W in a small step size α ∈ (0, 1)
at a time in the direction of the negative gradient (with respect ( N −1
X
)
to W ) of the squared error t+1
uk = arg min ℓ(xi|k , u i|k ) + Vπ (x N |k ) (19)
t+1
ut+1
k ∈U
1 h i 2 i=0
W ← W − α∇W J xk j , u∗k j − V̂ π xk j , W
2h and ut+1 is implemented in a receding horizon manner.
i k
= W + α J xk j , u∗k j − V̂ π xk j , W ∇W V̂ π xk j , W
In what follows, we focus on analyzing the theoretical prop-
erties of this RLMPC scheme.
h i
= W + α J xk j , u∗k j − W T 8 xk j 8 xk j , j ∈ N[1,q ]
(16)
IV. C ONVERGENCE , F EASIBILITY, AND
in which ∇W denotes the partial derivatives with respect to the S TABILITY A NALYSIS
components of W .
After W converges for all the training data, that is, for In this section, we analyze the convergence, feasibility,
∀xk j ∈ Sk , [J (xk j , u∗k j ) − V̂ π (xk j , W )] → 0 holds, we obtain and stability properties. We show that for ∀xk ∈ X, the
the (approximated) evaluation of the given fixed control policy. value function Vπt (xk ) → Vπ∗ (xk ) and the control policy
Provided with sufficient training data and a proper choice of π t (xk ) → π ∗ (xk ) as the iteration index t → ∞. Also,
the basis vector, we have V̂ π (xk , W ) → Vπ (xk ), ∀xk ∈ X. It is we demonstrate that the policy obtained in each iteration is a
worth noting that with VFA, we can characterize the value feasible and stabilizing policy for the closed-loop system. The
function over the state space using only the storage space of convergence proof is inspired by [42] which demonstrates the
a p-dimensional vector W , which enhances the utility of the convergence of optimal control (unconstrained infinite horizon
proposed approach. optimization) under PI. This article leverages its proof idea to
extend the results to RLMPC. Before proceeding, we introduce
the following definition.
C. PI in RLMPC Definition 1: For the RLMPC optimization problem OP 2,
In RLMPC, the value function and the control policy are a control sequence ũk = {ũ 0|k , . . . , ũ N −1|k } is said to be
updated by iterations with index t ∈ N increasing from zero feasible if ũ i|k ∈ U, i = N[0,N −1] and the resulting system
to infinity. For convenience, we denote the heuristic term trajectory x̃ k = {x̃ 0|k , . . . , x̃ N |k } satisfies x̃ i|k ∈ X, i = N[0,N ] ,
employed in the tth iteration as Vπt (·), and the policy generated rendering the cost function J (x̃ k , ũk ) in the form of (11a)
by the MPC with Vπt (·) as the terminal cost is denoted as finite.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: REINFORCEMENT LEARNING-BASED MODEL PREDICTIVE CONTROL FOR DISCRETE-TIME SYSTEMS 3317
( N −1 )
A. Convergence Analysis X
≥ min ℓ(xi|k , u i|k ) + Vπl (x N |k )
uk ∈U
We first present two theorems to respectively demonstrate i=0
that the value function Vπt (xk ) is uniformly monotonically N −1
X
ℓ xi|k
l
, u li|k + Vπl x lN |k = Vπl+1 (xk )
nondecreasing with respect to the iteration index t and show =
that it has an upper bound. Then, based on the two theorems, i=0
it is shown in the third theorem that Vπt (xk ) converges to the
where ū li|k , x̄ li|k , i = 0, 1, . . ., are the solutions to the problem
optimal solution as t → ∞.
ℓ(xi|k , u i|k ) + 0l (x N |k )}. Through induction,
P N −1
Theorem 1: For ∀t ∈ N and xk ∈ X, let Vπt (xk ) and π t (xk ) minuk ∈U { i=0
be obtained by PI presented in Section III-C, then Vπt (xk ) is we conclude that
a uniformly monotonically nondecreasing sequence for t, that ∀t : 0 t (xk ) ≥ Vπt (xk ). (23)
is, ∀t : Vπt (xk ) ≤ Vπt+1 (xk ).
Proof: Define a new value function 0 t (xk ) with 0 0 (·) ≡ 0. Next, we use this conclusion to prove that Vπt (xk ) is a
It is followed by the update law that uniformly monotonically nondecreasing sequence about t.
Again, the mathematical
P N −1 induction method is adopted. First,
N −1
X since Vπ1 (xk ) = i=0 ℓ(xi|k
0
, u i|k
0
) ≥ 0 and 0 0 (xk ) = 0, it is
0 t (xk ) = ℓ(x̃ i|k , ũ i|k ) + 0 t−1 (x̃ N |k ) (20)
obvious that 0 0 (xk ) ≤ Vπ1 (xk ). Second, if we let the arbitrary
i=0
feasible policy ũk be utk in (20), we have the following update
where {ũ 0|k , . . . , ũ N −1|k } = ũk is an arbitrary feasible control law at t = l:
sequence as defined in Definition 1, x̃ 0|k = xk , and x̃ i|k , i = N −1
N[1,N ] , is the resulting trajectory of ũk . Since 0 0 (·) = Vπ0 (·) =
X
0 (xk ) =
l
ℓ xi|k
l
, u li|k + 0l−1 x lN |k .
(24)
0, we have i=0
N −1
X At this time, assume that 0l−1 (xk ) ≤ Vπl (xk ) holds for ∀xk ∈
0 1 (xk ) = ℓ(x̃ i|k , ũ i|k ) + 0 0 (x̃ N |k ) X, then at t = l + 1, from (24) and (22), we derive
i=0
0l (xk ) − Vπl+1 (xk ) = 0l−1 x lN |k − Vπl x lN |k ≤ 0. (25)
( N −1 )
X
≥ min ℓ(xi|k , u i|k ) + 0 (x N |k ) 0
uk ∈U
i=0
Hence, 0l (xk ) ≤ Vπl+1 (xk ) holds. Therefore, we conclude
( N −1 ) that ∀t : 0 t (xk ) ≤ Vπt+1 (xk ). Combining (23), we can draw
the conclusion that
X
= min ℓ(xi|k , u i|k ) + Vπ0 (x N |k )
uk ∈U
i=0
∀t : Vπt (xk ) ≤ 0 t (xk ) ≤ Vπt+1 (xk ) (26)
N −1
X
ℓ 0
, u i|k
0
Vπ0 x N0 |k Vπ1 (xk ).
= xi|k + = (21) which completes the proof.
i=0 Remark 5: Since Vπ0 (·) = 0 and the running cost (4) is
The last two equations hold from (17) and (18), indicating positive-definite, Theorem 1 indicates that the value function
the following update law: Vπt (xk ) obtained in any iteration t ∈ N≥1 is positive-definite.
Theorem 2: For ∀t ∈ N and xk ∈ X, let Vπt (xk ) and π t (xk )
Vπt+1 (xk ) = J t xk , utk be obtained by PI presented in Section III-C, and Vπ∗ (xk ) be
N −1
X the optimal value function satisfying the Bellman optimality
ℓ xi|k
t
, u i|k
t
+ Vπt x Nt |k equation
=
i=0 ( N −1 )
X
Vπ∗ (xk ) = min ℓ(xi|k , u i|k ) + Vπ∗ (x N |k )
( N −1 )
X (27)
= min ℓ(xi|k , u i|k ) + Vπt (x N |k ) . (22) uk ∈U
i=0
uk ∈U
i=0
then Vπ∗ (xk ) is an upper bound of Vπt (xk ), that is,
Assume that at t = l, 0l (xk ) ≥ Vπl (xk ) holds for ∀xk ∈ X,
then for t = l + 1 ∀t : Vπt (xk ) ≤ Vπ∗ (xk ). (28)
N −1
X Proof: We prove this by induction. First, at t = 0,
0 l+1
(xk ) = ℓ(x̃ i|k , ũ i|k ) + 0l (x̃ N |k ) Vπ0 (xk ) = 0 ≤ Vπ∗ (xk ). Second, we assume that at t = l,
i=0
( N −1 ) Vπl (xk ) ≤ Vπ∗ (xk ) holds for ∀xk ∈ X, then at t = l + 1,
X we have
≥ min ℓ(xi|k , u i|k ) + 0 (x N |k ) l
uk ∈U ( N −1 )
i=0 X
N −1 Vπ (xk ) = min
l+1
ℓ(xi|k , u i|k ) + Vπ (x N |k )
l
X uk ∈U
ℓ x̄ li|k , ū li|k + 0l x̄ lN |k
= i=0
( N −1 )
i=0 X
N −1 ≤ min ℓ(xi|k , u i|k ) + Vπ∗ (x N |k )
X uk ∈U
ℓ x̄ li|k , ū li|k Vπl x̄ lN |k
≥ + i=0
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
3318 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 3, MARCH 2024
Thus, Vπt (xk ) ≤ Vπ∗ (xk ) holds for ∀t ∈ N, and Vπ∗ (xk ) is an where the last equation holds because of Bellman’s optimality
upper bound. The proof is completed. principle. As a result, invoking (22) leads to
Based on the properties revealed by Theorems 1 and 2, ( N −1 )
we give the following theorem to show the convergence of
X
V2N (xk ) = min ℓ(xi|k , u i|k ) + Vπ (x N |k )
1
RLMPC. u N ∥k ∈U
i=0
Theorem 3: For ∀t ∈ N and xk ∈ X, let Vπt (xk ) and π t (xk ) = J 1 (xk , u1N ∥k ) = Vπ2 (xk ). (32)
be obtained by PI presented in Section III-C. When t → ∞,
limt→∞ Vπt (xk ) = Vπ∞ (xk ) = Vπ∗ (xk ) and limt→∞ π t (xk ) = This illustrates that the policy generated by RLMPC with
π ∞ (xk ) = π ∗ (xk ). Vπ1 (x N |k ) as the terminal cost is equivalent to that of the
Proof: From Theorems 1 and 2, we know that Vπt (xk ) is MPCWTC with a prediction horizon length of 2N . By induc-
a uniformly monotonically nondecreasing and upper-bounded tion, it is easy to verify
sequence about t. According to the monotone convergence the-
V(t+1)·N (xk ) = J t (xk , utN ∥k ) = Vπt+1 (xk ) ∀t ∈ N (33)
orem [43], Vπt (xk ) converges to a value, denoted by Vπ∞ (xk ),
as t → ∞. Since limt→∞ Vπt (xk ) = limt→∞ Vπt+1 (xk ) = which implies that the policy π (xk ) is equivalent to the
t
Vπ∞ (xk ), we obtain MPCWTC with prediction horizon (t + 1) · N . Given that N
Vπ∞ (xk ) = lim Vπt+1 (xk ) satisfies (13) and hence (t + 1) · N ≥ 2 ln β/[ln β − ln(β − 1)]
t→∞
( N −1 ) also holds, the asymptotic stability and feasibility of π t (xk )
X can be guaranteed according to Lemma 1. The proof is
= lim min ℓ(xi|k , u i|k ) + Vπ (x N |k )
t
completed.
t→∞ uk ∈U
i=0
( N −1 ) Remark 6: The above proof reveals that the essence of
X RLMPC is to achieve the effect of accumulating the prediction
= min ℓ(xi|k , u i|k ) + lim V t (x N |k )
uk ∈U t→∞ π horizon on the basis of the initial controller through iterations.
i=0
( N −1 ) As t → ∞, π t (xk ) converges to the policy generated by
X the MPCWTC with an infinite prediction horizon, that is,
= min ℓ(xi|k , u i|k ) + lim Vπt+1 (x N |k )
uk ∈U t→∞ the optimal policy. This property echoes the results given
i=0
N −1
by Theorem 1–Theorem 3. In this sense, (33) serves as a
performance indicator for each iteration, and N determines
X
ℓ xi|k
∞
, u i|k
∞
+ Vπ∞ x N∞|k
=
i=0
the convergence rate (CR).
Remark 7: For specific control tasks, the MPCWTC could
which is the Bellman optimality equation. Thus, Vπ∞ (xk ) =
generally achieve (near-)optimal performance with a very
Vπ∗ (xk ) and the resulting policy limt→∞ π t (xk ) = π ∞ (xk ) =
limited prediction horizon 0 < N̄ ≪ ∞, without requiring
π ∗ (xk ). The proof is completed.
an infinite one. In other words, if the prediction horizon
B. Stability and Feasibility Analysis of MPCWTC is no less than N̄ , increasing its prediction
horizon hardly improves the performance. This indicates that
Theorem 4: For ∀t ∈ N and xk ∈ X, let Vπt (xk ) and π t (xk ) we do not have to iterate RLMPCs to t → ∞, but can
be obtained by PI presented in Section III-C, then for every t ∈
stop at 0 < t¯ ≪ ∞ if Vπt¯ (xk ) hardly changes with a
N, the RLMPC policy π t (xk ) is feasible and the closed-loop
system is asymptotically stable under π t (xk ). larger t¯. Furthermore, with Vπt¯ (xk ) being the terminal cost,
Proof: For t = 0, Assumption 3 has already ensured the one can flexibly adjust the prediction horizon of RLMPC and
asymptotic stability and feasibility of the initial policy π 0 (xk ), resulting in almost the same control policy π t¯(xk ) → π ∗ (xk ),
∀xk ∈ X. Before proceeding, we define as the Bellman optimality equation (7) inherently holds at any
prediction horizon. Since the prediction horizon determines
Ñ −1
X the dimensionality and hence the computational burden of
VÑ (xk ) = min ℓ(xi|k , u i|k ) (30)
u Ñ ∥k ∈U the optimization problem, such a property provides a way to
i=0
significantly reduce the computational burden by shortening
subject to (11b)–(11e) with N replaced by Ñ , where the prediction horizon of RLMPC.
Ñ can be an arbitrary positive integer, and u Ñ ∥k =
{u 0|k , u 1|k , . . . , u Ñ −1|k } is the decision variable of length Ñ .
V. I MPLEMENTATION I SSUES ON RLMPC
By letting Ñ = N , it is obvious that VN (xk ) = J 0 (xk , u0N ∥k ) =
Vπ1 (xk ) holds for ∀xk ∈ X. Furthermore, we can also obtain A. On the PI
2N −1 Theorems 1 and 2 reveal that as t increases, Vπt (xk ) exhibits
a monotonic approximation to Vπ∗ (xk ). This implies that each
X
V2N (xk ) = min ℓ(xi|k , u i|k )
u2N ∥k ∈U iteration will necessarily lead to some degree of improvement
i=0
( N −1 2N −1
) in Vπt (xk ). More importantly, Theorem 4 demonstrates that
Vπt (xk ) obtained at any t ∈ N can result in a stabilizing control
X X
= min ℓ(xi|k , u i|k ) + ℓ(xi|k , u i|k )
u2N ∥k ∈U policy. This allows us to stop the iteration at any t without
i=0 i=N
( N −1 ) affecting the stability of the closed-loop system.
As we can see in Section III-B, the update of Vπt (xk ) is
X
= min ℓ(xi|k , u i|k ) + VN (x N |k ) (31)
u N ∥k ∈U
i=0 determined by the weight vector W . Denote the weight vector
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: REINFORCEMENT LEARNING-BASED MODEL PREDICTIVE CONTROL FOR DISCRETE-TIME SYSTEMS 3319
Algorithm 1 RLMPC Algorithm system and the other involving a nonlinear one. As is known,
Initialize: N , Q, R, α, ε, p, 8(xk ), W = 0, t = 0, Flag = 0;
0 traditional MPC (which refers to the one described in OP 1)
1: for k = 1, 2, 3, . . . do can attain linear quadratic optimal performance for linear
2: Measure the current state xk at time k; systems with properly designed terminal conditions, so the
3: Solve OP 2 for xk with terminal cost (W t )T 8(x N |k ); linear system example can be a benchmark for RLMPC per-
4: Apply π t (xk ) = u t0|k to system (1); formance. For nonlinear systems, however, it can only achieve
5: if Flag == 0 then
tq suboptimal control performance, which is where RLMPC
6: Wk 0 ← W t ;
comes in. We will highlight the superiority of RLMPC in the
7: Construct sample set Sk = {xkt 1 , . . . , xkt q } ⊂ X;
8: for j = 1, 2, . . . , q do nonlinear system example. In both examples, we investigate
9: Solve OP 2 for xkt j with terminal cost (W t )T 8(·); RLMPC (following Algorithm 1) for two episodes. The first
10: Calculate J (xkt j , u tk j ) according to (17); episode (labeled as “RLMPC Epi. 1”), called the learning
11:
tq tq
Wk j ← Wk j−1 ; episode, is to learn from scratch, starting with the initial policy
12: Update Wk j using (16) until ∥1Wk j ∥ ≤ ε;
tq tq to control the system; while the second episode (labeled as
“RLMPC Epi. 2”), also referred to as the re-execution episode,
if ∥Wk j − Wk j−1 ∥ ≤ ε then break;
tq tq
13:
14: end if
is to redo the task under the learned value function (LVF) to
15: end for check the performance.
tq
16: W t+1 ← Wk j ;
17: end if
18: if ∥W t+1 − W t ∥ ≤ ε then A. Linear System
19: Flag == 1; Consider the following linear system:
20: end if
21: t ← t + 1; xk+1 = Axk + Bu k (34)
22: end for
where xk = [x1k , x2k ]T ∈ R2 , u k ∈ R1 , A = −0.1
1 0.5
obtained in the tth iteration as W t (which has converged for all 0.9 ,
and B = 0 . We set the running cost to be ℓ(xk , u k ) =
1
the training data in this iteration) and select a small constant
ε > 0 as the threshold of change. Instead of iterating infinitely xkT Qxk + u Tk Ru k with weights Q = I ∈ R2 and R = 0.5.
many steps, if ∥W t+1 − W t ∥ ≤ ε is satisfied for some t, then Then, according to the traditional MPC design procedure [9],
Vπt (xk ) can be regarded as converged to the neighborhood of we solve a Riccati equation and obtain P = −0.3409
1.4388 −0.3409
Vπ∗ (xk ) and the iteration can be stopped. 5.0793
Remark 8: In contrast to PI, value iteration (VI) is consid- with terminal cost F(x N |k ) = x NT |k P x N |k , and the termi-
ered less computationally expensive by reducing the policy nal controller u k = K xk with K = [−0.7597 − 0.2128]
evaluation step to a single iteration. Although it is possible being the linear quadratic optimal. Correspondingly, we set
to adapt RLMPC to the VI update manner, the properties RLMPC parameters to be α = 10−6 , ε = 10−8 , p = 9,
presented in Section IV may be lost. This is due to the fact and 8(x) = [x1 , x2 , x12 , x22 , x1 x2 , x13 , x12 x2 , x23 , x1 x22 ]. Let the
that the intermediate policies of VI are not guaranteed to be prediction horizon be N = 3 and the initial state be x0 =
stable and the intermediate value functions may not correspond [2.9, 2]T . We collect 100 random samples in each iteration
to any policy. Taking such value functions as the terminal in the state space x1k ∈ [−0.5, 3.5], x2k ∈ [−0.5, 2.5],
costs, it is difficult to ensure that the corresponding RLMPC as well as the actual system trajectory, to form Sk . Our task
generates suitable policies for value function updates. How to is to drive the system states from x0 to the origin while
ensure the convergence, stability, and feasibility in the case of minimizing (3). We examine the performance of RLMPC and
VI would be our future research. compare it with those of traditional MPC and MPCWTC
(labeled as “LQR (MPC)” and “MPCWTC,” respectively,
B. Overall Algorithm of the RLMPC Scheme in the following figures). To facilitate comparison with the
Given the state xk , use the policy π 0 (xk ) to control the optimal control performance in the infinite horizon, we do
real system for r 0 ≥ 1 steps. In the meanwhile, perform not impose any constraints on this control problem, at which
the policy evaluation for π 0 (·) as presented in Section III-B. point the above-designed traditional MPC is fully equivalent
Once the value function Vπ1 (·) is obtained, substitute it into to LQR.
OP 2 as terminal cost and update the policy to π 1 (·). This The relevant state trajectories and control inputs are reported
procedure is repeated for ∀t ≥ 1, and π t (·) is applied to the in Fig. 2. It can be seen that the trajectories of “RLMPC
real system for r t steps (r t ∈ N≥1 is a constant at a given t). Epi. 1” and “MPCWTC” are very close in the initial few
This learning procedure is characterized by the fact that the steps, which is due to the fact that the initial policy of the
policy is continually improved throughout the control process. learning episode is essentially the MPCWTC. As the policy
The pseudo-code of the RLMPC algorithm is shown in of the learning episode improves with each control step k,
Algorithm 1, which is an example of updating the policy at its trajectory gradually deviates from that of “MPCWTC” and
every control step, that is, r t ≡ 1, ∀t ∈ N. converges to that of “LQR (MPC).” The evolution curves of
the weight vector W in the learning episode are shown in
VI. S IMULATION E XAMPLES Fig. 3. We observe that W converges at the 11th step (at
To evaluate the effectiveness of the proposed RLMPC this moment, the system states are not yet regulated to the
scheme, two examples are studied: one involving a linear origin), indicating the convergence of the learning process.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
3320 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 3, MARCH 2024
Fig. 2. Trajectories and inputs (linear system). Fig. 4. ACCs (linear system).
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: REINFORCEMENT LEARNING-BASED MODEL PREDICTIVE CONTROL FOR DISCRETE-TIME SYSTEMS 3321
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
3322 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 3, MARCH 2024
TABLE II
C OMPARISON OF T RADITIONAL MPC AND RLMPC IN THE L EARNING
E PISODE W ITH D IFFERENT P REDICTION H ORIZONS
Fig. 10. Comparison of trajectories and inputs of RLMPC with the learned
(near-)optimal value function under prediction horizons N = 1 and N = 5
(nonlinear system).
Fig. 8. Convergence of the weight vector and LVF (nonlinear system). with the same prediction horizon of 30, the ACT of RLMPC is
still slightly lower than that of traditional MPC. This is mainly
due to the removal of the terminal constraint, which simplifies
ACC, CR, and the average computational time (ACT) at each the optimization problem.
step. RLMPC in the learning episode with four different We now perform a verification of Remark 7. Taking the
prediction horizons, namely N = 5, 7, 9, 30, are compared learned (near-)optimal value function as the terminal cost of
(settings of other parameters remain unchanged), as well RLMPC, we reduce the prediction horizon to 1 and re-execute
as that of traditional MPC with N = 30. Corresponding the regulation task. As we can see from Fig. 10, the trajectories
numerical results are listed in Table II. and input under N = 1 are basically coincide with those under
Consistent with the linear system example, increasing the N = 5, which confirms that shortening the prediction horizon
prediction horizon of RLMPC effectively enhances CR and does not affect the (near-)optimal performance. More impor-
leads to an improved (lower ACC) trajectory in the learning tantly, the ACT time is reduced from 0.0826 to 0.0222 s (about
episode. Conceivably, this comes at the cost of increased an improvement of 73%), which illustrates the effectiveness of
computation time at each step. However, we observe that even RLMPC in reducing the computational burden.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: REINFORCEMENT LEARNING-BASED MODEL PREDICTIVE CONTROL FOR DISCRETE-TIME SYSTEMS 3323
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
3324 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 3, MARCH 2024
[45] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes Yuanqing Xia (Senior Member, IEEE) received the
for data-efficient learning in robotics and control,” IEEE Trans. Pattern Ph.D. degree in control theory and control engineer-
Anal. Mach. Intell., vol. 37, no. 2, pp. 408–423, Feb. 2015. ing from the Beijing University of Aeronautics and
[46] R. W. Brockett, “Asymptotic stability and feedback stabilization,” Differ. Astronautics, Beijing, China, in 2001.
Geometric Control Theory, vol. 27, no. 1, pp. 181–191, Dec. 1983. From 2002 to 2003, he was a Post-Doctoral
[47] Y. Zhu and U. Ozguner, “Robustness analysis on constrained model Research Associate with the Institute of Systems
predictive control for nonholonomic vehicle regulation,” in Proc. Amer. Science, Academy of Mathematics and System
Control Conf., 2009, pp. 3896–3901. Sciences, Chinese Academy of Sciences, Beijing.
From 2003 to 2004, he was with the National
University of Singapore, Singapore, as a Research
Fellow, where he worked on variable structure
control. From 2004 to 2006, he was with the University of Glamorgan,
Min Lin received the B.S. degree in automation Pontypridd, U.K., as a Research Fellow. From 2007 to 2008, he was a Guest
from the Beijing Institute of Technology, Beijing, Professor with Innsbruck Medical University, Innsbruck, Austria. Since 2004,
China, in 2018, where he is currently pursuing the he has been with the School of Automation, Beijing Institute of Technology,
Ph.D. degree with the School of Automation. Beijing, first as an Associate Professor, then, since 2008, as a Professor. His
His research interests cover model predictive con- current research interests are in the fields of networked control systems, robust
trol, machine learning, active disturbance rejection control and signal processing, and active disturbance rejection control.
control, and robotic systems.
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.