0% found this document useful (0 votes)
52 views

1reinforcement Learning-Based Model Predictive Control For Discrete-Time Systems

This document proposes a novel reinforcement learning-based model predictive control (RLMPC) scheme for discrete-time systems. It integrates model predictive control and reinforcement learning through policy iteration, using MPC as a policy generator and reinforcement learning to evaluate policies. The obtained value function is then used as the terminal cost for MPC, improving the generated policy without requiring offline design of terminal costs. Analysis is provided of the convergence, feasibility and stability properties of the RLMPC approach.

Uploaded by

microstart95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

1reinforcement Learning-Based Model Predictive Control For Discrete-Time Systems

This document proposes a novel reinforcement learning-based model predictive control (RLMPC) scheme for discrete-time systems. It integrates model predictive control and reinforcement learning through policy iteration, using MPC as a policy generator and reinforcement learning to evaluate policies. The obtained value function is then used as the terminal cost for MPC, improving the generated policy without requiring offline design of terminal costs. Analysis is provided of the convergence, feasibility and stability properties of the RLMPC approach.

Uploaded by

microstart95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

3312 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO.

3, MARCH 2024

Reinforcement Learning-Based Model Predictive


Control for Discrete-Time Systems
Min Lin , Zhongqi Sun , Member, IEEE, Yuanqing Xia , Senior Member, IEEE, and Jinhui Zhang

Abstract— This article proposes a novel reinforcement Lyapunov function as the terminal cost [9], [10], [11], which
learning-based model predictive control (RLMPC) scheme for established the currently well-known framework for stability.
discrete-time systems. The scheme integrates model predictive Based on this framework, a commonly adopted MPC design
control (MPC) and reinforcement learning (RL) through policy
iteration (PI), where MPC is a policy generator and the RL paradigm involves finding an appropriate terminal cost and
technique is employed to evaluate the policy. Then the obtained a terminal controller as well as a terminal set that strictly
value function is taken as the terminal cost of MPC, thus meet specific conditions. While this is relatively easy for
improving the generated policy. The advantage of doing so is linear systems, it is generally a challenging task for non-
that it rules out the need for the offline design paradigm of the linear ones. Even if a qualified terminal cost, a terminal
terminal cost, the auxiliary controller, and the terminal constraint
in traditional MPC. Moreover, RLMPC proposed in this article controller, and the corresponding terminal set are found, the
enables a more flexible choice of prediction horizon due to the conditions for stability and recursive feasibility are usually
elimination of the terminal constraint, which has great potential quite conservative. In practice, however, the theoretically
in reducing the computational burden. We provide a rigorous well-established MPC approaches are prohibitively difficult to
analysis of the convergence, feasibility, and stability properties apply due to the high computational burden. Instead, MPC
of RLMPC. Simulation results show that RLMPC achieves nearly
the same performance as traditional MPC in the control of without terminal cost and terminal constraint enjoys wide
linear systems and exhibits superiority over traditional MPC for popularity. Its closed-loop stability and recursive feasibility
nonlinear ones. can be guaranteed under some conditions [12], [13]. But the
Index Terms— Discrete-time systems, model predictive control, absence of terminal cost entails that the residual costs outside
policy iteration (PI), reinforcement learning (RL). the truncated prediction horizon are completely ignored, which
makes it difficult to achieve optimal performance under a very
limited prediction horizon. Motivated by the analysis above,
I. I NTRODUCTION we aim to develop a learning-based MPC scheme to learn
an appropriate terminal cost function without complex offline
T HE development of modern unmanned systems places
higher demands on the performance of the controllers.
Typically, these systems are constrained, and violation of
design procedures while improving the performance of the
controller.
constraints may lead to unsafe behaviors that can severely In recent years, striking progress of reinforcement learning
affect system operation. Model predictive control (MPC) is (RL), such as AlphaGo [14] and Atari games [15], has drawn
capable of providing an optimal solution while handling the the attention of the control community. Different from MPC’s
constraints explicitly, which has seen significant success in offline design, RL optimizes the control policy through online
recent decades with wide applications in diverse fields [1], data-based adaptation [16], [17]. Observed state transitions and
[2], [3], [4], [5], [6]. costs (or rewards) are the only inputs to the RL agent, and no
The successful applications of MPC attracted tremendous prior knowledge of the system dynamics is needed [18], [19].
academic interest, and soon a rigorous stability-centered MPC A central idea in RL is temporal-difference (TD) learning [20],
theoretical foundation was established. The early stability [21], [22], which estimates the value function directly from
results on MPC employ a zero-state terminal equality con- raw experience in a bootstrapping way, without waiting for
straint [7], [8]. They were subsequently extended to the a final outcome. The value function is a prediction of the
use of a terminal inequality constraint by taking a control expected long-term reward at each state. When the opti-
mality is reached, it encodes the global optimal information
Manuscript received 30 March 2022; revised 31 January 2023; accepted so that the infinite-horizon optimal control policy can be
3 May 2023. Date of publication 19 May 2023; date of current version obtained.
1 March 2024. This work was supported in part by the Beijing Municipal
Science Foundation under Grant 4222052; and in part by the National Natural Nevertheless, to approximate the global optimal policy, the
Science Foundation of China under Grant 62003040, Grant 61836001, and RL agent tends to try different policies and learns via trial
Grant 61720106010. (Corresponding author: Zhongqi Sun.) and error, thereby struggling to provide safe guarantees on the
Min Lin, Yuanqing Xia, and Jinhui Zhang are with the School of Automa-
tion, Beijing Institute of Technology, Beijing 100081, China. resulting behaviors. This problem becomes prominent when
Zhongqi Sun is with the School of Automation, Beijing Institute of confronted with some safety-critical systems, and safety in
Technology, Beijing 100081, China, and also with the Yangtze Delta Region this context is defined in terms of stability. There have been
Academy of Beijing Institute of Technology, Jiaxing 314019, China (e-mail:
[email protected]). some achievements centered on safe RL recently [23], [24],
Digital Object Identifier 10.1109/TNNLS.2023.3273590 [25], but it remained largely an open field.
2162-237X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: REINFORCEMENT LEARNING-BASED MODEL PREDICTIVE CONTROL FOR DISCRETE-TIME SYSTEMS 3313

In view of the powerful data-driven optimization capability 3) We incorporate the value function approximation (VFA)
of RL, if combined with MPC to optimize the controller design technique into the developed RLMPC approach to
with running data, it can be expected to improve the control approximate the global optimal value function with
performance. At the same time, benefiting from the solid high computational efficiency. The effect of the predic-
theoretical foundation of MPC, the properties of the behaviors tion horizon on the control performance is scrutinized.
produced by the RL agent would be easier to analyze. This Results show that the proposed RLMPC scheme can
motivates the attempts to integrate the best of both worlds. achieve nearly the same optimal performance as tradi-
However, only limited pioneering work has been reported in tional MPC in linear system control and outperform it in
this field. For example, in [26], a novel “plan online and learn nonlinear cases, which is due to the conservativeness of
offline” framework was proposed by taking MPC as the tra- the offline MPC design for nonlinear systems. In addi-
jectory optimizer of RL. It is shown that MPC enables a more tion, thanks to the elimination of the terminal controller
efficient exploration, thereby accelerating the convergence and and terminal constraint, the RLMPC scheme is particu-
reducing the approximation errors in the value function. The larly useful in dealing with systems where it is difficult
related work was extended to the framework of actor-critic or even impossible to design a terminal controller, such
(AC) in [27] to handle the case of sparse and binary reward, as nonholonomic systems, while reducing computational
which shows that MPC is an important sampler that effectively burden by flexibly adjusting its prediction horizon.
propagates global information. These results were further The rest of this article is organized as follows. In Section II,
applied to the tasks defined by simple and easily interpretable we formally outline the problem formulation. The RLMPC
high-level objectives on a real unmanned ground vehicle scheme is developed in Section III. Its convergence, optimality,
in [28], and the expression of the value function was improved. feasibility, and stability properties are analyzed in Section IV.
Yet, the focus of these works is on the improvements that Section V discusses the implementation of the overall scheme.
MPC brings to RL practice, with no reference to stability. We test the proposed scheme in simulations of both linear and
The stability issue was highlighted in [29] where an economic nonlinear examples in Section VI. Section VII provides final
MPC was used as a function approximator in RL. This scheme conclusion.
was further analyzed in [30] under an improved algorithmic Notation: We use the following convention: R denotes the
framework. These two papers paved the way for [31], where set of reals and Rn is an n-dimensional set of real numbers. N
the constraint satisfaction concern was addressed with the help is the set of natural numbers. For some r1 , r2 ∈ R, we use R≥r1 ,
of robust linear MPC. The practicality of these approaches, N>r2 , and N[r1 ,r2 ] to represent the sets {r ∈ R|r ≥ r1 }, {r ∈
however, remains to be investigated and verified. To the best of N|r > r2 }, and {r ∈ N|r1 ≤ r ≤ r2 }, respectively. We label
the authors’ knowledge, few studies can balance the theoretical the variables of the optimal solution with ·∗ , feasible solution
and practical guarantees of the “RL + MPC” approaches, that with ˜·, and estimated ones with ˆ·, respectively. Moreover, the
is, to provide a theoretically safe and operationally efficient notations xi|k and u i|k indicate the state and input prediction i
scheme, and this article aims to fill this gap. steps ahead from the current time k, respectively. The sequence
We propose an “RL + MPC” scheme from a new per- {u 0|k , u 1|k , . . . , u N −1|k } is denoted by uk or by u N ∥k if we want
spective: policy iteration (PI). Considering that MPC’s abil- to emphasize its length N .
ity to handle constraints can provide constraint-enforcing
policies for RL agents, safety referred to in this article is II. P ROBLEM F ORMULATION
not only limited to stability, but also constraint satisfaction.
A. Optimal Control Problem
The main contributions of this article contain threefold as
follows. Consider a dynamic system described by the following state
1) We provide a new idea for combining RL and MPC space difference equation:
by bridging them through PI. In this way, a complete xk+1 = f (xk , u k ) (1)
RL-based model predictive control (RLMPC) scheme is
developed for discrete-time systems. In each iteration, where xk ∈ Rn and u k ∈ Rm are the system state and control
the current policy is evaluated through learning to obtain input at time k ∈ N, respectively. It is assumed that the system
the value function. Then it is employed as the terminal is subject to the constraints
cost to compensate for the suboptimality induced by the
xk ∈ X, uk ∈ U (2)
truncated prediction horizon of MPC, thus improving
the policy. This solves the challenge of complex offline where X and U are compact sets and contain the origin as
design procedures while progressively improving the an interior point. Also, system (1) is assumed to satisfy the
performance to (near-)optimum. following conditions.
2) The convergence of learning, recursive feasibility, and Assumption 1: The function f : Rn × Rm → Rn is
closed-loop stability of the proposed RLMPC are continuous with f (0, 0) = 0. Under the constraints (2), the
closely investigated, thereby theoretically guaranteeing system (1) is stabilizable with respect to the equilibrium at
its safety. We demonstrate that no constraint is violated xk = 0, u k = 0, and the system state xk is measurable.
even before the optimal value function is well-learned, Remark 1: Assumption 1 is a nonintrusive and common
which verifies the ability of MPC to effectively constrain assumption in the MPC community and can be found in
the RL-produced policies within safe limits. numerous literatures (e.g., [9], [10], [11], [12], [13]). For a

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
3314 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 3, MARCH 2024

system x̆ k+1 = f (x̆ k , ŭ k ) with a general equilibrium (x e , u e ), F(x N |k ) ≥ i=N ℓ(xi|k , κ(xi|k )) for ∀xk ∈ X , such that F(·)
P∞
one can always make variable changes xk = x̆ k − x e and is a local Lyapunov function for κ(xk ) in X .
u k = ŭ k − u e to translate it to system (1) with the equilibrium As mentioned in Section I, it is generally a difficult task
(0, 0). to design such terminal conditions (i.e., terminal cost, ter-
For the particular setup considered above, the infinite- minal constraint, and terminal controller) for nonlinear sys-
horizon optimal control problem at time k with the initial state tems. Although MPC without terminal condition (MPCWTC)
xk ∈ X is then described by Problem 1. is proposed as an option to circumvent this challenge, its
Problem 1: Find a control policy u k = π(xk ), which is performance under a very limited (but necessary) prediction
defined as a map from state space to control space π : Rn → horizon tends to be inferior to that of the kind with terminal
Rm and results in a control sequence u∞ = {u k , u k+1 , . . .}, cost. Therefore, we propose to introduce RL into MPC to
such that it stabilizes the zero equilibrium of (1) and minimizes learn the terminal cost online, thereby generating a policy
the infinite-horizon cost that is closer to the optimal one in the sense of an infinite
X∞ horizon.
J∞ (xk , u∞ ) = ℓ(xi , u i ) (3)
i=k
C. Background of RL
subject to constraints (1) and (2), with the running cost ℓ :
An important class of RL techniques aims to learn the opti-
Rn × Rm → R≥0 satisfying
mal policy π ∗ (x) by iteratively obtaining the evaluation of the
ℓ(0, 0) = 0 current policy and then improving the policy accordingly [34].
α1K∞ (∥x∥) ≤ ℓ̌(x) ≜ inf ℓ(x, u) ≤ α2K∞ (∥x∥) (4) At state xk , the value function Vπ : Rn → R, or the evaluation,
u∈U for a given policy u k = π(xk ) is the accumulated rewards
for ∀x ∈ X, where α1K∞ and α2K∞ are K∞ functions [32]. (or runningPcosts) by following the policy from xk , that is,
Remark 2: The running cost defined by (4) is a fairly Vπ (xk ) = i=k ℓ(x i , u i ), which can be rearranged into the

general form and is common in many MPC research studies well-known Bellman equation
(e.g., [13], [33]). Its specific form is task-dependent, for exam- Vπ (xk ) = ℓ(xk , u k ) + Vπ (xk+1 ). (6)
ple, it usually takes a quadratic form in regulation problems:
ℓ(x, u) = x T Qx + u T Ru, where Q and R are set to be According to Bellman’s principle, when the optimality is
positive-definite matrices with proper dimensions. achieved, the optimal value function Vπ∗ (xk ) satisfies the
Bellman optimality equation [35]
B. Background of Model Predictive Control
Vπ∗ (xk ) = min ℓ(xk , u k ) + Vπ∗ (xk+1 )

(7)
Generally, there is no analytic solution to Problem 1, uk
especially for constrained nonlinear systems. An efficient and the corresponding optimal policy is given by
method to solve this problem is MPC [9], which provides
π ∗ (xk ) = arg min ℓ(xk , u k ) + Vπ∗ (xk+1 ) .

an approximated solution by recursively solving the following (8)
uk
OP 1 at each xk ∈ X.
OP 1: Find the optimal control sequence u∗k by solving the Nevertheless, the calculation of Vπ (xk ) is generally com-
finite-horizon minimization problem putationally difficult, and Vπ∗ (xk ) is unknown before all the
N −1 policies π(xk ) = u k ∈ U are considered. This motivates the
TD learning approach [36], which provides a more tractable
X
min J (xk , uk ) = ℓ(xi|k , u i|k ) + F(x N |k ) (5a)
uk computation. In TD learning, the running costs are observed to
i=0
s.t. x0|k = xk (5b) construct the TD target VTD (xk ), thereby iteratively updating
the value function as
xi+1|k = f (xi|k , u i|k ), 0≤i ≤ N −1 (5c)
u i|k ∈ U, 0≤i ≤ N −1 (5d) Vπ (xk ) ← Vπ (xk ) + α[VTD (xk ) − Vπ (xk )] (9)
xi|k ∈ X, 1≤i ≤ N −1 (5e) where α ∈ (0, 1) is a constant called the learning step
x N |k ∈ X (5f) size, and VTD (xk ) − Vπ (xk ) is known as the TD error. The
Bellman equation (6) holds whenever the TD error is zero, thus
Where N ∈ N>0 is the prediction horizon, F(·) is the terminal
obtaining the evaluation of policy π(xk ). Here, to be consistent
cost, uk = {u 0|k , u 1|k , . . . , u N −1|k } is the predicted control
with MPC, the TD target can be constructed into the N -step
sequence, X is the terminal set, and (5f) is the terminal
form
constraint.
k+N
X−1
After solving OP 1, apply only the first element in u∗k , that is,
u k = u ∗0|k , to system (1) and repeat this procedure. VTD (xk ) = ℓ(xi , u i ) + Vπ (xk+N ). (10)
i=k
To guarantee the recursive feasibility and stability, it is
required that N is chosen properly so that x N |k is guaranteed to It can be seen that this TD target has a similar form as
enter a positively invariant set X (under a terminal controller the MPC cost function (5a), which builds up the connection
κ(xk ) ∈ U) containing the origin. Moreover, parameters of between MPC and RL, and their combination gives rise to the
the running and terminal costs should be designed to satisfy RLMPC algorithm to be presented in Section III–V.

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: REINFORCEMENT LEARNING-BASED MODEL PREDICTIVE CONTROL FOR DISCRETE-TIME SYSTEMS 3315

sequence u∗k = {u ∗0|k , u ∗1|k , . . . , u ∗N −1|k }, and the first element


is applied to system (1).
Remark 3: A significant advantage of this formulation
lies in the fact that it not only inherits the advantages of
MPCWTC [13], but also has the potential to further improve
the control performance by compensating the costs neglected
by the truncated prediction horizon through the learned heuris-
tic function term.
In this article, the initial policy π 0 (xk ) is generated by
solving OP 2 with Vπ (·) ≡ 0 and implemented in the receding
horizon fashion, which is exactly the MPCWTC. To ensure
its stability and recursive feasibility, we make the following
Fig. 1. RLMPC pipeline. assumptions.
Assumption 2 (Controllability Assumption): There exists a
positive constant β ∈ R>0 such that
III. RL-BASED MPC
Vπ∗ (xk ) ≤ β · ℓ̌(xk ) ∀xk ∈ X (12)
In the proposed RLMPC, the value function and control
policy are updated in the PI manner. Specifically, we take a where Vπ∗ (xk ) and ℓ̌(xk ) are defined in (7) and (4), respectively.
currently obtained heuristic term as the terminal cost of MPC, Lemma 1 (See [13]): If the prediction horizon N of the
and the corresponding control input is the current control MPCWTC is chosen to satisfy
policy. The costs incurred by following this policy, which
2 ln β
can be regarded as the rewards for the controller interacting N> (13)
with the system dynamics, are used as training data for the ln β − ln(β − 1)
learning of the value function. VFA technique is employed it is guaranteed that the closed-loop system is asymptotically
to approximate the underlying value function, that is, the stable and the MPCWTC optimization problem is recursively
evaluation of the current policy. With the latest learned value feasible.
function, we take it as the heuristic term in the cost function, Assumption 3: In this context, the prediction horizon of the
thus updating the MPC controller, that is, the current policy. initial controller satisfies (13).
Fig. 1 illustrates the RLMPC pipeline. Remark 4: Assumption 2 is a fairly standard condition in
the MPCWTC literature (e.g., [13], [37], [38]) and is also
A. Policy Generator—MPC known as the boundedness condition of the optimal value
function. It indicates the controllability of system (1) with
In this section, we formulate the MPC problem to be solved respect to the equilibrium, since otherwise Vπ∗ (xk ) would go
at each step (see OP 2). Since MPC only looks N steps to infinity [13]. Lemma 1 and Assumption 3 jointly determine
forward at each control period, it ultimately produces a locally that the initial controller is stable, which is consistent with the
optimal policy for Problem 1 unless coupled with a terminal requirement on the initial policy for PI (to be introduced in
cost that propagates global information [26]. Motivated by Section III-C). The method to calculate a suitable β can be
this, we take the heuristic term learned through RL, namely found in [39] and [40].
the value function, as the terminal cost. Based on the results
presented in [13], we modify the formulation of MPC as
follows. B. Learning of the Value Function
OP 2: Find the optimizer uk of the following problem: Now we focus on how to perform policy evaluation on
N −1
X a fixed control policy π(xk ) at time k. As mentioned in
min J (xk , uk ) = ℓ(xi|k , u i|k ) + Vπ (x N |k ) (11a) Section II-C, the essence of policy evaluation is to obtain
uk
i=0 Vπ (xk ) for π(xk ) on the set X. For the continuous state space,
s.t. x0|k = xk (11b) however, it is impossible to obtain Vπ (xk ) at every xk ∈ X,
xi+1|k = f (xi|k , u i|k ), 0 ≤ i ≤ N − 1 (11c) because we could not traverse all states over the state space.
Therefore, the VFA technique [36] is adopted in this article to
u i|k ∈ U, 0≤i ≤ N −1 (11d) approximate the existing true value function.
xi|k ∈ X, 1≤i ≤N (11e) According to the higher-order Weierstrass approximation
theorem [41], there exists a dense polynomial basis set
where the control sequence uk = {u 0|k , u 1|k , . . . , u N −1|k }
{8i (xk )} such that one can approximate the true value function
is the decision variable, and Vπ (·) is a heuristic term
Vπ (xk ), ∀xk ∈ X in the following form:
obtained through the learning technique to be presented in
∞ p ∞
Section III-B. We define the MPC control policy generated X X X
under a fixed Vπ (·) to be a fixed policy. Vπ (xk ) = Wi 8i (xk ) = Wi 8i (xk ) + Wi 8i (xk )
Up to this point, we obtain the policy π(xk ) implicitly i=1 i=1 i= p+1

defined by OP 2: solving OP 2 at time k to yield a control = W 8(xk ) + e(xk ) = V̂ π (xk , W ) + e(xk )


T
(14)

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
3316 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 3, MARCH 2024

where V̂ π (xk , W ) is an approximation of Vπ (xk ), W = π t (·). We are now in a position to present the PI mechanism
[W1 , . . . , W p ]T ∈ R p is the weight vector, and 8(xk ) = in RLMPC.
[81 (xk ), . . . , 8 p (xk )]T is the basis vector. The approximation 1) Initialization: Given an initial state xk ∈ X at time
error e(xk ) converges uniformly to zero as p → ∞. Note that k, we start the iteration with an initial heuristic term
this approximation is essentially a single-layer network, and Vπ0 (·) ≡ 0 and a corresponding stabilizing policy π 0 (xk )
all we have to do is to feed training data into this network. whose existence is already guaranteed by Assumptions 2
Consider a set containing q (q ∈ N>0 ) states sampled in and 3. For t = 0, 1, 2, . . ., do the following two steps
X at time k, denoted as Sk = {xk1 , xk2 , . . . , xkq } ⊂ X. Take iteratively.
each of its elements separately as initial states and propagate 2) Policy Evaluation Step: Determine the value function
one step using this policy, then we record the incurred costs Vπt+1 (·) for policy π t (·) by following the steps presented
as training data. Specifically, for each xk j ∈ Sk , j ∈ N[1,q] , in Section III-B: first, sample in X to form Sk . Second,
solve OP 2 (whose terminal cost is a fixed heuristic function generate training data J t (xk j , utk j ), xk j ∈ Sk , with
corresponding to the current policy) with the subscript k
replaced by k j , obtain the solution u∗k j , and calculate the cost ( N −1
X
)
J (xk j , u∗k j ) according to (11a). Note that we do not apply any J t
(xk j , utk j ) = min ℓ(xi|k j , u i|k j ) + Vπt (x N |k j )
uk j ∈U
u∗k j to the real system in this process, so the new subscript k j i=0
is adopted here to distinguish it from the actual system state. N −1 
X   
With all these costs J (xk j , u∗k j ), j ∈ N[1,q] , we are now able = ℓ xi|k
t
j
, u i|k
t
j
+ Vπt x Nt |k j (17)
to train the network similar to the TD learning. Recall that the i=0

purpose of (9) is to approximate the TD target with Vπ (xk ), t


where xk j = x0|k t
, xi|k t
, and u i|k , i ∈ N[0,N −1] are
we now directly use V̂ π (xk j , W ) to approximate the targets j j j

J (xk j , u∗k j ) deduced from model (11c), satisfying (11d) and (11e).
h   i Third, train the network using (16) to obtain the value
V̂ π (xk j , W ) ← V̂ π (xk j , W ) + α J xk j , u∗k j − V̂ π (xk j , W ) function
(15) Vπt+1 (xk ) → J t (xk , utk ) ∀xk ∈ X. (18)
for ∀ j ∈ N[1,q] , and we aim to minimize E(xk j ) = 1/2e(x
kj ) ,
2

where e(xk j ) = J (xk j , u∗k j ) − V̂ π (xk j , W ). The stochastic 3) Policy Improvement Step: According to the descriptions
in Section III-A, the updated policy π t+1 (xk ) is yielded
gradient descent method [36] achieves this minimization by
by solving OP 2 with terminal cost Vπt+1 (·)
adjusting the weight vector W in a small step size α ∈ (0, 1)
at a time in the direction of the negative gradient (with respect ( N −1
X
)
to W ) of the squared error t+1
uk = arg min ℓ(xi|k , u i|k ) + Vπ (x N |k ) (19)
t+1
ut+1
k ∈U
1 h   i 2 i=0
W ← W − α∇W J xk j , u∗k j − V̂ π xk j , W
2h  and ut+1 is implemented in a receding horizon manner.
 i k
= W + α J xk j , u∗k j − V̂ π xk j , W ∇W V̂ π xk j , W

In what follows, we focus on analyzing the theoretical prop-
erties of this RLMPC scheme.
h   i
= W + α J xk j , u∗k j − W T 8 xk j 8 xk j , j ∈ N[1,q ]


(16)
IV. C ONVERGENCE , F EASIBILITY, AND
in which ∇W denotes the partial derivatives with respect to the S TABILITY A NALYSIS
components of W .
After W converges for all the training data, that is, for In this section, we analyze the convergence, feasibility,
∀xk j ∈ Sk , [J (xk j , u∗k j ) − V̂ π (xk j , W )] → 0 holds, we obtain and stability properties. We show that for ∀xk ∈ X, the
the (approximated) evaluation of the given fixed control policy. value function Vπt (xk ) → Vπ∗ (xk ) and the control policy
Provided with sufficient training data and a proper choice of π t (xk ) → π ∗ (xk ) as the iteration index t → ∞. Also,
the basis vector, we have V̂ π (xk , W ) → Vπ (xk ), ∀xk ∈ X. It is we demonstrate that the policy obtained in each iteration is a
worth noting that with VFA, we can characterize the value feasible and stabilizing policy for the closed-loop system. The
function over the state space using only the storage space of convergence proof is inspired by [42] which demonstrates the
a p-dimensional vector W , which enhances the utility of the convergence of optimal control (unconstrained infinite horizon
proposed approach. optimization) under PI. This article leverages its proof idea to
extend the results to RLMPC. Before proceeding, we introduce
the following definition.
C. PI in RLMPC Definition 1: For the RLMPC optimization problem OP 2,
In RLMPC, the value function and the control policy are a control sequence ũk = {ũ 0|k , . . . , ũ N −1|k } is said to be
updated by iterations with index t ∈ N increasing from zero feasible if ũ i|k ∈ U, i = N[0,N −1] and the resulting system
to infinity. For convenience, we denote the heuristic term trajectory x̃ k = {x̃ 0|k , . . . , x̃ N |k } satisfies x̃ i|k ∈ X, i = N[0,N ] ,
employed in the tth iteration as Vπt (·), and the policy generated rendering the cost function J (x̃ k , ũk ) in the form of (11a)
by the MPC with Vπt (·) as the terminal cost is denoted as finite.

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: REINFORCEMENT LEARNING-BASED MODEL PREDICTIVE CONTROL FOR DISCRETE-TIME SYSTEMS 3317

( N −1 )
A. Convergence Analysis X
≥ min ℓ(xi|k , u i|k ) + Vπl (x N |k )
uk ∈U
We first present two theorems to respectively demonstrate i=0
that the value function Vπt (xk ) is uniformly monotonically N −1
X
ℓ xi|k
l
, u li|k + Vπl x lN |k = Vπl+1 (xk )
 
nondecreasing with respect to the iteration index t and show =
that it has an upper bound. Then, based on the two theorems, i=0
it is shown in the third theorem that Vπt (xk ) converges to the
where ū li|k , x̄ li|k , i = 0, 1, . . ., are the solutions to the problem
optimal solution as t → ∞.
ℓ(xi|k , u i|k ) + 0l (x N |k )}. Through induction,
P N −1
Theorem 1: For ∀t ∈ N and xk ∈ X, let Vπt (xk ) and π t (xk ) minuk ∈U { i=0
be obtained by PI presented in Section III-C, then Vπt (xk ) is we conclude that
a uniformly monotonically nondecreasing sequence for t, that ∀t : 0 t (xk ) ≥ Vπt (xk ). (23)
is, ∀t : Vπt (xk ) ≤ Vπt+1 (xk ).
Proof: Define a new value function 0 t (xk ) with 0 0 (·) ≡ 0. Next, we use this conclusion to prove that Vπt (xk ) is a
It is followed by the update law that uniformly monotonically nondecreasing sequence about t.
Again, the mathematical
P N −1 induction method is adopted. First,
N −1
X since Vπ1 (xk ) = i=0 ℓ(xi|k
0
, u i|k
0
) ≥ 0 and 0 0 (xk ) = 0, it is
0 t (xk ) = ℓ(x̃ i|k , ũ i|k ) + 0 t−1 (x̃ N |k ) (20)
obvious that 0 0 (xk ) ≤ Vπ1 (xk ). Second, if we let the arbitrary
i=0
feasible policy ũk be utk in (20), we have the following update
where {ũ 0|k , . . . , ũ N −1|k } = ũk is an arbitrary feasible control law at t = l:
sequence as defined in Definition 1, x̃ 0|k = xk , and x̃ i|k , i = N −1
N[1,N ] , is the resulting trajectory of ũk . Since 0 0 (·) = Vπ0 (·) =
X
0 (xk ) =
l
ℓ xi|k
l
, u li|k + 0l−1 x lN |k .
 
(24)
0, we have i=0

N −1
X At this time, assume that 0l−1 (xk ) ≤ Vπl (xk ) holds for ∀xk ∈
0 1 (xk ) = ℓ(x̃ i|k , ũ i|k ) + 0 0 (x̃ N |k ) X, then at t = l + 1, from (24) and (22), we derive
i=0
0l (xk ) − Vπl+1 (xk ) = 0l−1 x lN |k − Vπl x lN |k ≤ 0. (25)
( N −1 )  
X
≥ min ℓ(xi|k , u i|k ) + 0 (x N |k ) 0
uk ∈U
i=0
Hence, 0l (xk ) ≤ Vπl+1 (xk ) holds. Therefore, we conclude
( N −1 ) that ∀t : 0 t (xk ) ≤ Vπt+1 (xk ). Combining (23), we can draw
the conclusion that
X
= min ℓ(xi|k , u i|k ) + Vπ0 (x N |k )
uk ∈U
i=0
∀t : Vπt (xk ) ≤ 0 t (xk ) ≤ Vπt+1 (xk ) (26)
N −1
X
ℓ 0
, u i|k
0
Vπ0 x N0 |k Vπ1 (xk ).
 
= xi|k + = (21) which completes the proof.
i=0 Remark 5: Since Vπ0 (·) = 0 and the running cost (4) is
The last two equations hold from (17) and (18), indicating positive-definite, Theorem 1 indicates that the value function
the following update law: Vπt (xk ) obtained in any iteration t ∈ N≥1 is positive-definite.
Theorem 2: For ∀t ∈ N and xk ∈ X, let Vπt (xk ) and π t (xk )
Vπt+1 (xk ) = J t xk , utk be obtained by PI presented in Section III-C, and Vπ∗ (xk ) be


N −1
X the optimal value function satisfying the Bellman optimality
ℓ xi|k
t
, u i|k
t
+ Vπt x Nt |k equation
 
=
i=0 ( N −1 )
X
Vπ∗ (xk ) = min ℓ(xi|k , u i|k ) + Vπ∗ (x N |k )
( N −1 )
X (27)
= min ℓ(xi|k , u i|k ) + Vπt (x N |k ) . (22) uk ∈U
i=0
uk ∈U
i=0
then Vπ∗ (xk ) is an upper bound of Vπt (xk ), that is,
Assume that at t = l, 0l (xk ) ≥ Vπl (xk ) holds for ∀xk ∈ X,
then for t = l + 1 ∀t : Vπt (xk ) ≤ Vπ∗ (xk ). (28)
N −1
X Proof: We prove this by induction. First, at t = 0,
0 l+1
(xk ) = ℓ(x̃ i|k , ũ i|k ) + 0l (x̃ N |k ) Vπ0 (xk ) = 0 ≤ Vπ∗ (xk ). Second, we assume that at t = l,
i=0
( N −1 ) Vπl (xk ) ≤ Vπ∗ (xk ) holds for ∀xk ∈ X, then at t = l + 1,
X we have
≥ min ℓ(xi|k , u i|k ) + 0 (x N |k ) l
uk ∈U ( N −1 )
i=0 X
N −1 Vπ (xk ) = min
l+1
ℓ(xi|k , u i|k ) + Vπ (x N |k )
l
X uk ∈U
ℓ x̄ li|k , ū li|k + 0l x̄ lN |k
 
= i=0
( N −1 )
i=0 X
N −1 ≤ min ℓ(xi|k , u i|k ) + Vπ∗ (x N |k )
X uk ∈U
ℓ x̄ li|k , ū li|k Vπl x̄ lN |k
 
≥ + i=0

i=0 = Vπ∗ (xk ). (29)

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
3318 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 3, MARCH 2024

Thus, Vπt (xk ) ≤ Vπ∗ (xk ) holds for ∀t ∈ N, and Vπ∗ (xk ) is an where the last equation holds because of Bellman’s optimality
upper bound. The proof is completed. principle. As a result, invoking (22) leads to
Based on the properties revealed by Theorems 1 and 2, ( N −1 )
we give the following theorem to show the convergence of
X
V2N (xk ) = min ℓ(xi|k , u i|k ) + Vπ (x N |k )
1
RLMPC. u N ∥k ∈U
i=0
Theorem 3: For ∀t ∈ N and xk ∈ X, let Vπt (xk ) and π t (xk ) = J 1 (xk , u1N ∥k ) = Vπ2 (xk ). (32)
be obtained by PI presented in Section III-C. When t → ∞,
limt→∞ Vπt (xk ) = Vπ∞ (xk ) = Vπ∗ (xk ) and limt→∞ π t (xk ) = This illustrates that the policy generated by RLMPC with
π ∞ (xk ) = π ∗ (xk ). Vπ1 (x N |k ) as the terminal cost is equivalent to that of the
Proof: From Theorems 1 and 2, we know that Vπt (xk ) is MPCWTC with a prediction horizon length of 2N . By induc-
a uniformly monotonically nondecreasing and upper-bounded tion, it is easy to verify
sequence about t. According to the monotone convergence the-
V(t+1)·N (xk ) = J t (xk , utN ∥k ) = Vπt+1 (xk ) ∀t ∈ N (33)
orem [43], Vπt (xk ) converges to a value, denoted by Vπ∞ (xk ),
as t → ∞. Since limt→∞ Vπt (xk ) = limt→∞ Vπt+1 (xk ) = which implies that the policy π (xk ) is equivalent to the
t
Vπ∞ (xk ), we obtain MPCWTC with prediction horizon (t + 1) · N . Given that N
Vπ∞ (xk ) = lim Vπt+1 (xk ) satisfies (13) and hence (t + 1) · N ≥ 2 ln β/[ln β − ln(β − 1)]
t→∞
( N −1 ) also holds, the asymptotic stability and feasibility of π t (xk )
X can be guaranteed according to Lemma 1. The proof is
= lim min ℓ(xi|k , u i|k ) + Vπ (x N |k )
t
completed.
t→∞ uk ∈U
i=0
( N −1 ) Remark 6: The above proof reveals that the essence of
X RLMPC is to achieve the effect of accumulating the prediction
= min ℓ(xi|k , u i|k ) + lim V t (x N |k )
uk ∈U t→∞ π horizon on the basis of the initial controller through iterations.
i=0
( N −1 ) As t → ∞, π t (xk ) converges to the policy generated by
X the MPCWTC with an infinite prediction horizon, that is,
= min ℓ(xi|k , u i|k ) + lim Vπt+1 (x N |k )
uk ∈U t→∞ the optimal policy. This property echoes the results given
i=0
N −1
by Theorem 1–Theorem 3. In this sense, (33) serves as a
performance indicator for each iteration, and N determines
X
ℓ xi|k

, u i|k

+ Vπ∞ x N∞|k
 
=
i=0
the convergence rate (CR).
Remark 7: For specific control tasks, the MPCWTC could
which is the Bellman optimality equation. Thus, Vπ∞ (xk ) =
generally achieve (near-)optimal performance with a very
Vπ∗ (xk ) and the resulting policy limt→∞ π t (xk ) = π ∞ (xk ) =
limited prediction horizon 0 < N̄ ≪ ∞, without requiring
π ∗ (xk ). The proof is completed.
an infinite one. In other words, if the prediction horizon
B. Stability and Feasibility Analysis of MPCWTC is no less than N̄ , increasing its prediction
horizon hardly improves the performance. This indicates that
Theorem 4: For ∀t ∈ N and xk ∈ X, let Vπt (xk ) and π t (xk ) we do not have to iterate RLMPCs to t → ∞, but can
be obtained by PI presented in Section III-C, then for every t ∈
stop at 0 < t¯ ≪ ∞ if Vπt¯ (xk ) hardly changes with a
N, the RLMPC policy π t (xk ) is feasible and the closed-loop
system is asymptotically stable under π t (xk ). larger t¯. Furthermore, with Vπt¯ (xk ) being the terminal cost,
Proof: For t = 0, Assumption 3 has already ensured the one can flexibly adjust the prediction horizon of RLMPC and
asymptotic stability and feasibility of the initial policy π 0 (xk ), resulting in almost the same control policy π t¯(xk ) → π ∗ (xk ),
∀xk ∈ X. Before proceeding, we define as the Bellman optimality equation (7) inherently holds at any
prediction horizon. Since the prediction horizon determines
Ñ −1
X the dimensionality and hence the computational burden of
VÑ (xk ) = min ℓ(xi|k , u i|k ) (30)
u Ñ ∥k ∈U the optimization problem, such a property provides a way to
i=0
significantly reduce the computational burden by shortening
subject to (11b)–(11e) with N replaced by Ñ , where the prediction horizon of RLMPC.
Ñ can be an arbitrary positive integer, and u Ñ ∥k =
{u 0|k , u 1|k , . . . , u Ñ −1|k } is the decision variable of length Ñ .
V. I MPLEMENTATION I SSUES ON RLMPC
By letting Ñ = N , it is obvious that VN (xk ) = J 0 (xk , u0N ∥k ) =
Vπ1 (xk ) holds for ∀xk ∈ X. Furthermore, we can also obtain A. On the PI
2N −1 Theorems 1 and 2 reveal that as t increases, Vπt (xk ) exhibits
a monotonic approximation to Vπ∗ (xk ). This implies that each
X
V2N (xk ) = min ℓ(xi|k , u i|k )
u2N ∥k ∈U iteration will necessarily lead to some degree of improvement
i=0
( N −1 2N −1
) in Vπt (xk ). More importantly, Theorem 4 demonstrates that
Vπt (xk ) obtained at any t ∈ N can result in a stabilizing control
X X
= min ℓ(xi|k , u i|k ) + ℓ(xi|k , u i|k )
u2N ∥k ∈U policy. This allows us to stop the iteration at any t without
i=0 i=N
( N −1 ) affecting the stability of the closed-loop system.
As we can see in Section III-B, the update of Vπt (xk ) is
X
= min ℓ(xi|k , u i|k ) + VN (x N |k ) (31)
u N ∥k ∈U
i=0 determined by the weight vector W . Denote the weight vector
Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: REINFORCEMENT LEARNING-BASED MODEL PREDICTIVE CONTROL FOR DISCRETE-TIME SYSTEMS 3319

Algorithm 1 RLMPC Algorithm system and the other involving a nonlinear one. As is known,
Initialize: N , Q, R, α, ε, p, 8(xk ), W = 0, t = 0, Flag = 0;
0 traditional MPC (which refers to the one described in OP 1)
1: for k = 1, 2, 3, . . . do can attain linear quadratic optimal performance for linear
2: Measure the current state xk at time k; systems with properly designed terminal conditions, so the
3: Solve OP 2 for xk with terminal cost (W t )T 8(x N |k ); linear system example can be a benchmark for RLMPC per-
4: Apply π t (xk ) = u t0|k to system (1); formance. For nonlinear systems, however, it can only achieve
5: if Flag == 0 then
tq suboptimal control performance, which is where RLMPC
6: Wk 0 ← W t ;
comes in. We will highlight the superiority of RLMPC in the
7: Construct sample set Sk = {xkt 1 , . . . , xkt q } ⊂ X;
8: for j = 1, 2, . . . , q do nonlinear system example. In both examples, we investigate
9: Solve OP 2 for xkt j with terminal cost (W t )T 8(·); RLMPC (following Algorithm 1) for two episodes. The first
10: Calculate J (xkt j , u tk j ) according to (17); episode (labeled as “RLMPC Epi. 1”), called the learning
11:
tq tq
Wk j ← Wk j−1 ; episode, is to learn from scratch, starting with the initial policy
12: Update Wk j using (16) until ∥1Wk j ∥ ≤ ε;
tq tq to control the system; while the second episode (labeled as
“RLMPC Epi. 2”), also referred to as the re-execution episode,
if ∥Wk j − Wk j−1 ∥ ≤ ε then break;
tq tq
13:
14: end if
is to redo the task under the learned value function (LVF) to
15: end for check the performance.
tq
16: W t+1 ← Wk j ;
17: end if
18: if ∥W t+1 − W t ∥ ≤ ε then A. Linear System
19: Flag == 1; Consider the following linear system:
20: end if
21: t ← t + 1; xk+1 = Axk + Bu k (34)
22: end for
where xk = [x1k , x2k ]T ∈ R2 , u k ∈ R1 , A = −0.1
 1 0.5

obtained in the tth iteration as W t (which has converged for all 0.9 ,
and B = 0 . We set the running cost to be ℓ(xk , u k ) =
1
the training data in this iteration) and select a small constant
ε > 0 as the threshold of change. Instead of iterating infinitely xkT Qxk + u Tk Ru k with weights Q = I ∈ R2 and R = 0.5.
many steps, if ∥W t+1 − W t ∥ ≤ ε is satisfied for some t, then Then, according to the traditional MPC design procedure [9],
Vπt (xk ) can be regarded as converged to the neighborhood of we solve a Riccati equation and obtain P = −0.3409
 1.4388 −0.3409 
Vπ∗ (xk ) and the iteration can be stopped. 5.0793

Remark 8: In contrast to PI, value iteration (VI) is consid- with terminal cost F(x N |k ) = x NT |k P x N |k , and the termi-
ered less computationally expensive by reducing the policy nal controller u k = K xk with K = [−0.7597 − 0.2128]
evaluation step to a single iteration. Although it is possible being the linear quadratic optimal. Correspondingly, we set
to adapt RLMPC to the VI update manner, the properties RLMPC parameters to be α = 10−6 , ε = 10−8 , p = 9,
presented in Section IV may be lost. This is due to the fact and 8(x) = [x1 , x2 , x12 , x22 , x1 x2 , x13 , x12 x2 , x23 , x1 x22 ]. Let the
that the intermediate policies of VI are not guaranteed to be prediction horizon be N = 3 and the initial state be x0 =
stable and the intermediate value functions may not correspond [2.9, 2]T . We collect 100 random samples in each iteration
to any policy. Taking such value functions as the terminal in the state space x1k ∈ [−0.5, 3.5], x2k ∈ [−0.5, 2.5],
costs, it is difficult to ensure that the corresponding RLMPC as well as the actual system trajectory, to form Sk . Our task
generates suitable policies for value function updates. How to is to drive the system states from x0 to the origin while
ensure the convergence, stability, and feasibility in the case of minimizing (3). We examine the performance of RLMPC and
VI would be our future research. compare it with those of traditional MPC and MPCWTC
(labeled as “LQR (MPC)” and “MPCWTC,” respectively,
B. Overall Algorithm of the RLMPC Scheme in the following figures). To facilitate comparison with the
Given the state xk , use the policy π 0 (xk ) to control the optimal control performance in the infinite horizon, we do
real system for r 0 ≥ 1 steps. In the meanwhile, perform not impose any constraints on this control problem, at which
the policy evaluation for π 0 (·) as presented in Section III-B. point the above-designed traditional MPC is fully equivalent
Once the value function Vπ1 (·) is obtained, substitute it into to LQR.
OP 2 as terminal cost and update the policy to π 1 (·). This The relevant state trajectories and control inputs are reported
procedure is repeated for ∀t ≥ 1, and π t (·) is applied to the in Fig. 2. It can be seen that the trajectories of “RLMPC
real system for r t steps (r t ∈ N≥1 is a constant at a given t). Epi. 1” and “MPCWTC” are very close in the initial few
This learning procedure is characterized by the fact that the steps, which is due to the fact that the initial policy of the
policy is continually improved throughout the control process. learning episode is essentially the MPCWTC. As the policy
The pseudo-code of the RLMPC algorithm is shown in of the learning episode improves with each control step k,
Algorithm 1, which is an example of updating the policy at its trajectory gradually deviates from that of “MPCWTC” and
every control step, that is, r t ≡ 1, ∀t ∈ N. converges to that of “LQR (MPC).” The evolution curves of
the weight vector W in the learning episode are shown in
VI. S IMULATION E XAMPLES Fig. 3. We observe that W converges at the 11th step (at
To evaluate the effectiveness of the proposed RLMPC this moment, the system states are not yet regulated to the
scheme, two examples are studied: one involving a linear origin), indicating the convergence of the learning process.

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
3320 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 3, MARCH 2024

Fig. 2. Trajectories and inputs (linear system). Fig. 4. ACCs (linear system).

Fig. 5. Trajectories of RLMPC in the learning episode under different


Fig. 3. Convergence of the weight vector (linear system). prediction horizons (linear system).

To verify the statements in Remark 6, we first test the


Since Algorithm 1 is an example of performing one iteration at equivalence of RLMPC and the MPCWTC with an accu-
each step, we know that the learning process converges at the mulating prediction horizon. Since the prediction horizon of
11th iteration. To examine the optimality property, we compare RLMPC is set to be N = 3, following Algorithm 1 and the
the performance of “LQR (MPC)” with that of “RLMPC results in Section IV-B, it is theoretically equivalent to the
Epi. 2.” It can be seen that their trajectories basically coincide MPCWTC with N = 3 · k, where k ∈ N>0 is the step index.
within a certain error range, which verifies the optimality of We observe from the two subgraphs in the upper part of Fig. 5
RLMPC, and the error mainly depends on the selection of ε that their inputs and hence their trajectories are basically the
for terminating the iterations. same, which confirms this equivalence. We then investigate the
Fig. 4 illustrates the accumulated costs (ACCs), which is effectiveness of increasing the prediction horizon by compar-
defined as the sum of the running costs from the initial time ing the performance of RLMPC in the learning episode with
to the current time k N = 3, 5, 7, 9, 11, respectively. As can be seen from the lower
k
X subgraph of Fig. 5, the larger the prediction horizon, the closer
ACC = ℓ(xi , u i ). (35) the corresponding trajectory is to the optimal trajectory “LQR
i=0 (MPC).” This can be explained by the fact that the equivalent
As expected, the ACC of “MPCWTC” is the highest MPCWTC has an increased accumulating prediction horizon
because it only looks at N steps ahead and ignores all the in every iteration, so the policy at each step is closer to
residual costs in each control period. The ACC of “RLMPC the optimal one. Furthermore, we present the ACCs and
Epi. 1” ranks the second highest as its policy is gradually the CRs (which are measured by the number of iterations
improving from “MPCWTC.” “RLMPC Epi. 2” achieves required for convergence) in Table I. It is obvious that a larger
almost the same minimum as “LQR (MPC),” while the slightly prediction horizon brings lower ACC and faster convergence
higher value in ACC (about 4 × 10−5 ) is mainly due to the in the learning episode, which is consistent with our theoretical
precision setting of the iteration termination. In conclusion, results.
the proposed RLMPC approach can deliver the (near-)optimal We finally investigate the data efficiency of the proposed
policy in the sense of an infinite horizon for the control of a scheme and compare it with Q-learning [44] and PILCO [45].
linear system. The former employs the action-value function and has a very

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: REINFORCEMENT LEARNING-BASED MODEL PREDICTIVE CONTROL FOR DISCRETE-TIME SYSTEMS 3321

TABLE I be found, they tend to be quite conservative, as evidenced


ACC S AND CR S OF RLMPC IN THE L EARNING E PISODE U NDER D IFFER - by a small terminal set, a largely required prediction horizon,
ENT P REDICTION H ORIZONS
and an overestimated terminal cost. This can lead to the
conservative performance of the designed traditional MPC.
1) Design of the Traditional MPC and RLMPC: In [47],
a traditional MPC is proposed for (36): the terminal cost is
F(x N |k ) = χ N2 |k + y N2 |k + θ N2 |k , and the terminal set X is
defined as X = {x N |k ∈ X : F(x N |k ) ≤ ϱv }, where ϱv = cϱ,
similar learning process to the proposed scheme, except that ϱ = min{v̄2/η2 , ω̄2/ξ 2 }, c = (1 + T 2 max{η2 , ξ 2 } − max{ηT 2 +
it is model-free. The latter is a model-based policy search q11 /η +r11 η, 2ξ T }) < 1, and v̄ and ω̄ are the upper bounds of
scheme and is well known for its data efficiency. By imple- v and ω, respectively, that is, v̄ = 1 and ω̄ = 4, q11 and r11 are
menting Q-learning and PILCO1 in the optimal control of (34), the first elements on the diagonal of Q and R, respectively.
we observe that the amount of training data required for By choosing η = ξ = 8, N ≥ 30, system (36) can be stabilized
them to converge to the optimal policy is 1.3 × 104 and 400 to the origin.
(10 trials and 40 training data in each trial), respectively. Remark 10: This traditional MPC design is computation-
In contrast, RLMPC only needs 341 training data, far less ally demanding because the prediction horizon should be
than Q-learning, and achieves efficiency comparable to that at least 30 to meet the terminal constraint, which confirms
of PILCO. This is mainly because Q-learning utilizes no Remark 9. Moreover, there is no guarantee that the control
prior knowledge of the system and therefore requires a large performance of the designed MPC is optimal. In contrast,
amount of data from interactions with the environment to RLMPC avoids artificially designing the terminal conditions
provide a basis for policy improvement. While PILCO and and achieves (near-)optimal performance. With the learned
RLMPC optimize the policy with the assistance of the model, (near-)optimal terminal cost, its prediction horizon can be
thus significantly reducing the number of interactions and flexibly adjusted without affecting its performance. Therefore,
accelerating convergence. In addition, due to the combination it is expected to show unique superiority in this problem.
with MPC, the significant advantage of RLMPC over PILCO Now we present the RLMPC design: α = 10−6 , ε = 10−7 ,
is the theoretical guarantee for the stability and feasibility of p = 30, and the basis functions are chosen as polynomials
the policies generated throughout the learning process. of order one to six. We generate 100 random samples in each
iteration in the state space χk ∈ [0, 2], yk ∈ [−1, 6], θk ∈
[−π, π], together with the actual system trajectory, to form Sk .
B. Nonlinear System
2) Comparison of the Traditional MPC and RLMPC:
Consider the regulation problem of a nonholonomic vehicle In the following simulations, the prediction horizons of the
system traditional MPC and RLMPC are set to 30 and 5 steps,
xk+1 = xk + g(xk )u k (36) respectively.
As we can see from Figs. 6 and 7 that no constraint
where xk = [χk , yk , θk ]T is the state vector, with χk , yk , is violated during both episodes, which demonstrates that
and θk representing the abscissa, ordinate, and yaw angle in RLMPC is able to restrict the policy within the input and
the geodetic coordinate system, respectively; u k = [vk , ωk ]T state constraints and thus perform safely. Fig. 8 illustrates
is the control input, with vk and ωk representing the linear the convergence of the weight W . We observe that it takes
velocity and angular velocity, respectively; δ is the sampling 25 iterations to converge, and the evolution of the LVF is also
cos θk 0
h i
interval; and g(xk ) = δ sin θk 0 . Set δ = 0.2 s, the input visualized. Consistent with Theorem 1, this evolution exhibits
0 1 the uniformly monotonically nondecreasing property in this
constraint |vk | ≤ 1 m/s, |ωk | ≤ 4 rad/s, and the state process. To verify Theorem 4, we apply the policies obtained
constraint 0 ≤ χ ≤ 2 m. The control objective is to in the first 25 iterations of the learning episode, respectively,
drive the system from x0 = [1.98, 5, −π/3]T to the origin to the system and show the corresponding trajectories and
while minimizing (3) with running cost in the quadratic form ACCs [as defined in (35)] in Fig. 9. We observe that each
ℓ(xk , u k ) = xkT Qxk + u Tk Ru k , where Q = diag{1, 2, 0.06} and policy generates a safe and stabilizing trajectory to the origin
R = diag{0.01, 0.005}. and that the ACCs show a gradual improvement, which also
Remark 9: According to Brockett’s theory [46], there does corroborates Theorem 1. The comparison of the two episodes
not exist smooth and time-invariant feedback to stabilize of RLMPC and traditional MPC is shown in Fig. 7. Since the
system (36), which makes the widely adopted design method traditional MPC design for nonlinear systems does not ensure
for terminal conditions, such as [10], no longer applicable to optimality, RLMPC could achieve better performance (15.09%
this system, thus posing a great challenge to the traditional less ACC) even in the learning episode. There is a further
MPC design here. Even if qualified terminal conditions can 16% reduction in ACC in the re-execution episode where a
1 The Q-learning scheme is based on the adaptation of the program near-optimal policy is employed.
provided in https://github.com/XiaozhuFang/Simple-Comparison-between-Q- 3) Prediction Horizon and Computational Burden: We now
learning-and-MPC, while the PILCO scheme is based on the adaptation of the demonstrate the relationship between prediction horizon, con-
“cart-pole” example provided in https://github.com/UCL-SML/pilco-MATLAB.
We mainly modify the system and rewards to suit this example, leaving most trol performance, convergence, and computational burden and
of the parameters unchanged. use the following three metrics in the quantitative evaluation:

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
3322 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 3, MARCH 2024

Fig. 9. System trajectories and ACCs of RLMPC policies in the 1st–25th


iterations of the learning episode (nonlinear system).
Fig. 6. Inputs of RLMPC (nonlinear system).

TABLE II
C OMPARISON OF T RADITIONAL MPC AND RLMPC IN THE L EARNING
E PISODE W ITH D IFFERENT P REDICTION H ORIZONS

Fig. 7. System trajectories and ACCs of traditional MPC and RLMPC


(nonlinear system).

Fig. 10. Comparison of trajectories and inputs of RLMPC with the learned
(near-)optimal value function under prediction horizons N = 1 and N = 5
(nonlinear system).

Fig. 8. Convergence of the weight vector and LVF (nonlinear system). with the same prediction horizon of 30, the ACT of RLMPC is
still slightly lower than that of traditional MPC. This is mainly
due to the removal of the terminal constraint, which simplifies
ACC, CR, and the average computational time (ACT) at each the optimization problem.
step. RLMPC in the learning episode with four different We now perform a verification of Remark 7. Taking the
prediction horizons, namely N = 5, 7, 9, 30, are compared learned (near-)optimal value function as the terminal cost of
(settings of other parameters remain unchanged), as well RLMPC, we reduce the prediction horizon to 1 and re-execute
as that of traditional MPC with N = 30. Corresponding the regulation task. As we can see from Fig. 10, the trajectories
numerical results are listed in Table II. and input under N = 1 are basically coincide with those under
Consistent with the linear system example, increasing the N = 5, which confirms that shortening the prediction horizon
prediction horizon of RLMPC effectively enhances CR and does not affect the (near-)optimal performance. More impor-
leads to an improved (lower ACC) trajectory in the learning tantly, the ACT time is reduced from 0.0826 to 0.0222 s (about
episode. Conceivably, this comes at the cost of increased an improvement of 73%), which illustrates the effectiveness of
computation time at each step. However, we observe that even RLMPC in reducing the computational burden.

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: REINFORCEMENT LEARNING-BASED MODEL PREDICTIVE CONTROL FOR DISCRETE-TIME SYSTEMS 3323

VII. C ONCLUSION [17] R. Kamalapurkar, L. Andrews, P. Walters, and W. E. Dixon, “Model-


based reinforcement learning for infinite-horizon approximate optimal
To free the design of the terminal cost, terminal controller, tracking,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 3,
pp. 753–758, Mar. 2017.
and terminal set in traditional MPC and to improve the
[18] Y. Li, “Reinforcement learning applications,” 2019, arXiv:1908.06973.
closed-loop performance, this article developed a new RLMPC [19] M. Wiering and M. V. Otterlo, Reinforcement Learning, vol. 12. Berlin,
scheme for discrete-time systems by combining RL and MPC Germany: Springer, 2012.
through PI. The core of this scheme is to take a value function, [20] R. S. Sutton, “Learning to predict by the methods of temporal differ-
which is obtained through learning, as the terminal cost of ences,” Mach. Learn., vol. 3, no. 1, pp. 9–44, Aug. 1988.
[21] G. Tesauro, “Practical issues in temporal difference learning,” Mach.
traditional MPC to stabilize the system. We showed that the Learn., vol. 8, nos. 3–4, pp. 257–277, May 1992.
value function monotonically converges to the optimal one [22] J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference
under PI. Based on this property, the closed-loop stability of learning with function approximation,” IEEE Trans. Autom. Control,
RLMPC was further proved. We tested the proposed scheme vol. 42, no. 5, pp. 674–690, May 1997.
[23] J. García and F. Fernández, “A comprehensive survey on safe reinforce-
on a linear system example to show its convergence, safety, ment learning,” J. Mach. Learn. Res., vol. 16, no. 1, pp. 1437–1480,
optimality, and stability. Furthermore, we also implemented 2015.
it on a nonholonomic vehicle system to show that RLMPC [24] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-
based reinforcement learning with stability guarantees,” in Proc. Adv.
outperforms traditional MPC in the nonlinear case. Finally, Neural Inf. Process. Syst., 2017, pp. 908–918.
we discussed the impact of the prediction horizon on the [25] M. Gallieri et al., “Safe interactive model-based learning,” 2019,
learning process and the control performance of RLMPC, arXiv:1911.06556.
highlighting the ability of RLMPC to flexibly adjust the [26] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch,
“Plan online, learn offline: Efficient learning and exploration via model-
prediction horizon to reduce the computational burden. based control,” 2018, arXiv:1811.01848.
[27] F. Farshidian, D. Hoeller, and M. Hutter, “Deep value model predictive
control,” 2019, arXiv:1910.03358.
R EFERENCES [28] K. Napat, M. I. Valls, D. Hoeller, and M. Hutter, “Practical reinforcement
learning for MPC: Learning from sparse objectives in under an hour
[1] T. Wang, H. Gao, and J. Qiu, “A combined adaptive neural network and on a real robot,” in Proc. Annu. Conf. Learn. Dyn. Control, 2020,
nonlinear model predictive control for multirate networked industrial pp. 211–224.
process control,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 2, [29] S. Gros and M. Zanon, “Data-driven economic NMPC using reinforce-
pp. 416–425, Feb. 2016. ment learning,” IEEE Trans. Autom. Control, vol. 65, no. 2, pp. 636–648,
[2] M. L. Darby and M. Nikolaou, “MPC: Current practice and challenges,” Feb. 2020.
Control Eng. Pract., vol. 20, no. 4, pp. 328–342, Apr. 2012. [30] M. Zanon, S. Gros, and A. Bemporad, “Practical reinforcement learning
[3] M. Neunert et al., “Whole-body nonlinear model predictive control of stabilizing economic MPC,” in Proc. 18th Eur. Control Conf. (ECC),
through contacts for quadrupeds,” IEEE Robot. Autom. Lett., vol. 3, Jun. 2019, pp. 2258–2263.
no. 3, pp. 1458–1465, Jul. 2018. [31] M. Zanon and S. Gros, “Safe reinforcement learning using robust MPC,”
[4] Z. Li, Y. Xia, C.-Y. Su, J. Deng, J. Fu, and W. He, “Missile guidance IEEE Trans. Autom. Control, vol. 66, no. 8, pp. 3638–3652, Aug. 2021.
law based on robust model predictive control using neural-network [32] S. V. Raković and W. S. Levine, Handbook of Model Predictive Control.
optimization,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 8, Berlin, Germany: Springer, 2018.
pp. 1803–1809, Aug. 2015. [33] D. Limón, T. Alamo, F. Salas, and E. F. Camacho, “On the stability
[5] J. Kabzan et al., “AMZ driverless: The full autonomous racing system,” of constrained MPC without terminal constraint,” IEEE Trans. Autom.
J. Field Robot., vol. 37, no. 7, pp. 1267–1294, 2020. Control, vol. 51, no. 5, pp. 832–836, May 2006.
[6] Z. Li, W. Yuan, Y. Chen, F. Ke, X. Chu, and C. L. P. Chen, “Neural- [34] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive
dynamic optimization-based model predictive control for tracking and dynamic programming for feedback control,” IEEE Circuits Syst. Mag.,
formation of nonholonomic multirobot systems,” IEEE Trans. Neural vol. 9, no. 3, pp. 32–50, Aug. 2009.
Netw. Learn. Syst., vol. 29, no. 12, pp. 6113–6122, Dec. 2018. [35] D. E. Kirk, Optimal Control Theory: An Introduction. Chelmsford, MA,
[7] C. C. Chen and L. Shaw, “On receding horizon feedback control,” USA: Courier Corporation, 2004.
Automatica, vol. 18, no. 3, pp. 349–352, May 1982. [36] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
[8] M. Alamir and G. Bornard, “On the stability of receding horizon control Cambridge, MA, USA: MIT Press, 2018.
of nonlinear discrete-time systems,” Syst. Control Lett., vol. 23, no. 4, [37] L. Grüne, “NMPC without terminal constraints,” IFAC Proc. Volumes,
pp. 291–296, Oct. 1994. vol. 45, no. 17, pp. 1–13, 2012.
[9] J. B. Rawlings and D. Q. Mayne, Model Predictive Control: Theory and [38] L. Grüne, “Analysis and design of unconstrained nonlinear MPC
Design. Madison, WI, USA: Nob Hill Pub., 2009. schemes for finite and infinite dimensional systems,” SIAM J. Control
[10] H. Chen and F. Allgöwer, “A quasi-infinite horizon nonlinear model Optim., vol. 48, no. 2, pp. 1206–1228, Jan. 2009.
predictive control scheme with guaranteed stability,” Automatica, vol. 34, [39] K. Worthmann, M. W. Mehrez, M. Zanon, G. K. I. Mann, R. G. Gosine,
no. 10, pp. 1205–1217, 1998. and M. Diehl, “Model predictive control of nonholonomic mobile robots
[11] D. Q. Mayne, J. B. Rawlings, C. V. Rao, and P. O. M. Scokaert, “Con- without stabilizing constraints and costs,” IEEE Trans. Control Syst.
strained model predictive control: Stability and optimality,” Automatica, Technol., vol. 24, no. 4, pp. 1394–1406, Jul. 2016.
vol. 36, no. 6, pp. 789–814, Jun. 2000. [40] M. W. Mehrez, K. Worthmann, J. P. V. Cenerini, M. Osman,
[12] M. Reble, Model Predictive Control for Nonlinear Continuous-Time W. W. Melek, and S. Jeon, “Model predictive control without terminal
Systems With and Without Time-delays. Berlin, Germany: Logos Verlag constraints or costs for holonomic mobile robots,” Robot. Auto. Syst.,
Berlin GmbH, 2013. vol. 127, May 2020, Art. no. 103468.
[13] A. Boccia, L. Grüne, and K. Worthmann, “Stability and feasibility of [41] B. A. Finlayson, The method of Weighted Residuals and Variational
state constrained MPC without stabilizing terminal constraints,” Syst. Principles. Philadelphia, PA, USA: SIAM, 2013.
Control Lett., vol. 72, pp. 14–21, Oct. 2014. [42] D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming
[14] D. Silver et al., “Mastering the game of go with deep neural networks algorithm for discrete-time nonlinear systems,” IEEE Trans. Neural
and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016. Netw. Learn. Syst., vol. 25, no. 3, pp. 621–634, Mar. 2014.
[15] V. Mnih et al., “Playing Atari with deep reinforcement learning,” 2013, [43] J. Bibby, “Axiomatisations of the average and a further generalisation
arXiv:1312.5602. of monotonic sequences,” Glasgow Math. J., vol. 15, no. 1, pp. 63–65,
[16] H. Li, Q. Zhang, and D. Zhao, “Deep reinforcement learning-based auto- Mar. 1974.
matic exploration for navigation in unknown environment,” IEEE Trans. [44] J. Clifton and E. Laber, “Q-learning: Theory and applications,” Annu.
Neural Netw. Learn. Syst., vol. 31, no. 6, pp. 2064–2076, Jun. 2020. Rev. Statist. Appl., vol. 7, pp. 279–301, Mar. 2020.

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.
3324 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 3, MARCH 2024

[45] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes Yuanqing Xia (Senior Member, IEEE) received the
for data-efficient learning in robotics and control,” IEEE Trans. Pattern Ph.D. degree in control theory and control engineer-
Anal. Mach. Intell., vol. 37, no. 2, pp. 408–423, Feb. 2015. ing from the Beijing University of Aeronautics and
[46] R. W. Brockett, “Asymptotic stability and feedback stabilization,” Differ. Astronautics, Beijing, China, in 2001.
Geometric Control Theory, vol. 27, no. 1, pp. 181–191, Dec. 1983. From 2002 to 2003, he was a Post-Doctoral
[47] Y. Zhu and U. Ozguner, “Robustness analysis on constrained model Research Associate with the Institute of Systems
predictive control for nonholonomic vehicle regulation,” in Proc. Amer. Science, Academy of Mathematics and System
Control Conf., 2009, pp. 3896–3901. Sciences, Chinese Academy of Sciences, Beijing.
From 2003 to 2004, he was with the National
University of Singapore, Singapore, as a Research
Fellow, where he worked on variable structure
control. From 2004 to 2006, he was with the University of Glamorgan,
Min Lin received the B.S. degree in automation Pontypridd, U.K., as a Research Fellow. From 2007 to 2008, he was a Guest
from the Beijing Institute of Technology, Beijing, Professor with Innsbruck Medical University, Innsbruck, Austria. Since 2004,
China, in 2018, where he is currently pursuing the he has been with the School of Automation, Beijing Institute of Technology,
Ph.D. degree with the School of Automation. Beijing, first as an Associate Professor, then, since 2008, as a Professor. His
His research interests cover model predictive con- current research interests are in the fields of networked control systems, robust
trol, machine learning, active disturbance rejection control and signal processing, and active disturbance rejection control.
control, and robotic systems.

Jinhui Zhang received the Ph.D. degree in control


science and engineering from the Beijing Institute
of Technology, Beijing, China, in 2011.
He was a Research Associate with the Depart-
ment of Mechanical Engineering, The Univer-
Zhongqi Sun (Member, IEEE) was born in Hebei, sity of Hong Kong, Hong Kong, in 2010, as a
China, in 1986. He received the B.E. degree in Senior Research Associate with the Department
computer and automation from Hebei Polytechnic of Manufacturing Engineering and Engineering
University, Hebei, in 2010, and the Ph.D. degree Management, City University of Hong Kong,
in control science and engineering from the Beijing Hong Kong, from 2010 to 2011, and a Visiting
Institute of Technology, Beijing, China, in 2018. Fellow with the School of Computing, Engineering
From 2018 to 2019, he was a Post-Doctoral Fellow and Mathematics, University of Western Sydney, Sydney, NSW, Australia,
with the Faculty of Science and Engineering, Uni- in 2013. He was an Associate Professor with the Beijing University of
versity of Groningen, Groningen, The Netherlands. Chemical Technology, Beijing, from 2011 to 2016, and a Professor with the
He is currently an Assistant Professor with the School of Electrical and Automation Engineering, Tianjin University, Tianjin,
School of Automation, Beijing Institute of Tech- China, in 2016. He joined the Beijing Institute of Technology in 2016, where
nology. His research interests include multiagent systems, model predictive he is currently a Professor. His research interests include networked control
control, machine learning, and robotic systems. systems, composite disturbance rejection control, and biomedical engineering.

Authorized licensed use limited to: Beijing Jiaotong University. Downloaded on March 14,2024 at 13:14:01 UTC from IEEE Xplore. Restrictions apply.

You might also like