Adaptive optimal output regulation of unknown linear continuous-time systems by dynamic output feedback and value iteration
Adaptive optimal output regulation of unknown linear continuous-time systems by dynamic output feedback and value iteration
A R T I C L E I N F O A B S T R A C T
Keywords: This paper proposes a value iteration based learning algorithm to solve the optimal output regulation problem for
Adaptive dynamic programming linear continuous-time systems, which aims at achieving disturbance rejection and asymptotic tracking simul
Dynamic output-feedback control taneously. Notably, the state is unmeasurable and the system dynamics is completely unknown, which greatly
Optimal output regulation
increases the challenge of solving the output regulation problem. Firstly, we present a dynamic output feedback
Value iteration
controller design method by combining the internal model, setting a virtual exo-system, and constructing the
augmented internal state without using any knowledge of system. Then, by establishing a novel iterative learning
equation which requires no repeated finite window integral operations, an adaptive dynamic programming based
learning algorithm with employing value-iteration scheme is proposed to estimate the optimal feedback control
gain, which may lead to a reduction of the computational load. The analysis on solvability and convergence
shows that the estimated control gain converges to the optimal control gain. Finally, a physical experiment on
control of an unmanned quadrotor illustrates the effectiveness of the proposed algorithm.
1. Introduction extended in Gao et al. (2018), Gao et al. (2022), Jiang et al. (2020). In
Modares and Lewis (2014), the linear quadratic tracking (LQT) control
Recent years have witnessed the tremendous development of solving problem for partially unknown systems was first solved by using IRL
control problems by adaptive dynamic programming (ADP) (Jiang and technique, which was developed to solve the LQT for completely un
Jiang, 2017) and reinforcement learning (RL) (Lewis et al., 2012); see known dynamics in Chen et al. (2019). To further deal with the system
the latest survey on ADP for control (Liu et al., 2021) and the references uncertainty and disturbance, an ADP based learning algorithm with PI
therein. As one of the major topics included in Liu et al. (2021), inves scheme was proposed in Gao and Jiang (2016a) under the linear optimal
tigating ADP-based learning algorithms for solving optimal regulation output regulation problem (LOORP) framework for solving asymptotic
problems is of wide interest because the requirement of prior knowledge tracking and disturbance rejection simultaneously. The state feedback
of system dynamics can be relaxed, leading to the practical significance control gain estimated in Gao and Jiang (2016a) requires separately
(Gonzalez-Garcia et al., 2021; Jiang and Jiang, 2012; Jiang et al., 2020; solving the corresponding algebraic Riccati equation (ARE) and the
Wei et al., 2017; Yao et al., 2022). regulator equations. This ADP-based method was extended to solve the
In the last decade, several RL-based learning algorithms, e.g., Jiang LOORP for discrete time-delay systems (Gao and Jiang, 2019) and
and Jiang (2012), Chen et al. (2019), Zhang et al. (2022), have been multi-agent systems (Gao et al., 2018; 2022).
proposed for solving the optimal regulation problems. A novel In these aforementioned RL/ADP based learning algorithms for
ADP-based learning algorithm using integral reinforcement learning solving the state/tracking regulation problems, most of the PI based
(IRL) method and policy iteration (PI) scheme was proposed in Jiang learning algorithms, for instance, Jiang et al. (2020), Chen et al. (2019),
and Jiang (2012) to approximate online adaptive optimal controller require an initial admissible control policy while it is hard to be obtained
without any prior knowledge of the system dynamics, which was in practice, especially for unknown systems. To obviate this limitation,
This work was supported in part by the National Key R&D Program of China under Grant 2021ZD0112600, in part by the National Natural Science Foundation of
☆
China under Grants 62173283 and 62273285, and in part by the Natural Science Foundation of Fujian Province of China under Grant 2021J01051.
* Corresponding author.
E-mail address: [email protected] (X. Yu).
https://doi.org/10.1016/j.conengprac.2023.105675
Received 5 June 2023; Received in revised form 8 August 2023; Accepted 29 August 2023
Available online 18 September 2023
0967-0661/© 2023 Elsevier Ltd. All rights reserved.
K. Xie et al. Control Engineering Practice 141 (2023) 105675
an ADP-based learning algorithm with value iteration (VI) scheme was Gao and Jiang, 2016a; Modares et al., 2016; Zhang et al., 2020), im
proposed in Bian and Jiang (2016) for solving the state regulation munity to the exploration bias of the proposed learning algorithm with
problem without requirement on initial admissible controller, which DRL is verified in this paper.
was extended in Jiang et al. (2022) where the convergence rate The rest of this paper is organized as follows. Section 2 presents the
requirement is considered. A mixed iteration scheme between PI and VI preliminaries and problem formulation of the LOORP. In Section 3, the
was presented in Xu and Wu (2023). In addition, most of these ADP-based learning algorithm with VI scheme is developed with the
ADP-based learning algorithms (Gao and Jiang, 2016a; 2019; Gao et al., proposed dynamic OPFB controller to solve the LOORP. The solvability
2018; 2022; Jiang et al., 2020) were proposed with state feedback analysis are presented in this section. Finally, the experiment example is
controller which requires the state being available. Only a few litera given in Section 4, and Section 5 concludes the paper.
tures took into consideration unmeasurable state for discrete-time sys Notations:Throughout this paper, for a square matrix A, σ(A) is the
tems, e.g., Sun et al. (2019), Chen et al. (2020), Gao and Jiang (2019), complex spectrum of A. A > 0 and A ≥ 0 represent that A is positive
where the state can be explicitly reconstructed by using a sequence of definite and positive semi-definite respectively. Z+ is the set of
input/output data under the observability assumption. In Modares et al. nonnegative integer. For any given n ∈ Z+ , In ∈ Rn×n is a unit matrix. For
(2016), a model-free off-policy RL-based output feedback (OPFB) con two column vectors x ∈ Rn and v ∈ Rm , define col(x, v) =
trol method was proposed by reconstructing the state through a [xT , vT ]T ∈ Rn+m . vec(B) = [b1 T , b2 T , …, bn T ]T , where bi ∈ Rm are the
sequence of output data. However, the estimated controller in Modares columns of B ∈ Rm×n . vech(C) = [c11 , c12 + c21 , …, c1n + cn1 , c22 , c23 +
et al. (2016) consisting of the reconstructed state suffers from the bias n(n+1)
c32 , …, cn− 1,n + cn,n− 1 , cnn ]T ∈ R 2 with cij being the element of matrix
generated by the exploration noises. To deal with this issue, a dynamic
C ∈ Rn×n . diag(x) denotes a diagonal matrix D with dii = xi . For any
OPFB controller constructed by internal state was proposed in Rizvi and
Lin (2020) for solving the state regulation problem. In Gao and Jiang matrix F, F† = (FT F)− 1 FT is the pseudo-inverse operation of F. For any
(2016b, 2019) where addressed the LOORP for discrete systems with matrix Gi , G = block diag[G1 , …, Gn ] is an augmented block diagonal
unavailable state, the direct linear relationship between the recon matrix with block matrix G ii = Gi . Pn denotes the normed space of all
structed state and the input/output is the key property to establish the n-by-n real symmetric matrices and Pn+ : = {P ∈ Pn : P ≥ 0}. The symbol
corresponding learning algorithms. Compared to discrete-time systems, ⊗ denotes the Kronecker product on two matrices.
the relationship between reconstructed state and input/output data of
continuous-time systems is nonlinear, which makes the ADP-based 2. Preliminaries and problem formulation
learning algorithm design more challenging. In addition, although the
requirement on the prior knowledge of system dynamics can be relaxed Consider the following linear continuous-time system
by using RL/ADP based method to solve the stabilizing and tracking ⎧
ẋ = Ax + Bu + Ed
problems, the solution of regulator equations, one essential point of ⎪
⎪
⎨
solving the LOORP, still cannot be directly obtained when the system y = Cx (1)
⎪
matrices are unknown. Therefore, for solving the LOORP of ⎪
⎩
e = y − Fr
continuous-time systems, technical challenge arises from the practical
case where there exist unknown dynamics and unmeasurable state where x ∈ Rnx , u ∈ Rm , d ∈ Rnd , y ∈ Rp , e ∈ Rp , and r ∈ Rnr represent the
simultaneously. state of plant system, the control input, the disturbance, the output, the
In this paper, we address the LOORP for linear continuous-time output error, and the reference signal, respectively, and A, B, C, E, F are
systems with unknown system dynamics and unmeasurable state. The the system matrices with appropriate dimensions. Following the theory
contributions of this paper are highlighted as follows. Firstly, compared of output regulation (Huang, 2004), the disturbance and reference can
with Zhang et al. (2020), Modares et al. (2016), Rizvi and Lin (2020), we be viewed as signals generated by the following autonomous systems
further take into account the disturbance rejection, and a dynamic OPFB
controller is designed to solve the LOORP. Motivated by Rizvi and Lin ḋ = Sd d, ṙ = Sr r (2)
(2020), a new augmented internal state is constructed where the de
rivative of internal state is calculable. The design analysis shows that with Sd ∈ Rnd ×nd and Sr ∈ Rnr ×nr , which can be written as an exogenous
there exists a linear relationship between the state and the augmented system (exo-system) v̇ = Sv with v = col(d, r) and S = block diag[Sd ,Sr ].
internal state, which ensures the proposed iterative learning equation Without loss of generality, we present the following assumptions which
(ILE) to be established as a linear-parameter form. Second, different from are quite standard in existing works on solving the linear output regu
Gao and Jiang (2016a), Jiang et al. (2020), Fan et al. (2020) where the lation problem (LORP), for example, Saberi et al. (2003), Huang (2004),
regulator equations need to be solved, the internal model based Gao and Jiang (2016a), Gao et al. (2018), Gao and Jiang (2019).
[ ]
controller was used in Gao et al. (2018) which requires no solution to the A − λInx B
Assumption 1: rank = nx + p, ∀λ ∈ σ(S).
regulator equations. However, both the state and exo-state were C 0
assumed to be available in Gao et al. (2018), the unmeasurable state and Assumption 2: The pair (A, B) is stabilizable and the pair (A, C) is
exo-state are taken into considered in this paper, the proposed dynamic observable.
output feedback controller is designed by setting the virtual exo-system Assumption 3: The minimal polynomial of Sϕ with ϕ = d, r, i.e.,
and constructing the augmented internal state. Third, the proposed ILE M N ( )bj
has no requirement on the usage of repeated finite window integrals αϕ (s) = Π (s − λi )ai Π s2 − 2μj s + μj 2 + ωj 2 (3)
i=1 j=1
(FWIs). While in the existing off-policy learning algorithms with IRL
method for continuous-time systems, such as, Gao et al. (2022), Rizvi is available, with degree qϕ ≤ nϕ , ai and bj being positive integers and λi ,
and Lin (2020), Gao and Jiang (2016a), Gao et al. (2018), due to the μj , ωj ∈ R for i = 1, …, M and j = 1, …, N.
absence of the derivative of state, a large number of FWIs are required The LORP aims at finding a control law u to achieve disturbance
for constructing the ILE. Compared to these works, the proposed ILE is rejection and asymptotic tracking simultaneously, i.e., limt→∞ e(t) = 0.
established by employing the direct reinforcement learning (DRL), i.e., Under Assumptions 1 and 2, the following regulator equations (Huang,
combining the ARE with the data from the augmented internal system 2004)
directly. This renders no repeated FWIs in the proposed ILE, which could {
lead to a reduction of the computational load. The last but not least, XS = AX + BU + E
(4)
although the exploration noise is still needed for ensuring the rank 0 = CX + F
condition to be satisfied as in the IRL based algorithms (Fan et al., 2020;
2
K. Xie et al. Control Engineering Practice 141 (2023) 105675
have a unique solution (X* , U* ) where Ē = [E 0] and F̄ = [0 − F]. With where ̂v (t) evolves according to (5) with a given ̂ v (0), but also we can
the solution to (4), uss = U* v is the stabilizing input when the closed- always find a pair (G1 , G2 ) for the internal model of the matrix S which is
loop system of (1) achieves output regulation. given as
Then, the linear optimal output regulation problem (LOORP) can be żv = G1 zv + G2 e (7)
described as the following optimization problem.
Problem 1: where zv ∈ Rpq is the state of the incorporate internal model (7) of both
∫∞
( T ) Sv̂ and S with the pair (G1 , G2 ) controllable (refer to Huang, 2004,
min e Qe + (u − uss )T R(u − uss ) dt Remark 1.23 for choosing G1 and G2 ). It follows from (5) and (6) that
u 0
subject to (1) system (1) can be rewritten as
⎧
⎪ ̂v
ẋ = Ax + Bu + Ê
where Q = QT ∈ Rp×p > 0 and R = RT ∈ Rm×m > 0 are the weight ⎪
⎨
matrices. y = Cx (8)
⎪
⎪
In this paper, we consider the case where those system matrices A, B, ⎩ ̂v
e = Cx + F̂
C, E, F are constant but unknown, and both the system state x and the
exo-system state v are unmeasurable. Note that two challenges arise from where E
̂ = [EAd 0] and ̂ F = [0 − FAr ].
unknown system dynamics and unmeasurable states when solving the Then, consider the state observer for system (8) as in form of
LOORP. On the one hand, it is difficult to solve the regulator equations
(4) when the system matrices are completely unknown. On the other ̂x˙ (t) = x (t) + Bu(t) + EAd ̂
 d(t) + L(y(t) − Ĉx (t))
(9)
hand, since the states x and v are unmeasurable, it is necessary to = AL ̂x (t) + Bu(t) + EAd ̂
d(t) + Ly(t)
establish the OPFB controller to solve the LOORP. However, the design
of OPFB controller becomes more complicated in the absence of system x ∈ Rnx is the estimate of state x, L ∈ Rnx ×p is the observer gain,
where ̂
matrices. To handle these two issues, a VI-based learning algorithm for and AL = A − LC is the observer matrix.
directly estimating an optimal control policy without any prior knowl Considering the input u, the virtual disturbance ̂
d, and the output y as
edge of system dynamics and the explicit solution of the regulator the input to the observer system (9), the estimate state ̂
x can be written
equations, is developed with the dynamic OPFB controller proposed in as
Section 3.
̂x (t) = (sI − AL )− 1 (B[u] + EAd [ ̂
d] + L[y])
Remark 1. Assumptions 1–3 are the standard condition for solving the
LORP with dynamic measurement feedback controller as shown in ∑m U (s)
i
∑qd p
Dj (s) [ ̂ ] ∑ Yk (s) (10)
= [ui ] + dj + [yk ]
Huang (2004), Gao and Jiang (2016b), Gao et al. (2022). Assumption 3 i=1 Λ(s) j=1
Λ(s) k=1
Λ(s)
is typically used in solving the LOORP (Gao et al., 2018; 2022). In fact,
the exo-system (2) under Assumption 3 can generate a large class of where Ui (s), Dj (s), and Yk (s) depending on (A, B, C, EAd , L) are
signals, for example, a combination of sinusoidal signals of arbitrary n-dimensional polynomial vectors in differential operators s, i.e., s[u] =
amplitudes and initial phases, and ramp signals of arbitrary slopes. Thus, d
u, and ui , ̂
d j , and yk represent the ith, jth, and kth element of u, ̂
d, and
dt
the disturbance and reference signals with known frequencies in many
y, respectively, ̂
x (0) = 0, and Λ(s) = det(sI − AL ).
practical scenarios can be described. For instance, an aircraft in the
For each i = 1,⋯,m, it follows from Franklin et al. (2010, Chapter 7)
presence of wind disturbances where the frequencies of wind can be
and Rizvi and Lin (2020) that the first term of (10) can be expressed as
measured, while the amplitude of the force impacting on the aircraft and
the following canonical form
the corresponding aerodynamic coefficients of the aircraft are unknown.
⎡ ⎤
i1 nx − 1 i1 nx − 2 i1
a s + a s ⋯ + a
3. DRL-based learning algorithm with dynamic OPFB control ⎢ x ⎥
n − 1 n x − 2 0
⎢ Λ(s) ⎥
⎢ ⎥
Ui (s) ⎢ ⎥
In this section, we first propose a dynamic OPFB control design [ui ] = ⎢
⎢ ⋮ ⎥[ui ]
⎥
Λ(s) ⎢ inx nx − 1 inx ⎥
method for solving the LORP. Then, an DRL-based algorithm with VI ⎢ an − 1 s i(nx ) nx − 2 ⎥
⎣ x + a nx − 2 s ⋯ + a 0 ⎦
scheme is developed to estimate the optimal control gain for the dy Λ(s)
namic OPFB controller. Finally, the stability and convergence analyses ⎡ ⎤
are given. ⎡ ⎤ 1
i1 i1 ⎢ [ui ] ⎥
⎢ a0 ⋯ anx − 1 ⎥ ⎢ Λ(s) ⎥
⎢ ⎥⎢ ⎢
⎥
⎥
3.1. Design of dynamic OPFB controller = ⎢
⎢ ⋮ ⋮ ⋮ ⎥⎢
⎥⎢ ⋮ ⎥
⎣ in ⎦ ⎥
inx ⎢ snx − 1 ⎥
a0 x
⋯ anx − 1 ⎣ [ui ] ⎦
To begin with, under Assumption 3, design a virtual exo-system in Λ(s)
the form of
[˙] [ ][ ] ≜ Mui ziu (t)
̂
d Sd̂ 0 ̂
d
̂v˙ = = v
:= Sv̂ ̂ (5)
˙̂r 0 Sr̂ ̂r where the unknown constant matrix Miu ∈ Rnx ×nx contains the co
efficients of Ui (s). For each j = 1, ⋯, qd and k = 1, ⋯, p, along the same
lines of the procedure, the second and third terms in the right-hand side
where ̂d ∈ Rqd , ̂r ∈ Rqr , ̂ d, ̂r ) ∈ Rq with q = qd + qr , Sd̂ and Sr̂
v = col( ̂
of (10) can be rewritten as
are known matrices which can be determined by the minimal poly
nomial of Sd and Sr , respectively. As shown in Huang (2004) and Gao Dj (s) [ ̂ ] Yk (s)
d j ≜Mdj zjd (t), [yk ]≜Myk zky (t)
and Jiang (2016a), not only there always exists a constant matrix Sv̂ ∈ Λ(s) Λ(s)
Rq×q such that the state of the exo-system v can be obtained by
j
[ ] [ ][ ] where Md ∈ Rnx ×nx and Mky ∈ Rnx ×nx contain the coefficients of Dj (s) and
d Ad 0 ̂
d
v(t) = = := As ̂v (t) (6) Yk (s), respectively. Under Assumption 2, Λ(s) is a user-defined known
r 0 Ar ̂r polynomial as
3
K. Xie et al. Control Engineering Practice 141 (2023) 105675
4
K. Xie et al. Control Engineering Practice 141 (2023) 105675
{
u = − Kx x − Kv zv {
(31) ẋz = Az x z + Bz u
(37)
żv = G1 zv + G2 e e = Cz x z
where Kx and Kv are control gains, and zv is defined as the same as in (7). with x̄z = col(x̄, z̄v ).
Then, define the cost function
Lemma 1. Under Assumption 2, for any given observer gain L satisfying
∫∞ ∫
σ (A − LC) ∈ C− , there exists a dynamic OPFB controller (20) which con ( T ) ∞( )
Vz = e Qe + z̄v T Qz′ z̄v + ūT Rū dt = x̄z T Qz x̄z + ūT Rū dt
verges to a dynamic state feedback control policy (31). 0 0
Proof. It follows from (8) and (9) that the error dynamics of the
where Qz = block diag[CT QC, Qz′ ] with Q = QT ∈ Rp×p ≥ 0, Qz′ =
observer is √̅̅̅̅̅̅
QTz′ ∈ Rpq×pq ≥ 0, (Az , Qz ) is observable, and R = RT ∈ Rm×m > 0. To
ε(t) = x(t) − ̂x (t) = e(A− LC)t
̃x(0) (32) minimize the cost function Vz with respect to policy ū* = − K* x̄z with
K* = [K*x K*v ] being the optimal control gain where
with ̃
x(0) = x(0) − ̂x (0). Then, substituting (16) into (32), the state x
can be described as K * = R− 1 Bz T P* (38)
where K = [Kx Kv ] represents the control gain. In this subsection, the DRL-based learning algorithm with VI scheme
Define ū = u − (− Kx X* − Kv Z* )̂
v and combine (27) with the solution is proposed to approximate K * . To begin with, an ADP-based algorithm
of the equations (26). The closed-loop system (35) is transformed into with VI scheme proposed in Bian and Jiang (2016) is recalled in Algo
the augmented system as follows rithm 1 where a bounded sets {B i }∞ i=0 with nonempty interiors and a
5
K. Xie et al. Control Engineering Practice 141 (2023) 105675
Algorithm 1 state xz .
Model-based VI Scheme Bian and Jiang (2016). Define ̂
x z = col( ̂
x , zv ). It follows from (7), (16), and (19) that
x˙ z = Mz ż
̂x z = Mz z, ̂ (51)
[ ]
M 0
where Mz = x (t) − x(t)) = 0 holds by Lemma 1,
. Since limt→∞ ( ̂
0 Ipq
we have limt→∞ (̂x z (t) − xz (t)) = 0 and limt→∞ ( ̂ x˙ z (t) − ẋz (t)) = 0. Note
that the dynamics of z given in (19) can be user-defined, which implies
that both the internal state z and its derivation ż are available.
It follows from (43) and (51) that
where P̄j = MTz Pj Mz , H̄j = MTz Hj Mz , K̄j = Kj Mz , and Ēz = ETz Pj Mz . With
the same representation form as in (44)–(47), and using the operation
Γa,b and Δc,d defined in (48) and (49) respectively, the ILE with dynamic
OPFB controller transformed from the key equation (52) is given as the
following linear-in-parameter form
⎡ ( )⎤
vech H̄ j
n
deterministic sequence {ϵj }∞
j=0 satisfy B i ⊂B i+1 , i ∈ Z , limi→∞ B i = P+
+ ⎢ ( ) ⎥ ( )
∑ ∑ Ψ⎢ ⎥
⎣ vec K̄ j ⎦ = Φj P̄j (53)
and ϵj > 0, j ∈ Z+ , j=0 ϵj = ∞, j=0 ϵ2j < ∞.
∞ ∞
vec(Ēz )
Define Lj = dtd (xTz Pj xz ) + xTz Qz xz for each iteration of Pj . It follows
from system (35) that where Ψ = [Δz,z , 2Γz,u (Inz ⊗ R), 2Γz,v̂ ] and Φj = Δe,e vech(Q) +
Lj = ẋTz Pj xz + xTz Pj ẋz+ xTz Qz xz Δzv ,zv vech(Qz′ ) + 2Γz,z˙vec(P̄j ) with nz = nx (m + qd + p) + pq.
Finally, the DRL-based learning algorithm with VI scheme for
= (Az xz + Bz u + v )T Pj xz +
Ez ̂ xTz Pj (Az xz v ) + xzT Qz xz
+ Bz u + Ez ̂ (43)
T
estimating the optimal control gain K * of the optimal controller (41) is
= xTz Hj xz T
+ 2u RKj xz + 2̂
v T
Ez Pj x z described in Algorithm 2. The control system diagram with proposed
Algorithm 2 is shown in Fig. 1. Note that Ψ is not square matrix in most
where Hj = ATz Pj + Pj Az + Qz and Kj = R− 1 BTz Pj . cases, thus, (53) is solved by using pseudo-inverse method as given in (54).
By using the kronecker product representation and the predefined
operators in Notations, we have
( )
ẋTz Pj xz = (xz ⊗ ẋz )T vec Pj (44) Algorithm 2
DRL-based learning algorithm for solving LOORP with the dynamic OPFB
( )T ( ) control.
xz T Hj xz = vech 2xz xTz − diag(xz )diag(xz ) vech Hj (45)
( )
uT RKj xz = [(xz ⊗ u)(In ⊗ R)]T vec Kj (46)
( )
̂v T Ez T Pj xz = (xz ⊗ ̂
v )T vec Ez T Pj (47)
6
K. Xie et al. Control Engineering Practice 141 (2023) 105675
also the choice of instant of data collection will effect on the satisfactory
of rank condition in the implementation of Algorithm 2.
In fact, the proposed Algorithm 2 has the following exploration bias
immunity property.
Theorem 3. Algorithm 2 is immune to the exploration bias problem.
Proof. To begin with, define the control policy in data collection step
as
u =u+ξ
̂
Jiang, 2016a; Jiang et al., 2020) where the solution to the regulator
̂ j = R− 1 BT P
It follows from K z j as given in (43) that
̂
equations needs to be calculated explicitly, owing to the internal model
used in the proposed dynamic OPFB controller (41), the explicit solution ̂ j xz = ξT R K
̂ j xz .
ξT BTz P
to the regulator equations is not required in the proposed Algorithm 2.
Then, (57) can be further rewritten as
3.4. Convergence analysis 2(Az xz + Bz u + Ez ̂v )T P
̂ j xz + xTz Qz xz
T̂ T ̂
= xz H j xz + 2u R K j xz + 2̂ v T EzT P
̂ j xz
The result on the convergence of the proposed algorithm is sum
marized in the following theorem.
which is equivalent to the same learning equation with control u given as
Theorem 2. If there exists a sampling instant tl such that the following rank
̂ j xz + xTz Qz xz = xTz H
2ẋTz P ̂ j xz + 2̂v T EzT P
̂ j xz + 2uT R K ̂ j xz (58)
condition is satisfied,
([ ])
rank Δz,z , Γz,u , Γz,v̂ = κ (56) where ẋz is the state derivative of the closed-loop system (35) with
control policy u.
where κ = nz (nz +q +m +1)/2, then the estimated control scheme generated It follows from (43) and (58) that H
̂ j = Hj and K
̂ j = Kj for any given
by employing the Algorithm 2 converges to the optimal dynamic OPFB ̂ 0 = P0 , which indicates that immunity to the exploration bias is ach
P
controller (41) as j→∞. ieved for the iterative learning method with the dynamic state feedback
Proof. Note that Ψ is independent of P̄j , which only relies on a series of controller. Then, for the ADP learning equation (52) with using u, we
the past data. Thus, if the rank condition (56) is satisfied, then Ψ is of full have
row rank. It implies that (54) has a unique solution at every iteration.
v T Ēz z.
2żT P̄j z + zT MzT Qz Mz z = zT H̄ j z + 2uT RK̄ j z + 2̂
As proven in Bian and Jiang (2016, Theorem 3.3), for any given an
̃j+1 calculated from (42) will converge to P* as j→∞.
arbitrary P0 , the P Along the same lines of the aforementioned proof for the dynamic state
For Algorithm 2, for any given P̄0 , substitute P̄j into (54). Since Ψ is of feedback controller, the immunity to exploration bias property of Al
full rank, (54) has a unique solution H̄j and K̄j at every iteration j. Thus, gorithm 2 can be verified.□
P̄j+1 can also be recalculated uniquely from (55). It should be pointed out
Remark 7. Similar to the exploration noisy immunity of the existing
that (55) is transformed from (42) by using the equivalent relationship
learning methods in Jiang and Jiang (2012), Rizvi and Lin (2020), Gao
between the augmented internal state z and the estimated state ̂ xz.
and Jiang (2016a), this paper shows that the DRL-based iterative
*
Therefore, P̄j+1 computed from (54) and (55) converge to P̄ as j→∞. learning equation with control u is free from exploration noisy. Most
*
This implies that K̄j converge to K as j→∞.□ ADP-based learning algorithms are established by using the fact that the
solution of ARE (38) only depends on the dynamic system matrices Az
Remark 6. The rank condition can be easily satisfied by only adding
and Bz . That is, as long as the input/output or input/state data is able to
the exploration noise in the data collection process, as stated in Jiang
reflect full information of system dynamic matrices, the optimal control
and Jiang (2012), and it was utilized in most existing ADP based
gain can be obtained regardless of the existence of exploration noise in
learning algorithms; see, for example, Gao and Jiang (2016a), Jiang
the data collection step.
et al. (2022), Rizvi and Lin (2020). Typically, the exploration noise is
chosen as a combination of several sinusoids with different frequencies
and Gaussian noise. Notably, not only the choice of exploration noise but
7
K. Xie et al. Control Engineering Practice 141 (2023) 105675
virtual exo-system to replace the exo-system (59). As shown in (6), for satisfied, the exploration noise ξ is chosen as a combination of five
d(t)
t ≥ td , under Assumption 3, there exist the transformations d(t) = Ad ̂
8
K. Xie et al. Control Engineering Practice 141 (2023) 105675
Fig. 4. The reference orbit and the 3-D flight trajectories of the UAV.
Fig. 3. Six snapshots of the UAV flight trajectories. (a). t = 20.1s: The data collection start instant. (b). t = 90.5s: The reference at current instant is r(t) = ( − 0.6,
0)m. (c). t = 94.4s: The reference at current instant is r(t) = (0, 0.6)m. (d). t = 98.3s: The reference at current instant is r(t) = (0.6, 0)m. (e). t = 102.2s: The
reference at current instant is r(t) = (0, − 0.6)m. (f). t = 106.2s: The reference at current instant is r(t) = ( − 0.6, 0)m.
9
K. Xie et al. Control Engineering Practice 141 (2023) 105675
5. Conclusion
References
Bian, T., & Jiang, Z. P. (2016). Value iteration and adaptive dynamic programming for
data-driven adaptive optimal control design. Automatica, 71, 348–360.
Chen, C., Modares, H., Xie, K., Lewis, F. L., Wan, Y., & Xie, S. (2019). Reinforcement
learning-based adaptive optimal exponential tracking control of linear systems with
unknown dynamics. IEEE Transactions on Automatic Control, 64(11), 4423–4438.
Chen, C., Sun, W., Zhao, G., & Peng, Y. (2020). Reinforcement Q-learning incorporated
with internal model method for output feedback tracking control of unknown linear
systems. IEEE Access, 8, 134456–134467.
Fan, J., Wu, Q., Jiang, Y., Chai, T., & Lewis, F. L. (2020). Model-free optimal output
regulation for linear discrete-time lossy networked control systems. IEEE
Transactions on Systems, Man, and Cybernetics: Systems, 50(11), 4033–4042.
Franklin, G. F., Powell, J. D., & Emami-Naeini, A. F. (2010). Feedback control of dynamic
systems (6th ed.). Pearson.
Gao, W., & Jiang, Z. P. (2016). Adaptive dynamic programming and adaptive optimal
output regulation of linear systems. IEEE Transactions on Automatic Control, 61(12),
4164–4169.
Gao, W., & Jiang, Z. P. (2016). Adaptive optimal output regulation via output-feedback:
An adaptive dynamic programing approach. In proceedings of the IEEE 55th conference
on decision and control (CDC2016). Las Vegas, USA: IEEE.
Gao, W., & Jiang, Z. P. (2019). Adaptive optimal output regulation of time-delay systems
via measurement feedback. IEEE Transactions on Neural Networks and Learning
Systems, 30(3), 938–945.
Gao, W., Jiang, Z. P., & Lewis, F. L. (2018). Leader-to-formation stability of multi-agent
systems: An adaptive optimal control approach. IEEE Transactions on Automatic
Fig. 8. The convergence of the proposed algorithm. Control, 63(10), 3581–3588.
Gao, W., Mynuddin, M., Wunsch, D. C., & Jiang, Z.-P. (2022). Reinforcement learning-
based cooperative optimal output regulation via distributed adaptive internal model.
̃*
K = [77.7294 11.1377 0.1555 0.0223 59.7417 IEEE Transactions on Neural Networks and Learning Systems, 33(10), 5229–5240.
X Gonzalez-Garcia, A., Barragan-Alcantar, D., Collado-Gonzalez, I., & Garrido, L. (2021).
128.6298 5.6294 − 1.7779 7.3090], Adaptive dynamic programming and deep reinforcement learning for the control of
an unmanned surface vehicle: Experimental results. Control Engineering Practice, 111,
̃*
K = [92.3849 12.3442 0.1849 0.0248 72.8860 104807.
Y
155.6868 6.8898 − 2.2335 8.9205]. Huang, J. (2004). Nonlinear output regulation: Theory and applications. Philadelphia, PA,
USA: SIAM.
In fact, the quadrotor system suffers from a number of noises in Jiang, Y., Gao, W., Na, J., Zhang, D., Hämäläinen, T. T., Stojanovic, V., & Lewis, F. L.
(2022). Value iteration and adaptive optimal output regulation with assured
addition to the disturbance d. Moreover, there usually exist other un convergence rate. Control Engineering Practice, 121, 105042.
certainties in physical experiment, such as measurement noise, time Jiang, Y., & Jiang, Z. P. (2012). Computational adaptive optimal control for continuous-
delay, and data packet loss, etc. For example, the output data packet loss time linear systems with completely unknown dynamics. Automatica, 48(10),
2699–2704.
occurs during the periods t = [24.9000, 25.7333]s, t = [136.6667, Jiang, Y., & Jiang, Z. P. (2017). Robust adaptive dynamic programming. Hoboken, NJ, USA:
136.9667]s, and t = [138.7000, 139.0333]s, etc., which results in Wiley.
performance degradation of the estimated control policy. Even under Jiang, Y., Kiumarsi, B., Fan, J., Chai, T., Li, J., & Lewis, F. L. (2020). Optimal output
regulation of linear discrete-time systems with unknown dynamics using
this situation, the proposed approach still shows its effectiveness.
reinforcement learning. IEEE Transactions on Cybernetics, 50(7), 3147–3156.
To sum up, the results of the physical experiment show that the Lewis, F. L., Vrabie, D. L., & Syrmos, V. L. (2012). Reinforcement learning and optimal
LOORP with unavailable state and disturbance is solved by using the adaptive control. John Wiley & Sons, Inc.
Liu, D., Xue, S., Zhao, B., Luo, B., & Wei, Q. (2021). Adaptive dynamic programming for
proposed DRL-based learning algorithm without any prior knowledge of
control: A survey and recent advances. IEEE Transactions on Systems, Man, and
system matrices. Please also find the video chip of this demo in https:// Cybernetics: Systems, 51(1), 142–160.
xiaoyu.xmu.edu.cn/demos/dataOR4CEP.mp4.
10
K. Xie et al. Control Engineering Practice 141 (2023) 105675
Modares, H., & Lewis, F. L. (2014). Linear quadratic tracking control of partially- Wei, Q., Liu, D., Lin, Q., & Song, R. (2017). Discrete-time optimal control via local policy
unknown continuous-time systems using reinforcement learning. IEEE Transactions iteration adaptive dynamic programming. IEEE Transactions on Cybernetics, 47(10),
on Automatic Control, 59(11), 3051–3056. 3367–3379.
Modares, H., Lewis, F. L., & Jiang, Z. (2016). Optimal output-feedback control of Xu, Y., & Wu, Z.-G. (2023). Human-in-the-loop distributed cooperative tracking control
unknown continuous-time linear systems using off-policy reinforcement learning. with applications to autonomous ground vehicles: A data-driven mixed iteration
IEEE Transactions on Cybernetics, 46(11), 2401–2410. approach. Control Engineering Practice, 136, 105496.
Rizvi, S. A. A., & Lin, Z. (2020). Reinforcement learning-based linear quadratic Yao, Y., Ding, J., Zhao, C., Wang, Y., & Chai, T. (2022). Data-driven constrained
regulation of continuous-time systems using dynamic output feedback. IEEE reinforcement learning for optimal control of a multistage evaporation process.
Transactions on Cybernetics, 50(11), 4670–4679. Control Engineering Practice, 129, 105345.
Saberi, A., Stoorvogel, A. A., Sannuti, P., & Shi, G. (2003). On optimal output regulation Zhang, H., Liu, Y., Xiao, G., & Jiang, H. (2020). Data-based adaptive dynamic
for linear systems. International Journal of Control, 76(4), 319–333. programming for a class of discrete-time systems with multiple delays. IEEE
Sun, W., Zhao, G., & Peng, Y. (2019). Adaptive optimal output feedback tracking control Transactions on Systems, Man, and Cybernetics: Systems, 50(2), 432–441.
for unknown discrete-time linear systems using a combined reinforcement Q- Zhang, H., Zhao, C., & Ding, J. (2022). Online reinforcement learning with passivity-
learning and internal model method. IET Control Theory and Applications, 13(18), based stabilizing term for real time overhead crane control without knowledge of the
3075–3086. system model. Control Engineering Practice, 127, 105302.
11