0% found this document useful (0 votes)
7 views

Adaptive optimal output regulation of unknown linear continuous-time systems by dynamic output feedback and value iteration

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Adaptive optimal output regulation of unknown linear continuous-time systems by dynamic output feedback and value iteration

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Control Engineering Practice 141 (2023) 105675

Contents lists available at ScienceDirect

Control Engineering Practice


journal homepage: www.elsevier.com/locate/conengprac

Adaptive optimal output regulation of unknown linear continuous-time


systems by dynamic output feedback and value iteration☆
Kedi Xie a, b, Yiwei Zheng a, Weiyao Lan a, c, Xiao Yu *, a, c
a
Department of Automation, Xiamen University, Xiamen 361005, China
b
School of Automation, Beijing Institute of Technology, Beijing 100081, China
c
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen 361005, China

A R T I C L E I N F O A B S T R A C T

Keywords: This paper proposes a value iteration based learning algorithm to solve the optimal output regulation problem for
Adaptive dynamic programming linear continuous-time systems, which aims at achieving disturbance rejection and asymptotic tracking simul­
Dynamic output-feedback control taneously. Notably, the state is unmeasurable and the system dynamics is completely unknown, which greatly
Optimal output regulation
increases the challenge of solving the output regulation problem. Firstly, we present a dynamic output feedback
Value iteration
controller design method by combining the internal model, setting a virtual exo-system, and constructing the
augmented internal state without using any knowledge of system. Then, by establishing a novel iterative learning
equation which requires no repeated finite window integral operations, an adaptive dynamic programming based
learning algorithm with employing value-iteration scheme is proposed to estimate the optimal feedback control
gain, which may lead to a reduction of the computational load. The analysis on solvability and convergence
shows that the estimated control gain converges to the optimal control gain. Finally, a physical experiment on
control of an unmanned quadrotor illustrates the effectiveness of the proposed algorithm.

1. Introduction extended in Gao et al. (2018), Gao et al. (2022), Jiang et al. (2020). In
Modares and Lewis (2014), the linear quadratic tracking (LQT) control
Recent years have witnessed the tremendous development of solving problem for partially unknown systems was first solved by using IRL
control problems by adaptive dynamic programming (ADP) (Jiang and technique, which was developed to solve the LQT for completely un­
Jiang, 2017) and reinforcement learning (RL) (Lewis et al., 2012); see known dynamics in Chen et al. (2019). To further deal with the system
the latest survey on ADP for control (Liu et al., 2021) and the references uncertainty and disturbance, an ADP based learning algorithm with PI
therein. As one of the major topics included in Liu et al. (2021), inves­ scheme was proposed in Gao and Jiang (2016a) under the linear optimal
tigating ADP-based learning algorithms for solving optimal regulation output regulation problem (LOORP) framework for solving asymptotic
problems is of wide interest because the requirement of prior knowledge tracking and disturbance rejection simultaneously. The state feedback
of system dynamics can be relaxed, leading to the practical significance control gain estimated in Gao and Jiang (2016a) requires separately
(Gonzalez-Garcia et al., 2021; Jiang and Jiang, 2012; Jiang et al., 2020; solving the corresponding algebraic Riccati equation (ARE) and the
Wei et al., 2017; Yao et al., 2022). regulator equations. This ADP-based method was extended to solve the
In the last decade, several RL-based learning algorithms, e.g., Jiang LOORP for discrete time-delay systems (Gao and Jiang, 2019) and
and Jiang (2012), Chen et al. (2019), Zhang et al. (2022), have been multi-agent systems (Gao et al., 2018; 2022).
proposed for solving the optimal regulation problems. A novel In these aforementioned RL/ADP based learning algorithms for
ADP-based learning algorithm using integral reinforcement learning solving the state/tracking regulation problems, most of the PI based
(IRL) method and policy iteration (PI) scheme was proposed in Jiang learning algorithms, for instance, Jiang et al. (2020), Chen et al. (2019),
and Jiang (2012) to approximate online adaptive optimal controller require an initial admissible control policy while it is hard to be obtained
without any prior knowledge of the system dynamics, which was in practice, especially for unknown systems. To obviate this limitation,

This work was supported in part by the National Key R&D Program of China under Grant 2021ZD0112600, in part by the National Natural Science Foundation of

China under Grants 62173283 and 62273285, and in part by the Natural Science Foundation of Fujian Province of China under Grant 2021J01051.
* Corresponding author.
E-mail address: [email protected] (X. Yu).

https://doi.org/10.1016/j.conengprac.2023.105675
Received 5 June 2023; Received in revised form 8 August 2023; Accepted 29 August 2023
Available online 18 September 2023
0967-0661/© 2023 Elsevier Ltd. All rights reserved.
K. Xie et al. Control Engineering Practice 141 (2023) 105675

an ADP-based learning algorithm with value iteration (VI) scheme was Gao and Jiang, 2016a; Modares et al., 2016; Zhang et al., 2020), im­
proposed in Bian and Jiang (2016) for solving the state regulation munity to the exploration bias of the proposed learning algorithm with
problem without requirement on initial admissible controller, which DRL is verified in this paper.
was extended in Jiang et al. (2022) where the convergence rate The rest of this paper is organized as follows. Section 2 presents the
requirement is considered. A mixed iteration scheme between PI and VI preliminaries and problem formulation of the LOORP. In Section 3, the
was presented in Xu and Wu (2023). In addition, most of these ADP-based learning algorithm with VI scheme is developed with the
ADP-based learning algorithms (Gao and Jiang, 2016a; 2019; Gao et al., proposed dynamic OPFB controller to solve the LOORP. The solvability
2018; 2022; Jiang et al., 2020) were proposed with state feedback analysis are presented in this section. Finally, the experiment example is
controller which requires the state being available. Only a few litera­ given in Section 4, and Section 5 concludes the paper.
tures took into consideration unmeasurable state for discrete-time sys­ Notations:Throughout this paper, for a square matrix A, σ(A) is the
tems, e.g., Sun et al. (2019), Chen et al. (2020), Gao and Jiang (2019), complex spectrum of A. A > 0 and A ≥ 0 represent that A is positive
where the state can be explicitly reconstructed by using a sequence of definite and positive semi-definite respectively. Z+ is the set of
input/output data under the observability assumption. In Modares et al. nonnegative integer. For any given n ∈ Z+ , In ∈ Rn×n is a unit matrix. For
(2016), a model-free off-policy RL-based output feedback (OPFB) con­ two column vectors x ∈ Rn and v ∈ Rm , define col(x, v) =
trol method was proposed by reconstructing the state through a [xT , vT ]T ∈ Rn+m . vec(B) = [b1 T , b2 T , …, bn T ]T , where bi ∈ Rm are the
sequence of output data. However, the estimated controller in Modares columns of B ∈ Rm×n . vech(C) = [c11 , c12 + c21 , …, c1n + cn1 , c22 , c23 +
et al. (2016) consisting of the reconstructed state suffers from the bias n(n+1)
c32 , …, cn− 1,n + cn,n− 1 , cnn ]T ∈ R 2 with cij being the element of matrix
generated by the exploration noises. To deal with this issue, a dynamic
C ∈ Rn×n . diag(x) denotes a diagonal matrix D with dii = xi . For any
OPFB controller constructed by internal state was proposed in Rizvi and
Lin (2020) for solving the state regulation problem. In Gao and Jiang matrix F, F† = (FT F)− 1 FT is the pseudo-inverse operation of F. For any
(2016b, 2019) where addressed the LOORP for discrete systems with matrix Gi , G = block diag[G1 , …, Gn ] is an augmented block diagonal
unavailable state, the direct linear relationship between the recon­ matrix with block matrix G ii = Gi . Pn denotes the normed space of all
structed state and the input/output is the key property to establish the n-by-n real symmetric matrices and Pn+ : = {P ∈ Pn : P ≥ 0}. The symbol
corresponding learning algorithms. Compared to discrete-time systems, ⊗ denotes the Kronecker product on two matrices.
the relationship between reconstructed state and input/output data of
continuous-time systems is nonlinear, which makes the ADP-based 2. Preliminaries and problem formulation
learning algorithm design more challenging. In addition, although the
requirement on the prior knowledge of system dynamics can be relaxed Consider the following linear continuous-time system
by using RL/ADP based method to solve the stabilizing and tracking ⎧
ẋ = Ax + Bu + Ed
problems, the solution of regulator equations, one essential point of ⎪


solving the LOORP, still cannot be directly obtained when the system y = Cx (1)

matrices are unknown. Therefore, for solving the LOORP of ⎪

e = y − Fr
continuous-time systems, technical challenge arises from the practical
case where there exist unknown dynamics and unmeasurable state where x ∈ Rnx , u ∈ Rm , d ∈ Rnd , y ∈ Rp , e ∈ Rp , and r ∈ Rnr represent the
simultaneously. state of plant system, the control input, the disturbance, the output, the
In this paper, we address the LOORP for linear continuous-time output error, and the reference signal, respectively, and A, B, C, E, F are
systems with unknown system dynamics and unmeasurable state. The the system matrices with appropriate dimensions. Following the theory
contributions of this paper are highlighted as follows. Firstly, compared of output regulation (Huang, 2004), the disturbance and reference can
with Zhang et al. (2020), Modares et al. (2016), Rizvi and Lin (2020), we be viewed as signals generated by the following autonomous systems
further take into account the disturbance rejection, and a dynamic OPFB
controller is designed to solve the LOORP. Motivated by Rizvi and Lin ḋ = Sd d, ṙ = Sr r (2)
(2020), a new augmented internal state is constructed where the de­
rivative of internal state is calculable. The design analysis shows that with Sd ∈ Rnd ×nd and Sr ∈ Rnr ×nr , which can be written as an exogenous
there exists a linear relationship between the state and the augmented system (exo-system) v̇ = Sv with v = col(d, r) and S = block diag[Sd ,Sr ].
internal state, which ensures the proposed iterative learning equation Without loss of generality, we present the following assumptions which
(ILE) to be established as a linear-parameter form. Second, different from are quite standard in existing works on solving the linear output regu­
Gao and Jiang (2016a), Jiang et al. (2020), Fan et al. (2020) where the lation problem (LORP), for example, Saberi et al. (2003), Huang (2004),
regulator equations need to be solved, the internal model based Gao and Jiang (2016a), Gao et al. (2018), Gao and Jiang (2019).
[ ]
controller was used in Gao et al. (2018) which requires no solution to the A − λInx B
Assumption 1: rank = nx + p, ∀λ ∈ σ(S).
regulator equations. However, both the state and exo-state were C 0
assumed to be available in Gao et al. (2018), the unmeasurable state and Assumption 2: The pair (A, B) is stabilizable and the pair (A, C) is
exo-state are taken into considered in this paper, the proposed dynamic observable.
output feedback controller is designed by setting the virtual exo-system Assumption 3: The minimal polynomial of Sϕ with ϕ = d, r, i.e.,
and constructing the augmented internal state. Third, the proposed ILE M N ( )bj
has no requirement on the usage of repeated finite window integrals αϕ (s) = Π (s − λi )ai Π s2 − 2μj s + μj 2 + ωj 2 (3)
i=1 j=1
(FWIs). While in the existing off-policy learning algorithms with IRL
method for continuous-time systems, such as, Gao et al. (2022), Rizvi is available, with degree qϕ ≤ nϕ , ai and bj being positive integers and λi ,
and Lin (2020), Gao and Jiang (2016a), Gao et al. (2018), due to the μj , ωj ∈ R for i = 1, …, M and j = 1, …, N.
absence of the derivative of state, a large number of FWIs are required The LORP aims at finding a control law u to achieve disturbance
for constructing the ILE. Compared to these works, the proposed ILE is rejection and asymptotic tracking simultaneously, i.e., limt→∞ e(t) = 0.
established by employing the direct reinforcement learning (DRL), i.e., Under Assumptions 1 and 2, the following regulator equations (Huang,
combining the ARE with the data from the augmented internal system 2004)
directly. This renders no repeated FWIs in the proposed ILE, which could {
lead to a reduction of the computational load. The last but not least, XS = AX + BU + E
(4)
although the exploration noise is still needed for ensuring the rank 0 = CX + F
condition to be satisfied as in the IRL based algorithms (Fan et al., 2020;

2
K. Xie et al. Control Engineering Practice 141 (2023) 105675

have a unique solution (X* , U* ) where Ē = [E 0] and F̄ = [0 − F]. With where ̂v (t) evolves according to (5) with a given ̂ v (0), but also we can
the solution to (4), uss = U* v is the stabilizing input when the closed- always find a pair (G1 , G2 ) for the internal model of the matrix S which is
loop system of (1) achieves output regulation. given as
Then, the linear optimal output regulation problem (LOORP) can be żv = G1 zv + G2 e (7)
described as the following optimization problem.
Problem 1: where zv ∈ Rpq is the state of the incorporate internal model (7) of both
∫∞
( T ) Sv̂ and S with the pair (G1 , G2 ) controllable (refer to Huang, 2004,
min e Qe + (u − uss )T R(u − uss ) dt Remark 1.23 for choosing G1 and G2 ). It follows from (5) and (6) that
u 0
subject to (1) system (1) can be rewritten as

⎪ ̂v
ẋ = Ax + Bu + Ê
where Q = QT ∈ Rp×p > 0 and R = RT ∈ Rm×m > 0 are the weight ⎪

matrices. y = Cx (8)


In this paper, we consider the case where those system matrices A, B, ⎩ ̂v
e = Cx + F̂
C, E, F are constant but unknown, and both the system state x and the
exo-system state v are unmeasurable. Note that two challenges arise from where E
̂ = [EAd 0] and ̂ F = [0 − FAr ].
unknown system dynamics and unmeasurable states when solving the Then, consider the state observer for system (8) as in form of
LOORP. On the one hand, it is difficult to solve the regulator equations
(4) when the system matrices are completely unknown. On the other ̂x˙ (t) = x (t) + Bu(t) + EAd ̂
 d(t) + L(y(t) − Ĉx (t))
(9)
hand, since the states x and v are unmeasurable, it is necessary to = AL ̂x (t) + Bu(t) + EAd ̂
d(t) + Ly(t)
establish the OPFB controller to solve the LOORP. However, the design
of OPFB controller becomes more complicated in the absence of system x ∈ Rnx is the estimate of state x, L ∈ Rnx ×p is the observer gain,
where ̂
matrices. To handle these two issues, a VI-based learning algorithm for and AL = A − LC is the observer matrix.
directly estimating an optimal control policy without any prior knowl­ Considering the input u, the virtual disturbance ̂
d, and the output y as
edge of system dynamics and the explicit solution of the regulator the input to the observer system (9), the estimate state ̂
x can be written
equations, is developed with the dynamic OPFB controller proposed in as
Section 3.
̂x (t) = (sI − AL )− 1 (B[u] + EAd [ ̂
d] + L[y])
Remark 1. Assumptions 1–3 are the standard condition for solving the
LORP with dynamic measurement feedback controller as shown in ∑m U (s)
i
∑qd p
Dj (s) [ ̂ ] ∑ Yk (s) (10)
= [ui ] + dj + [yk ]
Huang (2004), Gao and Jiang (2016b), Gao et al. (2022). Assumption 3 i=1 Λ(s) j=1
Λ(s) k=1
Λ(s)
is typically used in solving the LOORP (Gao et al., 2018; 2022). In fact,
the exo-system (2) under Assumption 3 can generate a large class of where Ui (s), Dj (s), and Yk (s) depending on (A, B, C, EAd , L) are
signals, for example, a combination of sinusoidal signals of arbitrary n-dimensional polynomial vectors in differential operators s, i.e., s[u] =
amplitudes and initial phases, and ramp signals of arbitrary slopes. Thus, d
u, and ui , ̂
d j , and yk represent the ith, jth, and kth element of u, ̂
d, and
dt
the disturbance and reference signals with known frequencies in many
y, respectively, ̂
x (0) = 0, and Λ(s) = det(sI − AL ).
practical scenarios can be described. For instance, an aircraft in the
For each i = 1,⋯,m, it follows from Franklin et al. (2010, Chapter 7)
presence of wind disturbances where the frequencies of wind can be
and Rizvi and Lin (2020) that the first term of (10) can be expressed as
measured, while the amplitude of the force impacting on the aircraft and
the following canonical form
the corresponding aerodynamic coefficients of the aircraft are unknown.
⎡ ⎤
i1 nx − 1 i1 nx − 2 i1
a s + a s ⋯ + a
3. DRL-based learning algorithm with dynamic OPFB control ⎢ x ⎥
n − 1 n x − 2 0
⎢ Λ(s) ⎥
⎢ ⎥
Ui (s) ⎢ ⎥
In this section, we first propose a dynamic OPFB control design [ui ] = ⎢
⎢ ⋮ ⎥[ui ]

Λ(s) ⎢ inx nx − 1 inx ⎥
method for solving the LORP. Then, an DRL-based algorithm with VI ⎢ an − 1 s i(nx ) nx − 2 ⎥
⎣ x + a nx − 2 s ⋯ + a 0 ⎦
scheme is developed to estimate the optimal control gain for the dy­ Λ(s)
namic OPFB controller. Finally, the stability and convergence analyses ⎡ ⎤
are given. ⎡ ⎤ 1
i1 i1 ⎢ [ui ] ⎥
⎢ a0 ⋯ anx − 1 ⎥ ⎢ Λ(s) ⎥
⎢ ⎥⎢ ⎢


3.1. Design of dynamic OPFB controller = ⎢
⎢ ⋮ ⋮ ⋮ ⎥⎢
⎥⎢ ⋮ ⎥
⎣ in ⎦ ⎥
inx ⎢ snx − 1 ⎥
a0 x
⋯ anx − 1 ⎣ [ui ] ⎦
To begin with, under Assumption 3, design a virtual exo-system in Λ(s)
the form of
[˙] [ ][ ] ≜ Mui ziu (t)
̂
d Sd̂ 0 ̂
d
̂v˙ = = v
:= Sv̂ ̂ (5)
˙̂r 0 Sr̂ ̂r where the unknown constant matrix Miu ∈ Rnx ×nx contains the co­
efficients of Ui (s). For each j = 1, ⋯, qd and k = 1, ⋯, p, along the same
lines of the procedure, the second and third terms in the right-hand side
where ̂d ∈ Rqd , ̂r ∈ Rqr , ̂ d, ̂r ) ∈ Rq with q = qd + qr , Sd̂ and Sr̂
v = col( ̂
of (10) can be rewritten as
are known matrices which can be determined by the minimal poly­
nomial of Sd and Sr , respectively. As shown in Huang (2004) and Gao Dj (s) [ ̂ ] Yk (s)
d j ≜Mdj zjd (t), [yk ]≜Myk zky (t)
and Jiang (2016a), not only there always exists a constant matrix Sv̂ ∈ Λ(s) Λ(s)
Rq×q such that the state of the exo-system v can be obtained by
j
[ ] [ ][ ] where Md ∈ Rnx ×nx and Mky ∈ Rnx ×nx contain the coefficients of Dj (s) and
d Ad 0 ̂
d
v(t) = = := As ̂v (t) (6) Yk (s), respectively. Under Assumption 2, Λ(s) is a user-defined known
r 0 Ar ̂r polynomial as

3
K. Xie et al. Control Engineering Practice 141 (2023) 105675

Λ(s) = snx + αnx − 1 snx − 1 + ⋯ + α1 s + α0 . (11)


̂x˙ = (A − LC)̂ ̂ v + Ly
x + Bu + Ê (22)
j
The obtainable internal states ziu (t), zd (t), zky (t) are defined as
żv = G1 zv + G2 e (23)
żiu (t) = A ziu (t) + bui (t) (12)
Substituting (21)–(23) into system (8), we have

żjd (t) =A zjd (t) + b̂
d j (t) (13) ⎪ ẋ = Ax − BKx ̂ ̂v
x − BKv zv + Ê


x˙ = (A − LC)̂
̂ ̂ v + LCx
x − BKx ̂x − BKv zv + Ê (24)
żky (t) = A zky (t) + byk (t) (14) ⎪


̂v
żv = G1 zv + G2 Cx + G2 F̂
where A and b is given as
⎡ ⎤ ⎡ ⎤ which can be rewritten as
0 1 0 ⋯ 0 0 ⎡ ⎤ ⎡ ⎤⎡
⎤ ⎡ ⎤
⎢ ⎥ ⎢ ⎥ ẋ A − BKx − BKv x ̂
E
⎢ 0 0 1 ⋯ 0 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ̂x˙ ⎥ = ⎢ LC A − LC − BKx ⎥⎢ ⎥ ⎢ ̂⎥
− BKv ⎦⎣ ̂x ⎦ + ⎣ E ⎦̂v . (25)
A =⎢⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥,
⎥ b=⎢
⎢⋮⎥
⎥ (15) ⎣ ⎦ ⎣
⎢ ⎥ ⎢ ⎥ żv G2 C 0 G1 zv ̂
F
⎢ 0 0 0 ⋯ 1 ⎥ ⎢0⎥
⎣ ⎦ ⎣ ⎦
− α0 − α1 − α2 ⋯ − αnx − 1 1 Under the Assumptions 1–3, for a given pair (G1 , G2 ) incorporating the
internal model of S and Sv̂ and any given matrices E ̂ and ̂F, there always
and A is determined by Λ(s) = det(sI − AL ). Note that A is designed [ ]
A − BKx − BKv
without any exact knowledge of AL since Λ(s) can be user-defined as in exits a pair (Kx , Kv ) satisfing σ (Ak ) ∈ C with Ak =

,
G2 C G1
(11) which only requires Assumption 2 being satisfied. It implies that the
such that the LORP of system (8) is solvable (Huang, 2004, Lemma 1.27
internal states zu , zd , and zy are available by accessing the data u, ̂
d, and and Theorem 1.31). Meanwhile, the equations
y without any prior knowledge of the system dynamics. ⎧
By (10)–(14), the estimate state ̂
x can be parameterized as ⎪
⎪ XS = (A − BKx )X − BKv Z + E
⎨ v̂
̂
(26)
x (t)≜Mzx̂ (t)
̂ (16) ⎪

ZSv̂ = G1 Z
⎩ ̂
0 = CX + F
qd p nx ×nx (m+qd +p)
where M = [M1u , ⋯, Mm 1 1
u , Md , ⋯, Md , My , ⋯, My ] ∈ R is the
have a unique solution (X* , Z* ).
unknown constant parametrization matrix and zx̂ = Define
T T T qd T p T T T
nx (m+qd +p)
[(z1u ) , ⋯, (zm 1 1
u ) , (zd ) , ⋯, (zd ) , (zy ) , ⋯, (zy ) ] ∈ R is the
x̄ = x − X * ̂
v , δx = ̂x − x, z̄v = zv − Z * ̂v . (27)
available internal state as
żx̂ (t) = G3 zx̂ (t) + G4 ζx̂ (t). (17) Combining the solution to (26) with (27), we have
⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
ẋ A − BKx − BKx − BKv x x
The aforementioned system (17) is called the internal system where the ⎣ δ˙x ⎦ = ⎣ 0 A − LC 0 ⎦ ⎣ δ x = Ac ⎣ δ x ⎦ .
⎦ (28)
system matrices G3 and G4 are żv G2 C 0 G1 zv zv

(18) The error output is given as


e = Cx̄. (29)
and the extended measurement data ζx̂ = col(u, ̂ d, y) is input of the in­
ternal system (17). It is obvious that σ(Ac ) = σ(AL ) ∪ σ (Ak ). Since there exists L and (Kx , Kv )
Then, define the augmented internal state as z = col(zx̂ , satisfying σ (AL ) ∈ C− and σ (Ak ) ∈ C− under Assumption 2, the closed-
zv ) ∈ Rnx (m+qd +p)+pq . It follows from the definition of zv and zx̂ in (7) and loop system (28) is exponential stable, i.e., limt→∞ x̄(t) = 0, which
(17) that the augmented internal system can be given as gives limt→∞ e(t) = 0 from (29). That is, the LORP is solved by controller
(21).
ż = G 1 z + G 2 ζ (19) According to the linear relationship between the internal state zx̂
with the estimate state ̂ x as shown in (16), the control policy u in (21)
where G 1 = block diag[G3 , G1 ] and G 2 = block diag[G4 , G2 ] with (G1 , can be rewritten as
G2 ) and (G3 , G4 ) defined in (7) and (18), respectively, and ζ = col(u, ̂
d,y,
e). u = − Kx Mzx̂ − Kv zv = − K z
Finally, a dynamic OPFB controller with the augmented internal
where z = col(zx̂ , zv ) is defined in (19) with zx̂ and zv defined in (7) and
state z is proposed in the following Theorem.
(17), respectively, and
Theorem 1. Under Assumptions 1–3, there exists a dynamic OPFB
K = [Kx M Kv ]. (30)
controller given as
{
u=− K z Therefore, the LORP of system (1) is solved by the dynamic OPFB
(20) controller (20).□
ż = G 1 z + G 2 ζ
Furthermore, Theorem 1 leads to the following lemma which shows
with control gain K to solve the LORP of system (1). the equivalence relationship between the dynamic OPFB controller (20)
and the state feedback controller as in form of
Proof. Consider the dynamic OPFB controller as in form of
u = − Kx ̂x − Kv zv (21)

where Kx ∈ Rm×nx and Kv ∈ Rm×pq are control gains and

4
K. Xie et al. Control Engineering Practice 141 (2023) 105675

{
u = − Kx x − Kv zv {
(31) ẋz = Az x z + Bz u
(37)
żv = G1 zv + G2 e e = Cz x z

where Kx and Kv are control gains, and zv is defined as the same as in (7). with x̄z = col(x̄, z̄v ).
Then, define the cost function
Lemma 1. Under Assumption 2, for any given observer gain L satisfying
∫∞ ∫
σ (A − LC) ∈ C− , there exists a dynamic OPFB controller (20) which con­ ( T ) ∞( )
Vz = e Qe + z̄v T Qz′ z̄v + ūT Rū dt = x̄z T Qz x̄z + ūT Rū dt
verges to a dynamic state feedback control policy (31). 0 0

Proof. It follows from (8) and (9) that the error dynamics of the
where Qz = block diag[CT QC, Qz′ ] with Q = QT ∈ Rp×p ≥ 0, Qz′ =
observer is √̅̅̅̅̅̅
QTz′ ∈ Rpq×pq ≥ 0, (Az , Qz ) is observable, and R = RT ∈ Rm×m > 0. To
ε(t) = x(t) − ̂x (t) = e(A− LC)t
̃x(0) (32) minimize the cost function Vz with respect to policy ū* = − K* x̄z with
K* = [K*x K*v ] being the optimal control gain where
with ̃
x(0) = x(0) − ̂x (0). Then, substituting (16) into (32), the state x
can be described as K * = R− 1 Bz T P* (38)

x(t) = Mzx̂ (t) + e(A− LC)t


x(0). (33)
̃ with P* as the symmetric positive definite solution of the ARE
Then, the dynamic state feedback control policy can be rewritten as Az T P* + P* Az + Qz − P* Bz R− 1 Bz T P* = 0. (39)
u(x, zv ) = − Kx x(t) − Kv zv (t) = − Kx Mzx̂ (t) − Kv zv (t) − Kx M ε(t).
Therefore, the optimal dynamic state feedback controller for solving the
LOORP of system (35) is
By (30), the dynamic OPFB controller (20) is rewritten as
u(z) = − K z = − Kx Mzx̂ (t) − Kv zv (t). (34) u* = ū* − Kx* X * ̂v − Kv* Z * ̂
v = − K * xz . (40)

Finally, combining controller (40) with Lemma 1 and Theorem 1, the


Since σ(A − LC) ∈ C− , it follows from (32) that limt→∞ ε(t) = 0. Then, we optimal dynamic OPFB control policy for solving the LOORP of system
have u(z)→u(x, zv ) as t→∞.□ (1) can be given as
Although M is determined by the choose of L and the system {
u = − K *z
matrices, there is unnecessary to calculate M explicitly which is shown (41)
in the following Section 3.3 by using the linear relationship between the ż = G 1 z + G 2 ζ
state and the internal state.
where z is the augmented internal state with the pair (G 1 , G 2 ) defined in
Remark 2. Theorem 1 shows that the dynamic OPFB controller (20) is (19), and K * = [K*x M K*v ] is the optimal control gain with the matrix M
independent on the solution of the regulator equations. It only requires
and the pair (K*x , K*v ) described as in (16) and (38), respectively. Now, we
Assumptions 1–3 being satisfied which guarantees the existences of (Kx ,
give an analysis on the way to design and obtain the parameters in the
Kv ) and L such that σ (Ak ) ∈ C− and σ (AL ) ∈ C− .
proposed optimal dynamic OPFB controller (41), which is stated as
Remark 3. By designing the virtual exo-system (5) and considering the follows.
virtual disturbance ̂
d as the input to the observer system (9), the
augmented internal state system (19) is established by combining the • As given in (19), the pair (G 1 , G 2 ) consisting of (G1 , G2 , G3 , G4 ) can
internal model, which provides a way to establish the dynamic OPFB be user-defined previously. To be more specific, the pairs (G1 , G2 )
controller for the system (1) with both unmeasurable state and exo-state. and (G3 , G4 ) only rely on the minimal polynomial of the matrix S
(refers to Huang, 2004, Remark 1.28) and the eigenvalue of the
matrix AL (refers to (18) and (15)) which are available under As­
3.2. LOORP formulation with dynamic OPFB controller
sumptions 2 and 3. This implies that (G 1 , G 2 ) can be completely
determined by users.
Lemma 1 implies that the LOORP formulation of system (1) with the
• The control gain K * , as the only undetermined parameter in the
dynamic OPFB controller (20) can be described as the optimal control
dynamic OPFB controller (41), is very difficult to be found directly
problem formulation with dynamic state feedback controller (31).
Now, we begin with formulating the LOORP with dynamic state since both the pair (K*x , K*v ) and the matrix M depend on the exact
[
A 0
] [ ]
B system matrices which are unknown. Instead of separately calcu­
feedback controller (31). Define Az = , Bz = , Ez = lating (K*x , K*v ) and M to obtain K * , the DRL-based learning algo­
G2 C G1 0
[ ]
E , C = [ C 0 ], F = ̂
̂ rithm is proposed in the following subsection which aims at
F, and xz = col(x,zv ). Combining system (8)
G2 ̂F
z z
estimating the optimal control gain K * directly without any prior
with (31) leads to the closed-loop system knowledge of system matrices.




ẋz = Az xz + Bz u + Ez ̂
v The above statements show that the optimal dynamic OPFB
v˙ = Sv̂ ̂v
̂ . (35) controller (41) can be approximated by using the proposed learning


⎩ algorithm given in the following subsection without any prior knowl­
e = Cz xz + Fz ̂v
edge of the system dynamics.
Rewrite the controller (31) as
u = − Kx x − Kv zv = − Kxz (36) 3.3. DRL-based learning algorithm

where K = [Kx Kv ] represents the control gain. In this subsection, the DRL-based learning algorithm with VI scheme
Define ū = u − (− Kx X* − Kv Z* )̂
v and combine (27) with the solution is proposed to approximate K * . To begin with, an ADP-based algorithm
of the equations (26). The closed-loop system (35) is transformed into with VI scheme proposed in Bian and Jiang (2016) is recalled in Algo­
the augmented system as follows rithm 1 where a bounded sets {B i }∞ i=0 with nonempty interiors and a

5
K. Xie et al. Control Engineering Practice 141 (2023) 105675

Algorithm 1 state xz .
Model-based VI Scheme Bian and Jiang (2016). Define ̂
x z = col( ̂
x , zv ). It follows from (7), (16), and (19) that

x˙ z = Mz ż
̂x z = Mz z, ̂ (51)
[ ]
M 0
where Mz = x (t) − x(t)) = 0 holds by Lemma 1,
. Since limt→∞ ( ̂
0 Ipq
we have limt→∞ (̂x z (t) − xz (t)) = 0 and limt→∞ ( ̂ x˙ z (t) − ẋz (t)) = 0. Note
that the dynamics of z given in (19) can be user-defined, which implies
that both the internal state z and its derivation ż are available.
It follows from (43) and (51) that

2żT Pj z + eT Qe + zv T Qz′ zv = zT H j z + 2uT RK j z + 2̂v T Ez z (52)

where P̄j = MTz Pj Mz , H̄j = MTz Hj Mz , K̄j = Kj Mz , and Ēz = ETz Pj Mz . With
the same representation form as in (44)–(47), and using the operation
Γa,b and Δc,d defined in (48) and (49) respectively, the ILE with dynamic
OPFB controller transformed from the key equation (52) is given as the
following linear-in-parameter form
⎡ ( )⎤
vech H̄ j
n
deterministic sequence {ϵj }∞
j=0 satisfy B i ⊂B i+1 , i ∈ Z , limi→∞ B i = P+
+ ⎢ ( ) ⎥ ( )
∑ ∑ Ψ⎢ ⎥
⎣ vec K̄ j ⎦ = Φj P̄j (53)
and ϵj > 0, j ∈ Z+ , j=0 ϵj = ∞, j=0 ϵ2j < ∞.
∞ ∞
vec(Ēz )
Define Lj = dtd (xTz Pj xz ) + xTz Qz xz for each iteration of Pj . It follows
from system (35) that where Ψ = [Δz,z , 2Γz,u (Inz ⊗ R), 2Γz,v̂ ] and Φj = Δe,e vech(Q) +
Lj = ẋTz Pj xz + xTz Pj ẋz+ xTz Qz xz Δzv ,zv vech(Qz′ ) + 2Γz,z˙vec(P̄j ) with nz = nx (m + qd + p) + pq.
Finally, the DRL-based learning algorithm with VI scheme for
= (Az xz + Bz u + v )T Pj xz +
Ez ̂ xTz Pj (Az xz v ) + xzT Qz xz
+ Bz u + Ez ̂ (43)
T
estimating the optimal control gain K * of the optimal controller (41) is
= xTz Hj xz T
+ 2u RKj xz + 2̂
v T
Ez Pj x z described in Algorithm 2. The control system diagram with proposed
Algorithm 2 is shown in Fig. 1. Note that Ψ is not square matrix in most
where Hj = ATz Pj + Pj Az + Qz and Kj = R− 1 BTz Pj . cases, thus, (53) is solved by using pseudo-inverse method as given in (54).
By using the kronecker product representation and the predefined
operators in Notations, we have
( )
ẋTz Pj xz = (xz ⊗ ẋz )T vec Pj (44) Algorithm 2
DRL-based learning algorithm for solving LOORP with the dynamic OPFB
( )T ( ) control.
xz T Hj xz = vech 2xz xTz − diag(xz )diag(xz ) vech Hj (45)
( )
uT RKj xz = [(xz ⊗ u)(In ⊗ R)]T vec Kj (46)
( )
̂v T Ez T Pj xz = (xz ⊗ ̂
v )T vec Ez T Pj (47)

where n̄ = nx + pq. For convenience of description, we further define a


set of functions Γa,b and Δc,d as

Γa,b = [ a(t1 ) ⊗ b(t1 ) ⋯ a(ts ) ⊗ b(ts ) ]


T
(48)

Δc,d = [ δc,d (t1 ) ⋯ δc,d (ts ) ]


T
(49)

with δc,d (t) = vech(c(t)d(t)T +d(t)c(t)T − diag(c(t))diag(d(t))) where a, b,


c, and d are column vectors.
Combining (43)–(49), the iterative learning equation (ILE) with
dynamic state feedback controller (31) is given in the following linear-
parameters form
⎡ ⎤
( )
⎢ vech H
( ) ⎥
j
̂⎢ ⎥ ̂( )
Ψ ⎢ vec Kj ⎥ = Φ j Pj (50)
⎣ ( T )⎦
vec Ez Pj

where Ψ ̂ = [Δx ,x , 2Γx ,u (In̄ ⊗ R), 2Γx ,v̂ ] and Φ


z z z z
̂ j = Δxz ,xz vech(Qz ) +
2Γxz ,ẋz vec(Pj ).
Notably, there exists a linear relationship between the estimate state
x and the internal state zx̂ as shown in (16). This linear property ensures
̂
that the following DRL-based learning algorithm can be established by
accessing the data from the augmented internal state z instead of the

6
K. Xie et al. Control Engineering Practice 141 (2023) 105675

also the choice of instant of data collection will effect on the satisfactory
of rank condition in the implementation of Algorithm 2.
In fact, the proposed Algorithm 2 has the following exploration bias
immunity property.
Theorem 3. Algorithm 2 is immune to the exploration bias problem.
Proof. To begin with, define the control policy in data collection step
as
u =u+ξ
̂

where ξ is the exploration noise.


Fig. 1. The control system diagram with proposed Algorithm 2. To show Theorem 3, we first verify that the equivalent iterative
learning method with state feedback controller, expressed as (50), which
Remark 4. The constant parametrization matrix Mz is unique once the is derived from an DRL-based learning equation (43), is immune to the
Hurwitz matrix AL is determined under Assumption 2. The proposed ̂ j, K
exploration bias. Let two pairs ( H ̂ j, P
̂ j ) and (Hj , Kj , Pj ) be the estimates
Algorithm 2 only requires the existence of these matrices AL and Mz
obtained by using ̂
u and u, respectively.
instead of their exact knowledge.
Combining (43) with control input ̂ u , we have
Remark 5. In the IRL-based learning algorithms, for instance, Gao and
̂ j xz + xTz Qz xz = xTz H
2ẋTz P uT RK
̂ j xz + 2̂ ̂ j xz + 2̂v T EzT P
̂ j xz (57)
Jiang (2016a), Gao et al. (2022), Modares et al. (2016), Rizvi and Lin
(2020), Gao et al. (2018), Jiang and Jiang (2012), the ILEs in these al­
where ẋz is the state derivative of the closed-loop system (35) with
gorithms require a large number of repeated FWIs. While in this paper,
controller ̂
u . Substituting the description of the closed-loop system (35)
due to the obtainable derivation of the internal state ż, the proposed ILE
(53) is established by using the collected data (z, ż) at every instant with ̂
u = u + ξ into (57), we have
directly with using DRL method. Thus, the computational load of the v )T P
2(Az xz + Bz u + Bz ξ + Ez ̂ ̂ j xz + xzT Qz xz
proposed ADP-based learning algorithm is reduced since the FWIs are
= x H j xz + 2(u + ξ) R K j xz + 2̂v T ET P
T̂ T ̂ ̂ j xz .
not required anymore. Moreover, unlike (Fan et al., 2020; Gao and z z

Jiang, 2016a; Jiang et al., 2020) where the solution to the regulator
̂ j = R− 1 BT P
It follows from K z j as given in (43) that
̂
equations needs to be calculated explicitly, owing to the internal model
used in the proposed dynamic OPFB controller (41), the explicit solution ̂ j xz = ξT R K
̂ j xz .
ξT BTz P
to the regulator equations is not required in the proposed Algorithm 2.
Then, (57) can be further rewritten as
3.4. Convergence analysis 2(Az xz + Bz u + Ez ̂v )T P
̂ j xz + xTz Qz xz
T̂ T ̂
= xz H j xz + 2u R K j xz + 2̂ v T EzT P
̂ j xz
The result on the convergence of the proposed algorithm is sum­
marized in the following theorem.
which is equivalent to the same learning equation with control u given as
Theorem 2. If there exists a sampling instant tl such that the following rank
̂ j xz + xTz Qz xz = xTz H
2ẋTz P ̂ j xz + 2̂v T EzT P
̂ j xz + 2uT R K ̂ j xz (58)
condition is satisfied,
([ ])
rank Δz,z , Γz,u , Γz,v̂ = κ (56) where ẋz is the state derivative of the closed-loop system (35) with
control policy u.
where κ = nz (nz +q +m +1)/2, then the estimated control scheme generated It follows from (43) and (58) that H
̂ j = Hj and K
̂ j = Kj for any given
by employing the Algorithm 2 converges to the optimal dynamic OPFB ̂ 0 = P0 , which indicates that immunity to the exploration bias is ach­
P
controller (41) as j→∞. ieved for the iterative learning method with the dynamic state feedback
Proof. Note that Ψ is independent of P̄j , which only relies on a series of controller. Then, for the ADP learning equation (52) with using u, we
the past data. Thus, if the rank condition (56) is satisfied, then Ψ is of full have
row rank. It implies that (54) has a unique solution at every iteration.
v T Ēz z.
2żT P̄j z + zT MzT Qz Mz z = zT H̄ j z + 2uT RK̄ j z + 2̂
As proven in Bian and Jiang (2016, Theorem 3.3), for any given an
̃j+1 calculated from (42) will converge to P* as j→∞.
arbitrary P0 , the P Along the same lines of the aforementioned proof for the dynamic state
For Algorithm 2, for any given P̄0 , substitute P̄j into (54). Since Ψ is of feedback controller, the immunity to exploration bias property of Al­
full rank, (54) has a unique solution H̄j and K̄j at every iteration j. Thus, gorithm 2 can be verified.□
P̄j+1 can also be recalculated uniquely from (55). It should be pointed out
Remark 7. Similar to the exploration noisy immunity of the existing
that (55) is transformed from (42) by using the equivalent relationship
learning methods in Jiang and Jiang (2012), Rizvi and Lin (2020), Gao
between the augmented internal state z and the estimated state ̂ xz.
and Jiang (2016a), this paper shows that the DRL-based iterative
*
Therefore, P̄j+1 computed from (54) and (55) converge to P̄ as j→∞. learning equation with control u is free from exploration noisy. Most
*
This implies that K̄j converge to K as j→∞.□ ADP-based learning algorithms are established by using the fact that the
solution of ARE (38) only depends on the dynamic system matrices Az
Remark 6. The rank condition can be easily satisfied by only adding
and Bz . That is, as long as the input/output or input/state data is able to
the exploration noise in the data collection process, as stated in Jiang
reflect full information of system dynamic matrices, the optimal control
and Jiang (2012), and it was utilized in most existing ADP based
gain can be obtained regardless of the existence of exploration noise in
learning algorithms; see, for example, Gao and Jiang (2016a), Jiang
the data collection step.
et al. (2022), Rizvi and Lin (2020). Typically, the exploration noise is
chosen as a combination of several sinusoids with different frequencies
and Gaussian noise. Notably, not only the choice of exploration noise but

7
K. Xie et al. Control Engineering Practice 141 (2023) 105675

and r(t) = Ar ̂r (t) with unknown matrices Ad and Ar . Define ̂ v (t) =


col( ̂ d(t), ̂r (t)), then, the virtual exo-system can be written as
⎡ ⎤
0.0001 0 0
⎢ ⎥
̂v˙ (t) = ⎢
⎣ 0 0 − 0.4 ⎥ ⎦̂v (t) = Sv̂ ̂
v (t).
0 0.4 0

To further show the transformation relationship between the pairs (E, F)


in (1) and ( E,
̂ ̂F) in (8) under an user-defined ̂ v (0), we give the following
example. Let b1 = b2 = 2, FX = [0.6 0] and FY = [0 0.6]. If we choose
the initial state of the virtual exo-system as ̂ v (t0 ) = [1 0 0.6]T , then,
[ ]
0 0 0 ̂
there exist matrices E ̂X = ÊY = , F X = [ 0 1 0 ], ̂ FY =
0.002 0 0
Fig. 2. Equipments for the physical experiment. [ 0 0 − 1 ]. Note that the matrices ̂ EX , ̂
EY , ̂
F X , and ̂F Y are given just for
illustrating their existences, and they are not used in our proposed
4. Experiment learning algorithm.
Now, we choose the pair (G1 , G2 ) for the internal state zv defined in
In this section, we illustrate the practical feasibility of the proposed (7) as
learning algorithm, by presenting a physical experiment with a ZY-UAV-
200 quadrotor unmanned aerial vehicle (UAV) shown in Fig. 4, and a G1 = Sv̂ , G2 = [ 1 0 1 ]T .
motion capture OptiTrack system shown in Fig. 4. The PX4 open-source
Then, we set both the eigenvalues of the observer matrix AL to − 1,
autopilot flight controller is used for the inner-loop attitude control of
which gives the pair (G3 , G4 ) of the internal state zx̂ defined in (17) as
the aircraft. The motion capture OptiTrack system is used to obtain the
⎡ ⎤ ⎡ ⎤
position of the UAV for feedback. The control frequency is set to 30Hz. A 0 0 b 0 0
In this experiment example, the objective is to find the optimal ⎢ ⎥ ⎢ ⎥
G3 = ⎢
⎣ 0 A 0 ⎥ ⎢
⎦, G4 = ⎣ 0 b 0 ⎦

control policy which can track a reference trajectory described by an
orbit with unknown radius and initial positions, and simultaneously 0 0 A 0 0 b
achieve disturbance rejection on a steady wind with unknown strength.
with
Define xX = [pX vX ]T and xY = [pY vY ]T , where pX and vX are the
[ ] [ ]
position and velocity along the X-axis, respectively, and pY and vY are 0 1 0
the position and velocity along the Y-axis, respectively. The dynamics A = ,b = .
− 1 − 2 1
for motion of the UAV along X-axis and Y-axis are not linear and inde­
pendent. By local approximation, the dynamics are approximated as
According to (19), the computable internal state zX = col(zx̂X , zvX ) ∈R9×1
linear and independent, and locally described by
and zY = col(zx̂Y , zvY ) ∈ R9×1 are described as
ẋX = AX xX + BX uX (t) + EX d(t)
ẋY = AY xY + BY uY (t) + EY d(t) żX = G 1 zX + G 2 ζX
żY = G 1 zY + G 2 ζY
[ ] [ ] [ ] [ ]
0 1 0 0 0
where AX = AY = , BX = , BY = , EX = , EY =
0 0 a1 a2 b1 where G 1 = block diag[G3 ,G1 ], G 2 = block diag[G4 ,G2 ], ζX = col(uX , ̂
d,
[ ]
0 yX ,eX ), and ζ = col(uY , ̂
d,yY ,eY ). Thus, the dynamic OPFB controller is
are system matrices with parameters a1 , a2 , b1 , and b2 , uX and uY Y
b2 written as
denote the inputs of X-axis and Y-axis, respectively, and d is the { {
disturbance signal. The tracking errors of X-axis eX and Y-axis eY are uX = − K X zX
,
uY = − K Y zY
żX = G 1 zX + G 2 ζX żY = G 1 zY + G 2 ζY
defined as
eX (t) = CX xX (t) − FX r(t) where K X and K Y are the optimal control gains to be estimated by
eY (t) = CY xY (t) − FY r(t) using the proposed Algorithm 2.
In this experiment, choose the weight matrices as QX = QY = 100,
where CX = CY = [1 0], FX and FY are system matrices. Note that the Qz′ = Qz′ = I3 , RX = 0.03, RY = 0.02. For comparison, setting the
system matrices AX , AY , BX , BY , CX , CY , EX , EY , FX , and FY are given just X Y

parameters to a1 = a2 = 1 and b1 = b2 = 2 gives the optimal control


for experimental analysis, while our proposed DRL-based learning Al­
gain for X-axis K *X and Y-axis K *Y as
gorithm 2 is established without using any prior knowledge of these
matrices. K *
= [82.2318 10.9794 0.1645 0.02196 60.2731
To imitate the effect of steady wind on the UAV and to model the
X
131.5256 5.7800 − 1.9418 7.9307],
reference orbit trajectory, we use the following exo-system to generate *
K = [97.7694 12.1258 0.1955 0.02425 73.5177
the disturbance d and reference r as Y

{ [ ] 159.1613 7.0790 − 2.4480 9.6957].


0, t < td 0 0.4
ḋ(t) = , ṙ(t) = r(t) (59) For implementing the proposed learning algorithm, the related pa­
0.0001d(t), t ≥ td − 0.4 0 rameters in the algorithm are set to P̄0 = 0.001I9 , K 0 = 0 ∈ R1×9 , ε =
0.01, z(0) = 0 ∈ R9×1 , {B i }∞i=0 = {P̄> 0| ‖P̄‖< 200(i + 1)}, and
where t0 is the takeoff instant of the UAV, and td = 10 + t0 .
{ϵj }∞ = 50 /(5000 + j). To ensure that the rank condition (56) is
Since d(t) and r(t) are immeasurable exo-system states, we use a j=0

virtual exo-system to replace the exo-system (59). As shown in (6), for satisfied, the exploration noise ξ is chosen as a combination of five
d(t)
t ≥ td , under Assumption 3, there exist the transformations d(t) = Ad ̂

8
K. Xie et al. Control Engineering Practice 141 (2023) 105675

Fig. 4. The reference orbit and the 3-D flight trajectories of the UAV.

Fig. 5. The trajectories of the inputs.

sinusoids with frequencies 0.2Hz, 8Hz, 13Hz, 50Hz, and 130Hz.


Then, we use the proposed Algorithm 2 to find the optimal control
*
gain K̃ for both X-axis and Y-axis dynamic motion systems. The data
collection process starts at t = 20.1s, and it ends when the rank condi­
tion is satisfied.
The reference orbit and the 3-D flight trajectories of the UAV during
the whole experiment with six highlighted 3-D positions at different
instants are shown in Fig. 4. In particular, the snapshots of the experi­
ment at t = 20.1s, t = 90.5s, t = 94.4s, t = 98.3s, t = 102.2s, and t =
106.2s are shown in Fig. 3, which indicates that the trajectories during
the period t = [90.5, 106.2]s track the reference orbit. The trajectories of
the input, output and the reference signal are depicted in Figs. 5 and
Fig. 6, respectively. Fig. 6 shows that the positions of the UAV along both
X-axis and Y-axis track the reference trajectory. Fig. 7 shows the state
observer errors ̂ x X = MX zX and ̂ x Y = MY zY where MX = MY =
[ ]
1 0 0.002 0 1 2
. It should be pointed out that the
2 1 0.004 0.002 0 1
matrices MX and MY are only used to testify the convergence of internal
state, yet they are not used in the whole learning process.
The convergence of the proposed learning algorithm is shown in
Fig. 8. To be specific, there are 111 instants of data are collected during t
= [20.1000, 23.3333]s such that the rank condition is satisfied for both Fig. 6. The trajectories of the outputs and references.
X-axis and Y-axis. For X-axis, the learning based control law is updated
at t = 24.9000s after 4965 iterations. For Y-axis, the learning based
control law is updated at t = 24.9000s after 4928 iterations. The esti­
̃ * and K
mated optimal control gain K ̃ * are
X Y

Fig. 3. Six snapshots of the UAV flight trajectories. (a). t = 20.1s: The data collection start instant. (b). t = 90.5s: The reference at current instant is r(t) = ( − 0.6,
0)m. (c). t = 94.4s: The reference at current instant is r(t) = (0, 0.6)m. (d). t = 98.3s: The reference at current instant is r(t) = (0.6, 0)m. (e). t = 102.2s: The
reference at current instant is r(t) = (0, − 0.6)m. (f). t = 106.2s: The reference at current instant is r(t) = ( − 0.6, 0)m.

9
K. Xie et al. Control Engineering Practice 141 (2023) 105675

5. Conclusion

This paper develops an ADP-based learning algorithm with DRL


technique to solve the LOORP for linear continuous-time systems with
unmeasurable states and unknown system matrices. The dynamic OPFB
controller is designed by constructing the augmented internal state with
accessing the data of input, output, and error. The linear relationship
between estimate state and the augmented internal state is established
to construct an ILE which aims at approximating the optimal control
gain for the dynamic OPFB controller. Furthermore, immunity to the
exploration noise of the proposed algorithm is achieved, and the rank
condition is given to guarantee the convergence of the proposed algo­
rithm. For the future work, data-driven cooperative optimal output
regulation for multi-agent systems and nonlinear optimal output regu­
lation problem will be studied.

Declaration of Competing Interest

The authors declare that they have no known competing financial


interests or personal relationships that could have appeared to influence
Fig. 7. The trajectories of the observer error.
the work reported in this paper.

References

Bian, T., & Jiang, Z. P. (2016). Value iteration and adaptive dynamic programming for
data-driven adaptive optimal control design. Automatica, 71, 348–360.
Chen, C., Modares, H., Xie, K., Lewis, F. L., Wan, Y., & Xie, S. (2019). Reinforcement
learning-based adaptive optimal exponential tracking control of linear systems with
unknown dynamics. IEEE Transactions on Automatic Control, 64(11), 4423–4438.
Chen, C., Sun, W., Zhao, G., & Peng, Y. (2020). Reinforcement Q-learning incorporated
with internal model method for output feedback tracking control of unknown linear
systems. IEEE Access, 8, 134456–134467.
Fan, J., Wu, Q., Jiang, Y., Chai, T., & Lewis, F. L. (2020). Model-free optimal output
regulation for linear discrete-time lossy networked control systems. IEEE
Transactions on Systems, Man, and Cybernetics: Systems, 50(11), 4033–4042.
Franklin, G. F., Powell, J. D., & Emami-Naeini, A. F. (2010). Feedback control of dynamic
systems (6th ed.). Pearson.
Gao, W., & Jiang, Z. P. (2016). Adaptive dynamic programming and adaptive optimal
output regulation of linear systems. IEEE Transactions on Automatic Control, 61(12),
4164–4169.
Gao, W., & Jiang, Z. P. (2016). Adaptive optimal output regulation via output-feedback:
An adaptive dynamic programing approach. In proceedings of the IEEE 55th conference
on decision and control (CDC2016). Las Vegas, USA: IEEE.
Gao, W., & Jiang, Z. P. (2019). Adaptive optimal output regulation of time-delay systems
via measurement feedback. IEEE Transactions on Neural Networks and Learning
Systems, 30(3), 938–945.
Gao, W., Jiang, Z. P., & Lewis, F. L. (2018). Leader-to-formation stability of multi-agent
systems: An adaptive optimal control approach. IEEE Transactions on Automatic
Fig. 8. The convergence of the proposed algorithm. Control, 63(10), 3581–3588.
Gao, W., Mynuddin, M., Wunsch, D. C., & Jiang, Z.-P. (2022). Reinforcement learning-
based cooperative optimal output regulation via distributed adaptive internal model.
̃*
K = [77.7294 11.1377 0.1555 0.0223 59.7417 IEEE Transactions on Neural Networks and Learning Systems, 33(10), 5229–5240.
X Gonzalez-Garcia, A., Barragan-Alcantar, D., Collado-Gonzalez, I., & Garrido, L. (2021).
128.6298 5.6294 − 1.7779 7.3090], Adaptive dynamic programming and deep reinforcement learning for the control of
an unmanned surface vehicle: Experimental results. Control Engineering Practice, 111,
̃*
K = [92.3849 12.3442 0.1849 0.0248 72.8860 104807.
Y
155.6868 6.8898 − 2.2335 8.9205]. Huang, J. (2004). Nonlinear output regulation: Theory and applications. Philadelphia, PA,
USA: SIAM.
In fact, the quadrotor system suffers from a number of noises in Jiang, Y., Gao, W., Na, J., Zhang, D., Hämäläinen, T. T., Stojanovic, V., & Lewis, F. L.
(2022). Value iteration and adaptive optimal output regulation with assured
addition to the disturbance d. Moreover, there usually exist other un­ convergence rate. Control Engineering Practice, 121, 105042.
certainties in physical experiment, such as measurement noise, time Jiang, Y., & Jiang, Z. P. (2012). Computational adaptive optimal control for continuous-
delay, and data packet loss, etc. For example, the output data packet loss time linear systems with completely unknown dynamics. Automatica, 48(10),
2699–2704.
occurs during the periods t = [24.9000, 25.7333]s, t = [136.6667, Jiang, Y., & Jiang, Z. P. (2017). Robust adaptive dynamic programming. Hoboken, NJ, USA:
136.9667]s, and t = [138.7000, 139.0333]s, etc., which results in Wiley.
performance degradation of the estimated control policy. Even under Jiang, Y., Kiumarsi, B., Fan, J., Chai, T., Li, J., & Lewis, F. L. (2020). Optimal output
regulation of linear discrete-time systems with unknown dynamics using
this situation, the proposed approach still shows its effectiveness.
reinforcement learning. IEEE Transactions on Cybernetics, 50(7), 3147–3156.
To sum up, the results of the physical experiment show that the Lewis, F. L., Vrabie, D. L., & Syrmos, V. L. (2012). Reinforcement learning and optimal
LOORP with unavailable state and disturbance is solved by using the adaptive control. John Wiley & Sons, Inc.
Liu, D., Xue, S., Zhao, B., Luo, B., & Wei, Q. (2021). Adaptive dynamic programming for
proposed DRL-based learning algorithm without any prior knowledge of
control: A survey and recent advances. IEEE Transactions on Systems, Man, and
system matrices. Please also find the video chip of this demo in https:// Cybernetics: Systems, 51(1), 142–160.
xiaoyu.xmu.edu.cn/demos/dataOR4CEP.mp4.

10
K. Xie et al. Control Engineering Practice 141 (2023) 105675

Modares, H., & Lewis, F. L. (2014). Linear quadratic tracking control of partially- Wei, Q., Liu, D., Lin, Q., & Song, R. (2017). Discrete-time optimal control via local policy
unknown continuous-time systems using reinforcement learning. IEEE Transactions iteration adaptive dynamic programming. IEEE Transactions on Cybernetics, 47(10),
on Automatic Control, 59(11), 3051–3056. 3367–3379.
Modares, H., Lewis, F. L., & Jiang, Z. (2016). Optimal output-feedback control of Xu, Y., & Wu, Z.-G. (2023). Human-in-the-loop distributed cooperative tracking control
unknown continuous-time linear systems using off-policy reinforcement learning. with applications to autonomous ground vehicles: A data-driven mixed iteration
IEEE Transactions on Cybernetics, 46(11), 2401–2410. approach. Control Engineering Practice, 136, 105496.
Rizvi, S. A. A., & Lin, Z. (2020). Reinforcement learning-based linear quadratic Yao, Y., Ding, J., Zhao, C., Wang, Y., & Chai, T. (2022). Data-driven constrained
regulation of continuous-time systems using dynamic output feedback. IEEE reinforcement learning for optimal control of a multistage evaporation process.
Transactions on Cybernetics, 50(11), 4670–4679. Control Engineering Practice, 129, 105345.
Saberi, A., Stoorvogel, A. A., Sannuti, P., & Shi, G. (2003). On optimal output regulation Zhang, H., Liu, Y., Xiao, G., & Jiang, H. (2020). Data-based adaptive dynamic
for linear systems. International Journal of Control, 76(4), 319–333. programming for a class of discrete-time systems with multiple delays. IEEE
Sun, W., Zhao, G., & Peng, Y. (2019). Adaptive optimal output feedback tracking control Transactions on Systems, Man, and Cybernetics: Systems, 50(2), 432–441.
for unknown discrete-time linear systems using a combined reinforcement Q- Zhang, H., Zhao, C., & Ding, J. (2022). Online reinforcement learning with passivity-
learning and internal model method. IET Control Theory and Applications, 13(18), based stabilizing term for real time overhead crane control without knowledge of the
3075–3086. system model. Control Engineering Practice, 127, 105302.

11

You might also like