0% found this document useful (0 votes)
9 views

[18]Learning-Based Control of Continuous-Time Systems Using Output Feedback

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

[18]Learning-Based Control of Continuous-Time Systems Using Output Feedback

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Learning-Based Control of Continuous-Time Systems Using Output Feedback


Downloaded 09/11/24 to 137.189.241.59 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy


Leilei Cui and Zhong-Ping Jiang

Abstract cations of ADP span across various fields such as au-


This paper presents an adaptive optimal control approach tonomous driving [7, 25], human motor control [26], and
for continuous-time linear systems with output feedback. wheel-legged robots [9]. However, these methods re-
The method fills in the gap in the literature of reinforce- quire full-state information for controller design, which
ment learning and adaptive dynamic programming that is costly and sometimes impossible to measure in prac-
has been focused exclusively on either discrete-time sys- tice. Hence, designing a learning-based control method
tems or continuous-time systems with full-state information. using input-output measurement instead of input-state
The approach utilizes the historical continuous-time input- measurement is a significant yet challenging research
output trajectory to reconstruct the current state, with- topic. A central challenge in developing learning-based
out discretizing the system dynamics or using a state ob- control for systems without full-state measurement is
server. By exploiting the policy iteration method, subop- simultaneously estimating the state and optimizing the
timal output-feedback controllers can be directly obtained control policy.
from collected input-output trajectory data in the absence For learning-based output-feedback control of
of an accurate dynamic model. The effectiveness of the pro- discrete-time systems, several approaches have been
posed learning-based PI algorithm is demonstrated through proposed in the literature. Lewis et al. [23] reconstruct
a practical example of F-16 aircraft control. the current state by a finite segment of input-output tra-
jectory data, followed by learning-based PI and VI al-
1 Introduction gorithms. Kiumarsi et al. [20] propose a learning-based
control approach for linear quadratic optimal track-
Dynamic programming is a powerful tool for solving
ing of discrete-time systems. The authors of [10] dis-
sequential decision-making and optimal control prob-
cretize continuous-time systems and propose an output-
lems. However, traditional dynamic programming is not
feedback robust ADP algorithm to handle dynamic un-
applicable to real-world systems due to the “curse of
certainty. In [36], the authors propose learning-based
dimensionality” and “curse of modeling” [2, 29]. To
methods for event-triggered adaptive optimal control
address these limitations, approximate dynamic pro-
with output feedback, while in [12], an adaptive optimal
gramming and reinforcement learning have been pro-
output regulation approach is proposed for discrete-time
posed for systems described by Markov decision pro-
linear systems with output feedback and input delay.
cesses with discrete time, state, and input [2, 33, 14].
Another approach to learning-based output-
In the physical world, dynamical systems are mathe-
feedback control is to estimate the current state using a
matically modeled as differential equations and evolve
state observer, and then parameterize the control policy
in continuous time, state, and input spaces. Stability
and value functions in terms of the state observer.
is essential for safe operation of these real-world engi-
For example, the adaptive observer from [35, 13] is
neering systems. To ensure stability, adaptive dynamic
adopted for the learning-based output-feedback control
programming (ADP) has been introduced for adaptive
of continuous and discrete-time systems [15]. In [37],
optimal control of systems with continuous input and
the authors adopt the adaptive observer in [1] to solve
state spaces. Learning-based control algorithms have
the Bellman equation by data-driven methods. Based
been developed based on ADP for various classes of
on the observer in [34, Chapter 5], learning-based
linear, nonlinear, periodic, and time-delay dynamical
output-feedback control algorithms are developed for
systems, and for optimal stabilization and output reg-
optimal stabilization [31], zero-sum differential game
ulation problems [16, 18, 11, 3, 27, 6, 8]. The appli-
[30], and optimal output regulation [4]. However, to
∗ This offset the influence of the initial estimation error of the
work has been supported in part by the NSF grants
EPCN-1903781 and ECCS-2210320. observer and obtain an accurate state estimation, the
† L. Cui, and Z. P. Jiang are with the Control and Networks system should run for a sufficiently long time before
Lab, Department of Electrical and Computer Engineering, Tan- data collection [4, Lemma 8].
don School of Engineering, New York University, 370 Jay Street, In this paper, we propose a new learning-based
Brooklyn, NY 11201 (email: l.cui, [email protected]).

Copyright © 2023 by SIAM


17 Unauthorized reproduction of this article is prohibited
output-feedback control approach for continuous-time 2.1 System Description We consider the following
linear systems. Inspired by [23], we reconstruct the cur- continuous-time linear time-invariant systems with out-
rent state by the continuous-time input-output trajec- put measurement
tory, enabling the expression of the control policy and
Downloaded 09/11/24 to 137.189.241.59 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

ẋ = Ax + Bu, x(0) = x0 ,
value function as a functional of the continuous-time (2.1)
input-output trajectory. By combining the model-based y = Cx,
PI [21] with RL technique, a learning-based PI algo-
where x ∈ Rn , u ∈ Rm , and y ∈ Rq are the state,
rithm is proposed such that given an initial admissible
input and output of the system; x0 is the initial state;
controller, the optimal control policy is iteratively ap-
A, B, and C are constant matrices with compatible
proximated using input-output data in the absence of
dimensions. We assume that (A, B) is controllable and
the system model. Our contributions are as follows:
(C, A) is observable. Under this assumption, LQR is
1) we extend the state reconstruction technique from
the problem of minimizing the following quadratic cost:
discrete-time systems [23] to continuous-time systems Z ∞
without discretization; 2) we propose a learning-based
(2.2) J(x0 , u) = y T Qy + uT Rudt,
output-feedback control approach directly designed for 0
continuous-time systems without discretization; and 3) √
we analyze the theoretical convergence of the proposed with Q = Q ⪰ 0, R = RT ≻ 0, and ( QC, A) being
T

learning-based PI algorithm. observable. If the state x can be measured directly, it


The remaining contents of the paper are organized is well-known that the state-feedback optimal controller
as follows. In Section 2, we review the preliminaries of is u∗ (t) = −K ∗ x(t), where
optimal control. Section 3 presents the proposed state (2.3) K ∗ = R−1 B T P ∗ ,
reconstruction technique for continuous-time systems.
Next, in Section 4, we develop a learning-based PI and P ∗ is the unique positive-definite solution of the
algorithm and provide a theoretical analysis of its following algebraic Riccati equation (ARE)
convergence. We demonstrate the effectiveness of the
(2.4) AT P + P A − P BR−1 B T P + C T QC = 0.
proposed algorithm using an example of linearized F-16
aircraft dynamics in Section 5. Finally, we conclude In addition, (A − BK ∗ ) is Hurwitz [24, Section 6.2].
the paper with Section 6, where we summarize our
contributions and discuss potential future directions. 2.2 Model-based Policy Iteration The celebrated
Notations: Sn denotes the set of n-dimensional model-based PI is reviewed as the foundation of
real symmetric matrices. | · | denotes the Euclidean learning-based PI approach. Given an initial stabilizing
norm of a vector or Frobenius norm of a matrix. ∥·∥∞ controller K1 , PI fist evaluates the control policy Ki at
denotes the supremum norm of a function. C ∞ (X, Y ) the ith iteration (i = 1, 2, . . .) by calculating the cor-
denotes the class of smooth functions from the linear responding cost J(x0 , −Ki x) = xT0 Pi x0 , and then the
space X to the linear space Y . For a matrix A ∈ Rm×n , control gain is improved [21]. In detail, given an initial
vec(A) := [aT1 , · · · , aTn ]T , where ai is the ith column of stabilizing controller, PI is presented as
A. For a symmetric matrix P ∈ Sn , vecs(P ) =
[p11 , 2p12 , ..., 2p1n , p22 , 2p23 , ..., 2p(n−1)n , pnn ]T , 1. Policy evaluation
vecu(P ) = [2p12 , ..., 2p1n , 2p23 , ..., 2p(n−1)n ]T , and ATi Pi + Pi Ai + C T QC + KiT RKi = 0,
diag(P ) = [p11 , p22 , ..., pnn ]⊤ . For two arbitrary (2.5)
Ai = A − BKi .
vectors ν, µ ∈ Rn , vecd(ν, µ) = [ν1 µ1 , · · · , νn µn ]T ,
vecv(ν) = [ν12 , ν1 ν2 , ..., ν1 νn , ν22 , ..., νn−1 νn , νn2 ]T , 2. Policy improvement
vecp(ν, µ) = [ν1 µ2 , ..., ν1 µn , ν2 µ3 , ..., νn−1 µn ]T . [X]i,j
denotes the submatrix of the matrix X that is com- (2.6) Ki+1 = R−1 B T Pi .
prised of the rows between the ith and jth rows of
X. In denotes the n-dimensional identity matrix. A† The following lemma shows that the cost matrix Pi
denotes the Moore–Penrose pseudo-inverse of A. is monotonically decreasing and Ki updated at each
iteration maintains the stability.
2 Preliminaries and Problem Formulation Lemma 2.1. [21] Starting from an initial stabilizing
In this section, the linear quadratic regulator (LQR), control gain K1 , PI has the following properties
and model-based PI are reviewed. 1) A − BKi is Hurwitz for any i ∈ Z+ ;
2) P1 ⪰ P2 ⪰ · · · ⪰ Pi ⪰ · · · ⪰ P ∗ ;
3) limi→∞ Pi = P ∗ and limi→∞ Ki = K ∗ .

Copyright © 2023 by SIAM


18 Unauthorized reproduction of this article is prohibited
Given the model-based PI for solving the LQR Via integration by substitution with ν = τ − t, and
optimal control problem, the problem to be studied in changing the order of integration between ν and θ, it
this paper is formulated as follows. follows that
Z 0
Downloaded 09/11/24 to 137.189.241.59 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Problem 1. In the absence of the accurate system x(t) =


T
G−1 eA θ C T yt (θ)dθ
matrices (A, B, and C), develop a learning-based PI −T
algorithm to iteratively approximate the optimal output- (3.12) Z 0 Z ν
T
feedback controller for minimizing (2.2) using the input- + G−1 eA θ C T CeAθ dθe−Aν But (ν)dν.
−T −T
output trajectories of system (2.1).
Define M (θ) := [Mu (θ), My (θ)], and z(t) :=
3 State Reconstruction [uT (t), y T (t)]T , where
Since the optimal controller (2.3) requires the full state Z θ
T
measurement, in this section, we will reconstruct the M u (θ) := G−1 eA τ C T CeAτ dτ e−Aθ B,
state x(t) using a segment of historical input-output (3.13) −T
T
trajectory within the interval [t − T, t]. Let ut (θ) := My (θ) := G−1 eA θ C T .
u(t + θ) and yt (θ) := y(t + θ), ∀θ ∈ [−T, 0] denote the
Following (3.12), x(t) is reconstructed as
segments of input and output trajectories, respectively.
For any θ ∈ [−T, 0], according to [5, Equation (4.5)], Z 0
x(t) can be expressed as (3.14) x(t) = M (θ)zt (θ)dθ.
−T
Z t
(3.7) x(t) = e−Aθ x(t + θ) + eA(t−τ ) Bu(τ )dτ. It is seen from (3.14) that x(t) is expressed in terms of
t+θ a segment of input-output trajectory within [t − T, t].
T By (3.14), the optimal controller and the minimal
Pre-multiplying (3.7) with eA θ C T CeAθ and integrat- value function can be rewritten as
ing the both sides of the equation from −T to 0 with Z 0
respect to θ, we have u∗ (t) = −K ∗ x(t) = − K̄ ∗ (θ)zt (θ)dθ,
Z 0 −T
T (3.15)
eA θ C T yt (θ)dθ = Gx(t)
Z 0 Z 0

−T J(x(t), u ) = ztT (ξ)P̄ ∗ (ξ, θ)zt (θ)dξdθ,
(3.8) Z 0 Z t −T −T
T
− eA θ
CT CeA(t+θ−τ ) Bu(τ )dτ dθ, where
−T t+θ
K̄ ∗ (θ) = K ∗ M (θ),
where G is defined as (3.16)
Z 0 P̄ ∗ (ξ, θ) = M T (ξ)P ∗ M (θ).
AT θ T Aθ
(3.9) G := e C Ce dθ.
−T
Remark 1. It is noticed that in (3.14), a segment of
Lemma 3.1. G is nonsingular for any T > 0. input-output trajectory is applied to reconstruct the cur-
Proof. Via integration by substitution with µ = −θ, it rent state. Therefore, (3.14) can be considered as an
is obtained that extension of state-reconstruction method for discrete-
Z T time systems [23, Lemma 1] to continuous-time sys-
(3.10) G= e −AT µ T
C Ce −Aµ
dµ. tems. The difference is that in [23, Lemma 1], M is
0 a finite-dimensional matrix, while in (3.14), M (θ) is a
It is observed that G is the observability Gramian for matrix-valued function.
the pair (C, −A). Under the assumption that (C, A) is Remark 2. It is seen from (3.15) that the optimal con-
observable, G is nonsingular for any T > 0 [5, Theorem troller and value functions are expressed as functionals
6.4]. of the input-output trajectory within [t − T, t]. The ker-
With Lemma 3.1 and by (3.8), the state x(t) is nels of the optimal controller and value functions are
reconstructed as K̄ ∗ (θ) and P̄ ∗ (ξ, θ), respectively.
The computation of kernel matrices K̄ ∗ (θ) and
Z 0
T
x(t) = G−1 eA θ C T yt (θ)dθ ∗
P̄ (ξ, θ) depend on the system matrices (A, B, and C).
−T
(3.11) Z 0 Z t  In the next section, we will design a learning-based PI
+ e AT θ T
C Ce A(t+θ−τ )
Bu(τ )dτ dθ . algorithm to approximate K̄ ∗ (θ) and P̄ ∗ (ξ, θ) without
−T t+θ requiring the accurate system matrices.

Copyright © 2023 by SIAM


19 Unauthorized reproduction of this article is prohibited
4 Learning-Based Policy Iteration where WiN ∈ R(q+m)×N , ViN ∈ Rn1 ×N ,
(q+m−1)(q+m)
In this section, we will develop a learning-based PI n1 = 2 , UiN ∈ R(q+m)m×N ;
N ∞ 2 q+m ∞
algorithm using the input-output data collected from eΨ,i ∈ C ([−T, 0] , R ), eΛ,i ∈ C ([−T, 0]2 , Rn1 )
N

system (2.1). N ∞ (q+m)m


and eΦ,i ∈ C ([−T, 0], R ) are truncation errors
Downloaded 09/11/24 to 137.189.241.59 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

and they uniformly converge to zero, i.e.


4.1 Algorithm Development Recalling that Ai =
A − BKi , and Ki is the updated feedback gain at the lim ∥eN N
Ψ,i (ξ, θ)∥∞ = 0, lim ∥eΛ,i (ξ, θ)∥∞ = 0,
N →∞ N →∞
ith iteration of PI, system (2.1) can be rewritten as (4.7)
lim ∥eN Φ,i (θ)∥∞ = 0.
N →∞
ẋ = Ai x + B(Ki x + u),
(4.1)
y = Cx. The main idea of learning-based PI is to approxi-
mate the weighting matrices WiN , ViN and UiN using
Taking the derivative of xT Pi x along the trajectories of input-output data of system (4.1). Define ΘN as
i
(4.1) yields
d T (4.8) ΘN T N T N
i = [vec (Wi ), vec (Vi ), vec (Ui+1 )] .
T N T

[x (t)Pi x(t)] = xT (t)(ATi Pi + Pi Ai )x(t)


(4.2) dt
+ 2uT (t)B T Pi x(t) + 2xT (t)Pi BKi x(t). Let Θ̂N N
i denote the approximation of Θi . By the defini-
tion of ΘN ˆ ˆ
i and (4.6), P̄i and K̄i , the approximations of
Plugging (2.5) into (4.2), we have P̄i and K̄i respectively, can be reconstructed from Θ̂N i ,
T i.e.
d[x (t)Pi x(t)]
= −xT (t)(C T QC + KiT RKi )x(t)
(4.3) dt ŴiN = vec−1 ([Θ̂N
T T T T T i ]1,n2 ),
+ 2u (t)B Pi x(t) + 2x (t)Ki B Pi x(t).
diag(P̄ˆi (ξ, θ)) = ŴiN Ψ(ξ, θ),
To simplify the notations, define
V̂iN = vec−1 ([Θ̂N i ]n2 +1,n3 ),
K̄i (θ) := Ki M (θ), (4.9)
(4.4) vecu(P̄ˆi (ξ, θ)) = V̂iN Λ(ξ, θ),
T
P̄i (ξ, θ) := M (ξ)Pi M (θ). N
Ûi+1 = vec−1 ([Θ̂N i ]n3 +1,n4 ),
Integrating (4.3) from tl to tl+1 and considering (2.6) ˆ (θ) = vec−1 (Û N Φ(θ)),
K̄ i+1 i+1
and (3.14), we have

(4.5) where n2 = (q + m)N , n3 = n2 + n1 N , and n4 =


Z 0 Z 0 Z tl+1 n3 + (q + m)mN .
t
ztT (ξ)P̄i (ξ, θ)zt (θ)dξdθ|tll+1 = − y T Qydt Based on the parameterization in (4.6), we can
−T −T tl transform (4.5) to a linear equation with respect to the
tl+1 0 0
weighting matrices encoded in ΘN
Z Z Z
i . In detail, define the
− ztT (ξ)K̄iT (ξ)RK̄i (θ)zt (θ)dξdθdt
tl −T −T
data-dependent matrices of the form
Z tl+1 Z 0
+ 2uT (t)RK̄i+1 (θ)zt (θ)dθdt (4.10)
tl −T
Z 0 Z 0
Z tl+1 Z 0 Z 0 ΓΨzz (t) := ΨT (ξ, θ) ⊗ vecdT (zt (ξ), zt (θ))dξdθ,
+ 2ztT (ξ)K̄iT (ξ)RK̄i+1 (θ)zt (θ)dξdθdt −T
Z 0
−T
Z 0
tl −T −T
ΓΛzz (t) := ΛT (ξ, θ) ⊗ vecpT (zt (ξ), zt (θ))dξdθ,
Since P̄i (ξ, θ) ∈ C ∞ ([−T, 0]2 , Sq+m ) and K̄i (θ) ∈ −T −T

C ([−T, 0], Rm×(q+m) ), we will use the linear combi- Z 0
nations of the basis functions to parameterize these ΓΦzu (t) := ΦT (θ) ⊗ zt (θ) ⊗ (uT (t)R)dθ,
−T
functions. Let Φ(θ), Ψ(ξ, θ), and Λ(ξ, θ) denote N - Z 0
dimensional linearly-independent basis functions. By ΓΦzK̄ˆ (t) := ΦT (θ) ⊗ ztT (θ) ⊗ (ûTi (t)R)dθ,
i
the approximation theory [28], we have: −T

diag(P̄i (ξ, θ)) = WiN Ψ(ξ, θ) + eN


Ψ,i (ξ, θ), where
(4.6) vecu(P̄i (ξ, θ)) = ViN Λ(ξ, θ) + eN
Λ,i (ξ, θ),
Z 0
(4.11) ûi (t) = − ˆ (θ)z (θ)dθ.

N N i t
vec(K̄i (θ)) = Ui Φ(θ) + eΦ,i (θ), −T

Copyright © 2023 by SIAM


20 Unauthorized reproduction of this article is prohibited
With (4.10), each term in (4.5) is expressed as Stacking (4.14) for l = 1, · · · , L yields
Z 0 Z 0
ztT (ξ)P̄i (ξ, θ)zt (θ)dξdθ = ϵN
1,i (t)
Hi ΘN N
i + Ei = Yi ,
−T −T T T T
Hi = [Hi,1 , · · · , Hi,L ] ,
Downloaded 09/11/24 to 137.189.241.59 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

+ ΓΨzz (t) vec(WiN ) + ΓΛzz (t) vec(ViN ), (4.17)


Z 0 EiN = [Ei,1
N N T
, · · · , Ei,L ] ,
2uT (t)RK̄i+1 (θ)zt (θ)dθ Yi = [Yi,1 , · · · , Yi,L ]T .
(4.12) −T
N
= 2ΓΦzu (t) vec(Ui+1 ) + ϵN
2,i (t), Assumption 1. Given N ∈ Z+ , there exist L∗ ∈ Z+
Z 0 Z 0 and α > 0, such that for all L > L∗ and i ∈ Z+ , the
2ztT (ξ)K̄iT (ξ)RK̄i+1 (θ)zt (θ)dξdθ following inequality holds:
−T −T
N
= 2ΓΦzK̄ˆ (t) vec(Ui+1 ) + ϵ3,i (t) + ρN
i,1 (t), 1 T
i (4.18) H Hi ≥ αI.
L i
where ϵN N N N
1,i (t), ϵ2,i (t), ϵ3,i (t) and ρ1,i (t) are induced by
trunction errors and expressed as Remark 3. Assumptions 1 is the reminiscence of the
Z 0 Z 0 persistent excitation (PE) condition in adaptive control
ϵ1,i (t) = vecdT (zt (ξ), zt (θ))eNΨ,i (ξ, θ) [19, 38]. It guarantees the uniqueness of the least-
−T −T square solutions to (4.17). As in the literature of ADP-
+ vecpT (zt (ξ), zt (θ))eN Λ,i (ξ, θ)dξdθ, based learning-based control [17, 22], one can fulfill it by
Z 0 adding exploration noise, such as sinusoidal signals and
ϵ2,i (t) = [2ztT (θ) ⊗ (uT (t)R)]eN Φ,i+1 (θ)dθ, random noise.
−T
Z 0 Z 0
(4.13) ˆ T (ξ)R)] Under Assumption 1 and according to (4.17), ΘN i is
ϵ3,i (t) = [2ztT (θ) ⊗ (ztT (ξ)K̄ i
−T −T approximated by the least-square method as
eN Φ,i+1 (θ)dξdθ, †
Z 0 Z 0 (4.19) Θ̂N
i = Hi Yi .

ρ1,i (t) = ˆ (ξ))T


2ztT (ξ)(K̄i (ξ) − K̄i
−T −T
Now, we are ready to present the learning-based PI
algorithm for optimal output-feedback controller design
RK̄i+1 (θ)zt (θ)dξdθ.
in Algorithm 1.
Plugging (4.12) into (4.5), we have
Remark 4. Since P̄iT (ξ, θ) = P̄i (θ, ξ), the diagonal ele-
(4.14) Hi,l ΘN N
i + Ei,l = Yi,l ,
ments of P̄i (ξ, θ) satisfy diag(P̄i (ξ, θ)) = diag(P̄i (θ, ξ)).
where Hence, the vector of basis functions Ψ should be chosen
h
t t
Hi,l = ΓΨzz (t)|tll+1 , ΓΛzz (t)|tl+1 , to satisfy Ψ(ξ, θ) = Ψ(θ, ξ).
l
Z tl+1 
−2 ΓΦzu (t) + ΓΦzK̄ˆ (t)dt , 4.2 Convergence Analysis The convergence of the
i
tl
Z tl+1 proposed learning-based PI algorithm is rigorously stud-
Yi,l = − y T Qy ied. The following lemma shows that at each itera-
(4.15) tl tion, the value functional P̄i (ξ, θ) and the feedback gain
Z 0 Z 0 
ˆ (θ)z (θ)dξdθ dt,
ˆ T (ξ)RK̄ K̄i+1 (θ) are well approximated as long as the number
+ ztT (ξ)K̄ i i t
−T −T
of the basis functions N is large enough.
Z tl+1
N
Ei,l = ϵN N N
1,i (t) − ϵ2,i (t) − ϵ3,i (t)
Lemma 4.1. For any i ∈ Z+ and η > 0, there exists
tl N ∗ (i, η) > 0, such that if N ≥ N ∗ (i, η),
− ρN N
1,i (t) + ρ2,i (t)dt,
∥P̄i (ξ, θ) − P̄ˆi (ξ, θ)∥∞ ≤ η,
and ρN 2,i (t), induced by the truncation errors, is ex- (4.20)
pressed as ∥K̄i+1 (θ) − K̄ ˆ (θ)∥ ≤ η.
i+1 ∞
Z 0 Z 0
ρN ˆ (ξ))T R
ztT (ξ)[(K̄i (ξ) − K̄
2,i (t) = i Proof. Let Θ̃N N
i = Θi − Θ̂i and
N
(4.16) −T −T
ˆ T (ξ)R(K̄ (θ) − K̄
K̄i (θ) + K̄ ˆ (θ))]z (θ)dξdθ. (4.21) ÊiN := Yi − Hi Θ̂N
i i i t i .

Copyright © 2023 by SIAM


21 Unauthorized reproduction of this article is prohibited
Algorithm 1 Learning-based Policy Iteration limN →∞ ρN N
1,i+1 = limN →∞ ρ2,i+1 = 0. In addition,
1: Select the basis functions Φ(θ), Ψ(ξ, θ), and Λ(ξ, θ). from (4.7) and (4.13), we have limN →∞ ϵN 1,i+1 (t) =
2: Select the sampling instance tk ∈ [t1 , tL+1 ]. limN →∞ ϵN
2,i+1 (t) = lim N →∞ ϵN
3,i+1 (t) = 0.
Select a stabilizing gain K̄ˆ (θ), and using u(t) = Consequently, N
limN →∞ (Ei+1,l )2 = 0, and
Downloaded 09/11/24 to 137.189.241.59 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

3: 1
R0 ˆ
− −T K̄1 (θ)zt (θ)dθ +e(t), where e is an exploration limN →∞ (Θ̃N T N
i+1 ) Θ̃i+1 = 0 is obtained from (4.25).
signal, to explore system (2.1) and collect the input- Comparing (4.6) with (4.9), it is obtained that for
output data u(t), y(t), t ∈ [0, tL+1 ]. any η > 0, there exists N ∗ (i + 1, η) > 0, such that if
4: Set the threshold δ > 0 and i = 1. N ≥ N ∗ (i + 1, η), (4.20) holds for i + 1. The proof is
5: repeat thus completed by induction.
6: Construct Hi and Si by (4.15) and (4.17).
Get Θ̂N Theorem 4.1. For any η > 0, there exist i∗ ∈ Z+ and
7: i by solving (4.19).
ˆ (θ) by (4.9). N ∗∗ > 0, such that if N ≥ N ∗∗ ,
8: Get K̄ i+1
9: i←i+1 ∥P̄ ∗ (ξ, θ) − P̄ˆi∗ (ξ, θ)∥∞ ≤ η,
10: until |Θ̂N N
i − Θ̂i−1 |< δ. (4.26)
R0 ˆ ∥K̄ ∗ (θ) − K̄ˆ ∗ (θ)∥ ≤ η.
11: Use û (z ) = − K̄ (θ)z (θ)dθ as the approxi- i ∞
i t −T 1 t
mated optimal controller. Proof. According to Lemma 2.1, there exists i∗ ∈ Z+
such that

Subtracting (4.21) from (4.17) yields ∥P̄ ∗ (ξ, θ) − P̄i∗ (ξ, θ)∥∞ ≤ η/2,
(4.27)
∥K̄ ∗ (θ) − K̄i∗ (θ)∥∞ ≤ η/2.
(4.22) ÊiN = Hi Θ̃N N
i + Ei ,
Furthermore, by Lemma 4.1, if N ≥ N (i∗ , η/2),
Since Θ̂N
is the least-square solution to (4.17), it is
i
obtained that ∥P̄i∗ (ξ, θ) − P̄ˆi∗ (ξ, θ)∥∞ ≤ η/2,
(4.28)
1 N T N 1 ∥K̄ ∗ (θ) − K̄ˆ ∗ (θ)∥ ≤ η/2.
i i ∞
(4.23) (Ê ) Êi ≤ (EiN )T EiN .
L i L
Therefore, (4.26) is obtained by the triangle inequality.
It follows from (4.22) and (4.23) that

1 N T T
(Θ̃ ) Hi Hi Θ̃N 5 Application to F-16 Aircraft Control
L i i

1 In this section, we demonstrate the efficacy of the


(4.24) = (ÊiN − EiN )T (ÊiN − EiN ) proposed learning-based PI algorithm by the example
L
4 of linearized F-16 aircraft model [32]. The state of the
≤ (EiN )T EiN . system is x = [ζ, q, δe ]T , where ζ is the angle of attack, q
L
is the pitch rate, and δe is the elevator deflection angle.
Under Assumption 1, we have Then, the linearized F-16 dynamics is
4 4 (5.29)
(4.25) (Θ̃N T N
i ) Θ̃i ≤ (E N )T EiN ≤ max (E N )2 .
αL i α 1≤l≤L i,l
  
−1.01887 0.90506 −0.00215 0
ẋ =  0.82225 −1.07741 −0.17555 x +  0  u
Then, the lemma is demonstrated by induction. 0 0 −20.2 20.2
When i = 1, K̄1 (θ) = K̄ ˆ (θ), and ρN = ρN = 0
i 1,1 2,1
 
y = 0 57.2958 0 x.
is obtained from (4.13) and (4.16). In addition, it
follows from (4.7) and (4.13) that limN →∞ ϵN 1,1 (t) = The factor 57.2958 is used to convert the unit from
N N
limN →∞ ϵ2,1 (t) = limN →∞ ϵ3,1 (t) = 0. Consequently, radian to degree. The length of the trajectory to
N 2
limN →∞ (E1,l ) = 0, and limN →∞ (Θ̃N T N
1 ) Θ̃1 = 0 is reconstruct the state is T = 0.1. The basis functions are
obtained from (4.25). Comparing (4.6) with (4.9), it is chosen as Ψ(ξ, θ) = [1, ξ + θ, ξ 2 + θ2 , ξθ, ξ 3 + θ3 , ξ 2 θ +
obtained that for any η > 0, there exists N ∗ (1, η) > 0, ξθ2 , ξ 4 + θ4 , ξ 3 θ + ξθ3 , ξ 2 θ2 , ξ 4 θ + ξθ4 , ξ 3 θ2 + ξ 2 θ3 , ξ 4 θ2 +
such that if N ≥ N ∗ (1, η), (4.20) holds for i = 1. ξ 2 θ4 , ξ 3 θ3 , ξ 4 θ3 + ξ 3 θ4 , ξ 4 θ4 ]T , Λ = [1, ξ, ξ 2 , ξ 3 , ξ 4 ]T ⊗
Suppose that for some i > 1 and any η > 0, there [1, θ, θ2 , θ3 , θ4 ]T , and Φ = [1, θ, θ2 , θ3 , θ4 ]T . The cost
exists N ∗ (i, η) > 0, such that if N ≥ N ∗ (i, η), (4.20) weighting matrices are Q = 1 and R = 1. The
holds. Then, it follows from (4.13) and (4.16) that initial state is x(0) = [0.5, 0.5, 0.5]T . From t =

Copyright © 2023 by SIAM


22 Unauthorized reproduction of this article is prohibited
60 0.6 Data Collection Ends
Controller Updated
0.4
40
0.2
Downloaded 09/11/24 to 137.189.241.59 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

20 0

0 5 10 15 20 25 30
0
1 2 3 4 5 6 7 8 9 10 11 12

0.6 Data Collection Ends


Figure 1: Convergence of K̄ˆ (θ) to the optimal values Controller Updated
i 0.4

K̄ (θ) by learning-based PI algorithm.
0.2

0
0s to t = 10s, the exploratory/behavior policy is
P500
u(t) = 0.2 i=1 sin(wi t), where wi (i = 1, · · · , 500) are 0 5 10 15 20 25 30
randomly sampled from the uniform distribution over
[−250, 250]. The integrating interval is tk+1 − tk = 0.01.
For the PI algorithm, the input-output data is Data Collection Ends
collected from t = 0s to t = 10s and Algorithm starts 2
Controller Updated
at t = 10s. The algorithm converges after eleven
iterations when the stopping criterion |Θ̂N N 0
i − Θ̂i−1 |<
0.01 is satisfied. The relative approximation error -2
ˆ (θ)−K̄ ∗ (θ)∥
∥K̄i ∞
∥K̄ ∗ (θ)∥∞
is plotted in Fig. 1. It is seen that
0 5 10 15 20 25 30
at the eleventh iteration, the approximation error is
ˆ (θ)−K̄ ∗ (θ)∥
∥K̄11 ∞
∥K̄ ∗ (θ)∥∞
= 3.20%. The state trajectory is shown
in Fig. 2. It is observed that the state quickly converges
Figure 2: Trajectory of the state.
to zero after the controller is updated by Algorithm 1.

6 Conclusion
This paper has proposed a novel learning-based output- [3] T. Bian and Z. P. Jiang, Value iteration and adaptive
feedback control approach for adaptive optimal control dynamic programming for data-driven adaptive optimal
of continuous-time linear systems. The first major con- control design, Automatica, 71 (2016), pp. 348–360.
[4] C. Chen, L. Xie, K. Xie, F. L. Lewis, and S. Xie,
tribution of the paper is extending the state reconstruc-
Adaptive optimal output tracking of continuous-time
tion technique in [23] to continuous-time systems with-
systems via output-feedback-based reinforcement learn-
out discretizing system dynamics. By integrating the ing, Automatica, 146 (2022), p. 110581.
RL techniques, a learning-based PI algorithm is devel- [5] C.-T. Chen, Linear System Theory and Design, Ox-
oped and the convergence is theoretically analyzed. The ford University Press, New York, NY, 3rd ed., 1999.
efficacy of the proposed algorithms is validated by a [6] L. Cui and Z. P. Jiang, A reinforcement learning
practical example arising from F-16 flight control. Our look at risk-sensitive linear quadratic gaussian control,
future work will be directed at extending the proposed arXiv preprint arXiv:2212.02072, (2022).
learning-based control methodology to robust ADP and [7] L. Cui, K. Ozbay, and Z. P. Jiang, Combined
multi-agent systems with output feedback. longitudinal and lateral control of autonomous vehicles
based on reinforcement learning, in 2021 American
Control Conference (ACC), 2021, pp. 1929–1934.
References [8] L. Cui, B. Pang, and Z. P. Jiang, Learning-
based adaptive optimal control of linear time-delay
[1] F. Abdollahi, H. Talebi, and R. Patel, A sta- systems: A policy iteration approach, arXiv preprint
ble neural network-based observer with application to arXiv:2210.00204, (2022).
flexible-joint manipulators, IEEE Transactions on Neu- [9] L. Cui, S. Wang, J. Zhang, D. Zhang, J. Lai,
ral Networks, 17 (2006), pp. 118–129. Y. Zheng, Z. Zhang, and Z. P. Jiang, Learning-
[2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro- based balance control of wheel-legged robots, IEEE
Dynamic Programming., Athena Scientific, Belmont, Robotics and Automation Letters, 6 (2021), pp. 7667–
MA, 1996. 7674.

Copyright © 2023 by SIAM


23 Unauthorized reproduction of this article is prohibited
[10] W. Gao, Y. Jiang, Z. P. Jiang, and T. Chai, driven adaptive optimal control of mixed-traffic con-
Output-feedback adaptive optimal control of intercon- nected vehicles in a ring road, in 60th IEEE Conference
nected systems based on robust adaptive dynamic pro- on Decision and Control (CDC), 2021, pp. 77–82.
gramming, Automatica, 72 (2016), pp. 37–45. [26] B. Pang, L. Cui, and Z. P. Jiang, Human motor
Downloaded 09/11/24 to 137.189.241.59 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

[11] W. Gao and Z. P. Jiang, Adaptive dynamic program- learning is robust to control-dependent noise, Biological
ming and adaptive optimal output regulation of linear cybernetics, 116 (2022), pp. 307–325.
systems, IEEE Transactions on Automatic Control, 61 [27] B. Pang and Z. P. Jiang, Adaptive optimal control
(2016), pp. 4164–4169. of linear periodic systems: An off-policy value iteration
[12] W. Gao and Z. P. Jiang, Adaptive optimal out- approach, IEEE Transactions on Automatic Control,
put regulation of time-delay systems via measurement 66 (2021), pp. 888–894.
feedback, IEEE Transactions on Neural Networks and [28] M. J. D. Powell, Approximation Theory and Meth-
Learning Systems, 30 (2019), pp. 938–945. ods, Cambridge University Press, New York, NY, 1981.
[13] A. Guyader and Q. Zhang, Adaptive observer for [29] W. B. Powell, Approximate Dynamic Programming:
discrete time linear time varying systems, IFAC Pro- Solving the Curses of Dimensionality, Wiley Series in
ceedings Volumes, 36 (2003), pp. 1705–1710. 13th Probability and Statistics, Wiley, Hoboken, NJ, USA,
IFAC Symposium on System Identification, Rotter- 2nd ed., 2011.
dam, The Netherlands, 27-29 August, 2003. [30] S. A. A. Rizvi and Z. Lin, Output feedback adaptive
[14] R. A. Howard, Dynamic Programming and Markov dynamic programming for linear differential zero-sum
Processes, MIT Press, Cambridge, MA, 1960. games, Automatica, 122 (2020), p. 109272.
[15] Y. Jiang and Z. P. Jiang, Approximate dynamic pro- [31] S. A. A. Rizvi and Z. Lin, Adaptive dynamic pro-
gramming for output feedback control, in Proceedings of gramming for model-free global stabilization of control
the 29th Chinese Control Conference, 2010, pp. 5815– constrained continuous-time systems, IEEE Transac-
5820. tions on Cybernetics, 52 (2022), pp. 1048–1060.
[16] Y. Jiang and Z. P. Jiang, Computational adaptive [32] B. L. Stevens, F. L. Lewis, and E. N. Johnson,
optimal control for continuous-time linear systems with Aircraft Control and Simulation: Dynamics, Controls
completely unknown dynamics, Automatica, 48 (2012), Design, and Autonomous Systems, John Wiley & Sons,
pp. 2699–2704. Hoboken, NJ, 2015.
[17] Y. Jiang and Z. P. Jiang, Robust Adaptive Dynamic [33] R. S. Sutton and A. G. Barto, Reinforcement
Programming, Wiley-IEEE Press, NJ, USA, 2017. Learning: An Introduction, The MIT Press, Cam-
[18] Z. P. Jiang, T. Bian, and W. Gao, Learning- bridge, MA, 2018.
based control: A tutorial and some recent results, [34] G. Tao, Adaptive Control Design and Analysis, John
Foundations and Trends in Systems and Control, 8 Wiley & Sons, Hoboken, NJ, 2003.
(2020), pp. 176–284. [35] Q. Zhang, Adaptive observer for multiple-input-
[19] Z. P. Jiang, C. Prieur, and A. Astolfi (Editors), multiple-output (MIMO) linear time-varying systems,
Trends in Nonlinear and Adaptive Control: A Tribute IEEE Transactions on Automatic Control, 47 (2002),
to Laurent Praly for His 65th Birthday, Springer Na- pp. 525–529.
ture, NY, USA, 2021. [36] F. Zhao, W. Gao, T. Liu, and Z. P. Jiang,
[20] B. Kiumarsi, F. L. Lewis, M.-B. Naghibi-Sistani, Adaptive optimal output regulation of linear discrete-
and A. Karimpour, Optimal tracking control of un- time systems based on event-triggered output-feedback,
known discrete-time linear systems using input-output Automatica, 137 (2022), p. 110103.
measured data, IEEE Transactions on Cybernetics, 45 [37] L. M. Zhu, H. Modares, G. O. Peen, F. L. Lewis,
(2015), pp. 2770–2779. and B. Yue, Adaptive suboptimal output-feedback con-
[21] D. Kleinman, On an iterative technique for Riccati trol for linear systems using integral reinforcement
equation computations, IEEE Transactions on Auto- learning, IEEE Transactions on Control Systems Tech-
matic Control, 13 (1968), pp. 114–115. nology, 23 (2015), pp. 264–273.
[22] F. L. Lewis and D. Liu, Reinforcement Learning [38] K. J. Åström and B. Wittenmark, Adaptive Con-
and Approximate Dynamic Programming for Feedback trol, 2nd Edition, Addison-Wesley, MA, USA, 1997.
Control, Wiley-IEEE Press, NJ, USA, 2013.
[23] F. L. Lewis and K. G. Vamvoudakis, Reinforce-
ment learning for partially observable dynamic pro-
cesses: Adaptive dynamic programming using measured
output data, IEEE Transactions on Systems, Man, and
Cybernetics, Part B (Cybernetics), 41 (2011), pp. 14–
25.
[24] D. Liberzon, Calculus of Variations and Optimal
Control Theory: A Concise Introduction, Princeton
University Press, NJ, USA, 2012.
[25] T. Liu, L. Cui, B. Pang, and Z. P. Jiang, Data-

Copyright © 2023 by SIAM


24 Unauthorized reproduction of this article is prohibited

You might also like