0% found this document useful (0 votes)
9 views10 pages

Linear Quadratic Tracking Control of Unknown Systems

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

Linear Quadratic Tracking Control of Unknown Systems

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Automatica 148 (2023) 110761

Contents lists available at ScienceDirect

Automatica
journal homepage: www.elsevier.com/locate/automatica

Brief paper

Linear quadratic tracking control of unknown systems: A two-phase


reinforcement learning method✩

Jianguo Zhao a,b , Chunyu Yang a,b , , Weinan Gao c , Hamidreza Modares d , Xinkai Chen e ,
Wei Dai a,b
a
Engineering Research Center of Intelligent Control for Underground Space, Ministry of Education, Xuzhou, 221116, China
b
School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China
c
State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang, 110819, China
d
Department of Mechanical Engineering, Michigan State University, East Lansing, MI 48863, USA
e
Department of Electronic and Information Systems, Shibaura Institute of Technology, Saitama, 337-8570, Japan

article info a b s t r a c t

Article history: This paper considers the problem of linear quadratic tracking control (LQTC) with a discounted cost
Received 21 November 2021 function for unknown systems. The existing design methods often require the discount factor to
Received in revised form 27 September 2022 be small enough to guarantee the closed-loop stability. However, solving the discounted algebraic
Accepted 12 October 2022
Riccati equation (ARE) may lead to ill-conditioned numerical issues if the discount factor is too
Available online 29 November 2022
small. By singular perturbation theory, we decompose the full-order discounted ARE into a reduced-
Keywords: order ARE and a Sylvester equation, which facilitate designing the feedback and feedforward control
Reinforcement learning gains. The obtained controller is proved to be a stabilizing and near-optimal solution to the original
Linear quadratic tracking control LQTC problem. In the framework of reinforcement learning, both on-policy and off-policy two-phase
Discounted cost function learning algorithms are derived to design the near-optimal tracking control policy without knowing
Singular perturbation theory
the discount factor. The advantages of the developed results are illustrated by comparative simulation
results.
© 2022 Published by Elsevier Ltd.

1. Introduction context of control community, RL has been extensively em-


ployed to address various optimal control problems for partially
Reinforcement learning (RL), as a computational intelligent or completely unknown systems, such as linear quadratic regu-
technique, has attracted increasing research interests in the last lator (Jiang & Jiang, 2017; Lewis & Vrabie, 2009; Pang, Jiang, &
decade (Gao, Deng, Jiang, & Jiang, 2022; Kiumarsi, Vamvoudakis, Mareels, 2020), H∞ control (Liu & Wu, 2019; Wu & Luo, 2013),
Modares, & Lewis, 2018; Liu, Xue, Zhao, Luo, & Wei, 2021; output regulation (Chen et al., 2019; Gao & Jiang, 2016), robust
Mukherjee, Bai, & Chakrabortty, 2021; Mukherjee & Vu, 2022; adaptive dynamic programming (Gao, Jiang, Jiang, & Chai, 2016),
Vamvoudakis & Ferraz, 2018; Wang, Ha, & Qiao, 2020). RL en- zero-sun differential games (Vamvoudakis, 2015; Vamvoudakis
ables the agent to learn its action autonomously to optimize & Hespanha, 2018; Vamvoudakis, Modares, Kiumarsi, & Lewis,
a pre-specified long-term performance via active interactions 2017).
with the external environment (Sutton & Barto, 2018). In the In this paper, we focus on the problem of linear quadratic
tracking control (LQTC) with a discounted cost function for sys-
✩ This work was supported by the National Natural Science Foundation of tems with unknown dynamics. As an important topic in optimal
China under Grant 61873272, Grant 62073327, and Grant 62273350, in part by control theory, LQTC refers to design a control policy that makes
the Natural Science Foundation of Jiangsu Province under Grant BK20200086 the system output follow a desired trajectory, while minimizing
and Grant BK20200631. The material in this paper was not presented at any a predefined quadratic cost function (Modares & Lewis, 2014a).
conference. This paper was recommended for publication in revised form by
Associate Editor Kyriakos G. Vamvoudakis under the direction of Editor Miroslav The designed controller usually consists of a feedback component
Krstic. and a feedforward component which are traditionally developed
∗ Corresponding author at: Engineering Research Center of Intelligent Control based on the system model (Lewis, Vrabie, & Syrmos, 2012). Due
for Underground Space, Ministry of Education, Xuzhou, 221116, China. to the fact that the knowledge of model is hard to obtain for
E-mail addresses: [email protected] (J. Zhao),
many emerging control systems, RL technique has been extended
[email protected] (C. Yang), [email protected] (W. Gao),
[email protected] (H. Modares), [email protected] (X. Chen), to solve LQTC problem of unknown systems. In the previous RL-
[email protected] (W. Dai). based work (Kamalapurkar, Dinh, Bhasin, & Dixon, 2015; Zhang,

https://doi.org/10.1016/j.automatica.2022.110761
0005-1098/© 2022 Published by Elsevier Ltd.
J. Zhao, C. Yang, W. Gao et al. Automatica 148 (2023) 110761

Cui, Zhang, & Luo, 2011), the cost function contains only the feed-
back term, where the feedforward term is designed by dynamic
inversion, and thus the obtained design method only optimizes
the feedback part. The generalized cost function containing both
of the feedback and feedforward terms could be unlimited in
infinite horizon. One feasible strategy to avoid this issue is re- Fig. 1. Output tracking control block diagram of system (1).
vising the cost function as a discounted one (Modares & Lewis,
2014a). By adopting a discounted cost function, the LQTC problem
was transformed into an optimal regulation problem of an aug- 2. Problem statement and preliminaries
mented system composed of the original system and the signal
generator in Modares and Lewis (2014a) and an RL algorithm 2.1. LQTC problem formulation
was proposed to find the solution to the discounted algebraic
Riccati equation (ARE). Subsequently, H∞ tracking and output- Consider the continuous-time linear system
feedback tracking problems with discounted cost functions were {
also studied in Modares, Lewis, and Jiang (2015, 2016). ẋ = Ax + Bu
(1)
Stability analysis is a challenging task for LQTC with a dis- y = Cx
counted cost function. It was shown that there exists an upper where x ∈ Rn is the measurable state vector, u ∈ Rm is the control
bound for discount factor γ such that the closed-loop system input, and y ∈ Rq is the system output. State matrix A ∈ Rn×n and
is stable for any γ less than an upper bound (Modares et al., input matrix B ∈ Rn×m are unknown, but output matrix C ∈ Rq×n
2015, 2016). However, the upper bound is highly related to the is known.
system model parameters and is difficult to capture for unknown
systems (Gaitsgory, Grüne, & Thatcher, 2015; Granzotto, Pos- Assumption 1. The desired trajectory yd ∈ Rq is generated by an
toyan, Busoniu, Nešić, & Daafouz, 2021; Postoyan, Busoniu, Nešić, autonomous system with the dynamics
& Daafouz, 2017). RL algorithms in Modares et al. (2015, 2016) {
ẋd (t ) = Exd (t )
were proposed to solve discounted AREs associated with tracking (2)
yd (t ) = F xd (t )
control problems, where the discount factor γ is chosen as a
sufficiently small value to guarantee the closed-loop stability. By where xd ∈ Rr is the measurable state vector and xd (0) ̸ = 0.
analogy with singular perturbation optimal control theory (Koko- F ∈ Rq×r is the known constant matrix, E ∈ Rr ×r is an unknown
tovic, Khalil, & O’Reilly, 1999), solving the discounted ARE with matrix with real part of eigenvalues Re[λj (E)] = 0 for all j =
a small γ may lead to ill-conditioned numerical issues. To the 1, 2, . . . , r.
best knowledge of the authors, there is no discussion on how the Define the following discounted cost function (Modares &
discount factor affects the computation issues of the discounted Lewis, 2014a):
ARE, which motivates our study. ∫ ∞
1
J(ed , u) = e−γ t eTd Q ed + uT Ru dt
( )
This paper considers the LQTC problem of unknown systems (3)
and the desired trajectory is generated by a linear autonomous 2 0

system. The purpose of this study is to reveal the influence where γ > 0 is a discount factor, ed = y − yd is the tracking error,
of small γ on ARE solving algorithms and develop the well- and Q ≻ 0, R ≻ 0 are symmetric weight matrices.
conditioned computational methodology to solve the discounted
ARE in terms of RL. The major contributions are as follows: Assumption 2. The pair (A, B) is stabilizable and the pair (A, C ) is
observable.
(1) By theoretical analysis, we find that solving the discounted
ARE may lead to ill-conditioned numerical issues if the Problem 1. The LQTC problem in this paper, see Fig. 1, is to design
discount factor is too small. a control input
(2) Using singular perturbation theory (Kokotovic et al., 1999),
the full-order discounted ARE is decomposed into a u = −K1 x − K2 xd (4)
reduced-order ARE and a Sylvester equation, which fa- such that the system output y follows the desired trajectory yd
cilitate designing the feedback and feedforward control while minimizing quadratic cost function (3).
gains without ill-conditioned numerical issues. The ob-
tained controller, which is independent of the small dis- Remark 1. Assumption 1 is without loss of any generality.
count factor, is proved to be a stabilizing and near-optimal System (2) can generate numerous command trajectories such
solution to the original LQTC problem. as sinusoidal signal, unit step, harmonic oscillator, the ramp,
(3) In the framework of RL, both on-policy and off-policy well- and more (Baldi, Azzollini, & Ioannou, 2021; Gao, Jiang, Lewis,
conditioned two-phase learning algorithms are derived to & Wang, 2018). Assumption 2 is a common condition in linear
design the near-optimal tracking control policy without optimal control (Lewis et al., 2012). A way to check this condition
knowing the discount factor. is to usually suppose that we have the knowledge of a nominal
model (Pang et al., 2020; Vamvoudakis & Ferraz, 2018).
The remainder of this paper is organized as follows. In Sec-
tion 2, the LQTC problem under consideration is formulated and Remark 2. The discounted cost function is widely utilized to
the ill-conditioned numerical issues on solving the discounted address the optimal tracking problem (Modares & Lewis, 2014a,
ARE is analyzed. Section 3 proposes a near-optimal design scheme 2014b; Modares et al., 2016; Zhao, Yang, Dai and Gao, 2022).
avoiding the potential ill-conditioned numerical issues and de- Since the poles of the autonomous system (2) are on the imag-
velops both on-policy and off-policy two-phase RL algorithms inary axis, the feedforward input causes an unlimited or un-
for solving the LQTC problem. The comparative simulations of bounded value for the generalized cost function. Thus, the dis-
a spring–mass–damper system are given in Section 4. Section 5 count factor γ in (3) is an indispensable parameter for solving
concludes the paper. Problem 1.
2
J. Zhao, C. Yang, W. Gao et al. Automatica 148 (2023) 110761

2.2. Review of optimal control theory We next analyze the properties of the algorithm (7) in the
sense of γ being too small. Letting K̄ i := [K̄1i , K̄2i ] with K̄1i ∈ Rm×n
For simplicity, we construct the augmented state X = [xT , xTd ]T . and K̄2i ∈ Rm×r , we get from (7)
By Modares and Lewis (2014a), the optimal tracking control
M − 0.5γ In0 − N K̄ i
policy of Problem 1 is determined by
A − BK̄1i − 0.5γ In −BK̄2i
[ ]
u∗ = −R−1 N T PX (5) =
0 E − 0.5γ Ir
where P ≻ 0 is a symmetric solution to the discounted ARE Obviously, when the discount factor γ is too small, the eigen-
M T P + PM − γ P − PNR−1 N T P + Qd = 0 (6) values λj (M − 0.5γ In0 − N K̄ i ), j = 1(, 2, . . . , n0 , cluster)into two
different orders of magnitude, i.e., λ A − BK̄1i − 0.5γ In = O (1)
with and λ (E − 0.5γ Ir ) = O (γ ) ≪ 1, respectively. Recently, Zhao,
A
[
0
]
B C T QC
[ ] [
−C T QF
] Yang and Gao (2022) have pointed out that solving such algebraic
M= ,N = , Qd = Lyapunov equation (7) may lead to ill-condition numerical issues.
0 E 0 −F T QC F T QF
This is because the condition number of the coefficient matrix of
Since the discounted ARE is nonlinear in P, it is difficult to the matrix algebraic equation (7) is too large. In this case, the
solve P from (6) directly. To this end, Modares and Lewis (2014a) small perturbation deviation caused by measured error or/and
proposed the following model-based iterative algorithm: finite precision arithmetic can result in wrong solutions or even
( )T )T algorithm divergence, which motivates this study.
RK̄ i + M − 0.5γ In0 − N K̄ i
(
Qd + K̄ i Pi
(7)
+ P M − 0.5γ In0 − N K̄ 3. Main results
i
( i
)
=0
where n0 = n + r, K̄ i+1 = R−1 N T P i , and K̄ 0 satisfies that matrix In this section, we first show how to design a near-optimal
M − 0.5γ In0 − N K̄ 0 is Hurwitz. Besides, limi→∞ P i = P. tracking control policy to alleviate ill-conditioned numerical is-
In order to obviate the model knowledge about A, B, and sues when the discount factor is small enough. Then, we devise
E, the iterative procedure (7) can be implemented to solve the the on-policy and off-policy two-phase RL algorithms that can ap-
discounted ARE (6) by the following off-policy RL algorithm proximate the near-optimal solution of Problem 1 in the absence
(Modares et al., 2016): of the knowledge of system dynamics.
⏐t +δt
e−γ τ X T P i X ⏐t 3.1. Near-optimal design scheme for LQTC
∫ t +δt −γ τ
)T
Xdτ
( 0 i i+1
=2 e u + K̄ X RK̄ (8)
t
( ( i )T ) Throughout this paper, we assume that the discount factor γ is
e−γ τ X T Qd + K̄ small enough, i.e., γ ≪ 1. Inspired by optimal control theory for
∫ t +δt
− t
RK̄ i Xdτ
singularly perturbed systems (Kokotovic et al., 1999), the kernel
where u0 is the behavior policy to generate data, P i and K̄ i+1 are matrix P to the discounted ARE (6), can be partitioned into the
obtained simultaneously in each iteration. following form:
[ ]
P̄1 P̄2
2.3. Motivation P = (9)
P̄2T γ −1 P̄3
Under Assumptions 1 and 2, Najafi Birgani, Moaveni, and Then, using (9) and (6), (5) is rewritten as
Khaki-Sedigh (2018) has shown that the kernel matrix P ≻ 0
u∗ = −K̄1 x − K̄2 xd (10)
of (6) is always uniquely determined for any discount factor
γ > 0. However, a small enough discount factor γ should be where K̄1 = R−1 BT P̄1 , K̄2 = R−1 BT P̄2 , and P̄1 ∈ Rn×n , P̄2 ∈ Rn×r
chosen carefully to guarantee stability of the closed-loop system satisfy the following matrix algebraic equations:
(1) under the optimal control (5) (Modares et al., 2015, 2016;
AT P̄1 + P̄1 A − γ P̄1 + C T QC − P̄1 BR−1 BT P̄1 = 0 (11)
Najafi Birgani et al., 2018). This subsection will show that solving
the discounted ARE directly may lead to ill-conditioned numerical A P̄2 + P̄2 E − γ P̄2 − C QF − P̄1 BR
T T −1 T
B P̄2 = 0 (12)
issues if the discount factor γ is too small.
Notice that the optimal tracking policy (10) is directly relevant to
matrices P̄1 , P̄2 but not to matrix P̄3 which is also not contained in
Example 1. Consider the scalar system ẋ = x + u, y = x,
the reduced-order ARE (11) and the Sylvester equation (12). Thus
which is given in Modares et al. (2016). Suppose that ẋd = 0,
it is feasible for Problem 1 to design the optimal feedback input
xd (0) = 1, yd = xd , Q = 1, R = 1. By solving the ARE (6),
and optimal feedforward input based on (11) and (12).
the optimal control (5) is determined as u = −kx + k+γ1 −1 with
√ Setting γ = 0 in (11) and (12) yields the following zero-order
k = 1 − 0.5γ + (1 − 0.5γ )2 + 1. Then, the closed-loop
[ system
] approximate equations:
p1 p2
is stable if and only if γ < γ = 2. Let P =

. It AT P1 + P1 A + C T QC − P1 BR−1 BT P1 = 0 (13)
p2 p3
T T −1 T
1 1−p22 A P2 + P2 E − C QF − P1 BR B P2 = 0 (14)
follows from (6) that p1 = k, p2 = 1−γ −p1
, p3 = γ
, and
[ √ √ ]
where P1 and P2 differ from P̄1 and P̄2 by O(γ ) (Gajic & Lim, 2001;1
1 +√ 2 − 22
P |γ →0 = 2
. The solution P is bounded if γ > 0, Kodra & Gajic, 2017), namely,
− 2

and entries p1 , p2 , p3 are coupled. Nevertheless, Fig. 2 shows that P1 = P̄1 + O (γ ) (15)
entries p1 , p2 are bounded, but entry p3 of the kernel matrix P in
(6) increases to infinity as γ goes to zero, which implies that ARE 1 O(ε i ) is defined by O(ε i ) < aε i , where a is a bounded constant and i is a
(6) is ill-conditioned if γ is too small. real number (Gajic & Lim, 2001).

3
J. Zhao, C. Yang, W. Gao et al. Automatica 148 (2023) 110761

Fig. 2. Evolution of entries p1 , p2 , and p3 .

P2 = P̄2 + O (γ ) (16) The following theorem quantifies the near optimality and re-
veals that u in (17) achieves an O(γ 2 ) near-optimal solution of
In order to avoid the negative impacts caused by the introduction
the target cost J ∗ = 21 X (0)T PX (0) in (3).
of γ , we design the near-optimal tracking control policy for
Problem 1
Theorem 3. Let Jr be the cost in (3) with the designed control policy
u = −K1 x − K2 xd (17) (17). Then, Jr is bounded as α Jr ≤ J ∗ with α = 1/λmax (Pr P −1 ) and
Pr being defined in (21). Also, the obtained cost Jr satisfies
where K1 = R−1 BT P1 and K2 = R−1 BT P2 .
The following theorem concerns the existence of the tracking Jr = J ∗ + O(γ 2 ) (20)
control policy (17).
Proof. Applying the designed control policy (17) to solve Prob-
Theorem 1. Under Assumptions 1 and 2, the reduced-order ARE (13) lem 1, by Lewis et al. (2012), we have Jr = 21 X (0)T Pr X (0) with Pr
and the Sylvester equation (14) have a unique solution P1 ≻ 0 and satisfying
P2 .
MrT Pr + Pr Mr + K T RK + Qd = 0 (21)
Proof. By linear optimal control theory (Lewis et al., 2012), under where Mr = M − 0.5γ In0 − NK . Since α = 1/λmax (Pr P ), one has −1
Assumption 2, the solution matrix P1 ≻ 0 to (13) is uniquely α Jr ≤ J ∗ . Then, subtracting (6) from (21), we have the following
determined and A − BK1 is Hurwitz. Using K1 = R−1 BT P1 , the Lyapunov equation:
Sylvester equation (14) can be rewritten as )T (
MrT V + V Mr = − K − K̄
( )
T T
R K − K̄ (22)
(A − BK1 ) P2 + P2 E = C QF (18)
where V = Pr − P. By the power series expansion of V on γ , we
Since A − BK1 is Hurwitz, we have Re [λ (A − BK1 )] < 0. This have
means that A−BK1 and −E do not share any eigenvalues, and thus ∞
γ i V1i
[ ] [ ]
the solution matrix P2 to the Sylvester equation (18) exists and is V10 V20 ∑ V2i
V = + (23)
unique (Bartels & Stewart, 1972). This concludes the proof. ■ T
V20 γ −1 V30 i! V2iT γ −1 V3i
i=1

We next analyze the stability and approximate relation of the From Theorem 2, with (19), it is easily derived that
proposed control policy u. )T (
R K − K̄ = O(γ 2 )
( )
K − K̄
Theorem 2. Let Assumptions 1 and 2 hold, then the control gain
Substituting (23) into (22), similar to the proof of Theorem 2
matrix in (17) is a zero-order approximate value to (10), that is,
in Yang, Zhong, Liu, Dai, and Zhou (2020), we can get Vj0 = 0
K = K̄ + O(γ ) (19) and Vj1 = 0, j = 1, 2, 3 because A − BK1 − 0.5γ In and E − 0.5γ Ir
are Hurwitz matrices. Hence, we have V = O(γ 2 ) which implies
where K = [K1 , K2 ] and K̄ = [K̄1 , K̄2 ]. In addition, the control policy (20). ■
u developed in (17) stabilizes system (1).
Remark 3. The tracking control policy (17) does not require the
Proof. In view of (15) and (16), there exist constants c1 > 0 and knowledge of the small discount factor γ . Consequently, com-
c2 > 0, such that pared with the current works (Modares & Lewis, 2014a, 2014b;
P1 − P̄1  = ∥O(γ )∥ ≤ c1 γ
  Modares et al., 2016; Najafi Birgani et al., 2018), the proposed
near-optimal design scheme bypasses the challenging task of
P2 − P̄2  = ∥O(γ )∥ ≤ c2 γ
 
choosing the discount factor, and thus does not need to know the
upper bound γ ∗ a priori, which is of more practical significance,
For any c1 and c2 , there exists a c3 , such that c3 > c1 and c3 > c2 .
especially for completely unknown systems.
Then, we have
 [ ]
K − K̄  = [R−1 BT , R−1 BT ] P1 − P̄1
   0 
 Remark 4. The discount factor γ is a predefined design parameter
 0 P2 − P̄2  in the existing works. Modares and Lewis (2014a) argued that
≤ [R B , R B ] c3 γ the smaller γ is, the smaller steady-state tracking error ed is. It is
 −1 T −1 T 
shown that the obtained tracking control policy (17) is indepen-
which implies K = K̄ + O(γ ). Also, by the proof of Theorem 1, it dent of γ , thus, it can achieve smaller steady-state tracking error
is known that A − BK1 is Hurwitz and thus u is a stabilizing policy than the existing γ -dependent control (10), as will be illustrated
for system (1). ■ by the example in Section 4.
4
J. Zhao, C. Yang, W. Gao et al. Automatica 148 (2023) 110761

3.2. On-policy two-phase RL algorithm to learn near-optimal track- Next, we devise a novel learning method to design the feed-
ing control policy forward control gain. We use the index i to rewrite (18) as
)T
A − BK1i P2i + P2i E − C T QF = 0
(
(29)
In numerical calculation, the proposed tracking control pol-
icy u in (17) is easier to realize than the optimal control (5). Then, by (26) and (2), differentiating xT P2i xd yields
Therefore, the ordinary model-based algorithms can be employed
to solve the reduced-order ARE (13) and the Sylvester equa- dxT P2i xd ( )T
= xT C T QF xd + uie BT P2i xd (30)
tion (14) without ill-conditioned numerical issues. For solving dt
Problem 1, we derive the model-free approach to learn the O(γ 2 ) Letting K2i+1 = R−1 BT P2i . Integrating both sides of (30) over an
near-optimal control (17). interval [t , t + δ t ], we obtain
Using Kleinman algorithm (Kleinman, 1968), ARE (13) can be ⏐t +δt ∫ t +δt ( i )T
solved iteratively by xT P2i xd ⏐t − t
ue RK2i+1 xd dτ
∫ t +δt (31)
xT C T QF xd dτ
( )T ( ) =
A − BK1i P1i + P1i A − BK1i t
(24)
( )T
+ C T QC + K1i RK1i = 0 Using Kronecker product, the above equation is reformulated as
[ ( ) ]
vec P1i
where K1i+1
= R B P1 and −1 T i
satisfies that A − K10
is Hurwitz. BK10 Θ2i ) = Ξ2i (32)
vec K2i+1
(
The following control policy is applied to system (1) to collect
data
where
ui = −K1i x − K2i xd + ue (25)
[ ( )T )]
Θ2i = Λxd ,x , −Γxd ,uie I ⊗ K2i+1 R
(

where ue is the exploration noise. Then the closed-loop system


Ξ2i = Γxd ,x · vec C T QF
( )
turns into
Then the solution to the Sylvester equation (14) can be computed.
ẋ = A − BK1i x + Buie
( )
(26) In other words, we can utilize (32) to learn the feedforward
where uie = −K2i xd + ue . By (24) and (26), we have control gain K2 in (17) without using the knowledge of A, B, and E.
The uniqueness of solutions to (28) and (32) is guaranteed
⏐t +δt ∫ t +δt ( i )T
xT P1i x⏐t −2 t
ue RK1i+1 xdτ under one rank condition. Due to the limited space, we omit the
∫ t +δt ( ( )T ) (27) proof of Lemma 1, which adopts the same line of proofs as in Gao
=− t
xT C T QC + K1i RK1i xdτ and Jiang (2016) and Jiang and Jiang (2017).

where δ t > 0 is the sampling interval. Lemma 1. For each i = 0, 1, 2, . . . if there exists an integer s∗ such
For matrices Z = [z1 , . . . , zm ] ∈ Rn×m , D = [dij ] ∈ Rn×n , and that for all s > s∗
vector G = [gi ] ∈ Rn , define ([ ])
rank Γx,x , Γx,uie , Γxd ,x , Γxd ,uie
vec(Z ) ≜ [ z1T ,
,..., ] ∈ R z2T znT T nm
(33)
n(1+n)
vecs(D) ≜ [d11 , 2d12 , . . . , 2d1n , d22 , 2d23 , = 2
+ mn + (m + n) r
1
..., 2dn−1,n , dnn ]T ∈ R 2 n(n+1) then matrices Θ1i , Θ2i have full column rank for all i.
vecv(G) ≜ [g12 , g1 g2 , . . . , g1 gn , g22 , g2 g3 , Now, the on-policy two-phase RL algorithm for solving the
1
..., gn−1 gn , gn2 ]T ∈ R 2 n(n+1) LQTC problem is summarized in Algorithm 1. Since the feedback
control and feedforward control terms are not captured as a
Furthermore, for any two vectors φ, ϕ and a sufficiently large whole, the proposed methodology is referred to as the two-phase
number s > 0, define learning algorithm.
Πφ = [vecv (φ (t1 )) − vecv (φ (t0 )) , . . . ,
Algorithm 1 On-policy two-phase RL
vecv (φ (ts )) − vecv (φ (ts−1 ))]T
]T
1: Select s, feedback control gain K10 such that A − BK10 is
[
t t
Λφ,ϕ = φ ⊗ ϕ|t10 , φ ⊗ ϕ|t21 , . . . , φ ⊗ ϕ|ttss−1
]T Hurwitz, bounded feedforward control gain K20 , threshold
σ > 0, and sampling interval δ t > 0.
[∫
t ∫t ∫t
Γφ,ϕ = t 1 φ ⊗ ϕ dτ , t 2 φ ⊗ ϕ dτ , . . . , t s φ ⊗ ϕ dτ
0 1 s−1
2: i ← 0
Then, using Kronecker product, Eq. (27) is reorganized as 3: repeat
[ ( )] 4: Apply (25) to system (1), and collect the data, s.t. (33)
vecs P1i holds.
Θ1i ) = Ξ1i (28)
vec K1i+1 Compute P1i and K1i+1 by (28).
(
5:
6: Compute P2i and K2i+1 by (32).
where
[ ] 7: i←i+1
Θ1i = Πx , −2Γx,uie (I ⊗ R) 8: until ∥K1i+1 − K1i ∥ + ∥K2i+1 − K2i ∥ ≤ σ
( ( )T ) 9: Use u = −K1i x − K2i xd as the near-optimal tracking control
Ξ1i = −Γx,x · vec C T QC + K1i RK1i policy of Problem 1.

Hence, using (28), the feedback control gain K1 in (17) can be


learned from input/state data of system (1) without involving the We are ready to discuss the convergence of Algorithm 1 in the
knowledge of A and B. following theorem.
5
J. Zhao, C. Yang, W. Gao et al. Automatica 148 (2023) 110761

† †
Theorem 4. Consider (1)–(3), and if (33) is satisfied, then we have where K2 = R−1 BT P2 . By Kronecker product, the above equation
(1) A − BK1i is Hurwitz for each i; is reformulated as
(2) limi→∞ P1i = P1 and limi→∞ K1i+1 = K1 ;
( )⎤


vecs P2
(3) limi→∞ P2i = P2 and limi→∞ K2i+1 = K2 . Θ̂2 ⎣ ( ) ⎦ = Ξ̂2 (40)

vec K2
Proof. For convenience, we say that Step 5 is the phase-one
learning to the feedback control term and Step 6 is the phase-two where
learning to the feedforward control term. Dividing (27) by δ t and
[ ( ( )T ) ]

taking limit gives (24). This means that the model-free integral
Θ̂2 = Λxd ,x , −Γxd ,x I ⊗ K1 R − Γxd ,u0 (I ⊗ R)
equation (27) is equivalent to the model-based algorithm (24). On
Ξ̂2 = Γxd ,x · vec C T QF
( )
the other hand, the condition (33) ensures that P1i , K1i+1 from (28)
are uniquely determined. By Kleinman (1968), we have A − BK1i Again, we need the following lemma to guarantee the unique-
is Hurwitz, limi→∞ P1i = P1 and limi→∞ K1i+1 = K1 . It is checkable ness of solutions to (37) and (40) in the spirit of the exploration
that matrix P2i in the phase-two learning also satisfies (29) based noise.
on the derivation of (31). By property (1), A − BK1i and −E have
no common eigenvalues. Thus the solution P2i , for any i, from Lemma 2. For all i = 0, 1, 2, . . . if there exist an integer s∗ such
the Sylvester equation (29) is uniquely determined (Bartels & that for all s > s∗
Stewart, 1972). Since K1i = K1 when i goes to infinity in the phase-
Γx,x , Γx,u0 , Γxd ,x , Γxd ,u0
([ ])
rank
one learning, limi→∞ P2i = P2 . This implies limi→∞ K2i+1 = K2 by (41)
n(1+n)
continuity. The proof is thus completed. ■ = 2
+ mn + (m + n) r
then matrices Θ̂1i , Θ̂2 have full column rank for all i.
3.3. Off-policy two-phase RL algorithm to learn near-optimal track-
ing control policy The detailed off-policy two-phase RL algorithm for solving the
LQTC problem is presented in Algorithm 2.
In Algorithm 1, new input/output data in every iteration is
collected to find the near-optimal tracking control policy, which Algorithm 2 Off-policy two-phase RL
could be inconvenient and costly (Pang et al., 2020). To this
end, we develop an off-policy two-phase RL algorithm in this 1: Select s, feedback control gain K10 such that A − BK10 is
subsection.
Hurwitz, bounded feedforward control gain K20 , threshold
We consider the following control policy for system (1) to
σ > 0, and sampling interval δ t > 0.
collect data
2: Apply (34) to system (1), and collect the data, s.t. (41) holds.
u0 = −K10 x − K20 xd + ue (34) 3: i ← 0
4: repeat
where ue is the exploration noise. Applying (34) to (1), we have
5: Compute P1i and K1i+1 by (37).
the closed-loop system
6: i←i+1
ẋ = A − BK1i x + B u0 + K i x
( ) ( )
(35) 7: until ∥K1i+1 − K1i ∥ ≤ σ

Motivated by Jiang and Jiang (2017), by (24) and (35), we can 8: Update the feedback control gain K1 ← K1i+1 .
† †
obtain the following model-free integral equation: 9: Compute P2and K2 by (40).
⏐t +δt ∫ t +δt ( 0 )T 10: Use u† = −K1 x − K2† xd as the

near-optimal tracking control
xT P1i x⏐t −2 t u + K1i x RK1i+1 xdτ
policy of Problem 1.
∫ t +δt T ( T ( )T ) (36)
=− t x C QC + K1i RK1i xdτ

By Kronecker product, the above equation is rearranged as The convergence results of Algorithm 2 are shown in the
[ ( i) ] following theorem, whose proofs are similar to Theorem 4 and
vecs P1
Θ̂1i ) = Ξ̂1i (37) thus omitted.
vec K1i+1
(
Theorem 5. Consider (1)–(3), and if (41) is satisfied, then we have
where
[ ( (1) A − BK1i is Hurwitz for each i;
)T ) ]
Θ̂1i = Πx , −2Γx,x I ⊗ K1i+1 R − 2Γx,u0 (I ⊗ R) (2) limi→∞ P1i = P1 and limi→∞ K1i+1 = K1 ;
(
† †
( ( )T ) (3) P2 = P2 and K2 = K2 if i → ∞.
Ξ̂1i = −Γx,x · vec C T QC + K1i RK1i
In conclusion, when the discount factor γ in cost function
Next, we present the learning procedure to design the feedfor- is small enough, we can get an O(γ ) approximate stabilizing

ward control gain. We define K1 = K1i+1 , where K1i+1 is obtained tracking policy that achieves an O(γ 2 ) near-optimality for the
† LQTC problem of unknown systems by performing two-phase RL
from (37). Substituting the gain matrix K1 into (18), one has
in Algorithms 1 and 2.
( )T
† † †
A − BK1 P2 + P2 E − C T QF = 0 (38)
Remark 5. The computational complexity of inverting an
Similar to (31), by (35), (2), and (38), we have m-dimensional matrix is O(m2.376 ) (Cormen, Leiserson, Rivest, &
⏐t +δt )T Stein, 2009). The developed near-optimal design methodology
does not involve the matrix γ −1 P̄3 in (9) for solving Problem 1.

∫ t +δt ( † †
xT P 2 xd ⏐ − u0 + K 1 x RK2 xd dτ

t
t (39) Therefore, the two-phase RL has a lower dimension of data matrix
∫ t +δt
= t
xT C T QF xd dτ during learning than the existing RL methods. Moreover, the
6
J. Zhao, C. Yang, W. Gao et al. Automatica 148 (2023) 110761

feedforward control gain in Algorithm 2 is directly solved without


iterative loop, i.e., one-step calculation. Hence, the overall com-
putational complexity for designing the tracking control policy is
significantly reduced.

Remark 6. The stabilizing initial gain K10 , i.e., A − BK10 is Hurwitz, is


required in Algorithms 1 and 2. One way to check this condition is
to use the knowledge of nominal system model. When the infor-
mation of system matrices is completely unknown, it is desirable
to find such stabilizing initial gain by trial and error. In addition,
we can also further utilize hybrid iteration algorithm (Gao et al.,
2022) to obtain an initial admissible gain.

Remark 7. In Algorithm 1, the estimation (learned) policy is


applied to system (1) to generate data for learning. In Algorithm
2, on the contrary, a behavior policy to generate data could
in fact be unrelated to the estimation policy that is learned.
Moreover, the data used in Algorithm 1 is obtained online and Fig. 3. Convergence of P1 , K1 , P2 , K2 in Algorithm 1.
the data used in Algorithm 2 is obtained offline and reused to
update the estimation policy during learning. As a result, when
the optimal tracking control policy needs to be calculated in real
time, Algorithm 1 is a better choice. If there is a large amount
of offline data, we recommend using Algorithm 2 to learn the
optimal tracking policy. More details on the difference between
on-policy and off-policy learning methods refer to Kiumarsi et al.
(2018).

4. Simulation

In this section, an example is used to confirm the feasibility of


the proposed two-phase RL methodology and compare it to the Fig. 4. The output y versus the trajectory yd in Algorithm 1.
existing optimal design scheme.

Example 2. Consider an output tracking problem for the spring–


mass–damper system with the following dynamics (Modares &
Lewis, 2014b):

⎨ẋ1 = x2

ẋ2 = − mk x1 − mc x2 + m
1
u (42)

⎩y = x
1

where x1 and x2 are the position and velocity, respectively. m, c,


and k are constant system parameters, which are not required to Fig. 5. Convergence of P1 , K1 in Algorithm 2.
be known for our two-phase RL algorithms. In this simulation, we
set parameters m = 1 kg, c = 0.5 N s/m, k = 5 N/m.

The desired trajectory is yd = 0.5 sin( 5t) for y, which can be of the difference of P1i , K1i , P2i , K2i during learning. The feedback
generated by the autonomous system gain and the feedforward gain learned by Algorithm 1 are

K1 = [6.18034547 3.05115080]
[ ]
0 1 (46)
ẋd = x (43)
−5 0 d
K2 = [−6.10406531 −3.50730167] (47)
with initial condition xd (0) = [0.5, 0.5]T and F = [1, 0]. We can find that K1 = K̄1 + O(0.1), K2 = K̄2 + O(0.1). Fig. 4
Let Q = 10, R = 0.1, and γ = 0.1 in cost function (3). The lqr shows that the near-optimal tracking control policy learned from
and lyap functions in MATLAB are employed to solve the reduced- Algorithm 1 steers the system output y to track the desired
order ARE (11) and the Sylvester equation (12), respectively. The trajectory yd .
obtained optimal feedback gain and feedforward gain are

K̄1 = [6.01704784 2.96234903] (44) 4.2. Simulation results on Algorithm 2

K̄2 = [−6.08312307 −3.49933454] (45) We perform Algorithm 2 with s = 200, σ = 10−10 , and
δ t , K10 , K20 , ς being the same as in Algorithm 1. Fig. 5 shows the
4.1. Simulation results on Algorithm 1 norm of the difference of P1i , K1i during learning. The feedback
gain and the feedforward gain learned by Algorithm 2 are
We perform Algorithm 1 with s = 50, σ ∑ = 10−5 , δ t = 0.01 s,
100 K1 = [6.18033989 3.05115190] (48)
K10 = 0, K20 = 0, and the probing noise ς = i=1 sin(wi t), where
wi is selected from [−500, 500] randomly. Fig. 3 shows the norm K2 = [−6.10403939 −3.50731051] (49)
7
J. Zhao, C. Yang, W. Gao et al. Automatica 148 (2023) 110761

Table 1
Number of iterations.
γ Algorithm 2 Method of Modares et al. (2016)
0.1 9+1 9
0.01 9+1 12
0.001 9+1 154
0.0001 9+1 2273

K2 = [−6.10277113 −3.51001929] (51)

Fig. 6. The output y versus the trajectory yd in Algorithm 2.


after 10 iterations. However, owing to the ill-conditioned numer-
ical issues, the full-order learning (8) with γ = 0.01 cannot
converge in this case. Therefore, this numerical test suggests
somewhat robustness for the two-phase RL methodology with
regard to bounded additive measurement noise than the existing
work for solving LQTC problem.

5. Conclusion

In this paper, using singular perturbation theory, we proposed


a near-optimal tracking control policy design methodology for
the LQTC problem of unknown systems. The proposed method-
ology can ensure the closed-loop stability without any bound
requirement on the discount factor, overcome the potential ill-
conditioned numerical issues associated with the full-order dis-
counted ARE as well as reduce the computational complexity. The
idea in this paper may serve as a basic framework to investi-
gate the relevant control problems, such as H∞ tracking, optimal
tracking of nonlinear systems, and output-feedback tracking.
Fig. 7. Evolution of tracking errors with optimal and near-optimal design
schemes. References

Baldi, S., Azzollini, I. A., & Ioannou, P. A. (2021). A distributed indirect


adaptive approach to cooperative tracking in networks of uncertain single-
By (44) and (45), there exist K1 = K̄1 +O(0.1) and K2 = K̄2 +O(0.1). input single-output systems. IEEE Transactions on Automatic Control, 66(10),
Fig. 6 shows that the learned tracking control policy makes the 4844–4851.
system output y follow the desired trajectory yd . Bartels, R. H., & Stewart, G. W. (1972). Solution of the matrix equation
AX + XB = C . Communications of the ACM, 15(9), 820–826.
4.3. Comparison with optimal design scheme Chen, C., Modares, H., Xie, K., Lewis, F. L., Wan, Y., & Xie, S. (2019). Reinforce-
ment learning-based adaptive optimal exponential tracking control of linear
systems with unknown dynamics. IEEE Transactions on Automatic Control,
Next, we compare the proposed near-optimal design method- 64(11).
ology with the optimal design scheme (Modares & Lewis, 2014a; Cormen, T. H., Leiserson, C., Rivest, R., & Stein, C. (2009). Introduction to
Modares et al., 2016) through three cases. algorithms. Cambridge, MA: MIT Press.
Case 1: We evaluate the tracking performance with differ- Gaitsgory, V., Grüne, L., & Thatcher, N. (2015). Stabilization with discounted
ent values of the discount factor γ . Under the same scenar- optimal control. Systems & Control Letters, 82, 91–98.
Gajic, Z., & Lim, M. (2001). Optimal control of singularly perturbed linear systems
ios, the tracking errors of the optimal design scheme (5) with
and applications: High-accuracy techniques. New York: Marcel Dekker.
γ = 1, 0.1, 0.01 and γ -independent near-optimal design method Gao, W., Deng, C., Jiang, Y., & Jiang, Z.-P. (2022). Resilient reinforcement learning
are shown in Fig. 7. We can clearly find that the proposed and robust output regulation under denial-of-service attacks. Automatica,
γ -independent near-optimal tracking policy (17) results in the 142, Article 110366.
minimum steady-state tracking error. Gao, W., & Jiang, Z.-P. (2016). Adaptive dynamic programming and adaptive
Case 2: We test the convergence rate with different values of optimal output regulation of linear systems. IEEE Transactions on Automatic
γ . The off-policy two-phase RL Algorithm 2 and the full-order Control, 61(12), 4164–4169.
Gao, W., Jiang, Y., Jiang, Z.-P., & Chai, T. (2016). Output-feedback adaptive
RL algorithm (8) given in Modares et al. (2016) are employed
optimal control of interconnected systems based on robust adaptive dynamic
to solve Problem 1, respectively. We set the same termination programming. Automatica, 72, 37–45.
condition σ = 10−10 . To make it fair, Step 9 of Algorithm 2 is also Gao, W., Jiang, Z.-P., Lewis, F. L., & Wang, Y. (2018). Leader-to-formation
counted as one iteration. It is given in Table 1 that the number stability of multiagent systems: An adaptive optimal control approach. IEEE
of iterations of the γ -independent two-phase learning does not Transactions on Automatic Control, 63(10), 3581–3587.
change with the decrease of γ , but the number of iterations of the Granzotto, M., Postoyan, R., Busoniu, L., Nešić, D., & Daafouz, J. (2021).
full-order learning significantly increases. This is because γ −1 P̄3 Finite-horizon discounted optimal control: stability and performance. IEEE
Transactions on Automatic Control, 66(2), 550–565.
in the kernel matrix P is becoming larger and larger as γ goes to
Jiang, Y., & Jiang, Z.-P. (2017). Robust adaptive dynamic programming. Hoboken:
zero. Wiley-IEEE Press.
Case 3: We examine the convergence performance with small Kamalapurkar, R., Dinh, H., Bhasin, S., & Dixon, W. E. (2015). Approximate op-
numerical noise. The error of 1% the measured amplitude is added timal trajectory tracking for continuous-time nonlinear systems. Automatica,
to the sampled data from 1 s to 1.1 s to simulate the slight 51, 40–48.
perturbation in practice. We find that Algorithm 2 converges to Kiumarsi, B., Vamvoudakis, K. G., Modares, H., & Lewis, F. L. (2018). Opti-
mal and autonomous control using reinforcement learning: A survey. IEEE
K1 = [6.17791106 3.04994539] (50) Transactions on Neural Networks and Learning Systems, 29(6), 2042–2062.

8
J. Zhao, C. Yang, W. Gao et al. Automatica 148 (2023) 110761

Kleinman, D. (1968). On an iterative technique for riccati equation computations. Zhao, J., Yang, C., Dai, W., & Gao, W. (2022). Reinforcement learning-based
IEEE Transactions on Automatic Control, 13(1), 114–115. composite optimal operational control of industrial systems with multiple
Kodra, K., & Gajic, Z. (2017). Optimal control for a new class of singularly unit devices. IEEE Transactions on Industrial Informatics, 18(2), 1091–1101.
perturbed linear systems. Automatica, 81, 203–208.
Zhao, J., Yang, C., & Gao, W. (2022). Reinforcement learning based optimal
Kokotovic, P. V., Khalil, H. K., & O’Reilly, J. (1999). Singular perturbation methods
control of linear singularly perturbed systems. IEEE Transactions on Circuits
in control: Analysis and design. Philadelphia: SIAM.
and Systems II: Express Briefs, 69(3), 1362–1366.
Lewis, F. L., & Vrabie, D. (2009). Reinforcement learning and adaptive dynamic
programming for feedback control. IEEE Circuits and Systems Magazine, 9(3),
32–50.
Lewis, F. L., Vrabie, D., & Syrmos, V. L. (2012). Optimal control. Hoboken, NJ:
Jianguo Zhao received the B.S. degree in Automa-
Wiley.
tion from Lanzhou University of Technology, Lanzhou,
Liu, Z., & Wu, H.-N. (2019). New insight into the simultaneous policy update China, in 2017, the M.S. degree in Control Science and
algorithms related to H∞ state feedback control. Information Sciences, 484, Engineering from China University of Mining and Tech-
84–94. nology, Xuzhou, China, in 2020, where he is currently
Liu, D., Xue, S., Zhao, B., Luo, B., & Wei, Q. (2021). Adaptive dynamic program- working toward the Ph.D. degree in control science and
ming for control: A survey and recent advances. IEEE Transactions on Systems, engineering. He is also a visiting Ph.D. student with
Man, and Cybernetics: Systems, 51(1), 142–160. the Department of Electrical Engineering, Yeungnam
Modares, H., & Lewis, F. L. (2014a). Linear quadratic tracking control of University, Kyongsan, South Korea, from 2022 to 2023.
His research interests include multi-time-scale systems,
partially-unknown continuous-time systems using reinforcement leaning.
singular perturbation theory with applications to power
IEEE Transactions on Automatic Control, 59(11), 3051–3056.
and mechatronic systems, optimal control, and reinforcement learning.
Modares, H., & Lewis, F. L. (2014b). Optimal tracking control of nonlinear
partially-unknown constrained-input systems using integral reinforcement
learning. Automatica, 50(7), 1780–1792.
Modares, H., Lewis, F. L., & Jiang, Z.-P. (2015). H∞ Tracking control of Chunyu Yang received his B.S. degree in Applied
completely unknown continuous-time systems via off-policy reinforcement Mathematics and his Ph.D. degree in Control Theory
learning. IEEE Transactions on Neural Networks and Learning Systems, 26(10), and Control Engineering from Northeastern University,
2550–2562. Shenyang, China, in 2002 and 2009, respectively. He
Modares, H., Lewis, F. L., & Jiang, Z.-P. (2016). Optimal output-feedback control was a Visiting Scholar at the University of Michigan
of unknown continuous-time linear systems using off-policy reinforcement from 2014 to 2016. Currently, he is a Professor in
learning. IEEE Transactions on Cybernetics, 46(11), 2401–2410. the School of Information and Control Engineering at
Mukherjee, S., Bai, H., & Chakrabortty, A. (2021). Reduced-dimensional re- China University of Mining and Technology, Xuzhou,
China. His research interests include singularly per-
inforcement learning control using singular perturbation approximations.
turbed systems, industrial process operational control,
Automatica, 126, Article 109451.
cyber–physical systems, and robust control.
Mukherjee, S., & Vu, T. L. (2022). Reinforcement learning of structured stabilizing
control for linear systems with unknown state matrix. IEEE Transactions on
Automatic Control.
Najafi Birgani, S., Moaveni, B., & Khaki-Sedigh, A. (2018). Infinite horizon linear Weinan Gao received the B.Sc. degree in Automa-
quadratic tracking problem: A discounted cost function approach. Optimal tion from Northeastern University, Shenyang, China, in
Control Applications & Methods, 39(4), 1549–1572. 2011, the M. Sc. degree in Control Theory and Control
Pang, B., Jiang, Z.-P., & Mareels, I. (2020). Reinforcement learning for adaptive Engineering from Northeastern University, Shenyang,
optimal control of continuous-time linear periodic systems. Automatica, 118, China, in 2013, and the Ph.D. degree in Electrical
Article 109035. Engineering from New York University, Brooklyn, NY,
Postoyan, R., Busoniu, L., Nešić, D., & Daafouz, J. (2017). Stability analysis of in 2017. His research interests include reinforcement
discrete-time infinite-horizon optimal control with discounted cost. IEEE learning, adaptive dynamic programming, optimal con-
trol, cooperative adaptive cruise control, intelligent
Transactions on Automatic Control, 62(6), 2736–2749.
transportation systems, sampled-data control systems,
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd
and output regulation theory. He was the recipient of
ed.). Cambridge, Massachusetts: MIT Press. the Best Paper Award in IEEE International Conference on Real-time Computing
Vamvoudakis, K. G. (2015). Non-zero-sum nash Q -learning for unknown and Robotics, and the David Goodman Research Award at New York University.
deterministic continuous-time linear systems. Automatica, 61, 274–281. Dr. Gao is an Associate Editor or a Guest Editor of IEEE Transactions on Neural
Vamvoudakis, K. G., & Ferraz, H. (2018). Model-free event-triggered control Network and Learning Systems, IEEE/CAA Journal of Automatica Sinica, Control
algorithm for continuous-time linear systems with optimal performance. Engineering Practice, IEEE Transactions on Circuits and Systems II: Express Briefs,
Automatica, 87, 412–420. Neurocomputing, and Complex & Intelligent Systems, a member of Editorial
Vamvoudakis, K. G., & Hespanha, J. P. (2018). Cooperative Q -learning for rejection Board of Neural Computing and Applications, and a Technical Committee
Member in IEEE Control Systems Society on Nonlinear Systems and Control and
of persistent adversarial inputs in networked linear quadratic systems. IEEE
in IFAC TC 1.2 Adaptive and Learning Systems.
Transactions on Automatic Control, 63(4), 1018–1031.
Vamvoudakis, K. G., Modares, H., Kiumarsi, B., & Lewis, F. L. (2017). Game theory-
based control system algorithms with real-time reinforcement learning: How
to solve multiplayer games online. IEEE Control Systems Magazine, 37(1),
Hamidreza Modares received the B.S. degree in electri-
33–52.
cal engineering from the University of Tehran, Tehran,
Wang, D., Ha, M., & Qiao, J. (2020). Self-learning optimal regulation for discrete- Iran, in 2004, the M.S. degree in electrical engineering
time nonlinear systems under event-driven formulation. IEEE Transactions on from the Shahrood University of Technology, Shahrood,
Automatic Control, 65(3), 1272–1279. Iran, in 2006, and the Ph.D. degree in electrical en-
Wu, H.-N., & Luo, B. (2013). Simultaneous policy update algorithms for learning gineering from the University of Texas at Arlington,
the solution of linear continuous-time H∞ state feedback control. Information Arlington, TX, USA, in 2015. He was a Senior Lecturer
Sciences, 222, 472–485. with the Shahrood University of Technology, from 2006
Yang, C., Zhong, S., Liu, X., Dai, W., & Zhou, L. (2020). Adaptive composite to 2009, and a Faculty Research Associate with the
University of Texas at Arlington from 2015 to 2016. He
suboptimal control for linear singularly perturbed systems with unknown
is currently an Assistant Professor with the Mechanical
slow dynamics. International Journal of Robust and Nonlinear Control, 30(7),
Engineering Department, Michigan State University, East Lansing, MI, USA. His
2625–2643.
current research interests include cyber–physical systems, reinforcement learn-
Zhang, H., Cui, L., Zhang, X., & Luo, Y. (2011). Data-driven robust approximate op- ing, distributed control, robotics, and machine learning. Dr. Modares was the
timal tracking control for unknown general nonlinear systems using adaptive recipient of the Best Paper Award from the 2015 IEEE International Symposium
dynamic programming method. IEEE Transactions on Neural Networks, 22(12), on Resilient Control Systems. He is an Associate Editor of the IEEE Transactions
2226–2236. on Neural Network and Learning Systems.

9
J. Zhao, C. Yang, W. Gao et al. Automatica 148 (2023) 110761

Xinkai Chen received his Ph.D. degree in engineering Wei Dai received the M.S. and Ph.D. degrees in con-
from Nagoya University, Japan, in 1999. He is currently trol theory and control engineering from Northeastern
a professor in the Department of Electronic and In- University, Shenyang, China, in 2009 and 2015, re-
formation Systems, Shibaura Institute of Technology, spectively. From 2013 to 2015, he was as a Teaching
Japan. His research interests include adaptive con- Assistant with the State Key Laboratory of Synthetical
trol, smart materials, hysteresis, sliding mode control, Automation for Process Industries, Northeastern Uni-
machine vision, and observer. Dr. Chen has served versity. He is currently a Professor and an Outstanding
as an Associate Editor of several journals, including Young Backbone Teacher with the China University of
IEEE Transactions on Automatic Control, IEEE Transac- Mining and Technology, Xuzhou, China. He acted as
tions on Control Systems Technology, IEEE Transactions the project leader in several funded research projects
on Industrial Electronics, IEEE/ASME Transactions on (funds from the Natural Science Foundation of China,
Mechatronics, European Journal of Control, etc. He has also served for interna- the Nature Science Foundation of Jiangsu Province, the Postdoctoral Science
tional conferences as organizing committee members including General Chairs, Foundation of China, and so on). His current research interests include modeling,
Program Chairs, General Co-Chairs, Program Co-Chairs, etc. He is a Fellow of optimization, and control of the complex industrial process, data mining, and
IEEE. machine learning.

10

You might also like