[18]Learning-Based Control of Continuous-Time Systems Using Output Feedback
[18]Learning-Based Control of Continuous-Time Systems Using Output Feedback
†
Leilei Cui and Zhong-Ping Jiang
ẋ = Ax + Bu, x(0) = x0 ,
value function as a functional of the continuous-time (2.1)
input-output trajectory. By combining the model-based y = Cx,
PI [21] with RL technique, a learning-based PI algo-
where x ∈ Rn , u ∈ Rm , and y ∈ Rq are the state,
rithm is proposed such that given an initial admissible
input and output of the system; x0 is the initial state;
controller, the optimal control policy is iteratively ap-
A, B, and C are constant matrices with compatible
proximated using input-output data in the absence of
dimensions. We assume that (A, B) is controllable and
the system model. Our contributions are as follows:
(C, A) is observable. Under this assumption, LQR is
1) we extend the state reconstruction technique from
the problem of minimizing the following quadratic cost:
discrete-time systems [23] to continuous-time systems Z ∞
without discretization; 2) we propose a learning-based
(2.2) J(x0 , u) = y T Qy + uT Rudt,
output-feedback control approach directly designed for 0
continuous-time systems without discretization; and 3) √
we analyze the theoretical convergence of the proposed with Q = Q ⪰ 0, R = RT ≻ 0, and ( QC, A) being
T
3: 1
R0 ˆ
− −T K̄1 (θ)zt (θ)dθ +e(t), where e is an exploration limN →∞ (Θ̃N T N
i+1 ) Θ̃i+1 = 0 is obtained from (4.25).
signal, to explore system (2.1) and collect the input- Comparing (4.6) with (4.9), it is obtained that for
output data u(t), y(t), t ∈ [0, tL+1 ]. any η > 0, there exists N ∗ (i + 1, η) > 0, such that if
4: Set the threshold δ > 0 and i = 1. N ≥ N ∗ (i + 1, η), (4.20) holds for i + 1. The proof is
5: repeat thus completed by induction.
6: Construct Hi and Si by (4.15) and (4.17).
Get Θ̂N Theorem 4.1. For any η > 0, there exist i∗ ∈ Z+ and
7: i by solving (4.19).
ˆ (θ) by (4.9). N ∗∗ > 0, such that if N ≥ N ∗∗ ,
8: Get K̄ i+1
9: i←i+1 ∥P̄ ∗ (ξ, θ) − P̄ˆi∗ (ξ, θ)∥∞ ≤ η,
10: until |Θ̂N N
i − Θ̂i−1 |< δ. (4.26)
R0 ˆ ∥K̄ ∗ (θ) − K̄ˆ ∗ (θ)∥ ≤ η.
11: Use û (z ) = − K̄ (θ)z (θ)dθ as the approxi- i ∞
i t −T 1 t
mated optimal controller. Proof. According to Lemma 2.1, there exists i∗ ∈ Z+
such that
Subtracting (4.21) from (4.17) yields ∥P̄ ∗ (ξ, θ) − P̄i∗ (ξ, θ)∥∞ ≤ η/2,
(4.27)
∥K̄ ∗ (θ) − K̄i∗ (θ)∥∞ ≤ η/2.
(4.22) ÊiN = Hi Θ̃N N
i + Ei ,
Furthermore, by Lemma 4.1, if N ≥ N (i∗ , η/2),
Since Θ̂N
is the least-square solution to (4.17), it is
i
obtained that ∥P̄i∗ (ξ, θ) − P̄ˆi∗ (ξ, θ)∥∞ ≤ η/2,
(4.28)
1 N T N 1 ∥K̄ ∗ (θ) − K̄ˆ ∗ (θ)∥ ≤ η/2.
i i ∞
(4.23) (Ê ) Êi ≤ (EiN )T EiN .
L i L
Therefore, (4.26) is obtained by the triangle inequality.
It follows from (4.22) and (4.23) that
1 N T T
(Θ̃ ) Hi Hi Θ̃N 5 Application to F-16 Aircraft Control
L i i
20 0
0 5 10 15 20 25 30
0
1 2 3 4 5 6 7 8 9 10 11 12
0
0s to t = 10s, the exploratory/behavior policy is
P500
u(t) = 0.2 i=1 sin(wi t), where wi (i = 1, · · · , 500) are 0 5 10 15 20 25 30
randomly sampled from the uniform distribution over
[−250, 250]. The integrating interval is tk+1 − tk = 0.01.
For the PI algorithm, the input-output data is Data Collection Ends
collected from t = 0s to t = 10s and Algorithm starts 2
Controller Updated
at t = 10s. The algorithm converges after eleven
iterations when the stopping criterion |Θ̂N N 0
i − Θ̂i−1 |<
0.01 is satisfied. The relative approximation error -2
ˆ (θ)−K̄ ∗ (θ)∥
∥K̄i ∞
∥K̄ ∗ (θ)∥∞
is plotted in Fig. 1. It is seen that
0 5 10 15 20 25 30
at the eleventh iteration, the approximation error is
ˆ (θ)−K̄ ∗ (θ)∥
∥K̄11 ∞
∥K̄ ∗ (θ)∥∞
= 3.20%. The state trajectory is shown
in Fig. 2. It is observed that the state quickly converges
Figure 2: Trajectory of the state.
to zero after the controller is updated by Algorithm 1.
6 Conclusion
This paper has proposed a novel learning-based output- [3] T. Bian and Z. P. Jiang, Value iteration and adaptive
feedback control approach for adaptive optimal control dynamic programming for data-driven adaptive optimal
of continuous-time linear systems. The first major con- control design, Automatica, 71 (2016), pp. 348–360.
[4] C. Chen, L. Xie, K. Xie, F. L. Lewis, and S. Xie,
tribution of the paper is extending the state reconstruc-
Adaptive optimal output tracking of continuous-time
tion technique in [23] to continuous-time systems with-
systems via output-feedback-based reinforcement learn-
out discretizing system dynamics. By integrating the ing, Automatica, 146 (2022), p. 110581.
RL techniques, a learning-based PI algorithm is devel- [5] C.-T. Chen, Linear System Theory and Design, Ox-
oped and the convergence is theoretically analyzed. The ford University Press, New York, NY, 3rd ed., 1999.
efficacy of the proposed algorithms is validated by a [6] L. Cui and Z. P. Jiang, A reinforcement learning
practical example arising from F-16 flight control. Our look at risk-sensitive linear quadratic gaussian control,
future work will be directed at extending the proposed arXiv preprint arXiv:2212.02072, (2022).
learning-based control methodology to robust ADP and [7] L. Cui, K. Ozbay, and Z. P. Jiang, Combined
multi-agent systems with output feedback. longitudinal and lateral control of autonomous vehicles
based on reinforcement learning, in 2021 American
Control Conference (ACC), 2021, pp. 1929–1934.
References [8] L. Cui, B. Pang, and Z. P. Jiang, Learning-
based adaptive optimal control of linear time-delay
[1] F. Abdollahi, H. Talebi, and R. Patel, A sta- systems: A policy iteration approach, arXiv preprint
ble neural network-based observer with application to arXiv:2210.00204, (2022).
flexible-joint manipulators, IEEE Transactions on Neu- [9] L. Cui, S. Wang, J. Zhang, D. Zhang, J. Lai,
ral Networks, 17 (2006), pp. 118–129. Y. Zheng, Z. Zhang, and Z. P. Jiang, Learning-
[2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro- based balance control of wheel-legged robots, IEEE
Dynamic Programming., Athena Scientific, Belmont, Robotics and Automation Letters, 6 (2021), pp. 7667–
MA, 1996. 7674.
[11] W. Gao and Z. P. Jiang, Adaptive dynamic program- learning is robust to control-dependent noise, Biological
ming and adaptive optimal output regulation of linear cybernetics, 116 (2022), pp. 307–325.
systems, IEEE Transactions on Automatic Control, 61 [27] B. Pang and Z. P. Jiang, Adaptive optimal control
(2016), pp. 4164–4169. of linear periodic systems: An off-policy value iteration
[12] W. Gao and Z. P. Jiang, Adaptive optimal out- approach, IEEE Transactions on Automatic Control,
put regulation of time-delay systems via measurement 66 (2021), pp. 888–894.
feedback, IEEE Transactions on Neural Networks and [28] M. J. D. Powell, Approximation Theory and Meth-
Learning Systems, 30 (2019), pp. 938–945. ods, Cambridge University Press, New York, NY, 1981.
[13] A. Guyader and Q. Zhang, Adaptive observer for [29] W. B. Powell, Approximate Dynamic Programming:
discrete time linear time varying systems, IFAC Pro- Solving the Curses of Dimensionality, Wiley Series in
ceedings Volumes, 36 (2003), pp. 1705–1710. 13th Probability and Statistics, Wiley, Hoboken, NJ, USA,
IFAC Symposium on System Identification, Rotter- 2nd ed., 2011.
dam, The Netherlands, 27-29 August, 2003. [30] S. A. A. Rizvi and Z. Lin, Output feedback adaptive
[14] R. A. Howard, Dynamic Programming and Markov dynamic programming for linear differential zero-sum
Processes, MIT Press, Cambridge, MA, 1960. games, Automatica, 122 (2020), p. 109272.
[15] Y. Jiang and Z. P. Jiang, Approximate dynamic pro- [31] S. A. A. Rizvi and Z. Lin, Adaptive dynamic pro-
gramming for output feedback control, in Proceedings of gramming for model-free global stabilization of control
the 29th Chinese Control Conference, 2010, pp. 5815– constrained continuous-time systems, IEEE Transac-
5820. tions on Cybernetics, 52 (2022), pp. 1048–1060.
[16] Y. Jiang and Z. P. Jiang, Computational adaptive [32] B. L. Stevens, F. L. Lewis, and E. N. Johnson,
optimal control for continuous-time linear systems with Aircraft Control and Simulation: Dynamics, Controls
completely unknown dynamics, Automatica, 48 (2012), Design, and Autonomous Systems, John Wiley & Sons,
pp. 2699–2704. Hoboken, NJ, 2015.
[17] Y. Jiang and Z. P. Jiang, Robust Adaptive Dynamic [33] R. S. Sutton and A. G. Barto, Reinforcement
Programming, Wiley-IEEE Press, NJ, USA, 2017. Learning: An Introduction, The MIT Press, Cam-
[18] Z. P. Jiang, T. Bian, and W. Gao, Learning- bridge, MA, 2018.
based control: A tutorial and some recent results, [34] G. Tao, Adaptive Control Design and Analysis, John
Foundations and Trends in Systems and Control, 8 Wiley & Sons, Hoboken, NJ, 2003.
(2020), pp. 176–284. [35] Q. Zhang, Adaptive observer for multiple-input-
[19] Z. P. Jiang, C. Prieur, and A. Astolfi (Editors), multiple-output (MIMO) linear time-varying systems,
Trends in Nonlinear and Adaptive Control: A Tribute IEEE Transactions on Automatic Control, 47 (2002),
to Laurent Praly for His 65th Birthday, Springer Na- pp. 525–529.
ture, NY, USA, 2021. [36] F. Zhao, W. Gao, T. Liu, and Z. P. Jiang,
[20] B. Kiumarsi, F. L. Lewis, M.-B. Naghibi-Sistani, Adaptive optimal output regulation of linear discrete-
and A. Karimpour, Optimal tracking control of un- time systems based on event-triggered output-feedback,
known discrete-time linear systems using input-output Automatica, 137 (2022), p. 110103.
measured data, IEEE Transactions on Cybernetics, 45 [37] L. M. Zhu, H. Modares, G. O. Peen, F. L. Lewis,
(2015), pp. 2770–2779. and B. Yue, Adaptive suboptimal output-feedback con-
[21] D. Kleinman, On an iterative technique for Riccati trol for linear systems using integral reinforcement
equation computations, IEEE Transactions on Auto- learning, IEEE Transactions on Control Systems Tech-
matic Control, 13 (1968), pp. 114–115. nology, 23 (2015), pp. 264–273.
[22] F. L. Lewis and D. Liu, Reinforcement Learning [38] K. J. Åström and B. Wittenmark, Adaptive Con-
and Approximate Dynamic Programming for Feedback trol, 2nd Edition, Addison-Wesley, MA, USA, 1997.
Control, Wiley-IEEE Press, NJ, USA, 2013.
[23] F. L. Lewis and K. G. Vamvoudakis, Reinforce-
ment learning for partially observable dynamic pro-
cesses: Adaptive dynamic programming using measured
output data, IEEE Transactions on Systems, Man, and
Cybernetics, Part B (Cybernetics), 41 (2011), pp. 14–
25.
[24] D. Liberzon, Calculus of Variations and Optimal
Control Theory: A Concise Introduction, Princeton
University Press, NJ, USA, 2012.
[25] T. Liu, L. Cui, B. Pang, and Z. P. Jiang, Data-