Linear Quadratic Control Using Model-Free Reinforcement Learning
Linear Quadratic Control Using Model-Free Reinforcement Learning
Abstract—In this article, we consider linear quadratic adaptive control with optimal control theory, it is possible to
(LQ) control problem with process and measurement control unknown dynamical systems adaptively and optimally.
noises. We analyze the LQ problem in terms of the average Reinforcement learning (RL) refers to a class of such routines
cost and the structure of the value function. We assume
that the dynamics of the linear system is unknown and only and its history dates back decades [6], [7]. By recent progress
noisy measurements of the state variable are available. Us- in machine learning (ML), specifically using deep networks,
ing noisy measurements of the state variable, we propose the RL field is also reinvented. RL algorithms have recently
two model-free iterative algorithms to solve the LQ prob- shown impressive performances in many challenging problems
lem. The proposed algorithms are variants of policy itera- including playing Atari games [1], agile robotics [3], control of
tion routine where the policy is greedy with respect to the
average of all previous iterations. We rigorously analyze the continuous-time systems [2], [8]–[14], and distributed control
properties of the proposed algorithms, including stability of of multiagent systems [15], [16].
the generated controllers and convergence. We analyze the In a typical RL setting, the model of the dynamical system
effect of measurement noise on the performance of the pro- is unknown and the optimal controller is learned through in-
posed algorithms, the classical off-policy, and the classical teraction with the dynamical system. One possible approach to
Q-learning routines. We also investigate a model-building
approach, inspired by adaptive control, where a model of solve an RL problem is the class of dynamic programming (DP)
the dynamical system is estimated and the optimal control based solutions, also called temporal difference (TD) learning.
problem is solved assuming that the estimated model is the The idea is to learn the value function to satisfy the Bellman
true model. We use a benchmark to evaluate and compare equation [17], [18] by minimizing the Bellman error [17]. The
our proposed algorithms with the classical off-policy, the celebrated Q-learning algorithm belongs to this class [19]. Least
classical Q-learning, and the policy gradient. We show that
our model-building approach performs nearly identical to squares temporal difference learning (LSTD) [19], [20] refers
the analytical solution and our proposed policy iteration- to the case where the least-squares is used to minimize the TD
based algorithms outperform the classical off-policy and error. Another class contains policy search algorithms where
the classical Q-learning algorithms on this benchmark but the policy is learned by directly optimizing the performance
do not outperform the model-building approach. index. Examples are policy-gradient methods that use sampled
Index Terms—Linear quadratic (LQ) control, reinforce- returns from the dynamical system [21] and actor–critic methods
ment learning (RL). that use estimated value functions in learning the policy [22].
In the control community, adaptive control is used for optimal
I. INTRODUCTION control of the unknown system by first estimating (possibly
recursively) a model [23], [24] and then, solving the optimal
DAPTIVE control studies data-driven approaches for con-
A trol of unknown dynamical systems [4], [5]. If we combine
control problem using the estimated model [5]. This category
of solutions is called model-based RL in the RL community
but we coin the term model-building RL to emphasize that no
Manuscript received 24 July 2020; revised 28 January 2021 and 2 known/given model is used or assumed.
September 2021; accepted 8 January 2022. Date of publication 25 As the scope of RL expands to more demanding tasks, it is
January 2022; date of current version 30 January 2023. The work
of Farnaz Adib Yaghmaie was supported by the Excellence Center crucial for the RL algorithms to provide guaranteed stability
at Linköping–Lund in Information Technology (ELLIIT), in the Vinnova and performance. Still, we are far away from analyzing the
Competence Center LINK-SIC, the Wallenberg Artificial Intelligence, RL algorithms because of the inherent complexity of the tasks
Autonomous Systems and Software Program (WASP), and CENIIT. The
work of Fredrik Gustafsson was supported by the Vinnova Competence and deep networks. This motivates considering a simplified case
Center LINK-SIC and the Wallenberg Artificial Intelligence, Autonomous study where analysis is possible. linear quadratic (LQ) problem
Systems and Software Program (WASP). The work of Lennart Ljung is a classical control problem where the dynamical system
was supported in part by the Swedish Research Council under Grant
2019-04956 and in part by the Vinnova Competence Center LINK-SIC. obeys linear dynamics and the cost function to be minimized
Recommended by Associate Editor J. Lavaei. (Corresponding author: is quadratic. The LQ problem has a celebrated closed-form
Farnaz Adib Yaghmaie.) solution and is an ideal benchmark for studying and comparing
The authors are with the Department of Electrical Engineering,
Linköping University, 58431 Linköping, Sweden (e-mail: farnaz.adib. the RL algorithms because first, it is theoretically tractable and
[email protected]; [email protected]; [email protected]). second, it is practical in various engineering domains.
Color versions of one or more figures in this article are available at Because of the aforementioned reasons, the LQ problem
https://doi.org/10.1109/TAC.2022.3145632.
Digital Object Identifier 10.1109/TAC.2022.3145632 has attracted much attention from the RL community [11],
0018-9286 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
738 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023
[25]–[27], see [28], for a complete survey on RL approaches and [13], [25], [27]. We take one step further and replace the state
properties for the LQ problems. For example, the authors in [12] with a noisy measurement of the state; i.e., the output. There
and [13] study the linear quadratic regulator (LQR) control are two contributions to this article. Our first contribution is
problems in completely deterministic setups via LSTD. Global to study the LQ problem in terms of the average cost and the
convergence of policy gradient methods for deterministic LQR structure of the value function. Our second contribution is to
problem is shown in [29]. Considering a stochastic setup is more propose two model-free DP-based RL algorithms and analyze
demanding [11], [25]–[27]. In [11], a stochastic LQ problem their properties. We study the effect of measurement noise in
with state- and control-dependent process noise is considered. the proposed algorithms and the classical off-policy and the
The authors assume that the noise is measurable and propose an classical Q-learning routines. Such a discussion is absent in [33].
LSTD approach to solve the optimal control problem. In [25], Moreover, we provide the convergence proof for the proposed
the authors give LSTD algorithms for the LQ problem with algorithms while such a discussion in a simpler setup in [25]
process noise and analyze the regret bound, i.e., the difference is missing. We also investigate a model-building approach,
between the occurred cost and the optimal cost. The regret inspired by adaptive control, which can be easily extended to
bound is also studied in [30] for a model-building routine. The cover the partially observable dynamical systems. We leave
sample complexity of the LSTD for the LQ problem is studied the analysis of such model-building approaches for our future
in [27] and the gap between a model-building approach and a research. We would like to stress that the solutions we discuss
model-free routine is analyzed in [31]. In [26], a model-building (including the model-building one) are all in the model-free
approach, called coarse ID control, is developed to solve the LQ family of classical RL. The only information that is used is that
problem with process noise and sample complexity bounds are the outputs come from a linear system of known order; no known
derived. The idea of coarse-ID control is to estimate the linear model is assumed to generate the data. We have summarized
model and the uncertainty, and then to solve the optimal control the notations and abbreviations in Table I.
problem using the information of the model and uncertainty.
A convex procedure to design a robust controller for the LQ II. LQ PROBLEM
problem with unknown dynamics is suggested in [32]. The total
cost to be optimized in the LQ problem or in general in RL can be A. Linear Gaussian Dynamical Systems
considered as the average cost [31] or discounted cost [13], [27] Consider a linear Gaussian dynamical system
and it is a design parameter. If the cost is equally important at
xk+1 = Axk + Buk + wk (1)
all times, for example, the cost is energy consumption, one may
consider an average cost setting. If the immediate costs are more yk = x k + v k (2)
important, one may consider a discounted cost. In a discounted where xk ∈ Rn and uk ∈ Rm are the state and the control
setting, one needs to select the discounting factor carefully to input vectors, respectively. The vector wk ∈ Rn denotes the
prevent loss of stability while this can be avoided by considering process noise drawn independent identically distributed (i.i.d.)
the average cost. from a Gaussian distribution N (0, Ww ). The state variable xk
In all the aforementioned results, only process noise is con- is not measurable and the output yk ∈ Rn in (2) denotes a noisy
sidered in the problem formulation. In practice, both process and measurement of the state xk where vk ∈ Rn is the measurement
measurement noises appear in the dynamical systems and one noise drawn i.i.d. from a Gaussian distribution N (0, Wv ) where
should carefully consider the effect of measurement noise. This Wv is diagonal.
is particularly important in feedback controller design, where The model (A, B) is unknown but stabilizable. The model
only a noisy measurement of the output is available for the order n is assumed to be known. The measurements yk , uk can
control purpose. When a model-free approach is used to control be used to form the next output in order to achieve a specific
the system, it is not possible to use a filter, e.g., a Kalman filter, to goal of the control. When the control input uk is selected as a
estimate the state variable because the dynamics of the system is function of the output yk , it is called a policy. In this article, we
unknown. Yaghmaie and Gustafsson [33] consider both process design stationary policies of the form uk = Kyk to control the
and measurement noises in the problem formulation but only system in (1) and (2). Let L := A + BK. The policy gain K is
stability of the generated policy is established and the effect stabilizing if ρ(L) < 1.
of measurement noise is not studied. A closely related topic is
RL for partial observable Markov decision processes (POMDP),
B. LQ Optimal Control Problem
where the state is only partially observable [34]–[36]. The au-
thors in [34] and [35] consider LQ systems with both process We define the quadratic running cost as
and observation noises and propose algorithms to estimate the r(yk , uk ) = rk = ykT Ry yk + uTk Ru uk (3)
model of the dynamical system and to learn a near optimal
controller via gradient descent. Lewis and Vamvoudakis [36] where Ry ≥ 0 and Ru > 0 are the output and the control weight-
consider noise-free LQ problem and proposes to use a history ing matrices, respectively. The average cost associated with the
of input–output data in the controller design. policy uk = Kyk is defined by
τ
In this article, we focus on DP-based RL routines to solve LQ 1
problem. There is much research on RL algorithms for LQ prob- λ(K) = lim E r(yt , Kyt ) (4)
τ →∞ τ
lem assuming that the state variable is exactly measurable [12], t=1
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 739
TABLE I
NOTATION AND ABBREVIATIONS
which does not depend on the initial state of the system [6]. Solving problem 1 is in fact the same as the LQR problem in
For the dynamical system in (1) and (2), we define the value the sense that the gain K ∗ turns out to be the same as the LQR
function [6] (or bias function [37]) associated with a given policy gain: K ∗ = KLQR obtained from the standard ARE machinery.
uk = Kyk This is shown in the following lemma and the proof is given in
+∞ Appendix II.A.
V (yk , K) = E (r(yt , Kyt ) − λ(K))|yk (5) Lemma 1: Consider Problem 1 and (1)–(5). The optimal
t=k policy gain K ∗ is
where the expectation is taken with respect to wk , vk . K ∗ = −(Ru + B T P ∗ B)−1 B T P ∗ A (6)
∗
Problem 1: Consider the dynamical system in (1) and (2). where P > 0 is the solution to the ARE
Design the gain K ∗ such that the policy K ∗ yk minimizes (5). AT P ∗ A − P ∗ − AT P ∗ B(B T P ∗ B + Ru )−1 B T P ∗ A
In this article, we are interested in optimizing (5). The value (7)
function (5) quantifies the quality of transient response using + Ry = 0.
uk = Kyk . Because of the presence of stationary process and The average cost associated with K ∗ is given by
measurement noises in (1) and (2), the controller uk = Kyk
λ(K ∗ ) = Tr(K ∗T B T P ∗ BK ∗ Wv ) + Tr(P ∗ (Ww + Wv ))
results in a permanent average cost, which is represented by
λ(K). In the following section, we will show that λ(K) is a − Tr((A + BK ∗ )T P ∗ (A + BK ∗ )Wv ). (8)
function of Ww , Wv and has a nonzero value if the process noise Remark 1: Lemma 1 states that Problem 1 is equivalent
or the measurement noise, or both are present. By subtracting to the classical LQR problem and the solution can be found
λ(K) in (5), we end up with the transient cost or the controller from a standard ARE machinery. We explain this point intu-
cost. itively: The process and measurement noises result in a per-
Another relevant problem is the classical LQR problem. manent average cost. We take away the effect of the process
Problem 2 (LQR): Consider the dynamical system in (1) and and measurement noises by subtracting the average cost in
(2) without noise terms (wk = 0, vk = 0). Design the controller (5). This leaves us with the transient cost or the controller
uk = Kxk to minimize (5). cost, which can be optimized through the ARE in (7). Also,
Note that when Problem 2 is considered, λ(K) = 0 in (5) note that we can cast solving Problem 1 to minimizing the
because wk = 0, vk = 0. The solution is well known to be uk = regret function. The regret framework Ruk evaluates the quality
KLQR xk , where KLQR is obtained through an algebraic Riccati of a policy uk by comparing its running cost to a baseline
equation (ARE) using (A, B, Ry , Ru ), see, e.g., [38]. b [28]
A further relevant problem is the classical linear quadratic +∞
Gaussian (LQG) problem:
Ruk = rk − b. (9)
Problem 3 (LQG): Consider the dynamical system in (1) and t=1
(2). Design the controller uk (y1 , . . ., yk ) to minimize (4).
Let b := λ(K) in (4) represent the baseline. Then, the regret
The optimal controller for Problem 3 is uk = KLQR x̂k|k
framework in (9) is indeed equivalent to the value function in
where x̂k|k is a posteriori state estimation at time k given
(5), and minimizing V with respect to K (Problem 1) is cast to
observations (y1 , . . ., yk ) (see [41, Corollary 9.2]). The cost to
minimizing the regret function.
be minimized in the LQG problem contains the following two
parts: 1) The LQR cost and 2) the mean-square state estimation
C. Iterative Model-Based Approach for the LQ Problem
error. The LQR cost measures the quality of transient response
and KLQR is designed to minimize it. The mean-square error of According to Lemma 1, the solution to Problem 1 can be found
the state estimation x̂k|k is minimized by the means of a Kalman from the ARE in (7). If the dynamics is known, one can use the
filter where the Kalman gain depends on (A, B, Ww , Wv ). Hewer’s method [39] in Algorithm 1. The algorithm is initiated
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
740 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 741
where
Algorithm 2: Average Off-Policy Learning.
1: Initialize: Select a stabilizing policy gain K 1 , set ck = ykT (Ry + K iT Ru K i − P̂ i )yk + yk+1
T
P̂ i yk+1 − λ̄i
i = 1. 2yk ⊗ (uk − K i yk )
2: for i = 1, .., N do Υk =
3: Execute K i yk for τ rounds and estimate P̂ i , λ̄i from vecv(uk ) − vecv(K i yk )
(21)–(22).
vec(B T P i A)
4: Z = CollectData (K i , τ , τ , Wη ). ξ =
i
vecs(B T P i B)
5: Estimate ξˆi by (27) using Z.
6: Update the policy gain K i+1 by (28). n2 = yk+1
T
P̂ i yk+1 − E[yk+1
T
P i yk+1 |yk ] + ykT P i yk − ykT P̂ i yk .
7: end for (26)
In (25), ξ i is the parameter vector to be estimated and ck and Υk
Algorithm 3: CollectData. are known using the collected data.
Input: K, τ , τ , Wη . Line 4: To estimate ξ i , we run the CollectData routine in
Output: Z. Algorithm 3 to collect τ sample points. The procedure is as
Z = {}. follows: First, the policy K i yk is applied for τ rounds to ensure
for t = 1, .., τ do that E[xk ] = 0 and the output yk is sampled. Then, we sample
Execute Ky for τ rounds and observe y. a random action ηk from ηk ∼ N (0, Wη ). We apply the action
Sample η ∼ N (0, Wη ) and set u = Ky + η. uk = K i yk + ηk and sample the next output yk+1 . This builds
Take u and observe y+ . one data point (yk , uk , yk+1 ). We repeat this procedure and
Add (y, u, y+ ) to Z. collect τ data points.
end for Line 5: The estimation of ξ i is given by
τ −1 τ
ξˆ =
i
Υt Υ T
Υt ct . (27)
as an estimate of λi t
t=1 t=1
1
τ
λ̄ =
i
rt . (21) Line 6: Using ξˆi , we now obtain the improved policy. Let
τ t=1 H i := B T P i A and N i := B T P i B, and let Ĥ i and N̂ i be their
The average-cost LSTD estimator of P i is given by [37] estimates using (27) [see (26) and (27)]. The improved policy is
τ −1 τ selected to be greedy with respect to the average of all previously
vecs(P̂ i ) = Φt (Φt − Φt+1 )T Φt (rt − λ̄i ) . estimated value functions
⎛ ⎞−1 ⎛ ⎞
t=1 t=1
i i
(22) K i+1 = − ⎝ (N̂ j + Ru )⎠ ⎝ Ĥ j ⎠ . (28)
2) Policy Improvement: We find the improved policy gain j=1 j=1
by the concept of off-policy learning. The idea is to apply a
behavioral policy uk to the system while learning the target (or B. Average Q-Learning
optimal) policy K i yk . Note that the behavioral policy should be
Our second routine is called average Q-learning in Algorithm
different from the target policy. Define the behavioral policy as
4. Similar to the average off-policy learning in Section IV-A,
uk = K i yk + ηk where we sample ηk from N (0, Wη ) at each
the average Q-learning is a policy iteration routine. Different
time k and Wη is the covariance of probing. Using the behavioral
from the average off-policy learning, the policy is evaluated by
policy, the closed-loop system of (1) reads
finding the Q function. The algorithm is an extended version of
yk+1 = Axk + Buk + wk + vk+1 the model-free linear quadratic (MFLQ) control in [25] to ac-
= Li xk + B(uk − K i yk ) + BK i vk + wk + vk+1 . commodate the effect of observation noise. In [25], only process
(23) noise is present and the average cost is parameterized based on
the covariance of the process noise. Such an approach is not
Using the abovementioned equation, we can obtain the off-policy possible here because the average cost depends on the process
Bellman equation and observation noises, and the dynamics of the system. Note
ykT (Ry + K iT Ru K i − P i )yk + E[yk+1
T
P i yk+1 |yk ] − λi that there is no proof of convergence for the MFLQ algorithm
in [25]. We give a proof of convergence for Algorithm 4 in
= 2(uk − K i yk )T B T P i Ayk
Section V-B, which is also applicable to the MFLQ algorithm
+ (uk + K i yk )T B T P i B(uk − K i yk ). (24) in [25].
The algorithm is initiated with a stabilizing policy gain K 1
We will show in Theorem 1 that (24) is equivalent to the
in Line 1. Then, the algorithm iterates N times in Line 2 over
model-based Bellman equation (10). We can write (24) as a
the policy evaluation and the policy improvement steps. The
linear regression
policy evaluation step is given in Lines 3–5 and discussed in
ck = ΥTk ξ i + n2 (25) Section IV-B1. The policy gain K i is evaluated by estimating its
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
742 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 743
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
744 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023
TABLE III
PERCENTAGE OF STABILITY IN ALL ITERATIONS IN 100 SIMULATIONS
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 745
TABLE IV
|λ(K)−λ(K ∗ )| Algorithm 5: Classical Off-Policy Learning.
MEDIAN OF THE RELATIVE ERROR λ(K ∗ )
1: Initialize: Select a stabilizing policy gain K 1 , set
i = 1.
2: for i = 1, .., N do
3: Z = {}.
4: for t = 1, . . ., τ do
Sample ηt ∼ N (0, Wη ) and let ut = K i yt + ηt .
Take ut and observe yt+1 .
Add (yt , ut , yt+1 ) to Z.
5: end for
see that the model-building approach has the lowest relative 6: Estimate λ̄i from (21), and P i , B T P i A, B T P i B
error and it is almost identical to the analytical solution. The from (36).
average off-policy and the average Q-learning routines suffer 7: Update the policy gain K i+1 by (37).
from 4% − 5% relative error at their best. The reason is that 8: end for
the estimations of P i in the average off-policy and Gi in the
average Q-learning are biased and they will not improve further
by increasing the rollout length. However, the relative errors for
proved stability of the generated controllers and convergence of
the average off-policy and the average Q-learning are much less
the algorithms. We have shown that the measurement noise can
than those of the classical off-policy and Q-learning algorithms
deteriorate the performance of DP-based RL routines. We have
(∼ 35% − 40%). When τ is small, the average off-policy and the
also presented a model-building approach in Algorithm 8 as a
average Q-learning reach their best performance. This is because
natural way of solving Problem 1 in the adaptive control context,
Algorithm 3 is used to collect data for estimation where the dy-
using the same information from the system as Algorithms 2
namics is run for τ steps before collecting a sample data point.
and 4. We have used a popular RL benchmark for the LQ
It helps us to have independent sample points and E[x] = 0.
problem to evaluate our proposed algorithms. Our empirical
The policy gradient algorithm performance is worse than other
evaluation shows that our DP-based algorithms produce more
algorithms. It is because the policy gradient algorithms try to
stable policies in comparison with the classical off-policy and
directly optimize the performance index and as such, they are
the classical Q-learning routines for this benchmark and they
very sensitive to noise.
are the closest ones to the analytical solution. It has also turned
out that our model-building approach outperforms the DP-based
C. Discussion solutions, which is consistent with the previous results on linear
In summary, one may consider different factors when choos- systems with process noise only [21], [26].
ing a model-building or a model-free approach. Our empirical
evaluation shows that the model-building solution outperforms
the model-free approaches and is the closest one to the analytical APPENDIX I
solution. This point has also been shown in [25] and [26] for CLASSICAL MODEL-FREE ALGORITHMS AND THE
the LQ problem with the process noise only (no measurement MODEL-BUILDING APPROACH
noise), where it is possible to estimate (A, B) using the ordinary A. Classical Off-Policy Learning
least-squares. Although when the measurement noise is present,
The classical off-policy learning algorithm for deterministic
this point needs to be proved.
systems is given in [12]. Here, we bring a modified version in
The main difference between the model-building and the
Algorithm 5 to accommodate the effect of process and observa-
model-free approaches comes from the way the data is used.
tion noises. The algorithm is initiated with a stabilizing policy
In the model-building approach, the data are used to identify
gain K 1 in Line 1. Then, the algorithm iterates N times in Line
the dynamics of the system while in the model-free approach,
2 over the policy evaluation and the policy improvement steps.
to learn the value function and the optimal policy. The main
The policy evaluation step is given in Lines 3–5 and discussed in
advantage of a model-free approach is to eliminate the system
Appendix I.A.1. The policy improvement step is given in Line
identification part and to directly solve the optimal control
6 and discussed in Appendix I.A.2.
problem. Such an approach can be extremely useful for nonlinear
1) Policy Evaluation: Lines 3–4: To evaluate the policy gain
systems where it is difficult to identify the dynamics and solve
K i , we collect τ samples of (yk , uk , yk+1 ) in the following way:
the optimal control problem.
we observe yk and then, we sample ηk from N (0, Wη ). We apply
the policy uk = K i yk + ηk and observe yk+1 .
VII. CONCLUSION
Line 5: We set the average cost λi to the empirical average
In this article, we have considered the LQ control problem cost from τ samples (21). The off-policy Bellman equation (24)
with process and measurement noises. We have assumed that the reads
dynamical model is unknown and developed two model-free DP-
based routines (Algorithms 2 and 4) to solve Problem 1. We have rk − λi = ΩTk i + n1 (34)
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
746 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023
t=1 t=1
B Classical Q-Learning (40)
2) Policy Improvement: Line 6: Let partition matrix Ĝi as
We bring the classical Q-learning routine from [27] in Algo- (32). The improved policy gain is given by
rithm 6 and modify it to accommodate the effect of process and
measurement noises. The algorithm is initiated with a stabilizing K i+1 = arg min Q̂i (yk , a) = −(Ĝi22 )−1 ĜiT
12 . (41)
a
policy gain K 1 in Line 1. Then, the algorithm iterates N times
in Line 2 over the policy evaluation and the policy improvement C. Model-Building Approach
steps. The policy evaluation step is given in Lines 3–5 and
discussed in Appendix I.B.1. The policy improvement step is A possible variant to model-free solutions to Problem 1 where
given in Line 6 and discussed in Appendix I.B.2. the dynamics is not known is to estimate (A, B) as a separate
1) Policy Evaluation: The policy gain K i is evaluated by system identification problem and use these estimates for the so-
estimating the quadratic kernel Gi of the Q function. lution of Problem 1. We can call this a model-building approach
Lines 3–4: We collect τ samples of (yk , uk , yk+1 ) in the to the problem. The estimated (Â, B̂) can be used in the ARE (7)
following way: we observe yk and then, we sample ηk from to solve for optimal policy gain in the corresponding LQ control
N (0, Wη ). We apply the policy uk = K i yk + ηk and observe (see Lemma 1).
yk+1 . The system identification problem to find (A, B) in (1)
Line 5: We set λi to the empirical average cost from τ samples is actually a simple problem since the model order n is
(21). The quality-based Bellman equation (15) reads known. Let [U, Y ] denote the input–output data and m0 de-
note the initial state-space model (which can be set to zero
rk − λi = (Ψk − Ψk+1 )vecs(Gi ) + n4 (38) matrices when no initial knowledge about the dynamics is
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 747
APPENDIX II
Algorithm 9: Policy Gradient.
PROOFS
1: Initialize: Policy gain K 1 , i = 1, probability density
function p(a; K i yk ). In this section, we bring the proofs of theorems and lem-
2: while ending condition is not satisfied do mas in the body of this article. To facilitate the derivations,
3: Z = {}. we use the following identity frequently vecv(v)T vecs(P ) =
4: for t = 1, . . ., T do (v ⊗ v)vec(P ) = v T P v, where P is a symmetric matrix and v
Observe yt , sample ut ∼ p(a; K i yk ), take ut . is a vector with appropriate dimension. The following lemma is
Observe rt . Add (yt , ut , rt ) to Z. useful through the proofs.
5: end for Lemma 9: Consider (1)–(5). For Ξ ∈ Rn×n , we have
6: Let R(T ) = Tt=1 rt . E[xTk Ξvk |yk ] = E[xTk Ξwk |yk ] = E[xTk Ξvk+1 |yk ] = 0 (42)
7: Set δK i = Tt=1 (R(T ) − b)∇K i log p(ut , K i yt ).
E[xTk Ξxk |yk ] = ykT Ξyk − Tr(ΞWv ). (43)
8: Update K i+1 by gradient descent using δK i .
9: end while Proof: Based on (1), wk , vk , vk+1 do not affect xk and (42)
follows. To see (43), note that
ykT Ξyk = E[ykT Ξyk |yk ] = E[(xk + vk )T Ξ(xk + vk )|yk ]
known). To solve this structured problem in a general sys- = E[xTk Ξxk |yk ] + 2E[xTk Ξvk |yk ] + E[vkT Ξvk |yk ]
tem identification code, like the system identification toolbox
in MATLAB [42], the code in Algorithm 7 can be used. = E[xTk Ξxk |yk ] + Tr(ΞWv ).
One can also use total least squares [41] to estimate the
dynamics [33].
Since in this article we examine the model-building approach A. Proof of Lemma 1
along with the model-free algorithms, we bring the model-
building approach in a recursive way such that the controller Let V ∗ = ykT P ∗ yk be the optimal value function. Define the
gain in the model-building approach is updated at the same Bellman equation for the optimal value function (5)
pace as the model-free approaches. Algorithm 8 summarizes ykT P ∗ yk = ykT Ry yk + ykT K ∗T Ru K ∗ yk − λ(K ∗ )
the model-building approach. In Line 1, we initialize the algo- (44)
+ E[yk+1
T
P ∗ yk+1 |yk ].
rithm by selecting a stabilizing controller gain. Note that if the
uncontrolled system is unstable, using only noise for system The term E[yk+1
T
P ∗ yk+1 |yk ] reads
identification results in numerical instability. Otherwise, we can E[yk+1
T
P ∗ yk+1 |yk ]
set K 1 = 0. We set empty matrices for the input–output data
U, Y , assign zero matrices to the initial state-space model, and = E[xTk (A + BK ∗ )T P ∗ (A + BK ∗ )xk |yk ]
fix C = I (yk = Ixk + vk ). In Line 2, the algorithm iterates N
+ 2E[xTk (A + BK ∗ )T P ∗ BK ∗ vk |yk ]
times. In Line 3, we collect τ input–output samples and append
to U, Y . In Line 4, we estimate (Â, B̂) using U, Y and m0. In + 2E[xTk (A + BK ∗ )T P ∗ (wk + vk+1 )|yk ]
Line 5, we update m0 by (Â, B̂). In Line 6, we solve the ARE
(7) using (Â, B̂). + E[vkT K ∗T B T P ∗ BK ∗ vk |yk ]
Our proposed model-building routine in Algorithm 7 can + 2E[vkT K ∗T B T P ∗ (wk + vk+1 )|yk ]
be easily extended (by further estimating C) to cover the
partially observable dynamical systems similar to [34] and + E[(wk + vk+1 )T P ∗ (wk + vk+1 )|yk ].
[35]. Using (42) and (43) in the abovementioned equation
E[yk+1
T
P ∗ yk+1 |yk ] = ykT (A + BK ∗ )T P ∗ (A + BK ∗ )yk
D. Policy Gradient
+ Tr(K ∗T B T P ∗ BK ∗ Wv ) + Tr(P ∗ Ww )
A Policy gradient routine is given in Algorithm 9 [21]. In
Line 1, we initialize the algorithm by a controller gain and − Tr((A + BK ∗ )T P ∗ (A + BK ∗ )Wv ) + Tr(P ∗ Wv ).
selecting a pdf for the probabilistic policy (usually a multivariate Substituting the abovementioned result in (44) and matching
Gaussian distribution). In Line 2, the algorithm is iterated until terms, we have the optimal average cost in (8) and
∗
the ending condition is satisfied. In Lines 3–4, we collect T (A + BK ∗ )T P ∗ (A + BK ∗ ) − P ∗ + Ry + K T
Ru K ∗ = 0.
samples of (yk , uk , rk ) by sampling the actions from the pdf
Optimizing the abovementioned equation with respect to K ∗ ,
uk ∼ p(a; K i yk ) and storing in Z. In Line 5, we calculate
results the optimal policy gain in (6).
the total cost of T steps. In Line 6, we give the gradient of
the total cost with respect to K i . Note that b in Line 6 is a
baseline to reduce variance [28]. Among many options, one B. Proof of Lemma 3
can select the baseline as the mean cost of previous iterates. Similar to the proof of Lemma 2, we show that the
In Line 7, we update the policy gain by gradient descent given quadratic form satisfies the quality-based Bellman equa-
using δK i . tion (15). Let zk = [ykT , uTk ]T , zk+1 = [yk+1
T
, (Kyk+1 )T ]T and
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
748 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023
M = [I KT ] G
I
. Let uk as uk = Kyk + ηk . Using the where we have used (46) in the abovementioned equation. By
K expanding the abovementioned equation, (10) is concluded.
control uk at time k, the next output yk+1 reads
yk+1 = Lxk + Bηk + BKvk + wk + vk+1 . (45) D. Proof of Theorem 3
By (16), E[V (yk+1 , K)|zk ] = E[Q(yk+1 , Kyk+1 , K)|zk ]. The proof of convergence of the algorithm contains
Now, we use (45) to compute E[Q(yk+1 , Kyk+1 , K)|zk ] two steps. The first step is to show that solving the
E[Q(yk+1 , Kyk+1 , K)|zk ] = E[yk+1
T
M yk+1 |zk ] off-policy Bellman equation (24) is equivalent to the
model-based Bellman equation (13). This is guaranteed by
= E[(Lxk + Bηk + BKvk + wk + vk+1 )T M Theorem 1 and assuming that the estimation errors in
(Lxk + Bηk + BKvk + wk + vk+1 )|zk ] (22) and (27) are small enough P̂ i ≈ P i , B T ˆP i A ≈
B T P i A, B T ˆP i B ≈ B T P i B. The second step is to show that
= E[xTk LT M Lxk |zk ] + 2E[xTk LT M Bηk |zk ] by improving the policy to be greedy with respect to the
+E[ηkT B T M Bηk |zk ] + E[vkT K T B T M BKvk |zk ] average of all previous value functions (28), P i+1 ≤ P i for
i = 1, . . ., N. By Theorem 2, the algorithm produces stabiliz-
+E[wkT M wk + vk+1
T
M vk+1 |zk ]. ing controller gain K i . Let P i > 0 denote the solution to the
Using (43), the abovementioned equation reads model-based Bellman equation (13)
E[yk+1
T
M yk+1 |zk ] P i = (A + BK i )T P i (A + BK i ) + K iT Ru K i + Ry . (47)
Since A + BK i is stable, the unique positive definite solution
= ykT LT M Lyk + 2ykT LT M Bηk + ηkT B T M Bηk of (47) may be written as [39]
− Tr(LT M LWv ) + Tr(K T B T M BKWv ) +∞
Pi = ((A + BK i )T )k (K iT Ru K i + Ry )(A + BK i )k .
+ Tr(M Ww ) + Tr(M Wv ) k=0
(48)
AT M A AT M B y
k
= ykT uTk T T Consider two iteration indices i and j. Using (47), P i − P j
B M A B M B uk
reads
− Tr(LT M LWv ) + Tr(K T B T M BKWv ) P i − P j = (A + BK i )T P i (A + BK i ) + K iT Ru K i
+ Tr(M Ww ) + Tr(M Wv ) − (A + BK j )T P j (A + BK j ) − K jT Ru K j
T
T A MA AT M B = (A + BK j )T P i (A + BK j ) + K iT Ru K i
= zk zk − Tr(LT M LWv )
BT M A BT M B
− K jT B T P i (A + BK j ) − AT P i K j
+ Tr(K B M BKWv ) + Tr(M Ww ) + T r(M Wv ).
T T
+ K iT B T P i (A + BK i ) + AT P i K i
Replacing the abovementioned result in the quality-based Bell-
man equation (15) − (A + BK j )T P j (A + BK j ) − K jT Ru K j
T A M A + Ry
T
AT M B = (A + BK j )T (P i − P j )(A + BK j )
zk Gzk = zk
T
zk
BT M A B T M B + Ru + (K i − K j )T (Ru + B T P i B)(K i − K j )
− Tr(LT M LWv ) + Tr(K T B T M BKWv ) + (K i − K j )T [(Ru + B T P i B)K j + B T P i A]
+ Tr(M Ww ) + T r(M Wv ) − λ(K). + [K jT (Ru + B T P i B) + AT P i B](K i − K j )
By matching terms (17) and (18) are concluded. +∞
= ((A + BK j )T )k
C. Proof of Lemma 4 k=0
According to (16) ((K i − K j )T (Ru + B T P i B)(K i − K j )
I
P = I KT G . (46) + (K i − K j )T [(Ru + B T P i B)K j + B T P i A]
K
[K jT (Ru + B T P i B) + AT P i B](K i − K j ))
By Lemma 3, the quadratic kernel of the Q function, satisfies
(17).
Premultiplying
(17) by [I K T ] and postmultiplying (17) (A + BK j )k (49)
I
by K
, we have where we have used (48) to obtain the last equation. By (28)
⎛ ⎞−1 ⎛ ⎞
AT Ry 0 I i i
I K T P A B + −P =0 K i+1 = − ⎝ (B T P j B + Ru )⎠ ⎝ B T P j A⎠
B T
0 Ru K
j=1 j=1
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 749
⎛ ⎞−1 ⎛ ⎞
i−1
i−1 + (vec(Wv )T )(LiT ⊗ LiT ) − vec(Ww )T − vec(Wv )T
K = −⎝
i
(B P B + Ru )⎠
T j ⎝ B P A⎠ .
T j
− (vec(Wv )T )((K iT B T ) ⊗ (K iT B T ))) = 0.
j=1 j=1
(50) The term 2E[(xk ⊗ vk )ν T ] reads
i−1 2E[(xk ⊗ vk )ν T ]
Rearranging K i+1 in (50) and let M := j=1 (B
T
P jB +
Ru ) > 0 = − 4E[(xk ⊗ vk )(vkT ⊗ xTk )](AT ⊗ LiT ). (54)
M (K − K
i i+1
) = [(Ru + B P B)K
T i i+1
+ B P A]. (51)
T i
The term E[(vk ⊗ vk )ν ] reads
T
− vec(Ww )T − vec(Wv )T = xTk LiT (P̂ i − P i )Li xk + 2xTk LiT (P̂ i − P i )BK i vk
− (vec(Wv )T )((K iT B T ) ⊗ (K iT B T )). (53) + 2xTk LiT P̂ i (wk + vk+1 )+vkT K iT B T (P̂ i −P i )BK i vk
The instrumental variable in the estimation (22) is yk ⊗ yk . To + 2vkT K iT B T P̂ i (wk + vk+1 ) + xTk (P i − P̂ i )xk
have an unbiased estimate, E[(yk ⊗ yk )ν T ] = 0. In the sequel,
we analyze the correlation matrix + vkT (P i − P̂ i )vk + 2xTk (P i − P̂ i )vk
E[(yk ⊗ yk )ν T ] = E[(xk ⊗ xk )ν T ] + 2E[(xk ⊗ vk )ν T ] + (wk + vk+1 )T P̂ i (wk + vk+1 ) − λi
+ E[(vk ⊗ vk )ν T ]. − vkT AT P i Avk − 2vkT AT P i BK i vk − 2xTk LiT P i Avk .
Remembering the facts that xk , vk , vk+1 , wk are mutually (58)
independent, Wv is diagonal and using (14), we get the following To see if the estimate is biased, we examine E[Υk n2 ]
results. The term E[(xk ⊗ xk )ν T ] reads
E[(xk ⊗ xk )ν T ] 0 2E[yk n2 ] ⊗ ηk
E[Υk n2 ] = E[n2 ] + .
ηk ⊗ ηk 2ηk ⊗ K i E[yk n2 ]
= E[xk ⊗ xk ](vec(Ww )T + vec(Wv )T
So, it is enough to analyze E[n2 ] and E[yk n2 ]. Remembering
− vec(Wv )T (AT ⊗ AT + 2(K iT B T ) ⊗ AT ) the facts that xk , vk , vk+1 , wk are mutually independent, Wv is
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
750 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023
diagonal, E[xk ] = 0 (because of running Algorithm 3 to collect known, we only need to analyze E[yk ⊗ yk ν T ] and E[yk ν T ].
data) and using (14), we have The term E[yk ⊗ yk ν T ] is given in (56). Remembering the
E[n2 ] = E[xTk LiT (P̂ i − P i )Li xk ] + E[xTk (P i − P̂ i )xk ] facts that xk , vk , vk+1 , wk are mutually independent and Wv
is diagonal, we compute E[yk ν T ] where ν is given in (53)
+ E[vkT K iT B T (P̂ i − P i )BK i vk ] E[yk ν T ] = E[(xk + vk )ν T ]
+ Tr((P̂ − P )(Ww + Wv ))
i
= − E[(xk )(vkT ⊗ vkT )](AT ⊗ AT + 2(K iT B T ) ⊗ AT )
E[yk n2 ] = E[(xk + vk )n2 ]
− 2E[(vk )(vkT ⊗ xTk )](AT ⊗ LiT )
= E[(xk + vk )xTk LiT (P̂ i − P i )Li xk ]
+ E[xk ](vec(Wv )T )(LiT ⊗ LiT )
=0
− E[xk ](vec(Wv )T )((K iT B T ) ⊗ (K iT B T )). (62)
+ 2 E[(xk + vk )xTk LiT (P̂ i − P i )BK i vk ]
Hence, the bias in the classical off-policy is a function of (56)
=0 and (62).
E[2(xk + vk )xTk LiT P̂ i (wk + vk+1 )]
H. Proof of Lemma 7
=0
We repeat the estimation problem in (29)–(31)
+ E[(xk + vk )vkT K iT B T (P̂ i − P i )BK i vk ]
ck = ΨTk vec(Gi ) + n3
=0
ck = r(yk , uk ) − λi + yk+1
T
P̂ i yk+1
+ E[2(xk + vk )vkT K iT B T P̂ i (wk + vk+1 )]
T
=0 Ψk = ykT ⊗ ykT ykT ⊗ uTk uTk ⊗ ykT uTk ⊗ uTk
+ E[(xk + vk )(xTk (P i − P̂ i )xk + vkT (P i − P̂ i )vk )]
n3 = yk+1
T
P̂ i yk+1 − E[yk+1
T
P i yk+1 |zk ]. (63)
=0
The noise term n3 reads
+ 2 E[(xk + vk )xTk (P i − P̂ i )vk ]
n3 = yk+1
T
P̂ i yk+1 − E[yk+1
T
P i yk+1 |zk ]
=0
= xTk AT (P̂ i − P i )Axk + 2xTk AT (P̂ i − P i )Buk
+ E[(xk +vk )(wk +vk+1 )TP̂ i (wk +vk+1 )−(xk +vk )λi ]
=0
+ uTk B T (P̂ i − P i )Buk − 2vkT AT P i (Axk + Buk )
(59)
+ 2(xTk AT + uTk B T )P̂ i (wk + vk+1 )
+ E[(xk + vk )(−vkT AT P i Avk − 2vkT AT P i BK i vk )]
+ (wk + vk+1 )T P̂ i (wk + vk+1 ) − vkT AT P i Avk
=0
+ Tr(AT P i AWv ) − Tr(P i Ww ) − Tr(P i Wv ). (64)
− 2 E[(xk + vk )(xTk LiT P i Avk )] = 0. (60)
To see if the estimate is biased, we need to analyze E[Ψk n3 ].
=0
Since uk is deterministic (we know uk ), we examine i :=
Based (60), E[yk n2 ] = 0 and only E[n2 ] contributes to bias. If uk ⊗ uk E[n3 ], ii := uk ⊗ E[yk n3 ], and iii := E[yk ⊗ yk n3 ].
the estimation error of P i is negligible P̂ i − P i ≈ 0, then by Remembering the facts that xk , vk , vk+1 , wk are mutually
(59), E[n2 ] = 0 and the estimate is unbiased. independent, Wv is diagonal, E[xk ] = 0 (because of running
Algorithm 3 to collect data), we get the following results
G. Proof of Lemma 6
i := (uk ⊗ uk )(E[xTk AT (P̂ i − P i )Axk ] (65)
We repeat the estimation problem in (34)–(36)
+ uTk B T (P̂ i − P i )Buk + Tr((P̂ i − P i )(Ww + Wv ))
rk − λi = ΩTk i + n1
ii := uk ⊗ E[(xk + vk )n3 ]
n1 = yk+1
T
P i yk+1 − E[yk+1
T
P i yk+1 |yk ]
⎡ ⎤ ⎡ ⎤ = 2uk ⊗ E[xk xTk AT (P̂ i − P i )Buk ]
vec(P i ) yk ⊗ yk − yk+1 ⊗ yk+1
⎢ ⎥ ⎢ ⎥ − 2uk ⊗ E[vk vkT AT P i Buk ].
i = ⎣ vec(B T P i A) ⎦, Ωk =⎣ 2yk ⊗ (uk − K i yk ) ⎦ (66)
T i
vec(B P B) (uk − K yk ) ⊗ (uk + K yk )
i i Next, we analyze iii := E[yk ⊗ yk n3 ] = E[xk ⊗ xk n3 ] +
⎡ ⎤ 2E[xk ⊗ vk n3 ] + E[vk ⊗ vk n3 ]. First, we obtain E[xk ⊗ xk n3 ]
yk ⊗ yk E[xk ⊗ xk n3 ]
⎢ ⎥
Θk = ⎣ 2yk ⊗ ηk ⎦. (61)
ηk ⊗ (ηk + 2K i yk ) = E[xk ⊗ xk xTk AT (P̂ i − P i )Axk ]
We have shown in the proof of Lemma 5 that the noise term + E[xk ⊗ xk ]uTk B T (P̂ i − P i )Buk
can be written as n1 = ν T vec(P i ) where ν is given in (53). In
the sequel, we study E[Θk ν T ]. Since ν is zero mean and ηk is + E[xk ⊗ xk ]Tr((P̂ i − P i )(Ww + Wv ). (67)
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 751
Second, we obtain 2E[xk ⊗ vk n3 ] [5] K. J. Åström and B. Wittenmark, Adaptive Control, 2nd ed. Englewood
Cliffs, NJ, USA: Prentice-Hall, 1994.
2E[xk ⊗ vk n3 ] = − 4E[xk ⊗ vk vkT AT P i Axk ]. (68) [6] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
Third, we obtain E[vk ⊗ vk n3 ] vol. 1, 2nd ed. Cambridge, MA, USA: MIT Press, 2018.
[7] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning:
E[vk ⊗ vk n3 ] = E[vk ⊗ vk xTk AT (P̂ i − P i )Axk ] A survey,” J. Artif. Intell. Res., vol. 4, pp. 237–285, 1996.
[8] F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning
+ E[vk ⊗ vk uTk B T (P̂ i − P i )Buk ] and feedback control: Using natural decision methods to design optimal
adaptive controllers,” IEEE Control Syst., vol. 32, no. 6, pp. 76–105,
Dec. 2012.
+ E[vk ⊗ vk ]Tr((P̂ i − P i )(Wv + Ww )) [9] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic
programming for feedback control,” IEEE Circuits Syst. Mag., vol. 9, no. 3,
+ E[vk ⊗ vk ]Tr(AT P i AWv ) pp. 32–50, Jul.–Sep. 2009.
[10] F. A. Yaghmaie and D. J. Braun, “Reinforcement learning for a class of
− E[vk ⊗ vk vkT AT P i Avk ]. (69) continuous-time input constrained optimal control problems,” Automatica,
vol. 99, pp. 221–227, 2019.
As a result, iii is given by the summation of the terms in [11] T. Bian, Y. Jiang, and Z.-P. Jiang, “Adaptive dynamic programming for
(67)–(69). Because the terms in (65)–(69) are nonzero, the stochastic systems with state and control dependent noise,” IEEE Trans.
estimate in (31) is biased. Autom. Control, vol. 61, no. 12, pp. 4170–4175, Dec. 2016.
[12] B. Kiumarsi, F. L. Lewis, and Z.-P. Jiang, “H∞ control of linear discrete-
time systems: Off-policy reinforcement learning,” Automatica, vol. 78,
I. Proof of Lemma 8 pp. 144–152, 2017.
[13] H. Modares and F. L. Lewis, “Linear quadratic tracking control of
We repeat the estimation problem in (38)–(40) partially-unknown continuous-time systems using reinforcement learn-
rk − λi = (Ψk − Ψk+1 )vec(Gi ) + n4 ing,” IEEE Trans. Autom. Control, vol. 59, no. 11, pp. 3051–3056,
Nov. 2014.
[14] F. A. Yaghmaie, S. Gunnarsson, and F. L. Lewis, “Output regulation
ΨTk = [zkT ⊗ zkT ] = [ykT ⊗ ykT , ykT ⊗ uTk , uTk ⊗ ykT , uTk ⊗ uTk ] of unknown linear systems using average cost reinforcement learning,”
Automatica, vol. 110, 2019, Art. no. 108549.
n4 = zk+1
T
Gi zk+1 − E[zk+1
T
Gi zk+1 |zk ] [15] F. Adib Yaghmaie, F. L. Lewis, and R. Su, “Output regulation of
heterogeneous linear multi-agent systems with differential graphical
= yk+1
T
P i zk+1 − E[yk+1
T
P i yk+1 |zk ]. (70) game,” Int. J. Robust Nonlinear Control, vol. 26, no. 10, pp. 2256–2278,
2016.
The noise term n4 is obtained by P̂ ≡ P in (64)
i i
[16] F. A. Yaghmaie, K. Hengster Movric, F. L. Lewis, and R. Su, “Differential
n4 = − 2vkT AT P i (Axk + Buk ) graphical games for H∞ control of linear heterogeneous multiagent sys-
tems,” Int. J. Robust Nonlinear Control, vol. 29, no. 10, pp. 2995–3013,
2019.
+ 2(xTk AT + uTk B T )P i (wk + vk+1 ) [17] D. P. Bertsekas, Reinforcement Learning and Optimal Control. Belmont,
MA, USA: Athena Scientific, 2019.
+ (wk + vk+1 )T P i (wk + vk+1 ) − vkT AT P i Avk [18] R. E. Bellman and S. E. Dreyfus, Applied Dynamic Programming. Prince-
ton, NJ, USA: Princeton Univ. Press, 1962.
+ Tr(AT P i AWv ) − Tr(P i Ww ) − Tr(P i Wv ). (71) [19] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” J. Mach.
Learn. Res., vol. 4, no. 6, pp. 1107–1149, 2003.
We study E[Ψk n4 ]. Since uk is deterministic (we know uk ), [20] S. J. Bradtke and A. G. Barto, “Linear least-squares algorithms for tempo-
E[uk ⊗ uk n4 ] = 0. We examine uk ⊗ E[yk n4 ], E[yk ⊗ yk n4 ] ral difference learning,” Mach. Learn., vol. 22, no. 1–3, pp. 33–57, 2004.
[21] B. Recht, “A tour of reinforcement learning: The view from continuous
uk ⊗ E[yk n4 ] = −2uk ⊗ (E[vk vkT AT P i (Axk + Buk )] control,” Annu. Rev. Control, Robot., Auton. Syst., vol. 2, pp. 253–279,
2018.
+ E[xk ]E[(wk + vk+1 )T P i (wk + vk+1 )] [22] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
“Deterministic policy gradient algorithms,” in Proc. 31st Int. Conf. Mach.
− E[xk ]E[vkT AT P i Avk ] + E[xk ]Tr(AT P i AWv ) Learn., 2014, vol. 1, pp. 605–619.
[23] L. Ljung, System Identification—Theory for the User, (Informa. and Sys-
− E[xk ](Tr(P i Ww ) + Tr(P i Wv ))) tem Sciences), 2nd ed. Princeton, NJ, USA: Prentice-Hall, 1999.
[24] L. Ljung and T. Söderström, Theory and Practice of Recursive Identifica-
= − 2uk ⊗ E[vk vkT AT P i (Axk + Buk )] (72) tion. Cambridge, MA, USA: MIT Press, 1987.
[25] Y. Abbasi-Yadkori, N. Lazic, and C. Szepesvari, “Model-free linear
E[yk ⊗ yk n4 ] = −4E[xk ⊗ vk vkT AT P i (Axk + Buk )] quadratic control via reduction to expert prediction,” in Proc. 22nd Int.
Conf. Artif. Intell. Statist., 2019, pp. 3108–3117.
[26] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample
+ E[vk ⊗ vk ]Tr(AT P i AWv ) − E[vk ⊗ vk vkT AT P i Avk ]. complexity of the linear quadratic regulator,” Found. Comput. Math.,
(73) vol. 20, no. 4, pp. 633–679, 2019.
[27] S. Tu and B. Recht, “Least-squares temporal difference learning for
Since (72) and (73) are nonzero, the estimate is biased. the linear quadratic regulator,” in Proc. Int. Conf. Mach. Learn., 2018,
pp. 5005–5014.
REFERENCES [28] N. Matni, A. Proutiere, A. Rantzer, and S. Tu, “From self-tuning regulators
to reinforcement learning and back again,” in Proc. IEEE 58th Conf. Decis.
[1] V. Mnih et al., “Playing Atari with deep reinforcement learning,” 2013, Control, 2019, pp. 3724–3740.
arXiv:1312.5602. [29] M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi, “Global convergence of
[2] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench- policy gradient methods for the linear quadratic regulator,” in Proc. Int.
marking deep reinforcement learning for continuous control,” in Proc. Conf. Mach. Learn., 2018, pp. 1467–1476.
Int. Conf. Mach. Learn., 2016, pp. 1329–1338. [Online]. Available: http: [30] A. Cohen, A. Hassidim, T. Koren, N. Lazic, Y. Mansour, and K. Talwar,
//arxiv.org/abs/1604.06778 “Online linear quadratic control,” in Proc. Int. Conf. Mach. Learn., 2018,
[3] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, “An application of pp. 1029–1038.
reinforcement learning to aerobatic helicopter flight,” in Proc. Adv. Neural [31] S. Tu and B. Recht, “The gap between model-based and model-free
Inf. Process. Syst., 2007, vol. 19, pp. 1–8. methods on the linear quadratic regulator: An asymptotic viewpoint,”
[4] M. Krstic et al., Nonlinear and Adaptive Control Design, vol. 222. New in Proc. Conf. Learn. Theory, 2019, pp. 3036–3083. [Online]. Available:
York, NY, USA: Wiley, 1995. http://arxiv.org/abs/1812.03565
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
752 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023
[32] M. Ferizbegovic, J. Umenberger, H. Hjalmarsson, and T. B. Schon, “Learn- Fredrik Gustafsson (Fellow, IEEE) received
ing robust LQ-controllers using application oriented exploration,” IEEE the M.Sc. degree in electrical engineering and
Control Syst. Lett., vol. 4, no. 1, pp. 19–24, Jan. 2020. [Online]. Available: the Ph.D. degree in automatic control from
https://ieeexplore.ieee.org/document/8732482/ Linköping University, Linköping, Sweden, in
[33] F. A. Yaghmaie and F. Gustafsson, “Using reinforcement learning for 1988 and 1992, respectively.
model-free linear quadratic Gaussian control with process and measure- He has been a Professor in Sensor Infor-
ment noises,” in Proc. IEEE Conf. Decis. Control, 2019, pp. 6510–6517. matics with the Department of Electrical En-
[34] M. Simchowitz, K. Singh, and E. Hazan, “Improper learning for non- gineering, Linköping University, since 2005.
stochastic control,” in Proc. Conf. Learn. Theory, 2020, pp. 3320–3436. His research interests are in stochastic sig-
[Online]. Available: http://arxiv.org/abs/2001.09254 nal processing, adaptive filtering and change
[35] S. Lale, K. Azizzadenesheli, B. Hassibi, and A. Anandkumar, “Logarith- detection, with applications to communication,
mic regret bound in partially observable linear dynamical systems,” in vehicular, airborne, and audio systems. He is a Co-Founder of the
Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 20876–20888. companies NIRA Dynamics (automotive safety systems), Softube (audio
[36] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning for partially effects), and Senion (indoor positioning systems).
observable dynamic processes: Adaptive dynamic programming using Dr. Gustafsson was an Associate Editor for IEEE TRANSACTIONS OF
measured output data,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., SIGNAL PROCESSING in 2000–2006, IEEE TRANSACTIONS ON AEROSPACE
vol. 41, no. 1, pp. 14–25, Feb. 2011. AND ELECTRONIC SYSTEMS in 2010–2012, and EURASIP Journal on
[37] H. Yu and D. P. Bertsekas, “Convergence results for some temporal Applied Signal Processing in 2007–2012. He was the recipient of the
difference methods based on least squares,” IEEE Trans. Autom. Control, Arnberg prize by the Royal Swedish Academy of Science (KVA) 2004,
vol. 54, no. 7, pp. 1515–1531, Jul. 2009. elected member of the Royal Academy of Engineering Sciences (IVA)
[38] T. Glad and L. Ljung, Control Theory—Multivariable and Nonlinear 2007, and elevated to IEEE Fellow 2011. He was also the recipient of
Methods. New York, NY, USA: Taylor and Francis, 2000. the Harry Rowe Mimno Award 2011 for the tutorial “Particle Filter Theory
[39] G. A. Hewer, “An iterative technique for the computation of the steady and Practice with Positioning Applications,” which was published in the
state gains for the discrete optimal regulator,” IEEE Trans. Autom. Control, AESS Magazine in July 2010, and was coauthor of “Smoothed state
vol. AC-16, no. 4, pp. 382–384, Aug. 1971. estimates under abrupt changes using sum-of-norms regularization”
[40] L. A. Prashanth, N. Korda, and R. Munos, “Fast LSTD using stochastic that received the Automatica paper prize in 2014.
approximation: Finite time analysis and application to traffic control,” in
Proc. Joint Eur. Conf. Mach. Learn. Knowl. Discov. Databases, 2014,
pp. 66–81.
[41] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed. Baltimore,
MD, USA: The John Hopkins Univ. Press, 2013. Lennart Ljung (Fellow, IEEE) received the
[42] The MathWorks, MATLAB System Identification Toolbox (R2018a), Por- Ph.D. degree in automatic control from the Lund
tola Valley, CA, USA, 2018. Institute of Technology, Lund, Sweden, in 1974.
Since 1976, he has been Professor of the
Chair of Automatic Control in Linköping, Swe-
den. He has held visiting positions with Stanford
and MIT and has written several books on Sys-
tem Identification and Estimation.
Dr. Ljung is currently an IFAC Fellow and an
IFAC Advisor. He is as a Member of the Royal
Swedish Academy of Sciences (KVA), a mem-
Farnaz Adib Yaghmaie received the B.E. and ber of the Royal Swedish Academy of Engineering Sciences (IVA),
M.E. degrees in electrical engineering, control, an Honorary Member of the Hungarian Academy of Engineering, an
from the K. N. Toosi University of Technology, Honorary Professor of the Chinese Academy of Mathematics and Sys-
Tehran, Iran, in 2009 and 2011, respectively, tems Science, and a Foreign Member of the U.S. National Academy of
and the Ph.D. degree in electrical and electronic Engineering (NAE). He was the recipient honorary doctorates from the
engineering, control, from Nanyang Technologi- Baltic State Technical University in St Petersburg, from Uppsala Univer-
cal University (EEE-NTU), Singapore, in 2017. sity, Uppsala, Sweden, from the Technical University of Troyes, Troyes,
She is currently an Assistant Professor France, from the Catholic University of Leuven, Leuven, Belgium, and
with the Department of Electrical Engineering, from the Helsinki University of Technology, Espoo, Finland. He was also
Linköping University, Linköping, Sweden. Her the recipient of both the Quazza Medal (2002) and the Nichols Medal
current research interest is reinforcement learn- (2017) from IFAC, Hendrik W. Bode Lecture Prize from the IEEE Control
ing for controlling dynamical systems. Systems Society, in 2003 and IEEE Control Systems Award in 2007, and
Dr. Yaghmaie is the recipient of the Best Thesis Award from EEE-NTU Great Gold Medal from the Royal Swedish Academy of Engineering in
among 160 Ph.D. students. 2018.
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.