0% found this document useful (0 votes)
34 views

Linear Quadratic Control Using Model-Free Reinforcement Learning

Uploaded by

Gary Rey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Linear Quadratic Control Using Model-Free Reinforcement Learning

Uploaded by

Gary Rey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO.

2, FEBRUARY 2023 737

Linear Quadratic Control Using Model-Free


Reinforcement Learning
Farnaz Adib Yaghmaie , Fredrik Gustafsson , Fellow, IEEE, and Lennart Ljung , Life Fellow, IEEE

Abstract—In this article, we consider linear quadratic adaptive control with optimal control theory, it is possible to
(LQ) control problem with process and measurement control unknown dynamical systems adaptively and optimally.
noises. We analyze the LQ problem in terms of the average Reinforcement learning (RL) refers to a class of such routines
cost and the structure of the value function. We assume
that the dynamics of the linear system is unknown and only and its history dates back decades [6], [7]. By recent progress
noisy measurements of the state variable are available. Us- in machine learning (ML), specifically using deep networks,
ing noisy measurements of the state variable, we propose the RL field is also reinvented. RL algorithms have recently
two model-free iterative algorithms to solve the LQ prob- shown impressive performances in many challenging problems
lem. The proposed algorithms are variants of policy itera- including playing Atari games [1], agile robotics [3], control of
tion routine where the policy is greedy with respect to the
average of all previous iterations. We rigorously analyze the continuous-time systems [2], [8]–[14], and distributed control
properties of the proposed algorithms, including stability of of multiagent systems [15], [16].
the generated controllers and convergence. We analyze the In a typical RL setting, the model of the dynamical system
effect of measurement noise on the performance of the pro- is unknown and the optimal controller is learned through in-
posed algorithms, the classical off-policy, and the classical teraction with the dynamical system. One possible approach to
Q-learning routines. We also investigate a model-building
approach, inspired by adaptive control, where a model of solve an RL problem is the class of dynamic programming (DP)
the dynamical system is estimated and the optimal control based solutions, also called temporal difference (TD) learning.
problem is solved assuming that the estimated model is the The idea is to learn the value function to satisfy the Bellman
true model. We use a benchmark to evaluate and compare equation [17], [18] by minimizing the Bellman error [17]. The
our proposed algorithms with the classical off-policy, the celebrated Q-learning algorithm belongs to this class [19]. Least
classical Q-learning, and the policy gradient. We show that
our model-building approach performs nearly identical to squares temporal difference learning (LSTD) [19], [20] refers
the analytical solution and our proposed policy iteration- to the case where the least-squares is used to minimize the TD
based algorithms outperform the classical off-policy and error. Another class contains policy search algorithms where
the classical Q-learning algorithms on this benchmark but the policy is learned by directly optimizing the performance
do not outperform the model-building approach. index. Examples are policy-gradient methods that use sampled
Index Terms—Linear quadratic (LQ) control, reinforce- returns from the dynamical system [21] and actor–critic methods
ment learning (RL). that use estimated value functions in learning the policy [22].
In the control community, adaptive control is used for optimal
I. INTRODUCTION control of the unknown system by first estimating (possibly
recursively) a model [23], [24] and then, solving the optimal
DAPTIVE control studies data-driven approaches for con-
A trol of unknown dynamical systems [4], [5]. If we combine
control problem using the estimated model [5]. This category
of solutions is called model-based RL in the RL community
but we coin the term model-building RL to emphasize that no
Manuscript received 24 July 2020; revised 28 January 2021 and 2 known/given model is used or assumed.
September 2021; accepted 8 January 2022. Date of publication 25 As the scope of RL expands to more demanding tasks, it is
January 2022; date of current version 30 January 2023. The work
of Farnaz Adib Yaghmaie was supported by the Excellence Center crucial for the RL algorithms to provide guaranteed stability
at Linköping–Lund in Information Technology (ELLIIT), in the Vinnova and performance. Still, we are far away from analyzing the
Competence Center LINK-SIC, the Wallenberg Artificial Intelligence, RL algorithms because of the inherent complexity of the tasks
Autonomous Systems and Software Program (WASP), and CENIIT. The
work of Fredrik Gustafsson was supported by the Vinnova Competence and deep networks. This motivates considering a simplified case
Center LINK-SIC and the Wallenberg Artificial Intelligence, Autonomous study where analysis is possible. linear quadratic (LQ) problem
Systems and Software Program (WASP). The work of Lennart Ljung is a classical control problem where the dynamical system
was supported in part by the Swedish Research Council under Grant
2019-04956 and in part by the Vinnova Competence Center LINK-SIC. obeys linear dynamics and the cost function to be minimized
Recommended by Associate Editor J. Lavaei. (Corresponding author: is quadratic. The LQ problem has a celebrated closed-form
Farnaz Adib Yaghmaie.) solution and is an ideal benchmark for studying and comparing
The authors are with the Department of Electrical Engineering,
Linköping University, 58431 Linköping, Sweden (e-mail: farnaz.adib. the RL algorithms because first, it is theoretically tractable and
[email protected]; [email protected]; [email protected]). second, it is practical in various engineering domains.
Color versions of one or more figures in this article are available at Because of the aforementioned reasons, the LQ problem
https://doi.org/10.1109/TAC.2022.3145632.
Digital Object Identifier 10.1109/TAC.2022.3145632 has attracted much attention from the RL community [11],

0018-9286 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
738 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023

[25]–[27], see [28], for a complete survey on RL approaches and [13], [25], [27]. We take one step further and replace the state
properties for the LQ problems. For example, the authors in [12] with a noisy measurement of the state; i.e., the output. There
and [13] study the linear quadratic regulator (LQR) control are two contributions to this article. Our first contribution is
problems in completely deterministic setups via LSTD. Global to study the LQ problem in terms of the average cost and the
convergence of policy gradient methods for deterministic LQR structure of the value function. Our second contribution is to
problem is shown in [29]. Considering a stochastic setup is more propose two model-free DP-based RL algorithms and analyze
demanding [11], [25]–[27]. In [11], a stochastic LQ problem their properties. We study the effect of measurement noise in
with state- and control-dependent process noise is considered. the proposed algorithms and the classical off-policy and the
The authors assume that the noise is measurable and propose an classical Q-learning routines. Such a discussion is absent in [33].
LSTD approach to solve the optimal control problem. In [25], Moreover, we provide the convergence proof for the proposed
the authors give LSTD algorithms for the LQ problem with algorithms while such a discussion in a simpler setup in [25]
process noise and analyze the regret bound, i.e., the difference is missing. We also investigate a model-building approach,
between the occurred cost and the optimal cost. The regret inspired by adaptive control, which can be easily extended to
bound is also studied in [30] for a model-building routine. The cover the partially observable dynamical systems. We leave
sample complexity of the LSTD for the LQ problem is studied the analysis of such model-building approaches for our future
in [27] and the gap between a model-building approach and a research. We would like to stress that the solutions we discuss
model-free routine is analyzed in [31]. In [26], a model-building (including the model-building one) are all in the model-free
approach, called coarse ID control, is developed to solve the LQ family of classical RL. The only information that is used is that
problem with process noise and sample complexity bounds are the outputs come from a linear system of known order; no known
derived. The idea of coarse-ID control is to estimate the linear model is assumed to generate the data. We have summarized
model and the uncertainty, and then to solve the optimal control the notations and abbreviations in Table I.
problem using the information of the model and uncertainty.
A convex procedure to design a robust controller for the LQ II. LQ PROBLEM
problem with unknown dynamics is suggested in [32]. The total
cost to be optimized in the LQ problem or in general in RL can be A. Linear Gaussian Dynamical Systems
considered as the average cost [31] or discounted cost [13], [27] Consider a linear Gaussian dynamical system
and it is a design parameter. If the cost is equally important at
xk+1 = Axk + Buk + wk (1)
all times, for example, the cost is energy consumption, one may
consider an average cost setting. If the immediate costs are more yk = x k + v k (2)
important, one may consider a discounted cost. In a discounted where xk ∈ Rn and uk ∈ Rm are the state and the control
setting, one needs to select the discounting factor carefully to input vectors, respectively. The vector wk ∈ Rn denotes the
prevent loss of stability while this can be avoided by considering process noise drawn independent identically distributed (i.i.d.)
the average cost. from a Gaussian distribution N (0, Ww ). The state variable xk
In all the aforementioned results, only process noise is con- is not measurable and the output yk ∈ Rn in (2) denotes a noisy
sidered in the problem formulation. In practice, both process and measurement of the state xk where vk ∈ Rn is the measurement
measurement noises appear in the dynamical systems and one noise drawn i.i.d. from a Gaussian distribution N (0, Wv ) where
should carefully consider the effect of measurement noise. This Wv is diagonal.
is particularly important in feedback controller design, where The model (A, B) is unknown but stabilizable. The model
only a noisy measurement of the output is available for the order n is assumed to be known. The measurements yk , uk can
control purpose. When a model-free approach is used to control be used to form the next output in order to achieve a specific
the system, it is not possible to use a filter, e.g., a Kalman filter, to goal of the control. When the control input uk is selected as a
estimate the state variable because the dynamics of the system is function of the output yk , it is called a policy. In this article, we
unknown. Yaghmaie and Gustafsson [33] consider both process design stationary policies of the form uk = Kyk to control the
and measurement noises in the problem formulation but only system in (1) and (2). Let L := A + BK. The policy gain K is
stability of the generated policy is established and the effect stabilizing if ρ(L) < 1.
of measurement noise is not studied. A closely related topic is
RL for partial observable Markov decision processes (POMDP),
B. LQ Optimal Control Problem
where the state is only partially observable [34]–[36]. The au-
thors in [34] and [35] consider LQ systems with both process We define the quadratic running cost as
and observation noises and propose algorithms to estimate the r(yk , uk ) = rk = ykT Ry yk + uTk Ru uk (3)
model of the dynamical system and to learn a near optimal
controller via gradient descent. Lewis and Vamvoudakis [36] where Ry ≥ 0 and Ru > 0 are the output and the control weight-
consider noise-free LQ problem and proposes to use a history ing matrices, respectively. The average cost associated with the
of input–output data in the controller design. policy uk = Kyk is defined by
 τ 
In this article, we focus on DP-based RL routines to solve LQ 1 
problem. There is much research on RL algorithms for LQ prob- λ(K) = lim E r(yt , Kyt ) (4)
τ →∞ τ
lem assuming that the state variable is exactly measurable [12], t=1

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 739

TABLE I
NOTATION AND ABBREVIATIONS

which does not depend on the initial state of the system [6]. Solving problem 1 is in fact the same as the LQR problem in
For the dynamical system in (1) and (2), we define the value the sense that the gain K ∗ turns out to be the same as the LQR
function [6] (or bias function [37]) associated with a given policy gain: K ∗ = KLQR obtained from the standard ARE machinery.
uk = Kyk This is shown in the following lemma and the proof is given in
 +∞  Appendix II.A.

V (yk , K) = E (r(yt , Kyt ) − λ(K))|yk (5) Lemma 1: Consider Problem 1 and (1)–(5). The optimal
t=k policy gain K ∗ is
where the expectation is taken with respect to wk , vk . K ∗ = −(Ru + B T P ∗ B)−1 B T P ∗ A (6)

Problem 1: Consider the dynamical system in (1) and (2). where P > 0 is the solution to the ARE
Design the gain K ∗ such that the policy K ∗ yk minimizes (5). AT P ∗ A − P ∗ − AT P ∗ B(B T P ∗ B + Ru )−1 B T P ∗ A
In this article, we are interested in optimizing (5). The value (7)
function (5) quantifies the quality of transient response using + Ry = 0.
uk = Kyk . Because of the presence of stationary process and The average cost associated with K ∗ is given by
measurement noises in (1) and (2), the controller uk = Kyk
λ(K ∗ ) = Tr(K ∗T B T P ∗ BK ∗ Wv ) + Tr(P ∗ (Ww + Wv ))
results in a permanent average cost, which is represented by
λ(K). In the following section, we will show that λ(K) is a − Tr((A + BK ∗ )T P ∗ (A + BK ∗ )Wv ). (8)
function of Ww , Wv and has a nonzero value if the process noise Remark 1: Lemma 1 states that Problem 1 is equivalent
or the measurement noise, or both are present. By subtracting to the classical LQR problem and the solution can be found
λ(K) in (5), we end up with the transient cost or the controller from a standard ARE machinery. We explain this point intu-
cost. itively: The process and measurement noises result in a per-
Another relevant problem is the classical LQR problem. manent average cost. We take away the effect of the process
Problem 2 (LQR): Consider the dynamical system in (1) and and measurement noises by subtracting the average cost in
(2) without noise terms (wk = 0, vk = 0). Design the controller (5). This leaves us with the transient cost or the controller
uk = Kxk to minimize (5). cost, which can be optimized through the ARE in (7). Also,
Note that when Problem 2 is considered, λ(K) = 0 in (5) note that we can cast solving Problem 1 to minimizing the
because wk = 0, vk = 0. The solution is well known to be uk = regret function. The regret framework Ruk evaluates the quality
KLQR xk , where KLQR is obtained through an algebraic Riccati of a policy uk by comparing its running cost to a baseline
equation (ARE) using (A, B, Ry , Ru ), see, e.g., [38]. b [28]
A further relevant problem is the classical linear quadratic +∞
Gaussian (LQG) problem: 
Ruk = rk − b. (9)
Problem 3 (LQG): Consider the dynamical system in (1) and t=1
(2). Design the controller uk (y1 , . . ., yk ) to minimize (4).
Let b := λ(K) in (4) represent the baseline. Then, the regret
The optimal controller for Problem 3 is uk = KLQR x̂k|k
framework in (9) is indeed equivalent to the value function in
where x̂k|k is a posteriori state estimation at time k given
(5), and minimizing V with respect to K (Problem 1) is cast to
observations (y1 , . . ., yk ) (see [41, Corollary 9.2]). The cost to
minimizing the regret function.
be minimized in the LQG problem contains the following two
parts: 1) The LQR cost and 2) the mean-square state estimation
C. Iterative Model-Based Approach for the LQ Problem
error. The LQR cost measures the quality of transient response
and KLQR is designed to minimize it. The mean-square error of According to Lemma 1, the solution to Problem 1 can be found
the state estimation x̂k|k is minimized by the means of a Kalman from the ARE in (7). If the dynamics is known, one can use the
filter where the Kalman gain depends on (A, B, Ww , Wv ). Hewer’s method [39] in Algorithm 1. The algorithm is initiated
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
740 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023

Lemma 3: Consider (1)–(5) and (15). Assume that the policy


Algorithm 1: Model-Based LQ [39].
gain K is stabilizing. Then, Q(zk , K) = zkT Gzk where G ≥ 0
1: Initialize: Select a stabilizing policy gain K 1 , set is 
a solution to
i = 1.     
AT   I   Ry 0
2: for i = 1, .., N do I K T G
A B −G+ =0
3: Find P i from the model-based Bellman equation BT K 0 Ru
(A + BK i )T P i (A + BK i ) − P i + Ry (17)
(10) and the average cost is given by
+ K iT Ru K i = 0.    
  I
4: Update the policy gain λ(K) = Tr K T B T I K T G BKWv
K
K i+1 = −(Ru + B T P i B)−1 B T P i A. (11)    
5: end for   I
+ Tr I K G T (Ww + Wv )
K
   
with a stabilizing gain. In each iteration, the gain is evaluated by   I
solving the model-based Bellman equation in (10). Then, using − Tr L I K G
T T LWv . (18)
the solution to the Bellman equation, the gain will be improved. K
It has been proved in [39] that K i converges to KLQR . Lemma 4: Consider (1)–(5) and (15). Assume that the policy
gain K is stabilizing. Then, the quality-based Bellman equation
III. VALUE AND QUALITY FUNCTIONS (15) is equivalent to the model-based Bellman equation (10);
that is the solutions are the same.
In this article, we will present two iterative model-free algo-
The proofs of Lemmas 3 and 4 are given in Appendices
rithms to solve Problem 1. Our algorithms rely on the evaluation
II.B–II.C.
of a policy gain K, which is done by finding its associated value
function V (defined in Section II-B) or quality function Q (to
IV. PROPOSED MODEL-FREE ALGORITHMS
be defined in this section).
In this section, we present two model-free iterative algorithms
A. Value Function V to solve Problem 1. At a high level, each of these algorithms is
a variant of policy iteration routine. Each algorithm iterates N
The value function is defined in (5). Using (5), the value-based
times over the following two steps: 1) policy evaluation and
Bellman equation reads
2) policy improvement. The policy is evaluated by finding the
V (yk , K) = r(yk , Kyk ) − λ(K) + E[V (yk+1 , K)|yk ]. (12) associated value function or Q function, and the average cost.
It is straight forward to show that the value function associated The improved policy is then selected to be greedy with respect
with a linear policy is quadratic and it is equivalent to the model- to the average of all previously estimated value or Q functions.
based Bellman equation (10): The number of iterations N is a design variable, often a small
Lemma 2 (see[33]): Consider (1)–(5). Assume that the policy number.
gain K is stabilizing. Then, V (yk , K) = ykT P yk where P > 0
satisfies A. Average Off-Policy Learning
(A + BK)T P (A + BK) − P + Ry + K T Ru K = 0 (13) Our first routine is called average off-policy learning in Al-
and the average cost is given by gorithm 2. This routine is given in [33] but its properties are not
λ(K) = Tr(K T B T P BKWv ) + Tr(P (Ww + Wv )) discussed. We analyze the properties in Section V-A.
The algorithm is initiated with a stabilizing policy gain K 1 in
− Tr((A + BK)T P (A + BK)Wv ). (14) Line 1. Then, the algorithm iterates N times in Line 2 over the
Proof: See [33, proof of Lemma 1].  policy evaluation and the policy improvement steps. The policy
evaluation step is given in Line 3 and discussed in Section IV-A1.
B. Quality Q Function The policy gain K i is evaluated by estimating its associated
value function and average cost. The policy improvement step
Along with the value function (5), it is also handy to define
is given in Lines 4–6 and discussed in Section IV-A2.
the Quality (Q) function. Let zk = [ykT , uTk ]T . The Q function is
1) Policy Evaluation: Line 3: Considering V i (yk , K i ) =
equal to the expected return cost for taking an action uk at time T i
yk P yk , the value-based Bellman equation (12) reads
k and then following the existing policy Kyk
rk − λi = (Φk − Φk+1 )T vecs(P i ) + n1 (19)
Q(yk , uk , K) = r(yk , uk ) − λ(K) + E[V (yk+1 , K)|zk ].
(15) where
Note that (15) is indeed the quality-based Bellman equation. The Φk = vecv(yk ), rk = r(yk , K i yk )
(20)
value function V (yk , K) can be obtained from Q(yk , uk , K) by n1 = yk+1
T
P i yk+1 − E[yk+1
T
P i yk+1 |yk ].
selecting uk = Kyk
To estimate P i and λi , we execute K i yk and collect τ samples of
V (yk , K) = Q(yk , Kyk , K). (16) the output yk . We use the empirical average cost from τ samples,

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 741

where
Algorithm 2: Average Off-Policy Learning.
1: Initialize: Select a stabilizing policy gain K 1 , set ck = ykT (Ry + K iT Ru K i − P̂ i )yk + yk+1
T
P̂ i yk+1 − λ̄i
 
i = 1. 2yk ⊗ (uk − K i yk )
2: for i = 1, .., N do Υk =
3: Execute K i yk for τ rounds and estimate P̂ i , λ̄i from vecv(uk ) − vecv(K i yk )
(21)–(22).  
vec(B T P i A)
4: Z = CollectData (K i , τ  , τ  , Wη ). ξ =
i
vecs(B T P i B)
5: Estimate ξˆi by (27) using Z.
6: Update the policy gain K i+1 by (28). n2 = yk+1
T
P̂ i yk+1 − E[yk+1
T
P i yk+1 |yk ] + ykT P i yk − ykT P̂ i yk .
7: end for (26)
In (25), ξ i is the parameter vector to be estimated and ck and Υk
Algorithm 3: CollectData. are known using the collected data.
Input: K, τ  , τ  , Wη . Line 4: To estimate ξ i , we run the CollectData routine in
Output: Z. Algorithm 3 to collect τ  sample points. The procedure is as
Z = {}. follows: First, the policy K i yk is applied for τ  rounds to ensure
for t = 1, .., τ  do that E[xk ] = 0 and the output yk is sampled. Then, we sample
Execute Ky for τ  rounds and observe y. a random action ηk from ηk ∼ N (0, Wη ). We apply the action
Sample η ∼ N (0, Wη ) and set u = Ky + η. uk = K i yk + ηk and sample the next output yk+1 . This builds
Take u and observe y+ . one data point (yk , uk , yk+1 ). We repeat this procedure and
Add (y, u, y+ ) to Z. collect τ  data points.
end for Line 5: The estimation of ξ i is given by
 τ −1  τ  
 
ξˆ =
i
Υt Υ T
Υt ct . (27)
as an estimate of λi t
t=1 t=1
1 
τ
λ̄ =
i
rt . (21) Line 6: Using ξˆi , we now obtain the improved policy. Let
τ t=1 H i := B T P i A and N i := B T P i B, and let Ĥ i and N̂ i be their
The average-cost LSTD estimator of P i is given by [37] estimates using (27) [see (26) and (27)]. The improved policy is
 τ −1  τ  selected to be greedy with respect to the average of all previously
 
vecs(P̂ i ) = Φt (Φt − Φt+1 )T Φt (rt − λ̄i ) . estimated value functions
⎛ ⎞−1 ⎛ ⎞
t=1 t=1
i i
(22) K i+1 = − ⎝ (N̂ j + Ru )⎠ ⎝ Ĥ j ⎠ . (28)
2) Policy Improvement: We find the improved policy gain j=1 j=1
by the concept of off-policy learning. The idea is to apply a
behavioral policy uk to the system while learning the target (or B. Average Q-Learning
optimal) policy K i yk . Note that the behavioral policy should be
Our second routine is called average Q-learning in Algorithm
different from the target policy. Define the behavioral policy as
4. Similar to the average off-policy learning in Section IV-A,
uk = K i yk + ηk where we sample ηk from N (0, Wη ) at each
the average Q-learning is a policy iteration routine. Different
time k and Wη is the covariance of probing. Using the behavioral
from the average off-policy learning, the policy is evaluated by
policy, the closed-loop system of (1) reads
finding the Q function. The algorithm is an extended version of
yk+1 = Axk + Buk + wk + vk+1 the model-free linear quadratic (MFLQ) control in [25] to ac-
= Li xk + B(uk − K i yk ) + BK i vk + wk + vk+1 . commodate the effect of observation noise. In [25], only process
(23) noise is present and the average cost is parameterized based on
the covariance of the process noise. Such an approach is not
Using the abovementioned equation, we can obtain the off-policy possible here because the average cost depends on the process
Bellman equation and observation noises, and the dynamics of the system. Note
ykT (Ry + K iT Ru K i − P i )yk + E[yk+1
T
P i yk+1 |yk ] − λi that there is no proof of convergence for the MFLQ algorithm
in [25]. We give a proof of convergence for Algorithm 4 in
= 2(uk − K i yk )T B T P i Ayk
Section V-B, which is also applicable to the MFLQ algorithm
+ (uk + K i yk )T B T P i B(uk − K i yk ). (24) in [25].
The algorithm is initiated with a stabilizing policy gain K 1
We will show in Theorem 1 that (24) is equivalent to the
in Line 1. Then, the algorithm iterates N times in Line 2 over
model-based Bellman equation (10). We can write (24) as a
the policy evaluation and the policy improvement steps. The
linear regression
policy evaluation step is given in Lines 3–5 and discussed in
ck = ΥTk ξ i + n2 (25) Section IV-B1. The policy gain K i is evaluated by estimating its

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
742 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023

associated Q function and average cost. The policy improvement


Algorithm 4: Average Q-Learning.
step is given in Line 6 and discussed in Section IV-B2.
1) Policy Evaluation: We evaluate the policy gain K i by 1: Initialize: Select a stabilizing policy gain K 1 , set
finding the quadratic kernel Gi of the Q function and the average i = 1.
cost λi . 2: for i = 1, .., N do
Line 3: We execute K i yk and collect τ samples of the output 3: Execute K i yk for τ rounds and estimate P̂ i , λ̄i from
yk and then we estimated P i and λi from (21) and (22). (21)–(22).
Line 4: We run CollectData routine in Algorithm 3 to collect 4: Z = CollectData (K i , τ  , τ  , Wη ).
τ  sample points. 5: Estimate Ĝi from (31) using Z.
Line 5: Considering Qi (zk , K i ) = zkT Gi zk , the quality-based 6: Update the policy gain K i+1 by (33).
Bellman equation (15) can be written as 7: end for
ck = ΨTk vecs(Gi ) + n3 (29)
where Theorem 2 (see[33]): Assume that the estimation errors in
(22) and (27) are small enough. Then, Algorithm 2 produces
ck = r(yk , uk ) − λ̄i + yk+1
T
P̂ i yk+1 stabilizing policy gains K i+1 , i = 2, . . ., N.
Ψk = vecv(zk ), zk = [ykT , uTk ]T Theorem 3: Assume that the estimation errors in (22) and
(27) are small enough. Then, the sequence of P i , i = 1, . . ., N
n3 = yk+1
T
P̂ i yk+1 − E[yk+1
T
P i yk+1 |zk ]. (30) associated with the controller gain K i , i = 1, . . ., N generated
in Algorithm 2 is converging P ∗ ≤ P i+1 ≤ P i ≤ P 1 .
We estimate Gi using τ  sample points collected in Line 5
The off-policy Bellman equation is given in (24). In both the
 τ −1  τ  
  average off-policy learning and the classical off-policy learning,

vecs(Ĝ ) =
i
Ψt Ψt
T
Ψt ct . (31) the solution to this equation is found by estimating P i , B T P i A
t=1 t=1 and B T P i B. In the classical off-policy learning P i , B T P i A
2) Policy Improvement: Line 6: Let partition matrix Ĝi as and B T P i B are estimated at the same time, see (35)–(36). In
  the average off-policy learning, we solve (24) by first estimating
Ĝi11 Ĝi12 the quadratic kernel of the value function; i.e., P i and then,
Ĝ =
i
(32)
ĜiT
12 Ĝi22 the matrices B T P i A and B T P i B are estimated by running the
CollectData routine in Algorithm 3. In the sequel, we show that
such that Ĝi11 ∈ Rn×n , Ĝi12 ∈ Rn×m , Ĝi22 ∈ Rm×m . The im- the estimation of P i in (22) is biased and as a result, the average
proved policy is selected to be greedy with respect to the average off-policy learning routine in Algorithm 2 is biased. But if this
of all previously estimated Q functions bias in the estimation of P i is near zero, the average off-policy
1 j  learning returns an unbiased estimate.
i i
K i+1 = arg min Q̂ (yk , a) = −(Ĝj22 )−1 ĜjT
12 . Lemma 5: Consider Problem 1 and the estimation problem
a i j=1 j=1 in (19)–(22). The LSTD estimate of P̂ i in (22) is biased by an
(33) amount that is related to the correlation between the noise n1 in
Remark 2: There are two main differences between the av- (20) and the square of yk .
erage algorithms and the classical counterparts in Algorithms Theorem 4: Consider Problem 1 and the estimation problem
5–6: The first difference is related to the sampling and that is we in (25)–(27). Assume that we run Algorithm 3 to collect data
apply the control to the system for τ  steps before taking one to find the estimated solution ξˆi in (27). The estimate ξˆi in (27)
sample. As a result, we will have independent sample points in the average off-policy learning is biased and the bias in ξˆi
and E[xk ] = 0. The second difference is related to the updating is proportional to the bias in P̂ i in (22). If the bias in P̂ i is
the controller gain in each iteration of the algorithms. In the negligible, the estimate ξˆi is unbiased.
average Q-learning and average off-policy learning, the policy Lemma 6: Consider Problem 1 and the estimation problem
is greedy with respect to the average of all previously estimated in (34)–(36). The estimated solution i in (36) in the classical
value functions rather than the last value function. This causes off-policy learning is biased by an amount that is related to the
the gain to adapt slowly toward its optimal value. However if the correlation between n1 and yk , and the correlation between n1
estimation of the value function is poor (so it is away from its and the square of yk where n1 is the noise in (35).
correct value), it helps us to avoid getting an unstable controller The proofs of Theorems 3 and 4 and Lemmas 5 and 6 are
gain. This feature has a positive effect when the trajectory length given in Appendices II.D–II.G.
to estimate the value or Q function is short.
B. Properties of Average Q-Learning
V. ANALYSIS OF THE PROPOSED ALGORITHMS Based on Lemma 4 and Theorems 1–3, the following corol-
laries are concluded immediately.
A. Properties of Average Off-Policy Learning
Corollary 1: Assume that the estimation errors in (22) and
Theorem 1 (see[33]): The off-policy Bellman equation (24) (31) are small enough. Then, Algorithm 4 produces stabilizing
is equivalent to the model-based Bellman equation (10). policy gains K i+1 , i = 2. . ., N.

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 743

Corollary 2: Assume that the estimation errors in (22) and TABLE II


COMPUTATIONAL COMPLEXITY IN EACH ITERATION OF ALGORITHMS 2,
(31) are small enough. Then, the sequence of P i , i = 1, . . ., N 4–6, AND 8
associated with the controller gain K i , i = 1, . . ., N generated
in Algorithm 4 is converging P ∗ ≤ P i+1 ≤ P i ≤ P 1 .
The quality-based Bellman equation is given in (15). In the
average Q-learning, we solve this equation by first estimating
P i and then, estimating Gi by collecting data according to the
CollectData routine in Algorithm 3. In comparison, the classical
Q-learning algorithm estimates the kernel Gi directly, without
going through the estimation of P i . Both algorithms return
biased estimates.
Lemma 7: Consider Problem 1 and the estimation problem in
(29)–(31). Assume that we run Algorithm 3 to collect data for
the estimation of Gi in (31). The estimate (31) in the average
Q-learning is biased by an amount that is related to the average
of noise n3 in (30), the correlation between n3 and yk , and the
correlation between n3 and the square of yk .
Lemma 8: Consider Problem 1 and the estimation problem in
(38)–(40). The estimated solution Gi in (40) in the classical Q- d : The number of parameters to be estimated, n, m : The dimensions of the
learning is biased by an amount that is related to the correlation state and the input, τ : The rollout length, and τ  : The exploration length
between n4 and yk , and the correlation between n4 and the square in CollectData in Algorithm 3 (Only for the average off-policy and the
of yk where n4 is the noise in (39). average Q-learning routines).

Remark 3: It is not possible to compare the bias terms in the


average Q-learning and the classical Q learning, as they contain The complexity of the model-building approach can be dif-
different terms that are not comparable. ficult to establish. For example, the command SSEST in MAT-
Remark 4: Setting vk ≡ 0 and Wv ≡ 0 in the proofs of LAB initializes the parameter estimation using a noniterative
Lemmas 5–8 and Theorem 4, it is easy to verify that in the subspace approach where the numerical calculations consist of
absence of measurement noise, all estimates in Algorithms 2, 4, a QR factorization and an SVD (Singular Value Decomposition).
5, 6 are unbiased. It then refines the parameter values iteratively using the predic-
The proofs of Lemmas 7 and 8 are given in Appendices tion error minimization approach (see [26, Sec. 10.7]). One can
II.H–II.I. also use total least squares [41] to estimates the dynamics [33].
For this, one needs to find singular value decomposition of a
matrix of order τ × (n2 + nm + 1). For τ d, the complexity
of the computation is O(2(n2 + mn + 1)2 τ + 11(n2 + mn +
C. Computational Complexity
1)3 ) [41].
In each iteration of model-free algorithms, we solve one
or more linear regression problems with solutions like (22). VI. SIMULATION RESULTS
Considering d as the number of parameters to be estimated
and τ as the number of samples, the complexity of (22) for We consider an idealized instance of data center cooling with
τ d is O(d2 τ ) [40]. Table II summarizes the computational three sources coupled to their own cooling devices with the
complexities of the estimations in each iteration of the routines following dynamics [21], [25], [26]:
⎡ ⎤ ⎡ ⎤
where we have simplified the complexity terms and kept the 1.01 0.01 0 1 0 0
⎢ ⎥ ⎢ ⎥
dominant terms only. xk+1 = ⎣0.01 1.01 0.01⎦ xk + ⎣0 1 0⎦ uk + wk
For the sake of comparison, assume τ  = τ . According to 0 0.01 1.01 0 0 1
Table II, the average off-policy learning has a lower compu-
tational complexity in compared to the classical off-policy and yk = x k + v k
Q-learning algorithms. To see this point, note that the complexity and the quadratic running cost
O(d2 τ ) is quadratic in the number of parameters d. In the aver-
age off-policy learning, these d = (n + m)(n + m + 1)/2 un- r(yk , uk ) = 0.001ykT yk + uTk uk .
known parameters are estimated in the following two parts: first Let Ww = I, Wv = I. We select the initial state of the system
n(n + 1)/2 parameters are estimated in (22) and then, nm + as zero. We set the initial stabilizing policy gain K 1 for all
m(m + 1)/2 parameters are estimated in (27). So, the overall algorithms by solving a modified ARE(A, B, 200Ry , Ru ). We
complexity of the average off-policy learning is O((n(n + set the covariance of the probing noise ηk as Wη = I. If an
1)/2)2 τ ) + O((mn + m(m + 1)/2)2 τ  ). In the classical off- algorithm produces an unstable controller gain, we assign an
policy and Q-learning algorithms all d = (n + m)(n + m + infinite cost to that iteration and algorithm, and restart the
1)/2 parameters are estimated at the same time, which has the system. Let τ and N denote the rollout length and the number
higher complexity O(((n + m)(n + m + 1)/2)2 τ ). of iterations in Algorithms 2, 4, 5, and 6.

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
744 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023

TABLE III
PERCENTAGE OF STABILITY IN ALL ITERATIONS IN 100 SIMULATIONS

Fig. 1. Fraction of stable policy gains generated by each algorithm in


all iterations.

Average off-policy learning and Average Q-learning: In


Algorithms 2 and 4, we set τ  = τ and select τ  = 10. We select
the behavioral policy as uk = K i yk + ηk where we sample ηk
from N (0, Wη ).
Classical off-policy learning and Classical Q-learning: We
run Algorithms 5 and 6. We select uk = K i yk + ηk where we
sample ηk from N (0, Wη ).
|λ(K)−λ(K ∗ )|
Model-building approach: We run Algorithm 8 for the model- Fig. 2. Median of the relative error λ(K ∗ )
for 100 stable policies.
building approach. We collect the input–output data by applying Shaded regions display quartiles. The numerical values are given in
Table IV.
the control signal uk = K i yk + 10ηk , where we sample ηk
from N (0, Wη ). We estimate the matrices (Â, B̂) from the
experimental data as explained in Appendix I.C. learning and the average Q-learning algorithms produce more
Policy gradient: A policy gradient algorithm is given in stable policies among DP-based model-free approaches and
Algorithm 9 in Appendix I.D. We consider a multivariate they are the most stable model-free routines. This is because
Gaussian distribution for the probability density function (pdf) in the average off-policy and the average Q-learning routines,
of the probabilistic policy with the covariance 0.01I. In each the policy is greedy with respect to the average of all previously
iteration, we collect 10 mini-batches of trajectories of length estimated value functions rather than the last value function.
T = 10, so in each iteration, 10 × 10 data is used. We update This causes the gain to adapt slowly toward its optimal value
the controller gain using an ADAM optimizer with the learning and helps us to get less unstable gains when the trajectory length
rate as 0.01 and the default values for other hyperparameters. to estimate the value or Q function is short. By increasing the
To match the sample budget of Algorithms 2, 4, 5, 6, we set the rollout length, τ ≥ 2000, the average off-policy learning and the
number of iterations in the policy gradient as τ × N/100. average Q-learning routines produce stable policy always while
Analytical solution: Our baseline for comparison is the ana- the classical Q-learning and the classical off-policy learning
lytical solution assuming that the matrices (A, B) are exactly need at least 4000 − 5000 samples to produce stable policies
known. Using the full information of the system dynamics always. For the policy gradient algorithm, as more iteration is
(A, B), we solve ARE(A, B, Ry , Ru ) and obtain P ∗ and K ∗ . done, the algorithm returns less stable policies.
We compute the analytical average cost λ(K ∗ ) from (8) using
A, B, P ∗ , K ∗ , Wv , and Ww . B. Performance Analysis
Our metric of interest for analyzing the performance
A. Stability Analysis is the relative error |λ(K)−λ(K

)|
where K is the policy
λ(K ∗ )
For Algorithms 2, 4, 5, 6, we set the number of it- gain obtained from a model-free algorithm or the model-
erations as N = 3 and we change the rollout length τ = building approach and K ∗ is the optimal policy gain obtained
[100, 500, 1000, 2000, 3000, 4000, 5000]. In the policy gra- analytically using full information of the system. We run each
dient algorithm, we change the number of iterations as τ × algorithm until we obtain 100 stable policies. We set the num-
3/100 = [3, 15, 30, 60, 90, 120, 150] to match the sample ber of iterations as N = 3 and we change the rollout length
budget of Algorithms 2, 4, 5, 6. We run each algorithm 100 τ = [100, 500, 1000, 2000, 3000, 4000, 5000]. In the policy
times and in Fig. 1, we show the fraction of times each algorithm gradient algorithm, we change the number of iterations as
produces stable policy gains in all iterations. In Table III, we τ × 3/100 = [3, 15, 30, 60, 90, 120, 150].
list the percentage of stability numerically. From this figure, In Fig. 2, we plot the median of the relative errors for the
we can see that the model-building approach always produces aforementioned approaches and in Table IV, we list the median
stable policies. We can also see that the average off-policy of the relative errors numerically. From this figure, we can

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 745

TABLE IV
|λ(K)−λ(K ∗ )| Algorithm 5: Classical Off-Policy Learning.
MEDIAN OF THE RELATIVE ERROR λ(K ∗ )
1: Initialize: Select a stabilizing policy gain K 1 , set
i = 1.
2: for i = 1, .., N do
3: Z = {}.
4: for t = 1, . . ., τ do
Sample ηt ∼ N (0, Wη ) and let ut = K i yt + ηt .
Take ut and observe yt+1 .
Add (yt , ut , yt+1 ) to Z.
5: end for
see that the model-building approach has the lowest relative 6: Estimate λ̄i from (21), and P i , B T P i A, B T P i B
error and it is almost identical to the analytical solution. The from (36).
average off-policy and the average Q-learning routines suffer 7: Update the policy gain K i+1 by (37).
from 4% − 5% relative error at their best. The reason is that 8: end for
the estimations of P i in the average off-policy and Gi in the
average Q-learning are biased and they will not improve further
by increasing the rollout length. However, the relative errors for
proved stability of the generated controllers and convergence of
the average off-policy and the average Q-learning are much less
the algorithms. We have shown that the measurement noise can
than those of the classical off-policy and Q-learning algorithms
deteriorate the performance of DP-based RL routines. We have
(∼ 35% − 40%). When τ is small, the average off-policy and the
also presented a model-building approach in Algorithm 8 as a
average Q-learning reach their best performance. This is because
natural way of solving Problem 1 in the adaptive control context,
Algorithm 3 is used to collect data for estimation where the dy-
using the same information from the system as Algorithms 2
namics is run for τ  steps before collecting a sample data point.
and 4. We have used a popular RL benchmark for the LQ
It helps us to have independent sample points and E[x] = 0.
problem to evaluate our proposed algorithms. Our empirical
The policy gradient algorithm performance is worse than other
evaluation shows that our DP-based algorithms produce more
algorithms. It is because the policy gradient algorithms try to
stable policies in comparison with the classical off-policy and
directly optimize the performance index and as such, they are
the classical Q-learning routines for this benchmark and they
very sensitive to noise.
are the closest ones to the analytical solution. It has also turned
out that our model-building approach outperforms the DP-based
C. Discussion solutions, which is consistent with the previous results on linear
In summary, one may consider different factors when choos- systems with process noise only [21], [26].
ing a model-building or a model-free approach. Our empirical
evaluation shows that the model-building solution outperforms
the model-free approaches and is the closest one to the analytical APPENDIX I
solution. This point has also been shown in [25] and [26] for CLASSICAL MODEL-FREE ALGORITHMS AND THE
the LQ problem with the process noise only (no measurement MODEL-BUILDING APPROACH
noise), where it is possible to estimate (A, B) using the ordinary A. Classical Off-Policy Learning
least-squares. Although when the measurement noise is present,
The classical off-policy learning algorithm for deterministic
this point needs to be proved.
systems is given in [12]. Here, we bring a modified version in
The main difference between the model-building and the
Algorithm 5 to accommodate the effect of process and observa-
model-free approaches comes from the way the data is used.
tion noises. The algorithm is initiated with a stabilizing policy
In the model-building approach, the data are used to identify
gain K 1 in Line 1. Then, the algorithm iterates N times in Line
the dynamics of the system while in the model-free approach,
2 over the policy evaluation and the policy improvement steps.
to learn the value function and the optimal policy. The main
The policy evaluation step is given in Lines 3–5 and discussed in
advantage of a model-free approach is to eliminate the system
Appendix I.A.1. The policy improvement step is given in Line
identification part and to directly solve the optimal control
6 and discussed in Appendix I.A.2.
problem. Such an approach can be extremely useful for nonlinear
1) Policy Evaluation: Lines 3–4: To evaluate the policy gain
systems where it is difficult to identify the dynamics and solve
K i , we collect τ samples of (yk , uk , yk+1 ) in the following way:
the optimal control problem.
we observe yk and then, we sample ηk from N (0, Wη ). We apply
the policy uk = K i yk + ηk and observe yk+1 .
VII. CONCLUSION
Line 5: We set the average cost λi to the empirical average
In this article, we have considered the LQ control problem cost from τ samples (21). The off-policy Bellman equation (24)
with process and measurement noises. We have assumed that the reads
dynamical model is unknown and developed two model-free DP-
based routines (Algorithms 2 and 4) to solve Problem 1. We have rk − λi = ΩTk i + n1 (34)

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
746 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023

Algorithm 6: Classical Q-Policy Learning. Algorithm 7: EstimateDynamics.


1 Input: U , Y , m0.
1: Initialize: Select a stabilizing policy gain K , set
i = 1. Output: (A_hat,B_hat)
2: for i = 1, .., N do data_exp = iddata(U,Y);
3: Z = {}. me=ssest(data_exp,m0,ssestOptions,
4: for t = 1, . . ., τ do ’DisturbanceModel,’ ‘estimate’);
Sample ηt ∼ N (0, Wη ) and let ut = K i yt + ηt . A_hat=me.a; B_hat=me.b;
Take ut and observe yt+1 .
Add (yt , ut , yt+1 ) to Z.
5: end for Algorithm 8: Model-Building Approach.
6: Estimate λ̄i from (21) and Ĝi from (40). 1: Initialize: Select a stabilizing policy gain K 1 , set
7: Update the policy gain K i+1 by (41). i = 1, U = {}, Y = {}.
8: end for m0=idss(zeros(n,n),zeros(n,m),eye(n),
zeros(n,m),’ts,’1); m0.Structure.C.Free=zeros(n,n);
2: for i = 1, .., N do
where 3: for t = 1, . . ., τ do
⎡ ⎤ ⎡ ⎤
vecs(P i ) vecv(yk ) − vecv(yk+1 ) Sample ηt ∼ N (0, Wη ) and let ut = K i yt + ηt .
⎢ ⎥ ⎢ ⎥ Take ut and U ← ut , Y ← yt .
i = ⎣ vec(B T P i A) ⎦ , Ωk = ⎣ 2yk ⊗ (uk − K i yk ) ⎦
4: end for
vecs(B T P i B) vecv(uk ) − vecv(K i yk ) 5: (Â, B̂) =EstimateDynamics(U, Y, m0)
⎡ ⎤ 6: m0.A=Â; m0.B=B̂;
vecv(yk )
⎢ ⎥ 7: Find the optimal policy gain K ∗ from (6)–(7) and set
Θk = ⎣ 2yk ⊗ (uk − K i yk ) ⎦ K i+1 = K ∗ .
vecv(uk ) − vecv(K i yk ) 8: end for
n1 = yk+1
T
P i yk+1 − E[yk+1
T
P i yk+1 |yk ]. (35)
The estimated solution to (24) in the classical off-policy learning where
using instrumental variable method is given by
 τ −1  τ  Ψk = vecv(zk )
 
ˆ =
i
Θt Ωt T
Θt (rt − λ̄ ) .
i
(36) n4 = zk+1
T
Gi zk+1 − E[zk+1
T
Gi zk+1 |zk ]. (39)
t=1 t=1
Let zk = [ykT , uTk ] and zk+1 = [yk+1
T
, (K i yk+1
T
)]. In the clas-
2) Policy Improvement: Line 6: Let H i := B T P i A and i
sical Q-learning, the LSTD estimator of G approximates the
N i := B T P i B, and let Ĥ i and N̂ i be their estimates using (36). solution to the quality-based Bellman equation (15) by
The improved policy is given by  τ −1  τ 
 
K i+1 = −(N̂ i + Ru )−1 Ĥ i . (37) vecs(Ĝ ) =
i
Ψt (Ψt − Ψt+1 ) T
Ψt (rt − λ̄ ) .
i

t=1 t=1
B Classical Q-Learning (40)
2) Policy Improvement: Line 6: Let partition matrix Ĝi as
We bring the classical Q-learning routine from [27] in Algo- (32). The improved policy gain is given by
rithm 6 and modify it to accommodate the effect of process and
measurement noises. The algorithm is initiated with a stabilizing K i+1 = arg min Q̂i (yk , a) = −(Ĝi22 )−1 ĜiT
12 . (41)
a
policy gain K 1 in Line 1. Then, the algorithm iterates N times
in Line 2 over the policy evaluation and the policy improvement C. Model-Building Approach
steps. The policy evaluation step is given in Lines 3–5 and
discussed in Appendix I.B.1. The policy improvement step is A possible variant to model-free solutions to Problem 1 where
given in Line 6 and discussed in Appendix I.B.2. the dynamics is not known is to estimate (A, B) as a separate
1) Policy Evaluation: The policy gain K i is evaluated by system identification problem and use these estimates for the so-
estimating the quadratic kernel Gi of the Q function. lution of Problem 1. We can call this a model-building approach
Lines 3–4: We collect τ samples of (yk , uk , yk+1 ) in the to the problem. The estimated (Â, B̂) can be used in the ARE (7)
following way: we observe yk and then, we sample ηk from to solve for optimal policy gain in the corresponding LQ control
N (0, Wη ). We apply the policy uk = K i yk + ηk and observe (see Lemma 1).
yk+1 . The system identification problem to find (A, B) in (1)
Line 5: We set λi to the empirical average cost from τ samples is actually a simple problem since the model order n is
(21). The quality-based Bellman equation (15) reads known. Let [U, Y ] denote the input–output data and m0 de-
note the initial state-space model (which can be set to zero
rk − λi = (Ψk − Ψk+1 )vecs(Gi ) + n4 (38) matrices when no initial knowledge about the dynamics is

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 747

APPENDIX II
Algorithm 9: Policy Gradient.
PROOFS
1: Initialize: Policy gain K 1 , i = 1, probability density
function p(a; K i yk ). In this section, we bring the proofs of theorems and lem-
2: while ending condition is not satisfied do mas in the body of this article. To facilitate the derivations,
3: Z = {}. we use the following identity frequently vecv(v)T vecs(P ) =
4: for t = 1, . . ., T do (v ⊗ v)vec(P ) = v T P v, where P is a symmetric matrix and v
Observe yt , sample ut ∼ p(a; K i yk ), take ut . is a vector with appropriate dimension. The following lemma is
Observe rt . Add (yt , ut , rt ) to Z. useful through the proofs.
5: end for Lemma 9: Consider (1)–(5). For Ξ ∈ Rn×n , we have

6: Let R(T ) = Tt=1 rt . E[xTk Ξvk |yk ] = E[xTk Ξwk |yk ] = E[xTk Ξvk+1 |yk ] = 0 (42)

7: Set δK i = Tt=1 (R(T ) − b)∇K i log p(ut , K i yt ).
E[xTk Ξxk |yk ] = ykT Ξyk − Tr(ΞWv ). (43)
8: Update K i+1 by gradient descent using δK i .
9: end while Proof: Based on (1), wk , vk , vk+1 do not affect xk and (42)
follows. To see (43), note that
ykT Ξyk = E[ykT Ξyk |yk ] = E[(xk + vk )T Ξ(xk + vk )|yk ]
known). To solve this structured problem in a general sys- = E[xTk Ξxk |yk ] + 2E[xTk Ξvk |yk ] + E[vkT Ξvk |yk ]
tem identification code, like the system identification toolbox
in MATLAB [42], the code in Algorithm 7 can be used. = E[xTk Ξxk |yk ] + Tr(ΞWv ).
One can also use total least squares [41] to estimate the 
dynamics [33].
Since in this article we examine the model-building approach A. Proof of Lemma 1
along with the model-free algorithms, we bring the model-
building approach in a recursive way such that the controller Let V ∗ = ykT P ∗ yk be the optimal value function. Define the
gain in the model-building approach is updated at the same Bellman equation for the optimal value function (5)
pace as the model-free approaches. Algorithm 8 summarizes ykT P ∗ yk = ykT Ry yk + ykT K ∗T Ru K ∗ yk − λ(K ∗ )
the model-building approach. In Line 1, we initialize the algo- (44)
+ E[yk+1
T
P ∗ yk+1 |yk ].
rithm by selecting a stabilizing controller gain. Note that if the
uncontrolled system is unstable, using only noise for system The term E[yk+1
T
P ∗ yk+1 |yk ] reads
identification results in numerical instability. Otherwise, we can E[yk+1
T
P ∗ yk+1 |yk ]
set K 1 = 0. We set empty matrices for the input–output data
U, Y , assign zero matrices to the initial state-space model, and = E[xTk (A + BK ∗ )T P ∗ (A + BK ∗ )xk |yk ]
fix C = I (yk = Ixk + vk ). In Line 2, the algorithm iterates N
+ 2E[xTk (A + BK ∗ )T P ∗ BK ∗ vk |yk ]
times. In Line 3, we collect τ input–output samples and append
to U, Y . In Line 4, we estimate (Â, B̂) using U, Y and m0. In + 2E[xTk (A + BK ∗ )T P ∗ (wk + vk+1 )|yk ]
Line 5, we update m0 by (Â, B̂). In Line 6, we solve the ARE
(7) using (Â, B̂). + E[vkT K ∗T B T P ∗ BK ∗ vk |yk ]
Our proposed model-building routine in Algorithm 7 can + 2E[vkT K ∗T B T P ∗ (wk + vk+1 )|yk ]
be easily extended (by further estimating C) to cover the
partially observable dynamical systems similar to [34] and + E[(wk + vk+1 )T P ∗ (wk + vk+1 )|yk ].
[35]. Using (42) and (43) in the abovementioned equation
E[yk+1
T
P ∗ yk+1 |yk ] = ykT (A + BK ∗ )T P ∗ (A + BK ∗ )yk
D. Policy Gradient
+ Tr(K ∗T B T P ∗ BK ∗ Wv ) + Tr(P ∗ Ww )
A Policy gradient routine is given in Algorithm 9 [21]. In
Line 1, we initialize the algorithm by a controller gain and − Tr((A + BK ∗ )T P ∗ (A + BK ∗ )Wv ) + Tr(P ∗ Wv ).
selecting a pdf for the probabilistic policy (usually a multivariate Substituting the abovementioned result in (44) and matching
Gaussian distribution). In Line 2, the algorithm is iterated until terms, we have the optimal average cost in (8) and

the ending condition is satisfied. In Lines 3–4, we collect T (A + BK ∗ )T P ∗ (A + BK ∗ ) − P ∗ + Ry + K T
Ru K ∗ = 0.
samples of (yk , uk , rk ) by sampling the actions from the pdf
Optimizing the abovementioned equation with respect to K ∗ ,
uk ∼ p(a; K i yk ) and storing in Z. In Line 5, we calculate
results the optimal policy gain in (6).
the total cost of T steps. In Line 6, we give the gradient of
the total cost with respect to K i . Note that b in Line 6 is a
baseline to reduce variance [28]. Among many options, one B. Proof of Lemma 3
can select the baseline as the mean cost of previous iterates. Similar to the proof of Lemma 2, we show that the
In Line 7, we update the policy gain by gradient descent given quadratic form satisfies the quality-based Bellman equa-
using δK i . tion (15). Let zk = [ykT , uTk ]T , zk+1 = [yk+1
T
, (Kyk+1 )T ]T and

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
748 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023

 
M = [I KT ] G
I
. Let uk as uk = Kyk + ηk . Using the where we have used (46) in the abovementioned equation. By
K expanding the abovementioned equation, (10) is concluded.
control uk at time k, the next output yk+1 reads
yk+1 = Lxk + Bηk + BKvk + wk + vk+1 . (45) D. Proof of Theorem 3
By (16), E[V (yk+1 , K)|zk ] = E[Q(yk+1 , Kyk+1 , K)|zk ]. The proof of convergence of the algorithm contains
Now, we use (45) to compute E[Q(yk+1 , Kyk+1 , K)|zk ] two steps. The first step is to show that solving the
E[Q(yk+1 , Kyk+1 , K)|zk ] = E[yk+1
T
M yk+1 |zk ] off-policy Bellman equation (24) is equivalent to the
model-based Bellman equation (13). This is guaranteed by
= E[(Lxk + Bηk + BKvk + wk + vk+1 )T M Theorem 1 and assuming that the estimation errors in
(Lxk + Bηk + BKvk + wk + vk+1 )|zk ] (22) and (27) are small enough P̂ i ≈ P i , B T ˆP i A ≈
B T P i A, B T ˆP i B ≈ B T P i B. The second step is to show that
= E[xTk LT M Lxk |zk ] + 2E[xTk LT M Bηk |zk ] by improving the policy to be greedy with respect to the
+E[ηkT B T M Bηk |zk ] + E[vkT K T B T M BKvk |zk ] average of all previous value functions (28), P i+1 ≤ P i for
i = 1, . . ., N. By Theorem 2, the algorithm produces stabiliz-
+E[wkT M wk + vk+1
T
M vk+1 |zk ]. ing controller gain K i . Let P i > 0 denote the solution to the
Using (43), the abovementioned equation reads model-based Bellman equation (13)
E[yk+1
T
M yk+1 |zk ] P i = (A + BK i )T P i (A + BK i ) + K iT Ru K i + Ry . (47)
Since A + BK i is stable, the unique positive definite solution
= ykT LT M Lyk + 2ykT LT M Bηk + ηkT B T M Bηk of (47) may be written as [39]
− Tr(LT M LWv ) + Tr(K T B T M BKWv ) +∞

Pi = ((A + BK i )T )k (K iT Ru K i + Ry )(A + BK i )k .
+ Tr(M Ww ) + Tr(M Wv ) k=0
   (48)
  AT M A AT M B y
k
= ykT uTk T T Consider two iteration indices i and j. Using (47), P i − P j
B M A B M B uk
reads
− Tr(LT M LWv ) + Tr(K T B T M BKWv ) P i − P j = (A + BK i )T P i (A + BK i ) + K iT Ru K i
+ Tr(M Ww ) + Tr(M Wv ) − (A + BK j )T P j (A + BK j ) − K jT Ru K j
 
T
T A MA AT M B = (A + BK j )T P i (A + BK j ) + K iT Ru K i
= zk zk − Tr(LT M LWv )
BT M A BT M B
− K jT B T P i (A + BK j ) − AT P i K j
+ Tr(K B M BKWv ) + Tr(M Ww ) + T r(M Wv ).
T T
+ K iT B T P i (A + BK i ) + AT P i K i
Replacing the abovementioned result in the quality-based Bell-
man equation (15) − (A + BK j )T P j (A + BK j ) − K jT Ru K j
 
T A M A + Ry
T
AT M B = (A + BK j )T (P i − P j )(A + BK j )
zk Gzk = zk
T
zk
BT M A B T M B + Ru + (K i − K j )T (Ru + B T P i B)(K i − K j )
− Tr(LT M LWv ) + Tr(K T B T M BKWv ) + (K i − K j )T [(Ru + B T P i B)K j + B T P i A]
+ Tr(M Ww ) + T r(M Wv ) − λ(K). + [K jT (Ru + B T P i B) + AT P i B](K i − K j )
By matching terms (17) and (18) are concluded. +∞

= ((A + BK j )T )k
C. Proof of Lemma 4 k=0
According to (16) ((K i − K j )T (Ru + B T P i B)(K i − K j )
 
  I
P = I KT G . (46) + (K i − K j )T [(Ru + B T P i B)K j + B T P i A]
K
[K jT (Ru + B T P i B) + AT P i B](K i − K j ))
By Lemma 3, the quadratic kernel of the Q function, satisfies
(17).
 Premultiplying
 (17) by [I K T ] and postmultiplying (17) (A + BK j )k (49)
I
by K
, we have where we have used (48) to obtain the last equation. By (28)
      ⎛ ⎞−1 ⎛ ⎞
  AT   Ry 0 I i i
I K T P A B + −P =0 K i+1 = − ⎝ (B T P j B + Ru )⎠ ⎝ B T P j A⎠
B T
0 Ru K
j=1 j=1

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 749

⎛ ⎞−1 ⎛ ⎞

i−1 
i−1 + (vec(Wv )T )(LiT ⊗ LiT ) − vec(Ww )T − vec(Wv )T
K = −⎝
i
(B P B + Ru )⎠
T j ⎝ B P A⎠ .
T j
− (vec(Wv )T )((K iT B T ) ⊗ (K iT B T ))) = 0.
j=1 j=1
(50) The term 2E[(xk ⊗ vk )ν T ] reads
i−1 2E[(xk ⊗ vk )ν T ]
Rearranging K i+1 in (50) and let M := j=1 (B
T
P jB +
Ru ) > 0 = − 4E[(xk ⊗ vk )(vkT ⊗ xTk )](AT ⊗ LiT ). (54)
M (K − K
i i+1
) = [(Ru + B P B)K
T i i+1
+ B P A]. (51)
T i
The term E[(vk ⊗ vk )ν ] reads
T

By setting j = i + 1 in (49) and using (51), P i − P i+1 reads


E[(vk ⊗ vk )ν T ]
+∞

P i − P i+1 = ((A + BK j )T )k = − E[(vk ⊗ vk )(vkT ⊗ vkT )](AT ⊗ AT + 2(K iT B T ) ⊗ AT )
k=0
+ vec(Wv )(vec(Wv )T )(LiT ⊗ LiT )
[(K − K j )T (Ru + B T P i B)(K i − K j )
i

− vec(Wv )(vec(Wv )T )((K iT B T ) ⊗ (K iT B T )). (55)


+ 2(K i − K j )T M (K i − K j )](A + BK j )k ≥ 0
Finally by summing (54) and (55), we have the bias term in the
which shows that P i+1 ≤ P i and the sequence is converging. estimation (22)
By repeating this procedure for i = 1, . . ., we can see that P ∗ ≤
P i+1 ≤ P i ≤ P 1 . E[(yk ⊗ yk )ν T ] = −4E[(xk ⊗ vk )(vkT ⊗ xTk )](AT ⊗ LiT )
− E[(vk ⊗ vk )(vkT ⊗ vkT )](AT ⊗ AT + 2(K iT B T ) ⊗ AT )
E. Proof of Lemma 5
+ vec(Wv )(vec(Wv )T )(LiT ⊗ LiT )
We repeat the estimation problem in (19)–(22)
rk − λi = (yk ⊗ yk − yk+1 ⊗ yk+1 )T vec(P i ) + n1 − vec(Wv )(vec(Wv )T )((K iT B T ) ⊗ (K iT B T )) (56)
which is nonzero. Meaning that the estimate is biased.
n1 = yk+1
T
P i yk+1 − E[yk+1
T
P i yk+1 |yk ]. (52)
The noise in the abovementioned equation appears in the regres- F. Proof of Theorem 4
sor [also called error-in-variables [20])] and can be written as
We repeat the estimation problem in (25)–(27)
n1 = ν T vec(P i ) where vec(P i ) is our parameter vector and
ν T = yk+1
T
⊗ yk+1
T
− E[yk+1
T
⊗ yk+1
T
|yk ] ck = ΥTk ξ i + n2

= (2(wk + vk+1 )T ⊗ xTk )(I ⊗ LiT ) ck = ykT (Ry + K iT Ru K i − P̂ i )yk + yk+1


T
P̂ i yk+1 − λ̄i
   
+ (2(wk + vk+1 )T ⊗ vkT )(I ⊗ (K iT B T )) 2yk ⊗ (uk − K i yk ) vec(B T P i A)
Υk = ,ξ =
i
(uk − K i yk ) ⊗ (uk + K i yk ) vec(B T P i B)
+ (wk + vk+1 )T ⊗ (wk + vk+1 )T
n2 = yk+1
T
P̂ i yk+1 − E[yk+1
T
P i yk+1 |yk ] + ykT P i yk − ykT P̂ i yk .
− (vkT ⊗ vkT )(AT ⊗ AT )
(57)
− (2vkT ⊗ vkT )((K iT B T ) ⊗ AT ) The noise term n2 reads
− (2vkT ⊗ xTk )(AT ⊗ L ) + (vec(Wv ) )(L
iT T iT
⊗L ) iT n2 = yk+1
T
P̂ i yk+1 − E[yk+1
T
P i yk+1 |yk ] + ykT (P i − P̂ i )yk

− vec(Ww )T − vec(Wv )T = xTk LiT (P̂ i − P i )Li xk + 2xTk LiT (P̂ i − P i )BK i vk

− (vec(Wv )T )((K iT B T ) ⊗ (K iT B T )). (53) + 2xTk LiT P̂ i (wk + vk+1 )+vkT K iT B T (P̂ i −P i )BK i vk
The instrumental variable in the estimation (22) is yk ⊗ yk . To + 2vkT K iT B T P̂ i (wk + vk+1 ) + xTk (P i − P̂ i )xk
have an unbiased estimate, E[(yk ⊗ yk )ν T ] = 0. In the sequel,
we analyze the correlation matrix + vkT (P i − P̂ i )vk + 2xTk (P i − P̂ i )vk
E[(yk ⊗ yk )ν T ] = E[(xk ⊗ xk )ν T ] + 2E[(xk ⊗ vk )ν T ] + (wk + vk+1 )T P̂ i (wk + vk+1 ) − λi
+ E[(vk ⊗ vk )ν T ]. − vkT AT P i Avk − 2vkT AT P i BK i vk − 2xTk LiT P i Avk .
Remembering the facts that xk , vk , vk+1 , wk are mutually (58)
independent, Wv is diagonal and using (14), we get the following To see if the estimate is biased, we examine E[Υk n2 ]
results. The term E[(xk ⊗ xk )ν T ] reads    
E[(xk ⊗ xk )ν T ] 0 2E[yk n2 ] ⊗ ηk
E[Υk n2 ] = E[n2 ] + .
ηk ⊗ ηk 2ηk ⊗ K i E[yk n2 ]
= E[xk ⊗ xk ](vec(Ww )T + vec(Wv )T
So, it is enough to analyze E[n2 ] and E[yk n2 ]. Remembering
− vec(Wv )T (AT ⊗ AT + 2(K iT B T ) ⊗ AT ) the facts that xk , vk , vk+1 , wk are mutually independent, Wv is

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
750 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023

diagonal, E[xk ] = 0 (because of running Algorithm 3 to collect known, we only need to analyze E[yk ⊗ yk ν T ] and E[yk ν T ].
data) and using (14), we have The term E[yk ⊗ yk ν T ] is given in (56). Remembering the
E[n2 ] = E[xTk LiT (P̂ i − P i )Li xk ] + E[xTk (P i − P̂ i )xk ] facts that xk , vk , vk+1 , wk are mutually independent and Wv
is diagonal, we compute E[yk ν T ] where ν is given in (53)
+ E[vkT K iT B T (P̂ i − P i )BK i vk ] E[yk ν T ] = E[(xk + vk )ν T ]
+ Tr((P̂ − P )(Ww + Wv ))
i
= − E[(xk )(vkT ⊗ vkT )](AT ⊗ AT + 2(K iT B T ) ⊗ AT )
E[yk n2 ] = E[(xk + vk )n2 ]
− 2E[(vk )(vkT ⊗ xTk )](AT ⊗ LiT )
= E[(xk + vk )xTk LiT (P̂ i − P i )Li xk ]
   + E[xk ](vec(Wv )T )(LiT ⊗ LiT )
=0
− E[xk ](vec(Wv )T )((K iT B T ) ⊗ (K iT B T )). (62)
+ 2 E[(xk + vk )xTk LiT (P̂ i − P i )BK i vk ]
   Hence, the bias in the classical off-policy is a function of (56)
=0 and (62).
E[2(xk + vk )xTk LiT P̂ i (wk + vk+1 )]
   H. Proof of Lemma 7
=0
We repeat the estimation problem in (29)–(31)
+ E[(xk + vk )vkT K iT B T (P̂ i − P i )BK i vk ]
   ck = ΨTk vec(Gi ) + n3
=0
ck = r(yk , uk ) − λi + yk+1
T
P̂ i yk+1
+ E[2(xk + vk )vkT K iT B T P̂ i (wk + vk+1 )]
    T
=0 Ψk = ykT ⊗ ykT ykT ⊗ uTk uTk ⊗ ykT uTk ⊗ uTk
+ E[(xk + vk )(xTk (P i − P̂ i )xk + vkT (P i − P̂ i )vk )]
   n3 = yk+1
T
P̂ i yk+1 − E[yk+1
T
P i yk+1 |zk ]. (63)
=0
The noise term n3 reads
+ 2 E[(xk + vk )xTk (P i − P̂ i )vk ]
   n3 = yk+1
T
P̂ i yk+1 − E[yk+1
T
P i yk+1 |zk ]
=0
= xTk AT (P̂ i − P i )Axk + 2xTk AT (P̂ i − P i )Buk
+ E[(xk +vk )(wk +vk+1 )TP̂ i (wk +vk+1 )−(xk +vk )λi ]
  
=0
+ uTk B T (P̂ i − P i )Buk − 2vkT AT P i (Axk + Buk )
(59)
+ 2(xTk AT + uTk B T )P̂ i (wk + vk+1 )
+ E[(xk + vk )(−vkT AT P i Avk − 2vkT AT P i BK i vk )]
   + (wk + vk+1 )T P̂ i (wk + vk+1 ) − vkT AT P i Avk
=0
+ Tr(AT P i AWv ) − Tr(P i Ww ) − Tr(P i Wv ). (64)
− 2 E[(xk + vk )(xTk LiT P i Avk )] = 0. (60)
   To see if the estimate is biased, we need to analyze E[Ψk n3 ].
=0
Since uk is deterministic (we know uk ), we examine i :=
Based (60), E[yk n2 ] = 0 and only E[n2 ] contributes to bias. If uk ⊗ uk E[n3 ], ii := uk ⊗ E[yk n3 ], and iii := E[yk ⊗ yk n3 ].
the estimation error of P i is negligible P̂ i − P i ≈ 0, then by Remembering the facts that xk , vk , vk+1 , wk are mutually
(59), E[n2 ] = 0 and the estimate is unbiased. independent, Wv is diagonal, E[xk ] = 0 (because of running
Algorithm 3 to collect data), we get the following results
G. Proof of Lemma 6
i := (uk ⊗ uk )(E[xTk AT (P̂ i − P i )Axk ] (65)
We repeat the estimation problem in (34)–(36)
+ uTk B T (P̂ i − P i )Buk + Tr((P̂ i − P i )(Ww + Wv ))
rk − λi = ΩTk i + n1
ii := uk ⊗ E[(xk + vk )n3 ]
n1 = yk+1
T
P i yk+1 − E[yk+1
T
P i yk+1 |yk ]
⎡ ⎤ ⎡ ⎤ = 2uk ⊗ E[xk xTk AT (P̂ i − P i )Buk ]
vec(P i ) yk ⊗ yk − yk+1 ⊗ yk+1
⎢ ⎥ ⎢ ⎥ − 2uk ⊗ E[vk vkT AT P i Buk ].
i = ⎣ vec(B T P i A) ⎦, Ωk =⎣ 2yk ⊗ (uk − K i yk ) ⎦ (66)
T i
vec(B P B) (uk − K yk ) ⊗ (uk + K yk )
i i Next, we analyze iii := E[yk ⊗ yk n3 ] = E[xk ⊗ xk n3 ] +
⎡ ⎤ 2E[xk ⊗ vk n3 ] + E[vk ⊗ vk n3 ]. First, we obtain E[xk ⊗ xk n3 ]
yk ⊗ yk E[xk ⊗ xk n3 ]
⎢ ⎥
Θk = ⎣ 2yk ⊗ ηk ⎦. (61)
ηk ⊗ (ηk + 2K i yk ) = E[xk ⊗ xk xTk AT (P̂ i − P i )Axk ]
We have shown in the proof of Lemma 5 that the noise term + E[xk ⊗ xk ]uTk B T (P̂ i − P i )Buk
can be written as n1 = ν T vec(P i ) where ν is given in (53). In
the sequel, we study E[Θk ν T ]. Since ν is zero mean and ηk is + E[xk ⊗ xk ]Tr((P̂ i − P i )(Ww + Wv ). (67)
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
YAGHMAIE et al.: LINEAR QUADRATIC CONTROL USING MODEL-FREE REINFORCEMENT LEARNING 751

Second, we obtain 2E[xk ⊗ vk n3 ] [5] K. J. Åström and B. Wittenmark, Adaptive Control, 2nd ed. Englewood
Cliffs, NJ, USA: Prentice-Hall, 1994.
2E[xk ⊗ vk n3 ] = − 4E[xk ⊗ vk vkT AT P i Axk ]. (68) [6] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
Third, we obtain E[vk ⊗ vk n3 ] vol. 1, 2nd ed. Cambridge, MA, USA: MIT Press, 2018.
[7] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning:
E[vk ⊗ vk n3 ] = E[vk ⊗ vk xTk AT (P̂ i − P i )Axk ] A survey,” J. Artif. Intell. Res., vol. 4, pp. 237–285, 1996.
[8] F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning
+ E[vk ⊗ vk uTk B T (P̂ i − P i )Buk ] and feedback control: Using natural decision methods to design optimal
adaptive controllers,” IEEE Control Syst., vol. 32, no. 6, pp. 76–105,
Dec. 2012.
+ E[vk ⊗ vk ]Tr((P̂ i − P i )(Wv + Ww )) [9] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic
programming for feedback control,” IEEE Circuits Syst. Mag., vol. 9, no. 3,
+ E[vk ⊗ vk ]Tr(AT P i AWv ) pp. 32–50, Jul.–Sep. 2009.
[10] F. A. Yaghmaie and D. J. Braun, “Reinforcement learning for a class of
− E[vk ⊗ vk vkT AT P i Avk ]. (69) continuous-time input constrained optimal control problems,” Automatica,
vol. 99, pp. 221–227, 2019.
As a result, iii is given by the summation of the terms in [11] T. Bian, Y. Jiang, and Z.-P. Jiang, “Adaptive dynamic programming for
(67)–(69). Because the terms in (65)–(69) are nonzero, the stochastic systems with state and control dependent noise,” IEEE Trans.
estimate in (31) is biased. Autom. Control, vol. 61, no. 12, pp. 4170–4175, Dec. 2016.
[12] B. Kiumarsi, F. L. Lewis, and Z.-P. Jiang, “H∞ control of linear discrete-
time systems: Off-policy reinforcement learning,” Automatica, vol. 78,
I. Proof of Lemma 8 pp. 144–152, 2017.
[13] H. Modares and F. L. Lewis, “Linear quadratic tracking control of
We repeat the estimation problem in (38)–(40) partially-unknown continuous-time systems using reinforcement learn-
rk − λi = (Ψk − Ψk+1 )vec(Gi ) + n4 ing,” IEEE Trans. Autom. Control, vol. 59, no. 11, pp. 3051–3056,
Nov. 2014.
[14] F. A. Yaghmaie, S. Gunnarsson, and F. L. Lewis, “Output regulation
ΨTk = [zkT ⊗ zkT ] = [ykT ⊗ ykT , ykT ⊗ uTk , uTk ⊗ ykT , uTk ⊗ uTk ] of unknown linear systems using average cost reinforcement learning,”
Automatica, vol. 110, 2019, Art. no. 108549.
n4 = zk+1
T
Gi zk+1 − E[zk+1
T
Gi zk+1 |zk ] [15] F. Adib Yaghmaie, F. L. Lewis, and R. Su, “Output regulation of
heterogeneous linear multi-agent systems with differential graphical
= yk+1
T
P i zk+1 − E[yk+1
T
P i yk+1 |zk ]. (70) game,” Int. J. Robust Nonlinear Control, vol. 26, no. 10, pp. 2256–2278,
2016.
The noise term n4 is obtained by P̂ ≡ P in (64)
i i
[16] F. A. Yaghmaie, K. Hengster Movric, F. L. Lewis, and R. Su, “Differential
n4 = − 2vkT AT P i (Axk + Buk ) graphical games for H∞ control of linear heterogeneous multiagent sys-
tems,” Int. J. Robust Nonlinear Control, vol. 29, no. 10, pp. 2995–3013,
2019.
+ 2(xTk AT + uTk B T )P i (wk + vk+1 ) [17] D. P. Bertsekas, Reinforcement Learning and Optimal Control. Belmont,
MA, USA: Athena Scientific, 2019.
+ (wk + vk+1 )T P i (wk + vk+1 ) − vkT AT P i Avk [18] R. E. Bellman and S. E. Dreyfus, Applied Dynamic Programming. Prince-
ton, NJ, USA: Princeton Univ. Press, 1962.
+ Tr(AT P i AWv ) − Tr(P i Ww ) − Tr(P i Wv ). (71) [19] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” J. Mach.
Learn. Res., vol. 4, no. 6, pp. 1107–1149, 2003.
We study E[Ψk n4 ]. Since uk is deterministic (we know uk ), [20] S. J. Bradtke and A. G. Barto, “Linear least-squares algorithms for tempo-
E[uk ⊗ uk n4 ] = 0. We examine uk ⊗ E[yk n4 ], E[yk ⊗ yk n4 ] ral difference learning,” Mach. Learn., vol. 22, no. 1–3, pp. 33–57, 2004.
[21] B. Recht, “A tour of reinforcement learning: The view from continuous
uk ⊗ E[yk n4 ] = −2uk ⊗ (E[vk vkT AT P i (Axk + Buk )] control,” Annu. Rev. Control, Robot., Auton. Syst., vol. 2, pp. 253–279,
2018.
+ E[xk ]E[(wk + vk+1 )T P i (wk + vk+1 )] [22] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
“Deterministic policy gradient algorithms,” in Proc. 31st Int. Conf. Mach.
− E[xk ]E[vkT AT P i Avk ] + E[xk ]Tr(AT P i AWv ) Learn., 2014, vol. 1, pp. 605–619.
[23] L. Ljung, System Identification—Theory for the User, (Informa. and Sys-
− E[xk ](Tr(P i Ww ) + Tr(P i Wv ))) tem Sciences), 2nd ed. Princeton, NJ, USA: Prentice-Hall, 1999.
[24] L. Ljung and T. Söderström, Theory and Practice of Recursive Identifica-
= − 2uk ⊗ E[vk vkT AT P i (Axk + Buk )] (72) tion. Cambridge, MA, USA: MIT Press, 1987.
[25] Y. Abbasi-Yadkori, N. Lazic, and C. Szepesvari, “Model-free linear
E[yk ⊗ yk n4 ] = −4E[xk ⊗ vk vkT AT P i (Axk + Buk )] quadratic control via reduction to expert prediction,” in Proc. 22nd Int.
Conf. Artif. Intell. Statist., 2019, pp. 3108–3117.
[26] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample
+ E[vk ⊗ vk ]Tr(AT P i AWv ) − E[vk ⊗ vk vkT AT P i Avk ]. complexity of the linear quadratic regulator,” Found. Comput. Math.,
(73) vol. 20, no. 4, pp. 633–679, 2019.
[27] S. Tu and B. Recht, “Least-squares temporal difference learning for
Since (72) and (73) are nonzero, the estimate is biased. the linear quadratic regulator,” in Proc. Int. Conf. Mach. Learn., 2018,
pp. 5005–5014.
REFERENCES [28] N. Matni, A. Proutiere, A. Rantzer, and S. Tu, “From self-tuning regulators
to reinforcement learning and back again,” in Proc. IEEE 58th Conf. Decis.
[1] V. Mnih et al., “Playing Atari with deep reinforcement learning,” 2013, Control, 2019, pp. 3724–3740.
arXiv:1312.5602. [29] M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi, “Global convergence of
[2] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench- policy gradient methods for the linear quadratic regulator,” in Proc. Int.
marking deep reinforcement learning for continuous control,” in Proc. Conf. Mach. Learn., 2018, pp. 1467–1476.
Int. Conf. Mach. Learn., 2016, pp. 1329–1338. [Online]. Available: http: [30] A. Cohen, A. Hassidim, T. Koren, N. Lazic, Y. Mansour, and K. Talwar,
//arxiv.org/abs/1604.06778 “Online linear quadratic control,” in Proc. Int. Conf. Mach. Learn., 2018,
[3] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, “An application of pp. 1029–1038.
reinforcement learning to aerobatic helicopter flight,” in Proc. Adv. Neural [31] S. Tu and B. Recht, “The gap between model-based and model-free
Inf. Process. Syst., 2007, vol. 19, pp. 1–8. methods on the linear quadratic regulator: An asymptotic viewpoint,”
[4] M. Krstic et al., Nonlinear and Adaptive Control Design, vol. 222. New in Proc. Conf. Learn. Theory, 2019, pp. 3036–3083. [Online]. Available:
York, NY, USA: Wiley, 1995. http://arxiv.org/abs/1812.03565
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.
752 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 68, NO. 2, FEBRUARY 2023

[32] M. Ferizbegovic, J. Umenberger, H. Hjalmarsson, and T. B. Schon, “Learn- Fredrik Gustafsson (Fellow, IEEE) received
ing robust LQ-controllers using application oriented exploration,” IEEE the M.Sc. degree in electrical engineering and
Control Syst. Lett., vol. 4, no. 1, pp. 19–24, Jan. 2020. [Online]. Available: the Ph.D. degree in automatic control from
https://ieeexplore.ieee.org/document/8732482/ Linköping University, Linköping, Sweden, in
[33] F. A. Yaghmaie and F. Gustafsson, “Using reinforcement learning for 1988 and 1992, respectively.
model-free linear quadratic Gaussian control with process and measure- He has been a Professor in Sensor Infor-
ment noises,” in Proc. IEEE Conf. Decis. Control, 2019, pp. 6510–6517. matics with the Department of Electrical En-
[34] M. Simchowitz, K. Singh, and E. Hazan, “Improper learning for non- gineering, Linköping University, since 2005.
stochastic control,” in Proc. Conf. Learn. Theory, 2020, pp. 3320–3436. His research interests are in stochastic sig-
[Online]. Available: http://arxiv.org/abs/2001.09254 nal processing, adaptive filtering and change
[35] S. Lale, K. Azizzadenesheli, B. Hassibi, and A. Anandkumar, “Logarith- detection, with applications to communication,
mic regret bound in partially observable linear dynamical systems,” in vehicular, airborne, and audio systems. He is a Co-Founder of the
Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 20876–20888. companies NIRA Dynamics (automotive safety systems), Softube (audio
[36] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning for partially effects), and Senion (indoor positioning systems).
observable dynamic processes: Adaptive dynamic programming using Dr. Gustafsson was an Associate Editor for IEEE TRANSACTIONS OF
measured output data,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., SIGNAL PROCESSING in 2000–2006, IEEE TRANSACTIONS ON AEROSPACE
vol. 41, no. 1, pp. 14–25, Feb. 2011. AND ELECTRONIC SYSTEMS in 2010–2012, and EURASIP Journal on
[37] H. Yu and D. P. Bertsekas, “Convergence results for some temporal Applied Signal Processing in 2007–2012. He was the recipient of the
difference methods based on least squares,” IEEE Trans. Autom. Control, Arnberg prize by the Royal Swedish Academy of Science (KVA) 2004,
vol. 54, no. 7, pp. 1515–1531, Jul. 2009. elected member of the Royal Academy of Engineering Sciences (IVA)
[38] T. Glad and L. Ljung, Control Theory—Multivariable and Nonlinear 2007, and elevated to IEEE Fellow 2011. He was also the recipient of
Methods. New York, NY, USA: Taylor and Francis, 2000. the Harry Rowe Mimno Award 2011 for the tutorial “Particle Filter Theory
[39] G. A. Hewer, “An iterative technique for the computation of the steady and Practice with Positioning Applications,” which was published in the
state gains for the discrete optimal regulator,” IEEE Trans. Autom. Control, AESS Magazine in July 2010, and was coauthor of “Smoothed state
vol. AC-16, no. 4, pp. 382–384, Aug. 1971. estimates under abrupt changes using sum-of-norms regularization”
[40] L. A. Prashanth, N. Korda, and R. Munos, “Fast LSTD using stochastic that received the Automatica paper prize in 2014.
approximation: Finite time analysis and application to traffic control,” in
Proc. Joint Eur. Conf. Mach. Learn. Knowl. Discov. Databases, 2014,
pp. 66–81.
[41] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed. Baltimore,
MD, USA: The John Hopkins Univ. Press, 2013. Lennart Ljung (Fellow, IEEE) received the
[42] The MathWorks, MATLAB System Identification Toolbox (R2018a), Por- Ph.D. degree in automatic control from the Lund
tola Valley, CA, USA, 2018. Institute of Technology, Lund, Sweden, in 1974.
Since 1976, he has been Professor of the
Chair of Automatic Control in Linköping, Swe-
den. He has held visiting positions with Stanford
and MIT and has written several books on Sys-
tem Identification and Estimation.
Dr. Ljung is currently an IFAC Fellow and an
IFAC Advisor. He is as a Member of the Royal
Swedish Academy of Sciences (KVA), a mem-
Farnaz Adib Yaghmaie received the B.E. and ber of the Royal Swedish Academy of Engineering Sciences (IVA),
M.E. degrees in electrical engineering, control, an Honorary Member of the Hungarian Academy of Engineering, an
from the K. N. Toosi University of Technology, Honorary Professor of the Chinese Academy of Mathematics and Sys-
Tehran, Iran, in 2009 and 2011, respectively, tems Science, and a Foreign Member of the U.S. National Academy of
and the Ph.D. degree in electrical and electronic Engineering (NAE). He was the recipient honorary doctorates from the
engineering, control, from Nanyang Technologi- Baltic State Technical University in St Petersburg, from Uppsala Univer-
cal University (EEE-NTU), Singapore, in 2017. sity, Uppsala, Sweden, from the Technical University of Troyes, Troyes,
She is currently an Assistant Professor France, from the Catholic University of Leuven, Leuven, Belgium, and
with the Department of Electrical Engineering, from the Helsinki University of Technology, Espoo, Finland. He was also
Linköping University, Linköping, Sweden. Her the recipient of both the Quazza Medal (2002) and the Nichols Medal
current research interest is reinforcement learn- (2017) from IFAC, Hendrik W. Bode Lecture Prize from the IEEE Control
ing for controlling dynamical systems. Systems Society, in 2003 and IEEE Control Systems Award in 2007, and
Dr. Yaghmaie is the recipient of the Best Thesis Award from EEE-NTU Great Gold Medal from the Royal Swedish Academy of Engineering in
among 160 Ph.D. students. 2018.

Authorized licensed use limited to: University of New Hampshire. Downloaded on November 12,2024 at 01:27:41 UTC from IEEE Xplore. Restrictions apply.

You might also like