0% found this document useful (0 votes)
5 views13 pages

Tracking Control of Completely Unknown Continuous-Time Systems Via Off-Policy Reinforcement Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views13 pages

Tracking Control of Completely Unknown Continuous-Time Systems Via Off-Policy Reinforcement Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

2550 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO.

10, OCTOBER 2015

H ∞ Tracking Control of Completely Unknown


Continuous-Time Systems via Off-Policy
Reinforcement Learning
Hamidreza Modares, Frank L. Lewis, Fellow, IEEE, and Zhong-Ping Jiang, Fellow, IEEE

Abstract— This paper deals with the design of an H ∞ tracking the perfect tracking. Second, a feedback control input is
controller for nonlinear continuous-time systems with completely designed by solving a Hamilton–Jacobi–Isaacs (HJI) equation
unknown dynamics. A general bounded L 2 -gain tracking to stabilize the tracking error dynamics. These methods are
problem with a discounted performance function is introduced
for the H ∞ tracking. A tracking Hamilton–Jacobi–Isaac (HJI) suboptimal as they ignore the cost of the feedforward control
equation is then developed that gives a Nash equilibrium solution input in the performance function. Moreover, in these methods,
to the associated min–max optimization problem. A rigorous procedures for computing the feedback and feedforward terms
analysis of bounded L 2 -gain and stability of the control solution are based on the offline solution methods that require complete
obtained by solving the tracking HJI equation is provided. knowledge of the system dynamics.
An upper-bound is found for the discount factor to assure local
asymptotic stability of the tracking error dynamics. An off-policy During the last few years, reinforcement learning (RL)
reinforcement learning algorithm is used to learn the solution [10]–[13] has been extensively used to solve the optimal
to the tracking HJI equation online without requiring any H2 [14]–[25] and H∞ [26]–[37] regulation problems, and
knowledge of the system dynamics. Convergence of the proposed has been successfully applied to several real-world applica-
algorithm to the solution to the tracking HJI equation is shown. tions [38]–[43]. Offline iterative RL algorithms [26], [27],
Simulation examples are provided to verify the effectiveness of
the proposed method. online synchronous RL algorithms [28]–[31], and
simultaneous RL algorithms [32]–[34] were proposed to
Index Terms— Bounded L 2 -gain, H∞ tracking controller, approximate the solution to the HJI equation arising in
reinforcement learning (RL), tracking Hamilton–Jacobi–
Isaac (HJI) equation. the H∞ regulation problem. These mentioned methods
require complete knowledge of the system dynamics.
Vrabie and Lewis [35] and Li et al. [36] used an
I. I NTRODUCTION integral RL (IRL) algorithm [15], [16] to learn the solution
to the HJI equation for systems with unknown dynamics.
T HE H∞ optimal control has been extensively used in the
effort to attenuate the effect of disturbances on the system
performance. The H∞ control theory has mostly concentrated
Although efficient, these methods require the disturbance to
be adjustable. However, this is not practical in most systems,
on designing regulators to drive the states of the system because the disturbance is independent and cannot be
to zero in the presence of disturbance [1]–[5]. In practice, specified. Luo et al. [37], inspired by [21] and [22], proposed
however, it is often required to force the states or outputs of an efficient off-policy RL algorithm to learn the solution to the
the system to track a reference trajectory. Existing solutions to HJI equation. In the off-policy RL algorithm, the system data,
the H∞ tracking problem are composed of two steps [6]–[9]. which are used to learn the HJI solution, can be generated
First, a feedforward control input is designed to guarantee with arbitrary policies rather than the evaluating policy.
Their method does not require an adjustable disturbance
Manuscript received October 27, 2014; revised April 25, 2015 and input. However, it requires partial knowledge of the system
May 17, 2015; accepted May 31, 2015. Date of publication June 24, 2015; dynamics.
date of current version September 16, 2015. This work was supported in
part by the National Science Foundation (NSF) under Grant ECCS-1405173 While significant progress has been achieved by the use of
and Grant IIS-1208623, in part by the Office of Naval Research, Arlington, RL algorithms for the design of the H∞ optimal controllers,
VA, USA, under Grant N00014-13-1-0562 and Grant N000141410718, and in and these algorithms are limited to the case of the regulation
part by the U.S. Army Research Office under Grant W911NF-11-D-0001. The
work of Z.-P. Jiang was supported by NSF under Grant ECCS-1101401 and problem. In practice, however, it is desired to make the system
Grant ECCS-1230040. to follow a reference trajectory. Therefore, the H∞ optimal
H. Modares is with the University of Texas at Arlington Research Institute, tracking controllers are required. Although the RL algorithms
Fort Worth, TX 76118 USA (e-mail: [email protected]).
F. L. Lewis is with the University of Texas at Arlington Research Institute, have been recently presented for solving H2 optimal
Fort Worth, TX 76118 USA, and also with the State Key Laboratory tracking [44]–[50], only Liu et al. [51] proposed an
of Synthetical Automation for Process Industries, Northeastern University, RL solution to the H∞ tracking. However, their solution is
Shenyang 110004, China (e-mail: [email protected]).
Z.-P. Jiang is with the Department of Electrical and Computer Engineering suboptimal as the cost of the feedforward control input is
with the Polytechnic School of Engineering, New York University, NY 11201 ignored in the performance function, and it requires complete
USA (e-mail: [email protected]). knowledge of the system dynamics.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. In this paper, an online off-policy RL algorithm is developed
Digital Object Identifier 10.1109/TNNLS.2015.2441749 to find the solution to the H∞ optimal tracking problem of
2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.
MODARES et al.: H∞ TRACKING CONTROL OF COMPLETELY UNKNOWN CONTINUOUS-TIME SYSTEMS 2551

nonlinear completely unknown systems. It is not required


that the disturbance be adjustable. An augmented system
is constructed from the tracking error dynamics and the
command generator dynamics, and a new discounted perfor-
mance function is introduced for the H∞ optimal tracking
problem. This allows developing a more general version of
the L 2 -gain, where the whole control input and the tracking
error energies are weighted by an exponential discount factor
in the performance function. This is in contrast to the existing
methods that include only the cost of the feedback part of
the control input in the performance function. A tracking Fig. 1. State-feedback H∞ tracking control configuration.
HJI equation associated with the discounted performance
function is derived, which gives both the feedforward and Fig. 1 shows the system dynamics (1) and its inputs and
feedback parts of the control input simultaneously. Stability outputs. The goal of the H∞ tracking is to attenuate the effect
and L 2 -gain boundness of the solution to the tracking of the disturbance input d on the performance output z. Before
HJI equation are discussed. An upper-bound is obtained for defining the H∞ tracking control problem, we define the
the discount factor to assure local asymptotic stability of following general L 2 -gain or disturbance attenuation
the tracking error dynamics. An off-policy RL algorithm condition.
is then developed to find the solution to the tracking Definition 1 (Bounded L 2 -Gain or Disturbance
HJI equation online using only the measured data and without Attenuation): The nonlinear system (1) is said to have
any knowledge about the system dynamics. Convergence of L 2 -gain less than or equal to γ if the following disturbance
this algorithm to the solution to the tracking HJI equation attenuation condition is satisfied for all d ∈ L 2 [0, ∞) :
 ∞
is shown.
e−α(τ −t ) z(τ )2 dτ
II. H∞ T RACKING P ROBLEM  ∞
t
≤ γ2 (6)
−α(τ −t )
e d(τ ) dτ
2
In this section, a new formulation for the H∞ tracking t
is presented. A general L 2 -gain condition is defined. In this where α > 0 is the discount factor and γ represents the amount
L 2 -gain condition, a discounted performance index is used, of attenuation from the disturbance input d(t) to the defined
which penalizes both tracking error and control effort. performance output variable z(t).
A solution to this problem is presented in the subsequent Remark 1: The disturbance attenuation condition (6) implies
sections III and IV. that the effect of the disturbance input to the desired per-
Consider the affine nonlinear system defined as formance output is attenuated by a degree at least equal
ẋ = f (x) + g(x)u + k(x)d (1) to γ. The minimum value of γ for which the disturbance
attenuation condition (6) is satisfied gives the so-called optimal
where x ∈ Rn is the state, u = [u 1 , . . . , u m ] ∈ Rm is the robust control solution. However, there exists no way to
control input, d = [d1, . . . , dq ] ∈ Rq denotes the external find the smallest amount of the disturbance attenuation for
disturbance, f (x) ∈ Rn is the drift dynamics, g(x) ∈ Rn×m general nonlinear systems, and a large enough value is usually
is the input dynamics, and k(x) ∈ Rn×q is the disturbance predetermined for γ [3].
dynamics. It is assumed that f (x), g(x), and k(x) are unknown Using (5) in (6), one has
Lipchitz functions with f (0) = 0, and that the system (1) is  ∞  ∞
 
robustly stabilizing. e−α(τ −t ) edT Qed +u T Ru dτ ≤ γ 2 e−α(τ −t )(d T d)dτ.
Assumption 1: Let r (t) be the bounded reference trajectory, t t
and assume that there exists a Lipschitz continuous command (7)
generator function h d (.) ∈ Rn with h d (0) = 0 such that Definition 2 (H∞ Tracking): The H∞ tracking control
ṙ (t) = h d (r (t)). (2) problem is to find a control policy u = β(e, r ) for some
smooth function β depending on the tracking error e and the
Define the tracking error reference trajectory r , such that the following holds.
ed  x(t) − r (t). (3) 1) The closed-loop system ẋ = f (x)+g(x) β(e, r )+k(x) d
satisfies the attenuation condition (7).
Using (1)–(3), the tracking error dynamics is given by 2) The tracking error dynamics (4) with d = 0 is locally
asymptotically stable.
ėd (t) = f (x(t)) + g(x(t)) u(t) + k(x(t)) d(t) − h d (r (t)).
The main difference between Definition 2 and the
(4) standard definition of H∞ tracking control problem
The fictitious performance output to be controlled is defined, (see [6, Definition 5.2.1]) is that a more general disturbance
such that it satisfies attenuation condition is defined here. Since the whole
control input and the tracking error are penalized in the
z(t)2 = edT Q ed + u T R u. (5) disturbance attenuation condition (7), the problem formulated

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.
2552 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

in Definition 1 gives an optimal solution, in contrast to the Using the augmented system (9), the disturbance attenuation
standard definition that results in a suboptimal solution, condition (7) becomes
as stated in [6].  ∞  ∞
−α(τ −t )
Remark 2: The performance function in the left-hand side e (X Q T X +u Ru)dτ ≤ γ
T T 2
e−α(τ −t )(d T d)dτ
t t
of the disturbance attenuation condition (7) represents a (11)
meaningful cost in the sense that it includes a positive penalty
on the tracking error and a positive penalty on the control where
 
effort. The use of the discount factor is essential. This is Q 0
because the feedforward part of the control input does not QT = . (12)
0 0
converge to zero in general, and thus, penalizing the control
Based on (11), define the performance function
input in the performance function without a discount factor  ∞
makes the performance function unbounded. J (u, d) = e−α(τ −t )(X T Q T X + u T R u − γ 2 d T d) dτ.
Remark 3: Previous work on the H∞ optimal tracking t
divides the control input into feedback and feedforward parts. (13)
First, the feedforward part is obtained separately without Remark 4: Note that the problem of finding a control
considering any optimality criterion. Then, the problem of policy that satisfies bounded L 2 -gain condition for the
optimal design of the feedback part is reduced to an optimal tracking problem is equivalent to minimizing
H∞ optimal regulation problem. In contrast, in the new the discounted performance function (13) subject to the
formulation, both the feedback and feedforward parts of the augmented system (9).
control input are obtained simultaneously and optimally as a It is well-known that the H∞ control problem is closely
result of the defined L 2 -gain with a discount factor in (7). related to the two-player zero-sum differential game theory [5].
The control solution to the H∞ tracking problem with In fact, solvability of the H∞ control problem is equivalent to
the proposed attenuation condition (7) is provided in the solvability of the following zero-sum game [5]:
subsequent sections III and IV. We shall see in the subsequent
sections that this general disturbance attenuation condition V ∗ (X (t)) = J (u ∗ , d ∗ ) = min max J (u, d) (14)
u d
enables us to find both the feedback and feedforward parts
of the control input simultaneously, and therefore extends the where J is defined in (13) and V ∗ (X (t)) is defined as
method of off-policy RL for solving the problem in hand the optimal value function. This two-player zero-sum game
without requiring any knowledge of the system dynamics. control problem has a unique solution if a game theoretic
saddle point exists, i.e., if the following Nash condition holds:
III. HJI E QUATION FOR H∞ T RACKING V ∗ (X (t)) = min max J (u, d) = max min J (u, d). (15)
u d d u
In this section, it is first shown that the problem of
solving the H∞ tracking problem can be transformed into Note that differentiating (13) and noting that V (X (t)) =
a min–max optimization problem subject to an augmented J (u(t), d(t)) give the following Bellman equation:
system composed of the tracking error dynamics and the 
H (V, u, d) = X T Q T X + u T R u − γ 2 d T d − αV
command generator dynamics. A tracking HJI equation is
then developed, which gives the solution to the min–max + V XT (F + G u + K d) = 0 (16)
optimization problem. The stability and L 2 -gain boundedness   
where F(X) = F, G = G(X), K = K (X), and V X = ∂ V /∂ X.
of the tracking HJI control solution are discussed.
Applying stationarity conditions ∂ H (V ∗ , u, d)/∂u = 0 and
∂ H (V ∗ , u, d)/∂d = 0 [52] give the optimal control and
A. Tracking HJI Equation disturbance inputs as
In this section, a tracking HJI equation is formulated, 1
which gives the solution to the H∞ tracking problem stated u ∗ = − R −1 G T V X∗ (17)
2
in Definition 2. 1
Define the augmented system state d∗ = K T V X∗ (18)
2γ 2
X (t) = [ed (t)T r (t)T ]T ∈ R2n (8) where V ∗ is the optimal value function defined in (14).
where ed (t) is the tracking error defined in (3) and r (t) is the Substituting the control input u (17) and the disturbance d (18)
reference trajectory. into (16), the following tracking HJI equation is obtained:
Putting (2) and (4) together yields the augmented system 
H (V ∗ , u ∗ , d ∗ ) = X T Q T X + V X∗T F − αV X
Ẋ(t) = F(X (t)) + G(X (t)) u(t) + K (X (t)) d(t) (9) 1
− V X∗T G T R −1 G V X∗
where u(t) = u(X (t)) and 4
    1
+ 2 V X∗T K K T V X∗ = 0. (19)
f (ed + r ) − h d (r ) g(ed + r )
F(X) = , G(X) = 4γ
h d (r ) 0
  In the following, it is shown that the control solution (17),
k(ed + r ) which is found by solving the HJI equation (19), solves the
K (X) = . (10)
0 H∞ tracking problem formulated in Definition 2.
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.
MODARES et al.: H∞ TRACKING CONTROL OF COMPLETELY UNKNOWN CONTINUOUS-TIME SYSTEMS 2553

B. Disturbance Attenuation and Stability of the for every T > 0 and every d ∈ L 2 [0, ∞). Since V ∗ (.) ≥ 0
Solution to the Tracking HJI Equation the above equation yields
 T
In this section, first, it is shown that the control solution (17)
e−ατ (X T Q T X + u ∗T Ru ∗ )dτ
satisfies the disturbance attenuation condition (11) 0  T
[part 1) of Definition 2]. Then, the stability of the tracking
≤ e−ατ (γ 2 d T d)dτ + V ∗ (X (0)). (26)
error dynamics (4) without the disturbance is discussed 0
[part 2) of Definition 2]. It is shown that there exists an upper This completes the proof. 
bound α ∗ such that if the discount factor is <α ∗ , the control Theorem 2 solves part 1) of the state-feedback H∞ tracking
solution (17) makes the system locally asymptotically stable. control problem given in Definition 2. In the following,
Theorem 1 (Saddle Point Solution): Consider the we consider the problem of stability of the closed-loop system
H∞ tracking control problem as a two-player zero-sum without disturbance, which is part 2) of Definition 2.
game problem with the performance function (13). Then, the Theorem 3 (Stability of the Optimal Solution for α → 0):
pair of strategies (u ∗ , d ∗ ) defined in (17) and (18) provides a Suppose that V ∗ (X) is a smooth positive-semidefinite and
saddle point solution to the game. locally quadratic solution to the tracking HJI equation. Then,
Proof: See [26] for the same proof.  the control input given by (17) makes the error dynamics (4)
Theorem 2 (L 2 -Gain of System for the Solution to the with d = 0 asymptotically stable in the limit as the discount
HJI Equation): Assume that there exists a continuous positive- factor goes to zero.
semidefinite solution V ∗ (X) to the tracking HJI equation (19). Proof: Differentiating V ∗ along the trajectories of the
Then, u ∗ in (17) makes the closed-loop system (9) to have closed-loop system with d = 0, and using the tracking
L 2 -gain less than or equal to γ. HJI equation give
Proof: The Hamiltonian (16) for the optimal value
function V ∗ , and any control policy u and disturbance policy w V X∗T (F + G u ∗ ) = αV ∗ − X T Q T X − u ∗T R u ∗ + γ 2 d T d
become (27)
or equivalently
H (V ∗ , u, d) = X T Q T X + u T R u − γ 2 d T d − αV ∗
d −αt ∗
+ V X∗T (F + G u + K d). (20) (e V (X)) = e−αt (−X T Q T X − u ∗T Ru ∗ +γ 2 d T d) ≤ 0.
dt
(28)
On the other hand, using (17)–(19), one has
If the discount factor goes to zero, then LaSalle’s extension can
H (V ∗ , u, d) = H (V ∗ , u ∗ , d ∗ ) + (u − u ∗ )T R (u − u ∗ ) be used to show that the tracking error is locally asymptotically
+ γ 2 (d − d ∗ )T (d − d ∗ ). (21) stable. More specifically, if α → 0, based on LaSalle’s
extension, X (t) = [ed (t)T r (t)T ]T goes to a region wherein
Based on the HJI equation (19), we have H (V ∗ , u ∗ , d ∗ ) = 0. V̇ = 0. Since X T Q T X = ed (t)T Q ed (t), where Q is the
Therefore, (20) and (21) give positive definite, V̇ = 0 only if ed (t) = 0, and u = 0
when d = 0. On the other hand, u = 0 also requires that
X T Q T X + u T Ru − γ 2 d T d − αV ∗ + V X∗T (F + G u + K d) ed (t) = 0; therefore, for γ = 0, the tracking error is locally
= −(u − u ∗ )T R (u − u ∗ ) − γ 2 (d − d ∗ )T (d − d ∗ ). (22) asymptotically stable. 
Theorem 3 shows that if the discount factor goes to zero,
Substituting the optimal control policy u = u ∗ in the above then optimal control solution found by solving the tracking
equation yields HJI equation makes the system locally asymptotically stable.
However, if the discount factor is nonzero, the local asymptotic
X T Q T X + u ∗T Ru ∗ − γ 2 d T d − αV ∗ stability of the optimal control solution cannot be guaranteed
+V X∗T (F + Gu ∗ + K d) = −γ 2 (d − d ∗ )T (d − d ∗ ) ≤ 0. by Theorem 3. In Theorem 4, it is shown that the local
asymptotic stability of the optimal solution is guaranteed as
(23)
long as the discount factor is smaller than an upper bound.
Multiplying both the sides of this equation by e−αt , and Before presenting the proof of local asymptotic stability, the
defining V̇ ∗ = V X∗T (F + G u ∗ + K d) as the derivative of V ∗ following example shows that if the discount factor is not
along the trajectories of the closed-loop system, it gives small, the control solution obtained by solving the tracking
HJI equation can make the system unstable.
d −αt ∗ Example 1: Consider the scalar dynamical system
(e V (X)) ≤ e−αt (−X T Q T X − u ∗T R u ∗ + γ 2 d T d).
dt Ẋ = X + u + d. (29)
(24)
Assume that in the HJI equation (19), we have Q T = R = 1
Integrating from both the sides of this equation yields and the attenuation level is γ = 1. For this linear system with
quadratic performance, the value function is quadratic. That
e−αT V ∗ (X (T )) − V ∗ (X (0)) is, V (X) = p X 2 , and therefore, the HJI equation reduces to
 T
≤ e−ατ (−X T Q T X − u ∗T Ru ∗ + γ 2 d T d)dτ (25) 3
(2 − α) p − p2 + 1 = 0 (30)
0 4
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.
2554 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

and the optimal control solution becomes Equation (42) gives the augmented system dynamics (9),
and (43) is equivalent to the HJI equation (19) with μ = V X∗ .
u = − p X. (31)
In order to prove the local stability of the closed-loop system,
Solving this equation gives the optimal solution as the stability of the closed-loop linearized system is investi-
  gated. Using (33) for the system dynamics, (41) becomes
4 2 4
u = − (1 − 0.5α) + √ (1 − 0.5α)2 + 1 X. (32)
3 3 3 H m = (X T Q T X + u ∗T R u ∗ − γ 2 d ∗T d ∗ )
However, this optimal solution does not make the system + μT (AX + Bu ∗ + Dd ∗ + F̄(X)). (44)
stable for all values of the discount factor α. If fact, if
Then, the costate can be written as the sum of a linear and a
α > α ∗ = 27/12, then the system is unstable. The next
nonlinear term as
theorem shows how to find an upper bound α ∗ for the discount
factor to assure the stability of the system without disturbance. μ = 2P X + ϕ0 (X) ≡ μ1 + ϕ0 (X). (45)
Before presenting the stability theorem, note that the
augmented system dynamics (9) can be written as Using ∂ H m /∂u = 0, ∂ H m /∂d = 0 and (45), one has

Ẋ = F(X) + G(X)u + K (X)d = AX + Bu + Dd + F̄(X) u ∗ = −R −1 B T P X + ϕ1 (X) (46)


1
(33) d ∗ = 2 D T P X + ϕ2 (X) (47)
γ
where AX + Bu + K d is the linearized model with
  for some ϕ1 (X) and ϕ2 (X) depending on ϕ0 (X), F̄(X), and P.
Al1 Al1 − Al2 T T
A= , B = BlT 0T , D = DlT 0T Using (44)–(47), conditions (42) and (43) become
0 Al2 ⎡  ⎤
(34)   −1 B T − 1 D D T
 
Ẋ ⎣ A − B R ⎦ X
= γ 2
where Al1 and Al2 are the linearized models of the drift μ̇1 μ 1
−Q T −A T + α I
system dynamics f and the command generator dynamics h d ,      
F1 (X)  X F1 (X)
respectively, and F̄(X) is the remaining nonlinear term. + =W + (48)
Theorem 4 (Stability of the Optimal Solution and Upper F2 (X) μ1 F2 (X)
Bound for α): Consider the system (9). Define for some nonlinear functions F1 (X) and F2 (X). The linear
1 part of costate is a stable manifold of W , and thus, based on
L l = Bl R −1 BlT + 2 Dl DlT (35) the linear part of (48), it satisfies the following game algebraic
γ
Riccati equation (GARE):
where Bl and Dl are defined in (34). Then, the control
solution (17) makes the error system (4) with d = 0 locally Q T + A T P + P A − α P − P B R −1 B T P +
1
P D D T P = 0.
asymptotically stable if γ2
(49)
α ≤ α ∗ = 2(L l Q)1/2 . (36)
Proof: Given the augmented dynamics (9) and the Define
 
performance function (13), the Hamiltonian function in terms P11 P12
P= .
of the optimal control and disturbance is defined as [52] P12 P22
H (ρ, u ∗ , d ∗ ) = e−αt (X T Q T X + u ∗T Ru ∗ − γ 2 d ∗T d ∗ ) Then, based on (12) and (34), the upper left-hand side of the
GARE (49) becomes
+ ρ T (F + G u ∗ + K d ∗ ) (37)
where ρ is known as the costate variable. Using Pontryagin’s Q + Al1
T
P11 + P11 Al1 − α P11 − P11 Bl R −1 BlT P11
maximum principle, the optimal solutions u ∗ and d ∗ satisfy 1
+ 2 P11 Dl DlT P11 = 0. (50)
the following state and costate equations: γ
Ẋ = Hρ (X, ρ) (38) The closed-loop system dynamics for the control input (46)
without the disturbance is
ρ̇ = −H X (X, ρ). (39)
Define the new variable Ẋ = (A − B R −1 B T P)X + F f (X) (51)

μ = eαt ρ. (40) for some nonlinear function F f (X) with F f = [F Tf1 , F Tf2 ]T ,
which gives the following tracking error dynamics:
Based on (40), define the modified Hamiltonian function as  
ėd = Al1 − Bl R −1 BlT P11 ed + F f 1 = Ac ed + F f 1 . (52)
H m = e−αt H = (X T Q T X + u ∗T Ru ∗ − γ 2 d ∗T d ∗ )
+ μT (F + G u ∗ + K d ∗ ). (41) The GARE (50) based on the closed-loop error dynamics Ac
becomes
Then, conditions (38) and (39) become
Q + AcT P11 + P11 Ac − α P11 + P11 Bl R −1 BlT P11
Ẋ = Hμm (X, μ) (42) 1
μ̇ = α μ − H Xm (X, μ). (43) + 2 P11 Dl DlT P11 = 0. (53)
γ
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.
MODARES et al.: H∞ TRACKING CONTROL OF COMPLETELY UNKNOWN CONTINUOUS-TIME SYSTEMS 2555

To find a condition on the discount factor to assure stability of Algorithm 1 Offline RL Algorithm
the linearized error dynamics, assume that λ is an eigenvalue Initialization: Start with an admissible stabilizing control
of the closed-loop error dynamics Ac . That is, Ac x = λx policy u 0
with x, the eigenvector corresponding to λ. Then, multiplying 1) For a control input u i and a disturbance policy di , find
the left- and right-hand sides of the GARE (53) by x T and x, Vi using the following Bellman equation
respectively, one has
H (Vi , u i , di ) = X T Q T X + V XTi (F + G u i + K di )
T
2 (Re(λ) − 0.5α)x P11 x
  − αVi + u iT R u i − γ 2 diT di = 0 (58)
= −x T Q x − x T P11 Bl R −1 BlT + D D T P11 x. (54)
2) Update the disturbance using
Using the inequality a 2 + b2 ≥ 2ab and since P11 > 0,
1
(54) becomes di+1 = arg max[H (Vi , u i , d)] = K T VX i (59)
   2γ 2
−1 1/2 
d
(Re(λ) − 0.5α) ≤ − Q P11 (L l P11 )1/2  (55)
and the control policy using
or equivalently 1
 1/2  u i+1 = arg min[H (Vi , u, d)] = − R −1 G T V X i (60)
Re(λ) ≤ − QP−1
11
(L l P11 )1/2  + 0.5α (56) u 2
3) Go to 1.
where L l is defined in (35). Using the fact that AB ≥
AB gives
Re(λ) ≤ −(LQ)1/2  + 0.5α. (57) In fact, one can always pick a very small discount factor,
and/or a large weighting matrix Q (which is a design matrix)
Therefore, the linear error dynamics in (52) is stable if to assure that condition (36) is satisfied.
condition (36) is satisfied, and this completes the proof. 
Remark 5: Note that the GARE (49) can be written as IV. O FF -P OLICY IRL FOR L EARNING THE
Q T + (A − 0.5α I )T P + P(A − 0.5α I ) T RACKING HJI E QUATION
1 In this section, an offline RL algorithm is first given
−P B R −1 B T P + P D D T P = 0.
γ2 to solve the problem of H∞ optimal tracking by learning
the solution to the tracking HJI equation. An off-policy
This amounts to a GARE without the discount factor and
IRL algorithm is then developed to learn the solution to the
with the system dynamics given by A − 0.5α I , B, and D.
HJI equation online and without requiring any knowledge of
Therefore, existence of a unique solution to the GARE (30)
the system dynamics. Three neural networks (NNs) on an
requires (A−0.5α I, B) be stabilizable. Based on the definition
actor–critic–disturbance structure are used to implement the
of A and B in (34), this requires that (Al1 − 0.5α I, Bl )
proposed off-policy IRL algorithm.
be stabilizable and (Al2 − 0.5α I ) is stable. However, since
(Al1 , Bl ) be stabilizable, as the system dynamics in (1) is
assumed robustly stabilizing, then (Al1 − 0.5α I, Bl ) is also A. Off-Policy Reinforcement Learning Algorithm
stabilizable for any α > 0. Moreover, since the reference The Bellman equation (16) is linear in the cost
trajectory is assumed bounded, the linearized model of the function V , while the HJI equation (19) is nonlinear in the
command generator dynamics in (2), i.e., Al2 , is marginally value function V ∗ . Therefore, solving the Bellman equation
stable, and thus, (Al2 − 0.5α I ) is stable. Therefore, the for V is easier than solving the HJI for V ∗ . Instead of
discount factor does not affect the existence of the solution directly solving for V ∗ , a policy iteration (PI) algorithm
to the GARE. iterates on both the control and disturbance players to break the
Remark 6: Theorem 4 shows that the asymptotic stability HJI equation into a sequence of differential equations linear in
of only the first n variables of X is guaranteed, which are the the cost. An offline PI algorithm for solving the H∞ optimal
error dynamic states. This is reasonable as the last n variables tracking problem is given in Algorithm 1.
of X are the reference command generator variables, which Algorithm 1 extends the results of the simultaneous
are not under our control. RL algorithm in [33] to the tracking problem. The
Remark
√ 7: For Example 1, condition (35) gives the bound convergence of this algorithm to the minimal nonnegative
α < 80/12 to assure the stability. This bound is very close solution of the HJI equation was shown in [33]. In fact, similar
to the actual bound obtained in Example 1. However, it is to [33], the convergence of Algorithm 1 can be established by
obvious that condition (35) gives a conservative bound for the proving that iteration on (58) is essentially Newton’s iterative
discount factor to assure the stability. sequence, which converges to the unique solution of the
Remark 8: Theorem 4 confirms the existence of an upper HJI equation (19).
bound for the discount factor to assure the stability of the Algorithm 1 requires complete knowledge of the system
solution to the HJI tracking equation, and relates this bound dynamics. In the following, the off-policy IRL algorithm,
to the input and disturbance dynamics, and the weighting which was presented in [21] and [22] for solving the
matrices in the performance function. Condition (36) is not a H2 optimal regulation problem, is extended here to solve
restrictive condition even if the system dynamics are unknown. the H∞ optimal tracking for systems with completely

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.
2556 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

unknown dynamics. To this end, the system dynamics (9) is one has
first written as  t +T
 
e−α(τ −t ) X T Q T X + u iT Ru i − γ 2 diT di dτ
Ẋ = F + G u i + K di + G (u − u i ) + K (d − di ) (61) lim t
T →0 T
where u i = [u i,1 , . . . , u i,m ] ∈ Rm and di = [di,1 , . . . , di,q ] ∈ Rq = X Q T X + u iT R u i −γ 2 diT di
T
(66)
are policies to be updated. Differentiating Vi (X) along with  t +T  T 
the system dynamics (61) and using (58)–(60) give e−α(τ −t ) 2u i+1 R(u −u i )−2γ 2 di+1
T
(d − di ) dτ
t
V̇i = V XTi (F + G u i + K di )+V XTi G(u − u i )+V XTi K (d − di ) lim
T →0 T
= α Vi − X T Q T X − u iT R u i + γ 2 diT di −2 u i+1
T
R (u − u i ) = T
2 u i+1 R (u − u i ) − 2γ 2 di+1
T
(d − di ). (67)
+ 2γ 2 di+1
T
(d − di ). (62)
Substituting (65)–(67) in (64) yields
Multiplying both the sides of (62) by e−α(τ −t ) and integrating
−α Vi + V X i (F + G u i + K di + G (u − u i ) + K (d − di ))
from both the sides yield the following off-policy IRL Bellman
equation: +X T Q T X + u iT R u i − γ 2 diT di + 2u i+1
T
R (u − u i )

e−αT Vi (X (t + T )) − Vi (X (t)) − 2γ 2 di+1


T
(d − di ) = 0. (68)
 t +T
  Substituting the updated policies u i+1 and di+1 from
= e−α(τ −t ) − X T Q T X − u iT R u i + γ 2 diT di dτ
t (59) and (60) into (68) gives the Bellman equation (58). This
 t +T completes the proof. 
 
+ e−α(τ −t ) − 2 u i+1
T
R (u −u i )+2γ 2di+1
T
(d −di ) dτ. Remark 9: In the off-policy IRL Bellman equation (63),
t the control input u, which is applied to the system, can
(63) be different from the control policy u i , which is evaluated
and updated. The fixed control policy u should be a stable
Note that for a fixed control policy u (the policy that is
and exploring control policy. Moreover, in this off-policy
applied to the system) and a given disturbance d (the actual
IRL Bellman equation, the disturbance input d is the actual
disturbance that is applied to the system), (63) can be solved
external disturbance that comes from a disturbance source,
for both the value function Vi and the updated policies u i+1
and is not under our control. However, di is the disturbance,
and di+1 , simultaneously.
which is evaluated and updated. One advantage of this
Lemma 1: The off-policy IRL equation (63) gives the same
off-policy IRL Bellman equation is that, in contrast to
solution for the value function as the Bellman equation (58),
on-policy RL-based methods, the disturbance input, which is
and the same updated control and disturbance policies
applied to the system does not require to be adjustable.
as (59) and (60).
The following algorithm uses the off-policy tracking
Proof: Dividing both the sides of the off-policy IRL
Bellman equation (63) to iteratively solve the
Bellman equation (63) by T , and taking limit results in
HJI equation (19) without requiring any knowledge of
e−α T Vi (X (t + T )) − Vi (X (t)) the system dynamics. The implementation of this algorithm is
lim discussed in Section IV-B. It is shown how the data collected
T →0 T
 t +T from a fixed control policy u are reused to evaluate many
 
e−α(τ −t ) X T Q T X + u iT R u i − γ 2 diT di dτ updated control policies u i sequentially until convergence to
+ lim t the optimal solution is achieved.
T →0 T Remark 10: Inspired by the off-policy algorithm in [21],
 t +T
 T  Algorithm 2 has two separate phases. First, a fixed initial
e−α(τ −t ) 2u i+1 R(u −u i )−2γ 2 di+1
T
(d − di ) dτ exploratory control policy u is applied and the system
+ lim t information is recorded over the time interval T . Second,
T →0 T
= 0. (64) without requiring any knowledge of the system dynamics,
the information collected in phase 1 is repeatedly used to
By L’Hopital’s rule, the first term in (64) becomes find a sequence of updated policies u i and di converging to
u ∗ and d ∗ . Note that (69) is a scalar equation, and can be
e−α T Vi (X (t + T )) − Vi (X (t)) solved in a least-square sense after collecting enough number
lim
T →0 T of data samples from the system. It is shown in Section IV-B
= lim [−α e−α T Vi (X (t + T )) + e−α T
V̇i (X (t + T ))] how to collect required information in phase 1 and reuse them
T →0 in phase 2 in a least-square sense to solve (69) for Vi , u i+1 ,
= −αVi + V X i (F + Gu i + K di + G(u − u i ) + K (d − di )) and di+1 simultaneously. After the learning is done and the
(65) optimal control policy u ∗ is found, it can then be applied to
the system.
where the last term in the right-hand side is obtained using Theorem 5 (Convergence of Algorithm 2): The off-policy
V̇ = V X Ẋ . Similarly, for the second and third terms of (64), Algorithm 2 converges to the optimal control and disturbance

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.
MODARES et al.: H∞ TRACKING CONTROL OF COMPLETELY UNKNOWN CONTINUOUS-TIME SYSTEMS 2557

Algorithm 2 Online Off-Policy RL Algorithm for Solving vectors, Ŵ1 ∈ Rl1 , Ŵ2 ∈ Rm×l2 , and Ŵ3 ∈ Rq×l3 are
Tracking HJI Equation constant weight vectors, and l1 , l2 , and l2 are the number
Phase 1 (data collection using a fixed control policy): Apply a of neurons. Define v 1 = [v 11 , . . . , v 1m ]T = u − u i ,
fixed control policy u to the system and collect required system v 2 = [v 12 , . . . , v q2 ]T = d −di and assume R = diag(r, . . . , rm ).
information about the state, control input and disturbance at Then, substituting (70)–(72) in (69) yields
N different sampling interval T . e(t) = Ŵ1T(e−αT σ (X (t + T )) − σ (X (t)))
Phase 2 (reuse of collected data sequentially to find an t +T  
optimal policy iteratively): Given u i and di , use collected − e−α(τ −t ) − X T Q T X − u iT Ru i +γ 2 diT di dτ
information in phase 1 to Solve the following Bellman
t
 m  t +T
equation for Vi , u i+1 and di+1 simultaneously: +2 rl e−α(τ −t ) Ŵ2,l
T
φ(X (t)) vl1 dτ
l=1 t
e−αT Vi (X (t + T )) − Vi (X (t)) q 
 t +T
 t +T − 2γ 2 e−α(τ −t ) Ŵ3,k
T
ϕ(X (t)) v k2 dτ
  (73)
= e−α(τ −t ) − X T Q T X − u iT R u i + γ 2 diT di dτ k=1 t
t
 t +T  where e(t) is the Bellman approximation error, Ŵ2,l is the
+ e−α(τ −t ) − 2u i+1
T
R(u − u i ) lth column of Ŵ2 , and Ŵ3,k is the kth column of Ŵ3 . The
t
 Bellman approximation error is the continuous-time counter-
+ 2γ 2 di+1
T
(d − di ) dτ (69) part of the temporal difference (TD) [10]. In order to bring
Stop if a stopping criterion is met, otherwise set i = i + 1 the TD error to its minimum value, a least-squares method is
and got to 2. used. To this end, rewrite equation (73) as
y(t) + e(t) = Ŵ T h(t) (74)
solutions given by (17) and (18), where the value function where
satisfies the tracking HJI equation (19). T
Ŵ = Ŵ1T , Ŵ2,l
T T
, . . . , Ŵ2,m T
, Ŵ3,1 T
, . . . , Ŵ3,q
Proof: It was shown in Lemma 1 that the off-policy
tracking Bellman equation (69) gives the same value function ∈ Rl1 +m×l2 +q×l3 (75)
⎡ ⎤
as the Bellman equation (58) and the same updated policies e−αT σ (X (t+ T )) − σ (X (t)))
⎢  t +T ⎥
as (59) and (60). Therefore, both Algorithms 1 and 2 have the ⎢ 2r1
same convergence properties. Convergence of Algorithm 1 is ⎢ e−α(τ −t ) φ(X (t)) v 11 dτ ⎥ ⎥
⎢ t ⎥
proved in [33]. This confirms that Algorithm 2 converges to ⎢ .. ⎥
⎢  t +T . ⎥
the optimal solution.  ⎢ ⎥
⎢ −α(τ −t ) ⎥
Remark 11: Although both Algorithms 1 and 2 have the ⎢ 2rm e φ(X (t)) v 1
dτ ⎥
h(t) = ⎢ 
m ⎥ (76)
same convergence properties, Algorithm 2 is a model-free ⎢ t t +T ⎥
⎢ ⎥
algorithm, which finds an optimal control policy without ⎢ −2γ 2 e−α(τ −t ) ϕ(X (t)) v 12 dτ ⎥
⎢ ⎥
requiring any knowledge of the system dynamics. This is ⎢ t
.. ⎥
⎢ . ⎥
in contrast to Algorithm 1 that requires full knowledge of ⎢  ⎥
⎣ t +T ⎦
the system dynamics. Moreover, Algorithm 1 is an on-policy −α(τ −t )
−2γ 2 e ϕ(X (t)) v q dτ
2
RL algorithm, which requires the disturbance input to be  t +T t
specified and adjustable. On the other hand, Algorithm 2 is  
y(t) = e−α(τ −t ) − X T Q T X − u iT Ru i + γ 2 diT di dτ .
an off-policy RL algorithm, which obviates this requirement. t
(77)
B. Implementing Algorithm 2 Using Neural Networks
The parameter vector Ŵ , which gives the approximated
In order to implement the off-policy RL Algorithm 2, it is value function, actor, and disturbance (70)–(72), is found by
required to reuse the collected information found by applying minimizing, in the least-squares sense, the Bellman error (74).
a fixed control policy u to the system to solve (69) for Vi , Assume that the systems state, input, and disturbance
u i+1 , and di+1 iteratively. Three NNs, i.e., the actor NN, the information are collected at N ≥ l1 + m × l2 + q × l3
critic NN, and the disturber NN, are used here to approximate (the number of independent elements in Ŵ ) points t1 to t N
the value function and the updated control and disturbance in the state space, over the same time interval T in phase 1.
policies in the Bellman equation (69). That is, the solution Vi , Then, for a given u i and di , one can use this information to
u i+1 , and di+1 of the Bellman equation (69) is approximated evaluate (76) and (77) at N points to form
by three NNs as
H = [h(t1 ), . . . ., h(t N )] (78)
V̂i (X) = Ŵ1T σ (X) (70)
Y = [y(t1 ), . . . ., y(t N )]T. (79)
û i+1 (X) = Ŵ2T φ(X) (71)
The least-squares solution to (74) is then equal to
d̂i+1 (X) = Ŵ3T ϕ(X) (72)
Ŵ = (H H T )−1 H Y (80)
where σ = [σ1 , . . . , σl1 ] ∈ Rl1 , φ = [φ1 , . . . , φl2 ] ∈ Rl2 ,
and ϕ = [ϕ1 , . . . , ϕl3 ] ∈ Rl3 provide suitable basis function which gives Vi , u i+1 , and di+1 .

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.
2558 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

Remark 12: Note that, although X (t + T ) appears in (73),


this equation is solved in a least-square sense after observing
N samples X (t), X (t + T ), . . . , X (t + N T ). Therefore, the
knowledge of the system is not required to predict the future
state X (t + T ) at time t to solve (73).

V. S IMULATION R ESULTS
In this section, the proposed off-policy IRL method is first
applied to a linear system to show that it converges to the
optimal solution. Then, it is tested on a nonlinear system.
Fig. 2. Convergence of the kernel matrix P to its optimal value for
F-16 example.
A. Linear System: F16 Aircraft Systems
Consider the F16 aircraft system described by ẋ = Ax +
Bu + Dd with the following dynamics:
⎡ ⎤
−1.01887 0.90506 −0.00215
A = ⎣ 0.82225 −1.07741 −0.17555 ⎦
0 0 −1
⎡ ⎤ ⎡ ⎤
0 1
B = ⎣ 0 ⎦, D = ⎣ 0 ⎦. (81)
5 0
The system state vector is x = [x 1 x 2 x 3 ] = [α q δe ],
Fig. 3. Convergence of the control gain to its optimal value for F-16 example.
where α denotes the angle of attack, q is the pitch rate,
and δe is the elevator deflection angle. The control input is
the elevator actuator voltage, and the disturbance is wind their optimal values. In fact, P converges to
gusts on angle of attack. It is assumed that the output is ⎡ ⎤
12.675 5.418 −0.432 −7.481 5.424 −0.439
y = α, and the desired value is constant. Thus, the command ⎢ 5.420 3.412 −0.330 −4.985 3.404
⎢ −0.329 ⎥

generator dynamics become ṙ = 0. Therefore, the augmented ⎢ −0.427 −0.323 0.042 0.546 −0.333 0.046 ⎥
dynamics (9) becomes equal to (82), as shown at the bottom of P =⎢ ⎢ −7.495 −4.973 0.545 201.408 −4.985

⎢ 0.527 ⎥

this page. Since only e1 = x 1 −r1 is concerned as the tracking ⎣ 5.419 3.406 −0.328 −4.968 3.405 −0.339 ⎦
error, the first element of the matrix Q T in (12) is considered
−0.421 −0.347 −0.201 0.036 −0.333 0.046
to be 20, and all other elements are zero. It is also assumed
here that R = 1 and γ = 10. The offline solution to the which is very close to its optimal value given in (83). These
GARE (49) and consequently the optimal control policy (46) results and Figs. 2 and 3 confirm that the proposed method
are given in (83), as shown at the bottom of this page. converses with the optimal tracking solution without requiring
We now implement the off-policy IRL Algorithm 2. The the knowledge of the system dynamics. The optimal control
reinforcement interval is chosen as T = 0.05. The initial solution found in (83) is now applied to the system to test
control gain is chosen as zero. Figs. 2 and 3 show the its performance. To this end, it is assumed that the desired
convergence of the kernel matrix P and the control gain to value for the output is r1 = 2 for 0–30 s, and changes

⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−1.01887 0.90506 −0.00215 −1.01887 0.90506 −0.00215 0 1
⎢ 0.82225 −1.07741 −0.17555 0.82225 −1.07741 −0.17555 ⎥ ⎢0⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ −1 −1 ⎥ ⎢5⎥ ⎢0⎥
⎢ 0 0 0 0 ⎥ ⎢ ⎥ ⎢ ⎥
Ẋ = ⎢ ⎥ X + ⎢ ⎥u + ⎢ ⎥d (82)
⎢ 0 0 0 0 0 0 ⎥ ⎢0⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ 0 0 0 0 0 0 ⎦ ⎣0⎦ ⎣0⎦
0 0 0 0 0 0 0 0
⎡ ⎤
12.677 5.420 −0.432 −7.474 5.420 −0.432
⎢ 5.420 3.405 −0.332 −4.980 3.405 −0.332 ⎥
⎢ ⎥
⎢ −0.432 −0.332 −0.332 0.040 ⎥
∗ ⎢ 0.040 0.544 ⎥
P =⎢ ⎥
⎢ −7.474 −4.980 0.544 201.451 −4.980 0.544 ⎥
⎢ ⎥
⎣ 5.420 3.405 −0.332 −4.980 3.405 −0.332 ⎦
−0.432 −0.332 −0.205 0.040 −0.332 0.040
u ∗ = −[−2.1620, −1.6623, 0.2005, 2.7198, −1.6623, 0.2005]X (83)

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.
MODARES et al.: H∞ TRACKING CONTROL OF COMPLETELY UNKNOWN CONTINUOUS-TIME SYSTEMS 2559

Fig. 4. Reference trajectory versus output for F-16 systems using the
proposed control method. Fig. 6. Reference trajectory versus the second state of the robot manipulator
systems during and after learning.

Fig. 5. Reference trajectory versus the first state of the robot manipulator
systems during and after learning.
Fig. 7. Reference trajectory versus the third state of the robot manipulator
to r1 = 3 at 30 s. The disturbance is assumed to be systems during and after learning.
d = 0.1e−0.1t sin(0.1t). Fig. 4 shows how the output converges
to its desired values after the control solution (83) is applied
to the system, and confirms that the proposed optimal control
solution achieves suitable results.

B. Nonlinear System: A Two-Link Robot Manipulator


In this section, the proposed off-policy IRL algorithm is
applied to a two-link manipulator [46], [53], which is modeled
using
M q̈ + Vm q̇ + Fd q̇ + G(q) = u + d (84)
where q = [q1 q2 ]T is the vector of joint angles and
q̇ = [q̇1 q̇2 ]T is the vector of joint angular velocities, and
  Fig. 8. Reference trajectory versus the fourth state of the robot manipulator
p 1 + 2 p 3 c2 p 2 + p 3 c2 systems during and after learning.
M =
p 2 + p 3 c2 p2
  The objective is to find the control input u to make the state
− p3s2 q̇2 − p3 s2 (q̇1 + q̇2 )
Vm = follow the desired trajectory given as:
p3 s2 q̇1 0
are the inertia and Coriolis-centripetal matrices, respectively, r = [ 0.5 cos(2t) 0.33 cos(3t) − sin(2t) − sin(3t) ]T
with c2 = cos(q2 ), s2 = sin(q2 ), p1 = 3.473 kgm2, which is generated by the command generator (2) with
p2 = 0.196 kgm2, and p3 = 0.242 kgm2. Moreover, Fd =
diag [5.3, 1.1], G(q) = [8.45 tanh(q̇1 ), 2.35 tanh(q̇2 )]T , u, h d (r ) = [ r3 r4 −4r1 −9r2 ]T.
and τd are the static friction, the dynamic friction, the control
It is assumed in the disturbance attenuation condition (7) that
torque, and the disturbance, respectively.
Q = 10 I , R = 1, and γ = 20. Based on (8), the aug-
Defining the state vector as x = [q1 q2 q̇1 q̇2 ]T , the
mented state becomes X = [ e1 e2 e3 e4 r1 r2 r3 r4 ]T with
state-space equations for (84) become (1) with [46]
  ei = x i − ri , i = 1, 2, 3, 4. A power series NN containing
   T T even powers of the state variables of the system up to order
−1 x3
f (x) = x 3 x 4 M (−Vm − Fd ) −G(q) four is used for the critic in (70). The activation functions
x4
for the control and disturbance policies in (71) and (72)
T
g(x) = k(x) = [0 0]T [0 0]T [0 0]T (M −1 )T . are chosen as polynomials of all powers of the states up

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.
2560 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

[5] T. Başar and P. Bernard, H∞ -Optimal Control and Related Minimax


Design Problems. Boston, MA, USA: Birkhäuser, 1995.
[6] M. D. S. Aliyu, Nonlinear H∞ Control, Hamiltonian Systems and
Hamilton-Jacobi Equations. Boca Raton, FL, USA: CRC Press, 2011.
[7] S. Devasia, D. Chen, and B. Paden, “Nonlinear inversion-based output
tracking,” IEEE Trans. Autom. Control, vol. 41, no. 7, pp. 930–942,
Jul. 1996.
[8] G. J. Toussaint, T. Basar, and F. Bullo, “H∞ -optimal tracking con-
trol techniques for nonlinear underactuated systems,” in Proc. 39th
IEEE Conf. Decision Control, Sydney, NSW, Australia, Dec. 2000,
pp. 2078–2083.
[9] J. A. Ball, P. Kachroo, and A. J. Krener, “H∞ tracking control for a
class of nonlinear systems,” IEEE Trans. Autom. Control, vol. 44, no. 6,
pp. 1202–1206, Jun. 1999.
Fig. 9. Disturbance attenuation for the final controller. [10] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
Cambridge, MA, USA: MIT Press, 1998.
[11] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.
to order three. We now implement Algorithm 2 to find the Belmont, MA, USA: Athena Scientific, 1996.
optimal control solution online. The reinforcement interval [12] D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adaptive
is chosen as T = 0.05. The proposed algorithm starts the Control and Differential Games by Reinforcement Learning Principles
(Control Engineering). Stevenage, U.K.: IET Press, 2012.
learning process from the beginning of the simulation and [13] P. J. Werbos, “Approximate dynamic programming for real-time control
finishes it after 25 s, when the control policy is updated. The and neural modeling,” in Handbook of Intelligent Control, D. A. White
plots of state trajectories of the closed-loop system and the and D. A. Sofge, Eds. Brentwood, U.K.: Multiscience Press, 1992.
[14] K. G. Vamvoudakis and F. L. Lewis, “Online actor–critic algorithm
reference trajectory are shown in Figs. 5–8. The disturbance to solve the continuous-time infinite horizon optimal control problem,”
is assumed to be d = 0.1e−0.1t sin(0.1t) after the learning Automatica, vol. 46, no. 5, pp. 878–888, 2010.
is done. From Figs. 5–8, it is obvious that the system tracks [15] D. Vrabie and F. L. Lewis, “Neural network approach to continuous-
the reference trajectory after the learning is finished and the time direct adaptive optimal control for partially unknown nonlinear
systems,” Neural Netw., vol. 22, no. 3, pp. 237–246, 2009.
optimal controller is found. Fig. 9 shows the disturbance [16] D. Vrabie, O. Pastravanu, M. Abou-Khalaf, and F. L. Lewis, “Adaptive
attenuation level of the optimal control policy found by the optimal control for continuous-time linear systems based on policy
proposed method after the learning is done. iteration,” Automatica, vol. 45, no. 2, pp. 477–484, 2009.
[17] R. Song, W. Xiao, H. Zhang, and C. Sun, “Adaptive dynamic pro-
gramming for a class of complex-valued nonlinear systems,” IEEE
VI. C ONCLUSION Trans. Neural Netw. Learn. Syst., vol. 25, no. 9, pp. 1733–1739,
Sep. 2014.
A model-free H∞ tracker was developed for nonlinear [18] H. Modares, F. L. Lewis, and M.-B. Naghibi-Sistani, “Adaptive optimal
continuous-time systems in the presence of disturbance. control of unknown constrained-input systems using policy iteration and
A generalized discounted L 2 -gain condition was proposed neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 10,
pp. 1513–1525, Oct. 2013.
for obtaining the solution to this problem in which the [19] S. Bhasin, R. Kamalapurkar, M. Johnson, K. G. Vamvoudakis,
norm of the performance output includes both the feedback F. L. Lewis, and W. E. Dixon, “A novel actor–critic–identifier architec-
and feedforward control inputs. This enables us to extend ture for approximate optimal control of uncertain nonlinear systems,”
Automatica, vol. 49, no. 1, pp. 82–92, 2013.
RL algorithms for solving the H∞ optimal tracking prob-
[20] T. Bian, Y. Jiang, and Z.-P. Jiang, “Adaptive dynamic programming and
lem without requiring complete knowledge of the system optimal control of nonlinear nonaffine systems,” Automatica, vol. 50,
dynamics. A tracking HJI equation is developed to find the no. 10, pp. 2624–2632, 2014.
solution to the problem in hand. The stability and optimality [21] Y. Jiang and Z.-P. Jiang, “Robust adaptive dynamic programming and
feedback stabilization of nonlinear systems,” IEEE Trans. Neural Netw.
of the resulting solution were analyzed, and an upper bound Learn. Syst., vol. 25, no. 5, pp. 882–893, May 2014.
for the discount factor was found to assure the stability of the [22] Y. Jiang and Z.-P. Jiang, “Computational adaptive optimal control for
control solution found by solving the tracking HJI equation. continuous-time linear systems with completely unknown dynamics,”
Automatica, vol. 48, no. 10, pp. 2699–2704, 2012.
An online off-policy RL algorithm was proposed to learn [23] D. Liu, H. Li, and D. Wang, “Error bounds of adaptive dynamic
the solution to the tracking HJI equation without requiring programming algorithms for solving undiscounted optimal control
any knowledge of the system dynamics. It is shown that, problems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 6,
using off-policy RL, the disturbance input does not require pp. 1323–1334, Jun. 2015.
[24] B. Luo, H.-N. Wu, T. Huang, and D. Liu, “Data-based approximate
to be specified and adjusted. Simulation results confirmed the policy iteration for affine nonlinear continuous-time optimal control
suitability of the proposed method. design,” Automatica, vol. 50, no. 12, pp. 3281–3290, 2014.
[25] B. Luo, H.-N. Wu, and H.-X. Li, “Adaptive optimal control of highly
dissipative nonlinear spatially distributed processes with neuro-dynamic
R EFERENCES programming,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 4,
[1] G. Zames, “Feedback and optimal sensitivity: Model reference trans- pp. 684–696, Apr. 2015.
formations, multiplicative seminorms, and approximate inverses,” IEEE [26] M. Abu-Khalaf, F. L. Lewis, and J. Huang, “Neurodynamic program-
Trans. Autom. Control, vol. 26, no. 2, pp. 301–320, Apr. 1981. ming and zero-sum games for constrained control systems,” IEEE Trans.
[2] J. C. Doyle, K. Glover, P. Khargonekar, and B. A. Francis, “State-space Neural Netw., vol. 19, no. 7, pp. 1243–1252, Jul. 2008.
solutions to standard H2 and H∞ control problems,” IEEE Trans. Autom. [27] H. Zhang, Q. Wei, and D. Liu, “An iterative adaptive dynamic pro-
Control, vol. 34, no. 8, pp. 831–847, Aug. 1989. gramming method for solving a class of nonlinear zero-sum differential
[3] A. J. Van der Schaft, “L2 -gain analysis of nonlinear systems and games,” Automatica, vol. 47, no. 1, pp. 207–214, 2011.
nonlinear state-feedback H∞ control,” IEEE Trans. Autom. Control, [28] K. G. Vamvoudakis and F. L. Lewis, “Online solution of nonlinear
vol. 37, no. 6, pp. 770–784, Jun. 1992. two-player zero-sum games using synchronous policy iteration,” Int.
[4] A. Isidori and W. Lin, “Global L 2 -gain design for a class of nonlinear J. Robust Nonlinear Control, vol. 22, no. 13, pp. 1460–1483,
systems,” Syst. Control Lett., vol. 34, no. 5, pp. 295–302, 1998. 2012.

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.
MODARES et al.: H∞ TRACKING CONTROL OF COMPLETELY UNKNOWN CONTINUOUS-TIME SYSTEMS 2561

[29] K. G. Vamvoudakis and F. L. Lewis, “Online gaming: Real time solution [50] C. Qin, H. Zhang, and Y. Luo, “Online optimal tracking control
of nonlinear two-player zero-sum games using synchronous policy of continuous-time linear systems with unknown dynamics by using
iteration,” in Advances in Reinforcement Learning, A. Mellouk, Ed. adaptive dynamic programming,” Int. J. Control, vol. 87, no. 5,
Delhi, India: Intech, 2011. pp. 1000–1009, 2014.
[30] H. Modares, F. L. Lewis, and M.-B. Naghibi-Sistani, “Online solution [51] D. Liu, Y. Huang, and Q. Wei, “Neural network H∞ tracking control of
of nonquadratic two-player zero-sum games arising in the H∞ control nonlinear systems using GHJI method,” in Advances in Neural Networks,
of constrained input systems,” Int. J. Adapt. Control Signal Process., C. Guo, Z.-G. Huo, and Z. Zeng, Eds. Dalian, China: Springer-Verlag,
vol. 28, nos. 3–5, pp. 232–254, 2014. 2013.
[31] H. Zhang, C. Qin, B. Jiang, and Y. Luo, “Online adaptive policy [52] F. L. Lewis, D. Vrabie, and V. Syrmos, Optimal Control, 3rd ed.
learning algorithm for H∞ state feedback control of unknown affine New York, NY, USA: Wiley, 2012.
nonlinear discrete-time systems,” IEEE Trans. Cybern., vol. 44, no. 12, [53] F. L. Lewis, D. M. Dawson, and C. T. Abdallah, Robot Manipulator
pp. 2706–2718, Dec. 2014. Control: Theory and Practice, 2nd ed. New York, NY, USA: CRC Press,
[32] H.-N. Wu and B. Luo, “Simultaneous policy update algorithms for 2003.
learning the solution of linear continuous-time H∞ state feedback
control,” Inf. Sci., vol. 222, pp. 472–485, Feb. 2013.
[33] H.-N. Wu and B. Luo, “Neural network based online simultaneous policy
update algorithm for solving the HJI equation in nonlinear H∞ control,”
IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 12, pp. 1884–1895,
Dec. 2012. Hamidreza Modares received the B.S. degree from
[34] B. Luo and H.-N. Wu, “Computationally efficient simultaneous pol- the University of Tehran, Tehran¸ Iran, in 2004,
icy update algorithm for nonlinear H∞ state feedback control with the M.S. degree from the Shahrood University
Galerkin’s method,” Int. J. Robust Nonlinear Control, vol. 23, no. 9, of Technology, Shahrood, Iran, in 2006, and the
pp. 991–1012, 2013. Ph.D. degree from The University of Texas at
[35] D. Vrabie and F. L. Lewis, “Adaptive dynamic programming for online Arlington, Arlington, TX, USA, in 2015.
solution of a zero-sum differential game,” J. Control Theory Appl., vol. 9, He was a Senior Lecturer with the Shahrood
no. 3, pp. 353–360, 2011. University of Technology, from 2006 to 2009.
[36] H. Li, D. Liu, and D. Wang, “Integral reinforcement learning for linear His current research interests include cyber-physical
continuous-time zero-sum games with completely unknown dynamics,” systems, reinforcement learning, distributed control,
IEEE Trans. Autom. Sci. Eng., vol. 11, no. 3, pp. 706–714, Jul. 2014. robotics, and pattern recognition.
[37] B. Luo, H.-N. Wu, and T. Huang, “Off-policy reinforcement learning for
H∞ control design,” IEEE Trans. Cybern., vol. 45, no. 1, pp. 65–76,
Jan. 2015.
[38] Q. Wei, D. Liu, G. Shi, and Y. Liu, “Multibattery optimal coordination
control for home energy management systems via distributed iterative Frank. L. Lewis (S’70–M’81–SM’86–F’94)
adaptive dynamic programming,” IEEE Trans. Ind. Electron., vol. 62, received the bachelor’s degree in physics/electrical
no. 7, pp. 4203–4214, Jul. 2015. engineering and the M.S. degree in electrical
[39] S. Jagannathan and G. Galan, “Adaptive critic neural network-based engineering from Rice University, Houston, TX,
object grasping control using a three-finger gripper,” IEEE Trans. Neural USA, the M.S. degree in aeronautical engineering
Netw., vol. 15, no. 2, pp. 395–407, Mar. 2004. from the University of West Florida, Pensacola,
[40] Q. Wei, D. Liu, and G. Shi, “A novel dual iterative Q-learning method FL, USA, and the Ph.D. degree from the Georgia
for optimal battery management in smart residential environments,” Institute of Technology, Atlanta, GA, USA.
IEEE Trans. Ind. Electron., vol. 62, no. 4, pp. 2509–2518, Apr. 2015. He is currently a U.K. Chartered Engineer,
[41] Q. Wei and D. Liu, “Data-driven neuro-optimal temperature control of the IEEE Control Systems Society Distinguished
water–gas shift reaction using stable iterative adaptive dynamic pro- Lecturer, a University of Texas at Arlington
gramming,” IEEE Trans. Ind. Electron., vol. 61, no. 11, pp. 6399–6408, Distinguished Scholar Professor, a UTA Distinguished Teaching Professor,
Nov. 2014. and a Moncrief-O’Donnell Chair with the University of Texas at Arlington
[42] H. Modares, I. Ranatunga, F. L. Lewis, and D. O. Popa, “Optimized Research Institute, Fort Worth, TX, USA. He is a Qian Ren Thousand Talents
assistive human–robot interaction using reinforcement learning,” IEEE Consulting Professor with Northeastern University, Shenyang, China. He is
Trans. Cybern., to be published. involved in feedback control, reinforcement learning, intelligent systems, and
[43] Q. Wei and D. Liu, “Adaptive dynamic programming for optimal distributed control systems. He has authored six U.S. patents, 301 journal
tracking control of unknown nonlinear systems with application to papers, 396 conference papers, 20 books, 44 chapters, and 11 journal special
coal gasification,” IEEE Trans. Autom. Sci. Eng., vol. 11, no. 4, issues.
pp. 1020–1036, Oct. 2014. Dr. Lewis is a member of the National Academy of Inventors. He is
[44] B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, and a fellow of the International Federation of Automatic Control, the U.K.
M.-B. Naghibi-Sistani, “Reinforcement QQ-learning for optimal track- Institute of Measurement and Control, and Professional Engineer at Texas.
ing control of linear discrete-time systems with unknown dynamics,” He was a Founding Member of the Board of Governors of the Mediterranean
Automatica, vol. 50, no. 4, pp. 1167–1175, 2014. Control Association. He received the Fulbright Research Award, the NSF
[45] H. Modares and F. L. Lewis, “Linear quadratic tracking control of Research Initiation Grant, the ASEE Terman Award, the International
partially-unknown continuous-time systems using reinforcement learn- Neural Network Society Gabor Award in 2009, and the U.K. Institute of
ing,” IEEE Trans. Autom. Control, vol. 59, no. 11, pp. 3051–3065, Measurement and Control Honeywell Field Engineering Medal in 2009.
Nov. 2014. He was a recipient of the IEEE Computational Intelligence Society Neural
[46] R. Kamalapurkar, H. Dinh, S. Bhasin, and W. E. Dixon, “Approximate Networks Pioneer Award in 2012, the Distinguished Foreign Scholar from
optimal trajectory tracking for continuous-time nonlinear systems,” the Nanjing University of Science and Technology, the 111 Project Professor
Automatica, vol. 51, pp. 40–48, Jan. 2015. at Northeastern University, China, the Outstanding Service Award from the
[47] H. Modares and F. L. Lewis, “Optimal tracking control of nonlin- Dallas IEEE Section, an Engineer of the Year from the Fort Worth IEEE
ear partially-unknown constrained-input systems using integral rein- Section. He was listed in Fort Worth Business Press Top 200 Leaders in
forcement learning,” Automatica, vol. 50, no. 7, pp. 1780–1792, Manufacturing. He was also a recipient of the 2010 IEEE Region Five
2014. Outstanding Engineering Educator Award, the 2010 UTA Graduate Dean’s
[48] H. Zhang, R. Song, Q. Wei, and T. Zhang, “Optimal tracking control Excellence in Doctoral Mentoring Award, was elected to the UTA Academy
for a class of nonlinear discrete-time systems with time delays based on of Distinguished Teachers in 2012, and the Texas Regents Outstanding
heuristic dynamic programming,” IEEE Trans. Neural Netw., vol. 22, Teaching Award in 2013. He served on the NAE Committee on Space
no. 12, pp. 1851–1862, Dec. 2011. Station in 1995. He also received the IEEE Control Systems Society Best
[49] B. Kiumarsi and F. L. Lewis, “Actor–critic-based optimal track- Chapter Award (as a Founding Chairman of DFW Chapter), the National
ing for partially unknown nonlinear discrete-time systems,” IEEE Sigma Xi Award for Outstanding Chapter (as a President of UTA Chapter),
Trans. Neural Netw. Learn. Syst., vol. 26, no. 1, pp. 140–151, and the U.S. SBA Tibbets Award in 1996 (as the Director of ARRI’s SBIR
Jan. 2015. Program).

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.
2562 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

Zhong-Ping Jiang (M’94–SM’02–F’08) received Prof. Jiang is a fellow of the International Federation of Automatic
the B.Sc. degree in mathematics from the University Control. He was a recipient of the prestigious Queen Elizabeth II Fellowship
of Wuhan, Wuhan, China, in 1988, the M.Sc. degree Award from the Australian Research Council, the CAREER Award from the
in statistics from the University of Paris XI, Paris, U.S. National Science Foundation (NSF), and the Distinguished Overseas
France, in 1989, and the Ph.D. degree in automatic Chinese Scholar Award from the NSF of China. He was also a recipient
control and mathematics from the Ecole des Mines of recent awards recognizing his research work, including the Best Theory
de Paris, Paris, France, in 1993. Paper Award with Y. Wang at the 2008 World Congress on Intelligent
He is currently a Professor of Electrical and Control and Automation, the Guan Zhao Zhi Best Paper Award, with
Computer Engineering with the Polytechnic School T. Liu and D. Hill, at the 2011 Chinese Control Conference, and the
of Engineering, New York University, New York, Shimemura Young Author Prize with his student Y. Jiang at the 2013 Asian
NY, USA. His current research interests include Control Conference in Istanbul, Turkey. He is a Deputy co-Editor-in-Chief of
stability theory, robust, adaptive and distributed nonlinear control, adaptive the Journal of Control and Decision, an Editor for the International Journal
dynamic programming, and their applications to information, mechanical of Robust and Nonlinear Control, and has served as an Associate Editor for
and biological systems. He has co-authored the book Stability and Stabi- several journals, including the Mathematics of Control, Signals and Systems,
lization of Nonlinear Systems (Springer, 2011) with Dr. I. Karafyllis, and Systems and Control Letters, the IEEE T RANSACTIONS ON AUTOMATIC
Nonlinear Control of Dynamic Networks (Taylor & Francis, 2014) with C ONTROL, the European Journal of Control, and Science China: Information
Dr. T. Liu and Dr. D. J. Hill. Sciences.

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 09,2024 at 11:17:46 UTC from IEEE Xplore. Restrictions apply.

You might also like