0% found this document useful (0 votes)
5 views

[29] Optimal Tracking Control of Nonlinear Partially-unknown Constrained-Input Systems Using Integral Reinforcement Learning

Uploaded by

hoanganhlyk26
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

[29] Optimal Tracking Control of Nonlinear Partially-unknown Constrained-Input Systems Using Integral Reinforcement Learning

Uploaded by

hoanganhlyk26
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Automatica ( ) –

Contents lists available at ScienceDirect

Automatica
journal homepage: www.elsevier.com/locate/automatica

Optimal tracking control of nonlinear partially-unknown


constrained-input systems using integral reinforcement learning✩
Hamidreza Modares 1 , Frank L. Lewis
University of Texas at Arlington Research Institute, 7300 Jack Newell Blvd. S., Ft. Worth, TX 76118, USA

article info abstract


Article history: In this paper, a new formulation for the optimal tracking control problem (OTCP) of continuous-time
Received 20 June 2013 nonlinear systems is presented. This formulation extends the integral reinforcement learning (IRL)
Received in revised form technique, a method for solving optimal regulation problems, to learn the solution to the OTCP. Unlike
25 November 2013
existing solutions to the OTCP, the proposed method does not need to have or to identify knowledge of the
Accepted 21 March 2014
Available online xxxx
system drift dynamics, and it also takes into account the input constraints a priori. An augmented system
composed of the error system dynamics and the command generator dynamics is used to introduce
Keywords:
a new nonquadratic discounted performance function for the OTCP. This encodes the input constrains
Optimal tracking control into the optimization problem. A tracking Hamilton–Jacobi–Bellman (HJB) equation associated with this
Integral reinforcement learning nonquadratic performance function is derived which gives the optimal control solution. An online IRL
Input constrainers algorithm is presented to learn the solution to the tracking HJB equation without knowing the system drift
Neural networks dynamics. Convergence to a near-optimal control solution and stability of the whole system are shown
under a persistence of excitation condition. Simulation examples are provided to show the effectiveness
of the proposed method.
© 2014 Elsevier Ltd. All rights reserved.

1. Introduction During the last few years, RL methods have been successfully
used to solve the optimal regulation problems by learning the
Reinforcement learning (RL) (Bertsekas & Tsitsiklis, 1996; Pow- solution to the so-called Hamilton–Jacobi–Bellman (HJB) equation
ell, 2007; Sutton & Barto, 1998), inspired by learning mechanisms (Lewis, Vrabie, & Syrmos, 2012; Liu & Wei, 2013). For continuous-
observed in mammals, is concerned with how an agent or actor time systems, Vrabie and Lewis (2009) and Vrabie, Pastravanu,
ought to take actions so as to optimize a cost of its long-term in- Abou-Khalaf, and Lewis (2009) proposed a promising RL algorithm,
teractions with the environment. The agent or actor learns an op- called integral reinforcement learning (IRL), to learn the solution
timal policy by modifying its actions based on stimuli received to the HJB equation using only partial knowledge about the
in response to its interaction with its environment. Similar to RL, system dynamics. They used an iterative online policy iteration
optimal control involves finding an optimal policy based on op- (PI) (Howard, 1960) procedure to implement their IRL algorithm.
timizing a long-term performance criterion. Strong connections
Later, inspired by Vrabie and Lewis (2009) and Vrabie et al. (2009),
between RL and optimal control have prompted a major effort to-
some online PI algorithms were presented to solve the optimal
wards introducing and developing online and model-free RL algo-
regulation problem for completely unknown linear systems (Jiang
rithms to learn the solution to optimal control problems (Lewis &
& Jiang, 2012; Lee, Park, & Choi, 2012). Also, in Liu, Yang, and
Liu, 2012; Vrabie, Vamvoudakis, & Lewis, 2013; Zhang, Liu, Luo, &
Wang, 2012). Li (2013) the authors presented an IRL algorithm to find the
solution to the HJB equation related to a discounted cost function.
Other than the IRL-based PI algorithms, efficient synchronous PI
algorithms with guaranteed closed-loop stability were proposed
✩ The material in this paper was not presented at any conference. This paper was
in Bhasin et al. (2012), Modares, Naghibi-Sistani, and Lewis (2013),
recommended for publication in revised form by Associate Editor Giancarlo Ferrari-
Trecate under the direction of Editor Ian R. Petersen.
Vamvoudakis and Lewis (2010), to learn the solution to the HJB
E-mail addresses: [email protected] (H. Modares), [email protected] equation. Synchronous IRL algorithms were also presented for
(F.L. Lewis). solving the HJB equation in Modares, Naghibi-Sistani, and Lewis
1 Tel.: +1 8173818059; fax: +1 985118172233. (2014) and Vamvoudakis, Vrabie, and Lewis (in press).
http://dx.doi.org/10.1016/j.automatica.2014.05.011
0005-1098/© 2014 Elsevier Ltd. All rights reserved.
2 H. Modares, F.L. Lewis / Automatica ( ) –

Although RL algorithms have been widely used to solve the system dynamics. It is also pointed out that the input constraints
optimal regulation problems, few results considered solving the caused by the actuator saturation cannot be encoded into the
optimal tracking control problem (OTCP) for both discrete-time standard performance function a priori. A new formulation of the
(Dierks & Jagannathan, 2009; Kiumarsi, Lewis, Modares, Karim- OTCP problem is given in the next section to overcome these
pour, & Naghibi-Sistani, 2014; Wang, Liu, & Wei, 2012; Zhang, Wei, shortcomings.
& Luo, 2008) and continuous-time systems (Dierks & Jagannathan,
2010; Zhang, Cui, Zhang, & Luo, 2011). Moreover, existing meth-
2.1. Problem formulation
ods for continuous-time systems require the exact knowledge of
the system dynamics a priori while finding the feedforward part of
the control input using the dynamic inversion concept. In order to Consider the affine CT dynamical system described by
attain the required knowledge of the system dynamics, in Zhang ẋ(t ) = f (x(t )) + g (x(t )) u(t ) (1)
et al. (2011), a plant model was first identified and then an RL-
based optimal tracking controller was synthesized using the iden- where x ∈ R is the measurable system state vector, f (x) ∈ Rn
n

tified model. To our knowledge, there has been no attempt to de- is the drift dynamics of the system, g (x) ∈ Rn×m is the input
velop RL-based techniques to solve the OTCP for continuous-time dynamics of the system, and u(t ) ∈ Rm is the control input. The
systems with unknown or partially-unknown dynamics using only elements of u(t ) are defined by ui (t ), i = 1, . . . , m.
measured data in real time. While the importance of the IRL algo-
rithm is well understood for solving optimal regulation problems Assumption 1. It is assumed that f (0) = 0 and f (x) and g (x) are
using only partial knowledge of the system dynamics, the require- Lipschitz, and that the system (1) is controllable in the sense that
ment of the exact knowledge of the system dynamics for finding there exists a continuous control on a set Ω ⊆ Rn which stabilizes
the steady-state part of the control input in the existing OTCP for- the system.
mulation does not allow extending the IRL algorithm for solving
the OTCP. Assumption 2 (Bhasin et al., 2012; Vamvoudakis & Lewis, 2010). The
Another important issue which is ignored in the existing RL- following assumptions are considered on the system dynamics:
based solutions to the OTCP is the amplitude limitation on the
control inputs. In fact, in the existing formulation for the OTCP, it is (a) ∥f (x)∥ ≤ bf ∥x∥ for some constant bf .
not possible to encode the input constraints into the optimization (b) g (x) is bounded by a constant bg , i.e. ∥g (x)∥ ≤ bg .
problem a priori, as only the cost of the feedback part of the control
Note that Assumption 2(a) requires f (x) be Lipschitz and f (0) =
input is considered in the performance function. Therefore, the
0 (see Assumption 1) which is a standard assumption to make sure
existing RL-based solutions to the OTCP offer no guarantee on
the remaining control inputs on their permitted bounds during the solution x(t ) of the system (1) is unique for any finite initial
and after learning. This may result in performance degradation or condition. On the other hand, although Assumption 2(b) restricts
even system instability. In the context of the constrained optimal the considered class of nonlinear systems, many physical systems,
regulation problem, however, an offline PI algorithm (Abou-Khalaf such as robotic systems (Slotine & Li, 1991) and aircraft systems
& Lewis, 2005) and online PI algorithms (Modares et al., 2013, (Sastry, 1991) fulfill such a property.
2014) were presented to find the solution to the constrained HJB The goal of the optimal tracking problem is to find the optimal
equation. control policy u∗ (t ) so as to make the system (1) track a desired
In this paper, we develop an online adaptive controller based (reference) trajectory xd (t ) ∈ Rn in an optimal manner by min-
on the IRL technique to learn the OTCP solution for nonlinear imizing a predefined performance function. Moreover, the input
continuous-time systems without knowing the system drift dy- must be constrained to remain within predefined limits |ui (t )| ≤
namics or the command generator dynamics. The contributions of λ, i = 1, . . . , m.
this paper are as follows. First, a new formulation for the OTCP Define the tracking error as
is presented. In fact, an augmented system is constructed from
ed (t ) , x(t ) − xd (t ). (2)
the tracking error dynamics and the command generator dynam-
ics to introduce a new discounted performance function for the A general performance function leading to the optimal tracking
OTCP. Second, the input constraints are encoded into the optimiza- controller can be expressed as
tion problem a priori by employing a suitable nonquadratic per-  ∞
formance function. Third, a tracking HJB equation related to this
V (ed (t ), xd (t )) = e−γ (τ −t ) [E (ed (τ )) + U (u(τ ))] dτ (3)
nonquadratic performance function is derived which gives both
t
feedforward and feedback parts of the control input simultane-
ously. Fourth, the IRL algorithm is extended for solving the OTCP. where E (ed ) is a positive-definite function, U (u) is a positive-
An IRL algorithm, implemented on an actor–critic structure, is used definite integrand function, and γ is the discount factor.
to find the solution to the tracking HJB equation online using only Note that the performance function (3) contains both the
partial knowledge about the system dynamics. In contrast to the tracking error cost and the whole control input energy cost. The
existing work, a preceding identification procedure is not needed following assumption is made in accordance to other work in the
and the optimal policy is learned using only measured data from literature.
the system. Convergence of the proposed learning algorithm to a
near-optimal control solution and the boundness of the tracking Assumption 3. The desired reference trajectory xd (t ) is bounded
error and the actor and critic NNs weights during learning are also and there exists a Lipschitz continuous command generator
shown. function hd (xd (t )) ∈ Rn such that

ẋd (t ) = hd (xd (t )) (4)


2. Optimal tracking control problem (OTCP)
and hd (0) = 0.
In this section, a review of the OTCP for continuous-time
nonlinear systems is given. It is pointed out that the standard Note that the reference dynamics needs only to be stable in the
solution to the given problem requires complete knowledge of the sense of Lyapunov, not necessarily asymptotically stable.
H. Modares, F.L. Lewis / Automatica ( ) – 3

2.2. Standard formulation and solution to the OTCP The tracking error dynamics can be obtained by using (1) and
(2), and the result is
In this subsection, the standard solution to the OTCP and its ėd (t ) = f (x(t )) − hd (xd (t )) + g (x(t )) u(t ). (8)
shortcomings are discussed.
Define the augmented system state
In the existing standard solution to the OTCP, the desired or the
steady-state part of the control input ud (t ) is obtained by assuming
T
X (t ) = ed (t )T xd (t ) ∈ R2n .

(9)
that the desired reference trajectory satisfies
Then, putting (4) and (8) together yields the augmented system
ẋd (t ) = f (xd (t )) + g (xd (t )) ud (t ). (5)
Ẋ (t ) = F (X (t )) + G(X (t )) u(t ) (10)
If the dynamics of the system is known and the inverse of the input where u(t ) = u(X (t )) and
dynamics g −1 (xd (t )) exists, the steady-state control input which
f (ed (t ) + xd (t )) − hd (xd (t ))
 
guarantees perfect tracking is given by
F (X (t )) = (11)
hd (xd (t ))
ud (t ) = g −1 (xd (t )) (ẋd (t ) − f (xd (t ))). (6)
g (ed (t ) + xd (t ))
 
On the other hand, the feedback part of the control is designed G(X (t )) = . (12)
0
to stabilize the tracking error dynamics in an optimal manner by
minimizing the following performance function: Based on the augmented system (10), we introduce the follow-
ing discounted performance function for the OTCP.
 ∞

V (ed (t )) = [ed (τ )T Q ed (τ ) + uTe R ue ] dτ

(7)
t
V (X (t )) = e−γ (τ −t ) [X (τ )T QT X (τ ) + U (u(τ ))] dτ (13)
t
where ue (t ) = u(t ) − ud (t ) is the feedback control input. where γ > 0 is the discount factor,
The optimal feedback control solution u∗e (t ) which minimizes  
(7) can be obtained by solving the HJB equation related to this Q 0
QT = , Q >0 (14)
performance function (Lewis et al., 2012). 0 0
The standard optimal solution to the OTCP is then constituted and U (u) is a positive-definite integrand function defined as
by the optimal feedback control u∗e (t ) obtained.  u
U (u) = 2 (λ β −1 (v/λ))T R dv (15)
Remark 1. The optimal feedback part of the control input ue (t ) ∗ 0
can be learned using the integral reinforcement learning method where v ∈ Rm , β(·) = tanh(·), λ is the saturating bound for the
(Vrabie & Lewis, 2009) to obviate knowledge of the system drift actuators and R = diag (r1 , . . . , rm ) > 0 is assumed to be diago-
dynamics. However, the exact knowledge of the system dynamics nal for simplicity of analysis. This nonquadratic performance func-
is required to find the steady-state part of the control input given tion is used in the optimal regulation problem of constrained-input
by (6), which cancels the usefulness of the IRL technique. systems to deal with the input constraints (Abou-Khalaf & Lewis,
2005; Lyshevski, 1998; Modares et al., 2013, 2014). In fact, using
Remark 2. Because only the feedback part of the control input this nonquadratic performance function, the following constraints
is obtained by minimizing the performance function (7), it is not are always satisfied.
possible to encode the input constraints into the optimization |ui (t )| ≤ λ i = 1, . . . , m. (16)
problem by using a nonquadratic performance function, as has
been performed in the optimal regulation problem (Abou-Khalaf Definition 1 (Admissible Control). A control policy µ(X ) is said to
& Lewis, 2005; Modares et al., 2013, 2014). be admissible with respect to (13) on Ω , denoted by µ(X ) ∈ π (Ω ),
if µ(X ) is continuous on Ω , µ(0) = 0, u(t ) = µ(X ) stabilizes the
3. A new formulation for the OTCP of constrained-input error dynamics (8) on Ω , and V (X ) is finite ∀x ∈ Ω .
systems Note that from (10)–(12) it is clear that, as expected, the
command generator dynamics is not under our control. Since they
In this section, a new formulation for the OTCP is presented. are assumed be bounded, the admissibility of the control input
In this formulation, both the steady-state and feedback parts of implies the boundness of the states of the augmented system.
the control input are obtained simultaneously by minimizing a
new discounted performance function in the form of (3). The Remark 3. Note that for the first term under the integral we have
input constraints are also encoded into the optimization problem X T QT X = eTd Q ed . Therefore, this performance function is identical
a priori. A tracking HJB equation for the constrained OTCP is to the performance function (3) with E (ed (τ )) = eTd Q ed and U (u)
derived and an iterative offline integral reinforcement learning given in (15).
(IRL) algorithm is presented to find its solution. This algorithm
Remark 4. The use of the discount factor in the performance
provides a basis to develop an online IRL algorithm for learning
function (13) is essential. This is because the control input contains
the optimal solution to the OTCP for partially-unknown systems,
a steady-state part which in general makes (13) unbounded
which is discussed in the next section.
without using a discount factor, and therefore the meaning of
minimality is lost.
3.1. An augmented system and a new discounted performance
function for the OTCP Remark 5. Note that both steady-state and feedback parts of
the control input are obtained simultaneously by minimizing the
In this subsection, first an augmented system composed of the discounted performance function (13) along the trajectories of the
tracking error dynamics and the command generator dynamics augmented system (10). As is shown in the subsequent sections,
is constructed. Then, based on this augmented system, a new this formulation enables us to extend the IRL technique to find
discounted performance function for the OTCP is presented. It the solution to the OTCP without requiring the augmented system
is shown that this performance function is identical to the dynamics F . That is, both the system drift dynamics f and the
performance function (3). command generator dynamics hd are not required.
4 H. Modares, F.L. Lewis / Automatica ( ) –

3.2. Tracking Bellman and tracking HJB equations To solve the OTCP, one solves the HJB equation (25) for the
optimal value V ∗ , then the optimal control is given as a feedback
In this subsection, the optimal tracking Bellman equation u(V ∗ ) in terms of the HJB solution using (22).
and the optimal tracking HJB equation related to the defined Now a formal proof is given that the solution to the tracking
performance function (13) are given. HJB equation for constrained-input systems provides the optimal
Using Leibniz’s rule to differentiate V along the augmented tracking control solution and when the discount factor is zero
system trajectories (10), the following tracking Bellman equation it locally asymptotically stabilizes the error dynamics (8). The
is obtained: following key fact is instrumental.

∂ −γ (τ −t ) T

V̇ (X ) = e (X QT X + U (u))dτ Lemma 1. For any admissible control policy u(X ), let V (X ) ≥ 0
t ∂t be the corresponding solution to the Bellman equation (19). Define
− X T QT X − U (u). (17) u∗ (X ) = u(V (X )) by (22) in terms of V (X ). Then

Using (15) in (17) and noting that the first term on the right hand H (X , u, ∇ V ) = H (X , u∗ , ∇ V ) + ∇ V T (X ) × G(X )(u − u∗ )
side of (17) is equal to γ V (X ), gives u

+2 (λ tanh−1 (v/λ))T R dv. (26)
 u
u∗
T
X QT X + 2 (λ tanh (v/λ)) R dv − γ V (X ) + V̇ (X ) = 0
−1 T

0
Proof. The Hamiltonian function is
(18)
H ( X , u, ∇ V )
or, by defining the Hamiltonian function  u

H (X , u, ∇ V ) ≡ X QT X T = X T QT X + 2 (λ tanh−1 (v/λ))T R dv − γ V (X )
0
 u
+2 (λ tanh−1 (v/λ))T R dv − γ V (X ) + ∇ V T (X ) (F (X ) + G(X ) u(X )). (27)
0
 u∗
+ ∇ V T (X ) (F (X ) + G(X ) u(X )) = 0 (19) Adding and subtracting the terms 2 0
(λ tanh (v/λ))T R dv and
−1

∇ V T (X ) G(X ) u∗ (X ) to (27) yields


where ∇ V (X ) = ∂ V (X )/∂ X ∈ R2n . Let V ∗ (X ) be the optimal cost
function defined as H ( X , u, ∇ V )
 u∗

(λ tanh−1 (v/λ))T R dv − γ V (X )

V (X (t )) = min

e −γ (τ −t )
[X QT X + U (u)] dτ .
T
(20)
= X T QT X + 2
u∈π(Ω ) 0
t
+ ∇ V T (X ) (F (X ) + G(X )u∗ (X ))
Then, based on (19), V ∗ (X ) satisfies the tracking HJB equation
+ ∇ V T (X )G(X )(u(X ) − u∗ (X ))
H (X , u∗ , ∇ V ∗ )  u
 u∗ +2 (λ tanh−1 (v/λ))T R dv (28)
= X T QT X + 2 (λ tanh−1 (v/λ))T R dv − γ V ∗ (X ) u∗
0
which gives (26) and completes the proof. 
+ ∇ V ∗T (X ) (F (X ) + G(X ) u∗ (X )) = 0. (21)
Theorem 1. Consider the optimal tracking control problem for the
The optimal control input for the given problem is obtained by augmented system (10) with performance function (13). Suppose
employing the stationarity condition (Lewis et al., 2012) on the that V ∗ is a smooth positive definite solution to the tracking HJB
Hamiltonian (19). The result is equation (25). Define control u∗ = u(V ∗ (X )) as given by (22).
u∗ (X ) = arg min [H (X , u, ∇ V ∗ )] Then, u∗ minimizes the performance index (13) over all admissible
u∈π(Ω ) controls constrained to |ui | ≤ λ, i = 1, . . . , m, and the optimal
value on [0, ∞) is given by V ∗ (X (0)). Moreover, in the limit as the
= −λ tanh ((1/2λ)R−1 GT (X ) ∇ V ∗ (X )). (22)
discount factor goes to zero, the control input u∗ makes the error
This control is within its permitted bounds ±λ. The nonquadratic dynamics (8) asymptotically stable.
cost (15) for u∗ is (Abou-Khalaf & Lewis, 2005)
Proof. We first show the optimally of the HJB solution. Note
 u∗ that for any continuous value function V (X ), one can write the
U (u∗ ) = 2 (λ tanh−1 (v/λ))T R dv performance function (13) as
0  ∞
= 2 λ (tanh−1 (u∗ /λ))T R u∗ + λ2 R̄ ln(1 − (u∗ /λ)2 ) (23) V (X (0), u) = e−γ τ [X T QT X + U (u)] dτ
0
where 1 is a column vector having all of its elements equal to one,  ∞
d
and R̄ = [r1 , . . . , rm ] ∈ R1×m . Putting (22) in (23) results in + (e−γ τ V (X )) dτ + V (X (0))
dt
 ∞0
U (u∗ ) = λ ∇ V ∗T (X ) G(x) tanh (D∗ )
= e−γ τ [X T QT X + U (u)] dτ
+ λ R̄ ln(1 − tanh (D ))
2 2 ∗
(24) 0
 ∞
where D = (1/2λ) R G(X ) ∇ V (X ). Substituting u (22) back
∗ −1 T ∗ ∗ + e−γ τ [∇ V (X )T (F + Gu) − γ V (X )] dτ
into (21) and using (24), the tracking HJB equation (21) becomes 0
+ V (X (0))
H (X , u∗ , ∇ V ∗ ) = X T QT X − γ V ∗ (X ) + ∇ V ∗T (X ) F (X )  ∞
+ λ2 R̄ ln(1 − tanh2 (D∗ )) = 0. (25) = e−γ τ H (X , u, ∇ V ) dτ + V (X (0)). (29)
0
H. Modares, F.L. Lewis / Automatica ( ) – 5

Now, suppose V (X ) satisfies the HJB equation (25). Then H (X ∗ , u∗ , the integrals, there exists a uk ∈ (ū∗k , ūk ) such that
∇ V ∗ ) = 0 and (26) yields  ū∗
k
 ūk
 ∞   u β(ξk )dξk = − β(ξk )dξk = −β(uk )(ūk − ū∗k )
V (X (0), u) = e−γ τ 2 (λ tanh−1 (v/λ))T R dv ūk ū∗
k
0 u∗
 < −β(ū∗k )(ūk − ū∗k ) = β(ū∗k )(ū∗k − ūk ) (38)
+ ∇ V (X )G(X ) (u − u ) dτ + V ∗ (X (0)).
∗T ∗
(30) where the inequality is obtained by the fact that β is monotone
odd, and hence β(uk ) > β(u∗k ). Therefore Lk > 0 also for ū∗k < ūk .
To prove that u∗ is the optimal control solution and the optimal This completes the proof of the optimality.
value is V ∗ (X (0)), it remains to show that the integral term on the Now the stability of the error dynamics is shown. Note that for
right-hand side of the above equation is bigger than zero for all any continuous value function V (X ), by differentiating V (X ) along
u ̸= u∗ and attains it minimum value, i.e., zero, at u = u∗ . That is, the augmented system trajectories, one has
to show that
dV (X ) ∂ V (X ) ∂ V (X )T
 u = + Ẋ
H = 2 (λ tanh−1 (v/λ))T R dv dt ∂t ∂X
u∗
∂ V (X )T
+ ∇ V ∗T (X ) × G(X )(u − u∗ ) = (F (X ) + G(X )u) (39)
(31) ∂X
is bigger than or equal to zero. To show this, note that using (22) so that
one has dV (X )
H (X , u, ∇ V ) = − γ V (X ) + X T QT X
∇ V (X )G(X ) = −2 (λ tanh (v/λ)) R.
∗T −1 T
(32) dt
 u

Substituting (32) in (31) and noting ϕ −1


(·) = (λ tanh (·/λ))
−1 T +2 (λ tanh−1 (v/λ))T R dv. (40)
0
yields
Suppose now that V (X ) satisfies the HJB equation H (X ∗ , u∗ , ∇ V ∗ )
 u∗ = 0 and is positive definite. Then, substituting u = u∗ gives
H = 2 ϕ −1 (u∗ ) R (u∗ − u) − 2 ϕ −1 (v) R dv. (33)
u dV (X )
− γ V (X ) + X T QT X
dt
As  positive definite, one can rewrite it as R =
R is symmetric
Λ Λ, where
 u
is a triangular matrix with its values being the
singular values of R and Λ is an orthogonal symmetric matrix.
+2 (λ tanh−1 (v/λ))T R dv = 0 (41)
0
Substituting for R in (33) and applying the coordinate change
u = Λ−1 ū, one has or equivalently,
 dV (X )
H = 2 ϕ −1 (Λ−1 ū∗ ) Λ (ū∗ − ū) − γ V (X ) = −X T QT X
dt
 ū∗   u
−2 ϕ −1 (Λ−1 ξ )Λ dξ −2 (λ tanh−1 (v/λ))T R dv. (42)

0
 ū∗
Multiplying e−γ t to both sides of (42) gives
 
= 2 β(ū∗ ) (ū∗ − ū) − 2 β(ξ ) dξ (34)
ū 
d
(e −γ t
V (X )) = e −γ t
−X T QT X
where β(ū) = ϕ −1 (Λ−1 ū) Λ. Note
that β is monotone odd because dt
tanh−1 is monotonic odd. Since is a triangular matrix, one can  u

decouple the transformed input vector as −2 (λ tanh−1 (v/λ))T R dv ≤ 0. (43)
  0
m 
  ū∗ k
H =2 β(ū∗k ) (ū∗k − ūk ) − β(ξk ) dξk (35) Eq. (43) shows that the tracking error is bounded for the op-
k=1 kk ūk timal solution, but its asymptotic stability cannot be concluded.
However, if γ = 0 (which can be chosen only if the reference
where kk > 0, k = 1, . . . , m, since R > 0. To complete the proof

input goes to zero), LaSalle’s extension can be used to show that
it remains to show that the term the tracking error is locally asymptotically stable. In fact, based on
 ū∗
LaSalle’s extension, the augmented state X = [ed , xd ] goes to a re-
k
Lk = β(ūk ) (ūk − ūk ) −
∗ ∗
β(ξk ) dξk (36) gion of R2n wherein V̇ = 0. Considering that X T QT X = eTd Q ed
ūk with Q > 0, V̇ = 0 only if ed = 0 and u = 0. Since u = 0 also re-
quires that ed = 0, therefore, for γ = 0 the tracking error is locally
is bigger than zero for u∗ ̸= u and is zero for u∗ = u. To show this,
asymptotically stable with Lyapunov function V (X ) > 0. This con-
first assume that ū∗k > ūk . Then using mean value theorem for the
firms that in the limit as the discount factor goes to zero, the control
integrals, there exists a uk ∈ (ūk , ū∗k ) such that
input u∗ makes the error dynamics (8) asymptotically stable. 
ū∗
Note that although for γ ̸= 0 (which is essential to be consid-

k
β(ξk ) dξk = β(uk ) (ū∗k − ūk ) < β(ū∗k ) (ū∗k − ūk ) (37) ered if the reference trajectory does not go to zero) only boundness
ūk
of the tracking error is guaranteed for the optimal solution, one can
where the inequality is obtained by the fact that β is monotone make the tracking error as small as desired by choosing a small dis-
odd, and hence β(uk ) < β(u∗k ). Therefore, Lk > 0 for ū∗k > ūk . count factor and/or large Q . To demonstrate this, assume that the
Now suppose that ū∗k < ūk . Then, using mean value theorem for tracking error is nonzero. Then, considering that X T QT X = eTd Q ed
6 H. Modares, F.L. Lewis / Automatica ( ) –

with Q > 0, the derivative of the Lyapunov function in (43) be- The tracking Bellman equation (44) requires complete knowl-
comes negative and therefore the tracking error decreases until edge of the systems dynamics. In order to find an equivalent for-
the exponential term e−γ t becomes zero and makes the derivative mulation of the tracking Bellman equation that does not involve
of the Lyapunov function zero. After that, we can only conclude the dynamics, we use the IRL idea introduced in Vrabie and Lewis
that the tracking error does not increase anymore. The larger the (2009) for optimal regulation problem. Note that for any integral
Q is the more the speed of decreasing the tracking error is and the reinforcement interval T > 0, the value function (13) satisfies
smaller tracking error can be achieved. Moreover, the smaller the  t

discount factor is the less the speed of decreasing the derivative of V (X (t − T )) = e−γ (τ −t +T ) X (τ )T QT X (τ )
the Lyapunov function to zero is and the smaller tracking error can t −T

be achieved. Consequently, by choosing a smaller discount factor
and/or larger Q one can make the tracking error as small as desired + U (u(τ )) dτ + e−γ T V (X (t )). (46)
before the value of e−γ t becomes very small.
This IRL form of the tracking Bellman equation does not involve the
system dynamics.
Remark 6. The use of discounted cost functions is common in
optimal regulation control problems and the same conclusion can Lemma 2. The IRL tracking Bellman equation (46) and the tracking
be drawn for asymptotic stability of the system state in the optimal Bellman equation (18) are equivalent and have the same positive
regulator problem, as is drawn here for asymptotic stability of the semi-definite solution for the value function.
tracking error in the OTCP. However, the discount factor is a design
Proof. See Liu et al. (2013) and Vrabie and Lewis (2009) for the
parameter and as is shown in optimal regulation control problems
same proof. 
in the literature, it can be chosen small enough to make sure the
system state goes to a very small region around zero. Simulation Using the IRL tracking Bellman equation (46), the following IRL-
results in Section 5 confirm this conclusion for the OTCP. based PI algorithm can be used to solve the tracking HJB equation
(25) using only partial knowledge about the system dynamics.
3.3. Offline policy iteration algorithms for solving OTCP
Algorithm 2 (Offline IRL Algorithm).
1. (policy evaluation) Given a control input ui (X ), find V i (X ) using
The tracking HJB equation (25) is a nonlinear partial differential
the tracking Bellman equation
equation which is extremely difficult to solve. In this subsection,
t
 
two iterative offline policy iteration (PI) algorithms are presented
V i (X (t − T )) = e−γ (τ −t +T ) X (τ )T QT X (τ )
for solving this equation. An IRL based offline PI algorithm is given t −T
which is a basis for our online IRL algorithm presented in the next 
section. + U (u(τ )) dτ + e−γ T V i (X (t )). (47)
Note that the tracking HJB equation (25) is nonlinear in the
value function derivative ∇ V ∗ , while the tracking Bellman equa- 2. (policy improvement) Update the control policy using
tion (18) is linear in the cost function derivative ∇ V . Therefore,  
finding the value of a fixed control policy by solving (18) is eas- 1
ui+1 (X ) = −λ tanh R−1 GT (X ) ∇ V i (X ) . (48)
ier than finding the optimal value function by solving (25). This is 2λ
the motivation of introducing an iterative policy iteration (PI) al-
gorithm for approximating the tracking HJB solution. The PI algo-
4. Online actor–critic for solving the tracking HJB equation
rithm performs the following sequence of two-step iterations as
using the IRL technique
follows to find the optimal control policy.
In this section, an online solution to the tracking HJB equation
Algorithm 1 (Offline PI Algorithm). (25) is presented which only requires partial knowledge about the
1. (policy evaluation) Given a control input ui (X ), find V i (X ) using system dynamics. The learning structure uses the value function
the following Bellman equation approximation (Finlayson, 1990; Werbos, 1989, 1992) with two
NNs, namely an actor and a critic. Instead of sequentially updating
 ui the critic and actor NNs, as in Algorithm 2, both are updated
X T QT X + 2 (λ tanh−1 (v/λ))T R dv simultaneously in real time. We call this synchronous online PI.
0

− γ V i (X ) + ∇ V iT (X )(F (X ) + G(X )ui ) = 0. (44) 4.1. Critic NN and value function approximation
2. (policy improvement) Update the control policy using
Assuming the value function is a smooth function, according
to the Weierstrass high-order approximation theorem (Finlayson,
 
1
ui+1 (X ) = −λ tanh R−1 GT (X ) ∇ V i (X ) . (45) 1990), there exists a single-layer neural network (NN) such that

the solution V (X ) and its gradient ∇ V (X ) can be uniformly
approximated as
Algorithm 1 is an extension of the offline PI algorithm in Abou-
Khalaf and Lewis (2005) to the optimal tracking problem. The V (X ) = W1T φ(X ) + εv (X ) (49)
following theorem shows that this algorithm converges to the
optimal solution of the HJB equation (25). ∇ V (X ) = ∇φ(X ) W1 + ∇εv (X )
T
(50)
where φ(X ) ∈ R provides a suitable basis function vector, εv (X ) is
l

Theorem 2. If u0 ∈ π (Ω ), then ui ∈ π (Ω ), ∀i ≥ 1. Moreover, ui the approximation error, W1 ∈ Rl is a constant parameter vector


converges to u∗ and V i converges to V ∗ uniformly on Ω . and l is the number of neurons. Eq. (49) defines a critic NN with
weights W1 . It is known that the NN approximation error and its
Proof. See Abou-Khalaf and Lewis (2005) and Liu et al. (2013) for gradient are bounded over the compact set Ω , i.e. ∥εv (X )∥ ≤ bε
the same proof.  and ∥∇εv (X )∥ ≤ bεx (Hornik, Stinchcombe, & White, 1990).
H. Modares, F.L. Lewis / Automatica ( ) – 7

Assumption 4. The critic NN activation functions and their Theorem 3. Let u be any admissible bounded control policy and
gradients are bounded, i.e. ∥φ(X )∥ ≤ bφ and ∥∇φ(X )∥ ≤ bφ x . consider the adaptive law (58) for tuning the critic NN weights. If ∆φ̄
The critic NN (49) is used to approximate the value function in (59) is persistently exciting (PE), i.e. if there exist γ1 > 0 and γ2 > 0
related to the IRL tracking Bellman equation (46). Using the value such that ∀t > 0
approximation (49) in the tracking IRL, the Bellman equation (46)  t +T 1
yields γ1 I ≤ ∆φ̄(τ ) ∆φ̄ T (τ ) dτ ≤ γ2 I , (60)
 t
 t

εB (t ) ≡ e−γ (τ −t +T ) X (τ )T QT X (τ ) then,
t −T
 u
 (a) For εB (t ) = 0 (no reconstruction error), the critic weight estima-
+2 (λ tanh (v/λ)) R dv dτ +
−1 T
W1T ∆ φ(X (t )) (51) tion error converges to zero exponentially fast.
0 (b) For bounded reconstruction error, i.e., ∥εB (t )∥ < εmax , the critic
where weight estimation error converges exponentially fast to a residual
set.
∆φ(X (t )) = e−γ T φ(X (t )) − φ(X (t − T )) (52)
Proof. Using the IRL tracking Bellman equation (51) one has
and εB is the tracking Bellman equation error due to the NN
approximation error. Under Assumption 4, this approximation t u
  
−γ (τ −t )
error is bounded on the compact set Ω . That is, there exists a e 2 (λ tanh−1 (v/λ))T R dv
constant bound εmax for εB such that sup∀t ∥εB ∥ ≤ εmax . t −T 0

We now present the tuning and convergence of the critic NN



weights for a fixed control policy, in effect designing an observer + X (τ )T QT X (τ ) dτ = −W1T ∆φ(X (t )) + εB (t ). (61)
for the unknown value function. As the ideal critic NN weights
vector W1 which provides the best approximate solution to the Substituting (61) in (54), the tracking Bellman equation error
tracking Bellman (51) is unknown, it is approximated in real time becomes
as
eB (t ) = W̃1T (t ) ∆φ(t ) + εB (t ) (62)
V̂ (X ) = Ŵ1T φ(X ) (53)
where W̃1 = W1 − Ŵ1 is the critic weights estimation error.
where Ŵ1 is the current estimation of W1 . Therefore, the approxi- Using (62) in (58) and denoting m = 1 + ∆φ T ∆φ , the critic
mate IRL tracking Bellman equation becomes weights estimation error dynamics becomes
t
 
−γ (τ −t +T )
eB (t ) = e X (τ )T QT X (τ ) ˙ ∆φ̄(t )
t −T W̃ 1 (t ) = −α1 ∆φ̄(t )∆φ̄(t )T W̃1 (t ) + α1 εB (t ). (63)
 u
 m(t )
+2 (λ tanh−1 (v/λ))T R dv dτ + Ŵ1T ∆φ(X (t )). (54) This estimation error is the same as the critic weight estimation
0 error obtained in Vamvoudakis and Lewis (2010) and the reminder
Eq. (54) can be written as of the proof is identical to the proof of Theorem 1 in Vamvoudakis
and Lewis (2010). 
eB (t ) = Ŵ1 (t )T ∆φ(X (t )) + p(t ) (55)
where Remark 7. The critic estimation error equation (63) implies that
t
∆φ̄ T W̃1 is bounded. However, in general the boundness of ∆φ̄ T W̃1
 
p(t ) = e−γ (τ −t +T ) X (τ )T QT X (τ )
t −T does not imply the boundness of W̃1 . Theorem 3 shows that if
 u
 the PE condition (60) is satisfied, then the boundness of ∆φ̄ T W̃1
+2 (λ tanh−1 (v/λ))T R dv dτ (56) implies the boundness of the state W̃1 . We shall use this property
0 in the proof of Theorem 4.
is the integral reinforcement reward. The tracking Bellman error
eB in Eqs. (54) and (55) is the continuous-time counterpart of the 4.2. Synchronous actor–critic based IRL algorithm to learn the
temporal difference (TD) (Sutton & Barto, 1998). The problem of solution to the OTCP
finding the value function is now converted to adjusting the critic
NN weights such that the TD error eB is minimized. Consider the In this section, an online IRL algorithm is given which involves
objective function simultaneous or synchronous tuning of the actor and critic NNs to
1 find the optimal value function and control policy related to the
EB = e2B . (57) OTCP, adaptively.
2
Assume that the optimal value function solution to the tracking
From (54) and using the chain rule, the gradient descent algorithm
HJB equation is approximated by the critic NN in (49). Then, using
for EB is given by
(50) in (22), the optimal policy is obtained by
˙ −α1 ∂ EB −α1 ∆φ
Ŵ 1 = =− eB (58)
 
1
(1 + ∆φ T ∆φ)2 ∂ Ŵ1 (1 + ∆φ T ∆φ)2 u = −λ tanh R G (∇φ W1 + ∇εv ) .
−1 T T T
(64)

where α1 > 0 is the learning rate and (1 + ∆φ T ∆φ)2 is used for
normalization. Note that the square of the denominator, i.e., (1 + To see the effect of the error ∇εv on the tracking HJB equation,
∆φ T ∆φ)2 , is used in (58) for normalization to assure the stability note that using integration by parts we have
of the critic weights error W̃1 . Define
 t  t
e−γ (τ −t +T ) φ̇ dτ = e−γ (τ −t +T ) ∇φ(F + Gu) dτ
∆φ̄ = ∆φ/(1 + ∆φ ∆φ). T
(59) t −T t −T

The proof of the convergence of the critic NN weights is shown in


 t

the following theorem. = ∆φ(X ) + e−γ (τ −t +T ) φ(X ) dτ (65)


t −T
8 H. Modares, F.L. Lewis / Automatica ( ) –

or equivalently where
 t
 û1

∆φ(X ) = e −γ (τ −t +T )
∇φ(X )(F + Gu) dτ Û = 2 (λ tanh−1 (v/λ))T R dv. (74)
t −T 0
 t Then, the critic update law (58) becomes
− e−γ (τ −t +T ) φ(X ) dτ . (66) −α1 ∆φ
t −T
˙
Ŵ 1 = − êB . (75)
(1 + ∆φ T ∆φ)2
Also, note that U (u) in (23) for the optimal control input given by
Define the error eu as the difference between the control input
(64) becomes
û1 (71) applied to the system and the control input û (70) as an
u
approximation of the optimal control input given by (22) with V ∗

U (u) = 2 (λ tanh−1 (v/λ))T R dv = W1T ∇φ F u approximated by (53). That is,
0   
1
+ λ2 R̄ ln(1 − tanh2 (D + 0.5λR−1 GT ∇εv )). (67) eu = û1 − u1 = λ tanh R−1 GT ∇φ T Ŵ2

Using (66) and (67) for the third and second terms of (51),  
respectively, the following tracking HJB equation is given: 1
− tanh R−1 GT ∇φ T Ŵ1 . (76)
 t 2λ
e−γ (τ −t +T ) (X T QT X − γ W1T φ + W1T ∇φ F The objective function to be minimized by the action NN is now
t −T defined as
+ λ2 R̄ ln(1 − tanh2 (D)) + εHJB ) dτ = 0 (68) Eu = eTu R eu . (77)
where D = (1/2λ) R−1 GT ∇φ T W1 and εHJB , i.e., the HJB approxi- Then, the gradient-descent update law for the actor NN weights
mation error due to the function approximation error, is becomes
 t ˙
Ŵ 2 = −α2 (∇φ G eu + ∇φ G tanh2 (D̂) eu + Y Ŵ2 ) (78)
εHJB = e−γ (τ −t +T ) (∇εvT )F + λ2 R ln(1 − tanh2 (D
t −T
where
+ 0.5λR−1 GT ∇εv ) 1
D̂ = R−1 GT ∇φ T Ŵ2 , (79)

− λ R̄ ln(1 − tanh (D)) − γ εv )dτ .
2 2
(69)
Y > 0 is a design parameter and the last term of (78) is added to
Since the NN approximation error assure stability.
 is bounded, there exists a con-
stant error bound εh , so that sup εHJB  ≤ εh . We should note that Before presenting our main theorem, note that based on

the choice of the NN structure to make the error bound εh arbitrary Assumption 2 and the boundness of the command generator
small is commonly carried out by computer simulation in the lit- dynamics hd , for the drift dynamics of the augment system F one
erature. We assume here that the NN structure is specified by the has
designer, and the only unknowns are the NN weights. ∥F (X )∥ ≤ bF 1 ∥ed ∥ + bF 2 (80)
To approximate the solution to the tracking HJB equation (68), for some bF 1 and bF 1 .
the critic and actor NNs are employed. The critic NN given by
(53) is used to approximate the unknown approximate optimal Theorem 4. Given the dynamical system (1) and the command
value function. Assuming that Ŵ1 is the current estimation for the generator (4), let the tracking control law be given by the actor
optimal critic NN weights W1 , then using (64) the policy update NN (71). Let the update laws for tuning the critic and actor NNs be
law can be obtained by provided by (75) and (78), respectively. Let Assumptions 1–4 hold
and ∆φ̄ in (59) be persistently exciting. Then there exists a T0 defined
u1 = −λ tanh ((1/2λ)R−1 GT ∇φ T Ŵ1 ). (70) by (A.25) such that for the integral reinforcement interval T < T0 the
tracking error ed in (2), the critic NN error W̃1 , and the actor NN error
However, this policy update law does not guarantee the stability W̃2 in (72) are UUB.
of the closed-loop system. It is necessary to use a second neural
Proof. See Appendix. 
network Ŵ2T ∇φ for the actor because the control input must
not only solve the stationarity condition (22), but also guarantee Remark 8. The stability analysis in the proof of Theorem 4 differs
system stability while converging to the optimal solution. This is from the stability proof presented in Modares et al. (2013),
seen in the Lyapunov proof of Theorem 4. Hence, to assure stability Vamvoudakis and Lewis (2010) and Vamvoudakis et al. (in press)
in a Lyapunov sense, the following actor NN is employed. from at least two different perspectives. First, the actor update
law in the mentioned papers is derived entirely by the stability
û1 = −λ tanh ((1/2λ)R−1 GT ∇φ T Ŵ2 ) (71) analysis whereas the actor update law in our paper is based on the
minimization of the error between the actor neural network and
where Ŵ2 is the actor NN weights vector and it is considered as the approximate optimal control input. Moreover, in this paper the
the current estimated value of W1 . Define the actor NN estimation optimal tracking problem is considered, not the optimal regulation
error as problem, and the tracking HJB equation has an additional term
depending on the discount factor comparing to the regulation HJB
W̃2 = W1 − Ŵ2 . (72)
equation considered in the mentioned papers.
Note that using the actor û1 in (71), the IRL Bellman equation
Remark 9. The proof of Theorem 4 shows that the integral rein-
error is now given by
forcement learning time interval T cannot be too big. Moreover,
 t   since d and N in Eqs. (A.24) and (A.28) should be bigger than zero
e−γ (τ −t +T ) X (τ )T QT X (τ ) + Û dτ for an arbitrary ε > 0, one can conclude that the bigger the rein-
t −T
forcement interval T is, the bigger the parameter Y in learning rule
+ Ŵ1T ∆φ(X (t )) = êB (t ) (73) (78) should be chosen to assure stability.
H. Modares, F.L. Lewis / Automatica ( ) – 9

and
√ √ 
xd2 (t ) = 0.5 5 cos 5t (83)

which are given by using the following command generator


dynamics:
 
0 1
ẋd = x (84)
−5 0 d
Fig. 1. Mass, spring and damper system.
with initial condition xd (0) = [0.5, 0.5]. Therefore, the augmented
system (10) becomes
   
0 1 0 0 0
−5 −0.5 3 −1 1
Ẋ =  X +   u ≡ TX + B1 u (85)
0 0 0 1  0
0 0 −5 0 0

where X = [X1 , X2 , X3 , X4 ] = [ed1 , ed2 , xd1 , xd2 ].


The input saturation limit is considered as 5N, i.e.,|u| ≤ 5. The
nonquadratic performance index is chosen as (13) with R = 1, Q =
10 I and γ = 0.1.
As actuator saturation does not occur, the optimal value
function should be close to the value function of the linear
Fig. 2. Convergence of the critic NN weights. quadratic tracking (LQT) problem. By the LQT problem, we mean
the optimal tracking problem for linear systems with quadratic
5. Simulation results performance functions. In fact, for the augmented system (85) with
a quadratic performance function U (u) = uT R u in (13), the value
In this section, a simulation example is given to show the function is in the quadratic form of V (X ) = X T P X and therefore
effectiveness of the proposed method. Fig. 1 shows a spring, mass, the HJB equation (21) converts to the following algebraic Riccati
damper system. equation (ARE):
The simulation results constitute two parts. In the first part, the
spring and damper are considered to be linear and the actuator 0 = T T P + PT − γ P + P B1 R−1 BT1 P + QT . (86)
bound is chosen large enough to make sure the input control does
Efficient numerical methods exist to find the solution to this ARE
not exceed this bound, and it is shown how the proposed algorithm
which we can compare our results to.
converges to the optimal solution for a linear system in the absence
We now simulate our proposed method as in Theorem 4. As we
of the input constraints. Note that there are no known solutions to
expect that optimal critic is quadratic in the system in the absence
optimal control problems for linear systems with input constraints
of the control bounds, the critic NN is chosen as
to compare our results to. In the second part, the spring is consid-
ered to be nonlinear and the actuator saturation is also considered
V (x) = W T φ(x) (87)
to show the effectiveness of the proposed method for control of
nonlinear systems in the presence of the input constraints. where

W = [W1 , . . . , W10 ]T ,
5.1. Linear system without actuator saturation
φ(X ) = [X12 , X1 X2 , X1 X3 , X1 X4 , X22 , X2 X3 , X2 X4 , (88)
In this subsection, the results of the proposed method are com-
X32 , , ] .
X3 X4 X42 T
pared to the results of the standard solution given in Section 2.2,
and also it is shown that the proposed method converges to the The reinforcement interval T is selected as 0.1. A small probing
optimal solution in the absence of the control bounds. To this end, noise is added to the control input to excite the system states. Fig. 2
the actuator bounds are chosen large enough to make sure the in- shows the convergence of the critic parameters which converge to
put control does not exceed these bounds.
Assuming that both spring and damper are linear, the spring– W = [17.94, 0.77, −2.01, −0.29, 2.86, 0.07,
mass–damper system is described by the following dynamics: −0.59, 9.86, −0.08, 1.84]
ẋ1 = x2 The optimal control solution (22) then becomes
k c 1 (81)
ẋ2 = − x1 − x2 + u( t ) u = −5 tanh(0.155ed1 + 0.577ed2 + 0.014xd1 − 0.112xd2 ).
m m m
Note that the optimal critic weights obtained by solving the ARE
where y = x1 , x1 and x2 are the position and velocity, m is the mass
are
of the object, k is the stiffness constant of the spring and c is the
damping. The true parameters are set as m = 1 kg, c = 0.5 N · s/m W ∗ = [18.05, 0.77, −1.98, −0.34, 2.88, 0.08,
and k = 5 N/m. Note that in our control design, only the input
−0.56, 9.77, −0.08, 1.87]
dynamics is needed to be known, which is given by m.
The desired trajectories for x1 and x2 are considered as which are the components of the ARE solution matrix P in (86)
√  and confirm the convergence of our algorithm close to the optimal
xd1 (t ) = 0.5 sin 5t (82) control solution.
10 H. Modares, F.L. Lewis / Automatica ( ) –

Fig. 3. Control input for the standard method.


Fig. 6. Control input for the proposed method.

Fig. 4. The system state x1 versus the desired trajectory x1d for the standard Fig. 7. The system state x1 versus the desired trajectory x1d for the proposed
method. method.

Fig. 5. The system state x2 versus the desired trajectory x2d for the standard Fig. 8. The system state x2 versus the desired trajectory x2d for the proposed
method. method.

For the standard solution, the steady-state part of control residual, and the unknown critic NN weights. If the number of
input using (6) and the system and command generator dynamics NN hidden layers is chosen appropriately, which is fulfilled for
becomes the linear system provided here, all of these go to zero except
ud = [0.25, 0.25] xd (t ). for the unknown critic NN weights. However, these bounds are in
fact conservative and the simulation results show that the value
The optimal feedback part of the control input obtained by function and the optimal control solution are closely identified.
minimizing the performance index (7) is
ue = [−0.50, −0.25] ed (t ). 5.2. Nonlinear system and considering the actuator bound

Thus, the optimal control is given by


In this subsection, it is considered that the spring is nonlinear
u = −0.50ed1 + −0.25ed2 + 0.25xd1 + 0.25xd2 . with the nonlinearity k(x) = −x3 and therefore the system
dynamics becomes
Figs. 3–8 show the system state and the control input for both the
proposed and the standard methods, starting the system from a ẋ1 = x2
specific initial condition. From these figures, it can be concluded (89)
ẋ2 = −x31 − 0.5x2 + u(t ).
that although in contrast to the standard method, the proposed
method does not require the system drift dynamics, its transient Now suppose that the control bound is |u| ≤ 0.25.
response is better than the standard method. To find the optimal solution using the proposed method, the
critic NN is chosen as a power series neural network with 45
Remark 10. According to Theorem 4, the error bounds for optimal activation functions containing powers of the state variable of the
control solution depend on the NN approximation errors, the HJB augmented system up to order four. That is, the critic is chosen as
H. Modares, F.L. Lewis / Automatica ( ) – 11

Fig. 9. Control input while considering the actuator saturation. Fig. 11. The system state x2 versus the desired trajectory x2d while considering the
actuator saturation.

identification procedure. The stability of the whole system and


convergence to a near-optimal control solution were shown.

Acknowledgments

This work is supported by NSF grant ECCS-1128050, ONR


grant N00014-13-1-0562, AFOSR EOARD Grant 13-3055, ARO grant
W911NF-11-D-0001, China NNSF grant 61120106011, and China
Education Ministry Project 111 (No. B08015).

Fig. 10. The system state x1 versus the desired trajectory x1d while considering the Appendix. Proof of Theorem 4
actuator saturation.
Consider the following Lyapunov function:
(87) with weights and activation functions as
1 1
W = [W1 , . . . , W45 ]T J (t ) = V (t ) + W̃1 (t )T α1 −1 W̃1 (t ) + W̃2 (t )α2 −1 W̃2 (t )T (A.1)
2 2
φ(X ) = [X12 , X1 X2 , X1 X3 , X1 X4 , X22 , X2 X3 , X2 X4 , where V (t ) is the optimal value function. The derivative of the
X32 , X3 X4 , X42 , X14 , X13 X2 , X13 X3 , X13 X4 , X12 X22 , Lyapunov function is given by

X12 X2 X3 , X12 X2 X4 , X12 X32 , X12 X3 X4 , X12 X42 , X1 X23 , ˙


J̇ = V̇ + W̃1T α1 −1 W̃ 1 +W̃2T α2 −1 W̃ 2 .
˙ (A.2)
(90)
X1 X22 X3 , X1 X22 X4 , X1 X2 X32 , X1 X2 X3 X4 , X1 X2 X42 ,
Before evaluating (A.2), note that putting (66) and the tracking
X1 X33 , X1 X32 X4 , X1 X3 X42 , X1 X43 , X24 , X23 X3 , X23 X4 , HJB (68) in the IRL tracking Bellman equation (73) gives
X22 X32 , X22 X3 X4 , X22 X42 , X2 X33 , X2 X32 X4 , X2 X3 X42 , t

êB (t ) = e−γ (τ −t +T ) (X T QT X + Û + γ Ŵ1 (t )T φ )
X2 X43 , X34 , X33 X4 , X32 X42 , X3 X43 , X44 ]T . t −T

The reinforcement interval T is selected as 0.1. As no verifi- − Ŵ1 (t )T ∇φ (F − G λ tanh (D̂)) dτ


 t
able method exists to ensure PE in nonlinear systems, a small ex-
ploratory signal consisting of sinusoids of varying frequencies, i.e., = e−γ (τ −t +T ) (Û − U − γ Ŵ1 (t )T φ
n(t ) = 0.3 sin(8t )2 cos(2t ) + 0.3 sin(20t )4 cos(7t ), is added to
t −T

the control input to excite the system states and ensure the PE + Ŵ1 (t )T ∇φ (F − G λ tanh (D̂)) + γ W1 (t )T φ
qualitatively. The critic weights vector finally converges to W =
− W1T ∇φ (F − G λ tanh (D̂)) + εHJB ) dτ (A.3)
[9.038, 3.95, −1.20, −1.64, 2.41, 0.71, −1.06, 14.28, 0.38, 2.93,
−2.97, −0.75, 4.60, −2.40, −3.33, 1.79, 2.18, 3.11, 0.69, −2.45, where Û is defined in (74) and is given by
−2.23, 1.70, 2.02, 0.94, 0.43, 1.21, −0.47, −0.75, 0.54, 1.31,
0.03, 1.70, 0.81, 0.88, −0.02, −0.76, 0.84, −0.15, −3.14, Û = Ŵ2T ∇φ G λ tanh(D̂) + λ2 R̄ ln(1 − tanh2 (D̂)) (A.4)
−0.83, 4.11, 0.29, 0.86, −0.88, 0.07].
Figs. 9–11 show the performance of the proposed method. and

U = W1T ∇φ Gλ tanh(D) + λ2 R̄ ln(1 − tanh2 (D)) (A.5)


6. Conclusion
is the cost (15) for the optimal control input u = −λ tanh((1/2λ)
A new formulation of the optimal tracking control problem R−1 GT ∇φ T W1 ).
was presented in this paper. A tracking constrained HJB equation Using (66) and some manipulations, (A.3) becomes
was derived where both feedback and feedforward parts of the  t
bounded optimal control input were obtained simultaneously
êB (t ) = −∆φ W̃1 (t ) +T
e−γ (τ −t +T ) (Û − U
by solving this HJB equation. An online integral reinforcement t −T
learning algorithm was presented to find the solution to the
tracking HJB equation for partially-unknown constrained-input
+ W1T ∇φ (F − G λ tanh(D̂)) − W1T ∇φ
systems. The proposed method did not require any preceding × (F − G λ tanh(D)) + εHJB )dτ . (A.6)
12 H. Modares, F.L. Lewis / Automatica ( ) –

Using (A.4) and (A.5) and some manipulations, Û − U can be written and therefore
as (Modares et al., 2014) ˙
J̇1 = W̃1T (t )α1−1 W̃ 1 (t ) = −W̃1T (t )∆φ̄ ∆φ̄ T W̃1 (t )
Û − U = Ŵ2T ∇φ Gλ tanh (D̂) + W̃2T ∇φ G ∆φ̄
 t
− W̃1T (t ) e−γ (τ −t +T ) W̃2T (τ )Mdτ
× λ sgn(D̂) − ∇φ Gλ tanh(D) −
W1T ∇φ W1T m t −T

∆φ̄
 
× Gλ sgn(D̂) − sgn(D) + λ2 R̄ (εD̂ − εD ) (A.7)
− W̃1T (t ) E. (A.19)
m
where εD̂ and εD are some bounded approximation errors. Substi- For small enough reinforcement interval, the integral term of
tuting (A.7) in (A.6) gives (A.19) can be approximated by the right-hand rectangle method
 t (with only one rectangle) as
êB (t ) = −∆φ T W̃1 (t ) + e−γ (τ −t +T ) W̃2T Mdτ + E (A.8)  t
t −T
e−γ (τ −t +T ) W̃2T (τ ) Mdτ ≈ Te−γ T M T W̃2 (t ). (A.20)
where t −T

Using (A.20) in (A.19) gives


M = ∇φ G λ (tanh(D̂) − sgn(D̂)) (A.9)
∆φ̄
and J̇1 = −W̃1T (t )∆φ̄ ∆φ̄ T W̃1 (t ) − W̃1T (t ) E
m
 t
T −γ T T

E = e−γ (τ −t +T ) W1T ∇φ G λ (sgn(D̂) − sgn(D)) − e W̃1 (t ) ∆φ̄ M T W̃2 (t ). (A.21)
t −T m
 By applying the Young inequality to the last term of (A.21), one has
+ λ2 R̄ (εD̂ − εD ) + εHJB dτ . (A.10)
T −γ T T
e W̃1 (t ) ∆φ̄ M T W̃2 (t )
Note that M and E are bounded. m
We now evaluate the derivative of the Lyapunov function (A.2). T 2 e−2γ T
For the first term of (A.2), one has ≤ W̃1T (t ) ∆φ̄ ∆φ̄ T W̃1 (t )

ε
V̇ = W1T ∇φ (F − G λ tanh (D̂)) + ε0 (A.11) + W̃2 (t )T M M T W̃2 (t ) (A.22)
2m2
where for every ε > 0. Using (A.22) in (A.21) yields
ε0 (x) = ∇ε (F − G λ tanh(D̂)).
T
(A.12) J̇1 ≤ −d W̃1T (t )∆φ̄ ∆φ̄ T W̃1 (t ) − W̃1T (t )

According to Assumption 2 and the definition of G in (12), one ∆φ̄ ε


has × E+ W̃2 (t )T M M T W̃2 (t ) (A.23)
m 2m2
∥G∥ ≤ bG . (A.13) where

Using Assumption 4, (A.13) and (80), and taking norm of ε0 in T 2 e−2γ T


d=1− . (A.24)
(A.12) yields 2ε
Define T0 as a constant that satisfies
∥ ε0 (x) ∥ ≤ bεx bF 1 ∥ed ∥ + λ bεx bG + bF 2 . (A.14)
T02 e−2γ T0 = 2ε. (A.25)
Using the tracking HJB equation (68), the first term of (A.11)
becomes Then d > 0 if T < T0 .
Finally, for the last term of (A.2), using (78), and definitions
W1T ∇φ F = − eTd Q ed − U − γ W1T φ
Ŵ1 (t ) = W1 − W̃1 (t ) and Ŵ2 (t ) = W1 − W̃2 (t ), one has
+ W1T ∇φ G λ tanh(D) + εHJB (A.15) ˙
J̇2 = W̃2 (t )T α2 −1 W̃ 2 (t ) = −W̃2 (t )T Y W̃2 (t )
where U > 0 and it is defined in (A.5).
+ W̃2 (t )T λ∇φ G tanh (D̂) + W̃2 (t )T k3 (A.26)
Also, using W1 = Ŵ2 + W̃2 , and the fact xT tanh(x) > 0 ∀x, for
the second term of (A.11) one has where k3 = −λ∇φ G tanh ( 21λ R−1 GT ∇φ T Ŵ1 ) + Y W1 + λ∇φ G
tanh2 (D̂) eu . Based on definitions of eu , G in (76) and Assumptions 3
W1T ∇φ Gλ tanh(D̂) > W̃2T ∇φ G λ tanh(D̂). (A.16)
and 4, k3 is bounded.
Using (A.14)–(A.16) and Assumption 4, (A.11) becomes Using (A.17), (A.23) and (A.26) into (A.2), J̇ becomes
J̇ < −λmin (Q ) ∥e∥2 + k1 ∥e∥ + k2
V̇ < −λmin (Q ) ∥e∥2 + k1 ∥e∥ + k2 − W̃2T ∇φ G λ tanh(D̂) (A.17)
∆φ̄
where k1 = bεx bF 1 and k2 = 2λ bG bφ x ∥W1 ∥ + γ ∥W1 ∥ ∥φ∥ + − dW̃1T (t )∆φ̄ ∆φ̄ T W̃1 (t ) − W̃1 (t )T E
m
λ bεx bG + bF 2 + εh , and εh is the bound for εHJB .
˙ − W̃2 (t )T N W̃2 (t ) + W̃2 (t )T k3 (A.27)
For the second term of (A.2), using (A.8) in (58), W̃ 1 (t ) becomes
where
˙ ∆φ̄ ε
W̃ 1 (t ) = −α1 ∆φ̄ ∆φ̄ T W̃1 (t ) − α1 N =Y− M MT . (A.28)
m 2m2
t
∆φ̄

If we choose T and Y such that d in (A.24) and N in (A.28) are bigger
× e−γ (τ −t +T ) W̃2T (τ ) Mdτ − α1 E (A.18) than zero, then J̇ becomes negative, provided that
t −T m
H. Modares, F.L. Lewis / Automatica ( ) – 13

Vamvoudakis, K., Vrabie, D., & Lewis, F. L. (2013). Online adaptive algorithm for
k1 k21 k2
∥e∥ > + + (A.29) optimal control with integral reinforcement learning. International Journal of
2 λmin (Q ) 4λ 2
min (Q ) λmin (Q ) Robust and Nonlinear Control, http://dx.doi.org/10.1002/rnc, in press.
Vrabie, D., & Lewis, F. L. (2009). Neural network approach to continuous-time direct
  E adaptive optimal control for partially unknown nonlinear systems. Neural
 ∆φ̄ T W̃1  > (A.30)
 
Networks, 22, 237–246.
d Vrabie, D., Pastravanu, O., Abou-Khalaf, M., & Lewis, F. L. (2009). Adaptive
optimal control for continuous-time linear systems based on policy iteration.
k3
 
W̃2  > . (A.31) Automatica, 45, 477–484.
 
λmin (N ) Vrabie, D., Vamvoudakis, K. G., & Lewis, F. L. (2013). Optimal adaptive control and
differential games by reinforcement learning principles. London: Institution of
Since (A.30) holds on the output ∆φ̄ T W̃1 of error dynamics (63), as Engineering and Technology.
Wang, D., Liu, D., & Wei, Q. (2012). Finite-horizon neuro-optimal tracking
it was shown in Theorem 3, the PE signal ∆φ̄ shows that the state control for a class of discrete-time nonlinear systems using adaptive dynamic
W̃1 is UUB.  programming approach. Neurocomputing, 78, 14–22.
Werbos, P.J. (1989). Neural networks for control and system identification. In Proc.
IEEE conf. of decision control, Tampa, FL (pp. 260–265).
References Werbos, P. J. (1992). Approximate dynamic programming for real time control and
neural modeling. In D. A. White, & D. A. Sofge (Eds.), Handbook of intelligent
control. Multiscience Press.
Abou-Khalaf, M., & Lewis, F. L. (2005). Nearly optimal control laws for nonlinear Zhang, H., Cui, L., Zhang, X., & Luo, X. (2011). Data-driven robust approximate
systems with saturating actuators using a neural network HJB approach. optimal tracking control for unknown general nonlinear systems using
Automatica, 41, 779–791. adaptive dynamic programming method. IEEE Transactions on Neural Networks,
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. MA: Athena
22, 2226–2236.
Scientific. Zhang, H., Liu, D., Luo, Y., & Wang, D. (2012). Adaptive dynamic programming for
Bhasin, S., Kamalapurkar, R., Johnson, M., Vamvoudakis, K. G, Lewis, F. L, & Dixon, W.
control. algorithms and stability. London: Springer-Verlag.
E. (2012). A novel actor–critic–identifier architecture for approximate optimal
Zhang, H., Wei, Q., & Luo, Y. (2008). A novel infinite-time optimal tracking control
control of uncertain nonlinear systems. Automatica, 49, 82–92.
Dierks, T., & Jagannathan, S. (2009). Optimal tracking control of affine nonlinear scheme for a class of discrete-time nonlinear systems via the greedy HDP
discrete-time systems with unknown internal dynamics. In Joint 48th IEEE iteration algorithm. IEEE Transactions on Systems, Man and Cybernetics, Part B:
conference on decision and control and 28th Chinese control conference Shanghai, Cybernetics, 38, 937–942.
PR China (pp. 6750–6755).
Dierks, T., & Jagannathan, S. (2010). Optimal control of affine nonlinear continuous-
time systems. In Proc. Am. control conf. (pp. 1568–1573). Hamidreza Modares received the B.S. degree from Univer-
Finlayson, B. A. (1990). The method of weighted residuals and variational principles. sity of Tehran in 2004 and the M.S. degree from Shahrood
New York: Academic Press. University of Technology in 2006. He joined Shahrood Uni-
Hornik, K., Stinchcombe, M., & White, H. (1990). Universal approximation of an versity of Technology as a University Lecturer from 2006
unknown mapping and its derivatives using multilayer feedforward networks. to 2009. He is currently working towards the Ph.D. degree
Neural Networks, 3, 551–560. at University of Texas at Arlington. His research interests
Howard, R. A. (1960). Dynamic programming and markov processes. Cambridge, MA: include optimal control, reinforcement learning, approxi-
MIT Press. mate dynamic programming, neural adaptive control and
Jiang, Y., & Jiang, Z. P. (2012). Computational adaptive optimal control for
pattern recognition.
continuous-time linear systems with completely unknown dynamics. Automat-
ica, 48, 2699–2704.
Kiumarsi, B., Lewis, F. L., Modares, H., Karimpour, A., & Naghibi-Sistani, M. B. (2014).
Reinforcement Q -learning for optimal tracking control of linear discrete-time
systems with unknown dynamics. Automatica, 50, 1167–1175.
Lee, J. Y., Park, J. B., & Choi, Y. H. (2012). Integral Q -learning and explorized Frank L. Lewis is a Member of National Academy of
policy iteration for adaptive optimal control of continuous-time linear systems. Inventors, Fellow IEEE, Fellow IFAC, Fellow UK Institute
Automatica, 48, 2850–2859. of Measurement and Control, PE Texas, and UK Chartered
Lewis, F. L., & Liu, D. (Eds.) (2012). Reinforcement learning and approximate dynamic Engineer. He is also a UTA Distinguished Scholar Professor,
programming for feedback control. New York: Wiley. UTA Distinguished Teaching Professor, and Moncrief-
Lewis, F. L., Vrabie, D., & Syrmos, V. (2012). Optimal control (3rd ed.). Wiley. O’Donnell Chair at The University of Texas at Arlington
Liu, D., & Wei, Q. (2013). Finite-approximation-error-based optimal control Research Institute. He is also an IEEE Control Systems
approach for discretetime nonlinear systems. IEEE Transactions on Cybernetics, Society Distinguished Lecturer. He obtained the Bachelor’s
43, 779–789. Degree in Physics/EE and the MSEE at Rice University, the
Liu, D., Yang, X., & Li, H. (2013). Adaptive optimal control for a class of continuous- M.S. in Aeronautical Engineering from University of West
time affine nonlinear systems with unknown internal dynamics. Neural Florida, and the Ph.D. at Georgia Institute of Technology.
Computing and Applications, http://dx.doi.org/10.1007/s00521-012-1249-y. He works in feedback control, reinforcement learning, intelligent systems, and
Lyshevski, S.E. (1998). Optimal control of nonlinear continuous-time systems: distributed control systems. He is the author of 6 US patents, 273 journal papers, 375
design of bounded controllers via generalized nonquadratic functionals. In conference papers, 15 books, 44 chapters, and 11 journal special issues. He received
Proceedings of American control conference (pp. 205–209). the Fulbright Research Award, NSF Research Initiation Grant, ASEE Terman Award,
Modares, H., Naghibi-Sistani, M. B., & Lewis, F. L. (2013). A policy iteration approach Int. Neural Network Soc. Gabor Award 2009, and UK Institute of Measurement and
to online optimal control of continuous- time constrained-input systems. ISA Control Honeywell Field Engineering Medal 2009. He received IEEE Computational
Transactions, 52, 611–621. Intelligence Society Neural Networks Pioneer Award 2012. He was a Distinguished
Modares, H., Naghibi-Sistani, M. B., & Lewis, F. L. (2014). Integral reinforcement
Foreign Scholar, at Nanjing University of Science and Technology. He was also
learning and experience replay for adaptive optimal control of partially-
Project 111 Professor at Northeastern University, China. He received Outstanding
unknown constrained-input continuous-time systems. Automatica, 50,
Service Award from Dallas IEEE section and was selected as Engineer of the
193–202.
Powell, W. B. (2007). Approximate dynamic programming: solving the curses of Year by Ft. Worth IEEE Section. He was listed in Ft. Worth Business Press Top
dimensionality. Wiley-Interscience. 200 Leaders in Manufacturing. He received the 2010 IEEE Region 5 Outstanding
Sastry, S. (1991). Nonlinear systems: analysis, stability, and control. Springer-Verlag. Engineering Educator Award and the 2010 UTA Graduate Dean’s Excellence in
Slotine, J. E., & Li, W. (1991). Applied nonlinear control. Englewood Cliffs, NJ: Prentice Doctoral Mentoring Award. He was elected to UTA Academy of Distinguished
Hall. Teachers in 2012. He served on the NAE Committee on Space Station in 1995. He
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning—an introduction. is the Founding Member of the Board of Governors of the Mediterranean Control
Cambridge, MA: MIT Press. Association. He helped win the IEEE Control Systems Society Best Chapter Award (as
Vamvoudakis, K., & Lewis, F. L. (2010). Online actor–critic algorithm to solve Founding Chairman of DFW Chapter), the National Sigma Xi Award for Outstanding
the continuous infinite-time horizon optimal control problem. Automatica, 46, Chapter (as President of UTA Chapter), and the US SBA Tibbetts Award in 1996 (as
878–888. Director of ARRI’s SBIR Program).

You might also like