[29] Optimal Tracking Control of Nonlinear Partially-unknown Constrained-Input Systems Using Integral Reinforcement Learning
[29] Optimal Tracking Control of Nonlinear Partially-unknown Constrained-Input Systems Using Integral Reinforcement Learning
Automatica
journal homepage: www.elsevier.com/locate/automatica
1. Introduction During the last few years, RL methods have been successfully
used to solve the optimal regulation problems by learning the
Reinforcement learning (RL) (Bertsekas & Tsitsiklis, 1996; Pow- solution to the so-called Hamilton–Jacobi–Bellman (HJB) equation
ell, 2007; Sutton & Barto, 1998), inspired by learning mechanisms (Lewis, Vrabie, & Syrmos, 2012; Liu & Wei, 2013). For continuous-
observed in mammals, is concerned with how an agent or actor time systems, Vrabie and Lewis (2009) and Vrabie, Pastravanu,
ought to take actions so as to optimize a cost of its long-term in- Abou-Khalaf, and Lewis (2009) proposed a promising RL algorithm,
teractions with the environment. The agent or actor learns an op- called integral reinforcement learning (IRL), to learn the solution
timal policy by modifying its actions based on stimuli received to the HJB equation using only partial knowledge about the
in response to its interaction with its environment. Similar to RL, system dynamics. They used an iterative online policy iteration
optimal control involves finding an optimal policy based on op- (PI) (Howard, 1960) procedure to implement their IRL algorithm.
timizing a long-term performance criterion. Strong connections
Later, inspired by Vrabie and Lewis (2009) and Vrabie et al. (2009),
between RL and optimal control have prompted a major effort to-
some online PI algorithms were presented to solve the optimal
wards introducing and developing online and model-free RL algo-
regulation problem for completely unknown linear systems (Jiang
rithms to learn the solution to optimal control problems (Lewis &
& Jiang, 2012; Lee, Park, & Choi, 2012). Also, in Liu, Yang, and
Liu, 2012; Vrabie, Vamvoudakis, & Lewis, 2013; Zhang, Liu, Luo, &
Wang, 2012). Li (2013) the authors presented an IRL algorithm to find the
solution to the HJB equation related to a discounted cost function.
Other than the IRL-based PI algorithms, efficient synchronous PI
algorithms with guaranteed closed-loop stability were proposed
✩ The material in this paper was not presented at any conference. This paper was
in Bhasin et al. (2012), Modares, Naghibi-Sistani, and Lewis (2013),
recommended for publication in revised form by Associate Editor Giancarlo Ferrari-
Trecate under the direction of Editor Ian R. Petersen.
Vamvoudakis and Lewis (2010), to learn the solution to the HJB
E-mail addresses: [email protected] (H. Modares), [email protected] equation. Synchronous IRL algorithms were also presented for
(F.L. Lewis). solving the HJB equation in Modares, Naghibi-Sistani, and Lewis
1 Tel.: +1 8173818059; fax: +1 985118172233. (2014) and Vamvoudakis, Vrabie, and Lewis (in press).
http://dx.doi.org/10.1016/j.automatica.2014.05.011
0005-1098/© 2014 Elsevier Ltd. All rights reserved.
2 H. Modares, F.L. Lewis / Automatica ( ) –
Although RL algorithms have been widely used to solve the system dynamics. It is also pointed out that the input constraints
optimal regulation problems, few results considered solving the caused by the actuator saturation cannot be encoded into the
optimal tracking control problem (OTCP) for both discrete-time standard performance function a priori. A new formulation of the
(Dierks & Jagannathan, 2009; Kiumarsi, Lewis, Modares, Karim- OTCP problem is given in the next section to overcome these
pour, & Naghibi-Sistani, 2014; Wang, Liu, & Wei, 2012; Zhang, Wei, shortcomings.
& Luo, 2008) and continuous-time systems (Dierks & Jagannathan,
2010; Zhang, Cui, Zhang, & Luo, 2011). Moreover, existing meth-
2.1. Problem formulation
ods for continuous-time systems require the exact knowledge of
the system dynamics a priori while finding the feedforward part of
the control input using the dynamic inversion concept. In order to Consider the affine CT dynamical system described by
attain the required knowledge of the system dynamics, in Zhang ẋ(t ) = f (x(t )) + g (x(t )) u(t ) (1)
et al. (2011), a plant model was first identified and then an RL-
based optimal tracking controller was synthesized using the iden- where x ∈ R is the measurable system state vector, f (x) ∈ Rn
n
tified model. To our knowledge, there has been no attempt to de- is the drift dynamics of the system, g (x) ∈ Rn×m is the input
velop RL-based techniques to solve the OTCP for continuous-time dynamics of the system, and u(t ) ∈ Rm is the control input. The
systems with unknown or partially-unknown dynamics using only elements of u(t ) are defined by ui (t ), i = 1, . . . , m.
measured data in real time. While the importance of the IRL algo-
rithm is well understood for solving optimal regulation problems Assumption 1. It is assumed that f (0) = 0 and f (x) and g (x) are
using only partial knowledge of the system dynamics, the require- Lipschitz, and that the system (1) is controllable in the sense that
ment of the exact knowledge of the system dynamics for finding there exists a continuous control on a set Ω ⊆ Rn which stabilizes
the steady-state part of the control input in the existing OTCP for- the system.
mulation does not allow extending the IRL algorithm for solving
the OTCP. Assumption 2 (Bhasin et al., 2012; Vamvoudakis & Lewis, 2010). The
Another important issue which is ignored in the existing RL- following assumptions are considered on the system dynamics:
based solutions to the OTCP is the amplitude limitation on the
control inputs. In fact, in the existing formulation for the OTCP, it is (a) ∥f (x)∥ ≤ bf ∥x∥ for some constant bf .
not possible to encode the input constraints into the optimization (b) g (x) is bounded by a constant bg , i.e. ∥g (x)∥ ≤ bg .
problem a priori, as only the cost of the feedback part of the control
Note that Assumption 2(a) requires f (x) be Lipschitz and f (0) =
input is considered in the performance function. Therefore, the
0 (see Assumption 1) which is a standard assumption to make sure
existing RL-based solutions to the OTCP offer no guarantee on
the remaining control inputs on their permitted bounds during the solution x(t ) of the system (1) is unique for any finite initial
and after learning. This may result in performance degradation or condition. On the other hand, although Assumption 2(b) restricts
even system instability. In the context of the constrained optimal the considered class of nonlinear systems, many physical systems,
regulation problem, however, an offline PI algorithm (Abou-Khalaf such as robotic systems (Slotine & Li, 1991) and aircraft systems
& Lewis, 2005) and online PI algorithms (Modares et al., 2013, (Sastry, 1991) fulfill such a property.
2014) were presented to find the solution to the constrained HJB The goal of the optimal tracking problem is to find the optimal
equation. control policy u∗ (t ) so as to make the system (1) track a desired
In this paper, we develop an online adaptive controller based (reference) trajectory xd (t ) ∈ Rn in an optimal manner by min-
on the IRL technique to learn the OTCP solution for nonlinear imizing a predefined performance function. Moreover, the input
continuous-time systems without knowing the system drift dy- must be constrained to remain within predefined limits |ui (t )| ≤
namics or the command generator dynamics. The contributions of λ, i = 1, . . . , m.
this paper are as follows. First, a new formulation for the OTCP Define the tracking error as
is presented. In fact, an augmented system is constructed from
ed (t ) , x(t ) − xd (t ). (2)
the tracking error dynamics and the command generator dynam-
ics to introduce a new discounted performance function for the A general performance function leading to the optimal tracking
OTCP. Second, the input constraints are encoded into the optimiza- controller can be expressed as
tion problem a priori by employing a suitable nonquadratic per- ∞
formance function. Third, a tracking HJB equation related to this
V (ed (t ), xd (t )) = e−γ (τ −t ) [E (ed (τ )) + U (u(τ ))] dτ (3)
nonquadratic performance function is derived which gives both
t
feedforward and feedback parts of the control input simultane-
ously. Fourth, the IRL algorithm is extended for solving the OTCP. where E (ed ) is a positive-definite function, U (u) is a positive-
An IRL algorithm, implemented on an actor–critic structure, is used definite integrand function, and γ is the discount factor.
to find the solution to the tracking HJB equation online using only Note that the performance function (3) contains both the
partial knowledge about the system dynamics. In contrast to the tracking error cost and the whole control input energy cost. The
existing work, a preceding identification procedure is not needed following assumption is made in accordance to other work in the
and the optimal policy is learned using only measured data from literature.
the system. Convergence of the proposed learning algorithm to a
near-optimal control solution and the boundness of the tracking Assumption 3. The desired reference trajectory xd (t ) is bounded
error and the actor and critic NNs weights during learning are also and there exists a Lipschitz continuous command generator
shown. function hd (xd (t )) ∈ Rn such that
2.2. Standard formulation and solution to the OTCP The tracking error dynamics can be obtained by using (1) and
(2), and the result is
In this subsection, the standard solution to the OTCP and its ėd (t ) = f (x(t )) − hd (xd (t )) + g (x(t )) u(t ). (8)
shortcomings are discussed.
Define the augmented system state
In the existing standard solution to the OTCP, the desired or the
steady-state part of the control input ud (t ) is obtained by assuming
T
X (t ) = ed (t )T xd (t ) ∈ R2n .
(9)
that the desired reference trajectory satisfies
Then, putting (4) and (8) together yields the augmented system
ẋd (t ) = f (xd (t )) + g (xd (t )) ud (t ). (5)
Ẋ (t ) = F (X (t )) + G(X (t )) u(t ) (10)
If the dynamics of the system is known and the inverse of the input where u(t ) = u(X (t )) and
dynamics g −1 (xd (t )) exists, the steady-state control input which
f (ed (t ) + xd (t )) − hd (xd (t ))
guarantees perfect tracking is given by
F (X (t )) = (11)
hd (xd (t ))
ud (t ) = g −1 (xd (t )) (ẋd (t ) − f (xd (t ))). (6)
g (ed (t ) + xd (t ))
On the other hand, the feedback part of the control is designed G(X (t )) = . (12)
0
to stabilize the tracking error dynamics in an optimal manner by
minimizing the following performance function: Based on the augmented system (10), we introduce the follow-
ing discounted performance function for the OTCP.
∞
∞
V (ed (t )) = [ed (τ )T Q ed (τ ) + uTe R ue ] dτ
(7)
t
V (X (t )) = e−γ (τ −t ) [X (τ )T QT X (τ ) + U (u(τ ))] dτ (13)
t
where ue (t ) = u(t ) − ud (t ) is the feedback control input. where γ > 0 is the discount factor,
The optimal feedback control solution u∗e (t ) which minimizes
(7) can be obtained by solving the HJB equation related to this Q 0
QT = , Q >0 (14)
performance function (Lewis et al., 2012). 0 0
The standard optimal solution to the OTCP is then constituted and U (u) is a positive-definite integrand function defined as
by the optimal feedback control u∗e (t ) obtained. u
U (u) = 2 (λ β −1 (v/λ))T R dv (15)
Remark 1. The optimal feedback part of the control input ue (t ) ∗ 0
can be learned using the integral reinforcement learning method where v ∈ Rm , β(·) = tanh(·), λ is the saturating bound for the
(Vrabie & Lewis, 2009) to obviate knowledge of the system drift actuators and R = diag (r1 , . . . , rm ) > 0 is assumed to be diago-
dynamics. However, the exact knowledge of the system dynamics nal for simplicity of analysis. This nonquadratic performance func-
is required to find the steady-state part of the control input given tion is used in the optimal regulation problem of constrained-input
by (6), which cancels the usefulness of the IRL technique. systems to deal with the input constraints (Abou-Khalaf & Lewis,
2005; Lyshevski, 1998; Modares et al., 2013, 2014). In fact, using
Remark 2. Because only the feedback part of the control input this nonquadratic performance function, the following constraints
is obtained by minimizing the performance function (7), it is not are always satisfied.
possible to encode the input constraints into the optimization |ui (t )| ≤ λ i = 1, . . . , m. (16)
problem by using a nonquadratic performance function, as has
been performed in the optimal regulation problem (Abou-Khalaf Definition 1 (Admissible Control). A control policy µ(X ) is said to
& Lewis, 2005; Modares et al., 2013, 2014). be admissible with respect to (13) on Ω , denoted by µ(X ) ∈ π (Ω ),
if µ(X ) is continuous on Ω , µ(0) = 0, u(t ) = µ(X ) stabilizes the
3. A new formulation for the OTCP of constrained-input error dynamics (8) on Ω , and V (X ) is finite ∀x ∈ Ω .
systems Note that from (10)–(12) it is clear that, as expected, the
command generator dynamics is not under our control. Since they
In this section, a new formulation for the OTCP is presented. are assumed be bounded, the admissibility of the control input
In this formulation, both the steady-state and feedback parts of implies the boundness of the states of the augmented system.
the control input are obtained simultaneously by minimizing a
new discounted performance function in the form of (3). The Remark 3. Note that for the first term under the integral we have
input constraints are also encoded into the optimization problem X T QT X = eTd Q ed . Therefore, this performance function is identical
a priori. A tracking HJB equation for the constrained OTCP is to the performance function (3) with E (ed (τ )) = eTd Q ed and U (u)
derived and an iterative offline integral reinforcement learning given in (15).
(IRL) algorithm is presented to find its solution. This algorithm
Remark 4. The use of the discount factor in the performance
provides a basis to develop an online IRL algorithm for learning
function (13) is essential. This is because the control input contains
the optimal solution to the OTCP for partially-unknown systems,
a steady-state part which in general makes (13) unbounded
which is discussed in the next section.
without using a discount factor, and therefore the meaning of
minimality is lost.
3.1. An augmented system and a new discounted performance
function for the OTCP Remark 5. Note that both steady-state and feedback parts of
the control input are obtained simultaneously by minimizing the
In this subsection, first an augmented system composed of the discounted performance function (13) along the trajectories of the
tracking error dynamics and the command generator dynamics augmented system (10). As is shown in the subsequent sections,
is constructed. Then, based on this augmented system, a new this formulation enables us to extend the IRL technique to find
discounted performance function for the OTCP is presented. It the solution to the OTCP without requiring the augmented system
is shown that this performance function is identical to the dynamics F . That is, both the system drift dynamics f and the
performance function (3). command generator dynamics hd are not required.
4 H. Modares, F.L. Lewis / Automatica ( ) –
3.2. Tracking Bellman and tracking HJB equations To solve the OTCP, one solves the HJB equation (25) for the
optimal value V ∗ , then the optimal control is given as a feedback
In this subsection, the optimal tracking Bellman equation u(V ∗ ) in terms of the HJB solution using (22).
and the optimal tracking HJB equation related to the defined Now a formal proof is given that the solution to the tracking
performance function (13) are given. HJB equation for constrained-input systems provides the optimal
Using Leibniz’s rule to differentiate V along the augmented tracking control solution and when the discount factor is zero
system trajectories (10), the following tracking Bellman equation it locally asymptotically stabilizes the error dynamics (8). The
is obtained: following key fact is instrumental.
∞
∂ −γ (τ −t ) T
V̇ (X ) = e (X QT X + U (u))dτ Lemma 1. For any admissible control policy u(X ), let V (X ) ≥ 0
t ∂t be the corresponding solution to the Bellman equation (19). Define
− X T QT X − U (u). (17) u∗ (X ) = u(V (X )) by (22) in terms of V (X ). Then
Using (15) in (17) and noting that the first term on the right hand H (X , u, ∇ V ) = H (X , u∗ , ∇ V ) + ∇ V T (X ) × G(X )(u − u∗ )
side of (17) is equal to γ V (X ), gives u
+2 (λ tanh−1 (v/λ))T R dv. (26)
u
u∗
T
X QT X + 2 (λ tanh (v/λ)) R dv − γ V (X ) + V̇ (X ) = 0
−1 T
0
Proof. The Hamiltonian function is
(18)
H ( X , u, ∇ V )
or, by defining the Hamiltonian function u
H (X , u, ∇ V ) ≡ X QT X T = X T QT X + 2 (λ tanh−1 (v/λ))T R dv − γ V (X )
0
u
+2 (λ tanh−1 (v/λ))T R dv − γ V (X ) + ∇ V T (X ) (F (X ) + G(X ) u(X )). (27)
0
u∗
+ ∇ V T (X ) (F (X ) + G(X ) u(X )) = 0 (19) Adding and subtracting the terms 2 0
(λ tanh (v/λ))T R dv and
−1
Now, suppose V (X ) satisfies the HJB equation (25). Then H (X ∗ , u∗ , the integrals, there exists a uk ∈ (ū∗k , ūk ) such that
∇ V ∗ ) = 0 and (26) yields ū∗
k
ūk
∞ u β(ξk )dξk = − β(ξk )dξk = −β(uk )(ūk − ū∗k )
V (X (0), u) = e−γ τ 2 (λ tanh−1 (v/λ))T R dv ūk ū∗
k
0 u∗
< −β(ū∗k )(ūk − ū∗k ) = β(ū∗k )(ū∗k − ūk ) (38)
+ ∇ V (X )G(X ) (u − u ) dτ + V ∗ (X (0)).
∗T ∗
(30) where the inequality is obtained by the fact that β is monotone
odd, and hence β(uk ) > β(u∗k ). Therefore Lk > 0 also for ū∗k < ūk .
To prove that u∗ is the optimal control solution and the optimal This completes the proof of the optimality.
value is V ∗ (X (0)), it remains to show that the integral term on the Now the stability of the error dynamics is shown. Note that for
right-hand side of the above equation is bigger than zero for all any continuous value function V (X ), by differentiating V (X ) along
u ̸= u∗ and attains it minimum value, i.e., zero, at u = u∗ . That is, the augmented system trajectories, one has
to show that
dV (X ) ∂ V (X ) ∂ V (X )T
u = + Ẋ
H = 2 (λ tanh−1 (v/λ))T R dv dt ∂t ∂X
u∗
∂ V (X )T
+ ∇ V ∗T (X ) × G(X )(u − u∗ ) = (F (X ) + G(X )u) (39)
(31) ∂X
is bigger than or equal to zero. To show this, note that using (22) so that
one has dV (X )
H (X , u, ∇ V ) = − γ V (X ) + X T QT X
∇ V (X )G(X ) = −2 (λ tanh (v/λ)) R.
∗T −1 T
(32) dt
u
with Q > 0, the derivative of the Lyapunov function in (43) be- The tracking Bellman equation (44) requires complete knowl-
comes negative and therefore the tracking error decreases until edge of the systems dynamics. In order to find an equivalent for-
the exponential term e−γ t becomes zero and makes the derivative mulation of the tracking Bellman equation that does not involve
of the Lyapunov function zero. After that, we can only conclude the dynamics, we use the IRL idea introduced in Vrabie and Lewis
that the tracking error does not increase anymore. The larger the (2009) for optimal regulation problem. Note that for any integral
Q is the more the speed of decreasing the tracking error is and the reinforcement interval T > 0, the value function (13) satisfies
smaller tracking error can be achieved. Moreover, the smaller the t
discount factor is the less the speed of decreasing the derivative of V (X (t − T )) = e−γ (τ −t +T ) X (τ )T QT X (τ )
the Lyapunov function to zero is and the smaller tracking error can t −T
be achieved. Consequently, by choosing a smaller discount factor
and/or larger Q one can make the tracking error as small as desired + U (u(τ )) dτ + e−γ T V (X (t )). (46)
before the value of e−γ t becomes very small.
This IRL form of the tracking Bellman equation does not involve the
system dynamics.
Remark 6. The use of discounted cost functions is common in
optimal regulation control problems and the same conclusion can Lemma 2. The IRL tracking Bellman equation (46) and the tracking
be drawn for asymptotic stability of the system state in the optimal Bellman equation (18) are equivalent and have the same positive
regulator problem, as is drawn here for asymptotic stability of the semi-definite solution for the value function.
tracking error in the OTCP. However, the discount factor is a design
Proof. See Liu et al. (2013) and Vrabie and Lewis (2009) for the
parameter and as is shown in optimal regulation control problems
same proof.
in the literature, it can be chosen small enough to make sure the
system state goes to a very small region around zero. Simulation Using the IRL tracking Bellman equation (46), the following IRL-
results in Section 5 confirm this conclusion for the OTCP. based PI algorithm can be used to solve the tracking HJB equation
(25) using only partial knowledge about the system dynamics.
3.3. Offline policy iteration algorithms for solving OTCP
Algorithm 2 (Offline IRL Algorithm).
1. (policy evaluation) Given a control input ui (X ), find V i (X ) using
The tracking HJB equation (25) is a nonlinear partial differential
the tracking Bellman equation
equation which is extremely difficult to solve. In this subsection,
t
two iterative offline policy iteration (PI) algorithms are presented
V i (X (t − T )) = e−γ (τ −t +T ) X (τ )T QT X (τ )
for solving this equation. An IRL based offline PI algorithm is given t −T
which is a basis for our online IRL algorithm presented in the next
section. + U (u(τ )) dτ + e−γ T V i (X (t )). (47)
Note that the tracking HJB equation (25) is nonlinear in the
value function derivative ∇ V ∗ , while the tracking Bellman equa- 2. (policy improvement) Update the control policy using
tion (18) is linear in the cost function derivative ∇ V . Therefore,
finding the value of a fixed control policy by solving (18) is eas- 1
ui+1 (X ) = −λ tanh R−1 GT (X ) ∇ V i (X ) . (48)
ier than finding the optimal value function by solving (25). This is 2λ
the motivation of introducing an iterative policy iteration (PI) al-
gorithm for approximating the tracking HJB solution. The PI algo-
4. Online actor–critic for solving the tracking HJB equation
rithm performs the following sequence of two-step iterations as
using the IRL technique
follows to find the optimal control policy.
In this section, an online solution to the tracking HJB equation
Algorithm 1 (Offline PI Algorithm). (25) is presented which only requires partial knowledge about the
1. (policy evaluation) Given a control input ui (X ), find V i (X ) using system dynamics. The learning structure uses the value function
the following Bellman equation approximation (Finlayson, 1990; Werbos, 1989, 1992) with two
NNs, namely an actor and a critic. Instead of sequentially updating
ui the critic and actor NNs, as in Algorithm 2, both are updated
X T QT X + 2 (λ tanh−1 (v/λ))T R dv simultaneously in real time. We call this synchronous online PI.
0
− γ V i (X ) + ∇ V iT (X )(F (X ) + G(X )ui ) = 0. (44) 4.1. Critic NN and value function approximation
2. (policy improvement) Update the control policy using
Assuming the value function is a smooth function, according
to the Weierstrass high-order approximation theorem (Finlayson,
1
ui+1 (X ) = −λ tanh R−1 GT (X ) ∇ V i (X ) . (45) 1990), there exists a single-layer neural network (NN) such that
2λ
the solution V (X ) and its gradient ∇ V (X ) can be uniformly
approximated as
Algorithm 1 is an extension of the offline PI algorithm in Abou-
Khalaf and Lewis (2005) to the optimal tracking problem. The V (X ) = W1T φ(X ) + εv (X ) (49)
following theorem shows that this algorithm converges to the
optimal solution of the HJB equation (25). ∇ V (X ) = ∇φ(X ) W1 + ∇εv (X )
T
(50)
where φ(X ) ∈ R provides a suitable basis function vector, εv (X ) is
l
Assumption 4. The critic NN activation functions and their Theorem 3. Let u be any admissible bounded control policy and
gradients are bounded, i.e. ∥φ(X )∥ ≤ bφ and ∥∇φ(X )∥ ≤ bφ x . consider the adaptive law (58) for tuning the critic NN weights. If ∆φ̄
The critic NN (49) is used to approximate the value function in (59) is persistently exciting (PE), i.e. if there exist γ1 > 0 and γ2 > 0
related to the IRL tracking Bellman equation (46). Using the value such that ∀t > 0
approximation (49) in the tracking IRL, the Bellman equation (46) t +T 1
yields γ1 I ≤ ∆φ̄(τ ) ∆φ̄ T (τ ) dτ ≤ γ2 I , (60)
t
t
εB (t ) ≡ e−γ (τ −t +T ) X (τ )T QT X (τ ) then,
t −T
u
(a) For εB (t ) = 0 (no reconstruction error), the critic weight estima-
+2 (λ tanh (v/λ)) R dv dτ +
−1 T
W1T ∆ φ(X (t )) (51) tion error converges to zero exponentially fast.
0 (b) For bounded reconstruction error, i.e., ∥εB (t )∥ < εmax , the critic
where weight estimation error converges exponentially fast to a residual
set.
∆φ(X (t )) = e−γ T φ(X (t )) − φ(X (t − T )) (52)
Proof. Using the IRL tracking Bellman equation (51) one has
and εB is the tracking Bellman equation error due to the NN
approximation error. Under Assumption 4, this approximation t u
−γ (τ −t )
error is bounded on the compact set Ω . That is, there exists a e 2 (λ tanh−1 (v/λ))T R dv
constant bound εmax for εB such that sup∀t ∥εB ∥ ≤ εmax . t −T 0
or equivalently where
t
û1
∆φ(X ) = e −γ (τ −t +T )
∇φ(X )(F + Gu) dτ Û = 2 (λ tanh−1 (v/λ))T R dv. (74)
t −T 0
t Then, the critic update law (58) becomes
− e−γ (τ −t +T ) φ(X ) dτ . (66) −α1 ∆φ
t −T
˙
Ŵ 1 = − êB . (75)
(1 + ∆φ T ∆φ)2
Also, note that U (u) in (23) for the optimal control input given by
Define the error eu as the difference between the control input
(64) becomes
û1 (71) applied to the system and the control input û (70) as an
u
approximation of the optimal control input given by (22) with V ∗
U (u) = 2 (λ tanh−1 (v/λ))T R dv = W1T ∇φ F u approximated by (53). That is,
0
1
+ λ2 R̄ ln(1 − tanh2 (D + 0.5λR−1 GT ∇εv )). (67) eu = û1 − u1 = λ tanh R−1 GT ∇φ T Ŵ2
2λ
Using (66) and (67) for the third and second terms of (51),
respectively, the following tracking HJB equation is given: 1
− tanh R−1 GT ∇φ T Ŵ1 . (76)
t 2λ
e−γ (τ −t +T ) (X T QT X − γ W1T φ + W1T ∇φ F The objective function to be minimized by the action NN is now
t −T defined as
+ λ2 R̄ ln(1 − tanh2 (D)) + εHJB ) dτ = 0 (68) Eu = eTu R eu . (77)
where D = (1/2λ) R−1 GT ∇φ T W1 and εHJB , i.e., the HJB approxi- Then, the gradient-descent update law for the actor NN weights
mation error due to the function approximation error, is becomes
t ˙
Ŵ 2 = −α2 (∇φ G eu + ∇φ G tanh2 (D̂) eu + Y Ŵ2 ) (78)
εHJB = e−γ (τ −t +T ) (∇εvT )F + λ2 R ln(1 − tanh2 (D
t −T
where
+ 0.5λR−1 GT ∇εv ) 1
D̂ = R−1 GT ∇φ T Ŵ2 , (79)
2λ
− λ R̄ ln(1 − tanh (D)) − γ εv )dτ .
2 2
(69)
Y > 0 is a design parameter and the last term of (78) is added to
Since the NN approximation error assure stability.
is bounded, there exists a con-
stant error bound εh , so that sup εHJB ≤ εh . We should note that Before presenting our main theorem, note that based on
the choice of the NN structure to make the error bound εh arbitrary Assumption 2 and the boundness of the command generator
small is commonly carried out by computer simulation in the lit- dynamics hd , for the drift dynamics of the augment system F one
erature. We assume here that the NN structure is specified by the has
designer, and the only unknowns are the NN weights. ∥F (X )∥ ≤ bF 1 ∥ed ∥ + bF 2 (80)
To approximate the solution to the tracking HJB equation (68), for some bF 1 and bF 1 .
the critic and actor NNs are employed. The critic NN given by
(53) is used to approximate the unknown approximate optimal Theorem 4. Given the dynamical system (1) and the command
value function. Assuming that Ŵ1 is the current estimation for the generator (4), let the tracking control law be given by the actor
optimal critic NN weights W1 , then using (64) the policy update NN (71). Let the update laws for tuning the critic and actor NNs be
law can be obtained by provided by (75) and (78), respectively. Let Assumptions 1–4 hold
and ∆φ̄ in (59) be persistently exciting. Then there exists a T0 defined
u1 = −λ tanh ((1/2λ)R−1 GT ∇φ T Ŵ1 ). (70) by (A.25) such that for the integral reinforcement interval T < T0 the
tracking error ed in (2), the critic NN error W̃1 , and the actor NN error
However, this policy update law does not guarantee the stability W̃2 in (72) are UUB.
of the closed-loop system. It is necessary to use a second neural
Proof. See Appendix.
network Ŵ2T ∇φ for the actor because the control input must
not only solve the stationarity condition (22), but also guarantee Remark 8. The stability analysis in the proof of Theorem 4 differs
system stability while converging to the optimal solution. This is from the stability proof presented in Modares et al. (2013),
seen in the Lyapunov proof of Theorem 4. Hence, to assure stability Vamvoudakis and Lewis (2010) and Vamvoudakis et al. (in press)
in a Lyapunov sense, the following actor NN is employed. from at least two different perspectives. First, the actor update
law in the mentioned papers is derived entirely by the stability
û1 = −λ tanh ((1/2λ)R−1 GT ∇φ T Ŵ2 ) (71) analysis whereas the actor update law in our paper is based on the
minimization of the error between the actor neural network and
where Ŵ2 is the actor NN weights vector and it is considered as the approximate optimal control input. Moreover, in this paper the
the current estimated value of W1 . Define the actor NN estimation optimal tracking problem is considered, not the optimal regulation
error as problem, and the tracking HJB equation has an additional term
depending on the discount factor comparing to the regulation HJB
W̃2 = W1 − Ŵ2 . (72)
equation considered in the mentioned papers.
Note that using the actor û1 in (71), the IRL Bellman equation
Remark 9. The proof of Theorem 4 shows that the integral rein-
error is now given by
forcement learning time interval T cannot be too big. Moreover,
t since d and N in Eqs. (A.24) and (A.28) should be bigger than zero
e−γ (τ −t +T ) X (τ )T QT X (τ ) + Û dτ for an arbitrary ε > 0, one can conclude that the bigger the rein-
t −T
forcement interval T is, the bigger the parameter Y in learning rule
+ Ŵ1T ∆φ(X (t )) = êB (t ) (73) (78) should be chosen to assure stability.
H. Modares, F.L. Lewis / Automatica ( ) – 9
and
√ √
xd2 (t ) = 0.5 5 cos 5t (83)
W = [W1 , . . . , W10 ]T ,
5.1. Linear system without actuator saturation
φ(X ) = [X12 , X1 X2 , X1 X3 , X1 X4 , X22 , X2 X3 , X2 X4 , (88)
In this subsection, the results of the proposed method are com-
X32 , , ] .
X3 X4 X42 T
pared to the results of the standard solution given in Section 2.2,
and also it is shown that the proposed method converges to the The reinforcement interval T is selected as 0.1. A small probing
optimal solution in the absence of the control bounds. To this end, noise is added to the control input to excite the system states. Fig. 2
the actuator bounds are chosen large enough to make sure the in- shows the convergence of the critic parameters which converge to
put control does not exceed these bounds.
Assuming that both spring and damper are linear, the spring– W = [17.94, 0.77, −2.01, −0.29, 2.86, 0.07,
mass–damper system is described by the following dynamics: −0.59, 9.86, −0.08, 1.84]
ẋ1 = x2 The optimal control solution (22) then becomes
k c 1 (81)
ẋ2 = − x1 − x2 + u( t ) u = −5 tanh(0.155ed1 + 0.577ed2 + 0.014xd1 − 0.112xd2 ).
m m m
Note that the optimal critic weights obtained by solving the ARE
where y = x1 , x1 and x2 are the position and velocity, m is the mass
are
of the object, k is the stiffness constant of the spring and c is the
damping. The true parameters are set as m = 1 kg, c = 0.5 N · s/m W ∗ = [18.05, 0.77, −1.98, −0.34, 2.88, 0.08,
and k = 5 N/m. Note that in our control design, only the input
−0.56, 9.77, −0.08, 1.87]
dynamics is needed to be known, which is given by m.
The desired trajectories for x1 and x2 are considered as which are the components of the ARE solution matrix P in (86)
√ and confirm the convergence of our algorithm close to the optimal
xd1 (t ) = 0.5 sin 5t (82) control solution.
10 H. Modares, F.L. Lewis / Automatica ( ) –
Fig. 4. The system state x1 versus the desired trajectory x1d for the standard Fig. 7. The system state x1 versus the desired trajectory x1d for the proposed
method. method.
Fig. 5. The system state x2 versus the desired trajectory x2d for the standard Fig. 8. The system state x2 versus the desired trajectory x2d for the proposed
method. method.
For the standard solution, the steady-state part of control residual, and the unknown critic NN weights. If the number of
input using (6) and the system and command generator dynamics NN hidden layers is chosen appropriately, which is fulfilled for
becomes the linear system provided here, all of these go to zero except
ud = [0.25, 0.25] xd (t ). for the unknown critic NN weights. However, these bounds are in
fact conservative and the simulation results show that the value
The optimal feedback part of the control input obtained by function and the optimal control solution are closely identified.
minimizing the performance index (7) is
ue = [−0.50, −0.25] ed (t ). 5.2. Nonlinear system and considering the actuator bound
Fig. 9. Control input while considering the actuator saturation. Fig. 11. The system state x2 versus the desired trajectory x2d while considering the
actuator saturation.
Acknowledgments
Fig. 10. The system state x1 versus the desired trajectory x1d while considering the Appendix. Proof of Theorem 4
actuator saturation.
Consider the following Lyapunov function:
(87) with weights and activation functions as
1 1
W = [W1 , . . . , W45 ]T J (t ) = V (t ) + W̃1 (t )T α1 −1 W̃1 (t ) + W̃2 (t )α2 −1 W̃2 (t )T (A.1)
2 2
φ(X ) = [X12 , X1 X2 , X1 X3 , X1 X4 , X22 , X2 X3 , X2 X4 , where V (t ) is the optimal value function. The derivative of the
X32 , X3 X4 , X42 , X14 , X13 X2 , X13 X3 , X13 X4 , X12 X22 , Lyapunov function is given by
the control input to excite the system states and ensure the PE + Ŵ1 (t )T ∇φ (F − G λ tanh (D̂)) + γ W1 (t )T φ
qualitatively. The critic weights vector finally converges to W =
− W1T ∇φ (F − G λ tanh (D̂)) + εHJB ) dτ (A.3)
[9.038, 3.95, −1.20, −1.64, 2.41, 0.71, −1.06, 14.28, 0.38, 2.93,
−2.97, −0.75, 4.60, −2.40, −3.33, 1.79, 2.18, 3.11, 0.69, −2.45, where Û is defined in (74) and is given by
−2.23, 1.70, 2.02, 0.94, 0.43, 1.21, −0.47, −0.75, 0.54, 1.31,
0.03, 1.70, 0.81, 0.88, −0.02, −0.76, 0.84, −0.15, −3.14, Û = Ŵ2T ∇φ G λ tanh(D̂) + λ2 R̄ ln(1 − tanh2 (D̂)) (A.4)
−0.83, 4.11, 0.29, 0.86, −0.88, 0.07].
Figs. 9–11 show the performance of the proposed method. and
Using (A.4) and (A.5) and some manipulations, Û − U can be written and therefore
as (Modares et al., 2014) ˙
J̇1 = W̃1T (t )α1−1 W̃ 1 (t ) = −W̃1T (t )∆φ̄ ∆φ̄ T W̃1 (t )
Û − U = Ŵ2T ∇φ Gλ tanh (D̂) + W̃2T ∇φ G ∆φ̄
t
− W̃1T (t ) e−γ (τ −t +T ) W̃2T (τ )Mdτ
× λ sgn(D̂) − ∇φ Gλ tanh(D) −
W1T ∇φ W1T m t −T
∆φ̄
× Gλ sgn(D̂) − sgn(D) + λ2 R̄ (εD̂ − εD ) (A.7)
− W̃1T (t ) E. (A.19)
m
where εD̂ and εD are some bounded approximation errors. Substi- For small enough reinforcement interval, the integral term of
tuting (A.7) in (A.6) gives (A.19) can be approximated by the right-hand rectangle method
t (with only one rectangle) as
êB (t ) = −∆φ T W̃1 (t ) + e−γ (τ −t +T ) W̃2T Mdτ + E (A.8) t
t −T
e−γ (τ −t +T ) W̃2T (τ ) Mdτ ≈ Te−γ T M T W̃2 (t ). (A.20)
where t −T