Neural Network Based Online Simultaneous Policy Update Algorithm For Solving The HJI Equation in Nonlinear H Control
Neural Network Based Online Simultaneous Policy Update Algorithm For Solving The HJI Equation in Nonlinear H Control
Abstract— It is well known that the nonlinear H ∞ state be promising techniques for approximately solving nonlinear
feedback control problem relies on the solution of the Hamilton– optimal control problems [7]–[18]. RL [19]–[21] is a kind
Jacobi–Isaacs (HJI) equation, which is a nonlinear partial dif- of machine learning method, which refers to an actor
ferential equation that has proven to be impossible to solve
analytically. In this paper, a neural network (NN)-based online or agent that interacts with its environment and aims to
simultaneous policy update algorithm (SPUA) is developed to learn the optimal actions, or control policies, by observing
solve the HJI equation, in which knowledge of internal system their responses from the environment. ADP [20]–[23]
dynamics is not required. First, we propose an online SPUA which solves approximately the dynamic programming problem
can be viewed as a reinforcement learning technique for two play- forward-in-time; thus, it affords a methodology for learning
ers to learn their optimal actions in an unknown environment.
The proposed online SPUA updates control and disturbance the feedback control actions online in real time based on
policies simultaneously; thus, only one iterative loop is needed. system performance without necessarily knowing the system
Second, the convergence of the online SPUA is established by dynamics. This overcomes the computational complexity,
proving that it is mathematically equivalent to Newton’s method such as the curse of dimensionality [22] that exists in the
for finding a fixed point in a Banach space. Third, we develop an classical dynamic programming, which is an offline technique
actor-critic structure for the implementation of the online SPUA,
in which only one critic NN is needed for approximating the cost that requires a backward-in-time solution procedure. ADP
function, and a least-square method is given for estimating the has many implement structures, such as heuristic dynamic
NN weight parameters. Finally, simulation studies are provided programming (HDP), dual heuristic programming (DHP),
to demonstrate the effectiveness of the proposed algorithm. globalized DHP, etc., which are widely employed for
Index Terms— H ∞ state feedback control, Hamilton–Jacobi– nonlinear discrete-time systems. In [7], an HDP algorithm
Isaacs equation, neural network, online, simultaneous policy was developed to solve the discrete-time HJB equation
update algorithm. appearing in infinite horizon discrete-time nonlinear optimal
control, and a full proof of convergence was provided. In [12],
I. I NTRODUCTION the near-optimal control problem for a class of nonlinear
discrete-time systems with control constraints was solved
O VER the past few decades, a large number of theoreti-
cal results on H∞ control have been reported [1]–[6].
Although the nonlinear H∞ control theory has been well
using a DHP method. Wang et al. [14] studied the finite-
horizon optimal control problem of discrete-time nonlinear
developed, the main bottleneck for its practical application is systems and suggested an iterative ADP algorithm to obtain
the need to solve the Hamilton–Jacobi–Isaacs (HJI) equation. the optimal control law, which makes the performance index
The HJI equation, similar with the Hamilton–Jacobi–Bellman function close to the greatest lower bound of all performance
(HJB) equation of nonlinear optimal control, is a first order indices within a ε-error bound. Policy iteration is one of the
nonlinear partial differential equation (PDE), which is difficult most popular RL methods [20] for feedback controller design.
or impossible to solve, and may not have global analytic In [10] and [11], the optimal control problems of linear and
solutions even in simple cases. nonlinear continuous-time systems were solved online by
In recent years, reinforcement learning (RL) and policy iteration, respectively. Vamvoudakis and Lewis [13]
approximate dynamic programming (ADP) have appeared to gave an online synchronous policy iteration algorithm to learn
the continuous-time optimal control solution with infinite
Manuscript received January 6, 2012; revised August 29, 2012; accepted horizon cost for nonlinear systems with known dynamics, in
August 29, 2012. Date of publication October 12, 2012; date of current which an actor and a critic neural network (NN) are involved,
version November 20, 2012. This work was supported in part by the National
Basic Research Program of China through the 973 Program under Grant and the weights of both NNs tune at the same time instant.
2012CB720003 and in part by the National Natural Science Foundation of However, it is clear that the HJI equation associated with
China under Grant 61074057, Grant 61121003, and Grant 91016004. the nonlinear H∞ control problem is generally more difficult
The authors are with the Science and Technology on Aircraft Control Lab-
oratory, School of Automation Science and Electrical Engineering, Beihang to solve than the HJB equation appearing in nonlinear optimal
University (formerly Beijing University of Aeronautics and Astronautics), control, since the disturbance inputs are additionally reflected
Beijing 100191, China (e-mail: [email protected]; [email protected]). in the HJI equation. The main difference between the HJB
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. and HJI equations is that the HJB equation has a “negative
Digital Object Identifier 10.1109/TNNLS.2012.2217349 semi-definite quadratic term,” while the HJI equation has an
2162–237X/$31.00 © 2012 IEEE
WU AND LUO: NN-BASED ONLINE SPUA FOR SOLVING THE HJI EQUATION 1885
“indefinite quadratic term” [26]. Thus, the methods for the uses RL technique for the online H∞ control design
HJB equation may not be directly used to the HJI equation. of nonlinear continuous-time systems with unknown
In [27], the linear H∞ control problem was considered, in internal systems dynamics.
which the H∞ algebraic Riccati equation (ARE) with an 2) The online SPUA updates the control and disturbance
indefinite quadratic term was converted to a sequence of policies simultaneously, which needs only one iterative
H2 AREs with a negative semi-definite quadratic term. This loop rather than two. This is the essential difference
paper was subsequently extended to solve the HJI equation between the online SPUA and the existing methods in
for nonlinear systems in [26]. In [28], the solution of the HJI [32]–[37]. Moreover, the theory of Newton’s method in
equation was approximated by the Taylor series expansion, Banach space is introduced to prove the convergence of
and an efficient algorithm was furnished to generate the the online SPUA.
coefficients of the Taylor series. In [6], it was proven that 3) Develop an actor-critic structure for nonlinear H∞
there exists a sequence of policy iterations on the control control design without requiring the knowledge of inter-
input to pursue the smooth solution of the HJI equation, nal system dynamics, where only one critic NN is
where the HJI equation was successively approximated by a needed for approximating the cost function and a least-
sequence of HJB equations. Then, the methods for solving square (LS) method is given to estimate the NN weight
HJB equations can be used for the solution of the HJI equation. parameters.
In [29], the HJB equation was successively approximated by The rest of this paper is organized as follows. In Section II,
a sequence of linear generalized HJB equations, which can be we give the problem description. In Section III, we propose
solved by Galerkin’s approximation in [30] and [31]. Based the NN-based online SPUA and discuss some related issues.
on [6] and [29]–[31], policy iteration on the disturbance was Simulation studies are conducted in Section IV. Finally, a brief
used to approximate the HJI equation in [32], where each conclusion is derived in Section V.
HJB equation in [6] was further successively approximated Notations: R, Rn , and Rn×m are the set of real numbers,
by a sequence of generalized HJI equations and solved by the n-dimensional Euclidean space and the set of all real
Galerkin’s approximation. This obviously results in two itera- n × m matrices, respectively. · denotes the vector norm or
tive loops for the solution of HJI equation, i.e., the inner loop matrix norm in Rn or Rn×m , respectively. For a symmetric
solves an HJB equation by iteratively solving a sequence of matrix M, M > (≥)0 means that it is a positive (semi-
GHJB equations, and the outer loop solves the HJI equation positive) definite matrix. The superscript T is used for the
by iteratively solving a sequence of HJB equations. Following transpose and I denotes the identity matrix of appropriate
such a thought, the method in [13] was extended to solve the
dimension. ∇ = ∂/∂x denotes a gradient operator nota-
HJI equation in [33] with known dynamics. A policy iteration
scheme was also developed in [34] for nonlinear systems with ∞ L 2 [0,2∞) is a Banach space, for ∀w(t) ∈ L 2 [0, ∞),
tion.
0 w(t) dt < ∞. For a column vector function s(x),
actuator saturation, and its implementation was facilitated on 1/2
s(x) = sT (x)s(x)d x , x ∈ ⊂ Rn . H m, p () is a
the basis of neurodynamic programming in [35] and [36],
where NNs were used for approximating the value Sobolev space that consists of functions in space L p () such
function. that their derivatives of order at least m are also in L p ().
Most of the methods mentioned in the above paragraph
II. P ROBLEM D ESCRIPTION
for solving the HJI equation of H∞ control problem, such
as, [26]–[28], [32]–[36], require full knowledge of the system Consider the following partially unknown continuous-time
dynamics. Furthermore, these approaches follow the thought nonlinear system with external disturbance:
that the HJI equation is first successively approximated with ẋ(t) = f (x) + g(x)u(t) + k(x)w(t) (1)
a sequence of HJB equations, and then each HJB equation
is solved by the existing methods [26], [32]–[36]. This often z(t) = h(x) (2)
brings two iterative loops because the control and disturbance where x ∈ ⊂ Rn is the state, u ∈ Rm is the control input
policies are updated at the different iterative steps. Such a and u(t) ∈ L 2 [0, ∞), w ∈ Rq is the external disturbance and
procedure may lead to redundant equation solutions (i.e., w(t) ∈ L 2 [0, ∞), and z ∈ R p is the objective output. f (x) is
redundant iterations), and thus waste of sources, resulting in an unknown continuous nonlinear vector function satisfying
low efficiency. In [37], ADP was used to solve the linear H∞ f (0) = 0, which represents the internal system dynamics.
control online without the need of internal system dynamics g(x), k(x), and h(x) are known continuous vector or matrix
in [37], but it is still a linear special case based on the same functions of appropriate dimensions.
procedure as the works in [32]–[36], i.e., it also involves two The H∞ control problem under consideration is to find a
iterative loops. state feedback control law u(t) = u (x(t)) such that (1) and (2)
In this paper, we propose an online simultaneous policy is closed-loop asymptotically stable, and has L 2 -gain less than
update algorithm (SPUA) for solving the HJI equation in or equal to γ , that is
nonlinear H∞ state feedback control. The main contributions ∞ ∞
of this paper include three aspects. z(t)2 + u(t)2R dt ≤ γ 2 w(t)2 dt (3)
1) Propose an online SPUA, in which the knowledge of 0 0
internal system dynamics is not required. To the best of for all w(t) ∈ L 2 [0, ∞), where u(t)2R
= uT Ru, R > 0, and
our knowledge, this paper may be the first work that γ > 0 is some prescribed level of disturbance attenuation.
1886 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012
Remark 2: Although the procedure of policy iteration is the Gâteaux derivative of G at V . The Gâteaux differential at
also included in the methods of [32]–[36], there are two V is defined by L(W ).
main differences between these methods and the online SPUA. From (17), the Gâteaux differential at V can be defined
1) The methods in [32]–[36] are model-based ones, which equivalently through the following expression [38]:
require the full knowledge of the system dynamics, while the
G(V + sW ) − G(V )
online SPUA does not require the internal system dynam- L(W ) = lim . (18)
ics f (x). 2) The methods in [32]–[36] update the control s→0 s
and disturbance policies at different iterative steps (i.e., one Equation (18) gives a method to compute Gâteaux derivative,
player updates its policy while the other remains invariant), rather than Fréchet derivative required in (16). Thus, we
which brings two iterative loops. In contrast, the online SPUA introduce the following lemma to give the relationship between
updates the control and disturbance policies at the same them.
iterative step, in which only one iterative loop is needed. This Lemma 2 [38]: If G exists as Gâteaux derivative in some
means that the online SPUA is essentially different from the neighborhood of V , and if G is continuous at V , then L =
methods in [32]–[36]. The methods in [32]–[36] are based G (V ) is also an Fréchet derivative at V .
on the same procedure as in [32], their convergence can be Now, it follows from Lemma 2 that we can compute
directly guaranteed by the results in [6] and [32]. However, the Fréchet derivative G (V ) in (16) via (18). We have the
new tools are needed for the online SPUA to establish its following result.
convergence. Lemma 3: Let G : V → V be a mapping defined as (4),
Remark 3: Notice that the SPUA avoids the identification then, for ∀V ∈ V, the Fréchet differential of G at V is
of f (x) whose information is embedded in the online mea-
x(t) and x(t + t), and evaluation 1
surement of the states
t +t G (V )W = L(W ) = (∇W )T f − (∇W )T g R −1 g T ∇V
of the cost t h (x(τ ))2 + u(τ )2R − γ 2 w(τ )2 dτ . 4
That is to say, the lack of knowledge about f (x) does not 1 1
− (∇V )T g R −1 g T ∇W + (∇W )T kk T ∇V
have any impact on the online SPUA to obtain the equilibrium 4 4γ 2
solution. Thus, the resulting equilibrium behavior policies of 1
+ 2 (∇V )T kk T ∇W. (19)
the two players will not be affected by any errors between the 4γ
dynamics of a model of the system and the dynamics of the
Proof: See Appendix.
real system.
The following theorem provides an interesting result, in
which we discover that the online SPUA is mathematically
C. Convergence of Online SPUA equivalent to Newton’s iteration in a Banach space V.
In this section, we will prove the convergence of the online Theorem 1: Let T be a mapping defined by (16). Then,
SPUA. Namely, we want to show that the solution of equa- the iteration from (13) to (15) is equivalent to the following
tion (13) converges to the solution of HJI equation (4) when Newton’s iteration with (14) and (15):
i goes to infinity. Just as mentioned in Remark 2, the online V (i+1) = T V (i) , i = 0, 1, 2, . . . (20)
SPUA is essentially different from the algorithm framework
in [32]–[36]. Hence, its convergence proof is also different. Proof: See Appendix.
To this end, let us consider such a Banach space V ⊂ Under some proper assumptions, Newton’s iteration (20)
{ V (x)| V (x) : → R, V (0) = 0} equipped with a norm · , can converge to the unique solution of the fixed-point equation
and consider the mapping G : V → V defined in (4). Define T V ∗ = V ∗ , that is, the solution of equation G(V ∗ ) = 0.
a mapping T : V → V as follows: The convergence of Newton’s method is guaranteed by the
−1 following Kantorovtich’s theorem [39], [40].
T V = V − G (V ) G(V ) (16)
Lemma 4 (Kantorovtich’s Theorem):−1 Assume for some
(0)
V ∈ V1 ⊂ V such that G (V ) (0)
where G (V ) is the Fréchet derivative of G(·) at point
−1 V .
exists and that:
It should be noticed that both G (V ) and G (V ) are −1
operators on Banach space V. 1) G (V (0) ) ≤ B0 ; (21)
The Fréchet derivative is often difficult to compute directly,
−1
thus we introduce the Gâteaux derivative. 2) G (V (0) ) G(V (0) ) ≤ η; (22)
Definition 1 (Gâteaux Derivative) [38]: Let G: U(V ) ⊆
X → Y be a given map, with X and Y Banach spaces. Here, 3) for all V (1) , V (2) ∈ V1 ,
U(V ) denotes a neighborhood of V . The map G is Gâteaux
differentiable at V if there exists a bounded linear operator G (V (1) ) − G (V (2) ) ≤ K V (1) − V (2) (23)
L : X → Y such that
with h = B0 K η ≤ 1/2. Let
G(V + sW ) − G(V ) = s L(W ) + o(s), s→0 (17) √
(0) 1− 1 − 2h
for all W with W = 1 and all real numbers s in some V2 = V | V − V ≤ σ , where σ = η.
h
neighborhood of zero, where lim (o(s)/s) = 0. L is called (24)
s→0
1888 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012
where ∇ N (x) = ((∂ψ1 /∂x), . . . , (∂ψ N /∂x))T is the Algorithm 2 NN-Based Online SPUA
Jacobian of N . Step 1: Select N activation functions N (x). Given initial
Remark 5: Note that only one NN is needed in the online weights c(0) such that V̂ (0) ∈ V0 , let
SPUA (Algorithm 1), i.e., the critic NN for approximating the
1 1 −2 T
cost function via (28). After the weight vector is computed û (0) = − R −1 g T ∇ T (0)
N c , ŵ
(0)
= γ k ∇ T (0)
Nc ,
via (29), the control and disturbance policies in (14) and (15) 2 2
and i = 0.
can be approximately updated by (30) and (31) accordingly.
Therefore, the iteration from (13) to (15) in the online SPUA Step 2: With policies û (i) and ŵ(i) , collect N̄ sample data along
is converted to the weights iteration from (29) to (31). state trajectories in time interval [i t, (i + 1)t]. Compute
It is noticed that the NN weight parameters c(i+1) have N c(i+1) via (32) at time instant (i + 1)t.
unknown elements. Thus, in order to solve for c(i+1) , at least Step 3: Update the control and disturbance policies by (30)
N equations are required. Here, we construct N̄ ( N̄ ≥ N) and (31) at time instant (i + 1)t.
equations, and use an LS method to estimate c(i+1) . In each Step 4: Set i = i + 1. If c(i) − c(i−1) ≤ ε (ε is a small
time interval [t, t + t], we collect N̄ sample data along state positive real number), stop and use V (i) as the solution of the
trajectories, and construct the LS solution of the NN weights HJI equation (4), i.e., use û (i) as the H∞ controller, else, go
as follows: back to step 2 and continue.
c(i+1) = (X X T )−1 XY (32)
where E. Convergence of LS NN-Based Online SPUA
X= N(x(t)) − N (x(t + δt)) · · · N x(t + ( N̄ − 1)δt) In this section, we show that the LS NN-based online SPUA
− N x(t + N̄ δt) converges to the solution of HJI equation (4).
It is seen that the LS NN method derived in the above
Y = y(x(t), û (i) (t), ŵ(i) (t)) · · · y(x(t + ( N̄ − 1)δt),
section is used for solving (13) for cost function V (i+1) (x).
T
It follows from the proof of Theorem 1 that (13) is equal to
û (i) (t + ( N̄ − 1)δt), ŵ(i) (t + ( N̄ − 1)δt))
(A5), which means that the LS NN approach is nothing but
with δt = t/ N̄ and for solving (A5) mathematically.
We notice that (A5) is essentially the same as the Lyapunov
y(x(t + kδt), û (i) (t + kδt), ŵ(i) (t + kδt)) equation in [45] from pure mathematical view, because both
t +(k+1)δt
2 of them are first order linear PDEs of the same form. Some
= h (x(τ ))2 + û (i) (τ )
t +kδt R important theories have been established in computational
2 mathematics community to solve these types of PDEs, such
−γ 2 ŵ(i) (τ ) dτ , k = 0, . . . , N − 1. as the LS approach in [45] and Galerkin’s method in [30] and
[32]. Moreover, in [45], the convergence of LS NN approach
It is worth mentioning that the LS method (32) requires a for solving the first order, linear PDE was established, which
nonsingular matrix XX T . To attain the goal, we can inject provides the theoretical foundation of the online LS NN
probing noises into states or reset system states. method in this paper. We can directly obtain the following
Based on the actor-critic structure and the above LS Lemma by using the results in [45].
estimation of the NN weights, we develop an implementable Lemma 6: For i = 0, 1, 2, . . ., assume that the solution of
NN-based online SPUA procedure as shown in Algorithm 2. equation (A5) V (i+1) ∈ H 1,2(), the NN activation functions
Remark 6: It should be pointed out that the word “simul- ψ j ∈ H 1,2(), j = 1, 2, . . . , N are chosen such that, they
taneous” in this paper and the word “synchronous” in [33] are complete when N → ∞, V (i+1) and ∇V (i+1) can be
represent different meanings. The former emphasizes the same
uniformly approximated, and the set {ϕ j (x 1 , x 2 ) = ψ j (x 1 ) −
“iterative step,” while the latter emphasizes the same “time
ψ j (x 2 )} j =1 , ∀x 1 , x 2 ∈ , x 1 = x 2 is linearly independent and
N
instant. In this paper, the online SPUA updates control and
complete. Then, for i = 0, 1, 2
disturbance policies at the same iterative step, while the
algorithm in [33] updates control and disturbance policies at
the different iterative steps. sup V̂ (i+1) (x) − V (i+1) (x) → 0
x∈
Remark 7: It is worth emphasizing that different from the
result in [11], which was used for solving HJB equation of sup ∇ V̂ (i+1) (x) − ∇V (i+1) (x) → 0
x∈
nonlinear optimal control problem, the proposed online SPUA
is developed for solving HJI equation of nonlinear H∞ control sup û (i+1) (x) − u (i+1) (x) → 0
problem. Moreover, there are two main differences between x∈
the approach in [34] and the proposed online SPUA. 1) The sup ŵ(i+1) (x) − w(i+1) (x) → 0.
former is an offline approach that requires the system model, x∈
while the latter is an online one that does not need the
knowledge of internal system dynamics and 2) The method Proof: In order to use the results in [45], we first show
in [34] brings two iterative loops, while the online SPUA the set{∇ψ Tj ( f + gu (i) + kw (i) )}is linearly independent by
involves only one loop. contradiction. Suppose this is not true, then there exists a
1890 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012
nonzero vector c = c 1 . . . c N ∈ R N such that A. Simulations on Linear System
The first example considers the following F-16 aircraft plant
N
c j ∇ψ Tj ( f + gu (i) + kw (i) ) = 0 that studied in [47] and [48]
j =1
ẋ = Ax + B1 w + B2 u, z = C x (33)
which implies that for ∀x(t) ∈
t +t where C = I and
N ⎡ ⎤ ⎡ ⎤
c j ∇ψ Tj ( f + gu (i) + kw(i) )dτ −1.01887 0.90506 −0.00215 1
t j =1 A = ⎣ 0.82225 −1.07741 −0.17555 ⎦, B1 = ⎣ 0 ⎦
0 0 −1 0
N
⎡ ⎤
= c j ψ j (x(t + t)) − ψ j (x(t)) = 0 0
j =1 B2 = ⎣ 0 ⎦
that means 1
N T
the system state vector is x = α q δe , α denotes the angle
c j ϕ j (x(t + t), x(t)) = 0.
of attack, q is the rate, and δe is the elevator deflection angle.
j =1
The control input u is the elevator actuator voltage and the
This contradicts the linear independence of {ϕ j } Nj=1 . Thus, the disturbance w is wind gusts on angle of attack. Select R = I
set{∇ψ Tj ( f + gu(i) + kw(i) )} is linearly independent. and γ = 5. Letting V ∗ (x) = x T Px, the HJI equation (4) for
Then, the first three items of Lemma 6 can be proved by linear system (33) is the following H∞ ARE:
following the same procedure used in Theorem 2 and Corol-
lary 2 of [45]. The result sup ŵ(i+1) (x) − w(i+1) (x) → 0 A T P + P A + C T C + γ −2 P B 1 B1T P − P B 2 R −1 B2T P = 0
x∈ (34)
can also be proved by arguments similar to
sup û(i+1) (x) − u(i+1) (x) → 0. and the corresponding H∞ control law (5) is
x∈
Lemma 6 shows the LS NN approach can achieve the u ∗ (x) = −R −1 B2T Px. (35)
uniform approximation for the solution of (A5).
Theorem 2: If the conditions in Lemma 6 hold, then, for Solving the ARE (34) with the M ATLAB command CARE,
∀ς > 0, ∃i 0 , N0 , when i ≥ i 0 and N ≥ N0 , we have we obtain
⎡ ⎤
1.6573 1.3954 −0.1661
sup V̂ (i) (x) − V ∗ (x) < ς
x∈ P = ⎣ 1.3954 1.6573 −0.1804 ⎦. (36)
−0.1661 −0.1804 0.4371
sup û (i) (x) − u ∗ (x) < ς
x∈
Here, we use the proposed NN-based online SPUA to solve
sup ŵ(i) (x) − w∗ (x) < ς. the H∞ control problem of system (33). Select six polynomials
x∈ as activation functions as follows:
Proof: The first two items can be proved by following 2
2 T
N (x) = x 1 x 1 x 2 x 1 x 3 x 2 x 2 x 3 x 3
2
the same procedure used in the proofs of Theorems 3 and
4 of [45]. The result sup ŵ(i) (x) − w∗ (x) → 0 can also be thus, the true values of the NN weights c are
x∈ T
proved in a similar way of sup û(i) (x) − u∗ (x) → 0. c = P11 2P12 2P13 P22 2P23 P33
x∈
Theorem 2 demonstrates the uniform convergence of the T
= 1.6573 2.7908 −0.3322 1.6573 −0.3608 0.4370 .
proposed online SPUA with LS NN approximation.
Remark 8: Observe that the convergence proof of the (37)
proposed LS NN approach in this paper is almost the same as −7
that in [45]. The reason is that both the proposed LS NN T of stop criterion ε = 10 (0), initial state
Select the value
x(0) = 1 1 1 , the initial NN weights c = 0, and
approach in this paper and one in [45] are developed for sampling interval δt = 0.05(s). In each iterative step, after
solving a first order linear PDE. However, the final goal of collecting 10 (i.e., N̄ = 10) system state measurements, the
[45] is to solve a HJB equation of nonlinear optimal control LS method (32) is used to update NN weights, that is, the
problem, while our aim is to solve a HJI equation in nonlinear NN weights are updated every 0.5(s) (i.e., t = 0.5(s)).
H∞ control problem. On the other hand, the method in [45] After each update, we reset the system state as initial state
is an offline one that requires the system model, while our x0 . Fig. 2 shows the weights c(i) in each iterative step, where
method is an online one without requiring the knowledge of we can observe that the NN weights converge to the true values
internal system dynamics. in (37) at t0 = 2.5(s). Then the solution of HJI equation is
IV. S IMULATION S TUDIES computed via (28) and the corresponding H∞ controller is
obtained by (30). Select a disturbance signal as
In this section, we present simulation studies on two exam-
ples to illustrate the effectiveness of the developed NN-based 8e−(t −t0 ) cos (t − t0 ) , t ≥ t0
w(t) = (38)
online SPUA. 0, t < t0
WU AND LUO: NN-BASED ONLINE SPUA FOR SOLVING THE HJI EQUATION 1891
c1 c2 c3 c4 c5 c6 c1 c2 c3 c4 c5 c6
3 3
2 2
1 1
0 0
-1 -1
0 0.5 1 1.5 2 2.5 0 1 2 3 4 5
time (s) time (s)
Fig. 2. NN weights for the first example with t = 0.5(s). Fig. 4. NN weights for the first example with t = 1(s).
2 c1 c2 c3 c4 c5
x1
1.5 4
x2
1
x3 3
0.5
0 2
-0.5
0 10 20 30 1
time (s)
(a)
0.6 0
0.4 -1
0 0.5 1 1.5 2 2.5 3
u time (s)
0.2
Fig. 5. NN weights for the second example.
0
-0.2
0 10 20 30 B. Simulations on Nonlinear System
time (s)
(b) The second example is constructed by using the converse
HJB approach [49]. The system model is given as follows:
Fig. 3. For the first example (a) closed-loop state trajectories and (b) control
action u(t). −0.25x 1 0 0
ẋ = + w + u
0.5x 12 x 2 − 0.5γ −2 x 23 + 0.5x 2 x1 x2
and use the resulting H∞ controller for closed-loop system z = x.
simulations. Fig. 3 shows the closed-loop state trajectories
and control action u(t). The trajectories at the first 2.5 s are With the choice of γ = 2, the solution of the associated HJI
corresponding to a phrase in which the online SPUA is applied equation is V ∗ (x) = 2x 12 + x 22 .
to learn the NN weights. Select R = I , x(0) = [ 0.4 0.5 ]T , ε = 10−7 , NN activation
In order to test the influence of different t on the per- functions N (x) = [ x 12 x 1 x 2 x 22 x 14 x 24 ]T , and the initial NN
formance of the online SPUA, we run the online SPUA on weights c(0) = 0. The parameters of states sampling are the
Example 1 again under the same above parameters except by same as Example 1. After each update, we reset the system
setting t = 1(s) (i.e., δt = 0.1(s)). Fig. 4 gives the NN state as initial state x 0 . By the proposed NN-based online
weights in each iterative step. It is noticed that the weights SPUA, the simulation results are shown in Figs. 5 and 6.
are convergent at t0 = 5(s), which is doubled compared with Fig. 5 indicates the weights c(i) in each iterative step, where
Fig. 2. However, it is also found that the online SPUA is it can be observed
that the NN weights converge to the true
T
convergent at iterative step 5 (i.e., i = 5), which is the same weights (i.e., 2 0 1 0 0 ) at t0 = 3(s). By the resulting
as the results obtained by setting t = 0.5(s). This means that NN weights at instant t0 = 3(s), we can obtain the solution of
the change of t have great effect on the time of convergence, HJI equation (4) by (28) and the corresponding H∞ controller
but little on the iterative steps of convergence. by (30). Select a disturbance signal as in (38) with t0 = 3(s)
1892 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012
u -0.5 A PPENDIX
-1 P ROOF OF L EMMA 3
-1.5 For ∀V ∈ V and W ∈ Ṽ ⊂ V, where Ṽ is a neighborhood
of V , we have
-2
0 5 10 G(V + sW ) − G(V )
time (s)
(b) = (∇(V + sW ))T f + h T h − (∇(V + sW ))T g R −1
1 1
× g T ∇(V + sW ) + (∇(V + sW ))T kk T ∇(V + sW )
4γ 2
r − (∇V )T f + h T h − (∇V )T g R −1 g T ∇V
0.5
1
+ (∇V )T kk T ∇V
4γ 2
s
= s (∇W )T f − (∇W )T g R −1 g T ∇V
0 4
0 5 10 s
time (s) − (∇V )T g R −1 g T ∇W
(c) 4
s2 s
Fig. 6. For the second example (a) closed-loop state trajectories, (b) control
− (∇W )T g R −1 g T ∇W + (∇W )T kk T ∇V
4 4γ 2
action u(t), and (c) evolution of r(t).
s s2
+ 2 (∇V )T kk T ∇W + (∇W )T kk T ∇W.
and define the function r (t) as 4γ 4γ 2
t Thus, the Gâteaux differential at V is
z(τ )2 + u(τ )2R dτ
t0 G(V + sW ) − G(V )
r (t) = . (39) L(W ) = lim
t s→0 s
w(τ )2 dτ 1 1
t0 = (∇W )T f − (∇W )T g R −1 g T ∇V − (∇V )T
4 4
Applying the resulting H∞ controller to the system, Fig. 6 −1 T 1
shows the closed-loop state trajectories, control action u(t), × g R g ∇W + (∇W ) kk ∇V
T T
4γ 2
and evolution of r (t). The trajectories at the first 3 s in Fig. 6 1
are corresponding to a learning phrase, in which the online + 2 (∇V )T kk T ∇W. (A1)
4γ
SPUA is used to learn the NN weights. It can also be seen from
Fig. 6 that the closed-loop system is asymptotically stable, Next, we will prove that the map L = G (V ) is continuous.
and the r (t) converges to 0.9232, which satisfies the L 2 -gain For ∀W0 ∈ Ṽ, it is immediate that
requirement (i.e., r (t) < γ 2 = 4) when t → ∞. 1
L(W ) − L(W0 ) = (∇(W − W0 ))T f − (∇(W − W0 ))T
4
V. C ONCLUSION ×g R −1 g T ∇V
In this paper, a NN-based online SPUA has been developed 1
− (∇V )T g R −1 g T ∇(W − W0 )
to solve the HJI equation of H∞ state feedback control 4
problem for nonlinear systems. The H∞ control problem was 1 ∂V
+ 2 (∇(W − W0 ))T kk T
viewed as a zero-sum game, where the control is a minimizing 4γ ∂x
player and the disturbance is a maximizing one. By updating 1
+ 2 (∇V )T kk T ∇(W − W0 ).
two players’ policies simultaneously, an online SPUA was 4γ
WU AND LUO: NN-BASED ONLINE SPUA FOR SOLVING THE HJI EQUATION 1893
L(W ) − L(W0 ) ≤ M W − W0 < ε (A3) which must hold for any x ∈ . Thus, we have
when W − W0 < δ. This means L = G (V ) is continuous d (i+1)
V (i+1) − V = 0. (A13)
on Ṽ, thus according to Lemma 2, L(W ) = G (V )W [i.e., dt
(A1)] is the Fréchet differential, and L = G (V ) is the Fréchet (i+1)
derivative at V . This means V (x) = V (i+1) (x) − d, whered is a con-
(i+1) (i+1)
Proof of Theorem 1: We will give the proof with two steps. stant. Due to V (0) = V (i+1) (0) = 0 and V (0) =
1) On the one hand, with policies u (i) and w(i) , the system (i+1)
state x(t) satisfies V (i+1) (0)
− d, then d = 0. Thus, V (x) = V (i+1) (x).
Therefore, the (13) is equal to (A5).
ẋ(t) = f (x) + g(x)u (i) (t) + k(x)w(i) (t). (A4) 2) On the other hand, it follows from (16) and (20) that:
−1
Let V (i+1) be a solution of the following equations: V (i+1) = T V (i) = V (i) − G (V (i) ) G(V (i) )
T 2
∇V (i+1) f + gu (i) + kw (i) + h2 + u (i) which can be rewritten as
R
−γ 2 w(i)
2
= 0. (A5) G (V (i) )V (i+1) = G (V (i) )V (i) − G(V (i) ). (A14)
From (14), (15), and (19), we have
Using (A4), (A5) can be rewritten as
T 1 T
d (i+1) 2 2 G (V (i) )V (i+1) = ∇V (i+1) f − ∇V (i+1)
V (x(t)) = − h2 + u (i) − γ 2 w(i) . (A6) 4
dt R
× g R −1 g T ∇V (i)
Integrating (A6) from t to t + t yields 1 T 1 T
− ∇V (i) g R −1 g T ∇V (i+1) + 2
∇V (i+1)
t +t 4 4γ
V (i+1)
(x(t + t)) − V (i+1)
(x(t)) = − 1 T
t × kk T ∇V (i) + ∇V (i) kk T ∇V (i+1)
2 2 4γ 2
× h (x(τ ))2 + u (i) (τ ) −γ 2
w(i) (τ ) dτ (A7) T 1 T
R = ∇V (i+1) f − ∇V (i+1) g R −1 g T ∇V (i)
4
T
which implies that the solution V (i+1) satisfies (13). 1 T
1 T
Obviously, (13) can be rewritten as − ∇V (i) g R −1 g T ∇V (i+1) + ∇V (i+1)
4 4γ 2
∞ 2 T
2 1 T
V (i+1) x(t) = h x(τ ) + u (i) (τ ) × kk T ∇V (i) + ∇V (i)
kk T
∇V (i+1)
R 4γ 2
T
t
2 1 1
−γ 2 w(i) (τ ) dτ . (A8) = ∇V (i+1) f − g R −1 g T ∇V (i) + kk T ∇V (i)
2 2γ 2
1894 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012
T
= ∇V (i+1) f + gu (i) + kw(i) (A15) [9] Q. M. Yang, J. B. Vance, and S. Jagannathan, “Control of nonaffine
nonlinear discrete-time systems using reinforcement-learning-based lin-
T 1 T early parameterized neural networks,” IEEE Trans. Syst., Man, Cybern.,
G (V (i) )V (i) = ∇V (i) f − ∇V (i) g R −1 g T ∇V (i) B, Cybern., vol. 38, no. 4, pp. 994–1001, Aug. 2008.
4
1 (i)
T 1 T [10] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive
− ∇V g R −1 g T ∇V (i) + ∇V (i) optimal control for continuous-time linear systems based on policy
4 4γ 2 iteration,” Automatica, vol. 45, no. 2, pp. 477–484, 2009.
1 T [11] D. Vrabie and F. L. Lewis, “Neural network approach to continuous-
× kk T ∇V (i) + ∇V (i) kk T ∇V (i) time direct adaptive optimal control for partially unknown nonlinear
4γ 2 systems,” Neural Netw., vol. 22, no. 3, pp. 237–246, 2009.
T 2 2 [12] H. Zhang, Y. Luo, and D. Liu, “Neural-network-based near-optimal
= ∇V (i) f − 2 u (i) + 2γ 2 w(i) (A16) control for a class of discrete-time affine nonlinear systems with control
R
T 1 T
constraints,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503,
G(V (i) ) = ∇V (i) f + hT h − ∇V (i) Sep. 2009.
4 [13] K. G. Vamvoudakis and F. L. Lewis, “Online actor-critic algorithm to
1 T solve the continuous-time infinite horizon optimal control problem,”
× g R −1 g T ∇V (i) + ∇V (i)
kk T ∇V (i) Automatica, vol. 46, no. 5, pp. 878–888, 2010.
4γ 2 [14] F. Wang, N. Jin, D. Liu, and Q. Wei, “Adaptive dynamic programming
T 2 2 for finite-horizon optimal control of discrete-time nonlinear systems with
= ∇V (i) f + h2 − u (i) + γ 2 w(i) . (A17) ε-error bound,” IEEE Trans. Neural Netw., vol. 22, no. 1, pp. 24–36,
R
Jan. 2011.
Substituting (A15)–(A17) into (A14) gives [15] T. Dierks and S. Jagannathan, “Online optimal control of affine nonlinear
T discrete-time systems with unknown internal dynamics by using time-
∇V (i+1) f + gu (i) + kw(i) based policy update,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23,
no. 7, pp. 1118–1129, Jul. 2012.
T 2 2 [16] H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Data-driven robust approx-
= ∇V (i) f − 2 u (i) + 2γ 2 w(i) imate optimal tracking control for unknown general nonlinear systems
T
R using adaptive dynamic programming method,” IEEE Trans. Neural
2 2 Netw., vol. 22, no. 12, pp. 2226–2236, Dec. 2011.
(i)
− ∇V f + h − u (i) + γ 2 w(i)
2
[17] Y. Jiang and Z. P. Jiang, “Approximate dynamic programming for
R
optimal stationary control with control-dependent noise,” IEEE Trans.
that means Neural Netw., vol. 22, no. 12, pp. 2392–2398, Dec. 2011.
T 2 [18] H. Zhang, R. Song, Q. Wei, and T. Zhang, “Optimal tracking control
∇V (i+1) f + gu (i) + kw(i) + h2 + u (i) for a class of nonlinear discrete-time systems with time delays based on
R heuristic dynamic programming,” IEEE Trans. Neural Netw., vol. 22,
2 no. 12, pp. 1851–1862, Dec. 2011.
2 (i)
−γ w = 0. [19] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
Cambridge, MA: MIT Press, 1998.
This means that the (20) in Newton’s iteration is also equal [20] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive
to (A5). This completes the proof. dynamic programming for feedback control,” IEEE Circuits Syst. Mag.,
vol. 9, no. 3, pp. 32–50, Sep. 2009.
[21] F. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An
ACKNOWLEDGMENT introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47, May
2009.
The authors would like to thank the Associate Editor and
[22] W. B. Powell, Approximate Dynamic Programming: Solving the Curses
the anonymous reviewers for their helpful comments and of Dimensionality. Hoboken, NJ: Wiley, 2007.
suggestions, which have improved the presentation of this [23] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.
paper. Belmont, MA: Athena Scientific, 1996.
[24] X. Xu, C. Liu, S. X. Yang, and D. Hu, “Hierarchical approximate
policy iteration with binary-tree state space decomposition,” IEEE Trans.
R EFERENCES Neural Netw., vol. 22, no. 12, pp. 1863–1877, Dec. 2011.
[25] M. Fairbank, E. Alonso, and D. Prokhorov, “Simple and fast calculation
[1] T. Başar and P. Bernhard, H ∞ Optimal Control and Related Minimax
of the second-order gradients for globalized dual heuristic dynamic
Design Problems: A Dynamic Game Approach, 2nd ed. Boston, MA:
programming in neural networks,” IEEE Trans. Neural Netw. Learn.
Birkhäuser, 1995.
Syst., vol. 23, no. 10, pp. 1671–1676, Oct. 2012.
[2] A. J. van der Schaft, L2 -Gain and Passivity Techniques in Nonlinear
Control. Berlin, Germany: Springer-Verlag, 1996. [26] Y. Feng, B. Anderson, and M. Rotkowitz, “A game theoretic algo-
rithm to compute local stabilizing solutions to HJBI equations in
[3] K. Zhou, J. C. Doyle, and K. Glover, Robust and Optimal Control. Upper
nonlinear H∞ control,” Automatica, vol. 45, no. 4, pp. 881–888,
Saddle River, NJ: Prentice-Hall, 1996.
2009.
[4] A. Isidori and W. Kang, “H∞ control via measurement feedback for
general nonlinear systems,” IEEE Trans. Autom. Control, vol. 40, no. 3, [27] A. Lanzon, Y. Feng, B. D. O. Anderson, and M. Rotkowitz, “Computing
pp. 466–472, Mar. 1995. the positive stabilizing solution to algebraic Riccati equations with an
[5] G. Bianchini, R. Genesio, A. Parenti, and A. Tesi, “Global H∞ con- indefinite quadratic term via a recursive method,” IEEE Trans. Autom.
trollers for a class of nonlinear systems,” IEEE Trans. Autom. Control, Control, vol. 53, no. 10, pp. 2280–2291, Nov. 2008.
vol. 49, no. 2, pp. 244–249, Feb. 2004. [28] J. Huang and C. Lin, “Numerical approach to computing nonlinear H∞
[6] A. J. van der Schaft, “L 2 -gain analysis of nonlinear systems and control laws,” AIAA J. Guidance, Control, Dynamics, vol. 18, no. 5, pp.
nonlinear state-feedback H∞ control,” IEEE Trans. Autom. Control, 989–994, 1995.
vol. 37, no. 6, pp. 770–784, Jun. 1992. [29] G. N. Saridis and C. G. Lee, “An approximation theory of optimal
[7] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time non- control for trainable manipulators,” IEEE Trans. Syst., Man, Cybern.
linear HJB solution using approximate dynamic programming: Conver- B, Cybern., vol. 9, no. 3, pp. 152–159, Mar. 1979.
gence proof,” IEEE Trans. Syst., Man, Cybern., B, Cybern., vol. 38, [30] R. Beard, G. N. Saridis, and J. Wen, “Galerkin approximations of the
no. 4, pp. 943–949, Aug. 2008. generalized Hamilton-Jacobi-Bellman equation,” Automatica, vol. 33,
[8] H. Zhang, Q. Wei, and Y. Luo, “A novel infinite-time optimal tracking no. 12, pp. 2159–2177, 1997.
control scheme for a class of discrete-time nonlinear systems via the [31] R. Beard, G. N. Saridis, and J. Wen, “Approximate solutions to the time-
greedy HDP iteration algorithm,” IEEE Trans. Syst., Man, Cybern., B, invariant Hamilton–Jacobi–Bellman equation,” J. Optim. Theory Appl.,
Cybern., vol. 38, no. 4, pp. 937–942, Aug. 2008. vol. 96, no. 3, pp. 589–626, 1998.
WU AND LUO: NN-BASED ONLINE SPUA FOR SOLVING THE HJI EQUATION 1895
[32] R. W. Beard and T. W. Mclain, “Successive Galerkin approximation Huai-Ning Wu was born in Anhui, China, on
algorithms for nonlinear optimal and robust control,” Int. J. Control, November 15, 1972. He received the B.E. degree
vol. 71, no. 5, pp. 717–743, 1998. in automation from the Shandong Institute of Build-
[33] K. G. Vamvoudakis and F. L. Lewis, “Online solution of nonlinear ing Materials Industry, Jinan, China, and the Ph.D.
two-player zero-sum games using synchronous policy iteration,” Int. J. degree in control theory and control engineering
Robust Nonlinear Control, vol. 22, no. 13, pp. 1460–1483, 2011. from Xi’an Jiaotong University, Xi’an, China, in
[34] M. Abu-Khalaf, F. L. Lewis, and J. Huang, “Policy iterations on the 1992 and 1997, respectively.
Hamilton–Jacobi–Isaacs equation for H∞ state feedback control with He was a Post-Doctoral Researcher with the
input saturation,” IEEE Trans. Autom. Control, vol. 51, no. 12, pp. 1989– Department of Electronic Engineering, Beijing Insti-
1995, Dec. 2006. tute of Technology, Beijing, China, from 1997 to
[35] M. Abu-Khalaf, F. L. Lewis, and J. Huang, “Neurodynamic program- 1999. He joined the School of Automation Science
ming and zero-sum games for constrained control systems,” IEEE Trans. and Electrical Engineering, Beihang University (formerly Beijing University
Neural Netw., vol. 19, no. 7, pp. 1243–1252, Jul. 2008. of Aeronautics and Astronautics), Beijing, in 1999. From 2005 to 2006,
[36] M. Abu-Khalaf, J. Huang, and F. L. Lewis, Nonlinear H2/H-Infinity he was a Senior Research Associate with the Department of Manufacturing
Constrained Feedback Control: A Practical Design Approach Using Engineering and Engineering Management (MEEM), City University of Hong
Neural Networks. New York: Springer-Verlag, 2006. Kong, Kowloon, Hong Kong, where he was a Research Fellow from 2006 to
[37] D. Vrabie and F. L. Lewis, “Adaptive dynamic programming for online 2008 and from July to August 2010. From July to August in 2011, he was a
solution of a zero-sum differential game,” J. Control Theory Appl., vol. 9, Research Fellow with the Department of Systems Engineering and Engineer-
no. 3, pp. 353–360, 2011. ing Management, City University of Hong Kong. He is currently a Professor
[38] E. Zeidler, Nonlinear Functional Analysis: Fixed Point Theorems, vol. 1. with Beihang University. His current research interests include robust control
New York: Springer-Verlag, 1985. and filtering, fault-tolerant control, distributed parameter systems, and fuzzy
[39] L. Kantorovitch, “The method of successive approximation for func- and neural modeling and control.
tional equations,” Acta Math., vol. 71, no. 1, pp. 63–97, 1939. Dr. Wu is a member of the Committee of Technical Process Failure
[40] R. A. Tapia, “The Kantorovich theorem for Newton’s method,” Amer. Diagnosis and Safety of the Chinese Association of Automation.
Math. Monthly, vol. 78, no. 4, pp. 389–392, 1971.
[41] L. B. Rall, “A note on the convergence of Newton’s method,” SIAM J.
Numer. Anal., vol. 11, no. 1, pp. 34–36, 1974.
[42] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive
elements that can solve difficult learning control problems,” IEEE Trans.
Syst., Man, Cybern., B, Cybern., vol. 13, no. 5, pp. 834–846, May 1983.
[43] V. Konda, “On actor-critic algorithms,” Ph.D dissertation, Dept. Electr.
Eng. & Comput. Sci., Massachusetts Inst. Technology, Cambridge, 2002. Biao Luo received the B.E. degree in measur-
[44] V. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” SIAM J. ing and control technology and instrumentations
Control Optim., vol. 42, no. 4, pp. 1143–1166, 2003. and the M.E. degree in control theory and control
[45] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for engineering from Xiangtan University, Xiangtan,
nonlinear systems with saturating actuators using a neural network HJB China, in 2006 and 2009, respectively. He is cur-
approach,” Automatica, vol. 41, no. 5, pp. 779–791, 2005. rently pursuing the Ph.D. degree in control science
[46] B. A. Finlayson, The Method of Weighted Residuals and Variational and engineering with Beihang University (formerly
Principles. New York: Academic, 1972. Beijing University of Aeronautics and Astronautics),
[47] B. Stevens and F. L. Lewis, Aircraft Control and Simulation, 2nd ed. Beijing, China.
New York: Wiley, 2003. His current research interests include distributed
[48] K. G. Vamvoudakis, “Online learning algorithms for differential dynamic parameter systems, optimal control, data-based con-
games and optimal control,” Ph.D. dissertation, Faculty of Graduate trol, fuzzy and neural modeling and control, hypersonic entry and re-entry
School, Univ. Texas at Arlington, Arlington, 2011. guidance, reinforcement learning, approximate dynamic programming, and
[49] V. Nevisti’C and J. A. Primbs, “Constrained nonlinear optimal control: evolutionary computation.
A converse HJB approach,” Dept. Control & Dynamical Syst., California Mr. Luo was a recipient of the Excellent Master Dissertation Award of
Inst. Technology, Pasadena, Tech. Rep. TR96-021, 1996. Hunan Province in 2011.