0% found this document useful (0 votes)

9 views12 pages

Neural Network Based Online Simultaneous Policy Update Algorithm For Solving The HJI Equation in Nonlinear H Control

Uploaded by

hoanganhlyk26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views12 pages

Neural Network Based Online Simultaneous Policy Update Algorithm For Solving The HJI Equation in Nonlinear H Control

Uploaded by

hoanganhlyk26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

1884 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO.

12, DECEMBER 2012

Neural Network Based Online Simultaneous

Policy Update Algorithm for Solving the
HJI Equation in Nonlinear H∞ Control
Huai-Ning Wu and Biao Luo

Abstract— It is well known that the nonlinear H ∞ state be promising techniques for approximately solving nonlinear
feedback control problem relies on the solution of the Hamilton– optimal control problems [7]–[18]. RL [19]–[21] is a kind
Jacobi–Isaacs (HJI) equation, which is a nonlinear partial dif- of machine learning method, which refers to an actor
ferential equation that has proven to be impossible to solve
analytically. In this paper, a neural network (NN)-based online or agent that interacts with its environment and aims to
simultaneous policy update algorithm (SPUA) is developed to learn the optimal actions, or control policies, by observing
solve the HJI equation, in which knowledge of internal system their responses from the environment. ADP [20]–[23]
dynamics is not required. First, we propose an online SPUA which solves approximately the dynamic programming problem
can be viewed as a reinforcement learning technique for two play- forward-in-time; thus, it affords a methodology for learning
ers to learn their optimal actions in an unknown environment.
The proposed online SPUA updates control and disturbance the feedback control actions online in real time based on
policies simultaneously; thus, only one iterative loop is needed. system performance without necessarily knowing the system
Second, the convergence of the online SPUA is established by dynamics. This overcomes the computational complexity,
proving that it is mathematically equivalent to Newton’s method such as the curse of dimensionality [22] that exists in the
for finding a fixed point in a Banach space. Third, we develop an classical dynamic programming, which is an offline technique
actor-critic structure for the implementation of the online SPUA,
in which only one critic NN is needed for approximating the cost that requires a backward-in-time solution procedure. ADP
function, and a least-square method is given for estimating the has many implement structures, such as heuristic dynamic
NN weight parameters. Finally, simulation studies are provided programming (HDP), dual heuristic programming (DHP),
to demonstrate the effectiveness of the proposed algorithm. globalized DHP, etc., which are widely employed for
Index Terms— H ∞ state feedback control, Hamilton–Jacobi– nonlinear discrete-time systems. In [7], an HDP algorithm
Isaacs equation, neural network, online, simultaneous policy was developed to solve the discrete-time HJB equation
update algorithm. appearing in infinite horizon discrete-time nonlinear optimal
control, and a full proof of convergence was provided. In [12],
I. I NTRODUCTION the near-optimal control problem for a class of nonlinear
discrete-time systems with control constraints was solved
O VER the past few decades, a large number of theoreti-
cal results on H∞ control have been reported [1]–[6].
Although the nonlinear H∞ control theory has been well
using a DHP method. Wang et al. [14] studied the finite-
horizon optimal control problem of discrete-time nonlinear
developed, the main bottleneck for its practical application is systems and suggested an iterative ADP algorithm to obtain
the need to solve the Hamilton–Jacobi–Isaacs (HJI) equation. the optimal control law, which makes the performance index
The HJI equation, similar with the Hamilton–Jacobi–Bellman function close to the greatest lower bound of all performance
(HJB) equation of nonlinear optimal control, is a first order indices within a ε-error bound. Policy iteration is one of the
nonlinear partial differential equation (PDE), which is difficult most popular RL methods [20] for feedback controller design.
or impossible to solve, and may not have global analytic In [10] and [11], the optimal control problems of linear and
solutions even in simple cases. nonlinear continuous-time systems were solved online by
In recent years, reinforcement learning (RL) and policy iteration, respectively. Vamvoudakis and Lewis [13]
approximate dynamic programming (ADP) have appeared to gave an online synchronous policy iteration algorithm to learn
the continuous-time optimal control solution with infinite
Manuscript received January 6, 2012; revised August 29, 2012; accepted horizon cost for nonlinear systems with known dynamics, in
August 29, 2012. Date of publication October 12, 2012; date of current which an actor and a critic neural network (NN) are involved,
version November 20, 2012. This work was supported in part by the National
Basic Research Program of China through the 973 Program under Grant and the weights of both NNs tune at the same time instant.
2012CB720003 and in part by the National Natural Science Foundation of However, it is clear that the HJI equation associated with
China under Grant 61074057, Grant 61121003, and Grant 91016004. the nonlinear H∞ control problem is generally more difficult
The authors are with the Science and Technology on Aircraft Control Lab-
oratory, School of Automation Science and Electrical Engineering, Beihang to solve than the HJB equation appearing in nonlinear optimal
University (formerly Beijing University of Aeronautics and Astronautics), control, since the disturbance inputs are additionally reflected
Beijing 100191, China (e-mail: [email protected]; [email protected]). in the HJI equation. The main difference between the HJB
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. and HJI equations is that the HJB equation has a “negative
Digital Object Identifier 10.1109/TNNLS.2012.2217349 semi-definite quadratic term,” while the HJI equation has an
2162–237X/$31.00 © 2012 IEEE
WU AND LUO: NN-BASED ONLINE SPUA FOR SOLVING THE HJI EQUATION 1885

“indefinite quadratic term” [26]. Thus, the methods for the uses RL technique for the online H∞ control design
HJB equation may not be directly used to the HJI equation. of nonlinear continuous-time systems with unknown
In [27], the linear H∞ control problem was considered, in internal systems dynamics.
which the H∞ algebraic Riccati equation (ARE) with an 2) The online SPUA updates the control and disturbance
indefinite quadratic term was converted to a sequence of policies simultaneously, which needs only one iterative
H2 AREs with a negative semi-definite quadratic term. This loop rather than two. This is the essential difference
paper was subsequently extended to solve the HJI equation between the online SPUA and the existing methods in
for nonlinear systems in [26]. In [28], the solution of the HJI [32]–[37]. Moreover, the theory of Newton’s method in
equation was approximated by the Taylor series expansion, Banach space is introduced to prove the convergence of
and an efficient algorithm was furnished to generate the the online SPUA.
coefficients of the Taylor series. In [6], it was proven that 3) Develop an actor-critic structure for nonlinear H∞
there exists a sequence of policy iterations on the control control design without requiring the knowledge of inter-
input to pursue the smooth solution of the HJI equation, nal system dynamics, where only one critic NN is
where the HJI equation was successively approximated by a needed for approximating the cost function and a least-
sequence of HJB equations. Then, the methods for solving square (LS) method is given to estimate the NN weight
HJB equations can be used for the solution of the HJI equation. parameters.
In [29], the HJB equation was successively approximated by The rest of this paper is organized as follows. In Section II,
a sequence of linear generalized HJB equations, which can be we give the problem description. In Section III, we propose
solved by Galerkin’s approximation in [30] and [31]. Based the NN-based online SPUA and discuss some related issues.
on [6] and [29]–[31], policy iteration on the disturbance was Simulation studies are conducted in Section IV. Finally, a brief
used to approximate the HJI equation in [32], where each conclusion is derived in Section V.
HJB equation in [6] was further successively approximated Notations: R, Rn , and Rn×m are the set of real numbers,
by a sequence of generalized HJI equations and solved by the n-dimensional Euclidean space and the set of all real
Galerkin’s approximation. This obviously results in two itera- n × m matrices, respectively. · denotes the vector norm or
tive loops for the solution of HJI equation, i.e., the inner loop matrix norm in Rn or Rn×m , respectively. For a symmetric
solves an HJB equation by iteratively solving a sequence of matrix M, M > (≥)0 means that it is a positive (semi-
GHJB equations, and the outer loop solves the HJI equation positive) definite matrix. The superscript T is used for the
by iteratively solving a sequence of HJB equations. Following transpose and I denotes the identity matrix of appropriate
such a thought, the method in [13] was extended to solve the
dimension. ∇ = ∂/∂x denotes a gradient operator nota-
HJI equation in [33] with known dynamics. A policy iteration
scheme was also developed in [34] for nonlinear systems with ∞ L 2 [0,2∞) is a Banach space, for ∀w(t) ∈ L 2 [0, ∞),
tion.
0 w(t) dt < ∞. For a column vector function s(x),
actuator saturation, and its implementation was facilitated on 1/2
s(x) = sT (x)s(x)d x , x ∈ ⊂ Rn . H m, p () is a
the basis of neurodynamic programming in [35] and [36],
where NNs were used for approximating the value Sobolev space that consists of functions in space L p () such
function. that their derivatives of order at least m are also in L p ().
Most of the methods mentioned in the above paragraph
II. P ROBLEM D ESCRIPTION
for solving the HJI equation of H∞ control problem, such
as, [26]–[28], [32]–[36], require full knowledge of the system Consider the following partially unknown continuous-time
dynamics. Furthermore, these approaches follow the thought nonlinear system with external disturbance:
that the HJI equation is first successively approximated with ẋ(t) = f (x) + g(x)u(t) + k(x)w(t) (1)
a sequence of HJB equations, and then each HJB equation
is solved by the existing methods [26], [32]–[36]. This often z(t) = h(x) (2)
brings two iterative loops because the control and disturbance where x ∈ ⊂ Rn is the state, u ∈ Rm is the control input
policies are updated at the different iterative steps. Such a and u(t) ∈ L 2 [0, ∞), w ∈ Rq is the external disturbance and
procedure may lead to redundant equation solutions (i.e., w(t) ∈ L 2 [0, ∞), and z ∈ R p is the objective output. f (x) is
redundant iterations), and thus waste of sources, resulting in an unknown continuous nonlinear vector function satisfying
low efficiency. In [37], ADP was used to solve the linear H∞ f (0) = 0, which represents the internal system dynamics.
control online without the need of internal system dynamics g(x), k(x), and h(x) are known continuous vector or matrix
in [37], but it is still a linear special case based on the same functions of appropriate dimensions.
procedure as the works in [32]–[36], i.e., it also involves two The H∞ control problem under consideration is to find a
iterative loops. state feedback control law u(t) = u (x(t)) such that (1) and (2)
In this paper, we propose an online simultaneous policy is closed-loop asymptotically stable, and has L 2 -gain less than
update algorithm (SPUA) for solving the HJI equation in or equal to γ , that is
nonlinear H∞ state feedback control. The main contributions ∞ ∞
of this paper include three aspects. z(t)2 + u(t)2R dt ≤ γ 2 w(t)2 dt (3)
1) Propose an online SPUA, in which the knowledge of 0 0
internal system dynamics is not required. To the best of for all w(t) ∈ L 2 [0, ∞), where u(t)2R
= uT Ru, R > 0, and
our knowledge, this paper may be the first work that γ > 0 is some prescribed level of disturbance attenuation.
1886 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

Lemma 1 (see Theorem 16 and Corollary 17 in [6]): Algorithm 1 Online SPUA

Assume (1) and (2) is zero-state observable. Let γ > 0.
Step 1: Given an initial function V (0) ∈ V0 (V0 ⊂ V is
Suppose there exists a smooth solution V ∗ (x) ≥ 0 to the
determined by Lemma 5), let u (0) = − 12 R −1 g T ∇V (0) , w(0) =
HJI equation 1 −2 T
2γ k ∇V (0) , and i = 0.
T
G(V ∗ ) = ∇V ∗ (x) f (x) + h T (x)h(x) Step 2: With policies u (i) and w(i) , solve the following
1 T equation for the cost function V (i+1) :
− ∇V ∗ (x) g(x)R −1 g T (x)∇V ∗ (x)
4 V (i+1) (x(t))
1 T t +t
+ 2 ∇V ∗ (x) k(x)k T (x)∇V ∗ (x) = 0. (4) 2 2
+ u (i) (τ ) −γ 2 w(i) (τ )
2
4γ = h (x(τ )) dτ
t R
Then, the closed-loop system for the state feedback control
+V (i+1) (x(t + t)) . (13)
1
u(t) = u ∗ (x(t)) = − R −1 g T (x)∇V ∗ (x) (5) Step 3: Update the control and disturbance policies by
2
has L 2 -gain less than or equal to γ , and the closed-loop system u (i+1) = arg min H x, u, w(i) , ∇V (i+1)
u
(1), (2), (5) (when w(t) ≡ 0) is locally asymptotically stable.
1
= − R −1 g T ∇V (i+1) (14)
2
III. NN-BASED O NLINE SPUA
w(i+1) = arg max H x, u (i) , w, ∇V (i+1)
An immediate idea for the H∞ control design of the par- w
1
tially unknown system (1) and (2) is to conduct identification = γ −2 k T ∇V (i+1) . (15)
of the system model first, and then model-based approaches 2
can be employed to synthesize the controller. It is noted from Step 4: Set i = i + 1. If V (i) − V (i−1) ≤ ε (ε is
Lemma 1 that the nonlinear H∞ control problem hinges on the a small positive real number), stop and output V (i) as the
solution of the HJI equation (4). However, the HJI equation is a solution of the HJI equation (4) (i.e., V ∗ = V (i) ), else, go
nonlinear PDE that is difficult or impossible to solve, and may back to Step 2 and continue.
not have global analytic solutions even in simple cases. In this
section, we propose an online SPUA to solve the HJI equa-
tion without requiring the knowledge of the internal system then, the HJI equation (4) can also be written as
dynamics f (x). Thus, the identification process is avoided.
min max H x, u, w, ∇V ∗ = 0 (9)
u w
A. Two-Player Zero-Sum Game
and the saddle point (u ∗ , w∗ ) of the game is given as follows:
It is well known that the two-player zero-sum differential
1
game theory [1], [26], [27], [33], [35]–[37] has been exten- u ∗ (x) = arg min H x, u, w∗ , ∇V ∗ = − R −1 g T ∇V ∗ (10)
sively applied for the H∞ control problem. Correspondingly, u 2
1 −2 T
the control input u is a minimizing player and the disturbance w (x) = arg max H x, u , w, ∇V = γ k ∇V ∗ .
∗ ∗ ∗
(11)
w is a maximizing one. Both the H∞ control problem and the w 2
two-player zero-sum differential game rely on the solution of
the HJI equation (4). The solution of the H∞ control problem B. Online SPUA
is the saddle point (u ∗ , w∗ ) of the two-player zero-sum game, It follows from (6) that, given arbitrary control action u(t)
where u ∗ and w∗ are the optimal control policy and the worst- and disturbance signal w(t) with initial system state x(t), the
case disturbance, respectively. Defining the following infinite cost function is
horizon quadratic cost functional: ∞
∞ V (x(t)) = h (x(τ ))2 + u(τ )2R − γ 2 w(τ )2 dτ
V (u, w) = z(t)2 + u(t)2R − γ 2 w(t)2 dt (6) t
0
which can be rewritten as
then, the two-player zero-sum game under consideration can t +t
be formulated. Given (1) and (2) with two players u and w, V (x(t)) = h (x(τ ))2 + u(τ )2R − γ 2 w(τ )2 dτ
and the cost (6), find a saddle point (u ∗ , w∗ ) such that t
+V (x(t + t)). (12)
V (u ∗ , w∗ ) = min max V (u, w) (7)
u w
Based on (12), we propose the online SPUA (as shown
that means in Algorithm 1) for finding the solution V ∗ (x) of the HJI
equation (4).
V (u ∗ , w) ≤ V (u ∗ , w∗ ) ≤ V (u, w∗ ).
Remark 1: The online SPUA follows the basic procedure
Define the Hamiltonian of the problem of policy iteration in RL, which involves policy evaluation
(in Step 2) and policy improvement (in Step 3). Hence, it can
H (x, u, w, ∇V ) = (∇V )T ( f + gu + kw) also be viewed as an RL technique for two players to learn
+h T h + u T Ru − γ 2 w T w (8) their optimal actions in the unknown environment.
WU AND LUO: NN-BASED ONLINE SPUA FOR SOLVING THE HJI EQUATION 1887

Remark 2: Although the procedure of policy iteration is the Gâteaux derivative of G at V . The Gâteaux differential at
also included in the methods of [32]–[36], there are two V is defined by L(W ).
main differences between these methods and the online SPUA. From (17), the Gâteaux differential at V can be defined
1) The methods in [32]–[36] are model-based ones, which equivalently through the following expression [38]:
require the full knowledge of the system dynamics, while the
G(V + sW ) − G(V )
online SPUA does not require the internal system dynam- L(W ) = lim . (18)
ics f (x). 2) The methods in [32]–[36] update the control s→0 s
and disturbance policies at different iterative steps (i.e., one Equation (18) gives a method to compute Gâteaux derivative,
player updates its policy while the other remains invariant), rather than Fréchet derivative required in (16). Thus, we
which brings two iterative loops. In contrast, the online SPUA introduce the following lemma to give the relationship between
updates the control and disturbance policies at the same them.
iterative step, in which only one iterative loop is needed. This Lemma 2 [38]: If G exists as Gâteaux derivative in some
means that the online SPUA is essentially different from the neighborhood of V , and if G is continuous at V , then L =
methods in [32]–[36]. The methods in [32]–[36] are based G (V ) is also an Fréchet derivative at V .
on the same procedure as in [32], their convergence can be Now, it follows from Lemma 2 that we can compute
directly guaranteed by the results in [6] and [32]. However, the Fréchet derivative G (V ) in (16) via (18). We have the
new tools are needed for the online SPUA to establish its following result.
convergence. Lemma 3: Let G : V → V be a mapping defined as (4),
Remark 3: Notice that the SPUA avoids the identification then, for ∀V ∈ V, the Fréchet differential of G at V is
of f (x) whose information is embedded in the online mea-
x(t) and x(t + t), and evaluation 1
surement of the states
t +t G (V )W = L(W ) = (∇W )T f − (∇W )T g R −1 g T ∇V
of the cost t h (x(τ ))2 + u(τ )2R − γ 2 w(τ )2 dτ . 4
That is to say, the lack of knowledge about f (x) does not 1 1
− (∇V )T g R −1 g T ∇W + (∇W )T kk T ∇V
have any impact on the online SPUA to obtain the equilibrium 4 4γ 2
solution. Thus, the resulting equilibrium behavior policies of 1
+ 2 (∇V )T kk T ∇W. (19)
the two players will not be affected by any errors between the 4γ
dynamics of a model of the system and the dynamics of the
Proof: See Appendix.
real system.
The following theorem provides an interesting result, in
which we discover that the online SPUA is mathematically
C. Convergence of Online SPUA equivalent to Newton’s iteration in a Banach space V.
In this section, we will prove the convergence of the online Theorem 1: Let T be a mapping defined by (16). Then,
SPUA. Namely, we want to show that the solution of equa- the iteration from (13) to (15) is equivalent to the following
tion (13) converges to the solution of HJI equation (4) when Newton’s iteration with (14) and (15):
i goes to infinity. Just as mentioned in Remark 2, the online V (i+1) = T V (i) , i = 0, 1, 2, . . . (20)
SPUA is essentially different from the algorithm framework
in [32]–[36]. Hence, its convergence proof is also different. Proof: See Appendix.
To this end, let us consider such a Banach space V ⊂ Under some proper assumptions, Newton’s iteration (20)
{ V (x)| V (x) : → R, V (0) = 0} equipped with a norm · , can converge to the unique solution of the fixed-point equation
and consider the mapping G : V → V defined in (4). Define T V ∗ = V ∗ , that is, the solution of equation G(V ∗ ) = 0.
a mapping T : V → V as follows: The convergence of Newton’s method is guaranteed by the
−1 following Kantorovtich’s theorem [39], [40].
T V = V − G (V ) G(V ) (16)
Lemma 4 (Kantorovtich’s Theorem):−1 Assume for some
(0)
V ∈ V1 ⊂ V such that G (V ) (0)
where G (V ) is the Fréchet derivative of G(·) at point
−1 V .
exists and that:
It should be noticed that both G (V ) and G (V ) are −1
operators on Banach space V. 1) G (V (0) ) ≤ B0 ; (21)
The Fréchet derivative is often difficult to compute directly,
−1
thus we introduce the Gâteaux derivative. 2) G (V (0) ) G(V (0) ) ≤ η; (22)
Definition 1 (Gâteaux Derivative) [38]: Let G: U(V ) ⊆
X → Y be a given map, with X and Y Banach spaces. Here, 3) for all V (1) , V (2) ∈ V1 ,
U(V ) denotes a neighborhood of V . The map G is Gâteaux
differentiable at V if there exists a bounded linear operator G (V (1) ) − G (V (2) ) ≤ K V (1) − V (2) (23)

L : X → Y such that
with h = B0 K η ≤ 1/2. Let
G(V + sW ) − G(V ) = s L(W ) + o(s), s→0 (17) √
(0) 1− 1 − 2h
for all W with W = 1 and all real numbers s in some V2 = V | V − V ≤ σ , where σ = η.
h
neighborhood of zero, where lim (o(s)/s) = 0. L is called (24)
s→0
1888 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

Now, if V2 ⊂ V1 , then, the sequence {V (i) } given in (20) is Actor 1

well defined, remains in V2 , and converges to V ∗ ∈ V2 such
that G(V ∗ ) = 0. In addition
Control Policy u(x)
√ 2i
∗ (i) η 1 − 1 − 2h
V −V ≤ , i = 0, 1, 2, . . .
h 2i
(25) Actor 2

It is observed from Lemma 4 that the V1 must be suitably

Disturbance Policy w(x)
chosen. The following lemma gives a method to determine a
V0 satisfying V0 ⊂ V1 , so that V0 conversely guarantees the
hypotheses of Lemma 4. State
∗ Critic
Lemma 5 [41]: Suppose
∗
V ∗ ≥−10 is the ∗solution of HJI
equation G(V ) = 0. If G (V ) ≤ B , and
System Cost Cost Function V(x)
1
V3 = V | V − V ∗ ≤ ⊂ V1 (26)
B∗K
then, the hypotheses of Lemma 4 are satisfied. That is, for
each V (0) ∈ V0 , h ≤ 1/2, conditions (21) and (22) hold with
Fig. 1. Actor-critic structure for online SPUA.
B∗ −1
(0)
B0 = ≥ G (V )
1 − B ∗ K V (0) − V ∗ control policies and disturbance policies, respectively. They
and update their parameter vectors at each iterative step using the
1 − 21 B ∗ K V (0) − V ∗ (0) observations of the system state and the information obtained
η= V −V ∗ from the critic. Similarly, at each iterative step, the critic
1 − B ∗ K V (0) − V ∗
−1 updates the approximation of cost function corresponding to
≥ G (V (0) ) G(V (0) ) the current control and disturbance policies of two actors.
If we use NNs to parameterize the cost function, control, and
where disturbance policies in the actor-critic structure, three NNs are
√ needed. Here, we derive a simple method for implementation
∗ (2 − 2)
V0 = V| V − V
≤ . (27) of this structure, in which only a single critic NN is required
(2B ∗ K ) for the cost function, and then two actor NNs for the control
and disturbance policies are updated accordingly.
Lemmas 4 and 5 imply that if V (0) is chosen in V0 defined
Let N (x) = (ψ1 (x), . . . , ψ N (x))T be the activation func-
by (27) (V0 is a neighborhood of nonnegative definite solution
tions, where N is the number of hidden-layer neurons. Then,
V ∗ ≥ 0), the online SPUA, i.e., Newton’s method, is bound
the cost function V (i+1) (x) in (13) is approximated by
to converge to the fixed point of (16), i.e., the solution of HJI T
equation (4), and the error bound is given by (25). V̂ (i+1) (x) = c(i+1) T
N (x) = N (x)c
(i+1)
(28)
Remark 4: Theorem 1 shows that the sequence {V (i) }

generated by the online SPUA is equivalent to the Newton (i+1) (i+1) T
where c(i+1) = c1 , . . . , cN is the weight vector.
sequence obtained by (20), the convergence of which can be
guaranteed by Lemma 4. Therefore, the sequence {V (i) } also Thus, (13) can be written as
t +t
converges to the solution V ∗ of HJI equation (4), i.e.,V (i) → (i+1)
2 2
V ∗ , when i → ∞. Once V ∗ is obtained, the saddle point
T
N x(t) c = h x(τ ) + û (i) (τ )
R
(u ∗ , w∗ ) can be directly computed by (10) and (11).
t

2
−γ 2 ŵ(i) (τ ) dτ + NT x(t + t) c(i+1)
D. Actor-Critic Structure for Online SPUA and LS NN
that means
Approach
(i+1)
N (x(t)) − N (x(t + t)) c
T T
Actor-critic schemes [42] originated in the artificial intelli-
gence literature in the context of RL. In the past three decades, t +t
2 2
actor-critic algorithms have received much attention (see [43] = h (x(τ ))2 + û (i) (τ ) − γ 2 ŵ(i) (τ ) dτ .
t R
and [44] and references therein), and have been introduced to
(29)
solve optimal control problems [11], [13].
In this section, we develop an actor-critic structure for the Accordingly, the control and disturbance policies in (14) and
online SPUA (see Fig. 1) to solve the H∞ state feedback (15) can be approximated by
control problem. This structure involves three learning units, 1 1
a critic and two actors, interacting with each other and with û (i+1) = − R −1 g T ∇ V̂ (i+1) = − R −1 g T ∇ NT c(i+1) (30)
2 2
the system during the course of the online SPUA. Two actors (i+1) 1 −2 T (i+1) 1 −2 T
have tunable parameter vectors that parameterize a set of ŵ = γ k ∇ V̂ = γ k ∇ NT c(i+1) (31)
2 2
WU AND LUO: NN-BASED ONLINE SPUA FOR SOLVING THE HJI EQUATION 1889

where ∇ N (x) = ((∂ψ1 /∂x), . . . , (∂ψ N /∂x))T is the Algorithm 2 NN-Based Online SPUA
Jacobian of N . Step 1: Select N activation functions N (x). Given initial
Remark 5: Note that only one NN is needed in the online weights c(0) such that V̂ (0) ∈ V0 , let
SPUA (Algorithm 1), i.e., the critic NN for approximating the
1 1 −2 T
cost function via (28). After the weight vector is computed û (0) = − R −1 g T ∇ T (0)
N c , ŵ
(0)
= γ k ∇ T (0)
Nc ,
via (29), the control and disturbance policies in (14) and (15) 2 2
and i = 0.
can be approximately updated by (30) and (31) accordingly.
Therefore, the iteration from (13) to (15) in the online SPUA Step 2: With policies û (i) and ŵ(i) , collect N̄ sample data along
is converted to the weights iteration from (29) to (31). state trajectories in time interval [i t, (i + 1)t]. Compute
It is noticed that the NN weight parameters c(i+1) have N c(i+1) via (32) at time instant (i + 1)t.
unknown elements. Thus, in order to solve for c(i+1) , at least Step 3: Update the control and disturbance policies by (30)
N equations are required. Here, we construct N̄ ( N̄ ≥ N) and (31) at time instant (i + 1)t.
equations, and use an LS method to estimate c(i+1) . In each Step 4: Set i = i + 1. If c(i) − c(i−1) ≤ ε (ε is a small
time interval [t, t + t], we collect N̄ sample data along state positive real number), stop and use V (i) as the solution of the
trajectories, and construct the LS solution of the NN weights HJI equation (4), i.e., use û (i) as the H∞ controller, else, go
as follows: back to step 2 and continue.
c(i+1) = (X X T )−1 XY (32)
where E. Convergence of LS NN-Based Online SPUA

X= N(x(t)) − N (x(t + δt)) · · · N x(t + ( N̄ − 1)δt) In this section, we show that the LS NN-based online SPUA

− N x(t + N̄ δt) converges to the solution of HJI equation (4).
It is seen that the LS NN method derived in the above
Y = y(x(t), û (i) (t), ŵ(i) (t)) · · · y(x(t + ( N̄ − 1)δt),
section is used for solving (13) for cost function V (i+1) (x).
T
It follows from the proof of Theorem 1 that (13) is equal to
û (i) (t + ( N̄ − 1)δt), ŵ(i) (t + ( N̄ − 1)δt))
(A5), which means that the LS NN approach is nothing but
with δt = t/ N̄ and for solving (A5) mathematically.
We notice that (A5) is essentially the same as the Lyapunov
y(x(t + kδt), û (i) (t + kδt), ŵ(i) (t + kδt)) equation in [45] from pure mathematical view, because both
t +(k+1)δt
2 of them are first order linear PDEs of the same form. Some
= h (x(τ ))2 + û (i) (τ )
t +kδt R important theories have been established in computational
2 mathematics community to solve these types of PDEs, such
−γ 2 ŵ(i) (τ ) dτ , k = 0, . . . , N − 1. as the LS approach in [45] and Galerkin’s method in [30] and
[32]. Moreover, in [45], the convergence of LS NN approach
It is worth mentioning that the LS method (32) requires a for solving the first order, linear PDE was established, which
nonsingular matrix XX T . To attain the goal, we can inject provides the theoretical foundation of the online LS NN
probing noises into states or reset system states. method in this paper. We can directly obtain the following
Based on the actor-critic structure and the above LS Lemma by using the results in [45].
estimation of the NN weights, we develop an implementable Lemma 6: For i = 0, 1, 2, . . ., assume that the solution of
NN-based online SPUA procedure as shown in Algorithm 2. equation (A5) V (i+1) ∈ H 1,2(), the NN activation functions
Remark 6: It should be pointed out that the word “simul- ψ j ∈ H 1,2(), j = 1, 2, . . . , N are chosen such that, they
taneous” in this paper and the word “synchronous” in [33] are complete when N → ∞, V (i+1) and ∇V (i+1) can be
represent different meanings. The former emphasizes the same
uniformly approximated, and the set {ϕ j (x 1 , x 2 ) = ψ j (x 1 ) −
“iterative step,” while the latter emphasizes the same “time
ψ j (x 2 )} j =1 , ∀x 1 , x 2 ∈ , x 1 = x 2 is linearly independent and
N
instant. In this paper, the online SPUA updates control and
complete. Then, for i = 0, 1, 2
disturbance policies at the same iterative step, while the
algorithm in [33] updates control and disturbance policies at

the different iterative steps. sup V̂ (i+1) (x) − V (i+1) (x) → 0
x∈
Remark 7: It is worth emphasizing that different from the

result in [11], which was used for solving HJB equation of sup ∇ V̂ (i+1) (x) − ∇V (i+1) (x) → 0
x∈
nonlinear optimal control problem, the proposed online SPUA

is developed for solving HJI equation of nonlinear H∞ control sup û (i+1) (x) − u (i+1) (x) → 0
problem. Moreover, there are two main differences between x∈

the approach in [34] and the proposed online SPUA. 1) The sup ŵ(i+1) (x) − w(i+1) (x) → 0.
former is an offline approach that requires the system model, x∈
while the latter is an online one that does not need the
knowledge of internal system dynamics and 2) The method Proof: In order to use the results in [45], we first show
in [34] brings two iterative loops, while the online SPUA the set{∇ψ Tj ( f + gu (i) + kw (i) )}is linearly independent by
involves only one loop. contradiction. Suppose this is not true, then there exists a
1890 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

nonzero vector c = c 1 . . . c N ∈ R N such that A. Simulations on Linear System
The first example considers the following F-16 aircraft plant

N

c j ∇ψ Tj ( f + gu (i) + kw (i) ) = 0 that studied in [47] and [48]
j =1
ẋ = Ax + B1 w + B2 u, z = C x (33)
which implies that for ∀x(t) ∈
t +t where C = I and
N ⎡ ⎤ ⎡ ⎤

c j ∇ψ Tj ( f + gu (i) + kw(i) )dτ −1.01887 0.90506 −0.00215 1
t j =1 A = ⎣ 0.82225 −1.07741 −0.17555 ⎦, B1 = ⎣ 0 ⎦
0 0 −1 0

N
⎡ ⎤
= c j ψ j (x(t + t)) − ψ j (x(t)) = 0 0
j =1 B2 = ⎣ 0 ⎦
that means 1

N T
the system state vector is x = α q δe , α denotes the angle
c j ϕ j (x(t + t), x(t)) = 0.
of attack, q is the rate, and δe is the elevator deflection angle.
j =1
The control input u is the elevator actuator voltage and the
This contradicts the linear independence of {ϕ j } Nj=1 . Thus, the disturbance w is wind gusts on angle of attack. Select R = I
set{∇ψ Tj ( f + gu(i) + kw(i) )} is linearly independent. and γ = 5. Letting V ∗ (x) = x T Px, the HJI equation (4) for
Then, the first three items of Lemma 6 can be proved by linear system (33) is the following H∞ ARE:
following the same procedure used in Theorem 2 and Corol-
lary 2 of [45]. The result sup ŵ(i+1) (x) − w(i+1) (x) → 0 A T P + P A + C T C + γ −2 P B 1 B1T P − P B 2 R −1 B2T P = 0
x∈ (34)
can also be proved by arguments similar to
sup û(i+1) (x) − u(i+1) (x) → 0. and the corresponding H∞ control law (5) is
x∈
Lemma 6 shows the LS NN approach can achieve the u ∗ (x) = −R −1 B2T Px. (35)
uniform approximation for the solution of (A5).
Theorem 2: If the conditions in Lemma 6 hold, then, for Solving the ARE (34) with the M ATLAB command CARE,
∀ς > 0, ∃i 0 , N0 , when i ≥ i 0 and N ≥ N0 , we have we obtain
⎡ ⎤
1.6573 1.3954 −0.1661
sup V̂ (i) (x) − V ∗ (x) < ς
x∈ P = ⎣ 1.3954 1.6573 −0.1804 ⎦. (36)

−0.1661 −0.1804 0.4371
sup û (i) (x) − u ∗ (x) < ς
x∈
Here, we use the proposed NN-based online SPUA to solve

sup ŵ(i) (x) − w∗ (x) < ς. the H∞ control problem of system (33). Select six polynomials
x∈ as activation functions as follows:
Proof: The first two items can be proved by following 2
2 T
N (x) = x 1 x 1 x 2 x 1 x 3 x 2 x 2 x 3 x 3
2
the same procedure used in the proofs of Theorems 3 and
4 of [45]. The result sup ŵ(i) (x) − w∗ (x) → 0 can also be thus, the true values of the NN weights c are
x∈ T
proved in a similar way of sup û(i) (x) − u∗ (x) → 0. c = P11 2P12 2P13 P22 2P23 P33
x∈
Theorem 2 demonstrates the uniform convergence of the T
= 1.6573 2.7908 −0.3322 1.6573 −0.3608 0.4370 .
proposed online SPUA with LS NN approximation.
Remark 8: Observe that the convergence proof of the (37)
proposed LS NN approach in this paper is almost the same as −7
that in [45]. The reason is that both the proposed LS NN T of stop criterion ε = 10 (0), initial state
Select the value
x(0) = 1 1 1 , the initial NN weights c = 0, and
approach in this paper and one in [45] are developed for sampling interval δt = 0.05(s). In each iterative step, after
solving a first order linear PDE. However, the final goal of collecting 10 (i.e., N̄ = 10) system state measurements, the
[45] is to solve a HJB equation of nonlinear optimal control LS method (32) is used to update NN weights, that is, the
problem, while our aim is to solve a HJI equation in nonlinear NN weights are updated every 0.5(s) (i.e., t = 0.5(s)).
H∞ control problem. On the other hand, the method in [45] After each update, we reset the system state as initial state
is an offline one that requires the system model, while our x0 . Fig. 2 shows the weights c(i) in each iterative step, where
method is an online one without requiring the knowledge of we can observe that the NN weights converge to the true values
internal system dynamics. in (37) at t0 = 2.5(s). Then the solution of HJI equation is
IV. S IMULATION S TUDIES computed via (28) and the corresponding H∞ controller is
obtained by (30). Select a disturbance signal as
In this section, we present simulation studies on two exam-
ples to illustrate the effectiveness of the developed NN-based 8e−(t −t0 ) cos (t − t0 ) , t ≥ t0
w(t) = (38)
online SPUA. 0, t < t0
WU AND LUO: NN-BASED ONLINE SPUA FOR SOLVING THE HJI EQUATION 1891

c1 c2 c3 c4 c5 c6 c1 c2 c3 c4 c5 c6
3 3

2 2

1 1

0 0

-1 -1
0 0.5 1 1.5 2 2.5 0 1 2 3 4 5
time (s) time (s)

Fig. 2. NN weights for the first example with t = 0.5(s). Fig. 4. NN weights for the first example with t = 1(s).

2 c1 c2 c3 c4 c5
x1
1.5 4
x2
1
x3 3
0.5
0 2
-0.5
0 10 20 30 1
time (s)
(a)
0.6 0

0.4 -1
0 0.5 1 1.5 2 2.5 3
u time (s)
0.2
Fig. 5. NN weights for the second example.
0

-0.2
0 10 20 30 B. Simulations on Nonlinear System
time (s)
(b) The second example is constructed by using the converse
HJB approach [49]. The system model is given as follows:
Fig. 3. For the first example (a) closed-loop state trajectories and (b) control
action u(t). −0.25x 1 0 0
ẋ = + w + u
0.5x 12 x 2 − 0.5γ −2 x 23 + 0.5x 2 x1 x2
and use the resulting H∞ controller for closed-loop system z = x.
simulations. Fig. 3 shows the closed-loop state trajectories
and control action u(t). The trajectories at the first 2.5 s are With the choice of γ = 2, the solution of the associated HJI
corresponding to a phrase in which the online SPUA is applied equation is V ∗ (x) = 2x 12 + x 22 .
to learn the NN weights. Select R = I , x(0) = [ 0.4 0.5 ]T , ε = 10−7 , NN activation
In order to test the influence of different t on the per- functions N (x) = [ x 12 x 1 x 2 x 22 x 14 x 24 ]T , and the initial NN
formance of the online SPUA, we run the online SPUA on weights c(0) = 0. The parameters of states sampling are the
Example 1 again under the same above parameters except by same as Example 1. After each update, we reset the system
setting t = 1(s) (i.e., δt = 0.1(s)). Fig. 4 gives the NN state as initial state x 0 . By the proposed NN-based online
weights in each iterative step. It is noticed that the weights SPUA, the simulation results are shown in Figs. 5 and 6.
are convergent at t0 = 5(s), which is doubled compared with Fig. 5 indicates the weights c(i) in each iterative step, where
Fig. 2. However, it is also found that the online SPUA is it can be observed
that the NN weights converge to the true
T
convergent at iterative step 5 (i.e., i = 5), which is the same weights (i.e., 2 0 1 0 0 ) at t0 = 3(s). By the resulting
as the results obtained by setting t = 0.5(s). This means that NN weights at instant t0 = 3(s), we can obtain the solution of
the change of t have great effect on the time of convergence, HJI equation (4) by (28) and the corresponding H∞ controller
but little on the iterative steps of convergence. by (30). Select a disturbance signal as in (38) with t0 = 3(s)
1892 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

5 proposed to learn the solution of the HJI equation. The con-

4 x1 vergence of the online SPUA was established by showing that
it is mathematically equivalent to Newton’s method for finding
3 x2 a fixed point in a Banach space. Moreover, for implementation
2 purpose, we presented an actor-critic structure, in which two
1 actors and one critic were involved. To simplify this structure,
only a single critic NN was employed for approximating the
0
cost function, and then two actor NNs for the control and
-1 disturbance policies were updated correspondingly. Further-
0 5 10
time (s) more, an LS method was given to estimate the NN weight
(a) parameters, and its convergence was proved. Finally, through
0.5 the simulation studies on two examples, the achieved results
0 showed that the proposed NN-based online SPUA is effective.

u -0.5 A PPENDIX
-1 P ROOF OF L EMMA 3
-1.5 For ∀V ∈ V and W ∈ Ṽ ⊂ V, where Ṽ is a neighborhood
of V , we have
-2
0 5 10 G(V + sW ) − G(V )
time (s)
(b) = (∇(V + sW ))T f + h T h − (∇(V + sW ))T g R −1
1 1
× g T ∇(V + sW ) + (∇(V + sW ))T kk T ∇(V + sW )
4γ 2

r − (∇V )T f + h T h − (∇V )T g R −1 g T ∇V
0.5
1
+ (∇V )T kk T ∇V
4γ 2
s
= s (∇W )T f − (∇W )T g R −1 g T ∇V
0 4
0 5 10 s
time (s) − (∇V )T g R −1 g T ∇W
(c) 4
s2 s
Fig. 6. For the second example (a) closed-loop state trajectories, (b) control
− (∇W )T g R −1 g T ∇W + (∇W )T kk T ∇V
4 4γ 2
action u(t), and (c) evolution of r(t).
s s2
+ 2 (∇V )T kk T ∇W + (∇W )T kk T ∇W.
and define the function r (t) as 4γ 4γ 2
t Thus, the Gâteaux differential at V is
z(τ )2 + u(τ )2R dτ
t0 G(V + sW ) − G(V )
r (t) = . (39) L(W ) = lim
t s→0 s
w(τ )2 dτ 1 1
t0 = (∇W )T f − (∇W )T g R −1 g T ∇V − (∇V )T
4 4
Applying the resulting H∞ controller to the system, Fig. 6 −1 T 1
shows the closed-loop state trajectories, control action u(t), × g R g ∇W + (∇W ) kk ∇V
T T
4γ 2
and evolution of r (t). The trajectories at the first 3 s in Fig. 6 1
are corresponding to a learning phrase, in which the online + 2 (∇V )T kk T ∇W. (A1)
4γ
SPUA is used to learn the NN weights. It can also be seen from
Fig. 6 that the closed-loop system is asymptotically stable, Next, we will prove that the map L = G (V ) is continuous.
and the r (t) converges to 0.9232, which satisfies the L 2 -gain For ∀W0 ∈ Ṽ, it is immediate that
requirement (i.e., r (t) < γ 2 = 4) when t → ∞. 1
L(W ) − L(W0 ) = (∇(W − W0 ))T f − (∇(W − W0 ))T
4
V. C ONCLUSION ×g R −1 g T ∇V
In this paper, a NN-based online SPUA has been developed 1
− (∇V )T g R −1 g T ∇(W − W0 )
to solve the HJI equation of H∞ state feedback control 4
problem for nonlinear systems. The H∞ control problem was 1 ∂V
+ 2 (∇(W − W0 ))T kk T
viewed as a zero-sum game, where the control is a minimizing 4γ ∂x
player and the disturbance is a maximizing one. By updating 1
+ 2 (∇V )T kk T ∇(W − W0 ).
two players’ policies simultaneously, an online SPUA was 4γ
WU AND LUO: NN-BASED ONLINE SPUA FOR SOLVING THE HJI EQUATION 1893

Then, we have Calculating the derivative of (A8), yields

T
L(W ) − L(W0 ) V̇ (i+1) (x(t)) = ∇V (i+1) ẋ
1
≤ (∇(W − W0 ))T f + (∇(W − W0 ))T 2 2
4 = − h (x(t))2 + u (i) (t) − γ 2 w(i) (t) . (A9)
R
1
× g R −1 g T ∇V + (∇V )T g R −1 g T ∇(W − W0 )
4
In the following, we prove that V (i+1) is the unique solution
(i+1)
1 T T of (13) by contradiction. Assume V is another solution of
+ (∇(W − W0 )) kk ∇V
4γ 2 (13), that is
∞
1 (i+1) 2
+ (∇V )T kk T ∇(W − W0 ) V (x(t)) = h (x(τ ))2 + u (i) (τ )
4γ 2 R
t
1 −1 T 1 2
= f + g R g ∇V + 2
kk T ∇V −γ 2 w(i) (τ ) dτ . (A10)
2 2γ
× ∇(W − W0 )
Then the derivative of (A10) can be calculated as
1 −1 T 1
≤ f + g R g ∇V + 2
kk T ∇V ˙ (i+1)
(i+1)
T
2 2γ V (x(t)) = ∇ V ẋ
× m 1 W − W0 (A2)
2 2
where m 1 > 0. Let = − h (x(t))2 + u (i) (t) − γ 2 w(i) (t) . (A11)
R
1 −1 T 1
M = m 1 f + g R g ∇V + kk T ∇V . It follows from (A9) and (A11) that
2 2γ 2 T
(i+1)
Then, for ∀ε > 0, there exists a δ = ε M such that ∇ V (i+1) − V ẋ = 0 (A12)

L(W ) − L(W0 ) ≤ M W − W0 < ε (A3) which must hold for any x ∈ . Thus, we have

when W − W0 < δ. This means L = G (V ) is continuous d (i+1)
V (i+1) − V = 0. (A13)
on Ṽ, thus according to Lemma 2, L(W ) = G (V )W [i.e., dt
(A1)] is the Fréchet differential, and L = G (V ) is the Fréchet (i+1)
derivative at V . This means V (x) = V (i+1) (x) − d, whered is a con-
(i+1) (i+1)
Proof of Theorem 1: We will give the proof with two steps. stant. Due to V (0) = V (i+1) (0) = 0 and V (0) =
1) On the one hand, with policies u (i) and w(i) , the system (i+1)
state x(t) satisfies V (i+1) (0)
− d, then d = 0. Thus, V (x) = V (i+1) (x).
Therefore, the (13) is equal to (A5).
ẋ(t) = f (x) + g(x)u (i) (t) + k(x)w(i) (t). (A4) 2) On the other hand, it follows from (16) and (20) that:
−1
Let V (i+1) be a solution of the following equations: V (i+1) = T V (i) = V (i) − G (V (i) ) G(V (i) )
T 2
∇V (i+1) f + gu (i) + kw (i) + h2 + u (i) which can be rewritten as
R

−γ 2 w(i)
2
= 0. (A5) G (V (i) )V (i+1) = G (V (i) )V (i) − G(V (i) ). (A14)
From (14), (15), and (19), we have
Using (A4), (A5) can be rewritten as
T 1 T
d (i+1) 2 2 G (V (i) )V (i+1) = ∇V (i+1) f − ∇V (i+1)
V (x(t)) = − h2 + u (i) − γ 2 w(i) . (A6) 4
dt R
× g R −1 g T ∇V (i)
Integrating (A6) from t to t + t yields 1 T 1 T
− ∇V (i) g R −1 g T ∇V (i+1) + 2
∇V (i+1)
t +t 4 4γ
V (i+1)
(x(t + t)) − V (i+1)
(x(t)) = − 1 T
t × kk T ∇V (i) + ∇V (i) kk T ∇V (i+1)
2 2 4γ 2
× h (x(τ ))2 + u (i) (τ ) −γ 2
w(i) (τ ) dτ (A7) T 1 T
R = ∇V (i+1) f − ∇V (i+1) g R −1 g T ∇V (i)
4
T
which implies that the solution V (i+1) satisfies (13). 1 T
1 T
Obviously, (13) can be rewritten as − ∇V (i) g R −1 g T ∇V (i+1) + ∇V (i+1)
4 4γ 2
∞ 2 T
2 1 T
V (i+1) x(t) = h x(τ ) + u (i) (τ ) × kk T ∇V (i) + ∇V (i)
kk T
∇V (i+1)
R 4γ 2
T
t
2 1 1
−γ 2 w(i) (τ ) dτ . (A8) = ∇V (i+1) f − g R −1 g T ∇V (i) + kk T ∇V (i)
2 2γ 2
1894 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

T
= ∇V (i+1) f + gu (i) + kw(i) (A15) [9] Q. M. Yang, J. B. Vance, and S. Jagannathan, “Control of nonaffine
nonlinear discrete-time systems using reinforcement-learning-based lin-
T 1 T early parameterized neural networks,” IEEE Trans. Syst., Man, Cybern.,
G (V (i) )V (i) = ∇V (i) f − ∇V (i) g R −1 g T ∇V (i) B, Cybern., vol. 38, no. 4, pp. 994–1001, Aug. 2008.
4
1 (i)
T 1 T [10] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive
− ∇V g R −1 g T ∇V (i) + ∇V (i) optimal control for continuous-time linear systems based on policy
4 4γ 2 iteration,” Automatica, vol. 45, no. 2, pp. 477–484, 2009.
1 T [11] D. Vrabie and F. L. Lewis, “Neural network approach to continuous-
× kk T ∇V (i) + ∇V (i) kk T ∇V (i) time direct adaptive optimal control for partially unknown nonlinear
4γ 2 systems,” Neural Netw., vol. 22, no. 3, pp. 237–246, 2009.
T 2 2 [12] H. Zhang, Y. Luo, and D. Liu, “Neural-network-based near-optimal
= ∇V (i) f − 2 u (i) + 2γ 2 w(i) (A16) control for a class of discrete-time affine nonlinear systems with control
R
T 1 T
constraints,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503,
G(V (i) ) = ∇V (i) f + hT h − ∇V (i) Sep. 2009.
4 [13] K. G. Vamvoudakis and F. L. Lewis, “Online actor-critic algorithm to
1 T solve the continuous-time infinite horizon optimal control problem,”
× g R −1 g T ∇V (i) + ∇V (i)
kk T ∇V (i) Automatica, vol. 46, no. 5, pp. 878–888, 2010.
4γ 2 [14] F. Wang, N. Jin, D. Liu, and Q. Wei, “Adaptive dynamic programming
T 2 2 for finite-horizon optimal control of discrete-time nonlinear systems with
= ∇V (i) f + h2 − u (i) + γ 2 w(i) . (A17) ε-error bound,” IEEE Trans. Neural Netw., vol. 22, no. 1, pp. 24–36,
R
Jan. 2011.
Substituting (A15)–(A17) into (A14) gives [15] T. Dierks and S. Jagannathan, “Online optimal control of affine nonlinear
T discrete-time systems with unknown internal dynamics by using time-
∇V (i+1) f + gu (i) + kw(i) based policy update,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23,
no. 7, pp. 1118–1129, Jul. 2012.
T 2 2 [16] H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Data-driven robust approx-
= ∇V (i) f − 2 u (i) + 2γ 2 w(i) imate optimal tracking control for unknown general nonlinear systems
T
R using adaptive dynamic programming method,” IEEE Trans. Neural
2 2 Netw., vol. 22, no. 12, pp. 2226–2236, Dec. 2011.
(i)
− ∇V f + h − u (i) + γ 2 w(i)
2
[17] Y. Jiang and Z. P. Jiang, “Approximate dynamic programming for
R
optimal stationary control with control-dependent noise,” IEEE Trans.
that means Neural Netw., vol. 22, no. 12, pp. 2392–2398, Dec. 2011.
T 2 [18] H. Zhang, R. Song, Q. Wei, and T. Zhang, “Optimal tracking control
∇V (i+1) f + gu (i) + kw(i) + h2 + u (i) for a class of nonlinear discrete-time systems with time delays based on
R heuristic dynamic programming,” IEEE Trans. Neural Netw., vol. 22,
2 no. 12, pp. 1851–1862, Dec. 2011.
2 (i)
−γ w = 0. [19] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
Cambridge, MA: MIT Press, 1998.
This means that the (20) in Newton’s iteration is also equal [20] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive
to (A5). This completes the proof. dynamic programming for feedback control,” IEEE Circuits Syst. Mag.,
vol. 9, no. 3, pp. 32–50, Sep. 2009.
[21] F. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An
ACKNOWLEDGMENT introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47, May
2009.
The authors would like to thank the Associate Editor and
[22] W. B. Powell, Approximate Dynamic Programming: Solving the Curses
the anonymous reviewers for their helpful comments and of Dimensionality. Hoboken, NJ: Wiley, 2007.
suggestions, which have improved the presentation of this [23] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.
paper. Belmont, MA: Athena Scientific, 1996.
[24] X. Xu, C. Liu, S. X. Yang, and D. Hu, “Hierarchical approximate
policy iteration with binary-tree state space decomposition,” IEEE Trans.
R EFERENCES Neural Netw., vol. 22, no. 12, pp. 1863–1877, Dec. 2011.
[25] M. Fairbank, E. Alonso, and D. Prokhorov, “Simple and fast calculation
[1] T. Başar and P. Bernhard, H ∞ Optimal Control and Related Minimax
of the second-order gradients for globalized dual heuristic dynamic
Design Problems: A Dynamic Game Approach, 2nd ed. Boston, MA:
programming in neural networks,” IEEE Trans. Neural Netw. Learn.
Birkhäuser, 1995.
Syst., vol. 23, no. 10, pp. 1671–1676, Oct. 2012.
[2] A. J. van der Schaft, L2 -Gain and Passivity Techniques in Nonlinear
Control. Berlin, Germany: Springer-Verlag, 1996. [26] Y. Feng, B. Anderson, and M. Rotkowitz, “A game theoretic algo-
rithm to compute local stabilizing solutions to HJBI equations in
[3] K. Zhou, J. C. Doyle, and K. Glover, Robust and Optimal Control. Upper
nonlinear H∞ control,” Automatica, vol. 45, no. 4, pp. 881–888,
Saddle River, NJ: Prentice-Hall, 1996.
2009.
[4] A. Isidori and W. Kang, “H∞ control via measurement feedback for
general nonlinear systems,” IEEE Trans. Autom. Control, vol. 40, no. 3, [27] A. Lanzon, Y. Feng, B. D. O. Anderson, and M. Rotkowitz, “Computing
pp. 466–472, Mar. 1995. the positive stabilizing solution to algebraic Riccati equations with an
[5] G. Bianchini, R. Genesio, A. Parenti, and A. Tesi, “Global H∞ con- indefinite quadratic term via a recursive method,” IEEE Trans. Autom.
trollers for a class of nonlinear systems,” IEEE Trans. Autom. Control, Control, vol. 53, no. 10, pp. 2280–2291, Nov. 2008.
vol. 49, no. 2, pp. 244–249, Feb. 2004. [28] J. Huang and C. Lin, “Numerical approach to computing nonlinear H∞
[6] A. J. van der Schaft, “L 2 -gain analysis of nonlinear systems and control laws,” AIAA J. Guidance, Control, Dynamics, vol. 18, no. 5, pp.
nonlinear state-feedback H∞ control,” IEEE Trans. Autom. Control, 989–994, 1995.
vol. 37, no. 6, pp. 770–784, Jun. 1992. [29] G. N. Saridis and C. G. Lee, “An approximation theory of optimal
[7] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time non- control for trainable manipulators,” IEEE Trans. Syst., Man, Cybern.
linear HJB solution using approximate dynamic programming: Conver- B, Cybern., vol. 9, no. 3, pp. 152–159, Mar. 1979.
gence proof,” IEEE Trans. Syst., Man, Cybern., B, Cybern., vol. 38, [30] R. Beard, G. N. Saridis, and J. Wen, “Galerkin approximations of the
no. 4, pp. 943–949, Aug. 2008. generalized Hamilton-Jacobi-Bellman equation,” Automatica, vol. 33,
[8] H. Zhang, Q. Wei, and Y. Luo, “A novel infinite-time optimal tracking no. 12, pp. 2159–2177, 1997.
control scheme for a class of discrete-time nonlinear systems via the [31] R. Beard, G. N. Saridis, and J. Wen, “Approximate solutions to the time-
greedy HDP iteration algorithm,” IEEE Trans. Syst., Man, Cybern., B, invariant Hamilton–Jacobi–Bellman equation,” J. Optim. Theory Appl.,
Cybern., vol. 38, no. 4, pp. 937–942, Aug. 2008. vol. 96, no. 3, pp. 589–626, 1998.
WU AND LUO: NN-BASED ONLINE SPUA FOR SOLVING THE HJI EQUATION 1895

[32] R. W. Beard and T. W. Mclain, “Successive Galerkin approximation Huai-Ning Wu was born in Anhui, China, on
algorithms for nonlinear optimal and robust control,” Int. J. Control, November 15, 1972. He received the B.E. degree
vol. 71, no. 5, pp. 717–743, 1998. in automation from the Shandong Institute of Build-
[33] K. G. Vamvoudakis and F. L. Lewis, “Online solution of nonlinear ing Materials Industry, Jinan, China, and the Ph.D.
two-player zero-sum games using synchronous policy iteration,” Int. J. degree in control theory and control engineering
Robust Nonlinear Control, vol. 22, no. 13, pp. 1460–1483, 2011. from Xi’an Jiaotong University, Xi’an, China, in
[34] M. Abu-Khalaf, F. L. Lewis, and J. Huang, “Policy iterations on the 1992 and 1997, respectively.
Hamilton–Jacobi–Isaacs equation for H∞ state feedback control with He was a Post-Doctoral Researcher with the
input saturation,” IEEE Trans. Autom. Control, vol. 51, no. 12, pp. 1989– Department of Electronic Engineering, Beijing Insti-
1995, Dec. 2006. tute of Technology, Beijing, China, from 1997 to
[35] M. Abu-Khalaf, F. L. Lewis, and J. Huang, “Neurodynamic program- 1999. He joined the School of Automation Science
ming and zero-sum games for constrained control systems,” IEEE Trans. and Electrical Engineering, Beihang University (formerly Beijing University
Neural Netw., vol. 19, no. 7, pp. 1243–1252, Jul. 2008. of Aeronautics and Astronautics), Beijing, in 1999. From 2005 to 2006,
[36] M. Abu-Khalaf, J. Huang, and F. L. Lewis, Nonlinear H2/H-Infinity he was a Senior Research Associate with the Department of Manufacturing
Constrained Feedback Control: A Practical Design Approach Using Engineering and Engineering Management (MEEM), City University of Hong
Neural Networks. New York: Springer-Verlag, 2006. Kong, Kowloon, Hong Kong, where he was a Research Fellow from 2006 to
[37] D. Vrabie and F. L. Lewis, “Adaptive dynamic programming for online 2008 and from July to August 2010. From July to August in 2011, he was a
solution of a zero-sum differential game,” J. Control Theory Appl., vol. 9, Research Fellow with the Department of Systems Engineering and Engineer-
no. 3, pp. 353–360, 2011. ing Management, City University of Hong Kong. He is currently a Professor
[38] E. Zeidler, Nonlinear Functional Analysis: Fixed Point Theorems, vol. 1. with Beihang University. His current research interests include robust control
New York: Springer-Verlag, 1985. and filtering, fault-tolerant control, distributed parameter systems, and fuzzy
[39] L. Kantorovitch, “The method of successive approximation for func- and neural modeling and control.
tional equations,” Acta Math., vol. 71, no. 1, pp. 63–97, 1939. Dr. Wu is a member of the Committee of Technical Process Failure
[40] R. A. Tapia, “The Kantorovich theorem for Newton’s method,” Amer. Diagnosis and Safety of the Chinese Association of Automation.
Math. Monthly, vol. 78, no. 4, pp. 389–392, 1971.
[41] L. B. Rall, “A note on the convergence of Newton’s method,” SIAM J.
Numer. Anal., vol. 11, no. 1, pp. 34–36, 1974.
[42] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive
elements that can solve difficult learning control problems,” IEEE Trans.
Syst., Man, Cybern., B, Cybern., vol. 13, no. 5, pp. 834–846, May 1983.
[43] V. Konda, “On actor-critic algorithms,” Ph.D dissertation, Dept. Electr.
Eng. & Comput. Sci., Massachusetts Inst. Technology, Cambridge, 2002. Biao Luo received the B.E. degree in measur-
[44] V. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” SIAM J. ing and control technology and instrumentations
Control Optim., vol. 42, no. 4, pp. 1143–1166, 2003. and the M.E. degree in control theory and control
[45] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for engineering from Xiangtan University, Xiangtan,
nonlinear systems with saturating actuators using a neural network HJB China, in 2006 and 2009, respectively. He is cur-
approach,” Automatica, vol. 41, no. 5, pp. 779–791, 2005. rently pursuing the Ph.D. degree in control science
[46] B. A. Finlayson, The Method of Weighted Residuals and Variational and engineering with Beihang University (formerly
Principles. New York: Academic, 1972. Beijing University of Aeronautics and Astronautics),
[47] B. Stevens and F. L. Lewis, Aircraft Control and Simulation, 2nd ed. Beijing, China.
New York: Wiley, 2003. His current research interests include distributed
[48] K. G. Vamvoudakis, “Online learning algorithms for differential dynamic parameter systems, optimal control, data-based con-
games and optimal control,” Ph.D. dissertation, Faculty of Graduate trol, fuzzy and neural modeling and control, hypersonic entry and re-entry
School, Univ. Texas at Arlington, Arlington, 2011. guidance, reinforcement learning, approximate dynamic programming, and
[49] V. Nevisti’C and J. A. Primbs, “Constrained nonlinear optimal control: evolutionary computation.
A converse HJB approach,” Dept. Control & Dynamical Syst., California Mr. Luo was a recipient of the Excellent Master Dissertation Award of
Inst. Technology, Pasadena, Tech. Rep. TR96-021, 1996. Hunan Province in 2011.

High School DXD 22 - Gremory of The Graduation Ceremony
56% (18)
High School DXD 22 - Gremory of The Graduation Ceremony
172 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
199 pages
30 Day Diabetic Mealplan PDF
50% (2)
30 Day Diabetic Mealplan PDF
1 page
Meal Planning Soft Diet
0% (1)
Meal Planning Soft Diet
6 pages
Handbook-Riser-Design - Clamps PDF
67% (3)
Handbook-Riser-Design - Clamps PDF
46 pages
20-IEEE-NNLS-Event-Triggered Optimal Control With Performance Guarantees Using Adaptive Dynamic Programming
No ratings yet
20-IEEE-NNLS-Event-Triggered Optimal Control With Performance Guarantees Using Adaptive Dynamic Programming
13 pages
Al Tamimi2008
No ratings yet
Al Tamimi2008
7 pages
Shi 2021
No ratings yet
Shi 2021
11 pages
Tac 232
No ratings yet
Tac 232
7 pages
Global Adaptive Dynamic Programming For Continuous-Time Nonlinear Systems
No ratings yet
Global Adaptive Dynamic Programming For Continuous-Time Nonlinear Systems
13 pages
Tracking Control of Completely Unknown Continuous-Time Systems Via Off-Policy Reinforcement Learning
No ratings yet
Tracking Control of Completely Unknown Continuous-Time Systems Via Off-Policy Reinforcement Learning
13 pages
Continuous-Time Stochastic Policy Iteration of Adaptive Dynamic Programming
No ratings yet
Continuous-Time Stochastic Policy Iteration of Adaptive Dynamic Programming
13 pages
Derongliu 2014
No ratings yet
Derongliu 2014
14 pages
Kamala Pur Kar 2015
No ratings yet
Kamala Pur Kar 2015
9 pages
Neurocomputing: Xiaofeng Li, Lei Xue, Changyin Sun
No ratings yet
Neurocomputing: Xiaofeng Li, Lei Xue, Changyin Sun
8 pages
Adaptive Optimal Output Regulation of Unknown Linear Continuous-Time Systems by Dynamic Output Feedback and Value Iteration
No ratings yet
Adaptive Optimal Output Regulation of Unknown Linear Continuous-Time Systems by Dynamic Output Feedback and Value Iteration
11 pages
Automatica: Kyriakos G. Vamvoudakis Frank L. Lewis
No ratings yet
Automatica: Kyriakos G. Vamvoudakis Frank L. Lewis
11 pages
Lewis LFC
No ratings yet
Lewis LFC
15 pages
Near-Optimal Control of Dynamical Systems With Neural Ordinary Differential Equations
No ratings yet
Near-Optimal Control of Dynamical Systems With Neural Ordinary Differential Equations
23 pages
Linear Quadratic Control Using Model-Free Reinforcement Learning
No ratings yet
Linear Quadratic Control Using Model-Free Reinforcement Learning
16 pages
Robust Control Scheme For A Class of Uncertain Nonlinear Systems With Completely Unknown Dynamics Using Data-Driven Reinforcement Learning Method
No ratings yet
Robust Control Scheme For A Class of Uncertain Nonlinear Systems With Completely Unknown Dynamics Using Data-Driven Reinforcement Learning Method
34 pages
Vam Vou Dakis 2009
No ratings yet
Vam Vou Dakis 2009
8 pages
Optimal Tracking Control of Nonlinear Partially-Unknown Constrained-Input Systems Using Integral Reinforcement Learning
No ratings yet
Optimal Tracking Control of Nonlinear Partially-Unknown Constrained-Input Systems Using Integral Reinforcement Learning
13 pages
ACODS 2014 GAndrade
No ratings yet
ACODS 2014 GAndrade
7 pages
ACC23 Tutorial Paulson
No ratings yet
ACC23 Tutorial Paulson
12 pages
Complete Stability Analysis of A Heuristic Approximate Dynamic Programming Control Design
No ratings yet
Complete Stability Analysis of A Heuristic Approximate Dynamic Programming Control Design
20 pages
Automatica: D. Vrabie O. Pastravanu M. Abu-Khalaf F.L. Lewis
No ratings yet
Automatica: D. Vrabie O. Pastravanu M. Abu-Khalaf F.L. Lewis
8 pages
Using Reinforcement Learning Techniques To Solve Continuous-Time Non-Linear Optimal Tracking Problem Without System Dynamics
No ratings yet
Using Reinforcement Learning Techniques To Solve Continuous-Time Non-Linear Optimal Tracking Problem Without System Dynamics
9 pages
Learning-Based Control of Continuous-Time Systems Using Output Feedback
No ratings yet
Learning-Based Control of Continuous-Time Systems Using Output Feedback
8 pages
IJCAS v2 n3 pp263-278
No ratings yet
IJCAS v2 n3 pp263-278
16 pages
2024 Ouput Feedback Linear System Based On ADP 1
No ratings yet
2024 Ouput Feedback Linear System Based On ADP 1
10 pages
Adaptive DP For Discrete Time LQR Optimal Tracking Control Problems With Unknown Dynamics
No ratings yet
Adaptive DP For Discrete Time LQR Optimal Tracking Control Problems With Unknown Dynamics
6 pages
Cessna 2
No ratings yet
Cessna 2
20 pages
Adaptive Dynamic Programming For Stochastic Systems With State and Control Dependent Noise
No ratings yet
Adaptive Dynamic Programming For Stochastic Systems With State and Control Dependent Noise
12 pages
Adaptive Dynamic Programming With Applications in Optimal Control (PDFDrive)
No ratings yet
Adaptive Dynamic Programming With Applications in Optimal Control (PDFDrive)
609 pages
! Boss
No ratings yet
! Boss
17 pages
2020 ADP Nonlinear System Mismatched Disterbances 2
No ratings yet
2020 ADP Nonlinear System Mismatched Disterbances 2
8 pages
He, S Et Al (2019) Reinforcement Learning
No ratings yet
He, S Et Al (2019) Reinforcement Learning
10 pages
Adapti
No ratings yet
Adapti
6 pages
Adprl Chapter Icis
No ratings yet
Adprl Chapter Icis
43 pages
1 s2.0 S1474667016440140 Main
No ratings yet
1 s2.0 S1474667016440140 Main
6 pages
A Neural RDE Approach For Continuous-Time Non-Markovian Stochastic Control Problems
No ratings yet
A Neural RDE Approach For Continuous-Time Non-Markovian Stochastic Control Problems
11 pages
Data-Enabled Predictive Control: in The Shallows of The Deepc
No ratings yet
Data-Enabled Predictive Control: in The Shallows of The Deepc
8 pages
2017 - On The Sample Complexity of The Linear Quadratic Regulator
No ratings yet
2017 - On The Sample Complexity of The Linear Quadratic Regulator
43 pages
Data-Driven-Based Sliding-Mode Dynamic Event-Triggered Control
No ratings yet
Data-Driven-Based Sliding-Mode Dynamic Event-Triggered Control
11 pages
Curriculum-Based Deep Reinforcement Learning For Quantum Control
No ratings yet
Curriculum-Based Deep Reinforcement Learning For Quantum Control
14 pages
Applsci 13 13181
No ratings yet
Applsci 13 13181
21 pages
Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and An Application
No ratings yet
Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and An Application
139 pages
Programacion Lineal 9
No ratings yet
Programacion Lineal 9
5 pages
06722294
No ratings yet
06722294
6 pages
MIT6 231F15 Complete Slide
No ratings yet
MIT6 231F15 Complete Slide
166 pages
Root
No ratings yet
Root
8 pages
Modares 2014
No ratings yet
Modares 2014
10 pages
Minimax Linear Regulator Problems For Positive Systems
No ratings yet
Minimax Linear Regulator Problems For Positive Systems
26 pages
Optimization and Control With Applications
No ratings yet
Optimization and Control With Applications
586 pages
Solution of Differential Games
No ratings yet
Solution of Differential Games
6 pages
Eene Sakha Ea2025
No ratings yet
Eene Sakha Ea2025
14 pages
Robust Control Design For Zero-Sum Differential Games Problem Based On Off-Policy Reinforcement Learning Technique
No ratings yet
Robust Control Design For Zero-Sum Differential Games Problem Based On Off-Policy Reinforcement Learning Technique
9 pages
Inverse Optimal Control With Linearly-Solvable MDPs
No ratings yet
Inverse Optimal Control With Linearly-Solvable MDPs
8 pages
Distributed Optimal Coordination Control For Continuous-Time Nonlinear Multi-Agent Systems With Input Constraints
No ratings yet
Distributed Optimal Coordination Control For Continuous-Time Nonlinear Multi-Agent Systems With Input Constraints
6 pages
Lecture 10
No ratings yet
Lecture 10
25 pages
Dynamic Programming and Optimal Control 3rd Edition, Volume II
No ratings yet
Dynamic Programming and Optimal Control 3rd Edition, Volume II
233 pages
Paper RL
No ratings yet
Paper RL
61 pages
Feedback Control Theory
From Everand
Feedback Control Theory
Bruce Francis
5/5 (1)
Pinnacle DV500 User Guide
No ratings yet
Pinnacle DV500 User Guide
188 pages
Role of UN and International NGOs in Global Health Governance - Edited
No ratings yet
Role of UN and International NGOs in Global Health Governance - Edited
3 pages
Hydrocarbon Solutions
No ratings yet
Hydrocarbon Solutions
26 pages
Case Study
No ratings yet
Case Study
3 pages
Mnemonics Psychiatric Diagnosis.
No ratings yet
Mnemonics Psychiatric Diagnosis.
7 pages
Albania 2017 2D Seismic RFI - Final PDF
No ratings yet
Albania 2017 2D Seismic RFI - Final PDF
5 pages
BBM en-GB 2015.4
100% (2)
BBM en-GB 2015.4
476 pages
500 MW Zener Diode 2.4 To 75 Volts: Features
No ratings yet
500 MW Zener Diode 2.4 To 75 Volts: Features
5 pages
Forces and Newton'S Laws: Types of Forces Weight
No ratings yet
Forces and Newton'S Laws: Types of Forces Weight
6 pages
Evaluacion Ing
No ratings yet
Evaluacion Ing
1 page
Pad Footing Analysis and Design (Bs8110-1:1997)
No ratings yet
Pad Footing Analysis and Design (Bs8110-1:1997)
6 pages
SART Manual
100% (1)
SART Manual
20 pages
2020 Proposal
No ratings yet
2020 Proposal
14 pages
Exercises
No ratings yet
Exercises
9 pages
Pump Governer D6m Rastavljanje I Satavljanje
No ratings yet
Pump Governer D6m Rastavljanje I Satavljanje
33 pages
History
No ratings yet
History
5 pages
Redemption - Batch - 3 11 24 To 3 15 24
No ratings yet
Redemption - Batch - 3 11 24 To 3 15 24
4 pages
Voolenvine FavoriteSocks 2020 Final PDF
No ratings yet
Voolenvine FavoriteSocks 2020 Final PDF
6 pages
SchneiderPL PDF
No ratings yet
SchneiderPL PDF
281 pages
Practical Report LSD - Aryan Malik - 20-3017
No ratings yet
Practical Report LSD - Aryan Malik - 20-3017
11 pages
Cable Size Selection - Student Version
No ratings yet
Cable Size Selection - Student Version
14 pages
Battery Thermal Management System
No ratings yet
Battery Thermal Management System
17 pages
Effect of Elemental Sulfur On Pitting Corrosion of Steels
No ratings yet
Effect of Elemental Sulfur On Pitting Corrosion of Steels
8 pages
SOLIDWORKS Simulation 2019 Validation
100% (3)
SOLIDWORKS Simulation 2019 Validation
140 pages
Experiment No. 03 Determination of PH in Water Sample Using PH Meter
No ratings yet
Experiment No. 03 Determination of PH in Water Sample Using PH Meter
3 pages
JK Cements: Swot Analysis
No ratings yet
JK Cements: Swot Analysis
3 pages

Neural Network Based Online Simultaneous Policy Update Algorithm For Solving The HJI Equation in Nonlinear H Control

Uploaded by

Neural Network Based Online Simultaneous Policy Update Algorithm For Solving The HJI Equation in Nonlinear H Control

Uploaded by

1884 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO.

12, DECEMBER 2012

Neural Network Based Online Simultaneous

Lemma 1 (see Theorem 16 and Corollary 17 in [6]): Algorithm 1 Online SPUA

Now, if V2 ⊂ V1 , then, the sequence {V (i) } given in (20) is Actor 1

It is observed from Lemma 4 that the V1 must be suitably

5 proposed to learn the solution of the HJI equation. The con-

Then, we have Calculating the derivative of (A8), yields

You might also like