0% found this document useful (0 votes)
2 views

[15] Robust control scheme for a class of uncertain nonlinear systems with completely unknown dynamics using data-driven reinforcement learning method

This paper presents a robust control scheme for uncertain nonlinear systems with unknown dynamics using a data-driven reinforcement learning method. The authors formulate the optimal regulation control problem and design a robust controller by integrating a constant feedback gain with the nominal system's optimal controller, proving its optimality and stability. The proposed method is demonstrated through numerical simulations, showcasing its effectiveness in addressing robust control issues without relying on precise mathematical models.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

[15] Robust control scheme for a class of uncertain nonlinear systems with completely unknown dynamics using data-driven reinforcement learning method

This paper presents a robust control scheme for uncertain nonlinear systems with unknown dynamics using a data-driven reinforcement learning method. The authors formulate the optimal regulation control problem and design a robust controller by integrating a constant feedback gain with the nominal system's optimal controller, proving its optimality and stability. The proposed method is demonstrated through numerical simulations, showcasing its effectiveness in addressing robust control issues without relying on precise mathematical models.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Communicated by Dr Chenguang Yang

Accepted Manuscript

Robust control scheme for a class of uncertain nonlinear systems


with completely unknown dynamics using data-driven reinforcement
learning method

He Jiang, Huaguang Zhang, Yang Cui, Geyang Xiao

PII: S0925-2312(17)31381-4
DOI: 10.1016/j.neucom.2017.07.058
Reference: NEUCOM 18771

To appear in: Neurocomputing

Received date: 20 March 2017


Revised date: 16 July 2017
Accepted date: 31 July 2017

Please cite this article as: He Jiang, Huaguang Zhang, Yang Cui, Geyang Xiao, Robust control scheme
for a class of uncertain nonlinear systems with completely unknown dynamics using data-driven rein-
forcement learning method, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.07.058

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT

Robust control scheme for a class of uncertain nonlinear


systems with completely unknown dynamics using
data-driven reinforcement learning method

T
He Jiang, Huaguang Zhang∗, Yang Cui, Geyang Xiao

IP
College of Information Science and Engineering, Northeastern University, Box 134,
110819, Shenyang, P. R. China

CR
Abstract
US
This paper deals with the robust control issues for a class of uncertain nonlin-
ear systems with completely unknown dynamics via a data-driven reinforce-
AN
ment learning method. Firstly, we formulate the optimal regulation control
problem for the nominal system, and then, the robust controller for the orig-
inal uncertain system is designed by adding a constant feedback gain to the
optimal controller of the nominal system. Then, this scheme is extended to
M

the optimal tracking control by means of augmented system and discount fac-
tor. It is also demonstrated that the proposed robust controller can achieve
optimality with a new defined performance index function when there is no
ED

control perturbation. It is well known that the nonlinear optimal control


problem relies on the solution of Hamilton-Jacobi-Bellman (HJB) equation,
which is a nonlinear partial differential equation and impossible to be solved
PT

analytically. In order to overcome this difficulty, we introduce a model-based


iterative learning algorithm to successively approximate the solution of HJB
equation and provide its convergence proof. Subsequently, based on the
CE

structure of the model-based approach, a data-driven reinforcement learning


method is derived, which only requires the sampling data from real system
with different control inputs rather than the accurate mathematical system
models. Neural networks (NNs) are utilized to implement this model-free
AC


Corresponding author
Email addresses: [email protected] (He Jiang), [email protected]
(Huaguang Zhang), [email protected] (Yang Cui), [email protected] (Geyang
Xiao)

Preprint submitted to Neurocomputing August 21, 2017


ACCEPTED MANUSCRIPT

method to approximate the optimal solutions and the least-square approach


is employed to minimize the NN approximation residual errors. Finally, two
numerical simulation examples are given to illustrate the effectiveness of our
proposed method.

T
Keywords: Reinforcement learning; Adaptive dynamic programming;
Data-driven; Model-free; Neural networks.

IP
1. Introduction

CR
Model uncertainties, which may severely affect the control performance
of the closed-loop feedback systems, usually occur in the real-world complex

US
systems, such as manufacturing systems and power systems. Therefore, re-
search on the robust control problems has received considerable attention,
and, so far, many significant relevant results have been achieved. Authors
of the work [1] pointed out that the robust control problem of the uncer-
AN
tain original system can be translated into the optimal control problem of its
nominal system, but the detailed results were not shown. It is known that
solving the optimal control problem relies on the solution of the Hamilton-
Jacobi-Bellman (HJB) equation. For the linear optimal control, i.e., linear
M

quadratic regulator (LQR) problem, the HJB equation is the well-known


Riccati equation, which can be computed directly. Nevertheless, for the non-
linear optimal control issue, the HJB equation becomes a nonlinear partial
ED

differential one, which is impossible to be solved analytically.


Dynamic programming is regarded as a conventional approach in solving
the optimal control problem. However, due to its backward-in-time proce-
PT

dure, it generally suffers from the “curse of dimensionality”. In recent years,


reinforcement learning (RL), which can obtain the optimal action of an agent
via the responses from its environment, has provided many effective ways to
CE

overcome the bottleneck of dynamic programming, such as approximation/


adaptive dynamic programming (ADP), which is served as a forward-in-time
method to solve the optimal control problems [2]. ADP can be divided into
AC

two mainstream iterative algorithms, i.e., value iteration (VI) [3, 4] and pol-
icy iteration (PI) [5, 6, 7]. The convergence proof of VI algorithm for the
discrete-time (DT) systems was first given in [3], and a novel PI algorithm for
the DT version was presented in [6]. For the continuous-time (CT) systems,
a PI algorithm was proposed without the requirement of the knowledge of
internal system dynamics in [5]. Following the theoretical structure of these

2
ACCEPTED MANUSCRIPT

classical works [3, 4, 5, 6, 7], a variety of optimal control issues have been ad-
dressed, such as optimal control with constrained control input [8, 9, 10, 11]
and time-delay [12, 13, 14], optimal control for zero-sum [15, 16, 17, 18]
and non-zero-sum games [19, 20, 21, 22], and optimal control applied on the
Markov jump systems [23, 24], robot systems [25, 26] and multi-agent systems

T
[27, 28, 29, 30, 31]. Among these issues, optimal tracking control problem

IP
(OTCP) has become one of the most attractive topics within the scope of
ADP. The integral RL technique was employed to address the OTCP for the
partially unknown CT systems with constrained control inputs in [32], and

CR
the DT case was investigated in [33] by means of an actor-critic-based RL
algorithm. In [34], a novel infinite-time optimal tracking control scheme was
provided for a class of DT nonlinear systems via the greedy heuristic dy-

US
namic programming (HDP) iteration algorithm, and a finite-horizon OTCP
version was studied in [35] through an adaptive dynamic programming ap-
proach. However, the model uncertainties, which are generally inevitable for
the most real-world systems, are not concerned in these studies above.
AN
The main idea of this paper is to employ the ADP technique to ob-
tain the optimal solution of the nominal system, and then, extend it to the
robust controller design of the uncertain original system. In [36], a neural-
M

network-based robust tracking controller was designed for a class of electri-


cally driven nonholonomic mechanical systems, but the optimality was not
concerned. Authors of [37] proposed a novel ADP scheme to solve the opti-
ED

mal robust control problem for a class of uncertain nonlinear systems, and
then, based on this work [37], a data-based robust control approach was
developed via the neural network identification technique in [38]. However,
for the most industrial systems such as aerospace control systems, power
PT

systems and chemical engineering, the system structures are so complicated


that there may be no accurate system mathematical models to support the
control decision design. Hence, the model-based studies will be invalid for
CE

these practical complex systems. Although the identification technique is


viewed as an effective way to overcome this difficulty, prior model identifi-
cation is costly and brings unexpected identification errors which may affect
AC

the control performance. On the other hand, the conventional identification


approaches [15, 21, 23, 38, 39, 40, 41] generally utilize the obtained identi-
fication results to approximate and replace the real system dynamics, and,
as a consequence, they are in fact still model-based not exactly model-free.
Owing to the rapid development of digital sensor technology, vast volume of
system information can be acquired, which inspires data-driven RL control

3
ACCEPTED MANUSCRIPT

methods [42, 43, 44, 45]. Therefore, it will be interesting and challenging to
handle the robust control issues by using the data-driven RL method in the
model-free environment, which motivates our research.
In this paper, we present a data-driven RL scheme to solve the robust
control issues for a class of uncertain CT nonlinear systems with completely

T
unknown system dynamics. First of all, the problem formulation is derived

IP
in Section 2. Secondly, the robust controller of the uncertain original system
is designed by adding a constant feedback gain to the optimal controller of
the nominal system, and the proofs of optimality and stability are provided

CR
in Section 3. Subsequently, based on the introduced model-based iterative
learning algorithm, a data-driven model-free RL method is derived and im-
plemented by two neural networks (NNs), tuning laws of which are updated

US
by the least-square approach in Section 4. Two numerical simulation exam-
ples are given to demonstrate the effectiveness of our proposed scheme in
Section 5. Finally, a brief conclusion is drawn in Section 6.
AN
2. Problem formulation
Optimal control issues can be generally classified into two main groups:
optimal regulation control and optimal tracking control [33].
M

2.1. Optimal regulation control


Consider a class of uncertain nonlinear systems with control input per-
ED

turbation denoted by the following original system:


¯
ẋ(t) = f (x(t)) + g(x(t))(ū(t) + d(x(t))) (1)
PT

¯
where x(t) ∈ Rn is the state; ū(t) ∈ Rm denotes the control input; d(x) ∈
R represents the finite control input perturbation, which is assumed to be
m

bounded by d(x)¯ ≤ kd kxk with the positive constant kd ; f (x(t)) ∈ Rn


CE

and g(x(t)) ∈ Rn×m are the system matrices, which, in this paper, are both
considered to be unknown.
The corresponding nominal system of (1) can be described by
AC

ẋ(t) = f (x(t)) + g(x(t))u(t). (2)

Define the performance index function


Z ∞
 T 
V(x(t)) = x (τ )P x(τ ) + uT (x(τ ))Ru(x(τ )) dτ (3)
t

4
ACCEPTED MANUSCRIPT

where P ∈ Rn×n > 0 and R ∈ Rm×m > 0 are symmetric positive definite
weight matrices.
The associated optimal control policy which minimizes V(x) can be de-
rived as

T
1
u∗ (x) = − R−1 g T (x)∇V ∗ (x). (4)
2

IP
where V ∗ (x) denotes the optimal performance index function and ∇V ∗ (x) =
∂V ∗ (x)/∂x.

CR
Then, one can obtain the following Hamilton-Jacobi-Bellman (HJB) equa-
tion for the optimal regulation control problem
1

2.2. Optimal tracking control


4
US
∇V ∗T (x)f (x) + xT P x − ∇V ∗T (x)g(x)R−1 g T (x)∇V ∗ (x) = 0. (5)
AN
Let the command generator, which can produce a large class of command
trajectories, be the reference system to be tracked:

ẋd (t) = h(xd (t)) (6)


M

where xd (t) ∈ Rn denotes the tracking objective state and h(xd ) ∈ Rn is the
command generator function.
ED

Thus, the tracking error ed (t) can be defined as



ed (t) = x(t) − xd (t). (7)
PT

Bringing (2) and (6) together yields the tracking error dynamics

ėd (t) = f (x(t)) − h(xd (t)) + g(x(t))u(t). (8)


CE

 T
Let X(t) = eTd (t) xTd (t) be the state of the augmented system. Next,
combining (6) and (8) yields the dynamics of augmented system:
AC

Ẋ(t) = F (X(t)) + G(X(t))u(t) (9)

where    
f (ed (t) + xd (t)) − h(xd (t)) g(ed (t) + xd (t))
F (X(t)) = and G(X(t)) = .
h(xd (t)) 0

5
ACCEPTED MANUSCRIPT


X(t) ∈ X ⊂ R2n and u(t) ∈ U ⊂ Rm . D = {(X, u) |X ∈ X , u ∈ U }, where
X and U denote compact sets.
For the nominal augmented system (9), in order to solve the infinite
horizon OTCP, one needs to design a state feedback control policy u(X),
which minimizes the following discounted performance index function:

T
Z ∞
 

IP
V (X(t)) = e−α(τ −t) X T (τ )QX(τ ) + uT (X(τ ))Ru(X(τ )) dτ (10)
t
 

CR
P 0
where α > 0 is the discount factor; Q = with P > 0, and R > 0 is
0 0
a symmetric positive definite matrix.

US
Remark 1. Note that it is necessary to employ a discount factor in the per-
formance index function (10). This is because the trajectory of the reference
system xd (t) (6) to be tracked may not go to zero, which is a common case in
AN
the most practical systems, and then the performance index function, which
contains the control policy u(X), will become infinite without the discount
factor.
M

Definition 1. A control policy u(X) ∈ Ψ(X ) is referred to be admissible


with respect to (10) on X , if u(X) can not only stabilize the tracking error
system (8) but also guarantee the performance index function V (X(t)) (10)
ED

finite.

Assumption 1. Assume that there exists at least one admissible control pol-
icy u(X) on the compact set X such that the tracking error system (8) is
PT

asymptotically stable and the performance index function V (X) (10) is fi-
nite.
CE

By means of Leibniz’s rule, differentiate V (X(t)) along the augmented


system trajectories (9), then one can obtain
AC

V̇ (X(t)) = αV (X(t)) − X T (t)QX(t) − uT (t)Ru(t). (11)

In light of (11), the Hamiltonian function can be defined as



H(X, ∇V (X), u) = ∇V T (X)(F (X) + G(X)u) − αV (X) + X T QX + uT Ru
(12)

6
ACCEPTED MANUSCRIPT

where ∇V (X) = ∂V (X)/∂X.


The optimal performance index function V ∗ (X) is given by
Z ∞

 
V (X(t)) = min e−α(τ −t) X T (τ )QX(τ ) + uT (X(τ ))Ru(X(τ )) dτ ,
u∈Ψ(X) t

T
(13)

IP
which also satisfies the following HJB equation:

0 = min H(X, ∇V ∗ (X), u). (14)

CR
u∈Ψ(X)

If the minimum on the right hand side of (14) exists and is unique, the
corresponding optimal control policy u∗ (X) can be obtained as
1 US
u∗ (X) = − R−1 GT (X)∇V ∗ (X).
2
(15)
AN
Inserting (15) into (14), the HJB equation can be rewritten as

H(V ∗ (X)) =∇V ∗T (X)F (X) + X T QX − αV ∗ (X)


1
M

− ∇V ∗T (X)G(X)R−1 GT (X)∇V ∗ (X) = 0. (16)


4
Remark 2. For the linear systems, the HJB equation is the well known Ric-
ED

cati equation, which is easy to be solved directly. However, for the nonlinear
cases, the HJB equation becomes a nonlinear partial differential one, which
is generally impossible to obtain the solution analytically. In Section 4, we
will introduce two iterative ADP methods to overcome this difficulty.
PT

3. Robust controller design


CE

In this section, firstly, we provide the robust regulation controller, and


then extend the results to the design of the robust tracking controller.
AC

3.1. Robust regulation controller design


In light of the optimal control policy u∗ (x) (4) for the nominal system (2),
the robust controller ū(x) for the uncertain original system (1) is designed
by adding a constant feedback gain η to u∗ (x):

ū(x) = ηu∗ (x). (17)

7
ACCEPTED MANUSCRIPT

Define a new performance index function for the system (1) as


Z ∞
 
J (x(t)) = P(x(τ )) + ūT (x(τ ))R̄ū(x(τ )) dτ (18)
t

where P(x) = xT P x + (η − 1)u∗T (x)Ru∗ (x) with η ≥ 1 and R̄ = η −1 R.

T
Theorem 1. Consider the system (1) and let η ≥ 1. One can attain:

IP
1. If there is no control input perturbation, i.e., d¯ = 0, then the control
policy ū(x) (17) achieves optimality with the performance index func-

CR
tion (18).
2. If the constant feedback gain η is selected appropriately, then the robust
control policy ū(x) (17) guarantees the system (1) to be asymptotically
stable.
US
Proof. 1) Let J ∗ (x) be the optimal performance index function for the
system (1) with the condition d¯ = 0. One can derive the associated optimal
AN
control policy ū∗ (x) and HJB equation, respectively, as
1
ū∗ (x) = − R̄−1 g T (x)∇J ∗ (x), (19)
2
M

and
∇J ∗T (x)f (x) + xT P x + (η − 1)u∗T Ru∗
ED

1
− ∇J ∗T (x)g(x)R̄−1 g T (x)∇J ∗ (x) = 0 (20)
4
where ∇J ∗ (x) = ∂J ∗ (x)/∂x.
PT

Based on (5), replacing J ∗ (x) by V ∗ (x) and inserting (4) into (20) yields
1
∇V ∗T (x)f (x) + xT P x + (η − 1)∇V ∗T (x)g(x)R−1 g T (x)∇V ∗ (x)
4
CE

1
− η∇V (x)g(x)R g (x)∇V ∗ (x)
∗T −1 T
4
1
=∇V ∗T (x)f (x) + xT P x − ∇V ∗T (x)g(x)R−1 g T (x)∇V ∗ (x) = 0. (21)
AC

4
From (21), it can be observed that V ∗ (x) is a solution of the HJB equation
(20), which also implies the optimal control policy (19) can be rewritten as
1
ū∗ (x) = − ηR−1 g T (x)∇V ∗ (x) = ηu∗ (x). (22)
2

8
ACCEPTED MANUSCRIPT

This completes the proof.


2) Consider the Lyapunov function candidate as:

Θ(x) = V ∗ (x). (23)

T
Then, one has

IP
¯
Θ̇(x) =∇V ∗T (x)(f (x) + g(x)(ū(x) + d(x)))
1
= − xT P x + ∇V ∗T (x)g(x)R−1 g T (x)∇V ∗ (x)

CR
4
+ ∇V (x)g(x)ū(x) + ∇V ∗T (x)g(x)d(x)
∗T ¯
1 1 ¯
= − xT P x − ( η − )∇V ∗T (x)g(x)R−1 g T (x)∇V ∗ (x) + ∇V ∗T (x)g(x)d(x)
2 4

1
1
2
1
4 US
+ ∇V ∗T (x)g(x)g T (x)∇V ∗ (x)
1
≤ − xT P x − ( η − )∇V ∗T (x)g(x)R−1 g T (x)∇V ∗ (x) + d¯T (x)d(x)
2
¯
AN
2  
1 2 2 1 1 −1 1 2
≤ − (λmin (P ) − kd )kxk − ( η − )λmin (R ) − g T (x)∇V ∗ (x)
2 2 4 2
(24)
M

where λmin (P ) and λmin (R−1 ) denote the minimum eigenvalues of the matri-
ces P and R−1 , respectively.
ED

In order to obtain Θ̇(x) < 0, the matrix P and the constant feedback
gain η should be given to satisfy the following condition:

λmin (P ) > 21 kd2
PT

1 1 . (25)
η > λmin (R −1 ) + 2

Since P and R are both symmetric positive definite matrices, one can
CE

easily choose a large enough constant feedback gain η to satisfy the condition
1 1
η > max{1, λmin (R −1 ) + 2 }. If P and η are selected appropriately such that

the condition (25) holds, it can be acquired that Θ̇(x) < 0, which indicates
AC

the system is asymptotically stable under the robust controller (17).


The proof is completed. 

Remark 3. Theorem 1 shows that the control policy (17) is a successful


design of robust controller for the uncertain system (1). Furthermore, if there
is no any control input perturbation, (17) is also an optimal control policy

9
ACCEPTED MANUSCRIPT

with the respect to the performance index function (18). For some given P , if
the control perturbation becomes much larger such that the condition (25) is
not satisfied, one can also stabilize the system (1) by enhancing the feedback
gain η. This will be demonstrated in the simulation results.

T
3.2. Robust tracking controller design
Similar to the nominal augmented system (9), the original one can be

IP
expressed as

CR
¯
Ẋ(t) = F (X(t)) + G(X(t))(ū(t) + d(t)). (26)

Based on the optimal control policy u∗ (X) (15) for the nominal aug-
mented system, the robust controller for the uncertain original augmented
system (26) is designed by
US
ū(X) = ηu∗ (X). (27)
AN
Define a new performance index function for the augmented system (26)
as
Z ∞  
J(X(t)) = e−α(τ −t) L(X(τ )) + ūT (X(τ ))R̄ū(X(τ )) dτ (28)
M

where L(X) = X T QX + (η − 1)u∗T (X)Ru∗ (X) with η ≥ 1 and R̄ = η −1 R.


ED

Corollary 1. Consider the augmented system (26). If η ≥ 1 and there is no


control input perturbation, i.e., d¯ = 0, then the control policy ū (27) achieves
optimality with the performance index function (28).
PT

Proof. The derivation is similar to that of Theorem 1. Let J ∗ (X) be


the optimal performance index function for the augmented system (26) with
d¯ = 0. The corresponding optimal control policy ū∗ (X) and HJB equation
CE

can be given, respectively, by


1
ū∗ (X) = − R̄−1 GT (X)∇J ∗ (X), (29)
AC

2
and

∇J ∗T (X)F (x) + X T QX + (η − 1)u∗T (X)Ru∗ (X) − αJ ∗ (X)


1
− ∇J ∗T (X)G(X)R̄−1 GT (X)∇J ∗ (X) = 0 (30)
4

10
ACCEPTED MANUSCRIPT

where ∇J ∗ (X) = ∂J ∗ (X)/∂X.


According to (16), replacing J ∗ (X) by V ∗ (X) and substituting (15) into
(30) yields
1
∇V ∗T (X)F (X) + X T QX + (η − 1)∇V ∗T (X)G(X)R−1 GT (X)∇V ∗ (X)

T
4
1
− αV ∗ (X) − η∇V ∗T (X)G(X)R−1 GT (X)∇V ∗ (X)

IP
4
=∇V ∗T (X)F (x) + X T QX − αV ∗ (X)

CR
1
− ∇V ∗T (X)G(X)R−1 GT (X)∇V ∗ (X) = 0. (31)
4
From (31), it can be deduced that V ∗ (X) is a solution of the HJB equation

1
US
(30), which also indicates the optimal tracking control policy (29) can be
rewritten as

ū∗ (X) = − ηR−1 GT (X)∇V ∗ (X) = ηu∗ (X). (32)


AN
2
The proof is completed. 
Corollary 2. Consider the augmented system (26) and let η ≥ 1. If the
M

constant feedback gain η is selected large enough, then the robust control
policy ū(X) (27) makes the tracking error dynamics asymptotically stable in
the limit as the discount factor goes to zero.
ED

Remark 4. If the discount factor goes to zero, in light of the results of The-
orem 1 and previous relevant works [32, 33, 45, 46], it can be easily acquired
that the tracking error is asymptotically stable. Nevertheless, if the discount
PT

factor is nonzero, there will be difficulty in the stability analysis [47, 48]. Ac-
cording to the results of [32, 33, 45, 46], one can make the tracking error as
small as desired by selecting a small enough discount factor. If the discount
CE

factor is not small, the stability may not be guaranteed.

4. Data-driven reinforcement learning algorithm


AC

In order to solve the HJB equation, firstly, we introduce a model-based


iterative learning algorithm to approximate the solution, and provide the
corresponding convergence proof. Secondly, based on the model-based algo-
rithm, we present a data-driven model-free one, which only requires the real
system data with different control inputs rather than the accurate system
mathematical models.

11
ACCEPTED MANUSCRIPT

4.1. Model-based iterative learning algorithm


Step 1. Let i = 0. Given an initial function V (0) (X) ∈ V0 , where V0 is
determined by the Lemma 5 in [49]. u(0) = − 21 R−1 GT (X)∇V (0) (X).
Step 2. Solve for V (i+1) (X) by using the following equation:

T
[∇V (i+1) (X)]T (F (X) + G(X)u(i) ) − αV (i+1) (X) + X T QX + u(i)T Ru(i) = 0.

IP
(33)

Step 3. Update the control policy by

CR
1
u(i+1) = − R−1 GT (X)∇V (i+1) (X). (34)
2

US
If V (i+1) − V (i) ≤ ε, where ε is a small enough positive constant, then stop
at Step 3; Else, let i = i + 1 and go back to Step 2.
Following the idea of previous works [49, 50], the convergence proof of the
AN
proposed model-based iterative learning algorithm is provided as follows.
Let us consider a Banach space Λ ⊂ {V (x) |V (x) : X → R, V (0) = 0 }
with the norm k·kX and the mapping H : Λ → Λ given by (16). Define
another mapping Γ : Λ → Λ as
M

0
ΓV = V − (H (V ))−1 H(V ) (35)
0
ED

where H (V ) represents the Fréchet derivative of H(·) at the point V .


Due to the difficulty in calculating the solution of Fréchet derivative on
Banach space Λ directly, we introduce the following Gâteaux derivative.
PT

Definition 2. [50, 51] (Gâteaux Derivative) Let H : σ(V ) ⊆ X → Y be a


mapping with two Banach spaces X and Y , where σ(V ) represents a neigh-
borhood of V . If there exists a bounded linear operator L : X → Y such
CE

that

H(V + sM ) − H(V ) = sL(M ) + o(s), s → 0, (36)


AC

for all s in the neighborhood of zero with lim(o(s)/s) = 0 and all M with
s→0
kM kX = 1, then the mapping H is Gâteaux differentiable at the point V and
L denotes the Gâteaux derivative of H at V .

12
ACCEPTED MANUSCRIPT

Therefore, according to (36), the Gâteaux differential at V can be given


by L(M ) as
H(V + sM ) − H(V )
L(M ) = lim . (37)
s→0 s

T
It should be pointed out that (37) provides an easier way to compute
the Gâteaux derivative rather than the Fréchet derivative. The relationship

IP
between these two derivatives will be shown in the following lemma.
0
Lemma 1. [50, 51] If H is continuous at V and exists as the Gâteaux

CR
0
derivative in the neighborhood of V , then L = H (V ) is also an Fréchet
derivative at V .
Lemma 2. Let H be a mapping expressed as (16). For ∀V ∈ Λ, the Fréchet
differential of H at V can be given by
0
H (V )M =L(M )
US
AN
1
=∇M T F − αM − ∇V T GR−1 GT ∇M. (38)
2
Proof. According to (16), one has
M

H(V + sM ) =∇(V + sM )T F + X T QX − α(V + sM )


1
− ∇(V + sM )T GR−1 GT ∇(V + sM ),
4
ED

1
H(V ) =∇V T F + X T QX − αV − ∇V T GR−1 GT ∇V,
4
1
H(V + sM ) − H(V ) =s∇M T F − αsM − s2 ∇M T GR−1 GT ∇M
4
PT

1 T −1 T
− s∇V GR G ∇M. (39)
2
By means of (37) and (39), the Gâteaux derivative at V is obtained by
CE

H(V + sM ) − H(V ) 1
L(M ) = lim = ∇M T F − αM − ∇V T GR−1 GT ∇M.
s→0 s 2
(40)
AC

0
Based on Lemma 1, one gets H (V )M = L(M ). This completes the proof.

Next, construct a Newton iterative sequence as below:
∆ 0
V (i+1) = ΓV (i) = V (i) − (H (V (i) ))−1 H(V (i) ), i = 0, 1, 2, · · · . (41)

13
ACCEPTED MANUSCRIPT

Remark 5. According to the Lemma 4 (Kantorovtich’s Theorem) in [49], it


can be acquired that the Newton iterative sequence {V (i) } (41) will converge to
the optimal value V ∗ as i → ∞. The Lemma 5 in [49] also provides a method
to select the appropriate initial value V (0) which guarantees the convergence
of the Newton iteration.

T
Theorem 2. The sequence {V (i) } produced by the model-based iterative learn-

IP
ing algorithm is equivalent to that of the Newton iteration (41).

CR
Proof. Based on Lemma 2, one attains
0 1
H (V (i) )V (i+1) =∇V (i+1)T F − αV (i+1) − ∇V (i)T GR−1 GT ∇V (i+1) , (42)
2
0

US1
H (V (i) )V (i) =∇V (i)T F − αV (i) − ∇V (i)T GR−1 GT ∇V (i) ,
2
1
(43)

H(V (i) ) =∇V (i)T F + X T QX − αV (i) − ∇V (i)T GR−1 GT ∇V (i) .


AN
4
(44)

According to the model-based iterative learning algorithm, i.e., (33) and


M

(34), and combining (42), (43) and (44), one has


0 0
H (V (i) )V (i+1) = H (V (i) )V (i) − H(V (i) ), (45)
ED

which is equivalent to the Newton iteration (41). 


Thus, as is mentioned in Remark 5, once the initial value V (0) is deter-
mined by the Lemma 5 in [49], according to the Lemma 4 (Kantorovtich’s
PT

Theorem) in [49], the proposed model-based iterative learning algorithm will


ensure V (i) to converge to the optimal value V ∗ as i → ∞ , which also means
u(i) → u∗ as i → ∞.
CE

4.2. Derivation of data-driven reinforcement learning method


To derive the data-based reinforcement learning algorithm, first of all, we
AC

rewrite the system (9) as

Ẋ =F (X) + G(X)u
=F (X) + G(X)u(i) + G(X)(u − u(i) ). (46)

14
ACCEPTED MANUSCRIPT

Next, by means of (46), one gets


dV (i+1) (X(t))
dt
(i+1)
=[∇V (X)]T (F (X) + G(X)u)

T
=[∇V (i+1) (X)]T (F (X) + G(X)u(i) ) + [∇V (i+1) (X)]T G(X)(u − u(i) ). (47)

IP
Based on the updating laws (33) and (34), (47) becomes
dV (i+1) (X(t))

CR
= αV (i+1) (X) − X T QX − u(i)T Ru(i) − 2[u(i+1) ]T R(u − u(i) ).
dt
(48)

attains
V (i+1) (X(t + ∆t)) − V (i+1) (X(t))
US
Integrate both sides of (48) on the interval [t t + ∆t], and then one
AN
Z t+∆t Z t+∆t
(i+1)
= αV (X(τ ))dτ − (X T (τ )QX(τ ) + u(i)T (τ )Ru(i) (τ ))dτ
t t
Z t+∆t
T
−2 [u(i+1) (τ )] R(u(τ ) − u(i) (τ ))dτ . (49)
M

(i+1)
where V (X) and u(i+1) (X) are the unknown functions to be determined.
ED

Remark 6. From the aforementioned derivation, it can be observed that the


main idea of the data-driven RL method is to solve the equation (49) itera-
tively rather than the iterative equations (33) and (34). Furthermore, differ-
PT

ent from (33) and (34), the equation (49) only requires the arbitrary system
data (X, u) ∈ D instead of the system models, i.e., F (X) and G(X).

The following lemma shows that the data-driven model-free algorithm is


CE

equivalent to the model-based iterative learning algorithm, which also im-


plies the convergence of the data-driven approach, that is, {V (i) } and {u(i) }
generated by the iterative equation (49) converge to the optimal values V ∗
AC

and u∗ , respectively, as i → ∞.

Lemma 3. [46] Let V (i+1) (0) = 0 for ∀i = 0, 1, 2, · · · . Then, (V (i+1) (X),


u(i+1) (X)) is the solution of (49) if and only if it is the solution of (33) and
(34), i.e., the iterative equation (49) is equivalent to the iterative equations
(33) and (34).

15
ACCEPTED MANUSCRIPT

4.3. NN-based implementation of data-driven model-free algorithm


The NN structure used in this paper is similar to that in [52, 53]. Two
NNs, namely, critic NN and actor NN, are utilized to approximate the iter-
ative performance index function V (i) and control policy u(i) as below

T
(i)
V̂ (i) (X(t)) =φTV (X(t))WV , (50)

IP
û(i) (X(t)) =φTu (X(t))Wu(i) , (51)
(i) (i)

CR
where WV and Wu are the weight vectors of critic NN and actor NN, re-
spectively, and φV and φu are their corresponding NN activation functions.
For the following derivation, we take the single control input case into consid-
eration, that is, m = 1, which is convenient for the mathematical expression.

US
Since there will be NN approximation errors brought by the NN imple-
mentation, replacing V (i+1) , u(i) and u(i+1) in (49) by V̂ (i+1) , û(i) and û(i+1)
yields the following residual error Ξ(i) :
AN

Ξ(i) (X(t), X(t + ∆t), u(t))


Z t+∆t
M

∆ (i+1) (i+1)
=V̂ (X(t)) − V̂ (X(t + ∆t)) + αV̂ (i+1) (X(τ ))dτ
t
Z t+∆t
T
− (X T (τ )QX(τ ) + [û(i) (X(τ ))] Rû(i) (X(τ )))dτ
ED

t
Z t+∆t
T
+2 [û(i+1) (X(τ ))] R(û(i) (X(τ )) − u(τ ))dτ
t
Z t+∆t
PT

T (i+1) (i+1)
=(φV (X(t)) − φV (X(t + ∆t))) WV + αφTV (X(τ ))WV dτ
t
Z t+∆t
T
(X T (τ )QX(τ ) + [û(i) (X(τ ))] Rû(i) (X(τ )))dτ
CE


t
Z t+∆t
T
+2 (û(i) (X(τ )) − u(τ )) R(φTu (X(τ ))Wu(i+1) )dτ . (52)
t
AC

16
ACCEPTED MANUSCRIPT

For simplicity of mathematical expression, let


Z t+∆t
η1 =φV (X(t)) − φV (X(t + ∆t)), η2 = αφTV (X(τ ))dτ ,
t
Z t+∆t
T

T
η3 =2 (û(i) (X(τ )) − u(τ )) RφTu (X(τ ))dτ , (53)
t
Z t+∆t

IP
T
Υ(i) = (X T (τ )QX(τ ) + [û(i) (X(τ ))] Rû(i) (X(τ )))dτ .
t

CR
It is worthy of being pointed out that if ∆t is selected as a small enough
time period, one can utilize the trapezoidal rule to calculate definite integrals
η2 , η3 and Υ(i) as

lim
Z t+∆t
∆t→0 t
p(τ )dτ =
∆t
2
US
[p(t) + p(t + ∆t)]. (54)
AN
Therefore, (52) can be given by

Ξ(i) (x(t), x(t + ∆t), u(t))


(i+1) (i+1)
M

=η1T WV + η2 WV − Υ(i) + η3 Wu(i+1) . (55)

∆ (i+1) ∆ (i+1) T (i+1) T T


Define Φ(i) = [η1T + η2 η3 ] and WV u = [(WV ) (Wu ) ] . Next, (52)
ED

can be rewritten as
(i+1)
Ξ(i) (x(t), x(t + ∆t), u(t)) = Φ(i) WV u − Υ(i) . (56)
PT

In order to minimize the residual error Ξ(i) , we employ the least-


square approach, and thus, multiple data sets will be required. Let the
size of the data sampling sets be M, where M is a large enough number.
CE

Choosing different control input uk (t) with the small enough time period
∆t, one can obtain the system data sampling sets (Xk (t), Xk (t + ∆t), uk (t)),
where k = 1, 2, · · · , M. Subsequently, the database can be constructed as
AC

(i) (i) (i) (i) (i) (i)


ξ (i) = [(Φ1 )T (Φ2 )T · · · (ΦM )T ]T and θ(i) = [(Υ1 )T (Υ2 )T · · · (ΥM )T ]T .
(i+1)
Consequently, the least-square solution for the updating law of WV u is
derived by
(i+1)
WV u = [(ξ (i) )T ξ (i) ]−1 (ξ (i) )T θ(i) . (57)

17
ACCEPTED MANUSCRIPT

4.4. NN-based data-driven model-free algorithm


From the aforementioned content, the data-driven reinforcement learning
algorithm can be summarized as below.
Step 1. With different control inputs uk (t), the sampling sets of system
data (Xk (t), Xk (t + ∆t), uk (t)) can be collected, where k = 1, 2, · · · , M. Let

T
(0)
i = 0. Choose an initial NN weight WV u such that V̂ (0) ∈ V0 .

IP
Step 2. Use the collected sampling data sets to compute ξ (i) and θ(i) .
(i+1)
Step 3. Tune the NN weights WV u through the updating law (57). If
(i+1) (i)

CR
WV u − WV u ≤ ε, where ε is a small enough positive constant, then stop
at Step 3; Else, let i = i + 1 and go back to Step 2.

Remark 7. It can be observed that if we set the discount factor α = 0 and

US
use the performance index function (3), this data-driven method can be also
applied to solve the optimal regulation control issue, which implies optimal
regulation control is a special case of optimal tracking control.
AN
5. Simulation Results
In this section, two simulation examples for both robust regulation control
M

and robust tracking control are shown to demonstrate the effectiveness of our
proposed scheme.

5.1. Robust regulation control


ED

Consider the following uncertain original system:


   
−x1 + x2 0 ¯
ẋ = + (ū + d). (58)
PT

−0.5(x1 + x2 ) + 0.5x21 x2 x1

where ū = ηu∗ with η = 4 and d¯ = δ1 sin(δ2 x1 + δ3 x2 ) denotes the control


input perturbation with the random constants δ1 ∈ [−0.5, 0.5], δ2 ∈ [−1, 1]
CE

and δ3 ∈ [−1.5, 1.5].


The corresponding nominal system of (58) can be expressed as
   
AC

−x1 + x2 0
ẋ = + u. (59)
−0.5(x1 + x2 ) + 0.5x21 x2 x1
with the associated performance index function
Z ∞
 T 
V(x(t)) = x (τ )P x(τ ) + uT (x(τ ))Ru(x(τ )) dτ (60)
t

18
ACCEPTED MANUSCRIPT

where P = I2×2 (I denotes the identity matrix) and R = 1.


We select the activation functions for the critic and actor NNs, respec-
tively, as below
φV (x) = [x21 x1 x2 x22 ]T , (61)

T
and

IP
φu (x) = [x1 x2 x21 x1 x2 x22 ]T . (62)

CR
It should be pointed out that this simulation example is constructed by
the converse HJB method [54]. The optimal solutions can be given by V ∗ =
0.5x21 + x22 and u∗ = −x1 x2 . From the simulation results in Fig. 1 and
Fig. 2, it can be seen that the NN weights converge to their optimal values

US
after iteration. In Fig. 3, the proposed robust controller overcomes the
random control perturbation and state trajectories are stable finally. When
the control perturbation becomes much larger such that the condition (25)
AN
does not hold and the system (58) may be unstable, one can enhance the
feedback gain η to handle this problem as we have mentioned in Remark 3.
From Fig. 4, when we still use η = 4 for much larger control perturbation,
the system goes to unstable states; when we enhance the feedback gain η, the
M

system becomes stable, which demonstrates the aforementioned statement in


Remark 3.
ED

8
WV1
7
WV2

6 WV3
PT

5
Critic Weights

4
CE

1
AC

−1
0 5 10 15 20
Iteration Step

Fig. 1. Convergence of the critic NN weights.

19
ACCEPTED MANUSCRIPT

0.5

−0.5

T
−1
Actor Weights

IP
−1.5
Wu1

−2 Wu2
Wu3

CR
−2.5 Wu4
Wu5
−3

−3.5

−4
0 5
US 10
Iteration Step
15 20
AN
Fig. 2. Convergence of the actor NN weights.

0.2
M

x1
x2
0.15
ED

0.1
x1 and x2

PT

0.05
CE

−0.05
0 5 10 15 20 25 30
AC

Time Step

Fig. 3. Evolution of the state trajectories x1 and x2 .

20
ACCEPTED MANUSCRIPT

0.8

0.7
x1 with η=4
0.6 x2 with η=4
x1 with η=6

T
x1 and x2 with different η

0.5
x2 with η=6

IP
0.4 x1 with η=8
x2 with η=8
0.3
x1 with η=10

CR
0.2 x2 with η=10

0.1

−0.1
0 5
US10
Time Step
15 20
AN
Fig. 4. Evolution of the state trajectories x1 and x2 with different η.

5.2. Robust tracking control


M

Consider the uncertain original system as follows:


   
x2 0 ¯
ẋ = + (ū + d) (63)
ED

−0.5(x1 + x2 ) + 0.5x21 x2 1

where we set ū = ηu∗ with the constant feedback gain η = 2, and select
the control input perturbation as d¯ = δ1 sin(δ2 x1 + δ3 x2 ) with the random
PT

constants δ1 ∈ [−1, 1], δ2 ∈ [−3, 3] and δ3 ∈ [−4, 4].


Thus, the nominal nonlinear system of (63) is described by
   
CE

x2 0
ẋ = + u. (64)
−0.5(x1 + x2 ) + 0.5x21 x2 1

The reference system, which generates the desired trajectory dynamics,


AC

is given by
 
xd2
ẋd = . (65)
−xd1

21
ACCEPTED MANUSCRIPT

Let X1 = x1 − xd1 , X2 = x2 − xd2 , X3 = xd1 and X4 = xd2 . Then, we can


obtain the augmented system dynamics as below:
   
X2 0
 −0.5(X1 + X2 − X3 + X4 ) + 0.5(X1 + X3 )2 (X2 + X4 )   1 
Ẋ =   +  u

T
 X4   0 
−X3 0

IP
(66)

CR
with the following discounted performance index function:
Z ∞
 
V (X(t)) = e−α(τ −t) X T (τ )QX(τ ) + uT (τ )Ru(τ ) dτ (67)

US
t
 
P 0
where the discount factor α = 0.01; Q = with P = 10I2×2 , and
0 0
AN
R = 1.
The activation functions for the critic NN are selected as

φV (X) =[X12 X1 X2 X1 X3 X1 X4 X22 X2 X3 X2 X4 X32 X3 X4 X42


M

X13 X23 X33 X43 X14 X24 X34 X44 ]T , (68)

and the activation functions for the actor NN are given by


ED

φu (X) =[X1 X2 X3 X4 X12 X1 X2 X1 X3 X1 X4 X22 X2 X3 X2 X4 X32


X3 X4 X42 X13 X1 X2 X3 X1 X2 X4 X1 X3 X4 X12 X2 X12 X3 X12 X4
X23 X22 X1 X22 X3 X22 X4 X2 X3 X4 X33 X32 X1 X32 X2 X32 X4
PT

X43 X42 X1 X42 X2 X42 X3 ]T . (69)


CE

We set ∆t = 0.1s, and then, with different control inputs uk , real system
data sampling sets (Xk (t), Xk (t+∆t), uk ) can be collected. After that, update
the two NNs via the least-square method (57). From Fig. 5 and Fig. 6, it
can be seen that the NN weights achieve convergence finally.
AC

22
ACCEPTED MANUSCRIPT

300

250

200

T
Critic Weights

150

IP
100

CR
50

−50
0 5
US 10
Iteration Step
15 20
AN
Fig. 5. Convergence of the critic NN weights.

50
M

0
ED

−50

1
Actor Weights

−100
0.5
PT

−150 0

−0.5
−200
−1
CE

14 16 18 20
−250

−300
0 5 10 15 20
AC

Iteration Step

Fig. 6. Convergence of the actor NN weights.

23
ACCEPTED MANUSCRIPT

3
x1

2.5 xd1

T
1.5
x1 and xd1

IP
1

CR
0.5

−0.5

−1
0 5 10
US
15
Time Step
20 25 30
AN
Fig. 7. Evolution of the state trajectories x1 and xd1 .

2.5
M

x2

2 xd2

1.5
ED

1
x2 and xd2

0.5
PT

−0.5
CE

−1

−1.5
0 5 10 15 20 25 30
AC

Time Step

Fig. 8. Evolution of the state trajectories x2 and xd2 .

Subsequently, use the obtained results along with the constant feedback
gain to control the uncertain original system (63) with the random control

24
ACCEPTED MANUSCRIPT

input perturbation. As is indicated from Fig. 7 and Fig. 8, the uncer-


tain original system (63) successfully gets synchronization with the reference
system (65).

6. Conclusion

T
In this paper, the robust control issues for a class of uncertain systems

IP
have been investigated. A novel reinforcement learning scheme has been em-
ployed to obtain the optimal control policy of the nominal version without

CR
the requirement of the knowledge of system models. Adding a constant feed-
back gain to the optimal control policy yields the robust controller, which
has been proved to be able to achieve optimality under a new defined perfor-
mance function when there is no control input perturbation. To implement

US
the model-free algorithm, two neural networks updated by the least-square
method have been utilized to learn the solution of the HJB equation itera-
tion by iteration. Simulation results have demonstrated the feasibility and
AN
effectiveness of our proposed scheme. It is expected that, with the powerful
ability of ADP in solving the optimal control problems, our research results
will be extended to other decision support systems.
M

Acknowledgment
This work was supported by the National Natural Science Foundation
ED

of China (61433004, 61627809, 61621004), and IAPI Fundamental Research


Funds 2013ZCX14.
PT

References
[1] F. Lin, R. D. Brandt, J. Sun, Robust control of nonlinear systems:
compensating for uncertainty, International Journal of Control 56 (6)
CE

(1992) 1453–1459.
[2] R. Song, W. Xiao, H. Zhang, C. Sun, Adaptive dynamic programming
for a class of complex-valued nonlinear systems, IEEE Transactions on
AC

Neural Networks and Learning Systems 25 (9) (2014) 1733–1739.


[3] A. Al-Tamimi, F. L. Lewis, M. Abu-Khalaf, Discrete-time nonlinear
HJB solution using approximate dynamic programming: convergence
proof, IEEE Transactions on Systems, Man, and Cybernetics, Part B:
Cybernetics 38 (4) (2008) 943–949.

25
ACCEPTED MANUSCRIPT

[4] Q. Wei, D. Liu, H. Lin, Value iteration adaptive dynamic programming


for optimal control of discrete-time nonlinear systems, IEEE Transac-
tions on Cybernetics 46 (3) (2016) 840–853.

[5] J. J. Murray, C. J. Cox, G. G. Lendaris, R. Saeks, Adaptive dynamic

T
programming, IEEE Transactions on Systems, Man, and Cybernetics,
Part C: Applications and Reviews 32 (2) (2002) 140–153.

IP
[6] D. Liu, Q. Wei, Policy iteration adaptive dynamic programming algo-

CR
rithm for discrete-time nonlinear systems, Neural Networks and Learn-
ing Systems, IEEE Transactions on 25 (3) (2014) 621–634.

[7] K. G. Vamvoudakis, F. L. Lewis, Online actor-critic algorithm to solve

ica 46 (5) (2010) 878–888.


US
the continuous-time infinite horizon optimal control problem, Automat-

[8] H. Zhang, C. Qin, Y. Luo, Neural-network-based constrained optimal


AN
control scheme for discrete-time switched nonlinear system using dual
heuristic programming, IEEE Transactions on Automation Science En-
gineering 11 (3) (2014) 839–849.
M

[9] M. Abu-Khalaf, F. L. Lewis, Nearly optimal control laws for nonlinear


systems with saturating actuators using a neural network HJB approach,
Automatica 41 (5) (2005) 779–791.
ED

[10] H. Zhang, Y. Luo, D. Liu, Neural-network-based near-optimal control


for a class of discrete-time affine nonlinear systems with control con-
straints, IEEE Transactions on Neural Networks 20 (9) (2009) 1490–
PT

1503.

[11] H. Modares, F. L. Lewis, M.-B. Naghibi-Sistani, Adaptive optimal con-


CE

trol of unknown constrained-input systems using policy iteration and


neural networks, IEEE Transactions on Neural Networks and Learning
Systems 24 (10) (2013) 1513–1525.
AC

[12] B. Wang, D. Zhao, C. Alippi, D. Liu, Dual heuristic dynamic program-


ming for nonlinear discrete-time uncertain systems with state delay,
Neurocomputing 134 (2014) 222–229.

[13] H. Zhang, R. Song, Q. Wei, T. Zhang, Optimal tracking control for


a class of nonlinear discrete-time systems with time delays based on

26
ACCEPTED MANUSCRIPT

heuristic dynamic programming, IEEE Transactions on Neural Networks


22 (12) (2011) 1851–1862.

[14] Q. Wei, H. Zhang, D. Liu, Y. Zhao, An optimal control scheme for a


class of discrete-time nonlinear systems with time delays using adaptive

T
dynamic programming, Acta Automatica Sinica 36 (1) (2010) 121–129.

IP
[15] Q. Wei, R. Song, P. Yan, Data-driven zero-sum neuro-optimal control
for a class of continuous-time unknown nonlinear systems with distur-

CR
bance using ADP, IEEE Transactions on Neural Networks and Learning
Systems 27 (2) (2016) 444–458.

[16] K. G. Vamvoudakis, F. Lewis, Online solution of nonlinear two-player

US
zero-sum games using synchronous policy iteration, International Jour-
nal of Robust and Nonlinear Control 22 (13) (2012) 1460–1483.

[17] D. Liu, H. Li, D. Wang, Neural-network-based zero-sum game for


AN
discrete-time nonlinear systems via iterative adaptive dynamic program-
ming algorithm, Neurocomputing 110 (2013) 92–100.

[18] H. Zhang, Q. Wei, D. Liu, An iterative adaptive dynamic programming


M

method for solving a class of nonlinear zero-sum differential games, Au-


tomatica 47 (1) (2011) 207–214.
ED

[19] K. G. Vamvoudakis, F. L. Lewis, Multi-player non-zero-sum games: On-


line adaptive learning solution of coupled Hamilton-Jacobi equations,
Automatica 47 (8) (2011) 1556–1569.
PT

[20] H. Zhang, L. Cui, Y. Luo, Near-optimal control for nonzero-sum differ-


ential games of continuous-time nonlinear systems using single-network
adp, IEEE Transactions on Cybernetics 43 (1) (2013) 206–216.
CE

[21] D. Liu, H. Li, D. Wang, Online synchronous approximate optimal learn-


ing algorithm for multi-player non-zero-sum games with unknown dy-
namics, IEEE Transactions on Systems, Man, and Cybernetics: Systems
AC

44 (8) (2014) 1015–1027.

[22] R. Song, F. L. Lewis, Q. Wei, Off-policy integral reinforcement learn-


ing method to solve nonlinear continuous-time multiplayer nonzero-sum
games, IEEE Transactions on Neural Networks and Learning Systems
28 (3) (2017) 704–713.

27
ACCEPTED MANUSCRIPT

[23] X. Zhong, H. He, H. Zhang, Z. Wang, Optimal control for unknown


discrete-time nonlinear markov jump systems using adaptive dynamic
programming, IEEE Transactions on Neural Networks and Learning
Systems 25 (12) (2014) 2141–2155.

T
[24] X. Zhong, H. He, H. Zhang, Z. Wang, A neural network based online
learning and control approach for markov jump systems, Neurocomput-

IP
ing 149 (2015) 116–123.

CR
[25] C. Yang, K. Huang, H. Cheng, Y. Li, C.-Y. Su, Haptic identification by
ELM-controlled uncertain manipulator, IEEE Transactions on Systems,
Man, and Cybernetics: Systems PP (99) (2017) 1–12.

US
[26] C. Yang, X. Wang, L. Cheng, H. Ma, Neural-learning-based telerobot
control with guaranteed performance, IEEE Transactions on Cybernet-
ics PP (99) (2016) 1–12.
AN
[27] H. Zhang, T. Feng, G. H. Yang, H. Liang, Distributed cooperative op-
timal control for multiagent systems on directed graphs: An inverse op-
timal approach, IEEE Transactions on Cybernetics 45 (7) (2015) 1315–
1326.
M

[28] K. G. Vamvoudakis, F. L. Lewis, G. R. Hudas, Multi-agent differential


graphical games: Online adaptive learning solution for synchronization
ED

with optimality, Automatica 48 (8) (2012) 1598–1611.

[29] H. Zhang, J. Zhang, G. H. Yang, Y. Luo, Leader-based optimal coordi-


nation control for the consensus problem of multiagent differential games
PT

via fuzzy adaptive dynamic programming, IEEE Transactions on Fuzzy


Systems 23 (1) (2015) 152–163.
CE

[30] Q. Wei, D. Liu, F. L. Lewis, Optimal distributed synchronization con-


trol for continuous-time heterogeneous multi-agent differential graphical
games, Information Sciences 317 (2015) 96–113.
AC

[31] M. I. Abouheaf, F. Lewis, K. Vamvoudakis, S. Haesaert, R. Babuska,


Multi-agent discrete-time graphical games and reinforcement learning
solutions, Automatica 50 (2014) 3038–3053.

28
ACCEPTED MANUSCRIPT

[32] H. Modares, F. L. Lewis, Optimal tracking control of nonlinear partially-


unknown constrained-input systems using integral reinforcement learn-
ing, Automatica 50 (7) (2014) 1780–1792.

[33] B. Kiumarsi, F. L. Lewis, Actor-critic-based optimal tracking for par-

T
tially unknown nonlinear discrete-time systems, IEEE Transactions on
Neural Networks and Learning Systems 26 (1) (2015) 140–151.

IP
[34] H. Zhang, Q. Wei, Y. Luo, A novel infinite-time optimal tracking control

CR
scheme for a class of discrete-time nonlinear systems via the greedy
HDP iteration algorithm, IEEE Transactions on Systems, Man, and
Cybernetics, Part B: Cybernetics 38 (4) (2008) 937–942.

US
[35] D. Wang, D. Liu, Q. Wei, Finite-horizon neuro-optimal tracking control
for a class of discrete-time nonlinear systems using adaptive dynamic
programming approach, Neurocomputing 78 (1) (2012) 14–22.
AN
[36] H.-M. Yen, T.-H. S. Li, Y.-C. Chang, Design of a robust neural network-
based tracking controller for a class of electrically driven nonholonomic
mechanical systems, Information Sciences 222 (2013) 559–575.
M

[37] D. Wang, D. Liu, H. Li, H. Ma, Neural-network-based robust optimal


control design for a class of uncertain nonlinear systems via adaptive
dynamic programming, Information Sciences 282 (2014) 167–179.
ED

[38] D. Wang, D. Liu, Q. Zhang, D. Zhao, Data-based adaptive critic designs


for nonlinear robust optimal control with uncertain dynamics, IEEE
Transactions on Systems, Man, and Cybernetics: Systems 46 (11) (2015)
PT

1544–1555.

[39] T. Dierks, B. T. Thumati, S. Jagannathan, Optimal control of unknown


CE

affine nonlinear discrete-time systems using offline-trained neural net-


works with proof of convergence, Neural Networks 22 (5) (2009) 851–
860.
AC

[40] D. Liu, D. Wang, D. Zhao, Q. Wei, N. Jin, Neural-network-based opti-


mal control for a class of unknown discrete-time nonlinear systems using
globalized dual heuristic programming, IEEE Transactions on Automa-
tion Science Engineering 9 (3) (2012) 628–634.

29
ACCEPTED MANUSCRIPT

[41] H. Zhang, L. Cui, X. Zhang, Y. Luo, Data-driven robust approximate


optimal tracking control for unknown general nonlinear systems using
adaptive dynamic programming method, IEEE Transactions on Neural
Networks 22 (12) (2011) 2226–2236.

T
[42] R. Song, F. L. Lewis, Q. Wei, H. Zhang, Off-policy actor-critic struc-
ture for optimal control of unknown systems with disturbances, IEEE

IP
Transactions on Cybernetics 46 (5) (2016) 1041–1050.
[43] R. Song, F. Lewis, Q. Wei, H.-G. Zhang, Z.-P. Jiang, D. Levine, Mul-

CR
tiple actor-critic structures for continuous-time optimal control using
input-output data, IEEE Transactions on Neural Networks and Learn-
ing Systems 26 (4) (2015) 851–865.

US
[44] B. Luo, H.-N. Wu, T. Huang, D. Liu, Data-based approximate pol-
icy iteration for affine nonlinear continuous-time optimal control design,
Automatica 50 (12) (2014) 3281–3290.
AN
[45] H. Modares, F. L. Lewis, Z.-P. Jiang, H∞ tracking control of completely
unknown continuous-time systems via off-policy reinforcement learning,
IEEE Transactions on Neural Networks and Learning Systems 26 (10)
M

(2015) 2550–2562.
[46] G. Xiao, H. Zhang, Y. Luo, H. Jiang, Data-driven optimal tracking con-
ED

trol for a class of affine non-linear continuous-time systems with com-


pletely unknown dynamics, IET Control Theory Applications 10 (6)
(2016) 700–710.
PT

[47] R. Kamalapurkar, H. Dinh, S. Bhasin, W. E. Dixon, Approximate op-


timal trajectory tracking for continuous-time nonlinear systems, Auto-
matica 51 (2015) 40–48.
CE

[48] R. Kamalapurkar, L. Andrews, P. Walters, W. E. Dixon, Model-based


reinforcement learning for infinite-horizon approximate optimal track-
ing, IEEE Transactions on Neural Networks and Learning Systems
AC

PP (99) (2016) 1–6.


[49] B. Luo, H. Wu, Computationally efficient simultaneous policy update
algorithm for nonlinear H∞ state feedback control with Galerkin’s
method, International Journal of Robust and Nonlinear Control 23 (9)
(2013) 991–1012.

30
ACCEPTED MANUSCRIPT

[50] H.-N. Wu, B. Luo, Neural network based online simultaneous policy
update algorithm for solving the HJI equation in nonlinear H∞ control,
IEEE Transactions on Neural Networks and Learning Systems 23 (12)
(2012) 1884–1895.

T
[51] E. Zeidler, Nonlinear Functional Analysis and Its Applications: III: Vari-
ational Methods and Optimization, Springer Science Business Media,

IP
2013.

CR
[52] C. Yang, Y. Jiang, Z. Li, W. He, C.-Y. Su, Neural control of biman-
ual robots with guaranteed global stability and motion precision, IEEE
Transactions on Industrial Informatics PP (99) (2016) 1–9.

US
[53] C. Yang, X. Wang, Z. Li, Y. Li, C.-Y. Su, Teleoperation control based on
combination of wave variable and neural networks, IEEE Transactions
on Systems, Man, and Cybernetics: Systems PP (99) (2016) 1–12.
AN
[54] V. Nevistic, J. A. Primbs, Optimality of nonlinear design techniques: a
converse HJB approach, Technical Report TR96-022, California Insti-
tute of Technology (1996).
M
ED
PT
CE
AC

31
ACCEPTED MANUSCRIPT

He Jiang received the B.S. degree in automation control in

T
2014 from Northeastern University, Shenyang, China, where he is currently
pursuing the Ph.D. degree in control theory and control engineering. His cur-

IP
rent research interests include adaptive dynamic programming, fuzzy control,
multi-agent system control and their industrial applications.

CR
US
Huaguang Zhang received the B.S. degree and the M.S. de-
gree in control engineering from Northeast Dianli University of China, Jilin
City, China, in 1982 and 1985, respectively. He received the Ph.D. degree
AN
in thermal power engineering and automation from Southeast University,
Nanjing, China, in 1991. He joined the Department of Automatic Control,
Northeastern University, Shenyang, China, in 1992, as a Postdoctoral Fellow
for two years. Since 1994, he has been a Professor and Head of the Institute of
M

Electric Automation, School of Information Science and Engineering, North-


eastern University, Shenyang, China. His main research interests are fuzzy
control, stochastic system control, neural networks based control, nonlinear
ED

control, and their applications. He has authored and coauthored over 280
journal and conference papers, six monographs and co-invented 90 patents.
Dr. Zhang is the fellow of IEEE, the E-letter Chair of IEEE CIS Society,
the former Chair of the Adaptive Dynamic Programming & Reinforcement
PT

Learning Technical Committee on IEEE Computational Intelligence Soci-


ety. He is an Associate Editor of AUTOMATICA, IEEE TRANSACTIONS
ON NEURAL NETWORKS, IEEE TRANSACTIONS ON CYBERNET-
CE

ICS, and NEUROCOMPUTING, respectively. He was an Associate Editor


of IEEE TRANSACTIONS ON FUZZY SYSTEMS (2008-2013). He was
awarded the Outstanding Youth Science Foundation Award from the Na-
AC

tional Natural Science Foundation Committee of China in 2003. He was


named the Cheung Kong Scholar by the Education Ministry of China in
2005. He is a recipient of the IEEE Transactions on Neural Networks 2012
Outstanding Paper Award.

32
ACCEPTED MANUSCRIPT

T
IP
CR
Yang Cui received the B.S. degree in information
and computing science and the M.S. degree in applied mathematics form

US
Liaoning University of Technology, Jinzhou, China. She is currently working
toward the Ph.D. degree in control theory and control engineering, North-
eastern University. Her research interests include dynamic surface control,
AN
neural networks and network control.
M
ED
PT

Geyang Xiao received the B.S. degree in Au-


tomation Control from Northeastern University, Shenyang, China, in 2012.
CE

He has been pursuing the Ph.D. degree with Northeastern University, Shenyang,
China, since 2012. His current research interests include neural-network-
based control, nonlinear control, adaptive dynamic programming and their
AC

industrial applications.

33

You might also like