[15] Robust control scheme for a class of uncertain nonlinear systems with completely unknown dynamics using data-driven reinforcement learning method
[15] Robust control scheme for a class of uncertain nonlinear systems with completely unknown dynamics using data-driven reinforcement learning method
Accepted Manuscript
PII: S0925-2312(17)31381-4
DOI: 10.1016/j.neucom.2017.07.058
Reference: NEUCOM 18771
Please cite this article as: He Jiang, Huaguang Zhang, Yang Cui, Geyang Xiao, Robust control scheme
for a class of uncertain nonlinear systems with completely unknown dynamics using data-driven rein-
forcement learning method, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.07.058
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
T
He Jiang, Huaguang Zhang∗, Yang Cui, Geyang Xiao
IP
College of Information Science and Engineering, Northeastern University, Box 134,
110819, Shenyang, P. R. China
CR
Abstract
US
This paper deals with the robust control issues for a class of uncertain nonlin-
ear systems with completely unknown dynamics via a data-driven reinforce-
AN
ment learning method. Firstly, we formulate the optimal regulation control
problem for the nominal system, and then, the robust controller for the orig-
inal uncertain system is designed by adding a constant feedback gain to the
optimal controller of the nominal system. Then, this scheme is extended to
M
the optimal tracking control by means of augmented system and discount fac-
tor. It is also demonstrated that the proposed robust controller can achieve
optimality with a new defined performance index function when there is no
ED
∗
Corresponding author
Email addresses: [email protected] (He Jiang), [email protected]
(Huaguang Zhang), [email protected] (Yang Cui), [email protected] (Geyang
Xiao)
T
Keywords: Reinforcement learning; Adaptive dynamic programming;
Data-driven; Model-free; Neural networks.
IP
1. Introduction
CR
Model uncertainties, which may severely affect the control performance
of the closed-loop feedback systems, usually occur in the real-world complex
US
systems, such as manufacturing systems and power systems. Therefore, re-
search on the robust control problems has received considerable attention,
and, so far, many significant relevant results have been achieved. Authors
of the work [1] pointed out that the robust control problem of the uncer-
AN
tain original system can be translated into the optimal control problem of its
nominal system, but the detailed results were not shown. It is known that
solving the optimal control problem relies on the solution of the Hamilton-
Jacobi-Bellman (HJB) equation. For the linear optimal control, i.e., linear
M
two mainstream iterative algorithms, i.e., value iteration (VI) [3, 4] and pol-
icy iteration (PI) [5, 6, 7]. The convergence proof of VI algorithm for the
discrete-time (DT) systems was first given in [3], and a novel PI algorithm for
the DT version was presented in [6]. For the continuous-time (CT) systems,
a PI algorithm was proposed without the requirement of the knowledge of
internal system dynamics in [5]. Following the theoretical structure of these
2
ACCEPTED MANUSCRIPT
classical works [3, 4, 5, 6, 7], a variety of optimal control issues have been ad-
dressed, such as optimal control with constrained control input [8, 9, 10, 11]
and time-delay [12, 13, 14], optimal control for zero-sum [15, 16, 17, 18]
and non-zero-sum games [19, 20, 21, 22], and optimal control applied on the
Markov jump systems [23, 24], robot systems [25, 26] and multi-agent systems
T
[27, 28, 29, 30, 31]. Among these issues, optimal tracking control problem
IP
(OTCP) has become one of the most attractive topics within the scope of
ADP. The integral RL technique was employed to address the OTCP for the
partially unknown CT systems with constrained control inputs in [32], and
CR
the DT case was investigated in [33] by means of an actor-critic-based RL
algorithm. In [34], a novel infinite-time optimal tracking control scheme was
provided for a class of DT nonlinear systems via the greedy heuristic dy-
US
namic programming (HDP) iteration algorithm, and a finite-horizon OTCP
version was studied in [35] through an adaptive dynamic programming ap-
proach. However, the model uncertainties, which are generally inevitable for
the most real-world systems, are not concerned in these studies above.
AN
The main idea of this paper is to employ the ADP technique to ob-
tain the optimal solution of the nominal system, and then, extend it to the
robust controller design of the uncertain original system. In [36], a neural-
M
mal robust control problem for a class of uncertain nonlinear systems, and
then, based on this work [37], a data-based robust control approach was
developed via the neural network identification technique in [38]. However,
for the most industrial systems such as aerospace control systems, power
PT
3
ACCEPTED MANUSCRIPT
methods [42, 43, 44, 45]. Therefore, it will be interesting and challenging to
handle the robust control issues by using the data-driven RL method in the
model-free environment, which motivates our research.
In this paper, we present a data-driven RL scheme to solve the robust
control issues for a class of uncertain CT nonlinear systems with completely
T
unknown system dynamics. First of all, the problem formulation is derived
IP
in Section 2. Secondly, the robust controller of the uncertain original system
is designed by adding a constant feedback gain to the optimal controller of
the nominal system, and the proofs of optimality and stability are provided
CR
in Section 3. Subsequently, based on the introduced model-based iterative
learning algorithm, a data-driven model-free RL method is derived and im-
plemented by two neural networks (NNs), tuning laws of which are updated
US
by the least-square approach in Section 4. Two numerical simulation exam-
ples are given to demonstrate the effectiveness of our proposed scheme in
Section 5. Finally, a brief conclusion is drawn in Section 6.
AN
2. Problem formulation
Optimal control issues can be generally classified into two main groups:
optimal regulation control and optimal tracking control [33].
M
¯
where x(t) ∈ Rn is the state; ū(t) ∈ Rm denotes the control input; d(x) ∈
R represents the finite control input perturbation, which is assumed to be
m
and g(x(t)) ∈ Rn×m are the system matrices, which, in this paper, are both
considered to be unknown.
The corresponding nominal system of (1) can be described by
AC
4
ACCEPTED MANUSCRIPT
where P ∈ Rn×n > 0 and R ∈ Rm×m > 0 are symmetric positive definite
weight matrices.
The associated optimal control policy which minimizes V(x) can be de-
rived as
T
1
u∗ (x) = − R−1 g T (x)∇V ∗ (x). (4)
2
IP
where V ∗ (x) denotes the optimal performance index function and ∇V ∗ (x) =
∂V ∗ (x)/∂x.
CR
Then, one can obtain the following Hamilton-Jacobi-Bellman (HJB) equa-
tion for the optimal regulation control problem
1
where xd (t) ∈ Rn denotes the tracking objective state and h(xd ) ∈ Rn is the
command generator function.
ED
Bringing (2) and (6) together yields the tracking error dynamics
T
Let X(t) = eTd (t) xTd (t) be the state of the augmented system. Next,
combining (6) and (8) yields the dynamics of augmented system:
AC
where
f (ed (t) + xd (t)) − h(xd (t)) g(ed (t) + xd (t))
F (X(t)) = and G(X(t)) = .
h(xd (t)) 0
5
ACCEPTED MANUSCRIPT
∆
X(t) ∈ X ⊂ R2n and u(t) ∈ U ⊂ Rm . D = {(X, u) |X ∈ X , u ∈ U }, where
X and U denote compact sets.
For the nominal augmented system (9), in order to solve the infinite
horizon OTCP, one needs to design a state feedback control policy u(X),
which minimizes the following discounted performance index function:
T
Z ∞
IP
V (X(t)) = e−α(τ −t) X T (τ )QX(τ ) + uT (X(τ ))Ru(X(τ )) dτ (10)
t
CR
P 0
where α > 0 is the discount factor; Q = with P > 0, and R > 0 is
0 0
a symmetric positive definite matrix.
US
Remark 1. Note that it is necessary to employ a discount factor in the per-
formance index function (10). This is because the trajectory of the reference
system xd (t) (6) to be tracked may not go to zero, which is a common case in
AN
the most practical systems, and then the performance index function, which
contains the control policy u(X), will become infinite without the discount
factor.
M
finite.
Assumption 1. Assume that there exists at least one admissible control pol-
icy u(X) on the compact set X such that the tracking error system (8) is
PT
asymptotically stable and the performance index function V (X) (10) is fi-
nite.
CE
6
ACCEPTED MANUSCRIPT
T
(13)
IP
which also satisfies the following HJB equation:
CR
u∈Ψ(X)
If the minimum on the right hand side of (14) exists and is unique, the
corresponding optimal control policy u∗ (X) can be obtained as
1 US
u∗ (X) = − R−1 GT (X)∇V ∗ (X).
2
(15)
AN
Inserting (15) into (14), the HJB equation can be rewritten as
cati equation, which is easy to be solved directly. However, for the nonlinear
cases, the HJB equation becomes a nonlinear partial differential one, which
is generally impossible to obtain the solution analytically. In Section 4, we
will introduce two iterative ADP methods to overcome this difficulty.
PT
7
ACCEPTED MANUSCRIPT
T
Theorem 1. Consider the system (1) and let η ≥ 1. One can attain:
IP
1. If there is no control input perturbation, i.e., d¯ = 0, then the control
policy ū(x) (17) achieves optimality with the performance index func-
CR
tion (18).
2. If the constant feedback gain η is selected appropriately, then the robust
control policy ū(x) (17) guarantees the system (1) to be asymptotically
stable.
US
Proof. 1) Let J ∗ (x) be the optimal performance index function for the
system (1) with the condition d¯ = 0. One can derive the associated optimal
AN
control policy ū∗ (x) and HJB equation, respectively, as
1
ū∗ (x) = − R̄−1 g T (x)∇J ∗ (x), (19)
2
M
and
∇J ∗T (x)f (x) + xT P x + (η − 1)u∗T Ru∗
ED
1
− ∇J ∗T (x)g(x)R̄−1 g T (x)∇J ∗ (x) = 0 (20)
4
where ∇J ∗ (x) = ∂J ∗ (x)/∂x.
PT
Based on (5), replacing J ∗ (x) by V ∗ (x) and inserting (4) into (20) yields
1
∇V ∗T (x)f (x) + xT P x + (η − 1)∇V ∗T (x)g(x)R−1 g T (x)∇V ∗ (x)
4
CE
1
− η∇V (x)g(x)R g (x)∇V ∗ (x)
∗T −1 T
4
1
=∇V ∗T (x)f (x) + xT P x − ∇V ∗T (x)g(x)R−1 g T (x)∇V ∗ (x) = 0. (21)
AC
4
From (21), it can be observed that V ∗ (x) is a solution of the HJB equation
(20), which also implies the optimal control policy (19) can be rewritten as
1
ū∗ (x) = − ηR−1 g T (x)∇V ∗ (x) = ηu∗ (x). (22)
2
8
ACCEPTED MANUSCRIPT
T
Then, one has
IP
¯
Θ̇(x) =∇V ∗T (x)(f (x) + g(x)(ū(x) + d(x)))
1
= − xT P x + ∇V ∗T (x)g(x)R−1 g T (x)∇V ∗ (x)
CR
4
+ ∇V (x)g(x)ū(x) + ∇V ∗T (x)g(x)d(x)
∗T ¯
1 1 ¯
= − xT P x − ( η − )∇V ∗T (x)g(x)R−1 g T (x)∇V ∗ (x) + ∇V ∗T (x)g(x)d(x)
2 4
1
1
2
1
4 US
+ ∇V ∗T (x)g(x)g T (x)∇V ∗ (x)
1
≤ − xT P x − ( η − )∇V ∗T (x)g(x)R−1 g T (x)∇V ∗ (x) + d¯T (x)d(x)
2
¯
AN
2
1 2 2 1 1 −1 1 2
≤ − (λmin (P ) − kd )kxk − ( η − )λmin (R ) − g T (x)∇V ∗ (x)
2 2 4 2
(24)
M
where λmin (P ) and λmin (R−1 ) denote the minimum eigenvalues of the matri-
ces P and R−1 , respectively.
ED
In order to obtain Θ̇(x) < 0, the matrix P and the constant feedback
gain η should be given to satisfy the following condition:
λmin (P ) > 21 kd2
PT
1 1 . (25)
η > λmin (R −1 ) + 2
Since P and R are both symmetric positive definite matrices, one can
CE
easily choose a large enough constant feedback gain η to satisfy the condition
1 1
η > max{1, λmin (R −1 ) + 2 }. If P and η are selected appropriately such that
the condition (25) holds, it can be acquired that Θ̇(x) < 0, which indicates
AC
9
ACCEPTED MANUSCRIPT
with the respect to the performance index function (18). For some given P , if
the control perturbation becomes much larger such that the condition (25) is
not satisfied, one can also stabilize the system (1) by enhancing the feedback
gain η. This will be demonstrated in the simulation results.
T
3.2. Robust tracking controller design
Similar to the nominal augmented system (9), the original one can be
IP
expressed as
CR
¯
Ẋ(t) = F (X(t)) + G(X(t))(ū(t) + d(t)). (26)
Based on the optimal control policy u∗ (X) (15) for the nominal aug-
mented system, the robust controller for the uncertain original augmented
system (26) is designed by
US
ū(X) = ηu∗ (X). (27)
AN
Define a new performance index function for the augmented system (26)
as
Z ∞
J(X(t)) = e−α(τ −t) L(X(τ )) + ūT (X(τ ))R̄ū(X(τ )) dτ (28)
M
2
and
10
ACCEPTED MANUSCRIPT
T
4
1
− αV ∗ (X) − η∇V ∗T (X)G(X)R−1 GT (X)∇V ∗ (X)
IP
4
=∇V ∗T (X)F (x) + X T QX − αV ∗ (X)
CR
1
− ∇V ∗T (X)G(X)R−1 GT (X)∇V ∗ (X) = 0. (31)
4
From (31), it can be deduced that V ∗ (X) is a solution of the HJB equation
1
US
(30), which also indicates the optimal tracking control policy (29) can be
rewritten as
constant feedback gain η is selected large enough, then the robust control
policy ū(X) (27) makes the tracking error dynamics asymptotically stable in
the limit as the discount factor goes to zero.
ED
Remark 4. If the discount factor goes to zero, in light of the results of The-
orem 1 and previous relevant works [32, 33, 45, 46], it can be easily acquired
that the tracking error is asymptotically stable. Nevertheless, if the discount
PT
factor is nonzero, there will be difficulty in the stability analysis [47, 48]. Ac-
cording to the results of [32, 33, 45, 46], one can make the tracking error as
small as desired by selecting a small enough discount factor. If the discount
CE
11
ACCEPTED MANUSCRIPT
T
[∇V (i+1) (X)]T (F (X) + G(X)u(i) ) − αV (i+1) (X) + X T QX + u(i)T Ru(i) = 0.
IP
(33)
CR
1
u(i+1) = − R−1 GT (X)∇V (i+1) (X). (34)
2
US
If V (i+1) − V (i) ≤ ε, where ε is a small enough positive constant, then stop
at Step 3; Else, let i = i + 1 and go back to Step 2.
Following the idea of previous works [49, 50], the convergence proof of the
AN
proposed model-based iterative learning algorithm is provided as follows.
Let us consider a Banach space Λ ⊂ {V (x) |V (x) : X → R, V (0) = 0 }
with the norm k·kX and the mapping H : Λ → Λ given by (16). Define
another mapping Γ : Λ → Λ as
M
0
ΓV = V − (H (V ))−1 H(V ) (35)
0
ED
that
for all s in the neighborhood of zero with lim(o(s)/s) = 0 and all M with
s→0
kM kX = 1, then the mapping H is Gâteaux differentiable at the point V and
L denotes the Gâteaux derivative of H at V .
12
ACCEPTED MANUSCRIPT
T
It should be pointed out that (37) provides an easier way to compute
the Gâteaux derivative rather than the Fréchet derivative. The relationship
IP
between these two derivatives will be shown in the following lemma.
0
Lemma 1. [50, 51] If H is continuous at V and exists as the Gâteaux
CR
0
derivative in the neighborhood of V , then L = H (V ) is also an Fréchet
derivative at V .
Lemma 2. Let H be a mapping expressed as (16). For ∀V ∈ Λ, the Fréchet
differential of H at V can be given by
0
H (V )M =L(M )
US
AN
1
=∇M T F − αM − ∇V T GR−1 GT ∇M. (38)
2
Proof. According to (16), one has
M
1
H(V ) =∇V T F + X T QX − αV − ∇V T GR−1 GT ∇V,
4
1
H(V + sM ) − H(V ) =s∇M T F − αsM − s2 ∇M T GR−1 GT ∇M
4
PT
1 T −1 T
− s∇V GR G ∇M. (39)
2
By means of (37) and (39), the Gâteaux derivative at V is obtained by
CE
H(V + sM ) − H(V ) 1
L(M ) = lim = ∇M T F − αM − ∇V T GR−1 GT ∇M.
s→0 s 2
(40)
AC
0
Based on Lemma 1, one gets H (V )M = L(M ). This completes the proof.
Next, construct a Newton iterative sequence as below:
∆ 0
V (i+1) = ΓV (i) = V (i) − (H (V (i) ))−1 H(V (i) ), i = 0, 1, 2, · · · . (41)
13
ACCEPTED MANUSCRIPT
T
Theorem 2. The sequence {V (i) } produced by the model-based iterative learn-
IP
ing algorithm is equivalent to that of the Newton iteration (41).
CR
Proof. Based on Lemma 2, one attains
0 1
H (V (i) )V (i+1) =∇V (i+1)T F − αV (i+1) − ∇V (i)T GR−1 GT ∇V (i+1) , (42)
2
0
US1
H (V (i) )V (i) =∇V (i)T F − αV (i) − ∇V (i)T GR−1 GT ∇V (i) ,
2
1
(43)
Ẋ =F (X) + G(X)u
=F (X) + G(X)u(i) + G(X)(u − u(i) ). (46)
14
ACCEPTED MANUSCRIPT
T
=[∇V (i+1) (X)]T (F (X) + G(X)u(i) ) + [∇V (i+1) (X)]T G(X)(u − u(i) ). (47)
IP
Based on the updating laws (33) and (34), (47) becomes
dV (i+1) (X(t))
CR
= αV (i+1) (X) − X T QX − u(i)T Ru(i) − 2[u(i+1) ]T R(u − u(i) ).
dt
(48)
attains
V (i+1) (X(t + ∆t)) − V (i+1) (X(t))
US
Integrate both sides of (48) on the interval [t t + ∆t], and then one
AN
Z t+∆t Z t+∆t
(i+1)
= αV (X(τ ))dτ − (X T (τ )QX(τ ) + u(i)T (τ )Ru(i) (τ ))dτ
t t
Z t+∆t
T
−2 [u(i+1) (τ )] R(u(τ ) − u(i) (τ ))dτ . (49)
M
(i+1)
where V (X) and u(i+1) (X) are the unknown functions to be determined.
ED
ent from (33) and (34), the equation (49) only requires the arbitrary system
data (X, u) ∈ D instead of the system models, i.e., F (X) and G(X).
and u∗ , respectively, as i → ∞.
15
ACCEPTED MANUSCRIPT
T
(i)
V̂ (i) (X(t)) =φTV (X(t))WV , (50)
IP
û(i) (X(t)) =φTu (X(t))Wu(i) , (51)
(i) (i)
CR
where WV and Wu are the weight vectors of critic NN and actor NN, re-
spectively, and φV and φu are their corresponding NN activation functions.
For the following derivation, we take the single control input case into consid-
eration, that is, m = 1, which is convenient for the mathematical expression.
US
Since there will be NN approximation errors brought by the NN imple-
mentation, replacing V (i+1) , u(i) and u(i+1) in (49) by V̂ (i+1) , û(i) and û(i+1)
yields the following residual error Ξ(i) :
AN
∆ (i+1) (i+1)
=V̂ (X(t)) − V̂ (X(t + ∆t)) + αV̂ (i+1) (X(τ ))dτ
t
Z t+∆t
T
− (X T (τ )QX(τ ) + [û(i) (X(τ ))] Rû(i) (X(τ )))dτ
ED
t
Z t+∆t
T
+2 [û(i+1) (X(τ ))] R(û(i) (X(τ )) − u(τ ))dτ
t
Z t+∆t
PT
T (i+1) (i+1)
=(φV (X(t)) − φV (X(t + ∆t))) WV + αφTV (X(τ ))WV dτ
t
Z t+∆t
T
(X T (τ )QX(τ ) + [û(i) (X(τ ))] Rû(i) (X(τ )))dτ
CE
−
t
Z t+∆t
T
+2 (û(i) (X(τ )) − u(τ )) R(φTu (X(τ ))Wu(i+1) )dτ . (52)
t
AC
16
ACCEPTED MANUSCRIPT
T
η3 =2 (û(i) (X(τ )) − u(τ )) RφTu (X(τ ))dτ , (53)
t
Z t+∆t
IP
T
Υ(i) = (X T (τ )QX(τ ) + [û(i) (X(τ ))] Rû(i) (X(τ )))dτ .
t
CR
It is worthy of being pointed out that if ∆t is selected as a small enough
time period, one can utilize the trapezoidal rule to calculate definite integrals
η2 , η3 and Υ(i) as
lim
Z t+∆t
∆t→0 t
p(τ )dτ =
∆t
2
US
[p(t) + p(t + ∆t)]. (54)
AN
Therefore, (52) can be given by
can be rewritten as
(i+1)
Ξ(i) (x(t), x(t + ∆t), u(t)) = Φ(i) WV u − Υ(i) . (56)
PT
Choosing different control input uk (t) with the small enough time period
∆t, one can obtain the system data sampling sets (Xk (t), Xk (t + ∆t), uk (t)),
where k = 1, 2, · · · , M. Subsequently, the database can be constructed as
AC
17
ACCEPTED MANUSCRIPT
T
(0)
i = 0. Choose an initial NN weight WV u such that V̂ (0) ∈ V0 .
IP
Step 2. Use the collected sampling data sets to compute ξ (i) and θ(i) .
(i+1)
Step 3. Tune the NN weights WV u through the updating law (57). If
(i+1) (i)
CR
WV u − WV u ≤ ε, where ε is a small enough positive constant, then stop
at Step 3; Else, let i = i + 1 and go back to Step 2.
US
use the performance index function (3), this data-driven method can be also
applied to solve the optimal regulation control issue, which implies optimal
regulation control is a special case of optimal tracking control.
AN
5. Simulation Results
In this section, two simulation examples for both robust regulation control
M
and robust tracking control are shown to demonstrate the effectiveness of our
proposed scheme.
−0.5(x1 + x2 ) + 0.5x21 x2 x1
−x1 + x2 0
ẋ = + u. (59)
−0.5(x1 + x2 ) + 0.5x21 x2 x1
with the associated performance index function
Z ∞
T
V(x(t)) = x (τ )P x(τ ) + uT (x(τ ))Ru(x(τ )) dτ (60)
t
18
ACCEPTED MANUSCRIPT
T
and
IP
φu (x) = [x1 x2 x21 x1 x2 x22 ]T . (62)
CR
It should be pointed out that this simulation example is constructed by
the converse HJB method [54]. The optimal solutions can be given by V ∗ =
0.5x21 + x22 and u∗ = −x1 x2 . From the simulation results in Fig. 1 and
Fig. 2, it can be seen that the NN weights converge to their optimal values
US
after iteration. In Fig. 3, the proposed robust controller overcomes the
random control perturbation and state trajectories are stable finally. When
the control perturbation becomes much larger such that the condition (25)
AN
does not hold and the system (58) may be unstable, one can enhance the
feedback gain η to handle this problem as we have mentioned in Remark 3.
From Fig. 4, when we still use η = 4 for much larger control perturbation,
the system goes to unstable states; when we enhance the feedback gain η, the
M
8
WV1
7
WV2
6 WV3
PT
5
Critic Weights
4
CE
1
AC
−1
0 5 10 15 20
Iteration Step
19
ACCEPTED MANUSCRIPT
0.5
−0.5
T
−1
Actor Weights
IP
−1.5
Wu1
−2 Wu2
Wu3
CR
−2.5 Wu4
Wu5
−3
−3.5
−4
0 5
US 10
Iteration Step
15 20
AN
Fig. 2. Convergence of the actor NN weights.
0.2
M
x1
x2
0.15
ED
0.1
x1 and x2
PT
0.05
CE
−0.05
0 5 10 15 20 25 30
AC
Time Step
20
ACCEPTED MANUSCRIPT
0.8
0.7
x1 with η=4
0.6 x2 with η=4
x1 with η=6
T
x1 and x2 with different η
0.5
x2 with η=6
IP
0.4 x1 with η=8
x2 with η=8
0.3
x1 with η=10
CR
0.2 x2 with η=10
0.1
−0.1
0 5
US10
Time Step
15 20
AN
Fig. 4. Evolution of the state trajectories x1 and x2 with different η.
−0.5(x1 + x2 ) + 0.5x21 x2 1
where we set ū = ηu∗ with the constant feedback gain η = 2, and select
the control input perturbation as d¯ = δ1 sin(δ2 x1 + δ3 x2 ) with the random
PT
x2 0
ẋ = + u. (64)
−0.5(x1 + x2 ) + 0.5x21 x2 1
is given by
xd2
ẋd = . (65)
−xd1
21
ACCEPTED MANUSCRIPT
T
X4 0
−X3 0
IP
(66)
CR
with the following discounted performance index function:
Z ∞
V (X(t)) = e−α(τ −t) X T (τ )QX(τ ) + uT (τ )Ru(τ ) dτ (67)
US
t
P 0
where the discount factor α = 0.01; Q = with P = 10I2×2 , and
0 0
AN
R = 1.
The activation functions for the critic NN are selected as
We set ∆t = 0.1s, and then, with different control inputs uk , real system
data sampling sets (Xk (t), Xk (t+∆t), uk ) can be collected. After that, update
the two NNs via the least-square method (57). From Fig. 5 and Fig. 6, it
can be seen that the NN weights achieve convergence finally.
AC
22
ACCEPTED MANUSCRIPT
300
250
200
T
Critic Weights
150
IP
100
CR
50
−50
0 5
US 10
Iteration Step
15 20
AN
Fig. 5. Convergence of the critic NN weights.
50
M
0
ED
−50
1
Actor Weights
−100
0.5
PT
−150 0
−0.5
−200
−1
CE
14 16 18 20
−250
−300
0 5 10 15 20
AC
Iteration Step
23
ACCEPTED MANUSCRIPT
3
x1
2.5 xd1
T
1.5
x1 and xd1
IP
1
CR
0.5
−0.5
−1
0 5 10
US
15
Time Step
20 25 30
AN
Fig. 7. Evolution of the state trajectories x1 and xd1 .
2.5
M
x2
2 xd2
1.5
ED
1
x2 and xd2
0.5
PT
−0.5
CE
−1
−1.5
0 5 10 15 20 25 30
AC
Time Step
Subsequently, use the obtained results along with the constant feedback
gain to control the uncertain original system (63) with the random control
24
ACCEPTED MANUSCRIPT
6. Conclusion
T
In this paper, the robust control issues for a class of uncertain systems
IP
have been investigated. A novel reinforcement learning scheme has been em-
ployed to obtain the optimal control policy of the nominal version without
CR
the requirement of the knowledge of system models. Adding a constant feed-
back gain to the optimal control policy yields the robust controller, which
has been proved to be able to achieve optimality under a new defined perfor-
mance function when there is no control input perturbation. To implement
US
the model-free algorithm, two neural networks updated by the least-square
method have been utilized to learn the solution of the HJB equation itera-
tion by iteration. Simulation results have demonstrated the feasibility and
AN
effectiveness of our proposed scheme. It is expected that, with the powerful
ability of ADP in solving the optimal control problems, our research results
will be extended to other decision support systems.
M
Acknowledgment
This work was supported by the National Natural Science Foundation
ED
References
[1] F. Lin, R. D. Brandt, J. Sun, Robust control of nonlinear systems:
compensating for uncertainty, International Journal of Control 56 (6)
CE
(1992) 1453–1459.
[2] R. Song, W. Xiao, H. Zhang, C. Sun, Adaptive dynamic programming
for a class of complex-valued nonlinear systems, IEEE Transactions on
AC
25
ACCEPTED MANUSCRIPT
T
programming, IEEE Transactions on Systems, Man, and Cybernetics,
Part C: Applications and Reviews 32 (2) (2002) 140–153.
IP
[6] D. Liu, Q. Wei, Policy iteration adaptive dynamic programming algo-
CR
rithm for discrete-time nonlinear systems, Neural Networks and Learn-
ing Systems, IEEE Transactions on 25 (3) (2014) 621–634.
1503.
26
ACCEPTED MANUSCRIPT
T
dynamic programming, Acta Automatica Sinica 36 (1) (2010) 121–129.
IP
[15] Q. Wei, R. Song, P. Yan, Data-driven zero-sum neuro-optimal control
for a class of continuous-time unknown nonlinear systems with distur-
CR
bance using ADP, IEEE Transactions on Neural Networks and Learning
Systems 27 (2) (2016) 444–458.
US
zero-sum games using synchronous policy iteration, International Jour-
nal of Robust and Nonlinear Control 22 (13) (2012) 1460–1483.
27
ACCEPTED MANUSCRIPT
T
[24] X. Zhong, H. He, H. Zhang, Z. Wang, A neural network based online
learning and control approach for markov jump systems, Neurocomput-
IP
ing 149 (2015) 116–123.
CR
[25] C. Yang, K. Huang, H. Cheng, Y. Li, C.-Y. Su, Haptic identification by
ELM-controlled uncertain manipulator, IEEE Transactions on Systems,
Man, and Cybernetics: Systems PP (99) (2017) 1–12.
US
[26] C. Yang, X. Wang, L. Cheng, H. Ma, Neural-learning-based telerobot
control with guaranteed performance, IEEE Transactions on Cybernet-
ics PP (99) (2016) 1–12.
AN
[27] H. Zhang, T. Feng, G. H. Yang, H. Liang, Distributed cooperative op-
timal control for multiagent systems on directed graphs: An inverse op-
timal approach, IEEE Transactions on Cybernetics 45 (7) (2015) 1315–
1326.
M
28
ACCEPTED MANUSCRIPT
T
tially unknown nonlinear discrete-time systems, IEEE Transactions on
Neural Networks and Learning Systems 26 (1) (2015) 140–151.
IP
[34] H. Zhang, Q. Wei, Y. Luo, A novel infinite-time optimal tracking control
CR
scheme for a class of discrete-time nonlinear systems via the greedy
HDP iteration algorithm, IEEE Transactions on Systems, Man, and
Cybernetics, Part B: Cybernetics 38 (4) (2008) 937–942.
US
[35] D. Wang, D. Liu, Q. Wei, Finite-horizon neuro-optimal tracking control
for a class of discrete-time nonlinear systems using adaptive dynamic
programming approach, Neurocomputing 78 (1) (2012) 14–22.
AN
[36] H.-M. Yen, T.-H. S. Li, Y.-C. Chang, Design of a robust neural network-
based tracking controller for a class of electrically driven nonholonomic
mechanical systems, Information Sciences 222 (2013) 559–575.
M
1544–1555.
29
ACCEPTED MANUSCRIPT
T
[42] R. Song, F. L. Lewis, Q. Wei, H. Zhang, Off-policy actor-critic struc-
ture for optimal control of unknown systems with disturbances, IEEE
IP
Transactions on Cybernetics 46 (5) (2016) 1041–1050.
[43] R. Song, F. Lewis, Q. Wei, H.-G. Zhang, Z.-P. Jiang, D. Levine, Mul-
CR
tiple actor-critic structures for continuous-time optimal control using
input-output data, IEEE Transactions on Neural Networks and Learn-
ing Systems 26 (4) (2015) 851–865.
US
[44] B. Luo, H.-N. Wu, T. Huang, D. Liu, Data-based approximate pol-
icy iteration for affine nonlinear continuous-time optimal control design,
Automatica 50 (12) (2014) 3281–3290.
AN
[45] H. Modares, F. L. Lewis, Z.-P. Jiang, H∞ tracking control of completely
unknown continuous-time systems via off-policy reinforcement learning,
IEEE Transactions on Neural Networks and Learning Systems 26 (10)
M
(2015) 2550–2562.
[46] G. Xiao, H. Zhang, Y. Luo, H. Jiang, Data-driven optimal tracking con-
ED
30
ACCEPTED MANUSCRIPT
[50] H.-N. Wu, B. Luo, Neural network based online simultaneous policy
update algorithm for solving the HJI equation in nonlinear H∞ control,
IEEE Transactions on Neural Networks and Learning Systems 23 (12)
(2012) 1884–1895.
T
[51] E. Zeidler, Nonlinear Functional Analysis and Its Applications: III: Vari-
ational Methods and Optimization, Springer Science Business Media,
IP
2013.
CR
[52] C. Yang, Y. Jiang, Z. Li, W. He, C.-Y. Su, Neural control of biman-
ual robots with guaranteed global stability and motion precision, IEEE
Transactions on Industrial Informatics PP (99) (2016) 1–9.
US
[53] C. Yang, X. Wang, Z. Li, Y. Li, C.-Y. Su, Teleoperation control based on
combination of wave variable and neural networks, IEEE Transactions
on Systems, Man, and Cybernetics: Systems PP (99) (2016) 1–12.
AN
[54] V. Nevistic, J. A. Primbs, Optimality of nonlinear design techniques: a
converse HJB approach, Technical Report TR96-022, California Insti-
tute of Technology (1996).
M
ED
PT
CE
AC
31
ACCEPTED MANUSCRIPT
T
2014 from Northeastern University, Shenyang, China, where he is currently
pursuing the Ph.D. degree in control theory and control engineering. His cur-
IP
rent research interests include adaptive dynamic programming, fuzzy control,
multi-agent system control and their industrial applications.
CR
US
Huaguang Zhang received the B.S. degree and the M.S. de-
gree in control engineering from Northeast Dianli University of China, Jilin
City, China, in 1982 and 1985, respectively. He received the Ph.D. degree
AN
in thermal power engineering and automation from Southeast University,
Nanjing, China, in 1991. He joined the Department of Automatic Control,
Northeastern University, Shenyang, China, in 1992, as a Postdoctoral Fellow
for two years. Since 1994, he has been a Professor and Head of the Institute of
M
control, and their applications. He has authored and coauthored over 280
journal and conference papers, six monographs and co-invented 90 patents.
Dr. Zhang is the fellow of IEEE, the E-letter Chair of IEEE CIS Society,
the former Chair of the Adaptive Dynamic Programming & Reinforcement
PT
32
ACCEPTED MANUSCRIPT
T
IP
CR
Yang Cui received the B.S. degree in information
and computing science and the M.S. degree in applied mathematics form
US
Liaoning University of Technology, Jinzhou, China. She is currently working
toward the Ph.D. degree in control theory and control engineering, North-
eastern University. Her research interests include dynamic surface control,
AN
neural networks and network control.
M
ED
PT
He has been pursuing the Ph.D. degree with Northeastern University, Shenyang,
China, since 2012. His current research interests include neural-network-
based control, nonlinear control, adaptive dynamic programming and their
AC
industrial applications.
33