Adaptive Dynamic Programming: Single and Multiple Controllers
Adaptive Dynamic Programming: Single and Multiple Controllers
Ruizhuo Song
Qinglai Wei
Qing Li
Adaptive Dynamic
Programming:
Single and Multiple
Controllers
Studies in Systems, Decision and Control
Volume 166
Series editor
Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences,
Warsaw, Poland
e-mail: [email protected]
The series “Studies in Systems, Decision and Control” (SSDC) covers both new
developments and advances, as well as the state of the art, in the various areas of
broadly perceived systems, decision making and control–quickly, up to date and
with a high quality. The intent is to cover the theory, applications, and perspectives
on the state of the art and future developments relevant to systems, decision
making, control, complex processes and related areas, as embedded in the fields of
engineering, computer science, physics, economics, social and life sciences, as well
as the paradigms and methodologies behind them. The series contains monographs,
textbooks, lecture notes and edited volumes in systems, decision making and
control spanning the areas of Cyber-Physical Systems, Autonomous Systems,
Sensor Networks, Control Systems, Energy Systems, Automotive Systems,
Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace
Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power
Systems, Robotics, Social Systems, Economic Systems and other. Of particular
value to both the contributors and the readership are the short publication timeframe
and the world-wide distribution and exposure which enable both a wide and rapid
dissemination of research output.
Qing Li
Adaptive Dynamic
Programming: Single
and Multiple Controllers
123
Ruizhuo Song Qinglai Wei
University of Science and Technology Institute of Automation
Beijing Chinese Academy of Sciences
Beijing, China Beijing, China
Qing Li
University of Science and Technology
Beijing
Beijing, China
The print edition is not for sale in China Mainland. Customers from China Mainland please order the
print book from: Science Press, Beijing.
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019
This work is subject to copyright. All rights are reserved by the Publishers, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publishers, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publishers nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publishers remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
v
vi Preface
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Continuous-Time LQR . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Discrete-Time LQR . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Adaptive Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Review of Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Neural-Network-Based Approach for Finite-Time Optimal
Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Problem Formulation and Motivation . . . . . . . . . . . . . . . . . . . . 9
2.3 The Data-Based Identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Derivation of the Iterative ADP Algorithm
with Convergence Analysis . . . . . . . . . . . . . . . . . . . . . ...... 11
2.5 Neural Network Implementation of the Iterative Control
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay
Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 The Iteration ADP Algorithm and Its Convergence . . . . . . . . . . 30
3.3.1 The Novel ADP Iteration Algorithm . . . . . . . . . . . . . . 30
3.3.2 Convergence Analysis of the Improved Iteration
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33
3.3.3 Neural Network Implementation of the Iteration
ADP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38
vii
viii Contents
Qinglai Wei Ph.D., Professor, The State Key Laboratory of Management and
Control for Complex Systems, Institute of Automation, Chinese Academy of
Sciences.
He received the B.S. degree in Automation, and the Ph.D. degree in control
theory and control engineering, from the Northeastern University, Shenyang,
China, in 2002 and 2009, respectively. From 2009–2011, he was a postdoctoral
fellow with The State Key Laboratory of Management and Control for Complex
Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China.
He is currently a professor of the institute. He has authored three books, and
published over 80 international journal papers. His research interests include
adaptive dynamic programming, neural network-based control, computational
intelligence, optimal control, nonlinear systems and their industrial applications.
Dr. Wei is an Associate Editor of IEEE Transaction on Automation Science and
Engineering since 2017, IEEE Transaction on Consumer Electronics since 2017,
Control Engineering (in Chinese) since 2017, IEEE Transactions on Cognitive and
xiii
xiv About the Authors
Qing Li Ph.D., received his B.E. from North China University of Science and
Technology, Tangshan, China, in 1993, and the Ph.D. in control theory and its
applications from University of Science and Technology Beijing, Beijing, China, in
2000. He is currently a Professor at the School of Automation and Electrical
Engineering, University of Science and Technology Beijing, Beijing, China. He has
been a Visiting Scholar at Ryerson University, Toronto, Canada, from February
2006 to February 2007. His research interests include intelligent control and
intelligent optimization.
Symbols
x State vector
u Control vector
F System function
i Index
ℝn State space
X State set
J, V Performance index functions
U Utility function
J* Optimal performance index function
u* Law of optimal control
N Terminal time
W Weight matrix between the hidden layer and output layer
Q, R Positive definite matrices
a, b Learning rate
H Hamiltonian function
ec Estimation error
Ec Squared residual error
xv
Chapter 1
Introduction
Optimal control is one particular branch of modern control. It deals with the problem
of finding a control law for a given system such that a certain optimality criterion is
achieved. A control problem includes a cost functional that is a function of state and
control variables. An optimal control is a set of differential equations describing the
paths of the control variables that minimize the cost function. The optimal control
can be derived using Pontryagin’s maximum principle (a necessary condition also
known as Pontryagin’s minimum principle or simply Pontryagin’s Principle), or by
solving the Hamilton–Jacobi–Bellman (HJB) equation (a sufficient condition). For
linear systems with quadratic performance function, the HJB equation reduces to the
algebraic Riccati equation (ARE) [1].
Dynamic programming is based on Bellman’s principle of optimality: an optimal
(control) policy has the property that no matter what previous decisions have been,
the remaining decisions must constitute an optimal policy with regard to the state
resulting from those previous decisions. Dynamic programming is a very useful tool
in solving optimization and optimal control problems.
In the next, we will introduce optimal control problems for continuous-time and
discrete-time linear systems.
A special case of the general optimal control problem given is the linear quadratic
regulator (LQR) optimal control problem [2]. The LQR considers the linear time
invariant dynamical system described by
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 1
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_1
2 1 Introduction
with state x(t) ∈ Rn and control input u(t) ∈ Rm . To this system is associated the
infinite-horizon quadratic cost function or performance index
∞
1
V (x(t0 ), t0 ) = {xT (t)Qx(t) + uT (t)Ru(t)}dτ (1.2)
2 t0
AT P + PA + Q − PBR−1 BT P = 0. (1.5)
Consider the discrete-time LQR problem where the dynamical system described by
with k the discrete time index. The associated infinite horizon performance index
has deterministic stage costs and is
∞
1 T
V (k) = x (i)Qx(i) + uT (i)Ru(i) . (1.7)
2
i=k
The value function for a fixed policy depends only on the initial state x(k). A differ-
ence equation equivalent to this infinite sum is given by
1.1 Optimal Control 3
1 T
V (x(k)) = (x (k)Qx(k) + uT (k)Ru(k)) + V (x(k + 1)). (1.8)
2
Assuming the value is quadratic in the state so that
1 T
V (x(k)) = x (k)Px(k) (1.9)
2
for some kernel matrix P yields the Bellman equation form
Assuming a constant state feedback policy u(k) = −Kx(k) for some stabilizing
gain K, we write
minimum fuel, minimum energy, minimum risk, or maximum reward. Based on the
assessment of the performance, one of several schemes can then be used to modify
or improve the control policy in the sense that the new policy yields a value that is
improved relative to the previous value. In this scheme, reinforcement learning is a
means of learning optimal behaviors by observing the real-time responses from the
environment to nonoptimal control policies.
Werbos developed actor-critic techniques for feedback control of discrete-time
dynamical systems that learn optimal policies online in real time using data measured
along the system trajectories [4–7]. These methods, known as approximate dynamic
programming or adaptive dynamic programming (ADP), comprise a family of the
basic learning methods: heuristic dynamic programming (HDP), action-dependent
HDP (ADHDP), dual heuristic dynamic programming (DHP), ADDHP, globalized
DHP (GDHP), and ADGDHP. It builds critic to approximate the cost function and
actor to approximate the optimal control in dynamic programming using a function
approximation structure such as neural networks (NNs) [8].
Based on the Bellman equation for solving optimal decision problems in real
time forward in time according to data measured along the system trajectories, ADP
algorithm as an effective intelligent control method has played an important role
in seeking solutions for the optimal control problem [9]. Now, ADP has two main
iteration forms, namely policy iteration (PI) and value iteration (VI), respectively
[10]. PI algorithms contain policy evaluation and policy improvement. An initial
stabilizing control law is required, which is often difficult to obtain. Comparing to
VI algorithms, in most applications, PI would require fewer iterations as a Newtons
method, but every iteration is more computationally demanding. VI algorithms solve
the optimal control problem without requirement of an initial stabilizing control law,
which is easy to implement. For system (1.6) and the cost function (1.7), the detailed
procedures of PI and VI are given as follows.
1.2 Adaptive Dynamic Programming 5
1. PI Algorithm
(1) Initialize. Select any admissible (i.e. stabilizing) control policy h[0] (x(k))
(2) Policy Evaluation Step. Determine the value of the current policy using the
Bellman Equation
2. VI Algorithm
(1) Initialize. Select any control policy h[0] (x(k)), not necessarily admissible or
stabilizing.
(2) Value Update Step. Update the value using
In this book, some matrix manipulations are the basic mathematical vehicle and, for
those whose memory needs refreshing, we provides a short review.
1. For any n × n matrices A and B, (AB)T = BT AT .
2. For any n × n matrices A and B, if A and B are nonsingular, then (AB)−1 =
−1 −1
B A .
3. The Kronecker product of two matrices A = [aij ] ∈ Cm×n and B = [bij ] ∈ Cp×q
is A ⊗ B = [aij B] ∈ Cmp×nq .
4. If A = [a1 , a2 , . . . , an ] ∈
⎡ C ⎤ , where ai are the columns of A, the stacking
m×n
a1
⎢ a2 ⎥
⎢ ⎥
operator is defined by s(A) = ⎢ . ⎥. It converts A ∈ Cm×n into a vector s(A) ∈ Cmn .
⎣ .. ⎦
an
Then for matrices A, B and D we have
References
1. Zhang, H., Liu, D., Luo, Y., Wang, D.: Adaptive Dynamic Programming for Control-Algorithms
and Stability. Springer, London (2013)
2. Vrabie, D., Vamvoudakis, K., Lewis, F.: Optimal Adaptive Control and Differential Games by
Reinforcement Learning Principles. The Institution of Engineering and Technology, London
(2013)
3. Barto, A., Sutton, R., Anderson, C.: Neuron-like adaptive elements that can solve difficult
learning control problems. IEEE Trans. Syst. Man Cybern. Part B, Cybern. SMC–13(5), 834–
846 (1983)
4. Werbos, P.: A menu of designs for reinforcement learning over time. In: Miller, W.T., Sutton,
R.S., Werbos, P.J. (eds.) Neural Networks for Control, pp. 67–95. MIT Press, Cambridge (1991)
5. Werbos, P.: Approximate dynamic programming for real-time control and neural modeling.
In: White, D.A., Sofge, D.A. (eds.) Handbook of Intelligent Control. Van Nostrand Reinhold,
New York (1992)
6. Werbos, P.: Neural networks for control and system identification. In: proceedings of IEEE
Conference Decision Control, Tampa, FL, pp. 260–265 (1989)
7. Werbos, P.: Advanced forecasting methods for global crisis warning and models of intelligence.
General Syst. Yearbook 22, 25–38 (1977)
8. Liu, D., Wei, Q., Yang, X., Li, H., Wang, D.: Adaptive Dynamic Programming with Applications
in Optimal Control. Springer International Publishing, Berlin (2017)
9. Werbos, P.: ADP: the key direction for future research in intelligent control and understanding
brain intelligence. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 38(4), 898–900 (2008)
10. Lewis, F., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for feedback
control. IEEE Circuits Syst. Mag. 9(3), 32–50 (2009)
Chapter 2
Neural-Network-Based Approach for
Finite-Time Optimal Control
This chapter proposes a novel finite-time optimal control method based on input-
output data for unknown nonlinear systems using ADP algorithm. In this method,
the single-hidden layer feed-forward network (SLFN) with extreme learning machine
(ELM) is used to construct the data-based identifier of the unknown system dynamics.
Based on the data-based identifier, the finite-time optimal control method is estab-
lished by ADP algorithm. Two other SLFNs with ELM are used in ADP method
to facilitate the implementation of the iterative algorithm, which aim to approxi-
mate the performance index function and the optimal control law at each iteration,
respectively. A simulation example is provided to demonstrate the effectiveness of
the proposed control scheme.
2.1 Introduction
The linear optimal control problem with a quadratic cost function is probably the most
well-known control problem [1, 2], and it can be translated into Riccati equation.
While the optimal control of nonlinear systems is usually a challenging and difficult
problem [3, 4]. Furthermore, comparing with the known system dynamics case, it
is more intractable to solve the optimal control problem of the unknown system
dynamics. Generally speaking, most actual systems are nearly far too complex to
present the perfect mathematical models. Whenever no model is available to design
the system controller nor is easy to produce, a standard way is resorting to data-
based techniques [5]: (1) on the basis of input-output data, the model of the unknown
system dynamics is identified; (2) on the basis of the estimated model of the system
dynamics, the controller is designed by model-based design techniques.
It is well known that neural network is an effective tool to implement intelligent
identification based on input-output data, due to the properties of nonlinearity, adap-
tivity, self-learning and fault tolerance [6–10]. In which, SLFN is one of the most
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 7
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_2
8 2 Neural-Network-Based Approach for Finite-Time Optimal Control
useful types [11]. In [12], Hornik proved that if the activation function is continu-
ous, bounded, and non-constant, then continuous mappings can be approximated by
SLFN with additive hidden nodes over compact input sets. In [13], Leshno improved
the results of [12] and proved that SLFN with additive hidden nodes and with a non-
polynomial activation function can approximate any continuous target functions. In
[11], it is proven in theory that SLFN with randomly generated additive and a broad
type of activation functions can universally approximate any continuous target func-
tions in any compact subset of the Euclidean space. For SLFN training, there are
three main approaches: (1) gradient-descent based, for example back-propagation
(BP) method [14]; (2) least square based, for example ELM method in this chapter;
(3) standard optimization method based, for example support vector machine (SVM).
While, the learning speed of feed-forward neural networks is in general far slower
than required and it has been a major bottleneck in their applications for past decades
[15]. Two key reasons are: (1) the slow gradient-based learning algorithms are exten-
sively used to train neural networks, (2) all the parameters of the networks are tuned
iteratively by using such learning algorithms. Unlike conventional neural network
theories, in this chapter, the ELM method is used to train SLFN. Such SLFN can
be as the universal approximator, one may simply randomly choose hidden nodes
and then only need to adjust the output weights linking the hidden layer and the out-
put layer. For given network architectures, ELM does not require human-intervened
parameters, so ELM has fast convergence and can be easily used.
Based on the SLFN identifier, the finite-time optimal control method is presented
in this chapter. For finite-time control problems, the system must be stabilized to
zero within finite time. The controller design of finite-time problems still presents
a challenge to control engineers as the lack of methodology and the control step
is difficult to determine. Few results relate to the finite-time optimal control based
on ADP algorithm. As we know that [16] solved the finite-horizon optimal control
problem for a class of discrete-time nonlinear systems using ADP algorithm. But the
method in [16] adopts the BP networks to obtain the optimal control, which has slow
convergence speed.
In this chapter, we will design the finite-time optimal controller based on SLFN
with ELM for unknown nonlinear systems. First, the identifier is established by the
input-output data. It is proven that the identification error converges to zero. Upon
the data-based identifier, the optimal control method is proposed. We prove that the
iterative performance index function converges to optimum, and the optimal control
is also obtained. Compared to other popular implementation methods such as BP,
the SLFN with ELM has the fast response speed and is fully automatic. It means
that except for target errors and the allowed maximum number of hidden nodes, no
control parameters need to be manually tuned by users.
The rest of this chapter is organized as follows. In Sect. 2.2, the problem formu-
lation is presented. In Sect. 2.3, the identifier is developed based on the input-output
data. In Sect. 2.4, the iterative ADP algorithm and the convergence proof are given.
In Sect. 2.6, an example is given to demonstrate the effectiveness of the proposed
control scheme. In Sect. 2.7, the conclusion is drawn.
2.2 Problem Formulation and Motivation 9
where the state x(k) ∈ Rn and the control u(k) ∈ Rm . F(x(k), u(k)) is unknown
continuous function. Assume that the state is completely controllable and bounded
on Ω, and F(0, 0) = 0. The finite-time performance index function is defined as
follows:
K
J (x(k), U (k, K )) = {x T (i)Qx(i) + u T (i)Ru(i)} (2.2)
i=k
where Q and R are positive definite matrices, K is the finite positive integer, the
control sequence U (k, K ) = (u(k), u(k + 1), . . . , u(K )) is finite-time admissible
[16]. The length of U (k, K ) is defined as (K − k + 1).
This chapter is desired to find the optimal control for system (2.1) based on
performance index function (2.2). Since the system dynamics is completely unknown,
the optimal problem cannot be solved directly. Therefore, it is desirable to propose
a novel method that does not need the exact system dynamics but only the input-
output data, which can be obtained during the operation of the system. In this chapter,
we propose a data-based optimal control scheme using SLFN with ELM and ADP
method for general unknown nonlinear systems. The design of proposed controller
is divided into two steps:
(1) The unknown nonlinear system dynamics is identified by SLFN identification
scheme with convergence proof.
(2) The optimal controller is designed based on the data-based identifier.
In the following sections, we will discuss the establishment of the data-based
identifier and the controller design in details.
In this section, the ELM method is introduced and the data-based identifier is estab-
lished with convergence proof. The structure of SLFN is in Fig. 2.1.
For N1 arbitrary distinct samples (x̄(i), ȳ(i)), where x̄(i) ∈ Rn 1 , ȳ(i) ∈ Rm 1 ,
i = 1, 2, . . . , N1 . The weight vectors between the input neurons and the jth hid-
den neuron are w j ∈ Rn 2 . The weight vectors between the output neurons and
the jth hidden neuron are β̄ j ∈ Rn 3 , which will be designed by ELM method [17].
The number of hidden neurons is L. The threshold of the jth hidden neuron is
b j . The hidden layer activation function g L (x̄) is infinitely differentiable, then the
10 2 Neural-Network-Based Approach for Finite-Time Optimal Control
fL (x )
1 m
L
1 L
w
1 n
L
f L (x̄(i)) = β̄ j g L (wTj x̄(i) + b j ), i = 1, 2, . . . , N1 . (2.3)
j=1
Hβ L = Ȳ . (2.4)
β L = H + Ȳ , (2.5)
where H + = (H T H )−1 H T .
2.3 The Data-Based Identifier 11
For SLEN in (2.3), the output weight β L is the only value we want to obtain. In
the following, a theorem is given to prove that β L exists, which means that H is
invertible.
Theorem 2.1 If SLFN is defined as in (2.3), let the hidden neurons number is L.
For N1 arbitrary distinct input samples x̄(i) and any given w j and b j , we have H in
(2.4) is invertible.
Proof As x̄(i) are distinct, for any vector w j according to any continuous prob-
ability distribution, then with probability one, wTj x(1), wTj x(2), . . ., wTj x(N1 ) are
different from each other. Define the jth column of H is c( j) = [g L (wTj x̄(1) +
b j ), g L (wTj x̄(2) + b j ), . . . , g L (wTj x̄(N1 ) + b j )]T , we can have c( j) does not belong
to any subspace whose dimension is less than N1 [19]. It means that for any given
w j and b j , according to any continuous probability distribution, H in (2.4) can be
made full-rank, i.e., H is invertible.
Remark 2.1 ELM algorithm can work with wide type of activation functions, such
as sigmoidal functions, radial basis, sine, cosine and exponential functions et.al..
The feed-forward networks with arbitrarily assigned input weights and hidden layer
biases can universally approximate any continuous functions on any compact input
sets [21].
Remark 2.2 It is important to point out that β L in (2.5) has the smallest norm among
all the least-squares solutions of Hβ L = Ȳ . As the input weights and hidden neurons
biases of SLFN are simply randomly assigned values. So training an SLFN is simply
equivalent to finding a least-squares solution of the linear system Hβ L = Ȳ . Although
almost all learning algorithms wish to reach the minimum training error, however,
most of them cannot reach it because of local minimum or infinite training iteration
is usually not allowed in applications [21]. Fortunately, the special unique solution
β L in (2.5) has the smallest norm among all the least squares solutions.
For the unknown nonlinear system (2.1), the data-based identifier is established.
Then we can design the iterative ADP algorithm to get the solution of the finite-time
optimal control problem.
12 2 Neural-Network-Based Approach for Finite-Time Optimal Control
First, the derivations of the optimal control u ∗ (k) and J ∗ (x(k)) are given in details.
It is known that, for the case of finite horizon optimization, the optimal performance
index function J ∗ (x(k)) satisfies [16]
where U (k, K ) stands for a finite-time control sequence. The length of the control
sequence is not assigned.
According to Bellman’s optimality principle, the following Hamilton–Jocabi–
Bellman (HJB) equation
holds.
Define the law of optimal control sequence starting at k by
Based on the above preparation, the finite-time ADP method for unknown system
is proposed. The iterative procedure is as follows.
For the iterative step i = 1, the performance index function is computed as
where
u [1] (x(k)) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [0] (x(k + 1))}, (2.12)
u(k)
and V [0] (x(k + 1)) has two expression forms according to two different cases.
If for x(k), there exists U (k, K ) = (u(k)), s.t. F(x(k), u(k)) = 0, then V [0] (x(k +
1)) is
where U ∗ (k + 1, k + 1) = (0). In this situation, the restrict term F(x(k), u [1] (k)) =
0 for (2.11) is necessary.
If for x(k), there exists U (k, K̄ ) = (u(k), u(k + 1), . . . , u( K̄ )), s.t. F(x(k),
U (k, K̄ )) = 0, then V [0] (x(k + 1)) is
where
u [i+1] (x(k)) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))}. (2.16)
u(k)
In the above recurrent iterative procedure, the index i is the iterative step and k is
the time step. The optimal control and optimal performance index function can be
obtained by the iterative ADP algorithm (2.11)–(2.16).
In the following part, we will present the convergence analysis of the iterative
ADP algorithm (2.11)–(2.16).
Theorem 2.2 For an arbitrary state vector x(k), the performance index func-
tion V [i+1] (x(k)) is obtained by the iterative ADP algorithm (2.11)–(2.16), then
{V [i+1] (x(k))} is a monotonically nonincreasing sequence for i ≥ 1, i.e., V [i+1] (x(k))
≤ V [i] (x(k)), ∀i ≥ 1.
V [2] (x(k)) = x T (k)Qx(k) + u [2]T (k)Ru [2] (k) + V [1] (x(k + 1)). (2.17)
k+1
V [2] (x(k)) = inf {x T (l)Qx(l) + u T (l)Ru(l)}. (2.20)
U (k,k+1)
l=k
If U (k, k + 1) in (2.20) is defined as U (k, k + 1) = (u(k), u(k + 1)) = (u [1] (k), 0),
then we have
k+1
{x T (l)Qx(l) + u T (l)Ru(l)} = x T (k)Qx(k) + u [1]T (k)Ru [1] (k)
l=k
So according to (2.20) and (2.21), we have V [2] (x(k)) ≤ V [1] (x(k)), for i = 1.
Second, we assume that for i = j − 1, the following expression
holds.
Then according to (2.15), for i = j, we have
+ u (k + 1)Ru(k + 1) + · · ·
T
So we can obtain
k+ j
V [ j+1] (x(k)) = inf {x T (l)Qx(l) + u T (l)Ru(l)}. (2.24)
U (k)
l=k
k+ j
{x T (l)Qx(l) + u T (l)Ru(l)}
l=k
As mentioned in the iterative algorithm, the restrict term F(x(k), u [1] (k)) = 0, ∀x(k)
for (2.11) is necessary. So we can get
Thus, we have
k+ j
{x T (l)Qx(l) + u T (l)Ru(l)} = V [ j] (x(k)). (2.27)
l=k
Therefore, we obtain
For the situation (2.14), it can easily be proven according to the above method.
Therefore, we can conclude that V [i+1] (x(k)) ≤ V [i] (x(k)), ∀i.
From Theorem 2.2, it is clear that the iterative performance index function is con-
vergent. So we can define the limitation of the sequence {V [i+1] (x(k))} is V o (x(k)).
In the next theorem, we will prove that V o (x(k)) satisfies the HJB equation.
Theorem 2.3 Let V o (x(k)) = lim V [i+1] (x(k)), then V o (x(k)) satisfies
i→∞
Proof According to (2.15) and (2.16), for any admissible control vector η(k), we
have
On the other side, according to the definition V o (x(k)) = lim V [i+1] (x(k)), there
i→∞
exists a positive integer p and an arbitrary positive number ε, such that
Hence we have
From Theorems 2.2 and 2.3, it can be concluded that V o (x(k)) is the optimal
performance index function and V o (x(k)) = J ∗ (x(k)). So we can have the following
corollary.
In this section, the iterative control algorithm is proposed for data-based unknown
systems with convergence analysis. In next section, the neural network implementa-
tion of the iterative control algorithm will be presented.
The input-output data are used to identify the unknown nonlinear system, until the
identification error is in the satisfied precision range. Then the data-based identifier
is used for the controller design. The diagram of the whole structure is shown in
Fig. 2.2.
In Fig. 2.2, the SLENs module is the identifier, the action network module is
used to approximate the iterative control u [i] (k), and the critic network module is
used to approximate the iterative performance index function. The SlFNs with ELM
are used in the ADP algorithm, i.e. action network and critic network. The detailed
implementation steps are as follows.
Step 1. Train the identifier by input-output data.
Step 2. Choose an error bound , and choose randomly an initial state x(0).
Step 3. Calculate the initial finite-time admissible control sequence for x(0),
which is U (0, K ) = (u(0), u(1), . . . , u(K )). The corresponding state sequence is
(x(0), x(1), . . ., x(K + 1)), where x(K + 1) = 0.
x(k )
u (k )
SLENs x(k 1)
Step 4. For the state x(K ), run the iterative ADP algorithm (2.11)–(2.13) for
i = 1, and (2.15)–(2.16) for i > 1. If |V [i+1] (x(K )) − V i (x(K ))| < , then Stop.
Step 5. For the state x(k), k = K − 1, K − 2, . . . , 0, run the iterative ADP algo-
rithm (2.11)–(2.12) and (2.14) as i = 1, (2.15)–(2.16) as i > 1. Until |V [i+1] (x(k)) −
V i (x(k))| < .
Step 6. Stop.
To evaluate the performance of our iterative ADP algorithm for the data-based iden-
tifier, an example is provided in this section.
Consider the following nonlinear system [16]
20
test state
train state
accuracy
state, accuracy 15
10
-5
0 10 20 30 40 50 60 70 80 90 100
time steps
8.5
performance index function
7.5
6.5
6
0 5 10 15 20 25 30 35 40 45 50
iteration steps
of ELM method are faster and smoother than the results of BP method. It can be
concluded that the learning speed of ELM method is faster than BP method while
obtaining better generalization performance.
20 2 Neural-Network-Based Approach for Finite-Time Optimal Control
5.8
5.6
performance index function
5.4
5.2
4.8
4.6
4.4
0 10 20 30 40 50 60 70 80 90 100
iteration steps
1.6
1.4
1.2
0.8
state
0.6
0.4
0.2
-0.2
0 2 4 6 8 10 12 14 16 18 20
time steps
2.7 Conclusions
This chapter studied the ELM method for optimal control of unknown nonlinear
systems. Using the input-output data, a data-based identifier was established. The
2.7 Conclusions 21
1.6
1.4
1.2
0.8
state
0.6
0.4
0.2
-0.2
0 5 10 15 20 25 30
time steps
-0.2
-0.4
-0.6
control
-0.8
-1
-1.2
-1.4
0 5 10 15 20 25 30
time steps
finite-time optimal control scheme was proposed based on iterative ADP algorithm.
The results of theorems showed that the proposed iterative algorithm was convergent.
The simulation study have demonstrated the effectiveness of the proposed control
algorithm.
22 2 Neural-Network-Based Approach for Finite-Time Optimal Control
0.2
-0.2
-0.4
control
-0.6
-0.8
-1
-1.2
-1.4
0 5 10 15 20 25 30
time steps
References
1. Duncan, T., Guo, L., Pasik-Duncan, B.: Adaptive continuous-time linear quadratic gaussian
control. IEEE Trans. Autom. Control 44(9), 1653–1662 (1999)
2. Gabasov, R., Kirillova, F., Balashevich, N.: Open-loop and closed-loop optimization of linear
control systems. Asian J. Control 2(3), 155–168 (2000)
3. Jin, X., Yang, G., Peng, L.: Robust adaptive tracking control of distributed delay systems with
actuator and communication failures. Asian J. Control 14(5), 1282–1298 (2012)
4. Zhang, H., Song, R., Wei, Q., Zhang, T.: Optimal tracking control for a class of nonlinear
discrete-time systems with time delays based on heuristic dynamic programming. IEEE Trans.
Neural Netw. 22(12), 1851–1862 (2011)
5. Guardabassi, G., Savaresi, S.: Virtual reference direct design method: an off-line approach to
data-based control system design. IEEE Trans. Autom. Control 45(5), 954–959 (2000)
6. Jagannathan, S.: Neural Network Control of Nonlinear Discrete-Time Systems. CRC Press,
Boca Raton (2006)
7. Yu, W.: Recent Advances in Intelligent Control Systems. Springer, London (2009)
8. Fernández-Navarro, F., Hervás-Martínez, C., Gutierrez, P.: Generalised Gaussian radial basis
function neural networks. Soft Comput. 17, 519–533 (2013)
9. Richert, D., Masaud, K., Macnab, C.: Discrete-time weight updates in neural-adaptive control.
Soft Comput. 17, 431–444 (2013)
10. Kuntal, M., Pratihar, D., Nath, A.: Analysis and synthesis of laser forming process using neural
networks and neuro-fuzzy inference system. Soft Comput. 17, 849–865 (2013)
11. Huang, G., Chen, L., Siew, C.: Universal approximation using incremental constructive feed-
forward networks with random hidden nodes. IEEE Trans. Neural Netw. 17(4), 879–892 (2006)
12. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4,
251–257 (1991)
13. Leshno, M., Lin, V., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a non-
polynomial activation function can approximate any function. Neural Netw. 6, 861–867 (1993)
References 23
14. Zhang, H., Wei, Q., Luo, Y.: A novel infinite-time optimal tracking control scheme for a class
of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans. Syst.
Man Cybern. Part B: Cybern. 38(4), 937–942 (2008)
15. Huang, G., Siew, C.: Extreme learning machine: RBF network case. In: Proceedings of the
Eighth International Conference on Control, Automation, Robotics and Vision (ICARCV
2004), Dec 6–9, Kunming, China, vol. 2, pp. 1029–1036. (2004)
16. Wang, F., Jin, N., Liu, D., Wei, Q.: Adaptive dynamic programming for finite-horizon optimal
control of discrete-time nonlinear systems with -error bound. IEEE Transactions on Neural
Networks 22, 24–36 (2011)
17. Zhang, R., Huang, G., Sundararajan, N., Saratchandran, P.: Multi-category classification using
extreme learning machine for microarray gene expression cancer diagnosis. IEEE/ACM Trans.
Comput. Biol. Bioinform. 4(3), 485–495 (2007)
18. Tamura, S., Tateishi, M.: Capabilities of a fourlayered feedforward neural network: four layers
versus three. IEEE Trans. Neural Netw. 8(2), 251–255 (1997)
19. Huang, G.: Learning capability and storage capacity of two-hidden-layer feedforward networks.
IEEE Trans. Neural Netw. 14(2), 274–281 (2003)
20. Huang, G., Wang, D., Lan, Y.: Extreme learning machines: a survey. Int. J. Mach. Learn.
Cybern. 2(2), 107–122 (2011)
21. Huang, G., Zhu, Q., Siew, C.: Extreme learning machine: theory and applications. Neurocom-
puting 70, 489–501 (2006)
Chapter 3
Nearly Finite-Horizon Optimal Control
for Nonaffine Time-Delay Nonlinear
Systems
In this chapter, a novel ADP algorithm is developed to solve the nearly optimal finite-
horizon control problem for a class of deterministic nonaffine nonlinear time-delay
systems. The idea is to use ADP technique to obtain the nearly optimal control which
makes the optimal performance index function close to the greatest lower bound of
all performance index functions within finite time. The proposed algorithm contains
two cases with respective different initial iterations. In the first case, there exists con-
trol policy which makes arbitrary state of the system reach to zero in one time step.
In the second case, there exists a control sequence which makes the system reach to
zero in multiple time steps. The state updating is used to determine the optimal state.
Convergence analysis of the performance index function is given. Furthermore, the
relationship between the iteration steps and the length of the control sequence is pre-
sented. Two neural networks are used to approximate the performance index function
and compute the optimal control policy for facilitating the implementation of ADP
iteration algorithm. At last, two examples are used to demonstrate the effectiveness
of the proposed ADP iteration algorithm.
3.1 Introduction
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 25
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_3
26 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …
the controllability of linear time-delay systems [5, 6]. They proposed some related
theorems to judge the controllability of the linear time-delay systems. In addition,
the optimal control problem is often encountered in industrial production. So the
investigation of the optimal control for time-delay systems is significant. In [7] D. H.
Chyung has pointed out the disadvantages of discrete time-delay system written as
an extended system by increasing dimension method to deal with the optimal control
problem. So some direct methods for linear time-delay systems were presented in
[7, 8]. While for nonlinear time-delay system, due to the complexity of systems,
the optimal control problem is rarely researched. As we know that [9] solved the
finite-horizon optimal control problem for a class of discrete-time nonlinear systems
using ADP algorithm. But the method in [9] can not be used in nonlinear time-delay
systems. As the delay states in time-delay systems are coupling with each other.
The state of current time k is decides by the states before k and the control law,
while the control law is not known before it is obtained. So based on the research
results in [9], we proposed a new ADP algorithm to solve the nearly finite-horizon
optimal control problem for discrete time-delay systems through the framework of
Hamilton–Jacobi–Bellman (HJB) equation.
In this chapter the optimal controller is designed based on the original time-delay
systems, directly. The state updating method is proposed to determine the optimal
state of the time-delay system. For finite-horizon optimal control, the system can
reach to zero when the final running step N is finite. But it is impossible in practice.
So the results in this chapter is in the sense of an error bound. The main contributions
of this chapter can be summarized as follows.
(1) The finite-horizon optimal control for deterministic discrete time-delay systems
is first studied based on ADP algorithm.
(2) The state updating is used to determine the optimal states of HJB equation.
(3) The relationship between the iteration steps and the length of the control
sequence is given.
This chapter is organized as follows. In Sect. 3.2, the problem formulation is pre-
sented. In Sect. 3.3, the nearly finite-horizon optimal control scheme is developed
based on iteration ADP algorithm and the convergence proof is given. In Sect. 3.1,
two examples are given to demonstrate the effectiveness of the proposed control
scheme. In Sect. 3.5, the conclusion is drawn.
+k−1
N
J (x(k), U (k, N + k − 1)) = {x T ( j)Qx( j) + u T ( j)Ru( j)}, (3.2)
j=k
Definition 3.1 N time steps control sequence: For any time step k, we define the
N time steps control sequence U (k, N + k − 1) = (u(k), u(k + 1), . . . , u(N + k −
1)). The length of U (k, N + k − 1) is N .
Definition 3.2 Final state: we define final state x f = x f (x(k), U (k, N + k − 1)),
i.e., x f = x(N + k).
Remark 3.1 Definitions 3.1 and 3.2 are used to state conveniently the admissible
control sequence, i.e. Definition 3.3, which is necessary for the theorems of this
chapter.
Remark 3.2 It is important to point out that the length of control sequence N can not
be designated in advance. It is calculated by the proposed algorithm. If we calculate
that the length of optimal control sequence is L at time step k, then we consider that
the optimal control sequence length at time step k is N = L.
Remark 3.3 From Remark 3.2, we can see that the length N of the optimal control
sequence is unknown finite number and can not be designated in advance. So we can
say that if at time step k, the length of the optimal control sequence is N , then at time
step k + 1, the length of the optimal control sequence is N − 1. Therefore, the HJB
equation (3.7) is established.
In the following, we will give an explanation about the validity of Eq. (3.4). First,
we define U ∗ (k, N + k − 1) = (u ∗ (k), u ∗ (k + 1), . . . , u ∗ (N + k − 1)), i.e.,
Then we have
+k−1
N
J ∗ (x(k)) = {x T ( j)Qx( j) + u ∗T ( j)Ru ∗ ( j)}
j=k
+ ···
+ x T (N + k − 1)Qx(N + k − 1)
+ u ∗T (N + k − 1)Ru ∗ (N + k − 1). (3.10)
+ u ∗T (N + k − 2)Ru ∗ (N + k − 2)
+ inf {x T (N + k − 1)Qx(N + k − 1)
u(N +k−1)
We also obtain
+ u T (N + k − 2)Ru(N + k − 2)
+ inf {x T (N + k − 1)Qx(N + k − 1)
u(N +k−1)
So we have
+ ···
+ inf {x T (N + k − 2)Qx(N + k − 2)
u(N +k−2)
+ u T (N + k − 2)Ru(N + k − 2)
+ inf {x T (N + k − 1)Qx(N + k − 1)
u(N +k−1)
In this subsection we will give the novel iteration ADP algorithm in details. For the
state x(k) of system (3.1), there exists two cases. Case 1: ∃U (k, k) which makes
x(k + 1) = 0. Case 2: ∃U (k, k + m), m > 0, which makes x(k + m + 1) = 0. In
the following part, we will discuss the two cases, respectively.
Case 1: There exists U (k, k) = (β(k)) which makes x(k + 1) = 0 for system
(3.1). We set the optimal control sequence U ∗ (k + 1, k + 1) = (0). The states of the
system are driven by a given initial state χ (t), −h l ≤ t ≤ 0 and the initial control
policy β(t). We set V [0] (x(k + 1)) = J (x(k + 1), U ∗ (k + 1, k + 1)) = 0, ∀x(k +
1), then for time step k, we have the following iterations
and
V [1] (x [1] (k)) = x [1]T (k)Qx [1] (k) + u [1]T (k)Ru [1] (k) + V [0] (x [0] (k + 1)), (3.16)
and
u [i+1] (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))} (3.19)
u(k)
and
V [i+1] (x [i+1] (k)) = x [i+1]T (k)Qx [i+1] (k) + u [i+1]T (k)Ru [i+1] (k)
+ V [i] (x [i] (k + 1)), (3.20)
and
u [1] (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [0] (x(k + 1))} (3.24)
u(k)
and
V [1] (x [1] (k)) = x [1]T (k)Qx [1] (k) + u [1]T (k)Ru [1] (k) + V [0] (x [0] (k + 1)), (3.25)
In (3.25), V [0] (x [0] (k + 1)) is obtained by the similar Eq. (3.26). The states in (3.25)
are obtained as
and
u [i+1] (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))} (3.29)
u(k)
and
V [i+1] (x [i+1] (k)) = (x [i+1] (k))T Qx [i+1] (k) + u [i+1]T (k)Ru [i+1] (k)
+ V [i] (x [i] (k + 1)), (3.30)
and
This completes the iteration algorithm. From the two cases we can see that, if
V [0] = 0 in (3.25), then Case 1 is a special one of Case 2. In the following, the
algorithms are summarized as follows.
Remark 3.4 For the state x(k) of system (3.1), which is driven by the fixed initial
states χ (t), −h l ≤ t ≤ 0. If there exists a control sequence U (k, k) = (β(k)), which
makes x(k + 1) = 0 hold, then we will use Case 1 of the algorithm to obtain the opti-
mal control. Otherwise, i.e., there does not exist U (k, k), which makes x(k + 1) = 0
hold. But there is a control sequence U (k, k + m) = (β(k), β(k + 1), . . . , β(k +
m)) which makes x f (x(k), U (k, k + m)) = 0, then we will adopt Case 2 of the
3.3 The Iteration ADP Algorithm and Its Convergence 33
In the above subsection, the novel algorithm for finite-horizon time-delay nonlinear
systems has been proposed in detail. In the following part, we will prove that the
algorithm is convergent and the limitation of the sequence of performance index
functions V [i+1] (x [i+1] (k)) satisfies the HJB equation (3.7).
Theorem 3.1 For system (3.1), the states of the system are driven by a given initial
state χ (t), −h l ≤ t ≤ 0, and the initial finite-horizon admissible control policy β(t).
The iteration algorithm is as in (3.15)–(3.33). For time step k, ∀ x(k) and U (k, k + i),
we define
where V [0] (x(k + i + 1)) as in (3.26) and V [i+1] (x(k)) is updated as (3.31). Then
we have
34 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …
+ ···
+ inf {x T (k + i)Qx(k + i) + u T (k + i)Ru(k + i)}
u(k+i)
+ V [0] (x(k + i + 1))} . . .}}. (3.36)
+ x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1)
+ ...
+ x T (k + i)Qx(k + i) + u T (k + i)Ru(k + i)
+ V [0] (x(k + i + 1))}, (3.37)
Based on Theorem 3.1, we give the monotonicity theorem about the sequence of
performance index functions V [i+1] (x [i+1] (k)), ∀x [i+1] (k).
Theorem 3.2 For system (3.1), let the iteration algorithm be as in (3.15)–(3.33).
Then we have V [i+1] (x [i] (k)) ≤ V [i] (x [i] (k)), ∀i > 0, for Case 1; V [i+1] (x [i] (k)) ≤
V [i] (x [i] (k)), ∀i ≥ 0, for Case 2.
Proof We first give the proof for Case 2. Define Û (k, k + i) = (u [i] (k), u [i] (k +
2), . . . , u [1] (k + i − 1), u ∗ (k + i)), then according to the definition of Λ[i+1] (x(k),
Û (k, k + i)) in (3.34), we have
So we have ∀x(k),
From Theorem 3.2, we can conclude that the performance index function { V [i+1]
(x(k)) } is a monotonically nonincreasing sequence. As the performance index func-
tion is positive definite, so we can say that the performance index function is con-
vergent. Thus we define V ∞ (x(k)) = lim V [i+1] (x(k)), u ∞ (k) = lim u [i+1] (k) and
i→∞ i→∞
x ∞ (k) is the state under u ∞ (k). In the following, we give a theorem to indicate that
V ∞ (x ∞ (k)) satisfies HJB equation.
Theorem 3.3 For system (3.1), the iteration algorithm is as in (3.15)–(3.33). Then
we have
36 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …
Proof Let be an arbitrary positive number. Since V [i+1] (x(k)) is nonincreasing and
V ∞ (x(k)) = lim V [i+1] (x(k)), there exists a positive integer p such that
i→∞
So we have
Let i → ∞, we have
According to (3.29), we obtain u ∞ (k). From (3.32) and (3.33), we have the cor-
responding state x ∞ (k), thus the following expression
So we can say that V ∞ (x ∞ (k)) = J ∗ (x ∗ (k)). Until now, we have proven that
for ∀k, the iteration algorithm converges to the optimal performance index function
when the iteration index i → ∞. For finite-horizon optimal control problem of time-
3.3 The Iteration ADP Algorithm and Its Convergence 37
delay systems, another aspect is the length N of the optimal control sequence. In this
chapter, the specific value of N is not known, but we can analyse the relationship
between the the iteration index i and the terminal time N .
V [i+1] (x [i+1] (k)) = x [i+1]T (k)Qx [i+1] (k) + u [i+1]T (k)Ru [i+1] (k)
+ x [i]T (k + 1)Qx [i] (k + 1) + u [i]T (k + 1)Ru [i] (k + 1)
+ ···
+ x [1]T (k + i)Qx [1] (k + i) + u [1]T (k + i)Ru [1] (k + i)
+ V [0] (x [0] (k + i + 1)). (3.55)
According to [9], we can see that the optimal control sequence for x [i+1] (k) is
U (k, k + i) = (u [i+1] (k), u [i] (k + 1), . . . , u [1] (k + i)). As we have V [0] (x [0] (k +
∗
For Case 1 of the proposed iteration algorithm, we have the following corollary.
Corollary 3.1 Let the iteration algorithm be in (3.15)–(3.23). Then for system (3.1),
the state at time step k can reach to zero in N = i + 1 steps for Case 1.
V [i+1] (x [i+1] (k)) = x [i+1]T (k)Qx [i+1] (k) + u [i+1]T (k)Ru [i+1] (k)
+ x [i]T (k + 1)Qx [i] (k + 1) + u [i]T (k + 1)Ru [i] (k + 1)
+ ···
+ x [1]T (k + i)Qx [1] (k + i) + u [1]T (k + i)Ru [1] (k + i)
+ V [0] (x [0] (k + i + 1))
= J (x [i+1] (k), U (k, k + i)), (3.56)
where U ∗ (k, k + i) = (u [i+1] (k), u [i] (k + 1), . . . , u [1] (k + i)), and each element of
U ∗ (k, k + i) is obtained from (3.29). According to Case 1, x [0] (k + i + 1) = 0. So
the state at time step k can reach to zero in N = i + 1 steps.
We can see that for time step k the optimal controller is obtained when i → ∞,
which induces the time steps N → ∞ according to Theorem 3.4 and Corollary 3.1.
In this chapter, we want to get the nearly optimal performance index function within
finite N time steps. The following corollary is used to prove that the existences of
nearly optimal performance index function and nearly optimal control.
38 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …
Corollary 3.2 For system (3.1), the iteration algorithm is as in (3.15)–(3.33), then
∀ε > 0, ∃I ∈ N, ∀i > I , we have
[i+1] [i+1]
V (x (k)) − J ∗ (x ∗ (k)) ≤ ε. (3.57)
Proof From Theorems 3.2 and 3.3, we can see that lim V [i] (x [i] (k)) = J ∗ (x ∗ (k)),
i→∞
then from the limitation definition, the conclusion is obtained easily.
So we can say that V [i] (x [i] (k)) is the nearly optimal performance index function
in the sense of ε, the corresponding nearly optimal control is defined as follows:
Remark 3.5 From Theorem 3.4 and Corollary 3.1, we can see that the length of the
control sequence N is dependent on the iteration step. In addition, from Corollary 3.2,
we know that the iteration step is dependent on ε. So it is concluded that the length
of the control sequence N is dependent on ε.
From (3.57), we can see that the inequality is hard to satisfy. So in practice, we
adopt the following standard to substitute (3.57):
[i+1] [i+1]
V (x (k)) − V [i] (x [i] (k)) ≤ ε. (3.59)
The nonlinear optimal control solution relies on solving the HJB equation, and the
exact solution of which is generally impossible to be obtained for nonlinear time-
delay system. So we employ neural networks for approximations of u [i] (k) and
J [i+1] (x(k)) in this section.
Assume the number of hidden layer neurons is denoted by l, the weight matrix
between the input layer and hidden layer is denoted by V , the weight matrix between
the hidden layer and output layer is denoted by W , then the output of three-layer
neural network is represented by
ezi − e−zi
where σ (Ŵ T X ) ∈ Rl , [σ (z)]i = , i = 1, 2, . . . , l, are the activation func-
ezi + e−zi
tion. The gradient descent rule is adopted for the weight update rules of each neural
network.
Here, there are two networks, which are critic network and action network, respec-
tively. Both neural networks are chosen as three-layer Back-propagation (BP) neural
networks. The whole structure diagram is shown in Fig. 3.1.
3.3 The Iteration ADP Algorithm and Its Convergence 39
Critic Network
x(k )
Plant
Action Network Plant
Action Network v* ( k )
x(k ), , x(k hl )
x* ( k )
J * ( x* (k ))
Critic Network
The critic network is used to approximate the performance index function V [i+1]
(x(k)). The output of the critic network is denoted as
V [i+1] (x(k)) = x T (k)Qx(k) + û [i+1]T (k)R û [i+1] (k) + V̂ [i] (x(k + 1)). (3.62)
Then we define the error function for the critic network as follows:
1 [i+1]
E c[i+1] (k) = (e (k))2 . (3.64)
2 c
So the gradient-based weights update rule for the critic network is given by
where
∂ E c[i+1] (k)
wc[i+1] (k) = −αc , (3.67)
∂wc[i+1] (k)
∂ E c[i+1] (k)
vc[i+1] (k) = −αc , (3.68)
∂vc[i+1] (k)
In the action network the states x(k), x(k − h 1 ), . . . , x(k − h l ) are used as inputs to
create the optimal control, û [i] (k) as the output of the network. The output can be
formulated as
The weights in the action network are updated to minimize the following perfor-
mance error measure
1 [i]T
E a[i] (k) = e (k)ea[i] (k). (3.71)
2 a
The weights updating algorithm is similar to the one for the critic network. By
the gradient descent rule, we can obtain
where
∂ E a[i] (k)
wa[i] (k) = −αa , (3.74)
∂wa[i] (k)
∂ E a[i] (k)
va[i] (k) = −αa , (3.75)
∂va[i] (k)
1.15
1.14
1.12
1.11
1.1
1.09
1.08
1.07
1.06
1.05
0 5 10 15 20 25 30 35 40 45 50
iteration steps
We give the initial states as χ1 (−2) = χ1 (−1) = χ1 (0) = 1.5, and the initial con-
trol policy as β(t) = sin−1 (x(t + 1) − x(t − 2)) − 0.1x 2 (t). We implement the pro-
posed algorithm at the time instant k = 3.
First, according to the initial control policy β(t) = sin−1 (x(t + 1) − x(t − 2)) −
0.1x 2 (t) of system (3.76), we give fist group of state data: x(1) = 0.8, x(2) = 0.7,
x(3) = 0.5, x(4) = 0. We also can get the second group of state data: x(1) = 1.4,
x(2) = 1.2, x(3) = 1.1, x(4) = 0.8, x(5) = 0.7, x(6) = 0.5, x(7) = 0. Obviously,
for the first sequences of states we can get the optimal controller by Case 1 of the
proposed algorithm. For the second one, the optimal controller can be obtained by
Case 2 of the proposed algorithm, and the optimal control sequence U o (k + 1, k +
j + 1) can be obtained in the first group of state data. We select Q = R = 1.
The three-layer BP neural networks are used to approach the critic network and the
action network with the structure 2 − 8 − 1 and 6 − 8 − 1, respectively. The iteration
times of the weights updating for two neural networks are 200. The initial weights
are chosen randomly from (−0.1, 0.1), and the learning rates are αa = αc = 0.05.
The performance index trajectories for the first and second state data group show in
Figs. 3.2 and 3.3, respectively. According to Theorem 3.2, for the first state group, the
performance index is decreasing as i > 0. For the second state group, the performance
index is decreasing as i ≥ 0. The state trajectory and the control trajectory of the
second state data are shown in Figs. 3.4 and 3.5. From the figures, we can see that the
system is asymptotically stable. The simulation study shows the new iteration ADP
algorithm is very feasible.
42 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …
8.6
8.4
performance index function
8.2
7.8
7.6
7.4
0 5 10 15 20 25 30 35 40 45 50
iteration steps
1.6
1.4
1.2
0.8
state
0.6
0.4
0.2
-0.2
0 5 10 15 20 25 30
time steps
Fig. 3.4 The state trajectory using the second state data group
Example 3.2 For demonstrating the effectiveness of the proposed iteration algorithm
in this chapter, we give a more substantial application. Consider the ball and beam
experiment. A ball is placed on a beam as shown in Fig. 3.6.
3.4 Simulation Study 43
0.2
-0.2
-0.4
control
-0.6
-0.8
-1
-1.2
-1.4
0 5 10 15 20 25 30
time steps
Fig. 3.5 The control trajectory using the second state data group
2d
The beam angle α can be expressed in terms of the servo gear angle θ as α ≈ θ.
L
The equation of motion for the ball is given as
M
+ m r̈ + mg sin α − mr (α̇)2 = 0, (3.77)
R2
where r is the ball position coordinate. The mass of the ball m = 0.1 kg, the radius
of the ball R = 0.015 m, the radius of the lever gear d = 0.03 m, the length of the
44 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …
beam L = 1.0 m and the ball’s moment of inertia M = 10−5 kg.m2 . Given the time
step h, let r (t) = r (th), α(t) = α(th) and θ (t) = θ (th), then the Eq. (3.77)
is discretized as
⎧
⎪
⎪ 2d
⎪
⎪ x(t + 1) = x(t) + y(t) − A sin θ (t) + Bx(t)(θ (t) − z(t))2 ,
⎨ L
2d (3.78)
⎪
⎪ y(t + 1) = y(t) − A sin θ (t) + Bx(t)(θ (t) − z(t))2 ,
⎪
⎪ L
⎩
z(t + 1) = θ (t),
mgh 2 R 2 4d 2 m R 2
where A = and B = 2 . The state X (t) = (x(t), y(t),
M + mR 2 L (M + m R 2 )
z(t))T , in which x(t) = r (t), y(t) = r (t) − r (t − 1) and z(t) = θ (t − 1). The con-
trol input is u(t) = θ (t). For the convenience of analysis, system (3.78) is rewritten
as follows:
⎧
⎪
⎪ 2d
⎪
⎪ x(t + 1) = x(t − 2) + y(t) − A sin θ (t) + Bx(t)(θ (t) − z(t))2 ,
⎨ L
2d (3.79)
⎪
⎪ y(t + 1) = y(t) − A sin θ (t) + Bx(t)(θ (t) − z(t − 2))2 ,
⎪
⎪ L
⎩
z(t + 1) = θ (t).
In this chapter, h is selected as 0.1, the states of time-delay system (3.79) are
X (1) = [1.0027, 0.0098, 1]T , X (2) = [0.0000, 0.0057, 1.0012]T , X (3) = [1.0000,
0.0016, 1.0000]T , X (4) = [1.0002, −0.0025, 0.9994]T and X (5) = [0, 0, 0]T . The
initial states are χ (−2) = [0.9929, 0.0221, 1.0000]T , χ (−1) = [−0.0057, 0.0180,
1.0000]T and χ (0) = [0.9984, 0.0139, 1.0000]T . The initial control sequence is
(1.0000, 1.0012, 1.0000, 0.9994, 0.0000). Obviously, the initial control sequence
and states are not the optimal ones, so the proposed algorithm in this chapter is
adopt to obtain the optimal solution. We select Q = R = 1. The iteration times of
the weights updating for two neural networks are 200. The initial weights of critic
network are chosen randomly from (−0.1, 0.1), the initial weights of action network
are chosen randomly from [−2, 2], and the learning rates are αa = αc = 0.001. For
the state X (4) = [1.0002, −0.0025, 0.9994]T . For the state X (1) = [1.0027, 0.0098,
1]T . Obviously, for the state X (4) we can get the optimal controller by Case 1 of the
proposed algorithm. For the state X (1), the optimal controller can be obtained by
Case 2 of the proposed algorithm. Then we obtain the performance index function
trajectories of the two states as shown in Figs. 3.7 and 3.8, which satisfy Theorem 3.2,
i.e., for the state X (4), the performance index is decreasing as i > 0, for the state
X (1), the performance index is decreasing as i ≥ 0. The state trajectories and the
control trajectory of state X (1) show in Figs. 3.9, 3.10, 3.11 and 3.12. From the
figures, we can see that the states of the system are asymptotically stable. Based on
the above analysis, we can conclude that the proposed iteration ADP algorithm is
satisfactory.
3.4 Simulation Study 45
1.3
1.2
1.1
performance index function
0.9
0.8
0.7
0.6
0.5
0 5 10 15 20 25 30 35 40 45 50
iteration steps
5.8
5.6
performance index function
5.4
5.2
4.8
4.6
4.4
0 10 20 30 40 50 60 70 80 90 100
iteration steps
0.9
0.8
0.7
0.6
state
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40 45 50
time steps
0.01
0.009
0.008
0.007
0.006
0.005
y
0.004
0.003
0.002
0.001
0
0 5 10 15 20 25 30 35 40 45 50
time steps
0.8
0.6
0.4
z
0.2
-0.2
0 5 10 15 20 25 30 35 40 45 50
time steps
-0.05
control
-0.1
-0.15
0 5 10 15 20 25 30 35 40 45 50
time steps
3.5 Conclusion
This chapter proposed a novel ADP algorithm to deal with the nearly finite-horizon
optimal control for a class of deterministic nonaffine time-delay nonlinear systems.
For determining the optimal state, the state updating was contained in the novel
ADP algorithm. The results of theorems showed the proposed iteration algorithm
was convergent. Moreover, the relationship between the the iteration steps and time
steps was given. The simulation study have demonstrated the effectiveness of the
proposed control algorithm.
References
1. Niculescu, S.: Delay Effects on Stability: A Robust Control Approach. Springer, Berlin (2001)
2. Gu, K., Kharitonov, V., Chen, J.: Stability of Time-Delay Systems. Birkhäuser, Boston (2003)
3. Song, R., Zhang, H., Luo, Y., Wei, Q.: Optimal control laws for time-delay systems with
saturating actuators based on heuristic dynamic programming. Neurocomputing 73(16–18),
3020–3027 (2010)
4. Huang, J., Lewis, F.: Neural-network predictive control for nonlinear dynamic systems with
time-delay. IEEE Trans. Neural Netw. 14(2), 377–389 (2003)
5. Chyung, D.: On the controllability of linear systems with delay in control. IEEE Trans. Autom.
Control 15(2), 255–257 (1970)
6. Phat, V.: Controllability of discrete-time systems with multiple delays on controls and states.
Int. J. Control 49(5), 1645–1654 (1989)
7. Chyung, D.: Discrete optimal systems with time delay. IEEE Trans. Autom. Control 13(1), 117
(1968)
8. Chyung, D., Lee, E.: Linear optimal systems with time delays. SIAM J. Control 4(3), 548–575
(1966)
9. Wang, F., Jin, N., Liu, D., Wei, Q.: Adaptive dynamic programming for finite-horizon optimal
control of discrete-time nonlinear systems with -error bound. IEEE Trans. Neural Netw. 22,
24–36 (2011)
10. Manu, M., Mohammad, J.: Time-Delay Systems Analysis, Optimization and Applications.
North-Holland, New York (1987)
Chapter 4
Multi-objective Optimal Control
for Time-Delay Systems
4.1 Introduction
For a class of unknown discrete time nonlinear systems the multi-objective opti-
mal control problem was discussed in [1]. In [2], for nonaffine nonlinear unknown
discrete-time systems an optimal control scheme with discount factor was developed.
However, as far as we know, how to obtain the multi-objective optimal control solu-
tion of nonlinear time-delay systems based on ADP algorithm is still an intractable
problem. This chapter will discuss this difficult problem explicitly. First, the simple
objective optimal control problem is obtained by the weighted sum technology. Then,
the iterative ADP optimal control method of time-delay systems is established, and
the convergence analysis is presented. The neural network implementation program
is also given. At last, for illustrating the control effect of the proposed multi-objective
optimal control method, two simulation examples are introduced.
The rest chapter is organized as follows. In Sect. 4.2, it will present the problem
formulation. In Sect. 4.3, it will develop the multi-objective optimal control scheme
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 49
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_4
50 4 Multi-objective Optimal Control for Time-Delay Systems
and give the corresponding convergence proof. In Sect. 4.4, it will present implemen-
tation process by neural networks. In Sect. 4.5, it will give examples to demonstrate
the validity of the proposed control scheme. In Sect. 4.6, it will draw the conclusion.
where u(t) ∈ Rm and the state x(t), x(t − h) ∈ Rn . x(t) = 0, k ≤ 0. F(x(t), x(t −
h), u(t)) = f (x(t), x(t − h)) + g(x(t), x(t − h))u(t) is the unknown continuous
function.
For the system (4.1), we will consider the following multi-objective optimal
control problem
where Pi (x(t), u(t)) is the performance index function, and it is defined as the
following expression
∞
Pi (x(t), u(t)) = L i (x( j)) + u T ( j)Ri u( j) , (4.3)
j=k
h
P(x(t)) = wi Pi (x(t)), (4.4)
i=1
h
where W = [w1 , w2 , . . . , wh ]T , wi ≥ 0 and wi = 1.
i=1
So we have the following expression
h
h
P(x(t)) = wi Ii (t) + wi Pi (x(t + 1))
i=1 i=1
= W I (t) + P(x(t + 1)),
T
(4.5)
4.2 Problem Formulation 51
In fact, the first-order necessary condition of the optimal control policy u ∗ (t)
should be satisfied according to optimality principle of Bellman, i.e.,
h −1 h
∗ 1 ∂ F T ∂ Pi
u (t) = − wi Ri wi . (4.8)
2 i=1 ∂u i=1
∂F
h
h
P ∗ (x(t)) = wi L i (x(t)) + wi Pi∗ (x(t + 1))
i=1 i=1
⎛ −1 ⎞T
h
1⎝ ∂ F T ∂ Pi ⎠
h
+ wi Ri wi
4 i=1
∂u i=1
∂F
⎛ −1 ⎞
h h T h
∂ F ∂ P
wi Ri ⎝ ⎠.
i
× wi Ri wi (4.9)
i=1 i=1
∂u i=1
∂ F
The aim of this chapter is to obtain the solution of the optimal control problem (4.14),
so we propose the following ADP algorithm.
Let start from i = 0, and define the initial iterative performance index function
P [0] (·) = 0. So we can calculate the initial iterative control vector u [0] (t):
u [0] (t) = arg inf W T I (x(t), u(t)) . (4.10)
u(t)
In the next part, we will prove that the proposed algorithm is convergent, and the
asymptotic stability of the closed-loop systems.
Lemma 4.1 Give a control sequence μ[i] (t) , and u [i] (t) is obtained from (4.13).
The iterative performance index function P [i] is obtained from (4.14) and Υ [i] is
defined as
Υ [i+1] (x(t)) = W T I (x(t), μ[i] (t)) + Υ [i] (x(t + 1)), (4.16)
where x(t + 1) is obtained from μ[i] (t). If the initial values of P [i] and Υ [i] are same
for i = 0, then P [i+1] (x(t)) ≤ Υ [i+1] (x(t)), ∀i.
Lemma 4.2 If the iterative performance index function P [i] (x(t)) is obtained from
(4.14). Then P [i] (x(t)) is bounded, i.e., 0 ≤ P [i+1] (x(t)) ≤ B, ∀i, where B is posi-
tive.
Theorem 4.1 If the iterative performance index function P [i] (x(t)) is obtained from
(4.14). Then we have P [i+1] (x(t)) ≥ P [i] (x(t)), ∀i.
Proof Let Θ [i] (x(t)) define as follows:
So it is summed up for i = 0,
4.4 Neural Network Implementation … 53
Suppose that for i − 1, P [i] (x(t)) ≥ Θ [i−1] (x(t)), ∀x(t). Then we have
and
Θ [i] (x(t)) = W T I (x(t), u [i] (t)) + Θ i−1 (x(t + 1)). (4.21)
P [i+1] (x(t)) − Θ [i] (x(t)) = P [i] (x(t + 1)) − Θ i−1 (x(t + 1))
≥ 0. (4.22)
So, we have
P [i+1] (x(t)) ≥ Θ [i] (x(t)), ∀i. (4.23)
Furthermore, it is known that P [i] (x(t)) ≤ Θ [i] (x(t)) from Lemma 4.1. So it can
be drawn a conclusion
Hence, we have that as i → ∞, P [i] (x(t)) → P ∗ (x(t)), and u [i] (t) → u ∗ (t)
accordingly.
Theorem 4.2 If the iterative performance index function P [i] (x(t)) is obtained from
(4.14).
If u ∗ (t) is expressed as (4.13) and P ∗ (x(t)) is expressed as (4.14). Then we can
say that u ∗ (t) stabilizes the system (4.1) asymptotically.
Remark 4.1 In fact, the proposed method is the further study of [3]. In [3], although
the multi-objective control problem is presented, the time-delay state for nonlinear
system is not considered. In [1] the multi-objective optimal control problem is not
discussed. In this chapter, the multi-objective optimal control problem of time-delay
system is solved successfully. So this chapter is the evolvement of traditional ADP
literature.
54 4 Multi-objective Optimal Control for Time-Delay Systems
There are two neural networks in the ADP algorithm, i.e., the critic network and
the action network. The approximate performance index function P [i+1] (x(t)) is
obtained from the critic network, which is denoted as follows:
1 [i+1]T
Yc[i+1] (t) = e (t)ec[i+1] (t). (4.29)
2 c
Thus the critic network weights is updated by
where
∂Yc[i+1] (t)
vc[i+1] (t) = −βc . (4.31)
∂vc[i+1] (t)
And
where
∂Yc[i+1] (t)
wc[i+1] (t) = −βc . (4.33)
∂wc[i+1] (t)
For obtaining the update rule, we define the following performance error measure
1 [i]T
Ya[i] (t) = e (t)ea[i] (t). (4.36)
2 a
The critic weights is updated by
where
∂Ya[i] (t)
va[i] (t) = −βa . (4.38)
∂va[i] (t)
And
where
∂Ya[i] (t)
wa[i] (t) = −βa . (4.40)
∂wa[i] (t)
Example 4.1 For illustrating the detailed implementation procedure of the presented
method, we will discuss the nonlinear time-delay system as follows [4]:
where
56 4 Multi-objective Optimal Control for Time-Delay Systems
0.2x1 (t) exp (x2 (t))2 x2 (t − 2)
f (x(t), x(t − 2)) = ,
0.3 (x2 (t))2 x1 (t − 2)
and
x2 (t − 2) 0.2
g(x(t), x(t − 1), x(t − 2)) = .
0.1 1
The performance index functions Pi (x(t)) are similar as in [1], which are defined
as follows:
∞
P1 (x(t)) = ln(x12 (t) + 1) + u 2 (t) , (4.42)
k=0
∞
P2 (x(t)) = ln(x22 (t) + 1) + u 2 (t) , (4.43)
k=0
and
∞
P3 (x(t)) = ln(x12 (t) + x22 (t) + 1) + u 2 (t) , (4.44)
k=0
According to the request of the system property, the weight vector W is selected as
W = [0.1, 0.3, 0.6]T . For implementing the established algorithm, we let the initial
state of time-delay system (4.41) x0 = [0.5, −0.5]T . We also use neural networks to
implement the iterative ADP algorithm. The critic network is a three-layer BP neural
network, which is used to approximate the iterative performance index function. The
other three-layer BP neural network is also used as the action network, which is used
to approximate the iterative control. For increasing the neural network accuracy, the
two networks are trained 1000 steps, and the regulation parameters are both 0.001.
Then we get the iterative performance index functions as shown in Fig. 4.1, which
converges to a constant. The system runs 200 steps, then the simulation results are
obtained. We can see that the state trajectories are shown as in Fig. 4.2, and the control
input trajectories are shown as in Fig. 4.3. It is quite clear that the time-delay system
is convergent after 60 time steps. So the constructed iterative multi-objective optimal
control method in this chapter has good performance.
1.6
1.2
0.8
0.6
0.4
0 10 20 30 40 50
iteration steps
x1
0.8 x2
0.6
state trajectories
0.4
0.2
-0.2
0 50 100 150 200
time steps
0.4
u
1
0.3 u
2
0.2
0.1
control
-0.1
-0.2
-0.3
0 50 100 150 200
time steps
∞
P1 (x(t)) = ln(x12 (t) + x22 (t)) + u 2 (t) , (4.46)
k=0
∞
P2 (x(t)) = ln(x22 (t) + x32 (t)) + u 2 (t) , (4.47)
k=0
∞
P3 (x(t)) = ln(x12 (t) + x32 (t)) + u 2 (t) , (4.48)
k=0
and
∞
P4 (x(t)) = ln(x12 (t) + 1) + u 2 (t) , (4.49)
k=0
The weight vector is W = [0.1, 0.3, 0.1, 0.5]T . Then the performance index func-
tion P(x(t)) can be obtained based on the weighted sum method. The initial state is
selected as x0 = [0.5, −0.5, 1]T . The structures, the learning rate and initial weights
of critic network and action network are same as in Example 4.1. After 10 iterative
steps, the performance index function trajectory is obtained as in Fig. 4.4. The state
trajectories are shown in Fig. 4.5, and the control trajectory is given in Fig. 4.6.
4.5 Simulation Results 59
0.6
0.5
performance index function
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10
iteration steps
1
x1
x2
x3
0.5
state trajectories
-0.5
0 10 20 30 40 50 60 70 80
time steps
-0.1
-0.2
-0.3
control
-0.4
-0.5
-0.6
-0.7
-0.8
0 10 20 30 40 50 60 70 80
time steps
x1
0.5 x2
x3
0
state trajectories
-0.5
-1
-1.5
-2
-2.5
-3
0 500 1000 1500
time steps
0.2
0.1
-0.1
control
-0.2
-0.3
-0.4
-0.5
0 150 300 450 600 750 900 1050 1200 1350 1500
time steps
For illustrating the good performance, the method in [1] has been used to obtain
the optimal control of system (4.45). After 1500 time steps, the system states and
control converge to zero, which are shown in Figs. 4.7 and 4.8. From the comparison,
we can see that the convergence speed of the method presented in this chapter is faster
and the performance is better than the method in [1].
4.6 Conclusions
This chapter aimed nonlinear time-delay systems and solved the multi-objective
optimal control problem using the presented ADP method. By the weighted sum
technology, we obtained the single optimal control problem from the original multi-
objective one. So we established an ADP method for the considered nonlinear time-
delay systems, and the convergence analysis proved that the iterative performance
index functions converge to the optimal one. The critic and action networks were
used to get the iterative performance index function and the iterative control policy,
respectively. In the simulation, the multiple performance index functions were given,
and using the proposed method, the results are achieved, which illustrates the validity
and effectiveness of the proposed multi-objective optimal control method.
62 4 Multi-objective Optimal Control for Time-Delay Systems
References
1. Wei, Q., Zhang, H., Dai, J.: Model-free multiobjective approximate dynamic programming for
discrete-time nonlinear systems with general performance index functions. Neurocomputing
72(7–9), 1839–1848 (2009)
2. Wang, D., Liu, D., Wei, Q., Zhao, D., Jin, N.: Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8), 1825–1832
(2012)
3. Song, R., Xiao, W., Zhang, H.: Multi-objective optimal control for a class of unknown nonlinear
systems based on finite-approximation-error ADP algorithm. Neurocomputing 119(7), 212–221
(2013)
4. Zhang, H., Song, R., wei, Q., Zhang, T.: Optimal tracking control for a class of nonlinear discrete-
time systems with time delays based on heuristic dynamic programming. IEEE Trans. Neural
Netw. 22(12), 1851–1862 (2011)
5. Gyurkovics, É., Takács, T.: Quadratic stabilisation with H∞ -norm bound of non-linear discrete-
time uncertain systems with bounded control. Syst. Control Lett. 50, 277–289 (2003)
Chapter 5
Multiple Actor-Critic Optimal
Control via ADP
5.1 Introduction
possible responses into different categories based on context and match with previ-
ously encountered situations. These categories can be viewed as representing stored
behavior responses or patterns. Werbos [3, 4] discusses novel ADP structures with
multiple critic levels based on functional regions in the human cerebral cortex. The
interplay between using previously stored experiences to quickly decide possible
actions, and real-time exploration and learning for precise control is emphasized.
This agrees with the work in [1, 2] which details how stored behavior patterns can be
used to enact fast decisions by drawing on previous experiences when there is match
between observed attributes and stored patterns. In the event of risk, mismatch, or
anxiety, higher-level control mechanisms are recruited by the ACC that involve more
focus on real-time observations and exploration.
There are existing approaches to adaptive control and ADP that take into account
some features of these new studies. Included is the multiple-model adaptive control
method of [5]. Multiple actor-critic structures were developed by [5–8]. These works
used multiple modules, each of which contains a stored dynamics model of the
environment and a controller. Incoming data were classified into modules based on
prediction errors between observed data and the prediction of the dynamics model
in each module.
In industrial process control, the system dynamics are difficult and expensive to
estimate and cannot be accurately obtained [9]. Therefore, it is difficult to design
optimal controllers for these unknown systems. On the other hand, new imperatives
in minimizing resource use and pollution, while maximizing throughput and yield,
make it important to control industrial processes in an optimal fashion. Using proper
control techniques, the input-output data generated in the process of system operation
can be accessed and used to design optimal controllers for unknown systems [10].
Recently, many studies have been done on data-based control schemes for unknown
systems [11–17]. For large-scale industrial processes, there are usually different
production performance measures according to desired process properties. Therefore,
more extensive architectures are needed to design optimal controllers for large-scale
industrial processes, which respond quickly based on previously learned behavior
responses, allow adaptive online learning to guarantee real-time performance, and
accommodate different performance measures for different situations.
Such industrial imperatives show the need for more comprehensive approaches to
data-driven process control. Multiple performance objectives may be needed depend-
ing on different important features hidden in observed data. This requires different
control categories that may not only depend on closeness of match with predictions
based on stored dynamics models as used in [2, 18]. The work of Levine [19] and
Werbos [20] indicates that more complex structures are responsible for learning in
the human brain than the standard three-level actor-critic based on networks for actor,
model, and critic.
It has been shown [1, 2] that shunting inhibition is a powerful computational
mechanism and plays an important role in sensory neural information processing
systems [21]. Bouzerdoum [22] proposed the shunting inhibitory artificial neural
network (SIANN) which can be used for highly effective classification and func-
tion approximation. In SIANN the synaptic interactions among neurons are medi-
ated via shunting inhibition. Shunting inhibition is more powerful than multi-layer
5.1 Introduction 65
perceptions in that each shunting neuron has a higher discrimination capacity than
a perceptron neuron and allows the network to construct complex decision surfaces
much more readily. Therefore, SIANN is a powerful and effective classifier. In [23],
efficient training algorithms for a class of shunting inhibitory convolution neural
networks are presented.
In this chapter we bring together the recent studies in neurocognitive psychology
[19, 20] and recent mathematical machinery for implementing shunting inhibition
[21, 22] to develop a new class of process controllers that have a novel multiple actor
critic structure. This multiple network structure has learning on several timescales.
Stored experiences are first used to train a SIANN to classify environmental cues into
different behavior performance categories according to different technical require-
ments. This classification is general, and does not only depend on match of observe
data with stored dynamics models. Then, data observed in real-time are used to
learn deliberative responses for more precise online optimal control. This results in
faster immediate control using stored experiences, along with real-time deliberative
learning.
The contributions of this chapter are as follows.
(1) Based on neurocognitive psychology, a novel controller based on multiple actor-
critic structures is developed for unknown systems. This controller trades off fast
actions based on stored behavior patterns with real-time exploration using current
input-output data.
(2) A SIANN is used to classify input-output data into categories based on salient
features of previously recorded data. Shunting inhibition allows for higher dis-
criminatory capabilities and has been shown important in neural information
processing.
(3) In each category, an RNN is used to identify the system dynamics, novel parameter
update algorithms are given, and it is proven that the parameter errors are UUB.
(4) Action-critic networks are developed in each category to obtain the optimal per-
formance measure function and the optimal controller based on current observed
input-output data. It is proven that the closed-loop systems and the weight errors
are UUB based on Lyapunov techniques.
The rest of the chapter is organized as follows. In Sect. 5.2, the problem moti-
vations and preliminaries are presented. In Sect. 5.3, the SIANN architecture-based
category technique is developed. In Sect. 5.4, the model network, critic network and
action network are introduced. In Sect. 5.5, two examples are given to demonstrate
the effectiveness of the proposed optimal control scheme. In Sect. 5.6, the conclusion
is drawn.
For most complex industrial systems, it is difficult, time consuming, and expensive to
identify an accurate mathematics model. Therefore, optimal controllers are difficult
to design. Motivated by this problem, a multiple actor-critic structure is proposed
66 5 Multiple Actor-Critic Optimal Control via ADP
to obtain optimal controllers based on measured input-output data. This allows the
construction of adaptive optimal control systems for industrial processes that are
responsive to changing plant conditions and performance requirements.
This structure uses fast classification based on stored memories, such as that
occurs in the amygdala and orbitofrontal cortex (OFC) in [1, 2]. It also allows for
real-time exploration and learning, such as occurs in the ACC in [1, 2]. This structure
conforms to the ideas of [3], which stress the importance of an integrated utilization
of stored memories and real-time exploration.
The overall multiple actor-critic structure is shown in Fig. 5.1. A SIANN is trained
to classify previously recorded data into different memory categories [24]. In Fig. 5.1,
xi stands for the measured system data, which are also the input of the SIANN. The
trained SIANN is used to classify data recorded in real time into categories, and its
output y = 1, 2, . . . , L, is the category label. Within each category, ADP is used to
establish the optimal control. If the output of SIANN is y = i, then the ith ADP
structure is activated. For each ADP, an RNN is trained to model the dynamics.
Then, the critic network and the action network of that ADP are used to determine
the performance measure function and the optimal control. The actor-critic structure
in the active category is tuned online based on real-time recorded data. The details
are given in the next sections.
In industrial process control, there are usually different production performance mea-
sures according to different desired process properties. Therefore, the input-output
data should be classified into different categories based on various features inherent
in the data. It has been shown by [1], that shunting inhibition is important to explain
the manner in which the amygdala and OFC classify data based on environmental
cues. In this section, the structure and updating method are introduced for a SIANN
[21] that is used to classify the measured data into different control response cat-
egories depending on match between certain attributes of the data and previously
stored experiences. In this chapter, SIANN consists of one input layer, one hidden
layer and one output layer. In the hidden units, the shunting inhibition is the basic
synaptic interaction. Generalized shunting neurons are used in the hidden layer of
the SIANN. The structure of each neuron is shown in Fig. 5.2 and given as follows
[21].
The output of each neuron is expressed as
gh wTj x + w j0 + b j
sj = , (5.1)
a j + f h (cTj x + c j0 )
n
y= v j s j + d = vT s + d, (5.2)
j=1
1 2
E= e . (5.3)
2
According to gradient descent algorithm, the updates for the parameters are given
by backpropagation as follows [25]:
∂ E ∂y
v̇ = −γv = −γv es, (5.4)
∂e ∂v
ḋ = −γd e, (5.5)
∂ E ∂ y ∂s j g h wTj x + w j0 x T
ẇ j = −γw = −γw ev j , (5.6)
∂e ∂s j ∂w j a j + f h (cTj x + c j0 )
∂ E ∂ y ∂s j 1
ḃ j = −γb = −γb ev j , (5.7)
∂e ∂s j ∂b j a j + f h (cTj x + c j0 )
∂ E ∂ y ∂s j −(gh wTj x + w j0 + b j )
ȧ j = −γa = −γa ev j , (5.8)
∂e ∂s j ∂a j (a j + f h (cTj x + c j0 ))
2
∂ E ∂ y ∂s j −(gh wTj x + w j0 + b j )
ċ j = −γc = −γc ev j f h (cTj x + c j0 )x T , (5.9)
∂e ∂s j ∂c j (a j + f h (cTj x + c j0 ))
2
In the previous section a SIANN is trained using previously recorded data to classify
the input-output data in real-time into different categories based on data attributes
and features. Based on this SIANN classification, an ADP structure in each category
is used to determine, based on data measured in real time, the optimal control for the
data attributes and performance measure in that category. In this section, a novel ADP
structure is designed that captures the required system identification and optimality
factors in process control.
This structure uses fast classification based on stored memories, such as occurs
in the amygdala and OFC in [7, 8], and also allows for real-time exploration and
learning, such as occurs in the ACC in [7, 8]. This structure provides an integrated
utilization of stored memories and real-time exploration [9].
The ADP structure is used in each category as shown in Fig. 5.1. The ADP structure
has the standard three networks for model, critic, and actor. In this section, first,
the optimization problem is introduced. Then a recursive neural network (RNN) is
trained to identify the system dynamics. Novel neural network (NN) parameter tuning
algorithms for the model RNN are given and the state estimation error is proven to
converge with time. After this model identification phase, the ADP critic network
and action network are designed to use data recorded in real time to compute online
the performance measure value function and the optimal control action, respectively.
A theorem is given to prove that the closed-loop system is UUB.
Based on the previous section, suppose the input-output data is classified into cate-
gory y = i. Then we can train the ith ADP using that input-output data. Throughout
this section, it is understood that the design refers to the ADP structure in each
category i. To conserve notation, we do not use subscripts i throughout.
The performance measure function for the ith ADP is given as
∞
J (x) = r (x(τ ), u(τ ))dτ, (5.10)
t
where x and u are the state and control inputs of the ith ADP, r (x, u) = Q(x) +
u T Ru, where R is a positive definite matrix. It is assumed that there exists a scalar
P > 0, s.t. Q(x) > P x T x.
The infinitesimal version of (5.10) is the Bellman equation
0 = JxT ẋ + r, (5.11)
∂J
where Jx = ∂x
. The Hamiltonian function is given as
In the following subsections, the detailed derivations for the model neural network,
the critic network and the action network are given.
The previous section showed how to use SIANN to classify the data into one of
i categories. Please see Fig. 5.1. In each category i, a RNN is used to identify the
system dynamics for the ith ADP structure. The dynamics in category i is taken to
be modeled by the RNN given as [26]
This is a very general model of a nonlinear system that can closely fit many dynamical
models. Here, A1 , A2 , A1 and A4 are the unknown ideal weight matrices for ith ADP,
which are assumed to satisfy ||Ai || F ≤ AiB , where AiB > 0 is constant, i = 1, 2, 3, 4.
ε s the bounded approximate error, and is assumed to satisfy εT ε ≤ β A exT ex , where
β A is the bounded constant target value and the state estimation error is ex = x − x̂.
The activation function f (x) is a monotonically increasing function, and it is taken
to satisfy 0 ≤ || f (x1 ) − f (x2 )|| ≤ k||x1 − x2 ||, where x1 > x2 and k > 0. τd is the
disturbance and it is bounded , i.e., ||τd || ≤ d B , where d B is a constant number.
The approximate dynamics model is constructed as
where Â1 , Â2 , Â3 and Â4 are the estimated values of the ideal unknown weights, and
Â5 is a square matrix that satisfies
1 k2 1 βA
A5 − AT1 − AT2 A2 − + + I > 0.
2 2 2 2
Define the parameter identification errors as Ã1 = A1 − Â1 , Ã2 = A2 − Â2 , Ã3 =
A3 − Â3 , Ã4 = A4 − Â4 , and define f˜(ex ) = f (x) − f (x̂).
Due to disturbances and modeling errors, the model (5.16) cannot exactly recon-
struct the unknown dynamics (5.15). It is desired to tune the parameters in (5.16)
such that the parameter estimation errors and state estimation error are uniformly
ultimately bounded.
5.4 Optimal Control Based on ADP 71
where βi > 0, i = 1, 2, 3, 4. If the initial values of the state estimation error ex (0),
and the parameter identification errors Ãi (0) are bounded. Then the state estimation
error ex , and the parameter identification errors Ãi are UUB.
Proof Let the initial values of the state estimation error ex (0), and the parameter
identification errors Ãi (0) be bounded, then the NN approximation property (5.15)
holds for the state x [29].
Define the Lyapunov function candidate:
V = V1 + V2 , (5.17)
1
where V1 = 1/2exT ex and V2 = V A1 + V A2 + V A3 + V A4 , in which V A1 =
2β1
1 1 1
tr { ÃT1 Ã1 }, V A2 =
tr { ÃT2 Ã2 }, V A3 = tr { ÃT3 Ã3 }, V A4 = Ã4 ÃT4 .
2β2 2β3 2β4
As the following equations
AT1 x − ÂT1 x̂ = (AT1 x − AT1 x̂) + (AT1 x̂ − ÂT1 x̂) = AT1 ex + ÃT1 x̂ (5.18)
and
AT2 f (x) − ÂT2 f (x̂) = (AT2 f (x) − AT2 f (x̂)) + (AT2 f (x̂) − ÂT2 f (x̂))
= AT2 f˜(ex ) + ÃT2 f (x̂) (5.19)
ėx = ẋ − x̂˙
= AT1 ex + ÃT1 x̂ + AT2 f˜(ex ) + ÃT2 f (x̂) + ÃT3 u + ÃT4 + ε − A5 ex + τd . (5.20)
72 5 Multiple Actor-Critic Optimal Control via ADP
Therefore,
V̇1 = exT (AT1 ex + ÃT1 x̂ + AT2 f˜(ex ) + ÃT2 f (x̂) + ÃT3 u + ÃT4 + ε − A5 ex + τd ).
(5.21)
As
1 1
exT AT2 f˜(ex ) ≤ exT AT2 A2 ex + k 2 exT ex (5.22)
2 2
and
1 T 1 1 βA
exT ε ≤ ex ex + ε T ε ≤ + exT ex , (5.23)
2 2 2 2
1
V̇1 ≤ exT AT1 ex + exT ÃT1 x̂ + exT ÃT2 f (x̂) + exT AT2 A2 ex
2
1 2 1 βA T
+ k + + ex ex + exT ÃT3 u + exT ÃT4 − exT A5 e + exT τd . (5.24)
2 2 2
Therefore,
V̇2 ≤ − exT ÃT1 x̂ − exT ÃT2 f (x̂) − exT ÃT3 u − exT ÃT4
+ ||ex |||| Ã1 || F A1B − ||ex |||| Ã1 ||2F + ||ex |||| Ã2 || F A2B − ||ex |||| Ã2 ||2F
+ ||ex |||| Ã3 || F A3B − ||ex |||| Ã3 ||2F + ||ex |||| Ã4 || F A4B − ||ex |||| Ã4 ||2F . (5.29)
5.4 Optimal Control Based on ADP 73
1 k2 1 βA
Let K V = A5 − AT1 − AT2 A2 − + + I , and then
2 2 2 2
or
2
A1B
A1B
|| Ã1 || F > + + dB , (5.33)
2 2
2
A2B
A2B
|| Ã2 || F > + + dB , (5.34)
2 2
A3B
AB 2
|| Ã3 || F > + 3 + dB , (5.35)
2 2
A B
AB 2
|| Ã4 || F > 4 + 4 + dB . (5.36)
2 2
Thus, V̇ (t) is negative outside a compact set. This demonstrates the UUB of ||ex || and
|| Ã1 || F , || Ã2 || F , || Ã3 || F , || Ã4 || F .
74 5 Multiple Actor-Critic Optimal Control via ADP
Remark 5.2 In fact, many NN activation functions are bounded and have bounded
derivatives. In Theorem 5.1, the unknown system is bounded, and the NN param-
eters and output are bounded. Therefore, the initial state estimation error ex and
the initial parameter identification errors Ãi are bounded. The approximation error
boundedness property is established in [25].
From Theorem 5.1, it can be seen that as t → ∞, the parameter estimates Âi
converge to bounded regions, such that the state estimation error ex is bounded. Let
the steady state value of Âi be denoted as Bi . Then after convergence of the model
parameters, (5.16) can be rewritten as
In the following, the optimal control for ith ADP based on the well-trained RNN
(5.37) will be designed.
Based on the trained RNN model (5.37), the critic network expression in each cate-
gory i is
J = WcT ϕc + εc , (5.38)
where Wc is the ideal critic network weight matrix, ϕc is the activation function and
εc is the approximation error. We assume that ||∇ϕc || ≤ ϕcd M .
For the actual neural networks, let the estimate of Wc be Ŵc , then the actual output
of the critic network in each category i is
Jˆ = ŴcT ϕc . (5.39)
Substitute this into the Bellman equation (5.11) to obtain the equation error
where
Let
5.4 Optimal Control Based on ADP 75
1 T
Ec = e ec . (5.43)
2 c
∂ Ec ξ1 (ξ1T Ŵc + r )
Ŵ˙ c = −αc = −αc 2
, (5.44)
∂ Ŵc (ξ1T ξ1 + 1)
where αc > 0 is the learning rate of the critic network and ξ1 = ∇ϕc ẋ.
ξ1
Define ξ2 = and ξ3 = ξ1T ξ1 + 1, with ||ξ2 || ≥ ξ2m . Then
ξ3
ξ1 (ξ1 Ŵc + r )
T
εH
Ŵ˙ c = αc = −αc ξ2 ξ2T W̃c + αc ξ2 . (5.45)
ξ32 ξ3
The action network is used to obtain the control policy u in each category i and
is given by
u = WaT ϕa + εa , (5.46)
where Wa is the ideal weight matrix of the action network, ϕa is the activation
function. According to persistent excitation conditions, it has ϕa M ≥ ||ϕa || ≥ ϕam .
εa is the action network approximation error. The actual output of the action network
is
û = ŴaT ϕa , (5.47)
1 −1
ea = ŴaT ϕa + R B3 ∇ϕcT Ŵc . (5.49)
2
Define the objective function as follows:
1 T
Ea = e ea . (5.50)
2 a
76 5 Multiple Actor-Critic Optimal Control via ADP
Then the weight update law for the action network weight is a gradient descent
algorithm, which is given by
T
Ŵ˙ a = −αa ϕa ŴaT ϕa + R −1 B3 ∇ϕcT Ŵc ,
1
(5.51)
2
The next theorem proves the asymptotic convergence of the optimal control scheme.
The result extends the results in [30].
Theorem 5.2 Let the optimal control input for (5.37) be provided by (5.47), and
the weight updating laws of the critic network and the action network be given as
in (5.44) and (5.51), respectively. Suppose there exist positive scalars l1 , l2 and l3
satisfying
l2 ϕam
2
l1 < , (5.55)
4||B3 ||2 ϕa2M
2
2ξ2m
l2 < , (5.56)
||R −1 B3 ||2 ϕcd
2
M
and
||B3 ||2 l1 2l1 ||B1 || + l1 ||B2 ||2 + l1 k 2 + 4l1
l3 > max , . (5.57)
λmin (R) P
5.4 Optimal Control Based on ADP 77
If the initial values of the state x(0), the control u(0) and the weight estimation
errors W̃c (0) and W̃a (0) are bounded. Then x in (5.54), control input u, and the
weight estimation errors W̃c and W̃a are UUB.
Γ = Γ1 + Γ2 + Γ3 , (5.58)
1 l2
where Γ1 = W̃cT W̃c , Γ2 = tr{W̃aT W̃a }, l2 > 0 and Γ3 = l1 x T x + l3 J , l1 > 0,
2αc 2αa
l3 > 0.
Then the time derivative of the Lyapunov function candidate (5.58) along the
trajectories of the closed-loop systems (5.54) is computed as Γ˙ = Γ˙1 + Γ˙2 + Γ˙3 .
According to (5.45), it can be obtained that
εH
Γ˙1 =W̃cT −ξ2 ξ2T W̃c + ξ2
ξ3
1 1 εH
≤ − ||ξ2 ||2 ||W̃c ||2 + ||ξ2 ||2 ||W̃c ||2 + || ||2
2 2 ξ3
1 2 1 ε H
≤ − ξ2m ||W̃c ||2 + || ||2 . (5.59)
2 2 ξ3
1
Define ε12 = WaT ϕa + R −1 B3 ∇ϕcT Wc and assume ||ε12 || ≤ ε12M , then based on
2
(5.53), one has
T
1
Γ˙2 = − l2 tr W̃aT ϕa (W̃aT ϕa + R −1 B3 ∇ϕcT W̃c − ε12 )
2
l2 l2 ε 2
≤ − l2 ||ϕa ||2 ||W̃a ||2 + ||ϕa ||2 ||W̃a ||2 + 12M
2 2
l2 l
+ ||ϕa ||2 ||W̃a ||2 + ||R −1 B3 ∇ϕcT ||2 ||W̃c ||2
2
4 4
l2 2 l2 −1 l2 2
≤ − ϕam ||W̃a || + ||R B3 ||2 ϕcd
2 2
M || W̃c || +
2
ε . (5.60)
4 4 2 12M
The time derivative of Γ3 is calculated as follows:
Γ˙3 = 2l1 x T (B1T x + B2T f (x) + B3T u − B3T εa − B3T W̃aT ϕa + B4T ) + l3 (−r (x, u)).
(5.61)
As
−2x T B3T W̃aT ϕa ≤ ||x||2 + ||B3 ||2 ϕa2M ||W̃a ||2 , (5.63)
Γ˙3 ≤(2l1 ||B1 || + l1 ||B2 ||2 + l1 k 2 + 4l1 − l3 P)||x||2 + (||B3 ||2 l1 − l3 λmin (R))||u||2
+ l1 ||B3 ||2 ϕa2M ||W̃a ||2 + l1 ||B3 ||2 εa2 M + l1 ||B4 ||2 . (5.66)
l2 2 1 εH
Let εΓ = ε12M + || ||2 + l1 ||B3 ||2 εa2 M + l1 ||B4 ||2 . Then one has
2 2 ξ3
Γ˙ ≤ − (l3 P − 2l1 ||B1 || − l1 ||B2 ||2 − l1 k 2 − 4l1 )||x||2 − (l3 λmin (R) − ||B3 ||2 l1 )||u||2
1 2 l2 2 − l2 ϕ 2 − l ||B ||2 ϕ 2
− ξ2m − ||R −1 B3 ||2 ϕcd 2 || W̃ c || 1 3
2
a M ||W̃a || + εΓ .
2 4 M 4 am
(5.67)
or
εΓ
||u|| > (5.69)
l3 λmin (R) − ||B3 ||2 l1
or
εΓ
||W̃c || > (5.70)
ξ
1 2
2 2m
− l42 ||R −1 B3 ||2 ϕcd
2
M
or
εΓ
||W̃a || > (5.71)
ϕ
l2 2
4 am
− l1 ||B3 ||2 ϕa2M
hold. Γ˙ < 0. Therefore, according to the standard Lyapunov extension, x, u, W̃c and
W̃a are UUB.
5.4 Optimal Control Based on ADP 79
Theorem 5.3 Suppose the hypotheses of Theorem 5.2 hold. If ||ϕa || ≤ ϕa M , ||∇ϕc ||
≤ ϕcd M , ||Wc || ≤ WcM , ||Wa || ≤ Wa M and ||εa || ≤ εa M . Then
(1) ||εa || ≤ εa M is UUB, where H (x, Ŵa , Ŵc ) = JˆxT B1T x+B2T f (x) + B3T û+B4T
+ Q(x) + û T R û.
(2) û is close to u within a small bound.
For a fixed admissible control policy, H (x, Wa , Wc ) is bounded. Then one can let
||H (x, Wa , Wc )|| ≤ H B . Therefore, (5.73) can be written as
H (x, Ŵa , Ŵc ) = − ∇ϕcT W̃c B1T x + B2T f (x) + B4T − ∇ϕcT Wc B3T W̃aT ϕa
− ∇ϕcT W̃c B3T WaT ϕa + ∇ϕcT W̃c B3T W̃aT ϕa
− ϕaT Wa R W̃aT ϕa − ϕaT W̃a RWaT ϕa + ϕaT W̃a R W̃aT ϕa + H B .
(5.74)
According to Theorem 5.2, the signals on the right-hand side of (5.75) are UUB,
therefore H (x, û, Jˆx ) is UUB.
80 5 Multiple Actor-Critic Optimal Control via ADP
(2) As
Equation (5.77) means that û is close to the control input u within a small bound.
Remark 5.3 According to Theorem 5.1 and the properties of NN, we can see that
the initial weight estimation errors W̃c and W̃a are bounded. Therefore, the initial
value H is bounded in Theorem 5.3.
Remark 5.4 It should be mentioned that Theorems 5.1–5.3 are based on the following
fact: the model network (RNN) is trained first, and then the critic and action networks
weights are turned according to the steady system model (5.37).
In the next theorem, the simultaneous turning method for the model, critic and
action networks will be discussed further.
Theorem 5.4 Let the optimal control input for (5.16) be provided by (5.47). The
update methods for the tunable parameters of Âi in (5.16), i = 1, 2, 3, 4 are given
in Theorem 5.1. The weight updating laws of the critic and the action networks are
given as
ξ1 (ξ1T Ŵc + r )
Ŵ˙ c = − αc 2
, (5.78)
(ξ1T ξ1 + 1)
where Θ7 = 2l6 || Â1 || + l6 || Â2 ||2 + l6 || Â5 ||2 + l6 k 2 + 4l6 . Then the weight estima-
tion errors W̃c and W̃a are UUB if the model error ex is UUB.
Θ = V + Γ4 + Γ5 + Γ6 , (5.84)
1 2 1 εH
Γ˙4 ≤ − ξ2m ||W̃c ||2 + || ||2 , (5.86)
2 2 ξ3
and
l5 2 l5 l5 2
Γ˙5 ≤ − ϕam ||W̃a ||2 + ||R −1 B3 ||2 ϕcd
2
M || W̃c || +
2
ε . (5.87)
4 4 2 12M
Since the approximate dynamics model (5.16) with controller (5.47) is
x̂˙ = ÂT1 x̂ + ÂT2 f (x̂) + ÂT3 u − ÂT3 εa − ÂT3 W̃aT ϕa + ÂT4 + A5 ex . (5.88)
Then
1 ε H 2 l5 2
Let dV Γ = || || + ε12M + l6 || Â3 ||2 εa2 M + l6 || Â4 ||2 , and Θ A = A5 − AT1 −
2 ξ3 2
1 T k2 3 βA
A A2 − ( + + )I . Then
2 2 2 2 2
82 5 Multiple Actor-Critic Optimal Control via ADP
dV Γ
||ex || > . (5.91)
4
(|| Ãi ||2F − || Ãi || F AiB ) + dB
i=1
Remark 5.5 From Theorem 5.4, it can be seen that when the system model error
enters in a bounded region, then the weights of critic and the action networks will be
UUB.
In this section two simulations are provided to demonstrate the versatility of the
proposed SIANN/Multiple ADP structure.
Example 5.1 We consider a simplified model of a continuously stirred tank reactor
with an exothermic reaction [33]. The model is given by
5.5 Simulation Study 83
⎡ 13 5 ⎤
ẋ1 ⎢ 6 12 ⎥ x1 −2
=⎣ +
50 8 ⎦ x2
u (5.92)
ẋ2 0
− −
3 3
where x1 represents the temperature and x2 represents the concentration of the initial
product of the chemical reaction.
Define two categories as follows:
1, ||x|| ≥ 0.1,
yd = (5.93)
−1, ||x|| < 0.1.
0.7
0.6
0.5
0.4
||x||
0.3
0.2
0.1
0
0 500 1000 1500
time steps
2
yd
y
1.5
1
category label
0.5
-0.5
-1
-1.5
0 100 200 300 400 500 600 700 800 900 1000
time steps
c j be selected randomly in (−1, 1). Let the initial parameter in the output of SIANN
be d = 1, and the initial weight v be selected randomly in (−1, 1). The number of
shunting neurons is 10.
The SIANN is trained using the historical data generated above and the parameter
tuning algorithms in previous section. When the historical data are fed into the trained
SIANN, the classification results are shown in Fig. 5.4. Note that, in Fig. 5.3, the
norm of the state falls below 0.1 at 300 time steps. This corresponds to the category
classification change observed in Fig. 5.4.
Second, the RNN estimation model is trained, which is used to estimate the
system dynamics. Let the elements of matrices Âi , i = 1, 2, 3, 4 be selected in
initial
2 2
(−0.5, 0.5). Let A5 = , βi = 0.05, i = 1, 2, 3, 4. Let the activation function
2 2
be f (·) = tan sig(·). The stored data generated as in Fig. 5.3 are used to turn the
model RNN. The training RNN error is shown in Fig. 5.5.
Finally, the optimal controller is simulated based on the trained SIANN classifier
and RNN estimation model. The structures of the critic network and action network
are 2-8-1 and 2-8-1, respectively, in each category. Let the initial weights of critic and
action networks for the two categories be selected in (−0.7, 0.7). Let the activation
function be hyperbolic function. Let the learning rate of critic and action networks be
0.02. The trained SIANN is used to classify the system state obtained by the optimal
controller. For increasing the accuracy, the weight of SIANN is modified online. The
classification result is shown in Fig. 5.6. Once the state belongs to one category, then
the RNN is used to get the system state. The test RNN estimation error is shown in
Fig. 5.7.
5.5 Simulation Study 85
0.015
ex(1)
ex(2)
0.01
0.005
ex
-0.005
-0.01
-0.015
0 100 200 300 400 500 600 700 800 900 1000
time steps
1
y
yd
0.5
0
test results
-0.5
-1
-1.5
0 100 200 300 400 500 600 700 800 900 1000
time steps
After 600 time steps, the convergent trajectory of Wc is obtained, which is shown
in Fig. 5.8. The control and state trajectories are shown in Figs. 5.9 and 5.10. The
simulation result reveals that the proposed optimal control method for unknown
system operates properly.
86 5 Multiple Actor-Critic Optimal Control via ADP
0.01
ex(1)
ex(2)
test RNN error 0.005
-0.005
-0.01
-0.015
0 100 200 300 400 500 600 700 800 900 1000
time steps
0.5
W c(1)
0.4
W c(2)
0.3 W c(3)
W c(4)
0.2 W c(5)
W c(6)
0.1
W c(7)
Wc
0 W c(8)
-0.1
-0.2
-0.3
-0.4
-0.5
0 100 200 300 400 500 600 700 800 900 1000
time steps
0.16
0.14
0.12
0.1
0.08
control
0.06
0.04
0.02
-0.02
0 100 200 300 400 500 600 700 800 900 1000
time steps
0.5
0.4 x1
x2
0.3
0.2
0.1
state
-0.1
-0.2
-0.3
-0.4
-0.5
0 100 200 300 400 500 600 700 800 900 1000
time steps
1.5
yd
1 y
e
0.5
0
category label
-0.5
-1
-1.5
-2
-2.5
0 20 40 60 80 100 120 140 160 180 200
time steps
For system (5.95), the admissible control is used to obtain the historical data of the
system state. Based on this obtained system data, the SIANN classifier and the RNN
are trained.
First, the SIANN is trained. The parameters are similar as in Example 5.1. Then
the SIANN training result is shown in Fig. 5.11.
Second, the RNN estimation model is trained, based on the classified states.
Let the initial parameters Âi , i = 1, 2, 3, 4 of RNN be selected in (−0.2, 0.2) and
A5 = [2, 2; 2, 2]. Let βi = 0.05, i = 1, 2, 3, 4. The RNN training error is shown in
Fig. 5.12.
Finally, the optimal controller is established based on the trained SIANN classifier
and RNN estimation model. The structures of action network and critic network are
2-8-1 and 2-8-1, respectively. Let the initial weights Wa and Wc be selected in (−1, 1)
and the activation function be sigmoid function. The trained SIANN is used to classify
the system state online, and the classification result is shown in Fig. 5.13. Once the
state belongs to one category, then the RNN is used to get the system state. The RNN
estimate error is shown in Fig. 5.14. The control and state trajectories are shown in
Figs. 5.15 and 5.16. From simulation results, it can be seen that the proposed control
method for unknown systems is effective.
5.5 Simulation Study 89
0.01
ex(1)
ex(2)
0.005
-0.005
ex
-0.01
-0.015
-0.02
-0.025
0 100 200 300 400 500 600 700 800
time steps
1.5
test y
1 test yd
0.5
test classification
-0.5
-1
-1.5
0 200 400 600 800 1000 1200 1400 1600 1800 2000
time steps
0.025
ex(1)
0.02
ex(2)
0.015
0.01
0.005
test ex
-0.005
-0.01
-0.015
-0.02
-0.025
0 200 400 600 800 1000 1200 1400 1600 1800 2000
time steps
1.5
0.5
control
-0.5
-1
-1.5
0 200 400 600 800 1000 1200 1400 1600 1800 2000
time steps
1
x1
x2
0.5
0
state
-0.5
-1
-1.5
0 200 400 600 800 1000 1200 1400 1600 1800 2000
time steps
5.6 Conclusions
This chapter proposed multiple actor-critic structures to obtain the optimal control
by input-output data for unknown systems. First we classified the input-output data
into several categories by SIANN. The performance measure functions were defined
for each category. Then the optimal controller was obtained by ADP algorithm. The
RNN was used to reconstruct the unknown system dynamics using input-output
data. The neural networks were used to approximate the critic network and action
network, respectively. It is proven that the model error and the closed-system are
UUB. Simulation results demonstrated the effectiveness of the proposed optimal
control scheme for the unknown nonlinear system.
References
1. Levine, D., Ramirez Jr., P.: An attentional theory of emotional influences on risky decisions.
Prog. Brain Res. 202(2), 369–388 (2013)
2. Levine, D., Mills, B., Estrada, S.: Modeling emotional influences on human decision making
under risk. In: Proceedings of International Joint Conference on Neural Networks, pp. 1657–
1662 (2005)
3. Werbos, P.: Intelligence in the brain: a theory of how it works and how to build it. Neural Netw.
22, 200–212 (2009)
4. Werbos. P.: Stable adaptive control using new critic designs. In: Proceedings of Adaptation,
Noise, and Self-Organizing Systems (1998)
92 5 Multiple Actor-Critic Optimal Control via ADP
5. Narendra, K., Balakrishnan, J.: Adaptive control using multiple models. IEEE Trans. Autom.
control 42(2), 171–187 (1997)
6. Sugimoto, N., Morimoto, J., Hyon, S., Kawato, M.: The eMOSAIC model for humanoid robot
control. Neural Netw. 29–30, 8–19 (2012)
7. Doya, K.: What are the computations of the cerebellum, the basal ganglia and the cerebral
cortex? Neural Netw. 12(7–8), 961–974 (1999)
8. Hikosaka, O., Nakahara, H., Rand, M., Sakai, K., Lu, X., Nakamura, K., Miyachi, S., Doya, K.:
Parallel neural networks for learning sequential procedures. Trends Neurosci. 22(10), 464–471
(1999)
9. Lee, J., Lee, J.: Approximate dynamic programming-based approaches for input-output data-
driven control of nonlinear processes. Automatica 41(7), 1281–1288 (2005)
10. Song, R., Xiao, W., Zhang, H.: Multi-objective optimal control for a class of unknown nonlinear
systems based on finite- approximation-error ADP algorithm. Neurocomputing 119(7), 212–
221 (2013)
11. Li, H., Liu, D., Wang, D.: Integral reinforcement learning for linear continuous-time zero-sum
games with completely unknown dynamics. IEEE Trans. Autom. Sci. Eng. 11(3), 706–714
(2014)
12. Yang, X., Liu, D., Huang, Y.: Neural-network-based online optimal control for uncertain non-
linear continuous-time systems with control constraints. IET Control Theory Appl. 7(17),
2037–2047 (2013)
13. Lewis, F., Vamvoudakis, K.: Reinforcement learning for partially observable dynamic pro-
cesses: adaptive dynamic programming using measured output data. IEEE Trans. Syst. Man
Cybern. Part B: Cybern. 41(1), 14–25 (2011)
14. Li, Z., Duan, Z., Lewis, F.: Distributed robust consensus control of multi-agent systems with
heterogeneous matching uncertainties. Automatica 50(3), 883–889 (2014)
15. Modares, H., Lewis, F., Naghibi-Sistani, M.: Integral reinforcement learning and experience
replay for adaptive optimal control of partially unknown constrained-input continuous-time
systems. Automatica 50(1), 193–202 (2014)
16. Zhang, H., Lewis, F.: Adaptive cooperative tracking control of higher-order nonlinear systems
with unknown dynamics. Automatica 48(7), 1432–1439 (2012)
17. Wei, Q., Liu, D.: Adaptive dynamic programming for optimal tracking control of unknown
nonlinear systems with application to coal gasification. IEEE Trans. Autom. Sci. Eng. 11(4),
1020–1036 (2014)
18. Doya, K., Samejima, K., Katagiri, K., Kawato, M.: Multiple model-based reinforcement learn-
ing. Neural Comput. 14, 1347–1369 (2002)
19. Levine, D.: Neural dynamics of affect, gist, probability, and choice. Cogn. Syst. Res. 15–16,
57–72 (2012)
20. Werbos, P.: Using ADP to understand and replicate brain intelligence: the next level design.
IEEE International Symposium on Approximate Dynamic Programming and Reinforcement
Learning, pp. 209–216 (2007)
21. Arulampalam, G., Bouzerdoum, A.: A generalized feedforward neural network architecture
for classification and regression. Neural Netw. 16, 561–568 (2003)
22. Bouzerdoum, A.: Classification and function approximation using feedforward shunting
inhibitory artificial neural networks. In: Proceedings of the IEEE-INNS-ENNS International
Joint Conference on Neural Networks, vol. 6, pp. 613–618 (2000)
23. Tivive, F., Bouzerdoum, A.: Efficient training algorithms for a class of shunting inhibitory
convolutional neural networks. IEEE Trans. Neural Netw. 16(3), 541–556 (2005)
24. Song, R., Lewis, F., Wei, Q., Zhang, H.: Off-policy actor-critic structure for optimal control of
unknown systems with disturbances. IEEE Trans. Cybern. 46(5), 1041–1050 (2016)
25. Hornik, K., Stinchcombe, M., White, H., Auer, P.: Degree of approximation results for feedfor-
ward networks approximating unknown mappings and their derivatives. Neural Comput. 6(6),
1262–1275 (1994)
26. Zhang, H., Cui, L., Zhang, X., Luo, Y.: Data-driven robust approximate optimal tracking control
for unknown general nonlinear systems using adaptive dynamic programming method. IEEE
Trans. Neural Netw. 22(12), 2226–2236 (2011)
References 93
27. Kim, Y., Lewis, F.: Neural network output feedback control of robot manipulators. IEEE Trans.
Robot. Autom. 15(2), 301–309 (1999)
28. Khalil, H.: Nonlinear System. Prentice-Hall, NJ (2002)
29. Lewis, F., Jagannathan, S., Yesildirek, A.: Neural Network Control of Robot Manipulators and
Nonlinear Systems. Taylor and Francis, London (1999)
30. Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complex-
valued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014)
31. Yang, C., Li, Z., Li, J.: Trajectory planning and optimized adaptive control for a class of wheeled
inverted pendulum vehicle models. IEEE Trans. Cybern. 43(1), 24–36 (2013)
32. Yang, C., Li, Z., Cui, R., Xu, B.: Neural network-based motion control of an underactuated
wheeled inverted pendulum model. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 2004–2016
(2014)
33. Beard, R.: Improving the Closed-Loop Performance of Nonlinear Systems, Ph.D. thesis, Rens-
selaer Polytechnic Institute, Troy, NY (1995)
34. Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems with saturating
actuators using a neural network HJB approach. Automatica 41, 779–791 (2005)
Chapter 6
Optimal Control for a Class of
Complex-Valued Nonlinear Systems
6.1 Introduction
In many science problems and engineering applications, the parameters and signals
are complex-valued [1, 2], such as quantum systems [3] and complex-valued neural
networks [4]. In [5], complex-valued filters were proposed for complex signals and
systems. In [6], a complex-valued B-spline neural network was proposed to model the
complex-valued Wiener system. In [7], a complex-valued pipelined recurrent neural
network for nonlinear adaptive prediction of complex nonlinear and nonstationary
signals was proposed. In [8], the output feedback stabilization of complex-valued
reaction-advection-diffusion systems was studied. In [4], the global asymptotic sta-
bility of delayed complex-valued recurrent neural networks was studied. In [9], a
reinforcement learning algorithm with complex-valued functions was proposed. In
the investigations of complex-valued systems, many system designers want to find
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 95
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_6
96 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems
where z ∈ Cn is the system state, f (z) ∈ Cn and f (0) = 0. Let g(z) ∈ Cn×n be a
bounded input gain, i.e., ||g(·)|| ≤ ḡ, where ḡ is a positive constant, and || · || stands
for the 2-norm, unless special declaration is given. Let u ∈ Cn be the control vector.
Let z 0 be the initial state. For system (6.1), the infinite-horizon performance index
function is defined as
∞
J (z) = r̄ (z(τ ), u(τ ))dτ, (6.2)
t
The aim of this chapter is to obtain the optimal control of the complex-valued
nonlinear system (6.1). In order to achieve this purpose, the following assumptions
are necessary.
Assumption 6.1 [4] Let i 2 = −1, and then z = x + i y, where x ∈ Rn and y ∈ Rn .
If C(z) ∈ Cn is the complex-valued function, then it can be expressed as C(z) =
C R (x, y) + iC I (x, y), where C R (x, y) ∈ Rn and C I (x, y) ∈ Rn .
Assumption 6.2 Let f (z) = ( f 1 (z), f 2 (z), . . . , f n (z))T , and f j (z) = f jR (x, y) +
i f jI (x, y), j = 1, 2, . . . , n. The partial derivatives of f j (z) with respect to x and y sat-
∂ f R ∂ f R ∂ f I ∂ f I
j R R j R I j j
isfy ≤ λ j , ≤ λ j , ≤ λ j and
IR
≤ λ Ij I , where
∂ x ∂ y ∂ x ∂ y
1 1 1 1
λ Rj R , λ Rj I , λ Ij R and λ Ij I are positive constants. Let || · ||1 stand for 1-norm.
According to above preparations, the system transformation for system (6.1) will
be given. Let f (z) = f R (x, y) + i f I (x, y), g(z) = g R (x, y) + ig I (x, y) and u =
u R + iu I . Then, system (6.1) can be written as
ẋ + i ẏ = f R (x, y) + i f I (x, y)
+ (g R (x, y) + i g I (x, y))(u R + iu I ). (6.3)
R R R
x u f (x, y) g (x, y) − g I (x, y)
Let η = ,ν = I , F(η) = and G(η) = I .
y u f (x, y)
I g (x, y) g (x, y)
R
where η ∈ R2n , F(η) ∈ R2n , G(η) ∈ R2n×2n and ν ∈ R2n . From (6.4) we can see
that F(0) = 0.
Remark 6.1 In fact, the system transformations between system (6.1) and system
(6.4) are equivalent and reversible, which can be seen in the following equations
η̇= F(η)
+ G(η)ν
ẋ f R (x, y) + g R (x, y)u R − g I (x, y)u I
⇔ =
ẏ f I (x, y) + g I (x, y)u R + g R (x, y)u I
⇔ ẋ + i ẏ = f R (x, y) + i f I (x, y)
+ g R (x, y) + i g I (x, y) u R + (i g I (x, y) + g R (x, y))iu I
⇔ ẋ + i ẏ = f R (x, y) + i f I (x, y)
+ g R (x, y) + i g I (x, y) u R + iu I
⇔ ż = f (z) + g(z)u.
Therefore, if the optimal control for (6.4) is acquired, then the optimal control
problem of system (6.1) is also solved. In the following section, the optimal control
scheme of system (6.4) will be developed based on continuous-time ADP method.
98 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems
For an arbitrary admissible control law ν, if the associated performance index func-
tion J (η) is given in (6.6), then an infinitesimal version of (6.6) is the so-called
nonlinear Lyapunov equation [12]
∂J
where Jη = is the partial derivative of the performance index function J . Define
∂η
the optimal performance index function as
∞
∗
J (η) = min r (η(τ ), ν(η(τ )))dτ, (6.8)
ν∈U t
According to (6.9) and (6.10), the optimal control law can be expressed as
1
ν ∗ (η) = − R −1 G T (η)Jη∗ (η). (6.11)
2
Remark 6.2 In this chapter, the system transformations between (6.1) and (6.4) are
necessary. We should say that the optimal control cannot be obtained from (6.1)
and (6.2) directly. For example, if the optimal control is calculated from (6.1) and
1
(6.2), according to Bellman optimality principle, we have u = − R1−1 g H (z)Jz (z).
2
∂J
Let z = x + i y and J = J R + i J I . The partial derivative Jz = exists only if
∂z
∂J R
∂J I
∂JR ∂JI
Cauchy–Riemann conditions are satisfied, i.e., = and =− .
∂x ∂y ∂y ∂x
6.2 Motivations and Preliminaries 99
In the next section, the ADP-based optimal control method will be given in details.
In this section, neural networks are introduced to implement the optimal control
method. Let the number of hidden layer neurons of the neural network be L. Let
the weight matrix between the input layer and hidden layer be Y . Let the weight
matrix between the hidden layer and output layer be W and let the input vector be X .
Then, the output of the neural network is represented as FN (X, Y, W ) = W T σ (Y X ),
where σ (Y X ) is the activation function. For convenience of analysis, only the output
weight W is updated, while the hidden weight is kept fixed [13]. Hence, the neural
network function can be simplified by the expression FN (X, W ) = W T σ̄ (X ), where
σ̄ (X ) = σ (Y X ).
There are two neural networks in the ADP method, which are critic and action
networks, respectively. In the following subsections, the detailed design methods for
the critic and action networks will be given.
The critic network is utilized to approximate the performance index function J (z),
and the ideal critic network is expressed as J (z) = W̄cH ϕ̄c (z) + εc , where W̄c ∈ Cn c1
is the ideal critic network weight matrix. Let ϕ̄c (z) ∈ Cn c1 be the activation function
and let εc ∈ R be the approximation error of the critic network.
Let W̄c = W̄cR + i W̄cI and ϕ̄c = ϕ̄cR + i ϕ̄cI . Then, we have
where ε H is the residual error, which is expressed as ε H = −∇εcT (F + Gν). Let Ŵc
be the estimate of Wc , and then the output of the critic network is
Then, we can define the weight estimation error of the critic network as
Note that, for a fixed admissible control law ν, Hamiltonian function (6.9) becomes
H (η, ν, Jη ) = 0, which means H (η, ν, Wc ) = 0. Therefore, from (6.15), we have
1
It is desired to select Ŵc to minimize the squared residual error E c = ecT ec .
2
Normalized gradient descent algorithm is used to train the critic network [12]. Then,
the weight update rule of the critic network Ŵ˙ c is derived as
where αc > 0 is the learning rate of the critic network and ξ1 = ∇ϕc (F + Gν).
It is a modified Levenberg–Marquardt algorithm, i.e., (ξ1T ξ1 + 1) is replaced by
6.3 ADP-Based Optimal Control Design 101
(ξ1T ξ1 + 1)2 , which is used for normalization, and it will be required in the proofs
ξ1
[12]. Let ξ2 = and ξ3 = ξ1T ξ1 + 1. We have
ξ3
ξ1 (ξ1 Ŵc + r )
T
W̃˙ c = αc
ξ32
ξ1 (ξ1T Wc + r )
= −αc ξ2 ξ2T W̃c + αc
ξ32
εH
= −αc ξ2 ξ2T W̃c + αc ξ2 . (6.22)
ξ3
The action network is used to obtain the control law u. The ideal expression of the
action network is u = W̄aT ϕ̄a (z) + ε̄a , where W̄a ∈ Cna1 ×n is the ideal weight matrix
of the action network. Let ϕ̄a (η) ∈ Cna1 be the activation function and let ε̄a ∈ Cn be
the approximation error of the action network.
Let W̄a = W̄aR + i W̄aI , ϕ̄a = ϕ̄aR + iϕ̄aI and ε̄a = ε̄aR + i ε̄aI . Let WaT =
W̄aRT
− W̄aI T ϕ̄aR ε̄ R
, ϕa = and εa = aI . We have
W̄aIT
W̄aRT ϕ̄aI
ε̄a
where Ŵa is the estimation of Wa . We can define the output error of the action
network as
1 −1 T T
ea = ŴaT ϕa + R G ∇ϕc Ŵc . (6.25)
2
The objective function to be minimized by the action network is defined as
1 T
Ea = e ea . (6.26)
2 a
The weight update law for the action network weight is a gradient descent algo-
rithm, which is given by
T
˙ 1 −1 T T
Ŵa = −αa ϕa Ŵa ϕa + R G ∇ϕc Ŵc ,
T
(6.27)
2
102 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems
where αa is the learning rate of the action network. Define the weight estimation
error of the action network as
Then, we have
1
As ν = − R −1 G T Jη , according to (6.14) and (6.23), we have
2
1 1
WaT ϕa + εa = − R −1 G T ∇ϕcT Wc − R −1 G T ∇εc . (6.30)
2 2
Thus, (6.29) can be rewritten as
T
˙ 1 −1 T T
W̃a = −αa ϕa W̃a ϕa + R G ∇ϕc W̃c − ε12 ,
T
(6.31)
2
1 −1 T
where ε12 = −εa − R G ∇εc .
2
Kc GTη
νc = − , (6.32)
ηT η + b
6.3 ADP-Based Optimal Control Design 103
ηT η + b
where K c ≥ ||G||2 εa2 M , and b > 0 is a constant. Then, the optimal control
2ηT GG T η
law can be expressed as
νall = ν̂ + νc , (6.33)
where νc is the compensation controller, and ν̂ is the output of the action network.
Substituting (6.33) into (6.4), we can get
holds.
Remark 6.3 This assumption makes system (6.4) be persistently excited sufficiently
for tuning critic and action networks. In fact, the persistent excitation assumption
ensures ξ2m ≤ ξ2 , where ξ2m is a positive number [12].
Before giving the main result, the following preparation works are presented.
Lemma 6.1 For ∀x ∈ R n , we have
√
||x||2 ≤ ||x||1 ≤ n||x||2 . (6.37)
n 2
n
Proof Let x = (x1 , x2 , . . . , xn ) . As
T
||x||22 = |xi | ≤
2
|xi | = ||x||21 , we
n 2 i=1 i=1
n
can get ||x||2 ≤ ||x||1 . As ||x||21 = |xi | ≤n |xi |2 = n||x||22 , we can obtain
√ i=1 i=1
||x||1 ≤ n||x||2 .
104 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems
Theorem 6.1 For system (6.1), if f (z) satisfies Assumptions 6.1 and 6.2, then we
have
Proof According to Assumption 6.2 and the mean value theorem for multi-variable
functions, we have
|| f jR (x, y) − f jR (x , y )||1
≤ λ Rj R ||x − x ||1 + λ Rj I ||y − y ||1 . (6.39)
According to the definition of 1-norm, we have ||η − η ||1 = ||x − x ||1 + ||y −
y ||1 , and
n
|| f (x, y) − f (x , y )||1 ≤
R R
k Rj ||η − η ||1 . (6.41)
j=1
According to the idea from (6.40) and (6.41), we can also obtain
n
|| f I (x, y) − f I (x , y )||1 ≤ k Ij ||η − η ||1 . (6.42)
j=1
and
6.3 ADP-Based Optimal Control Design 105
√
||η − η ||1 ≤ 2n||η − η ||2 . (6.45)
From (6.43)–(6.45), we can obtain ||F(η) − F(η )||2 ≤ k||η − η ||2 . The proof is
completed.
Next theorems show that the estimation error of critic and action networks are
UUB.
Theorem 6.2 T Letthe weights of the critic network Ŵc be updated by (6.21). If the
ξ1
inequality
ξ W̃c > ε H M holds, then the estimation error W̃c converges to the set
3
W̃c ≤ β1−1 β2 T (1 + 2ρβ2 αc ) ε H M exponentially, where ρ > 0.
Proof Let W̃c be defined as in (6.18). Define the following Lyapunov function can-
didate
1
L= W̃ T W̃c . (6.46)
2αc c
T
εH
As ξ3 ≥ 1, we have < ε H M . If ξ1 W̃c > ε H M , then we can get L̇ ≤ 0. That
ξ ξ
3 3
means L decreases and ||ξ2 W̃c || is bounded. In light of [14] and Technical Lemma
T
√
2 in [12], W̃c ≤ β1−1 β2 T (1 + 2ρβ2 αc ) ε H M .
Theorem 6.3 Let the optimal control law be expressed as in (6.33). The weight
update laws of the critic and action networks are given as in (6.21) and (6.27),
respectively. If there exists parameters l2 and l3 , that satisfy
and
||G||2 2k + 3
l3 > max , , (6.49)
λmin (R) λmin (Q)
respectively, then the system state η in (6.4) is UUB and the weights of the critic and
action networks, i.e., Ŵc and Ŵa converge to finite neighborhoods of the optimal
ones.
Proof Choose the Lyapunov function candidate as
V = V1 + V2 + V3 , (6.50)
106 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems
1 l2
where V1 = W̃cT W̃c , V2 = tr{W̃aT W̃a }, V3 =ηT η + l3 J (η), l2 > 0, l3 > 0.
2αc 2αa
Taking the derivative of the Lyapunov function candidate (6.50), we can get V̇ =
V̇1 + V̇2 + V̇3 . According to (6.22), we have
εH
V̇1 = −(W̃cT ξ2 )T W̃cT ξ2 + (W̃cT ξ2 )T . (6.51)
ξ3
V̇3 = (2ηT F − 2ηT G W̃aT ϕa + 2ηT Gν − 2ηT Gεa + 2ηT Gνc ) + l3 (−r (η, ν)).
(6.53)
εH
Let Z = [ηT , v T , (W̃cT ξ2 )T , (W̃aT ϕa )]T , and N V = [0, 0,
, M V 4 ]T ,
ξ3
1
where MV 4 = − R −1 G T ∇ϕcT W̃c + l2 ε12 , and MV = diag (l3 λmin (Q) − 2k − 3),
2
(l3 λmin (R) − ||G||2 ), 1, (l2 − ||G||2 ) . Thus, we have
V̇ ≤ −Z T MV Z + Z T N V
≤ −||Z ||2 λmin (MV ) + ||Z ||||N V ||. (6.57)
6.3 ADP-Based Optimal Control Design 107
||N V ||
According to (6.48)–(6.49), we can see that if ||Z || ≥ ≡ Z B , then the
λmin (MV )
εH
Lyapunov candidate V̇ ≤ 0. As MV and ξ3
are both upper bounded, we have ||N V ||
is upper bounded. Therefore, the state η, the weight errors W̃c and W̃a are UUB [15].
The proof is completed.
Theorem 6.4 Let the weight updating laws of the critic and the action networks be
given by (6.21) and (6.27), respectively. If Theorem 6.3 holds, then the control law
νall converges to a finite neighborhood of the optimal control law ν ∗ .
Proof From Theorem 6.3, there exist νc > 0 and W̃a > 0, such that lim ||ν|| ≤ ν
t→∞
and lim ||W̃a || ≤ W̃a . From (6.33), we have
t→∞
νall − ν ∗ = ν̂ + νc − ν ∗
= ŴaT ϕa + νc − WaT ϕa − εa
= W̃aT ϕa + νc − εa . (6.58)
Therefore, we have
Example 6.1 Our first example is chosen as Example 3 in [3] with modifications.
Consider the following nonlinear complex-valued harmonic oscillator system
−2z(z 2 − 25 )
ż = i −1 + 10(1 + i)u, (6.60)
2z 2 − 1
Example 6.2 In the second example, the effectiveness of the developed ADP method
will be justified by a complex-valued Chen system [16]. The system can be expressed
as
⎧
⎨ ż 1 = −μz 1 + z 2 (z 3 + α) + i z 1 u 1 ,
ż 2 = −μz 2 + z 1 (z 3 − α) + 10u 2 , (6.61)
⎩
ż 3 = 1 − 0.5(z̄ 1 z 2 + z 1 z̄ 2 ) + u 3 ,
where μ = 0.8 and α = 1.8. Let z = [z 1 , z 2 , z 3 ]T ∈ C3 and g(z) = diag(i z 1 , 10, 1).
Let z̄ 1 and z̄ 2 denote the complex conjugate vectors of z 1 and z 2 , respectively. Define
η = [z 1R , z 2R , z 3R , z 1I , z 2I , z 3I ]T and ν = [u 1R , u 2R , u 3R , u 1I , u 2I , u 3I ]T .
The structures of action and critic networks are 6-8-6 and 6-6-1, respectively.
Let the training rules of the neural networks be the same as in Example 1. Let
0.03
real(u)
0.02
imag(u)
0.01
-0.01
-0.02
-0.03
0 2 4 6 8 10 12 14 16 18 20
time steps
0.8
real(z)
0.6
imag(z)
0.4
0.2
0
z
-0.2
-0.4
-0.6
-0.8
-1
0 2 4 6 8 10 12 14 16 18 20
time steps
3.7
real(u1 )
3
real(u2 )
real(u3 )
2
imag(u1 )
imag(u2 )
1
imag(u3 )
control
-1
-2
-3
-3.7
0 50 100 150 200 250 300 350 400 450 480
time steps
0
z1 , z2 , z3
-1 real(z1 )
real(z2 )
real(z3 )
-2
imag(z1 )
imag(z2 )
-3 imag(z3 )
-4
-4.5
0 50 100 150 200 250 280
time steps
6.5 Conclusion
In this chapter, for the first time an optimal control scheme based on ADP method for
complex-valued systems has been developed. First, the performance index function is
defined based on complex-valued state and control. Then, system transformations are
used to overcome Cauchy–Riemann conditions. Based on the transformed system and
the corresponding performance index function, a new ADP based optimal control
method is established. A compensation controller is presented to compensate the
approximation errors of neural networks. Finally, the simulation examples are given
to show the effectiveness of the developed optimal control scheme.
References
1. Adali, T., Schreier, P., Scharf, L.: Complex-valued signal processing: the proper way to deal
with impropriety. IEEE Trans. Signal Process. 59(11), 5101–5125 (2011)
References 111
2. Fang, T., Sun, J.: Stability analysis of complex-valued impulsive system. IET Control Theory
Appl. 7(8), 1152–1159 (2013)
3. Yang, C.: Stability and quantization of complex-valued nonlinear quantum systems. Chaos,
Solitons Fractals 42, 711–723 (2009)
4. Hu, J., Wang, J.: Global stability of complex-valued recurrent neural networks with time-delays.
IEEE Trans. Neural Netw. Learn. Syst. 23(6), 853–865 (2012)
5. Huang, S., Li, C., Liu, Y.: Complex-valued filtering based on the minimization of complex-error
entropy. IEEE Trans. Neural Netw. Learn. Syst. 24(5), 695–708 (2013)
6. Hong, X., Chen, S.: Modeling of complex-valued wiener systems using B-spline neural net-
work. IEEE Trans. Neural Netw. 22(5), 818–825 (2011)
7. Goh, S., Mandic, D.: Nonlinear adaptive prediction of complex-valued signals by complex-
valued PRNN. IEEE Trans. Signal Process. 53(5), 1827–1836 (2005)
8. Bolognani, S., Smyshlyaev, A., Krstic, M.: Adaptive output feedback control for complex-
valued reaction-advection-diffusion systems, In: Proceedings of American Control Conference,
Seattle, Washington, USA, pp. 961–966, (2008)
9. Hamagami, T., Shibuya, T., Shimada, S.: Complex-valued reinforcement learning. In: Proceed-
ings of IEEE International Conference on Systems, Man, and Cybernetics, Taipei, Taiwan, pp.
4175–4179 (2006)
10. Paulraj, A., Nabar, R., Gore, D.: Introduction to Space-Time Wireless Communications. Cam-
bridge University Press, Cambridge (2003)
11. Mandic, D.P., Goh, V.S.L.: Complex Valued Nonlinear Adaptive Filters: Noncircularity, Widely
Linear and Neural Models. Wiley, New York (2009)
12. Vamvoudakis, K.G., Lewis, F.L.: Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5), 878–888 (2010)
13. Dierks, T., Jagannathan, S.: Online optimal control of affine nonlinear discrete-time systems
with unknown internal dynamics by using time-based policy update. IEEE Trans. Neural Netw.
Learn. Syst. 23(7), 1118–1129 (2012)
14. Khalil, H.K.: Nonlinear System. Prentice-Hall, Upper Saddle River (2002)
15. Lewis, F.L., Jagannathan, S., Yesildirek, A.: Neural Network Control of Robot Manipulators
and Nonlinear Systems. Taylor & Francis, New York (1999)
16. Mahmoud, G.M., Aly, S.A., Farghaly, A.A.: On chaos synchronization of a complex two
coupled dynamos system. Chaos, Solitons Fractals 33, 178–187 (2007)
Chapter 7
Off-Policy Neuro-Optimal Control
for Unknown Complex-Valued
Nonlinear Systems
7.1 Introduction
Policy iteration (PI) and the value iteration (VI) are the two main algorithms in ADP
[1, 2]. In [3], VI algorithm for nonlinear systems was proposed, which can find the
optimal control law by iterative performance index function and iterative control
with a zero initial performance index function. When the iteration index increases
to infinity, it is proven that the iterative performance index function is a nondecreas-
ing sequence and bounded. In [4], PI algorithm was proposed, which can obtain the
optimal control by constructing a sequence of stabilizing iterative control. In con-
ventional PI, the system dynamics is necessary. While in many industrial systems,
the dynamics is difficult to be known. Off-policy method, which is based on PI, can
solve the HJB equation by completely unknown dynamics. In [5], H∞ tracking con-
trol of completely unknown continuous-time systems via off-policy reinforcement
learning was discussed. Note that, Ref. [6] studied the optimal control problem for
complex-valued nonlinear systems in the frame of infinite-horizon ADP algorithm
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 113
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_7
114 7 Off-Policy Neuro-Optimal Control …
with known dynamics. In this chapter, the further study the optimal control problem
of unknown complex-valued nonlinear systems based on off-policy PI.
In this chapter, we consider the optimal control problem of complex-valued
unknown system. The PI algorithm is used to obtain the solution of the HJB equation.
Off-policy learning allows the iterative performance index and the iterative control
to be obtained with completely unknown dynamics. Action and critic networks are
used to approximate the iterative performance index and iterative control. Therefore
the established method has two steps: policy evaluation and policy improvement,
which executed by the critic and actor. It is proven that the closed-loop system is
asymptotic stability, the iterative performance index function is convergent, and the
weight error is uniformly ultimately bounded (UUB).
The rest of the chapter is organized as follows. In Sect. 7.2, the problem motiva-
tions and preliminaries are presented. In Sect. 7.3, the optimal control design scheme
is developed based on PI, and the asymptotic stability is proved. In Sect. 7.4, two
examples are given to demonstrate the effectiveness of the proposed optimal control
scheme. In Sect. 7.5, the conclusion is drawn.
∂Ω
where Ωs = is disabused here as a column vector. Given that q is an admissible
∂s
control policy, if Ω satisfies (7.5), with r (s, q) ≥ 0, then it can be shown that Ω is
a Lyapunov function for the system (7.3) with control policy q.
We define Hamiltonian function H :
Then the optimal performance index function Ω ∗ satisfies the HJB equation
Assuming that the minimum on the right-hand side of (7.7) exists and is unique,
then the optimal control function is
1
q ∗ = − N −1 Υ T Ωs∗ . (7.8)
2
The optimal control problem can now be formulated: Find an admissible control
policy q(t) = q(s(t)) such that the performance index function (7.4) associated with
the system (7.3) is minimized.
The PI algorithm given in [7] which can obtain the optimal control, requires full
system dynamics, because both Γ and Υ appear in the Bellman equation (7.5). In
[8–10], the data is used to construct the optimal control. In this section, the off-policy
PI algorithm will be presented to solve (7.5) without the dynamics of system (7.3).
First, the PI algorithm is given as follows. The PI algorithm begins from an
admissible control q [0] , then the iterative performance index function Ω [i] is updated
as
116 7 Off-Policy Neuro-Optimal Control …
1
q [i+1] = − N −1 Υ T Ωs[i] . (7.10)
2
Based on the PI algorithm (7.9) and (7.10), the off-policy PI algorithm will be
derived. Note that, for any time t and and time interval T > 0, the iterative perfor-
mance index function satisfies
t
Ω [i] (s(t − T )) = r (s, q [i] )dτ + Ω(s [i] (t)). (7.11)
t−T
In this subsection, we will analyze the stability of the closed-loop system and the
convergence of the established off-policy PI algorithm.
Theorem 7.1 Let the iterative performance index function Ω [i] satisfy
∞
Ω [i] (s) = s T Ms + q [i]T N q [i] dτ. (7.18)
t
Let the iterative control input q [i+1] be obtained from (7.10). Then the closed-loop
system (7.3) is asymptotically stable.
Proof Take Ω [i] as the Lyapunov function candidate. We make the derivative of Ω [i]
along the system ṡ = Γ + Υ q [i+1] , we can have
Then we have
Since Λkk is the singular value and Λkk > 0, then we get
m
Ω̇ [i] = Λkk (2y [i+1]T y [i] − 2y [i+1]T y [i+1] − y [i]T y [i] ) − s T Ms
k=1
< 0. (7.25)
Theorem 7.2 Define the iterative performance index Ω [i] satisfying (7.9), and q [i+1]
obtaining from (7.10), then Ω [i] is non-increasing as i → ∞, i.e. Ω ∗ (s) ≤ Ω [i+1] ≤
Ω [i] .
Furthermore, consider Ω [i+1] and Ω [i] taking the derivatives along system ṡ =
Γ + Υ q [i+1] , respectively. Then we can obtain
∞ T
d(Ω [i+1] − Ω [i] )
Ω [i+1] (s) − Ω [i] (s) = − (Γ + Υ q [i+1] )dτ
ds
∞ 0
∞
[i+1]T [i+1]
=− Ωs (Γ + Υ q )dτ + Ωs[i]T (Γ + Υ q [i+1] )dτ
∞
0
∞
0
T
=− Ωs[i+1]T (Γ + Υ q [i+1] )dτ − s Ms + q [i]T N q [i]
0
0
− 2q [i+1]T N q [i] + 2q [i+1]T N q [i+1] dτ. (7.29)
7.3 Off-Policy Optimal Control Method 119
From the off-policy Bellman equation (7.17), we can see that the system dynamics is
not required. Therefore, the off-policy PI algorithm depends on the system state s and
iterative control q [i] . In (7.17), the function structures Ω [i] and q [i] are unknown, so in
the following subsection the critic and action networks are presented to approximate
the iterative performance index function and the iterative control.
The critic network is expressed as
where WΩ[i] is the ideal weight of the critic network, φΩ (s) is the active function, εΩ
[i]
[i]
is the residual error. The estimation of Ω (s) is given as follows:
where Wq[i] is the ideal weight of the action network, φq (s) is the active function, εq[i]
is the residual error. Accordingly, the estimation of q [i] (s) is given as follows:
where
ϕΩ = φΩ (s(t)) − φΩ (s(t − T )).
According to (7.36), we have
i.e.,
t ŴΩ[i]
e[i] = (
ϕΩ
T
⊗ I) [i] T
2 ((q − q̂ ) N ) ⊗ ϕqT
H
t−T vec(Ŵq[i+1] )
t
+ (s T Ms + q̂ [i]T N q̂ [i] )dτ. (7.41)
t−T
7.3 Off-Policy Optimal Control Method 121
t
[i] [i] T
Define C̄ = = (
ϕΩ ⊗ I ),
T
2 ((q − q̂ ) N ) ⊗ ϕq dτ ,
T
¯
t
t−T
[i]
Ŵ
Let E [i] = 0.5e[i]T [i]
=
H e H , then we have W
[i] Ω
. Then the update method
vec(Ŵq[i+1] )
for the weight of critic and action networks is
[i]
Ŵ˙ = −ηW C̄ [i]T (C̄ [i] Ŵ [i] + D̄ [i] ), (7.43)
where ηW > 0.
In the following theorem, we will prove that the proposed implementation method
is convergent.
Theorem 7.3 Let the control input q [i] be equal to (7.36), the updating methods
for critic and action networks be as in (7.43). Define the weight error as W̃ [i] =
W [i] − Ŵ [i] , then for every iterative step, W̃ [i] is UUB.
l1
Σ= W̃ [i]T W̃ [i] , (7.44)
2ηW
where l1 > 0.
According to (7.43), we can obtain
Then we have
Σ̇ = −l1 W̃ [i]T C̄ [i]T C̄ [i] W̃ [i] + l1 W̃ [i]T C̄ [i]T (C̄ [i] W [i] + D̄ [i] )
≤ −l1 ||C̄ [i] ||2 ||W̃ [i] ||2 + ||W̃ [i] ||2 + ||S [i] ||2
= (−l1 ||C̄ [i] ||2 + 1)||W̃ [i] ||2 + ||S [i] ||2 , (7.46)
1 1
where S [i] = ||l1 C̄ [i]T C̄ [i] W [i] ||2 + ||l1 C̄ [i]T D̄ [i] ||2 .
2 2
122 7 Off-Policy Neuro-Optimal Control …
||S [i] ||
||W̃ [i] || > (7.47)
l1 ||C̄ [i] ||2 − 1
In this section, two examples are presented to demonstrate the effectiveness of the
proposed optimal control method.
Example 7.1 In this chapter, we first consider the following nonlinear complex-
valued system [11]
ż = f (z) + u, (7.48)
where
8 0 2 + 3i 3 − i
f (z) = − z+ f¯(z), (7.49)
0 6 4 − 2i 1 + 2i
and
1 − exp(−s j ) 1
f¯j (z) = +i , j = 1, 2. (7.50)
1 + exp(−s j ) 1 + exp(−y j )
7.4 Simulation Study 123
1.5
zR
1
zR
2
1
zI1
zI2
0.5
0
z
-0.5
-1
-1.5
0 5 10 15 20 25 30 35 40 45 50
time steps
where z = [z 1 , z 2 ]T ∈ C2 , u = [u 1 , u 2 ]T ∈ C2 , z j = z Rj + i z Ij and u j = u Rj + iu Ij ,
j = 1, 2. Let s = [z 1R , z 2R , z 1I , z 2I ]T , and q = [u 1R , u 2R , u 1I , u 2I ]T . According to the pro-
posed method, design the critic and action networks to approximate the performance
index function and control. The initial weights are selected in [0.5, 0.5]. The acti-
vation functions of critic and the action networks are sigmoid functions. After 100
time steps, the critic network weight ŴΩ and the action network weight Ŵq are con-
vergent, and the optimal control is achieved. The state and control trajectories are
obtained and demonstrated in Figs. 7.3 and 7.4. They are all convergent.
124 7 Off-Policy Neuro-Optimal Control …
150
uR
1
100
uR
2
uI1
50 uI2
control
-50
-100
-150
0 5 10 15 20 25 30 35 40 45 50
time steps
zR
1
0.5 zR
2
zI1
zI2
0
-0.5
z
-1
-1.5
-2
0 10 20 30 40 50 60 70 80 90 100
time steps
4
uR
1
3 uR
2
uI1
2
uI2
1
control
-1
-2
-3
-4
0 10 20 30 40 50 60 70 80 90 100
time steps
7.5 Conclusion
References
1. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction, A Bradford Book. The MIT
Press, Cambridge (2005)
2. Lewis, F., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for feedback
control. IEEE Circuits Syst. Mag. 9(3), 32–50 (2009)
3. Al-Tamimi, A., Lewis, F., Abu-Khalaf, M.: Discrete-time nonlinear HJB solution using approx-
imate dynamic programming: convergence proof. IEEE Trans. Syst. Man Cybern. B Cybern.
38(4), 943–949 (2008)
4. Murray, J., Cox, C., Lendaris, G., Saeks, R.: Adaptive dynamic programming. IEEE Trans.
Syst. Man Cybern. Syst. 32(2), 140–153 (2002)
5. Modares, H., Lewis, F., Jiang, Z.: H∞ tracking control of completely unknown continuous-time
systems via off-policy reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 26(10),
2550–2562 (2015)
126 7 Off-Policy Neuro-Optimal Control …
6. Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complex-
valued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014)
7. Wang, J., Xu, X., Liu, D., Sun, Z., Chen, Q.: Self-learning cruise control using kernel-based
least squares policy iteration. IEEE Trans. Control Syst. Technol. 22(3), 1078–1087 (2014)
8. Luo, B., Wu, H., Huang, T., Liu, D.: Data-based approximate policy iteration for affine nonlinear
continuous-time optimal control design. Automatica 50(12), 3281–3290 (2014)
9. Modares, H., Lewis, F.: Linear quadratic tracking control of partially-unknown continuous-time
systems using reinforcement learning. IEEE Trans. Autom. Control 59, 3051–3056 (2014)
10. Kiumarsi, B., Lewis, F., Modares, H., Karimpur, A., Naghibi-Sistani, M.: Reinforcement Q-
learning for optimal tracking control of linear discrete-time systems with unknown dynamics.
Automatica 50(4), 1167–1175 (2014)
11. Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems withsaturating
actuators using a neural network HJB approach. Automatica 41, 779–791 (2005)
Chapter 8
Approximation-Error-ADP-Based
Optimal Tracking Control for Chaotic
Systems
In this chapter, an optimal tracking contrl scheme is proposed for a class of discrete-
time chaotic systems using the approximation-error-based ADP algorithm. Via the
system transformation, the optimal tracking problem is transformed into an optimal
regulation problem, and then the novel optimal tracking control method is proposed.
It is shown that for the iterative ADP algorithm with finite approximation error, the
iterative performance index functions can converge to a finite neighborhood of the
greatest lower bound of all performance index functions under some convergence
conditions. Two examples are given to demonstrate the validity of the proposed
optimal tracking control scheme for chaotic systems.
8.1 Introduction
During recent years, the control problem of chaotic systems has received considerable
attentions [1, 2]. Many different methods are applied theoretically and experimentally
to control the chaotic systems, such as adaptive synchronization control method [3]
and impulsive control method [4]. However, the methods mentioned above are just
focus on designing the controller for chaotic systems. Few of them considered the
optimal tracking control problem, which is an important index of chaotic systems.
Although the iterative ADP algorithm has improved greatly in control field, it is
still an open problem about how to solve the optimal tracking control problem for
chaotic systems based on ADP algorithm. The reason is that in the most implementa-
tion methods of ADP algorithms, the accurate optimal control and performance index
function are not obtained, because of the existence of approximate error between the
approximate function and the expected one. While the approximate error will influ-
ence the control performance of the chaotic systems. So the approximate error need to
be considered in the ADP algorithm. This motivates our research. In this chapter, we
proposed an approximation-error ADP algorithm to deal with the optimal tracking
control problem for chaotic systems. First, the optimal tracking problem of chaotic
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 127
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_8
128 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems
system is transformed into the optimal regulation problem, and the corresponding
performance index function is defined. Then, the approximation-error ADP algo-
rithm is established. It is proved that the ADP algorithm with approximation error
makes the iterative performance index functions converge to a finite neighborhood
of the optimal performance index function. Finally, two simulation examples are
given to show the effectiveness of the proposed optimal tracking control algorithm
for chaotic systems.
This chapter is organized as follows. In Sect. 8.2, we present the problem for-
mulation. In Sect. 8.3, the optimal tracking control scheme is developed and the
convergence proof is given. In Sect. 8.4, two examples are given to demonstrate the
effectiveness of the proposed tracking control scheme. In Sect. 8.5, the conclusion is
given.
Consider the MIMO chaotic dynamic system which can be represented by the fol-
lowing form
m
x1 (k + 1) = f 1 (x(k)) + g1 j u j (k),
j=1
..
.
m
xn (k + 1) = f n (x(k)) + gn j u j (k), (8.1)
j=1
where x(k) = [x1 (k), x2 (k), . . . , xn (k)]T is the system state vector which is assumed
to be available for measurement. u(k) = [u 1 (k), u 2 (k), . . . , u m (k)]T is the control
input. gi j , i = 1, 2, . . . , n, j = 1, 2, . . . , m is the constant control gain. If we denote
f (x(k)) = [ f 1 (x(k)), f 2 (x(k), . . . , f n (x(k))]T and
⎡ ⎤
g11 g12 · · · g1m
⎢ .. .. .. ⎥ .
g = ⎣. . . ⎦ (8.2)
gn1 gn2 · · · gnm
In fact, system (8.3) has a large family, and lots of chaotic systems can be described
as in (8.3), such as Hénon mapping [5], the new discrete chaotic system proposed in
[6] and many others.
8.2 Problem Formulation and Preliminaries 129
and w(k) = 0, for k < 0, where u d (k) denotes the expected control, which can be
given as
where ε0 is a constant matrix, which is used to guarantee the existence of the matrix
inverse.
Then the tracking error is obtained as follows:
e(k + 1) = x(k + 1) − xd (k + 1)
= f (x(k)) − f (xd (k)) + gw(k)
= f (e(k) + xd (k)) − f (xd (k)) + gw(k). (8.8)
Remark 8.1 Actually, the term g T (gg T + ε0 )−1 in Eq. (8.7), is the generalized
inverse of g. As g ∈ Rn×m , the generalized inverse technique is used to obtain the
inverse of g.
To solve the optimal tracking control problem, we present the following perfor-
mance index function
∞
J (e(k), w(k)) = U (e(l), w(l)), (8.9)
l=k
where U (e(k), w(k)) > 0, ∀e(k), w(k), is the utility function. In this chapter, we
define
130 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems
U (e(k), w(k)) = eT (k)Qe(k) + (w(k) − w(k − 1))T R(w(k) − w(k − 1)), (8.10)
where Q and R are both diagonal positive definite matrices. In (8.10), the first term
means the tracking error, and the second term means the difference of the control
error.
According to Bellman’s principle of optimality, the optimal performance index
function satisfies the HJB equation as follows:
We can see that if we want to obtain the optimal tracking control u ∗ (k), we
must obtain w∗ (k) and the optimal performance index function J ∗ (e(k)). Generally,
J ∗ (e(k)) is unknown before all the control error w∗ (k) is considered. If we adopt the
traditional dynamic programming method to obtain the optimal performance index
function, then we have to face the “the curse of dimensionality”. This makes the
optimal control nearly impossible to be obtained by the HJB equation. So, in the
next part, we present an iterative ADP algorithm, based on Bellman’s principle of
optimality.
First, for ∀e(k), let the initial function Υ (e(k)) be an arbitrary function that satisfies
Υ (e(k)) ∈ Ῡe(k) , where Ῡe(k) is defined as follows.
Ῡe(k) = {Υ (e(k)) : Υ (e(k)) > 0, and Υ (e(k + 1)) < Υ (e(k))} (8.15)
8.3 Optimal Tracking Control Scheme Based … 131
be the initial positive definition function set, where e(k + 1) = F(x(k), u(k)) −
xd (k + 1).
For ∀e(k), let the initial performance index function J [0] (e(k)) = θ̂Υ (e(k)), where
θ̂ > 0 is a large enough finite positive constant. The initial iterative control policy
w[0] (k) can be computed as follows:
w[0] (k) = arg inf U (e(k), w(k)) + J [0] (e(k + 1)) , (8.16)
w(k)
where J [0] (e(k + 1)) = θ̂Υ (e(k + 1)). The performance index function can be
updated as
w[i] (e(k)) = arg inf U (e(k), w(k)) + J [i] (e(k + 1)) , (8.18)
w(k)
In fact, the accurate iterative control policy w[i] (k) and the iterative performance
index function J [i] (e(k)) are generally impossible to be obtained. For example, if
neural networks are used to implement the iterative ADP algorithm, no matter what
kind of neural networks we choose, the approximate error between the output of the
neural networks and the expect output must exist. In fact, as the existence of the
approximation error, the accurate iterative control policy of the chaotic systems can
not generally be obtained. So the iterative ADP algorithm with finite approximation
error is expressed as follows:
ŵ[0] (e(k)) = arg inf U (e(k), w(k)) + Jˆ[0] (e(k + 1)) + α[0] (e(k)), (8.20)
w(k)
where Jˆ[0] (e(k + 1)) = θ̂Υ (e(k + 1)). The performance index function can be
updated as
Jˆ[1] (e(k)) = U (e(k), ŵ[0] (k)) + Jˆ[0] (e(k + 1))) + β [0] (e(k)), (8.21)
where α[0] (e(k)) and β [0] (e(k)) are the approximation error functions of the iterative
control and iterative performance index function, respectively.
For i = 1, 2, . . ., the iterative ADP algorithm will iterate between
ŵ[i] (e(k)) = arg inf U (e(k), w(k)) + Jˆ[i] (e(k + 1)) + α[i] (e(k)), (8.22)
w(k)
132 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems
where α[i] (e(k)) and β [i] (e(k)) are the approximation error functions of the iterative
control and iterative performance index function, respectively.
where Jˆ[i−1] (e(k + 1)) is defined in (8.23), and w(k) can accurately be obtained. If
the initial iterative performance index function Jˆ[0] (e(k)) = Δ[0] (e(k)), and there
exists a finite constant 1 that makes
So, we get
It is obvious that the property analysis of the iterative performance index function
Jˆ[i] (e(k)) and iterative control policy ŵ[i] (k) are very difficult. In next part, the novel
convergence analysis is built.
and
i
where we define (·) = 0, for ∀ j > i and i, j = 0, 1, . . ..
j
Proof The theorem can be proved by mathematical induction. First, let i = 0. Then,
(8.36) becomes
Δ[i] (e(k))
= inf U (e(k), w(k)) + Jˆ[i−1] (e(k + 1))
w(k)
⎧ ⎛
⎨
i−1 j j−1
ζ ι (ι − 1) ζ i−1 ιi−1 (η − 1)
≤ inf U (e(k), w(k)) + ι ⎝1 + + ∗
J (e(k + 1)) .
w(k) ⎩ (ζ + 1) j (ζ + 1)i−1
j=1
(8.41)
Then, according to (8.35), we can obtain (8.36) which proves the conclusion for
∀ i = 0, 1, . . ..
Theorem 8.3 Let e(k) be an arbitrary controllable state. Suppose Theorem 8.2 holds
for ∀e(k). If for ζ < ∞ and ι ≥ 1, the inequality
ζ +1
ι< (8.43)
ζ
holds, then as i → ∞, the iterative performance index function Jˆ[i] (e(k)) in the iter-
ative ADP algorithm (8.20)–(8.23) is uniformly convergent into a bounded neigh-
borhood of the optimal performance index function J ∗ (e(k)), i.e.,
ζ(ι − 1)
lim Jˆ[i] (e(k)) = Jˆ∞ (e(k)) ≤ ι 1 + J ∗ (e(k)). (8.44)
i→∞ 1 − ζ(ι − 1)
136 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems
ζ +1
As i → ∞, if 1 ≤ ι < , then (8.45) becomes
ζ
[i] ∞ ζ(ι − 1)
lim Δ (e(k)) = Δ (e(k)) ≤ 1 + J ∗ (e(k)). (8.46)
i→∞ 1 − ζ(ι − 1)
Corollary 8.1 Let e(k) be an arbitrary controllable state. Suppose Theorem 8.2
holds for ∀e(k). If for ζ < ∞ and ι ≥ 1, the inequality (8.43) holds, then the iterative
control policy ŵ[i] (k) of the iterative ADP algorithm (8.20)–(8.23) is convergent, i.e.,
ŵ∞ (k) = lim ŵ[i] (k) = arg inf U (e(k), w(k)) + Jˆ∞ (e(k + 1)) . (8.49)
i→∞ w(k)
Therefore, it is proved that the new iterative ADP algorithm with finite approximation
error is convergent. In the next section, the examples will be given to illustrate the
performance of the proposed method for chaotic systems.
To evaluate the performance of our iterative ADP algorithm with finite approximation
error, we choose two examples with quadratic utility functions.
8.4 Simulation Study 137
0.5
x2
-0.5
-1
with
cos wT sin wT
A= ,
− sin wT cos wT
12
10
performance index function
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
iteration steps
formance index function is not convergent. The state tracking trajectories are as in
Figs. 8.3 and 8.4, and the tracking error trajectories are as in Fig. 8.5. We can see that
the iterative control can not make the system track the desired orbit satisfactorily.
For 1 = 10−8 , the convergence conditions are satisfied, and the performance index
function is shown in Fig. 8.6. The states trajectories are shown in Figs. 8.7 and 8.8,
and the tracking error trajectories are as in Fig. 8.9. From the results of the simulation,
we can see that the optimal tracking control of the chaotic systems can be obtained
if the convergence conditions are satisfied.
Example 8.2 Consider the following continuous time chaotic system [11]
⎧
⎨ ẋ1 = θ1 (x2 − h(x1 )) + u 1 ,
ẋ2 = x1 − x2 + x3 + u 2 , (8.52)
⎩
ẋ3 = −θ2 x2 + u 3 ,
1
where h(x1 ) = m 1 x1 + (m 0 − m 1 )(|x1 + θ3 | − |x1 − θ3 |), θ1 = 9, θ2 = 14.28,
2
1 2
θ3 = 1, m 0 = − and m 1 = .
7 7
According to Euler’s discretization method, the continuous time chaotic system
can be represented as follows:
⎧
⎨ x1 (k + 1) = x1 (k) + T̂ (θ1 (x2 (k) − h(x1 (k))) + u 1 (k)),
x (k + 1) = x2 (k) + T̂ (x1 (k) − x2 (k) + x3 (k) + u 2 (k)), (8.53)
⎩ 2
x3 (k + 1) = x3 (k) + T̂ (−θ2 x2 (k) + u 3 (k)),
8.4 Simulation Study 139
1.5
xd1
1 x1
0.5
0
state
-0.5
-1
-1.5
-2
0 50 100 150 200 250
time steps
0.8
xd2
0.6
x2
0.4
0.2
0
state
-0.2
-0.4
-0.6
-0.8
-1
-1.2
0 50 100 150 200 250
time steps
1.5
e1
e2
1
0.5
trajectory error
-0.5
-1
-1.5
0 50 100 150 200 250
time steps
5.5
performance index function
4.5
3.5
2.5
0 2 4 6 8 10 12 14 16
iteration steps
2
xd1
x1
1.5
0.5
state
-0.5
-1
-1.5
0 50 100 150 200 250
time steps
0.8
xd2
x2
0.6
0.4
0.2
state
-0.2
-0.4
-0.6
0 50 100 150 200 250
time steps
0.6
e1
0.5 e2
0.4
0.3
state error
0.2
0.1
-0.1
-0.2
0 50 100 150 200 250
time steps
The initial state is selected as [0.2, 0.13, 0.17]T , and the reference trajectory is
[0.5, −0.5, 0.4]T . The utility function is defined as U (e(k), w(k)) = eT (k)Qe(k) +
wT (k)Rw(k), where Q = R = I3 . Let θ̂ = 6 and Υ (e(k)) = eT (k)Qe(k) to initialize
the algorithm. The performance index function is shown as in Fig. 8.11. The system
states are shown as in Fig. 8.12. The state errors are given in Fig. 8.13. It can be seen
that the proposed algorithm is effective and the simulation results are satisfactory.
8.4 Simulation Study 143
12
11
performance index function
10
5
0 6 8 10 12 14 16 18 20
iteration steps
0.6
x1
x2
0.4 x3
0.2
state
-0.2
-0.4
-0.6
0 100 200 300 400 500 600 700 800 900 1000
time steps
0.8
e1
0.6 e2
e3
0.4
state error
0.2
-0.2
-0.4
0 100 200 300 400 500 600 700 800 900 1000
time steps
8.5 Conclusion
This chapter proposed an optimal tracking control method for chaotic systems. Via
system transformation, the optimal tracking problem was transformed into an optimal
regulation problem, and then the approximation-error ADP algorithm was introduced
to deal with the optimal regulation problem with rigorous convergence analysis.
Finally, two examples have been given to demonstrate the validity of the proposed
optimal tracking control scheme for chaotic systems.
References
1. Zhang, H., Huang, W., Wang, Z., Chai, T.: Adaptive synchronization between two different
chaotic systems with unknown parameters. Phys. Lett. A 350(5–6), 363–366 (2006)
2. Chen, S., Lu, J.: Synchronization of an uncertain unified chaotic system via adaptive control.
Chaos, Solitons Fractals 14(4), 643–647 (2002)
3. Zhang, H., Wang, Z., Liu, D.: chaotifying fuzzy hyperbolic model using adaptive inverse
optimal control approach. Int. J. Bifurc. Chaos 14(10), 3505–3517 (2004)
4. Zhang, H., Ma, T., Fu, J., Tong, S.: Robust lag synchronization of two different chaotic systems
via dual-stage impulsive control. Chin. Phys. B 18(9), 3751–3757 (2009)
5. Héno, M.: A two-dimensional mapping with a strange attractor 50(1), 69–77 (1976)
6. Lu, J., Wu, X., Lü, J., Kang, L.: A new discrete chaotic system with rational fraction and its
dynamical behaviors. Chaos, Solitons Fractals 22(2), 311–319 (2004)
7. Zhang, H., Cui, L., Zhang, X., Luo, Y.: Data-driven robust approximate optimal tracking control
for unknown general nonlinear systems using adaptive dynamic programming method. IEEE
Trans. Neural Netw. 22(12), 2226–2236 (2011)
References 145
8. Zhang, H., Song, R., Wei, Q., Zhang, T.: Optimal tracking control for a class of nonlinear
discrete-time systems with time delays based on heuristic dynamic programming. IEEE Trans.
Neural Netw. 22(12), 1851–1862 (2011)
9. Zhang, H., Wei, Q., Luo, Y.: A novel infinite-time optimal tracking control scheme for a class of
discrete-time nonlinear systems via the greedy HDP iteration algorithm 38(4), 937–942 (2008)
10. Liu, D., Wei, Q.: Finite-approximation-error-based optimal control approach for discrete-time
nonlinear systems. IEEE Trans. Syst. Man Cybern. B Cybern. 43(2), 779–789 (2013)
11. Ma, T., Zhang, H., Fu, J.: Exponential synchronization of stochastic impulsive perturbed chaotic
Lur’e systems with time-varying delay and parametric uncertainty. Chin. Phys. B 17(12), 4407–
4417 (2008)
Chapter 9
Off-Policy Actor-Critic Structure
for Optimal Control of Unknown
Systems with Disturbances
An optimal control method is developed for unknown continuous time systems with
unknown disturbances in this chapter. The integral reinforcement learning (IRL) algo-
rithm is presented to obtain the iterative control. Off-policy learning is used to allow
the dynamics to be completely unknown. Neural networks (NN) are used to con-
struct critic and action networks. It is shown that if there are unknown disturbances,
off-policy IRL may not converge or may be biased. For reducing the influence of
unknown disturbances, a disturbances compensation controller is added. It is proven
that the weight errors are uniformly ultimately bounded (UUB) based on Lyapunov
techniques. Convergence of the Hamiltonian function is also proven. The simulation
study demonstrates the effectiveness of the proposed optimal control method for
unknown systems with disturbances.
9.1 Introduction
Adaptive control is a body of online design techniques that use measured data along
system trajectories to learn unknown system dynamics, and compensate for distur-
bances and modeling errors to provide guaranteed performance [1–3]. Reinforce-
ment learning (RL), which improves the action in a given uncertain environment,
can obtain the optimal adaptive control in an uncertain noisy environment with less
computational burden [4–6]. Integral RL (IRL) is based on the integral temporal dif-
ference and uses RL ideas to find the value of the parameters of the infinite horizon
performance index function [7], which has been one of the focus area in the theory
of optimal control [8, 9].
IRL is conceptually based on the policy iteration (PI) technique [10, 11]. PI algo-
rithm [12, 13] is an iterative approach to solving the HJB equation by constructing a
sequence of stabilizing control policies that converge to the optimal control solution.
IRL allows the development of a Bellman equation that does not contain the system
dynamics. In [14], an online algorithm that uses integral reinforcement knowledge
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 147
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_9
148 9 Off-Policy Actor-Critic Structure for Optimal Control …
for learning the continuous-time optimal control solution of nonlinear systems with
infinite horizon costs and partial knowledge of the system dynamics was introduced.
It is worth noting that most of the IRL algorithms are on-policy, i.e., the performance
index function is evaluated by using system data generated with policies being eval-
uated. It means on-policy learning methods use the “inaccurate” data to learn the
performance index function, which will increase the accumulated error. On-policy
IRL algorithms generally require knowledge of the system input coupling function
g(x).
A recently developed approach for IRL is off-policy learning, which can learn
the solution of HJB equation from the system data generated by an arbitrary control
[15]. In [16], the off-policy IRL method was developed to solve the H-infinity control
problem of continuous-time systems with completely unknown dynamics. In [17], a
novel PI approach for finding online adaptive optimal controllers for continuous-time
linear systems with completely unknown system dynamics was presented. In Refs. [1,
18, 19], robust optimal control designs for uncertain nonlinear systems were studied
based on robust ADP, in which the unmolded dynamics was taken into accounted
and the interactions between the system model and the dynamic uncertainty was
studied using robust nonlinear control theory. However, the external disturbances
exist in many industrial systems, which depend on the time scale directly. If the
external disturbances are considered as an interior part of the systems, then the
original systems become time-varying ones. It is not the desire of the engineers.
Therefore, effective optimal control methods are required to develop to deal with the
systems with disturbances. This motivates our research interest.
In this chapter an off-policy IRL algorithm is estimated to obtain the optimal con-
trol of unknown systems with unknown disturbances. Because the unknown external
disturbances exist, off-policy IRL methods may be biased and fail to give optimal
controllers. A disturbances compensation redesigned off-policy IRL controller is
given here. It is proven that the weight errors are uniformly ultimately bounded
(UUB) based on Lyapunov techniques.
||d|| ≤ Bd , (9.2)
where r (x, u) = Q(x) + u T Ru, in which Q(x) > 0 and R is a symmetric positive
definite matrix. It is assumed that there exists q > 0 satisfying Q(x) ≥ q||x||2 .
To begin with the algorithms, let us introduce the concept of the admissible control
[18, 20].
Definition 9.1 For system (9.1) with d = 0, a control policy u(x) is defined as
admissible, if u(x) is continuous on a set Ω ∈ Rn , u(0) = 0, u(x) stabilizes the
system, and J (x) defined in (9.3) is finite for all x.
0 = JxT ẋ + r, (9.4)
If the solution J ∗ exists and it is continuously differentiable, then the optimal control
can be expressed as
u ∗ = arg min H (x, u, Jx∗ ) . (9.9)
u∈U
Algorithm 2 PI
1: Let i = 0, select an admissible control u [0] .
2: Let the iteration index i ≥ 0, and solve J [i] form
Let d = 0, for any time T > 0, and t > T , the performance index function (9.3)
can be written in IRL form as
t
J (x(t − T )) = r (x(τ ), u(τ ))dτ + J (x(t)). (9.12)
t−T
t
Define r I = t−T r (x(τ ), u(τ ))dτ . then (9.12) can be expressed as
Let u [i] be obtained by (9.11). Then the original system (9.1) with d = 0 can be
rewritten as
ẋ = f + gu [i] . (9.14)
This is an IRL Bellman equation that can be solved instead of (9.10) at each step
of the PI algorithm. This means that f (x) is not needed in this IRL PI algorithm.
Details are given in [7].
Based on Sect. 9.2, we will analyze the case of d is not equal to zero. In this section,
on-policy IRL for nonzero disturbance is first presented, and then the off-policy IRL
is given.
If disturbance d is not equal to zero, the IRL PI algorithm just gives incorrect results.
Let u = u [i] , then the original system (9.1) can be rewritten as
ẋ = f + gu [i] + d. (9.17)
This shows that equation error el is biased by a term depending on the unknown
disturbance d. Therefore:
152 9 Off-Policy Actor-Critic Structure for Optimal Control …
(1) The least squares method in [17] will not give the correct solution for J [i] .
(2) If the unknown disturbance d is not equal to zero, the iterative control u [i] cannot
guarantee the stability of the closed-loop system when dynamics uncertainty
occurs.
Therefore, the original method for d = 0 is not adapted for the nonlinear system
with unknown external disturbance. In the following subsection, an off-policy IRL
algorithm is used to decrease el in (9.19), and makes the closed-loop system with
external disturbance stable.
Here we detail off-policy IRL for the case of nonzero disturbance. Off-policy IRL
allows learning with completely unknown dynamics. However, it is seen here that
there are unknown disturbance, off-policy IRL may not perform properly.
Let u [i] be obtained by (9.11), and then the original system (9.1) can be rewritten
as
Algorithm 3 PI
1: Let i = 0, select an admissible control u [0] .
2: Let the iteration index i ≥ 0, and solve J [i] and u [i] simultaneously from (9.21).
Since
t
J [i] (x(t)) − J [i] (x(t − T )) = Jx[i]T ẋdτ. (9.23)
t−T
Here, (9.25) is the off-policy Bellman equation, which is the main equation in
Algorithm 3. From (9.22)–(9.25), it can be seen that from Algorithm 2, we can derive
Algorithm 3.
In (9.20), if one lets d = 0 and u = u [i] , then (9.24) can be written as
t
J [i] (x(t)) − J [i] (x(t − T )) = Jx[i]T ẋdτ
t−T
t
= Jx[i]T ( f + gu [i] )dτ. (9.26)
t−T
which means
From (9.26)–(9.28), it can be seen that from Algorithm 3, we can derive Algorithm 2.
(2) In [21] and [22], it was shown that as the iteration goes on by Algorithm 2,
j [i] and u [i] converge to the optimal solution J ∗ and u ∗ , respectively. Therefore, one
can get the optimal solution J ∗ and u ∗ by Algorithm 3.
154 9 Off-Policy Actor-Critic Structure for Optimal Control …
In fact, the iteration (9.10) in Algorithm 2 needs the knowledge of system dynam-
ics. While in (9.21) of Algorithm 3, system dynamics are not expressed explicitly.
Therefore, Algorithms 2 and 3 aim at different situations, although the two algo-
rithms are same essentially. The online solution of (9.21) is detailed in Sect. 9.3.3.
According to (9.21), one can define
This equation was developed for the case d = 0 in [19]. Therefore, the equation
error e2 is biased by a term depending on unknown disturbance d. This may cause
nonconvergence or biased results.
It is noted that in (29), d is the unknown external disturbance and Jx[i] may be
nonanalytic. For solving Jx[i] and u [i] from (9.29), critic and action networks are
introduced to obtain Jx[i] and u [i] approximately as shown next.
Here we introduce neural network approximation structures for Jx[i] and u [i] . These are
termed respectively by the critic NN and the actor NN. For off-policy learning, these
two structures are updated simultaneously using the off-policy Bellman equation
(9.21), as shown here.
Let the ideal critic network expression be
Let the estimation of Wc[i] be Ŵc[i] , and then the estimation of J [i] can be expressed
as
Let ϕc = ϕc (x(t − T )) = ϕc (x(t)) − ϕc (x(t − T )), then the first term of
(9.21) is expressed as
t t
Define Dcc = −(ϕcT ⊗ I ), Dx x = Q(x)dτ + û [i]T R û [i] dτ, Daa =
t
t−T t t−T
T
−2 ((u − û [i] ) R) ⊗ ϕaT dτ and Ddd = (∇ϕc d)T ⊗ I dτ . Then (9.40) is
t−T t−T
expressed as
e3 = Ψ Ŵ [i] − Dx x . (9.42)
This error allows the update of the weights for the critic NN and the actor NN
simultaneously. Data is repeatedly collected at the end of each interval of length
T. When sufficient samples have been collected, this equation can be solved using
gradient descent method for the weight vector. Unfortunately,
t d is unknown, so that
Ddd and Ψ are unknown. Therefore we define D̄dd = t−T (∇ϕc Bd )T ⊗ I dτ , and
Ψ̄ = [Dcc + D̄dd , Daa ], where Bd is the bound of d in (9.2).
Then, one has the estimated residual error expressed as
e4 = Ψ̄ Ŵ [i] − Dx x . (9.43)
Let E = 21 e4T e4 , according to the gradient descent method, a solution can be found
by updating Ŵ [i] using
where αw is a positive number. This yields an on-line method for updating the weights
for the critic NN and the actor NN simultaneously.
Define the weight error W̃ [i] = W [i] − Ŵ [i] , then according to (9.44), one has
[i]
W̃˙ = αw Ψ̄ T Ψ̄ Ŵ [i] − αw Ψ̄ T Dx x
= − αw Ψ̄ T Ψ̄ W̃ [i] + αw Ψ̄ T Ψ̄ W [i] − αw Ψ̄ T Dx x . (9.45)
9.4 Disturbance Compensation Redesign and Stability Analysis 157
It has been seen that if there is unknown disturbance, off-policy IRL may not perform
properly and may yield biased solution. In this section we show how to redesign the
off-policy IRL method by adding a disturbance compensator. It is shown that this
method yields proper performance by using Lyapunov techniques.
Here we propose the structure of the disturbance compensated off-policy IRL method.
Stability analysis is given in terms of Lyapunov theory. The following assumption is
first given.
Assumption 9.1 The activation function of the action network satisfies ||ϕa || ≤
ϕa M . The partial derivative of the activation function satisfies ||∇ϕc || ≤ ϕcd M . The
[i]
ideal weights of the critic and action networks satisfy ||Wc[i] || ≤ WcM and ||Wa[i] || ≤
[i]
Wa M on a compact set. The approximation error of the action network satisfies
||εa[i] || ≤ εa[i]M on a compact set.
Remark 9.1 Assumption 9.1 is a standard assumption in NN control theory. Many
NN activation functions are bounded and have bounded derivatives. Examples
include the sigmoid, symmetric sigmoid, hyperbolic tangent, and radial basis func-
tion. Continuous functions are bounded on a compact set. Hence the NN weights are
bounded. The approximation error boundedness property is established in [23].
From Assumption 9.1, it can be seen that Ψ̄ and Dx x are bounded. Without loss
of generality, write ||Ψ̄ || ≤ BΨ and ||Dx x || ≤ Bx x for positive numbers BΨ and Bx x .
Disturbance Compensated Off-Policy IRL
For unknown disturbance d = 0, the methods in these papers [17, 19] can be modified
to compensate for unknown disturbance as now shown.
Disturbance compensation controller is designed as
u [i]
c = −K c g M x/(x x + b),
T T
(9.46)
ẋ = f + g û [i] + gu [i]
c + d. (9.48)
158 9 Off-Policy Actor-Critic Structure for Optimal Control …
The whole structure of system (9.48) is shown in the figure (Fig. 9.1).
Based on (9.48), the following theorems can be obtained.
Our main result follows. It verifies the performance of the disturbance compensation
redesigned off-policy IRL algorithm.
Theorem 9.1 Let the control input be equal to (9.49), and let the updating methods
for critic and action networks be as in (9.44). Suppose Assumption 9.1 holds and let
the initial states be in the set such that the NN approximation error is bounded as in
the assumption. Then for every iterative step i, the weight errors W̃ [i] are UUB.
Proof Choose Lyapunov function candidate as follows:
Σ = Σ1 + Σ2 + Σ3 , (9.49)
l2
where Σ1 = x T x, Σ2 = l1 J [i] (x) and Σ3 = W̃ [i]T W̃ [i] , l1 > 0, l2 > 0.
2αw
As f is locally Lipchitz, then there exists B f > 0, s.t. || f (x)|| ≤ B f ||x||. There-
fore, from (9.48), one has
≤(2B f + 1)||x||2 + Bg2 ||û [i] ||2 − Bd2 ||gg TM ||||x||2 + ||x||2 Bd2 + 1
≤(2B f + 1)||x||2 + Bg2 ||û [i] ||2 + 1. (9.50)
Thus,
Σ̇ ≤ −Z T M Z + Z T N + 1, (9.54)
and
l2 > 0. (9.56)
If
||N || ||N ||2 1
||Z || ≥ + + (9.57)
2λmin (M) 4λmin (M) λmin (M)
2
Theorem 9.2 Suppose the hypotheses of Theorem 9.1 hold. Define H (x, u [i] ˆ[i]
s , Jx ) =
Jˆx[i]T ( f + g û [i] + gu [i]
c + d) + Q(x) + û
[i]T
R û [i] . Then
160 9 Off-Policy Actor-Critic Structure for Optimal Control …
(1) u [i]
s is close to u
[i]
within a finite bound, as t → ∞.
[i] ˆ[i]
(2) H (x, u s , Jx ) = H (x, Ŵa[i] , Ŵc[i] ) is UUB.
Proof (1) From Theorem 9.1, x will converge to a finite bound of zero, as t → ∞.
Therefore, there exist Buc > 0 and Bwa > 0, satisfying ||û [i] [i]
c || ≤ Buc and || W̃a || ≤
Bwa . Since u s − u = Ŵa ϕa − Wa ϕa − εa + u c = −W̃a ϕa − εa + u [i]
[i] [i] [i]T [i]T [i] [i] [i]T [i]
c .
Then one can get
[i]
||u [i] [i]
s − u || ≤ ϕa M Bwa + εa M + Buc . (9.58)
H (x, Ŵa[i] , Ŵc[i] ) =(Wc[i] − W̃c[i] )T ∇ϕc f + (Wc[i] − W̃c[i] )T ∇ϕc gu [i]
c
[i] [i] T [i]T [i]T
+ (Wc − W̃c ) ∇ϕc gWa ϕa − g W̃a ϕa + d
+ Q(x) + ϕaT (Wa[i] − W̃a[i] )R(Wa[i]T − W̃a[i]T )ϕa , (9.59)
H (x, Ŵa[i] , Ŵc[i] ) = Wc[i]T ∇ϕc f − W̃c[i]T ∇ϕc f + Wc[i]T ∇ϕc gu [i]
c
As εa → 0 and εc → 0, for a fixed admissible control policy, one has H (x, Wa[i] , Wc[i] )
is bounded, i.e., there exists H B , s.t. ||H (x, Wa[i] , Wc[i] )|| ≤ H B . Therefore, (9.60) can
be written as
H (x, Ŵa[i] , Ŵc[i] ) = − W̃c[i]T ∇ϕc f + Wc[i]T ∇ϕc gu [i] [i]T [i]
c − W̃c ∇ϕc gu c
||H (x, Ŵa[i] , Ŵc[i] )|| ≤ϕcd M B f ||W̃c[i] ||||x|| + ϕcd M Bg ||Wc[i] ||Buc
+ ϕcd M Bg ||W̃c[i] ||Buc + ϕcd M Bg ϕa M ||Wa[i] ||||W̃c[i] ||
+ ϕcd M Bg ϕa M ||Wc[i] ||||W̃a[i] ||
+ ϕcd M Bg ϕa M ||W̃a[i] ||||W̃c[i] ||
+ 2ϕa2M ||Wa[i] ||||R||||W̃a[i] || + ϕa2M ||R||||W̃a[i] ||2
+ ϕcd M ||Wc[i] ||Bd + ϕcd M ||W̃c[i] ||Bd + H B . (9.62)
According to Theorems 9.1 and 9.2, the signals on the right-hand side of (9.62)
are bounded, therefore H (x, u [i] ˆ[i]
s , Jx ) is UUB.
Theorem 9.3 Let J [i] and u [i] be defined in (9.10) and (9.11), then ∀i = 0, 1, 2, . . . ,
u [i+1] is admissible, and 0 ≤ J [i+1] ≤ J [i] . Furthermore, lim J [i] = J ∗ and lim u [i]
i→∞ i→∞
= u∗.
In this section we present simulation results that verify the proper performance of
the disturbance compensated off-policy IRL algorithm.
Consider the following torsional pendulum system [25]
⎧
⎪
⎨
dθ
= ω + d1 ,
dt (9.63)
⎪
⎩J
dω
= u − Mgl sin θ − f d
dθ
+ d2 ,
dt dt
where M = 1/3 kg and l = 2/3 m are the mass and length of the pendulum bar,
respectively. The system states are the current angle θ and the angular velocity ω. Let
J = 4/3Ml 2 , and f d = 0.2 be the rotary inertia and frictional factor, respectively.
Let g = 9.8m/s2 be the gravity. d = [d1; d2] is white noise.
The activation functions ϕc and ϕa be hyperbolic tangent functions. The structures
of the critic and action networks are 2-8-1 and 2-8-1, respectively. Let Q = I 2,
R = 1, αw = 0.01, then the system state and control input trajectories are displayed
in Figs. 9.2 and 9.3. Therefore, we can declare the effectiveness of the disturbance
compensated off-policy IRL algorithm in this chapter.
162 9 Off-Policy Actor-Critic Structure for Optimal Control …
1.5
1
control
0.5
-0.5
0 500 1000 1500
time steps
0.8
x(1)
0.6 x(2)
0.4
0.2
0
state
-0.2
-0.4
-0.6
-0.8
-1
0 500 1000 1500
time steps
9.6 Conclusion
This chapter proposes an optimal controller for unknown continuous time systems
with unknown disturbances. Based on policy iteration, an off-policy IRL algorithm is
estimated to obtain the iterative control. Critic and action networks are used to obtain
the iterative performance index function and control approximately. The weight
updating method is given based on off-policy IRL. A compensation controller is
constructed to reduce the influence of unknown disturbances. Based on Lyapunov
techniques, one proves the weight errors are UUB. The convergence of the Hamilto-
nian function is also proven. Simulation study demonstrates the effectiveness of the
proposed optimal control method for unknown systems with disturbances.
From this chapter, we can see that the weight error bound depends on the dis-
turbance bound. In the future, we will further study the method that focus on the
unknown disturbance instead of the disturbance bound.
References
1. Jiang, Y., Jiang, Z.: Robust adaptive dynamic programming for large-scale systems with an
application to multimachine power systems. IEEE Trans. Circuits Syst. II: Express Br. 59(10),
693–697 (2012)
2. Chen, B., Liu, K., Liu, X., Shi, P., Lin, C., Zhang, H.: Approximation-based adaptive neural
control design for a class of nonlinear systems. IEEE Trans. Cybern. 44(5), 610–619 (2014)
3. Lewis, F., Vamvoudakis, K.: Reinforcement learning for partially observable dynamic pro-
cesses: adaptive dynamic programming using measured output data. IEEE Trans. Syst. Man
Cybern. Part B: Cybern. 41(1), 14–25 (2011)
4. Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complex-
valued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014)
5. Lee, J., Park, J., Choi, Y.: Integral Q-learning and explorized policy iteration for adaptive
optimal control of continuous-time linear systems. Automatica 48(11), 2850–2859 (2012)
6. Kiumarsi, B., Lewis, F., Modares, H., Karimpour, A., Naghibi-Sistani, M.: Reinforcement Q-
learning for optimal tracking control of linear discrete-time systems with unknown dynamics.
Automatica 50(4), 1167–1175 (2014)
7. Vrabie, D., Pastravanu, O., Lewis, F., Abu-Khalaf, M.: Adaptive optimal control for continuous-
time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)
8. Lewis, F., Vrabie, D., Vamvoudakis, K.: Reinforcement learning and feedback control: using
natural decision methods to design optimal adaptive controllers. IEEE Control Syst. Mag.
32(6), 76–105 (2012)
9. Bhasin, S., Kamalapurkar, R., Johnson, M., Vamvoudakis, K., Lewis, F., Dixon, W.: A novel
actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear sys-
tems. Automatica 49(1), 82–92 (2013)
10. Vrabie, D., Lewis, F.: Adaptive dynamic programming for online solution of a zero-sum dif-
ferential game. J. Control Theory Appl. 9(3), 353–360 (2011)
11. Vrabie, D., Lewis, F.: Integral reinforcement learning for online computation of feedback nash
strategies of nonzero-sum differential games. In: Proceedings of Decision and Control, Atlanta,
GA, USA, pp. 3066–3071 (2010)
12. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. A Bradford Book. The MIT
Press, Cambridge (2005)
164 9 Off-Policy Actor-Critic Structure for Optimal Control …
13. Wang, J., Xu, X., Liu, D., Sun, Z., Chen, Q.: Self-learning cruise control using Kernel-based
least squares policy iteration. IEEE Trans. Control Syst. Technol. 22(3), 1078–1087 (2014)
14. Vamvoudakis, K., Vrabie, D., Lewis, F.: Online adaptive algorithm for optimal control with
integral reinforcement learning. Int. J. Robust Nonlinear Control 24(17), 2686–2710 (2015)
15. Li, H., Liu, D., Wang, D.: Integral reinforcement learning for linear continuous-time zero-sum
games with completely unknown dynamics. IEEE Trans. Autom. Sci. Eng. 11(3), 706–714
(2014)
16. Luo, B., Wu, H., Huang, T.: Off-policy reinforcement learning for H∞ control design. IEEE
Trans. Cybern. 45(1), 65–76 (2015)
17. Jiang, Y., Jiang, Z.: Computational adaptive optimal control for continuous-time linear systems
with completely unknown dynamics. Automatica 48(10), 2699–2704 (2012)
18. Jiang, Y., Jiang, Z.: Robust adaptive dynamic programming for nonlinear control design. In:
Proceedings of IEEE Conference on Decision and Control, Maui, Hawaii, USA, pp. 1896–1901
(2012)
19. Jiang, Y., Jiang, Z.: Robust adaptive dynamic programming and feedback stabilization of non-
linear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 882–893 (2014)
20. Beard, R., Saridis, G., Wen, J.: Galerkin approximations of the generalized Hamilton-Jacobi-
Bellman equation. Automatica 33(12), 2159–2177 (1997)
21. Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems with saturating
actuators using a neural network HJB approach. Automatica 41(5), 779–791 (2005)
22. Saridis, G., Lee, C.: An approximation theory of optimal control for trainable manipulators.
IEEE Trans. Syst. Man Cybern. Part B: Cybern. 9(3), 152–159 (1979)
23. Hornik, K., Stinchcombe, M., White, H., Auer, P.: Degree of approximation results for feedfor-
ward networks approximating unknown mappings and their derivatives. Neural Comput. 6(6),
1262–1275 (1994)
24. Lewis, F., Jagannathan, S., Yesildirek, A.: Neural Network Control of Robot Manipulators and
Nonlinear Systems. Taylor and Francis, London (1999)
25. Liu, D., Wei, Q.: Policy iteration adaptive dynamic programming algorithm for discrete-time
nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(3), 621–634 (2014)
Chapter 10
An Iterative ADP Method to Solve for a
Class of Nonlinear Zero-Sum Differential
Games
10.1 Introduction
A large class of real systems are controlled by more than one controller or decision
maker with each using an individual strategy. These controllers often operate in a
group with a general quadratic performance index function as a game [1]. Zero-sum
(ZS) differential game theory has been widely applied to decision making problems
[2–7], stimulated by a vast number of applications, including those in economics,
management, communication networks, power networks, and in the design of com-
plex engineering systems. In these situations, many control schemes are presented
in order to reach some form of optimality [8, 9]. Traditional approaches to deal with
ZS differential games are to find out the optimal solution or the saddle point of the
games. So many interests are developed to discuss the existence conditions of the
differential ZS games [10, 11].
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 165
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_10
166 10 An Iterative ADP Method to Solve for a Class …
In the real world, however, the existence conditions of the saddle point for ZS
differential games are so difficult to satisfy that many applications of the ZS differ-
ential games are limited to linear systems [12–14]. On the other hand, for many ZS
differential games, especially in nonlinear case, the optimal solution of the game (or
saddle point) doesn’t exist inherently. Therefore, it is necessary to study the optimal
control approach for the ZS differential games that the saddle point is invalid. The
earlier optimal control scheme is to adopt the mixed trajectory method [15, 16],
one player selects an optimal probability distribution over his control set and the
other player selects an optimal probability distribution over his own control set, and
then the expected solution of the game can be obtained under the meaning of the
probability. The expected solution of the game is called mixed optimal solution and
the corresponding performance index function is mixed optimal performance index
function. The main difficulty of the mixed trajectory for the ZS differential game is
that the optimal probability distribution is too hard to obtain if not impossible under
the whole real space. Furthermore, the mixed optimal solution is hardly reached once
the control schemes are determined. In most cases (i.e. in engineering cases), how-
ever, the optimal solution or mixed optimal solution of the ZS differential games has
to be achieved by a determined optimal or mixed optimal control scheme. In order
to overcome these difficulties, a new iterative approach is proposed in this chapter
to solve the ZS differential games for the nonlinear system.
In this chapter, it is the first time that the continuous-time two-person ZS differ-
ential games for nonlinear systems are solved by the iterative ADP method. When
the saddle point exists, using the proposed iterative ADP method, the optimal control
pair is obtained to make the performance index function reach the saddle point. When
the saddle point does not exist, according to mixed trajectory method, a determined
mixed optimal control scheme is proposed to obtain the mixed optimal performance
index function. In a brief, the main contributions of this chapter include:
(1) Construct a new iterative method to solve two-person ZS differential games
for a class of nonlinear systems using ADP.
(2) Obtain the optimal control pair that makes the performance index function
reach the saddle point with rigid stability and convergence proof.
(3) Design a determined mixed optimal control scheme to obtain the mixed optimal
performance index function under the condition that there is no saddle point, and
give the analysis of stability and convergence.
This chapter is organized as follows. Section 10.2 presents the preliminaries and
assumptions. In Sect. 10.3, iterative ADP method for ZS differential games is pro-
posed. In Sect. 10.4, the neural network implementation for the optimal control
scheme is presented. In Sect. 10.5, simulation studies are given to demonstrate the
effectiveness of the proposed method. The conclusion is drawn in Sect. 10.6.
Consider the following two-person ZS differential game. The state trajectory at time t
of the game denoted by x = x(t) is described by the continuous-time affine nonlinear
function
10.2 Preliminaries and Assumptions 167
be the lower performance index function with the obvious inequality V (x) ≥ V (x).
Define the optimal control pairs be (u, w) and (u, w) for upper and lower performance
index function respectively. Then we have
and
we say that the optimal performance index function of the ZS differential game or the
saddle point exists and the corresponding optimal control pair is denoted by (u ∗ , w∗ ).
The following assumptions and lemmas are proposed that are in effect in the
remaining sections.
Assumption 10.2 The upper performance index function and the lower performance
index function both exist.
Based on the above assumptions, the following two lemmas are important to apply
the dynamic programming method.
Lemma 10.1 (principle of optimality) If the upper and lower performance index
function are defined as (10.3) and (10.4) respectively, then for 0 ≤ t ≤ tˆ < ∞, x ∈
Rn , u ∈ Rk , w ∈ Rm , we have
tˆ
V (x) = inf sup (x T Ax + u T Bu
u∈U [t,tˆ) w∈W [t,tˆ) t
+ w Cw + 2u Dw + 2x Eu + 2x Fw)dt + V (x(tˆ)) ,
T T T T
(10.8)
and
tˆ
V (x) = sup inf (x T Ax + u T Bu
w∈W [t,tˆ) u∈U [t,tˆ) t
T ˆ
+ w Cw + 2u Dw + 2x Eu + 2x Fw)dt + V (x(t )) .
T T T
(10.9)
Lemma 10.2 (HJI equation) If the upper and lower performance index function are
defined as (10.3) and (10.4) respectively, we can obtain the following Hamilton–
Jacobi–Isaacs (HJI) equations
dV (x) dV (x)
where V t (x) = , V x (x) = , H (V x (x), u, w) = inf sup {V x (a(x) +
dt dx u∈U w∈W
b(x)u + c(x)w) + (x T Ax + u T Bu + wT Cw + 2u T Dw + 2x T Eu + 2x T Fw)} is
called upper Hamilton function, and
dV (x) dV (x)
where V t (x) = , V x (x) = , H (V x (x), u, w) = sup inf {V x (a(x) +
dt dx w∈W u∈U
b(x)u + c(x)w) + (x Ax + u Bu + w Cw + 2u Dw + 2x Eu + 2x T Fw)} is
T T T T T
The optimal control pair is obtained by solving the HJI equations (10.10) and (10.11),
but these equations cannot generally be solved. There is no current method for rigor-
ously confronting this type of equation to find the optimal performance index function
of the system. This is the reason why we introduce the iterative ADP method. In this
section, the iterative ADP method for ZS differential games is proposed and we will
show that the iterative ADP method can be expanded to ZS differential games.
Theorem 10.1 If Assumptions 10.1–10.3 hold, (u, w) is the optimal control pair
for the upper performance index function and (u, w) is the optimal control pair
170 10 An Iterative ADP Method to Solve for a Class …
for the lower performance index function, then there exist the control pairs (u, w),
(u, w) which make V o (x) = V (x, u, w) = V (x, u, w). Furthermore, if the saddle
point exists, then V o (x) = V ∗ (x).
Proof According to (10.3) and (10.5), we have V o (x) ≤ V (x, u, w). Simultaneously,
we also have V (x, u, w) ≤ V (x, u, w). As the system (10.1) is controllable and w is
continuous on Rm , there exists control pair (u, w) which makes V o (x) = V (x, u, w).
On the other hand, according to (10.4) and (10.6) we have V o (x) ≥ V (x, u, w).
And we can also find that V (x, u, w) ≥ V (x, u, w). As u is continuous on Rk , there
exists control pair (u, w) which makes V o (x) = V (x, u, w). Then we have V o (x) =
V (x, u, w) = V (x, u, w).
If the saddle point exists, we have V ∗ (x) = V (x) = V (x). On the other hand,
V (x) ≤ V o (x) ≤ V (x). Then obviously V o (x) = V ∗ (x).
The above theorem builds up the relationship between the optimal or mixed opti-
mal performance index function and the upper and lower performance index func-
tions. It also implies that the mixed optimal performance index function can be
solved through regulating the optimal control pairs for the upper and lower perfor-
mance index function. So firstly, it is necessary to find out the optimal control pair
for both the upper and lower performance index function.
Differentiating the HJI equation (10.10) through the derivative of the control w
for the upper performance index function, it yields
∂H T
= V x c(x) + 2wT C + 2u T D + 2x T F = 0. (10.12)
∂w
Then we can get
1
w = − C −1 (2D T u + 2F T x + cT (x)V x ). (10.13)
2
Substitute (10.13) into (10.10) and take the derivative of u then obtain
1
u = − (B − DC −1 D T )−1 (2(E T − DC −1 F T )x
2
+ (bT (x) − DC −1 cT (x))V x ). (10.14)
Thus, the detailed expression of the optimal control pair (u, w) for upper perfor-
mance index function is obtained.
For the lower performance index function, according to (10.11), take the derivative
of the control u and we have
∂H
= V Tx b(x) + 2u T B + 2wT D + 2x T E = 0. (10.15)
∂u
Then we can get
10.3 Iterative Approximate Dynamic Programming… 171
1
u = − B −1 (2Dw + 2E T x + bT (x)V x ). (10.16)
2
Substitute (10.16) into (10.11) and take the derivative of w, then we obtain
1
w = − (C − D T B D)−1 (2(F T − D T B −1 E T )x
2
− (cT (x) − D T B −1 bT (x))V x ). (10.17)
So the optimal control pair (u, w) for lower performance index function is also
obtained.
If the equality (10.7) holds under the optimal control pairs (u, w) and (u, w),
we have a saddle point; if not, we have a game without saddle point. For such a
differential game that the saddle point does not exist, we adopt a mixed trajectory
method to achieve the mathematical expectation of the performance index func-
tion. To apply mixed trajectory method, the game matrix is necessary to obtain
under the trajectory sets of the control pair (u, w). Small enough Gaussian noise
γu ∈ Rk and γw ∈ Rm are introduced that are added to the optimal control u and
j
w respectively, where γui (0, σi2 ), i = 1, 2, . . . , k and γw (0, σ j2 ), j = 1, 2, . . . , m are
zero-mean exploration noise with variances σi and σ j respectively.
2 2
Therefore, the upper and lower performance index functions (10.3) and (10.4)
become V (x, u, (w + γw )) and V (x, (u + γu ), w) respectively, where V (x, u, (w +
γw )) > V (x, (u + γu ), w) holds. Define the following game matrix
I I1 I I2
I1 L 11 L 12 = L , (10.18)
I2 L 21 L 22
2
2
E(V (x)) = min max PI i L i j PI I j , (10.19)
PI i PI I j
i=1 j=1
Vo − V
where α = .
V −V
For example, let the game matrix be
I I1 I I2
I1 L 11 = 11 L 12 = 7 = L .
I2 L 21 = 5 L 22 = 9
E(L) = L 11 PI PI I + L 12 PI (1 − PI I ) + L 21 (1 − PI )PI I
+ L 22 (1 − PI )(1 − PI I )
= 11PI PI I + 7PI (1 − PI I ) + 5(1 − PI )PI I + 9(1 − PI )(1 − PI I )
1 1
= 8 PI − PI I − + 8. (10.21)
2 4
1 1
E(L) = L 11 + L 21 .
2 2
Remark 10.1 From the above example we can see that once the trajectory in the
trajectory set is determined, the expected value 8 can not be obtained in reality. In most
practical optimal control environment, however, the expected optimal performance
(or mixed optimal performance) has to be achieved.
So in the following part, the expatiation of the method to achieve the mixed optimal
performance index function will be displayed. Calculating the expected performance
index function for N times under the exploration noise γw ∈ Rm and γu ∈ Rk in the
control w and u, we can obtain E 1 (V (x)), E 2 (V (x)), . . . , E N (V (x)). Then the mixed
optimal performance index function can be written by
1
N
V o (x) = E(E i (V (x))) = E i (V (x))
N i=1
= αV (x) + (1 − α)V (x), (10.22)
Vo − V
where α = .
V −V
10.3 Iterative Approximate Dynamic Programming… 173
Then according to Theorem 10.1, the mixed optimal control pair can be obtained
by regulating the control w in the control pair (u, w) that minimizes the error between
V (x) and V o (x) where the performance index function V (x) is defined by
V (x(0)) = V (x(0), u, w)
∞
= (x T Ax + u T Bu + wT Cw + 2u T Dw
0
+ 2x T Eu + 2x T Fw)dt, (10.24)
(x))2 .
min(V (10.26)
w
(x), w) = V
H J B(V x (x), x, w) = 0,
t (x) + H (V (10.27)
where V t (x) = d V (x) , V
x (x) = d V (x) , and the Hamilton function is H (V
x (x),
dt dx
x, w) = min {V x (a(x) + b(x)u + c(x)w) + l(x, w)}.
w∈W
When V (x) < 0, we have −V (x) = −(V(x) − V o (x)) > 0, then we also have
the HJB equation described by
(x)), w) = (−V
H J B((−V t (x)) + H ((−V x (x)), x, w)
=V t (x) + H (V x (x), x, w)
= 0, (10.28)
which is same as (10.27). Then the optimal control w can be obtained by differenti-
ating the HJB equation (10.27) through the derivative of control w:
174 10 An Iterative ADP Method to Solve for a Class …
1 x ).
w = − C −1 (2D T u + 2F T x + cT (x)V (10.29)
2
Remark 10.2 We can also obtain the mixed optimal control pair by regulating the
control u in the control pair (u, w) that minimizes the error between V (x) and V o (x)
where the performance index function V (x) is defined by
V (x(0)) =V (x(0), u, w)
∞
= (x T Ax + u T Bu + wT Cw + 2u T Dw + 2x T Eu + 2x T Fw)dt.
0
(10.30)
Remark 10.3 From (10.24) to (10.30) we can see that effective mixed optimal control
schemes are proposed to obtain the mixed optimal performance index function when
the saddle point does not exist, while the uniqueness of the mixed optimal control
pair is rarely satisfied.
Given the above preparation, we now formulate the iterative approximate dynamic
programming method for ZS differential games as follows.
Step 1 Initialize the algorithm with a stabilizing performance index function V [0]
and control pair (u [0] , w[0] ) where Assumptions 10.1–10.3 hold. Give the computation
precision ζ > 0.
Step 2 For i = 0, 1, . . ., from the same initial state x(0) run the system with control
pair (u [i] , w[i] ) for the upper performance index function and run the system with
control pair (u [i] , w[i] ) for the lower performance index function.
1
u [i+1] = − (B − DC −1 D T )−1 (2(E T − DC −1 F T )x
2
[i]
+ (bT (x) − DC −1 cT (x))V x ), (10.32)
10.3 Iterative Approximate Dynamic Programming… 175
and
1 [i]
w[i+1] = − C −1 (2D T u [i+1] + 2F T x + cT (x)V x ), (10.33)
2
[i]
where (u [i] , w[i] ) is satisfied with H J I (V (x), u [i] , w[i] ) = 0.
[i+1] [i]
Step 4 If
V (x(0)) − V (x(0))
< ζ , let u = u [i] , w = w[i] and V (x) =
[i+1]
V (x) go to step 5, else i = i + 1 and go to step 3.
1
w[i+1] = − (C − D T B D)−1 (2(F T − D T B −1 E)x
2
+ (cT (x) − D T B −1 bT (x))V [i]
x ), (10.35)
and
1
u [i+1] = − B −1 (2Dw[i+1] + 2E T x + bT (x)V [i]
x ), (10.36)
2
where (u [i] , w[i] ) is satisfied with the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0.
Step 6 If
V [i+1] (x(0)) − V [i] (x(0))
< ζ , let u = u [i] , w = w[i] and V (x) =
V [i+1] (x) go to next step, else i = i + 1 and go to step 5.
Step 7 If
V (x(0)) − V (x(0))
< ζ stop, the saddle point is achieved, else go to the
next step.
Step 8 For i = 0, 1, . . ., regulate the control w for the upper performance index
function and let
1 x[i+1] ).
w[i] = − C −1 (2D T u + 2F T x + cT (x)V (10.38)
2
Remark 10.4 For step 8 of the above process, we can also regulate the control u for
the lower performance index function and the new performance index function is
expressed as
∞
V [i+1] (x(0)) = (x T Ax + u [i]T Bu [i] + wT Cw
0
+ 2u [i]T Dw + 2x T Eu [i] + 2x T Fw − l o (x, u, w, u, w))dt,
(10.39)
In this subsection, we present the proofs to show that the proposed iterative ADP
method for ZS differential games can be used to improve the properties of the system.
The following definition is proposed which is necessary for the remaining proofs.
1 [i]
w[i+1] = − C −1 (2D T u [i+1] + 2F T x + cT (x)V x ),
2
[i+1] 1
u = − (B − DC −1 D T )−1 (2(E T − D T C −1 F T )x
2
[i]
+ (bT (x) − DC −1 cT (x))V x ), (10.40)
[i]
dV (x) dV [i] (x, u [i+1] , w[i+1] )
=
dt dt
[i]T [i]T [i]T
= V x a(x) + V x b(x)u [i+1] + V x c(x)w[i+1] . (10.41)
[i]
dV (x)
Thus we can derive ≤ 0.
dt
[i]
As V (x) ≥ 0, there exist two functions α(
x
) and β(
x
) belong to class
[i]
K and satisfy α(
x
) ≤ V (x) ≤ β(
x
).
For ∀ε > 0, there exists δ(ε) > 0 that makes β(δ) ≤ α(ε). Let t0 is any initial
[i]
dV (x)
time. According to ≤ 0, for t ∈ [t0 , ∞) we have
dt
[i]
[i] [i]
t
dV (x)
V (x(t)) − V (x(t0 )) = dτ ≤ 0. (10.47)
t0 dτ
As α(
x
) belongs to class K, we can obtain
x
≤ ε. (10.49)
Theorem 10.3 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V [i] (x) ∈
C1 satisfies the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0, i = 0, 1, . . ., and ∀ t,
l(x, u [i] , w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T
Fw[i] < 0, then the control pair (u [i] , w[i] ) formulated by
1 −1
u [i+1] = − B (2Dw[i+1] + 2E T x + bT (x)V [i]
x ),
2
1
w[i+1] = − (C − D T B −1 D)−1 (2(F T − D T B −1 E T )x
2
+ (cT (x) − D T B −1 bT (x))V [i]
x ), (10.50)
which is satisfied with the performance index function (10.34) makes the system
(10.1) asymptotically stable.
10.3 Iterative Approximate Dynamic Programming… 179
dV [i] (x)
=V [i]T [i]T
x a(x) + V x (c(x) − b(x)B
−1
D)w[i+1]
dt
− V [i]T
x b(x)B
−1 T
E x + V [i]T
x c(x)w
[i+1]
1
− V [i]T b(x)B −1 bT (x)V [i]T
x . (10.52)
2 x
From the HJI equation we have
0 =V [i]T
x f (x, u [i] , w[i] ) + l(x, u [i] , w[i] )
=V [i]T [i]T
x a(x) + V x (c(x) − b(x)B
−1
D)w[i]
1
− V [i]T b(x)B −1 bT (x)V [i]
4 x x
+ x Ax + w (C − D B D)w[i]
T [i]T T −1
− V [i]T
x b(x)B
−1 T
E x − x T F B −1 E T x + 2x T (F − E B −1 D)w[i] . (10.53)
dV [i] (x)
=V [i]T
x (c(x) − b(x)B
−1
D)(w[i+1] − w[i] )
dt
− x T Ax − w[i]T (C − D T B −1 D)w[i]
+ x T F B −1 E T x − 2x T (F − E B −1 D)w[i]
1
− V [i]T b(x)B −1 bT (x)V [i]
x . (10.54)
4 x
According to (10.50) we have
dV [i] (x)
= − (w[i+1] − w[i] )T (C − D T B −1 D)(w[i+1] − w[i] )
dt
− w[i+1]T (C − D T B −1 D)w[i+1] − x T Ax
+ x T F B −1 E T x − 2x T (F − E B −1 D)w[i+1]
1
− V [i]T b(x)B −1 bT (x)V [i]
x . (10.55)
4 x
If we substitute (10.50) into the utility function, we obtain
180 10 An Iterative ADP Method to Solve for a Class …
<0. (10.56)
dV [i] (x)
So we have > 0.
dt
[i]
As V (x) < 0, there exist two functions α(
x
) and β(
x
) belong to class
K and satisfy α(
x
) ≤ −V [i] (x) ≤ β(
x
).
For ∀ ε > 0, there exists δ(ε) > 0 that makes β(δ) ≤ α(ε). According to
dV [i] (x)
− < 0, for t ∈ [t0 , ∞) we have
dt
t
dV [i] (x)
−V [i] (x(t)) − (−V [i] (x(t0 ))) = − dτ < 0. (10.57)
t0 dτ
As α(
x
) belongs to class K, we can obtain
x
< ε. (10.59)
Corollary 10.1 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V [i] (x) ∈
C1 satisfies the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0, i = 0, 1, . . ., and for ∀ t,
l(x, u [i] , w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T
Fw[i] ≥ 0, then the control pairs (u [i] , w[i] ) which satisfies the performance index
function (10.34) makes the system (10.1) asymptotically stable.
[i]
Proof According to (10.31) and (10.34), we have V [i] (x) ≤ V (x). As the utility
[i]
function l(x, u [i] , w[i] ) ≥ 0, we have V [i] (x) ≥ 0. So we get 0 ≤ V [i] (x) ≤ V (x).
From Proposition 10.1, we know that for ∀ t0 , there exist two functions α(
x
)
and β(
x
) belong to class K and satisfy
[i] [i]
α(ε) ≥ β(δ) ≥ V (x(t0 )) ≥ V (x(t)) ≥ α(
x
). (10.60)
10.3 Iterative Approximate Dynamic Programming… 181
[i] [i]
According to V (x) ∈ C1 , V [i] (x) ∈ C1 and V (x) → 0, there exist time instants
t1 and t2 (not loss of generality, let t0 < t1 < t2 ) that satisfies
[i] [i] [i]
V (x(t0 )) ≥ V (x(t1 )) ≥ V [i] (x(t0 )) ≥ V (x(t2 )). (10.61)
[i]
Choose ε1 > 0 that satisfies V [i] (x(t0 )) ≥ α(ε1 ) ≥ V (x(t2 )). Then there exists
[i]
δ1 (ε1 ) > 0 that makes α(ε1 ) ≥ β(δ1 ) ≥ V (x(t2 )). Then we can obtain
[i]
V [i] (x(t0 )) ≥ α(ε1 ) ≥ β(δ1 ) ≥ V (x(t2 ))
[i]
≥ V (x(t)) ≥ V [i] (x(t)) ≥ α(
x
). (10.62)
As α(
x
) belongs to class K, we can obtain
x
≤ ε. (10.64)
[i] [i]
According to V (x) ∈ C1 , V [i] (x) ∈ C1 and V (x) → 0, there exist time
instants t1 and t2 that satisfies
[i]
−V [i] (x(t0 )) ≥ −V [i] (x(t1 )) ≥ −V (x(t0 )) ≥ −V [i] (x(t2 )). (10.66)
[i]
Choose ε1 > 0 that satisfies −V (x(t0 )) ≥ α(ε1 ) ≥ −V [i] (x(t2 )). Then there
exists δ1 (ε1 ) > 0 that makes α(ε1 ) ≥ β(δ1 ) ≥ −V [i] (x(t2 )). Then we can obtain
182 10 An Iterative ADP Method to Solve for a Class …
[i]
−V (x(t0 )) ≥ α(ε1 ) ≥ β(δ1 ) ≥ −V [i] (x(t2 ))
[i]
≥ −V [i] (x(t)) ≥ −V (x(t)) ≥ α(
x
). (10.67)
As α(
x
) belongs to class K, we can obtain
x
< ε. (10.69)
the utility function, then the control pair (u [i] , w[i] ) which is satisfied with the upper
performance index function (10.31) is a pair of asymptotically stable controls for
system (10.1).
Proof For the time sequel t0 < t1 < t2 < · · · < tm < tm+1 < · · · , not loss of gen-
erality, we suppose l(x, u [i] , w[i] ) ≥ 0 in [ t2n , t(2n+1) ) and l(x, u [i] , w[i] ) < 0 in
[ t2n+1 , t(2(n+1)) ) where n = 0, 1, . . .. t
For t ∈ [t0 , t1 ) we have l(x, u [i] , w[i] ) ≥ 0 and t01 l(x, u [i] , w[i] )dt ≥ 0. Accord-
ing to Theorem 10.2, we have
x(t0 )
≥
x(t1 )
≥
x(t1 )
, (10.70)
where t1 ∈ [t0 , t1 ). t2
For t ∈ [t1 , t2 ) we have l(x, u [i] , w[i] ) < 0 and l(x, u [i] , w[i] )dt < 0. Accord-
t1
ing to Corollary 10.2 we have
x(t1 )
>
x(t2 )
>
x(t2 )
, (10.71)
where t2 ∈ [t1 , t2 ).
So we can obtain
x(t0 )
≥
x(t0 )
>
x(t2 )
, (10.72)
where t0 ∈ [t0 , t2 ).
10.3 Iterative Approximate Dynamic Programming… 183
Theorem 10.5 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V [i] (x) ∈
C1 satisfies the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0, i = 0, 1, . . . and l(x, u [i] ,
w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T Fw[i] is
the utility function, then the control pair (u [i] , w[i] ) which is satisfied with the upper
performance index function (10.39) is a pair of asymptotically stable controls for
system (10.1).
Proof Similar to the proof of Theorem 10.3, the conclusion can be obtained according
to Theorem 10.3 and Corollary 10.1 and the proof process is omitted.
In the following part, the analysis of convergence property for the ZS differential
games is presented to guarantee the iterative control pair reach the optimal.
[i]
Proposition 10.1 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk , w[i] ∈ Rm and V (x) ∈
[i]
C1 satisfies the HJI equation H J I (V (x), u [i] , w[i] ) = 0, then the iterative control
pair (u [i] , w[i] ) formulated by
1 [i]
w[i+1] = − C −1 (2D T u [i+1] + 2F T x + cT (x)V x ),
2
1
u [i+1] = − (B − DC −1 D T )−1 (2(E T − D T C −1 F T )x
2
[i]
+ (bT (x) − DC −1 cT (x))V x ), (10.73)
[i]
makes the the upper performance index function V (x) → V (x) as i → ∞.
Proof To show the convergence of the upper performance index function, we will
[i+1] [i]
primarily consider the property of d(V (x) − V (x)) dt.
[i]
According to the HJI equation H J I (V (x), u [i] , w[i] ) = 0, we can obtain
[i+1]
dV (x) dt by replacing the index “i” by the index “i + 1”:
[i+1]
dV (x)
= − (x T Ax + u [i+1]T (B − DC −1 D T )u [i+1]
dt
1 [i]T [i]
+ V x c(x)C −1 cT (x)V x
4
+ 2x T (E − FC −1 D T )u [i+1]
− x T FC −1 F T x). (10.74)
[i+1] [i]
d(V (x) − V (x))
dt
[i+1] [i]
dV (x)
dV (x)
= −
dt dt
= − (x T Ax + u [i+1]T (B − DC −1 D T )u [i+1]
1 [i]T [i]
+ V x c(x)C −1 cT (x)V x + 2x T (E − FC −1 D T )u [i+1] − x T FC −1 F T x)
4
− (−(u [i+1] − u [i] )T (B − DC −1 D T )(u [i+1] − u [i] )
1 [i]T [i]
− u [i+1]T (B − DC −1 D T )u [i+1] − x T Ax − V x c(x)C −1 cT (x)V x
4
− 2x T (E − FC −1 D T )u [i+1] + x T FC −1 F T x)
= u [i+1]T (B − DC −1 D T )u [i+1]
>0. (10.75)
Remark 10.5 The convergence of the upper performance index function can not
guarantee the convergence of the lower performance index function. In fact, the
lower performance index function may be boundary but not convergent. So it is
necessary to analyze the convergence of lower performance index function.
Proposition 10.2 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk , w[i] ∈ Rm and V [i] (x) ∈
C1 satisfies the HJI function H J I (V [i] (x), u [i] , w[i] ) = 0, then the iterative control
pair (u [i] , w[i] ) formulated by
1 −1
u [i+1] = − B (2Dw[i+1] + 2E T x + bT (x)V [i]
x )
2
1
w[i+1] = − (C − D T B D)−1 (2(F T − D T B −1 E T )x
2
+ (cT (x) − D T B −1 bT (x))V [i]
x ) (10.76)
Proof To show the convergence of the lower performance index function, we also
consider the property of d(V [i+1] (x) − V [i] (x)) dt.
From the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0, we can obtain dV [i+1] (x) dt
by replacing the index “i” by the index “i + 1”:
10.3 Iterative Approximate Dynamic Programming… 185
dV [i+1] (x)
= − w[i+1]T (C − D T B −1 D)w[i+1] − x T Ax
dt
+ x T F B −1 E T x − 2x T (F − E B −1 D)w[i+1]
1
− V [i]T b(x)B −1 bT (x)V [i]
x . (10.77)
4 x
According to (10.50), we have
d(V [i+1] (x) − V [i] (x)) dt
=dV [i+1] (x) dt − dV [i] (x) dt
= − w[i+1]T (C − D T B −1 D)w[i+1] − x T Ax
1
+ x T F B −1 E T x − 2x T (F − E B −1 D)w[i+1] − V [i]T b(x)B −1 bT (x)V [i]
4 x x
[i+1] [i] T T −1 [i+1] [i]
− (−(w − w ) (C − D B D)(w −w )
− w[i+1]T (C − D T B −1 D)w[i+1] − x T Ax + x T F B −1 E T x
1
− 2x T (F − E B −1 D)w[i] − V [i]T b(x)B −1 bT (x)V [i]
x )
4 x
=(w[i+1]T − w[i]T )(C − D T B −1 D)(w[i+1] − w[i] )
<0. (10.78)
[i+1] [i]
Because the system f is stable, and V (x) − V (x) converge to zero. From
[i+1] [i]
the result d(V (x) − V (x)) dt ≤ 0, we can conclude V [i+1] (x) − V [i] (x) ≤ 0,
that is V [i+1] (x) ≥ V [i] (x). As such, V [i] (x) is an increasing sequence of positive
[i+1] [i+1]
number i. On the other hand, we know that V [i+1] (x) ≤ V (x), and V (x) is
[i+1]
a convergent performance index function, and therefore V (x) is convergent i.e.
V [i+1] (x) → V (x) as i → ∞.
Theorem 10.6 If the optimal performance index function of the ZS differential game
or the saddle point exists that is V (x) = V (x) = V ∗ (x), then the control pairs
[i]
(u [i+1] , w[i+1] ) and (u [i+1] , w[i+1] ) make V (x) → V ∗ (x) and V [i] (x) → V ∗ (x)
respectively as i → ∞.
1
w∗ = − C −1 (2D T u + 2F T x + cT (x)V x ),
2
1
u ∗ = − (B − DC −1 D T )−1 (2(E T − D T C −1 F T )x
2
+ (bT (x) − DC −1 cT (x))V x ), (10.79)
which is the same as (10.9) and (10.10). So the control pair (u ∗ , w∗ ) in (10.79)
uniquely decides the upper performance index function V (x) for the same initial
state x(0). It is contradiction against the assumption.
Similarly, we can derive V (x) → V ∗ (x) under the control pair (u [i+1] , w[i+1] ) as
i → ∞.
For the situation the saddle point does not exist, we regulate the control w in
the control pair (u, w) to make the performance index function V (x) reach the the
mixed optimal performance index function V o (x). The following propositions and
theorems are considered the situation.
Proof According to Theorems 10.4 and 10.5, we can derive the result directly.
[i+1] (x) d V
dV [i+1] (x, u, w[i] )
=
dt dt
=V x[i+1]T a(x) + V x[i+1]T b(x)u + V
x[i+1]T c(x)w[i] . (10.80)
[i+1] (x)
dV x[i+1]T c(x)(w[i] − w[i+1] ) − x T Ax
=V
dt
− u T Bu − w[i+1]T Cw[i+1] − 2u T Dw[i+1] − 2x T Eu − 2x T Fw[i+1] .
(10.82)
[i+1] (x)
dV
= − (w[i] − w[i+1] )T C(w[i] − w[i+1] )
dt
− x T Ax − u T Bu − w[i]T Cw[i] − 2u T Dw[i] − 2x T Eu − 2x T Fw[i] .
(10.83)
[i] (x) dt by replacing the index
According to the HJB equation, we can obtain d V
“i + 1 by the index “i”:
[i] (x)
dV
= − x T Ax − u T Bu − w[i]T Cw[i] − 2u T Dw[i] − 2x T Eu − 2x T Fw[i] .
dt
(10.84)
Then we have
[i+1] (x) − V
d(V [i] (x))
dt
= − (w[i] − w[i+1] )T C(w[i] − w[i+1] )
− x T Ax − u T Bu − w[i]T Cw[i] − 2u T Dw[i] − 2x T Eu − 2x T Fw[i]
− (−x T Ax − u T Bu − w[i]T Cw[i] − 2u T Dw[i] − 2x T Eu − 2x T Fw[i] )
= − (w[i] − w[i+1] )T C(w[i] − w[i+1] )
≥0. (10.85)
gent as i → ∞.
[i+1]
Proof Since the system is continuous and V (x) ∈ C1 , we have
188 10 An Iterative ADP Method to Solve for a Class …
[i+1] (x))
d(−V [i+1] (x, u, w[i] )
dV
=−
dt dt
= − Vx[i+1]T x[i+1]T b(x)u − V
a(x) − V x[i+1]T c(x)w[i] . (10.86)
[i+1] (x))
d(−V
=−V x[i+1]T c(x)(w[i] − w[i+1] ) + x T Ax + u T Bu
dt
+ w[i+1]T Cw[i+1] + 2u T Dw[i+1] + 2x T Eu + 2x T Fw[i+1] .
(10.88)
[i+1] (x))
d(−V
=(w[i] − w[i+1] )T C(w[i] − w[i+1] ) + x T Ax
dt
+ u T Bu + w[i]T Cw[i] + 2u T Dw[i] + 2x T Eu + 2x T Fw[i] .
(10.89)
[i] (x))
d(−V
=x T Ax + u T Bu + w[i]T Cw[i] + 2u T Dw[i] + 2x T Eu + 2x T Fw[i] .
dt
(10.90)
Then we have
that satisfies the performance index function (10.37) makes V [i] (x) convergent as
i → ∞.
Proof For the time sequel t0 < t1 < t2 < · · · < tm < tm+1 < · · · , not loss of gener-
˜ w[i] ) ≥ 0 in [ t2n , t(2n+1) ) and l(x,
ality, we suppose l(x, ˜ w[i] ) < 0 in [ t2n+1 , t(2(n+1)) )
where n = 0, 1, 2, . . ..
˜ w[i] ) ≥ 0 and t1 l(x,
For t ∈ [ t2n , t(2n+1) ) we have l(x, ˜ w[i] )dt ≥ 0. According
t0
to Proposition 10.5, we have V [i+1] (x).
(x) ≤ V [i]
˜ w[i] ) < 0 and t2 l(x,
For t ∈ [ t2n+1 , t(2(n+1)) ) we have l(x, ˜ w[i] )dt < 0. Accord-
t1
ing to Proposition 10.5 we have V [i+1] (x) > V [i] (x).
Then for ∀ t0 , we have
t1 t2
[i+1] (x(t0 ))
=
V ˜ w[i] )dt
+
l(x,
t0 t1
t(m+1)
+ ···+
˜ w[i] )dt
+ · · ·
l(x,
tm
[i] (x(t0 ))
.
<
V (10.92)
that satisfies the performance index function (10.37) makes V[i] (x) → V o (x) as
i → ∞.
Proof It is proved by contradiction. Suppose that the control pair (u, w[i] ) makes the
performance index function V [i] (x) converge to V (x) and V (x) = V o (x).
According to Theorem 10.7, as i → ∞ we have the following HJB equation based
on the principle of optimality
(x), w) = V
H J B(V x (x), x, w) = 0.
t (x) + H (V (10.93)
H J B(V t (x) + H (V
(x), w ) = V x (x), x, w ) = 0. (10.94)
190 10 An Iterative ADP Method to Solve for a Class …
It is contradiction. So the assumption does not hold. Thus we have V[i] (x) → V o (x)
as i → ∞.
Remark 10.6 If we regulate the control u for the lower performance index function
which satisfies (10.39), we can also prove that the iterative control pair (u [i] , w)
stabilizes the nonlinear system (10.1) and the performance index function V [i] (x) →
V o (x) as i → ∞. The proof procedure is similar to the proof of Propositions 10.1–
10.5 and Theorems 10.7 and 10.8, and it is omitted.
As the computer can only deal with the digital and discrete signal it is necessary to
transform the continuous-time system and performance index function to the corre-
sponding discrete-time form. Discretization of the system function and performance
index function using Euler and trapezoidal methods [19, 20] leads to
x(t + 1) = x(t) + a(x(t)) + b(x(t))u(t) + c(x(t))w(t) t, (10.95)
and
∞
V (x(0)) = (x T (t)Ax(t) + u T (t)Bu(t) + wT (t)Cw(t)
t=0
+ 2u T (t)Dw(t) + 2x T (t)Eu(t) + 2x T (t)Fw(t))t, (10.96)
where Y ∗ , W ∗ are the ideal weight parameters, ξ(X ) is the estimation error.
There are ten neural networks to implement the iterative ADP method, where
three are model networks, three are critic networks and four are action networks
respectively. All the neural networks are chosen as three-layer feed-forward neural
network. The whole structure diagram is shown in Fig. 10.1.
10.4 Neural Network Implementation 191
Fig. 10.1 The structure diagram of the iterative ADP method for ZS differential games
For the nonlinear system, before carrying out iterative ADP method, we should first
train the model network. The output of the model network is given as
1 2
E m (t) = e (t). (10.100)
2 m
192 10 An Iterative ADP Method to Solve for a Class …
Then the gradient-based weight update rule for the critic network can be described
by
The critic network is used to approximate the performance index functions i.e.
[i]
V (x), V [i] (x) and V[i] (x). The output of the critic network is denoted as
[i]
where Vˆ (x(t + 1)) is the output of the upper critic network.
For the lower performance index function, the target function can be written as
V [i] (x(t)) = x T (t)Qx(t) + u [i+1]T (t)Ru [i+1] (t)
+ w[i+1]T Cw[i+1] + 2u [i+1]T Dw[i+1]
[i]
+ 2x T Eu [i+1] + 2x T Fw[i+1] t + V̂ (x(t + 1)), (10.105)
[i]
where V̂ (x(t + 1)) is the output of the lower critic network.
Then for the upper performance index function, we define the error function for
the critic network by
10.4 Neural Network Implementation 193
[i]
where Vˆ (x(t)) is the output of the upper critic neural network.
And the objective function to be minimized in the critic network is
1 [i]
E c[i] (t) = (e (t))2 . (10.107)
2 c
So the gradient-based weight update rule for the critic network [21, 22] is given
by
∂ E c[i] (t)
wc[i] (t) = ηc − , (10.109)
∂wc[i] (t)
where ηc > 0 is the learning rate of critic network and wc (t) is the weight vector in
the critic network.
For the lower performance index function, the error function for the critic network
is defined by
[i]
e[i] [i]
c (t) = V̂ (x(t)) − V (x(t)). (10.111)
And for the mixed optimal performance index function, the error function can be
expressed as
The action network is used to approximate the optimal and mixed optimal controls.
There are four action networks, two are used to approximate the optimal control
194 10 An Iterative ADP Method to Solve for a Class …
pair for the upper performance index function and two are used to approximate the
optimal control pair for the lower performance index function.
For the two action networks for upper performance index function, x(t) is used
T
as the input for the first action network to create the control u(t), and x(t) u(t)
is used as the input for the other action network to create the control w(t).
The target function of the first action network (u network) is the discretization
formulation of Eq. (10.32):
1
u [i+1] = − (B − DC −1 D T )−1 (2(E T − DC −1 F T )x
2
[i]
∂ V (x(t + 1))
+ (bT (x) − DC −1 cT (x)) . (10.113)
∂ x(t + 1)
And the target function of the second action network (w network) is the discretiza-
tion formulation of Eq. (10.33):
[i]
[i+1] 1 −1 T [i+1] ∂ V (x(t + 1))
w =− C 2D u + 2F x + c (x)
T T
. (10.114)
2 ∂ x(t + 1)
While in the two action network for lower performance index function, x(t) is used
T
as the input for the first action network to create the control w(t), and x(t) u(t)
is used as the input for the other action network to create the control u(t).
The target function of the first action network (w network) is the discretization
formulation of Eq. (10.35):
1
w[i+1] = − (C − D T B D)−1 (2(F T − D T B −1 E)x
2
∂ V [i] (x(t + 1))
+ (cT (x) − D T B −1 bT (x)) . (10.115)
∂ x(t + 1)
The target function of the second action network (u network) is the discretization
formulation of Eq. (10.36):
1 ∂ V [i] (x(t + 1))
u [i+1] = − B −1 2Dw[i+1] + 2E T x + bT (x) . (10.116)
2 ∂ x(t + 1)
The output of the action network i.e. the first network for the upper performance
index function can be formulated as
And the target of the output of the action network is given by (10.113). So we can
define the output error of the action network as
10.4 Neural Network Implementation 195
The weighs in the action network are updated to minimize the following perfor-
mance error measure
1 [i]
E a[i] (t) = (e (t))2 . (10.119)
2 a
The weights update algorithm is similar to the one for the critic network. By the
gradient descent rule, we can obtain
∂ E a[i] (t)
wa[i] (t) = ηa − [i] , (10.121)
∂wa (t)
In this section, two examples are used to illustrate the effectiveness of the proposed
approach for continuous-time nonlinear quadratic ZS game.
196 10 An Iterative ADP Method to Solve for a Class …
ẋ = x + u + w (10.125)
∞
J= (x 2 + u 2 − γ 2 w2 )dt, (10.126)
0
where γ 2 = 2.
We choose three-layer neural networks as the critic network and the model network
with the structure 1-8-1 and 2-8-1 respectively. The structure of action networks is
also three layers. For the upper performance index function, the structure of the u
action network is 1-8-1 and the structure of the w action network is 2-8-1; for the
lower performance index function, the structure of the u action network is 2-8-1
and the structure of the w action network is 1-8-1. The initial weights of action
networks, critic network and model network are all set to be random in [−1, 1]. It
should be mentioned that the model network should be trained first. For the given
initial state x(0) = 1, we train the model network for 20 000 steps under the learning
rate ηm = 0.01. After the training of the model network is completed, the weights
of model network keep unchanged. Then the critic network and the action networks
are trained for 100 time steps so that the given accuracy ζ = 10−6 is reached. In the
training process, the learning rate ηa = ηc = 0.01.
The system and the performance index function are transformed according to
(10.95) and (10.96) where the sample time interval is t = 0.01. Take the iteration
number i = 4. The convergence trajectory of the performance index function is
shown in Fig. 10.2.
From Theorem 4.1 in [17] we know that the saddle point exists for the system
(10.125) with the performance index function (10.126). Then from the simulation
Fig. 10.2, we can see that the performance index functions reach the saddle point after
Fig. 10.4 iterations which demonstrates the effectiveness of the proposed method in
the chapter.
Figure 10.3 shows the iterative trajectories of the control variables u. The optimal
control trajectories are displayed in Fig. 10.4 and the corresponding state trajectory
is shown in Fig. 10.5.
0.1x12 + 0.05x2 0.1 + x1 x2 x1
ẋ = + w
0.2x12 − 0.15x2 x1 0.2 + x1 x2
0.1 + x1 x2 0.5 + x1
+ u, (10.127)
x22 0.1 + x12 0.3 + x1 x2
10.5 Simulation Study 197
T T T
where x = x1 x2 , u = u 1 u 2 u 3 and w = w1 w2 .
The performance index function is formulated by
∞
V (x(0), u, w) = (x T Ax + u T Bu + wT Cw + 2u T Dw + 2x T Eu
0
+ 2x T Fw)dt, (10.128)
⎡ ⎤
100 T
10 −3 0 100 101
where A = , B = ⎣ 0 1 0 ⎦, C = ,D= ,E =
01 0 −3 011 011
001
−1 0
and F = .
0 −1
The system and the performance index function are transformed according to
(10.95) and (10.96) where the sample time interval is t = 0.01. The critic network
and the model network are also chosen as three-layer neural networks with the struc-
ture 2-8-1 and 7-8-2 respectively. The action network is also chosen as three-layer
neural networks. For the upper performance index function, the structure of the u
action network is 2-8-3 and the structure of the w action network is 5-8-2; for the
lower performance index function, the structure of the u action network is 2-8-2 and
the structure of the w action network is 4-8-3. The initial weight is also randomly
chosen in [−1, 1]. For the given initial state x(0) = [−1 0.5 ]T , we train the model
network for 20000 steps. After the training of the model network is completed, the
weights keep unchanged. Then the critic network and the action network are trained
0.6
0.5
performance index function
0.1
0
0 10 20 30 40 50 60 70 80 90 100
time steps
Fig. 10.2 The trajectories of upper and lower performance index functions
198 10 An Iterative ADP Method to Solve for a Class …
-0.1
-0.2
1st iteration for upper performance index
-0.5
-0.6
-0.7
-0.8
0 10 20 30 40 50 60 70 80 90 100
time steps
-0.1
-0.2
-0.3
optimal control
-0.4 control u
-0.5
-0.6
control w
-0.7
-0.8
-0.9
-1
0 10 20 30 40 50 60 70 80 90 100
time steps
0.9
0.8
0.7
0.6
state
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70 80 90 100
time steps
1.6
1.4
1.2
performance index function
0.8
st
1 iteration for lower performance index function
0.6
1st iteration for upper performance index function
0.4
limiting iteration for upper performance index function
0.2
0
0 100 200 300 400 500 600 700 800 900 1000
time steps
for 1000 time steps so that the given accuracy ζ = 10−4 is reached. The convergence
curves of the performance index functions are shown in Fig. 10.6.
Figure 10.7 shows the convergent trajectories of control variable u 1 for upper
performance index function. And the optimal control trajectories for upper perfor-
200 10 An Iterative ADP Method to Solve for a Class …
0.05
0.045
0.035
1st iteration
0.03
0.025
0.015
0.01
0.005
0
0 100 200 300 400 500 600 700 800 900 1000
time steps
0.2
controls for upper performance index function
0.15
u3
0.1 u2
u1
0.05
-0.05
w2
-0.1 w1
-0.15
0 100 200 300 400 500 600 700 800 900 1000
time steps
Fig. 10.8 The optimal control for upper performance index function
10.5 Simulation Study 201
0.04
0.02
0.01
limiting
0
iteration
-0.01
-0.02
-0.03
0 100 200 300 400 500 600 700 800 900 1000
time steps
0.15
controls for lower performance index function
0.1 u3
w 2 u1
0.05 w1
u2
-0.05
-0.1
0 100 200 300 400 500 600 700 800 900 1000
time steps
Fig. 10.10 The optimal control for lower performance index function
mance index functions is displayed in Fig. 10.8. Figure 10.9 shows the convergent
trajectories of control variable w2 for lower performance index function. And the
optimal control trajectories for lower performance index functions is displayed in
Fig. 10.10.
202 10 An Iterative ADP Method to Solve for a Class …
1.4
1.2
upper performance index function
performance index function
0.6
0.4
0
0 100 200 300 400 500 600 700 800 900 1000
time steps
Gaussian noise γw and γu into w and u and take N = 100. According to (10.18)–
(10.23) we can get the value of the mixed optimal performance index function
45.15594227340300. And the mixed optimal performance index function can be
expressed as V o (x) = 0.4716V (x) + 0.5284V (x).
Figure 10.11 displays the trajectories of the mixed optimal performance index
function, the corresponding mixed optimal control trajectories and state trajectories
are shown in Figs. 10.12 and 10.13 respectively. From the above simulation results we
can see that using the proposed iterative ADP method, we obtain the mixed optimal
performance index function successfully as the saddle point of the differential game
does not exists.
10.5 Simulation Study 203
0.25
0.2
0.15
mixed optimal control
u1
0.1
u2
u3
0.05
-0.05
w2
-0.1 w1
-0.15
0 100 200 300 400 500 600 700 800 900 1000
time steps
0.5
x
1
x
2
0
state trajectories
-0.5
-1
0 200 400 600 800 1000
time steps
10.6 Conclusions
References
19. Padhi, R., Unnikrishnan, N., Wang, X., Balakrishman, S.: A single network adaptive critic
(SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural
Netw. 19(10), 1648–1660 (2006)
20. Gupta, S.: Numerical Methods for Engineerings. Wiley Eastern Ltd. and New Age International
Company, New Delhi (1995)
21. Si, J., Wang, Y.: On-line learning control by association and reinforcement. IEEE Trans. Neural
Netw. 12(2), 264–275 (2001)
22. Enns, R., Si, J.: Helicopter trimming and tracking control using direct neural dynamic pro-
gramming. IEEE Trans. Neural Netw. 14(7), 929–939 (2003)
Chapter 11
Neural-Network-Based Synchronous
Iteration Learning Method for
Multi-player Zero-Sum Games
In this chapter, a synchronous solution method for multi-player zero-sum (ZS) games
without system dynamics is established based on neural network. The policy iteration
(PI) algorithm is presented to solve the Hamilton–Jacobi–Bellman (HJB) equation.
It is proven that the obtained iterative cost function is convergent to the optimal
game value. For avoiding system dynamics, off-policy learning method is given
to obtain the iterative cost function, controls and disturbances based on PI. Critic
neural network (CNN), action neural networks (ANNs) and disturbance neural net-
works (DNNs) are used to approximate the cost function, controls and disturbances.
The weights of neural networks compose the synchronous weight matrix, and the
uniformly ultimately bounded (UUB) of the synchronous weight matrix is proven.
Two examples are given to show that the effectiveness of the proposed synchronous
solution method for multi-player ZS games.
11.1 Introduction
The importance of strategic behavior in the human and social world is increasingly
recognized in theory and practice. As a result, game theory has emerged as a fun-
damental instrument in pure and applied research [1]. Modern day society relies on
the operation of complex systems, including aircraft, automobiles, electric power
systems, economic entities, business organizations, banking and finance systems,
computer networks, manufacturing systems, and industrial processes. Networked
dynamical agents have cooperative team-based goals as well as individual selfish
goals, and their interplay can be complex and yield unexpected results in terms
of emergent teams. Cooperation and conflict of multiple decision-makers for such
systems can be studied within the field of cooperative and noncooperative game
theory [2]. It knows that many real-world systems are often controlled by more
than one controller or decision maker with each using an individual strategy. These
controllers often operate in a group with a general quadratic performance index
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 207
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_11
208 11 Neural-Network-Based Synchronous Iteration...
p
q
ẋ = f (x) + g(x) u i + h(x) dj, (11.1)
i=1 j=1
∞
p
q
J (x(0), U p , Dq ) = x T Qx+ u iT Ri u i − d Tj S j d j dt, (11.2)
0 i=1 j=1
The multi-player ZS differential game selects the minimizing player set U p and
the maximizing player set Dq such that the saddle point U p∗ and Dq∗ satisfies the
following inequalities
If we give the feedback policy (U p (x), Dq (x)), then the value or cost of the policy
is
∞
p
q
V (x(t)) = x T Qx+ u iT Ri u i − d Tj S j d j dt. (11.6)
t i=1 j=1
p
q
H (x, ∇V, U p , D q ) = x Qx + T
u iT Ri u i − d Tj S j d j
i=1 j=1
⎛ ⎞
p
q
+ ∇V T ⎝ f + g ui + h dj⎠
i=1 j=1
= 0, (11.7)
∂V
where ∇V = . The stationary conditions are
∂x
210 11 Neural-Network-Based Synchronous Iteration...
∂H
= 0, i = 1, 2, . . . , p, (11.8)
∂u i
and ∂H
= 0, j = 1, 2, . . . , q. (11.9)
∂d j
According to (11.7), we have the optimal controls and the disturbances are
1
u i∗ = − Ri−1 g T ∇V ∗ , i = 1, 2, . . . , p, (11.10)
2
and 1 −1 T
d ∗j = S h ∇V ∗ , j = 1, 2, . . . , q. (11.11)
2 j
From Bellman equation (11.7), we can derive V ∗ from the solution of the HJI equation
1
p
0 = x T Qx + ∇V T f − ∇V T g Ri−1 g T ∇V
4 i=1
1
q
+ ∇V T h S −1
j h ∇V.
T
(11.12)
4 j=1
Note that if (11.12) is solved, then the optimal controls are obtained. In general
case, the PI algorithm can be applied to get V ∗ . The algorithm implementation process
is given in Algorithm 1.
The convergence of Algorithm 4 will be analyzed in the next theorem.
p
q
0 = x T Qx + u i[k]T Ri u i[k] − d [k]T
j S j d [k]
j (11.13)
i=1 j=1
p
q
+ ∇V [k]T f + g u i[k] + h d [k]
j .
i=1 j=1
Theorem 11.1 Define V [k] as in (11.13). Let control policy u i[k] and disturbance
policy d [k]
j be in (11.14) and (11.15), respectively. Then the iterative values V
[k]
∗
converge to the optimal game values V , as k → ∞.
p
q
V̇ [k+1] = −x T Qx − u i[k+1]T Ri u i[k+1] + d [k+1]T
j S j d [k+1]
j . (11.16)
i=1 j=1
Then
p
q
V̇ [k] = − x T Qx − u i[k]T Ri u i[k] + d [k]T
j S j d [k]
j
i=1 j=1
p
q
− u i[k+1]T Ri u i[k+1] + d [k+1]T
j S j d [k+1]
j
i=1 j=1
p
q
+ u i[k+1]T Ri u i[k+1] − d [k+1]T
j S j d [k+1]
j
i=1 j=1
p
q
=V̇ [k+1]
+ u i[k+1]T Ri u i[k+1] − d [k+1]T
j S j d [k+1]
j
i=1 j=1
p
q
− u i[k]T Ri u i[k] + d [k]T
j S j d [k]
j . (11.17)
i=1 j=1
By transformation, we have
p
V̇ [k] = V̇ [k+1] − (u i[k+1] − u i[k] )T Ri (u i[k+1] − u i[k] )
i=1
p
+2 u i[k+1]T Ri (u i[k+1] − u i[k] )
i=1
q
+ (d [k+1]
j − d [k] [k+1]
j ) S j (d j
T
− d [k]
j )
j=1
p
−2 d [k+1]T
j S j (d [k+1]
j − d [k]
j ). (11.18)
i=1
p
p
V̇ [k] = V̇ [k+1] − u i[k]T Ri u i[k] + 2 u i[k+1]T Ri u i[k]
i=1 i=1
q
p
+ d [k]T
j S j d [k]
j −2 d [k+1]T
j S j d [k]
j . (11.19)
j=1 i=1
and
∇V [k]T h = 2d [k+1]T
j Sj. (11.21)
p
p
V̇ [k] = V̇ [k+1] − u i[k]T Ri u i[k] − V [k]T gu i[k]
i=1 i=1
q
p
+ d [k]T
j S j d [k]
j − V [k]T hd [k]
j . (11.22)
j=1 i=1
and
d [k]T
j S j d [k]
j − V
[k]T
hd [k]
j < 0. (11.24)
From Algorithm 4, we can see that the PI algorithm depends on system dynam-
ics, which is unknown in this chapter. Therefore, in the next section, off-policy PI
algorithm will be presented which can solve the control and disturbance policies
synchronously.
11.3 Synchronous Solution of Multi-player ZS Games 213
p
q
p
q
ẋ = f + g u i[k] +h d [k]
j +g (u i − u i[k] ) +h (d j − d [k]
j ). (11.25)
i=1 j=1 i=1 j=1
Then (11.27) is the off-policy Bellman equation for multi-player ZS games, which
is expressed as
214 11 Neural-Network-Based Synchronous Iteration...
It can be seen that (11.28) shows two points. First, the system dynamics is not nec-
essary for obtaining V [k] . Second, u i[k] , d [k]
j and V
[k]
can be obtained synchronously.
In the next part, the implementation method for solving (11.28) will be presented.
In this part, the method for solving off-policy Bellman equation (11.28) is given.
Critic, action and disturbance networks are applied to approximate V [k] , u i[k] and
d [k]
j . The implementation block diagram is shown in Fig. 11.1. Here CNN, ANNs
and DNNs are used to approximate the cost, control policies and disturbances.
In the neural network, if the number of hidden layer neurons is L, the weight
matrix between the input layer and hidden layer is Y , the weight matrix between the
hidden layer and output layer is W and the input vector of the neural network is X ,
then the output of three-layer neural network is represented by
ANN
ANN
Plant CNN
DNN
DNN
FN (X, Y, W ) = W T σ̂ (Y X ), (11.29)
where σ̂ (Y X ) is the activation function. For convenience of analysis, only the output
weight W is updating during the training, while the hidden weight is kept unchanged.
Hence, in the following part, the neural network function (11.29) can be simplified
by the expression
FN (X, W ) = W T σ (X ). (11.30)
where A[k] is the ideal weight of critic network, φV (x) is the active function, and
δV (x) is residual error. Let the estimation of A[k] is Â[k] . Then the estimation of
V [k] (x) is
and
where Bi[k] is the ideal weight of action network, φu (x) is the active function, and
δu (x) is residual error. Let B̂i[k] be the estimation of Bi[k] , then the estimation of u i[k]
is
d [k] [k]T
j = Cj φd (x) + δd (x), (11.36)
where C [k]
j is the ideal weight of action network, φd (x) is the active function, and
δ (x) is residual error. Let Cˆ[k] be the estimation of C [k] , then the estimation of d [k]
d j j j
is
d̂ [k] [k]T
j = Ĉ j φd (x). (11.37)
Since
p
p T
φuT B̂i[k+1]T Ri (u i − û i[k] ) = (u i − û i[k] Ri ⊗ φuT vec( B̂i[k+1] ),
i=1 i=1
(11.40)
q
φdT Ĉ [k+1]T
j Sj (d j − d̂ [k]
j )
j=1
q T
[k]
= (d j − d̂ j S j ⊗ φd vec(Ĉ [k+1]
T
j ). (11.41)
j=1
Define
t+T
p T
Πu = −2 (u i − û i[k] ) Ri ⊗ φuT dτ, (11.45)
t i=1
t+T
q T
Πd = − (d j − d̂ [k]
j ) Sj ⊗ φdT dτ. (11.46)
t j=1
Then we have
ΠΠ = [ΠV Πu Πd ] , (11.48)
Then (11.47) is
Define E [k] = 1/2e[k]T e[k] , then according to gradient descent algorithm, the
update method of the weight Ŵi,[k]j is
Ŵ˙ i,[k]j = −ηi,[k]j ΠΠT ΠΠ Ŵi,[k]j − Π , (11.51)
Theorem 11.2 Let the update method for critic, action and disturbance networks
be as in (11.51). Define the weight estimation error as W̃i,[k]j = Wi,[k]j − Ŵi,[k]j , Then
W̃i,[k]j is UUB.
Proof Let Lyapunov function candidate be
α
Λi,[k]j = W̃i,[k]T [k]
j W̃i, j , ∀i, j, k, (11.52)
2ηi,[k]j
where α > 0.
According to (11.51), we have
W̃˙ i,[k]j =ηi,[k]j ΠΠT ΠΠ (Wi,[k]j − W̃i,[k]j ) − Π
= − ηi,[k]j ΠΠT ΠΠ W̃i,[k]j + ηi,[k]j ΠΠT ΠΠ Wi,[k]j − ηi,[k]j ΠΠT Π. (11.53)
1 α2
≤ − α||W̃i,[k]j ||2 ||ΠΠ ||2 + ||W̃i,[k]j ||2 ||ΠΠ ||2 + ||Wi,[k]j ||2 ||ΠΠ ||2
2 2
1 α 2
+ ||W̃i,[k]j ||2 ||ΠΠ ||2 + ||Π ||2 . (11.54)
2 2
11.3 Synchronous Solution of Multi-player ZS Games 219
By transformation, (11.54) is
α2 α2
Λ̇i,[k]j ≤ (−α + 1) ||W̃i,[k]j ||2 ||ΠΠ ||2 + ||Wi,[k]j ||2 ||ΠΠ ||2 + ||Π ||2 . (11.55)
2 2
Define
α2 α2
Σi,[k]j = ||Wi,[k]j ||2 ||ΠΠ ||2 + ||Π ||2 . (11.56)
2 2
Then (11.55) is
Thus, if
α > 1, (11.58)
and
Σi,[k]j
||W̃i,[k]j ||2 > , (11.59)
(α − 1)||ΠΠ ||2
ẋ = x + u + d. (11.60)
In this chapter, the initial state is x(0) = 1. We select hyperbolic tangent functions
as the activation functions of critic, action and disturbance networks. The structures
of critic, action and disturbance networks are 1-8-1. The initial weight W is selected
arbitrarily from (−1, 1), the dimension of W is 24 × 1. For the cost function, Q, R
and S in the utility function are identity matrices of appropriate dimensions. After
500 time steps, the simulation results are obtained. In Fig. 11.2, the cost function
is shown, which converges to zero as time increasing. The control and disturbance
220 11 Neural-Network-Based Synchronous Iteration...
0.12
0.1
0.08
cost function
0.06
0.04
0.02
0
0 50 100 150 200 250 300 350 400 450 500
time steps
-0.05
control
-0.1
-0.15
0 50 100 150 200 250 300 350 400 450 500
time steps
trajectories are given in Figs. 11.3 and 11.4. Under the action of the obtained control
and disturbance inputs, the state trajectory is displayed in Fig. 11.5. It is clear that
the presented method in this chapter is very effective and feasible.
11.4 Simulation Study 221
0.15
0.1
disturbance
0.05
0
0 50 100 150 200 250 300 350 400 450 500
time steps
0.9
0.8
0.7
0.6
state
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250 300 350 400 450 500
time steps
-0.5
-1
cost function
-1.5
-2
-2.5
-3
-3.5
0 500 1000 1500 2000 2500
time steps
Example 11.2 Consider the following affine in control input nonlinear system [11]
p
q
ẋ = f (x) + g(x) u i + h(x) dj, (11.61)
i=1 j=1
x2
where f (x) = ,
−x2 − 21 x1 + 41 x2 (cos(2x1 ) + 2)2 + 41 x2 (sin(4x12 ) + 2)2
0 0
g(x) = , h(x) = , p = q = 1.
cos(2x1 ) + 2 sin(4x12 ) + 2
In this simulation, the initial state is x(0) = [1, −1]T . Hyperbolic tangent func-
tions are used to be as the activation functions of critic, action and disturbance
networks. The structures of the networks are 2-8-1. The initial weight W is selected
arbitrarily from (−1, 1), the dimension of W is 24 × 1. For the cost function of
(11.61), Q, R and S in the utility function are identity matrices of appropriate dimen-
sions. The simulation results are obtained by 2500 time steps. The cost function is
shown in Fig. 11.6, it is ZS. The control and disturbance trajectories are given in
Figs. 11.7 and 11.8. The state trajectories are displayed in Fig. 11.9. We can see that
the closed-loop system state, control and disturbance inputs converge to zero, as time
step increasing. So the proposed synchronous method for multi-player ZS games in
this chapter is very effective.
11.4 Simulation Study 223
0.35
0.3
0.25
0.2
control
0.15
0.1
0.05
0
0 500 1000 1500 2000 2500
time steps
-0.2
-0.4
-0.6
disturbance
-0.8
-1
-1.2
-1.4
-1.6
0 500 1000 1500 2000 2500
time steps
1
x(1)
0.8 x(2)
0.6
0.4
0.2
state
-0.2
-0.4
-0.6
-0.8
-1
0 500 1000 1500 2000 2500
time steps
11.5 Conclusions
References
1. Yeung, D., Petrosyan, L.: Cooperative Stochastic Differential Games. Springer, Berlin (2006)
2. Lewis, F., Vrabie, D., Syrmos, V.: Optimal Control, 3rd edn. Wiley, Hoboken (2012)
3. Song, R., Lewis, F., Wei, Q.: Off-policy integral reinforcement learning method to solve nonlin-
ear continuous-time multi-player non-zero-sum games. IEEE Trans. Neural Networks Learn.
Syst. 28(3), 704–713 (2016)
4. Liu, D., Wei, Q.: Multiperson zero-sum differential games for a class of uncertain nonlinear
systems. Int. J. Adap. Control Signal Process. 28(3–5), 205–231 (2014)
References 225
5. Mu, C., Sun, C., Song, A., Yu, H.: Iterative GDHP-based approximate optimal tracking control
for a class of discrete-time nonlinear systems. Neurocomputing 214(19), 775–784 (2016)
6. Fang, X., Zheng, D., He, H., Ni, Z.: Data-driven heuristic dynamic programming with virtual
reality. Neurocomputing 166(20), 244–255 (2015)
7. Feng, T., Zhang, H., Luo, Y., Zhang, J.: Stability analysis of heuristic dynamic programming
algorithm for nonlinear systems. Neurocomputing 149(Part C, 3), 1461–1468 (2015)
8. Feng, T., Zhang, H., Luo, Y., Liang, H.: Globally optimal distributed cooperative control for
general linear multi-agent systems. Neurocomputing 203(26), 12–21 (2016)
9. Lewis, F., Vrabie, D., Syrmos, V.: Optimal Control. Wiley, NewYork (2012)
10. Basar, T., Olsder, G.: Dynamic Noncooperative Game Theory. Academic Press, New York
(1982)
11. Vamvoudakis, K., Lewis, F.: Multi-player non-zero-sum games: online adaptive learning solu-
tion of coupled Hamilton-Jacobi equations. Automatica 47(8), 1556–1569 (2011)
Chapter 12
Off-Policy Integral Reinforcement
Learning Method for Multi-player
Non-zero-Sum Games
12.1 Introduction
Non-zero-sum (NZS) games with N players rely on solving the coupled Hamilton–
Jacobi (HJ) equations, which means each player decides for the Nash equilibrium
depending on HJ equations coupled through their quadratic terms. In linear NZS
games, it reduces to the coupled algebraic Riccati equations [1]. In nonlinear NZS
games, it is difficult, even doesn’t have analytic solutions. Therefore, many intelligent
methods are proposed to obtain the approximate solutions [2–6].
IRL allows the development of a Bellman equation that does not contain the
system dynamics [7–10]. It is worth noting that most of the IRL algorithms are
on-policy, i.e., the performance index function is evaluated by using system data
generated with policies being evaluated. It means on-policy learning methods use
the “inaccurate” data to learn the performance index function, which will increase
the accumulated error. While off-policy IRL, which can learn the solution of HJB
equation from the system data generated by an arbitrary control. Moreover, off-policy
IRL can be regarded as a direct learning method for NZS games, which avoids the
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 227
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_12
228 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …
N
ẋ(t) = f (x(t)) + g(x(t))u j (t), (12.1)
j=1
where state x(t) ∈ Rn and controls u j (t) ∈ Rm . This system has N inputs or play-
ers, and they influence each other through their joint effects on the overall system
12.2 Problem Statement 229
state dynamics. The system functions f (x) and g(x) are unknown. Let Ω con-
taining the origin be a closed subset of Rn to which all motions of (12.1) are
N
restricted. Let f + gu j be Lipschitz continuous on Ω and that system (12.1)
j=1
is stabilizable in the sense that there exists admissible control on Ω that asymp-
totically stabilizes the system. Define u −i as the supplementary set of player i:
u −i = {u j , j ∈ {1, 2, . . . , i − 1, i + 1, . . . , N }. Define the performance index func-
tion of player i as
∞
Ji (x(0), u i , u −i ) = ri (x(t), u i , u −i )dt, (12.2)
0
N
where the utility function ri (x, u i , u −i ) = Q i (x) + u Tj Ri j u j , in which function
j=1
Q i (x)0 and Ri j > 0 are symmetric matrices.
Define the multiplayer differential game
⎛ ⎞
∞
N
Vi∗ (x(t), u i , u −i ) = min ⎝ Q i (x) + u Tj Ri j u j ⎠ dτ . (12.3)
ui t j=1
This game implies that all the players have the same competitive hierarchical level
and seek to attain a Nash equilibrium as given by the following definition.
Definition 12.1 Nash equilibrium [14, 15]: Policies {u i∗ , u ∗−i } = {u ∗1 , u ∗2 , . . . ,
u i∗ , . . . , u ∗N } are said to constitute a Nash equilibrium solution for the N -player game
if
hence the N -tuple {J1∗ , J2∗ , . . . , JN∗ } is known as a Nash equilibrium value set or
outcome of the N -player game.
In this section, the PI solution for NZS games is introduced with convergence anal-
ysis. From Definition 12.1, it can be seen that if any player unilaterally changes his
control policy while the policies of all other players remain the same, then that player
will obtain worse performance. For fixed stabilizing feedback control policies u i and
u −i define the value function
⎛ ⎞
∞ ∞
N
Vi (x(t)) = ri (x, u i , u −i )dτ = ⎝ Q i (x) + u Tj Ri j u j ⎠ dτ . (12.5)
t t j=1
230 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …
Using the Leibniz formula, Hamiltonian function is given by the Bellman equation
⎛ ⎞
N
N
Hi (x, ∇Vi , u i , u −i ) = Q i (x) + u Tj Ri j u j + ∇ViT ⎝ f (x) + g(x)u j ⎠ = 0.
j=1 j=1
(12.6)
∂ Hi
Then the following policies can be yielded from =0
∂u i
1
u i (x) = − Rii−1 g T (x)∇Vi . (12.7)
2
Therefore, one obtains the N -coupled Hamilton–Jacobi (HJ) equations
⎛ ⎞
1 N
0 = ∇ViT ⎝ f (x) − g(x)R −1
j j g (x)∇V j
T ⎠ + Q i (x)
2 j=1
1
N
+ ∇V jT g(x)R −1 −1 T
j j Ri j R j j g (x)∇V j . (12.8)
4 j=1
Define the best response HJ equation as the Bellman equation (12.6) with control
u i∗ by (12.7), and arbitrary policies u −i as
N
Hi (x, ∇Vi , u i∗ , u −i ) = ∇ViT f (x) + Q i + ∇ViT g(x)u j
j=i
1 N
− ∇ViT g(x)Rii−1 g T (x)∇Vi + u Tj Ri j u j . (12.9)
4 j=i
In [1], the following policy iteration for N -player games has been proposed to
solve (8).
Here [0] and [k] in the superscript mean the iteration index. The following two
theorems reveal that the convergence of Algorithm 5.
Theorem 12.1 If is bounded, and the system functions f (x) and g(x) are known,
the iterative control u i[k] of player i is obtained by PI algorithm in (12.10)–(12.12),
and the controls u −i do not update their control policies. Then the iterative cost
function is convergent, and the values converge to the best response HJ equation
(12.9).
Proof Let
Hio (x, ∇Vi[k] , u −i ) = min Hi (x, ∇Vi[k] , u i[k] , u −i ) = Hi (x, ∇Vi[k] , u i[k+1] , u −i ).
u i[k]
(12.13)
12.3 Multi-player Learning PI Solution for NZS Games 231
N
Hi (x, ∇Vi[k] , u i[k] , u [k]
−i ) = ∇Vi
[k]T
( f (x) + g(x)u [k] [k] [k] [k]
j ) + ri (x, u 1 , u 2 , . . . , u N ) = 0
j=1
(12.10)
which explicitly is
1
u i[k+1] = − Rii−1 g T (x)∇Vi[k] . (12.12)
2
Theorem 12.2 If the system functions f (x) and g(x) are known, all the iterative
controls u i[k] of players i are obtained by PI algorithm in (12.10)–(12.12). If Ω is
bounded, then the iterative values Vi[k] converge to the optimal game values Vi∗ , as
k → ∞.
Proof As
N
V̇i[k+1] = −Q i (x) − u [k+1]T
j Ri j u [k+1]
j , (12.23)
j=1
and
N
N
V̇i[k] = − Q i (x) − u [k]T
j Ri j u [k]
j + u [k+1]T
j Ri j u [k+1]
j
j=1 j=1
N
− u [k+1]T
j Ri j u [k+1]
j
j=1
N
N
= V̇i[k+1] − u [k]T
j Ri j u [k]
j + u [k+1]T
j Ri j u [k+1]
j
j=1 j=1
N
T
= V̇i[k+1] − (u [k+1]
j − u [k] [k+1]
j ) Ri j (u j − u [k]
j )
j=1
N
N
+2 u [k+1]T
j Ri j u [k+1]
j −2 u [k+1]T
j Ri j u [k]
j
j=1 j=1
N
T
= V̇i[k+1] − (u [k+1]
j − u [k] [k+1]
j ) Ri j (u j − u [k]
j )
j=1
12.3 Multi-player Learning PI Solution for NZS Games 233
N
+2 u [k+1]T
j Ri j (u [k+1]
j − u [k]
j ). (12.24)
j=1
Let
u [k] [k+1]
j = uj − u [k]
j . (12.25)
N
N
u [k]T
j Ri j u [k]
j −2 u [k+1]T
j Ri j u [k]
j ≥ 0. (12.26)
j=1 j=1
u [k]T
j Ri j u [k] [k]T
j + ∇V j g(x)R −T [k]
j j Ri j u j ≥ 0. (12.27)
||u [k]T
j Ri j u [k] [k]T
j || ≥ ||∇V j g(x)R −T [k]
j j Ri j u j ||, (12.28)
i.e.,
where δ L (·) is the operator which takes the minimum singular value, δ H (·) is the
operator which takes the maximum singular value. Specifically, (12.29) holds if
[k]
δ H (R −T
j j Ri j ) = 0. By integration of V̇i ≤ V̇i[k+1] , it follows that
From Algorithm 5, it can be seen that the PI algorithm depended on the system
dynamics, which is not known in this chapter. Therefore, off-policy IRL algorithm
is established to solve the NZS games. In this section, the off-policy IRL is first
presented. Then the method for solving off-policy Bellman equation is developed.
At last the theoretical analyses and implement method are given.
Let u i[k] be obtained by (12.12), and then the original system (12.1) can be rewritten
as
N
N
ẋ = f (x) + g(x)u [k]
j + g(x)(u j − u [k]
j ). (12.32)
j=1 j=1
Then
N
N
∇Vi[k]T ( f (x) + g(x)u [k]
j ) = −Q i (x) − u [k]T
j Ri j u [k]
j . (12.34)
j=1 j=1
12.4 Off-Policy Integral Reinforcement Learning Method 235
Remark 12.2 Notice that in (12.37), the term ∇Vi[k]T g(x) depending on the unknown
function g(x) is replaced by u i[k+1]T Rii , which can be obtained by measuring the state
online. Therefore, (12.37) plays an important role in separating the system dynamics
from the iterative process. It is referred to as off-policy Bellman equation. By off-
policy Bellman equation (12.37), one can obtain the optimal control of the N -player
nonlinear differential game without the requirement of the system dynamics.
The next part will give the implementation method for Algorithm 6.
236 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …
In this part, the method for solving off-policy Bellman equation (12.37) will be
given. Critic and action networks are used to approximate Vi[k] (x(t)) and u i[k] for each
player, respectively. The neural network expression of critic network is given as
where Mi[k] is the ideal weight of critic network, ϕi (x) is the active function, εi[k] (x) is
residual error. Let the estimation of Mi[k] be M̂i[k] , and then the estimation of Vi[k] (x)
is
Accordingly,
∇ V̂i[k] (x) = ∇ϕiT (x) M̂i[k] . (12.40)
where Ni[k] is the ideal weight of action network, φi (x) is the active function, δi[k] (x)
is residual error. Let N̂i[k] be the estimation of Ni[k] , then the estimation of u i[k] (x) is
Substitute (12.39) and (12.42) into (12.43), then one can get
As
N
φiT (x) N̂i[k+1] Rii (u j − û [k]
j )
j=1
⎛⎛⎛ ⎞T ⎞ ⎞
N
= ⎝⎝⎝ (u j − û [k]
j )
⎠ Rii ⎠ ⊗ φiT ⎠ vec( N̂i[k+1] ). (12.45)
j=1
Here the symbol ⊗ stands for Kronecker product. Then (12.44) becomes
t+T
[k]
Define I x x,i = (ϕi (x(t + T )) − ϕi (x(t)))T ⊗ I , I xu,i =− Q i (x)dτ −
⎛⎛⎛ ⎞T t ⎞ ⎞
t+T
N t+T N
⎜⎜⎝ ⎠ R ⎟ T⎟
û [k]T
j Ri j û [k] [k]
j dτ , and Iuu,i = 2 ⎝⎝ (u j − û [k]
j ) ii ⎠ ⊗ φi ⎠dτ .
t j=1 t j=1
Therefore (12.46) is written as
ei[k] = Iw,i
[k]
Ŵi[k] + Ixu,i
[k]
. (12.48)
For obtaining the update rule of the weights of critic and action networks, the opti-
mization objective is defined as
1 [k]T [k]
E i[k] = e ei . (12.49)
2 i
Thus, using the gradient descent algorithm, one can get
l
L i[k] = W̃i[k]T W̃i[k] , (12.51)
2ηi[k]
where l > 0.
According to (12.50), one has
W̃˙ i[k] = ηi[k] Iw,i
[k]T [k]
Iw,i (Wi[k] − W̃i[k] ) + Ixu,i
[k]
= − ηi[k] Iw,i
[k]T [k]
Iw,i W̃i[k] + ηi[k] Iw,i
[k]T [k]
(Iw,i Wi[k] + Ixu,i
[k]
). (12.52)
l2 l 2 [k] 2
L̇ i[k] ≤ −(l − 1)||W̃i[k] ||2 ||Iw,i
[k] 2
|| + ||Wi[k] ||2 ||Iw,i
[k] 2
|| + ||Ixu,i || . (12.54)
2 2
12.4 Off-Policy Integral Reinforcement Learning Method 239
Let Ci[k] = l2
||Wi[k] ||2 ||Iw,i
[k] 2 [k] 2
2
2
|| + l2 ||Ixu,i || , then (12.54) is
Therefore, if l satisfies
l > 1, (12.56)
and
Ci[k]
||W̃i[k] ||2 > [k] 2
, (12.57)
(l − 1)||Iw,i ||
Based on Theorem 12.3, one can assume that V̂i[k] (x) → Vi[k] (x) and û i[k] (x) →
u i[k] (x).
Then one can get the following theorem which proves that the asymptotic
stability of the closed-loop system further.
Theorem 12.4 Let the control be given as in (12.12), then the closed-loop system
is asymptotically stable.
N
V̇i[k] (x(t)) = −Q i[k] (x) − u [k]T
j Ri j u [k]
j < 0. (12.59)
j=1
(12.60)
= 0. (12.62)
N
Hi (x, ∇Vi , u i , u −i ) = Hi (x, ∇Vi , u i∗ , u ∗−i ) + (u j − u ∗j )T Ri j (u j − u ∗j )
j=1
N
N
+ ∇ViT g(x)(u j − u ∗j ) + 2 u ∗T ∗
j Ri j (u j − u j ).
j=1 j=1
(12.63)
Ji (x(0), u i , u −i )
∞
N
=Vi (x(0)) + ∇ViT g(x)(u j − u ∗j )dt
0 j=1, j=i
∞
+ ∇ViT g(x)(u i − u i∗ )dt
0
∞
N ∞
+ 2 u ∗T
j Ri j (u j − u ∗j )dt + 2u i∗T Rii (u i − u i∗ )dt
0 j=1, j=i 0
∞
N ∞
+ (u j − u ∗j )T Ri j (u j − u ∗j )dt + (u i − u i∗ )T Rii (u i − u i∗ ).
0 j=1, j=i 0
(12.65)
Note that, at the equilibrium point, one has u i = u i∗ and u −i = u ∗−i . Thus
Note that u i∗ = u i (Vi (x)), then ∇ViT g(x) = −2u i∗T Rii . Therefore, Eq. (12.67) is
∞
Ji (x(0), u i , u ∗−i ) = Vi (x(0)) + (u i − u i∗ )T Rii (u i − u i∗ ). (12.68)
0
∗ ∗
Then clearly Ji (x(0),
∗ u i , u −i ) in (12.66)
and Ji (x(0), u i , u ∗−i ) in (12.68) satisfy
(12.4). It means u i , i = 1, 2, . . . , N is in Nash equilibrium.
Based on above analyses, it is clear that off-policy IRL obtains V̂i[k] (x) and û i[k] (x)
simultaneously. Based on the weight update method (12.50), V̂i[k] (x) → Vi[k] (x)
and û i[k] (x) → u i[k] (x). It proves that u i[k] (x) makes the closed-loop system asymp-
totically stable, and u i[k] (x) → u i∗ (x), as k → ∞. The final theorem demonstrates
∗
u i , i = 1, 2, . . . , N } is in Nash equilibrium. Therefore in the following subsec-
tion, one is ready to give the following computational algorithm for practical online
implementation.
242 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …
Here we present simulations of linear and nonlinear systems to show that the games
can be solved by off-policy IRL method of this chapter.
Example 12.1 Consider the following linear system with modification [18, 19]
ẋ = 2x + 3u 1 + 3u 2 . (12.71)
Define
∞
J1 = (9x 2 + 0.64u 21 + u 22 )dt, (12.72)
0
and
∞
J2 = (3x 2 + u 21 + u 22 )dt. (12.73)
0
Let the initial state be x(0) = 0.5. For each player the structures of the critic and
action networks are 1-8-1 and 1-8-1, respectively. The initial weights are selected
in (−0.5, 0.5). Let Q 1 = 9, Q 2 = 3, R11 = 0.64 and R12 = R21 = R22 = 1. The
activation functions ϕi and φi are hyperbolic functions. ηi = 0.01. After 100 time
steps, the state and control trajectories are shown in Figs. 12.1 and 12.2. From above
analyses, the iterative value function is monotonic decreasing, which are given in
Figs. 12.3 and 12.4.
Example 12.2 Consider the following nonlinear system in [20] with modification
ẋ = f (x) + gu 1 + gu 2 + gu 3 , (12.74)
12.5 Simulation Study 243
0.5
0.45
0.4
0.35
0.3
0.25
x
0.2
0.15
0.1
0.05
0
0 5 10 15 20 25 30
time steps
0.3
u1
0.2 u2
0.1
0
u1 , u2
-0.1
-0.2
-0.3
-0.4
-0.5
0 5 10 15 20 25 30
time steps
1.42
1.415
1.41
V1
1.405
1.4
1.395
1.39
0 100 200 300 400 500 600 700 800 900 1000
iteration steps
1.94
1.92
1.9
V2
1.88
1.86
1.84
1.82
0 100 200 300 400 500 600 700 800 900 1000
iteration steps
1.5
x1
x2
1
0.5
0
x
-0.5
-1
-1.5
0 50 100 150 200 250 300
time steps
f 1 (x)
where f (x) = , f 1 (x) = −2x1 + x2 , f 2 (x) = −0.5x1 − x2 + x12 x2 + 0.25
f 2 (x)
0
x2 (cos(2x1 ) + 2)2 + 0.25x2 (sin(4x12 ) + 2)2 , and g(x) = . Define
2x1
∞
1 2 1 2
J1 (x) = x + x + u 1 + 0.9u 2 + u 3 dt,
2 2 2
(12.75)
0 8 1 4 2
∞
1 2
J2 (x) = x1 + x23 + u 21 + u 22 + 5u 23 dt, (12.76)
0 2
∞
1 2 1 2
J3 (x) = x1 + x2 + 3u 21 + 2u 22 + u 23 dt. (12.77)
0 4 2
Let the initial state be x(0) = [1; −1]. For each player the structures of the critic
and action networks are 2-8-1 and 2-8-1, respectively. The activation functions ϕi
and φi are hyperbolic functions. Let ηi = 0.05, after 1000 time steps, the state and
control trajectories are shown in Figs. 12.5 and 12.6. Figures 12.7, 12.8 and 12.9 are
value function for each player, which is monotonic decreasing.
246 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …
500
u1
400 u2
u3
300
200
u1 , u2 , u3
100
-100
-200
-300
0 50 100 150 200 250 300
time steps
73
72.5
72
71.5
V1
71
70.5
70
69.5
0 50 100 150 200 250 300 350 400 450 500
iteration steps
62
61
60
59
V2
58
57
56
55
54
0 50 100 150 200 250 300 350 400 450 500
iteration steps
29.2
29
28.8
28.6
V3
28.4
28.2
28
27.8
0 50 100 150 200 250 300 350 400 450 500
iteration steps
12.6 Conclusion
This chapter establishes an off-policy IRL method for CT multi-player NZS games
with unknown dynamics. Since the system dynamics is unknown in this chapter, off-
policy IRL is studied to do policy evaluation and policy improvement in PI algorithm.
The critic and action networks are used to obtain the performance index and control
for each player. The convergence of the weights is proven. The asymptotic stability of
the closed-loop system and the existence of Nash equilibrium are proven. Simulation
study demonstrates the effectiveness of the proposed method of this chapter.
Furthermore, it is noted that the condition is hard for the proof of Theorem 12.2.
In the next work, we will concentrate on relaxing the condition for the proof of
Theorem 12.2. We also notice that the system unknown functions f and g are same
for each player in this chapter. In our future work, we will discuss the off-policy IRL
method for different unknown functions f i or gi for each player. This will make the
research more extensive and deep.
References
1. Vamvoudakis, K., Lewis, F.: Multi-player non-zero-sum games: online adaptive learning solu-
tion of coupled Hamilton-Jacobi equations. Automatica 47(8), 1556–1569 (2011)
2. Zhang, H., Wei, Q., Luo, Y.: A novel infinite-time optimal tracking control scheme for a class
of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans. Syst.
Man Cybern. Part B-Cybern. 38(4), 937–942 (2008)
3. Wei, Q., Wang, F., Liu, D., Yang, X.: Finite-approximation-error based discrete-time iterative
adaptive dynamic programming. IEEE Trans. Cybern. 44(12), 2820–2833 (2014)
4. Wei, Q., Liu, D.: A novel iterative-Adaptive dynamic programming for discrete-time nonlinear.
IEEE Trans. Automat. Sci. Eng. 11(4), 1176–1190 (2014)
5. Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complex-
valued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014)
6. Song, R., Lewis, F., Wei, Q., Zhang, H., Jiang, Z., Levine, D.: Multiple Actor-Critic Structures
for Continuous-Time Optimal Control Using Input-Output Data. IEEE Trans. Neural Netw.
Learn. Syst. 26(4), 851–865 (2015)
7. Modares, H., Lewis, F., Naghibi-Sistani, M.B.: Adaptive optimal control of unknown
constrained-input systems using policy iteration and neural networks. IEEE Trans. Neural
Netw. Learn. Syst. 24(10), 1513–1525 (2013)
8. Modares, H., Lewis, F.: Optimal tracking control of nonlinear partially-unknown constrained-
input systems using integral reinforcement learning. Automatica 50(7), 1780–1792 (2014)
9. Modares, H., Lewis, F., Naghibi-Sistani, M.: Integral reinforcement learning and experience
replay for adaptive optimal control of partially-unknown constrained-input continuous-time
systems. Automatica 50, 193–202 (2014)
10. Kiumarsi, B., Lewis, F., Naghibi-Sistani, M., Karimpour, A.: Approximate dynamic program-
ming for optimal tracking control of unknown linear systems using measured data. IEEE Trans.
Cybern. 45(12), 2770–2779 (2015)
11. Jiang, Y., Jiang, Z.: Computational adaptive optimal control for continuous-time linear systems
with completely unknown dynamics. Automatica 48(10), 2699–2704 (2012)
12. Luo, B., Wu, H., Huang, T.: Off-policy reinforcement learning for H control design. IEEE
Trans. Cybern. 45(1), 65–76 (2015)
References 249
13. Song, R., Lewis, F., Wei, Q., Zhang, H.: Off-policy actor-critic structure for optimal control of
unknown systems with disturbances. IEEE Trans. Cybern. 46(5), 1041–1050 (2016)
14. Lewis, F., Vrabie, D., Syrmos, V.L.: Optimal Control, 3rd edn. Wiley, Hoboken (2012)
15. Vamvoudakis, K., Lewis, F., Hudas, G.: Multi-agent differential graphical games: online adap-
tive learning solution for synchronization with optimality. Automatica 48(8), 1598–1611 (2012)
16. Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems withsaturating
actuators using a neural network HJB approach. Automatica 41(5), 779–791 (2005)
17. Leake, R., Liu, R.: Construction of suboptimal control sequences. SIAM J. Control 5(1), 54–63
(1967)
18. Jungers, M., De Pieri, E., Abou-Kandil, H.: Solving coupled algebraic Riccati equations from
closed-loop Nash strategy, by lack of trust approach. Int. J. Tomogr. Stat. 7(F07), 49–54 (2007)
19. Limebeer, D., Anderson, B., Hendel, H.: A Nash game approach to mixed H2/H control. IEEE
Trans. Autom. Control 39(1), 69–82 (1994)
20. Liu, D., Li, H., Wang, D.: Online synchronous approximate optimal learning algorithm for
multiplayer nonzero-sum games with unknown dynamics. IEEE Trans. Syst. Man Cybern.:
Syst. 44(8), 1015–1027 (2014)
Chapter 13
Optimal Distributed Synchronization
Control for Heterogeneous Multi-agent
Graphical Games
In this chapter, a new optimal coordination control for the consensus problem of het-
erogeneous multi-agent differential graphical games by iterative ADP is developed.
The main idea is to use iterative ADP technique to obtain the iterative control law
which makes all the agents track a given dynamics and simultaneously makes the
iterative performance index function reach the Nash equilibrium. In the developed
heterogeneous multi-agent differential graphical games, the agent of each node is
different from the one of other nodes. The dynamics and performance index function
for each node depend only on local neighbor information. A cooperative policy iter-
ation algorithm for graphical differential games is developed to achieve the optimal
control law for the agent of each node, where the coupled Hamilton–Jacobi (HJ)
equations for optimal coordination control of heterogeneous multi-agent differential
games can be avoided. Convergence analysis is developed to show that the perfor-
mance index functions of heterogeneous multi-agent differential graphical games
can converge to the Nash equilibrium. Simulation results will show the effectiveness
of the developed optimal control scheme.
13.1 Introduction
A large class of real systems are controlled by more than one controller or decision
maker with each using an individual strategy. These controllers often operate in a
group with coupled performance index functions as a game [1]. Stimulated by a vast
number of applications, including those in economics, management, communication
networks, power networks, and in the design of complex engineering systems, game
theory [2] has been very successful in modeling strategic behavior, where the outcome
for each player depends on the actions of himself and all the other players.
For the previous policy iteration ADP algorithms of multi-player games, it always
desires the system states for each agent converge the equilibrium of the systems. In
many real world games, it requires that the states of each agent track a desired
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 251
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_13
252 13 Optimal Distributed Synchronization Control …
Let G = (V, E, A) be a weighted graph of N nodes with the nonempty finite set of
nodes V = {v1 , v2 , . . . , v N }, where the set of edges E belongs to the product space of
V (i.e., E ⊆ V × V), an edge of G is denoted by ρi j = (v j , vi ), which is a direct path
from node j to node i, and A = [ρi j ] is a weighted adjacency matrix with nonnegative
13.2 Graphs and Synchronization of Multi-agent Systems 253
ẋi = Ai xi + Bi u i , (13.1)
where xi (t) ∈ Rn is the state of node vi , and u i (t) ∈ Rm is the input coordination
control. Let Ai and Bi for ∀ i ∈ N be constant matrices. In this chapter, we assume that
the control gain matrix Bi satisfies rank{Bi } ≥ n for the convenience of our analysis.
For ∀ i ∈ N, assume that the state xi of the agent for each node is controllable. The
local neighborhood tracking error for ∀ i ∈ N is defined as
δi = ρi j (xi − x j ) + σi (xi − x0 ), (13.2)
j∈Ni
where σi ≥ 0 is the pinning gain. Note that σi > 0 for at least one i. We let σi = 0
if and only if there is not a direct path from the control node to the ith node in G.
Otherwise we have σi > 0.
The target node or state is x0 (t) ∈ Rn , which satisfies the dynamics
ẋ0 = A0 x0 . (13.3)
The synchronization control design problem is to design local control protocols for
all the nodes in G to synchronize to the state of the control node, i.e., for ∀ i ∈ N, we
require xi (t) → x0 (t).
254 13 Optimal Distributed Synchronization Control …
ẋ = Ax + Bu, (13.4)
ẋ0 = Ai x0 + Bi u di . (13.6)
According to (13.2), the tracking error vector for the network G can be expressed
as
δ̇i = ρi j (ẋi − ẋ j ) + σi (ẋi − ẋ0 )
j∈Ni
= ρi j (ẋei − ẋej ) + σi ẋei
j∈Ni
= ρi j (Ai xei + Bi u ei − A j xej + B j u ej ) + σi (Ai xei + Bi u ei )
j∈Ni
= ρi j Ai (xei − xej ) + ρi j (Ai − A j )xej + ρi j (Bi u ei − B j u ej )
j∈Ni j∈Ni j∈Ni
13.2 Graphs and Synchronization of Multi-agent Systems 255
+ σi Ai xei + σi Bi u ei
= Ai δi + ρi j (Ai − A j )xej + (di + σi )Bi u ei − ρi j B j u ej . (13.9)
j∈Ni j∈Ni
From (13.9), we can see that the tracking error vector is a function with respect
to δi and xej . This means that the distributed control u i should be designed by the
information of δi and xej . A system transformation is necessary.
Let j1i , j2i , . . . , j Ni i be the nodes in Ni . Define new state and control vectors as x̃ei =
[ xej
T
i , x i , . . . , x i ] and ũ ei = [ u i , u i , . . . , u i ] . If we let z i = [δi , x̃ ei ] and
T
ej
T
ej
T T
ej
T
ej
T
ej
T T T T
1 2 Ni 1 2 Ni
and
⎡ ⎤
(di + σi )Bi −ei j1i B j1i · · · −ei jNi B jNi
⎢ i i
⎥
⎢ 0 B j1i 0 ⎥
B̄i = ⎢
⎢ .. ..
⎥,
⎥ (13.12)
⎣ . . ⎦
0 0 B jNi
i
respectively. We can see that system (13.10) is a multi-input system. The controls
are u ei and all the u ej from its neighbors, where the controls for agent i are coupled
with the agents of its neighbors. This makes the controller design very difficult. In
the next section, an optimal distributed control by iterative ADP algorithm will be
developed that makes all agents reach the target state.
In this section, our goal is to design an optimal distributed control to reach a consensus
while simultaneously minimizing the local performance index function for system
(13.10). Define u ei as the vector of the control for the neighbors of node i, i.e.,
u −ei {u ej : j ∈ Ni }.
256 13 Optimal Distributed Synchronization Control …
⎡ ⎤T ⎡ ⎤⎡ ⎤
δi Q ii 0 ··· 0 δi
⎢ xej1i ⎥ ⎢ 0 ⎥ ⎢
0 · · · 0 ⎥ ⎢ ej1i
x ⎥
⎢ ⎥ ⎢ ⎥
z iT Q̄ ii z i = ⎢ .. ⎥ ⎢ .. .. . . .. ⎥ ⎢ .. ⎥ = δiT Q ii δi . (13.14)
⎣ . ⎦ ⎣ . . . . ⎦⎣ . ⎦
xejNi 0 0 ··· 0 xejNi
i i
From (13.15), we can see that the performance index function includes only the
information about the inputs of node i and its neighbors. The goal of this chapter is to
design the local optimal distributed control to minimize the local performance index
functions (13.15) subject to (13.10) and make all nodes (agents) reach the consensus
at the target state x0 .
Definition 13.2 (Admissible Coordination Control Law [3]) Control law u ei , i ∈ N
is defined as admissible coordination control law if u ei is continuous, u ei (0) = 0, u ei
stabilizes agent (13.10) locally, and the performance index function (13.15) is finite.
If u ei and u ej , j ∈ Ni are admissible control laws, then we can define the local
performance index function Vi (z i ) as
∞
Vi (z i (0)) = Ui z i , u ei , u −ei dt. (13.16)
0
∂ Vi
For admissible control laws u ei and u ej , j ∈ Ni , the Hamiltonian Hi (z i , , u ei ,
∂z i
u −ei ) satisfies the following cooperative Hamilton–Jacobi (HJ) equation
13.3 Optimal Distributed Cooperative Control for Multi-agent … 257
∂ Vi ∂ ViT
Hi z i , , u ei , u −ei ≡ Āi z i + B̄i ū ei + δiT Q ii δi
∂z i ∂z i
+ u Tei Rii u ei + u Tej Ri j u ej = 0 (13.17)
j∈Ni
with boundary condition Vi (0) = 0. If Vi∗ (z i ) is the local performance index func-
tion, then Vi∗ (z i ) satisfies the coupled HJ equations
∂V ∗
min Hi z i , i , u ei , u −ei = 0 (13.18)
u ei ∂z i
In this subsection, the optimality of the multi-agent systems will be developed. The
corresponding property will also be presented. According to [3], we introduce the
Nash equilibrium definition for multi-agent differential games.
Ji∗ = Ji (u ∗e1 , u ∗e2 , . . . , u ∗eN ) ≤ Ji (u ∗e1 , u e2 , . . . , u ∗eN ), (13.20)
for u ei = u ∗ei .
The N -tuple of the local performance index function {J1∗ , J2∗ , . . . , JN∗ } is known
as the Nash equilibrium of the N multi-agent game in G.
Theorem 13.1 Let Ji∗ be the Nash equilibrium solution that satisfies (13.20). If u ∗e1 ,
u ∗e2 , . . ., u ∗eN are optimal control laws, then for ∀i, we have
∂ Ji∗
u ∗ei = arg min Hi zi , ∗
, u ei , u −ei . (13.21)
u ei ∂z i
Find a performance index function Jlo (zl (0), u oel , u ∗−el ) that satisfies
∂ Jo ∂ Jo
Hl zl , l , u oel , u ∗−el = l Āl zl + B̄l ū oel + δlT Q ll δl
∂zl ∂zl
+u ∗T ∗
el Rll u el + u ∗T ∗
ej Rl j u ej = 0. (13.27)
j∈Nl
(13.29)
From Theorem 13.1, we can see that if for ∀ i ∈ N, all the control from its neigh-
bors are optimal, i.e., u ej = u ∗ej , j ∈ Ni , then the optimal distributed control u ∗ei
can be obtained by (13.21). The following results can be derived according to the
properties of u ∗ei , i ∈ N.
Proof The proof can be seen in [3] and the details are omitted here.
From the previous subsection, we know that if we have obtained the optimal per-
formance index function, then the optimal control law can be obtained by solving
the HJ equation (13.18). However, Eq. (13.18) is difficult or impossible to be solved
directly by dynamic programming. In this section, a policy iteration (PI) algorithm of
adaptive dynamic programming is developed which solves the HJ equation (13.18)
iteratively. Convergence properties of the developed algorithm will also be presented
in this section.
In the developed policy iteration algorithm, the performance index function and
control law are updated by iterations, with the iteration index k increasing from 0 to
infinity. For ∀ i ∈ N, let u i[0] be an arbitrary admissible control law. Let Vi[0] be the
iterative performance index function constructed by u i[0] , which satisfies
∂ V [0]
Hi z i , i , u [0] , u [0]
−ei = 0. (13.30)
∂z i ei
From the multi-agent policy iteration (13.30)–(13.33), we can see that Vi[k] is used
to approximate Ji∗ and u i[k] is used to approximate u i∗ . Therefore, when i → ∞, it
is expected that the algorithm is convergent to make Vi[k] and u i[k] converge to the
optimal ones. In the next subsection, we will show such properties of the developed
policy iteration algorithm.
In this subsection, we will first present the converge property of the multi-agent policy
iteration algorithm. Initialized by an admissible control laws, the monotonicity of
the iterative performance index functions is discussed. In [3], the properties of the
iterative performance index function are analyzed for the linear agent expressed
by ẋi = Axi + Bi u i . In this chapter, inspired by [3], the properties of the iterative
performance index function are analyzed for ẋi = Ai xi + Bi u i .
Lemma 2 (Solution for best iterative control law) Given fixed neighbor policies
u −ei {u j : j ∈ Ni }, the best iterative control law can be expressed by
[k]
1 −1 T ∂ Vi
u [k+1] = − R (di + σi )B . (13.34)
ei
2 ii i
∂δi
T
∂ Vi[k]T ∂ Vi[k]T ∂ Vi[k]T
× , , ...,
∂δi ∂ xej i ∂ x ji
1 Ni
1 ∂ V [k]
=− (di + σi )Rii−1 BiT i . (13.35)
2 ∂δi
V̇i[k]
∂ Vi[k] [k+1]
= ( Āi z i + B̄i û ei )
∂z i
∂ V [k] [k+1] o [k+1]T [k+1]
= Hi z i , i , u ei , u −ei − δiT Q ii δi + u ei Rii u ei + u oT
ej R i j u ej . (13.37)
o
∂z i
j∈Ni
=0, (13.39)
which means
V̇i[k+1] = − δiT Q ii δi + u [k+1]T
ei R ii u [k+1]
ei + u oT
ej R i j u ej .
o
(13.40)
j∈Ni
According to (13.37), (13.38) and (13.40), we can obtain V̇i[k+1] ≥ V̇i[k] . By inte-
grating both sides of V̇i[k+1] ≥ V̇i[k] , we can obtain Vi[k+1] ≤ Vi[k] . As Vi[k] is lower
bounded, we have Vi[k] is convergent as k → ∞, i.e., lim Vi[k] = Vi∞ . According to
k→∞
(13.32) and (13.33), letting k → ∞, we can obtain (13.36). The proof is completed.
From Theorem 13.2, we can see that for i = 1, 2, . . . , Ni , if the neighbor control laws
u o−ei are fixed, then the iterative performance index functions and iterative control
laws are convergent to the optimum. In next subsection, the convergence property
for iterative performance index function and iterative control laws when all nodes
update their control laws will be developed.
Theorem 13.3 (Convergence of policy iteration algorithm when all agents update
their policies) Assume all nodes i update their policies at each iteration of policy
iteration algorithm (13.30)–(13.33). Define ς (Ri j R −1 j j ) as the maximum singular
value of Ri j R j j . For small ς (Ri j R j j ), the iterative performance index function Vi[k]
−1 −1
dVi[k] ∂ Vi[k]
= ( Āi z i + B̄i ŭ [k+1] ). (13.41)
dt ∂z i ei
V̇i[k+1] − V̇i[k]
∂ Vi[k] [k+1]T
= B̄i (ū [k] [k+1]
ei − ŭ ei ) − u [k+1]T Rii u [k+1] − u ej Ri j u [k+1]
∂z i ei ei
j∈Ni
ej
[k]T
+ u [k]T [k]
ei Rii u ei + u ej Ri j u [k]
ej
j∈Ni
∂ Vi[k]
=(di + σi ) Bi (u [k] [k+1]
ei − u ei ) − u [k+1]T Rii u [k+1] − u [k+1]T Ri j u [k+1]
∂δi ei ei
j∈Ni
ej ej
+ u [k]T [k]
ei Rii u ei + u [k]T [k]
ej Ri j u ej
j∈Ni
=2u [k+1]T
ei Rii (u [k+1]
− u [k]
ei ei ) − u ei
[k+1]T
Rii u [k+1]
ei + u [k]T [k]
ei Rii u ei
[k+1] T [k+1]
+ (u ej − u [k] [k+1]
ej ) Ri j (u ej − u [k]
ej ) − 2 (u ej − u [k] [k+1]
ej )Ri j u ej
j∈Ni j∈Ni
T
=(u [k+1]
ei − u [k] [k+1]
ei ) Rii (u ei
T
− u [k]
ei ) + (u [k+1]
ej − u [k] [k+1]
ej ) Ri j (u ej − u [k]
ej )
j∈Ni
−2 (u [k+1]
ej − u [k] [k+1]
ej )Ri j u ej . (13.43)
j∈Ni
∂ V j[k]
=− (d j + σ j )u [k]T −1
ej Ri j R j j B j ,
j∈Ni
∂δ j
(13.44)
where u ei = u [k+1]
ei − u [k]
ei , i ∈ {i, N i }. Let ς (R i j ) be the maximum singular
value of Ri j . We can see that for ∀ i, if ς (Ri j R −1 j j ), ρi j , j ∈ Ni and σi are
[k+1]
small, then inequality (13.44) holds, which means V̇i ≥ V̇i[k] . By integration of
V̇i[k+1] ≥ V̇i[k] , we can obtain Vi[k+1] ≤ Vi[k] . Hence, the iterative performance index
function Vi[k] is monotonically nonincreasing and lower bounded. As such Vi[k] is
convergent as k → ∞, i.e., lim Vi[k] = Vi∞ .
k→∞
It is obvious that Vi∞ ≥ J ∗ . On the other hand, let {u [ ] [ ] [ ]
e1 , u e2 , . . . , u eN } be arbitrary
admissible control laws. For ∀i, there must exist a performance index function Vi[ ]
that satisfies
∂ Vi[ ] [ ] [ ]
Hi z i , , u ei , u −ei = 0. (13.45)
∂z i
264 13 Optimal Distributed Synchronization Control …
∂ V [ ]
Let u [ +1] = arg min Hi z i , i , u ei , u [ ]
−ei and then we have Vi∞ ≤ Vi[ +1] ≤
ei u ei ∂z i
Vi[ ] . As {u [ ] [ ] [ ]
e1 , u e2 , . . . , u eN } are arbitrary, let
∗ ∗
{u [ ] [ ] [ ] ∗
e1 , u e2 , . . . , u eN } = u e1 , u e2 , . . . , u eN . (13.46)
and then we can obtain Vi∞ ≤ J ∗ . Therefore, we can obtain that lim Vi[k] = Ji∗ . The
k→∞
proof is completed.
Remark 2 In [3], it shows that for linear multi-agent system ẋi = Axi + Bi u i , if
the edge weights ρi j and ς (Ri j R −1 j j ) are small, then the iterative performance index
function converge to the optimum. From Theorem 13.3, we have that for multi-agent
[k]
system (13.1), if ς (Ri j R −1
j j ) is small, then iterative performance index function Vi
will also converge to the optimum, as k → ∞, while the constraint for ρi j is omitted.
4: Do Policy Evaluation
∂ V [k]
Hi z i , i , u [k] , u [k]
−ei = 0;
∂z i ei
5: If Vi[k+1] ≤ Vi[k] , goto next step. Else, let Ri j = ζ Ri j , j ∈ Ni and goto Step 2.
6: If Vi[k] − Vi[k+1] ≤ ε, then the optimal performance index function and optimal control law are
obtained. Goto Step 6. Else goto Step 3;
7: return u [k] [k] ∗ [k]
ei and Vi . Let u i = u ei + u di .
13.5 Simulation Study 265
Node 2:
ẋ21 −1 1 x21 20 u 21
= + , (13.48)
ẋ22 −4 −1 x22 03 u 22
Node 3:
ẋ31 −2 1 x31 20 u 31
= + , (13.49)
ẋ32 −1 −1 x32 02 u 32
Node 4:
ẋ41 −1 1 x41 10 u 41
= + , (13.50)
ẋ42 −2 −1 x42 01 u 42
Node 5:
1 4
Leader
3
2 5
2 2
Lm xe11 In xe22
1 1
tracking error
tracking error
In xe11 Lm x
e22
0 0
In xe12 In xe21
-1 -1
Lm xe12 Lm xe21
-2 -2
0 50 100 150 0 50 100 150
(a) time steps (b)time steps
1 2
Lm xe32 In xe41
tracking error
tracking error
Lm x
0.5 e31 0
In xe31
Lm xe41
0 -2
Lm x
In xe32 In xe42 e42
-0.5 -4
0 50 100 150 0 50 100 150
(c) time steps (d) time steps
Fig. 13.2 Tracking errors xei , i = 1, 2, 3, 4. a Tracking error xe1 . b Tracking error xe2 . c Tracking
error xe3 . d Tracking error xe4
ẋ51 −2 1 x51 30 u 51
= + . (13.51)
ẋ52 −3 −1 x52 02 u 52
where I denotes the identity matrix with suitable dimensions. Define the utility
function as in (13.13), where the weight matrices are expressed as
Q 11 = Q 22 = Q 33 = Q 44 = Q 55 = I,
R11 = 4I,R12 = I,R13 = I,R14 = I,
R21 = I,R22 = 4I,
R31 = I,R32 = I,R33 = 5I,
R41 = I,R44 = 9I,R45 = I,
R52 = I,R52 = I,R55 = 9I. (13.53)
13.5 Simulation Study 267
3 4
In ue11
2 2
Tracking errors
Lm xe52 In ue12
Control errors
1 0
In xe52
0 -2
Lm ue11
-1 In x -4
e51 Lm ue12
Lm xe51
-2 -6
0 50 100 150 0 50 100 150
(a) time steps (b) time steps
4
Lm ue21 In ue31
2 In ue32
0
Control errors
Fig. 13.3 Tracking error xe5 and the controls errors u ei , i = 1, 2, 3. a Tracking errors xe5 . b Control
error u e1 . c Control error u e2 . d Control error u e3
0
Control error
Lm u e41
-5 In u e41
Lm u e42
-10
In u e42
-15
0 50 100 150
(a) time steps
4
Lm u e51
2
Control error
0 In u e51
Lm u e52
-2
In u e52
-4
-6
0 50 100 150
(b) time steps
x1
x2
x3
3
x4
2 x5
x0
1
0
xi2
-1
-2
-3
4
2 300
250
0 200
xi1 150
-2 100
50 time steps
-4 0
u1
u2
u3
15
u4
10 u5
0
ui2
-5
-10
-15
20
10 300
250
0 200
ui1 150
-10 100
50 time steps
-20 0
13.6 Conclusion
In this chapter, an effective policy iteration based ADP algorithm is developed to solve
the optimal coordination control problems for heterogeneous multi-agent differential
graphical games. The developed heterogeneous differential graphical games permits
the agent dynamics of each node to be different from the agents of other nodes.
An optimal cooperative policy iteration algorithm for graphical differential games is
developed to achieve the optimal control law for the agent of each node to guarantee
that the dynamics of each node can track the desired one. Convergence analysis is
developed to show that the performance index functions of heterogeneous multi-agent
differential graphical games can converge to the Nash equilibrium. Finally, simulation
results will show the effectiveness of the developed optimal control scheme.
References
1. Jamshidi, M.: Large-Scale Systems-Modeling and Control. The Netherlands Press, Amsterdam
(1982)
2. Owen, G.: Game Theory. Acadamic Press, New York (1982)
3. Vamvoudakis, K., Lewis, F., Hudas, G.: Multi-agent differential graphical games: online adaptive
learning solution for synchronization with optimality. Automatica 48(8), 1598–1611 (2012)
Index
C
Critic neural network, 207 R
Recurrent neural network, 63
Reinforcement learning, 3
D
Disturbance neural networks, 207
S
Shunting inhibitory artificial neural network,
E 63, 64
Extreme learning machine, 7 Single-hidden layer feed-forward network, 7
Support vector machine, 8
H
Hamilton-Jacobi-Bellman, 1 U
Hamilton–Jacobi–Isaacs, 208 Uniformly ultimately bounded, 63
I V
Integral reinforcement learning, vi Value iteration, 4
L W
Linear quadratic regulator, 1 Wheeled inverted pendulum, 83
N Z
Neural network, 4, 69 Zero-sum, 165
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 271
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5