Adaptive Dynamic Programming With Applications in Optimal Control (PDFDrive)
Adaptive Dynamic Programming With Applications in Optimal Control (PDFDrive)
Derong Liu
Qinglai Wei
Ding Wang
Xiong Yang
Hongliang Li
Adaptive Dynamic
Programming with
Applications in
Optimal Control
Advances in Industrial Control
Series editors
Michael J. Grimble, Glasgow, UK
Michael A. Johnson, Kidlington, UK
More information about this series at http://www.springer.com/series/1412
Derong Liu Qinglai Wei Ding Wang
• •
Adaptive Dynamic
Programming
with Applications
in Optimal Control
123
Derong Liu Xiong Yang
Institute of Automation Tianjin University
Chinese Academy of Sciences Tianjin
Beijing China
China
Hongliang Li
Qinglai Wei Tencent Inc.
Institute of Automation Shenzhen
Chinese Academy of Sciences China
Beijing
China
Ding Wang
Institute of Automation
Chinese Academy of Sciences
Beijing
China
v
vi Foreword
The series Advances in Industrial Control aims to report and encourage technology
transfer in control engineering. The rapid development of control technology has an
impact on all areas of the control discipline: new theory, new controllers, actuators,
sensors, new industrial processes, computer methods, new applications, new design
philosophies, and new challenges. Much of this development work resides in
industrial reports, feasibility study papers, and the reports of advanced collaborative
projects. The series offers an opportunity for researchers to present an extended
exposition of such new work in all aspects of industrial control for wider and rapid
dissemination.
The method of dynamic programming has a long history in the field of optimal
control. It dates back to those days when the subject of control was emerging in a
modern form in the 1950s and 1960s. It was devised by Richard Bellman who gave
it a modern revision in a publication of 1954 [1]. The name of Bellman became
linked to an optimality equation, key to the method, and like the name of Kalman
became uniquely associated with the early development of optimal control. One
notable extension to the method was that of differential dynamic programming due
to David Q. Mayne in 1966 and developed at length in the book by Jacobson and
Mayne [2]. Their new technique used locally quadratic models for the system
dynamics and cost functions and improved the convergence of the dynamic pro-
gramming method for optimal trajectory control problems.
Since those early days, the subject of control has taken many different directions,
but dynamic programming has always retained a place in the theory of optimal
control fundamentals. It is therefore instructive for the Advances in Industrial
Control monograph series to have a contribution that presents new ways of solving
dynamic programming and demonstrating these methods with some up-to-date
industrial problems. This monograph, Adaptive Dynamic Programming with
Applications in Optimal Control, by Derong Liu, Qinglai Wei, Ding Wang, Xiong
Yang and Hongliang Li, has precisely that objective.
The authors open the monograph with a very interesting and relevant discussion
of another computationally difficult problem, namely devising a computer program
to defeat human master players at the Chinese game of Go. Inspiration from the
vii
viii Series Editors’ Foreword
better programming techniques used in the Go-master problem was used by the
authors to defeat the “curse of dimensionality” that arises in dynamic programming
methods.
More formally, the objective of the techniques reported in the monograph is to
control in an optimal fashion an unknown or uncertain nonlinear multivariable
system using recorded and instantaneous output signals. The algorithms’ technical
framework is then constructed through different categories of the usual state-space
nonlinear ordinary differential system model. The system model can be continuous
or discrete, have affine or nonaffine control inputs, be subject to no constraints, or
have constraints present. A set of 11 chapters contains the theory for various
formulations of the system features.
Since standard dynamic programming schemes suffer from various implemen-
tation obstacles, adaptive dynamic programming procedures have been developed
to find computable practical suboptimal control solutions. A key technique used by
the authors is that of neural networks which are trained using recorded data and
updated, or “adapted,” to accommodate uncertain system knowledge. The theory
chapters are arranged in two parts: Part 1 Discrete-Time Systems—five chapters;
and Part 2 Continuous-Time Systems—five chapters.
An important feature of the monographs of the Advances in Industrial Control
series is a demonstration of potential or actual application to industrial problems.
After a comprehensive presentation of the theory of adaptive dynamic program-
ming, the authors devote Part 3 of their monograph to three chapter-length appli-
cation studies. Chapter 12 examines the scheduling of energy supplies in a smart
home environment, a topic and problem of considerable contemporary interest.
Chapter 13 uses a coal gasification process that is suitably challenging to demon-
strate the authors’ techniques. And finally, Chapter 14 concerns the control of the
water gas shift reaction. In this example, the data used was taken from a real-world
operational system.
This monograph is very comprehensive in its presentation of the adaptive
dynamic programming theory and has demonstrations with three challenging pro-
cesses. It should find a wide readership in both the industrial control engineering
and the academic control theory communities. Readers in other fields such as
computer science and chemical engineering may also find the monograph of con-
siderable interest.
Michael J. Grimble
Michael A. Johnson
Industrial Control Centre
University of Strathclyde
Glasgow, Scotland, UK
Series Editors’ Foreword ix
References
1. Bellman R (1954) The theory of dynamic programming. Bulletin of the American Mathematical
Society 60(6):503–515
2. Jacobson DH, Mayne DQ (1970) Differential dynamic programming, American Elsevier Pub.
Co. New York
Preface
With the rapid development in information science and technology, many busi-
nesses and industries have undergone great changes, such as chemical industry,
electric power engineering, electronics industry, mechanical engineering, trans-
portation, and logistics business. While the scale of industrial enterprises is
increasing, production equipment and industrial processes are becoming more and
more complex. For these complex systems, decision and control are necessary to
ensure that they perform properly and meet prescribed performance objectives.
Under this circumstance, how to design safe, reliable, and efficient control for
complex systems is essential for our society. As modern systems become more
complex and performance requirements become more stringent, advanced control
methods are greatly needed to achieve guaranteed performance and satisfactory
goals.
In general, optimal control deals with the problem of finding a control law for a
given system such that a certain optimality criterion is achieved. The main differ-
ence between optimal control of linear and nonlinear systems lies in that the latter
often requires solving the nonlinear Bellman equation instead of the Riccati
equation. Although dynamic programming is a conventional method in solving
optimization and optimal control problems, it often suffers from the “curse of
dimensionality.” To overcome this difficulty, based on function approximators such
as neural networks, adaptive/approximate dynamic programming (ADP) was pro-
posed by Werbos as a method for solving optimal control problems
forward-in-time.
This book presents the recent results of ADP with applications in optimal
control. It is composed of 14 chapters which cover most of the hot research areas of
ADP and are divided into three parts. Part I concerns discrete-time systems,
including five chapters from Chaps. 2 to 6. Part II concerns continuous-time sys-
tems, including five chapters from Chaps. 7 to 11. Part III concerns applications,
including three chapters from Chaps. 12 to 14.
In Chap. 1, an introduction to the history of ADP is provided, including the basic
and iterative forms of ADP. The review begins with the origin of ADP and
xi
xii Preface
describes the basic structures and the algorithm development in detail. Connections
between ADP and reinforcement learning are also discussed.
Part I: Discrete-Time Systems (Chaps. 2–6)
In Chap. 2, optimal control problems of discrete-time nonlinear dynamical systems,
including optimal regulation, optimal tracking control, and constrained optimal
control, are studied using a series of value iteration ADP approaches. First, an ADP
scheme based on general value iteration is developed to obtain near-optimal control
for discrete-time affine nonlinear systems with continuous state and control spaces.
The present scheme is also employed to solve infinite-horizon optimal tracking
control problems for a class of discrete-time nonlinear systems. In particular, using
the globalized dual heuristic programming technique, a value iteration-based
optimal control strategy of unknown discrete-time nonlinear dynamical systems
with input constraints is established as a case study. Second, an iterative θ-ADP
algorithm is given to solve the optimal control problem of infinite-horizon
discrete-time nonlinear systems, which shows that each of the iterative controls can
stabilize the nonlinear dynamical systems and the condition of initial admissible
control is avoided effectively.
In Chap. 3, a series of iterative ADP algorithms are developed to solve the
infinite-horizon optimal control problems for discrete-time nonlinear dynamical
systems with finite approximation errors. Iterative control laws are obtained by
using the present algorithms such that the iterative value functions reach the opti-
mum. Then, the numerical optimal control problems are solved by a novel
numerical adaptive learning control scheme based on ADP algorithm. Moreover, a
general value iteration algorithm with finite approximate errors is developed to
guarantee the iterative value function to converge to the solution of the Bellman
equation. The general value iteration algorithm permits an arbitrary positive
semidefinite function to initialize itself, which overcomes the disadvantage of tra-
ditional value iteration algorithms.
In Chap. 4, a discrete-time policy iteration ADP method is developed to solve
the infinite-horizon optimal control problems for nonlinear dynamical systems. The
idea is to use an iterative ADP technique to obtain iterative control laws that
optimize the iterative value functions. The convergence, stability, and optimality
properties are analyzed for policy iteration method for discrete-time nonlinear
dynamical systems, and it is shown that the iterative value functions are nonin-
creasingly convergent to the optimal solution of the Bellman equation. It is also
proven that any of the iterative control laws obtained from the present policy
iteration algorithm can stabilize the nonlinear dynamical systems.
In Chap. 5, a generalized policy iteration algorithm is developed to solve the
optimal control problems for infinite-horizon discrete-time nonlinear systems.
Generalized policy iteration algorithm uses the idea of interacting the policy iter-
ation algorithm and the value iteration algorithm of ADP. It permits an arbitrary
positive semidefinite function to initialize the algorithm, where two iteration indices
are used for policy evaluation and policy improvement, respectively. The
Preface xiii
control design, which extends the application scope of ADP methods to nonlinear
and uncertain environment.
In Chap. 10, by using neural network-based online learning optimal control
approach, a decentralized control strategy is developed to stabilize a class of
continuous-time large-scale interconnected nonlinear systems. The decentralized
control strategy of the overall system can be established by adding appropriate
feedback gains to the optimal control laws of isolated subsystems. Then, an online
policy iteration algorithm is presented to solve the Hamilton–Jacobi–Bellman
equations related to the optimal control problems. Furthermore, as a generalization,
a neural network-based decentralized control law is developed to stabilize the
large-scale interconnected nonlinear systems with unknown dynamics by using an
online model-free integral policy iteration algorithm.
In Chap. 11, differential game problems of continuous-time systems, including
two-player zero-sum games, multiplayer zero-sum games, and multiplayer
nonzero-sum games, are studied via a series of ADP approaches. First, an integral
policy iteration algorithm is developed to learn online the Nash equilibrium solution
of two-player zero-sum differential games with completely unknown
continuous-time linear dynamics. Second, multiplayer zero-sum differential games
for a class of continuous-time uncertain nonlinear systems are solved by using an
iterative ADP algorithm. Finally, an online synchronous approximate optimal
learning algorithm based on policy iteration is developed to solve multiplayer
nonzero-sum games of continuous-time nonlinear systems without requiring exact
knowledge of system dynamics.
Part III: Applications (Chaps. 12–14)
In Chap. 12, intelligent optimization methods based on ADP are applied to the
challenges of intelligent price-responsive management of residential energy, with
an emphasis on home battery use connected to the power grid. First, an
action-dependent heuristic dynamic programming is developed to obtain the opti-
mal control law for residential energy management. Second, a dual iterative
Q-learning algorithm is developed to solve the optimal battery management and
control problem in smart residential environments where two iterations are intro-
duced, which are respectively internal and external iterations. Based on the dual
iterative Q-learning algorithm, the convergence property of iterative Q-learning
method for the optimal battery management and control problem is proven. Finally,
a distributed iterative ADP method is developed to solve the multibattery optimal
coordination control problem for home energy management systems.
In Chap. 13, a coal gasification optimal tracking control problem is solved
through a data-based iterative optimal learning control scheme by using iterative
ADP approach. According to system data, neural networks are used to construct the
dynamics of coal gasification process, coal quality, and reference control, respec-
tively. Via system transformation, the optimal tracking control problem with
approximation errors and disturbances is effectively transformed into a two-person
zero-sum optimal control problem. An iterative ADP algorithm is developed to
obtain the optimal control laws for the transformed system.
Preface xv
The authors would like to acknowledge the help and encouragement they have
received from colleagues in Beijing and Chicago during the course of writing this
book. Some materials presented in this book are based on the research conducted
with several Ph.D. students, including Yuzhu Huang, Dehua Zhang, Pengfei Yan,
Yancai Xu, Hongwen Ma, Chao Li, and Guang Shi. The authors also wish to thank
Oliver Jackson, Editor (Engineering) from Springer for his patience and
encouragements.
The authors are very grateful to the National Natural Science Foundation of
China (NSFC) for providing necessary financial support to our research in the past
five years. The present book is the result of NSFC Grants 61034002, 61233001,
61273140, 61304086, and 61374105.
xvii
Contents
xix
xx Contents
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3 Finite Approximation Error-Based Value Iteration ADP . . . . . .... 91
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 91
3.2 Iterative θ-ADP Algorithm with Finite
Approximation Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 92
3.2.1 Properties of the Iterative ADP Algorithm
with Finite Approximation Errors . . . . . . . . . . . . . . . . . 93
3.2.2 Neural Network Implementation . . . . . . . . . . . . . . . . . . 100
3.2.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.3 Numerical Iterative θ-Adaptive Dynamic Programming . . . . . . . 107
3.3.1 Derivation of the Numerical Iterative θ-ADP
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 107
3.3.2 Properties of the Numerical Iterative θ-ADP
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 111
3.3.3 Summary of the Numerical Iterative θ-ADP
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 120
3.3.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . .... 121
3.4 General Value Iteration ADP Algorithm with Finite
Approximation Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 125
3.4.1 Derivation and Properties of the GVI Algorithm
with Finite Approximation Errors . . . . . . . . . . . . . .... 125
3.4.2 Designs of Convergence Criteria with Finite
Approximation Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.4.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4 Policy Iteration for Optimal Control of Discrete-Time Nonlinear
Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.2 Policy Iteration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.2.1 Derivation of Policy Iteration Algorithm . . . . . . . . . . . . 153
4.2.2 Properties of Policy Iteration Algorithm . . . . . . . . . . . . 154
4.2.3 Initial Admissible Control Law . . . . . . . . . . . . . . . . . . . 160
4.2.4 Summary of Policy Iteration ADP Algorithm . . . . . . . . 162
4.3 Numerical Simulation and Analysis . . . . . . . . . . . . . . . . . . . . . . 162
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Contents xxi
xxvii
Symbols
2 Belong to
8 For all
) Implies
, Equivalent, or if and only if
Kronecker product
; The empty set
, Equal to by definition
Cn ðΩÞ The class of functions having continuous nth derivative on Ω
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
R ffi
2
ℒ2(Ω) The ℒ2 space defined on Ω, i.e., Ω kf ðxÞk dx\1 for f 2 ℒ2(Ω)
ℒ∞(Ω) The ℒ∞ space defined on Ω, i.e., supx2Ω kf ðxÞk\1 for f 2 ℒ∞(Ω)
λmin ðAÞ The minimum eigenvalue of matrix A
λmax ðAÞ The maximum eigenvalue of matrix A
In The n by n identity matrix
A[0 Matrix A is positive definite
det(A) Determinant of matrix A
A1 The inverse of matrix A
tr(A) The trace of matrix A
xxix
xxx Symbols
1.1 Introduction
Big data, artificial intelligence (AI), and deep learning are the three topics talked about
the most lately in information technology. The recent emergence of deep learning
[10, 17, 38, 68, 88] has pushed neural networks (NNs) to become a hot research
topic again. It has also gained huge success in almost every branch of AI, includ-
ing machine learning, pattern recognition, speech recognition, computer vision, and
natural language processing [17, 25, 26, 35, 74]. On the other hand, the study of
big data often uses AI technologies such as machine learning [80] and deep learning
[17]. One particular subject of study in AI, i.e., the computer game of Go, faced
a great challenge of dealing with vast amounts of data. The ancient Chinese board
game Go has been studied for years with the hope that one day, computer programs
can defeat human professional players. The board of game Go consists of 19 × 19
grid of squares. At the beginning of the game, each of the two players has roughly
360 options for placing each stone. However, the number of potential legal board
positions grows exponentially, and it quickly becomes greater than the total number
of atoms in the whole universe [103]. Such a number leads to so many directions any
given game can move in that makes it impossible for a computer to play by brute
force computation of all possible outcomes.
Previous computer programs focused less on evaluating the state of the board
positions and more on speeding up simulations of how the game might play out.
The Monte Carlo tree search approach was used often in computer game programs,
which samples only some of the possible sequences of plays randomly at each step to
choose between different possible moves instead of trying to calculate every possible
ones. Google DeepMind, an AI company in London acquired by Google in 2014,
developed a program called AlphaGo [92] that has shown performance previously
thought to be impossible for at least a decade. Instead of exploring various sequences
of moves, AlphaGo learns to make a move by evaluating the strength of its position on
the board. Such an evaluation was made possible by NN’s deep learning capabilities.
ADP as well as iterative forms. A few related books will be briefly reviewed before
the end of this chapter.
The main research results in RL can be found in the book by Sutton and Barto
[98] and the references cited in the book. Even though both RL and the main topic
studied in the present book, i.e., ADP, provide approximate solutions to dynamic
programming, research in these two directions has been somewhat independent [7]
in the past. The most famous algorithms in RL are the temporal difference algorithm
[97] and the Q-learning algorithm [112, 113]. Compared to ADP, the area of RL is
more mature and has a vast amount of literature (cf. [27, 34, 47, 98]).
An RL system typically consists of the following four components: {S, A, R, F},
where S is the set of states, A is the set of actions, R is the set of scaler reinforcement
signals or rewards, and F is the function describing the transition from one state
to the next under a given action, i.e., F : S × A → S. A policy π is defined as a
mapping π : S → A. At any given time t, the system can be in a state st ∈ S, take
an action at ∈ A determined by the policy π , i.e., at = π(st ), transition to the next
state s = st+1 , which is denoted by st+1 = F(st , at ), and at the same time, receive a
reward signal rt+1 = r(st , at , st+1 ) ∈ R. The goal of RL is to determine a policy to
maximize the accumulated reward starting from initial state s0 at t = 0.
An RL task always involves estimating some kind of value functions. A value
function estimates how good it is to be in a given state s and it is defined as,
∞
∞
V π (s) = γ k rk+1 s0 =s = γ k r(sk , ak , sk+1 )s0 =s,
k=0 k=0
where at+k = π(st+k ) and st+k+1 = F(st+k , at+k ) for k = 0, 1, . . . . V π (s) is referred
to as the state-value function for policy π . On the other hand, the action-value function
for policy π estimates how good it is to perform a given action a in a given state s
under the policy π and it is defined as,
∞
∞
Qπ (s, a) = γ k rt+k+1 st =s, at =a = γ k r(st+k , at+k , st+k+1 )st =s, at =a,
k=0 k=0
(1.2.2)
4 1 Overview of Adaptive Dynamic Programming
or ∗
Q∗ (s, a) = r(s, a, s ) + γ max
Q (s , a ) . (1.2.6)
a
or
π ∗ (s) = arg max Q∗ (s, a). (1.2.8)
a
The computation shall continue until convergence, e.g., when |Vi+1 (s) − Vi (s)| ≤
ε, ∀s, to obtain V ∗ (s) ≈ Vi+1 (s), where ε is a small positive number. Then, the
optimal policy is approximated as π ∗ (s) ≈ πi+1 (s).
The temporal difference (TD) method [97] is developed to estimate the value
function for a given policy. The TD algorithm is described by
V (st ) ← V (st ) + α rt+1 + γ V (st+1 ) − V (st ) , (1.2.15)
or
Vi+1 (st ) = Vi (st ) + α rt+1 + γ Vi (st+1 ) − Vi (st ) , i = 0, 1, 2, . . . , (1.2.16)
where α > 0 is a step size. This algorithm is also called TD(0) (compared to TD(λ)
to be introduced later). The update in (1.2.15) is obtained according to the following
general formula:
6 1 Overview of Adaptive Dynamic Programming
or
Qi+1 (st , at ) = Qi (st , at ) + α rt+1 + γ max{Qi (st+1 , a)} − Qi (st , at ) , i = 0, 1, 2, . . . .
a
(1.2.18)
Based on the value function obtained above, an improved policy can be determined
as
π (st ) = arg max Q(st , a), (1.2.19)
a
or
πi+1 (st ) = arg max Qi+1 (st , a). (1.2.20)
a
or
Qi+1 (st , at ) = Qi (st , at ) + α[rt+1 + γ Qi (st+1 , at+1 ) − Qi (st , at )], i = 0, 1, 2, . . . . (1.2.22)
A more general TD algorithm called TD(λ) [97] has been quite popular [8, 13, 15,
16, 20, 24, 87, 89, 92, 93, 100, 101, 137, 138]. Using TD(λ), the update equation
(1.2.15) now becomes
Similar ideas can be applied to Sarsa [82] to obtain the following Sarsa(λ) update
equation
Qt+1 (s, a) = Qt (s, a) + αzt (s, a) rt+1 + γ Qt (st+1 , at+1 ) − Qt (st , at ) , (1.2.24)
where
γ λzt−1 (s, a) + 1, if s = st and a = at ,
z t (s, a) =
γ λzt−1 (s, a), otherwise,
with z0 (s, a) = 0, ∀s, a. It can also be applied to Q-learning [112, 113] to obtain the
following Q(λ) update equation
where
δt = rt+1 + γ max
Qt (st+1 , a ) − Qt (st , at ),
a
γ λzt−1 (s, a), if Qt−1 (st , at ) = maxa {Qt−1 (st , a)},
zt (s, a) = Is,st Ia,at +
0, otherwise
1, if x = y,
Ix,y =
0, otherwise,
There are several schemes of dynamic programming [9, 11, 23, 41]. One can con-
sider discrete-time systems or continuous-time systems, linear systems or nonlin-
ear systems, time-invariant systems or time-varying systems, deterministic systems
or stochastic systems, etc. Discrete-time (deterministic) nonlinear (time-invariant)
dynamical systems will be discussed first. Time-invariant nonlinear systems cover
most of the application areas and discrete-time is the basic consideration for digital
implementation.
8 1 Overview of Adaptive Dynamic Programming
where uk = (uk , uk+1 , . . . ) denotes the control sequence starting at time k, U(·, ·) ≥ 0
is called the utility function and γ is the discount factor with 0 < γ ≤ 1. Note that the
function J is dependent on the initial time k and the initial state xk , and it is referred to
as the cost-to-go of state xk . The cost in this case accumulates indefinitely; this kind
of problems is referred to as infinite-horizon problems in dynamic programming. On
the other hand, in finite-horizon problems, the cost accumulates over a finite number
of steps. Very often, it is desired to determine u0 = (u0 , u1 , . . . ) so that J(x0 , u0 ) is
optimized (i.e., maximized or minimized). We will use u∗0 = (u0∗ , u1∗ , . . . ) and J ∗ (x0 )
to denote the optimal control sequence and the optimal cost function, respectively.
The objective of dynamic programming problem in this book is to determine a
control sequence uk∗ , k = 0, 1, . . . , so that the function J (i.e., the cost) in (1.3.2) is
minimized. The optimal cost function is defined as
which is the cost function for system (1.3.1) starting at xk when the policy uk = μ(xk )
is applied. The optimal cost for system (1.3.1) starting at x0 is determined as
∗
J ∗ (x0 ) = inf J μ (x0 ) = J μ (x0 ),
μ
where xk is the state of the system at time k and xk+1 is determined by (1.3.1).
According to Bellman, the optimal cost from time k on is equal to
Equation (1.3.4) is the principle of optimality for discrete-time systems. Its impor-
tance lies in the fact that it allows one to optimize over only one control vector at a
time by working backward in time.
Dynamic programming is a very useful tool in solving optimization and optimal
control problems. In particular, it can easily be applied to nonlinear systems with or
without constraints on the control and state variables. Equation (1.3.4) is called the
functional equation of dynamic programming or Bellman equation and is the basis
for computer implementation of dynamic programming. In the above, if the function
F in (1.3.1) and the cost function J in (1.3.2) are known, the solution for u∗ becomes
a simple optimization problem. However, it is often computationally untenable to
run true dynamic programming due to the backward numerical process required for
its solutions, i.e., as a result of the well-known “curse of dimensionality” [9, 23,
41]. The optimal cost function J ∗ , which is the theoretical solution to the Bellman
equation (1.3.4), is very difficult to obtain, except systems satisfying some very good
conditions such as linear time-invariant systems. Over the years, progress has been
made to circumvent the curse of dimensionality by building a system, called critic, to
approximate the cost function in dynamic programming. The idea is to approximate
dynamic programming solutions using a function approximation structure such as
NNs to approximate the cost function. Such an approach is called adaptive dynamic
programming in this book, though it was previously called adaptive critic designs or
approximate dynamic programming.
10 1 Overview of Adaptive Dynamic Programming
In 1977, Paul Werbos [124] introduced an approach for approximate dynamic pro-
gramming that was later called adaptive critic designs (ACDs). ACDs have received
increasing attention (cf. [2, 6, 7, 15, 21, 22, 36, 39, 50, 55, 65, 72, 73, 78, 79, 83,
90, 91, 104, 125–128, 137]). In the literature, there are several synonyms used for
“adaptive critic designs” including “approximate dynamic programming” [40, 76,
90, 128], “asymptotic dynamic programming” [83], “adaptive dynamic program-
ming” [72, 73, 139], “neuro-dynamic programming” [12], “neural dynamic pro-
gramming” [133], “relaxed dynamic programming” [46, 81], and “reinforcement
learning” [14, 40, 98, 106, 127]. No matter what we call it, in all these cases, the
goal is to approximate the solutions of dynamic programming. Because of this, the
term “approximate dynamic programming” has been quite popular in the past.
In this book, we will use the term ADP to represent “adaptive dynamic pro-
gramming” or “approximate dynamic programming.” ADP has potential applica-
tions in many fields, including controls, management, logistics, economy, military,
aerospace, etc. This book contains mostly applications in optimal control of nonlin-
ear systems. A typical design of ADP consists of three modules—critic, model, and
action [127, 128], as shown in Fig. 1.1. The critic network will give an estimation
of the cost function J, which is often a Lyapunov function, at least for some of the
deterministic systems.
The present book considers the case where each module is an NN (refer to, e.g.,
[5, 141, 145] for ADP implementations using fuzzy systems). In the ADP scheme
Critic Network
xk 1
Model Network
uk
Action Network
xk
1.3 Adaptive Dynamic Programming 11
shown in Fig. 1.1, the critic network outputs the function Ĵ, which is an estimate
of the function J in (1.3.2). This is done by minimizing the following square error
measure over time
1 2 1 2
Eh = Ek = Ĵk − Uk − γ Ĵk+1 , (1.3.6)
2 2
k k
where Ĵk = Ĵ(xk , Wc ) and Wc represents the parameters of the critic network. The
function Uk is the same utility function as the one in (1.3.2) which indicates the
performance of the overall system. The function Uk given in a problem is usually a
function of xk and uk , i.e., Uk = U(xk , uk ). When Ek = 0 for all k, (1.3.6) implies
that
Ĵk = Uk + γ Ĵk+1
= Uk + γ (Uk+1 + γ Ĵk+2 )
= ···
∞
= γ i−k Ui , (1.3.7)
i=k
which is exactly the same as the cost function in (1.3.2). It is therefore clear that
minimizing the error function in (1.3.6), we will have an NN trained so that its
output Ĵ becomes an estimate of the cost function J defined in (1.3.2).
The model network in Fig. 1.1 learns the nonlinear function F given in equation
(1.3.1); it can be trained previously offline [79, 128], or trained in parallel with the
critic and action networks [83].
After the model network is obtained, the critic network will be trained. The critic
network gives an estimate of the cost function. The training of the critic network in
this case is achieved by minimizing the error function defined in (1.3.6), for which
many standard NN training algorithms can be utilized [29, 146]. Note that in Fig. 1.1,
the output of the critic network Ĵk+1 = Ĵ(x̂k+1 , Wc ) is an approximation to the cost
function J at time k + 1, where x̂k+1 is not a real trajectory but a prediction of the
states from the model network.
After the critic network’s training is finished, one can start the action network’s
training with the objective of minimizing Uk + γ Ĵk+1 , through the use of the control
signal uk = u(xk , Wa ), where Wa represents the parameters of the action network.
Once an action network is trained this way, we will have an NN trained so that it will
generate as its output an optimal, or at least, a suboptimal control signal depending
on how well the performance of the critic network is. Recall that the goal of dynamic
programming is to obtain an optimal control sequence as in (1.3.5), which minimizes
the function J in (1.3.2). The key here is to interactively build a link between present
actions and future consequences via an estimate of the cost function.
After the action network’s training cycle is complete, one may check the system
performance, then stop or continue the training procedure by going back to the critic
12 1 Overview of Adaptive Dynamic Programming
network’s training cycle again, if the performance is not acceptable yet [78, 79].
This process will be repeated until an acceptable system performance is reached.
The three networks will be connected as shown in Fig. 1.1. As a part of the process,
the control signal uk will be applied to the external environment and obtain a new
state xk+1 . Meanwhile, the model network gives an approximation of the next state
x̂k+1 . By minimizing xk+1 − x̂k+1 , the model network can be trained.
The training of the action network is done through its parameter updates to mini-
mize the values of Uk + γ Ĵk+1 while keeping the parameters of the critic and model
networks fixed. The gradient information is propagated backward through the critic
network to the model network and then to the action network, as if the three networks
formed one large feedforward network (cf. Fig. 1.1). This implies that the model net-
work in Fig. 1.1 is required for the implementation of ADP in the present case. Even
in the case of known function F, one still needs to build a model network so that the
action network can be trained by backpropagation algorithm.
Two approaches for the training of critic network are provided in [64]: a forward-
in-time approach and a backward-in-time approach. Figure 1.2 shows the diagram of
forward-in-time approach. In this approach, we view Ĵk in (1.3.6) as the output of the
critic network to be trained and choose Uk + γ Ĵk+1 as the training target. Note that
Ĵk and Ĵk+1 are obtained using state variables at different time instances. Figure 1.3
shows the diagram of backward-in-time approach. In this approach, we view Ĵk+1 in
(1.3.6) as the output of the critic network to be trained and choose (Ĵk − Uk )/γ as
the training target. Note that both forward-in-time and backward-in-time approaches
try to minimize the error measure in (1.3.6) and satisfy the requirement in (1.3.7). In
Figs. 1.2 and 1.3, x̂k+1 is the output from the model network.
From the TD algorithm in (1.2.15), we can see that the learning objective is to
minimize |rt+1 + γ V (st+1 ) − V (st )| by using rt+1 + γ V (st+1 ) as the learning target.
This gives the same idea as in the forward-in-time approach shown in Fig. 1.2, where
the target is Uk + γ Ĵk+1 . The only difference is the definition of reward function.
In (1.2.15), it is defined as rt+1 = r(st , at , st+1 ), whereas in Fig. 1.2, it is defined as
Uk = U(xk , uk ), where the current times are t and k, respectively. We will make clear
Jk Jk
1 Ek
xk 1 xk
later the reason behind this one-step time difference between rt+1 and Uk . Even in
TD(λ) given by (1.2.23), the same learning objective is utilized. However, in TD and
TD(λ), the update of value functions at each step only makes a move according to
the step size toward the target, and presumably, it does not reach the target. On the
other hand, in the forward-in-time and backward-in-time approaches, the training
will only be performed for certain number of steps, e.g., 3–5 steps [39] or 50 steps
[91]. Such a move may or may not reach the target, but for sure will lead to a move
in the direction of the target.
Two most important advances of ADP for control start with [79] and [91]. Ref-
erence [79] provides a detailed summary of the major developments in ADP up to
1997. Before that, major references are papers by Werbos such as [124–128]. Refer-
ence [91] makes significant contributions to model-free ADP. Using the approach of
[91], the model network in Fig. 1.1 is not needed anymore. Several practical exam-
ples are included in [91] for demonstration which include single inverted pendulum
and triple inverted pendulum. The training approach of [91] can be considered as
a backward-in-time approach. Reference [64] is also about model-free ADP. It is
a model-free, action-dependent adaptive critic design since we can view the model
network and the critic network together as another NN, which we call it still a critic
network, as illustrated in Fig. 1.4.
The model-free ADP has been called action-dependent ACDs by Werbos. Accord-
ing to [79, 128], ADP approaches were classified into several main schemes: heuris-
tic dynamic programming (HDP), action-dependent HDP (ADHDP; note the pre-
fix “action-dependent” (AD) used hereafter), dual heuristic dynamic programming
(DHP), ADDHP, globalized DHP (GDHP), and ADGDHP. HDP is the basic version
of ADP, which is described in Fig. 1.1 or the left side of Fig. 1.4. According to Werbos
[128], TD algorithms share the same idea as that of HDP to use the same learning
objective. On the other hand, the equivalence of ADHDP and Q-learning has been
argued by Werbos [128] as well. Both ADHDP and Q-learning (including Sarsa as
well) use value functions that are functions of the state and action. For ADHDP,
which is described in the right side of Fig. 1.4, the critic network training is done by
14 1 Overview of Adaptive Dynamic Programming
Critic Network
A New
xk 1
Critic Network
Model Network
uk uk
xk xk
1 2 1 2
Eq = Eqk = Qk−1 − Uk − γ Qk , (1.3.8)
2 2
k k
where Qk = Q(xk , uk , Wqc ) and Wqc represents the parameters of the critic network.
When Eqk = 0 for all k, (1.3.8) implies that
Qk = Uk+1 + γ Qk+1
= Uk+1 + γ (Uk+2 + γ Qk+2 )
= ···
∞
= γ i−k−1 Ui , (1.3.9)
i=k+1
which is exactly the cost function Ĵk+1 defined in (1.3.7). From Fig. 1.4, we can see
that Qk = Q(xk , uk , Wqc ) is equivalent to Ĵk+1 = Ĵ(x̂k+1 , Wc ) = Ĵ(F̂(xk , uk ), Wc ).
The two outputs will be the same given the same inputs xk and uk . However, the two
relationships are different. In model-free ADP, the output Qk is explicitly a function
of xk and uk , while the model network F̂ becomes totally hidden and internal. On
the other hand, with model network, the output Ĵk+1 is an explicit function of x̂k+1
which in turn is a function of xk and uk through the model network F̂.
The one-step time difference here between Ĵk in (1.3.7) and Qk in (1.3.9) has
exactly the same reasoning behind the one-step time difference between the TD
1.3 Adaptive Dynamic Programming 15
algorithm’s rt+1 and the HDP algorithm’s Uk mentioned earlier. In the HDP structure
described above, a model network is used which leads to
In the ADHDP structure, there is no model network, and thus the expression becomes
As a matter of fact, for the RL system discussed in the TD methods, the value function
is defined without model network [97, 98]. Therefore, the value function V (st ) in
(1.2.15) is defined similarly to the function Qk above and it starts with rt+1 instead
of rt .
In addition to the basic structures reviewed above for ADP, there are also other
structures proposed in the literature, such as [30, 75].
A popular method to determine the cost function and the optimal cost function is
to use value iteration. In this case, a different set of notation has been used in the
literature. The cost function J μ defined above will be called value function V μ , i.e.,
V μ (xk ) = J μ (xk ), ∀xk . Similarly, V ∗ (x0 ) = inf μ V μ (x0 ) is called the optimal value
function.
Note that similar to the case of cost function J defined above, the value function
V will also have the following three forms. (i) V (xk , uk ) represents the value of
the cost function of system (1.3.1) starting at xk when the control sequence uk =
(uk , uk+1 , . . . ) is applied. (ii) V μ (xk ) tells the value of the cost function of system
(1.3.1) starting at xk when the control policy uk = μ(xk ) is applied. (iii) V ∗ (xk ) is
the optimal cost function of system (1.3.1) starting at xk . When the context is clear,
for convenience in this book, the notation V (xk ) will be used to represent V (xk , uk )
and V μ (xk ). Such a use of notation for value functions has been quite standard in the
literature. In subsequent chapters, there will also be cases where the value function
is a function of xk and uk (not explicitly as a function of uk ). Thus, V (xk , uk ) will
be appropriate. Also, there are cases where the value function is time-varying, such
that V (xk , uk , k) will be appropriate. As a convention, we will use V (xk ) to represent
V (xk , uk ) and V (xk , uk , k) when the context is clear.
As stated earlier in (1.2.11) and (1.2.13) as well as in (1.2.16), (1.2.18), and
(1.2.22), the Bellman equation can be solved using iterative approaches.
We rewrite the Bellman optimality equation (1.3.4) here,
Our goal is to solve for a function J ∗ which satisfies (1.3.10) and which in turn leads
to the optimal control solution uk∗ given by (1.3.5), rewritten below,
One way to solve (1.3.10) is to use the following iterative approach. Replace the
function J ∗ in (1.3.10) using V , i.e., using a value function. Now, we need to solve
for function V from
V (xk ) = min U(xk , uk ) + γ V (xk+1 ) . (1.3.12)
uk
This is similar to solving the algebraic equation x = f (x) using iterative method
from xi = f (xi−1 ). Starting from x0 , we use the iteration to obtain x1 = f (x0 ), x2 =
f (x1 ), . . . . We can get the solution as x∞ = f (x∞ ), if the iterative process is conver-
gent. For (1.3.13), one can start with a function V0 and the iteration gives V1 , V2 ,
and so on. One would hope that a solution V∞ (xk ) is obtained when i reaches ∞. Of
course, such a solution can only be obtained if the iterative process is convergent.
Using the procedure above, we would hope to obtain a solution which is also the
optimal solution as required by the solution of dynamic programming. During the
iterative solution process, corresponding to each iterative value function Vi obtained,
there will also be a control signal determined as
vi (xk ) = arg min U(xk , uk ) + γ Vi (F(xk , uk )) . (1.3.14)
uk
This sequence of control signals {v0 , v1 , . . . } is called the iterative control law
sequence. We would hope to obtain a sequence of stable control laws, or at least
a stable control law when it is optimal.
The above gives the rational of iterative ADP based on value iteration. We need
theoretical results regarding qualitative analysis of the iterative solution including
stability, convergence, and optimality. Stability is the fundamental requirement of any
control system. Convergence of the iterative solution process is required in order for
the above procedure to be meaningful. Furthermore, only requiring the convergence
of {Vi } is not enough, since we will also need the iterative solutions to converge to
the optimal solution J ∗ .
The simplest choice of V0 to start the iteration is V0 (xk ) ≡ 0, ∀xk [3, 46, 81]. In
[3, 46, 81], undiscounted optimal control problems are considered, where γ = 1. In
this case, (1.3.13) becomes
Vi (xk ) = min U(xk , uk ) + Vi−1 (xk+1 ) , i = 1, 2, . . . . (1.3.15)
uk
1.3 Adaptive Dynamic Programming 17
Proposition 1.3.1 (Convergence of value iteration [46, 81]) Suppose that, for
all xk and for all uk , the inequality J ∗ (F(xk , uk )) ≤ ρU(xk , uk ) holds uniformly
for some ρ < ∞ and that ηJ ∗ (xk ) ≤ V0 (xk ) ≤ J ∗ (xk ) for some 0 ≤ η ≤ 1. Then,
the sequence {Vi } defined iteratively by (1.3.15) approaches J ∗ according to the
inequalities
η−1
1+ J ∗ (xk ) ≤ Vi (xk ) ≤ J ∗ (xk ), ∀xk . (1.3.16)
(1 + ρ −1 )i
Proposition 1.3.1 clearly shows the convergence of value iteration, i.e., Vi (xk ) →
J ∗ (xk ) as i → ∞.
On the other hand, the affine version of system (1.3.1) has been studied often,
which is given by
where V0 (xk+1 ) = 0. Once the policy v0 is determined, the next value function is
computed as
For continuous-time systems, the cost function J is also the key to dynamic pro-
gramming. By minimizing J, one gets the optimal cost function J ∗ , which is often
a Lyapunov function of the system. As a consequence of the Bellman’s principle of
optimality, J ∗ satisfies the Hamilton–Jacobi–Bellman (HJB) equation. But usually,
one cannot get the analytical solution of the HJB equation. Even to find an accurate
numerical solution is very difficult due to the so-called curse of dimensionality.
Continuous-time nonlinear systems can be described by
where x ∈ Rn and u ∈ Rm are the state and the control vectors, and F(x, u) is a
continuous nonlinear system function. The cost in this case is defined as
∞
J(x0 , u) = U(x(τ ), u(τ ))dτ,
t0
with nonnegative utility function U(x, u) ≥ 0, where x(t0 ) = x0 . The Bellman’s prin-
ciple of optimality can also be applied to continuous-time systems. In this case, the
optimal cost
J ∗ (x(t)) = min{J(x(t), u(t))}, t ≥ t0 ,
u(t)
1.3 Adaptive Dynamic Programming 19
The HJB equation in (1.3.18) can be derived from the Bellman’s principle of optimal-
ity (1.3.4) [41]. Meanwhile, the optimal control u∗ (t) will be the one that minimizes
the cost function,
u∗ (t) = arg min{J(x(t), u(t))}, t ≥ t0 . (1.3.19)
u(t)
In 1994, Saridis and Wang [86] studied the nonlinear stochastic systems described
by
dx = f (x, t) dt + g(x, t)u dt + h(x, t)dw, t0 ≤ t ≤ T , (1.3.20)
where x ∈ Rn , u ∈ Rm , and w ∈ Rk are the state vector, the control vector, and a
separable Wiener process; f , g and h are measurable system functions; and Q and φ
are nonnegative functions. A value function V is defined as
T
V (x, t) = E Q(x, t) + u u dt + φ(x(T ), T ) : x(t0 ) = x0 , t ∈ I,
T
t
where I [t0 , T ]. The HJB equation is modified to become the following equation
∂V
+ Lu V + Q(x, t) + uTu = ∇V , (1.3.21)
∂t
The benefit of the suboptimal control is that the bound V of the optimal cost J ∗ can
be approximated by an iterative process. Beginning from certain chosen functions
u0 and V0 , let
1 ∂Vi−1 (x, t)
ui (x, t) = − gT(x, t) , i = 1, 2, . . . . (1.3.22)
2 ∂x
Then, by repeatedly applying (1.3.21) and (1.3.22), one will get a sequence of func-
tions Vi . This sequence {Vi } will converge to the bound V (or V ) of the cost function
J ∗ . Consequently, ui will approximate the optimal control when i tends to ∞. It is
important to note that the sequences {Vi } and {ui } are obtainable by computation and
they approximate the optimal cost and the optimal control law, respectively.
Some further theoretical results for ADP have been obtained in [72, 73]. These
works investigated the stability and optimality for some special cases of ADP. In
[72, 73], Murray et al. studied the (deterministic) continuous-time affine nonlinear
systems
ẋ = f (x) + g(x)u, x(t0 ) = x0 , (1.3.23)
where U(x, u) = Q(x) + uTR(x)u, Q(x) > 0 for x = 0 and Q(0) = 0, and R(x) > 0
for all x. Similar to [86], an iterative procedure is proposed to find the control law as
follows. For the plant (1.3.23) and the cost function (1.3.24), the HJB equation leads
to the following optimal control law
∗
∗ 1 −1 dJ (x)
u (x) = − R (x)g (x)
T
. (1.3.25)
2 dx
Applying (1.3.24) and (1.3.25) repeatedly, one will get sequences of estimations of
the optimal cost function J ∗ and the optimal control u∗ . Starting from an initial sta-
bilizing control v0 (x), for i = 0, 1, . . . , the approximation is given by the following
iterations between value functions
∞
Vi+1 (x) = U(x(τ ), vi (τ ))dτ
t
(1) The sequence of functions {Vi } obtained above converges to the optimal cost
function J ∗ .
(2) Each of the control laws vi+1 obtained above stabilizes the plant (1.3.23), for all
i = 0, 1, . . . .
(3) Each of the value functions Vi+1 (x) is a Lyapunov function of the plant, for all
i = 0, 1, . . . .
Abu-Khalaf and Lewis [1] also studied the system (1.3.23) with the following
value function
∞ ∞
V (x(t)) = U(x(τ ), u(τ ))dτ = x T(τ )Qx(τ ) + uT(τ )Rx(τ ) dτ,
t t
1
vi+1 (x) = − R−1 gT (x)∇Vi (x),
2
where ∇Vi (x) = ∂Vi (x)/∂x. In [1], the above iterative approach was applied to sys-
tems (1.3.23) with saturating actuators through a modified utility function, with
convergence and optimality proofs showing that Vi → J ∗ and vi → u∗ , as i → ∞.
For continuous-time optimal control problems, attempts have been going on for a
long time in the quest for successive solutions to the HJB equation. Published works
can date back to as early as 1967 by Leake and Liu [37]. The brief overview presented
here only serves as a beginning of many more recent results [1, 32, 33, 45, 49, 51,
52, 56, 57, 62, 69, 70, 105, 107, 134–136, 140].
1.3.4 Remarks
Most of the early applications of ADP are in the areas of aircraft flight control
[6, 21, 78] and missile guidance [22, 28]. Some other applications have also been
reported, e.g., for ship steering [55], in power systems [99, 104], in communication
networks [65, 67], in engine control [36, 50], and for locomotive planning [77].
The most successful application in industry is perhaps the fleet management and
truckload operation problems as reported by Forbes Magazine [19]. Warren Powell
from Princeton University has been working with Snyder International [94, 95], one
of the largest freight haulers in the USA, in order to find more efficient ways to
plan routes for parcels and freight. Applications reported in this book are related to
22 1 Overview of Adaptive Dynamic Programming
optimal control approaches in the areas of energy management in smart homes [31,
119, 120], coal gasification process [116], and water gas shift reaction system [118].
New comers to the field of ADP should first take a look at the challenging control
problems listed in [4]. Interested readers should also read reference [39], especially
the proposed training strategies for the critic network and the action network. There
are also several good survey papers to read, e.g., [42, 43, 48, 53, 110].
Paul Werbos, who is the inventor of the backpropagation algorithm and adaptive
critic designs, has often talked about brain-like intelligence. He has pointed out that
“ADP may be the only approach that can achieve truly brain-like intelligence” [85,
125, 129, 131]. More and more evidence has accumulated, suggesting that optimality
is an organizing principle for understanding brain intelligence [129–131]. There has
been a great interest in brain research around the world in recent years. We would
certainly hope ADP can make contributions to brain research in general and to brain-
like intelligence in particular. On the other hand, with more and more advances
in the understanding of brain-learning functions, new ADP algorithms can then be
developed.
Deep reinforcement learning has been of great interests lately. With the current
trends in deep learning, big data, artificial intelligence, as well as cyber-physical
systems and Internet of things, we believe that ADP will have a bright future. There
are still many pending issues to be solved, and most of them are related to obtaining
good approximations to solutions of dynamic programming with less computation.
Deep reinforcement learning is able to output control signal directly based on input
images, which incorporates both the advantages of the perception of deep learning
and the decision-making of reinforcement learning. This mechanism makes artificial
intelligence much closer to human thinking. Combining deep learning with rein-
forcement learning/ADP will benefit us to construct systems with more intelligence
and attain higher level of brain-like intelligence.
There have been a few books published on the topics of reinforcement learning and
adaptive dynamic programming. A quick overview of these books will be given in
this section.
In their book published in 1996 [12], Bertsekas and Tsitsiklis give an overview of
neuro-dynamic programming. The book draws on the theory of function approxima-
tion, iterative optimization, neural network training, and dynamic programming. It
provides the background, gives a detailed introduction to dynamic programming, dis-
cusses the neural network architectures and methods for training them, and develops
general convergence theorems for stochastic approximation methods as the founda-
tion for the analysis of various neuro-dynamic programming algorithms. It aims at
explaining with mathematical analysis, examples, speculative insight, and case stud-
ies, a number of computational ideas and phenomena that collectively can provide
the foundation for understanding and applying the methodology of neuro-dynamic
1.4 Related Books 23
and high-quality solutions to problems that involve making decisions in the presence
of uncertainty. The book integrates the disciplines of Markov design processes, math-
ematical programming, simulation, and statistics, to demonstrate how to successfully
model and solve a wide range of real-life problems using the idea of approximate
dynamic programming (ADP). It starts with a simple introduction using a discrete
representation of states. The background of dynamic programming and Markov deci-
sion processes is given, and meanwhile the phenomenon of the curse of dimensional-
ity is discussed. A detailed description on how to model a dynamic program and some
important algorithmic strategies are presented next. The most important dimensions
of ADP, i.e., modeling real applications, the interface with stochastic approximation
methods, techniques for approximating general value functions, and a more in-depth
presentation of ADP algorithms for finite- and infinite-horizon applications are pro-
vided, respectively. Several specific problems, including information acquisition and
resource allocation, and algorithms that arise in this setting are introduced in the third
part. The well-known exploration versus exploitation problem is proposed to discuss
how to visit a state. These applications bring out the richness of ADP techniques. In
summary, it models complex, high-dimensional problems in a natural and practical
way; introduces and emphasizes the power of estimating a value function around the
post-decision state; and presents a thorough discussion of recursive estimation. It is
shown in this book that ADP is an accessible introduction to dynamic modeling and
is also a valuable guide for the development of high-quality solutions to problems
that exist in operations research and engineering.
The book by Busoniu, et al. [14] provides an accessible in-depth treatment of
dynamic programming (DP) and reinforcement learning (RL) methods using func-
tion approximators. Even though DP and RL are methods for solving problems where
actions are applied to a dynamical plant to achieve a desired goal, the former requires
a model of the systems behavior while the latter does not since it works using only
data obtained from the system. However, a core obstacle of them lies in that the
solutions cannot be represented exactly for problems with large discrete state-action
spaces or continuous spaces. As a result, compact representations relying on func-
tion approximators must be constructed. The book adopts a control-theoretic point of
view, employing control-theoretical notation and terminology and choosing control
systems as examples to illustrate the behavior of DP and RL algorithms. It starts
with introducing the basic problems and their solutions, the representative classical
algorithms, and the behavior of some algorithms via examples with discrete states
and actions. Then, it gives an extensive account of DP and RL methods with function
approximation, which are applicable to large- and continuous-space problems. The
three major classes of algorithms, including value iteration, policy iteration, and pol-
icy search, are presented, respectively. Next, a value iteration algorithm with fuzzy
approximation is discussed, and an extensive theoretical analysis of this algorithm is
given to illustrate how convergence and consistency guarantees can be developed to
perform approximate DP. Moreover, an algorithm for approximate policy iteration is
studied and an online version is also developed, in order to emphasize the important
issues of online RL. At last, a policy search approach relying on the cross-entropy
method for optimization is described, which highlights the possibility to develop
1.4 Related Books 25
This book covers some most recent developments in adaptive dynamic programming
(ADP). After Derong Liu landed his first academic job in 1995, he was exposed to
ADP at the suggestion of Paul Werbos. Within four years, he was lucky enough to
receive an NSF CAREER Award, for a project on ADP for network traffic control.
He started publishing papers in ADP in the same year [55], with publications ranging
1.5 About This Book 27
References
1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Al-Tamimi A, Abu-Khalaf M, Lewis FL (2007) Adaptive critic designs for discrete-time
zero-sum games with application to H∞ control. IEEE Trans Syst Man Cybern Part B Cybern
37(1):240–247
3. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part
B Cybern 38(4):943–949
4. Anderson CW, Miller WT III (1990) Challinging control problems. In: Miller WT III, Sutton
RS, Werbos PJ (eds) Neural networks for control (Appendix). MIT Press, Cambridge, MA
5. Bai X, Zhao D, Yi J (2009) Coordinated multiple ramps metering based on neuro-fuzzy
adaptive dynamic programming. In: Proceedings of the international joint conference on
neural networks, pp 241–248
6. Balakrishnan SN, Biega V (1996) Adaptive-critic-based neural networks for aircraft optimal
control. AIAA J Guid Control Dyn 19:893–898
7. Barto AG (1992) Reinforcement learning and adaptive critic methods. In: White DA, Sofge
DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive approaches (chapter
12). Van Nostrand Reinhold, New York
8. Baudis P, Gailly JL (2012) PACHI: state of the art open source Go program. In: Advances in
computer games (Lecture notes in computer science), vol 7168. pp 24–38
9. Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton, NJ
10. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspec-
tives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
11. Bertsekas DP (2005) Dynamic programming and optimal control. Athena Scientific, Belmont,
MA
12. Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmont,
MA
13. Buro M (1998) From simple features to sophisticated evaluation functions. In: Proceedings
of the international conference on computers and games (Lecture notes in computer science),
vol 1558. pp 126–145
14. Busoniu L, Babuska R, De Schutter B, Ernst D (2010) Reinforcement learning and dynamic
programming using function approximators. CRC Press, Boca Raton, FL
28 1 Overview of Adaptive Dynamic Programming
15. Cai X, Wunsch DC (2001) A parallel computer-Go player, using HDP method. In: Proceedings
of the international joint conference on neural networks, pp 2373–2375
16. Campbell M, Hoane AJ, Hsu FH (2002) Deep blue. Artif Intell 134(1–2):57–83
17. Chen XW, Lin X (2014) Big data deep learning: challenges and perspectives. IEEE Access
2:514–525
18. Clark C, Storkey AJ (2015) Training deep convolutional neural networks to play Go. In:
Proceedings of the international conference on machine learning, pp 1766–1774
19. Coster H (2011) Schneider National uses data to survive a bumpy economy. Forbes, 12 Sept
2011
20. Coulom R (2007) Computing Elo ratings of move patterns in the game of Go. ICGA J
30(4):198–208
21. Cox C, Stepniewski S, Jorgensen C, Saeks R, Lewis C (1999) On the design of a neural
network autolander. Int J Robust Nonlinear Control 9:1071–1096
22. Dalton J, Balakrishnan SN (1996) A neighboring optimal adaptive critic for missile guidance.
Math Comput Model 23:175–188
23. Dreyfus SE, Law AM (1977) The art and theory of dynamic programming. Academic Press,
New York
24. Enzenberger M (2004) Evaluation in Go by a neural network using soft segmentation. In:
Advances in computer games - many games, many challenges (Proceedings of the advances
in computer games conference), pp 97–108
25. Fu ZP, Zhang YN, Hou HY (2014) Survey of deep learning in face recognition. In: Proceedings
of the IEEE international conference on orange technologies, pp 5–8
26. Ghesu FC, Krubasik E, Georgescu B, Singh V, Zheng Y, Hornegger J, Comaniciu D (2016)
Marginal space deep learning: efficient architecture for volumetric image parsing. IEEE Trans
Med Imaging 35(5):1217–1228
27. Gosavi A (2009) Reinforcement learning: a tutorial survey and recent advances. INFORMS
J Comput 21(2):178–192
28. Han D, Balakrishnan SN (2002) State-constrained agile missile control with adaptive-critic-
based neural networks. IEEE Trans Control Syst Technol 10(4):481–489
29. Haykin S (2009) Neural networks and learning machines, 3rd edn. Prentice-Hall, Upper
Saddle River, NJ
30. He H, Ni Z, Fu J (2012) A three-network architecture for on-line learning and optimization
based on adaptive dynamic programming. Neurocomputing 78(1):3–13
31. Huang T, Liu D (2013) A self-learning scheme for residential energy system control and
management. Neural Comput Appl 22(2):259–269
32. Jiang Y, Jiang ZP (2012) Robust adaptive dynamic programming for large-scale systems with
an application to multimachine power systems. IEEE Trans Circuits Syst II: Express Briefs
59(10):693–697
33. Jiang Y, Jiang ZP (2013) Robust adaptive dynamic programming with an application to power
systems. IEEE Trans Neural Netw Learn Syst 24(7):1150–1156
34. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell
Res 4:237–285
35. Konoplich GV, Putin EO, Filchenkov AA (2016) Application of deep learning to the problem
of vehicle detection in UAV images. In: Proceedings of the IEEE international conference on
soft computing and measurements, pp 4–6
36. Kulkarni NV, KrishnaKumar K (2003) Intelligent engine control using an adaptive critic.
IEEE Trans Control Syst Technol 11:164–173
37. Leake RJ, Liu RW (1967) Construction of suboptimal control sequences. SIAM J Control
5(1):54–63
38. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
39. Lendaris GG, Paintz C (1997) Training strategies for critic and action neural networks in
dual heuristic programming method. In: Proceedings of the IEEE international conference on
neural networks, pp 712–717
References 29
40. Lewis FL, Liu D (2012) Reinforcement learning and approximate dynamic programming for
feedback control. Wiley, Hoboken, NJ
41. Lewis FL, Syrmos VL (1995) Optimal control. Wiley, New York
42. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Systems Mag 9(3):32–50
43. Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control:
Using natural decision methods to design optimla adaptive controllers. IEEE Control Syst
Mag 32(6):76–105
44. Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general
value iteration. IET Control Theory Appl 6(18):2725–2736
45. Li H, Liu D, Wang D (2014) Integral reinforcement learning for linear continuous-time zero-
sum games with completely unknown dynamics. IEEE Trans Autom Sci Eng 11(3):706–714
46. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
47. Littman ML (2015) Reinforcement learning improves behaviour from evaluative feedback.
Nature 521:445–451
48. Liu D (2005) Approximate dynamic programming for self-learning control. Acta Autom Sin
31(1):13–18
49. Liu D, Huang Y, Wang D, Wei Q (2013) Neural-network-observer-based optimal control for
unknown nonlinear systems using adaptive dynamic programming. Int J Control 86(9):1554–
1566
50. Liu D, Javaherian H, Kovalenko O, Huang T (2008) Adaptive critic learning techniques
for engine torque and air-fuel ratio control. IEEE Trans Syst Man Cybern Part B Cybern
38(4):988–993
51. Liu D, Li C, Li H, Wang D, Ma H (2015) Neural-network-based decentralized control of
continuous-time nonlinear interconnected sytems with unknown dynamics. Neurocomputing
165:90–98
52. Liu D, Li H, Wang D (2014) Online synchronous approximate optimal learning algorithm for
multiplayer nonzero-sum games with unknown dynamics. IEEE Trans Syst Man Cybern Syst
44(8):1015–1027
53. Liu D, Li H, Wang D (2013) Data-based self-learning optimal control: research progress and
prospects. Acta Autom Sin 39(11):1858–1870
54. Liu D, Li H, Wang D (2015) Error bounds of adaptive dynamic programming algorithms
for solving undiscounted optimal control problems. IEEE Trans Neural Netw Learn Syst
26(6):1323–1334
55. Liu D, Patino HD (1999) A self-learning ship steering controller based on adaptive critic
designs. In: Proceedings of the IFAC triennial world congress, pp 367–372
56. Liu D, Wang D, Li H (2014) Decentralized stabilization for a class of continuous-time non-
linear interconnected systems using online learning optimal control approach. IEEE Trans
Neural Netw Learn Syst 25(2):418–428
57. Liu D, Wang D, Wang FY, Li H, Yang X (2014) Neural-network-based online HJB solution
for optimal robust guaranteed cost control of continuous-time uncertain nonlinear systems.
IEEE Trans Cybern 44(12):2834–2847
58. Liu D, Wang D, Yang X (2013) An iterative adaptive dynamic programming algorithm for
optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci
220:331–342
59. Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class
of unknown discrete-time nonlinear systems using globalized dual heuristic programming.
IEEE Trans Autom Sci Eng 9(3):628–634
60. Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
61. Liu D, Wei Q (2014) Policy iterative adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
30 1 Overview of Adaptive Dynamic Programming
62. Liu D, Wei Q (2014) Multi-person zero-sum differential games for a class of uncertain non-
linear systems. Int J Adapt Control Signal Process 28(3–5):205–231
63. Liu D, Wei Q, Yan P (2015) Generalized policy iteration adaptive dynamic programming for
discrete-time nonlinear systems. IEEE Trans Syst Man Cybern Syst 45(12):1577–1591
64. Liu D, Xiong X, Zhang Y (2001) Action-dependent adaptive critic designs. In: Proceedings
of the international joint conference on neural networks, pp 990–995
65. Liu D, Zhang Y, Zhang H (2005) A self-learning call admission control scheme for CDMA
cellular networks. IEEE Trans Neural Netw 16(5):1219–1228
66. Maddison CJ, Huang A, Sutskever I, Silver D (2015) Move evaluation in Go using deep con-
volutional neural networks. In: The 3rd international conference on learning representations.
http://arxiv.org/abs/1412.6564
67. Marbach P, Mihatsch O, Tsitsiklis JN (2000) Call admission control and routing in integrated
service networks using neuro-dynamic programming. IEEE J Sel Areas Commun 18(2):197–
208
68. Minh V, Kavukcuoglu Silver D et al (2015) Human-level control through deep reinforcement
learning. Nature 518:529–533
69. Modares H, Lewis FL, Naghibi-Sistani MB (2013) Adaptive optimal control of unknown
constrained-input systems using policy iteration and neural networks. IEEE Trans Neural
Netw Learn Syst 24(10):1513–1525
70. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50:193–202
71. Moyer C (2016) How Google’s AlphaGo beat a Go world champion. The Atlantic, 28 Mar
2016
72. Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE
Trans Syst Man Cybern Part C Appl Rev 32(2):140–153
73. Murray JJ, Cox CJ, Saeks RE (2003) The adaptive dynamic programming theorem. In: Liu
D, Antsaklis PJ (eds) Stability and control of dynamical systems with applications (chapter
19). Birkhäuser, Boston
74. Nguyen HD, Le AD, Nakagawa M (2015) Deep neural networks for recognizing online
handwritten mathematical symbols. In: Proceedings of the IAPR Asian conference on pattern
recognition, pp 121–125
75. Padhi R, Unnikrishnan N, Wang X, Balakrishnan SN (2006) A single network adaptive critic
(SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural
Netw 19(10):1648–1660
76. Powell WB (2007) Approximate dynamic programming: solving the curses of dimensionality.
Wiley, Hoboken, NJ
77. Powell WB, Bouzaiene-Ayari B, Lawrence C et al (2014) Locomotive planning at Nor-
folk Southern: an optimizing simulator using approximate dynamic programming. Interfaces
44(6):567–578
78. Prokhorov DV, Santiago RA, Wunsch DC (1995) Adaptive critic designs: a case study for
neurocontrol. Neural Netw 8:1367–1372
79. Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8:997–
1007
80. Qiu J, Wu Q, Ding G, Xu Y, Feng S (2016) A survey of machine learning for big data
processing. EURASIP J Adv Signal Process. doi:10.1186/s13634-016-0355-x
81. Rantzer A (2006) Relaxed dynamic programming in switching systems. IEE Proc Control
Theory Appl 153(5):567–574
82. Rummery GA, Niranjan M (1994) On-line Q-learning using connectionist systems. Technical
Report CUED/F-INFENG/TR 166. Engineering Department, Cambridge University, UK
83. Saeks RE, Cox CJ, Mathia K, Maren AJ (1997) Asymptotic dynamic programming: prelim-
inary concepts and results. In: Proceedings of the IEEE international conference on neural
networks, pp 2273–2278
References 31
110. Wang FY, Zhang H, D. Liu D, (2009) Adaptive dynamic programming: an introduction. IEEE
Comput Intell Mag 4(2):39–47
111. Wang FY, Zhang JJ, Zheng X et al (2016) Where does AlphaGo go: from Church-Turing
thesis to AlphaGo thesis and beyond. IEEE/CAA J Autom Sin 3(2):113–120
112. Watkins CJCH (1989) Learning from delayed rewards. Ph.D. Thesis, Cambridge University,
UK
113. Watkins CJCH, Dayan P (1992) Q-learning. Mach Learn 8:279–292
114. Wei Q, Liu D (2013) Numerical adaptive learning control scheme for discrete-time non-linear
systems. IET Control Theory Appl 7(11):1472–1486
115. Wei Q, Liu D (2014) A novel iterative θ-adaptive dynamic programming for discrete-time
nonlinear systems. IEEE Trans Autom Sci Eng 11(4):1176–1190
116. Wei Q, Liu D (2014) Adaptive dynamic programming for optimal tracking control of
unknown nonlinear systems with application to coal gasification. IEEE Trans Autom Sci
Eng 11(4):1020–1036
117. Wei Q, Liu D (2014) Stable iterative adaptive dynamic programming algorithm with approx-
imation errors for discrete-time nonlinear systems. Neural Comput Appl 24(6):1355–1367
118. Wei Q, Liu D (2014) Data-driven neuro-optimal temperature control of water-gas shift reaction
using stable iterative adaptive dynamic programming. IEEE Trans Ind Electron 61(11):6399–
6408
119. Wei Q, Liu D, Shi G (2015) A novel dual iterative Q-learning method for optimal battery
management in smart residential environments. IEEE Trans Ind Electron 62(4):2509–2518
120. Wei Q, Liu D, Shi G, Liu Y (2015) Multibattery optimal coordination control for home energy
management systems via distributed iterative adaptive dynamic programming. IEEE Trans
Ind Electron 62(7):4203–4214
121. Wei Q, Liu D, Xu Y (2014) Neuro-optimal tracking control for a class of discrete-time non-
linear systems via generalized value iteration adaptive dynamic programming. Soft Comput
20(2):697–706
122. Wei Q, Liu D, Yang X (2015) Infinite horizon self-learning optimal control of nonaffine
discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst 26(4):866–879
123. Wei Q, Wang FY, Liu D, Yang X (2014) Finite-approximation-error-based discrete-time iter-
ative adaptive dynamic programming. IEEE Trans Cybern 44(12):2820–2833
124. Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of
intelligence. Gen Syst Yearb 22:25–38
125. Werbos PJ (1987) Building and understanding adaptive systems: a statistical/numerical
approach to factory automation and brain research. IEEE Trans Syst Man Cybern SMC
17(1):7–20
126. Werbos PJ (1990) Consistency of HDP applied to a simple reinforcement learning problem.
Neural Netw 3:179–189
127. Werbos PJ (1990) A menu of designs for reinforcement learning over time. In: Miller WT,
Sutton RS, Werbos PJ (eds) Neural networks for control (chapter 3). MIT Press, Cambridge,
MA
128. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural mod-
eling. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy, and
adaptive approaches (chapter 13). Van Nostrand Reinhold, New York
129. Werbos PJ (2007) Using ADP to understand and replicate brain intelligence: the next level
design. In: Proceedings of the IEEE symposium on approximate dynamic programming and
reinforcement learning, pp 209–216
130. Werbos PJ (2008) ADP: the key direction for future research in intelligent control and under-
standing brain intelligence. IEEE Trans Syst Man Cybern Part B Cybern 38(4):898–900
131. Werbos PJ (2009) Intelligence in the brain: a theory of how it works and how to build it.
Neural Netw 22(3):200–212
132. Yan P, Wang D, Li H, Liu D (2016) Error bound analysis of Q-function for discounted
optimal control problems with policy iteration. IEEE Trans Syst Man Cybern Syst. doi:10.
1109/TSMC.2016.2563982
References 33
133. Yang L, Enns R, Wang YT, Si J (2003) Direct neural dynamic programming. In: Liu D,
Antsaklis PJ (eds) Stability and control of dynamical systems with applications (chapter 10).
Birkhauser, Boston
134. Yang X, Liu D, Huang Y (2013) Neural-network-based online optimal control for uncer-
tain non-linear continuous-time systems with control constraints. IET Control Theory Appl
7(17):2037–2047
135. Yang X, Liu D, Wang D (2014) Reinforcement learning for adaptive optimal control of
unknown continuous-time nonlinear systems with input constraints. Int J Control 87(3):553–
566
136. Yang X, Liu D, Wei Q (2014) Online approximate optimal control for affine non-linear systems
with unknown internal dynamics using adaptive dynamic programming. IET Control Theory
Appl 8(16):1676–1688
137. Zaman R, Prokhorov D, Wunsch DC (1997) Adaptive critic design in learning to play game
of Go. In: Proceedings of the international conference on neural networks, pp 1–4
138. Zaman R, Wunsch DC (1999) TD methods applied to mixture of experts for learning 9×9 Go
evaluation function. In: Proceedings of the international joint conference on neural networks,
pp 3734–3739
139. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control: algo-
rithms and stability. Springer, London
140. Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving
a class of nonlinear zero-sum differential games. Automatica 7(1):207–214
141. Zhang H, Zhang J, Yang GH, Luo Y (2015) Leader-based optimal coordination control for the
consensus problem of multiagent differential games via fuzzy adaptive dynamic programming.
IEEE Trans Fuzzy Syst 23(1):152–163
142. Zhao DB, Shao K, Zhu YH et al (2016) Review of deep reinforcement learning and discussions
on the development of computer Go. Control Theory Appl 33(6):701–717
143. Zhao Q, Xu H, Jagannathan S (2014) Near optimal output feedback control of nonlinear
discrete-time systems based on reinforcement neural network learning. IEEE/CAA J Automa
Sin 1(4):372–384
144. Zhong X, He H, Zhang H, Wang Z (2014) Optimal control for unknown discrete-time nonlinear
markov jump systems using adaptive dynamic programming. IEEE Trans Neural Netw Learn
Syst 25(12):2141–2155
145. Zhu Y, Zhao D, He H (2012) Integration of fuzzy controller with adaptive dynamic pro-
gramming. In: Proceedings of the world congress on intelligent control and automation, pp
310–315
146. Zurada JM (1992) Introduction to artificial neural systems. West, St. Paul, MN
Part I
Discrete-Time Systems
Chapter 2
Value Iteration ADP for Discrete-Time
Nonlinear Systems
2.1 Introduction
The nonlinear optimal control has been the focus of control fields for many decades
[7, 10, 23, 39]. It often needs to solve the nonlinear Bellman equation. The Bellman
equation is more difficult to work with than the Riccati equation because it involves
solving nonlinear partial difference equations. Although dynamic programming has
been a useful technique in handling optimal control problems for nonlinear systems,
it is often computationally untenable to perform it to obtain the optimal solutions
because of the well-known “curse of dimensionality” [9, 14]. Fortunately, relying
on the strong abilities of self-learning and adaptivity of artificial neural networks
(ANNs), the ADP method was proposed by Werbos [46, 47] to deal with optimal
control problems forward-in-time. In recent years, ADP and related research have
gained much attention from scholars (see the recent books [22, 40, 50] and the
references cited therein).
It is important to note that the iterative methods are often used in ADP to obtain the
solution of Bellman equation indirectly and have received more and more attention. In
[24], iterative ADP algorithms were classified into two main schemes, namely policy
iteration (PI) and value iteration (VI) [38, 40], respectively. PI algorithms contain
policy evaluation and policy improvement [18, 38, 40]. An initial stabilizing control
law is required, which is often difficult to obtain. Comparing to VI algorithms, in
most applications, PI would require fewer iterations as a Newton’s method, but every
iteration is more computationally demanding. VI algorithms solve the optimal control
problem without requirement of an initial stabilizing control law, which is easy to
implement. However, the stabilizing control law cannot be obtained until the value
function converges. This means that only the converged optimal control (function
of the system state xk ) u∗ (xk ) can be used to control the nonlinear system, where
the iterative controls vi (xk ), i = 0, 1, . . ., may be invalid. Hence, the computational
efficiency of the VI ADP method is low. Besides, most of the VI algorithms are
implemented off-line which limits their applications very much. In this chapter, the
Assumption 2.2.1 F(0, 0) = 0, and the state feedback control law u(·) satisfies
u(0) = 0, i.e., xk = 0 is an equilibrium state of system (2.2.1) under the control
uk = 0.
Assumption 2.2.3 System (2.2.1) is controllable in the sense that there exists a
continuous control law on Ω that asymptotically stabilizes the system.
First, in Sects. 2.2.1 and 2.2.2, we develop a GVI-based optimal control scheme
for discrete-time nonlinear systems with affine form [25]. Consider the following
affine nonlinear systems
where f (·) ∈ Rn and g(·) ∈ Rn×m are differentiable and f (0) = 0. Our goal is to
find a state feedback control law u(·) such that uk = u(xk ) can stabilize the system
(2.2.2) and simultaneously minimize the infinite-horizon cost function given by
∞
J(x0 , u) = J u (x0 ) = U(xk , uk ), (2.2.3)
k=0
where U(xk , uk ) is a positive-definite utility function, i.e., U(0, 0) = 0 and for all
(xk , uk ) = (0, 0), U(xk , uk ) > 0. Note that the control law u(·) must not only stabilize
the system on Ω but also guarantee (2.2.3) to be finite, i.e., the control law must be
admissible.
Definition 2.2.1 (cf. [5, 51]) A control law u(·) is said to be admissible with respect
to (2.2.2) (or (2.2.1)) on Ω if u(·) is continuous on Ω, u(0) = 0, uk = u(xk ) stabilizes
(2.2.2) (or (2.2.1)) on Ω, and J(x0 , u) is finite, ∀x0 ∈ Ω.
Let A (Ω) be the set of admissible control laws associated with the controllable
set Ω of states. For optimal control problems we study in this book, the set A (Ω)
is assumed to be nonempty, i.e., A (Ω) = ∅.
Define the optimal cost function as
According to [9, 11, 14, 23], the optimal cost function J ∗ (xk ) satisfies the Bellman
equation
J ∗ (xk ) = min U(xk , uk ) + J ∗ (xk+1 ) . (2.2.4)
uk
vector at a time by working backward in time. The optimal control law u∗ (·) should
satisfy
uk∗ = u∗ (xk ) = arg min U(xk , uk ) + J ∗ (xk+1 ) . (2.2.5)
uk
In general, the utility function can be chosen as the quadratic form given by
where Q ∈ Rn×n and R ∈ Rm×m are positive-definite matrices. The optimal control
uk∗ satisfies the first-order necessary condition, from which we obtain
1 ∂xk+1 T ∂J ∗ (xk+1 ) 1 ∂J ∗ (xk+1 )
uk∗ = − R−1 = − R−1 gT (xk ) .
2 ∂uk ∂xk+1 2 ∂xk+1
Equation (2.2.4) reduces to Riccati equation in the case of linear quadratic regulator
problem. However, in the nonlinear case, the cost function of the optimal control
problem cannot be obtained directly. Therefore, we will solve the Bellman equation
by the GVI algorithm.
V (xk ) = J u (xk ).
= xkT Qxk + viT (xk )Rvi (xk ) + Vi (f (xk ) + g(xk )vi (xk )). (2.2.9)
an iteration. The iteration goes from top to bottom within each column and from the
bottom block to the top block in the next column, as shown in Fig. 2.1.
Similarly, the ADP algorithm in (2.2.10)–(2.2.12) can be described by Table 2.2.
Comparing between the two tables, one can see that they contain exactly the same
contents of iterations, except the fact that Table 2.2 did not start in the very first block.
Note that i is the iteration index and k is the time index. As a VI algorithm,
this iterative ADP algorithm does not require an initial stabilizing controller. The
value function and control law are updated until they converge to the optimal ones.
Furthermore, it should satisfy that Vi (0) = 0, vi (0) = 0, ∀i ≥ 0.
It should be mentioned that the initial value function here is chosen as V0 (xk ) =
xkT P0 xk instead of V0 (·) = 0 as in most traditional VI algorithms [3–5, 51, 52]. In
what follows, we will prove the convergence of the iterations between (2.2.11) and
(2.2.12), i.e., Vi → J ∗ and vi → u∗ as i → ∞.
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 43
Λi+1 (xk ) = xkT Qxk + μTi (xk )Rμi (xk ) + Λi (f (xk ) + g(xk )μi (xk )),
The lemma can easily be proved by noting that Vi is the result of minimizing the
right-hand side of (2.2.11) with respect to the control input uk , while Λi is the result
of an arbitrary control input.
Theorem 2.2.1 Define the value function sequence {Vi (xk )} and the control law
sequence {vi (xk )} as in (2.2.10)–(2.2.12) with V0 (xk ) = xkT P0 xk in (2.2.7).
If V0 (xk ) ≥ V1 (xk ) holds for all xk , the value function sequence {Vi } is a monotoni-
cally nonincreasing sequence, i.e., Vi+1 (xk ) ≤ Vi (xk ), ∀xk , ∀i ≥ 0. If V0 (xk ) ≤ V1 (xk )
holds for all xk , the value function sequence {Vi (xk )} is a monotonically nondecreas-
ing sequence, i.e., Vi (xk ) ≤ Vi+1 (xk ), ∀xk , ∀i ≥ 0.
Proof First, suppose that V0 (xk ) ≥ V1 (xk ) holds for any xk . Define a new sequence
{Φi }, which is updated according to
Φ1 (xk ) = xkT Qxk + v0T (xk )Rv0 (xk ) + Φ0 (f (xk ) + g(xk )v0 (xk )),
Φi+1 (xk ) = xkT Qxk + vi−1
T
(xk )Rvi−1 (xk ) + Φi (f (xk ) + g(xk )vi−1 (xk )), i ≥ 1,
where Φ0 (xk ) = V0 (xk ) = xkT P0 xk and {vi } are obtained by (2.2.10) and (2.2.12).
Now, we use the mathematical induction to demonstrate
Noticing Φ1 (xk ) = V1 (xk ), it is clear that Φ1 (xk ) ≤ V0 (xk ). Then, we assume that it
holds for i − 1, i.e., Φi (xk ) ≤ Vi−1 (xk ), ∀i ≥ 1, ∀xk . According to
and
Φi+1 (xk ) = xkT Qxk + vi−1
T
(xk )Rvi−1 (xk ) + Φi (xk+1 ), i ≥ 1,
we have
Vi (xk ) − Φi+1 (xk ) = Vi−1 (xk+1 ) − Φi (xk+1 ) ≥ 0, i ≥ 1,
First, it is easy to see Γ0 (xk ) = V0 (xk ) ≤ V1 (xk ). Then, we assume that it holds for
i − 1, i.e., Γi−1 (xk ) ≤ Vi (xk ), ∀i ≥ 1, ∀xk .
According to
and
Vi+1 (xk ) = xkT Qxk + viT (xk )Rvi (xk ) + Vi (xk+1 ), i ≥ 1,
we have
Vi+1 (xk ) − Γi (xk ) = Vi (xk+1 ) − Γi−1 (xk+1 ) ≥ 0, i ≥ 1,
Remark 2.2.1 From Theorem 2.2.1, we can see that the monotonicity property of
the value function Vi is determined by the relationship between V0 and V1 , i.e.,
V0 (xk ) ≥ V1 (xk ) or V0 (xk ) ≤ V1 (xk ), ∀xk . In the traditional VI algorithm, the initial
value function is selected as V0 (·) = 0. We can easily find that this is just a special
case of our general scheme, i.e., V0 (xk ) ≤ V1 (xk ), which leads to a nondecreasing
value function sequence. Furthermore, the monotonicity property is still valid starting
from p if we can find that Vp (xk ) ≥ Vp+1 (xk ) or Vp (xk ) ≤ Vp+1 (xk ) for all xk and
some p. For example,
Vp (xk ) ≥ Vp+1 (xk ) for all xk and some p ≥ 0 ⇒ Vi (xk ) ≥ Vi+1 (xk ), ∀xk , ∀i ≥ p.
Next, we will demonstrate the uniform convergence of value function using the
technique of [27, 35], and we will show that the control sequence converges to the
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 45
optimal control law by a corollary. The following theorem is due to Rantzer and his
coworkers [27, 35].
Proof First, we demonstrate that the system defined in this section satisfies the con-
ditions of Theorem 2.2.2. According to Assumption 2.2.2, the system state cannot
jump to infinity by any one step of finite control input, i.e., f (xk ) + g(xk )uk is finite.
Because U(xk , uk ) is a positive-definite function, there exists some 0 < γ < ∞ such
that 0 ≤ J ∗ (f (xk ) + g(xk )uk ) ≤ γ U(xk , uk ) holds uniformly. For any finite positive-
definite initial value function V0 , there exist α and β such that 0 ≤ αJ ∗ ≤ V0 ≤ βJ ∗
is satisfied, where 0 ≤ α ≤ 1 and 1 ≤ β < ∞. Next, we will demonstrate the lower
bound of the inequality (2.2.13) by mathematical induction, i.e.,
α−1
1+ J ∗ (xk ) ≤ Vi (xk ). (2.2.14)
(1 + γ −1 )i
When i = 1, since
α−1
γ U(xk , uk ) − J ∗ (xk+1 ) ≤ 0, 0 ≤ α ≤ 1,
1+γ
Now, assume that the inequality (2.2.14) holds for i − 1. Then, we have
Vi (xk ) = min U(xk , uk ) + Vi−1 (xk+1 )
uk
α−1
≥ min U(xk , uk ) + 1 + J ∗ (xk+1 )
uk (1 + γ −1 )i−1
(α − 1)γ i
≥ min 1 + U(xk , uk )
uk (γ + 1)i
α−1 (α − 1)γ i−1 ∗
+ 1+ − J (xk+1 )
(1 + γ −1 )i−1 (γ + 1)i
(α − 1)γ i
= 1+ min U(xk , uk ) + J ∗ (xk+1 )
(γ + 1)i uk
(α − 1)
= 1+ J ∗ (xk ).
(1 + γ −1 )i
Thus, the lower bound of (2.2.13) is proved. The upper bound of (2.2.13) can be
shown by the same procedure.
Lastly, we demonstrate the uniform convergence of value function as the iteration
index i goes to ∞. When i → ∞, for 0 < γ < ∞, we have
α−1
lim 1 + J ∗ (xk ) = J ∗ (xk ),
i→∞ (1 + γ −1 )i
and
β −1
lim 1 + J ∗ (xk ) = J ∗ (xk ).
i→∞ (1 + γ −1 )i
Define V∞ (xk ) = lim Vi (xk ). Then, we can get V∞ (xk ) = J ∗ (xk ). Hence, Vi (xk )
i→∞
converges pointwise to J ∗ (xk ). Because Ω is compact, we can get the uniform con-
vergence of value function immediately from Dini’s theorem [6]. The proof is com-
plete.
From Theorem 2.2.2, we can determine the upper and lower bounds for every
iterative value function. As the iteration index i increases, the upper bound will
exponentially approach the lower bound. When the iteration index i goes to ∞, the
upper bound will be equal to the lower bound, which is just the optimal cost. Addi-
tionally, we can also analyze the convergence speed of the value function, which is
not available using the approaches in [3–5, 24, 51, 52]. According to the inequal-
ity (2.2.13), smaller γ will lead to faster convergence speed of the value function.
Moreover, it should be mentioned that conditions of Theorem 2.2.2 can be satisfied
according to Assumptions 2.2.1–2.2.3, which are mild for general control problems.
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 47
Note that larger α will lead to faster convergence speed of the value function.
When V0 (xk ) ≥ V1 (xk ), ∀xk , according to Theorems 2.2.1 and 2.2.2, we can
deduce that V0 (xk ) ≥ J ∗ (xk ). So, the constants α and β satisfy α = 1 and β ≥ 1.
Then, the corresponding inequality becomes
β −1
J ∗ (xk ) ≤ Vi (xk ) ≤ 1 + J ∗ (xk ).
(1 + γ −1 )i
Note that smaller β will lead to faster convergence speed of the value function.
According to the results of Theorem 2.2.2, we can derive the following corollary.
Corollary 2.2.1 Define the value function sequence {Vi } and the control law
sequence {vi } as in (2.2.10)–(2.2.12) with V0 (xk ) = xkT P0 xk . If the system state
xk is controllable, then the control sequence {vi } converges to the optimal control
law u∗ as i → ∞, i.e., limi→∞ vi (xk ) = u∗ (xk ).
Proof According to Theorem 2.2.2, we have proved that limi→∞ Vi (xk ) = V∞ (xk ) =
J ∗ (xk ). Thus,
V∞ (xk ) = min xkT Qxk + ukT Ruk + V∞ (xk+1 ) .
uk
That is to say that the value function sequence {Vi } converges to the optimal value
function of the Bellman equation. Comparing (2.2.5)–(2.2.12), the corresponding
control law {vi } converges to the optimal control law u∗ as i → ∞. This completes
the proof of the corollary.
Next, we will complete the stability analysis for nonlinear systems under the
condition of control Lyapunov function.
Theorem 2.2.3 The value function sequence {Vi } and the control law sequence {vi }
are iteratively updated by (2.2.10)–(2.2.12). If V0 (xk ) = xkT P0 xk ≥ V1 (xk ) holds for
any controllable xk , then the value function Vi (xk ) is a Lyapunov function and the
system using the control law vi (xk ) is asymptotically stable.
Second, we have
Note that v0 (xk ) satisfies the first-order necessary condition, which is given by the
gradient of the right-hand side of (2.2.10) with respect to uk as
T
∂ xkT Qxk + ukT Ruk ∂xk+1 ∂V0 (xk+1 )
+ = 0.
∂uk ∂uk ∂xk+1
That is,
2Ruk + 2gT (xk )P0 (f (xk ) + g(xk )uk ) = 0.
The control law v0 (xk ) exists since P0 and R are both positive-definite matrices.
Remark 2.2.2 If the condition V0 (xk ) ≥ V1 (xk ) holds, V0 (xk ) = xkT P0 xk is called
control Lyapunov function if the associated feedback control law v0 (xk ) can guaran-
tee the closed-loop system to be stable. Compared to PI algorithms, this condition
V0 (xk ) ≥ V1 (xk ) is easier to satisfy than an initial stabilizing control law. In partic-
ular, we can just choose P0 = κ In and κ ≥ 0, where In is the n × n identity matrix.
By choosing a large κ, V0 (xk ) ≥ V1 (xk ) is satisfied. Besides, similar to [12, 49], it
should be mentioned that the condition V0 (xk ) ≥ V1 (xk ) in Theorem 2.2.3 cannot be
replaced by V0 (xk ) ≥ J ∗ (xk ), because the nonincreasing property of value function is
guaranteed by V0 (xk ) ≥ V1 (xk ). However, if the condition V0 (xk ) ≤ V1 (xk ) holds, we
cannot derive that vi (xk ) is a stable and admissible control for nonlinear systems. For
linear discrete-time-invariant systems, Primbs and Nevistic [33] demonstrated that
there exists a finite iteration index i∗ and that the closed-loop system is asymptotically
stable for all i ≥ i∗ .
We have demonstrated the convergence of value function in the above under the
assumption that control laws and value functions can exactly be solved at each
iteration. However, it is difficult to solve these equations for nonlinear systems.
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 49
Critic Vˆi 1 ( xk )
Network
-
Signal Line
Back-propagating Path
Weight Transmission
where χk = [xkT , v̂iT (xk )]T is the input vector of model network and χ̄k = YmT χk .
The input-to-hidden-layer weights Ym are an (n + m) × l matrix and the hidden-to-
output-layer weights Wm are an l ×n matrix, where l is the number of hidden neurons,
n is the dimension of state vector, and m is the dimension of control input vector.
The activation function is chosen as σ (z) = tanh(z), and its derivative is denoted as
dσ (z)
σ̇ (z) = ∈ Rl×l for z ∈ Rl .
dz
The stopping criterion is that the performance function is within a prespecified
threshold, or the training step reaches the maximum value. When the weights of
model network converge, they are kept unchanged. Then, the estimated value of the
control coefficient matrix ĝ(xk ) is given by
Note that V̂i (xk ) is the estimated value function of the iterative algorithm (2.2.10)–
(2.2.12) from the i th iteration, whereas Wc(i) and Yc(i) are the critic NN weights to be
obtained from NN training during the i th iteration. The target function for critic NN
training is given by
where V̂i−1 (x̂k+1 ) = Wc(i−1)T σ Yc(i−1)T x̂k+1 . Then, the error function for training
critic network is defined by ec(i) (xk ) = Vi (xk )− V̂i (xk ), and the performance function
to be minimized is defined by
1 2
Ec(i) (xk ) = e (xk ).
2 c(i)
The weight tuning algorithm of critic network is the same as model network.
In the action network, the state xk is used as input to obtain the optimal control.
The output can be formulated as v̂i (xk ) = WaiT σ YaiT xk , whereas Wai and Yai are the
action NN weights to be obtained from NN training during the i th iteration of the
ADP algorithm (2.2.10)–(2.2.12). The target of action NN training is given by
1 ∂ V̂i (x̂k+1 )
vi (xk ) = − R−1 ĝT (xk ) , (2.2.16)
2 ∂ x̂k+1
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 51
where x̂k+1 = WmT σ (YmT [xkT , v̂iT ]T ). The convergence of action network weights is
shown in [13]. The error function of the action network can be defined as ea(i) (xk ) =
vi (xk ) − v̂i (xk ). The weights of the action network are updated to minimize the
following performance function:
1 T
Ea(i) (xk ) = e (xk )ea(i) (xk ).
2 a(i)
The LM algorithm ensures that Ea(i) (xk ) will decrease every time when the parameters
of action network update.
At last, a summary of the present general value iteration adaptive dynamic pro-
gramming algorithm for optimal control is given in Algorithm 2.2.1.
by (2.2.15). Train the critic network until the given accuracy εc or the maximum number of
c
iterations jmax is reached.
Step 6. If i > 1, then go to Step 7. Elseif V0 > V1 is true for all xk , go to Step 7; otherwise, increase
κ and go to Step 3.
Step 7. Compute the target of action network training
p
{vi (xk1 ), vi (xk2 ), . . . , vi (xk )}
by (2.2.16), and train the action network until the given accuracy εa or the maximum number of
a
iterations jmax is reached.
Step 8. If i > imax or
|Vi (xks ) − Vi−1 (xks )| ≤ ξ, s = 1, 2, . . . , p,
go to Step 9; otherwise, go to Step 4.
p
Step 9. Compute the output of the action network {v̂i (xk1 ), v̂i (xk2 ), . . . , v̂i (xk )}. Obtain the final near
optimal control law
u∗ (·) = v̂i (·),
and stop the algorithm.
52 2 Value Iteration ADP for Discrete …
The above GVI-based ADP approach can be employed to solve the optimal tracking
control problem [45]. Consider the nonaffine nonlinear system (2.2.1), for infinite-
time optimal tracking problem, the objective is to design an optimal control u∗ (xk ),
such that the state xk tracks the specified desired trajectory ξk ∈ Rn , k = 0, 1, . . .. In
this section, we assume that there exists a feedback control ue,k , which satisfies the
following equation:
ξk+1 = F(ξk , ue,k ), (2.2.17)
where μk = uk − ue,k and ue,k is the desired control that satisfies (2.2.17). The
quadratic cost function is
∞
∞
J(z0 , μ0 ) = U(zk , μk ) = zkT Qzk + (uk − ue,k )T R(uk − ue,k ) , (2.2.18)
k=0 k=0
where V0 (zk+1 ) = Ψ (zk+1 ). For i = 1, 2, . . ., the iterative ADP algorithm will iterate
between value function update
vi (zk ) = arg min {U(zk , μk ) + Vi (zk+1 )} = arg min {U(zk , μk ) + Vi (F(zk , μk ))} .
μk μk
(2.2.24)
Note that the ADP algorithm described above in (2.2.22)–(2.2.24) (for nonaffine
nonlinear systems) is essentially the same as that in (2.2.10)–(2.2.12) (for affine
nonlinear systems). The only difference between the two is the choice of initial value
function (see (2.2.7) and (2.2.21)).
Additional properties of the GVI-based ADP algorithm are given as follows.
and
are satisfied uniformly, then the iterative value function Vi (zk ) satisfies
α−1 β −1
1+ J ∗ (zk ) ≤ Vi (zk ) ≤ 1 + J ∗ (zk ). (2.2.28)
(1 + γ −1 )i (1 + ρ −1 )i
1 ≤ α ≤ β < ∞,
respectively. If ∀zk , the inequalities (2.2.26) and (2.2.27) hold uniformly, then the
iterative value function Vi (zk ) satisfies (2.2.28).
Corollary 2.2.2 For i = 0, 1, . . ., let vi (zk ) and Vi (zk ) be obtained by (2.2.21)–
(2.2.24). Let ρ, γ , α, and β be constants that satisfy (2.2.25) and
56 2 Value Iteration ADP for Discrete …
0 ≤ α ≤ β < ∞, (2.2.29)
respectively. If ∀zk , the inequalities (2.2.26) and (2.2.27) hold uniformly, then the
iterative value function Vi (zk ) converges to the optimal cost function J ∗ (zk ), i.e.,
The VI-based optimal control [5, 41], constrained optimal control [28, 51], and opti-
mal tracking control [19, 52] methods are special cases of the results in Sect. 2.2.3, by
noting that the initial value function is chosen as zero. Among them, input constraints
are often confronted in practical problems, which results in a considerable difficulty
in designing the optimal controller [17, 28, 51]. Therefore, in this section, we develop
a VI-based constrained optimal control scheme via GDHP technique [28].
Consider the discrete-time nonaffine nonlinear systems (2.2.1), we define Ω̄u =
{uk : uk = [u1k , u2k , . . . , umk ]T ∈ Rm , |ulk | ≤ ūl , l = 1, 2, . . . , m}, where ūl is the
saturation bound for the lth actuator. Let Ū = diag{ū1 , ū2 , . . . , ūm } be a constant
diagonal matrix.
In many literatures of optimal control [5, 13, 41, 42], the utility function is chosen
as the quadratic form of (2.2.6). However, when dealing with constrained optimal
control problems, it is not the case any more. Inspired by the work of [1, 29, 51], we
can employ a generalized nonquadratic functional
uk
Y (uk ) = 2 Φ −T Ū −1 s ŪR ds (2.2.30)
0
and
uk
∗
u (xk ) = arg min xk Qxk + 2
T
Φ −T (Ū −1 s)ŪRds + J ∗ (xk+1 ) ,
uk 0
respectively.
The traditional VI-based iterative ADP algorithm is performed as follows. First,
we start with the initial value function V0 (·) = 0 and solve Vi (xk ) and vi (xk ) using
the iterative algorithm described by (2.2.22)–(2.2.24).
In this section, the GDHP technique is employed to implement the iterative ADP
algorithm. In the iterative GDHP algorithm, there are three NNs, which are model
network, critic network, and action network. Here, all the NNs are chosen as three-
layer feedforward ones. It is important to note that the critic network of GDHP
outputs both the value function V (xk ) and its derivative ∂V (xk )/∂xk [34], which is
schematically depicted in Fig. 2.3. It is a combination of HDP and dual heuristic
dynamic programming (DHP).
The training of model network is complete after the system identification process,
and its weights will be kept unchanged. As a result, we avoid the requirement of
knowing F(xk , uk ) during the implementation of the iterative GDHP algorithm. Next,
the learned NN model will be used in the training process of critic network and action
network.
We denote λi (xk ) = ∂Vi (xk )/∂xk in our discussion. Hence, the critic network
is used to approximate both Vi (xk ) and λi (xk ). The output of critic network is
expressed as iT
V̂i (xk ) Wc1
= iT σ Yc xk = Wc σ Yc xk ,
iT iT iT
λ̂i (xk ) Wc2
and
Then, we define the error function of critic network training as evcik = Vi (xk ) − V̂i (xk )
and eλcik = λi (xk ) − λ̂i (xk ). The objective function to be minimized in the critic
network is
λ
Ecik = (1 − τ )Ecik
v
+ τ Ecik ,
where 0 ≤ τ ≤ 1 is a parameter that adjusts how HDP and DHP are combined in
GDHP,
1 2
v
Ecik = evcik
2
and
λ 1 λT λ
Ecik = e e .
2 cik cik
The weight update rule for training critic network is the gradient-based adaptation
which is given by
λ
∂Ecik
v
∂Ecik
Wci (p + 1) = Wci (p) − αc (1 − τ ) + τ ,
∂Wci (p) ∂Wci (p)
∂E v ∂E λ
Yci (p + 1) = Yci (p) − αc (1 − τ ) i cik + τ i cik ,
∂Yc (p) ∂Yc (p)
where αc > 0 is the learning rate of critic network and p is the inner-loop iteration
step for updating NN weight parameters. The detailed discussion on superiority of
GDHP-based iterative ADP algorithm can be found in [42].
In the action network, the state xk is used as input to obtain the approximate
optimal control as output of the network, which is formulated as
1 T
Ea(i)k = e ea(i)k .
2 a(i)k
Similarly, the weight update algorithm is
∂Ea(i)k
Wai (p + 1) = Wai (p) − αa ,
∂Wai (p)
∂Ea(i)k
Yai (p + 1) = Yai (p) − αa ,
∂Yai (p)
where αa > 0 is the learning rate of action network and p is the inner-loop iteration
step for updating weight parameters.
In this section, several examples are provided to demonstrate the effectiveness of the
present control methods.
60 2 Value Iteration ADP for Discrete …
and R = 1. Note that the open-loop poles are −0.1083 and 1.1083, which indicates
that the system is unstable.
Algorithm 2.2.1 will be used here. To reduce the influence of the NN approxi-
mation errors, we choose three-layer BP NNs as model network, critic network and
action network with the structures of 3–9–2, 2–8–1, and 2–8–1, respectively. The
initial weights of NNs are chosen randomly in [−0.1, 0.1].
Before implementing the GVI algorithm, we need to train the model network first.
The operation region of system (2.2.31) is selected as −1 ≤ x1 ≤ 1 and −1 ≤ x2 ≤ 1.
Thousand samples are randomly chosen from this operation region as the training set,
and the model network is trained until the given accuracy εm = 10−8 is reached with
m
jmax = 10000. The inner-loop iteration number of critic network and action network
c
is jmax = jmax
a
= 1000, and the given accuracy is εc = εa = 10−6 . The maximum
outer-loop iteration is selected as imax = 10, and the prespecified accuracy is selected
as ξ = 10−6 . The number of samples at each iteration is p = 2000.
Set P0 = I2 . We find that V0 ≥ V1 holds for all states, which can be seen from
Fig. 2.4. After implementing the outer-loop iteration for 10 times, the convergence
of value function is observed. The 3-D plot of approximate value function at i = 0
and i = 10 is given in Fig. 2.5, and the 3-D plot of error between the optimal cost
function J ∗ and the approximate optimal value function V10 is given in Fig. 2.6. We
can see that the error between the optimal cost function and the approximate optimal
value function is nearly within 10−3 in the operational region from Fig. 2.6.
For the initial state x0 = [1, −1]T , the convergence process of value function is
given in Fig. 2.7. We apply the control law v10 to the system for 20 time steps. The
corresponding state trajectories are given in Fig. 2.8, and the control input is shown
in Fig. 2.9.
These simulation results indicate that our algorithm is effective in obtaining the
optimal control law via learning in a timely manner.
where xk = [x1k , x2k ]T and uk = [u1k , u2k ]T . The desired trajectory is set to
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 61
1.5
1
1
V −V
0
0.5
0
1
0.5 1
0 0.5
0
−0.5 −0.5
x2 −1 −1
x
1
2
i=0
1.5
Value function
1
i=10
0.5
0
1
0.5 1
0 0.5
0
−0.5 −0.5
x2 −1 −1
x1
−4
x 10
15
10
J −V10
5
*
−5
1
0.5 1
0 0.5
0
−0.5 −0.5
x2 −1 −1
x1
Fig. 2.6 Error between the optimal cost function J ∗ and the approximate optimal value function
V10
1.8
1.6
Value function
1.4
1.2
0.8
0.6
0 2 4 6 8 10
The iteration index: i
1
x
1
0.8 x2
0.6
0.4
The state trajectories
0.2
−0.2
−0.4
−0.6
−0.8
−1
0 5 10 15 20
Time steps
0.35
0.3
0.25
The control input
0.2
0.15
0.1
0.05
0
0 5 10 15 20
Time steps
π T
ξk = sin k + , 0.5 cos(k) . (2.2.33)
2
According to (2.2.32) and (2.2.33), we can easily obtain the desired control
5 0 2
0.2ξ1k exp(ξ2k )
ue,k = − ξk+1 − 3 .
0 5 0.3ξ2k
are all 1. It is desired to control the system with control constraint of |u| ≤ 0.5. The
cost function is chosen as
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 65
6 5.5
5
5.4
Value function
4
Value function
5.3
3
5.2
2
5.1
1
0 5
0 10 20 30 0 10 20 30
Iteration steps Iteration steps
(a) (b)
9 7
8 6.5
Value function
Value function
7 6
6 5.5
5 5
0 10 20 30 0 10 20 30
Iteration steps Iteration steps
(c) (d)
Fig. 2.10 The trajectories of the iterative value functions with initial value function given by Ψ j (zk ),
j = 1, 2, 3, 4. a Ψ 1 (zk ). b Ψ 2 (zk ). c Ψ 3 (zk ). d Ψ 4 (zk )
∞
uk
J(x0 ) = xk Qxk + 2
T
tanh−T (Ū −1 s)ŪRds ,
k=0 0
where Q and R are identity matrices with suitable dimensions and Ū = 0.5.
In this example, the three NNs are chosen with structures of 3–8–2, 2–8–3, and
2–8–1, respectively. Here, the initial weights of the critic network and action network
are all set to be random in [−0.1, 0.1]. Then, letting the parameter τ = 0.5 and the
learning rate αc = αa = 0.05, we train the critic network and action network for
26 iterations. When k = 0, the convergence process of the value function and its
derivatives is depicted in Fig. 2.12.
66 2 Value Iteration ADP for Discrete …
0.5
z1
z2
0
Tracking errors
−0.5
−1
0 5 10 15 20 25 30
Time steps
(a) (b)
4
2
The derivatives of the
The value function
3 1
value function
λ1
0
2 λ2
−1
1
−2
0 −3
0 10 20 30 0 10 20 30
Iterations Iterations
Fig. 2.12 a The convergence process of the value function. b The convergence process of the
derivatives of the value function
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 67
(a) (b)
1
x
1 0.1
The state trajectories
0.5 x
(c) (d)
1
x
1 0.1
The state trajectories
0.5 x2
The control input
0
−0.1
0
−0.2
Fig. 2.13 Simulation results of Example 2.2.3. a The state trajectory x. b The control input u.
c The state trajectory x without considering the control constraint. d The control input u without
considering the control constraint
Next, for given initial state x0 = [0.5, −1]T , we apply the optimal control laws
designed by the iterative GDHP algorithm, with and without considering the control
constraints, to system (2.2.34) for 20 time steps, respectively. The simulation results
are shown in Fig. 2.13, which also exhibits excellent control results of the iterative
GDHP algorithm.
∞
J(x0 , u0 ) = U(xk , uk ),
k=0
We can see that if we want to obtain the optimal control law u∗ (xk ), we must obtain the
optimal value function J ∗ (xk ). Generally speaking, J ∗ (xk ) is unknown before all the
controls uk ∈ Rm are considered. If we adopt the traditional dynamic programming
method to obtain the optimal value function one step at a time, then we have to face
the “curse of dimensionality.” In [5, 43], iterative algorithms of ADP were used to
obtain the solution of Bellman equation indirectly. However, we pointed out that the
stability of the system cannot be guaranteed in [5] and an admissible control sequence
2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 69
In the present iterative θ -ADP algorithm, the value function and control law are
updated with the iteration index i increasing from 0 to ∞. The following definition
is necessary to begin the algorithm.
Definition 2.3.1 For xk ∈ Rn , let
Ψ̄xk = Ψ (xk ) : Ψ (xk ) > 0, and ∃ ν̄(xk ) ∈ Ak , s.t. Ψ (F(xk , ν̄(xk ))) < Ψ (xk )
(2.3.5)
∀xk ∈ Rn , where θ > 0 is a finite positive constant. The iterative control law v0 (xk )
can be computed as follows:
and
We note that the ADP algorithm (2.3.7)–(2.3.9) described above is essentially the
same as those in (2.2.10)–(2.2.12) and (2.2.22)–(2.2.24). The only difference is the
70 2 Value Iteration ADP for Discrete …
choice of initial value function and the choice of utility function. Here, the utility
function may be nonquadratic.
Remark 2.3.1 Equations (2.3.7)–(2.3.9) in the iterative θ -ADP algorithm are similar
to the Bellman equation (2.3.4), but they are not the same. There are at least three
obvious differences.
(1) The Bellman equation (2.3.4) possesses a unique optimal cost function, i.e.,
J ∗ (xk ), ∀xk , while in the iterative ADP equations (2.3.7)–(2.3.9), the value func-
tions are different for different iteration index i, i.e., Vi (xk ) = Vj (xk ), ∀i = j.
(2) The control law obtained by Bellman equation (2.3.4) is the optimal control law,
i.e., u∗ (xk ), ∀xk , while the control laws from the iterative ADP equations (2.3.7)–
(2.3.9) are different for each iteration index i, i.e., vi (xk ) = vj (xk ), ∀i = j, which
are not optimal in general.
(3) For any finite i, the iterative value function Vi (xk ) is a sum of finite sequence
with a terminal constraint term and the property of Vi (xk ) can be seen in the
following lemma (Lemma 2.3.1). But the optimal cost function J ∗ (xk ) in (2.3.4)
is a sum of an infinite sequence. So, in general, Vi (xk ) = J ∗ (xk ).
Lemma 2.3.1 Let xk be an arbitrary state vector. If the iterative value function
Vi (xk ) and the control law vi (xk ) are obtained by (2.3.7)–(2.3.9), then Vi (xk ) can be
expressed as
i
Vi (xk ) = U xk+j , vi−j (xk+j ) + θ Ψ (xk+i+1 ).
j=0
where
V1 (xk+i ) = min{U(xk+i , uk+i ) + θ Ψ (xk+i+1 )}.
uk+i
Define
uNk = (uk , uk+1 , . . . , uN )
In the above, we can see that the optimal value function J ∗ (xk ) is replaced by
a sequence of iterative value functions Vi (xk ) and the optimal control law u∗ (xk ) is
replaced by a sequence of iterative control laws vi (xk ), where i ≥ 0 is the iteration
index. As (2.3.8) is not a Bellman equation, generally speaking, the iterative value
function Vi (xk ) is not optimal. However, we can prove that J ∗ (xk ) is the limit of
Vi (xk ) as i → ∞. Next, the convergence properties will be analyzed.
Lemma 2.3.2 Let μ(xk ) ∈ Ak be an arbitrary control law, and let Vi (xk ) and vi (xk )
be expressed as in (2.3.7)–(2.3.9), respectively. Define a new value function Pi (xk ) as
In general, we have
Pi (xk ) ≥ J ∗ (xk ), ∀i, xk .
Theorem 2.3.1 Let xk be an arbitrary state vector. The iterative control law vi (xk )
and the iterative value function Vi (xk ) are obtained by (2.3.7)–(2.3.9). If Assumptions
2.2.1–2.2.3 and 2.3.1 hold, then for any finite i = 0, 1, . . ., there exists a finite
θ > 0 such that the iterative value function Vi (xk ) is a monotonically nonincreasing
sequence for i = 0, 1, . . ., i.e.,
Proof To obtain the conclusion, we will show that for an arbitrary finite i < ∞,
there exists a finite θi > 0 such that (2.3.12) holds. We prove this by mathematical
induction.
First, we let i = 0. Let μ(xk ) ∈ Ak be an arbitrary stable control law. Define the
value function Pi (xk ) as in (2.3.11). For i = 0, we have
72 2 Value Iteration ADP for Discrete …
According to Definition 2.3.1, there exists a stable control law ν̄k = ν̄(xk ) such that
As ν̄(xk ) is a stable control law, the utility function U(xk , ν̄(xk )) is finite. Then,
there exists a finite θ0 > 0 such that
We can get
= min{U(xk , uk ) + θ0 Ψ (xk+1 )}
uk
l−1
Vl (xk ) = U xk+j , vl−j−1 (xk+j ) + θ̃l Ψ (xk+l ), (2.3.15)
j=0
Let
uk = vl−1 (xk ), uk+1 = vl−2 (xk+1 ), . . . , uk+l−1 = v0 (xk+l−1 ).
where μ(xk+l ) ∈ Ak+l . According to Definition 2.3.1, there exists a stable control
law ν̄(xk+l ) ∈ Ak+l such that
l−1
Vl (xk ) = U xk+j , vl−j−1 (xk+j ) + θl Ψ (xk+l )
j=0
l−1
≥ U xk+j , vl−j−1 (xk+j ) + U(xk+l , ν̄(xk+l )) + θl Ψ (xk+l+1 )
j=0
= Pl+1 (xk ).
According to Lemma 2.3.2, we have Vl+1 (xk ) ≤ Pl+1 (xk ). Therefore, we obtain
holds, then we can obtain (2.3.12). In this situation, the iterative value function Vi (xk )
is a monotonically decreasing sequence for i = 0, 1, . . ..
Theorem 2.3.2 Let xk be an arbitrary state vector. If Assumptions 2.2.1–2.2.3 and
2.3.1 hold and there exists a control law ν̄(xk ) ∈ Ak which satisfies (2.3.5) such that
the following limit
U(xk , ν̄(xk ))
lim (2.3.18)
xk →0 Ψ (xk ) − Ψ (F(xk , ν̄(xk )))
exists, then there exists a finite θ > 0 such that (2.3.12) is true.
Proof According to (2.3.17) in Theorem 2.3.1, we can see that for any finite i < ∞,
the parameter θi should satisfy
U(xk+i , ν̄(xk+i ))
θi ≥
Ψ (xk+i ) − Ψ (F(xk+i , ν̄(xk+i )))
U(xk+i , ν̄(xk+i ))
lim θi ≥ lim . (2.3.19)
i→∞ i→∞ Ψ (xk+i ) − Ψ (F(xk+i , ν̄(xk+i )))
We can see that if the limit of the right-hand side of (2.3.19) exists, then θ∞ = lim θi
i→∞
can be defined. Therefore, if we define
θ̄ = sup{θ0 , θ1 , . . . , θ∞ }, (2.3.20)
then θ̄ can be well defined. Hence, we can choose an arbitrary finite θ which satisfies
θ ≥ θ̄, (2.3.21)
U(xk , ν̄(xk ))
θ ≥ lim .
xk →0 Ψ (xk ) − Ψ (F(xk , ν̄(xk )))
If we put u∗ (xk ) into (2.3.11), then for i → ∞, lim Pi (xk ) = J ∗ (xk ) holds for any
i→∞
finite θ .
From Theorems 2.3.1 and 2.3.2, we can see that if there exists a finite θ such that
(2.3.12) holds, then Vi (xk ) ≥ 0 and it is a nonincreasing sequence with lower bound
for iteration index i = 0, 1, . . .. We can derive the following theorem.
Theorem 2.3.3 Let xk be an arbitrary state vector. Define the value function V∞ (xk )
as the limit of the iterative value function Vi (xk ), i.e.,
Then,
V∞ (xk ) = min{U(xk , uk ) + V∞ (xk+1 )}. (2.3.22)
uk
Proof Let μ(xk ) be an arbitrary stable control law. According to Theorem 2.3.1,
∀i = 0, 1, . . ., we have
So,
Let ε > 0 be an arbitrary positive number. Since Vi (xk ) is nonincreasing for all i and
limi→∞ Vi (xk ) = V∞ (xk ), there exists a positive integer p such that
Then, we let
Hence,
V∞ (xk ) ≥ Vp (xk ) − ε
≥ U(xk , vp−1 (xk )) + Vp−1 (xk+1 ) − ε
≥ U(xk , vp−1 (xk )) + V∞ (xk+1 ) − ε
≥ min{U(xk , uk ) + V∞ (xk+1 )} − ε.
uk
76 2 Value Iteration ADP for Discrete …
Combining (2.3.23) and (2.3.24), we have (2.3.22) which proves the conclusion of
this theorem.
Remark 2.3.4 Two important properties we must point out. First, from the iterative
θ -ADP algorithm (2.3.7)–(2.3.9), we see that the initial function Ψ (xk ) is arbitrarily
chosen in the set Ψ̄ (xk ). The parameter θ is also arbitrarily chosen if it satisfies
(2.3.21). Actually, it is not necessary to find all θi to construct the set in (2.3.20).
What we should do is to choose a θ large enough to run the iterative θ -ADP algorithm
(2.3.7)–(2.3.9) and guarantee the iterative value function to be convergent. This
allows for very convenient implementation of the present algorithm. Second, for
different initial value θ and different initial function Ψ (xk ), the iterative value function
of the iterative θ -ADP algorithm will converge to the same value function. We will
show this property after two necessary lemmas.
Lemma 2.3.3 Let ν̄(xk ) ∈ Ak be an arbitrary stable control law, and let the value
function Pi (xk ) be defined in (2.3.11) with u = ν̄. Define a new value function
Pi (xk ) as
Pi+1 (xk ) = U(xk , ν̄(xk )) + Pi (xk+1 ),
with P0 (xk ) = θ Ψ (xk ), ∀xk . Let θ and θ be two different finite constants which
satisfy (2.3.21), i.e., let θ ≥ θ̄ and θ ≥ θ̄ such that (2.3.12) is true. Then, P∞ (xk ) =
P∞ (xk ) = Γ∞ (xk ), where Γ∞ (xk ) is defined as
i
Γ∞ (xk ) = lim U(xk+j , ν̄(xk+j )) .
i→∞
j=0
i
Pi (xk ) = U xk+j , ν̄(xk+j ) + θ Ψ xk+i+1 ,
j=0
i
Pi (xk ) = U xk+j , ν̄(xk+j ) + θ Ψ xk+i+1 ,
j=0
where θ and θ both satisfy (2.3.21), and θ = θ . As ν̄(xk ) is a stable control law, we
have xk+i → 0 as i → ∞. Then, lim θ Ψ (xk ) = lim θ Ψ (xk ) = 0 since xk → 0.
k→∞ k→∞
So, we can get
2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 77
i
P∞ (xk ) = P∞ (xk ) = lim U(xk+j , ν̄(xk+j )) = Γ∞ (xk ).
i→∞
j=0
Next, we will prove that the iterative value function Vi (xk ) converges to the optimal
value function J ∗ (xk ) as i → ∞. Before we give the optimality theorem, the following
lemma is necessary.
Next, let
∗(k+q−1)
μkk+q−1 = uk .
We can obtain
∞
Pq (xk ) − J ∗ (xk ) = θ Ψ (xk+q ) − ∗
U(xk+j , uk+j ) ≥ 0.
j=q
From (2.3.11), as μ(xk ) is a stable control law, the control sequence μk = (μk , μk+1 ,
. . .) under the stable control law μ(xk ) is a stable control sequence. Hence, we can
get θ Ψ (xk+q ) → 0 as q → ∞. Then, from the fact that
78 2 Value Iteration ADP for Discrete …
∞
∗
lim U(xk+j , uk+j ) = 0,
q→∞
j=q
we can obtain
∞
∗
lim θ Ψ (xk+q ) − U(xk+j , uk+j ) = 0.
q→∞
j=q
Therefore, ∀ε > 0, there exists a finite q such that Pq (xk ) − J ∗ (xk ) ≤ ε holds. This
completes the proof of the lemma.
Theorem 2.3.4 Let Vi (xk ) be defined by (2.3.8) where θ satisfies (2.3.21). If the
system state xk is controllable, then Vi (xk ) converges to the optimal cost function
J ∗ (xk ) as i → ∞, i.e.,
J ∗ (xk ) ≤ Vi (xk ).
Let ε > 0 be an arbitrary positive number. According to Lemma 2.3.4, there exists
a finite positive integer q such that
V∞ (xk ) = J ∗ (xk ).
According to Theorem 2.3.1, we have Vi+1 (xk ) ≤ Vi (xk ), ∀i ≥ 0. Then, for all
xk = 0, we can obtain
For i = 0, 1, . . ., the iterative value function Vi (xk ) is a Lyapunov function [20, 26,
30]. Therefore, the conclusion is proved.
Next, we will prove that the optimal control law u∗ (xk ) is an admissible control
law for system (2.3.1).
Theorem 2.3.6 Let xk be an arbitrary controllable state. For i = 0, 1, . . ., if Assump-
tions 2.2.1–2.2.3 and 2.3.1 hold and the iterative value function Vi (xk ) and iterative
control law vi (xk ) are defined by (2.3.7)–(2.3.9) where θ satisfies (2.3.21), then the
optimal control law u∗ (xk ) is an admissible control law for system (2.3.1).
The proof of this theorem can be done by considering the fact that J ∗ (xk ) is finite.
Therefore, we omit the details here.
Remark 2.3.5 From the above analysis, we can see that the present iterative θ -ADP
algorithm is different from VI algorithms in [5, 52]. The main differences can be
summarized as follows.
(1) The initial conditions are different. In [5, 52], VI algorithms are initialized by
zero, i.e., V0 (xk ) ≡ 0, ∀xk . In this section, the iterative θ -ADP algorithm is
initialized by θ Ψ (xk ) = 0.
(2) The convergence properties are different. For VI algorithms in [5, 52], the iter-
ative value function Vi (xk ) is monotonically nondecreasing and converges to
the optimum. In this section, the iterative value function Vi (xk ) in the θ -ADP
algorithm is monotonically nonincreasing and converges to the optimal one.
(3) We emphasize that the properties of the iterative control laws are different. For
the VI algorithms in [5, 52], the stability of iterative control laws cannot be
guaranteed, which means the VI algorithm can only be implemented off-line. In
this section, it is proved that for all i = 0, 1, . . ., the iterative control law vi (xk )
is a stable control law. This means that the present iterative θ -ADP algorithm
is feasible for implementations both online and off-line. This is an obvious
merit of the present iterative θ -ADP algorithm. In the simulation study, we will
provide simulation comparisons between the VI algorithms in [5, 52] and the
present iterative θ -ADP algorithm. This conclusion echoes the observation in
Remark 2.2.2.
Theorem 2.3.7 Let xk be an arbitrary controllable state, and let J ∗ (xk ) be the opti-
mal cost function expressed by (2.3.2). If Assumptions 2.2.1–2.2.3 and 2.3.1 hold,
then
J ∗ (xk ) ∈ Ψ̄xk .
Proof By Assumption 2.3.1 and the definition of J ∗ (xk ) in (2.3.2) and (2.3.4), we
can see that
N
∗
J (xk ) = lim U(xk+j , u∗ (xk+j ))
N→∞
k=0
According to Theorem 2.3.7, Ψ̄xk is not an empty set. While generally, the optimal
value is difficult to obtain before the algorithm is complete. So, some other methods
are established to obtain Ψ (xk ).
Remark 2.3.6 According to the definition of admissible control law, we can see that
Ψxk ∈ Ψ̄xk is equivalent to that Ψxk is a Lyapunov function. There are two properties
we should point out. First, the general purpose of choosing a Lyapunov function
Ψ (xk ) is to find a control ν̄(xk ) to stabilize the system. In this section, however, the
purpose of choosing the initial function θ Ψ (xk ) is to obtain the optimal control of
the system (not only to stabilize the system but also to minimize the value function).
Second, if we adopt V0 (xk ) = Ψ (xk ) to initialize the system, then the initial iterative
control law v0 (xk ) can be obtained by
We should point out that v0 (xk ) may not be a stable control law for the system,
although the algorithm is initialized by a Lyapunov function. Using the present
iterative θ -ADP algorithm (2.3.7)–(2.3.9) in this section, we can prove that all the
iterative controls vi (xk ) for i = 0, 1, . . ., are stable and simultaneously guarantee the
iterative value function to converge to the optimum. Hence, our present algorithm is
effective to obtain the optimal control law both online and off-line.
82 2 Value Iteration ADP for Discrete …
From Corollary 2.3.2, we can see that if we get a Lyapunov function of system
(2.3.1), then Ψ (xk ) can be obtained. As Lyapunov function is also difficult to obtain,
we will give some simple methods to choose the function Ψ (xk ).
First, it is recommended to use the utility function U(xk , 0) to start the iterative
θ -ADP algorithm, where we set V0 (xk ) = θ U(xk , 0) with a large θ . If we get a V1 (xk )
such that V1 (xk ) < V0 (xk ), then U(xk , 0) ∈ Ψ̄xk .
Second, we can use NN structures of ADP to generate an initial function Ψ (xk ).
We first randomly initialize the weights of the action NN. Give an arbitrary positive-
definite function G(xk ) > 0 and train the critic NN to satisfy the equation
where Ψ̂ (xk ) and v̂(xk ) are outputs of critic and action networks, respectively. The NN
structure and the training rule can be seen in the next section. If the critic network
training is convergent, then let Ψ (xk ) = Ψ̂ (xk ) and the initial value function is
determined.
Remark 2.3.7 For many nonlinear systems and utility functions, such as [2, 52], we
can obtain U(xk , 0) ∈ Ψ̄xk . In this situation, we only need to set a large θ for the initial
condition and run the iterative θ -ADP algorithm (2.3.7)–(2.3.9). This can reduce the
amount of computation very much. If there does not exist a stable control law such
that (2.3.18) is finite, then there may not exist a finite θ such that (2.3.12) is true.
In this case, we can find an initial admissible control law η(xk ) such that xk+N = 0,
where N ≥ 1 is an arbitrary positive integer. Let
N
V0 (xk ) = U(xk+τ , η(xk+τ )).
τ =0
Then, using the algorithm (2.3.7)–(2.3.9), we can also obtain Vi (xk ) ≤ Vi+1 (xk ). The
details of proof are available in [43].
Remark 2.3.8 The iterative θ -ADP algorithm is different from the policy itera-
tion algorithm in [1, 31]. For the policy iteration algorithm, an admissible control
sequence is necessary to initialize the algorithm, while for the iterative θ -ADP algo-
rithm developed in this section, the initial admissible control sequence is avoided.
Instead, we only need an arbitrary function Ψ (xk ) ∈ Ψ̄xk to start the algorithm.
Generally speaking, for nonlinear systems, admissible control sequences are diffi-
cult to obtain, while the function Ψ (xk ) can easily be obtained (for many cases,
U(xk , 0) ∈ Ψ̄xk ). Second, for PI algorithms in [1, 31], during every iteration step,
we need to solve a generalized Bellman equation to update the iterative control law,
while in the present iterative θ -ADP algorithm, the generalized Bellman equation is
unnecessary. Therefore, the iterative θ -ADP algorithm has more advantages than the
PI algorithm.
2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 83
Remark 2.3.9 Generally speaking, NNs are used to implement the present iterative
θ -ADP algorithm. In order to approximate the functions Vi (xk ) and vi (xk ), a large
number of xk in the state space is required to train NNs. In this situation, as we have
declared in Step 1, we should choose randomly an array of initial states xk in the
state space to initialize the algorithm. For all i = 0, 1, . . ., according to the array of
states xk , we can obtain the iterative value function Vi (xk ) and the iterative control
law vi (xk ) by training NNs, respectively. To the best of our knowledge, all the NN
implementations for ADP require a large number of xk in state space to approximate
the iterative control laws and the iterative value functions, such as [36, 48]. The
detailed NN implementation for the present iterative θ -ADP algorithm can be found
in [44].
To evaluate the performance of our iterative θ -ADP algorithm, we choose two exam-
ples with quadratic utility functions for numerical experiments.
Example 2.3.1 This example is chosen from [43, 52] with modifications. Consider
2
0.8x1k exp(x2k ) −0.2 0
xk+1 = + u,
3
0.9x2k 0 −0.2 k
where xk = [x1k , x2k ]T and uk = [u1k , u2k ]T are the state and control variables,
respectively. The initial state is x0 = [1, −1]T . The cost function is the quadratic
form expressed as
∞
J x0 , u∞
0 = xkT Qxk + ukT Ruk
k=0
where the matrices Q and R given as identity matrix with suitable dimensions.
84 2 Value Iteration ADP for Discrete …
Two NNs are used to implement the iterative θ -ADP algorithm. The critic and
action networks are both chosen as three-layer BP NNs with the structures of 2–8–1
and 2–8–2, respectively. For each iteration step, the critic and action networks are
trained for 200 steps using the learning rate of 0.02 so that the NN training errors
become less than 10−6 . To show the effectiveness of the iterative θ -ADP algorithm,
we choose four θ ’s (including θ = 3.5, 5, 7, 10) to initialize the algorithm. Let the
algorithm run for 35 iteration steps for different θ ’s to obtain the optimal value
function. The convergence curves of the value functions are shown in Fig. 2.14.
From Fig. 2.14, we can see that all the convergence curves of value functions are
monotonically nonincreasing. For convenience of analysis, we let
θ =3.5 θ =5
8
6.2
7.5
6
Value function
Value function
5.8 6.5
5.6 6
5.4 5.5
5
0 10 20 30 35 0 10 20 30 35
Iteration steps Iteration steps
(a) (b)
θ =7 θ =10
12 17
11 15
10
Value function
Value function
13
9
11
8
9
7
6 7
5 5
0 10 20 30 35 0 10 20 30 35
Iteration steps Iteration steps
(c) (d)
Fig. 2.14 The convergence of value functions for Example 2.3.1. a θ = 3.5. b θ = 5. c θ = 7.
d θ = 10
2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 85
U(xk , v0 (xk ))
θ0 = lim
xk →0 U(xk , 0) − U(F(xk , v0 (xk )), 0)
where v0 (xk ) is obtained in (2.3.7). Then, for θ = 3.5, we have θ0 = 1.9015. For
θ = 5, we have θ0 = 1.90984. For θ = 7, we have θ0 = 2.04469. For θ = 10, we
have θ0 = 2.16256. We can also see that if the iterative value function is convergent,
then the iterative value function can converge to the optimum and the optimal value
function is independent from the parameter θ . We apply the optimal control law to
the system for Tf = 20 time steps and obtain the following results. The optimal state
and control trajectories are shown in Fig. 2.15a and b, respectively.
From the above simulation results, we can see that if we choose θ large enough
to initialize the iterative θ -ADP algorithm, the iterative value function Vi (xk ) will be
monotonically nonincreasing and converge to the optimum, which verifies the effec-
tiveness of the present algorithm. Next, we enhance the complexity of the system.
We will consider the situation where the autonomous system is unstable, and we will
show that the present iterative θ -ADP is still effective.
1
x
1
0.5 x2
States
−0.5
−1
0 2 4 6 8 10 12 14 16 18 20
Time steps
(a)
0.1
u
1
0.05 u2
Controls
−0.05
−0.1
0 2 4 6 8 10 12 14 16 18 20
Time steps
(b)
Fig. 2.15 The optimal trajectories. a Optimal state trajectories. b Optimal control trajectories
86 2 Value Iteration ADP for Discrete …
θ=3 θ=5
6 10
5 8
Value function
Value function
4 6
3 4
2 2
0 5 10 15 0 5 10 15
Iteration steps Iteration steps
(a) (b)
θ=7 θ = 10
12 16
14
10
Value function
Value function
8 10
6
6
4
2 2
0 5 10 15 0 5 10 15
Iteration steps Iteration steps
(c) (d)
where xk = [x1k , x2k ]T denotes the system state vector and uk denotes the control.
The value function is the same as the one in Example 2.3.1.
The initial state is x0 = [1, −1]T . From system (2.3.29), we can see that xk = 0
is an equilibrium state and the autonomous system F(xk , 0) is unstable. We also
use NNs to implement the iterative ADP algorithm where four θ ’s (including θ =
3, 5, 7, 10) are chosen to initialize the algorithm and the convergence curves of the
value functions are shown in Fig. 2.16.
Applying the optimal control law to the system for Tf = 15 time steps, the optimal
state and control trajectories are shown in Fig. 2.17a and b, respectively.
2.4 Conclusions 87
1
x
1
0.5 x
2
States 0
−0.5
−1
0 5 10 15
Time steps
(a)
0.5
0
Control
−0.5
−1
−1.5
−2
0 5 10 15
Time steps
(b)
Fig. 2.17 The optimal trajectories. a Optimal state trajectories. b Optimal control trajectory
2.4 Conclusions
In this chapter, we developed several VI-based ADP methods for optimal control
problems of discrete-time nonlinear systems. First, a GVI-based ADP scheme is
established to obtain optimal control for discrete-time affine nonlinear systems. Then,
the GVI ADP algorithm is used to solve the optimal tracking control problem for
discrete-time nonlinear systems as a generalization. Furthermore, as a case study, the
VI-based ADP approach is developed to derive optimal control for discrete-time non-
linear systems with unknown dynamics and input constraints. It is emphasized that
using the ADP approach, affine and nonaffine nonlinear systems can be treated uni-
formly. Next, an iterative θ -ADP technique is presented to solve the optimal control
problem of discrete-time nonlinear systems. Convergence analysis and optimality
analysis results are established for the iterative θ -ADP algorithm. Simulation results
are provided to show the effectiveness of the present algorithm.
References
1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
88 2 Value Iteration ADP for Discrete …
2. Abu-Khalaf M, Lewis FL, Huang J (2008) Neurodynamic programming and zero-sum games
for constrained control systems. IEEE Trans Neural Netw 19(7):1243–1252
3. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2007) Adaptive critic designs for discrete-time zero-
sum games with application to H∞ control. IEEE Trans Syst Man Cybern.-Part B: Cybern
37(1):240–247
4. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2007) Model-free Q-learning designs for linear
discrete-time zero-sum games with application to H-infinity control. Automatica 43(3):473–
481
5. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern-Part B:
Cybern 38(4):943–949
6. Apostol TM (1974) Mathematical analysis: A modern approach to advanced calculus. Addison-
Wesley, Boston, MA
7. Athans M, Falb PL (1966) Optimal control: an introduction to the theory and its applications.
McGraw-Hill, New York
8. Beard R, Saridis G, Wen J (1997) Galerkin approximations of the generalized Hamilton–
Jacobi–Bellman equation. Automatica 33(12):2158–2177
9. Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton, NJ
10. Berkovitz LD, Medhin NG (2013) Nonlinear optimal control theory. CRC Press, Boca Raton,
FL
11. Bertsekas DP (2005) Dynamic programming and optimal control. Athena Scientific, Belmont,
MA
12. Bitmead RR, Gever M, Petersen IR (1985) Monotonicity and stabilizability properties of solu-
tions of the Riccati difference equation: Propositions, lemmas, theorems, fallacious conjectures
and counterexamples. Syst Control Lett 5:309–315
13. Dierks T, Thumati BT, Jagannathan S (2009) Optimal control of unknown affine nonlinear
discrete-time systems using offline-trained neural networks with proof of convergence. Neural
Netw 22(5):851–860
14. Dreyfus SE, Law AM (1977) The art and theory of dynamic programming. Academic Press,
New York
15. Fu J, He H, Zhou X (2011) Adaptive learning and control for MIMO system based on adaptive
dynamic programming. IEEE Trans Neural Netw 22(7):1133–1148
16. Hagan MT, Menhaj MB (1994) Training feedforward networks with the Marquardt algorithm.
IEEE Trans Neural Netw 5(6):989–993
17. Heydari A, Balakrishnan SN (2013) Finite-horizon control-constrained nonlinear optimal con-
trol using single network adaptive critics. IEEE Trans Neural Netw Learn Syst 24(1):145–157
18. Howard RA (1960) Dynamic programming and Markov processes. MIT Press, Cambridge,
MA
19. Huang Y, Liu D (2014) Neural-network-based optimal tracking control scheme for a class
of unknown discrete-time nonlinear systems using iterative ADP algorithm. Neurocomputing
125:46–56
20. Koppel LB (1968) Introduction to control theory with applications to process control. Prentice-
Hall, Englewood Cliffs, NJ
21. Levin AU, Narendra KS (1993) Control of nonlinear dynamical systems using neural networks:
controllability and stabilization. IEEE Trans Neural Netw 4(2):192–206
22. Lewis FL, Liu D (2012) Reinforcement learning and approximate dynamic programming for
feedback control. Wiley, Hoboken, NJ
23. Lewis FL, Syrmos VL (1995) Optimal control. Wiley, New York
24. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
25. Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general
value iteration. IET Control Theory Appl 6(18):2725–2736
26. Liao X, Wang L, Yu P (2007) Stability of dynamical systems. Elsevier, Amsterdam, Netherlands
References 89
27. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
28. Liu D, Wang D, Yang X (2013) An iterative adaptive dynamic programming algorithm for
optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci
220:331–342
29. Lyshevski SE (1998) Optimal control of nonlinear continuous-time systems: design of bounded
controllers via generalized nonquadratic functionals. In: Proceedings of the American control
conference. pp 205–209
30. Michel AN, Hou L, Liu D (2015) Stability of dynamical systems: On the role of monotonic
and non-monotonic Lyapunov functions. Birkhäuser, Boston, MA
31. Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE Trans
Syst Man Cybern-Part C: Appl Rev 32(2):140–153
32. Navarro-Lopez EM (2007) Local feedback passivation of nonlinear discrete-time systems
through the speed-gradient algorithm. Automatica 43(7):1302–1306
33. Primbs JA, Nevistic V (2000) Feasibility and stability of constrained finite receding horizon
control. Automatica 36(7):965–971
34. Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8(5):997–
1007
35. Rantzer A (2006) Relaxed dynamic programming in switching systems. IEE Proc-Control
Theory Appl 153(5):567–574
36. Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
37. Sira-Ramirez H (1991) Non-linear discrete variable structure systems in quasi-sliding mode.
Int J Control 54(5):1171–1187
38. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge,
MA
39. Vincent TL, Grantham WJ (1997) Nonlinear and optimal control systems. Wiley, New York
40. Vrabie D, Vamvoudakis KG, Lewis FL (2013) Optimal adaptive control and differential games
by reinforcement learning principles. IET, London
41. Wang D, Liu D (2013) Neuro-optimal control for a class of unknown nonlinear dynamic systems
using SN-DHP technique. Neurocomputing 121:218–225
42. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
43. Wang FY, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon
optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Trans Neural
Netw 22(1):24–36
44. Wei Q, Liu D (2014) A novel iterative θ-adaptive dynamic programming for discrete-time
nonlinear systems. IEEE Trans Autom Sci Eng 11(4):1176–1190
45. Wei Q, Liu D, Xu Y (2014) Neuro-optimal tracking control for a class of discrete-time non-
linear systems via generalized value iteration adaptive dynamic programming. Soft Comput
20(2):697–706
46. Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of
intelligence. Gen Syst Yearbook 22:25–38
47. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive
approaches (Chapter 13). Van Nostrand Reinhold, New York
48. Yang Q, Jagannathan S (2012) Reinforcement learning controller design for affine nonlin-
ear discrete-time systems using online approximators. IEEE Trans Syst Man Cybern-Part B:
Cybern 42(2):377–390
49. Zhang H, Huang J, Lewis FL (2009) An improved method in receding horizon control with
updating of terminal cost function. In: Valavanis KP (ed) Applications of intelligent control to
engineering systems. Springer, New York, pp 365–393
50. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control: algorithms
and stability. Springer, London
90 2 Value Iteration ADP for Discrete …
51. Zhang H, Luo Y, Liu D (2009) Neural-network-based near-optimal control for a class of
discrete-time affine nonlinear systems with control constraints. IEEE Trans Neural Netw
20(9):1490–1503
52. Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a
class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans
Syst Man Cybern-Part B: Cybern 38(4):937–942
Chapter 3
Finite Approximation Error-Based Value
Iteration ADP
3.1 Introduction
The ADP approach, proposed by Werbos [45–47], has played an important role in
seeking approximate solutions to dynamic programming problems as a way to solve
the optimal control problems [1, 8, 17, 19, 21, 25, 27–29, 34, 37, 40, 42, 52]. In
these control strategies, iterative methods are an effective way used in ADP to obtain
the solution of Bellman equation indirectly and have received lots of attentions. There
are two main iterative ADP algorithms including policy iteration and value iteration
algorithms [16, 31].
Although iterative ADP algorithms attract more and more attention [14, 20, 24,
30, 43, 44, 48, 53], for nearly all of the iterative algorithms, the control of each
iteration is required to be accurately obtained. These iterative ADP algorithms can
be called “accurate iterative ADP algorithms.” For most real-world control sys-
tems, however, the accurate control laws in the iterative ADP algorithms cannot be
obtained. For example, during the implementation of the iterative ADP algorithm,
approximation structures, such as neural networks and fuzzy systems, are used to
approximate the iterative value functions and the iterative control laws. While we can
see that no matter what kind of neural networks [12] and fuzzy systems are used, and
no matter what the approximation precisions are obtained, there must exist approxi-
mation errors between the approximated functions and the expected ones. This shows
that the accurate value functions and control laws cannot be obtained in the iterative
ADP algorithms for real-world control systems [6, 9]. When the accurate iterative
control laws cannot be obtained, the convergence properties obtained for the accurate
iterative ADP algorithms may not hold anymore. It is therefore necessary to study
the convergence and the stability properties of the iterative ADP algorithms when the
iterative control cannot be obtained accurately. In this chapter, several new iterative
ADP schemes with finite approximation errors will be developed to solve the infinite
horizon optimal control problems, with convergence, stability, and optimality proofs
[22, 35, 37, 39, 43].
∞
J (x0 , u 0 ) = U (xk , u k ), (3.2.2)
k=0
Assumption 3.2.3 System (3.2.1) is controllable in the sense that there exists a
continuous control law on Ω that asymptotically stabilizes the system.
For optimal control problems, the designed feedback control must not only stabi-
lize the system (3.2.1) on Ω but also guarantee that the cost function (3.2.2) is finite,
i.e., the control must be admissible (cf. Definition 2.2.1 or [4]). Define A (Ω) as the
set of admissible control sequences associated with the controllable set Ω of states.
The optimal cost function is defined as
J ∗ (xk ) = inf J (xk , u k ) : u k ∈ A (Ω) .
uk
We can see that if we want to obtain the optimal control u ∗k , we must obtain the
optimal cost function J ∗ (xk ). Generally speaking, J ∗ (xk ) is unknown before the
whole control sequence u k is considered. If we adopt the traditional dynamic pro-
gramming method to obtain the optimal cost function at every time step, then we have
to face the “curse of dimensionality.” This implies that the optimal control sequence
is nearly impossible to obtain analytically by the Bellman equation (3.2.4). Hence,
ADP approaches will be investigated. In particular, in this chapter, we study ADP
algorithms under a more realistic situation, where neural network approximations
are not perfect as in the previous chapter.
In Chap. 2, a θ -ADP algorithm is developed to obtain the optimal cost function and
optimal control law iteratively. We have shown that in the iterative θ -ADP algorithm
(2.3.7)–(2.3.9), the iterative value function Vi (xk ) converges to the optimal cost
function J ∗ (xk ) and J ∗ (xk ) = inf u k {J (xk , u k ) : u k ∈ A (Ω)}, satisfying the Bellman
equation (3.2.4) for any controllable state xk ∈ Rn . In fact, due to the existence of
approximation errors, the accurate iterative control law cannot be obtained in general.
In this case, the iterative ADP algorithm with no approximation errors may only exist
in ideal situation. Besides, new analysis methods should also be developed [22, 39].
Next, we present the iterative ADP algorithm with finite approximation errors.
where V̂0 (xk+1 ) = θ Ψ (xk+1 ) and ρ0 (xk ) is the approximation error of the control
law v̂0 (xk ). For i = 1, 2, . . ., the iterative ADP algorithm will iterate between
V̂i (xk ) = min U (xk , u k ) + V̂i−1 (xk+1 ) + πi (xk )
uk
= U (xk , v̂i−1 (xk )) + V̂i−1 (F(xk , v̂i−1 (xk ))) + πi (xk ), (3.2.7)
and
v̂i (xk ) = arg min U (xk , u k ) + V̂i (xk+1 ) + ρi (xk )
uk
= arg min U (xk , u k ) + V̂i (F(xk , u k )) + ρi (xk ), (3.2.8)
uk
where πi (xk ) and ρi (xk ) are approximation errors of the iterative value functions and
the iterative control laws, respectively.
We are now in a position to present the following theorem.
where V̂i (xk ) is defined in (3.2.7) and u k can accurately be obtained in Rm . Let the
initial value function Γ0 (xk ) = V̂0 (xk ) = θ Ψ (xk ). If for i = 1, 2, . . ., there exists a
finite constant ε̄ ≥ 0 such that
for i = 1, 2, . . .. Then, we can get −ε̄ ≤ V̂1 (xk ) − V1 (xk ) ≤ ε̄, which proves
|V̂1 (xk ) − V1 (xk )| ≤ ε̄. Assume that (3.2.11) holds for i = l − 1, l = 2, 3, . . .. Then,
= Vl (xk ) − (l − 1)ε̄,
and
= Vl (xk ) + (l − 1)ε̄.
Then, using (3.2.12) and similarly considering the left side of (3.2.13), we can get
(3.2.11). This completes the proof.
From Theorem 3.2.1, we can see that if we let a finite constant −ε̄ ≤ ε ≤ ε̄ such
that
V̂i (xk ) − Γi (xk ) ≤ ε (3.2.14)
J ∗ (F(xk , u k )) ≤ γ U (xk , u k ),
and
J ∗ (xk ) ≤ V0 (xk ) ≤ β J ∗ (xk ),
3.2 Iterative θ-ADP Algorithm with Finite Approximation Errors 97
hold uniformly. If there exists 0 < η < ∞ such that (3.2.16) holds uniformly, then
i
γ j η j−1 (η − 1) γ i ηi (β − 1) ∗
V̂i (xk ) ≤ η 1 + + J (xk ), (3.2.17)
j=1
(γ + 1) j (γ + 1)i
i
where we define j (·) = 0, for all j > i and i, j = 0, 1, . . ..
l−1
γ j−1 η j−1 (η − 1) γ l−1 ηl−1 (ηβ − 1)
≤ min 1+γ + U (xk , u k )
uk
j=1
(γ + 1) j (γ + 1)l−1
l−1
γ j η j−1 (η − 1) γ l−1 ηl−1 (β − 1)
+ η 1+ +
j=1
(γ + 1) j (γ + 1)l−1
l−1
γ j−1 η j−1 (η − 1) γ l−1 ηl−1 (ηβ − 1)
− + J ∗ (F(xk , u k ))
j=1
(γ + 1) j
(γ + 1) l−1
l
γ j η j−1 (η − 1) γ l ηl (β − 1) ∗
= 1+ + J (xk ). (3.2.18)
j=1
(γ + 1) j (γ + 1)l
Then, according to (3.2.16), we can obtain (3.2.17) which proves the conclusion for
i = 0, 1, . . .. The proof is complete.
From (3.2.17), we can see that for arbitrary finite i, η, and β, there exists a bounded
error between the iterative value function V̂i (xk ) and the optimal cost function J ∗ (xk ).
While as i → ∞, the bound of the approximation errors may increase to infinity.
Thus, in the following theorems, we will prove the convergence properties of the
iterative ADP algorithm (3.2.6)–(3.2.8) using the error bound method.
holds, then as i → ∞, the value function V̂i (xk ) in the iterative ADP algorithm
(3.2.6)–(3.2.8) is uniformly convergent into a bounded neighborhood of the optimal
cost function J ∗ (xk ), i.e.,
η
lim V̂i (xk ) = V̂∞ (xk ) ≤ J ∗ (xk ). (3.2.20)
i→∞ 1 − γ (η − 1)
γ j η j−1 (η − 1)
(γ + 1) j
Combining (3.2.22) and (3.2.23), we can obtain (3.2.20). The proof is complete.
For the iterative ADP algorithm with finite approximation errors, we can see that
for different approximation errors, the limits of the iterative value function V̂i (xk )
are different. This property can be proved by the following theorem.
be the least upper bound of the limit of the iterative value function. If Theorem 3.2.3
holds for all xk ∈ Rn , then V̄∞ (xk ) is a monotonically increasing function of the
approximation error η.
Proof From (3.2.24), we can see that the least upper bound of the limit of the iterative
value function V̄∞ (xk ) is a differentiable function of the approximation error η.
Hence, we can take the derivative of the approximation error η on both sides of the
equation (3.2.24). Then, we can obtain
∂ V̄∞ (xk ) 1+γ
= J ∗ (xk ) > 0.
∂η (γ (η − 1) − 1)2
According to the definitions of iterative value functions V̂i (xk ) and Γi (xk ) in
(3.2.7) and (3.2.9), if for i = 0, 1, . . ., we let
In this section, the approximation error η which satisfies (3.2.16) is used to analyze
the convergence properties of the iterative ADP algorithm. Generally speaking, the
approximation error ε is obtained by (3.2.14) instead of obtaining η. So, if we use ε
to express the approximation error, then a transformation is needed between the two
approximation errors ε and η.
For i = 0, 1, . . ., according to (3.2.14) and (3.2.16), for any εi expressed in
(3.2.25), there exists a ηi such that
V̂i (xk )
V̂i (xk ) − εi (xk ) = .
ηi (xk )
In the case of linear systems, the control law is linear and the cost function is quadratic.
In the nonlinear case, this is not necessarily true, and therefore we use backpropaga-
tion (BP) neural networks to approximate Vi (xk ) and vi (xk ).
For a given nonlinear function ξ(x) ∈ Rm , a BP NN with two layers of weights
can be used to approximate it. Assume that the number of hidden layer neurons is
denoted by L, the weight matrix between the input layer and hidden layer is denoted
by Yξ ∈ R L×n , and the weight matrix between the hidden layer and output layer is
denoted by Wξ ∈ Rm×L . Then, the NN representation of ξ is given by
where σ (·) = tanh(·) is the componentwise activation function, Yξ∗ and Wξ∗ are the
ideal weight parameters of Yξ and Wξ , receptively, and ε(x) is the reconstruction
error. The NN approximation of function ξ is expressed by
Remark 3.2.1 In Chap. 2 and several other chapters, weight matrices in the NN
expressions are in transposed forms. It is a matter of convenience, for instance, to
use Wξ or WξT in NN expressions. In this chapter, to avoid cumbersome notations,
we opt to use weight matrices without transposition in NN expressions.
Here, there are two networks, which are critic network and action network, respec-
tively. Both neural networks are chosen as three-layer feedforward network. The
whole structural diagram is shown in Fig. 3.1.
3.2 Iterative θ-ADP Algorithm with Finite Approximation Errors 101
The objective function to be minimized in the critic network is E ci (k) = (1/2)eci2 (k).
So, the gradient-based weight update rule [13, 28] for the critic network is given by
X c ( p + 1) = X c ( p) + ΔX c ( p),
∂ E ci (k)
ΔX c ( p) = αc − ,
∂ X c ( p)
∂ E ci (k) ∂ E ci (k) ∂ V̂i (xk )
= ,
∂ X c ( p) ∂ V̂i (xk ) ∂ X c ( p)
102 3 Finite Approximation Error-Based Value Iteration ADP
where p represents the inner-loop iteration step for updating critic NN weight para-
meters, αc > 0 is the learning rate of critic network, and X c is the weight matrix of
the critic network which can be replaced by Wc and Yc , respectively.
If we define
n
ql ( p) = Yc,lj ( p)xjk , l = 1, 2 . . . , L c ,
j=1
rl ( p) = tanh(ql ( p)), l = 1, 2 . . . , L c ,
then
Lc
V̂i (xk ) = Wcl ( p)rl ( p),
l=1
The weights of input to hidden layer of the critic network are updated as
The weights of the action network are updated to minimize the following performance
error measure E ai (k) = (1/2)eai
T
(k)eai (k). The weights updating algorithm is similar
to the one for the critic network. By the gradient descent rule, we can obtain
X a ( p + 1) = X a ( p) + ΔX a ( p),
∂ E ai (k)
ΔX a ( p) = βa − ,
∂ X a ( p)
where βa > 0 is the learning rate of action network and X a is the weight matrix of
the action network which can be replaced by Wa and Ya .
If we define
n
gl ( p) = Ya,lj ( p)x jk , l = 1, . . . , L a ,
j=1
h l ( p) = tanh(gl ( p)), l = 1, . . . , L a ,
then
La
v̂it (xk ) = Wa,tl ( p)h l ( p), t = 1, 2, . . . , m,
l=1
where L a is the number of hidden nodes in the action network, v̂i (xk ) = (v̂i1 (xk ), . . . ,
v̂im (xk ))T ∈ Rm , Ya = (Ya,lj ) ∈ R L a ×n , and Wa = (Wa,tl ) ∈ Rm×L a . By applying the
chain rule, the adaptation of the action network is summarized as follows.
The weight of hidden-to-output layer of the action network is updated as
Consider the following discrete-time nonaffine nonlinear system given by [14, 33]
0.25
0.2
Aprroximation error
0.15
0.1
0.05
0
1
0.5 40
0 30
20
−0.5
10
−1 0
State variables Iteration steps
∂ F(xk , u k )
(0, 0) = 1,
∂ xk
7
Iterative value function
Iterative value function
3.8
6
3.6
3.4 5
3.2 4
3
3
2.8
2.6 2
0 10 20 30 0 10 20 30
4
5
3.5
4
3
3
2.5
2 2
0 10 20 30 0 10 20 30
Iteration steps Iteration steps
(c) (d)
Fig. 3.3 Trajectories of the iterative value functions. a Approximation error ε = 10−6 . b Approx-
imation error ε = 10−4 . c Approximation error ε = 10−3 . d Approximation error ε = 10−1
106 3 Finite Approximation Error-Based Value Iteration ADP
1 1
x x
u u
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 5 10 15 20 0 5 10 15 20
Time steps Time steps
(a) (b)
1 6
x x
0.5 u u
4
0
2
−0.5
0
−1
−1.5 −2
0 5 10 15 20 0 5 10 15 20
Time steps Time steps
(c) (d)
Fig. 3.4 Control and state trajectories. a Approximation error ε = 10−8 . b Approximation error
ε = 10−4 . c Approximation error ε = 10−3 . d Approximation error ε = 10−1
3.2 Iterative θ-ADP Algorithm with Finite Approximation Errors 107
control and state are displayed in Fig. 3.4a. For approximation error ε = 10−4 , the
trajectories of the control and state are displayed in Fig. 3.4b. When the approximation
error ε = 10−3 , we can see that the iterative value functions is not monotonic. The
trajectories of the control and state are displayed in Fig. 3.4c. When the approximation
error ε = 10−1 , we can see that the iterative value functions is not convergent. In this
situation, the control system is not stable, and the trajectories of the control and state
are displayed in Fig. 3.4d.
As the development of digital computers, numerical control attracts more and more
attention of researchers [15, 49, 50]. In real-world implementations, especially for
numerical control systems, for each i, the accurate iterative value functions Vi (xk )
and the accurate iterative control laws vi (xk ) are generally impossible to obtain. For
the situation that the control u k ∈ A, the iterative θ -ADP algorithm in Sect. 2.3 may
be invalid, where A denotes the set of numerical controls and we assume 0 ∈ A. First,
for numerical control systems, the set of numerical controls A is discrete. This means
that there are only finite elements in the set of numerical controls A, which implies
that the iterative control law and iterative value function can only be obtained with
errors. Second, as u k ∈ A, the convergence property of the iterative value function
cannot be guaranteed, and the stability of the system under such controls cannot be
proved either. Furthermore, even if we solve the iterative control law and the iterative
value function at every iteration step, it is not clear whether the errors between the
iterative value functions Vi (xk ) and the optimal cost function J ∗ (xk ) are finite or
not for all i. Thus, the optimal cost function and optimal control law are nearly
impossible to obtain by the iterative θ -ADP algorithm in Sect. 2.3.
In this section, a numerical iterative θ -ADP algorithm is developed to obtain the
numerical optimal controller for nonaffine nonlinear system (3.2.1) [36, 37].
In the present numerical iterative θ -ADP algorithm, the value functions and control
laws are updated by iterations, with the iteration index i increasing from 0 to infinity.
For xk ∈ Rn , let the initial value function be
where θ > 0 is a large finite positive constant. The numerical iterative control law
v̂0 (xk ) can be computed as follows:
108 3 Finite Approximation Error-Based Value Iteration ADP
v̂0 (xk ) = arg min U (xk , u k ) + V̂0 (xk+1 )
u k ∈A
= arg min U (xk , u k ) + V̂0 (F(xk , u k )) , (3.3.2)
u k ∈A
where V̂0 (xk+1 ) = θ Ψ (xk+1 ). For i = 1, 2, . . ., the numerical iterative θ -ADP algo-
rithm will iterate between the iterative value functions
V̂i (xk ) = min U (xk , u k ) + V̂i−1 (xk+1 )
u k ∈A
Proof The lemma can be proved by mathematical induction and by following similar
steps as in the proof of Theorem 3.2.1. First, let i = 1. We have
= V1 (xk ).
Then, according to (3.3.6), we can get V̂1 (xk ) − V1 (xk ) ≤ ε. Assume that (3.3.7)
holds for i − 1. Then, for i, we have
= Vi (xk ) + (i − 1)ε.
Lemma 3.3.1 shows that although the approximation error for each single step
is finite and may be small, as the iteration index i → ∞ increases, the bounds of
approximation errors between V̂i (xk ) and Vi (xk ) may also increase to infinity. To
overcome these difficulties, we must discuss the convergence and stability properties
of the iterative ADP algorithm in numerical implementation with finite approxima-
tion errors. For convenience of analysis, we perform transformation of approximation
errors. According to the definitions of V̂i (xk ) and Γi (xk ) in (3.3.3) and (3.3.5), we
have Γi (xk ) ≤ V̂i (xk ). Then, for i = 0, 1, . . ., there exists a η ≥ 1 such that
and
J ∗ (xk ) ≤ V0 (xk ) ≤ β J ∗ (xk ) (3.3.10)
hold uniformly. If there exists a η, 1 ≤ η < ∞, such that (3.3.8) is satisfied and
β −1
η ≤1+ , (3.3.11)
γβ
110 3 Finite Approximation Error-Based Value Iteration ADP
then the iterative value function V̂i (xk ) converges to a bounded neighborhood of
J ∗ (xk ), as i → ∞.
This theorem can be proved following the same procedure as Theorem 3.2.3 by
noting that condition (3.3.11) implies (3.2.19) of Theorem 3.2.3. We omit the details.
Proof As V̂0 (xk ) = V0 (xk ) = θ Ψ (xk ), V̂0 (xk ) is a positive definite function for i = 0.
Using the mathematical induction, assume that the iterative value function V̂i (xk ),
i = 0, 1, . . ., is positive definite. Then, according to Assumptions 3.2.1–3.2.3, we
can get
V̂i (0) = U (0, v̂i−1 (0)) + V̂i−1 (F(0, v̂i−1 (0))) = 0
where v̄i (xk ) = arg minu k {U (xk , u k ) + V̄i (xk+1 )}. From Theorem 3.2.2, V̄i (xk ) =
χi J ∗ (xk ) is the upper bound of any iterative value function. Thus, we have Pi+1 (xk ) ≤
V̄i+1 (xk ) ≤ V̄i (xk ), which implies
Hence, V̄i (xk ) is a Lyapunov function and v̄i (xk ) is an asymptotically stable control
law for i = 0, 1, . . .. As v̄i (xk ) is an asymptotically stable control law, we have
xk+N → 0 as N → ∞, that is V̄i (xk+N ) → 0. Since 0 < V̂i (xk ) ≤ V̄i (xk ) holds for
all xk , then we can get 0 < V̂i (xk+N ) ≤ V̄i (xk+N ) as N → ∞. So, V̂i (xk+N ) → 0 as
N → ∞. As V̂i (xk ) is a positive definite function, we can conclude xk+N → 0 as
N → ∞ under the control law v̂i (xk ). The proof is complete.
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 111
Although Theorem 3.3.1 gives the convergence criterion, we can see that the parame-
ters η, γ , and β are very difficult to achieve, and the convergence criterion (3.3.11)
is quite difficult to verify. To overcome this difficulty, a new convergence condition
must be developed to guarantee the convergence of the numerical iterative θ -ADP
algorithm. For the convenience of analysis, we define a new value function as
Vˆ0 (xk , 0) = V̂0 (xk ),
(3.3.12)
Vˆi (xk , v̂i (xk )) = U (xk , v̂i (xk )) + V̂i−1 (F(xk , v̂i (xk ))), i = 1, 2, . . . ,
where we can see that V̂i (xk ) = Vˆi (xk , v̂i (xk )). We have the following definition.
Definition 3.3.1 The iterative value function Vˆi (xk , v̂i (xk )) is a Lipschitz continuous
function for all v̂i (xk ), if there exists an L ∈ R such that
|Vˆi (xk , v̂i (xk ))− Vˆi (xk , v̂i (xk ))| ≤ L v̂i (xk )− v̂i (xk ) , (3.3.13)
A = {u( p1 , . . . , pm ) : p1 , . . . , pm ∈ Z , 1 ≤ p1 ≤ P1 , . . . , 1 ≤ pm ≤ Pm }
(3.3.14)
to denote all the control elements in A, where P1 , . . . , Pm are all positive integers.
Thus, for all xk ∈ Rn and i = 0, 1, . . ., there exists a sequence of positive numbers
p1i , . . . , pmi such that
u( p1i , . . . , pmi ) = v̂i (xk ), (3.3.15)
Ā( p1i , . . . , pmi ) = {u( p̄1i , . . . , p̄mi ) : ( p̄1i , . . . , p̄mi ) ∈ A, | pij − p̄ij | ≤ r }, (3.3.17)
We can see that Γi (xk ) = Υi (xk , ṽi (xk )). As ṽi (xk ) cannot be obtained, it is very
difficult to analyze its properties. While ∀u( p1i , . . . , pmi ), we can get Ā( p1i , . . . , pmi )
by (3.3.17) immediately. So, if ṽi (xk ) is inside Ā( p1i , . . . , pmi ) with the radius r ≥ 1,
then the properties of ṽi (xk ) can be obtained. We will analyze the relationship between
ṽi (xk ) and Ā( p1i , . . . , pmi ) next. Before that, some lemmas are necessary.
Lemma 3.3.2 Let O = ( p1i , . . . , pmi ) denote the origin of the m-dimensional coor-
dinate system, and let O L be an arbitrary vector in the m-dimensional space. If we
let ϑ j , j = 1, 2, . . . , m, be the intersection angle between O L and the jth coordinate
axis, then we have
m
cos2 ϑ j = 1.
j=1
where
O L = (l i − pi )2 + · · · + (l i − pi )2 .
1 1 m m
Then, i
m l − pi 2 + · · · + l i − pi 2
1 1 m m
cos ϑ j =
2
2
= 1.
j=1 (l1i − p1i ) + · · · + (lmi − pmi )2
Lemma 3.3.3 Let O = ( p1i , . . . , pmi ) denote the origin of the m-dimensional coor-
dinate system and let O L be an arbitrary vector in the m-dimensional space. Let A j ,
j = 1, 2 . . . , m, be points on the jth coordinate axis of m-dimensional space and
∀ j = 1, 2, . . . , m, O A j = O L. If we let ϑ j = min{ϑ1 , . . . , ϑm }, then we have
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 113
1
A j L ≤ (m − 1)/m O L cos arcsin (m − 1)/m . (3.3.19)
2
√
Proof Letting ϑ1 = · · · = ϑm = arccos(1/ m), then ϑ j reaches the maximum. Let
α j = ∠O A j L. We can get α j = 21 (π − ϑ j ). According to sine rule, we have
A j L ≤ sin ϑ j / sin α j O L. (3.3.20)
√
Since cosϑ j = 1/ m, we have
1
sin ϑ j = (m − 1)/m, sin α j = cos arcsin (m − 1)/m .
2
Taking sin ϑ j and sin α j into (3.3.20), we can obtain the conclusion. The proof is
complete.
Lemma 3.3.4 Let O = ( p1i , . . . , pmi ) denote the origin of the m-dimensional coor-
dinate system and let A = ( p1i , . . . , pmi ) be an arbitrary point in Ā( p1i , . . . , pmi ),
where 1 ≤ ≤ (2r + 1)m . Let L = ( p̄1i , . . . , p̄mi ) be an arbitrary point such that
OL ≥ max m O A . If the radius r of Ā( p1i , . . . , pmi ) satisfies the following
1≤≤(2r +1)
inequality √
3(m − 1) + 3(m − 1)
r≥ , (3.3.21)
3m
then there exists an such that
A L ≤ O L. (3.3.22)
Proof Without loss of generality, let L = ( p̄1i , . . . , p̄mi ) be located in the first
quadrant. According to Lemmas 3.3.2 and√3.3.3, we can see that if intersection
angles satisfy ϑ1 = · · · = ϑm = arccos(1/ m), the value on the left-hand side of
(3.3.19) reaches its maximum for each coordinate axis. In this situation, we can
let L = ( p1i + r, . . . , pmi + r ). Let A = (0, . . . , 0, pmi + r ) and A = ( p1i + r −
1, . . . , pm−1
i
+ r − 1, pmi + r ). Then, we can see that the points A , A , and L are
on the same line. Let ϑ = ∠B O L. Then, we have
2 2 2
O A + O L − L A
cos ϑ = . (3.3.23)
2O A O L
(r − 1)m + 1
cos ϑ = √ . (3.3.24)
m (r − 1)2 m − (r − 1)2 + r 2
114 3 Finite Approximation Error-Based Value Iteration ADP
Proof According to the definitions of the iterative value functions Vˆi (xk ,
u( p1i , . . . , pmi )) and Υi (xk , ṽi (xk )) in (3.3.16) and (3.3.18), respectively, we can
see that if we put the control ṽi (xk ) into the set of numerical controls A, then
For the control u( p1i , . . . , pmi ), according to Definition 3.3.1, there exists an L such
that
|Vˆi (xk , u( p1i , . . . , pmi )) − Υi (xk , ṽi (xk ))| ≤ L u( p1i , . . . , pmi ) − ṽi (xk ) .
Since ṽi (xk ) cannot be obtained accurately, the distance between ṽi (xk ) and
u( p1i , . . . , pmi ) is unknown. Hence, ṽi (xk ) must be replaced by other known vec-
tor. Next, we will show
u( p1i , . . . , pmi ) − ṽi (xk ) ≤ max u( p1i , . . . , pmi )−u( p1i , . . . , pmi ) .
1≤≤(2r +1)m
(3.3.25)
As
v̂i (xk ) = u( p1i , p2i , . . . , pmi ) ∈ A,
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 115
then ṽi (xk ) becomes the neighboring point of u( p1i , . . . , pmi ) and we can put ṽi (xk )
into Ā( p1i , . . . , pmi ) such that ṽi (xk ) ∈ Ā( p1i , . . . , pmi ). Next, we will prove the con-
clusion by contradiction. Assume that the inequality (3.3.25) does not hold. Then,
u( p1i , . . . , pmi ) − ṽi (xk ) > max u( p1i , . . . , pmi )−u( p1i , . . . , pmi )
1≤≤(2r +1)m
as ṽi (xk ) belongs to the set Ā( p1i , . . . , pmi ). As there are m dimensions in Ā( p1i , . . . ,
pmi ), we can divide it into 2m quadrants.
Without loss of generality, let ṽi (xk ) be located in the first quadrant where L =
( p̄1i , . . . , p̄mi ) is the corresponding coordinate. If we let O = ( p1i , . . . , pmi ) be the
origin, then
O L = u( p1i , . . . , pmi ) − ṽi (xk ) .
According to the definition of Υi (xk , ṽi (xk )) in (3.3.18), we know Vˆi (xk , u( p1i , . . . ,
pmi )) ≥ Υi (xk , ṽi (xk )) and Vˆi (xk , u i ( p1i , . . . , pmi )) ≥ Υi (xk , ṽi (xk )),
for i = 0, 1, . . .. Thus, according to (3.3.26), we can obtain
Vˆi (xk ,u( p1i ,. . . , pmi )) > Vˆi (xk ,u i ( p1i , . . . , pmi )).
This contradicts the definition of Vˆi (xk , u( p1i , . . . , pmi )) in (3.3.16). Therefore, the
assumption is false and the inequality (3.3.25) holds. The proof is complete.
According to the definitions of iterative value functions Vˆi (xk , u( p1i , . . . , pmi ))
and Υi (xk , ṽi (xk )) in (3.3.16) and (3.3.18), respectively, for i = 0, 1, . . ., we can
define
Vˆi (xk , u( p1i , . . . , pmi )) − Υi (xk , ṽi (xk )) = εi (xk ), (3.3.27)
where ε0 (xk ) = 0. Then, for any εi (xk ) expressed in (3.3.27), there exists a ηi (xk )
such that
116 3 Finite Approximation Error-Based Value Iteration ADP
Theorem 3.3.4 Let the iterative value function Vˆi (xk , u( p1i , . . . , pmi )) and the
numerical iterative control u( p1i , . . . , pmi ) be obtained by (3.3.15) and (3.3.16),
respectively. If for xk ∈ Rn , we define the admissible approximation error as
ε̄i (xk ) = L i ( p1i , . . . , pmi ) max u( p1i , . . . , pmi ) − u( p1i , . . . , pmi ) ,
1≤≤(2r +1)m
(3.3.28)
where L i ( p1i , . . . , pmi ) are Lipschitz constants and ε̄0 (xk ) = 0, then for i = 0, 1, . . .,
we have
εi (xk ) ≤ ε̄i (xk ).
Proof As Vˆi (xk , u( p1i , . . . , pmi )) are Lipschitz continuous, according to (3.3.13), we
have
where L i ( p1i , . . . , pmi ) are the Lipschitz constants. According to (3.3.25), we can
draw the conclusion. The proof is complete.
|Vˆi (xk ,u( p1i , . . . , pmi )) − Vˆi (xk , u( p1i , . . . , pmi ))|
= L̄ i ( p1i , . . . , pmi ) u( p1i , . . . , pmi ) − u( p1i , . . . , pmi ) , (3.3.29)
|Vˆi (xk ,u( p1i , . . . , pmi )) − Vˆi (xk , u( p1i , . . . , pmi ))|
≤ L̄ i ( p1i , . . . , pmi ) u( p1i , . . . , pmi ) − u( p1i , . . . , pmi ) .
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 117
For all u( p1i , . . . , pmi ) ∈ A, we can define the global Lipschitz constant L̄ i as
L̄ i = max L̄ i ( p1i , . . . , pmi ): 1 ≤ pij ≤ P j , j = 1, 2, . . . , m . (3.3.31)
Thus, from (3.3.30) and (3.3.31), we can easily obtain L i ( p1i , . . . , pmi ) ≤ L̄ i .
In the above, we give an effective method to obtain the approximation error
ε̄i+1 (xk ) of the numerical iterative θ -ADP algorithm. We will show how to obtain
the admissible approximation error to guarantee the convergence criterion of the
present numerical
iterative ADP algorithm. According to (3.3.9), we can define
γ = max J ∗ (F(xk , u k ))/U (xk , u k ) : xk ∈ Rn , u k ∈ A . If we let
Vi (F(xk , u k ))
γ̃i = : xk ∈ Rn , u k ∈ A , (3.3.32)
U (xk , u k )
then we can get γ̃i ≥ γ . Before the next theorem, we introduce some notation. Let
and
Vˆ0 (xk , 0)
βi (xk ) = . (3.3.34)
Vˆi (xk , u( p1i , . . . , pmi ))
Theorem 3.3.5 Let the iterative value function Vˆi (xk , u( p1i , . . . , pmi )) be defined in
(3.3.12) and the numerical iterative control u( p1i , . . . , pmi ) be defined in (3.3.15).
For i = 0, 1, . . ., if the approximation error satisfies
then the numerical iterative control law u( p1i , . . . , pmi ) stabilizes the nonlinear sys-
tem (3.2.1) and simultaneously guarantees the iterative value function Vˆi (xk , u( p1i ,
. . . , pmi )) to converge to a finite neighborhood of J ∗ (xk ), as i → ∞.
V0 (xk )
β = max : xk ∈ Rn .
J ∗ (xk )
118 3 Finite Approximation Error-Based Value Iteration ADP
So, if
βi (xk ) − 1
η̄i (xk ) ≤ 1 + , (3.3.37)
γ̃i βi (xk )
then (3.3.36) holds. Putting (3.3.33) and (3.3.34) into (3.3.37), we can obtain (3.3.35).
On the other hand, according to (3.3.32) and the definition of η in (3.3.8), we have
By Theorems 3.3.1 and 3.3.2, we can draw the conclusion. The proof is complete.
Theorem 3.3.6 Let the iterative value function Vˆi (xk , u( p1i , . . . , pmi )) be defined in
(3.3.12) and the numerical iterative control u( p1i , . . . , pmi ) be defined in (3.3.15). If
for all xk ∈ Rn , we have
U (xk , u k ) ≥ U (xk , 0), (3.3.38)
where
Δi1 (xk , u( p1i , . . . , pmi )) = Vi (xk , u( p1i , . . . , pmi ))−U (xk , 0)
and
then the numerical iterative control law u( p1i , . . . , pmi ) stabilizes the nonlinear
system (3.2.1) and simultaneously guarantees the iterative value functions
Vˆi (xk , u( p1i , . . . , pmi )) to converge to a finite neighborhood of J ∗ (xk ), as i → ∞.
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 119
Proof If we let
So, if
βi (xk ) − 1
η̄i (xk ) ≤ 1 + ,
γ̂i βi (xk )
then (3.3.11) holds, which means that iterative value functions Vˆi (xk , u( p1i , . . . , pmi ))
converge to a finite neighborhood of J ∗ (xk ) according to Theorem 3.3.1. According
to (3.3.33), (3.3.34), and (3.3.40), and we can get (3.3.39). The proof is complete.
From Theorems 3.3.5 and 3.3.6, we can see that the information of the parameter
γi+1 should be used while the value of γi+1 is usually difficult to obtain. So, in the
next theorem, we give a more simplified convergence justification for the iterative
ADP algorithm.
Theorem 3.3.7 Let the iterative value function Vˆi (xk , u( p1i , . . . , pmi )) and the
numerical iterative control u( p1i , . . . , pmi ) be obtained by (3.3.15) and (3.3.16),
respectively. Let ε̄i be expressed as in (3.3.28). For i = 0, 1, . . ., if the utility function
U (xk , u k ) satisfies (3.3.38) and the iterative approximation error satisfies
then the numerical iterative control law u( p1i , . . . , pmi ) stabilizes the nonlinear
system (3.2.1) and simultaneously guarantees the iterative value function
Vˆi (xk , u( p1i , . . . , pmi )) to converge to a finite neighborhood of J ∗ (xk ), as i → ∞.
Proof First, we look at (3.3.9) and (3.3.10). Without loss of generality, we let
and
Vˆ0 (xk , 0)
max − β̃i (xk ) ≥ maxn {γ̃ (xk )β̃(xk )} = γβ.
xk ∈Rn ,i=0,1,... U (xk , 0) xk ∈R
then
Substituting (3.3.33) and (3.3.34) into (3.3.42), we can obtain (3.3.41). The proof is
complete.
Step 5: If |V̂i (xk ) − V̂i−1 (xk )| ≤ ζ , then the iterative value function is converged and
go to Step 9; elseif i > i max , then go to Step 6; else, let i = i + 1 and go to
Step 2.
Step 6: Stop the algorithm.
Let xk = [x1 (k), x2 (k)]T denote the system state vector and u k = u(k) denote the
control. Let
A = {−2, −2 + ς, −2 + 2ς, . . . , 2},
where ς is the grid step. The cost function is defined as (3.2.3) with the utility function
U (xk , u k ) = xkTQxk + u Tk Ru k ,
Admissible approximation error
0.2
0.15
0.1
0.05
0
1
1
0.5
0.5
0
0
−0.5 −0.5
State variable x2 State variable x1
−1 −1
Fig. 3.5 The curve of the admissible approximation error obtained by (3.3.39)
122 3 Finite Approximation Error-Based Value Iteration ADP
0.2
0.15
0.1
0.05
0.5 −1
0 −0.5
0
−0.5
0.5
State variable x −1 1
State variable x 1
2
Fig. 3.6 The curve of the admissible approximation error obtained by (3.3.41)
6.8 8.5
Iterative value function
6.6 8
7.5
6.4
7
6.2
6.5
6 6
5.5
0 10 20 30 0 10 20 30
12 4.5
11 4
10 3.5
9 3
8 2.5
0 10 20 30 0 10 20 30
Iteration steps Iteration steps
(c) (d)
Fig. 3.7 The trajectories of the iterative value functions. a ς = 10−8 . b ς = 10−4 . c ς = 10−2 .
d ς = 10−1
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 123
1 0.5
x1
0.5 x2
0
Control
States
−0.5
−0.5
−1 −1
0 10 20 30 40 0 10 20 30 40
Time steps Time steps
(a) (b)
1 0.5
x
1
0.5 x2
0
States
Control
0
−0.5
−0.5
−1 −1
0 10 20 30 40 0 10 20 30 40
Time steps Time steps
(c) (d)
Fig. 3.8 The control and state trajectories. a State trajectories for ς = 10−8 . b Control trajectory
for ς = 10−8 . c State trajectories for ς = 10−4 . d Control trajectory for ς = 10−4
where Q and R are given as identity matrices with suitable dimensions. The initial
state is x0 = [1, −1]T .
The iterative ADP algorithm runs for 30 iteration steps to guarantee the conver-
gence of the iterative value function. The curves of the admissible approximation
error obtained by (3.3.39) and (3.3.41) are displayed in Figs. 3.5 and 3.6, respectively.
From Fig. 3.6, we can see that for some states xk , the admissible approximation error
is smaller than zero so that the convergence criterion (3.3.41) is invalid. While from
Fig. 3.5, we can see that the admissible approximation error curve is above zero
which implies that the convergence criterion (3.3.39) is effective for all xk .
To show the effectiveness of the numerical iterative ADP algorithm, we choose
four different grid steps. Let ς = 10−8 , 10−4 , 10−2 , 10−1 , respectively. The trajec-
tories of the iterative value function are shown in Fig. 3.7a, b, c and d, respectively.
For ς = 10−8 and ς = 10−4 , implement the approximate optimal control for sys-
tem (3.3.43), respectively. Let the implementation time be T f = 40. The trajectories
of the states and controls are displayed in Fig. 3.8a, b, c and d, respectively. When
ς = 10−2 , we can see that the iterative value function is not completely converged
124 3 Finite Approximation Error-Based Value Iteration ADP
1
x1
0.5 x
2
States
−0.5
−1
0 5 10 15 20 25 30 35 40
Time steps
(a)
0.5
0
Control
−0.5
−1
−1.5
0 5 10 15 20 25 30 35 40
Time steps
(b)
Fig. 3.9 The state and control trajectories. a State trajectories for ς = 10−2 . b Control trajectory
for ς = 10−2
within 30 iteration steps. The trajectories of the state are displayed in Fig. 3.9a, and
the corresponding control trajectory is displayed in Fig. 3.9b.
In this section, it is shown that if the inequality (3.3.35) holds, then for i = 0, 1, . . .,
the numerical iterative control v̂i (xk ) stabilizes the system (3.3.43), which means that
the numerical iterative θ -ADP algorithm can be implemented both online and offline.
In Fig. 3.10a–d, we give the system state and control trajectories of the system (3.3.43)
under the iterative control law v̂0 (xk ) with ς = 10−8 and ς = 10−4 , respectively. In
Fig. 3.11a–d, we give the system state and control trajectories of the system (3.3.43)
under the iterative control law v̂0 (xk ) with ς = 10−2 and ς = 10−1 , respectively.
When ς = 10−1 , we can see that the iterative value functions is not convergent. The
control system is not stable.
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 125
1 0
x
1
0.5 x2 −0.1
0 −0.2
States
Control
−0.5 −0.3
−1 −0.4
−1.5 −0.5
0 10 20 30 40 0 10 20 30 40
Time steps Time steps
(a) (b)
1 0
x
1
0.5 x −0.1
2
0 −0.2
States
Control
−0.5 −0.3
−1 −0.4
−1.5 −0.5
0 10 20 30 40 0 10 20 30 40
Time steps Time steps
(c) (d)
Fig. 3.10 The state and control trajectories under v̂0 (xk ). a State trajectories for ς = 10−8 .
b Control trajectory for ς = 10−8 . c State trajectories for ς = 10−4 . d Control trajectory for
ς = 10−4
1 0
x1
0.5 x −0.1
2
States
0 −0.2
Control
−0.5 −0.3
−1 −0.4
−1.5 −0.5
0 10 20 30 40 0 10 20 30 40
0 −0.2
Control
−0.5 −0.3
−1 −0.4
−1.5 −0.5
0 10 20 30 40 0 10 20 30 40
Time steps Time steps
(c) (d)
Fig. 3.11 The state and control trajectories under v̂0 (xk ). a State trajectories for ς = 10−2 . b Con-
trol trajectory for ς = 10−2 . c State trajectories for ς = 10−1 . d Control trajectory for ς = 10−1
where V̂0 (xk+1 ) = Ψ (xk+1 ) and ρ0 (xk ) is the approximation error function of the
iterative control law v̂0 (xk ). For i = 1, 2, . . ., the iterative ADP algorithm will iterate
between
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 127
V̂i (xk ) = U (xk , v̂i−1 (xk )) + V̂i−1 (F(xk , v̂i−1 (xk ))) + πi (xk ) (3.4.3)
and
v̂i (xk ) = arg min U (xk , u k ) + V̂i (xk+1 ) + ρi (xk ), (3.4.4)
uk
where πi (xk ) and ρi (xk ) are finite approximation error functions of the iterative value
function and control law, respectively. For convenience of analysis, for i = 0, 1, . . .,
we assume that ρi (xk ) = 0 and πi (xk ) = 0 for xk = 0.
Remark 3.4.1 In [4], for V0 (xk ) = 0, it was proven that the iterative value func-
tion in (3.4.5) is monotonically nondecreasing and converges to the optimum. In
[3, 5, 10], the convergence property of value iteration for discounted case has been
investigated, and it was proven that the iterative value function will converge to the
optimum if the discount factor satisfies 0 < α < 1. For the undiscounted cost func-
tion (3.2.2) and the positive semidefinite function, i.e., α ≡ 1 and V0 (xk ) = Ψ (xk ),
the convergence analysis methods in [3–5, 7, 10, 18] are no longer applicable. Hence,
a new convergence analysis method needs to be developed. In [11, 18, 26], a “func-
tion bound” method was proposed for the traditional value iteration algorithm with
the zero initial value function. Based on the previous contribution of value iteration
algorithms [3–5, 7, 10, 18], and inspired by [18], a new convergence analysis method
for the general value iteration algorithm is developed in this section.
Theorem 3.4.1 For i = 0, 1, . . ., let Vi (xk ) and vi (xk ) be updated by (3.4.5). Then,
the iterative value function Vi (xk ) converges to J ∗ (xk ) as i → ∞.
0 < U (xk , u k ) < ∞, 0 ≤ V0 (xk ) < ∞, 0 < J ∗ (xk ) < ∞, and 0 ≤ J ∗ (F(xk , u k )) < ∞.
Hence, there exist constants γ , α, and β such that J ∗ (F(xk , u k )) ≤ γ U (xk , u k ) and
α J ∗ (xk ) ≤ V0 (xk ) ≤ β J ∗ (xk ), where 0 < γ < ∞ and 0 ≤ α ≤ 1 ≤ β, respectively.
As J ∗ (xk ) is unknown, the values of γ , α, and β cannot be obtained directly. We will
first prove that for i = 0, 1, . . ., the following inequality
α−1
Vi (xk ) ≥ 1 + J ∗ (xk )
(1 + γ −1 )i
The mathematical induction is complete. On the other hand, according to (3.4.6) and
(3.4.7), for i = 0, 1, . . ., we can get
β −1
Vi (xk ) ≤ 1 + J ∗ (xk ).
(1 + γ −1 )i
where
V̂0 (xk ) = Γ0 (xk ) = Ψ (xk ).
From (3.4.9), we can see that the iterative value function Γi (xk ) can be larger or
smaller than V̂i (xk ) and the convergence properties are different for different values
of ηi . As the convergence analysis methods for 0 < ηi < 1 and ηi ≥ 1 are different,
the convergence properties for different ηi will be discussed separately.
Proof If 0 < ηi < 1, according to (3.4.9), we have 0 ≤ V̂i (xk ) < Γi (xk ). Using
mathematical induction, we can prove that for i = 1, 2, . . ., the following inequality
If for i = 1, 2, . . ., there exists 1 ≤ ηi < ∞ such that (3.4.9) holds uniformly, then
i−1
V̂i (xk ) ≤ ηi 1+ ηi−1 ηi−2 . . . ηi− j+1 (ηi− j − 1)
j=1
γi−1 γi−2 . . . γi− j
× Vi (xk ), (3.4.13)
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)
i
where we define (·) = 0 for all j > i, i, j = 0, 1, . . ., and
j
Thus, the conclusion holds for i = 1. Next, let i = 2. According to (3.4.8), we have
Γ2 (xk ) = min U (xk , u k ) + V̂1 (F(xk , u k ))
uk
η1 − 1 η1 − 1
≤ min 1 + γ1 U (xk , u k )+ η1 − V1 (xk+1 )
uk γ1 + 1 γ1 + 1
η1 − 1
= 1 + γ1 min U (xk , u k ) + V1 (xk+1 )
γ1 + 1 u k
η1 − 1
= 1 + γ1 V2 (xk ).
γ1 + 1
which shows that (3.4.13) holds for i = 2. Assume that (3.4.13) holds for i = l − 1,
where l = 2, 3, . . .. Then, for i = l, we obtain
Γl (xk ) = min U (xk , u k ) + V̂l−1 (F(xk , u k ))
uk
l−2
≤ min U (xk , u k ) + ηl−1 1 + ηl−2 . . . ηl− j (ηl− j−1 − 1)
uk
j=1
γl−2 γl−3 . . . γl− j−1
× Vl−1 (xk )
(γl−3 + 1) . . . (γl− j−1 + 1)
l−1
γl−1 . . . γl− j
≤ 1+ ηl−1 ηl−2 . . . ηl− j+1 (ηl− j − 1)
(γl−1 + 1) . . . (γl− j + 1)
j=1
× min U (xk , u k ) + Vl−1 (xk )
uk
l−1
γl−1 . . . γl− j
= 1+ ηl−1 ηl−2 . . . ηl− j+1 (ηl− j − 1) Vl (xk ).
(γl−1 + 1) . . . (γl− j + 1)
j=1
Then, by (3.4.9), we can obtain (3.4.13) which proves the conclusion for i = 0, 1, . . ..
The proof is complete.
From (3.4.13), we can see that for i = 0, 1, . . ., there exists an error between
V̂i (xk ) and Vi (xk ). As i → ∞, the bound of the approximation errors may increase to
infinity. Thus, we will give the convergence properties of the iterative ADP algorithm
(3.4.2)–(3.4.4) using error bound method. Before presenting the next theorem, the
following lemma is necessary.
γi
where qi is an arbitrary constant such that < qi < 1, then as i → ∞, the
γi + 1
iterative value function V̂i (xk ) of the general value iteration algorithm (3.4.2)–(3.4.4)
converges to a bounded neighborhood of the optimal cost function J ∗ (xk ).
Proof For (3.4.13) in Theorem 3.4.3, if we let
i−1
γi−1 γi−2 . . . γi− j
Δi = ηi−1 ηi−2 . . . ηi− j+1 (ηi− j − 1) ,
j=1
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)
i−1
ηi−1 ηi−2 . . . ηi− j γi−1 γi−2 . . . γi− j
Δi = ,
j=1
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)
i−1
ηi−1 ηi−2 . . . ηi− j+1 γi−1 γi−2 . . . γi− j
− , (3.4.15)
j=1
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)
i−1
As γi /(γi + 1) < q < 1 and γi−1 is finite for i = 1, 2, . . ., letting i → ∞, lim bi j
i→∞ j=1
is finite. On the other hand, for i = 1, 2, . . ., and for j = 1, 2, . . . , i − 1, we have
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 133
γi−1 + 1
0 < ηi ≤ qi−1 (3.4.18)
γi−1
holds, where 0 < qi < 1 is an arbitrary constant, then as i → ∞, the iterative value
function V̂i (xk ) converges to a bounded neighborhood of the optimal cost function
J ∗ (xk ).
In the previous section, we have discussed the convergence property of the general
value iteration algorithm. From (3.4.18), for i = 0, 1, . . ., the approximation error
ηi is a function of γi . Hence, if we can obtain the parameter γi for i = 1, 2, . . ., then
we can design an iterative approximation error ηi+1 to guarantee the iterative value
function V̂i (xk ) to be convergent. Next, the detailed design method of the convergence
criteria will be discussed.
According to (3.4.12), we give the following definition. Define Ωγi as
Ωγi = γi : γi U (xk , u k ) ≥ Vi (F(xk , u k )), ∀xk ∈ Ωx , ∀u k ∈ Ωu .
As the existence of approximation errors, the accurate iterative value function Vi (xk )
cannot be obtained in general. Hence, the parameter γi cannot be obtained directly
134 3 Finite Approximation Error-Based Value Iteration ADP
from (3.4.12). In this section, several methods will be introduced on how to esti-
mate γi .
Lemma 3.4.3 Let νi (xk ) be an arbitrary control law. Define Vi (xk ) as in (3.4.5) and
define Λi (xk ) as
Λi (xk ) = U (xk , νi−1 (xk )) + Λi−1 (xk+1 ). (3.4.19)
where P0 (xk ) = V0 (xk ) = Ψ (xk ). If there exists a constant γ̃i such that
then γ̃i ∈ Ωγ i .
Proof As μ(xk ) is an arbitrary admissible control law, according to Lemma 3.4.3,
we have Pi (xk ) ≥ Vi (xk ). If γ̃i satisfies (3.4.21), then we can get
Lemma 3.4.4 For i = 1, 2, . . ., let Γi (xk ) be expressed as in (3.4.8) and V̂i (xk ) be
expressed as in (3.4.3). If for i = 1, 2, . . .,
then
V̂i (xk ) ≥ Vi (xk ),
So, the conclusion holds for i = 1. Assume that the conclusion holds for i = l − 1,
l = 2, 3, . . .. Then, for i = l, we can obtain
= Vl (xk ).
where V̂0 (xk ) = Γ0 (xk ) = Ψ (xk ). If π0 (xk ) ≥ 0, then we can get V̂1 (xk ) ≥ V1 (xk ).
Using the mathematical induction, we can prove that V̂i (xk ) ≥ Vi (xk ) holds for
i = 1, 2, . . .. According to (3.4.23), we can get γ̂i U (xk , u k ) ≥ Vi (F(xk , u k )), which
proves γ̂i ∈ Ωγi . This completes the proof of the theorem.
From Theorem 3.4.7, we can see that if for i = 0, 1, . . ., we can force the approxi-
mation error πi (xk ) ≥ 0, then the parameter γi can be estimated. An effective method
is to add another approximation error |πi (xk )| to the iterative value function V̂i+1 (xk ).
The new iterative value function can be defined as
136 3 Finite Approximation Error-Based Value Iteration ADP
Vˆi (xk ) = U (xk , v̂i−1 (xk )) + V̂i−1 (xk+1 ) + πi (xk ) + |πi (xk )|
= V̂i (xk ) + |πi (xk )| . (3.4.24)
If V̂i (xk ) ≥ Γi (xk ), then we define ηi = 1. We can derive the following results.
i−1
V̂i (xk ) ≥ ηi 1+ ηi−1 ηi−2 . . . ηi− j+1 (ηi− j − 1)
j=1
γi−1 γi−2 . . . γi− j
× Vi (xk ),
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)
i
where we define (·) = 0, for all j > i, i, j = 0, 1, . . ., and we let
j
Theorem 3.4.8 For i = 1, 2, . . ., let 0 < γi < ∞ be a constant such that (3.4.12).
Define a new iterative value function as
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 137
i−1
V i (xk ) = ηi 1+ ηi−1 ηi−2 . . . ηi− j+1 (ηi− j − 1)
j=1
γi−1 γi−2 . . . γi− j
× Vi (xk ),
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)
l−1
γl−1 γl−2 . . . γl− j
ηl 1 + ηl−1 ηl−2 . . . ηl− j+1 (ηl− j − 1)
j=1
(γl−1 + 1)(γl−2 + 1) . . . (γl− j + 1)
l−2
γl−1
= ηl 1 + ηl−1 1 + ηl−2 ηl−3 . . . ηl− j
γl−1 + 1 j=1
(ηl− j−1 − 1)γl−2 γl−3 . . . γl− j−1
× −1 . (3.4.27)
(γl−2 + 1)(γl−3 + 1) . . . (γl− j−1 + 1)
l−2
0 < ηl−1 1 + ηl−2 ηl−3 . . . ηl− j (ηl− j−1 − 1)
j=1
γl−2 γl−3 . . . γl− j−1
×
(γl−2 + 1)(γl−3 + 1) . . . (γl− j−1 + 1)
≤ 1. (3.4.28)
Substituting (3.4.28) into (3.4.27), we can obtain the conclusion for i = l. The proof
is complete.
From Theorem 3.4.8, we can estimate the lower bound of the iterative value func-
tion Vi (xk ). Hence, we can derive the following theorem.
Theorem 3.4.9 For i = 1, 2, . . ., let V̂i (xk ) be the iterative value function such that
(3.4.22). If for i = 1, 2, . . ., there exists γ̂i such that (3.4.23), then
γ̂i
γi = ∈ Ωγi , (3.4.29)
Δi
138 3 Finite Approximation Error-Based Value Iteration ADP
where
i−1
Δi = ηi 1+ ηi−1 ηi−2 . . . ηi− j+1 (ηi− j − 1)
j=1
γ i−1 γ i−2 . . . γ i− j
× . (3.4.30)
(γ i−1 + 1)(γ i−2 + 1) . . . (γ i− j + 1)
Algorithm 3.4.4 General value iteration algorithm with finite approximation errors
Initialization:
Choose randomly an array of system states xk in Ωx . Choose a semi-positive definite function
Ψ (xk ) ≥ 0. Choose a convergence precision ε. Give a sequence {qi }, i = 0, 1, . . ., where 0 <
qi < 1. Give two constants 0 < ς < 1, 0 < ξ < 1.
Iteration:
Step 1. Let the iteration index i = 0.
Step 2. Let V0 (xk ) = Ψ (xk ) and obtain γ 0 by (3.4.17). Compute v̂0 (xk ) by (3.4.2).
Step 3. Let i = i + 1.
Step 4. Compute V̂i (xk ) by (3.4.3) and obtain v̂i (xk ) by (3.4.4).
Step 5. Obtain ηi by (3.4.9). If ηi (xk ) satisfies (3.4.14), then estimate γi by Algorithm Υ ,
Υ = 3.4.1, 3.4.2 or 3.4.3, and goto next step. Otherwise, decrease ρi−1 (xk ) and πi−1 (xk ), i.e.,
ρi−1 (xk ) = ςρi−1 (xk ) and πi−1 (xk ) = ξ πi−1 (xk ), respectively, e.g., by further NN training.
Goto Step 4.
Step 6. If |V̂i (xk ) − V̂i−1 (xk )| ≤ ε, then the optimal cost function is obtained and goto Step 7; else,
goto Step 3.
Step 7. Return V̂i (xk ) and v̂i (xk ).
1
εi (xk ) ≤ V̂i (xk ). (3.4.31)
γi−1 + 1
From (3.4.31), we can see that for different system state xk , it requires different
approximation error εi to guarantee the convergence of V̂i (xk ). If xk is large, then
the present general value iteration algorithm permits large approximation errors to
converge, and if xk is small, then small approximation errors are required to ensure
the convergence of the general value iteration algorithm. It is an important property
for the neural network implementation of the finite-approximation-error-based gen-
eral value iteration algorithm. For large xk , approximation errors of neural net-
works can be large. As xk → 0, the approximation errors of neural networks are
also required to be zero. As is known, the approximation of neural networks cannot
be globally accurate with no approximation errors, so the implementation of the gen-
eral value iteration algorithm by neural networks may be invalid as xk → 0. On the
other hand, in the real world, neural networks are generally trained under a global
uniform training precision or approximation error. Thus, the present general value
iteration algorithm requires that the training precision is high and the approximation
errors are small which guarantee the iterative value function convergent for most of
the state space.
140 3 Finite Approximation Error-Based Value Iteration ADP
where M = 1/3 kg and l = 2/3 m are the mass and length of the pendulum bar,
respectively. The system states are the current angle θ and the angular velocity ω. Let
J = 4/3Ml 2 and f d = 0.2 be the rotary inertia and frictional factor, respectively. Let
g = 9.8 m/s2 be the gravity. Discretization of the system function and value function
100
120
80 100
60 80
γ
γ
60
40 γ γ
I I
40
γ γ
20 II II
γ 20 γ
III III
0 0
0 20 40 60 0 20 40 60
Iteration steps Iteration steps
(a) (b)
140 160
γ
120 I
140 γII
100
γ
gamma
80 III
120
γ
60 γI
100
40 γ
II
20 γ 80
III
0
0 20 40 60 0 20 40 60
Iteration steps Iteration steps
(c) (d)
Fig. 3.12 The trajectories of γ ’s by Ψ 1 (xk )–Ψ 4 (xk ). a γ I , γ II , and γ III by Ψ 1 (xk ).
b γ I , γ II , and γ III by Ψ 2 (xk ). c γ I , γ II , and γ III by Ψ 3 (xk ). d γ I , γ II , and γ III by Ψ 4 (xk )
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 141
Fig. 3.13 The admissible errors corresponding to Ψ 1 (xk ) and Ψ 2 (xk ). a Admissible errors by γ I
and Ψ 1 (xk ). b Admissible errors by γ II and Ψ 1 (xk ). c Admissible errors by γ III and Ψ 1 (xk ). d
Admissible errors by γ I and Ψ 2 (xk ). e Admissible errors by γ II and Ψ 2 (xk ). f Admissible errors
by γ III and Ψ 2 (xk )
using Euler and trapezoidal methods with the sampling interval Δt = 0.1 sec leads
to
where x1k = θk and x2k = ωk . Let the initial state be x0 = [1, −1]T . We choose
10000 states in Ωx . The critic and the action networks are also chosen as three-layer
backpropagation (BP) neural networks, where the structures are set as 2–12–1 and
2–12–1, respectively.
To illustrate the effectiveness of the present algorithm, we also choose four differ-
ent initial value functions which are expressed by Ψ j (xk ) = xkTP j xk , j = 1, . . . , 4.
Let P1 = 0. Let P2 –P4 be initialized by positive definite matrices given by
142 3 Finite Approximation Error-Based Value Iteration ADP
Fig. 3.14 The plots of the admissible errors corresponding to Ψ 1 (xk ) and Ψ 2 (xk ). a Admissible
errors by γ I and Ψ 3 (xk ). b Admissible errors by γ II and Ψ 3 (xk ). c Admissible errors by γ III and
Ψ 3 (xk ). d Admissible errors by γ I and Ψ 4 (xk ). e Admissible errors by γ II and Ψ 4 (xk ). f Admissible
errors by γ III and Ψ 4 (xk )
100 120
100
80
80
60
60
40
γ 40 γ
V I V I
20 VγII 20 VγII
VγIII VγIII
0 0
0 20 40 60 0 20 40 60
Iteration steps Iteration steps
(a) (b)
140 130
VγI
Iterative value functions
Iterative value functions
120 120 γ
V II
γ
100 110 V III
80 100
60 VγI 90
γ
40 V II
γ 80
V III
20
70
0 20 40 60 0 20 40 60
Iteration steps Iteration steps
(c) (d)
Fig. 3.15 The trajectories of the iterative value functions. a Vγ I , Vγ II and Vγ III by Ψ 1 (xk ). b Vγ I ,
Vγ II and Vγ III by Ψ 2 (xk ). c Vγ I , Vγ II and Vγ III by Ψ 3 (xk ). d Vγ I , Vγ II and Vγ III by Ψ 4 (xk )
and
3 3
2 Lm 2
In
Controls
Controls
1 1
0 0
−1 −1
In Lm
−2 −2
0 50 100 0 50 100
Time steps Time steps
(a) (b)
3 3
In
2
In 2
Controls
Controls
1
1
0
0
−1
Lm Lm
−2 −1
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 3.16 Iterative control signals. a Control signals by Ψ 1 (xk ) and γ I . b Control signals by Ψ 2 (xk )
and γ II . c Control signals by Ψ 3 (xk ) and γ III . d Control signals by Ψ 4 (xk ) and γ III
sible errors corresponding to Ψ 3 (xk ) are shown in Fig. 3.14a, b and c, respectively.
The admissible errors corresponding to Ψ 4 (xk ) are shown in Fig. 3.14d, e and f,
respectively.
Initialized by Ψ 1 (xk )–Ψ 4 (xk ), the trajectories of the iterative value functions for
V γ I , V γ II and V γ III are shown in Fig. 3.15a, b, c and d, respectively.
The iterative control and state trajectories by Ψ 1 (xk ) and γ I are shown in
Figs. 3.16a and 3.17a, respectively. The iterative control and state trajectories by
Ψ 2 (xk ) and γ II are shown in Figs. 3.16b and 3.17b, respectively. The iterative control
and state trajectories by Ψ 3 (xk ) and γ III are shown in Figs. 3.16c and 3.17c, respec-
tively. The iterative control and state trajectories by Ψ 4 (xk ) and γ III are shown in
Figs. 3.16d and 3.17d, respectively.
From Figs. 3.15–3.17, we can see that for different initial value functions under
γ I , γ II and γ III , the iterative value functions can converge to a finite neighborhood of
the optimal cost function. The corresponding iterative controls and iterative states
are also convergent. Therefore, the effectiveness of the present finite-approximation-
error-based general value iteration algorithm has been shown for nonlinear systems.
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 145
10 2
In x In x2
1 In x
Lm x 1 1
System states
System states
5 1
0
0
−1
Lm x
2
−5
Lm x2 −2 Lm x1
In x2
−10 −3
0 50 100 0 50 100
Time steps Time steps
(a) (b)
2 1
In x1
In x In x In x
1 2
1 0.5 2
System states
0 System states 0
−0.5
−1
−1 Lm x
Lm x2 1
−2 Lm x −1.5
1
Lm x2
−3 −2
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 3.17 Iterative state trajectories. a Trajectories by Ψ 1 (xk ) and γ I . b Trajectories by Ψ 2 (xk )
and γ II . c Trajectories by Ψ 3 (xk ) and γ III . d Trajectories by Ψ 4 (xk ) and γ III
On the other hand, if the initial weights of the neural networks, such as weights
of the initial admissible control law, are changed, then the convergence property
will also be different. Now, we also use the action network to obtain the admissible
control law [23] with the expression given by
150 100
40
50
γ by μ
I 20 VγI
γ’I by μ’ V
γ’
I
0 0
0 20 40 60 0 20 40 60
Iteration steps Iteration steps
(a) (b)
1 3
u* by γ
0.5 I
Optimal control
System states
2 *
u by γ’I
0
−0.5 x by γ 1
1 I
x2 by γI
−1
x by γ’ 0
1 I
−1.5
x by γ’
2 I
−2 −1
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 3.18 Comparisons for different μ(xk ). a γI and γI by μ(xk ) and μ (xk ). b Vγ I and Vγ I . c
System states by γI and γI . d Optimal control by γI and γI
and
We choose Ψ 3 (xk ) as the initial value function for facilitating our analysis of
simulation results. The trajectories of γ I by μ(xk ) and γ I by μ (xk ) are shown in
Fig. 3.18a. We can see that if the initial weights of the neural networks for admissible
control law are changed, then the parameter γ I is also changed. For different γ I ,
according to Theorem 3.4.4, the iterative value function can be convergent under
different approximation errors. The trajectories of the iterative value function by
γ I and γ I are shown in Fig. 3.18b, respectively, where we can see that for different
initial weights of the neural networks for the admissible control law, the iterative value
function will converge to different values. Hence, the initial weights of the neural
network will lead to different convergence properties. The optimal trajectories of
system states and control by γ I and γ I are shown in Fig. 3.18c and d, respectively.
3.5 Conclusions 147
3.5 Conclusions
In this chapter, several optimal control problems are solved by using value itera-
tion ADP algorithms for infinite horizon discrete-time nonlinear systems with finite
approximation errors. The iterative control laws which can guarantee the iterative
value functions to reach optimums are obtained by using the iterative ADP algo-
rithms. Then, a novel numerical adaptive learning control scheme based on the iter-
ative ADP algorithms is presented to solve the numerical optimal control problems,
which extends the implementation for both online and offline. Finally, a GVI ADP
algorithm is developed with necessary analysis results, which overcomes the disad-
vantage of the traditional VI algorithm.
References
1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Agricola I, Friendrich T (2008) Elementary geometry. American Mathematical Society, Prov-
idence
3. Almudevar A, Arruda EF (2012) Optimal approximation schedules for a class of iterative algo-
rithms with an application to multigrid value iteration. IEEE Trans Autom Control 57(12):3132–
3146
4. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part B
Cybern 38(4):943–949
5. Arruda EF, Ourique FO, LaCombe J, Almudevar A (2013) Accelerating the convergence of
value iteration by using partial transition functions. Eur J Oper Res 229(1):190–198
6. Beard R (1995) Improving the closed-loop performance of nonlinear systems. Ph.D. thesis,
Rensselaer Polytechnic Institute
7. Bertsekas DP (2007) Dynamic programming and optimal control, 3rd edn. Athena Scientific,
Belmont
8. Cheng T, Lewis FL, Abu-Khalaf M (2007) A neural network solution for fixed-final time
optimal control of nonlinear systems. Automatica 43(3):482–490
9. Dorf RC, Bishop RH (2011) Modern control systems, 12th edn. Prentice-Hall, Upper Saddle
River
10. Feinberg EA, Huang J (2014) The value iteration algorithm is not strongly polynomial for
discounted dynamic programming. Oper Res Lett 42(2):130–131
11. Grune L, Rantzer A (2008) On the infinite horizon performance of receding horizon controllers.
IEEE Trans Autom Control 53(9):2100–2111
12. Hagan MT, Demuth HB, Beale MH (1996) Neural network design. PWS Publishing Company,
Boston
13. Haykin S (2009) Neural networks and learning machines, 3rd edn. Prentice-Hall, Upper Saddle
River
14. Jin N, Liu D, Huang T, Pang Z (2007) Discrete-time adaptive dynamic programming using
wavelet basis function neural networks. In: Proceedings of the IEEE symposium on approximate
dynamic programming and reinforcement learning, pp 135–142
15. Kushner HJ (2010) Numerical algorithms for optimal controls for nonlinear stochastic systems
with delays. IEEE Trans Autom Control 55(9):2170–2176
16. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
148 3 Finite Approximation Error-Based Value Iteration ADP
17. Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control:
using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag
32(6):76–105
18. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
19. Liu D, Zhang Y, Zhang H (2005) A self-learning call admission control scheme for CDMA
cellular networks. IEEE Trans Neural Netw 16(5):1219–1228
20. Liu D, Javaherian H, Kovalenko O, Huang T (2008) Adaptive critic learning techniques for
engine torque and air-fuel ratio control. IEEE Trans Syst Man Cybern Part B Cybern 38(4):988–
993
21. Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class
of unknown discrete-time nonlinear systems using globalized dual heuristic programming.
IEEE Trans Autom Sci Eng 9(3):628–634
22. Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
23. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
24. Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE Trans
Syst Man Cybern Part C Appl Rev 32(2):140–153
25. Navarro-Lopez EM (2007) Local feedback passivation of nonlinear discrete-time systems
through the speed-gradient algorithm. Automatica 43(7):1302–1306
26. Rantzer A (2006) Relaxed dynamic programming in switching systems. IEE Proc Control
Theory Appl 153(5):567–574
27. Seiffertt J, Sanyal S, Wunsch DC (2008) Hamilton-Jacobi-Bellman equations and approximate
dynamic programming on time scales. IEEE Trans Syst Man Cybern Part B Cybern 38(4):918–
923
28. Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
29. Sira-Ramirez H (1991) Non-linear discrete variable structure systems in quasi-sliding mode.
Int J Control 54(5):1171–1187
30. Vamvoudakis KG, Lewis FL (2011) Multi-player non-zero-sum games: online adaptive learn-
ing solution of coupled Hamilton-Jacobi equations. Automatica 47(8):1556–1569
31. Vrabie D, Vamvoudakis KG, Lewis FL (2013) Optimal adaptive control and differential games
by reinforcement learning principles. IET, London
32. Wang FY, Zhang H, Liu D (2009) Adaptive dynamic programming: an introduction. IEEE
Comput Intell Mag 4(2):39–47
33. Wang FY, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon
optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Trans Neural
Netw 22(1):24–36
34. Wei Q, Liu D (2012) An iterative ε-optimal control scheme for a class of discrete-time nonlinear
systems with unfixed initial state. Neural Netw 32:236–244
35. Wei Q, Liu D (2012) A new optimal control method for discrete-time nonlinear systems with
approximation errors. In: Proceedings of the world congress on intelligent control and automa-
tion, pp 185–190
36. Wei Q, Liu D (2012) Adaptive dynamic programming with stable value iteration algorithm for
discrete-time nonlinear systems. In: Proceedings of the IEEE international joint conference on
neural networks, pp 1–6
37. Wei Q, Liu D (2013) Numerical adaptive learning control scheme for discrete-time non-linear
systems. IET Control Theory Appl 7(18):1472–1486
38. Wei Q, Liu D (2014) A novel iterative θ-Adaptive dynamic programming for discrete-time
nonlinear systems. IEEE Trans Autom Sci Eng 11(4):1176–1190
39. Wei Q, Liu D (2014) Stable iterative adaptive dynamic programming algorithm with approxi-
mation errors for discrete-time nonlinear systems. Neural Comput Appl 24(6):1355–1367
References 149
40. Wei Q, Liu D (2015) A novel policy iteration based deterministic Q-learning for discrete-time
nonlinear systems. Sci China Inf Sci 58(12):1–15
41. Wei Q, Liu D (2015) Neural-network-based adaptive optimal tracking control scheme for
discrete-time nonlinear systems with approximation errors. Neurocomputing 149:106–115
42. Wei Q, Liu D, Lewis FL (2015) Optimal distributed synchronization control for continuous-
time heterogeneous multi-agent differential graphical games. Inf Sci 317:96–113
43. Wei Q, Wang FY, Liu D, Yang X (2014) Finite-approximation-error-based discrete-time itera-
tive adaptive dynamic programming. IEEE Trans Cybern 44(12):2820–2833
44. Wei Q, Zhang H, Dai J (2009) Model-free multiobjective approximate dynamic programming
for discrete-time nonlinear systems with general performance index functions. Neurocomputing
72(7–9):1839–1848
45. Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of
intelligence. Gen Syst Yearb 22:25–38
46. Werbos PJ (1991) A menu of designs for reinforcement learning over time. In: Miller WT,
Sutton RS, Werbos PJ (eds) Neural networks for control. MIT Press, Cambridge, pp 67–95
47. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sofge DA (eds) Handbook of intelligent control: Neural, fuzzy, and adaptive
approaches (Chapter 13). Van Nostrand Reinhold, New York
48. Yang Q, Vance JB, Jagannathan S (2008) Control of nonaffine nonlinear discrete-time systems
using reinforcement-learning-based linearly parameterized neural networks. IEEE Trans Syst
Man Cybern Part B Cybern 38(4):994–1001
49. Younkin G, Hesla E (2008) Origin of numerical control. IEEE Ind Appl Mag 14(5):10–12
50. Zhang C, Ordonez R (2007) Numerical optimization-based extremum seeking control with
application to ABS design. IEEE Trans Autom Control 52(3):454–467
51. Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a
class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans
Syst Man Cybern Part B Cybern 38(4):937–942
52. Zhang H, Luo Y, Liu D (2009) The RBF neural network-based near-optimal control for a class
of discrete-time affine nonlinear systems with control constraint. IEEE Trans Neural Netw
20(9):1490–1503
53. Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving
a class of nonlinear zero-sum differential games. Automatica 47(1):207–214
Chapter 4
Policy Iteration for Optimal Control
of Discrete-Time Nonlinear Systems
4.1 Introduction
Value iteration ADP algorithm and policy iteration ADP algorithm are the effective
methods to obtain solutions of the Bellman equation indirectly [6, 8–11, 13, 19–21,
24]. Many discussions on value iteration have been presented in the last two chapters.
On the other hand, policy iteration is the main focus of this chapter.
Policy iteration algorithm for continuous-time systems was proposed in [14]. In
[14], Murray et al. proved that for continuous-time affine nonlinear systems, the iter-
ative value function will converge to the optimum non-increasingly and each of the
iterative controls stabilizes the nonlinear systems using policy iteration algorithms.
This is a merit of policy iteration algorithm and hence achieved many applications
for solving optimal control problems of nonlinear systems. In 2005, Abu-Khalaf
and Lewis [1] proposed a policy iteration algorithm for continuous-time nonlinear
systems with control constraints. Zhang et al. [25] applied policy iteration algorithm
to solve continuous-time nonlinear two-person zero-sum games. Vamvoudakis et al.
[17] proposed a multiagent differential graphical games for continuous-time linear
systems using policy iteration algorithms. Bhasin et al. [3] proposed an online actor–
critic–identifier architecture to obtain optimal control law for uncertain continuous-
time nonlinear systems by policy iteration algorithms. Till now, nearly all the online
iterative ADP algorithms are policy iteration algorithms. However, most of the dis-
cussions on the policy iteration algorithms are based on continuous-time systems
[3, 17, 23, 25]. The discussions on policy iteration algorithms for discrete-time con-
trol systems are scarce. Only in [8, 9, 16], a brief sketch of policy iteration algorithm
for discrete-time systems was mentioned while the stability and convergence prop-
erties were not discussed. To the best of our knowledge, there is not much discussion
focused on policy iteration algorithms for discrete-time systems. Therefore, in this
chapter, policy iteration method for optimal control is developed for discrete-time
nonlinear systems [12].
be an arbitrary sequence of controls from k to ∞. The cost function for state x0 under
the control sequence u0 = (u0 , u1 , . . . ) is defined as
∞
J(x0 , u0 ) = U(xk , uk ), (4.2.2)
k=0
Define the law of optimal control as u∗ (·). Each uk∗ = u∗ (xk ) is obtained by
uk∗ = arg min U(xk , uk ) + J ∗ (F(xk , uk )) .
uk
We can see that if we want to obtain the optimal control sequence uk , we must
obtain the optimal cost function J ∗ (xk ). Generally speaking, J ∗ (xk ) is unknown before
the whole control sequence uk is considered. If we adopt the traditional dynamic
programming method to obtain the optimal cost function at every time step, then
we have to face the “curse of dimensionality”. Furthermore, the optimal control
is discussed in infinite horizon. This means the length of the control sequence is
infinite, which implies that the optimal control sequence is nearly impossible to
obtain analytically by the Bellman equation (4.2.4). To overcome this difficulty,
a novel discrete-time policy iteration ADP algorithm will be developed to obtain
the optimal control sequence for nonlinear systems. The goal of the present policy
iteration ADP algorithm is to construct iterative control laws vi (xk ), which move an
arbitrary initial state x0 to the equilibrium xk = 0, and simultaneously guarantee that
the iterative value functions reach the optimum.
In the present policy iteration algorithm, the value functions and control laws are
updated by iterations, with the iteration index i increasing from 0 to ∞. Starting
from an arbitrary admissible control law v0 (xk ), the PI algorithm can be described
as follows.
For i = 1, 2, . . . , the iterative value function Vi (xk ) is constructed from vi−1 (xk )
by solving the following generalized Bellman equation
In the above PI algorithm, (4.2.5) is called policy evaluation (or value update)
and, (4.2.6) is called policy improvement (or policy update) [16, 18]. The iterative
154 4 Policy Iteration for Optimal …
Table 4.1 The iterative process of the PI algorithm in (4.2.5) and (4.2.6)
v0 → V1 (4.2.5) v1 → V2 (4.2.5) v2 → V3 (4.2.5) ···
evaluation evaluation evaluation
V1 → v1 (4.2.6) V2 → v2 (4.2.6) V3 → v3 (4.2.6) ···
minimization minimization minimization
i=1 i=2 i=3 ···
process of the PI algorithm in (4.2.5) and (4.2.6) is illustrated in Table 4.1, where the
iteration flows as in Fig. 2.1. The algorithm starts with an admissible control v0 (xk ),
and it obtains v1 (xk ) at the end of the first iteration. Comparing between Tables 2.2
and 4.1, we note that “calculation” performed in Table 2.2 using (2.2.11) is a simpler
operation than “evaluation” in Table 4.1 based on (4.2.5). In (4.2.5), one needs to
solve for Vi (xk ) from a nonlinear functional equation, whereas in (2.2.11), it is a
simple calculation.
From the policy iteration algorithm (4.2.5) and (4.2.6), we can see that the iterative
value function Vi (xk ) is used to approximate J ∗ (xk ), and the iterative control law vi (xk )
is used to approximate u∗ (xk ). As (4.2.5) is generally not a Bellman equation, the
iterative value functions the iterative control laws
Theorem 4.2.1 For i = 0, 1, . . ., let Vi (xk ) and vi (xk ) be obtained by (4.2.5) and
(4.2.6). If Assumption 4.2.1 holds, then the iterative value function Vi (xk ), ∀xk ∈ Rn ,
is a monotonically nonincreasing sequence, i.e.,
where vi (xk ) is obtained by (4.2.6). According to (4.2.5), (4.2.6), and (4.2.8), we can
obtain
Then, we derive
We can obtain
Let ε > 0 be an arbitrary positive number. Since Vi (xk ) is nonincreasing, there exists
a positive integer p such that
= V∞ (xk ).
(3) Show that the value function V∞ (xk ) equals to the optimal cost function J ∗ (xk ).
According to the definition of J ∗ (xk ) in (4.2.3), for i = 0, 1, . . ., we have
Vi (xk ) ≥ J ∗ (xk ).
On the other hand, for an arbitrary admissible control law μ(xk ), (4.2.19) holds. Let
μ(xk ) = u∗ (xk ), where u∗ (xk ) is the optimal control law. Thus, when μ(xk ) = u∗ (xk ),
we can get
According to (4.2.20) and (4.2.21), we can obtain (4.2.12). The proof is complete.
Remark 4.2.1 In Theorems 4.2.1 and 4.2.2, we have proven that the iterative value
functions are monotonically non-increasing and convergent to the optimal cost func-
tion as i → ∞. The stability of the nonlinear systems can also be guaranteed under the
iterative control laws. Hence, the convergence and stability properties of the policy
iteration algorithms for continuous-time nonlinear systems are also valid for the pol-
4.2 Policy Iteration Algorithm 159
Remark 4.2.2 In [2], value iteration algorithm of ADP is proposed to obtain the opti-
mal solution of the Bellman equation, for the following discrete-time affine nonlinear
systems
xk+1 = f (xk ) + g(xk )uk ,
where Q and R are defined as the penalizing matrices for system state and control
vectors, respectively. Let Q ∈ Rn×n and R ∈ Rm×m be both positive-definite matrices.
Starting from V0 (xk ) ≡ 0, the value iteration algorithm can be expressed as
⎧
⎨ 1 ∂Vi (xk+1 )
ui (xk ) = − R−1 gT(xk ) ,
2 ∂xk+1 (4.2.22)
⎩
Vi+1 (xk ) = xk Qxk + ui (xk )Rui (xk ) + Vi (xk+1 ).
T T
As the policy iteration algorithm requires to start with an admissible control law, the
method of obtaining the admissible control law is important to the implementation
of the algorithm. Actually, for all the policy iteration algorithms of ADP, including
[1, 3, 14, 17, 23, 25], the initial admissible control law is necessary to implement
their algorithms. Unfortunately, to the best of our knowledge, there is no theory on
how to design the initial admissible control law. In this section, we will give an
effective method to obtain the initial admissible control law by experiments using
neural networks.
First, let Ψ (xk ) ≥ 0 be an arbitrary semipositive definite function. For i = 0, 1, . . .,
we define a new value function as
U(xk+j , μ(xk+j )) → 0 as j → ∞.
Step 5. Use the trained cnet1 to get Φi+1 (xk ) and use cnet2 to get Φi (xk ). If |Φi+1 (xk ) − Φi (xk )| <
ε, then goto Step 7; else, goto next step.
Step 6. If i > imax , then goto Step 1; else, goto Step 3.
Step 7. Let v0 (xk ) = μ(xk ).
Remark 4.2.3 We can see that the above training procedure can easily be imple-
mented by computer program. After creating an action network, the weights of the
critic networks cnet1 and cnet2 can be updated iteratively. If the weights of the critic
network are convergent, then μ(xk ) must be an admissible control law which is gener-
ated by a neural network with random weights. As the weights of action network are
162 4 Policy Iteration for Optimal …
chosen randomly, the admissibility of the control law is unknown before the weights
of the critic networks are convergent. Hence, the iteration process of Algorithm 4.2.1
is implemented off-line.
Step 3. If |Vi (xk ) − Vi+1 (xk )| < ε, goto Step 5; else, goto Step 4.
Step 4. If i < imax , then goto Step 1; else, goto Step 6.
Step 5. The optimal control law is achieved as u∗ (xk ) = vi (xk ). Stop the algorithm.
Step 6. The optimal control law is not achieved within imax iterations.
To evaluate the performance of our policy iteration algorithm, four examples are
chosen: (1) a linear system, (2) a discrete-time nonlinear system, (3) a torsional
pendulum system, and (4) a complex nonlinear system.
Example 4.3.1 In the first example, a linear system is considered. The results will
be compared to the traditional linear quadratic regulation (LQR) method to verify
the effectiveness of the present algorithm. We consider the following linear system
0 0.1 0
A= , B= . (4.3.2)
0.3 −1 0.5
The initial state is x0 = [1, −1]T . Let the cost function be expressed by (4.2.2). The
utility function is the quadratic form that is expressed as U(xk , uk ) = xkTQxk + ukTRuk ,
where Q is the 2 × 2 identity matrix, and R = 0.5.
Neural networks are used to implement the present policy iteration algorithm. The
critic network and the action network are chosen as three-layer BP neural networks
with the structures of 2–8–1 and 2–8–1, respectively. For each iteration step, the critic
network and the action network are trained for 80 steps so that the neural network
training error becomes less than 10−5 . The initial admissible control law is obtained
by Algorithm 4.2.1, where the weights and thresholds of action network are obtained
as ⎡ ⎤
−4.1525 −1.1980
⎢ 0.3693 −0.8828 ⎥
⎢ ⎥
⎢ 1.8071 2.8088 ⎥
⎢ ⎥
⎢ 0.4104 −0.9845 ⎥
⎢
Va,initial = ⎢ ⎥,
⎥
⎢ 0.7319 −1.7384 ⎥
⎢ 1.2885 −2.5911 ⎥
⎢ ⎥
⎣ −0.3403 0.8154 ⎦
−0.5647 1.3694
and
respectively. Implement the policy iteration algorithm for 6 iterations to reach the
computation precision ε = 10−5 , and the convergence trajectory of the iterative value
functions is shown in Fig. 4.1a. During each iteration, the iterative control law is
updated. Applying the iterative control law to the given system (4.3.1) for Tf = 15
time steps, we can obtain the states and the controls, which are displayed in Fig. 4.1b, c
respectively.
164 4 Policy Iteration for Optimal …
System states
Limiting iterative state x
3.4 1
0
3.2
Initial iterative state x
−0.5 1
3
2.8 −1
0 2 4 6 0 5 10 15
Iteration steps Time steps
System states
−0.2 0
Initial iterative control
−0.4 −0.5
−0.6 −1
0 5 10 15 0 5 10 15
Time steps Time steps
Fig. 4.1 Numerical results of Example 4.3.1. a The convergence trajectory of iterative value func-
tion. b The states. c The controls. d The trajectories of x2 under the optimal control of policy
iteration and ARE
We can see that the optimal control law of system (4.3.1) is obtained after 6
iterations. On the other hand, it is known that the solution of the optimal control
problem for linear systems is quadratic in the state given by J ∗ (xk ) = xkT Pxk , where
P is the solution of the algebraic Riccati equation (ARE) [7]. The solution of the ARE
1.091 −0.309
for the given linear system (4.3.1) can be determined as P = .
−0.309 2.055
∗
The optimal control can be obtained as u (xk ) = [−0.304, 1.029 ]xk . Applying this
optimal control law to the linear system (4.3.1), we can obtain the same optimal
control results. The optimal trajectories of system state x2 under the optimal control
laws by the present policy iteration algorithm, and ARE are displayed in Fig. 4.1d,
respectively.
Example 4.3.2 The second example is chosen from the example in [2, 20]. We
consider the following nonlinear system
where xk = [x1k , x2k ]T and uk are the state and control variables, respectively. The sys-
2
0.2x1k exp(x2k ) 0
tem functions are given as f (xk ) = , g(xk ) = . The ini-
3
0.3x2k −0.2
tial state is x0 = [2, −1]T . In this example, two utility functions, which are quadratic
and nonquadratic forms, will be considered, respectively. The first utility function
4.3 Numerical Simulation and Analysis 165
is a quadratic form which is the same as the one in Example 4.3.1, where Q and
R are chosen as identity matrices. The configurations of the critic network and the
action network are chosen the same as the ones in Example 4.3.1. For each iteration
step, the critic network and the action network are trained for 100 steps so that the
neural network training error becomes less than 10−5 . Implement the policy iteration
algorithm for 6 iterations to reach the computation precision of 10−5 . The conver-
gence trajectories of the iterative value functions are shown in Fig. 4.2a. Applying
the optimal control law to the given system (4.3.3) for Tf = 10 time steps, we can
obtain the iterative states trajectories and control, which are displayed in Fig. 4.2b–d,
respectively.
In [2], it is proven that the optimal control law can be obtained by the value
iteration algorithm described in (4.2.22). The convergence trajectory of the iterative
value function is shown in Fig. 4.3a. The optimal trajectories of system states and
control are displayed in Fig. 4.3b–d, respectively.
Next, we change the utility function into a nonquadratic form as in [21] with
modifications, where the utility function is expressed as
(a) 40
(b) 2
Iterative value function
30 1.5
System state x
20 1
0.5
10
0
0
0 2 4 6 0 5 10
Iteration steps Time steps
(c) (d)
0 0
2
Optimal control
System state x
−0.5 −0.05
−1 −0.1
0 5 10 0 5 10
Time steps Time steps
Fig. 4.2 Numerical results using policy iteration algorithm. a The convergence trajectory of iterative
value function. b The optimal trajectory of state x1 . c The optimal trajectory of state x2 . d The optimal
control
166 4 Policy Iteration for Optimal …
(a) 8 (b) 2
1
6 1.5
System state x
4 1
0.5
2
0
0
0 2 4 6 0 5 10
Iteration steps Time steps
Optimal control
0
−0.05
−0.5
−0.1
−1 −0.15
0 5 10 0 5 10
Time steps Time steps
Fig. 4.3 Numerical results using value iteration algorithm. a The convergence trajectory of iterative
value function. b The optimal trajectory of state x1 . c The optimal trajectory of state x2 . d The optimal
control
Let the other parameters keep unchanged. Using the present policy iteration algo-
rithm, we can also obtain the optimal control law for the system. The value function
is shown in Fig. 4.4a. The optimal trajectories of iterative states and controls are
shown in Fig. 4.4b–d, respectively.
Remark 4.3.1 From the numerical results, we can see that for quadratic and non-
quadratic utility functions, the optimal control law of the nonlinear system can both be
effectively obtained. On the other hand, we have shown that using the value iteration
algorithm in [2], we can also obtain the optimal control law of the system. But
we should point out that the convergence properties of the iterative value functions
by the policy and value iteration algorithms are obviously different. Thus, stability
behaviors of control systems under the two iterative control algorithms will be quite
different. In the next example, detailed comparisons will be displayed.
Example 4.3.3 We now examine the performance of the present algorithm in a tor-
sional pendulum system [15]. The dynamics of the pendulum is as follows
⎧
⎪
⎨
dθ
= ω,
dt
⎪
⎩J
dω dθ
= u − Mgl sin θ − fd ,
dt dt
4.3 Numerical Simulation and Analysis 167
(a) 20 (b)
Initial iterative state x
1
2
Limiting iterative state x1
System stats
1
10
0
5 Limiting iterative state x
2
−1
Iniitial iterative state x2
0
0 2 4 6 8 10 0 2 4 6 8 10
Iteration steps Time steps
1 u
−0.5
Limiting iterative control 0.5
−1
0
Initial iterative control
−1.5 −0.5
−2 −1
0 2 4 6 8 10 0 2 4 6 8 10
Time steps Time steps
Fig. 4.4 Numerical results for nonquadratic utility function using policy iteration algorithm. a The
convergence trajectory of iterative value function. b The states. c The controls. d The optimal states
and control
where M = 1/3 kg and l = 2/3 m are the mass and length of the pendulum bar,
respectively. The system states are the current angle θ and the angular velocity ω. Let
J = 4/3 M l2 and fd = 0.2 be the rotary inertia and frictional factor, respectively. Let
g = 9.8 m/s2 be the gravity. Discretization of the system function and cost function
using Euler and trapezoidal methods [4] with the sampling interval Δt = 0.1s leads
to
where x1k = θk and x2k = ωk . Let the initial state be x0 = [1, −1]T . Let the utility
function be quadratic form and the structures of the critic and action networks are
2–12–1 and 2–12–1. The initial admissible control law is obtained by Algo-
rithm 4.2.1, where the weight matrices are obtained as
168 4 Policy Iteration for Optimal …
⎡ ⎤
−0.6574 1.1803
⎢ 1.5421 −2.9447 ⎥
⎢ ⎥
⎢ −4.3289 −3.7448 ⎥
⎢ ⎥
⎢ 5.7354 2.8933 ⎥
⎢ ⎥
⎢ 0.4354 −0.8078 ⎥
⎢ ⎥
⎢ −1.9680 3.6870 ⎥
Va,initial =⎢
⎢ 1.9285
⎥,
⎢ 1.4044 ⎥
⎥
⎢ −4.9011 −4.3527 ⎥
⎢ ⎥
⎢ 1.0914 −0.0344 ⎥
⎢ ⎥
⎢ −1.5746 2.8033 ⎥
⎢ ⎥
⎣ 1.4897 −0.0315 ⎦
0.2992 −0.0784
and
For each iteration step, the critic network and the action network are trained for 400
steps so that the neural network training error becomes less than 10−5 . Implement the
policy iteration algorithm for 16 iterations to reach the computation precision 10−5 ,
and the convergence trajectory of the iterative value function is shown in Fig. 4.5a.
Apply the iterative control laws to the given system for Tf = 100 time steps and the
trajectories of the iterative control and states are displayed in Fig. 4.5b, c respectively.
The optimal trajectories of system states and control are displayed in Fig. 4.5d.
From the numerical results, we can see that using the present policy iteration
algorithm, any of the iterative control law can stabilize the system. But for the value
iteration algorithm, the situation is quite different. For the value iteration algorithm,
we choose the initial value function V0 (xk ) ≡ 0 as in [2]. Run the value iteration
algorithm (4.2.22) for 30 iterations to reach the computation precision 10−5 and
trajectory of the value function is expressed in Fig. 4.6a. Applying the iterative control
law to the given system (4.3.4) for Tf = 100 time steps, we can obtain the iterative
states and iterative controls, which are displayed in Fig. 4.6b, c respectively. The
optimal trajectories of control and system states are displayed in Fig. 4.6d.
For unstable control systems, although the optimal control law can be obtained by
value and policy iteration algorithms, for value iteration algorithm, we can see that
not all the controls can stabilize the system. Moreover, the properties of the iterative
4.3 Numerical Simulation and Analysis 169
(a) 90 (b) 6
5
Iterative value function 80
4
Iterative controls
Initial iterative control
70 3
2
60 Limiting iterative control
1
50 0
−1
40
0 5 10 15 0 20 40 60 80 100
Iteration steps Time steps
0.5
1
0
0
−0.5
Limiting iterative state x2 −1
−1
Initial iterative state x2 −2
−1.5
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
Fig. 4.5 Numerical results of Example 4.3.2 using policy iteration algorithm. a The convergence
trajectory of iterative value function. b The controls. c The states. d The optimal states and control
controls obtained by the value iteration algorithm cannot be analyzed, and thus, the
value iteration algorithm can only be implemented off-line. For the present policy
iteration algorithm, the stability property can be guaranteed. Hence, we can declare
the effectiveness of the policy iteration algorithm in this chapter.
Example 4.3.4 As a real world application of the present method, the problem of
nonlinear satellite attitude control has been selected [5, 22]. Satellite dynamics is
represented by
dω
= Υ −1 (Nnet − ω × Υ ω) ,
dt
where Υ , ω, and Nnet are inertia tensor, angular velocity vector of the body frame with
respect to inertial frame, and the vector of the total torque applied on the satellite,
respectively. The selected satellite is an inertial pointing satellite. Hence, we are
interested in its attitude with respect to the inertial frame. All vectors are represented
in the body frame and the sign × denotes the cross product of two vectors. Let
170 4 Policy Iteration for Optimal …
(a) 50 (b) 3
Iterative controls
1
30
0
20
−1
10
Limiting iterative control
−2
0
0 10 20 30 0 20 40 60 80 100
Iteration steps Time steps
(c) 5
Initial iterative state x2
(d)
Limiting iterative state x2 3 x1
1
0
0
−1
Limiting iterative state x1
−2
Initial iterative state x1
−5
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
Fig. 4.6 Numerical results of Example 4.3.2 using value iteration algorithm. a The convergence
trajectory of iterative value function. b The controls. c The states. d The optimal states and control
Nnet = Nctrl + Ndis , where Nctrl is the control, and Ndis is disturbance. Following [22]
and its order of transformation, the kinematic equation of the satellite becomes
⎡
⎤ ⎡ ⎤⎡ ⎤
φ 1 sin φ tan φ cos φ tan θ ωx
d ⎣ ⎦ ⎣ ⎦ ⎣ ωy ⎦ ,
θ = 0 cos
φ − sin
φ (4.3.5)
dt Ψ 0 sin φ cos θ cos φ cos θ ωz
where φ, θ , and Ψ are the three Euler angles describing the attitude of the satellite with
respect to x, y, and z axes of the inertial coordinate system, respectively. Subscripts
x, y, and z are the corresponding elements of the angular velocity vector ω. The three
Euler angles and the three elements of the angular velocity form the elements of the
state space for the satellite attitude control problem. The state equation is given as
follows,
ẋ = f (x) + g(x)u,
4.3 Numerical Simulation and Analysis 171
2
wc1
wc2
wc3
1.5 wc4
wc5
wc6
wc7
Critic network weights
1
wc8
wc9
wc10
0.5 wc11
wc12
wc13
0 wc14
wc15
−0.5
−1
0 50 100 150
Iteration steps
where
x = [ φ, θ, Ψ, ωx , ωy , ωz ]T ,
u = [ ux , uy , uz ]T ,
M3×1
f (x) = ,
Υ −1 (Ndis − ω × Υ ω)
0
g(x) = 3×3 ,
Υ −1
and M3×1 denotes the right hand side of (4.3.5). The three-by-three null matrix is
denoted by 03×3 . The moment of inertia matrix of the satellite is chosen as
⎡ ⎤
100 2 0.5
Υ = ⎣ 2 100 1 ⎦ kg · m2 .
0.5 1 110
The initial states are 60◦ , 20◦ , and 70◦ for the Euler angles of φ, θ , and Ψ ,
respectively, and −0.001, 0.001, and 0.001 rad/s for the angular rates around x, y,
and z axes, respectively. For convenience of analysis, we assume that there is no
disturbance in the system. Discretization of the system function and value function
using Euler and trapezoidal methods [4] with the sampling interval Δt = 0.25 s leads
to
xk+1 = (Δtf (xk ) + xk ) + Δt g(xk )uk .
172 4 Policy Iteration for Optimal …
0.25
wa1
wa2
wa3
0.2
wa4
wa5
wa6
0.15 wa7
Action network weights
wa8
wa9
0.1 wa10
wa11
wa12
0.05 wa13
wa14
wa15
0
−0.05
−0.1
0 50 100 150
Iteration steps
Fig. 4.8 The convergence trajectories of the first column of weights of the action network
Let the utility function be quadratic form where the state and control weight matrices
are selected as
Q = diag{0.25, 0.25, 0.25, 25, 25, 25}
and
R = diag{25, 25, 25},
respectively.
Neural networks are used to implement the present policy iteration algorithm.
The critic network and the action network are chosen as three-layer BP neural net-
works with the structures of 6–15–1 and 6–15–3, respectively. For each iteration
step, the critic network and the action network are trained for 800 steps so that the
neural network training error becomes less than 10−5 . Implement the policy iteration
algorithm for 150 iterations. We have proven that the weights of critic and action net-
works are convergent in each iteration and thus convergent to the optimal ones. The
convergence trajectories of critic network weights are shown Fig. 4.7. The weight
convergence trajectories of the first column of action network are shown in Fig. 4.8.
The iterative value functions are shown in Fig. 4.9a. After the weights of the critic
and action networks are convergent, we apply the neuro-optimal controller to the
given system for Tf = 1500 time steps. The optimal state trajectories of φ, θ and Ψ
are shown in Fig. 4.9b. The trajectories of angular velocities ωx , ωy , and ωz are shown
4.3 Numerical Simulation and Analysis 173
(a) 30 (b) 60
1
1
Control (N.m)
0
0
ωx −1 u
−1 x
ωy u
−2 y
−2
ωz uz
−3 −3
0 500 1000 1500 0 500 1000 1500
Time steps Time steps
Fig. 4.9 Numerical results of Example 4.3.4 using policy iteration algorithm. a The convergence
trajectory of iterative value function. b The trajectories of angular velocities φ, θ, and Ψ . c The
trajectories of angular velocities ωx , ωy , and ωz . d The optimal control trajectories
in Fig. 4.9c, and the optimal control signals are shown in Fig. 4.9d. The numerical
results illustrate the effectiveness of the present policy iteration algorithm.
4.4 Conclusions
References
1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern-Part B:
Cybern 38(4):943–949
3. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actorcriticidentifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):82–92
4. Gupta SK (1995) Numerical methods for engineers. New Age International, India
5. Heydari A, Balakrishnan SN (2013) Finite-horizon control-constrained nonlinear optimal con-
trol using single network adaptive critics. IEEE Trans Neural Netw Learn Syst 24(1):145–157
6. Huang T, Liu D (2013) A self-learning scheme for residential energy system control and
management. Neural Comput Appl 22(2):259–269
7. Lewis FL, Syrmos VL (1995) Optimal control. Wiley, New York
8. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
9. Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control:
using natural decision methods to design optimal adaptive controllers. IEEE Control Syst
32(6):76–105
10. Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general
value iteration. IET Control Theory Appl 6(18):2725–2736
11. Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
12. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
13. Modares H, Lewis FL, Naghibi-Sistani MB (2013) Adaptive optimal control of unknown
constrained-input systems using policy iteration and neural networks. IEEE Trans Neural Netw
Learn Syst 24(10):1513–1525
14. Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE Trans
Syst Man Cybern-Part C: Appl Rev 32(2):140–153
15. Si J, Wang YT (2001) Online learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
16. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
17. Vamvoudakis KG, Lewis FL, Hudas GR (2012) Multi-agent differential graphical games: online
adaptive learning solution for synchronization with optimality. Automatica 48(8):1598–1611
18. Vrabie D, Vamvoudakis KG, Lewis FL (2013) Optimal adaptive control and differential games
by reinforcement learning principles. IET, London
19. Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of
discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocom-
puting 78(1):14–22
20. Wang FY, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon
optimal control of discrete-time nonlinear systems with -error bound. IEEE Trans Neural
Netw 22(1):24–36
21. Wei Q, Zhang H, Dai J (2009) Model-free multiobjective approximate dynamic programming
for discrete-time nonlinear systems with general performance index functions. Neurocomputing
72(7):1839–1848
22. Wertz JR (1978) Spacecraft attitude determination and control. Kluwer, Netherlands
References 175
23. Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming method.
IEEE Trans Neural Netw 22(12):2226–2236
24. Zhang H, Song R, Wei Q, Zhang T (2011) Optimal tracking control for a class of nonlinear
discrete-time systems with time delays based on heuristic dynamic programming. IEEE Trans
Neural Netw 22(12):1851–1862
25. Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving
a class of nonlinear zero-sum differential games. Automatica 47(1):207–214
Chapter 5
Generalized Policy Iteration ADP
for Discrete-Time Nonlinear Systems
5.1 Introduction
According to [14, 15], most of the discrete-time reinforcement learning methods can
be described as generalized policy iteration (GPI) algorithms. There are two revolving
iteration procedures for GPI algorithms, which are policy evaluation, making the
value function consistent with the current policy, and policy improvement, making
a new policy that improves on the previous policy [14]. GPI allows one of these two
iterations to be performed without completing the other step a priori. Investigations
of GPI algorithms are important for the development of ADP. There exist inherent
differences between GPI algorithms and the value and policy iteration algorithms.
Analysis methods for traditional value and policy iteration algorithms are not valid
for GPI algorithms anymore. For a long time, discussions on the properties of GPI
algorithms for discrete-time control systems are scarce. To the best of our knowledge,
only in [18], the properties of GPI algorithms were analyzed, while the stability
property of the system under the iterative control law in [18] cannot be guaranteed.
Therefore, together with the GPI algorithms to be developed in this chapter, we will
establish convergence, admissibility, and optimality property analysis results as well
[11, 18].
where xk ∈ Rn is the state vector and u k ∈ Rm is the control vector. Let x0 be the
initial state. Let F(xk , u k ) denote the system function, which is known. For any
k = 0, 1, . . ., let u k = (u k , u k+1 , . . . ) be an arbitrary sequence of controls
from k to
∞. The cost function for state x0 under the control sequence u 0 = u 0 , u 1 , . . . is
defined as
∞
J (x0 , u 0 ) = U (xk , u k ), (5.2.2)
k=0
where
u 0 = u 0 , u 1 , . . . = μ(x0 ), μ(x1 ), . . . .
We have studied value iteration (VI) algorithms as well as policy iteration (PI)
algorithms in previous chapters. Both VI and PI algorithms have their own advan-
tages. It is desirable to develop a scheme that combines the two so that we can enjoy
the benefits of both algorithms. This motivates the work in this chapter, and we call
the new scheme as generalized policy iteration (GPI) [11].
For i = 1, 2, . . . and for all xk ∈ Ω, the generalized policy iteration algorithm can
be expressed by the following two iterative procedures.
j-iteration:
For ji = 1, 2, . . . , Ni , for all xk ∈ Ω, we update the iterative value function
by
where
Vi,0 (xk+1 ) = Vi−1 (xk+1 )
and V0 (·) is obtained in (5.2.5). For all xk ∈ Ω, define the iterative value
function at the end of the j-iteration as
i-iteration:
For all xk ∈ Ω, the iterative control law is improved by
From the above generalized policy iteration algorithm, we can see that for each
j-iteration, it computes the iterative value function for the control law vi−1 (xk ), which
tries to solve the following generalized Bellman equation
Vi, ji (xk ) = U (x, vi−1 (xk )) + Vi, ji (F(xk , vi−1 (xk ))). (5.2.10)
This iterative procedure is called “policy evaluation” [8, 14]. In this procedure, the
iterative value function Vi, ji (xk ) is updated, while the control law is kept unchanged.
For each i-iteration, based on the iterative value function Vi (xk ) for some control law,
we use it to find another policy that is better, or at least not worse. This procedure is
known as “policy improvement” [8, 14]. In this iterative procedure, the control law
is updated.
B. Properties of the GPI Algorithm
In this section, under the assumption that perfect function approximation is available,
we will prove that for any Ni > 0 and for all xk ∈ Ω, the iterative value function
Vi, ji (xk ) will converge to J ∗ (xk ) as i → ∞. The admissibility property of the iterative
control law vi (xk ) will also be presented.
Theorem 5.2.1 Let v0 (xk ) ∈ A (Ω) be the admissible control law obtained by
(5.2.6). For i = 1, 2, . . . and for all xk ∈ Ω, let the iterative value function Vi, ji (xk )
and the iterative control law vi (xk ) be obtained by (5.2.7)–(5.2.9). Let {N1 , N2 , . . .}
be a sequence of nonnegative integers. Then, we have the following properties.
(i) For i = 1, 2, . . . and ji = 1, 2, . . . , Ni + 1, we have
(ii) For i = 1, 2, . . ., let ji and j(i+1) be arbitrary constant integers which satisfy
0 ≤ ji ≤ Ni and 0 ≤ j(i+1) ≤ Ni+1 , respectively. Then, we have
Proof The theorem can be proven in two steps. We first prove (5.2.11) by mathe-
matical induction. Let i = 1. From V1,0 (·) = V0 (·) and (5.2.7), we have for j1 = 1,
Assume that
V1, j1 (xk ) ≤ V1, j1 −1 (xk )
Hence, (5.2.11) holds for i = 1. According to (5.2.9), the iterative control law v1 (xk )
is expressed by
where
V1 (xk+1 ) = V1,N1 (xk+1 ).
Next, let i = 2. From V2,0 (·) = V1 (·), V1 (·) = V1,N1 (·), and (5.2.7), we can get for
j2 = 1 and for all xk ∈ Ω,
So, (5.2.11) holds for i = 2 and j2 = 1. Assume that (5.2.11) holds for j2 = l2 ,
l2 = 1, 2, . . . , N2 . Then, for j2 = l2 + 1, we obtain
where Vr (xk+1 ) = Vr,Nr (xk+1 ). Then, for i = r + 1, we have Vr +1,0 (xk ) = Vr (xk ).
According to (5.2.7), for jr +1 = 1, we have
So, (5.2.11) holds for i = r + 1 and jr +1 = 1. Assume that (5.2.11) holds for jr +1 =
lr +1 , lr +1 = 1, 2, . . . , Nr +1 . Then, for jr +1 = lr +1 + 1, we have
which shows that (5.2.12) holds for i = 1. Using mathematical induction, it is easy
to prove that (5.2.12) holds for i = 1, 2, . . .. This completes the proof of the theorem.
Lemma 5.2.1 Suppose that Assumptions 5.2.1–5.2.4 hold. For i = 1, 2, . . . , and for
ji = 0, 1, . . . , Ni , the iterative value function Vi, ji (xk ) is a positive-definite function
of xk .
184 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
Proof Let vk(−1) = v(−1) (xk ), v(−1) (xk+1 ), . . . . As v(−1) (xk ) ∈ A (Ω) is admissi-
ble, according to (5.2.5), for all xk ∈ Ω, the iterative value function
∞
V0 (xk ) = U (xk+l , v(−1) (xk+l )) (5.2.19)
l=0
Vi,∞ (xk ) = U (x, vi−1 (xk )) + Vi,∞ (F(xk , vi−1 (xk ))), (5.2.20)
where
Proof According to (5.2.11), for i = 1, 2, . . . and for all xk ∈ Ω, the iterative value
function Vi, ji (xk ) is monotonically nonincreasing as ji increases from 0 to Ni . On
the other hand, according to Lemma 5.2.1, Vi, ji (xk ) is a positive-definite function for
i = 1, 2, . . . and ji = 0, 1, . . . , Ni , i.e., Vi, ji (xk ) > 0, ∀xk = 0. This means that the
iterative value function Vi, ji (xk ) is monotonically nonincreasing and lower bounded.
Hence, for all xk ∈ Ω, the limit of Vi, ji (xk ) exists when ji → ∞. Then, we can obtain
(5.2.20) directly. This completes the proof of the theorem.
Vi, j̄i (xk ) = U (xk , vi−1 (xk )) + Vi, j̄i −1 (F(xk , vi−1 (xk ))), (5.2.21)
j̄i −Ni −1
Vi, j̄i (xk ) = U (xk+l , vi−1 (xk+l )) + Vi,Ni (xk+ j̄i −Ni ).
l=0
According to Theorem 5.2.2, for all xk ∈ Ω, the iterative value function Vi,∞ (xk ),
which is expressed by
j̄i −Ni −1
Vi,∞ (xk ) = lim U (xk+l , vi−1 (xk+l )) + lim Vi,Ni (xk+ j̄i −Ni ),
j̄i →∞ j̄i →∞
l=0
is finite. According to Assumption 5.2.4, the utility function U (xk , vi−1 (xk )) > 0,
∀xk = 0. Then,
lim U (xk , vi−1 (xk )) = 0,
k→∞
As
Ni
U (xk+l , vi−1 (xk+l ))
l=0
is finite, we have
∞
Ni
U (xk+l , vi−1 (xk+l )) = U (xk+l , vi−1 (xk+l )) + Vi,∞ (xk+Ni +1 )
l=0 l=0
is also finite. The above shows that vi−1 (xk ) is admissible. It is easy to conclude that
vi (xk ) is also admissible. The proof is complete.
Next, the convergence property of the generalized policy iteration algorithm will
be established. As the iteration index i increases to ∞, we will show that the optimal
cost function and optimal control law can be achieved using the present generalized
policy iteration algorithm. Before the main theorem, some lemmas are necessary.
Lemma 5.2.2 (cf. [3]) If a monotonically nonincreasing sequence an , n =
0, 1, . . ., contains an arbitrary convergent subsequence, then sequence an is con-
vergent.
Lemma 5.2.3 For i = 1, 2, . . ., let the iterative value function Vi (xk ) be defined
as in (5.2.8). Then, the iterative value function sequence {Vi (xk )} is monotonically
nonincreasing and convergent.
186 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
Theorem 5.2.3 For i = 0, 1, . . . and for all xk ∈ Ω, let Vi, ji (xk ) and vi (xk ) be
obtained by (5.2.5)–(5.2.9). If Assumptions 5.2.1–5.2.4 hold, then for any Ni > 0,
the iterative value function Vi, ji (xk ) converges to the optimal cost function J ∗ (xk ),
as i → ∞, i.e.,
there exists a positive integer p such that V p (xk ) − ε ≤ V∞ (xk ) ≤ V p (xk ). Hence,
we can get
Next, for all xk ∈ Ω, let μ(xk ) be an arbitrary admissible control law, and define
a new value function P(xk ) as
According to the definition of the optimal cost function J ∗ (xk ) given in (5.2.3),
for i = 0, 1, . . . and for all xk ∈ Ω, we have Vi (xk ) ≥ J ∗ (xk ). Then, let i → ∞. We
can obtain V∞ (xk ) ≥ J ∗ (xk ).
On the other hand, for an arbitrary admissible control law μ(xk ), (5.2.28) holds.
For all xk ∈ Ω, let μ(xk ) = u ∗ (xk ), where u ∗ (xk ) is the optimal control law. Then,
we can get V∞ (xk ) ≤ J ∗ (xk ). Hence, (5.2.22) holds. The proof is complete.
Corollary 5.2.2 For i = 0, 1, . . ., let Vi, ji (xk ) and vi (xk ) be obtained by (5.2.5)–
(5.2.9). If for i = 1, 2, . . ., let ji ≡ 1, then the iterative value function Vi, ji (xk ) con-
verges to the optimal cost function J ∗ (xk ), as i → ∞.
Algorithm 5.2.1 Finite-step policy evaluation algorithm for initial value function
Initialization: (1) (2) ( p)
Choose randomly an array of system states xk in Ω, i.e., X k = xk , xk , . . . , xk , where p
is a large positive integer.
Choose an arbitrary positive semidefinite function Ψ (xk ) ≥ 0.
Determine an initial admissible control law v(−1) (xk ).
Iteration:
Step 1. Let the iteration index j0 = 0 and let V0,0 (xk ) = Ψ (xk ).
j
Step 2. For all xk ∈ X k , update the control law v00 (xk ) by
j
v00 (xk ) = arg min U (xk , u k ) + V0, j0 (F(xk , u k )) , (5.2.29)
uk
V0, j0 (xk ) = U (xk , v(−1) (xk )) + V0, j0 −1 (F(xk , v(−1) (xk ))). (5.2.31)
Goto Step 2.
j0 j
Step 5. Let V0 (xk ) = V1,0 (xk ) and v0 (xk ) = v00 (xk ).
j0
V0, j0 +1 (xk ) = U (xk+l , v(−1) (xk+l )) + Ψ (xk+ j0 +1 ).
l=0
∞
U (xk+l , v(−1) (xk+l ))
l=0
is finite. Hence, we conclude that lim V0, j0 (xk ) is finite, which means V0, j0 (xk ) is
j0 →∞
convergent as j0 → ∞. This completes the proof of the lemma.
Using the admissible control law v(−1) (xk ), from Lemma 5.2.4, V0, j0 +1 (xk ) =
V0, j0 (xk ) holds as j0 → ∞. It means that there must exist N0 > 0 such that
N0
V1,0 (xk ) ≤ V0,N0 (xk ). Hence, if we have obtained an admissible control law, then
we can construct the initial value function by finite-step policy evaluation to replace
the value function V0 (xk ) in (5.2.5). On the other hand, Algorithm 5.2.1 requires an
admissible control law v(−1) (xk ) to implement. Usually, the admissible control law
of the nonlinear system is difficult to obtain. To overcome this difficulty, a policy
improvement algorithm can be implemented by experiment. The details are given in
Algorithm 5.2.2.
ς
Theorem 5.2.4 For all xk ∈ Ω, let the iterative control law v10 (xk ) be expressed as
ς
in (5.2.33) and let the iterative value function V1,00 (xk ) be expressed as in (5.2.34). If
the iterative value functions satisfy (5.2.35), then the convergence properties (5.2.11)
and (5.2.12) of Theorem 5.2.1 hold for i = 1, 2, . . . and ji = 0, 1, . . . , Ni .
ς ς
Proof Let i = 1 and j1 = 0. As V0 (xk ) = V1,00 (xk ) and v0 (xk ) = v00 (xk ), according
to (5.2.7) and (5.2.35), we have
5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 191
Following similar steps as in the proof of Theorem 5.2.1, the convergence proper-
ties (5.2.11) and (5.2.12) can be shown for i = 1, 2, . . . and ji = 0, 1, . . . , Ni . This
completes the proof of the theorem.
Remark 5.2.1 From Algorithm 5.2.2, we can see that the admissible control law
v(−1) (xk ) in Algorithm 5.2.1 is avoided. This is a merit of Algorithm 5.2.2. However,
in Algorithm 5.2.2, we should find a positive-definite function Ψ̄ ς0 (xk ) that satisfies
(5.2.35). As Ψ̄ ς0 (xk ) is randomly chosen, it may take some iterations to determine
Ψ̄ ς0 (xk ). This is a disadvantage of the algorithm.
Vi, ji (xk ) = U (xk , vi−1 (xk )) + Vi, ji −1 (F(xk , vi−1 (xk ))),
Step 6. For all xk ∈ X k , if |Vi−1 (xk ) − Vi (xk )| < ε, then the optimal cost function and optimal
control law are obtained, and goto Step 7; else, let i = i + 1 and goto Step 3.
Step 7. Return Vi (xk ) and vi (xk ) as the optimal cost function J ∗ (xk ) and the optimal control law
u ∗ (xk ).
192 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
To evaluate the performance of our generalized policy iteration algorithm, two exam-
ples are given for numerical experiments to solve the approximate optimal control
problems.
Example 5.2.1 First, let us consider the following spring–mass–damper system [7]
d2 y dy
M +b + κ y = u,
d t2 dt
where y is the position and u is the control input. Let M = 0.1 kg denote the mass
of object. Let κ = 2 kgf/m be the stiffness coefficient of spring and let b = 0.1 be
the wall friction. Let x1 = y, x2 = dy/d t. Discretizing the system function with the
sampling interval Δt = 0.1s leads to
1 Δt
0
x1(k+1) x1k
= κ b + Δt u k . (5.2.36)
x2(k+1) − Δt 1 − Δt x2k
M M M
Let the initial state be x0 = [1, −1]T . Let the cost function be expressed by (5.2.2).
The utility function is expressed as U (xk , u k ) = xkTxk + u 2k .
Let the state space be Ω = {xk : − 1 ≤ x1k ≤ 1, −1 ≤ x2k ≤ 1}. We randomly
choose an array of p = 5000 states in Ω to implement the generalized policy iteration
algorithm to obtain the optimal control law. Neural networks are used to implement
the present generalized policy iteration algorithm. The critic network and the action
network are chosen as three-layer backpropagation (BP) neural networks with the
structures of 2–8–1 and 2–8–1, respectively. Define the two neural networks as group
“NN1”. For system (5.2.36), we can obtain an admissible control law u(xk ) = K xk ,
where K = [0.13, −0.17]. Let Ψ (xk ) = xkTP0 xk , where
80 1
P0 = .
1 2
As the initial admissible control law u(xk ) = K xk is known, the finite-step pol-
icy evaluation in Algorithm 5.2.1 is implemented. We can see that it takes 3
iterations to obtain V0 (xk ) and v0 (xk ) and the results for the initial iteration are
displayed in Fig. 5.1. (See the trajectories of the iterative value functions for
i = 0). Let iteration index i max = 10. To illustrate the effectiveness of the algo-
γ
rithm, we choose four different iteration sequences {Ni }, γ = 1, 2, 3, 4. For γ = 1
and i = 1, 2, . . . , 10, we let Ni = 1. For γ = 2, iteration sequence is chosen
1
80 80
60 60
40 40
20 20
0 2 0 4
5 1 5 2
i 10 0 ji i 10 0 ji
(a) (b)
80 80
60 60
40 40
20 20
0 10 0 20
5 5 5 10
i 10 0 ji i 10 0 ji
(c) (d)
Fig. 5.1 Iterative value functions Vi, ji (xk ) for i = 1, 2, . . . , 10 and xk = x0 . a Vi, ji (xk ) for {Ni1 }.
b Vi, ji (xk ) for {Ni2 }. c Vi, ji (xk ) for {Ni3 }. d Vi, ji (xk ) for {Ni4 }
shown in Fig. 5.1. The curves of the iterative value functions Vi (xk ) are shown in
Fig. 5.2, where we use “In” to denote initial iteration and use “Lm” to denote limiting
iteration.
For Ni1 = 1, the generalized policy iteration algorithm is reduced to value itera-
tion algorithm [5, 6, 16, 17]. From Figs. 5.1a and 5.2a, we can see that the iterative
value function converges to the approximate optimum which verifies the effective-
ness of the present algorithm. For Ni4 = 20, we can see that for each i = 1, . . . , 10,
the iterative value function Vi, ji (xk ) is convergent for ji . In this case, the generalized
policy iteration algorithm can be considered as a policy iteration algorithm [5, 8,
10] for each i, where the convergence can be verified. For arbitrary sequence {Ni },
such as {Ni2 } and {Ni3 }, From Figs. 5.1b, c and 5.2b, c, the iterative value function
can also approximate the optimum. Hence, value and policy iteration algorithms are
special cases of the present generalized policy iteration algorithm and the conver-
gence properties of the present algorithm can be verified. The stability property of
system (5.2.36) under the iterative control law vi (xk ) is shown in Figs. 5.3 and 5.4,
respectively.
194 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
Fig. 5.2 Iterative value functions Vi (xk ), for i = 0, 1, . . . , 10. a Vi (xk ) for {Ni1 }. b Vi (xk ) for
{Ni2 }. c Vi (xk ) for {Ni3 }. d Vi (xk ) for {Ni4 }
From the above simulation results, we can see that for i = 0, 1, . . ., the itera-
tive control laws vi (xk ) are admissible. For linear system (5.2.36), the optimal cost
function is quadratic and in this case given by J ∗ (xk ) = xkT P ∗ xk . According to the
discrete algebraic Riccati equation (DARE) [7], we obtain
26.61 1.81
P∗ =
1.81 1.90
and the effectiveness of the present algorithm can be verified for linear systems.
On the other hand, the structure of the neural networks is important for its approx-
imation performance. To show the influence of the neural network structure, we
change the structures of the critic and action networks to 2–4–1 and 2–4–1, respec-
tively, and other parameters of the neural networks are kept unchanged. Define the
two neural networks as group “NN2”. Choose the same {Ni2 } for the j-iteration.
Implement the present algorithm for i = 10 iterations. The iterative value functions
by NN1 and NN2 are shown in Fig. 5.5a. We can see that if the number of hidden
layer is reduced, the neural network approximate accuracy for value function may
5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 195
3 3
2 2
Lm Lm
Control
Control
1 1
0 0
In In
−1 −1
0 50 100 0 50 100
Time steps Time steps
(a) (b)
3 3
2 2
Lm Lm
Control
Control
1 1
0 0
In In
−1 −1
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.3 Trajectories of iterative control law vi (xk ), i = 1, 2, . . . , 10. a vi (xk ) for {Ni1 }. b vi (xk )
for {Ni2 }. c vi (xk ) for {Ni3 }. d vi (xk ) for {Ni4 }
decrease. The plot of Vi (xk ) is shown in Fig. 5.5b. The corresponding trajectories of
states and control are shown in Fig. 5.5c, d, respectively. We can see that if the struc-
ture of neural networks is not chosen appropriately, the performance of the control
system will be poor.
Example 5.2.2 We now examine the performance of the present algorithm in a tor-
sional pendulum system [10, 13] with modifications. The dynamics of the pendulum
is given by ⎧
⎪
⎪
dθ
⎨ d t = ω,
⎪
⎪
⎪
⎪
⎩J
dω dθ
= u − Mgl sin θ − f d ,
dt dt
196 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
4
4
Lm x1 In x
Lm x 1 In x
1 2
2 2
In x1 In x2
System states
System states
0 0
−2 −2
Lm x Lm x2
2
−4 −4
0 50 100 0 50 100
Time steps Time steps
(a) (b)
4 4
Lm x Lm x1
1
2 In x1 2 In x1
System states
System states
In x2 In x
2
0 0
−2 −2
Lm x Lm x2
2
−4 −4
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.4 Trajectories of system states. a State trajectories for {Ni1 }. b State trajectories for {Ni2 }. c
State trajectory for {Ni3 }. d State trajectories for {Ni4 }
where M = 1/3 kg and l = 2/3 m are the mass and length of the pendulum bar,
respectively. Let J = 4/3 Ml 2 and f d = 0.2 be the rotary inertia and frictional fac-
tor, respectively. Let x1 = θ and x2 = ω. Let g = 9.8 m/s2 be the gravity and the
sampling time interval Δt = 0.1 s. Then, the discretized system can be expressed by
x1(k+1) 0.1x2k + x1k 0
= + u . (5.2.37)
x2(k+1) −0.49 sin(x1k ) − 0.1 f d x2k + x2k 0.1 k
Let the initial state be x0 = [1, −1]T and let the utility function be the same as the
one in Example 5.2.1.
Neural networks are also used to implement the generalized policy iteration algo-
rithm, where the structures of the critic network and the action network are the
same as the ones in Example 5.2.1. We choose p = 10000 states in Ω to implement
the generalized policy iteration algorithm. For nonlinear system (5.2.37), the initial
5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 197
Fig. 5.5 Simulation results for i = 0, 1, . . . , 10 and {Ni2 }. a Value function at x = x0 for NN1 and
NN2. b Vi (xk ) by NN2. c Iterative control law by NN2. d System states by NN2
ative integer such that 0 < Ni2 ≤ 4. For γ = 3, let Ni3 , i = 1, 2, . . . , 30, be arbitrary
nonnegative integer such that 0 < Ni3 ≤ 10. For γ = 4 and for i = 0, 1, . . . , 30, let
Ni4 = 20. Train the critic and the action networks under the learning rate 0.01 and
set the threhold of neural network training error as 10−6 . Under the iteration indices
i and ji , the trajectories of iterative value functions Vi, ji (xk ) for x = x0 are shown
in Fig. 5.6. The curves of the iterative value functions Vi (xk ) are shown in Fig. 5.7.
198 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
150 150
100 100
50 50
0 1 0 4
10 10 2
0.5 20
20
30 0 ji 30 0 ji
i i
(a) (b)
150 150
100 100
50 50
0 10 0 20
10 10
5 10
20 20
30 0 ji 30 0 ji
i i
(c) (d)
Fig. 5.6 Iterative value functions Vi, ji (xk ) for i = 0, 1, . . . , 30 and xk = x0 . a Vi, ji (xk ) for {Ni1 }.
b Vi, ji (xk ) for {Ni2 } c Vi, ji (xk ) for {Ni3 }. d Vi, ji (xk ) for {Ni4 }
From Figs. 5.6 and 5.7, we can see that given an arbitrary nonnegative integer
sequence {Ni }, i = 0, 1, . . ., the iterative value function Vi, ji (xk ) is monotonically
nonincreasing and converges to the approximate optimum using the present general-
ized policy iteration algorithm. The convergence property of the generalized policy
iteration algorithm for nonlinear systems can be verified. The convergence properties
of value and policy iteration algorithms can also be verified by the present algorithm.
The stability property of system (5.2.37) under the iterative control law vi (xk ) is
shown in Figs. 5.8 and 5.9, respectively.
We can see that for i = 0, 1, . . ., the iterative control law vi (xk ) is admissible,
and hence, the effectiveness of the present algorithm can be verified for nonlinear
systems.
5.3 Discrete-Time GPI with General Initial Value Functions 199
Fig. 5.7 Iterative value functions Vi (xk ), i = 0, 1, . . . , 30. a Vi (xk ) for {Ni1 }. b Vi (xk ) for {Ni2 }.
c Vi (xk ) for {Ni3 }. d Vi (xk ) for {Ni4 }
2 2
In x In x1
1 In x2 Inx
2
1 1
System states
System states
0 0
−1 −1 Lm x1
Lm x1
Lm x2 Lm x2
−2 −2
0 50 100 0 50 100
Time steps Time steps
(a) (b)
2 2
In x In x
1 In x 1 In x
2 2
1 1
System states
System states
0 0
−1 −1 Lm x1
Lm x
1
Lm x Lm x2
2
−2 −2
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.8 Trajectories of system state. a State trajectories for {Ni1 }. b State trajectories for {Ni2 }. c
State trajectories for {Ni3 }. d State trajectories for {Ni4 }
iteration indices increase from 0. The detailed generalized policy iteration algorithm
is described as follows.
Let Ψ (xk ) be a positive semidefinite function. Let
Remark 5.3.1 From the generalized policy iteration algorithm (5.2.6)–(5.2.9) with
initial condition (5.3.1), for i = 1, 2, . . ., if we let Ni ≡ 1, then the generalized policy
iteration algorithm is reduced to a value iteration algorithm [2, 6, 9] (see Chaps. 2
and 3). For i = 1, 2, . . ., if we let Ni → ∞, then the generalized policy iteration
5.3 Discrete-Time GPI with General Initial Value Functions 201
3 3
2 2
Lm Lm
1 1
Control
Control
0 0
In In
−1 −1
−2 −2
0 50 100 0 50 100
Time steps Time steps
(a) (b)
3 3
2 2
Lm Lm
1 1
Control
Control
0 0
In In
−1 −1
−2 −2
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.9 Trajectories of iterative control law vi (xk ). a vi (xk ) for {Ni1 }. b vi (xk ) for {Ni2 }. c vi (xk )
for {Ni3 }. d vi (xk ) for {Ni4 }
algorithm becomes a policy iteration algorithm [10] (see Chap. 4). Hence, the value
and policy iteration algorithms are special cases of the present generalized policy
iteration algorithm. On the other hand, we can see that the generalized policy itera-
tion algorithm (5.2.6)–(5.2.9) is inherently different from policy and value iteration
algorithms. Analysis results for (5.2.5)–(5.2.9) have been established in Sect. 5.2. In
this section, further analysis results will be established for (5.2.6)–(5.2.9) with initial
condition (5.3.1) using the approach due to Rantzer et al. [9, 12].
B. Properties of the GPI Algorithm
In this section, the properties of the GPI algorithm are analyzed. First, for i =
1, 2, . . ., the convergence property of the iterative value function in j-iteration (local
convergence property) is analyzed. The local convergence criterion is obtained. Sec-
ond, the convergence property of the iterative value function for i → ∞ (global
convergence property) is developed and the corresponding convergence criterion is
obtained. The admissibility and optimality analysis are also presented in this section.
202 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
Theorem 5.3.1 For i = 1, 2, . . ., let Vi, ji (xk ) and vi (xk ) be obtained by (5.2.6)–
(5.2.9) with initial condition (5.3.1). Let 0 < γi < ∞ and 1 ≤ σi < ∞ be constants
such that
and
ji
where we define (·) = 0 for ji < 1.
ρ=1
Thus, (5.3.4) holds for ji = 1. Assume that the conclusion holds for ji = l − 1,
l = 1, 2, . . . , Ni . Then, for ji = l, we have
Hence, (5.3.4) holds for j = l. The mathematical induction is complete. This com-
pletes the proof of the theorem.
1 + γi
σi < , (5.3.5)
γi
1
lim Vi, ji (xk ) ≤ Vi,0 (xk ). (5.3.6)
ji →∞ 1 + γi − γi σi
ρ ρ−1
Proof According to (5.3.4), we can see the sequence γi σi (σi − 1)/(1 + γi )ρ
is geometrical for ji = 1, 2, . . .,. Then, (5.3.4) can be written as
⎡ ⎤
γi (σi − 1) γi σ i j
⎢ 1− ⎥
⎢ γi + 1 γi + 1 ⎥
Vi, ji (xk ) ≤ ⎢1 + γi σ i ⎥ Vi,0 (xk ).
⎣ 1− ⎦
γi + 1
For
γi + 1
1 ≤ σi < ,
γi
For optimal control problems, the present control scheme must not only stabilize
the control systems, but also guarantee the cost function to be finite, i.e., the control
law must be admissible [2]. Next, the admissibility property of the iterative control
law vi (xk ) will be analyzed.
Vi, j̄i (xk ) = U (xk , vi−1 (xk )) + Vi, j̄i −1 (F(xk , vi−1 (xk ))), (5.3.7)
204 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
where
From (5.3.7) and (5.3.8), we have Vi, j̄i (xk ) = Vi, j̄i (xk ) for j̄i ≤ Ni . For j̄i = 0, 1, . . .,
according to (5.3.7), we can obtain
j̄i −1
Vi, j̄i (xk ) = U (xk+l , vi−1 (xk+l )) + Vi−1 (xk+ j̄i ).
l=0
Let σi satisfy (5.3.5). According to Theorem 5.3.2, letting j̄i → ∞, we can obtain
j̄i −1
lim Vi, j̄i (xk ) = lim U (xk+l , vi−1 (xk+1 )) + lim Vi−1 (xk+ j̄i )
j̄i →∞ j̄i →∞ j̄i →∞
l=0
σi
≤ Vi−1 (xk ).
1 + γi − γi σi
∞
Since Vi−1 (xk ) is finite for i = 1, 2, . . ., U (xk+l , vi (xk+l )) is also finite. Define
l=0
Vi,∞ (xk ) = lim Vi, j̄i (xk ). If σi satisfies (5.3.5), letting j̄i → ∞, the iterative control
j̄i →∞
law vi (xk ) satisfies the following generalized Bellman equation
Vi,∞ (xk ) = U (xk , vi−1 (xk )) + Vi,∞ (F(xk , vi−1 (xk ))). (5.3.9)
According to Lemma 5.2.1, we can derive that Vi,∞ (xk ) is a positive-definite function.
From (5.3.9), we get
Theorem 5.3.4 For i = 0, 1, . . ., let Vi, ji (xk ) and vi (xk ) be obtained by (5.2.6)–
(5.2.9) with initial condition (5.3.1). Let 0 < γ < ∞ and 1 ≤ σ < ∞ be constants
such that J ∗ (F(xk , u k )) ≤ γ U (xk , u k ) and V0 (xk ) ≤ σ J ∗ (xk ). For i = 0, 1, . . ., if
σi satisfies (5.3.5), then the iterative value function Vi, ji (xk ) satisfies
⎛ ⎞
ji ρ ρ−1
γi σi (σi − 1)
Vi, ji (xk ) ≤ ⎝1 + ρ
⎠
ρ=1
(1 + γi )
5.3 Discrete-Time GPI with General Initial Value Functions 205
⎡
⎢
i−1
γ i−l γl (σl − 1)
×⎢
⎣1 + i−1
l=1 (1 + γ )i−l Π (1 − γη (ση − 1))
η=l
⎤
γ i (σ − 1) ⎥ ∗
+ ⎥ J (xk ), (5.3.10)
i−1 ⎦
(1 + γ )i Π (1 − γη (ση − 1))
η=1
l
where we define k (·) = 0 and Πkl (·) = 1, for k > l.
Proof The statement can be proven by mathematical induction. First, the conclusion
is obviously true for i = 0. Let i = 1. For ji = 1, 2, . . ., we have from (5.3.4),
⎛ ⎞
j1
γi ρ σiρ−1 (σi − 1)
V1, j1 (xk ) ≤ ⎝1+ ρ
⎠ V1,0 (xk )
ρ=1
(1 + γi )
⎛ ⎞
j1 ρ ρ−1
γi σi (σi − 1)
≤ ⎝1 + ρ
⎠ min U (xk , u k ) + σ J ∗ (xk+1 )
ρ=1
(1 + γi ) u k
⎛ ⎞
j1 ρ ρ−1
γi σi (σi − 1) σ −1
⎝
≤ 1+ ⎠ min 1 + γ U (xk , u k )
ρ=1
(1 + γi )ρ uk 1+γ
σ −1
+ σ− J ∗ (xk+1 )
1+γ
⎛ ⎞
γ ρ σ ρ−1 (σi − 1)
j1
γ (σ − 1)
⎝
≤ 1+ i i
ρ
⎠ 1+ J ∗ (xk ),
ρ=1
(1 + γ i ) 1 + γ
which proves inequality (5.3.10) for i = 1. Assume that the conclusion holds for
i = ϑ − 1, ϑ = 1, 2, . . .. As σi satisfies (5.3.5), we can get
1
Vϑ−1, jϑ−1 (xk ) ≤
1 − γϑ−1 (σϑ−1 − 1)
⎡
ϑ−2
⎢ γ i−ϑ−1 γl (σl − 1)
⎢
× ⎣1 + ϑ−2
l=1 (1 + γ )i−ϑ−1 Π (1 − γη (ση − 1))
η=l
⎤
γ ϑ−1 (σ − 1) ⎥ ∗
+ ⎥ J (xk ).
ϑ−2 ⎦
(1 + γ )ϑ−1 Π (1 − γη (ση − 1))
η=l
206 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
γ ϑ (σ − 1) ⎥
⎥
+ + 1⎥ U (xk , u k )
ϑ−1 ⎦
(1 + γ )ϑ Π (1 − γη (ση − 1))
η=l
⎡ ⎛
⎢ ⎜ ϑ−2
γ ϑ−l−1 γl (σl − 1)
⎢ 1 ⎜
+⎢ ⎜1 +
⎣ 1 − γϑ−1 (σϑ−1 − 1) ⎝ ϑ−2
l=1 (1 + γ )ϑ−l−1 Π (1 − γη (ση − 1))
η=l
⎞
γ ϑ−1 (σ − 1) ⎟
⎟
+ ⎟
ϑ−2 ⎠
(1 + γ )ϑ−1 Π (1 − γη (ση − 1))
η=l
⎛
⎜ϑ−1
⎜ γ ϑ−l−1 γl (σl − 1)
−⎜
⎝ ϑ−1
l=1 (1 + γ )ϑ−l Π (1 − γη (ση − 1))
η=l
5.3 Discrete-Time GPI with General Initial Value Functions 207
Fig. 5.10 Iterative value function Vi (xk ) for different Ψ (xk )’s. a Vi (xk ) for Ψ 1 (xk ). b Vi (xk ) for
Ψ 2 (xk ). c Vi (xk ) for Ψ 3 (xk ). d Vi (xk ) for Ψ 4 (xk )
⎞⎤ ⎫
⎪
⎪
⎟⎥ ⎪
⎬
γ ϑ−1 (σ − 1) ⎟⎥ ∗
+ ⎟⎥ J (xk+1 )
ϑ−1 ⎠⎦ ⎪
⎪
(1 + γ )ϑ Π (1 − γη (ση − 1)) ⎪
⎭
η=l
⎡
⎛ ⎞
ρ ρ−1
γϑ σϑ (σϑ − 1) ⎢ ϑ−1
jϑ
⎠⎢ γ i−ϑ γl (σl − 1)
= ⎝1 + ρ ⎢1 +
(1 + γϑ ) ⎣ ϑ−1
ρ=1 l=1 (1 + γ )i−ϑ Π (1 − γη (ση − 1))
η=l
⎤
γ ϑ (σ − 1) ⎥
⎥ ∗
+ ⎥ J (xk ).
ϑ−1 ⎦
(1 + γ )ϑ Π (1 − γη (ση − 1))
η=l
The mathematical induction is complete for inequality (5.3.10). This completes the
proof of the theorem.
208 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
6 1.6
5
1.4
4
3 1.2
Θi
Θi
2
1
1
0 0.8
5 10 15 5 10 15
Iteration steps Iteration steps
(a) (b)
1.4 1
1.3
0.995
1.2
Θi
0.99
Θi
1.1
1 0.985
0.9
5 10 15 0.98
5 10 15
Iteration steps
Iteration steps
(c)
(d)
Fig. 5.11 The trajectories of Θi for different Ψ (xk )’s. a Θi for Ψ 1 (xk ). b Θi for Ψ 2 (xk ). c Θi for
Ψ 3 (xk ). d Θi for Ψ 4 (xk )
Theorem 5.3.5 (Global convergence criterion) For i = 0, 1, . . ., let Vi, ji (xk ) and
vi (xk ) be obtained by (5.2.6)–(5.2.9) with initial condition (5.3.1). If for i = 0, 1, . . .,
σi satisfies
1
σi < qi + 1, (5.3.11)
γi (1 + γ )
where 0 < qi < 1 is a constant, then for ji = 0, 1, . . . , Ni , the iterative value func-
tion Vi, ji (xk ) is convergent as i → ∞.
15 15
10
Vi, j (x0 )
(x0 )
10
i
i
i, j
5
V
0 5
0 10 0 10
5 5
5 5
10 10
15 0 ji 15 0 ji
i i
(a) (b)
20 25
15 20
(x0 )
Vi, j (x0)
i
i
i, j
10 15
V
5 10
0 10 0 10
5 5
5 5
10 10
15 0 ji 15 0 ji
i i
(c) (d)
Fig. 5.12 Iterative value function Vi, ji (xk ) for xk = x0 and different Ψ (xk )’s. a Vi, ji (xk ) for
Ψ 1 (xk ). b Vi, ji (xk ) for Ψ 2 (xk ). c Vi, ji (xk ) for Ψ 3 (xk ). d Vi, ji (xk ) for Ψ 4 (xk )
q
i−1 γ i−l
i−1
γ γl (σl − 1)
i−l
1+γ
lim i−1
< lim i−l
i→∞ i→∞ q
l=1 (1 + γ )i−l Π (1 − γη (ση − 1)) l=1
(1 + γ ) i−l
1 −
η=l 1+γ
qγ
= . (5.3.12)
(1 − q)(1 + γ )
γ i (σ − 1)
lim i−1
= 0. (5.3.13)
i→∞
(1 + γ ) Π (1 − γη (ση − 1))
i
η=l
0.5
In 0 Lm
0
−0.2
−0.5
Control
Control
−0.4
−1 −0.6
Lm
−1.5 −0.8
In
−2 −1
0 50 100 0 50 100
Time steps Time steps
(a) (b)
0 0
−0.1
−0.2
Lm −0.2
Control
Control
−0.4 In
−0.3
−0.6 Lm
In −0.4
−0.8 −0.5
−1
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.13 Trajectories of iterative control law vi (xk ) for different Ψ (xk )’s. a vi (xk ) for Ψ 1 (xk ). b
vi (xk ) for Ψ 2 (xk ). c vi (xk ) for Ψ 3 (xk ). d vi (xk ) for Ψ 4 (xk )
1 1 + γi
qi +1< , (5.3.14)
γi (1 + γ ) γi
we can obtain
ji ρ ρ−1
γi σi (σi − 1) ji ρ ρ−1
γi σi (σi − 1) q
ρ < lim ρ < . (5.3.15)
ρ=1
(1 + γi ) ji →∞
ρ=1
(1 + γi ) γ +1−q
4 4
In x2
In x
3 In x 3 1
2 Lm x
In x 1
System states
System states
2 1 2
1 1
0 0
−1 −1
Lm x2 Lm x Lm x1
2
−2 −2
0 50 100 0 50 100
Time steps Time steps
(a) (b)
4 1
Lm x In x1
1
3
0.5
2 In x2
System states
System states
In x1 0
1
0 −0.5
−1 Lm x2 In x2
Lm x1 −1
−2
Lm x2
−3 −1.5
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.14 Trajectories of system states for different Ψ (xk )’s. a State trajectories for Ψ 1 (xk ). b State
trajectories for Ψ 2 (xk ). c State trajectories for Ψ 3 (xk ). d State trajectories for Ψ 4 (xk )
From Theorem 5.3.5, we can see that the convergence criterion (5.3.11) is indepen-
dent of the parameter σ . For i = 1, 2, . . ., σi and γi can be obtained by (5.3.2) and
(5.3.3), respectively. If the parameter γ can be estimated, then the convergence crite-
rion (5.3.11) can be implemented. As the optimal cost function J ∗ (xk ) is unknown,
the parameter γ cannot be obtained directly. In this section, relaxations of the con-
vergence criterion are discussed. First, we define a new set Ωγ as
Ωγ = γ : γ U (xk , u k ) ≥ J ∗ (F(xk , u k )) .
Lemma 5.3.1 Let Pi (xk ) be the iterative value function that satisfies
then γ̃ ∈ Ωγ .
Proof As μ(xk ) is an admissible control law, according to Theorem 5.2.3, we have
P∞ (xk ) ≥ V∞ (xk ). If γ̃ satisfies (5.3.16), then we can get
= Φ1 (xk ).
Hence, we have
Φ0 (xk ) ≥ J ∗ (xk ).
Corollary 5.3.1 Let Φ0 (xk ) be a positive-definite function and let γ̄ and σ̄ be con-
stants that satisfy (5.3.17) and
If γ̄ and σ̄ satisfy
Step 3. If Φ1 (xk ) ≤ Φ0 (xk ), then goto Step 7; else, goto next step.
Step 4. Choose two parameters 0 < γ̄ < ∞ and 1 ≤ σ̄ < ∞ that satisfy (5.3.17) and (5.3.19).
Step 5. If γ̄ and σ̄ satisfy (5.3.22), then for l = 1, 2, . . ., implement (5.3.20) until the computation
precision is achieved, i.e.,
where P0 (xk ) = Φ0 (xk ), and goto next step; else goto Step 1.
Step 6. Let Φ0 (xk ) = Pl (xk ).
Step 7. Choose γ such that Φ0 (F(xk , u k )) ≤ γ U (xk , u k ).
Step 8. Return γ .
214 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
1 + γ̄
σ̄ < , (5.3.22)
γ̄
and there exists a constant γ̂ such that γ̂ U (xk , u k ) ≥ lim Pl (xk ), then γ̂ ∈ Ωγ .
l→∞
Proof If γ̄ and σ̄ satisfy (5.3.22), according to Theorem 5.3.3, the iterative control
law ν(xk ) in (5.3.21) is admissible. Then, the limit of iterative value function Pl (xk )
exists, as l → ∞. According to Lemma 5.3.1, we can obtain γ̂ ∈ Ωγ . This completes
the proof of the corollary.
Vi, ji (xk ) = U (xk , vi−1 (xk )) + Vi, ji −1 (F(xk , vi−1 (xk )));
else, let
% & % &
ẋ1 x2 0
= g + 1 u,
ẋ2 sin(x1 ) − κx2
m2
where m = 1/2 kg and = 1/3 m are the mass and length of the pendulum bar,
respectively. Let κ = 0.2 and g = 9.8 m/s2 be the frictional factor and the grav-
itational acceleration, respectively. Discretization of the system function with the
sampling interval Δt = 0.1 s leads to
x1(k+1) x1k + Δt x2k 0
= g + Δt uk .
x2(k+1) Δt sin(x1k ) + (1 − κΔt)x2k
m2
Let the initial state be x0 = [1, −1]T . Let the state space be Ξ = {x : − 1 ≤ x1 ≤
1, −1 ≤ x2 ≤ 1}. NNs are used to implement the present generalized policy iter-
ation algorithm. The critic network and the action network are chosen as three-
layer BP NNs with the structures of 2–8–1 and 2–8–1, respectively. We choose
p = 5000 states in Ξ to train the action and critic networks. To illustrate the
effectiveness of the algorithm, four different initial value functions are chosen
which are expressed by Ψ ς (xk ) = xkT Pς xk , ς = 1, 2, 3, 4. Let P1 = 0. Let P2 –P4
be initialized by positive-definite matrices given by P2 = [ 2.98, 1.05; 1.05, 5.78],
P3 = [ 6.47, −0.33; −0.33, 6.55], and P4 = [ 22.33, 4.26; 4.26, 7.18], respectively.
For i = 0, 1, . . ., let qi = 0.9999. First, implement Algorithm 5.3.1 and it returns
ς ς
γ = 5.40. Let the iteration sequence be {Ni }, where Ni ∈ [1, 10] be a random non-
negative integer. Then, initialized by Ψ ς (xk ), ς = 1, 2, 3, 4, the generalized policy
iteration algorithm in Algorithm 5.3.2 is implemented for i = 15 iterations. Train
the critic and action networks under the learning rate of 0.01 and set the NN train-
ing errors as 10−6 . The curves of the iterative value functions Vi (xk ) are shown in
Fig. 5.10, where we let “In” denote “initial iteration” and let “Lm” denote “limiting
iteration”.
For i = 1, 2, . . . , 15, define the function Θi as
σi γi (1 + γ )
Θi = . (5.3.23)
qi + γi (1 + γ )
It can easily be shown that Θi < 1 implies σi satisfies (5.3.11). The trajectories of
the function Θi are shown in Fig. 5.11. Under the iteration indices i and ji , the trajec-
tories of iterative value functions Vi, ji (xk ) for xk = x0 are shown in Fig. 5.12. From
216 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
40 40
30
Vi, j (x0)
Vi, j (x0)
20
i
i
20
0 10
0 10 0 10
10 5 10 5
20 0 ji 20 0 ji
i i
(a) (b)
60 150
40 100
V (x )
Vi, j (x0)
0
i
i
i, j
20 50
0 0
0 10 0 10
10 5 10 5
20 0 ji 20 0 ji
i i
(c) (d)
Fig. 5.15 Iterative value function Vi, ji (xk ) for xk = x0 and different Ψ̄ (xk )’s. a Vi, ji (xk ) for
Ψ̄ 1 (xk ). b Vi, ji (xk ) for Ψ̄ 2 (xk ). c Vi, ji (xk ) for Ψ̄ 3 (xk ). d Vi, ji (xk ) for Ψ̄ 4 (xk )
Figs. 5.10 and 5.12, we can see that given an arbitrary positive semidefinite func-
tion, the iterative value function Vi, ji (xk ) converges to the optimum using the present
generalized policy iteration algorithm. The convergence property of the present gen-
eralized policy iteration algorithm for nonlinear systems can be verified. From Figs.
5.11 and 5.12, we can see that when Θi < 1, the GPI algorithm is convergent and both
the policy evaluation and improvement procedures can be implemented. If Θi < 1,
the iterative control law vi (xk ) is admissible. Let the execution time T f be 100 time
steps. The trajectories of the iterative control laws and system states are shown in
Figs. 5.13 and 5.14, respectively, where the effectiveness of the present generalized
policy iteration algorithm for nonlinear systems can be verified.
Example 5.3.2 Our second example is chosen as a nonaffine nonlinear system in [1]
with modifications. The system is expressed by
5.3 Discrete-Time GPI with General Initial Value Functions 217
6 1.4
1.3
4
1.2
Θi
Θi
1.1
2
1
0 0.9
0 5 10 15 0 5 10 15
Iteration steps Iteration steps
(a) (b)
2.5 1.4
2 1.2
1
1.5
Θi
Θi
0.8
1
0.6
0.5
0 5 10 15 0 5 10 15
Iteration steps Iteration steps
(c) (d)
Fig. 5.16 The trajectories of Θi for different Ψ̄ (xk )’s. a Θi for Ψ̄ 1 (xk ). b Θi for Ψ̄ 2 (xk ). c Θi for
Ψ̄ 3 (xk ). d Θi for Ψ̄ 4 (xk )
ẋ1 = 2 sin x1 + x2 + x3 ,
ẋ2 = x1 + sin(−x2 + u 2 ),
ẋ3 = x3 + sin u 1 .
Let x = [x1 , x2 , x3 ]T be the system state and let u = [u 1 , u 2 ]T be the system control
input. Let the initial state be x0 = [1, −1, 1]T . Let the sampling interval be Δt =
∞
0.1 s. Let the cost function be expressed by J (x0 ) = (xkT Qxk + R(u k )), where
k=0
' uk
T
R(u k ) = (Φ −1 (v)) Rdv.
0
Let Q be an identity matrix with suitable dimension. Let Φ(·) be a sigmoid func-
tion, i.e., Φ(·) = tanh(·). Let the state space be Ξ̄ = {x : − 1 ≤ x1 ≤ 1, −1 ≤
x2 ≤ 1, −1 ≤ x3 ≤ 1}. NNs are used to implement the present generalized policy
218 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
Wc1
Wc6
6
Weight trajectories
Weight trajectories
4 2 Wc9
Wc1
2 Wc9 Wc3
0 0
Wc6
−2
−2
0 10 0 10
5 Wc3 5
5 5
10 10
15 0 ji 15 0 ji
i i
(a) (b)
Wc3
Weight trajectories 4 Wc9 Wc3
Weight trajectories
5
Wc9
2
0
Wc1 0
−5 Wc6 −2 Wc1
0 10 0 10
Wc6
5 5
5 5
10 10
15 0 ji 15 0 ji
i i
(c) (d)
Fig. 5.17 Trajectories of weights for the critic network for different Ψ̄ (xk )’s. a Wcl , l = 1, 3, 6, 9,
for Ψ̄ 1 (xk ). b Wcl , l = 1, 3, 6, 9, for Ψ̄ 2 (xk ). c Wcl , l = 1, 3, 6, 9, for Ψ̄ 3 (xk ). d Wcl , l = 1, 3, 6, 9,
for Ψ̄ 4 (xk )
iteration algorithm. The critic and action networks are chosen as three-layer BP
NNs with the structures of 3–10–1 and 3–10–2, respectively. We choose randomly
p = 10000 states in Ξ̄ to train the action and critic networks. To illustrate the effec-
tiveness of the algorithm, we also choose four different initial value functions which
are expressed by Ψ̄ ς (xk ) = xkT P̄ς xk , ς = 1, . . . , 4. Let P̄1 – P̄4 be positive-definite
matrices given by P̄1 = 0.01 × [0.11, 0, −0.18; 0, 0.12, 0.04; −0.18, 0.04, 0.19],
P̄2 = [ 9.09, −2.06, 1.77; −2.06, 2.93, −1.89; 1.77, −1.89, 7.69], P̄3 = [ 6.89,
−8.49, −0.46; −8.49, 15.60, 0.44; −0.46, 0.45, 6.81], and P̄4 = [ 51.39, −18.53,
−4.64; −18.53, 36.72, 6.22; −4.64, 6.22, 25.34], respectively. For i = 0, 1, . . ., let
qi = 0.9999. First, implement Algorithm 5.3.1 and it returns γ = 6.3941. Let the
ς ς
iteration sequence be {Ni }, where Ni ∈ [1, 10] be a random nonnegative integer.
Then, initialized by Ψ̄ ς (xk ), ς = 1, . . . , 4, the generalized policy iteration algorithm
(Algorithm 5.3.2) is implemented for a total of i = 20 iterations. Train the critic and
action networks under the learning rate of 0.01 and set the NN training error threshold
5.3 Discrete-Time GPI with General Initial Value Functions 219
1 1
In u Lm u Lm u
1 2
2
0.5 0.5
In u
2
0 0
−0.5 −0.5
Lm u2 Lm u
1
−1 −1
In u In u1
2
−1.5 −1.5
−2 −2
0 50 100 0 50 100
Time steps Time steps
(a) (b)
1.5 4
1 3 In u
2
Lm u2
0.5 2 Lm u
2
0 1
−0.5 0
−1 Lm u −1
1
In u
2 Lm u
−1.5 −2 1
In u1 In u
1
−2 −3
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.18 Trajectories of iterative control law vi (xk ), for different Ψ̄ (xk )’s. a vi (xk ) for Ψ̄ 1 (xk ). b
vi (xk ) for Ψ̄ 2 (xk ). c vi (xk ) for Ψ̄ 3 (xk ). d vi (xk ) for Ψ̄ 4 (xk )
as 10−6 . Under the iteration indices i and ji , the trajectories of iterative value func-
tions Vi, ji (xk ) for xk = x0 are shown in Fig. 5.15, where we can see that initialized
by an arbitrary positive semidefinite function, the iterative value function Vi, ji (xk )
converges to the optimum using the present generalized policy iteration algorithm.
For i = 1, 2, . . . , 15, define the function Θi as (5.3.23) and the trajectories of the
function Θi are shown in Fig. 5.16.
Let Wcl , l = 1, 2, . . . , 10, be the first column of the hidden-output weight matrix
for the critic network. The convergence trajectories of weights Wc1 , Wc3 , Wc6 , Wc9
for the critic network are shown in Fig. 5.17, where we can see that the weights for
critic network are convergent to the optimum. Other weights of the NNs are omitted
here.
220 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
3 2
In x3
1.5
2 In x
2
In x 1 Lm x3
In x1 3
1 0.5
In x In x
2 1
0
0 Lm x1
Lm x3 −0.5
Lm x2
Lm x1Lm x2
−1 −1
0 50 100 0 50 100
Time steps Time steps
(a) (b)
2 1 Lm x
In x3 In x1 1 In x
3
1.5 In x2 Lm x3
0.5
Lm x3
1 In x
1
0.5 0
0 Lm x
Lm x −0.5 2
1
−0.5
Lm x2 In x
2
−1 −1
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.19 Trajectories of system states for different Ψ̄ (xk )’s. a State trajectories for Ψ̄ 1 (xk ). b State
trajectories for Ψ̄ 2 (xk ). c State trajectories for Ψ̄ 3 (xk ). d State trajectories for Ψ̄ 4 (xk )
From Figs. 5.15 and 5.16, the convergence property of the present generalized
policy iteration algorithm for nonlinear systems can be verified. We can see that
for Θi < 1, the GPI algorithm is convergent and both the policy evaluation and
improvement procedures can be implemented. Let the execution time T f be 100 time
steps. The trajectories of the iterative control laws and system states are shown in
Figs. 5.18 and 5.19, respectively. From Fig. 5.16d, we can see that for Ψ 4 (xk ), Θi < 1
for i = 1, 2, . . . , 15. In this case, the iterative control law vi (xk ) is admissible. This
property can be verified from results shown in Figs. 5.18d and 5.19d, respectively.
From these simulation results, the effectiveness of the present generalized policy
iteration algorithm for nonlinear systems can be verified.
5.4 Conclusions 221
5.4 Conclusions
References
1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part B
Cybern 38(4):943–949
3. Apostol TM (1974) Mathematical analysis, 2nd edn. Addison-Wesley, Boston
4. Beard RW (1995) Improving the closed-loop performance of nonlinear systems. Ph.D. thesis,
Rensselaer Polytechnic Institute, Troy, NY
5. Bertsekas DP (2007) Dynamic programming and optimal control, 3rd edn. Athena Scientific,
Belmont
6. Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmont
7. Dorf RC, Bishop RH (2011) Modern control systems, 12th edn. Prentice-Hall, Upper Saddle
River
8. Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control:
using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag
32(6):76–105
9. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
10. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Network Learn Syst 25(3):621–634
11. Liu D, Wei Q, Yan P (2015) Generalized policy iteration adaptive dynamic programming for
discrete-time nonlinear systems. IEEE Trans Syst Man Cybern Syst 45(12):1577–1591
12. Rantzer A (2006) Relaxed dynamic programming in switching systems. IEE Proc Control
Theory Appl 153(5):567–574
13. Si J, Wang YT (2001) Online learning control by association and reinforcement. IEEE Trans
Neural Network 12(2):264–276
14. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
15. Vrabie D, Vamvoudakis KG, Lewis FL (2013) Optimal adaptive control and differential games
by reinforcement learning principles. IET, London
16. Wei Q, Liu D (2014) A novel iterative θ-adaptive dynamic programming for discrete-time
nonlinear systems. IEEE Trans Autom Sci Eng 11(4):1176–1190
17. Wei Q, Liu D, Lin H (2016) Value iteration adaptive dynamic programming for optimal control
of discrete-time nonlinear systems. IEEE Trans Cybern 46(3):840–853
18. Wei Q, Liu D, Yang X (2015) Infinite horizon self-learning optimal control of nonaffine discrete-
time nonlinear systems. IEEE Trans Neural Network Learn Syst 26(4):866–879
Chapter 6
Error Bounds of Adaptive Dynamic
Programming Algorithms
6.1 Introduction
Value iteration and policy iteration are two basic classes of ADP algorithms which
can solve optimal control problems for nonlinear dynamical systems with continu-
ous state and action spaces. Value iteration algorithms can solve the optimal con-
trol problems for nonlinear systems without requiring an initial stabilizing control
policy. Value iteration algorithms iterate between value function update and policy
improvement until the iterative value functions converge to the optimal one. Value
iteration-based ADP algorithms have been used for solving the Bellman equation
[3, 8, 17, 19, 30, 31], the near-optimal control of discrete-time affine nonlinear sys-
tems with control constraints [20, 34], the finite-horizon optimal control problem
[10, 28], the optimal tracking control problem [33, 35], and the optimal control of
unknown nonaffine nonlinear discrete-time systems with discount factor in the cost
function [19, 29]. In contrast to value iteration algorithms, policy iteration algorithms
[1, 11, 14, 18, 27] always require an initial stabilizing control policy. The policy
iteration is built to iterate between policy evaluation and policy improvement until
the policy converges to the optimal one. The policy iteration-based ADP approaches
were also developed for optimal control of discrete-time nonlinear dynamical sys-
tems [7, 18]. For all the iterative ADP algorithms mentioned above, it is assumed
that the value function and control policy update equations can be exactly solved at
each iteration. Furthermore, optimistic policy iteration [25] represents a spectrum of
iterative algorithms which includes value iteration and policy iteration algorithms,
and it is also known as generalized policy iteration [24, 26] or modified policy iter-
ation [22]. The optimistic policy iteration algorithm has not been widely applied to
ADP for solving optimal control problems of nonlinear dynamical systems.
On the other hand, an inequality version of the Bellman equation was used to derive
bounds on the optimal cost function [12]. For undiscounted optimal control problems
of discrete-time nonlinear systems, a relaxed value iteration scheme [23] based on
the inequality version of Bellman equation [12] was introduced to derive the upper
and lower bounds of the optimal cost function, where the distance from optimal
values can be kept within prescribed bounds. In [16], the relaxed value iteration
scheme was used to solve the optimal switching between linear systems, the optimal
control of a linear system with piecewise linear cost, and a partially observable
Markov decision problem. In [9], the relaxed value iteration scheme was applied
to receding horizon control schemes for discrete-time nonlinear systems. Based on
[23], a new expression of approximation errors at each iteration was introduced to
establish convergence analysis results for the approximate value iteration algorithm
[17].
For optimal control problems with continuous state and action spaces, ADP meth-
ods use a critic neural network to approximate the value function, and use an action
neural network to approximate the control policy. Iterating on these approximate
schemes will inevitably give rise to approximation errors. Therefore, we establish
error bounds for the approximate value iteration, approximate policy iteration, and
approximate optimistic policy iteration in this chapter [21]. A new assumption is uti-
lized instead of the contraction assumption in discounted optimal control problems
[16]. It is shown that the iterative approximate value function can converge to a finite
neighborhood of the optimal value function under some mild conditions. To imple-
ment the present algorithms, two multilayer feedforward neural networks are used
to approximate the value function and control policy. A simulation example is given
to demonstrate the present results. Furthermore, we also present error bound analy-
sis results of Q-function for the action-dependent ADP to solve discounted optimal
control problems of unknown discrete-time nonlinear systems [32]. It is shown that
the approximate Q-function will converge to a finite neighborhood of the optimal
Q-function. Results in this chapter will complement the analysis results developed in
Chaps. 2–5 by introducing various error conditions for different VI and PI algorithms
[17–19, 28–30].
where xk ∈ Rn is the system state at time k and u k ∈ Rm is the control input. Let x0 be
the initial state. Assume that the system function F(xk , u k ) is Lipschitz continuous
on a compact set Ω ⊂ Rn containing the origin, and F(0, 0) = 0. Hence, x = 0
is an equilibrium state of the system (6.2.1) under the control u = 0. Assume that
system (6.2.1) is stabilizable on the compact set Ω [3].
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 225
Note that the optimal value function defined here is the same as the optimal cost
function, i.e.,
V ∗ (x) = J ∗ (x) inf {J (x, u) : u ∈ A (Ω)}.
u
According to Bellman’s principle of optimality [4, 13], the optimal value function
V ∗ (x) satisfies the Bellman equation
V ∗ (x) = inf U (x, μ) + V ∗ F(x, μ) . (6.2.4)
μ
If it can be solved for V ∗ , the optimal control policy μ∗ (x) can be obtained by
μ∗ (x) = arg inf U (x, μ) + V ∗ F(x, μ) .
μ
For convenience, Tμk denotes the k time composition of the operator Tμ , i.e.,
(Tμk V )(x) = Tμ (Tμk−1 V ) (x). (6.2.5)
V ∗ = T V ∗,
We first present the algorithm for exact value iteration and then the algorithm for
approximate value iteration with analysis results for error bounds.
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 227
A. Value Iteration
For the value iteration algorithm, it starts with any initial positive-definite value
function V0 (x) or V0 (·) = 0. Then, the control policy v0 (x) can be obtained by
v0 (x) = arg min U (x, u) + V0 F(x, u) . (6.2.6)
u
For i = 1, 2, . . . , the value iteration algorithm iterates between the value function
update
The lemma can be proved by considering the key expressions in Table 6.1.
For undiscounted optimal control problems, the convergence of value iteration
algorithm has been given in the following theorem [23] (cf. Theorem 2.2.2).
Theorem 6.2.1 Let Assumptions 6.2.1 and 6.2.2 hold. Suppose that 0 ≤ αV ∗ ≤
V0 ≤ βV ∗ , 0 ≤ α ≤ 1, and 1 ≤ β < ∞. The value function Vi and the control
policy vi+1 are iteratively updated by (6.2.7) and (6.2.8). Then, the value function
sequence {Vi } approaches V ∗ according to the inequalities
hold uniformly. Similarly, we also assume that there exist finite positive constants
ζ ≤ 1 and ζ ≥ 1 such that
by denoting
η ζ δ ≤ 1, η ζ δ ≥ 1. (6.2.13)
Based on Assumptions 6.2.1 and 6.2.2, we can establish the error bounds for the
approximate value iteration by the following theorem.
Theorem 6.2.2 Let Assumptions 6.2.1 and 6.2.2 hold. Suppose that 0 ≤ αV ∗ ≤
V0 ≤ βV ∗ , 0 ≤ α ≤ 1 and 1 ≤ β < ∞. The approximate value function V̂i
satisfies
the iterative error condition (6.2.12). Then, the value function sequence
V̂i approaches V ∗ according to the following inequalities
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 229
i
λ j η j−1 (1 − η) λi ηi (1 − α) ∗
η 1− − V
j=1
(λ + 1) j (λ + 1)i
i
λ j η j−1 (η − 1) λi η i (β − 1) ∗
≤ V̂i+1 ≤ η 1 + + V , i ≥ 0, (6.2.14)
j=1
(λ + 1) j (λ + 1)i
i
where we define (·) = 0 for i < 1. Moreover, the value function sequence V̂i
j=1
converges to a finite neighborhood of V ∗ uniformly on Ω as i → ∞, i.e.,
η η
V ∗ ≤ lim V̂i ≤ V ∗, (6.2.15)
1 + λ − λη i→∞ 1 + λ − λη
1
under the condition η < + 1.
λ
Proof First, we prove the lower bound of the approximate value function V̂i+1 by
mathematical induction. Letting i = 0 in (6.2.12), we can obtain V̂1 ≥ ηT V̂0 =
ηT V0 . Considering αV ∗ ≤ V0 , we can get
Thus, the lower bound of V̂i+1 holds for i = 0. According to (6.2.12), Assumptions
6.2.1 and 6.2.2, we can get
V̂2 ≥ ηT V̂1
= η min U (x, u) + V̂1 F(x, u)
u
≥ η min U (x, u) + αηV ∗ F(x, u)
u
αη − 1 αη − 1
≥ η min 1+λ U (x, u) + αη − V ∗ F(x, u)
u λ+1 λ+1
αη − 1
=η 1+λ min U (x, u) + V ∗ F(x, u)
λ+1 u
λ(1 − η) λη(1 − α)
=η 1− − V ∗.
λ+1 λ+1
Hence, the lower bound of V̂i+1 holds for i = 1. Assume the lower bound in (6.2.14)
holds for i = l, i.e.,
l
λ j η j−1 (1 − η) λl ηl (1 − α) ∗
V̂l+1 ≥η 1− − V Θ V ∗.
j=1
(λ + 1) j (λ + 1)l
230 6 Error Bounds of Adaptive Dynamic Programming Algorithms
V̂l+2 ≥ ηT V̂l+1
= η min U (x, u) + V̂l+1 F(x, u)
u
≥ η min U (x, u) + Θ V ∗ F(x, u) .
u
Let
l+1
λ j−1 η j−1 (1 − η) λl ηl+1 (1 − α)
Ξ − − . (6.2.16)
j=1
(λ + 1) j (λ + 1) l+1
l
λ j η j (1 − η) λl ηl+1 (1 − α)
=η − −
j=1
(λ + 1) j (λ + 1)l
= Θ.
Also, the upper bound can be proved similarly. Therefore, the lower and upper
bounds of V̂i+1 in (6.2.14) have been proved.
Finally, we prove that the value function sequence V̂i converges to a finite
neighborhood of V ∗ uniformly on Ω as i → ∞ if η < 1/λ + 1. Since the sequence
λ j η j−1 (1 − η)
is geometric, we have
(λ + 1) j
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 231
λ(1 − η) λη i
i
λ j η j−1 (1 − η) 1−
= λ+1 λ+1 .
(λ + 1) j λη
j=1 1−
λ+1
λη λη
Considering ≤ < 1, we have
λ+1 λ+1
η
lim V̂i ≥ V ∗.
i→∞ 1 + λ − λη
For the upper bound, if λη/(λ + 1) < 1, i.e., η < 1/λ + 1, we can show
η
lim V̂i ≤ V ∗.
i→∞ 1 + λ − λη
Remark 6.2.1 We can find that the lower and upper bounds in (6.2.15) are both
monotonically increasing functions of η and η, respectively. The condition η <
1/λ+1 should be satisfied such that the upper bound in (6.2.15) is finite and positive.
Since η ≤ 1, the lower bound in (6.2.15) is always positive. The values of η and η
may gradually be refined during the iterative process similar to [2], in which a crude
initial operator approximation is gradually refined with new iterations. We can also
derive that a larger λ will lead to a slower convergence rate and a larger error bound. A
larger λ also requires more accurate iteration to converge. When η = η = 1, Theorem
6.2.2 reduces to Theorem 6.2.1 and the value function sequence V̂i converges to
V ∗ uniformly on Ω as i → ∞.
We first present and analyze the exact policy iteration and then establish the error
bounds for approximate policy iteration.
A. Policy Iteration
For the policy iteration algorithm, an initial stabilizing control policy is usually
required. In this section, we start the policy iteration from an initial value function
V0 such that V0 ≥ T V0 . We can see that the control policy π 0 (x) obtained by
π 0 (x) = arg min U (x, u) + V0 F(x, u) (6.2.18)
u
V0 F x, π 0 (x) − V0 (x) ≤ V0 (F x, π 0 (x) − (T V0 )(x)
= −U x, π 0 (x) ≤ 0,
Note that a short expression for (6.2.19) is given by Viπ = Tπi−1 Viπ .
The flow of iterations of the PI algorithm in (6.2.18)–(6.2.20) is illustrated in
Table 6.2, where the iteration flows as in Fig. 2.1. The algorithm starts with a special
initial value function so that an admissible control law π 0 (xk ) is obtained from
(6.2.18). Comparing between Tables 6.2 and 4.1, we note that the two descriptions
are the same except the initialization step. The algorithm in (6.2.18)–(6.2.20) starts
at i = 0 using an initial value function V0 satisfying V0 ≥ T V0 , whereas the
PI algorithm in (4.2.5)–(4.2.6) starts with an admissible control law v0 (xk ). For a
particular problem, the one with initial condition easier to obtain will usually be
chosen.
When the policy evaluation equation (6.2.19) cannot be solved directly, the follow-
ing iterative process can be used to solve the value function at the policy evaluation
step
π( j) π( j−1)
Vi (x) = U x, πi−1 (x) + Vi F x, πi−1 (x) , j > 0, (6.2.21)
Lemma 6.2.2 Let Assumption 6.2.1 hold. Suppose that V0 ≥ T V0 . Let πi+1 and
π( j) π( j)
Vi be updated by (6.2.20) and (6.2.21). Then, the sequence {Vi } is a monoton-
π( j) π( j−1)
ically nonincreasing sequence, i.e., Vi ≤ Vi , ∀i ≥ 1. Moreover, as j → ∞,
π( j) π(∞)
the limit of Vi denoted by Vi exists, and it is equal to Viπ , ∀i ≥ 1.
π( j) π( j−1)
Assume that V1 ≤ V1 is true for j = l, l = 1, 2, . . . , i.e., V1π(l) ≤ V1π(l−1) .
For j = l + 1, we have
π( j) π( j−1)
Therefore, we obtain Vi ≤ Vi for i = 1 by induction. Since the sequence
π( j) π( j) π( j)
{V1 } is a monotonically nonincreasing sequence and V1 ≥ 0, the limit of V1
π( j)
exists, which is denoted by V1π(∞) and satisfies V1π(∞) ≤ V1 . Considering
234 6 Error Bounds of Adaptive Dynamic Programming Algorithms
π( j+1) π( j)
V1 (x) = U x, π 0 (x) + V1 F x, π 0 (x) , j ≥ 0,
we have
π( j+1)
V1 (x) ≥ U x, π0 (x) + V1π(∞) F x, π 0 (x) , j ≥ 0. (6.2.23)
Similarly, we obtain
π( j+1) π( j)
V1π(∞) (x) ≤ V1 (x) = U x, π 0 (x) + V1 F x, π 0 (x) , j ≥ 0. (6.2.25)
Considering (6.2.19), we can obtain that Viπ(∞) (x) = Viπ (x) holds for i = 1.
π( j) π( j+1)
We assume that Vi ≥ Vi holds and Viπ(∞) (x) = Viπ (x), ∀i ≥ 1. Then,
considering (6.2.19)–(6.2.21), for j = 0, we can get
π(1) π(0) π(0)
Vi+1 = Tπi Vi+1 = Tπi Viπ = T Viπ ≤ Viπ = Vi+1 .
π( j+1) π( j)
Similarly, we can obtain that Vi+1 ≤ Vi+1 holds for j = 0, 1, . . . , by induction,
π(∞) π π( j) π( j−1)
and Vi+1 (x) = Vi+1 (x). We have now shown Vi ≤ Vi for all i = 1, 2, . . . ,
π(∞) π
and all j = 0, 1, . . . , and Vi = Vi for all i = 1, 2, . . . . Therefore, the proof is
complete.
Note that Lemma 6.2.2 presented the same result as Theorems 5.2.1(i) and 5.2.2.
We state and prove it here using our new notation.
Lemma 6.2.3 (cf. Lemma 5.2.3) Let Assumption 6.2.1 hold. Suppose that V0 ≥
π( j)
T V0 . Let πi and Vi be updated by (6.2.20) and (6.2.21). Then, the sequence {Viπ }
is a monotonically nonincreasing sequence, i.e., Viπ ≥ Vi+1π
, ∀i ≥ 0.
Proof According to Lemma 6.2.2, we can get
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 235
π(0) π(∞) π
Vi+1 ≥ Vi+1 = Vi+1 .
Then, considering
π(0)
Viπ = Tπi−1 Viπ ≥ T Viπ = Tπi Viπ = Tπi Vi+1 ,
Therefore, considering (6.2.3) and Theorem 6.2.1, we can obtain the conclusion.
This completes the proof of the theorem.
hold uniformly, where Viπ̂ is the exact value function associated with π̂i . Considering
Lemma 6.2.2, we have
Similarly, we assume that there exist finite positive constants ζ ≤ 1 and ζ ≥ 1 such
that
π̂ π̂ π̂
ζ T V̂i−1 ≤ Tπ̂i V̂i−1 ≤ ζ T V̂i−1 , ∀i = 1, 2, . . . , (6.2.29)
V̂iπ̂ ≤ ζ δT V̂i−1
π̂
.
On the other hand, considering (6.2.27), (6.2.29), and Assumption 6.2.1, we can
obtain
V̂iπ̂ ≥ δViπ̂
π̂
= δ(Tπ̂i · · · Tπ̂i Tπ̂i )V̂i−1
π̂
≥ δ(T · · · T Tπ̂i )V̂i−1
π̂
≥ ζ δ(T · · · T T )V̂i−1
≥ ζ δV ∗ .
Therefore, the whole approximation errors in the value function and control policy
update equations can be expressed by
ζ δV ∗ ≤ V̂iπ̂ ≤ ζ δT V̂i−1
π̂
. (6.2.30)
ηV ∗ ≤ V̂iπ̂ ≤ ηT V̂i−1
π̂
. (6.2.31)
Similar to Sect. 6.2.2B, we can establish the error bounds for approximate policy
iteration by the following theorem.
Theorem 6.2.4 Let Assumptions 6.2.1 and 6.2.2 hold. Suppose that V ∗ ≤ V0 ≤
βV ∗ , 1 ≤ β < ∞, and that V0 ≥ T V0 . The approximate value function V̂iπ̂
satisfies
π̂ the iterative∗ error condition (6.2.31). Then, the value function sequence
V̂i approaches V according to the following inequalities
i
λj η j−1 (η − 1) λi ηi (β − 1) ∗
ηV ∗ ≤ V̂i+1
π̂
≤η 1+ + V , i ≥ 0.
j=1
(λ + 1) j (λ + 1)i
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 237
Moreover, the approximate value function sequence {V̂iπ } converges to a finite neigh-
borhood of V ∗ uniformly on Ω as i → ∞, i.e.,
η
ηV ∗ ≤ lim V̂iπ̂ ≤ V ∗, (6.2.32)
i→∞ 1 + λ − λη
1
under the condition η < + 1.
λ
We will prove the convergence of exact optimistic policy iteration and establish the
error bounds for approximate optimistic policy iteration.
where
μ( j) μ( j−1)
Vi (x) = U x, μi−1 (x) + Vi F x, μi−1 (x) , 0 < j ≤ Ni , (6.2.34)
μ(0) μ μ(0) μ
Vi = Vi−1 , ∀i ≥ 1, and V1 = V0 = V0 . Using the definition in (6.2.5), the
μ μ μ
value function Vi (x) can be expressed by Vi (x) = TμNi−1i Vi−1 (x). The optimistic
policy iteration algorithm updates the control policy by
μ
μi (x) = arg min U (x, u) + Vi F(x, u) . (6.2.35)
u
In this case, the optimistic policy iteration algorithm becomes the generalized
policy iteration algorithm studied in Chap. 4. The above algorithm becomes value
iteration as Ni = 1 and becomes the policy iteration as Ni → ∞. For policy
iteration, it solves the value function associated with the current control policy at
each iteration, while it takes only one iteration toward that value function for value
238 6 Error Bounds of Adaptive Dynamic Programming Algorithms
iteration. However, the value function update in (6.2.34) has to stop before j → ∞
in practical implementations.
We list some key expressions in Table 6.4 which will be used in this section for
establishing analysis results.
Next, we will show the monotonicity property of value function which is given
in [6] and then establish the convergence property of optimistic policy iteration.
μ
Lemma 6.2.4 Let Assumption 6.2.1 hold. Suppose that V0 ≥ T V0 . Let Vi and μi
μ
be updated by (6.2.34) and (6.2.35). Then, the value function sequence {Vi } is a
μ μ
monotonically nonincreasing sequence, i.e., Vi ≥ Vi+1 , ∀i ≥ 0.
and
μ μ μ(0) μ μ
T V1 = Tμ1 V1 = Tμ1 V2 ≥ Tμ1 V2 = V2 .
Theorem 6.2.5 Let Assumptions 6.2.1 and 6.2.2 hold. Suppose that V ∗ ≤ V0 ≤
μ
βV ∗ , 1 ≤ β < ∞, and that V0 ≥ T V0 . The value function Vi and the control
policy μi are updated by (6.2.34) and (6.2.35). Then, the value function sequence
{Viμ } approaches V ∗ according to the inequalities
μ β −1
V ∗ ≤ Vi ≤ 1 + V ∗ , i ≥ 1. (6.2.37)
(1 + λ−1 )i
μ
Moreover, the value function Vi converges to V ∗ uniformly on Ω.
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 239
μ
Proof First, we prove that Vi ≤ Vi , ∀i ≥ 1, holds by mathematical induction, where
Vi is defined in (6.2.7). According to Lemma 6.2.2, we have
μ μ(N1 ) μ(1)
V1 = V1 ≤ Vi = Tμ0 V0 = T V0 = V1 .
μ
Thus, (6.2.37) holds for i = 1. Assume that it holds for i ≥ 1, i.e., Vi ≤ Vi .
According to Lemma 6.2.2, we have
μ μ(N ) μ(1) μ(0) μ μ
Vi+1 = Vi+1 i ≤ Vi+1 = Tμi Vi+1 = Tμi Vi = T Vi .
T Viμ ≤ T Vi = Vi+1 .
μ μ
Thus, we can obtain Vi+1 ≤ Vi+1 . Then, it can also be proved that Vi ≥ V ∗ , ∀i ≥ 1,
by mathematical induction. Therefore, considering Theorem 6.2.1, we can obtain
μ
(6.2.37). As i → ∞, the value function Vi converges to V ∗ uniformly on Ω. The
proof is complete.
μ̂ μ̂ μ̂
δTμ̂Ki V̂i−1 ≤ V̂i ≤ δTμ̂Ki V̂i−1 , ∀i = 1, 2, . . . , (6.2.38)
μ̂ μ̂ μ̂
V̂i ≤ δTμ̂Ki V̂i−1 ≤ δTμ̂i V̂i−1 . (6.2.39)
Similarly, we assume that there exist finite positive constants ζ ≤ 1 and ζ ≥ 1 such
that
μ̂ μ̂ μ̂
ζ T V̂i−1 ≤ Tμ̂i V̂i−1 ≤ ζ T V̂i−1 , ∀i = 1, 2, . . . , (6.2.40)
μ̂ μ̂
V̂i ≤ ζ δT V̂i−1 .
On the other hand, considering (6.2.38), (6.2.40), and Assumption 6.2.1, we get
240 6 Error Bounds of Adaptive Dynamic Programming Algorithms
−1) μ̂ μ̂
V̂iμ̂ ≥ δTμ̂Ki V̂i−1
μ̂
≥ δT Tμ̂(K
i
V̂i−1 ≥ · · · ≥ δT (K −1) Tμ̂i V̂i−1 μ̂
≥ ζ δT K V̂i−1 .
Therefore, the whole approximation errors in the value function and control policy
update equations can be expressed by
μ̂ μ̂ μ̂
ζ δT K V̂i−1 ≤ V̂i ≤ ζ δT V̂i−1 . (6.2.41)
μ̂ μ̂ μ̂
ηT K V̂i−1 ≤ V̂i ≤ ηT V̂i−1 . (6.2.42)
Next, we can establish the error bounds for the approximate optimistic policy
iteration by the following theorem.
Theorem 6.2.6 Let Assumptions 6.2.1 and 6.2.2 hold. Suppose that V ∗ ≤ V0 ≤
μ̂
βV ∗ , 1 ≤ β < ∞, and that V0 ≥ T V0 . The approximate value function V̂i
satisfies the iterative error condition (6.2.42). Then, the value function sequence
μ̂
V̂i approaches V ∗ according to the following inequalities
i
λ j η j−1 (1 − η) ∗
η 1− V
j=1
(λ + 1) j
μ̂
≤ V̂i+1
i
λ j η j−1 (η − 1) λi ηi (β − 1) ∗
≤η 1+ + V , i ≥ 0. (6.2.43)
j=1
(λ + 1) j (λ + 1)i
μ̂
Moreover, the approximate value function sequence V̂i converges to a finite neigh-
borhood of V ∗ uniformly on Ω as i → ∞, i.e.,
η μ̂ η
V ∗ ≤ lim V̂i ≤ V ∗, (6.2.44)
1 + λ − λη i→∞ 1 + λ − λη
1
under the condition η < + 1.
λ
Proof First, we prove the lower bound in (6.2.43). According to (6.2.42) and
Assumption 6.2.1, we have
μ̂
V̂1 ≥ ηT K V0 ≥ ηT K V ∗ = ηV ∗ ,
V̂2μ̂ ≥ ηT K V̂1μ̂
μ̂
= η min U (x, u) + T K −1 V̂1 F(x, u)
u
≥ η min U (x, u) + ηT K −1 V ∗ F(x, u)
u
= η min U (x, u) + ηV ∗ F(x, u)
u
1−η 1−η
≥ η min 1−λ U (x, u) + η + V ∗ F(x, u)
u λ+1 λ+1
λ(1 − η)
=η 1− V ∗,
λ+1
i.e., it holds for i = 1. Therefore, we can obtain the lower bound by continuing the
above process. Similar to Theorem 6.2.2, we can obtain the upper bound in (6.2.43)
and the conclusion in (6.2.44). The proof is complete.
We have just proven that the approximate value iteration, approximate policy itera-
tion, and approximate optimistic policy iteration algorithms can converge to a finite
neighborhood of the optimal value function associated with the Bellman equation.
It should be mentioned that we consider approximation errors in both value function
and control policy update equations at each iteration. This makes it feasible to use
neural network approximation for solving undiscounted optimal control problems of
nonlinear systems. Since the optimistic policy iteration contains the value iteration
and policy iteration, we only present a detailed implementation of the approximate
optimistic policy iteration using neural networks in this section. The neural net-
work implementation of approximate value iteration has been discussed in previous
chapters [15, 17].
The whole structural diagram of the approximate optimistic policy iteration is
shown in Fig. 6.1 (cf. (6.2.34)), where two multilayer feedforward neural networks
are used. The critic neural network is used to approximate the value function, and
the action neural network is used to approximate the control policy.
A neural network can be used to approximate some smooth function on a pre-
μ( j)
scribed compact set. The value function Vi (xk ) in (6.2.34) is approximated by the
critic neural network
μ̂( j) j T j T
V̂i (xk ) = Wc(i) φ Yc(i) xk ,
where the activation functions are selected as tanh(·). The target function of the critic
neural network training is given by
242 6 Error Bounds of Adaptive Dynamic Programming Algorithms
Critic Vˆi uˆ ( j ) ( xk )
Network
Signal line
Back-propagating path
Weight transimission
where
xk+1 = F xk , μ̂i−1 (xk ) .
Then, the error function for training critic neural network is defined by
j μ̂( j) μ̂( j)
ec(i) (xk ) = Vi (xk ) − V̂i (xk ),
j 1 j 2
E c(i) (xk ) = e (xk ) . (6.2.45)
2 c(i)
The control policy μi (xk ) in (6.2.35) is approximated by the action neural network
T
μ̂i (xk ) = Wa(i)
T
φ Ya(i) xk .
Then, the error function for training the action neural network is given by
The weights of the action neural network are updated to minimize the following
performance function
1 T
E a(i) (xk ) = ea(i) (xk ) ea(i) (xk ). (6.2.46)
2
We use the gradient descent method to tune the weights of neural networks on a
training set constructed from the compact set Ω. The details of this tuning method
can be found in [15]. Some other tuning methods can also be used, such as Newton’s
method and the Levenberg–Marquardt method, in order to increase the convergence
rate of neural network training.
A detailed process of the approximate optimistic policy iteration is given in
Algorithm 6.2.1, where the approximate value iteration can be regarded as a spe-
cial case. If we have an initial stabilizing control policy, the algorithm can iterate
between Steps 4 and 5 directly. It should be mentioned that Algorithm 6.2.1 runs in
an off-line manner. Note that it can also be implemented online but a persistence of
excitation condition is usually required.
1.5
1
V0 −V1
0.5
0
1
0.5 1
0 0.5
x2 0
−0.5 −0.5 x1
−1 −1
3.5
3
Value function
2.5
1.5
1
0 5 10 15 20
The iteration index: j
μ̂( j)
Fig. 6.3 The convergence curve of the value function V̂1 at x0
4
K=1
K=3
3.5 K=10
3
Value function
2.5
1.5
0.5
0 5 10 15 20
The iteration index: i
μ̂
Fig. 6.4 The convergence curves of the value function V̂1 at x0 when K = 1, K = 3, and K = 10
tion, respectively. After implementing the algorithms for i max = 20, the convergence
μ̂
curves of the value functions V̂1 at x0 are shown in Fig. 6.4. It can be seen that all
the value functions are basically convergent with the iteration index i > 10, and the
obtained approximate optimal value functions at i = 20 are quite close. Although
246 6 Error Bounds of Adaptive Dynamic Programming Algorithms
0.4
K=1
K=3
0.2 K=10
−0.2
x2
−0.4
−0.6
−0.8
−1
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
x1
0.7
K=1
0.6 K=3
K=10
0.5
Control inputs
0.4
0.3
0.2
0.1
−0.1
0 10 20 30 40 50 60
Time steps: k
there exist approximation errors in both value function and control policy update
steps, the approximate value function can converge to a finite neighborhood of the
optimal value function.
Finally, we apply the approximate optimal control policies obtained above to the
system (6.2.47) for 60 time steps. The corresponding state trajectories are displayed
in Fig. 6.5, and the control inputs are displayed in Fig. 6.6. Examining the results,
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 247
it is observed that all the control policies obtain very good performance, and the
differences between the three trajectories are quite small.
where xk ∈ Rn is the state vector, and u k ∈ Rm is the control input. The system
(6.3.1) is assumed to be controllable which implies that there exists a continuous
control policy on a compact set Ω ⊆ Rn that stabilizes the system asymptotically.
We assume that xk = 0 is an equilibrium state of the system (6.3.1) and F(0, 0) = 0.
The system function F(xk , u k ) is Lipschitz continuous with respect to xk and u k . The
infinite-horizon cost function with discount factor for any initial state x0 is given by
∞
J (x0 , u 0 ) = γ k U (xk , u k ), (6.3.2)
k=0
In the case of action-dependent ADP schemes, which will be studied in this section,
Q-function (also known as the state-action value function) will be used and it is
defined as
Q μ (x, u) = U (x, u) + γ Q μ (x + , μ(x + )), (6.3.7)
where x ∈ Rn , u ∈ Rm , and x + is the state of the next moment, i.e., x + = F(x, u).
According to (6.3.3), the relationship between value function and Q-function is
where u k+1 is the action of the next moment. The connection between the optimal
value function and the optimal Q-function is
In most situations, the optimal control problem for nonlinear systems has no
analytical solution and the traditional dynamic programming faces the “curse of
dimensionality.” In this section, we develop an iterative ADP algorithm by using
Q-function which depends on the state and action to solve the nonlinear optimal
control problem. Similar to most ADP methods, function approximation structures
such as neural networks are used to approximate the Q-function (or state-action
value function) and the control policy. The approximation errors may increase along
with the iteration processes. Therefore, it is necessary to establish the error bounds of
6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 249
Q-function for the present iteration algorithm considering the function approximation
errors.
We assume that the nonlinear dynamical system (6.3.1) is unknown, and only an
off-line data set {xk , u k , xk+1 } N is available, where xk+1 is the next state given xk and
u k , and N is the number of samples in the data set. xk+1 and xk stand for the dynamic
behavior of one-shot data and do not necessarily mean that the data set has to take
samples from one trajectory. In general, the data set contains a variety of trajectories
and scattered data.
For the policy iteration of action-dependent ADP algorithms, it starts with an
initial admissible control μ0 . For i = 0, 1, . . ., the policy iteration algorithm contains
policy evaluation phase and policy improvement phase given as follows.
Policy evaluation:
μ μ
i
Q j+1 (xk , u k ) = U (xk , u k ) + γ Q j i (xk+1 , μi (xk+1 )) (6.3.10)
Policy improvement:
μi+1 (xk ) = arg min Q μi (xk , u k ) (6.3.11)
uk
μ
where j is the policy evaluation index and i is the policy improvement index. Q j i
represents the jth evaluation for the ith control policy μi , and
μ
Q 0 i = Q μ∞i−1 .
μ
Let Q μi denote the Q-function for μi . Next, we will prove that the limit of Q j i as
μ
j → ∞ exists and Q ∞i = Q μi .
Assumption 6.3.1 There exists a finite positive constant λ such that the condition
Remark 6.3.1 Assumption 6.3.1 is a basic assumption which ensures the conver-
gence of ADP algorithms. For most nonlinear systems, it is easy to find a sufficiently
large number λ to satisfy this assumption as Q ∗ (·) and U (·) are finite and positive.
Lemma 6.3.1 Let Assumption 6.3.1 hold. Suppose that μ0 is an admissible control
μ
policy and Q μ0 is the Q-function of μ0 . Let Q j i and μi be updated by (6.3.10) and
(6.3.11). Then we can obtain the following conclusions:
μ μ μi
(i) the sequence {Q j i } is monotonically nonincreasing, i.e., Q j i ≥ Q j+1 , ∀i ≥ 1.
250 6 Error Bounds of Adaptive Dynamic Programming Algorithms
μ μ
Moreover, as j → ∞, the limit of Q j i exists and is denoted by Q ∞i , and equals to
Q μi , ∀i ≥ 1.
(ii) the sequence {Q μi } is monotonically nonincreasing, i.e., Q μi ≥ Q μi+1 , ∀i ≥ 1.
μ μ
Letting j → ∞, we have Q ∞1 (xk , u k ) ≥ U (xk , u k ) + γ Q ∞1 (xk+1 , μ1 (xk+1 )). Simi-
larly, we can obtain
μ
Q μ∞1 (xk , u k ) ≤ U (xk , u k ) + γ Q j 1 (xk+1 , μ1 (xk+1 )), j ≥ 0. (6.3.15)
μ μ
Letting j → ∞, we have Q ∞1 (xk , u k ) ≤ U (xk , u k )+γ Q ∞1 (xk+1 , μ1 (xk+1 )). Hence,
we can obtain
μi
and Q ∞ = Q μi hold for any i = l, l ≥ 1. According to (6.3.10) and (6.3.11), we
can obtain
6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 251
μ
Q 0 l+1 (xk , u k ) = Q μl (xk , u k )
= U (xk , u k ) + γ Q μl (xk+1 , μl (xk+1 ))
≥ U (xk , u k ) + γ min Q μl (xk+1 , u k+1 )
u k+1
Proof According to the definitions of Q ∗ and Q μi , the left-hand side of the inequality
(6.3.20) always holds for any i ≥ 1. Next, we prove the right-hand side of (6.3.20)
by induction. According to Assumption 6.3.1, we have
β −1
= 1+λ U (xk , u k )
λ+1
β −1
+ β+ γ min Q ∗ (xk+1 , u k+1 )
λ+1 u k+1
β −1
= 1+ Q ∗ (xk , u k ). (6.3.22)
(1 + λ−1 )
which shows that the right-hand side of (6.3.20) holds for i = 1. Assume that
μi β −1
Q (xk , u k ) ≤ 1 + Q ∗ (xk , u k )
(1 + λ−1 )i
which shows that the right-hand side of (6.3.20) holds for i = l + 1. According to the
mathematical induction, the right-hand side of (6.3.20) holds. The proof is complete.
In policy iteration, an initial admissible control policy is required, which is usually
obtained by experience or trial. However, for most nonlinear systems, it is hard to
obtain an admissible control policy, especially in action-dependent ADP for unknown
systems. So we present a novel initial condition for policy iteration.
Lemma 6.3.2 Let Assumption 6.3.1 hold. Suppose that there is a positive-definite
μ μ
function Q 0 satisfying γ Q 0 ≥ Q 1 1 for any xk , u k . Let Q j 1 and μ1 be obtained by
μ
(6.3.10) and (6.3.11). Then μ1 (x) is an admissible control policy and Q μ1 = Q ∞1 is
the Q-function of μ1 (x).
6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 253
and
μ
Q 1 1 (xk , u k ) = U (xk , u k ) + γ Q 0 (xk+1 , μ1 (xk+1 )). (6.3.26)
we can conclude that the control policy μ1 is asymptotically stable for the system
μ
(6.3.1). Then, similar to (6.3.12)–(6.3.16), we can obtain that Q μ1 = Q ∞1 ≤ Q 0 .
Thus, the value function of μ1 satisfies
Therefore, we can conclude that μ1 (xk ) is an admissible control. The proof is com-
plete.
From Theorem 6.3.1 and Corollary 6.3.1, we can see that as i → ∞, Q μi con-
verges to Q ∗ under ideal conditions, i.e., the control policy and Q-function in each
iteration can be obtained accurately. They also give a convergence rate of Q μi with
policy iteration. When the discount factor γ = 1, the discounted optimal control
problem turns into an undiscounted optimal control problem, and Theorem 6.3.1 and
Corollary 6.3.1 still hold.
However, in practice, considering that the iteration indices i and j cannot reach
infinity as the algorithm must stop in finite steps, there exist convergence errors in
254 6 Error Bounds of Adaptive Dynamic Programming Algorithms
the iteration process. In addition, the control policy and Q-function in each iteration
are obtained by approximation structures, so there exist approximate errors between
approximate and accurate values. Hence, Theorem 6.3.1 and Corollary 6.3.1 may
be invalid and the policy-iteration-based action-dependent ADP may even be diver-
gent. To overcome this difficulty, in the following section we establish new error
bound analysis results for Q-function considering the convergence and approxima-
tion errors.
For the approximate policy iteration, function approximation structures are used to
approximate the Q-function and the control policy. The approximate expressions
of μi and Q μi are μ̂i and Q̂ μ̂i , respectively. We assume that there exist two finite
positive constants δ ≤ 1 and δ̄ ≥ 1 such that
holds uniformly, for any i ≥ 1, where Q μ̂i is the exact Q-function associated with
μ̂i . δ and δ̄ imply the convergence error in j-iteration and the approximation error of
Q μ̂i in policy evaluation phase. When δ = δ̄ = 1, both errors are zero. Considering
Lemma 6.3.1, we obtain
μ̂
Q̂ μ̂i ≤ δ̄ Q μ̂i ≤ δ̄ Q̂ 1 i , (6.3.32)
where
Q̂ μ̂1 i (xk , u k ) = U (xk , u k ) + γ Q̂ μ̂i−1 (xk+1 , μ̂i (xk+1 )).
Similarly, we assume that there exist two finite positive constants ζ ≤ 1 and ζ̄ ≥ 1
such that
μ̂ μ̂ μ̂
ζ Q 1 i ≤ Q̂ 1 i ≤ ζ̄ Q 1 i (6.3.33)
μ̂
Q 1 i (xk , u k ) = U (xk , u k ) + γ Q̂ μ̂i−1 (xk+1 , μ̂i (xk+1 )).
ζ and ζ̄ imply the approximation errors of μ̂i in the policy improvement phase. If
the iterative control policy can be obtained accurately, then ζ = ζ̄ = 1. Considering
(6.3.32) and (6.3.33), we can get
μ̂
Q̂ μ̂i ≤ ζ̄ δ̄ Q 1 i . (6.3.34)
6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 255
Therefore, the approximation errors in the Q-function and control policy update step
can be expressed by
μ̂
ηQ ∗ ≤ Q̂ μ̂i ≤ ηQ 1 i , (6.3.36)
where η = δ and η = ζ̄ δ̄. We establish the error bounds for approximate policy
iteration by the following theorem.
Theorem 6.3.2 Let Assumption 6.3.1 hold. Suppose that Q ∗ ≤ Q μ0 ≤ β Q ∗ , 1 ≤
β ≤ ∞. μ0 is an admissible control policy and Q μ0 is the Q-function of μ0 . The
approximate Q-function Q̂ μ̂i satisfies the iterative error condition (6.3.36). Then, the
Q-function sequence { Q̂ μ̂i } approaches Q ∗ according to the following inequalities
⎡ ⎤
i
λj η j−1 (η − 1) λ η (β − 1) ⎦ ∗
i i
ηQ ∗ ≤ Q̂ μ̂i+1 ≤ η ⎣1 + + Q , ∀i ≥ 0.
j=1
(λ + 1) j (λ + 1)i
(6.3.37)
Moreover, the approximate value function sequence { Q̂ μ̂i } converges to a finite neigh-
borhood of Q ∗ uniformly on Ω as i → ∞, i.e.,
η
ηQ ∗ ≤ lim Q̂ μ̂i ≤ Q∗, (6.3.38)
i→∞ 1 + λ − λη
1
under the condition η ≤ + 1.
λ
Proof First, the left-hand side of (6.3.37) holds clearly according to (6.3.36). Next,
we prove the right-hand side of (6.3.37) for i ≥ 1. Considering (6.3.36), we obtain
μ̂
Q̂ μ̂1 (xk , u k ) ≤ ηQ 1 1 (xk , u k )
μ̂
≤ ηQ 0 1 (xk , u k )
= ηQ μ0 (xk , u k )
≤ ηβ Q ∗ (xk , u k ), (6.3.39)
which means (6.3.37) is true for i = 0. Then, considering (6.3.36) and Assumption
6.3.1, we can derive
μ̂
Q̂ μ̂2 (xk , u k ) ≤ ηQ 1 2 (xk , u k )
= η[U (xk , u k ) + γ Q̂ μ̂1 (xk+1 , μ̂2 (xk+1 ))]
= η[U (xk , u k ) + γ min Q μ̂1 (xk+1 , u k+1 )]
u k+1
256 6 Error Bounds of Adaptive Dynamic Programming Algorithms
ηβ − 1
=η 1+λ U (xk , u k )
λ+1
ηβ − 1
+ η ηβ − γ min Q ∗ (xk+1 , u k+1 )
λ+1 u k+1
ηβ − 1
=η 1+λ Q ∗ (xk , u k )
λ+1
λ(η − 1) λη(β − 1)
=η 1+ + Q ∗ (xk , u k ) (6.3.40)
λ+1 λ+1
Hence, the upper bound of (6.3.37) holds for i = 1. Suppose that the upper bound
of (6.3.37) holds for Q̂ μ̂i (i ≥ 1). Then, for Q̂ μ̂i+1 , we derive
Q̂ μ̂i+1 (xk , u k ) ≤ η U (xk , u k ) + γ min Q̂ μ̂i (xk+1 , u k+1 )
u k+1
≤ η U (xk , u k ) + γ ηρ min Q ∗ (xk+1 , u k+1 )
u k+1
≤ η U (xk , u k ) + γ ηρ min Q ∗ (xk+1 , u k+1 )
u k+1
+ ηΔ λU (xk , u k ) − γ min Q ∗ (xk+1 , u k+1
u k+1
= η(1 + Δλ) U (xk , u k ) + γ min Q ∗ (xk+1 , u k+1 )
u k+1
where
i−1
λ j η j−1 (η − 1) λi−1 ηi−1 (β − 1)
ρ =1+ +
j=1
(λ + 1) j (λ + 1)i−1
i
λ j−1 η j−1 (η − 1) λi−1 ηi (β − 1)
= + . (6.3.42)
j=1
(λ + 1) j (λ + 1)i
6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 257
i
μ̂i+1 λ j η j−1 (η − 1) λi ηi (β − 1)
Q̂ ≤η 1+ + Q∗. (6.3.43)
j=1
(λ + 1) j (λ + 1)i
Thus, the upper bound of Q̂ μ̂i holds for i + 1. According to the mathematical induc-
tion, the right-hand side of (6.3.37) holds.
The conclusion in (6.3.38) can be proved following the same steps as in the proof
of Theorem 6.2.2. The proof of the theorem is complete.
Remark 6.3.2 We can find that the upper bound is a monotonically increasing func-
tion of η. The condition η ≤ 1/λ + 1 ensures that the upper bound in (6.3.38) is
finite and positive. A larger λ will lead to a slower convergence rate and a larger error
bound. Besides, a larger λ also requires more accurate iteration to converge. When
η = η = 1, the approximate Q-function sequence Q̂ μ̂i converges to Q ∗ uniformly
on Ω as i → ∞.
For the undiscounted optimal control problem, the discount factor γ = 1, and the
Q-function is redefined as
From Theorems 6.3.1 and 6.3.2, when γ = 1, all the deductions still hold. So we
have the following corollary.
Corollary 6.3.2 For the undiscounted optimal control problem with Assumption
6.3.1 and the admissible control policy μ0 satisfying Q ∗ ≤ Q μ0 ≤ β Q ∗ , 1 ≤ β ≤ ∞,
if the approximate Q-function Q̂ μ̂i satisfies the iterative error condition (6.3.36), the
approximate Q-function sequence { Q̂ μ̂i } converges to a finite neighborhood of Q ∗
uniformly on Ω as i → ∞, i.e.,
η
ηQ ∗ ≤ lim Q̂ μ̂i ≤ Q∗, (6.3.46)
i→∞ 1 + λ − ηλ
In the previous section, the approximate Q-function with policy iteration is proven
to converge to a finite neighborhood of the optimal one. Hence, it is feasible to
258 6 Error Bounds of Adaptive Dynamic Programming Algorithms
approximate the Q-function and the control policy using neural networks. We present
the detailed implementation of the present algorithm using neural networks in this
section.
The structural diagram of the action-dependent iterative ADP algorithm in this
section is shown in Fig. 6.7. The outputs of critic network and the action network are
the approximations of the Q-function and the control policy, respectively.
The approximate Q-function Q̂ ûj i (xk , u k ) is expressed by
where ζ (·) is the activation function, which is selected as tanh(·). The target function
of the critic neural network training is given by
Q̂ i,∗ j+1 (xk , u k ) = U (xk , u k ) + Q̂ ûj i (xk+1 , μ̂i (xk+1 )), (6.3.48)
where xk+1 = F(xk , μ̂i (xk )). Then, the error function for the critic network training
is defined by
ec(i, j+1) (xk ) = Q̂ i,∗ j+1 (xk , u k ) − Q̂ ûj+1
i
(xk , u k ), (6.3.49)
1 2
E c(i, j+1) = e . (6.3.50)
2 c(i, j+1)
xk
Critic Qˆ jˆi 1
uk Network
xk
xk 1 Action
1
Critic Qˆ jˆi U ( xk , uk )
Network ˆ i ( xk 1 ) Network
Signal line
Back-propagating path
Weight transimission
Then, the error function for training the action network is defined by
∗
ea(i+1) (xk ) = μ̂i+1 (xk ) − μ̂i+1 (xk ). (6.3.53)
1 T
E a(i+1) = e ea(i+1) . (6.3.54)
2 a(i+1)
We use the gradient descent method to update the weights of critic and action
networks on a training data set. The detailed process of the approximate policy
iteration is given in Algorithm 6.3.1.
where
0.1 0
Q= ,
0 0.1
2.5
2
Q-function
1.5
0.5
10 20 30 40 50 60
Iteration index j
μ̂
Fig. 6.8 The convergence of Q-function Q̂ j i on state [0.5, −0.5]T at i = 1
6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 261
5
Q−function
0
1 2 3 4 5 6 7 8 9 10
Iteration index i
(a) (b)
1 1
Trajectory under the Trajectory under the
initial control policy initial control policy
0.8 Trajectory under the Trajectory under the
optimal control policy 0.5 optimal control policy
0.6
0
0.4
0.2 −0.5
x2
x1
0
−1
−0.2
−1.5
−0.4
−0.6 −2
0 100 200 300 0 100 200 300
Time steps Time steps
Fig. 6.10 The state trajectories starting from the state [1, −1]T a x1 ; b x2
262 6 Error Bounds of Adaptive Dynamic Programming Algorithms
2
Initial control policy
Optimal control policy
1.5
0.5
Control action
−0.5
−1
−1.5
−2
0 50 100 150 200 250 300
Time steps
Fig. 6.11 The control signals corresponding to the states from [1, −1]T
i max = 10. The compact subset Ω of the state space is chosen as 0 ≤ x1 ≤ 1 and
0 ≤ x2 ≤ 1. The training set {xk } contains 1000 samples choosing randomly from the
compact set Ω. The initial admissible control policy is chosen as μ0 = [−3, −1]xk .
We train the action and critic neural networks off-line with the Algorithm 6.3.1.
μ̂
Figure 6.8 illustrates the convergence process of the Q-function Q̂ j i with the iteration
index j on state x = [−0.5, 0.5]T at i = 1. Figure 6.9 shows the convergence curve
of Q-function Q̂ μ̂i on state [0.5, 0.5]T with the iteration index i. Figure 6.10 shows
the state trajectories from the initial state [1, −1]T to the equilibrium under the initial
control policy and the approximate optimal control policy obtained by our method,
respectively. Figure 6.11 shows the action trajectories of the initial control policy and
the approximate optimal control policy obtained by our method, respectively.
6.4 Conclusions
In this chapter, we established error bounds for approximate value iteration, approx-
imate policy iteration, and approximate optimistic policy iteration. We considered
approximation errors in both value function and control policy update equations.
It was shown that the iterative approximate value function can converge to a finite
neighborhood of the optimal value function under some mild conditions. The results
provided theoretical guarantees for using neural network approximation for solving
6.4 Conclusions 263
undiscounted optimal control problems. We also developed the error bound analy-
sis method of Q-function with approximate policy iteration for optimal control of
unknown discounted discrete-time nonlinear systems.
References
1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Almudevar A, Arruda EF (2012) Optimal approximation schedules for a class of iterative
algorithms, with an application to multigrid value iteration. IEEE Trans Autom Control
57(12):3132–3146
3. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern-Part B:
Cybern 38(4):943–949
4. Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton
5. Bertsekas DP (2012) Weighted sup-norm contractions in dynamic programming: a review and
some new applications. Technical report LIDS-P-2884, MIT
6. Bertsekas DP (2013) Abstract dynamic programming. Athena Scientific, Belmont
7. Bertsekas DP (2016) Value and policy iterations in optimal control and adaptive dynamic
programming. IEEE Trans Neural Netw Learn Syst (Online Available). doi:10.1109/TNNLS.
2015.2503980
8. Dierks T, Thumati BT, Jagannathan S (2009) Optimal control of unknown affine nonlinear
discrete-time systems using offline-trained neural networks with proof of convergence. Neural
Netw 22(5):851–860
9. Grune L, Rantzer A (2008) On the infinite horizon performance of receding horizon controllers.
IEEE Trans Autom Control 53(9):2100–2111
10. Heydari A, Balakrishnan SN (2013) Finite-horizon control-constrained nonlinear optimal con-
trol using single network adaptive critics. IEEE Trans Neural Netw Learn Syst 24(1):145–157
11. Howard RA (1960) Dynamic programming and Markov processes. MIT Press, Cambridge
12. Leake RJ, Liu RW (1967) Construction of suboptimal control sequences. SIAM J Control
Optim 5(1):54–63
13. Lewis FL, Syrmos VL (1995) Optimal control. Wiley, New York
14. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
15. Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general
value iteration. IET Control Theory Appl 6(18):2725–2736
16. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
17. Liu D, Wei Q (2013) Finite-approximation-error based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
18. Liu D, Wei Q (2014) Policy iterative adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
19. Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class
of unknown discrete-time nonlinear systems using globalized dual heuristic programming.
IEEE Trans Autom Sci Eng 9(3):628–634
20. Liu D, Wang D, Yang X (2013) An iterative adaptive dynamic programming algorithm for
optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci
220:331–342
21. Liu D, Li H, Wang D (2015) Error bounds of adaptive dynamic programming algorithms
for solving undiscounted optimal control problems. IEEE Trans Neural Netw Learn Syst
26(6):1323–1334
264 6 Error Bounds of Adaptive Dynamic Programming Algorithms
22. Puterman M, Shin M (1978) Modified policy iteration algorithms for discounted Markov deci-
sion problems. Manag Sci 24(11):1127–1137
23. Rantzer A (2006) Relaxed dynamic programming in switching systems. IEE Proc-Control
Theory Appl 153(5):567–574
24. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
25. Tsitsiklis JN (2002) On the convergence of optimistic policy iteration. J Mach Learn Res
3:59–72
26. Vrabie D, Vamvoudakis KG, Lewis FL (2013) Optimal adaptive control and differential games
by reinforcement learning principles. IET, London
27. Wang FY, Zhang H, Liu D (2009) Adaptive dynamic programming: an introduction. IEEE
Comput Intell Mag 4(2):39–47
28. Wang FY, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon
optimal control of discrete-time nonlinear systems with -error bound. IEEE Trans Neural
Netw 22(1):24–36
29. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
30. Wei Q, Wang FY, Liu D, Yang X (2014) Finite-approximation-error based discrete-time iterative
adaptive dynamic programming. IEEE Trans Cybern 44(12):2820–2833
31. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sofge DA (eds) Handbook of intelligent control: Neural, fuzzy, and adaptive
approaches (Chap. 13). Van Nostrand Reinhold, New York
32. Yan P, Wang D, Li H, Liu D (2016) Error bound analysis of Q-function for discounted optimal
control problems with policy iteration. IEEE Trans Syst Man Cybern Syst (Online Available)
doi:10.1109/TSMC.2016.2563982
33. Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a
class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans
Syst Man Cybern-Part B: Cybern 38(4):937–942
34. Zhang H, Luo Y, Liu D (2009) Neural-network-based near-optimal control for a class of
discrete-time affine nonlinear systems with control constraints. IEEE Trans Neural Netw
20(9):1490–1503
35. Zhang H, Song R, Wei Q, Zhang T (2011) Optimal tracking control for a class of nonlinear
discrete-time systems with time delays based on heuristic dynamic programming. IEEE Trans
Neural Netw 22(1):24–36
Part II
Continuous-Time Systems
Chapter 7
Online Optimal Control of Continuous-Time
Affine Nonlinear Systems
7.1 Introduction
where x(t) ∈ Rn is the state vector available for measurement, u(t) ∈ Rm is the
control vector, f (x) ∈ Rn is an unknown nonlinear function with f (0) = 0, and
g(x) ∈ Rn×m is a matrix of known nonlinear functions. For the convenience of
subsequent analysis, we provide two assumptions as follows.
© Springer International Publishing AG 2017 267
D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_7
268 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems
Assumption 7.2.2 The control matrix g(x) is bounded over a compact set Ω; that
is, there exist constants gm and g M (0 < gm < g M ) such that gm ≤ g(x)F ≤ g M
for all x ∈ Ω.
Note that g(·) ∈ Rn×m is a matrix function. Our analysis results in this chapter
(and Chap. 8) will use Frobenius matrix norm, which is defined as
n
m
AF = ai2j
i=1 j=1
for matrix A = (ai j ) ∈ Rn×m . For the convenience of presentation, we will drop the
subscript “F” for Frobenius matrix norm in this chapter and the next. Accordingly,
vector norm will also be chosen as the compatible one, i.e., the Euclidean norm for
vectors which is defined as
n
x = xi2
i=1
for x ∈ Rn . In other words, whenever we use norms in this chapter and Chap. 8, we
mean Euclidean norm for vectors and Frobenius norm for matrices.
The value function for system (7.2.1) is defined by
∞
V (x(t)) = r x(s), u(s) ds, (7.2.2)
t
where
r (x, u) = x TQx + u TRu,
Definition 7.2.1 (cf. [11]) The solution x(t) of system (7.2.1) is said to be uniformly
ultimately bounded (UUB), if there exist positive constants b and c, independent of
t0 ≥ 0, and for every a ∈ (0, c), there is T = T (a, b) > 0, independent of t0 , such
that
x(t0 ) ≤ a ⇒ x(t) ≤ b, ∀t ≥ t0 + T.
The control objective of this chapter is to obtain an online adaptive control not
only stabilizes system (7.2.1) but also minimizes the value function (7.2.2), while
ensuring that all signals in the closed-loop system are UUB.
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 269
where F (x) = f (x) − Ax and A ∈ Rn×n is a known Hurwitz matrix. Using (7.2.3),
(7.2.4) can be rewritten as
ẋ(t) = Ax + WmT σm YmT x + g(x)u + εm (x). (7.2.5)
where x̂(t) ∈ Rn is the identifier NN state, Ŵm ∈ R N0 ×n and Ŷm ∈ Rn×N0 are
estimated weights, x̃(t) is the identification error x̃(t) x(t) − x̂(t), and the design
matrix C ∈ Rn×n selected such that A − C is a Hurwitz matrix.
Using (7.2.5) and (7.2.6), we obtain the identification error dynamics as
˙ = Ac x̃(t) + W̃ T σm Ŷ T x̂ + δ(x),
x̃(t) (7.2.7)
m m
Before proceeding further, we provide some assumptions and facts, which have
been used in the literature [16, 17, 23, 25, 31–33, 35, 38].
270 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems
Wm ≤ W̄ M , Ym ≤ Ȳ M .
Fact 7.2.1 The NN activation function is bounded over Ω, i.e., there exists a known
constant σ M > 0 such that σm (x) ≤ σ M for every x ∈ Ω.
ATc P + P Ac = −θ In ,
τ1 > l1 A−1
c /4, τ2 > l2 ,
2
(7.2.10)
Φ ŶmT x̂ = diag σm1
2 T
Ŷm1 x̂ , . . . , σm2 N0 ŶmTN0 x̂ ,
T
I N0 is the N0 × N0 identity matrix, sgn(x̂) = sgn(x̂1 ), . . . , sgn(x̂n ) , and sgn(·) is
the componentwise sign function. Then, the identifier NN given in (7.2.6) can ensure
that the identification error x̃(t) in (7.2.7) converges to a compact set
Ωx̃ = x̃ : x̃ ≤ 2ς/θ , (7.2.11)
where ς > 0 is a constant to be determined later (see (7.2.20) below). In addition, the
NN weight estimation errors W̃m = Wm − Ŵm and Ỹm = Ym − Ŷm are all guaranteed
to be UUB.
where
1 T 1 1
L 11 (x) = x̃ P x̃ and L 12 (x) = tr W̃mT l1−1 W̃m + tr ỸmT l2−1 Ỹm .
2 2 2
Taking the time derivative of L 11 (x) along the solutions of (7.2.7) and using Facts
7.2.1 and 7.2.2, it follows
1 ˙T
L̇ 11 (x) = x̃ P x̃ + x̃ TP x̃˙
2
θ
= − x̃ Tx̃ + x̃ TP W̃mT σm ŶmT x̂ + δ(x)
2
θ
≤ − x̃2 + x̃ P W̃m σ M + δ M , (7.2.13)
2
where δ M is the upper bound of δ(x), i.e., δ(x) ≤ δ M . Actually, noticing that u(x)
is a continuous function defined on Ω, there exists a constant u M > 0 such that
u(x) ≤ u M for every x ∈ Ω. Then, by Assumptions 7.2.2–7.2.4 and Fact 7.2.1,
we can conclude that δ(x) given in (7.2.7) is an upper bounded function.
Taking the time derivative of L 12 (x) and using (7.2.8) and (7.2.9), we obtain
L̇ 12 (x) = tr W̃mT l1−1 W̃˙ m
τ1
= tr W̃mT σm ŶmT x̂ x̃ TA−1
c + x̃ W̃ T
m W m − W̃ m
l1
T
+ tr ỸmTsgn(x̂)x̃ TA−1c Wm − W̃m
T τ2
× I N0 − Φ Ŷm x̂ + x̃ỸmT Ym − Ỹm . (7.2.14)
l2
Note that
tr X 1 X 2T = tr X 2T X 1 = X 2T X 1 , ∀X 1 , X 2 ∈ Rn×1 ,
and
tr Z̃ T(Z − Z̃ ) ≤ Z̃ F Z F − Z̃ 2F , ∀Z , Z̃ ∈ Rm×n . (7.2.15)
We emphasize that (7.2.15) is true for Frobenius matrix norm, but it is not true for
other matrix norms in general. As we have declared earlier, Frobenius norm for
matrices and Euclidean norm for vectors are used in this chapter. We do not use the
subscript “F” for Frobenius matrix norm for the convenience of presentation in this
chapter.
Then, from (7.2.14), it follows
272 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems
T τ1
L̇ 12 (x) = x̃ TA−1
c W̃m σm Ŷm x̂ +
T
x̃tr W̃mT Wm − W̃m
l1
T
+ x̃ TA−1
c Wm − W̃m I N0 − Φ ŶmT x̂ ỸmT sgn(x̂)
τ2
+ x̃tr ỸmT Ym − Ỹm
l2
τ1
≤ ασ M x̃W̃m + x̃ W̄ M W̃m − W̃m 2
l1
T
+ α I N0 − Φ Ŷm x̂ x̃ W̄ M + W̃m Ỹm
τ2
+ x̃ Ȳ M Ỹm − Ỹm 2 , (7.2.16)
l2
T
where α = A−1 ≤
c . Combining (7.2.13) and (7.2.16) and noticing I N0−Φ Ŷm x̂
1, we obtain the time derivative of the Lyapunov function as
θ τ1
L̇ 1 (x) ≤ − x̃2 + δ M P + (P + α)σ M + W̄ M W̃m
2 l1
2
τ2 τ1 α
+ α W̄ M + Ȳ M Ỹm − − W̃m 2
l2 l1 4
τ2 α 2
− − 1 Ỹm −2
W̃m − Ỹm x̃
l2 2
θ τ1 α2 2 τ2
= − x̃2 + δ M P + − β1 + − 1 β22
2 l1 4 l2
τ1 α2 τ2
− − (W̃m + β1 )2 − − 1 (Ỹm + β2 )2
l1 4 l2
α 2
− W̃m − Ỹm x̃, (7.2.17)
2
where
2l1 (α + P)σ M + 2τ1 W̄ M αl2 W̄ M + τ2 Ȳ M
β1 = , β2 = .
α 2 l1 − 4τ1 2(l2 − τ2 )
where
τ1 α2 2 τ2
ς = δ M P + − β1 + − 1 β22 . (7.2.20)
l1 4 l2
Hence, L̇ 1 (x) is negative as long as x̃ > 2ς/θ . That is, the identification error
x̃(t) converges to Ωx̃ defined as in (7.2.11). Meanwhile, according to the standard
Lyapunov’s extension theorem [12, 13] (or the Lagrange stability result [22]), this
verifies the uniform ultimate boundedness of the NN weight estimation errors W̃m
and Ỹm , which completes the proof of the theorem.
Remark 7.2.1 The first terms of (7.2.8) and (7.2.9) are derived from the standard
backpropagation algorithm, and the last terms are employed to ensure the bounded-
ness of parameter estimations. The size of Ωx̃ in (7.2.11) can be kept sufficiently
small by properly choosing parameters, for example, θ , τi , and li (i = 1, 2), such that
higher accuracy of identification is guaranteed. Although (7.2.8) and (7.2.9) share
similar feature as in [16], a significant difference between [16] and the present work
is that, in our case, we do not use Taylor series in the process of identification. Due
to residual errors from the Taylor series expansion, our method is considered to be
more accurate in estimating unknown system dynamics.
Since system (7.2.1) can be approximated well by (7.2.6) outside of the compact
set Ωx̃ , in what follows we replace system (7.2.1) with (7.2.6). Meanwhile, we replace
the actual system state x(t) with the estimated state x̂(t). In this circumstance, system
(7.2.1) can be represented using (7.2.6) as
˙ = h(x̂) + g(x̂)u,
x̂(t) (7.2.21)
where
h(x̂) = A x̂ + ŴmT σm ŶmT x̂ + C x̃.
If the control u(x̂) ∈ A (Ω) and the value function V (x̂) ∈ C 1 (Ω), then (7.2.21)
and (7.2.22) are equivalent to
Vx̂T h(x̂) + g(x̂)u + Q(x̂) + u TRu = 0,
274 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems
where Vx̂ ∈ Rn represents the partial derivative of V (x̂) with respect to x̂, i.e.,
∂ V (x̂)
Vx̂ = .
∂ x̂
Now, define the Hamiltonian for the control u(x̂) and the value function V (x̂) as
H (x̂, Vx̂ , u) = Vx̂T h(x̂) + g(x̂)u + Q(x̂) + u TRu.
Then, the optimal value function V ∗ (x̂) is obtained by solving the Hamilton–Jacobi–
Bellman (HJB) equation
min H x̂, Vx̂∗ , u = 0. (7.2.23)
u(x̂)∈A (Ω)
Therefore, the closed-form expression for the optimal control can be derived as
1
u ∗ (x̂) = − R −1 g T(x̂)Vx̂∗ . (7.2.24)
2
Substituting (7.2.24) into (7.2.23), we get the HJB equation as
T 1 T
Vx̂∗ h(x̂) + Q(x̂) − Vx∗ g(x̂)R −1 g T(x̂)Vx̂∗ = 0. (7.2.25)
4
In this sense, one shall find that (7.2.25) is actually a nonlinear partial differential
equation with respect to V ∗ (x̂), which is difficult to solve by analytical methods. To
confront the challenge, an online NN-based optimal control scheme will be devel-
oped. Prior to proceeding further, we provide the following required assumption.
with L 2x̂ the partial derivative of L 2 (x̂) with respect to x̂. Meanwhile, there exists a
symmetric positive definite matrix Λ2 (x̂) ∈ Rn×n defined on Ω such that
L T2x̂ h(x̂) + g(x̂)u ∗ = −L T2x̂ Λ2 (x̂)L 2x̂ . (7.2.26)
is the vector of activation function with σ j (x̂) ∈ C 1 (Ω) and σ j (0) = 0, the set
{σ j (x̂)}1N1 is often selected to be linearly independent, N1 is the number of the neurons,
and ε N1 (x̂) is the NN function reconstruction error. The derivative of V ∗ (x̂) with
respect to x̂ is given as
Vx̂∗ = ∇σ T(x̂)Wc + ∇ε N1(x̂), (7.2.27)
1 1
u ∗ (x̂) = − R −1 g T(x̂)∇σ T Wc − R −1 g T(x̂)∇ε N1 . (7.2.28)
2 2
By the same token, (7.2.25) can be rewritten as
1
WcT ∇σ h(x̂) + Q(x̂) − WcT ∇σ g(x̂)R −1 g T(x̂)∇σ T Wc + εHJB = 0, (7.2.29)
4
where εHJB is the residual error converging to zero when N1 is large enough [1]; that
is, there exists a small constant εa1 > 0 such that εHJB ≤ εa1 .
Since the ideal NN weight vector Wc is often unavailable, (7.2.28) cannot be
implemented in real control process. Hence, we employ a critic NN to approximate
the value function given in (7.2.22) as
where Ŵc is the estimated weight of Wc . The weight estimation error for the critic
NN is defined as
W̃c = Wc − Ŵc . (7.2.31)
1
û(x̂) = − R −1 g T(x̂)∇σ T Ŵc . (7.2.32)
2
The approximate Hamiltonian is derived as
1
H (x̂, Ŵc ) = ŴcT ∇σ h(x̂) + Q(x̂) − ŴcT ∇σ G(x̂)∇σ T Ŵc
4
e1 , (7.2.33)
where
G(x̂) = g(x̂)R −1 g T(x̂).
1
e1 = (Wc − W̃c )T ∇σ h(x̂) + Q(x̂) − (Wc − W̃c )T ∇σ G(x̂)∇σ T(Wc − W̃c )
4
1 T
= Wc ∇σ h(x̂) + Q(x̂) − Wc ∇σ G(x̂)∇σ T Wc
T
4
−εHJB
1 1
− W̃cT ∇σ h(x̂) + W̃cT ∇σ G(x̂)∇σ T Wc − W̃cT ∇σ G(x̂)∇σ T W̃c
2 4
∗ 1 −1 T
= − W̃c ∇σ h(x̂) + W̃c ∇σ g(x̂) − u (x̂) − R g (x̂)∇ε N1
T T
2
1 T
− W̃c ∇σ G(x̂)∇σ W̃c − εHJB
T
4
1 1
= − W̃cT ∇σ ξ(x̂) + G(x̂)∇ε N1 − W̃cT ∇σ G(x̂)∇σ T W̃c − εHJB , (7.2.34)
2 4
η1 ∂E
Ŵ˙ c = − 2
1 + φ φ ∂ Ŵc
T
η1 ∂e1
= − 2 e1
1 + φTφ ∂ Ŵc
φ
= −η1 2 e1 , (7.2.35)
1 + φTφ
where φ = ∇σ h(x̂) + g(x̂)û(x̂) , η1 > 0 is a design constant, and the term (1 +
φ T φ)2 is employed for normalization.
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 277
However, there exist two issues about the tuning rule (7.2.35):
(i) If the initial control is not admissible, then tuning the critic NN weights based on
(7.2.35) to minimize E = (1/2)e1T e1 may not guarantee the stability of system
(7.2.21) during the critic NN learning process.
(ii) The persistence of excitation (PE) of φ/(1 + φ T φ) is required to guarantee the
weights of the critic NN to converge to actual optimal values
[3, 7, 26, 29, 36, 37]. Nevertheless, the PE condition is often intractable to
verify. In addition, a small exploratory signal is often added to the control input
in order to satisfy the PE condition, which might cause instability of the closed-
loop system during the implementation of the algorithm.
To address the above two issues, a novel weight update law for the critic NN is
developed as follows:
Remark 7.2.3 Several notes about the weight tuning rule (7.2.36) are listed as
follows:
(a) The first term in (7.2.36) shares the same feature as (7.2.35), which aims to
minimize the objective function E = (1/2)e1T e1 .
(b) The second term in (7.2.36) is utilized to relax the PE condition. If there is no
second term in (7.2.36), one shall find Ŵ˙ c = 0 when x̂ = 0. That is, the weight
of the critic NN will not be updated. In this circumstance, the critic NN might not
converge to the optimal weights. Therefore, the PE condition is often employed
[3, 7, 26, 29, 36, 37]. Interestingly, the second term given in (7.2.36) can also
avoid this issue as long as the set {φ̄( j) }1N1 is selected to be linearly independent.
Now, we show this fact by contradiction. Suppose that Ŵ˙ = 0 when x̂ = 0. c
From (7.2.36), we obtain
278 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems
N1
φ̄( j) e1 j = 0,
j=1
where
1
e1 j = Ξ (x̂t j ) − ŴcT ∇σ( j) G(x̂t j )∇σ(Tj) Ŵc .
4
˙
Θ = L̇ 2 (x̂) = L T2x̂ x̂(t)
= L T2x̂ h(x̂) + g(x̂)û
1
= L T2x̂ h(x̂) − g(x̂)R −1 g T(x̂)∇σ T Ŵc
2
1
= L 2x̂ h(x̂) − G(x̂)∇σ T Ŵc
T
(7.2.38)
2
Equation (7.2.39) shows the reason that we employ the last term of (7.2.36). In
fact, observing the definition of (x̂, û) given in (7.2.37), we find that if system
(7.2.21) is stable (i.e., Θ < 0), then (x̂, û) = 0 and the last term given in
(7.2.36) disappears. If system (7.2.21) is unstable, then (x̂, û) = 1 and the
last term given in (7.2.36) is activated. Due to the existence of the last term of
(7.2.36), it does not require an initial stabilizing control law for system (7.2.21).
The property shall be shown in the subsequent simulation study.
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 279
By Remark 7.2.3, the set {φ̄( j) }1N1 should be linearly independent. Nevertheless,
it is not an easy task to directly check this condition. Hence, we introduce a lemma
as follows.
N
Lemma 7.2.1 If the set σ (x̂t j ) 1 1 is linearly independent and û(x̂) stabilizes
system (7.2.21), then the following set
N
∇σ( j) h(x̂t j ) + g(x̂t j )û 1 1
where
φ(x̂t j ) = ∇σ( j) h(x̂t j ) + g(x̂t j )û .
N
By Lemma 7.2.1, we find that if σ (x̂t j ) 1 1 is linearly independent, then {φ̄( j) }1N1 is
also linearly independent. That is, to ensure {φ̄( j) }1N1 to be linearly independent, the
following condition should be satisfied.
Remark 7.2.4 Condition 7.2.1 can be satisfied by selecting and recording data dur-
ing the learning process of NNs over a finite time interval. Compared with the PE
condition, a clear advantage of Condition 7.2.1 is that it can easily be checked online
[5]. In addition, Condition 7.2.1 makes full use of history data, which can improve the
speed of convergence of parameters. This feature will be also shown in the simulation
study.
By the definition of φ given in (7.2.35) and using (7.2.28), (7.2.31) and (7.2.32),
we have
φ = ∇σ (h(x̂) + g(x̂)û(x̂))
1
= ∇σ h(x̂) − g(x̂)R −1 g T(x̂)σ T (Wc − W̃c )
2
1 1
= ∇σ ξ(x̂) + G(x̂)∇ε N1 + ∇σ G(x̂)∇σ T W̃c , (7.2.40)
2 2
where ξ(x̂) is given in (7.2.34). From (7.2.31), (7.2.34), (7.2.36), and (7.2.40), we
obtain
280 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems
N1
η1 1
− 2
∇σ( j) ζ (x̂t j ) + G(x̂t j )W̃c W̃cT ∇σ( j) ζ (x̂t j )
j=1
m sj 2
1 η1
+ W̃cT G(x̂t j )W̃c + εHJB − (x̂, û)∇σ G(x̂)L 2x̂ , (7.2.41)
4 2
where
1
ζ (x̂) = ξ(x̂) + G(x̂)∇ε N1, G(x̂) = ∇σ G(x̂)∇σ T, G(x̂t j) = ∇σ( j) G(x̂t j )∇σ(Tj) .
2
The schematic diagram of the present control algorithm is shown in Fig. 7.1.
We start with an assumption and then present the stability analysis result.
Assumption 7.2.6 There exist known constants bσ > 0 and εb1 > 0 such that
∇σ (x̂) < bσ and ∇ε N1(x̂) < εb1 for every x̂ ∈ Ω.
Theorem 7.2.2 Given the input-affine nonlinear dynamics described by (7.2.1) with
associated HJB equation (7.2.25), let Assumptions 7.2.1–7.2.6 hold and take the
control input for system (7.2.1) as in (7.2.32). Moreover, let weight update laws of
the identifier NN be (7.2.8) and (7.2.9), and let weight tuning rule for the critic NN
be (7.2.36). Then, the identification error x̃(t), and the weight estimation errors W̃m ,
Ỹm , and W̃c are all UUB.
1
L(x) = L 1 (x) + L 2 (x̂) + W̃cT η1−1 W̃c , (7.2.42)
2
where ς is given in (7.2.20). Using (7.2.41), the last term of (7.2.43) is developed as
N1
1 1
N2 = − 2
W̃cT ∇σ( j) ζ (x̂t j ) + W̃cT Ḡ(x̂t j )W̃c
j=1
ms j 2
1
× W̃cT ∇σ( j) ζ (x̂t j ) + W̃cT Ḡ(x̂t j )W̃c + εHJB .
4
Now, we consider the first term N1 . We have
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 283
1 T 2 1 2
N1 = − 2
W̃c ∇σ ζ (x̂) + W̃cT Ḡ(x̂)W̃c
ms 8
3 T T
+ W̃c ∇σ ζ (x̂) W̃c Ḡ(x̂)W̃c
4
1
+ W̃cT ∇σ L(x̂)εHJB + W̃cT Ḡ(x̂)W̃c εHJB . (7.2.45)
2
Applying (7.2.18) (with ν = 1) to the last three terms of (7.2.45), it follows
1 1 1 2
N1 = − 2
3W̃cT ∇σ ζ (x̂) + W̃cT Ḡ(x̂)W̃c
ms 2 4
1 T 2 1 1 T 2
+ W̃c ∇σ ζ (x̂) + εHJB + W̃ Ḡ(x̂)W̃c + 2εHJB
2 2 4 c
2 1 T 2 5 2
− 4 W̃cT ∇σ ζ (x̂) + W̃c Ḡ(x̂)W̃c − εHJB
16 2
1 1 2 2 5 2
≤− 2 W̃cT Ḡ(x̂)W̃c − 4 W̃cT ∇σ ζ (x̂) − εHJB . (7.2.46)
m s 16 2
Similarly, we have
N1
1 1 T 2 T 2 5 2
N2 ≤ − W̃ Ḡ( x̂ t ) W̃ c − 4 W̃ ∇σ( j) ζ ( x̂ t ) − εHJB . (7.2.47)
j=1
m 2s j 16 c j c j
2
m s = 1 + φ Tφ
1 1
τ0 ≤ = ≤1
m 2s (1 + φ Tφ)2
and
1 1
τ0 ≤ = ≤ 1.
m 2s j (1 + φ(xt j )Tφ(xt j ))2
N1
1 T 2 1 2
+4 W̃ ∇σ( j) ζ (x̂t j ) + 2 W̃cT ∇σ ζ (x̂)
j=1
m 2s j c ms
284 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems
N1
5 1 1 1
+ + εHJB
2
− W̃cT (x̂, û)∇σ G(x̂)L 2x̂
2 m 2s j=1
m 2
sj 2
N1
τ0
≤− μ Ḡ(x̂t j ) + μinf Ḡ(x̂) W̃c 4
2 2
16 j=1 inf
N1
+ 4bσ2 ϑsup
2
ζ (x̂t j ) + ϑsup ζ (x̂) W̃c 2
2
j=1
5 1
+ (N1 + 1)εa21 − W̃cT (x̂, û)∇σ G(x̂)L 2x̂ , (7.2.48)
2 2
where μinf (Y ) denotes the lower bound of Y Y = Ḡ(x̂t j ), Ḡ(x̂) , ϑsup (Z ) repre-
sents the upper bound of Z Z = ζ (x̂t j ), ζ (x̂) , N1 is the number of nodes in the
hidden layer of the critic NN, and εHJB ≤ εa1 as assumed earlier in (7.2.29).
Combining (7.2.43) and (7.2.48), we derive
1 1
L̇(x) ≤ L T2x̂ h(x̂) − G(x̂)∇σ T Ŵc − W̃cT (x̂, û)∇σ G(x̂)L 2x̂
2 2
T1 θ ς 2 ς2
− W̃c 4 + 4T2 W̃c 2 − x̃ − +
16 2 θ 2θ
5
+ (N1 + 1)εa1 , 2
(7.2.49)
2
where
N1
T1 = μ2inf Ḡ(x̂) + μ2inf Ḡ(x̂t j ) τ0 ,
j=1
and
N1
T2 = bσ2 ϑsup
2
ζ (x̂) + bσ2 ϑsup
2
ζ (x̂t j ) .
j=1
In view of the definition of (x̂, û) in (7.2.37), we divide (7.2.49) into the fol-
lowing two cases for discussion.
Case 1: (x̂, û) = 0. In this case, we can derive that the first term in (7.2.49) is
negative. Observing the fact that L T2x̂ x̂˙ < 0 and by using the dense property of R
[28], we can conclude that there exists a constant τ > 0 such that
θ ς 2
L̇(x) ≤ − τ L 2x̂ − x̃ −
2 θ
2
T1 32T 2 64T22
− W̃c 2 − +
16 T1 T1
1
+ ς 2 + 5θ (N1 + 1)εa21 . (7.2.50)
2θ
Hence, (7.2.50) yields L̇(x) < 0 as long as one of the following conditions holds:
or
8T1 ς 2 /θ + 5(N1 + 1)εa21 + 1024T22
32T2
W̃c > + . (7.2.53)
T1 T1
Case 2: (x̂, û) = 1. By the definition of (x̂, û) given in (7.2.37), we find that,
in this case, the first term in (7.2.49) is nonnegative which implies that the control
(7.2.32) may not stabilize system (7.2.21). Then, using (7.2.29) and (7.2.34), (7.2.49)
becomes
1 θ ς 2 ς2
L̇(x) ≤ L 2x̂ ξ(x̂) + G(x̂)∇ε N1 −
T
x̃ − +
2 2 θ 2θ
2
T1 2 32T2 64T22
5
− W̃c − + + (N1 + 1)εa21 . (7.2.54)
16 T1 T1 2
or !
2d ς
x̃ > + , (7.2.57)
θ θ
or
8T
W̃c > 2 2
+
d
, (7.2.58)
T1 T1
where
εb21 ϑsup
2
G(x̂) 64T22 1 2
d= + + ς + 5θ (N1 + 1)εa21 .
16λmin Λ2 (x̂) T1 2θ
Combining Case 1 and Case 2 and using the standard Lyapunov’s extension the-
orem [12, 13] (or the Lagrange stability result [22]), one can conclude that the state
identification error x̃(t) and NN weight estimation errors W̃m , Ỹm , and W̃c are all
UUB. This completes the proof.
Remark 7.2.5 The uniform ultimate boundedness of W̃m and Ỹm is obtained as fol-
lows: Inequalities (7.2.51)–(7.2.53) (or (7.2.56)–(7.2.58)) guarantee L̇(x) < 0. Then,
we can conclude that L(x(t)) is the strictly decreasing function with respect to t
(t ≥ 0). Hence, we can derive L(x(t)) < L(x(0)), where L(x(0)) is a bounded
positive constant. Using L(x) defined as in (7.2.42), we have 21 tr W̃mT l1−1 W̃m +
1
2
tr ỸmT l2−1 Ỹm < L(x(0)). Using the definition of Frobenius matrix norm [9, 16],
we obtain that W̃m and Ỹm are UUB.
where
" # " #
−x1 + x2 0
f (x) = , g(x) = .
−0.5x1 − 0.5x2 + 0.5x2 [cos(2x1 ) + 2]2 cos(2x1 ) + 2
The value function is defined as in (7.2.2), where Q and R are chosen as identity
matrices of appropriate dimensions. The prior knowledge of f (x) is assumed to be
unavailable. To obtain the knowledge of system dynamics, an identifier NN given in
(7.2.6) is employed. The gains for the identifier NN are selected as
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 287
" # " #
−1 1 1 0
A= , C= ,
−0.5 −0.5 −0.5 0
l1 = 20, l2 = 10, τ1 = 6.1, τ2 = 15,
N0 = 8,
and the gain for the critic NN is selected as η1 = 2.5. The activation function of the
critic NN is chosen with N1 = 3 neurons with σ (x) = [x12 , x22 , x1 x2 ]T , and the critic
NN weight is denoted as Ŵc = [Ŵc1 , Ŵc2 , Ŵc3 ]T .
Remark 7.2.6 It is significant to emphasize that the number of neurons required for
any particular application is still an open problem. Selecting the proper number of
neurons for NNs is more of an art than science [8, 27]. In this example, the number
of neurons is obtained by computer simulations. We find that selecting eight neurons
in the hidden layer for the identifier NN can lead to satisfactory simulation results.
Meanwhile, in order to compare our algorithm with the algorithms proposed in [3,
29], we choose three neurons in the hidden layer for the critic NN, and the simulation
results are satisfied.
The initial weights Ŵm and Ŷm for the identifier NN are selected randomly within
an interval of [−10, 10] and [−5, 5], respectively. Meanwhile, the initial weights
of the critic NN are chosen to be zeros, and the initial system state is set to x0 =
[3.5, −3.5]T. In this circumstance, the initial control cannot stabilize system (7.2.59).
In the present algorithm, no initial stabilizing control is required. Moreover, by using
the method proposed in [5, 6], the recorded data can easily be made qualified for
Condition 7.2.1.
x̃1 (t)
0.5
x̃2 (t)
System identification errors
−0.5
−1
−1.5
−2
0 3 6 9 12 15
Time (s)
Fig. 7.2 System identification errors x̃1 (t) and x̃2 (t)
288 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems
Computer simulation results are shown in Figs. 7.3, 7.4, 7.5, 7.6, 7.7 and 7.8.
Figure 7.2 illustrates the performance of the system identification errors x̃1 (t) and
x̃2 (t). Figure 7.3 shows the trajectories of system states x(t). Figure 7.4 shows the
Frobenius norm of the weights Ŵm and Ŷm of the identifier NN. Figure 7.5 shows
the convergence of the critic NN weights. Figure 7.6 shows the control u. Figure 7.7
illustrates the system states without considering the third term in (7.2.36), where the
system becomes unstable.
To compare with [7], we use Fig. 7.8 to show the system states with the algorithm
proposed in [7]. It should be mentioned that the PE condition is necessary in [7].
To guarantee the PE condition, a small exploratory signal N (t) = sin5 (t) cos(t) +
sin5 (2t) cos(0.1t) is added to the control u for the first 9 s. That is, Fig. 7.8 is obtained
with the exploratory signal added to the control input during the first 9 s. In addition,
it is worth pointing out that by using the methods proposed in [3] and [29] and
employing the small exploratory signal N (t), one can also get stable system states.
We omit the simulation results here since the results are similar to Fig. 7.8 [3, 29].
It is quite straightforward to notice that the trajectories of system states given in [3,
29] share common feature with Fig. 7.8, which is oscillatory before it converges to
the equilibrium point. This is caused by the PE exploratory signal.
Several notes about the simulation results are listed as follows.
• From Fig. 7.2, it is observed that the identifier NN can approximate the real system
very fast and well.
• From Fig. 7.3, one can observe that there are almost no oscillations of system state.
As aforementioned, the PE signal always leads to oscillations of the system states
x1 (t)
2 x2 (t)
Evolution of system states
−2
−4
−6
0 3 6 9 12 15
Time (s)
15
Ŷm
9
0
0 3 6 9 12 15
Time (s)
1
Convergence of critic NN weights
Ŵc1
0.75
Ŵc2
0.5 Ŵc3
0.25
−0.25
0 3 6 9 12 15
Time (s)
12
Control input u 9
−3
0 3 6 9 12 15
Time (s)
200
0
System states
−200
−400
x1 (t)
−600
x2 (t)
−800
0 0.8 1.6 2.4 3.2
Time (s)
Fig. 7.7 Trajectories of states without considering the third term in (7.2.36)
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 291
x1 (t)
6
x2 (t)
3
System states
−3
−6
0 3 6 9 12 15
Time (s)
(see Fig. 7.8 and the simulation results presented in [3, 7, 29]). This verifies that
the restrictive PE condition is relaxed by using recorded and instantaneous data
simultaneously. Hence, an advantage of the present algorithm as compared to the
methods in [3, 7, 29] lies in that the PE condition is relaxed.
• From Figs. 7.4 and 7.5, we can find that the identifier NN and the critic NN are
tuned simultaneously. Meanwhile, the estimated weights of the identifier NN and
the critic NN are all guaranteed to be UUB.
• From Figs. 7.5 and 7.6, one can find that the initial weights of the critic NN and the
initial control are all zeros. In this circumstance, the initial control cannot stabilize
system (7.2.59). Nevertheless, the initial control must stabilize the system in [3,
29]. Therefore, in comparison with the methods proposed in [3, 29], a distinct
advantage of the algorithm developed in this section lies in that the initial stabilizing
control is not required any more.
• From Fig. 7.7, one shall find that without the third term in (7.2.36), the system is
unstable during the learning process of the critic NN.
where Q(x) is continuously differentiable and positive definite, i.e., Q(x) ∈ C 1 (Ω),
Q(x) > 0 for all x = 0 and x = 0 ⇔ Q(x) = 0, and Y (u) is positive definite. To
confront bounded controls in system (7.3.1) and inspired by the work of [1, 20, 24],
we define Y (u) as
u m
ui
−T
Y (u) = 2τ Ψ (υ/τ )dυ = 2τ ψ −1 (υi /τ )dυi ,
0 i=1 0
T
where Ψ −1 (υ/τ ) = ψ −1 (υ1 /τ ), ψ −1 (υ2 /τ ), . . . , ψ −1 (υm /τ ) , Ψ ∈ Rm , and
Ψ −T denotes (Ψ −1 )T . Meanwhile, ψ(·) is a strictly monotonic odd function satisfying
|ψ(·)| < 1 and belonging to C p ( p ≥ 1) and L2 (Ω). It is significant to state that
Y (u) is positive definite since ψ −1 (·) is a monotonic odd function. Without loss of
generality, in this section, we choose ψ(·) = tanh(·).
Given a control u(x) ∈ A (Ω), if the associated value function V (x) ∈ C 1 (Ω),
the infinitesimal version of (7.3.2) is the so-called Lyapunov equation
u
VxT f (x) + g(x)u + Q(x) + 2τ tanh−T(υ/τ )dυ = 0,
0
where Vx ∈ Rn denotes the partial derivative of V (x) with respect to x. Define the
Hamiltonian for the control u(x) ∈ A (Ω) and the associated value function V (x)
as u
H (x, Vx , u) = VxT f (x) + g(x)u + Q(x) + 2τ tanh−T(υ/τ )dυ.
0
The optimal value V ∗ (x) can be obtained by solving the HJB equation
1 T
u ∗ (x) = −τ tanh g (x)Vx∗ . (7.3.4)
2τ
Substituting (7.3.4) into (7.3.3), the HJB equation for constrained nonlinear systems
becomes
T
Vx∗ f (x) − 2τ 2 CT(x) tanh(C(x)) + Q(x)
−τ tanh(C(x))
+ 2τ tanh−T(υ/τ )dυ = 0, (7.3.5)
0
1 T
where C(x) = g (x)Vx∗ . Using the integration formulas of inverse hyperbolic
2τ
functions, we note that
−τ tanh(C(x))
2τ tanh−T(υ/τ )dυ
0
m
−τ tanh(Ci (x))
= 2τ tanh−T(υi /τ )dυi
i=1 0
m
= 2τ C (x) tanh(C(x)) + τ
2 T 2
ln 1 − tanh2 (Ci (x)) ,
i=1
m
Vx∗ T f (x) + Q(x) + τ 2 ln 1 − tanh2 (Ci (x)) = 0. (7.3.6)
i=1
with L 3x the partial derivative of L 3 (x) with respect to x. Meanwhile, there exists a
positive-definite matrix Λ2 (x) ∈ Rn×n defined on Ω such that
L T3x f (x) + g(x)u ∗ = −L T3x Λ2 (x)L 3x . (7.3.7)
294 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems
In this section, we design an online optimal control scheme using a single critic NN.
According to [8, 10], the optimal value V ∗ (x) can be represented by a single-layer
NN on a compact set Ω as
1 T
u ∗ (x) = −τ tanh g (x)∇σ T Wc + εu ∗ , (7.3.9)
2τ
1
where εu ∗ = − 1 − tanh2 (χ ) g T(x)∇ε N2 (x), χ ∈ Rm is chosen between
2
1 T 1 T T
g (x)∇σ T Wc and g (x) ∇σ Wc + ∇ε N2 (x) ,
2τ 2τ
m
WcT ∇σ f (x) + Q(x) + τ 2 ln 1 − tanh2 (B1i (x)) + εHJB = 0, (7.3.10)
i=1
where
1 T
B1 (x) = g (x)∇σ T Wc ,
2τ
and B1 (x) = (B11 (x), . . . , B1m (x))T with B1i (x) ∈ R, i = 1, . . . , m, and εHJB is
the HJB approximation error [1].
Remark 7.3.1 It was shown in [1] that there exists a small constant εh > 0 and a
positive integer N (depending only on εh ) such that N2 > N implies εHJB ≤ εh .
Since the ideal critic NN weight matrix Wc is typically unknown, (7.3.9) cannot
be implemented in the real control process. Therefore, we employ a critic NN to
7.3 Online Optimal Control of Affine Nonlinear Systems with Constrained Inputs 295
where Ŵc is the estimate of Wc . The weight estimation error for the critic NN is
defined as
W̃c = Wc − Ŵc . (7.3.12)
1 T
û(x) = −τ tanh g (x)∇σ T Ŵc . (7.3.13)
2τ
From (7.3.11) and (7.3.13), we derive the approximate Hamiltonian as
m
H (x, Ŵc ) = ŴcT ∇σ f (x) + Q(x) + τ 2 ln 1 − tanh2 (B2i (x)) e2 , (7.3.14)
i=1
where
1 T
B2 (x) = g (x)∇σ T Ŵc ,
2τ
and B2 (x) = (B21 (x), . . . , B2m (x))T with B2i (x) ∈ R, i = 1, . . . , m. Subtracting
(7.3.10) from (7.3.14) and using (7.3.12), we obtain
m
e2 = −W̃cT ∇σ f (x) + τ 2 Γ (B2i (x)) − Γ (B1i (x)) − εHJB , (7.3.15)
i=1
That is,
Note that
m
Γ (Bi (x)) = m ln 4 − 2BT (x)sgn(B (x))
i=1
m
−2 ln 1 + exp(−2Bi (x)sgn(Bi (x))) . (7.3.16)
i=1
where
m
1 + exp[−2B1i (x)sgn(B1i (x))]
ΔB = 2 ln ,
i=1
1 + exp[−2B2i (x)sgn(B2i (x))]
D1 (x) = τ WcT ∇σ g(x)[sgn(B1 (x)) − sgn(B2 (x))] + τ 2 ΔB − εHJB .
where φ̄ = φ/m 2s , m s = 1 + φ T φ,
Remark 7.3.3 The first term in (7.3.18) is used to minimize E = (1/2)e2T e2 and
is obtained by utilizing the normalized gradient descent algorithm. The rest are
employed to ensure the stability of the optimal closed-loop system while the critic
NN learns the optimal weights. Σ(x, û) is designed based on Lyapunov’s condition
for stability. Observing the expression of Σ(x, û) and noticing that L 3 (x) is the
Lyapunov function candidate for system (7.3.1) defined in Assumption 7.3.1, if
system (7.3.1) is stable (i.e., L T3x f (x)−τ g(x) tanh(B2 (x)) < 0), then Σ(x, û) = 0
and the second term given in (7.3.18) disappears. If system (7.3.1) is unstable, then
Σ(x, û) = 1 and the second term given in (7.3.18) is activated. Due to this property
of Σ(x, û), it does not require an initial stabilizing control law for system (7.3.1).
Remark 7.3.4 From the expressions of (7.3.14), one shall find that x = 0 gives rise
to H (x, Ŵc ) = e2 = 0. If F2 = F1 ϕ T in (7.3.18), then Ŵ˙ c = 0. In this sense,
the critic NN will no longer be updated. However, the optimal control might not
have been obtained at finite time t0 such that x(t0 ) = 0. To avoid this circumstance
from occurring, a small exploratory signal will be needed, i.e., the PE condition is
required. It is worth pointing out that by using the method presented in Sect. 7.2, the
PE condition can be removed. However, due to the purpose of this section, we do
not employ the approach of Sect. 7.2.
ϕ
W̃˙ c = η2 − W̃cT φ + τ W̃cT ∇σ g(x)N(x) + D1 (x)
ms
η2
− Σ(x, û)∇σ g(x) [Im − N (B2 (x))] g T(x)L 3x
2
" #
ϕT
+ η2 τ ∇σ g(x)N(x) Ŵc + F2 Ŵc − F1 ϕ T Ŵc . (7.3.21)
ms
1
L(x) = L 3 (x) + W̃cTη2−1 W̃c , (7.3.22)
2
where L 3 (x) is defined as in Assumption 7.3.1.
Taking the time derivative of (7.3.22) using Assumption 7.3.1 and (7.3.13), we
have
L̇(x) = L T3x ( f (x) − τ g(x) tanh(B2 (x))) + W̃˙ cT η2−1 W̃c . (7.3.23)
ϕT
W̃˙ cT η2−1 W̃c = − W̃cT φ + τ W̃cT ∇σ g(x)N(x) + D1 (x) W̃c
ms
1
− Σ(x, û)L T3x g(x) Im − N (B2 (x)) g T(x)∇σ T W̃c
2
ϕT
+ τ W̃cT ∇σ g(x)N(x) Ŵc + W̃cT F2 Ŵc − F1 ϕ T Ŵc
ms
= − W̃c ϕϕ W̃c + D̄1 (x)ϕ T W̃c + W̃cT D̄2 (x)
T T
7.3 Online Optimal Control of Affine Nonlinear Systems with Constrained Inputs 299
1
− Σ(x, û)L T3x g(x) Im − N (B2 (x)) g T(x)∇σ T W̃c
2
+ W̃cT F2 Ŵc − F1 ϕ T Ŵc , (7.3.24)
D1 (x) ϕT
where D̄1 (x) = and D̄2 (x) = τ ∇σ g(x)N(x) Wc .
ms ms
Observe that
W̃cT F2 Ŵc − F1 ϕ T Ŵc = W̃cT F2 Wc − W̃cT F2 W̃c − W̃cT F1 ϕ T Wc + W̃cT F1 ϕ T W̃c .
Denote Z T = W̃cTϕ, W̃cT . Then, (7.3.24) can be developed as
Combining (7.3.23) with (7.3.25) and selecting F1 and F2 such that K is positive
definite, we obtain
Therefore, (7.3.27) implies L̇(x) < 0 as long as one of the following conditions
holds:
ϑM
2
ϑM
L 3x > , or Z > . (7.3.28)
4τ λmin (K ) λmin (K )
300 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems
Since
a 1
≤ , ∀a,
(1 + a) 2 4
φTφ
ϕ2 =
(1 + φ T φ)2
1 %
and thus ϕ ≤ . Noticing that Z ≤ 1 + ϕ2 W̃c , we obtain
2
√
5
Z ≤ W̃c .
2
Then, from (7.3.28), we have
2ϑ M
W̃c > √ .
5λmin (K )
Case 2: Σ(x, û) = 1. In this case, the first term given in (7.3.26) is nonnega-
tive which implies that the control (7.3.13) may not stabilize system (7.3.1). Then,
(7.3.26) becomes
Denote B(Bi (x)) = tanh(Bi (x)), i = 1, 2. By using the Taylor series, we have
1
tanh(B2 (x)) + [Im − N (B2 (x))] g T(x)∇σ T W̃c
2τ
= tanh(B1 (x)) − O (B1 (x) − B2 (x))2 . (7.3.31)
7.3 Online Optimal Control of Affine Nonlinear Systems with Constrained Inputs 301
Substituting (7.3.31) into (7.3.29) and adding and subtracting L T1x g(x)εu ∗ to the
right-hand side of (7.3.29), we derive
L̇(x) ≤ L T3x f (x) + g(x)u ∗ − L T3x g(x)εu ∗
+ τ L T3x g(x)O (B1 (x) − B2 (x))2
2
ϑM ϑM2
− λmin (K ) Z − + . (7.3.32)
2λmin (K ) 4λmin (K )
Hence, (7.3.33) yields L̇(x) < 0 as long as one of the following conditions holds:
!
ρM μ0
L 3x > + ,
2λmin (Λ2 (x)) λmin (Λ2 (x))
or !
ϑM μ0
Z > + . (7.3.34)
2λmin (K ) λmin (K )
√
5
Observe that Z ≤ W̃c . Then, from (7.3.34), we have
2
!
ϑM μ0
W̃c > √ +2 .
5λmin (K ) 5λ min (K )
Combining Case 1 and Case 2 and by the standard Lyapunov’s extension theorem
[12, 13] (or the Lagrange stability result [22]), we conclude that the function L 3x
and the weight estimation error W̃c are UUB. This completes the proof.
where
" #
−x1 + x2
f (x) = ,
−0.5x1 − 0.5x2 + 0.5x2 [cos(2x1 ) + 2]2
" #
0
g(x) = .
cos(2x1 ) + 2
The objective is to control the system with control limits of |u| ≤ 1.2. The value
function is given by
∞ u
−T
V (x(0)) = Q(x) + 2τ tanh (υ/τ )dυ dt,
0 0
where
Q(x) = x12 + x22 .
The gains of the critic NN are chosen as η2 = 38, τ = 1.2, N2 = 8. The activation
function of the critic NN is chosen as
T
σ (x) = x12 , x22 , x1 x2 , x14 , x24 , x13 x2 , x12 x22 , x1 x23 ,
The initial weights of the critic NN are chosen as zeros, and the initial system state is
selected as x0 = [3, −0.5]T . It is significant to point out that under this circumstance,
the initial control cannot stabilize system (7.3.35). That is, no initial stabilizing
control is required for the implementation of this algorithm. To guarantee the PE
condition, a small exploratory signal
2.4 x1 (t)
x2 (t)
1.8
System state x(t)
1.2
0.6
−0.6
−1.2
0 4 8 12 16 20
Time (s)
4
Convergence of critic NN weights
−1
−2
−3
0 4 8 12 16 20
Time (s)
1000
800
Optimal value V*
600
400
200
0
4
2 4
0 2
x2 0
−2 −2 x1
−4 −4
1.8
1.2
1.2 1
Control input u
0.6 0.8
0.2 0.4
−0.6
−1.2
0 4 8 12 16 20
Time (s)
1.8
Upper bound
1.2
0.6
Control input u
−0.6
−1.2
Lower bound
−1.8
0 4 8 12 16 20
Time (s)
of the exploratory signal. Figure 7.10 shows the convergence of critic NN weights.
Figure 7.11 shows the optimal value function. Figure 7.12 illustrates the optimal
controller with control constraints. In order to make comparisons, we use Fig. 7.13
to show the controller designed without considering control constraints. The actuator
saturation actually exists; therefore, the control input is limited to the bounded value
when it overruns the saturation bound. From Figs. 7.9–7.13, it is observed that the
optimal control can be obtained using a single NN. Meanwhile, the system state and
the estimated weights of the critic NN are all guaranteed to be UUB, while keeping
the closed-loop system stable. Moreover, comparing Fig. 7.12 with Fig. 7.13, one can
find that the restriction of control constraints has been overcome.
7.4 Conclusions
References
1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Beard RW, Saridis GN, Wen JT (1997) Galerkin approximations of the generalized Hamilton-
Jacobi-Bellman equation. Automatica 33(12):2159–2177
3. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):82–92
4. Bryson AE, Ho YC (1975) Applied optimal control: optimization, estimation and control. CRC
Press, Boca Raton
5. Chowdhary G (2010) Concurrent learning for convergence in adaptive control without persis-
tency of excitation. Ph.D. Thesis, Georgia Institute of Technology, USA
6. Chowdhary G, Johnson E (2011) A singular value maximizing data recording algorithm for
concurrent learning. In: Proceedings of the American control conference. pp 3547–3552
7. Dierks T, Jagannathan S (2010) Optimal control of affine nonlinear continuous-time systems.
In: Proceedings of the American control conference. pp 1568–1573
8. Haykin S (2009) Neural networks and learning machines, 3rd edn. Prentice-Hall, Upper Saddle
River
9. Horn RA, Johnson CR (2012) Matrix analysis. Cambridge University Press, New York
10. Hornik K, Stinchcombe M, White H (1990) Universal approximation of an unknown mapping
and its derivatives using multilayer feedforward networks. Neural Netw 3(5):551–560
11. Khalil HK (2001) Nonlinear systems. Prentice-Hall, Upper Saddle River
12. LaSalle JP, Lefschetz S (1967) Stability by Liapunov’s direct method with applications. Aca-
demic Press, New York
13. Lewis FL, Jagannathan S, Yesildirak A (1999) Neural network control of robot manipulators
and nonlinear systems. Taylor & Francis, London
14. Lewis FL, Vrabie D, Syrmos VL (2012) Optimal control. Wiley, Hoboken
15. Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control:
using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag
32(6):76–105
16. Lewis FL, Yesildirek A, Liu K (1996) Multilayer neural-net robot controller with guaranteed
tracking performance. IEEE Trans Neural Netw 7(2):388–399
17. Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general
value iteration. IET Control Theory Appl 6(18):2725–2736
18. Liu D, Wang D, Wang FY, Li H, Yang X (2014) Neural-network-based online HJB solution for
optimal robust guaranteed cost control of continuous-time uncertain nonlinear systems. IEEE
Trans Cybern 44(12):2834–2847
19. Liu D, Wang D, Yang X (2013) An iterative adaptive dynamic programming algorithm for
optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci
220:331–342
20. Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
21. Lyshevski SE (1998) Optimal control of nonlinear continuous-time systems: design of bounded
controllers via generalized nonquadratic functionals. In: Proceedings of the American Control
Conference. pp 205–209
22. Michel AN, Hou L, Liu D (2015) Stability of dynamical systems: on the role of monotonic
and non-monotonic Lyapunov functions. Birkhäuser, Boston
23. Modares H, Lewis FL (2014) Optimal tracking control of nonlinear partially-unknown
constrained-input systems using integral reinforcement learning. Automatica 50(7):1780–1792
24. Modares H, Lewis F, Naghibi-Sistani MB (2014) Online solution of nonquadratic two-player
zero-sum games arising in the H∞ control of constrained input systems. Int J Adapt Control
Signal Process 28(3–5):232–254
References 307
25. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1):193–202
26. Nodland D, Zargarzadeh H, Jagannathan S (2013) Neural network-based optimal adaptive
output feedback control of a helicopter UAV. IEEE Trans Neural Netw Learn Syst 24(7):1061–
1073
27. Padhi R, Unnikrishnan N, Wang X, Balakrishnan S (2006) A single network adaptive critic
(SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural
Netw 19(10):1648–1660
28. Rudin W (1976) Principles of mathematical analysis. McGraw-Hill, New York
29. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
30. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
31. Yang X, Liu D, Huang Y (2013) Neural-network-based online optimal control for uncertain non-
linear continuous-time systems with control constraints. IET Control Theory Appl 7(17):2037–
2047
32. Yang X, Liu D, Wang D (2014) Reinforcement learning for adaptive optimal control of unknown
continuous-time nonlinear systems with input constraints. Int J Control 87(3):553–566
33. Yang X, Liu D, Wang D, Wei Q (2014) Discrete-time online learning control for a class of
unknown nonaffine nonlinear systems using reinforcement learning. Neural Netw 55:30–41
34. Yang X, Liu D, Wei Q (2014) Online approximate optimal control for affine non-linear systems
with unknown internal dynamics using adaptive dynamic programming. IET Control Theory
Appl 8(16):1676–1688
35. Yu W (2009) Recent advances in intelligent control systems. Springer, London
36. Zhang H, Cui L, Luo Y (2013) Near-optimal control for nonzero-sum differential games of
continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern 43(1):206–
216
37. Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming method.
IEEE Trans Neural Netw 22(12):2226–2236
38. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control: algorithms
and stability. Springer, London
39. Zhang H, Qin C, Luo Y (2014) Neural-network-based constrained optimal control scheme for
discrete-time switched nonlinear system using dual heuristic programming. IEEE Trans Autom
Sci Eng 11(3):839–849
40. Zhong X, He H, Zhang H, Wang Z (2014) Optimal control for unknown discrete-time nonlinear
markov jump systems using adaptive dynamic programming. IEEE Trans Neural Netw Learn
Syst 25(12):2141–2155
Chapter 8
Optimal Control of Unknown
Continuous-Time Nonaffine Nonlinear
Systems
8.1 Introduction
The output of nonaffine nonlinear systems depends nonlinearly on the control signal,
which is significantly different from affine nonlinear systems. Due to this feature,
control approaches of affine nonlinear systems do not always hold for nonaffine non-
linear systems. Therefore, it is necessary to develop control methods for nonaffine
nonlinear systems. Recently, optimal control problems for nonaffine nonlinear sys-
tems have attracted considerable attention. Many remarkable approaches have been
proposed to tackle the problem [6, 19, 29, 32].
In this chapter, we present two novel control algorithms based on ADP methods
to obtain the optimal control of continuous-time nonaffine nonlinear systems with
completely unknown dynamics. First, we construct an ADP-based identifier–actor–
critic architecture to achieve the approximate optimal control of continuous-time
unknown nonaffine nonlinear systems [28]. The identifier is constructed by a dynamic
neural network, which transforms nonaffine nonlinear systems into a kind of affine
nonlinear systems. Actor–critic dual networks are employed to derive the optimal
control for the newly formulated affine nonlinear systems. Then, we develop a novel
ADP-based observer–critic architecture to obtain the approximate optimal control
of nonaffine nonlinear systems with unknown dynamics [19]. The present observer
is composed of a three-layer feedforward neural network, which aims to get the
knowledge of system states. Meanwhile, a single critic neural network is employed
for estimating the performance of the systems as well as for constructing the optimal
control signal.
where x(t) ∈ Rn is the state vector available for measurement and u(t) ∈ U ⊂ Rm
is the control vector, and U = {u ∈ Rm : |u i | ≤ τ, i = 1, . . . , m} with the saturation
bound τ > 0. F(x, u) is an unknown nonaffine nonlinear function. We assume that
F(x, u) is Lipschitz continuous on Ω containing the origin, such that the solution
x(t) of system (8.2.1) is unique for arbitrarily given initial state x0 ∈ Ω and control
u ∈ U, and F(0, 0) = 0.
The value function for system (8.2.1) is defined by
∞
V (x(t)) = Q(x(s)) + Y (u(s)) ds, (8.2.2)
t
where Q(x) is continuously differentiable and positive definite, i.e., Q(x) ∈ C 1 (Ω),
Q(x) > 0 for all x = 0 and x = 0 ⇔ Q(x) = 0, and Y (u) is defined as
u T m ui
Y (u) = 2 τ Ψ −1 (υ/τ ) Rdυ 2τ ri ψ −1 (υi /τ )dυi ,
0 i=1 0
T
where Ψ −1 (υ/τ ) = ψ −1 (υ1 /τ ), ψ −1 (υ2 /τ ), . . . , ψ −1 (υm /τ ) ∈ Rm , and Ψ −T
denotes (Ψ −1 )T . Meanwhile, ψ(·) is a strictly monotonic odd function satisfying
|ψ(·)| < 1 and belonging to C p ( p ≥ 1) and L2 (Ω). R = diag{r1 , . . . , rm } with
ri > 0, i = 1, . . . , m. It is necessary to state that Y (u) is positive definite since
ψ −1 (·) is a monotonic odd function and R is positive definite. Without loss of gen-
erality, we choose ψ(·) = tanh(·) and R is assumed to be the m × m identity matrix,
i.e., R = Im .
If the value function V (x) ∈ C 1 , by taking the time derivative of both sides of
(8.2.2) and moving the terms on the right-hand side to the left, we have
u
VxT F(x, u) + Q(x) + 2τ tanh−T(υ/τ )dυ = 0, (8.2.3)
0
Then, the optimal cost V ∗ (x) can be obtained by solving the Hamilton–Jacobi–
Bellman (HJB) equation
where f (x(t)) ∈ Rn and g(x(t)) ∈ Rn×m . Then, using (8.2.4) and (8.2.5), the closed-
form expression for constrained optimal control is derived as
1
u ∗ (x) = −τ tanh g T(x)Vx∗ . (8.2.6)
2τ
However, for continuous-time nonaffine nonlinear system (8.2.1), the optimal
control u ∗ (x) cannot be obtained as given in (8.2.6) using (8.2.4) and (8.2.5). The
main reason is that
∂ H (x, Vx , u)/∂u = 0
φ(ζ1 ) − φ(ζ2 ) ≤ κ1 ζ1 − ζ2 ,
ρ(ζ1 ) − ρ(ζ2 ) ≤ κ2 ζ1 − ζ2 . (8.2.8)
312 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems
Remark 8.2.1 Though (8.2.7) shares the same feature as in [30], the matrix A is
assumed to be unknown in (8.2.7), rather than a stable matrix as in [30]. Therefore,
(8.2.7) can be considered as a more general case to estimate system (8.2.1) using
dynamic NNs.
where x̂(t) ∈ Rn is the dynamic NN state, Â(t) ∈ Rn×n , Ŵm 1 (t) ∈ Rn×n , and Ŵm 2
(t) ∈ Rn×n are estimated dynamic NN weight matrices, and η x̃(t) is the residual error
term with the design parameter η > 0 and the identification error x̃(t) = x(t) − x̂(t).
By (8.2.7) and (8.2.9), the identification error dynamics can be developed as
where Ã(t) = A − Â(t), W̃m 1 (t) = Wm 1 − Ŵm 1 (t), W̃m 2 (t) = Wm 2 − Ŵm 2 (t), φ̃ =
φ(x(t)) − φ(x̂(t)), and ρ̃ = ρ(x(t)) − ρ(x̂(t)).
Prior to showing the stability of the identification error x̃(t), two assumptions are
introduced. It should be mentioned that these assumptions are common techniques,
which have been used in [9, 31, 32].
εT(t)ε(t) ≤ δ M x̃ T(t)x̃(t),
Theorem 8.2.1 Let Assumptions 8.2.1 and 8.2.2 be satisfied. If the estimated
dynamic NN weight matrices Â(t), Ŵm 1 (t), and Ŵm 2 (t) are updated as
˙ = Λ x̂(t)x̃ T(t),
Â(t) 1
where
1 T
L 1 (x) = x̃ (t)x̃(t),
2
1
L 2 (x) = tr ÃT(t)Λ−1 −1 −1
1 Ã(t) + W̃m 1 (t)Λ2 W̃m 1 (t) + W̃m 2 (t)Λ3 W̃m 2 (t) .
T T
2
Taking the time derivative of L 1 (x) and using the identification error (8.2.10), we
obtain
1 T T
L̇ 1 (x) ≤ x̃ (t) A A + WmT1 Wm 1 + WmT2 Wm 2 x̃(t)
2
1
+ 2 + κ12 + α 2 κ22 − 2η x̃ T(t)x̃(t)
2
+ x̃ T(t) Ã(t)x̂(t) + x̃ T(t)W̃mT1 (t)φ(x̂(t))
1
+ x̃ T(t)W̃mT2 (t)ρ(x̂(t))u(t) + εT(t)ε(t). (8.2.16)
2
On the other hand, taking the time derivative of L 2 (x) and using the weight update
law (8.2.11), we obtain
L̇ 2 (x) = − tr ÃT(t)x̂(t)x̃ T(t) + W̃mT1 (t)φ(x̂(t))x̃ T(t)
+ W̃mT2 (t)ρ(x̂(t))u(t)x̃ T(t) . (8.2.17)
Observe that tr X 1 X 2T = tr X 2T X 1 = X 2T X 1 , ∀X 1 , X 2 ∈ Rn×1 . Then, (8.2.17) can
be rewritten as
Combining (8.2.16) and (8.2.18) and employing Assumptions 8.2.1 and 8.2.2, we
obtain
1 T T
L̇(x) ≤ x̃ (t) A A + WmT1 Wm 1 + WmT2 Wm 2 x̃(t)
2
1 1
+ κ12 + κ22 α 2 + 2 − 2η x̃ T(t)x̃(t) + εT(t)ε(t)
2 2
1 T 3
≤ − x̃ (t) 2η − κ1 − α κ2 − δ M − 2 In −
2 2 2
P̄i x̃(t).
2 i=1
Denote
3
B = 2η − κ1 − α κ2 − δ M − 2 In −
2 2 2
P̄i .
i=1
1 1
L̇(x) ≤ − x̃ T(t)Bx̃(t) ≤ − λmin (B) x̃(t) 2 . (8.2.19)
2 2
Equations (8.2.12) and (8.2.19) guarantee that x̃(t), Ã(t), W̃m 1 (t), and W̃m 2 (t) are
bounded since L(x) is decreasing. Integrating both sides of (8.2.19), we obtain
8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems with Constrained Inputs 315
∞
2(L(0) − L(∞))
x̃(t) 2 dt ≤ .
0 λmin (B)
By Barbalat’s lemma [13], lim x̃(t) = 0, i.e., lim x̃(t) = 0. Note that the right-
t→∞ t→∞
hand side of the above expression is finite since λmin (B) > 0 and L(∞) < L(0) < ∞
by virtue of (8.2.19). This completes the proof.
By Remark 8.2.3, and using (8.2.6), we derive the optimal control for the given
problem as
1
u ∗ (x) = −τ tanh ρ T(x)Ŵm 2 Vx∗ . (8.2.21)
2τ
Then, the HJB equation (8.2.5) becomes
Vx∗T Âx + ŴmT1 φ(x) − 2τ 2 Φ T(x) tanh Φ(x) + Q(x)
−τ tanh(Φ(x))
+ 2τ tanh−T(υ/τ )dυ = 0, (8.2.22)
0
where
1 T
Φ(x) = ρ (x)Ŵm 2 Vx∗ .
2τ
Denote
m
= 2τ Φ (x) tanh Φ(x) + τ
2 T 2
ln 1 − tanh2 Φi (x) . (8.2.23)
i=1
m
Vx∗T Âx + ŴmT1 φ(x) + Q(x) + τ 2
ln 1 − tanh2 Φi (x) = 0.
i=1
is the activation function with σ j (x) ∈ C 1 (Ω) and σ j (0) = 0, the set {σ j (x)}1N0 is
often selected to be linearly independent, and ε N0 (x) is a bounded NN function
reconstruction error. The derivative of V (x) with respect to x is given by
where Ŵc is the estimation of Wc . The weight estimation error for the critic NN is
defined as
W̃c = Wc − Ŵc .
where δε is the Bellman residual error. From (8.2.26) and (8.2.28), we derive
δε = −W̃cT ∇σ Âx + ŴmT1 φ(x) + ŴmT2 ρ(x)u + εLE . (8.2.29)
1 2
E= δ .
2 ε
Using the gradient descent algorithm, the weight update law for the critic NN is
developed as
∂E
Ŵ˙ c = −
lc h
= −lc δε , (8.2.30)
(1 + h T h)2 ∂ Ŵc (1 + h T h)2
where
h = ∇σ Âx + ŴmT1 φ(x) + ŴmT2 ρ(x)u ,
lc > 0 is the critic NN learning rate, and the term (1 + h T h)2 is employed for nor-
malization.
On the other hand, by (8.2.21) and (8.2.27), the control policy can be approximated
by an action NN as
1
û c (x) = −τ tanh ρ T(x)Ŵm 2 ∇σ T Ŵc .
2τ
From the expression û c (x), one can find the action NN shares the same weights
Ŵc as the critic NN. However, in standard weight tuning laws [2, 26], the critic NN
and the action NN are tuned sequentially, with the weights of the other NN being
kept constant. It is generally considered that this type of weight update law is more
318 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems
Remark 8.2.4 Unlike the actor update law derived in [5, 32] by minimizing the
Bellman residual error, we develop the actor tuning law based on the stability analysis
which shares similar spirits as [21, 25]. The concrete form of the action NN update
law is developed next.
We start with three assumptions and then establish the stability result.
Assumption 8.2.3 The parameters defined in (8.2.20) are upper bounded such that
 ≤ a1 , Ŵm1 ≤ a2 , and Ŵm2 ≤ a3 .
Assumption 8.2.4 There exist known constants bφ > 0 and bρ > 0 such that φ(x)
≤ bφ x and ρ(x) ≤ bρ x for every x ∈ Ω. Meanwhile, there exist known con-
stants bσ > 0 and bσ x > 0 such that σ (x) < bσ and ∇σ (x) < bσ x for every
x ∈ Ω.
Assumption 8.2.5 The NN reconstruction error ε N0 (x) and its derivative with
respect to x are upper bounded as ε N0 (x) < bε and ∇ε N0 (x) < bεx .
Theorem 8.2.2 Consider the nonlinear system described by (8.2.1) with the struc-
ture unknown. Let Assumptions 8.2.3–8.2.5 hold. Take the critic NN and the action
NN as in (8.2.27) and (8.2.31), respectively. Let the weight update law for the critic
NN be
û a
Ŵ˙ c = −lc
h
h T Ŵc + 2τ tanh−T(υ/τ )dυ + Q(x) , (8.2.32)
(1 + h T h)2 0
where h = ∇σ Âx + ŴmT1 φ(x) + ŴmT2 ρ(x)û a . Define h̄ h (1 + h T h). Let the
action NN be turned by
8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems with Constrained Inputs 319
Ŵ˙ a = − la K 2 Ŵa − K 1 h̄ T Ŵc − τ ∇σ (x)ŴmT2 ρ(x)
h̄ T
× tanh(Aa ) − sgn(Aa ) Ŵc , (8.2.33)
1 + hTh
where
1 T
Aa = ρ (x)Ŵm 2 ∇σ T(x)Ŵa ,
2τ
K 1 and K 2 are the tuning vector and the tuning matrix with suitable dimensions
which will be detailed in the proof, and la > 0 is the learning rate of the action NN.
Then, the system state x(t), and the actor–critic weight estimation errors W̃a , and
W̃c are all guaranteed to be UUB, when the number of neurons N0 is selected large
enough.
where
1 T −1
L1 = W̃ l W̃c
2 c c
and
1 T −1
L2 = W̃ l W̃a .
2 a a
Taking the time derivative of V (x) based on (8.2.20) and (8.2.24), we have
V̇ (x) = WcT ∇σ + ∇εTN0 Âx + ŴmT1 φ(x) + ŴmT2 ρ(x)û a
= WcT ∇σ Âx + ŴmT1 φ(x) − τ WcT ∇σ ŴmT2 ρ(x) tanh(Aa ) + Ξ, (8.2.35)
On the other hand, from (8.2.21), (8.2.22), and (8.2.25), we define the residual error
as
320 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems
WcT ∇σ Âx + ŴmT1 φ(x) − τ WcT ∇σ ŴmT2 ρ(x) tanh(Ac )
−τ tanh(Ac )
+ 2τ tanh−T (υ/τ )dυ + Q(x) = εHJB ,
0
(8.2.37)
where
1 T
Ac = ρ (x)Ŵm 2 ∇σ T(x)Wc ,
2τ
and εHJB converges to zero when the number of neurons N0 is large enough [2, 25].
Assume that εHJB is upper bounded such that εHJB ≤ εmax , where εmax > 0 is a
small number.
According to (8.2.23),
−τ tanh(Ac )
2τ tanh−T (υ/τ )dυ
0
m
= 2τ 2
AiT tanh(Ai ) + τ 2
ln 1 − tanh2 (Ai j ) . (8.2.40)
j=1
For every Ai j ∈ R, ln 1 − tanh2 (Ai j ) can be expressed as
ln 4 − 2Ai j − 2 ln 1 + exp(−2Ai j ) , Ai j > 0;
ln 1 − tanh2 (Ai j ) =
ln 4 + 2Ai j − 2 ln 1 + exp(2Ai j ) , Ai j < 0.
That is,
ln 1 − tanh2 (Ai j ) = ln 4 − 2Ai j sgn(Ai j ) − 2 ln 1 + exp(−2Ai j sgn(Ai j )) ,
(8.2.41)
where sgn(Ai j ) ∈ R is the sign function with respect to Ai j .
Combining (8.2.40) and (8.2.41), it follows
Γ (Aa ) − Γ (Ac ) = 2τ 2 AaT tanh(Aa ) − ATc tanh(Ac )
+ 2τ 2 ATc sgn(Ac ) − AaT sgn(Aa ) + ΘA , (8.2.42)
where
m
1 + exp(−2AT2 j sgn(A2 j ))
ΘA = 2 ln .
j=1
1 + exp(−2AT1 j sgn(A1 j ))
ΘA ∈ [−m ln 4, m ln 4].
Γ (Aa ) − Γ (Ac ) = τ ŴaT ∇σ ŴmT2 ρ(x) tanh(Aa ) − τ WcT ∇σ ŴmT2 ρ(x) tanh(Ac )
− τ WcT ∇σ ŴmT2 ρ(x) sgn(Aa ) − sgn(Ac )
+ τ W̃aT ∇σ ŴmT2 ρ(x)sgn(Aa ) + ΘA . (8.2.43)
h
L̇ 1 (x) = W̃cT τ W̃aT ∇σ ŴmT2 ρ(x) sgn(Aa ) − tanh(Aa )
(1 + h h)
T 2
− τ WcT ∇σ ŴmT2 ρ(x) sgn(Aa ) − sgn(Ac ) − W̃cT h + εHJB + ΘA
h̄ T
= τ W̃aT ∇σ ŴmT2 ρ(x) tanh(Aa ) − sgn(Aa ) Ŵc
1 + hTh
− W̃cT h̄ h̄ T W̃c + W̃cT h̄ D̄1 (x) + W̃aT D̄2 (x), (8.2.44)
where
1
D̄1 (x) = τ WcT ∇σ ŴmT2 ρ(x)[sgn(Ac ) − sgn(Aa )] − εHJB − ΘA ,
1+h h T
h̄ T
D̄2 (x) = τ ∇σ ŴmT2 ρ(x) sgn(Aa ) − tanh(Aa ) Wc .
1 + hTh
L̇(x) < − Q(x) − W̃cT h̄ h̄ T W̃c + W̃cT h̄ D̄1 (x) + W̃aT D̄2 (x) + β x + εmax
h̄ T
− W̃aT la−1 Ŵ˙ a − τ ∇σ ŴmT2 ρ(x) tanh(Aa ) − sgn(Aa ) Ŵc .
1 + hTh
(8.2.45)
To guarantee L̇(x) < 0, we use the weight update law for the action NN as in (8.2.33).
Observe that
W̃aT K 2 Ŵa − W̃aT K 1 h̄ T Ŵc = W̃aT K 2 (Wc − W̃a ) − W̃aT K 1 h̄ T (Wc − W̃c )
= W̃aT K 2 Wc − W̃aT K 2 W̃a − W̃aT K 1 h̄ T Wc
+ W̃aT K 1 h̄ T W̃c . (8.2.46)
L̇(x) < − Q(x) − W̃cT h̄ h̄ T W̃c − W̃aT K 2 W̃a + W̃cT h̄ D̄1 (x) + W̃aT K 1 h̄ T W̃c
+ β x + W̃aT D̄2 (x) + K 2 Wc − K 1 h̄ T Wc + εmax . (8.2.47)
Since Q(x) > 0, there exists a positive value q ∈ R such that x Tq x < Q(x). Let
Z T = [x T, W̃cT h̄, W̃aT ]. Then, (8.2.47) can be rewritten as
8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems with Constrained Inputs 323
where
⎡ ⎤
qI 0 0 ⎡ ⎤
⎢ K T⎥ β
⎢ − 1 ⎥
G=⎢0 I
2 ⎥ , M =⎣ D̄1 (x) ⎦.
⎣ ⎦
K1 D̄2 (x) + K 2 Wc − K 1 h̄ Wc
T
0 − K2
2
Select K 1 and K 2 such that G is positive definite. Then, (8.2.48) implies
According to the standard Lyapunov’s extension theorem [16, 17] (or the Lagrange
stability result [20]), this demonstrates the uniform ultimate boundedness of Z .
Therefore, the system state x(t), and actor–critic weight estimates errors W̃a (t), and
W̃c (t) are guaranteed to be UUB. This completes the proof.
It is desired to control the system with constrained inputs |u| ≤ 0.45. The non-
quadratic value function is given as
∞ u
V (x) = x12 + x22 + 2τ tanh−T(υ/τ )dυ dt.
0 0
0.09
x̃1 (t)
0.06
x̃2 (t)
Identification errors
0.03
−0.03
−0.06
0 8 16 24 32 40
Time (s)
Fig. 8.1 Errors in estimating system state x(t) by the dynamic NN identifier
simulation result of the system identification error is illustrated in Fig. 8.1. From
Fig. 8.1, one observes that the dynamic NN identifier can ensure asymptotic identi-
fication of the unknown nonaffine nonlinear system (8.2.49). Then, the process of
identifying the unknown system dynamics is finished and the identifier NN weights
are kept unchanged.
The activation function of the critic NN is chosen with N0 = 24 neurons as
σ (x) = x12 , x22 , x1 x2 , x14 , x24 , x13 x2 , x12 x22 , x1 x23 , x16 , x26 , x15 x2 , x14 x22 ,
T
x13 x23 , x12 x24 , x1 x25 , x18 x28 , x17 x2 , x16 x22 , x15 x23 , x14 x24 , x13 x25 , x12 x26 , x1 x27 .
(8.2.50)
1.2
0.9 x1 (t)
x2 (t)
System state x(t)
0.6
0.3
−0.3
0 10 20 30 40 50 60
Time (s)
0.9
Convergence of action NN weights
0.6
−0.3
−0.6
0 10 20 30 40 50 60
Time (s)
is added to the control input during the first 40 s. Clearly, the learning process is com-
pleted before the end of the exploratory signal. Figure 8.3 indicates the convergence
curves of the first 8 weights of the action NN. In fact, after 40 s the action NN weight
vector converges to Ŵa = [0.1371, 0.5229, −0.5539, −0.2209, −0.1081, −0.0534,
326 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems
0.1
0
−0.35
−0.1
Control input u(x)
−0.4
−0.2
−0.3 −0.45
1.4 1.8 2.2 2.6 3
−0.45
−0.4
−0.5
0 10 20 30 40 50 60
Time (s)
0.1
−0.1 −0.35
Control input u(x)
−0.2 −0.4
−0.3 −0.45
1.4 1.8 2.2 2.6 3
−0.45
−0.4
−0.5
0 10 20 30 40 50 60
Time (s)
we use Fig. 8.5 to illustrate the controller designed without considering control
constraints, where the control signal is saturated for a short period of time.
From Figs. 8.2 and 8.3, it is observed that the system states and the estimated
weights of the action NN are guaranteed to be UUB, while keeping closed-loop
system stable. Meanwhile, one can find that persistence of excitation ensures the
weights converge to their optimal values (Ŵa∗ ) after 40 s. That is, the final weights
Ŵa∗ are obtained. Under this circumstance, based on (8.2.31) and (8.2.50), we can
derive the optimal control for system (8.2.49). In addition, comparing Fig. 8.4 with
Fig. 8.5, we shall find that the restriction of control constraints has been successfully
overcome.
where x(t) = [x1 (t), . . . , xn (t)]T ∈ Rn is the state, u(t) = [u 1 (t), . . . , u m (t)]T ∈ Rm
is the control input, y(t) = [y1 (t), . . . , yl (t)]T ∈ Rl is the output, and F(x, u) is
an unknown nonaffine nonlinear smooth function with F(0, 0) = 0. The state of
system (8.3.1) is not available, only the system output y(t) can be measured. For
convenience of later analysis, we provide an assumption as follows (for definition of
various concepts, see [1, 15]).
Assumption 8.3.1 System (8.3.1) is observable and the system state is bounded in
L∞ (Ω). In addition, C ∈ Rl×n (l ≤ n) is a full row rank matrix, i.e., rank(C) = l.
For optimal output regulator problems, the control objective is to find an admis-
sible control for system (8.3.1) which minimizes the infinite-horizon value function
∞
V (x(t)) = y T(s)Qy(s) + u T(s)Ru(s) ds, (8.3.2)
t
Due to the fact that the system dynamics and system states are completely unknown,
we cannot directly apply existing ADP methods to system (8.3.1). In this section,
we employ a multilayer feedforward NN state observer to obtain estimated states of
system (8.3.1).
From (8.3.1), we have
where G(x, u) = F(x, u) − Ax, A is a Hurwitz matrix, and the pair (C, A) is observ-
able. Then, the state observer for system (8.3.1) is given by
where x̂(t) and ŷ(t) denote the state and output of the observer, respectively, and the
observer gain K ∈ Rn×l is selected such that A − K C is a Hurwitz matrix.
To design an NN state observer, one often uses an NN to identify the nonlinearity
and a conventional observer to estimate system states [1, 3, 23]. The structure of the
designed NN observer is shown in Fig. 8.6.
It has been proved that a three-layer NN with a single hidden layer can approximate
nonlinear systems with any degree of accuracy [11, 12]. According to the universal
approximation property of NNs, G(x, u) can be represented on a compact set Ω as
where Wo ∈ Rk×n and Yo ∈ R(n+m)×k are the ideal weight matrices for the hidden
layer to the output layer and the input layer to the hidden layer, respectively, k is
the number of neurons in the hidden layer, x̄ = [x T, u T ]T is the NN input, and εo (x)
is the bounded NN functional approximation error. It is often assumed that there
exists a constant ε M > 0 such that εo (x) ≤ ε M . σ (·) is the NN activation function
− −
+
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 329
Assumption 8.3.2 The ideal NN weights Wo and Yo are bounded by known positive
constants W̄ M and Ȳ M , respectively. That is,
Wo ≤ W̄ M and Yo ≤ Ȳ M .
ˆ
Ĝ(x̂, u) = ŴoT σ (ŶoT x̄),
where x̂ is the estimated state vector, x̄ˆ = [x̂ T , u T ]T , Ŵo and Ŷo are the corresponding
estimates of the ideal weight matrices. Then, the NN state observer (8.3.5) can be
represented as
x̂˙ = A x̂ + ŴoT σ ŶoT x̄ˆ + K (y − C x̂), ŷ = C x̂. (8.3.7)
Define the state and output estimation errors as x̃ = x − x̂ and ỹ = y − ŷ. Then,
using (8.3.4), (8.3.6) and (8.3.7), the error dynamics is obtained as
x̃˙ = (A − K C)x̃ + WoT σ YoT x̄ − ŴoT σ ŶoT x̄ˆ + εo (x), ỹ = C x̃. (8.3.8)
ˆ to (8.3.8), it follows
Adding and subtracting WoT σ (ŶoT x̄)
x̃˙ = Ao x̃ + W̃oT σ ŶoT x̄ˆ + ζ (x), ỹ = C x̃, (8.3.9)
where W̃o = Wo − Ŵo , Ao = A − K C, and ζ (x) = WoT σ YoT x̄ − σ ŶoT x̄ˆ + εo
(x).
Remark 8.3.1 It is worth pointing out that ζ (x) is a bounded disturbance term. That
is, there exists a known constant ζ M > 0 such that ζ (x) ≤ ζ M , because of the
boundedness of the hyperbolic tangent function, the NN approximation error εo (x),
and the ideal NN weights Wo and Yo .
Theorem 8.3.1 Consider system (8.3.1) and the observer dynamics (8.3.7). Let
Assumptions 8.3.1 and 8.3.2 hold. If the NN weight tuning algorithms are given as
330 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems
Ŵ˙ o = −η1 σ ŶoT x̄ˆ ỹ TC A−1
o − θ1 ỹ Ŵo ,
˙
ˆ ỹ C Ao ŴoT Ik − Γ ŶoT x̄ˆ − θ2 ỹ Ŷo ,
Ŷo = −η2 sgn(x̄) T −1
(8.3.10)
where
ˆ = diag σ12 (Ŷo1
Γ (ŶoTx̄) T ˆ T ˆ
x̄), . . . , σk2 (Ŷok x̄) ,
Ŷoi is the i th column of Ŷo , sgn is the componentwise sign function, η1 > 0 (1 =
1, 2) are learning rates, and θ2 > 0 (2 = 1, 2) are design parameters, then the state
estimation error x̃ converges to the compact set
2d
Ωx̃ = x̃ : x̃ ≤ , (8.3.11)
ρ C λmin (C + )T C +
1 T 1 1
L o (x) = x̃ P x̃ + tr W̃oT W̃o + tr ỸoT Ỹo , (8.3.12)
2 2 2
where P is a symmetric positive-definite matrix satisfying
ATo P + P Ao = −ρ In (8.3.13)
L˙o (x) = x̃˙ TP x̃ + x̃ TP x̃˙ + tr W̃oT W̃˙ o + tr ỸoT Ỹ˙o .
1 1
(8.3.14)
2 2
Using (8.3.10), we obtain
W̃˙ o = η1 σ ŶoT x̄ˆ ỹ T C A−1
o + θ1 ỹ Ŵo ,
˙ T
ˆ ỹ T C A−1
Ỹo = η2 sgn(x̄) ˆ + θ2 ỹ Ŷo .
o Ŵo Ik − Γ Ŷo x̄
T
(8.3.15)
where l1 = η1 C A−1 −1
o and l2 = η2 C Ao .
Replace x̃ using x̃ = C ỹ, where C + is the Moore–Penrose pseudoinverse of the
+
Note that (8.3.17) is true for Frobenius matrix norm, but it is not true for other matrix
norms in general. As we have declared earlier, Frobenius norm for matrices and
Euclidean norm for vectors are used in this chapter. We do not use the subscript “F”
for Frobenius matrix norm for convenience of presentation. The last inequality in
(8.3.17) is obtained based on the fact that, for given matrices A and B, the following
relationship holds:
Observing that
Ŵo ≤ W̄ M + W̃o , 1 − σ M2 ≤ 1,
ρ
L˙o ≤ − λmin (C + )T C + ỹ 2 + ζ M P C + − θ1 − K 12 W̃o 2
2
+ σ M P C + + σ M l1 + θ1 W̄ M W̃o + θ2 Ȳ M + l2 W̄ M Ỹo
2
− (θ2 − 1) Ỹo 2 − K 1 W̃o − Ỹo ỹ . (8.3.21)
Denote K 2 and K 3 as
σM P C + + σ M l1 + θ1 W̄ M θ2 Ȳ M + l2 W̄ M
K2 = , K3 = .
2(θ1 − K 1 )
2 2(θ2 − 1)
To complete the squares for the terms W̃o and Ỹo , K 22 ỹ and K 32 ỹ are added
to and subtracted from (8.3.21), and we obtain
ρ
L˙o (x) ≤ − λmin (C + )T C + ỹ 2 + ζ M P C + + θ1 − K 12 K 22
2
2
+ (θ2 − 1)K 32 − θ1 − K 12 K 2 − W̃o
2 2
− (θ2 − 1) K 3 − Ỹo − K 1 W̃o − Ỹo ỹ . (8.3.22)
ρ
L˙o ≤ − λmin (C + )T C + ỹ 2
+ d ỹ , (8.3.23)
2
where
d = ζM P C + + θ1 − K 12 K 22 + θ2 − 1 K 32 . (8.3.24)
Therefore, for guaranteeing L˙o < 0, the following condition should hold, i.e.,
2d
ỹ > . (8.3.25)
ρλmin (C + )T C +
2d
x̃ > .
ρ C λmin (C + )T C +
That is, the state estimation error x̃ converges to Ωx̃ defined as in (8.3.11). Mean-
while, by using the standard Lyapunov’s extension theorem [16, 17] (or the Lagrange
stability result [20]), we conclude that the weight estimation errors W̃o and Ỹo are
UUB. This completes the proof.
Remark 8.3.2 By linear matrix theory [7, 10], we can obtain that rank(C) =
rank(C + ) and rank(C + ) = rank (C + )T C + . Accordingly, using Assumption 8.3.1,
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 333
we derive rank (C + )T C + = rank(C) = l. Noticing that (C + )T C + ∈ Rl×l and
(C + )T C + is a symmetric matrix, + T +
we can conclude that (C ) C is positive def-
inite. Therefore, λmin (C + )T C + > 0. This shows that the compact set Ωx̃ makes
sense.
Remark 8.3.3 The explanation about selecting an NN observer rather than system
identification technique is given here. In control engineering, a common approach is
to start from the measurement of system behavior and external influences (inputs to
the system) and try to determine a mathematical relationship between them without
going into the details of what is actually happening inside the system [8, 27]. This
approach is called system identification. So, we can conclude that based on system
identification, we are generally able to obtain a “black box” model of the nonlinear
system [14], but do not get any in depth knowledge about system states because
they are the internal properties of the system. In most real cases, the state variables
are unavailable for direct online measurements, and merely input and output of the
system are measurable. Therefore, estimating the state variables by observers plays
an important role in the control of processes to achieve better performances. Once
the estimated states are obtained, we can directly design a state feedback controller to
achieve the optimization of system performance [24]. In conclusion, here we employ
an NN observer rather than system identification techniques.
By (8.3.1) and (8.3.3) and using the observed state x̂ to replace the system state x,
we obtain the Hamiltonian as
where
r (x̂, u) = x̂ TQ c x̂ + u TRu
∂ V (x̂)
as in (8.3.3) and Vx̂ = . The optimal value function V ∗ (x̂(t)) is
∂ x̂
∞
∗
V (x̂(t)) = min r (x̂(τ ), u(τ ))dτ,
u∈A (Ω) t
∂ V ∗ (x̂)
where Vx̂∗ = . Assume that the minimum on the right-hand side of (8.3.26)
∂ x̂
∂ H (x̂, u, Vx̂∗ )
exists and is unique. Then, by solving the equation = 0, the optimal
∂u
control can be obtained as
T
1 ∂ F(x̂, u)
u ∗ = − R −1 Vx̂∗ . (8.3.27)
2 ∂u
To obtain solutions of the optimal control problem, we only need to solve (8.3.28).
However, due to the nonlinear nature of the HJB equation, finding its solutions is
generally difficult or impossible. Therefore, a scheme shall be developed using NNs
for solving the above optimal control problems. The structural diagram of the NN
observer-based controller is shown in Fig. 8.7.
According to the universal approximation property of NNs, the value function
V ∗ (x̂) can be represented on a compact set Ω as
V ∗ (x̂) = WcT σ YcT x̂ + εc (x̂), (8.3.29)
where Wc ∈ Rkc and Yc ∈ Rn×kc are the ideal weight matrices for the hidden layer
to the output layer and the input layer to the hidden layer, respectively, kc is the
the number of neurons in the hidden layer, and εc is the bounded NN functional
approximation error. In our design, based on [12], for both simplicity of learning and
efficiency of approximation, the output layer weight Wc is adapted online, whereas
the input layer weight Yc is selected initially at random and held fixed during the
entire learning process. It is demonstrated in [11, 12] that if the number of neurons
kc is sufficiently large, then the NN approximation error εc can be kept arbitrarily
small.
The output of the critic NN is given as
V̂ (x̂) = ŴcT σ YcT x̂ = ŴcT σ (z),
where Ŵc is the estimate of Wc . Since the hidden layer weight matrix Yc is fixed, we
write the activation function σ (YcT x̂) as σ (z) with z = YcT x̂.
The derivative of the value function V (x̂) with respect to x̂ is
V x
x
u x
x
x x x y
x F xu C
S
y y
K
u y
G x x
x C
S
where ∇σcT = Yc (∂σ T(z)/∂z) and ∇εc = ∂εc /∂ x̂. In addition, the derivative of V̂ (x̂)
with respect to x̂ is obtained as V̂x̂ = ∇σcT Ŵc . Then, the approximate Hamiltonian
is derived as
H (x̂, u, Ŵc ) = ŴcT ∇σc F(x̂, u) + r (x̂, u) = ec . (8.3.31)
It is worth pointing out that, to get the error ec , the knowledge of the system dynamics
is required. To overcome this limitation, the NN observer developed in (8.3.7) is used
to replace F(x̂, u). Then, (8.3.31) becomes
Given that u ∈ A (Ω), it is desired to select Ŵc to minimize the squared residual
error E c (Ŵc ) as
1
E c (Ŵc ) = ec2 .
2
336 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems
Using the normalized gradient descent algorithm, the weight update law for the critic
NN is derived as
α ∂ Ec ψ T
Ŵ˙ c = − = −α T ψ Ŵc + r (x̂, u) , (8.3.33)
(ψ T ψ+ 1) ∂ Ŵc
2 (ψ ψ + 1) 2
˙
ψ = ∇σc x̂.
where ϑ = −∇εc x̂˙ is the residual error due to the NN approximation. Substituting
(8.3.34) into (8.3.33), and using the notation
ψ
ψ1 = T , ψ2 = ψ T ψ + 1,
ψ ψ +1
ϑ
W̃˙ c = −αψ1 ψ1T W̃c + αψ1 .
ψ2
Ŵ˙ c = −W̃˙ c
ϑ
= αψ1 ψ1T W̃c − αψ1
ψ2
ψ1 T
= −α ψ (Ŵc − Wc ) + ϑ
ψ2
ψ1 T
= −α T ψ Ŵc + r (x̂, û) . (8.3.35)
ψ ψ +1
From the expression ψ1 , there exists a constant ψ1M > 1 such that ψ1 < ψ1M .
It is important to note that the persistence of excitation (PE) condition is required for
tuning the critic NN. To satisfy the PE condition, a small exploratory signal is added
to the control input [25]. Furthermore, the PE condition ensures ψ1 ≥ ψ1m , with
ψ1m being a positive constant.
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 337
Therefore, ∂ F̂(x̂, u)/∂u can be obtained by the backpropagation from the output of
the observer NN to its input u.
Assumption 8.3.4 The NN activation function and its gradient are bounded. That
is, σ < σcM and ∇σc < σd M , where σcM and σd M are known positive constants.
Theorem 8.3.2 Consider the NN observer (8.3.7) and the feedback control given in
(8.3.37). Let weight tuning laws for the observer NN be provided as in (8.3.10) and
for the critic NN be provided as in (8.3.35). Then, x, x̃, W̃o , Ỹo and W̃c in the NN
observer-based control system are all UUB.
1 T
L(x) = L o (x) + W̃c W̃c + α x Tx + γ V (x) ,
!2α "# $ ! "# $
L c2
L c1
and the residual error is upper bounded, i.e., there exists a constant ϑ M > 0 such that
ϑ ≤ ϑ M . Taking the time derivative of L c1 , we obtain
1 T ˙
L̇ c1 = W̃ W̃c
α c
1 T ϑ
= W̃c − αψ1 ψ1 W̃c + αψ1
T
α ψ2
ϑ
= − W̃cTψ1 2 + W̃cTψ1
ψ2
% %
%ϑ %
≤ − W̃cTψ1 2 + W̃cT ψ1 % %ψ %
%
2
≤ − W̃cTψ1 2 + W̃cTψ1 ϑ M
ϑ M 2 ϑ M 2
= − W̃cTψ1 − + .
2 4
Using Assumptions 8.3.3 and 8.3.4, and
we have
L̇ c2 = 2αx T ẋ + αγ − x TQ c x − û TR û
= 2αx T (Ax + WoTσ (YoTx̄) + ε) + αγ − x TQ c x − û TR û
T
≤ α x Tx + Ax + WoTσ (YoTx̄) + ε Ax + WoTσ (YoTx̄) + ε
+ γ − x TQ c x − û TR û
≤ α (1 + 3 A 2 ) x + 3 WoTσ (YoTx̄) 2 + 3 ε
2 2
− γ λmin (Q c ) x 2 − γ λmin (R) û 2
≤ − α γ λmin (Q c ) − 1 − 3 A 2 x 2
− γ λmin (R) û 2 + 3W̄ M σ M + 3ε2M .
2 2
Thus, we obtain
L̇ c1 + L̇ c2 ≤ − α γ λmin (Q c ) − 1 − 3 A 2 x 2
ϑ M 2
− W̃cTψ1 −
2
− αγ λmin (R) û 2 + D M , (8.3.38)
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 339
where
ϑM2
DM = + 3α W̄ M σ M + 3αε2M .
2 2
4
Combining (8.3.23) and (8.3.38), it follows
ρ
L̇(x) ≤ − λmin (C + )T C + C x̃ 2 + d C x̃ − α γ λmin (Q c ) − 1 − 3 A 2 x 2
2
ϑ M 2
− W̃cTψ1 − − αγ λmin (R) û 2 + D M
2
ρ d 2
≤ − λmin (C + )T C + C x̃ −
2 ρλmin (C + )T C +
ϑ M 2
− α γ λmin (Q c ) − 1 − 3 A 2 x 2 − W̃cTψ1 −
2
2
d
− αγ λmin (R) û 2 + D M + .
2ρλmin (C + )T C +
! "# $
D̃ M
3 A 2+1
θ1 ≥ K 12 , θ2 ≥ 1, γ > , (8.3.39)
λmin (Q c )
% T %
or
%W̃ ψ1 % > D̃ M + ϑ M B2 ,
c
2
then, we derive L̇(x) < 0. By the dense property of R [22], we can obtain a
positive constant κ1 (0 < κ1 ≤ C ) such that C x̃ > κ1 x̃ > B1 . Similarly,
we can derive that there exists a positive constant κ2 (0 < κ2 ≤ ψ1m ) such that
W̃cT ψ1 > κ2 W̃c > B2 . Then, it follows that L̇(x) < 0, if (8.3.39) is true and
if one of the following inequalities is true
B1 d 1 2 D̃ M
x̃ > = T + T (8.3.41)
κ1 κ1 ρλmin C + C + κ 1 ρλmin C + C +
340 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems
or
B2 1 ϑM
W̃c > = D̃ M + . (8.3.42)
κ2 κ2 2
By using Lyapunov’s extension theorem [16, 17] (or the Lagrange stability result
[20]), it can be concluded that the observer error x̃, the state x, and the NN weight
estimation errors W̃o , Ỹo , and W̃c are all UUB. This completes the proof.
Remark 8.3.4 It should be mentioned that, in (8.3.39) and (8.3.40), the constraints
for θ1 , θ2 and x̃ are the same as the NN observer designed earlier. In fact, a nonlinear
separation principle is not applicable here. But for the proof of the NN observer-
based control system, the closed-loop dynamics incorporates the observer dynamics,
then we can develop simultaneous weight tuning laws for the observer NN and the
critic NN.
x˙1 = −x1 + x2 ,
x˙2 = −x1 − 1 − sin2 (x1 ) x2 + sin(x1 )u + 0.1u 2 ,
y = x1 ,
with initial conditions x1 (0) = 1 and x2 (0) = −0.5. The value function is given by
(8.3.2), where Q and R are chosen as identity matrices with appropriate dimensions. It
is assumed that the system dynamics is unknown, the system states are unavailable for
measurements and only the input and output of the system are measurable. To estimate
the system states, an NN observer is employed and the corresponding parameters are
chosen as
0 1
A= ,
−6 − 5
and
10
K = .
−2
0.8
Error of real state and observed state
0.6
0.4
0.2
-0.2
0 5 10 15 20 25 30 35 40 45 50
Time (s)
Fig. 8.8 The error between the actual state x1 and observed state x̂1
0.8
Error of real state and observed state
0.6
0.4
0.2
-0.2
-0.4
-0.6
-0.8
-1
0 5 10 15 20 25 30 35 40 45 50
Time (s)
Fig. 8.9 The error between the actual state x2 and observed state x̂2
342 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems
1.2
0.6
0.3
−0.3
0 10 20 30 40 50
Time (s)
1.5
0
Control input
−1.5
−3
0 10 20 30 40 50
Time (s)
The activation functions of the critic NN are chosen from the sixth-order series
expansion of the value function. Only polynomial terms of even order are considered,
that is,
' (T
σc = x12 , x1 x2 , x22 , x14 , x13 x2 , x12 x22 , x1 x23 , x24 , x16 , x15 x2 , x14 x22 , x13 x23 , x12 x24 , x1 x25 , x26 .
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 343
The critic NN weight vector is denoted as Ŵc = [Ŵc1 , . . . , Ŵc15 ]T . The learning rate
for the critic NN is selected as α = 0.5. Moreover, the initial weight vector of Ŵc
is set to be [1, . . . , 1]T . Moreover, in order to maintain the PE condition, a small
exploratory signal
8.4 Conclusions
References
1. Abdollahi F, Talebi HA, Patel RV (2006) A stable neural network-based observer with appli-
cation to flexible-joint manipulators. IEEE Trans Neural Netw 17(1):118–129
2. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
3. Ahmed M, Riyaz S (2000) Dynamic observer-a neural net approach. J Intell Fuzzy Syst
9(1–2):113–127
4. Apostol TM (1974) Mathematical analysis. Addison-Wesley, Boston, MA
5. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):82–92
6. Bian T, Jiang Y, Jiang ZP (2014) Adaptive dynamic programming and optimal control of
nonlinear nonaffine systems. Automatica 50(10):2624–2632
7. Campbell SL, Meyer CD (1991) Generalized inverses of linear transformations. Dover Publi-
cations, New York
8. Goodwin GC, Payne RL (1977) Dynamic system identification: experiment design and data
analysis. Academic Press, New York
344 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems
9. Hayakawa T, Haddad WM, Hovakimyan N (2008) Neural network adaptive control for a class
of nonlinear uncertain dynamical systems with asymptotic stability guarantees. IEEE Trans
Neural Netw 19(1):80–89
10. Horn RA, Johnson CR (2012) Matrix analysis. Cambridge University Press, New York
11. Hornik K, Stinchcombe M, White H (1990) Universal approximation of an unknown mapping
and its derivatives using multilayer feedforward networks. Neural Netw 3(5):551–560
12. Igelnik B, Pao YH (1995) Stochastic choice of basis functions in adaptive function approxi-
mation and the functional-link net. IEEE Trans Neural Netw 6(6):1320–1329
13. Ioannou P, Sun J (1996) Robust adaptive control. Prentice-Hall, Upper Saddle River, NJ
14. Jin G, Sain M, Pham K, Spencer B, Ramallo J (2001) Modeling MR-dampers: A nonlinear
blackbox approach. In: Proceedings of the American control conference. pp 429–434
15. Khalil HK (2001) Nonlinear systems. Prentice-Hall, Upper Saddle River, NJ
16. LaSalle JP, Lefschetz S (1967) Stability by Liapunov’s direct method with applications. Aca-
demic Press, New York
17. Lewis F, Jagannathan S, Yesildirak A (1999) Neural network control of robot manipulators and
nonlinear systems. Taylor & Francis, London
18. Lewis FL, Vrabie D, Syrmos VL (2012) Optimal control. Wiley, Hoboken, NJ
19. Liu D, Huang Y, Wang D, Wei Q (2013) Neural-network-observer-based optimal control for
unknown nonlinear systems using adaptive dynamic programming. Int J Control 86(9):1554–
1566
20. Michel AN, Hou L, Liu D (2015) Stability of dynamical systems: on the role of monotonic
and non-monotonic Lyapunov functions. Birkhäuser, Boston, MA
21. Modares H, Lewis F, Naghibi-Sistani MB (2014) Online solution of nonquadratic two-player
zero-sum games arising in the H∞ control of constrained input systems. Int J Adap Control
Signal Process 28(3–5):232–254
22. Rudin W (1976) Principles of mathematical analysis. McGraw-Hill, New York
23. Selmic RR, Lewis FL (2001) Multimodel neural networks identification and failure detection
of nonlinear systems. In: Proceedings of the IEEE conference on decision and control. pp
3128–3133
24. Theocharis J, Petridis V (1994) Neural network observer for induction motor control. IEEE
Control Syst Mag 14(2):26–37
25. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
26. Vrabie D, Lewis F (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
27. Walter E, Pronzato L (1997) Identification of parametric models from experimental data.
Springer, Heidelberg
28. Yang X, Liu D, Wang D (2014) Reinforcement learning for adaptive optimal control of unknown
continuous-time nonlinear systems with input constraints. Int J Control 87(3):553–566
29. Yang X, Liu D, Wang D, Wei Q (2014) Discrete-time online learning control for a class of
unknown nonaffine nonlinear systems using reinforcement learning. Neural Netw 55:30–41
30. Yu W (2009) Recent advances in intelligent control systems. Springer, London
31. Yu W, Li X (2001) Some new results on system identification with dynamic neural networks.
IEEE Trans Neural Netw 12(2):412–417
32. Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming method.
IEEE Trans Neural Netw 22(12):2226–2236
33. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control: algorithms
and stability. Springer, London
Chapter 9
Robust and Optimal Guaranteed Cost
Control of Continuous-Time Nonlinear
Systems
9.1 Introduction
ADP method has been often used in feedback control applications, both for discrete-
time systems [8, 12, 17, 20, 22, 25, 34, 35, 43–46, 50, 57] and for continuous-time
systems [1, 5, 7, 21, 23, 26, 27, 31, 36, 41, 42, 55, 60]. Besides, various traditional
control problems, such as robust control [2, 3], decentralized control [29], networked
control [56], power system control [18], are studied under the new framework, which
greatly extends the application scope of ADP methods.
In this chapter, we first employ online policy iteration (PI) algorithm to tackle
the robust control problem [47]. The robust control problem is transformed into an
optimal control problem with the cost function modified to account for uncertainties.
Then, an online PI algorithm is developed to solve the HJB equation by constructing
and training a critic network. It is shown that an approximate closed-form expression
of the optimal control law is available. Hence, there is no need to build an action net-
work. The uniform ultimate boundedness of the closed-loop system is analyzed by
using the Lyapunov approach. Since the ADP method is effective in solving optimal
control problem and NNs can be constructed to facilitate the implementation process,
it is convenient to employ the PI algorithm to handle robust control problem. Thus,
the present robust control approach is easy to understand and implement. It can be
used to solve a broad class of nonlinear robust control problems. Next, we inves-
tigate the optimal guaranteed cost control of continuous-time uncertain nonlinear
systems using NN-based online solution of HJB equation [28]. The optimal robust
guaranteed cost control problem is transformed into an optimal control problem by
introducing an appropriate cost function. It can be proved that the optimal cost func-
tion of the nominal system is the optimal guaranteed cost of the uncertain system.
Then, a critic network is constructed for facilitating the solution of modified HJB
equation. Moreover, inspired by the work of [7, 36], an additional stabilizing term
is introduced to ensure the stability, which relaxes the need for an initial stabilizing
control. The uniform ultimate boundedness of the closed-loop system is also proved
using Lyapunov’s direct approach. Besides, it is shown that the approximate control
input can converge to the optimal control within a small bound.
where x(t) ∈ Rn is the state vector and u(t) ∈ Rm is the control vector, f (·) and g(·)
are differentiable in their arguments with f (0) = 0, and Δf (x(t)) is the unknown
perturbation. Here, we let x(0) = x0 be the initial state.
Denote
f¯(x) = f (x) + Δf (x).
9.2 Robust Control of Uncertain Nonlinear Systems 347
Suppose that the function f¯(x) is known only up to an additive perturbation, which
is bounded by a known function in the range of g(x). Note that the condition for
the unknown perturbation to be in the range space of g(x) is called the matching
condition. Thus, we write Δf (x) = g(x)d(x) with d(x) ∈ Rm , which represents
the matched uncertainty of the system dynamics. Assume that the function d(x) is
bounded by a known function d M (x), i.e., d(x) ≤ d M (x) with d M (0) = 0. We also
assume that d(0) = 0, so that x = 0 is an equilibrium.
For system (9.2.1), to deal with the robust control problem, we should find a feed-
back control law u(x), such that the closed-loop system is globally asymptotically
stable for all admissible uncertainties d(x). Here, we will show that this problem
can be converted into designing an optimal controller for the corresponding nominal
system with appropriate cost function.
Consider the nominal system corresponding to (9.2.1)
where ρ > 0 is a constant, U (x, u) ≥ 0 is the utility function with U (0, 0) = 0. In this
section, the utility function is chosen as the quadratic form U (x, u) = x TQx + u TRu,
where Q and R are positive definite matrices with Q ∈ Rn×n and R ∈ Rm×m . The
cost function described in (9.2.3) gives a modification with respect to the ordinary
optimal control problem, which appropriately reflects the uncertainties, regulation,
and control simultaneously.
When dealing with the optimal control problem, the designed feedback control
must be admissible. A control u(x) : Rn → Rm is said to be admissible with respect
to (9.2.2) on Ω, written as u(x) ∈ A (Ω), if u(x) is continuous on Ω, u(0) = 0,
u(x) stabilizes system (9.2.2) on Ω and J (x0 , u) is finite for every x0 ∈ Ω. Given a
control μ ∈ A (Ω), if the associated value function
∞ 2
V (x(t)) = ρd M (x(τ )) + U (x(τ ), μ(x(τ ))) dτ (9.2.4)
t
ρd M
2
(x) + U (x, μ(x)) + (∇V (x))T ( f (x) + g(x)μ(x)) = 0 (9.2.5)
348 9 Robust and Optimal Guaranteed Cost Control …
with V (0) = 0. In (9.2.5), the term ∇V (x) denotes the partial derivative of the value
∂ V (x)
function V (x) with respect to x, i.e., ∇V (x) = .
∂x
Define the Hamiltonian of the problem and the optimal cost function as
H (x, μ, ∇V (x)) = ρd M
2
(x) + U (x, μ) + (∇V (x))T ( f (x) + g(x)μ)
and ∞
∗
2
V (x) = min ρd M (x(τ )) + U (x(τ ), μ(x(τ ))) dτ. (9.2.6)
μ∈A (Ω) t
The optimal value function defined above will be used interchangeably with optimal
cost function J ∗ (x0 ). The optimal value function V ∗ (x) satisfies the HJB equation
∂ V ∗ (x)
where ∇V ∗ (x) = . Assume that the minimum on the right-hand side of
∂x
(9.2.7) exists and is unique. Then, the optimal control law for the given problem is
1
u ∗ (x) = − R −1 g T(x)∇V ∗ (x). (9.2.8)
2
Substituting (9.2.8) into the nonlinear Lyapunov equation (9.2.5), we can obtain the
HJB equation in terms of ∇V ∗ (x) as follows:
ρd M
2
(x) + x TQx + (∇V ∗ (x))T f (x)
1
− (∇V ∗ (x))T g(x)R −1 g T(x)∇V ∗ (x) = 0 (9.2.9)
4
with V ∗ (0) = 0.
In this section, we establish the following theorem for the transformation between
the robust control problem and the optimal control problem.
Theorem 9.2.1 For nominal system (9.2.2) with the cost function (9.2.3), assume
that the HJB equation (9.2.7) has a solution V ∗ (x). Then, using this solution, the
optimal control law obtained in (9.2.8) ensures closed-loop asymptotic stability of
uncertain nonlinear system (9.2.1), provided that the following condition is satisfied:
ρd M
2
(x) ≥ d T(x)Rd(x). (9.2.10)
9.2 Robust Control of Uncertain Nonlinear Systems 349
Proof Let V ∗ (x) be the optimal solution of the HJB equation (9.2.7) and u ∗ (x) be
the optimal control law defined by (9.2.8). Now, we prove that u ∗ (x) is a solution to
the robust control problem, namely system (9.2.1) is asymptotically stable under the
control u ∗ (x) for all possible uncertainties d(x). To do this, one needs to prove that
V ∗ (x) is a Lyapunov function of (9.2.1).
According to (9.2.6), V ∗ (x) > 0 for any x = 0 and V ∗ (x) = 0 when x = 0. This
means that V ∗ (x) is positive definite. Using (9.2.7), we have
Using (9.2.10), we can conclude that V̇ ∗ (x) ≤ −x TQx < 0 for any x = 0. Then,
the conditions for Lyapunov local stability theory are satisfied. Thus, there exists
a neighborhood Φ = {x : x(t) < c} for some c > 0 such that if x(t) ∈ Φ, then
limt→∞ x(t) = 0.
However, x(t) cannot remain forever outside Φ. Otherwise, x(t) ≥ c for all
t ≥ 0. Denote T
q = inf x Qx > 0.
x(t)≥c
Clearly,
∗
V̇(9.2.1) (x) ≤ −x TQx ≤ −q.
Then,
t
∗ ∗ ∗
V (x(t)) − V (x(0)) = V̇(9.2.1) (x(τ ))dτ ≤ −qt. (9.2.14)
0
350 9 Robust and Optimal Guaranteed Cost Control …
V ∗ (x(t)) ≤ V ∗ (x(0)) − qt → −∞ as t → ∞.
This contradicts the fact that V ∗ (x(t)) > 0 for any x = 0. Therefore, limt→∞
x(t) = 0 no matter where the trajectory starts from in Ω. This completes the proof.
optimal control problem without the need for any additional conditions. Otherwise,
the formula (9.2.10) should be satisfied in order to ensure the equivalence of problem
transformation.
In light of Theorem 9.2.1, by acquiring the solution of the HJB equation (9.2.9)
and then deriving the optimal control law (9.2.8), we can obtain the robust control
law for system (9.2.1) in the presence of matched uncertainties. However, due to
the nonlinear nature of the HJB equation, finding its solution is often difficult. In
what follows, we will introduce an online policy iteration (PI) algorithm to solve the
problem based on NN techniques.
Step 1: Choose a small number ε > 0. Let i = 0 and V (0) = 0. Then, start with an
initial admissible control law μ(0) (x) ∈ A (Ω).
Step 2: Let i = i + 1. Based on the control law μ(i−1) (x), solve the nonlinear
Lyapunov equation
T
ρd M
2
(x) + U x, μ(i−1) (x) + ∇V (i) (x) f (x) + g(x)μ(i−1) (x) = 0
1
μ(i) (x) = − R −1 g T(x)∇V (i) (x).
2
Step 4: If |V (i) (x) − V (i−1) (x)| ≤ ε, stop and obtain the approximate optimal con-
trol law μ(i) (x); else, go back to Step 2.
9.2 Robust Control of Uncertain Nonlinear Systems 351
The algorithm will converge to the optimal value function and optimal control law,
i.e., V (i) (x) → V ∗ (x) and μ(i) (x) → u ∗ (x) as i → ∞. The convergence proofs of
the PI algorithm above have been given in [1, 4, 21, 32]
Next, we present the implementation process of the PI algorithm based on a critic
NN. Assume that the value function V (x) is continuously differentiable. According
to the universal approximation property of NNs, V (x) can be reconstructed by an
NN on a compact set Ω as
where ∇V (x) = ∂ V (x)/∂ x, ∇σc (x) = ∂σc (x)/∂ x ∈ Rl×n and ∇εc (x) = ∂εc (x)/
∂ x ∈ Rn are the gradient of the activation function and NN approximation error,
respectively.
Using (9.2.16), the Lyapunov equation (9.2.5) takes the following form:
ρd M
2
(x) + U (x, μ) + WcT ∇σc (x) + (∇εc (x))T ẋ(9.2.2) = 0, (9.2.17)
where we use ẋ(9.2.2) to indicate it is from (9.2.2). Assume that the weight vector Wc ,
the gradient ∇σc (x), and the approximation error εc (x) and its derivative ∇εc (x)
are all bounded on a compact set Ω. We also have εc (x) → 0 and ∇εc (x) → 0 as
l → ∞ [41].
Since the ideal weights are unknown, a critic NN can be built in terms of the
estimated weights as
V̂ (x) = ŴcT σc (x) (9.2.18)
to approximate the value function. In (9.2.18), σc (x) is selected such that V̂ (x) > 0
for any x = 0 and V̂ (x) = 0 when x = 0. Then,
H (x, μ, Ŵc ) = ρd M
2
(x) + U (x, μ) + ŴcT ∇σc (x)ẋ(9.2.2) ec . (9.2.20)
To train the critic NN, it is desired to design Ŵc to minimize the objective function
1 2
Ec = e .
2 c
352 9 Robust and Optimal Guaranteed Cost Control …
We employ the standard steepest descent algorithm to tune the weights of the critic
network, i.e.,
∂ Ec ∂ec
Ŵ˙ c = −αc = −αc ec , (9.2.21)
∂ Ŵc ∂ Ŵc
H (x, μ, Wc ) = ρd M
2
(x) + U (x, μ) + WcT ∇σc (x)ẋ(9.2.2) = ecH , (9.2.22)
where ecH = −(∇εc (x))Tẋ is the residual error due to the NN approximation.
Denote θ = ∇σc (x)ẋ. Assume that there exists a constant θ M > 0 such that θ ≤
θ M , and let the weight estimation error of the critic NN be W̃c = Wc − Ŵc . Then,
considering (9.2.20) and (9.2.22), we have ecH − ec = W̃cTθ. Therefore, the dynamics
of weight estimation error is
1
μ(x) = − R −1 g T(x) (∇σc (x))T Wc + ∇εc (x) . (9.2.24)
2
The approximate control law can be formulated as
1
μ̂(x) = − R −1 g T(x)(∇σc (x))T Ŵc . (9.2.25)
2
Equation (9.2.25) implies that based on the trained critic NN, the approximate control
law can be derived directly. The actor-critic architecture is maintained, but training
of the action NN is not required in this case since we have closed-form solution
available. The diagram of the online PI algorithm is depicted in Fig. 9.1.
9.2 Robust Control of Uncertain Nonlinear Systems 353
Fig. 9.1 Diagram of the online PI algorithm (solid lines represent signals and the dashed line
represents the backpropagation path)
The weight estimation dynamics and the closed-loop system based on the approxi-
mate optimal control law are uniformly ultimately bounded (UUB) as shown in the
following two theorems.
Theorem 9.2.2 For the controlled system (9.2.2), the weight update law for tuning
the critic NN is given by (9.2.21). Then, the dynamics of the weight estimation error
of the critic NN is UUB.
Proof Select the Lyapunov function candidate as L(x) = (1/αc ) W̃cT W̃c . Taking the
time derivative of L(x) along the trajectory of error dynamics (9.2.23) and consid-
ering the Cauchy–Schwarz inequality, it follows
2 T ˙
L̇(x) = W̃ W̃c
αc c
2
= W̃cT αc ecH − W̃cT θ θ
αc
2 2
= ecH αc W̃cT θ − αc W̃cT θ
αc
2 1 2 1
≤ ecH + αc2 (W̃cT θ )2 − αc (W̃cT θ )2
αc 2 2
1 2 2
= ecH − (2 − αc ) W̃cT θ .
αc
2
ecH
|W̃cT θ | > .
αc (2 − αc )
354 9 Robust and Optimal Guaranteed Cost Control …
By employing the dense property of real numbers [38], we derive that there exist a
positive constant κ (0 < κ ≤ θ M ) such that
2
ecH
|W̃cT θ | > κW̃c > . (9.2.26)
αc (2 − αc )
By Lyapunov’s extension theorem [13, 15] (or the Lagrange stability result [30]), the
dynamics of the weight estimation error is UUB. The norm of the weight estimation
error is bounded as well. This completes the proof.
Theorem 9.2.3 For system (9.2.2), the weight update law of the critic NN given by
(9.2.21) and the approximate optimal control law obtained by (9.2.25) ensure that,
for any initial state x0 , there exists a time T (x0 , M) such that x(t) is UUB. Here, the
bound M > 0 is given by
βM
x(t) ≤ ≡ M, t ≥ T,
ρρ02 + λmin (Q)
where β M > 0 and ρ0 > 0 are constants to be determined later and λmin (Q) is the
smallest eigenvalue of Q.
Proof Taking the time derivative of V (x) along the trajectory of (9.2.2) generated
by the approximate control law μ̂(x), we can obtain
1
0 = ρd M
2
(x) + x TQx + (∇V (x))T f (x) − (∇V (x))T g(x)R −1 g T (x)∇V (x).
4
(9.2.29)
1
V̇(9.2.2) = − ρd M
2
(x) − x TQx + (∇V (x))T g(x)R −1 g T(x)∇V (x)
4
+ (∇V (x))T g(x)μ̂. (9.2.30)
9.2 Robust Control of Uncertain Nonlinear Systems 355
Adding and subtracting (∇V (x))T g(x)μ to (9.2.30) and using (9.2.24) and (9.2.25),
it follows
1
V̇(9.2.2) = − ρd M
2
(x) − x TQx + (∇V (x))T g(x)R −1 g T(x)∇V (x)
4
+ (∇V (x))T g(x)μ + (∇V (x))T g(x)μ̂ − (∇V (x))T g(x)μ
1
= − ρd M
2
(x) − x TQx − (∇V (x))T g(x)R −1 g T(x)∇V (x)
4
1 −1 T
+ (∇V (x)) g(x)R g (x) ∇V (x) − ∇ V̂ (x) .
T
(9.2.31)
2
Substituting (9.2.16) and (9.2.19) into (9.2.31), we can further obtain
1
V̇(9.2.2) = − ρd M
2
(x) − x TQx − (∇V (x))T g(x)R −1 g T(x)∇V (x)
4
1 T
+ Wc ∇σc (x) + (∇εc (x))T g(x)R −1 g T(x)
2
× (∇σc (x))T W̃c + ∇εc (x) .
Here, we denote
1 T
β= Wc ∇σc (x) + (∇εc (x))T g(x)R −1 g T (x) (∇σc (x))T W̃c + ∇εc (x) .
2
Considering the fact that R −1 is positive definite, the assumption that Wc , ∇σc (x), and
∇εc (x) are bounded, and Theorem 9.2.2, we can conclude that β is upper bounded
by β ≤ β M , where β M > 0 is a constant. Therefore, V̇(9.2.2) takes the following form:
V̇(9.2.2) ≤ −ρd M
2
(x) − x TQx + β M . (9.2.32)
In many cases, we can determine a quadratic bound of d(x). Under such circum-
stances, we assume that d M (x) = ρ0 x, where ρ0 > 0 is a constant. Then, (9.2.32)
becomes
V̇(9.2.2) ≤ − ρρ02 + λmin (Q) x2 + β M .
Hence, we can observe that V̇(9.2.2) < 0 whenever x(t) lies outside the compact set
βM
Ωx = x : x ≤ .
ρρ02 + λmin (Q)
Therefore, based on the approximate optimal control law, the state trajectories of the
closed-loop system are UUB and x(t) ≤ M. This completes the proof.
In the next theorem, the equivalence of the NN-based HJB solution of the optimal
control problem and the solution of robust control problem is established.
356 9 Robust and Optimal Guaranteed Cost Control …
Theorem 9.2.4 Assume that the NN-based HJB solution of the optimal control prob-
lem exists. Then, the control law defined by (9.2.25) ensures closed-loop asymptotic
stability of uncertain nonlinear system (9.2.1) if the condition described in (9.2.10)
is satisfied.
and μ̂(x) be the approximate optimal control law defined by (9.2.25). Then,
T
2μ̂T(x)R = − ∇ V̂ (x) g(x).
Now, we show that with the approximate optimal control μ̂(x), the closed-loop
system remains asymptotically stable for all possible uncertainties d(x). According
to (9.2.18) and the selection of σc (x), we have V̂ (0) = 0 and V̂ (x) > 0 when x = 0.
Taking the manipulations similar to the proof of Theorem 9.2.1, we can easily obtain
which implies that V̂˙ (x) < 0 for any x = 0. This completes the proof.
Remark 9.2.2 In [47], an iterative algorithm for online design of robust control for a
class of continuous-time nonlinear systems was developed; however, the optimality
of the robust controller with respect to a specified cost function was not discussed. In
fact, recently, there are few results on robust optimal control of uncertain nonlinear
systems based on ADP, not to mention the decentralized optimal control of large-
scale systems. In [48], the robust optimal control scheme for a class of uncertain
nonlinear systems via ADP technique and without using an initial admissible control
was established. In addition, the developed results of [48] were also extended to
deal with the decentralized optimal control for a class of continuous-time nonlinear
interconnected systems.
where x = [x1 , x2 ]T ∈ R2 and u ∈ R are the state and control variables, respectively,
and p is an unknown parameter. The term d(x) = 0.5 px1 sin x2 reflects the uncer-
tainty of the control plant. For simplicity, we assume that p ∈ [−1, 1]. Here, we
choose d M (x) = x and we select ρ = 1 for the purpose of simulation.
We aim at obtaining a robust control law that can stabilize system (9.2.33) for
all possible p. This problem can be formulated as the following optimal control
problem. For the nominal system, we need to find a feedback control law u(x) that
minimizes the cost function
∞
J (x0 ) = x2 + x TQx + u TRu dτ, (9.2.34)
0
and
u ∗ (x) = −(cos(2x1 ) + 2)x2 .
We adopt the online PI algorithm to tackle the optimal control problem for the
nominal system, where a critic network is constructed to approximate the value func-
tion. We denote the weight vector of the critic network as Ŵc = [Ŵc1 , Ŵc2 , Ŵc3 ]T .
During the simulation process, the initial weights of the critic network are chosen
randomly in [0, 2] and the weight normalization is not used. The activation func-
tion of the critic network is chosen as σc (x) = [x12 , x1 x2 , x22 ]T , so the ideal weight is
[1, 0, 2]T . Let the learning rate of the critic network be αc = 0.1 and the initial state
of the controlled plant be x0 = [1, −1]T .
During the implementation process of the PI algorithm, the following small sinu-
soidal exploratory signal with various frequencies will be added to satisfy the PE
condition,
N (t) = sin2 (t) cos(t) + sin2 (2t) cos(0.1t) + sin2 (−1.2t) cos(0.5t)
+ sin5 (t) + sin2 (1.12t) + cos(2.4t) sin3 (2.4t). (9.2.35)
It is introduced into the control input and thus affect the system states. The weights
of the critic network converge to [0.9978, 0.0008, 1.9997]T as shown in Fig. 9.2a,
which displays a good approximation of the ideal ones. Actually, we can observe
that the convergence of the weight occurred after 750 s. Then, the exploratory signal
is turned off. The evolution of the state trajectory is depicted in Fig. 9.2b. We see that
the state converges to zero quickly after the exploratory signal is turned off.
Using (9.2.18) and (9.2.25), the value function and control law can be approxi-
mated as
358 9 Robust and Optimal Guaranteed Cost Control …
(a) (b)
2.5
Weight of the critic network
x x
2 1 2
2
State trajectory
1.5 ωac1 ωac2 ωac3
1
1
0
0.5
0
−1
−0.5
0 200 400 600 800 0 200 400 600 800
Time (s) Time (s)
Fig. 9.2 Simulation results. a Convergence of the weight vector of the critic network (ωac1 , ωac2 ,
and ωac3 represent Ŵc1 , Ŵc2 , and Ŵc3 , respectively). b Evolution of the state trajectory during the
implementation process
⎡ ⎤T ⎡ 2 ⎤
0.9978 x1
V̂ (x) = ⎣ 0.0008 ⎦ ⎣ x1 x2 ⎦
1.9997 x22
and
⎡ ⎤ ⎡ ⎤
T 2x1 0 T 0.9978
1 −1 0 ⎣ x2 x1 ⎦ ⎣ 0.0008 ⎦ ,
μ̂(x) = − R (9.2.36)
2 cos(2x1 ) + 2
0 2x2 1.9997
respectively. The error between the optimal cost function and the approximate one is
presented in Fig. 9.3a. The error between the optimal control law and the approximate
version is displayed in Fig. 9.3b. Both approximation errors are close to zero, which
verifies good performance of the learning algorithm.
(a) (b)
−3
x 10
Approximation error
Approximation error
0.015
1
0.01
0
0.005
−1
0
2 2
2 2
0 0 0 0
x2 −2 −2 x2 −2 −2 x1
x
1
Fig. 9.3 Simulation results. a 3D plot of the approximation error of the cost function, i.e., V ∗ (x) −
V̂ (x). b 3D plot of the approximation error of the control law, i.e., u ∗ (x) − μ̂(x)
9.2 Robust Control of Uncertain Nonlinear Systems 359
2
2
ρd M(x) and d (x)Rd(x)
ρd M(x)
1.5 T
d (x)Rd(x)
T
0.5
2
0
0.5
1
0 0.8
0.6
−0.5 0.4
x2 0.2
−1 0 x
1
0
Va and dVa
−2
−4
−6
−8
0.5
Va
1
0 0.8
dVa
0.6
−0.5 0.4
x 0.2
2
−1 0
x1
Fig. 9.5 The Lyapunov function and its derivative (Va and d Va represent V̂ and V̂˙ , respectively)
360 9 Robust and Optimal Guaranteed Cost Control …
1
x
1
0.8 x
2
0.6
0.4
State trajectory
0.2
−0.2
−0.4
−0.6
−0.8
−1
0 5 10 15
Time (s)
Fig. 9.6 The state trajectory under the robust control law μ̂(x) when setting p = 1
Next, the scalar parameter p = 1 is chosen for evaluating the robust control per-
formance. On the one hand, as shown in Fig. 9.4, the condition of Theorem 9.2.1 is
satisfied. On the other hand, for all the values of x, the Lyapunov function V̂ (x) ≥ 0
and its derivative V̂˙ (x) ≤ 0, which are described in Fig. 9.5. Therefore, the control
law designed by solving the NN-based HJB equation is the robust control of the orig-
inal uncertain nonlinear system (9.2.33). In other words, we can employ the control
law (9.2.36) to stabilize system (9.2.33). In fact, under the action of the control law,
the state vector converges to the equilibrium point. Figure 9.6 presents the evolution
process of the state trajectory when applying the control law to system (9.2.33) for
15 s. These simulation results demonstrate the effectiveness of the robust control
method established in this section.
In this section, we study the optimal guaranteed cost control of uncertain nonlinear
systems using the ADP approach [28]. Consider a class of continuous-time uncertain
nonlinear systems given by
where x(t) ∈ Rn is the state vector and u(t) ∈ Rm is the control input. The known
functions f (·) and g(·) are differentiable in their arguments with f (0) = 0, and
Δf (x(t)) is the nonlinear perturbation of the corresponding nominal system
Assumption 9.3.1 Assume that the uncertainty Δf (x) has the form
where
d T(ϕ(x))d(ϕ(x)) ≤ h T(ϕ(x))h(ϕ(x)). (9.3.4)
In (9.3.3) and (9.3.4), G(·) ∈ Rn×r and ϕ(·) satisfying ϕ(0) = 0 are known functions
denoting the structure of the uncertainty, d(·) ∈ Rr is an uncertain function with
d(0) = 0, and h(·) ∈ Rr is a given function with h(0) = 0.
where
U (x, u) = Q(x) + u TRu,
Next, we will prove that the optimal robust guaranteed cost control problem of
system (9.3.1) can be transformed into an optimal control problem of the nominal
system (9.3.2). The ADP technique can be employed to deal with the optimal control
362 9 Robust and Optimal Guaranteed Cost Control …
problem of system (9.3.2). Note that in this section, the feedback control u(x) is
often written as u for simplicity.
In this section, we show that the guaranteed cost of the uncertain nonlinear system is
closely related to the modified cost function of the nominal system. The next theorem
is from [10] with relaxed conditions.
Theorem 9.3.1 Assume that there exist a continuously differentiable and radially
unbounded value function V (x) satisfying V (x) > 0 for all x = 0 and V (0) = 0, a
bounded function Γ (x) satisfying Γ (x) ≥ 0, and a feedback control function u(x)
such that
where the symbol ∇V (x) denotes the partial derivative of the value function V (x)
with respect to x, i.e., ∇V (x) = ∂ V (x)/∂ x. Then, with the feedback control function
u(x), there exists a neighborhood of the origin such that system (9.3.1) is asymptot-
ically stable. Furthermore,
Proof First, we show the asymptotic stability of system (9.3.1) under the feedback
control u(x). Let
V̇(9.3.1) (x) = (∇V (x))T F̄(x, u).
Considering (9.3.6) and (9.3.7), we obtain V̇(9.3.1) (x) < 0 for any x = 0. This implies
that V (·) is a Lyapunov function for system (9.3.1), which proves the asymptotic
stability.
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 363
According to the definitions of J (x0 , u) and J¯(x0 , u), we can see that
since Γ (x) ≥ 0.
When Δf (x) = 0, we can still find that (9.3.6)–(9.3.8) are true since Γ (x) ≥ 0.
In this case, we derive
Then,
Based on (9.3.11) and (9.3.12), we can easily find that (9.3.9) is true. This completes
the proof.
Theorem 9.3.1 shows that the bounded function Γ (x) takes an important role in
deriving the guaranteed cost of the controlled system. The following lemma presents
a specific form of Γ (x).
Lemma 9.3.1 For any continuously differentiable and radially unbounded function
V (x), define
1
Γ (x) = h T(ϕ(x)) h(ϕ(x)) + (∇V (x))T G(x) G T(x)∇V (x). (9.3.13)
4
Then,
(∇V (x))T Δf (x) ≤ Γ (x). (9.3.14)
364 9 Robust and Optimal Guaranteed Cost Control …
Remark 9.3.1 For any continuously differentiable and radially unbounded function
V (x), since
(∇V (x))T F̄(x, u) = (∇V (x))T F(x, u) + (∇V (x))T Δf (x), (9.3.15)
we can easily find that the bounded function (9.3.13) satisfies (9.3.6). Note that
Lemma 9.3.1 seems only to imply (9.3.6), but in fact, it presents a specific form of
Γ (x) satisfying (9.3.6), (9.3.7), and (9.3.8). The reason is that (9.3.7) and (9.3.8) are
implicit assumptions of Theorem 9.3.1, noticing the framework of the generalized
HJB equation [4] and the fact that (∇V (x))T F(x, u) + Γ (x) = −U (x, u) < 0 when
x = 0. Hence, it can be used for problem transformation. In fact, based on (9.3.6)
and (9.3.15), we can find that the positive semidefinite bounded function Γ (x) gives
an upper bound of the term (∇V (x))T Δf (x), which facilitates us to solve the opti-
mal robust guaranteed cost control problem of a class of nonlinear systems with
uncertainties.
Remark 9.3.2 It is important to note that Theorem 9.3.1 indicates the existence of
the guaranteed cost of the uncertain nonlinear system (9.3.1). In addition, in order to
derive the optimal guaranteed cost controller, we should minimize the upper bound
J (x0 , u) with respect to u. Therefore, we should solve the optimal control problem
of system (9.3.2) with V (x0 ) considered as the value function.
we have
1 T
lim V (x(T )) − V (x0 ) + U (x, u) + Γ (x) dτ = 0. (9.3.17)
T →0 T 0
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 365
where J (x0 , u) is given in (9.3.10). Note that V ∗ (x) satisfies the modified HJB
equation
0 = min H (x, u, ∇V ∗ (x)), (9.3.19)
u∈A (Ω)
where ∇V ∗ (x) = ∂ V ∗ (x)/∂ x. Assume that the minimum on the right-hand side of
(9.3.19) exists and is unique. Then, the optimal control of system (9.3.2) is
Assumption 9.3.2 Consider system (9.3.2) with value function (9.3.16) and the
optimal feedback control function (9.3.20). Let L s (x) be a continuously differentiable
Lyapunov function candidate formed as a polynomial and satisfying
where ∇ L s (x) = ∂ L s (x)/∂ x. Assume there exists a positive definite matrix Λ(x)
such that the following relation holds:
Remark 9.3.3 This is a common assumption that has been used in the literature, for
instance [7, 36, 60], to facilitate discussing the stability issue of closed-loop system.
According to [7], we assume that the closed-loop dynamics with optimal control is
bounded by a function of system state on the compact set of this section. Without
loss of generality, we assume that
Let λm and λ M be the minimum and maximum eigenvalues of matrix Λ(x), then
Therefore, by noticing (9.3.23) and (9.3.24), we can conclude that the Assump-
tion 9.3.2 is reasonable. Specifically, in this section, L s (x) can be obtained by prop-
erly selecting a polynomial when implementing the ADP method.
The following theorem illustrates how to develop the optimal robust guaranteed
cost control scheme for system (9.3.1).
Theorem 9.3.2 Consider system (9.3.1) with cost function (9.3.5). Suppose the mod-
ified HJB equation (9.3.22) has a continuously differentiable solution V ∗ (x). Then,
for any admissible control function u, the cost function (9.3.5) satisfies
J¯(x0 , u) ≤ Φ(u),
where ∞
∗
Φ(u) V (x0 ) + (u − u ∗ )TR(u − u ∗ )dτ.
0
Moreover, the optimal robust guaranteed cost of the controlled uncertain nonlinear
system is given by
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 367
Φ ∗ = Φ(u ∗ ) = V ∗ (x0 ).
Clearly, the optimal robust guaranteed cost can be obtained when setting u = u ∗ ,
i.e., Φ(u ∗ ) = V ∗ (x0 ). Furthermore, we can derive that
368 9 Robust and Optimal Guaranteed Cost Control …
and
ū ∗ = arg inf Φ(u) = u ∗ .
u
Remark 9.3.4 According to Theorem 9.3.2, the optimal robust guaranteed cost con-
trol of uncertain nonlinear system is transformed into the optimal control of nominal
system, where the modified cost function is considered as the upper bound function. In
other words, once the solution of the modified HJB equation (9.3.22) corresponding
to nominal system (9.3.2) is derived, we can establish the optimal robust guaranteed
cost control scheme of system (9.3.1).
In this section, inspired by the work of [5, 7, 41], an improved online technique
without utilizing the iterative strategy and an initial stabilizing control is developed
by constructing a single network, namely the critic network. Here, the ADP method
is introduced to the framework of infinite horizon optimal robust guaranteed cost
control of nonlinear systems with uncertainties.
Assume that the value function V (x) is continuously differentiable. According
to the universal approximation property of NNs, V (x) can be reconstructed by a
single-layer NN on a compact set Ω as
Following the framework of [5, 7, 41], we assume that the weight vector Wc , the
gradient ∇σc (x), and the approximation error εc (x) and its derivative ∇εc (x) are all
bounded on a compact set Ω.
Since the ideal weights are unknown, a critic NN can be built in terms of the
estimated weights as
V̂ (x) = ŴcTσc (x) (9.3.32)
to approximate the value function. Under the framework of ADP method, the selec-
tion of the activation function of the critic network is often a natural choice guided
by engineering experience and intuition [1, 4]. Then,
∂ V̂ (x)
where ∇ V̂ (x) = .
∂x
According to (9.3.20) and (9.3.31), we derive
1
u(x) = − R −1 g T(x) (∇σc (x))T Wc + ∇εc (x) , (9.3.34)
2
which, in fact, represents the expression of optimal control u ∗ (x) if the value function
in (9.3.30) is considered as the optimal one V ∗ (x). Besides, in light of (9.3.20) and
(9.3.33), the approximate control function can be given as
1
û(x) = − R −1 g T(x)(∇σc (x))T Ŵc . (9.3.35)
2
Applying (9.3.35) to system (9.3.2), the closed-loop system dynamics is expressed
as
1
ẋ = f (x) − g(x)R −1 g T(x)(∇σc (x))T Ŵc . (9.3.36)
2
Recalling the definition of the Hamiltonian (9.3.18) and the modified HJB equa-
tion (9.3.19), we can easily obtain
H (x, u ∗ , ∇V ∗ ) = 0.
370 9 Robust and Optimal Guaranteed Cost Control …
The NN expressions (9.3.31) and (9.3.34) imply that u ∗ and ∇V ∗ can be formulated
based on the ideal weight of the critic network, i.e., Wc . As a result, the Hamiltonian
becomes
H (x, Wc ) = 0,
where
1
ecH = (∇εc (x))T f (x) − (∇εc (x))T g(x)R −1 g T(x)(∇σc (x))T Wc
2
1
− (∇εc (x)) g(x)R −1 g T(x)∇εc (x)
T
4
1
+ (∇εc (x))T G(x)G T(x)(∇σc (x))T Wc
2
1
+ (∇εc (x))T G(x)G T(x)∇εc (x). (9.3.38)
4
In Eq. (9.3.38), ecH denotes the residual error generated due to NN approximation.
Then, using the estimated weight vector, the approximate Hamiltonian can be
derived as
Then, based on (9.3.37), (9.3.39), and (9.3.40), we can obtain the formulation of ec
in terms of W̃c as follows:
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 371
For training the critic network, it is desired to design Ŵc to minimize the objective
function
1
E c = ec2 . (9.3.42)
2
Here, the weights of the critic network are tuned based on the standard steepest
descent algorithm with an additional term introduced to ensure the boundedness of
system state, i.e.,
∂ Ec
Ŵ˙ c = −αc
1
+ αs Π (x, û)∇σc (x)g(x)R −1 g T(x)∇ L s (x), (9.3.43)
∂ Ŵc 2
where αc > 0 is the learning rate of the critic network, αs > 0 is the learning rate of
the additional term, and L s (x) is the Lyapunov function candidate given in Assump-
tion 9.3.2. In (9.3.43), the Π (x, û) is the additional stabilizing term defined as
0, if J˙s (x) = (∇ L s (x))T F(x, û) < 0,
Π (x, û) =
1, else.
∂ec 1
= ∇σc (x) f (x) − ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T Ŵc
∂ Ŵc 2
1
+ ∇σc (x)G(x)G T(x)(∇σc (x))T Ŵc . (9.3.44)
2
In light of (9.3.40), (9.3.42), and (9.3.43), the dynamics of the weight estimation
error is
˙ ∂ec 1
W̃c = αc ec − αs Π (x, û)∇σc (x)g(x)R −1 g T(x)∇ L s (x). (9.3.45)
∂ Ŵc 2
372 9 Robust and Optimal Guaranteed Cost Control …
Fig. 9.7 Structural diagram of NN implementation (solid lines represent the signals and dashed
lines represent the backpropagation paths)
Then, combining (9.3.40), (9.3.41), and (9.3.44), the error dynamics (9.3.45) becomes
W̃˙ c = αc − W̃cT ∇σc (x) f (x) − W̃cT ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T W̃c
1
4
1
+ W̃cT ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T Wc
2
1
+ W̃cT ∇σc (x)G(x)G T(x)(∇σc (x))T W̃c
4
1
− W̃cT ∇σc (x)G(x)G T(x)(∇σc (x))T Wc − ecH
2
1
× ∇σc (x) f (x) − ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T Wc
2
1 −1 T
+ ∇σc (x)g(x)R g (x)(∇σc (x))T W̃c
2
1
+ ∇σc (x)G(x)G T(x)(∇σc (x))T Wc
2
1
− ∇σc (x)G(x)G T(x)(∇σc (x))T W̃c
2
1
− αs Π (x, û)∇σc (x)g(x)R −1 g T(x)∇ L s (x). (9.3.46)
2
For continuous-time uncertain nonlinear systems (9.3.1) satisfying (9.3.3) and
(9.3.4), we summarize the design procedure of optimal robust guaranteed cost control
as follows.
Step 1: Select G(x) and ϕ(x), determine h(ϕ(x)), and conduct the problem trans-
formation based on the bounded function Γ (x) as in (9.3.13).
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 373
Step 2: Choose the Lyapunov function candidate L s (x), construct a critic network
as (9.3.32), and set its initial weights to zeros.
Step 3: Solve the transformed optimal control problem via online solution of the
modified HJB equation, using the expressions of approximate control func-
tion (9.3.35), approximate Hamiltonian (9.3.39), and weights update crite-
rion (9.3.43).
Step 4: Derive the optimal robust guaranteed cost and optimal robust guaranteed
cost control of original uncertain nonlinear system based on the converged
weights of critic network.
Remark 9.3.5 It is observed from (9.3.32) and (9.3.39), both the approximate value
function and the approximate Hamiltonian become zero when x = 0. In this case,
we can find that Ŵ˙ c = 0. Thus, when the system state converges to zero, the weights
of the critic network are no longer updated. This can be viewed as a PE requirement
of the NN inputs. In other words, the system state must be persistently exciting
long enough in order to ensure the critic network to learn the optimal value function
as accurately as possible. In this chapter, the PE condition is satisfied by adding a
small exploratory signal to the control input. The condition can be removed once the
weights of the critic network converge to their target values. Actually, it is for this
reason that there always exists a trade-off between computational accuracy and time
consumption for practical realization.
Next, the stability analysis of the NN-based feedback control system is presented
using the Lyapunov theory.
In this section, the error dynamics of the critic network and the closed-loop system
based on the approximate optimal control are proved to be UUB.
Theorem 9.3.3 Consider the nonlinear system given by (9.3.2). Let the control input
be provided by (9.3.35) and the weights of the critic network be tuned by (9.3.43).
Then, the state x of the closed-loop system and the weight estimation error W̃c of the
critic network are UUB.
1 αs
L(x) = W̃ T W̃c + L s (x), (9.3.47)
2αc c αc
374 9 Robust and Optimal Guaranteed Cost Control …
where L s (x) is presented in Assumption 9.3.2. The derivative of the Lyapunov func-
tion candidate (9.3.47) with respect to time along the solutions of (9.3.36) and (9.3.46)
is
1 T ˙ αs
L̇(x) = W̃c W̃c + (∇ L s (x))T ẋ. (9.3.48)
αc αc
1
L̇(x) = W̃cT − W̃cT ∇σc (x) f (x) − W̃cT ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T W̃c
4
1 T −1 T
+ W̃c ∇σc (x)g(x)R g (x)(∇σc (x))T Wc
2
1
+ W̃cT ∇σc (x)G(x)G T(x)(∇σc (x))T W̃c
4
1
− W̃cT ∇σc (x)G(x)G T(x)(∇σc (x))T Wc − ecH
2
1
× ∇σc (x) f (x) − ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T Wc
2
1
+ ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T W̃c
2
1
+ ∇σc (x)G(x)G T(x)(∇σc (x))T Wc
2
1
− ∇σc (x)G(x)G T(x)(∇σc (x))T W̃c
2
αs αs
− Π (x, û)W̃cT ∇σc (x)g(x)R −1 g T(x)∇ L s (x) + (∇ L s (x))T ẋ.
2αc αc
(9.3.49)
1 T 1 1 1
L̇(x) = − W̃cT ∇σc (x) f (x) + W̃ A W̃c − W̃cT AWc − W̃cT B W̃c + W̃cT BWc + ecH
4 c 2 4 2
1 1 1 1
× W̃cT ∇σc (x) f (x) + W̃cT A W̃c − W̃cT AWc − W̃cT B W̃c + W̃cT BWc
2 2 2 2
αs αs
− Π(x, û)W̃cT ∇σc (x)g(x)R −1 g T(x)∇ L s (x) + (∇ L s (x))T ẋ.
2αc αc
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 375
1 1 1
L̇(x) = − W̃cT ∇σc (x)ẋ − W̃cT A W̃c − W̃cT B W̃c + W̃cT BWc + ecH
4 4 2
1 1
× W̃cT ∇σc (x)ẋ − W̃cT B W̃c + W̃cT BWc
2 2
αs
− Π (x, û)W̃cT ∇σc (x)g(x)R −1 g T(x)∇ L s (x)
2αc
αs
+ (∇ L s (x))T ẋ.
αc
and
ẋ = f (x) + g(x)û,
1 1
ẋ = ẋ ∗ + g(x)R −1 g T (x)(∇σc (x))T W̃c + g(x)R −1 g T(x)∇εc (x).
2 2
Then, we have
1
L̇(x) = − W̃cT ∇σc (x)ẋ ∗ + W̃cT A W̃c
4
1 T 1 1
+ W̃c ∇σc (x)g(x)R −1 g T(x)∇εc (x) − W̃cT B W̃c + W̃cT BWc + ecH
2 4 2
∗ 1 T
× W̃c ∇σc (x)ẋ + W̃c A W̃c
T
2
1 T 1 1
+ W̃c ∇σc (x)g(x)R −1 g T(x)∇εc (x) − W̃cT B W̃c + W̃cT BWc
2 2 2
αs αs
− Π (x, û)W̃cT ∇σc (x)g(x)R −1 g T(x)∇ L s (x) + (∇ L s (x))T ẋ.
2αc αc
(9.3.50)
As in the work of [7], we assume that λ1m > 0 and λ1M > 0 are the lower
and upper bounds of the norm of matrix A. Similarly, assume that λ2m > 0 and
λ2M > 0 are the lower and upper bounds of the norm of matrix B. Assume that
R −1 ≤ R −1
M , g(x) ≤ g M , ∇σ (x) ≤ σd M , BWc ≤ λ4 , ∇εc (x) ≤ λ10 , and
ecH ≤ λ12 , where R −1
M , g M , σd M , λ4 , λ10 , and λ12 are positive constants. In addi-
∗
tion,
√ assume that ∇σ c (x) ẋ ≤ λ3 , where λ3 is a positive constant. Let λ5 =
( 6/2)λ12 , λ9 = g M R M , and λ11 = σd M g 2M R −1
2 −1 −1 T
M λ10 , then g(x)R g (x) ≤ λ9
and ∇σ (x)g(x)R −1 g T (x)∇εc (x) ≤ λ11 . Using the relations
376 9 Robust and Optimal Guaranteed Cost Control …
1 b 2 b2
ab = − φ+ a − + φ+ a + 2 ,
2 2
2 φ+ φ+
1 b 2 b2
−ab = − φ− a + − φ− a − 2 ,
2 2
2 φ− φ−
we have
3
− (W̃cT ∇σc (x)ẋ ∗ )(W̃cTA W̃c )
4
3 W̃ TA W̃c 2 (W̃cTA W̃c )2
=− φ1 W̃cT ∇σc (x)ẋ ∗ + c − φ12 (W̃cT ∇σc (x)ẋ ∗ )2 −
8 φ1 φ12
3 2 T (W̃cT A W̃c )2
≤ φ1 (W̃c ∇σc (x)ẋ ∗ )2 +
8 φ12
3 2 3
≤ λ1M W̃c 4 + φ12 λ23 W̃c 2 ,
8φ12 8
where φ+ , φ− , and φ1 are nonzero constants. Other terms in the expression of (9.3.50)
can be handled the same way. Then, we can find that
where
1 1 3 3
λ7 = λ21m + λ22m − 2 λ21M − 2 λ22M
8 8 8φ1 8φ2
3 3 3 3
− φ32 λ21M − φ42 λ21M − φ52 λ211 − φ62 λ22M ,
16 16 16 16
3 3 3 2 3 2
λ8 = φ12 λ23 + φ22 λ23 + λ +
2 11
λ
8 8 16φ3 16φ42 4
3 2 3 2
+ λ +
2 11
λ ,
16φ5 16φ62 4
and φi , i = 1, 2, . . . , 6, are nonzero design constants. Note that with proper choices
of φi , i = 1, 2, . . . , 6, the relation λ7 > 0 can be guaranteed.
We will consider next the cases of Π (x, û) = 0 and Π (x, û) = 1, respectively.
Case 1: Π (x, û) = 0. Since (∇ L s (x))T ẋ < 0, we have −(∇ L s (x))T ẋ > 0.
According to the density property of real numbers [38], there exists a positive con-
stant λ6 such that 0 < λ6 ∇ L s (x) ≤ −(∇ L s (x))T ẋ holds for all x ∈ Ω, i.e.,
or
αc (4λ25 λ7 + λ28 )
∇ L s (x) ≥ B1 (9.3.52)
4αs λ6 λ7
αs λ2
λ6 ∇ L s (x) ≥ max {−λ7 W̃c 4 + λ8 W̃c 2 + λ25 } = λ25 + 8 .
αc W̃c 2 4λ7
to the right-hand side of (9.3.51) and taking Assumption 9.3.2 into consideration
yield
or
1 1
u ∗ − û = − R −1 g T(x)(∇σ (x))T W̃c − R −1 g T(x)∇εc (x).
2 2
In light of Theorem 9.3.3, we have W̃c < A , where A is defined in the proof above.
Then, the terms R −1 g T(x)(∇σ (x))T W̃c and R −1 g T(x)∇εc (x) are both bounded. Thus,
we can further determine
1 −1 1
u ∗ − û ≤ R M g M σd M A + R −1 g M λ10 εu ,
2 2 M
where λ10 is given in the proof of Theorem 9.3.3 and εu is the finite bound. This
completes the proof.
In this section, two simulation examples are provided to demonstrate the effectiveness
of the optimal robust guaranteed cost control strategy derived based on the online
HJB solution. We first consider a continuous-time linear system and then a nonlinear
system, both with system uncertainty.
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 379
where x = [x1 , x2 ]T and Δf (x) = [ px1 sin x2 , 0]T with p ∈ [−0.5, 0.5]. According
to the form of system uncertainty, we choose G(x) = [1, 0]T and ϕ(x) = x. Then,
d(ϕ(x)) = px1 sin x2 . Besides, we select h(ϕ(x)) = 0.5x1 sin x2 .
In this example, we first choose Q(x) = x Tx, R = I , where I is an identity matrix
with suitable dimension. In order to solve the transformed optimal control problem,
a critic network is constructed to approximate the value function as
Let the initial state of the controlled plant be x0 = [1, −1]T . Select the Lyapunov
function candidate of the weights tuning criterion as L s (x) = (1/2)x Tx. Let the
learning rate of the critic network and the additional term be αc = 0.8 and αs = 0.5,
respectively. During the NN implementation process, the exploratory signal N (t)
given in (9.2.35) is added to the control input to satisfy the PE condition. It is
introduced into the control input and thus affects the system state. After a learning
session, the weights of the critic network converge to [0.3461, −0.1330, 0.1338]T
as shown in Fig. 9.8. Here, it is important to note that the initial weights of the critic
network are all set to zeros, which implies that no initial stabilizing control is needed
0.4
0.3
0.2
Weight of the critic network
0.1
-0.1
-0.2
-0.3
ω
ac1
ω
-0.4 ac2
ω
ac3
-0.5
0 20 40 60 80 100 120 140 160 180 200
Time (s)
Fig. 9.8 Convergence of weight vector of the critic network (ωac1 , ωac2 , and ωac3 represent Ŵc1 ,
Ŵc2 , and Ŵc3 , respectively)
380 9 Robust and Optimal Guaranteed Cost Control …
0.4
0.3
0.2
Weight of the critic network
0.1
-0.1
-0.2
-0.3 ω ac1
ω ac2
-0.4
ω
ac3
-0.5
0 1 2 3 4 5 6 7 8 9 10
Time (s)
Fig. 9.9 Updating process of weight vector during the first 10 s (ωac1 , ωac2 , and ωac3 represent
Ŵc1 , Ŵc2 , and Ŵc3 , respectively)
for implementing the control strategy. This can be verified by observing Fig. 9.9,
which displays the updating process of weight vector during the first 10 s.
Based on the converged weight vector, the optimal robust guaranteed cost of the
controlled system is Φ(u ∗ ) = V ∗ (x0 ) = 0.6129. Next, the scalar parameter p = 0.5
is chosen for evaluating the control performance. Under the action of the obtained
control function, the system trajectory during the first 20 s is presented in Fig. 9.10,
which shows good performance of the control approach.
Next, we set Q(x) = 8x Tx, R = 5I , and conduct the NN implementation again by
increasing the learning rates of the critic network and the additional term properly. In
this case, the weights of the critic network converge to [5.4209, −3.5088, 1.2605]T ,
which is depicted in Fig. 9.11. Similarly, the system trajectory during the first 20 s
when choosing p = 0.5 is displayed in Fig. 9.12. These simulation results show that
the parameters Q(x) and R play an important role in the design process. In addition,
the power of the present control technique is demonstrated again.
Example 9.3.2 Consider the following continuous-time nonlinear system:
⎡ ⎤ ⎡ ⎤
−x1 + x2 0
ẋ = ⎣ 0.1x1 − x2 − x1 x3 ⎦ + ⎣ 1 ⎦ u + Δf (x), (9.3.53)
x1 x2 − x3 0
where x = [x1 , x2 , x3 ]T ,
1
x1
0.8
x
2
0.6
0.4
System state
0.2
-0.2
-0.4
-0.6
-0.8
-1
0 2 4 6 8 10 12 14 16 18 20
Time (s)
4
Weight of the critic network
0 ω
ac1
ω
ac2
-2 ω
ac3
-4
-6
0 20 40 60 80 100 120 140 160 180 200
Time (s)
Fig. 9.11 Convergence of weight vector of the critic network (ωac1 , ωac2 , and ωac3 represent Ŵc1 ,
Ŵc2 , and Ŵc3 , respectively)
382 9 Robust and Optimal Guaranteed Cost Control …
1
x1
0.8
x2
0.6
0.4
System state
0.2
-0.2
-0.4
-0.6
-0.8
-1
0 2 4 6 8 10 12 14 16 18 20
Time (s)
and p ∈ [−1, 1]. Similarly, if we choose G(x) = [0, 0, 1]T and ϕ(x) = x based on
the form of system uncertainty, then d(ϕ(x)) = px1 sin x2 cos x3 . Clearly, we can
select h(ϕ(x)) = x1 sin x2 cos x3 .
In this example, Q(x) and R are chosen the same as the first case of Example 9.3.1.
However, the critic network is constructed by using the following equation:
V̂ (x) = Ŵc1 x12 + Ŵc2 x22 + Ŵc3 x32 + Ŵc4 x1 x2 + Ŵc5 x1 x3 + Ŵc6 x2 x3 + Ŵc7 x14
+ Ŵc8 x24 + Ŵc9 x34 + Ŵc10 x12 x22 + Ŵc11 x12 x32 + Ŵc12 x22 x32 + Ŵc13 x12 x2 x3
+ Ŵc14 x1 x22 x3 + Ŵc15 x1 x2 x32 + Ŵc16 x13 x2 + Ŵc17 x13 x3 + Ŵc18 x1 x23
+ Ŵc19 x23 x3 + Ŵc20 x1 x33 + Ŵc21 x2 x33 .
Here, let the initial state of the controlled system be x0 = [1, −1, 0.5]T . Besides,
let the learning rate of the critic network and the additional term be αc = 0.3
and αs = 0.5, respectively. Same as earlier, a small exploratory signal N (t) (cf.
(9.2.35)) is added to satisfy the PE condition during the NN implementation process.
Besides, all the elements of the weight vector of critic network are initialized
to zero. After a sufficient learning session, the weights of the critic network converge
to [0.4759, 0.5663, 0.1552, 0.4214, 0.0911, 0.0375, 0.0886, −0.0099, 0.0986,
0.1539, 0.0780, −0.0192, −0.1335, −0.0052, −0.0639, −0.1583, 0.0456,
0.0576, −0.0535, 0.0885, −0.0227]T .
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 383
1
x1
0.8
x
2
0.6 x
3
0.4
System state
0.2
-0.2
-0.4
-0.6
-0.8
-1
0 2 4 6 8 10 12 14 16 18 20
Time (s)
Similarly, the optimal robust guaranteed cost of the nonlinear system is Φ(u ∗ ) =
∗
V (x0 ) = 1.1841. In this example, the scalar parameter p = −1 is chosen for eval-
uating the robust control performance. The system trajectory is depicted in Fig. 9.13
when applying the obtained control to system (9.3.53) for 20 s. These simulation
results verify the effectiveness of the present control approach.
9.4 Conclusions
In this chapter, a novel strategy is developed to solve the robust control problem of a
class of uncertain nonlinear systems. The robust control problem is transformed into
an optimal control problem with appropriate cost function. The online PI algorithm
is presented to solve the HJB equation by constructing a critic network. Then, a
strategy is developed to derive the optimal guaranteed cost control of uncertain
nonlinear systems. This is accomplished by properly modifying the cost function to
account for system uncertainty, so that the solution of the transformed optimal control
problem also solves the optimal robust guaranteed cost control problem of the original
system. A critic network is constructed to solve the modified HJB equation online.
Several simulation examples are presented to reinforce the theoretical results.
384 9 Robust and Optimal Guaranteed Cost Control …
References
1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Adhyaru DM, Kar IN, Gopal M (2009) Fixed final time optimal control approach for bounded
robust controller design using Hamilton-Jacobi-Bellman solution. IET Control Theory Appl
3(9):1183–1195
3. Adhyaru DM, Kar IN, Gopal M (2011) Bounded robust control of nonlinear systems using
neural network-based HJB solution. Neural Comput Appl 20(1):91–103
4. Beard RW, Saridis GN, Wen JT (1997) Galerkin approximations of the generalized Hamilton-
Jacobi-Bellman equation. Automatica 33(12):2159–2177
5. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):82–92
6. Chang SSL, Peng TKC (1972) Adaptive guaranteed cost control of systems with uncertain
parameters. IEEE Trans Autom Control 17(4):474–483
7. Dierks T, Jagannathan S (2010) Optimal control of affine nonlinear continuous-time systems.
In: Proceedings of the American Control Conference, pp 1568–1573
8. Dierks T, Jagannathan S (2012) Online optimal control of affine nonlinear discrete-time systems
with unknown internal dynamics by using time-based policy update. IEEE Trans Neural Netw
Learn Syst 23(7):1118–1129
9. Haddad WM, Chellaboina V (2008) Nonlinear dynamical systems and control: a Lyapunov-
based approach. Princeton University Press, Princeton
10. Haddad WM, Chellaboina V, Fausz JL (1998) Robust nonlinear feedback control for uncertain
linear systems with nonquadratic performance criteria. Syst Control Lett 33(5):327–338
11. Haddad WM, Chellaboina V, Fausz JL, Leonessa A (2000) Optimal non-linear robust control
for nonlinear uncertain systems. Int J Control 73(4):329–342
12. Heydari A, Balakrishnan SN (2013) Finite-horizon control-constrained nonlinear optimal con-
trol using single network adaptive critics. IEEE Trans Neural Netw Learn Syst 24(1):145–157
13. LaSalle JP, Lefschetz S (1967) Stability by Liapunov’s direct method with applications. Aca-
demic Press, New York
14. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
15. Lewis FL, Jagannathan S, Yesildirek A (1999) Neural network control of robot manipulators
and nonlinear systems. Taylor & Francis, London
16. Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control:
using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag
32(6):76–105
17. Li H, Liu D (2012) Optimal control for discrete-time affine nonlinear systems using general
value iteration. IET Control Theory Appl 6(18):2725–2736
18. Liang J, Venayagamoorthy GK, Harley RG (2012) Wide-area measurement based dynamic
stochastic optimal power flow control for smart grids with high variability and uncertainty.
IEEE Trans Smart Grid 3(1):59–69
19. Lin F, Brand RD, Sun J (1992) Robust control of nonlinear systems: compensating for uncer-
tainty. Int J Control 56(6):1453–1459
20. Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class
of unknown discrete-time nonlinear systems using globalized dual heuristic programming.
IEEE Trans Autom Sci Eng 9(3):628–634
21. Liu D, Yang X, Li H (2013) Adaptive optimal control for a class of continuous-time affine
nonlinear systems with unknown internal dynamics. Neural Comput Appl 23:1843–1850
22. Liu D, Wang D, Yang X (2013) An iterative adaptive dynamic programming algorithm for
optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci
220:331–342
References 385
23. Liu D, Huang Y, Wang D, Wei Q (2013) Neural-network-observer-based optimal control for
unknown nonlinear systems using adaptive dynamic programming. Int J Control 86(9):1554–
1566
24. Liu D, Li H, Wang D (2013) Data-based self-learning optimal control: Research progress and
prospects. Acta Autom Sinica 39(11):1858–1870
25. Liu D, Li H, Wang D (2013) Neural-network-based zero-sum game for discrete-time nonlinear
systems via iterative adaptive dynamic programming algorithm. Neurocomputing 110:92–100
26. Liu D, Li H, Wang D (2014) Online synchronous approximate optimal learning algorithm for
multi-player non-zero-sum games with unknown dynamics. IEEE Trans Syst Man Cybern Syst
44(8):1015–1027
27. Liu D, Wang D, Li H (2014) Decentralized stabilization for a class of continuous-time nonlinear
interconnected systems using online learning optimal control approach. IEEE Trans Neural
Netw Learn Syst 25(2):418–428
28. Liu D, Wang D, Wang FY, Li H, Yang X (2014) Neural-network-based online HJB solution for
optimal robust guaranteed cost control of continuous-time uncertain nonlinear systems. IEEE
Trans Cybern 44(12):2834–2847
29. Mehraeen S, Jagannathan S (2011) Decentralized optimal control of a class of interconnected
nonlinear discrete-time systems by using online Hamilton-Jacobi-Bellman formulation. IEEE
Trans Neural Netw 22(11):1757–1769
30. Michel AN, Hou L, Liu D (2015) Stability of dynamical systems: On the role of monotonic
and non-monotonic Lyapunov functions. Birkhäuser, Boston
31. Modares H, Lewis FL, Naghibi-Sistani MB (2013) Adaptive optimal control of unknown
constrained-input systems using policy iteration and neural networks. IEEE Trans Neural Netw
Learn Syst 24(10):1513–1525
32. Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE Trans
Syst Man Cybern Part C Appl Rev 32(2):140–153
33. Nevistic V, Primbs JA (1996) Constrained nonlinear optimal control: a converse HJB approach.
Technical Memorandum No. CIT-CDS, California Institute of Technology, Pasadena, CA, pp
96–021
34. Ni Z, He H, Wen J (2013) Adaptive learning in tracking control based on the dual critic network
design. IEEE Trans Neural Netw Learn Syst 24(6):913–928
35. Ni Z, He H, Wen J, Xu X (2013) Goal representation heuristic dynamic programming on maze
navigation. IEEE Trans Neural Netw Learn Syst 24(12):2038–2050
36. Nodland D, Zargarzadeh H, Jagannathan S (2013) Neural network-based optimal adaptive
output feedback control of a helicopter UAV. IEEE Trans Neural Netw Learn Syst 24(7):1061–
1073
37. Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8(5):997–
1007
38. Rudin W (1976) Principles of mathematical analysis. McGraw-Hill, New York
39. Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
40. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
41. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
42. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
43. Wang D, Liu D (2013) Neuro-optimal control for a class of unknown nonlinear dynamic systems
using SN-DHP technique. Neurocomputing 121:218–225
44. Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of
discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocom-
puting 78(1):14–22
45. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
386 9 Robust and Optimal Guaranteed Cost Control …
46. Wang D, Liu D, Zhao D, Huang Y, Zhang D (2013) A neural-network-based iterative GDHP
approach for solving a class of nonlinear optimal control problems with control constraints.
Neural Comput Appl 22(2):219–227
47. Wang D, Liu D, Li H (2014) Policy iteration algorithm for online design of robust control for
a class of continuous-time nonlinear systems. IEEE Trans Autom Sci Eng 11(2):627–632
48. Wang D, Liu D, Li H, Ma H (2014) Neural-network-based robust optimal control design for a
class of uncertain nonlinear systems via adaptive dynamic programming. Inf Sci 282:167–179
49. Wang FY, Zhang H, Liu D (2009) Adaptive dynamic programming: an introduction. IEEE
Comput Intell Mag 4(2):39–47
50. Wang FY, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon
optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Trans Neural
Netw 22(1):24–36
51. Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of
intelligence. General Syst Yearb 22:25–38
52. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive
approaches (Chapter 13). Van Nostrand Reinhold, New York
53. Werbos PJ (2008) ADP: The key direction for future research in intelligent control and under-
standing brain intelligence. IEEE Trans Syst Man Cybern Part B Cybern 38(4):898–900
54. Werbos PJ (2009) Intelligence in the brain: a theory of how it works and how to build it. Neural
Netw 22(3):200–212
55. Wu HN, Luo B (2012) Neural network based online simultaneous policy update algorithm
for solving the HJI equation in nonlinear H∞ control. IEEE Trans Neural Netw Learn Syst
23(12):1884–1895
56. Xu H, Jagannathan S, Lewis FL (2012) Stochastic optimal control of unknown linear networked
control system in the presence of random delays and packet losses. Automatica 48(6):1017–
1030
57. Xu X, Lian C, Zuo L, He H (2014) Kernel-based approximate dynamic programming for
real-time online learning control: an experimental study. IEEE Trans Control Syst Technol
22(1):146–156
58. Yu L, Chu J (1999) An LMI approach to guaranteed cost control of linear uncertain time-delay
systems. Automatica 35(6):1155–1159
59. Yu L, Han QL, Sun MX (2005) Optimal guaranteed cost control of linear uncertain systems
with input constraints. Int J Control Autom Syst 3(3):397–402
60. Zhang H, Cui L, Luo Y (2013) Near-optimal control for nonzero-sum differential games of
continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern 43(1):206–
216
61. Zhang H, Zhang X, Luo Y, Yang J (2013) An overview of research on adaptive dynamic
programming. Acta Autom Sinica 39(4):303–311
Chapter 10
Decentralized Control of Continuous-Time
Interconnected Nonlinear Systems
10.1 Introduction
forward-in-time [27, 28]. In recent years, great efforts have been made to ADP-related
research in theory and applications [1, 5–8, 10–12, 15, 16, 23–25]. In light of [9, 10,
29], the ADP technique is closely related to reinforcement learning when engaging
in the research of feedback control. Policy iteration (PI) is a fundamental algorithm
for reinforcement learning-based ADP in optimal control. Vrabie and Lewis [24]
derived an integral reinforcement learning method to obtain direct adaptive optimal
control for nonlinear input-affine continuous-time systems with partially unknown
dynamics. Jiang and Jiang [5] presented a novel PI approach for continuous-time
linear systems with complete unknown dynamics. Lee et al. [8] presented an integral
Q-learning algorithm for continuous-time systems without the exact knowledge of
system dynamics.
In this chapter, we first employ the online PI algorithm to tackle the decentralized
control problem for a class of large-scale interconnected nonlinear systems [13]. The
decentralized control strategy can be established by adding appropriate feedback
gains to the local optimal control laws. The online PI algorithm is developed to
solve the HJB equations related to the optimal control problem by constructing and
training some critic neural networks. The uniform ultimate boundedness (UUB) of
the dynamics of the NN weight estimation errors is analyzed by using the Lyapunov
approach. Since it is difficult to obtain the exact knowledge of the system dynamics for
large-scale nonlinear systems, such as chemical engineering processes, transportation
systems, and power systems, we relax the assumptions of exact knowledge of the
system dynamics required in the optimal controller design presented in [13]. We
use an online model-free integral PI to solve the decentralized control of a class
of continuous-time large-scale interconnected nonlinear systems [14]. The optimal
control problems for the isolated subsystems with unknown dynamics are related to
the development of decentralized control laws. To implement this algorithm, a critic
NN and an action NN are used to approximate the value function and control law of
each isolated subsystem, respectively.
where xi (t) ∈ Rni and ūi (t) ∈ Rmi are the state vector and the control vector of the ith
subsystem, respectively. In large-scale system (10.2.1), x = [x1T , x2T , . . . , xNT ]T ∈ Rn
denotes the overall state, where n = Ni=1 ni . Correspondingly, x1 , x2 , . . . , xN are
called local states while ū1 , ū2 , . . . , ūN are local controls. Note that for subsystem
10.2 Decentralized Control of Interconnected Nonlinear Systems 389
i, fi (xi ), gi (xi ), and gi (xi )Z̄i (x) represent the nonlinear internal dynamics, the input
gain matrix, and the interconnected term, respectively.
Let xi (0) = xi0 be the initial state of the ith subsystem, i = 1, . . . , N. Additionally,
we let the following assumptions hold throughout this chapter.
Assumption 10.2.1 The state vector xi = 0 is the equilibrium of the ith subsystem,
i = 1, . . . , N.
Assumption 10.2.2 The functions fi (·) and gi (·) are differentiable in their arguments
with fi (0) = 0, where i = 1, . . . , N.
Assumption 10.2.3 The feedback control vector ūi (xi ) = 0 when xi = 0, where
i = 1, . . . , N.
Let Ri ∈ Rmi ×mi , i = 1, . . . , N, be symmetric positive-definite matrices. Then, we
1/2
denote Zi (x) = Ri Z̄i (x), where Zi (x) ∈ Rmi , i = 1, . . . , N, are bounded as follows:
N
Zi (x) ≤ ρij hij (xj ), i = 1, . . . , N. (10.2.2)
j=1
In (10.2.2), ρij are nonnegative constant parameters and hij (xj ) are
positive-semidefinite functions with i, j = 1, . . . , N.
If we define hi (xi ) = max{h1i (xi ), h2i (xi ), . . . , hNi (xi )}, i = 1, . . . , N, then
(10.2.2) can be formulated as
N
Zi (x) ≤ λij hj (xj ), i = 1, . . . , N, (10.2.3)
j=1
where
hij (xj )
λij ≥ ρij , i, j = 1, . . . , N,
hj (xj )
HJB equations. Then, the decentralized control strategy can be constructed based on
the optimal control laws.
Consider the N isolated subsystems corresponding to (10.2.1) given by
For the ith isolated subsystem, we further assume that fi +gi ui is Lipschitz continuous
on a set Ωi in Rni containing the origin, and the subsystem is controllable in the sense
that there exists a continuous control policy on Ωi that asymptotically stabilizes the
subsystem.
In this section, in order to deal with the infinite-horizon optimal control problem,
we need to find the control laws ui (xi ), i = 1, . . . , N, which minimize the cost
functions
∞
2
Ji (xi0 ) = Qi (xi (τ )) + uiT(xi (τ ))Ri ui (xi (τ )) dτ, i = 1, . . . , N, (10.2.5)
0
Based on the theory of optimal control, the designed feedback controls must not
only stabilize the subsystems on Ωi , i = 1, . . . , N, but also guarantee that the cost
functions (10.2.5) are finite. In other words, the control laws must be admissible. We
use Ai (Ωi ) to denote the set of admissible controls of subsystem i on Ωi .
For any admissible control laws μi ∈ Ai (Ωi ), i = 1, . . . , N, if the associated
value functions
∞
2
Vi (xi ) = Qi (xi (τ )) + μTi (xi (τ ))Ri μi (xi (τ )) dτ, i = 1, . . . , N, (10.2.7)
t
are continuously differentiable, then the infinitesimal versions of (10.2.7) are the
so-called nonlinear Lyapunov equations
10.2 Decentralized Control of Interconnected Nonlinear Systems 391
0 = Qi2 (xi ) + μTi (xi )Ri μi (xi ) + ∇ViT(xi )(fi (xi ) + gi (xi )μi (xi )),
i = 1, . . . , N, (10.2.8)
Hi (xi , μi , ∇Vi (xi )) = Qi2 (xi ) + μTi (xi )Ri μi (xi ) + ∇ViT(xi )(fi (xi ) + gi (xi )μi (xi )),
where i = 1, . . . , N.
The optimal cost functions of the N isolated subsystems can be formulated as
∞ 2
Ji∗ (xi ) = min Qi (xi (τ )) + μTi (xi (τ ))Ri μi (xi (τ )) dτ, i = 1, . . . , N.
μi ∈Ai (Ωi ) t
(10.2.9)
In view of the theory of optimal control, the optimal cost functions Ji∗ (xi ), i =
1, . . . , N, satisfy the HJB equations
where
∂Ji∗ (xi )
∇Ji∗ (xi ) = , i = 1, . . . , N.
∂xi
Assume that the minima on the left-hand side of (10.2.10) exist and are unique. Then,
the optimal control laws for the N isolated subsystems are
1
Qi2 (xi ) + (∇Ji∗ (xi ))T fi (xi ) − (∇Ji∗ (xi ))T gi (x)Ri−1 giT (xi )∇Ji∗ (xi ) = 0 (10.2.12)
4
with Ji∗ (0) = 0 and i = 1, . . . , N.
According to (10.2.11), we have expressed the optimal control policies, i.e.,
u1∗ (x1 ), u2∗ (x2 ), . . . , uN∗ (xN ), for the N isolated subsystems (10.2.4). We will show
that by proportionally increasing some local feedback gains, a stabilizing decentral-
ized control scheme can be established for the interconnected system (10.2.1). Now,
392 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems
we give the following lemma, indicating how the feedback gains can be added, in
order to guarantee the asymptotic stability of the isolated subsystems.
Lemma 10.2.1 Consider the isolated subsystems (10.2.4), the feedback controls
Proof The lemma can be proved by showing that Ji∗ (xi ), i = 1, . . . , N, are Lyapunov
functions. First of all, in light of (10.2.9), we can find that Ji∗ (xi ) > 0 for any
xi = 0 and Ji∗ (xi ) = 0 when xi = 0, which implies that Ji∗ (xi ), i = 1, . . . , N, are
positive-definite functions. Next, the derivatives of Ji∗ (xi ), i = 1, . . . , N, along the
corresponding trajectories of the closed-loop isolated subsystems are given by
where i = 1, . . . , N. Then, by adding and subtracting (1/4)(∇Ji∗ (xi ))T gi (xi )ui∗ (xi )
to (10.2.14) and considering (10.2.11)–(10.2.13), we have
1
J̇i∗ (xi ) = (∇Ji∗ (xi ))T fi (xi ) − (∇Ji∗ (xi ))T gi (xi )Ri−1 giT(xi )∇Ji∗ (xi )
4
1 1
− πi − (∇Ji (xi ))T gi (xi )Ri−1 giT(xi )∇Ji∗ (xi )
∗
2 2
1 1 −1/2
Ri giT(xi )∇Ji∗ (xi ) ,
2
= − Qi2 (xi ) − πi − (10.2.15)
2 2
where i = 1, . . . , N. Observing (10.2.15), we can obtain J̇i∗ (xi ) < 0 for all πi ≥ 1/2
and xi = 0, where i = 1, . . . , N. Therefore, the conditions for Lyapunov local
stability theory are satisfied and the proof is complete.
Remark 10.2.1 Lemma 10.2.1 reveals that any feedback controls ūi (xi ), i = 1, . . . , N,
can ensure the asymptotic stability of the closed-loop isolated subsystems as long as
πi ≥ 1/2, i = 1, . . . , N. However, only when πi = 1, i = 1, . . . , N, the feedback
controls are optimal. In fact, similar results have been given in [3, 4, 22], showing
that the optimal controls ui∗ (xi ), i = 1, . . . , N, are robust in the sense that they have
infinite gain margins.
Now, we present the main theorem of this section, based on which the acquired
decentralized control strategy can be established.
10.2 Decentralized Control of Interconnected Nonlinear Systems 393
Theorem 10.2.1 For the interconnected system (10.2.1), there exist N positive num-
bers πi∗ > 0, i = 1, 2, . . . , N, such that for any πi ≥ πi∗ , i = 1, 2, . . . , N, the feed-
back controls developed by (10.2.13) ensure that the closed-loop interconnected sys-
tem is asymptotically stable. In other words, the control vector (ū1 (x1 ), ū2 (x2 ), . . . ,
ūN (xN )) is the decentralized control strategy of large-scale system (10.2.1).
Proof In accordance with Lemma 10.2.1, we observe that Ji∗ (xi ), i = 1, . . . , N, are
all Lyapunov functions. Here, we choose a composite Lyapunov function given by
N
L(x) = θi Ji∗ (xi ),
i=1
N
L̇(x) = θi (∇Ji∗ (xi ))T (fi (xi )+gi (xi )ūi (xi ))+(∇Ji∗ (xi ))T gi (xi )Z̄i (x) . (10.2.16)
i=1
N
1 1 −1/2 T
gi (xi )∇Ji∗ (xi )
2
L̇(x) ≤ − θi Qi2 (xi ) + πi − Ri
i=1
2 2
−1/2
− (∇Ji∗ (xi ))T gi (xi )Ri Zi (x)
⎧
N ⎨
1 1 −1/2 2
≤− θi Qi2 (xi ) + πi − (∇Ji∗ (xi ))T gi (xi )Ri
⎩ 2 2
i=1
⎫
N ⎬
−1/2
− (∇Ji∗ (xi ))T gi (xi )Ri λij Qj (xj ) . (10.2.17)
⎭
j=1
and
1 1 1 1 1 1
Π = diag π1 − , π2 − ,..., πN − .
2 2 2 2 2 2
394 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems
Clearly, the focal point of designing the decentralized control strategy becomes to
derive the optimal controllers for the N isolated subsystems on the basis of Theorem
10.2.1. Then, we should put our emphasis on solving the HJB equations, which yet
is regarded as a difficult task [10, 25]. Hence, in what follows we shall employ a
more pragmatic approach to obtain the approximate solutions based on online PI
algorithm and NN techniques.
In this section, the online PI algorithm is introduced to solve the HJB equations. The
PI algorithm consists of policy evaluation based on (10.2.8) and policy improvement
based on (10.2.11) [21]. Specifically, its iteration procedure can be described as
follows.
10.2 Decentralized Control of Interconnected Nonlinear Systems 395
Step 1: Choose a small positive number ε. Let p = 0 and Vi(0) (xi ) = 0, where
i = 1, . . . , N. Then, start with N initial admissible control laws μ(0) 1 (x1 ),
μ(0)
2 (x2 ), . . . , μ(0)
N (xN ).
(p−1)
Step 2: Let p = p + 1. Based on the control laws μi (xi ), i = 1, . . . , N, solve the
following nonlinear Lyapunov equations
(p−1) T (p−1)
0 = Qi2 (xi ) + μi (xi ) Ri μi (xi )
(p) T (p−1)
+ ∇Vi (xi ) fi (xi ) + gi (xi )μi (xi ) (10.2.19)
(p)
with Vi (0) = 0 and i = 1, . . . , N.
Step 3: Update the control laws via
(p) 1 (p)
μi (xi ) = − Ri−1 giT (xi )∇Vi (xi ), (10.2.20)
2
(p)i = 1, . .(p−1)
where . , N.
Step 4: If Vi (xi )−Vi (xi ) ≤ ε, i = 1, . . . , N, stop and obtain the approximate
optimal controls of the N isolated subsystems; else, go back to Step 2.
Note that N initial admissible control laws are required in the above algorithm.
In the following theorems, we present the convergence analysis of the online PI
algorithm for the isolated subsystems.
Theorem 10.2.2 Consider the N isolated subsystems (10.2.4), given N initial admis-
sible control laws μ(0) (0) (0)
1 (x1 ), μ2 (x2 ), . . . , μN (xN ). Then, using the PI algorithm
established in (10.2.19) and (10.2.20), the value functions and control laws con-
(p) (p)
verge to the optimal ones as p → ∞, i.e., Vi (xi ) → Ji∗ (xi ) and μi (xi ) → ui∗ (xi )
as p → ∞, where i = 1, . . . , N.
Proof First, we consider the subsystem i. According to [1], when given an initial
(p)
admissible control law μ(0)
i (xi ), we have μi (xi ) ∈ Ai (Ωi ) for any p ≥ 0. Addi-
tionally, for any ζ > 0, there exists an integer p0i , such that for any p ≥ p0i , the
inequalities (p)
sup Vi (xi ) − Ji∗ (xi ) < ζ (10.2.21)
xi ∈Ωi
and
(p)
sup μi (xi ) − ui∗ (xi ) < ζ (10.2.22)
xi ∈Ωi
hold simultaneously.
Next, we consider the N isolated subsystems. When given μ(0) (0)
1 (x1 ), μ2 (x2 ), . . .,
(0) (0)
μN (xN ), where μi (xi ) is the initial admissible control law corresponding to the ith
(p)
subsystem, we can acquire that μi (xi ) ∈ Ai (Ωi ) for any p ≥ 0, where i = 1, . . . , N.
Moreover, we denote p0 = max{p01 , p02 , . . . , p0N }. Thus, we can conclude that for
any ζ > 0, there exists an integer p0 , such that for any p ≥ p0 , (10.2.21) and (10.2.22)
396 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems
are true with i = 1, . . . , N. In other words, the algorithm will converge to the optimal
cost functions and optimal control laws of the N isolated subsystems. The proof is
complete.
where Wci ∈ Rli is the ideal weight, σci (xi ) ∈ Rli is the activation function, li is the
number of neurons in the hidden layer, and εci (xi ) ∈ R is the approximation error of
the ith NN, i = 1, . . . , N.
The derivatives of the value functions with respect to their state vectors are for-
mulated as
where ∇σci (xi ) = ∂σci (xi )/∂xi ∈ Rli ×ni and ∇εci (xi ) = ∂εci (xi )/∂xi ∈ Rni are the gra-
dients of the activation function and approximation error of the ith NN, respectively,
i = 1, . . . , N. Based on (10.2.23), the Lyapunov equations (10.2.8) become
0 = Qi2 (xi ) + μTi Ri μi + WciT ∇σci (xi ) + ∇εciT(xi ) ẋi ,
where Ŵci , i = 1, . . . , N, are the estimated weights. Here, σci (xi ), i = 1, . . . , N, are
selected such that V̂i (xi ) > 0 for any xi = 0 and V̂i (xi ) = 0 when xi = 0.
Similarly, the derivatives of the approximate value functions with respect to the
state vectors can be expressed as
where ∇ V̂i (xi ) = ∂ V̂i (xi )/∂xi , i = 1, . . . , N. Then, the approximate Hamiltonians
can be obtained as
10.2 Decentralized Control of Interconnected Nonlinear Systems 397
For the purpose of training the critic networks of the isolated subsystems, it is desired
to obtain Ŵci , i = 1, . . . , N, to minimize the following objective functions:
1 2
Eci = e , i = 1, . . . , N.
2 ci
The standard steepest descent algorithm is introduced to tune the critic networks.
Then, their weights are updated through
∂Eci
Ŵ˙ ci = −αci , i = 1, . . . , N, (10.2.25)
∂ Ŵci
where αci > 0, i = 1, . . . , N, are the learning rates of the critic networks.
On the other hand, based on (10.2.23), the Hamiltonians take the following forms:
where
ecHi = −(∇εci (xi ))T ẋi , i = 1, . . . , N,
Moreover, we define the weight estimation errors of the critic networks as W̃ci =
Wci − Ŵci , where i = 1, . . . , N. Then, combining (10.2.24) with (10.2.26) yields
Incidentally, the persistency of excitation condition is required to tune the ith critic
network to guarantee that δi ≥ δmi , where δmi , i = 1, . . . , N, are positive constants.
398 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems
Thus, a set small exploratory signals will be added to the isolated subsystems in order
to satisfy the condition in practice.
When implementing the online PI algorithm, in order to accomplish the policy
improvement, we should obtain the control polices that minimize the value functions.
Hence, according to (10.2.11) and (10.2.23), we have
1
μi (xi ) = − Ri−1 giT (xi )∇Vi (xi )
2
1
= − Ri−1 giT (xi ) ∇σciT (xi )Wci + ∇εci (xi ) ,
2
where i = 1, . . . , N. Hence, the approximate control policies can be obtained as
1
μ̂i (xi ) = − Ri−1 giT (xi )∇ V̂i (xi )
2
1
= − Ri−1 giT (xi )∇σciT (xi )Ŵci , (10.2.29)
2
where i = 1, . . . , N.
When considering the critic networks, the weight estimation dynamics are uni-
formly ultimately bounded (UUB) as described in the following theorem.
Theorem 10.2.3 For the N isolated subsystems (10.2.4), the weight update laws for
tuning the critic networks are given by (10.2.25). Then, the dynamics of the weight
estimation errors of the critic networks are UUB.
1 iT i 1 iT i
Li (x) = tr W̃c W̃c = W̃ W̃ , i = 1, . . . , N.
αci αci c c
The time derivatives of the Lyapunov functions Li (x), i = 1, . . . , N, along the tra-
jectories of the error dynamics (10.2.28) are
2 iT ˙ 2 iT
L̇i (x) = W̃c W̃ci = W̃c αci ecHi − W̃ciT δi δi ,
αci αci
2 2
L̇i (x) = ecHi αci W̃ciT δi − αci W̃ciT δi
αci
1 2 2
≤ e − (2 − αci ) W̃ciT δi ,
αci cHi
L̇(x) ≤ −ξ T Ξ ξ + Σe ,
where Σe is the sum of the approximation errors. Hence, we can conclude that based
on the approximate optimal control laws μ̂i (xi ), i = 1, . . . , N, the present control
vector (ūˆ 1 (x1 ), ūˆ 2 (x2 ), . . . , ūˆ N (xN )) can ensure the uniform ultimate boundedness
of the state trajectories of the closed-loop interconnected system. It is in this sense
400 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems
that we accomplish the design of the decentralized control scheme by adopting the
learning optimal control approach based on online PI algorithm.
Remark 10.2.4 Note that the controller presented here is a decentralized stabilizing
one. Though the optimal decentralized controller of interconnected systems has been
studied before [19], in this section, we aim at developing a novel decentralized control
strategy based on ADP. How to extend the present results to the design of optimal
decentralized control for interconnected nonlinear systems is of our future research.
In fact, in the recent work, Wang et al. [26] has provided some preliminary results
on this topic.
ẋi (t) = fi (xi (t)) + gi (xi (t))[ui (xi (t)) + ei (t)]. (10.2.33)
The derivative of the value function with respect to time along the trajectory of the
subsystem (10.2.33) is calculated using (10.2.8) as
V̇i (xi ) = ∇ViT(xi ) fi (xi ) + gi (xi )[μi (xi ) + ei ]
= − ri (xi , μi ) + ∇ViT(xi )gi (xi )ei , (10.2.34)
where
ri (xi , μi ) = Qi2 (xi ) + μTi (xi )Ri μi (xi ).
integrate (10.2.34) over the interval [t, t + T ] to obtain (10.2.35). This means that the
unique solution of (10.2.34), Vi (xi ), also satisfies (10.2.35). To complete the proof,
we show that (10.2.35) has a unique solution by contradiction.
Thus, we assume that there exists another value function V̄i (xi ) which satisfies
(10.2.35) with the end condition (boundary condition) V̄i (0) = 0. This value function
also satisfies
V̄˙i (xi ) = −ri (xi , μi ) + ∇ V̄iT(xi )gi (xi )ei .
which must hold for any xi on the system trajectories generated by the stabilizing
policy μi (xi ). According to (10.2.36), we have V̄i (xi ) = Vi (xi ) + c. As this relation
must hold for xi (t) = 0, we have V̄i (0) = Vi (0) + c ⇒ c = 0. Thus, V̄i (xi ) = Vi (xi ),
i.e., (10.2.35) has a unique solution which is equal to the solution of (10.2.34). The
proof is complete.
Integrating (10.2.34) from t to t +T with any time interval T > 0, and considering
(10.2.19) and (10.2.20), we have
(p) (p)
Vi (xi (t + T ))−Vi (xi (t))
t+T
(p) T
=−2 μi (xi ) Ri ei dτ
t
t+T
2 (p) T (p)
− Qi (xi ) + μi (xi ) Ri μi (xi ) dτ. (10.2.37)
t
Note that N initial admissible control laws are required in this algorithm, and we
(p)
let Vi (xi ) = 0, when p = 0.
Proof In [1], it was shown that in the iteration process of (10.2.20) and (10.2.34),
if the initial control policy μ(0)
i (xi ) is admissible, all the subsequent control laws
will be admissible. Moreover, the iteration result will converge to the solution of
the HJB equation. Based on (10.2.37) and the proven equivalence between (10.2.34)
and (10.2.35), we can conclude that the present online model-free PI algorithm will
converge to the solution of the optimal control problem for subsystem (10.2.33)
without using the knowledge of system dynamics. The proof is complete.
subscripts “c” and “a” denote the critic and the action, respectively. Since the ideal
weights are unknown, the outputs of the critic NN and the action NN are estimated
as
10.2 Decentralized Control of Interconnected Nonlinear Systems 403
with
t+kT 2
θki = Qi (xi ) + μTi (xi )Ri μi (xi ) dτ,
t+(k−1)T
# T
ψki = φci (xi (t + (k − 1)T )) − φci (xi (t + kT )) ,
t+kT
T $T
−2 Ri ei φai (xi ) dτ ,
t+(k−1)T
If ΦiT has full column rank, the parameters can be obtained by solving the equation
Ŵci
= (Φi ΦiT )−1 Φi Θi . (10.2.42)
Ŵai
Therefore, we need to guarantee that the number of collected data points Ki satisfies
Ki ≥ rank(Φi ) = Nci + Nai , which will guarantee the existence of (Φi ΦiT )−1 . The
least squares problem in (10.2.42) can be solved in real time by collecting enough
data points generated from the system (10.2.33).
Clearly, the problem of designing the decentralized control law becomes to derive
the optimal controllers for the N isolated subsystems. Based on the online model-
free integral PI algorithm and NN techniques, we obtain the approximate solutions
of HJB equations. We can conclude that the approximate optimal control policies
μ̂i (xi ) is obtained. Therefore, according to Theorem 10.2.1 and Remark 10.2.3, the
stabilizing decentralized control law of the large-scale interconnected system can be
derived.
404 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems
Two simulation examples are provided to show the applicability of the decentralized
control strategy established in this section.
where x1 = [x11 , x12 ]T ∈ R2 and ū1 (x1 ) ∈ R are the state and control variables of
subsystem 1, and x2 = [x21 , x22 ]T ∈ R2 and ū2 (x2 ) ∈ R are the state and control
variables of subsystem 2. Let R1 and R2 = I be identity matrices with suitable
dimensions. Additionally, let h1 (x1 ) = x1 and h2 (x2 ) = |x22 |. Then, we find that
Z1 (x) and Z2 (x) with x = [x1T , x2T ]T are upper bounded as in (10.2.3). For example,
we can select λ11 = λ12 = 1 and λ21 = λ22 = 1/2.
In order to design the decentralized controller of interconnected system (10.2.43),
we first deal with the optimal control problem of two isolated subsystems. Here, we
choose Q1 (x1 ) = x1 and Q2 (x2 ) = |x22 |. Hence, the cost functions of the optimal
control problem are, respectively,
∞ 2
J1 (x10 ) = x11 + x12
2
+ u1Tu1 dτ
0
and ∞
J2 (x20 ) = 2
x22 + u2Tu2 dτ.
0
We adopt the online PI algorithm to tackle the optimal control problem, where
two critic networks are constructed to approximate the cost functions. We denote
the weight vectors of the two critic networks as Ŵc1 = [Ŵc1 1
, Ŵc2
1
, Ŵc3
1 T
] and
Ŵc = [Ŵc1 , Ŵc2 , Ŵc3 ] . During the simulation process, the initial weights of
2 2 2 2 T
the critic networks are chosen randomly in [0, 2]. Moreover, the activation func-
tions of the two critic networks are chosen as σc1 (x1 ) = [x11 2
, x11 x12 , x12
2 T
] and
σc2 (x2 ) = [x21 , x21 x22 , x22 ] , respectively. Besides, choose the learning rates of the
2 2 T
critic networks as αc1 = αc2 = 0.1 and the initial states of the two isolated subsys-
tems as x10 = x20 = [1, −1]T .
10.2 Decentralized Control of Interconnected Nonlinear Systems 405
During the implementation process of the online PI algorithm, for each isolated
subsystem, we add the following small exploratory signals to satisfy the persistency
of excitation condition:
N1 (t) = sin2 (t) cos(t) + sin2 (2t) cos(0.1t) + sin2 (−1.2t) cos(0.5t)
+ sin5 (t) + sin2 (1.12t) + cos(2.4t) sin3 (2.4t)
and N2 (t) = 1.6N1 (t). We can observe that the convergence of the weights occurred
after 750 and 180 s, respectively. Then, the exploratory signals are turned off. Actu-
ally, the weights of the critic networks converge to Ŵc1 = [0.498969, 0.000381,
0.999843]T and Ŵc2 = [1.000002, −0.000021, 0.999992]T , respectively, which are
depicted in Figs. 10.2 and 10.3.
Based on the converged weights Ŵc1 and Ŵc2 , we can obtain the approximate
value function and control law for each isolated subsystem, namely, V̂1 (x1 ), μ̂1 (x1 ),
V̂2 (x2 ), and μ̂2 (x2 ). In comparison, for the method proposed in [23], the optimal
cost function and control law of isolated subsystem 1 are J1∗ (x1 ) = 0.5x11 2
+ x12
2
and
∗
u1 (x1 ) = −(cos(2x11 ) + 2)x12 , respectively. Similarly, the optimal cost function and
control law of isolated subsystem 2 are J2∗ (x2 ) = x21
2
+ x22
2
and u2∗ (x2 ) = −x21 x22 .
As a result, for isolated subsystem 1, the error between the optimal cost function
and the approximate one is presented in Fig. 10.4. Moreover, the error between the
optimal control law and the approximate version is shown in Fig. 10.5. It is clear
to see that both the approximation errors are close to zero, which verifies the good
1.5
ωac11 ωac12 ωac13
Weights of the critic network 1
0.5
−0.5
0 100 200 300 400 500 600 700 800
Time (s)
Fig. 10.2 Convergence of the weight vector of the critic network 1 (ωac11 , ωac12 , and ωac13 represent
1 , Ŵ 1 , and Ŵ 1 , respectively)
Ŵc1 c2 c3
406 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems
2
ω
ac21
ωac22
0.5
−0.5
0 50 100 150 200
Time (s)
Fig. 10.3 Convergence of the weight vector of the critic network 2 (ωac21 , ωac22 , and ωac23 represent
2 , Ŵ 2 , and Ŵ 2 , respectively)
Ŵc1 c2 c3
−3
x 10
8
Approximation error
0
2
1 2
0 1
0
−1 −1
x12 −2 −2
x
11
Fig. 10.4 3-D plot of the approximation error of the cost function of isolated subsystem 1, i.e.,
J1∗ (x1 ) − V̂1 (x1 )
10.2 Decentralized Control of Interconnected Nonlinear Systems 407
−3
x 10
1.5
Approximation error
1
0.5
−0.5
−1
−1.5
2
1 2
0 1
0
−1 −1
x −2 −2
12 x
11
Fig. 10.5 3-D plot of the approximation error of the control law of isolated subsystem 1, i.e.,
u1∗ (x1 ) − μ̂1 (x1 )
performance of the online learning algorithm. When regarding the isolated subsystem
2, we obtain the same simulation results shown in Figs. 10.6 and 10.7.
Next, by choosing θ1 = θ2 = 1 and π1 = π2 = 2, we can guarantee the
positive definiteness of the matrix Ξ . Thus, (π1 μ̂1 (x1 ), π2 μ̂2 (x2 )) is the decentralized
control strategy of the original interconnected system (10.2.43). Here, we apply the
decentralized control scheme to plant (10.2.43) for 40 s and obtain the evolution
processes of the state trajectories illustrated in Figs. 10.8 and 10.9. By zooming in
on the state trajectories near zero, it is demonstrated that the state trajectories of the
closed-loop system are UUB. Obviously, these simulation results authenticate the
validity of the decentralized control approach developed in this section.
Example 10.2.2 Consider the classical multi-machine power system with governor
controllers [6]
−4
x 10
2
Approximation error
1.5
0.5
−0.5
−1
2
1 2
0 1
0
−1 −1
x22 −2 −2
x21
Fig. 10.6 3-D plot of the approximation error of the cost function of isolated subsystem 2, i.e.,
J2∗ (x2 ) − V̂2 (x2 )
−5
x 10
2
Approximation error
−2
−4
−6
−8
2
1 2
0 1
0
−1 −1
x −2 −2
22 x
21
Fig. 10.7 3-D plot of the approximation error of the control law of isolated subsystem 2, i.e.,
u2∗ (x2 ) − μ̂2 (x2 )
10.2 Decentralized Control of Interconnected Nonlinear Systems 409
1
x11
0.8 x
12
0.4
0.2
−0.2
−5
x 10
−0.4 1
−0.6 0
−0.8 −1
36 38 40
−1
0 5 10 15 20 25 30 35 40
Time (s)
Fig. 10.8 The state trajectories of subsystem 1 under the action of the decentralized control strategy
(π1 μ̂1 (x1 ), π2 μ̂2 (x2 ))
1
x21
0.8 x
22
State trajectories of subsystem 2
0.6
0.4
0.2
−0.2
−3
x 10
−0.4 1
−0.6 0
−0.8 −1
36 38 40
−1
0 5 10 15 20 25 30 35 40
Time (s)
Fig. 10.9 The state trajectories of subsystem 2 under the action of the decentralized control strategy
(π1 μ̂1 (x1 ), π2 μ̂2 (x2 ))
410 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems
where, for 1 ≤ i and j ≤ N, δi (t) represents the angle of the ith generator; δij (t) =
δi (t) − δj (t) is the angular difference between the ith and jth generators; ωi (t) is the
relative rotor speed; Pmi (t) and Pei (t) are the mechanical power and the electrical
power, respectively; Eqi is the transient electromotive force in quadrature axis and is
assumed to be constant under high-gain SCR (silicon-controlled rectifier) controllers;
Di , Hi , and Ti are the damping constant, the inertia constant, and the governor time
constant, respectively; Bij and G ij are the imaginary and real parts of the admittance
matrix, respectively; ugi (t) is the speed governor control signal for the ith generator;
and ω 0 is the steady-state frequency.
A three-machine power system is considered in our numerical simulation. The
parameters of the system are the same as those in [6]. The weighting matrices are
set to be Qi2 (xi ) = xiT × 1000I3 × xi and Ri = 1, for i = 1, 2, 3. Similarly, as in [6],
the multi-machine power system can be rewritten as the following form
where
Δδi (t) = δi (t) − δi0 ,
and
N
di (t) = Eqi Eqj [Bij cos δij (t) − G ij sin δij (t)][Δωi (t) − Δωj (t)] .
j=1,j =i
For each isolated subsystem, we denote the weight vectors of the action and critic
networks as
10.2 Decentralized Control of Interconnected Nonlinear Systems 411
# $T
Ŵai = Ŵa1
i
, Ŵa2
i
, Ŵa3
i
,
# $T
Ŵci = Ŵc1
i
, Ŵc2
i
, Ŵc3
i
, Ŵc4
i
, Ŵc5
i
, Ŵc6
i
.
From these parameters, Nci = 6 and Nai = 3. So, we conduct the simulation with
Ki = 10. We set the initial state and the initial weights of the critic networks as
xi0 = [1, 1, 1]T , Ŵci = 100 × [1, 1, 1, 1, 1, 1]T , for i = 1, 2, 3. The initial weights
of the action networks are chosen as Ŵa1 = −[30, 30, 30]T , Ŵa2 = −[10, 20, 50]T
and Ŵa3 = −[10, 20, 30]T , respectively. The period time T = 0.1 s and exploratory
signals
Ni (t) = 0.01(sin(2π t) + cos(2π t))
are used in the learning process. The least squares problem is solved after 10 samples
are acquired, and thus, the weights of the NNs are updated every 1 s. Figs. 10.10,
10.11, and 10.12 illustrate the evolutions of the weights of the action network for
the isolated subsystems 1, 2, and 3, respectively. It is clear that the weights converge
after some update steps.
Then, we can choose π1 = π2 = π3 = 1 to obtain a combined control vector
(π1 μ̂1 (x1 ), π2 μ̂2 (x2 ), π3 μ̂3 (x3 )), which can be regarded as the stabilizing decentral-
ized control law of the interconnected system. By applying the decentralized control
0
Weight of the action network 1
−50
−100
−150
−200
1 1 1
W W W
a1 a2 a3
−250
0 1 2 3 4 5 6 7 8 9 10
Time (s)
−50
Weight of the action network 2
−100
−150
−200
−250
W2 W2 W2
a1 a2 a3
−300
0 1 2 3 4 5 6 7 8 9 10
Time (s)
0
Weight of the action network 3
−50
−100
−150
−200
3 3 3
W W W
a1 a2 a3
−250
0 1 2 3 4 5 6 7 8 9 10
Time (s)
law to the interconnected power system for 10 s, we obtain the evolution process of
the power angle deviations and frequencies of the generators shown in Figs. 10.13
and 10.14, respectively. Obviously, the applicability of the decentralized control law
developed in this section has been verified by these simulation results.
10.2 Decentralized Control of Interconnected Nonlinear Systems 413
112
Angle of G1 (degree)
110
108
0 1 2 3 4 5 6 7 8 9 10
Time (s)
99
Angle of G2 (degree)
98
97
0 1 2 3 4 5 6 7 8 9 10
Time (s)
59
Angle of G3 (degree)
58
57
0 1 2 3 4 5 6 7 8 9 10
Time (s)
Fig. 10.13 Angles of the generators under the action of the decentralized control law
52
Frequency of G1 (Hz)
50
48
0 1 2 3 4 5 6 7 8 9 10
Time (s)
52
Frequency of G2 (Hz)
50
48
0 1 2 3 4 5 6 7 8 9 10
Time (s)
52
Frequency of G3 (Hz)
50
48
0 1 2 3 4 5 6 7 8 9 10
Time (s)
Fig. 10.14 Frequencies of the generators under the action of the decentralized control law
414 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems
10.3 Conclusions
In this chapter, a decentralized control strategy is developed to deal with the stabi-
lization problem of a class of continuous-time large-scale nonlinear systems using an
online PI algorithm. It is shown that the decentralized control strategy of the overall
system can be established by adding feedback gains to the obtained optimal control
policies. Then, a stabilizing decentralized control law for a class of large-scale non-
linear systems with unknown dynamics is established by using an NN-based online
model-free integral PI algorithm. We use an online model-free integral PI algorithm
with an exploration to solve the HJB equations related to the optimal control problem
of the isolated subsystems.
References
1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Bakule L (2008) Decentralized control: an overview. Ann Rev Control 32(1):87–98
3. Beard RW, Saridis GN, Wen JT (1997) Galerkin approximations of the generalized Hamilton–
Jacobi–Bellman equation. Automatica 33(12):2159–2177
4. Glad ST (1984) On the gain margin of nonlinear and optimal regulators. IEEE Trans Autom
Control AC–29(7):615–620
5. Jiang Y, Jiang ZP (2012) Computational adaptive optimal control for continuous-time linear
systems with completely unknown dynamics. Automatica 48(10):2699–2704
6. Jiang Y, Jiang ZP (2012) Robust adaptive dynamic programming for large-scale systems
with an application to multimachine power systems. IEEE Trans Circ Syst II Express Briefs
59(10):693–697
7. Khan SG, Herrmann G, Lewis FL, Pipe T, Melhuish C (2012) Reinforcement learning and opti-
mal adaptive control: an overview and implementation examples. Ann Rev Contorl 36(1):42–59
8. Lee JY, Park JB, Choi YH (2012) Integral Q-learning and explorized policy iteration for adaptive
optimal control of continuous-time linear systems. Automatica 48(11):2850–2859
9. Lewis FL, Liu D (2012) Reinforcement learning and approximate dynamic programming for
feedback control. Wiley, Hoboken
10. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circ Syst Mag 9(3):32–50
11. Liang J, Venayagamoorthy GK, Harley RG (2012) Wide-area measurement based dynamic
stochastic optimal power flow control for smart grids with high variability and uncertainty.
IEEE Trans Smart Grid 3(1):59–69
12. Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class
of unknown discrete-time nonlinear systems using globalized dual heuristic programming.
IEEE Trans Autom Sci Eng 9(3):628–634
13. Liu D, Wang D, Li H (2014) Decentralized stabilization for a class of continuous-time nonlinear
interconnected systems using online learning optimal control approach. IEEE Trans Neural
Network Learn Syst 25(2):418–428
14. Liu D, Li C, Li H, Wang D, Ma H (2015) Neural-network-based decentralized control of
continuous-time nonlinear interconnected sytems with unknown dynamics. Neurocomputing
165:90–98
15. Mehraeen S, Jagannathan S (2011) Decentralized optimal control of a class of interconnected
nonlinear discrete-time systems by using online Hamilton-Jacobi-Bellman formulation. IEEE
Trans Neural Network 22(11):1757–1769
References 415
16. Park JW, Harley RG, Venayagamoorthy GK (2005) Decentralized optimal neuro-controllers
for generation and transmission devices in an electric power network. Eng Appl Artif Intell
18(1):37–46
17. Rudin W (1976) Principles of mathematical analysis. McGraw-Hill, New York
18. Saberi A (1988) On optimality of decentralized control for a class of nonlinear interconnected
systems. Automatica 24(1):101–104
19. Siljak DD (1991) Decentralized control of complex systems. Academic Press, Boston
20. Siljak DD, Zecevic AI (2005) Control of large-scale systems: beyond decentralized feedback.
Ann Rev Control 29(2):169–179
21. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
22. Tsitsiklis JN, Athans M (1984) Guaranteed robustness properties of multivariable nonlinear
stochastic optimal regulators. IEEE Trans Autom Control AC–29(8):690–696
23. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
24. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Network 22(3):237–246
25. Wang FY, Zhang H, Liu D (2009) Adaptive dynamic programming: an introduction. IEEE
Comput Intell Mag 4(2):39–47
26. Wang D, Liu D, Li H, Ma H (2014) Neural-network-based robust optimal control design for a
class of uncertain nonlinear systems via adaptive dynamic programming. Inf Sci 282:167–179
27. Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of
intelligence. General Syst Yearb 22:25–38
28. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive
approaches (Chapter 13). Van Nostrand Reinhold, New York
29. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control: algorithms
and stability. Springer, London
Chapter 11
Learning Algorithms for Differential
Games of Continuous-Time Systems
11.1 Introduction
Adaptive dynamic programming (ADP) [15, 16, 31, 33] has received significantly
increasing attention owing to its self-learning capability. Value iteration-based ADP
algorithms [6, 20, 32, 37, 40] can solve optimal control problems for discrete-time
nonlinear systems without the requirement of complete knowledge of the dynamics
by building a model neural network (NN). Policy iteration (PI)-based ADP algorithms
can be used to solve optimal control problems of continuous-time nonlinear systems
with the requirement of full knowledge [1, 24] or only partial knowledge [27, 30] of
the dynamics. Some extended PI algorithms can solve optimal control problems of
continuous-time nonlinear systems with completely unknown dynamics by building
NN identifiers [10, 38] or with no NN identifiers [13, 14].
However, most of the previous works on ADP solving the optimal control prob-
lems assume that the system is only affected by a single control law. Game theory
provides an ideal environment to study multi-player optimal decision and control
problems. Two-player non-cooperative zero-sum game [9] has received much atten-
tion since it also provides the solution of H∞ optimal control problems [8]. For
a zero-sum game, it relies on solving the Hamilton–Jacobi–Isaacs (HJI) equation
which reduces to solving the game algebraic Riccati equation (GARE) [9] when
the system has linear dynamics and the cost function is quadratic. Different from
the non-cooperative zero-sum game, nonzero-sum game offers a suitable theoretical
method considering cooperative and non-cooperative objectives. For a multi-player
nonzero-sum game, it will require to solve the coupled Hamilton–Jacobi (HJ) equa-
tions, which reduce to the coupled algebraic Riccati equations in the linear quadratic
case. Generally speaking, both the HJI and coupled HJ equations cannot be solved
analytically due to its nonlinear nature.
ADP algorithms have been used to solve the zero-sum game problems for discrete-
time nonlinear systems [4, 5, 19]. For continuous-time linear systems, an online
ADP algorithm was proposed for two-player zero-sum games without requiring the
ẋ = Ax + B1 u + B2 w, (11.2.1)
where x ∈ Rn is the system state with initial state x0 , u ∈ Rm is the control input, and
w ∈ Rq is the external disturbance input with w ∈ L2 [0, ∞). A ∈ Rn×n , B1 ∈ Rn×m ,
and B2 ∈ Rn×q are the unknown system matrices.
Define the infinite horizon performance index (or cost function)
∞
J (x0 , u, w) = x TQx + u TRu − γ 2 wTw dτ
0 ∞
r (x, u, w)dτ
0
where the control policy player u seeks to minimize the value function while the
disturbance policy player w desires to maximize it. The goal was to find the saddle
point (u ∗ , w∗ ) which satisfies the following inequalities
Defining P ∗ as the unique positive definite solution of (11.2.3), the saddle point of
the zero-sum game is
420 11 Learning Algorithms for Differential Games of Continuous-Time Systems
u ∗ = −K ∗ x = −R −1 B1T P ∗ x,
w∗ = L ∗ x = γ −2 B2T P ∗ x, (11.2.4)
The solution of the H∞ control problem can be obtained by solving the saddle
point of the equivalent two-player zero-sum game problem. The following H∞ norm
(the L2 -gain) is used to measure the performance of the control system.
Definition 11.2.1 Let γ ≥ 0 be certain prescribed level of disturbance attenuation.
The system (11.2.1) is said to have L2 -gain less than or equal to γ if
∞ ∞
T
x TQx + u TRu dτ ≤ γ 2 w w dτ (11.2.5)
0 0
In this section, we will develop an online model-free integral PI algorithm for the lin-
ear continuous-time zero-sum differential game with completely unknown dynam-
ics. First, we assume an initial stabilizing control matrix K 1 to be given. Define
Vi (x) = x TPi x, u i (x) = −K i x, and wi (x) = L i x as the value function, control
policy, and disturbance policy, respectively, for each iterative step i ≥ 0.
To relax the assumptions of exact knowledge of A, B1 , and B2 , we use N1 and N2
to denote the small exploratory signals added to the control policy u i and disturbance
policy wi , respectively. The exploratory signals are assumed to be any nonzero mea-
surable signal which is bounded by e M > 0, i.e., N1 ≤ e M , N2 ≤ e M . Then,
the original system (11.2.1) becomes
ẋ = Ax + B1 (u i + N1 ) + B2 (wi + N2 ). (11.2.6)
11.2 Integral Policy Iteration for Two-Player Zero-Sum Games 421
where we have used K i+1 = R −1 B1T Pi and L i+1 = γ −2 B2T Pi according to (11.2.4).
Integrating (11.2.7) from t and t + T with any time interval T > 0, we have
t+T t+T
T
xt+T Pi xt+T − xtTPi xt = − r (x, u i , wi )dτ + 2 x TK i+1
T
RN1 dτ
t t
t+T
+ 2γ 2 x TL i+1
T
N2 dτ,
t
where the values of the state x at time t and t + T are denoted by xt and xt+T .
Therefore, we obtain the online model-free integral PI algorithm (Algorithm 11.2.1)
for zero-sum differential games.
Step 3. If Pi − Pi−1 ≤ ξ (ξ is a prescribed small positive real number), stop and return Pi ; else,
set i = i + 1 and go to Step 2.
Remark 11.2.1 Equation (11.2.8) plays an important role in relaxing the assumption
of the knowledge of system dynamics, since A, B1 , and B2 do not appear in (11.2.8).
Only online data measured along the system trajectories are required to run this
algorithm. Our method avoids the identification of A, B1 , and B2 whose information
is embedded in the online measured data. In other words, the lack of knowledge
about the system dynamics does not have any impact on our method to obtain the
Nash equilibrium. Thus, our method will not be affected by the errors between the
identification model and the real system, and it can respond fast to the changes of
the system dynamics.
Remark 11.2.2 This algorithm is actually a PI method, but the policy evaluation and
policy improvement are performed at the same time. Compared with the model-based
method [25] and partially model-free method [35], our algorithm is a fully model-
free method which does not require knowledge of the system dynamics. Different
from the iterative method with inner loop on disturbance policy and outer loop on
422 11 Learning Algorithms for Differential Games of Continuous-Time Systems
control policy [25], and the method with only one iterative loop by updating control
and disturbance policies simultaneously [35], the method developed here updates the
value function, control, and disturbance policies at the same time.
Remark 11.2.3 Small exploratory signals can satisfy the persistence of excitation
(PE) condition to efficiently update the value function and the policies. To guarantee
the PE condition, the state may need to be reset during the iterative process, but
it results in technical problems for stability analysis of the closed-loop system. An
alternative way is to add exploratory signals. The solution obtained by our method
is exactly the same as the one determined by the GARE by considering the effects
of exploratory signals.
Next, we will show the relationship between the present algorithm and the
Q-learning algorithm by extending the concept of Q-function to zero-sum games
that are continuous in time, state, and action space. The optimal continuous-time
Q-function for zero-sum games is defined as the following quadratic form:
T
Q∗ (x, u, w) = x T u T wT H ∗ x T u T wT
⎡ ∗ ∗ ∗ ⎤⎡ ⎤
T T T H11 H12 H13 x
= x u w ⎣ H21 ∗
H22 ∗
H23∗ ⎦⎣ ⎦
u
∗ ∗ ∗
H31 H32 H33 w
⎡ T ∗ ∗ ∗ ∗
⎤⎡ ⎤
T T T A P +TP ∗A + Q P B1 P B2 x
= x u w ⎣ B1 P R 0 ⎦ ⎣ u ⎦ . (11.2.9)
B2T P ∗ 0 −γ 2 w
we can obtain
u ∗ = −(H22
∗ −1 ∗ T
) (H12 ) x,
w∗ = −(H33
∗ −1 ∗ T
) (H13 ) x,
Q∗ (x0 , u ∗ , w∗ ) = V ∗ (x0 ),
where
i
H21 = (H12
i T
) and H31
i
= (H13
i T
) .
Now, we are in a position to develop the next online integral Q-learning algorithm
(Algorithm 11.2.2).
i x = xT Hi x
t+T t+T T i
xtTH11 t t+T 11 t+T + t r (x, u i , wi )dτ − 2 t x H12 N1 dτ
t+T T i (11.2.10)
−2 t x H13 N2 dτ.
K i+1 = R −1 (H12
i T
) , L i+1 = γ −2 (H13
i T
) .
Remark 11.2.4 Note that the model-free integral PI algorithm (Algorithm 11.2.1)
is equivalent to the integral Q-learning algorithm (Algorithm 11.2.2) for zero-sum
games. As PI methods, the algorithms developed above require an initial stabilizing
control policy which is usually obtained by experience. We can also obtain B1 =
i −1 i i −1 i
(H11 ) H12 and B2 = (H11 ) H13 .
In this section, we will provide a convergence analysis of the present algorithms for
two-player zero-sum differential games. It can be shown that the present model-free
integral PI and Q-learning algorithms are equivalent to Newton’s method.
424 11 Learning Algorithms for Differential Games of Continuous-Time Systems
Proof First, for an initial stabilizing control policy u 1 = −K 1 x, we can prove that
the present Algorithm 11.2.1 is equivalent to the following Lyapunov equation
where
Ai = A − B1 K i + B2 L i , Mi = Q + K iTR K i − γ 2 L iTL i .
ẋ = Ai x + B1 N1 + B2 N2 ,
x T (AiT Pi + Pi Ai )x = −r (x, u i , wi )
= −x T (Q + K iT R K i − γ 2 L iT L i )x,
i.e.,
AiT Pi + Pi Ai = −Mi ,
where
Mi = Q + K iT R K i − γ 2 L iT L i .
∞
Then, according to the results in [35], the sequence {Pi }i=1 generated by (11.2.11) is
equivalent to Newton’s method and converges to the optimal solution P ∗ of GARE,
∞ ∞
as i → ∞. Furthermore, the sequences {K i }i=1 and {L i }i=1 converge to the saddle
∗ ∗
point K and L , respectively, as i → ∞. This completes the proof of the theorem.
The next theorem will show the convergence of the model-free integral Q-learning
algorithm for zero-sum games.
Theorem 11.2.2 For an initial stabilizing control policy u 1 = −K 1 x, the sequence
∞
{H i }i=1 obtained by solving (11.2.10) in Algorithm 11.2.2 converges to the optimal
solution H ∗ ; i.e., Q∗ can be achieved, as i → ∞.
Proof Because the iteration process of H i with solving (11.2.10) in Algorithm 11.2.2
is equivalent to that of Pi with solving (11.2.8) in Algorithm 11.2.1, then as i → ∞
⎡ ⎤
AT P ∗ + P ∗ A + Q P ∗ B1 P ∗ B2
H →i⎣ B1T P ∗ R 0 ⎦ = H ∗.
B2T P ∗ 0 −γ 2
Hence,
T
xt+(k−1)T Pi xt+(k−1)T − xt+kT
T
Pi xt+kT = (x̄t+(k−1)T − x̄t+kT )T P̂,
x TK i+1
T
RN1 = (x ⊗ N1 )T (In ⊗ R)vec(K i+1 )
x TL i+1
T
N2 = (x ⊗ N2 )T vec(L i+1 ).
Using the expressions established above, (11.2.8) can be rewritten in a general com-
pact form as
⎡ ⎤
P̂i
ψkT ⎣ vec(K i+1 ) ⎦ = θk , ∀i ∈ Z+ , (11.2.15)
vec(L i+1 )
where
t+kT
θk = r (x, u i , wi )dτ,
t+(k−1)T
t+kT
ψk = (x̄t+(k−1)T − x̄t+kT ) , 2 T
(x ⊗ N1 )T dτ (In ⊗ R),
t+(k−1)T
t+kT T
2γ 2 (x ⊗ N2 )T dτ ,
t+(k−1)T
i P
K
ui K i x Wi Li x
x Ax B ui B wi
Pi Ki Li
Pi
T
Ki
Li
Pi Pi
i i
Nmin = rank(Φ),
i.e.,
n(n + 1)
Nmin = + nm + nq,
2
ferent frequencies [13], random noise [4, 39], and exponentially decreasing probing
noise [25].
ẋ = Ax + B1 u + B2 w (11.2.17)
⎡ ⎤
−0.0665 8 0 0
⎢ 0 −3.663 3.663 0 ⎥
=⎢
⎣ −6.86
⎥x
0 −13.736 −13.736 ⎦
0.6 0 0 0
⎡ ⎤ ⎡ ⎤
0 −8
⎢ 0 ⎥ ⎢ 0 ⎥
+⎢ ⎥ ⎢ ⎥
⎣ 13.736 ⎦ u + ⎣ 0 ⎦ w,
0 0
Δf (Hz) is the incremental frequency deviation, ΔPg (p.u. MW) is the incremental
change in generator output, ΔX g (p.u. MW) is the incremental change in governor
value position, and ΔE is the incremental change in integral control. We assume that
the dynamics of system (11.2.17) is unknown. The matrices Q and R in the cost
function are identity matrices of appropriate dimensions, and γ = 3.5. Using the
system model (11.2.17), the matrix in the optimal value function of the zero-sum
game is
⎡ ⎤
0.8335 0.9649 0.1379 0.8005
⎢ 0.9649 1.4751 0.2358 0.8046 ⎥
P =⎢
∗
⎣ 0.1379
⎥.
0.2358 0.0696 0.0955 ⎦
0.8005 0.8046 0.0955 2.6716
Now we will use the present online model-free integral PI algorithm to solve this
problem. The initial state is selected as x0 = [0.1, 0.2, 0.2, 0.1]T . The simulation is
conducted using data obtained along the system trajectory at every 0.01 s. The least
squares problem is solved after 50 data samples are acquired, and thus, the parameters
of the control policy are updated every 0.5 s. The parameters of the critic network, the
11.2 Integral Policy Iteration for Two-Player Zero-Sum Games 429
4
Weights of critic network
-1
-2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time (s)
control action network, and the disturbance action network are all initialized to zero.
Similar to [39], the PE condition is ensured by adding small zero-mean Gaussian
noises with variances to the control and disturbance inputs.
Figure 11.2 presents the evolution of the parameters of the game value function
during the learning process. It is clear that Algorithm 11.2.1 is convergent after 10
iterative steps. The obtained approximate game value function is given by the matrix
⎡ ⎤
0.8335 0.9649 0.1379 0.8005
⎢ 0.9649 1.4752 0.2359 0.8047 ⎥
P10 = ⎢
⎣ 0.1379
⎥,
0.2359 0.0696 0.0956 ⎦
0.8005 0.8047 0.0956 2.6718
and P10 − P ∗ = 2.9375 × 10−4 . We can find that the solution obtained by the
online model-free integral PI algorithm is quite close to the exact one obtained by
solving GARE. Figures 11.3 and 11.4 show the convergence process of the control
and disturbance action network parameters. The obtained H∞ state feedback control
policy is u 11 = −[1.8941, 3.2397, 0.9563, 1.3126]x.
430 11 Learning Algorithms for Differential Games of Continuous-Time Systems
9
1
Ki
8
2
Ki
Weights of control action network
7
K3i
6 K4i
−1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time (s)
0
Weights of disturbance action network
−0.2
−0.4
−0.6
L1i
−0.8
2
Li
−1 3
Li
L4i
−1.2
−1.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time (s)
Consider the following multi-player zero-sum differential game. The state trajectory
at time t of the game denoted by x = x(t) is described by the following continuous-
time affine uncertain nonlinear function
ẋ = f (x, u 1 , u 2 , . . . , u p , w1 , w2 , . . . , wq )
= a(x) + b1 (x)u 1 + b2 (x)u 2 + · · · + b p (x)u p
+ c1 (x)w1 + c2 (x)w2 + · · · + cq (x)wq , (11.3.1)
l(x, u 1 , u 2 , . . . , u p , w1 , w2 , . . . , wq ) = x TAx + u T1 B1 u 1 + · · · + u Tp B p u p
+ w1T C1 w1 + · · · + wqT Cq wq .
For the above multi-player zero-sum differential game, there are two groups of con-
trollers or players where group I (including u 1 , u 2 , . . . , u p ) tries to minimize the value
function V (x), while group II (including w1 , w2 , . . . , wq ) attempts to maximize it.
According to the situation of the two groups, we have the following definitions.
Let
be the lower value function with the obvious inequality V (x) ≥ V (x) (see [9]
for details). Define optimal control vectors as (u 1 , u 2 , . . . , u p , w1 , w2 , . . . , wq ) and
(u 1 , u 2 , . . . , u p , w1 , w2 , . . . , wq ) for upper and lower value functions, respectively.
Then,
V (x) = V (x, u 1 , . . . , u p , w1 , . . . , wq ),
and
V (x) = V (x, u 1 , . . . , u p , w1 , . . . , wq ).
holds, then the optimal value function of the zero-sum differential game or the
optimal solution exists and the corresponding optimal control vector is denoted by
(u ∗1 , u ∗2 , . . . , u ∗p , w1∗ , w2∗ , . . . , wq∗ ).
The following assumptions and lemmas are needed.
Assumption 11.3.2 The upper value function and the lower value function both
exist.
Based on the above assumptions, the following two lemmas are important in
applications of the ADP method.
Lemma 11.3.1 If Assumptions 11.3.1–11.3.3 hold, then for 0 ≤ t ≤ tˆ < ∞,
x ∈ Rn , u ∈ Rk , w ∈ Rm , we have
tˆ
V (x(t)) = inf · · · inf sup · · · sup l(x, u 1 , . . . , u p , w1 , . . . , wq )dτ + V (x(tˆ)) ,
u1 up w1 wq t
tˆ
V (x(t)) = sup · · · sup inf · · · inf l(x, u 1 , . . . , u p , w1 , . . . , wq )dτ + V (x(tˆ)) .
w1 wp u1 uq t
Lemma 11.3.2 If the upper and lower value functions are defined as (11.3.3) and
(11.3.4), respectively, we can obtain the following HJI equations
Remark 11.3.1 Optimal control problems do not necessarily have smooth or even
continuous value functions [1]. In [7], it is shown that if the Hamiltonians are strictly
convex in u and concave in w, then V (x) ∈ C 1 and V (x) ∈ C 1 satisfy the HJI
equations (11.3.5) and (11.3.6) everywhere. If the smoothness property is removed,
using the theory of viscosity solutions [7], it shows that for infinite horizon optimal
control problems with unbounded value functionals, V (x) and V (x) are the unique
viscosity solutions that satisfy the HJI equations (11.3.5) and (11.3.6), respectively,
under Assumptions 11.3.1–11.3.3.
The optimal control vector can be obtained by solving the HJI equations (11.3.5) and
(11.3.6), but these equations cannot be solved analytically in general. There is no
current method for rigorously confronting this type of equations to find the optimal
value functions of the system. We introduce the iterative ADP method to tackle
this problem. In this section, the iterative ADP method for multi-player zero-sum
differential games is developed.
Theorem 11.3.1 Suppose Assumptions 11.3.1–11.3.3 hold. If the upper and lower
value functions V (x) and V (x) are defined as (11.3.3) and (11.3.4), respectively,
then
V (x)
= inf · · · inf · · · inf · · · inf sup · · · sup · · · sup · · · sup{V (x, u 1 , . . . , u p , w1 , . . . , wq )}
u1 um un u p w1 wm̄ wn̄ wq
and
434 11 Learning Algorithms for Differential Games of Continuous-Time Systems
V (x)
= sup · · · sup · · · sup · · · sup inf · · · inf · · · inf · · · inf {V (x, u 1 , . . . , u p , w1 , . . . , wq )}
w1 wm̄ wn̄ wq u 1 um un up
∂H T ∂ f (x, u 1 , . . . , u p , w1 , . . . , wq ) ∂l(x, u 1 , . . . , u p , w1 , . . . , wq )
= Vx + = 0.
∂w j ∂w j ∂w j
1
w j = − C −1 cT (x)V x . (11.3.9)
2 j j
Taking the derivative of u k , we have
∂H T ∂ f (x, u 1 , . . . , u p , w1 , . . . , wq ) ∂l(x, u 1 , . . . , u p , w1 , . . . , wq )
= Vx + = 0.
∂u k ∂u k ∂u k
Substituting (11.3.9) into (11.3.5) and using Assumption 11.3.3, we can obtain
1 −1 T
uk = − B b (x)V x . (11.3.10)
2 k k
From (11.3.9), we can see that for any j = 1, . . . , q, w j is independent from w j ,
where j = j . For any k = 1, . . . , p, u k is independent from u k , where k = k ,
which proves the conclusion in (11.3.7).
On the other hand, according to ∂ H /∂u k = 0 and ∂ H /∂w j = 0, we can get
1 −1 T
uk = − B b (x)V x (11.3.11)
2 k k
and
1
w j = − C −1 cT (x)V x . (11.3.12)
2 j j
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 435
From (11.3.12) we can also see that for any j = 1, . . . , q, w j is independent from
w j , where j = j . For any k = 1, . . . , p, u k is independent from u k where k = k ,
which proves the conclusion in (11.3.8). This completes the proof of the theorem.
From (11.3.9) to (11.3.12), we can see that if the system functions bk (x) and
c j (x) are obtained, then the upper and lower control pairs u k , w j , u k , w j , can be
well defined. Next, NNs are introduced to construct the uncertain nonlinear system
(11.3.1). For convenience of training the NNs, discretization of the continuous-time
system function is necessary. According to Euler and trapezoidal methods [11], we
have ẋ(t) = (x(t + 1) − x(t))/Δt, where Δt is the sampling time interval which
satisfies the Shannon’s sampling theorem [22]. Then, the uncertain nonlinear system
can be constructed as
p
q
x(t + 1) = x(t) + a(x(t)) + bk (x(t))u k (t) + c j (x(t))w j (t) Δt.
l=1 j=1
(11.3.13)
and
We can see that the system functions bk (x(t)) and c j (x(t)) are constructed by a
NN if the weights Wm and Vm are obtained. This guarantees that the upper and lower
control pairs u k , w j , u k , w j in (11.3.9)–(11.3.12) to be well defined.
According to (11.3.13)–(11.3.16), the nonlinear system (11.3.1) can be written as
ẋ = f (x, U, W )
T ∂(σ (Vm X )) T ∂ X T ∂(σ (Vm X )) T ∂ X
T T
= a(x) + Wm V U + Wm V W.
∂(VmT X ) m ∂U ∂(VmT X ) m ∂ W
(11.3.17)
436 11 Learning Algorithms for Differential Games of Continuous-Time Systems
The upper and lower value functions can be, respectively, rewritten as
and
V (x) = sup inf V (x, U, W ).
W U
Hence, we can see that if Theorem 11.3.1 holds, the multi-player zero-sum differential
games can be changed to the two-player differential games which simplifies the
problem. But we see that the upper and lower control vectors still cannot be solved.
For example, if we want to obtain the upper control vectors (u 1 , . . . , u p , w1 , . . . , wq ),
we must obtain the upper value function V (x). Generally speaking, V (x) is unknown
before all the control vectors
(u 1 , . . . , u p , w1 , . . . , wq ) ∈ Rn 1 +···+n p +m 1 +···+m q
and T
T ∂(σ (Vm X )) T ∂ X
T
(i) 1 −1 (i)
W =− C Wm V Vx , (11.3.20)
2 ∂(VmT X ) m ∂ W (i)
(i) (i) (i) (i) (i)
where U , W satisfies the HJI equation HJI V (x), U , W = 0.
For i = 0, 1, . . . , let the lower iterative value function be
∞
V (i) (x(0)) = l x, U (i) , W (i) dt. (11.3.21)
0
and
T
T ∂(σ (Vm X )) T ∂ X
T
(i) 1 −1
W =− C Wm V V (i)
x , (11.3.23)
2 ∂(VmT X ) m ∂ W (i)
where U (i) , W (i) satisfies the HJI equation HJI V (i) (x), U (i) , W (i) = 0.
Remark 11.3.2 We should point out the following important fact. In [39], for two-
player zero-sum differential games, it is proved that the iterative value functions
converge to the optimal solution when the system function is given. As the nonlinear
system (11.3.1) is replaced by (11.3.17), the upper control pair (u k , w j ) in (11.3.9)
and (11.3.10) is replaced by (11.3.19) and (11.3.20), respectively. The lower con-
trol pair (u k , w j ) in (11.3.11) and (11.3.12) is replaced by (11.3.22) and (11.3.23),
respectively. From (11.3.9) to (11.3.12), we can see that the system functions bk (x),
k = 1, 2, . . . , p and c j (x), j = 1, 2, . . . , q are only the functions of x. While for
(i) (i)
(11.3.19) and (11.3.20), the NN-based functions are functions of x, U and W .
For (11.3.22) and (11.3.23), the NN-based functions are functions of x, U (i) and
W (i) . On the other hand, in [39], the system function is invariable for all i. While in
this chapter, for different i, the system functions are also different for i = 0, 1, . . .
These are the two obvious differences from the results in [39].
438 11 Learning Algorithms for Differential Games of Continuous-Time Systems
In the next section, we will show that using NNs to construct the uncertain non-
linear system (11.3.1), the iterative control pairs can also guarantee the upper and
lower iterative value functions to converge to the optimal solution of the game.
11.3.2 Properties
In this section, we show that the present iterative ADP algorithm for multi-player
zero-sum differential games can be used to improve the properties of the nonlinear
system.
(i) (i)
Theorem 11.3.2 Let Assumptions 11.3.1–11.3.3 hold, and U ∈ Rk , W ∈ Rm ,
(i)
V (x) ∈ C 1 satisfy the HJI equation
(i) (i) (i)
HJI V (x), U , W = 0, i = 0, 1, . . . .
(i) (i) i+1 (i+1)
If for any t, l x, U , W ≥ 0, then the new control pairs U , W given by
(11.3.19) and (11.3.20) which satisfy (11.3.18) guarantee the asymptotic stability of
the nonlinear system (11.3.1).
(i)
Proof Since V (x) ∈ C 1 , according to (11.3.20), we can get
(i)
dV (x) (i) T (i) T T ∂(σ (Vm X )) T ∂ X
T
(i+1)
= V x a(x) + V x Wm Vm (i)
U
dt ∂(Vm X )
T
∂U
1 (i) T ∂(σ (Vm X )) T ∂ X
T
− Vx WmT V
2 ∂(VmT X ) m ∂ W (i)
T
∂(σ (VmT X )) T ∂ X (i)
× C −1 WmT V Vx . (11.3.24)
∂(VmT X ) m ∂ W (i)
On the other hand, substituting (11.3.20) into the utility function, it follows
(i+1) (i+1)
(i+1) (i+1)
T
l x, U ,W = U BU + x TAx
1 (i) T T ∂(σ (Vm X )) T ∂ X
T
+ Vx Wm V
4 ∂(VmT X ) m ∂ W (i)
T
T ∂(σ (Vm X )) T ∂ X
T
−1 (i)
×C Wm Vm (i)
V x ≥ 0. (11.3.27)
∂(Vm X )
T
∂W
(i)
Combining (11.3.26) and (11.3.27), we can derive dV (x)/dt ≤ 0.
(i) (i)
As V (x) is positive definite (since l(x, u, w) is positive definite) and dV (x)/dt
i
≤ 0, V (x) is a Lyapunov function. We can conclude that the system (11.3.1) is
asymptotically stable. The proof is complete.
Theorem 11.3.3 Let Assumptions 11.3.1–11.3.3 hold, and U (i) ∈ Rk , W (i) ∈ Rm ,
V (x) ∈ C 1 satisfy the HJI equation HJI V (x), U (i) , W (i) = 0, i = 0, 1, . . .
(i) (i)
If for all t, l x, U (i) , W (i) < 0, then the control pairs U (i) , W (i) formulated by
(11.3.22) and (11.3.23) which satisfy the value function (11.3.21) guarantee system
(11.3.1) to be asymptotically stable.
Since the proof is similar to Theorem 11.3.2, we omit it here.
Corollary 11.3.1 Let Assumptions 11.3.1–11.3.3 (i) hold, U (i) ∈ Rk , W (i) ∈ Rm ,
V (x) ∈ C satisfy the HJI equation HJI V (x), U (i) , W (i) = 0, i = 0, 1, . . . If
(i) 1
for all t, l x, U (i) , W (i) ≥ 0, then the control pairs U (i) , W (i) which satisfy the
value function (11.3.21) guarantee system (11.3.1) to be asymptotically stable.
(i) (i)
Corollary 11.3.2 Let Assumptions 11.3.1–11.3.3 hold, and U ∈ Rk , W ∈ Rm ,
(i) (i) (i) (i)
V (x) ∈ C 1 satisfy the HJI equation HJI V (x), U , W = 0, i = 0, 1, . . . If
(i) (i) (i) (i)
for any t, l x, U , W < 0, then the control pairs U , W which satisfy the
value function (11.3.18) guarantee system (11.3.1) to be asymptotically stable.
(i) (i)
Theorem 11.3.4 Let Assumptions 11.3.1–11.3.3 hold. Also, let U ∈ Rk , W ∈
(i) (i) (i) (i)
Rm , and V (x) ∈ C 1 satisfy the HJI equation HJI V (x), U , W = 0, for
440 11 Learning Algorithms for Differential Games of Continuous-Time Systems
(i) (i)
i = 0, 1, . . . Let l x, U , W be the utility function. Then, the control pairs
(i) (i)
U ,W which satisfy the upper value function (11.3.18) guarantee system
(11.3.1) to be asymptotically stable.
Theorem 11.3.5 Let Assumptions 11.3.1–11.3.3 (i) hold, and U (i) ∈ Rk , W (i) ∈ Rm ,
V (x) ∈ C satisfies the HJI equation HJI V (x), U , W (i) = 0, i = 0, 1, . . .
(i) 1 (i)
Let l x, U (i) , W (i) be the utility function. Then, the control pair U (i) , W (i) which
satisfies the lower value function (11.3.21) is a pair of asymptotically stable controls
for system (11.3.1).
Next, the analysis of convergence property for the zero-sum differential games is
presented to guarantee that the iterative control pairs reach the optimal solution.
(i) (i)
Proposition 11.3.1 Let Assumptions 11.3.1–11.3.3 hold, and U ∈ Rk , W ∈ Rm ,
(i) (i) (i) (i)
V (x) ∈ C 1 satisfy the HJI equation HJI V (x), U , W = 0. Then, the
(i) (i)
iterative control pairs U , W formulated by (11.3.19) and (11.3.20) guarantee
(i)
the upper value function V (x) → V (x) as i → ∞.
Since the system (11.3.1) is asymptotically stable, its state trajectory x converges
(i+1) (i) (i+1) (i)
to zero, and so does V (x) − V (x). Since d V (x) − V (x) /dt ≥ 0 on
(i+1) (i) (i+1) (i)
these trajectories, it implies V (x)− V (x) ≤ 0, i.e., V (x) ≤ V (x). Thus,
(i) (i)
V (x) is convergent as i → ∞ since V (x) ≥ 0.
(i)
(2) Show that V (x) → V (x) as i → ∞. We define
(i) (∞)
lim V (x) = V (x).
i→∞
∗
tˆ
(i)
W = arg max l(x, U, W )dτ + V (x(tˆ)) .
W t
(i)
tˆ
(i)
V (x) ≤ sup l(x, U, W )dτ + V (x(tˆ))
W t
tˆ
∗ (i)
= l(x, U, W )dτ + V (x(tˆ)).
t
(i+1) (i)
Since V (x) ≤ V (x), we have
tˆ
(∞) ∗ (i)
V (x) ≤ l(x, U, W )dτ + V (x(tˆ)).
t
So, it follows
(∞)
tˆ
(∞)
V (x) ≤ inf sup l(x, u, w)dτ + V (x(tˆ)) . (11.3.28)
U W t
Let ε > 0 be an arbitrary positive number. Since the upper value function is
nonincreasing and convergent, there exists a positive integer i such that
(i) (∞) (i)
V (x) − ε ≤ V (x) ≤ V (x).
∗
tˆ
∗ (i)
Let U = arg min l(x, U , W )dτ + V (x(tˆ)) . Then, we can get
U t
tˆ
(i) ∗ ∗ (i)
V (x) = l x, U , W dτ + V (x(tˆ)).
t
Thus,
tˆ
(∞) ∗ ∗ (i)
V (x) ≥ l(x, U , W )dτ + V (x(tˆ)) − ε
t
tˆ
∗ ∗ (∞)
≥ l(x, U , W )dτ + V (x(tˆ)) − ε
t
tˆ
(∞)
= inf sup l(x, U, W )dτ + V (x(tˆ)) − ε.
U W t
(∞)
tˆ
(∞)
V (x) ≥ inf sup l(x, U, W )dτ + V (x(tˆ)) . (11.3.29)
U W t
(∞)
tˆ
(∞)
V (x) = inf sup l(x, U, W )dτ + V (x(tˆ)) .
U W t
Letting tˆ → ∞, we have
(∞)
V (x) = inf sup V (x, U, W ),
U W
(i)
which is the same as (11.3.3). So, V (x) → V (x) as i → ∞. The proof is complete.
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 443
(i)
Proposition 11.3.2 Let Assumptions 11.3.1–11.3.3 (i) hold, U ∈ Rk , W (i) ∈ Rm
and V (x) ∈ C satisfy the HJI function HJI V (x), U , W (i) = 0. Then the
(i) 1 (i)
iterative control pair U (i) , W (i) formulated by (11.3.22) and (11.3.23) guarantees
the lower value function V (i) (x) → V (x) as i → ∞.
Remark 11.3.3 In [39], for the known nonlinear systems, the convergence property
was proved for two-player zero-sum differential games. We should point out that in
[39], the system function is invariant for any i. For multi-player zero-sum differential
games, the NN-based system functions are also updated for any i = 0, 1, . . . While
from Propositions 11.3.1 and 11.3.2, we can see that by the approximation of NNs
for nonlinear systems, the convergence property can also be guaranteed which shows
the effectiveness of the present method.
Theorem 11.3.6 If the optimal value function of the zero-sum differential game
(i+1) (i+1)
or the optimal solution exists, then the control pairs U ,W and U (i+1) ,
(i)
W (i+1) guarantee V (x) → V ∗ (x) and V (i) (x) → V ∗ (x), respectively, as i → ∞.
(i)
Proof For the upper value function, according to Proposition 11.3.1, we have V (x)
(i+1) (i+1)
→ V (x) under the control pair U ,W as i → ∞. So, the optimal control
pair for upper value function satisfies
On the other hand, there exists an optimal control pair (U ∗ , W ∗ ) under which the
value function reaches the optimal solution. According to the property of the optimal
solution, the optimal control pair (U ∗ , W ∗ ) satisfies
(i+1)
which is the same as (11.3.30). So, V (x) → V ∗ (x) under the control pair U ,
(i+1)
W as i → ∞.
Similarly, we can derive V (x) → V ∗ (x) under the control pair U (i+1) , W (i+1)
as i → ∞. This completes the proof of the theorem.
Remark 11.3.4 From Theorem 11.3.6, we can see that if the optimal solution exists,
the upper and lower iterative value functions converge to the optimum under the
(i) (i)
iterative control pairs U , W and U (i) , W (i) , respectively. Since the existence
criterions of the optimal solution of the games in [2, 3, 8] are unnecessary, we can
see that the present iterative ADP algorithm is more effective for solving zero-sum
differential games.
444 11 Learning Algorithms for Differential Games of Continuous-Time Systems
As the computer can only deal with the digital and discrete signals, it is necessary to
transform the continuous-time value function to a corresponding discrete-time form.
Discretization of the value function using Euler and trapezoidal methods [11] leads
to
∞
T
V (x(0)) = x (t)Ax(t) + U T(t)BU (t) + W T(t) C W (t) Δt,
t=0
e zi − e−zi
where σ (Y Tf X ) ∈ Rλ , [σ (z)]i = , i = 1, . . . , λ, are the activation func-
e zi + e−zi
tions. Using NNs, the estimation error can be expressed by
where V f∗ , W ∗f are the ideal weight parameters, and ξ(X ) is the estimation error.
where x̂(t +1) is the estimated system state vector, Ŵm (t) is the estimation of the ideal
weight matrix Wm . According to (11.3.14), we can define the system identification
error as
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 445
where W̃m (t) = Ŵm (t) − Wm . Let φ(t) = W̃mT (t)σ ( Z̄ (t)). Then, we can get
1 T
E m (t) = x̃ (t + 1)x̃(t + 1).
2
Then, the gradient-based weight update rule for the critic network can be described
as
W̃m (t + 1) = W̃m (t) + ΔW̃m (t), (11.3.33)
∂ E m (t)
ΔW̃m (t) = ηm − , (11.3.34)
∂ W̃m (t)
where ηm > 0 is the learning rate of the model network. After the model network is
trained, its weights are kept unchanged.
Next, we will give the convergence theorem of NNs.
Theorem 11.3.7 Let the NN be expressed by (11.3.31) and let the NN weights be
updated by (11.3.33) and (11.3.34). The approximation error em is expressed as in
(11.3.14). If there exists a constant 0 < λ M < 1 such that
then the system identification error x̃(t) is asymptotically stable while the parameter
estimation error W̃m (t) is bounded.
Proof First, to obtain the conclusion, we built the Lyapunov function as follows:
1 T
L(x) = x̃ T(t) x̃(t) + tr W̃m(t)W̃m (t) .
ηm
As ||σ ( Z̄ (t))|| is finite, we can define ||σ ( Z̄ (t))|| ≤ σ M , where σ M > 0 is the upper
bound. According to (11.3.35), we can get
ΔL(x) ≤ − 1 − 2ηm σ M2 ||φ(t)||2 − 1 − λ M − 2ηm λ M σ M2 ||x̃(t)||2 . (11.3.36)
where β = 0. As long as the parameters are selected as discussed above, we can have
ΔL(x) ≤ 0. Therefore, the identification error x̃(t) and the weight estimation error
W̃m (t) are bounded if x̃(0) and W̃m (0) are bounded. Furthermore, by summing both
sides of (11.3.37) to infinity and taking limits when t → ∞, it can be concluded that
the estimation error x̃(t) approaches zero when t → ∞. This completes the proof
of the theorem.
Remark 11.3.5 In [26], ADP algorithm was used to solve a nonlinear multi-player
nonzero-sum game which can be converted to zero-sum one. In [26], we can see
that the system function is necessary in order to obtain the optimal control pair. In
this chapter, we see that the control pair is obtained without the system model. If
Theorem 11.3.7 is satisfied, the system dynamics can be well constructed by an NN
which guarantees the effectiveness of the present algorithm.
Fig. 11.5 Structural diagram of the iterative ADP method for multi-person zero-sum games
The critic network is used to approximate upper and lower iterative value func-
(i)
tions, i.e., V (x) and V (i) (x). The output of the critic network is denoted as
T T
V̂ (i) (x(t)) = Wc(i) σ Vc(i) X (t) ,
T
where X (t) = x T(t), u T(t), wT(t) is the input of critic networks. The two critic
networks have the following target functions. For the upper value function, the target
function can be written as
(i)
(i+1) T (i+1)
V (x(t)) = x T(t)Ax(t) + U (t) BU (t)
(i+1) T (i)
(t) Δt + Vˆ (x(t + 1)),
(i+1)
+ W (t) C W
(i)
where Vˆ (x(t + 1)) is the output of the upper critic network. For the lower value
function, the target function can be written as
T
V (i) (x(t)) = x T(t)Qx(t) + U (i+1) (t) BU (i+1) (t)
T (i)
+ W (i+1) (t) C W (i+1) (t) Δt + V̂ (x(t + 1)), (11.3.38)
(i)
where V̂ (x(t + 1)) is the output of the lower critic network.
Then, for the upper value function, we define the error function of the critic
network by
(i)
e(i) (t) = V (x(t)) − Vˆ (x(t)),
(i)
c
448 11 Learning Algorithms for Differential Games of Continuous-Time Systems
(i)
where Vˆ (x(t)) is the output of the upper critic network. The objective function to
be minimized in the critic network is
1 (i) 2
E c(i) (t) = e (t) .
2 c
So, the gradient-based weight update rule for the critic network is given by
∂ E c(i) (t)
Wc(i) (t + 1) = Wc(i) (t) + ΔWc(i) (t), ΔWc(i) (t) = ηc − ,
∂ Wc(i) (t)
(i)
∂ E c(i) (t) ∂ E c(i) (t) ∂ Vˆ (x(t))
= ,
∂ Wc(i) (t) (i) (i)
∂ Vˆ (x(t)) ∂ Wc (t)
where ηc > 0 is the learning rate of critic network and Wc (t) is the weight vector in
the critic network, which can be replaced by Wc(i) and Vc(i) .
For the lower value function, the error function of the critic network is defined by
(i)
ec(i) (t) = V (i) (x(t)) − V̂ (x(t)).
For the lower iterative value function, the weight updating rule of the critic network
is the same as the one for upper value function. The details are omitted here. To
implement the present iterative ADP algorithm, four action networks are used to
approximate the laws of the upper and lower iterative control pairs, where two are
used to approximate the laws of upper iterative control pairs and the other two are
used to approximate the laws of lower iterative control pairs.
The target functions of the upper U and W action networks are the discretization
formulation of equations (11.3.19) and (11.3.20), respectively, which can be written
as
T (i)
(i+1) 1 −1 ∂(σ (V T
X (t))) ∂ X (t) dV (x(t + 1))
U (t) = − B T
Wm m
V T
,
2 ∂(VmT X (t)) m ∂U (i)(t) dx(t + 1)
(11.3.39)
and
T (i)
T ∂(σ (Vm X (t))) T ∂ X (t) dV (x(t + 1))
T
(i+1) 1 −1
W (t) = − C Wm V .
2 ∂(VmT X (t)) m ∂ W (i)(t) dx(t + 1)
The target functions of the lower U and W action networks are the discretization
formulation of (11.3.22) and (11.3.23), respectively, which can be written as
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 449
T
1 ∂(σ (VmT X (t))) T ∂ X (t) dV (i) (x(t + 1))
U (i+1) (t) = − B −1 WmT V ,
2 ∂(VmT X (t)) m ∂U (i) (t) dx(t + 1)
and
T
1 ∂(σ (VmT X (t))) T ∂ X (t) dV (i) (x(t + 1))
W (i+1) (t) = − C −1 WmT V (i)
.
2 ∂(Vm X (t))
T m
∂ W (t) dx(t + 1)
The output of the action network (the upper U action network for example) can
be formulated as
(i) (i) T (i) T
Uˆ (t) = W ua σ Y ua x(t) .
The target of the output of the action network is given by (11.3.39). So, we can define
the output error of the action network as
(i)
(t) = U (t) − Uˆ (t).
(i) (i)
eua
The weights of the action network are updated to minimize the following perfor-
mance error measure
(i) 1 (i) 2
E ua (t) = eua (t) .
2
The weight update algorithm is similar to the one for the critic network. Using the
gradient descent rule, we can obtain
where ηa > 0 is the learning rate of action network, and Wa(i) (t) is the weight vector
(i) (i)
of the action network, which can be replaced by W ua and V ua , respectively. The
weight update rule for the other action networks is similar and is omitted.
Given the above preparation, we now formulate the iterative ADP algorithm for
the nonlinear multi-player zero-sum differential games as follows.
Step 1: Initialize the algorithm with a stabilizing control pair (U (0) , W (0) ), where
Assumptions 11.3.1–11.3.3 hold. Choose the computation precision ζ > 0.
Step 2: Discretize the nonlinear system (11.3.1) as (11.3.13) and construct a model
NN. Train the model NN according to (11.3.31)–(11.3.34), and obtain the
system function bk (x(t)), k = 1, 2, . . . , p and c j (x(t)), j = 1, 2, . . . , q,
according to (11.3.15) and (11.3.16), respectively.
450 11 Learning Algorithms for Differential Games of Continuous-Time Systems
where
(i+1) (i+1)
(i+1) T (i+1)
l x(t), U (t), W (t) = x T(t)Ax(t) + U (t) BU (t)
(i+1) T (i+1)
+ W (t) C W (t).
let
(i) (i) (i+1)
U (t) = U (t), W (t) = W (t), and V (x(t)) = V (x(t)),
let
U (t) = U (i) (t), W (t) = W (i) (t), and V (x(t)) = V (i+1) (x(t)),
stop, and the optimal solution is achieved; else, stop, and the optimal solution
dose not exist.
Example 11.3.1 Our first example is chosen from [9] with modifications. Consider
the following linear system
ẋ = x + u + w,
with the initial state x(0) = 1. The value function is defined by (11.3.2) with
1
A = 1, B = , and C = −1.
4
Discretization of the value function using Euler and trapezoidal methods with the
sampling time interval Δt = 10−2 . NNs are used to implement the iterative ADP
algorithm.
To obtain a good approximation, it is important to consider the NN structure
carefully. As we choose three-layer NNs to implement the iterative ADP algorithm,
the number of the hidden layer neurons λ can be chosen by the following equation
from experience #
λ = N I + N O + a, (11.3.40)
where N I and N O are the dimensions of the input and output vectors, respectively.
The constant a is an integer between 1 and 10. Equation (11.3.40) shows that for low-
452 11 Learning Algorithms for Differential Games of Continuous-Time Systems
dimensional system, the number of the hidden layer neurons can be small. While
for high-dimensional complex nonlinear systems, the number of the hidden layer
neurons should increase correspondingly.
According to (11.3.40), the number of the hidden layer neurons is chosen as 8. The
structure of the model NN is chosen as 2–8–1. The structures of the critic networks for
upper and lower value functions are chosen as 1–8–1. For the upper value functions,
the structures of the U action network and the W action network are chosen as 1–8–1
and 2–8–1, respectively. For the lower value functions, the structures of the U action
network and the W action network are chosen as 2–8–1 and 1–8–1, respectively. The
initial weights of action networks, critic networks, and model network are all set to
be random in [−0.5, 0.5]. It should be mentioned that the model network should be
trained first. For the given initial state x(0) = 1, we train the model network for 1000
steps under the learning rate ηm = 0.002 to reach the given accuracy of ε = 10−6 .
After the training of the model network is complete, the weights are kept unchanged.
During each iteration step, the critic networks and the action networks are trained
for 200 steps to reach the given accuracy of ε = 10−6 . In the training process, the
learning rate ηa = ηc = 0.01. The convergence curves of the upper and lower value
functions are shown in Fig. 11.6.
The convergence trajectories of the first row of the weights of the critic network
are shown in Fig. 11.7. The convergence trajectories of the hidden layer weights for
the U action network are shown in Fig. 11.8. The corresponding state and control
trajectories are displayed in Fig. 11.9a and b, respectively.
10
7 st
Iterative value functions
5
The optimal value function
4
st
1 1 iteration of lower value function
0
0 5 10 15 20 25 30
Time steps
1
wc1−1
wc1−2
0.8 wc1−3
wc1−4
wc1−5
0.6
First line weights of the critic NN
wc1−6
wc1−7
wc1−8
0.4
0.2
−0.2
−0.4
−0.6
−0.8
0 20 40 60 80 100 120 140 160 180 200
Iteration steps
Fig. 11.7 Convergence trajectories of the first row of weights of the critic network
1.5
wau1−1
wau1−2
1 wau1−3
wau1−4
wau1−5
Weight convergence of wau1
wau1−6
wau1−7
0.5
wau1−8
−0.5
−1
−1.5
0 5 10 15 20 25 30 35 40 45 50
Iteration steps
Fig. 11.8 Convergence trajectories of the first row of weights of the U action network
454 11 Learning Algorithms for Differential Games of Continuous-Time Systems
(a) 1 (b) 2
u
1
Control trajectories
0.8 w
State trajectory
0
0.6
−1
0.4
−2
0.2 −3
0 −4
0 10 20 30 0 10 20 30
Time steps Time steps
(c) 1 (d) 2
u
1
Control trajectories
0.8 w
State trajectory
0
0.6
−1
0.4
−2
0.2 −3
0 −4
0 10 20 30 0 10 20 30
Time steps Time steps
On the other hand, it is known that the solution of the optimal control problem for
the linear system is quadratic in the state [8]. We can easily obtain that P ∗ = 2.4142
and the optimal control laws are u ∗ = −2P ∗ x and w∗ = P ∗ x, respectively. The state
and control trajectories by GARE approach are displayed in Fig. 11.9c and Fig. 11.9d,
respectively.
From the comparison results, we can see that the present ADP algorithm is effec-
tive for the zero-sum differential games of linear systems.
0.1x12 + 0.05x2
ẋ =
0.2x12 − 0.15x2
⎡ ⎤
u1
0.1 + x1 + x22 0.1 + x2 + x12 0 ⎣ u2 ⎦
+
0 0 0.8 + x1 + x1 x2
u3
0.1 + x1 + x1 x2 0 w1
+
0 0.2 + x1 + x1 x2 w2
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 455
1.5
wc2−1
wc2−2
wc2−3
wc2−4
wc2−5
Output layer weights of critic NN
wc2−6
wc2−7
wc2−8
1
0.5
0
0 500 1000 1500 2000 2500 3000
Iteration steps
Fig. 11.10 Convergence trajectories of the first row of weights of the critic network
with
x(0) = [−1, 0.5]T .
The value function is defined as (11.3.2), where C = −2I and A, B, and I are
identity matrices of appropriate dimensions. Discretization of the value function
using Euler and trapezoidal methods with the sampling time Δt = 10−3 . According
to (11.3.40), in order to obtain good approximations, the number of hidden neurons
is chosen as 10. The structure of the model NN is chosen as 2–8–1. The structures
of the critic networks for upper and lower value functions both are chosen as 2–8–1.
For the upper value functions, the structures of the U action network and the W
action network are chosen as 3–8–1 and 5–8–1, respectively. For the lower value
functions, the structures of the U action network and the W action network are
chosen as 5–8–1 and 3–8–1, respectively. The initial weights of action networks,
critic networks, and model network are all set to be random in [−1, 1]. It should
be mentioned that the model network should be trained first. For the given initial
state x(0) = [−1, 0.5]T , we train the model network for 10,000 steps under the
learning rate ηm = 0.005 to reach the given accuracy of ε = 10−6 . After the training
of the model network is complete, the weights are kept unchanged. During each
iteration step, the critic networks and the action networks are trained for 3000 steps
456 11 Learning Algorithms for Differential Games of Continuous-Time Systems
1.82
Upper
Lower
1.8
1.78
1.76
Value functions
1.74
1.72
1.7
1.68
1.66
1.64
0 10 20 30 40 50 60 70 80 90 100
Iteration steps
to reach the given accuracy of ε = 10−6 . In the training process, the learning rate
ηa = ηc = 0.005. The convergence trajectories of the first row of weights of the
critic network are shown in Fig. 11.10. The convergence curves of the upper and
lower value functions are shown in Fig. 11.11.
We can see that upper and lower value functions converge to the optimal solution
of the zero-sum games and the optimal value function exists. The iterative control
pairs are also convergent to the optimal. The convergence trajectories of the first
row of weights of the U action network are shown in Fig. 11.12. Next, we give the
convergent results of the upper iterative control pairs for upper value function.
The convergence curves of the iterative controls w1 and w2 for upper value func-
tions are shown in Fig. 11.13a and b, respectively. The convergence curves of the
iterative controls u 1 , u 2 and u 3 for upper value functions are shown in Fig. 11.14a–c,
respectively. The curves of the optimal controls are shown in Fig. 11.14d. Then,
we apply the optimal control pair to the system for T f = 500 time steps and the
corresponding state trajectories are given in Fig. 11.15.
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 457
0.15
0.1
First line wights of action W NN
0.05
−0.05
waw1−1
−0.1 waw1−2
waw1−3
waw1−4
−0.15 waw1−5
waw1−6
waw1−7
waw1−8
−0.2
0 500 1000 1500 2000 2500 3000
Iteration steps
Fig. 11.12 Convergence trajectories of the first row of weights of the U action network
(a) 0
−0.05
limiting iteration
w1
−0.02
−0.04
limiting iteration
2
w
−0.08
−0.1
0 50 100 150 200 250 300 350 400 450 500
Time steps
u2
0.03
u
0.02
0.02
0 0
0 100 200 300 400 500 0 100 200 300 400 500
Time steps Time steps
(c) 0.25 (d)
0.15 u
0.2 3
0.1
u
0.15 Controls 1
u
1st iteration 0.05 2
u3
0.05 −0.05 w2
−0.1 w
1
0
0 100 200 300 400 500 0 100 200 300 400 500
Time steps Time steps
Fig. 11.14 Convergence of upper iterative controls u 1 , u 2 , u 3 and optimal control trajectories
0.5
x1
x
2
0
States
−0.5
−1
0 50 100 150 200 250 300 350 400 450 500
Time steps
N
ẋ = f (x) + g j (x)u j , (11.4.1)
j=1
where x(t) ∈ Rn is the system state with initial state x0 , f (x) ∈ Rn , g j (x) ∈ Rn×m j ,
and u j ∈ Rm j is the controller or player. We assume that f (0) = 0, f (x) and
g j (x) are Lipschitz continuous on a compact set Ω ⊆ Rn containing the origin,
and the system is stabilizable on Ω. The system dynamics, i.e., f (x) and g j (x),
j = 1, 2, . . . , N , are assumed to be unknown.
Define the infinite horizon cost functions associated with each player as
∞
N
Ji (x0 , u 1 , . . . , u N ) = x TQ i x + u Tj Ri j u j dt
0 j=1
∞
ri (x, u 1 , . . . , u N )dt, i ∈ N, (11.4.2)
0
It is desirable to find the optimal admissible control vector {μ∗1 , . . . , μ∗N } such that the
cost functions (11.4.2) are minimized. The control vector {μ∗1 , . . . , μ∗N } corresponds
to the Nash equilibrium of the differential game.
Definition 11.4.1 ([9]) An N -tuple of policies {μ∗1 , . . . , μ∗N } with μi∗ ∈ A (Ω) is
said to constitute a Nash equilibrium for an N -player game, if the following N
inequalities are satisfied
Assuming that the value functions (11.4.3) are continuously differentiable, the
infinitesimal version of (11.4.3) are
460 11 Learning Algorithms for Differential Games of Continuous-Time Systems
N
0 = ri (x, μ1 , . . . , μ N ) + (∇Vi )T f (x) + g j (x)μ j , i ∈ N, (11.4.4)
j=1
∞
N
Vi∗ (x0 ) = min x TQ i x + μTj Ri j μ j dτ, i ∈ N.
μi ∈A (Ω) 0
j=1
∂ Hi 1
= 0 ⇒ μi (x) = − Rii−1 giT(x)∇Vi∗ , i ∈ N. (11.4.6)
∂μi 2
1 N
0 = x Qi x +
T
(∇Vi∗ )T f (x) − (∇Vi∗ )T g j (x)R −1 ∗
j j g j (x)∇V j
T
2 j=1
1
N
+ (∇V j∗ )T g j (x)R −1 −1 T ∗
j j Ri j R j j g j (x)∇V j , i ∈ N (11.4.7)
4 j=1
been demonstrated for nonlinear systems [26, 28]. In this chapter, we will prove
the convergence of the online PI algorithm for N -player nonzero-sum games with
nonlinear dynamics. It can also be shown that the online PI algorithm for N -player
nonzero-sum game is the quasi-Newton method. The analysis is based on the work
of [28, 34].
Consider a Banach space V ⊂ {V (x) : Ω → R, V (0) = 0} with a norm · Ω .
Define a mapping Gi : $V × V ×%&· · · × V' → V as follows:
N
1 N
Gi = x TQ i x + (∇Vi )T f (x) − (∇Vi )T g j (x)R −1
j j g j (x)∇V j
T
2 j=1
1
N
+ (∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j , i ∈ N. (11.4.10)
4 j=1
A mapping Ti : V → V is defined as
where Gi Vi denotes the Fréchet derivative of Gi taken with respect to Vi . Since the
Fréchet derivative is difficult to compute directly, we introduce the following Gâteaux
derivative [34, 36].
Definition 11.4.2 Let G : U(V ) ⊆ X → Y be a given map, where X and Y are
Banach spaces, and U(V ) denotes a neighborhood of V . The map G is Gâteaux
differentiable at V if and only if there exists a bounded linear operator L : X → Y
such that G (V + s M) − G (V ) = sL (M) + o(s), s → 0, for all M with MΩ = 1
462 11 Learning Algorithms for Differential Games of Continuous-Time Systems
and all real numbers s in some neighborhood of zero, where lims→0 (o(s)/s) = 0.
L is called the Gâteaux derivative of G at V . Then, the Gâteaux derivative at V is
defined as
G (V + s M) − G (V )
L (M) = lim . (11.4.12)
s→0 s
1 N
Gi Vi M = Li Vi (M) = (∇ M)T f − (∇ M)T g j R −1
j j g j ∇V j .
T
2 j=1
1
N
+ (∇V j )T g j R −1 −1 T
j j Ri j R j j g j ∇V j
4
j=1, j =i
1
− x TQ i x − (∇Vi )T gi Rii−1 giT ∇Vi + (∇Vi )T f
4
1 T
N
1
N
− ∇Vi g j R −1
jj g T
j ∇V j + (∇V j )T
g j R −1
jj R i j R −1 T
jj g j ∇V j
2 4
j=1, j =i j=1, j =i
s s
= s(∇ M)T f − (∇ M)T gi Rii−1 giT ∇Vi − (∇Vi )T gi Rii−1 giT ∇ M
4 4
s2 s
N
− (∇ M)T gi Rii−1 giT ∇ M − (∇ M)T g j R −1 T
j j g j ∇V j
4 2
j=1, j =i
s
N
s2
= s(∇ M)T f − (∇ M)T g j R −1 T
j j g j ∇V j − (∇ M)T gi Rii−1 giT ∇ M.
2 4
j=1
Gi (Vi + s M) − Gi (Vi )
Li Vi (M) = lim
s→0 s
1 N
= (∇ M)T f − (∇ M)T g j R −1
j j g j ∇V j .
T
(11.4.13)
2 j=1
( T ( (1 T N (
( (
≤ ( ∇(M − M0 ) f (Ω + ( ∇(M − M0 ) g j R −1 g
jj j
T
∇V j (
2 j=1
Ω
(1 N (
( (
≤ f Ω + ( g j R −1 g
jj j
T
∇V j ( ∇(M − M0 )Ω
2 j=1 Ω
(1 N (
( (
≤ f Ω + ( g j R −1 g
jj j
T
∇V j( αM − M0 Ω
2 j=1 Ω
ΦM − M0 Ω
With the results in Lemma 11.4.2, we can prove that the online PI algorithm is
mathematically equivalent to the quasi-Newton’s iteration in a Banach space V.
Theorem 11.4.1 Let Ti be a mapping defined in (11.4.11). Then, the iteration
between (11.4.8) and (11.4.9) is equivalent to the following quasi-Newton’s iter-
ation −1
Vik+1 = Ti Vik = Vik − Gi V k Gi , k = 0, 1, . . . . (11.4.14)
i
T 1 T N
Gi V k Vik+1 = ∇Vik+1 f − ∇Vik+1 g j R −1
j j g j ∇V j
T k
i 2 j=1
T N
1
= ∇Vik+1 f − g j R −1 g T
∇V k
j=1
2 jj j j
T N
= ∇Vik+1 f + g j μkj ,
j=1
and
T N
Gi V k Vik = ∇Vik f + g j μkj .
i
j=1
j=1
Thus,
Gi V k Vik − Gi = −ri x, μk1 , . . . , μkN .
i
Since the PI algorithm for N -player nonzero-sum games is equivalent to the quasi-
Newton’s iteration, the value function Vik+1 will converge to the optimal value func-
tion Vi∗ as k → ∞, ∀i ∈ N.
Based on the above results, we will develop an online synchronous approximate opti-
mal learning algorithm using NN approximation for the multi-player nonzero-sum
game with unknown dynamics (see Fig. 11.16). By using a model NN for nonzero-
sum game problems, both the internal and drift dynamics are not required. Compared
with the algorithm in reference [26], there are fewer parameters to tune in the present
algorithm.
A. Model NN Design
In this section, a model NN is used to reconstruct the unknown system dynamics by
using input–output data [38]. The unknown nonlinear system dynamics (11.4.1) can
be represented as
11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 465
∇ ∇
∇ ∇
Fig. 11.16 Structural diagram of the online synchronous approximate optimal learning algorithm
N
ẋ = Ax + W Tf σ f (x) + ε f (x) + WgTj σg j (x) + εg j (x) u j , (11.4.15)
j=1
where A is a stable matrix, W f ∈ Rn×n and Wg j ∈ Rn×n are the unknown bounded
ideal weight matrices, σ f (·) ∈ Rn and σg j (·) ∈ Rn are the activation functions, and
ε f (·) ∈ Rn and εg j (·) ∈ Rn are the bounded reconstruction errors, respectively.
The model (11.4.15) is obtained by letting m j = 1 in (11.4.1), and it can easily be
extended to the general case. Moreover, f (x) is approximated by
Assumption 11.4.1 The ideal NN weights are bounded by positive constraints, i.e.,
W Tf W f ≤ W̄ Tf W̄ f and WgTj Wg j ≤ W̄gTj W̄g j , where W̄ f and W̄g j are known positive
definite matrices.
Assumption 11.4.2 The reconstruction errors ε f (x) and εg j (x) are assumed to be
upper bounded by a function of modeling error such that
where λ is a constant value, and x̃ = x − x̂ is the system modeling error (x̂ is the
estimated system state).
Based on (11.4.15), the model NN used to identify the system (11.4.1) is given
by
N
x̂˙ = A x̂ + Ŵ Tf σ f (x̂) + ŴgTj σg j (x̂)u j , (11.4.18)
j=1
where Ŵ f and Ŵg j are the estimates of the ideal weight matrices W f and Wg j ,
respectively. Then, the modeling error dynamics is written as
Ŵ˙ f = Γ f σ f (x̂)x̃ T ,
Ŵ˙ g j = Γg j σg j (x̂)u j x̃ T , j = 1, . . . , N (11.4.20)
1 T 1 N
1 T −1
L(x) = x̃ x̃ + tr W̃ Tf Γ f−1 W̃ f + tr W̃g j Γg j W̃g j . (11.4.21)
2 2 j=1
2
11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 467
The time derivative of the Lyapunov function (11.4.21) along the trajectories of the
modeling error system (11.4.19) is computed by
N
˙
L̇(t) = x̃ Tx̃˙ + tr W̃ Tf Γ f−1 W̃˙ f + tr W̃gTj Γg−1
j W̃g j . (11.4.22)
j=1
1 T T 1
x̃ T W Tf [σ f (x) − σ f (x̂)] ≤ x̃ W f W f x̃ + k 2 x̃ Tx̃,
2 2
and
1 T T 1
x̃ T WgTj [σg j (x) − σg j (x̂)] ≤ x̃ Wg j Wg j x̃ + k 2 x̃ Tx̃.
2 2
According to Assumption 11.4.2, we obtain
1 T 1 1 1
x̃ Tε f (x) ≤ x̃ x̃ + λx̃ Tx̃ and x̃ Tεg j (x) ≤ x̃ Tx̃ + λx̃ Tx̃.
2 2 2 2
Therefore, (11.4.23) can be upper bounded as
1 1 1 1
L̇(t) ≤ x̃ TA x̃ + x̃ T W Tf W f x̃ + k 2 x̃ Tx̃ + x̃ Tx̃ + λx̃ Tx̃
2 2 2 2
1 N
1 N
1+λ
N
+ x̃ T u j WgTj Wg j x̃ + k 2 u j x̃ Tx̃ + u j x̃ Tx̃
2 j=1
2 j=1
2 j=1
= x̃ TΞ x̃,
where
1 1+λ 1
N N N
1 T 1 1 1
Ξ = A+ Wf Wf + u j WgTj Wg j + + λ+ k 2 + u j + k2 u j In .
2 2 2 2 2 2 2
j=1 j=1 j=1
468 11 Learning Algorithms for Differential Games of Continuous-Time Systems
The model NN is a stable and asymptotic identifier, and thus, the exact knowledge
of the system dynamics can be removed. Consequently, we can obtain the following
model NN
N
T
ẋ = Ax + W Tf σ f (x) + Wg j σg j (x) u j . (11.4.24)
j=1
where Wci ∈ R K are the unknown bounded ideal weights (Wci ≤ W̄ci ), φi (x) ∈ R K
are the activation functions, K is the number of neurons in the hidden layer, and
εi (x) ∈ R are the bounded NN approximation errors. The activation functions can
be selected as polynomial, sigmoid, tanh, etc.
The derivatives of value functions with respect to x are represented as
T
∂ Vi (x) ∂φi (x) ∂εi (x)
= Wci + = ∇φiT Wci + ∇εi , i ∈ N, (11.4.26)
∂x ∂x ∂x
where ∇φi ∈ R K ×n and ∇εi ∈ Rn are bounded gradients of the activation functions
and approximation errors, respectively. As the number of neurons in the hidden layer
K → ∞, the approximation errors εi → 0, and the derivatives ∇εi → 0 uniformly.
The approximation errors εi and the derivatives ∇εi are bounded by constants on a
compact set Ω. Thus, (11.4.4) can be rewritten as
0 = ri (x, μ1 , . . . , μ N ) + WciT ∇φi + (∇εi )T ẋ, i ∈ N. (11.4.27)
Since the ideal weights are unknown, the critic NNs can be written in terms of
the weight estimates as
V̂i (x) = ŴciT φi (x), i ∈ N.
To avoid using the knowledge of the system dynamics, the model NN is used to
approximate the system dynamics. Then, (11.4.27) can be rewritten as
11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 469
N
T
0 = ri (x, μ1 , . . . , μ N ) + WciT ∇φi + (∇εi ) Ax + W Tf σ f (x) + WgTj σg j (x) uj .
j=1
ei , i ∈ N.
1 2
E i (Ŵci ) = e , i ∈ N.
2 i
The tuning law for the critic NN weights is a standard steepest descent algorithm,
which is given by
T
∂ Ei
Ŵ˙ i = −ηi = −ηi θi ri + ŴciT θi , i ∈ N, (11.4.28)
∂ Ŵci
θi ≤ θi M .
e H i , i ∈ N,
N
T
e H i = −(∇εi )T Ax + W Tf σ f (x) + Wg j σg j (x) u j , i ∈ N.
j=1
Define the weight estimation errors of the critic NN as W̃ci = Wci − Ŵci . Thus, we
can have the following error dynamics
W̃˙ i = ηi θi e H i − W̃ciT θi . (11.4.29)
The PE condition is needed to tune the critic NNs ensuring θi ≥ θim , i ∈ N, where
θim are positive constants. In order to satisfy the PE condition, a small exploratory
signal can be injected into the system or reset system states.
The objective of the action NN is to select a policy which minimizes the current
estimate of the value functions in (11.4.25). Since a closed-form expression for the
optimal control is available, there is no need for training action NNs. Substituting
gi (x) and ∇Vi (x), the expressions in (11.4.6) can be rewritten as
1 T
u i = − Rii−1 WgiT σgi + εg j ∇φiT Wci + ∇εi , i ∈ N.
2
1 T
û i = − Rii−1 WgiT σgi ∇φiT Ŵci , i ∈ N. (11.4.30)
2
The UUB stability of the closed-loop system can be proved based on Lyapunov
approach.
Theorem 11.4.3 Consider the system described by (11.4.24). The weight-updating
laws of the critic NNs are given by (11.4.28), and the control policies are updated
by (11.4.30). The initial weights of the critic NNs are chosen to generate an initial
admissible control pair. Then, the weight estimation errors of the critic NNs are UUB.
N
1 iT i
N
L(x) = tr W̃c W̃c Si (t).
i=1
2ηi i=1
The time derivatives of the Lyapunov functions along the trajectories of the error
dynamics (11.4.29) are computed as
1 iT ˙ 1
L̇ i (x) = tr W̃c W̃i = tr W̃ciT ηi θi (e H i − W̃ciT θi ) , i ∈ N.
ηi ηi
11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 471
then
L̇ i (x) < 0.
Using Lyapunov theory, it can be concluded that the weight estimation errors of the
critic NNs W̃ci are UUB. This completes the proof of the theorem.
where ξi ∈ R+ , i ∈ N.
Proof To show the stability of the approximate control policies in (11.4.30), we take
the derivatives of Vi (x) along the trajectories generated by the approximate control
policies û i as
T N
V̇i (t) = ∇Vi (x) f (x) + g j (x)û j , i ∈ N. (11.4.31)
j=1
472 11 Learning Algorithms for Differential Games of Continuous-Time Systems
T 1 N
∇Vi (x) f = − x Q i x + (∇Vi )
T T
g j (x)R −1
j j g j (x)∇V j
T
2 j=1
1
N
− (∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j .
4 j=1
T
Substituting ∇Vi (x) f into (11.4.31) yields
N N
1
V̇i (t) = − x Q i x + (∇Vi )
T T
g j (x)û j + (∇Vi )T g j (x)R −1
j j g j (x)∇V j
T
j=1
2 j=1
1
N
− (∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j .
4 j=1
1
N
V̇i (t) = − x TQ i x − (∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j
4 j=1
N
− (∇Vi )T g j (x)(u j − û j ) . (11.4.32)
j=1
1
N
V̇i (t) = − x TQ i x − (∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j
4 j=1
1 N
T
+ (∇φiT Wci + ∇εi )T Wg j σg j (x) R −1
jj
2 j=1
× σgTj Wg j ∇φ Tj W̃ j + σgTj Wg j ∇ε j
1
N
− x TQ i x − (∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j + Λi ,
4 j=1
(
(1 T i T N
T
Λi ≤ (
(2 ∇φ i W c + ∇ε i Wg j σg j (x) R −1
jj
j=1
(
T (
× σg j Wg j ∇φ j W̃ j + σg j Wg j ∇ε j (
T T
(
≤ ξi , (11.4.33)
1
N
(∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j
4 j=1
1 −1 T T
N
= R j j g j (x)∇V j Ri j R −1
j j g j (x)∇V j > 0.
T
(11.4.34)
4 j=1
Combining (11.4.33) and (11.4.34), we can obtain that V̇i (t) is upper bounded by
For each value function Vi (x(t)), it can be shown)that its derivative V̇i (t)
* is negative
+
whenever x(t) lies outside the compact set Ωi x : x ≤ λminξ(Q i
i)
. Denote the
compact set Ωx as
ξ1 ξN
Ωx x : x ≤ max ,..., .
λmin (Q 1 ) λmin (Q N )
Then, all the derivatives V̇i (t) are negative whenever x(t) lies outside the compact
set Ωx ; i.e., x(t) is UUB. If we increase λmin (Q i ), the size of Ωx can be made
smaller. This completes the proof of the theorem.
−0.5
−1
−1.5
−2
0 200 400 600 800 1000
Time (s)
1
x1
−1
0 200 400 600 800 1000
Times (s)
2
x2
−2
0 200 400 600 800 1000
Times (s)
0.8
Weights
0.6
0.4
0.2
−0.2
−0.4
0 200 400 600 800 1000
Time (s)
W2
1.2 2
2
W3
0.8
Weights
0.6
0.4
0.2
−0.2
0 200 400 600 800 1000
Time (s)
0.6
0.5
Weights
0.4
0.3
0.2
0.1
−0.1
0 200 400 600 800 1000
Time (s)
Approximation error of V
1
0.075
0.07
1
*
V −V
0.065
1
0.06
0.055
2
1 2
0 1
0
x −1
2 −1 x
1
−2 −2
Fig. 11.22 3-D plot of the approximation error of the value function for player 1
11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 477
Aprroximation error of u
1
x 10−3
5
0
1
*
u −u
−5
1
−10
−15
2
1 2
0 1
x 0
2 −1
−1 x
1
−2 −2
Fig. 11.23 3-D plot of the approximation error of the control policy for player 1
where
−2x1 + x2
f (x) = ,
− 21 x1 − x2 + x12 x2 + 41 x2 (cos(2x1 ) + 2)2 + 41 x2 (sin(4x12 ) + 2)2
0 0 0
g1 (x) = , g2 (x) = , g3 (x) = .
2x1 cos(2x1 ) + 2 sin(4x12 ) + 2
1 2 1 2 1 1 1
V1∗ (x) = x1 + x2 , V2∗ (x) = x12 + x22 , V3∗ (x) = x12 + x22 .
8 4 2 4 2
The corresponding optimal control policies for three players are
1
u ∗1 (x) = −x1 x2 , u ∗2 (x) = − (cos(2x1 ) + 2)x2 , (11.4.35)
2
and
1
u ∗3 (x) = − (sin(4x12 ) + 2)x2 . (11.4.36)
2
478 11 Learning Algorithms for Differential Games of Continuous-Time Systems
First, we use a model NN to identify the unknown nonlinear system. The model
NN is selected as in (11.4.18) with A = [−10, 0; 0, −10]. The activation functions
are selected as hyperbolic tangent function tanh(·). Select the parameters in Theorem
11.4.2 as Γ f = [1, 0.1; 0.1, 1], Γg j = [1, 0.1; 0.1, 1]. The curves of modeling error
are shown in Fig. 11.17. We can observe that the obtained model NN can reconstruct
the unknown nonlinear system successfully.
The activation functions for the critic NNs are selected as φ1 (x) = φ2 (x) =
φ3 (x) = [x12 , x1 x2 , x22 ]T . The critic NN weight vectors for the three players are deno-
ted as Ŵ 1 = [Ŵ11 , Ŵ21 , Ŵ31 ]T , Ŵ 2 = [Ŵ12 , Ŵ22 , Ŵ32 ]T , and Ŵ 3 = [Ŵ13 , Ŵ23 , Ŵ33 ]T .
The initial weights of the three critic NNs are randomly selected in [0, 1.5], and
the learning rates for the three critic NNs are all 0.1. The initial state is selected
as x0 = [1, −1]T . A small exploratory signal is used to satisfy the PE condi-
tion. After the exploratory signal is turned off at t = 950 sec, the states con-
verge to zero, and Fig. 11.18 presents the evolution of the system states. From
Figs. 11.19, 11.20 and 11.21, we can observe that the weight vector Ŵ 1 converges to
[0.1277, −0.0012, 0.2503]T , the weight vector Ŵ 2 converges to [0.5022, −0.0009,
1.0002]T , and the weight vector Ŵ 3 converges to [0.2516, −0.0007, 0.5001]T at
t = 1000s. For player 1, Fig. 11.22 shows the 3-D plot of the difference between
the approximated value function and the optimal one, and Fig. 11.23 shows the 3-D
plot of the difference between the approximated control policy and the optimal one.
We can find that these errors are close to zero, and other players also have similar
results. Thus, the approximate value functions converge to the optimal ones within
a small bound.
11.5 Conclusions
References
1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Abu-Khalaf M, Lewis FL, Huang J (2006) Policy iterations and the Hamilton-Jacobi-Isaacs
equation for H∞ state feedback control with input saturation. IEEE Trans Autom Control
51(12):1989–1995
References 479
3. Abu-Khalaf M, Lewis FL, Huang J (2008) Neurodynamic programming and zero-sum games
for constrained control systems. IEEE Trans Neural Netw 19(7):1243–1252
4. Al-Tamimi A, Abu-Khalaf M, Lewis FL (2007) Adaptive critic designs for discrete-time zero-
sum games with application to H∞ control. IEEE Trans Syst Man Cybern-Part B: Cybern
37(1):240–247
5. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2007) Model-free Q-learning designs for linear
discrete-time zero-sum games with application to H∞ control. Automatica 43(3):473–481
6. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern-Part B:
Cybern 38(4):943–949
7. Bardi M, Capuzzo-Dolcetta I (1997) Optimal control and viscosity solutions of Hamiilton-
Jacobi-Bellman equations. Birkhäuser, Boston
8. Basar T, Bernhard P (1995) H∞ optimal conrol and related minimax design problems: a
dynamic game approach. Birkhäuser, Boston
9. Basar T, Olsder GJ (1999) Dynamic noncooperative game theory. SIAM, Philadelphia
10. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):82–92
11. Gupta SK (1995) Numerical methods for engineers. Wiley, New York
12. Hecht-Nielsen R (1989) Theory of the backpropagation neural network. In: Proceedings of the
international joint conference on neural networks, pp 593–605
13. Jiang Y, Jiang ZP (2012) Computational adaptive optimal control for continuous-time linear
systems with completely unknown dynamics. Automatica 48(10):2699–2704
14. Lee JY, Park JB, Choi YH (2012) Integral Q-learning and explorized policy iteration for adaptive
optimal control of continuous-time linear systems. Automatica 48(11):2850–2859
15. Lewis FL, Liu D (2012) Reinforcement learning and approximate dynamic programming for
feedback control. Wiley, Hoboken
16. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
17. Li H, Liu D, Wang D (2014) Integral reinforcement learning for linear continuous-time zero-
sum games with completely unknown dynamics. IEEE Trans Autom Sci Eng 11(3):706–714
18. Liu D, Wei Q (2014) Multi-person zero-sum differential games for a class of uncertain nonlinear
systems. Int J Adapt Control Signal Process 28(3–5):205–231
19. Liu D, Li H, Wang D (2013) Neural-network-based zero-sum game for discrete-time nonlinear
systems via iterative adaptive dynamic programming algorithm. Neurocomputing 110:92–100
20. Liu D, Wang D, Yang X (2013) An iterative adaptive dynamic programming algorithm for
optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci
220:331–342
21. Liu D, Li H, Wang D (2014) Data-based online synchronous optimal learning algorithm for
multi-player non-zero-sum games. IEEE Trans Syst Man Cybern: Syst 44(8):1015–1027
22. Marks RJ (1991) Introduction to Shannon sampling and interpolation theory. Springer, New
York
23. Nevisti V, Primbs JA (1996) Constrained nonlinear optimal control: a converse HJB approach.
California Institute of Technology, TR96-021
24. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
25. Vamvoudakis KG, Lewis FL (2011) Online solution of nonlinear two-player zero-sum games
using synchronous policy iteration. Int J Robust Nonlinear Control 22(13):1460–1483
26. Vamvoudakis KG, Lewis FL (2011) Multi-player non-zero-sum games: online adaptive learn-
ing solution of coupled Hamilton-Jacobi equations. Automatica 47(8):1556–1569
27. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
28. Vrabie D, Lewis FL (2010) Integral reinforcement learning for online computation of feedback
Nash strategies of nonzero-sum differential games. In: Proceedings of the IEEE conference on
decision and control, pp 3066–3071
480 11 Learning Algorithms for Differential Games of Continuous-Time Systems
29. Varbie D, Lewis FL (2011) Adaptive dynamic programming for online solution of a zero-sum
differential game. J Control Theory Appl 9(3):353–360
30. Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis FL (2009) Adaptive optimal control for
continuous-time linear systems based on policy iteration. Automatica 45(2):477–484
31. Wang FY, Zhang H, Liu D (2009) Adaptive dynamic programming: an introduction. IEEE
Comput Intell Mag 4(2):39–47
32. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
33. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive
approaches (Chapter 13). Van Nostrand Reinhold, New York
34. Wu H, Luo B (2012) Neural network based online simultaneous policy update algorithm
for solving the HJI equation in nonlinear H∞ control. IEEE Trans Neural Netw Learn Syst
23(12):1884–1895
35. Wu H, Luo B (2013) Simultaneous policy update algorithms for learning the solution of linear
continuous-time H∞ state feedback control. Inf Sci 222(10):472–485
36. Zeidler E (1985) Nonlinear functional analysis. Fixed point theorems, vol 1. Springer, New
York
37. Zhang H, Luo Y, Liu D (2009) Neural-network-based near-optimal control for a class of
discrete-time affine nonlinear systems with control constraints. IEEE Trans Neural Netw
20(9):1490–1503
38. Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming method.
IEEE Trans Neural Netw 22(12):2226–2236
39. Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving
a class of nonlinear zero-sum differential games. Automatica 47(1):207–214
40. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control: algorithms
and stability. Springer, London
Part III
Applications
Chapter 12
Adaptive Dynamic Programming
for Optimal Residential Energy
Management
12.1 Introduction
With the rising cost, environmental concerns, and reliability issues, there is an
increasing need to develop optimal control and management systems for residential
environments. Smart residential energy systems, composed of power grids, battery
systems, and residential loads which are interconnected over a power management
unit, provide end users with the optimal management of energy usage to improve
the operational efficiency of power systems [4, 23, 31]. On the other hand, with
the rapidly evolving technology of electric storage devices, energy storage-based
optimal management has attracted much attention [3, 19, 36]. Along with the devel-
opment of smart grids, increasing intelligence is required in the optimal design of
residential energy systems [10, 34, 39]. Hence, the intelligent optimization of battery
management becomes a key tool for saving the power expense in smart residential
environments.
Different techniques have been used to implement optimal controllers in residen-
tial energy management systems; for example, dynamic programming is used in [32,
37] and genetic algorithm is proposed in [11]. In addition, Liu and Huang [21] pro-
posed an ADP scheme using only critic network and considering only three possible
controls for the battery (charging mode, discharging mode, idle) to choose the best
for every time slot, while in [18], a particle swarm optimization method and a mixed
integer linear programming procedure are chosen in [33].
In this chapter, the optimal management of the total electrical system is viewed as
optimal battery management for each time slot: Step by step, the controller provides
the best decision of energy management whereby charging or discharging the battery
and reducing the total cost according to the external environment. First, an action-
dependent heuristic dynamic programming method is developed to obtain the optimal
residential energy control law [21, 22]. Second, a dual iterative Q-learning algorithm
is developed to solve the optimal battery management and control problem in smart
residential environments where two iterations, internal and external iterations, are
employed [42]. Based on the dual iterative Q-learning algorithm, the convergence
property of iterative Q-learning method for the optimal battery management and
control problem is proven. Finally, a distributed iterative ADP method is developed
to solve the multi-battery optimal coordination control problems for home energy
management systems [43].
comparable to that delivered from the power grids. The battery storage system con-
sists of lead acid batteries, which are the most commonly used rechargeable battery
type. The optimum battery size for a particular residential household can be obtained
by performing various test scenarios, which is beyond the scope of the present book.
Generally speaking, the battery is sized to enable it to supply power to the residential
load for a period of twelve hours.
There are three operational modes for the batteries in residential energy system
under consideration.
(1) Charging mode: When system load is low and the electricity price is inexpensive,
the power grids will supply the residential load directly and, at the same time,
charge the batteries.
(2) Idle mode: the power grids will directly supply the residential load at certain
hours when, from the economical point of view, it is more cost effective to use
the fully charged batteries in the evening peak hours.
(3) Discharging mode: By taking the subsequent load demands and time-varying
electricity rate into account, batteries alone supply the residential load at hours
when the price of electricity from the grid is high.
This system can easily be expanded; i.e., other power sources along with the
power grid and sources such as photovoltaic (PV) panels or wind generators can be
integrated into the system when they are available.
For this section, the optimal scheduling problem is treated as a discrete-time
problem with the time step as one hour, and it is assumed that the residential load
over each hourly time step varies with noise. Thus, the daily load profile is divided
into twenty-four hour period to represent each hour of the day. Each day can be
divided into a greater number of periods to have higher resolution. However, for
simplicity and agreement with existing literature [6, 9, 13, 30], we use a twenty-
four-hour period each day in this work. A typical weekday load profile is shown in
Fig. 12.2. The load factor PL is expressed as PLt during hour t (t = 1, 2, . . . , 24).
For instance, at time t = 19, the load is 7.8 kW which would require 7.8 kWh of
energy. Since the load profile is divided into one-hour steps, the units of the power
of energy sources can be represented equally by kW or kWh. Residential real-time
pricing is one of the load management policies used to shift electricity usage from
peak load hours to light load hours in order to improve the power system efficiency
and allow the new power system construction projects [24]. With real-time pricing,
the electricity rate varies from hour to hour based on the wholesale market prices.
Hourly, market-based electricity prices typically change as the demand for electricity
changes; higher demand usually means higher hourly prices. In general, there tends
to be a small price spike in the morning and another slightly larger spike in the
evening when the corresponding demand is high. Figure 12.3 demonstrates a typical
daily real-time pricing from [12]. The varying electricity rate is expressed as Ct ,
representing the energy cost during the hour t in cents. For the residential customer
with real-time pricing, energy charges are functions of the time of electricity usage.
Therefore, for the situation where batteries are charged during the low rate hours
486 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
5
Load (kW)
0
0 5 10 15 20 25
Time (hours)
10
6
Rate (cents)
0
0 2 4 6 8 10 12 14 16 18 20 22 24
Time (hours)
and discharged during high rate hours, one may expect, from an economical point
of view, the cost savings will be made by storing energy during low rate hours and
releasing it during the high rate hours. In this way, the battery storage system can be
used to reduce the total electricity cost for residential household. The energy stored
in a battery can be expressed as [24, 44]:
t
Ebt = Eb0 − Pbτ ,
τ =0
Pbτ = Vb Ib ατ , τ = 0, 1, . . . , t,
1, τ ≤ τ0 ,
ατ =
K1 (Ib )dVb /dτ, τ > τ0 ,
where Ebt is the battery energy at time t, Eb0 is the peak energy level when the battery
is fully charged (capacity of the battery), Pbτ is the battery power output at time τ ,
Vbτ is the terminal voltage of the battery, Ib is the battery discharge current, ατ is the
current weight factor as a function of discharge time, τ0 is the battery manufacturer
specified length of time for constant power output under constant discharge current
rate, K1 (Ib ) is the weight factor as a function of the magnitude of the current, Vs is
the battery internal voltage, Kc is the polarization coefficient (ohm × cm2 ), Δ is the
available amount of active material (coulombs per cm2 ), Jc is the apparent current
density (amperes per cm2 ), N is the internal resistance per cm2 , and A and B are
constants.
Apart from the battery itself, the loss of other equipments such as inverters, trans-
formers, and transmission lines should also be considered in the battery model. The
efficiency of these devices was derived in [44] as follows:
|Pbt |
η(Pbt ) = 0.898 − 0.173 , Prate > 0, (12.2.1)
Prate
where Prate is the rated power output of the battery and η(Pbt ) is the total efficiency
of all the auxiliary equipments in the battery system.
Assume that all the loss caused by these equipments occur during the charging
period. The battery model used in this work is expressed as follows: When the battery
charges,
Eb(t+1) = Ebt − Pb(t+1) × η(Pb(t+1) ), Pb(t+1) < 0,
In general, to improve the battery efficiency and extend the battery’s lifetime as
far as possible, two constraints need to be considered:
(1) Battery has storage limit. A battery lifetime may be reduced if it operates at lower
amount of charge. In order to avoid damage, the energy stored in the battery must
always meet constraint as follows:
(2) For safety, battery cannot be charged or discharged at rate exceeding the max-
imum and minimum values to prevent damage. This constraint represents the
upper and lower limit for the hourly charging and discharging power. A negative
Pbt means that the battery is being charged, while a positive Pbt means the battery
is discharging,
Pbmin ≤ Pbt ≤ Pbmax . (12.2.3)
At any time, the sum of the power from the power grids and the batteries must be
equal to the demand of residential user
PLt = Pbt + Pgt , (12.2.4)
where Pgt is the power from the power grids, Pbt can be positive (in the case of
batteries discharging) or negative (batteries charging) or zero (idle). It explains the
fact that the power generation (power grids and batteries) must balance the load
demand for each hour in the scheduling period. We assume here that the supply from
power grids is enough for the residential demand. The objective of the optimization
policy is, given the residential load profile and real-time pricing, to find the optimal
battery charge/discharge/idle
schedule at each time step which minimizes the total
cost CT = Tt=1 Ct × Pgt , while satisfying the load balance equation (12.2.4) and
the operational constraints (12.2.1)–(12.2.3). CT represents the operational cost to
the residential customer in a period of T hours. To make the best possible use of bat-
teries for the benefit of residential customers, with time of day pricing signals given
by Ct , it is a complex multistage stochastic optimization problem. Adaptive dynamic
programming (ADP) which provides approximate optimal solutions to dynamic pro-
gramming is applicable to this problem. Using ADP, we will develop a self-learning
optimization strategy for residential energy system control and management. During
real-time operations under uncertain changes in the environment, the performance of
the optimal strategy can be further refined and improved through continuous learning
and adaptation.
where xt ∈ Rn denotes the state vector of the system, ut ∈ Rm represents the control
action, and F is a transition from the current state xt to the next state xt+1 under
given state feedback control action ut = u(xt ) at time t. Suppose that this system is
associated with the performance index
∞
J(xt0 , u) = J u (xt0 ) = γ k−t0 U(xk , uk ), (12.2.6)
k=t0
where U is called the utility function and γ is the discount factor with 0 < γ ≤ 1. It
is important to realize that J depends on the initial time t0 and the initial state xt0 . The
performance index J is also referred to as the cost-to-go of state xt0 . The objective
of dynamic programming problem is to choose a sequence of control actions ut =
u(xt ), t = t0 , t0 + 1, . . . , so that the performance index J in (12.2.6) is minimized.
According to Bellman, the optimal cost from the time t on is equal to
J ∗ (xt ) = min U(xt , ut ) + γ J ∗ (F(xt , ut )) .
ut
The optimal control ut∗ at time t is the ut that achieves this minimum.
Generally speaking, there are three design families of ADP: heuristic dynamic pro-
gramming (HDP), dual heuristic programming (DHP), and globalized dual heuristic
programming (GDHP) as well as their action-dependent versions. A typical ADHDP
is shown in Fig. 12.4 [28]. Both the critic and action networks can be trained using
the strategy in [25] as described in Sect. 1.3.1 of this book.
The learning control architecture for residential energy system control and manage-
ment is based on ADHDP. However, as described below, only a single module (single
490 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
critic approach) will be used instead of two or three modules in the original scheme.
The single critic module technique retains all the powerful features of the original
ADP, while eliminating the action module completely. There is no need for the iter-
ative training loops between the action and the critic networks and, thus, greatly
simplifies the training process.
There exists a class of problems in realistic applications that have a finite-
dimensional control action space. Typical examples include inverted pendulum or
the cart-pole problem, where the control action only takes a few finite values. When
there is only a finite control action space in the application, the decisions that can
be made are constrained to a limited number of choices, e.g., a ternary choice in the
case of residential energy control and management problem. When there is a power
demand from the residential household, the decisions can be made are constrained
to three choices, i.e., to discharge batteries, to charge batteries, or to do nothing to
batteries. Let us denote the three options using ut = 1 for “discharge,” ut = −1 for
“charge,” and ut = 0 for “idle.” In the present case, we note that the control actions
are limited to a ternary choice or to only three possible options. Therefore, we can
further simplify the ADHDP introduced in Fig. 12.4 so that only the critic network
is needed in the design.
Figure 12.5 illustrates our self-learning control scheme for residential energy sys-
tem control and management using ADHDP. The control scheme works as follows:
When there is a power demand from the residential household, we will first ask the
critic network to see which action (discharge, charge, and idle) generates the small-
est output value of the critic network; then, the control action from ut = 1, −1, 0
that generates the smallest critic network output will be chosen. As in the case of
Fig. 12.4, the critic network in our design will also need the system states as input
variables. It is important to realize that Fig. 12.5 is only a diagrammatic layout that
illustrates how the computation takes place while making battery control and man-
agement decisions. In Fig. 12.5, the three blocks for the critic network stand for the
same critic network or computer program. From the block diagram in Fig. 12.5, it is
clear that the critic network will be utilized three times in calculations with different
values of ut to make a decision about whether to discharge or charge batteries or
keep it idle. The previous description is based on the assumption that the critic net-
work has been successfully trained. Once the critic network is learned and obtained
(off-line or online), it will be applied to perform the task of residential energy sys-
tem control and management as in Fig. 12.5. The performance of the overall system
can be further refined and improved through continuous learning as it learns more
experience in real-time operations when needed. In this way, the overall residential
energy system will achieve optimal individual performance now and in the future
environments under uncertain changes.
In stationary environment, where residential energy system configuration remains
unchanged, a set of simple static if-then rules will be able to achieve the optimal
scheduling as described previously. However, system configuration, including user
power demand, capacity of the battery, and power rate, may be significantly different
from time to time. To cope with uncertain changes in environments, static energy
control and management algorithm would not be proper. The present control and
12.2 A Self-learning Scheme for Residential Energy System Control and Management 491
Step 1: Collecting data: During this stage, whenever there is a power demand from
residential household, we can take any of the following actions: discharge
batteries, charge batteries, or keep batteries idle and calculate the utility
function for the system. The utility function is chosen as follows:
492 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
During the data collection step, we simply choose actions 1, −1, 0 randomly
with the same probability of 1/3. Meanwhile, the states corresponding to
each action are collected. The environmental states we collect for each action
are the electricity rate, the residential load, and the energy level of the battery.
Step 2: Training the critic network: We use the data collected to train the critic
network as presented in Chap. 1. The input variables chosen for the critic
network are states including the electricity rate, the residential load, the
energy level of the battery, and the action.
Step 3: Applying the critic network: We apply the trained critic network as illustrated
in Fig. 12.5. Three values of action ut will be provided to the critic network
at each time step. The action with the smallest output of the critic network
is the one the system is going to take.
Step 4: Further updating the critic network: We will update the critic network as
needed while it is applied in the residential energy system to cope with envi-
ronmental changes, for example, user demand changes or new requirements
for the system. We note that the data has to be collected again and the training
of critic network has to be performed as well. In such a case, the previous
three steps will be repeated.
user and, at the same time, charge batteries. It is expected that batteries are charged
during the low rate hours, idle in some mid-rate hours, and discharged during high
rate hours. In this way, both energy and cost savings are achieved.
The critic network in the present application is a multilayer feedforward neural
network with 4–9–1 structure, i.e., four neurons at the input layer, nine neurons
at the hidden layer, and one linear neuron at the output layer. The hidden layer
uses the hyperbolic tangent function as the activation function. The critic network
outputs function Q, which is an approximation to the function J(xt , u) as defined
in (12.2.6). The four inputs to the critic network are as follows: energy level of
batteries, residential power demand, real-time pricing, and the action of operation
(1 for discharging batteries, −1 for charging batteries, 0 for keeping batteries idle).
The local utility function defined in (12.2.6) is
Ct × Pgt
Ut = ,
Umax
where Ct is real-time electricity rate, Pgt is the supply from power grids for residential
power demand, and Umax is the possible maximum cost for all time. The utility
function chosen in this way will lead to a control objective of minimizing the overall
cost for the residential user.
The typical residential load profile in one week is shown in Fig. 12.6 [12] random
noise in the load curve. From the load curve, we can see that, during weekdays, there
are two load peaks occurring in the period of 7:00–8:00 and 18:00–20:00, while
during weekend, the residential demand gradually increases until the peak appears
at 19:00. Thus, the residential demand pattern during weekdays and during weekend
is different. Figure 12.7 shows the change in the electrical energy level in batteries
6
Load (kW)
0
0 20 40 60 80 100 120 140 160
Time (Hours)
494 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
80
70
60
Energy (kWh)
50
40
30
20
10
0
0 20 40 60 80 100 120 140 160
Time (Hours)
week
5
0
Power (kW)
−5
−10
−15
during a typical one-week residential load cycle. From Fig. 12.7, it is shown that
batteries are fully charged during the midnight when the price of electricity is low.
After that, batteries discharge during peak load hours or medium load hours and are
charged again during the midnight light load hours. This cycle repeats, which means
that the scheme is optimized with evenly charging and discharging. Therefore, the
peak of the load curve is shaved by the output of batteries, which results in less
cost of power from the power girds. Figure 12.8 illustrates the optimal scheduling
of home batteries. The bars in Fig. 12.8 represent the power output of batteries,
12.2 A Self-learning Scheme for Residential Energy System Control and Management 495
while the dotted line denotes the electricity rate in real time. From Fig. 12.8, we can
see that batteries are charged during hours from 23:00 to 5:00 next day when the
electricity rate is in the lowest range and discharge when the price of electricity is
expensive. It is observed that batteries discharge from 6:00 to 20:00 during weekdays
and from 7:00 to 19:00 during weekend to supply the residential power demand. The
difference lies in the fact that the power demand during the weekend is generally
bigger than the weekdays’ demand, which demonstrates that the present scheme can
adapt to varying load conditions. From Fig. 12.8, we can also see that there are some
hours that the batteries are idle, such as from 3:00 to 5:00 and from 21:00 to 22:00.
Obviously, the self-learning algorithm believe that considering the subsequent load
demand and electricity rate, keeping batteries idle during these hours will achieve the
most economic return which result in the lowest overall cost to the customer. The cost
of serving this typical residential load in one week is 2866.64 cents. Compared to
the cost using the power grids alone to supply the residential load which is 4124.13
cents, it gives a savings of 1257.49 cents in a week period. This illustrates that a
considerable saving on the electricity cost is achieved. In this case, the self-learning
scheme has the ability to learn the system characteristics and provide the minimum
cost to the residential user.
In order to better evaluate the performance of the self-learning scheme, we con-
duct comparison studies with a fixed daily cycle scheme. The daily cycle scheme
charges batteries during the day time and releases the energy into the residential user
load when required during the expensive peak hours at night. Figure 12.9 shows the
scheduling of batteries by the fixed daily cycle scheme. The overall cost is 3284.37
cents. This demonstrates that the present ADHDP scheme has lower cost. Comparing
Fig. 12.8 with 12.9, we can see the self-learning scheme is able to discharge batter-
ies one hour late from 7:00 to 19:00 during the weekend instead of from 6:00 to
scheme
5
Power (kW)
−5
−10
−15
20:00 during weekdays to achieve optimal performance, while the fixed daily cycle
scheme ignores the differences in the demand between weekdays and weekend due
to the static nature of the algorithm. Therefore, we conclude that the present self-
learning algorithm performs better than the fixed algorithm due to the fact that the
self-learning scheme can adapt to the varying load consideration and environmental
changes.
The smart residential energy system described by (12.2.5) is composed of the power
grid, the residential load, a battery system, which is located at the side of residential
load (including a battery and a sine wave inverter), and a power management unit
(controller). The schematic diagram of the smart residential energy system can be
described in Fig. 12.1. The battery model used in this work is based on [22, 24, 44],
where the battery model is expressed by
Let Pbt > 0 denote battery discharging; let Pbt < 0 denote battery charging; and let
Pbt = 0 denote the battery idle. Let the efficiency of battery charging/discharging be
derived as in (12.2.1).
In this section, the power flow from the battery to the grid is not permitted,
i.e., we define Pgt ≥ 0, to guarantee the power quality of the grid. For convenience
of analysis, we introduce delays in Pbt and PLt , and then, we can define the load
balance as PL(t−1) = Pb(t−1) + Pgt . The total cost function expected to be minimized
is defined as
∞
γ t
m1 (Ct Pgt )2 + m2 (Ebt − Ebo )2 + r(Pbt )2 , (12.3.1)
t=0
1
where 0 < γ < 1 and Ebo = (Ebmin + Ebmax ). The physical meaning of the first term
2
of the cost function is to minimize the total cost from the grid. The second term aims
to guarantee the stored energy of the battery to be close to the middle of storage limit,
which avoids fully charging/discharging of the battery. The third term is to prevent
large charging/discharging power of the battery. Hence, the second and third terms
aim to extend the lifetime of the battery. Let x1t = Pgt and x2t = Ebt − Ebo . Letting
xt = [x1t , x2t ]T and ut = Pbt , the equation of the residential energy system can then
be written as
12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 497
PLt − ut
xt+1 = F(xt , ut , t) = . (12.3.2)
x2t − η(ut )ut
m1 Ct2 0
Mt = .
0 m2
Let x0 be the initial state. Then, the cost function (12.3.1) can be written as
∞
J(x0 , u0 , 0) = γ t U(xt , ut , t),
t=0
Assumption 12.3.1 The residential load PLt and the electricity rate Ct are periodic
functions with the period λ = 24 h.
Define the control sequence set as Ut = ut : ut = (ut , ut+1 , . . .), ut+i ∈ R, i =
0, 1, . . . . Then, the optimal cost function can be defined as follows:
J ∗ (xt , t) = inf J(xt , ut , t) : ut ∈ Ut .
ut
Define the optimal Q-function as Q∗ (xt , ut , t) such that min Q∗ (xt , ut , t) = J ∗ (xt , t).
ut
Hence, the Q-function is also an action-dependent value function. According to [40,
41], the optimal Q-function satisfies the following Bellman equation
In this section, a novel dual iterative Q-learning algorithm is developed to obtain the
optimal control law for residential energy systems [42]. A new convergence analysis
method will also be developed in this section. From (12.3.3), we can see that the
optimal Q-function Q∗ (xt , ut , t) is a nonlinear function which is difficult to obtain.
According to Assumption 12.3.1, there exist ρ = 0, 1, . . . and θ = 0, 1, . . . , 23 such
that t = ρλ + θ , ∀t = 0, 1, . . .. Let k = ρλ. Then, PLt = PL(k+θ) = PLθ and Ct =
Ck+θ = Cθ , respectively. Define Uk as the control sequence in 24 h from k to k +
λ − 1, i.e., Uk = (uk , uk+1 , . . . , uk+λ−1 ). We can define a new utility function as
498 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
λ−1
Π (xk , Uk ) = γ θ U(xk+θ , uk+θ , θ ), ∀k ∈ {0, λ, 2λ, . . .}. (12.3.4)
θ=0
The utility function in (12.3.4) is time-invariant for k = 0, λ, 2λ, . . . , since the matrix
Mt used in the definition of U(xt , ut , t) is periodic with period of λ. Then, (12.3.3)
can be expressed as
Based on the preparations above, a new dual iterative Q-learning algorithm can
be developed. In the present algorithm, two iterations are utilized, which are external
iterations (i-iterations in brief) and internal iterations (j-iterations in brief), respec-
tively. Let i = 0, 1, . . . be the external iteration index.
Let Ψ (xk , uk ) be an arbitrary positive-semidefinite function. Define the initial
Q-function Q0 (xk , Uk ) as
and
where Q0 (·, ·) and A0 (·) are determined by (12.3.5) and (12.3.6). Let j = 0, 1, . . . , 24
be the internal iteration index. For i = 0 and j = 0, let the initial Q-function be
and
j j
u0 (xk ) = arg min Q0 (xk , uk ), (12.3.12)
uk
where
(j) PL(λ−j) − uk
xk+1 = F(xk , uk , λ − j) = (12.3.13)
x2k − η(uk )uk
and
U(xk , uk , λ − j) = xkT Mλ−j xk + ruk2 .
Note that there are 24 such systems in (12.3.13) according to j = 1, 2, . . . , 24, and
they are system (12.3.2) working at different hours according to λ − j.
For i = 1, 2, . . ., let Qi0 (xk , uk ) = Qi−1
24
(xk , uk ). For j = 0, we calculate
and
j j
ui (xk ) = arg min Qi (xk , uk ). (12.3.15)
uk
Ai (xk ) = {ui0 (xk ), ui23 (xk ), ui22 (xk ), . . . , ui1 (xk )}, ∀i = 0, 1, . . . . (12.3.16)
Such an ordering of control laws can be understood with the proof of the next theorem
(cf. (12.3.17) when j = 24).
500 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
In the present section, the convergence property of the dual iterative Q-learning
algorithm will be established. First, we will show that iterative control law sequence
Ai (xk ) obtained by the j-iteration can minimize the total cost in each 24-h period.
Theorem 12.3.1 For i = 0, 1, . . . and j = 0, 1, . . . , 24, let the iterative Q-functions
j
Qi (xk , Uk ) and Qi (xk , uk ) be obtained by (12.3.5)–(12.3.15). Then,
Note that the superscripts used for xk+1 , xk+2 , . . . , can be dropped when j = 24. For
(j−1) (23)
example, when j = 24, the calculation of xk+2 = xk+2 requires system (12.3.13) at
hour = 1 (or equivalently, at k + 1), and the subscript 23 indicates exactly the same.
The conclusion holds for i = 0. Assume that the conclusion holds for i = τ − 1, i.e.,
(24)
min Qτ24 (xk , uk ) = min U(xk , uk , 0) + γ min U(xk+1 , uk+1 , 1)
uk uk uk+1
(23)
+ γ min Qτ22 (xk+2 , uk+2 )
uk+2
(24)
= min U(xk , uk , 0) + γ min U(xk+1 , uk+1 , 1)
uk uk+1
(23) (22)
+ γ min U(xk+2 , uk+2 , 2) + γ min U(xk+3 , uk+3 , 3) + · · ·
uk+2 uk+3
12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 501
(2) (1)
+ γ min U(xk+23 , uk+23 , 23)+γ min Qτ0 (xk+λ , uk+λ )
uk+23 uk+λ
= min Π (xk , Uk ) + γ̃ min Qτ24−1 (xk+λ , uk+λ )
Uk uk+λ
= min Π (xk , Uk ) + γ̃ min Qτ −1 (xk+λ , Uk+λ )
Uk Uk+λ
= min Qτ (xk , Uk ).
Uk
j
and define Qi (xk , uk ) as in (12.3.14). For i = 0, 1, . . ., let Pi0 (xk , uk ) = Qi0 (xk , uk ).
Then, for j = 0, 1, . . . , 24, we have
j j
Qi (xk , uk ) ≤ Pi (xk , uk ).
From Theorem 12.3.1 and Corollary 12.3.1, for i = 0, 1, . . ., the total cost in each
period can be minimized by the iterative control law sequence Ai (xk ) according to
j-iteration (12.3.9)–(12.3.16). Next, the convergence property of i-iteration will be
developed.
Proof For functions Q ∗ (xk , Uk ), Π (xk , Uk ), and Q0 (xk , Uk ), inspired by [26], let
ς, ς , δ, and δ be constants such that
and
that for the given constants ς , ς, δ, and δ, the iterative Q-function Qi (xk , Uk ) will
converge to the optimum and the estimations of these constants can be omitted. The
proof proceeds in four steps. First, we show that if 0 ≤ δ ≤ δ < 1, and then for
i = 0, 1, . . ., the iterative value function Qi (xk , Uk ) satisfies
δ−1 δ−1
1+ Q ∗ (xk , Uk ) ≤ Qi (xk , Uk ) ≤ 1 + Q ∗ (xk , Uk ).
−1 i (1 + ς −1 )i
(1 + ς )
(12.3.19)
δ−1 δ−1
≥ 1+ς Π (xk , Uk ) + γ̃ δ − min Q ∗ (xk+λ , Uk+λ )
1+ς 1 + ς Uk+λ
ς(δ − 1) ∗
= 1+ Π (xk , Uk )+ γ̃ min Q (xk+λ , Uk+λ )
(1 + ς ) Uk+λ
δ−1
= 1+ Q ∗ (xk , Uk ). (12.3.20)
(1 + ς −1 )
δ−1
Q1 (xk , Uk ) ≤ 1 + Q ∗ (xk , Uk ).
(1 + ς −1 )
ς l (δ − 1) ∗
≥ 1+ Π (xk , U k )+ γ̃ min Q (xk+λ , U k+λ )
(1+ς )l Uk+λ
δ−1
= 1+ Q ∗ (xk , Uk ). (12.3.21)
(1 + ς −1 )l
δ−1 δ−1
1+ Q ∗ (xk , Uk ) ≤ Qi (xk , Uk ) ≤ 1 + Q ∗ (xk , Uk ).
−1 i i
(1 + ς ) (1 + ς −1 )
(12.3.22)
The lower bound of (12.3.22) can be proven according to the steps similar to (12.3.20)
and (12.3.21). For the upper bound of (12.3.22), letting i = 0, we have
Q1 (xk , Uk ) = Π (xk , Uk ) + γ̃ min Q0 (xk+λ , Uk+λ )
Uk+λ
≤ Π (xk , Uk ) + δ γ̃ min Q ∗ (xk+λ , Uk+λ )
Uk+λ
δ−1
+ ς Π (xk , Uk ) − γ̃ min Q ∗ (xk+λ , Uk+λ )
(1 + ς ) Uk+λ
δ−1
≤ 1+ Q ∗ (xk , Uk ).
(1 + ς −1 )
In this section, neural networks are introduced to implement the dual iterative Q-
learning algorithm. There are two neural networks, which are critic and action net-
works, respectively, in the dual iterative Q-learning algorithm. Both neural networks
are chosen as three-layer backpropagation (BP) networks. The whole structural dia-
gram is shown in Fig. 12.10. The role of the action network is to approximate the
504 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
xk
Critic Qi j ( xk , uk )
uk
Network
xk
xk 1
Residential x Qi j 1 ( xk 1 , uk 1 )
uk k 1 Action uk Critic
Energy 1
PL ( Network Network
j) System
U ( xk , uk , j)
Fig. 12.10 The structural diagram of the dual iterative Q-learning algorithm (the discount factor
γ is not shown in the diagram)
where Za (xk ) = YaT xk and σ (·) is a sigmoid function [38]. To enhance the training
ij
speed, only the hidden–output weight Wa (l) is updated during the neural network
training, while the input-hidden weight is fixed [15]. According to [38], the action
network’s weight update is expressed as follows:
j
∂E (l)
Waij (l + 1) = Waij (l) − βa ai
ij
,
∂ Wa (l)
where
j 1 j
Eai (l) = (e (l))2 ,
2 ai
j j,l j
eai (l) = ûi (xk ) − ui (xk ),
and βa > 0 is the learning rate of the action network. The goal of the critic network
j
is to obtain Qi (xk , Uk ) by updating Qi (xk , uk ) in (12.3.14) for i = 0, 1, . . . and j =
0, 1, . . . , 24, iteratively. The critic network can be constructed by 3 input neurons,
15 sigmoidal hidden neurons, and 1 linear output neuron. Let Zck = [xkT , uk ]T be
the input vector of the critic network. Then, the output of the critic network can be
expressed as
12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 505
j,l
Q̂i (xk , uk ) = WcijT(l)σ (Zck ),
where Zck = YcTZck and σ (·) is a sigmoid function [38]. During the neural network
j
training, the hidden–output weight Wci (l) is updated, while the input-hidden weight
Yc is fixed. According to [38], the critic network weight update is expressed as
follows:
j
∂Eci (l)
Wc (l + 1) = Wc (l) − αc
ij ij
ij
,
∂ Wc (l)
where
j 1 j 2
Eci (l) = e (l) ,
2 ci
j j,l j
eci (l) = Q̂i (xk , uk ) − Qi (xk , uk ),
and αc > 0 is the learning rate of the critic network. The dual iterative Q-learning
algorithm implemented by action and critic networks is explained step by step and
shown in Algorithm 12.3.1.
In this section, the performance of the dual iterative Q-learning algorithm will be
examined by numerical experiments. Comparisons will also be given to show the
superiority of the present algorithm. The profiles of the residential load demand
and the real-time electricity rate are taken from [8, 17, 22], where the real-time
electricity rate and the residential load demand for one week (168 h) are shown in
Fig. 12.11a and c, respectively. We can see that the real-time electricity rate and the
residential load demand are both periodic-like functions with the period λ = 24. The
average trajectories of the electricity rate and the residential load demand are shown
in Fig. 12.11b, d. In this section, we use average residential load demand and average
electricity rate as the periodic residential load demand and electricity rate.
8 7
Average rate (cents/kWh)
7 6
Rate (cents/kWh)
6
5
5
4
4
3
3
2 2
0 50 100 150 0 5 10 15 20 25
Time (Hours) Time (Hours)
(a) (b)
10 8
7
Average load (kW)
8
6
Load (kW)
6 5
4
4
3
2 2
0 50 100 150 0 5 10 15 20 25
Time (Hours) Time (Hours)
(c) (d)
Fig. 12.11 Residential electricity rate and load demand. a Real-time electricity rate for 168 h. b
Average electricity rate. c Residential load demand for 168 h. d Average residential load demand
12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 507
We assume that the supply from the power grid guarantees the residential load
demand at any time. Define the capacity of the battery as 100 kWh. Let the upper and
lower storage limits of the battery be Ebmin = 20 kWh and Ebmax = 80 kWh, respec-
tively. The rated power output of the battery and the maximum charging/discharging
rate is 16 kW. The initial level of the battery is 60 kWh. Let the cost function be
expressed as in (12.3.1), where we set m1 = 1, m2 = 0.2, r = 0.1, and γ = 0.995.
Let the initial function Ψ (xk , uk ) = [xkT , ukT ]P[xkT , ukT ]T , where P = I is the identity
matrix with a suitable dimension. Let the initial state be x0 = [8, 60]T . After nor-
malizing the data of the residential load demand and the electricity rate [1, 38],
we implement the present dual iterative Q-learning algorithm by neural networks for
i = 20 iterations to guarantee the computation precision ε = 10−4 . The learning rates
of the action and critic networks are 0.01, and the training precisions of the neural net-
works are 10−6 . Let Qi (x0 , ū) = min Qi (x0 , u). The trajectory of Qi (x0 , ū) is shown in
j j j
u
j j
Fig. 12.12. After i = 20 iterations, we get Qi (x0 , ū) = Qi−1 (x0 , ū), j = 0, 1, . . . , 24,
which means the iterative Q-function is convergent to the optimum. According to
the one week’s residential load demand and electricity rate, the optimal control of
the battery is shown in Fig. 12.13.
In the present study, time-based Q-learning (TBQL) algorithm [22] and particle
swarm optimization (PSO) algorithm [17] will be compared to illustrate the superi-
ority of the present dual iterative Q-learning algorithm. For t = 0, 1, . . ., the goal of
TBQL algorithm [22, 38] is to design an iterative control that satisfies the following
optimality equation
4
Iterative Q function
0 Optimal Q function
0
5 25
20
10 15
i 15 10
5 j
20 0
10
0
Power (kW)
−5
−10
−15
Let the initial function and the structures of the action and critic networks which
implement the TBQL algorithm be the same as those in our example. For PSO
algorithm [17], let G = 30 be the swarm size. The position of each particle at time
t is represented by xt , = 1, 2, . . . , G and its movement by the velocity vector vt .
Then, the update rule of PSO can be expressed as
Let the inertia factor be ω = 0.7. Let the correction factors ρ1 = ρ2 = [1, 1]T . Let ϕ1
and ϕ2 be random numbers in [0, 1]. Let p be the best position of particles, and let pg
be the global best position. Implement the TBQL for 100 time steps, and implement
PSO algorithm for 100 iterations. Let the real-time cost function be Rct = Ct Pgt ,
and the corresponding real-time cost functions are shown in Fig. 12.14a, where the
term “original” denotes “no battery system.” The comparison of the total cost for
168 h is displayed in Table 12.1. From Table 12.1, the superiority of our dual iterative
Q-learning algorithm can be verified. The trajectories of the battery energy by dual
12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 509
Real−time cost(cents/100)
1 Original Dual iterative Q−learning TBQL PSO
0.8
0.6
0.4
0.2
0
0 20 40 60 80 100 120 140 160
Time (Hours)
(a)
100
Battery energy (kWh)
80
60
40
20
Dual iterative Q−learning TBQL
0
0 20 40 60 80 100 120 140 160
Time (Hours)
(b)
Fig. 12.14 Numerical comparisons. a Real-time cost comparison among dual iterative Q-learning,
TBQL, and PSO algorithms. b Battery energy comparison between dual iterative Q-learning and
TBQL algorithms
iterative Q-learning and TBQL algorithms are shown in Fig. 12.14b. We can see that
using the TBQL algorithm, the battery is fully charged each day, while the battery
level is more reasonable by the dual iterative Q-learning algorithm.
In the above optimizations, we give more importance to the electricity rate than
the cost of the battery system, i.e., m1 in the cost function is large. On the other hand,
the discharging rate and depth are also important for the battery system to be kept
“alive” for as long as possible. Hence, we enlarge the parameters m2 and r in the
cost function. Let m2 = 1, r = 1, and let m1 be unchanged. The iterative Q-function
is shown in Fig. 12.15. The optimal battery control is shown in Fig. 12.16, and the
battery energy under the new cost function is shown in Fig. 12.17a. Enlarging m2
510 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
4
Q function
0
0
Optimal Q function
5 25
20
10 15
i 15 10
5 j
20 0
10
0
Power (kW)
−5
−10
−15
Rate Load Optimal Battery managment
60
40
20
20 40 60 80 100 120 140 160
Time (Hours)
(a)
Battery energy (kWh)
80
60
40
20
0 20 40 60 80 100 120 140 160
Time (Hours)
(b)
Battery energy (kWh)
60
40
20
0
0 20 40 60 80 100 120 140 160
Time (Hours)
(c)
Fig. 12.17 Batteries’ energy. a New cost function with m1 = 1, m2 = 1 and r = 1. b Battery I. c
Battery II
and r, we can see that the value of the iterative Q-function is enhanced. The battery
output power is reduced, and the battery energy is closer to E o , which extends the
lifetime of the battery. However, the total cost of one week is 2955.35 cents, which
means the cost saving is reduced.
On the other hand, the battery model is important to the optimal control law of the
battery. To illustrate the effectiveness of the present algorithm, different elements of
the battery will be considered.
For convenience of analysis, we let m1 = 1, m2 = 0.2, r = 0.1. First, let the effi-
ciency of battery charging/discharging be η(Pbt ) = 0.698 − 0.173|Pbt |/Prate and let
the capacity of the battery be 80 kWh. Define the battery as Battery I. Implementing
j
the dual iterative Q-learning algorithm with Battery I, the trajectory of Qi (x0 , ū) is
shown in Fig. 12.18. We can see that the iterative Q-function is also convergent to
the optimum after i = 20 iterations and the values of the Q-functions are larger than
the ones in Fig. 12.12, which indicates that the optimization ability decreases. The
512 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
5
Q function
0
0
Optimal Q function
5 25
20
10 15
i 15 10
5 j
20 0
10
0
Power (kW)
−5
−10
−15
optimal control trajectory for Battery I is shown in Fig. 12.19. The battery energy of
Battery I is shown in Fig. 12.17b, and the total cost in one week is 2914.70 cents.
Next, we keep on reducing the performance of the battery. Let the capacity of
the battery decrease to 60 kWh. Let the rated power output of the battery and the
maximum charging/discharging rate be 12 kW. Define the battery as Battery II. The
optimal control trajectory for Battery II is shown in Fig. 12.19. The battery energy
of Battery II is shown in Fig. 12.17c, and the total cost in one week is 3027.17 cents.
From the numerical results, we can see that for different battery models, the
present dual iterative Q-learning algorithm will guarantee the iterative value function
to converge to the optimum and obtain the optimal battery control law. We can also
see that as the performance of the battery decreases, the optimization ability of the
battery also decreases.
In this section, the multi-battery home energy management system will be described
and the optimization objective of the multi-battery coordination control will be intro-
duced [43].
The optimal multi-battery control problem is treated as a discrete-time problem
with the time step of 1 h, and it is assumed that the load demand varies hourly. The
schematic diagram of the smart home energy system is described in Fig. 12.20, which
is composed of the power grid, the load demand, the multi-battery system (including
N batteries, N ∈ Z+ , and sine wave inverters), and the power management unit
(controller). Given the real-time home load and electricity rate, our goal is to find
the optimal coordination control laws for the N batteries which minimize the total
expense of the power from the grid.
The models of batteries are taken from [22, 24, 44]. The battery characteris-
tics, battery hardware, battery software, and inverter/rectifier model of the battery
were presented in [44], and the derivation of battery model was presented in [24].
Define Nς as Nς = {ς : ς ∈ Z+ , ς ≤ N}. Let Ebς,t be the energy of battery ς at
time t, and let ης (·) be the charging/discharging efficiency of battery ς , ∀ς ∈ Nς .
Then, the model of battery ς can be expressed as Ebς,t+1 = Ebς,t − Pbς,t × η(Pbς,t ),
where Pbς,t is the power output of battery ς at time t. Let Pbς,t > 0 denote battery
ς discharging, ∀ς ∈ Nς . Let Pbς,t < 0 denote battery ς charging, and let Pbς,t = 0
denote battery ς idle. Let the efficiency of battery charging/discharging be derived
as η(Pbς,t ) = η̄ς − 0.173|Pbς,t |/Pςrate , where Pςrate > 0 is the rated power output of
battery ς and η̄ς is the maximum efficiency constant such that η̄ς ≤ 0.898. The stor-
min
age limit is defined as Ebς ≤ Ebς,t ≤ Ebς max
, where Ebς
min max
and Ebς are the minimum
and maximum storage energies of battery ς , respectively. The corresponding charg-
ing and discharging power limits are defined as Pbς min
≤ Pbς,t ≤ Pbςmax
, where Pbς
min
max
and Pbς are the minimum and maximum charging/discharging powers of battery
ς , respectively.
Based on the models of batteries, the optimization objectives can be formulated.
Let Ct be the electricity rate at time t. Let PL,t be the power of the home load at time
t, and let Pg,t be the power from the power grid. For the convenience of analysis, we
assume that the home load PL,t and the electricity rate Ct are periodic functions with
the period λ = 24 h. For ς ∈ Nς , with delays in PL,t and Pbς,t for battery ς , the load
balance can be expressed as
N
PL,t−1 = Pbς,t−1 + Pg,t .
ς=1
In this section, power flow from the batteries to the grid is not permitted, i.e., we
define Pg,t ≥ 0, to guarantee the power quality of the grid. To extend the lifetime
of the batteries, for ς ∈ Nς , we desire that the stored energy of the battery ς is
near the middle of storage limit Ebς o
= 21 (Ebς
min
+ Ebς
max
). Define ΔEbς,t as ΔEbς,t =
Ebς,t − Ebς . Let α, βς , rς , ς ∈ Nς be given positive constants, and let 0 < γ < 1 be
o
the discount factor. Then, the total cost function to be minimized can be defined as
∞
N
γ t α(Ct Pg,t )2 + βς ΔEbς,t
2
+ rς Pbς,t
2
, (12.4.1)
t=0 ς=1
The physical meaning of the first term of the cost function is to minimize the total
cost from the grid. The second term aims to guarantee the stored energy of batteries
to be close to the middle of storage limit, which avoids fully charging/discharging
of the batteries. The third term is to prevent large charging/discharging power of
12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 515
the batteries. Hence, the second and third terms aim to extend the lifetime of the
batteries. Let x1,t = Pg,t and x2ς,t = ΔEbς,t , ς ∈ Nς , be the system states. Let uς,t =
Pbς,t be the control input. Let M2 = diag{βς } and Mt = diag{αCt2 , M2 }. Let xt =
[x1,t , x21,t , . . . , x2N,t ]T be the state vector and ut = [u1,t , . . . , uN,t ]T be the control
vector. The home energy management system is defined as
⎡ ⎤
N
⎢ PL,t − ς=1 uς,t ⎥
⎢ ⎥
⎢ ⎥
xt+1 = F(xt , ut , t) = ⎢ x21,t − η(u1,t )u1,t ⎥. (12.4.2)
⎢ .. ⎥
⎣ . ⎦
x2N,t − η(uN,t )uN,t
Let x0 be the initial state. Then, the cost function (12.4.1) can be written as
∞
J(x0 , u0 , 0) = γ t U(xt , ut , t), where U(xt , ut , t) = xtTMt xt + utTRut , R = diag{rς }
t=0
and ut = (ut , ut+1 , . . .) denotes the control sequence
from t to ∞. The optimal cost
function can be defined as J ∗ (xt , t) = inf ut J(xt , ut , t) . According to Bellman’s
principle of optimality [7], we can obtain the following Bellman equation
J ∗ (xt , t) = min U(xt , ut , t) + γ J ∗ (xt+1 , t + 1) . (12.4.3)
ut
In this section, a novel distributed iterative ADP algorithm is developed to solve the
optimal multi-battery coordination control problem for home energy management
systems. Convergence analysis results will be developed. From (12.4.2), the system
state xt is an (N + 1)-dimensional vector and the control ut is an N-dimensional
vector. Generally speaking, J ∗ (xt , t) in (12.4.3) is a highly nonlinear and nonanalytic
function. For the multi-battery home energy management system (12.4.2), J ∗ (xt , t) is
highly complex which is computationally untenable to obtain by directly solving the
Bellman equation (12.4.3). To overcome these difficulties, system transformations
are necessary to implement. First, for t = 0, 1, . . ., there exist ρ = 0, 1, . . . and θ =
0, 1, . . . , 23 such that t = ρλ + θ . Let k = ρλ. Then, PL,t = PL,k+θ = PL,θ and Ct =
Ck+θ = Cθ , respectively. Let Uk denote the control sequence for N batteries in 24 h,
i.e., Uk = [uk , uk+1 , . . . , uk+λ−1 ]T ∈ Rλ×N . Similar to (12.3.4), define a new utility
function as
λ−1
Υ (xk , Uk ) = γ θ U(xk+θ , uk+θ , θ ).
θ=0
516 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
Then, for all k ∈ {0, λ, 2λ, . . .}, the Bellman equation (12.4.3) can be expressed as
|Pb,t |
η(Pb,t ) = ης − 0.173 ,
P̄ rate
and Pb,t be the charging/discharging power such that P̄bmin ≤ Pb,t ≤ Pbmax . Here, we
assume that P̄bmin < Pbmax and η(Pb,t ) > 0 to facilitate our analysis. According to the
above notations and assumptions, we can construct a new battery, called Battery ℵ,
which is expressed as
1
where ΔEb,t = Eb,t − Ebo , Ebo = (Ēb − E b ), and Eb,t is the energy of the new battery
2
at time t. From (12.4.5), we can see that Battery ℵ has the worst performance of all the
N batteries (e.g., it is the “smallest” battery). If we replace all batteries ς by Battery
ℵ, i.e., x2ς,t = ΔEb,t and uς,t = Pb,t , ∀ς ∈ Nς , then the home energy management
system (12.4.2) can be simplified as
PL,t − Nvt
zt+1 = F (zt , vt , t) = , (12.4.6)
x2,t − η(vt )vt
where zt = [x1,t , x2,t ]T , x2,t = ΔEb,t , and vt = Pb,t . The cost function can be rewritten
as
∞
N
γ t ztT M̄t zt + vt2 rς , (12.4.7)
t=0 ς=1
12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 517
N
where M̄t = diag αCt ,
2
βς . Next, we aim to design an optimal control law for
ς=1
Battery ℵ.
Define Uk = [vk , vk+1 , . . . , vk+λ−1 ]T , which is the control sequence for the Battery
ℵ in 24 h. According to (12.4.6), there exists a function F̃ such that zk+λ = F̃(zk , Uk ).
N
Let Ū(zt , vt , t) = ztT M̄t zt + r̄vt2 , where r̄ = rς . We can define another new util-
ς=1
λ−1
ity function as Γ (zk , Uk ) = γ θ Ū(zk+θ , vk+θ , θ ), ∀k ∈ {0, λ, 2λ, . . .}. Then, the
θ=0
Bellman equation (12.4.4) can be expressed as
In this section, an iterative ADP algorithm will be developed to obtain the opti-
mal control laws for Battery ℵ. Let i = 0, 1, . . . be the iteration index. Let Ψ (zk )
be an arbitrary positive-semidefinite function and choose the initial value function
V 0 (zk ) = Ψ (zk ). We obtain
A0 (zk ) = arg min Γ (zk , Uk ) + γ˜ V 0 (zk+λ ) . (12.4.9)
Uk
For i = 1, 2, . . ., the iterative value function and the iterative control law sequence
will be obtained by
V i (zk ) = min Γ (zk , Uk ) + γ˜ V i−1 (zk+λ )
Uk
and
Ai (zk ) = arg min Γ (zk , Uk ) + γ˜ V i (zk+λ ) . (12.4.11)
Uk
Proof For functions J o (zk ), Γ (zk , Uk ), and V 0 (zk ), inspired by [26], there exist
constants ζ , δ, and δ such that γ̃ J o (zk+λ ) ≤ ζ Γ (zk , Uk ), and δJ o (zk ) ≤ V 0 (zk ) ≤
δJ o (zk ), respectively, where 0 < ζ < ∞ and 0 ≤ δ ≤ 1 ≤ δ < ∞. We will show
that for i = 0, 1, . . ., the iterative value function V i (zk ) satisfies
δ−1 δ−1
1+ J o (zk ) ≤ V i (zk ) ≤ 1 + J o (zk ). (12.4.12)
(1 + ζ −1 )i (1 + ζ −1 )i
δ−1
= 1+ J o (zk ). (12.4.14)
(1 + ζ −1 )l
Based on the mathematical induction (12.4.13) and (12.4.14), we can obtain the
upper bound of (12.4.12). On the other hand, if δ ≥ 1 and δ ≤ 1, we can let δ = 1
and δ = 1, where we can see that (12.4.12) can also be verified. Hence, according to
(12.4.12), we can obtain
lim V i (zk ) = J o (zk ).
i→∞
We calculate
μ0ϑ0 (xk ) = arg min Υ xk , [μϑ0 ,k , Ūϑ0 (xk )] + γ˜ Vϑ00 F̄(xk , [μϑ0 ,k , Ūϑ0 (xk )]) .
μϑ0 ,k
(12.4.16)
Note that Ūϑ0 (xk ) is Uϑ0 (xk ) by removing its ϑ0 th column. For = 0, ϑ0 ∈ Nς and
τ = 1, 2, . . ., the distributed iterative ADP algorithm will proceed between
Vϑτ0 (xk ) = Υ xk , [μτϑ−1
0
(xk ), Ūϑ0 (xk )] + γ˜ Vϑτ0−1 F̄(xk , [μϑτ −1
0
(xk ), Ūϑ0 (xk )]) ,
(12.4.17)
and
μτϑ0 (xk ) = arg min Υ xk , [μϑ0 ,k , Ūϑ0 (xk )] + γ˜ Vϑτ0 F̄(xk , [μϑ0 ,k , Ūϑ0 (xk )]) .
μϑ0 ,k
(12.4.18)
and
μτϑ (xk ) = arg min Υ xk , [μϑ ,k , Ūϑ (xk )] + γ˜ Vϑτ F̄(xk , [μϑ ,k , Ūϑ (xk )]) .
μϑ ,k
(12.4.21)
Vθ∞
(xk ) = lim Vθτ (xk ),
τ →∞
where Vθ∞
(xk ) satisfies the following Bellman equation
Vϑ∞
(xk ) = min Υ xk , [μϑ ,k , Ūϑ (xk )] + γ̃ Vϑ∞
F̄(xk , [μϑ ,k , Ūϑ (xk )])
μϑ ,k
= Υ xk , [μ∞ ∞ ∞
ϑ (xk ), Ūϑ (xk )] + γ̃ Vϑ F̄(xk , [μϑ (xk ), Ūϑ (xk )]) .
(12.4.22)
Assume that (12.4.23) holds for τ = τ̄ − 1, τ̄ = 1, 2, . . ., i.e., Vϑτ̄0 (xk ) ≤ Vϑτ̄0−1 (xk ).
Then, for τ = τ̄ , we can obtain
Vϑτ̄0+1 (xk ) = min Υ xk , [μϑ0 ,k , Ūϑ0 (xk )] + γ̃ Vϑτ̄0 F̄(xk , [μϑ0 ,k , Ūϑ0 (xk )])
μϑ0 ,k
≤ min Υ xk , [μϑ0 ,k , Ūϑ0 (xk )] + γ̃ Vϑτ̄0−1 F̄(xk , [μϑ0 ,k , Ūϑ0 (xk )])
μϑ0 ,k
¯ we have
Then, for = ,
where Vϑ∞
¯
−1
(xk ) satisfies (12.4.26). According to (12.4.24) and (12.4.25), we can
obtain
can be defined and Vϑτ (xk ) converges to the solution of the Bellman equation (12.4.22)
as τ → ∞. However, the optimization by a single battery cannot guarantee the
iterative value function to converge to the solution of the Bellman equation (12.4.4).
Hence, a global convergence analysis will be needed.
Before the next theorem, we define some notations. Let Tς = { : ϑ = ς } and
let πς be the number of elements in Tς .
Theorem 12.4.3 (Global convergence property) For = 0, 1, . . . and τ = 0, 1, . . .,
let Vϑτ (xk ) and μτϑ (xk ) be obtained by (12.4.15)–(12.4.21). If the sequence {ϑ }
satisfies
(1) for all = 0, 1, . . ., ϑ ∈ Nς ,
(2) for all ς ∈ Nς , πς → ∞,
then for all τ = 0, 1, . . ., the iterative value function Vϑτ (xk ) converges to the opti-
mum, as → ∞, i.e.,
lim Vθτ (xk ) = J ∗ (xk ).
→∞
Let σ1 , σ2 , . . . , σπς be the elements in Tς such that σ1 < σ2 < · · · < σπς , i.e.,
Tς = {σ : = 1, 2, . . . , πς }, where πς → ∞. For τ̃ = 0, 1, . . ., we define
Vστ̃ (xk ) Vστ̃1 (xk ), Vστ̃2 (xk ), . . . , Vστ̃ (xk ), . . . , (12.4.27)
then Vστ̃ (xk ) is a subsequence of Vθτ (xk ) . According to (12.4.22) and (12.4.23),
(2) Show that the limit of the iterative value function Vϑτ (xk ) satisfies the Bellman
equation, as → ∞.
12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 523
= Vσ∞
−1
(xk ) ≥ Vστ+1 (xk )
= Υ (xk , [μτσ (xk ), Ūσ (xk )]) + γ̃ Vστ F̄(xk , [μτσ (xk ), Ūσ (xk )])
≥ Vσ∞
+1
(xk )
= min Υ (xk , [μσ+1 ,k , Ūσ+1 (xk )])
μσ+1 ,k
+ γ̃ Vσ∞
+1
F̄(xk , [μσ+1 ,k , Ūσ+1 (xk )]) .
which means
V∞ (xk ) = min Υ (xk , Uk ) + γ̃ V∞ F̄(xk , Uk ) .
Uk
Next, let Ũ(xk ) be an arbitrary admissible control law sequence [2, 27] for the N
batteries. Define a new value function P(xk ), which satisfies
524 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
holds.
(4) Show that the value function V∞ (xk ) equals the optimal cost function J ∗ (xk ).
According to the definition of J ∗ (xk ) in (12.4.4), for = 0, 1, . . ., we have
4.5
4
Rate (cents/kWh)
3.5
2.5
1.5
0 5 10 15 20 25
Time (Hours)
(a)
2
1.5
Load (kW)
0.5
0 5 10 15 20 25
Time (Hours)
(b)
Fig. 12.21 Electricity rate and home load demand. a Typical electricity rate for nonsummer seasons.
b Typical home load demand
12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 525
3.5
Iterative value function
2.5
1.5
0.5
aleph
I
II
III
IV
I
15
Ba II
III
tte IV 10
ry I
se II
qu III eps
en IV
I 5
tion st
ce II
III
IV 0 Itera
1.5
0.5
0
Power (kW)
−0.5
−1
−1.5
−2
−2.5
−3
Rate/3 Load usum for Battery ℵ Optimal usum
−3.5
0 20 40 60 80 100 120 140 160
Time (Hours)
∗
Fig. 12.23 Optimal usum of the batteries in one week
526 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
0.5
0.5
0
0
−0.5 −0.5
−1 −1
0 50 100 150 0 50 100 150
Time (Hours) Time (Hours)
(a) (b)
0.6 0.4
0.4
0.2
0.2
0
0
−0.2 −0.2
−0.4
−0.4
−0.6
−0.6
0 50 100 150 0 50 100 150
Time (Hours) Time (Hours)
(c) (d)
Fig. 12.24 Optimal control of the batteries in one week. a Battery I. b Battery II. c Battery III. d
Battery IV
Let → ∞. We can obtain V∞ (xk ) ≥ J ∗ (xk ). On the other hand, for an arbitrary
admissible control law Ũ(xk ), (12.4.31) holds. Let Ũ(xk ) = U∗ (xk ), where U∗ (xk ) is
an optimal control law. Then, we can get V∞ (xk ) ≤ J ∗ (xk ). Hence, we can obtain
lim Vθτ (xk ) = J ∗ (xk ). The proof is complete.
→∞
8 6
6
4
4 3
2
2
1
0 0
0 50 100 150 0 50 100 150
Time (Hours) Time (Hours)
(a) (b)
2.5
4
2
3
1.5
2
1
1
0.5
0 0
0 50 100 150 0 50 100 150
Time (Hours) Time (Hours)
(c) (d)
Fig. 12.25 The energy of batteries in one week. a Battery I. b Battery II. c Battery III. d Battery
IV
In this section, the performance of the present distributed iterative ADP algorithm will
be examined by numerical experiments. Choose the profiles of real-time electricity
rate in nonsummer seasons from ComEd Company in [14], and choose the home load
demand from NAHB Research Report in [35], where trajectories of the electricity
rate and the home load demand are shown in Fig. 12.21a and b, respectively.
528 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
16
Original TBQL DIADP Battery ℵ PSO
14
12
Real−time cost (cents)
10
0
0 20 40 60 80 100 120 140 160
Time (Hours)
5.5
4.5
Rate (cents/kWh)
3.5
2.5
1.5
0 5 10 15 20 25
Time (Hours)
5.5
Iterative value function
4.5
3.5
2.5
aleph
I
II
III
IV
I
15
II
Ba III
tter IV
I 10
ys II
equ III
te ps
ns
IV 5
enc I o
II ati
e III
IV 0 Iter
1.5
0.5
0
Power (kW)
−0.5
−1
−1.5
−2
−2.5
−3
Rate/3 Load usum for Battery ℵ Optimal usum
−3.5
0 20 40 60 80 100 120 140 160
Time (Hours)
∗
Fig. 12.29 Optimal usum of the batteries for one week in summer
Based on the optimal control law of Battery ℵ, the distributed iterative ADP
algorithm is implemented. Let the optimization sequence of the batteries be chosen
as {I, II, . . . , IV, I, II, . . . , IV, . . .}. The trajectory of the iterative value function is
shown in Fig. 12.22, where we can see that the iterative value function is monoton-
ically nonincreasing and converges to the optimum. The optimal sum control usum
for the four batteries is shown in Fig. 12.23, where the charging/discharging power
is obviously larger than the one of Battery ℵ. The optimal control laws for Batteries
I, . . ., IV are shown in Fig. 12.24, and the energies of the batteries are displayed in
Fig. 12.25.
Next, time-based Q-learning (TBQL) algorithm [22] and particle swarm opti-
mization (PSO) algorithm [17] will be compared to illustrate the superiority of our
ADP algorithm. For t = 0, 1, . . ., the goal of TBQL algorithm [22] is to design
an iterative control that satisfies the following optimality equation Q(xt−1 , ut−1 ) =
U(xt , ut ) + γ Q(xt , ut ).
Three-layer backpropagation (BP) neural networks are implemented to approxi-
mate the Q-function. The detailed neural network implementation of TBQL can be
seen in [8, 22, 38], which is omitted here. For PSO algorithm [17], let G = 100
be the swarm size. The position of each particle at time t is represented by xt ,
= 1, 2, . . . , G , and its movement by the velocity vector vt . Then, the updating
rule of PSO can be expressed as
12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 531
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 50 100 150 0 50 100 150
Time (Hours) Time (Hours)
(a) (b)
0.4
Optimal control of Battery IV (kW)
Optimal control of Battery III (kW)
0.6
0.4 0.2
0.2
0
0
−0.2
−0.2
−0.4 −0.4
−0.6
−0.6
0 50 100 150 0 50 100 150
Time (Hours) Time (Hours)
(c) (d)
Fig. 12.30 Optimal control of the batteries for one week in summer. a Battery I. b Battery II. c
Battery III. d Battery IV
Let the inertia factor be ω = 0.7. Let the correction factors ρ1 = ρ2 = 1. Let ϕ1 and
ϕ2 be random numbers in [0, 1]. Let p be the best position of particles, and let pg
be the global best position. Implement the TBQL for 500 time steps, and implement
the PSO algorithm for 500 iterations. The comparison of the total cost for 168 h
is displayed in Table 12.2, where we can see that the minimum cost is obtained by
our distributed iterative ADP algorithm. Let the real-time cost function be Rct =
Ct Pgt , and the corresponding real-time cost functions are shown in Fig. 12.26, where
the term “original” denotes “no battery system” and “DIADP” denotes “distributed
iterative ADP algorithm.” According to the numerical comparisons, the superiority
of our distributed iterative ADP algorithm can be verified.
532 12 Adaptive Dynamic Programming for Optimal Residential Energy Management
It should be pointed out that if the residential environment, such as the electricity
rate, is changed, the optimal multi-battery control will change correspondingly. It is
pointed out by ComEd Company [14] that in summer, the electricity rate is different
from other seasons, which is shown in Fig. 12.27. Let all the other parameters remain
unchanged, and implement the iterative ADP algorithm. The iterative value function
is shown in Fig. 12.28, where the nonincreasing monotonicity and optimality are
verified. The optimal control of usum for 168 h in summer is shown in Fig. 12.29. The
corresponding optimal multi-battery coordination controls are shown in Fig. 12.30a–
d, respectively. Next, implementing TBQL and PSO algorithms for the summer data,
the total costs by TBQL and PSO, and the present iterative ADP algorithms for one
week are shown in Table 12.3, where the superiority of our distributed iterative ADP
algorithm can be verified.
12.5 Conclusions 533
12.5 Conclusions
Given the residential load and the real-time electricity rate, the objective of the opti-
mal control in this chapter is to find the optimal battery charging/discharging/idle
control law at each time step which minimizes the total expense of the power from
the grid while considering the battery limitations. First, the ADP scheme based on
ADHDP that is suitable for applications to residential energy system control and
management problem is introduced. Then, a new iterative ADP algorithm, called
dual iterative Q-learning algorithm, is developed to solve the optimal battery man-
agement and control problem in smart residential environments. The main idea of
the present dual iterative Q-learning algorithm is to update the iterative value func-
tion and iterative control laws by ADP technique according to the i-iteration and the
j-iteration, respectively. The convergence and optimality of the present algorithm
are established. Next, optimal multi-battery management problems in smart home
energy management systems are solved by iterative ADP algorithms. To obtain the
optimal coordination control law for multi-batteries, a new distributed iterative ADP
algorithm is developed, which avoids the increasing dimension of controls. Conver-
gence properties are developed to guarantee the optimality of the algorithm. Neural
networks are introduced to implement these ADP algorithms. Effectiveness of the
algorithms is verified by numerical results.
References
8. Boaro M, Fuselli D, Angelis FD, Liu D, Wei Q, Piazza F (2013) Adaptive dynamic programming
algorithm for renewable energy scheduling and battery management. Cogn Comput 5(2):264–
277
9. Chacra FA, Bastard P, Fleury G, Clavreul R (2005) Impact of energy storage costs on economical
performance in a distribution substation. IEEE Trans Power Syst 20:684–691
10. Chaouachi A, Kamel RM, Andoulsi R, Nagasaka K (2013) Multiobjective intelligent energy
management for a microgrid. IEEE Trans Ind Electron 60(4):1688–1699
11. Chen C, Duan S, Cai T, Liu B, Yin J (2009) Energy trading model for optimal microgrid schedul-
ing based on genetic algorithm. In: Proceedings of the IEEE International Power Electronics
and Motion Control Conference. pp 2136–2139
12. ComEd, Real time pricing in USA. [Online Available] http://www.thewattspot.com
13. Corrigan PM, Heydt GT (2007) Optimized dispatch of a residential solar energy system. In:
Proceedings of the North American power symposium. pp 4183–4188
14. Data of electricity rate from ComEd Company, USA. [Online Available] https://rrtp.comed.
com/live-prices
15. Dierks T, Jagannathan S (2012) Online optimal control of affine nonlinear discrete-time systems
with unknown internal dynamics by using time-based policy update. IEEE Trans Neural Netw
Learn Syst 23(7):1118–1129
16. Fung CC, Ho SCY, Nayar CV (1993) Optimisation of a hybrid energy system using simulated
annealing technique. In: Proceedings of the ieee region 10 conference on computer, commu-
nication, control and power engineering, pp 235–238
17. Fuselli D, Angelis FD, Boaro M, Liu D, Wei Q, Squartini S, Piazza F (2013) Action dependent
heuristic dynamic programming for home energy resource scheduling. Int J Electr Power
Energy Syst 48:148–160
18. Gudi N, Wang L, Devabhaktuni V, Depuru SSSR (2011) A demand-side management sim-
ulation platform incorporating optimal management of distributed renewable resources. In:
Proceedings of the ieee/pes power systems conference and exposition, pp 1–7
19. Guerrero JM, Loh PC, Lee TL, Chandorkar M (2013) Advanced control architectures for
intelligent microgrids-part II: power quality, energy storage, and AC/DC microgrids. IEEE
Trans Ind Electron 60(4):1263–1270
20. Hagan MT, Menhaj MB (1994) Training feedforward networks with the Marquardt algorithm.
IEEE Trans Neural Netw 5(6):989–993
21. Huang T, Liu D (2011) Residential energy system control and management using adaptive
dynamic programming. In: Proceedings of the ieee international joint conference on neural
networks, pp 119–124
22. Huang T, Liu D (2013) A self-learning scheme for residential energy system control and
management. Neural Comput Appl 22(2):259–269
23. Jian L, Xue H, Xu G, Zhu X, Zhao D, Shao ZY (2013) Regulated charging of plug-in hybrid
electric vehicles for minimizing load variance in household smart microgrid. IEEE Trans Ind
Electron 60(8):3218–3226
24. Lee TY (2007) Operating schedule of battery energy storage system in a time-of-use rate
industrial user with wind turbine generators: A multipass iteration particle swarm optimization
approach. IEEE Trans Energy Convers 22(3):774–782
25. Lendaris GG, Paintz C (1997) Training strategies for critic and action neural networks in dual
heuristic programming method. In: Proceedings of the ieee international conference neural
networks, pp 712–717
26. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
27. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
28. Liu D, Xiong X, Zhang Y (2001) Action-dependent adaptive critic designs. In: Proceedings of
the international joint conference on neural networks, pp 990–995
29. Liu D, Zhang Y, Zhang H (2005) A self-learning call admission control scheme for CDMA
cellular networks. IEEE Trans Neural Netw 16(5):1219–1228
References 535
13.1 Introduction
Coal is the world’s most abundant energy resource and the cheapest fossil fuel. The
development of coal gasification technologies, which is a primary component of the
carbon-based process industries, is of primary importance to deal with the limited
petroleum reserves [12]. Hence, optimal control for the coal gasification process is a
key problem of carbon-based process industries. To describe the process of coal gasi-
fication, many discussions focus on coal gasification modeling approaches [1, 13,
14, 19]. The established models are usually very complex with high nonlinearities.
To simplify the controller design, the traditional control method for the coal gasifica-
tion process adopts feedback linearization control method [6, 10, 18]. However, the
controller designed by feedback linearization technique is only effective in the neigh-
borhood of the equilibrium point. When the required operating range is large, the
nonlinearities in the system cannot be properly compensated by using a linear model.
Therefore, it is necessary to study an optimal control approach for the original nonlin-
ear system [4, 7–9, 17]. But to the best of our knowledge, there are no discussions on
the optimal controller design for the nonlinear coal gasification processes. One of the
difficulties is the complexity of the coal gasification processes, which leads to very
complex expressions for the optimal control law. Generally speaking, the optimal
control law cannot be expressed analytically. Another difficulty to obtain the optimal
control law lies in solving the time-varying Bellman equation which is usually too
difficult to solve analytically. On the other hand, in the real-world control systems
of coal gasification processes, the coal quality is also unknown for control systems.
This makes it more difficult to obtain the optimal control law of the coal gasification
systems. To overcome these difficulties, iterative ADP algorithm will be employed.
For the coal gasification process, the accurate system model is complex and can-
not be obtained in general. In each iteration of the iterative ADP algorithms, the
accurate iterative control laws and the cost function cannot be accurately obtained
either. In this situation, approximation structures, such as neural networks, can be
used to approximate the system model, the iterative control law, and the iterative
value function, respectively. So, there must exist approximation errors between the
approximated functions and the expected ones, no matter what the approximation
precisions are obtained. When the accurate system model, iterative control laws, and
the iterative value function cannot be obtained, the convergence properties of the
accurate iterative ADP algorithms may be invalid. Till now, only in [11], approxi-
mation errors for the iterative control law and iterative value function in the iterative
ADP algorithm were considered, but the accurate system model is required. To the
best of our knowledge, there are no discussions on the optimal control scheme of the
iterative ADP algorithms, where modeling errors of unknown systems and iteration
errors are both considered. In this chapter, an integrated self-learning optimal con-
trol method of the coal gasification process using iterative ADP is developed, where
modeling errors and iteration errors are considered [16].
The coal gasification inputs the coal water slurry (including coal and water) and
combines with oxygen into the gasifier. The coal gasification process in the gasifier
operates at a high temperature and the output of coal gasification process include
synthesis gas and char. The diagram of coal gasification process is given in Fig. 13.1.
The composition of coal contains carbon (C), hydrogen (H), oxygen (O), and char
(Char), which is expressed by
4
where θki = 1 and k = 0, 1, . . . , is the discrete-time. Let Θk = [θk1 , θk2 , θk3 , θk4 ]T
i=1
denote the coal quality function. The coal gasification reaction can be classified
into two phases [13]. The first phase is coal combustion reaction, and the chemical
equations are expressed by
1 1
C + O2 = CO − 123.1 kJ/mol, CO + O2 = CO2 − 282.9 kJ/mol, (13.2.1)
2 2
where CO is carbon monoxide and CO2 is carbon dioxide. The other phase is water
gas shift reaction which is reversible and mildly exothermic
where H2 O is water.
13.2 Data-Based Modeling and Properties 539
The coal combustion reaction is instantaneous and nonreversible. The water gas
shift reaction is reversible and the reaction is strongly dependent on the reaction tem-
perature. Let xk be the reaction temperature and let Tk denote the reaction equilibrium
coefficient. Then, we have the following empirical formula [13]
0.921914
nCO2 · nH2 202.362
Tk = = , (13.2.3)
nCO · nH2 O xk − 635.52
u k = [u 1k , u 2k , u 3k ]T = [Pcoal
k
, PHk2 O , POk2 ]T . (13.2.4)
xk+1 = F(xk , u k , Θk )
yk = G(xk , u k , Θk ), (13.2.6)
exp(ζi ) − exp(−ζi )
where σ̄(Y f X ) ∈ R L , [σ̄(ζ)]i = , i = 1, . . . L , are the activation
exp(ζi ) + exp(−ζi )
function.
The NN estimation error can be expressed by
where Y ∗f and W ∗f are the ideal weight parameters, ε(X ) is the reconstruction error.
For the convenience of analysis, only the output weights W f are updated during
the training, while the hidden weights Y f are kept fixed [3, 20]. Hence, the NN
function (13.2.7) will be simplified by the expression F̂N (X, W f ) = W f σ f (X ),
where σ f (X ) = σ̄(Y f X ).
Next, using input-state-output data, two BP NNs are used to reconstruct the system
(13.2.6). Let the number of hidden layer neurons be denoted by L m1 and L m2 . Let the
∗ ∗
ideal weights be denoted by Wm1 and Wm2 , respectively. According to the universal
13.2 Data-Based Modeling and Properties 541
x̂k+1 = Ŵm1k
T
σ1 (z k ),
ŷk = Ŵm2k
T
σ2 (z k ), (13.2.8)
where x̂k is the estimated system state vector and ŷk is the estimated system output
∗
vector. Let Ŵm1k be the estimation of the ideal weight matrix Wm1 and let Ŵm2k be the
∗
estimation of the ideal weight matrix Wm2 . Then, we define the system identification
errors as
∗ ∗
where W̃m1k = Ŵm1k − Wm1 and W̃m2k = Ŵm2k − Wm2 .
Let φm1k = W̃m1k σ1 (z k ) and φm2k = W̃m2k σ2 (z k ). Then, we can get
T T
1 2 1
E mk = x̃ + ỹ T ỹk .
2 k+1 2 k
By a gradient-based adaptation rule, the weights are updated as
Assumption 13.2.1 The NN approximation errors εm1k and εm2k are assumed to be
upper bounded by a function of estimation error such that
ε2m1k ≤ λm1 x̃ 2 ,
φTm2k εm2k ≤ λm2 φTm2k φm2k ,
where 0 < λm1 < 1 and 0 < λm2 < 1 are bounded constant values.
Then, we have the following theorem.
Theorem 13.2.1 Let the identification scheme (13.2.8) be used to identify the non-
linear system (13.2.6), and let the NN weights be updated by (13.2.10). If Assump-
tion 13.2.1 holds, then the system identification error x̃k approaches zero asymptoti-
cally and the error matrices W̃m1k and W̃m2k both converge to zero, as k → ∞.
1 T 1 T
L(x̃k , W̃m1k , W̃m2k ) = x̃k2 + tr W̃m1k W̃m1k + tr W̃m2k W̃m2k .
lm1 lm2
With the identification error dynamics (13.2.9) and the weight tuning rules of Ŵm1,k+1
and Ŵm2,k+1 in (13.2.10), we can obtain
ΔL(x̃k , W̃m1k , W̃m2k ) = φ2m1k −2φm1k εm1k + ε2m1k + lm1 σ1T(z k )σ1 (z k )x̃k+1
2
− x̃k2 +lm2 σ2T(z k )σ2 (z k ) ỹkT ỹk −2 W̃m2k σ2T(z k ) ỹk −2φm1k x̃k+1 .
1 1 − λm1
lm1 < min ,
2σ 2M 2λm1 σ 2M
1 − λm2
lm2 < .
σ 2M (1
+ χm2 )
Next, NN will be used to identify the coal quality function Θk and solve the
reference control law u f k using the system data. Different from the system modeling,
the coal quality data cannot generally be detected and identified in real-time coal
gasification process. This means that the coal quality data can only be obtained
offline. Noticing this feature, an iterative training method of the neural networks can
be adopted.
According to (13.2.6), we can solve Θk , which is expressed as
Usually, FΘ (·) is a highly nonlinear function and the analytical expression of FΘ (·)
is nearly impossible to obtain. Thus, a BP NN (Θ network for brief) is established
to identify the coal quality function Θk .
Let the number of hidden layer neurons be denoted by L Θ . Let the ideal weights
be denoted by WΘ∗ . The NN representation of (13.2.12) can be written as
where z Θk = [xk , xk+1 , ykT , u Tk ]T and εΘk is the reconstruction error. Let σΘ (z Θk ) =
σ̄(YΘ z Θk ) where YΘ is an arbitrary matrix. The NN coal quality function is
constructed as
Θ̂k = ŴΘk
T
σΘ (z Θk ), (13.2.14)
where Θ̂k is the estimated coal quality function, and ŴΘk is the estimated weight
matrix. According to (13.2.12), we notice that solving Θk needs the data xk+1 . As
we adopt offline data to train the NN, the corresponding data can be obtained. Define
the identification error as
j j j
Θ̃k = Θk − Θ̂k = φΘk − εΘk ,
Let the number of hidden layer neurons be denoted by L u . Let the ideal weights
be denoted by Wu∗ . The NN representation of (13.2.16) can be written as
where z uk = [xk , xk+1 , ΘkT ]T and εuk is the reconstruction error. Let σu (z k ) =
σ̄(Yu z k ) where Yu is an arbitrary matrix. The NN reference control is constructed as
T
where û f k is the estimated reference control, and Ŵuk is the estimated weight matrix.
Define the identification error as
j j j
ũ f k = u f k − û f k = φuk − εuk , (13.2.19)
jT jT j
φΘk εΘk ≤ λΘ φΘk φΘk ,
jT jT j
φuk εuk ≤ λu φukφuk (13.2.21)
hold, where 0 < λΘ < 1 and 0 < λu < 1, then the error matrices W̃Θk and W̃uk
both converge to zero, as j → ∞.
j j 1 jT j 1 jT j
L W̃Θk , W̃uk = tr W̃Θk W̃Θk + tr W̃uk W̃uk .
lΘ lu
j j 1 ( j+1)T j+1 jT j
ΔL(W̃Θk , W̃uk ) = tr W̃Θk W̃Θk − W̃Θk W̃Θk
lΘ
1 ( j+1)T j+1 jT j
+ tr W̃uk W̃uk − W̃uk W̃uk
lu
jT j jT j
= − 2WΘk σΘk Θ̃k + lΘ Θ̃k σΘk
T
σΘk Θ̃k
jT j jT j
− 2Wuk σuk ũ k + lΘ ũ k σuk
T
σuk ũ k
jT j jT j
≤ − 2 φΘk φΘu − φΘk εΘk + lΘ σΘ̄ 2
φΘk − εΘk 2
jT j jT j
− 2 φukφuk − φukεu k + lu σū2 φuk − εuk 2 , (13.2.22)
where σΘk = σΘ (z Θk ) and σuk = σu (z uk ). As εΘk and εuk are bounded, there exist
χΘ > 0 and χu > 0 such that
jT j
εTΘk εΘk ≤ χΘ φΘk φΘk ,
jT j
εTuk εuk ≤ χu φukφuk .
546 13 Adaptive Dynamic Programming for Optimal Control …
1 − λΘ
lΘ < ,
σΘ̄
2
(1+ χΘ )
1 − λu
lu < .
σū2 (1 + χu )
j j
Hence, we can obtain ΔL W̃Θk , W̃uk ≤ 0. The proof is complete.
Remark 13.2.1 From Theorem 13.2.2, the coal quality function Θk and the reference
control law u f k can be approximated by neural networks. It should be pointed out
that, in real-world applications, the coal quality is generally a slow time-varying
function. It implies that when the coal quality function Θk is identified, it can be
considered as a constant vector. Hence, from the current coal gasification system,
we can obtain the current state, input and output data. Then, we can first use neural
network to identify Θk according to (13.2.12)–(13.2.15). Then, taking the estimated
Θ̂k and the state xk into the u f network, we can obtain the reference control law û f k
immediately, according to (13.2.16)–(13.2.20).
In the previous section, we have shown how to use the system data to approximate
the dynamics of system (13.2.6). NNs are also adopted to solve the reference control
and obtain the coal quality function, respectively. In this section, we will present the
iterative ADP algorithm to obtain the optimal tracking control law under system and
iteration errors.
Although the control system, the reference control, and the coal quality function are
approximated by NNs, the system errors are still unknown. It is difficult to design
the optimal tracking control system with unknown system errors. Thus, an effective
system transformation is performed in this section.
13.3 Design and Implementation of Optimal Tracking Control 547
In order to transform the system, for the desired system state η, a desired reference
control (desired control for brief) can be obtained. Substituting the desired state
trajectory η into (13.2.16), we can obtain the reference control trajectory
u dk = Fu (η, η, Θk ),
where u dk is defined as the desired control or reference control. Let F̂u (η, η, Θk ) =
T
Ŵuk σu (η, η, Θk ) be the neural network function which approximates the reference
control u dk . If the weights Ŵuk converge to Wu∗ sufficiently, then
Let z duk = [η, η, ΘkT ]T and ẑ duk = [η, η, Θ̂kT ]T . According to the mean value
theorem, we have
F̂u (ẑ duk ) = F̂u (z duk ) − ∇(ξΘ )εΘk ,
where
∂ F̂u (η, η, ξΘ )
∇(ξΘ ) = ,
∂ξΘ
where ε̂uk = ∇(ξΘ )εΘk + εuk . Let u δk be the error between the control u k and the
reference control u dk , then we can obtain
Remark 13.3.1 From (13.3.3), we can see that if we have obtained the estimated
control uˆk and the control error ε̂u k , then the control input u k can be determined.
548 13 Adaptive Dynamic Programming for Optimal Control …
where ε̄uk = Δεuk − ε̂uk . Next, the disturbance of the control is considered.
∗ ∗
According to (13.2.8), let the NN weights be convergent to Wm1 and Wm2
sufficiently. If we let F̂(z k ) = Ŵm1k
T
σ1 (z k ), then the system state equation can be
written as
where
∂ F̂(xk , ξuk , Θ̂k )
∇(ξu ) = ,
∂ξuk
ek = xk − η, (13.3.7)
u ek = u k − u dk , (13.3.8)
ū k = u ek + û dk + ε̃uk , (13.3.9)
13.3 Design and Implementation of Optimal Tracking Control 549
where
∇(ξ˜u ) = ∂ F̂ (ek + η), ξ˜u , Θ̂k /∂ ξ˜u ,
where wk = ∇(ξu )ε̄uk + ∇(ξΘ )εΘk + εm1k + ∇(ξ˜u )ε̃uk . As ∇(ξu )ε̄uk , ∇(ξΘ )εΘk ,
∇(ξ˜u )ε̃uk , and εm1k are all bounded, the system disturbance will also be bounded.
Let |εm1k | ≤ |ε̄m1 | and ∇(ξ˜u )ε̃uk ≤ ε̃u , then we can get
On the other hand, as mentioned in Remark 13.2.1, Θ̂k is a constant vector after it
is identified. Hence, according to (13.3.1), û dk can also be seen as a constant vector.
Then, system (13.3.10) can be transformed into the following regulation system
where
F̄(ek , u ek , Θ̂k ) = F̂ (ek + η), (u ek + û dk ), Θ̂k − η.
From (13.3.11), we can see that the nonlinear tracking control system (13.2.6) is
transformed into a regulation system, where the system errors and the control fluc-
tuation are transformed into an unknown-bounded system disturbance.
Our goal is to obtain an optimal control such that the tracking error ek converges
to zero under the system disturbance wk . As the system disturbance wk is unknown,
the design of the optimal controller becomes very difficult. In [2], the optimal control
problem for system (13.3.11) was transformed into a two-person zero-sum optimal
control problem, where the system disturbance wk was defined as a control variable.
The optimal control law is obtained under the worst case of the disturbance (the
disturbance control maximizes the cost function). Inspired by [2], we define wk as a
disturbance control of the system and the two controls u ek and wk of system (13.3.11)
are designed to optimize the following quadratic cost function
∞
J (e0 , u e 0 , w0 ) = Aek2 + u Tek Bu ek − Cwk2 , (13.3.12)
k=0
550 13 Adaptive Dynamic Programming for Optimal Control …
where we let u ek = (u ek , u e,k+1 , . . .) and wk = (wk , wk+1 , . . .). Let A > 0 and
C > 0 be constants, and let B be a positive definite matrix. Then, the optimal cost
function can be defined as
J ∗ (ek ) = inf sup J (ek , u ek , wk ) .
u ek wk
Let U (ek , u ek , wk ) = Aek2 + u Tek Bu ek − Cwk2 be the utility function. In this chapter,
we assume that the utility function U (ek , u ek , wk ) > 0, ∀ek , u ek , wk = 0. Generally
speaking, the system errors are small. This requires the system disturbance wk to
be small and the utility function to be larger than zero. If wk are large, we can
reduce the value of C or enlarge value of A and the matrix B. Hence, the assumption
U (ek , u ek , wk ) > 0 can be guaranteed.
According to Bellman’s principle of optimality, J ∗ (ek ) satisfies the discrete-time
Hamilton–Jacobi–Isaacs (HJI) equation
J ∗ (ek ) = min max U (ek , u ek , wk )+ J ∗ (ek+1 ) . (13.3.13)
u ek wk
We can see that if we want to obtain the optimal control laws u ∗e (ek ) and w∗ (ek ), we
must obtain the optimal cost function J ∗ (ek ). Generally speaking, J ∗ (ek ) is unknown
before all the controls u ek and wk are considered, which means that the HJI equation
is generally unsolvable. In this chapter, an iterative ADP algorithm with system
and approximation errors is employed to overcome these difficulties. In the present
iterative ADP algorithm, the cost function and control law are updated by iterations,
with the iteration index i increasing from 0 to infinity. Let the initial value function
V̂0 (ek ) ≡ 0, ∀ek .
From V̂0 (·) = 0, we calculate
v̂0 (ek ) = arg min U (ek , u ek , ω 0 (ek )) + V̂0 (ek+1 ) + ρ0 (ek ), (13.3.15)
u ek
13.3 Design and Implementation of Optimal Tracking Control 551
where V̂0 (ek+1 ) = 0 and ρ0 (ek ) is the finite approximation error function of the
iterative control v̂0 (ek ). For i = 1, 2, . . ., the iterative algorithm calculates the itera-
tive value function V̂i (ek ),
where ek+1 is expressed as in (13.3.11) and πi (ek ) is the finite approximation error
function of the iterative value function and the control laws ωi (ek ) and v̂i (ek ),
v̂i (ek ) = arg min {U (ek , u ek , ωi (ek )) +V̂i (ek+1 ) + ρi (ek ), (13.3.18)
u ek
where ρi (ek ) is the finite approximation error function of the iterative control.
Remark 13.3.2 From (13.3.11), the system is affine for the disturbance control wk
(it is actually linear in this case). According to (13.3.14) and (13.3.17), using the
necessary condition of optimality, for i = 0, 1, 2, . . ., ωi (ek ) can be obtained as
1 d V̂i (ek+1 )
ωi (ek ) = C −1 .
2 dek+1
Next, we consider the properties of the iterative ADP algorithm with system errors,
iteration errors, and control disturbance. For the two-person zero-sum iterative ADP
algorithm described in (13.3.14)–(13.3.18), as the iteration errors are unknown, the
properties of the iterative value functions V̂i (ek ) and the iterative control laws ωi (ek )
and v̂i (ek ) are very difficult to analyze, for i = 0, 1, . . .. On the other hand, in [11],
for nonlinear systems with a single controller, an “error bound” analysis method is
proposed to prove the convergence of the iterative value function. In this chapter,
we will establish similar “error bound” convergence analysis results for the iterative
value functions for nonlinear two-person zero-sum optimal control problems.
For i = 1, 2, . . ., define an iterative value function as
where
V0 (ek+1 ) = V̂0 (ek+1 ) = 0
and
552 13 Adaptive Dynamic Programming for Optimal Control …
is the accurate iterative control law. According to (13.3.18), there exists a finite
constant τ ≥ 1 such that
J ∗ (ek+1 ) ≤ γ U (ek , u ek , wk )
holds uniformly. If there exists 1 ≤ τ < ∞ such that (13.3.21) holds uniformly, then
i
γ j τ j−1 (τ − 1) ∗
V̂i (ek ) ≤ τ 1 + J (ek ), (13.3.22)
j=1
(γ + 1) j
i
where we define j (·) = 0, ∀ j > i and i, j = 0, 1, . . ..
which shows that (13.3.22) holds for i = 1. Assume that (13.3.22) holds for i = l −1,
where l = 1, 2, . . .. Then, for i = l, we have
13.3 Design and Implementation of Optimal Tracking Control 553
According to (13.3.21), we can obtain (13.3.22) which proves the conclusion for
i = 0, 1, . . .. This completes the proof of the theorem.
Theorem 13.3.2 Suppose Theorem 13.3.1 holds. If for 0 < γ < ∞, the inequality
γ+1
1≤τ < (13.3.23)
γ
holds, then as i → ∞, the iterative value function V̂i (ek ) in the iterative ADP
algorithm (13.3.14)–(13.3.18) is uniformly convergent to a bounded neighborhood
of the optimal cost function J ∗ (ek ), i.e.,
γ(τ − 1)
lim V̂i (ek ) = V̂∞ (ek ) ≤ τ 1 + J ∗ (ek ). (13.3.24)
i→∞ 1 − γ(τ − 1)
According to (13.3.25) and (13.3.26), we can obtain (13.3.24). This completes the
proof of the theorem.
Corollary 13.3.1 Suppose Theorem 13.3.1 holds. If for 0 < γ < ∞ and 1 ≤ τ <
∞, the inequality (13.3.23) holds, then the iterative control laws ωi (ek ) and v̂i (ek )
of the iterative ADP algorithm (13.3.14)–(13.3.18) are convergent, i.e.,
⎧
⎨ ω∞ (ek ) = i→∞
lim ωi (ek ),
In this section, neural networks, including action network and critic network, are
used to implement the present iterative ADP algorithm. Both the neural networks
are chosen as three-layer BP networks. The whole structural diagram is shown in
Fig. 13.2.
For all i = 1, 2, . . ., the critic network is used to approximate the value function
Vi (ek ) in (13.3.19). The output of the critic network is denoted by
j jT
V̂i (ek ) = Wci σc (ek )
for j = 0, 1, . . .. Let Wci0 be random weight matrices. Let σc (ek ) = σ̄(Yc ek ) where
Yc is an arbitrary matrix. Then, σc (ek ) is upper bounded, i.e., σc (ek ) ≤ σ̄c for a
positive constant σ̄c > 0. The target function can be written as
Critic J (ek )
-
Network
ˆ
k
uˆdk uk
u uek uˆk uk
ek + + Model xk 1 ek 1 Critic J (ek 1 )
Action + - +
xk Network Network
Network +
U (ek , uek , wk )
w
Control
Module wk
j 1 j 2
E cik = e .
2 cik
The gradient-based weight update rule [15] can be applied here to train the critic
network
i( j+1) ij ij
Wck = Wck + ΔWck ,
j j
ij ∂ E cik ∂ V̂i (ek )
= Wck − lc j ij
∂ V̂i (ek ) ∂Wck
ij j
= Wck − lc ecik σc (ek ), (13.3.27)
where lc > 0 is the learning rate of critic network. If the training precision is achieved,
then Vi (ek ) can be approximated by the critic network.
The action network is used to approximate the iterative control law vi (ek ), where
vi (ek ) is defined by (13.3.20). The output can be formulated as
j
v̂i (ek ) = Wai jTσa (ek ).
Let σa (ek ) = σ̄(Ya ek ) where Ya is an arbitrary matrix. Then, σa (ek ) is upper bounded,
i.e., σa (ek ) ≤ σ̄a for a positive constant σ̄a > 0. So, we can define the output error
of the action network as
j j
eaik = vi (ek ) − v̂i (ek ).
The weights in the action network are updated to minimize the following performance
error measure:
j 1 jT j
E aik = eaik eaik .
2
The weight updating algorithm is similar to the one for the critic network. By the
gradient descent rule, we can obtain
i( j+1) ij ij
Wak = Wak + ΔWak ,
j j j
ij ∂ E aik ∂eaik ∂ v̂ik
= Wak − la j j ij
∂eaik ∂ v̂ik ∂Wak
ij j
= Wak − la eaik σa (ek ), (13.3.28)
respectively, where εcik and εaik are reconstruction errors. Let the critic and action
networks be trained by (13.3.27) and (13.3.28), respectively. Let W̃c = Wc − Wc∗i
ij ij
ij ij ∗i
and W̃a = Wa − Wa . If for j = 1, 2, . . ., there exist 0 < λc < 1 and 0 < λa < 1
such that
j j 2 jT jT j
φcik εcik ≤ λc φcik , φaik εaik ≤ λa φaik φaik , (13.3.29)
j i jT j i jT
where φcik = W̃ck σc (ek ) and φaik = W̃ak σa (ek ), then the error matrices W̃ck
i
and
W̃ak both converge to zero, as j → ∞.
i
j j j j
Let Ṽi (ek ) = Vi (ek ) − V̂i (ek ) and ṽi (ek ) = vi (ek ) − v̂i (ek ). Then, the difference
of the Lyapunov function candidate (13.3.30) is given by
1 1
ΔL W̃ci j , W̃ai j = tr W̃ci jT W̃ci j + W̃ai jT W̃ai j − tr W̃ci jT W̃ci j + W̃ai jT W̃ai j
lc la
i jT j i jT j j
= − 2Wck σc (ek )Ṽi+1 (ek ) − 2Wak σa (ek )ṽi (ek ) + lc Ṽi+1 (ek )σcT(ek )
j jT j
× σc (ek )Ṽi+1 (ek ) + lΘ ṽi (ek )σaT(ek )σa (ek )ṽi (ek )
j j jT j jT
≤ − 2 (φcik )2 − φcik εcik − 2 φaik φaik − φaik εaik
j 2 j
+ lc σc 2 φcik − εcik + la σa 2 φaik − εaik 2 . (13.3.31)
13.3 Design and Implementation of Optimal Tracking Control 557
As εcik and εaik are both finite, there exist χc > 0 and χa > 0 such that
j 2 jT j
ε2cik ≤ χc φcik , εaik
T
εaik ≤ χa φaik φaik . (13.3.32)
1 − λc
lc < ,
σc2 (1 + χc )
1 − λa
lu < 2 ,
σa (1 + χa )
ij ij
we can obtain ΔL W̃c , W̃a < 0. The proof is complete.
In this section, numerical experiments are studied to show the effectiveness of our
iterative ADP algorithm. Let the coal gasification control system be expressed as
in (13.2.6). We let the initial reaction temperature in the gasifier be x0 = 1000 ◦ C.
Observe the corresponding system input and output data (kg/h) which are
Let the desired reaction temperature η = 1320 ◦ C. To model the coal gasification
control system (13.2.6), we collect 20, 000 temperature data from the real-world
coal gasification system. The corresponding 20, 000 system input data and 20, 000
system output data are also recorded. Then, a three-layer BP NN is established with
the structure of 8–20–1 to approximate the state equation in (13.2.6) and the NN is
the model network. The control input is expressed by (13.2.4). We also use three-
layer BP NN with structure of 8–20–5 to approximate the input–output equation in
(13.2.6) and the NN is the input–output network. Let the learning rates of the model
network and input–output network be lm = 0.002. Use the gradient-based weight
update rule [15] to train the neural networks for 20, 000 iteration steps to reach the
training precision of 10−6 . The converged weights, respectively, are given by
and ⎡
0.1062 1.0652 0.2797 −0.0631 −0.1974
⎢ −0.0843 1.6798 −0.9755 0.0069 0.5187
⎢
Ŵm2 =⎢
⎢ 0.9926 0.3194 −0.3022 −0.0434 0.0509
⎣ −0.7054 1.4218 0.3699 0.0479 0.0666
0.0301 4.9841 0.0175 0.1136 −0.0059
Next, we adopt three-layer BP NNs to identify the coal quality equation (13.2.13)
and the reference control equation (13.2.16). The structure of Θ network and u f
network is chosen as 10–20–4 and 6–20–3, respectively. Using the gradient-based
weight update rule, train the two neural networks for 20,000 iteration steps under the
learning rate 0.002 to reach the training precision of 10−6 . The converged weights
are given by
⎡
0.003 0.2629 −0.0030 0.0010 −0.0868
⎢ 0.001 0.4992 −0.0001 0.0001 0.0020
ŴΘ = ⎢
⎣ 0.001 −0.0391 0.0010 −0.0001 −0.0285
−0.004 0.2651 0.0020 −0.0010 0.1151
and
⎡
0.0108 −0.0011 −0.0013 0.0464 −0.0515
Ŵu = ⎣ −0.0389 −0.0038 −0.0126 −0.1154 −0.0345
0.0990 0.0010 −0.0011 0.5440 0.1863
Taking the current system data x0 , u 0 , and y0 into Θ network, we can obtain the
coal quality as
Θ̂k = [0.6789, 0.0373, 0.1149, 0.1689]T .
Taking the desired state η = 1320 and the coal quality Θ̂k into u f network, we can
obtain the desired control input expressed by û dk = [61408.74, 44430.69, 51200]T .
According to the weights of model network, Θ network, and u f network, we can
easily obtain the system disturbance w ≤ 26.44.
Next, the present iterative ADP algorithm is established to obtain the optimal
tracking control law. Let the cost function be defined as in (13.3.12), where A = 1
and C = 0.5, and B id the identity matrix with suitable dimensions. The critic and
action networks are both chosen as three-layer BP neural networks with the structures
of 1–8–3 and 1–8–1, respectively. For each iteration, the critic and action networks
are trained for 1000 steps using the learning rate of lc = la = 0.01 so that the neural
network training errors become less than 10−6 . Let the iteration index i = 100. The
converged weights of the critic network and the action network are expressed as
and ⎡
−0.0625 0.0746 −0.0122 −0.0717
Wa = ⎣ 0.0497 −0.2990 −0.1878 0.0750
0.0265 0.5098 0.3543 0.0404
⎤
−0.0263 −0.0236 −0.0311 0.0031
−0.1385 0.0146 0.1016 0.0823 ⎦ ,
0.2876 0.1056 −0.0604 −0.0862
1
Iterative value function
0.8
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70 80 90 100
Iteration steps
1300
1250
State (oC)
1200
1150
1100
1050
1000
0 10 20 30 40 50 60 70 80 90 100
Time steps
13.4 Numerical Analysis 561
4 4
x 10 x 10
6.14 4.7
1
2
6.12 4.6
6.1 4.5
6.08 4.4
0 50 100 0 50 100
Time steps Time steps
(a) (b)
4 4
x 10 x 10
System output of y (CO, Kg/h)
5.2
Control input of u3 (O2, Kg/h)
7.9
5 7.8
1
7.7
4.8
7.6
4.6
7.5
4.4 7.4
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 13.5 Trajectories of control inputs and system output. a Coal input trajectory. b H2 O input
trajectory. c O2 input trajectory. d CO output trajectory
4
x 10
System output of y2 (CO2, Kg/h)
4200
3.4
4000
3.2 3800
3600
3
3400
2.8
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(a) (b)
4 4
x 10 x 10
3.6
1.036
3.4
4
1.034
3.2
1.032
3 1.03
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(c) (d)
Fig. 13.6 Trajectories of system outputs. a CO2 output trajectory. b H2 output trajectory. c H2 O
output trajectory. d Char output trajectory
1
Iterative value function
0.8
0.6
0.4
0.2
0
0 20 40 60 80 100 120 140 160 180 200
Iteration steps
13.4 Numerical Analysis 563
1350
1300
1250
State (oC)
1200
1150
1100
1050
1000
0 10 20 30 40 50 60 70 80 90 100
Time steps
4 4
Control input of u (coal, Kg/h)
x 10 x 10
6.15 4.9
4.8
6.14
4.7
1
6.13
4.6
6.12
4.5
6.11 4.4
0 50 100 0 50 100
Time steps Time steps
(a) (b)
4 4
x 10 x 10
5.2
System output of y (CO, Kg/h)
Control input of u (O , Kg/h)
7.9
5
2
7.8
3
4.8
7.7
4.6 7.6
4.4 7.5
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 13.9 Trajectories of control inputs and system output. a Coal input trajectory. b H2 O input
trajectory. c O2 input trajectory. d CO output trajectory
564 13 Adaptive Dynamic Programming for Optimal Control …
4
System output y (CO2, Kg/h) x 10
3.6 4500
3
2
3.2
3500
3
2.8 3000
0 50 100 0 50 100
Time steps Time steps
(a) (b)
4 4
x 10 x 10
System output y4 (H2O, Kg/h)
3.4
1.036
3.2
1.034
3
2.8 1.032
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 13.10 Trajectories of system outputs. a CO2 output trajectory. b H2 output trajectory. c H2 O
output trajectory. d Char output trajectory
T f = 100 time steps and obtain the following results. The optimal state trajec-
tory is shown in Fig. 13.8. The corresponding control trajectories and system output
trajectories are shown in Figs. 13.9 and 13.10, respectively.
As is known, for the real-world coal gasification, the flow fluctuation of the control
inputs is important and cannot be ignored. We will display the control system per-
formance under the control disturbance. Let Δu k be a zero-expectation white noise
of control input, with |Δu 1k | ≤ 100, |Δu 2k | ≤ 40, |Δu 3k | ≤ 40. The disturbance
trajectories of the control input are displayed in Figs. 13.11a, b and c, respectively.
Let the training precisions of model network, Θ network, and u f network be 10−3
and the training precisions of critic and action networks are kept at 10−6 . The conver-
gence trajectory of the iterative value function is shown in Fig. 13.12. The optimal
state trajectory is shown in Fig. 13.13. The corresponding control trajectories and
system output trajectories are shown in Figs. 13.14 and 13.15, respectively. From the
numerical results, we can see that under the disturbance of the control input, we can
also obtain the optimal tracking control of the system which shows the effectiveness
13.4 Numerical Analysis 565
50 20
2
0 0
−50 −20
−100 −40
0 50 100 0 50 100
Time steps Time steps
(a) (b)
−3
x 10
40 1.5
1
20
0.5
0
0
−20
−0.5
−40 −1
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 13.11 Control disturbance and the input–output mass error. a Control disturbance Δu 1k .
b Control disturbance Δu 2k . c Control disturbance Δu 3k . d The error between the input and output
mass
0.5
0
0 20 40 60 80 100 120 140 160 180 200
Iteration steps
566 13 Adaptive Dynamic Programming for Optimal Control …
1350
1300
1250
State (oC)
1200
1150
1100
1050
1000
0 10 20 30 40 50 60 70 80 90 100
Time steps
4 4
Control input of u (H2O, Kg/h )
Control input of u1 (coal, Kg/h )
x 10 x 10
6.2 5.2
6.15
5
6.1
2
6.05 4.8
6
4.6
5.95
5.9 4.4
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(a) (b)
4 4
x 10 x 10
Control input of u (O2, Kg/h )
5.5 8
5 7.8
3
4.5 7.6
4 7.4
3.5 7.2
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(c) (d)
Fig. 13.14 Trajectories of control inputs and system output. a Coal input trajectory. b H2 O input
trajectory. c O2 input trajectory. d CO output trajectory
13.4 Numerical Analysis 567
3.2
4500
3.1
4000
3
3500
2.9
2.8 3000
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(a) (b)
System output of y4 (H2O, Kg/h)
4 4
3.6 1.04
3.4 1.03
3.2 1.02
3 1.01
2.8 1
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(c) (d)
Fig. 13.15 Trajectories of the system outputs. a CO2 output trajectory. b H2 output trajectory.
c H2 O output trajectory. d Char output trajectory
and robustness of the present iterative ADP method. To verify the correctness of the
model and the present method, the mass errors between the input and output are
given in Fig. 13.11d.
From the numerical results, we can see that when the system errors and the distur-
bance of the control input are enlarged, the iterative ADP algorithm is still effective
to find the optimal tracking control scheme for the system. On the other hand, if we
enlarge the iteration errors, the control property is quite different. Let the disturbance
of the control Δu k = 0. Let the training precisions for model network, Θ network,
and u f network be kept at 10−3 . We change the training precisions of critic and
action networks to 10−3 . Let the iteration index i = 100. The convergence trajectory
of the iterative value function is shown in Fig. 13.16a, where we can see that the iter-
ative value function is not convergent any more. The corresponding state trajectory
is shown in Fig. 13.16b, where we notice that the desired state is not achieved.
568 13 Adaptive Dynamic Programming for Optimal Control …
1400
1200
Iterative value function
1000
800
600
400
200
0
0 10 20 30 40 50 60 70 80 90 100
Iteration steps
(a)
1500
1400
State (oC)
1300
1200
1100
1000
0 10 20 30 40 50 60 70 80 90 100
Time steps
(b)
Fig. 13.16 Simulation results. a The trajectory of iterative value function. b The trajectory of state
13.5 Conclusions
In this chapter, an effective iterative ADP algorithm is established to solve the optimal
tracking control problem for coal gasification systems. Using the input-state-output
data of the system, NNs are used to approximate the system model, the coal qual-
ity, and the reference control, respectively, and the mathematical model of the coal
gasification is unnecessary. Considering the system errors of NNs and the control
disturbance, the optimal tracking control problem is transformed into a two-person
zero-sum optimal regulation control problem. Iterative ADP algorithm is then estab-
lished to obtain the optimal control law where the approximation errors in each
iteration are considered. Convergence analysis is given to guarantee that the iterative
value functions are convergent to a finite neighborhood of the optimal cost function.
References 569
References
1. Abani N, Ghoniem AF (2013) Large eddy simulations of coal gasification in an entrained flow
gasifier. Fuel 104:664–680
2. Basar T, Bernard P (1995) H∞ Optimal control and related minimax design problems.
Birkhauser, Boston
3. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):82–92
4. Chen Y, Li Z, Zhou M (2014) Optimal supervisory control of flexible manufacturing systems
by petri nets: a set classification approach. IEEE Trans Autom Sci Eng 11(2):549–563
5. Gopalsami N, Raptis AC (1984) Acoustic velocity and attenuation measurements in thin rods
with application to temperature profiling in coal gasification systems. IEEE Trans Sonics Ultra-
son 31(1):32–39
6. Guo R, Cheng G, Wang Y (2006) Texaco coal gasification quality prediction by neural estimator
based on dynamic PCA. In: Proceedings of the IEEE international conference on mechatronics
and automation, pp 2241–2246
7. Jia QS (2011) An adaptive sampling algorithm for simulation-based optimization with descrip-
tive complexity preference. IEEE Trans Autom Sci Eng 8(4):720–731
8. Jin X, Hu SJ, Ni J, Xiao G (2013) Assembly strategies for remanufacturing systems with
variable quality returns. IEEE Trans Autom Sci Eng 10(1):76–85
9. Kang Q, Zhou M, An J, Wu Q (2013) Swarm intelligence approaches to optimal power flow
problem with distributed generator failures in power networks. IEEE Trans Autom Sci Eng
10(2):343–353
10. Kostur K, Kacur J (2012) Developing of optimal control system for UCG. In: Proceedings of
the international carpathian control conference, pp 347–352
11. Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
12. Matveev IB, Messerle VE, Ustimenko AB (2009) Investigation of plasma-aided bituminous
coal gasification. IEEE Trans Plasma Sci 37(4):580–585
13. Ruprecht P, Schafer W, Wallace P (1988) A computer model of entrained coal gasification.
Fuel 67(6):739–742
14. Serbin SI, Matveev IB (2010) Theoretical investigations of the working processes in a plasma
coal gasification system. IEEE Transactions on Plasma Science 12(38):3300–3305
15. Si J, Wang YT (2001) Online learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
16. Wei Q, Liu D (2014) Adaptive dynamic programming for optimal tracking control of unknown
nonlinear systems with application to coal gasification. IEEE Trans Autom Sci Eng 11(4):1020–
1036
17. Wigstrom O, Lennartson B, Vergnano A, Breitholtz C (2013) High-level scheduling of energy
optimal trajectories. IEEE Trans Autom Sci Eng 10(1):57–64
18. Wilson JA, Chew M, Jones WE (2006) State estimation-based control of a coal gasifier. IEE
Proc-Control Theory Appl 153(3):268–276
19. Xu J, Qiao L, Gore J (2013) Multiphysics well-stirred reactor modeling of coal gasification
under intense thermal radiation. Int J Hydrog Energy 38(17):7007–7015
20. Yang Q, Jagannathan S (2012) Reinforcement learning controller design for affine nonlinear
discrete-time systems using online approximators. IEEE Trans Syst, Man, Cybern-Part B:
Cybern 42(2):377–390
Chapter 14
Data-Based Neuro-Optimal Temperature
Control of Water Gas Shift Reaction
14.1 Introduction
Water gas shift (WGS) reactor is an essential component of the coal-based chemical
industry [11]. The WGS reactor combines carbon monoxide (CO) and water (H2 O)
in the reactant stream to produce carbon dioxide (CO2 ) and hydrogen (H2 ). Proper
regulation of the operating temperature is critical to achieving adequate CO conver-
sion during transients [3]. Hence, optimal control of the reaction temperate is a key
problem for WGS reaction process. To describe the dynamics of the WGS reaction
process, many discussions focused on WGS modeling approaches [6, 16]. Unfortu-
nately, the established WGS models are generally complex with high nonlinearities.
Thus, the traditional linearized control method [28, 31, 32] is only effective in the
neighborhood of the equilibrium point. When the required operating range is large,
the nonlinearities in the system cannot be properly compensated by using a linear
model. Therefore, it is necessary to study optimal control approaches for the original
nonlinear system [3, 11]. Although optimal control of nonlinear systems has been
the focus of control field in the last several decades [1, 2, 4, 5, 9, 18, 19, 22, 29],
the optimal controller design for WGS reaction systems (WGS systems in brief) is
still challenging, due to the complexity of the WGS reaction process.
Iterative ADP method has played an important role as an effective way to solve
Bellman equation indirectly and received lots of attentions [12, 14, 23, 24, 27, 30,
33]. For most previous ADP algorithms, it requires that the system model, the iterative
control, and the cost function can accurately be approximated which guarantees
the convergence property of the proposed algorithms. In real-world implementation
of ADP, e.g., for WGS systems, the reconstruction errors by approximators and
the disturbances of system states and controls inherently exist. Thus, the system
models, iterative control laws, and cost functions are impossible to obtain accurately.
Although in [8, 21], ADP was explored to design optimal temperature controller
of the WGS system, the effects of approximation errors and disturbances were not
considered. Furthermore, the convergence and stability properties were not discussed.
In this chapter, a stable iterative ADP algorithm is developed to obtain the optimal
control law for the WGS system, such that the temperature of WGS system tracks
the desired temperature [26].
The WGS reaction inputs the water gas, which includes CO, CO2 , H2 , and H2 O, into
the WGS reactor. The WGS reaction, which is slightly exothermic, converts CO to
CO2 and H2 as shown in the following equation
where the rate is in (kmol/m3/s). The catalyst density is ρcat = 1.8 × 10−4 kg/m3 .
T is the current reaction temperature in K (Kelvin). The rate constant is kr = 1.32 ×
109 kmol/kg/s. The reaction
equilibrium coefficient
K T is given as in [17], which is
4577.7 K
expressed by K T = exp − 4.33 .
T
For the WGS reaction (14.2.1), we can see that the reaction temperature is the
key parameter [3, 11].
Let u k denote the control input representing the flow of water gas (m3/s). Let
where θCO , θCO2 , θH2 , and θH2 O denote the given percentage compositions of CO,
CO2 , H2 , and H2 O. Generally speaking, the water gas of WGS systems comes from
the previous reaction process, such as coal gasification [16]. This means that the
composition ratios of the mixed gas are uncontrollable for the WGS systems and
the amount of water gas flow is the only one to be controlled. Let xk denote the
temperature of the WGS reactor. The WGS system can be expressed as
where F(·) is an unknown system function. Let xk ∈ R and u k ∈ R. Let the desired
state be τ . Then, our goal is to design an optimal state feedback tracking control law
u ∗k = u ∗ (xk ), such that the system state tracks the desired state trajectory.
14.2 System Description and Data-Based Modeling 573
Z = F̂N (X, Y f , W f ) = W Tf σ (Y f X ),
where z k = [ xk , u k ]T denotes the NN input and Wm∗ ∈ R L m ×1 denotes the ideal weight
matrix of model NN, where L m is the number of hidden layer neurons. Let σm (z k ) =
σ (Ym z k ), where Ym is an arbitrary weight matrix with a suitable dimension. Let
σm (·) ≤ σ̄ M for a constant σ̄ M > 0, and let εm1k be the bounded NN reconstruction
error such that |εm1k | ≤ ε̄m1 for a constant ε̄m1 > 0. To train the model NN, it requires
an array of WGS system and control data, such as the data from a period of time.
The NN model for the system is constructed as
x̂k+1 = Ŵmk
T
σm (z k ), (14.2.3)
where x̂k is the estimated system state vector. Let Ŵmk be the estimation of the ideal
weight matrix Wm∗ . Then, we define the system identification error as
1 T 1 2
E mk = x̃k+1 x̃k+1 = x̃k+1 .
2 2
574 14 Data-Based Neuro-Optimal Temperature Control …
By a gradient descent adaptation rule [20, 25], the weights are updated as
1 T
L(x̃k , W̃mk ) = x̃k2 + tr W̃mk W̃mk .
lm
≤−(1−2lm σ M2 )φmk
2
−(1−λm (1+2lm σ M2 ))x̃k2 .
u f k = Wu∗Tσu (z uk ) + εu1k ,
where z uk = [xk , xk+1 ]T and εu1k is the NN reconstruction error such that |εu1k | ≤ ε̄u1
for a constant ε̄u1 > 0. Let σu (z uk ) = σ (Yu z uk ), where Yu is an arbitrary weight
matrix with a suitable dimension.
The NN reference control is constructed as
where û f k is the estimated reference control, and Ŵuk is the estimated weight matrix.
Define the identification error as
14.2 System Description and Data-Based Modeling 575
ũ f k = û f k − u f k = φuk − εu1k ,
1 2
Euk = ũ .
2 fk
By gradient-based adaptation rule, the weight is updated as
In this section, a stable iterative ADP algorithm will be employed to obtain the
optimal control law such that the temperature of WGS system tracks the desired one
with convergence and stability analysis.
For WGS system (14.2.2), if we let the desired state be τ , then we can define the
tracking error ek = xk − τ . Let u dk be the corresponding desired reference control
(desired control in brief) for the desired state τ . As the system function is unknown,
the desired control u dk cannot directly be obtained by the WGS system (14.2.2).
On the other hand, in the real-world WGS systems, the disturbances of the system
and control input are both unavoidable. Thus, the system transformation method with
accurate system model [34] is difficult to implement. To overcome these difficulties, a
system transformation with NN reconstruction errors and disturbances is developed.
First, according to the desired state τ , we can obtain u dk = Fu (τ, τ ).
Let
û dk = F̂u (τ, τ ) = Ŵuk
T
σu (τ, τ )
u ek = u k − û dk − εuk ,
where εuk = εu1k + εu2k . As εu1k and εu2k are bounded, there exits a constant ε̄u > 0
such that |εuk | ≤ ε̄u . On the other hand, let F̂(z k ) = Ŵmk
T
σm (z k ) be the model NN
function. Let εm2k be an unknown bounded system disturbance such that |εm2k | ≤ ε̄m2
for a constant ε̄m2 > 0. Then, the tracking error system ek+1 can be defined as
ek+1 = F̄(ek , u ek )
= F̂((ek + τ ), (u ek + û dk )) − τ + ∇ F̂(ξu )εu + εmk , (14.3.1)
where
∂ F̂((ek + τ ), ξu )
∇ F̂(ξu ) = ,
∂ξu
ξu = cu (u ek + û dk ) + (1 − cu )(u ek + û dk + εuk ),
and 0 ≤ cu ≤ 1. Let εmk = εm1k + εm2k . We have |εmk | ≤ ε̄m for a constant ε̄m > 0.
Let the NN tracking error êk+1 be expressed as
êk+1 = Fe (ek , û ek )
= F̂((ek + τ ), (u ek + û dk )) − τ. (14.3.2)
as the system error such that |εek | ≤ ε̄e for a constant ε̄e > 0.
In this section, our goal is to design an optimal control scheme such that the tracking
error ek converges to zero. Let
be the utility function, where Q and R are positive constants. Define the cost function
as
∞
J (e0 , u e0 ) = U (ek , u ek ),
k=0
14.3 Design of Neuro-Optimal Temperature Controller 577
where we let u ek = (u ek , u e,k+1 , . . .). The optimal cost function can be defined as
J ∗ (ek ) = inf J (ek , u ek ) .
u ek
Let the initial value function V̂0 (ek ) = P(ek ). The control law v̂0 (ek ) is obtained by
V̂i (ek ) = U (ek , v̂i−1 (ek ))+ V̂i−1 (êk+1 ) + πi (ek ), (14.3.7)
and
law v̂i (ek ) is used to approximate u ∗ (ek ). Therefore, when i → ∞, the algorithm
should be convergent, i.e., V̂i (ek ) and v̂i (ek ) converge to the optimal ones. In the next
section, we will show the properties of the present iterative ADP algorithm.
From the iterative ADP algorithm (14.3.6)–(14.3.8), as the existence of system errors,
iteration errors, and disturbances, the convergence analysis methods for the accurate
ADP algorithms are no longer valid. In this chapter, inspired by [13, 14], an “error
bound”-based convergence and stability analysis will be developed. First, we define
a new value function
holds uniformly.
then
v̄i (ek ) = v̂i (ek ) − ρi (ek ).
V̂i (ek ) = U (ek , (v̄i−1 (ek ) + ρi−1 (ek )) + πi (ek ) + Vi−1 (Fe (ek , (v̄i−1 (ek ) + ρi−1 (ek )))).
Let
∂U (ek , ξ ) ∂ V̂i (Fe (ek , ξ ))
∇U (ξ ) = and ∇Vi (ξ ) = .
∂ξ ∂ξ
ξV i = cV i V̂i (Fe (ek , v̄i (ek ))) + (1 − cV i )V̂i (Fe (ek , v̂i (ek ))),
and
ξ V i = cV i V̂i (Fe (ek , û e (ek ))) + (1 − cV i )V̂i (Fe (ek , u e (ek ))).
V̂i (ek ) = U (ek , (v̄i−1 (ek ) + ρi−1 (ek ))) + πi (ek ) + Vi−1 (Fe (ek , (v̄i−1 (ek ) + ρi−1 (ek ))))
= min{U (ek , u ek ) + ∇U (ξU i )εuk + Vi−1 (ek+1 ) + ∇Vi (ξV i )εek } + πi (ek )
u ek
As ∇U (ξU i ), ∇Vi (ξV i ), ∇U (ξU i ), and ∇Vi (ξV i ) are upper bounded, if we let
|∇U (ξU i )εuk | ≤ ε̄U i , |∇Vi (ξV i )εek | ≤ ε̄V i , |∇U (ξU i )ρi (ek )| ≤ εU i , |∇Vi (ξV i )ρi (ek )|
≤ εV i , and |πi (ek )| ≤ επi for constants ε̄U i , ε̄V i , εU i , and εV i , then
J ∗ ( F̄(ek , u ek )) ≤ γ U (ek , u ek )
V0 (ek ) ≤ δ J ∗ (ek )
δ−1
σ ≤1+ , (14.3.13)
γδ
then the iterative value function V̂i (ek ) converges to a finite neighborhood of the
optimal cost function J ∗ (ek ).
Proof The theorem can be proven in two steps. First, using mathematical induction,
we will prove that, for i = 0, 1, . . ., the iterative value function V̂i (ek ) satisfies
⎛ ⎞
i
γ j σ j−1 (σ − 1) γ i σ i (δ − 1) ⎠ ∗
V̂i (ek ) ≤ σ ⎝1 + + J (ek ). (14.3.14)
j=1
(γ + 1) j (γ + 1)i
580 14 Data-Based Neuro-Optimal Temperature Control …
Let i = 0. Then, (14.3.14) becomes V̂0 (ek ) ≤ σ δ J ∗ (ek ). We have the conclusion
holds for i = 0. Assume that (14.3.14) holds for i = l − 1, l = 1, 2, . . .. Then, for
i = l, we have
⎧⎛
⎨ l−1
γ j−1 σ j−1 (σ − 1)
Γl (ek ) ≤ min ⎝1 + γ
u ek ⎩ (γ + 1) j
j=1
γ l−1 σ l−1 (σ δ − 1)
+ U (ek , u ek )
(γ + 1)l
⎡ ⎛ ⎞
l
γ σ
j j−1
(σ − 1) γ σ
l l
(δ − 1)
+ ⎣σ ⎝1 + + ⎠
j=1
(γ + 1) j (γ + 1)l
⎛ ⎞⎤
l−1
γ σ
j−1 j−1
(σ − 1)s γ σ
l−1 l−1
(σ δ − 1)
−⎝ + ⎠⎦
j=1
(γ + 1) j (γ + 1)l
⎫
⎬
×J ∗ ( F̄(ek , u ek ))
⎭
⎛ ⎞
l
γ σ
j j−1
(σ − 1) γ σ
l l
(δ − 1)
= ⎝1+ + ⎠ J ∗ (ek ),
j=1
(γ + 1) j
(γ + 1) l
(14.3.14), the iterative value function V̂i (ek ) is convergent to a finite neighborhood
of the optimal cost function J ∗ (ek ). This completes the proof of the theorem.
Define a new iterative value function as V̄i (ek ) = χi J ∗ (ek ), where χi is defined
as in (14.3.15). According to (14.3.13), we can get χi+1 − χi ≤ 0, which means
V̄i+1 (ek ) ≤ V̄i (ek ). Let
14.3 Design of Neuro-Optimal Temperature Controller 581
ξV i = cV̄ i V̄i (Fe (ek , v̄i (ek ))) + (1 − cV̄ i )V̄i (Fe (ek , v̂i (ek ))),
for 0 ≤ cV̄ i ≤ 1. Let |∇(ξV̄ i )εe | ≤ εV̄ i for a constant εV̄ i , and we can get
Define a new state error set Ωe = {ek : U (ek , v̂i (ek )) ≤ εV̄ i }. As U (ek , v̂i (ek )) is a
positive-definite function, |ek | is finite for ek ∈ Ωe . We define
e M = sup {|x|} .
x∈Ωe
Define two scalar functions α(|ek |) and β(|ek |) which satisfy the following two
conditions.
(1) If |ek | ≤ e M , then
(2) If |ek | > e M , then α(|ek |) and β(|ek |) are both monotonically increasing functions
and satisfy
For an arbitrary constant ς > e M , there exists a (ς ) > e M such that β() ≤ α(ς ).
For T = 1, 2, . . ., if |ek | > e M and |ek+T | > e M , then V̄i (ek+T ) − V̄i (ek ) ≤ 0.
Hence, for all |ek | > e M satisfying e M < |ek | ≤ β(), there exists a T > 0 such
that
which obtains ς > |ek+T |. Therefore, for all |ek | > e M , there exists a T = 1, 2, . . .
such that |ek+T | ≤ ς . As ς is arbitrary, let ς → e M . Then, we can obtain |ek+T | ≤ e M .
According to the definition in [10], ek is UUB.
Next, for V̂i (ek ) ≤ V̄i (ek ), there exists time instants T0 and T1 such that
for all |ek | > e M , |ek+T0 | > e M , and |ek+T1 | > e M . Choose ς1 > 0 to satisfy V̂i (ek ) ≥
α(ς1 ) ≥ V̄i (ek+T1 ). Then, there exists 1 (ς1 ) > 0 such that
According to (14.3.18) and the definition of α(|ek |) and β(|ek |) in (14.3.16) and
(14.3.17), we have
α(ς ) ≥ β() ≥ V̂i (ek ) ≥ α(ς1 ) ≥ β(1 ) ≥ V̄i (ek+T1 ) ≥ α(|ek+T1 |).
For an arbitrary constant ς > e M , we can obtain |ek+T1 | ≤ ς , which shows that v̂i (ek )
is a UUB control law for the tracking error system (14.3.1). This completes the proof
of the theorem.
Corollary 14.3.1 For i = 0, 1, . . ., let V̂i (ek ) and v̂i (ek ) be obtained by (14.3.7) and
(14.3.8), respectively. If
∀ek , then the iterative control law v̂i (ek ) is an asymptotically stable control law for
system (14.3.1).
In this section, neural networks, including action network and critic network, are
used to implement the present stable iterative ADP algorithm. The whole structural
diagram is shown in Fig. 14.1.
For all i = 0, 1, . . ., the critic network is used to approximate the value function in
p
(14.3.8). Collect an array of tracking errors Ek = {ek1 , . . . , ek }, where p is a large inte-
j jT
ger. For j = 0, 1, . . . , p, let the output of the critic network be V̂i (ek ) = Wci σc (ek ),
where σc (ek ) = σ (Yc ek ) and Yc is an arbitrary matrix with a suitable dimension.
Fig. 14.1 The structural diagram of the stable iterative ADP algorithm
14.4 Neural Network Implementation for the Optimal … 583
j j j j j
Define the error function of the critic network as ϑci (ek ) = Vi (ek ) − V̂i (ek ). The
weights of the critic network are updated as [15, 25]
j j j
Wci( j+1) = Wci j − lc ϑci (ek )σc (ek ), (14.4.1)
j
where σc (ek ) ≤ σC for a constant σC and lc > 0 is the learning rate of critic
network.
The action network is used to approximate the iterative control law v̄i (ek ), where
j i jT
v̄i (ek ) is defined by (14.3.11). The output can be formulated as v̂i (ek ) = Wa σa (ek ),
where σa (ek ) = σ (Ya ek ). Let Ya be an arbitrary matrix with a suitable dimension.
j j
According to Ek , we can define the output error of the action network as ϑai (ek ) =
j j j
v̂i (ek ) − v̄i (ek ), j = 1, 2, . . . , p. The weight of the action network can be updated as
j j j
Wai( j+1) = Wai j − la σa (ek )ϑai(ek ), (14.4.2)
j
where σa (ek ) ≤ σ A for a constant σ A and la > 0 is the learning rate of action
network. The weight convergence property of the neural networks is shown in the
following theorem.
Theorem 14.4.1 For j = 1, 2, . . . , p, let the ideal critic and action network func-
tion be expressed by
and
respectively. The critic and action networks are trained by (14.4.1) and (14.4.2),
respectively. Let W̃ci = Wci − Wci∗ and W̃ai = Wai − Wai∗ . For all i = 1, 2, . . ., if
j j j j
there exist constants 0 < λc < 1 and 0 < λa < 1 such that
j j j 2 j j j 2
φcik εci (ek ) ≤ λc φcik and φaik εai (ek ) ≤ λa φaik ,
j i jT j j i jT j
respectively, where φcik = W̃c σc (ek ) and φaik = W̃a σa (ek ), then the error matri-
ij ij
ces W̃c and W̃a converge to zero, as j → ∞.
1 1
L W̃ci j , W̃ai j = tr W̃ci jT W̃ci j + tr W̃ai jT W̃ai j .
lc la
584 14 Data-Based Neuro-Optimal Temperature Control …
2(1 − λc ) 2(1 − λa )
lc < , la < 2 ,
σC (1 + χc )
2
σ A (1 + χa )
ij ij
we have ΔL W̃c , W̃a ≤ 0, j = 1, 2, . . . , p. Let j → ∞, and we can obtain the
conclusion. This completes the proof of the theorem.
Based on the above analysis, the whole data-driven stable iterative ADP algorithm
for the WGS system can be summarized in Algorithm 14.4.1.
Remark 14.4.1 One property should be pointed out. For all i = 1, 2, . . ., if we define
the approximation error function εi (ek ) as
From (14.4.3), we can see that if |ek | is large, then the present iterative ADP algorithm
permits convergence under large approximation errors, and if |ek | is small, then
small approximation errors are required to ensure the convergence of the iterative
ADP algorithm. As the existences of the approximation errors and disturbances, the
convergence criterion (14.4.3) cannot generally be satisfied for every ek . Define a
new tracking error set
V̂i (ek )(δ − 1)
Θe = ek : εi (ek ) > .
γδ + δ − 1
As εi (ek ) ≤ ε is finite, if we define Υ = sup {|ek |}, then Υ is finite. Thus, for all
ek ∈Θe
ek ∈ Θ e , Θ e Rn \Θe , we can get that V̂i (ek ) is convergent, i.e.,
6 6
4 4
System disturbance
Control disturbance
2 2
0 0
−2 −2
−4 −4
−6 −6
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Time steps x 10
4 Time steps x 10
4
(a) (b)
12 12
Iterative value function
10 10
8 8
6 6
4 4
2 2
0 5 10 15 20 25 0 5 10 15 20 25
Iteration steps Iteration steps
(c) (d)
Fig. 14.2 Disturbances and iterative value function. a System disturbance. b Control disturbance.
c Iterative value function under ε̄ = 10−4 . d Iterative value function under ε̄ = 10−6 .
in Fig. 14.3a–d, respectively. From the numerical results, we can see that by the stable
iterative ADP algorithm, the iterative control law can guarantee the tracking error
system to be UUB, which shows the robustness of the present algorithm. Moreover,
we can see that if we enhance the training precisions of the NNs, such as reducing
ε̄ from 10−4 to 10−6 , then the approximation errors can be reduced and the system
state will be closer to the desired one. The optimal state and control trajectories for
ε̄ = 10−6 are shown in Fig. 14.5a, b, respectively. In the real-world neural network
training, the training precision of NNs is generally set to a uniform one. Thus, it is
recommended that the present iterative ADP algorithm is implemented with a high
training precision which allows the iterative value function to converge for most of
the state space.
14.5 Numerical Analysis 587
400 140
3
360
Initial iteration
340 Limiting iteration 100
260 40
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(a) (b)
380 120
100
Limiting iteration
340
Initial iteration
320 80
300
Initial iteration 60
280
Limiting iteration
260 40
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(c) (d)
Fig. 14.3 Iterative trajectories of states and controls for different ε̄’s. a State for ε̄ = 10−4 . b Control
for ε̄ = 10−4 . c State for ε̄ = 10−6 . d Control for ε̄ = 10−6 .
2.5
1.5
0.5
0
1
0.5 25
20
0 15
State −0.5 10
5
−1 0 Iteration steps
588 14 Data-Based Neuro-Optimal Temperature Control …
380 75
360
70
340
320 65
300
60
280
260 55
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(a) (b)
450 75
System state (°C) by MFAC
65
350
60
300
55
250 50
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(c) (d)
Fig. 14.5 Comparisons by ADP and MFAC. a State trajectory by ADP. b Control trajectory by
ADP. c State trajectory by MFAC. d Control trajectory by MFAC.
On the other hand, to show the effectiveness of the stable iterative ADP algorithm,
numerical results by our algorithm will be compared with the ones by the data-driven
model-free adaptive control (MFAC) algorithm [7]. According to [7], the controller
is designed by
ρΦk (τ − xk )
u k = u k−1 + ,
λ + Φk2
and
η(Δxk − Φk−1 Δu k−1 )Δu k−1
Φk = Φk−1 + ,
μ + (Δu k−1 )2
From the numerical results, we can see that using the present stable iterative
ADP algorithm, it takes 25 time steps for the system state to track the desired one.
By MFAC algorithm in [7], it takes 50 iteration steps for the system state to track
the desired one. Furthermore, there exist overshoots by the method of [7], while the
overshoots are avoided by the present stable iterative ADP algorithm. These illustrate
the effectiveness of our algorithm.
14.6 Conclusions
References
14. Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
15. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
16. Lu X, Wang T (2013) Watergas shift modeling in coal gasification in an entrained-flow gasifier.
Part 1: development of methodology and model calibration. Fuel 108:629–638
17. Moe JM (1962) Design of water gas shift reactors. Chem Eng Prog 58(3):33–36
18. Olalla C, Queinnec I, Leyva R, Aroudi AE (2012) Optimal state-feedback control of bilinear
DC-DC converters with guaranteed regions of stability. IEEE Trans Ind Electro 59(10):3868–
3880
19. Rathore R, Holtz H, Boller T (2013) Generalized optimal pulsewidth modulation of multilevel
inverters for low-switching-frequency control of medium-voltage high-power industrial AC
drives. IEEE Trans Ind Electron 60(10):4215–4224
20. Si J, Wang YT (2001) Online learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
21. Sudhakar M, Narasimhan S, Kaisare NS (2012) Approximate dynamic programming based
control for water gas shift reactor. Comput Aided Chem Eng 31:340–344
22. Ueyama Y, Miyashita E (2014) Optimal feedback control for predicting dynamic stiffness
during arm movement. IEEE Trans Ind Electron 61(2):1044–1052
23. Wei Q, Liu D (2012) An iterative ε-optimal control scheme for a class of discrete-time nonlinear
systems with unfixed initial state. Neural Netw 32:236–244
24. Wei Q, Liu D (2013) Numerical adaptive learning control scheme for discrete-time nonlinear
systems. IET Control Theory Appl 7(11):1472–1486
25. Wei Q, Liu D (2013) A novel iterative θ-adaptive dynamic programming for discrete-time
nonlinear systems. IEEE Trans Automa Sci Eng 11(4):1176–1190
26. Wei Q, Liu D (2014) Data-driven neuro-optimal temperature control of water-gas shift reaction
using stable iterative adaptive dynamic programming. IEEE Trans Ind Electron 61(11):6399–
6408
27. Wei Q, Wang D, Zhang D (2013) Dual iterative adaptive dynamic programming for a class of
discrete-time nonlinear systems with time-delays. Neural Comput Appl 23(7–8):1851–1863
28. Wright GT, Edgar TF (1989) Adaptive control of a laboratory water-gas shift reactor with
dynamic inlet condition. In: Proceedings of the American control conference, pp 1828–1833
29. Xiao S, Li Y (2013) Optimal design, fabrication, and control of an micropositioning stage
driven by electromagnetic actuators. IEEE Trans Ind Electron 60(10):4613–4626
30. Yang Q, Jagannathan S (2012) Reinforcement learning controller design for affine nonlinear
discrete-time systems using online approximators. IEEE Trans Syst Man Cybern Part B Cybern
42(2):377–390
31. Yin S, Ding SX, Haghani A, Hao H, Zhang P (2012) A comparison study of basic data-driven
fault diagnosis and process monitoring methods on the benchmark Tennessee Eastman process.
J Process Control 22(9):1567–1581
32. Yin S, Luo H, Ding SX (2014) Real-time implementation of fault-tolerant control systems with
performance optimization. IEEE Trans Ind Electron 61(5):2402–2411
33. Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving
a class of nonlinear zero-sum differential games. Automatica 47(1):207–214
34. Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a
class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans
Syst Man Cybern Part B Cybern 38(4):937–942
Index
A B
Action network, 10, 11, 59, 102, 161 Backpropagation, 12, 22, 49, 100, 273, 329,
Action-dependent, 13, 259 503, 530, 540
Action-dependent heuristic dynamic pro- Backward-in-time approach, 12, 13
gramming, 483, 484, 489 Battery ℵ, 516, 526
Action-value function, or state-action value Battery model, 487, 496, 514
function, 3, 6 Battery storage system, 484, 487
Activation function, 50, 269, 275, 287, 294, Bellman equation, or Bellman optimality
302, 316, 324, 329, 334, 340, 351, equation, 4, 9, 15, 39, 52, 92, 152,
396 178, 225, 247, 248, 497, 515, 577
Actor–critic architecture, 316 Bellman residual error, 317, 318
Adaptive critic designs, 2, 9, 10, 22 Bellman’s principle of optimality, 9, 18, 52,
Adaptive dynamic programming, 2, 7, 9, 10, 92, 152, 178, 225, 247, 515, 550, 577
25, 26, 37 Brain-like intelligence, 22
Admissibility property, 180, 203
Admissible control, 39, 80, 92, 134, 142,
152–154, 156, 157, 179, 183, 189, C
192, 225, 232, 247, 251, 252, 350, Cauchy–Schwarz inequality, 313, 542
366, 390, 395, 402, 459, 460, 470, Coal combustion reaction, 538
523, 524 Coal gasification process, 538, 539
Affine nonlinear systems, 39, 267, 286, 302, Coal quality function, 538, 543, 544
431 Constrained inputs, 56, 291, 298, 310
Algebraic Riccati equation, 163, 193 Convergence criterion, 133, 211
AlphaGo, 1 Cost function, 8, 15, 496, 497, 514
Approximate dynamic programming, 9, 10, Cost-to-go, 8, 489
23 Critic architecture, 294
Approximate optimal control, 267 Critic network, 10, 11, 57, 58, 101, 161, 333
Approximate optimistic policy iteration, Curse of dimensionality, 9, 18
237, 239, 243
Approximate policy iteration, 231, 235, 259
Approximate value iteration, 226, 228 D
Approximation error, 91, 94, 96, 224 Decentralized control, 388
Asymptotic dynamic programming, 10 Decentralized stabilization, 389
Asymptotically stable, 47, 48, 79 Deep learning, 1, 2
Lyapunov function, 47, 48, 80, 81, 110, 154, Persistence of excitation, 277, 279, 283, 288,
204, 270, 274, 282, 293, 298, 313, 297, 302, 318, 336, 343
319, 330, 337, 353, 359, 366, 373, Policy evaluation, 5, 153, 180, 191, 214, 232
393, 398–400, 424, 445, 467, 470, Policy improvement, 5, 41, 154, 180, 190,
542, 545, 556 191, 194, 214, 227, 232
Lyapunov’s extension theorem, 48, 273, 286, Policy iteration, 5, 37, 151, 153, 154, 160,
301, 323, 332, 340, 354, 378 162, 173, 223, 231, 249
Policy update, 41, 154, 236, 240, 255
M
Mean value theorem, 548 Q
Model network, 10, 11, 57 Q(λ), 7
Monte Carlo tree search, 1 Q-function, 247, 248, 497, 498, 512, 530
Moore-Penrose pseudoinverse, 330, 331 Q-learning, 6
Multi-battery coordination control, 513
Multi-player nonzero-sum games, 459
Multi-player zero-sum games, 431 R
Reaction temperature, 539, 557, 571, 572,
585
Reference control, 546, 547, 574
N
Reinforcement learning, 2, 3, 10, 23–25
Near optimal control, 51
Relaxed dynamic programming, 10
Neural dynamic programming, 10
Residential energy system, 484, 489, 491,
Neural network implementation, 48, 100,
496, 513
241, 257, 350, 396, 425, 444, 464,
Residual error, 275, 276
503, 554, 582
Robust control, 346
Neural network observer, 328
Robust guaranteed cost control, 366, 367,
Neuro-dynamic programming, 10, 22
372
Nonaffine nonlinear systems, 52, 223, 291,
309, 310, 327
Numerical control, 107, 111
S
Numerical iterative θ-adaptive dynamic pro-
Sarsa, 6
gramming, 107, 120
Sarsa(λ), 7
Numerical iterative θ-ADP algorithm, 107,
Single critic approach, 490, 491
111, 120
Snyder International, 21
State-action value function, 248
State-value function, 3–5
O Steepest descent algorithm, 397
Observer-critic architecture, 309 Suboptimal control, 19
Off-policy, 6
On-policy, 6
One-step policy evaluation, 41 T
Online optimal control, 275, 294 TD(λ), 6
Optimal adaptive control, 25 TD-Gammon, 2
Optimal battery charge/discharge strategy, Temporal difference, 5
484 θ-adaptive dynamic programming, 67, 80,
Optimal tracking control, 52, 223 92, 107
Optimistic policy iteration, 237 Time-based Q-learning algorithm, 507, 530
Torsional pendulum system, 166
Two-player zero-sum games, 418
P
Partial differential equation, 267, 274, 293
Particle swarm optimization, 507, 530 U
Performance index, 8, 419, 489 Uncertain nonlinear systems, 346, 360
594 Index
Uniformly ultimately bounded, 268, 282, Value function update, 5, 41, 53, 188, 227
286, 292, 298, 323 Value iteration, 5, 37, 40, 223, 227, 248, 310
Utility function, 17, 18, 39, 40, 52, 56, 57,
68, 550
V
Value function, 3, 15, 268, 274, 292, 327, W
400, 419, 459 Water gas shift reaction, 572