0% found this document useful (0 votes)
87 views

Adaptive Dynamic Programming With Applications in Optimal Control (PDFDrive)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views

Adaptive Dynamic Programming With Applications in Optimal Control (PDFDrive)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 609

Advances in Industrial Control

Derong Liu
Qinglai Wei
Ding Wang
Xiong Yang
Hongliang Li

Adaptive Dynamic
Programming with
Applications in
Optimal Control
Advances in Industrial Control

Series editors
Michael J. Grimble, Glasgow, UK
Michael A. Johnson, Kidlington, UK
More information about this series at http://www.springer.com/series/1412
Derong Liu Qinglai Wei Ding Wang
• •

Xiong Yang Hongliang Li


Adaptive Dynamic
Programming
with Applications
in Optimal Control

123
Derong Liu Xiong Yang
Institute of Automation Tianjin University
Chinese Academy of Sciences Tianjin
Beijing China
China
Hongliang Li
Qinglai Wei Tencent Inc.
Institute of Automation Shenzhen
Chinese Academy of Sciences China
Beijing
China

Ding Wang
Institute of Automation
Chinese Academy of Sciences
Beijing
China

ISSN 1430-9491 ISSN 2193-1577 (electronic)


Advances in Industrial Control
ISBN 978-3-319-50813-9 ISBN 978-3-319-50815-3 (eBook)
DOI 10.1007/978-3-319-50815-3
Library of Congress Control Number: 2016959539

© Springer International Publishing AG 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword

Nowadays, nonlinearity is involved in all walks of life. It is a challenge for


engineers to design controllers for all kinds of nonlinear systems. To handle this
issue, various nonlinear control theories have been developed, such as theories of
adaptive control, optimal control, and robust control. Among these theories, the
theory of optimal control has drawn considerable attention over the past several
decades. This is mainly because optimal control provides an effective way to design
controllers with guaranteed robustness properties as well as capabilities of opti-
mization and resource conservation that are important in manufacturing, vehicle
emission control, aerospace systems, power systems, chemical engineering pro-
cesses, and many other applications.
The core challenge in deriving the solutions of nonlinear optimal control
problems is that it often boils down to solving certain Hamilton–Jacobi–Bellman
(HJB) equations. The HJB equations are nonlinear and difficult to solve for general
nonlinear dynamical systems. Indeed, no closed-form solution to such equations
exists, except for very special problems. Therefore, numerical solutions to HJB
equations have been developed by engineers. To obtain such numerical solutions, a
highly effective method known as adaptive/approximate dynamic programming
(ADP) can be used. A distinct advantage of ADP is that it can avoid the well-known
“curse of dimensionality” of dynamic programming while adaptively solving the
HJB equations. Due to this characteristic, many elegant ADP approaches and their
applications have been developed in the literature during the past several decades.
It is also notable that ADP techniques also provide a link with cognitive
decision-making methods that are observed in the human brain, and thus, ADP has
become a main channel to achieve truly brain-like intelligence in human-engineered
automatic control systems.
Unlike most ADP books, the present book “Adaptive Dynamic Programming
with Applications in Optimal Control” focuses on the principles of emerging
optimal control techniques for nonlinear systems in both discrete-time and
continuous-time domains, and on creating applications of these optimal control
techniques. This book contains three themes:

v
vi Foreword

1. Optimal control for discrete-time nonlinear dynamical systems, covering various


novel techniques used to derive optimal control in the discrete-time domain,
such as general value iteration, θ-ADP, finite approximation error-based value
iteration, policy iteration, generalized policy iteration, and error bounds analysis
of ADP.
2. Optimal control for continuous-time nonlinear systems, discussing the optimal
control for input-affine/input-nonaffine nonlinear systems, robust and optimal
guaranteed cost control for input-affine nonlinear systems, decentralized control
for interconnected nonlinear systems, and optimal control for differential games.
3. Applications, providing typical applications of optimal control approaches in the
areas of energy management in smart homes, coal gasification, and water gas
shift reaction.
This book provides timely and informative coverage about ADP, including both
rigorous derivations and insightful developments. It will help both specialists and
nonspecialists understand the new developments in the field of nonlinear optimal
control using online/offline learning techniques. Meanwhile, it will be beneficial for
engineers to apply the developed ADP methods to their own problems in practice.
I am sure you will enjoy reading this book.

Arlington, TX, USA Frank L. Lewis


September 2016
Series Editors’ Foreword

The series Advances in Industrial Control aims to report and encourage technology
transfer in control engineering. The rapid development of control technology has an
impact on all areas of the control discipline: new theory, new controllers, actuators,
sensors, new industrial processes, computer methods, new applications, new design
philosophies, and new challenges. Much of this development work resides in
industrial reports, feasibility study papers, and the reports of advanced collaborative
projects. The series offers an opportunity for researchers to present an extended
exposition of such new work in all aspects of industrial control for wider and rapid
dissemination.
The method of dynamic programming has a long history in the field of optimal
control. It dates back to those days when the subject of control was emerging in a
modern form in the 1950s and 1960s. It was devised by Richard Bellman who gave
it a modern revision in a publication of 1954 [1]. The name of Bellman became
linked to an optimality equation, key to the method, and like the name of Kalman
became uniquely associated with the early development of optimal control. One
notable extension to the method was that of differential dynamic programming due
to David Q. Mayne in 1966 and developed at length in the book by Jacobson and
Mayne [2]. Their new technique used locally quadratic models for the system
dynamics and cost functions and improved the convergence of the dynamic pro-
gramming method for optimal trajectory control problems.
Since those early days, the subject of control has taken many different directions,
but dynamic programming has always retained a place in the theory of optimal
control fundamentals. It is therefore instructive for the Advances in Industrial
Control monograph series to have a contribution that presents new ways of solving
dynamic programming and demonstrating these methods with some up-to-date
industrial problems. This monograph, Adaptive Dynamic Programming with
Applications in Optimal Control, by Derong Liu, Qinglai Wei, Ding Wang, Xiong
Yang and Hongliang Li, has precisely that objective.
The authors open the monograph with a very interesting and relevant discussion
of another computationally difficult problem, namely devising a computer program
to defeat human master players at the Chinese game of Go. Inspiration from the

vii
viii Series Editors’ Foreword

better programming techniques used in the Go-master problem was used by the
authors to defeat the “curse of dimensionality” that arises in dynamic programming
methods.
More formally, the objective of the techniques reported in the monograph is to
control in an optimal fashion an unknown or uncertain nonlinear multivariable
system using recorded and instantaneous output signals. The algorithms’ technical
framework is then constructed through different categories of the usual state-space
nonlinear ordinary differential system model. The system model can be continuous
or discrete, have affine or nonaffine control inputs, be subject to no constraints, or
have constraints present. A set of 11 chapters contains the theory for various
formulations of the system features.
Since standard dynamic programming schemes suffer from various implemen-
tation obstacles, adaptive dynamic programming procedures have been developed
to find computable practical suboptimal control solutions. A key technique used by
the authors is that of neural networks which are trained using recorded data and
updated, or “adapted,” to accommodate uncertain system knowledge. The theory
chapters are arranged in two parts: Part 1 Discrete-Time Systems—five chapters;
and Part 2 Continuous-Time Systems—five chapters.
An important feature of the monographs of the Advances in Industrial Control
series is a demonstration of potential or actual application to industrial problems.
After a comprehensive presentation of the theory of adaptive dynamic program-
ming, the authors devote Part 3 of their monograph to three chapter-length appli-
cation studies. Chapter 12 examines the scheduling of energy supplies in a smart
home environment, a topic and problem of considerable contemporary interest.
Chapter 13 uses a coal gasification process that is suitably challenging to demon-
strate the authors’ techniques. And finally, Chapter 14 concerns the control of the
water gas shift reaction. In this example, the data used was taken from a real-world
operational system.
This monograph is very comprehensive in its presentation of the adaptive
dynamic programming theory and has demonstrations with three challenging pro-
cesses. It should find a wide readership in both the industrial control engineering
and the academic control theory communities. Readers in other fields such as
computer science and chemical engineering may also find the monograph of con-
siderable interest.
Michael J. Grimble
Michael A. Johnson
Industrial Control Centre
University of Strathclyde
Glasgow, Scotland, UK
Series Editors’ Foreword ix

References

1. Bellman R (1954) The theory of dynamic programming. Bulletin of the American Mathematical
Society 60(6):503–515
2. Jacobson DH, Mayne DQ (1970) Differential dynamic programming, American Elsevier Pub.
Co. New York
Preface

With the rapid development in information science and technology, many busi-
nesses and industries have undergone great changes, such as chemical industry,
electric power engineering, electronics industry, mechanical engineering, trans-
portation, and logistics business. While the scale of industrial enterprises is
increasing, production equipment and industrial processes are becoming more and
more complex. For these complex systems, decision and control are necessary to
ensure that they perform properly and meet prescribed performance objectives.
Under this circumstance, how to design safe, reliable, and efficient control for
complex systems is essential for our society. As modern systems become more
complex and performance requirements become more stringent, advanced control
methods are greatly needed to achieve guaranteed performance and satisfactory
goals.
In general, optimal control deals with the problem of finding a control law for a
given system such that a certain optimality criterion is achieved. The main differ-
ence between optimal control of linear and nonlinear systems lies in that the latter
often requires solving the nonlinear Bellman equation instead of the Riccati
equation. Although dynamic programming is a conventional method in solving
optimization and optimal control problems, it often suffers from the “curse of
dimensionality.” To overcome this difficulty, based on function approximators such
as neural networks, adaptive/approximate dynamic programming (ADP) was pro-
posed by Werbos as a method for solving optimal control problems
forward-in-time.
This book presents the recent results of ADP with applications in optimal
control. It is composed of 14 chapters which cover most of the hot research areas of
ADP and are divided into three parts. Part I concerns discrete-time systems,
including five chapters from Chaps. 2 to 6. Part II concerns continuous-time sys-
tems, including five chapters from Chaps. 7 to 11. Part III concerns applications,
including three chapters from Chaps. 12 to 14.
In Chap. 1, an introduction to the history of ADP is provided, including the basic
and iterative forms of ADP. The review begins with the origin of ADP and

xi
xii Preface

describes the basic structures and the algorithm development in detail. Connections
between ADP and reinforcement learning are also discussed.
Part I: Discrete-Time Systems (Chaps. 2–6)
In Chap. 2, optimal control problems of discrete-time nonlinear dynamical systems,
including optimal regulation, optimal tracking control, and constrained optimal
control, are studied using a series of value iteration ADP approaches. First, an ADP
scheme based on general value iteration is developed to obtain near-optimal control
for discrete-time affine nonlinear systems with continuous state and control spaces.
The present scheme is also employed to solve infinite-horizon optimal tracking
control problems for a class of discrete-time nonlinear systems. In particular, using
the globalized dual heuristic programming technique, a value iteration-based
optimal control strategy of unknown discrete-time nonlinear dynamical systems
with input constraints is established as a case study. Second, an iterative θ-ADP
algorithm is given to solve the optimal control problem of infinite-horizon
discrete-time nonlinear systems, which shows that each of the iterative controls can
stabilize the nonlinear dynamical systems and the condition of initial admissible
control is avoided effectively.
In Chap. 3, a series of iterative ADP algorithms are developed to solve the
infinite-horizon optimal control problems for discrete-time nonlinear dynamical
systems with finite approximation errors. Iterative control laws are obtained by
using the present algorithms such that the iterative value functions reach the opti-
mum. Then, the numerical optimal control problems are solved by a novel
numerical adaptive learning control scheme based on ADP algorithm. Moreover, a
general value iteration algorithm with finite approximate errors is developed to
guarantee the iterative value function to converge to the solution of the Bellman
equation. The general value iteration algorithm permits an arbitrary positive
semidefinite function to initialize itself, which overcomes the disadvantage of tra-
ditional value iteration algorithms.
In Chap. 4, a discrete-time policy iteration ADP method is developed to solve
the infinite-horizon optimal control problems for nonlinear dynamical systems. The
idea is to use an iterative ADP technique to obtain iterative control laws that
optimize the iterative value functions. The convergence, stability, and optimality
properties are analyzed for policy iteration method for discrete-time nonlinear
dynamical systems, and it is shown that the iterative value functions are nonin-
creasingly convergent to the optimal solution of the Bellman equation. It is also
proven that any of the iterative control laws obtained from the present policy
iteration algorithm can stabilize the nonlinear dynamical systems.
In Chap. 5, a generalized policy iteration algorithm is developed to solve the
optimal control problems for infinite-horizon discrete-time nonlinear systems.
Generalized policy iteration algorithm uses the idea of interacting the policy iter-
ation algorithm and the value iteration algorithm of ADP. It permits an arbitrary
positive semidefinite function to initialize the algorithm, where two iteration indices
are used for policy evaluation and policy improvement, respectively. The
Preface xiii

monotonicity, convergence, admissibility, and optimality properties of the gener-


alized policy iteration algorithm are analyzed.
In Chap. 6, error bounds of ADP algorithms are established for solving undis-
counted infinite-horizon optimal control problems of discrete-time deterministic
nonlinear systems. The error bounds for approximate value iteration based on a
novel error condition are developed. The error bounds for approximate policy
iteration and approximate optimistic policy iteration algorithms are also provided. It
is shown that the iterative approximate value function can converge to a finite
neighborhood of the optimal value function under some conditions. In addition,
error bounds are also established for Q-function of approximate policy iteration for
optimal control of unknown discounted discrete-time nonlinear systems. Neural
networks are used to approximate the Q-function and the control policy.
Part II: Continuous-Time Systems (Chaps. 7–11)
In Chap. 7, optimal control problems of continuous-time affine nonlinear dynamical
systems are studied using ADP approaches. First, an identifier–critic architecture
based on ADP methods is presented to derive the approximate optimal control for
uncertain continuous-time nonlinear dynamical systems. The identifier neural net-
work and the critic neural network are tuned simultaneously, while the restrictive
persistence of excitation condition is relaxed. Second, an ADP-based algorithm is
developed to solve the optimal control problems for continuous-time nonlinear
dynamical systems with control constraints. Only a single critic neural network is
utilized to derive the optimal control, and there is no special requirement on the
initial control.
In Chap. 8, the optimal control problems are considered for continuous-time
nonaffine nonlinear dynamical systems with completely unknown dynamics via
ADP methods. First, an ADP-based novel identifier–actor–critic architecture is
developed to provide approximate optimal control solutions for continuous-time
unknown nonaffine nonlinear dynamical systems, where the identifier is constructed
by a dynamic neural network to transform nonaffine nonlinear systems into a class
of affine nonlinear systems. Second, an ADP-based observer–critic architecture is
presented to obtain the approximate optimal control for nonaffine nonlinear
dynamical systems in the presence of unknown dynamics, where the observer is
composed of a three-layer feedforward neural network aiming to get the knowledge
of system states.
In Chap. 9, robust control and optimal guaranteed cost control of
continuous-time uncertain nonlinear systems are studied using the idea of ADP.
First, a novel strategy is established to design the robust controller for a class of
nonlinear systems with uncertainties based on an online policy iteration algorithm.
By properly choosing a cost function that reflects the uncertainties, regulation, and
control, the robust control problem is transformed into an optimal control problem,
which can be solved effectively under the framework of ADP. Then, the
infinite-horizon optimal guaranteed cost control problem of uncertain nonlinear
systems is investigated by employing the formulation of ADP-based online optimal
xiv Preface

control design, which extends the application scope of ADP methods to nonlinear
and uncertain environment.
In Chap. 10, by using neural network-based online learning optimal control
approach, a decentralized control strategy is developed to stabilize a class of
continuous-time large-scale interconnected nonlinear systems. The decentralized
control strategy of the overall system can be established by adding appropriate
feedback gains to the optimal control laws of isolated subsystems. Then, an online
policy iteration algorithm is presented to solve the Hamilton–Jacobi–Bellman
equations related to the optimal control problems. Furthermore, as a generalization,
a neural network-based decentralized control law is developed to stabilize the
large-scale interconnected nonlinear systems with unknown dynamics by using an
online model-free integral policy iteration algorithm.
In Chap. 11, differential game problems of continuous-time systems, including
two-player zero-sum games, multiplayer zero-sum games, and multiplayer
nonzero-sum games, are studied via a series of ADP approaches. First, an integral
policy iteration algorithm is developed to learn online the Nash equilibrium solution
of two-player zero-sum differential games with completely unknown
continuous-time linear dynamics. Second, multiplayer zero-sum differential games
for a class of continuous-time uncertain nonlinear systems are solved by using an
iterative ADP algorithm. Finally, an online synchronous approximate optimal
learning algorithm based on policy iteration is developed to solve multiplayer
nonzero-sum games of continuous-time nonlinear systems without requiring exact
knowledge of system dynamics.
Part III: Applications (Chaps. 12–14)
In Chap. 12, intelligent optimization methods based on ADP are applied to the
challenges of intelligent price-responsive management of residential energy, with
an emphasis on home battery use connected to the power grid. First, an
action-dependent heuristic dynamic programming is developed to obtain the opti-
mal control law for residential energy management. Second, a dual iterative
Q-learning algorithm is developed to solve the optimal battery management and
control problem in smart residential environments where two iterations are intro-
duced, which are respectively internal and external iterations. Based on the dual
iterative Q-learning algorithm, the convergence property of iterative Q-learning
method for the optimal battery management and control problem is proven. Finally,
a distributed iterative ADP method is developed to solve the multibattery optimal
coordination control problem for home energy management systems.
In Chap. 13, a coal gasification optimal tracking control problem is solved
through a data-based iterative optimal learning control scheme by using iterative
ADP approach. According to system data, neural networks are used to construct the
dynamics of coal gasification process, coal quality, and reference control, respec-
tively. Via system transformation, the optimal tracking control problem with
approximation errors and disturbances is effectively transformed into a two-person
zero-sum optimal control problem. An iterative ADP algorithm is developed to
obtain the optimal control laws for the transformed system.
Preface xv

In Chap. 14, a data-driven stable iterative ADP algorithm is developed to solve


the optimal temperature control problem of water gas shift reaction system.
According to the system data, neural networks are used to construct the dynamics of
water gas shift reaction system and solve the reference control. Considering the
reconstruction errors of neural networks and the disturbances of the system and
control input, a stable iterative ADP algorithm is developed to obtain the optimal
control law. Convergence property is developed to guarantee that the iterative value
function converges to a finite neighborhood of the optimal cost function. Stability
property is developed so that each of the iterative control laws can guarantee the
tracking error to be uniformly ultimately bounded.

Beijing, China Derong Liu


Chicago, USA Qinglai Wei
September 2016 Ding Wang
Xiong Yang
Hongliang Li
Acknowledgements

The authors would like to acknowledge the help and encouragement they have
received from colleagues in Beijing and Chicago during the course of writing this
book. Some materials presented in this book are based on the research conducted
with several Ph.D. students, including Yuzhu Huang, Dehua Zhang, Pengfei Yan,
Yancai Xu, Hongwen Ma, Chao Li, and Guang Shi. The authors also wish to thank
Oliver Jackson, Editor (Engineering) from Springer for his patience and
encouragements.
The authors are very grateful to the National Natural Science Foundation of
China (NSFC) for providing necessary financial support to our research in the past
five years. The present book is the result of NSFC Grants 61034002, 61233001,
61273140, 61304086, and 61374105.

xvii
Contents

1 Overview of Adaptive Dynamic Programming . . . . . . . . . . . . . . . . . 1


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Adaptive Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Basic Forms of Adaptive Dynamic Programming . . . . . 10
1.3.2 Iterative Adaptive Dynamic Programming . . . . . . . . . . . 15
1.3.3 ADP for Continuous-Time Systems . . . . . . . . . . . . . . . . 18
1.3.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Related Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Part I Discrete-Time Systems


2 Value Iteration ADP for Discrete-Time Nonlinear Systems . . . .... 37
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 37
2.2 Optimal Control of Nonlinear Systems
Using General Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.2 Neural Network Implementation . . . . . . . . . . . . . . . . . . 48
2.2.3 Generalization to Optimal Tracking Control . . . . . . . . . 52
2.2.4 Optimal Control of Systems
with Constrained Inputs . . . . . . . . . . . . . . . . . . . . . .... 56
2.2.5 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . .... 59
2.3 Iterative θ-Adaptive Dynamic Programming Algorithm
for Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.3.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.3.2 Optimality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.3.3 Summary of Iterative θ-ADP Algorithm . . . . . . . . . . . . 80
2.3.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

xix
xx Contents

2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3 Finite Approximation Error-Based Value Iteration ADP . . . . . .... 91
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 91
3.2 Iterative θ-ADP Algorithm with Finite
Approximation Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 92
3.2.1 Properties of the Iterative ADP Algorithm
with Finite Approximation Errors . . . . . . . . . . . . . . . . . 93
3.2.2 Neural Network Implementation . . . . . . . . . . . . . . . . . . 100
3.2.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.3 Numerical Iterative θ-Adaptive Dynamic Programming . . . . . . . 107
3.3.1 Derivation of the Numerical Iterative θ-ADP
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 107
3.3.2 Properties of the Numerical Iterative θ-ADP
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 111
3.3.3 Summary of the Numerical Iterative θ-ADP
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 120
3.3.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . .... 121
3.4 General Value Iteration ADP Algorithm with Finite
Approximation Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 125
3.4.1 Derivation and Properties of the GVI Algorithm
with Finite Approximation Errors . . . . . . . . . . . . . .... 125
3.4.2 Designs of Convergence Criteria with Finite
Approximation Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.4.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4 Policy Iteration for Optimal Control of Discrete-Time Nonlinear
Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.2 Policy Iteration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.2.1 Derivation of Policy Iteration Algorithm . . . . . . . . . . . . 153
4.2.2 Properties of Policy Iteration Algorithm . . . . . . . . . . . . 154
4.2.3 Initial Admissible Control Law . . . . . . . . . . . . . . . . . . . 160
4.2.4 Summary of Policy Iteration ADP Algorithm . . . . . . . . 162
4.3 Numerical Simulation and Analysis . . . . . . . . . . . . . . . . . . . . . . 162
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Contents xxi

5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear


Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.2 Generalized Policy Iteration-Based Adaptive Dynamic
Programming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.2.1 Derivation and Properties of the GPI Algorithm . . . . . . 179
5.2.2 GPI Algorithm and Relaxation of Initial Conditions . . . 188
5.2.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.3 Discrete-Time GPI with General Initial Value Functions . . . . . . 199
5.3.1 Derivation and Properties of the GPI Algorithm . . . . . . 199
5.3.2 Relaxations of the Convergence Criterion
and Summary of the GPI Algorithm . . . . . . . . . . . . . . . 211
5.3.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6 Error Bounds of Adaptive Dynamic Programming Algorithms . . . . 223
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal
Control Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.2.2 Approximate Value Iteration . . . . . . . . . . . . . . . . . . . . . 226
6.2.3 Approximate Policy Iteration . . . . . . . . . . . . . . . . . . . . . 231
6.2.4 Approximate Optimistic Policy Iteration . . . . . . . . . . . . 237
6.2.5 Neural Network Implementation . . . . . . . . . . . . . . . . . . 241
6.2.6 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.3 Error Bounds of Q-Function for Discounted Optimal Control
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.3.2 Policy Iteration Under Ideal Conditions . . . . . . . . . . . . . 249
6.3.3 Error Bound for Approximate Policy Iteration . . . . . . . . 254
6.3.4 Neural Network Implementation . . . . . . . . . . . . . . . . . . 257
6.3.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

Part II Continuous-Time Systems


7 Online Optimal Control of Continuous-Time Affine Nonlinear
Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 267
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 267
7.2 Online Optimal Control of Partially Unknown Affine
Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 267
7.2.1 Identifier–Critic Architecture for Solving HJB
Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 269
xxii Contents

7.2.2 Stability Analysis of Closed-Loop System . . . . . . . .... 281


7.2.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . .... 286
7.3 Online Optimal Control of Affine Nonlinear Systems
with Constrained Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 291
7.3.1 Solving HJB Equation via Critic Architecture . . . . .... 294
7.3.2 Stability Analysis of Closed-Loop System
with Constrained Inputs . . . . . . . . . . . . . . . . . . . . . . . . . 298
7.3.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
8 Optimal Control of Unknown Continuous-Time Nonaffine
Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems
with Constrained Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
8.2.1 Identifier Design via Dynamic Neural Networks . . . . . . 311
8.2.2 Actor–Critic Architecture
for Solving HJB Equation . . . . . . . . . . . . . . . . . . . . . . . 316
8.2.3 Stability Analysis of Closed-Loop System . . . . . . . . . . . 318
8.2.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
8.3.1 Neural Network Observer . . . . . . . . . . . . . . . . . . . . . . . 328
8.3.2 Observer-Based Optimal Control Scheme
Using Critic Network . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8.3.3 Stability Analysis of Closed-Loop System . . . . . . . . . . . 337
8.3.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
9 Robust and Optimal Guaranteed Cost Control
of Continuous-Time Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . 345
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
9.2 Robust Control of Uncertain Nonlinear Systems. . . . . . . . . . . . . 346
9.2.1 Equivalence Analysis and Problem Transformation . . . . 348
9.2.2 Online Algorithm and Neural Network
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
9.2.3 Stability Analysis of Closed-Loop System . . . . . . . . . . . 353
9.2.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
9.3.1 Optimal Guaranteed Cost Controller Design . . . . . . . . . 362
9.3.2 Online Solution of Transformed Optimal Control
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
Contents xxiii

9.3.3 Stability Analysis of Closed-Loop System . . . . . . . . . . . 373


9.3.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
9.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
10 Decentralized Control of Continuous-Time Interconnected
Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
10.2 Decentralized Control of Interconnected Nonlinear Systems . . . . 388
10.2.1 Decentralized Stabilization via Optimal Control
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
10.2.2 Optimal Controller Design of Isolated Subsystems . . . . 394
10.2.3 Generalization to Model-Free
Decentralized Control . . . . . . . . . . . . . . . . . . . . . . . . . . 400
10.2.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
10.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
11 Learning Algorithms for Differential Games
of Continuous-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
11.2 Integral Policy Iteration for Two-Player Zero-Sum Games . . . . . 418
11.2.1 Derivation of Integral Policy Iteration . . . . . . . . . . . . . . 420
11.2.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 423
11.2.3 Neural Network Implementation . . . . . . . . . . . . . . . . . . 425
11.2.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
11.3 Iterative Adaptive Dynamic Programming for Multi-player
Zero-Sum Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
11.3.1 Derivation of the Iterative ADP Algorithm . . . . . . . . . . 433
11.3.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
11.3.3 Neural Network Implementation . . . . . . . . . . . . . . . . . . 444
11.3.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
11.4 Synchronous Approximate Optimal Learning for Multi-player
Nonzero-Sum Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
11.4.1 Derivation and Convergence Analysis . . . . . . . . . . . . . . 460
11.4.2 Neural Network Implementation . . . . . . . . . . . . . . . . . . 464
11.4.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
11.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
xxiv Contents

Part III Applications


12 Adaptive Dynamic Programming for Optimal Residential Energy
Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
12.2 A Self-learning Scheme for Residential Energy System
Control and Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
12.2.1 The ADHDP Method . . . . . . . . . . . . . . . . . . . . . . . . . . 488
12.2.2 A Self-learning Scheme for Residential Energy
System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
12.2.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
12.3 A Novel Dual Iterative Q-Learning Method for Optimal
Battery Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
12.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
12.3.2 Dual Iterative Q-Learning Algorithm . . . . . . . . . . . . . . . 497
12.3.3 Neural Network Implementation . . . . . . . . . . . . . . . . . . 503
12.3.4 Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
12.4 Multi-battery Optimal Coordination Control for Residential
Energy Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
12.4.1 Distributed Iterative ADP Algorithm . . . . . . . . . . . . . . . 515
12.4.2 Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
12.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
13 Adaptive Dynamic Programming for Optimal Control of Coal
Gasification Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
13.2 Data-Based Modeling and Properties . . . . . . . . . . . . . . . . . . . . . 538
13.2.1 Description of Coal Gasification Process
and Control Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
13.2.2 Data-Based Process Modeling and Properties . . . . . . . . 540
13.3 Design and Implementation of Optimal Tracking Control. . . . . . 546
13.3.1 Optimal Tracking Controller Design by Iterative ADP
Algorithm Under System and Iteration Errors . . . . . . . . 546
13.3.2 Neural Network Implementation . . . . . . . . . . . . . . . . . . 554
13.4 Numerical Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
13.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
14 Data-Based Neuro-Optimal Temperature Control
of Water Gas Shift Reaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
14.2 System Description and Data-Based Modeling . . . . . . . . . . . . . . 572
14.2.1 Water Gas Shift Reaction . . . . . . . . . . . . . . . . . . . . . . . 572
14.2.2 Data-Based Modeling and Properties . . . . . . . . . . . . . . . 573
Contents xxv

14.3 Design of Neuro-Optimal Temperature Controller . . . . . . . .... 575


14.3.1 System Transformation . . . . . . . . . . . . . . . . . . . . . .... 575
14.3.2 Derivation of Stable Iterative ADP Algorithm . . . . .... 576
14.3.3 Properties of Stable Iterative ADP Algorithm
with Approximation Errors and Disturbances . . . . .... 578
14.4 Neural Network Implementation for the Optimal Tracking
Control Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
14.5 Numerical Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
14.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
Abbreviations

ACD Adaptive critic designs


AD Action-dependent, e.g., ADHDP and ADDHP
ADP Adaptive dynamic programming, or approximate dynamic programming
ADPRL Adaptive dynamic programming and reinforcement learning
BP Backpropagation
DHP Dual heuristic programming
DP Dynamic programming
GDHP Globalized dual heuristic programming
GPI Generalized policy iteration
GVI General value iteration
HDP Heuristic dynamic programming
HJB Hamilton–Jacobi–Bellman, e.g., HJB equation
HJI Hamilton–Jacobi–Isaacs, e.g., HJI equation
NN Neural network
PE Persistence of excitation
PI Policy iteration
UUB Uniformly ultimately bounded
VI Value iteration
RL Reinforcement learning

xxvii
Symbols

T The transposition symbol, e.g., AT is the transposition of matrix A


N The set of all natural numbers
Zþ The set of all positive integers, i.e., N ¼ f0; Z þ g
R The set of all real numbers
Rn The Euclidean space of all real n-vectors, e.g., a vector x 2 Rn is
written as x ¼ ðx1 ; x2 ; . . .; xn ÞT
Rmn The space of all m by n real matrices, e.g., a matrix A 2 Rmn is
written as A ¼ ðaij Þ 2 Rmn
kk The vector norm or matrix norm in Rn or Rnm
k  kF The Frobenius matrix norm, which is the Euclidean norm of a matrix,
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pn P m
is defined as kAkF ¼ a2ij for A ¼ ðaij Þ 2 Rnm
i¼1 j¼1

2 Belong to
8 For all
) Implies
, Equivalent, or if and only if
 Kronecker product
; The empty set
, Equal to by definition
Cn ðΩÞ The class of functions having continuous nth derivative on Ω
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
R ffi
2
ℒ2(Ω) The ℒ2 space defined on Ω, i.e., Ω kf ðxÞk dx\1 for f 2 ℒ2(Ω)
ℒ∞(Ω) The ℒ∞ space defined on Ω, i.e., supx2Ω kf ðxÞk\1 for f 2 ℒ∞(Ω)
λmin ðAÞ The minimum eigenvalue of matrix A
λmax ðAÞ The maximum eigenvalue of matrix A
In The n by n identity matrix
A[0 Matrix A is positive definite
det(A) Determinant of matrix A
A1 The inverse of matrix A
tr(A) The trace of matrix A

xxix
xxx Symbols

vec(A) The vectorization mapping from matrix A into an mn-dimensional


column vector for A 2 Rmn
diagfζi g Also written as diagfζ1 ; ζ2 ; . . .; ζn g which is an n  n diagonal matrix
with diagonal elements ζ1 ; ζ2 ; . . .; ζ n
tanhðxÞ The hyperbolic tangent function of x
sgnðxÞ The sign function of x, i.e., sgnðxÞ ¼ 1 for x [ 0, sgnð0Þ ¼ 0, and
sgnðxÞ ¼ 1 for x\0
AðΩÞ The set of admissible controls on Ω
J Performance index, or cost-to-go, or cost function
Jμ Performance index, or cost-to-go, or cost function associated with the
policy μ
J Optimal performance index function or optimal cost function
V Value function, or performance index associated with a specific policy
V Optimal value function
L Lyapunov function
Wf ; Yf NN weights for function approximation
Wm ; Ym Model/identifier NN weights
Wo ; Yo Observer NN weights
Wc ; Yc Critic NN weights
Wa ; Ya Action NN weights
Chapter 1
Overview of Adaptive Dynamic
Programming

1.1 Introduction

Big data, artificial intelligence (AI), and deep learning are the three topics talked about
the most lately in information technology. The recent emergence of deep learning
[10, 17, 38, 68, 88] has pushed neural networks (NNs) to become a hot research
topic again. It has also gained huge success in almost every branch of AI, includ-
ing machine learning, pattern recognition, speech recognition, computer vision, and
natural language processing [17, 25, 26, 35, 74]. On the other hand, the study of
big data often uses AI technologies such as machine learning [80] and deep learning
[17]. One particular subject of study in AI, i.e., the computer game of Go, faced
a great challenge of dealing with vast amounts of data. The ancient Chinese board
game Go has been studied for years with the hope that one day, computer programs
can defeat human professional players. The board of game Go consists of 19 × 19
grid of squares. At the beginning of the game, each of the two players has roughly
360 options for placing each stone. However, the number of potential legal board
positions grows exponentially, and it quickly becomes greater than the total number
of atoms in the whole universe [103]. Such a number leads to so many directions any
given game can move in that makes it impossible for a computer to play by brute
force computation of all possible outcomes.
Previous computer programs focused less on evaluating the state of the board
positions and more on speeding up simulations of how the game might play out.
The Monte Carlo tree search approach was used often in computer game programs,
which samples only some of the possible sequences of plays randomly at each step to
choose between different possible moves instead of trying to calculate every possible
ones. Google DeepMind, an AI company in London acquired by Google in 2014,
developed a program called AlphaGo [92] that has shown performance previously
thought to be impossible for at least a decade. Instead of exploring various sequences
of moves, AlphaGo learns to make a move by evaluating the strength of its position on
the board. Such an evaluation was made possible by NN’s deep learning capabilities.

© Springer International Publishing AG 2017 1


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_1
2 1 Overview of Adaptive Dynamic Programming

Position evaluation (for approximating the optimal cost-to-go function of the


game) is the key to success of AlphaGo. Such ideas have been used previously
by many researchers in computer games, such as backgammon (TD-Gammon) [100,
101], checkers [87], othello [13], and chess [16]. A reinforcement learning technique
called TD(λ) was employed in AlphaGo and TD-Gammon for position evaluation.
With TD-Gammon, the program has learned to play backgammon at a grandmaster
level [100, 101]. On the other hand, AlphaGo has defeated European Go champion
Fan Hui (professional 2 dan) by 5 games to 0 [92] and defeated world Go champion
Lee Sedol (professional 9 dan) by 4 games to 1 [71, 111].
The success of reinforcement learning (RL) technique in this case relied on NN’s
deep learning capabilities [10, 17, 38, 68, 88]. The NNs used in AlphaGo have a
deep structure with 13 layers. Even though there were reports on the use of RL and
related techniques for the computer game of Go [15, 89, 93, 137, 138], it is only
with AlphaGo [92] that value networks were obtained using deep NNs to achieve
high evaluation accuracy. On the other hand, position evaluation [8, 20, 24, 89, 93]
and deep learning [18, 66, 102] have been considered for building programs to play
the game of Go; none of them achieved the level of success by AlphaGo [92]. The
match of AlphaGo versus Lee Sedol in March 2016 is a history-making event and
a milestone in the quest of AI. The defeat over humanity by a machine has also
generated huge public interests in AI technology around the world, especially in
China, Korea, US, and UK [111]. It will have a lasting impact on the research in AI,
deep learning, and RL [142].
RL is a very useful tool in solving optimization problems by employing the princi-
ple of optimality from dynamic programming (DP). In particular, in control systems
community, RL is an important approach to handle optimal control problems for
unknown nonlinear systems. DP provides an essential foundation for understanding
RL. Actually, most of the methods of RL can be viewed as attempts to achieve much
the same effect as DP, with less computation and without assuming a perfect model
of the environment. One class of RL methods is built upon the actor-critic structure,
namely adaptive critic designs, where an actor component applies an action or control
policy to the environment, and a critic component assesses the value of that action
and the state resulting from it. The combination of DP, NN, and actor-critic structure
results in the adaptive dynamic programming (ADP) algorithms.
The present book studies ADP with applications to optimal control. Significant
efforts will be devoted to the building of value functions which indicate how well
the predicted system performance is, which in turn are used for developing the opti-
mal control strategy. Both RL and ADP provide approximate solutions to dynamic
programming, and they are closely related to each other. Therefore, there has been a
trend to consider the two together as ADPRL (ADP and RL). Examples include IEEE
International Symposium on Adaptive Dynamic Programming and Reinforcement
Learning (started in 2007), IEEE CIS Technical Committee on Adaptive Dynamic
Programming and Reinforcement Learning (started in 2008), and a survey article
[42] published in 2009. A brief overview of RL will be given in the next section,
followed by a more detailed overview of ADP. We review both the basic forms of
1.1 Introduction 3

ADP as well as iterative forms. A few related books will be briefly reviewed before
the end of this chapter.

1.2 Reinforcement Learning

The main research results in RL can be found in the book by Sutton and Barto
[98] and the references cited in the book. Even though both RL and the main topic
studied in the present book, i.e., ADP, provide approximate solutions to dynamic
programming, research in these two directions has been somewhat independent [7]
in the past. The most famous algorithms in RL are the temporal difference algorithm
[97] and the Q-learning algorithm [112, 113]. Compared to ADP, the area of RL is
more mature and has a vast amount of literature (cf. [27, 34, 47, 98]).
An RL system typically consists of the following four components: {S, A, R, F},
where S is the set of states, A is the set of actions, R is the set of scaler reinforcement
signals or rewards, and F is the function describing the transition from one state
to the next under a given action, i.e., F : S × A → S. A policy π is defined as a
mapping π : S → A. At any given time t, the system can be in a state st ∈ S, take
an action at ∈ A determined by the policy π , i.e., at = π(st ), transition to the next
state s = st+1 , which is denoted by st+1 = F(st , at ), and at the same time, receive a
reward signal rt+1 = r(st , at , st+1 ) ∈ R. The goal of RL is to determine a policy to
maximize the accumulated reward starting from initial state s0 at t = 0.
An RL task always involves estimating some kind of value functions. A value
function estimates how good it is to be in a given state s and it is defined as,

 ∞

 
V π (s) = γ k rk+1 s0 =s = γ k r(sk , ak , sk+1 )s0 =s,
k=0 k=0

where 0 ≤ γ ≤ 1 is a discount factor and ak = π(sk ) and sk+1 = F(sk , ak ) for k =


0, 1, . . . . The definition of V π (s) can also be considered to start from st , i.e.,

 ∞

 
V π (s) = γ k rt+k+1 st =s = γ k r(st+k , at+k , st+k+1 )st =s, (1.2.1)
k=0 k=0

where at+k = π(st+k ) and st+k+1 = F(st+k , at+k ) for k = 0, 1, . . . . V π (s) is referred
to as the state-value function for policy π . On the other hand, the action-value function
for policy π estimates how good it is to perform a given action a in a given state s
under the policy π and it is defined as,

 ∞

 
Qπ (s, a) = γ k rt+k+1 st =s, at =a = γ k r(st+k , at+k , st+k+1 )st =s, at =a,
k=0 k=0
(1.2.2)
4 1 Overview of Adaptive Dynamic Programming

where at+k = π(st+k ) for k = 1, 2, . . . , and st+k+1 = F(st+k , at+k ) for k = 0, 1, . . . .


The goal of RL is to find a policy that is optimal, i.e., a policy that is better than or
equal to all other policies. The optimal policy may not be unique, and all the optimal
policies are denoted by π ∗ . They have the same optimal state-value function defined
by  
V ∗ (s) = sup V π (s) , (1.2.3)
π

or the same optimal action-value function defined by


 
Q∗ (s, a) = sup Qπ (s, a) . (1.2.4)
π

They satisfy the Bellman optimality equation (or, Bellman equation)


 
V ∗ (s) = max r(s, a, s ) + γ V ∗ (s ) , (1.2.5)
a

or  ∗   
Q∗ (s, a) = r(s, a, s ) + γ max

Q (s , a ) . (1.2.6)
a

The optimal policy is determined as follows


 
π ∗ (s) = arg max r(s, a, s ) + γ V ∗ (s ) , (1.2.7)
a

or
π ∗ (s) = arg max Q∗ (s, a). (1.2.8)
a

If we were to consider a stochastic process instead of a deterministic process, the


transition from state s to the next state s is described by a probability distribution

P(s | s, a) = Pr{st+1 = s | st = s, at = a},

and the expected value of the next reward is given by

r(s, a, s ) = E{rt+1 | st = s, at = a, st+1 = s },

where E(·) is the expected value. In this case, (1.2.1) is rewritten as



∞  
π 
V (s) = Eπ γ rt+k+1  st = s ,
k
(1.2.9)
k=0

where the expected value Eπ {·} is taken under the policy π .


This book will not consider stochastic processes. Therefore, all of our descriptions
are for deterministic systems, including the introduction to RL in this section.
1.2 Reinforcement Learning 5

Next, we consider how to compute the state-value function V π for an arbitrary


policy π . This procedure is called policy evaluation in the literature which is described
by

V π (s) = r(s, a, s ) + γ V π (s ) = r(s, π(s), F(s, π(s))) + γ V π (F(s, π(s))).


(1.2.10)
The value function can also be determined using iterative policy evaluation starting
from an arbitrary initial approximation V0 (·),

Vi+1 (s) = r(s, a, s ) + γ Vi (s ), i = 0, 1, 2, . . . . (1.2.11)

The goal is to obtain the value function V π (s) given policy π .


After V π (s) is obtained or evaluated, we obtain an improved policy by

π  (s) = arg max{r(s, a, s ) + γ V π (s )}. (1.2.12)


a

Policy iteration procedure involves computation cycles between policy evaluation


(1.2.10) or (1.2.11) and policy improvement (1.2.12) in order to find the optimal
policy.
Another way to determine the optimal policy is to use the value iteration procedure.
In this case, for each i, i = 0, 1, . . . , the state-value functions Vi for a policy πi can
also be computed using value iteration described by value function update

Vi+1 (s) = max{r(s, a, s ) + γ Vi (s )}, (1.2.13)


a

and policy improvement

πi+1 (s) = arg max{r(s, a, s ) + γ Vi (s )}. (1.2.14)


a

The computation shall continue until convergence, e.g., when |Vi+1 (s) − Vi (s)| ≤
ε, ∀s, to obtain V ∗ (s) ≈ Vi+1 (s), where ε is a small positive number. Then, the
optimal policy is approximated as π ∗ (s) ≈ πi+1 (s).
The temporal difference (TD) method [97] is developed to estimate the value
function for a given policy. The TD algorithm is described by

V (st ) ← V (st ) + α rt+1 + γ V (st+1 ) − V (st ) , (1.2.15)

or

Vi+1 (st ) = Vi (st ) + α rt+1 + γ Vi (st+1 ) − Vi (st ) , i = 0, 1, 2, . . . , (1.2.16)

where α > 0 is a step size. This algorithm is also called TD(0) (compared to TD(λ)
to be introduced later). The update in (1.2.15) is obtained according to the following
general formula:
6 1 Overview of Adaptive Dynamic Programming

NewValue ← OldValue + Update = OldValue + StepSize × (Target − OldValue),

which indicates a step of move toward the ‘Target.’


Q-learning [112, 113] is an off-policy1 [98] TD algorithm and it is described by

Q(st , at ) ← Q(st , at ) + α rt+1 + γ max{Q(st+1 , a)} − Q(st , at ) , (1.2.17)


a

or
Qi+1 (st , at ) = Qi (st , at ) + α rt+1 + γ max{Qi (st+1 , a)} − Qi (st , at ) , i = 0, 1, 2, . . . .
a
(1.2.18)
Based on the value function obtained above, an improved policy can be determined
as
π  (st ) = arg max Q(st , a), (1.2.19)
a

or
πi+1 (st ) = arg max Qi+1 (st , a). (1.2.20)
a

A modification to the above Q-learning algorithm so that it becomes an on-policy2


[82] is called Sarsa [96] whose name comes from the fact that the algorithm uses
the following five elements of the problem: {st , at , rt+1 , st+1 , at+1 }. It updates the
action-value function as follows

Q(st , at ) ← Q(st , at ) + α[rt+1 + γ Q(st+1 , at+1 ) − Q(st , at )], (1.2.21)

or
Qi+1 (st , at ) = Qi (st , at ) + α[rt+1 + γ Qi (st+1 , at+1 ) − Qi (st , at )], i = 0, 1, 2, . . . . (1.2.22)

A more general TD algorithm called TD(λ) [97] has been quite popular [8, 13, 15,
16, 20, 24, 87, 89, 92, 93, 100, 101, 137, 138]. Using TD(λ), the update equation
(1.2.15) now becomes

Vt+1 (s) = Vt (s) + ΔVt (s)



= Vt (s) + αzt (s) rt+1 + γ Vt (st+1 ) − Vt (st ) . (1.2.23)

In the above, z t (s) is defined as the eligibility trace of state s given by



γ λz t−1 (s) + 1, if s = st ,
zt (s) =
γ λz t−1 (s), if s = st ,

where z0 (s) = 0, ∀s, and 0 ≤ λ ≤ 1 is called the trace-decay parameter.

1 Off-policy learning is defined as evaluating one policy while following another.


2 On-policy learning estimates the value of a policy while using it for control.
1.2 Reinforcement Learning 7

Similar ideas can be applied to Sarsa [82] to obtain the following Sarsa(λ) update
equation

Qt+1 (s, a) = Qt (s, a) + αzt (s, a) rt+1 + γ Qt (st+1 , at+1 ) − Qt (st , at ) , (1.2.24)

where 
γ λzt−1 (s, a) + 1, if s = st and a = at ,
z t (s, a) =
γ λzt−1 (s, a), otherwise,

with z0 (s, a) = 0, ∀s, a. It can also be applied to Q-learning [112, 113] to obtain the
following Q(λ) update equation

Qt+1 (s, a) = Qt (s, a) + αδt zt (s, a), (1.2.25)

where  
δt = rt+1 + γ max

Qt (st+1 , a ) − Qt (st , at ),
a


γ λzt−1 (s, a), if Qt−1 (st , at ) = maxa {Qt−1 (st , a)},
zt (s, a) = Is,st Ia,at +
0, otherwise

1, if x = y,
Ix,y =
0, otherwise,

and z0 (s, a) = 0, ∀s, a.


TD, Q-learning, Sarsa, TD(λ), Sarsa(λ), and Q(λ) are developed to obtain the
value functions V (s) and Q(s, a). The next step after the value function is obtained
is to determine a policy or update an existing policy according to (1.2.12), (1.2.14),
(1.2.19), or (1.2.20) (cf. (1.2.7) and (1.2.8)). The process can be repeated until an
optimal policy is obtained. Such a procedure solves the Bellman equation (1.2.5) or
(1.2.6) iteratively to provide approximate solutions.

1.3 Adaptive Dynamic Programming

There are several schemes of dynamic programming [9, 11, 23, 41]. One can con-
sider discrete-time systems or continuous-time systems, linear systems or nonlin-
ear systems, time-invariant systems or time-varying systems, deterministic systems
or stochastic systems, etc. Discrete-time (deterministic) nonlinear (time-invariant)
dynamical systems will be discussed first. Time-invariant nonlinear systems cover
most of the application areas and discrete-time is the basic consideration for digital
implementation.
8 1 Overview of Adaptive Dynamic Programming

Consider the following discrete-time nonlinear systems described by

xk+1 = F(xk , uk ), k = 0, 1, 2, . . . , (1.3.1)

where xk ∈ Rn is the state vector, uk ∈ Rm is the control vector, and F : Rn × Rm →


Rn is the nonlinear system function. Starting from an initial state x0 ∈ Rn , one needs
to choose a control sequence uk , k = 0, 1, 2, . . . , such that certain objectives can be
achieved when the control sequence is applied to system (1.3.1).
Suppose that one associates with system (1.3.1) the following performance index
(or cost-to-go, or simply, cost)


J(xk , uk ) = γ i−k U(xi , ui ), (1.3.2)
i=k

where uk = (uk , uk+1 , . . . ) denotes the control sequence starting at time k, U(·, ·) ≥ 0
is called the utility function and γ is the discount factor with 0 < γ ≤ 1. Note that the
function J is dependent on the initial time k and the initial state xk , and it is referred to
as the cost-to-go of state xk . The cost in this case accumulates indefinitely; this kind
of problems is referred to as infinite-horizon problems in dynamic programming. On
the other hand, in finite-horizon problems, the cost accumulates over a finite number
of steps. Very often, it is desired to determine u0 = (u0 , u1 , . . . ) so that J(x0 , u0 ) is
optimized (i.e., maximized or minimized). We will use u∗0 = (u0∗ , u1∗ , . . . ) and J ∗ (x0 )
to denote the optimal control sequence and the optimal cost function, respectively.
The objective of dynamic programming problem in this book is to determine a
control sequence uk∗ , k = 0, 1, . . . , so that the function J (i.e., the cost) in (1.3.2) is
minimized. The optimal cost function is defined as

J ∗ (x0 ) = inf J(x0 , u0 ) = J(x0 , u∗0 ),


u0

which is dependent upon the initial state x0 .


The control action may be determined as a function of the state. In this case,
we write uk = u(xk ), ∀k. Such a relationship, or mapping u : Rn → Rm , is called
feedback control, or control policy, or policy. It is also called control law. For a given
control policy μ, the cost function in (1.3.2) is rewritten as


J μ (xk ) = γ i−k U(xi , μ(xi )), (1.3.3)
i=k

which is the cost function for system (1.3.1) starting at xk when the policy uk = μ(xk )
is applied. The optimal cost for system (1.3.1) starting at x0 is determined as

J ∗ (x0 ) = inf J μ (x0 ) = J μ (x0 ),
μ

where μ∗ indicates the optimal policy.


1.3 Adaptive Dynamic Programming 9

Dynamic programming is based on Bellman’s principle of optimality [9, 11,


23, 41]: An optimal (control) policy has the property that no matter what previous
decisions have been, the remaining decisions must constitute an optimal policy with
regard to the state resulting from those previous decisions.
Suppose that one has computed the optimal cost J ∗ (xk+1 ) from time k + 1 to the
terminal time, for all possible states xk+1 , and that one has also found the optimal
control sequences from time k + 1 to the terminal time. The optimal cost results
∗ ∗
when the optimal control sequence uk+1 , uk+2 , . . . , is applied to the system with
initial state xk+1 . Note that the optimal control sequence depends on xk+1 . If one
applies an arbitrary control action uk at time k and then uses the known optimal
control sequence from k + 1 on, the resulting cost will be
∗ ∗
U(xk , uk ) + γ U(xk+1 , uk+1 ) + γ 2 U(xk+2 , uk+2 ) + · · · = U(xk , uk ) + γ J ∗ (xk+1 ),

where xk is the state of the system at time k and xk+1 is determined by (1.3.1).
According to Bellman, the optimal cost from time k on is equal to

J ∗ (xk ) = min{U(xk , uk ) + γ J ∗ (xk+1 )} = min{U(xk , uk ) + γ J ∗ (F(xk , uk ))}.


uk uk
(1.3.4)
The optimal control uk∗ at time k is the uk that achieves this minimum, i.e.,

uk∗ = arg min{U(xk , uk ) + γ J ∗ (xk+1 )}. (1.3.5)


uk

Equation (1.3.4) is the principle of optimality for discrete-time systems. Its impor-
tance lies in the fact that it allows one to optimize over only one control vector at a
time by working backward in time.
Dynamic programming is a very useful tool in solving optimization and optimal
control problems. In particular, it can easily be applied to nonlinear systems with or
without constraints on the control and state variables. Equation (1.3.4) is called the
functional equation of dynamic programming or Bellman equation and is the basis
for computer implementation of dynamic programming. In the above, if the function
F in (1.3.1) and the cost function J in (1.3.2) are known, the solution for u∗ becomes
a simple optimization problem. However, it is often computationally untenable to
run true dynamic programming due to the backward numerical process required for
its solutions, i.e., as a result of the well-known “curse of dimensionality” [9, 23,
41]. The optimal cost function J ∗ , which is the theoretical solution to the Bellman
equation (1.3.4), is very difficult to obtain, except systems satisfying some very good
conditions such as linear time-invariant systems. Over the years, progress has been
made to circumvent the curse of dimensionality by building a system, called critic, to
approximate the cost function in dynamic programming. The idea is to approximate
dynamic programming solutions using a function approximation structure such as
NNs to approximate the cost function. Such an approach is called adaptive dynamic
programming in this book, though it was previously called adaptive critic designs or
approximate dynamic programming.
10 1 Overview of Adaptive Dynamic Programming

1.3.1 Basic Forms of Adaptive Dynamic Programming

In 1977, Paul Werbos [124] introduced an approach for approximate dynamic pro-
gramming that was later called adaptive critic designs (ACDs). ACDs have received
increasing attention (cf. [2, 6, 7, 15, 21, 22, 36, 39, 50, 55, 65, 72, 73, 78, 79, 83,
90, 91, 104, 125–128, 137]). In the literature, there are several synonyms used for
“adaptive critic designs” including “approximate dynamic programming” [40, 76,
90, 128], “asymptotic dynamic programming” [83], “adaptive dynamic program-
ming” [72, 73, 139], “neuro-dynamic programming” [12], “neural dynamic pro-
gramming” [133], “relaxed dynamic programming” [46, 81], and “reinforcement
learning” [14, 40, 98, 106, 127]. No matter what we call it, in all these cases, the
goal is to approximate the solutions of dynamic programming. Because of this, the
term “approximate dynamic programming” has been quite popular in the past.
In this book, we will use the term ADP to represent “adaptive dynamic pro-
gramming” or “approximate dynamic programming.” ADP has potential applica-
tions in many fields, including controls, management, logistics, economy, military,
aerospace, etc. This book contains mostly applications in optimal control of nonlin-
ear systems. A typical design of ADP consists of three modules—critic, model, and
action [127, 128], as shown in Fig. 1.1. The critic network will give an estimation
of the cost function J, which is often a Lyapunov function, at least for some of the
deterministic systems.
The present book considers the case where each module is an NN (refer to, e.g.,
[5, 141, 145] for ADP implementations using fuzzy systems). In the ADP scheme

Fig. 1.1 The three modules Jk 1


of ADP/ACD

Critic Network

xk 1

Model Network

uk

Action Network

xk
1.3 Adaptive Dynamic Programming 11

shown in Fig. 1.1, the critic network outputs the function Ĵ, which is an estimate
of the function J in (1.3.2). This is done by minimizing the following square error
measure over time
1 2 1 2
Eh = Ek = Ĵk − Uk − γ Ĵk+1 , (1.3.6)
2 2
k k

where Ĵk = Ĵ(xk , Wc ) and Wc represents the parameters of the critic network. The
function Uk is the same utility function as the one in (1.3.2) which indicates the
performance of the overall system. The function Uk given in a problem is usually a
function of xk and uk , i.e., Uk = U(xk , uk ). When Ek = 0 for all k, (1.3.6) implies
that

Ĵk = Uk + γ Ĵk+1
= Uk + γ (Uk+1 + γ Ĵk+2 )
= ···


= γ i−k Ui , (1.3.7)
i=k

which is exactly the same as the cost function in (1.3.2). It is therefore clear that
minimizing the error function in (1.3.6), we will have an NN trained so that its
output Ĵ becomes an estimate of the cost function J defined in (1.3.2).
The model network in Fig. 1.1 learns the nonlinear function F given in equation
(1.3.1); it can be trained previously offline [79, 128], or trained in parallel with the
critic and action networks [83].
After the model network is obtained, the critic network will be trained. The critic
network gives an estimate of the cost function. The training of the critic network in
this case is achieved by minimizing the error function defined in (1.3.6), for which
many standard NN training algorithms can be utilized [29, 146]. Note that in Fig. 1.1,
the output of the critic network Ĵk+1 = Ĵ(x̂k+1 , Wc ) is an approximation to the cost
function J at time k + 1, where x̂k+1 is not a real trajectory but a prediction of the
states from the model network.
After the critic network’s training is finished, one can start the action network’s
training with the objective of minimizing Uk + γ Ĵk+1 , through the use of the control
signal uk = u(xk , Wa ), where Wa represents the parameters of the action network.
Once an action network is trained this way, we will have an NN trained so that it will
generate as its output an optimal, or at least, a suboptimal control signal depending
on how well the performance of the critic network is. Recall that the goal of dynamic
programming is to obtain an optimal control sequence as in (1.3.5), which minimizes
the function J in (1.3.2). The key here is to interactively build a link between present
actions and future consequences via an estimate of the cost function.
After the action network’s training cycle is complete, one may check the system
performance, then stop or continue the training procedure by going back to the critic
12 1 Overview of Adaptive Dynamic Programming

network’s training cycle again, if the performance is not acceptable yet [78, 79].
This process will be repeated until an acceptable system performance is reached.
The three networks will be connected as shown in Fig. 1.1. As a part of the process,
the control signal uk will be applied to the external environment and obtain a new
state xk+1 . Meanwhile, the model network gives an approximation of the next state
x̂k+1 . By minimizing xk+1 − x̂k+1 , the model network can be trained.
The training of the action network is done through its parameter updates to mini-
mize the values of Uk + γ Ĵk+1 while keeping the parameters of the critic and model
networks fixed. The gradient information is propagated backward through the critic
network to the model network and then to the action network, as if the three networks
formed one large feedforward network (cf. Fig. 1.1). This implies that the model net-
work in Fig. 1.1 is required for the implementation of ADP in the present case. Even
in the case of known function F, one still needs to build a model network so that the
action network can be trained by backpropagation algorithm.
Two approaches for the training of critic network are provided in [64]: a forward-
in-time approach and a backward-in-time approach. Figure 1.2 shows the diagram of
forward-in-time approach. In this approach, we view Ĵk in (1.3.6) as the output of the
critic network to be trained and choose Uk + γ Ĵk+1 as the training target. Note that
Ĵk and Ĵk+1 are obtained using state variables at different time instances. Figure 1.3
shows the diagram of backward-in-time approach. In this approach, we view Ĵk+1 in
(1.3.6) as the output of the critic network to be trained and choose (Ĵk − Uk )/γ as
the training target. Note that both forward-in-time and backward-in-time approaches
try to minimize the error measure in (1.3.6) and satisfy the requirement in (1.3.7). In
Figs. 1.2 and 1.3, x̂k+1 is the output from the model network.
From the TD algorithm in (1.2.15), we can see that the learning objective is to
minimize |rt+1 + γ V (st+1 ) − V (st )| by using rt+1 + γ V (st+1 ) as the learning target.
This gives the same idea as in the forward-in-time approach shown in Fig. 1.2, where
the target is Uk + γ Ĵk+1 . The only difference is the definition of reward function.
In (1.2.15), it is defined as rt+1 = r(st , at , st+1 ), whereas in Fig. 1.2, it is defined as
Uk = U(xk , uk ), where the current times are t and k, respectively. We will make clear

Fig. 1.2 Forward-in-time


approach
1.3 Adaptive Dynamic Programming 13

Fig. 1.3 Backward-in-time Uk


approach

Jk Jk
1 Ek

Critic Network Critic Network


Copied

xk 1 xk

later the reason behind this one-step time difference between rt+1 and Uk . Even in
TD(λ) given by (1.2.23), the same learning objective is utilized. However, in TD and
TD(λ), the update of value functions at each step only makes a move according to
the step size toward the target, and presumably, it does not reach the target. On the
other hand, in the forward-in-time and backward-in-time approaches, the training
will only be performed for certain number of steps, e.g., 3–5 steps [39] or 50 steps
[91]. Such a move may or may not reach the target, but for sure will lead to a move
in the direction of the target.
Two most important advances of ADP for control start with [79] and [91]. Ref-
erence [79] provides a detailed summary of the major developments in ADP up to
1997. Before that, major references are papers by Werbos such as [124–128]. Refer-
ence [91] makes significant contributions to model-free ADP. Using the approach of
[91], the model network in Fig. 1.1 is not needed anymore. Several practical exam-
ples are included in [91] for demonstration which include single inverted pendulum
and triple inverted pendulum. The training approach of [91] can be considered as
a backward-in-time approach. Reference [64] is also about model-free ADP. It is
a model-free, action-dependent adaptive critic design since we can view the model
network and the critic network together as another NN, which we call it still a critic
network, as illustrated in Fig. 1.4.
The model-free ADP has been called action-dependent ACDs by Werbos. Accord-
ing to [79, 128], ADP approaches were classified into several main schemes: heuris-
tic dynamic programming (HDP), action-dependent HDP (ADHDP; note the pre-
fix “action-dependent” (AD) used hereafter), dual heuristic dynamic programming
(DHP), ADDHP, globalized DHP (GDHP), and ADGDHP. HDP is the basic version
of ADP, which is described in Fig. 1.1 or the left side of Fig. 1.4. According to Werbos
[128], TD algorithms share the same idea as that of HDP to use the same learning
objective. On the other hand, the equivalence of ADHDP and Q-learning has been
argued by Werbos [128] as well. Both ADHDP and Q-learning (including Sarsa as
well) use value functions that are functions of the state and action. For ADHDP,
which is described in the right side of Fig. 1.4, the critic network training is done by
14 1 Overview of Adaptive Dynamic Programming

Fig. 1.4 Definition of a new Jk Qk


1
critic network for model-free
ADP

Critic Network

A New
xk 1
Critic Network

Model Network

uk uk

Action Network Action Network

xk xk

minimizing the following square error measure over time

1 2 1 2
Eq = Eqk = Qk−1 − Uk − γ Qk , (1.3.8)
2 2
k k

where Qk = Q(xk , uk , Wqc ) and Wqc represents the parameters of the critic network.
When Eqk = 0 for all k, (1.3.8) implies that

Qk = Uk+1 + γ Qk+1
= Uk+1 + γ (Uk+2 + γ Qk+2 )
= ···


= γ i−k−1 Ui , (1.3.9)
i=k+1

which is exactly the cost function Ĵk+1 defined in (1.3.7). From Fig. 1.4, we can see
that Qk = Q(xk , uk , Wqc ) is equivalent to Ĵk+1 = Ĵ(x̂k+1 , Wc ) = Ĵ(F̂(xk , uk ), Wc ).
The two outputs will be the same given the same inputs xk and uk . However, the two
relationships are different. In model-free ADP, the output Qk is explicitly a function
of xk and uk , while the model network F̂ becomes totally hidden and internal. On
the other hand, with model network, the output Ĵk+1 is an explicit function of x̂k+1
which in turn is a function of xk and uk through the model network F̂.
The one-step time difference here between Ĵk in (1.3.7) and Qk in (1.3.9) has
exactly the same reasoning behind the one-step time difference between the TD
1.3 Adaptive Dynamic Programming 15

algorithm’s rt+1 and the HDP algorithm’s Uk mentioned earlier. In the HDP structure
described above, a model network is used which leads to

Ĵk → Uk + γ Uk+1 + γ 2 Uk+2 + · · · .

In the ADHDP structure, there is no model network, and thus the expression becomes

Qk → Uk+1 + γ Uk+2 + γ 2 Uk+3 + · · · .

As a matter of fact, for the RL system discussed in the TD methods, the value function
is defined without model network [97, 98]. Therefore, the value function V (st ) in
(1.2.15) is defined similarly to the function Qk above and it starts with rt+1 instead
of rt .
In addition to the basic structures reviewed above for ADP, there are also other
structures proposed in the literature, such as [30, 75].

1.3.2 Iterative Adaptive Dynamic Programming

A popular method to determine the cost function and the optimal cost function is
to use value iteration. In this case, a different set of notation has been used in the
literature. The cost function J μ defined above will be called value function V μ , i.e.,
V μ (xk ) = J μ (xk ), ∀xk . Similarly, V ∗ (x0 ) = inf μ V μ (x0 ) is called the optimal value
function.
Note that similar to the case of cost function J defined above, the value function
V will also have the following three forms. (i) V (xk , uk ) represents the value of
the cost function of system (1.3.1) starting at xk when the control sequence uk =
(uk , uk+1 , . . . ) is applied. (ii) V μ (xk ) tells the value of the cost function of system
(1.3.1) starting at xk when the control policy uk = μ(xk ) is applied. (iii) V ∗ (xk ) is
the optimal cost function of system (1.3.1) starting at xk . When the context is clear,
for convenience in this book, the notation V (xk ) will be used to represent V (xk , uk )
and V μ (xk ). Such a use of notation for value functions has been quite standard in the
literature. In subsequent chapters, there will also be cases where the value function
is a function of xk and uk (not explicitly as a function of uk ). Thus, V (xk , uk ) will
be appropriate. Also, there are cases where the value function is time-varying, such
that V (xk , uk , k) will be appropriate. As a convention, we will use V (xk ) to represent
V (xk , uk ) and V (xk , uk , k) when the context is clear.
As stated earlier in (1.2.11) and (1.2.13) as well as in (1.2.16), (1.2.18), and
(1.2.22), the Bellman equation can be solved using iterative approaches.
We rewrite the Bellman optimality equation (1.3.4) here,

J ∗ (xk ) = min{U(xk , uk ) + γ J ∗ (xk+1 )} = min{U(xk , uk ) + γ J ∗ (F(xk , uk ))}.


uk uk
(1.3.10)
16 1 Overview of Adaptive Dynamic Programming

Our goal is to solve for a function J ∗ which satisfies (1.3.10) and which in turn leads
to the optimal control solution uk∗ given by (1.3.5), rewritten below,

uk∗ = arg min{U(xk , uk ) + γ J ∗ (xk+1 )}. (1.3.11)


uk

One way to solve (1.3.10) is to use the following iterative approach. Replace the
function J ∗ in (1.3.10) using V , i.e., using a value function. Now, we need to solve
for function V from
 
V (xk ) = min U(xk , uk ) + γ V (xk+1 ) . (1.3.12)
uk

This equation can be further rewritten as


 
Vi (xk ) = min U(xk , uk ) + γ Vi−1 (xk+1 ) , i = 1, 2, . . . . (1.3.13)
uk

This is similar to solving the algebraic equation x = f (x) using iterative method
from xi = f (xi−1 ). Starting from x0 , we use the iteration to obtain x1 = f (x0 ), x2 =
f (x1 ), . . . . We can get the solution as x∞ = f (x∞ ), if the iterative process is conver-
gent. For (1.3.13), one can start with a function V0 and the iteration gives V1 , V2 ,
and so on. One would hope that a solution V∞ (xk ) is obtained when i reaches ∞. Of
course, such a solution can only be obtained if the iterative process is convergent.
Using the procedure above, we would hope to obtain a solution which is also the
optimal solution as required by the solution of dynamic programming. During the
iterative solution process, corresponding to each iterative value function Vi obtained,
there will also be a control signal determined as
 
vi (xk ) = arg min U(xk , uk ) + γ Vi (F(xk , uk )) . (1.3.14)
uk

This sequence of control signals {v0 , v1 , . . . } is called the iterative control law
sequence. We would hope to obtain a sequence of stable control laws, or at least
a stable control law when it is optimal.
The above gives the rational of iterative ADP based on value iteration. We need
theoretical results regarding qualitative analysis of the iterative solution including
stability, convergence, and optimality. Stability is the fundamental requirement of any
control system. Convergence of the iterative solution process is required in order for
the above procedure to be meaningful. Furthermore, only requiring the convergence
of {Vi } is not enough, since we will also need the iterative solutions to converge to
the optimal solution J ∗ .
The simplest choice of V0 to start the iteration is V0 (xk ) ≡ 0, ∀xk [3, 46, 81]. In
[3, 46, 81], undiscounted optimal control problems are considered, where γ = 1. In
this case, (1.3.13) becomes
 
Vi (xk ) = min U(xk , uk ) + Vi−1 (xk+1 ) , i = 1, 2, . . . . (1.3.15)
uk
1.3 Adaptive Dynamic Programming 17

Rantzer proved the following proposition.

Proposition 1.3.1 (Convergence of value iteration [46, 81]) Suppose that, for
all xk and for all uk , the inequality J ∗ (F(xk , uk )) ≤ ρU(xk , uk ) holds uniformly
for some ρ < ∞ and that ηJ ∗ (xk ) ≤ V0 (xk ) ≤ J ∗ (xk ) for some 0 ≤ η ≤ 1. Then,
the sequence {Vi } defined iteratively by (1.3.15) approaches J ∗ according to the
inequalities  
η−1
1+ J ∗ (xk ) ≤ Vi (xk ) ≤ J ∗ (xk ), ∀xk . (1.3.16)
(1 + ρ −1 )i

Proposition 1.3.1 clearly shows the convergence of value iteration, i.e., Vi (xk ) →
J ∗ (xk ) as i → ∞.
On the other hand, the affine version of system (1.3.1) has been studied often,
which is given by

xk+1 = f (xk ) + g(xk )u(xk ), k = 0, 1, 2, . . . , (1.3.17)

where xk ∈ Rn is the state vector, uk ∈ Rm is the control vector, and f : Rn → Rn


and g : Rn → Rn×m are nonlinear system functions. The utility function in this case
is usually chosen as a quadratic form given by

U(xk , uk ) = xkTQxk + ukTRuk ,

where Q and R are positive-definite matrices with appropriate dimensions. In a sem-


inal paper by Al-Tamimi, Lewis, and Abu-Khalaf [3], an iterative ADP algorithm is
derived as follows. Starting with the zero initial value function, i.e., V0 (xk ) ≡ 0, ∀xk ,
solve for v0 as
 
v0 (xk ) = arg min xkT Qxk + ukTRuk + V0 (xk+1 ) ,
uk

where V0 (xk+1 ) = 0. Once the policy v0 is determined, the next value function is
computed as

V1 (xk ) = xkTQxk + v0T (xk )Rv0 (xk ) + V0 (xk+1 )


= xkTQxk + v0T (xk )Rv0 (xk ) + V0 (f (xk ) + g(xk )v0 (xk )).

The iteration will then be performed between control policies


 
vi (xk ) = arg min xkTQxk + ukTRuk + Vi (xk+1 )
uk
 
= arg min xkTQxk + ukTRuk + Vi f (xk ) + g(xk )uk
uk
1 ∂Vi (xk+1 )
= R−1 gT(xk ) ,
2 ∂xk+1
18 1 Overview of Adaptive Dynamic Programming

and value functions


 
Vi+1 (xk ) = min xkTQxk + ukTRuk + Vi (xk+1 )
uk

= xkTQxk + viT(xk )Rvi (xk ) + Vi f (xk ) + g(xk )vi (xk ) ,

for i = 1, 2, . . . . The following results were shown in [3].


(1) Starting from V0 (xk ) = 0, the sequence {Vi (xk )} generated by the above iteration
will be monotonically nondecreasing and has an upper bound. Therefore, the
limit of Vi (xk ) when i → ∞, i.e., V∞ (xk ) exists.
(2) V∞ (xk ) = J ∗ (xk ), i.e., the iterative solution converges to the optimal solution.
(3) The iterative control law converges to the optimal control law, i.e., v∞ (xk ) =
u∗ (xk ).
Along the line of [3], some of the new developments can be found in [58, 59, 61,
63, 84, 108, 109, 115, 143, 144]. Along the line of [46, 81], there are also significant
developments [44, 54, 60, 114, 117, 121–123, 132].
The introduction given above is for discrete-time nonlinear systems. We next
introduce briefly ADP for continuous-time nonlinear systems before we conclude
our review.

1.3.3 ADP for Continuous-Time Systems

For continuous-time systems, the cost function J is also the key to dynamic pro-
gramming. By minimizing J, one gets the optimal cost function J ∗ , which is often
a Lyapunov function of the system. As a consequence of the Bellman’s principle of
optimality, J ∗ satisfies the Hamilton–Jacobi–Bellman (HJB) equation. But usually,
one cannot get the analytical solution of the HJB equation. Even to find an accurate
numerical solution is very difficult due to the so-called curse of dimensionality.
Continuous-time nonlinear systems can be described by

ẋ(t) = F(x(t), u(t)), t ≥ t0 ,

where x ∈ Rn and u ∈ Rm are the state and the control vectors, and F(x, u) is a
continuous nonlinear system function. The cost in this case is defined as
 ∞
J(x0 , u) = U(x(τ ), u(τ ))dτ,
t0

with nonnegative utility function U(x, u) ≥ 0, where x(t0 ) = x0 . The Bellman’s prin-
ciple of optimality can also be applied to continuous-time systems. In this case, the
optimal cost
J ∗ (x(t)) = min{J(x(t), u(t))}, t ≥ t0 ,
u(t)
1.3 Adaptive Dynamic Programming 19

satisfies the HJB equation


  ∗ T 
∂J ∗ ∂J
− = min U(x, u) + F(x, u) . (1.3.18)
∂t u(t) ∂x

The HJB equation in (1.3.18) can be derived from the Bellman’s principle of optimal-
ity (1.3.4) [41]. Meanwhile, the optimal control u∗ (t) will be the one that minimizes
the cost function,
u∗ (t) = arg min{J(x(t), u(t))}, t ≥ t0 . (1.3.19)
u(t)

In 1994, Saridis and Wang [86] studied the nonlinear stochastic systems described
by
dx = f (x, t) dt + g(x, t)u dt + h(x, t)dw, t0 ≤ t ≤ T , (1.3.20)

with the cost function


 T 
J(x0 , u) = E Q(x, t) + uTu dt + φ(x(T ), T ) : x(t0 ) = x0 ,
t0

where x ∈ Rn , u ∈ Rm , and w ∈ Rk are the state vector, the control vector, and a
separable Wiener process; f , g and h are measurable system functions; and Q and φ
are nonnegative functions. A value function V is defined as
 T 
V (x, t) = E Q(x, t) + u u dt + φ(x(T ), T ) : x(t0 ) = x0 , t ∈ I,
T
t

where I  [t0 , T ]. The HJB equation is modified to become the following equation

∂V
+ Lu V + Q(x, t) + uTu = ∇V , (1.3.21)
∂t

where Lu is the infinitesimal generator of the stochastic process specified by (1.3.20)


and is defined by
     
1 ∂ ∂V (x, t) T ∂V (x, t) T
Lu V = tr h(x, t)hT(x, t) + (f (x, t) + g(x, t)u).
2 ∂x ∂x ∂x

Depending on whether ∇V ≤ 0 or ∇V ≥ 0, an upper bound V or a lower bound V


of the optimal cost J ∗ are found by solving equation (1.3.21) such that V ≤ J ∗ ≤ V .
Using V (or V ) as an approximation to J ∗ , one can solve for a control law. This leads
to the so-called suboptimal control. It was proved that such controls are stable for the
infinite-time stochastic regulator optimal control problem, where the cost function
is defined as
20 1 Overview of Adaptive Dynamic Programming
  T 
1
J(x0 , u) = lim E Q(x, t) + u u dt : x(t0 ) = x0 .
T
T →∞ T t0

The benefit of the suboptimal control is that the bound V of the optimal cost J ∗ can
be approximated by an iterative process. Beginning from certain chosen functions
u0 and V0 , let
1 ∂Vi−1 (x, t)
ui (x, t) = − gT(x, t) , i = 1, 2, . . . . (1.3.22)
2 ∂x
Then, by repeatedly applying (1.3.21) and (1.3.22), one will get a sequence of func-
tions Vi . This sequence {Vi } will converge to the bound V (or V ) of the cost function
J ∗ . Consequently, ui will approximate the optimal control when i tends to ∞. It is
important to note that the sequences {Vi } and {ui } are obtainable by computation and
they approximate the optimal cost and the optimal control law, respectively.
Some further theoretical results for ADP have been obtained in [72, 73]. These
works investigated the stability and optimality for some special cases of ADP. In
[72, 73], Murray et al. studied the (deterministic) continuous-time affine nonlinear
systems
ẋ = f (x) + g(x)u, x(t0 ) = x0 , (1.3.23)

with the cost function  ∞


J(x, u) = U(x, u)dt, (1.3.24)
t0

where U(x, u) = Q(x) + uTR(x)u, Q(x) > 0 for x = 0 and Q(0) = 0, and R(x) > 0
for all x. Similar to [86], an iterative procedure is proposed to find the control law as
follows. For the plant (1.3.23) and the cost function (1.3.24), the HJB equation leads
to the following optimal control law
 ∗ 
∗ 1 −1 dJ (x)
u (x) = − R (x)g (x)
T
. (1.3.25)
2 dx

Applying (1.3.24) and (1.3.25) repeatedly, one will get sequences of estimations of
the optimal cost function J ∗ and the optimal control u∗ . Starting from an initial sta-
bilizing control v0 (x), for i = 0, 1, . . . , the approximation is given by the following
iterations between value functions
 ∞
Vi+1 (x) = U(x(τ ), vi (τ ))dτ
t

and control laws  


1 −1 dVi+1 (x)
vi+1 (x) = − R (x)g (x)
T
.
2 dx

The following results were shown in [72, 73].


1.3 Adaptive Dynamic Programming 21

(1) The sequence of functions {Vi } obtained above converges to the optimal cost
function J ∗ .
(2) Each of the control laws vi+1 obtained above stabilizes the plant (1.3.23), for all
i = 0, 1, . . . .
(3) Each of the value functions Vi+1 (x) is a Lyapunov function of the plant, for all
i = 0, 1, . . . .

Abu-Khalaf and Lewis [1] also studied the system (1.3.23) with the following
value function
 ∞  ∞
V (x(t)) = U(x(τ ), u(τ ))dτ = x T(τ )Qx(τ ) + uT(τ )Rx(τ ) dτ,
t t

where Q and R are positive-definite matrices. The successive approximation to the


HJB equation starts with an initial stabilizing control law v0 (x). For i = 0, 1, . . . ,
the approximation is given by the following iterations between policy evaluation

0 = x TQx + viT(x)Rvi (x) + ∇ViT(x) f (x) + g(x)vi (x)

and policy improvement

1
vi+1 (x) = − R−1 gT (x)∇Vi (x),
2
where ∇Vi (x) = ∂Vi (x)/∂x. In [1], the above iterative approach was applied to sys-
tems (1.3.23) with saturating actuators through a modified utility function, with
convergence and optimality proofs showing that Vi → J ∗ and vi → u∗ , as i → ∞.
For continuous-time optimal control problems, attempts have been going on for a
long time in the quest for successive solutions to the HJB equation. Published works
can date back to as early as 1967 by Leake and Liu [37]. The brief overview presented
here only serves as a beginning of many more recent results [1, 32, 33, 45, 49, 51,
52, 56, 57, 62, 69, 70, 105, 107, 134–136, 140].

1.3.4 Remarks

Most of the early applications of ADP are in the areas of aircraft flight control
[6, 21, 78] and missile guidance [22, 28]. Some other applications have also been
reported, e.g., for ship steering [55], in power systems [99, 104], in communication
networks [65, 67], in engine control [36, 50], and for locomotive planning [77].
The most successful application in industry is perhaps the fleet management and
truckload operation problems as reported by Forbes Magazine [19]. Warren Powell
from Princeton University has been working with Snyder International [94, 95], one
of the largest freight haulers in the USA, in order to find more efficient ways to
plan routes for parcels and freight. Applications reported in this book are related to
22 1 Overview of Adaptive Dynamic Programming

optimal control approaches in the areas of energy management in smart homes [31,
119, 120], coal gasification process [116], and water gas shift reaction system [118].
New comers to the field of ADP should first take a look at the challenging control
problems listed in [4]. Interested readers should also read reference [39], especially
the proposed training strategies for the critic network and the action network. There
are also several good survey papers to read, e.g., [42, 43, 48, 53, 110].
Paul Werbos, who is the inventor of the backpropagation algorithm and adaptive
critic designs, has often talked about brain-like intelligence. He has pointed out that
“ADP may be the only approach that can achieve truly brain-like intelligence” [85,
125, 129, 131]. More and more evidence has accumulated, suggesting that optimality
is an organizing principle for understanding brain intelligence [129–131]. There has
been a great interest in brain research around the world in recent years. We would
certainly hope ADP can make contributions to brain research in general and to brain-
like intelligence in particular. On the other hand, with more and more advances
in the understanding of brain-learning functions, new ADP algorithms can then be
developed.
Deep reinforcement learning has been of great interests lately. With the current
trends in deep learning, big data, artificial intelligence, as well as cyber-physical
systems and Internet of things, we believe that ADP will have a bright future. There
are still many pending issues to be solved, and most of them are related to obtaining
good approximations to solutions of dynamic programming with less computation.
Deep reinforcement learning is able to output control signal directly based on input
images, which incorporates both the advantages of the perception of deep learning
and the decision-making of reinforcement learning. This mechanism makes artificial
intelligence much closer to human thinking. Combining deep learning with rein-
forcement learning/ADP will benefit us to construct systems with more intelligence
and attain higher level of brain-like intelligence.

1.4 Related Books

There have been a few books published on the topics of reinforcement learning and
adaptive dynamic programming. A quick overview of these books will be given in
this section.
In their book published in 1996 [12], Bertsekas and Tsitsiklis give an overview of
neuro-dynamic programming. The book draws on the theory of function approxima-
tion, iterative optimization, neural network training, and dynamic programming. It
provides the background, gives a detailed introduction to dynamic programming, dis-
cusses the neural network architectures and methods for training them, and develops
general convergence theorems for stochastic approximation methods as the founda-
tion for the analysis of various neuro-dynamic programming algorithms. It aims at
explaining with mathematical analysis, examples, speculative insight, and case stud-
ies, a number of computational ideas and phenomena that collectively can provide
the foundation for understanding and applying the methodology of neuro-dynamic
1.4 Related Books 23

programming. It suggests many useful methodologies to apply in neuro-dynamic pro-


gramming, such as Monte Carlo simulation, online and offline temporal difference
methods, Q-learning algorithms, optimistic policy iteration, Bellman error methods,
approximate linear programming, approximate dynamic programming with cost-to-
go function, etc. First, the theory and computational methods on discounted problems
are developed. The computational methods for generalized discounted dynamic pro-
gramming, including the asynchronous optimistic policy iteration and its application
to game and minimax problems, constrained policy iteration, and Q-learning are pro-
vided. Then, the stochastic shortest path problems, the undiscounted problems, and
the average cost per stage problems are also discussed. The policy iteration methods
and the asynchronous optimistic versions for stochastic shortest path problems that
involve improper policies are given. At last, the detailed descriptions of ADP on dis-
counted models and nondiscounted models with generalizations are provided. Exten-
sive materials on various simulation-based, approximate value, and policy iteration
methods, Monte Carlo linear algebra, and new simulation techniques for multi-step
methods, such as geometric and free-form sampling are highlighted.
The classical book by Sutton and Barto [98] provides a clear and simple account of
the main ideas and algorithms of reinforcement learning, which is much more focused
on goal-directed learning from interaction than other approaches of the machine
learning field. It shows how to map situations to actions, so as to maximize a reward
signal. The learner is not told which actions to take, as in most forms of machine
learning, but instead must discover which actions yield the most reward by trying
them. Generally, actions may affect not only the immediate reward but also the next
situation and all the subsequent rewards. This book was designed to be used as a text
in a one-semester course and can also be used as part of a broader course on machine
learning, artificial intelligence, or neural networks. The book consists of four parts:
Part I focuses on the simplest aspects of reinforcement learning and the main distin-
guishing features. A beginning chapter is presented to introduce the reinforcement
learning problem whose solution will be explored in the book. Part II presents tab-
ular versions of all the basic solution methods under finite state space environment
based on estimating action values. In this part, dynamic programming (including
policy iteration and value iteration), Monte Carlo methods, and temporal difference
learning are introduced in detail. It also includes the eligibility traces which unifies
the latter two methods and contains the unified planning methods (such as dynamic
programming and state-space search) and learning methods (such as Monte Carlo
and temporal difference learning). Part III is concerned with extending the tabular
methods to include various forms of approximation, such as function approxima-
tion, policy-gradient methods, and methods designed for solving off-policy learning
problems. Part IV presents some of the frontiers of reinforcement learning in biology
with practical applications. It shows that trial-and-error search and delayed reward
are the two most important distinguishing features of reinforcement learning.
With the growing levels of sophistication in modern operations, it is vital for
practitioners to understand how to approach, model, and solve complex industrial
problems. The book by Powell [76] on approximate dynamic programming shows
the decades of experience working in large industrial settings to develop practical
24 1 Overview of Adaptive Dynamic Programming

and high-quality solutions to problems that involve making decisions in the presence
of uncertainty. The book integrates the disciplines of Markov design processes, math-
ematical programming, simulation, and statistics, to demonstrate how to successfully
model and solve a wide range of real-life problems using the idea of approximate
dynamic programming (ADP). It starts with a simple introduction using a discrete
representation of states. The background of dynamic programming and Markov deci-
sion processes is given, and meanwhile the phenomenon of the curse of dimensional-
ity is discussed. A detailed description on how to model a dynamic program and some
important algorithmic strategies are presented next. The most important dimensions
of ADP, i.e., modeling real applications, the interface with stochastic approximation
methods, techniques for approximating general value functions, and a more in-depth
presentation of ADP algorithms for finite- and infinite-horizon applications are pro-
vided, respectively. Several specific problems, including information acquisition and
resource allocation, and algorithms that arise in this setting are introduced in the third
part. The well-known exploration versus exploitation problem is proposed to discuss
how to visit a state. These applications bring out the richness of ADP techniques. In
summary, it models complex, high-dimensional problems in a natural and practical
way; introduces and emphasizes the power of estimating a value function around the
post-decision state; and presents a thorough discussion of recursive estimation. It is
shown in this book that ADP is an accessible introduction to dynamic modeling and
is also a valuable guide for the development of high-quality solutions to problems
that exist in operations research and engineering.
The book by Busoniu, et al. [14] provides an accessible in-depth treatment of
dynamic programming (DP) and reinforcement learning (RL) methods using func-
tion approximators. Even though DP and RL are methods for solving problems where
actions are applied to a dynamical plant to achieve a desired goal, the former requires
a model of the systems behavior while the latter does not since it works using only
data obtained from the system. However, a core obstacle of them lies in that the
solutions cannot be represented exactly for problems with large discrete state-action
spaces or continuous spaces. As a result, compact representations relying on func-
tion approximators must be constructed. The book adopts a control-theoretic point of
view, employing control-theoretical notation and terminology and choosing control
systems as examples to illustrate the behavior of DP and RL algorithms. It starts
with introducing the basic problems and their solutions, the representative classical
algorithms, and the behavior of some algorithms via examples with discrete states
and actions. Then, it gives an extensive account of DP and RL methods with function
approximation, which are applicable to large- and continuous-space problems. The
three major classes of algorithms, including value iteration, policy iteration, and pol-
icy search, are presented, respectively. Next, a value iteration algorithm with fuzzy
approximation is discussed, and an extensive theoretical analysis of this algorithm is
given to illustrate how convergence and consistency guarantees can be developed to
perform approximate DP. Moreover, an algorithm for approximate policy iteration is
studied and an online version is also developed, in order to emphasize the important
issues of online RL. At last, a policy search approach relying on the cross-entropy
method for optimization is described, which highlights the possibility to develop
1.4 Related Books 25

techniques that scales to relatively high-dimensional state spaces, by focusing the


computation on important initial states. Some experimental studies via classical con-
trol problems are included as performance verification.
The book edited by Frank Lewis and Derong Liu [40] gives an exposition of
recently developed reinforcement learning (RL) and adaptive dynamic program-
ming (ADP) techniques for decision and control for human-engineered systems.
Included are single-player decision and control, multi-player games, and founda-
tions in Markov decision processes. The first part of the book develops methods for
feedback control of nonlinear systems based on RL and ADP. Reviews are given for
the foundations, common misconceptions, and challenges ahead of RL and ADP.
Novel structures, such as hierarchical adaptive critic, single network adaptive critic,
actor-critic-identifier architecture, and robust ADP, are introduced to solve the opti-
mal control problems. Additionally, several RL algorithms, such as value gradient
learning and RL without using value and policy iterations, are employed to find
the optimal controller. A constrained backpropagation approach to function approx-
imation and a Q-learning-based method for learning unknown information about
the timescale properties of dynamical systems are also studied. The second part
treats learning and control in multi-agent games. The hybrid learning in stochastic
games and its application in network security are studied. Moreover, online learning
algorithms for dynamic games are proposed. The third part presents some ideas of
fundamental importance in understanding and implementing decision algorithm in
Markov processes. Surveys on lambda-policy iteration, event-based optimization,
and optimistic planning in Markov decision processes are given. In addition, differ-
ent strategies for learning policies and the backpropagation on timescales under the
framework of ADP are developed. In summary, this book establishes basic methods
for feedback control of modern complex systems based on RL and ADP, explains
learning algorithms in multi-agent games as well as single-player games, and outlines
key approaches for implementing the main algorithms in Markov decision processes.
The book by Vrabie, Vamvoudakis, and Lewis [106] presents new development
in optimal adaptive control and reinforcement learning for human-engineered sys-
tems including aerospace systems, aircraft autopilots, vehicle engine controllers,
ship motion and engine control, and industrial processes. It shows how to use rein-
forcement learning techniques to design new structures of adaptive feedback control
systems that learn the solutions to optimal control problems. It also presents how
to design adaptive controllers for multi-player games that converge to optimal dif-
ferential game theoretic solutions online in real time by measuring data along the
system trajectories. Techniques based on reinforcement learning can be used to unify
optimal control and adaptive control. Reinforcement learning techniques are used to
design adaptive control with novel structures that learn the solutions to optimal con-
trol problems in real time by observing data for continuous-time dynamical systems.
The methods studied here depend on reinforcement learning techniques such as pol-
icy iteration and value iteration, which evaluate the performance of current control
policies and provide methods for improving those policies. The adaptive learning sys-
tems utilize the actor-critic structure, wherein there are two networks in two control
loops—a critic network and an actor network with parameters updated sequentially
26 1 Overview of Adaptive Dynamic Programming

or simultaneously. Integral reinforcement learning and synchronous tuning are com-


bined to yield a synchronous adaptive control structure that converges to optimal
control solutions without knowing the full system dynamics. Reinforcement learn-
ing methods are applied to design adaptive controllers for multi-player games that
converge online to optimal game theoretic solutions. This book presents synchro-
nous adaptive controllers that learn in real time the Nash equilibrium solution of
two-player zero-sum differential games and multi-player nonzero-sum differential
games. Integral reinforcement learning methods are used to learn the solution to the
two-player zero-sum games online without knowing the system drift dynamics.
The book by Zhang, Liu, Luo, and Wang [139] studies the control algorithms
and stability issues of adaptive dynamic programming (ADP). ADP is a biologically
inspired and computational method proposed to solve optimization and optimal con-
trol problems. It is an efficient scheme to learn to approximate optimal strategy of
action in the general case. With the development of ADP algorithms, more and more
people show interest in the convergence of the control algorithms and stability of
ADP-based systems. This book includes three parts, i.e., the optimal feedback con-
trol, the nonlinear games, and some applications of ADP. In the first part, various
recent results on ADP-based infinite-horizon and finite-horizon feedback control,
including stabilization and tracking problems, are presented in a systematic fashion
with convergence analysis, mainly for nonlinear discrete-time systems. In addition,
optimal feedback control and tracking control for nonlinear systems with time delays
are studied. The input-/output-data-based optimal robust feedback control strategy
for unknown general nonlinear continuous-time systems is also developed with sta-
bility proof. In addition, the optimal feedback control schemes for several special
systems, such as switched systems, descriptor systems, and singularly perturbed
systems, are also constructed. In the second part, both the zero-sum games and
nonzero-sum games are studied using the idea of ADP. For the zero-sum games, it
is proved for the first time that the iterative policies converge to the mixed optimal
ones when the saddle point does not exist. For the nonzero-sum games, a single
network ADP approach is established to seek for the Nash equilibrium. In the third
part, a self-learning call admission control scheme is established for CDMA cellular
networks, while an engine torque and air–fuel ratio control scheme is provided using
ADP. In summary, the book establishes the fundamental theory of optimal control
based on ADP with convergence proofs and stability analyses, and shows that ADP
method can be put into use both in simulation and in real applications.

1.5 About This Book

This book covers some most recent developments in adaptive dynamic programming
(ADP). After Derong Liu landed his first academic job in 1995, he was exposed to
ADP at the suggestion of Paul Werbos. Within four years, he was lucky enough to
receive an NSF CAREER Award, for a project on ADP for network traffic control.
He started publishing papers in ADP in the same year [55], with publications ranging
1.5 About This Book 27

from theoretical developments to applications, some of which were included in his


book co-authored with Huaguang Zhang, Yanhong Luo, and Ding Wang in 2013
[139]. The present book reports some of the more recent results of Derong Liu’s
research group, which were mostly published after 2012.
The development of ADP has been on a fast track since 2006, when many
researchers joined the effort. Within Derong Liu’s research group alone, since 2012,
six PhD students and one post-doctoral fellow have graduated with a few more in
the pipeline. They explored new ideas of ADP and they expanded exiting theories in
various dimensions. The 14 chapters in the present book cover theoretical develop-
ments in discrete-time systems and then continuous-time systems. Some application
examples will be given at last. Continuing efforts are still being made in the quest
for finding solutions to dynamic programming problems with manageable amount
of computation and guarantee for stability, convergence, and optimality.

References

1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Al-Tamimi A, Abu-Khalaf M, Lewis FL (2007) Adaptive critic designs for discrete-time
zero-sum games with application to H∞ control. IEEE Trans Syst Man Cybern Part B Cybern
37(1):240–247
3. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part
B Cybern 38(4):943–949
4. Anderson CW, Miller WT III (1990) Challinging control problems. In: Miller WT III, Sutton
RS, Werbos PJ (eds) Neural networks for control (Appendix). MIT Press, Cambridge, MA
5. Bai X, Zhao D, Yi J (2009) Coordinated multiple ramps metering based on neuro-fuzzy
adaptive dynamic programming. In: Proceedings of the international joint conference on
neural networks, pp 241–248
6. Balakrishnan SN, Biega V (1996) Adaptive-critic-based neural networks for aircraft optimal
control. AIAA J Guid Control Dyn 19:893–898
7. Barto AG (1992) Reinforcement learning and adaptive critic methods. In: White DA, Sofge
DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive approaches (chapter
12). Van Nostrand Reinhold, New York
8. Baudis P, Gailly JL (2012) PACHI: state of the art open source Go program. In: Advances in
computer games (Lecture notes in computer science), vol 7168. pp 24–38
9. Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton, NJ
10. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspec-
tives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
11. Bertsekas DP (2005) Dynamic programming and optimal control. Athena Scientific, Belmont,
MA
12. Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmont,
MA
13. Buro M (1998) From simple features to sophisticated evaluation functions. In: Proceedings
of the international conference on computers and games (Lecture notes in computer science),
vol 1558. pp 126–145
14. Busoniu L, Babuska R, De Schutter B, Ernst D (2010) Reinforcement learning and dynamic
programming using function approximators. CRC Press, Boca Raton, FL
28 1 Overview of Adaptive Dynamic Programming

15. Cai X, Wunsch DC (2001) A parallel computer-Go player, using HDP method. In: Proceedings
of the international joint conference on neural networks, pp 2373–2375
16. Campbell M, Hoane AJ, Hsu FH (2002) Deep blue. Artif Intell 134(1–2):57–83
17. Chen XW, Lin X (2014) Big data deep learning: challenges and perspectives. IEEE Access
2:514–525
18. Clark C, Storkey AJ (2015) Training deep convolutional neural networks to play Go. In:
Proceedings of the international conference on machine learning, pp 1766–1774
19. Coster H (2011) Schneider National uses data to survive a bumpy economy. Forbes, 12 Sept
2011
20. Coulom R (2007) Computing Elo ratings of move patterns in the game of Go. ICGA J
30(4):198–208
21. Cox C, Stepniewski S, Jorgensen C, Saeks R, Lewis C (1999) On the design of a neural
network autolander. Int J Robust Nonlinear Control 9:1071–1096
22. Dalton J, Balakrishnan SN (1996) A neighboring optimal adaptive critic for missile guidance.
Math Comput Model 23:175–188
23. Dreyfus SE, Law AM (1977) The art and theory of dynamic programming. Academic Press,
New York
24. Enzenberger M (2004) Evaluation in Go by a neural network using soft segmentation. In:
Advances in computer games - many games, many challenges (Proceedings of the advances
in computer games conference), pp 97–108
25. Fu ZP, Zhang YN, Hou HY (2014) Survey of deep learning in face recognition. In: Proceedings
of the IEEE international conference on orange technologies, pp 5–8
26. Ghesu FC, Krubasik E, Georgescu B, Singh V, Zheng Y, Hornegger J, Comaniciu D (2016)
Marginal space deep learning: efficient architecture for volumetric image parsing. IEEE Trans
Med Imaging 35(5):1217–1228
27. Gosavi A (2009) Reinforcement learning: a tutorial survey and recent advances. INFORMS
J Comput 21(2):178–192
28. Han D, Balakrishnan SN (2002) State-constrained agile missile control with adaptive-critic-
based neural networks. IEEE Trans Control Syst Technol 10(4):481–489
29. Haykin S (2009) Neural networks and learning machines, 3rd edn. Prentice-Hall, Upper
Saddle River, NJ
30. He H, Ni Z, Fu J (2012) A three-network architecture for on-line learning and optimization
based on adaptive dynamic programming. Neurocomputing 78(1):3–13
31. Huang T, Liu D (2013) A self-learning scheme for residential energy system control and
management. Neural Comput Appl 22(2):259–269
32. Jiang Y, Jiang ZP (2012) Robust adaptive dynamic programming for large-scale systems with
an application to multimachine power systems. IEEE Trans Circuits Syst II: Express Briefs
59(10):693–697
33. Jiang Y, Jiang ZP (2013) Robust adaptive dynamic programming with an application to power
systems. IEEE Trans Neural Netw Learn Syst 24(7):1150–1156
34. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell
Res 4:237–285
35. Konoplich GV, Putin EO, Filchenkov AA (2016) Application of deep learning to the problem
of vehicle detection in UAV images. In: Proceedings of the IEEE international conference on
soft computing and measurements, pp 4–6
36. Kulkarni NV, KrishnaKumar K (2003) Intelligent engine control using an adaptive critic.
IEEE Trans Control Syst Technol 11:164–173
37. Leake RJ, Liu RW (1967) Construction of suboptimal control sequences. SIAM J Control
5(1):54–63
38. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
39. Lendaris GG, Paintz C (1997) Training strategies for critic and action neural networks in
dual heuristic programming method. In: Proceedings of the IEEE international conference on
neural networks, pp 712–717
References 29

40. Lewis FL, Liu D (2012) Reinforcement learning and approximate dynamic programming for
feedback control. Wiley, Hoboken, NJ
41. Lewis FL, Syrmos VL (1995) Optimal control. Wiley, New York
42. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Systems Mag 9(3):32–50
43. Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control:
Using natural decision methods to design optimla adaptive controllers. IEEE Control Syst
Mag 32(6):76–105
44. Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general
value iteration. IET Control Theory Appl 6(18):2725–2736
45. Li H, Liu D, Wang D (2014) Integral reinforcement learning for linear continuous-time zero-
sum games with completely unknown dynamics. IEEE Trans Autom Sci Eng 11(3):706–714
46. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
47. Littman ML (2015) Reinforcement learning improves behaviour from evaluative feedback.
Nature 521:445–451
48. Liu D (2005) Approximate dynamic programming for self-learning control. Acta Autom Sin
31(1):13–18
49. Liu D, Huang Y, Wang D, Wei Q (2013) Neural-network-observer-based optimal control for
unknown nonlinear systems using adaptive dynamic programming. Int J Control 86(9):1554–
1566
50. Liu D, Javaherian H, Kovalenko O, Huang T (2008) Adaptive critic learning techniques
for engine torque and air-fuel ratio control. IEEE Trans Syst Man Cybern Part B Cybern
38(4):988–993
51. Liu D, Li C, Li H, Wang D, Ma H (2015) Neural-network-based decentralized control of
continuous-time nonlinear interconnected sytems with unknown dynamics. Neurocomputing
165:90–98
52. Liu D, Li H, Wang D (2014) Online synchronous approximate optimal learning algorithm for
multiplayer nonzero-sum games with unknown dynamics. IEEE Trans Syst Man Cybern Syst
44(8):1015–1027
53. Liu D, Li H, Wang D (2013) Data-based self-learning optimal control: research progress and
prospects. Acta Autom Sin 39(11):1858–1870
54. Liu D, Li H, Wang D (2015) Error bounds of adaptive dynamic programming algorithms
for solving undiscounted optimal control problems. IEEE Trans Neural Netw Learn Syst
26(6):1323–1334
55. Liu D, Patino HD (1999) A self-learning ship steering controller based on adaptive critic
designs. In: Proceedings of the IFAC triennial world congress, pp 367–372
56. Liu D, Wang D, Li H (2014) Decentralized stabilization for a class of continuous-time non-
linear interconnected systems using online learning optimal control approach. IEEE Trans
Neural Netw Learn Syst 25(2):418–428
57. Liu D, Wang D, Wang FY, Li H, Yang X (2014) Neural-network-based online HJB solution
for optimal robust guaranteed cost control of continuous-time uncertain nonlinear systems.
IEEE Trans Cybern 44(12):2834–2847
58. Liu D, Wang D, Yang X (2013) An iterative adaptive dynamic programming algorithm for
optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci
220:331–342
59. Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class
of unknown discrete-time nonlinear systems using globalized dual heuristic programming.
IEEE Trans Autom Sci Eng 9(3):628–634
60. Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
61. Liu D, Wei Q (2014) Policy iterative adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
30 1 Overview of Adaptive Dynamic Programming

62. Liu D, Wei Q (2014) Multi-person zero-sum differential games for a class of uncertain non-
linear systems. Int J Adapt Control Signal Process 28(3–5):205–231
63. Liu D, Wei Q, Yan P (2015) Generalized policy iteration adaptive dynamic programming for
discrete-time nonlinear systems. IEEE Trans Syst Man Cybern Syst 45(12):1577–1591
64. Liu D, Xiong X, Zhang Y (2001) Action-dependent adaptive critic designs. In: Proceedings
of the international joint conference on neural networks, pp 990–995
65. Liu D, Zhang Y, Zhang H (2005) A self-learning call admission control scheme for CDMA
cellular networks. IEEE Trans Neural Netw 16(5):1219–1228
66. Maddison CJ, Huang A, Sutskever I, Silver D (2015) Move evaluation in Go using deep con-
volutional neural networks. In: The 3rd international conference on learning representations.
http://arxiv.org/abs/1412.6564
67. Marbach P, Mihatsch O, Tsitsiklis JN (2000) Call admission control and routing in integrated
service networks using neuro-dynamic programming. IEEE J Sel Areas Commun 18(2):197–
208
68. Minh V, Kavukcuoglu Silver D et al (2015) Human-level control through deep reinforcement
learning. Nature 518:529–533
69. Modares H, Lewis FL, Naghibi-Sistani MB (2013) Adaptive optimal control of unknown
constrained-input systems using policy iteration and neural networks. IEEE Trans Neural
Netw Learn Syst 24(10):1513–1525
70. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50:193–202
71. Moyer C (2016) How Google’s AlphaGo beat a Go world champion. The Atlantic, 28 Mar
2016
72. Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE
Trans Syst Man Cybern Part C Appl Rev 32(2):140–153
73. Murray JJ, Cox CJ, Saeks RE (2003) The adaptive dynamic programming theorem. In: Liu
D, Antsaklis PJ (eds) Stability and control of dynamical systems with applications (chapter
19). Birkhäuser, Boston
74. Nguyen HD, Le AD, Nakagawa M (2015) Deep neural networks for recognizing online
handwritten mathematical symbols. In: Proceedings of the IAPR Asian conference on pattern
recognition, pp 121–125
75. Padhi R, Unnikrishnan N, Wang X, Balakrishnan SN (2006) A single network adaptive critic
(SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural
Netw 19(10):1648–1660
76. Powell WB (2007) Approximate dynamic programming: solving the curses of dimensionality.
Wiley, Hoboken, NJ
77. Powell WB, Bouzaiene-Ayari B, Lawrence C et al (2014) Locomotive planning at Nor-
folk Southern: an optimizing simulator using approximate dynamic programming. Interfaces
44(6):567–578
78. Prokhorov DV, Santiago RA, Wunsch DC (1995) Adaptive critic designs: a case study for
neurocontrol. Neural Netw 8:1367–1372
79. Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8:997–
1007
80. Qiu J, Wu Q, Ding G, Xu Y, Feng S (2016) A survey of machine learning for big data
processing. EURASIP J Adv Signal Process. doi:10.1186/s13634-016-0355-x
81. Rantzer A (2006) Relaxed dynamic programming in switching systems. IEE Proc Control
Theory Appl 153(5):567–574
82. Rummery GA, Niranjan M (1994) On-line Q-learning using connectionist systems. Technical
Report CUED/F-INFENG/TR 166. Engineering Department, Cambridge University, UK
83. Saeks RE, Cox CJ, Mathia K, Maren AJ (1997) Asymptotic dynamic programming: prelim-
inary concepts and results. In: Proceedings of the IEEE international conference on neural
networks, pp 2273–2278
References 31

84. Sahoo A, Xu H, Jagannathan S (2016) Near optimal event-triggered control of nonlinear


discrete-time systems using neurodynamic programming. IEEE Trans Neural Netw Learn
Syst 27(9):1801–1815
85. Santiago RA, Werbos PJ (1994) New progress towards truly brain-like intelligent control. In:
Proceedings of the world congress on neural networks, vol I. pp 27–33
86. Saridis GN, Wang FY (1994) Suboptimal control of nonlinear stochastic systems. Control
Theory Adv Technol 10(4):847–871
87. Schaeffer J, Culberson J, Treloar N, Knight B, Lu P, Szafron D (1992) A world championship
caliber checkers program. Artif Intell 53(2–3):273–289
88. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
89. Schraudolph NN, Dayan P, Sejnowski TJ (1994) Temporal difference learning of position
evaluation in the game of Go. In: Advances in neural information processing systems 6 (NIPS
1993), pp 817–824
90. Si J, Barto AG, Powell WB, Wunsch DC (2004) Handbook of learning and approximate
dynamic programming. IEEE, Piscataway, NJ
91. Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans
Neural Net 12(3):264–276
92. Silver D, Huang A, Maddison CJ et al (2016) Mastering the game of Go with deep neural
networks and tree search. Nature 529:484–489
93. Silver D, Sutton R, Müller M (2012) Temporal-difference search in computer Go. Machine
Learning 87(2):183–219
94. Simao HP, Day J, George AP, et al (2009) An approximate dynamic programming algorithm
for large-scale fleet management: a case application. Transportation Science 43(2):178–197
95. Simao HP, George Am Powell WB et al (2010) Approximate dynamic programming captures
fleet operations for Schneider National. Interfaces 40(5):342–352
96. Sutton RS (1996) Generalization in reinforcement learning: successful examples using sparse
coarse coding. In: Advances in neural information processing systems 8 (NIPS 1995), pp
1038–1044
97. Sutton RS (1998) Learning to predict by the methods of temporal differences. Mach Learn
3:9–44
98. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
99. Tang Y, He H, Wen J, Liu J (2015) Power system stability control for a wind farm based on
adaptive dynamic programming. IEEE Trans Smart Grid 6(1):166–177
100. Tesauro GJ (1992) Practical issues in temporal difference learning. Mach Learn 8:257–277
101. Tesauro G (1994) TD-gammon, self-teaching backgammon program, achieves master-level
play. Neural Comput 6:215–219
102. Tian YD (2016) A simple analysis of AlphaGo. Acta Automa Sin 42(5):671–675
103. Tromp J (2016) Number of legal Go positions. http://tromp.github.io/http://tromp.github.io/
go.htmlhttp://tromp.github.io/go/legal.html
104. Venayagamoorthy GK, Harley RG, Wunsch DC (2002) Comparison of heuristic dynamic
programming and dual heuristic programming adaptive critics for neurocontrol of a turbo-
generator. IEEE Trans Neural Netw 13(5):764–773
105. Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis FL (2009) Adaptive optimal control for
continuous-time linear systems based on policy iteration. Automatica 45(2):477–484
106. Vrabie D, Vamvoudakis KG, Lewis FL (2013) Optimal adaptive control and differential games
by reinforcement learning principles. IET, London
107. Wang D, Liu D, Li H (2014) Policy iteration algorithm for online design of robust control of
a class of continuous-time nonlinear systems. IEEE Trans Automa Sci Eng 11(2):627–632
108. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonlinear discrete-
time systems based on adaptive dynamic programming approach. Automatica 48(8):1825–
1832
109. Wang FY, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon
optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Trans Neural
Netw 22(1):24–36
32 1 Overview of Adaptive Dynamic Programming

110. Wang FY, Zhang H, D. Liu D, (2009) Adaptive dynamic programming: an introduction. IEEE
Comput Intell Mag 4(2):39–47
111. Wang FY, Zhang JJ, Zheng X et al (2016) Where does AlphaGo go: from Church-Turing
thesis to AlphaGo thesis and beyond. IEEE/CAA J Autom Sin 3(2):113–120
112. Watkins CJCH (1989) Learning from delayed rewards. Ph.D. Thesis, Cambridge University,
UK
113. Watkins CJCH, Dayan P (1992) Q-learning. Mach Learn 8:279–292
114. Wei Q, Liu D (2013) Numerical adaptive learning control scheme for discrete-time non-linear
systems. IET Control Theory Appl 7(11):1472–1486
115. Wei Q, Liu D (2014) A novel iterative θ-adaptive dynamic programming for discrete-time
nonlinear systems. IEEE Trans Autom Sci Eng 11(4):1176–1190
116. Wei Q, Liu D (2014) Adaptive dynamic programming for optimal tracking control of
unknown nonlinear systems with application to coal gasification. IEEE Trans Autom Sci
Eng 11(4):1020–1036
117. Wei Q, Liu D (2014) Stable iterative adaptive dynamic programming algorithm with approx-
imation errors for discrete-time nonlinear systems. Neural Comput Appl 24(6):1355–1367
118. Wei Q, Liu D (2014) Data-driven neuro-optimal temperature control of water-gas shift reaction
using stable iterative adaptive dynamic programming. IEEE Trans Ind Electron 61(11):6399–
6408
119. Wei Q, Liu D, Shi G (2015) A novel dual iterative Q-learning method for optimal battery
management in smart residential environments. IEEE Trans Ind Electron 62(4):2509–2518
120. Wei Q, Liu D, Shi G, Liu Y (2015) Multibattery optimal coordination control for home energy
management systems via distributed iterative adaptive dynamic programming. IEEE Trans
Ind Electron 62(7):4203–4214
121. Wei Q, Liu D, Xu Y (2014) Neuro-optimal tracking control for a class of discrete-time non-
linear systems via generalized value iteration adaptive dynamic programming. Soft Comput
20(2):697–706
122. Wei Q, Liu D, Yang X (2015) Infinite horizon self-learning optimal control of nonaffine
discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst 26(4):866–879
123. Wei Q, Wang FY, Liu D, Yang X (2014) Finite-approximation-error-based discrete-time iter-
ative adaptive dynamic programming. IEEE Trans Cybern 44(12):2820–2833
124. Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of
intelligence. Gen Syst Yearb 22:25–38
125. Werbos PJ (1987) Building and understanding adaptive systems: a statistical/numerical
approach to factory automation and brain research. IEEE Trans Syst Man Cybern SMC
17(1):7–20
126. Werbos PJ (1990) Consistency of HDP applied to a simple reinforcement learning problem.
Neural Netw 3:179–189
127. Werbos PJ (1990) A menu of designs for reinforcement learning over time. In: Miller WT,
Sutton RS, Werbos PJ (eds) Neural networks for control (chapter 3). MIT Press, Cambridge,
MA
128. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural mod-
eling. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy, and
adaptive approaches (chapter 13). Van Nostrand Reinhold, New York
129. Werbos PJ (2007) Using ADP to understand and replicate brain intelligence: the next level
design. In: Proceedings of the IEEE symposium on approximate dynamic programming and
reinforcement learning, pp 209–216
130. Werbos PJ (2008) ADP: the key direction for future research in intelligent control and under-
standing brain intelligence. IEEE Trans Syst Man Cybern Part B Cybern 38(4):898–900
131. Werbos PJ (2009) Intelligence in the brain: a theory of how it works and how to build it.
Neural Netw 22(3):200–212
132. Yan P, Wang D, Li H, Liu D (2016) Error bound analysis of Q-function for discounted
optimal control problems with policy iteration. IEEE Trans Syst Man Cybern Syst. doi:10.
1109/TSMC.2016.2563982
References 33

133. Yang L, Enns R, Wang YT, Si J (2003) Direct neural dynamic programming. In: Liu D,
Antsaklis PJ (eds) Stability and control of dynamical systems with applications (chapter 10).
Birkhauser, Boston
134. Yang X, Liu D, Huang Y (2013) Neural-network-based online optimal control for uncer-
tain non-linear continuous-time systems with control constraints. IET Control Theory Appl
7(17):2037–2047
135. Yang X, Liu D, Wang D (2014) Reinforcement learning for adaptive optimal control of
unknown continuous-time nonlinear systems with input constraints. Int J Control 87(3):553–
566
136. Yang X, Liu D, Wei Q (2014) Online approximate optimal control for affine non-linear systems
with unknown internal dynamics using adaptive dynamic programming. IET Control Theory
Appl 8(16):1676–1688
137. Zaman R, Prokhorov D, Wunsch DC (1997) Adaptive critic design in learning to play game
of Go. In: Proceedings of the international conference on neural networks, pp 1–4
138. Zaman R, Wunsch DC (1999) TD methods applied to mixture of experts for learning 9×9 Go
evaluation function. In: Proceedings of the international joint conference on neural networks,
pp 3734–3739
139. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control: algo-
rithms and stability. Springer, London
140. Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving
a class of nonlinear zero-sum differential games. Automatica 7(1):207–214
141. Zhang H, Zhang J, Yang GH, Luo Y (2015) Leader-based optimal coordination control for the
consensus problem of multiagent differential games via fuzzy adaptive dynamic programming.
IEEE Trans Fuzzy Syst 23(1):152–163
142. Zhao DB, Shao K, Zhu YH et al (2016) Review of deep reinforcement learning and discussions
on the development of computer Go. Control Theory Appl 33(6):701–717
143. Zhao Q, Xu H, Jagannathan S (2014) Near optimal output feedback control of nonlinear
discrete-time systems based on reinforcement neural network learning. IEEE/CAA J Automa
Sin 1(4):372–384
144. Zhong X, He H, Zhang H, Wang Z (2014) Optimal control for unknown discrete-time nonlinear
markov jump systems using adaptive dynamic programming. IEEE Trans Neural Netw Learn
Syst 25(12):2141–2155
145. Zhu Y, Zhao D, He H (2012) Integration of fuzzy controller with adaptive dynamic pro-
gramming. In: Proceedings of the world congress on intelligent control and automation, pp
310–315
146. Zurada JM (1992) Introduction to artificial neural systems. West, St. Paul, MN
Part I
Discrete-Time Systems
Chapter 2
Value Iteration ADP for Discrete-Time
Nonlinear Systems

2.1 Introduction

The nonlinear optimal control has been the focus of control fields for many decades
[7, 10, 23, 39]. It often needs to solve the nonlinear Bellman equation. The Bellman
equation is more difficult to work with than the Riccati equation because it involves
solving nonlinear partial difference equations. Although dynamic programming has
been a useful technique in handling optimal control problems for nonlinear systems,
it is often computationally untenable to perform it to obtain the optimal solutions
because of the well-known “curse of dimensionality” [9, 14]. Fortunately, relying
on the strong abilities of self-learning and adaptivity of artificial neural networks
(ANNs), the ADP method was proposed by Werbos [46, 47] to deal with optimal
control problems forward-in-time. In recent years, ADP and related research have
gained much attention from scholars (see the recent books [22, 40, 50] and the
references cited therein).
It is important to note that the iterative methods are often used in ADP to obtain the
solution of Bellman equation indirectly and have received more and more attention. In
[24], iterative ADP algorithms were classified into two main schemes, namely policy
iteration (PI) and value iteration (VI) [38, 40], respectively. PI algorithms contain
policy evaluation and policy improvement [18, 38, 40]. An initial stabilizing control
law is required, which is often difficult to obtain. Comparing to VI algorithms, in
most applications, PI would require fewer iterations as a Newton’s method, but every
iteration is more computationally demanding. VI algorithms solve the optimal control
problem without requirement of an initial stabilizing control law, which is easy to
implement. However, the stabilizing control law cannot be obtained until the value
function converges. This means that only the converged optimal control (function
of the system state xk ) u∗ (xk ) can be used to control the nonlinear system, where
the iterative controls vi (xk ), i = 0, 1, . . ., may be invalid. Hence, the computational
efficiency of the VI ADP method is low. Besides, most of the VI algorithms are
implemented off-line which limits their applications very much. In this chapter, the

© Springer International Publishing AG 2017 37


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_2
38 2 Value Iteration ADP for Discrete …

VI ADP approach is employed to solve the optimal control problems of discrete-time


nonlinear systems, where several value iteration schemes are developed to overcome
the above difficulties.
In the beginning, an ADP scheme based on general value iteration (GVI) is devel-
oped to obtain optimal control for discrete-time affine nonlinear systems [25]. The
selection of initial value function is different from the traditional VI algorithm, and
a new method is introduced to demonstrate the convergence property and the con-
vergence speed of value functions. The control law obtained at each iteration can
stabilize the system under some conditions. To facilitate the implementation of the
iterative scheme, three NNs with Levenberg–Marquardt (LM) training algorithm
are used to approximate the unknown system, the value function, and the control
law, respectively. Then, the GVI-based ADP method is generalized to solve infinite-
horizon optimal tracking control problem for a class of discrete-time nonlinear sys-
tems [45]. The GVI-based ADP algorithm permits an arbitrary positive-semidefinite
function to initialize it, and it is more advantageous than traditional VI algorithms
which starts from a zero function. Next, the ADP approach is used for designing
the optimal controller of discrete-time nonlinear systems with unknown dynamics
and constrained inputs [28]. The iterative ADP algorithm is developed to solve the
constrained optimal control problem based on VI algorithm, which can be regarded
as a special case of GVI. Three NNs are employed for approximating the unknown
nonlinear system dynamics, the value function and its derivatives, and the control
law, respectively, under the framework of globalized dual heuristic programming
(GDHP) technique. Finally, an iterative θ -ADP technique is developed to solve opti-
mal control problems for infinite-horizon discrete-time nonlinear systems [44]. The
condition of initial admissible control in PI algorithm is avoided. It is proved that
all the iterative controls obtained in the iterative θ -ADP algorithm can stabilize the
nonlinear system, which means that the iterative θ -ADP algorithm is feasible for
implementations both online and off-line. Convergence analysis of the value func-
tion is presented to guarantee that the iterative value function can converge to the
optimum monotonically.

2.2 Optimal Control of Nonlinear Systems Using General


Value Iteration

Consider the discrete-time nonlinear systems described by

xk+1 = F(xk , uk ), k = 0, 1, 2, . . . , (2.2.1)

where xk ∈ Rn is the state vector at time k, uk = u(xk ) ∈ Rm is the state feedback


control vector, and F(·, ·) is the nonlinear system function. Let x0 be the initial state.
Let the following assumptions hold throughout this chapter.
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 39

Assumption 2.2.1 F(0, 0) = 0, and the state feedback control law u(·) satisfies
u(0) = 0, i.e., xk = 0 is an equilibrium state of system (2.2.1) under the control
uk = 0.

Assumption 2.2.2 F(xk , uk ) is Lipschitz continuous on a compact set Ω ⊂ Rn


containing the origin.

Assumption 2.2.3 System (2.2.1) is controllable in the sense that there exists a
continuous control law on Ω that asymptotically stabilizes the system.

First, in Sects. 2.2.1 and 2.2.2, we develop a GVI-based optimal control scheme
for discrete-time nonlinear systems with affine form [25]. Consider the following
affine nonlinear systems

xk+1 = f (xk ) + g(xk )uk , k = 0, 1, 2, . . . , (2.2.2)

where f (·) ∈ Rn and g(·) ∈ Rn×m are differentiable and f (0) = 0. Our goal is to
find a state feedback control law u(·) such that uk = u(xk ) can stabilize the system
(2.2.2) and simultaneously minimize the infinite-horizon cost function given by


J(x0 , u) = J u (x0 ) = U(xk , uk ), (2.2.3)
k=0

where U(xk , uk ) is a positive-definite utility function, i.e., U(0, 0) = 0 and for all
(xk , uk ) = (0, 0), U(xk , uk ) > 0. Note that the control law u(·) must not only stabilize
the system on Ω but also guarantee (2.2.3) to be finite, i.e., the control law must be
admissible.

Definition 2.2.1 (cf. [5, 51]) A control law u(·) is said to be admissible with respect
to (2.2.2) (or (2.2.1)) on Ω if u(·) is continuous on Ω, u(0) = 0, uk = u(xk ) stabilizes
(2.2.2) (or (2.2.1)) on Ω, and J(x0 , u) is finite, ∀x0 ∈ Ω.

Let A (Ω) be the set of admissible control laws associated with the controllable
set Ω of states. For optimal control problems we study in this book, the set A (Ω)
is assumed to be nonempty, i.e., A (Ω) = ∅.
Define the optimal cost function as

J ∗ (xk ) = inf {J(xk , u) : u ∈ A (Ω)} .


u

According to [9, 11, 14, 23], the optimal cost function J ∗ (xk ) satisfies the Bellman
equation  
J ∗ (xk ) = min U(xk , uk ) + J ∗ (xk+1 ) . (2.2.4)
uk

Equation (2.2.4) is the Bellman’s principle of optimality for discrete-time systems.


Its importance lies in the fact that it allows one to optimize over only one control
40 2 Value Iteration ADP for Discrete …

vector at a time by working backward in time. The optimal control law u∗ (·) should
satisfy  
uk∗ = u∗ (xk ) = arg min U(xk , uk ) + J ∗ (xk+1 ) . (2.2.5)
uk

In general, the utility function can be chosen as the quadratic form given by

U(xk , uk ) = xkT Qxk + ukT Ruk , (2.2.6)

where Q ∈ Rn×n and R ∈ Rm×m are positive-definite matrices. The optimal control
uk∗ satisfies the first-order necessary condition, from which we obtain
 
1 ∂xk+1 T ∂J ∗ (xk+1 ) 1 ∂J ∗ (xk+1 )
uk∗ = − R−1 = − R−1 gT (xk ) .
2 ∂uk ∂xk+1 2 ∂xk+1

Equation (2.2.4) reduces to Riccati equation in the case of linear quadratic regulator
problem. However, in the nonlinear case, the cost function of the optimal control
problem cannot be obtained directly. Therefore, we will solve the Bellman equation
by the GVI algorithm.

2.2.1 Convergence Analysis

Since direct solution of the Bellman equation is computationally intensive, we present


an iterative ADP algorithm in a general framework based on Bellman’s principle of
optimality. Define the value function for system (2.2.2) as

V (xk ) = J u (xk ).

As we have explained in Chap. 1, V (xk ) is a short notation of V (xk , u) or V u (xk ) for


convenience of presentation.
First, the initial value function is chosen as a quadratic form given by

V0 (xk ) = xkT P0 xk , (2.2.7)

where P0 is a positive-definite matrix. Then, for i = 0, 1, 2, . . . , the GVI-based ADP


algorithm iterates between a sequence of control laws vi (xk ),
 
vi (xk ) = arg min xkT Qxk + ukT Ruk + Vi (xk+1 )
uk
 
= arg min xkT Qxk + ukT Ruk + Vi (f (xk ) + g(xk )uk ) , (2.2.8)
uk

and a sequence of value functions Vi+1 (xk ),


2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 41
 
Vi+1 (xk ) = min xkT Qxk + ukT Ruk + Vi (xk+1 )
uk

= xkT Qxk + viT (xk )Rvi (xk ) + Vi (f (xk ) + g(xk )vi (xk )). (2.2.9)

From the i th iteration of the algorithm in (2.2.8)–(2.2.9), we obtain vi (xk ) and


Vi+1 (xk ).
In the above VI algorithm, (2.2.8) is called policy improvement (or policy update)
and (2.2.9) is called value function update [38, 40]. In (2.2.8), an improved policy
that is better or at least not worse than the previous policy is obtained using the current
value function. In (2.2.9), an updated value function, to be used in the next iteration,
is calculated using the current policy. It is a one-step procedure for approximating
the value function corresponding to the current policy, and thus, (2.2.9) is also called
one-step policy evaluation [38].
If we want the outcomes of the i th iteration to be Vi (xk ) and vi (xk ), the iterative
algorithm (2.2.8)–(2.2.9) can be rewritten as follows.
From the initial value function given in (2.2.7), we obtain the control law
v0 (xk ) by
 
v0 (xk ) = arg min xkT Qxk + ukT Ruk + V0 (xk+1 )
uk
 
= arg min xkT Qxk + ukT Ruk + V0 (f (xk ) + g(xk )uk ) , (2.2.10)
uk

where V0 (xk+1 ) = xk+1


T
P0 xk+1 according to (2.2.7). For i = 1, 2, . . . , the GVI-based
ADP algorithm iterates between value function update
 
Vi (xk ) = min xkT Qxk + ukT Ruk + Vi−1 (xk+1 )
uk

= xkT Qxk + vi−1


T
(xk )Rvi−1 (xk ) + Vi−1 (f (xk ) + g(xk )vi−1 (xk )), (2.2.11)

and policy improvement


 
vi (xk ) = arg min xkT Qxk + ukT Ruk + Vi (xk+1 )
uk
 
= arg min xkT Qxk + ukT Ruk + Vi (f (xk ) + g(xk )uk ) . (2.2.12)
uk

Now, from the i th iteration of the algorithm in (2.2.10)–(2.2.12), we obtain Vi (xk )


and vi (xk ). Note that it is a simple calculation in (2.2.11) to update the value function
using the previous policy and previous value function, while in (2.2.12), it performs
the minimization so that an improved policy that is better or at least not worse than
the previous policy is obtained using the newly updated value function.
Since our goal in optimal control design is to obtain an optimal controller, it is
desirable to have the outcome of an algorithm as vi (xk ) at the end of the i th iteration,
whereas Vi (xk ) becomes an internal variable.
The VI algorithm in (2.2.8)–(2.2.9) was originally given in [5] with V0 (·) = 0.
The iterative process is shown in Table 2.1, where each column of blocks represents
42 2 Value Iteration ADP for Discrete …

Table 2.1 The iterative process of the VI algorithm in (2.2.8)–(2.2.9)


V0 → v0 (2.2.8) V1 → v1 (2.2.8) V2 → v2 (2.2.8) ···
minimization minimization minimization
v0 → V1 (2.2.9) v1 → V2 (2.2.9) v2 → V3 (2.2.9) ···
calculation calculation calculation
i=0 i=1 i=2 ···

Fig. 2.1 The iteration flowchart of algorithm in Table 2.1

Table 2.2 The iterative process of the VI algorithm in (2.2.10)–(2.2.12)


(empty) v0 → V1 (2.2.11) v1 → V2 (2.2.11) ···
calculation calculation
V0 → v0 (2.2.10) V1 → v1 (2.2.12) V2 → v2 (2.2.12) ···
minimization minimization minimization
i=0 i=1 i=2 ···

an iteration. The iteration goes from top to bottom within each column and from the
bottom block to the top block in the next column, as shown in Fig. 2.1.
Similarly, the ADP algorithm in (2.2.10)–(2.2.12) can be described by Table 2.2.
Comparing between the two tables, one can see that they contain exactly the same
contents of iterations, except the fact that Table 2.2 did not start in the very first block.
Note that i is the iteration index and k is the time index. As a VI algorithm,
this iterative ADP algorithm does not require an initial stabilizing controller. The
value function and control law are updated until they converge to the optimal ones.
Furthermore, it should satisfy that Vi (0) = 0, vi (0) = 0, ∀i ≥ 0.
It should be mentioned that the initial value function here is chosen as V0 (xk ) =
xkT P0 xk instead of V0 (·) = 0 as in most traditional VI algorithms [3–5, 51, 52]. In
what follows, we will prove the convergence of the iterations between (2.2.11) and
(2.2.12), i.e., Vi → J ∗ and vi → u∗ as i → ∞.
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 43

Lemma 2.2.1 Let μi be an arbitrary control law and let Λi be obtained by

Λi+1 (xk ) = xkT Qxk + μTi (xk )Rμi (xk ) + Λi (f (xk ) + g(xk )μi (xk )),

for i = 0, 1, 2, . . . . Let Vi and vi be defined in (2.2.10)–(2.2.12). If Λ0 (xk ) =


V0 (xk ) = xkT P0 xk , then
Vi (xk ) ≤ Λi (xk ), ∀i.

The lemma can easily be proved by noting that Vi is the result of minimizing the
right-hand side of (2.2.11) with respect to the control input uk , while Λi is the result
of an arbitrary control input.
Theorem 2.2.1 Define the value function sequence {Vi (xk )} and the control law
sequence {vi (xk )} as in (2.2.10)–(2.2.12) with V0 (xk ) = xkT P0 xk in (2.2.7).
If V0 (xk ) ≥ V1 (xk ) holds for all xk , the value function sequence {Vi } is a monotoni-
cally nonincreasing sequence, i.e., Vi+1 (xk ) ≤ Vi (xk ), ∀xk , ∀i ≥ 0. If V0 (xk ) ≤ V1 (xk )
holds for all xk , the value function sequence {Vi (xk )} is a monotonically nondecreas-
ing sequence, i.e., Vi (xk ) ≤ Vi+1 (xk ), ∀xk , ∀i ≥ 0.
Proof First, suppose that V0 (xk ) ≥ V1 (xk ) holds for any xk . Define a new sequence
{Φi }, which is updated according to

Φ1 (xk ) = xkT Qxk + v0T (xk )Rv0 (xk ) + Φ0 (f (xk ) + g(xk )v0 (xk )),
Φi+1 (xk ) = xkT Qxk + vi−1
T
(xk )Rvi−1 (xk ) + Φi (f (xk ) + g(xk )vi−1 (xk )), i ≥ 1,

where Φ0 (xk ) = V0 (xk ) = xkT P0 xk and {vi } are obtained by (2.2.10) and (2.2.12).
Now, we use the mathematical induction to demonstrate

Φi+1 (xk ) ≤ Vi (xk ), ∀i ≥ 0.

Noticing Φ1 (xk ) = V1 (xk ), it is clear that Φ1 (xk ) ≤ V0 (xk ). Then, we assume that it
holds for i − 1, i.e., Φi (xk ) ≤ Vi−1 (xk ), ∀i ≥ 1, ∀xk . According to

Vi (xk ) = xkT Qxk + vi−1


T
(xk )Rvi−1 (xk ) + Vi−1 (xk+1 ), i ≥ 1,

and
Φi+1 (xk ) = xkT Qxk + vi−1
T
(xk )Rvi−1 (xk ) + Φi (xk+1 ), i ≥ 1,

we have
Vi (xk ) − Φi+1 (xk ) = Vi−1 (xk+1 ) − Φi (xk+1 ) ≥ 0, i ≥ 1,

which implies Φi+1 (xk ) ≤ Vi (xk ), i ≥ 1. Considering Φ1 (xk ) ≤ V0 (xk ), we have


Φi+1 (xk ) ≤ Vi (xk ), i ≥ 0. According to Lemma 2.2.1, it is clear that Vi+1 (xk ) ≤
Φi+1 (xk ), ∀i ≥ 0. Therefore,

Vi+1 (xk ) ≤ Vi (xk ), ∀i ≥ 0, ∀xk .


44 2 Value Iteration ADP for Discrete …

Thus, we complete the first part of the proof by mathematical induction.


Next, suppose that V0 (xk ) ≤ V1 (xk ) holds for any xk . Define a new sequence {Γi },
which is updated according to

Γi+1 (xk ) = xkT Qxk + vi+1


T
(xk )Rvi+1 (xk ) + Γi (xk+1 ), i ≥ 0,

with Γ0 (xk ) = V0 (xk ) = xkT P0 xk .


Similarly, we use the mathematical induction to demonstrate

Γi (xk ) ≤ Vi+1 (xk ), ∀i ≥ 0.

First, it is easy to see Γ0 (xk ) = V0 (xk ) ≤ V1 (xk ). Then, we assume that it holds for
i − 1, i.e., Γi−1 (xk ) ≤ Vi (xk ), ∀i ≥ 1, ∀xk .
According to

Γi (xk ) = xkT Qxk + viT (xk )Rvi (xk ) + Γi−1 (xk+1 ), i ≥ 1,

and
Vi+1 (xk ) = xkT Qxk + viT (xk )Rvi (xk ) + Vi (xk+1 ), i ≥ 1,

we have
Vi+1 (xk ) − Γi (xk ) = Vi (xk+1 ) − Γi−1 (xk+1 ) ≥ 0, i ≥ 1,

which implies Γi (xk ) ≤ Vi+1 (xk ), i ≥ 1. Considering Γ0 (xk ) ≤ V1 (xk ), we have


Γi (xk ) ≤ Vi+1 (xk ), i ≥ 0. According to Lemma 2.2.1, it is easy to find Vi (xk ) ≤
Γi (xk ), ∀i ≥ 0. Therefore,

Vi (xk ) ≤ Vi+1 (xk ), ∀i ≥ 0, ∀xk .

Thus, we complete the second part of the proof by mathematical induction.

Remark 2.2.1 From Theorem 2.2.1, we can see that the monotonicity property of
the value function Vi is determined by the relationship between V0 and V1 , i.e.,
V0 (xk ) ≥ V1 (xk ) or V0 (xk ) ≤ V1 (xk ), ∀xk . In the traditional VI algorithm, the initial
value function is selected as V0 (·) = 0. We can easily find that this is just a special
case of our general scheme, i.e., V0 (xk ) ≤ V1 (xk ), which leads to a nondecreasing
value function sequence. Furthermore, the monotonicity property is still valid starting
from p if we can find that Vp (xk ) ≥ Vp+1 (xk ) or Vp (xk ) ≤ Vp+1 (xk ) for all xk and
some p. For example,

Vp (xk ) ≥ Vp+1 (xk ) for all xk and some p ≥ 0 ⇒ Vi (xk ) ≥ Vi+1 (xk ), ∀xk , ∀i ≥ p.

Next, we will demonstrate the uniform convergence of value function using the
technique of [27, 35], and we will show that the control sequence converges to the
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 45

optimal control law by a corollary. The following theorem is due to Rantzer and his
coworkers [27, 35].

Theorem 2.2.2 Suppose the condition

0 ≤ J ∗ (f (xk ) + g(xk )uk ) ≤ γ U(xk , uk )

holds uniformly for some 0 < γ < ∞ and that 0 ≤ αJ ∗ ≤ V0 ≤ βJ ∗ , 0 ≤ α ≤ 1,


and 1 ≤ β < ∞. The value function sequence {Vi } and the control law sequence {vi }
are iteratively updated by (2.2.10)–(2.2.12). Then, the value function Vi approaches
J ∗ according to the following inequalities:
 
α−1 ∗ β −1
1+ J (xk ) ≤ Vi (xk ) ≤ 1 + J ∗ (xk ). (2.2.13)
(1 + γ −1 )i (1 + γ −1 )i

Moreover, the value function Vi (xk ) converges to J ∗ (xk ) uniformly on Ω.

Proof First, we demonstrate that the system defined in this section satisfies the con-
ditions of Theorem 2.2.2. According to Assumption 2.2.2, the system state cannot
jump to infinity by any one step of finite control input, i.e., f (xk ) + g(xk )uk is finite.
Because U(xk , uk ) is a positive-definite function, there exists some 0 < γ < ∞ such
that 0 ≤ J ∗ (f (xk ) + g(xk )uk ) ≤ γ U(xk , uk ) holds uniformly. For any finite positive-
definite initial value function V0 , there exist α and β such that 0 ≤ αJ ∗ ≤ V0 ≤ βJ ∗
is satisfied, where 0 ≤ α ≤ 1 and 1 ≤ β < ∞. Next, we will demonstrate the lower
bound of the inequality (2.2.13) by mathematical induction, i.e.,

α−1
1+ J ∗ (xk ) ≤ Vi (xk ). (2.2.14)
(1 + γ −1 )i

When i = 1, since

α−1
γ U(xk , uk ) − J ∗ (xk+1 ) ≤ 0, 0 ≤ α ≤ 1,
1+γ

and αJ ∗ ≤ V0 , ∀xk , we have


 
V1 (xk ) = min U(xk , uk ) + V0 (xk+1 )
uk
 
≥ min U(xk , uk ) + αJ ∗ (xk+1 )
uk
   
α−1 α−1 ∗
≥ min 1 + γ U(xk , uk ) + α − J (xk+1 )
uk 1+γ 1+γ

α−1  
= 1+ min U(xk , uk ) + J ∗ (xk+1 )
(1 + γ −1 ) uk

α−1
= 1+ J ∗ (xk ).
(1 + γ −1 )
46 2 Value Iteration ADP for Discrete …
 
α−1
≥ min U(xk , uk ) + 1 + J ∗ (xk+1 )
uk (1 + γ −1 )i−1

Now, assume that the inequality (2.2.14) holds for i − 1. Then, we have
 
Vi (xk ) = min U(xk , uk ) + Vi−1 (xk+1 )
uk
 
α−1
≥ min U(xk , uk ) + 1 + J ∗ (xk+1 )
uk (1 + γ −1 )i−1

(α − 1)γ i
≥ min 1 + U(xk , uk )
uk (γ + 1)i

α−1 (α − 1)γ i−1 ∗
+ 1+ − J (xk+1 )
(1 + γ −1 )i−1 (γ + 1)i

(α − 1)γ i  
= 1+ min U(xk , uk ) + J ∗ (xk+1 )
(γ + 1)i uk

(α − 1)
= 1+ J ∗ (xk ).
(1 + γ −1 )i

Thus, the lower bound of (2.2.13) is proved. The upper bound of (2.2.13) can be
shown by the same procedure.
Lastly, we demonstrate the uniform convergence of value function as the iteration
index i goes to ∞. When i → ∞, for 0 < γ < ∞, we have

α−1
lim 1 + J ∗ (xk ) = J ∗ (xk ),
i→∞ (1 + γ −1 )i

and 
β −1
lim 1 + J ∗ (xk ) = J ∗ (xk ).
i→∞ (1 + γ −1 )i

Define V∞ (xk ) = lim Vi (xk ). Then, we can get V∞ (xk ) = J ∗ (xk ). Hence, Vi (xk )
i→∞
converges pointwise to J ∗ (xk ). Because Ω is compact, we can get the uniform con-
vergence of value function immediately from Dini’s theorem [6]. The proof is com-
plete.

From Theorem 2.2.2, we can determine the upper and lower bounds for every
iterative value function. As the iteration index i increases, the upper bound will
exponentially approach the lower bound. When the iteration index i goes to ∞, the
upper bound will be equal to the lower bound, which is just the optimal cost. Addi-
tionally, we can also analyze the convergence speed of the value function, which is
not available using the approaches in [3–5, 24, 51, 52]. According to the inequal-
ity (2.2.13), smaller γ will lead to faster convergence speed of the value function.
Moreover, it should be mentioned that conditions of Theorem 2.2.2 can be satisfied
according to Assumptions 2.2.1–2.2.3, which are mild for general control problems.
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 47

Specially, when 0 ≤ V0 (xk ) ≤ V1 (xk ), ∀xk , according to Theorems 2.2.1 and


2.2.2, we can deduce that V0 (xk ) ≤ J ∗ (xk ). Thus, the constants α and β satisfy
0 < α ≤ 1 and β = 1. Then, the corresponding inequality becomes

α−1
1+ J ∗ (xk ) ≤ Vi (xk ) ≤ J ∗ (xk ).
(1 + γ −1 )i

Note that larger α will lead to faster convergence speed of the value function.
When V0 (xk ) ≥ V1 (xk ), ∀xk , according to Theorems 2.2.1 and 2.2.2, we can
deduce that V0 (xk ) ≥ J ∗ (xk ). So, the constants α and β satisfy α = 1 and β ≥ 1.
Then, the corresponding inequality becomes

β −1
J ∗ (xk ) ≤ Vi (xk ) ≤ 1 + J ∗ (xk ).
(1 + γ −1 )i

Note that smaller β will lead to faster convergence speed of the value function.
According to the results of Theorem 2.2.2, we can derive the following corollary.

Corollary 2.2.1 Define the value function sequence {Vi } and the control law
sequence {vi } as in (2.2.10)–(2.2.12) with V0 (xk ) = xkT P0 xk . If the system state
xk is controllable, then the control sequence {vi } converges to the optimal control
law u∗ as i → ∞, i.e., limi→∞ vi (xk ) = u∗ (xk ).

Proof According to Theorem 2.2.2, we have proved that limi→∞ Vi (xk ) = V∞ (xk ) =
J ∗ (xk ). Thus,  
V∞ (xk ) = min xkT Qxk + ukT Ruk + V∞ (xk+1 ) .
uk

That is to say that the value function sequence {Vi } converges to the optimal value
function of the Bellman equation. Comparing (2.2.5)–(2.2.12), the corresponding
control law {vi } converges to the optimal control law u∗ as i → ∞. This completes
the proof of the corollary.

Next, we will complete the stability analysis for nonlinear systems under the
condition of control Lyapunov function.

Theorem 2.2.3 The value function sequence {Vi } and the control law sequence {vi }
are iteratively updated by (2.2.10)–(2.2.12). If V0 (xk ) = xkT P0 xk ≥ V1 (xk ) holds for
any controllable xk , then the value function Vi (xk ) is a Lyapunov function and the
system using the control law vi (xk ) is asymptotically stable.

Proof First, according to V0 (xk ) ≥ V1 (xk ) and Theorem 2.2.1, we have

Vi (xk ) ≥ Vi+1 (xk ) ≥ U(xk , vi (xk )), ∀i.

Because U(xk , vi (xk )) is a positive-definite function and Vi (0) = 0, Vi (xk ) is also a


positive-definite function.
48 2 Value Iteration ADP for Discrete …

Second, we have

Vi (xk+1 ) − Vi (xk ) ≤ Vi (xk+1 ) − Vi+1 (xk ) = −U(xk , vi (xk )) ≤ 0.

By the Lyapunov stability criteria (Lyapunov’s extension theorem [26] or the


Lagrange stability result [30]), Vi (xk ) is a Lyapunov function, and the system using
the control law vi (xk ) is asymptotically stable. This completes the proof of the theo-
rem.

Note that v0 (xk ) satisfies the first-order necessary condition, which is given by the
gradient of the right-hand side of (2.2.10) with respect to uk as
 T
∂ xkT Qxk + ukT Ruk ∂xk+1 ∂V0 (xk+1 )
+ = 0.
∂uk ∂uk ∂xk+1

That is,
2Ruk + 2gT (xk )P0 (f (xk ) + g(xk )uk ) = 0.

Then, we can solve for v0 (xk ) as


−1 T
v0 (xk ) = − gT (xk )P0 g(xk ) + R g (xk )P0 f (xk ).

The control law v0 (xk ) exists since P0 and R are both positive-definite matrices.

Remark 2.2.2 If the condition V0 (xk ) ≥ V1 (xk ) holds, V0 (xk ) = xkT P0 xk is called
control Lyapunov function if the associated feedback control law v0 (xk ) can guaran-
tee the closed-loop system to be stable. Compared to PI algorithms, this condition
V0 (xk ) ≥ V1 (xk ) is easier to satisfy than an initial stabilizing control law. In partic-
ular, we can just choose P0 = κ In and κ ≥ 0, where In is the n × n identity matrix.
By choosing a large κ, V0 (xk ) ≥ V1 (xk ) is satisfied. Besides, similar to [12, 49], it
should be mentioned that the condition V0 (xk ) ≥ V1 (xk ) in Theorem 2.2.3 cannot be
replaced by V0 (xk ) ≥ J ∗ (xk ), because the nonincreasing property of value function is
guaranteed by V0 (xk ) ≥ V1 (xk ). However, if the condition V0 (xk ) ≤ V1 (xk ) holds, we
cannot derive that vi (xk ) is a stable and admissible control for nonlinear systems. For
linear discrete-time-invariant systems, Primbs and Nevistic [33] demonstrated that
there exists a finite iteration index i∗ and that the closed-loop system is asymptotically
stable for all i ≥ i∗ .

2.2.2 Neural Network Implementation

We have demonstrated the convergence of value function in the above under the
assumption that control laws and value functions can exactly be solved at each
iteration. However, it is difficult to solve these equations for nonlinear systems.
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 49

Critic Vˆi 1 ( xk )
Network
-

xk Action Model xˆk 1 Critic Vˆi ( xˆk 1 ) U ( xk , vˆi ( xk ))


Network Network Network
+
vˆi ( xk )

Signal Line
Back-propagating Path
Weight Transmission

Fig. 2.2 The structural diagram of HDP algorithm

Fortunately, we can use NN to approximate vi and Vi at each iteration. In this section,


we will use heuristic dynamic programming (HDP, see definition in Chap. 1) to
implement the GVI algorithm.
The structural diagram of HDP algorithm is given in Fig. 2.2. In the HDP
algorithm, there are three NNs, which are model network, critic network, and action
network, respectively. The model network is used to approximate the unknown non-
linear system by using available input–output data. The critic network approximates
the relationship between state vector xk and value function V̂i (xk ), and the action
network approximates the relationship between state vector xk and control vector
v̂i (xk ).
We choose the popular backpropagation (BP) NN as our function approximation
scheme, although any other function approximation structures would also suffice.
The Levenberg–Marquardt (LM) algorithm is used to tune weights of NN, even
though any standard NN training methods would suffice, including the gradient
descent method. We find that LM algorithm can enormously improve the conver-
gence speed and decrease the approximation error, which will lead ADP to better
performance. LM algorithm, which combines steepest descent gradient and Gauss–
Newton method, mainly includes three processes: calculating the Jacobian matrix,
evaluating whether the parameters are getting closer to optimal ones or not, and
updating the damping parameter. The details of LM algorithm used here can be
found in [16].
The first step is to train the model network. The output of model network is
denoted as
x̂k+1 = WmT σ (χ̄k ) = WmT σ YmT χk ,
50 2 Value Iteration ADP for Discrete …

where χk = [xkT , v̂iT (xk )]T is the input vector of model network and χ̄k = YmT χk .
The input-to-hidden-layer weights Ym are an (n + m) × l matrix and the hidden-to-
output-layer weights Wm are an l ×n matrix, where l is the number of hidden neurons,
n is the dimension of state vector, and m is the dimension of control input vector.
The activation function is chosen as σ (z) = tanh(z), and its derivative is denoted as
dσ (z)
σ̇ (z) = ∈ Rl×l for z ∈ Rl .
dz
The stopping criterion is that the performance function is within a prespecified
threshold, or the training step reaches the maximum value. When the weights of
model network converge, they are kept unchanged. Then, the estimated value of the
control coefficient matrix ĝ(xk ) is given by

∂(WmT (k)σ (χ k )) ∂χk


ĝ(xk ) = = WmT (k)σ̇ (χ k )YmT (k) ,
∂ v̂i ∂ v̂i

∂χk 0
where = n×m and Im is the m × m identity matrix.
∂ v̂i Im
Similarly, we use LM algorithm to train critic network and action network. The
output of the critic network is denoted as

V̂i (xk ) = Wc(i)T σ Yc(i)T xk .

Note that V̂i (xk ) is the estimated value function of the iterative algorithm (2.2.10)–
(2.2.12) from the i th iteration, whereas Wc(i) and Yc(i) are the critic NN weights to be
obtained from NN training during the i th iteration. The target function for critic NN
training is given by

Vi (xk ) = xkT Qxk + v̂i−1


T
(xk )Rv̂i−1 (xk ) + V̂i−1 (x̂k+1 ), (2.2.15)

where V̂i−1 (x̂k+1 ) = Wc(i−1)T σ Yc(i−1)T x̂k+1 . Then, the error function for training
critic network is defined by ec(i) (xk ) = Vi (xk )− V̂i (xk ), and the performance function
to be minimized is defined by

1 2
Ec(i) (xk ) = e (xk ).
2 c(i)
The weight tuning algorithm of critic network is the same as model network.
In the action network, the state xk is used as input to obtain the optimal control.
The output can be formulated as v̂i (xk ) = WaiT σ YaiT xk , whereas Wai and Yai are the
action NN weights to be obtained from NN training during the i th iteration of the
ADP algorithm (2.2.10)–(2.2.12). The target of action NN training is given by

1 ∂ V̂i (x̂k+1 )
vi (xk ) = − R−1 ĝT (xk ) , (2.2.16)
2 ∂ x̂k+1
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 51

where x̂k+1 = WmT σ (YmT [xkT , v̂iT ]T ). The convergence of action network weights is
shown in [13]. The error function of the action network can be defined as ea(i) (xk ) =
vi (xk ) − v̂i (xk ). The weights of the action network are updated to minimize the
following performance function:

1 T
Ea(i) (xk ) = e (xk )ea(i) (xk ).
2 a(i)
The LM algorithm ensures that Ea(i) (xk ) will decrease every time when the parameters
of action network update.
At last, a summary of the present general value iteration adaptive dynamic pro-
gramming algorithm for optimal control is given in Algorithm 2.2.1.

Algorithm 2.2.1 General value iteration adaptive dynamic programming algorithm


Step 1. Initialize the weights of critic and action neural networks and the parameters
m , ja , jc , ε , ε , ε i
jmax max max m a c max , ξ, Q, R.
Step 2. Construct the model network x̂k+1 = WmT σ (YmT χk ). Obtain the training data, and train the
model network until the given accuracy εm or the maximum number of iterations jmax m is reached.

Step 3. Set the iteration index i = 0 and P0 = κIn .


p
Step 4. Choose randomly an array of p state vector {xk1 , xk2 , . . . , xk }. Compute the output of
p
the action network {v̂i (xk1 ), v̂i (xk2 ), . . . , v̂i (xk )}. Compute the output of the model network
1 , x̂ 2 , . . . , x̂ p } and the output of the critic network {V̂ (x̂ 1 ), V̂ (x̂ 2 ), . . . , V̂ (x̂ p )}.
{x̂k+1 k+1 k+1 i k+1 i k+1 i k+1
Step 5. Set the iteration index i = i + 1. Then, compute the target of the critic network training
p
{Vi (xk1 ), Vi (xk2 ), . . . , Vi (xk )}

by (2.2.15). Train the critic network until the given accuracy εc or the maximum number of
c
iterations jmax is reached.
Step 6. If i > 1, then go to Step 7. Elseif V0 > V1 is true for all xk , go to Step 7; otherwise, increase
κ and go to Step 3.
Step 7. Compute the target of action network training
p
{vi (xk1 ), vi (xk2 ), . . . , vi (xk )}

by (2.2.16), and train the action network until the given accuracy εa or the maximum number of
a
iterations jmax is reached.
Step 8. If i > imax or
|Vi (xks ) − Vi−1 (xks )| ≤ ξ, s = 1, 2, . . . , p,
go to Step 9; otherwise, go to Step 4.
p
Step 9. Compute the output of the action network {v̂i (xk1 ), v̂i (xk2 ), . . . , v̂i (xk )}. Obtain the final near
optimal control law
u∗ (·) = v̂i (·),
and stop the algorithm.
52 2 Value Iteration ADP for Discrete …

2.2.3 Generalization to Optimal Tracking Control

The above GVI-based ADP approach can be employed to solve the optimal tracking
control problem [45]. Consider the nonaffine nonlinear system (2.2.1), for infinite-
time optimal tracking problem, the objective is to design an optimal control u∗ (xk ),
such that the state xk tracks the specified desired trajectory ξk ∈ Rn , k = 0, 1, . . .. In
this section, we assume that there exists a feedback control ue,k , which satisfies the
following equation:
ξk+1 = F(ξk , ue,k ), (2.2.17)

where ue,k is called the desired control.


Remark 2.2.3 It should be pointed out that for a large class of nonlinear systems,
there exists a feedback control ue,k that satisfies (2.2.17). For example, for all the
affine nonlinear systems (2.2.2) with invertible g(xk ), the desired control ue,k can be
expressed as
ue,k = g−1 (ξk )(ξk+1 − f (ξk )),

where g(ξk )g−1 (ξk ) = Im and Im is the m × m identity matrix.


Define the tracking error as zk = ξk − xk . The utility function is quadratic and is
given by

U(zk , μk ) = zkT Qzk + μTk Rμk ,

where μk = uk − ue,k and ue,k is the desired control that satisfies (2.2.17). The
quadratic cost function is

 ∞
  
J(z0 , μ0 ) = U(zk , μk ) = zkT Qzk + (uk − ue,k )T R(uk − ue,k ) , (2.2.18)
k=0 k=0

where μ0 = (μ0 , μ1 , . . .).


For system (2.2.1), our goal is to find an optimal tracking control scheme which
tracks the desired trajectory ξk and simultaneously minimizes the cost function
(2.2.18). The optimal cost function is defined as
 
J ∗ (zk ) = inf J(zk , μk ) ,
μk

where μk = (μk , μk+1 , . . .). According to Bellman’s principle of optimality, J ∗ (zk )


satisfies the Bellman equation
 
J ∗ (zk ) = min U(zk , μk ) + J ∗ (zk+1 )
μk
 
= min U(zk , μk ) + J ∗ (F(zk , μk )) . (2.2.19)
μk
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 53

Then, the optimal control law is expressed as


 
μ∗ (zk ) = arg min U(zk , μk ) + J ∗ (F(zk , μk )) .
μk

Hence, the Bellman equation (2.2.19) can be written as

J ∗ (zk ) = U(zk , μ∗ (zk )) + J ∗ (F(zk , μ∗ (zk ))). (2.2.20)

Generally speaking, J ∗ (zk ) is a high nonlinear and nonanalytic function, which


cannot be obtained by directly solving the Bellman equation (2.2.20). Similar to
Sect. 2.2.1, a GVI-based ADP method can be developed to obtain J ∗ (zk ) iteratively.
Then, the optimal tracking control can be obtained.
Let Ψ (zk ) be an arbitrary positive-semidefinite function for zk ∈ Rn . Then, let the
initial value function be

V0 (zk ) = Ψ (zk ). (2.2.21)

The control law v0 (zk ) can be computed as follows:

v0 (zk ) = arg min {U(zk , μk ) + V0 (zk+1 )}


μk

= arg min {U(zk , μk ) + V0 (F(zk , μk ))} , (2.2.22)


μk

where V0 (zk+1 ) = Ψ (zk+1 ). For i = 1, 2, . . ., the iterative ADP algorithm will iterate
between value function update

Vi (zk ) = min {U(zk , μk ) + Vi−1 (zk+1 )}


μk

= U(zk , vi−1 (zk )) + Vi−1 (F(zk , vi−1 (zk ))), (2.2.23)

and policy improvement

vi (zk ) = arg min {U(zk , μk ) + Vi (zk+1 )} = arg min {U(zk , μk ) + Vi (F(zk , μk ))} .
μk μk
(2.2.24)

Note that the ADP algorithm described above in (2.2.22)–(2.2.24) (for nonaffine
nonlinear systems) is essentially the same as that in (2.2.10)–(2.2.12) (for affine
nonlinear systems). The only difference between the two is the choice of initial value
function (see (2.2.7) and (2.2.21)).
Additional properties of the GVI-based ADP algorithm are given as follows.

Theorem 2.2.4 For i = 0, 1, . . ., let Vi (zk ) and vi (zk ) be obtained by (2.2.21)–


(2.2.24). Let ρ, γ , α, and β be constants such that
54 2 Value Iteration ADP for Discrete …

0 < ρ ≤ γ < ∞, (2.2.25)

and 0 ≤ α ≤ β < 1, respectively. If ∀zk , the following conditions

ρU(zk , vk ) ≤ J ∗ (F(zk , vk )) ≤ γ U(zk , vk ) (2.2.26)

and

αJ ∗ (zk ) ≤ V0 (zk ) ≤ βJ ∗ (zk ) (2.2.27)

are satisfied uniformly, then the iterative value function Vi (zk ) satisfies
   
α−1 β −1
1+ J ∗ (zk ) ≤ Vi (zk ) ≤ 1 + J ∗ (zk ). (2.2.28)
(1 + γ −1 )i (1 + ρ −1 )i

Proof The theorem can be proved in two steps.


(1) Prove the lower bound of (2.2.28).
Mathematical induction is employed to prove the conclusion. Let i = 1. From
(2.2.26) and (2.2.27), we have

V1 (zk ) = min {U(zk , vk ) + V0 (zk+1 )}


vk
 
≥ min U(zk , vk ) + αJ ∗ (zk+1 )
vk
   
α−1 α−1 ∗
≥ min 1 + γ U(zk , vk ) + α − J (zk+1 )
vk 1+γ 1+γ
 
α−1  
= 1+ −1
min U(zk , vk ) + J ∗ (zk+1 )
1+γ vk
 
α−1
= 1+ J ∗ (zk ).
1 + γ −1

Assume the conclusion holds for i = l − 1, l = 1, 2, . . .. Then, for i = l, we have

Vl (zk ) = min {U(zk , vk ) + Vl−1 (zk+1 )}


vk
 
α−1
≥ min U(zk , vk ) + 1 + J ∗ (zk+1 )
vk (1 + γ −1 )l−1
 
α−1
≥ min 1 + U(zk , vk )
vk (1 + γ −1 )l

α−1 −1 α−1
+ 1+ −γ J ∗ (zk+1 )
(1 + γ −1 )l−1 (1 + γ −1 )l
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 55
 
α−1  
= 1+ −1
min U(zk , vk ) + J ∗ (zk+1 )
(1 + γ ) l vk
 
α−1
= 1+ J ∗ (zk ).
(1 + γ −1 )l

(2) Prove the upper bound of (2.2.28).


We also use mathematical induction to prove the conclusion. Let i = 1. We have

V1 (zk ) = min {U(zk , vk ) + V0 (zk+1 )}


vk
 
≤ min U(zk , vk ) + βJ ∗ (zk+1 )
vk

β −1 ∗
≤ min U(zk , vk ) + βJ ∗ (zk+1 ) + J (zk+1 ) − ρU(zk , vk )
vk (1 + ρ)
 
β −1  
= 1+ −1
min U(zk , vk ) + J ∗ (zk+1 )
1+ρ vk
 
β −1
= 1+ J ∗ (zk ).
1 + ρ −1

Assume that the conclusion holds for i = l − 1, l = 1, 2, . . .. Then, for i = l,


we have

Vl (zk ) = min {U(zk , vk ) + Vl−1 (zk+1 )}


vk
  
β −1
≤ min U(zk , vk ) + 1 + J ∗ (zk+1 )
vk (1 + ρ −1 )l−1
 
β −1  
≤ 1+ −1
min U(zk , vk ) + J ∗ (zk+1 )
(1 + ρ ) l v k
 
β −1
= 1+ J ∗ (zk ).
(1 + ρ −1 )l

The proof is complete.


The following two results can readily be proved by following the same procedure.
Theorem 2.2.5 For i = 0, 1, . . ., let vi (zk ) and Vi (zk ) be obtained by (2.2.21)–
(2.2.24). Let ρ, γ , α, and β be constants that satisfy (2.2.25) and

1 ≤ α ≤ β < ∞,

respectively. If ∀zk , the inequalities (2.2.26) and (2.2.27) hold uniformly, then the
iterative value function Vi (zk ) satisfies (2.2.28).
Corollary 2.2.2 For i = 0, 1, . . ., let vi (zk ) and Vi (zk ) be obtained by (2.2.21)–
(2.2.24). Let ρ, γ , α, and β be constants that satisfy (2.2.25) and
56 2 Value Iteration ADP for Discrete …

0 ≤ α ≤ β < ∞, (2.2.29)

respectively. If ∀zk , the inequalities (2.2.26) and (2.2.27) hold uniformly, then the
iterative value function Vi (zk ) converges to the optimal cost function J ∗ (zk ), i.e.,

lim Vi (zk ) = J ∗ (zk ).


i→∞

Remark 2.2.4 When 0 ≤ α ≤ 1 ≤ β < ∞, we can also obtain result similar to


Theorem 2.2.2. Corollary 2.2.2 is obtained directly from Theorems 2.2.2, 2.2.4, and
2.2.5.
Remark 2.2.5 Note that techniques employed in this section are extensions of that
in [27, 35]. From Theorem 2.2.2, we can see that the iterative value function will
converge to the optimum as i → ∞, which is independent from the initial value func-
tion Ψ (zk ). Furthermore, for arbitrary constants ρ, γ , α, and β that satisfy (2.2.25)
and (2.2.29), respectively, the iterative value function Vi (zk ) can be guaranteed to
converge to the optimum as i → ∞. Hence, the estimations of ρ, γ , α, and β are
not necessary.

2.2.4 Optimal Control of Systems with Constrained Inputs

The VI-based optimal control [5, 41], constrained optimal control [28, 51], and opti-
mal tracking control [19, 52] methods are special cases of the results in Sect. 2.2.3, by
noting that the initial value function is chosen as zero. Among them, input constraints
are often confronted in practical problems, which results in a considerable difficulty
in designing the optimal controller [17, 28, 51]. Therefore, in this section, we develop
a VI-based constrained optimal control scheme via GDHP technique [28].
Consider the discrete-time nonaffine nonlinear systems (2.2.1), we define Ω̄u =
{uk : uk = [u1k , u2k , . . . , umk ]T ∈ Rm , |ulk | ≤ ūl , l = 1, 2, . . . , m}, where ūl is the
saturation bound for the lth actuator. Let Ū = diag{ū1 , ū2 , . . . , ūm } be a constant
diagonal matrix.
In many literatures of optimal control [5, 13, 41, 42], the utility function is chosen
as the quadratic form of (2.2.6). However, when dealing with constrained optimal
control problems, it is not the case any more. Inspired by the work of [1, 29, 51], we
can employ a generalized nonquadratic functional
 uk
Y (uk ) = 2 Φ −T Ū −1 s ŪR ds (2.2.30)
0

to substitute the quadratic term of uk in (2.2.6). Note that in (2.2.30),


 T
Φ −1 (uk ) = φ −1 (u1k ), φ −1 (u2k ), . . . , φ −1 (umk ) ,
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 57

R is positive-definite and assumed to be diagonal for simplicity of analysis,


s ∈ Rm , Φ ∈ Rm , Φ −T denotes (Φ −1 )T , and φ(·) is a strictly monotonic odd function
satisfying |φ(·)| < 1 and belonging to C p (p ≥ 1) and L2 (Ω). The well-known
hyperbolic tangent function φ(·) = tanh(·) is one example of such functions. Besides,
it is important to note that Y (uk ) is positive-definite since φ −1 (·) is a monotonic odd
function and R is positive-definite.
In this sense, the utility function becomes U(xk , uk ) = xkT Qxk + Y (uk ). Accord-
ingly, (2.2.4) and (2.2.5) become
  uk

J (xk ) = min xk Qxk + 2
T
Φ −T (Ū −1 s)ŪRds + J ∗ (xk+1 )
uk 0

and
  uk

u (xk ) = arg min xk Qxk + 2
T
Φ −T (Ū −1 s)ŪRds + J ∗ (xk+1 ) ,
uk 0

respectively.
The traditional VI-based iterative ADP algorithm is performed as follows. First,
we start with the initial value function V0 (·) = 0 and solve Vi (xk ) and vi (xk ) using
the iterative algorithm described by (2.2.22)–(2.2.24).
In this section, the GDHP technique is employed to implement the iterative ADP
algorithm. In the iterative GDHP algorithm, there are three NNs, which are model
network, critic network, and action network. Here, all the NNs are chosen as three-
layer feedforward ones. It is important to note that the critic network of GDHP
outputs both the value function V (xk ) and its derivative ∂V (xk )/∂xk [34], which is
schematically depicted in Fig. 2.3. It is a combination of HDP and dual heuristic
dynamic programming (DHP).
The training of model network is complete after the system identification process,
and its weights will be kept unchanged. As a result, we avoid the requirement of
knowing F(xk , uk ) during the implementation of the iterative GDHP algorithm. Next,
the learned NN model will be used in the training process of critic network and action
network.
We denote λi (xk ) = ∂Vi (xk )/∂xk in our discussion. Hence, the critic network
is used to approximate both Vi (xk ) and λi (xk ). The output of critic network is
expressed as   iT
V̂i (xk ) Wc1
= iT σ Yc xk = Wc σ Yc xk ,
iT iT iT
λ̂i (xk ) Wc2

Fig. 2.3 The critic network


of GDHP technique
Critic
Network
58 2 Value Iteration ADP for Discrete …
 i 
where Wci = Wc1 , Wc2 i
and Yci are critic NN weights to be obtained during the ith
iteration of the ADP algorithm (2.2.22)–(2.2.24). Accordingly, we have V̂i (xk ) =
Wc1iT
σ YciT xk and λ̂i (xk ) = Wc2 iT
σ YciT xk . The target functions for critic NN training
can be written as
Vi (xk ) = U(xk , v̂i−1 (xk )) + V̂i−1 (x̂k+1 )

and

∂U(xk , v̂i−1 (xk )) ∂ V̂i−1 (x̂k+1 )


λi (xk ) = +
∂xk ∂xk
 T
∂ v̂i−1 (xk )
= 2Qxk + 2 ŪRΦ −1 Ū −1 v̂i−1 (xk )
∂xk
 
∂ x̂k+1 ∂ x̂k+1 ∂ v̂i−1 (xk ) T
+ + λ̂i−1 (x̂k+1 ).
∂xk ∂ v̂i−1 (xk ) ∂xk

Then, we define the error function of critic network training as evcik = Vi (xk ) − V̂i (xk )
and eλcik = λi (xk ) − λ̂i (xk ). The objective function to be minimized in the critic
network is
λ
Ecik = (1 − τ )Ecik
v
+ τ Ecik ,

where 0 ≤ τ ≤ 1 is a parameter that adjusts how HDP and DHP are combined in
GDHP,
1 2
v
Ecik = evcik
2
and
λ 1 λT λ
Ecik = e e .
2 cik cik
The weight update rule for training critic network is the gradient-based adaptation
which is given by
 λ
∂Ecik
v
∂Ecik
Wci (p + 1) = Wci (p) − αc (1 − τ ) + τ ,
∂Wci (p) ∂Wci (p)

∂E v ∂E λ
Yci (p + 1) = Yci (p) − αc (1 − τ ) i cik + τ i cik ,
∂Yc (p) ∂Yc (p)

where αc > 0 is the learning rate of critic network and p is the inner-loop iteration
step for updating NN weight parameters. The detailed discussion on superiority of
GDHP-based iterative ADP algorithm can be found in [42].

Remark 2.2.6 The GDHP (globalized dual heuristic programming) is implemented


using
λ
Ecik = (1 − τ )Ecik
v
+ τ Ecik .
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 59

When τ = 0, the algorithm reduces to HDP (heuristic dynamic programming) where


critic NN training is based on value function V . When τ = 1, the algorithm becomes
DHP (dual heuristic programming) where critic NN training is based on the derivative
λ of the value function. In order to determine the minimum value of a function, we
can either estimate the function itself or estimate its derivatives. However, when
0 < τ < 1, the algorithm will use the estimates of both the value function and its
derivatives, which will usually lead to better results in optimization.

In the action network, the state xk is used as input to obtain the approximate
optimal control as output of the network, which is formulated as

v̂i (xk ) = WaiT σ YaiT xk .

The target function of control action is given by


 
vi (xk ) = arg min U(xk , uk ) + V̂i (x̂k+1 ) .
uk

The error function of action network can be defined as

ea(i)k = vi (xk ) − v̂i (xk ).

The weights of action network are updated to minimize

1 T
Ea(i)k = e ea(i)k .
2 a(i)k
Similarly, the weight update algorithm is

∂Ea(i)k
Wai (p + 1) = Wai (p) − αa ,
∂Wai (p)

∂Ea(i)k
Yai (p + 1) = Yai (p) − αa ,
∂Yai (p)

where αa > 0 is the learning rate of action network and p is the inner-loop iteration
step for updating weight parameters.

2.2.5 Simulation Studies

In this section, several examples are provided to demonstrate the effectiveness of the
present control methods.
60 2 Value Iteration ADP for Discrete …

Example 2.2.1 Consider the linear system


 
0 0.4 0
xk+1 = xk + u, (2.2.31)
0.3 1 1 k

where xk = [x1k , x2k ]T . The weight matrices are chosen as



0.2 0
Q=
0 0.2

and R = 1. Note that the open-loop poles are −0.1083 and 1.1083, which indicates
that the system is unstable.
Algorithm 2.2.1 will be used here. To reduce the influence of the NN approxi-
mation errors, we choose three-layer BP NNs as model network, critic network and
action network with the structures of 3–9–2, 2–8–1, and 2–8–1, respectively. The
initial weights of NNs are chosen randomly in [−0.1, 0.1].
Before implementing the GVI algorithm, we need to train the model network first.
The operation region of system (2.2.31) is selected as −1 ≤ x1 ≤ 1 and −1 ≤ x2 ≤ 1.
Thousand samples are randomly chosen from this operation region as the training set,
and the model network is trained until the given accuracy εm = 10−8 is reached with
m
jmax = 10000. The inner-loop iteration number of critic network and action network
c
is jmax = jmax
a
= 1000, and the given accuracy is εc = εa = 10−6 . The maximum
outer-loop iteration is selected as imax = 10, and the prespecified accuracy is selected
as ξ = 10−6 . The number of samples at each iteration is p = 2000.
Set P0 = I2 . We find that V0 ≥ V1 holds for all states, which can be seen from
Fig. 2.4. After implementing the outer-loop iteration for 10 times, the convergence
of value function is observed. The 3-D plot of approximate value function at i = 0
and i = 10 is given in Fig. 2.5, and the 3-D plot of error between the optimal cost
function J ∗ and the approximate optimal value function V10 is given in Fig. 2.6. We
can see that the error between the optimal cost function and the approximate optimal
value function is nearly within 10−3 in the operational region from Fig. 2.6.
For the initial state x0 = [1, −1]T , the convergence process of value function is
given in Fig. 2.7. We apply the control law v10 to the system for 20 time steps. The
corresponding state trajectories are given in Fig. 2.8, and the control input is shown
in Fig. 2.9.
These simulation results indicate that our algorithm is effective in obtaining the
optimal control law via learning in a timely manner.

Example 2.2.2 Consider the nonlinear system


 
2
0.2x1k exp(x2k ) 0.2 0
xk+1 = + u, (2.2.32)
3
0.3x2k 0 0.2 k

where xk = [x1k , x2k ]T and uk = [u1k , u2k ]T . The desired trajectory is set to
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 61

1.5

1
1
V −V
0

0.5

0
1
0.5 1
0 0.5
0
−0.5 −0.5
x2 −1 −1
x
1

Fig. 2.4 3-D plot of V0 −V1 in the operation region

2
i=0

1.5
Value function

1
i=10

0.5

0
1
0.5 1
0 0.5
0
−0.5 −0.5
x2 −1 −1
x1

Fig. 2.5 3-D plot of approximate value function at i = 0, 10


62 2 Value Iteration ADP for Discrete …

−4
x 10

15

10
J −V10

5
*

−5
1
0.5 1
0 0.5
0
−0.5 −0.5
x2 −1 −1
x1

Fig. 2.6 Error between the optimal cost function J ∗ and the approximate optimal value function
V10

1.8

1.6
Value function

1.4

1.2

0.8

0.6
0 2 4 6 8 10
The iteration index: i

Fig. 2.7 Convergence process of the value function at x = [1, −1]T


2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 63

1
x
1
0.8 x2

0.6

0.4
The state trajectories

0.2

−0.2

−0.4

−0.6

−0.8

−1
0 5 10 15 20
Time steps

Fig. 2.8 The state trajectories

0.35

0.3

0.25
The control input

0.2

0.15

0.1

0.05

0
0 5 10 15 20
Time steps

Fig. 2.9 The control input


64 2 Value Iteration ADP for Discrete …

  π T
ξk = sin k + , 0.5 cos(k) . (2.2.33)
2
According to (2.2.32) and (2.2.33), we can easily obtain the desired control
   
5 0 2
0.2ξ1k exp(ξ2k )
ue,k = − ξk+1 − 3 .
0 5 0.3ξ2k

The value function is defined as in (2.2.18), where Q = R = I ∈ R2×2 and I denotes


the identity matrix.
We use NNs to implement the GVI ADP algorithm. The structures of the critic
and the action networks are chosen as 2–8–1 and 2–8–2, respectively. We choose
a random array of state variable in [−1, 1] to train the NNs. For each iterative
step, the critic network and the action network are trained for 2000 steps under
the learning rate 0.005 so that the approximation error limit 10−6 is reached. The
GVI algorithm runs for 30 iterations to guarantee the convergence of the iterative
value function. To illustrate the effectiveness of the algorithm, four different initial
value functions are considered. Let the initial value functions be the quadratic form
which are expressed by Ψ j (zk ) = zkT Pj zk , j = 1, 2, 3, 4. Let P1 = 0. Let P2 , P3 ,
and P4 be positive-definite matrices given by P2 = [9.07, −0.26; −0.26, 11.62],
P3 = [10.48, 2.16; 2.16, 13.24], and P4 = [ 11.59, 0.61; 0.61, 13.40 ], respec-
tively.
According to Theorem 2.2.2, for an arbitrary positive-semidefinite function, the
iterative value function will converge to the optimum. The curve of the iterative
value functions under the four different initial value functions Ψ j (zk ), j = 1, 2, 3, 4,
is displayed in Fig. 2.10, which justifies the convergence property of our algorithm.
The tracking error trajectories are shown in Fig. 2.11. These results show good con-
vergence results as well as good tracking control performance.

Example 2.2.3 The following nonlinear system is a modification of the example


in [21]: 
x + sin(4uk − 2x2k )
xk+1 = 1k , (2.2.34)
x2k − 2uk

where xk = [x1k , x2k ]T ∈ R2 , uk ∈ R, k = 1, 2, . . . . We can see that xk = [0, 0]T is


an equilibrium state of system (2.2.34). However, the system (2.2.34) is marginally
stable at this equilibrium, since the eigenvalues of
 
∂xk+1  1 −2
=
∂xk (0,0) 0 1

are all 1. It is desired to control the system with control constraint of |u| ≤ 0.5. The
cost function is chosen as
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 65

6 5.5

5
5.4

Value function
4
Value function

5.3
3
5.2
2

5.1
1

0 5
0 10 20 30 0 10 20 30
Iteration steps Iteration steps
(a) (b)
9 7

8 6.5
Value function
Value function

7 6

6 5.5

5 5
0 10 20 30 0 10 20 30
Iteration steps Iteration steps
(c) (d)

Fig. 2.10 The trajectories of the iterative value functions with initial value function given by Ψ j (zk ),
j = 1, 2, 3, 4. a Ψ 1 (zk ). b Ψ 2 (zk ). c Ψ 3 (zk ). d Ψ 4 (zk )

∞ 
  uk
J(x0 ) = xk Qxk + 2
T
tanh−T (Ū −1 s)ŪRds ,
k=0 0

where Q and R are identity matrices with suitable dimensions and Ū = 0.5.
In this example, the three NNs are chosen with structures of 3–8–2, 2–8–3, and
2–8–1, respectively. Here, the initial weights of the critic network and action network
are all set to be random in [−0.1, 0.1]. Then, letting the parameter τ = 0.5 and the
learning rate αc = αa = 0.05, we train the critic network and action network for
26 iterations. When k = 0, the convergence process of the value function and its
derivatives is depicted in Fig. 2.12.
66 2 Value Iteration ADP for Discrete …

0.5
z1
z2

0
Tracking errors

−0.5

−1
0 5 10 15 20 25 30
Time steps

Fig. 2.11 The tracking error

(a) (b)
4
2
The derivatives of the
The value function

3 1
value function

λ1
0
2 λ2
−1
1
−2

0 −3
0 10 20 30 0 10 20 30
Iterations Iterations

Fig. 2.12 a The convergence process of the value function. b The convergence process of the
derivatives of the value function
2.2 Optimal Control of Nonlinear Systems Using General Value Iteration 67

(a) (b)
1
x
1 0.1
The state trajectories

0.5 x

The control input


2 0
−0.1
0
−0.2
−0.3774
−0.5 −0.3
−0.4
−1 −0.5
0 5 10 15 20 0 5 10 15 20
Time Time

(c) (d)
1
x
1 0.1
The state trajectories

0.5 x2
The control input

0
−0.1
0
−0.2

−0.5 −0.3 −0.5


−0.4
−1 −0.5
0 5 10 15 20 0 5 10 15 20
Time Time

Fig. 2.13 Simulation results of Example 2.2.3. a The state trajectory x. b The control input u.
c The state trajectory x without considering the control constraint. d The control input u without
considering the control constraint

Next, for given initial state x0 = [0.5, −1]T , we apply the optimal control laws
designed by the iterative GDHP algorithm, with and without considering the control
constraints, to system (2.2.34) for 20 time steps, respectively. The simulation results
are shown in Fig. 2.13, which also exhibits excellent control results of the iterative
GDHP algorithm.

2.3 Iterative θ-Adaptive Dynamic Programming Algorithm


for Nonlinear Systems

In this section, we present an iterative θ -ADP algorithm for optimal control of


discrete-time nonlinear systems [44]. Consider the deterministic discrete-time
systems
68 2 Value Iteration ADP for Discrete …

xk+1 = F(xk , uk ), k = 0, 1, 2, . . . , (2.3.1)

where xk ∈ Rn is the n-dimensional state vector and uk ∈ Rm is the m-dimensional


control vector. Let x0 be the initial state and F(xk , uk ) be the system function.
Let uk = (uk , uk+1 , . . . ) be an arbitrary sequence of controls from k to ∞. The
cost function for state x0 under the control sequence u0 = (u0 , u1 , . . . ) is defined as



J(x0 , u0 ) = U(xk , uk ),
k=0

where U(xk , uk ) > 0, ∀xk , uk = 0, is the utility function.


For convenience of analysis, results of this section are based on Assumptions
2.2.1–2.2.3 and the following assumption.

Assumption 2.3.1 The utility function U(xk , uk ) is a continuous positive-definite


function of xk and uk .

As system (2.3.1) is controllable, there exists a stable control sequence uk =


(uk , uk+1 , . . .) that moves xk to zero. Let Ak denote the set which contains all the
stable control sequences, and let Ak be the set of the stable control laws. Then, the
optimal cost function can be defined as
 
J ∗ (xk ) = inf J(xk , uk ) : uk ∈ Ak . (2.3.2)
uk

According to the Bellman’s principle of optimality, J ∗ (xk ) satisfies the Bellman


equation  
J ∗ (xk ) = min U(xk , uk ) + J ∗ (F(xk , uk )) . (2.3.3)
uk

The corresponding optimal control law is given by


 
u∗ (xk ) = arg min U(xk , uk ) + J ∗ (F(xk , uk )) .
uk

Hence, the Bellman equation (2.3.3) can be written as

J ∗ (xk ) = U(xk , u∗ (xk )) + J ∗ (F(xk , u∗ (xk ))). (2.3.4)

We can see that if we want to obtain the optimal control law u∗ (xk ), we must obtain the
optimal value function J ∗ (xk ). Generally speaking, J ∗ (xk ) is unknown before all the
controls uk ∈ Rm are considered. If we adopt the traditional dynamic programming
method to obtain the optimal value function one step at a time, then we have to face
the “curse of dimensionality.” In [5, 43], iterative algorithms of ADP were used to
obtain the solution of Bellman equation indirectly. However, we pointed out that the
stability of the system cannot be guaranteed in [5] and an admissible control sequence
2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 69

is necessary to initialize the algorithm in [43]. To overcome these difficulties, a new


iterative ADP algorithm will be developed in this section.

2.3.1 Convergence Analysis

In the present iterative θ -ADP algorithm, the value function and control law are
updated with the iteration index i increasing from 0 to ∞. The following definition
is necessary to begin the algorithm.
Definition 2.3.1 For xk ∈ Rn , let
 
Ψ̄xk = Ψ (xk ) : Ψ (xk ) > 0, and ∃ ν̄(xk ) ∈ Ak , s.t. Ψ (F(xk , ν̄(xk ))) < Ψ (xk )
(2.3.5)

be the set of initial positive-definite functions.


Let Ψ (xk ) be an arbitrary function such that Ψ (xk ) ∈ Ψ̄xk , ∀xk ∈ Rn . The existence
and properties of Ψ̄xk will be discussed later. Let the initial value function

V0 (xk ) = θ Ψ (xk ) (2.3.6)

∀xk ∈ Rn , where θ > 0 is a finite positive constant. The iterative control law v0 (xk )
can be computed as follows:

v0 (xk ) = arg min {U(xk , uk ) + V0 (xk+1 )}


uk

= arg min {U(xk , uk ) + V0 (F(xk , uk ))} , (2.3.7)


uk

where V0 (xk+1 ) = θ Ψ (xk+1 ). For i = 1, 2, . . ., the iterative θ -ADP algorithm will


iterate between

Vi (xk ) = min {U(xk , uk ) + Vi−1 (xk+1 )}


uk

= U(xk , vi−1 (xk )) + Vi−1 (F(xk , vi−1 (xk ))) (2.3.8)

and

vi (xk ) = arg min {U(xk , uk ) + Vi (xk+1 )}


uk

= arg min {U(xk , uk ) + Vi (F(xk , uk ))} . (2.3.9)


uk

We note that the ADP algorithm (2.3.7)–(2.3.9) described above is essentially the
same as those in (2.2.10)–(2.2.12) and (2.2.22)–(2.2.24). The only difference is the
70 2 Value Iteration ADP for Discrete …

choice of initial value function and the choice of utility function. Here, the utility
function may be nonquadratic.

Remark 2.3.1 Equations (2.3.7)–(2.3.9) in the iterative θ -ADP algorithm are similar
to the Bellman equation (2.3.4), but they are not the same. There are at least three
obvious differences.

(1) The Bellman equation (2.3.4) possesses a unique optimal cost function, i.e.,
J ∗ (xk ), ∀xk , while in the iterative ADP equations (2.3.7)–(2.3.9), the value func-
tions are different for different iteration index i, i.e., Vi (xk ) = Vj (xk ), ∀i = j.
(2) The control law obtained by Bellman equation (2.3.4) is the optimal control law,
i.e., u∗ (xk ), ∀xk , while the control laws from the iterative ADP equations (2.3.7)–
(2.3.9) are different for each iteration index i, i.e., vi (xk ) = vj (xk ), ∀i = j, which
are not optimal in general.
(3) For any finite i, the iterative value function Vi (xk ) is a sum of finite sequence
with a terminal constraint term and the property of Vi (xk ) can be seen in the
following lemma (Lemma 2.3.1). But the optimal cost function J ∗ (xk ) in (2.3.4)
is a sum of an infinite sequence. So, in general, Vi (xk ) = J ∗ (xk ).

Lemma 2.3.1 Let xk be an arbitrary state vector. If the iterative value function
Vi (xk ) and the control law vi (xk ) are obtained by (2.3.7)–(2.3.9), then Vi (xk ) can be
expressed as


i
Vi (xk ) = U xk+j , vi−j (xk+j ) + θ Ψ (xk+i+1 ).
j=0

Proof According to (2.3.8), we have


 
Vi (xk ) = min U(xk , uk ) + min U(xk+1 , uk+1 )
uk uk+1

+ · · · + min {U(xk+i−1 , uk+i−1 ) +V1 (xk+i )} · · · , (2.3.10)
uk+i−1

where
V1 (xk+i ) = min{U(xk+i , uk+i ) + θ Ψ (xk+i+1 )}.
uk+i

Define
uNk = (uk , uk+1 , . . . , uN )

as a finite sequence of controls from k to N, where N ≥ k is an arbitrary positive


integer. Then, (2.3.10) can be written as
2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 71

Vi (xk ) = min {U(xk , uk ) + U(xk+1 , uk+1 ) + · · ·


ukk+i

+ U(xk+i , uk+i ) + θ Ψ (xk+i+1 )}



i
= U xk+j , vi−j (xk+j ) + θ Ψ (xk+i+1 ).
j=0

The proof is complete.

In the above, we can see that the optimal value function J ∗ (xk ) is replaced by
a sequence of iterative value functions Vi (xk ) and the optimal control law u∗ (xk ) is
replaced by a sequence of iterative control laws vi (xk ), where i ≥ 0 is the iteration
index. As (2.3.8) is not a Bellman equation, generally speaking, the iterative value
function Vi (xk ) is not optimal. However, we can prove that J ∗ (xk ) is the limit of
Vi (xk ) as i → ∞. Next, the convergence properties will be analyzed.

Lemma 2.3.2 Let μ(xk ) ∈ Ak be an arbitrary control law, and let Vi (xk ) and vi (xk )
be expressed as in (2.3.7)–(2.3.9), respectively. Define a new value function Pi (xk ) as

Pi+1 (xk ) = U(xk , μ(xk )) + Pi (xk+1 ), (2.3.11)

with P0 (xk ) = V0 (xk ) = θ Ψ (xk ), ∀xk , then Vi (xk ) ≤ Pi (xk ).

From the definition given in (2.3.11), if we let μ(xk ) = u∗ (xk ), then

lim Pi (xk ) = J ∗ (xk ).


i→∞

In general, we have
Pi (xk ) ≥ J ∗ (xk ), ∀i, xk .

Theorem 2.3.1 Let xk be an arbitrary state vector. The iterative control law vi (xk )
and the iterative value function Vi (xk ) are obtained by (2.3.7)–(2.3.9). If Assumptions
2.2.1–2.2.3 and 2.3.1 hold, then for any finite i = 0, 1, . . ., there exists a finite
θ > 0 such that the iterative value function Vi (xk ) is a monotonically nonincreasing
sequence for i = 0, 1, . . ., i.e.,

Vi+1 (xk ) ≤ Vi (xk ), ∀i. (2.3.12)

Proof To obtain the conclusion, we will show that for an arbitrary finite i < ∞,
there exists a finite θi > 0 such that (2.3.12) holds. We prove this by mathematical
induction.
First, we let i = 0. Let μ(xk ) ∈ Ak be an arbitrary stable control law. Define the
value function Pi (xk ) as in (2.3.11). For i = 0, we have
72 2 Value Iteration ADP for Discrete …

P1 (xk ) = U(xk , μ(xk )) + P0 (xk+1 )


= U(xk , μ(xk )) + θ Ψ (F(xk , μ(xk ))).

According to Definition 2.3.1, there exists a stable control law ν̄k = ν̄(xk ) such that

Ψ (xk ) − Ψ (F(xk , ν̄(xk ))) > 0.

As ν̄(xk ) is a stable control law, the utility function U(xk , ν̄(xk )) is finite. Then,
there exists a finite θ0 > 0 such that

θ0 [Ψ (xk ) − Ψ (F(xk , ν̄(xk )))] ≥ U(xk , ν̄(xk )).

As μ(xk ) ∈ Ak is arbitrary, we can let μ(xk ) = ν̄(xk ). Let θ = θ0 and

P0 (xk ) = V0 (xk ) = θ0 Ψ (xk ). (2.3.13)

We can get

θ0 Ψ (xk ) ≥ U(xk , ν̄(xk )) + θ0 Ψ (F(xk , ν̄(xk ))) = P1 (xk ).

According to Lemma 2.3.2, we have

V1 (xk ) = min{U(xk , uk ) + V0 (xk+1 )}


uk

= min{U(xk , uk ) + θ0 Ψ (xk+1 )}
uk

≤ U(xk , ν̄(xk )) + θ0 Ψ (F(xk , ν̄(xk )))


= P1 (xk ). (2.3.14)

According to (2.3.13) and (2.3.14), we can obtain

V0 (xk ) = θ0 Ψ (xk ) ≥ P1 (xk ) ≥ V1 (xk ),

which proves V0 (xk ) ≥ V1 (xk ).


Hence, the conclusion holds for i = 0. Assume that for i = l − 1, l = 1, 2, . . .,
there exists a finite θl−1 such that (2.3.12) holds. Now, we consider the situation for
i = l. According to Lemma 2.3.1, for all θ̃l > 0, the iterative value function Vl (xk )
can be expressed as


l−1
Vl (xk ) = U xk+j , vl−j−1 (xk+j ) + θ̃l Ψ (xk+l ), (2.3.15)
j=0

where vl (xk ) is the iterative control law satisfying (2.3.9), and


2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 73

V0 (xk ) = P0 (xk ) = θ̃l Ψ (xk ).

Let
uk = vl−1 (xk ), uk+1 = vl−2 (xk+1 ), . . . , uk+l−1 = v0 (xk+l−1 ).

Then, the iterative value function Pl+1 (xk ) can be derived as

Pl+1 (xk ) = U(xk , vl−1 (xk )) + U(xk+1 , vl−2 (xk+1 ))


+ · · · + U(xk+l−1 , v0 (xk+l−1 ))
+ U(xk+l , μ(xk+l ))) + P0 (xk+l+1 )

l−1
= U xk+j , vl−j−1 (xk+j ) + U(xk+l , μ(xk+l )) + θ̃l Ψ (xk+l+1 ), (2.3.16)
j=0

where μ(xk+l ) ∈ Ak+l . According to Definition 2.3.1, there exists a stable control
law ν̄(xk+l ) ∈ Ak+l such that

Ψ (xk+l ) − Ψ (F(xk+l , ν̄(xk+l ))) > 0.

Thus, there exists a finite θl satisfying

θl [Ψ (xk+l ) − Ψ (F(xk+l , ν̄(xk+l )))] ≥ U(xk+l , ν̄(xk+l )). (2.3.17)

Let μ(xk+l ) = ν̄(xk+l ), and θ̃l = θl . Then, according to (2.3.15)–(2.3.17), we can


get


l−1
Vl (xk ) = U xk+j , vl−j−1 (xk+j ) + θl Ψ (xk+l )
j=0


l−1
≥ U xk+j , vl−j−1 (xk+j ) + U(xk+l , ν̄(xk+l )) + θl Ψ (xk+l+1 )
j=0

= Pl+1 (xk ).

According to Lemma 2.3.2, we have Vl+1 (xk ) ≤ Pl+1 (xk ). Therefore, we obtain

Vl+1 (xk ) ≤ Vl (xk ).

The mathematical induction is complete. On the other hand, as i is finite, if we let


θ̃ = max{θ0 , θ1 , . . . , θi }, then we can choose an arbitrary finite θ that satisfies θ ≥ θ̃
such that (2.3.12) holds. The proof is complete.

Remark 2.3.2 In (2.3.17), for all i = 1, 2, . . ., if we choose a θi such that


74 2 Value Iteration ADP for Discrete …

θi [Ψ (xk+i ) − Ψ (F(xk+i , ν̄(xk+i )))] > U(xk+i , ν̄k+i )

holds, then we can obtain (2.3.12). In this situation, the iterative value function Vi (xk )
is a monotonically decreasing sequence for i = 0, 1, . . ..
Theorem 2.3.2 Let xk be an arbitrary state vector. If Assumptions 2.2.1–2.2.3 and
2.3.1 hold and there exists a control law ν̄(xk ) ∈ Ak which satisfies (2.3.5) such that
the following limit

U(xk , ν̄(xk ))
lim (2.3.18)
xk →0 Ψ (xk ) − Ψ (F(xk , ν̄(xk )))

exists, then there exists a finite θ > 0 such that (2.3.12) is true.
Proof According to (2.3.17) in Theorem 2.3.1, we can see that for any finite i < ∞,
the parameter θi should satisfy

U(xk+i , ν̄(xk+i ))
θi ≥
Ψ (xk+i ) − Ψ (F(xk+i , ν̄(xk+i )))

in order for (2.3.12) to be true. Let i → ∞. We have

U(xk+i , ν̄(xk+i ))
lim θi ≥ lim . (2.3.19)
i→∞ i→∞ Ψ (xk+i ) − Ψ (F(xk+i , ν̄(xk+i )))

We can see that if the limit of the right-hand side of (2.3.19) exists, then θ∞ = lim θi
i→∞
can be defined. Therefore, if we define

θ̄ = sup{θ0 , θ1 , . . . , θ∞ }, (2.3.20)

then θ̄ can be well defined. Hence, we can choose an arbitrary finite θ which satisfies

θ ≥ θ̄, (2.3.21)

such that (2.3.12) is true.


On the other hand, ν̄(xk ) ∈ Ak is a stable control law. We have xk → 0 as
k → ∞ under the stable control sequence (ν̄k , ν̄k+1 , . . .), where ν̄k = ν̄(xk ) for all
k = 0, 1, . . .. If (2.3.18) is finite, then according to (2.3.19) and (2.3.20), there exists
a finite θ such that

U(xk , ν̄(xk ))
θ ≥ lim .
xk →0 Ψ (xk ) − Ψ (F(xk , ν̄(xk )))

The proof is complete.


Remark 2.3.3 In this section, we expect that the iterative value function Vi (xk ) →
J ∗ (xk ) and the iterative control law vi (xk ) → u∗ (xk ). It is obvious that u∗ (xk ) ∈ Ak .
2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 75

If we put u∗ (xk ) into (2.3.11), then for i → ∞, lim Pi (xk ) = J ∗ (xk ) holds for any
i→∞
finite θ .
From Theorems 2.3.1 and 2.3.2, we can see that if there exists a finite θ such that
(2.3.12) holds, then Vi (xk ) ≥ 0 and it is a nonincreasing sequence with lower bound
for iteration index i = 0, 1, . . .. We can derive the following theorem.
Theorem 2.3.3 Let xk be an arbitrary state vector. Define the value function V∞ (xk )
as the limit of the iterative value function Vi (xk ), i.e.,

V∞ (xk ) = lim Vi (xk ).


i→∞

Then,
V∞ (xk ) = min{U(xk , uk ) + V∞ (xk+1 )}. (2.3.22)
uk

Proof Let μ(xk ) be an arbitrary stable control law. According to Theorem 2.3.1,
∀i = 0, 1, . . ., we have

V∞ (xk ) ≤ Vi+1 (xk ) ≤ U(xk , μ(xk )) + Vi (xk+1 ).

Let i → ∞. We then have

V∞ (xk ) ≤ U(xk , μ(xk )) + V∞ (xk+1 ).

So,

V∞ (xk ) ≤ min{U(xk , u(xk )) + V∞ (xk+1 )}. (2.3.23)


uk

Let ε > 0 be an arbitrary positive number. Since Vi (xk ) is nonincreasing for all i and
limi→∞ Vi (xk ) = V∞ (xk ), there exists a positive integer p such that

Vp (xk ) − ε ≤ V∞ (xk ) ≤ Vp (xk ).

Then, we let

Vp (xk ) = min{U(xk , uk ) + Vp−1 (xk+1 )} = U(xk , vp−1 (xk )) + Vp−1 (xk+1 ).


uk

Hence,

V∞ (xk ) ≥ Vp (xk ) − ε
≥ U(xk , vp−1 (xk )) + Vp−1 (xk+1 ) − ε
≥ U(xk , vp−1 (xk )) + V∞ (xk+1 ) − ε
≥ min{U(xk , uk ) + V∞ (xk+1 )} − ε.
uk
76 2 Value Iteration ADP for Discrete …

Since ε is arbitrary, we have

V∞ (xk ) ≥ min{U(xk , uk ) + V∞ (xk+1 )}. (2.3.24)


uk

Combining (2.3.23) and (2.3.24), we have (2.3.22) which proves the conclusion of
this theorem.

Remark 2.3.4 Two important properties we must point out. First, from the iterative
θ -ADP algorithm (2.3.7)–(2.3.9), we see that the initial function Ψ (xk ) is arbitrarily
chosen in the set Ψ̄ (xk ). The parameter θ is also arbitrarily chosen if it satisfies
(2.3.21). Actually, it is not necessary to find all θi to construct the set in (2.3.20).
What we should do is to choose a θ large enough to run the iterative θ -ADP algorithm
(2.3.7)–(2.3.9) and guarantee the iterative value function to be convergent. This
allows for very convenient implementation of the present algorithm. Second, for
different initial value θ and different initial function Ψ (xk ), the iterative value function
of the iterative θ -ADP algorithm will converge to the same value function. We will
show this property after two necessary lemmas.

Lemma 2.3.3 Let ν̄(xk ) ∈ Ak be an arbitrary stable control law, and let the value
function Pi (xk ) be defined in (2.3.11) with u = ν̄. Define a new value function
Pi (xk ) as

Pi+1 (xk ) = U(xk , ν̄(xk )) + Pi (xk+1 ),

with P0 (xk ) = θ  Ψ (xk ), ∀xk . Let θ and θ  be two different finite constants which
satisfy (2.3.21), i.e., let θ ≥ θ̄ and θ  ≥ θ̄ such that (2.3.12) is true. Then, P∞ (xk ) =

P∞ (xk ) = Γ∞ (xk ), where Γ∞ (xk ) is defined as


i
Γ∞ (xk ) = lim U(xk+j , ν̄(xk+j )) .
i→∞
j=0

Proof According to the definitions of Pi (xk ) and Pi (xk ), we have


i
Pi (xk ) = U xk+j , ν̄(xk+j ) + θ Ψ xk+i+1 ,
j=0


i
Pi (xk ) = U xk+j , ν̄(xk+j ) + θ  Ψ xk+i+1 ,
j=0

where θ and θ  both satisfy (2.3.21), and θ = θ  . As ν̄(xk ) is a stable control law, we
have xk+i → 0 as i → ∞. Then, lim θ Ψ (xk ) = lim θ  Ψ (xk ) = 0 since xk → 0.
k→∞ k→∞
So, we can get
2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 77


i

P∞ (xk ) = P∞ (xk ) = lim U(xk+j , ν̄(xk+j )) = Γ∞ (xk ).
i→∞
j=0

The proof is complete.

Next, we will prove that the iterative value function Vi (xk ) converges to the optimal
value function J ∗ (xk ) as i → ∞. Before we give the optimality theorem, the following
lemma is necessary.

2.3.2 Optimality Analysis

The following lemma is needed for the optimality analysis.


Lemma 2.3.4 Let Vi (xk ) be defined in (2.3.7)–(2.3.9) and Pi (xk ) be defined in
(2.3.11), for i = 0, 1, . . .. Let θ satisfy (2.3.21). Then, there exists a finite positive
integer q such that

Pq (xk ) ≤ J ∗ (xk ) + ε, ∀ε.

Proof According to Lemma 2.3.1, we have



i
V∞ (xk ) = lim U(xk+j , vi−j (xk+j )) .
i→∞
j=0

According to the definition of J ∗ (xk ), we have V∞ (xk ) ≥ J ∗ (xk ). Let q be an arbitrary


finite positive integer. According to Theorems 2.3.1 and 2.3.3, we have Vq (xk ) ≥
V∞ (xk ). According to Lemma 2.3.2, we have

Pq (xk ) ≥ Vq (xk ) ≥ V∞ (xk ) ≥ J ∗ (xk ).

Next, let
∗(k+q−1)
μkk+q−1 = uk .

We can obtain


Pq (xk ) − J ∗ (xk ) = θ Ψ (xk+q ) − ∗
U(xk+j , uk+j ) ≥ 0.
j=q

From (2.3.11), as μ(xk ) is a stable control law, the control sequence μk = (μk , μk+1 ,
. . .) under the stable control law μ(xk ) is a stable control sequence. Hence, we can
get θ Ψ (xk+q ) → 0 as q → ∞. Then, from the fact that
78 2 Value Iteration ADP for Discrete …




lim U(xk+j , uk+j ) = 0,
q→∞
j=q

we can obtain
 ∞


lim θ Ψ (xk+q ) − U(xk+j , uk+j ) = 0.
q→∞
j=q

Therefore, ∀ε > 0, there exists a finite q such that Pq (xk ) − J ∗ (xk ) ≤ ε holds. This
completes the proof of the lemma.

Theorem 2.3.4 Let Vi (xk ) be defined by (2.3.8) where θ satisfies (2.3.21). If the
system state xk is controllable, then Vi (xk ) converges to the optimal cost function
J ∗ (xk ) as i → ∞, i.e.,

Vi (xk ) → J ∗ (xk ), as i → ∞, ∀xk .

Proof According to the definition of J ∗ (xk ) in (2.3.2), we have

J ∗ (xk ) ≤ Vi (xk ).

Then, let i → ∞. We have

J ∗ (xk ) ≤ V∞ (xk ). (2.3.25)

Let ε > 0 be an arbitrary positive number. According to Lemma 2.3.4, there exists
a finite positive integer q such that

Vq (xk ) ≤ Pq (xk ) ≤ J ∗ (xk ) + ε. (2.3.26)

On the other hand, according to Theorem 2.3.1, we have

V∞ (xk ) ≤ Vq (xk ). (2.3.27)

Combining (2.3.26) and (2.3.27), we have V∞ (xk ) ≤ J ∗ (xk ) + ε. As ε is an arbitrary


positive number, we have

V∞ (xk ) ≤ J ∗ (xk ). (2.3.28)

According to (2.3.25) and (2.3.28), we have

V∞ (xk ) = J ∗ (xk ).

The proof is complete.


2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 79

We can now derive the following corollary.


Corollary 2.3.1 Let the value function Vi (xk ) be defined by (2.3.8). If the system
state xk is controllable and Theorem 2.3.4 holds, then the iterative control law vi (xk )
converges to the optimal control law u∗ (xk ).
As is known, the stability property of control systems is a most basic and necessary
property for any control systems. So, we will give the stability analysis for system
(2.3.1) under the iterative θ -ADP algorithm (2.3.7)–(2.3.9).

Theorem 2.3.5 Let xk be an arbitrary controllable state. For i = 0, 1, . . ., if Assump-


tions 2.2.1–2.2.3 and 2.3.1 hold and the iterative value function Vi (xk ) and iterative
control law vi (xk ) are defined by (2.3.7)–(2.3.9) where θ satisfies (2.3.21), then vi (xk )
is an asymptotically stable control law for system (2.3.1), ∀i = 0, 1, . . ..

Proof The theorem will be proved in two steps.


(1) Show that the iterative value function Vi (xk ) is a positive-definite function,
∀i = 0, 1, . . ..
For the iterative θ -ADP algorithm, we have V0 (xk ) = θ Ψ (xk ).
According to Assumption 2.3.1, Vi (xk ) is a positive-definite function for i = 0.
Assume that for i = l, Vl (xk ) is a positive-definite function. Then, for i = l + 1,
(2.3.9) holds. Let xk = 0, and we can get

Vl+1 (0) = U(0, vl (0)) + Vl (F(0, vl (0))).

According to Assumptions 2.2.1–2.2.3 and 2.3.1, we have vl (0) = 0, F(0,


vl (0)) = 0, U(0, vl (0)) = 0. As Vl (xk ) is a positive-definite function, we have
Vl (0) = 0. Then, we have Vl+1 (0) = 0. If xk = 0, according to Assumption 2.3.1,
we have Vl+1 (xk ) > 0. On the other hand, let xk  → ∞. As U(xk , uk ) is a positive-
definite function, Vl+1 (xk ) → ∞. So, Vl+1 (xk ) is a positive-definite function. The
mathematical induction is complete.
(2) Show that vi (xk ) is an asymptotically stable control law for system (2.3.1).
As the iterative value function Vi (xk ) is a positive-definite function, ∀i = 0, 1, . . .,
according to (2.3.8), we have

Vi (F(xk , vi (xk ))) − Vi+1 (xk ) = −U(xk , vi (xk )) ≤ 0.

According to Theorem 2.3.1, we have Vi+1 (xk ) ≤ Vi (xk ), ∀i ≥ 0. Then, for all
xk = 0, we can obtain

Vi (F(xk , vi (xk ))) − Vi (xk ) ≤ Vi (F(xk , vi (xk ))) − Vi+1 (xk )


= −U(xk , vi (xk )) < 0.
80 2 Value Iteration ADP for Discrete …

For i = 0, 1, . . ., the iterative value function Vi (xk ) is a Lyapunov function [20, 26,
30]. Therefore, the conclusion is proved.

Next, we will prove that the optimal control law u∗ (xk ) is an admissible control
law for system (2.3.1).
Theorem 2.3.6 Let xk be an arbitrary controllable state. For i = 0, 1, . . ., if Assump-
tions 2.2.1–2.2.3 and 2.3.1 hold and the iterative value function Vi (xk ) and iterative
control law vi (xk ) are defined by (2.3.7)–(2.3.9) where θ satisfies (2.3.21), then the
optimal control law u∗ (xk ) is an admissible control law for system (2.3.1).
The proof of this theorem can be done by considering the fact that J ∗ (xk ) is finite.
Therefore, we omit the details here.

Remark 2.3.5 From the above analysis, we can see that the present iterative θ -ADP
algorithm is different from VI algorithms in [5, 52]. The main differences can be
summarized as follows.
(1) The initial conditions are different. In [5, 52], VI algorithms are initialized by
zero, i.e., V0 (xk ) ≡ 0, ∀xk . In this section, the iterative θ -ADP algorithm is
initialized by θ Ψ (xk ) = 0.
(2) The convergence properties are different. For VI algorithms in [5, 52], the iter-
ative value function Vi (xk ) is monotonically nondecreasing and converges to
the optimum. In this section, the iterative value function Vi (xk ) in the θ -ADP
algorithm is monotonically nonincreasing and converges to the optimal one.
(3) We emphasize that the properties of the iterative control laws are different. For
the VI algorithms in [5, 52], the stability of iterative control laws cannot be
guaranteed, which means the VI algorithm can only be implemented off-line. In
this section, it is proved that for all i = 0, 1, . . ., the iterative control law vi (xk )
is a stable control law. This means that the present iterative θ -ADP algorithm
is feasible for implementations both online and off-line. This is an obvious
merit of the present iterative θ -ADP algorithm. In the simulation study, we will
provide simulation comparisons between the VI algorithms in [5, 52] and the
present iterative θ -ADP algorithm. This conclusion echoes the observation in
Remark 2.2.2.

2.3.3 Summary of Iterative θ-ADP Algorithm

In the previous development, we can see that an initial positive-definite function


Ψ (xk ) ∈ Ψ̄xk is needed to start the iterative θ -ADP algorithm. So, the existence of
the set Ψ̄xk is important for the algorithm. Next, we will show the Ψ̄xk = ∅, where ∅
is the empty set.
2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 81

Theorem 2.3.7 Let xk be an arbitrary controllable state, and let J ∗ (xk ) be the opti-
mal cost function expressed by (2.3.2). If Assumptions 2.2.1–2.2.3 and 2.3.1 hold,
then

J ∗ (xk ) ∈ Ψ̄xk .

Proof By Assumption 2.3.1 and the definition of J ∗ (xk ) in (2.3.2) and (2.3.4), we
can see that

N

J (xk ) = lim U(xk+j , u∗ (xk+j ))
N→∞
k=0

is a positive-definite function of xk . From (2.3.4), we can also obtain

J ∗ (F(xk , u∗ (xk )) ≤ J ∗ (F(xk )), ∀xk .

This completes the proof of the theorem.

According to Theorem 2.3.7, Ψ̄xk is not an empty set. While generally, the optimal
value is difficult to obtain before the algorithm is complete. So, some other methods
are established to obtain Ψ (xk ).

Corollary 2.3.2 Let xk be an arbitrary state vector. If Ψ (xk ) is a Lyapunov function


of system (2.3.1), then Ψ (xk ) ∈ Ψ̄xk .

Remark 2.3.6 According to the definition of admissible control law, we can see that
Ψxk ∈ Ψ̄xk is equivalent to that Ψxk is a Lyapunov function. There are two properties
we should point out. First, the general purpose of choosing a Lyapunov function
Ψ (xk ) is to find a control ν̄(xk ) to stabilize the system. In this section, however, the
purpose of choosing the initial function θ Ψ (xk ) is to obtain the optimal control of
the system (not only to stabilize the system but also to minimize the value function).
Second, if we adopt V0 (xk ) = Ψ (xk ) to initialize the system, then the initial iterative
control law v0 (xk ) can be obtained by

v0 (xk ) = arg min {U(xk , uk ) + Ψ (xk+1 )} .


uk

We should point out that v0 (xk ) may not be a stable control law for the system,
although the algorithm is initialized by a Lyapunov function. Using the present
iterative θ -ADP algorithm (2.3.7)–(2.3.9) in this section, we can prove that all the
iterative controls vi (xk ) for i = 0, 1, . . ., are stable and simultaneously guarantee the
iterative value function to converge to the optimum. Hence, our present algorithm is
effective to obtain the optimal control law both online and off-line.
82 2 Value Iteration ADP for Discrete …

From Corollary 2.3.2, we can see that if we get a Lyapunov function of system
(2.3.1), then Ψ (xk ) can be obtained. As Lyapunov function is also difficult to obtain,
we will give some simple methods to choose the function Ψ (xk ).
First, it is recommended to use the utility function U(xk , 0) to start the iterative
θ -ADP algorithm, where we set V0 (xk ) = θ U(xk , 0) with a large θ . If we get a V1 (xk )
such that V1 (xk ) < V0 (xk ), then U(xk , 0) ∈ Ψ̄xk .
Second, we can use NN structures of ADP to generate an initial function Ψ (xk ).
We first randomly initialize the weights of the action NN. Give an arbitrary positive-
definite function G(xk ) > 0 and train the critic NN to satisfy the equation

Ψ̂ (xk ) = G(xk ) + Ψ̂ (F(xk , v̂(xk ))),

where Ψ̂ (xk ) and v̂(xk ) are outputs of critic and action networks, respectively. The NN
structure and the training rule can be seen in the next section. If the critic network
training is convergent, then let Ψ (xk ) = Ψ̂ (xk ) and the initial value function is
determined.

Remark 2.3.7 For many nonlinear systems and utility functions, such as [2, 52], we
can obtain U(xk , 0) ∈ Ψ̄xk . In this situation, we only need to set a large θ for the initial
condition and run the iterative θ -ADP algorithm (2.3.7)–(2.3.9). This can reduce the
amount of computation very much. If there does not exist a stable control law such
that (2.3.18) is finite, then there may not exist a finite θ such that (2.3.12) is true.
In this case, we can find an initial admissible control law η(xk ) such that xk+N = 0,
where N ≥ 1 is an arbitrary positive integer. Let


N
V0 (xk ) = U(xk+τ , η(xk+τ )).
τ =0

Then, using the algorithm (2.3.7)–(2.3.9), we can also obtain Vi (xk ) ≤ Vi+1 (xk ). The
details of proof are available in [43].

Remark 2.3.8 The iterative θ -ADP algorithm is different from the policy itera-
tion algorithm in [1, 31]. For the policy iteration algorithm, an admissible control
sequence is necessary to initialize the algorithm, while for the iterative θ -ADP algo-
rithm developed in this section, the initial admissible control sequence is avoided.
Instead, we only need an arbitrary function Ψ (xk ) ∈ Ψ̄xk to start the algorithm.
Generally speaking, for nonlinear systems, admissible control sequences are diffi-
cult to obtain, while the function Ψ (xk ) can easily be obtained (for many cases,
U(xk , 0) ∈ Ψ̄xk ). Second, for PI algorithms in [1, 31], during every iteration step,
we need to solve a generalized Bellman equation to update the iterative control law,
while in the present iterative θ -ADP algorithm, the generalized Bellman equation is
unnecessary. Therefore, the iterative θ -ADP algorithm has more advantages than the
PI algorithm.
2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 83

Now, we summarize the iterative θ -ADP algorithm as follows.

Algorithm 2.3.1 Iterative θ -adaptive dynamic programming algorithm


Step 1. Choose randomly an array of initial states xk and choose a computation precision ε. Choose
an arbitrary positive definite function Ψ (xk ) ∈ Ψ̄xk . Choose a constant θ > 0.
Step 2. Let i = 0. Let the initial value function V0 (xk ) = θΨ (xk ).
Step 3. Compute v0 (xk ) by (2.3.7) and obtain V1 (xk ) by (2.3.8).
Step 4. If V1 (xk ) ≤ V0 (xk ), then go to next step. Otherwise, θ is not large enough, and choose a
larger θ  > θ. Let θ = θ  and go to Step 2.
Step 5. Let i = i + 1. Compute Vi (xk ) by (2.3.8) and vi (xk ) by (2.3.9).
Step 6. If Vi (xk ) ≤ Vi−1 (xk ), go to Step 7; else, choose a larger θ  > θ. Let θ = θ  and go to
Step 2.
Step 7. If |Vi (xk ) − Vi−1 (xk )| ≤ ε, then go to next step; else go to Step 5.
Step 8. Stop.

Remark 2.3.9 Generally speaking, NNs are used to implement the present iterative
θ -ADP algorithm. In order to approximate the functions Vi (xk ) and vi (xk ), a large
number of xk in the state space is required to train NNs. In this situation, as we have
declared in Step 1, we should choose randomly an array of initial states xk in the
state space to initialize the algorithm. For all i = 0, 1, . . ., according to the array of
states xk , we can obtain the iterative value function Vi (xk ) and the iterative control
law vi (xk ) by training NNs, respectively. To the best of our knowledge, all the NN
implementations for ADP require a large number of xk in state space to approximate
the iterative control laws and the iterative value functions, such as [36, 48]. The
detailed NN implementation for the present iterative θ -ADP algorithm can be found
in [44].

2.3.4 Simulation Studies

To evaluate the performance of our iterative θ -ADP algorithm, we choose two exam-
ples with quadratic utility functions for numerical experiments.
Example 2.3.1 This example is chosen from [43, 52] with modifications. Consider
 
2
0.8x1k exp(x2k ) −0.2 0
xk+1 = + u,
3
0.9x2k 0 −0.2 k

where xk = [x1k , x2k ]T and uk = [u1k , u2k ]T are the state and control variables,
respectively. The initial state is x0 = [1, −1]T . The cost function is the quadratic
form expressed as
∞
J x0 , u∞
0 = xkT Qxk + ukT Ruk
k=0

where the matrices Q and R given as identity matrix with suitable dimensions.
84 2 Value Iteration ADP for Discrete …

Two NNs are used to implement the iterative θ -ADP algorithm. The critic and
action networks are both chosen as three-layer BP NNs with the structures of 2–8–1
and 2–8–2, respectively. For each iteration step, the critic and action networks are
trained for 200 steps using the learning rate of 0.02 so that the NN training errors
become less than 10−6 . To show the effectiveness of the iterative θ -ADP algorithm,
we choose four θ ’s (including θ = 3.5, 5, 7, 10) to initialize the algorithm. Let the
algorithm run for 35 iteration steps for different θ ’s to obtain the optimal value
function. The convergence curves of the value functions are shown in Fig. 2.14.
From Fig. 2.14, we can see that all the convergence curves of value functions are
monotonically nonincreasing. For convenience of analysis, we let

θ =3.5 θ =5
8
6.2
7.5
6
Value function

Value function

5.8 6.5

5.6 6

5.4 5.5

5
0 10 20 30 35 0 10 20 30 35
Iteration steps Iteration steps
(a) (b)

θ =7 θ =10
12 17

11 15
10
Value function

Value function

13
9
11
8
9
7

6 7

5 5
0 10 20 30 35 0 10 20 30 35
Iteration steps Iteration steps
(c) (d)

Fig. 2.14 The convergence of value functions for Example 2.3.1. a θ = 3.5. b θ = 5. c θ = 7.
d θ = 10
2.3 Iterative θ-Adaptive Dynamic Programming Algorithm for Nonlinear Systems 85

U(xk , v0 (xk ))
θ0 = lim
xk →0 U(xk , 0) − U(F(xk , v0 (xk )), 0)

where v0 (xk ) is obtained in (2.3.7). Then, for θ = 3.5, we have θ0 = 1.9015. For
θ = 5, we have θ0 = 1.90984. For θ = 7, we have θ0 = 2.04469. For θ = 10, we
have θ0 = 2.16256. We can also see that if the iterative value function is convergent,
then the iterative value function can converge to the optimum and the optimal value
function is independent from the parameter θ . We apply the optimal control law to
the system for Tf = 20 time steps and obtain the following results. The optimal state
and control trajectories are shown in Fig. 2.15a and b, respectively.
From the above simulation results, we can see that if we choose θ large enough
to initialize the iterative θ -ADP algorithm, the iterative value function Vi (xk ) will be
monotonically nonincreasing and converge to the optimum, which verifies the effec-
tiveness of the present algorithm. Next, we enhance the complexity of the system.
We will consider the situation where the autonomous system is unstable, and we will
show that the present iterative θ -ADP is still effective.

Example 2.3.2 This example is chosen from [32, 37]. Consider



(x1k
2
+ x2k
2
+ uk ) cos(x2k )
xk+1 = , (2.3.29)
(x1k + x2k
2 2
+ uk ) sin(x2k )

1
x
1
0.5 x2
States

−0.5

−1
0 2 4 6 8 10 12 14 16 18 20
Time steps
(a)
0.1
u
1
0.05 u2
Controls

−0.05

−0.1
0 2 4 6 8 10 12 14 16 18 20
Time steps
(b)

Fig. 2.15 The optimal trajectories. a Optimal state trajectories. b Optimal control trajectories
86 2 Value Iteration ADP for Discrete …

θ=3 θ=5
6 10

5 8
Value function

Value function
4 6

3 4

2 2
0 5 10 15 0 5 10 15
Iteration steps Iteration steps
(a) (b)

θ=7 θ = 10
12 16
14
10
Value function

Value function

8 10

6
6
4

2 2
0 5 10 15 0 5 10 15
Iteration steps Iteration steps
(c) (d)

Fig. 2.16 The convergence of value functions for Example 2.3.2. a θ = 3. b θ = 5. c θ = 7.


d θ = 10

where xk = [x1k , x2k ]T denotes the system state vector and uk denotes the control.
The value function is the same as the one in Example 2.3.1.
The initial state is x0 = [1, −1]T . From system (2.3.29), we can see that xk = 0
is an equilibrium state and the autonomous system F(xk , 0) is unstable. We also
use NNs to implement the iterative ADP algorithm where four θ ’s (including θ =
3, 5, 7, 10) are chosen to initialize the algorithm and the convergence curves of the
value functions are shown in Fig. 2.16.
Applying the optimal control law to the system for Tf = 15 time steps, the optimal
state and control trajectories are shown in Fig. 2.17a and b, respectively.
2.4 Conclusions 87

1
x
1
0.5 x
2

States 0

−0.5

−1
0 5 10 15
Time steps
(a)
0.5
0
Control

−0.5
−1
−1.5
−2
0 5 10 15
Time steps
(b)

Fig. 2.17 The optimal trajectories. a Optimal state trajectories. b Optimal control trajectory

2.4 Conclusions

In this chapter, we developed several VI-based ADP methods for optimal control
problems of discrete-time nonlinear systems. First, a GVI-based ADP scheme is
established to obtain optimal control for discrete-time affine nonlinear systems. Then,
the GVI ADP algorithm is used to solve the optimal tracking control problem for
discrete-time nonlinear systems as a generalization. Furthermore, as a case study, the
VI-based ADP approach is developed to derive optimal control for discrete-time non-
linear systems with unknown dynamics and input constraints. It is emphasized that
using the ADP approach, affine and nonaffine nonlinear systems can be treated uni-
formly. Next, an iterative θ -ADP technique is presented to solve the optimal control
problem of discrete-time nonlinear systems. Convergence analysis and optimality
analysis results are established for the iterative θ -ADP algorithm. Simulation results
are provided to show the effectiveness of the present algorithm.

References

1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
88 2 Value Iteration ADP for Discrete …

2. Abu-Khalaf M, Lewis FL, Huang J (2008) Neurodynamic programming and zero-sum games
for constrained control systems. IEEE Trans Neural Netw 19(7):1243–1252
3. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2007) Adaptive critic designs for discrete-time zero-
sum games with application to H∞ control. IEEE Trans Syst Man Cybern.-Part B: Cybern
37(1):240–247
4. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2007) Model-free Q-learning designs for linear
discrete-time zero-sum games with application to H-infinity control. Automatica 43(3):473–
481
5. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern-Part B:
Cybern 38(4):943–949
6. Apostol TM (1974) Mathematical analysis: A modern approach to advanced calculus. Addison-
Wesley, Boston, MA
7. Athans M, Falb PL (1966) Optimal control: an introduction to the theory and its applications.
McGraw-Hill, New York
8. Beard R, Saridis G, Wen J (1997) Galerkin approximations of the generalized Hamilton–
Jacobi–Bellman equation. Automatica 33(12):2158–2177
9. Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton, NJ
10. Berkovitz LD, Medhin NG (2013) Nonlinear optimal control theory. CRC Press, Boca Raton,
FL
11. Bertsekas DP (2005) Dynamic programming and optimal control. Athena Scientific, Belmont,
MA
12. Bitmead RR, Gever M, Petersen IR (1985) Monotonicity and stabilizability properties of solu-
tions of the Riccati difference equation: Propositions, lemmas, theorems, fallacious conjectures
and counterexamples. Syst Control Lett 5:309–315
13. Dierks T, Thumati BT, Jagannathan S (2009) Optimal control of unknown affine nonlinear
discrete-time systems using offline-trained neural networks with proof of convergence. Neural
Netw 22(5):851–860
14. Dreyfus SE, Law AM (1977) The art and theory of dynamic programming. Academic Press,
New York
15. Fu J, He H, Zhou X (2011) Adaptive learning and control for MIMO system based on adaptive
dynamic programming. IEEE Trans Neural Netw 22(7):1133–1148
16. Hagan MT, Menhaj MB (1994) Training feedforward networks with the Marquardt algorithm.
IEEE Trans Neural Netw 5(6):989–993
17. Heydari A, Balakrishnan SN (2013) Finite-horizon control-constrained nonlinear optimal con-
trol using single network adaptive critics. IEEE Trans Neural Netw Learn Syst 24(1):145–157
18. Howard RA (1960) Dynamic programming and Markov processes. MIT Press, Cambridge,
MA
19. Huang Y, Liu D (2014) Neural-network-based optimal tracking control scheme for a class
of unknown discrete-time nonlinear systems using iterative ADP algorithm. Neurocomputing
125:46–56
20. Koppel LB (1968) Introduction to control theory with applications to process control. Prentice-
Hall, Englewood Cliffs, NJ
21. Levin AU, Narendra KS (1993) Control of nonlinear dynamical systems using neural networks:
controllability and stabilization. IEEE Trans Neural Netw 4(2):192–206
22. Lewis FL, Liu D (2012) Reinforcement learning and approximate dynamic programming for
feedback control. Wiley, Hoboken, NJ
23. Lewis FL, Syrmos VL (1995) Optimal control. Wiley, New York
24. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
25. Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general
value iteration. IET Control Theory Appl 6(18):2725–2736
26. Liao X, Wang L, Yu P (2007) Stability of dynamical systems. Elsevier, Amsterdam, Netherlands
References 89

27. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
28. Liu D, Wang D, Yang X (2013) An iterative adaptive dynamic programming algorithm for
optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci
220:331–342
29. Lyshevski SE (1998) Optimal control of nonlinear continuous-time systems: design of bounded
controllers via generalized nonquadratic functionals. In: Proceedings of the American control
conference. pp 205–209
30. Michel AN, Hou L, Liu D (2015) Stability of dynamical systems: On the role of monotonic
and non-monotonic Lyapunov functions. Birkhäuser, Boston, MA
31. Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE Trans
Syst Man Cybern-Part C: Appl Rev 32(2):140–153
32. Navarro-Lopez EM (2007) Local feedback passivation of nonlinear discrete-time systems
through the speed-gradient algorithm. Automatica 43(7):1302–1306
33. Primbs JA, Nevistic V (2000) Feasibility and stability of constrained finite receding horizon
control. Automatica 36(7):965–971
34. Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8(5):997–
1007
35. Rantzer A (2006) Relaxed dynamic programming in switching systems. IEE Proc-Control
Theory Appl 153(5):567–574
36. Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
37. Sira-Ramirez H (1991) Non-linear discrete variable structure systems in quasi-sliding mode.
Int J Control 54(5):1171–1187
38. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge,
MA
39. Vincent TL, Grantham WJ (1997) Nonlinear and optimal control systems. Wiley, New York
40. Vrabie D, Vamvoudakis KG, Lewis FL (2013) Optimal adaptive control and differential games
by reinforcement learning principles. IET, London
41. Wang D, Liu D (2013) Neuro-optimal control for a class of unknown nonlinear dynamic systems
using SN-DHP technique. Neurocomputing 121:218–225
42. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
43. Wang FY, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon
optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Trans Neural
Netw 22(1):24–36
44. Wei Q, Liu D (2014) A novel iterative θ-adaptive dynamic programming for discrete-time
nonlinear systems. IEEE Trans Autom Sci Eng 11(4):1176–1190
45. Wei Q, Liu D, Xu Y (2014) Neuro-optimal tracking control for a class of discrete-time non-
linear systems via generalized value iteration adaptive dynamic programming. Soft Comput
20(2):697–706
46. Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of
intelligence. Gen Syst Yearbook 22:25–38
47. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive
approaches (Chapter 13). Van Nostrand Reinhold, New York
48. Yang Q, Jagannathan S (2012) Reinforcement learning controller design for affine nonlin-
ear discrete-time systems using online approximators. IEEE Trans Syst Man Cybern-Part B:
Cybern 42(2):377–390
49. Zhang H, Huang J, Lewis FL (2009) An improved method in receding horizon control with
updating of terminal cost function. In: Valavanis KP (ed) Applications of intelligent control to
engineering systems. Springer, New York, pp 365–393
50. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control: algorithms
and stability. Springer, London
90 2 Value Iteration ADP for Discrete …

51. Zhang H, Luo Y, Liu D (2009) Neural-network-based near-optimal control for a class of
discrete-time affine nonlinear systems with control constraints. IEEE Trans Neural Netw
20(9):1490–1503
52. Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a
class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans
Syst Man Cybern-Part B: Cybern 38(4):937–942
Chapter 3
Finite Approximation Error-Based Value
Iteration ADP

3.1 Introduction

The ADP approach, proposed by Werbos [45–47], has played an important role in
seeking approximate solutions to dynamic programming problems as a way to solve
the optimal control problems [1, 8, 17, 19, 21, 25, 27–29, 34, 37, 40, 42, 52]. In
these control strategies, iterative methods are an effective way used in ADP to obtain
the solution of Bellman equation indirectly and have received lots of attentions. There
are two main iterative ADP algorithms including policy iteration and value iteration
algorithms [16, 31].
Although iterative ADP algorithms attract more and more attention [14, 20, 24,
30, 43, 44, 48, 53], for nearly all of the iterative algorithms, the control of each
iteration is required to be accurately obtained. These iterative ADP algorithms can
be called “accurate iterative ADP algorithms.” For most real-world control sys-
tems, however, the accurate control laws in the iterative ADP algorithms cannot be
obtained. For example, during the implementation of the iterative ADP algorithm,
approximation structures, such as neural networks and fuzzy systems, are used to
approximate the iterative value functions and the iterative control laws. While we can
see that no matter what kind of neural networks [12] and fuzzy systems are used, and
no matter what the approximation precisions are obtained, there must exist approxi-
mation errors between the approximated functions and the expected ones. This shows
that the accurate value functions and control laws cannot be obtained in the iterative
ADP algorithms for real-world control systems [6, 9]. When the accurate iterative
control laws cannot be obtained, the convergence properties obtained for the accurate
iterative ADP algorithms may not hold anymore. It is therefore necessary to study
the convergence and the stability properties of the iterative ADP algorithms when the
iterative control cannot be obtained accurately. In this chapter, several new iterative
ADP schemes with finite approximation errors will be developed to solve the infinite
horizon optimal control problems, with convergence, stability, and optimality proofs
[22, 35, 37, 39, 43].

© Springer International Publishing AG 2017 91


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_3
92 3 Finite Approximation Error-Based Value Iteration ADP

3.2 Iterative θ-ADP Algorithm with Finite


Approximation Errors

Consider the following deterministic discrete-time nonlinear systems

xk+1 = F(xk , u k ), k = 0, 1, 2, . . . , (3.2.1)

where xk ∈ Rn is the n-dimensional state vector and u k ∈ Rm is the m-dimensional


control vector. Let x0 be the initial state and F(xk , u k ) be the system function.
Let u k = (u k , u k+1 , . . . ) be an arbitrary sequence of controls from k to ∞. The
cost function for state x0 under the control sequence u 0 = (u 0 , u 1 , . . . ) is defined as



J (x0 , u 0 ) = U (xk , u k ), (3.2.2)
k=0

where U (xk , u k ) > 0, ∀xk , u k = 0, is the positive definite utility function.


In this chapter, we will study optimal control problems for (3.2.1). The goal is to
find an optimal control scheme which stabilizes the system (3.2.1) and simultaneously
minimizes the cost function (3.2.2).
Assumptions 2.2.1–2.2.3 from Chap. 2 will be used in this chapter as well and
they are restated here for convenience.
Assumption 3.2.1 F(0, 0) = 0, and the state feedback control law u(·) satisfies
u(0) = 0, i.e., xk = 0 is an equilibrium state of system (3.2.1) under the control
u k = 0.

Assumption 3.2.2 F(xk , u k ) is Lipschitz continuous on a compact set Ω ⊂ Rn con-


taining the origin.

Assumption 3.2.3 System (3.2.1) is controllable in the sense that there exists a
continuous control law on Ω that asymptotically stabilizes the system.

For optimal control problems, the designed feedback control must not only stabi-
lize the system (3.2.1) on Ω but also guarantee that the cost function (3.2.2) is finite,
i.e., the control must be admissible (cf. Definition 2.2.1 or [4]). Define A (Ω) as the
set of admissible control sequences associated with the controllable set Ω of states.
The optimal cost function is defined as
 
J ∗ (xk ) = inf J (xk , u k ) : u k ∈ A (Ω) .
uk

According to Bellman’s principle of optimality, J ∗ (xk ) satisfies the Bellman equation


 
J ∗ (xk ) = min U (xk , u k ) + J ∗ (F(xk , u k )) . (3.2.3)
uk
3.2 Iterative θ-ADP Algorithm with Finite Approximation Errors 93

Define the optimal control sequence as


 
u ∗k = arg inf J (xk , u k ) : u k ∈ A (Ω) .
uk

Then, the optimal control vector at time k can be expressed as


 
u ∗k = u ∗ (xk ) = arg min U (xk , u k ) + J ∗ (F(xk , u k )) .
uk

Hence, the Bellman equation (3.2.3) can be written as

J ∗ (xk ) = U (xk , u ∗k ) + J ∗ (F(xk , u ∗k )). (3.2.4)

We can see that if we want to obtain the optimal control u ∗k , we must obtain the
optimal cost function J ∗ (xk ). Generally speaking, J ∗ (xk ) is unknown before the
whole control sequence u k is considered. If we adopt the traditional dynamic pro-
gramming method to obtain the optimal cost function at every time step, then we have
to face the “curse of dimensionality.” This implies that the optimal control sequence
is nearly impossible to obtain analytically by the Bellman equation (3.2.4). Hence,
ADP approaches will be investigated. In particular, in this chapter, we study ADP
algorithms under a more realistic situation, where neural network approximations
are not perfect as in the previous chapter.

3.2.1 Properties of the Iterative ADP Algorithm with Finite


Approximation Errors

In Chap. 2, a θ -ADP algorithm is developed to obtain the optimal cost function and
optimal control law iteratively. We have shown that in the iterative θ -ADP algorithm
(2.3.7)–(2.3.9), the iterative value function Vi (xk ) converges to the optimal cost
function J ∗ (xk ) and J ∗ (xk ) = inf u k {J (xk , u k ) : u k ∈ A (Ω)}, satisfying the Bellman
equation (3.2.4) for any controllable state xk ∈ Rn . In fact, due to the existence of
approximation errors, the accurate iterative control law cannot be obtained in general.
In this case, the iterative ADP algorithm with no approximation errors may only exist
in ideal situation. Besides, new analysis methods should also be developed [22, 39].
Next, we present the iterative ADP algorithm with finite approximation errors.

A. Derivation of the Iterative ADP Algorithm with Finite Approximation Errors


In the present iterative ADP algorithm, the value function and control law are updated
by recurrent iteration, with the iteration index i increasing from 0 to infinity. For all
xk ∈ Rn , let the initial function Ψ (xk ) be an arbitrary function such that Ψ (xk ) ∈
Ψ̄xk , where Ψ̄xk is expressed in Definition 2.3.1, i.e., Ψ (xk ) ∈ Ψ̄xk implies that
Ψ (xk ) > 0 and there exists a stable control law ν̄(xk ) such that Ψ (F(xk , ν̄(xk ))) <
94 3 Finite Approximation Error-Based Value Iteration ADP

Ψ (xk ). To avoid duplications, we mention that θ -ADP algorithm (2.3.7)–(2.3.9)


will be used here starting from the initial value function V0 (x) = θ Ψ (xk ) to obtain
Vi (xk ) and vi (xk ) for i = 0, 1, . . . , where θ > 0 is a finite constant that is large
enough. These value functions Vi (xk ) and control laws vi (xk ) are obtained assuming
no approximation errors, i.e., all results in (2.3.7)–(2.3.9) are obtained accurately.
Next, we consider the same algorithm under approximation errors. For all xk ∈ Rn ,
let the initial value function
V̂0 (xk ) = θ Ψ (xk ). (3.2.5)

The control law v̂0 (xk ) can be computed as


 
v̂0 (xk ) = arg min U (xk , u k ) + V̂0 (xk+1 ) + ρ0 (xk )
uk
 
= arg min U (xk , u k ) + V̂0 (F(xk , u k )) + ρ0 (xk ), (3.2.6)
uk

where V̂0 (xk+1 ) = θ Ψ (xk+1 ) and ρ0 (xk ) is the approximation error of the control
law v̂0 (xk ). For i = 1, 2, . . ., the iterative ADP algorithm will iterate between
 
V̂i (xk ) = min U (xk , u k ) + V̂i−1 (xk+1 ) + πi (xk )
uk

= U (xk , v̂i−1 (xk )) + V̂i−1 (F(xk , v̂i−1 (xk ))) + πi (xk ), (3.2.7)

and
 
v̂i (xk ) = arg min U (xk , u k ) + V̂i (xk+1 ) + ρi (xk )
uk
 
= arg min U (xk , u k ) + V̂i (F(xk , u k )) + ρi (xk ), (3.2.8)
uk

where πi (xk ) and ρi (xk ) are approximation errors of the iterative value functions and
the iterative control laws, respectively.
We are now in a position to present the following theorem.

Theorem 3.2.1 Let xk ∈ Rn be an arbitrary controllable state and Assump-


tions 3.2.1–3.2.3 hold. For i = 1, 2, . . ., define new iterative value functions as

Γi (xk ) = min{U (xk , u k ) + V̂i−1 (xk+1 )}, (3.2.9)


uk

where V̂i (xk ) is defined in (3.2.7) and u k can accurately be obtained in Rm . Let the
initial value function Γ0 (xk ) = V̂0 (xk ) = θ Ψ (xk ). If for i = 1, 2, . . ., there exists a
finite constant ε̄ ≥ 0 such that

|V̂i (xk ) − Γi (xk )| ≤ ε̄ (3.2.10)


3.2 Iterative θ-ADP Algorithm with Finite Approximation Errors 95

holds uniformly. Then, we have

|V̂i (xk ) − Vi (xk )| ≤ i ε̄. (3.2.11)

Proof The theorem can be proved by mathematical induction. First, let i = 1. We


have

Γ1 (xk ) = min{U (xk , u k ) + V̂0 (xk+1 )}


uk

= min{U (xk , u k ) + V0 (F(xk , u k ))} = V1 (xk ).


uk

According to (3.2.10), we have

−ε̄ ≤ V̂i (xk ) − Γi (xk ) ≤ ε̄ (3.2.12)

for i = 1, 2, . . .. Then, we can get −ε̄ ≤ V̂1 (xk ) − V1 (xk ) ≤ ε̄, which proves
|V̂1 (xk ) − V1 (xk )| ≤ ε̄. Assume that (3.2.11) holds for i = l − 1, l = 2, 3, . . .. Then,

− (l − 1)ε̄ ≤ V̂l−1 (xk ) − Vl−1 (xk ) ≤ (l − 1)ε̄. (3.2.13)

For i = l, considering (3.2.13), we can get

Γl (xk ) = min{U (xk , u k ) + V̂l−1 (xk+1 )}


uk

≥ min{U (xk , u k ) + Vl−1 (xk+1 ) − (l − 1)ε̄}


uk

= Vl (xk ) − (l − 1)ε̄,

and

Γl (xk ) = min{U (xk , u k ) + V̂l−1 (xk+1 )}


uk

≤ min{U (xk , u k ) + Vl−1 (xk+1 ) + (l − 1)ε̄}


uk

= Vl (xk ) + (l − 1)ε̄.

Then, using (3.2.12) and similarly considering the left side of (3.2.13), we can get
(3.2.11). This completes the proof.

From Theorem 3.2.1, we can see that if we let a finite constant −ε̄ ≤ ε ≤ ε̄ such
that
V̂i (xk ) − Γi (xk ) ≤ ε (3.2.14)

uniformly, then we can immediately obtain

V̂i (xk ) − Vi (xk ) ≤ iε, (3.2.15)


96 3 Finite Approximation Error-Based Value Iteration ADP

where ε can be defined as uniform finite approximation error (finite approximation


error in brief). For the iterative θ -ADP algorithm (2.3.7)–(2.3.9), it has been proved
that the iterative value function converges to the optimum as i → ∞. From (3.2.15),
we can see that if we let Ti = iε be the least upper bound of the iterative approx-
imation errors, then for all ε = 0, Ti → ∞, as i → ∞. This means that although
the approximation error for each single step is finite and may be small, as the iter-
ation index i increases, it is possible that the approximation errors between V̂i (xk )
and Vi (xk ) increase to infinity. Hence, the convergence and stability analysis results
in Chap. 2 [36, 38] are no longer applicable to the analysis of V̂i (xk ) and v̂i (xk ) in
(3.2.6)–(3.2.8). To overcome these difficulties, new convergence and stability analy-
sis must be established.

B. Properties of the Iterative ADP Algorithm with Finite Approximation Errors


From iterative ADP algorithm (3.2.6)–(3.2.8), we can see that for i = 0, 1, . . ., there
exists an approximation error between the iterative value functions V̂i (xk ) and Vi (xk ).
Furthermore, the value of the error at each iteration is unknown and nearly impossible
to obtain. It is very difficult to analyze the properties of the iterative value function
V̂i (xk ) and iterative control law v̂i (xk ). So, in this section, a new “error bound”
analysis method is developed. The idea of the “error bound” analysis method is
that for each iteration index i = 0, 1 . . ., the least upper bound of the iterative value
functions V̂i (xk ) is analyzed, which avoids to analyze the value of V̂i (xk ) directly.
Using the “error bound” method, it can be proved that the iterative value functions
V̂i (xk ) can uniformly converge to a bounded neighborhood of optimal cost function.
The convergence and stability analysis will also be given in this section.
For convenience of analysis, we transform the expressions of the approximation
errors. According to (3.2.14), we have V̂i (xk ) ≤ Γi (xk ) + ε. Then, for i = 0, 1, . . .,
there exists a finite constant η > 0 such that

V̂i (xk ) ≤ ηΓi (xk ) (3.2.16)

holds uniformly. Hence, we can give the following theorem.

Theorem 3.2.2 Let xk ∈ Rn be an arbitrary controllable state and Assump-


tions 3.2.1–3.2.3 hold. For i = 0, 1, . . ., let Γi (xk ) be expressed as in (3.2.9) and
V̂i (xk ) be expressed as in (3.2.7). Let γ < ∞ and 1 ≤ β < ∞ be constants such that

J ∗ (F(xk , u k )) ≤ γ U (xk , u k ),

and
J ∗ (xk ) ≤ V0 (xk ) ≤ β J ∗ (xk ),
3.2 Iterative θ-ADP Algorithm with Finite Approximation Errors 97

hold uniformly. If there exists 0 < η < ∞ such that (3.2.16) holds uniformly, then
 i 
γ j η j−1 (η − 1) γ i ηi (β − 1) ∗
V̂i (xk ) ≤ η 1 + + J (xk ), (3.2.17)
j=1
(γ + 1) j (γ + 1)i

i
where we define j (·) = 0, for all j > i and i, j = 0, 1, . . ..

Proof The theorem can be proved by mathematical induction. First, let i = 0.


Then, (3.2.17) becomes V̂0 (xk ) ≤ ηβ J ∗ (xk ). As Γ0 (xk ) = V0 (xk ) ≤ β J ∗ (xk ), we
can obtain V̂0 (xk ) ≤ ηΓ0 (xk ) ≤ ηβ J ∗ (xk ). So, the conclusion holds for i = 0.
Next, let i = 1. By (3.2.9), we have
 
Γ1 (xk ) = min U (xk , u k ) + V̂0 (F(xk , u k ))
uk
 
≤ min U (xk , u k ) + ηβ J ∗ (F(xk , u k ))
uk
   
ηβ − 1 ηβ − 1
≤ min 1 + γ U (xk , u k ) + ηβ − J ∗ (F(xk , u k ))
uk γ +1 γ +1
 
ηβ − 1  
= 1+γ min U (xk , u k ) + J ∗ (F(xk , u k ))
γ +1 uk
 
γ (η − 1) γ η(β − 1)
= 1+ + J ∗ (xk ).
γ +1 γ +1

According to (3.2.16), we can obtain


 
γ (η − 1) γ η(β − 1)
V̂1 (xk ) ≤ η 1 + + J ∗ (xk ),
γ +1 γ +1

which shows that (3.2.17) holds for i = 1.


Assume that (3.2.17) holds for i = l − 1, where l = 1, 2, . . .. Then, for i = l, we
have
 
Γl (xk ) = min U (xk , u k ) + V̂l−1 (F(xk , u k ))
uk
 l−1
γ j η j−1 (η − 1)
≤ min U (xk , u k ) + η 1 +
uk
j=1
(γ + 1) j

γ l−1 ηl−1 (β − 1) ∗
+ J (xk+1 )
(γ + 1)l−1
98 3 Finite Approximation Error-Based Value Iteration ADP

 
l−1 
γ j−1 η j−1 (η − 1) γ l−1 ηl−1 (ηβ − 1)
≤ min 1+γ + U (xk , u k )
uk
j=1
(γ + 1) j (γ + 1)l−1
 
l−1
γ j η j−1 (η − 1) γ l−1 ηl−1 (β − 1)
+ η 1+ +
j=1
(γ + 1) j (γ + 1)l−1

l−1 
γ j−1 η j−1 (η − 1) γ l−1 ηl−1 (ηβ − 1)
− + J ∗ (F(xk , u k ))
j=1
(γ + 1) j
(γ + 1) l−1


l
γ j η j−1 (η − 1) γ l ηl (β − 1) ∗
= 1+ + J (xk ). (3.2.18)
j=1
(γ + 1) j (γ + 1)l

Then, according to (3.2.16), we can obtain (3.2.17) which proves the conclusion for
i = 0, 1, . . .. The proof is complete.

From (3.2.17), we can see that for arbitrary finite i, η, and β, there exists a bounded
error between the iterative value function V̂i (xk ) and the optimal cost function J ∗ (xk ).
While as i → ∞, the bound of the approximation errors may increase to infinity.
Thus, in the following theorems, we will prove the convergence properties of the
iterative ADP algorithm (3.2.6)–(3.2.8) using the error bound method.

Theorem 3.2.3 Let xk ∈ Rn be an arbitrary controllable state and Assump-


tions 3.2.1–3.2.3 hold. Suppose Theorem 3.2.2 holds for all xk ∈ Rn . If for γ < ∞
and η > 0, the inequality
γ +1
0<η< (3.2.19)
γ

holds, then as i → ∞, the value function V̂i (xk ) in the iterative ADP algorithm
(3.2.6)–(3.2.8) is uniformly convergent into a bounded neighborhood of the optimal
cost function J ∗ (xk ), i.e.,
η
lim V̂i (xk ) = V̂∞ (xk ) ≤ J ∗ (xk ). (3.2.20)
i→∞ 1 − γ (η − 1)

Proof According to (3.2.18) in Theorem 3.2.2, we can see that for j = 1, 2, . . .,

γ j η j−1 (η − 1)
(γ + 1) j

is a geometrical sequence. Then, (3.2.18) can be written as


   
γ (η − 1) 1 − (γ η)i γ i ηi (β − 1) ∗
Γi (xk ) ≤ 1+ + J (xk ). (3.2.21)
1+γ −γ η (γ + 1)i
3.2 Iterative θ-ADP Algorithm with Finite Approximation Errors 99

As i → ∞, if 0 < η < (γ + 1)/γ , then (3.2.21) becomes


 
γ (η − 1)
lim Γi (xk ) = Γ∞ (xk ) ≤ 1 + J ∗ (xk ). (3.2.22)
i→∞ 1 − γ (η − 1)

Using (3.2.16) and letting i → ∞, we have

V̂∞ (xk ) ≤ ηΓ∞ (xk ). (3.2.23)

Combining (3.2.22) and (3.2.23), we can obtain (3.2.20). The proof is complete.

Corollary 3.2.1 Let xk ∈ Rn be an arbitrary controllable state and Assump-


tions 3.2.1–3.2.3 hold. Suppose Theorem 3.2.2 holds for all xk ∈ Rn . If for γ < ∞
and η > 0, the inequality (3.2.19) holds, then the iterative control law v̂i (xk ) of the
iterative ADP algorithm (3.2.6)–(3.2.8) is convergent, i.e.,

v̂∞ (xk ) = lim v̂i (xk ).


i→∞

For the iterative ADP algorithm with finite approximation errors, we can see that
for different approximation errors, the limits of the iterative value function V̂i (xk )
are different. This property can be proved by the following theorem.

Theorem 3.2.4 Let xk ∈ Rn be an arbitrary controllable state and Assump-


tions 3.2.1–3.2.3 hold. Let
 
γ (η − 1)
V̄∞ (xk ) = η 1 + J ∗ (xk ) (3.2.24)
1 − γ (η − 1)

be the least upper bound of the limit of the iterative value function. If Theorem 3.2.3
holds for all xk ∈ Rn , then V̄∞ (xk ) is a monotonically increasing function of the
approximation error η.

Proof From (3.2.24), we can see that the least upper bound of the limit of the iterative
value function V̄∞ (xk ) is a differentiable function of the approximation error η.
Hence, we can take the derivative of the approximation error η on both sides of the
equation (3.2.24). Then, we can obtain
 
∂ V̄∞ (xk ) 1+γ
= J ∗ (xk ) > 0.
∂η (γ (η − 1) − 1)2

The proof is complete.

According to the definitions of iterative value functions V̂i (xk ) and Γi (xk ) in
(3.2.7) and (3.2.9), if for i = 0, 1, . . ., we let

V̂i (xk ) − Γi (xk ) = εi (xk ), (3.2.25)


100 3 Finite Approximation Error-Based Value Iteration ADP

then, we can choose


ε = sup{ε0 (xk ), ε1 (xk ), . . .}. (3.2.26)

In this section, the approximation error η which satisfies (3.2.16) is used to analyze
the convergence properties of the iterative ADP algorithm. Generally speaking, the
approximation error ε is obtained by (3.2.14) instead of obtaining η. So, if we use ε
to express the approximation error, then a transformation is needed between the two
approximation errors ε and η.
For i = 0, 1, . . ., according to (3.2.14) and (3.2.16), for any εi expressed in
(3.2.25), there exists a ηi such that

V̂i (xk )
V̂i (xk ) − εi (xk ) = .
ηi (xk )

According to (3.2.26), we can obtain η = sup{η0 (xk ), η1 (xk ), . . .}.

3.2.2 Neural Network Implementation

In the case of linear systems, the control law is linear and the cost function is quadratic.
In the nonlinear case, this is not necessarily true, and therefore we use backpropaga-
tion (BP) neural networks to approximate Vi (xk ) and vi (xk ).
For a given nonlinear function ξ(x) ∈ Rm , a BP NN with two layers of weights
can be used to approximate it. Assume that the number of hidden layer neurons is
denoted by L, the weight matrix between the input layer and hidden layer is denoted
by Yξ ∈ R L×n , and the weight matrix between the hidden layer and output layer is
denoted by Wξ ∈ Rm×L . Then, the NN representation of ξ is given by

ξ(x) = Wξ∗ σ (Yξ∗ x) + ε(x),

where σ (·) = tanh(·) is the componentwise activation function, Yξ∗ and Wξ∗ are the
ideal weight parameters of Yξ and Wξ , receptively, and ε(x) is the reconstruction
error. The NN approximation of function ξ is expressed by

ξ̂ (x, Yξ , Wξ ) = Wξ σ (Yξ x).

Remark 3.2.1 In Chap. 2 and several other chapters, weight matrices in the NN
expressions are in transposed forms. It is a matter of convenience, for instance, to
use Wξ or WξT in NN expressions. In this chapter, to avoid cumbersome notations,
we opt to use weight matrices without transposition in NN expressions.

Here, there are two networks, which are critic network and action network, respec-
tively. Both neural networks are chosen as three-layer feedforward network. The
whole structural diagram is shown in Fig. 3.1.
3.2 Iterative θ-ADP Algorithm with Finite Approximation Errors 101

Fig. 3.1 The structural diagram of the algorithm

A. The Critic Network


The critic network is used to approximate the value function Vi (xk ). The output of
the critic network is denoted as

V̂i (xk ) = Wc σ (Yc xk ),

where Wc ∈ R1×L c , Yc ∈ R L c ×n , and L c is the number of hidden layer nodes in the


critic network. The target function can be written as

Vi (xk ) = U (xk , vi−1 (xk )) + V̂i−1 (xk+1 ).

Then, we define the error function of the critic network training as

eci (k) = Vi (xk ) − V̂i (xk ).

The objective function to be minimized in the critic network is E ci (k) = (1/2)eci2 (k).
So, the gradient-based weight update rule [13, 28] for the critic network is given by

X c ( p + 1) = X c ( p) + ΔX c ( p),
∂ E ci (k)
ΔX c ( p) = αc − ,
∂ X c ( p)
∂ E ci (k) ∂ E ci (k) ∂ V̂i (xk )
= ,
∂ X c ( p) ∂ V̂i (xk ) ∂ X c ( p)
102 3 Finite Approximation Error-Based Value Iteration ADP

where p represents the inner-loop iteration step for updating critic NN weight para-
meters, αc > 0 is the learning rate of critic network, and X c is the weight matrix of
the critic network which can be replaced by Wc and Yc , respectively.
If we define


n
ql ( p) = Yc,lj ( p)xjk , l = 1, 2 . . . , L c ,
j=1

rl ( p) = tanh(ql ( p)), l = 1, 2 . . . , L c ,

then

Lc
V̂i (xk ) = Wcl ( p)rl ( p),
l=1

where xk = (x1k , . . . , xnk )T ∈ Rn , Yc = (Yc,lj ) ∈ R L c ×n , and Wc = (Wc1 , . . . ,


WcL c ) ∈ R1×L c . By applying the chain rule, the adaptation of the critic network
is summarized as follows:
The weights of hidden-to-output layer of the critic network are updated as

∂ E ci (k) ∂ V̂i (xk )


Wcl ( p + 1) = Wcl ( p) − αc
∂ V̂i (xk ) ∂ Wcl ( p)
= Wcl ( p) − αc eci (k)rl ( p), l = 1, 2 . . . , L c .

The weights of input to hidden layer of the critic network are updated as

∂ E ci (k) ∂ V̂i (xk )


Yc,lj ( p + 1) = Yc,lj ( p) − αc
∂ V̂i (xk ) ∂Yc,lj ( p)
∂ E ci (k) ∂ V̂i (xk ) ∂rl ( p) ∂ql ( p)
= Yc,lj ( p) − αc
∂ V̂i (xk ) ∂rl ( p) ∂ql ( p) ∂Yc,lj (k)
 
= Yc,lj ( p) − αc eci (k)Wcl ( p) 1 − rl2 ( p) xjk ,
l = 1, 2 . . . , L c , j = 1, 2, . . . , n.

B. The Action Network


In the action network, the state xk is used as input to create the optimal control law
as the output of the network. The output can be formulated as

v̂i (xk ) = Wai σ (Yai xk ).


3.2 Iterative θ-ADP Algorithm with Finite Approximation Errors 103

So, we can define the output error of the action network as

eai (k) = vi (xk ) − v̂i (xk ),

where the target function of the action network training is given by

vi (xk ) = arg min{U (xk , u k ) + V̂i−1 (xk+1 )}.


uk

The weights of the action network are updated to minimize the following performance
error measure E ai (k) = (1/2)eai
T
(k)eai (k). The weights updating algorithm is similar
to the one for the critic network. By the gradient descent rule, we can obtain

X a ( p + 1) = X a ( p) + ΔX a ( p),
∂ E ai (k)
ΔX a ( p) = βa − ,
∂ X a ( p)

∂ E ai (k) ∂ E ai (k) ∂eai (k) ∂ v̂i (xk )


= ,
∂ X a ( p) ∂eai (k) ∂ v̂i (xk ) ∂ X a ( p)

where βa > 0 is the learning rate of action network and X a is the weight matrix of
the action network which can be replaced by Wa and Ya .
If we define


n
gl ( p) = Ya,lj ( p)x jk , l = 1, . . . , L a ,
j=1

h l ( p) = tanh(gl ( p)), l = 1, . . . , L a ,

then

La
v̂it (xk ) = Wa,tl ( p)h l ( p), t = 1, 2, . . . , m,
l=1

where L a is the number of hidden nodes in the action network, v̂i (xk ) = (v̂i1 (xk ), . . . ,
v̂im (xk ))T ∈ Rm , Ya = (Ya,lj ) ∈ R L a ×n , and Wa = (Wa,tl ) ∈ Rm×L a . By applying the
chain rule, the adaptation of the action network is summarized as follows.
The weight of hidden-to-output layer of the action network is updated as

∂ E ai (k) ∂ v̂it (xk )


Wa,tl ( p + 1) = Wa,tl ( p) − βa = Wa,tl ( p) − βa eai (k)h l ( p).
∂ v̂it (xk ) ∂ Wa,tl ( p)
104 3 Finite Approximation Error-Based Value Iteration ADP

The weight of input to hidden layer of the action network is updated as

∂ E ai (k) ∂ v̂it (xk )


Ya,lj ( p + 1) = Ya,lj ( p) − βa
∂ v̂it (xk ) ∂Ya,lj ( p)
∂ E ai (k) ∂ v̂it (xk ) ∂h l ( p) ∂ gl ( p)
= Ya,lj ( p) − βa
∂ v̂it (xk ) ∂h l ( p) ∂ gl ( p) ∂Ya,lj ( p)
 
= Ya,lj ( p) − βa eai (k)Wa,lj (k) 1 − gl2 (t) x jk .

The detailed NN training algorithm can also be seen in [22, 28].

3.2.3 Simulation Study

Consider the following discrete-time nonaffine nonlinear system given by [14, 33]

xk+1 = F(xk , u k ) = xk + sin(0.1xk2 + u k ). (3.2.27)

The initial state is x0 = 1. Since F(0, 0) = 0, xk = 0 is an equilibrium state of system


(3.2.27). But since

0.25

0.2
Aprroximation error

0.15

0.1

0.05

0
1
0.5 40

0 30
20
−0.5
10
−1 0
State variables Iteration steps

Fig. 3.2 The curve of approximation errors


3.2 Iterative θ-ADP Algorithm with Finite Approximation Errors 105

∂ F(xk , u k )
(0, 0) = 1,
∂ xk

nonlinear system (3.2.27) is unstable at xk = 0. Let the cost function be quadratic


form which is expressed by

  
J (x0 , u 0 ) = xkTQxk + u Tk Ru k ,
k=0

where Q and R are given as identity matrices with suitable dimensions.


Neural networks are used to implement the iterative ADP algorithm. The critic
network and the action network are chosen as three-layer BP neural networks with the
structures of 1–8–1 and 1–8–1, respectively. Neural networks are used to implement
the iterative ADP algorithm and the neural network structural diagram of the algo-
rithm can also be seen in [28, 32, 51]. Use θ = 8 and Ψ (xk ) = xkTQxk to initialize
the algorithm.

7
Iterative value function
Iterative value function

3.8
6
3.6
3.4 5

3.2 4
3
3
2.8
2.6 2
0 10 20 30 0 10 20 30

Iteration steps Iteration steps


(a) (b)
6 4.5
Iterative value function

Iterative value function

4
5
3.5
4
3
3
2.5

2 2
0 10 20 30 0 10 20 30
Iteration steps Iteration steps

(c) (d)
Fig. 3.3 Trajectories of the iterative value functions. a Approximation error ε = 10−6 . b Approx-
imation error ε = 10−4 . c Approximation error ε = 10−3 . d Approximation error ε = 10−1
106 3 Finite Approximation Error-Based Value Iteration ADP

The curve of the approximation errors is displayed in Fig. 3.2.


We choose a random array of state variable in [−1, 1] in order to train the neural
networks. For each iterative step, the critic network and the action network are trained
for 1000 steps under the learning rate αc = βa = 0.001, so that the approximation
error is reached. The iterative ADP algorithm runs for 35 iteration steps to guarantee
the convergence of the iterative value function. To show the effectiveness of the
present iterative ADP algorithm, we choose four different global training precisions
of neural networks. First, let the approximation error of the neural networks be
ε = 10−6 . The trajectory of the iterative value function is shown in Fig. 3.3a. Second,
let the approximation error of the neural networks be ε = 10−4 . The trajectory of
the iterative value function is shown in Fig. 3.3b. Third, let the approximation error
of the neural networks be ε = 10−3 . The trajectory of the iterative value function
is shown in Fig. 3.3c. Forth, let the approximation error of the neural networks be
ε = 10−1 . The trajectory of the iterative value function is shown in Fig. 3.3d.
For approximation error ε = 10−6 , implement the approximate optimal control
for system (3.2.27). Let the implementation time T f = 20. The trajectories of the

1 1
x x
u u
0.5 0.5

0 0

−0.5 −0.5

−1 −1
0 5 10 15 20 0 5 10 15 20
Time steps Time steps
(a) (b)

1 6
x x
0.5 u u
4
0
2
−0.5
0
−1

−1.5 −2
0 5 10 15 20 0 5 10 15 20
Time steps Time steps

(c) (d)
Fig. 3.4 Control and state trajectories. a Approximation error ε = 10−8 . b Approximation error
ε = 10−4 . c Approximation error ε = 10−3 . d Approximation error ε = 10−1
3.2 Iterative θ-ADP Algorithm with Finite Approximation Errors 107

control and state are displayed in Fig. 3.4a. For approximation error ε = 10−4 , the
trajectories of the control and state are displayed in Fig. 3.4b. When the approximation
error ε = 10−3 , we can see that the iterative value functions is not monotonic. The
trajectories of the control and state are displayed in Fig. 3.4c. When the approximation
error ε = 10−1 , we can see that the iterative value functions is not convergent. In this
situation, the control system is not stable, and the trajectories of the control and state
are displayed in Fig. 3.4d.

3.3 Numerical Iterative θ-Adaptive Dynamic Programming

As the development of digital computers, numerical control attracts more and more
attention of researchers [15, 49, 50]. In real-world implementations, especially for
numerical control systems, for each i, the accurate iterative value functions Vi (xk )
and the accurate iterative control laws vi (xk ) are generally impossible to obtain. For
the situation that the control u k ∈ A, the iterative θ -ADP algorithm in Sect. 2.3 may
be invalid, where A denotes the set of numerical controls and we assume 0 ∈ A. First,
for numerical control systems, the set of numerical controls A is discrete. This means
that there are only finite elements in the set of numerical controls A, which implies
that the iterative control law and iterative value function can only be obtained with
errors. Second, as u k ∈ A, the convergence property of the iterative value function
cannot be guaranteed, and the stability of the system under such controls cannot be
proved either. Furthermore, even if we solve the iterative control law and the iterative
value function at every iteration step, it is not clear whether the errors between the
iterative value functions Vi (xk ) and the optimal cost function J ∗ (xk ) are finite or
not for all i. Thus, the optimal cost function and optimal control law are nearly
impossible to obtain by the iterative θ -ADP algorithm in Sect. 2.3.
In this section, a numerical iterative θ -ADP algorithm is developed to obtain the
numerical optimal controller for nonaffine nonlinear system (3.2.1) [36, 37].

3.3.1 Derivation of the Numerical Iterative θ-ADP Algorithm

In the present numerical iterative θ -ADP algorithm, the value functions and control
laws are updated by iterations, with the iteration index i increasing from 0 to infinity.
For xk ∈ Rn , let the initial value function be

V̂0 (xk ) = θ Ψ (xk ), (3.3.1)

where θ > 0 is a large finite positive constant. The numerical iterative control law
v̂0 (xk ) can be computed as follows:
108 3 Finite Approximation Error-Based Value Iteration ADP
 
v̂0 (xk ) = arg min U (xk , u k ) + V̂0 (xk+1 )
u k ∈A
 
= arg min U (xk , u k ) + V̂0 (F(xk , u k )) , (3.3.2)
u k ∈A

where V̂0 (xk+1 ) = θ Ψ (xk+1 ). For i = 1, 2, . . ., the numerical iterative θ -ADP algo-
rithm will iterate between the iterative value functions
 
V̂i (xk ) = min U (xk , u k ) + V̂i−1 (xk+1 )
u k ∈A

= U (xk , v̂i−1 (xk )) + V̂i−1 (F(xk , v̂i−1 (xk ))), (3.3.3)

and the iterative control laws


 
v̂i (xk ) = arg min U (xk , u k ) + V̂i (xk+1 )
u k ∈A
 
= arg min U (xk , u k ) + V̂i (F(xk , u k )) . (3.3.4)
u k ∈A

Remark 3.3.1 The ADP algorithm in (3.3.2)–(3.3.4) is different from (3.2.6)–(3.2.8)


in the choice of control set. Here, the control set A is discrete. As the set of numerical
controls A is discrete, for all i ≥ 0, v̂i (xk ) = vi (xk ) in general. Then, for all i ≥ 1,
the iterative value function V̂i (xk ) = Vi (xk ), which means that there exists an error
between V̂i (xk ) and Vi (xk ). It should be pointed out that the iterative approximation
error is not a constant. The fact is that as the iteration index i → ∞, the boundary
of iterative approximation errors will also increase to infinity. The following lemma
will show this property.
Lemma 3.3.1 Let xk ∈ Rn be an arbitrary controllable state and Assumptions
3.2.1–3.2.3 hold. For i = 1, 2, . . ., define a new iterative value function as

Γi (xk ) = min{U (xk , u k ) + V̂i−1 (xk+1 )}, (3.3.5)


uk

where V̂i (xk ) is defined in (3.3.3). If the initial value function

Γ0 (xk ) = V̂0 (xk ) = θ Ψ (xk )

and for i = 1, 2, . . ., there exists a finite constant ε such that

V̂i (xk ) − Γi (xk ) ≤ ε (3.3.6)

holds uniformly, then


V̂i (xk ) − Vi (xk ) ≤ iε, (3.3.7)

where ε is called uniform approximation error.


3.3 Numerical Iterative θ-Adaptive Dynamic Programming 109

Proof The lemma can be proved by mathematical induction and by following similar
steps as in the proof of Theorem 3.2.1. First, let i = 1. We have

Γ1 (xk ) = min{U (xk , u k ) + V̂0 (xk+1 )}


uk

= min{U (xk , u k ) + V0 (F(xk , u k ))}


uk

= V1 (xk ).

Then, according to (3.3.6), we can get V̂1 (xk ) − V1 (xk ) ≤ ε. Assume that (3.3.7)
holds for i − 1. Then, for i, we have

Γi (xk ) = min{U (xk , u k ) + V̂i−1 (xk+1 )}


uk

≤ min{U (xk , u k ) + Vi−1 (xk+1 ) + (i − 1)ε}


uk

= Vi (xk ) + (i − 1)ε.

Then, according to (3.3.6), we can get (3.3.7). The proof is complete.

Lemma 3.3.1 shows that although the approximation error for each single step
is finite and may be small, as the iteration index i → ∞ increases, the bounds of
approximation errors between V̂i (xk ) and Vi (xk ) may also increase to infinity. To
overcome these difficulties, we must discuss the convergence and stability properties
of the iterative ADP algorithm in numerical implementation with finite approxima-
tion errors. For convenience of analysis, we perform transformation of approximation
errors. According to the definitions of V̂i (xk ) and Γi (xk ) in (3.3.3) and (3.3.5), we
have Γi (xk ) ≤ V̂i (xk ). Then, for i = 0, 1, . . ., there exists a η ≥ 1 such that

Γi (xk ) ≤ V̂i (xk ) ≤ ηΓi (xk ) (3.3.8)

holds uniformly. Hence, we can give the following theorem.

Theorem 3.3.1 Let xk ∈ Rn be an arbitrary controllable state and Assumptions


3.2.1–3.2.3 hold. For i = 0, 1, . . ., let Γi (xk ) be expressed as (3.3.5) and V̂i (xk ) be
expressed as (3.3.3). Let γ < ∞ and 1 ≤ β < ∞ be constants such that

J ∗ (F(xk , u k )) ≤ γ U (xk , u k ), (3.3.9)

and
J ∗ (xk ) ≤ V0 (xk ) ≤ β J ∗ (xk ) (3.3.10)

hold uniformly. If there exists a η, 1 ≤ η < ∞, such that (3.3.8) is satisfied and

β −1
η ≤1+ , (3.3.11)
γβ
110 3 Finite Approximation Error-Based Value Iteration ADP

then the iterative value function V̂i (xk ) converges to a bounded neighborhood of
J ∗ (xk ), as i → ∞.

This theorem can be proved following the same procedure as Theorem 3.2.3 by
noting that condition (3.3.11) implies (3.2.19) of Theorem 3.2.3. We omit the details.

Theorem 3.3.2 Let xk ∈ Rn be an arbitrary controllable state. If ∀xk , Theorem 3.3.1


holds and η satisfies (3.3.11), then for i = 0, 1, . . ., the numerical iterative control
law v̂i (xk ) is an asymptotically stable control law for system (3.2.1).

Proof As V̂0 (xk ) = V0 (xk ) = θ Ψ (xk ), V̂0 (xk ) is a positive definite function for i = 0.
Using the mathematical induction, assume that the iterative value function V̂i (xk ),
i = 0, 1, . . ., is positive definite. Then, according to Assumptions 3.2.1–3.2.3, we
can get
V̂i (0) = U (0, v̂i−1 (0)) + V̂i−1 (F(0, v̂i−1 (0))) = 0

for xk = 0. When xk → ∞, as the utility function U (xk , u k ) is a positive function


of xk , we have V̂i (xk ) → ∞. Hence, V̂i (xk ) is a positive definite function and the
mathematical induction is complete. Next, let χi be defined as
 i 
γ j η j−1 (η − 1) γ i ηi (β − 1)
χi = η 1 + + ,
j=1
(γ + 1) j (γ + 1)i

and V̄i (xk ) = χi J ∗ (xk ).


As 1 ≤ η ≤ 1 + (β − 1)/(γβ), we have χi+1 ≤ χi , which means V̄i+1 (xk ) ≤
V̄i (xk ). Define a new iterative value function Pi+1 (xk ) as

Pi+1 (xk ) = U (xk , v̄i (xk )) + V̄i (xk+1 ),

where v̄i (xk ) = arg minu k {U (xk , u k ) + V̄i (xk+1 )}. From Theorem 3.2.2, V̄i (xk ) =
χi J ∗ (xk ) is the upper bound of any iterative value function. Thus, we have Pi+1 (xk ) ≤
V̄i+1 (xk ) ≤ V̄i (xk ), which implies

V̄i (xk+1 ) − V̄i (xk ) ≤ −U (xk , v̄i (xk )) < 0.

Hence, V̄i (xk ) is a Lyapunov function and v̄i (xk ) is an asymptotically stable control
law for i = 0, 1, . . .. As v̄i (xk ) is an asymptotically stable control law, we have
xk+N → 0 as N → ∞, that is V̄i (xk+N ) → 0. Since 0 < V̂i (xk ) ≤ V̄i (xk ) holds for
all xk , then we can get 0 < V̂i (xk+N ) ≤ V̄i (xk+N ) as N → ∞. So, V̂i (xk+N ) → 0 as
N → ∞. As V̂i (xk ) is a positive definite function, we can conclude xk+N → 0 as
N → ∞ under the control law v̂i (xk ). The proof is complete.
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 111

3.3.2 Properties of the Numerical Iterative θ-ADP Algorithm

Although Theorem 3.3.1 gives the convergence criterion, we can see that the parame-
ters η, γ , and β are very difficult to achieve, and the convergence criterion (3.3.11)
is quite difficult to verify. To overcome this difficulty, a new convergence condition
must be developed to guarantee the convergence of the numerical iterative θ -ADP
algorithm. For the convenience of analysis, we define a new value function as

Vˆ0 (xk , 0) = V̂0 (xk ),
(3.3.12)
Vˆi (xk , v̂i (xk )) = U (xk , v̂i (xk )) + V̂i−1 (F(xk , v̂i (xk ))), i = 1, 2, . . . ,

where we can see that V̂i (xk ) = Vˆi (xk , v̂i (xk )). We have the following definition.

Definition 3.3.1 The iterative value function Vˆi (xk , v̂i (xk )) is a Lipschitz continuous
function for all v̂i (xk ), if there exists an L ∈ R such that

|Vˆi (xk , v̂i (xk ))− Vˆi (xk , v̂i (xk ))| ≤ L v̂i (xk )− v̂i (xk ) , (3.3.13)

where v̂i (xk ) ∈ A and v̂i (xk ) = v̂i (xk ).

As Vˆi (xk , v̂i (xk )) is a function to be approximated by neural networks, the


Lipschitz assumption is reasonable. For the numerical control system, the set of
numerical controls A is discrete. Then, according to the grid principle, let P j ,
j = 1, 2, . . . , m, be the discrete grids for the jth dimension in A. Let Z be the
set of positive integers. As A ⊂ Rm , using the grid principle, we can define A as

A = {u( p1 , . . . , pm ) : p1 , . . . , pm ∈ Z , 1 ≤ p1 ≤ P1 , . . . , 1 ≤ pm ≤ Pm }
(3.3.14)
to denote all the control elements in A, where P1 , . . . , Pm are all positive integers.
Thus, for all xk ∈ Rn and i = 0, 1, . . ., there exists a sequence of positive numbers
p1i , . . . , pmi such that
u( p1i , . . . , pmi ) = v̂i (xk ), (3.3.15)

where 1 ≤ p1i ≤ P1 , . . . , 1 ≤ pmi ≤ Pm . Then, according to (3.3.12), we can rewrite


Vˆi (xk , v̂i (xk )) as
 
Vˆi (xk , v̂i (xk )) = min U (xk , u k ) + V̂i−1 (xk+1 )
u k ∈A

= U (xk , u( p1i , . . . , pmi )) + V̂i−1 (F(xk , u( p1i , . . . , pmi ))


= Vˆi (xk , u( p1i , . . . , pmi )). (3.3.16)

Next, ∀u( p1i , . . . , pmi ) ∈ A, 1 ≤ pij ≤ P j , j = 1, . . . , m, we can define a neigh-


borhood set of u( p1i , . . . , pmi ) as
112 3 Finite Approximation Error-Based Value Iteration ADP

Ā( p1i , . . . , pmi ) = {u( p̄1i , . . . , p̄mi ) : ( p̄1i , . . . , p̄mi ) ∈ A, | pij − p̄ij | ≤ r }, (3.3.17)

where r ∈ Z is defined as the radius of Ā( p1i , . . . , pmi ).


Define a new value function Υi (xk , ṽi (xk )) as

Υi (xk , ṽi−1 (xk )) = min{U (xk , u k ) + V̂i−1 (xk+1 )}


uk

= U (xk , ṽi−1 (xk )) + V̂i−1 (F(xk , ṽi−1 (xk ))). (3.3.18)

We can see that Γi (xk ) = Υi (xk , ṽi (xk )). As ṽi (xk ) cannot be obtained, it is very
difficult to analyze its properties. While ∀u( p1i , . . . , pmi ), we can get Ā( p1i , . . . , pmi )
by (3.3.17) immediately. So, if ṽi (xk ) is inside Ā( p1i , . . . , pmi ) with the radius r ≥ 1,
then the properties of ṽi (xk ) can be obtained. We will analyze the relationship between
ṽi (xk ) and Ā( p1i , . . . , pmi ) next. Before that, some lemmas are necessary.

Lemma 3.3.2 Let O = ( p1i , . . . , pmi ) denote the origin of the m-dimensional coor-
dinate system, and let O L be an arbitrary vector in the m-dimensional space. If we
let ϑ j , j = 1, 2, . . . , m, be the intersection angle between O L and the jth coordinate
axis, then we have
m
cos2 ϑ j = 1.
j=1

Proof Let L = (l1i , . . . , lmi ) be an arbitrary point in the m-dimensional coordinate


system. Then, O L = ((l1i − p1i ), . . . , (lmi − pmi )). According to the definition of ϑ j ,
we have  
i 
l j − pij 
cos ϑ j =   .
O L

 
where

 O L  = (l i − pi )2 + · · · + (l i − pi )2 .
1 1 m m

Then,  i   

m l − pi  2 + · · · + l i − pi  2
1 1 m m
cos ϑ j =
2
2
= 1.
j=1 (l1i − p1i ) + · · · + (lmi − pmi )2

The proof is complete.

Lemma 3.3.3 Let O = ( p1i , . . . , pmi ) denote the origin of the m-dimensional coor-
dinate system and let O L be an arbitrary vector in the m-dimensional space. Let A j ,
j = 1, 2 . . . , m, be points on the jth coordinate axis of m-dimensional space and
∀ j = 1, 2, . . . , m, O A j = O L. If we let ϑ j = min{ϑ1 , . . . , ϑm }, then we have
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 113

  1  
A j L ≤ (m − 1)/m O L cos arcsin (m − 1)/m . (3.3.19)
2

Proof Letting ϑ1 = · · · = ϑm = arccos(1/ m), then ϑ j reaches the maximum. Let
α j = ∠O A j L. We can get α j = 21 (π − ϑ j ). According to sine rule, we have
 
A j L ≤ sin ϑ j / sin α j O L. (3.3.20)

Since cosϑ j = 1/ m, we have

 1  
sin ϑ j = (m − 1)/m, sin α j = cos arcsin (m − 1)/m .
2
Taking sin ϑ j and sin α j into (3.3.20), we can obtain the conclusion. The proof is
complete.

Lemma 3.3.4 Let O = ( p1i , . . . , pmi ) denote the origin of the m-dimensional coor-
dinate system and let A = ( p1i , . . . , pmi ) be an arbitrary point in Ā( p1i , . . . , pmi ),
where 1 ≤  ≤ (2r + 1)m . Let L = ( p̄1i , . . . , p̄mi ) be an arbitrary point such that
 
OL ≥ max m O A . If the radius r of Ā( p1i , . . . , pmi ) satisfies the following
1≤≤(2r +1)
inequality √
3(m − 1) + 3(m − 1)
r≥ , (3.3.21)
3m
then there exists an  such that
A L ≤ O L. (3.3.22)

Proof Without loss of generality, let L = ( p̄1i , . . . , p̄mi ) be located in the first
quadrant. According to Lemmas 3.3.2 and√3.3.3, we can see that if intersection
angles satisfy ϑ1 = · · · = ϑm = arccos(1/ m), the value on the left-hand side of
(3.3.19) reaches its maximum for each coordinate axis. In this situation, we can
let L = ( p1i + r, . . . , pmi + r ). Let A = (0, . . . , 0, pmi + r ) and A = ( p1i + r −
1, . . . , pm−1
i
+ r − 1, pmi + r ). Then, we can see that the points A , A , and L are
on the same line. Let ϑ = ∠B O L. Then, we have
 2  2  2
O A + O L − L A
cos ϑ = . (3.3.23)
2O A O L

Taking the coordinates of A and L into (3.3.23), we can get

(r − 1)m + 1
cos ϑ = √  . (3.3.24)
m (r − 1)2 m − (r − 1)2 + r 2
114 3 Finite Approximation Error-Based Value Iteration ADP

Next, extend the line O A to B so that it satisfies O B = O L. Then, O B L


1
is an isosceles triangle. It is obvious that when ϑ ≤ π/3, that is cos ϑ ≥ , we
2
can get B L ≤ O L. According to (3.3.24), we can obtain (3.3.21). On the other
sin ϑ sin ∠O B L sin ϑ
hand, according to sine rule [2], we can get = , and =
B L OL A L
sin ∠O A L π
. If ∠O B L ≤ ∠O A L ≤ , we can easily obtain A L ≤ B L ≤ O L.
OL 2
π
If ≤ ∠O A L ≤ π , then ∠O A L is the maximum angle in O A L, which obtain
2
A L ≤ O L directly. The proof is complete.

Given the above preparation, we derive the following theorem.


Theorem 3.3.3 Let v̂i (xk ) = u( p1i , p2i , . . . , pmi ) ∈ A and let u( p1i , p2i , . . . ,
pmi ) ∈ Ā( p1i , p2i , . . . , pmi ), 1 ≤  ≤ (2r + 1)m , be an arbitrary control vector. If the
radius r of Ā( p1i , p2i , . . . , pmi ) satisfies (3.3.21), then there exists a positive number
L ∈ R such that

|Vˆi (xk ,u( p1i , . . . , pmi )) − Υi (xk , ṽi (xk ))|


 
≤L max m u( p1i , . . . , pmi )−u( p1i , . . . , pmi ) .
1≤≤(2r +1)

Proof According to the definitions of the iterative value functions Vˆi (xk ,
u( p1i , . . . , pmi )) and Υi (xk , ṽi (xk )) in (3.3.16) and (3.3.18), respectively, we can
see that if we put the control ṽi (xk ) into the set of numerical controls A, then

Vˆi (xk , ṽi (xk )) = Υi (xk , ṽi (xk )).

For the control u( p1i , . . . , pmi ), according to Definition 3.3.1, there exists an L such
that

|Vˆi (xk , u( p1i , . . . , pmi )) − Υi (xk , ṽi (xk ))| ≤ L u( p1i , . . . , pmi ) − ṽi (xk ) .

Since ṽi (xk ) cannot be obtained accurately, the distance between ṽi (xk ) and
u( p1i , . . . , pmi ) is unknown. Hence, ṽi (xk ) must be replaced by other known vec-
tor. Next, we will show
 
u( p1i , . . . , pmi ) − ṽi (xk ) ≤ max u( p1i , . . . , pmi )−u( p1i , . . . , pmi ) .
1≤≤(2r +1)m
(3.3.25)

As
v̂i (xk ) = u( p1i , p2i , . . . , pmi ) ∈ A,
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 115

then ṽi (xk ) becomes the neighboring point of u( p1i , . . . , pmi ) and we can put ṽi (xk )
into Ā( p1i , . . . , pmi ) such that ṽi (xk ) ∈ Ā( p1i , . . . , pmi ). Next, we will prove the con-
clusion by contradiction. Assume that the inequality (3.3.25) does not hold. Then,
 
u( p1i , . . . , pmi ) − ṽi (xk ) > max u( p1i , . . . , pmi )−u( p1i , . . . , pmi )
1≤≤(2r +1)m

as ṽi (xk ) belongs to the set Ā( p1i , . . . , pmi ). As there are m dimensions in Ā( p1i , . . . ,
pmi ), we can divide it into 2m quadrants.
Without loss of generality, let ṽi (xk ) be located in the first quadrant where L =
( p̄1i , . . . , p̄mi ) is the corresponding coordinate. If we let O = ( p1i , . . . , pmi ) be the
origin, then
O L = u( p1i , . . . , pmi ) − ṽi (xk ) .

As (3.3.21) holds, according to Theorem 3.3.4, if O L is the max vector in


Ā( p1i , . . . , pmi ), then there exists a vector O A ∈ Ā( p1i , . . . , pmi ) with the coordi-
nate A = ( p1i , . . . , pmi ), such that (3.3.22) is true. Let L 1 be the Lipschitz constant.
Then, we can get

|Vˆi (xk , u( p1i , . . . , pmi )) − Υi (xk , ṽi (xk ))|


= L 1 u( p1i , . . . , pmi ) − ṽi (xk )
≥ L 1 u( p1i , . . . , pmi ) − ṽi (xk )
= |Vˆi (xk , u i ( p1i , . . . , pmi ))−Υi (xk , ṽi (xk ))|. (3.3.26)

According to the definition of Υi (xk , ṽi (xk )) in (3.3.18), we know Vˆi (xk , u( p1i , . . . ,
pmi )) ≥ Υi (xk , ṽi (xk )) and Vˆi (xk , u i ( p1i , . . . , pmi )) ≥ Υi (xk , ṽi (xk )),
for i = 0, 1, . . .. Thus, according to (3.3.26), we can obtain

Vˆi (xk ,u( p1i ,. . . , pmi )) > Vˆi (xk ,u i ( p1i , . . . , pmi )).

This contradicts the definition of Vˆi (xk , u( p1i , . . . , pmi )) in (3.3.16). Therefore, the
assumption is false and the inequality (3.3.25) holds. The proof is complete.

According to the definitions of iterative value functions Vˆi (xk , u( p1i , . . . , pmi ))
and Υi (xk , ṽi (xk )) in (3.3.16) and (3.3.18), respectively, for i = 0, 1, . . ., we can
define
Vˆi (xk , u( p1i , . . . , pmi )) − Υi (xk , ṽi (xk )) = εi (xk ), (3.3.27)

where ε0 (xk ) = 0. Then, for any εi (xk ) expressed in (3.3.27), there exists a ηi (xk )
such that
116 3 Finite Approximation Error-Based Value Iteration ADP

Υi (xk , ṽi (xk )) = Vˆi (xk , u( p1i , . . . , pmi )) − εi (xk )


Vˆi (xk , u( p1i , . . . , pmi ))
= .
ηi (xk )

Theorem 3.3.4 Let the iterative value function Vˆi (xk , u( p1i , . . . , pmi )) and the
numerical iterative control u( p1i , . . . , pmi ) be obtained by (3.3.15) and (3.3.16),
respectively. If for xk ∈ Rn , we define the admissible approximation error as
 
ε̄i (xk ) = L i ( p1i , . . . , pmi ) max u( p1i , . . . , pmi ) − u( p1i , . . . , pmi ) ,
1≤≤(2r +1)m
(3.3.28)

where L i ( p1i , . . . , pmi ) are Lipschitz constants and ε̄0 (xk ) = 0, then for i = 0, 1, . . .,
we have
εi (xk ) ≤ ε̄i (xk ).

Proof As Vˆi (xk , u( p1i , . . . , pmi )) are Lipschitz continuous, according to (3.3.13), we
have

|Vˆi (xk , u( p1i , . . . , pm


i
)) −Υi (xk , ṽi (xk ))| ≤ L i ( p1i , . . . , pm
i
)(u( p1i , . . . , pm
i
) − ṽi (xk )),

where L i ( p1i , . . . , pmi ) are the Lipschitz constants. According to (3.3.25), we can
draw the conclusion. The proof is complete.

For the set of numerical control A expressed in (3.3.14), as A is known, then


for any control u( p1i , . . . , pmi ) ∈ A, we can obtain Ā( p1i , . . . , pmi ) immediately.
Hence, if the Lipschitz constants L i ( p1i , . . . , pmi ) are known, then the error ε̄i can
be obtained by (3.3.28). Next, we will give a method to obtain L i ( p1i , . . . , pmi ). Let
u( p1i , . . . , pmi ) ∈ Ā( p1i , . . . , pmi ) be an arbitrary control vector. Then, we can get

|Vˆi (xk ,u( p1i , . . . , pmi )) − Vˆi (xk , u( p1i , . . . , pmi ))|
= L̄ i ( p1i , . . . , pmi ) u( p1i , . . . , pmi ) − u( p1i , . . . , pmi ) , (3.3.29)

where L̄ i ( p1i , . . . , pmi ) > 0,  = 1, 2, . . . , (2r + 1)m , i = 0, 1, . . ..


Let  
L̄ i ( p1i , . . . , pmi ) = max m L̄ i ( p1i , . . . , pmi ) , (3.3.30)
1≤≤(2r +1)

be the local Lipschitz constant. Then, (3.3.29) can be written as

|Vˆi (xk ,u( p1i , . . . , pmi )) − Vˆi (xk , u( p1i , . . . , pmi ))|
≤ L̄ i ( p1i , . . . , pmi ) u( p1i , . . . , pmi ) − u( p1i , . . . , pmi ) .
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 117

For all u( p1i , . . . , pmi ) ∈ A, we can define the global Lipschitz constant L̄ i as
 
L̄ i = max L̄ i ( p1i , . . . , pmi ): 1 ≤ pij ≤ P j , j = 1, 2, . . . , m . (3.3.31)

Thus, from (3.3.30) and (3.3.31), we can easily obtain L i ( p1i , . . . , pmi ) ≤ L̄ i .
In the above, we give an effective method to obtain the approximation error
ε̄i+1 (xk ) of the numerical iterative θ -ADP algorithm. We will show how to obtain
the admissible approximation error to guarantee the convergence criterion of the
present numerical
 iterative ADP algorithm. According  to (3.3.9), we can define
γ = max J ∗ (F(xk , u k ))/U (xk , u k ) : xk ∈ Rn , u k ∈ A . If we let

Vi (F(xk , u k ))
γ̃i = : xk ∈ Rn , u k ∈ A , (3.3.32)
U (xk , u k )

then we can get γ̃i ≥ γ . Before the next theorem, we introduce some notation. Let

Vˆi (xk , u( p1i , . . . , pmi ))


η̄i (xk ) = , (3.3.33)
Vˆi (xk , u( p1i , . . . , pmi )) − ε̄i (xk )

and
Vˆ0 (xk , 0)
βi (xk ) = . (3.3.34)
Vˆi (xk , u( p1i , . . . , pmi ))

Theorem 3.3.5 Let the iterative value function Vˆi (xk , u( p1i , . . . , pmi )) be defined in
(3.3.12) and the numerical iterative control u( p1i , . . . , pmi ) be defined in (3.3.15).
For i = 0, 1, . . ., if the approximation error satisfies

Vi (xk , u( p1i , . . . , pmi ))


ε̄i (xk ) ≤
Vˆ0 (xk , 0)(γ̃i + 1) − Vi (xk , u( p1i , . . . , pmi ))
× (Vˆ0 (xk , 0) − Vi (xk , u( pi , . . . , pi ))),
1 m (3.3.35)

then the numerical iterative control law u( p1i , . . . , pmi ) stabilizes the nonlinear sys-
tem (3.2.1) and simultaneously guarantees the iterative value function Vˆi (xk , u( p1i ,
. . . , pmi )) to converge to a finite neighborhood of J ∗ (xk ), as i → ∞.

Proof From (3.3.10), we can define

V0 (xk )
β = max : xk ∈ Rn .
J ∗ (xk )
118 3 Finite Approximation Error-Based Value Iteration ADP

From (3.3.32)–(3.3.34), for all xk ∈ Rn , we can obtain



⎨ ηi (xk ) ≤ η̄i (xk ),
βi (xk ) − 1 β −1 (3.3.36)
⎩1 + ≤1+ .
γ̃i βi (xk ) γβ

So, if

βi (xk ) − 1
η̄i (xk ) ≤ 1 + , (3.3.37)
γ̃i βi (xk )

then (3.3.36) holds. Putting (3.3.33) and (3.3.34) into (3.3.37), we can obtain (3.3.35).
On the other hand, according to (3.3.32) and the definition of η in (3.3.8), we have

η= max {ηi (xk )} ≤ max {η̄i (xk )}


xk ∈Rn ,i=0,1,... xk ∈Rn ,i=0,1,...
βi (xk ) − 1
≤ max 1+
xk ∈Rn ,i=0,1,... γ̃i βi (xk )
β −1
≤1+ .
γβ

By Theorems 3.3.1 and 3.3.2, we can draw the conclusion. The proof is complete.

Theorem 3.3.6 Let the iterative value function Vˆi (xk , u( p1i , . . . , pmi )) be defined in
(3.3.12) and the numerical iterative control u( p1i , . . . , pmi ) be defined in (3.3.15). If
for all xk ∈ Rn , we have
U (xk , u k ) ≥ U (xk , 0), (3.3.38)

and for i = 0, 1, . . ., the approximation error satisfies

Vi 2 (xk , u( p1i , . . . , pmi ))(Vˆ0 (xk , 0) − Vi (xk , u( p1i , . . . , pmi )))


ēi (xk ) ≤ , (3.3.39)
Vˆ0 (xk , 0)Δi1 (xk , u( p1i , . . . , pmi ))+Δi2 (xk , u( p1i , . . . , pmi ))

where
Δi1 (xk , u( p1i , . . . , pmi )) = Vi (xk , u( p1i , . . . , pmi ))−U (xk , 0)

and

Δi2 (xk , u( p1i , . . . , pm


i
)) = Vi (xk , u( p1i , . . . , pm
i
))(Vˆ0 (xk , 0)− Vi (xk , u( p1i , . . . , pm
i
))),

then the numerical iterative control law u( p1i , . . . , pmi ) stabilizes the nonlinear
system (3.2.1) and simultaneously guarantees the iterative value functions
Vˆi (xk , u( p1i , . . . , pmi )) to converge to a finite neighborhood of J ∗ (xk ), as i → ∞.
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 119

Proof If we let

Vi (xk , u( p1i , . . . , pmi ))


γ̂i = max − 1 : xk ∈ Rn , u k ∈ A , (3.3.40)
U (xk , 0)

then we can obtain


J ∗ (xk )
γ̂i ≥ − 1 ≥ γ.
U (xk , u ∗ (xk ))

So, if
βi (xk ) − 1
η̄i (xk ) ≤ 1 + ,
γ̂i βi (xk )

then (3.3.11) holds, which means that iterative value functions Vˆi (xk , u( p1i , . . . , pmi ))
converge to a finite neighborhood of J ∗ (xk ) according to Theorem 3.3.1. According
to (3.3.33), (3.3.34), and (3.3.40), and we can get (3.3.39). The proof is complete.

From Theorems 3.3.5 and 3.3.6, we can see that the information of the parameter
γi+1 should be used while the value of γi+1 is usually difficult to obtain. So, in the
next theorem, we give a more simplified convergence justification for the iterative
ADP algorithm.

Theorem 3.3.7 Let the iterative value function Vˆi (xk , u( p1i , . . . , pmi )) and the
numerical iterative control u( p1i , . . . , pmi ) be obtained by (3.3.15) and (3.3.16),
respectively. Let ε̄i be expressed as in (3.3.28). For i = 0, 1, . . ., if the utility function
U (xk , u k ) satisfies (3.3.38) and the iterative approximation error satisfies

ε̄i (xk ) ≤ Vˆi (xk , u( p1i , . . . , pmi ))−


Vˆ0 (xk , 0)  
Vˆi (xk , u( p1i , . . . , pmi )) − U (xk , 0) , (3.3.41)
Vˆ0 (xk , 0) − U (xk , 0)

then the numerical iterative control law u( p1i , . . . , pmi ) stabilizes the nonlinear
system (3.2.1) and simultaneously guarantees the iterative value function
Vˆi (xk , u( p1i , . . . , pmi )) to converge to a finite neighborhood of J ∗ (xk ), as i → ∞.

Proof First, we look at (3.3.9) and (3.3.10). Without loss of generality, we let

J ∗ (xk ) − U (xk , u ∗ (xk ))


γ̃ (xk ) = ,
U (xk , u ∗ (xk ))
V0 (xk )
β̃(xk ) = ∗ .
J (xk )

Then, we can get


V0 (xk )
γ̃ (xk )β̃(xk ) = − β̃(xk ).
U (xk , u ∗ (xk ))
120 3 Finite Approximation Error-Based Value Iteration ADP

By (3.3.34), we can obtain

max {βi (xk )} ≤ maxn β̃(xk ),


xk ∈Rn ,i=0,1,... xk ∈R

and  
Vˆ0 (xk , 0)
max − β̃i (xk ) ≥ maxn {γ̃ (xk )β̃(xk )} = γβ.
xk ∈Rn ,i=0,1,... U (xk , 0) xk ∈R

Next, we notice that if η̄i (xk ) satisfies

(β̃i (xk ) − 1)U (xk , 0)


η̄i (xk ) ≤ 1 + , (3.3.42)
ˆ
V0 (xk , 0) − β̃i (xk )U (xk , 0)

then

η= max η̄i (xk )


xk ∈Rn ,i=0,1,...

(β̃i (xk ) − 1)U (xk , 0)


≤ max 1+
xk ∈Rn ,i=0,1,... ˆ
V0 (xk , 0) − β̃i (xk )U (xk , 0)
β −1
≤1+ .
γβ

Substituting (3.3.33) and (3.3.34) into (3.3.42), we can obtain (3.3.41). The proof is
complete.

3.3.3 Summary of the Numerical Iterative θ-ADP Algorithm

Now, we summarize the numerical iterative θ -ADP algorithm.


Step 1: Choose randomly an array of initial states x0 and choose the approximation
precision ζ . Give the set of numerical controls A as in expression (3.3.14).
Give the maximum iteration index i max . Choose a large θ . Let i = 0 and
V̂0 (xk ) = θ Ψ (xk ), where Ψ (xk ) ∈ Ψ̄xk . Obtain v̂0 (xk ) from (3.3.2). Obtain
u( p10 , . . . , pm0 ) = v̂0 (xk ) by (3.3.15) and Ā( p10 , p20 , . . . , pm0 ) by (3.3.17).
Obtain xk+1 . Calculate Vˆ1 (xk , u( p10 , . . . , pm0 )) by (3.3.16).
Step 2: Let i = i + 1. According to the numerical controls A, implement the numer-
ical iterative θ -ADP algorithm (3.3.2)–(3.3.4) to obtain V̂i (xk ) and v̂i (xk ).
Obtain u( p1i , . . . , pmi ) = v̂i (xk ) by (3.3.15) and Ā( p1i , . . . , pmi ) by (3.3.17).
Obtain xk+1 . Calculate Vˆi (xk , u( p1i , . . . , pmi )) by (3.3.16).
Step 3: Obtain the Lipschitz constant L i ( p1i , . . . , pmi ) according to (3.3.30). Com-
pute ε̄i (xk ) by (3.3.28).
Step 4: If ε̄i (xk ) satisfies (3.3.39), then go to next step. Otherwise, go to Step 6.
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 121

Step 5: If |V̂i (xk ) − V̂i−1 (xk )| ≤ ζ , then the iterative value function is converged and
go to Step 9; elseif i > i max , then go to Step 6; else, let i = i + 1 and go to
Step 2.
Step 6: Stop the algorithm.

3.3.4 Simulation Study

Consider the discrete-time nonlinear system given by

x1 (k + 1) = [x12 (k) + x22 (k) + u(k)] cos(x2 (k)),


x2 (k + 1) = [x12 (k) + x22 (k) + u(k)] sin(x2 (k)). (3.3.43)

Let xk = [x1 (k), x2 (k)]T denote the system state vector and u k = u(k) denote the
control. Let
A = {−2, −2 + ς, −2 + 2ς, . . . , 2},

where ς is the grid step. The cost function is defined as (3.2.3) with the utility function

U (xk , u k ) = xkTQxk + u Tk Ru k ,
Admissible approximation error

0.2

0.15

0.1

0.05

0
1
1
0.5
0.5
0
0
−0.5 −0.5
State variable x2 State variable x1
−1 −1

Fig. 3.5 The curve of the admissible approximation error obtained by (3.3.39)
122 3 Finite Approximation Error-Based Value Iteration ADP

Admissible approximation error

0.2

0.15

0.1

0.05

0.5 −1
0 −0.5
0
−0.5
0.5
State variable x −1 1
State variable x 1
2

Fig. 3.6 The curve of the admissible approximation error obtained by (3.3.41)

6.8 8.5
Iterative value function

Iterative value function

6.6 8

7.5
6.4
7
6.2
6.5
6 6

5.5
0 10 20 30 0 10 20 30

Iteration steps Iteration steps


(a) (b)
13 5
Iterative value function
Iterative value function

12 4.5

11 4

10 3.5

9 3

8 2.5
0 10 20 30 0 10 20 30
Iteration steps Iteration steps

(c) (d)
Fig. 3.7 The trajectories of the iterative value functions. a ς = 10−8 . b ς = 10−4 . c ς = 10−2 .
d ς = 10−1
3.3 Numerical Iterative θ-Adaptive Dynamic Programming 123

1 0.5
x1

0.5 x2
0

Control
States

−0.5
−0.5

−1 −1
0 10 20 30 40 0 10 20 30 40
Time steps Time steps
(a) (b)
1 0.5
x
1
0.5 x2
0
States

Control
0

−0.5
−0.5

−1 −1
0 10 20 30 40 0 10 20 30 40
Time steps Time steps
(c) (d)
Fig. 3.8 The control and state trajectories. a State trajectories for ς = 10−8 . b Control trajectory
for ς = 10−8 . c State trajectories for ς = 10−4 . d Control trajectory for ς = 10−4

where Q and R are given as identity matrices with suitable dimensions. The initial
state is x0 = [1, −1]T .
The iterative ADP algorithm runs for 30 iteration steps to guarantee the conver-
gence of the iterative value function. The curves of the admissible approximation
error obtained by (3.3.39) and (3.3.41) are displayed in Figs. 3.5 and 3.6, respectively.
From Fig. 3.6, we can see that for some states xk , the admissible approximation error
is smaller than zero so that the convergence criterion (3.3.41) is invalid. While from
Fig. 3.5, we can see that the admissible approximation error curve is above zero
which implies that the convergence criterion (3.3.39) is effective for all xk .
To show the effectiveness of the numerical iterative ADP algorithm, we choose
four different grid steps. Let ς = 10−8 , 10−4 , 10−2 , 10−1 , respectively. The trajec-
tories of the iterative value function are shown in Fig. 3.7a, b, c and d, respectively.
For ς = 10−8 and ς = 10−4 , implement the approximate optimal control for sys-
tem (3.3.43), respectively. Let the implementation time be T f = 40. The trajectories
of the states and controls are displayed in Fig. 3.8a, b, c and d, respectively. When
ς = 10−2 , we can see that the iterative value function is not completely converged
124 3 Finite Approximation Error-Based Value Iteration ADP

1
x1

0.5 x
2
States

−0.5

−1
0 5 10 15 20 25 30 35 40
Time steps
(a)

0.5

0
Control

−0.5

−1

−1.5
0 5 10 15 20 25 30 35 40
Time steps
(b)
Fig. 3.9 The state and control trajectories. a State trajectories for ς = 10−2 . b Control trajectory
for ς = 10−2

within 30 iteration steps. The trajectories of the state are displayed in Fig. 3.9a, and
the corresponding control trajectory is displayed in Fig. 3.9b.
In this section, it is shown that if the inequality (3.3.35) holds, then for i = 0, 1, . . .,
the numerical iterative control v̂i (xk ) stabilizes the system (3.3.43), which means that
the numerical iterative θ -ADP algorithm can be implemented both online and offline.
In Fig. 3.10a–d, we give the system state and control trajectories of the system (3.3.43)
under the iterative control law v̂0 (xk ) with ς = 10−8 and ς = 10−4 , respectively. In
Fig. 3.11a–d, we give the system state and control trajectories of the system (3.3.43)
under the iterative control law v̂0 (xk ) with ς = 10−2 and ς = 10−1 , respectively.
When ς = 10−1 , we can see that the iterative value functions is not convergent. The
control system is not stable.
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 125

1 0
x
1
0.5 x2 −0.1

0 −0.2
States

Control
−0.5 −0.3

−1 −0.4

−1.5 −0.5
0 10 20 30 40 0 10 20 30 40
Time steps Time steps
(a) (b)
1 0
x
1
0.5 x −0.1
2

0 −0.2
States

Control

−0.5 −0.3

−1 −0.4

−1.5 −0.5
0 10 20 30 40 0 10 20 30 40
Time steps Time steps
(c) (d)
Fig. 3.10 The state and control trajectories under v̂0 (xk ). a State trajectories for ς = 10−8 .
b Control trajectory for ς = 10−8 . c State trajectories for ς = 10−4 . d Control trajectory for
ς = 10−4

3.4 General Value Iteration ADP Algorithm with Finite


Approximation Errors

Generally speaking, J ∗ (xk ) given in (3.2.4) is a highly nonlinear and nonanalytic


function. The optimal control law is nearly impossible to obtain by directly solving
the Bellman equation. In this section, a GVI ADP algorithm will be developed to
obtain optimal solution of the Bellman equation indirectly.

3.4.1 Derivation and Properties of the GVI Algorithm


with Finite Approximation Errors

In this section, a finite-approximation-error-based general value iteration algorithm


is developed. A new convergence analysis method will be established, and a new
design method for the convergence criteria will also be developed [43].
126 3 Finite Approximation Error-Based Value Iteration ADP

1 0
x1
0.5 x −0.1
2
States

0 −0.2

Control
−0.5 −0.3

−1 −0.4

−1.5 −0.5
0 10 20 30 40 0 10 20 30 40

Time steps Time steps


(a) (b)
1 0
x
1
0.5 x −0.1
2
States

0 −0.2
Control
−0.5 −0.3

−1 −0.4

−1.5 −0.5
0 10 20 30 40 0 10 20 30 40
Time steps Time steps
(c) (d)
Fig. 3.11 The state and control trajectories under v̂0 (xk ). a State trajectories for ς = 10−2 . b Con-
trol trajectory for ς = 10−2 . c State trajectories for ς = 10−1 . d Control trajectory for ς = 10−1

A. Derivation of the GVI Algorithm with Finite Approximation Errors


In the present GVI algorithm, the value function and control law are updated by
iterations, with the iteration index i increasing from 0 to ∞. Let Ωx and Ωu be
the domain of definition for state and control, which are defined as Ωx = {xk : xk ∈
Rn and xk < ∞} and Ωu = {u k : u k ∈ Rm and u k < ∞}, respectively, where ·
denotes the Euclidean norm. For all xk ∈ Ωx , let the initial function Ψ (xk ) ≥ 0
be an arbitrary positive semidefinite function. For all xk ∈ Ωx , let the initial value
function
V̂0 (xk ) = Ψ (xk ). (3.4.1)

The iterative control law v̂0 (xk ) can be computed as


 
v̂0 (xk ) = arg min U (xk , u k ) + V̂0 (xk+1 ) + ρ0 (xk ), (3.4.2)
uk

where V̂0 (xk+1 ) = Ψ (xk+1 ) and ρ0 (xk ) is the approximation error function of the
iterative control law v̂0 (xk ). For i = 1, 2, . . ., the iterative ADP algorithm will iterate
between
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 127

V̂i (xk ) = U (xk , v̂i−1 (xk )) + V̂i−1 (F(xk , v̂i−1 (xk ))) + πi (xk ) (3.4.3)

and  
v̂i (xk ) = arg min U (xk , u k ) + V̂i (xk+1 ) + ρi (xk ), (3.4.4)
uk

where πi (xk ) and ρi (xk ) are finite approximation error functions of the iterative value
function and control law, respectively. For convenience of analysis, for i = 0, 1, . . .,
we assume that ρi (xk ) = 0 and πi (xk ) = 0 for xk = 0.

B. Special Case of Accurate GVI Algorithm


Next, we will discuss the properties of GVI iterative ADP algorithm where the
iterative value function and the iterative control law can accurately be obtained in each
iteration. For i = 0, 1, . . ., the GVI algorithm is reduced to the following equations

Vi (xk ) = min{U (xk , u k ) + Vi−1 (F(xk , u k ))},
uk
(3.4.5)
vi (xk ) = arg min{U (xk , u k ) + Vi (F(xk , u k ))},
uk

where V0 (xk ) = Ψ (xk ) is a positive semidefinite function.

Remark 3.4.1 In [4], for V0 (xk ) = 0, it was proven that the iterative value func-
tion in (3.4.5) is monotonically nondecreasing and converges to the optimum. In
[3, 5, 10], the convergence property of value iteration for discounted case has been
investigated, and it was proven that the iterative value function will converge to the
optimum if the discount factor satisfies 0 < α < 1. For the undiscounted cost func-
tion (3.2.2) and the positive semidefinite function, i.e., α ≡ 1 and V0 (xk ) = Ψ (xk ),
the convergence analysis methods in [3–5, 7, 10, 18] are no longer applicable. Hence,
a new convergence analysis method needs to be developed. In [11, 18, 26], a “func-
tion bound” method was proposed for the traditional value iteration algorithm with
the zero initial value function. Based on the previous contribution of value iteration
algorithms [3–5, 7, 10, 18], and inspired by [18], a new convergence analysis method
for the general value iteration algorithm is developed in this section.

Lemma 3.4.1 Let Ψ (xk ) ≥ 0 be an arbitrary positive semidefinite function. Under


Assumptions 3.2.1–3.2.3, for i = 1, 2, . . ., the iterative value function Vi (xk ) in
(3.4.5), is positive definite for all xk ∈ Ωx .

Theorem 3.4.1 For i = 0, 1, . . ., let Vi (xk ) and vi (xk ) be updated by (3.4.5). Then,
the iterative value function Vi (xk ) converges to J ∗ (xk ) as i → ∞.

Proof First, according to Assumptions 3.2.1–3.2.3, J ∗ (xk ) is a positive definite func-


tion of xk . Let ε > 0 be an arbitrary small positive number. Define
 
Ωε = xk : xk ∈ Ωx , xk < ε .
128 3 Finite Approximation Error-Based Value Iteration ADP

For all xk ∈ Ωε , according to Lemma 3.4.1, for i = 0, 1, . . ., we have Vi (xk ) →


J ∗ (xk ) as ε → 0. Hence, the conclusion holds for all xk ∈ Ωε , ε → 0. On the other
hand, for an arbitrary small ε > 0, xk ∈ Ωx \Ωε , and u k ∈ Ωu , we have

0 < U (xk , u k ) < ∞, 0 ≤ V0 (xk ) < ∞, 0 < J ∗ (xk ) < ∞, and 0 ≤ J ∗ (F(xk , u k )) < ∞.

Hence, there exist constants γ , α, and β such that J ∗ (F(xk , u k )) ≤ γ U (xk , u k ) and
α J ∗ (xk ) ≤ V0 (xk ) ≤ β J ∗ (xk ), where 0 < γ < ∞ and 0 ≤ α ≤ 1 ≤ β, respectively.
As J ∗ (xk ) is unknown, the values of γ , α, and β cannot be obtained directly. We will
first prove that for i = 0, 1, . . ., the following inequality
 
α−1
Vi (xk ) ≥ 1 + J ∗ (xk )
(1 + γ −1 )i

holds. Mathematical induction is introduced to prove the conclusion. Let i = 0. We


have
 
V1 (xk ) = min U (xk , u k ) + V0 (xk+1 )
uk
   
α−1 α−1 ∗
≥ min 1 + γ U (xk , u k ) + α − J (xk+1 )
uk 1+γ 1+γ
 
α−1
= 1+ J ∗ (xk ). (3.4.6)
(1 + γ −1 )

Assume that the conclusion holds for i = l − 1, l = 1, 2, . . .. Then, for i = l, we have


 
Vl+1 (xk ) = min U (xk , u k ) + Vl (xk+1 )
uk
 
α−1
≥ min U (xk , u k ) + 1 + J ∗ (xk+1 )
uk (1 + γ −1 )l−1
 
α−1
≥ 1+ J ∗ (xk ). (3.4.7)
(1 + γ −1 )l

The mathematical induction is complete. On the other hand, according to (3.4.6) and
(3.4.7), for i = 0, 1, . . ., we can get
 
β −1
Vi (xk ) ≤ 1 + J ∗ (xk ).
(1 + γ −1 )i

Letting i → ∞, we can get the conclusion. The proof is complete.


3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 129

C. Properties of the Finite-Approximation-Error-based GVI Algorithm


Considering approximation errors, we generally have V̂i (xk ) = Vi (xk ), v̂i (xk ) =
vi (xk ), i = 0, 1, . . ., and the convergence property in Theorem 3.4.1 for the accu-
rate general value iteration algorithms becomes invalid. Hence, in this section, a new
“design method of the convergence criteria” will be established. It permits the present
general value iteration algorithm to design a suitable approximation error adaptively
so that the iterative value function converges to a finite neighborhood of the optimal
one.
Define the target iterative value function as
 
Γi (xk ) = min U (xk , u k ) + V̂i−1 (xk+1 ) , (3.4.8)
uk

where
V̂0 (xk ) = Γ0 (xk ) = Ψ (xk ).

For i = 1, 2, . . ., let ηi > 0 be a finite approximation error such that

V̂i (xk ) ≤ ηi Γi (xk ). (3.4.9)

From (3.4.9), we can see that the iterative value function Γi (xk ) can be larger or
smaller than V̂i (xk ) and the convergence properties are different for different values
of ηi . As the convergence analysis methods for 0 < ηi < 1 and ηi ≥ 1 are different,
the convergence properties for different ηi will be discussed separately.

Theorem 3.4.2 Let xk ∈ Ωx be an arbitrary controllable state and Assumptions


3.2.1–3.2.3 hold. For i = 1, 2, . . ., let Γi (xk ) be expressed as in (3.4.8) and V̂i (xk )
be expressed as in (3.4.4). If for i = 1, 2, . . ., there exists 0 < ηi < 1 such that (3.4.9)
holds, then the iterative value function converges to a bounded neighborhood of the
optimal cost function J ∗ (xk ).

Proof If 0 < ηi < 1, according to (3.4.9), we have 0 ≤ V̂i (xk ) < Γi (xk ). Using
mathematical induction, we can prove that for i = 1, 2, . . ., the following inequality

0 < V̂i (xk ) < Vi (xk ) (3.4.10)

holds. According to Theorem 3.4.1, we have Vi (xk ) → J ∗ (xk ). Then, for i = 0, 1, . . .,


V̂i (xk ) is upper bounded and

0 < lim V̂i (xk ) ≤ lim Vi (xk ) = J ∗ (xk ). (3.4.11)


i→∞ i→∞

The proof is complete.


130 3 Finite Approximation Error-Based Value Iteration ADP

Theorem 3.4.3 Let xk ∈ Ωx be an arbitrary controllable state and Assumptions


3.2.1–3.2.3 hold. For i = 1, 2, . . ., let Γi (xk ) be expressed as (3.4.8) and V̂i (xk ) be
expressed as (3.4.4). Let 0 < γi < ∞ be a constant such that

Vi (F(xk , u k )) ≤ γi U (xk , u k ). (3.4.12)

If for i = 1, 2, . . ., there exists 1 ≤ ηi < ∞ such that (3.4.9) holds uniformly, then

i−1 

V̂i (xk ) ≤ ηi 1+ ηi−1 ηi−2 . . . ηi− j+1 (ηi− j − 1)
j=1

γi−1 γi−2 . . . γi− j
× Vi (xk ), (3.4.13)
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)

i
where we define (·) = 0 for all j > i, i, j = 0, 1, . . ., and
j

ηi−1 ηi−2 . . . ηi− j+1 (ηi− j − 1) = (ηi−1 − 1) for j = 1.

Proof The theorem can be proven by mathematical induction. First, let i = 1. We


can easily obtain Γ1 (xk ) = V1 (xk ). According to (3.4.9), we have

V̂1 (xk ) ≤ η1 V1 (xk ).

Thus, the conclusion holds for i = 1. Next, let i = 2. According to (3.4.8), we have
 
Γ2 (xk ) = min U (xk , u k ) + V̂1 (F(xk , u k ))
uk
   
η1 − 1 η1 − 1
≤ min 1 + γ1 U (xk , u k )+ η1 − V1 (xk+1 )
uk γ1 + 1 γ1 + 1
 
η1 − 1  
= 1 + γ1 min U (xk , u k ) + V1 (xk+1 )
γ1 + 1 u k
 
η1 − 1
= 1 + γ1 V2 (xk ).
γ1 + 1

By (3.4.9), we can obtain


 
η1 − 1
V̂2 (xk ) ≤ η2 1 + γ1 V2 (xk ),
γ1 + 1
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 131

which shows that (3.4.13) holds for i = 2. Assume that (3.4.13) holds for i = l − 1,
where l = 2, 3, . . .. Then, for i = l, we obtain

 
Γl (xk ) = min U (xk , u k ) + V̂l−1 (F(xk , u k ))
uk
 l−2 

≤ min U (xk , u k ) + ηl−1 1 + ηl−2 . . . ηl− j (ηl− j−1 − 1)
uk
j=1

γl−2 γl−3 . . . γl− j−1
× Vl−1 (xk )
(γl−3 + 1) . . . (γl− j−1 + 1)
 
l−1 
  γl−1 . . . γl− j
≤ 1+ ηl−1 ηl−2 . . . ηl− j+1 (ηl− j − 1)
(γl−1 + 1) . . . (γl− j + 1)
j=1
 
× min U (xk , u k ) + Vl−1 (xk )
uk
  l−1 
  γl−1 . . . γl− j
= 1+ ηl−1 ηl−2 . . . ηl− j+1 (ηl− j − 1) Vl (xk ).
(γl−1 + 1) . . . (γl− j + 1)
j=1

Then, by (3.4.9), we can obtain (3.4.13) which proves the conclusion for i = 0, 1, . . ..
The proof is complete.

From (3.4.13), we can see that for i = 0, 1, . . ., there exists an error between
V̂i (xk ) and Vi (xk ). As i → ∞, the bound of the approximation errors may increase to
infinity. Thus, we will give the convergence properties of the iterative ADP algorithm
(3.4.2)–(3.4.4) using error bound method. Before presenting the next theorem, the
following lemma is necessary.

Lemma 3.4.2 Let {bi }, i = 1, 2, . . ., be a sequence of positive numbers. Let 0 <



λi < ∞ be a bounded positive constant for i = 1, 2, . . . and let ai = λi bi . If bi
i=1

is finite, then ai is finite.
i=1

Proof Since for i = 1, 2, . . ., λi is finite, we can let λ̄ = sup{λ1 , λ2 , . . .}. Then,


∞ ∞ ∞
ai = λi bi ≤ λ̄ bi is finite. The proof is complete.
i=1 i=1 i=1

Theorem 3.4.4 Let xk ∈ Ωx be an arbitrary controllable state and Assump-


tions 3.2.1–3.2.3 hold. If for i = 1, 2, . . ., V̂i (xk ) satisfies (3.4.13) and the approxi-
mation error ηi satisfies
γi−1 + 1
1 ≤ ηi ≤ qi−1 , (3.4.14)
γi−1
132 3 Finite Approximation Error-Based Value Iteration ADP

γi
where qi is an arbitrary constant such that < qi < 1, then as i → ∞, the
γi + 1
iterative value function V̂i (xk ) of the general value iteration algorithm (3.4.2)–(3.4.4)
converges to a bounded neighborhood of the optimal cost function J ∗ (xk ).
Proof For (3.4.13) in Theorem 3.4.3, if we let

i−1 
 
γi−1 γi−2 . . . γi− j
Δi = ηi−1 ηi−2 . . . ηi− j+1 (ηi− j − 1) ,
j=1
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)

then we can get

i−1 
 
ηi−1 ηi−2 . . . ηi− j γi−1 γi−2 . . . γi− j
Δi = ,
j=1
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)
i−1 
 
ηi−1 ηi−2 . . . ηi− j+1 γi−1 γi−2 . . . γi− j
− , (3.4.15)
j=1
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)

where we define ηi−1 ηi−2 . . . ηi− j+1 = 1 for j = 1. If we let

ηi−1 ηi−2 . . . ηi− j γi−1 γi−2 . . . γi− j


ai j = ,
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)
ηi−1 ηi−2 . . . ηi− j+1 γi−1 γi−2 . . . γi− j
bi j = , (3.4.16)
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)

i−1 i−1 i−1


where i = 1, 2, . . ., and j = 1, 2, . . . , i − 1, then Δi = ai j − bi j . If ai j
j=1 j=1 j=1
i−1
and bi j are both finite as i → ∞, then lim Δi is finite. According to (3.4.16),
j=1 i→∞
bi j ηi− j+1 γi− j bi j
we have = . If ≤ qi− j < 1, then we can get ηi− j+1 ≤
bi( j−1) (γi− j + 1) bi( j−1)
γi− j + 1 γ + 1
qi− j . Let  = i − j. We can obtain η+1 ≤ q , where  = 1, 2, . . . ,
γi− j γ
i − 1. Letting i → ∞, we can obtain (3.4.14).  
γi−1 + 1
Let q = sup{q1 , q2 , . . .}, where 0 < q < 1. We can get bi j ≤ q j−1 .
γi−1
Thus, we can obtain

i−1 i−1 
 
γi−1 + 1
bi j ≤ q j−1 .
j=1 j=1
γi−1

i−1
As γi /(γi + 1) < q < 1 and γi−1 is finite for i = 1, 2, . . ., letting i → ∞, lim bi j
i→∞ j=1
is finite. On the other hand, for i = 1, 2, . . ., and for j = 1, 2, . . . , i − 1, we have
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 133

ai j = ηi− j bi j . As for i = 1, 2, . . . and for j = 1, 2, . . . , i − 1, 1 ≤ ηi− j < ∞ is


i−1
finite, according to Lemma 3.4.2, lim ai j must be finite. Therefore, lim Δi is
i→∞ j=1 i→∞
finite. According to Theorem 3.4.1, we have lim Vi (xk ) = J ∗ (xk ). Hence, the itera-
i→∞
tive value function V̂i (xk ) converges to a bounded neighborhood of the optimal cost
function J ∗ (xk ). The proof is complete.
Remark 3.4.2 From Theorem 3.4.4, we can see that there is no constraint for the
approximation error η1 in (3.4.14). If we let the parameter γ 0 satisfy

V0 (F(xk , u k )) ≤ γ 0 U (xk , u k ), (3.4.17)

where V0 (F(xk , u k )) = Ψ (F(xk , u k )) and let the approximation error η1 satisfy


1 ≤ η1 < (γ 0 + 1)/γ 0 , then the convergence criterion (3.4.14) is valid for i =
0, 1, . . ..
Combining Theorems 3.4.2 and 3.4.4, the convergence criterion of the finite-
approximation-error-based general value iteration algorithm can be established.
Theorem 3.4.5 Let xk ∈ Ωx be an arbitrary controllable state and Assumptions
3.2.1–3.2.3 hold. If for i = 0, 1, . . ., inequality

γi−1 + 1
0 < ηi ≤ qi−1 (3.4.18)
γi−1

holds, where 0 < qi < 1 is an arbitrary constant, then as i → ∞, the iterative value
function V̂i (xk ) converges to a bounded neighborhood of the optimal cost function
J ∗ (xk ).

3.4.2 Designs of Convergence Criteria with Finite


Approximation Errors

In the previous section, we have discussed the convergence property of the general
value iteration algorithm. From (3.4.18), for i = 0, 1, . . ., the approximation error
ηi is a function of γi . Hence, if we can obtain the parameter γi for i = 1, 2, . . ., then
we can design an iterative approximation error ηi+1 to guarantee the iterative value
function V̂i (xk ) to be convergent. Next, the detailed design method of the convergence
criteria will be discussed.
According to (3.4.12), we give the following definition. Define Ωγi as
 
Ωγi = γi : γi U (xk , u k ) ≥ Vi (F(xk , u k )), ∀xk ∈ Ωx , ∀u k ∈ Ωu .

As the existence of approximation errors, the accurate iterative value function Vi (xk )
cannot be obtained in general. Hence, the parameter γi cannot be obtained directly
134 3 Finite Approximation Error-Based Value Iteration ADP

from (3.4.12). In this section, several methods will be introduced on how to esti-
mate γi .
Lemma 3.4.3 Let νi (xk ) be an arbitrary control law. Define Vi (xk ) as in (3.4.5) and
define Λi (xk ) as
Λi (xk ) = U (xk , νi−1 (xk )) + Λi−1 (xk+1 ). (3.4.19)

If V0 (xk ) = Λ0 (xk ) = Ψ (xk ), then for i = 1, 2, . . ., Vi (xk ) ≤ Λi (xk ).


Theorem 3.4.6 Let μ(xk ) be an arbitrary admissible control law of nonlinear system
(3.2.1). Let Pi (xk ) be the iterative value function satisfying

Pi (xk ) = U (xk , μ(xk )) + Pi−1 (xk+1 ) (3.4.20)

where P0 (xk ) = V0 (xk ) = Ψ (xk ). If there exists a constant γ̃i such that

γ̃i U (xk , u k ) ≥ Pi (F(xk , u k )), (3.4.21)

then γ̃i ∈ Ωγ i .
Proof As μ(xk ) is an arbitrary admissible control law, according to Lemma 3.4.3,
we have Pi (xk ) ≥ Vi (xk ). If γ̃i satisfies (3.4.21), then we can get

γ̃i U (xk , u k ) ≥ Pi (F(xk , u k )) ≥ Vi (F(xk , u k )).

The proof is complete.


From Theorem 3.4.6, the first estimation method of the parameter γi can be
described in Algorithm 3.4.1.

Algorithm 3.4.1 Method I for estimating γi


1: Give an arbitrary admissible control law μ(xk ) for nonlinear system (3.2.1).
2: Construct an iterative value function Pi (xk ) by (3.4.20), i = 1, 2, . . ., where P0 (xk ) = V0 (xk ) =
Ψ (xk ).
3: Compute the constant γ̃i by (3.4.21).
4: Return γi = γ̃i .

Lemma 3.4.4 For i = 1, 2, . . ., let Γi (xk ) be expressed as in (3.4.8) and V̂i (xk ) be
expressed as in (3.4.3). If for i = 1, 2, . . .,

V̂i (xk ) ≥ Γi (xk ), (3.4.22)

then
V̂i (xk ) ≥ Vi (xk ),

where Vi (xk ) is defined in (3.4.5).


3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 135

Proof This conclusion can be proven by mathematical induction. First, consider


i = 1. We have

V̂1 (xk ) ≥ Γ1 (xk ) = min {U (xk , u k ) + V0 (xk+1 )} = V1 (xk ).


uk

So, the conclusion holds for i = 1. Assume that the conclusion holds for i = l − 1,
l = 2, 3, . . .. Then, for i = l, we can obtain

V̂l (xk ) ≥ Γl (xk )


 
= min U (xk , u k ) + V̂l−1 (xk+1 )
uk

≥ min {U (xk , u k ) + Vl−1 (xk+1 )}


uk

= Vl (xk ).

The proof is complete.

Now, we can derive the following theorem.


Theorem 3.4.7 For i = 0, 1, . . ., V̂i (xk ) is obtained by (3.4.2)–(3.4.4). If for
i = 0, 1, . . ., the approximation error of the iterative value function πi (xk ) ≥ 0, and
there exists γ̂i such that

γ̂i U (xk , u k ) ≥ V̂i (F(xk , u k )), (3.4.23)

then γ̂i ∈ Ωγi .

Proof According to the general value iteration algorithm (3.4.2)–(3.4.4). As the


existence of the approximation error ρ0 (xk ) in the iterative control law, for i = 1, we
have

Γ1 (xk ) = min {U (xk , u k ) + V0 (xk+1 )} ≤ U (xk , v̂0 (xk )) + V̂0 (xk+1 ),


uk

where V̂0 (xk ) = Γ0 (xk ) = Ψ (xk ). If π0 (xk ) ≥ 0, then we can get V̂1 (xk ) ≥ V1 (xk ).
Using the mathematical induction, we can prove that V̂i (xk ) ≥ Vi (xk ) holds for
i = 1, 2, . . .. According to (3.4.23), we can get γ̂i U (xk , u k ) ≥ Vi (F(xk , u k )), which
proves γ̂i ∈ Ωγi . This completes the proof of the theorem.

From Theorem 3.4.7, we can see that if for i = 0, 1, . . ., we can force the approxi-
mation error πi (xk ) ≥ 0, then the parameter γi can be estimated. An effective method
is to add another approximation error |πi (xk )| to the iterative value function V̂i+1 (xk ).
The new iterative value function can be defined as
136 3 Finite Approximation Error-Based Value Iteration ADP

Vˆi (xk ) = U (xk , v̂i−1 (xk )) + V̂i−1 (xk+1 ) + πi (xk ) + |πi (xk )|
= V̂i (xk ) + |πi (xk )| . (3.4.24)

Hence, the second estimation method of the parameter γi can be described in


Algorithm 3.4.2.

Algorithm 3.4.2 Method II for estimating γi


1: For i = 1, 2, . . ., obtain V̂i (xk ) by (3.4.3).
2: Obtain the iterative value function Vˆi (xk ) by (3.4.24).
3: Let V̂i (xk ) = Vˆi (xk ).
4: Obtain the parameter γ̂i by (3.4.23).
5: Return γi = γ̂i .

For i = 1, 2, . . ., there exists a finite constant 0 < ηi ≤ 1 such that

V̂i (xk ) ≥ ηi Γi (xk ). (3.4.25)

If V̂i (xk ) ≥ Γi (xk ), then we define ηi = 1. We can derive the following results.

Lemma 3.4.5 Let xk ∈ Ωx be an arbitrary controllable state and Assumptions


3.2.1–3.2.3 hold. Let V̂0 (xk ) = V0 (xk ) = Ψ (xk ). For i = 1, 2, . . ., let Γi (xk ) be
expressed as in (3.4.8) and V̂i (xk ) be expressed as in (3.4.4). Let 0 < γi < ∞ be
a constant such that (3.4.12). If for i = 1, 2, . . ., there exists 0 < ηi ≤ 1 such that
(3.4.25), then

i−1 

V̂i (xk ) ≥ ηi 1+ ηi−1 ηi−2 . . . ηi− j+1 (ηi− j − 1)
j=1

γi−1 γi−2 . . . γi− j
× Vi (xk ),
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)

i
where we define (·) = 0, for all j > i, i, j = 0, 1, . . ., and we let
j

ηi−1 ηi−2 . . . ηi− j+1 (ηi− j − 1) = (ηi−1 − 1), for j = 1.

Theorem 3.4.8 For i = 1, 2, . . ., let 0 < γi < ∞ be a constant such that (3.4.12).
Define a new iterative value function as
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 137

i−1 

V i (xk ) = ηi 1+ ηi−1 ηi−2 . . . ηi− j+1 (ηi− j − 1)
j=1

γi−1 γi−2 . . . γi− j
× Vi (xk ),
(γi−1 + 1)(γi−2 + 1) . . . (γi− j + 1)

where 0 < ηi ≤ 1. Then, for i = 1, 2, . . ., we have

0 < V i (xk ) ≤ Vi (xk ). (3.4.26)

Proof The conclusion can be proven by mathematical induction. First, let i = 1.


We have V 1 (xk ) ≤ η1 V1 (xk ) which shows (3.4.26) holds for i = 1. Assume that the
inequality (3.4.26) holds for i = l − 1, l = 2, 3, . . .. Then, for i = l, we can get

l−1 
 
γl−1 γl−2 . . . γl− j
ηl 1 + ηl−1 ηl−2 . . . ηl− j+1 (ηl− j − 1)
j=1
(γl−1 + 1)(γl−2 + 1) . . . (γl− j + 1)
  l−2 
γl−1
= ηl 1 + ηl−1 1 + ηl−2 ηl−3 . . . ηl− j
γl−1 + 1 j=1
(ηl− j−1 − 1)γl−2 γl−3 . . . γl− j−1  
× −1 . (3.4.27)
(γl−2 + 1)(γl−3 + 1) . . . (γl− j−1 + 1)

As for i = 1, 2, . . ., 0 < ηi ≤ 1 and 0 < γi < ∞, we have

 l−2 

0 < ηl−1 1 + ηl−2 ηl−3 . . . ηl− j (ηl− j−1 − 1)
j=1

γl−2 γl−3 . . . γl− j−1
×
(γl−2 + 1)(γl−3 + 1) . . . (γl− j−1 + 1)
≤ 1. (3.4.28)

Substituting (3.4.28) into (3.4.27), we can obtain the conclusion for i = l. The proof
is complete.

From Theorem 3.4.8, we can estimate the lower bound of the iterative value func-
tion Vi (xk ). Hence, we can derive the following theorem.
Theorem 3.4.9 For i = 1, 2, . . ., let V̂i (xk ) be the iterative value function such that
(3.4.22). If for i = 1, 2, . . ., there exists γ̂i such that (3.4.23), then

γ̂i
γi = ∈ Ωγi , (3.4.29)
Δi
138 3 Finite Approximation Error-Based Value Iteration ADP

where
i−1 

Δi = ηi 1+ ηi−1 ηi−2 . . . ηi− j+1 (ηi− j − 1)
j=1
γ i−1 γ i−2 . . . γ i− j 
× . (3.4.30)
(γ i−1 + 1)(γ i−2 + 1) . . . (γ i− j + 1)

Proof The conclusion can be proven by mathematical induction. First, let i = 1.


According to (3.4.30), we have

γ̂1 U (xk , u k ) ≥ V̂1 (F(xk , u k )) ≥ η1 V1 (F(xk , u k )).

Assume that the conclusion holds for i = l − 1, l = 2, 3, . . .. For i = l, accord-


ing to Lemma 3.4.5, we can get V̂l (xk ) ≥ Δl Vl (xk ). From (3.4.23), we can obtain
γ̂l
γ̂l U (xk , u k ) ≥ V̂l (F(xk , u k )) ≥ Δl Vl (F(xk , u k )), which shows that γ l = ∈ Ωγl .
Δl
The proof is complete.

The third estimation method of the parameter γi is described in Algorithm 3.4.3.

Algorithm 3.4.3 Method III for estimating γi


1: For i = 1, 2, . . ., record η0 , η1 , . . . , ηi−1 and γ 0 , γ 1 , . . . , γ i−1 .
2: Obtain V̂i (xk ) and Γi (xk ) by (3.4.3) and by (3.4.8), respectively.
3: If V̂i (xk ) ≥ Γi (xk ), then let ηi = 1; else, obtain the approximation error ηi by (3.4.25).
4: Obtain γ̂i according to V̂i (xk ) and U (xk , u k ). Solve γ i by (3.4.29).
5: Return γi = γ i .

Remark 3.4.3 We can see that if we record η j and γ j , j = 1, 2, . . . , i − 2, then


we can compute Δi−1 by (3.4.30). Then, we can obtain γ i−1 by (3.4.29). For
i = 2, 3, . . ., the convergence criterion (3.4.18) can be verified. Hence, it is unnec-
essary to construct a new iterative value function to estimate the parameter γi . On
the other hand, from the expression of (3.4.30), 0 < Δi ≤ 1 holds for i = 1, 2, . . ..
It shows that the convergence criterion (3.4.29) is feasible. Therefore, it is recom-
mended to use Algorithm 3.4.3 to estimate γi . In this section, we will give compar-
isons of these three methods.

Now, we summarize the finite-approximation-error-based general value iteration


algorithm in Algorithm 3.4.4.
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 139

Algorithm 3.4.4 General value iteration algorithm with finite approximation errors
Initialization:
Choose randomly an array of system states xk in Ωx . Choose a semi-positive definite function
Ψ (xk ) ≥ 0. Choose a convergence precision ε. Give a sequence {qi }, i = 0, 1, . . ., where 0 <
qi < 1. Give two constants 0 < ς < 1, 0 < ξ < 1.
Iteration:
Step 1. Let the iteration index i = 0.
Step 2. Let V0 (xk ) = Ψ (xk ) and obtain γ 0 by (3.4.17). Compute v̂0 (xk ) by (3.4.2).
Step 3. Let i = i + 1.
Step 4. Compute V̂i (xk ) by (3.4.3) and obtain v̂i (xk ) by (3.4.4).
Step 5. Obtain ηi by (3.4.9). If ηi (xk ) satisfies (3.4.14), then estimate γi by Algorithm Υ ,
Υ = 3.4.1, 3.4.2 or 3.4.3, and goto next step. Otherwise, decrease ρi−1 (xk ) and πi−1 (xk ), i.e.,
ρi−1 (xk ) = ςρi−1 (xk ) and πi−1 (xk ) = ξ πi−1 (xk ), respectively, e.g., by further NN training.
Goto Step 4.
Step 6. If |V̂i (xk ) − V̂i−1 (xk )| ≤ ε, then the optimal cost function is obtained and goto Step 7; else,
goto Step 3.
Step 7. Return V̂i (xk ) and v̂i (xk ).

Remark 3.4.4 Generally speaking, in iterative ADP algorithms, the difference


between V̂i (xk ) and Γi (xk ) is obtained, i.e.,

V̂i (xk ) − Γi (xk ) = εi (xk ),

where εi (xk ) is the approximation error function. According to the definition of ηi


in (3.4.9) and the convergence criterion (3.4.14), we can easily obtain the following
equivalent convergence criterion

1
εi (xk ) ≤ V̂i (xk ). (3.4.31)
γi−1 + 1

From (3.4.31), we can see that for different system state xk , it requires different
approximation error εi to guarantee the convergence of V̂i (xk ). If xk is large, then
the present general value iteration algorithm permits large approximation errors to
converge, and if xk is small, then small approximation errors are required to ensure
the convergence of the general value iteration algorithm. It is an important property
for the neural network implementation of the finite-approximation-error-based gen-
eral value iteration algorithm. For large xk , approximation errors of neural net-
works can be large. As xk → 0, the approximation errors of neural networks are
also required to be zero. As is known, the approximation of neural networks cannot
be globally accurate with no approximation errors, so the implementation of the gen-
eral value iteration algorithm by neural networks may be invalid as xk → 0. On the
other hand, in the real world, neural networks are generally trained under a global
uniform training precision or approximation error. Thus, the present general value
iteration algorithm requires that the training precision is high and the approximation
errors are small which guarantee the iterative value function convergent for most of
the state space.
140 3 Finite Approximation Error-Based Value Iteration ADP

3.4.3 Simulation Study

In this section, we examine the performance of the present algorithm in a torsional


pendulum system [22, 23, 28]. The dynamics of the pendulum is as follows


⎪ dθ
⎨ = ω,
dt

⎪ dω dθ
⎩J = u − Mgl sin θ − f d ,
dt dt

where M = 1/3 kg and l = 2/3 m are the mass and length of the pendulum bar,
respectively. The system states are the current angle θ and the angular velocity ω. Let
J = 4/3Ml 2 and f d = 0.2 be the rotary inertia and frictional factor, respectively. Let
g = 9.8 m/s2 be the gravity. Discretization of the system function and value function

100
120
80 100

60 80
γ
γ

60
40 γ γ
I I
40
γ γ
20 II II
γ 20 γ
III III
0 0
0 20 40 60 0 20 40 60
Iteration steps Iteration steps
(a) (b)

140 160
γ
120 I
140 γII
100
γ
gamma

80 III
120
γ

60 γI
100
40 γ
II
20 γ 80
III
0
0 20 40 60 0 20 40 60
Iteration steps Iteration steps
(c) (d)
Fig. 3.12 The trajectories of γ ’s by Ψ 1 (xk )–Ψ 4 (xk ). a γ I , γ II , and γ III by Ψ 1 (xk ).
b γ I , γ II , and γ III by Ψ 2 (xk ). c γ I , γ II , and γ III by Ψ 3 (xk ). d γ I , γ II , and γ III by Ψ 4 (xk )
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 141

Fig. 3.13 The admissible errors corresponding to Ψ 1 (xk ) and Ψ 2 (xk ). a Admissible errors by γ I
and Ψ 1 (xk ). b Admissible errors by γ II and Ψ 1 (xk ). c Admissible errors by γ III and Ψ 1 (xk ). d
Admissible errors by γ I and Ψ 2 (xk ). e Admissible errors by γ II and Ψ 2 (xk ). f Admissible errors
by γ III and Ψ 2 (xk )

using Euler and trapezoidal methods with the sampling interval Δt = 0.1 sec leads
to

x1(k+1) 0.1x2k + x1k 0


= + u ,
x2(k+1) −0.49 × sin(x1k ) − 0.1 × f d × x2k + x2k 0.1 k

where x1k = θk and x2k = ωk . Let the initial state be x0 = [1, −1]T . We choose
10000 states in Ωx . The critic and the action networks are also chosen as three-layer
backpropagation (BP) neural networks, where the structures are set as 2–12–1 and
2–12–1, respectively.
To illustrate the effectiveness of the present algorithm, we also choose four differ-
ent initial value functions which are expressed by Ψ j (xk ) = xkTP j xk , j = 1, . . . , 4.
Let P1 = 0. Let P2 –P4 be initialized by positive definite matrices given by
142 3 Finite Approximation Error-Based Value Iteration ADP

Fig. 3.14 The plots of the admissible errors corresponding to Ψ 1 (xk ) and Ψ 2 (xk ). a Admissible
errors by γ I and Ψ 3 (xk ). b Admissible errors by γ II and Ψ 3 (xk ). c Admissible errors by γ III and
Ψ 3 (xk ). d Admissible errors by γ I and Ψ 4 (xk ). e Admissible errors by γ II and Ψ 4 (xk ). f Admissible
errors by γ III and Ψ 4 (xk )

2.35 3.31 5.13 −5.72 100.78 5.96


P2 = , P3 = , P4 = ,
3.31 9.28 −5.72 15.13 5.96 20.51

respectively. Let qi = 0.9999 for i = 0, 1, . . ., and let ς =  = 0.5. We use the


action network to obtain the admissible control law [23] with the expression given
by μ(xk ) = Winitial φ(Vinitial xk + b initial ), where φ(·) denotes the sigmoid activation
function [23, 28]. The weight matrices are obtained as

4.15 −0.14 2.64 0.93 −3.74 0.39 1.13


Winitial =
−3.67 −0.67 −0.25 −2.59 0.27 −1.58 3.49
T
1.40 0.24 1.13 −2.69 −4.77
,
4.91 0.60 −3.55 2.97 0.22
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 143

100 120

Iterative value functions


Iterative value functions

100
80

80
60
60
40
γ 40 γ
V I V I

20 VγII 20 VγII
VγIII VγIII
0 0
0 20 40 60 0 20 40 60
Iteration steps Iteration steps
(a) (b)
140 130
VγI
Iterative value functions
Iterative value functions

120 120 γ
V II
γ
100 110 V III

80 100

60 VγI 90
γ
40 V II
γ 80
V III

20
70
0 20 40 60 0 20 40 60
Iteration steps Iteration steps
(c) (d)
Fig. 3.15 The trajectories of the iterative value functions. a Vγ I , Vγ II and Vγ III by Ψ 1 (xk ). b Vγ I ,
Vγ II and Vγ III by Ψ 2 (xk ). c Vγ I , Vγ II and Vγ III by Ψ 3 (xk ). d Vγ I , Vγ II and Vγ III by Ψ 4 (xk )

Vinitial = [ 0.01, 1.89, −0.12, 0.10, −0.02, 0.18, −0.01,


− 0.01, −1.75, 0.03, −0.00, 0.04],

and

b initial = [−5.88, 0.81, −2.72, −2.07, −0.03, −0.23,


− 0.41, 1.59, 0.88, 1.97, −2.92, −4.28]T .

Initialized by Ψ j (xk ), j = 1, . . . , 4, we choose 50000 data in Ωx × Ωu to com-


pute γ I , γ II , and γ III . The corresponding trajectories are presented in Fig. 3.12a, b, c
and d, respectively.
According to γ I , γ II , and γ III , the admissible errors with Ψ 1 (xk ) are shown in
Fig. 3.13a, b and c, respectively. The admissible errors corresponding to Ψ 2 (xk ) are
shown in Fig. 3.13d, e and f, respectively. According to γ I , γ II , and γ III , the admis-
144 3 Finite Approximation Error-Based Value Iteration ADP

3 3

2 Lm 2
In
Controls

Controls
1 1

0 0

−1 −1
In Lm
−2 −2
0 50 100 0 50 100
Time steps Time steps
(a) (b)
3 3
In
2
In 2
Controls

Controls
1
1
0
0
−1
Lm Lm
−2 −1
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 3.16 Iterative control signals. a Control signals by Ψ 1 (xk ) and γ I . b Control signals by Ψ 2 (xk )
and γ II . c Control signals by Ψ 3 (xk ) and γ III . d Control signals by Ψ 4 (xk ) and γ III

sible errors corresponding to Ψ 3 (xk ) are shown in Fig. 3.14a, b and c, respectively.
The admissible errors corresponding to Ψ 4 (xk ) are shown in Fig. 3.14d, e and f,
respectively.
Initialized by Ψ 1 (xk )–Ψ 4 (xk ), the trajectories of the iterative value functions for
V γ I , V γ II and V γ III are shown in Fig. 3.15a, b, c and d, respectively.
The iterative control and state trajectories by Ψ 1 (xk ) and γ I are shown in
Figs. 3.16a and 3.17a, respectively. The iterative control and state trajectories by
Ψ 2 (xk ) and γ II are shown in Figs. 3.16b and 3.17b, respectively. The iterative control
and state trajectories by Ψ 3 (xk ) and γ III are shown in Figs. 3.16c and 3.17c, respec-
tively. The iterative control and state trajectories by Ψ 4 (xk ) and γ III are shown in
Figs. 3.16d and 3.17d, respectively.
From Figs. 3.15–3.17, we can see that for different initial value functions under
γ I , γ II and γ III , the iterative value functions can converge to a finite neighborhood of
the optimal cost function. The corresponding iterative controls and iterative states
are also convergent. Therefore, the effectiveness of the present finite-approximation-
error-based general value iteration algorithm has been shown for nonlinear systems.
3.4 General value Iteration ADP Algorithm with Finite Approximation Errors 145

10 2
In x In x2
1 In x
Lm x 1 1
System states

System states
5 1

0
0
−1
Lm x
2
−5
Lm x2 −2 Lm x1
In x2
−10 −3
0 50 100 0 50 100
Time steps Time steps
(a) (b)
2 1
In x1
In x In x In x
1 2
1 0.5 2
System states

0 System states 0

−0.5
−1
−1 Lm x
Lm x2 1
−2 Lm x −1.5
1
Lm x2
−3 −2
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 3.17 Iterative state trajectories. a Trajectories by Ψ 1 (xk ) and γ I . b Trajectories by Ψ 2 (xk )
and γ II . c Trajectories by Ψ 3 (xk ) and γ III . d Trajectories by Ψ 4 (xk ) and γ III

On the other hand, if the initial weights of the neural networks, such as weights
of the initial admissible control law, are changed, then the convergence property
will also be different. Now, we also use the action network to obtain the admissible
control law [23] with the expression given by

μ (xk ) = Winitial φ(Vinitial xk + b initial ),

and the weight matrices are obtained as

−1.92 −2.35 4.59 −0.44 −2.10 −3.97


Winitial =
4.45 4.27 −1.55 4.89 4.49 −2.81
T
1.41 −4.61 −3.37 2.99 −3.48 −2.39
,
4.64 1.53 3.32 3.81 3.30 4.24

Vinitial = [ −0.064, 0.002, 0.013, 0.012, −0.101, 0.015, 0.008, 0.019,


− 0.178, −0.002, −0.123, −0.091]
146 3 Finite Approximation Error-Based Value Iteration ADP

150 100

Iterative value function


80
100
60
γ

40
50
γ by μ
I 20 VγI
γ’I by μ’ V
γ’
I

0 0
0 20 40 60 0 20 40 60
Iteration steps Iteration steps
(a) (b)

1 3
u* by γ
0.5 I

Optimal control
System states

2 *
u by γ’I
0

−0.5 x by γ 1
1 I
x2 by γI
−1
x by γ’ 0
1 I
−1.5
x by γ’
2 I
−2 −1
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 3.18 Comparisons for different μ(xk ). a γI and γI by μ(xk ) and μ (xk ). b Vγ I and Vγ I . c
System states by γI and γI . d Optimal control by γI and γI

and

b initial = [−4.87, 3.90, −3.09, 2.21, 1.35, 0.44,


− 0.45, −1.33, −2.16, 3.08, −3.92, −4.86]T .

We choose Ψ 3 (xk ) as the initial value function for facilitating our analysis of
simulation results. The trajectories of γ I by μ(xk ) and γ I by μ (xk ) are shown in
Fig. 3.18a. We can see that if the initial weights of the neural networks for admissible
control law are changed, then the parameter γ I is also changed. For different γ I ,
according to Theorem 3.4.4, the iterative value function can be convergent under
different approximation errors. The trajectories of the iterative value function by
γ I and γ I are shown in Fig. 3.18b, respectively, where we can see that for different
initial weights of the neural networks for the admissible control law, the iterative value
function will converge to different values. Hence, the initial weights of the neural
network will lead to different convergence properties. The optimal trajectories of
system states and control by γ I and γ I are shown in Fig. 3.18c and d, respectively.
3.5 Conclusions 147

3.5 Conclusions

In this chapter, several optimal control problems are solved by using value itera-
tion ADP algorithms for infinite horizon discrete-time nonlinear systems with finite
approximation errors. The iterative control laws which can guarantee the iterative
value functions to reach optimums are obtained by using the iterative ADP algo-
rithms. Then, a novel numerical adaptive learning control scheme based on the iter-
ative ADP algorithms is presented to solve the numerical optimal control problems,
which extends the implementation for both online and offline. Finally, a GVI ADP
algorithm is developed with necessary analysis results, which overcomes the disad-
vantage of the traditional VI algorithm.

References

1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Agricola I, Friendrich T (2008) Elementary geometry. American Mathematical Society, Prov-
idence
3. Almudevar A, Arruda EF (2012) Optimal approximation schedules for a class of iterative algo-
rithms with an application to multigrid value iteration. IEEE Trans Autom Control 57(12):3132–
3146
4. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part B
Cybern 38(4):943–949
5. Arruda EF, Ourique FO, LaCombe J, Almudevar A (2013) Accelerating the convergence of
value iteration by using partial transition functions. Eur J Oper Res 229(1):190–198
6. Beard R (1995) Improving the closed-loop performance of nonlinear systems. Ph.D. thesis,
Rensselaer Polytechnic Institute
7. Bertsekas DP (2007) Dynamic programming and optimal control, 3rd edn. Athena Scientific,
Belmont
8. Cheng T, Lewis FL, Abu-Khalaf M (2007) A neural network solution for fixed-final time
optimal control of nonlinear systems. Automatica 43(3):482–490
9. Dorf RC, Bishop RH (2011) Modern control systems, 12th edn. Prentice-Hall, Upper Saddle
River
10. Feinberg EA, Huang J (2014) The value iteration algorithm is not strongly polynomial for
discounted dynamic programming. Oper Res Lett 42(2):130–131
11. Grune L, Rantzer A (2008) On the infinite horizon performance of receding horizon controllers.
IEEE Trans Autom Control 53(9):2100–2111
12. Hagan MT, Demuth HB, Beale MH (1996) Neural network design. PWS Publishing Company,
Boston
13. Haykin S (2009) Neural networks and learning machines, 3rd edn. Prentice-Hall, Upper Saddle
River
14. Jin N, Liu D, Huang T, Pang Z (2007) Discrete-time adaptive dynamic programming using
wavelet basis function neural networks. In: Proceedings of the IEEE symposium on approximate
dynamic programming and reinforcement learning, pp 135–142
15. Kushner HJ (2010) Numerical algorithms for optimal controls for nonlinear stochastic systems
with delays. IEEE Trans Autom Control 55(9):2170–2176
16. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
148 3 Finite Approximation Error-Based Value Iteration ADP

17. Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control:
using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag
32(6):76–105
18. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
19. Liu D, Zhang Y, Zhang H (2005) A self-learning call admission control scheme for CDMA
cellular networks. IEEE Trans Neural Netw 16(5):1219–1228
20. Liu D, Javaherian H, Kovalenko O, Huang T (2008) Adaptive critic learning techniques for
engine torque and air-fuel ratio control. IEEE Trans Syst Man Cybern Part B Cybern 38(4):988–
993
21. Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class
of unknown discrete-time nonlinear systems using globalized dual heuristic programming.
IEEE Trans Autom Sci Eng 9(3):628–634
22. Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
23. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
24. Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE Trans
Syst Man Cybern Part C Appl Rev 32(2):140–153
25. Navarro-Lopez EM (2007) Local feedback passivation of nonlinear discrete-time systems
through the speed-gradient algorithm. Automatica 43(7):1302–1306
26. Rantzer A (2006) Relaxed dynamic programming in switching systems. IEE Proc Control
Theory Appl 153(5):567–574
27. Seiffertt J, Sanyal S, Wunsch DC (2008) Hamilton-Jacobi-Bellman equations and approximate
dynamic programming on time scales. IEEE Trans Syst Man Cybern Part B Cybern 38(4):918–
923
28. Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
29. Sira-Ramirez H (1991) Non-linear discrete variable structure systems in quasi-sliding mode.
Int J Control 54(5):1171–1187
30. Vamvoudakis KG, Lewis FL (2011) Multi-player non-zero-sum games: online adaptive learn-
ing solution of coupled Hamilton-Jacobi equations. Automatica 47(8):1556–1569
31. Vrabie D, Vamvoudakis KG, Lewis FL (2013) Optimal adaptive control and differential games
by reinforcement learning principles. IET, London
32. Wang FY, Zhang H, Liu D (2009) Adaptive dynamic programming: an introduction. IEEE
Comput Intell Mag 4(2):39–47
33. Wang FY, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon
optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Trans Neural
Netw 22(1):24–36
34. Wei Q, Liu D (2012) An iterative ε-optimal control scheme for a class of discrete-time nonlinear
systems with unfixed initial state. Neural Netw 32:236–244
35. Wei Q, Liu D (2012) A new optimal control method for discrete-time nonlinear systems with
approximation errors. In: Proceedings of the world congress on intelligent control and automa-
tion, pp 185–190
36. Wei Q, Liu D (2012) Adaptive dynamic programming with stable value iteration algorithm for
discrete-time nonlinear systems. In: Proceedings of the IEEE international joint conference on
neural networks, pp 1–6
37. Wei Q, Liu D (2013) Numerical adaptive learning control scheme for discrete-time non-linear
systems. IET Control Theory Appl 7(18):1472–1486
38. Wei Q, Liu D (2014) A novel iterative θ-Adaptive dynamic programming for discrete-time
nonlinear systems. IEEE Trans Autom Sci Eng 11(4):1176–1190
39. Wei Q, Liu D (2014) Stable iterative adaptive dynamic programming algorithm with approxi-
mation errors for discrete-time nonlinear systems. Neural Comput Appl 24(6):1355–1367
References 149

40. Wei Q, Liu D (2015) A novel policy iteration based deterministic Q-learning for discrete-time
nonlinear systems. Sci China Inf Sci 58(12):1–15
41. Wei Q, Liu D (2015) Neural-network-based adaptive optimal tracking control scheme for
discrete-time nonlinear systems with approximation errors. Neurocomputing 149:106–115
42. Wei Q, Liu D, Lewis FL (2015) Optimal distributed synchronization control for continuous-
time heterogeneous multi-agent differential graphical games. Inf Sci 317:96–113
43. Wei Q, Wang FY, Liu D, Yang X (2014) Finite-approximation-error-based discrete-time itera-
tive adaptive dynamic programming. IEEE Trans Cybern 44(12):2820–2833
44. Wei Q, Zhang H, Dai J (2009) Model-free multiobjective approximate dynamic programming
for discrete-time nonlinear systems with general performance index functions. Neurocomputing
72(7–9):1839–1848
45. Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of
intelligence. Gen Syst Yearb 22:25–38
46. Werbos PJ (1991) A menu of designs for reinforcement learning over time. In: Miller WT,
Sutton RS, Werbos PJ (eds) Neural networks for control. MIT Press, Cambridge, pp 67–95
47. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sofge DA (eds) Handbook of intelligent control: Neural, fuzzy, and adaptive
approaches (Chapter 13). Van Nostrand Reinhold, New York
48. Yang Q, Vance JB, Jagannathan S (2008) Control of nonaffine nonlinear discrete-time systems
using reinforcement-learning-based linearly parameterized neural networks. IEEE Trans Syst
Man Cybern Part B Cybern 38(4):994–1001
49. Younkin G, Hesla E (2008) Origin of numerical control. IEEE Ind Appl Mag 14(5):10–12
50. Zhang C, Ordonez R (2007) Numerical optimization-based extremum seeking control with
application to ABS design. IEEE Trans Autom Control 52(3):454–467
51. Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a
class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans
Syst Man Cybern Part B Cybern 38(4):937–942
52. Zhang H, Luo Y, Liu D (2009) The RBF neural network-based near-optimal control for a class
of discrete-time affine nonlinear systems with control constraint. IEEE Trans Neural Netw
20(9):1490–1503
53. Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving
a class of nonlinear zero-sum differential games. Automatica 47(1):207–214
Chapter 4
Policy Iteration for Optimal Control
of Discrete-Time Nonlinear Systems

4.1 Introduction

Value iteration ADP algorithm and policy iteration ADP algorithm are the effective
methods to obtain solutions of the Bellman equation indirectly [6, 8–11, 13, 19–21,
24]. Many discussions on value iteration have been presented in the last two chapters.
On the other hand, policy iteration is the main focus of this chapter.
Policy iteration algorithm for continuous-time systems was proposed in [14]. In
[14], Murray et al. proved that for continuous-time affine nonlinear systems, the iter-
ative value function will converge to the optimum non-increasingly and each of the
iterative controls stabilizes the nonlinear systems using policy iteration algorithms.
This is a merit of policy iteration algorithm and hence achieved many applications
for solving optimal control problems of nonlinear systems. In 2005, Abu-Khalaf
and Lewis [1] proposed a policy iteration algorithm for continuous-time nonlinear
systems with control constraints. Zhang et al. [25] applied policy iteration algorithm
to solve continuous-time nonlinear two-person zero-sum games. Vamvoudakis et al.
[17] proposed a multiagent differential graphical games for continuous-time linear
systems using policy iteration algorithms. Bhasin et al. [3] proposed an online actor–
critic–identifier architecture to obtain optimal control law for uncertain continuous-
time nonlinear systems by policy iteration algorithms. Till now, nearly all the online
iterative ADP algorithms are policy iteration algorithms. However, most of the dis-
cussions on the policy iteration algorithms are based on continuous-time systems
[3, 17, 23, 25]. The discussions on policy iteration algorithms for discrete-time con-
trol systems are scarce. Only in [8, 9, 16], a brief sketch of policy iteration algorithm
for discrete-time systems was mentioned while the stability and convergence prop-
erties were not discussed. To the best of our knowledge, there is not much discussion
focused on policy iteration algorithms for discrete-time systems. Therefore, in this
chapter, policy iteration method for optimal control is developed for discrete-time
nonlinear systems [12].

© Springer International Publishing AG 2017 151


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_4
152 4 Policy Iteration for Optimal …

4.2 Policy Iteration Algorithm

Consider the following deterministic discrete-time systems

xk+1 = F(xk , uk ), k = 0, 1, 2, . . . , (4.2.1)

where xk ∈ Rn is the n-dimensional state vector, and uk ∈ Rm is the m-dimensional


control vector. Let x0 be the initial state, and F(xk , uk ) be the system function.
Let
uk = (uk , uk+1 , . . . )

be an arbitrary sequence of controls from k to ∞. The cost function for state x0 under
the control sequence u0 = (u0 , u1 , . . . ) is defined as



J(x0 , u0 ) = U(xk , uk ), (4.2.2)
k=0

where U(xk , uk ) is the utility function.


We will study optimal control problems for (4.2.1). The goal is to find an optimal
control scheme which stabilizes system (4.2.1) and simultaneously minimizes the
cost function (4.2.2). For convenience of analysis, results of this chapter are based
on the following assumption.

Assumption 4.2.1 The origin xk = 0 is an equilibrium state of system (4.2.1) under


the control uk = 0, i.e., F(0, 0) = 0; the feedback control uk = u(xk ) satisfies uk =
u(xk ) = 0 for xk = 0; the system function F(xk , uk ) is Lipschitz continuous on a
compact set Ω ⊂ Rn containing the origin; the system (4.2.1) is controllable on Ω;
the utility function U(xk , uk ) is a positive-definite function of xk and uk .

As system (4.2.1) is assumed to be controllable, there exists a stable control


sequence uk = (uk , uk+1 , . . .) that moves the state xk to zero. For optimal control
problems, the designed feedback control must not only stabilize the system (4.2.1)
on Ω, but also guarantee that the cost function (4.2.2) is finite, i.e., the control must
be admissible (cf. Definition 2.2.1 or [2]). Let A (Ω) denote the set which contains
all the admissible control sequences associated with the controllable set Ω of states.
Then, the optimal cost function can be defined as
 
J ∗ (xk ) = inf J(xk , uk ) : uk ∈ A (Ω) .
uk

According to Bellman’s principle of optimality, the optimal cost function J ∗ (xk )


satisfies the following Bellman equation
 
J ∗ (xk ) = min U(xk , uk ) + J ∗ (F(xk , uk )) . (4.2.3)
uk
4.2 Policy Iteration Algorithm 153

Define the law of optimal control as u∗ (·). Each uk∗ = u∗ (xk ) is obtained by
 
uk∗ = arg min U(xk , uk ) + J ∗ (F(xk , uk )) .
uk

Hence, the Bellman equation (4.2.3) can be written as

J ∗ (xk ) = U(xk , uk∗ ) + J ∗ (F(xk , uk∗ ))


= U(xk , u∗ (xk )) + J ∗ (F(xk , u∗ (xk ))). (4.2.4)

We can see that if we want to obtain the optimal control sequence uk , we must
obtain the optimal cost function J ∗ (xk ). Generally speaking, J ∗ (xk ) is unknown before
the whole control sequence uk is considered. If we adopt the traditional dynamic
programming method to obtain the optimal cost function at every time step, then
we have to face the “curse of dimensionality”. Furthermore, the optimal control
is discussed in infinite horizon. This means the length of the control sequence is
infinite, which implies that the optimal control sequence is nearly impossible to
obtain analytically by the Bellman equation (4.2.4). To overcome this difficulty,
a novel discrete-time policy iteration ADP algorithm will be developed to obtain
the optimal control sequence for nonlinear systems. The goal of the present policy
iteration ADP algorithm is to construct iterative control laws vi (xk ), which move an
arbitrary initial state x0 to the equilibrium xk = 0, and simultaneously guarantee that
the iterative value functions reach the optimum.

4.2.1 Derivation of Policy Iteration Algorithm

In the present policy iteration algorithm, the value functions and control laws are
updated by iterations, with the iteration index i increasing from 0 to ∞. Starting
from an arbitrary admissible control law v0 (xk ), the PI algorithm can be described
as follows.
For i = 1, 2, . . . , the iterative value function Vi (xk ) is constructed from vi−1 (xk )
by solving the following generalized Bellman equation

Vi (xk ) = U(xk , vi−1 (xk )) + Vi (F(xk , vi−1 (xk ))), (4.2.5)

and the iterative control law vi (xk ) is updated by

vi (xk ) = arg min {U(xk , uk ) + Vi (xk+1 )}


uk

= arg min {U(xk , uk ) + Vi (F(xk , uk ))} . (4.2.6)


uk

In the above PI algorithm, (4.2.5) is called policy evaluation (or value update)
and, (4.2.6) is called policy improvement (or policy update) [16, 18]. The iterative
154 4 Policy Iteration for Optimal …

Table 4.1 The iterative process of the PI algorithm in (4.2.5) and (4.2.6)
v0 → V1 (4.2.5) v1 → V2 (4.2.5) v2 → V3 (4.2.5) ···
evaluation evaluation evaluation
V1 → v1 (4.2.6) V2 → v2 (4.2.6) V3 → v3 (4.2.6) ···
minimization minimization minimization
i=1 i=2 i=3 ···

process of the PI algorithm in (4.2.5) and (4.2.6) is illustrated in Table 4.1, where the
iteration flows as in Fig. 2.1. The algorithm starts with an admissible control v0 (xk ),
and it obtains v1 (xk ) at the end of the first iteration. Comparing between Tables 2.2
and 4.1, we note that “calculation” performed in Table 2.2 using (2.2.11) is a simpler
operation than “evaluation” in Table 4.1 based on (4.2.5). In (4.2.5), one needs to
solve for Vi (xk ) from a nonlinear functional equation, whereas in (2.2.11), it is a
simple calculation.
From the policy iteration algorithm (4.2.5) and (4.2.6), we can see that the iterative
value function Vi (xk ) is used to approximate J ∗ (xk ), and the iterative control law vi (xk )
is used to approximate u∗ (xk ). As (4.2.5) is generally not a Bellman equation, the
iterative value functions the iterative control laws

Vi (xk ) = J ∗ (xk ), vi (xk ) = u∗ (xk ), ∀i = 1, 2, . . . .

Therefore, it is necessary to determine whether the PI algorithm is convergent or not,


which means that whether Vi (xk ) and vi (xk ) converge to the optimal ones as i → ∞. In
the next section, we will show such properties of the above policy iteration algorithm.

4.2.2 Properties of Policy Iteration Algorithm

For the policy iteration algorithm of continuous-time nonlinear systems, it is shown


that any of the iterative control laws can stabilize the system [14]. This is a merit of
the policy iteration algorithm. For discrete-time systems, stability of the system can
also be guaranteed under the iterative control law.
Lemma 4.2.1 For i = 0, 1, . . ., let Vi (xk ) and vi (xk ) be obtained by the policy iter-
ation algorithm (4.2.5) and (4.2.6), where v0 (xk ) is an arbitrary admissible control
law. If Assumption 4.2.1 holds, then the iterative control law vi (xk ), ∀i = 0, 1, . . .,
stabilizes the nonlinear system (4.2.1).
The lemma can easily be proved by considering (4.2.5), which shows that Vi (xk )
is a Lyapunov function.
4.2 Policy Iteration Algorithm 155

For continuous-time policy iteration algorithms [14], according to the derivative


of the difference of iterative value functions with respect to time t, it is proven that
the iterative value functions are non-increasingly convergent to the optimal solution
of the Bellman equation. For discrete-time systems, the analysis method for the
continuous-time is no longer applicable. Thus, a new convergence proof is established
in this chapter. We will show that using the policy iteration algorithm of discrete-time
nonlinear systems, the iterative value functions are also nonincreasingly convergent
to the optimum.

Theorem 4.2.1 For i = 0, 1, . . ., let Vi (xk ) and vi (xk ) be obtained by (4.2.5) and
(4.2.6). If Assumption 4.2.1 holds, then the iterative value function Vi (xk ), ∀xk ∈ Rn ,
is a monotonically nonincreasing sequence, i.e.,

Vi+1 (xk ) ≤ Vi (xk ), ∀i ≥ 0. (4.2.7)

Proof For i = 0, 1, . . ., define a new iterative value function Γi+1 (xk ) as

Γi+1 (xk ) = U(xk , vi (xk )) + Vi (xk+1 ), (4.2.8)

where vi (xk ) is obtained by (4.2.6). According to (4.2.5), (4.2.6), and (4.2.8), we can
obtain

Γi+1 (xk ) ≤ Vi (xk ), ∀xk . (4.2.9)

Then, we derive

Γi+1 (xk ) = U(xk , vi (xk )) + Vi (xk+1 )


≥ U(xk , vi (xk )) + Γi+1 (xk+1 )

N−1
≥ U(xk+j , vi (xk+j )) + Γi+1 (xk+N ),
j=0

where N > 0 is an integer. Let N → ∞. We have




Vi (xk ) ≥ Γi+1 (xk ) ≥ U(xk+j , vi (xk+j )) + lim Γi+1 (xk+N ). (4.2.10)
N→∞
j=0

From (4.2.10), we get




Vi+1 (xk ) = U(xk+j , vi (xk+j )) ≤ Γi+1 (xk ). (4.2.11)
j=0

Therefore, according to (4.2.9), we have


156 4 Policy Iteration for Optimal …

Vi+1 (xk ) ≤ Γi+1 (xk ) ≤ Vi (xk ), ∀xk .

The proof is complete.


Since Vi (xk ) is finite, we have U(xk+N , vi (xk+N )) → 0 as N → ∞ which implies
xk+N → 0 as N → ∞. Thus, vi (xk ) is a stable control law. On the other hand, we get


U(xk+j , vi (xk+j )) < ∞.
j=0

Hence, vi (xk ) is an admissible control law.


Corollary 4.2.1 For i = 0, 1, . . ., let Vi (xk ) and vi (xk ) be obtained by (4.2.5) and
(4.2.6), where v0 (xk ) is an arbitrary admissible control law. If Assumption 4.2.1 holds,
then the iterative control law vi (xk ) is admissible, ∀i = 0, 1, . . ..
From Theorem 4.2.1, the iterative value function Vi (xk ) ≥ 0 is monotonically non-
increasing and bounded below for iteration index i = 1, 2, . . .. Hence, we can derive
the following theorem.
Theorem 4.2.2 For i = 0, 1, . . ., let Vi (xk ) and vi (xk ) be obtained by (4.2.5) and
(4.2.6). If Assumption 4.2.1 holds, then the iterative value function Vi (xk ) converges
to the optimal cost function J ∗ (xk ), as i → ∞, i.e.,

lim Vi (xk ) = J ∗ (xk ), ∀xk , (4.2.12)


i→∞

which satisfies the Bellman equation (4.2.3).


Proof The statement can be proven by the following three steps.
(1) Show that the limit of the iterative value function Vi (xk ) satisfies the Bellman
equation, as i → ∞.
According to Theorem 4.2.1, as Vi (xk ) is non-increasing and lower bounded, the
limit of the iterative value function Vi (xk ) exists as i → ∞. Define the value function
V∞ (xk ) as the limit of the iterative function Vi (xk ), i.e.,

V∞ (xk ) = lim Vi (xk ).


i→∞

According to the definition of Γi+1 (xk ) in (4.2.8), we have

Γi+1 (xk ) = U(xk , vi (xk )) + Vi (xk+1 )


= min {U(xk , uk ) + Vi (xk+1 )} .
uk

According to (4.2.11), we can get

Vi+1 (xk ) ≤ Γi+1 (xk ), ∀xk .


4.2 Policy Iteration Algorithm 157

We can obtain

V∞ (xk ) ≤ Vi+1 (xk ) ≤ Γi+1 (xk ) = min {U(xk , uk ) + Vi (xk+1 )} .


uk

Thus, letting i → ∞, we obtain

V∞ (xk ) ≤ min{U(xk , uk ) + V∞ (xk+1 )}. (4.2.13)


uk

Let ε > 0 be an arbitrary positive number. Since Vi (xk ) is nonincreasing, there exists
a positive integer p such that

Vp (xk ) − ε ≤ V∞ (xk ) ≤ Vp (xk ). (4.2.14)

Hence, we can get

V∞ (xk ) ≥ U(xk , vp (xk )) + Vp (F(xk , vp (xk ))) − ε


≥ U(xk , vp (xk )) + V∞ (F(xk , vp (xk ))) − ε
≥ min{U(xk , uk ) + V∞ (xk+1 )} − ε.
uk

Since ε is arbitrary, we have

V∞ (xk ) ≥ min{U(xk , uk ) + V∞ (xk+1 )}. (4.2.15)


uk

Combining (4.2.13) and (4.2.15), we can obtain

V∞ (xk ) = min{U(xk , uk ) + V∞ (xk+1 )}, (4.2.16)


uk

i.e., V∞ (xk ) satisfies the Bellman equation (4.2.3).


Next, let μ(xk ) be an arbitrary admissible control law, and define a new value
function P(xk ), which satisfies

P(xk ) = U(xk , μ(xk )) + P(xk+1 ). (4.2.17)

Then, we can proceed to the second step of the proof.


(2) Show that for an arbitrary admissible control law μ(xk ), the value function
V∞ (xk ) ≤ P(xk ).
For i = 1, 2, . . . , define a new iterative value function Πi (xk ) as

Πi (xk ) = U(xk , μ(xk )) + Πi−1 (xk+1 ),

where Π0 (xk ) = V∞ (xk ). Then, for i = 1, we get


158 4 Policy Iteration for Optimal …

Π1 (xk ) = U(xk , μ(xk )) + Π0 (xk+1 )


= U(xk , μ(xk )) + V∞ (xk+1 )
≥ min{U(xk , uk ) + V∞ (xk+1 )}
uk

= V∞ (xk ).

Using mathematical induction, we can obtain

V∞ (xk ) ≤ Πi (xk ), ∀i. (4.2.18)

We can also obtain


Π∞ (xk ) = U(xk , μ(xk )) + Π∞ (xk+1 )

and P(xk ) = Π∞ (xk ), which together with (4.2.18) implies

V∞ (xk ) ≤ P(xk ). (4.2.19)

(3) Show that the value function V∞ (xk ) equals to the optimal cost function J ∗ (xk ).
According to the definition of J ∗ (xk ) in (4.2.3), for i = 0, 1, . . ., we have

Vi (xk ) ≥ J ∗ (xk ).

Let i → ∞. We can obtain

V∞ (xk ) ≥ J ∗ (xk ). (4.2.20)

On the other hand, for an arbitrary admissible control law μ(xk ), (4.2.19) holds. Let
μ(xk ) = u∗ (xk ), where u∗ (xk ) is the optimal control law. Thus, when μ(xk ) = u∗ (xk ),
we can get

V∞ (xk ) ≤ P(xk ) = J ∗ (xk ). (4.2.21)

According to (4.2.20) and (4.2.21), we can obtain (4.2.12). The proof is complete.

Corollary 4.2.2 Let xk ∈ Rn be an arbitrary state vector. If Theorem 4.2.2 holds,


then the iterative control law vi (xk ) converges to the optimal control law as i → ∞,
i.e.,
u∗ (xk ) = lim vi (xk ).
i→∞

Remark 4.2.1 In Theorems 4.2.1 and 4.2.2, we have proven that the iterative value
functions are monotonically non-increasing and convergent to the optimal cost func-
tion as i → ∞. The stability of the nonlinear systems can also be guaranteed under the
iterative control laws. Hence, the convergence and stability properties of the policy
iteration algorithms for continuous-time nonlinear systems are also valid for the pol-
4.2 Policy Iteration Algorithm 159

icy iteration algorithms for discrete-time nonlinear systems. However, we emphasize


that the analysis techniques between the continuous-time and discrete-time policy
iteration algorithms are quite different. First, the Hamilton–Jacobi–Bellman equation
of continuous-time systems and the Bellman equation of discrete-time systems are
different inherently. Second, the analysis method for continuous-time policy itera-
tion algorithm is based on derivative techniques [1, 3, 14, 17, 23, 25]. The analysis
method for continuous-time policy iteration algorithm is no longer applicable to
discrete-time situation. In this chapter, we will establish properties for the discrete-
time policy iteration algorithms, which form the theoretical basis for applications to
optimal control problems of discrete-time nonlinear systems.

Remark 4.2.2 In [2], value iteration algorithm of ADP is proposed to obtain the opti-
mal solution of the Bellman equation, for the following discrete-time affine nonlinear
systems
xk+1 = f (xk ) + g(xk )uk ,

with the cost function



  T 
J(xk ) = xj Qxj + ujTRuj ,
j=k

where Q and R are defined as the penalizing matrices for system state and control
vectors, respectively. Let Q ∈ Rn×n and R ∈ Rm×m be both positive-definite matrices.
Starting from V0 (xk ) ≡ 0, the value iteration algorithm can be expressed as

⎨ 1 ∂Vi (xk+1 )
ui (xk ) = − R−1 gT(xk ) ,
2 ∂xk+1 (4.2.22)

Vi+1 (xk ) = xk Qxk + ui (xk )Rui (xk ) + Vi (xk+1 ).
T T

It was proven in [2] that Vi (xk ) is a monotonically nondecreasing sequence and


bounded, and hence converges to J ∗ (xk ). However, using the value iteration equa-
tion (4.2.22), the stability of the system under the iterative control law cannot be
guaranteed. We should point out that using the policy iteration algorithm (4.2.5) and
(4.2.6), the stability property can be guaranteed. In Sect. 4.3, numerical results and
comparisons between the policy and value iteration algorithms will be presented to
show these properties.
Therefore, if an admissible control of the nonlinear system is known, then policy
iteration algorithm is preferred to obtain the optimal control law with good conver-
gence and stability properties. Accordingly, the initial admissible control law is a
key to the success of the policy iteration algorithm. In the next section, an effective
method will be discussed to obtain the initial admissible control law using neural
networks.
160 4 Policy Iteration for Optimal …

4.2.3 Initial Admissible Control Law

As the policy iteration algorithm requires to start with an admissible control law, the
method of obtaining the admissible control law is important to the implementation
of the algorithm. Actually, for all the policy iteration algorithms of ADP, including
[1, 3, 14, 17, 23, 25], the initial admissible control law is necessary to implement
their algorithms. Unfortunately, to the best of our knowledge, there is no theory on
how to design the initial admissible control law. In this section, we will give an
effective method to obtain the initial admissible control law by experiments using
neural networks.
First, let Ψ (xk ) ≥ 0 be an arbitrary semipositive definite function. For i = 0, 1, . . .,
we define a new value function as

Φi+1 (xk ) = U(xk , μ(xk )) + Φi (xk+1 ), (4.2.23)

where Φ0 (xk ) = Ψ (xk ). Then, we can derive the following theorem.


Theorem 4.2.3 Suppose Assumption 4.2.1 holds. Let Ψ (xk ) ≥ 0 be an arbitrary
semipositive-definite function. Let μ(xk ) be an arbitrary control law for system (4.2.1)
which satisfies μ(0) = 0. Define the iterative value function Φi (xk ) as in (4.2.23),
where Φ0 (xk ) = Ψ (xk ). Then, μ(xk ) is an admissible control law if and only if the
limit of Φi (xk ) exists as i → ∞.
Proof We first prove the sufficiency of the statement. Assume that μ(xk ) is an admis-
sible control law. According to (4.2.23),

Φi+1 (xk ) = U(xk , μ(xk )) + Φi (xk+1 )


= U(xk , μ(xk )) + U(xk+1 , μ(xk+1 )) + Φi−1 (xk+2 )
= ···

i
= U(xk+j , μ(xk+j )) + Ψ (xk+i+1 ), (4.2.24)
j=0

where Ψ (xk ) = Φ0 (xk ). Letting i → ∞, it follows




lim Φi (xk ) = U(xk+j , μ(xk+j )) + lim Ψ (xk+i+1 ). (4.2.25)
i→∞ i→∞
j=0

Since μ(xk ) is an admissible control law,




U(xk+j , μ(xk+j )) (4.2.26)
j=0

is finite and Ψ (xk+i+1 ) → 0 as i → ∞ (xk+i+1 → 0). Hence, lim Φi (xk ) is finite.


i→∞
4.2 Policy Iteration Algorithm 161

On the other hand, if the limit of Φi (xk ) exists as i → ∞, according to (4.2.25), as


Ψ (xk ) is finite, we can get that (4.2.26) is finite. Since the utility function U(xk , uk )
is positive definite, then we can obtain

U(xk+j , μ(xk+j )) → 0 as j → ∞.

As μ(xk ) = 0 for xk = 0, we can get that xk → 0 as k → ∞, which means that


system (4.2.1) is stable and μ(xk ) is an admissible control law. The necessity of the
statement is proven, and the proof of the theorem is complete.

According to Theorem 4.2.3, we can establish an effective iteration algorithm by


repeating experiments using neural networks. The detailed implementation of the
iteration algorithm is expressed in Algorithm 4.2.1.

Algorithm 4.2.1 Computation of the initial admissible control law


Initialization:
Choose a semi-positive definite function Ψ (xk ) ≥ 0.
Initialize two neural networks (critic networks) cnet1 and cnet2 with small random weights.
Let Φ0 (xk ) = Ψ (xk ).
Give the maximum iteration number of computation imax .
Iteration:
Step 1. Establish an action neural network with small random weights to generate an initial control
law μ(xk ) with μ(xk ) = 0 for xk = 0. Obtain {xk } and {xk+1 }.
Step 2. Let i = 0. Train the critic network cnet1 so that its output becomes

{U(xk , μ(xk )) + Φ0 (xk+1 )}.

Step 3. Copy cnet1 to cnet2, i.e., let cnet2 = cnet1.


Step 4. Let i = i + 1. Use cnet2 to get Φi (xk+1 ) and train the critic network cnet1 so that its output
becomes

{U(xk , μ(xk )) + Φi (xk+1 )}.

Step 5. Use the trained cnet1 to get Φi+1 (xk ) and use cnet2 to get Φi (xk ). If |Φi+1 (xk ) − Φi (xk )| <
ε, then goto Step 7; else, goto next step.
Step 6. If i > imax , then goto Step 1; else, goto Step 3.
Step 7. Let v0 (xk ) = μ(xk ).

Remark 4.2.3 We can see that the above training procedure can easily be imple-
mented by computer program. After creating an action network, the weights of the
critic networks cnet1 and cnet2 can be updated iteratively. If the weights of the critic
network are convergent, then μ(xk ) must be an admissible control law which is gener-
ated by a neural network with random weights. As the weights of action network are
162 4 Policy Iteration for Optimal …

chosen randomly, the admissibility of the control law is unknown before the weights
of the critic networks are convergent. Hence, the iteration process of Algorithm 4.2.1
is implemented off-line.

4.2.4 Summary of Policy Iteration ADP Algorithm

According to the above preparations, we can summarize the discrete-time policy


iteration algorithm in Algorithm 4.2.2.

Algorithm 4.2.2 Discrete-time policy iteration algorithm


Initialization:
Choose randomly an array of initial states x0 .
Choose a computation precision ε.
Give the initial admissible control law v0 (xk ).
Give the maximum iteration number of computation imax .
Let the iteration index i = 0.
Iteration:
Step 1. Let i = i + 1. Construct the iterative value functions Vi (xk ), which satisfies the following
generalized Bellman equation

Vi (xk ) = U(xk , vi−1 (xk )) + Vi (F(xk , vi−1 (xk ))).

Step 2. Update the iterative control law vi+1 (xk ) by

vi (xk ) = arg min {U(xk , uk ) + Vi (xk+1 )} .


uk

Step 3. If |Vi (xk ) − Vi+1 (xk )| < ε, goto Step 5; else, goto Step 4.
Step 4. If i < imax , then goto Step 1; else, goto Step 6.
Step 5. The optimal control law is achieved as u∗ (xk ) = vi (xk ). Stop the algorithm.
Step 6. The optimal control law is not achieved within imax iterations.

4.3 Numerical Simulation and Analysis

To evaluate the performance of our policy iteration algorithm, four examples are
chosen: (1) a linear system, (2) a discrete-time nonlinear system, (3) a torsional
pendulum system, and (4) a complex nonlinear system.

Example 4.3.1 In the first example, a linear system is considered. The results will
be compared to the traditional linear quadratic regulation (LQR) method to verify
the effectiveness of the present algorithm. We consider the following linear system

xk+1 = Axk + Buk , (4.3.1)


4.3 Numerical Simulation and Analysis 163

where xk = [x1k , x2k ]T and uk ∈ R. Let the system matrices be expressed as

0 0.1 0
A= , B= . (4.3.2)
0.3 −1 0.5

The initial state is x0 = [1, −1]T . Let the cost function be expressed by (4.2.2). The
utility function is the quadratic form that is expressed as U(xk , uk ) = xkTQxk + ukTRuk ,
where Q is the 2 × 2 identity matrix, and R = 0.5.
Neural networks are used to implement the present policy iteration algorithm. The
critic network and the action network are chosen as three-layer BP neural networks
with the structures of 2–8–1 and 2–8–1, respectively. For each iteration step, the critic
network and the action network are trained for 80 steps so that the neural network
training error becomes less than 10−5 . The initial admissible control law is obtained
by Algorithm 4.2.1, where the weights and thresholds of action network are obtained
as ⎡ ⎤
−4.1525 −1.1980
⎢ 0.3693 −0.8828 ⎥
⎢ ⎥
⎢ 1.8071 2.8088 ⎥
⎢ ⎥
⎢ 0.4104 −0.9845 ⎥

Va,initial = ⎢ ⎥,

⎢ 0.7319 −1.7384 ⎥
⎢ 1.2885 −2.5911 ⎥
⎢ ⎥
⎣ −0.3403 0.8154 ⎦
−0.5647 1.3694

Wa,initial = [ − 0.0010, −0.2566, 0.0001, −0.1409, −0.0092,


0.0001, 0.3738, 0.0998 ],

and

ba,initial = [ 3.5272, −0.9609, −1.8038, −0.0970, 0.8526,


1.1966, −1.0948, 2.5641 ]T ,

respectively. Implement the policy iteration algorithm for 6 iterations to reach the
computation precision ε = 10−5 , and the convergence trajectory of the iterative value
functions is shown in Fig. 4.1a. During each iteration, the iterative control law is
updated. Applying the iterative control law to the given system (4.3.1) for Tf = 15
time steps, we can obtain the states and the controls, which are displayed in Fig. 4.1b, c
respectively.
164 4 Policy Iteration for Optimal …

(a) 3.8 (b) 1


Limiting iterative state x2
3.6

Iterative value function


0.5 Initial iterative state x2

System states
Limiting iterative state x
3.4 1

0
3.2
Initial iterative state x
−0.5 1
3

2.8 −1
0 2 4 6 0 5 10 15
Iteration steps Time steps

(c) 0.2 (d) 1


Limiting iterative control x PI
2
x ARE
0 0.5 2
Iterative controls

System states
−0.2 0
Initial iterative control
−0.4 −0.5

−0.6 −1
0 5 10 15 0 5 10 15
Time steps Time steps

Fig. 4.1 Numerical results of Example 4.3.1. a The convergence trajectory of iterative value func-
tion. b The states. c The controls. d The trajectories of x2 under the optimal control of policy
iteration and ARE

We can see that the optimal control law of system (4.3.1) is obtained after 6
iterations. On the other hand, it is known that the solution of the optimal control
problem for linear systems is quadratic in the state given by J ∗ (xk ) = xkT Pxk , where
P is the solution of the algebraic Riccati equation (ARE) [7]. The solution of the ARE
1.091 −0.309
for the given linear system (4.3.1) can be determined as P = .
−0.309 2.055

The optimal control can be obtained as u (xk ) = [−0.304, 1.029 ]xk . Applying this
optimal control law to the linear system (4.3.1), we can obtain the same optimal
control results. The optimal trajectories of system state x2 under the optimal control
laws by the present policy iteration algorithm, and ARE are displayed in Fig. 4.1d,
respectively.

Example 4.3.2 The second example is chosen from the example in [2, 20]. We
consider the following nonlinear system

xk+1 = f (xk ) + g(xk )uk , (4.3.3)

where xk = [x1k , x2k ]T and uk are the state and control variables, respectively. The sys-
2
0.2x1k exp(x2k ) 0
tem functions are given as f (xk ) = , g(xk ) = . The ini-
3
0.3x2k −0.2
tial state is x0 = [2, −1]T . In this example, two utility functions, which are quadratic
and nonquadratic forms, will be considered, respectively. The first utility function
4.3 Numerical Simulation and Analysis 165

is a quadratic form which is the same as the one in Example 4.3.1, where Q and
R are chosen as identity matrices. The configurations of the critic network and the
action network are chosen the same as the ones in Example 4.3.1. For each iteration
step, the critic network and the action network are trained for 100 steps so that the
neural network training error becomes less than 10−5 . Implement the policy iteration
algorithm for 6 iterations to reach the computation precision of 10−5 . The conver-
gence trajectories of the iterative value functions are shown in Fig. 4.2a. Applying
the optimal control law to the given system (4.3.3) for Tf = 10 time steps, we can
obtain the iterative states trajectories and control, which are displayed in Fig. 4.2b–d,
respectively.
In [2], it is proven that the optimal control law can be obtained by the value
iteration algorithm described in (4.2.22). The convergence trajectory of the iterative
value function is shown in Fig. 4.3a. The optimal trajectories of system states and
control are displayed in Fig. 4.3b–d, respectively.
Next, we change the utility function into a nonquadratic form as in [21] with
modifications, where the utility function is expressed as

U(xk , uk ) = ln(xkTQxk + 1) + ln(xkTQxk + 1)ukTRuk .

(a) 40
(b) 2
Iterative value function

30 1.5
System state x

20 1

0.5
10
0
0
0 2 4 6 0 5 10
Iteration steps Time steps
(c) (d)
0 0
2

Optimal control
System state x

−0.5 −0.05

−1 −0.1
0 5 10 0 5 10
Time steps Time steps

Fig. 4.2 Numerical results using policy iteration algorithm. a The convergence trajectory of iterative
value function. b The optimal trajectory of state x1 . c The optimal trajectory of state x2 . d The optimal
control
166 4 Policy Iteration for Optimal …

(a) 8 (b) 2

Iterative value function

1
6 1.5

System state x
4 1

0.5
2
0
0
0 2 4 6 0 5 10
Iteration steps Time steps

(c) (d) 0.05


0
System state x2

Optimal control
0

−0.05
−0.5
−0.1

−1 −0.15
0 5 10 0 5 10
Time steps Time steps

Fig. 4.3 Numerical results using value iteration algorithm. a The convergence trajectory of iterative
value function. b The optimal trajectory of state x1 . c The optimal trajectory of state x2 . d The optimal
control

Let the other parameters keep unchanged. Using the present policy iteration algo-
rithm, we can also obtain the optimal control law for the system. The value function
is shown in Fig. 4.4a. The optimal trajectories of iterative states and controls are
shown in Fig. 4.4b–d, respectively.

Remark 4.3.1 From the numerical results, we can see that for quadratic and non-
quadratic utility functions, the optimal control law of the nonlinear system can both be
effectively obtained. On the other hand, we have shown that using the value iteration
algorithm in [2], we can also obtain the optimal control law of the system. But
we should point out that the convergence properties of the iterative value functions
by the policy and value iteration algorithms are obviously different. Thus, stability
behaviors of control systems under the two iterative control algorithms will be quite
different. In the next example, detailed comparisons will be displayed.

Example 4.3.3 We now examine the performance of the present algorithm in a tor-
sional pendulum system [15]. The dynamics of the pendulum is as follows




= ω,
dt

⎩J
dω dθ
= u − Mgl sin θ − fd ,
dt dt
4.3 Numerical Simulation and Analysis 167

(a) 20 (b)
Initial iterative state x
1
2
Limiting iterative state x1

Iterative value function


15

System stats
1
10
0
5 Limiting iterative state x
2
−1
Iniitial iterative state x2
0
0 2 4 6 8 10 0 2 4 6 8 10
Iteration steps Time steps

(c) 0.5 (d) 2


x1

Optimal control and states


1.5
0 x2
Iterative controls

1 u
−0.5
Limiting iterative control 0.5
−1
0
Initial iterative control
−1.5 −0.5

−2 −1
0 2 4 6 8 10 0 2 4 6 8 10
Time steps Time steps

Fig. 4.4 Numerical results for nonquadratic utility function using policy iteration algorithm. a The
convergence trajectory of iterative value function. b The states. c The controls. d The optimal states
and control

where M = 1/3 kg and l = 2/3 m are the mass and length of the pendulum bar,
respectively. The system states are the current angle θ and the angular velocity ω. Let
J = 4/3 M l2 and fd = 0.2 be the rotary inertia and frictional factor, respectively. Let
g = 9.8 m/s2 be the gravity. Discretization of the system function and cost function
using Euler and trapezoidal methods [4] with the sampling interval Δt = 0.1s leads
to

x1(k+1) 0.1x2k + x1k 0


= + u, (4.3.4)
x2(k+1) −0.49 × sin(x1k ) − 0.1 × fd × x2k + x2k 0.1 k

where x1k = θk and x2k = ωk . Let the initial state be x0 = [1, −1]T . Let the utility
function be quadratic form and the structures of the critic and action networks are
2–12–1 and 2–12–1. The initial admissible control law is obtained by Algo-
rithm 4.2.1, where the weight matrices are obtained as
168 4 Policy Iteration for Optimal …
⎡ ⎤
−0.6574 1.1803
⎢ 1.5421 −2.9447 ⎥
⎢ ⎥
⎢ −4.3289 −3.7448 ⎥
⎢ ⎥
⎢ 5.7354 2.8933 ⎥
⎢ ⎥
⎢ 0.4354 −0.8078 ⎥
⎢ ⎥
⎢ −1.9680 3.6870 ⎥
Va,initial =⎢
⎢ 1.9285
⎥,
⎢ 1.4044 ⎥

⎢ −4.9011 −4.3527 ⎥
⎢ ⎥
⎢ 1.0914 −0.0344 ⎥
⎢ ⎥
⎢ −1.5746 2.8033 ⎥
⎢ ⎥
⎣ 1.4897 −0.0315 ⎦
0.2992 −0.0784

Wa,initial = [ − 2.1429, 1.9276, 0.0060, 0.0030, 4.5618, 3.2266, 0.0005,


− 0.0012, 1.3796, −0.5338, 0.5043, −5.1110 ],

and

ba,initial = [ 0.8511, 4.2189, 5.7266, −6.0599, 0.4998, −5.7323, −0.6220,


− 0.5142, −0.3874, 3.3985, 0.6668, −0.2834 ]T .

For each iteration step, the critic network and the action network are trained for 400
steps so that the neural network training error becomes less than 10−5 . Implement the
policy iteration algorithm for 16 iterations to reach the computation precision 10−5 ,
and the convergence trajectory of the iterative value function is shown in Fig. 4.5a.
Apply the iterative control laws to the given system for Tf = 100 time steps and the
trajectories of the iterative control and states are displayed in Fig. 4.5b, c respectively.
The optimal trajectories of system states and control are displayed in Fig. 4.5d.
From the numerical results, we can see that using the present policy iteration
algorithm, any of the iterative control law can stabilize the system. But for the value
iteration algorithm, the situation is quite different. For the value iteration algorithm,
we choose the initial value function V0 (xk ) ≡ 0 as in [2]. Run the value iteration
algorithm (4.2.22) for 30 iterations to reach the computation precision 10−5 and
trajectory of the value function is expressed in Fig. 4.6a. Applying the iterative control
law to the given system (4.3.4) for Tf = 100 time steps, we can obtain the iterative
states and iterative controls, which are displayed in Fig. 4.6b, c respectively. The
optimal trajectories of control and system states are displayed in Fig. 4.6d.
For unstable control systems, although the optimal control law can be obtained by
value and policy iteration algorithms, for value iteration algorithm, we can see that
not all the controls can stabilize the system. Moreover, the properties of the iterative
4.3 Numerical Simulation and Analysis 169

(a) 90 (b) 6

5
Iterative value function 80
4

Iterative controls
Initial iterative control
70 3

2
60 Limiting iterative control
1

50 0

−1
40
0 5 10 15 0 20 40 60 80 100
Iteration steps Time steps

(c) 1.5 (d)


Initial iterative state x1 3 x1

Optimal control and states


1 x2
Limiting iterative state x1 2
u
System states

0.5
1
0
0
−0.5
Limiting iterative state x2 −1
−1
Initial iterative state x2 −2
−1.5
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps

Fig. 4.5 Numerical results of Example 4.3.2 using policy iteration algorithm. a The convergence
trajectory of iterative value function. b The controls. c The states. d The optimal states and control

controls obtained by the value iteration algorithm cannot be analyzed, and thus, the
value iteration algorithm can only be implemented off-line. For the present policy
iteration algorithm, the stability property can be guaranteed. Hence, we can declare
the effectiveness of the policy iteration algorithm in this chapter.

Example 4.3.4 As a real world application of the present method, the problem of
nonlinear satellite attitude control has been selected [5, 22]. Satellite dynamics is
represented by


= Υ −1 (Nnet − ω × Υ ω) ,
dt
where Υ , ω, and Nnet are inertia tensor, angular velocity vector of the body frame with
respect to inertial frame, and the vector of the total torque applied on the satellite,
respectively. The selected satellite is an inertial pointing satellite. Hence, we are
interested in its attitude with respect to the inertial frame. All vectors are represented
in the body frame and the sign × denotes the cross product of two vectors. Let
170 4 Policy Iteration for Optimal …

(a) 50 (b) 3

Iterative value function 40 2


Initial iterative control

Iterative controls
1
30

0
20

−1
10
Limiting iterative control
−2
0
0 10 20 30 0 20 40 60 80 100
Iteration steps Time steps

(c) 5
Initial iterative state x2
(d)
Limiting iterative state x2 3 x1

Optimal control and states


x2
2
u
system state

1
0
0

−1
Limiting iterative state x1
−2
Initial iterative state x1
−5
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps

Fig. 4.6 Numerical results of Example 4.3.2 using value iteration algorithm. a The convergence
trajectory of iterative value function. b The controls. c The states. d The optimal states and control

Nnet = Nctrl + Ndis , where Nctrl is the control, and Ndis is disturbance. Following [22]
and its order of transformation, the kinematic equation of the satellite becomes

⎤ ⎡ ⎤⎡ ⎤
φ 1 sin φ tan φ cos φ tan θ ωx
d ⎣ ⎦ ⎣ ⎦ ⎣ ωy ⎦ ,
θ = 0 cos
 φ − sin
 φ (4.3.5)
dt Ψ 0 sin φ cos θ cos φ cos θ ωz

where φ, θ , and Ψ are the three Euler angles describing the attitude of the satellite with
respect to x, y, and z axes of the inertial coordinate system, respectively. Subscripts
x, y, and z are the corresponding elements of the angular velocity vector ω. The three
Euler angles and the three elements of the angular velocity form the elements of the
state space for the satellite attitude control problem. The state equation is given as
follows,

ẋ = f (x) + g(x)u,
4.3 Numerical Simulation and Analysis 171

2
wc1
wc2
wc3
1.5 wc4
wc5
wc6
wc7
Critic network weights

1
wc8
wc9
wc10
0.5 wc11
wc12
wc13
0 wc14
wc15

−0.5

−1
0 50 100 150
Iteration steps

Fig. 4.7 The convergence trajectories of the critic network

where

x = [ φ, θ, Ψ, ωx , ωy , ωz ]T ,
u = [ ux , uy , uz ]T ,
M3×1
f (x) = ,
Υ −1 (Ndis − ω × Υ ω)
0
g(x) = 3×3 ,
Υ −1

and M3×1 denotes the right hand side of (4.3.5). The three-by-three null matrix is
denoted by 03×3 . The moment of inertia matrix of the satellite is chosen as
⎡ ⎤
100 2 0.5
Υ = ⎣ 2 100 1 ⎦ kg · m2 .
0.5 1 110

The initial states are 60◦ , 20◦ , and 70◦ for the Euler angles of φ, θ , and Ψ ,
respectively, and −0.001, 0.001, and 0.001 rad/s for the angular rates around x, y,
and z axes, respectively. For convenience of analysis, we assume that there is no
disturbance in the system. Discretization of the system function and value function
using Euler and trapezoidal methods [4] with the sampling interval Δt = 0.25 s leads
to
xk+1 = (Δtf (xk ) + xk ) + Δt g(xk )uk .
172 4 Policy Iteration for Optimal …

0.25
wa1
wa2
wa3
0.2
wa4
wa5
wa6
0.15 wa7
Action network weights

wa8
wa9
0.1 wa10
wa11
wa12
0.05 wa13
wa14
wa15
0

−0.05

−0.1
0 50 100 150
Iteration steps

Fig. 4.8 The convergence trajectories of the first column of weights of the action network

Let the utility function be quadratic form where the state and control weight matrices
are selected as
Q = diag{0.25, 0.25, 0.25, 25, 25, 25}

and
R = diag{25, 25, 25},

respectively.
Neural networks are used to implement the present policy iteration algorithm.
The critic network and the action network are chosen as three-layer BP neural net-
works with the structures of 6–15–1 and 6–15–3, respectively. For each iteration
step, the critic network and the action network are trained for 800 steps so that the
neural network training error becomes less than 10−5 . Implement the policy iteration
algorithm for 150 iterations. We have proven that the weights of critic and action net-
works are convergent in each iteration and thus convergent to the optimal ones. The
convergence trajectories of critic network weights are shown Fig. 4.7. The weight
convergence trajectories of the first column of action network are shown in Fig. 4.8.
The iterative value functions are shown in Fig. 4.9a. After the weights of the critic
and action networks are convergent, we apply the neuro-optimal controller to the
given system for Tf = 1500 time steps. The optimal state trajectories of φ, θ and Ψ
are shown in Fig. 4.9b. The trajectories of angular velocities ωx , ωy , and ωz are shown
4.3 Numerical Simulation and Analysis 173

(a) 30 (b) 60

Iterative value function


40

Euler Angles (deg.)


25
20
20
0
15 −20 φ
10 −40 θ
−60 Ψ
5
0 50 100 150 0 500 1000 1500
Iteration steps Time steps
−3 −3
(c) x 10 (d) x 10
2
Angular velocity (rad/s)

1
1

Control (N.m)
0
0
ωx −1 u
−1 x
ωy u
−2 y
−2
ωz uz
−3 −3
0 500 1000 1500 0 500 1000 1500
Time steps Time steps

Fig. 4.9 Numerical results of Example 4.3.4 using policy iteration algorithm. a The convergence
trajectory of iterative value function. b The trajectories of angular velocities φ, θ, and Ψ . c The
trajectories of angular velocities ωx , ωy , and ωz . d The optimal control trajectories

in Fig. 4.9c, and the optimal control signals are shown in Fig. 4.9d. The numerical
results illustrate the effectiveness of the present policy iteration algorithm.

4.4 Conclusions

In this chapter, an effective policy iteration ADP algorithm is developed to solve


the infinite horizon optimal control problems for discrete-time nonlinear systems.
It is shown that any of the iterative control laws can stabilize the control system.
It is proven that the iterative value functions are monotonically nonincreasing and
convergent to the optimal solution of the Bellman equation. Neural networks are
used to approximate the iterative value functions and compute the iterative control
laws, respectively, for facilitating the implementation of the iterative ADP algorithm.
Finally, four numerical examples are given to illustrate the performance of the method
developed in this chapter.
174 4 Policy Iteration for Optimal …

References

1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern-Part B:
Cybern 38(4):943–949
3. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actorcriticidentifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):82–92
4. Gupta SK (1995) Numerical methods for engineers. New Age International, India
5. Heydari A, Balakrishnan SN (2013) Finite-horizon control-constrained nonlinear optimal con-
trol using single network adaptive critics. IEEE Trans Neural Netw Learn Syst 24(1):145–157
6. Huang T, Liu D (2013) A self-learning scheme for residential energy system control and
management. Neural Comput Appl 22(2):259–269
7. Lewis FL, Syrmos VL (1995) Optimal control. Wiley, New York
8. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
9. Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control:
using natural decision methods to design optimal adaptive controllers. IEEE Control Syst
32(6):76–105
10. Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general
value iteration. IET Control Theory Appl 6(18):2725–2736
11. Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
12. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
13. Modares H, Lewis FL, Naghibi-Sistani MB (2013) Adaptive optimal control of unknown
constrained-input systems using policy iteration and neural networks. IEEE Trans Neural Netw
Learn Syst 24(10):1513–1525
14. Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE Trans
Syst Man Cybern-Part C: Appl Rev 32(2):140–153
15. Si J, Wang YT (2001) Online learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
16. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
17. Vamvoudakis KG, Lewis FL, Hudas GR (2012) Multi-agent differential graphical games: online
adaptive learning solution for synchronization with optimality. Automatica 48(8):1598–1611
18. Vrabie D, Vamvoudakis KG, Lewis FL (2013) Optimal adaptive control and differential games
by reinforcement learning principles. IET, London
19. Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of
discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocom-
puting 78(1):14–22
20. Wang FY, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon
optimal control of discrete-time nonlinear systems with -error bound. IEEE Trans Neural
Netw 22(1):24–36
21. Wei Q, Zhang H, Dai J (2009) Model-free multiobjective approximate dynamic programming
for discrete-time nonlinear systems with general performance index functions. Neurocomputing
72(7):1839–1848
22. Wertz JR (1978) Spacecraft attitude determination and control. Kluwer, Netherlands
References 175

23. Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming method.
IEEE Trans Neural Netw 22(12):2226–2236
24. Zhang H, Song R, Wei Q, Zhang T (2011) Optimal tracking control for a class of nonlinear
discrete-time systems with time delays based on heuristic dynamic programming. IEEE Trans
Neural Netw 22(12):1851–1862
25. Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving
a class of nonlinear zero-sum differential games. Automatica 47(1):207–214
Chapter 5
Generalized Policy Iteration ADP
for Discrete-Time Nonlinear Systems

5.1 Introduction

According to [14, 15], most of the discrete-time reinforcement learning methods can
be described as generalized policy iteration (GPI) algorithms. There are two revolving
iteration procedures for GPI algorithms, which are policy evaluation, making the
value function consistent with the current policy, and policy improvement, making
a new policy that improves on the previous policy [14]. GPI allows one of these two
iterations to be performed without completing the other step a priori. Investigations
of GPI algorithms are important for the development of ADP. There exist inherent
differences between GPI algorithms and the value and policy iteration algorithms.
Analysis methods for traditional value and policy iteration algorithms are not valid
for GPI algorithms anymore. For a long time, discussions on the properties of GPI
algorithms for discrete-time control systems are scarce. To the best of our knowledge,
only in [18], the properties of GPI algorithms were analyzed, while the stability
property of the system under the iterative control law in [18] cannot be guaranteed.
Therefore, together with the GPI algorithms to be developed in this chapter, we will
establish convergence, admissibility, and optimality property analysis results as well
[11, 18].

5.2 Generalized Policy Iteration-Based Adaptive Dynamic


Programming Algorithm

Consider the following discrete-time nonlinear systems

xk+1 = F(xk , u k ), k = 0, 1, . . . , (5.2.1)

© Springer International Publishing AG 2017 177


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_5
178 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

where xk ∈ Rn is the state vector and u k ∈ Rm is the control vector. Let x0 be the
initial state. Let F(xk , u k ) denote the system function, which is known. For any
k = 0, 1, . . ., let u k = (u k , u k+1 , . . . ) be an arbitrary sequence of controls
 from k to
∞. The cost function for state x0 under the control sequence u 0 = u 0 , u 1 , . . . is
defined as
∞
J (x0 , u 0 ) = U (xk , u k ), (5.2.2)
k=0

where U (xk , u k ) > 0 is the utility function.


We will study optimal control problems for (5.2.1). The goal is to find an optimal
control law which stabilizes system (5.2.1) and simultaneously minimizes the cost
function (5.2.2). For convenience of analysis, results of this chapter are based on the
following assumptions (cf. Assumptions 2.2.1–2.2.3 in Chap. 2).
Assumption 5.2.1 F(0, 0) = 0, and the state feedback control law u(·) satisfies
u(0) = 0, i.e., xk = 0 is an equilibrium state of system (5.2.1) under the control
u k = 0.
Assumption 5.2.2 F(xk , u k ) is Lipschitz continuous on a compact set Ω ⊂ Rn con-
taining the origin.
Assumption 5.2.3 System (5.2.1) is controllable in the sense that there exists a
continuous control law on Ω that asymptotically stabilizes the system.
Assumption 5.2.4 The utility function U (xk , u k ) is a continuous positive-definite
function of xk and u k .
For a given control law μ(·), the control in (5.2.1) is given by u k = μ(xk ) and the
cost function (5.2.2) is rewritten as


J μ (x0 ) = U (xk , μ(xk )).
k=0

The optimal cost function is denoted by


 
J ∗ (x0 ) = inf J μ (x0 ) = inf {J (x0 , u 0 )}, (5.2.3)
μ u0

where    
u 0 = u 0 , u 1 , . . . = μ(x0 ), μ(x1 ), . . . .

According to Bellman’s principle of optimality, for all xk ∈ Ω, J ∗ (xk ) satisfies Bell-


man equation
 
J ∗ (xk ) = min U (xk , u k ) + J ∗ (xk+1 )
uk
 
= min U (xk , u k ) + J ∗ (F(xk , u k )) . (5.2.4)
uk
5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 179

The optimal control law at time k is given by


 
u ∗k = arg min U (xk , u k ) + J ∗ (F(xk , u k )) .
uk

Then, for all xk ∈ Ω, the Bellman equation can be written as

J ∗ (xk ) = U (xk , u ∗k ) + J ∗ (F(xk , u ∗k )).

We have studied value iteration (VI) algorithms as well as policy iteration (PI)
algorithms in previous chapters. Both VI and PI algorithms have their own advan-
tages. It is desirable to develop a scheme that combines the two so that we can enjoy
the benefits of both algorithms. This motivates the work in this chapter, and we call
the new scheme as generalized policy iteration (GPI) [11].

5.2.1 Derivation and Properties of the GPI Algorithm

A. Derivation of the GPI Algorithm


The present generalized policy iteration algorithm contains two iteration procedures,
which are called the i-iteration and the j-iteration, respectively. We introduce two
iteration indices i and ji and both of them increase from 0.
Let A (Ω) be the set of admissible control laws. Let v(−1) (xk ) ∈ A (Ω) be an
arbitrary admissible control law for all xk ∈ Ω. Let {N1 , N2 , . . .} be a sequence of
positive integers.
For i = 0, construct a value function V0 (xk ) based on v(−1) (xk ) from the following
generalized Bellman equation

V0 (xk ) = U (xk , v(−1) (xk )) + V0 (F(xk , v(−1) (xk ))), (5.2.5)

and calculate the control law v0 (xk ) for all xk ∈ Ω by

v0 (xk ) = arg min {U (xk , u k ) + V0 (xk+1 )}


uk

= arg min {U (xk , u k ) + V0 (F(xk , u k ))} . (5.2.6)


uk

For i = 1, 2, . . . and for all xk ∈ Ω, the generalized policy iteration algorithm can
be expressed by the following two iterative procedures.
j-iteration:
For ji = 1, 2, . . . , Ni , for all xk ∈ Ω, we update the iterative value function
by

Vi, ji (xk ) = U (xk , vi−1 (xk )) + Vi, ji −1 (xk+1 )


= U (xk , vi−1 (xk )) + Vi, ji −1 (F(xk , vi−1 (xk ))), (5.2.7)
180 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

where
Vi,0 (xk+1 ) = Vi−1 (xk+1 )

and V0 (·) is obtained in (5.2.5). For all xk ∈ Ω, define the iterative value
function at the end of the j-iteration as

Vi (xk ) = Vi,Ni (xk ). (5.2.8)

i-iteration:
For all xk ∈ Ω, the iterative control law is improved by

vi (xk ) = arg min {U (xk , u k ) + Vi (xk+1 )}


uk

= arg min {U (xk , u k ) + Vi (F(xk , u k ))} . (5.2.9)


uk

From the above generalized policy iteration algorithm, we can see that for each
j-iteration, it computes the iterative value function for the control law vi−1 (xk ), which
tries to solve the following generalized Bellman equation

Vi, ji (xk ) = U (x, vi−1 (xk )) + Vi, ji (F(xk , vi−1 (xk ))). (5.2.10)

This iterative procedure is called “policy evaluation” [8, 14]. In this procedure, the
iterative value function Vi, ji (xk ) is updated, while the control law is kept unchanged.
For each i-iteration, based on the iterative value function Vi (xk ) for some control law,
we use it to find another policy that is better, or at least not worse. This procedure is
known as “policy improvement” [8, 14]. In this iterative procedure, the control law
is updated.
B. Properties of the GPI Algorithm
In this section, under the assumption that perfect function approximation is available,
we will prove that for any Ni > 0 and for all xk ∈ Ω, the iterative value function
Vi, ji (xk ) will converge to J ∗ (xk ) as i → ∞. The admissibility property of the iterative
control law vi (xk ) will also be presented.

Theorem 5.2.1 Let v0 (xk ) ∈ A (Ω) be the admissible control law obtained by
(5.2.6). For i = 1, 2, . . . and for all xk ∈ Ω, let the iterative value function Vi, ji (xk )
and the iterative control law vi (xk ) be obtained by (5.2.7)–(5.2.9). Let {N1 , N2 , . . .}
be a sequence of nonnegative integers. Then, we have the following properties.
(i) For i = 1, 2, . . . and ji = 1, 2, . . . , Ni + 1, we have

Vi, ji (xk ) ≤ Vi, ji −1 (xk ), ∀xk ∈ Ω, (5.2.11)


5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 181

where Vi,Ni +1 (xk ) is also defined by (5.2.7).

(ii) For i = 1, 2, . . ., let ji and j(i+1) be arbitrary constant integers which satisfy
0 ≤ ji ≤ Ni and 0 ≤ j(i+1) ≤ Ni+1 , respectively. Then, we have

Vi+1, j(i+1) (xk ) ≤ Vi, ji (xk ), ∀xk ∈ Ω. (5.2.12)

Proof The theorem can be proven in two steps. We first prove (5.2.11) by mathe-
matical induction. Let i = 1. From V1,0 (·) = V0 (·) and (5.2.7), we have for j1 = 1,

V1,1 (xk ) = U (xk , v0 (xk )) + V1,0 (F(xk , v0 (xk )))


= U (xk , v0 (xk )) + V0 (F(xk , v0 (xk )))
= min{U (xk , u k ) + V0 (F(xk , u k ))}
uk

≤ U (xk , v(−1) (xk )) + V0 (F(xk , v(−1) (xk )))


= V0 (xk )
= V1,0 (xk ). (5.2.13)

Assume that
V1, j1 (xk ) ≤ V1, j1 −1 (xk )

holds for i = 1 and j1 = l1 , l1 = 1, 2, . . . , N1 . Then, for j1 = l1 + 1, we have

V1,l1 +1 (xk ) = U (xk , v0 (xk )) + V1,l1 (F(xk , v0 (xk )))


≤ U (xk , v0 (xk )) + V1,l1 −1 (F(xk , v0 (xk )))
= V1,l1 (xk ). (5.2.14)

Hence, (5.2.11) holds for i = 1. According to (5.2.9), the iterative control law v1 (xk )
is expressed by

v1 (xk ) = arg min {U (xk , u k ) + V1 (xk+1 )}


uk

= arg min {U (xk , u k ) + V1 (F(xk , u k ))} , (5.2.15)


uk

where
V1 (xk+1 ) = V1,N1 (xk+1 ).

Next, let i = 2. From V2,0 (·) = V1 (·), V1 (·) = V1,N1 (·), and (5.2.7), we can get for
j2 = 1 and for all xk ∈ Ω,

V2,1 (xk ) = U (xk , v1 (xk )) + V2,0 (F(xk , v1 (xk )))


= U (xk , v1 (xk )) + V1 (F(xk , v1 (xk )))
182 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

= min {U (xk , u k ) + V1 (F(xk , u k ))}


uk

≤ U (xk , v0 (xk )) + V1 (F(xk , v0 (xk )))


= U (xk , v0 (xk )) + V1,N1 (F(xk , v0 (xk )))
= V1,N1 +1 (xk )
≤ V1,N1 (xk )
= V1 (xk )
= V2,0 (xk ). (5.2.16)

So, (5.2.11) holds for i = 2 and j2 = 1. Assume that (5.2.11) holds for j2 = l2 ,
l2 = 1, 2, . . . , N2 . Then, for j2 = l2 + 1, we obtain

V2,l2 +1 (xk ) = U (xk , v1 (xk )) + V2,l2 (F(xk , v1 (xk )))


≤ U (xk , v1 (xk )) + V2,l2 −1 (F(xk , v1 (xk )))
= V2,l2 (xk ). (5.2.17)

Then, (5.2.11) holds for i = 2. Assume that (5.2.11) holds for i = r , r = 1, 2, . . .,


i.e.,

Vr, jr (xk ) ≤ Vr, jr −1 (xk ).

For all xk ∈ Ω, the iterative control law can be updated by

vr (xk ) = arg min {U (xk , u k ) + Vr (xk+1 )}


uk

= arg min {U (xk , u k ) + Vr (F(xk , u k ))} ,


uk

where Vr (xk+1 ) = Vr,Nr (xk+1 ). Then, for i = r + 1, we have Vr +1,0 (xk ) = Vr (xk ).
According to (5.2.7), for jr +1 = 1, we have

Vr +1,1 (xk ) = U (xk , vr (xk )) + Vr +1,0 (F(xk , vr (xk )))


= U (xk , vr (xk )) + Vr (F(xk , vr (xk )))
 
= min U (xk , u k ) + Vr (F(xk , u k ))
uk

≤ U (xk , vr −1 (xk )) + Vr (F(xk , vr −1 (xk )))


= U (xk , vr −1 (xk )) + Vr,Nr (F(xk , vr −1 (xk )))
= Vr,Nr +1 (xk )
≤ Vr,Nr (xk )
= Vr (xk )
= Vr +1,0 (xk ).
5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 183

Table 5.1 The iterative process of the GPI algorithm in (5.2.5)–(5.2.9)


v(−1) → V0 v0 → V1,0 → V1,1 v1 → V2,0 → V2,1 ···
(5.2.5) → · · · → V1,N1 = V1 → · · · → V2,N2 = V2
evaluation (5.2.7)–(5.2.8) (5.2.7)–(5.2.8)
N1 -step evaluation N2 -step evaluation
V0 → v0 V1 → v1 V2 → v2 ···
(5.2.6) (5.2.9) (5.2.9)
minimization minimization minimization
i =0 i =1 i =2 ···

So, (5.2.11) holds for i = r + 1 and jr +1 = 1. Assume that (5.2.11) holds for jr +1 =
lr +1 , lr +1 = 1, 2, . . . , Nr +1 . Then, for jr +1 = lr +1 + 1, we have

Vr +1,lr +1 +1 (xk ) = U (xk , vr (xk )) + Vr +1,lr +1 (F(xk , vr (xk )))


≤ U (xk , vr (xk )) + Vr +1,lr +1 −1 (F(xk , vr (xk )))
= Vr +1,lr +1 (xk ).

Hence, (5.2.11) holds for i = 1, 2, . . . and ji = 1, 2, . . . , Ni + 1. The mathematical


induction is complete.
Next, we will prove inequality (5.2.12). For i = 1, let 1 ≤ j1 ≤ N1 and 1 ≤ j2 ≤
N2 . According to (5.2.13)–(5.2.17), for all xk ∈ Ω, we have

V1 (xk ) = V1,N1 (xk ) ≤ V1, j1 (xk ) ≤ V1,0 (xk ) = V0 (xk ),
(5.2.18)
V2 (xk ) = V2,N2 (xk ) ≤ V2, j2 (xk ) ≤ V2,0 (xk ) ≤ V1 (xk ),

which shows that (5.2.12) holds for i = 1. Using mathematical induction, it is easy
to prove that (5.2.12) holds for i = 1, 2, . . .. This completes the proof of the theorem.

The iterative process of the GPI algorithm in (5.2.5)–(5.2.9) is illustrated in


Table 5.1. The algorithm starts with an admissible control law v(−1) (xk ) and it obtains
v0 (xk ) at the initialization step. Comparing between Tables 4.1 and 5.1, we note that
starting from i = 1, “evaluation” in Table 4.1 is replaced by an iterative process in
Table 5.1 for the calculation of Vi (xk ). Clearly, “Ni -step evaluation” in Table 5.1 is
an approximation to the “evaluation” in Table 4.1. It can be shown following similar
steps as in the proof of Theorem 2.3.3 that the limit lim ji →∞ Vi, ji (xk ) = Vi,∞ (xk )
exists and Vi,∞ (xk ) is a solution to the evaluation equation in (5.2.10) since it is the
same update as in VI algorithms (see Theorem 5.2.2). On the other hand, at the ini-
tialization step, it is also possible to design algorithms so that V0 (·) is also obtained
by an iterative evaluation procedure instead of using (5.2.5) (See Algorithm 5.2.1).

Lemma 5.2.1 Suppose that Assumptions 5.2.1–5.2.4 hold. For i = 1, 2, . . . , and for
ji = 0, 1, . . . , Ni , the iterative value function Vi, ji (xk ) is a positive-definite function
of xk .
184 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems
 
Proof Let vk(−1) = v(−1) (xk ), v(−1) (xk+1 ), . . . . As v(−1) (xk ) ∈ A (Ω) is admissi-
ble, according to (5.2.5), for all xk ∈ Ω, the iterative value function


V0 (xk ) = U (xk+l , v(−1) (xk+l )) (5.2.19)
l=0

is finite. For xk = 0, we have v(−1) (xk ) = 0. According to Assumption 5.2.1, we


can get xk+1 = F(xk , v(−1) (xk )) = F(0, 0) = 0. By mathematical induction, for l =
0, 1, . . ., we have xk+l = 0. According to (5.2.19) and Assumption 5.2.4, we can
get V0 (xk ) = 0 when xk = 0. On the other hand, according to Assumption 5.2.4, we
have V0 (xk ) → ∞, as xk → ∞. As U (xk , u k ) > 0 for all xk = 0, we have V0 (xk ) > 0
for all xk = 0. Hence, V0 (xk ) is a positive-definite function. According to (5.2.7)–
(5.2.9), using mathematical induction, we can easily obtain that Vi, ji (xk ) is also
positive definite. The proof is complete.

Theorem 5.2.2 For i = 1, 2, . . . and ji = 0, 1, . . . , Ni , let the iterative value func-


tion Vi, ji (xk ) and iterative control law vi (xk ) be obtained by (5.2.7)–(5.2.9). If for
i = 1, 2, . . ., we let Ni → ∞, then for all xk ∈ Ω the iterative value function Vi, ji (xk )
is convergent as ji → ∞, i.e.,

Vi,∞ (xk ) = U (x, vi−1 (xk )) + Vi,∞ (F(xk , vi−1 (xk ))), (5.2.20)

where

Vi,∞ (xk )  lim Vi, ji (xk ).


ji →∞

Proof According to (5.2.11), for i = 1, 2, . . . and for all xk ∈ Ω, the iterative value
function Vi, ji (xk ) is monotonically nonincreasing as ji increases from 0 to Ni . On
the other hand, according to Lemma 5.2.1, Vi, ji (xk ) is a positive-definite function for
i = 1, 2, . . . and ji = 0, 1, . . . , Ni , i.e., Vi, ji (xk ) > 0, ∀xk = 0. This means that the
iterative value function Vi, ji (xk ) is monotonically nonincreasing and lower bounded.
Hence, for all xk ∈ Ω, the limit of Vi, ji (xk ) exists when ji → ∞. Then, we can obtain
(5.2.20) directly. This completes the proof of the theorem.

Corollary 5.2.1 For i = 1, 2, . . . and ji = 0, 1, . . . , Ni , let the iterative value func-


tion Vi, ji (xk ) and iterative control law vi (xk ) be obtained by (5.2.7)–(5.2.9). Then,
for i = 1, 2, . . . and for all xk ∈ Ω, the iterative control law vi (xk ) is admissible.

Proof Considering Ni > 0. For j̄i = Ni + 1, Ni + 2, . . . and for all xk ∈ Ω, we con-


struct a value function Vi, j̄i (xk ) as

Vi, j̄i (xk ) = U (xk , vi−1 (xk )) + Vi, j̄i −1 (F(xk , vi−1 (xk ))), (5.2.21)

where Vi,Ni (xk ) = Vi,Ni (xk ). According to (5.2.21), we can obtain


5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 185

j̄i −Ni −1

Vi, j̄i (xk ) = U (xk+l , vi−1 (xk+l )) + Vi,Ni (xk+ j̄i −Ni ).
l=0

According to Theorem 5.2.2, for all xk ∈ Ω, the iterative value function Vi,∞ (xk ),
which is expressed by

j̄i −Ni −1

Vi,∞ (xk ) = lim U (xk+l , vi−1 (xk+l )) + lim Vi,Ni (xk+ j̄i −Ni ),
j̄i →∞ j̄i →∞
l=0

is finite. According to Assumption 5.2.4, the utility function U (xk , vi−1 (xk )) > 0,
∀xk = 0. Then,
lim U (xk , vi−1 (xk )) = 0,
k→∞

implies xk → 0 as k → ∞. On the other hand, according to Lemma 5.2.1, Vi,Ni (xk ) =


Vi,Ni (xk ) is positive definite. Thus, we can get

lim Vi,Ni (xk+ j̄i −Ni ) = 0.


j̄i →∞

As

Ni
U (xk+l , vi−1 (xk+l ))
l=0

is finite, we have


 
Ni
U (xk+l , vi−1 (xk+l )) = U (xk+l , vi−1 (xk+l )) + Vi,∞ (xk+Ni +1 )
l=0 l=0

is also finite. The above shows that vi−1 (xk ) is admissible. It is easy to conclude that
vi (xk ) is also admissible. The proof is complete.

Next, the convergence property of the generalized policy iteration algorithm will
be established. As the iteration index i increases to ∞, we will show that the optimal
cost function and optimal control law can be achieved using the present generalized
policy iteration algorithm. Before the main theorem, some lemmas are necessary.
 
Lemma 5.2.2 (cf. [3]) If a monotonically nonincreasing sequence an , n =
0, 1, . . ., contains an arbitrary convergent subsequence, then sequence an is con-
vergent.

Lemma 5.2.3 For i = 1, 2, . . ., let the iterative value function Vi (xk ) be defined
as in (5.2.8). Then, the iterative value function sequence {Vi (xk )} is monotonically
nonincreasing and convergent.
186 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

Lemma 5.2.3 is a direct consequence of Theorem 5.2.1.

Theorem 5.2.3 For i = 0, 1, . . . and for all xk ∈ Ω, let Vi, ji (xk ) and vi (xk ) be
obtained by (5.2.5)–(5.2.9). If Assumptions 5.2.1–5.2.4 hold, then for any Ni > 0,
the iterative value function Vi, ji (xk ) converges to the optimal cost function J ∗ (xk ),
as i → ∞, i.e.,

lim Vi, ji (xk ) = J ∗ (xk ), (5.2.22)


i→∞

which satisfies the Bellman equation (5.2.4).

Proof Define a sequence of iterative value functions as


  
Vi, ji (xk )  V0 (xk ), V1,0 (xk ), V1,1 (xk ), . . . , V1,N1 (xk ),

V1 (xk ), V2,0 (xk ), . . . , V2,N2 (xk ), . . . .
     
If we let Vi (xk )  V0 (xk ), V1 (xk ), . . . , then Vi (xk ) is a subsequence of
Vi, ji (xk) which is monotonically nonincreasing. According to Lemma 5.2.3,  the
limit of Vi (xk ) exists. From Lemma 5.2.2, we can get that if the sequence
 V i  )
(x k
is convergent, then Vi, ji (xk ) is also convergent. Since
 a sequence
 V i, ji
(x k ) can
converge
  to at most one point [3], the sequence Vi, ji
(x k ) and its subsequence
Vi (xk ) possess the same limit, i.e.,

lim Vi, ji (xk ) = lim Vi (xk ).


i→∞ i→∞

Thus, we will prove

lim Vi (xk ) = J ∗ (xk ). (5.2.23)


i→∞

The statement (5.2.23) can be proven in three steps.


(1) Show that the limit of the iterative value function Vi (xk ) satisfies the Bellman
equation, as i → ∞.
According to Lemma 5.2.3, for all xk ∈ Ω, we can define the value function
V∞ (xk ) as the limit of the iterative value function Vi (xk ), i.e., V∞ (xk ) = lim Vi (xk ).
i→∞
According to (5.2.7)–(5.2.9), we have

Vi (xk ) = Vi,Ni (xk )


≤ Vi,1 (xk )
= U (xk , vi−1 (xk )) + Vi−1 (F(xk , vi−1 (xk )))
 
= min U (xk , u k ) + Vi−1 (F(xk , u k )) .
uk
5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 187

Then, we can obtain

V∞ (xk ) = lim Vi (xk ) ≤ Vi (xk ) ≤ min {U (xk , u k ) + Vi−1 (F(xk , u k ))} .


i→∞ uk

Let i → ∞. For all xk ∈ Ω, we can obtain

V∞ (xk ) ≤ min{U (xk , u k ) + V∞ (F(xk , u k ))}. (5.2.24)


uk

Let ε > 0 be an arbitrary positive number. Since Vi (xk ) is nonincreasing for i ≥ 0


and
lim Vi (xk ) = V∞ (xk ),
i→∞

there exists a positive integer p such that V p (xk ) − ε ≤ V∞ (xk ) ≤ V p (xk ). Hence,
we can get

V∞ (xk ) ≥ U (xk , v p−1 (xk )) + V p−1 (F(xk , v p−1 (xk ))) − ε


≥ U (xk , v p−1 (xk )) + V∞ (F(xk , v p−1 (xk ))) − ε
≥ min{U (xk , u k ) + V∞ (F(xk , u k )))} − ε.
uk

Since ε > 0 is arbitrary, for all xk ∈ Ω, we have

V∞ (xk ) ≥ min{U (xk , u k ) + V∞ (F(xk , u k ))}. (5.2.25)


uk

Combining (5.2.24) and (5.2.25), for all xk ∈ Ω, we can obtain

V∞ (xk ) = min{U (xk , u k ) + V∞ (F(xk , u k ))}. (5.2.26)


uk

Next, for all xk ∈ Ω, let μ(xk ) be an arbitrary admissible control law, and define
a new value function P(xk ) as

P(xk ) = U (xk , μ(xk )) + P(F(xk , μ(xk ))). (5.2.27)

Then, we can proceed to the second step of the proof.


(2) Show that for an arbitrary admissible control law μ(xk ), the value function
P(xk ) satisfies

P(xk ) ≥ V∞ (xk ). (5.2.28)

Please refer to part (2) of the proof of Theorem 4.2.2.


(3) Show that the value function V∞ (xk ) equals to the optimal cost function J ∗ (xk ).
188 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

According to the definition of the optimal cost function J ∗ (xk ) given in (5.2.3),
for i = 0, 1, . . . and for all xk ∈ Ω, we have Vi (xk ) ≥ J ∗ (xk ). Then, let i → ∞. We
can obtain V∞ (xk ) ≥ J ∗ (xk ).
On the other hand, for an arbitrary admissible control law μ(xk ), (5.2.28) holds.
For all xk ∈ Ω, let μ(xk ) = u ∗ (xk ), where u ∗ (xk ) is the optimal control law. Then,
we can get V∞ (xk ) ≤ J ∗ (xk ). Hence, (5.2.22) holds. The proof is complete.
Corollary 5.2.2 For i = 0, 1, . . ., let Vi, ji (xk ) and vi (xk ) be obtained by (5.2.5)–
(5.2.9). If for i = 1, 2, . . ., let ji ≡ 1, then the iterative value function Vi, ji (xk ) con-
verges to the optimal cost function J ∗ (xk ), as i → ∞.

5.2.2 GPI Algorithm and Relaxation of Initial Conditions

In the previous section, the monotonicity, convergence, optimality, and admissibility


properties of the generalized policy iteration algorithm have been analyzed. From the
generalized policy iteration algorithm (5.2.5)–(5.2.9), we can see that to implement
the present algorithm, it requires an admissible control law v(−1) (xk ) ∈ A (Ω) to
construct the initial value function V0 (xk ) that satisfies (5.2.5). Usually, v(−1) (xk ) ∈
A (Ω) and V0 (xk ) in (5.2.5) are difficult to obtain, which creates difficulties in the
implementation of the present algorithm. In this section, some effective methods will
be presented to relax the initial value function of the algorithm.
First, we consider the situation that the admissible control law v(−1) (xk ) is known.
We develop a finite-step policy evaluation algorithm to relax the initial value function
V0 (xk ), where it is approximated by a finite iteration procedure instead of (5.2.5).
The detailed implementation of the algorithm is expressed in Algorithm 5.2.1.
Lemma 5.2.4 For xk ∈ Ω, let Ψ (xk ) ≥ 0 be an arbitrary positive semidefinite func-
tion. Let v(−1) (xk ) be an arbitrary admissible control law and let V0, j0 (xk ) be the iter-
ative value function updated by (5.2.29)–(5.2.31), where V0,0 (xk ) = Ψ (xk ). Then,
V0, j0 (xk ) is convergent as j0 → ∞.
Proof According to (5.2.31), for all xk ∈ Ω, we have

V0, j0 +1 (xk ) − V0, j0 (xk ) = U (xk , v(−1) (xk )) + V0, j0 (xk+1 )


 
− U (xk , v(−1) (xk )) + V0, j0 −1 (xk+1 )
= V0, j0 (xk+1 ) − V0, j0 −1 (xk+1 ). (5.2.32)

According to (5.2.32), we can get




⎪ V0, j0 +1 (xk ) − V0, j0 (xk ) = V0,1 (xk+ j0 ) − V0,0 (xk+ j0 ),

⎨ V0, j0 (xk ) − V0, j0 −1 (xk ) = V0,1 (xk+ j0 −1 ) − V0,0 (xk+ j0 −1 ),
..

⎪ .


V0,1 (xk ) − V0,0 (xk ) = V0,1 (xk ) − V0,0 (xk ).
5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 189

Algorithm 5.2.1 Finite-step policy evaluation algorithm for initial value function
Initialization:  (1) (2) ( p) 
Choose randomly an array of system states xk in Ω, i.e., X k = xk , xk , . . . , xk , where p
is a large positive integer.
Choose an arbitrary positive semidefinite function Ψ (xk ) ≥ 0.
Determine an initial admissible control law v(−1) (xk ).
Iteration:
Step 1. Let the iteration index j0 = 0 and let V0,0 (xk ) = Ψ (xk ).
j
Step 2. For all xk ∈ X k , update the control law v00 (xk ) by
j  
v00 (xk ) = arg min U (xk , u k ) + V0, j0 (F(xk , u k )) , (5.2.29)
uk

and improve the iterative value function by


j0  
V1,0 (xk ) = min U (xk , u k ) + V0, j0 (F(xk , u k ))
uk (5.2.30)
j j
= U (xk , v00 (xk )) + V0, j0 (F(xk , v00 (xk ))).
j
0
Step 3. If V1,0 (xk ) − V0, j0 (xk ) ≤ 0 for all xk ∈ X k , goto Step 5; else, goto Step 4.
Step 4. Let j0 = j0 + 1. For all xk ∈ X k , update the iterative value function by

V0, j0 (xk ) = U (xk , v(−1) (xk )) + V0, j0 −1 (F(xk , v(−1) (xk ))). (5.2.31)

Goto Step 2.
j0 j
Step 5. Let V0 (xk ) = V1,0 (xk ) and v0 (xk ) = v00 (xk ).

The right-hand side can be written as




⎪ V0,1 (xk+j0 )−V0,0 (xk+j0 ) = U(xk+ j0 , v(−1) (xk+ j0 ))+V0,0 (xk+ j0 +1 )−V0,0 (xk+j0 ),

⎨V0,1 (xk+j0−1)−V0,0 (xk+j0−1) = U(xk+j0−1 ,v(−1) (xk+j0−1)+V0,0 (xk+j0)−V0,0 (xk+j0−1),
..

⎪ .


V0,1 (xk ) − V0,0 (xk ) = U (xk , v(−1) (xk )) + V0,0 (xk+1 ) − V0,0 (xk ).

By adding all these equations together, we get


j0
V0, j0 +1 (xk ) = U (xk+l , v(−1) (xk+l )) + Ψ (xk+ j0 +1 ).
l=0

Let j0 → ∞. We can obtain




lim V0, j0 +1 (xk ) = U (xk+l , v(−1) (xk+l )).
j0 →∞
l=0

As v(−1) (xk ) is an admissible control law,


190 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems



U (xk+l , v(−1) (xk+l ))
l=0

is finite. Hence, we conclude that lim V0, j0 (xk ) is finite, which means V0, j0 (xk ) is
j0 →∞
convergent as j0 → ∞. This completes the proof of the lemma.
Using the admissible control law v(−1) (xk ), from Lemma 5.2.4, V0, j0 +1 (xk ) =
V0, j0 (xk ) holds as j0 → ∞. It means that there must exist N0 > 0 such that
N0
V1,0 (xk ) ≤ V0,N0 (xk ). Hence, if we have obtained an admissible control law, then
we can construct the initial value function by finite-step policy evaluation to replace
the value function V0 (xk ) in (5.2.5). On the other hand, Algorithm 5.2.1 requires an
admissible control law v(−1) (xk ) to implement. Usually, the admissible control law
of the nonlinear system is difficult to obtain. To overcome this difficulty, a policy
improvement algorithm can be implemented by experiment. The details are given in
Algorithm 5.2.2.

Algorithm 5.2.2 Policy improvement algorithm for initial value function


Initialization:  ( p) 
Choose randomly an array of system states xk in Ω, i.e., X k = xk(1) , xk(2) , . . . , xk , where p
is a large positive integer. Let ς0 = 0.
Iteration:
ς
Step 1. Choose arbitrarily a large positive definite function Ψ̄ ς0 (xk ) ≥ 0, and let V0,00 (xk ) =
ς
Ψ̄ (xk ).
0
ς
Step 2. For all xk ∈ X k , update the control law v0 0 (xk ) by
ς  
v0 0 (xk ) = arg min U (xk , u k ) + Ψ̄ ς0 (F(xk , u k )) , (5.2.33)
uk

and for all xk ∈ X k , improve the iterative value function by


ς ς
V1,00 (xk ) = min U (xk , u k ) + V0,00 (F(xk , u k ))
uk
ς ς ς
(5.2.34)
= U (xk , v0 0 (xk )) + V0,00 (F(xk , v0 0 (xk ))).

Step 3. For all xk ∈ X k , if the inequality


ς
V1,00 (xk ) − Ψ̄ ς0 (xk ) ≤ 0 (5.2.35)

holds, then goto Step 4; else, let ς0 = ς0 + 1 and goto Step 1.


ς ς
Step 4. Let V0 (xk ) = V1,00 (xk ) and v0 (xk ) = v0 0 (xk ).

ς
Theorem 5.2.4 For all xk ∈ Ω, let the iterative control law v10 (xk ) be expressed as
ς
in (5.2.33) and let the iterative value function V1,00 (xk ) be expressed as in (5.2.34). If
the iterative value functions satisfy (5.2.35), then the convergence properties (5.2.11)
and (5.2.12) of Theorem 5.2.1 hold for i = 1, 2, . . . and ji = 0, 1, . . . , Ni .
ς ς
Proof Let i = 1 and j1 = 0. As V0 (xk ) = V1,00 (xk ) and v0 (xk ) = v00 (xk ), according
to (5.2.7) and (5.2.35), we have
5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 191

V1,1 (xk ) = U (xk , v0 (xk )) + V1,0 (F(xk , v0 (xk )))


ς ς ς
= U (xk , v00 (xk )) + V1,00 (F(xk , v00 (xk )))
ς ς ς
≤ U (xk , v00 (xk )) + V0,00 (F(xk , v00 (xk )))
= V1,0 (xk ).

Following similar steps as in the proof of Theorem 5.2.1, the convergence proper-
ties (5.2.11) and (5.2.12) can be shown for i = 1, 2, . . . and ji = 0, 1, . . . , Ni . This
completes the proof of the theorem.

Remark 5.2.1 From Algorithm 5.2.2, we can see that the admissible control law
v(−1) (xk ) in Algorithm 5.2.1 is avoided. This is a merit of Algorithm 5.2.2. However,
in Algorithm 5.2.2, we should find a positive-definite function Ψ̄ ς0 (xk ) that satisfies
(5.2.35). As Ψ̄ ς0 (xk ) is randomly chosen, it may take some iterations to determine
Ψ̄ ς0 (xk ). This is a disadvantage of the algorithm.

Based on the above preparations, we summarize the generalized policy iteration


ADP algorithm in Algorithm 5.2.3.

Algorithm 5.2.3 Generalized policy iteration algorithm


Initialization:  (1) (2) ( p) 
Choose randomly an array of system states xk in Ω, i.e., X k = xk , xk , . . . , xk , where p
is a large positive integer.
Choose a computation precision ε.
Construct a sequence {Ni }, where Ni > 0, i = 1, 2, . . ., are nonnegative integers.
Iteration:
Step 1. Let the iteration index i = 1.
Step 2. Obtain V0 (xk ) and v0 (xk ) by Algorithm ϒ, ϒ = 5.2.1 or 5.2.2, instead of (5.2.5) and (5.2.6).
Step 3. Let ji increase from 1 to Ni . For all xk ∈ X k , do Policy Evaluation according to (5.2.7),

Vi, ji (xk ) = U (xk , vi−1 (xk )) + Vi, ji −1 (F(xk , vi−1 (xk ))),

with Vi,0 (xk ) = Vi−1 (xk ).


Step 4. Let Vi (xk ) = Vi,Ni (xk ) as in (5.2.8).
Step 5. For all xk ∈ X k , do Policy Improvement as in (5.2.9),

vi (xk ) = arg min {U (xk , u k ) + Vi (F(xk , u k ))} .


uk

Step 6. For all xk ∈ X k , if |Vi−1 (xk ) − Vi (xk )| < ε, then the optimal cost function and optimal
control law are obtained, and goto Step 7; else, let i = i + 1 and goto Step 3.
Step 7. Return Vi (xk ) and vi (xk ) as the optimal cost function J ∗ (xk ) and the optimal control law
u ∗ (xk ).
192 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

5.2.3 Simulation Studies

To evaluate the performance of our generalized policy iteration algorithm, two exam-
ples are given for numerical experiments to solve the approximate optimal control
problems.
Example 5.2.1 First, let us consider the following spring–mass–damper system [7]

d2 y dy
M +b + κ y = u,
d t2 dt
where y is the position and u is the control input. Let M = 0.1 kg denote the mass
of object. Let κ = 2 kgf/m be the stiffness coefficient of spring and let b = 0.1 be
the wall friction. Let x1 = y, x2 = dy/d t. Discretizing the system function with the
sampling interval Δt = 0.1s leads to
   1 Δt
   0 
x1(k+1) x1k
= κ b + Δt u k . (5.2.36)
x2(k+1) − Δt 1 − Δt x2k
M M M

Let the initial state be x0 = [1, −1]T . Let the cost function be expressed by (5.2.2).
The utility function is expressed as U (xk , u k ) = xkTxk + u 2k .
Let the state space be Ω = {xk : − 1 ≤ x1k ≤ 1, −1 ≤ x2k ≤ 1}. We randomly
choose an array of p = 5000 states in Ω to implement the generalized policy iteration
algorithm to obtain the optimal control law. Neural networks are used to implement
the present generalized policy iteration algorithm. The critic network and the action
network are chosen as three-layer backpropagation (BP) neural networks with the
structures of 2–8–1 and 2–8–1, respectively. Define the two neural networks as group
“NN1”. For system (5.2.36), we can obtain an admissible control law u(xk ) = K xk ,
where K = [0.13, −0.17]. Let Ψ (xk ) = xkTP0 xk , where
 
80 1
P0 = .
1 2

As the initial admissible control law u(xk ) = K xk is known, the finite-step pol-
icy evaluation in Algorithm 5.2.1 is implemented. We can see that it takes 3
iterations to obtain V0 (xk ) and v0 (xk ) and the results for the initial iteration are
displayed in Fig. 5.1. (See the trajectories of the iterative value functions for
i = 0). Let iteration index i max = 10. To illustrate the effectiveness of the algo-
γ
rithm, we choose four different iteration sequences {Ni }, γ = 1, 2, 3, 4. For γ = 1
and i = 1, 2, . . . , 10, we let Ni = 1. For γ = 2, iteration sequence is chosen
1

as {Ni2 } = {3, 4, 4, 1, 2, 2, 3, 3, 2, 4}. For γ = 3, iteration sequence is chosen as


{Ni3 } = {6, 1, 9, 3, 5, 7, 5, 4, 1, 3}. For γ = 4 and i = 1, 2, . . . , 10, let Ni4 = 20.
Train the critic and the action networks under the learning rate 0.01 and set
the neural network training error threshold as 10−6 . Under the iteration indices
i and ji , the trajectories of iterative value functions Vi, ji (xk ) for xk = x0 are
5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 193

Iterative value function


Iterative value function

80 80

60 60

40 40

20 20
0 2 0 4
5 1 5 2
i 10 0 ji i 10 0 ji
(a) (b)

Iterative value function


Iterative value function

80 80

60 60

40 40

20 20
0 10 0 20
5 5 5 10
i 10 0 ji i 10 0 ji

(c) (d)
Fig. 5.1 Iterative value functions Vi, ji (xk ) for i = 1, 2, . . . , 10 and xk = x0 . a Vi, ji (xk ) for {Ni1 }.
b Vi, ji (xk ) for {Ni2 }. c Vi, ji (xk ) for {Ni3 }. d Vi, ji (xk ) for {Ni4 }

shown in Fig. 5.1. The curves of the iterative value functions Vi (xk ) are shown in
Fig. 5.2, where we use “In” to denote initial iteration and use “Lm” to denote limiting
iteration.
For Ni1 = 1, the generalized policy iteration algorithm is reduced to value itera-
tion algorithm [5, 6, 16, 17]. From Figs. 5.1a and 5.2a, we can see that the iterative
value function converges to the approximate optimum which verifies the effective-
ness of the present algorithm. For Ni4 = 20, we can see that for each i = 1, . . . , 10,
the iterative value function Vi, ji (xk ) is convergent for ji . In this case, the generalized
policy iteration algorithm can be considered as a policy iteration algorithm [5, 8,
10] for each i, where the convergence can be verified. For arbitrary sequence {Ni },
such as {Ni2 } and {Ni3 }, From Figs. 5.1b, c and 5.2b, c, the iterative value function
can also approximate the optimum. Hence, value and policy iteration algorithms are
special cases of the present generalized policy iteration algorithm and the conver-
gence properties of the present algorithm can be verified. The stability property of
system (5.2.36) under the iterative control law vi (xk ) is shown in Figs. 5.3 and 5.4,
respectively.
194 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

Fig. 5.2 Iterative value functions Vi (xk ), for i = 0, 1, . . . , 10. a Vi (xk ) for {Ni1 }. b Vi (xk ) for
{Ni2 }. c Vi (xk ) for {Ni3 }. d Vi (xk ) for {Ni4 }

From the above simulation results, we can see that for i = 0, 1, . . ., the itera-
tive control laws vi (xk ) are admissible. For linear system (5.2.36), the optimal cost
function is quadratic and in this case given by J ∗ (xk ) = xkT P ∗ xk . According to the
discrete algebraic Riccati equation (DARE) [7], we obtain
 
26.61 1.81
P∗ =
1.81 1.90

and the effectiveness of the present algorithm can be verified for linear systems.
On the other hand, the structure of the neural networks is important for its approx-
imation performance. To show the influence of the neural network structure, we
change the structures of the critic and action networks to 2–4–1 and 2–4–1, respec-
tively, and other parameters of the neural networks are kept unchanged. Define the
two neural networks as group “NN2”. Choose the same {Ni2 } for the j-iteration.
Implement the present algorithm for i = 10 iterations. The iterative value functions
by NN1 and NN2 are shown in Fig. 5.5a. We can see that if the number of hidden
layer is reduced, the neural network approximate accuracy for value function may
5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 195

3 3

2 2
Lm Lm

Control
Control

1 1

0 0

In In
−1 −1
0 50 100 0 50 100
Time steps Time steps
(a) (b)

3 3

2 2
Lm Lm
Control
Control

1 1

0 0

In In
−1 −1
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.3 Trajectories of iterative control law vi (xk ), i = 1, 2, . . . , 10. a vi (xk ) for {Ni1 }. b vi (xk )
for {Ni2 }. c vi (xk ) for {Ni3 }. d vi (xk ) for {Ni4 }

decrease. The plot of Vi (xk ) is shown in Fig. 5.5b. The corresponding trajectories of
states and control are shown in Fig. 5.5c, d, respectively. We can see that if the struc-
ture of neural networks is not chosen appropriately, the performance of the control
system will be poor.

Example 5.2.2 We now examine the performance of the present algorithm in a tor-
sional pendulum system [10, 13] with modifications. The dynamics of the pendulum
is given by ⎧



⎨ d t = ω,




⎩J
dω dθ
= u − Mgl sin θ − f d ,
dt dt
196 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

4
4
Lm x1 In x
Lm x 1 In x
1 2
2 2
In x1 In x2

System states
System states

0 0

−2 −2

Lm x Lm x2
2
−4 −4
0 50 100 0 50 100
Time steps Time steps
(a) (b)
4 4
Lm x Lm x1
1
2 In x1 2 In x1
System states
System states

In x2 In x
2
0 0

−2 −2
Lm x Lm x2
2
−4 −4
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.4 Trajectories of system states. a State trajectories for {Ni1 }. b State trajectories for {Ni2 }. c
State trajectory for {Ni3 }. d State trajectories for {Ni4 }

where M = 1/3 kg and l = 2/3 m are the mass and length of the pendulum bar,
respectively. Let J = 4/3 Ml 2 and f d = 0.2 be the rotary inertia and frictional fac-
tor, respectively. Let x1 = θ and x2 = ω. Let g = 9.8 m/s2 be the gravity and the
sampling time interval Δt = 0.1 s. Then, the discretized system can be expressed by
     
x1(k+1) 0.1x2k + x1k 0
= + u . (5.2.37)
x2(k+1) −0.49 sin(x1k ) − 0.1 f d x2k + x2k 0.1 k

Let the initial state be x0 = [1, −1]T and let the utility function be the same as the
one in Example 5.2.1.
Neural networks are also used to implement the generalized policy iteration algo-
rithm, where the structures of the critic network and the action network are the
same as the ones in Example 5.2.1. We choose p = 10000 states in Ω to implement
the generalized policy iteration algorithm. For nonlinear system (5.2.37), the initial
5.2 Generalized Policy Iteration-Based Adaptive Dynamic Programming Algorithm 197

Fig. 5.5 Simulation results for i = 0, 1, . . . , 10 and {Ni2 }. a Value function at x = x0 for NN1 and
NN2. b Vi (xk ) by NN2. c Iterative control law by NN2. d System states by NN2

admissible control law is difficult to obtain. Thus, we implement Algorithm 5.2.2


for policy improvement, and we obtain the initial value function Ψ̄ ς0 (xk ) = xkTP̄0 xk ,
where  
145.31 8.43
P̄0 = .
8.43 28.42

Let iteration index i = 30. To illustrate the effectiveness of the algorithm, we


γ
choose four different iteration sequences {Ni }, γ = 1, 2, 3, 4. For γ = 1 and for i =
0, 1, . . . , 30, we let Ni = 1. For γ = 2, let Ni2 , i = 1, 2, . . . , 30, be arbitrary nonneg-
1

ative integer such that 0 < Ni2 ≤ 4. For γ = 3, let Ni3 , i = 1, 2, . . . , 30, be arbitrary
nonnegative integer such that 0 < Ni3 ≤ 10. For γ = 4 and for i = 0, 1, . . . , 30, let
Ni4 = 20. Train the critic and the action networks under the learning rate 0.01 and
set the threhold of neural network training error as 10−6 . Under the iteration indices
i and ji , the trajectories of iterative value functions Vi, ji (xk ) for x = x0 are shown
in Fig. 5.6. The curves of the iterative value functions Vi (xk ) are shown in Fig. 5.7.
198 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

Iterative value function


Iterative value function

150 150

100 100

50 50
0 1 0 4
10 10 2
0.5 20
20
30 0 ji 30 0 ji
i i
(a) (b)

Iterative value function


Iterative value function

150 150

100 100

50 50
0 10 0 20
10 10
5 10
20 20
30 0 ji 30 0 ji
i i
(c) (d)
Fig. 5.6 Iterative value functions Vi, ji (xk ) for i = 0, 1, . . . , 30 and xk = x0 . a Vi, ji (xk ) for {Ni1 }.
b Vi, ji (xk ) for {Ni2 } c Vi, ji (xk ) for {Ni3 }. d Vi, ji (xk ) for {Ni4 }

From Figs. 5.6 and 5.7, we can see that given an arbitrary nonnegative integer
sequence {Ni }, i = 0, 1, . . ., the iterative value function Vi, ji (xk ) is monotonically
nonincreasing and converges to the approximate optimum using the present general-
ized policy iteration algorithm. The convergence property of the generalized policy
iteration algorithm for nonlinear systems can be verified. The convergence properties
of value and policy iteration algorithms can also be verified by the present algorithm.
The stability property of system (5.2.37) under the iterative control law vi (xk ) is
shown in Figs. 5.8 and 5.9, respectively.
We can see that for i = 0, 1, . . ., the iterative control law vi (xk ) is admissible,
and hence, the effectiveness of the present algorithm can be verified for nonlinear
systems.
5.3 Discrete-Time GPI with General Initial Value Functions 199

Fig. 5.7 Iterative value functions Vi (xk ), i = 0, 1, . . . , 30. a Vi (xk ) for {Ni1 }. b Vi (xk ) for {Ni2 }.
c Vi (xk ) for {Ni3 }. d Vi (xk ) for {Ni4 }

5.3 Discrete-Time GPI with General Initial Value Functions

In this section, a novel discrete-time GPI-based optimal control algorithm is devel-


oped for nonlinear systems. The developed GPI algorithm permits an arbitrary pos-
itive semidefinite function to initialize the algorithm, with iteration indices iterating
between policy evaluation and policy improvement, respectively. The convergence,
admissibility, and optimality properties of GPI algorithm for discrete-time nonlinear
systems are analyzed [18].

5.3.1 Derivation and Properties of the GPI Algorithm

A. Derivation of the GPI Algorithm


The present generalized policy iteration algorithm contains two iteration procedures,
which are named the i-iteration and the j-iteration, respectively. Both of the
200 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

2 2
In x In x1
1 In x2 Inx
2
1 1

System states
System states

0 0

−1 −1 Lm x1
Lm x1
Lm x2 Lm x2
−2 −2
0 50 100 0 50 100
Time steps Time steps
(a) (b)

2 2
In x In x
1 In x 1 In x
2 2
1 1
System states
System states

0 0

−1 −1 Lm x1
Lm x
1
Lm x Lm x2
2
−2 −2
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.8 Trajectories of system state. a State trajectories for {Ni1 }. b State trajectories for {Ni2 }. c
State trajectories for {Ni3 }. d State trajectories for {Ni4 }

iteration indices increase from 0. The detailed generalized policy iteration algorithm
is described as follows.
Let Ψ (xk ) be a positive semidefinite function. Let

V0 (xk ) = Ψ (xk ) (5.3.1)

be the initial value function. Let {N1 , N2 , . . .} be a sequence, where Ni > 0, i =


1, 2, . . ., are positive integers. Then, for i = 0, 1, . . ., the iterative control laws and
the iterative value functions are obtained as in (5.2.6)–(5.2.9). Details are referred to
(5.2.6)–(5.2.9) since the only difference of the algorithm here is the initial condition.

Remark 5.3.1 From the generalized policy iteration algorithm (5.2.6)–(5.2.9) with
initial condition (5.3.1), for i = 1, 2, . . ., if we let Ni ≡ 1, then the generalized policy
iteration algorithm is reduced to a value iteration algorithm [2, 6, 9] (see Chaps. 2
and 3). For i = 1, 2, . . ., if we let Ni → ∞, then the generalized policy iteration
5.3 Discrete-Time GPI with General Initial Value Functions 201

3 3

2 2
Lm Lm
1 1

Control
Control

0 0
In In
−1 −1

−2 −2
0 50 100 0 50 100
Time steps Time steps
(a) (b)

3 3

2 2
Lm Lm

1 1
Control
Control

0 0
In In
−1 −1

−2 −2
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.9 Trajectories of iterative control law vi (xk ). a vi (xk ) for {Ni1 }. b vi (xk ) for {Ni2 }. c vi (xk )
for {Ni3 }. d vi (xk ) for {Ni4 }

algorithm becomes a policy iteration algorithm [10] (see Chap. 4). Hence, the value
and policy iteration algorithms are special cases of the present generalized policy
iteration algorithm. On the other hand, we can see that the generalized policy itera-
tion algorithm (5.2.6)–(5.2.9) is inherently different from policy and value iteration
algorithms. Analysis results for (5.2.5)–(5.2.9) have been established in Sect. 5.2. In
this section, further analysis results will be established for (5.2.6)–(5.2.9) with initial
condition (5.3.1) using the approach due to Rantzer et al. [9, 12].
B. Properties of the GPI Algorithm
In this section, the properties of the GPI algorithm are analyzed. First, for i =
1, 2, . . ., the convergence property of the iterative value function in j-iteration (local
convergence property) is analyzed. The local convergence criterion is obtained. Sec-
ond, the convergence property of the iterative value function for i → ∞ (global
convergence property) is developed and the corresponding convergence criterion is
obtained. The admissibility and optimality analysis are also presented in this section.
202 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

Theorem 5.3.1 For i = 1, 2, . . ., let Vi, ji (xk ) and vi (xk ) be obtained by (5.2.6)–
(5.2.9) with initial condition (5.3.1). Let 0 < γi < ∞ and 1 ≤ σi < ∞ be constants
such that

γi U (xk , u k ) ≥ Vi (xk ), (5.3.2)

and

Vi,0 (xk ) ≤ σi Vi−1 (xk ). (5.3.3)

Then, for i = 1, 2, . . . and ji = 0, 1, . . . , Ni , we have


 ji ρ ρ−1 
γi σi (σi − 1)
Vi, ji (xk ) ≤ 1 + Vi,0 (xk ), (5.3.4)
ρ=1
(1 + γi )ρ


ji
where we define (·) = 0 for ji < 1.
ρ=1

Proof The theorem can be proven by mathematical induction. Obviously, inequality


(5.3.4) holds for ji = 0. For ji = 1, we have

Vi,1 (xk ) = U (xk , vi−1 (xk )) + Vi,0 (xk+1 )


≤ U (xk , vi−1 (xk )) + σi Vi−1 (xk+1 )
   
σi − 1 σi − 1
≤ 1 + γi U (xk , vi−1 (xk )) + σi − Vi−1 (xk+1 )
1 + γi 1 + γi
 
σi − 1
≤ 1 + γi Vi,0 (xk ).
1 + γi

Thus, (5.3.4) holds for ji = 1. Assume that the conclusion holds for ji = l − 1,
l = 1, 2, . . . , Ni . Then, for ji = l, we have

Vi,l (xk ) = U (xk , vi−1 (xk )) + Vi,l−1 (xk+1 )


⎡ ⎤

l−1 ρ ρ−1
γi σi (σi − 1)
≤ U (xk , vi−1 (xk )) + ⎣1 + ⎦ Vi,0 (xk )
(1 + γi )ρ
ρ=1
⎡ ⎛ ⎞⎤

l−1 ρ ρ
γ γi σi (σi − 1)
≤ ⎣1+ ⎝σi −1+ ⎠⎦ U (xk , vi (xk ))
i
1 + γi (1 + γi )ρ
ρ=1
⎡ ⎛ ⎞⎤
 ρ ρ ρ ρ
σi − 1  γi σi (σi − 1) ⎠⎦
l−1 l−1
γi σi (σi − 1)

+ σi + − ⎝ + Vi−1 (xk+1 )
(1 + γi )ρ 1 + γi (1 + γi )ρ+1
ρ=1 ρ=1
⎡ ⎤
l ρ ρ−1
γi σi (σi − 1)
= ⎣1 + ⎦ Vi,0 (xk ).
(1 + γi )ρ
ρ=1
5.3 Discrete-Time GPI with General Initial Value Functions 203

Hence, (5.3.4) holds for j = l. The mathematical induction is complete. This com-
pletes the proof of the theorem.

Theorem 5.3.2 (Local convergence criterion) For i = 0, 1, . . . and ji = 1, 2, . . . ,


Ni , let Vi, ji (xk ) and vi (xk ) be obtained by (5.2.6)–(5.2.9) with initial condition (5.3.1).
Let 0 < γi < ∞ and 1 ≤ σi < ∞ be constants that satisfy (5.3.2) and (5.3.3), respec-
tively. If for i = 1, 2, . . ., the constant σi satisfies

1 + γi
σi < , (5.3.5)
γi

then, the iterative value function Vi, ji (xk ) is convergent as ji → ∞, i.e.,

1
lim Vi, ji (xk ) ≤ Vi,0 (xk ). (5.3.6)
ji →∞ 1 + γi − γi σi
 ρ ρ−1 
Proof According to (5.3.4), we can see the sequence γi σi (σi − 1)/(1 + γi )ρ
is geometrical for ji = 1, 2, . . .,. Then, (5.3.4) can be written as
⎡    ⎤
γi (σi − 1) γi σ i j
⎢ 1− ⎥
⎢ γi + 1 γi + 1 ⎥
Vi, ji (xk ) ≤ ⎢1 + γi σ i ⎥ Vi,0 (xk ).
⎣ 1− ⎦
γi + 1

For
γi + 1
1 ≤ σi < ,
γi

letting ji → ∞, we can obtain (5.3.6). The proof is complete.

For optimal control problems, the present control scheme must not only stabilize
the control systems, but also guarantee the cost function to be finite, i.e., the control
law must be admissible [2]. Next, the admissibility property of the iterative control
law vi (xk ) will be analyzed.

Theorem 5.3.3 (Admissibility property) For i = 1, 2, . . . and ji = 0, 1, . . . , Ni , let


Vi, ji (xk ) and vi (xk ) be obtained by (5.2.6)–(5.2.9) with initial condition (5.3.1). Let
0 < γi < ∞ and 1 ≤ σi < ∞ be constants that satisfy (5.3.2) and (5.3.3), respec-
tively. If for i = 1, 2, . . ., the constant σi satisfies (5.3.5), and the iterative value
function Vi (xk ) is finite, then vi (xk ) is admissible, i.e., vi (xk ) ∈ A (Ω).

Proof Let j̄i = 0, 1, . . .. We construct a value function Vi, j̄i (xk ) as

Vi, j̄i (xk ) = U (xk , vi−1 (xk )) + Vi, j̄i −1 (F(xk , vi−1 (xk ))), (5.3.7)
204 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

where

Vi,0 (xk ) = Vi−1 (xk ). (5.3.8)

From (5.3.7) and (5.3.8), we have Vi, j̄i (xk ) = Vi, j̄i (xk ) for j̄i ≤ Ni . For j̄i = 0, 1, . . .,
according to (5.3.7), we can obtain

j̄i −1

Vi, j̄i (xk ) = U (xk+l , vi−1 (xk+l )) + Vi−1 (xk+ j̄i ).
l=0

Let σi satisfy (5.3.5). According to Theorem 5.3.2, letting j̄i → ∞, we can obtain

j̄i −1

lim Vi, j̄i (xk ) = lim U (xk+l , vi−1 (xk+1 )) + lim Vi−1 (xk+ j̄i )
j̄i →∞ j̄i →∞ j̄i →∞
l=0
σi
≤ Vi−1 (xk ).
1 + γi − γi σi



Since Vi−1 (xk ) is finite for i = 1, 2, . . ., U (xk+l , vi (xk+l )) is also finite. Define
l=0
Vi,∞ (xk ) = lim Vi, j̄i (xk ). If σi satisfies (5.3.5), letting j̄i → ∞, the iterative control
j̄i →∞
law vi (xk ) satisfies the following generalized Bellman equation

Vi,∞ (xk ) = U (xk , vi−1 (xk )) + Vi,∞ (F(xk , vi−1 (xk ))). (5.3.9)

According to Lemma 5.2.1, we can derive that Vi,∞ (xk ) is a positive-definite function.
From (5.3.9), we get

Vi,∞ (F(xk , vi−1 (xk ))) − Vi,∞ (xk ) ≤ 0,

which means that Vi,∞ (xk ) is a Lyapunov function. Thus, xk → 0 as k → ∞, which


means that the iterative control law vi (xk ) ∈ Ωu . The proof is complete.

Theorem 5.3.4 For i = 0, 1, . . ., let Vi, ji (xk ) and vi (xk ) be obtained by (5.2.6)–
(5.2.9) with initial condition (5.3.1). Let 0 < γ < ∞ and 1 ≤ σ < ∞ be constants
such that J ∗ (F(xk , u k )) ≤ γ U (xk , u k ) and V0 (xk ) ≤ σ J ∗ (xk ). For i = 0, 1, . . ., if
σi satisfies (5.3.5), then the iterative value function Vi, ji (xk ) satisfies
⎛ ⎞
ji ρ ρ−1
γi σi (σi − 1)
Vi, ji (xk ) ≤ ⎝1 + ρ

ρ=1
(1 + γi )
5.3 Discrete-Time GPI with General Initial Value Functions 205

⎢ 
i−1
γ i−l γl (σl − 1)
×⎢
⎣1 + i−1
l=1 (1 + γ )i−l Π (1 − γη (ση − 1))
η=l

γ i (σ − 1) ⎥ ∗
+ ⎥ J (xk ), (5.3.10)
i−1 ⎦
(1 + γ )i Π (1 − γη (ση − 1))
η=1

l
where we define k (·) = 0 and Πkl (·) = 1, for k > l.

Proof The statement can be proven by mathematical induction. First, the conclusion
is obviously true for i = 0. Let i = 1. For ji = 1, 2, . . ., we have from (5.3.4),
⎛ ⎞
j1
γi ρ σiρ−1 (σi − 1)
V1, j1 (xk ) ≤ ⎝1+ ρ
⎠ V1,0 (xk )
ρ=1
(1 + γi )
⎛ ⎞
 j1 ρ ρ−1
γi σi (σi − 1)  
≤ ⎝1 + ρ
⎠ min U (xk , u k ) + σ J ∗ (xk+1 )
ρ=1
(1 + γi ) u k

⎛ ⎞
 j1 ρ ρ−1  
γi σi (σi − 1) σ −1

≤ 1+ ⎠ min 1 + γ U (xk , u k )
ρ=1
(1 + γi )ρ uk 1+γ
  
σ −1
+ σ− J ∗ (xk+1 )
1+γ
⎛ ⎞
 γ ρ σ ρ−1 (σi − 1) 
j1
γ (σ − 1)


≤ 1+ i i
ρ
⎠ 1+ J ∗ (xk ),
ρ=1
(1 + γ i ) 1 + γ

which proves inequality (5.3.10) for i = 1. Assume that the conclusion holds for
i = ϑ − 1, ϑ = 1, 2, . . .. As σi satisfies (5.3.5), we can get

1
Vϑ−1, jϑ−1 (xk ) ≤
1 − γϑ−1 (σϑ−1 − 1)

ϑ−2

⎢ γ i−ϑ−1 γl (σl − 1)

× ⎣1 + ϑ−2
l=1 (1 + γ )i−ϑ−1 Π (1 − γη (ση − 1))
η=l

γ ϑ−1 (σ − 1) ⎥ ∗
+ ⎥ J (xk ).
ϑ−2 ⎦
(1 + γ )ϑ−1 Π (1 − γη (ση − 1))
η=l
206 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

For i = ϑ, we can get


⎛ ⎞

 ρ ρ−1
γϑ σϑ (σϑ − 1)
Vϑ, jϑ (xk ) ≤ ⎝1 + ⎠ Vϑ,0 (xk+1 )
(1 + γϑ )ρ
ρ=1
⎛ ⎞

 ρ ρ−1
γϑ σϑ (σϑ − 1)  
≤ ⎝1 + ⎠ min U (xk , u k ) + Vϑ−1 (xk+1 )
(1 + γϑ )ρ uk
ρ=1

⎛ ⎪
⎪ ⎞
jϑ ρ ρ−1 ⎪

 γϑ σϑ (σϑ − 1) 1

≤ 1+ ⎠ min U (xk , u k )+
(1 + γϑ )ρ uk ⎪⎪ 1−γϑ−1 ϑ−1 − 1)

ρ=1 ⎪


⎢ ϑ−2
 γ ϑ−l−1 γl (σl − 1)

× ⎢1 +
⎣ ϑ−2
l=1 (1 + γ )ϑ−l−1 Π (1 − γη (ση −1))
η=l
⎤ ⎫


⎥ ⎪

γ ϑ−1 (σ − 1) ⎥ ∗
+ ⎥ J (xk+1 )
ϑ−1
ϑ−2 ⎦ ⎪

(1 + γ ) Π (1 − γη (ση − 1)) ⎪

η=l
⎧⎡
⎛ ⎞ ⎪

jϑ ρ ρ−1 ⎪
⎨⎢ϑ−1

γϑ σϑ (σϑ −1)
⎠ min ⎢ γ ϑ−l γl (σl − 1)
≤ ⎝1+ ρ ⎢
(1 + γϑ ) uk ⎪
⎪⎣ ϑ−l
ϑ−1
ρ=1 ⎪
⎩ l=1 (1+γ ) Π (1−γη (ση −1))
η=l

γ ϑ (σ − 1) ⎥

+ + 1⎥ U (xk , u k )
ϑ−1 ⎦
(1 + γ )ϑ Π (1 − γη (ση − 1))
η=l
⎡ ⎛
⎢ ⎜ ϑ−2
 γ ϑ−l−1 γl (σl − 1)
⎢ 1 ⎜
+⎢ ⎜1 +
⎣ 1 − γϑ−1 (σϑ−1 − 1) ⎝ ϑ−2
l=1 (1 + γ )ϑ−l−1 Π (1 − γη (ση − 1))
η=l

γ ϑ−1 (σ − 1) ⎟

+ ⎟
ϑ−2 ⎠
(1 + γ )ϑ−1 Π (1 − γη (ση − 1))
η=l

⎜ϑ−1
⎜ γ ϑ−l−1 γl (σl − 1)
−⎜
⎝ ϑ−1
l=1 (1 + γ )ϑ−l Π (1 − γη (ση − 1))
η=l
5.3 Discrete-Time GPI with General Initial Value Functions 207

Fig. 5.10 Iterative value function Vi (xk ) for different Ψ (xk )’s. a Vi (xk ) for Ψ 1 (xk ). b Vi (xk ) for
Ψ 2 (xk ). c Vi (xk ) for Ψ 3 (xk ). d Vi (xk ) for Ψ 4 (xk )

⎞⎤ ⎫


⎟⎥ ⎪

γ ϑ−1 (σ − 1) ⎟⎥ ∗
+ ⎟⎥ J (xk+1 )
ϑ−1 ⎠⎦ ⎪

(1 + γ )ϑ Π (1 − γη (ση − 1)) ⎪

η=l

⎛ ⎞
ρ ρ−1
γϑ σϑ (σϑ − 1) ⎢ ϑ−1
jϑ 
⎠⎢ γ i−ϑ γl (σl − 1)
= ⎝1 + ρ ⎢1 +
(1 + γϑ ) ⎣ ϑ−1
ρ=1 l=1 (1 + γ )i−ϑ Π (1 − γη (ση − 1))
η=l

γ ϑ (σ − 1) ⎥
⎥ ∗
+ ⎥ J (xk ).
ϑ−1 ⎦
(1 + γ )ϑ Π (1 − γη (ση − 1))
η=l

The mathematical induction is complete for inequality (5.3.10). This completes the
proof of the theorem.
208 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

6 1.6

5
1.4
4

3 1.2

Θi
Θi

2
1
1

0 0.8
5 10 15 5 10 15
Iteration steps Iteration steps
(a) (b)

1.4 1

1.3
0.995
1.2
Θi

0.99
Θi

1.1

1 0.985

0.9
5 10 15 0.98
5 10 15
Iteration steps
Iteration steps
(c)
(d)
Fig. 5.11 The trajectories of Θi for different Ψ (xk )’s. a Θi for Ψ 1 (xk ). b Θi for Ψ 2 (xk ). c Θi for
Ψ 3 (xk ). d Θi for Ψ 4 (xk )

Theorem 5.3.5 (Global convergence criterion) For i = 0, 1, . . ., let Vi, ji (xk ) and
vi (xk ) be obtained by (5.2.6)–(5.2.9) with initial condition (5.3.1). If for i = 0, 1, . . .,
σi satisfies

1
σi < qi + 1, (5.3.11)
γi (1 + γ )

where 0 < qi < 1 is a constant, then for ji = 0, 1, . . . , Ni , the iterative value func-
tion Vi, ji (xk ) is convergent as i → ∞.

Proof Define ΩIN = {i : i = 1, 2, . . .} as an iteration index set. For i ∈ ΩIN , if we


let
q = max {qi } ,
i∈ΩIN

then 0 < q < 1. According to (5.3.10), we can get


5.3 Discrete-Time GPI with General Initial Value Functions 209

15 15

10

Vi, j (x0 )
(x0 )

10

i
i
i, j

5
V

0 5
0 10 0 10
5 5
5 5
10 10
15 0 ji 15 0 ji
i i
(a) (b)

20 25

15 20
(x0 )

Vi, j (x0)
i
i
i, j

10 15
V

5 10
0 10 0 10
5 5
5 5
10 10
15 0 ji 15 0 ji
i i
(c) (d)
Fig. 5.12 Iterative value function Vi, ji (xk ) for xk = x0 and different Ψ (xk )’s. a Vi, ji (xk ) for
Ψ 1 (xk ). b Vi, ji (xk ) for Ψ 2 (xk ). c Vi, ji (xk ) for Ψ 3 (xk ). d Vi, ji (xk ) for Ψ 4 (xk )

q

i−1 γ i−l

i−1
γ γl (σl − 1)
i−l
1+γ
lim i−1
< lim  i−l
i→∞ i→∞ q
l=1 (1 + γ )i−l Π (1 − γη (ση − 1)) l=1
(1 + γ ) i−l
1 −
η=l 1+γ

= . (5.3.12)
(1 − q)(1 + γ )

If σi satisfies (5.3.11), then by the necessity of convergent series, we can get

γ i (σ − 1)
lim i−1
= 0. (5.3.13)
i→∞
(1 + γ ) Π (1 − γη (ση − 1))
i
η=l

Following the fact that


210 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

0.5
In 0 Lm
0
−0.2
−0.5

Control
Control

−0.4
−1 −0.6
Lm
−1.5 −0.8
In
−2 −1
0 50 100 0 50 100
Time steps Time steps
(a) (b)

0 0

−0.1
−0.2
Lm −0.2
Control
Control

−0.4 In
−0.3
−0.6 Lm
In −0.4
−0.8 −0.5

−1
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.13 Trajectories of iterative control law vi (xk ) for different Ψ (xk )’s. a vi (xk ) for Ψ 1 (xk ). b
vi (xk ) for Ψ 2 (xk ). c vi (xk ) for Ψ 3 (xk ). d vi (xk ) for Ψ 4 (xk )

1 1 + γi
qi +1< , (5.3.14)
γi (1 + γ ) γi

we can obtain

ji ρ ρ−1
γi σi (σi − 1) ji ρ ρ−1
γi σi (σi − 1) q
ρ < lim ρ < . (5.3.15)
ρ=1
(1 + γi ) ji →∞
ρ=1
(1 + γi ) γ +1−q

According to (5.3.12)–(5.3.15), we can obtain


  
q qγ
lim Vi, ji (xk ) < 1 + 1+ J ∗ (xk ).
i→∞ γ +1−q (1 − q)(1 + γ )

The proof is complete.


5.3 Discrete-Time GPI with General Initial Value Functions 211

4 4
In x2
In x
3 In x 3 1
2 Lm x
In x 1

System states
System states

2 1 2

1 1

0 0

−1 −1
Lm x2 Lm x Lm x1
2
−2 −2
0 50 100 0 50 100
Time steps Time steps
(a) (b)

4 1
Lm x In x1
1
3
0.5
2 In x2
System states
System states

In x1 0
1
0 −0.5
−1 Lm x2 In x2
Lm x1 −1
−2
Lm x2
−3 −1.5
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.14 Trajectories of system states for different Ψ (xk )’s. a State trajectories for Ψ 1 (xk ). b State
trajectories for Ψ 2 (xk ). c State trajectories for Ψ 3 (xk ). d State trajectories for Ψ 4 (xk )

5.3.2 Relaxations of the Convergence Criterion


and Summary of the GPI Algorithm

From Theorem 5.3.5, we can see that the convergence criterion (5.3.11) is indepen-
dent of the parameter σ . For i = 1, 2, . . ., σi and γi can be obtained by (5.3.2) and
(5.3.3), respectively. If the parameter γ can be estimated, then the convergence crite-
rion (5.3.11) can be implemented. As the optimal cost function J ∗ (xk ) is unknown,
the parameter γ cannot be obtained directly. In this section, relaxations of the con-
vergence criterion are discussed. First, we define a new set Ωγ as
 
Ωγ = γ : γ U (xk , u k ) ≥ J ∗ (F(xk , u k )) .

Then, we can obtain the following statements.


212 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

Lemma 5.3.1 Let Pi (xk ) be the iterative value function that satisfies

Pi+1 (xk ) = U (xk , μ(xk )) + Pi (xk+1 ),

where P0 (xk ) is an arbitrary positive semidefinite function and μ(xk ) is an arbitrary


admissible control law. Let P∞ (xk ) = lim Pi (xk ). If there exists a constant γ̃ such
i→∞
that

γ̃ U (xk , u k ) ≥ P∞ (F(xk , u k )), (5.3.16)

then γ̃ ∈ Ωγ .
Proof As μ(xk ) is an admissible control law, according to Theorem 5.2.3, we have
P∞ (xk ) ≥ V∞ (xk ). If γ̃ satisfies (5.3.16), then we can get

γ̃ U (xk , u k ) ≥ P∞ (F(xk , u k )) ≥ V∞ (F(xk , u k )) = J ∗ (F(xk , u k )).

The proof is complete.


Theorem 5.3.6 Let Φ0 (xk ) be a positive-definite function and let γ̄ be a constant
such that

γ̄ U (xk , u k ) ≥ Φ0 (F(xk , u k )). (5.3.17)

For l = 0, 1, . . ., let Φl (xk ) be the iterative value function such that


 
Φl+1 (xk ) = min U (xk , u k ) + Φl (xk+1 ) .
uk

If Φ1 (xk ) ≤ Φ0 (xk ), then γ̄ ∈ Ωγ .


Proof The statement can be proven in two steps. First, for l = 0, 1, . . ., we prove
the inequality

Φl (xk ) ≥ Φl+1 (xk ), ∀xk . (5.3.18)

Obviously, (5.3.18) holds for l = 0. For l = 1, we have

Φ2 (xk ) = min{U (xk , u k ) + Φ1 (xk+1 )}


uk

≤ min{U (xk , u k ) + Φ0 (xk+1 )}


uk

= Φ1 (xk ).

According to mathematical induction, for l = 0, 1, . . ., we have

Φl (xk ) ≥ Φl+1 (xk ) ≥ Φl+2 (xk ) ≥ · · · .


5.3 Discrete-Time GPI with General Initial Value Functions 213

Let l → ∞. According to Corollary 5.2.2, we can obtain

Φl (xk ) ≥ lim Φl (xk ) = J ∗ (xk ).


l→∞

Hence, we have
Φ0 (xk ) ≥ J ∗ (xk ).

If γ̄ satisfies (5.3.17), then we can obtain γ̄ ∈ Ωγ . The proof is complete.

Corollary 5.3.1 Let Φ0 (xk ) be a positive-definite function and let γ̄ and σ̄ be con-
stants that satisfy (5.3.17) and

Φ1 (xk ) ≤ σ̄ Φ0 (xk ), (5.3.19)

respectively. For l = 1, 2, . . ., let the value function Pl (xk ) satisfy

Pl (xk ) = U (xk , ν(xk )) + Pl−1 (xk+1 ), (5.3.20)

where P0 (xk ) = Φ0 (xk ) and

ν(xk ) = arg min {U (xk , u k ) + P0 (xk+1 )} . (5.3.21)


uk

If γ̄ and σ̄ satisfy

Algorithm 5.3.1 Estimation of the parameter γ


Initialization:
Give the computation precision ε.
Iteration:
Step 1. Choose a positive semidefinite function Φ0 (xk ) ≥ 0.
Step 2. Obtain the iterative control law and the iterative value function by (5.3.21) and

Φ1 (xk ) = U (xk , ν(xk )) + Φ0 (xk+1 ).

Step 3. If Φ1 (xk ) ≤ Φ0 (xk ), then goto Step 7; else, goto next step.
Step 4. Choose two parameters 0 < γ̄ < ∞ and 1 ≤ σ̄ < ∞ that satisfy (5.3.17) and (5.3.19).
Step 5. If γ̄ and σ̄ satisfy (5.3.22), then for l = 1, 2, . . ., implement (5.3.20) until the computation
precision is achieved, i.e.,

|Pl (xk ) − Pl−1 (xk )| < ε,

where P0 (xk ) = Φ0 (xk ), and goto next step; else goto Step 1.
Step 6. Let Φ0 (xk ) = Pl (xk ).
Step 7. Choose γ such that Φ0 (F(xk , u k )) ≤ γ U (xk , u k ).
Step 8. Return γ .
214 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

1 + γ̄
σ̄ < , (5.3.22)
γ̄

and there exists a constant γ̂ such that γ̂ U (xk , u k ) ≥ lim Pl (xk ), then γ̂ ∈ Ωγ .
l→∞

Proof If γ̄ and σ̄ satisfy (5.3.22), according to Theorem 5.3.3, the iterative control
law ν(xk ) in (5.3.21) is admissible. Then, the limit of iterative value function Pl (xk )
exists, as l → ∞. According to Lemma 5.3.1, we can obtain γ̂ ∈ Ωγ . This completes
the proof of the corollary.

According to Theorem 5.3.6 and Corollary 5.3.1, we can establish an effective


iteration algorithm to estimate γ by experiments. The detailed implementation of the
iteration algorithm is provided in Algorithm 5.3.1.
Based on the above preparations, we summarize the generalized policy iteration-
based iterative ADP algorithm in Algorithm 5.3.2.

Algorithm 5.3.2 The generalized policy iteration algorithm


Initialization:
Choose randomly an array of initial states x0 . Choose a computation precision ε. Give a positive
semidefinite function Ψ (xk ). Construct a sequence {Ni }, where Ni > 0, i = 1, 2, . . ., are positive
integers. Estimate γ by Algorithm 5.3.1.
Iteration:
Step 1. Let the iteration index i = 0 and V0 (xk ) = Ψ (xk ).
Step 2. Let i = i + 1.
Step 3. Let Vi,0 (xk ) = Vi−1 (xk ). Obtain γi and σi by (5.3.2) and (5.3.3), respectively.
Step 4. If σi satisfies the convergence criterion (5.3.11), then for ji = 0, 1, . . . , Ni , do Policy Eval-
uation

Vi, ji (xk ) = U (xk , vi−1 (xk )) + Vi, ji −1 (F(xk , vi−1 (xk )));

else, let

Vi (xk ) = U (xk , vi−1 (xk )) + Vi−1 (F(xk , vi−1 (xk ))),

and goto Step 2.


Step 5. Do Policy Improvement

vi (xk ) = arg min {U (xk , u k ) + Vi (xk+1 )} .


uk

Step 6. Let Vi (xk ) = Vi,Ni (xk ).


Step 7. If |Vi (xk ) − Vi−1 (xk )| < ε, then the optimal cost function and optimal control law are
obtained, and goto Step 8; else goto Step 2.
Step 8. Return Vi (xk ) and vi (xk ).
5.3 Discrete-Time GPI with General Initial Value Functions 215

5.3.3 Simulation Studies

To evaluate the performance of our generalized policy iteration algorithm, we choose


two examples for numerical experiments.
Example 5.3.1 The first example we choose is an inverted pendulum system [4].
The dynamics of the pendulum is expressed as

  % & % &
ẋ1 x2 0
= g + 1 u,
ẋ2 sin(x1 ) − κx2
 m2
where m = 1/2 kg and  = 1/3 m are the mass and length of the pendulum bar,
respectively. Let κ = 0.2 and g = 9.8 m/s2 be the frictional factor and the grav-
itational acceleration, respectively. Discretization of the system function with the
sampling interval Δt = 0.1 s leads to

     
x1(k+1) x1k + Δt x2k 0
= g + Δt uk .
x2(k+1) Δt sin(x1k ) + (1 − κΔt)x2k
 m2

Let the initial state be x0 = [1, −1]T . Let the state space be Ξ = {x : − 1 ≤ x1 ≤
1, −1 ≤ x2 ≤ 1}. NNs are used to implement the present generalized policy iter-
ation algorithm. The critic network and the action network are chosen as three-
layer BP NNs with the structures of 2–8–1 and 2–8–1, respectively. We choose
p = 5000 states in Ξ to train the action and critic networks. To illustrate the
effectiveness of the algorithm, four different initial value functions are chosen
which are expressed by Ψ ς (xk ) = xkT Pς xk , ς = 1, 2, 3, 4. Let P1 = 0. Let P2 –P4
be initialized by positive-definite matrices given by P2 = [ 2.98, 1.05; 1.05, 5.78],
P3 = [ 6.47, −0.33; −0.33, 6.55], and P4 = [ 22.33, 4.26; 4.26, 7.18], respectively.
For i = 0, 1, . . ., let qi = 0.9999. First, implement Algorithm 5.3.1 and it returns
ς ς
γ = 5.40. Let the iteration sequence be {Ni }, where Ni ∈ [1, 10] be a random non-
negative integer. Then, initialized by Ψ ς (xk ), ς = 1, 2, 3, 4, the generalized policy
iteration algorithm in Algorithm 5.3.2 is implemented for i = 15 iterations. Train
the critic and action networks under the learning rate of 0.01 and set the NN train-
ing errors as 10−6 . The curves of the iterative value functions Vi (xk ) are shown in
Fig. 5.10, where we let “In” denote “initial iteration” and let “Lm” denote “limiting
iteration”.
For i = 1, 2, . . . , 15, define the function Θi as

σi γi (1 + γ )
Θi = . (5.3.23)
qi + γi (1 + γ )

It can easily be shown that Θi < 1 implies σi satisfies (5.3.11). The trajectories of
the function Θi are shown in Fig. 5.11. Under the iteration indices i and ji , the trajec-
tories of iterative value functions Vi, ji (xk ) for xk = x0 are shown in Fig. 5.12. From
216 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

40 40

30

Vi, j (x0)
Vi, j (x0)

20

i
i

20

0 10
0 10 0 10
10 5 10 5
20 0 ji 20 0 ji
i i
(a) (b)

60 150

40 100
V (x )
Vi, j (x0)

0
i
i

i, j

20 50

0 0
0 10 0 10
10 5 10 5
20 0 ji 20 0 ji
i i
(c) (d)
Fig. 5.15 Iterative value function Vi, ji (xk ) for xk = x0 and different Ψ̄ (xk )’s. a Vi, ji (xk ) for
Ψ̄ 1 (xk ). b Vi, ji (xk ) for Ψ̄ 2 (xk ). c Vi, ji (xk ) for Ψ̄ 3 (xk ). d Vi, ji (xk ) for Ψ̄ 4 (xk )

Figs. 5.10 and 5.12, we can see that given an arbitrary positive semidefinite func-
tion, the iterative value function Vi, ji (xk ) converges to the optimum using the present
generalized policy iteration algorithm. The convergence property of the present gen-
eralized policy iteration algorithm for nonlinear systems can be verified. From Figs.
5.11 and 5.12, we can see that when Θi < 1, the GPI algorithm is convergent and both
the policy evaluation and improvement procedures can be implemented. If Θi < 1,
the iterative control law vi (xk ) is admissible. Let the execution time T f be 100 time
steps. The trajectories of the iterative control laws and system states are shown in
Figs. 5.13 and 5.14, respectively, where the effectiveness of the present generalized
policy iteration algorithm for nonlinear systems can be verified.

Example 5.3.2 Our second example is chosen as a nonaffine nonlinear system in [1]
with modifications. The system is expressed by
5.3 Discrete-Time GPI with General Initial Value Functions 217

6 1.4

1.3
4
1.2
Θi

Θi
1.1
2
1

0 0.9
0 5 10 15 0 5 10 15
Iteration steps Iteration steps
(a) (b)

2.5 1.4

2 1.2

1
1.5
Θi

Θi

0.8
1
0.6
0.5
0 5 10 15 0 5 10 15
Iteration steps Iteration steps
(c) (d)
Fig. 5.16 The trajectories of Θi for different Ψ̄ (xk )’s. a Θi for Ψ̄ 1 (xk ). b Θi for Ψ̄ 2 (xk ). c Θi for
Ψ̄ 3 (xk ). d Θi for Ψ̄ 4 (xk )

ẋ1 = 2 sin x1 + x2 + x3 ,
ẋ2 = x1 + sin(−x2 + u 2 ),
ẋ3 = x3 + sin u 1 .

Let x = [x1 , x2 , x3 ]T be the system state and let u = [u 1 , u 2 ]T be the system control
input. Let the initial state be x0 = [1, −1, 1]T . Let the sampling interval be Δt =
∞
0.1 s. Let the cost function be expressed by J (x0 ) = (xkT Qxk + R(u k )), where
k=0

' uk
T
R(u k ) = (Φ −1 (v)) Rdv.
0

Let Q be an identity matrix with suitable dimension. Let Φ(·) be a sigmoid func-
tion, i.e., Φ(·) = tanh(·). Let the state space be Ξ̄ = {x : − 1 ≤ x1 ≤ 1, −1 ≤
x2 ≤ 1, −1 ≤ x3 ≤ 1}. NNs are used to implement the present generalized policy
218 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

Wc1
Wc6
6

Weight trajectories
Weight trajectories

4 2 Wc9
Wc1
2 Wc9 Wc3
0 0
Wc6
−2
−2
0 10 0 10
5 Wc3 5
5 5
10 10
15 0 ji 15 0 ji
i i
(a) (b)

Wc3
Weight trajectories 4 Wc9 Wc3
Weight trajectories

5
Wc9
2
0
Wc1 0

−5 Wc6 −2 Wc1
0 10 0 10
Wc6
5 5
5 5
10 10
15 0 ji 15 0 ji
i i
(c) (d)
Fig. 5.17 Trajectories of weights for the critic network for different Ψ̄ (xk )’s. a Wcl , l = 1, 3, 6, 9,
for Ψ̄ 1 (xk ). b Wcl , l = 1, 3, 6, 9, for Ψ̄ 2 (xk ). c Wcl , l = 1, 3, 6, 9, for Ψ̄ 3 (xk ). d Wcl , l = 1, 3, 6, 9,
for Ψ̄ 4 (xk )

iteration algorithm. The critic and action networks are chosen as three-layer BP
NNs with the structures of 3–10–1 and 3–10–2, respectively. We choose randomly
p = 10000 states in Ξ̄ to train the action and critic networks. To illustrate the effec-
tiveness of the algorithm, we also choose four different initial value functions which
are expressed by Ψ̄ ς (xk ) = xkT P̄ς xk , ς = 1, . . . , 4. Let P̄1 – P̄4 be positive-definite
matrices given by P̄1 = 0.01 × [0.11, 0, −0.18; 0, 0.12, 0.04; −0.18, 0.04, 0.19],
P̄2 = [ 9.09, −2.06, 1.77; −2.06, 2.93, −1.89; 1.77, −1.89, 7.69], P̄3 = [ 6.89,
−8.49, −0.46; −8.49, 15.60, 0.44; −0.46, 0.45, 6.81], and P̄4 = [ 51.39, −18.53,
−4.64; −18.53, 36.72, 6.22; −4.64, 6.22, 25.34], respectively. For i = 0, 1, . . ., let
qi = 0.9999. First, implement Algorithm 5.3.1 and it returns γ = 6.3941. Let the
ς ς
iteration sequence be {Ni }, where Ni ∈ [1, 10] be a random nonnegative integer.
Then, initialized by Ψ̄ ς (xk ), ς = 1, . . . , 4, the generalized policy iteration algorithm
(Algorithm 5.3.2) is implemented for a total of i = 20 iterations. Train the critic and
action networks under the learning rate of 0.01 and set the NN training error threshold
5.3 Discrete-Time GPI with General Initial Value Functions 219

1 1
In u Lm u Lm u
1 2
2
0.5 0.5
In u
2
0 0

−0.5 −0.5
Lm u2 Lm u
1
−1 −1
In u In u1
2
−1.5 −1.5

−2 −2
0 50 100 0 50 100
Time steps Time steps
(a) (b)

1.5 4
1 3 In u
2
Lm u2
0.5 2 Lm u
2
0 1
−0.5 0
−1 Lm u −1
1
In u
2 Lm u
−1.5 −2 1
In u1 In u
1
−2 −3
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.18 Trajectories of iterative control law vi (xk ), for different Ψ̄ (xk )’s. a vi (xk ) for Ψ̄ 1 (xk ). b
vi (xk ) for Ψ̄ 2 (xk ). c vi (xk ) for Ψ̄ 3 (xk ). d vi (xk ) for Ψ̄ 4 (xk )

as 10−6 . Under the iteration indices i and ji , the trajectories of iterative value func-
tions Vi, ji (xk ) for xk = x0 are shown in Fig. 5.15, where we can see that initialized
by an arbitrary positive semidefinite function, the iterative value function Vi, ji (xk )
converges to the optimum using the present generalized policy iteration algorithm.
For i = 1, 2, . . . , 15, define the function Θi as (5.3.23) and the trajectories of the
function Θi are shown in Fig. 5.16.
Let Wcl , l = 1, 2, . . . , 10, be the first column of the hidden-output weight matrix
for the critic network. The convergence trajectories of weights Wc1 , Wc3 , Wc6 , Wc9
for the critic network are shown in Fig. 5.17, where we can see that the weights for
critic network are convergent to the optimum. Other weights of the NNs are omitted
here.
220 5 Generalized Policy Iteration ADP for Discrete-Time Nonlinear Systems

3 2
In x3
1.5
2 In x
2
In x 1 Lm x3
In x1 3
1 0.5
In x In x
2 1
0
0 Lm x1
Lm x3 −0.5
Lm x2
Lm x1Lm x2
−1 −1
0 50 100 0 50 100
Time steps Time steps
(a) (b)
2 1 Lm x
In x3 In x1 1 In x
3
1.5 In x2 Lm x3
0.5
Lm x3
1 In x
1
0.5 0

0 Lm x
Lm x −0.5 2
1
−0.5
Lm x2 In x
2
−1 −1
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 5.19 Trajectories of system states for different Ψ̄ (xk )’s. a State trajectories for Ψ̄ 1 (xk ). b State
trajectories for Ψ̄ 2 (xk ). c State trajectories for Ψ̄ 3 (xk ). d State trajectories for Ψ̄ 4 (xk )

From Figs. 5.15 and 5.16, the convergence property of the present generalized
policy iteration algorithm for nonlinear systems can be verified. We can see that
for Θi < 1, the GPI algorithm is convergent and both the policy evaluation and
improvement procedures can be implemented. Let the execution time T f be 100 time
steps. The trajectories of the iterative control laws and system states are shown in
Figs. 5.18 and 5.19, respectively. From Fig. 5.16d, we can see that for Ψ 4 (xk ), Θi < 1
for i = 1, 2, . . . , 15. In this case, the iterative control law vi (xk ) is admissible. This
property can be verified from results shown in Figs. 5.18d and 5.19d, respectively.
From these simulation results, the effectiveness of the present generalized policy
iteration algorithm for nonlinear systems can be verified.
5.4 Conclusions 221

5.4 Conclusions

In this chapter, an effective generalized policy iteration algorithm is developed to


solve infinite-horizon optimal control problems for discrete-time nonlinear systems.
The iterative ADP algorithm can be implemented by an arbitrary positive semidefi-
nite function. It has been proven that the iterative value function for the generalized
policy iteration algorithm converges to the optimum. Convergence criteria of the
algorithm are obtained. Admissibility of the iterative control laws is analyzed. Effec-
tive methods are presented to estimate the parameter γ of the present algorithm.
NNs are used to implement the generalized policy iteration algorithm. Finally, two
numerical examples are given to illustrate the effectiveness of the present algorithm.

References

1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part B
Cybern 38(4):943–949
3. Apostol TM (1974) Mathematical analysis, 2nd edn. Addison-Wesley, Boston
4. Beard RW (1995) Improving the closed-loop performance of nonlinear systems. Ph.D. thesis,
Rensselaer Polytechnic Institute, Troy, NY
5. Bertsekas DP (2007) Dynamic programming and optimal control, 3rd edn. Athena Scientific,
Belmont
6. Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmont
7. Dorf RC, Bishop RH (2011) Modern control systems, 12th edn. Prentice-Hall, Upper Saddle
River
8. Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control:
using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag
32(6):76–105
9. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
10. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Network Learn Syst 25(3):621–634
11. Liu D, Wei Q, Yan P (2015) Generalized policy iteration adaptive dynamic programming for
discrete-time nonlinear systems. IEEE Trans Syst Man Cybern Syst 45(12):1577–1591
12. Rantzer A (2006) Relaxed dynamic programming in switching systems. IEE Proc Control
Theory Appl 153(5):567–574
13. Si J, Wang YT (2001) Online learning control by association and reinforcement. IEEE Trans
Neural Network 12(2):264–276
14. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
15. Vrabie D, Vamvoudakis KG, Lewis FL (2013) Optimal adaptive control and differential games
by reinforcement learning principles. IET, London
16. Wei Q, Liu D (2014) A novel iterative θ-adaptive dynamic programming for discrete-time
nonlinear systems. IEEE Trans Autom Sci Eng 11(4):1176–1190
17. Wei Q, Liu D, Lin H (2016) Value iteration adaptive dynamic programming for optimal control
of discrete-time nonlinear systems. IEEE Trans Cybern 46(3):840–853
18. Wei Q, Liu D, Yang X (2015) Infinite horizon self-learning optimal control of nonaffine discrete-
time nonlinear systems. IEEE Trans Neural Network Learn Syst 26(4):866–879
Chapter 6
Error Bounds of Adaptive Dynamic
Programming Algorithms

6.1 Introduction

Value iteration and policy iteration are two basic classes of ADP algorithms which
can solve optimal control problems for nonlinear dynamical systems with continu-
ous state and action spaces. Value iteration algorithms can solve the optimal con-
trol problems for nonlinear systems without requiring an initial stabilizing control
policy. Value iteration algorithms iterate between value function update and policy
improvement until the iterative value functions converge to the optimal one. Value
iteration-based ADP algorithms have been used for solving the Bellman equation
[3, 8, 17, 19, 30, 31], the near-optimal control of discrete-time affine nonlinear sys-
tems with control constraints [20, 34], the finite-horizon optimal control problem
[10, 28], the optimal tracking control problem [33, 35], and the optimal control of
unknown nonaffine nonlinear discrete-time systems with discount factor in the cost
function [19, 29]. In contrast to value iteration algorithms, policy iteration algorithms
[1, 11, 14, 18, 27] always require an initial stabilizing control policy. The policy
iteration is built to iterate between policy evaluation and policy improvement until
the policy converges to the optimal one. The policy iteration-based ADP approaches
were also developed for optimal control of discrete-time nonlinear dynamical sys-
tems [7, 18]. For all the iterative ADP algorithms mentioned above, it is assumed
that the value function and control policy update equations can be exactly solved at
each iteration. Furthermore, optimistic policy iteration [25] represents a spectrum of
iterative algorithms which includes value iteration and policy iteration algorithms,
and it is also known as generalized policy iteration [24, 26] or modified policy iter-
ation [22]. The optimistic policy iteration algorithm has not been widely applied to
ADP for solving optimal control problems of nonlinear dynamical systems.
On the other hand, an inequality version of the Bellman equation was used to derive
bounds on the optimal cost function [12]. For undiscounted optimal control problems
of discrete-time nonlinear systems, a relaxed value iteration scheme [23] based on
the inequality version of Bellman equation [12] was introduced to derive the upper

© Springer International Publishing AG 2017 223


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_6
224 6 Error Bounds of Adaptive Dynamic Programming Algorithms

and lower bounds of the optimal cost function, where the distance from optimal
values can be kept within prescribed bounds. In [16], the relaxed value iteration
scheme was used to solve the optimal switching between linear systems, the optimal
control of a linear system with piecewise linear cost, and a partially observable
Markov decision problem. In [9], the relaxed value iteration scheme was applied
to receding horizon control schemes for discrete-time nonlinear systems. Based on
[23], a new expression of approximation errors at each iteration was introduced to
establish convergence analysis results for the approximate value iteration algorithm
[17].
For optimal control problems with continuous state and action spaces, ADP meth-
ods use a critic neural network to approximate the value function, and use an action
neural network to approximate the control policy. Iterating on these approximate
schemes will inevitably give rise to approximation errors. Therefore, we establish
error bounds for the approximate value iteration, approximate policy iteration, and
approximate optimistic policy iteration in this chapter [21]. A new assumption is uti-
lized instead of the contraction assumption in discounted optimal control problems
[16]. It is shown that the iterative approximate value function can converge to a finite
neighborhood of the optimal value function under some mild conditions. To imple-
ment the present algorithms, two multilayer feedforward neural networks are used
to approximate the value function and control policy. A simulation example is given
to demonstrate the present results. Furthermore, we also present error bound analy-
sis results of Q-function for the action-dependent ADP to solve discounted optimal
control problems of unknown discrete-time nonlinear systems [32]. It is shown that
the approximate Q-function will converge to a finite neighborhood of the optimal
Q-function. Results in this chapter will complement the analysis results developed in
Chaps. 2–5 by introducing various error conditions for different VI and PI algorithms
[17–19, 28–30].

6.2 Error Bounds of ADP Algorithms for Undiscounted


Optimal Control Problems

6.2.1 Problem Formulation

Consider the discrete-time deterministic nonlinear dynamical systems described by

xk+1 = F(xk , u k ), k = 0, 1, 2, . . . , (6.2.1)

where xk ∈ Rn is the system state at time k and u k ∈ Rm is the control input. Let x0 be
the initial state. Assume that the system function F(xk , u k ) is Lipschitz continuous
on a compact set Ω ⊂ Rn containing the origin, and F(0, 0) = 0. Hence, x = 0
is an equilibrium state of the system (6.2.1) under the control u = 0. Assume that
system (6.2.1) is stabilizable on the compact set Ω [3].
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 225

Define the undiscounted infinite-horizon cost function as




J (x0 , u) = U (xk , u k ), (6.2.2)
k=0

where U is a positive-definite utility function. In this chapter, the utility function


is chosen as the quadratic form U (xk , u k ) = xkTQxk + u Tk Ru k , where Q and R are
positive-definite matrices with suitable dimensions. Our goal is to find a control
policy u k = u(xk ) which can minimize the cost function (6.2.2) for every initial state
x0 ∈ Ω. For optimal control problems, the designed feedback control must not only
stabilize the system (6.2.1) on Ω but also guarantee that the cost function (6.2.2) is
finite, i.e., the control must be admissible (cf. Definition 2.2.1 or [3]).
For any admissible control policy u = μ(x) ∈ A (Ω), the mapping from any
state x to the value of (6.2.2) is the value function defined by V μ (x)  J (x, μ(x)),
where A (Ω) is the set of admissible control policies associated with the controllable
set Ω of states. Then, we define the optimal value function as
 
V ∗ (x) = inf V μ (x) . (6.2.3)
μ

Note that the optimal value function defined here is the same as the optimal cost
function, i.e.,
V ∗ (x) = J ∗ (x)  inf {J (x, u) : u ∈ A (Ω)}.
u

According to Bellman’s principle of optimality [4, 13], the optimal value function
V ∗ (x) satisfies the Bellman equation
  
V ∗ (x) = inf U (x, μ) + V ∗ F(x, μ) . (6.2.4)
μ

If it can be solved for V ∗ , the optimal control policy μ∗ (x) can be obtained by
  
μ∗ (x) = arg inf U (x, μ) + V ∗ F(x, μ) .
μ

Similar to [6], we define the Hamiltonian as


 
H (x, u, V ) = U (x, u) + V F(x, u) .

Then, we define an operator Tμ as


 
(Tμ V )(x) = H x, μ(x), V ,

and define an operator T as


 
(T V )(x) = min H x, μ(x), V .
μ
226 6 Error Bounds of Adaptive Dynamic Programming Algorithms

For convenience, Tμk denotes the k time composition of the operator Tμ , i.e.,
 
(Tμk V )(x) = Tμ (Tμk−1 V ) (x). (6.2.5)

Similarly, the mapping T k is defined by


 
(T k V )(x) = T (T k−1 V ) (x).

Therefore, the Bellman equation (6.2.4) can be written compactly as

V ∗ = T V ∗,

i.e., V ∗ is the fixed point of T .


In this chapter, we assume the following monotonicity property holds, which was
also used in [6].
Assumption 6.2.1 If V ≤ V  , then H (x, u, V ) ≤ H (x, u, V  ), ∀x, u.
Besides the above monotonicity assumption, the contraction assumption in [5] is
often required for discounted optimal control problems. However, for undiscounted
optimal control problems, we utilize the following assumption from [16] instead of
the contraction assumption.
 
Assumption 6.2.2 Suppose the condition 0 ≤ V ∗ F(x, u) ≤ λU (x, u) holds uni-
formly for some 0 < λ < ∞.
The positive constant λ is usually larger than 1 and gives a measure on how
“contractive” the optimally controlled system is, i.e., how close the value function is
to the cost of a single step [16]. A smaller λ implies a faster convergence.
The Bellman equation (6.2.4) reduces to the Riccati equation in the case of linear
quadratic regulator, which can be efficiently solved. In the case of general nonlinear
systems, it is very difficult to solve the Bellman equation analytically. ADP meth-
ods using function approximation structures are proposed to learn the near-optimal
control policy and value function associated with the Bellman equation. Because of
the approximation errors, the value function and control policy are generally impos-
sible to obtain accurately at each iteration. Therefore, in this chapter, we analyze the
convergence and establish the error bounds for ADP algorithms considering func-
tion approximation errors. We note that the analysis conducted in this chapter uses a
different notation. Most notably, the use of “operators” T and Tμ can significantly
simplify our presentation.

6.2.2 Approximate Value Iteration

We first present the algorithm for exact value iteration and then the algorithm for
approximate value iteration with analysis results for error bounds.
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 227

A. Value Iteration
For the value iteration algorithm, it starts with any initial positive-definite value
function V0 (x) or V0 (·) = 0. Then, the control policy v0 (x) can be obtained by
  
v0 (x) = arg min U (x, u) + V0 F(x, u) . (6.2.6)
u

For i = 1, 2, . . . , the value iteration algorithm iterates between the value function
update

Vi (x) = T Vi−1 (x)


  
= min U (x, u) + Vi−1 F(x, u)
u
    
= U x, vi−1 (x) + Vi−1 F x, vi−1 (x) , (6.2.7)

and policy improvement


  
vi (x) = arg min U (x, u) + Vi F(x, u) . (6.2.8)
u

It is required that Vi (0) = 0 and vi (0) = 0, ∀i ≥ 1. The value iteration algorithm


does not require an initial stabilizing control policy. Note that the algorithm described
in (6.2.6)–(6.2.8) is the same as the one given earlier in (2.2.10)–(2.2.12) or (2.2.22)–
(2.2.24).
We list some key expressions in Table 6.1 which can be used in this section for
establishing analysis results.
According to Assumption 6.2.1, it is easy to give the following lemma.

Lemma 6.2.1 If V0 ≥ T V0 , the value function sequence {Vi } is a monotonically


nonincreasing sequence, i.e., Vi ≥ Vi+1 , ∀i ≥ 0. If V0 ≤ T V0 , the value function
sequence {Vi } is a monotonically nondecreasing sequence, i.e., Vi ≤ Vi+1 , ∀i ≥ 0.

The lemma can be proved by considering the key expressions in Table 6.1.
For undiscounted optimal control problems, the convergence of value iteration
algorithm has been given in the following theorem [23] (cf. Theorem 2.2.2).

Theorem 6.2.1 Let Assumptions 6.2.1 and 6.2.2 hold. Suppose that 0 ≤ αV ∗ ≤
V0 ≤ βV ∗ , 0 ≤ α ≤ 1, and 1 ≤ β < ∞. The value function Vi and the control
policy vi+1 are iteratively updated by (6.2.7) and (6.2.8). Then, the value function
sequence {Vi } approaches V ∗ according to the inequalities

Table 6.1 Key expressions Key expression Location


used in Sect. 6.2.2
V ≤ V
⇒ Tu V ≤ Tu V Assumption 6.2.1
Vi = T Vi−1 (6.2.7)
228 6 Error Bounds of Adaptive Dynamic Programming Algorithms
   
1−α ∗ β −1
1− V ≤ Vi ≤ 1 + V ∗ , ∀i ≥ 1.
(1 + λ−1 )i (1 + λ−1 )i

Moreover, the value function Vi converges to V ∗ uniformly on Ω as i → ∞.

B. Error Bounds for Approximate Value Iteration


For the approximate value iteration algorithm, function approximation structures
like neural networks are usually used to approximate the value function Vi and the
control policy vi . In general, the value function and control policy update equations
(6.2.7) and (6.2.8) cannot be solved accurately, because we only have some samples
from the state space and there exist approximation errors for function approximation
structures. Here, we use V̂i and v̂i , i = 0, 1, . . . , to stand for the approximate
expressions of Vi and vi , respectively, where V̂0 = V0 .
We assume that there exist finite positive constants δ ≤ 1 and δ ≥ 1 such that

δTv̂i V̂i ≤ V̂i+1 ≤ δTv̂i V̂i , ∀i = 0, 1, . . . , (6.2.9)

hold uniformly. Similarly, we also assume that there exist finite positive constants
ζ ≤ 1 and ζ ≥ 1 such that

ζ T V̂i ≤ Tv̂i V̂i ≤ ζ T V̂i , ∀i = 0, 1, . . . , (6.2.10)

hold uniformly. Combining (6.2.9) and (6.2.10), we can get

ζ δT V̂i ≤ V̂i+1 ≤ ζ δT V̂i . (6.2.11)

For simplicity, (6.2.11) can be written as

ηT V̂i ≤ V̂i+1 ≤ ηT V̂i , ∀i = 0, 1, . . . , (6.2.12)

by denoting
η  ζ δ ≤ 1, η  ζ δ ≥ 1. (6.2.13)

Based on Assumptions 6.2.1 and 6.2.2, we can establish the error bounds for the
approximate value iteration by the following theorem.
Theorem 6.2.2 Let Assumptions 6.2.1 and 6.2.2 hold. Suppose that 0 ≤ αV ∗ ≤
V0 ≤ βV ∗ , 0 ≤ α ≤ 1 and 1 ≤ β < ∞. The approximate value function V̂i
satisfies
  the iterative error condition (6.2.12). Then, the value function sequence
V̂i approaches V ∗ according to the following inequalities
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 229

  
i
λ j η j−1 (1 − η) λi ηi (1 − α) ∗
η 1− − V
j=1
(λ + 1) j (λ + 1)i
 i 
λ j η j−1 (η − 1) λi η i (β − 1) ∗
≤ V̂i+1 ≤ η 1 + + V , i ≥ 0, (6.2.14)
j=1
(λ + 1) j (λ + 1)i

i  
where we define (·) = 0 for i < 1. Moreover, the value function sequence V̂i
j=1
converges to a finite neighborhood of V ∗ uniformly on Ω as i → ∞, i.e.,

η η
V ∗ ≤ lim V̂i ≤ V ∗, (6.2.15)
1 + λ − λη i→∞ 1 + λ − λη

1
under the condition η < + 1.
λ
Proof First, we prove the lower bound of the approximate value function V̂i+1 by
mathematical induction. Letting i = 0 in (6.2.12), we can obtain V̂1 ≥ ηT V̂0 =
ηT V0 . Considering αV ∗ ≤ V0 , we can get

V̂1 ≥ ηT V0 ≥ αηT V ∗ = αηV ∗ .

Thus, the lower bound of V̂i+1 holds for i = 0. According to (6.2.12), Assumptions
6.2.1 and 6.2.2, we can get

V̂2 ≥ ηT V̂1
  
= η min U (x, u) + V̂1 F(x, u)
u
  
≥ η min U (x, u) + αηV ∗ F(x, u)
u
αη − 1 αη − 1  
≥ η min 1+λ U (x, u) + αη − V ∗ F(x, u)
u λ+1 λ+1
αη − 1   
=η 1+λ min U (x, u) + V ∗ F(x, u)
λ+1 u

λ(1 − η) λη(1 − α)
=η 1− − V ∗.
λ+1 λ+1

Hence, the lower bound of V̂i+1 holds for i = 1. Assume the lower bound in (6.2.14)
holds for i = l, i.e.,
  
l
λ j η j−1 (1 − η) λl ηl (1 − α) ∗
V̂l+1 ≥η 1− − V  Θ V ∗.
j=1
(λ + 1) j (λ + 1)l
230 6 Error Bounds of Adaptive Dynamic Programming Algorithms

When i = l + 1, according to (6.2.12) and the above inequality, we have

V̂l+2 ≥ ηT V̂l+1
  
= η min U (x, u) + V̂l+1 F(x, u)
u
  
≥ η min U (x, u) + Θ V ∗ F(x, u) .
u

Let


l+1
λ j−1 η j−1 (1 − η) λl ηl+1 (1 − α)
Ξ − − . (6.2.16)
j=1
(λ + 1) j (λ + 1) l+1

According to the definitions of Θ and Ξ , we have


⎡ ⎤
1−η 
l+1
λ j−1 η j−1 (1 − η) λl ηl+1 (1 − α)
1 + λΞ + Ξ = 1 − (λ + 1) ⎣ + − ⎦
λ+1 j=2
(λ + 1) j (λ + 1)l+1


l
λ j η j (1 − η) λl ηl+1 (1 − α)
=η − −
j=1
(λ + 1) j (λ + 1)l
= Θ.

According to Assumption 6.2.2 and Ξ ≤ 0, we have


 
λΞU (x, u) − Ξ V ∗ F(x, u) ≤ 0. (6.2.17)

Therefore, noticing 1 + λΞ = Θ − Ξ , we have


  
V̂l+2 ≥ η min (1 + λΞ )U (x, u) + (Θ − Ξ )V ∗ F(x, u)
u
  
= η(1 + λΞ ) min U (x, u) + V ∗ F(x, u)
u
  
l+1
λ j η j−1 (1 − η) λl+1 ηl+1 (1 − α) ∗
=η 1− − V .
j=1
(λ + 1) j (λ + 1) l+1

Also, the upper bound can be proved similarly. Therefore, the lower and upper
bounds of V̂i+1 in (6.2.14) have been proved.  
Finally, we prove that the value function sequence V̂i converges to a finite
neighborhood of V ∗ uniformly on Ω as i → ∞ if η < 1/λ + 1. Since the sequence
λ j η j−1 (1 − η)
is geometric, we have
(λ + 1) j
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 231

λ(1 − η)   λη i 

i
λ j η j−1 (1 − η) 1−
= λ+1 λ+1 .
(λ + 1) j λη
j=1 1−
λ+1

λη λη
Considering ≤ < 1, we have
λ+1 λ+1
η
lim V̂i ≥ V ∗.
i→∞ 1 + λ − λη

For the upper bound, if λη/(λ + 1) < 1, i.e., η < 1/λ + 1, we can show

η
lim V̂i ≤ V ∗.
i→∞ 1 + λ − λη

Thus, we complete the proof.

Remark 6.2.1 We can find that the lower and upper bounds in (6.2.15) are both
monotonically increasing functions of η and η, respectively. The condition η <
1/λ+1 should be satisfied such that the upper bound in (6.2.15) is finite and positive.
Since η ≤ 1, the lower bound in (6.2.15) is always positive. The values of η and η
may gradually be refined during the iterative process similar to [2], in which a crude
initial operator approximation is gradually refined with new iterations. We can also
derive that a larger λ will lead to a slower convergence rate and a larger error bound. A
larger λ also requires more accurate iteration to converge. When η = η = 1, Theorem
 
6.2.2 reduces to Theorem 6.2.1 and the value function sequence V̂i converges to
V ∗ uniformly on Ω as i → ∞.

6.2.3 Approximate Policy Iteration

We first present and analyze the exact policy iteration and then establish the error
bounds for approximate policy iteration.

A. Policy Iteration
For the policy iteration algorithm, an initial stabilizing control policy is usually
required. In this section, we start the policy iteration from an initial value function
V0 such that V0 ≥ T V0 . We can see that the control policy π 0 (x) obtained by
  
π 0 (x) = arg min U (x, u) + V0 F(x, u) (6.2.18)
u

is an asymptotically stable control law for the system (6.2.1), because


232 6 Error Bounds of Adaptive Dynamic Programming Algorithms

Table 6.2 The iterative process of the PI algorithm in (6.2.18)–(6.2.20)


(empty) π 0 → V1π π 1 → V2π ···
(6.2.19) (6.2.19)
evaluation evaluation
V0 → π 0 V1π → π 1 V2π → π 2 ···
(6.2.18) (6.2.20) (6.2.20)
minimization minimization minimization
i =0 i =1 i =2 ···

    
V0 F x, π 0 (x) − V0 (x) ≤ V0 (F x, π 0 (x) − (T V0 )(x)
 
= −U x, π 0 (x) ≤ 0,

is negative definite. For i = 1, 2, . . ., the policy iteration algorithm iterates between


policy evaluation
    
Viπ (x) = U x, πi−1 (x) + Viπ F x, πi−1 (x) , (6.2.19)

and policy improvement


  
πi (x) = arg min U (x, u) + Viπ F(x, u) . (6.2.20)
u

Note that a short expression for (6.2.19) is given by Viπ = Tπi−1 Viπ .
The flow of iterations of the PI algorithm in (6.2.18)–(6.2.20) is illustrated in
Table 6.2, where the iteration flows as in Fig. 2.1. The algorithm starts with a special
initial value function so that an admissible control law π 0 (xk ) is obtained from
(6.2.18). Comparing between Tables 6.2 and 4.1, we note that the two descriptions
are the same except the initialization step. The algorithm in (6.2.18)–(6.2.20) starts
at i = 0 using an initial value function V0 satisfying V0 ≥ T V0 , whereas the
PI algorithm in (4.2.5)–(4.2.6) starts with an admissible control law v0 (xk ). For a
particular problem, the one with initial condition easier to obtain will usually be
chosen.
When the policy evaluation equation (6.2.19) cannot be solved directly, the follow-
ing iterative process can be used to solve the value function at the policy evaluation
step
π( j)   π( j−1)   
Vi (x) = U x, πi−1 (x) + Vi F x, πi−1 (x) , j > 0, (6.2.21)

with Viπ(0) = Vi−1


π
, ∀i ≥ 1 and V1π(0) = V0π = V0 . The solution of (6.2.21) for the
value function can be obtained after infinite number of iterations as Viπ = Viπ(∞)
(cf. Lemma 6.2.2 presented next). If a finite number of iterations is employed in the
calculation of (6.2.21), it gives approximations to the evaluation in (6.2.19) and the
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 233

Table 6.3 Key expressions Key expression Location


used in Sect. 6.2.3
V ≤ V  ⇒ Tu V ≤ Tu V  Assumption 6.2.1
Viπ = Tπi−1 Viπ (6.2.19)
T Viπ = Tπi Viπ (6.2.20)
π( j) π( j−1)
Vi = Tπi−1 Vi (6.2.21)
π(0) π ,
Vi = Vi−1 (6.2.21)
π(0)
V1 = V0π = V0

algorithm becomes the optimistic policy iteration algorithm to be studied later in


Sect. 6.2.4.
We list in Table 6.3 some key expressions which will be used in this section for
establishing analysis results.
Now, using the expressions in Table 6.3, we can prove the following lemma.

Lemma 6.2.2 Let Assumption 6.2.1 hold. Suppose that V0 ≥ T V0 . Let πi+1 and
π( j) π( j)
Vi be updated by (6.2.20) and (6.2.21). Then, the sequence {Vi } is a monoton-
π( j) π( j−1)
ically nonincreasing sequence, i.e., Vi ≤ Vi , ∀i ≥ 1. Moreover, as j → ∞,
π( j) π(∞)
the limit of Vi denoted by Vi exists, and it is equal to Viπ , ∀i ≥ 1.

Proof We prove the lemma by mathematical induction. Letting i = 1 and j = 1 in


(6.2.21), we can obtain

V1π(1) (x) = Tπ 0 V1π(0) (x)


    
= U x, π 0 (x) + V1π(0) F x, π 0 (x)
    
= U x, π 0 (x) + V0 F x, π 0 (x)
= T V0 ≤ V0 = V1π(0) (x). (6.2.22)

For j = 2, according to (6.2.21) and Assumption 6.2.1, we have

V1π(2) (x) = Tπ 0 V1π(1) (x) ≤ Tπ 0 V1π(0) (x) = V1π(1) (x).

π( j) π( j−1)
Assume that V1 ≤ V1 is true for j = l, l = 1, 2, . . . , i.e., V1π(l) ≤ V1π(l−1) .
For j = l + 1, we have

V1π(l+1) = Tπ 0 V1π(l) ≤ Tπ 0 V1π(l−1) = V1π(l) .

π( j) π( j−1)
Therefore, we obtain Vi ≤ Vi for i = 1 by induction. Since the sequence
π( j) π( j) π( j)
{V1 } is a monotonically nonincreasing sequence and V1 ≥ 0, the limit of V1
π( j)
exists, which is denoted by V1π(∞) and satisfies V1π(∞) ≤ V1 . Considering
234 6 Error Bounds of Adaptive Dynamic Programming Algorithms

π( j+1)   π( j)   
V1 (x) = U x, π 0 (x) + V1 F x, π 0 (x) , j ≥ 0,

we have
π( j+1)     
V1 (x) ≥ U x, π0 (x) + V1π(∞) F x, π 0 (x) , j ≥ 0. (6.2.23)

Letting j → ∞ in (6.2.23), we have


    
V1π(∞) (x) ≥ U x, π0 (x) + V1π(∞) F x, π0 (x) . (6.2.24)

Similarly, we obtain
π( j+1)   π( j)   
V1π(∞) (x) ≤ V1 (x) = U x, π 0 (x) + V1 F x, π 0 (x) , j ≥ 0. (6.2.25)

Letting j → ∞ in (6.2.25), we can obtain


    
V1π(∞) (x) ≤ U x, π 0 (x) + V1π(∞) F x, π 0 (x) . (6.2.26)

Combining (6.2.24) and (6.2.26), we get


    
V1π(∞) (x) = U x, π 0 (x) + V1π(∞) F x, π 0 (x) .

Considering (6.2.19), we can obtain that Viπ(∞) (x) = Viπ (x) holds for i = 1.
π( j) π( j+1)
We assume that Vi ≥ Vi holds and Viπ(∞) (x) = Viπ (x), ∀i ≥ 1. Then,
considering (6.2.19)–(6.2.21), for j = 0, we can get
π(1) π(0) π(0)
Vi+1 = Tπi Vi+1 = Tπi Viπ = T Viπ ≤ Viπ = Vi+1 .

According to (6.2.21) and Assumption 6.2.1, for j = 1, we have


π(2) π(1) π(0) π(1)
Vi+1 = Tπi Vi+1 ≤ Tπi Vi+1 = Vi+1 .

π( j+1) π( j)
Similarly, we can obtain that Vi+1 ≤ Vi+1 holds for j = 0, 1, . . . , by induction,
π(∞) π π( j) π( j−1)
and Vi+1 (x) = Vi+1 (x). We have now shown Vi ≤ Vi for all i = 1, 2, . . . ,
π(∞) π
and all j = 0, 1, . . . , and Vi = Vi for all i = 1, 2, . . . . Therefore, the proof is
complete.
Note that Lemma 6.2.2 presented the same result as Theorems 5.2.1(i) and 5.2.2.
We state and prove it here using our new notation.
Lemma 6.2.3 (cf. Lemma 5.2.3) Let Assumption 6.2.1 hold. Suppose that V0 ≥
π( j)
T V0 . Let πi and Vi be updated by (6.2.20) and (6.2.21). Then, the sequence {Viπ }
is a monotonically nonincreasing sequence, i.e., Viπ ≥ Vi+1π
, ∀i ≥ 0.
Proof According to Lemma 6.2.2, we can get
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 235

π(0) π(∞) π
Vi+1 ≥ Vi+1 = Vi+1 .

Then, considering
π(0)
Viπ = Tπi−1 Viπ ≥ T Viπ = Tπi Viπ = Tπi Vi+1 ,

and Assumption 6.2.1, we can obtain


π(0)
Viπ ≥ Tπi Vi+1 π
≥ Tπi Vi+1 π
= Vi+1 .

Therefore, the sequence {Viπ }, i ≥ 0, is a monotonically nonincreasing sequence.


This completes the proof of the lemma.

Based on the lemmas above, an extended result of Proposition 5 in [23] is given


in the following theorem.
Theorem 6.2.3 Let Assumptions 6.2.1 and 6.2.2 hold. Suppose that V ∗ ≤ V0 ≤
π( j)
βV ∗ , 1 ≤ β < ∞, and that V0 ≥ T V0 . Let πi and Vi be updated by (6.2.20)
π
and (6.2.21). Then, the value function sequence {Vi } approaches V ∗ according to
the following inequalities
 
β −1
V ∗ ≤ Viπ ≤ 1 + V ∗ , i ≥ 1.
(1 + λ−1 )i

Proof First, we prove that Viπ ≤ Vi , ∀i ≥ 1, holds by induction, where Vi is defined


in (6.2.7). According to Lemma 6.2.2 and (6.2.22), we have

V1π = Tπ 0 V1π = Tπ 0 V1π(∞) ≤ Tπ 0 V1π(0) = Tπ 0 V0 = T V0 = V1 .

Assume Viπ ≤ Vi holds, ∀i ≥ 1. Considering Assumption 6.2.1 and Lemma 6.2.2,


we have
π π π(∞) π(0) π(0)
Vi+1 = Tπi Vi+1 = Tπi Vi+1 ≤ Tπi Vi+1 ≤ T Vi+1 = T Viπ ≤ T Vi = Vi+1 .

Therefore, considering (6.2.3) and Theorem 6.2.1, we can obtain the conclusion.
This completes the proof of the theorem.

B. Error Bounds for Approximate Policy Iteration


Similar to Sect. 6.2.2B, we use function approximation structures to approximate
the value function and control policy. Here, we use V̂iπ̂ and π̂i to stand for the
approximate expressions of Vπi and πi , respectively. We assume that there exist
finite positive constants δ ≤ 1 and δ ≥ 1 such that

δViπ̂ ≤ V̂iπ̂ ≤ δViπ̂ , ∀i = 1, 2, . . . , (6.2.27)


236 6 Error Bounds of Adaptive Dynamic Programming Algorithms

hold uniformly, where Viπ̂ is the exact value function associated with π̂i . Considering
Lemma 6.2.2, we have

V̂iπ̂ ≤ δViπ̂ ≤ δViπ̂ (1) = δTπ̂i Viπ̂(0) = δTπ̂i V̂i−1


π̂
. (6.2.28)

Similarly, we assume that there exist finite positive constants ζ ≤ 1 and ζ ≥ 1 such
that
π̂ π̂ π̂
ζ T V̂i−1 ≤ Tπ̂i V̂i−1 ≤ ζ T V̂i−1 , ∀i = 1, 2, . . . , (6.2.29)

hold uniformly. Combining (6.2.28) and (6.2.29), we can get

V̂iπ̂ ≤ ζ δT V̂i−1
π̂
.

On the other hand, considering (6.2.27), (6.2.29), and Assumption 6.2.1, we can
obtain

V̂iπ̂ ≥ δViπ̂
π̂
= δ(Tπ̂i · · · Tπ̂i Tπ̂i )V̂i−1
π̂
≥ δ(T · · · T Tπ̂i )V̂i−1
π̂
≥ ζ δ(T · · · T T )V̂i−1
≥ ζ δV ∗ .

Therefore, the whole approximation errors in the value function and control policy
update equations can be expressed by

ζ δV ∗ ≤ V̂iπ̂ ≤ ζ δT V̂i−1
π̂
. (6.2.30)

Using the notation in (6.2.13), (6.2.30) can be written as

ηV ∗ ≤ V̂iπ̂ ≤ ηT V̂i−1
π̂
. (6.2.31)

Similar to Sect. 6.2.2B, we can establish the error bounds for approximate policy
iteration by the following theorem.

Theorem 6.2.4 Let Assumptions 6.2.1 and 6.2.2 hold. Suppose that V ∗ ≤ V0 ≤
βV ∗ , 1 ≤ β < ∞, and that V0 ≥ T V0 . The approximate value function V̂iπ̂
satisfies
 π̂  the iterative∗ error condition (6.2.31). Then, the value function sequence
V̂i approaches V according to the following inequalities

 i 
λj η j−1 (η − 1) λi ηi (β − 1) ∗
ηV ∗ ≤ V̂i+1
π̂
≤η 1+ + V , i ≥ 0.
j=1
(λ + 1) j (λ + 1)i
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 237

Moreover, the approximate value function sequence {V̂iπ } converges to a finite neigh-
borhood of V ∗ uniformly on Ω as i → ∞, i.e.,

η
ηV ∗ ≤ lim V̂iπ̂ ≤ V ∗, (6.2.32)
i→∞ 1 + λ − λη

1
under the condition η < + 1.
λ

6.2.4 Approximate Optimistic Policy Iteration

We will prove the convergence of exact optimistic policy iteration and establish the
error bounds for approximate optimistic policy iteration.

A. Optimistic Policy Iteration


According to Lemma 6.2.2, we can see that the policy evaluation can be obtained as
j → ∞ in (6.2.21). However, this process will usually take a long time to converge.
To avoid this problem, the optimistic policy iteration algorithm only performs finite
number of iterations in (6.2.21).
For the optimistic policy iteration algorithm, we start the iteration from an initial
value function V0 such that V0 ≥ T V0 . The initial control policy μ0 = π 0 is obtained
by (6.2.18). For i = 1, 2, . . ., and for some positive integer Ni , the optimistic policy
iteration algorithm updates the value function by
μ μ(Ni )
Vi (x) = Vi (x), (6.2.33)

where
μ( j)   μ( j−1)   
Vi (x) = U x, μi−1 (x) + Vi F x, μi−1 (x) , 0 < j ≤ Ni , (6.2.34)

μ(0) μ μ(0) μ
Vi = Vi−1 , ∀i ≥ 1, and V1 = V0 = V0 . Using the definition in (6.2.5), the
μ μ μ
value function Vi (x) can be expressed by Vi (x) = TμNi−1i Vi−1 (x). The optimistic
policy iteration algorithm updates the control policy by
 μ 
μi (x) = arg min U (x, u) + Vi F(x, u) . (6.2.35)
u

In this case, the optimistic policy iteration algorithm becomes the generalized
policy iteration algorithm studied in Chap. 4. The above algorithm becomes value
iteration as Ni = 1 and becomes the policy iteration as Ni → ∞. For policy
iteration, it solves the value function associated with the current control policy at
each iteration, while it takes only one iteration toward that value function for value
238 6 Error Bounds of Adaptive Dynamic Programming Algorithms

Table 6.4 Key expressions Key expression Location


used in Sect. 6.2.4
V ≤ V  ⇒ Tu V ≤ Tu V  Assumption 6.2.1
μ μ(N ) μ(N −1)
Vi (x) = Vi i (x) = Tμi−1 Vi i (6.2.33)
μ( j) μ( j−1)
Vi (x) = Tμi−1 Vi (6.2.34)
μ(0) μ μ(0) μ
Vi = Vi−1 , V1 = V0 = V0 (6.2.34)
μ μ
T Vi = Tμi Vi (6.2.35)

iteration. However, the value function update in (6.2.34) has to stop before j → ∞
in practical implementations.
We list some key expressions in Table 6.4 which will be used in this section for
establishing analysis results.
Next, we will show the monotonicity property of value function which is given
in [6] and then establish the convergence property of optimistic policy iteration.
μ
Lemma 6.2.4 Let Assumption 6.2.1 hold. Suppose that V0 ≥ T V0 . Let Vi and μi
μ
be updated by (6.2.34) and (6.2.35). Then, the value function sequence {Vi } is a
μ μ
monotonically nonincreasing sequence, i.e., Vi ≥ Vi+1 , ∀i ≥ 0.

Proof According to Assumption 6.2.1 and Lemma 6.2.2, for i = 0, we have


μ μ(0) μ(N1 −1) μ
V0 = V0 ≥ T V0 = Tμ0 V0 = Tμ0 V1 ≥ Tμ0 V1 = V1 .

Thus, it holds for i = 0. Similarly, for i = 1, we can get


μ μ(N1 −1) μ(N1 ) μ μ
V1 = Tμ0 V1 ≥ Tμ0 V1 = Tμ0 V1 ≥ T V1 (6.2.36)

and
μ μ μ(0) μ μ
T V1 = Tμ1 V1 = Tμ1 V2 ≥ Tμ1 V2 = V2 .

Therefore, the conclusion can be proved by induction.

Theorem 6.2.5 Let Assumptions 6.2.1 and 6.2.2 hold. Suppose that V ∗ ≤ V0 ≤
μ
βV ∗ , 1 ≤ β < ∞, and that V0 ≥ T V0 . The value function Vi and the control
policy μi are updated by (6.2.34) and (6.2.35). Then, the value function sequence
{Viμ } approaches V ∗ according to the inequalities
 
μ β −1
V ∗ ≤ Vi ≤ 1 + V ∗ , i ≥ 1. (6.2.37)
(1 + λ−1 )i
μ
Moreover, the value function Vi converges to V ∗ uniformly on Ω.
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 239

μ
Proof First, we prove that Vi ≤ Vi , ∀i ≥ 1, holds by mathematical induction, where
Vi is defined in (6.2.7). According to Lemma 6.2.2, we have
μ μ(N1 ) μ(1)
V1 = V1 ≤ Vi = Tμ0 V0 = T V0 = V1 .
μ
Thus, (6.2.37) holds for i = 1. Assume that it holds for i ≥ 1, i.e., Vi ≤ Vi .
According to Lemma 6.2.2, we have
μ μ(N ) μ(1) μ(0) μ μ
Vi+1 = Vi+1 i ≤ Vi+1 = Tμi Vi+1 = Tμi Vi = T Vi .

Considering Assumption 6.2.1, we can get

T Viμ ≤ T Vi = Vi+1 .
μ μ
Thus, we can obtain Vi+1 ≤ Vi+1 . Then, it can also be proved that Vi ≥ V ∗ , ∀i ≥ 1,
by mathematical induction. Therefore, considering Theorem 6.2.1, we can obtain
μ
(6.2.37). As i → ∞, the value function Vi converges to V ∗ uniformly on Ω. The
proof is complete.

B. Error Bounds for Approximate Optimistic Policy Iteration


μ̂ μ
Here, we use V̂i and μ̂i to stand for the approximate expressions of Vi and μi ,
respectively. Without loss of generality, we let Ni be a constant integer K for all
iteration steps.
We assume that there exist finite positive constants δ ≤ 1 and δ ≥ 1 such that

μ̂ μ̂ μ̂
δTμ̂Ki V̂i−1 ≤ V̂i ≤ δTμ̂Ki V̂i−1 , ∀i = 1, 2, . . . , (6.2.38)

hold uniformly. Considering Lemma 6.2.2, we have

μ̂ μ̂ μ̂
V̂i ≤ δTμ̂Ki V̂i−1 ≤ δTμ̂i V̂i−1 . (6.2.39)

Similarly, we assume that there exist finite positive constants ζ ≤ 1 and ζ ≥ 1 such
that
μ̂ μ̂ μ̂
ζ T V̂i−1 ≤ Tμ̂i V̂i−1 ≤ ζ T V̂i−1 , ∀i = 1, 2, . . . , (6.2.40)

hold uniformly. Combining (6.2.39) and (6.2.40), we can get

μ̂ μ̂
V̂i ≤ ζ δT V̂i−1 .

On the other hand, considering (6.2.38), (6.2.40), and Assumption 6.2.1, we get
240 6 Error Bounds of Adaptive Dynamic Programming Algorithms
 −1) μ̂   μ̂ 
V̂iμ̂ ≥ δTμ̂Ki V̂i−1
μ̂
≥ δT Tμ̂(K
i
V̂i−1 ≥ · · · ≥ δT (K −1) Tμ̂i V̂i−1 μ̂
≥ ζ δT K V̂i−1 .

Therefore, the whole approximation errors in the value function and control policy
update equations can be expressed by

μ̂ μ̂ μ̂
ζ δT K V̂i−1 ≤ V̂i ≤ ζ δT V̂i−1 . (6.2.41)

Using the notation in (6.2.13), (6.2.41) can be written as

μ̂ μ̂ μ̂
ηT K V̂i−1 ≤ V̂i ≤ ηT V̂i−1 . (6.2.42)

Next, we can establish the error bounds for the approximate optimistic policy
iteration by the following theorem.
Theorem 6.2.6 Let Assumptions 6.2.1 and 6.2.2 hold. Suppose that V ∗ ≤ V0 ≤
μ̂
βV ∗ , 1 ≤ β < ∞, and that V0 ≥ T V0 . The approximate value function V̂i
satisfies the iterative error condition (6.2.42). Then, the value function sequence
 μ̂ 
V̂i approaches V ∗ according to the following inequalities

  
i
λ j η j−1 (1 − η) ∗
η 1− V
j=1
(λ + 1) j
μ̂
≤ V̂i+1
 i 
λ j η j−1 (η − 1) λi ηi (β − 1) ∗
≤η 1+ + V , i ≥ 0. (6.2.43)
j=1
(λ + 1) j (λ + 1)i

 μ̂ 
Moreover, the approximate value function sequence V̂i converges to a finite neigh-
borhood of V ∗ uniformly on Ω as i → ∞, i.e.,

η μ̂ η
V ∗ ≤ lim V̂i ≤ V ∗, (6.2.44)
1 + λ − λη i→∞ 1 + λ − λη

1
under the condition η < + 1.
λ
Proof First, we prove the lower bound in (6.2.43). According to (6.2.42) and
Assumption 6.2.1, we have

μ̂
V̂1 ≥ ηT K V0 ≥ ηT K V ∗ = ηV ∗ ,

i.e., it holds for i = 0. Similarly, we can get


6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 241

V̂2μ̂ ≥ ηT K V̂1μ̂
 μ̂  
= η min U (x, u) + T K −1 V̂1 F(x, u)
u
  
≥ η min U (x, u) + ηT K −1 V ∗ F(x, u)
u
  
= η min U (x, u) + ηV ∗ F(x, u)
u
1−η 1−η  
≥ η min 1−λ U (x, u) + η + V ∗ F(x, u)
u λ+1 λ+1
λ(1 − η)
=η 1− V ∗,
λ+1

i.e., it holds for i = 1. Therefore, we can obtain the lower bound by continuing the
above process. Similar to Theorem 6.2.2, we can obtain the upper bound in (6.2.43)
and the conclusion in (6.2.44). The proof is complete.

6.2.5 Neural Network Implementation

We have just proven that the approximate value iteration, approximate policy itera-
tion, and approximate optimistic policy iteration algorithms can converge to a finite
neighborhood of the optimal value function associated with the Bellman equation.
It should be mentioned that we consider approximation errors in both value function
and control policy update equations at each iteration. This makes it feasible to use
neural network approximation for solving undiscounted optimal control problems of
nonlinear systems. Since the optimistic policy iteration contains the value iteration
and policy iteration, we only present a detailed implementation of the approximate
optimistic policy iteration using neural networks in this section. The neural net-
work implementation of approximate value iteration has been discussed in previous
chapters [15, 17].
The whole structural diagram of the approximate optimistic policy iteration is
shown in Fig. 6.1 (cf. (6.2.34)), where two multilayer feedforward neural networks
are used. The critic neural network is used to approximate the value function, and
the action neural network is used to approximate the control policy.
A neural network can be used to approximate some smooth function on a pre-
μ( j)
scribed compact set. The value function Vi (xk ) in (6.2.34) is approximated by the
critic neural network
μ̂( j)  j T  j T 
V̂i (xk ) = Wc(i) φ Yc(i) xk ,

where the activation functions are selected as tanh(·). The target function of the critic
neural network training is given by
242 6 Error Bounds of Adaptive Dynamic Programming Algorithms

Critic Vˆi uˆ ( j ) ( xk )
Network

xk Action xk 1 Critic Vˆi uˆ ( j 1) ( xk 1 ) U ( xk , ˆ i 1 ( xk ))


ˆ i 1 ( xk ) System
Network Network

Signal line
Back-propagating path
Weight transimission

Fig. 6.1 Structural diagram of approximate optimistic policy iteration

μ̂( j)   μ̂( j−1)


Vi (xk ) = U xk , μ̂i−1 (xk ) + V̂i (xk+1 ),

where  
xk+1 = F xk , μ̂i−1 (xk ) .

Then, the error function for training critic neural network is defined by

j μ̂( j) μ̂( j)
ec(i) (xk ) = Vi (xk ) − V̂i (xk ),

and the performance function to be minimized is defined by

j 1 j 2
E c(i) (xk ) = e (xk ) . (6.2.45)
2 c(i)
The control policy μi (xk ) in (6.2.35) is approximated by the action neural network
 T 
μ̂i (xk ) = Wa(i)
T
φ Ya(i) xk .

The target function of the action neural network training is defined by


 μ̂  
di (xk ) = arg min U (xk , u k ) + V̂i xk+1 .
uk

Then, the error function for training the action neural network is given by

ea(i) (xk ) = di (xk ) − μ̂i (xk ).


6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 243

The weights of the action neural network are updated to minimize the following
performance function

1 T
E a(i) (xk ) = ea(i) (xk ) ea(i) (xk ). (6.2.46)
2
We use the gradient descent method to tune the weights of neural networks on a
training set constructed from the compact set Ω. The details of this tuning method
can be found in [15]. Some other tuning methods can also be used, such as Newton’s
method and the Levenberg–Marquardt method, in order to increase the convergence
rate of neural network training.
A detailed process of the approximate optimistic policy iteration is given in
Algorithm 6.2.1, where the approximate value iteration can be regarded as a spe-
cial case. If we have an initial stabilizing control policy, the algorithm can iterate
between Steps 4 and 5 directly. It should be mentioned that Algorithm 6.2.1 runs in
an off-line manner. Note that it can also be implemented online but a persistence of
excitation condition is usually required.

Algorithm 6.2.1 Approximate optimistic policy iteration


Step 1. Initialization:
Initialize critic and action neural networks.
Select an initial value function V0 satisfying V0 ≥ T V0 .
Set policy evaluation steps K and maximum number of iteration steps i max .
Step 2. Update the control policy μ̂0 (xk ) by minimizing (6.2.46) on a training set {xk } randomly
selected from the compact set Ω.
Step 3. Set i = 1.
μ̂( j)
Step 4. For j = 0, 1, . . . , K − 1, update the value function V̂i (xk ) by minimizing (6.2.45) on
μ̂
a training set {xk } randomly selected from the compact set Ω. After convergence, set V̂i (xk ) =
μ̂(K )
V̂i (xk ).
Step 5. Update the control policy μ̂i (xk ) by minimizing (6.2.46) on a training set {xk } randomly
selected from the compact set Ω.
Step 6. Set i ← i + 1.
Step 7. Repeat Steps 4–6 until the convergence conditions are met or i > i max .
Step 8. Obtain the approximate optimal control policy μ̂i (xk ) or the optimal control policy is not
obtained if i > i max .

6.2.6 Simulation Study

In this section, we provide a simulation example to demonstrate the effectiveness


of the present algorithms. Consider the following discrete-time nonlinear system
xk+1 = h(xk ) + g(xk )u k , where
 
h(xk ) =  1k + 0.1x2k
0.9x  ,
−0.05(x1k + x2k 1 − cos(2x1k )+2)2 + x2k
244 6 Error Bounds of Adaptive Dynamic Programming Algorithms
 
0
g(xk ) = , (6.2.47)
0.1 cos(2x1k ) + 0.2

xk = [x1k , x2k ]T ∈ R2 , and u k ∈ R, k = 0, 1, . . . .. Define the cost function as



  
J (x0 , u) = xkT Qxk + u Tk Ru k , (6.2.48)
k=0

where Q = [0.1, 0; 0, 0.1] and R = 0.1.


To implement the present algorithms, we choose three-layer feedforward neural
networks as function approximation structures. The structures of the critic and action
neural networks are both chosen as 2–8–1. The initial weights of the critic and
action neural networks are chosen randomly in [−0.1, 0.1]. The maximum number
of iteration steps is selected as i max = 20. The compact set Ω or the operation region
of the system is selected as −1 ≤ x1 ≤ 1 and −1 ≤ x2 ≤ 1. The training set {xk }
is constructed by randomly choosing 1000 samples from the compact set Ω at each
iteration.
The initial value function is selected as V0 = 2x1k 2
+ 2x2k2
. According to Fig. 6.2,
it can be seen that V0 ≥ V1 holds for all states in the compact set Ω.
We implement Algorithm 6.2.1 by letting K = 1, K = 3, and K = 10, respec-
tively. For the initial state x0 = [1, −1]T , the convergence curve of the value function
j j
sequence {V̂μ̂1 } is shown in Fig. 6.3. We can see that {V̂μ̂1 } is a monotonically non-
increasing sequence, and it is convergent at j = 10. Thus, the algorithm for K = 10
can be regarded as the approximate policy iteration. The algorithms for K = 1 and
K = 3 are the approximate value iteration and approximate optimistic policy itera-

1.5

1
V0 −V1

0.5

0
1
0.5 1
0 0.5
x2 0
−0.5 −0.5 x1
−1 −1

Fig. 6.2 3-D plot of V0 –V1 in the compact set Ω


6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 245

3.5

3
Value function

2.5

1.5

1
0 5 10 15 20
The iteration index: j

μ̂( j)
Fig. 6.3 The convergence curve of the value function V̂1 at x0

4
K=1
K=3
3.5 K=10

3
Value function

2.5

1.5

0.5
0 5 10 15 20
The iteration index: i
μ̂
Fig. 6.4 The convergence curves of the value function V̂1 at x0 when K = 1, K = 3, and K = 10

tion, respectively. After implementing the algorithms for i max = 20, the convergence
μ̂
curves of the value functions V̂1 at x0 are shown in Fig. 6.4. It can be seen that all
the value functions are basically convergent with the iteration index i > 10, and the
obtained approximate optimal value functions at i = 20 are quite close. Although
246 6 Error Bounds of Adaptive Dynamic Programming Algorithms

0.4
K=1
K=3
0.2 K=10

−0.2
x2

−0.4

−0.6

−0.8

−1
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
x1

Fig. 6.5 The state trajectories

0.7
K=1
0.6 K=3
K=10
0.5
Control inputs

0.4

0.3

0.2

0.1

−0.1
0 10 20 30 40 50 60
Time steps: k

Fig. 6.6 The control inputs

there exist approximation errors in both value function and control policy update
steps, the approximate value function can converge to a finite neighborhood of the
optimal value function.
Finally, we apply the approximate optimal control policies obtained above to the
system (6.2.47) for 60 time steps. The corresponding state trajectories are displayed
in Fig. 6.5, and the control inputs are displayed in Fig. 6.6. Examining the results,
6.2 Error Bounds of ADP Algorithms for Undiscounted Optimal Control Problems 247

it is observed that all the control policies obtain very good performance, and the
differences between the three trajectories are quite small.

6.3 Error Bounds of Q-Function for Discounted Optimal


Control Problems

6.3.1 Problem Formulation

In this section, we consider the following deterministic discrete-time nonlinear


dynamical systems
xk+1 = F(xk , u k ), k = 0, 1, 2, . . . , (6.3.1)

where xk ∈ Rn is the state vector, and u k ∈ Rm is the control input. The system
(6.3.1) is assumed to be controllable which implies that there exists a continuous
control policy on a compact set Ω ⊆ Rn that stabilizes the system asymptotically.
We assume that xk = 0 is an equilibrium state of the system (6.3.1) and F(0, 0) = 0.
The system function F(xk , u k ) is Lipschitz continuous with respect to xk and u k . The
infinite-horizon cost function with discount factor for any initial state x0 is given by


J (x0 , u 0 ) = γ k U (xk , u k ), (6.3.2)
k=0

where u 0 is the control sequence defined as u 0 = (u 0 , u 1 , u 2 , . . .), U is the utility


function which is continuous and positive definite, and γ is the discount factor such
that 0 < γ ≤ 1. For any feedback control policy μ(·), the value function V μ (xk ),
i.e., the mapping from any state xk to the value of (6.3.2), is defined as

V μ (xk ) = J (xk , μ(xk )) = J (xk , u k ), (6.3.3)

where u k = (μ(xk ), μ(xk+1 ), . . .). A feedback control policy u = μ(x) is admissible


with respect to system (6.3.1), if μ(0) = 0, μ(x) is continuous and stabilizes the
system, and the value function in (6.3.3) is finite.
Our goal is to find an admissible control which can minimize the value function
such that
V ∗ (xk ) = inf {V μ (xk )}, ∀xk . (6.3.4)
μ

According to Bellman’s principle of optimality, the optimal value function satisfies


the Bellman equation

V ∗ (xk ) = min{U (xk , u k ) + γ V ∗ (F(xk , u k ))}, (6.3.5)


uk
248 6 Error Bounds of Adaptive Dynamic Programming Algorithms

and the optimal control policy μ∗ (·) is given by

μ∗ (xk ) = u ∗k = arg min{U (xk , u k ) + γ V ∗ (F(xk , u k ))}. (6.3.6)


uk

In the case of action-dependent ADP schemes, which will be studied in this section,
Q-function (also known as the state-action value function) will be used and it is
defined as
Q μ (x, u) = U (x, u) + γ Q μ (x + , μ(x + )), (6.3.7)

where x ∈ Rn , u ∈ Rm , and x + is the state of the next moment, i.e., x + = F(x, u).
According to (6.3.3), the relationship between value function and Q-function is

Q μ (x, μ(x)) = U (x, μ(x)) + γ Q μ (x + , μ(x + ))


= J (x, μ(x))
= V μ (x).

Then, the optimal Q-function is defined as

Q ∗ (x, u) = inf Q μ (x, u),


μ

and the optimal control policy μ∗ (·) can be obtained by

μ∗ (xk ) = arg min Q ∗ (xk , u k ), ∀xk . (6.3.8)


uk

The optimal Q-function satisfies the following Bellman optimality equation

Q ∗ (xk , u k ) = U (xk , u k ) + γ min Q ∗ (xk+1 , u k+1 ), (6.3.9)


u k+1

where u k+1 is the action of the next moment. The connection between the optimal
value function and the optimal Q-function is

V ∗ (x) = min Q ∗ (x, u).


u

In most situations, the optimal control problem for nonlinear systems has no
analytical solution and the traditional dynamic programming faces the “curse of
dimensionality.” In this section, we develop an iterative ADP algorithm by using
Q-function which depends on the state and action to solve the nonlinear optimal
control problem. Similar to most ADP methods, function approximation structures
such as neural networks are used to approximate the Q-function (or state-action
value function) and the control policy. The approximation errors may increase along
with the iteration processes. Therefore, it is necessary to establish the error bounds of
6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 249

Q-function for the present iteration algorithm considering the function approximation
errors.

6.3.2 Policy Iteration Under Ideal Conditions

We assume that the nonlinear dynamical system (6.3.1) is unknown, and only an
off-line data set {xk , u k , xk+1 } N is available, where xk+1 is the next state given xk and
u k , and N is the number of samples in the data set. xk+1 and xk stand for the dynamic
behavior of one-shot data and do not necessarily mean that the data set has to take
samples from one trajectory. In general, the data set contains a variety of trajectories
and scattered data.
For the policy iteration of action-dependent ADP algorithms, it starts with an
initial admissible control μ0 . For i = 0, 1, . . ., the policy iteration algorithm contains
policy evaluation phase and policy improvement phase given as follows.
Policy evaluation:
μ μ
i
Q j+1 (xk , u k ) = U (xk , u k ) + γ Q j i (xk+1 , μi (xk+1 )) (6.3.10)

Policy improvement:
μi+1 (xk ) = arg min Q μi (xk , u k ) (6.3.11)
uk

μ
where j is the policy evaluation index and i is the policy improvement index. Q j i
represents the jth evaluation for the ith control policy μi , and
μ
Q 0 i = Q μ∞i−1 .
μ
Let Q μi denote the Q-function for μi . Next, we will prove that the limit of Q j i as
μ
j → ∞ exists and Q ∞i = Q μi .

Assumption 6.3.1 There exists a finite positive constant λ such that the condition

min Q ∗ (xk+1 , u k+1 ) ≤ λU (xk , u k )


u k+1

holds uniformly on Ω for some 0 < λ < ∞.

Remark 6.3.1 Assumption 6.3.1 is a basic assumption which ensures the conver-
gence of ADP algorithms. For most nonlinear systems, it is easy to find a sufficiently
large number λ to satisfy this assumption as Q ∗ (·) and U (·) are finite and positive.

Lemma 6.3.1 Let Assumption 6.3.1 hold. Suppose that μ0 is an admissible control
μ
policy and Q μ0 is the Q-function of μ0 . Let Q j i and μi be updated by (6.3.10) and
(6.3.11). Then we can obtain the following conclusions:
μ μ μi
(i) the sequence {Q j i } is monotonically nonincreasing, i.e., Q j i ≥ Q j+1 , ∀i ≥ 1.
250 6 Error Bounds of Adaptive Dynamic Programming Algorithms

μ μ
Moreover, as j → ∞, the limit of Q j i exists and is denoted by Q ∞i , and equals to
Q μi , ∀i ≥ 1.
(ii) the sequence {Q μi } is monotonically nonincreasing, i.e., Q μi ≥ Q μi+1 , ∀i ≥ 1.

Proof (i) μ0 is an admissible control policy. According to (6.3.10) and (6.3.11), we


can obtain
μ
Q 0 1 (xk , u k ) = Q μ0 (xk , u k )
= U (xk , u k ) + γ Q μ0 (xk+1 , μ0 (xk+1 ))
≥ U (xk , u k ) + γ min Q μ0 (xk+1 , u k+1 )
u k+1

= U (xk , u k ) + γ Q μ0 (xk+1 , μ1 (xk+1 ))


μ
= Q 1 1 (xk , u k ). (6.3.12)
μ μ μ μ
So Q j 1 ≥ Q j+1
1
holds for j = 0. If Q j 1 ≥ Q j+1
1
holds for j = m − 1, when j = m,
we obtain
μ
Q μm1 (xk , u k ) = U (xk , u k ) + γ Q m−1
1
(xk+1 , μ1 (xk+1 ))
μ1
≥ U (xk , u k ) + γ Q m (xk+1 , μ1 (xk+1 ))
= Q μm+1
1
(xk , u k ). (6.3.13)
μ μ
According to the mathematical induction, we can obtain that Q j i ≥ Q j+1 i
holds for
μ1 μ1
i = 1. Since {Q j } is a monotonically nonincreasing sequence and Q j ≥ 0, the
μ μ μ μ
limit of {Q j 1 } exists, which is denoted by Q ∞1 , and Q j 1 ≥ Q ∞1 , ∀ j. According to
(6.3.10), we have
μ1
Q j+1 (xk , u k ) ≥ U (xk , u k ) + γ Q μ∞1 (xk+1 , μ1 (xk+1 )), j ≥ 0. (6.3.14)

μ μ
Letting j → ∞, we have Q ∞1 (xk , u k ) ≥ U (xk , u k ) + γ Q ∞1 (xk+1 , μ1 (xk+1 )). Simi-
larly, we can obtain
μ
Q μ∞1 (xk , u k ) ≤ U (xk , u k ) + γ Q j 1 (xk+1 , μ1 (xk+1 )), j ≥ 0. (6.3.15)

μ μ
Letting j → ∞, we have Q ∞1 (xk , u k ) ≤ U (xk , u k )+γ Q ∞1 (xk+1 , μ1 (xk+1 )). Hence,
we can obtain

Q μ∞1 (xk , u k ) = U (xk , u k ) + γ Q μ∞1 (xk+1 , μ1 (xk+1 )). (6.3.16)


μ μ μ
Thus, we can obtain that Q ∞i = Q μi holds for i = 1. We assume that Q j i ≥ Q j+1
i

μi
and Q ∞ = Q μi hold for any i = l, l ≥ 1. According to (6.3.10) and (6.3.11), we
can obtain
6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 251

μ
Q 0 l+1 (xk , u k ) = Q μl (xk , u k )
= U (xk , u k ) + γ Q μl (xk+1 , μl (xk+1 ))
≥ U (xk , u k ) + γ min Q μl (xk+1 , u k+1 )
u k+1

= U (xk , u k ) + γ Q μl (xk+1 , μl+1 (xk+1 ))


μ
= Q 1 l+1 (xk , u k ). (6.3.17)

Considering (6.3.10) and (6.3.17), we have


μ μ
Q 1 l+1 (xk , u k ) = U (xk , u k ) + γ Q 0 l+1 (xk+1 , μl (xk+1 ))
μ
≥ U (xk , u k ) + γ Q 1 l+1 (xk+1 , μl (xk+1 ))
μ
= Q 2 l+1 (xk , u k ). (6.3.18)

Similarly, we can obtain that Q μj i ≥ Q μj+1


i
holds for i = l + 1 by induction and
μi μi
Q ∞ = Q . Therefore, this completes the proof by induction.
(ii) According to (i), we have
u
Q u i = Q 0 i+1 ≥ Q u∞i+1 = Q u i+1 . (6.3.19)

Therefore, {Q μi } is a monotonically nonincreasing sequence and the proof is com-


plete.

Theorem 6.3.1 Let Assumption 6.3.1 hold. Suppose that Q ∗ ≤ Q μ0 ≤ β Q ∗ , 1 ≤


β ≤ ∞. μ0 is an admissible control policy and Q μ0 is the Q-function of μ0 . Let
Q μj i and μi be updated by (6.3.10) and (6.3.11). Then the Q-function sequence Q μi
approaches Q ∗ according to the inequalities
 
∗ μi β −1
Q ≤Q ≤ 1+ Q ∗ , ∀i ≥ 1. (6.3.20)
(1 + λ−1 )i

Proof According to the definitions of Q ∗ and Q μi , the left-hand side of the inequality
(6.3.20) always holds for any i ≥ 1. Next, we prove the right-hand side of (6.3.20)
by induction. According to Assumption 6.3.1, we have

γ min Q ∗ (xk+1 , u k+1 ) ≤ min Q ∗ (xk+1 , u k+1 ) ≤ λU (xk , u k ). (6.3.21)


u k+1 u k+1

Considering (6.3.10), (6.3.11), and (6.3.21), we can obtain

Q μ1 1 (xk , u k ) = U (xk , u k ) + γ min Q μ0 1 (xk+1 , u k+1 )


u k+1

= U (xk , u k ) + γ min Q μ0 (xk+1 , u k+1 )


u k+1

≤ U (xk , u k ) + γβ min Q ∗ (xk+1 , u k+1 )


u k+1
252 6 Error Bounds of Adaptive Dynamic Programming Algorithms

≤ U (xk , u k ) + γβ min Q ∗ (xk+1 , u k+1 )


u k+1
β − 1 
+ λU (xk , u k ) − γ min Q ∗ (xk+1 , u k+1 )
λ+1 u k+1

β −1
= 1+λ U (xk , u k )
λ+1
β −1
+ β+ γ min Q ∗ (xk+1 , u k+1 )
λ+1 u k+1
 
β −1
= 1+ Q ∗ (xk , u k ). (6.3.22)
(1 + λ−1 )

According to Lemma 6.3.1, we have


 
μ β −1
Q μ1 = Q μ∞1 ≤ Q 1 1 (xk , u k ) ≤ 1 + Q ∗ (xk , u k ), (6.3.23)
(1 + λ−1 )

which shows that the right-hand side of (6.3.20) holds for i = 1. Assume that
 
μi β −1
Q (xk , u k ) ≤ 1 + Q ∗ (xk , u k )
(1 + λ−1 )i

holds for any i = l, l ≥ 1, then we can obtain


μ
Q μl+1 (xk , u k ) ≤ Q 1 l+1 (xk , u k )
= U (xk , u k ) + γ min Q μl (xk+1 , u k+1 )
u k+1
 
β −1
≤ U (xk , u k ) + γ 1 + min Q ∗ (xk+1 , u k+1 )
(1 + λ−1 )l u k+1
β −1
+ [λU (xk , u k ) − γ min Q ∗ (xk+1 , u k+1 )]
(1 + λ−1 )l+1 u k+1
 
β −1
≤ 1+ Q ∗ (xk , u k ), (6.3.24)
(1 + λ−1 )l+1

which shows that the right-hand side of (6.3.20) holds for i = l + 1. According to the
mathematical induction, the right-hand side of (6.3.20) holds. The proof is complete.
In policy iteration, an initial admissible control policy is required, which is usually
obtained by experience or trial. However, for most nonlinear systems, it is hard to
obtain an admissible control policy, especially in action-dependent ADP for unknown
systems. So we present a novel initial condition for policy iteration.
Lemma 6.3.2 Let Assumption 6.3.1 hold. Suppose that there is a positive-definite
μ μ
function Q 0 satisfying γ Q 0 ≥ Q 1 1 for any xk , u k . Let Q j 1 and μ1 be obtained by
μ
(6.3.10) and (6.3.11). Then μ1 (x) is an admissible control policy and Q μ1 = Q ∞1 is
the Q-function of μ1 (x).
6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 253

Proof Considering (6.3.10) and (6.3.11), we have

μ1 (xk ) = arg min Q 0 (xk , u k ), (6.3.25)


uk

and
μ
Q 1 1 (xk , u k ) = U (xk , u k ) + γ Q 0 (xk+1 , μ1 (xk+1 )). (6.3.26)

Using the assumption about Q 0 , we obtain


μ
γ Q 0 (xk , μ1 (xk )) ≥ Q 1 1 (xk , μ1 (xk ))
= U (xk , μ1 (xk )) + γ Q 0 (xk+1 , μ1 (xk+1 )). (6.3.27)

Considering Q 0 (xk , μ1 (xk )) ≥ 0 and

Q 0 (xk+1 , μ1 (xk+1 )) − Q 0 (xk , μ1 (xk )) ≤ −(1/γ )U (xk , μ1 (xk )), (6.3.28)

we can conclude that the control policy μ1 is asymptotically stable for the system
μ
(6.3.1). Then, similar to (6.3.12)–(6.3.16), we can obtain that Q μ1 = Q ∞1 ≤ Q 0 .
Thus, the value function of μ1 satisfies

V μ1 (xk ) = Q μ1 (xk , μ1 (xk )) ≤ Q 0 (xk , μ1 (xk )). (6.3.29)

Therefore, we can conclude that μ1 (xk ) is an admissible control. The proof is com-
plete.

According to Lemma 6.3.2, we can obtain an admissible control by using an


initial positive-definite function Q 0 . Thus, considering Theorem 6.3.1, we obtain the
following corollary.
Corollary 6.3.1 Let Assumption 6.3.1 hold. Suppose that there is an initial positive-
μ μ
definite function Q 0 such that γ Q 0 ≥ Q 1 1 for any xk , u k . Let Q j 1 and μ1 be
updated by (6.3.10) and (6.3.11). Then the Q-function sequence {Q μi } approaches
Q ∗ according to the following inequalities
 
∗ μi β −1
Q ≤Q ≤ 1+ Q ∗ , ∀i ≥ 1. (6.3.30)
(1 + λ−1 )i

From Theorem 6.3.1 and Corollary 6.3.1, we can see that as i → ∞, Q μi con-
verges to Q ∗ under ideal conditions, i.e., the control policy and Q-function in each
iteration can be obtained accurately. They also give a convergence rate of Q μi with
policy iteration. When the discount factor γ = 1, the discounted optimal control
problem turns into an undiscounted optimal control problem, and Theorem 6.3.1 and
Corollary 6.3.1 still hold.
However, in practice, considering that the iteration indices i and j cannot reach
infinity as the algorithm must stop in finite steps, there exist convergence errors in
254 6 Error Bounds of Adaptive Dynamic Programming Algorithms

the iteration process. In addition, the control policy and Q-function in each iteration
are obtained by approximation structures, so there exist approximate errors between
approximate and accurate values. Hence, Theorem 6.3.1 and Corollary 6.3.1 may
be invalid and the policy-iteration-based action-dependent ADP may even be diver-
gent. To overcome this difficulty, in the following section we establish new error
bound analysis results for Q-function considering the convergence and approxima-
tion errors.

6.3.3 Error Bound for Approximate Policy Iteration

For the approximate policy iteration, function approximation structures are used to
approximate the Q-function and the control policy. The approximate expressions
of μi and Q μi are μ̂i and Q̂ μ̂i , respectively. We assume that there exist two finite
positive constants δ ≤ 1 and δ̄ ≥ 1 such that

δ Q μ̂i ≤ Q̂ μ̂i ≤ δ̄ Q μ̂i (6.3.31)

holds uniformly, for any i ≥ 1, where Q μ̂i is the exact Q-function associated with
μ̂i . δ and δ̄ imply the convergence error in j-iteration and the approximation error of
Q μ̂i in policy evaluation phase. When δ = δ̄ = 1, both errors are zero. Considering
Lemma 6.3.1, we obtain
μ̂
Q̂ μ̂i ≤ δ̄ Q μ̂i ≤ δ̄ Q̂ 1 i , (6.3.32)

where
Q̂ μ̂1 i (xk , u k ) = U (xk , u k ) + γ Q̂ μ̂i−1 (xk+1 , μ̂i (xk+1 )).

Similarly, we assume that there exist two finite positive constants ζ ≤ 1 and ζ̄ ≥ 1
such that
μ̂ μ̂ μ̂
ζ Q 1 i ≤ Q̂ 1 i ≤ ζ̄ Q 1 i (6.3.33)

holds uniformly, ∀i ≥ 1, where

μ̂
Q 1 i (xk , u k ) = U (xk , u k ) + γ Q̂ μ̂i−1 (xk+1 , μ̂i (xk+1 )).

ζ and ζ̄ imply the approximation errors of μ̂i in the policy improvement phase. If
the iterative control policy can be obtained accurately, then ζ = ζ̄ = 1. Considering
(6.3.32) and (6.3.33), we can get

μ̂
Q̂ μ̂i ≤ ζ̄ δ̄ Q 1 i . (6.3.34)
6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 255

On the other hand, considering (6.3.31) and (6.3.33), we have

Q̂ μ̂i ≥ δ Q μ̂i ≥ δ Q ∗ . (6.3.35)

Therefore, the approximation errors in the Q-function and control policy update step
can be expressed by
μ̂
ηQ ∗ ≤ Q̂ μ̂i ≤ ηQ 1 i , (6.3.36)

where η = δ and η = ζ̄ δ̄. We establish the error bounds for approximate policy
iteration by the following theorem.
Theorem 6.3.2 Let Assumption 6.3.1 hold. Suppose that Q ∗ ≤ Q μ0 ≤ β Q ∗ , 1 ≤
β ≤ ∞. μ0 is an admissible control policy and Q μ0 is the Q-function of μ0 . The
approximate Q-function Q̂ μ̂i satisfies the iterative error condition (6.3.36). Then, the
Q-function sequence { Q̂ μ̂i } approaches Q ∗ according to the following inequalities
⎡ ⎤

i
λj η j−1 (η − 1) λ η (β − 1) ⎦ ∗
i i
ηQ ∗ ≤ Q̂ μ̂i+1 ≤ η ⎣1 + + Q , ∀i ≥ 0.
j=1
(λ + 1) j (λ + 1)i
(6.3.37)
Moreover, the approximate value function sequence { Q̂ μ̂i } converges to a finite neigh-
borhood of Q ∗ uniformly on Ω as i → ∞, i.e.,

η
ηQ ∗ ≤ lim Q̂ μ̂i ≤ Q∗, (6.3.38)
i→∞ 1 + λ − λη

1
under the condition η ≤ + 1.
λ
Proof First, the left-hand side of (6.3.37) holds clearly according to (6.3.36). Next,
we prove the right-hand side of (6.3.37) for i ≥ 1. Considering (6.3.36), we obtain

μ̂
Q̂ μ̂1 (xk , u k ) ≤ ηQ 1 1 (xk , u k )
μ̂
≤ ηQ 0 1 (xk , u k )
= ηQ μ0 (xk , u k )
≤ ηβ Q ∗ (xk , u k ), (6.3.39)

which means (6.3.37) is true for i = 0. Then, considering (6.3.36) and Assumption
6.3.1, we can derive
μ̂
Q̂ μ̂2 (xk , u k ) ≤ ηQ 1 2 (xk , u k )
= η[U (xk , u k ) + γ Q̂ μ̂1 (xk+1 , μ̂2 (xk+1 ))]
= η[U (xk , u k ) + γ min Q μ̂1 (xk+1 , u k+1 )]
u k+1
256 6 Error Bounds of Adaptive Dynamic Programming Algorithms

≤ η[U (xk , u k ) + γ ηβ min Q ∗ (xk+1 , u k+1 )]


u k+1
ηβ − 1
+η [λU (xk , u k ) − γ min Q ∗ (xk+1 , u k+1 )]
λ+1 u k+1

ηβ − 1
=η 1+λ U (xk , u k )
λ+1
ηβ − 1
+ η ηβ − γ min Q ∗ (xk+1 , u k+1 )
λ+1 u k+1

ηβ − 1
=η 1+λ Q ∗ (xk , u k )
λ+1
λ(η − 1) λη(β − 1)
=η 1+ + Q ∗ (xk , u k ) (6.3.40)
λ+1 λ+1

Hence, the upper bound of (6.3.37) holds for i = 1. Suppose that the upper bound
of (6.3.37) holds for Q̂ μ̂i (i ≥ 1). Then, for Q̂ μ̂i+1 , we derive
 
Q̂ μ̂i+1 (xk , u k ) ≤ η U (xk , u k ) + γ min Q̂ μ̂i (xk+1 , u k+1 )
u k+1
 
≤ η U (xk , u k ) + γ ηρ min Q ∗ (xk+1 , u k+1 )
u k+1
 
≤ η U (xk , u k ) + γ ηρ min Q ∗ (xk+1 , u k+1 )
u k+1
 
+ ηΔ λU (xk , u k ) − γ min Q ∗ (xk+1 , u k+1
u k+1
 
= η(1 + Δλ) U (xk , u k ) + γ min Q ∗ (xk+1 , u k+1 )
u k+1

= η(1 + Δλ)Q ∗ (xk , u k ), (6.3.41)

where

i−1
λ j η j−1 (η − 1) λi−1 ηi−1 (β − 1)
ρ =1+ +
j=1
(λ + 1) j (λ + 1)i−1

(for Q̂ μ̂i from (6.3.37)), Δ satisfies Δ ≥ 0 and 1 + Δλ = ηρ − Δ. Hence, we can


calculate
ηρ − 1
Δ=
1+λ
η − 1  λ j η j (η − 1) λi−1 ηi (β − 1)
i−1
= + +
1+λ j=1
(λ + 1) j+1 (λ + 1)i


i
λ j−1 η j−1 (η − 1) λi−1 ηi (β − 1)
= + . (6.3.42)
j=1
(λ + 1) j (λ + 1)i
6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 257

Substituting (6.3.42) into (6.3.41), we obtain

 
i 
μ̂i+1 λ j η j−1 (η − 1) λi ηi (β − 1)
Q̂ ≤η 1+ + Q∗. (6.3.43)
j=1
(λ + 1) j (λ + 1)i

Thus, the upper bound of Q̂ μ̂i holds for i + 1. According to the mathematical induc-
tion, the right-hand side of (6.3.37) holds.
The conclusion in (6.3.38) can be proved following the same steps as in the proof
of Theorem 6.2.2. The proof of the theorem is complete.
Remark 6.3.2 We can find that the upper bound is a monotonically increasing func-
tion of η. The condition η ≤ 1/λ + 1 ensures that the upper bound in (6.3.38) is
finite and positive. A larger λ will lead to a slower convergence rate and a larger error
bound. Besides, a larger λ also requires more accurate iteration to converge. When
η = η = 1, the approximate Q-function sequence Q̂ μ̂i converges to Q ∗ uniformly
on Ω as i → ∞.
For the undiscounted optimal control problem, the discount factor γ = 1, and the
Q-function is redefined as

Q μ (x, u) = U (x, u) + Q μ (x + , μ(x + )), (6.3.44)

and the optimal Q-function satisfies

Q ∗ (x, u) = U (x, u) + min


+
Q ∗ (x + , u + ). (6.3.45)
u

From Theorems 6.3.1 and 6.3.2, when γ = 1, all the deductions still hold. So we
have the following corollary.
Corollary 6.3.2 For the undiscounted optimal control problem with Assumption
6.3.1 and the admissible control policy μ0 satisfying Q ∗ ≤ Q μ0 ≤ β Q ∗ , 1 ≤ β ≤ ∞,
if the approximate Q-function Q̂ μ̂i satisfies the iterative error condition (6.3.36), the
approximate Q-function sequence { Q̂ μ̂i } converges to a finite neighborhood of Q ∗
uniformly on Ω as i → ∞, i.e.,

η
ηQ ∗ ≤ lim Q̂ μ̂i ≤ Q∗, (6.3.46)
i→∞ 1 + λ − ηλ

under the condition η ≤ 1/λ + 1.

6.3.4 Neural Network Implementation

In the previous section, the approximate Q-function with policy iteration is proven
to converge to a finite neighborhood of the optimal one. Hence, it is feasible to
258 6 Error Bounds of Adaptive Dynamic Programming Algorithms

approximate the Q-function and the control policy using neural networks. We present
the detailed implementation of the present algorithm using neural networks in this
section.
The structural diagram of the action-dependent iterative ADP algorithm in this
section is shown in Fig. 6.7. The outputs of critic network and the action network are
the approximations of the Q-function and the control policy, respectively.
The approximate Q-function Q̂ ûj i (xk , u k ) is expressed by

Q̂ ûj i (xk , u k ) = Wc(i


T
j) ζ (Vc(i j) [x k , u k ] ),
T T T T
(6.3.47)

where ζ (·) is the activation function, which is selected as tanh(·). The target function
of the critic neural network training is given by

Q̂ i,∗ j+1 (xk , u k ) = U (xk , u k ) + Q̂ ûj i (xk+1 , μ̂i (xk+1 )), (6.3.48)

where xk+1 = F(xk , μ̂i (xk )). Then, the error function for the critic network training
is defined by
ec(i, j+1) (xk ) = Q̂ i,∗ j+1 (xk , u k ) − Q̂ ûj+1
i
(xk , u k ), (6.3.49)

and the performance function to be minimized is defined by

1 2
E c(i, j+1) = e . (6.3.50)
2 c(i, j+1)

xk
Critic Qˆ jˆi 1
uk Network

xk
xk 1 Action
1
Critic Qˆ jˆi U ( xk , uk )
Network ˆ i ( xk 1 ) Network

Signal line
Back-propagating path
Weight transimission

Fig. 6.7 Structural diagram of action-dependent iterative ADP


6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 259

The approximate control policy μ̂i is expressed by the action network

μ̂i (xk ) = Wa(i)


T
ζ (Va(i)
T
xk ). (6.3.51)

The target function of the action network training is given by



μ̂i+1 (xk ) = arg min Q̂ μ̂i (xk , u). (6.3.52)
μ

Then, the error function for training the action network is defined by

ea(i+1) (xk ) = μ̂i+1 (xk ) − μ̂i+1 (xk ). (6.3.53)

The performance function of the action network to be minimized is defined by

1 T
E a(i+1) = e ea(i+1) . (6.3.54)
2 a(i+1)
We use the gradient descent method to update the weights of critic and action
networks on a training data set. The detailed process of the approximate policy
iteration is given in Algorithm 6.3.1.

Algorithm 6.3.1 Approximate policy iteration for action-dependent ADP


1: Initialization:
Initialize critic and action networks with random weights;
Select an initial stabilizing control policy μ0 ;
Set the approximation errors of policy evaluation step and policy improvement step as θ and ξ ,
and maximum iteration numbers of policy evaluation step and policy improvement step as jmax
and i max .
2: Set i = 0.
μi
3: For j = 0, 1, . . . , jmax , update the Q-function Q̂ j+1 (xk ) by minimizing (6.3.50) on the training
μ
set {xk }. When j = jmax or the convergence conditions are met, set Q̂ μi (xk ) = Q̂ j+1
i
(xk ) and
go to Step 4.
4: Update the control policy μ̂i+1 (xk ) by minimizing (6.3.54) on the training set {xk }.
5: Set i ← i + 1.
6: Repeat Steps 3–5 until the convergence conditions are met or i = i max .
7: Obtain the approximate optimal control policy μ̂i (xk ).

6.3.5 Simulation Study

In this section, we use a simulation example to demonstrate the effectiveness of the


developed algorithm. Consider the mass-spring system whose dynamics is:
260 6 Error Bounds of Adaptive Dynamic Programming Algorithms
   
x1,k+1 0.05x2,k
= (6.3.55)
x2,k+1 −0.0005x1,k − 0.0335x1,k
3
+ x2,k
 
0
+ u .
0.05 k

Define the Q-function as




Q(x0 , u) = γ k (xkT Qxk + u Tk Ru k ), (6.3.56)
k=0

where 

0.1 0
Q= ,
0 0.1

R = 0.1 and γ = 0.95.


In the simulation, we choose two three-layer neural networks as the approximation
structures of controller and Q-function. The structures of the critic and action neural
networks are chosen as 3–8–1 and 2–8–1, respectively. The activation function is
selected as tanh(·)the convergence of neural network training, the initial weights of
the activation functions are chosen randomly around zero. We set the initial weights
of both the critic and action networks as random values with uniform distribution in
[−0.01, 0.01]. The preset approximation errors are θ = ξ = 10−8 , and the maximum
iteration steps of policy evaluation and policy improvement are jmax = 40 and

2.5

2
Q-function

1.5

0.5
10 20 30 40 50 60
Iteration index j
μ̂
Fig. 6.8 The convergence of Q-function Q̂ j i on state [0.5, −0.5]T at i = 1
6.3 Error Bounds of Q-Function for Discounted Optimal Control Problems 261

5
Q−function

0
1 2 3 4 5 6 7 8 9 10
Iteration index i

Fig. 6.9 The convergence of Q-function Q̂ μ̂i on state [0.5, −0.5]T

(a) (b)
1 1
Trajectory under the Trajectory under the
initial control policy initial control policy
0.8 Trajectory under the Trajectory under the
optimal control policy 0.5 optimal control policy

0.6

0
0.4

0.2 −0.5
x2
x1

0
−1

−0.2

−1.5
−0.4

−0.6 −2
0 100 200 300 0 100 200 300
Time steps Time steps

Fig. 6.10 The state trajectories starting from the state [1, −1]T a x1 ; b x2
262 6 Error Bounds of Adaptive Dynamic Programming Algorithms

2
Initial control policy
Optimal control policy
1.5

0.5
Control action

−0.5

−1

−1.5

−2
0 50 100 150 200 250 300
Time steps

Fig. 6.11 The control signals corresponding to the states from [1, −1]T

i max = 10. The compact subset Ω of the state space is chosen as 0 ≤ x1 ≤ 1 and
0 ≤ x2 ≤ 1. The training set {xk } contains 1000 samples choosing randomly from the
compact set Ω. The initial admissible control policy is chosen as μ0 = [−3, −1]xk .
We train the action and critic neural networks off-line with the Algorithm 6.3.1.
μ̂
Figure 6.8 illustrates the convergence process of the Q-function Q̂ j i with the iteration
index j on state x = [−0.5, 0.5]T at i = 1. Figure 6.9 shows the convergence curve
of Q-function Q̂ μ̂i on state [0.5, 0.5]T with the iteration index i. Figure 6.10 shows
the state trajectories from the initial state [1, −1]T to the equilibrium under the initial
control policy and the approximate optimal control policy obtained by our method,
respectively. Figure 6.11 shows the action trajectories of the initial control policy and
the approximate optimal control policy obtained by our method, respectively.

6.4 Conclusions

In this chapter, we established error bounds for approximate value iteration, approx-
imate policy iteration, and approximate optimistic policy iteration. We considered
approximation errors in both value function and control policy update equations.
It was shown that the iterative approximate value function can converge to a finite
neighborhood of the optimal value function under some mild conditions. The results
provided theoretical guarantees for using neural network approximation for solving
6.4 Conclusions 263

undiscounted optimal control problems. We also developed the error bound analy-
sis method of Q-function with approximate policy iteration for optimal control of
unknown discounted discrete-time nonlinear systems.

References

1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Almudevar A, Arruda EF (2012) Optimal approximation schedules for a class of iterative
algorithms, with an application to multigrid value iteration. IEEE Trans Autom Control
57(12):3132–3146
3. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern-Part B:
Cybern 38(4):943–949
4. Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton
5. Bertsekas DP (2012) Weighted sup-norm contractions in dynamic programming: a review and
some new applications. Technical report LIDS-P-2884, MIT
6. Bertsekas DP (2013) Abstract dynamic programming. Athena Scientific, Belmont
7. Bertsekas DP (2016) Value and policy iterations in optimal control and adaptive dynamic
programming. IEEE Trans Neural Netw Learn Syst (Online Available). doi:10.1109/TNNLS.
2015.2503980
8. Dierks T, Thumati BT, Jagannathan S (2009) Optimal control of unknown affine nonlinear
discrete-time systems using offline-trained neural networks with proof of convergence. Neural
Netw 22(5):851–860
9. Grune L, Rantzer A (2008) On the infinite horizon performance of receding horizon controllers.
IEEE Trans Autom Control 53(9):2100–2111
10. Heydari A, Balakrishnan SN (2013) Finite-horizon control-constrained nonlinear optimal con-
trol using single network adaptive critics. IEEE Trans Neural Netw Learn Syst 24(1):145–157
11. Howard RA (1960) Dynamic programming and Markov processes. MIT Press, Cambridge
12. Leake RJ, Liu RW (1967) Construction of suboptimal control sequences. SIAM J Control
Optim 5(1):54–63
13. Lewis FL, Syrmos VL (1995) Optimal control. Wiley, New York
14. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
15. Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general
value iteration. IET Control Theory Appl 6(18):2725–2736
16. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
17. Liu D, Wei Q (2013) Finite-approximation-error based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
18. Liu D, Wei Q (2014) Policy iterative adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
19. Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class
of unknown discrete-time nonlinear systems using globalized dual heuristic programming.
IEEE Trans Autom Sci Eng 9(3):628–634
20. Liu D, Wang D, Yang X (2013) An iterative adaptive dynamic programming algorithm for
optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci
220:331–342
21. Liu D, Li H, Wang D (2015) Error bounds of adaptive dynamic programming algorithms
for solving undiscounted optimal control problems. IEEE Trans Neural Netw Learn Syst
26(6):1323–1334
264 6 Error Bounds of Adaptive Dynamic Programming Algorithms

22. Puterman M, Shin M (1978) Modified policy iteration algorithms for discounted Markov deci-
sion problems. Manag Sci 24(11):1127–1137
23. Rantzer A (2006) Relaxed dynamic programming in switching systems. IEE Proc-Control
Theory Appl 153(5):567–574
24. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
25. Tsitsiklis JN (2002) On the convergence of optimistic policy iteration. J Mach Learn Res
3:59–72
26. Vrabie D, Vamvoudakis KG, Lewis FL (2013) Optimal adaptive control and differential games
by reinforcement learning principles. IET, London
27. Wang FY, Zhang H, Liu D (2009) Adaptive dynamic programming: an introduction. IEEE
Comput Intell Mag 4(2):39–47
28. Wang FY, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon
optimal control of discrete-time nonlinear systems with -error bound. IEEE Trans Neural
Netw 22(1):24–36
29. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
30. Wei Q, Wang FY, Liu D, Yang X (2014) Finite-approximation-error based discrete-time iterative
adaptive dynamic programming. IEEE Trans Cybern 44(12):2820–2833
31. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sofge DA (eds) Handbook of intelligent control: Neural, fuzzy, and adaptive
approaches (Chap. 13). Van Nostrand Reinhold, New York
32. Yan P, Wang D, Li H, Liu D (2016) Error bound analysis of Q-function for discounted optimal
control problems with policy iteration. IEEE Trans Syst Man Cybern Syst (Online Available)
doi:10.1109/TSMC.2016.2563982
33. Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a
class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans
Syst Man Cybern-Part B: Cybern 38(4):937–942
34. Zhang H, Luo Y, Liu D (2009) Neural-network-based near-optimal control for a class of
discrete-time affine nonlinear systems with control constraints. IEEE Trans Neural Netw
20(9):1490–1503
35. Zhang H, Song R, Wei Q, Zhang T (2011) Optimal tracking control for a class of nonlinear
discrete-time systems with time delays based on heuristic dynamic programming. IEEE Trans
Neural Netw 22(1):24–36
Part II
Continuous-Time Systems
Chapter 7
Online Optimal Control of Continuous-Time
Affine Nonlinear Systems

7.1 Introduction

Optimal control problems for continuous-time nonlinear dynamical systems have


been intensively studied during the past several decades [4, 14, 15, 18–20, 30,
31, 34, 39, 40]. A core challenge in deriving solutions of nonlinear optimal control
problems is that it often falls to solve the Hamilton–Jacobi–Bellman (HJB) equations.
It is well-known that the HJB equation is actually a nonlinear partial differential
equation, which is difficult or impossible to solve by analytical methods. To cope
with the problem, in this chapter, we develop approximate optimal control schemes
for continuous-time affine nonlinear systems using adaptive dynamic programming
(ADP) approach.
First, we consider an optimal control problem for a class of continuous-time
partially unknown affine nonlinear systems. By employing an identifier–critic archi-
tecture, an online algorithm based on ADP approach is developed to derive the
approximate optimal control [34]. Then, an ADP algorithm is presented to obtain the
optimal control for continuous-time nonlinear systems with control constraints [31].

7.2 Online Optimal Control of Partially Unknown Affine


Nonlinear Systems

Consider the continuous-time input-affine nonlinear systems described by

ẋ(t) = f (x(t)) + g(x(t))u(x(t)), x(0) = x0 , (7.2.1)

where x(t) ∈ Rn is the state vector available for measurement, u(t) ∈ Rm is the
control vector, f (x) ∈ Rn is an unknown nonlinear function with f (0) = 0, and
g(x) ∈ Rn×m is a matrix of known nonlinear functions. For the convenience of
subsequent analysis, we provide two assumptions as follows.
© Springer International Publishing AG 2017 267
D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_7
268 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

Assumption 7.2.1 f (x) + g(x)u is Lipschitz continuous on a compact set Ω ⊂ Rn


containing the origin, such that the solution x(t) of system (7.2.1) is unique for all
x0 ∈ Ω and u ∈ Rm . In addition, system (7.2.1) is controllable.

Assumption 7.2.2 The control matrix g(x) is bounded over a compact set Ω; that
is, there exist constants gm and g M (0 < gm < g M ) such that gm ≤ g(x)F ≤ g M
for all x ∈ Ω.

Note that g(·) ∈ Rn×m is a matrix function. Our analysis results in this chapter
(and Chap. 8) will use Frobenius matrix norm, which is defined as


 n 
m
AF =  ai2j
i=1 j=1

for matrix A = (ai j ) ∈ Rn×m . For the convenience of presentation, we will drop the
subscript “F” for Frobenius matrix norm in this chapter and the next. Accordingly,
vector norm will also be chosen as the compatible one, i.e., the Euclidean norm for
vectors which is defined as 
 n

x =  xi2
i=1

for x ∈ Rn . In other words, whenever we use norms in this chapter and Chap. 8, we
mean Euclidean norm for vectors and Frobenius norm for matrices.
The value function for system (7.2.1) is defined by
 ∞  
V (x(t)) = r x(s), u(s) ds, (7.2.2)
t

where
r (x, u) = x TQx + u TRu,

and Q and R are constant symmetric positive-definite matrices with appropriate


dimensions.

Definition 7.2.1 (cf. [11]) The solution x(t) of system (7.2.1) is said to be uniformly
ultimately bounded (UUB), if there exist positive constants b and c, independent of
t0 ≥ 0, and for every a ∈ (0, c), there is T = T (a, b) > 0, independent of t0 , such
that
x(t0 ) ≤ a ⇒ x(t) ≤ b, ∀t ≥ t0 + T.

The control objective of this chapter is to obtain an online adaptive control not
only stabilizes system (7.2.1) but also minimizes the value function (7.2.2), while
ensuring that all signals in the closed-loop system are UUB.
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 269

7.2.1 Identifier–Critic Architecture for Solving HJB


Equation

In control engineering, NNs are considered as powerful tools for approximating


nonlinear functions due to their nonlinearity, adaptivity, self-learning, and fault toler-
ance. In this section, a single-hidden layer feedforward NN is applied to approximate
F (x) ∈ C n (Ω) (F (x) is a nonlinear function to be described later) as follows [10]:
 
F (x) = WmT σm YmT x + εm (x), (7.2.3)

where σm (·) ∈ R N0 is the activation function, εm (x) ∈ Rn is the NN function


reconstruction error, and Wm ∈ R N0 ×n and Ym ∈ Rn×N0 are the weight matrices of the
hidden layer to the output layer and the input layer to the hidden layer, respectively.
The number of the hidden layer nodes is denoted by N0 . The activation function
σm (·) is a componentwise bounded, measurable, and nondecreasing function from
the real numbers onto [−1, 1]. Without loss of generality, in this chapter, we choose
the hyperbolic tangent function as the activation function, i.e., σm (α) = tanh(α) =
(eα − e−α )/(eα + e−α ).
From system (7.2.1), we have

ẋ(t) = Ax + F (x) + g(x)u, (7.2.4)

where F (x) = f (x) − Ax and A ∈ Rn×n is a known Hurwitz matrix. Using (7.2.3),
(7.2.4) can be rewritten as
 
ẋ(t) = Ax + WmT σm YmT x + g(x)u + εm (x). (7.2.5)

The NN identifier approximates system (7.2.1) as


 
˙ = A x̂ + Ŵ T σm Ŷ T x̂ + g(x̂)u + C x̃(t),
x̂(t) (7.2.6)
m m

where x̂(t) ∈ Rn is the identifier NN state, Ŵm ∈ R N0 ×n and Ŷm ∈ Rn×N0 are
estimated weights, x̃(t) is the identification error x̃(t)  x(t) − x̂(t), and the design
matrix C ∈ Rn×n selected such that A − C is a Hurwitz matrix.
Using (7.2.5) and (7.2.6), we obtain the identification error dynamics as
 
˙ = Ac x̃(t) + W̃ T σm Ŷ T x̂ + δ(x),
x̃(t) (7.2.7)
m m

where Ac = A − C, W̃m = Wm − Ŵm , and δ(x) is given as


   
δ(x) = WmT σm YmT x − σm ŶmT x̂ + g(x) − g(x̂) u + εm (x).

Before proceeding further, we provide some assumptions and facts, which have
been used in the literature [16, 17, 23, 25, 31–33, 35, 38].
270 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

Assumption 7.2.3 The ideal identifier NN weights Wm and Ym are bounded by


known positive constants W̄ M and Ȳ M , respectively. That is,

Wm  ≤ W̄ M , Ym  ≤ Ȳ M .

Assumption 7.2.4 The NN function approximation error εm (x) is bounded over Ω


as εm (x) ≤ ε M for every x ∈ Ω, where ε M > 0 is a known constant.

Fact 7.2.1 The NN activation function is bounded over Ω, i.e., there exists a known
constant σ M > 0 such that σm (x) ≤ σ M for every x ∈ Ω.

Fact 7.2.2 Since Ac is a Hurwitz matrix, there exists a positive-definite symmetric


matrix P ∈ Rn×n satisfying the Lyapunov equation

ATc P + P Ac = −θ In ,

where θ > 0 is a design parameter and In denotes the n × n identity matrix.

Theorem 7.2.1 Let Assumptions 7.2.1–7.2.4 hold. If the identifier NN estimated


weights Ŵm and Ŷm are updated as
 
Ŵ˙ m = − l1 σm ŶmT x̂ x̃ TA−1
c − τ1  x̃ Ŵm , (7.2.8)
 T 
Ŷ˙m = − l2 sgn(x̂)x̃ TA−1c Ŵm I N0 − Φ Ŷm x̂
T
− τ2 x̃Ŷm , (7.2.9)

where li > 0 (i = 1, 2) are given constant parameters, τi (i = 1, 2) satisfy

τ1 > l1 A−1
c  /4, τ2 > l2 ,
2
(7.2.10)

Φ ŶmT x̂ = diag σm1
2 T
Ŷm1 x̂ , . . . , σm2 N0 ŶmTN0 x̂ ,

 T
I N0 is the N0 × N0 identity matrix, sgn(x̂) = sgn(x̂1 ), . . . , sgn(x̂n ) , and sgn(·) is
the componentwise sign function. Then, the identifier NN given in (7.2.6) can ensure
that the identification error x̃(t) in (7.2.7) converges to a compact set
 
Ωx̃ = x̃ : x̃ ≤ 2ς/θ , (7.2.11)

where ς > 0 is a constant to be determined later (see (7.2.20) below). In addition, the
NN weight estimation errors W̃m = Wm − Ŵm and Ỹm = Ym − Ŷm are all guaranteed
to be UUB.

Proof Consider the Lyapunov function candidate

L 1 (x) = L 11 (x) + L 12 (x), (7.2.12)


7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 271

where
1 T 1   1  
L 11 (x) = x̃ P x̃ and L 12 (x) = tr W̃mT l1−1 W̃m + tr ỸmT l2−1 Ỹm .
2 2 2
Taking the time derivative of L 11 (x) along the solutions of (7.2.7) and using Facts
7.2.1 and 7.2.2, it follows

1  ˙T 
L̇ 11 (x) = x̃ P x̃ + x̃ TP x̃˙
2
θ  
= − x̃ Tx̃ + x̃ TP W̃mT σm ŶmT x̂ + δ(x)
2
θ
≤ − x̃2 + x̃ P W̃m σ M + δ M , (7.2.13)
2
where δ M is the upper bound of δ(x), i.e., δ(x) ≤ δ M . Actually, noticing that u(x)
is a continuous function defined on Ω, there exists a constant u M > 0 such that
u(x) ≤ u M for every x ∈ Ω. Then, by Assumptions 7.2.2–7.2.4 and Fact 7.2.1,
we can conclude that δ(x) given in (7.2.7) is an upper bounded function.
Taking the time derivative of L 12 (x) and using (7.2.8) and (7.2.9), we obtain
 
L̇ 12 (x) = tr W̃mT l1−1 W̃˙ m
  τ1  
= tr W̃mT σm ŶmT x̂ x̃ TA−1
c +  x̃ W̃ T
m W m − W̃ m
l1
  T
+ tr ỸmTsgn(x̂)x̃ TA−1c Wm − W̃m
 T  τ2  
× I N0 − Φ Ŷm x̂ + x̃ỸmT Ym − Ỹm . (7.2.14)
l2

Note that    
tr X 1 X 2T = tr X 2T X 1 = X 2T X 1 , ∀X 1 , X 2 ∈ Rn×1 ,

and
tr Z̃ T(Z − Z̃ ) ≤  Z̃ F Z F −  Z̃ 2F , ∀Z , Z̃ ∈ Rm×n . (7.2.15)

We emphasize that (7.2.15) is true for Frobenius matrix norm, but it is not true for
other matrix norms in general. As we have declared earlier, Frobenius norm for
matrices and Euclidean norm for vectors are used in this chapter. We do not use the
subscript “F” for Frobenius matrix norm for the convenience of presentation in this
chapter.
Then, from (7.2.14), it follows
272 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

 T  τ1  
L̇ 12 (x) = x̃ TA−1
c W̃m σm Ŷm x̂ +
T
x̃tr W̃mT Wm − W̃m
l1
 T  
+ x̃ TA−1
c Wm − W̃m I N0 − Φ ŶmT x̂ ỸmT sgn(x̂)
τ2  
+ x̃tr ỸmT Ym − Ỹm
l2
τ1  
≤ ασ M x̃W̃m  + x̃ W̄ M W̃m  − W̃m 2
l1
  T   
+ α I N0 − Φ Ŷm x̂ x̃ W̄ M + W̃m  Ỹm 

τ2  
+ x̃ Ȳ M Ỹm  − Ỹm 2 , (7.2.16)
l2
  T 
where α = A−1  ≤
c . Combining (7.2.13) and (7.2.16) and noticing I N0−Φ Ŷm x̂
1, we obtain the time derivative of the Lyapunov function as
  
θ τ1
L̇ 1 (x) ≤ − x̃2 + δ M P + (P + α)σ M + W̄ M W̃m 
2 l1
 2
τ2 τ1 α
+ α W̄ M + Ȳ M Ỹm  − − W̃m 2
l2 l1 4

τ2 α 2
− − 1 Ỹm  −2
W̃m  − Ỹm  x̃
l2 2
  
θ τ1 α2 2 τ2
= − x̃2 + δ M P + − β1 + − 1 β22
2 l1 4 l2
 
τ1 α2 τ2
− − (W̃m  + β1 )2 − − 1 (Ỹm  + β2 )2
l1 4 l2

α 2
− W̃m  − Ỹm  x̃, (7.2.17)
2

where
2l1 (α + P)σ M + 2τ1 W̄ M αl2 W̄ M + τ2 Ȳ M
β1 = , β2 = .
α 2 l1 − 4τ1 2(l2 − τ2 )

In the above derivation, we have used the following identity,


  
1 b 2 b2
ab = νa + − ν2a2 + 2 (7.2.18)
2 ν ν

for arbitrary a, b ∈ R and ν = 0. Using (7.2.10), we derive (7.2.17) as


   
θ τ1 α2 τ2
L̇ 1 (x) ≤ − x̃ + δ M P +
2
− β1 +
2
− 1 β2 x̃
2
2 l1 4 l2
θ
=− x̃ − ς x̃, (7.2.19)
2
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 273

where  
τ1 α2 2 τ2
ς = δ M P + − β1 + − 1 β22 . (7.2.20)
l1 4 l2

Hence, L̇ 1 (x) is negative as long as x̃ > 2ς/θ . That is, the identification error
x̃(t) converges to Ωx̃ defined as in (7.2.11). Meanwhile, according to the standard
Lyapunov’s extension theorem [12, 13] (or the Lagrange stability result [22]), this
verifies the uniform ultimate boundedness of the NN weight estimation errors W̃m
and Ỹm , which completes the proof of the theorem.

Remark 7.2.1 The first terms of (7.2.8) and (7.2.9) are derived from the standard
backpropagation algorithm, and the last terms are employed to ensure the bounded-
ness of parameter estimations. The size of Ωx̃ in (7.2.11) can be kept sufficiently
small by properly choosing parameters, for example, θ , τi , and li (i = 1, 2), such that
higher accuracy of identification is guaranteed. Although (7.2.8) and (7.2.9) share
similar feature as in [16], a significant difference between [16] and the present work
is that, in our case, we do not use Taylor series in the process of identification. Due
to residual errors from the Taylor series expansion, our method is considered to be
more accurate in estimating unknown system dynamics.

Since system (7.2.1) can be approximated well by (7.2.6) outside of the compact
set Ωx̃ , in what follows we replace system (7.2.1) with (7.2.6). Meanwhile, we replace
the actual system state x(t) with the estimated state x̂(t). In this circumstance, system
(7.2.1) can be represented using (7.2.6) as

˙ = h(x̂) + g(x̂)u,
x̂(t) (7.2.21)

where
h(x̂) = A x̂ + ŴmT σm ŶmT x̂ + C x̃.

The value function (7.2.2) is rewritten as


 ∞  
V (x̂(t)) = Q(x̂(s)) + u T(s)Ru(s) ds, (7.2.22)
t

where Q(x̂) = x̂ TQ x̂.

Definition 7.2.2 (cf. [2]) A control u(x̂) : Rn → Rm is said to be admissible with


respect to (7.2.22) on Ω, written as u(x̂) ∈ A (Ω), if u(x̂) is continuous on Ω,
u(0) = 0, u(x̂) stabilizes system (7.2.21) on Ω and V (x̂) is finite for every x̂ ∈ Ω.

If the control u(x̂) ∈ A (Ω) and the value function V (x̂) ∈ C 1 (Ω), then (7.2.21)
and (7.2.22) are equivalent to
 
Vx̂T h(x̂) + g(x̂)u + Q(x̂) + u TRu = 0,
274 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

where Vx̂ ∈ Rn represents the partial derivative of V (x̂) with respect to x̂, i.e.,
∂ V (x̂)
Vx̂ = .
∂ x̂
Now, define the Hamiltonian for the control u(x̂) and the value function V (x̂) as
 
H (x̂, Vx̂ , u) = Vx̂T h(x̂) + g(x̂)u + Q(x̂) + u TRu.

Then, the optimal value function V ∗ (x̂) is obtained by solving the Hamilton–Jacobi–
Bellman (HJB) equation
 
min H x̂, Vx̂∗ , u = 0. (7.2.23)
u(x̂)∈A (Ω)

Therefore, the closed-form expression for the optimal control can be derived as

1
u ∗ (x̂) = − R −1 g T(x̂)Vx̂∗ . (7.2.24)
2
Substituting (7.2.24) into (7.2.23), we get the HJB equation as

 T 1  T
Vx̂∗ h(x̂) + Q(x̂) − Vx∗ g(x̂)R −1 g T(x̂)Vx̂∗ = 0. (7.2.25)
4
In this sense, one shall find that (7.2.25) is actually a nonlinear partial differential
equation with respect to V ∗ (x̂), which is difficult to solve by analytical methods. To
confront the challenge, an online NN-based optimal control scheme will be devel-
oped. Prior to proceeding further, we provide the following required assumption.

Assumption 7.2.5 L 2 (x̂) is a continuously differentiable Lyapunov function candi-


date for system (7.2.21) and satisfies
 
L̇ 2 (x̂) = L T2x̂ h(x̂) + g(x̂)u ∗ < 0

with L 2x̂ the partial derivative of L 2 (x̂) with respect to x̂. Meanwhile, there exists a
symmetric positive definite matrix Λ2 (x̂) ∈ Rn×n defined on Ω such that
 
L T2x̂ h(x̂) + g(x̂)u ∗ = −L T2x̂ Λ2 (x̂)L 2x̂ . (7.2.26)

Remark 7.2.2 h(x̂) + g(x̂)u ∗ is often assumed to be bounded by a positive constant


[3, 23, 25, 29], i.e., there exists a constant ρ > 0 such that h(x̂) + g(x̂)u ∗  ≤ ρ.
To relax the condition, in this section, h(x̂) + g(x̂)u ∗ is assumed to be bounded by a
function with respect to x̂. Since L 2x̂ is the function with respect to x̂, without loss
of generality, we assume that

h(x̂) + g(x̂)u ∗  ≤ ρL 2x̂ ,


7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 275
 
where ρ > 0. Noticing L T2x̂ h(x̂) + g(x̂)u ∗ < 0, one shall find that (7.2.26) defined
as in Assumption 7.2.5 is reasonable. In addition, L 2 (x̂) can be derived through
properly selecting functions, such as polynomials.

In what follows, an online optimal control scheme is constructed using a single


critic NN. Due to the universal approximation property of NNs [8, 10], the optimal
value function V ∗ (x̂) can be represented by a single-layer NN on a compact set Ω
as [1, 29]
V ∗ (x̂) = WcT σ (x̂) + ε N1 (x̂),

where Wc ∈ R N1 is the ideal NN weight vector,

σ (x̂) = [σ1 (x̂), . . . , σ N1 (x̂)]T ∈ R N1

is the vector of activation function with σ j (x̂) ∈ C 1 (Ω) and σ j (0) = 0, the set
{σ j (x̂)}1N1 is often selected to be linearly independent, N1 is the number of the neurons,
and ε N1 (x̂) is the NN function reconstruction error. The derivative of V ∗ (x̂) with
respect to x̂ is given as
Vx̂∗ = ∇σ T(x̂)Wc + ∇ε N1(x̂), (7.2.27)

where ∇σ (x̂) = ∂σ (x̂)/∂ x̂, ∇ε N1 = ∂ε N1(x̂)/∂ x̂, and ∇σ (0) = 0.


Using (7.2.27), (7.2.24) can be represented as

1 1
u ∗ (x̂) = − R −1 g T(x̂)∇σ T Wc − R −1 g T(x̂)∇ε N1 . (7.2.28)
2 2
By the same token, (7.2.25) can be rewritten as

1
WcT ∇σ h(x̂) + Q(x̂) − WcT ∇σ g(x̂)R −1 g T(x̂)∇σ T Wc + εHJB = 0, (7.2.29)
4
where εHJB is the residual error converging to zero when N1 is large enough [1]; that
is, there exists a small constant εa1 > 0 such that εHJB  ≤ εa1 .
Since the ideal NN weight vector Wc is often unavailable, (7.2.28) cannot be
implemented in real control process. Hence, we employ a critic NN to approximate
the value function given in (7.2.22) as

V̂ (x̂) = ŴcT σ (x̂), (7.2.30)

where Ŵc is the estimated weight of Wc . The weight estimation error for the critic
NN is defined as
W̃c = Wc − Ŵc . (7.2.31)

Using (7.2.30), the estimate of (7.2.24) is given by


276 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

1
û(x̂) = − R −1 g T(x̂)∇σ T Ŵc . (7.2.32)
2
The approximate Hamiltonian is derived as

1
H (x̂, Ŵc ) = ŴcT ∇σ h(x̂) + Q(x̂) − ŴcT ∇σ G(x̂)∇σ T Ŵc
4
 e1 , (7.2.33)

where
G(x̂) = g(x̂)R −1 g T(x̂).

Combining (7.2.28), (7.2.29), and (7.2.33), we have

1
e1 = (Wc − W̃c )T ∇σ h(x̂) + Q(x̂) − (Wc − W̃c )T ∇σ G(x̂)∇σ T(Wc − W̃c )
4
1 T
= Wc ∇σ h(x̂) + Q(x̂) − Wc ∇σ G(x̂)∇σ T Wc
T

 4 
−εHJB
1 1
− W̃cT ∇σ h(x̂) + W̃cT ∇σ G(x̂)∇σ T Wc − W̃cT ∇σ G(x̂)∇σ T W̃c
2 4
∗ 1 −1 T
= − W̃c ∇σ h(x̂) + W̃c ∇σ g(x̂) − u (x̂) − R g (x̂)∇ε N1
T T
2
1 T
− W̃c ∇σ G(x̂)∇σ W̃c − εHJB
T
4
1 1
= − W̃cT ∇σ ξ(x̂) + G(x̂)∇ε N1 − W̃cT ∇σ G(x̂)∇σ T W̃c − εHJB , (7.2.34)
2 4

where ξ(x̂) = h(x̂) + g(x̂)u ∗ (x̂).


To obtain the minimum value of e1 , one often chooses Ŵc to minimize the squared
1
residual error E = e12 . Using the gradient descent algorithm and (7.2.33), the weight
2
tuning law for the critic NN is generally given as [3, 29, 37]

η1 ∂E
Ŵ˙ c = −  2
1 + φ φ ∂ Ŵc
T

η1 ∂e1
= − 2 e1
1 + φTφ ∂ Ŵc
φ
= −η1  2 e1 , (7.2.35)
1 + φTφ
 
where φ = ∇σ h(x̂) + g(x̂)û(x̂) , η1 > 0 is a design constant, and the term (1 +
φ T φ)2 is employed for normalization.
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 277

However, there exist two issues about the tuning rule (7.2.35):

(i) If the initial control is not admissible, then tuning the critic NN weights based on
(7.2.35) to minimize E = (1/2)e1T e1 may not guarantee the stability of system
(7.2.21) during the critic NN learning process.
(ii) The persistence of excitation (PE) of φ/(1 + φ T φ) is required to guarantee the
weights of the critic NN to converge to actual optimal values
[3, 7, 26, 29, 36, 37]. Nevertheless, the PE condition is often intractable to
verify. In addition, a small exploratory signal is often added to the control input
in order to satisfy the PE condition, which might cause instability of the closed-
loop system during the implementation of the algorithm.

To address the above two issues, a novel weight update law for the critic NN is
developed as follows:

Ŵ˙ c = − η1 φ̄ Ξ (x̂) − ŴcT ∇σ G(x̂)∇σ T Ŵc


1
4
N1
1
− η1 φ̄( j) Ξ (x̂t j ) − ŴcT ∇σ( j) G(x̂t j )∇σ(Tj) Ŵc
j=1
4
η1
+ (x̂, û)∇σ G(x̂)L 2x̂ , (7.2.36)
2

where Ξ (x̂) = ŴcT ∇σ h(x̂) + Q(x̂), G(x̂) is given in (7.2.33), φ̄ = φ/m 2s , m s =


1 + φ T φ, j ∈ {1, . . . , N1 } denotes the index of a stored/recorded data point x̂(t j )
(written as x̂t j ), φ̄( j) = φ̄(x̂t j ), m s j = 1 + φ T (x̂t j )φ(x̂t j ), ∇σ( j) = ∇σ (x̂t j ), L 2x̂ is
defined as in Assumption 7.2.5, and (x̂, û) is given as

⎨ 0, if L T h(x̂) − 1 G(x̂)∇σ T Ŵ < 0,
c
(x̂, û) = 2 x̂
2 (7.2.37)

1, otherwise.

Remark 7.2.3 Several notes about the weight tuning rule (7.2.36) are listed as
follows:
(a) The first term in (7.2.36) shares the same feature as (7.2.35), which aims to
minimize the objective function E = (1/2)e1T e1 .
(b) The second term in (7.2.36) is utilized to relax the PE condition. If there is no
second term in (7.2.36), one shall find Ŵ˙ c = 0 when x̂ = 0. That is, the weight
of the critic NN will not be updated. In this circumstance, the critic NN might not
converge to the optimal weights. Therefore, the PE condition is often employed
[3, 7, 26, 29, 36, 37]. Interestingly, the second term given in (7.2.36) can also
avoid this issue as long as the set {φ̄( j) }1N1 is selected to be linearly independent.
Now, we show this fact by contradiction. Suppose that Ŵ˙ = 0 when x̂ = 0. c
From (7.2.36), we obtain
278 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems


N1
φ̄( j) e1 j = 0,
j=1

where
1
e1 j = Ξ (x̂t j ) − ŴcT ∇σ( j) G(x̂t j )∇σ(Tj) Ŵc .
4

Since {φ̄( j) }1N1 is linearly independent, we can obtain e1 j = 0 ( j = 1, . . . , N1 ).


However, this case will not happen until the system state stays at the equilibrium
point, for the points x̂t j , j ∈ {1, . . . , N1 }, are randomly selected. In other words,
there at least exists a j0 ∈ {1, . . . , N1 } such that e1 j0 = 0 during the critic NN
learning process. So, there is a contradiction. Hence, the second term given in
(7.2.36) guarantees that Ŵ˙ c = 0 during the critic NN learning process.
(c) The last term in (7.2.36) is employed to ensure stability of the closed-loop system
while the critic NN learns optimal weights. Consider system (7.2.21) with the
control given in (7.2.32). In this case, we denote the derivative of Lyapunov
function candidate by Θ and it can be calculated as

˙
Θ = L̇ 2 (x̂) = L T2x̂ x̂(t)
 
= L T2x̂ h(x̂) + g(x̂)û
1
= L T2x̂ h(x̂) − g(x̂)R −1 g T(x̂)∇σ T Ŵc
2
1
= L 2x̂ h(x̂) − G(x̂)∇σ T Ŵc
T
(7.2.38)
2

where G(x̂) is defined in (7.2.33). If the closed-loop system is unstable, then


there exists Θ > 0. To keep the closed-loop system stable (i.e., Θ < 0), we
calculate the gradient of Θ as
 1 
∂Θ ∂ L T
2 x̂ h( x̂) − G( x̂)∇σ T
Ŵ c η1
−η1 = −η1 2 = ∇σ G(x̂)L 2x̂ .
∂ Ŵc ∂ Ŵc 2
(7.2.39)

Equation (7.2.39) shows the reason that we employ the last term of (7.2.36). In
fact, observing the definition of (x̂, û) given in (7.2.37), we find that if system
(7.2.21) is stable (i.e., Θ < 0), then (x̂, û) = 0 and the last term given in
(7.2.36) disappears. If system (7.2.21) is unstable, then (x̂, û) = 1 and the
last term given in (7.2.36) is activated. Due to the existence of the last term of
(7.2.36), it does not require an initial stabilizing control law for system (7.2.21).
The property shall be shown in the subsequent simulation study.
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 279

By Remark 7.2.3, the set {φ̄( j) }1N1 should be linearly independent. Nevertheless,
it is not an easy task to directly check this condition. Hence, we introduce a lemma
as follows.
 N
Lemma 7.2.1 If the set σ (x̂t j ) 1 1 is linearly independent and û(x̂) stabilizes
system (7.2.21), then the following set
   N
∇σ( j) h(x̂t j ) + g(x̂t j )û 1 1

is also linearly independent.

Since the proof is similar to [2], we omit it here.


Notice that
φ(x̂t j )
φ̄( j) =  2 ,
1 + φ T (x̂t j )φ(x̂t j )

where  
φ(x̂t j ) = ∇σ( j) h(x̂t j ) + g(x̂t j )û .
 N
By Lemma 7.2.1, we find that if σ (x̂t j ) 1 1 is linearly independent, then {φ̄( j) }1N1 is
also linearly independent. That is, to ensure {φ̄( j) }1N1 to be linearly independent, the
following condition should be satisfied.

Condition 7.2.1 Let D = σ (x̂t1 ), . . . , σ (x̂t N1 ) ∈ R N1 ×N1 be the recorded data


matrix. There exist sufficiently large amount of recorded data such that D can be
chosen nonsingular, that is, det D = 0.

Remark 7.2.4 Condition 7.2.1 can be satisfied by selecting and recording data dur-
ing the learning process of NNs over a finite time interval. Compared with the PE
condition, a clear advantage of Condition 7.2.1 is that it can easily be checked online
[5]. In addition, Condition 7.2.1 makes full use of history data, which can improve the
speed of convergence of parameters. This feature will be also shown in the simulation
study.

By the definition of φ given in (7.2.35) and using (7.2.28), (7.2.31) and (7.2.32),
we have

φ = ∇σ (h(x̂) + g(x̂)û(x̂))
1
= ∇σ h(x̂) − g(x̂)R −1 g T(x̂)σ T (Wc − W̃c )
2
1 1
= ∇σ ξ(x̂) + G(x̂)∇ε N1 + ∇σ G(x̂)∇σ T W̃c , (7.2.40)
2 2

where ξ(x̂) is given in (7.2.34). From (7.2.31), (7.2.34), (7.2.36), and (7.2.40), we
obtain
280 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

W̃˙ c = η1 φ̄ Ξ (x̂) − ŴcT ∇σ G(x̂)∇σ T Ŵc


1
4

N1
1 η1
+ η1 φ̄( j) Ξ (x̂t j ) − ŴcT ∇σ( j) G(x̂t j )∇σ(Tj) Ŵc − (x̂, û)∇σ G(x̂)L 2x̂
4 2
j=1
φ 1
= η1 2 ŴcT ∇σ h(x̂) + Q(x̂) − ŴcT ∇σ G(x̂)∇σ T Ŵc
ms 4

N1
φ( j) 1 T
+ η1 ŴcT ∇σ( j) h(x̂t j ) + Q(x̂t j ) − Ŵ ∇σ( j) G(x̂t j )∇σ(Tj) Ŵc
m 2s j 4 c
j=1
η1
− (x̂, û)∇σ G(x̂)L 2x̂
2
φ 1
= η1 2 WcT ∇σ h(x̂) + Q(x̂) − WcT ∇σ G(x̂)∇σ T Wc
ms 4
1 T 1
− W̃c ∇σ h(x̂) − W̃c ∇σ G(x̂)∇σ T W̃c + W̃cT ∇σ G(x̂)∇σ T Wc
T
4 2
N1
φ( j) 1
+ η1 WcT ∇σ( j) h(x̂t j ) + Q(x̂t j ) − WcT ∇σ( j) G(x̂t j )∇σ(Tj) Wc
m 2s j 4
j=1
1 1
− W̃cT ∇σ( j) h(x̂t j )− W̃cT ∇σ( j) G(x̂t j )∇σ(Tj) W̃c + W̃cT ∇σ( j) G(x̂t j )∇σ(Tj) Wc
4 2
η1
− (x̂, û)∇σ G(x̂)L 2x̂
2
η1  1 1 
=− ∇σ ξ( x̂) + G( x̂)∇ε N 1
+ ∇σ G( x̂)∇σ T
W̃ c
m 2s 2 2
 1
× W̃cT ∇σ h(x̂) + W̃cT ∇σ g(x̂) u ∗ (x̂) + R −1 g T (x̂)∇ε N1
2
1 
+ W̃cT ∇σ G(x̂)∇σ T W̃c + εHJB
4
 η1  
N1
1 1
− ∇σ ( j) ξ( x̂ tj ) + G( x̂ tj )∇ε N1 ( j) + ∇σ ( j) G( x̂ tj )∇σ T

( j) c
j=1
m 2s j 2 2
 1
× W̃cT ∇σ( j) h(x̂t j ) + W̃cT ∇σ( j) g(x̂t j ) u ∗ (x̂t j ) + R −1 g T (x̂t j )∇ε N1 ( j)
2
1 T 
+ W̃c ∇σ( j) G(x̂t j )∇σ( j) W̃c + εHJB
T
4
η1
− (x̂, û)∇σ G(x̂)L 2x̂
2
η1 1 1
= − 2 ∇σ ζ (x̂) + G(x̂)W̃c W̃cT ∇σ ζ (x̂) + W̃cT G(x̂)W̃c + εHJB
ms 2 4
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 281

N1
η1 1
− 2
∇σ( j) ζ (x̂t j ) + G(x̂t j )W̃c W̃cT ∇σ( j) ζ (x̂t j )
j=1
m sj 2
1 η1
+ W̃cT G(x̂t j )W̃c + εHJB − (x̂, û)∇σ G(x̂)L 2x̂ , (7.2.41)
4 2

where
1
ζ (x̂) = ξ(x̂) + G(x̂)∇ε N1, G(x̂) = ∇σ G(x̂)∇σ T, G(x̂t j) = ∇σ( j) G(x̂t j )∇σ(Tj) .
2
The schematic diagram of the present control algorithm is shown in Fig. 7.1.

7.2.2 Stability Analysis of Closed-Loop System

We start with an assumption and then present the stability analysis result.

Assumption 7.2.6 There exist known constants bσ > 0 and εb1 > 0 such that
∇σ (x̂) < bσ and ∇ε N1(x̂) < εb1 for every x̂ ∈ Ω.

Fig. 7.1 The schematic diagram of the identifier–critic architecture


282 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

Theorem 7.2.2 Given the input-affine nonlinear dynamics described by (7.2.1) with
associated HJB equation (7.2.25), let Assumptions 7.2.1–7.2.6 hold and take the
control input for system (7.2.1) as in (7.2.32). Moreover, let weight update laws of
the identifier NN be (7.2.8) and (7.2.9), and let weight tuning rule for the critic NN
be (7.2.36). Then, the identification error x̃(t), and the weight estimation errors W̃m ,
Ỹm , and W̃c are all UUB.

Proof Consider the Lyapunov function candidate

1
L(x) = L 1 (x) + L 2 (x̂) + W̃cT η1−1 W̃c , (7.2.42)
2

where L 1 (x) is defined in (7.2.12) and L 2 (x̂) is defined as in Assumption 7.2.5.


Taking the time derivative of (7.2.42) using (7.2.19) and (7.2.38), we derive

L̇(x) = L̇ 1 (x) + L̇ 2 (x̂) + W̃cT η1−1 W̃˙ c


θ 1
≤ − x̃ − ς x̃ + L T2x̂ h(x̂) − G(x̂)∇σ T Ŵc
2 2
+ W̃ T η−1 W̃˙ ,
c 1 c (7.2.43)

where ς is given in (7.2.20). Using (7.2.41), the last term of (7.2.43) is developed as

W̃cT η1−1 W̃˙ c = N1 + N2 − W̃cT (x̂, û)∇σ G(x̂)L 2x̂ ,


1
(7.2.44)
2
where
1 1
N1 = − W̃cT ∇σ ζ (x̂) + W̃cT Ḡ(x̂)W̃c
m 2s 2
1
× W̃cT ∇σ ζ (x̂) + W̃cT Ḡ(x̂)W̃c + εHJB ,
4

N1
1 1
N2 = − 2
W̃cT ∇σ( j) ζ (x̂t j ) + W̃cT Ḡ(x̂t j )W̃c
j=1
ms j 2
1
× W̃cT ∇σ( j) ζ (x̂t j ) + W̃cT Ḡ(x̂t j )W̃c + εHJB .
4
Now, we consider the first term N1 . We have
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 283

1  T 2 1  2
N1 = − 2
W̃c ∇σ ζ (x̂) + W̃cT Ḡ(x̂)W̃c
ms 8
3 T  T 
+ W̃c ∇σ ζ (x̂) W̃c Ḡ(x̂)W̃c
4
1 
+ W̃cT ∇σ L(x̂)εHJB + W̃cT Ḡ(x̂)W̃c εHJB . (7.2.45)
2
Applying (7.2.18) (with ν = 1) to the last three terms of (7.2.45), it follows

1 1 1 2
N1 = − 2
3W̃cT ∇σ ζ (x̂) + W̃cT Ḡ(x̂)W̃c
ms 2 4
1 T 2 1 1 T 2
+ W̃c ∇σ ζ (x̂) + εHJB + W̃ Ḡ(x̂)W̃c + 2εHJB
2 2 4 c 
 2 1 T 2 5 2
− 4 W̃cT ∇σ ζ (x̂) + W̃c Ḡ(x̂)W̃c − εHJB
16 2
1 1  2  2 5 2 
≤− 2 W̃cT Ḡ(x̂)W̃c − 4 W̃cT ∇σ ζ (x̂) − εHJB . (7.2.46)
m s 16 2

Similarly, we have

N1
1 1 T 2  T 2 5 2 
N2 ≤ − W̃ Ḡ( x̂ t ) W̃ c − 4 W̃ ∇σ( j) ζ ( x̂ t ) − εHJB . (7.2.47)
j=1
m 2s j 16 c j c j
2

Noticing that the PE condition guarantees

m s = 1 + φ Tφ

to be bounded, there exists a positive constant τ0 such that

1 1
τ0 ≤ = ≤1
m 2s (1 + φ Tφ)2

and
1 1
τ0 ≤ = ≤ 1.
m 2s j (1 + φ(xt j )Tφ(xt j ))2

Substituting (7.2.46) and (7.2.47) into (7.2.44), we obtain


 N1 
1  1  T 2 1  T 2
W̃cT η1−1 W̃˙ c ≤ − W̃ Ḡ( x̂ tj ) W̃ c + W̃ Ḡ( x̂) W̃ c
16 j=1 m 2s j c m 2s c


N1 
1  T 2 1  2
+4 W̃ ∇σ( j) ζ (x̂t j ) + 2 W̃cT ∇σ ζ (x̂)
j=1
m 2s j c ms
284 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

 N1 
5 1 1 1
+ + εHJB
2
− W̃cT (x̂, û)∇σ G(x̂)L 2x̂
2 m 2s j=1
m 2
sj 2
 N1 
τ0     
≤− μ Ḡ(x̂t j ) + μinf Ḡ(x̂) W̃c 4
2 2
16 j=1 inf

N1 
   
+ 4bσ2 ϑsup
2
ζ (x̂t j ) + ϑsup ζ (x̂) W̃c 2
2

j=1
5 1
+ (N1 + 1)εa21 − W̃cT (x̂, û)∇σ G(x̂)L 2x̂ , (7.2.48)
2 2

 
where μinf (Y ) denotes the lower bound of Y Y = Ḡ(x̂t j ), Ḡ(x̂) , ϑsup (Z ) repre-
 
sents the upper bound of Z Z = ζ (x̂t j ), ζ (x̂) , N1 is the number of nodes in the
hidden layer of the critic NN, and εHJB  ≤ εa1 as assumed earlier in (7.2.29).
Combining (7.2.43) and (7.2.48), we derive

1 1
L̇(x) ≤ L T2x̂ h(x̂) − G(x̂)∇σ T Ŵc − W̃cT (x̂, û)∇σ G(x̂)L 2x̂
2 2
T1 θ ς 2 ς2
− W̃c 4 + 4T2 W̃c 2 − x̃ − +
16 2 θ 2θ
5
+ (N1 + 1)εa1 , 2
(7.2.49)
2
where
   N1
 
T1 = μ2inf Ḡ(x̂) + μ2inf Ḡ(x̂t j ) τ0 ,
j=1

and
  
N1
 
T2 = bσ2 ϑsup
2
ζ (x̂) + bσ2 ϑsup
2
ζ (x̂t j ) .
j=1

In view of the definition of (x̂, û) in (7.2.37), we divide (7.2.49) into the fol-
lowing two cases for discussion.
Case 1: (x̂, û) = 0. In this case, we can derive that the first term in (7.2.49) is
negative. Observing the fact that L T2x̂ x̂˙ < 0 and by using the dense property of R
[28], we can conclude that there exists a constant τ > 0 such that

L T2x̂ x̂˙ < −τ L 2x̂  ≤ 0.

Then, (7.2.49) becomes


7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 285

θ ς 2
L̇(x) ≤ − τ L 2x̂  − x̃ −
2 θ
 2
T1 32T 2 64T22
− W̃c 2 − +
16 T1 T1
1
+ ς 2 + 5θ (N1 + 1)εa21 . (7.2.50)

Hence, (7.2.50) yields L̇(x) < 0 as long as one of the following conditions holds:

64T22 ς 2 + 5θ (N1 + 1)εa21


L 2x̂  > + , (7.2.51)
τ T1 2τ θ
or 
128T22 ς 2 + 5θ (N1 + 1)εa21 ς
x̃ > + + , (7.2.52)
θ T1 θ2 θ

or 

 8T1 ς 2 /θ + 5(N1 + 1)εa21 + 1024T22
 32T2
W̃c  > + . (7.2.53)
T1 T1

Case 2: (x̂, û) = 1. By the definition of (x̂, û) given in (7.2.37), we find that,
in this case, the first term in (7.2.49) is nonnegative which implies that the control
(7.2.32) may not stabilize system (7.2.21). Then, using (7.2.29) and (7.2.34), (7.2.49)
becomes
 
1 θ ς 2 ς2
L̇(x) ≤ L 2x̂ ξ(x̂) + G(x̂)∇ε N1 −
T
x̃ − +
2 2 θ 2θ
 2
 
T1  2 32T2 64T22
5
− W̃c − + + (N1 + 1)εa21 . (7.2.54)
16 T1 T1 2

Using Assumptions 7.2.5 and 7.2.6, (7.2.54) can be rewritten as


  
  εb1 ϑsup G(x̂) 2
L̇(x) ≤ − λmin Λ2 (x̂) L 2x̂  −  
4λmin Λ2 (x̂)
 
θ ς 2 T1 32T2 2
− x̃ − − W̃c 2 −
2 θ 16 T1
 
εb ϑsup G(x̂)
2 2
64T22
1 2
+ 1  + + ς + 5θ (N1 + 1)εa21 , (7.2.55)
16λmin Λ2 (x̂) T1 2θ

where ϑsup (·) is defined as in (7.2.48).


Thus, (7.2.55) implies L̇(x) < 0 as long as one of the following conditions holds:
286 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems
  
εb1 ϑsup G(x̂) d
L 2x̂  >   +  , (7.2.56)
4λmin Λ2 (x̂) λmin Λ2 (x̂)

or !
2d ς
x̃ > + , (7.2.57)
θ θ
or  

 8T
W̃c  > 2  2
+
d
, (7.2.58)
T1 T1

where  
εb21 ϑsup
2
G(x̂) 64T22 1 2
d=  + + ς + 5θ (N1 + 1)εa21 .
16λmin Λ2 (x̂) T1 2θ

Combining Case 1 and Case 2 and using the standard Lyapunov’s extension the-
orem [12, 13] (or the Lagrange stability result [22]), one can conclude that the state
identification error x̃(t) and NN weight estimation errors W̃m , Ỹm , and W̃c are all
UUB. This completes the proof.
Remark 7.2.5 The uniform ultimate boundedness of W̃m and Ỹm is obtained as fol-
lows: Inequalities (7.2.51)–(7.2.53) (or (7.2.56)–(7.2.58)) guarantee L̇(x) < 0. Then,
we can conclude that L(x(t)) is the strictly decreasing function with respect to t
(t ≥ 0). Hence, we can derive L(x(t)) < L(x(0)), where L(x(0))  is a bounded

positive constant. Using L(x) defined as in (7.2.42), we have 21 tr W̃mT l1−1 W̃m +
 
1
2
tr ỸmT l2−1 Ỹm < L(x(0)). Using the definition of Frobenius matrix norm [9, 16],
we obtain that W̃m and Ỹm are UUB.

7.2.3 Simulation Study

Consider the continuous-time input-affine nonlinear systems described by

ẋ = f (x) + g(x)u, (7.2.59)

where
" # " #
−x1 + x2 0
f (x) = , g(x) = .
−0.5x1 − 0.5x2 + 0.5x2 [cos(2x1 ) + 2]2 cos(2x1 ) + 2

The value function is defined as in (7.2.2), where Q and R are chosen as identity
matrices of appropriate dimensions. The prior knowledge of f (x) is assumed to be
unavailable. To obtain the knowledge of system dynamics, an identifier NN given in
(7.2.6) is employed. The gains for the identifier NN are selected as
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 287
" # " #
−1 1 1 0
A= , C= ,
−0.5 −0.5 −0.5 0
l1 = 20, l2 = 10, τ1 = 6.1, τ2 = 15,
N0 = 8,

and the gain for the critic NN is selected as η1 = 2.5. The activation function of the
critic NN is chosen with N1 = 3 neurons with σ (x) = [x12 , x22 , x1 x2 ]T , and the critic
NN weight is denoted as Ŵc = [Ŵc1 , Ŵc2 , Ŵc3 ]T .
Remark 7.2.6 It is significant to emphasize that the number of neurons required for
any particular application is still an open problem. Selecting the proper number of
neurons for NNs is more of an art than science [8, 27]. In this example, the number
of neurons is obtained by computer simulations. We find that selecting eight neurons
in the hidden layer for the identifier NN can lead to satisfactory simulation results.
Meanwhile, in order to compare our algorithm with the algorithms proposed in [3,
29], we choose three neurons in the hidden layer for the critic NN, and the simulation
results are satisfied.
The initial weights Ŵm and Ŷm for the identifier NN are selected randomly within
an interval of [−10, 10] and [−5, 5], respectively. Meanwhile, the initial weights
of the critic NN are chosen to be zeros, and the initial system state is set to x0 =
[3.5, −3.5]T. In this circumstance, the initial control cannot stabilize system (7.2.59).
In the present algorithm, no initial stabilizing control is required. Moreover, by using
the method proposed in [5, 6], the recorded data can easily be made qualified for
Condition 7.2.1.

x̃1 (t)
0.5
x̃2 (t)
System identification errors

−0.5

−1

−1.5

−2
0 3 6 9 12 15
Time (s)

Fig. 7.2 System identification errors x̃1 (t) and x̃2 (t)
288 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

Computer simulation results are shown in Figs. 7.3, 7.4, 7.5, 7.6, 7.7 and 7.8.
Figure 7.2 illustrates the performance of the system identification errors x̃1 (t) and
x̃2 (t). Figure 7.3 shows the trajectories of system states x(t). Figure 7.4 shows the
Frobenius norm of the weights Ŵm  and Ŷm  of the identifier NN. Figure 7.5 shows
the convergence of the critic NN weights. Figure 7.6 shows the control u. Figure 7.7
illustrates the system states without considering the third term in (7.2.36), where the
system becomes unstable.
To compare with [7], we use Fig. 7.8 to show the system states with the algorithm
proposed in [7]. It should be mentioned that the PE condition is necessary in [7].
To guarantee the PE condition, a small exploratory signal N (t) = sin5 (t) cos(t) +
sin5 (2t) cos(0.1t) is added to the control u for the first 9 s. That is, Fig. 7.8 is obtained
with the exploratory signal added to the control input during the first 9 s. In addition,
it is worth pointing out that by using the methods proposed in [3] and [29] and
employing the small exploratory signal N (t), one can also get stable system states.
We omit the simulation results here since the results are similar to Fig. 7.8 [3, 29].
It is quite straightforward to notice that the trajectories of system states given in [3,
29] share common feature with Fig. 7.8, which is oscillatory before it converges to
the equilibrium point. This is caused by the PE exploratory signal.
Several notes about the simulation results are listed as follows.

• From Fig. 7.2, it is observed that the identifier NN can approximate the real system
very fast and well.
• From Fig. 7.3, one can observe that there are almost no oscillations of system state.
As aforementioned, the PE signal always leads to oscillations of the system states

x1 (t)
2 x2 (t)
Evolution of system states

−2

−4

−6

0 3 6 9 12 15
Time (s)

Fig. 7.3 Evolution of system state x(t)


7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 289

15

Frobenius norm of identifier NN weights 12


Ŵm

Ŷm
9

0
0 3 6 9 12 15
Time (s)

Fig. 7.4 Frobenius norm of identifier NN weights Ŵm  and Ŷm 

1
Convergence of critic NN weights

Ŵc1
0.75
Ŵc2
0.5 Ŵc3

0.25

−0.25
0 3 6 9 12 15
Time (s)

Fig. 7.5 Convergence of critic NN weights


290 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

12

Control input u 9

−3
0 3 6 9 12 15
Time (s)

Fig. 7.6 Control input u corresponding to Fig. 7.3

200

0
System states

−200

−400

x1 (t)
−600
x2 (t)

−800
0 0.8 1.6 2.4 3.2
Time (s)

Fig. 7.7 Trajectories of states without considering the third term in (7.2.36)
7.2 Online Optimal Control of Partially Unknown Affine Nonlinear Systems 291

x1 (t)
6
x2 (t)

3
System states

−3

−6

0 3 6 9 12 15
Time (s)

Fig. 7.8 System states with the algorithm proposed in [7]

(see Fig. 7.8 and the simulation results presented in [3, 7, 29]). This verifies that
the restrictive PE condition is relaxed by using recorded and instantaneous data
simultaneously. Hence, an advantage of the present algorithm as compared to the
methods in [3, 7, 29] lies in that the PE condition is relaxed.
• From Figs. 7.4 and 7.5, we can find that the identifier NN and the critic NN are
tuned simultaneously. Meanwhile, the estimated weights of the identifier NN and
the critic NN are all guaranteed to be UUB.
• From Figs. 7.5 and 7.6, one can find that the initial weights of the critic NN and the
initial control are all zeros. In this circumstance, the initial control cannot stabilize
system (7.2.59). Nevertheless, the initial control must stabilize the system in [3,
29]. Therefore, in comparison with the methods proposed in [3, 29], a distinct
advantage of the algorithm developed in this section lies in that the initial stabilizing
control is not required any more.
• From Fig. 7.7, one shall find that without the third term in (7.2.36), the system is
unstable during the learning process of the critic NN.

7.3 Online Optimal Control of Affine Nonlinear Systems


with Constrained Inputs

Consider the continuous-time nonlinear systems described by the form

ẋ(t) = f (x(t)) + g(x(t))u(x(t)) (7.3.1)


292 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

with the state x(t) ∈ Rn and the control input u(t) ∈ U ⊂ Rm , U = {u ∈ Rm : |u i | ≤


τ, i = 1, . . . , m}, where τ > 0 is the saturation bound, and f (x(t)) ∈ Rn and
g(x(t)) ∈ Rn×m are known continuous functions of the state x(t). It is assumed that
f (0) = 0 and f (x) + g(x)u is Lipschitz continuous on Ω containing the origin,
such that the solution of system (7.3.1) is unique for any given initial state x0 ∈ Ω
and control u ∈ U. System (7.3.1) is stable in the sense that there exists a continuous
control u ∈ U such that the system is UUB over Ω.
The value function for system (7.3.1) is defined by
 ∞  
V (x(t)) = Q(x(s)) + Y (u(s)) ds, (7.3.2)
t

where Q(x) is continuously differentiable and positive definite, i.e., Q(x) ∈ C 1 (Ω),
Q(x) > 0 for all x = 0 and x = 0 ⇔ Q(x) = 0, and Y (u) is positive definite. To
confront bounded controls in system (7.3.1) and inspired by the work of [1, 20, 24],
we define Y (u) as
 u m 
 ui
−T
Y (u) = 2τ Ψ (υ/τ )dυ = 2τ ψ −1 (υi /τ )dυi ,
0 i=1 0

 T
where Ψ −1 (υ/τ ) = ψ −1 (υ1 /τ ), ψ −1 (υ2 /τ ), . . . , ψ −1 (υm /τ ) , Ψ ∈ Rm , and
Ψ −T denotes (Ψ −1 )T . Meanwhile, ψ(·) is a strictly monotonic odd function satisfying
|ψ(·)| < 1 and belonging to C p ( p ≥ 1) and L2 (Ω). It is significant to state that
Y (u) is positive definite since ψ −1 (·) is a monotonic odd function. Without loss of
generality, in this section, we choose ψ(·) = tanh(·).
Given a control u(x) ∈ A (Ω), if the associated value function V (x) ∈ C 1 (Ω),
the infinitesimal version of (7.3.2) is the so-called Lyapunov equation

  u
VxT f (x) + g(x)u + Q(x) + 2τ tanh−T(υ/τ )dυ = 0,
0

where Vx ∈ Rn denotes the partial derivative of V (x) with respect to x. Define the
Hamiltonian for the control u(x) ∈ A (Ω) and the associated value function V (x)
as  u
 
H (x, Vx , u) = VxT f (x) + g(x)u + Q(x) + 2τ tanh−T(υ/τ )dυ.
0

The optimal value V ∗ (x) can be obtained by solving the HJB equation

min H (x, Vx∗ , u) = H (x, Vx∗ , u ∗ ) = 0. (7.3.3)


u(x)∈A (Ω)

The closed-form expression for constrained optimal control is


7.3 Online Optimal Control of Affine Nonlinear Systems with Constrained Inputs 293

1 T
u ∗ (x) = −τ tanh g (x)Vx∗ . (7.3.4)

Substituting (7.3.4) into (7.3.3), the HJB equation for constrained nonlinear systems
becomes
 T
Vx∗ f (x) − 2τ 2 CT(x) tanh(C(x)) + Q(x)
 −τ tanh(C(x))
+ 2τ tanh−T(υ/τ )dυ = 0, (7.3.5)
0

1 T
where C(x) = g (x)Vx∗ . Using the integration formulas of inverse hyperbolic

functions, we note that
 −τ tanh(C(x))
2τ tanh−T(υ/τ )dυ
0
m 
 −τ tanh(Ci (x))
= 2τ tanh−T(υi /τ )dυi
i=1 0


m
= 2τ C (x) tanh(C(x)) + τ
2 T 2
ln 1 − tanh2 (Ci (x)) ,
i=1

where C(x) = (C1 (x), . . . , Cm (x))T with Ci (x) ∈ R, i = 1, . . . , m. Then, we can


rewrite (7.3.5) as


m
Vx∗ T f (x) + Q(x) + τ 2 ln 1 − tanh2 (Ci (x)) = 0. (7.3.6)
i=1

Nevertheless, (7.3.6) is intractable to solve since it is a nonlinear partial differential


equation with respect to V ∗ (x). To overcome the difficulty, we develop an online
control scheme to solve (7.3.6) based on ADP approaches. Before proceeding further,
we provide the following required assumption.

Assumption 7.3.1 Assume that L 3 (x) is a continuously differentiable Lyapunov


function candidate for system (7.3.1) and satisfies that
 
L̇ 3 (x) = L T3x f (x) + g(x)u ∗ < 0

with L 3x the partial derivative of L 3 (x) with respect to x. Meanwhile, there exists a
positive-definite matrix Λ2 (x) ∈ Rn×n defined on Ω such that
 
L T3x f (x) + g(x)u ∗ = −L T3x Λ2 (x)L 3x . (7.3.7)
294 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

7.3.1 Solving HJB Equation via Critic Architecture

In this section, we design an online optimal control scheme using a single critic NN.
According to [8, 10], the optimal value V ∗ (x) can be represented by a single-layer
NN on a compact set Ω as

V ∗ (x) = WcT σ (x) + ε N2 (x),


 T
where Wc ∈ R N2 is the ideal NN weight vector, σ (x) = σ1 (x), . . . , σ N2 (x) ∈ R N2
is the activation function with σ j (x) ∈ C 1 (Ω) and σ j (0) = 0, the set {σ j (x)}1N2 is
selected to be linearly independent, N2 is the number of the neurons, and ε N2 (x)
is the NN function reconstruction error. The derivative of V ∗ (x) with respect to x is
obtained as
Vx∗ = ∇σ T(x)Wc + ∇ε N2 (x) (7.3.8)

with ∇σ (x) = ∂σ (x)/∂ x, ∇ε N2 (x) = ∂ε N2 (x)/∂ x, and ∇σ (0) = 0.


Substituting (7.3.8) into (7.3.4), and using the Taylor series expansion, we have

1 T
u ∗ (x) = −τ tanh g (x)∇σ T Wc + εu ∗ , (7.3.9)

1 
where εu ∗ = − 1 − tanh2 (χ ) g T(x)∇ε N2 (x), χ ∈ Rm is chosen between
2
1 T 1 T  T 
g (x)∇σ T Wc and g (x) ∇σ Wc + ∇ε N2 (x) ,
2τ 2τ

and 1 = (1, . . . , 1)T ∈ Rm .


Using (7.3.8), (7.3.6) can be rewritten as


m
WcT ∇σ f (x) + Q(x) + τ 2 ln 1 − tanh2 (B1i (x)) + εHJB = 0, (7.3.10)
i=1

where
1 T
B1 (x) = g (x)∇σ T Wc ,

and B1 (x) = (B11 (x), . . . , B1m (x))T with B1i (x) ∈ R, i = 1, . . . , m, and εHJB is
the HJB approximation error [1].

Remark 7.3.1 It was shown in [1] that there exists a small constant εh > 0 and a
positive integer N  (depending only on εh ) such that N2 > N  implies εHJB  ≤ εh .

Since the ideal critic NN weight matrix Wc is typically unknown, (7.3.9) cannot
be implemented in the real control process. Therefore, we employ a critic NN to
7.3 Online Optimal Control of Affine Nonlinear Systems with Constrained Inputs 295

approximate the optimal value function V ∗ (x) as

V̂ (x) = ŴcTσ (x), (7.3.11)

where Ŵc is the estimate of Wc . The weight estimation error for the critic NN is
defined as
W̃c = Wc − Ŵc . (7.3.12)

Using (7.3.11), the estimates of (7.3.4) is given by

1 T
û(x) = −τ tanh g (x)∇σ T Ŵc . (7.3.13)

From (7.3.11) and (7.3.13), we derive the approximate Hamiltonian as


m
H (x, Ŵc ) = ŴcT ∇σ f (x) + Q(x) + τ 2 ln 1 − tanh2 (B2i (x))  e2 , (7.3.14)
i=1

where
1 T
B2 (x) = g (x)∇σ T Ŵc ,

and B2 (x) = (B21 (x), . . . , B2m (x))T with B2i (x) ∈ R, i = 1, . . . , m. Subtracting
(7.3.10) from (7.3.14) and using (7.3.12), we obtain


m
e2 = −W̃cT ∇σ f (x) + τ 2 Γ (B2i (x)) − Γ (B1i (x)) − εHJB , (7.3.15)
i=1

where Γ (Bi (x)) = ln 1 − tanh2 (Bi (x)) ,  = 1, 2 and i = 1, . . . , m. Noticing


that for every Bi (x) ∈ R, Γ (Bi (x)) can be represented accurately as
 
Γ (Bi (x)) = ln 1 − tanh2 (Bi (x))
$  
ln 4 − 2Bi (x) − 2 ln 1 + exp(−2Bi (x)) , Bi (x) > 0,
=  
ln 4 + 2Bi (x) − 2 ln 1 + exp(2Bi (x)) , Bi (x) < 0.

That is,

Γ (Bi (x)) = ln 4 − 2Bi (x)sgn(Bi (x)) − 2 ln 1 + exp(−2Bi (x)sgn(Bi (x))) ,

where sgn(Bi (x)) is the sign function.


296 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

Note that


m
Γ (Bi (x)) = m ln 4 − 2BT (x)sgn(B (x))
i=1

m
−2 ln 1 + exp(−2Bi (x)sgn(Bi (x))) . (7.3.16)
i=1

Therefore, combining (7.3.15) and (7.3.16), we get

e2 = 2τ 2 BT1 (x)sgn(B1 (x)) − BT2 (x)sgn(B2 (x))


− W̃cT ∇σ f (x) + τ 2 ΔB − εHJB
=τ WcT ∇σ g(x)sgn(B1 (x)) − ŴcT ∇σ g(x)sgn(B2 (x))
− W̃cT ∇σ f (x) + τ 2 ΔB − εHJB
= − W̃cT ∇σ f (x) + τ W̃cT ∇σ g(x)sgn(B2 (x)) + D1 (x), (7.3.17)

where


m
1 + exp[−2B1i (x)sgn(B1i (x))]
ΔB = 2 ln ,
i=1
1 + exp[−2B2i (x)sgn(B2i (x))]
D1 (x) = τ WcT ∇σ g(x)[sgn(B1 (x)) − sgn(B2 (x))] + τ 2 ΔB − εHJB .

Remark 7.3.2 From the expression of ΔB , we have ΔB ∈ [−m ln 4, m ln 4]. Mean-


while, by Remark 7.3.1, we can conclude that D1 (x) is a bounded function since Wc
is generally bounded and τ > 0 is a constant.

To get the minimum value of e2 , it is desired to minimize the objective function


E = (1/2)e2T e2 with the gradient descent algorithm. However, tuning the critic
NN weights to minimize E alone does not guarantee the stability of system (7.3.1)
during the learning process of NNs if the initial control is not admissible. To ensure
the algorithm for online implementation, a novel weight update law for the critic NN
is developed by
 
m 
Ŵ˙ c = − η2 φ̄ Q(x) + ŴcT ∇σ f (x) + τ 2 ln 1 − tanh2 (B2i (x))
i=1
η2
+ Σ(x, û)∇σ g(x) [Im − N (B2 (x))] g T(x)L 3x
2

ϕT
+ η2 τ ∇σ g(x) tanh(B2 (x)) − sgn(B2 (x)) Ŵc
ms

 
− F2 Ŵc − F1 ϕ T Ŵc , (7.3.18)
7.3 Online Optimal Control of Affine Nonlinear Systems with Constrained Inputs 297

where φ̄ = φ/m 2s , m s = 1 + φ T φ,

φ = ∇σ f (x) − τ ∇σ g(x) tanh(B2 (x)),

ϕ = φ/m s , η2 > 0 is a design parameter,


 
N (B2 (x)) = diag tanh2 (B21 (x)), . . . , tanh2 (B2m (x)) ,

Im is the m × m identity matrix, L 3x is defined as in Assumption 7.3.1, F1 and F2


are tuning parameters with suitable dimensions, and Σ(x, û) is described by
$  
0, if L T3x f (x) − τ g(x) tanh(B2 (x)) < 0,
Σ(x, û) = (7.3.19)
1, otherwise.

Remark 7.3.3 The first term in (7.3.18) is used to minimize E = (1/2)e2T e2 and
is obtained by utilizing the normalized gradient descent algorithm. The rest are
employed to ensure the stability of the optimal closed-loop system while the critic
NN learns the optimal weights. Σ(x, û) is designed based on Lyapunov’s condition
for stability. Observing the expression of Σ(x, û) and noticing that L 3 (x) is the
Lyapunov function candidate for  system (7.3.1) defined in Assumption 7.3.1, if
system (7.3.1) is stable (i.e., L T3x f (x)−τ g(x) tanh(B2 (x)) < 0), then Σ(x, û) = 0
and the second term given in (7.3.18) disappears. If system (7.3.1) is unstable, then
Σ(x, û) = 1 and the second term given in (7.3.18) is activated. Due to this property
of Σ(x, û), it does not require an initial stabilizing control law for system (7.3.1).

Remark 7.3.4 From the expressions of (7.3.14), one shall find that x = 0 gives rise
to H (x, Ŵc ) = e2 = 0. If F2 = F1 ϕ T in (7.3.18), then Ŵ˙ c = 0. In this sense,
the critic NN will no longer be updated. However, the optimal control might not
have been obtained at finite time t0 such that x(t0 ) = 0. To avoid this circumstance
from occurring, a small exploratory signal will be needed, i.e., the PE condition is
required. It is worth pointing out that by using the method presented in Sect. 7.2, the
PE condition can be removed. However, due to the purpose of this section, we do
not employ the approach of Sect. 7.2.

Using φ given in (7.3.18), we obtain ∇σ f (x) = φ + τ ∇σ g(x) tanh(B2 (x)).


Therefore, (7.3.17) can be represented as

e2 = −W̃cT φ + τ W̃cT ∇σ g(x)N(x) + D1 (x), (7.3.20)

where N(x) = sgn(B2 (x)) − tanh(B2 (x)).


From (7.3.12), (7.3.14), (7.3.18), and (7.3.20), we have
298 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

ϕ
W̃˙ c = η2 − W̃cT φ + τ W̃cT ∇σ g(x)N(x) + D1 (x)
ms
η2
− Σ(x, û)∇σ g(x) [Im − N (B2 (x))] g T(x)L 3x
2
" #
ϕT  
+ η2 τ ∇σ g(x)N(x) Ŵc + F2 Ŵc − F1 ϕ T Ŵc . (7.3.21)
ms

7.3.2 Stability Analysis of Closed-Loop System


with Constrained Inputs

We introduce an assumption before establishing the stability analysis result.

Assumption 7.3.2 The ideal critic NN weight Wc is bounded by a known constant


W Mc > 0, i.e., Wc  ≤ W Mc . Meanwhile, the NN approximation error ε N2 (x) is
bounded by a known constant ε M > 0 over Ω, i.e., ε N2 (x) ≤ ε M for every x ∈ Ω.
In addition, εu ∗ in (7.3.9) is upper bounded by a known constant εa2 > 0, i.e.,
εu ∗  ≤ εa2 .

Theorem 7.3.1 Consider the continuous-time nonlinear system described by (7.3.1)


with associated HJB equation (7.3.6). Let Assumptions 7.2.2, 7.3.1, and 7.3.2 hold,
and let the control input for system (7.3.1) be given by (7.3.13). Moreover, let the
critic NN weight tuning law be given by (7.3.18). Then, the function L 3x and the
weight estimation error W̃c are guaranteed to be UUB.

Proof Consider the Lyapunov function candidate

1
L(x) = L 3 (x) + W̃cTη2−1 W̃c , (7.3.22)
2
where L 3 (x) is defined as in Assumption 7.3.1.
Taking the time derivative of (7.3.22) using Assumption 7.3.1 and (7.3.13), we
have
L̇(x) = L T3x ( f (x) − τ g(x) tanh(B2 (x))) + W̃˙ cT η2−1 W̃c . (7.3.23)

Using (7.3.21), we derive the last term in (7.3.23) as

ϕT
W̃˙ cT η2−1 W̃c = − W̃cT φ + τ W̃cT ∇σ g(x)N(x) + D1 (x) W̃c
ms
1
− Σ(x, û)L T3x g(x) Im − N (B2 (x)) g T(x)∇σ T W̃c
2
ϕT  
+ τ W̃cT ∇σ g(x)N(x) Ŵc + W̃cT F2 Ŵc − F1 ϕ T Ŵc
ms
= − W̃c ϕϕ W̃c + D̄1 (x)ϕ T W̃c + W̃cT D̄2 (x)
T T
7.3 Online Optimal Control of Affine Nonlinear Systems with Constrained Inputs 299

1
− Σ(x, û)L T3x g(x) Im − N (B2 (x)) g T(x)∇σ T W̃c
2
 
+ W̃cT F2 Ŵc − F1 ϕ T Ŵc , (7.3.24)

D1 (x) ϕT
where D̄1 (x) = and D̄2 (x) = τ ∇σ g(x)N(x) Wc .
ms ms
Observe that
 
W̃cT F2 Ŵc − F1 ϕ T Ŵc = W̃cT F2 Wc − W̃cT F2 W̃c − W̃cT F1 ϕ T Wc + W̃cT F1 ϕ T W̃c .
 
Denote Z T = W̃cTϕ, W̃cT . Then, (7.3.24) can be developed as

W̃˙ cT η2−1 W̃c = − Z T K Z + Z T G − Σ(x, û)L T3x g(x)


1
2
× [Im − N (B2 (x))] g T(x)∇σ T W̃c , (7.3.25)

where " # " #


I − 21 F1T D̄1 (x)
K = , G = .
− 21 F1 F2 D̄2 (x) + F2 Wc − F1 ϕ T Wc

Combining (7.3.23) with (7.3.25) and selecting F1 and F2 such that K is positive
definite, we obtain

L̇(x) ≤ L T3x ( f (x) − τ g(x) tanh(B2 (x))) − λmin (K )Z 2 + ϑ M Z 


1
− Σ(x, û)L T3x g(x) [Im − N (B2 (x))] g T(x)∇σ T W̃c , (7.3.26)
2
where ϑ M is the upper bound of G, i.e., G ≤ ϑ M .
Due to the definition of Σ(x, û) in (7.3.19), (7.3.26) is divided into the following
two cases for discussion.
Case 1: Σ(x, û) = 0. By the definition of Σ(x, û) in (7.3.19), the first term
in (7.3.26) is negative. Since x > 0 is guaranteed by persistent excitation, one
can draw the conclusion that there exists a constant τ such that 0 < τ ≤ ẋ
implies L T3x ẋ ≤ −τ L 3x  based on dense property of R [28]. Then, (7.3.26) can be
developed by
 2
ϑM ϑM2
L̇(x) ≤ −τ L 3x  − λmin (K ) Z  − + . (7.3.27)
2λmin (K ) 4λmin (K )

Therefore, (7.3.27) implies L̇(x) < 0 as long as one of the following conditions
holds:
ϑM
2
ϑM
L 3x  > , or Z  > . (7.3.28)
4τ λmin (K ) λmin (K )
300 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

Since
a 1
≤ , ∀a,
(1 + a) 2 4

from the definition of ϕ, we have

φTφ
ϕ2 =
(1 + φ T φ)2

1 %
and thus ϕ ≤ . Noticing that Z  ≤ 1 + ϕ2 W̃c , we obtain
2

5
Z  ≤ W̃c .
2
Then, from (7.3.28), we have

2ϑ M
W̃c  > √ .
5λmin (K )

Case 2: Σ(x, û) = 1. In this case, the first term given in (7.3.26) is nonnega-
tive which implies that the control (7.3.13) may not stabilize system (7.3.1). Then,
(7.3.26) becomes

L̇(x) ≤ L T3x f (x) − τ L T3x g(x) tanh(B2 (x))


1 
+ Im − N (B2 (x)) g T(x)∇σ T W̃c

 2
ϑM ϑM
2
− λmin (K ) Z  − + . (7.3.29)
2λmin (K ) 4λmin (K )

Denote B(Bi (x)) = tanh(Bi (x)), i = 1, 2. By using the Taylor series, we have

B(B1 (x)) − B(B2 (x))


   
= Ḃ(B2 (x)) B1 (x) − B2 (x) + O (B1 (x) − B2 (x))2
1  
= [Im − N (B2 (x))] g T(x)∇σ T W̃c + O (B1 (x) − B2 (x))2 .

(7.3.30)

From (7.3.30), we get

1
tanh(B2 (x)) + [Im − N (B2 (x))] g T(x)∇σ T W̃c
2τ  
= tanh(B1 (x)) − O (B1 (x) − B2 (x))2 . (7.3.31)
7.3 Online Optimal Control of Affine Nonlinear Systems with Constrained Inputs 301

Substituting (7.3.31) into (7.3.29) and adding and subtracting L T1x g(x)εu ∗ to the
right-hand side of (7.3.29), we derive
 
L̇(x) ≤ L T3x f (x) + g(x)u ∗ − L T3x g(x)εu ∗
 
+ τ L T3x g(x)O (B1 (x) − B2 (x))2
 2
ϑM ϑM2
− λmin (K ) Z  − + . (7.3.32)
2λmin (K ) 4λmin (K )

Using Assumption 7.3.1, we can rewrite (7.3.32) as


 2
ρM
L̇(x) ≤ − λmin (Λ2 (x)) L 3x  −
2λmin (Λ2 (x))
 2
ϑM
− λmin (K ) Z  − + μ0 , (7.3.33)
2λmin (K )
 
where ρ M = g M (εa2 + τ εb2 ), εb2 is the upper bound of O (B1 (x) − B2 (x))2 , and
μ0 is given as
ρM
2
ϑM2
μ0 = + .
4λmin (Λ2 (x)) 4λmin (K )

Hence, (7.3.33) yields L̇(x) < 0 as long as one of the following conditions holds:
!
ρM μ0
L 3x  > + ,
2λmin (Λ2 (x)) λmin (Λ2 (x))

or !
ϑM μ0
Z  > + . (7.3.34)
2λmin (K ) λmin (K )

5
Observe that Z  ≤ W̃c . Then, from (7.3.34), we have
2
!
ϑM μ0
W̃c  > √ +2 .
5λmin (K ) 5λ min (K )

Combining Case 1 and Case 2 and by the standard Lyapunov’s extension theorem
[12, 13] (or the Lagrange stability result [22]), we conclude that the function L 3x
and the weight estimation error W̃c are UUB. This completes the proof.

Remark 7.3.5 Because L 3 (x) given in Assumption 7.3.1 is often constructed by


selecting polynomials, one can conclude that L 3x is also a polynomial with respect
to x. Since Theorem 7.3.1 has verified that L 3x is UUB, one can easily obtain that
the trajectory of the closed-loop system is also UUB.
302 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

7.3.3 Simulation Study

Consider the continuous-time affine nonlinear system described by [29]

ẋ(t) = f (x) + g(x)u, (7.3.35)

where
" #
−x1 + x2
f (x) = ,
−0.5x1 − 0.5x2 + 0.5x2 [cos(2x1 ) + 2]2
" #
0
g(x) = .
cos(2x1 ) + 2

The objective is to control the system with control limits of |u| ≤ 1.2. The value
function is given by
 ∞   u 
−T
V (x(0)) = Q(x) + 2τ tanh (υ/τ )dυ dt,
0 0

where
Q(x) = x12 + x22 .

The gains of the critic NN are chosen as η2 = 38, τ = 1.2, N2 = 8. The activation
function of the critic NN is chosen as
T
σ (x) = x12 , x22 , x1 x2 , x14 , x24 , x13 x2 , x12 x22 , x1 x23 ,

and the critic NN weights are denoted as


 T
Ŵc = Ŵc1 , . . . , Ŵc8 .

The initial weights of the critic NN are chosen as zeros, and the initial system state is
selected as x0 = [3, −0.5]T . It is significant to point out that under this circumstance,
the initial control cannot stabilize system (7.3.35). That is, no initial stabilizing
control is required for the implementation of this algorithm. To guarantee the PE
condition, a small exploratory signal

N (t) = sin5 (t) cos(t) + sin5 (2t) cos(0.2t)

is added to the control law u(t) for the first 10 s.


Using the algorithm developed in Sect. 7.3, we obtain the computer simulation
results as in Figs. 7.9, 7.10, 7.11, 7.12 and 7.13. Figure 7.9 shows the trajectories of
system state x(t), where the exploratory signal N (t) is added to the control input
during the first 10 s. It is shown that the learning process is finished before the end
7.3 Online Optimal Control of Affine Nonlinear Systems with Constrained Inputs 303

2.4 x1 (t)

x2 (t)
1.8
System state x(t)

1.2

0.6

−0.6

−1.2
0 4 8 12 16 20
Time (s)

Fig. 7.9 Evolution of system state x(t)

4
Convergence of critic NN weights

Ŵc1 Ŵc2 Ŵc3 Ŵc4


3
Ŵc5 Ŵc6 Ŵc7 Ŵc8
2

−1

−2

−3
0 4 8 12 16 20
Time (s)

Fig. 7.10 Convergence of critic NN weights


304 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

1000

800
Optimal value V*

600

400

200

0
4
2 4
0 2
x2 0
−2 −2 x1
−4 −4

Fig. 7.11 Optimal value function

1.8
1.2

1.2 1
Control input u

0.6 0.8
0.2 0.4

−0.6

−1.2
0 4 8 12 16 20
Time (s)

Fig. 7.12 Control input u


7.3 Online Optimal Control of Affine Nonlinear Systems with Constrained Inputs 305

1.8
Upper bound

1.2

0.6
Control input u

−0.6

−1.2

Lower bound
−1.8
0 4 8 12 16 20
Time (s)

Fig. 7.13 Control input u without considering control constraints

of the exploratory signal. Figure 7.10 shows the convergence of critic NN weights.
Figure 7.11 shows the optimal value function. Figure 7.12 illustrates the optimal
controller with control constraints. In order to make comparisons, we use Fig. 7.13
to show the controller designed without considering control constraints. The actuator
saturation actually exists; therefore, the control input is limited to the bounded value
when it overruns the saturation bound. From Figs. 7.9–7.13, it is observed that the
optimal control can be obtained using a single NN. Meanwhile, the system state and
the estimated weights of the critic NN are all guaranteed to be UUB, while keeping
the closed-loop system stable. Moreover, comparing Fig. 7.12 with Fig. 7.13, one can
find that the restriction of control constraints has been overcome.

7.4 Conclusions

In this chapter, we developed an identifier–critic architecture based on ADP method-


ology to derive the approximate optimal control for partially unknown continuous-
time nonlinear systems. Meanwhile, an ADP algorithm is developed to obtain the
optimal control for constrained continuous-time nonlinear systems. A limitation of
the present methods is that the prior knowledge of g(x) is required to be available.
In addition, it should be mentioned that the present approaches are applicable to
continuous-time input-affine nonlinear systems. Nevertheless, if the system is input-
nonaffine, the present methods might not work. Therefore, new ADP methods should
be developed to obtain the optimal control of continuous-time nonaffine nonlinear
systems.
306 7 Online Optimal Control of Continuous-Time Affine Nonlinear Systems

References

1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Beard RW, Saridis GN, Wen JT (1997) Galerkin approximations of the generalized Hamilton-
Jacobi-Bellman equation. Automatica 33(12):2159–2177
3. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):82–92
4. Bryson AE, Ho YC (1975) Applied optimal control: optimization, estimation and control. CRC
Press, Boca Raton
5. Chowdhary G (2010) Concurrent learning for convergence in adaptive control without persis-
tency of excitation. Ph.D. Thesis, Georgia Institute of Technology, USA
6. Chowdhary G, Johnson E (2011) A singular value maximizing data recording algorithm for
concurrent learning. In: Proceedings of the American control conference. pp 3547–3552
7. Dierks T, Jagannathan S (2010) Optimal control of affine nonlinear continuous-time systems.
In: Proceedings of the American control conference. pp 1568–1573
8. Haykin S (2009) Neural networks and learning machines, 3rd edn. Prentice-Hall, Upper Saddle
River
9. Horn RA, Johnson CR (2012) Matrix analysis. Cambridge University Press, New York
10. Hornik K, Stinchcombe M, White H (1990) Universal approximation of an unknown mapping
and its derivatives using multilayer feedforward networks. Neural Netw 3(5):551–560
11. Khalil HK (2001) Nonlinear systems. Prentice-Hall, Upper Saddle River
12. LaSalle JP, Lefschetz S (1967) Stability by Liapunov’s direct method with applications. Aca-
demic Press, New York
13. Lewis FL, Jagannathan S, Yesildirak A (1999) Neural network control of robot manipulators
and nonlinear systems. Taylor & Francis, London
14. Lewis FL, Vrabie D, Syrmos VL (2012) Optimal control. Wiley, Hoboken
15. Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control:
using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag
32(6):76–105
16. Lewis FL, Yesildirek A, Liu K (1996) Multilayer neural-net robot controller with guaranteed
tracking performance. IEEE Trans Neural Netw 7(2):388–399
17. Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general
value iteration. IET Control Theory Appl 6(18):2725–2736
18. Liu D, Wang D, Wang FY, Li H, Yang X (2014) Neural-network-based online HJB solution for
optimal robust guaranteed cost control of continuous-time uncertain nonlinear systems. IEEE
Trans Cybern 44(12):2834–2847
19. Liu D, Wang D, Yang X (2013) An iterative adaptive dynamic programming algorithm for
optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci
220:331–342
20. Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
21. Lyshevski SE (1998) Optimal control of nonlinear continuous-time systems: design of bounded
controllers via generalized nonquadratic functionals. In: Proceedings of the American Control
Conference. pp 205–209
22. Michel AN, Hou L, Liu D (2015) Stability of dynamical systems: on the role of monotonic
and non-monotonic Lyapunov functions. Birkhäuser, Boston
23. Modares H, Lewis FL (2014) Optimal tracking control of nonlinear partially-unknown
constrained-input systems using integral reinforcement learning. Automatica 50(7):1780–1792
24. Modares H, Lewis F, Naghibi-Sistani MB (2014) Online solution of nonquadratic two-player
zero-sum games arising in the H∞ control of constrained input systems. Int J Adapt Control
Signal Process 28(3–5):232–254
References 307

25. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1):193–202
26. Nodland D, Zargarzadeh H, Jagannathan S (2013) Neural network-based optimal adaptive
output feedback control of a helicopter UAV. IEEE Trans Neural Netw Learn Syst 24(7):1061–
1073
27. Padhi R, Unnikrishnan N, Wang X, Balakrishnan S (2006) A single network adaptive critic
(SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural
Netw 19(10):1648–1660
28. Rudin W (1976) Principles of mathematical analysis. McGraw-Hill, New York
29. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
30. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
31. Yang X, Liu D, Huang Y (2013) Neural-network-based online optimal control for uncertain non-
linear continuous-time systems with control constraints. IET Control Theory Appl 7(17):2037–
2047
32. Yang X, Liu D, Wang D (2014) Reinforcement learning for adaptive optimal control of unknown
continuous-time nonlinear systems with input constraints. Int J Control 87(3):553–566
33. Yang X, Liu D, Wang D, Wei Q (2014) Discrete-time online learning control for a class of
unknown nonaffine nonlinear systems using reinforcement learning. Neural Netw 55:30–41
34. Yang X, Liu D, Wei Q (2014) Online approximate optimal control for affine non-linear systems
with unknown internal dynamics using adaptive dynamic programming. IET Control Theory
Appl 8(16):1676–1688
35. Yu W (2009) Recent advances in intelligent control systems. Springer, London
36. Zhang H, Cui L, Luo Y (2013) Near-optimal control for nonzero-sum differential games of
continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern 43(1):206–
216
37. Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming method.
IEEE Trans Neural Netw 22(12):2226–2236
38. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control: algorithms
and stability. Springer, London
39. Zhang H, Qin C, Luo Y (2014) Neural-network-based constrained optimal control scheme for
discrete-time switched nonlinear system using dual heuristic programming. IEEE Trans Autom
Sci Eng 11(3):839–849
40. Zhong X, He H, Zhang H, Wang Z (2014) Optimal control for unknown discrete-time nonlinear
markov jump systems using adaptive dynamic programming. IEEE Trans Neural Netw Learn
Syst 25(12):2141–2155
Chapter 8
Optimal Control of Unknown
Continuous-Time Nonaffine Nonlinear
Systems

8.1 Introduction

The output of nonaffine nonlinear systems depends nonlinearly on the control signal,
which is significantly different from affine nonlinear systems. Due to this feature,
control approaches of affine nonlinear systems do not always hold for nonaffine non-
linear systems. Therefore, it is necessary to develop control methods for nonaffine
nonlinear systems. Recently, optimal control problems for nonaffine nonlinear sys-
tems have attracted considerable attention. Many remarkable approaches have been
proposed to tackle the problem [6, 19, 29, 32].
In this chapter, we present two novel control algorithms based on ADP methods
to obtain the optimal control of continuous-time nonaffine nonlinear systems with
completely unknown dynamics. First, we construct an ADP-based identifier–actor–
critic architecture to achieve the approximate optimal control of continuous-time
unknown nonaffine nonlinear systems [28]. The identifier is constructed by a dynamic
neural network, which transforms nonaffine nonlinear systems into a kind of affine
nonlinear systems. Actor–critic dual networks are employed to derive the optimal
control for the newly formulated affine nonlinear systems. Then, we develop a novel
ADP-based observer–critic architecture to obtain the approximate optimal control
of nonaffine nonlinear systems with unknown dynamics [19]. The present observer
is composed of a three-layer feedforward neural network, which aims to get the
knowledge of system states. Meanwhile, a single critic neural network is employed
for estimating the performance of the systems as well as for constructing the optimal
control signal.

© Springer International Publishing AG 2017 309


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_8
310 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

8.2 Optimal Control of Unknown Nonaffine Nonlinear


Systems with Constrained Inputs

Consider the continuous-time nonlinear systems given by


 
ẋ(t) = F x(t), u(t) , (8.2.1)

where x(t) ∈ Rn is the state vector available for measurement and u(t) ∈ U ⊂ Rm
is the control vector, and U = {u ∈ Rm : |u i | ≤ τ, i = 1, . . . , m} with the saturation
bound τ > 0. F(x, u) is an unknown nonaffine nonlinear function. We assume that
F(x, u) is Lipschitz continuous on Ω containing the origin, such that the solution
x(t) of system (8.2.1) is unique for arbitrarily given initial state x0 ∈ Ω and control
u ∈ U, and F(0, 0) = 0.
The value function for system (8.2.1) is defined by
 ∞  
V (x(t)) = Q(x(s)) + Y (u(s)) ds, (8.2.2)
t

where Q(x) is continuously differentiable and positive definite, i.e., Q(x) ∈ C 1 (Ω),
Q(x) > 0 for all x = 0 and x = 0 ⇔ Q(x) = 0, and Y (u) is defined as
  
u  T m ui
Y (u) = 2 τ Ψ −1 (υ/τ ) Rdυ  2τ ri ψ −1 (υi /τ )dυi ,
0 i=1 0

 T
where Ψ −1 (υ/τ ) = ψ −1 (υ1 /τ ), ψ −1 (υ2 /τ ), . . . , ψ −1 (υm /τ ) ∈ Rm , and Ψ −T
denotes (Ψ −1 )T . Meanwhile, ψ(·) is a strictly monotonic odd function satisfying
|ψ(·)| < 1 and belonging to C p ( p ≥ 1) and L2 (Ω). R = diag{r1 , . . . , rm } with
ri > 0, i = 1, . . . , m. It is necessary to state that Y (u) is positive definite since
ψ −1 (·) is a monotonic odd function and R is positive definite. Without loss of gen-
erality, we choose ψ(·) = tanh(·) and R is assumed to be the m × m identity matrix,
i.e., R = Im .
If the value function V (x) ∈ C 1 , by taking the time derivative of both sides of
(8.2.2) and moving the terms on the right-hand side to the left, we have
 u
VxT F(x, u) + Q(x) + 2τ tanh−T(υ/τ )dυ = 0, (8.2.3)
0

where Vx ∈ Rn denotes the partial derivative of V (x) with respect to x. It should be


mentioned that (8.2.3) is a sort of a Lyapunov equation for nonlinear systems [2].
Define the Hamiltonian for the control u ∈ A (Ω) and the value function V (x) by
 u
H (x, Vx , u) = VxT F(x, u) + Q(x) + 2τ tanh−T(υ/τ )dυ. (8.2.4)
0
8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems with Constrained Inputs 311

Then, the optimal cost V ∗ (x) can be obtained by solving the Hamilton–Jacobi–
Bellman (HJB) equation

min H (x, Vx∗ , u) = 0. (8.2.5)


u(x)∈A (Ω)

When considering the input-affine form, system (8.2.1) is often given as


 
ẋ(t) = f (x(t)) + g(x(t))u(x(t))  F x(t), u(t) ,

where f (x(t)) ∈ Rn and g(x(t)) ∈ Rn×m . Then, using (8.2.4) and (8.2.5), the closed-
form expression for constrained optimal control is derived as
1 
u ∗ (x) = −τ tanh g T(x)Vx∗ . (8.2.6)

However, for continuous-time nonaffine nonlinear system (8.2.1), the optimal
control u ∗ (x) cannot be obtained as given in (8.2.6) using (8.2.4) and (8.2.5). The
main reason is that
∂ H (x, Vx , u)/∂u = 0

might be a high-order nonlinear differential equation with respect to u, which is


intractable to solve analytically. In addition, due to the unavailability of the dynamics
of system (8.2.1), it brings about more difficulties to obtain the optimal control u ∗ (x).
Therefore, in order to deal with the optimal control problem through (8.2.4) and
(8.2.5), we should first obtain the knowledge of the system dynamics. In what follows,
we present a dynamic NN to identify the unknown system dynamics.

8.2.1 Identifier Design via Dynamic Neural Networks

According to [30], system (8.2.1) can be approximated by a dynamic NN as

ẋ(t) = Ax(t) + WmT1 φ(x(t)) + WmT2 ρ(x(t))u(t) + ε(t), (8.2.7)

where A ∈ Rn×n , Wm 1 ∈ Rn×n , and Wm 2 ∈ Rn×n are ideal NN weight matrices,


and ε(t) ∈ Rn is a bounded dynamic NN function reconstruction error. The vec-
tor function φ(x) ∈ Rn is assumed to be n-dimensional with the elements increas-
ing monotonically. The matrix function ρ(x) ∈ Rn×m is assumed to be ρ(x) =
 T
ρ1 (Y1T x), . . . , ρn (YnT x) , where Yi ∈ Rn×m is a constant matrix and ρi (·) is a
nondecreasing function. The typical representations of φ(x) and ρ(x) are sigmoid
functions, such as the hyperbolic tangent function. From the property of sigmoid
functions, for every ζ1 , ζ2 ∈ Rn , there exists κi > 0 (i = 1, 2) such that

φ(ζ1 ) − φ(ζ2 ) ≤ κ1 ζ1 − ζ2 ,
ρ(ζ1 ) − ρ(ζ2 ) ≤ κ2 ζ1 − ζ2 . (8.2.8)
312 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

Remark 8.2.1 Though (8.2.7) shares the same feature as in [30], the matrix A is
assumed to be unknown in (8.2.7), rather than a stable matrix as in [30]. Therefore,
(8.2.7) can be considered as a more general case to estimate system (8.2.1) using
dynamic NNs.

In this chapter, we define the dynamic NN identifier to approximate system


(8.2.1) as

˙ = Â(t)x̂(t) + Ŵ T (t)φ(x̂(t)) + Ŵ T (t)ρ(x̂(t))u(t) + η x̃(t),


x̂(t) (8.2.9)
m1 m2

where x̂(t) ∈ Rn is the dynamic NN state, Â(t) ∈ Rn×n , Ŵm 1 (t) ∈ Rn×n , and Ŵm 2
(t) ∈ Rn×n are estimated dynamic NN weight matrices, and η x̃(t) is the residual error
term with the design parameter η > 0 and the identification error x̃(t) = x(t) − x̂(t).
By (8.2.7) and (8.2.9), the identification error dynamics can be developed as

˙ = A x̃(t) + Ã(t)x̂(t) + W T φ̃ + W T ρ̃u(t) − η x̃(t)


x̃(t) m1 m2

+ W̃mT1 (t)φ(x̂(t)) + W̃mT2 (t)ρ(x̂(t))u(t) + ε(t), (8.2.10)

where Ã(t) = A − Â(t), W̃m 1 (t) = Wm 1 − Ŵm 1 (t), W̃m 2 (t) = Wm 2 − Ŵm 2 (t), φ̃ =
φ(x(t)) − φ(x̂(t)), and ρ̃ = ρ(x(t)) − ρ(x̂(t)).
Prior to showing the stability of the identification error x̃(t), two assumptions are
introduced. It should be mentioned that these assumptions are common techniques,
which have been used in [9, 31, 32].

Assumption 8.2.1 The ideal NN weight matrices A, Wm 1 , and Wm 2 satisfy

A AT ≤ P̄1 , WmT1 Wm 1 ≤ P̄2 , and WmT2 Wm 2 ≤ P̄3 ,

where P̄i (i = 1, 2, 3) are prior known symmetric positive-definite matrices.

Assumption 8.2.2 The dynamic NN function reconstruction error ε(t) is upper


bounded by the state estimation error x̃(t), i.e.,

εT(t)ε(t) ≤ δ M x̃ T(t)x̃(t),

where δ M is a positive constant.

Theorem 8.2.1 Let Assumptions 8.2.1 and 8.2.2 be satisfied. If the estimated
dynamic NN weight matrices Â(t), Ŵm 1 (t), and Ŵm 2 (t) are updated as

˙ = Λ x̂(t)x̃ T(t),
Â(t) 1

Ŵ˙ m 1 (t) = Λ2 φ(x̂(t))x̃ T(t), (8.2.11)


Ŵ˙ (t) = Λ ρ(x̂(t))u(t)x̃ T(t),
m2 3
8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems with Constrained Inputs 313

where Λi ∈ Rn×n (i = 1, 2, 3) are design matrices, then the identifier developed in


(8.2.9) can ensure identification of the state, in the sense that the identification error
in (8.2.10) goes to zero, i.e., lim x̃(t) = 0.
t→∞

Proof Consider the Lyapunov function candidate

L(x) = L 1 (x) + L 2 (x), (8.2.12)

where
1 T
L 1 (x) = x̃ (t)x̃(t),
2
1  
L 2 (x) = tr ÃT(t)Λ−1 −1 −1
1 Ã(t) + W̃m 1 (t)Λ2 W̃m 1 (t) + W̃m 2 (t)Λ3 W̃m 2 (t) .
T T
2
Taking the time derivative of L 1 (x) and using the identification error (8.2.10), we
obtain

L̇ 1 (x) = x̃ T(t)A x̃(t) + x̃ T(t) Ã(t)x̂(t) + x̃ T(t)WmT1 φ̃


+ x̃ T(t)WmT2 ρ̃u(t) − η x̃ T(t)x̃(t) + x̃ T(t)W̃mT1 (t)φ(x̂(t))
+ x̃ T(t)W̃mT2 (t)ρ(x̂(t))u(t) + x̃ T(t)ε(t). (8.2.13)

Applying the Cauchy–Schwarz inequality a T b ≤ (1/2)a T a + (1/2)bT b (cf. [4]) to


x̃ T(t)ε(t) and x̃ T(t)A x̃(t) in (8.2.13), it follows
1 T 1
x̃ T(t)ε(t) ≤ x̃ (t)x̃(t) + εT(t)ε(t),
2 2
1 T 1
x̃ (t)A x̃(t) ≤ x̃ (t)A A x̃(t) + x̃ T(t)x̃(t).
T T
(8.2.14)
2 2
Note that from (8.2.1), u(t) is bounded, i.e.,

 m
u(t) ≤ τ 2  α.
i=1

Using (8.2.8) and the Cauchy–Schwarz inequality, we get


1 T 1 T
x̃ T(t)WmT1 φ̃ ≤ x̃ (t)WmT1 Wm 1 x̃(t) + φ̃ φ̃
2 2
1 κ12 T
≤ x̃ T(t)WmT1 Wm 1 x̃(t) + x̃ (t)x̃(t),
2 2
1 1 T
x̃ (t)Wm 2 ρ̃u(t) ≤ x̃ T(t)WmT2 Wm 2 x̃(t) +
T T
u (t)ρ̃ Tρ̃u(t)
2 2
1 (ακ2 )2 T
≤ x̃ T(t)WmT2 Wm 2 x̃(t) + x̃ (t)x̃(t). (8.2.15)
2 2
314 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

Substituting (8.2.14) and (8.2.15) into (8.2.13), we have

1 T  T 
L̇ 1 (x) ≤ x̃ (t) A A + WmT1 Wm 1 + WmT2 Wm 2 x̃(t)
2
1 
+ 2 + κ12 + α 2 κ22 − 2η x̃ T(t)x̃(t)
2
+ x̃ T(t) Ã(t)x̂(t) + x̃ T(t)W̃mT1 (t)φ(x̂(t))
1
+ x̃ T(t)W̃mT2 (t)ρ(x̂(t))u(t) + εT(t)ε(t). (8.2.16)
2
On the other hand, taking the time derivative of L 2 (x) and using the weight update
law (8.2.11), we obtain

L̇ 2 (x) = − tr ÃT(t)x̂(t)x̃ T(t) + W̃mT1 (t)φ(x̂(t))x̃ T(t)

+ W̃mT2 (t)ρ(x̂(t))u(t)x̃ T(t) . (8.2.17)
   
Observe that tr X 1 X 2T = tr X 2T X 1 = X 2T X 1 , ∀X 1 , X 2 ∈ Rn×1 . Then, (8.2.17) can
be rewritten as

L̇ 2 (x) = − x̃ T(t) ÃT(t)x̂(t) − x̃ T(t)W̃mT1 (t)φ(x̂(t))


− x̃ T(t)W̃mT2 (t)ρ(x̂(t))u(t). (8.2.18)

Combining (8.2.16) and (8.2.18) and employing Assumptions 8.2.1 and 8.2.2, we
obtain
1 T  T 
L̇(x) ≤ x̃ (t) A A + WmT1 Wm 1 + WmT2 Wm 2 x̃(t)
2
1  1
+ κ12 + κ22 α 2 + 2 − 2η x̃ T(t)x̃(t) + εT(t)ε(t)
2 2
1 T   3
≤ − x̃ (t) 2η − κ1 − α κ2 − δ M − 2 In −
2 2 2
P̄i x̃(t).
2 i=1

Denote
  
3
B = 2η − κ1 − α κ2 − δ M − 2 In −
2 2 2
P̄i .
i=1

Selecting η such that B is positive definite, we have

1 1
L̇(x) ≤ − x̃ T(t)Bx̃(t) ≤ − λmin (B) x̃(t) 2 . (8.2.19)
2 2

Equations (8.2.12) and (8.2.19) guarantee that x̃(t), Ã(t), W̃m 1 (t), and W̃m 2 (t) are
bounded since L(x) is decreasing. Integrating both sides of (8.2.19), we obtain
8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems with Constrained Inputs 315
 ∞
2(L(0) − L(∞))
x̃(t) 2 dt ≤ .
0 λmin (B)

By Barbalat’s lemma [13], lim x̃(t) = 0, i.e., lim x̃(t) = 0. Note that the right-
t→∞ t→∞
hand side of the above expression is finite since λmin (B) > 0 and L(∞) < L(0) < ∞
by virtue of (8.2.19). This completes the proof.

Remark 8.2.2 It is worth pointing out that the eigenvalues of a positive-definite


matrix are all positive values. Therefore, in order to guarantee (8.2.19) to be negative,
η should be selected such that B is positive definite. Meanwhile, P̄1 , P̄2 , and P̄3 are
selected sufficiently large to guarantee the validity of Assumption 8.2.1, since the
ideal NN weights A, Wm 1 , and Wm 2 are typically unknown. In this sense, η can only
be chosen by trial-and-error.

Remark 8.2.3 By (8.2.11) and lim x̃(t) = 0, we derive lim Â(t) ˙ = 0,


t→∞ t→∞
lim Ŵ˙ m 1 (t) = 0, and lim Ŵ˙ m 2 (t) = 0. Therefore, Â(t), Ŵm 1 (t), and Ŵm 2 (t) will
t→∞ t→∞
all tend to constants, written as Â, Ŵm 1 , and Ŵm 2 , respectively. Hence, using the
identifier (8.2.9), system (8.2.1) can be represented as

ẋ(t) = Âx(t) + ŴmT1 φ(x(t)) + ŴmT2 ρ(x(t))u(t). (8.2.20)

By Remark 8.2.3, and using (8.2.6), we derive the optimal control for the given
problem as
1 
u ∗ (x) = −τ tanh ρ T(x)Ŵm 2 Vx∗ . (8.2.21)

Then, the HJB equation (8.2.5) becomes
   
Vx∗T Âx + ŴmT1 φ(x) − 2τ 2 Φ T(x) tanh Φ(x) + Q(x)
 −τ tanh(Φ(x))
+ 2τ tanh−T(υ/τ )dυ = 0, (8.2.22)
0

where
1 T
Φ(x) = ρ (x)Ŵm 2 Vx∗ .

Denote

Φ(x) = (Φ1 (x), . . . , Φm (x))T ∈ Rm with Φi (x) ∈ R, i = 1, . . . , m.

According to the integration formulas of inverse hyperbolic tangent function, we


note that
316 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems
 −τ tanh(Φ(x))
2τ tanh−T(υ/τ )dυ
0
m 
 −τ tanh(Φi (x))
= 2τ tanh−T(υi /τ )dυi
i=1 0

  
m
 
= 2τ Φ (x) tanh Φ(x) + τ
2 T 2
ln 1 − tanh2 Φi (x) . (8.2.23)
i=1

Therefore, (8.2.22) can be rewritten as

  
m
 
Vx∗T Âx + ŴmT1 φ(x) + Q(x) + τ 2
ln 1 − tanh2 Φi (x) = 0.
i=1

8.2.2 Actor–Critic Architecture for Solving HJB Equation

According to the universal approximation property of feedforward NNs [11], the


value function V (x) given in (8.2.2) can accurately be represented by an NN on a
compact set Ω as
V (x) = WcT σ (x) + ε N0 (x), (8.2.24)

where Wc ∈ R N0 is the ideal NN weight vector, N0 is the number of neurons,


 T
σ (x) = σ1 (x), σ2 (x), . . . , σ N0 (x) ∈ R N0

is the activation function with σ j (x) ∈ C 1 (Ω) and σ j (0) = 0, the set {σ j (x)}1N0 is
often selected to be linearly independent, and ε N0 (x) is a bounded NN function
reconstruction error. The derivative of V (x) with respect to x is given by

Vx = ∇σ T(x)Wc + ∇ε N0 (x), (8.2.25)

where ∇σ (x) = ∂σ (x)/∂ x, ∇ε N0 (x) = ∂ε N0 (x)/∂ x, and ∇σ (0) = 0.


Using (8.2.25) and Remark 8.2.3, (8.2.3) becomes
 
WcT ∇σ Âx + ŴmT1 φ(x) + ŴmT2 ρ(x)u
m  ui
+ 2τ tanh−T(υi /τ )dυi + Q(x) = εLE , (8.2.26)
i=1 0

where εLE is the residual error and defined as


 
εLE = −∇εTN0 Âx + ŴmT1 φ(x) + ŴmT2 ρ(x)u .
8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems with Constrained Inputs 317

Since the ideal NN weight Wc is typically unknown, (8.2.24) cannot be imple-


mented in real-time control process. Therefore, we employ a critic NN to approximate
the value function V (x) as
V̂ (x) = ŴcT σ (x), (8.2.27)

where Ŵc is the estimation of Wc . The weight estimation error for the critic NN is
defined as
W̃c = Wc − Ŵc .

Then, (8.2.26) can be rewritten as


 
ŴcT ∇σ Âx + ŴmT1 φ(x) + ŴmT2 ρ(x)u
m  u
+ 2τ tanh−T(υ/τ )dυ + Q(x) = δε , (8.2.28)
i=1 0

where δε is the Bellman residual error. From (8.2.26) and (8.2.28), we derive
 
δε = −W̃cT ∇σ Âx + ŴmT1 φ(x) + ŴmT2 ρ(x)u + εLE . (8.2.29)

To get the minimum value of δε , it is desired to minimize the objective function

1 2
E= δ .
2 ε
Using the gradient descent algorithm, the weight update law for the critic NN is
developed as
∂E
Ŵ˙ c = −
lc h
= −lc δε , (8.2.30)
(1 + h T h)2 ∂ Ŵc (1 + h T h)2

where  
h = ∇σ Âx + ŴmT1 φ(x) + ŴmT2 ρ(x)u ,

lc > 0 is the critic NN learning rate, and the term (1 + h T h)2 is employed for nor-
malization.
On the other hand, by (8.2.21) and (8.2.27), the control policy can be approximated
by an action NN as
1 
û c (x) = −τ tanh ρ T(x)Ŵm 2 ∇σ T Ŵc .

From the expression û c (x), one can find the action NN shares the same weights
Ŵc as the critic NN. However, in standard weight tuning laws [2, 26], the critic NN
and the action NN are tuned sequentially, with the weights of the other NN being
kept constant. It is generally considered that this type of weight update law is more
318 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

time-consuming than tuning the two NN weights simultaneously as in the present


case.
To tune simultaneously the weights of the critic NN and the action NN, a separate
action NN is employed. In this sense, the control input is redefined as
1 
û a (x) = −τ tanh ρ T(x)Ŵm 2 ∇σ T Ŵa , (8.2.31)

where Ŵa ∈ R N0 denotes the current estimation of Wa = Wc . The weight estimation


error for the action NN is defined as W̃a = Wc − Ŵa . The weight update law for
the action NN shall be developed in the subsequent section based on the stability
analysis.

Remark 8.2.4 Unlike the actor update law derived in [5, 32] by minimizing the
Bellman residual error, we develop the actor tuning law based on the stability analysis
which shares similar spirits as [21, 25]. The concrete form of the action NN update
law is developed next.

8.2.3 Stability Analysis of Closed-Loop System

We start with three assumptions and then establish the stability result.

Assumption 8.2.3 The parameters defined in (8.2.20) are upper bounded such that
 ≤ a1 , Ŵm1 ≤ a2 , and Ŵm2 ≤ a3 .

Assumption 8.2.4 There exist known constants bφ > 0 and bρ > 0 such that φ(x)
≤ bφ x and ρ(x) ≤ bρ x for every x ∈ Ω. Meanwhile, there exist known con-
stants bσ > 0 and bσ x > 0 such that σ (x) < bσ and ∇σ (x) < bσ x for every
x ∈ Ω.

Assumption 8.2.5 The NN reconstruction error ε N0 (x) and its derivative with
respect to x are upper bounded as ε N0 (x) < bε and ∇ε N0 (x) < bεx .

Theorem 8.2.2 Consider the nonlinear system described by (8.2.1) with the struc-
ture unknown. Let Assumptions 8.2.3–8.2.5 hold. Take the critic NN and the action
NN as in (8.2.27) and (8.2.31), respectively. Let the weight update law for the critic
NN be
 û a
Ŵ˙ c = −lc
h
h T Ŵc + 2τ tanh−T(υ/τ )dυ + Q(x) , (8.2.32)
(1 + h T h)2 0

  
where h = ∇σ Âx + ŴmT1 φ(x) + ŴmT2 ρ(x)û a . Define h̄  h (1 + h T h). Let the
action NN be turned by
8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems with Constrained Inputs 319

 
Ŵ˙ a = − la K 2 Ŵa − K 1 h̄ T Ŵc − τ ∇σ (x)ŴmT2 ρ(x)

 h̄ T
× tanh(Aa ) − sgn(Aa ) Ŵc , (8.2.33)
1 + hTh

where
1 T
Aa = ρ (x)Ŵm 2 ∇σ T(x)Ŵa ,

K 1 and K 2 are the tuning vector and the tuning matrix with suitable dimensions
which will be detailed in the proof, and la > 0 is the learning rate of the action NN.
Then, the system state x(t), and the actor–critic weight estimation errors W̃a , and
W̃c are all guaranteed to be UUB, when the number of neurons N0 is selected large
enough.

Proof Consider the Lyapunov function candidate

L(x) = V (x) + L 1 (x) + L 2 (x), (8.2.34)

where
1 T −1
L1 = W̃ l W̃c
2 c c
and
1 T −1
L2 = W̃ l W̃a .
2 a a
Taking the time derivative of V (x) based on (8.2.20) and (8.2.24), we have
  
V̇ (x) = WcT ∇σ + ∇εTN0 Âx + ŴmT1 φ(x) + ŴmT2 ρ(x)û a
 
= WcT ∇σ Âx + ŴmT1 φ(x) − τ WcT ∇σ ŴmT2 ρ(x) tanh(Aa ) + Ξ, (8.2.35)

where Aa is defined as in (8.2.33), and Ξ is given as


 
Ξ = ∇εTN0 Âx + ŴmT1 φ(x) − τ ŴmT2 ρ(x) tanh(Aa ) .

Using Assumptions 8.2.3–8.2.5, we obtain


 
Ξ ≤ bεx a1 + a2 bφ + τ a3 bρ x . (8.2.36)

On the other hand, from (8.2.21), (8.2.22), and (8.2.25), we define the residual error
as
320 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems
 
WcT ∇σ Âx + ŴmT1 φ(x) − τ WcT ∇σ ŴmT2 ρ(x) tanh(Ac )
 −τ tanh(Ac )
+ 2τ tanh−T (υ/τ )dυ + Q(x) = εHJB ,
0
(8.2.37)

where
1 T
Ac = ρ (x)Ŵm 2 ∇σ T(x)Wc ,

and εHJB converges to zero when the number of neurons N0 is large enough [2, 25].
Assume that εHJB is upper bounded such that εHJB ≤ εmax , where εmax > 0 is a
small number.
According to (8.2.23),
 −τ tanh(Ac )
2τ tanh−T (υ/τ )dυ
0

is a positive definite, and combining (8.2.35)–(8.2.37), we have


 −τ tanh(Ac )
V̇ (x(t)) = − Q(x) − 2τ tanh−T(υ/τ )dυ
0
+ τ WcT ∇σ ŴmT2 ρ(x) [tanh(Ac ) − tanh(Aa )] + εHJB + Ξ
≤ − Q(x) + 2τ WcT ∇σ ŴmT2 ρ(x) + εHJB + Ξ
≤ − Q(x) + β x + εmax , (8.2.38)
 
where β = 2τ a3 bσ x bρ Wc + bεx a1 + a2 bφ + τ a3 bρ .
Taking the time derivative of L 1 (t), we derive
  û a 
L̇ 1 (t) = W̃cT lc−1 W̃˙ c = W̃cT
h −T
Ŵ T
h + 2τ tanh (υ/τ )dυ + Q(x)
(1 + h T h)2 c
0
  −τ tanh(Aa )
h
= W̃cT ŴcT h + 2τ tanh−T(υ/τ )dυ
(1 + h T h) 2
0
 −τ tanh(Ac ) 
−T
− Wc ξ − 2τ
T
tanh (υ/τ )dυ + εHJB
0
W̃cT h 
= Γ (Aa ) − Γ (Ac ) − W̃cT h + WcT (h − ξ ) + εHJB , (8.2.39)
(1 + h h)
T 2

where h is defined in (8.2.32),


8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems with Constrained Inputs 321
 
ξ = ∇σ Âx + ŴmT1 φ(x) − τ ŴmT2 ρ(x) tanh(Ac ) ,
 −τ tanh(Ai )
Γ (Ai ) = 2τ tanh−T (υ/τ )dυ, i = 1, 2.
0

Denote Ai (x) = (Ai1 (x), . . . , Aim (x))T with Ai j (x) ∈ R, i = 1, 2; j = 1, . . . , m.


Then, Γ (Ai ) can be expressed as
m 
 −τ tanh(Ai j )
Γ (Ai ) = 2τ tanh−1 (υ j /τ )dυ j
j=1 0


m

= 2τ 2
AiT tanh(Ai ) + τ 2
ln 1 − tanh2 (Ai j ) . (8.2.40)
j=1

 
For every Ai j ∈ R, ln 1 − tanh2 (Ai j ) can be expressed as
  
  ln 4 − 2Ai j − 2 ln 1 + exp(−2Ai j ) , Ai j > 0;
ln 1 − tanh2 (Ai j ) =  
ln 4 + 2Ai j − 2 ln 1 + exp(2Ai j ) , Ai j < 0.

That is,
  
ln 1 − tanh2 (Ai j ) = ln 4 − 2Ai j sgn(Ai j ) − 2 ln 1 + exp(−2Ai j sgn(Ai j )) ,
(8.2.41)
where sgn(Ai j ) ∈ R is the sign function with respect to Ai j .
Combining (8.2.40) and (8.2.41), it follows

Γ (Aa ) − Γ (Ac ) = 2τ 2 AaT tanh(Aa ) − ATc tanh(Ac )

+ 2τ 2 ATc sgn(Ac ) − AaT sgn(Aa ) + ΘA , (8.2.42)

where
m
1 + exp(−2AT2 j sgn(A2 j ))
ΘA = 2 ln .
j=1
1 + exp(−2AT1 j sgn(A1 j ))

From the expression of ΘA , one can conclude that

ΘA ∈ [−m ln 4, m ln 4].

Using Aa defined as in (8.2.33) and Ac defined as in (8.2.37), (8.2.42) can be


written as
322 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

Γ (Aa ) − Γ (Ac ) = τ ŴaT ∇σ ŴmT2 ρ(x) tanh(Aa ) − τ WcT ∇σ ŴmT2 ρ(x) tanh(Ac )

− τ WcT ∇σ ŴmT2 ρ(x) sgn(Aa ) − sgn(Ac )
+ τ W̃aT ∇σ ŴmT2 ρ(x)sgn(Aa ) + ΘA . (8.2.43)

Therefore, from (8.2.39) and (8.2.43), it follows

h  
L̇ 1 (x) = W̃cT τ W̃aT ∇σ ŴmT2 ρ(x) sgn(Aa ) − tanh(Aa )
(1 + h h)
T 2
 
− τ WcT ∇σ ŴmT2 ρ(x) sgn(Aa ) − sgn(Ac ) − W̃cT h + εHJB + ΘA
 h̄ T
= τ W̃aT ∇σ ŴmT2 ρ(x) tanh(Aa ) − sgn(Aa ) Ŵc
1 + hTh
− W̃cT h̄ h̄ T W̃c + W̃cT h̄ D̄1 (x) + W̃aT D̄2 (x), (8.2.44)

where
1  
D̄1 (x) = τ WcT ∇σ ŴmT2 ρ(x)[sgn(Ac ) − sgn(Aa )] − εHJB − ΘA ,
1+h h T

 h̄ T
D̄2 (x) = τ ∇σ ŴmT2 ρ(x) sgn(Aa ) − tanh(Aa ) Wc .
1 + hTh

Using (8.2.34), (8.2.38), and (8.2.44), we have

L̇(x) < − Q(x) − W̃cT h̄ h̄ T W̃c + W̃cT h̄ D̄1 (x) + W̃aT D̄2 (x) + β x + εmax
 h̄ T
− W̃aT la−1 Ŵ˙ a − τ ∇σ ŴmT2 ρ(x) tanh(Aa ) − sgn(Aa ) Ŵc .
1 + hTh
(8.2.45)

To guarantee L̇(x) < 0, we use the weight update law for the action NN as in (8.2.33).
Observe that

W̃aT K 2 Ŵa − W̃aT K 1 h̄ T Ŵc = W̃aT K 2 (Wc − W̃a ) − W̃aT K 1 h̄ T (Wc − W̃c )
= W̃aT K 2 Wc − W̃aT K 2 W̃a − W̃aT K 1 h̄ T Wc
+ W̃aT K 1 h̄ T W̃c . (8.2.46)

From (8.2.33), (8.2.45), and (8.2.46), we derive

L̇(x) < − Q(x) − W̃cT h̄ h̄ T W̃c − W̃aT K 2 W̃a + W̃cT h̄ D̄1 (x) + W̃aT K 1 h̄ T W̃c
 
+ β x + W̃aT D̄2 (x) + K 2 Wc − K 1 h̄ T Wc + εmax . (8.2.47)

Since Q(x) > 0, there exists a positive value q ∈ R such that x Tq x < Q(x). Let
Z T = [x T, W̃cT h̄, W̃aT ]. Then, (8.2.47) can be rewritten as
8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems with Constrained Inputs 323

L̇(x) < −Z TGZ + Z TM + εmax , (8.2.48)

where
⎡ ⎤
qI 0 0 ⎡ ⎤
⎢ K T⎥ β
⎢ − 1 ⎥
G=⎢0 I
2 ⎥ , M =⎣ D̄1 (x) ⎦.
⎣ ⎦
K1 D̄2 (x) + K 2 Wc − K 1 h̄ Wc
T
0 − K2
2
Select K 1 and K 2 such that G is positive definite. Then, (8.2.48) implies

L̇(x) < −λmin (G) Z 2


+ M Z + εmax .

Therefore, L̇(x) is negative as long as the following condition holds:


  
1
Z > M + M 2 + 4λ2min (G)εmax .
2λmin (G)

According to the standard Lyapunov’s extension theorem [16, 17] (or the Lagrange
stability result [20]), this demonstrates the uniform ultimate boundedness of Z .
Therefore, the system state x(t), and actor–critic weight estimates errors W̃a (t), and
W̃c (t) are guaranteed to be UUB. This completes the proof.

8.2.4 Simulation Study

Consider the continuous-time input-nonaffine nonlinear system described by

ẋ1 = −0.4 sin x1 + 0.6x2 ,


ẋ2 = − sin x1 cos x2 − 0.5u − 0.2 tan u. (8.2.49)

It is desired to control the system with constrained inputs |u| ≤ 0.45. The non-
quadratic value function is given as
 ∞  u 
V (x) = x12 + x22 + 2τ tanh−T(υ/τ )dυ dt.
0 0

The prior knowledge of system (8.2.49) is assumed to be unavailable. First, a dynamic


NN given in (8.2.9) is employed for identification of the system dynamics. The iden-
tifier gains are selected as η = 20, Λ1 = [1, 0.5; 0.5, 1], Λ2 = [1, 0.2; 0.2, 1], and
Λ3 = [1, 0.1; 0.1, 1], and the vector function φ(x̂) and ρi (ζiT x̂) are chosen as hyper-
bolic tangent functions tanh(x̂) and tanh(ζiT x̂), respectively. Each ζi (i = 1, 2) is
selected randomly within the interval of [−1, 1] and held constant. The computer
324 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

0.09

x̃1 (t)
0.06
x̃2 (t)
Identification errors

0.03

−0.03

−0.06
0 8 16 24 32 40
Time (s)

Fig. 8.1 Errors in estimating system state x(t) by the dynamic NN identifier

simulation result of the system identification error is illustrated in Fig. 8.1. From
Fig. 8.1, one observes that the dynamic NN identifier can ensure asymptotic identi-
fication of the unknown nonaffine nonlinear system (8.2.49). Then, the process of
identifying the unknown system dynamics is finished and the identifier NN weights
are kept unchanged.
The activation function of the critic NN is chosen with N0 = 24 neurons as

σ (x) = x12 , x22 , x1 x2 , x14 , x24 , x13 x2 , x12 x22 , x1 x23 , x16 , x26 , x15 x2 , x14 x22 ,
T
x13 x23 , x12 x24 , x1 x25 , x18 x28 , x17 x2 , x16 x22 , x15 x23 , x14 x24 , x13 x25 , x12 x26 , x1 x27 .
(8.2.50)

It should be mentioned that, in this example, the number of neurons is obtained by


computer simulations. We find that selecting 24 neurons for the hidden layer can lead
to satisfactory simulation results. The weights of the critic NN and the action NN are
denoted as Ŵc = [Ŵc1 , . . . , Ŵc24 ]T and Ŵa = [Ŵa1 , . . . , Ŵa24 ]T , respectively. The
gains for the actor–critic learning laws are selected as la = 0.7, lc = 0.3, and τ is
given as 0.45. The initial state is chosen to be x0 = [0.2, 0.3]T , and the initial weights
for both the action NN and the critic NN are selected randomly within the interval of
[−1, 1]. To maintain the persistence of excitation condition, a small exploratory sig-
nal N (t) = 3 sin2 (t) cos(0.5t) − 0.6 sin3 (0.1t) − 0.4 sin(2.4t) cos(2.4t) is added
to the control u(t) for the first 40 s. The present optimal control algorithm is imple-
mented using (8.2.27) and (8.2.31)–(8.2.33).
Computer simulation results are shown in Figs. 8.2, 8.3, 8.4 and 8.5. Figure 8.2
shows the trajectories of system states x1 (t) and x2 (t), where the exploratory signal
8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems with Constrained Inputs 325

1.2

0.9 x1 (t)

x2 (t)
System state x(t)

0.6

0.3

−0.3
0 10 20 30 40 50 60
Time (s)

Fig. 8.2 Trajectories of system state xi (t) (i = 1, 2)

0.9
Convergence of action NN weights

0.6

Ŵa1 Ŵa2 Ŵa3 Ŵa4


0.3
Ŵa5 Ŵa6 Ŵa7 Ŵa8

−0.3

−0.6

0 10 20 30 40 50 60
Time (s)

Fig. 8.3 Convergence of the first 8 weights of the action NN

is added to the control input during the first 40 s. Clearly, the learning process is com-
pleted before the end of the exploratory signal. Figure 8.3 indicates the convergence
curves of the first 8 weights of the action NN. In fact, after 40 s the action NN weight
vector converges to Ŵa = [0.1371, 0.5229, −0.5539, −0.2209, −0.1081, −0.0534,
326 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

0.1

0
−0.35

−0.1
Control input u(x)

−0.4
−0.2

−0.3 −0.45
1.4 1.8 2.2 2.6 3

−0.45
−0.4

−0.5
0 10 20 30 40 50 60
Time (s)

Fig. 8.4 Control input with constraints

0.1

−0.1 −0.35
Control input u(x)

−0.2 −0.4

−0.3 −0.45
1.4 1.8 2.2 2.6 3
−0.45
−0.4

−0.5
0 10 20 30 40 50 60
Time (s)

Fig. 8.5 Control input without constraints

−0.4634, 0.7872, −0.8714, −0.0970, 0.8020, 0.8287, −0.6761, 0.1722,


−0.1026, 0.9681, 0.5566, 0.9214, −0.5205, −0.8689, −0.5387, 0.5650, 0.9316,
−0.5487]T . Meanwhile, after 40 s the critic NN weight vector converges to the same
vector as the action NN. Figure 8.4 shows the optimal controller designed using the
present algorithm with the consideration of control constraints. To make comparison,
8.2 Optimal Control of Unknown Nonaffine Nonlinear Systems with Constrained Inputs 327

we use Fig. 8.5 to illustrate the controller designed without considering control
constraints, where the control signal is saturated for a short period of time.
From Figs. 8.2 and 8.3, it is observed that the system states and the estimated
weights of the action NN are guaranteed to be UUB, while keeping closed-loop
system stable. Meanwhile, one can find that persistence of excitation ensures the
weights converge to their optimal values (Ŵa∗ ) after 40 s. That is, the final weights
Ŵa∗ are obtained. Under this circumstance, based on (8.2.31) and (8.2.50), we can
derive the optimal control for system (8.2.49). In addition, comparing Fig. 8.4 with
Fig. 8.5, we shall find that the restriction of control constraints has been successfully
overcome.

8.3 Optimal Output Regulation of Unknown Nonaffine


Nonlinear Systems

Consider the continuous-time nonlinear systems described by

ẋ(t) = F(x(t), u(t)), y(t) = C x(t), (8.3.1)

where x(t) = [x1 (t), . . . , xn (t)]T ∈ Rn is the state, u(t) = [u 1 (t), . . . , u m (t)]T ∈ Rm
is the control input, y(t) = [y1 (t), . . . , yl (t)]T ∈ Rl is the output, and F(x, u) is
an unknown nonaffine nonlinear smooth function with F(0, 0) = 0. The state of
system (8.3.1) is not available, only the system output y(t) can be measured. For
convenience of later analysis, we provide an assumption as follows (for definition of
various concepts, see [1, 15]).

Assumption 8.3.1 System (8.3.1) is observable and the system state is bounded in
L∞ (Ω). In addition, C ∈ Rl×n (l ≤ n) is a full row rank matrix, i.e., rank(C) = l.

For optimal output regulator problems, the control objective is to find an admis-
sible control for system (8.3.1) which minimizes the infinite-horizon value function
 ∞  
V (x(t)) = y T(s)Qy(s) + u T(s)Ru(s) ds, (8.3.2)
t

where Q and R are positive-definite matrices with appropriate dimensions. Noticing


that y(t) = C x(t), (8.3.2) can be rewritten as
 ∞
V (x(t)) = r (x(s), u(s))ds, (8.3.3)
t

where r (x, u) = x TQ c x + u TRu with Q c = C TQC, and Q c is symmetric positive


semidefinite since l ≤ n.
328 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

8.3.1 Neural Network Observer

Due to the fact that the system dynamics and system states are completely unknown,
we cannot directly apply existing ADP methods to system (8.3.1). In this section,
we employ a multilayer feedforward NN state observer to obtain estimated states of
system (8.3.1).
From (8.3.1), we have

ẋ(t) = Ax + G(x(t), u(t)), y(t) = C x(t), (8.3.4)

where G(x, u) = F(x, u) − Ax, A is a Hurwitz matrix, and the pair (C, A) is observ-
able. Then, the state observer for system (8.3.1) is given by

˙ = A x̂(t) + Ĝ(x̂(t), u(t)) + K (y(t) − C x̂(t)), ŷ(t) = C x̂(t),


x̂(t) (8.3.5)

where x̂(t) and ŷ(t) denote the state and output of the observer, respectively, and the
observer gain K ∈ Rn×l is selected such that A − K C is a Hurwitz matrix.
To design an NN state observer, one often uses an NN to identify the nonlinearity
and a conventional observer to estimate system states [1, 3, 23]. The structure of the
designed NN observer is shown in Fig. 8.6.
It has been proved that a three-layer NN with a single hidden layer can approximate
nonlinear systems with any degree of accuracy [11, 12]. According to the universal
approximation property of NNs, G(x, u) can be represented on a compact set Ω as

G(x, u) = WoTσ (YoTx̄) + εo (x), (8.3.6)

where Wo ∈ Rk×n and Yo ∈ R(n+m)×k are the ideal weight matrices for the hidden
layer to the output layer and the input layer to the hidden layer, respectively, k is
the number of neurons in the hidden layer, x̄ = [x T, u T ]T is the NN input, and εo (x)
is the bounded NN functional approximation error. It is often assumed that there
exists a constant ε M > 0 such that εo (x) ≤ ε M . σ (·) is the NN activation function

Fig. 8.6 The structural


diagram of the NN observer
=

− −

+
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 329

which is componentwise and selected to be hyperbolic tangent function. Besides, NN


activation functions are also bounded such that σ (·) ≤ σ M for a constant σ M > 0.
For the weights of the three-layer NN, an assumption has been used often as
follows [18, 33].

Assumption 8.3.2 The ideal NN weights Wo and Yo are bounded by known positive
constants W̄ M and Ȳ M , respectively. That is,

Wo ≤ W̄ M and Yo ≤ Ȳ M .

Note that G(x, u) can be approximated by an NN on the compact set Ω as

ˆ
Ĝ(x̂, u) = ŴoT σ (ŶoT x̄),

where x̂ is the estimated state vector, x̄ˆ = [x̂ T , u T ]T , Ŵo and Ŷo are the corresponding
estimates of the ideal weight matrices. Then, the NN state observer (8.3.5) can be
represented as
 
x̂˙ = A x̂ + ŴoT σ ŶoT x̄ˆ + K (y − C x̂), ŷ = C x̂. (8.3.7)

Define the state and output estimation errors as x̃ = x − x̂ and ỹ = y − ŷ. Then,
using (8.3.4), (8.3.6) and (8.3.7), the error dynamics is obtained as
   
x̃˙ = (A − K C)x̃ + WoT σ YoT x̄ − ŴoT σ ŶoT x̄ˆ + εo (x), ỹ = C x̃. (8.3.8)

ˆ to (8.3.8), it follows
Adding and subtracting WoT σ (ŶoT x̄)
 
x̃˙ = Ao x̃ + W̃oT σ ŶoT x̄ˆ + ζ (x), ỹ = C x̃, (8.3.9)
   
where W̃o = Wo − Ŵo , Ao = A − K C, and ζ (x) = WoT σ YoT x̄ − σ ŶoT x̄ˆ + εo
(x).

Remark 8.3.1 It is worth pointing out that ζ (x) is a bounded disturbance term. That
is, there exists a known constant ζ M > 0 such that ζ (x) ≤ ζ M , because of the
boundedness of the hyperbolic tangent function, the NN approximation error εo (x),
and the ideal NN weights Wo and Yo .

To guarantee stability of the NN observer, a suitable tuning algorithm should be


provided for the NN weights in the design. Inspired by [1], in what follows we design
the weight tuning algorithm based on the error backpropagation algorithm plus some
modification terms to guarantee the stability of the state observer and the NN weight
estimation errors.

Theorem 8.3.1 Consider system (8.3.1) and the observer dynamics (8.3.7). Let
Assumptions 8.3.1 and 8.3.2 hold. If the NN weight tuning algorithms are given as
330 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

 
Ŵ˙ o = −η1 σ ŶoT x̄ˆ ỹ TC A−1
o − θ1 ỹ Ŵo ,
˙   
ˆ ỹ C Ao ŴoT Ik − Γ ŶoT x̄ˆ − θ2 ỹ Ŷo ,
Ŷo = −η2 sgn(x̄) T −1
(8.3.10)

where  
ˆ = diag σ12 (Ŷo1
Γ (ŶoTx̄) T ˆ T ˆ
x̄), . . . , σk2 (Ŷok x̄) ,

Ŷoi is the i th column of Ŷo , sgn is the componentwise sign function, η1 > 0 (1 =
1, 2) are learning rates, and θ2 > 0 (2 = 1, 2) are design parameters, then the state
estimation error x̃ converges to the compact set
 
2d
Ωx̃ = x̃ : x̃ ≤  , (8.3.11)
ρ C λmin (C + )T C +

where d > 0 is a constant to be determined later (see (8.3.24)), C + is the Moore–


Penrose pseudoinverse [10] of the matrix C. In addition, the NN weight estimation
errors W̃o = Wo − Ŵo and Ỹo = Yo − Ŷo are UUB.
Proof Consider the Lyapunov function candidate

1 T 1   1  
L o (x) = x̃ P x̃ + tr W̃oT W̃o + tr ỸoT Ỹo , (8.3.12)
2 2 2
where P is a symmetric positive-definite matrix satisfying

ATo P + P Ao = −ρ In (8.3.13)

for the Hurwitz matrix Ao and a positive constant ρ.


Taking the time derivative of L o , it follows

   
L˙o (x) = x̃˙ TP x̃ + x̃ TP x̃˙ + tr W̃oT W̃˙ o + tr ỸoT Ỹ˙o .
1 1
(8.3.14)
2 2
Using (8.3.10), we obtain
 
W̃˙ o = η1 σ ŶoT x̄ˆ ỹ T C A−1
o + θ1 ỹ Ŵo ,
˙   T 
ˆ ỹ T C A−1
Ỹo = η2 sgn(x̄) ˆ + θ2 ỹ Ŷo .
o Ŵo Ik − Γ Ŷo x̄
T
(8.3.15)

Substituting (8.3.9), (8.3.13), and (8.3.15) into (8.3.14), we obtain


ρ    
L˙o (x) = − x̃ T x̃ + x̃ P W̃oT σ ŶoT x̄ˆ + ζ
2  
  
+ tr W̃oT σ ŶoT x̄ˆ ỹ T l1 + W̃oT θ1 ỹ Wo − W̃o
     
ˆ ỹ T l2 ŴoT Ik − Γ ŶoT x̄ˆ + ỸoT θ2 ỹ Yo − Ỹo , (8.3.16)
+ tr ỸoT sgn(x̄)
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 331

where l1 = η1 C A−1 −1
o and l2 = η2 C Ao .
Replace x̃ using x̃ = C ỹ, where C + is the Moore–Penrose pseudoinverse of the
+

matrix C [10]. Then, (8.3.16) can be represented as


ρ     
L˙o (x) = − ỹ T (C + )T C + ỹ + C + ỹ P W̃oTσ ŶoTx̄ˆ + ζ
2  
 
+ tr W̃oTσ ŶoT x̄ˆ ỹ T l1 + W̃oT θ1 ỹ Wo − W̃o
     
ˆ ỹ T l2 ŴoT Ik − Γ ŶoT x̄ˆ + ỸoT θ2 ỹ Yo − Ỹo .
+ tr ỸoTsgn(x̄)

Before proceeding further, we provide the following inequalities:


  
tr W̃oT Wo − W̃o ≤ W̄ M W̃o F − W̃o 2F ,
  
tr ỸoT Yo − Ỹo ≤ Ȳ M Ỹo F − Ỹo 2F ,
   
tr W̃oTσ ŶoT x̄ˆ ỹ T l1 ≤ σ M l1 ỹ W̃o F . (8.3.17)

Note that (8.3.17) is true for Frobenius matrix norm, but it is not true for other matrix
norms in general. As we have declared earlier, Frobenius norm for matrices and
Euclidean norm for vectors are used in this chapter. We do not use the subscript “F”
for Frobenius matrix norm for convenience of presentation. The last inequality in
(8.3.17) is obtained based on the fact that, for given matrices A and B, the following
relationship holds:

tr(AB) = tr(B A). (8.3.18)

Observing that
Ŵo ≤ W̄ M + W̃o , 1 − σ M2 ≤ 1,

and using (8.3.18), we get


   
ˆ ỹ T l2 ŴoT Ik − Γ ŶoT x̄ˆ
tr ỸoTsgn(x̄) ≤ l2 ỹ Ỹo (W̄ M + W̃o ). (8.3.19)

Then, using (8.3.17) and (8.3.19), we have


ρ   
L˙o (x) ≤ − λmin (C + )T C + ỹ+ ỹ P C + σ M W̃o + ζ M
2
2  
+ ỹ σ M l1 W̃o + ỹ θ1 W̄ M W̃o − W̃o 2
   
+ ỹ l2 Ỹo W̄ M + W̃o + ỹ θ2 Ȳ M Ỹo − Ỹo 2 . (8.3.20)

Denote K 1 = l2 /2. Adding and subtracting K 12 W̃o 2


ỹ and Ỹo 2
ỹ to the
right-hand side of (8.3.20), it follows
332 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

ρ   
L˙o ≤ − λmin (C + )T C + ỹ 2 + ζ M P C + − θ1 − K 12 W̃o 2
2
   
+ σ M P C + + σ M l1 + θ1 W̄ M W̃o + θ2 Ȳ M + l2 W̄ M Ỹo
 2 
− (θ2 − 1) Ỹo 2 − K 1 W̃o − Ỹo ỹ . (8.3.21)

Denote K 2 and K 3 as

σM P C + + σ M l1 + θ1 W̄ M θ2 Ȳ M + l2 W̄ M
K2 = , K3 = .
2(θ1 − K 1 )
2 2(θ2 − 1)

To complete the squares for the terms W̃o and Ỹo , K 22 ỹ and K 32 ỹ are added
to and subtracted from (8.3.21), and we obtain
ρ   
L˙o (x) ≤ − λmin (C + )T C + ỹ 2 + ζ M P C + + θ1 − K 12 K 22
2
  2
+ (θ2 − 1)K 32 − θ1 − K 12 K 2 − W̃o
 2  2 
− (θ2 − 1) K 3 − Ỹo − K 1 W̃o − Ỹo ỹ . (8.3.22)

Select θ1 ≥ K 12 and θ2 ≥ 1. Then, (8.3.22) becomes

ρ 
L˙o ≤ − λmin (C + )T C + ỹ 2
+ d ỹ , (8.3.23)
2
where    
d = ζM P C + + θ1 − K 12 K 22 + θ2 − 1 K 32 . (8.3.24)

Therefore, for guaranteeing L˙o < 0, the following condition should hold, i.e.,

2d
ỹ > . (8.3.25)
ρλmin (C + )T C +

Note that ỹ ≤ C x̃ . Then, (8.3.25) implies

2d
x̃ > .
ρ C λmin (C + )T C +

That is, the state estimation error x̃ converges to Ωx̃ defined as in (8.3.11). Mean-
while, by using the standard Lyapunov’s extension theorem [16, 17] (or the Lagrange
stability result [20]), we conclude that the weight estimation errors W̃o and Ỹo are
UUB. This completes the proof.

Remark 8.3.2 By linear matrix theory [7, 10], we can obtain that rank(C) =
rank(C + ) and rank(C + ) = rank (C + )T C + . Accordingly, using Assumption 8.3.1,
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 333

we derive rank (C + )T C + = rank(C) = l. Noticing that (C + )T C + ∈ Rl×l and
(C + )T C + is a symmetric matrix, + T +
 we can conclude that (C ) C is positive def-
inite. Therefore, λmin (C + )T C + > 0. This shows that the compact set Ωx̃ makes
sense.

Remark 8.3.3 The explanation about selecting an NN observer rather than system
identification technique is given here. In control engineering, a common approach is
to start from the measurement of system behavior and external influences (inputs to
the system) and try to determine a mathematical relationship between them without
going into the details of what is actually happening inside the system [8, 27]. This
approach is called system identification. So, we can conclude that based on system
identification, we are generally able to obtain a “black box” model of the nonlinear
system [14], but do not get any in depth knowledge about system states because
they are the internal properties of the system. In most real cases, the state variables
are unavailable for direct online measurements, and merely input and output of the
system are measurable. Therefore, estimating the state variables by observers plays
an important role in the control of processes to achieve better performances. Once
the estimated states are obtained, we can directly design a state feedback controller to
achieve the optimization of system performance [24]. In conclusion, here we employ
an NN observer rather than system identification techniques.

8.3.2 Observer-Based Optimal Control Scheme


Using Critic Network

By (8.3.1) and (8.3.3) and using the observed state x̂ to replace the system state x,
we obtain the Hamiltonian as

H (x̂, u, Vx̂ ) = r (x̂(t), u(t)) + Vx̂T F(x̂(t), u(t)),

where
r (x̂, u) = x̂ TQ c x̂ + u TRu

∂ V (x̂)
as in (8.3.3) and Vx̂ = . The optimal value function V ∗ (x̂(t)) is
∂ x̂
 ∞

V (x̂(t)) = min r (x̂(τ ), u(τ ))dτ,
u∈A (Ω) t

and satisfies the HJB equation

min H (x̂, u, Vx̂∗ ) = 0, (8.3.26)


u∈A (Ω)
334 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

∂ V ∗ (x̂)
where Vx̂∗ = . Assume that the minimum on the right-hand side of (8.3.26)
∂ x̂
∂ H (x̂, u, Vx̂∗ )
exists and is unique. Then, by solving the equation = 0, the optimal
∂u
control can be obtained as
 T
1 ∂ F(x̂, u)
u ∗ = − R −1 Vx̂∗ . (8.3.27)
2 ∂u

Substituting (8.3.27) into (8.3.26), it follows


 
1 ∂ F(x̂, u) −1 ∂ F(x̂, u) T ∗
0 = x̂ TQ c x̂ + Vx̂∗T R Vx̂
4 ∂u ∂u
   
1 ∂ F(x̂, u) T ∗
+ Vx̂∗T F x̂, − R −1 Vx̂ . (8.3.28)
2 ∂u

To obtain solutions of the optimal control problem, we only need to solve (8.3.28).
However, due to the nonlinear nature of the HJB equation, finding its solutions is
generally difficult or impossible. Therefore, a scheme shall be developed using NNs
for solving the above optimal control problems. The structural diagram of the NN
observer-based controller is shown in Fig. 8.7.
According to the universal approximation property of NNs, the value function
V ∗ (x̂) can be represented on a compact set Ω as
 
V ∗ (x̂) = WcT σ YcT x̂ + εc (x̂), (8.3.29)

where Wc ∈ Rkc and Yc ∈ Rn×kc are the ideal weight matrices for the hidden layer
to the output layer and the input layer to the hidden layer, respectively, kc is the
the number of neurons in the hidden layer, and εc is the bounded NN functional
approximation error. In our design, based on [12], for both simplicity of learning and
efficiency of approximation, the output layer weight Wc is adapted online, whereas
the input layer weight Yc is selected initially at random and held fixed during the
entire learning process. It is demonstrated in [11, 12] that if the number of neurons
kc is sufficiently large, then the NN approximation error εc can be kept arbitrarily
small.
The output of the critic NN is given as
 
V̂ (x̂) = ŴcT σ YcT x̂ = ŴcT σ (z),

where Ŵc is the estimate of Wc . Since the hidden layer weight matrix Yc is fixed, we
write the activation function σ (YcT x̂) as σ (z) with z = YcT x̂.
The derivative of the value function V (x̂) with respect to x̂ is

Vx̂ = ∇σcT Wc + ∇εc , (8.3.30)


8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 335

V x

x
u x

x
x x x y
x F xu C
S

y y
K

u y
G x x
x C
S

Fig. 8.7 The structural diagram of the NN observer-based controller

where ∇σcT = Yc (∂σ T(z)/∂z) and ∇εc = ∂εc /∂ x̂. In addition, the derivative of V̂ (x̂)
with respect to x̂ is obtained as V̂x̂ = ∇σcT Ŵc . Then, the approximate Hamiltonian
is derived as
H (x̂, u, Ŵc ) = ŴcT ∇σc F(x̂, u) + r (x̂, u) = ec . (8.3.31)

It is worth pointing out that, to get the error ec , the knowledge of the system dynamics
is required. To overcome this limitation, the NN observer developed in (8.3.7) is used
to replace F(x̂, u). Then, (8.3.31) becomes

ec = ŴcT ∇σc x̂˙ + r (x̂, u). (8.3.32)

Given that u ∈ A (Ω), it is desired to select Ŵc to minimize the squared residual
error E c (Ŵc ) as
1
E c (Ŵc ) = ec2 .
2
336 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

Using the normalized gradient descent algorithm, the weight update law for the critic
NN is derived as
α ∂ Ec ψ  T 
Ŵ˙ c = − = −α T ψ Ŵc + r (x̂, u) , (8.3.33)
(ψ T ψ+ 1) ∂ Ŵc
2 (ψ ψ + 1) 2

where α > 0 is the learning rate, and

˙
ψ = ∇σc x̂.

This is a modified Levenberg–Marquardt algorithm, and (ψ T ψ + 1)2 is used for


normalization. Define the weight estimation error of critic NN as W̃c = Wc − Ŵc .
˙ we get
By (8.3.26) and (8.3.29) and replacing F(x, u) by x̂,

0 = r (x̂, u) + WcT ∇σc x̂˙ − ϑ, (8.3.34)

where ϑ = −∇εc x̂˙ is the residual error due to the NN approximation. Substituting
(8.3.34) into (8.3.33), and using the notation

ψ
ψ1 =  T  , ψ2 = ψ T ψ + 1,
ψ ψ +1

the critic NN weight estimation error dynamics becomes

ϑ
W̃˙ c = −αψ1 ψ1T W̃c + αψ1 .
ψ2

Considering (8.3.34), we can further derive

Ŵ˙ c = −W̃˙ c
ϑ
= αψ1 ψ1T W̃c − αψ1
ψ2
ψ1  T 
= −α ψ (Ŵc − Wc ) + ϑ
ψ2
ψ1  T 
= −α T ψ Ŵc + r (x̂, û) . (8.3.35)
ψ ψ +1

From the expression ψ1 , there exists a constant ψ1M > 1 such that ψ1 < ψ1M .
It is important to note that the persistence of excitation (PE) condition is required for
tuning the critic NN. To satisfy the PE condition, a small exploratory signal is added
to the control input [25]. Furthermore, the PE condition ensures ψ1 ≥ ψ1m , with
ψ1m being a positive constant.
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 337

Using (8.3.27) and (8.3.30), the control u is given as


 T
1 ∂ F(x, u)  T 
u = − R −1 ∇σc Wc + ∇εc . (8.3.36)
2 ∂u

The approximate expression of u is obtained as


 T
1 ∂ F̂(x̂, u)
û = − R −1 ∇σcT Ŵc . (8.3.37)
2 ∂u

It should be mentioned that, to compute u in (8.3.36), the knowledge of ∂ F(x, u)/∂u


is required. However, for unknown nonlinear systems, this term cannot be obtained
directly. In this chapter, using the observer NN, its estimates can be derived as
   T 
ˆ
∂ ŴoTσ (ŶoTx̄) ˆ ∂ x̄ˆ
∂ F̂(x̂, u) ∂ Ĝ(x̂, u) T∂σ Ŷo x̄
= = = Ŵo   ŶoT .
∂u ∂u ∂u ∂ ŶoTx̄ˆ ∂u

Therefore, ∂ F̂(x̂, u)/∂u can be obtained by the backpropagation from the output of
the observer NN to its input u.

8.3.3 Stability Analysis of Closed-Loop System

Before proceeding further, we provide two assumptions as follows.


Assumption 8.3.3 The NN approximation error and its gradient are bounded on a
compact set containing Ω. That is, εc < εcM and ∇εc < εd M , where εcM and
εd M are known positive constants.

Assumption 8.3.4 The NN activation function and its gradient are bounded. That
is, σ < σcM and ∇σc < σd M , where σcM and σd M are known positive constants.

Theorem 8.3.2 Consider the NN observer (8.3.7) and the feedback control given in
(8.3.37). Let weight tuning laws for the observer NN be provided as in (8.3.10) and
for the critic NN be provided as in (8.3.35). Then, x, x̃, W̃o , Ỹo and W̃c in the NN
observer-based control system are all UUB.

Proof Consider the Lyapunov function candidate

1 T  
L(x) = L o (x) + W̃c W̃c + α x Tx + γ V (x) ,
!2α "# $ ! "# $
L c2
L c1

where L o is defined as in (8.3.12) and γ > 0.


338 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

Note the facts that


ψ2 = ψ T ψ + 1 ≥ 1

and the residual error is upper bounded, i.e., there exists a constant ϑ M > 0 such that
ϑ ≤ ϑ M . Taking the time derivative of L c1 , we obtain

1 T ˙
L̇ c1 = W̃ W̃c
α c 
1 T ϑ
= W̃c − αψ1 ψ1 W̃c + αψ1
T
α ψ2
ϑ
= − W̃cTψ1 2 + W̃cTψ1
ψ2
% %
%ϑ %
≤ − W̃cTψ1 2 + W̃cT ψ1 % %ψ %
%
2

≤ − W̃cTψ1 2 + W̃cTψ1 ϑ M
 ϑ M 2 ϑ M 2
= − W̃cTψ1 − + .
2 4
Using Assumptions 8.3.3 and 8.3.4, and

(a + b + c)T(a + b + c) ≤ 3(a Ta + bTb + cTc), ∀a, b, c ∈ Rn

we have
 
L̇ c2 = 2αx T ẋ + αγ − x TQ c x − û TR û
 
= 2αx T (Ax + WoTσ (YoTx̄) + ε) + αγ − x TQ c x − û TR û
 T 
≤ α x Tx + Ax + WoTσ (YoTx̄) + ε Ax + WoTσ (YoTx̄) + ε
 
+ γ − x TQ c x − û TR û
≤ α (1 + 3 A 2 ) x + 3 WoTσ (YoTx̄) 2 + 3 ε
2 2

− γ λmin (Q c ) x 2 − γ λmin (R) û 2
 
≤ − α γ λmin (Q c ) − 1 − 3 A 2 x 2

− γ λmin (R) û 2 + 3W̄ M σ M + 3ε2M .
2 2

Thus, we obtain
 
L̇ c1 + L̇ c2 ≤ − α γ λmin (Q c ) − 1 − 3 A 2 x 2

 ϑ M 2
− W̃cTψ1 −
2
− αγ λmin (R) û 2 + D M , (8.3.38)
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 339

where
ϑM2
DM = + 3α W̄ M σ M + 3αε2M .
2 2
4
Combining (8.3.23) and (8.3.38), it follows
ρ   
L̇(x) ≤ − λmin (C + )T C + C x̃ 2 + d C x̃ − α γ λmin (Q c ) − 1 − 3 A 2 x 2
2
 ϑ M 2
− W̃cTψ1 − − αγ λmin (R) û 2 + D M
2
ρ  d 2
≤ − λmin (C + )T C + C x̃ − 
2 ρλmin (C + )T C +
   ϑ M 2
− α γ λmin (Q c ) − 1 − 3 A 2 x 2 − W̃cTψ1 −
2
2
d
− αγ λmin (R) û 2 + D M + .
2ρλmin (C + )T C +
! "# $
D̃ M

Therefore, if θ1 , θ2 , γ , and α are selected to satisfy

3 A 2+1
θ1 ≥ K 12 , θ2 ≥ 1, γ > , (8.3.39)
λmin (Q c )

and if one of the following inequalities holds


&
d 2 D̃ M
C x̃ > +   B1 (8.3.40)
ρλmin (C + )T C + ρλmin (C + )T C +

% T % 
or
%W̃ ψ1 % > D̃ M + ϑ M  B2 ,
c
2

then, we derive L̇(x) < 0. By the dense property of R [22], we can obtain a
positive constant κ1 (0 < κ1 ≤ C ) such that C x̃ > κ1 x̃ > B1 . Similarly,
we can derive that there exists a positive constant κ2 (0 < κ2 ≤ ψ1m ) such that
W̃cT ψ1 > κ2 W̃c > B2 . Then, it follows that L̇(x) < 0, if (8.3.39) is true and
if one of the following inequalities is true

B1 d 1 2 D̃ M
x̃ > =  T  +  T  (8.3.41)
κ1 κ1 ρλmin C + C + κ 1 ρλmin C + C +
340 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

or  
B2 1 ϑM
W̃c > = D̃ M + . (8.3.42)
κ2 κ2 2

By using Lyapunov’s extension theorem [16, 17] (or the Lagrange stability result
[20]), it can be concluded that the observer error x̃, the state x, and the NN weight
estimation errors W̃o , Ỹo , and W̃c are all UUB. This completes the proof.

Remark 8.3.4 It should be mentioned that, in (8.3.39) and (8.3.40), the constraints
for θ1 , θ2 and x̃ are the same as the NN observer designed earlier. In fact, a nonlinear
separation principle is not applicable here. But for the proof of the NN observer-
based control system, the closed-loop dynamics incorporates the observer dynamics,
then we can develop simultaneous weight tuning laws for the observer NN and the
critic NN.

8.3.4 Simulation Study

Consider the continuous-time input-nonaffine nonlinear system given by

x˙1 = −x1 + x2 ,
 
x˙2 = −x1 − 1 − sin2 (x1 ) x2 + sin(x1 )u + 0.1u 2 ,
y = x1 ,

with initial conditions x1 (0) = 1 and x2 (0) = −0.5. The value function is given by
(8.3.2), where Q and R are chosen as identity matrices with appropriate dimensions. It
is assumed that the system dynamics is unknown, the system states are unavailable for
measurements and only the input and output of the system are measurable. To estimate
the system states, an NN observer is employed and the corresponding parameters are
chosen as
0 1
A= ,
−6 − 5

and
10
K = .
−2

The observer NN is constructed by a three-layer NN with one hidden layer containing


eight neurons. The input layer involves three neurons and the output layer contains
two neurons. The activation function σ (·) is selected as hyperbolic tangent function
tanh(·). Let the design parameters be θ1 = θ2 = 1.5. The initial weights of Wo and
Yo are all set to be random within [0.5, 1]. From Figs. 8.8 and 8.9, it is clear that the
errors between the actual states x1 , x2 and the observed states x̂1 , x̂2 quickly approach
zero, which shows good the performance of the NN observer.
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 341

0.8
Error of real state and observed state

0.6

0.4

0.2

-0.2
0 5 10 15 20 25 30 35 40 45 50
Time (s)

Fig. 8.8 The error between the actual state x1 and observed state x̂1

0.8
Error of real state and observed state

0.6

0.4

0.2

-0.2

-0.4

-0.6

-0.8

-1
0 5 10 15 20 25 30 35 40 45 50
Time (s)

Fig. 8.9 The error between the actual state x2 and observed state x̂2
342 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

1.2

The system output 0.9

0.6

0.3

−0.3
0 10 20 30 40 50
Time (s)

Fig. 8.10 System output y

1.5

0
Control input

−1.5

−3
0 10 20 30 40 50
Time (s)

Fig. 8.11 Control input u

The activation functions of the critic NN are chosen from the sixth-order series
expansion of the value function. Only polynomial terms of even order are considered,
that is,
' (T
σc = x12 , x1 x2 , x22 , x14 , x13 x2 , x12 x22 , x1 x23 , x24 , x16 , x15 x2 , x14 x22 , x13 x23 , x12 x24 , x1 x25 , x26 .
8.3 Optimal Output Regulation of Unknown Nonaffine Nonlinear Systems 343

The critic NN weight vector is denoted as Ŵc = [Ŵc1 , . . . , Ŵc15 ]T . The learning rate
for the critic NN is selected as α = 0.5. Moreover, the initial weight vector of Ŵc
is set to be [1, . . . , 1]T . Moreover, in order to maintain the PE condition, a small
exploratory signal

N (t) = e−0.1t [sin2 t cos t + sin2 (2t) cos(0.1t)]

is added to the control input for the first 10 s as in [25].


Figures 8.10 and 8.11 depict the system output trajectory y and the nearly optimal
control signal u, respectively. It can be seen from Figs. 8.10 and 8.11 that the present
NN observer-based optimal control algorithm is effective for output regulation of
unknown nonlinear systems.

8.4 Conclusions

In this chapter, using the ADP methods, an identifier–actor–critic architecture is


presented to obtain the approximate optimal control for continuous-time unknown
nonaffine nonlinear systems. Meanwhile, a novel ADP-based observer–critic archi-
tecture is developed to obtain the optimal output regulation for nonaffine nonlinear
systems with unknown dynamics. A limitation of the present methods is that they do
not consider the external disturbance of nonaffine nonlinear systems. Nevertheless,
the disturbance might give rise to instability of the closed-loop system while design-
ing optimal controllers. Accordingly, new ADP methods should be studied to derive
the robust optimal control for continuous-time nonaffine nonlinear systems.

References

1. Abdollahi F, Talebi HA, Patel RV (2006) A stable neural network-based observer with appli-
cation to flexible-joint manipulators. IEEE Trans Neural Netw 17(1):118–129
2. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
3. Ahmed M, Riyaz S (2000) Dynamic observer-a neural net approach. J Intell Fuzzy Syst
9(1–2):113–127
4. Apostol TM (1974) Mathematical analysis. Addison-Wesley, Boston, MA
5. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):82–92
6. Bian T, Jiang Y, Jiang ZP (2014) Adaptive dynamic programming and optimal control of
nonlinear nonaffine systems. Automatica 50(10):2624–2632
7. Campbell SL, Meyer CD (1991) Generalized inverses of linear transformations. Dover Publi-
cations, New York
8. Goodwin GC, Payne RL (1977) Dynamic system identification: experiment design and data
analysis. Academic Press, New York
344 8 Optimal Control of Unknown Continuous-Time Nonaffine Nonlinear Systems

9. Hayakawa T, Haddad WM, Hovakimyan N (2008) Neural network adaptive control for a class
of nonlinear uncertain dynamical systems with asymptotic stability guarantees. IEEE Trans
Neural Netw 19(1):80–89
10. Horn RA, Johnson CR (2012) Matrix analysis. Cambridge University Press, New York
11. Hornik K, Stinchcombe M, White H (1990) Universal approximation of an unknown mapping
and its derivatives using multilayer feedforward networks. Neural Netw 3(5):551–560
12. Igelnik B, Pao YH (1995) Stochastic choice of basis functions in adaptive function approxi-
mation and the functional-link net. IEEE Trans Neural Netw 6(6):1320–1329
13. Ioannou P, Sun J (1996) Robust adaptive control. Prentice-Hall, Upper Saddle River, NJ
14. Jin G, Sain M, Pham K, Spencer B, Ramallo J (2001) Modeling MR-dampers: A nonlinear
blackbox approach. In: Proceedings of the American control conference. pp 429–434
15. Khalil HK (2001) Nonlinear systems. Prentice-Hall, Upper Saddle River, NJ
16. LaSalle JP, Lefschetz S (1967) Stability by Liapunov’s direct method with applications. Aca-
demic Press, New York
17. Lewis F, Jagannathan S, Yesildirak A (1999) Neural network control of robot manipulators and
nonlinear systems. Taylor & Francis, London
18. Lewis FL, Vrabie D, Syrmos VL (2012) Optimal control. Wiley, Hoboken, NJ
19. Liu D, Huang Y, Wang D, Wei Q (2013) Neural-network-observer-based optimal control for
unknown nonlinear systems using adaptive dynamic programming. Int J Control 86(9):1554–
1566
20. Michel AN, Hou L, Liu D (2015) Stability of dynamical systems: on the role of monotonic
and non-monotonic Lyapunov functions. Birkhäuser, Boston, MA
21. Modares H, Lewis F, Naghibi-Sistani MB (2014) Online solution of nonquadratic two-player
zero-sum games arising in the H∞ control of constrained input systems. Int J Adap Control
Signal Process 28(3–5):232–254
22. Rudin W (1976) Principles of mathematical analysis. McGraw-Hill, New York
23. Selmic RR, Lewis FL (2001) Multimodel neural networks identification and failure detection
of nonlinear systems. In: Proceedings of the IEEE conference on decision and control. pp
3128–3133
24. Theocharis J, Petridis V (1994) Neural network observer for induction motor control. IEEE
Control Syst Mag 14(2):26–37
25. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
26. Vrabie D, Lewis F (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
27. Walter E, Pronzato L (1997) Identification of parametric models from experimental data.
Springer, Heidelberg
28. Yang X, Liu D, Wang D (2014) Reinforcement learning for adaptive optimal control of unknown
continuous-time nonlinear systems with input constraints. Int J Control 87(3):553–566
29. Yang X, Liu D, Wang D, Wei Q (2014) Discrete-time online learning control for a class of
unknown nonaffine nonlinear systems using reinforcement learning. Neural Netw 55:30–41
30. Yu W (2009) Recent advances in intelligent control systems. Springer, London
31. Yu W, Li X (2001) Some new results on system identification with dynamic neural networks.
IEEE Trans Neural Netw 12(2):412–417
32. Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming method.
IEEE Trans Neural Netw 22(12):2226–2236
33. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control: algorithms
and stability. Springer, London
Chapter 9
Robust and Optimal Guaranteed Cost
Control of Continuous-Time Nonlinear
Systems

9.1 Introduction

Practical control systems are always subject to model uncertainties, exogenous


disturbances, or other changes in their lifetime. They are often considered during
the controller design process in order to avoid the performance deterioration of nom-
inal closed-loop systems. A controller is said to be robust if it works even if the actual
system deviates from its nominal model on which the controller design is based. The
importance of the robust control problems is evident, and it has been recognized by
control scientists for several decades [9–11, 19]. In [19], it was shown that the robust
control problems could be solved by studying an optimal control problem of the
nominal system. Hence, optimal control methods can be employed to design robust
controllers. However, the results are restricted to a class of systems with special form
of uncertainties. Though Adhyaru et al. [2, 3] proposed an HJB equation-based opti-
mal control algorithm to deal with the nonlinear robust control problem, the algorithm
was constructed using the least square method and performed offline, not to mention
the stability analysis of the closed-loop optimal control system. On the other hand,
when controlling a real plant, it is desirable to design a controller which guarantees
not only the asymptotically stability of closed-loop system but also an adequate level
of performance under uncertainties. The so-called guaranteed cost control approach
[6] has the advantage of providing an upper bound on a given cost, and thus the
system performance degradation incurred by the model parameter uncertainties is
guaranteed to be less than this bound [58, 59]. The optimal robust guaranteed cost
control problem arises when discussing the optimality of the guaranteed cost func-
tion. In this chapter, we study the robust control and optimal guaranteed cost control
of uncertain nonlinear systems using the online ADP strategy.
The ADP algorithm was first proposed by Werbos [51, 52] as an effective method
to solve optimization and optimal control problems. In general, it is implemented by
solving the HJB equation based on function approximators, such as neural networks
(NNs). It is one of the key directions for future researches in intelligent control
and understanding brain intelligence [53, 54]. As a result, the ADP and related
research have gained much attention from scholars across many disciplines; see, e.g.,
[14, 16, 24, 37, 39, 49, 61] and the numerous references therein. Significantly, the

© Springer International Publishing AG 2017 345


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_9
346 9 Robust and Optimal Guaranteed Cost Control …

ADP method has been often used in feedback control applications, both for discrete-
time systems [8, 12, 17, 20, 22, 25, 34, 35, 43–46, 50, 57] and for continuous-time
systems [1, 5, 7, 21, 23, 26, 27, 31, 36, 41, 42, 55, 60]. Besides, various traditional
control problems, such as robust control [2, 3], decentralized control [29], networked
control [56], power system control [18], are studied under the new framework, which
greatly extends the application scope of ADP methods.
In this chapter, we first employ online policy iteration (PI) algorithm to tackle
the robust control problem [47]. The robust control problem is transformed into an
optimal control problem with the cost function modified to account for uncertainties.
Then, an online PI algorithm is developed to solve the HJB equation by constructing
and training a critic network. It is shown that an approximate closed-form expression
of the optimal control law is available. Hence, there is no need to build an action net-
work. The uniform ultimate boundedness of the closed-loop system is analyzed by
using the Lyapunov approach. Since the ADP method is effective in solving optimal
control problem and NNs can be constructed to facilitate the implementation process,
it is convenient to employ the PI algorithm to handle robust control problem. Thus,
the present robust control approach is easy to understand and implement. It can be
used to solve a broad class of nonlinear robust control problems. Next, we inves-
tigate the optimal guaranteed cost control of continuous-time uncertain nonlinear
systems using NN-based online solution of HJB equation [28]. The optimal robust
guaranteed cost control problem is transformed into an optimal control problem by
introducing an appropriate cost function. It can be proved that the optimal cost func-
tion of the nominal system is the optimal guaranteed cost of the uncertain system.
Then, a critic network is constructed for facilitating the solution of modified HJB
equation. Moreover, inspired by the work of [7, 36], an additional stabilizing term
is introduced to ensure the stability, which relaxes the need for an initial stabilizing
control. The uniform ultimate boundedness of the closed-loop system is also proved
using Lyapunov’s direct approach. Besides, it is shown that the approximate control
input can converge to the optimal control within a small bound.

9.2 Robust Control of Uncertain Nonlinear Systems

Consider a class of continuous-time nonlinear systems described by

ẋ(t) = f (x(t)) + g(x(t))u(t) + Δf (x(t)), (9.2.1)

where x(t) ∈ Rn is the state vector and u(t) ∈ Rm is the control vector, f (·) and g(·)
are differentiable in their arguments with f (0) = 0, and Δf (x(t)) is the unknown
perturbation. Here, we let x(0) = x0 be the initial state.
Denote
f¯(x) = f (x) + Δf (x).
9.2 Robust Control of Uncertain Nonlinear Systems 347

Suppose that the function f¯(x) is known only up to an additive perturbation, which
is bounded by a known function in the range of g(x). Note that the condition for
the unknown perturbation to be in the range space of g(x) is called the matching
condition. Thus, we write Δf (x) = g(x)d(x) with d(x) ∈ Rm , which represents
the matched uncertainty of the system dynamics. Assume that the function d(x) is
bounded by a known function d M (x), i.e., d(x) ≤ d M (x) with d M (0) = 0. We also
assume that d(0) = 0, so that x = 0 is an equilibrium.
For system (9.2.1), to deal with the robust control problem, we should find a feed-
back control law u(x), such that the closed-loop system is globally asymptotically
stable for all admissible uncertainties d(x). Here, we will show that this problem
can be converted into designing an optimal controller for the corresponding nominal
system with appropriate cost function.
Consider the nominal system corresponding to (9.2.1)

ẋ(t) = f (x(t)) + g(x(t))u(t), (9.2.2)

where we assume that f + gu is Lipschitz continuous on a set Ω in Rn containing


the origin, and that system (9.2.2) is controllable in the sense that there exists a
continuous control law on Ω that asymptotically stabilizes the system. It is desired
to find a control law u(x) which minimizes the infinite horizon cost function given by
 ∞  
J (x0 , u) = ρd M
2
(x(τ )) + U (x(τ ), u(x(τ ))) dτ, (9.2.3)
0

where ρ > 0 is a constant, U (x, u) ≥ 0 is the utility function with U (0, 0) = 0. In this
section, the utility function is chosen as the quadratic form U (x, u) = x TQx + u TRu,
where Q and R are positive definite matrices with Q ∈ Rn×n and R ∈ Rm×m . The
cost function described in (9.2.3) gives a modification with respect to the ordinary
optimal control problem, which appropriately reflects the uncertainties, regulation,
and control simultaneously.
When dealing with the optimal control problem, the designed feedback control
must be admissible. A control u(x) : Rn → Rm is said to be admissible with respect
to (9.2.2) on Ω, written as u(x) ∈ A (Ω), if u(x) is continuous on Ω, u(0) = 0,
u(x) stabilizes system (9.2.2) on Ω and J (x0 , u) is finite for every x0 ∈ Ω. Given a
control μ ∈ A (Ω), if the associated value function
 ∞  2 
V (x(t)) = ρd M (x(τ )) + U (x(τ ), μ(x(τ ))) dτ (9.2.4)
t

is continuously differentiable, then an infinitesimal version of (9.2.4) is the so-called


nonlinear Lyapunov equation

ρd M
2
(x) + U (x, μ(x)) + (∇V (x))T ( f (x) + g(x)μ(x)) = 0 (9.2.5)
348 9 Robust and Optimal Guaranteed Cost Control …

with V (0) = 0. In (9.2.5), the term ∇V (x) denotes the partial derivative of the value
∂ V (x)
function V (x) with respect to x, i.e., ∇V (x) = .
∂x
Define the Hamiltonian of the problem and the optimal cost function as

H (x, μ, ∇V (x)) = ρd M
2
(x) + U (x, μ) + (∇V (x))T ( f (x) + g(x)μ)

and  ∞

 2 
V (x) = min ρd M (x(τ )) + U (x(τ ), μ(x(τ ))) dτ. (9.2.6)
μ∈A (Ω) t

The optimal value function defined above will be used interchangeably with optimal
cost function J ∗ (x0 ). The optimal value function V ∗ (x) satisfies the HJB equation

min H (x, μ, ∇V ∗ (x)) = 0, (9.2.7)


μ∈A (Ω)

∂ V ∗ (x)
where ∇V ∗ (x) = . Assume that the minimum on the right-hand side of
∂x
(9.2.7) exists and is unique. Then, the optimal control law for the given problem is

1
u ∗ (x) = − R −1 g T(x)∇V ∗ (x). (9.2.8)
2
Substituting (9.2.8) into the nonlinear Lyapunov equation (9.2.5), we can obtain the
HJB equation in terms of ∇V ∗ (x) as follows:

ρd M
2
(x) + x TQx + (∇V ∗ (x))T f (x)
1
− (∇V ∗ (x))T g(x)R −1 g T(x)∇V ∗ (x) = 0 (9.2.9)
4
with V ∗ (0) = 0.

9.2.1 Equivalence Analysis and Problem Transformation

In this section, we establish the following theorem for the transformation between
the robust control problem and the optimal control problem.
Theorem 9.2.1 For nominal system (9.2.2) with the cost function (9.2.3), assume
that the HJB equation (9.2.7) has a solution V ∗ (x). Then, using this solution, the
optimal control law obtained in (9.2.8) ensures closed-loop asymptotic stability of
uncertain nonlinear system (9.2.1), provided that the following condition is satisfied:

ρd M
2
(x) ≥ d T(x)Rd(x). (9.2.10)
9.2 Robust Control of Uncertain Nonlinear Systems 349

Proof Let V ∗ (x) be the optimal solution of the HJB equation (9.2.7) and u ∗ (x) be
the optimal control law defined by (9.2.8). Now, we prove that u ∗ (x) is a solution to
the robust control problem, namely system (9.2.1) is asymptotically stable under the
control u ∗ (x) for all possible uncertainties d(x). To do this, one needs to prove that
V ∗ (x) is a Lyapunov function of (9.2.1).
According to (9.2.6), V ∗ (x) > 0 for any x = 0 and V ∗ (x) = 0 when x = 0. This
means that V ∗ (x) is positive definite. Using (9.2.7), we have

(∇V ∗ (x))T ( f (x) + g(x)u ∗ (x)) = −ρd M


2
(x) − U (x, u ∗ (x)). (9.2.11)

Using (9.2.8), it follows

2(u ∗ (x))T R = −(∇V ∗ (x))T g(x). (9.2.12)

Considering (9.2.11) and (9.2.12), we find



V̇(9.2.1) (x)  (∇V ∗ (x))T ( f (x) + Δf (x) + g(x)u ∗ (x))
= −ρd M
2
(x) − U (x, u ∗ (x)) − 2(u ∗ (x))TRd(x). (9.2.13)

By adding and subtracting the term d T(x)Rd(x), (9.2.13) becomes



V̇(9.2.1) (x) = − ρd M
2
(x) − x TQx + d T(x)Rd(x)
− (u ∗ (x) + d(x))TR(u ∗ (x) + d(x))
 2 
≤ − ρd M (x) − d T(x)Rd(x) − x TQx.

Using (9.2.10), we can conclude that V̇ ∗ (x) ≤ −x TQx < 0 for any x = 0. Then,
the conditions for Lyapunov local stability theory are satisfied. Thus, there exists
a neighborhood Φ = {x : x(t) < c} for some c > 0 such that if x(t) ∈ Φ, then
limt→∞ x(t) = 0.
However, x(t) cannot remain forever outside Φ. Otherwise, x(t) ≥ c for all
t ≥ 0. Denote  T 
q = inf x Qx > 0.
x(t)≥c

Clearly,

V̇(9.2.1) (x) ≤ −x TQx ≤ −q.

Then,
 t
∗ ∗ ∗
V (x(t)) − V (x(0)) = V̇(9.2.1) (x(τ ))dτ ≤ −qt. (9.2.14)
0
350 9 Robust and Optimal Guaranteed Cost Control …

From (9.2.14), we obtain

V ∗ (x(t)) ≤ V ∗ (x(0)) − qt → −∞ as t → ∞.

This contradicts the fact that V ∗ (x(t)) > 0 for any x = 0. Therefore, limt→∞
x(t) = 0 no matter where the trajectory starts from in Ω. This completes the proof.

Remark 9.2.1 According to Theorem 9.2.1, if we set ρ = 1 and R = Im , where Im


denotes the m × m identity matrix, then (9.2.10) becomes d M 2
(x) ≥ d T(x) d(x) =
d(x) . Hence, in this special case, the robust control problem is equivalent to the
2

optimal control problem without the need for any additional conditions. Otherwise,
the formula (9.2.10) should be satisfied in order to ensure the equivalence of problem
transformation.

In light of Theorem 9.2.1, by acquiring the solution of the HJB equation (9.2.9)
and then deriving the optimal control law (9.2.8), we can obtain the robust control
law for system (9.2.1) in the presence of matched uncertainties. However, due to
the nonlinear nature of the HJB equation, finding its solution is often difficult. In
what follows, we will introduce an online policy iteration (PI) algorithm to solve the
problem based on NN techniques.

9.2.2 Online Algorithm and Neural Network Implementation

According to [40], PI algorithms consist of policy evaluation based on (9.2.5) and


policy improvement based on (9.2.8). Its iteration procedure can be described as
follows.

Step 1: Choose a small number ε > 0. Let i = 0 and V (0) = 0. Then, start with an
initial admissible control law μ(0) (x) ∈ A (Ω).
Step 2: Let i = i + 1. Based on the control law μ(i−1) (x), solve the nonlinear
Lyapunov equation
   T 
ρd M
2
(x) + U x, μ(i−1) (x) + ∇V (i) (x) f (x) + g(x)μ(i−1) (x) = 0

with V (i) (0) = 0.


Step 3: Update the control law via

1
μ(i) (x) = − R −1 g T(x)∇V (i) (x).
2

Step 4: If |V (i) (x) − V (i−1) (x)| ≤ ε, stop and obtain the approximate optimal con-
trol law μ(i) (x); else, go back to Step 2.
9.2 Robust Control of Uncertain Nonlinear Systems 351

The algorithm will converge to the optimal value function and optimal control law,
i.e., V (i) (x) → V ∗ (x) and μ(i) (x) → u ∗ (x) as i → ∞. The convergence proofs of
the PI algorithm above have been given in [1, 4, 21, 32]
Next, we present the implementation process of the PI algorithm based on a critic
NN. Assume that the value function V (x) is continuously differentiable. According
to the universal approximation property of NNs, V (x) can be reconstructed by an
NN on a compact set Ω as

V (x) = WcTσc (x) + εc (x), (9.2.15)

where Wc ∈ Rl is the ideal weight vector, σc (x) ∈ Rl is the activation function, l is


the number of neurons in the hidden layer, and εc (x) is the approximation error of
the NN. Then,
∇V (x) = (∇σc (x))T Wc + ∇εc (x), (9.2.16)

where ∇V (x) = ∂ V (x)/∂ x, ∇σc (x) = ∂σc (x)/∂ x ∈ Rl×n and ∇εc (x) = ∂εc (x)/
∂ x ∈ Rn are the gradient of the activation function and NN approximation error,
respectively.
Using (9.2.16), the Lyapunov equation (9.2.5) takes the following form:
 
ρd M
2
(x) + U (x, μ) + WcT ∇σc (x) + (∇εc (x))T ẋ(9.2.2) = 0, (9.2.17)

where we use ẋ(9.2.2) to indicate it is from (9.2.2). Assume that the weight vector Wc ,
the gradient ∇σc (x), and the approximation error εc (x) and its derivative ∇εc (x)
are all bounded on a compact set Ω. We also have εc (x) → 0 and ∇εc (x) → 0 as
l → ∞ [41].
Since the ideal weights are unknown, a critic NN can be built in terms of the
estimated weights as
V̂ (x) = ŴcT σc (x) (9.2.18)

to approximate the value function. In (9.2.18), σc (x) is selected such that V̂ (x) > 0
for any x = 0 and V̂ (x) = 0 when x = 0. Then,

∇ V̂ (x) = (∇σc (x))T Ŵc , (9.2.19)

where ∇ V̂ (x) = ∂ V̂ (x)/∂ x.


The approximate Hamiltonian can be derived as

H (x, μ, Ŵc ) = ρd M
2
(x) + U (x, μ) + ŴcT ∇σc (x)ẋ(9.2.2)  ec . (9.2.20)

To train the critic NN, it is desired to design Ŵc to minimize the objective function

1 2
Ec = e .
2 c
352 9 Robust and Optimal Guaranteed Cost Control …

We employ the standard steepest descent algorithm to tune the weights of the critic
network, i.e.,
   
∂ Ec ∂ec
Ŵ˙ c = −αc = −αc ec , (9.2.21)
∂ Ŵc ∂ Ŵc

where αc > 0 is the learning rate of the critic NN.


Using (9.2.17), the Hamiltonian becomes

H (x, μ, Wc ) = ρd M
2
(x) + U (x, μ) + WcT ∇σc (x)ẋ(9.2.2) = ecH , (9.2.22)

where ecH = −(∇εc (x))Tẋ is the residual error due to the NN approximation.
Denote θ = ∇σc (x)ẋ. Assume that there exists a constant θ M > 0 such that θ  ≤
θ M , and let the weight estimation error of the critic NN be W̃c = Wc − Ŵc . Then,
considering (9.2.20) and (9.2.22), we have ecH − ec = W̃cTθ. Therefore, the dynamics
of weight estimation error is

W̃˙ c = −Ŵ˙ c = αc (ecH − W̃cTθ )θ. (9.2.23)

The PE condition is required to tune the critic network ensuring θ  ≥ θm , where


θm > 0 is a constant. A small exploratory signal will be added to the system in order
to satisfy the PE condition.
When implementing the online PI algorithm, for the purpose of policy improve-
ment, we should obtain the policy that can minimize the value function in (9.2.15).
Hence, according to (9.2.8) and (9.2.16),

1  
μ(x) = − R −1 g T(x) (∇σc (x))T Wc + ∇εc (x) . (9.2.24)
2
The approximate control law can be formulated as

1
μ̂(x) = − R −1 g T(x)(∇σc (x))T Ŵc . (9.2.25)
2
Equation (9.2.25) implies that based on the trained critic NN, the approximate control
law can be derived directly. The actor-critic architecture is maintained, but training
of the action NN is not required in this case since we have closed-form solution
available. The diagram of the online PI algorithm is depicted in Fig. 9.1.
9.2 Robust Control of Uncertain Nonlinear Systems 353

Fig. 9.1 Diagram of the online PI algorithm (solid lines represent signals and the dashed line
represents the backpropagation path)

9.2.3 Stability Analysis of Closed-Loop System

The weight estimation dynamics and the closed-loop system based on the approxi-
mate optimal control law are uniformly ultimately bounded (UUB) as shown in the
following two theorems.
Theorem 9.2.2 For the controlled system (9.2.2), the weight update law for tuning
the critic NN is given by (9.2.21). Then, the dynamics of the weight estimation error
of the critic NN is UUB.
Proof Select the Lyapunov function candidate as L(x) = (1/αc ) W̃cT W̃c . Taking the
time derivative of L(x) along the trajectory of error dynamics (9.2.23) and consid-
ering the Cauchy–Schwarz inequality, it follows

2 T ˙ 
L̇(x) = W̃ W̃c
αc c
2  
= W̃cT αc ecH − W̃cT θ θ
αc
2  2
= ecH αc W̃cT θ − αc W̃cT θ
αc
2 1 2 1
≤ ecH + αc2 (W̃cT θ )2 − αc (W̃cT θ )2
αc 2 2
1 2  2
= ecH − (2 − αc ) W̃cT θ .
αc

We can conclude that L̇(x) < 0 as long as 0 < αc < 2 and

2
ecH
|W̃cT θ | > .
αc (2 − αc )
354 9 Robust and Optimal Guaranteed Cost Control …

By employing the dense property of real numbers [38], we derive that there exist a
positive constant κ (0 < κ ≤ θ M ) such that

2
ecH
|W̃cT θ | > κW̃c  > . (9.2.26)
αc (2 − αc )

By noticing (9.2.26), we can conclude that L̇(x) < 0 as long as



⎨ 0 < αc < 2,

2 (9.2.27)
ecH

⎩ W̃c  > .
αc κ 2 (2 − αc )

By Lyapunov’s extension theorem [13, 15] (or the Lagrange stability result [30]), the
dynamics of the weight estimation error is UUB. The norm of the weight estimation
error is bounded as well. This completes the proof.

Theorem 9.2.3 For system (9.2.2), the weight update law of the critic NN given by
(9.2.21) and the approximate optimal control law obtained by (9.2.25) ensure that,
for any initial state x0 , there exists a time T (x0 , M) such that x(t) is UUB. Here, the
bound M > 0 is given by

βM
x(t) ≤ ≡ M, t ≥ T,
ρρ02 + λmin (Q)

where β M > 0 and ρ0 > 0 are constants to be determined later and λmin (Q) is the
smallest eigenvalue of Q.

Proof Taking the time derivative of V (x) along the trajectory of (9.2.2) generated
by the approximate control law μ̂(x), we can obtain

V̇(9.2.2) = (∇V (x))T ( f (x) + g(x)μ̂). (9.2.28)

Using (9.2.9), we can find

1
0 = ρd M
2
(x) + x TQx + (∇V (x))T f (x) − (∇V (x))T g(x)R −1 g T (x)∇V (x).
4
(9.2.29)

Considering (9.2.29), (9.2.28) becomes

1
V̇(9.2.2) = − ρd M
2
(x) − x TQx + (∇V (x))T g(x)R −1 g T(x)∇V (x)
4
+ (∇V (x))T g(x)μ̂. (9.2.30)
9.2 Robust Control of Uncertain Nonlinear Systems 355

Adding and subtracting (∇V (x))T g(x)μ to (9.2.30) and using (9.2.24) and (9.2.25),
it follows
1
V̇(9.2.2) = − ρd M
2
(x) − x TQx + (∇V (x))T g(x)R −1 g T(x)∇V (x)
4
+ (∇V (x))T g(x)μ + (∇V (x))T g(x)μ̂ − (∇V (x))T g(x)μ
1
= − ρd M
2
(x) − x TQx − (∇V (x))T g(x)R −1 g T(x)∇V (x)
4
1 −1 T
 
+ (∇V (x)) g(x)R g (x) ∇V (x) − ∇ V̂ (x) .
T
(9.2.31)
2
Substituting (9.2.16) and (9.2.19) into (9.2.31), we can further obtain

1
V̇(9.2.2) = − ρd M
2
(x) − x TQx − (∇V (x))T g(x)R −1 g T(x)∇V (x)
4
1 T 
+ Wc ∇σc (x) + (∇εc (x))T g(x)R −1 g T(x)
2
 
× (∇σc (x))T W̃c + ∇εc (x) .

Here, we denote

1 T   
β= Wc ∇σc (x) + (∇εc (x))T g(x)R −1 g T (x) (∇σc (x))T W̃c + ∇εc (x) .
2

Considering the fact that R −1 is positive definite, the assumption that Wc , ∇σc (x), and
∇εc (x) are bounded, and Theorem 9.2.2, we can conclude that β is upper bounded
by β ≤ β M , where β M > 0 is a constant. Therefore, V̇(9.2.2) takes the following form:

V̇(9.2.2) ≤ −ρd M
2
(x) − x TQx + β M . (9.2.32)

In many cases, we can determine a quadratic bound of d(x). Under such circum-
stances, we assume that d M (x) = ρ0 x, where ρ0 > 0 is a constant. Then, (9.2.32)
becomes
 
V̇(9.2.2) ≤ − ρρ02 + λmin (Q) x2 + β M .

Hence, we can observe that V̇(9.2.2) < 0 whenever x(t) lies outside the compact set
 
βM
Ωx = x : x ≤ .
ρρ02 + λmin (Q)

Therefore, based on the approximate optimal control law, the state trajectories of the
closed-loop system are UUB and x(t) ≤ M. This completes the proof.
In the next theorem, the equivalence of the NN-based HJB solution of the optimal
control problem and the solution of robust control problem is established.
356 9 Robust and Optimal Guaranteed Cost Control …

Theorem 9.2.4 Assume that the NN-based HJB solution of the optimal control prob-
lem exists. Then, the control law defined by (9.2.25) ensures closed-loop asymptotic
stability of uncertain nonlinear system (9.2.1) if the condition described in (9.2.10)
is satisfied.

Proof Let V̂ (x) be the solution of


 T
ρd M
2
(x) + U (x, μ̂(x)) + ∇ V̂ (x) ( f (x) + g(x)μ̂(x)) = 0

and μ̂(x) be the approximate optimal control law defined by (9.2.25). Then,
 T
2μ̂T(x)R = − ∇ V̂ (x) g(x).

Now, we show that with the approximate optimal control μ̂(x), the closed-loop
system remains asymptotically stable for all possible uncertainties d(x). According
to (9.2.18) and the selection of σc (x), we have V̂ (0) = 0 and V̂ (x) > 0 when x = 0.
Taking the manipulations similar to the proof of Theorem 9.2.1, we can easily obtain

V̂˙ (x) ≤ −x TQx,

which implies that V̂˙ (x) < 0 for any x = 0. This completes the proof.

Remark 9.2.2 In [47], an iterative algorithm for online design of robust control for a
class of continuous-time nonlinear systems was developed; however, the optimality
of the robust controller with respect to a specified cost function was not discussed. In
fact, recently, there are few results on robust optimal control of uncertain nonlinear
systems based on ADP, not to mention the decentralized optimal control of large-
scale systems. In [48], the robust optimal control scheme for a class of uncertain
nonlinear systems via ADP technique and without using an initial admissible control
was established. In addition, the developed results of [48] were also extended to
deal with the decentralized optimal control for a class of continuous-time nonlinear
interconnected systems.

9.2.4 Simulation Study

Consider the following continuous-time nonlinear system:


 
−x 1 + x2 
ẋ =
−0.5x1 − 0.5x2 1 − (cos(2x1 ) + 2)2
 
0
+ (u + 0.5 px1 sin x2 ), (9.2.33)
cos(2x1 ) + 2
9.2 Robust Control of Uncertain Nonlinear Systems 357

where x = [x1 , x2 ]T ∈ R2 and u ∈ R are the state and control variables, respectively,
and p is an unknown parameter. The term d(x) = 0.5 px1 sin x2 reflects the uncer-
tainty of the control plant. For simplicity, we assume that p ∈ [−1, 1]. Here, we
choose d M (x) = x and we select ρ = 1 for the purpose of simulation.
We aim at obtaining a robust control law that can stabilize system (9.2.33) for
all possible p. This problem can be formulated as the following optimal control
problem. For the nominal system, we need to find a feedback control law u(x) that
minimizes the cost function
 ∞
 
J (x0 ) = x2 + x TQx + u TRu dτ, (9.2.34)
0

where Q = I2 and R = 2. Based on the procedure proposed in [33], the optimal


value function and the optimal control law of the problem are

V ∗ (x) = x12 + 2x22

and
u ∗ (x) = −(cos(2x1 ) + 2)x2 .

We adopt the online PI algorithm to tackle the optimal control problem for the
nominal system, where a critic network is constructed to approximate the value func-
tion. We denote the weight vector of the critic network as Ŵc = [Ŵc1 , Ŵc2 , Ŵc3 ]T .
During the simulation process, the initial weights of the critic network are chosen
randomly in [0, 2] and the weight normalization is not used. The activation func-
tion of the critic network is chosen as σc (x) = [x12 , x1 x2 , x22 ]T , so the ideal weight is
[1, 0, 2]T . Let the learning rate of the critic network be αc = 0.1 and the initial state
of the controlled plant be x0 = [1, −1]T .
During the implementation process of the PI algorithm, the following small sinu-
soidal exploratory signal with various frequencies will be added to satisfy the PE
condition,

N (t) = sin2 (t) cos(t) + sin2 (2t) cos(0.1t) + sin2 (−1.2t) cos(0.5t)
+ sin5 (t) + sin2 (1.12t) + cos(2.4t) sin3 (2.4t). (9.2.35)

It is introduced into the control input and thus affect the system states. The weights
of the critic network converge to [0.9978, 0.0008, 1.9997]T as shown in Fig. 9.2a,
which displays a good approximation of the ideal ones. Actually, we can observe
that the convergence of the weight occurred after 750 s. Then, the exploratory signal
is turned off. The evolution of the state trajectory is depicted in Fig. 9.2b. We see that
the state converges to zero quickly after the exploratory signal is turned off.
Using (9.2.18) and (9.2.25), the value function and control law can be approxi-
mated as
358 9 Robust and Optimal Guaranteed Cost Control …

(a) (b)
2.5
Weight of the critic network

x x
2 1 2
2

State trajectory
1.5 ωac1 ωac2 ωac3
1
1
0
0.5

0
−1
−0.5
0 200 400 600 800 0 200 400 600 800
Time (s) Time (s)

Fig. 9.2 Simulation results. a Convergence of the weight vector of the critic network (ωac1 , ωac2 ,
and ωac3 represent Ŵc1 , Ŵc2 , and Ŵc3 , respectively). b Evolution of the state trajectory during the
implementation process

⎡ ⎤T ⎡ 2 ⎤
0.9978 x1
V̂ (x) = ⎣ 0.0008 ⎦ ⎣ x1 x2 ⎦
1.9997 x22

and
⎡ ⎤ ⎡ ⎤
 T 2x1 0 T 0.9978
1 −1 0 ⎣ x2 x1 ⎦ ⎣ 0.0008 ⎦ ,
μ̂(x) = − R (9.2.36)
2 cos(2x1 ) + 2
0 2x2 1.9997

respectively. The error between the optimal cost function and the approximate one is
presented in Fig. 9.3a. The error between the optimal control law and the approximate
version is displayed in Fig. 9.3b. Both approximation errors are close to zero, which
verifies good performance of the learning algorithm.

(a) (b)
−3
x 10
Approximation error

Approximation error

0.015
1
0.01
0
0.005
−1
0
2 2
2 2
0 0 0 0
x2 −2 −2 x2 −2 −2 x1
x
1

Fig. 9.3 Simulation results. a 3D plot of the approximation error of the cost function, i.e., V ∗ (x) −
V̂ (x). b 3D plot of the approximation error of the control law, i.e., u ∗ (x) − μ̂(x)
9.2 Robust Control of Uncertain Nonlinear Systems 359

2
2
ρd M(x) and d (x)Rd(x)

ρd M(x)
1.5 T
d (x)Rd(x)
T

0.5
2

0
0.5
1
0 0.8
0.6
−0.5 0.4
x2 0.2
−1 0 x
1

Fig. 9.4 Verification of condition (9.2.10)

0
Va and dVa

−2

−4

−6

−8
0.5
Va
1
0 0.8
dVa
0.6
−0.5 0.4
x 0.2
2
−1 0
x1

Fig. 9.5 The Lyapunov function and its derivative (Va and d Va represent V̂ and V̂˙ , respectively)
360 9 Robust and Optimal Guaranteed Cost Control …

1
x
1
0.8 x
2

0.6

0.4
State trajectory

0.2

−0.2

−0.4

−0.6

−0.8

−1
0 5 10 15
Time (s)

Fig. 9.6 The state trajectory under the robust control law μ̂(x) when setting p = 1

Next, the scalar parameter p = 1 is chosen for evaluating the robust control per-
formance. On the one hand, as shown in Fig. 9.4, the condition of Theorem 9.2.1 is
satisfied. On the other hand, for all the values of x, the Lyapunov function V̂ (x) ≥ 0
and its derivative V̂˙ (x) ≤ 0, which are described in Fig. 9.5. Therefore, the control
law designed by solving the NN-based HJB equation is the robust control of the orig-
inal uncertain nonlinear system (9.2.33). In other words, we can employ the control
law (9.2.36) to stabilize system (9.2.33). In fact, under the action of the control law,
the state vector converges to the equilibrium point. Figure 9.6 presents the evolution
process of the state trajectory when applying the control law to system (9.2.33) for
15 s. These simulation results demonstrate the effectiveness of the robust control
method established in this section.

9.3 Optimal Guaranteed Cost Control of Uncertain


Nonlinear Systems

In this section, we study the optimal guaranteed cost control of uncertain nonlinear
systems using the ADP approach [28]. Consider a class of continuous-time uncertain
nonlinear systems given by

ẋ(t) = F̄(x(t), u(t))


= f (x(t)) + g(x(t))u(t) + Δf (x(t)), (9.3.1)
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 361

where x(t) ∈ Rn is the state vector and u(t) ∈ Rm is the control input. The known
functions f (·) and g(·) are differentiable in their arguments with f (0) = 0, and
Δf (x(t)) is the nonlinear perturbation of the corresponding nominal system

ẋ(t) = F(x(t), u(t)) = f (x(t)) + g(x(t))u(t). (9.3.2)

Here, we let x(0) = x0 be the initial state. In addition, we assume that f + gu is


Lipschitz continuous on a set Ω in Rn containing the origin and that the system
(9.3.2) is controllable on Ω.
Before proceeding further, we assign an explicit structure to the system uncer-
tainty. The following assumption is given, which has been used in [9, 11].

Assumption 9.3.1 Assume that the uncertainty Δf (x) has the form

Δf (x) = G(x) d(ϕ(x)), (9.3.3)

where
d T(ϕ(x))d(ϕ(x)) ≤ h T(ϕ(x))h(ϕ(x)). (9.3.4)

In (9.3.3) and (9.3.4), G(·) ∈ Rn×r and ϕ(·) satisfying ϕ(0) = 0 are known functions
denoting the structure of the uncertainty, d(·) ∈ Rr is an uncertain function with
d(0) = 0, and h(·) ∈ Rr is a given function with h(0) = 0.

Consider system (9.3.1) with infinite horizon cost function


 ∞
J¯(x0 , u) = U (x(τ ), u(τ ))dτ, (9.3.5)
0

where
U (x, u) = Q(x) + u TRu,

Q(x) ≥ 0, and R = R T > 0 is a constant matrix.


In this section, the aim of solving the robust guaranteed cost control problem is
to find a feedback control function u(x) and determine a finite upper bound function
Φ(u), i.e., Φ(u) < +∞, such that the closed-loop system is robustly stable and the
cost function (9.3.5) satisfies J¯ ≤ Φ. Here, the upper bound function Φ(u) is termed
the robust guaranteed cost function. Only when Φ(u) is minimized, it is named the
optimal robust guaranteed cost and is denoted as Φ ∗ , i.e., Φ ∗ = inf u Φ(u). Addition-
ally, the corresponding control function ū ∗ is called the optimal robust guaranteed
cost control, i.e.,
ū ∗ = arg inf Φ(u).
u

Next, we will prove that the optimal robust guaranteed cost control problem of
system (9.3.1) can be transformed into an optimal control problem of the nominal
system (9.3.2). The ADP technique can be employed to deal with the optimal control
362 9 Robust and Optimal Guaranteed Cost Control …

problem of system (9.3.2). Note that in this section, the feedback control u(x) is
often written as u for simplicity.

9.3.1 Optimal Guaranteed Cost Controller Design

In this section, we show that the guaranteed cost of the uncertain nonlinear system is
closely related to the modified cost function of the nominal system. The next theorem
is from [10] with relaxed conditions.

Theorem 9.3.1 Assume that there exist a continuously differentiable and radially
unbounded value function V (x) satisfying V (x) > 0 for all x = 0 and V (0) = 0, a
bounded function Γ (x) satisfying Γ (x) ≥ 0, and a feedback control function u(x)
such that

(∇V (x))T F̄(x, u) ≤ (∇V (x))T F(x, u) + Γ (x), (9.3.6)


(∇V (x)) F(x, u) + Γ (x) < 0, x = 0,
T
(9.3.7)
U (x, u) + (∇V (x))T F(x, u) + Γ (x) = 0, (9.3.8)

where the symbol ∇V (x) denotes the partial derivative of the value function V (x)
with respect to x, i.e., ∇V (x) = ∂ V (x)/∂ x. Then, with the feedback control function
u(x), there exists a neighborhood of the origin such that system (9.3.1) is asymptot-
ically stable. Furthermore,

J¯(x0 , u) ≤ J (x0 , u) = V (x0 ), (9.3.9)

where J (x0 , u) is defined as


 ∞  
J (x0 , u) = U (x(τ ), u(x(τ ))) + Γ (x(τ )) dτ, (9.3.10)
0

and is called the modified cost function of system (9.3.2).

Proof First, we show the asymptotic stability of system (9.3.1) under the feedback
control u(x). Let
V̇(9.3.1) (x) = (∇V (x))T F̄(x, u).

Considering (9.3.6) and (9.3.7), we obtain V̇(9.3.1) (x) < 0 for any x = 0. This implies
that V (·) is a Lyapunov function for system (9.3.1), which proves the asymptotic
stability.
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 363

According to the definitions of J (x0 , u) and J¯(x0 , u), we can see that

J¯(x0 , u) ≤ J (x0 , u), (9.3.11)

since Γ (x) ≥ 0.
When Δf (x) = 0, we can still find that (9.3.6)–(9.3.8) are true since Γ (x) ≥ 0.
In this case, we derive

V̇(9.3.2) (x) = (∇V (x))T F(x, u).

Then,

U (x, u) + Γ (x) = −V̇(9.3.2) (x) + (∇V (x))T F(x, u) + U (x, u) + Γ (x).

Based on (9.3.8), we obtain

U (x, u) + Γ (x) = −V̇(9.3.2) (x).

Similarly, by integrating over [0, t), considering V (x(t)) = 0 as t → ∞, we have


 t
 
U (x, u) + Γ (x) dτ = −V (x(t)) + V (x0 ).
0

Here, letting t → ∞ yields


J (x0 , u) = V (x0 ). (9.3.12)

Based on (9.3.11) and (9.3.12), we can easily find that (9.3.9) is true. This completes
the proof.

Theorem 9.3.1 shows that the bounded function Γ (x) takes an important role in
deriving the guaranteed cost of the controlled system. The following lemma presents
a specific form of Γ (x).

Lemma 9.3.1 For any continuously differentiable and radially unbounded function
V (x), define

1
Γ (x) = h T(ϕ(x)) h(ϕ(x)) + (∇V (x))T G(x) G T(x)∇V (x). (9.3.13)
4
Then,
(∇V (x))T Δf (x) ≤ Γ (x). (9.3.14)
364 9 Robust and Optimal Guaranteed Cost Control …

Proof Considering (9.3.3), (9.3.4), and (9.3.13), since


 1 T  1 
0 ≤ d(ϕ(x)) − G T(x)∇V (x) d(ϕ(x)) − G T(x)∇V (x)
2 2
1
= d T(ϕ(x))d(ϕ(x)) + (∇V (x))T G(x)G T(x)∇V (x) − (∇V (x))T G(x)d(ϕ(x))
4
1
≤ h T(ϕ(x))h(ϕ(x)) + (∇V (x))T G(x)G T(x)∇V (x) − (∇V (x))T Δf (x)
4
= Γ (x) − (∇V (x))T Δf (x),

we can see that (9.3.14) holds. This completes the proof.

Remark 9.3.1 For any continuously differentiable and radially unbounded function
V (x), since

(∇V (x))T F̄(x, u) = (∇V (x))T F(x, u) + (∇V (x))T Δf (x), (9.3.15)

we can easily find that the bounded function (9.3.13) satisfies (9.3.6). Note that
Lemma 9.3.1 seems only to imply (9.3.6), but in fact, it presents a specific form of
Γ (x) satisfying (9.3.6), (9.3.7), and (9.3.8). The reason is that (9.3.7) and (9.3.8) are
implicit assumptions of Theorem 9.3.1, noticing the framework of the generalized
HJB equation [4] and the fact that (∇V (x))T F(x, u) + Γ (x) = −U (x, u) < 0 when
x = 0. Hence, it can be used for problem transformation. In fact, based on (9.3.6)
and (9.3.15), we can find that the positive semidefinite bounded function Γ (x) gives
an upper bound of the term (∇V (x))T Δf (x), which facilitates us to solve the opti-
mal robust guaranteed cost control problem of a class of nonlinear systems with
uncertainties.

Remark 9.3.2 It is important to note that Theorem 9.3.1 indicates the existence of
the guaranteed cost of the uncertain nonlinear system (9.3.1). In addition, in order to
derive the optimal guaranteed cost controller, we should minimize the upper bound
J (x0 , u) with respect to u. Therefore, we should solve the optimal control problem
of system (9.3.2) with V (x0 ) considered as the value function.

For system (9.3.2), observing that


 ∞  
V (x0 ) = U (x, u) + Γ (x) dτ
0
 T  
= U (x, u) + Γ (x) dτ + V (x(T )), (9.3.16)
0

we have

1 T   
lim V (x(T )) − V (x0 ) + U (x, u) + Γ (x) dτ = 0. (9.3.17)
T →0 T 0
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 365

Clearly, (9.3.17) is equivalent to (9.3.8). Hence, (9.3.8) is an infinitesimal version


of the modified value function (9.3.16) and is the so-called nonlinear Lyapunov
equation.
For system (9.3.2) with modified value function (9.3.16), define the Hamiltonian
of the optimal control problem as

H (x, u, ∇V (x)) = U (x, u) + (∇V (x))T F(x, u) + Γ (x). (9.3.18)

Define the optimal cost function of system (9.3.2) as

V ∗ (x0 ) = min J (x0 , u),


u∈A (Ω)

where J (x0 , u) is given in (9.3.10). Note that V ∗ (x) satisfies the modified HJB
equation
0 = min H (x, u, ∇V ∗ (x)), (9.3.19)
u∈A (Ω)

where ∇V ∗ (x) = ∂ V ∗ (x)/∂ x. Assume that the minimum on the right-hand side of
(9.3.19) exists and is unique. Then, the optimal control of system (9.3.2) is

u ∗ (x) = arg min H (x, u, ∇V ∗ (x))


u∈A (Ω)
1
= − R −1 g T(x)∇V ∗ (x). (9.3.20)
2
Hence, the modified HJB equation becomes

0 = U (x, u ∗ ) + (∇V ∗ (x))T F(x, u ∗ ) + h T(ϕ(x)) h(ϕ(x))


1
+ (∇V ∗ (x))T G(x)G T(x)∇V ∗ (x) (9.3.21)
4
with V ∗ (0) = 0. Substituting (9.3.20) into (9.3.21), we can obtain the formulation
of the modified HJB equation in terms of ∇V ∗ (x) as follows:

0 = Q(x) + (∇V ∗ (x))T f (x) + h T(ϕ(x)) h(ϕ(x))


1
− (∇V ∗ (x))T g(x)R −1 g T(x)∇V ∗ (x)
4
1
+ (∇V ∗ (x))T G(x)G T(x)∇V ∗ (x) (9.3.22)
4
with V ∗ (0) = 0.
Now, we give the following assumption, which is helpful to derive the optimal
control with regard to system (9.3.2) and prove the stability of the closed-loop system.
366 9 Robust and Optimal Guaranteed Cost Control …

Assumption 9.3.2 Consider system (9.3.2) with value function (9.3.16) and the
optimal feedback control function (9.3.20). Let L s (x) be a continuously differentiable
Lyapunov function candidate formed as a polynomial and satisfying

L˙s (x) = (∇ L s (x))T ẋ(9.3.2) = (∇ L s (x))T ( f (x) + g(x)u ∗ ) < 0, (9.3.23)

where ∇ L s (x) = ∂ L s (x)/∂ x. Assume there exists a positive definite matrix Λ(x)
such that the following relation holds:

(∇ L s (x))T ( f (x) + g(x)u ∗ ) = −(∇ L s (x))T Λ(x)∇ L s (x).

Remark 9.3.3 This is a common assumption that has been used in the literature, for
instance [7, 36, 60], to facilitate discussing the stability issue of closed-loop system.
According to [7], we assume that the closed-loop dynamics with optimal control is
bounded by a function of system state on the compact set of this section. Without
loss of generality, we assume that

 f (x) + g(x)u ∗  ≤ η∇ L s (x)

with η > 0. Hence, we can further obtain

(∇ L s (x))T ( f (x) + g(x)u ∗ ) ≤ η∇ L s (x)2 .

Let λm and λ M be the minimum and maximum eigenvalues of matrix Λ(x), then

λm ∇ L s (x)2 ≤ (∇ L s (x))T Λ(x)∇ L s (x) ≤ λ M ∇ L s (x)2 . (9.3.24)

Therefore, by noticing (9.3.23) and (9.3.24), we can conclude that the Assump-
tion 9.3.2 is reasonable. Specifically, in this section, L s (x) can be obtained by prop-
erly selecting a polynomial when implementing the ADP method.

The following theorem illustrates how to develop the optimal robust guaranteed
cost control scheme for system (9.3.1).

Theorem 9.3.2 Consider system (9.3.1) with cost function (9.3.5). Suppose the mod-
ified HJB equation (9.3.22) has a continuously differentiable solution V ∗ (x). Then,
for any admissible control function u, the cost function (9.3.5) satisfies

J¯(x0 , u) ≤ Φ(u),

where  ∞

Φ(u)  V (x0 ) + (u − u ∗ )TR(u − u ∗ )dτ.
0

Moreover, the optimal robust guaranteed cost of the controlled uncertain nonlinear
system is given by
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 367

Φ ∗ = Φ(u ∗ ) = V ∗ (x0 ).

Accordingly, the optimal robust guaranteed cost control is given by ū ∗ = u ∗ .


Proof For any admissible control function u(x), the cost function (9.3.5) can be
written as the following form:
 ∞  
J¯(x0 , u) = V ∗ (x0 ) + U (x, u) + V̇ ∗ (x) dτ. (9.3.25)
0

Along the closed-loop trajectories of system (9.3.1) and according to (9.3.22), we


find that

U (x, u)+V̇(9.3.1) (x)
= Q(x) + u TRu + (∇V ∗ (x))T ( f (x) + g(x)u + Δf (x))
= u T Ru + (∇V ∗ (x))T (g(x)u + Δf (x))
1
− h T(ϕ(x))h(ϕ(x)) − (∇V ∗ (x))T G(x)G T(x)∇V ∗ (x)
4
1
+ (∇V ∗ (x))T g(x)R −1 g T(x)∇V ∗ (x). (9.3.26)
4
For the optimal cost function V ∗ (x), in light of Lemma 9.3.1, the following inequality
holds:

(∇V ∗ (x))T Δf (x) ≤ h T (ϕ(x))h(ϕ(x))


1
+ (∇V ∗ (x))T G(x)G T(x)∇V ∗ (x). (9.3.27)
4
Substituting (9.3.27) into (9.3.26), we can further obtain

U (x, u) + V̇ ∗ (x) ≤ u TR u + (∇V ∗ (x))T g(x)u


1
+ (∇V ∗ (x))T g(x)R −1 g T(x)∇V ∗ (x). (9.3.28)
4
Considering the expression of the optimal control in (9.3.20), (9.3.28) is in fact

U (x, u) + V̇ ∗ (x) ≤ (u − u ∗ )TR(u − u ∗ ). (9.3.29)

Thus, combining (9.3.25) with (9.3.29), we can find


 ∞
J¯(x0 , u) ≤ V ∗ (x0 ) + (u − u ∗ )TR(u − u ∗ )dτ.
0

Clearly, the optimal robust guaranteed cost can be obtained when setting u = u ∗ ,
i.e., Φ(u ∗ ) = V ∗ (x0 ). Furthermore, we can derive that
368 9 Robust and Optimal Guaranteed Cost Control …

Φ ∗ = inf Φ(u) = V ∗ (x0 )


u

and
ū ∗ = arg inf Φ(u) = u ∗ .
u

This completes the proof.

Remark 9.3.4 According to Theorem 9.3.2, the optimal robust guaranteed cost con-
trol of uncertain nonlinear system is transformed into the optimal control of nominal
system, where the modified cost function is considered as the upper bound function. In
other words, once the solution of the modified HJB equation (9.3.22) corresponding
to nominal system (9.3.2) is derived, we can establish the optimal robust guaranteed
cost control scheme of system (9.3.1).

9.3.2 Online Solution of Transformed Optimal Control


Problem

In this section, inspired by the work of [5, 7, 41], an improved online technique
without utilizing the iterative strategy and an initial stabilizing control is developed
by constructing a single network, namely the critic network. Here, the ADP method
is introduced to the framework of infinite horizon optimal robust guaranteed cost
control of nonlinear systems with uncertainties.
Assume that the value function V (x) is continuously differentiable. According
to the universal approximation property of NNs, V (x) can be reconstructed by a
single-layer NN on a compact set Ω as

V (x) = WcTσc (x) + εc (x), (9.3.30)

where Wc ∈ Rl is the ideal weight, σc (x) ∈ Rl is the activation function, l is the


number of neurons in the hidden layer, and εc (x) is the unknown approximation
error of NN. Then,
∇V (x) = (∇σc (x))T Wc + ∇εc (x) (9.3.31)

is also unknown, where

∂σc (x) ∂εc (x)


∇σc (x) = , ∇εc (x) =
∂x ∂x
are the gradients of the activation function and NN approximation error, respectively.
Based on (9.3.31), the Lyapunov equation (9.3.8) takes the following form:
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 369
 
0 = U (x, u) + WcT ∇σc (x) + (∇εc (x))T F(x, u)
1 
+ h T(ϕ(x))h(ϕ(x)) + WcT ∇σc (x) + (∇εc (x))T
4
 
× G(x)G (x) (∇σc (x))T Wc + ∇εc (x) .
T

Following the framework of [5, 7, 41], we assume that the weight vector Wc , the
gradient ∇σc (x), and the approximation error εc (x) and its derivative ∇εc (x) are all
bounded on a compact set Ω.
Since the ideal weights are unknown, a critic NN can be built in terms of the
estimated weights as
V̂ (x) = ŴcTσc (x) (9.3.32)

to approximate the value function. Under the framework of ADP method, the selec-
tion of the activation function of the critic network is often a natural choice guided
by engineering experience and intuition [1, 4]. Then,

∇ V̂ (x) = (∇σc (x))T Ŵc , (9.3.33)

∂ V̂ (x)
where ∇ V̂ (x) = .
∂x
According to (9.3.20) and (9.3.31), we derive

1  
u(x) = − R −1 g T(x) (∇σc (x))T Wc + ∇εc (x) , (9.3.34)
2
which, in fact, represents the expression of optimal control u ∗ (x) if the value function
in (9.3.30) is considered as the optimal one V ∗ (x). Besides, in light of (9.3.20) and
(9.3.33), the approximate control function can be given as

1
û(x) = − R −1 g T(x)(∇σc (x))T Ŵc . (9.3.35)
2
Applying (9.3.35) to system (9.3.2), the closed-loop system dynamics is expressed
as
1
ẋ = f (x) − g(x)R −1 g T(x)(∇σc (x))T Ŵc . (9.3.36)
2
Recalling the definition of the Hamiltonian (9.3.18) and the modified HJB equa-
tion (9.3.19), we can easily obtain

H (x, u ∗ , ∇V ∗ ) = 0.
370 9 Robust and Optimal Guaranteed Cost Control …

The NN expressions (9.3.31) and (9.3.34) imply that u ∗ and ∇V ∗ can be formulated
based on the ideal weight of the critic network, i.e., Wc . As a result, the Hamiltonian
becomes
H (x, Wc ) = 0,

which specifically, can be written as

H (x, Wc ) = Q(x) + WcT ∇σc (x) f (x)


1
− WcT ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T Wc + h T(ϕ(x))h(ϕ(x))
4
1
+ WcT ∇σc (x)G(x)G T(x)(∇σc (x))T Wc + ecH
4
= 0, (9.3.37)

where
1
ecH = (∇εc (x))T f (x) − (∇εc (x))T g(x)R −1 g T(x)(∇σc (x))T Wc
2
1
− (∇εc (x)) g(x)R −1 g T(x)∇εc (x)
T
4
1
+ (∇εc (x))T G(x)G T(x)(∇σc (x))T Wc
2
1
+ (∇εc (x))T G(x)G T(x)∇εc (x). (9.3.38)
4
In Eq. (9.3.38), ecH denotes the residual error generated due to NN approximation.
Then, using the estimated weight vector, the approximate Hamiltonian can be
derived as

Ĥ (x, Ŵc ) = Q(x) + ŴcT ∇σc (x) f (x)


1
− ŴcT ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T Ŵc + h T(ϕ(x))h(ϕ(x))
4
1
+ ŴcT ∇σc (x)G(x)G T(x)(∇σc (x))T Ŵc . (9.3.39)
4

Let ec = Ĥ (x, Ŵc ) − H (x, Wc ). Considering (9.3.37), we have ec = Ĥ (x, Ŵc ).


Let the weight estimation error of the critic network be

W̃c = Wc − Ŵc . (9.3.40)

Then, based on (9.3.37), (9.3.39), and (9.3.40), we can obtain the formulation of ec
in terms of W̃c as follows:
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 371

ec = Ĥ (x, Ŵc ) − H (x, Wc )


1
= − W̃cT ∇σc (x) f (x) − W̃cT ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T W̃c
4
1
+ W̃cT ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T Wc
2
1
+ W̃cT ∇σc (x)G(x)G T(x)(∇σc (x))T W̃c
4
1
− W̃cT ∇σc (x)G(x)G T(x)(∇σc (x))T Wc − ecH . (9.3.41)
2

For training the critic network, it is desired to design Ŵc to minimize the objective
function
1
E c = ec2 . (9.3.42)
2
Here, the weights of the critic network are tuned based on the standard steepest
descent algorithm with an additional term introduced to ensure the boundedness of
system state, i.e.,
 
∂ Ec
Ŵ˙ c = −αc
1
+ αs Π (x, û)∇σc (x)g(x)R −1 g T(x)∇ L s (x), (9.3.43)
∂ Ŵc 2

where αc > 0 is the learning rate of the critic network, αs > 0 is the learning rate of
the additional term, and L s (x) is the Lyapunov function candidate given in Assump-
tion 9.3.2. In (9.3.43), the Π (x, û) is the additional stabilizing term defined as

0, if J˙s (x) = (∇ L s (x))T F(x, û) < 0,
Π (x, û) =
1, else.

The structural diagram of the implementation process using NN is displayed in


Fig. 9.7.
Next, we will determine the dynamics of the weight estimation error W̃c . Accord-
ing to (9.3.39), we have

∂ec 1
= ∇σc (x) f (x) − ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T Ŵc
∂ Ŵc 2
1
+ ∇σc (x)G(x)G T(x)(∇σc (x))T Ŵc . (9.3.44)
2
In light of (9.3.40), (9.3.42), and (9.3.43), the dynamics of the weight estimation
error is
 
˙ ∂ec 1
W̃c = αc ec − αs Π (x, û)∇σc (x)g(x)R −1 g T(x)∇ L s (x). (9.3.45)
∂ Ŵc 2
372 9 Robust and Optimal Guaranteed Cost Control …

Fig. 9.7 Structural diagram of NN implementation (solid lines represent the signals and dashed
lines represent the backpropagation paths)

Then, combining (9.3.40), (9.3.41), and (9.3.44), the error dynamics (9.3.45) becomes

W̃˙ c = αc − W̃cT ∇σc (x) f (x) − W̃cT ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T W̃c
1
4
1
+ W̃cT ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T Wc
2
1
+ W̃cT ∇σc (x)G(x)G T(x)(∇σc (x))T W̃c
4
1
− W̃cT ∇σc (x)G(x)G T(x)(∇σc (x))T Wc − ecH
2
1
× ∇σc (x) f (x) − ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T Wc
2
1 −1 T
+ ∇σc (x)g(x)R g (x)(∇σc (x))T W̃c
2
1
+ ∇σc (x)G(x)G T(x)(∇σc (x))T Wc
2
1
− ∇σc (x)G(x)G T(x)(∇σc (x))T W̃c
2
1
− αs Π (x, û)∇σc (x)g(x)R −1 g T(x)∇ L s (x). (9.3.46)
2
For continuous-time uncertain nonlinear systems (9.3.1) satisfying (9.3.3) and
(9.3.4), we summarize the design procedure of optimal robust guaranteed cost control
as follows.

Step 1: Select G(x) and ϕ(x), determine h(ϕ(x)), and conduct the problem trans-
formation based on the bounded function Γ (x) as in (9.3.13).
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 373

Step 2: Choose the Lyapunov function candidate L s (x), construct a critic network
as (9.3.32), and set its initial weights to zeros.
Step 3: Solve the transformed optimal control problem via online solution of the
modified HJB equation, using the expressions of approximate control func-
tion (9.3.35), approximate Hamiltonian (9.3.39), and weights update crite-
rion (9.3.43).
Step 4: Derive the optimal robust guaranteed cost and optimal robust guaranteed
cost control of original uncertain nonlinear system based on the converged
weights of critic network.

Remark 9.3.5 It is observed from (9.3.32) and (9.3.39), both the approximate value
function and the approximate Hamiltonian become zero when x = 0. In this case,
we can find that Ŵ˙ c = 0. Thus, when the system state converges to zero, the weights
of the critic network are no longer updated. This can be viewed as a PE requirement
of the NN inputs. In other words, the system state must be persistently exciting
long enough in order to ensure the critic network to learn the optimal value function
as accurately as possible. In this chapter, the PE condition is satisfied by adding a
small exploratory signal to the control input. The condition can be removed once the
weights of the critic network converge to their target values. Actually, it is for this
reason that there always exists a trade-off between computational accuracy and time
consumption for practical realization.

Next, the stability analysis of the NN-based feedback control system is presented
using the Lyapunov theory.

9.3.3 Stability Analysis of Closed-Loop System

In this section, the error dynamics of the critic network and the closed-loop system
based on the approximate optimal control are proved to be UUB.

Theorem 9.3.3 Consider the nonlinear system given by (9.3.2). Let the control input
be provided by (9.3.35) and the weights of the critic network be tuned by (9.3.43).
Then, the state x of the closed-loop system and the weight estimation error W̃c of the
critic network are UUB.

Proof We choose the following Lyapunov function candidate:

1 αs
L(x) = W̃ T W̃c + L s (x), (9.3.47)
2αc c αc
374 9 Robust and Optimal Guaranteed Cost Control …

where L s (x) is presented in Assumption 9.3.2. The derivative of the Lyapunov func-
tion candidate (9.3.47) with respect to time along the solutions of (9.3.36) and (9.3.46)
is
1 T ˙ αs
L̇(x) = W̃c W̃c + (∇ L s (x))T ẋ. (9.3.48)
αc αc

Substituting (9.3.36) and (9.3.46) into (9.3.48), we obtain

1
L̇(x) = W̃cT − W̃cT ∇σc (x) f (x) − W̃cT ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T W̃c
4
1 T −1 T
+ W̃c ∇σc (x)g(x)R g (x)(∇σc (x))T Wc
2
1
+ W̃cT ∇σc (x)G(x)G T(x)(∇σc (x))T W̃c
4
1
− W̃cT ∇σc (x)G(x)G T(x)(∇σc (x))T Wc − ecH
2
1
× ∇σc (x) f (x) − ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T Wc
2
1
+ ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T W̃c
2
1
+ ∇σc (x)G(x)G T(x)(∇σc (x))T Wc
2
1
− ∇σc (x)G(x)G T(x)(∇σc (x))T W̃c
2
αs αs
− Π (x, û)W̃cT ∇σc (x)g(x)R −1 g T(x)∇ L s (x) + (∇ L s (x))T ẋ.
2αc αc
(9.3.49)

For simplicity, we denote

A = ∇σc (x)g(x)R −1 g T(x)(∇σc (x))T ,


B = ∇σc (x)G(x)G T(x)(∇σc (x))T .

Then, (9.3.49) becomes

1 T 1 1 1
L̇(x) = − W̃cT ∇σc (x) f (x) + W̃ A W̃c − W̃cT AWc − W̃cT B W̃c + W̃cT BWc + ecH
4 c 2 4 2
1 1 1 1
× W̃cT ∇σc (x) f (x) + W̃cT A W̃c − W̃cT AWc − W̃cT B W̃c + W̃cT BWc
2 2 2 2
αs αs
− Π(x, û)W̃cT ∇σc (x)g(x)R −1 g T(x)∇ L s (x) + (∇ L s (x))T ẋ.
2αc αc
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 375

Considering (9.3.36), we have

1 1 1
L̇(x) = − W̃cT ∇σc (x)ẋ − W̃cT A W̃c − W̃cT B W̃c + W̃cT BWc + ecH
4 4 2
1 1
× W̃cT ∇σc (x)ẋ − W̃cT B W̃c + W̃cT BWc
2 2
αs
− Π (x, û)W̃cT ∇σc (x)g(x)R −1 g T(x)∇ L s (x)
2αc
αs
+ (∇ L s (x))T ẋ.
αc

Using the formulations


ẋ ∗ = f (x) + g(x)u ∗

and
ẋ = f (x) + g(x)û,

we can obtain the relationship between ẋ ∗ and ẋ as

1 1
ẋ = ẋ ∗ + g(x)R −1 g T (x)(∇σc (x))T W̃c + g(x)R −1 g T(x)∇εc (x).
2 2
Then, we have

1
L̇(x) = − W̃cT ∇σc (x)ẋ ∗ + W̃cT A W̃c
4
1 T 1 1
+ W̃c ∇σc (x)g(x)R −1 g T(x)∇εc (x) − W̃cT B W̃c + W̃cT BWc + ecH
2 4 2
∗ 1 T
× W̃c ∇σc (x)ẋ + W̃c A W̃c
T
2
1 T 1 1
+ W̃c ∇σc (x)g(x)R −1 g T(x)∇εc (x) − W̃cT B W̃c + W̃cT BWc
2 2 2
αs αs
− Π (x, û)W̃cT ∇σc (x)g(x)R −1 g T(x)∇ L s (x) + (∇ L s (x))T ẋ.
2αc αc
(9.3.50)

As in the work of [7], we assume that λ1m > 0 and λ1M > 0 are the lower
and upper bounds of the norm of matrix A. Similarly, assume that λ2m > 0 and
λ2M > 0 are the lower and upper bounds of the norm of matrix B. Assume that
R −1  ≤ R −1
M , g(x) ≤ g M , ∇σ (x) ≤ σd M , BWc  ≤ λ4 , ∇εc (x) ≤ λ10 , and
ecH  ≤ λ12 , where R −1
M , g M , σd M , λ4 , λ10 , and λ12 are positive constants. In addi-

tion,
√ assume that ∇σ c (x) ẋ  ≤ λ3 , where λ3 is a positive constant. Let λ5 =
( 6/2)λ12 , λ9 = g M R M , and λ11 = σd M g 2M R −1
2 −1 −1 T
M λ10 , then g(x)R g (x) ≤ λ9
and ∇σ (x)g(x)R −1 g T (x)∇εc (x) ≤ λ11 . Using the relations
376 9 Robust and Optimal Guaranteed Cost Control …
  
1 b 2 b2
ab = − φ+ a − + φ+ a + 2 ,
2 2
2 φ+ φ+
  
1 b 2 b2
−ab = − φ− a + − φ− a − 2 ,
2 2
2 φ− φ−

we have

3
− (W̃cT ∇σc (x)ẋ ∗ )(W̃cTA W̃c )
4
 
3  W̃ TA W̃c 2 (W̃cTA W̃c )2
=− φ1 W̃cT ∇σc (x)ẋ ∗ + c − φ12 (W̃cT ∇σc (x)ẋ ∗ )2 −
8 φ1 φ12
 
3 2 T (W̃cT A W̃c )2
≤ φ1 (W̃c ∇σc (x)ẋ ∗ )2 +
8 φ12
3 2 3
≤ λ1M W̃c 4 + φ12 λ23 W̃c 2 ,
8φ12 8

where φ+ , φ− , and φ1 are nonzero constants. Other terms in the expression of (9.3.50)
can be handled the same way. Then, we can find that

L̇(x) ≤ − λ7 W̃c 4 + λ8 W̃c 2 + λ25


αs αs
− Π (x, û)W̃cT ∇σc (x)g(x)R −1 g T(x)∇ L s (x) + (∇ L s (x))T ẋ,
2αc αc
(9.3.51)

where
1 1 3 3
λ7 = λ21m + λ22m − 2 λ21M − 2 λ22M
8 8 8φ1 8φ2
3 3 3 3
− φ32 λ21M − φ42 λ21M − φ52 λ211 − φ62 λ22M ,
16 16 16 16
3 3 3 2 3 2
λ8 = φ12 λ23 + φ22 λ23 + λ +
2 11
λ
8 8 16φ3 16φ42 4
3 2 3 2
+ λ +
2 11
λ ,
16φ5 16φ62 4

and φi , i = 1, 2, . . . , 6, are nonzero design constants. Note that with proper choices
of φi , i = 1, 2, . . . , 6, the relation λ7 > 0 can be guaranteed.
We will consider next the cases of Π (x, û) = 0 and Π (x, û) = 1, respectively.
Case 1: Π (x, û) = 0. Since (∇ L s (x))T ẋ < 0, we have −(∇ L s (x))T ẋ > 0.
According to the density property of real numbers [38], there exists a positive con-
stant λ6 such that 0 < λ6 ∇ L s (x) ≤ −(∇ L s (x))T ẋ holds for all x ∈ Ω, i.e.,

(∇ L s (x))T ẋ ≤ −λ6 ∇ L s (x).


9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 377

Hence, the inequality (9.3.51) becomes


αs
L̇(x) ≤ −λ7 W̃c 4 + λ8 W̃c 2 + λ25 + (∇ L s (x))T ẋ
αc
αs
≤ −λ7 W̃c 4 + λ8 W̃c 2 + λ25 − λ6 ∇ L s (x).
αc

Therefore, given the following inequality:


 

 λ + 4λ2 λ + λ2
 8 5 7 8
W̃c  ≥  A1
2λ7

or

αc (4λ25 λ7 + λ28 )
∇ L s (x) ≥  B1 (9.3.52)
4αs λ6 λ7

holds, we conclude L̇(x) < 0. Note that (9.3.52) is derived from

αs λ2
λ6 ∇ L s (x) ≥ max {−λ7 W̃c 4 + λ8 W̃c 2 + λ25 } = λ25 + 8 .
αc W̃c 2 4λ7

Case 2: Π (x, û) = 1. Subtracting and adding

αs (∇ L s (x))T g(x)R −1 g T (x)∇εc (x)


2αc

to the right-hand side of (9.3.51) and taking Assumption 9.3.2 into consideration
yield

L̇(x) ≤ − λ7 W̃c 4 + λ8 W̃c 2 + λ25


αs T αs
− W̃ ∇σc (x)g(x)R −1 g T(x)∇ L s (x) + (∇ L s (x))T ( f (x) + g(x)û)
2αc c αc
= − λ7 W̃c 4 + λ8 W̃c 2 + λ25
αs αs
+ (∇ L s (x))T ( f (x) + g(x)u ∗ ) + (∇ L s (x))T g(x)R −1 g T (x)∇εc (x)
αc 2αc
≤ − λ7 W̃c 4 + λ8 W̃c 2 + λ25
αs αs
− λm ∇ L s (x)2 + λ9 λ10 ∇ L s (x)
αc 2αc
= − λ7 W̃c 4 + λ8 W̃c 2 + λ25
αs  λ9 λ10 2 λ29 λ210
− λm ∇ L s (x) − − .
αc 4λm 16λ2m
378 9 Robust and Optimal Guaranteed Cost Control …

Therefore, given the following inequality:




λ λ25 λ2 αs λ29 λ210
W̃c  ≥ 
8
+ + 82 +  A2
2λ7 λ7 4λ7 16αc λm λ7

or

λ9 λ10 αc (4λ25 λ7 + λ28 ) λ29 λ210


∇ L s (x) ≥ + +  B2
4λm 4αs λm λ7 16λ2m

holds, we obtain L̇(x) < 0.


To summarize, if the inequality W̃c  > max(A1 , A2 ) = A or ∇ L s (x) >
max(B1 , B2 ) = B holds, then L̇(x) < 0. Considering the fact that L s (x) is chosen
as a polynomial and in accordance with the standard Lyapunov’s extension theorem
[13, 15] (or the Lagrange stability result [30]), we can derive the conclusion that the
state x and the weight estimation error W̃c are UUB. This completes the proof.

Corollary 9.3.1 The approximate control input û in (9.3.35) converges to a neigh-


borhood of optimal control input u ∗ with finite bound.

Proof According to (9.3.34) and (9.3.35), we have

1 1
u ∗ − û = − R −1 g T(x)(∇σ (x))T W̃c − R −1 g T(x)∇εc (x).
2 2

In light of Theorem 9.3.3, we have W̃c  < A , where A is defined in the proof above.
Then, the terms R −1 g T(x)(∇σ (x))T W̃c and R −1 g T(x)∇εc (x) are both bounded. Thus,
we can further determine
1 −1 1
u ∗ − û ≤ R M g M σd M A + R −1 g M λ10  εu ,
2 2 M
where λ10 is given in the proof of Theorem 9.3.3 and εu is the finite bound. This
completes the proof.

9.3.4 Simulation Studies

In this section, two simulation examples are provided to demonstrate the effectiveness
of the optimal robust guaranteed cost control strategy derived based on the online
HJB solution. We first consider a continuous-time linear system and then a nonlinear
system, both with system uncertainty.
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 379

Example 9.3.1 Consider the continuous-time linear system


   
−1 −2 1
ẋ = x+ u + Δf (x),
1 −4 −3

where x = [x1 , x2 ]T and Δf (x) = [ px1 sin x2 , 0]T with p ∈ [−0.5, 0.5]. According
to the form of system uncertainty, we choose G(x) = [1, 0]T and ϕ(x) = x. Then,
d(ϕ(x)) = px1 sin x2 . Besides, we select h(ϕ(x)) = 0.5x1 sin x2 .
In this example, we first choose Q(x) = x Tx, R = I , where I is an identity matrix
with suitable dimension. In order to solve the transformed optimal control problem,
a critic network is constructed to approximate the value function as

V̂ (x) = Ŵc1 x12 + Ŵc2 x1 x2 + Ŵc3 x22 .

Let the initial state of the controlled plant be x0 = [1, −1]T . Select the Lyapunov
function candidate of the weights tuning criterion as L s (x) = (1/2)x Tx. Let the
learning rate of the critic network and the additional term be αc = 0.8 and αs = 0.5,
respectively. During the NN implementation process, the exploratory signal N (t)
given in (9.2.35) is added to the control input to satisfy the PE condition. It is
introduced into the control input and thus affects the system state. After a learning
session, the weights of the critic network converge to [0.3461, −0.1330, 0.1338]T
as shown in Fig. 9.8. Here, it is important to note that the initial weights of the critic
network are all set to zeros, which implies that no initial stabilizing control is needed

0.4

0.3

0.2
Weight of the critic network

0.1

-0.1

-0.2

-0.3
ω
ac1
ω
-0.4 ac2
ω
ac3
-0.5
0 20 40 60 80 100 120 140 160 180 200
Time (s)

Fig. 9.8 Convergence of weight vector of the critic network (ωac1 , ωac2 , and ωac3 represent Ŵc1 ,
Ŵc2 , and Ŵc3 , respectively)
380 9 Robust and Optimal Guaranteed Cost Control …

0.4

0.3

0.2
Weight of the critic network

0.1

-0.1

-0.2

-0.3 ω ac1
ω ac2
-0.4
ω
ac3
-0.5
0 1 2 3 4 5 6 7 8 9 10
Time (s)

Fig. 9.9 Updating process of weight vector during the first 10 s (ωac1 , ωac2 , and ωac3 represent
Ŵc1 , Ŵc2 , and Ŵc3 , respectively)

for implementing the control strategy. This can be verified by observing Fig. 9.9,
which displays the updating process of weight vector during the first 10 s.
Based on the converged weight vector, the optimal robust guaranteed cost of the
controlled system is Φ(u ∗ ) = V ∗ (x0 ) = 0.6129. Next, the scalar parameter p = 0.5
is chosen for evaluating the control performance. Under the action of the obtained
control function, the system trajectory during the first 20 s is presented in Fig. 9.10,
which shows good performance of the control approach.
Next, we set Q(x) = 8x Tx, R = 5I , and conduct the NN implementation again by
increasing the learning rates of the critic network and the additional term properly. In
this case, the weights of the critic network converge to [5.4209, −3.5088, 1.2605]T ,
which is depicted in Fig. 9.11. Similarly, the system trajectory during the first 20 s
when choosing p = 0.5 is displayed in Fig. 9.12. These simulation results show that
the parameters Q(x) and R play an important role in the design process. In addition,
the power of the present control technique is demonstrated again.
Example 9.3.2 Consider the following continuous-time nonlinear system:
⎡ ⎤ ⎡ ⎤
−x1 + x2 0
ẋ = ⎣ 0.1x1 − x2 − x1 x3 ⎦ + ⎣ 1 ⎦ u + Δf (x), (9.3.53)
x1 x2 − x3 0

where x = [x1 , x2 , x3 ]T ,

Δf (x) = [0, 0, px1 sin x2 cos x3 ]T ,


9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 381

1
x1
0.8
x
2
0.6

0.4
System state

0.2

-0.2

-0.4

-0.6

-0.8

-1
0 2 4 6 8 10 12 14 16 18 20
Time (s)

Fig. 9.10 The system state ( p = 0.5)

4
Weight of the critic network

0 ω
ac1
ω
ac2

-2 ω
ac3

-4

-6
0 20 40 60 80 100 120 140 160 180 200
Time (s)

Fig. 9.11 Convergence of weight vector of the critic network (ωac1 , ωac2 , and ωac3 represent Ŵc1 ,
Ŵc2 , and Ŵc3 , respectively)
382 9 Robust and Optimal Guaranteed Cost Control …

1
x1
0.8
x2
0.6

0.4
System state

0.2

-0.2

-0.4

-0.6

-0.8

-1
0 2 4 6 8 10 12 14 16 18 20
Time (s)

Fig. 9.12 The system state ( p = 0.5)

and p ∈ [−1, 1]. Similarly, if we choose G(x) = [0, 0, 1]T and ϕ(x) = x based on
the form of system uncertainty, then d(ϕ(x)) = px1 sin x2 cos x3 . Clearly, we can
select h(ϕ(x)) = x1 sin x2 cos x3 .
In this example, Q(x) and R are chosen the same as the first case of Example 9.3.1.
However, the critic network is constructed by using the following equation:

V̂ (x) = Ŵc1 x12 + Ŵc2 x22 + Ŵc3 x32 + Ŵc4 x1 x2 + Ŵc5 x1 x3 + Ŵc6 x2 x3 + Ŵc7 x14
+ Ŵc8 x24 + Ŵc9 x34 + Ŵc10 x12 x22 + Ŵc11 x12 x32 + Ŵc12 x22 x32 + Ŵc13 x12 x2 x3
+ Ŵc14 x1 x22 x3 + Ŵc15 x1 x2 x32 + Ŵc16 x13 x2 + Ŵc17 x13 x3 + Ŵc18 x1 x23
+ Ŵc19 x23 x3 + Ŵc20 x1 x33 + Ŵc21 x2 x33 .

Here, let the initial state of the controlled system be x0 = [1, −1, 0.5]T . Besides,
let the learning rate of the critic network and the additional term be αc = 0.3
and αs = 0.5, respectively. Same as earlier, a small exploratory signal N (t) (cf.
(9.2.35)) is added to satisfy the PE condition during the NN implementation process.
Besides, all the elements of the weight vector of critic network are initialized
to zero. After a sufficient learning session, the weights of the critic network converge
to [0.4759, 0.5663, 0.1552, 0.4214, 0.0911, 0.0375, 0.0886, −0.0099, 0.0986,
0.1539, 0.0780, −0.0192, −0.1335, −0.0052, −0.0639, −0.1583, 0.0456,
0.0576, −0.0535, 0.0885, −0.0227]T .
9.3 Optimal Guaranteed Cost Control of Uncertain Nonlinear Systems 383

1
x1
0.8
x
2
0.6 x
3

0.4
System state

0.2

-0.2

-0.4

-0.6

-0.8

-1
0 2 4 6 8 10 12 14 16 18 20
Time (s)

Fig. 9.13 The system state ( p = −1)

Similarly, the optimal robust guaranteed cost of the nonlinear system is Φ(u ∗ ) =

V (x0 ) = 1.1841. In this example, the scalar parameter p = −1 is chosen for eval-
uating the robust control performance. The system trajectory is depicted in Fig. 9.13
when applying the obtained control to system (9.3.53) for 20 s. These simulation
results verify the effectiveness of the present control approach.

9.4 Conclusions

In this chapter, a novel strategy is developed to solve the robust control problem of a
class of uncertain nonlinear systems. The robust control problem is transformed into
an optimal control problem with appropriate cost function. The online PI algorithm
is presented to solve the HJB equation by constructing a critic network. Then, a
strategy is developed to derive the optimal guaranteed cost control of uncertain
nonlinear systems. This is accomplished by properly modifying the cost function to
account for system uncertainty, so that the solution of the transformed optimal control
problem also solves the optimal robust guaranteed cost control problem of the original
system. A critic network is constructed to solve the modified HJB equation online.
Several simulation examples are presented to reinforce the theoretical results.
384 9 Robust and Optimal Guaranteed Cost Control …

References

1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Adhyaru DM, Kar IN, Gopal M (2009) Fixed final time optimal control approach for bounded
robust controller design using Hamilton-Jacobi-Bellman solution. IET Control Theory Appl
3(9):1183–1195
3. Adhyaru DM, Kar IN, Gopal M (2011) Bounded robust control of nonlinear systems using
neural network-based HJB solution. Neural Comput Appl 20(1):91–103
4. Beard RW, Saridis GN, Wen JT (1997) Galerkin approximations of the generalized Hamilton-
Jacobi-Bellman equation. Automatica 33(12):2159–2177
5. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):82–92
6. Chang SSL, Peng TKC (1972) Adaptive guaranteed cost control of systems with uncertain
parameters. IEEE Trans Autom Control 17(4):474–483
7. Dierks T, Jagannathan S (2010) Optimal control of affine nonlinear continuous-time systems.
In: Proceedings of the American Control Conference, pp 1568–1573
8. Dierks T, Jagannathan S (2012) Online optimal control of affine nonlinear discrete-time systems
with unknown internal dynamics by using time-based policy update. IEEE Trans Neural Netw
Learn Syst 23(7):1118–1129
9. Haddad WM, Chellaboina V (2008) Nonlinear dynamical systems and control: a Lyapunov-
based approach. Princeton University Press, Princeton
10. Haddad WM, Chellaboina V, Fausz JL (1998) Robust nonlinear feedback control for uncertain
linear systems with nonquadratic performance criteria. Syst Control Lett 33(5):327–338
11. Haddad WM, Chellaboina V, Fausz JL, Leonessa A (2000) Optimal non-linear robust control
for nonlinear uncertain systems. Int J Control 73(4):329–342
12. Heydari A, Balakrishnan SN (2013) Finite-horizon control-constrained nonlinear optimal con-
trol using single network adaptive critics. IEEE Trans Neural Netw Learn Syst 24(1):145–157
13. LaSalle JP, Lefschetz S (1967) Stability by Liapunov’s direct method with applications. Aca-
demic Press, New York
14. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
15. Lewis FL, Jagannathan S, Yesildirek A (1999) Neural network control of robot manipulators
and nonlinear systems. Taylor & Francis, London
16. Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control:
using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag
32(6):76–105
17. Li H, Liu D (2012) Optimal control for discrete-time affine nonlinear systems using general
value iteration. IET Control Theory Appl 6(18):2725–2736
18. Liang J, Venayagamoorthy GK, Harley RG (2012) Wide-area measurement based dynamic
stochastic optimal power flow control for smart grids with high variability and uncertainty.
IEEE Trans Smart Grid 3(1):59–69
19. Lin F, Brand RD, Sun J (1992) Robust control of nonlinear systems: compensating for uncer-
tainty. Int J Control 56(6):1453–1459
20. Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class
of unknown discrete-time nonlinear systems using globalized dual heuristic programming.
IEEE Trans Autom Sci Eng 9(3):628–634
21. Liu D, Yang X, Li H (2013) Adaptive optimal control for a class of continuous-time affine
nonlinear systems with unknown internal dynamics. Neural Comput Appl 23:1843–1850
22. Liu D, Wang D, Yang X (2013) An iterative adaptive dynamic programming algorithm for
optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci
220:331–342
References 385

23. Liu D, Huang Y, Wang D, Wei Q (2013) Neural-network-observer-based optimal control for
unknown nonlinear systems using adaptive dynamic programming. Int J Control 86(9):1554–
1566
24. Liu D, Li H, Wang D (2013) Data-based self-learning optimal control: Research progress and
prospects. Acta Autom Sinica 39(11):1858–1870
25. Liu D, Li H, Wang D (2013) Neural-network-based zero-sum game for discrete-time nonlinear
systems via iterative adaptive dynamic programming algorithm. Neurocomputing 110:92–100
26. Liu D, Li H, Wang D (2014) Online synchronous approximate optimal learning algorithm for
multi-player non-zero-sum games with unknown dynamics. IEEE Trans Syst Man Cybern Syst
44(8):1015–1027
27. Liu D, Wang D, Li H (2014) Decentralized stabilization for a class of continuous-time nonlinear
interconnected systems using online learning optimal control approach. IEEE Trans Neural
Netw Learn Syst 25(2):418–428
28. Liu D, Wang D, Wang FY, Li H, Yang X (2014) Neural-network-based online HJB solution for
optimal robust guaranteed cost control of continuous-time uncertain nonlinear systems. IEEE
Trans Cybern 44(12):2834–2847
29. Mehraeen S, Jagannathan S (2011) Decentralized optimal control of a class of interconnected
nonlinear discrete-time systems by using online Hamilton-Jacobi-Bellman formulation. IEEE
Trans Neural Netw 22(11):1757–1769
30. Michel AN, Hou L, Liu D (2015) Stability of dynamical systems: On the role of monotonic
and non-monotonic Lyapunov functions. Birkhäuser, Boston
31. Modares H, Lewis FL, Naghibi-Sistani MB (2013) Adaptive optimal control of unknown
constrained-input systems using policy iteration and neural networks. IEEE Trans Neural Netw
Learn Syst 24(10):1513–1525
32. Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE Trans
Syst Man Cybern Part C Appl Rev 32(2):140–153
33. Nevistic V, Primbs JA (1996) Constrained nonlinear optimal control: a converse HJB approach.
Technical Memorandum No. CIT-CDS, California Institute of Technology, Pasadena, CA, pp
96–021
34. Ni Z, He H, Wen J (2013) Adaptive learning in tracking control based on the dual critic network
design. IEEE Trans Neural Netw Learn Syst 24(6):913–928
35. Ni Z, He H, Wen J, Xu X (2013) Goal representation heuristic dynamic programming on maze
navigation. IEEE Trans Neural Netw Learn Syst 24(12):2038–2050
36. Nodland D, Zargarzadeh H, Jagannathan S (2013) Neural network-based optimal adaptive
output feedback control of a helicopter UAV. IEEE Trans Neural Netw Learn Syst 24(7):1061–
1073
37. Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8(5):997–
1007
38. Rudin W (1976) Principles of mathematical analysis. McGraw-Hill, New York
39. Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
40. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
41. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
42. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
43. Wang D, Liu D (2013) Neuro-optimal control for a class of unknown nonlinear dynamic systems
using SN-DHP technique. Neurocomputing 121:218–225
44. Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of
discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocom-
puting 78(1):14–22
45. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
386 9 Robust and Optimal Guaranteed Cost Control …

46. Wang D, Liu D, Zhao D, Huang Y, Zhang D (2013) A neural-network-based iterative GDHP
approach for solving a class of nonlinear optimal control problems with control constraints.
Neural Comput Appl 22(2):219–227
47. Wang D, Liu D, Li H (2014) Policy iteration algorithm for online design of robust control for
a class of continuous-time nonlinear systems. IEEE Trans Autom Sci Eng 11(2):627–632
48. Wang D, Liu D, Li H, Ma H (2014) Neural-network-based robust optimal control design for a
class of uncertain nonlinear systems via adaptive dynamic programming. Inf Sci 282:167–179
49. Wang FY, Zhang H, Liu D (2009) Adaptive dynamic programming: an introduction. IEEE
Comput Intell Mag 4(2):39–47
50. Wang FY, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon
optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Trans Neural
Netw 22(1):24–36
51. Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of
intelligence. General Syst Yearb 22:25–38
52. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive
approaches (Chapter 13). Van Nostrand Reinhold, New York
53. Werbos PJ (2008) ADP: The key direction for future research in intelligent control and under-
standing brain intelligence. IEEE Trans Syst Man Cybern Part B Cybern 38(4):898–900
54. Werbos PJ (2009) Intelligence in the brain: a theory of how it works and how to build it. Neural
Netw 22(3):200–212
55. Wu HN, Luo B (2012) Neural network based online simultaneous policy update algorithm
for solving the HJI equation in nonlinear H∞ control. IEEE Trans Neural Netw Learn Syst
23(12):1884–1895
56. Xu H, Jagannathan S, Lewis FL (2012) Stochastic optimal control of unknown linear networked
control system in the presence of random delays and packet losses. Automatica 48(6):1017–
1030
57. Xu X, Lian C, Zuo L, He H (2014) Kernel-based approximate dynamic programming for
real-time online learning control: an experimental study. IEEE Trans Control Syst Technol
22(1):146–156
58. Yu L, Chu J (1999) An LMI approach to guaranteed cost control of linear uncertain time-delay
systems. Automatica 35(6):1155–1159
59. Yu L, Han QL, Sun MX (2005) Optimal guaranteed cost control of linear uncertain systems
with input constraints. Int J Control Autom Syst 3(3):397–402
60. Zhang H, Cui L, Luo Y (2013) Near-optimal control for nonzero-sum differential games of
continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern 43(1):206–
216
61. Zhang H, Zhang X, Luo Y, Yang J (2013) An overview of research on adaptive dynamic
programming. Acta Autom Sinica 39(4):303–311
Chapter 10
Decentralized Control of Continuous-Time
Interconnected Nonlinear Systems

10.1 Introduction

Various complex systems in social and engineering areas, such as ecosystems,


transportation systems, and power systems, are considered as large-scale systems.
Generally speaking, a large-scale system is comprised of several subsystems with
obvious interconnections, which leads to the increasing difficulty of analysis and
synthesis when using classical centralized control techniques. Therefore, it is neces-
sary to partition the design issue of the overall system into manageable subproblems,
as Bakule pointed out in [2]. Then, the overall plant is no longer controlled by a
single controller but by an array of independent controllers which together repre-
sent a decentralized controller. Consequently, the decentralized control has been a
control of choice for large-scale systems since it is computationally efficient to for-
mulate control laws that use only locally available subsystem states or outputs [20].
Actually, considerable attention has been paid to the decentralized stabilization of
large-scale systems during the last several decades. A decentralized strategy consists
of some noninteracting local controllers corresponding to the isolated subsystems,
not the overall system. Thus, in many situations, the design of isolated subsystems
is a matter of great significance. In [18], it was shown that the decentralized control
of the interconnected system is related to the optimal control of the isolated subsys-
tems. Therefore, the optimal control method can be employed to facilitate the design
process of the decentralized control strategy. However, in that work, the cost func-
tions of the isolated subsystems were not chosen as the general forms, not to mention
that the detailed procedure was not given. For this reason, in this chapter, we will
investigate the decentralized stabilization problem using neural network (NN)-based
optimal learning control approach.
The optimal control of nonlinear system often leads to solving the Hamilton–
Jacobi–Bellman (HJB) equation instead of the Riccati equation as in the case
of linear systems. Adaptive dynamic programming (ADP) was first proposed by
Werbos as an alternative method to solve the optimal control problems

© Springer International Publishing AG 2017 387


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_10
388 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

forward-in-time [27, 28]. In recent years, great efforts have been made to ADP-related
research in theory and applications [1, 5–8, 10–12, 15, 16, 23–25]. In light of [9, 10,
29], the ADP technique is closely related to reinforcement learning when engaging
in the research of feedback control. Policy iteration (PI) is a fundamental algorithm
for reinforcement learning-based ADP in optimal control. Vrabie and Lewis [24]
derived an integral reinforcement learning method to obtain direct adaptive optimal
control for nonlinear input-affine continuous-time systems with partially unknown
dynamics. Jiang and Jiang [5] presented a novel PI approach for continuous-time
linear systems with complete unknown dynamics. Lee et al. [8] presented an integral
Q-learning algorithm for continuous-time systems without the exact knowledge of
system dynamics.
In this chapter, we first employ the online PI algorithm to tackle the decentralized
control problem for a class of large-scale interconnected nonlinear systems [13]. The
decentralized control strategy can be established by adding appropriate feedback
gains to the local optimal control laws. The online PI algorithm is developed to
solve the HJB equations related to the optimal control problem by constructing and
training some critic neural networks. The uniform ultimate boundedness (UUB) of
the dynamics of the NN weight estimation errors is analyzed by using the Lyapunov
approach. Since it is difficult to obtain the exact knowledge of the system dynamics for
large-scale nonlinear systems, such as chemical engineering processes, transportation
systems, and power systems, we relax the assumptions of exact knowledge of the
system dynamics required in the optimal controller design presented in [13]. We
use an online model-free integral PI to solve the decentralized control of a class
of continuous-time large-scale interconnected nonlinear systems [14]. The optimal
control problems for the isolated subsystems with unknown dynamics are related to
the development of decentralized control laws. To implement this algorithm, a critic
NN and an action NN are used to approximate the value function and control law of
each isolated subsystem, respectively.

10.2 Decentralized Control of Interconnected Nonlinear


Systems

In this section, we study the decentralized control of interconnected nonlinear sys-


tems based on ADP technique [13]. Consider a class of continuous-time large-scale
nonlinear systems composed of N interconnected subsystems described by
 
ẋi (t) = fi (xi (t)) + gi (xi (t)) ūi (t) + Z̄i (x(t)) , i = 1, . . . , N, (10.2.1)

where xi (t) ∈ Rni and ūi (t) ∈ Rmi are the state vector and the control vector of the ith
subsystem, respectively. In large-scale system (10.2.1), x = [x1T , x2T , . . . , xNT ]T ∈ Rn

denotes the overall state, where n = Ni=1 ni . Correspondingly, x1 , x2 , . . . , xN are
called local states while ū1 , ū2 , . . . , ūN are local controls. Note that for subsystem
10.2 Decentralized Control of Interconnected Nonlinear Systems 389

i, fi (xi ), gi (xi ), and gi (xi )Z̄i (x) represent the nonlinear internal dynamics, the input
gain matrix, and the interconnected term, respectively.
Let xi (0) = xi0 be the initial state of the ith subsystem, i = 1, . . . , N. Additionally,
we let the following assumptions hold throughout this chapter.
Assumption 10.2.1 The state vector xi = 0 is the equilibrium of the ith subsystem,
i = 1, . . . , N.
Assumption 10.2.2 The functions fi (·) and gi (·) are differentiable in their arguments
with fi (0) = 0, where i = 1, . . . , N.
Assumption 10.2.3 The feedback control vector ūi (xi ) = 0 when xi = 0, where
i = 1, . . . , N.
Let Ri ∈ Rmi ×mi , i = 1, . . . , N, be symmetric positive-definite matrices. Then, we
1/2
denote Zi (x) = Ri Z̄i (x), where Zi (x) ∈ Rmi , i = 1, . . . , N, are bounded as follows:


N
Zi (x) ≤ ρij hij (xj ), i = 1, . . . , N. (10.2.2)
j=1

In (10.2.2), ρij are nonnegative constant parameters and hij (xj ) are
positive-semidefinite functions with i, j = 1, . . . , N.
If we define hi (xi ) = max{h1i (xi ), h2i (xi ), . . . , hNi (xi )}, i = 1, . . . , N, then
(10.2.2) can be formulated as


N
Zi (x) ≤ λij hj (xj ), i = 1, . . . , N, (10.2.3)
j=1

where
hij (xj )
λij ≥ ρij , i, j = 1, . . . , N,
hj (xj )

are nonnegative constant parameters.


When dealing with the decentralized control problem, we aim at finding N con-
trol laws ū1 (x1 ), ū2 (x2 ), . . . , ūN (xN ) to stabilize the large-scale system (10.2.1). It
is important to note that in the control vector (ū1 (x1 ), ū2 (x2 ), . . . , ūN (xN )), ūi (xi ) is
only a function of the corresponding local state, namely xi , where i = 1, . . . , N. The
schematic diagram of the decentralized control problem is displayed in Fig. 10.1.

10.2.1 Decentralized Stabilization via Optimal Control


Approach

In this section, we investigate the methodology for decentralized controller design.


The optimal control of the isolated subsystems is described under the framework of
390 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

Fig. 10.1 The schematic


diagram of the decentralized
control problem of the
interconnected system

HJB equations. Then, the decentralized control strategy can be constructed based on
the optimal control laws.
Consider the N isolated subsystems corresponding to (10.2.1) given by

ẋi (t) = fi (xi (t)) + gi (xi (t))ui (xi (t)), i = 1, . . . , N. (10.2.4)

For the ith isolated subsystem, we further assume that fi +gi ui is Lipschitz continuous
on a set Ωi in Rni containing the origin, and the subsystem is controllable in the sense
that there exists a continuous control policy on Ωi that asymptotically stabilizes the
subsystem.
In this section, in order to deal with the infinite-horizon optimal control problem,
we need to find the control laws ui (xi ), i = 1, . . . , N, which minimize the cost
functions
 ∞
 2 
Ji (xi0 ) = Qi (xi (τ )) + uiT(xi (τ ))Ri ui (xi (τ )) dτ, i = 1, . . . , N, (10.2.5)
0

where Qi (xi ), i = 1, . . . , N, are positive-definite functions satisfying

hi (xi ) ≤ Qi (xi ), i = 1, . . . , N. (10.2.6)

Based on the theory of optimal control, the designed feedback controls must not
only stabilize the subsystems on Ωi , i = 1, . . . , N, but also guarantee that the cost
functions (10.2.5) are finite. In other words, the control laws must be admissible. We
use Ai (Ωi ) to denote the set of admissible controls of subsystem i on Ωi .
For any admissible control laws μi ∈ Ai (Ωi ), i = 1, . . . , N, if the associated
value functions
 ∞
 2 
Vi (xi ) = Qi (xi (τ )) + μTi (xi (τ ))Ri μi (xi (τ )) dτ, i = 1, . . . , N, (10.2.7)
t

are continuously differentiable, then the infinitesimal versions of (10.2.7) are the
so-called nonlinear Lyapunov equations
10.2 Decentralized Control of Interconnected Nonlinear Systems 391

0 = Qi2 (xi ) + μTi (xi )Ri μi (xi ) + ∇ViT(xi )(fi (xi ) + gi (xi )μi (xi )),
i = 1, . . . , N, (10.2.8)

with Vi (0) = 0, i = 1, . . . , N. In (10.2.8), the terms ∇Vi (xi ), i = 1, . . . , N, denote


the partial derivatives of the value functions Vi (xi ) with respect to local states xi , i.e.,
∇Vi (xi ) = ∂ Vi (xi )/∂xi , where i = 1, . . . , N.
Define the Hamiltonians of the N isolated subsystems as

Hi (xi , μi , ∇Vi (xi )) = Qi2 (xi ) + μTi (xi )Ri μi (xi ) + ∇ViT(xi )(fi (xi ) + gi (xi )μi (xi )),

where i = 1, . . . , N.
The optimal cost functions of the N isolated subsystems can be formulated as
 ∞  2 
Ji∗ (xi ) = min Qi (xi (τ )) + μTi (xi (τ ))Ri μi (xi (τ )) dτ, i = 1, . . . , N.
μi ∈Ai (Ωi ) t
(10.2.9)
In view of the theory of optimal control, the optimal cost functions Ji∗ (xi ), i =
1, . . . , N, satisfy the HJB equations

min Hi (xi , μi , ∇Ji∗ (xi )) = 0, i = 1, . . . , N, (10.2.10)


μi ∈Ai (Ωi )

where
∂Ji∗ (xi )
∇Ji∗ (xi ) = , i = 1, . . . , N.
∂xi

Assume that the minima on the left-hand side of (10.2.10) exist and are unique. Then,
the optimal control laws for the N isolated subsystems are

ui∗ (xi ) = arg min Hi (xi , μi , ∇Ji∗ (xi ))


μi ∈Ai (Ωi )
1
= − Ri−1 giT(xi )∇Ji∗ (xi ), i = 1, . . . , N. (10.2.11)
2
Substituting the optimal control laws (10.2.11) into the nonlinear Lyapunov equations
(10.2.8), we can obtain the formulation of the HJB equations in terms of ∇Ji∗ (xi ),
i = 1, . . . , N, as

1
Qi2 (xi ) + (∇Ji∗ (xi ))T fi (xi ) − (∇Ji∗ (xi ))T gi (x)Ri−1 giT (xi )∇Ji∗ (xi ) = 0 (10.2.12)
4
with Ji∗ (0) = 0 and i = 1, . . . , N.
According to (10.2.11), we have expressed the optimal control policies, i.e.,
u1∗ (x1 ), u2∗ (x2 ), . . . , uN∗ (xN ), for the N isolated subsystems (10.2.4). We will show
that by proportionally increasing some local feedback gains, a stabilizing decentral-
ized control scheme can be established for the interconnected system (10.2.1). Now,
392 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

we give the following lemma, indicating how the feedback gains can be added, in
order to guarantee the asymptotic stability of the isolated subsystems.

Lemma 10.2.1 Consider the isolated subsystems (10.2.4), the feedback controls

ūi (xi ) = πi ui∗ (xi )


1
= − πi Ri−1 giT(xi )∇Ji∗ (xi ), i = 1, . . . , N, (10.2.13)
2
ensure that the N closed-loop isolated subsystems are asymptotically stable for all
πi ≥ 1/2, where i = 1, . . . , N.

Proof The lemma can be proved by showing that Ji∗ (xi ), i = 1, . . . , N, are Lyapunov
functions. First of all, in light of (10.2.9), we can find that Ji∗ (xi ) > 0 for any
xi = 0 and Ji∗ (xi ) = 0 when xi = 0, which implies that Ji∗ (xi ), i = 1, . . . , N, are
positive-definite functions. Next, the derivatives of Ji∗ (xi ), i = 1, . . . , N, along the
corresponding trajectories of the closed-loop isolated subsystems are given by

J̇i∗ (xi ) = (∇Ji∗ (xi ))T ẋi


= (∇Ji∗ (xi ))T (fi (xi ) + gi (xi )ūi (xi )), (10.2.14)

where i = 1, . . . , N. Then, by adding and subtracting (1/4)(∇Ji∗ (xi ))T gi (xi )ui∗ (xi )
to (10.2.14) and considering (10.2.11)–(10.2.13), we have

1
J̇i∗ (xi ) = (∇Ji∗ (xi ))T fi (xi ) − (∇Ji∗ (xi ))T gi (xi )Ri−1 giT(xi )∇Ji∗ (xi )
4
1 1
− πi − (∇Ji (xi ))T gi (xi )Ri−1 giT(xi )∇Ji∗ (xi )

2 2
1 1 −1/2
Ri giT(xi )∇Ji∗ (xi ) ,
2
= − Qi2 (xi ) − πi − (10.2.15)
2 2

where i = 1, . . . , N. Observing (10.2.15), we can obtain J̇i∗ (xi ) < 0 for all πi ≥ 1/2
and xi = 0, where i = 1, . . . , N. Therefore, the conditions for Lyapunov local
stability theory are satisfied and the proof is complete.

Remark 10.2.1 Lemma 10.2.1 reveals that any feedback controls ūi (xi ), i = 1, . . . , N,
can ensure the asymptotic stability of the closed-loop isolated subsystems as long as
πi ≥ 1/2, i = 1, . . . , N. However, only when πi = 1, i = 1, . . . , N, the feedback
controls are optimal. In fact, similar results have been given in [3, 4, 22], showing
that the optimal controls ui∗ (xi ), i = 1, . . . , N, are robust in the sense that they have
infinite gain margins.

Now, we present the main theorem of this section, based on which the acquired
decentralized control strategy can be established.
10.2 Decentralized Control of Interconnected Nonlinear Systems 393

Theorem 10.2.1 For the interconnected system (10.2.1), there exist N positive num-
bers πi∗ > 0, i = 1, 2, . . . , N, such that for any πi ≥ πi∗ , i = 1, 2, . . . , N, the feed-
back controls developed by (10.2.13) ensure that the closed-loop interconnected sys-
tem is asymptotically stable. In other words, the control vector (ū1 (x1 ), ū2 (x2 ), . . . ,
ūN (xN )) is the decentralized control strategy of large-scale system (10.2.1).

Proof In accordance with Lemma 10.2.1, we observe that Ji∗ (xi ), i = 1, . . . , N, are
all Lyapunov functions. Here, we choose a composite Lyapunov function given by


N
L(x) = θi Ji∗ (xi ),
i=1

where θi , i = 1, . . . , N, are arbitrary positive constants.


Taking the time derivative of L(x) along the trajectories of the closed-loop inter-
connected system, we obtain


N
 
L̇(x) = θi (∇Ji∗ (xi ))T (fi (xi )+gi (xi )ūi (xi ))+(∇Ji∗ (xi ))T gi (xi )Z̄i (x) . (10.2.16)
i=1

Using (10.2.3), (10.2.6), and (10.2.15), from (10.2.16), it follows


N 
1 1 −1/2 T
gi (xi )∇Ji∗ (xi )
2
L̇(x) ≤ − θi Qi2 (xi ) + πi − Ri
i=1
2 2

−1/2
− (∇Ji∗ (xi ))T gi (xi )Ri Zi (x)


N ⎨ 
1 1 −1/2 2
≤− θi Qi2 (xi ) + πi − (∇Ji∗ (xi ))T gi (xi )Ri
⎩ 2 2
i=1


N ⎬
−1/2
− (∇Ji∗ (xi ))T gi (xi )Ri λij Qj (xj ) . (10.2.17)

j=1

For convenience, we denote


⎡ ⎤
λ11 λ12 ··· λ1N
⎢ λ21 λ22 ··· λ2N ⎥
⎢ ⎥
Θ = diag{θ1 , θ2 , . . . , θN }, Λ = ⎢ . .. .. .. ⎥,
⎣ .. . . . ⎦
λN1 λN2 · · · λNN

and   
1 1 1 1 1 1
Π = diag π1 − , π2 − ,..., πN − .
2 2 2 2 2 2
394 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

Then, by introducing a 2N-dimensional vector


⎡ ⎤
Q1 (x1 )
⎢ Q2 (x2 ) ⎥
⎢ ⎥
⎢ ⎥
⎢ .. ⎥
⎢ . ⎥
⎢ ⎥
⎢ Q N N)
(x ⎥
⎢ ⎥
ξ =⎢
⎢ (∇J1∗ (x1 ))T g1 (x1 )R1−1/2
⎥,

⎢ ⎥
⎢ ⎥
⎢ (∇J2∗ (x2 ))T g2 (x2 )R2−1/2 ⎥
⎢ ⎥
⎢ .. ⎥
⎢ ⎥
⎣ . ⎦
−1/2
(∇JN∗ (xN ))T gN (xN )RN

we can transform (10.2.17) into compact form as


⎡ ⎤
1 T
⎢ Θ − Λ Θ⎥
L̇(x) ≤ −ξ T ⎢

2 ⎥ξ

1
− ΘΛ ΘΠ
2
 −ξ Ξ ξ.
T
(10.2.18)

According to (10.2.18), sufficiently large πi , i = 1, . . . , N, can be chosen such that the


matrix Ξ is positive definite. That is to say, there exist πi∗ , i = 1, . . . , N, such that any
πi ≥ πi∗ , i = 1, . . . , N, are large enough to guarantee the positive definiteness of Ξ .
Then, we have L̇(x) < 0. Therefore, the closed-loop interconnected system is asymp-
totically stable under the action of the control vector (ū1 (x1 ), ū2 (x2 ), . . . , ūN (xN )).
The proof is complete.

Clearly, the focal point of designing the decentralized control strategy becomes to
derive the optimal controllers for the N isolated subsystems on the basis of Theorem
10.2.1. Then, we should put our emphasis on solving the HJB equations, which yet
is regarded as a difficult task [10, 25]. Hence, in what follows we shall employ a
more pragmatic approach to obtain the approximate solutions based on online PI
algorithm and NN techniques.

10.2.2 Optimal Controller Design of Isolated Subsystems

In this section, the online PI algorithm is introduced to solve the HJB equations. The
PI algorithm consists of policy evaluation based on (10.2.8) and policy improvement
based on (10.2.11) [21]. Specifically, its iteration procedure can be described as
follows.
10.2 Decentralized Control of Interconnected Nonlinear Systems 395

Step 1: Choose a small positive number ε. Let p = 0 and Vi(0) (xi ) = 0, where
i = 1, . . . , N. Then, start with N initial admissible control laws μ(0) 1 (x1 ),
μ(0)
2 (x2 ), . . . , μ(0)
N (xN ).
(p−1)
Step 2: Let p = p + 1. Based on the control laws μi (xi ), i = 1, . . . , N, solve the
following nonlinear Lyapunov equations
 (p−1) T (p−1)
0 = Qi2 (xi ) + μi (xi ) Ri μi (xi )
 (p) T  (p−1) 
+ ∇Vi (xi ) fi (xi ) + gi (xi )μi (xi ) (10.2.19)

(p)
with Vi (0) = 0 and i = 1, . . . , N.
Step 3: Update the control laws via

(p) 1 (p)
μi (xi ) = − Ri−1 giT (xi )∇Vi (xi ), (10.2.20)
2

 (p)i = 1, . .(p−1)
where . , N. 
Step 4: If Vi (xi )−Vi (xi ) ≤ ε, i = 1, . . . , N, stop and obtain the approximate
optimal controls of the N isolated subsystems; else, go back to Step 2.
Note that N initial admissible control laws are required in the above algorithm.
In the following theorems, we present the convergence analysis of the online PI
algorithm for the isolated subsystems.
Theorem 10.2.2 Consider the N isolated subsystems (10.2.4), given N initial admis-
sible control laws μ(0) (0) (0)
1 (x1 ), μ2 (x2 ), . . . , μN (xN ). Then, using the PI algorithm
established in (10.2.19) and (10.2.20), the value functions and control laws con-
(p) (p)
verge to the optimal ones as p → ∞, i.e., Vi (xi ) → Ji∗ (xi ) and μi (xi ) → ui∗ (xi )
as p → ∞, where i = 1, . . . , N.
Proof First, we consider the subsystem i. According to [1], when given an initial
(p)
admissible control law μ(0)
i (xi ), we have μi (xi ) ∈ Ai (Ωi ) for any p ≥ 0. Addi-
tionally, for any ζ > 0, there exists an integer p0i , such that for any p ≥ p0i , the
inequalities  (p) 
sup Vi (xi ) − Ji∗ (xi ) < ζ (10.2.21)
xi ∈Ωi

and
(p)
sup μi (xi ) − ui∗ (xi ) < ζ (10.2.22)
xi ∈Ωi

hold simultaneously.
Next, we consider the N isolated subsystems. When given μ(0) (0)
1 (x1 ), μ2 (x2 ), . . .,
(0) (0)
μN (xN ), where μi (xi ) is the initial admissible control law corresponding to the ith
(p)
subsystem, we can acquire that μi (xi ) ∈ Ai (Ωi ) for any p ≥ 0, where i = 1, . . . , N.
Moreover, we denote p0 = max{p01 , p02 , . . . , p0N }. Thus, we can conclude that for
any ζ > 0, there exists an integer p0 , such that for any p ≥ p0 , (10.2.21) and (10.2.22)
396 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

are true with i = 1, . . . , N. In other words, the algorithm will converge to the optimal
cost functions and optimal control laws of the N isolated subsystems. The proof is
complete.

Next, we describe the NN implementation process of the online PI algorithm. For


the N isolated subsystems, assume that the value functions Vi (xi ), i = 1, . . . , N, are
continuously differentiable. Then, according to the universal approximation property
of NNs, Vi (xi ) can be reconstructed by an NN on a compact set Ωi as

Vi (xi ) = WciTσci (xi ) + εci (xi ), i = 1, . . . , N,

where Wci ∈ Rli is the ideal weight, σci (xi ) ∈ Rli is the activation function, li is the
number of neurons in the hidden layer, and εci (xi ) ∈ R is the approximation error of
the ith NN, i = 1, . . . , N.
The derivatives of the value functions with respect to their state vectors are for-
mulated as

∇Vi (xi ) = ∇σciT(xi )Wci + ∇εci (xi ), i = 1, . . . , N, (10.2.23)

where ∇σci (xi ) = ∂σci (xi )/∂xi ∈ Rli ×ni and ∇εci (xi ) = ∂εci (xi )/∂xi ∈ Rni are the gra-
dients of the activation function and approximation error of the ith NN, respectively,
i = 1, . . . , N. Based on (10.2.23), the Lyapunov equations (10.2.8) become
 
0 = Qi2 (xi ) + μTi Ri μi + WciT ∇σci (xi ) + ∇εciT(xi ) ẋi ,

where i = 1, . . . , N and ẋi is given in (10.2.4).


For the ith NN, i = 1, . . . , N, assume that the NN weight vector Wci , the gra-
dient ∇σci (xi ), and the approximation error εci (xi ), and its derivative ∇εci (xi ) are all
bounded on the compact set Ωi . Moreover, according to [23], we have εci (xi ) → 0
and ∇εci (xi ) → 0 as li → ∞, where i = 1, . . . , N.
Because the ideal weights are unknown, N critic NNs can be built to approximate
the value functions as

V̂i (xi ) = ŴciTσci (xi ), i = 1, . . . , N,

where Ŵci , i = 1, . . . , N, are the estimated weights. Here, σci (xi ), i = 1, . . . , N, are
selected such that V̂i (xi ) > 0 for any xi = 0 and V̂i (xi ) = 0 when xi = 0.
Similarly, the derivatives of the approximate value functions with respect to the
state vectors can be expressed as

∇ V̂i (xi ) = ∇σciT(xi )Ŵci , i = 1, . . . , N,

where ∇ V̂i (xi ) = ∂ V̂i (xi )/∂xi , i = 1, . . . , N. Then, the approximate Hamiltonians
can be obtained as
10.2 Decentralized Control of Interconnected Nonlinear Systems 397

Hi (xi , μi , Ŵci ) = Qi2 (xi ) + μTi Ri μi + ŴciT ∇σci (xi )ẋi


= eci , i = 1, . . . , N. (10.2.24)

For the purpose of training the critic networks of the isolated subsystems, it is desired
to obtain Ŵci , i = 1, . . . , N, to minimize the following objective functions:

1 2
Eci = e , i = 1, . . . , N.
2 ci
The standard steepest descent algorithm is introduced to tune the critic networks.
Then, their weights are updated through
 
∂Eci
Ŵ˙ ci = −αci , i = 1, . . . , N, (10.2.25)
∂ Ŵci

where αci > 0, i = 1, . . . , N, are the learning rates of the critic networks.
On the other hand, based on (10.2.23), the Hamiltonians take the following forms:

Hi (xi , μi , Wci ) = Qi2 (xi ) + μTi Ri μi + WciT ∇σci (xi )ẋi


= ecHi , i = 1, . . . , N, (10.2.26)

where
ecHi = −(∇εci (xi ))T ẋi , i = 1, . . . , N,

are the residual errors due to the NN approximation.


Denote
δi = ∇σci (xi )ẋi , i = 1, . . . , N.

We assume that there exist N positive constants δMi , i = 1, . . . , N, such that

δi  ≤ δMi , i = 1, . . . , N. (10.2.27)

Moreover, we define the weight estimation errors of the critic networks as W̃ci =
Wci − Ŵci , where i = 1, . . . , N. Then, combining (10.2.24) with (10.2.26) yields

ecHi − eci = W̃ciT δi , i = 1, . . . , N.

Therefore, the dynamics of the weight estimation errors can be given as


 
W̃˙ ci = αci ecHi − W̃ciT δi δi , i = 1, . . . , N. (10.2.28)

Incidentally, the persistency of excitation condition is required to tune the ith critic
network to guarantee that δi  ≥ δmi , where δmi , i = 1, . . . , N, are positive constants.
398 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

Thus, a set small exploratory signals will be added to the isolated subsystems in order
to satisfy the condition in practice.
When implementing the online PI algorithm, in order to accomplish the policy
improvement, we should obtain the control polices that minimize the value functions.
Hence, according to (10.2.11) and (10.2.23), we have

1
μi (xi ) = − Ri−1 giT (xi )∇Vi (xi )
2
1  
= − Ri−1 giT (xi ) ∇σciT (xi )Wci + ∇εci (xi ) ,
2
where i = 1, . . . , N. Hence, the approximate control policies can be obtained as

1
μ̂i (xi ) = − Ri−1 giT (xi )∇ V̂i (xi )
2
1
= − Ri−1 giT (xi )∇σciT (xi )Ŵci , (10.2.29)
2
where i = 1, . . . , N.

Remark 10.2.2 According to (10.2.29), it is obvious to observe that the approximate


control laws of the N isolated subsystems can be derived directly based on the trained
critic networks. Consequently, unlike the traditional actor–critic architecture, the
action NNs are not required any more.

When considering the critic networks, the weight estimation dynamics are uni-
formly ultimately bounded (UUB) as described in the following theorem.

Theorem 10.2.3 For the N isolated subsystems (10.2.4), the weight update laws for
tuning the critic networks are given by (10.2.25). Then, the dynamics of the weight
estimation errors of the critic networks are UUB.

Proof Choose N Lyapunov function candidates described as follows:

1  iT i  1 iT i
Li (x) = tr W̃c W̃c = W̃ W̃ , i = 1, . . . , N.
αci αci c c

The time derivatives of the Lyapunov functions Li (x), i = 1, . . . , N, along the tra-
jectories of the error dynamics (10.2.28) are

2 iT ˙ 2  iT  
L̇i (x) = W̃c W̃ci = W̃c αci ecHi − W̃ciT δi δi ,
αci αci

where i = 1, . . . , N. After some basic mathematical manipulations and considering


the Cauchy–Schwarz inequality, it yields
10.2 Decentralized Control of Interconnected Nonlinear Systems 399

2   2
L̇i (x) = ecHi αci W̃ciT δi − αci W̃ciT δi
αci
1 2  2
≤ e − (2 − αci ) W̃ciT δi ,
αci cHi

where i = 1, . . . , N. It can be found that L̇i (x) < 0 as long as




⎪ 0 < αci < 2,
⎨ "
e2cHi (10.2.30)


⎩ |W̃c δi | > ,
iT
αci (2 − αci )

where i = 1, . . . , N. By employing the dense property of real numbers [17], we


derive that there exist some positive constants κi (0 < κi ≤ δMi ) such that
"
e2cHi
|W̃ciT δi | > κi W̃ci  > . (10.2.31)
αci (2 − αci )

By noticing (10.2.31), we can conclude that L̇i (x) < 0 as long as




⎪ 0 < αci < 2,
⎨ "
e2cHi (10.2.32)


⎩ W̃c  > ,
i
αci κi (2 − αci )
2

where i = 1, . . . , N. In accordance with the Lyapunov stability theory, we conclude


that the dynamics of the weight estimation errors of the critic networks are all UUB.
Meanwhile, the norms of the weight estimation errors are bounded as well. The proof
is complete.

Remark 10.2.3 Let


ūˆ i (xi ) = πi μ̂i (xi ),

where μ̂i (xi ), i = 1, . . . , N, are obtained by (10.2.29). According to the selections of


the activation functions of the critic networks, we can easily find that the approximate
value functions V̂i (xi ), i = 1, . . . , N, are also Lyapunov functions. Furthermore,
similar to the proof of Theorem 10.2.1, we have

L̇(x) ≤ −ξ T Ξ ξ + Σe ,

where Σe is the sum of the approximation errors. Hence, we can conclude that based
on the approximate optimal control laws μ̂i (xi ), i = 1, . . . , N, the present control
vector (ūˆ 1 (x1 ), ūˆ 2 (x2 ), . . . , ūˆ N (xN )) can ensure the uniform ultimate boundedness
of the state trajectories of the closed-loop interconnected system. It is in this sense
400 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

that we accomplish the design of the decentralized control scheme by adopting the
learning optimal control approach based on online PI algorithm.
Remark 10.2.4 Note that the controller presented here is a decentralized stabilizing
one. Though the optimal decentralized controller of interconnected systems has been
studied before [19], in this section, we aim at developing a novel decentralized control
strategy based on ADP. How to extend the present results to the design of optimal
decentralized control for interconnected nonlinear systems is of our future research.
In fact, in the recent work, Wang et al. [26] has provided some preliminary results
on this topic.

10.2.3 Generalization to Model-Free Decentralized Control

Now, we generalize the above results to model-free decentralized control of intercon-


nected nonlinear systems [14]. We develop an online model-free integral PI algorithm
for optimal control problems with completely unknown system dynamics. To deal
with exploration which relaxes the assumptions of exact knowledge on fi (xi ) and
gi (xi ), we consider the following nonlinear subsystem explored by a known bounded
piecewise continuous signal ei (t)

ẋi (t) = fi (xi (t)) + gi (xi (t))[ui (xi (t)) + ei (t)]. (10.2.33)

The derivative of the value function with respect to time along the trajectory of the
subsystem (10.2.33) is calculated using (10.2.8) as
 
V̇i (xi ) = ∇ViT(xi ) fi (xi ) + gi (xi )[μi (xi ) + ei ]
= − ri (xi , μi ) + ∇ViT(xi )gi (xi )ei , (10.2.34)

where
ri (xi , μi ) = Qi2 (xi ) + μTi (xi )Ri μi (xi ).

We present a lemma which is essential to prove the convergence of the model-free


PI algorithm for the isolated subsystems.
Lemma 10.2.2 Solving for Vi (xi ) in the following equation
 t+T  t+T
Vi (xi (t + T )) − Vi (xi (t)) = ∇ViT(xi )gi (xi )ei dτ − ri (xi , μi (xi ))dτ
t t
(10.2.35)

is equivalent to finding the solution of (10.2.34).


Proof Since μi (xi ) ∈ Ai (Ωi ), the value function Vi (xi ) is a Lyapunov function for
the subsystem (10.2.33), and it satisfies (10.2.34) with ri (xi , μi ) > 0, xi = 0. We
10.2 Decentralized Control of Interconnected Nonlinear Systems 401

integrate (10.2.34) over the interval [t, t + T ] to obtain (10.2.35). This means that the
unique solution of (10.2.34), Vi (xi ), also satisfies (10.2.35). To complete the proof,
we show that (10.2.35) has a unique solution by contradiction.
Thus, we assume that there exists another value function V̄i (xi ) which satisfies
(10.2.35) with the end condition (boundary condition) V̄i (0) = 0. This value function
also satisfies
V̄˙i (xi ) = −ri (xi , μi ) + ∇ V̄iT(xi )gi (xi )ei .

Subtracting this from (10.2.34), we obtain


T
d[V̄i (xi ) − Vi (xi )]
V̇i (xi ) − V̄˙i (xi ) = ẋi = (∇Vi (xi ) − ∇ V̄i (xi ))Tgi (xi )ei ,
dxi
or,
T
d[V̄i (xi ) − Vi (xi )]
0= (ẋi − gi (xi )ei )
dxi
T
d[V̄i (xi ) − Vi (xi )]
= (fi (xi ) + gi (xi )μi (xi )) , (10.2.36)
dxi

which must hold for any xi on the system trajectories generated by the stabilizing
policy μi (xi ). According to (10.2.36), we have V̄i (xi ) = Vi (xi ) + c. As this relation
must hold for xi (t) = 0, we have V̄i (0) = Vi (0) + c ⇒ c = 0. Thus, V̄i (xi ) = Vi (xi ),
i.e., (10.2.35) has a unique solution which is equal to the solution of (10.2.34). The
proof is complete.

Integrating (10.2.34) from t to t +T with any time interval T > 0, and considering
(10.2.19) and (10.2.20), we have
(p) (p)
Vi (xi (t + T ))−Vi (xi (t))
 t+T
 (p) T
=−2 μi (xi ) Ri ei dτ
t
 t+T
 2  (p) T (p) 
− Qi (xi ) + μi (xi ) Ri μi (xi ) dτ. (10.2.37)
t

Equation (10.2.37) which is derived by (10.2.20) and (10.2.35) plays an important


role in relaxing the knowledge of the system dynamics, since fi (xi ) and gi (xi ) do not
appear in the equation. It means that the iteration can be done without knowing the
system dynamics. Thus, we obtain the online model-free integral PI algorithm as in
Algorithm 10.2.1.
402 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

Algorithm 10.2.1 Online model-free integral PI algorithm


Step 1. Give a small positive real number ε. Let p = 0 and start with an initial admissible control
law μ(0)
i (xi ).
Step 2. Policy Evaluation and Improvement:
(p) (p)
Based on the control law μi (xi ), solve the following nonlinear Lyapunov equations for Vi (xi )
(p+1)
and μi (xi )
 t+T
(p)  2  (p) T (p) 
Vi (xi (t)) = Qi (xi )+ μi (xi ) Ri μi (xi ) dτ
t
 t+T 
(p+1) T (p)
+2 μi (xi ) Ri ei dτ + Vi (xi (t + T )). (10.2.38)
t
(p) (p−1)
Step 3. If |Vi (xi ) − Vi (xi )| ≤ ε, stop and obtain the approximate optimal control law of the
ith isolated subsystem; else, set p = p + 1 and go to Step 2.

Note that N initial admissible control laws are required in this algorithm, and we
(p)
let Vi (xi ) = 0, when p = 0.

Theorem 10.2.4 Considering the isolated subsystems (10.2.4), we give N initial


admissible control laws μ(0) (0) (0)
1 (x1 ), μ2 (x2 ), …, μN (xN ). Then, using the PI algorithm
established in (10.2.38), the value functions and control laws converge to the optimal
(p) (p)
ones as p → ∞, i.e., Vi (xi ) → Ji∗ (xi ), μi (xi ) → ui∗ (xi ).

Proof In [1], it was shown that in the iteration process of (10.2.20) and (10.2.34),
if the initial control policy μ(0)
i (xi ) is admissible, all the subsequent control laws
will be admissible. Moreover, the iteration result will converge to the solution of
the HJB equation. Based on (10.2.37) and the proven equivalence between (10.2.34)
and (10.2.35), we can conclude that the present online model-free PI algorithm will
converge to the solution of the optimal control problem for subsystem (10.2.33)
without using the knowledge of system dynamics. The proof is complete.

Now, we discuss the NN-based implementation method of the established model-


free PI algorithm. A critic NN and an action NN are used to approximate the value
function and the control law of the subsystem, respectively. We assume that for the
ith subsystem, Vi (xi ) and μi (xi ) are represented on a compact set Ωi by single-layer
NNs as Vi (xi ) = (Wci )T φci (xi ) + εci (xi ) and μi (xi ) = (Wai )T φai (xi ) + εai (xi ), where
i i
Wci ∈ RNc and Wai ∈ RNa are unknown bounded ideal NN weight parameters which
i
will be determined by the established model-free PI algorithm, φci (xi ) ∈ RNc and
i
φai (xi ) ∈ RNa are the continuously differentiable nonlinear activation functions, and
εc (xi ) ∈ R and εai (xi ) ∈ Rmi are the bounded NN approximation errors. Here, the
i

subscripts “c” and “a” denote the critic and the action, respectively. Since the ideal
weights are unknown, the outputs of the critic NN and the action NN are estimated
as
10.2 Decentralized Control of Interconnected Nonlinear Systems 403

V̂i (xi ) = (Ŵci )T φci (xi ), (10.2.39)


μ̂i (xi ) = (Ŵai )T φai (xi ), (10.2.40)

where Ŵci and Ŵai are the current estimated weights.


Using (10.2.39) and (10.2.40), (10.2.38) can be rewritten in a general form as
 
Ŵci
[ψki ]T = θki (10.2.41)
Ŵai

with
 t+kT  2 
θki = Qi (xi ) + μTi (xi )Ri μi (xi ) dτ,
t+(k−1)T
# T
ψki = φci (xi (t + (k − 1)T )) − φci (xi (t + kT )) ,
 t+kT
 T $T
−2 Ri ei φai (xi ) dτ ,
t+(k−1)T

where the measurement time is from t + (k − 1)T to t + kT . Since (10.2.41) is


only a 1-dimensional equation, we cannot guarantee the uniqueness of the solution.
Similar to [8], we use the least squares method to solve the parameter vector over a
compact set Ωi . For any positive integral Ki , we denote Φi = [ψ1i , ψ2i , . . . , ψKi i ] and
Θi = [θ1i , θ2i , . . . , θKi i ]T . Then, we have the following Ki -dimensional equation
 
Ŵci
ΦiT = Θi .
Ŵai

If ΦiT has full column rank, the parameters can be obtained by solving the equation
 
Ŵci
= (Φi ΦiT )−1 Φi Θi . (10.2.42)
Ŵai

Therefore, we need to guarantee that the number of collected data points Ki satisfies
Ki ≥ rank(Φi ) = Nci + Nai , which will guarantee the existence of (Φi ΦiT )−1 . The
least squares problem in (10.2.42) can be solved in real time by collecting enough
data points generated from the system (10.2.33).
Clearly, the problem of designing the decentralized control law becomes to derive
the optimal controllers for the N isolated subsystems. Based on the online model-
free integral PI algorithm and NN techniques, we obtain the approximate solutions
of HJB equations. We can conclude that the approximate optimal control policies
μ̂i (xi ) is obtained. Therefore, according to Theorem 10.2.1 and Remark 10.2.3, the
stabilizing decentralized control law of the large-scale interconnected system can be
derived.
404 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

10.2.4 Simulation Studies

Two simulation examples are provided to show the applicability of the decentralized
control strategy established in this section.

Example 10.2.1 Consider the following continuous-time large-scale nonlinear sys-


tem consisting of two interconnected subsystems given by
 
−x11 + x12
ẋ1 =
−0.5x11 − 0.5x12 − 0.5x12 (cos(2x11 ) + 2)2
 
0  
+ ū1 (x1 ) + (x11 + x22 ) sin x12
2
cos(0.5x21 ) ,
cos(2x11 ) + 2
 
x22
ẋ2 =
−x21 − 0.5x22 + 0.5x21 2
x22
 
0 2
+ ū2 (x2 ) + 0.5(x12 + x22 ) cos ex21 , (10.2.43)
x21

where x1 = [x11 , x12 ]T ∈ R2 and ū1 (x1 ) ∈ R are the state and control variables of
subsystem 1, and x2 = [x21 , x22 ]T ∈ R2 and ū2 (x2 ) ∈ R are the state and control
variables of subsystem 2. Let R1 and R2 = I be identity matrices with suitable
dimensions. Additionally, let h1 (x1 ) = x1  and h2 (x2 ) = |x22 |. Then, we find that
Z1 (x) and Z2 (x) with x = [x1T , x2T ]T are upper bounded as in (10.2.3). For example,
we can select λ11 = λ12 = 1 and λ21 = λ22 = 1/2.
In order to design the decentralized controller of interconnected system (10.2.43),
we first deal with the optimal control problem of two isolated subsystems. Here, we
choose Q1 (x1 ) = x1  and Q2 (x2 ) = |x22 |. Hence, the cost functions of the optimal
control problem are, respectively,
 ∞  2 
J1 (x10 ) = x11 + x12
2
+ u1Tu1 dτ
0

and  ∞  
J2 (x20 ) = 2
x22 + u2Tu2 dτ.
0

We adopt the online PI algorithm to tackle the optimal control problem, where
two critic networks are constructed to approximate the cost functions. We denote
the weight vectors of the two critic networks as Ŵc1 = [Ŵc1 1
, Ŵc2
1
, Ŵc3
1 T
] and
Ŵc = [Ŵc1 , Ŵc2 , Ŵc3 ] . During the simulation process, the initial weights of
2 2 2 2 T

the critic networks are chosen randomly in [0, 2]. Moreover, the activation func-
tions of the two critic networks are chosen as σc1 (x1 ) = [x11 2
, x11 x12 , x12
2 T
] and
σc2 (x2 ) = [x21 , x21 x22 , x22 ] , respectively. Besides, choose the learning rates of the
2 2 T

critic networks as αc1 = αc2 = 0.1 and the initial states of the two isolated subsys-
tems as x10 = x20 = [1, −1]T .
10.2 Decentralized Control of Interconnected Nonlinear Systems 405

During the implementation process of the online PI algorithm, for each isolated
subsystem, we add the following small exploratory signals to satisfy the persistency
of excitation condition:

N1 (t) = sin2 (t) cos(t) + sin2 (2t) cos(0.1t) + sin2 (−1.2t) cos(0.5t)
+ sin5 (t) + sin2 (1.12t) + cos(2.4t) sin3 (2.4t)

and N2 (t) = 1.6N1 (t). We can observe that the convergence of the weights occurred
after 750 and 180 s, respectively. Then, the exploratory signals are turned off. Actu-
ally, the weights of the critic networks converge to Ŵc1 = [0.498969, 0.000381,
0.999843]T and Ŵc2 = [1.000002, −0.000021, 0.999992]T , respectively, which are
depicted in Figs. 10.2 and 10.3.
Based on the converged weights Ŵc1 and Ŵc2 , we can obtain the approximate
value function and control law for each isolated subsystem, namely, V̂1 (x1 ), μ̂1 (x1 ),
V̂2 (x2 ), and μ̂2 (x2 ). In comparison, for the method proposed in [23], the optimal
cost function and control law of isolated subsystem 1 are J1∗ (x1 ) = 0.5x11 2
+ x12
2
and

u1 (x1 ) = −(cos(2x11 ) + 2)x12 , respectively. Similarly, the optimal cost function and
control law of isolated subsystem 2 are J2∗ (x2 ) = x21
2
+ x22
2
and u2∗ (x2 ) = −x21 x22 .
As a result, for isolated subsystem 1, the error between the optimal cost function
and the approximate one is presented in Fig. 10.4. Moreover, the error between the
optimal control law and the approximate version is shown in Fig. 10.5. It is clear
to see that both the approximation errors are close to zero, which verifies the good

1.5
ωac11 ωac12 ωac13
Weights of the critic network 1

0.5

−0.5
0 100 200 300 400 500 600 700 800
Time (s)

Fig. 10.2 Convergence of the weight vector of the critic network 1 (ωac11 , ωac12 , and ωac13 represent
1 , Ŵ 1 , and Ŵ 1 , respectively)
Ŵc1 c2 c3
406 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

2
ω
ac21
ωac22

Weights of the critic network 2 1.5 ωac23

0.5

−0.5
0 50 100 150 200
Time (s)

Fig. 10.3 Convergence of the weight vector of the critic network 2 (ωac21 , ωac22 , and ωac23 represent
2 , Ŵ 2 , and Ŵ 2 , respectively)
Ŵc1 c2 c3

−3
x 10
8
Approximation error

0
2
1 2
0 1
0
−1 −1
x12 −2 −2
x
11

Fig. 10.4 3-D plot of the approximation error of the cost function of isolated subsystem 1, i.e.,
J1∗ (x1 ) − V̂1 (x1 )
10.2 Decentralized Control of Interconnected Nonlinear Systems 407

−3
x 10
1.5
Approximation error
1

0.5

−0.5

−1

−1.5
2
1 2
0 1
0
−1 −1
x −2 −2
12 x
11

Fig. 10.5 3-D plot of the approximation error of the control law of isolated subsystem 1, i.e.,
u1∗ (x1 ) − μ̂1 (x1 )

performance of the online learning algorithm. When regarding the isolated subsystem
2, we obtain the same simulation results shown in Figs. 10.6 and 10.7.
Next, by choosing θ1 = θ2 = 1 and π1 = π2 = 2, we can guarantee the
positive definiteness of the matrix Ξ . Thus, (π1 μ̂1 (x1 ), π2 μ̂2 (x2 )) is the decentralized
control strategy of the original interconnected system (10.2.43). Here, we apply the
decentralized control scheme to plant (10.2.43) for 40 s and obtain the evolution
processes of the state trajectories illustrated in Figs. 10.8 and 10.9. By zooming in
on the state trajectories near zero, it is demonstrated that the state trajectories of the
closed-loop system are UUB. Obviously, these simulation results authenticate the
validity of the decentralized control approach developed in this section.

Example 10.2.2 Consider the classical multi-machine power system with governor
controllers [6]

δ̇i (t) = ωi (t),


Di ω0
ω̇i (t) = − ωi (t) + [Pmi (t) − Pei (t)],
2Hi 2Hi
1
Ṗmi (t) = [−Pmi (t) + ugi (t)],
Ti
 N
Pei (t) = Eqi Eqj [Bij sin δij (t) + G ij cos δij (t)],
j=1
408 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

−4
x 10
2
Approximation error
1.5

0.5

−0.5

−1
2
1 2
0 1
0
−1 −1
x22 −2 −2
x21

Fig. 10.6 3-D plot of the approximation error of the cost function of isolated subsystem 2, i.e.,
J2∗ (x2 ) − V̂2 (x2 )

−5
x 10
2
Approximation error

−2

−4

−6

−8
2
1 2
0 1
0
−1 −1
x −2 −2
22 x
21

Fig. 10.7 3-D plot of the approximation error of the control law of isolated subsystem 2, i.e.,
u2∗ (x2 ) − μ̂2 (x2 )
10.2 Decentralized Control of Interconnected Nonlinear Systems 409

1
x11
0.8 x
12

State trajectories of subsystem 1 0.6

0.4

0.2

−0.2
−5
x 10
−0.4 1

−0.6 0

−0.8 −1
36 38 40
−1
0 5 10 15 20 25 30 35 40
Time (s)

Fig. 10.8 The state trajectories of subsystem 1 under the action of the decentralized control strategy
(π1 μ̂1 (x1 ), π2 μ̂2 (x2 ))

1
x21
0.8 x
22
State trajectories of subsystem 2

0.6

0.4

0.2

−0.2
−3
x 10
−0.4 1

−0.6 0

−0.8 −1
36 38 40
−1
0 5 10 15 20 25 30 35 40
Time (s)

Fig. 10.9 The state trajectories of subsystem 2 under the action of the decentralized control strategy
(π1 μ̂1 (x1 ), π2 μ̂2 (x2 ))
410 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

where, for 1 ≤ i and j ≤ N, δi (t) represents the angle of the ith generator; δij (t) =
δi (t) − δj (t) is the angular difference between the ith and jth generators; ωi (t) is the
relative rotor speed; Pmi (t) and Pei (t) are the mechanical power and the electrical
power, respectively; Eqi is the transient electromotive force in quadrature axis and is
assumed to be constant under high-gain SCR (silicon-controlled rectifier) controllers;
Di , Hi , and Ti are the damping constant, the inertia constant, and the governor time
constant, respectively; Bij and G ij are the imaginary and real parts of the admittance
matrix, respectively; ugi (t) is the speed governor control signal for the ith generator;
and ω 0 is the steady-state frequency.
A three-machine power system is considered in our numerical simulation. The
parameters of the system are the same as those in [6]. The weighting matrices are
set to be Qi2 (xi ) = xiT × 1000I3 × xi and Ri = 1, for i = 1, 2, 3. Similarly, as in [6],
the multi-machine power system can be rewritten as the following form

Δδ̇i (t) = Δωi (t),


Di ω0
Δω̇i (t) = − Δωi (t) + ΔPmi (t),
2Hi 2Hi
1
ΔṖmi (t) = [−ΔPmi (t) + ui (t) − di (t)].
Ti

We define the state

xi = [Δδi (t), Δωi (t), ΔPmi (t)]T = [xi1 , xi2 , xi3 ]T ,

where
Δδi (t) = δi (t) − δi0 ,

Δωi (t) = ωi (t) − ωi0 ,

ΔPmi (t) = Pmi (t) − Pei (t),

ui (t) = ugi (t) − Pei (t),

and


N
 
di (t) = Eqi Eqj [Bij cos δij (t) − G ij sin δij (t)][Δωi (t) − Δωj (t)] .
j=1,j =i

For each isolated subsystem, we denote the weight vectors of the action and critic
networks as
10.2 Decentralized Control of Interconnected Nonlinear Systems 411

# $T
Ŵai = Ŵa1
i
, Ŵa2
i
, Ŵa3
i
,
# $T
Ŵci = Ŵc1
i
, Ŵc2
i
, Ŵc3
i
, Ŵc4
i
, Ŵc5
i
, Ŵc6
i
.

The activation functions are chosen as

φai (xi ) = [xi1 , xi2 , xi3 ]T


 2 T
φci (xi ) = xi1 , xi1 xi2 , xi1 xi3 , xi2
2
, xi2 xi3 , xi3
2
.

From these parameters, Nci = 6 and Nai = 3. So, we conduct the simulation with
Ki = 10. We set the initial state and the initial weights of the critic networks as
xi0 = [1, 1, 1]T , Ŵci = 100 × [1, 1, 1, 1, 1, 1]T , for i = 1, 2, 3. The initial weights
of the action networks are chosen as Ŵa1 = −[30, 30, 30]T , Ŵa2 = −[10, 20, 50]T
and Ŵa3 = −[10, 20, 30]T , respectively. The period time T = 0.1 s and exploratory
signals
Ni (t) = 0.01(sin(2π t) + cos(2π t))

are used in the learning process. The least squares problem is solved after 10 samples
are acquired, and thus, the weights of the NNs are updated every 1 s. Figs. 10.10,
10.11, and 10.12 illustrate the evolutions of the weights of the action network for
the isolated subsystems 1, 2, and 3, respectively. It is clear that the weights converge
after some update steps.
Then, we can choose π1 = π2 = π3 = 1 to obtain a combined control vector
(π1 μ̂1 (x1 ), π2 μ̂2 (x2 ), π3 μ̂3 (x3 )), which can be regarded as the stabilizing decentral-
ized control law of the interconnected system. By applying the decentralized control

0
Weight of the action network 1

−50

−100

−150

−200

1 1 1
W W W
a1 a2 a3
−250
0 1 2 3 4 5 6 7 8 9 10
Time (s)

Fig. 10.10 Evolutions of the weights of the action network 1


412 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

−50
Weight of the action network 2
−100

−150

−200

−250
W2 W2 W2
a1 a2 a3
−300
0 1 2 3 4 5 6 7 8 9 10
Time (s)

Fig. 10.11 Evolutions of the weights of the action network 2

0
Weight of the action network 3

−50

−100

−150

−200

3 3 3
W W W
a1 a2 a3
−250
0 1 2 3 4 5 6 7 8 9 10

Time (s)

Fig. 10.12 Evolutions of the weights of the action network 3

law to the interconnected power system for 10 s, we obtain the evolution process of
the power angle deviations and frequencies of the generators shown in Figs. 10.13
and 10.14, respectively. Obviously, the applicability of the decentralized control law
developed in this section has been verified by these simulation results.
10.2 Decentralized Control of Interconnected Nonlinear Systems 413

112
Angle of G1 (degree)
110

108
0 1 2 3 4 5 6 7 8 9 10
Time (s)
99
Angle of G2 (degree)
98

97
0 1 2 3 4 5 6 7 8 9 10
Time (s)
59
Angle of G3 (degree)
58

57
0 1 2 3 4 5 6 7 8 9 10
Time (s)

Fig. 10.13 Angles of the generators under the action of the decentralized control law

52
Frequency of G1 (Hz)
50

48
0 1 2 3 4 5 6 7 8 9 10
Time (s)
52
Frequency of G2 (Hz)
50

48
0 1 2 3 4 5 6 7 8 9 10
Time (s)
52
Frequency of G3 (Hz)
50

48
0 1 2 3 4 5 6 7 8 9 10
Time (s)

Fig. 10.14 Frequencies of the generators under the action of the decentralized control law
414 10 Decentralized Control of Continuous-Time Interconnected Nonlinear Systems

10.3 Conclusions

In this chapter, a decentralized control strategy is developed to deal with the stabi-
lization problem of a class of continuous-time large-scale nonlinear systems using an
online PI algorithm. It is shown that the decentralized control strategy of the overall
system can be established by adding feedback gains to the obtained optimal control
policies. Then, a stabilizing decentralized control law for a class of large-scale non-
linear systems with unknown dynamics is established by using an NN-based online
model-free integral PI algorithm. We use an online model-free integral PI algorithm
with an exploration to solve the HJB equations related to the optimal control problem
of the isolated subsystems.

References

1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Bakule L (2008) Decentralized control: an overview. Ann Rev Control 32(1):87–98
3. Beard RW, Saridis GN, Wen JT (1997) Galerkin approximations of the generalized Hamilton–
Jacobi–Bellman equation. Automatica 33(12):2159–2177
4. Glad ST (1984) On the gain margin of nonlinear and optimal regulators. IEEE Trans Autom
Control AC–29(7):615–620
5. Jiang Y, Jiang ZP (2012) Computational adaptive optimal control for continuous-time linear
systems with completely unknown dynamics. Automatica 48(10):2699–2704
6. Jiang Y, Jiang ZP (2012) Robust adaptive dynamic programming for large-scale systems
with an application to multimachine power systems. IEEE Trans Circ Syst II Express Briefs
59(10):693–697
7. Khan SG, Herrmann G, Lewis FL, Pipe T, Melhuish C (2012) Reinforcement learning and opti-
mal adaptive control: an overview and implementation examples. Ann Rev Contorl 36(1):42–59
8. Lee JY, Park JB, Choi YH (2012) Integral Q-learning and explorized policy iteration for adaptive
optimal control of continuous-time linear systems. Automatica 48(11):2850–2859
9. Lewis FL, Liu D (2012) Reinforcement learning and approximate dynamic programming for
feedback control. Wiley, Hoboken
10. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circ Syst Mag 9(3):32–50
11. Liang J, Venayagamoorthy GK, Harley RG (2012) Wide-area measurement based dynamic
stochastic optimal power flow control for smart grids with high variability and uncertainty.
IEEE Trans Smart Grid 3(1):59–69
12. Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class
of unknown discrete-time nonlinear systems using globalized dual heuristic programming.
IEEE Trans Autom Sci Eng 9(3):628–634
13. Liu D, Wang D, Li H (2014) Decentralized stabilization for a class of continuous-time nonlinear
interconnected systems using online learning optimal control approach. IEEE Trans Neural
Network Learn Syst 25(2):418–428
14. Liu D, Li C, Li H, Wang D, Ma H (2015) Neural-network-based decentralized control of
continuous-time nonlinear interconnected sytems with unknown dynamics. Neurocomputing
165:90–98
15. Mehraeen S, Jagannathan S (2011) Decentralized optimal control of a class of interconnected
nonlinear discrete-time systems by using online Hamilton-Jacobi-Bellman formulation. IEEE
Trans Neural Network 22(11):1757–1769
References 415

16. Park JW, Harley RG, Venayagamoorthy GK (2005) Decentralized optimal neuro-controllers
for generation and transmission devices in an electric power network. Eng Appl Artif Intell
18(1):37–46
17. Rudin W (1976) Principles of mathematical analysis. McGraw-Hill, New York
18. Saberi A (1988) On optimality of decentralized control for a class of nonlinear interconnected
systems. Automatica 24(1):101–104
19. Siljak DD (1991) Decentralized control of complex systems. Academic Press, Boston
20. Siljak DD, Zecevic AI (2005) Control of large-scale systems: beyond decentralized feedback.
Ann Rev Control 29(2):169–179
21. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
22. Tsitsiklis JN, Athans M (1984) Guaranteed robustness properties of multivariable nonlinear
stochastic optimal regulators. IEEE Trans Autom Control AC–29(8):690–696
23. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
24. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Network 22(3):237–246
25. Wang FY, Zhang H, Liu D (2009) Adaptive dynamic programming: an introduction. IEEE
Comput Intell Mag 4(2):39–47
26. Wang D, Liu D, Li H, Ma H (2014) Neural-network-based robust optimal control design for a
class of uncertain nonlinear systems via adaptive dynamic programming. Inf Sci 282:167–179
27. Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of
intelligence. General Syst Yearb 22:25–38
28. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive
approaches (Chapter 13). Van Nostrand Reinhold, New York
29. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control: algorithms
and stability. Springer, London
Chapter 11
Learning Algorithms for Differential
Games of Continuous-Time Systems

11.1 Introduction

Adaptive dynamic programming (ADP) [15, 16, 31, 33] has received significantly
increasing attention owing to its self-learning capability. Value iteration-based ADP
algorithms [6, 20, 32, 37, 40] can solve optimal control problems for discrete-time
nonlinear systems without the requirement of complete knowledge of the dynamics
by building a model neural network (NN). Policy iteration (PI)-based ADP algorithms
can be used to solve optimal control problems of continuous-time nonlinear systems
with the requirement of full knowledge [1, 24] or only partial knowledge [27, 30] of
the dynamics. Some extended PI algorithms can solve optimal control problems of
continuous-time nonlinear systems with completely unknown dynamics by building
NN identifiers [10, 38] or with no NN identifiers [13, 14].
However, most of the previous works on ADP solving the optimal control prob-
lems assume that the system is only affected by a single control law. Game theory
provides an ideal environment to study multi-player optimal decision and control
problems. Two-player non-cooperative zero-sum game [9] has received much atten-
tion since it also provides the solution of H∞ optimal control problems [8]. For
a zero-sum game, it relies on solving the Hamilton–Jacobi–Isaacs (HJI) equation
which reduces to solving the game algebraic Riccati equation (GARE) [9] when
the system has linear dynamics and the cost function is quadratic. Different from
the non-cooperative zero-sum game, nonzero-sum game offers a suitable theoretical
method considering cooperative and non-cooperative objectives. For a multi-player
nonzero-sum game, it will require to solve the coupled Hamilton–Jacobi (HJ) equa-
tions, which reduce to the coupled algebraic Riccati equations in the linear quadratic
case. Generally speaking, both the HJI and coupled HJ equations cannot be solved
analytically due to its nonlinear nature.
ADP algorithms have been used to solve the zero-sum game problems for discrete-
time nonlinear systems [4, 5, 19]. For continuous-time linear systems, an online
ADP algorithm was proposed for two-player zero-sum games without requiring the

© Springer International Publishing AG 2017 417


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_11
418 11 Learning Algorithms for Differential Games of Continuous-Time Systems

knowledge of internal system dynamics [29]. For continuous-time nonlinear systems


with constrained inputs, an offline PI scheme was proposed for two-player zero-sum
games [2, 3]. In [39], four action networks and two critic networks were used to
solve two-player zero-sum games without L2 gain condition. In [25], an online PI
algorithm with two iterative loops was presented to solve two-player zero-sum games
for continuous-time nonlinear systems with known dynamics. In [34, 35], an online
simultaneous policy update algorithm with only one iterative loop was proposed for
two-player zero-sum games by updating policies of both control and disturbance
players simultaneously. Moreover, an online ADP approach was proposed to solve
two-player nonzero-sum games of continuous-time linear systems without using
complete information [28]. In [26], an online adaptive control algorithm based on PI
was presented to solve multi-player nonzero-sum games of continuous-time nonlin-
ear systems with known system dynamics. However, it is difficult for many practical
problems to obtain the knowledge of the system dynamics.
In this chapter, we will develop ADP algorithms for differential games of
continuous-time systems. First, an online integral PI algorithm is developed to learn
the Nash equilibrium solution for two-player zero-sum linear differential games
with completely unknown dynamics [17]. It results in a fully model-free method
solving the GARE forward in time, where both internal and drift system dynamics
are not required. The present algorithm updates value function, control policy, and
disturbance policy simultaneously. Second, an iterative ADP algorithm is developed
for multi-player zero-sum differential games of continuous-time uncertain nonlinear
systems [18]. It is proved that the iterative value functions converge to the opti-
mal solution if the optimal solution of the multi-player zero-sum differential games
exists. By using NNs, the iterative control pairs can be obtained without know-
ing the system function where the stability and convergence properties are proved.
Finally, an online synchronous approximate optimal learning algorithm based on PI
is developed to solve the Nash equilibrium of multi-player nonzero-sum games with
unknown nonlinear dynamics [21]. It is proved that the PI algorithm for nonzero-sum
games with nonlinear dynamics is mathematically equivalent to the quasi-Newton’s
iteration. A model NN is established to identify the unknown continuous-time non-
linear systems using data measured along the system trajectories. For each player, a
critic NN and an action NN are used to approximate its value function and control
policy, respectively, and only critic NN weights need to be tuned. The uniform ulti-
mate boundedness (UUB) of the closed-loop system is proved based on Lyapunov
approach. Simulation examples are given to demonstrate the effectiveness of the
present schemes.

11.2 Integral Policy Iteration for Two-Player Zero-Sum


Games

Consider a class of continuous-time linear dynamical systems described by


11.2 Integral Policy Iteration for Two-Player Zero-Sum Games 419

ẋ = Ax + B1 u + B2 w, (11.2.1)

where x ∈ Rn is the system state with initial state x0 , u ∈ Rm is the control input, and
w ∈ Rq is the external disturbance input with w ∈ L2 [0, ∞). A ∈ Rn×n , B1 ∈ Rn×m ,
and B2 ∈ Rn×q are the unknown system matrices.
Define the infinite horizon performance index (or cost function)
 ∞  
J (x0 , u, w) = x TQx + u TRu − γ 2 wTw dτ
0 ∞
 r (x, u, w)dτ
0

with Q = Q T ≥ 0, R = R T > 0, and a predetermined constant γ ≥ γ ∗ ≥ 0, where


γ ∗ denotes the smallest γ for which the system (11.2.1) is stable. For feedback
control policy u(x) and disturbance policy w(x), we define the value function of the
policies as  ∞
 T 
V (x) = x Qx + u TRu − γ 2 wTw dτ. (11.2.2)
t

Then, we define the two-player zero-sum differential game as

J ∗ (x0 ) = min max J (x0 , u, w)


u w
 ∞
 T 
= min max x Qx + u TRu − γ 2 wTw dτ,
u w 0

where the control policy player u seeks to minimize the value function while the
disturbance policy player w desires to maximize it. The goal was to find the saddle
point (u ∗ , w∗ ) which satisfies the following inequalities

J (x0 , u, w∗ ) ≥ J (x0 , u ∗ , w∗ ) ≥ J (x0 , u ∗ , w)

for any state feedback control policy u and disturbance policy w.


We use notations u = −K x and w = L x for the state feedback control policy
and the disturbance policy, respectively. Then, the value function (11.2.2) can be
represented as V (x) = x TP x, where the matrix P is determined by K and L. The
saddle point can be obtained by solving the following continuous-time GARE [9]

AT P ∗ + P ∗ A + Q − P ∗ B1 R −1 B1T P ∗ + γ −2 P ∗ B2 B2T P ∗ = 0. (11.2.3)

Defining P ∗ as the unique positive definite solution of (11.2.3), the saddle point of
the zero-sum game is
420 11 Learning Algorithms for Differential Games of Continuous-Time Systems

u ∗ = −K ∗ x = −R −1 B1T P ∗ x,
w∗ = L ∗ x = γ −2 B2T P ∗ x, (11.2.4)

and the optimal game value function is

V ∗(x0 ) = x0T P ∗x0 .

The solution of the H∞ control problem can be obtained by solving the saddle
point of the equivalent two-player zero-sum game problem. The following H∞ norm
(the L2 -gain) is used to measure the performance of the control system.
Definition 11.2.1 Let γ ≥ 0 be certain prescribed level of disturbance attenuation.
The system (11.2.1) is said to have L2 -gain less than or equal to γ if
 ∞  ∞
   T 
x TQx + u TRu dτ ≤ γ 2 w w dτ (11.2.5)
0 0

for all w ∈ L2 [0, ∞).


The H∞ control problem is to find a state feedback control policy u such that the
closed-loop system is stable and satisfies the condition (11.2.5). For every γ ≥ γ ∗ ,
the GARE has a unique positive definite solution [9].
A stabilizing control policy exists with the following standard assumption.
Assumption 11.2.1 The pair (A, B1 ) is stabilizable and the pair (A, Q 1/2 ) is observ-
able.

11.2.1 Derivation of Integral Policy Iteration

In this section, we will develop an online model-free integral PI algorithm for the lin-
ear continuous-time zero-sum differential game with completely unknown dynam-
ics. First, we assume an initial stabilizing control matrix K 1 to be given. Define
Vi (x) = x TPi x, u i (x) = −K i x, and wi (x) = L i x as the value function, control
policy, and disturbance policy, respectively, for each iterative step i ≥ 0.
To relax the assumptions of exact knowledge of A, B1 , and B2 , we use N1 and N2
to denote the small exploratory signals added to the control policy u i and disturbance
policy wi , respectively. The exploratory signals are assumed to be any nonzero mea-
surable signal which is bounded by e M > 0, i.e., N1  ≤ e M , N2  ≤ e M . Then,
the original system (11.2.1) becomes

ẋ = Ax + B1 (u i + N1 ) + B2 (wi + N2 ). (11.2.6)
11.2 Integral Policy Iteration for Two-Player Zero-Sum Games 421

The derivative of the value function with respect to time is calculated as

V̇i (x) = − x TQx − x TK iTR K i x + γ 2 x TL iTL i x + 2x TK i+1


T
RN1
+ 2γ 2 x TL i+1
T
N2 , (11.2.7)

where we have used K i+1 = R −1 B1T Pi and L i+1 = γ −2 B2T Pi according to (11.2.4).
Integrating (11.2.7) from t and t + T with any time interval T > 0, we have
 t+T  t+T
T
xt+T Pi xt+T − xtTPi xt = − r (x, u i , wi )dτ + 2 x TK i+1
T
RN1 dτ
t t
 t+T
+ 2γ 2 x TL i+1
T
N2 dτ,
t

where the values of the state x at time t and t + T are denoted by xt and xt+T .
Therefore, we obtain the online model-free integral PI algorithm (Algorithm 11.2.1)
for zero-sum differential games.

Algorithm 11.2.1 Online model-free integral PI for zero-sum games


Step 1. Give an initial stabilizing policy u 1 = −K 1 x and w1 = L 1 x. Set i = 1 and P0 = 0.
Step 2. Policy Evaluation and Improvement: For the system (11.2.6) with policies u i = −K i x and
wi = L i x, and exploratory signals N1 and N2 , solve the following equation for Pi , K i+1 and
L i+1
 t+T  t+T T T
xtTPi xt = xt+T
T Px
i t+T + t r (x, u i , wi )dτ − 2 t x K i+1 R N1 dτ
 t+T (11.2.8)
−2γ 2 t T N dτ.
x TL i+1 2

Step 3. If Pi − Pi−1  ≤ ξ (ξ is a prescribed small positive real number), stop and return Pi ; else,
set i = i + 1 and go to Step 2.

Remark 11.2.1 Equation (11.2.8) plays an important role in relaxing the assumption
of the knowledge of system dynamics, since A, B1 , and B2 do not appear in (11.2.8).
Only online data measured along the system trajectories are required to run this
algorithm. Our method avoids the identification of A, B1 , and B2 whose information
is embedded in the online measured data. In other words, the lack of knowledge
about the system dynamics does not have any impact on our method to obtain the
Nash equilibrium. Thus, our method will not be affected by the errors between the
identification model and the real system, and it can respond fast to the changes of
the system dynamics.
Remark 11.2.2 This algorithm is actually a PI method, but the policy evaluation and
policy improvement are performed at the same time. Compared with the model-based
method [25] and partially model-free method [35], our algorithm is a fully model-
free method which does not require knowledge of the system dynamics. Different
from the iterative method with inner loop on disturbance policy and outer loop on
422 11 Learning Algorithms for Differential Games of Continuous-Time Systems

control policy [25], and the method with only one iterative loop by updating control
and disturbance policies simultaneously [35], the method developed here updates the
value function, control, and disturbance policies at the same time.

Remark 11.2.3 Small exploratory signals can satisfy the persistence of excitation
(PE) condition to efficiently update the value function and the policies. To guarantee
the PE condition, the state may need to be reset during the iterative process, but
it results in technical problems for stability analysis of the closed-loop system. An
alternative way is to add exploratory signals. The solution obtained by our method
is exactly the same as the one determined by the GARE by considering the effects
of exploratory signals.

Next, we will show the relationship between the present algorithm and the
Q-learning algorithm by extending the concept of Q-function to zero-sum games
that are continuous in time, state, and action space. The optimal continuous-time
Q-function for zero-sum games is defined as the following quadratic form:
   T
Q∗ (x, u, w) = x T u T wT H ∗ x T u T wT
⎡ ∗ ∗ ∗ ⎤⎡ ⎤
 T T T  H11 H12 H13 x
= x u w ⎣ H21 ∗
H22 ∗
H23∗ ⎦⎣ ⎦
u
∗ ∗ ∗
H31 H32 H33 w
⎡ T ∗ ∗ ∗ ∗
⎤⎡ ⎤
 T T T  A P +TP ∗A + Q P B1 P B2 x
= x u w ⎣ B1 P R 0 ⎦ ⎣ u ⎦ . (11.2.9)
B2T P ∗ 0 −γ 2 w

It can be seen that the matrix H ∗ is associated with P ∗ in GARE. By solving

∇u Q∗ (x, u, w) = 0 and ∇w Q∗ (x, u, w) = 0,

we can obtain

u ∗ = −(H22
∗ −1 ∗ T
) (H12 ) x,
w∗ = −(H33
∗ −1 ∗ T
) (H13 ) x,

which are the same as the equations in (11.2.4). Since

Q∗ (x0 , u ∗ , w∗ ) = V ∗ (x0 ),

the relationship between P ∗ and H ∗ can be represented as follows:


⎡ ⎤
  In
P ∗ = In −K T L T
H ∗ ⎣ −K ⎦ .
L
11.2 Integral Policy Iteration for Two-Player Zero-Sum Games 423

According to (11.2.9), we can obtain


∗ ∗ ∗ −1 ∗ ∗ ∗ −1 ∗
H11 = H12 (H22 ) H21 + H13 (H33 ) H31 ,

and thus H11 is a redundant term. Define
⎡ i i i

H11 H12 H13
H i = ⎣ H21
i
R 0 ⎦,
H31 0 −γ 2
i

where
i
H21 = (H12
i T
) and H31
i
= (H13
i T
) .

Now, we are in a position to develop the next online integral Q-learning algorithm
(Algorithm 11.2.2).

Algorithm 11.2.2 Online integral Q-learning for zero-sum games


Step 1. Give an initial stabilizing policy u 1 = −K 1 x and w1 = L 1 x.
Step 2. Set i = 0, H11
0 = 0, H 0 = K T R, and H 0 = γ 2 L T .
12 1 13 1
Step 3. Policy Evaluation: Let i = i + 1. For the system (11.2.6) with policies u i = −K i x and
wi = L i x, and exploratory signals N1 and N2 , solve the following equation for H11
i , H i and
12
i
H13

i x = xT Hi x
 t+T  t+T T i
xtTH11 t t+T 11 t+T + t r (x, u i , wi )dτ − 2 t x H12 N1 dτ
 t+T T i (11.2.10)
−2 t x H13 N2 dτ.

Step 4. Policy Improvement: Update the following parameters

K i+1 = R −1 (H12
i T
) , L i+1 = γ −2 (H13
i T
) .

i − H i−1  + H i − H i−1  + H i − H i−1  ≤ ζ (ζ is a prescribed small positive


Step 5. If H11 11 12 12 13 13
real number), stop and return Hi ; else, go to Step 3.

Remark 11.2.4 Note that the model-free integral PI algorithm (Algorithm 11.2.1)
is equivalent to the integral Q-learning algorithm (Algorithm 11.2.2) for zero-sum
games. As PI methods, the algorithms developed above require an initial stabilizing
control policy which is usually obtained by experience. We can also obtain B1 =
i −1 i i −1 i
(H11 ) H12 and B2 = (H11 ) H13 .

11.2.2 Convergence Analysis

In this section, we will provide a convergence analysis of the present algorithms for
two-player zero-sum differential games. It can be shown that the present model-free
integral PI and Q-learning algorithms are equivalent to Newton’s method.
424 11 Learning Algorithms for Differential Games of Continuous-Time Systems

Theorem 11.2.1 For an initial stabilizing control policy u 1 = −K 1 x, the sequences


∞ ∞ ∞
of {Pi }i=1 , {K i }i=1 , and {L i }i=1 obtained by solving (11.2.8) in Algorithm 11.2.1 con-
verge to the optimal solution P ∗ of GARE, the saddle point K ∗ , and L ∗ , respectively,
as i → ∞.

Proof First, for an initial stabilizing control policy u 1 = −K 1 x, we can prove that
the present Algorithm 11.2.1 is equivalent to the following Lyapunov equation

AiTPi + Pi Ai = −Mi , (11.2.11)

where
Ai = A − B1 K i + B2 L i , Mi = Q + K iTR K i − γ 2 L iTL i .

With the control policy u i = −K i x, the disturbance policy wi = L i x, and the


exploratory signals N1 and N2 , the closed-loop system (11.2.1) becomes

ẋ = Ai x + B1 N1 + B2 N2 ,

where Ai = A − B1 K i + B2 L i . Considering the Lyapunov function Vi (x) = x TPi x,


its derivative can be calculated as

V̇i (x) = ẋ TPi x + x TPi ẋ


= x TAiT Pi x + x TPi Ai x + (B1 N1 + B2 N2 )TPi x + x TPi (B1 N1 + B2 N2 )
= x T(AiTPi + Pi Ai )x + 2x TPi B1 N1 + 2x TPi B2 N2
= x T(AiTPi + Pi Ai )x + 2x TK i+1
T
RN1 + 2γ 2 x TL i+1
T
N2 . (11.2.12)

Integrating (11.2.12) from t to t + T yields


 t+T  t+T
Vi (xt+T ) − Vi (xt ) = x T
(AiTPi + Pi Ai ) x dτ + 2 x TK i+1
T
RN1 dτ
t t
 t+T
+2 γ 2x TL i+1
T
N2 dτ. (11.2.13)
t

According to (11.2.8), we have


 t+T  t+T
Vi (xt+T ) − Vi (xt ) = − r (x, u i , wi )dτ + 2 x TK i+1
T
RN1 dτ
t t
 t+T
+2 γ 2x TL i+1
T
N2 dτ. (11.2.14)
t
11.2 Integral Policy Iteration for Two-Player Zero-Sum Games 425

Therefore, combining (11.2.13) and (11.2.14), we have

x T (AiT Pi + Pi Ai )x = −r (x, u i , wi )
= −x T (Q + K iT R K i − γ 2 L iT L i )x,

i.e.,
AiT Pi + Pi Ai = −Mi ,

where
Mi = Q + K iT R K i − γ 2 L iT L i .


Then, according to the results in [35], the sequence {Pi }i=1 generated by (11.2.11) is
equivalent to Newton’s method and converges to the optimal solution P ∗ of GARE,
∞ ∞
as i → ∞. Furthermore, the sequences {K i }i=1 and {L i }i=1 converge to the saddle
∗ ∗
point K and L , respectively, as i → ∞. This completes the proof of the theorem.
The next theorem will show the convergence of the model-free integral Q-learning
algorithm for zero-sum games.
Theorem 11.2.2 For an initial stabilizing control policy u 1 = −K 1 x, the sequence

{H i }i=1 obtained by solving (11.2.10) in Algorithm 11.2.2 converges to the optimal
solution H ∗ ; i.e., Q∗ can be achieved, as i → ∞.
Proof Because the iteration process of H i with solving (11.2.10) in Algorithm 11.2.2
is equivalent to that of Pi with solving (11.2.8) in Algorithm 11.2.1, then as i → ∞
⎡ ⎤
AT P ∗ + P ∗ A + Q P ∗ B1 P ∗ B2
H →i⎣ B1T P ∗ R 0 ⎦ = H ∗.
B2T P ∗ 0 −γ 2

The proof is complete.

11.2.3 Neural Network Implementation

In this section, an online implementation of Algorithm 11.2.1 is developed based


on ADP with the least squares method. Algorithm 11.2.2 can be implemented in
the same way. Here parametric structures are used to approximate the game value
function, control policy, and disturbance policy.
Given a stabilizing control policy u i = −K i x, a triplet (Pi , K i+1 , L i+1 ) with
Pi = PiT > 0, can uniquely be determined by (11.2.8). We now define the following
1 1
two operators: P ∈ Rn×n → P̂ ∈ R 2 n(n+1) , x ∈ Rn → x̄ ∈ R 2 n(n+1) , where

P̂ = [ p11 , 2 p12 , . . . , 2 p1n , p22 , 2 p23 , . . . , 2 p(n−1)n , pnn ]T ,


x̄ = [x12 , x1 x2 , . . . , x1 xn , x22 , x2 x3 , . . . , xn−1 xn , xn2 ]T .
426 11 Learning Algorithms for Differential Games of Continuous-Time Systems

Hence,
T
xt+(k−1)T Pi xt+(k−1)T − xt+kT
T
Pi xt+kT = (x̄t+(k−1)T − x̄t+kT )T P̂,

where k ∈ Z+ and k ≥ 1. Using Kronecker product ⊗, we obtain

x TK i+1
T
RN1 = (x ⊗ N1 )T (In ⊗ R)vec(K i+1 )
x TL i+1
T
N2 = (x ⊗ N2 )T vec(L i+1 ).

Using the expressions established above, (11.2.8) can be rewritten in a general com-
pact form as
⎡ ⎤
P̂i
ψkT ⎣ vec(K i+1 ) ⎦ = θk , ∀i ∈ Z+ , (11.2.15)
vec(L i+1 )

where
 t+kT
θk = r (x, u i , wi )dτ,
t+(k−1)T
 t+kT
ψk = (x̄t+(k−1)T − x̄t+kT ) , 2 T
(x ⊗ N1 )T dτ (In ⊗ R),
t+(k−1)T
 t+kT T
2γ 2 (x ⊗ N2 )T dτ ,
t+(k−1)T

and the measurement time is from t + (k − 1)T to t + kT . Since (11.2.15) is only


one-dimensional equation, we cannot guarantee the uniqueness of the solution. We
will use the least squares method to solve this problem, where the parameter vector
is determined in a least squares sense over a compact set Ω.
For any positive integer N , denote Φ = [ψ1 , . . . , ψ N ] and Θ = [θ1 , . . . , θ N ]T .
Then, we have the following N -dimensional equation
⎡ ⎤
P̂i
Φ T ⎣ vec(K i+1 ) ⎦ = Θ, ∀i ∈ Z+ .
vec(L i+1 )

If Φ T has full column rank, the parameters can be solved by


⎡ ⎤
P̂i  
⎣ vec(K i+1 ) ⎦ = ΦΦ T −1 ΦΘ. (11.2.16)
vec(L i+1 )
11.2 Integral Policy Iteration for Two-Player Zero-Sum Games 427

Fig. 11.1 Flowchart of


Algorithm 11.2.1

i P
K

ui K i x Wi Li x
x Ax B ui B wi

Pi Ki Li
Pi
T
Ki
Li

Pi Pi
i i

Therefore, we need to have the number of collected data points N to be at least

Nmin = rank(Φ),

i.e.,

n(n + 1)
Nmin = + nm + nq,
2

which will guarantee the existence of (ΦΦ T )−1 .


The least squares problem in (11.2.16) can be solved in real time after collecting
enough data points generated from system (11.2.6). The flowchart of this algorithm
is shown in Fig. 11.1. The solution can be obtained using the batch least squares, the
recursive least squares algorithms, or the gradient descent algorithms.

Remark 11.2.5 The sequence {Pi }i=1 calculated by the least squares method con-
verges to the solution of GARE. The PE condition is required in adaptive control to
perform system identification. Several types of exploratory signals have been used
for that purpose, such as piecewise constant signals [14], sinusoidal signals with dif-
428 11 Learning Algorithms for Differential Games of Continuous-Time Systems

ferent frequencies [13], random noise [4, 39], and exponentially decreasing probing
noise [25].

11.2.4 Simulation Studies

In this section, we demonstrate the effectiveness of the present algorithm by designing


an H∞ state feedback controller for a power system.
Example 11.2.1 Consider the following linear model of a power system that was
studied in [29]

ẋ = Ax + B1 u + B2 w (11.2.17)
⎡ ⎤
−0.0665 8 0 0
⎢ 0 −3.663 3.663 0 ⎥
=⎢
⎣ −6.86
⎥x
0 −13.736 −13.736 ⎦
0.6 0 0 0
⎡ ⎤ ⎡ ⎤
0 −8
⎢ 0 ⎥ ⎢ 0 ⎥
+⎢ ⎥ ⎢ ⎥
⎣ 13.736 ⎦ u + ⎣ 0 ⎦ w,
0 0

where the state vector is

x = [Δf, ΔPg , ΔX g , ΔE]T ,

Δf (Hz) is the incremental frequency deviation, ΔPg (p.u. MW) is the incremental
change in generator output, ΔX g (p.u. MW) is the incremental change in governor
value position, and ΔE is the incremental change in integral control. We assume that
the dynamics of system (11.2.17) is unknown. The matrices Q and R in the cost
function are identity matrices of appropriate dimensions, and γ = 3.5. Using the
system model (11.2.17), the matrix in the optimal value function of the zero-sum
game is
⎡ ⎤
0.8335 0.9649 0.1379 0.8005
⎢ 0.9649 1.4751 0.2358 0.8046 ⎥
P =⎢

⎣ 0.1379
⎥.
0.2358 0.0696 0.0955 ⎦
0.8005 0.8046 0.0955 2.6716

Now we will use the present online model-free integral PI algorithm to solve this
problem. The initial state is selected as x0 = [0.1, 0.2, 0.2, 0.1]T . The simulation is
conducted using data obtained along the system trajectory at every 0.01 s. The least
squares problem is solved after 50 data samples are acquired, and thus, the parameters
of the control policy are updated every 0.5 s. The parameters of the critic network, the
11.2 Integral Policy Iteration for Two-Player Zero-Sum Games 429

4
Weights of critic network

-1

-2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time (s)

Fig. 11.2 Convergence of the game value function matrix P̂i

control action network, and the disturbance action network are all initialized to zero.
Similar to [39], the PE condition is ensured by adding small zero-mean Gaussian
noises with variances to the control and disturbance inputs.
Figure 11.2 presents the evolution of the parameters of the game value function
during the learning process. It is clear that Algorithm 11.2.1 is convergent after 10
iterative steps. The obtained approximate game value function is given by the matrix
⎡ ⎤
0.8335 0.9649 0.1379 0.8005
⎢ 0.9649 1.4752 0.2359 0.8047 ⎥
P10 = ⎢
⎣ 0.1379
⎥,
0.2359 0.0696 0.0956 ⎦
0.8005 0.8047 0.0956 2.6718

and P10 − P ∗  = 2.9375 × 10−4 . We can find that the solution obtained by the
online model-free integral PI algorithm is quite close to the exact one obtained by
solving GARE. Figures 11.3 and 11.4 show the convergence process of the control
and disturbance action network parameters. The obtained H∞ state feedback control
policy is u 11 = −[1.8941, 3.2397, 0.9563, 1.3126]x.
430 11 Learning Algorithms for Differential Games of Continuous-Time Systems

9
1
Ki
8
2
Ki
Weights of control action network
7
K3i
6 K4i

−1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time (s)

Fig. 11.3 Convergence of the control action network parameters K i

0
Weights of disturbance action network

−0.2

−0.4

−0.6

L1i
−0.8
2
Li

−1 3
Li

L4i
−1.2

−1.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time (s)

Fig. 11.4 Convergence of the disturbance action network parameters L i


11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 431

11.3 Iterative Adaptive Dynamic Programming


for Multi-player Zero-Sum Games

Consider the following multi-player zero-sum differential game. The state trajectory
at time t of the game denoted by x = x(t) is described by the following continuous-
time affine uncertain nonlinear function

ẋ = f (x, u 1 , u 2 , . . . , u p , w1 , w2 , . . . , wq )
= a(x) + b1 (x)u 1 + b2 (x)u 2 + · · · + b p (x)u p
+ c1 (x)w1 + c2 (x)w2 + · · · + cq (x)wq , (11.3.1)

where a(x), bk (x), k = 1, . . . , p, and c j (x), j = 1, . . . , q, are unknown system


functions, p, q > 0 are positive integers. Let x ∈ Rn be the system state. Let u k ∈ Rn k
and w j ∈ Rm j be the controls, where n k , k = 1, . . . , p, and m j , j = 1, . . . , q, are
positive integers. The initial condition x(0) = x0 is given. The value function is a
generalized quadratic form given by
 ∞
V (x(0), u 1 , . . . , u p , w1 , . . . , wq ) = x TAx + u T1 B1 u 1 + u T2 B2 u 2 + · · · + u Tp B p u p
0

+ w1T C1 w1 + w2T C2 w2 + · · · + wqT Cq wq dt, (11.3.2)

where A, Bk , C j , k = 1, 2, . . . , p, j = 1, 2, . . . , q, are matrices with suitable


dimensions, and A > 0, Bk > 0, C j < 0. We assume that ∀t ∈ [0, ∞), the value
function V (x(t), u 1 , u 2 , . . . , u p , w1 , w2 , . . . , wq ) (denoted by V (x) for brevity) is
rigorously convex for all u k , k = 1, 2, . . . , p and concave for all w j , j = 1, 2, . . . , q.
Define the quadratic utility function as

l(x, u 1 , u 2 , . . . , u p , w1 , w2 , . . . , wq ) = x TAx + u T1 B1 u 1 + · · · + u Tp B p u p
+ w1T C1 w1 + · · · + wqT Cq wq .

For the above multi-player zero-sum differential game, there are two groups of con-
trollers or players where group I (including u 1 , u 2 , . . . , u p ) tries to minimize the value
function V (x), while group II (including w1 , w2 , . . . , wq ) attempts to maximize it.
According to the situation of the two groups, we have the following definitions.
Let

V (x)  inf · · · inf sup · · · sup{V (x(t), u 1 , . . . , u p , w1 , . . . , wq )} (11.3.3)


u1 up w1 wq

be the upper value function and

V (x)  sup · · · sup inf · · · inf {V (x(t), u 1 , . . . , u p , w1 , . . . , wq )} (11.3.4)


w1 wq u1 up
432 11 Learning Algorithms for Differential Games of Continuous-Time Systems

be the lower value function with the obvious inequality V (x) ≥ V (x) (see [9]
for details). Define optimal control vectors as (u 1 , u 2 , . . . , u p , w1 , w2 , . . . , wq ) and
(u 1 , u 2 , . . . , u p , w1 , w2 , . . . , wq ) for upper and lower value functions, respectively.
Then,
V (x) = V (x, u 1 , . . . , u p , w1 , . . . , wq ),

and
V (x) = V (x, u 1 , . . . , u p , w1 , . . . , wq ).

If both V (x) and V (x) exist, and

V (x) = V (x) = V ∗ (x)

holds, then the optimal value function of the zero-sum differential game or the
optimal solution exists and the corresponding optimal control vector is denoted by
(u ∗1 , u ∗2 , . . . , u ∗p , w1∗ , w2∗ , . . . , wq∗ ).
The following assumptions and lemmas are needed.

Assumption 11.3.1 The nonlinear system (11.3.1) is controllable.

Assumption 11.3.2 The upper value function and the lower value function both
exist.

Assumption 11.3.3 The controls u k , k = 1, 2, . . . , p in Group I choose their policy


independent from each other. The controls w j , j = 1, 2, . . . , q, in Group II choose
their policy independent from each other.

Based on the above assumptions, the following two lemmas are important in
applications of the ADP method.
Lemma 11.3.1 If Assumptions 11.3.1–11.3.3 hold, then for 0 ≤ t ≤ tˆ < ∞,
x ∈ Rn , u ∈ Rk , w ∈ Rm , we have

 tˆ 
V (x(t)) = inf · · · inf sup · · · sup l(x, u 1 , . . . , u p , w1 , . . . , wq )dτ + V (x(tˆ)) ,
u1 up w1 wq t
 tˆ 
V (x(t)) = sup · · · sup inf · · · inf l(x, u 1 , . . . , u p , w1 , . . . , wq )dτ + V (x(tˆ)) .
w1 wp u1 uq t

Lemma 11.3.2 If the upper and lower value functions are defined as (11.3.3) and
(11.3.4), respectively, we can obtain the following HJI equations

HJI(V (x), u 1 , . . . , u p ,w1 , . . . , wq )


= V t (x) + H (V x (x), u 1 , . . . , u p , w1 , . . . , wq ) = 0,
(11.3.5)
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 433

where V t (x) = dV (x)/dt, V x (x) = dV (x)/dx, H (V x (x), u 1 , . . . , u p , w1 , . . . , wq )


is called the upper Hamiltonian, and

HJI(V (x), u 1 , . . . , u p ,w1 , . . . , wq )


= V t (x) + H (V x (x), u 1 , . . . , u p , w1 , . . . , wq ) = 0,
(11.3.6)

where V t (x) = dV (x)/dt, V x (x) = dV (x)/dx, H (V x (x), u 1 , . . . , u p , w1 , . . . , wq )


is called the lower Hamiltonian.

Remark 11.3.1 Optimal control problems do not necessarily have smooth or even
continuous value functions [1]. In [7], it is shown that if the Hamiltonians are strictly
convex in u and concave in w, then V (x) ∈ C 1 and V (x) ∈ C 1 satisfy the HJI
equations (11.3.5) and (11.3.6) everywhere. If the smoothness property is removed,
using the theory of viscosity solutions [7], it shows that for infinite horizon optimal
control problems with unbounded value functionals, V (x) and V (x) are the unique
viscosity solutions that satisfy the HJI equations (11.3.5) and (11.3.6), respectively,
under Assumptions 11.3.1–11.3.3.

11.3.1 Derivation of the Iterative ADP Algorithm

The optimal control vector can be obtained by solving the HJI equations (11.3.5) and
(11.3.6), but these equations cannot be solved analytically in general. There is no
current method for rigorously confronting this type of equations to find the optimal
value functions of the system. We introduce the iterative ADP method to tackle
this problem. In this section, the iterative ADP method for multi-player zero-sum
differential games is developed.

Theorem 11.3.1 Suppose Assumptions 11.3.1–11.3.3 hold. If the upper and lower
value functions V (x) and V (x) are defined as (11.3.3) and (11.3.4), respectively,
then

V (x)
= inf · · · inf · · · inf · · · inf sup · · · sup · · · sup · · · sup{V (x, u 1 , . . . , u p , w1 , . . . , wq )}
u1 um un u p w1 wm̄ wn̄ wq

= inf · · · inf · · · inf · · · inf sup · · · sup · · · sup · · · sup{V (x, u 1 , . . . , u p , w1 , . . . , wq )}


u1 un um u p w1 wn̄ wm̄ wq
(11.3.7)

and
434 11 Learning Algorithms for Differential Games of Continuous-Time Systems

V (x)
= sup · · · sup · · · sup · · · sup inf · · · inf · · · inf · · · inf {V (x, u 1 , . . . , u p , w1 , . . . , wq )}
w1 wm̄ wn̄ wq u 1 um un up

= sup · · · sup · · · sup · · · sup inf · · · inf · · · inf · · · inf {V (x, u 1 , . . . , u p , w1 , . . . , wq )}


w1 wn̄ wm̄ wq u 1 un um up
(11.3.8)

hold, for any m, n, m̄ and n̄.

Proof We consider the upper value function. For any w j , j = 1, 2, . . . , p, differ-


entiating the HJI equation (11.3.5) with respect to the control w j through the upper
value function, it yields

∂H T ∂ f (x, u 1 , . . . , u p , w1 , . . . , wq ) ∂l(x, u 1 , . . . , u p , w1 , . . . , wq )
= Vx + = 0.
∂w j ∂w j ∂w j

According to Assumption 11.3.3, we can get

1
w j = − C −1 cT (x)V x . (11.3.9)
2 j j
Taking the derivative of u k , we have

∂H T ∂ f (x, u 1 , . . . , u p , w1 , . . . , wq ) ∂l(x, u 1 , . . . , u p , w1 , . . . , wq )
= Vx + = 0.
∂u k ∂u k ∂u k

Substituting (11.3.9) into (11.3.5) and using Assumption 11.3.3, we can obtain

1 −1 T
uk = − B b (x)V x . (11.3.10)
2 k k
From (11.3.9), we can see that for any j = 1, . . . , q, w j is independent from w j ,
where j = j . For any k = 1, . . . , p, u k is independent from u k , where k = k ,
which proves the conclusion in (11.3.7).
On the other hand, according to ∂ H /∂u k = 0 and ∂ H /∂w j = 0, we can get

1 −1 T
uk = − B b (x)V x (11.3.11)
2 k k
and
1
w j = − C −1 cT (x)V x . (11.3.12)
2 j j
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 435

From (11.3.12) we can also see that for any j = 1, . . . , q, w j is independent from
w j , where j = j . For any k = 1, . . . , p, u k is independent from u k where k = k ,
which proves the conclusion in (11.3.8). This completes the proof of the theorem.

From (11.3.9) to (11.3.12), we can see that if the system functions bk (x) and
c j (x) are obtained, then the upper and lower control pairs u k , w j , u k , w j , can be
well defined. Next, NNs are introduced to construct the uncertain nonlinear system
(11.3.1). For convenience of training the NNs, discretization of the continuous-time
system function is necessary. According to Euler and trapezoidal methods [11], we
have ẋ(t) = (x(t + 1) − x(t))/Δt, where Δt is the sampling time interval which
satisfies the Shannon’s sampling theorem [22]. Then, the uncertain nonlinear system
can be constructed as
 
p

q 
x(t + 1) = x(t) + a(x(t)) + bk (x(t))u k (t) + c j (x(t))w j (t) Δt.
l=1 j=1
(11.3.13)

Using NNs, the nonlinear system can be constructed by

x(t + 1) = WmT σ (VmT X (t)) + em (t), (11.3.14)


 T
where X (t) = [x T(t), U T(t), W T(t)]T , U (t) = u T1 , u T2 , . . . , u Tp , W (t) =
 T T T
w1 , w2 , . . . , wqT , and em (t) is the bounded approximation error. The parameters
Wm , Vm are the NN weights and σ is the activation function. Then, for k = 1, 2, . . . , p
and j = 1, 2, . . . , q, we can obtain

∂(WmT σ (VmT X (t))) ∂(σ (VmT X (t))) T ∂ X (t)


bk (x(t)) = = WmT V , (11.3.15)
∂u k (t) ∂(VmT X (t)) m ∂u k (t)

and

∂(WmT σ (VmT X (t))) ∂(σ (VmT X (t))) T ∂ X (t)


c j (x(t)) = = WmT V . (11.3.16)
∂w j (t) ∂(VmT X (t)) m ∂w j (t)

We can see that the system functions bk (x(t)) and c j (x(t)) are constructed by a
NN if the weights Wm and Vm are obtained. This guarantees that the upper and lower
control pairs u k , w j , u k , w j in (11.3.9)–(11.3.12) to be well defined.
According to (11.3.13)–(11.3.16), the nonlinear system (11.3.1) can be written as

ẋ = f (x, U, W )
   
T ∂(σ (Vm X )) T ∂ X T ∂(σ (Vm X )) T ∂ X
T T
= a(x) + Wm V U + Wm V W.
∂(VmT X ) m ∂U ∂(VmT X ) m ∂ W
(11.3.17)
436 11 Learning Algorithms for Differential Games of Continuous-Time Systems

The upper and lower value functions can be, respectively, rewritten as

V (x) = inf sup V (x, U, W ),


U W

and
V (x) = sup inf V (x, U, W ).
W U

Hence, we can see that if Theorem 11.3.1 holds, the multi-player zero-sum differential
games can be changed to the two-player differential games which simplifies the
problem. But we see that the upper and lower control vectors still cannot be solved.
For example, if we want to obtain the upper control vectors (u 1 , . . . , u p , w1 , . . . , wq ),
we must obtain the upper value function V (x). Generally speaking, V (x) is unknown
before all the control vectors

(u 1 , . . . , u p , w1 , . . . , wq ) ∈ Rn 1 +···+n p +m 1 +···+m q

are considered. It is not possible to adopt the traditional dynamic programming


method to obtain the optimal cost function at every time step due to the “curse of
dimensionality.” Furthermore, the optimal control is discussed in infinite horizon.
This means the length of the control sequence is infinite, which implies that the upper
optimal control vector is nearly impossible to obtain by the HJI equation (11.3.5).
To overcome this difficulty, an iterative ADP algorithm is developed next.
In the iterative ADP algorithm, the value function and control policy are updated
by recurrent iterations, with the iteration number i increasing from 0 to ∞. The
present algorithm initializes with a stabilizing control pair (U (0) , W (0) ), where
Assumptions 11.3.1–11.3.3 hold. Then, for i = 0, 1, . . . , let the upper iterative
value function be expressed as
 ∞
(i)  (i) (i) 
V (x(0)) = l x, U , W dt, (11.3.18)
0

where  (i) (i)


  (i) T (i)  (i) T (i)
l x, U , W = x TAx + U BU + W CW

with B and C defined as


⎡ ⎤ ⎡ ⎤
B1 0 ··· 0 C1 0 · · · 0
⎢ 0 B2 · · · 0 ⎥ ⎢ 0 C2 · · · 0 ⎥
⎢ ⎥ ⎢ ⎥
B=⎢ . .. . . .. ⎥ , C = ⎢ .. .. . . .. ⎥ .
⎣ .. . . . ⎦ ⎣ . . . . ⎦
0 0 · · · Bp 0 0 · · · Cq
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 437

The upper iterative control pair is formulated as


 T
(i) 1 −1 ∂(σ (VmT X )) T ∂ X (i)
U =− B WmT V Vx , (11.3.19)
2 ∂(VmT X ) m ∂U (i)

and  T
T ∂(σ (Vm X )) T ∂ X
T
(i) 1 −1 (i)
W =− C Wm V Vx , (11.3.20)
2 ∂(VmT X ) m ∂ W (i)
 (i) (i)   (i) (i) (i) 
where U , W satisfies the HJI equation HJI V (x), U , W = 0.
For i = 0, 1, . . . , let the lower iterative value function be
 ∞
 
V (i) (x(0)) = l x, U (i) , W (i) dt. (11.3.21)
0

The lower iterative control pair is formulated as


 T
T ∂(σ (Vm X )) T ∂ X
T
(i) 1 −1
U =− B Wm V V (i)
x , (11.3.22)
2 ∂(VmT X ) m ∂U (i)

and
 T
T ∂(σ (Vm X )) T ∂ X
T
(i) 1 −1
W =− C Wm V V (i)
x , (11.3.23)
2 ∂(VmT X ) m ∂ W (i)
   
where U (i) , W (i) satisfies the HJI equation HJI V (i) (x), U (i) , W (i) = 0.

Remark 11.3.2 We should point out the following important fact. In [39], for two-
player zero-sum differential games, it is proved that the iterative value functions
converge to the optimal solution when the system function is given. As the nonlinear
system (11.3.1) is replaced by (11.3.17), the upper control pair (u k , w j ) in (11.3.9)
and (11.3.10) is replaced by (11.3.19) and (11.3.20), respectively. The lower con-
trol pair (u k , w j ) in (11.3.11) and (11.3.12) is replaced by (11.3.22) and (11.3.23),
respectively. From (11.3.9) to (11.3.12), we can see that the system functions bk (x),
k = 1, 2, . . . , p and c j (x), j = 1, 2, . . . , q are only the functions of x. While for
(i) (i)
(11.3.19) and (11.3.20), the NN-based functions are functions of x, U and W .
For (11.3.22) and (11.3.23), the NN-based functions are functions of x, U (i) and
W (i) . On the other hand, in [39], the system function is invariable for all i. While in
this chapter, for different i, the system functions are also different for i = 0, 1, . . .
These are the two obvious differences from the results in [39].
438 11 Learning Algorithms for Differential Games of Continuous-Time Systems

In the next section, we will show that using NNs to construct the uncertain non-
linear system (11.3.1), the iterative control pairs can also guarantee the upper and
lower iterative value functions to converge to the optimal solution of the game.

11.3.2 Properties

In this section, we show that the present iterative ADP algorithm for multi-player
zero-sum differential games can be used to improve the properties of the nonlinear
system.
(i) (i)
Theorem 11.3.2 Let Assumptions 11.3.1–11.3.3 hold, and U ∈ Rk , W ∈ Rm ,
(i)
V (x) ∈ C 1 satisfy the HJI equation
 (i) (i) (i) 
HJI V (x), U , W = 0, i = 0, 1, . . . .
 (i) (i)   i+1 (i+1) 
If for any t, l x, U , W ≥ 0, then the new control pairs U , W given by
(11.3.19) and (11.3.20) which satisfy (11.3.18) guarantee the asymptotic stability of
the nonlinear system (11.3.1).
(i)
Proof Since V (x) ∈ C 1 , according to (11.3.20), we can get
(i)  
dV (x)  (i) T  (i) T T ∂(σ (Vm X )) T ∂ X
T
(i+1)
= V x a(x) + V x Wm Vm (i)
U
dt ∂(Vm X )
T
∂U
 

1 (i) T  ∂(σ (Vm X )) T ∂ X
T
− Vx WmT V
2 ∂(VmT X ) m ∂ W (i)
 T
∂(σ (VmT X )) T ∂ X (i)
× C −1 WmT V Vx . (11.3.24)
∂(VmT X ) m ∂ W (i)

From the HJI equation (11.3.5), we have


 
 (i) T  (i) T T ∂(σ (Vm X )) T ∂ X
T
(i)
0= V x a(x) + Vx Wm V U
∂(VmT X ) m ∂U (i)
 
 (i) T (i) 1  (i) T ∂(σ (VmT X )) T ∂ X
+ x T Ax + U BU − V x WmT V
4 ∂(VmT X ) m ∂ W (i)
 T
∂(σ (VmT X )) T ∂ X (i)
× C −1 WmT V Vx . (11.3.25)
∂(VmT X ) m ∂ W (i)
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 439

Combining (11.3.24) and (11.3.25), we obtain


(i)
dV (x)  (i+1) (i) T  (i+1) (i)   (i+1) T (i+1)
=− U −U B U −U − U BU
dt
 
1  (i) T ∂(σ (VmT X )) T ∂ X
− x TAx − V x WmT V
4 ∂(VmT X ) m ∂ W (i)
 T
∂(σ (VmT X )) T ∂ X (i)
× C −1 WmT V Vx . (11.3.26)
∂(VmT X ) m ∂ W (i)

On the other hand, substituting (11.3.20) into the utility function, it follows
 (i+1) (i+1)
  (i+1)  (i+1)
T
l x, U ,W = U BU + x TAx
 
1  (i) T T ∂(σ (Vm X )) T ∂ X
T
+ Vx Wm V
4 ∂(VmT X ) m ∂ W (i)
 T
T ∂(σ (Vm X )) T ∂ X
T
−1 (i)
×C Wm Vm (i)
V x ≥ 0. (11.3.27)
∂(Vm X )
T
∂W
(i)
Combining (11.3.26) and (11.3.27), we can derive dV (x)/dt ≤ 0.
(i) (i)
As V (x) is positive definite (since l(x, u, w) is positive definite) and dV (x)/dt
i
≤ 0, V (x) is a Lyapunov function. We can conclude that the system (11.3.1) is
asymptotically stable. The proof is complete.
Theorem 11.3.3 Let Assumptions 11.3.1–11.3.3  hold, and U (i) ∈ Rk , W (i) ∈ Rm ,
V (x) ∈ C 1 satisfy the HJI equation HJI V (x), U (i) , W (i) = 0, i = 0, 1, . . .
(i) (i)
   
If for all t, l x, U (i) , W (i) < 0, then the control pairs U (i) , W (i) formulated by
(11.3.22) and (11.3.23) which satisfy the value function (11.3.21) guarantee system
(11.3.1) to be asymptotically stable.
Since the proof is similar to Theorem 11.3.2, we omit it here.
Corollary 11.3.1 Let Assumptions 11.3.1–11.3.3 (i) hold, U (i)  ∈ Rk , W (i) ∈ Rm ,
V (x) ∈ C satisfy the HJI equation HJI V (x), U (i) , W (i) = 0, i = 0, 1, . . . If
(i) 1
   
for all t, l x, U (i) , W (i) ≥ 0, then the control pairs U (i) , W (i) which satisfy the
value function (11.3.21) guarantee system (11.3.1) to be asymptotically stable.
(i) (i)
Corollary 11.3.2 Let Assumptions 11.3.1–11.3.3 hold, and U ∈ Rk , W ∈ Rm ,
(i)  (i) (i) (i) 
V (x) ∈ C 1 satisfy the HJI equation HJI V (x), U , W = 0, i = 0, 1, . . . If
 (i) (i)   (i) (i) 
for any t, l x, U , W < 0, then the control pairs U , W which satisfy the
value function (11.3.18) guarantee system (11.3.1) to be asymptotically stable.
(i) (i)
Theorem 11.3.4 Let Assumptions 11.3.1–11.3.3 hold. Also, let U ∈ Rk , W ∈
(i)  (i) (i) (i) 
Rm , and V (x) ∈ C 1 satisfy the HJI equation HJI V (x), U , W = 0, for
440 11 Learning Algorithms for Differential Games of Continuous-Time Systems

 (i) (i) 
i = 0, 1, . . . Let l x, U , W be the utility function. Then, the control pairs
 (i) (i) 
U ,W which satisfy the upper value function (11.3.18) guarantee system
(11.3.1) to be asymptotically stable.

The proof is similar to Theorem 4 in [39], and thus it is omitted here.

Theorem 11.3.5 Let Assumptions 11.3.1–11.3.3  (i) hold, and U (i) ∈ Rk , W (i) ∈ Rm ,
V (x) ∈ C satisfies the HJI equation HJI V (x), U , W (i) = 0, i = 0, 1, . . .
(i) 1 (i)
   
Let l x, U (i) , W (i) be the utility function. Then, the control pair U (i) , W (i) which
satisfies the lower value function (11.3.21) is a pair of asymptotically stable controls
for system (11.3.1).

Next, the analysis of convergence property for the zero-sum differential games is
presented to guarantee that the iterative control pairs reach the optimal solution.
(i) (i)
Proposition 11.3.1 Let Assumptions 11.3.1–11.3.3 hold, and U ∈ Rk , W ∈ Rm ,
(i)  (i) (i) (i) 
V (x) ∈ C 1 satisfy the HJI equation HJI V (x), U , W = 0. Then, the
 (i) (i) 
iterative control pairs U , W formulated by (11.3.19) and (11.3.20) guarantee
(i)
the upper value function V (x) → V (x) as i → ∞.

Proof The proof consists of the following two steps.


(1) Show the convergence of the upper value function. From the HJI equation
 (i) (i) (i)  (i+1)
HJI V (x), U , W = 0, we can obtain dV (x)/dt by replacing the index
“i” by the index “i + 1”
(i+1)
dV (x)  (i+1) T (i+1)
= − x TAx + U BU
dt
 
1  (i) T T ∂(σ (Vm X )) T ∂ X
T
+ Vx Wm V
4 ∂(VmT X ) m ∂ W (i)
 T
T ∂(σ (Vm X )) T ∂ X
T
−1 (i)
×C Wm V Vx .
∂(VmT X ) m ∂ W (i)

According to (11.3.26), we can obtain


 (i+1) (i)  (i+1) (i)
d V (x) − V (x) dV (x) dV (x)
= −
dt dt dt
 (i+1) T (i+1)
= − x TAx + U BU
 
1  (i) T ∂(σ (VmT X )) T ∂ X
+ Vx WmT V
4 ∂(VmT X ) m ∂ W (i)
 T
∂(σ (VmT X )) T ∂ X (i)
× C −1 WmT V Vx
∂(VmT X ) m ∂ W (i)
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 441

 (i+1) (i) T  (i+1) (i) 


− − U −U B U −U
 
1  (i) T T ∂(σ (Vm X )) T ∂ X
T
− x Ax − V x T
Wm V
4 ∂(VmT X ) m ∂ W (i)
 T
T ∂(σ (Vm X )) T ∂ X
T
−1 (i)
×C Wm V Vx
∂(VmT X ) m ∂ W (i)
 (i+1) T (i+1)
= U BU ≥ 0.

Since the system (11.3.1) is asymptotically stable, its state trajectory x converges
(i+1) (i)  (i+1) (i) 
to zero, and so does V (x) − V (x). Since d V (x) − V (x) /dt ≥ 0 on
(i+1) (i) (i+1) (i)
these trajectories, it implies V (x)− V (x) ≤ 0, i.e., V (x) ≤ V (x). Thus,
(i) (i)
V (x) is convergent as i → ∞ since V (x) ≥ 0.
(i)
(2) Show that V (x) → V (x) as i → ∞. We define
(i) (∞)
lim V (x) = V (x).
i→∞

For any i, let


 tˆ
(i)

W = arg max l(x, U, W )dτ + V (x(tˆ)) .
W t

Then, according to the principle of optimality expressed in Lemma 11.3.1, we have

(i)
 tˆ
(i)

V (x) ≤ sup l(x, U, W )dτ + V (x(tˆ))
W t
 tˆ
∗ (i)
= l(x, U, W )dτ + V (x(tˆ)).
t

(i+1) (i)
Since V (x) ≤ V (x), we have
 tˆ
(∞) ∗ (i)
V (x) ≤ l(x, U, W )dτ + V (x(tˆ)).
t

Let i → ∞. We can then obtain


 tˆ
(∞) ∗ (∞)
V (x) ≤ l(x, U, W )dτ + V (x(tˆ)).
t
442 11 Learning Algorithms for Differential Games of Continuous-Time Systems

So, it follows

(∞)
 tˆ
(∞)

V (x) ≤ inf sup l(x, u, w)dτ + V (x(tˆ)) . (11.3.28)
U W t

Let ε > 0 be an arbitrary positive number. Since the upper value function is
nonincreasing and convergent, there exists a positive integer i such that
(i) (∞) (i)
V (x) − ε ≤ V (x) ≤ V (x).


 tˆ
∗ (i)

Let U = arg min l(x, U , W )dτ + V (x(tˆ)) . Then, we can get
U t

 tˆ  
(i) ∗ ∗ (i)
V (x) = l x, U , W dτ + V (x(tˆ)).
t

Thus,
 tˆ
(∞) ∗ ∗ (i)
V (x) ≥ l(x, U , W )dτ + V (x(tˆ)) − ε
t
 tˆ
∗ ∗ (∞)
≥ l(x, U , W )dτ + V (x(tˆ)) − ε
t
 tˆ
(∞)

= inf sup l(x, U, W )dτ + V (x(tˆ)) − ε.
U W t

Since ε is arbitrary, we obtain

(∞)
 tˆ
(∞)

V (x) ≥ inf sup l(x, U, W )dτ + V (x(tˆ)) . (11.3.29)
U W t

Combining (11.3.28) and (11.3.29), it follows

(∞)
 tˆ
(∞)

V (x) = inf sup l(x, U, W )dτ + V (x(tˆ)) .
U W t

Letting tˆ → ∞, we have
(∞)
V (x) = inf sup V (x, U, W ),
U W

(i)
which is the same as (11.3.3). So, V (x) → V (x) as i → ∞. The proof is complete.
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 443

(i)
Proposition 11.3.2 Let Assumptions 11.3.1–11.3.3  (i) hold, U ∈ Rk , W (i) ∈ Rm
and V (x) ∈ C satisfy the HJI function HJI V (x), U , W (i) = 0. Then the
(i) 1 (i)
 
iterative control pair U (i) , W (i) formulated by (11.3.22) and (11.3.23) guarantees
the lower value function V (i) (x) → V (x) as i → ∞.

Remark 11.3.3 In [39], for the known nonlinear systems, the convergence property
was proved for two-player zero-sum differential games. We should point out that in
[39], the system function is invariant for any i. For multi-player zero-sum differential
games, the NN-based system functions are also updated for any i = 0, 1, . . . While
from Propositions 11.3.1 and 11.3.2, we can see that by the approximation of NNs
for nonlinear systems, the convergence property can also be guaranteed which shows
the effectiveness of the present method.

Theorem 11.3.6 If the optimal value function of the zero-sum differential game
 (i+1) (i+1)  
or the optimal solution exists, then the control pairs U ,W and U (i+1) ,
 (i)
W (i+1) guarantee V (x) → V ∗ (x) and V (i) (x) → V ∗ (x), respectively, as i → ∞.
(i)
Proof For the upper value function, according to Proposition 11.3.1, we have V (x)
 (i+1) (i+1) 
→ V (x) under the control pair U ,W as i → ∞. So, the optimal control
pair for upper value function satisfies

V (x) = V (x, U , W ) = inf sup V (x, U, W ). (11.3.30)


U W

On the other hand, there exists an optimal control pair (U ∗ , W ∗ ) under which the
value function reaches the optimal solution. According to the property of the optimal
solution, the optimal control pair (U ∗ , W ∗ ) satisfies

V ∗ (x) = V (x, U ∗ , W ∗ ) = inf sup V (x, U, W ),


U W

 (i+1)
which is the same as (11.3.30). So, V (x) → V ∗ (x) under the control pair U ,
(i+1) 
W as i → ∞.
 
Similarly, we can derive V (x) → V ∗ (x) under the control pair U (i+1) , W (i+1)
as i → ∞. This completes the proof of the theorem.

Remark 11.3.4 From Theorem 11.3.6, we can see that if the optimal solution exists,
the upper and lower iterative value functions converge to the optimum under the
 (i) (i)   
iterative control pairs U , W and U (i) , W (i) , respectively. Since the existence
criterions of the optimal solution of the games in [2, 3, 8] are unnecessary, we can
see that the present iterative ADP algorithm is more effective for solving zero-sum
differential games.
444 11 Learning Algorithms for Differential Games of Continuous-Time Systems

11.3.3 Neural Network Implementation

As the computer can only deal with the digital and discrete signals, it is necessary to
transform the continuous-time value function to a corresponding discrete-time form.
Discretization of the value function using Euler and trapezoidal methods [11] leads
to
∞
 T 
V (x(0)) = x (t)Ax(t) + U T(t)BU (t) + W T(t) C W (t) Δt,
t=0

where Δt is the sample time interval.


In [1, 24, 25], polynomial approximation is used to construct single-layer NNs
that approximate the value function and the control policy, respectively, to implement
the ADP algorithm. While using the polynomial approximation method, the order
of the polynomial sequence and the length of the polynomial sequence are both
difficult to determine. As is known a three-layer BP NN can approximate any MIMO
continuous functions [12]. Three-layer BP NNs will be used to implement the present
iterative ADP algorithm.
Assume the number of hidden layer neurons is denoted by λ, the weight matrix
between the input layer and hidden layer is denoted by Y f , and the weight matrix
between the hidden layer and output layer is denoted by W f . Then, the output of
three-layer NN is represented by:
 
F̂(X, Y f , W f ) = W Tf σ Y Tf X ,

e zi − e−zi
where σ (Y Tf X ) ∈ Rλ , [σ (z)]i = , i = 1, . . . , λ, are the activation func-
e zi + e−zi
tions. Using NNs, the estimation error can be expressed by

F(X ) = F(X, V f∗ , W ∗f ) + ξ(X ),

where V f∗ , W ∗f are the ideal weight parameters, and ξ(X ) is the estimation error.

A. Neural Network Identifier for Uncertain Nonlinear Systems


In this case, the uncertain nonlinear system will be constructed by NN using the
input–output data. For convenience of analysis, only the output weights are updated
during the training, while the hidden weights are kept fixed. Let X̄ (t) = YmT X (t).
The activation function σ (YmT X (t)) can be rewritten as σ ( X̄ (t)). Thus, the NN model
for the system is constructed as
 
x̂(t + 1) = ŴmT (t) σ X̄ (t) , (11.3.31)

where x̂(t +1) is the estimated system state vector, Ŵm (t) is the estimation of the ideal
weight matrix Wm . According to (11.3.14), we can define the system identification
error as
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 445

x̃(t + 1) = x̂(t + 1) − x(t + 1) = W̃m (t)σ ( Z̄ (t)) − em (t),

where W̃m (t) = Ŵm (t) − Wm . Let φ(t) = W̃mT (t)σ ( Z̄ (t)). Then, we can get

x̃(t + 1) = φ(t) − em (t). (11.3.32)

Define the performance error index as:

1 T
E m (t) = x̃ (t + 1)x̃(t + 1).
2
Then, the gradient-based weight update rule for the critic network can be described
as
W̃m (t + 1) = W̃m (t) + ΔW̃m (t), (11.3.33)

∂ E m (t)
ΔW̃m (t) = ηm − , (11.3.34)
∂ W̃m (t)

where ηm > 0 is the learning rate of the model network. After the model network is
trained, its weights are kept unchanged.
Next, we will give the convergence theorem of NNs.

Theorem 11.3.7 Let the NN be expressed by (11.3.31) and let the NN weights be
updated by (11.3.33) and (11.3.34). The approximation error em is expressed as in
(11.3.14). If there exists a constant 0 < λ M < 1 such that

emT (t)em (t) ≤ λ M x̃ T(t) x̃(t), (11.3.35)

then the system identification error x̃(t) is asymptotically stable while the parameter
estimation error W̃m (t) is bounded.

Proof First, to obtain the conclusion, we built the Lyapunov function as follows:

1  T 
L(x) = x̃ T(t) x̃(t) + tr W̃m(t)W̃m (t) .
ηm

The first difference of the Lyapunov function candidate is given by

ΔL(x) = x̃ T(t + 1) x̃(t + 1) − x̃ T(t) x̃(t)


1  T 
+ tr W̃m(t + 1)W̃m (t + 1) − W̃mT(t)W̃m (t) .
ηm

From (11.3.33)–(11.3.34), the weights of the NN are updated as

Ŵm (t + 1) = Ŵm (t) − ηm σ ( Z̄ (t))x̃(t + 1).


446 11 Learning Algorithms for Differential Games of Continuous-Time Systems

According to (11.3.32), we can obtain

ΔL(x) = φ T(t) φ(t) − 2φ T(t) em (t) + ηm σ T( Z̄ (t))σ ( Z̄ (t))x̃ T(t + 1) x̃(t + 1)


+ emT (t)em (t) − x̃ T(t) x̃(t) − 2φ T(t) x̃(t + 1).

Using the Cauchy–Schwarz inequality, we can get

ΔL(x) ≤ − φ T(t) φ(t) + emT (t)em (t) − x̃ T(t) x̃(t)


 
+ 2ηm σ T( Z̄ (t))σ ( Z̄ (t)) φ T(t) φ(t) + emT (t) em (t) .

As ||σ ( Z̄ (t))|| is finite, we can define ||σ ( Z̄ (t))|| ≤ σ M , where σ M > 0 is the upper
bound. According to (11.3.35), we can get
   
ΔL(x) ≤ − 1 − 2ηm σ M2 ||φ(t)||2 − 1 − λ M − 2ηm λ M σ M2 ||x̃(t)||2 . (11.3.36)

Let ηm be selected as ηm ≤ β 2 /2σ M2 , then (11.3.36) becomes


   
ΔL(x) ≤ − 1 − β 2 ||φ(t)||2 − 1 − λ M − λ M β 2 ||x̃(t)||2 . (11.3.37)

Therefore, ΔL(x) ≤ 0 provided that 0 < λ M ≤ 1 and


    
1 − λM 1 − λM
max − , −1 ≤ β ≤ min ,1 ,
λM λM

where β = 0. As long as the parameters are selected as discussed above, we can have
ΔL(x) ≤ 0. Therefore, the identification error x̃(t) and the weight estimation error
W̃m (t) are bounded if x̃(0) and W̃m (0) are bounded. Furthermore, by summing both
sides of (11.3.37) to infinity and taking limits when t → ∞, it can be concluded that
the estimation error x̃(t) approaches zero when t → ∞. This completes the proof
of the theorem.
Remark 11.3.5 In [26], ADP algorithm was used to solve a nonlinear multi-player
nonzero-sum game which can be converted to zero-sum one. In [26], we can see
that the system function is necessary in order to obtain the optimal control pair. In
this chapter, we see that the control pair is obtained without the system model. If
Theorem 11.3.7 is satisfied, the system dynamics can be well constructed by an NN
which guarantees the effectiveness of the present algorithm.

B. Neural Networks for the Iterative ADP


We use eight NNs to implement the iterative ADP method, where two are model
networks (the weights of the two model NNs are the same), two are critic networks,
and four are action networks, respectively. All the NNs are chosen as three-layer
feedforward NNs. The structural diagram is shown in Fig. 11.5.
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 447

Fig. 11.5 Structural diagram of the iterative ADP method for multi-person zero-sum games

The critic network is used to approximate upper and lower iterative value func-
(i)
tions, i.e., V (x) and V (i) (x). The output of the critic network is denoted as
 T  T 
V̂ (i) (x(t)) = Wc(i) σ Vc(i) X (t) ,
 T
where X (t) = x T(t), u T(t), wT(t) is the input of critic networks. The two critic
networks have the following target functions. For the upper value function, the target
function can be written as
(i)
  (i+1) T (i+1)
V (x(t)) = x T(t)Ax(t) + U (t) BU (t)
 (i+1) T  (i)
(t) Δt + Vˆ (x(t + 1)),
(i+1)
+ W (t) C W

(i)
where Vˆ (x(t + 1)) is the output of the upper critic network. For the lower value
function, the target function can be written as
  T
V (i) (x(t)) = x T(t)Qx(t) + U (i+1) (t) BU (i+1) (t)
 T  (i)
+ W (i+1) (t) C W (i+1) (t) Δt + V̂ (x(t + 1)), (11.3.38)

(i)
where V̂ (x(t + 1)) is the output of the lower critic network.
Then, for the upper value function, we define the error function of the critic
network by
(i)
e(i) (t) = V (x(t)) − Vˆ (x(t)),
(i)
c
448 11 Learning Algorithms for Differential Games of Continuous-Time Systems

(i)
where Vˆ (x(t)) is the output of the upper critic network. The objective function to
be minimized in the critic network is
1  (i) 2
E c(i) (t) = e (t) .
2 c
So, the gradient-based weight update rule for the critic network is given by

∂ E c(i) (t)
Wc(i) (t + 1) = Wc(i) (t) + ΔWc(i) (t), ΔWc(i) (t) = ηc − ,
∂ Wc(i) (t)
(i)
∂ E c(i) (t) ∂ E c(i) (t) ∂ Vˆ (x(t))
= ,
∂ Wc(i) (t) (i) (i)
∂ Vˆ (x(t)) ∂ Wc (t)

where ηc > 0 is the learning rate of critic network and Wc (t) is the weight vector in
the critic network, which can be replaced by Wc(i) and Vc(i) .
For the lower value function, the error function of the critic network is defined by
(i)
ec(i) (t) = V (i) (x(t)) − V̂ (x(t)).

For the lower iterative value function, the weight updating rule of the critic network
is the same as the one for upper value function. The details are omitted here. To
implement the present iterative ADP algorithm, four action networks are used to
approximate the laws of the upper and lower iterative control pairs, where two are
used to approximate the laws of upper iterative control pairs and the other two are
used to approximate the laws of lower iterative control pairs.
The target functions of the upper U and W action networks are the discretization
formulation of equations (11.3.19) and (11.3.20), respectively, which can be written
as
 T (i)
(i+1) 1 −1 ∂(σ (V T
X (t))) ∂ X (t) dV (x(t + 1))
U (t) = − B T
Wm m
V T
,
2 ∂(VmT X (t)) m ∂U (i)(t) dx(t + 1)
(11.3.39)

and
 T (i)
T ∂(σ (Vm X (t))) T ∂ X (t) dV (x(t + 1))
T
(i+1) 1 −1
W (t) = − C Wm V .
2 ∂(VmT X (t)) m ∂ W (i)(t) dx(t + 1)

The target functions of the lower U and W action networks are the discretization
formulation of (11.3.22) and (11.3.23), respectively, which can be written as
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 449

 T
1 ∂(σ (VmT X (t))) T ∂ X (t) dV (i) (x(t + 1))
U (i+1) (t) = − B −1 WmT V ,
2 ∂(VmT X (t)) m ∂U (i) (t) dx(t + 1)

and
 T
1 ∂(σ (VmT X (t))) T ∂ X (t) dV (i) (x(t + 1))
W (i+1) (t) = − C −1 WmT V (i)
.
2 ∂(Vm X (t))
T m
∂ W (t) dx(t + 1)

The output of the action network (the upper U action network for example) can
be formulated as
(i)  (i) T  (i) T 
Uˆ (t) = W ua σ Y ua x(t) .

The target of the output of the action network is given by (11.3.39). So, we can define
the output error of the action network as
(i)
(t) = U (t) − Uˆ (t).
(i) (i)
eua

The weights of the action network are updated to minimize the following perfor-
mance error measure
(i) 1  (i) 2
E ua (t) = eua (t) .
2
The weight update algorithm is similar to the one for the critic network. Using the
gradient descent rule, we can obtain

Wa(i) (t + 1) = Wa(i) (t) + ΔWa(i) (t),


∂ E a(i) (t)
ΔWa(i) (t) = ηa − ,
∂ Wa(i) (t)
(i)
∂ E a(i) (t) ∂ E a(i) (t) ∂ea(i) (t) ∂ Uˆ (t)
= ,
∂ Wa(i) (t) ∂ea(i) (t) ∂ Uˆ (i) (t) ∂ Wa(i) (t)

where ηa > 0 is the learning rate of action network, and Wa(i) (t) is the weight vector
(i) (i)
of the action network, which can be replaced by W ua and V ua , respectively. The
weight update rule for the other action networks is similar and is omitted.
Given the above preparation, we now formulate the iterative ADP algorithm for
the nonlinear multi-player zero-sum differential games as follows.
Step 1: Initialize the algorithm with a stabilizing control pair (U (0) , W (0) ), where
Assumptions 11.3.1–11.3.3 hold. Choose the computation precision ζ > 0.
Step 2: Discretize the nonlinear system (11.3.1) as (11.3.13) and construct a model
NN. Train the model NN according to (11.3.31)–(11.3.34), and obtain the
system function bk (x(t)), k = 1, 2, . . . , p and c j (x(t)), j = 1, 2, . . . , q,
according to (11.3.15) and (11.3.16), respectively.
450 11 Learning Algorithms for Differential Games of Continuous-Time Systems

Step 3: For i = 0, 1, . . ., for upper value function, let


∞ 
 
(i) (i+1) (i+1)
V (x(0)) = l x(t), U (t), W (t) Δt,
t=0

where
 (i+1) (i+1)
  (i+1) T (i+1)
l x(t), U (t), W (t) = x T(t)Ax(t) + U (t) BU (t)
 (i+1) T (i+1)
+ W (t) C W (t).

Update the upper iterative control pairs by


⎧  T

⎪ (i+1) 1 −1 ∂(σ (VmT X )) T ∂ X (i)

⎨U =− B T
Wm Vm Vx ,
2 ∂(Vm X )
T (i)
 ∂U 

⎪ (i+1) 1 ∂(σ (V T
X )) T ∂X
T
(i)

⎩W = − C −1 WmT m
V Vx ,
2 ∂(Vm X )
T m (i)
∂W
 (i) (i) (i) 
where the HJI equation HJI V (x(t)), U (t), W (t) = 0 is satisfied for
any i = 0, 1, . . ..
Step 4: If
" (i+1) (i) "
"V (x(0)) − V (x(0))" < ζ,

let
(i) (i) (i+1)
U (t) = U (t), W (t) = W (t), and V (x(t)) = V (x(t)),

and go to next step; else, set i = i + 1 and go to Step 3.


Step 5: For i = 0, 1, . . ., for lower value function, let
∞ 
 
V (i) (x(0)) = l x(t), U (i+1) (t), W (i+1) (t) Δt,
t=0

and the lower iterative control pairs be updated by


⎧  T

⎪ 1 −1 T ∂(σ (Vm X )) T ∂ X
T

⎨ U (i+1)
= − B W V V (i)
x ,
2 m
∂(VmT X ) m ∂ V (i)
 T

⎪ 1 −1 T ∂(σ (Vm X )) T ∂ X
T

⎩W (i+1)
=− C V (i)
Wm V x ,
2 ∂(VmT X ) m ∂ V (i)

where the HJI equation


11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 451
 
HJI V (i) (x(t)), U (i) (t), W (i) (t) = 0

is satisfied for any i = 0, 1, . . .


Step 6: If " (i+1) "
"V (x(0)) − V (i) (x(0))" < ζ,

let

U (t) = U (i) (t), W (t) = W (i) (t), and V (x(t)) = V (i+1) (x(t)),

and go to the next step; else, set i = i + 1 and go to Step 5.


Step 7: If
|V (x(0)) − V (x(0))| < ζ,

stop, and the optimal solution is achieved; else, stop, and the optimal solution
dose not exist.

11.3.4 Simulation Studies

In this section, two examples will be provided to demonstrate the effectiveness of


the present optimal control scheme.

Example 11.3.1 Our first example is chosen from [9] with modifications. Consider
the following linear system

ẋ = x + u + w,

with the initial state x(0) = 1. The value function is defined by (11.3.2) with

1
A = 1, B = , and C = −1.
4
Discretization of the value function using Euler and trapezoidal methods with the
sampling time interval Δt = 10−2 . NNs are used to implement the iterative ADP
algorithm.
To obtain a good approximation, it is important to consider the NN structure
carefully. As we choose three-layer NNs to implement the iterative ADP algorithm,
the number of the hidden layer neurons λ can be chosen by the following equation
from experience #
λ = N I + N O + a, (11.3.40)

where N I and N O are the dimensions of the input and output vectors, respectively.
The constant a is an integer between 1 and 10. Equation (11.3.40) shows that for low-
452 11 Learning Algorithms for Differential Games of Continuous-Time Systems

dimensional system, the number of the hidden layer neurons can be small. While
for high-dimensional complex nonlinear systems, the number of the hidden layer
neurons should increase correspondingly.
According to (11.3.40), the number of the hidden layer neurons is chosen as 8. The
structure of the model NN is chosen as 2–8–1. The structures of the critic networks for
upper and lower value functions are chosen as 1–8–1. For the upper value functions,
the structures of the U action network and the W action network are chosen as 1–8–1
and 2–8–1, respectively. For the lower value functions, the structures of the U action
network and the W action network are chosen as 2–8–1 and 1–8–1, respectively. The
initial weights of action networks, critic networks, and model network are all set to
be random in [−0.5, 0.5]. It should be mentioned that the model network should be
trained first. For the given initial state x(0) = 1, we train the model network for 1000
steps under the learning rate ηm = 0.002 to reach the given accuracy of ε = 10−6 .
After the training of the model network is complete, the weights are kept unchanged.
During each iteration step, the critic networks and the action networks are trained
for 200 steps to reach the given accuracy of ε = 10−6 . In the training process, the
learning rate ηa = ηc = 0.01. The convergence curves of the upper and lower value
functions are shown in Fig. 11.6.
The convergence trajectories of the first row of the weights of the critic network
are shown in Fig. 11.7. The convergence trajectories of the hidden layer weights for
the U action network are shown in Fig. 11.8. The corresponding state and control
trajectories are displayed in Fig. 11.9a and b, respectively.

10

7 st
Iterative value functions

1 iteration of upper value function

5
The optimal value function
4

st
1 1 iteration of lower value function

0
0 5 10 15 20 25 30
Time steps

Fig. 11.6 Convergence of the upper and lower value functions


11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 453

1
wc1−1
wc1−2
0.8 wc1−3
wc1−4
wc1−5
0.6
First line weights of the critic NN

wc1−6
wc1−7
wc1−8
0.4

0.2

−0.2

−0.4

−0.6

−0.8
0 20 40 60 80 100 120 140 160 180 200
Iteration steps

Fig. 11.7 Convergence trajectories of the first row of weights of the critic network

1.5

wau1−1
wau1−2
1 wau1−3
wau1−4
wau1−5
Weight convergence of wau1

wau1−6
wau1−7
0.5
wau1−8

−0.5

−1

−1.5
0 5 10 15 20 25 30 35 40 45 50
Iteration steps

Fig. 11.8 Convergence trajectories of the first row of weights of the U action network
454 11 Learning Algorithms for Differential Games of Continuous-Time Systems

(a) 1 (b) 2
u
1

Control trajectories
0.8 w
State trajectory

0
0.6
−1
0.4
−2
0.2 −3

0 −4
0 10 20 30 0 10 20 30
Time steps Time steps
(c) 1 (d) 2
u
1

Control trajectories
0.8 w
State trajectory

0
0.6
−1
0.4
−2
0.2 −3

0 −4
0 10 20 30 0 10 20 30
Time steps Time steps

Fig. 11.9 State and control trajectories

On the other hand, it is known that the solution of the optimal control problem for
the linear system is quadratic in the state [8]. We can easily obtain that P ∗ = 2.4142
and the optimal control laws are u ∗ = −2P ∗ x and w∗ = P ∗ x, respectively. The state
and control trajectories by GARE approach are displayed in Fig. 11.9c and Fig. 11.9d,
respectively.
From the comparison results, we can see that the present ADP algorithm is effec-
tive for the zero-sum differential games of linear systems.

Example 11.3.2 The system function is expressed as

0.1x12 + 0.05x2
ẋ =
0.2x12 − 0.15x2
⎡ ⎤
u1
0.1 + x1 + x22 0.1 + x2 + x12 0 ⎣ u2 ⎦
+
0 0 0.8 + x1 + x1 x2
u3
0.1 + x1 + x1 x2 0 w1
+
0 0.2 + x1 + x1 x2 w2
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 455

1.5
wc2−1
wc2−2
wc2−3
wc2−4
wc2−5
Output layer weights of critic NN

wc2−6
wc2−7
wc2−8
1

0.5

0
0 500 1000 1500 2000 2500 3000
Iteration steps

Fig. 11.10 Convergence trajectories of the first row of weights of the critic network

with
x(0) = [−1, 0.5]T .

The value function is defined as (11.3.2), where C = −2I and A, B, and I are
identity matrices of appropriate dimensions. Discretization of the value function
using Euler and trapezoidal methods with the sampling time Δt = 10−3 . According
to (11.3.40), in order to obtain good approximations, the number of hidden neurons
is chosen as 10. The structure of the model NN is chosen as 2–8–1. The structures
of the critic networks for upper and lower value functions both are chosen as 2–8–1.
For the upper value functions, the structures of the U action network and the W
action network are chosen as 3–8–1 and 5–8–1, respectively. For the lower value
functions, the structures of the U action network and the W action network are
chosen as 5–8–1 and 3–8–1, respectively. The initial weights of action networks,
critic networks, and model network are all set to be random in [−1, 1]. It should
be mentioned that the model network should be trained first. For the given initial
state x(0) = [−1, 0.5]T , we train the model network for 10,000 steps under the
learning rate ηm = 0.005 to reach the given accuracy of ε = 10−6 . After the training
of the model network is complete, the weights are kept unchanged. During each
iteration step, the critic networks and the action networks are trained for 3000 steps
456 11 Learning Algorithms for Differential Games of Continuous-Time Systems

1.82
Upper
Lower
1.8

1.78

1.76
Value functions

1.74

1.72

1.7

1.68

1.66

1.64
0 10 20 30 40 50 60 70 80 90 100
Iteration steps

Fig. 11.11 Convergence of value functions

to reach the given accuracy of ε = 10−6 . In the training process, the learning rate
ηa = ηc = 0.005. The convergence trajectories of the first row of weights of the
critic network are shown in Fig. 11.10. The convergence curves of the upper and
lower value functions are shown in Fig. 11.11.
We can see that upper and lower value functions converge to the optimal solution
of the zero-sum games and the optimal value function exists. The iterative control
pairs are also convergent to the optimal. The convergence trajectories of the first
row of weights of the U action network are shown in Fig. 11.12. Next, we give the
convergent results of the upper iterative control pairs for upper value function.
The convergence curves of the iterative controls w1 and w2 for upper value func-
tions are shown in Fig. 11.13a and b, respectively. The convergence curves of the
iterative controls u 1 , u 2 and u 3 for upper value functions are shown in Fig. 11.14a–c,
respectively. The curves of the optimal controls are shown in Fig. 11.14d. Then,
we apply the optimal control pair to the system for T f = 500 time steps and the
corresponding state trajectories are given in Fig. 11.15.
11.3 Iterative Adaptive Dynamic Programming for Multi-player Zero-Sum Games 457

0.15

0.1
First line wights of action W NN

0.05

−0.05

waw1−1
−0.1 waw1−2
waw1−3
waw1−4
−0.15 waw1−5
waw1−6
waw1−7
waw1−8
−0.2
0 500 1000 1500 2000 2500 3000
Iteration steps

Fig. 11.12 Convergence trajectories of the first row of weights of the U action network

(a) 0

−0.05
limiting iteration
w1

−0.1 1st iteration

0 50 100 150 200 250 300 350 400 450 500


Time steps
(b) 0

−0.02

−0.04
limiting iteration
2
w

−0.06 1st iteration

−0.08

−0.1
0 50 100 150 200 250 300 350 400 450 500
Time steps

Fig. 11.13 Convergence of upper iterative controls w1 and w2


458 11 Learning Algorithms for Differential Games of Continuous-Time Systems

(a) 0.06 (b) 0.05


0.05 0.04
limiting iteration
0.04
0.03 1st iteration
1st iteration
1

u2
0.03
u

0.02
0.02

0.01 0.01 limiting iteration

0 0
0 100 200 300 400 500 0 100 200 300 400 500
Time steps Time steps
(c) 0.25 (d)
0.15 u
0.2 3

0.1
u
0.15 Controls 1
u
1st iteration 0.05 2
u3

0.1 limiting iteration 0

0.05 −0.05 w2

−0.1 w
1
0
0 100 200 300 400 500 0 100 200 300 400 500
Time steps Time steps

Fig. 11.14 Convergence of upper iterative controls u 1 , u 2 , u 3 and optimal control trajectories

0.5
x1
x
2

0
States

−0.5

−1
0 50 100 150 200 250 300 350 400 450 500
Time steps

Fig. 11.15 State trajectories


11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 459

11.4 Synchronous Approximate Optimal Learning


for Multi-player Nonzero-Sum Games

Consider the following N -player nonlinear differential game described by


N
ẋ = f (x) + g j (x)u j , (11.4.1)
j=1

where x(t) ∈ Rn is the system state with initial state x0 , f (x) ∈ Rn , g j (x) ∈ Rn×m j ,
and u j ∈ Rm j is the controller or player. We assume that f (0) = 0, f (x) and
g j (x) are Lipschitz continuous on a compact set Ω ⊆ Rn containing the origin,
and the system is stabilizable on Ω. The system dynamics, i.e., f (x) and g j (x),
j = 1, 2, . . . , N , are assumed to be unknown.
Define the infinite horizon cost functions associated with each player as
 ∞  
N 
Ji (x0 , u 1 , . . . , u N ) = x TQ i x + u Tj Ri j u j dt
0 j=1
 ∞
 ri (x, u 1 , . . . , u N )dt, i ∈ N, (11.4.2)
0

where Q i ∈ Rn×n and Ri j ∈ Rm j ×m j are symmetric positive definite matrices.


The value functions associated with admissible control policies μi (x) ∈ A (Ω)
are defined as
 ∞  
N 
Vi (x(t), μ1 , . . . , μ N ) = x TQ i x + μTj Ri j μ j dτ
t j=1
 ∞
 ri (x(τ ), μ1 , . . . , μ N )dτ, i ∈ N. (11.4.3)
t

It is desirable to find the optimal admissible control vector {μ∗1 , . . . , μ∗N } such that the
cost functions (11.4.2) are minimized. The control vector {μ∗1 , . . . , μ∗N } corresponds
to the Nash equilibrium of the differential game.

Definition 11.4.1 ([9]) An N -tuple of policies {μ∗1 , . . . , μ∗N } with μi∗ ∈ A (Ω) is
said to constitute a Nash equilibrium for an N -player game, if the following N
inequalities are satisfied

Ji (μ∗1 , . . . , μi∗ , . . . , μ∗N ) ≤ Ji (μ∗1 , . . . , μi , . . . , μ∗N ), i ∈ N.

Assuming that the value functions (11.4.3) are continuously differentiable, the
infinitesimal version of (11.4.3) are
460 11 Learning Algorithms for Differential Games of Continuous-Time Systems

 
N 
0 = ri (x, μ1 , . . . , μ N ) + (∇Vi )T f (x) + g j (x)μ j , i ∈ N, (11.4.4)
j=1

with Vi (0) = 0. Define the Hamiltonians as

Hi (x, ∇Vi , μ1 , . . . , μ N ) = ri (x, μ1 , . . . , μ N )


 
N 
+ (∇Vi )T f (x) + g j (x)μ j , i ∈ N. (11.4.5)
j=1

The optimal value functions Vi∗ (x0 ) are defined as

 ∞ 
N 
Vi∗ (x0 ) = min x TQ i x + μTj Ri j μ j dτ, i ∈ N.
μi ∈A (Ω) 0
j=1

Then, the associated state feedback control policies can be obtained by

∂ Hi 1
= 0 ⇒ μi (x) = − Rii−1 giT(x)∇Vi∗ , i ∈ N. (11.4.6)
∂μi 2

Therefore, the coupled HJ equations can be written as

1  N
0 = x Qi x +
T
(∇Vi∗ )T f (x) − (∇Vi∗ )T g j (x)R −1 ∗
j j g j (x)∇V j
T
2 j=1

1
N
+ (∇V j∗ )T g j (x)R −1 −1 T ∗
j j Ri j R j j g j (x)∇V j , i ∈ N (11.4.7)
4 j=1

with Vi∗ (0) = 0.


The coupled HJ equations will reduce to the well-known coupled algebraic
Riccati equations for the linear quadratic regulator problem. However, the coupled HJ
equations cannot generally be solved due to its nonlinear nature. In the next section,
we will present an online PI algorithm with convergence analysis for the N -player
nonzero-sum game.

11.4.1 Derivation and Convergence Analysis

In this section, an PI algorithm (Algorithm 11.4.1) [26] is presented to solve the


coupled HJ equations.
For linear systems, it is shown that the Nash equilibrium of nonzero-sum games
can be determined by a quasi-Newton method [28]. However, this property has not
11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 461

Algorithm 11.4.1 PI for N -player non-zero-sum games


Step 1. Start with an initial admissible control vector μ0 = {μ01 (x), . . . , μ0N (x)}, set Vi0 (·) = 0,
and let k = 1.
Step 2. Policy Evaluation: With the policies {μk−1 k−1
1 (x), . . . , μ N (x)}, solve the following N non-
linear Lyapunov equations
 
N 
0 = ri (x, μk−1
1 , . . . , μk−1
N ) + (∇Vi
k T
) f (x) + g j (x)μk−1
j , Vik (0) = 0, i ∈ N.
j=1
(11.4.8)
Step 3. Policy Improvement: Update the N -tuple of control policies simultaneously by
  1
μik (x) = arg Hi x, ∇Vik , μ1 , . . . , μ N = − Rii−1 giT(x)∇Vik , i ∈ N.
min
μi ∈A (Ω) 2
  (11.4.9)
Step 4. If max V1k − V1k−1 , . . . , VNk − VNk−1  ≤ ε (ε is a small positive number), stop and
obtain the approximate optimal control vector μk = {μk1 (x), . . . , μkN (x)}; else, set k = k + 1,
and go back to Step 2.

been demonstrated for nonlinear systems [26, 28]. In this chapter, we will prove
the convergence of the online PI algorithm for N -player nonzero-sum games with
nonlinear dynamics. It can also be shown that the online PI algorithm for N -player
nonzero-sum game is the quasi-Newton method. The analysis is based on the work
of [28, 34].
Consider a Banach space V ⊂ {V (x) : Ω → R, V (0) = 0} with a norm  · Ω .
Define a mapping Gi : $V × V ×%&· · · × V' → V as follows:
N

1  N
Gi = x TQ i x + (∇Vi )T f (x) − (∇Vi )T g j (x)R −1
j j g j (x)∇V j
T
2 j=1

1
N
+ (∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j , i ∈ N. (11.4.10)
4 j=1

A mapping Ti : V → V is defined as

Ti Vi = Vi − (Gi Vi )−1 Gi , i ∈ N, (11.4.11)

where Gi Vi denotes the Fréchet derivative of Gi taken with respect to Vi . Since the
Fréchet derivative is difficult to compute directly, we introduce the following Gâteaux
derivative [34, 36].
Definition 11.4.2 Let G : U(V ) ⊆ X → Y be a given map, where X and Y are
Banach spaces, and U(V ) denotes a neighborhood of V . The map G is Gâteaux
differentiable at V if and only if there exists a bounded linear operator L : X → Y
such that G (V + s M) − G (V ) = sL (M) + o(s), s → 0, for all M with MΩ = 1
462 11 Learning Algorithms for Differential Games of Continuous-Time Systems

and all real numbers s in some neighborhood of zero, where lims→0 (o(s)/s) = 0.
L is called the Gâteaux derivative of G at V . Then, the Gâteaux derivative at V is
defined as

G (V + s M) − G (V )
L (M) = lim . (11.4.12)
s→0 s

To compute the Fréchet derivative, we introduce the following Lemma 11.4.1


(Chap. 4.2, [36]).
Lemma 11.4.1 If the Gâteaux derivative L exists in some neighborhood of V , and
if L is continuous at V , then G = L is also a Fréchet derivative at V .
Therefore, we can compute the Fréchet derivative Gi Vi which is given in the following
lemma by (11.4.12).
Lemma 11.4.2 Let Gi be a mapping defined in (11.4.10). Then, ∀Vi ∈ V, the Fréchet
differential of Gi at Vi is

1  N
Gi Vi M = Li Vi (M) = (∇ M)T f − (∇ M)T g j R −1
j j g j ∇V j .
T
2 j=1

Proof First, according to the definition of Gi in (11.4.10), ∀Vi ∈ V, we have


Gi (Vi + s M)−Gi (Vi )
1 T  T
∇(Vi + s M) gi Rii−1 giT ∇(Vi + s M) + ∇(Vi + s M) f
= x TQ i x −
4
1 N
− (∇(Vi + s M))T g j R −1 T
j j g j ∇V j
2
j=1, j =i

1 
N
+ (∇V j )T g j R −1 −1 T
j j Ri j R j j g j ∇V j
4
j=1, j =i

1
− x TQ i x − (∇Vi )T gi Rii−1 giT ∇Vi + (∇Vi )T f
4

1 T 
N
1 
N
− ∇Vi g j R −1
jj g T
j ∇V j + (∇V j )T
g j R −1
jj R i j R −1 T
jj g j ∇V j
2 4
j=1, j =i j=1, j =i
s s
= s(∇ M)T f − (∇ M)T gi Rii−1 giT ∇Vi − (∇Vi )T gi Rii−1 giT ∇ M
4 4
s2 s 
N
− (∇ M)T gi Rii−1 giT ∇ M − (∇ M)T g j R −1 T
j j g j ∇V j
4 2
j=1, j =i

s 
N
s2
= s(∇ M)T f − (∇ M)T g j R −1 T
j j g j ∇V j − (∇ M)T gi Rii−1 giT ∇ M.
2 4
j=1

Therefore, the Gâteaux differential at Vi is


11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 463

Gi (Vi + s M) − Gi (Vi )
Li Vi (M) = lim
s→0 s
1  N
= (∇ M)T f − (∇ M)T g j R −1
j j g j ∇V j .
T
(11.4.13)
2 j=1

Next, we will prove that Li Vi is continuous at Vi . For any M0 ∈ V, we have

Li Vi (M) − Li Vi (M0 )Ω


( T 1 T N (
( (
= ( ∇(M − M0 ) f − ∇(M − M0 ) g j R −1
j j g j ∇V j (
T
2 j=1
Ω

( T ( (1 T N (
( (
≤ ( ∇(M − M0 ) f (Ω + ( ∇(M − M0 ) g j R −1 g
jj j
T
∇V j (
2 j=1
Ω

 (1  N ( 
( (
≤  f Ω + ( g j R −1 g
jj j
T
∇V j ( ∇(M − M0 )Ω
2 j=1 Ω

 (1  N ( 
( (
≤  f Ω + ( g j R −1 g
jj j
T
∇V j( αM − M0 Ω
2 j=1 Ω

 ΦM − M0 Ω

where α > 0. Therefore, ∀ε > 0, there exists a δ = ε/Φ such that

Li Vi (M) − Li Vi (M0 )Ω ≤ ΦM − M0 Ω < ε,

when M − M0 Ω < δ; i.e., Li Vi is continuous at Vi . Then, according to Lemma


11.4.1, Gi Vi M = Li Vi (M) is the Fréchet differential as in (11.4.13), and Gi Vi = Li Vi
is the Fréchet derivative. This completes the proof of the lemma.

With the results in Lemma 11.4.2, we can prove that the online PI algorithm is
mathematically equivalent to the quasi-Newton’s iteration in a Banach space V.
Theorem 11.4.1 Let Ti be a mapping defined in (11.4.11). Then, the iteration
between (11.4.8) and (11.4.9) is equivalent to the following quasi-Newton’s iter-
ation  −1
Vik+1 = Ti Vik = Vik − Gi V k Gi , k = 0, 1, . . . . (11.4.14)
i

Proof According to (11.4.9) and Lemma 11.4.2, we have


464 11 Learning Algorithms for Differential Games of Continuous-Time Systems

 T 1 T N
Gi V k Vik+1 = ∇Vik+1 f − ∇Vik+1 g j R −1
j j g j ∇V j
T k
i 2 j=1
  
 T N
1
= ∇Vik+1 f − g j R −1 g T
∇V k

j=1
2 jj j j

  
 T N
= ∇Vik+1 f + g j μkj ,
j=1

and
  
 T N
Gi V k Vik = ∇Vik f + g j μkj .
i
j=1

From (11.4.9) and (11.4.10), we have


  
 
k T
N
Gi = ri (x, μk1 , . . . , μkN ) + ∇Vi f (x) + g j (x)μ j .
k

j=1

Thus,  
Gi V k Vik − Gi = −ri x, μk1 , . . . , μkN .
i

Considering (11.4.8), we have Gi V k Vik+1 = Gi V k Vik − Gi , which is the same as


i i
(11.4.14). The proof is complete.

Since the PI algorithm for N -player nonzero-sum games is equivalent to the quasi-
Newton’s iteration, the value function Vik+1 will converge to the optimal value func-
tion Vi∗ as k → ∞, ∀i ∈ N.

11.4.2 Neural Network Implementation

Based on the above results, we will develop an online synchronous approximate opti-
mal learning algorithm using NN approximation for the multi-player nonzero-sum
game with unknown dynamics (see Fig. 11.16). By using a model NN for nonzero-
sum game problems, both the internal and drift dynamics are not required. Compared
with the algorithm in reference [26], there are fewer parameters to tune in the present
algorithm.
A. Model NN Design
In this section, a model NN is used to reconstruct the unknown system dynamics by
using input–output data [38]. The unknown nonlinear system dynamics (11.4.1) can
be represented as
11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 465

∇ ∇

∇ ∇

Fig. 11.16 Structural diagram of the online synchronous approximate optimal learning algorithm


N
 
ẋ = Ax + W Tf σ f (x) + ε f (x) + WgTj σg j (x) + εg j (x) u j , (11.4.15)
j=1

where A is a stable matrix, W f ∈ Rn×n and Wg j ∈ Rn×n are the unknown bounded
ideal weight matrices, σ f (·) ∈ Rn and σg j (·) ∈ Rn are the activation functions, and
ε f (·) ∈ Rn and εg j (·) ∈ Rn are the bounded reconstruction errors, respectively.
The model (11.4.15) is obtained by letting m j = 1 in (11.4.1), and it can easily be
extended to the general case. Moreover, f (x) is approximated by

f (x) = Ax + W Tf σ f (x) + ε f (x),

and g j (x) is approximated by

g j (x) = WgTj σg j (x) + εg j (x). (11.4.16)

The activation function σ (·) is selected as a monotonically increasing function sat-


isfying
0 ≤ σ (x) − σ (y) ≤ k(x − y), (11.4.17)

∀x, y ∈ R and x ≥ y, k > 0, such as σ (x) = tanh(x).


466 11 Learning Algorithms for Differential Games of Continuous-Time Systems

Assumption 11.4.1 The ideal NN weights are bounded by positive constraints, i.e.,
W Tf W f ≤ W̄ Tf W̄ f and WgTj Wg j ≤ W̄gTj W̄g j , where W̄ f and W̄g j are known positive
definite matrices.

Assumption 11.4.2 The reconstruction errors ε f (x) and εg j (x) are assumed to be
upper bounded by a function of modeling error such that

εTf(x)ε f (x) ≤ λx̃ Tx̃, εgTj (x)εg j (x) ≤ λx̃ Tx̃

where λ is a constant value, and x̃ = x − x̂ is the system modeling error (x̂ is the
estimated system state).

Assumption 11.4.3 The control inputs are bounded, i.e., u j  ≤ ū j .

Based on (11.4.15), the model NN used to identify the system (11.4.1) is given
by

N
x̂˙ = A x̂ + Ŵ Tf σ f (x̂) + ŴgTj σg j (x̂)u j , (11.4.18)
j=1

where Ŵ f and Ŵg j are the estimates of the ideal weight matrices W f and Wg j ,
respectively. Then, the modeling error dynamics is written as

x̃˙ = A x̃ + W̃ Tf σ f (x̂) + W Tf [σ f (x) − σ f (x̂)] + ε f (x)



N
 
+ W̃gTj σg j (x̂) + WgTj (σg j (x) − σg j (x̂)) + εg j (x) u j , (11.4.19)
j=1

where W̃ f = W f − Ŵ f and W̃g = Wg − Ŵg .

Theorem 11.4.2 The modeling error x̃ will asymptotically converge to zero as t →


∞, if the weight matrices are updated through the following equations

Ŵ˙ f = Γ f σ f (x̂)x̃ T ,
Ŵ˙ g j = Γg j σg j (x̂)u j x̃ T , j = 1, . . . , N (11.4.20)

where Γ f and Γg j are symmetric positive definite matrices.

Proof Consider the following Lyapunov function candidate

1 T 1    N
1  T −1 
L(x) = x̃ x̃ + tr W̃ Tf Γ f−1 W̃ f + tr W̃g j Γg j W̃g j . (11.4.21)
2 2 j=1
2
11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 467

The time derivative of the Lyapunov function (11.4.21) along the trajectories of the
modeling error system (11.4.19) is computed by

  N
 ˙ 
L̇(t) = x̃ Tx̃˙ + tr W̃ Tf Γ f−1 W̃˙ f + tr W̃gTj Γg−1
j W̃g j . (11.4.22)
j=1

Substituting (11.4.19) and (11.4.20) into (11.4.22), we have

L̇(t) = x̃ TA x̃ + x̃ T W Tf [σ f (x) − σ f (x̂)] + x̃ Tε f (x)



N
  
N
+ x̃ T WgTj (σg j (x)−σg j (x̂)) u j + x̃ T εg j (x)u j . (11.4.23)
j=1 j=1

According to (11.4.17), we have

1 T T 1
x̃ T W Tf [σ f (x) − σ f (x̂)] ≤ x̃ W f W f x̃ + k 2 x̃ Tx̃,
2 2
and
1 T T 1
x̃ T WgTj [σg j (x) − σg j (x̂)] ≤ x̃ Wg j Wg j x̃ + k 2 x̃ Tx̃.
2 2
According to Assumption 11.4.2, we obtain

1 T 1 1 1
x̃ Tε f (x) ≤ x̃ x̃ + λx̃ Tx̃ and x̃ Tεg j (x) ≤ x̃ Tx̃ + λx̃ Tx̃.
2 2 2 2
Therefore, (11.4.23) can be upper bounded as
1 1 1 1
L̇(t) ≤ x̃ TA x̃ + x̃ T W Tf W f x̃ + k 2 x̃ Tx̃ + x̃ Tx̃ + λx̃ Tx̃
2 2 2 2
1 N
1  N
1+λ
N
+ x̃ T u j WgTj Wg j x̃ + k 2 u j x̃ Tx̃ + u j x̃ Tx̃
2 j=1
2 j=1
2 j=1

= x̃ TΞ x̃,

where
 
1 1+λ  1 
N N N
1 T 1 1 1
Ξ = A+ Wf Wf + u j WgTj Wg j + + λ+ k 2 + u j + k2 u j In .
2 2 2 2 2 2 2
j=1 j=1 j=1
468 11 Learning Algorithms for Differential Games of Continuous-Time Systems

If A is selected such that Ξ ≤ 0, it can be concluded that L̇(t) ≤ 0, and then


x̃(t) → 0 as t → ∞. The proof is complete.

The model NN is a stable and asymptotic identifier, and thus, the exact knowledge
of the system dynamics can be removed. Consequently, we can obtain the following
model NN

N
 T 
ẋ = Ax + W Tf σ f (x) + Wg j σg j (x) u j . (11.4.24)
j=1

B. Online Synchronous Approximate Optimal Learning Algorithm


Assume that the value functions Vi (x) are continuously differentiable. Then, the
value functions Vi (x) are approximated on a compact set Ω by feedforward NNs as

Vi (x) = WciT φi (x) + εi (x), i ∈ N, (11.4.25)

where Wci ∈ R K are the unknown bounded ideal weights (Wci  ≤ W̄ci ), φi (x) ∈ R K
are the activation functions, K is the number of neurons in the hidden layer, and
εi (x) ∈ R are the bounded NN approximation errors. The activation functions can
be selected as polynomial, sigmoid, tanh, etc.
The derivatives of value functions with respect to x are represented as
 T
∂ Vi (x) ∂φi (x) ∂εi (x)
= Wci + = ∇φiT Wci + ∇εi , i ∈ N, (11.4.26)
∂x ∂x ∂x

where ∇φi ∈ R K ×n and ∇εi ∈ Rn are bounded gradients of the activation functions
and approximation errors, respectively. As the number of neurons in the hidden layer
K → ∞, the approximation errors εi → 0, and the derivatives ∇εi → 0 uniformly.
The approximation errors εi and the derivatives ∇εi are bounded by constants on a
compact set Ω. Thus, (11.4.4) can be rewritten as
 
0 = ri (x, μ1 , . . . , μ N ) + WciT ∇φi + (∇εi )T ẋ, i ∈ N. (11.4.27)

Since the ideal weights are unknown, the critic NNs can be written in terms of
the weight estimates as
V̂i (x) = ŴciT φi (x), i ∈ N.

To avoid using the knowledge of the system dynamics, the model NN is used to
approximate the system dynamics. Then, (11.4.27) can be rewritten as
11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 469

 
N 
 T
  
0 = ri (x, μ1 , . . . , μ N ) + WciT ∇φi + (∇εi ) Ax + W Tf σ f (x) + WgTj σg j (x) uj .
j=1

The approximate Hamiltonians can be derived as follows:

Hi (x, Ŵci , u 1 , . . . , u N ) = ri (x, μ1 , . . . , μ N )+ ŴciT∇φi (x)


  
N
 T 
× Ax + W f σ f (x)+
T
Wg j σg j (x) u j
j=1

 ei , i ∈ N.

It is desired to design Ŵci to minimize the following objective functions

1 2
E i (Ŵci ) = e , i ∈ N.
2 i
The tuning law for the critic NN weights is a standard steepest descent algorithm,
which is given by
 T
∂ Ei  
Ŵ˙ i = −ηi = −ηi θi ri + ŴciT θi , i ∈ N, (11.4.28)
∂ Ŵci

where ηi > 0 is the learning rate of the critic NN, and


  
N
 T 
θi = ∇φi Ax + W Tf σ f (x) + Wg j σg j (x) u j .
j=1

There exists a positive constant θi M such that

θi  ≤ θi M .

According to (11.4.5), the Hamiltonians become

Hi (x, Wci , u 1 , . . . , u N ) = ri (x, μ1 , . . . , μ N )+WciT ∇φi (x)


  
N
 T 
× Ax + W Tf σ f (x)+ Wg j σg j (x) u j
j=1

 e H i , i ∈ N,

where the residual errors due to the NN approximation are


470 11 Learning Algorithms for Differential Games of Continuous-Time Systems

  
N
 T 
e H i = −(∇εi )T Ax + W Tf σ f (x) + Wg j σg j (x) u j , i ∈ N.
j=1

Define the weight estimation errors of the critic NN as W̃ci = Wci − Ŵci . Thus, we
can have the following error dynamics
 
W̃˙ i = ηi θi e H i − W̃ciT θi . (11.4.29)

The PE condition is needed to tune the critic NNs ensuring θi  ≥ θim , i ∈ N, where
θim are positive constants. In order to satisfy the PE condition, a small exploratory
signal can be injected into the system or reset system states.
The objective of the action NN is to select a policy which minimizes the current
estimate of the value functions in (11.4.25). Since a closed-form expression for the
optimal control is available, there is no need for training action NNs. Substituting
gi (x) and ∇Vi (x), the expressions in (11.4.6) can be rewritten as

1  T  
u i = − Rii−1 WgiT σgi + εg j ∇φiT Wci + ∇εi , i ∈ N.
2

The approximate control policies û i are given by

1  T
û i = − Rii−1 WgiT σgi ∇φiT Ŵci , i ∈ N. (11.4.30)
2
The UUB stability of the closed-loop system can be proved based on Lyapunov
approach.
Theorem 11.4.3 Consider the system described by (11.4.24). The weight-updating
laws of the critic NNs are given by (11.4.28), and the control policies are updated
by (11.4.30). The initial weights of the critic NNs are chosen to generate an initial
admissible control pair. Then, the weight estimation errors of the critic NNs are UUB.

Proof Choose the following Lyapunov function

N
1  iT i  
N
L(x) = tr W̃c W̃c  Si (t).
i=1
2ηi i=1

The time derivatives of the Lyapunov functions along the trajectories of the error
dynamics (11.4.29) are computed as

1  iT ˙  1   
L̇ i (x) = tr W̃c W̃i = tr W̃ciT ηi θi (e H i − W̃ciT θi ) , i ∈ N.
ηi ηi
11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 471

According to θim ≤ θi  ≤ θi M , we have


 ηi  e2
L̇ i (x) ≤ − θim
2
− θi2M W̃ci 2 + H i , i ∈ N.
2 2ηi

If the learning rates ηi are selected to satisfy


2
2θim
ηi ≤ , i ∈ N,
θi M
2

and given that the following inequalities hold



e2H i
W̃ci  > , i ∈ N,
ηi (2θim
2
− ηi θi2M )

then
L̇ i (x) < 0.

Using Lyapunov theory, it can be concluded that the weight estimation errors of the
critic NNs W̃ci  are UUB. This completes the proof of the theorem.

Theorem 11.4.4 Consider the system described by (11.4.24). The weight-updating


laws of the critic NNs are given by (11.4.28), and the control policies are updated
by (11.4.30). The initial weights of the critic NNs are chosen to generate an initial
admissible control pair. For some initial condition x0 , there exists a time T (x0 , B)
such that x(t) is UUB, where the bound B is given by
)  *
ξ1 ξN
x(t) ≤ max , ...,
λmin (Q 1 ) λmin (Q N )
 B, t ≥ t0 + T,

where ξi ∈ R+ , i ∈ N.

Proof To show the stability of the approximate control policies in (11.4.30), we take
the derivatives of Vi (x) along the trajectories generated by the approximate control
policies û i as
  
 T N
V̇i (t) = ∇Vi (x) f (x) + g j (x)û j , i ∈ N. (11.4.31)
j=1
472 11 Learning Algorithms for Differential Games of Continuous-Time Systems

Considering (11.4.7), we have

 T 1 N
∇Vi (x) f = − x Q i x + (∇Vi )
T T
g j (x)R −1
j j g j (x)∇V j
T
2 j=1

1
N
− (∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j .
4 j=1

 T
Substituting ∇Vi (x) f into (11.4.31) yields


N   N
1
V̇i (t) = − x Q i x + (∇Vi )
T T
g j (x)û j + (∇Vi )T g j (x)R −1
j j g j (x)∇V j
T

j=1
2 j=1

1 
N
− (∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j .
4 j=1

Considering (11.4.6), we have

1
N
V̇i (t) = − x TQ i x − (∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j
4 j=1

N 
− (∇Vi )T g j (x)(u j − û j ) . (11.4.32)
j=1

Substituting (11.4.6), (11.4.16), (11.4.26), and (11.4.30) into (11.4.32) yields

1
N
V̇i (t) = − x TQ i x − (∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j
4 j=1

1 N
 T 
+ (∇φiT Wci + ∇εi )T Wg j σg j (x) R −1
jj
2 j=1
 
× σgTj Wg j ∇φ Tj W̃ j + σgTj Wg j ∇ε j

1
N
 − x TQ i x − (∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j + Λi ,
4 j=1

where W̃ j = W j − Ŵ j . According to the boundedness assumption on Wg j , εg j , Wci ,


εi , and W̃ci is UUB, the term Λi has an upper bound, i.e.,
11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 473

(
(1 T i T N
 T 
Λi ≤ (
(2 ∇φ i W c + ∇ε i Wg j σg j (x) R −1
jj
j=1
(
 T  (
× σg j Wg j ∇φ j W̃ j + σg j Wg j ∇ε j (
T T
(
≤ ξi , (11.4.33)

where ξi ∈ R+ is a computable constant. Since Ri j is a symmetric positive definite


matrix, the following term is positive definite

1
N
(∇V j )T g j (x)R −1 −1 T
j j Ri j R j j g j (x)∇V j
4 j=1

1   −1 T T
N
= R j j g j (x)∇V j Ri j R −1
j j g j (x)∇V j > 0.
T
(11.4.34)
4 j=1

Combining (11.4.33) and (11.4.34), we can obtain that V̇i (t) is upper bounded by

V̇i (t) ≤ −x TQ i x + ξi ≤ −λmin (Q i )x2 + ξi .

For each value function Vi (x(t)), it can be shown)that its derivative V̇i (t)
* is negative
+
whenever x(t) lies outside the compact set Ωi  x : x ≤ λminξ(Q i
i)
. Denote the
compact set Ωx as
   
ξ1 ξN
Ωx  x : x ≤ max ,..., .
λmin (Q 1 ) λmin (Q N )

Then, all the derivatives V̇i (t) are negative whenever x(t) lies outside the compact
set Ωx ; i.e., x(t) is UUB. If we increase λmin (Q i ), the size of Ωx can be made
smaller. This completes the proof of the theorem.

11.4.3 Simulation Study

In this section, we give a simulation example to demonstrate the effectiveness of


the present scheme. This example is constructed by the converse HJB method [23]
which can give the optimal value functions and control policies.

Example 11.4.1 Consider a 3-player continuous-time affine nonlinear differential


game given by
ẋ = f (x) + g1 (x)u 1 + g2 (x)u 2 + g3 (x)u 3 ,
474 11 Learning Algorithms for Differential Games of Continuous-Time Systems

The modeling error


0.5

−0.5

−1

−1.5

−2
0 200 400 600 800 1000
Time (s)

Fig. 11.17 Curve of the modeling error

1
x1

−1
0 200 400 600 800 1000
Times (s)

2
x2

−2
0 200 400 600 800 1000
Times (s)

Fig. 11.18 Evolution of the states


11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 475

The critic NN for player 1


1.6
W1
1
1.4 W1
2
1
W3
1.2

0.8
Weights

0.6

0.4

0.2

−0.2

−0.4
0 200 400 600 800 1000
Time (s)

Fig. 11.19 Critic NN weights for player 1

The critic NN for player 2


1.4
W2
1

W2
1.2 2
2
W3

0.8
Weights

0.6

0.4

0.2

−0.2
0 200 400 600 800 1000
Time (s)

Fig. 11.20 Critic NN weights for player 2


476 11 Learning Algorithms for Differential Games of Continuous-Time Systems

The critic NN for player 3


0.9
3
W
1
0.8 W
3
2
3
W3
0.7

0.6

0.5
Weights

0.4

0.3

0.2

0.1

−0.1
0 200 400 600 800 1000
Time (s)

Fig. 11.21 Critic NN weights for player 3

Approximation error of V
1

0.075

0.07
1
*
V −V

0.065
1

0.06

0.055
2
1 2
0 1
0
x −1
2 −1 x
1
−2 −2

Fig. 11.22 3-D plot of the approximation error of the value function for player 1
11.4 Synchronous Approximate Optimal Learning for Multi-player Nonzero-Sum Games 477

Aprroximation error of u
1

x 10−3
5

0
1
*
u −u

−5
1

−10

−15
2
1 2
0 1
x 0
2 −1
−1 x
1
−2 −2

Fig. 11.23 3-D plot of the approximation error of the control policy for player 1

where

−2x1 + x2
f (x) = ,
− 21 x1 − x2 + x12 x2 + 41 x2 (cos(2x1 ) + 2)2 + 41 x2 (sin(4x12 ) + 2)2

0 0 0
g1 (x) = , g2 (x) = , g3 (x) = .
2x1 cos(2x1 ) + 2 sin(4x12 ) + 2

Select Q 1 = 0.5I2 , Q 2 = 2I2 , Q 3 = I2 , R11 = R12 = R13 = 0.5, R21 = R22 =


R23 = 2, and R31 = R32 = R33 = 1. The optimal value functions for the three
players are

1 2 1 2 1 1 1
V1∗ (x) = x1 + x2 , V2∗ (x) = x12 + x22 , V3∗ (x) = x12 + x22 .
8 4 2 4 2
The corresponding optimal control policies for three players are

1
u ∗1 (x) = −x1 x2 , u ∗2 (x) = − (cos(2x1 ) + 2)x2 , (11.4.35)
2
and
1
u ∗3 (x) = − (sin(4x12 ) + 2)x2 . (11.4.36)
2
478 11 Learning Algorithms for Differential Games of Continuous-Time Systems

First, we use a model NN to identify the unknown nonlinear system. The model
NN is selected as in (11.4.18) with A = [−10, 0; 0, −10]. The activation functions
are selected as hyperbolic tangent function tanh(·). Select the parameters in Theorem
11.4.2 as Γ f = [1, 0.1; 0.1, 1], Γg j = [1, 0.1; 0.1, 1]. The curves of modeling error
are shown in Fig. 11.17. We can observe that the obtained model NN can reconstruct
the unknown nonlinear system successfully.
The activation functions for the critic NNs are selected as φ1 (x) = φ2 (x) =
φ3 (x) = [x12 , x1 x2 , x22 ]T . The critic NN weight vectors for the three players are deno-
ted as Ŵ 1 = [Ŵ11 , Ŵ21 , Ŵ31 ]T , Ŵ 2 = [Ŵ12 , Ŵ22 , Ŵ32 ]T , and Ŵ 3 = [Ŵ13 , Ŵ23 , Ŵ33 ]T .
The initial weights of the three critic NNs are randomly selected in [0, 1.5], and
the learning rates for the three critic NNs are all 0.1. The initial state is selected
as x0 = [1, −1]T . A small exploratory signal is used to satisfy the PE condi-
tion. After the exploratory signal is turned off at t = 950 sec, the states con-
verge to zero, and Fig. 11.18 presents the evolution of the system states. From
Figs. 11.19, 11.20 and 11.21, we can observe that the weight vector Ŵ 1 converges to
[0.1277, −0.0012, 0.2503]T , the weight vector Ŵ 2 converges to [0.5022, −0.0009,
1.0002]T , and the weight vector Ŵ 3 converges to [0.2516, −0.0007, 0.5001]T at
t = 1000s. For player 1, Fig. 11.22 shows the 3-D plot of the difference between
the approximated value function and the optimal one, and Fig. 11.23 shows the 3-D
plot of the difference between the approximated control policy and the optimal one.
We can find that these errors are close to zero, and other players also have similar
results. Thus, the approximate value functions converge to the optimal ones within
a small bound.

11.5 Conclusions

In this chapter, we developed some ADP algorithms for differential games of


continuous-time systems with unknown dynamics. First, we developed an online
model-free integral PI algorithm for two-player zero-sum differential games of
continuous-time linear systems. Second, we developed an iterative ADP algorithm
for multi-player zero-sum differential games of continuous-time nonlinear systems.
Finally, we developed an online synchronous approximate optimal learning algo-
rithm based on PI for multi-player nonzero-sum games of continuous-time nonlinear
systems.

References

1. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
2. Abu-Khalaf M, Lewis FL, Huang J (2006) Policy iterations and the Hamilton-Jacobi-Isaacs
equation for H∞ state feedback control with input saturation. IEEE Trans Autom Control
51(12):1989–1995
References 479

3. Abu-Khalaf M, Lewis FL, Huang J (2008) Neurodynamic programming and zero-sum games
for constrained control systems. IEEE Trans Neural Netw 19(7):1243–1252
4. Al-Tamimi A, Abu-Khalaf M, Lewis FL (2007) Adaptive critic designs for discrete-time zero-
sum games with application to H∞ control. IEEE Trans Syst Man Cybern-Part B: Cybern
37(1):240–247
5. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2007) Model-free Q-learning designs for linear
discrete-time zero-sum games with application to H∞ control. Automatica 43(3):473–481
6. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern-Part B:
Cybern 38(4):943–949
7. Bardi M, Capuzzo-Dolcetta I (1997) Optimal control and viscosity solutions of Hamiilton-
Jacobi-Bellman equations. Birkhäuser, Boston
8. Basar T, Bernhard P (1995) H∞ optimal conrol and related minimax design problems: a
dynamic game approach. Birkhäuser, Boston
9. Basar T, Olsder GJ (1999) Dynamic noncooperative game theory. SIAM, Philadelphia
10. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):82–92
11. Gupta SK (1995) Numerical methods for engineers. Wiley, New York
12. Hecht-Nielsen R (1989) Theory of the backpropagation neural network. In: Proceedings of the
international joint conference on neural networks, pp 593–605
13. Jiang Y, Jiang ZP (2012) Computational adaptive optimal control for continuous-time linear
systems with completely unknown dynamics. Automatica 48(10):2699–2704
14. Lee JY, Park JB, Choi YH (2012) Integral Q-learning and explorized policy iteration for adaptive
optimal control of continuous-time linear systems. Automatica 48(11):2850–2859
15. Lewis FL, Liu D (2012) Reinforcement learning and approximate dynamic programming for
feedback control. Wiley, Hoboken
16. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
17. Li H, Liu D, Wang D (2014) Integral reinforcement learning for linear continuous-time zero-
sum games with completely unknown dynamics. IEEE Trans Autom Sci Eng 11(3):706–714
18. Liu D, Wei Q (2014) Multi-person zero-sum differential games for a class of uncertain nonlinear
systems. Int J Adapt Control Signal Process 28(3–5):205–231
19. Liu D, Li H, Wang D (2013) Neural-network-based zero-sum game for discrete-time nonlinear
systems via iterative adaptive dynamic programming algorithm. Neurocomputing 110:92–100
20. Liu D, Wang D, Yang X (2013) An iterative adaptive dynamic programming algorithm for
optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci
220:331–342
21. Liu D, Li H, Wang D (2014) Data-based online synchronous optimal learning algorithm for
multi-player non-zero-sum games. IEEE Trans Syst Man Cybern: Syst 44(8):1015–1027
22. Marks RJ (1991) Introduction to Shannon sampling and interpolation theory. Springer, New
York
23. Nevisti V, Primbs JA (1996) Constrained nonlinear optimal control: a converse HJB approach.
California Institute of Technology, TR96-021
24. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
25. Vamvoudakis KG, Lewis FL (2011) Online solution of nonlinear two-player zero-sum games
using synchronous policy iteration. Int J Robust Nonlinear Control 22(13):1460–1483
26. Vamvoudakis KG, Lewis FL (2011) Multi-player non-zero-sum games: online adaptive learn-
ing solution of coupled Hamilton-Jacobi equations. Automatica 47(8):1556–1569
27. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
28. Vrabie D, Lewis FL (2010) Integral reinforcement learning for online computation of feedback
Nash strategies of nonzero-sum differential games. In: Proceedings of the IEEE conference on
decision and control, pp 3066–3071
480 11 Learning Algorithms for Differential Games of Continuous-Time Systems

29. Varbie D, Lewis FL (2011) Adaptive dynamic programming for online solution of a zero-sum
differential game. J Control Theory Appl 9(3):353–360
30. Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis FL (2009) Adaptive optimal control for
continuous-time linear systems based on policy iteration. Automatica 45(2):477–484
31. Wang FY, Zhang H, Liu D (2009) Adaptive dynamic programming: an introduction. IEEE
Comput Intell Mag 4(2):39–47
32. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
33. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive
approaches (Chapter 13). Van Nostrand Reinhold, New York
34. Wu H, Luo B (2012) Neural network based online simultaneous policy update algorithm
for solving the HJI equation in nonlinear H∞ control. IEEE Trans Neural Netw Learn Syst
23(12):1884–1895
35. Wu H, Luo B (2013) Simultaneous policy update algorithms for learning the solution of linear
continuous-time H∞ state feedback control. Inf Sci 222(10):472–485
36. Zeidler E (1985) Nonlinear functional analysis. Fixed point theorems, vol 1. Springer, New
York
37. Zhang H, Luo Y, Liu D (2009) Neural-network-based near-optimal control for a class of
discrete-time affine nonlinear systems with control constraints. IEEE Trans Neural Netw
20(9):1490–1503
38. Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming method.
IEEE Trans Neural Netw 22(12):2226–2236
39. Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving
a class of nonlinear zero-sum differential games. Automatica 47(1):207–214
40. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control: algorithms
and stability. Springer, London
Part III
Applications
Chapter 12
Adaptive Dynamic Programming
for Optimal Residential Energy
Management

12.1 Introduction

With the rising cost, environmental concerns, and reliability issues, there is an
increasing need to develop optimal control and management systems for residential
environments. Smart residential energy systems, composed of power grids, battery
systems, and residential loads which are interconnected over a power management
unit, provide end users with the optimal management of energy usage to improve
the operational efficiency of power systems [4, 23, 31]. On the other hand, with
the rapidly evolving technology of electric storage devices, energy storage-based
optimal management has attracted much attention [3, 19, 36]. Along with the devel-
opment of smart grids, increasing intelligence is required in the optimal design of
residential energy systems [10, 34, 39]. Hence, the intelligent optimization of battery
management becomes a key tool for saving the power expense in smart residential
environments.
Different techniques have been used to implement optimal controllers in residen-
tial energy management systems; for example, dynamic programming is used in [32,
37] and genetic algorithm is proposed in [11]. In addition, Liu and Huang [21] pro-
posed an ADP scheme using only critic network and considering only three possible
controls for the battery (charging mode, discharging mode, idle) to choose the best
for every time slot, while in [18], a particle swarm optimization method and a mixed
integer linear programming procedure are chosen in [33].
In this chapter, the optimal management of the total electrical system is viewed as
optimal battery management for each time slot: Step by step, the controller provides
the best decision of energy management whereby charging or discharging the battery
and reducing the total cost according to the external environment. First, an action-
dependent heuristic dynamic programming method is developed to obtain the optimal
residential energy control law [21, 22]. Second, a dual iterative Q-learning algorithm
is developed to solve the optimal battery management and control problem in smart
residential environments where two iterations, internal and external iterations, are

© Springer International Publishing AG 2017 483


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_12
484 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

employed [42]. Based on the dual iterative Q-learning algorithm, the convergence
property of iterative Q-learning method for the optimal battery management and
control problem is proven. Finally, a distributed iterative ADP method is developed
to solve the multi-battery optimal coordination control problems for home energy
management systems [43].

12.2 A Self-learning Scheme for Residential Energy System


Control and Management

The objective of this section is to apply an action-dependent heuristic dynamic


programming (ADHDP) algorithm to obtain the optimal control for home energy
management systems which minimizes the sum of system operational cost over the
scheduling period, subject to technological and operational constraints of power grids
and storage resources and subject to the system constraints such as power balance
and reliability. For this purpose, we focus our research on finding the optimal battery
charge/discharge strategy of the residential energy system including batteries and
power grids [21, 22].
The residential energy system uses AC utility grid as the primary source of
electricity and is intended to operate in parallel with the battery storage system.
Figure 12.1 depicts the schematic diagram of a residential energy system. The sys-
tem consists of power grids, a sine wave inverter, a battery system, and a power
management unit. The battery storage system is connected to power management
system through an inverter. The inverter functions as both charger and discharger for
the battery. The construction of the inverter is based upon power MOSFET technol-
ogy and pulse width modulation technique [16]. The quality of the inverter output is

Fig. 12.1 Grid-connected


residential energy system
with battery storage
12.2 A Self-learning Scheme for Residential Energy System Control and Management 485

comparable to that delivered from the power grids. The battery storage system con-
sists of lead acid batteries, which are the most commonly used rechargeable battery
type. The optimum battery size for a particular residential household can be obtained
by performing various test scenarios, which is beyond the scope of the present book.
Generally speaking, the battery is sized to enable it to supply power to the residential
load for a period of twelve hours.
There are three operational modes for the batteries in residential energy system
under consideration.

(1) Charging mode: When system load is low and the electricity price is inexpensive,
the power grids will supply the residential load directly and, at the same time,
charge the batteries.
(2) Idle mode: the power grids will directly supply the residential load at certain
hours when, from the economical point of view, it is more cost effective to use
the fully charged batteries in the evening peak hours.
(3) Discharging mode: By taking the subsequent load demands and time-varying
electricity rate into account, batteries alone supply the residential load at hours
when the price of electricity from the grid is high.

This system can easily be expanded; i.e., other power sources along with the
power grid and sources such as photovoltaic (PV) panels or wind generators can be
integrated into the system when they are available.
For this section, the optimal scheduling problem is treated as a discrete-time
problem with the time step as one hour, and it is assumed that the residential load
over each hourly time step varies with noise. Thus, the daily load profile is divided
into twenty-four hour period to represent each hour of the day. Each day can be
divided into a greater number of periods to have higher resolution. However, for
simplicity and agreement with existing literature [6, 9, 13, 30], we use a twenty-
four-hour period each day in this work. A typical weekday load profile is shown in
Fig. 12.2. The load factor PL is expressed as PLt during hour t (t = 1, 2, . . . , 24).
For instance, at time t = 19, the load is 7.8 kW which would require 7.8 kWh of
energy. Since the load profile is divided into one-hour steps, the units of the power
of energy sources can be represented equally by kW or kWh. Residential real-time
pricing is one of the load management policies used to shift electricity usage from
peak load hours to light load hours in order to improve the power system efficiency
and allow the new power system construction projects [24]. With real-time pricing,
the electricity rate varies from hour to hour based on the wholesale market prices.
Hourly, market-based electricity prices typically change as the demand for electricity
changes; higher demand usually means higher hourly prices. In general, there tends
to be a small price spike in the morning and another slightly larger spike in the
evening when the corresponding demand is high. Figure 12.3 demonstrates a typical
daily real-time pricing from [12]. The varying electricity rate is expressed as Ct ,
representing the energy cost during the hour t in cents. For the residential customer
with real-time pricing, energy charges are functions of the time of electricity usage.
Therefore, for the situation where batteries are charged during the low rate hours
486 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

5
Load (kW)

0
0 5 10 15 20 25
Time (hours)

Fig. 12.2 A typical residential load profile

10

6
Rate (cents)

0
0 2 4 6 8 10 12 14 16 18 20 22 24
Time (hours)

Fig. 12.3 A typical daily real-time pricing


12.2 A Self-learning Scheme for Residential Energy System Control and Management 487

and discharged during high rate hours, one may expect, from an economical point
of view, the cost savings will be made by storing energy during low rate hours and
releasing it during the high rate hours. In this way, the battery storage system can be
used to reduce the total electricity cost for residential household. The energy stored
in a battery can be expressed as [24, 44]:


t
Ebt = Eb0 − Pbτ ,
τ =0
Pbτ = Vb Ib ατ , τ = 0, 1, . . . , t,

1, τ ≤ τ0 ,
ατ =
K1 (Ib )dVb /dτ, τ > τ0 ,

Vbτ = Vs − (Kc (Δ/(Δ − Jc τ ) + N)Jc + Aexp(−BΔ−1 Jc τ )),

where Ebt is the battery energy at time t, Eb0 is the peak energy level when the battery
is fully charged (capacity of the battery), Pbτ is the battery power output at time τ ,
Vbτ is the terminal voltage of the battery, Ib is the battery discharge current, ατ is the
current weight factor as a function of discharge time, τ0 is the battery manufacturer
specified length of time for constant power output under constant discharge current
rate, K1 (Ib ) is the weight factor as a function of the magnitude of the current, Vs is
the battery internal voltage, Kc is the polarization coefficient (ohm × cm2 ), Δ is the
available amount of active material (coulombs per cm2 ), Jc is the apparent current
density (amperes per cm2 ), N is the internal resistance per cm2 , and A and B are
constants.
Apart from the battery itself, the loss of other equipments such as inverters, trans-
formers, and transmission lines should also be considered in the battery model. The
efficiency of these devices was derived in [44] as follows:

|Pbt |
η(Pbt ) = 0.898 − 0.173 , Prate > 0, (12.2.1)
Prate

where Prate is the rated power output of the battery and η(Pbt ) is the total efficiency
of all the auxiliary equipments in the battery system.
Assume that all the loss caused by these equipments occur during the charging
period. The battery model used in this work is expressed as follows: When the battery
charges,
Eb(t+1) = Ebt − Pb(t+1) × η(Pb(t+1) ), Pb(t+1) < 0,

and when the battery discharges,

Eb(t+1) = Ebt − Pb(t+1) × η(Pb(t+1) ), Pb(t+1) > 0.


488 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

In general, to improve the battery efficiency and extend the battery’s lifetime as
far as possible, two constraints need to be considered:
(1) Battery has storage limit. A battery lifetime may be reduced if it operates at lower
amount of charge. In order to avoid damage, the energy stored in the battery must
always meet constraint as follows:

Ebmin ≤ Ebt ≤ Ebmax . (12.2.2)

(2) For safety, battery cannot be charged or discharged at rate exceeding the max-
imum and minimum values to prevent damage. This constraint represents the
upper and lower limit for the hourly charging and discharging power. A negative
Pbt means that the battery is being charged, while a positive Pbt means the battery
is discharging,
Pbmin ≤ Pbt ≤ Pbmax . (12.2.3)

At any time, the sum of the power from the power grids and the batteries must be
equal to the demand of residential user
PLt = Pbt + Pgt , (12.2.4)

where Pgt is the power from the power grids, Pbt can be positive (in the case of
batteries discharging) or negative (batteries charging) or zero (idle). It explains the
fact that the power generation (power grids and batteries) must balance the load
demand for each hour in the scheduling period. We assume here that the supply from
power grids is enough for the residential demand. The objective of the optimization
policy is, given the residential load profile and real-time pricing, to find the optimal
battery charge/discharge/idle
 schedule at each time step which minimizes the total
cost CT = Tt=1 Ct × Pgt , while satisfying the load balance equation (12.2.4) and
the operational constraints (12.2.1)–(12.2.3). CT represents the operational cost to
the residential customer in a period of T hours. To make the best possible use of bat-
teries for the benefit of residential customers, with time of day pricing signals given
by Ct , it is a complex multistage stochastic optimization problem. Adaptive dynamic
programming (ADP) which provides approximate optimal solutions to dynamic pro-
gramming is applicable to this problem. Using ADP, we will develop a self-learning
optimization strategy for residential energy system control and management. During
real-time operations under uncertain changes in the environment, the performance of
the optimal strategy can be further refined and improved through continuous learning
and adaptation.

12.2.1 The ADHDP Method

In this section, we consider the following discrete-time nonlinear systems:


xt+1 = F(xt , ut ), t = 0, 1, 2, . . . , (12.2.5)
12.2 A Self-learning Scheme for Residential Energy System Control and Management 489

Fig. 12.4 A typical scheme


of an ADHDP

where xt ∈ Rn denotes the state vector of the system, ut ∈ Rm represents the control
action, and F is a transition from the current state xt to the next state xt+1 under
given state feedback control action ut = u(xt ) at time t. Suppose that this system is
associated with the performance index


J(xt0 , u) = J u (xt0 ) = γ k−t0 U(xk , uk ), (12.2.6)
k=t0

where U is called the utility function and γ is the discount factor with 0 < γ ≤ 1. It
is important to realize that J depends on the initial time t0 and the initial state xt0 . The
performance index J is also referred to as the cost-to-go of state xt0 . The objective
of dynamic programming problem is to choose a sequence of control actions ut =
u(xt ), t = t0 , t0 + 1, . . . , so that the performance index J in (12.2.6) is minimized.
According to Bellman, the optimal cost from the time t on is equal to
 
J ∗ (xt ) = min U(xt , ut ) + γ J ∗ (F(xt , ut )) .
ut

The optimal control ut∗ at time t is the ut that achieves this minimum.
Generally speaking, there are three design families of ADP: heuristic dynamic pro-
gramming (HDP), dual heuristic programming (DHP), and globalized dual heuristic
programming (GDHP) as well as their action-dependent versions. A typical ADHDP
is shown in Fig. 12.4 [28]. Both the critic and action networks can be trained using
the strategy in [25] as described in Sect. 1.3.1 of this book.

12.2.2 A Self-learning Scheme for Residential Energy System

The learning control architecture for residential energy system control and manage-
ment is based on ADHDP. However, as described below, only a single module (single
490 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

critic approach) will be used instead of two or three modules in the original scheme.
The single critic module technique retains all the powerful features of the original
ADP, while eliminating the action module completely. There is no need for the iter-
ative training loops between the action and the critic networks and, thus, greatly
simplifies the training process.
There exists a class of problems in realistic applications that have a finite-
dimensional control action space. Typical examples include inverted pendulum or
the cart-pole problem, where the control action only takes a few finite values. When
there is only a finite control action space in the application, the decisions that can
be made are constrained to a limited number of choices, e.g., a ternary choice in the
case of residential energy control and management problem. When there is a power
demand from the residential household, the decisions can be made are constrained
to three choices, i.e., to discharge batteries, to charge batteries, or to do nothing to
batteries. Let us denote the three options using ut = 1 for “discharge,” ut = −1 for
“charge,” and ut = 0 for “idle.” In the present case, we note that the control actions
are limited to a ternary choice or to only three possible options. Therefore, we can
further simplify the ADHDP introduced in Fig. 12.4 so that only the critic network
is needed in the design.
Figure 12.5 illustrates our self-learning control scheme for residential energy sys-
tem control and management using ADHDP. The control scheme works as follows:
When there is a power demand from the residential household, we will first ask the
critic network to see which action (discharge, charge, and idle) generates the small-
est output value of the critic network; then, the control action from ut = 1, −1, 0
that generates the smallest critic network output will be chosen. As in the case of
Fig. 12.4, the critic network in our design will also need the system states as input
variables. It is important to realize that Fig. 12.5 is only a diagrammatic layout that
illustrates how the computation takes place while making battery control and man-
agement decisions. In Fig. 12.5, the three blocks for the critic network stand for the
same critic network or computer program. From the block diagram in Fig. 12.5, it is
clear that the critic network will be utilized three times in calculations with different
values of ut to make a decision about whether to discharge or charge batteries or
keep it idle. The previous description is based on the assumption that the critic net-
work has been successfully trained. Once the critic network is learned and obtained
(off-line or online), it will be applied to perform the task of residential energy sys-
tem control and management as in Fig. 12.5. The performance of the overall system
can be further refined and improved through continuous learning as it learns more
experience in real-time operations when needed. In this way, the overall residential
energy system will achieve optimal individual performance now and in the future
environments under uncertain changes.
In stationary environment, where residential energy system configuration remains
unchanged, a set of simple static if-then rules will be able to achieve the optimal
scheduling as described previously. However, system configuration, including user
power demand, capacity of the battery, and power rate, may be significantly different
from time to time. To cope with uncertain changes in environments, static energy
control and management algorithm would not be proper. The present control and
12.2 A Self-learning Scheme for Residential Energy System Control and Management 491

Fig. 12.5 Block diagram of the single critic approach

management scheme based on ADHDP will be capable of coping with uncertain


changes in the environment through continuous learning. Another advantage of the
present self-learning scheme is that through further learning as it gains more and
more experience in real-time operations, the algorithm has the capability to adapt
itself and improve performance. We note that continuous learning and adaptation
over the entire operating regime and system conditions to improve the performance
of the overall system is one of the key promising attributes of the present method.
The development of the present self-learning scheme for residential energy system
control and management involves the following four steps.

Step 1: Collecting data: During this stage, whenever there is a power demand from
residential household, we can take any of the following actions: discharge
batteries, charge batteries, or keep batteries idle and calculate the utility
function for the system. The utility function is chosen as follows:
492 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

the electricity cost at time t


Ut = .
the possible maximum cost

During the data collection step, we simply choose actions 1, −1, 0 randomly
with the same probability of 1/3. Meanwhile, the states corresponding to
each action are collected. The environmental states we collect for each action
are the electricity rate, the residential load, and the energy level of the battery.
Step 2: Training the critic network: We use the data collected to train the critic
network as presented in Chap. 1. The input variables chosen for the critic
network are states including the electricity rate, the residential load, the
energy level of the battery, and the action.
Step 3: Applying the critic network: We apply the trained critic network as illustrated
in Fig. 12.5. Three values of action ut will be provided to the critic network
at each time step. The action with the smallest output of the critic network
is the one the system is going to take.
Step 4: Further updating the critic network: We will update the critic network as
needed while it is applied in the residential energy system to cope with envi-
ronmental changes, for example, user demand changes or new requirements
for the system. We note that the data has to be collected again and the training
of critic network has to be performed as well. In such a case, the previous
three steps will be repeated.

12.2.3 Simulation Study

The performance of the present algorithm is demonstrated by simulation studies


for a typical residential family. The objective is to minimize the electricity cost
from power grids over one-week horizon by finding the optimal battery operational
strategy of the energy system while satisfying the load conditions and the system
constraints. The focus of the present section is on residential energy system with home
batteries connected to the power grids. For the residential energy system, the cost to
be minimized is a function of real-time pricing and residential power demands. The
optimal battery operation strategy refers to the strategy of when to charge batteries,
when to discharge batteries, and when to keep batteries idle to achieve minimum
electricity cost for the residential user.
The residential energy system consists of power grids, an inverter, batteries, and
a power management unit as shown in Fig. 12.1. We assume that the supply from
power grid is guaranteed for the residential user demand at any time. The capacity of
batteries used in the simulations is 100 kWh, and a minimum of 20% of the charge is
to be retained. The rated power output of batteries and the maximum charge/discharge
rate is 16 kWh. The initial charge of batteries is at 80% of batteries’ full charge. We
assume that the batteries and the power grids will not simultaneously provide power
to the residential user. At any time, residential power demand is supplied by either
batteries or power grids. The power girds would provide the supply to the residential
12.2 A Self-learning Scheme for Residential Energy System Control and Management 493

user and, at the same time, charge batteries. It is expected that batteries are charged
during the low rate hours, idle in some mid-rate hours, and discharged during high
rate hours. In this way, both energy and cost savings are achieved.
The critic network in the present application is a multilayer feedforward neural
network with 4–9–1 structure, i.e., four neurons at the input layer, nine neurons
at the hidden layer, and one linear neuron at the output layer. The hidden layer
uses the hyperbolic tangent function as the activation function. The critic network
outputs function Q, which is an approximation to the function J(xt , u) as defined
in (12.2.6). The four inputs to the critic network are as follows: energy level of
batteries, residential power demand, real-time pricing, and the action of operation
(1 for discharging batteries, −1 for charging batteries, 0 for keeping batteries idle).
The local utility function defined in (12.2.6) is

Ct × Pgt
Ut = ,
Umax

where Ct is real-time electricity rate, Pgt is the supply from power grids for residential
power demand, and Umax is the possible maximum cost for all time. The utility
function chosen in this way will lead to a control objective of minimizing the overall
cost for the residential user.
The typical residential load profile in one week is shown in Fig. 12.6 [12] random
noise in the load curve. From the load curve, we can see that, during weekdays, there
are two load peaks occurring in the period of 7:00–8:00 and 18:00–20:00, while
during weekend, the residential demand gradually increases until the peak appears
at 19:00. Thus, the residential demand pattern during weekdays and during weekend
is different. Figure 12.7 shows the change in the electrical energy level in batteries

Fig. 12.6 A typical 9


residential load profile in one
week 8

6
Load (kW)

0
0 20 40 60 80 100 120 140 160
Time (Hours)
494 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

Fig. 12.7 Energy changes in 100


batteries
90

80

70

60

Energy (kWh)
50

40

30

20

10

0
0 20 40 60 80 100 120 140 160
Time (Hours)

Fig. 12.8 Optimal Battery power status


scheduling of batteries in one 10 Electricity rate

week
5

0
Power (kW)

−5

−10

−15

0 20 40 60 80 100 120 140 160


Time (Hours)

during a typical one-week residential load cycle. From Fig. 12.7, it is shown that
batteries are fully charged during the midnight when the price of electricity is low.
After that, batteries discharge during peak load hours or medium load hours and are
charged again during the midnight light load hours. This cycle repeats, which means
that the scheme is optimized with evenly charging and discharging. Therefore, the
peak of the load curve is shaved by the output of batteries, which results in less
cost of power from the power girds. Figure 12.8 illustrates the optimal scheduling
of home batteries. The bars in Fig. 12.8 represent the power output of batteries,
12.2 A Self-learning Scheme for Residential Energy System Control and Management 495

while the dotted line denotes the electricity rate in real time. From Fig. 12.8, we can
see that batteries are charged during hours from 23:00 to 5:00 next day when the
electricity rate is in the lowest range and discharge when the price of electricity is
expensive. It is observed that batteries discharge from 6:00 to 20:00 during weekdays
and from 7:00 to 19:00 during weekend to supply the residential power demand. The
difference lies in the fact that the power demand during the weekend is generally
bigger than the weekdays’ demand, which demonstrates that the present scheme can
adapt to varying load conditions. From Fig. 12.8, we can also see that there are some
hours that the batteries are idle, such as from 3:00 to 5:00 and from 21:00 to 22:00.
Obviously, the self-learning algorithm believe that considering the subsequent load
demand and electricity rate, keeping batteries idle during these hours will achieve the
most economic return which result in the lowest overall cost to the customer. The cost
of serving this typical residential load in one week is 2866.64 cents. Compared to
the cost using the power grids alone to supply the residential load which is 4124.13
cents, it gives a savings of 1257.49 cents in a week period. This illustrates that a
considerable saving on the electricity cost is achieved. In this case, the self-learning
scheme has the ability to learn the system characteristics and provide the minimum
cost to the residential user.
In order to better evaluate the performance of the self-learning scheme, we con-
duct comparison studies with a fixed daily cycle scheme. The daily cycle scheme
charges batteries during the day time and releases the energy into the residential user
load when required during the expensive peak hours at night. Figure 12.9 shows the
scheduling of batteries by the fixed daily cycle scheme. The overall cost is 3284.37
cents. This demonstrates that the present ADHDP scheme has lower cost. Comparing
Fig. 12.8 with 12.9, we can see the self-learning scheme is able to discharge batter-
ies one hour late from 7:00 to 19:00 during the weekend instead of from 6:00 to

Fig. 12.9 Scheduling of Battery power status


batteries of fixed daily cycle 10 Electricity rate

scheme

5
Power (kW)

−5

−10

−15

0 20 40 60 80 100 120 140 160


Time (Hours)
496 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

20:00 during weekdays to achieve optimal performance, while the fixed daily cycle
scheme ignores the differences in the demand between weekdays and weekend due
to the static nature of the algorithm. Therefore, we conclude that the present self-
learning algorithm performs better than the fixed algorithm due to the fact that the
self-learning scheme can adapt to the varying load consideration and environmental
changes.

12.3 A Novel Dual Iterative Q-Learning Method


for Optimal Battery Management

12.3.1 Problem Formulation

The smart residential energy system described by (12.2.5) is composed of the power
grid, the residential load, a battery system, which is located at the side of residential
load (including a battery and a sine wave inverter), and a power management unit
(controller). The schematic diagram of the smart residential energy system can be
described in Fig. 12.1. The battery model used in this work is based on [22, 24, 44],
where the battery model is expressed by

Eb(t+1) = Ebt − Pbt × η(Pbt ).

Let Pbt > 0 denote battery discharging; let Pbt < 0 denote battery charging; and let
Pbt = 0 denote the battery idle. Let the efficiency of battery charging/discharging be
derived as in (12.2.1).
In this section, the power flow from the battery to the grid is not permitted,
i.e., we define Pgt ≥ 0, to guarantee the power quality of the grid. For convenience
of analysis, we introduce delays in Pbt and PLt , and then, we can define the load
balance as PL(t−1) = Pb(t−1) + Pgt . The total cost function expected to be minimized
is defined as

  
γ t
m1 (Ct Pgt )2 + m2 (Ebt − Ebo )2 + r(Pbt )2 , (12.3.1)
t=0

1
where 0 < γ < 1 and Ebo = (Ebmin + Ebmax ). The physical meaning of the first term
2
of the cost function is to minimize the total cost from the grid. The second term aims
to guarantee the stored energy of the battery to be close to the middle of storage limit,
which avoids fully charging/discharging of the battery. The third term is to prevent
large charging/discharging power of the battery. Hence, the second and third terms
aim to extend the lifetime of the battery. Let x1t = Pgt and x2t = Ebt − Ebo . Letting
xt = [x1t , x2t ]T and ut = Pbt , the equation of the residential energy system can then
be written as
12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 497

PLt − ut
xt+1 = F(xt , ut , t) = . (12.3.2)
x2t − η(ut )ut

Let ut = (ut , ut+1 , . . .) denote the control sequence from t to ∞. Let

m1 Ct2 0
Mt = .
0 m2

Let x0 be the initial state. Then, the cost function (12.3.1) can be written as


J(x0 , u0 , 0) = γ t U(xt , ut , t),
t=0

where U(xt , ut , t) = xtTMt xt + rut2 . Generally speaking, functions of the residential


load and the real-time electricity rate are periodic. For convenience of analysis, our
discussion is based on the following assumption.

Assumption 12.3.1 The residential load PLt and the electricity rate Ct are periodic
functions with the period λ = 24 h.

Define  the control sequence set as Ut = ut : ut = (ut , ut+1 , . . .), ut+i ∈ R, i =
0, 1, . . . . Then, the optimal cost function can be defined as follows:
 
J ∗ (xt , t) = inf J(xt , ut , t) : ut ∈ Ut .
ut

Define the optimal Q-function as Q∗ (xt , ut , t) such that min Q∗ (xt , ut , t) = J ∗ (xt , t).
ut
Hence, the Q-function is also an action-dependent value function. According to [40,
41], the optimal Q-function satisfies the following Bellman equation

Q∗ (xt , ut , t) = U(xt , ut , t) + γ min Q∗ (xt+1 , ut+1 , t + 1). (12.3.3)


ut+1

12.3.2 Dual Iterative Q-Learning Algorithm

In this section, a novel dual iterative Q-learning algorithm is developed to obtain the
optimal control law for residential energy systems [42]. A new convergence analysis
method will also be developed in this section. From (12.3.3), we can see that the
optimal Q-function Q∗ (xt , ut , t) is a nonlinear function which is difficult to obtain.
According to Assumption 12.3.1, there exist ρ = 0, 1, . . . and θ = 0, 1, . . . , 23 such
that t = ρλ + θ , ∀t = 0, 1, . . .. Let k = ρλ. Then, PLt = PL(k+θ) = PLθ and Ct =
Ck+θ = Cθ , respectively. Define Uk as the control sequence in 24 h from k to k +
λ − 1, i.e., Uk = (uk , uk+1 , . . . , uk+λ−1 ). We can define a new utility function as
498 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

λ−1

Π (xk , Uk ) = γ θ U(xk+θ , uk+θ , θ ), ∀k ∈ {0, λ, 2λ, . . .}. (12.3.4)
θ=0

The utility function in (12.3.4) is time-invariant for k = 0, λ, 2λ, . . . , since the matrix
Mt used in the definition of U(xt , ut , t) is periodic with period of λ. Then, (12.3.3)
can be expressed as

Q ∗ (xk , Uk ) = Π (xk , Uk ) + γ̃ min Q ∗ (xk+λ , Uk+λ ),


Uk+λ

where γ̃ = γ λ and Q(xk , Uk ) is defined according to Π (xk , Uk ). The optimal control


law sequence can be expressed as

U ∗ (xk ) = arg min Q ∗ (xk , Uk ).


Uk

Based on the preparations above, a new dual iterative Q-learning algorithm can
be developed. In the present algorithm, two iterations are utilized, which are external
iterations (i-iterations in brief) and internal iterations (j-iterations in brief), respec-
tively. Let i = 0, 1, . . . be the external iteration index.
Let Ψ (xk , uk ) be an arbitrary positive-semidefinite function. Define the initial
Q-function Q0 (xk , Uk ) as

Q0 (xk , Uk ) = Π (xk , Uk ) + γ̃ min Ψ (xk+λ , uk+λ ). (12.3.5)


uk+λ

The control law A0 can be computed as

A0 (xk ) = arg min Q0 (xk , Uk ). (12.3.6)


Uk

For i = 1, 2, . . ., the i-iteration will proceed between

Qi (xk , Uk ) = Π (xk , Uk ) + γ̃ min Qi−1 (xk+λ , Uk+λ )


Uk+λ

= Π (xk , Uk ) + γ̃ Qi−1 (xk+λ , Ai−1 (xk+λ )), (12.3.7)

and

Ai (xk ) = arg min Qi (xk , Uk ), (12.3.8)


Uk

where Q0 (·, ·) and A0 (·) are determined by (12.3.5) and (12.3.6). Let j = 0, 1, . . . , 24
be the internal iteration index. For i = 0 and j = 0, let the initial Q-function be

Q00 (xk , uk ) = Ψ (xk , uk ) (12.3.9)


12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 499

and the corresponding control law is obtained by

u00 (xk ) = arg min Q00 (xk , uk ). (12.3.10)


uk

For i = 0 and j = 1, 2, . . . , 24, the j-iteration will proceed between


j j−1 (j)
Q0 (xk , uk ) = U(xk , uk , λ − j) + γ min Q0 (xk+1 , uk+1 )
uk+1
j−1 (j) j−1 (j)
= U(xk , uk , λ − j) + γ Q0 (xk+1 , u0 (xk+1 )), (12.3.11)

and
j j
u0 (xk ) = arg min Q0 (xk , uk ), (12.3.12)
uk

where
(j) PL(λ−j) − uk
xk+1 = F(xk , uk , λ − j) = (12.3.13)
x2k − η(uk )uk

and
U(xk , uk , λ − j) = xkT Mλ−j xk + ruk2 .

Note that there are 24 such systems in (12.3.13) according to j = 1, 2, . . . , 24, and
they are system (12.3.2) working at different hours according to λ − j.
For i = 1, 2, . . ., let Qi0 (xk , uk ) = Qi−1
24
(xk , uk ). For j = 0, we calculate

ui0 (xk ) = arg min Qi0 (xk , uk ).


uk

For j = 1, 2, . . . , 24, the j-iteration will proceed between


j j−1 (j)
Qi (xk , uk ) = U(xk , uk , λ − j) + γ min Qi (xk+1 , uk+1 )
uk+1
j−1 j j−1 (j)
= U(xk , uk , λ − j) + γ Qi (xk+1 , ui (xk+1 )), (12.3.14)

and
j j
ui (xk ) = arg min Qi (xk , uk ). (12.3.15)
uk

Then, we can obtain the iterative control law sequence as

Ai (xk ) = {ui0 (xk ), ui23 (xk ), ui22 (xk ), . . . , ui1 (xk )}, ∀i = 0, 1, . . . . (12.3.16)

Such an ordering of control laws can be understood with the proof of the next theorem
(cf. (12.3.17) when j = 24).
500 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

In the present section, the convergence property of the dual iterative Q-learning
algorithm will be established. First, we will show that iterative control law sequence
Ai (xk ) obtained by the j-iteration can minimize the total cost in each 24-h period.
Theorem 12.3.1 For i = 0, 1, . . . and j = 0, 1, . . . , 24, let the iterative Q-functions
j
Qi (xk , Uk ) and Qi (xk , uk ) be obtained by (12.3.5)–(12.3.15). Then,

min Qi (xk , Uk ) = min Qi24 (xk , uk ).


Uk uk

Proof The statement can be proven by mathematical induction. First, for i = 0, we


have

j j−1  (j) 
min Q0 (xk , uk ) = min U(xk , uk , λ − j) + γ min Q0 xk+1 , uk+1
uk uk uk+1
 (j) 
= min U(xk , uk , λ − j) + γ min U xk+1 , uk+1 , λ − j + 1
uk uk+1
 (j−1) 
+ γ min U xk+2 , uk+2 , λ − j + 2 + · · ·
uk+2
 (2)   (1) 
+ γ min U xk+j−1 , uk+j−1 , λ−1 + γ min Ψ xk+j , uk+j
uk+j−1 uk+j
 
j−1
 (j−l+1) 
= min U(xk , uk , λ−j)+ γ l U xk+l , uk+l , λ−j+l
(uk ,uk+1 ,...,uk+j−1 )
l=1

 (1) 
+ γ min Ψ xk+j , uk+j .
j
(12.3.17)
uk+j

Let j = 24. According to (12.3.4) and (12.3.5), we have



min Q024 (xk , uk ) = min Π (xk , Uk ) + γ̃ min Ψ (xk+λ , uk+λ ) = min Q0 (xk , Uk ).
uk Uk uk+λ Uk

Note that the superscripts used for xk+1 , xk+2 , . . . , can be dropped when j = 24. For
(j−1) (23)
example, when j = 24, the calculation of xk+2 = xk+2 requires system (12.3.13) at
hour = 1 (or equivalently, at k + 1), and the subscript 23 indicates exactly the same.
The conclusion holds for i = 0. Assume that the conclusion holds for i = τ − 1, i.e.,

min Qτ −1 (xk , Uk ) = min Qτ24−1 (xk , uk ).


Uk uk

Then, for i = τ , we have

(24)
min Qτ24 (xk , uk ) = min U(xk , uk , 0) + γ min U(xk+1 , uk+1 , 1)
uk uk uk+1

(23)
+ γ min Qτ22 (xk+2 , uk+2 )
uk+2
(24)
= min U(xk , uk , 0) + γ min U(xk+1 , uk+1 , 1)
uk uk+1
(23) (22)
+ γ min U(xk+2 , uk+2 , 2) + γ min U(xk+3 , uk+3 , 3) + · · ·
uk+2 uk+3
12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 501

(2) (1)
+ γ min U(xk+23 , uk+23 , 23)+γ min Qτ0 (xk+λ , uk+λ )
uk+23 uk+λ

= min Π (xk , Uk ) + γ̃ min Qτ24−1 (xk+λ , uk+λ )
Uk uk+λ

= min Π (xk , Uk ) + γ̃ min Qτ −1 (xk+λ , Uk+λ )
Uk Uk+λ

= min Qτ (xk , Uk ).
Uk

The mathematical induction is complete.

From Theorem 12.3.1, we can obtain the following corollary.

Corollary 12.3.1 Let μ(xk ) be an arbitrary control law. For i = 0, 1, . . . and j =


0, 1, . . . , 24, define a new value function as
j j−1
Pi (xk , uk ) = U(xk , uk ) + γ Pi (xk+1 , μ(xk+1 )),

j
and define Qi (xk , uk ) as in (12.3.14). For i = 0, 1, . . ., let Pi0 (xk , uk ) = Qi0 (xk , uk ).
Then, for j = 0, 1, . . . , 24, we have
j j
Qi (xk , uk ) ≤ Pi (xk , uk ).

From Theorem 12.3.1 and Corollary 12.3.1, for i = 0, 1, . . ., the total cost in each
period can be minimized by the iterative control law sequence Ai (xk ) according to
j-iteration (12.3.9)–(12.3.16). Next, the convergence property of i-iteration will be
developed.

Theorem 12.3.2 For i = 0, 1, . . ., let Qi (xk , Ak ) and Ai (xk ) be obtained by i-


iteration (12.3.5)–(12.3.8). Then, the iterative Q-function Qi (xk , Uk ) converges to
its optimum, i.e.,

lim Qi (xk , Uk ) = Q ∗ (xk , Uk ). (12.3.18)


i→∞

Proof For functions Q ∗ (xk , Uk ), Π (xk , Uk ), and Q0 (xk , Uk ), inspired by [26], let
ς, ς , δ, and δ be constants such that

ς Π (xk , Uk ) ≤ γ̃ min Q ∗ (xk+λ , Uk+λ ) ≤ ς Π (xk , Uk ),


Uk+λ

and

δQ ∗ (xk , Uk ) ≤ Q0 (xk , Uk ) ≤ δQ ∗ (xk , Uk ),

respectively, where 0 < ς ≤ ς < ∞ and 0 ≤ δ ≤ δ < ∞. Since Q ∗ (xk , Uk ) is


unknown, the values of ς, ς , δ, and δ cannot be obtained directly. We will prove
502 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

that for the given constants ς , ς, δ, and δ, the iterative Q-function Qi (xk , Uk ) will
converge to the optimum and the estimations of these constants can be omitted. The
proof proceeds in four steps. First, we show that if 0 ≤ δ ≤ δ < 1, and then for
i = 0, 1, . . ., the iterative value function Qi (xk , Uk ) satisfies

δ−1 δ−1
1+ Q ∗ (xk , Uk ) ≤ Qi (xk , Uk ) ≤ 1 + Q ∗ (xk , Uk ).
−1 i (1 + ς −1 )i
(1 + ς )
(12.3.19)

The inequality (12.3.19) can be proven by mathematical induction. Let i = 0. We


have
 
Q1 (xk , Uk ) = Π (xk , Uk ) + γ̃ min Q0 (xk+λ , Uk+λ )
Uk+λ
 
≥ Π (xk , Uk ) + δ γ̃ min Q ∗ (xk+λ , Uk+λ )
Uk+λ

δ−1 δ−1  
≥ 1+ς Π (xk , Uk ) + γ̃ δ − min Q ∗ (xk+λ , Uk+λ )
1+ς 1 + ς Uk+λ
ς(δ − 1)  ∗ 
= 1+ Π (xk , Uk )+ γ̃ min Q (xk+λ , Uk+λ )
(1 + ς ) Uk+λ

δ−1
= 1+ Q ∗ (xk , Uk ). (12.3.20)
(1 + ς −1 )

Similarly, we can get

δ−1
Q1 (xk , Uk ) ≤ 1 + Q ∗ (xk , Uk ).
(1 + ς −1 )

Thus, (12.3.19) holds for i = 0. Assume that (12.3.19) holds for i = l − 1, l =


1, 2, . . .. Then, for i = l, we have
 
Ql (xk , Uk ) = Π (xk , Uk ) + γ̃ min Ql−1 (xk+λ , Uk+λ )
Uk+λ
 
ς l−1 (δ − 1)  
≥ Π (xk , Uk )+ γ̃ 1+ min Q ∗ (xk+λ , Uk+λ )
(1 + ς ) l−1 U k+λ

 
ς l (δ − 1)  ∗ 
≥ 1+ Π (xk , U k )+ γ̃ min Q (xk+λ , U k+λ )
(1+ς )l Uk+λ

δ−1
= 1+ Q ∗ (xk , Uk ). (12.3.21)
(1 + ς −1 )l

Similarly, we can also get


 
δ−1
Ql (xk , Uk ) ≤ 1 + Q ∗ (xk , Uk ).
(1 + ς −1 )l
12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 503

Hence, (12.3.19) holds for i = 0, 1, . . .. The mathematical induction is complete.


Second, we show that if 0 ≤ δ ≤ 1 ≤ δ < ∞, then the iterative Q-function
Qi (xk , Uk ) satisfies

δ−1 δ−1
1+ Q ∗ (xk , Uk ) ≤ Qi (xk , Uk ) ≤ 1 + Q ∗ (xk , Uk ).
−1 i i
(1 + ς ) (1 + ς −1 )
(12.3.22)

The lower bound of (12.3.22) can be proven according to the steps similar to (12.3.20)
and (12.3.21). For the upper bound of (12.3.22), letting i = 0, we have
 
Q1 (xk , Uk ) = Π (xk , Uk ) + γ̃ min Q0 (xk+λ , Uk+λ )
Uk+λ
 
≤ Π (xk , Uk ) + δ γ̃ min Q ∗ (xk+λ , Uk+λ )
Uk+λ

δ−1   
+ ς Π (xk , Uk ) − γ̃ min Q ∗ (xk+λ , Uk+λ )
(1 + ς ) Uk+λ

δ−1
≤ 1+ Q ∗ (xk , Uk ).
(1 + ς −1 )

According to mathematical induction, we can obtain the upper bound of (12.3.22).


Third, for the situation 1 ≤ δ ≤ δ < ∞, according to the steps similar to (12.3.20)
and (12.3.21), we can prove that, for i = 0, 1, . . ., the iterative value function
Qi (xk , Uk ) satisfies (12.3.19).
Finally, considering the three situations above, for the given constants ς , ς , δ, and
δ, according to (12.3.19) and (12.3.22), we obtain (12.3.18), as i → ∞. The proof
is complete.

Corollary 12.3.2 For i = 0, 1, . . ., let Qi (xk , Uk ) and Ui (xk ) be obtained by i-


iteration (12.3.5)–(12.3.8). Then, the iterative control law sequence Ui (xk ) converges
to the optimal control law sequence, i.e.,

lim Ui (xk ) = U ∗ (xk ).


i→∞

12.3.3 Neural Network Implementation

In this section, neural networks are introduced to implement the dual iterative Q-
learning algorithm. There are two neural networks, which are critic and action net-
works, respectively, in the dual iterative Q-learning algorithm. Both neural networks
are chosen as three-layer backpropagation (BP) networks. The whole structural dia-
gram is shown in Fig. 12.10. The role of the action network is to approximate the
504 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

xk
Critic Qi j ( xk , uk )
uk
Network

xk
xk 1

Residential x Qi j 1 ( xk 1 , uk 1 )
uk k 1 Action uk Critic
Energy 1
PL ( Network Network
j) System
U ( xk , uk , j)

Fig. 12.10 The structural diagram of the dual iterative Q-learning algorithm (the discount factor
γ is not shown in the diagram)

iterative control law sequence Ai (xk ), i = 0, 1, . . ., defined in (12.3.8). The target of


the action network can be defined as in (12.3.12) and (12.3.15). The action network
can be constructed by 2 input neurons, 10 sigmoidal hidden neurons, and 1 linear
output neuron. Let l = 0, 1, . . . be the training step. The output of the action network
can be expressed as
j,l
ûi (xk ) = WaijT (l)σ (Za (xk )),

where Za (xk ) = YaT xk and σ (·) is a sigmoid function [38]. To enhance the training
ij
speed, only the hidden–output weight Wa (l) is updated during the neural network
training, while the input-hidden weight is fixed [15]. According to [38], the action
network’s weight update is expressed as follows:
 j 
∂E (l)
Waij (l + 1) = Waij (l) − βa ai
ij
,
∂ Wa (l)

where
j 1 j
Eai (l) = (e (l))2 ,
2 ai
j j,l j
eai (l) = ûi (xk ) − ui (xk ),

and βa > 0 is the learning rate of the action network. The goal of the critic network
j
is to obtain Qi (xk , Uk ) by updating Qi (xk , uk ) in (12.3.14) for i = 0, 1, . . . and j =
0, 1, . . . , 24, iteratively. The critic network can be constructed by 3 input neurons,
15 sigmoidal hidden neurons, and 1 linear output neuron. Let Zck = [xkT , uk ]T be
the input vector of the critic network. Then, the output of the critic network can be
expressed as
12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 505

j,l
Q̂i (xk , uk ) = WcijT(l)σ (Zck ),

where Zck = YcTZck and σ (·) is a sigmoid function [38]. During the neural network
j
training, the hidden–output weight Wci (l) is updated, while the input-hidden weight
Yc is fixed. According to [38], the critic network weight update is expressed as
follows:
 j 
∂Eci (l)
Wc (l + 1) = Wc (l) − αc
ij ij
ij
,
∂ Wc (l)

where
j 1  j 2
Eci (l) = e (l) ,
2 ci
j j,l j
eci (l) = Q̂i (xk , uk ) − Qi (xk , uk ),

and αc > 0 is the learning rate of the critic network. The dual iterative Q-learning
algorithm implemented by action and critic networks is explained step by step and
shown in Algorithm 12.3.1.

Algorithm 12.3.1 Dual iterative Q-learning algorithm


Initialization:
Collect an array of system data for the residential energy system (12.3.2).
Give a positive semidefinite function Ψ (xk , uk ).
Give the computation precision ε > 0.
Iteration:
Step 1. Let i = 0. For j = 0, let Q00 (xk , uk ) = Ψ (xk , uk ).
j j
Step 2. For j = 0, 1, . . . , 24, train the critic and action networks to obtain Q0 (xk , uk ) and u0 (xk )
that satisfy (12.3.9)–(12.3.12).
Step 3. Let Q0 (xk , Uk ) = Q024 (xk , uk ). Obtain A0 (xk ) by (12.3.6).
Step 4. Let i = i + 1.
j j
Step 5. For j = 0, 1, . . . , 24, train the critic and action networks to obtain Qi (xk , uk ) and ui (xk ) that
satisfy (12.3.14) and (12.3.15), respectively.
Step 6. Let Qi (xk , Uk ) = Qi24 (xk , uk ). Obtain Qi (xk , Uk ) and Ai (xk ) by (12.3.7) and (12.3.8),
respectively.
Step 7. If |Qi (xk , Uk ) − Qi−1 (xk , Uk )| ≤ ε, then goto next step; else, goto Step 4.
Step 8. Obtain Ui (xk ) = [ui23 (xk ), . . . , ui0 (xk )].
Step 9. Return Qi (xk , Uk ) and Ui (xk ).
506 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

12.3.4 Numerical Analysis

In this section, the performance of the dual iterative Q-learning algorithm will be
examined by numerical experiments. Comparisons will also be given to show the
superiority of the present algorithm. The profiles of the residential load demand
and the real-time electricity rate are taken from [8, 17, 22], where the real-time
electricity rate and the residential load demand for one week (168 h) are shown in
Fig. 12.11a and c, respectively. We can see that the real-time electricity rate and the
residential load demand are both periodic-like functions with the period λ = 24. The
average trajectories of the electricity rate and the residential load demand are shown
in Fig. 12.11b, d. In this section, we use average residential load demand and average
electricity rate as the periodic residential load demand and electricity rate.

8 7
Average rate (cents/kWh)
7 6
Rate (cents/kWh)

6
5
5
4
4
3
3

2 2
0 50 100 150 0 5 10 15 20 25
Time (Hours) Time (Hours)
(a) (b)

10 8

7
Average load (kW)

8
6
Load (kW)

6 5

4
4
3

2 2
0 50 100 150 0 5 10 15 20 25
Time (Hours) Time (Hours)
(c) (d)

Fig. 12.11 Residential electricity rate and load demand. a Real-time electricity rate for 168 h. b
Average electricity rate. c Residential load demand for 168 h. d Average residential load demand
12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 507

We assume that the supply from the power grid guarantees the residential load
demand at any time. Define the capacity of the battery as 100 kWh. Let the upper and
lower storage limits of the battery be Ebmin = 20 kWh and Ebmax = 80 kWh, respec-
tively. The rated power output of the battery and the maximum charging/discharging
rate is 16 kW. The initial level of the battery is 60 kWh. Let the cost function be
expressed as in (12.3.1), where we set m1 = 1, m2 = 0.2, r = 0.1, and γ = 0.995.
Let the initial function Ψ (xk , uk ) = [xkT , ukT ]P[xkT , ukT ]T , where P = I is the identity
matrix with a suitable dimension. Let the initial state be x0 = [8, 60]T . After nor-
malizing the data of the residential load demand and the electricity rate [1, 38],
we implement the present dual iterative Q-learning algorithm by neural networks for
i = 20 iterations to guarantee the computation precision ε = 10−4 . The learning rates
of the action and critic networks are 0.01, and the training precisions of the neural net-
works are 10−6 . Let Qi (x0 , ū) = min Qi (x0 , u). The trajectory of Qi (x0 , ū) is shown in
j j j
u
j j
Fig. 12.12. After i = 20 iterations, we get Qi (x0 , ū) = Qi−1 (x0 , ū), j = 0, 1, . . . , 24,
which means the iterative Q-function is convergent to the optimum. According to
the one week’s residential load demand and electricity rate, the optimal control of
the battery is shown in Fig. 12.13.
In the present study, time-based Q-learning (TBQL) algorithm [22] and particle
swarm optimization (PSO) algorithm [17] will be compared to illustrate the superi-
ority of the present dual iterative Q-learning algorithm. For t = 0, 1, . . ., the goal of
TBQL algorithm [22, 38] is to design an iterative control that satisfies the following
optimality equation

4
Iterative Q function

0 Optimal Q function
0

5 25
20
10 15
i 15 10
5 j
20 0

Fig. 12.12 The trajectory of the iterative Q-function


508 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

10

0
Power (kW)

−5

−10

−15

Rate Load Optimal Battery Management

0 20 40 60 80 100 120 140 160


Time (Hours)

Fig. 12.13 Optimal control of the battery in one week

Q(xt−1 , ut−1 ) = U(xt , ut ) + γ Q(xt , ut ).

Let the initial function and the structures of the action and critic networks which
implement the TBQL algorithm be the same as those in our example. For PSO
algorithm [17], let G = 30 be the swarm size. The position of each particle at time
t is represented by xt ,  = 1, 2, . . . , G and its movement by the velocity vector vt .
Then, the update rule of PSO can be expressed as

xt = x(t−1) + νt ,


νt = ων(t−1) + ϕ1 ρ1T (p − x(t−1) ) + ϕ2 ρ2T (pg − x(t−1) ).

Let the inertia factor be ω = 0.7. Let the correction factors ρ1 = ρ2 = [1, 1]T . Let ϕ1
and ϕ2 be random numbers in [0, 1]. Let p be the best position of particles, and let pg
be the global best position. Implement the TBQL for 100 time steps, and implement
PSO algorithm for 100 iterations. Let the real-time cost function be Rct = Ct Pgt ,
and the corresponding real-time cost functions are shown in Fig. 12.14a, where the
term “original” denotes “no battery system.” The comparison of the total cost for
168 h is displayed in Table 12.1. From Table 12.1, the superiority of our dual iterative
Q-learning algorithm can be verified. The trajectories of the battery energy by dual
12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 509

Real−time cost(cents/100)
1 Original Dual iterative Q−learning TBQL PSO

0.8

0.6

0.4

0.2

0
0 20 40 60 80 100 120 140 160
Time (Hours)
(a)

100
Battery energy (kWh)

80

60

40

20
Dual iterative Q−learning TBQL
0
0 20 40 60 80 100 120 140 160
Time (Hours)
(b)
Fig. 12.14 Numerical comparisons. a Real-time cost comparison among dual iterative Q-learning,
TBQL, and PSO algorithms. b Battery energy comparison between dual iterative Q-learning and
TBQL algorithms

Table 12.1 Cost comparison


Horizon Original PSO TBQL Dual iterative
Q-learning
Total cost (cents) 4124.13 3029.96 2866.64 2797.86
Saving (%) 26.53 30.49 32.16

iterative Q-learning and TBQL algorithms are shown in Fig. 12.14b. We can see that
using the TBQL algorithm, the battery is fully charged each day, while the battery
level is more reasonable by the dual iterative Q-learning algorithm.
In the above optimizations, we give more importance to the electricity rate than
the cost of the battery system, i.e., m1 in the cost function is large. On the other hand,
the discharging rate and depth are also important for the battery system to be kept
“alive” for as long as possible. Hence, we enlarge the parameters m2 and r in the
cost function. Let m2 = 1, r = 1, and let m1 be unchanged. The iterative Q-function
is shown in Fig. 12.15. The optimal battery control is shown in Fig. 12.16, and the
battery energy under the new cost function is shown in Fig. 12.17a. Enlarging m2
510 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

4
Q function

0
0
Optimal Q function
5 25
20
10 15
i 15 10
5 j
20 0

Fig. 12.15 The trajectory of the iterative Q-function

10

0
Power (kW)

−5

−10

−15
Rate Load Optimal Battery managment

0 20 40 60 80 100 120 140 160


Time (Hours)

Fig. 12.16 Optimal control of the battery under m1 = 1, m2 = 1 and r = 1


12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 511

Battery energy (kWh)


80

60

40

20
20 40 60 80 100 120 140 160
Time (Hours)
(a)
Battery energy (kWh)

80

60

40

20
0 20 40 60 80 100 120 140 160
Time (Hours)
(b)
Battery energy (kWh)

60

40

20

0
0 20 40 60 80 100 120 140 160
Time (Hours)
(c)
Fig. 12.17 Batteries’ energy. a New cost function with m1 = 1, m2 = 1 and r = 1. b Battery I. c
Battery II

and r, we can see that the value of the iterative Q-function is enhanced. The battery
output power is reduced, and the battery energy is closer to E o , which extends the
lifetime of the battery. However, the total cost of one week is 2955.35 cents, which
means the cost saving is reduced.
On the other hand, the battery model is important to the optimal control law of the
battery. To illustrate the effectiveness of the present algorithm, different elements of
the battery will be considered.
For convenience of analysis, we let m1 = 1, m2 = 0.2, r = 0.1. First, let the effi-
ciency of battery charging/discharging be η(Pbt ) = 0.698 − 0.173|Pbt |/Prate and let
the capacity of the battery be 80 kWh. Define the battery as Battery I. Implementing
j
the dual iterative Q-learning algorithm with Battery I, the trajectory of Qi (x0 , ū) is
shown in Fig. 12.18. We can see that the iterative Q-function is also convergent to
the optimum after i = 20 iterations and the values of the Q-functions are larger than
the ones in Fig. 12.12, which indicates that the optimization ability decreases. The
512 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

5
Q function

0
0
Optimal Q function
5 25
20
10 15
i 15 10
5 j
20 0

Fig. 12.18 The trajectory of the iterative Q-function

10

0
Power (kW)

−5

−10

−15

Rate Load Battery I Battery II

0 20 40 60 80 100 120 140 160


Time (Hours)

Fig. 12.19 Optimal control of the battery in one week


12.3 A Novel Dual Iterative Q-Learning Method for Optimal Battery Management 513

optimal control trajectory for Battery I is shown in Fig. 12.19. The battery energy of
Battery I is shown in Fig. 12.17b, and the total cost in one week is 2914.70 cents.
Next, we keep on reducing the performance of the battery. Let the capacity of
the battery decrease to 60 kWh. Let the rated power output of the battery and the
maximum charging/discharging rate be 12 kW. Define the battery as Battery II. The
optimal control trajectory for Battery II is shown in Fig. 12.19. The battery energy
of Battery II is shown in Fig. 12.17c, and the total cost in one week is 3027.17 cents.
From the numerical results, we can see that for different battery models, the
present dual iterative Q-learning algorithm will guarantee the iterative value function
to converge to the optimum and obtain the optimal battery control law. We can also
see that as the performance of the battery decreases, the optimization ability of the
battery also decreases.

12.4 Multi-battery Optimal Coordination Control


for Residential Energy Systems

In this section, the multi-battery home energy management system will be described
and the optimization objective of the multi-battery coordination control will be intro-
duced [43].
The optimal multi-battery control problem is treated as a discrete-time problem
with the time step of 1 h, and it is assumed that the load demand varies hourly. The
schematic diagram of the smart home energy system is described in Fig. 12.20, which
is composed of the power grid, the load demand, the multi-battery system (including
N batteries, N ∈ Z+ , and sine wave inverters), and the power management unit

Fig. 12.20 Smart home energy system


514 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

(controller). Given the real-time home load and electricity rate, our goal is to find
the optimal coordination control laws for the N batteries which minimize the total
expense of the power from the grid.
The models of batteries are taken from [22, 24, 44]. The battery characteris-
tics, battery hardware, battery software, and inverter/rectifier model of the battery
were presented in [44], and the derivation of battery model was presented in [24].
Define Nς as Nς = {ς : ς ∈ Z+ , ς ≤ N}. Let Ebς,t be the energy of battery ς at
time t, and let ης (·) be the charging/discharging efficiency of battery ς , ∀ς ∈ Nς .
Then, the model of battery ς can be expressed as Ebς,t+1 = Ebς,t − Pbς,t × η(Pbς,t ),
where Pbς,t is the power output of battery ς at time t. Let Pbς,t > 0 denote battery
ς discharging, ∀ς ∈ Nς . Let Pbς,t < 0 denote battery ς charging, and let Pbς,t = 0
denote battery ς idle. Let the efficiency of battery charging/discharging be derived
as η(Pbς,t ) = η̄ς − 0.173|Pbς,t |/Pςrate , where Pςrate > 0 is the rated power output of
battery ς and η̄ς is the maximum efficiency constant such that η̄ς ≤ 0.898. The stor-
min
age limit is defined as Ebς ≤ Ebς,t ≤ Ebς max
, where Ebς
min max
and Ebς are the minimum
and maximum storage energies of battery ς , respectively. The corresponding charg-
ing and discharging power limits are defined as Pbς min
≤ Pbς,t ≤ Pbςmax
, where Pbς
min
max
and Pbς are the minimum and maximum charging/discharging powers of battery
ς , respectively.
Based on the models of batteries, the optimization objectives can be formulated.
Let Ct be the electricity rate at time t. Let PL,t be the power of the home load at time
t, and let Pg,t be the power from the power grid. For the convenience of analysis, we
assume that the home load PL,t and the electricity rate Ct are periodic functions with
the period λ = 24 h. For ς ∈ Nς , with delays in PL,t and Pbς,t for battery ς , the load
balance can be expressed as


N
PL,t−1 = Pbς,t−1 + Pg,t .
ς=1

In this section, power flow from the batteries to the grid is not permitted, i.e., we
define Pg,t ≥ 0, to guarantee the power quality of the grid. To extend the lifetime
of the batteries, for ς ∈ Nς , we desire that the stored energy of the battery ς is
near the middle of storage limit Ebς o
= 21 (Ebς
min
+ Ebς
max
). Define ΔEbς,t as ΔEbς,t =
Ebς,t − Ebς . Let α, βς , rς , ς ∈ Nς be given positive constants, and let 0 < γ < 1 be
o

the discount factor. Then, the total cost function to be minimized can be defined as


 
N
 
γ t α(Ct Pg,t )2 + βς ΔEbς,t
2
+ rς Pbς,t
2
, (12.4.1)
t=0 ς=1

The physical meaning of the first term of the cost function is to minimize the total
cost from the grid. The second term aims to guarantee the stored energy of batteries
to be close to the middle of storage limit, which avoids fully charging/discharging
of the batteries. The third term is to prevent large charging/discharging power of
12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 515

the batteries. Hence, the second and third terms aim to extend the lifetime of the
batteries. Let x1,t = Pg,t and x2ς,t = ΔEbς,t , ς ∈ Nς , be the system states. Let uς,t =
Pbς,t be the control input. Let M2 = diag{βς } and Mt = diag{αCt2 , M2 }. Let xt =
[x1,t , x21,t , . . . , x2N,t ]T be the state vector and ut = [u1,t , . . . , uN,t ]T be the control
vector. The home energy management system is defined as
⎡ ⎤

N
⎢ PL,t − ς=1 uς,t ⎥
⎢ ⎥
⎢ ⎥
xt+1 = F(xt , ut , t) = ⎢ x21,t − η(u1,t )u1,t ⎥. (12.4.2)
⎢ .. ⎥
⎣ . ⎦
x2N,t − η(uN,t )uN,t

Let x0 be the initial state. Then, the cost function (12.4.1) can be written as


J(x0 , u0 , 0) = γ t U(xt , ut , t), where U(xt , ut , t) = xtTMt xt + utTRut , R = diag{rς }
t=0
and ut = (ut , ut+1 , . . .) denotes the control sequence
 from t to ∞. The optimal cost
function can be defined as J ∗ (xt , t) = inf ut J(xt , ut , t) . According to Bellman’s
principle of optimality [7], we can obtain the following Bellman equation
 
J ∗ (xt , t) = min U(xt , ut , t) + γ J ∗ (xt+1 , t + 1) . (12.4.3)
ut

12.4.1 Distributed Iterative ADP Algorithm

In this section, a novel distributed iterative ADP algorithm is developed to solve the
optimal multi-battery coordination control problem for home energy management
systems. Convergence analysis results will be developed. From (12.4.2), the system
state xt is an (N + 1)-dimensional vector and the control ut is an N-dimensional
vector. Generally speaking, J ∗ (xt , t) in (12.4.3) is a highly nonlinear and nonanalytic
function. For the multi-battery home energy management system (12.4.2), J ∗ (xt , t) is
highly complex which is computationally untenable to obtain by directly solving the
Bellman equation (12.4.3). To overcome these difficulties, system transformations
are necessary to implement. First, for t = 0, 1, . . ., there exist ρ = 0, 1, . . . and θ =
0, 1, . . . , 23 such that t = ρλ + θ . Let k = ρλ. Then, PL,t = PL,k+θ = PL,θ and Ct =
Ck+θ = Cθ , respectively. Let Uk denote the control sequence for N batteries in 24 h,
i.e., Uk = [uk , uk+1 , . . . , uk+λ−1 ]T ∈ Rλ×N . Similar to (12.3.4), define a new utility
function as
λ−1

Υ (xk , Uk ) = γ θ U(xk+θ , uk+θ , θ ).
θ=0
516 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

Then, for all k ∈ {0, λ, 2λ, . . .}, the Bellman equation (12.4.3) can be expressed as

J ∗ (xk ) = min{Υ (xk , Uk ) + γ̃ J ∗ (xk+λ )}, (12.4.4)


Uk

and the optimal control law sequence can be expressed by


 
U∗ (xk ) = arg min Υ (xk , Uk ) + γ̃ J ∗ (xk+λ ) ,
Uk

where γ̃ = γ 24 . Note that J ∗ (xt , t) in (12.4.3) is time-varying. The function J ∗ (xk )


in (12.4.4) is the same function, but it is time-invariant when the time step is λ,
due to the periodic nature of the system parameters in (12.4.2). Therefore, the time
dependency of J ∗ (xk ) will be dropped when the context is clear.
Next, system transformations are developed to derive a system with lower
dimensions. Let Ēb = min {Ebς max
− Ebς
min
} and E b = 0. Let P̄ rate = min{Pςrate } and
ς∈Nς ς∈N
ης = min η̄ς . Let P̄bmin = max{Pbς
min
} and Pbmax = min{Pbς
max
}. Let
ς∈Nς ς∈N ς∈N

|Pb,t |
η(Pb,t ) = ης − 0.173 ,
P̄ rate

and Pb,t be the charging/discharging power such that P̄bmin ≤ Pb,t ≤ Pbmax . Here, we
assume that P̄bmin < Pbmax and η(Pb,t ) > 0 to facilitate our analysis. According to the
above notations and assumptions, we can construct a new battery, called Battery ℵ,
which is expressed as

ΔEb,t+1 = ΔEb,t − Pb,t × η(Pb,t ), (12.4.5)

1
where ΔEb,t = Eb,t − Ebo , Ebo = (Ēb − E b ), and Eb,t is the energy of the new battery
2
at time t. From (12.4.5), we can see that Battery ℵ has the worst performance of all the
N batteries (e.g., it is the “smallest” battery). If we replace all batteries ς by Battery
ℵ, i.e., x2ς,t = ΔEb,t and uς,t = Pb,t , ∀ς ∈ Nς , then the home energy management
system (12.4.2) can be simplified as

PL,t − Nvt
zt+1 = F (zt , vt , t) = , (12.4.6)
x2,t − η(vt )vt

where zt = [x1,t , x2,t ]T , x2,t = ΔEb,t , and vt = Pb,t . The cost function can be rewritten
as

 
N
γ t ztT M̄t zt + vt2 rς , (12.4.7)
t=0 ς=1
12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 517
 

N
where M̄t = diag αCt ,
2
βς . Next, we aim to design an optimal control law for
ς=1
Battery ℵ.
Define Uk = [vk , vk+1 , . . . , vk+λ−1 ]T , which is the control sequence for the Battery
ℵ in 24 h. According to (12.4.6), there exists a function F̃ such that zk+λ = F̃(zk , Uk ).

N
Let Ū(zt , vt , t) = ztT M̄t zt + r̄vt2 , where r̄ = rς . We can define another new util-
ς=1

λ−1
ity function as Γ (zk , Uk ) = γ θ Ū(zk+θ , vk+θ , θ ), ∀k ∈ {0, λ, 2λ, . . .}. Then, the
θ=0
Bellman equation (12.4.4) can be expressed as

J o (zk ) = min{Γ (zk , Uk ) + γ̃ J o (zk+λ )}, (12.4.8)


Uk

and the optimal control law sequence can be expressed by


 
U o (zk ) = arg min Γ (zk , Uk ) + γ̃ J o (zk+λ ) .
Uk

In this section, an iterative ADP algorithm will be developed to obtain the opti-
mal control laws for Battery ℵ. Let i = 0, 1, . . . be the iteration index. Let Ψ (zk )
be an arbitrary positive-semidefinite function and choose the initial value function
V 0 (zk ) = Ψ (zk ). We obtain
 
A0 (zk ) = arg min Γ (zk , Uk ) + γ˜ V 0 (zk+λ ) . (12.4.9)
Uk

For i = 1, 2, . . ., the iterative value function and the iterative control law sequence
will be obtained by
 
V i (zk ) = min Γ (zk , Uk ) + γ˜ V i−1 (zk+λ )
Uk

= Γ (zk , Ai−1 (zk )) + γ˜ V i−1 (F̃(zk , Ai−1 (zk ))), (12.4.10)

and
 
Ai (zk ) = arg min Γ (zk , Uk ) + γ˜ V i (zk+λ ) . (12.4.11)
Uk

For a given V i (zk ), as λ is finite, according to the principle of dynamic programming


[7], the iterative value function V i (zk ) can be updated in (12.4.10) and the iterative
control law sequence

Ai (zk ) = [vi (zk ), . . . , vi (zk+λ−1 )]T

can be obtained in (12.4.11).


518 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

Theorem 12.4.1 If for i = 0, 1, . . ., V i (zk ) and Ai (zk ) are obtained by (12.4.9)–


(12.4.11), then V i (zk ) will converge to the solution of Bellman equation (12.4.8),
i.e.,
lim V i (zk ) = J o (zk ).
i→∞

Proof For functions J o (zk ), Γ (zk , Uk ), and V 0 (zk ), inspired by [26], there exist
constants ζ , δ, and δ such that γ̃ J o (zk+λ ) ≤ ζ Γ (zk , Uk ), and δJ o (zk ) ≤ V 0 (zk ) ≤
δJ o (zk ), respectively, where 0 < ζ < ∞ and 0 ≤ δ ≤ 1 ≤ δ < ∞. We will show
that for i = 0, 1, . . ., the iterative value function V i (zk ) satisfies

δ−1 δ−1
1+ J o (zk ) ≤ V i (zk ) ≤ 1 + J o (zk ). (12.4.12)
(1 + ζ −1 )i (1 + ζ −1 )i

Inequality (12.4.12) can be proven by mathematical induction. Let i = 1. We have


 
V 1 (zk ) = min Γ (zk , Uk ) + γ˜ V 0 (zk+λ )
Uk
 
δ−1 δ−1
≥ min 1 + ζ Γ (zk , Uk ) + γ̃ δ− J o (zk+λ )
Uk 1+ζ 1+ζ
δ−1
= 1+ J o (zk ). (12.4.13)
(1 + ζ −1 )

Thus, (12.4.12) holds for i = 0. Assume that (12.4.12) holds for i = l − 1, l =


1, 2, . . .. Then, for i = l, we have
 
V l (zk ) = min Γ (xk , Uk ) + γ˜ V l−1 (zk+λ )
Uk

δ−1
≥ min Γ (xk , Uk ) + γ̃ 1 + l−1
J o (zk+λ )
Uk (1 + ζ )−1

δ−1  
+ ζ Γ (xk , Uk ) − γ̃ J (zk+λ )
o
(1 + ζ )(1 + ζ −1 )l−1
δ−1  
= 1+ −1
min Γ (xk , Uk ) + γ̃ J o (zk+λ )
(1+ζ ) Uk l

δ−1
= 1+ J o (zk ). (12.4.14)
(1 + ζ −1 )l

Based on the mathematical induction (12.4.13) and (12.4.14), we can obtain the
upper bound of (12.4.12). On the other hand, if δ ≥ 1 and δ ≤ 1, we can let δ = 1
and δ = 1, where we can see that (12.4.12) can also be verified. Hence, according to
(12.4.12), we can obtain
lim V i (zk ) = J o (zk ).
i→∞

The proof is complete.


12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 519

Considering the worst-performance optimization for Battery ℵ as the initial con-


dition, a new distributed iterative ADP algorithm will be developed. Convergence
properties will be established to guarantee that the iterative value function con-
verges to the optimum. Let μς,k = [uς,k , . . . , uς,k+λ−1 ]T . Define Ūς ∈ Rλ×(N−1)
as Ūς = [(μς̄,k : ς̄ ∈ Nς , ς̄ = ς )] = [(μ1,k , μ2,k , . . . , μN,k ) with μς,k removed],
∀ς ∈ Nς . Let  = 0, 1, . . . increase from 0 to ∞. Let {ϑ } be a sequence such
that ϑ ∈ Nς , ∀ = 0, 1, . . .. According to (12.4.2), define a function F̄ that satis-
fies xk+λ = F̄(xk , Uϑl ). According to (12.4.7) and (12.4.8), for k = 0, λ, 2λ, . . ., we
construct a cost function J o (xk ) such that J o (xk ) = J o (zk ). Let μoς (xk ) = U o (zk ),
∀ς ∈ Nς , and let Uϑ0 (xk ) = [ μo1 (xk ), . . . , μoN (xk )].
Now for  = 0 and τ = 0, let the initial value function be given by

Vϑ00 (xk ) = J o (xk ). (12.4.15)

We calculate
    
μ0ϑ0 (xk ) = arg min Υ xk , [μϑ0 ,k , Ūϑ0 (xk )] + γ˜ Vϑ00 F̄(xk , [μϑ0 ,k , Ūϑ0 (xk )]) .
μϑ0 ,k
(12.4.16)

Note that Ūϑ0 (xk ) is Uϑ0 (xk ) by removing its ϑ0 th column. For  = 0, ϑ0 ∈ Nς and
τ = 1, 2, . . ., the distributed iterative ADP algorithm will proceed between
   
Vϑτ0 (xk ) = Υ xk , [μτϑ−1
0
(xk ), Ūϑ0 (xk )] + γ˜ Vϑτ0−1 F̄(xk , [μϑτ −1
0
(xk ), Ūϑ0 (xk )]) ,
(12.4.17)

and
    
μτϑ0 (xk ) = arg min Υ xk , [μϑ0 ,k , Ūϑ0 (xk )] + γ˜ Vϑτ0 F̄(xk , [μϑ0 ,k , Ūϑ0 (xk )]) .
μϑ0 ,k
(12.4.18)

For τ → ∞, let Vϑ∞ (xk ) = lim Vϑτ0 (xk ) and μ∞ τ


ϑ0 (xk ) = lim μϑ0 (xk ). Let μς (xk ) =
1
0 τ →∞ τ →∞
μ0ς (xk ) for ς = ϑ0 and μ1ς (xk ) = μ∞ ς (xk ) for ς = ϑ0 . In (12.4.17) and (12.4.18), we
update only one of the 24 h control sequences for the battery indicated by ϑ0 , while
the control sequences for all other batteries remain unchanged.
Then, for all ϑ ,  = 1, 2, . . ., let Vϑ0 (xk ) = Vϑ∞
−1
(xk ) and Uϑ (xk ) = [ μ1 (xk ), . . . ,
μN (xk )]. We calculate
    
μ0ϑ (xk ) = arg min Υ xk , [μϑ ,k , Ūϑ (xk )] + γ˜ Vϑ0 F̄(xk , [μϑ ,k , Ūϑ (xk )]) .
μϑ ,k
(12.4.19)
520 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

For ϑ ∈ Nς and τ = 1, 2, . . ., the distributed iterative ADP algorithm will proceed


between
   
Vϑτ (xk ) = Υ xk , [μτϑ−1

(xk ), Ūϑ (xk )] + γ̃ Vϑτ−1 F̄(xk , [μϑτ −1

(xk ), Ūϑ (xk )]) ,
(12.4.20)

and
    
μτϑ (xk ) = arg min Υ xk , [μϑ ,k , Ūϑ (xk )] + γ˜ Vϑτ F̄(xk , [μϑ ,k , Ūϑ (xk )]) .
μϑ ,k
(12.4.21)

For τ → ∞, let Vϑ∞



(xk ) = lim Vϑτ (xk ) and μ∞ τ +1
ϑ (xk ) = lim μϑ (xk ). Let μς (xk )
τ →∞ τ →∞
= μς (xk ) for ς = ϑ and μς+1 (xk ) = μ∞ ς (xk ) for ς = ϑ .
As all the N batteries are operated by the same initial control law, the initial
control viable is Uϑ0 (xk ), composing of 24 identical columns. In (12.4.15)–(12.4.21),
however, it is emphasized that as the control law of battery ϑ ,  = 0, 1, . . . , is
updated, the control
 laws of other batteries are kept unchanged. Then, the control
variable becomes μϑτ −1 
(x k ), Ū (x
ϑ k ) and F̄ in (12.4.15)–(12.4.21) varies according
to ϑ . To avoid cumbersome notation, we did not write this explicitly, i.e., F̄ ϑ would
be more appropriate than F̄. When the iteration index τ increases, it is desired that
the iterative value function is convergent by updating the control law of battery ϑ ,
l = 0, 1, . . . .

Theorem 12.4.2 (Local convergence property) For  = 0, 1, . . . and τ = 0, 1, . . .,


let Vϑτ (xk ) and μτϑ (xk ) be obtained by (12.4.15)–(12.4.21). Then, for  = 0, 1, . . .,
the iterative value function Vϑτ (xk ) is convergent as τ → ∞, i.e.,

Vθ∞

(xk ) = lim Vθτ (xk ),
τ →∞

where Vθ∞

(xk ) satisfies the following Bellman equation
    
Vϑ∞

(xk ) = min Υ xk , [μϑ ,k , Ūϑ (xk )] + γ̃ Vϑ∞ 
F̄(xk , [μϑ ,k , Ūϑ (xk )])
μϑ ,k
   
= Υ xk , [μ∞ ∞ ∞
ϑ (xk ), Ūϑ (xk )] + γ̃ Vϑ F̄(xk , [μϑ (xk ), Ūϑ (xk )]) .
(12.4.22)

Proof The statement can be proven by mathematical induction. Considering the


situation of  = 0, we will prove that for τ = 0, 1, . . ., the inequality

Vϑτ0+1 (xk ) ≤ Vϑτ0 (xk ) (12.4.23)


12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 521

holds. First, for τ = 0, according to (12.4.15) and (12.4.16), we can get


   
Vϑ10 (xk ) = Υ xk , [μ0ϑ0 (xk ), Ūϑ0 (xk )] + γ̃ Vϑ00 F̄(xk , [μ0ϑ0 (xk ), Ūϑ0 (xk )])
    
= min Υ xk , [μϑ0 ,k , Ūϑ0 (xk )] + γ̃ Vϑ00 F̄(xk , [μϑ0 ,k , Ūϑ (xk )])
μϑ0 ,k
   
≤ Υ xk , [μoϑ0 (xk ), Ūϑ0 (xk )] + γ̃ J o F̄(xk , [μoϑ0 (xk ), Ūϑ0 (xk )])
= Γ (zk , U o (zk )) + γ̃ J o (F̃(zk , U o (zk )))
= Vϑ00 (xk ). (12.4.24)

Assume that (12.4.23) holds for τ = τ̄ − 1, τ̄ = 1, 2, . . ., i.e., Vϑτ̄0 (xk ) ≤ Vϑτ̄0−1 (xk ).
Then, for τ = τ̄ , we can obtain
    
Vϑτ̄0+1 (xk ) = min Υ xk , [μϑ0 ,k , Ūϑ0 (xk )] + γ̃ Vϑτ̄0 F̄(xk , [μϑ0 ,k , Ūϑ0 (xk )])
μϑ0 ,k
    
≤ min Υ xk , [μϑ0 ,k , Ūϑ0 (xk )] + γ̃ Vϑτ̄0−1 F̄(xk , [μϑ0 ,k , Ūϑ0 (xk )])
μϑ0 ,k

= Vϑτ̄0 (xk ). (12.4.25)

Hence, for  = 0, inequality (12.4.23) holds for τ = 0, 1, . . .. As the utility function


Υ (xk , Uk ) ≥ 0 and J o (xk ) ≥ 0, Vϑτ0 (xk ) ≥ 0, which is lower bounded. Thus, the iter-
ative value function is convergent as τ → ∞. According to (12.4.20) and (12.4.21),
letting τ → ∞, we can obtain (12.4.22) for  = 0. For  = ¯ − 1, ¯ = 1, 2, . . ., we
can obtain
  
Vϑ∞
¯
(xk ) = min Υ xk , [μϑ−1
¯ ,k , Ūϑ−1
¯ (xk )]
−1 μϑ ¯ ,k
−1
 
+ γ̃ Vϑ∞
¯
−1
F̄(xk , [μϑ−1
¯ ,k , Ūϑ−1
¯ (xk )]) . (12.4.26)

¯ we have
Then, for  = ,

Vϑ0¯ (xk ) = Vϑ∞


¯
−1
(xk ),

where Vϑ∞
¯
−1
(xk ) satisfies (12.4.26). According to (12.4.24) and (12.4.25), we can
obtain

Vϑτ¯+1 (xk ) ≤ Vϑτ¯ (xk ),

¯ the iterative value function Vϑτ (xk ) is convergent as τ →


which shows that for  = , ¯
∞. According to (12.4.17) and (12.4.18), letting τ → ∞, we can obtain (12.4.22).
This completes the proof of the theorem.
Theorem 12.4.2 shows that for  = 0, 1, . . ., the limit

lim Vθτ0 (xk ) = Vθ∞ (xk )


τ →∞ 0
522 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

can be defined and Vϑτ (xk ) converges to the solution of the Bellman equation (12.4.22)
as τ → ∞. However, the optimization by a single battery cannot guarantee the
iterative value function to converge to the solution of the Bellman equation (12.4.4).
Hence, a global convergence analysis will be needed.
Before the next theorem, we define some notations. Let Tς = { : ϑ = ς } and
let πς be the number of elements in Tς .
Theorem 12.4.3 (Global convergence property) For  = 0, 1, . . . and τ = 0, 1, . . .,
let Vϑτ (xk ) and μτϑ (xk ) be obtained by (12.4.15)–(12.4.21). If the sequence {ϑ }
satisfies
(1) for all  = 0, 1, . . ., ϑ ∈ Nς ,
(2) for all ς ∈ Nς , πς → ∞,
then for all τ = 0, 1, . . ., the iterative value function Vϑτ (xk ) converges to the opti-
mum, as  → ∞, i.e.,
lim Vθτ (xk ) = J ∗ (xk ).
→∞

Proof The statement can be proven in four steps.


(1) Show that for τ = 0, 1, . . ., Vϑτ (xk ) is convergent as  → ∞.
For  = 0, 1, . . . and τ = 0, 1, . . ., define a sequence of iterative value functions
as
   
Vθτ (xk )  Vϑ00 (xk ), Vϑ10 (xk ), . . . , Vϑ∞
0
(xk ), Vϑ01 (xk ), Vϑ11 (xk ), . . . , Vϑ∞
1
(xk ), . . . .

Let σ1 , σ2 , . . . , σπς be the elements in Tς such that σ1 < σ2 < · · · < σπς , i.e.,
Tς = {σ :  = 1, 2, . . . , πς }, where πς → ∞. For τ̃ = 0, 1, . . ., we define
   
Vστ̃ (xk )  Vστ̃1 (xk ), Vστ̃2 (xk ), . . . , Vστ̃ (xk ), . . . , (12.4.27)
   
then Vστ̃ (xk ) is a subsequence of Vθτ (xk ) . According to (12.4.22) and (12.4.23),

Vϑ00 (xk ) ≥ Vϑ10 (xk ) ≥ · · · ≥ Vϑ∞


0
(xk ) ≥ Vϑ01 (xk ) ≥ Vϑ11 (xk ) ≥ · · · ≥ Vϑ∞
1
(xk ) ≥ · · · ,
 
which means Vϑτ (xk ) is a monotonically nonincreasing and convergent sequence.
For τ̃ = 0, 1, . . ., we can also get Vστ̃1 (xk ) ≥ Vστ̃2 (xk ) ≥ · · · , which shows that
(12.4.27) is a monotonically nonincreasing and convergent sequence. Since a
sequence can only converge to at most one point [5], the sequence {Vϑτ (xk )} and
 
its subsequence Vστ̃ (xk ) possess the same limit, i.e.,

lim Vθτ (xk ) = lim Vστ̃ (xk ), ∀xk .


→∞ →∞

(2) Show that the limit of the iterative value function Vϑτ (xk ) satisfies the Bellman
equation, as  → ∞.
12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 523

For all τ = 0, 1, . . ., we define

V∞ (xk ) = lim Vστ (xk ). (12.4.28)


→∞

According to the definition of Tς , for τ = 0, 1, . . ., the iterative value function


Vστ (xk ) is updated by battery ς . As  → ∞, the control law of battery ς is updated
infinite times. Thus, for  → ∞, the control law of battery ς can be defined as
  
μτς (xk ) = arg min Υ (xk , [μς,k , Ūς (xk )]) + γ̃ V∞ F̄(xk , [μς,k , Ūς (xk )]) ,
μς,k
(12.4.29)

From (12.4.22) and (12.4.23), for  = 0, 1, . . ., and for τ = 1, 2, . . ., we have


  
min Υ (xk , [μσ−1 ,k , Ūσ−1 (xk )]) + γ̃ Vσ∞
−1
F̄(xk , [μσ−1 ,k , Ūσ−1 (xk )])
μσ−1 ,k

= Vσ∞
−1
(xk ) ≥ Vστ+1 (xk )
 
= Υ (xk , [μτσ (xk ), Ūσ (xk )]) + γ̃ Vστ F̄(xk , [μτσ (xk ), Ūσ (xk )])
≥ Vσ∞
+1
(xk )

= min Υ (xk , [μσ+1 ,k , Ūσ+1 (xk )])
μσ+1 ,k
 
+ γ̃ Vσ∞
+1
F̄(xk , [μσ+1 ,k , Ūσ+1 (xk )]) .

Let  → ∞, according to (12.4.28) and (12.4.29), we can obtain


  
V∞ (xk ) = min Υ (xk , [μς,k , Ūς (xk )]) + γ̃ V∞ F̄(xk , [μς,k , Ūς (xk )]) .
μς,k

For every ς ∈ Nς , we have πς → ∞ and


  
V∞ (xk ) = min Υ (xk , [μ1,k , Ū1 (xk )]) + γ̃ V∞ F̄(xk , [μ1,k , Ū1 (xk )]) ,
μ1,k
..
.
  
V∞ (xk ) = min Υ (xk , [μN,k , ŪN (xk )]) + γ̃ V∞ F̄(xk , [μN,k , ŪN (xk )]) ,
μN,k

which means
  
V∞ (xk ) = min Υ (xk , Uk ) + γ̃ V∞ F̄(xk , Uk ) .
Uk

Next, let Ũ(xk ) be an arbitrary admissible control law sequence [2, 27] for the N
batteries. Define a new value function P(xk ), which satisfies
524 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

P(xk ) = Υ (xk , Ũ(xk )) + γ̃ P(F̄(xk , Ũ(xk ))). (12.4.30)

Then, we can proceed to the third step of the proof.


(3) Show that for an arbitrary admissible control law Ũ(xk ), the value function
P(xk ) ≥ V∞ (xk ).
Following similar steps as in the part (2) of the proof of Theorem 4.2.2, we can
show that for all xk , k = 0, 1, . . ., the inequality

P(xk ) ≥ V∞ (xk ). (12.4.31)

holds.
(4) Show that the value function V∞ (xk ) equals the optimal cost function J ∗ (xk ).
According to the definition of J ∗ (xk ) in (12.4.4), for  = 0, 1, . . ., we have

Vθτ (xk ) ≥ J ∗ (xk ).

4.5

4
Rate (cents/kWh)

3.5

2.5

1.5
0 5 10 15 20 25
Time (Hours)
(a)
2

1.5
Load (kW)

0.5
0 5 10 15 20 25
Time (Hours)
(b)

Fig. 12.21 Electricity rate and home load demand. a Typical electricity rate for nonsummer seasons.
b Typical home load demand
12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 525

3.5
Iterative value function

2.5

1.5

0.5
aleph
I
II
III
IV
I
15
Ba II
III
tte IV 10
ry I
se II
qu III eps
en IV
I 5
tion st
ce II
III
IV 0 Itera

Fig. 12.22 The trajectory of the iterative value function

1.5

0.5

0
Power (kW)

−0.5

−1

−1.5

−2

−2.5

−3
Rate/3 Load usum for Battery ℵ Optimal usum

−3.5
0 20 40 60 80 100 120 140 160
Time (Hours)


Fig. 12.23 Optimal usum of the batteries in one week
526 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

Optimal control of Battery II (kW)


Optimal control of Battery I (kW)

0.5
0.5

0
0

−0.5 −0.5

−1 −1
0 50 100 150 0 50 100 150
Time (Hours) Time (Hours)
(a) (b)

Optimal control of Battery IV (kW)


Optimal control of Battery III (kW)

0.6 0.4
0.4
0.2
0.2
0
0

−0.2 −0.2

−0.4
−0.4
−0.6
−0.6
0 50 100 150 0 50 100 150
Time (Hours) Time (Hours)
(c) (d)
Fig. 12.24 Optimal control of the batteries in one week. a Battery I. b Battery II. c Battery III. d
Battery IV

Let  → ∞. We can obtain V∞ (xk ) ≥ J ∗ (xk ). On the other hand, for an arbitrary
admissible control law Ũ(xk ), (12.4.31) holds. Let Ũ(xk ) = U∗ (xk ), where U∗ (xk ) is
an optimal control law. Then, we can get V∞ (xk ) ≤ J ∗ (xk ). Hence, we can obtain
lim Vθτ (xk ) = J ∗ (xk ). The proof is complete.
→∞

Based on the above preparations, we now summarize the distributed iterative


ADP algorithm for the multi-battery home energy management systems in Algo-
rithm 12.4.1. We can see that there are two iteration blocks in Algorithm 12.4.1. In
Block 1, iterations are implemented to obtain the optimal control for Battery ℵ. In
Block 2, the optimal multi-battery coordination control is obtained.
12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 527

8 6

Battery II energy (kWh)


Battery I energy (kWh)

6
4

4 3

2
2
1

0 0
0 50 100 150 0 50 100 150
Time (Hours) Time (Hours)
(a) (b)

5 Battery IV energy (kWh) 3


Battery III energy (kWh)

2.5
4

2
3
1.5
2
1

1
0.5

0 0
0 50 100 150 0 50 100 150
Time (Hours) Time (Hours)
(c) (d)
Fig. 12.25 The energy of batteries in one week. a Battery I. b Battery II. c Battery III. d Battery
IV

Table 12.2 Cost comparison


Original Battery ℵ DIADP TBQL PSO
Total cost (cents) 683.10 590.35 499.86 548.33 556.71
Saving (%) 13.57 26.82 19.73 18.50

12.4.2 Numerical Analysis

In this section, the performance of the present distributed iterative ADP algorithm will
be examined by numerical experiments. Choose the profiles of real-time electricity
rate in nonsummer seasons from ComEd Company in [14], and choose the home load
demand from NAHB Research Report in [35], where trajectories of the electricity
rate and the home load demand are shown in Fig. 12.21a and b, respectively.
528 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

16
Original TBQL DIADP Battery ℵ PSO

14

12
Real−time cost (cents)

10

0
0 20 40 60 80 100 120 140 160
Time (Hours)

Fig. 12.26 Real-time cost comparison

We assume to have four batteries in the home energy management system,


which are denoted by [ Battery I, . . . , Battery IV], respectively. Define the capaci-
ties of the battery as [8, 6, 5, 3] kWh. Let the upper and lower storage limits of the
batteries, i.e., Ebςmax
and Ebς min
, ς = I, . . . , IV, be given by [6.2, 5.2, 3.9, 2.4] kWh
and [1.8, 1.6, 1.4, 0.6] kWh, respectively. Let the charging and discharging power
min
limits, i.e., Pbς max
and Pbς , be given by [−1, −0.8, −0.6, −0.4] kW and [1, 0.8, 0.6,
0.4] kW. Let the rated power output of the batteries be the same 0.8 kW and the
max efficiency constants be [0.898, 0.798, 0.698, 0.598]. Let the cost function be
expressed as in (12.4.1), where for ς = I, . . . , IV, we set α = 1, βς = 0.2, rς = 0.1,
and γ = 0.995. Let the initial function Ψ (zk , uk ) = [zkT , ukT ]P[zkT , ukT ]T , where P = I
is the identity matrix with a suitable dimension. Choose the computation precision
ε = 10−3 . First, transform all the four batteries into Battery ℵ, with the parameters
Ēb = 1.8 kWh, E b = 0 kWh, P̄ rate = 0.8 kW, P̄bmin = −0.4 kW, and Pbmax = 0.4 kW.
After the data normalization, implementing the iterative ADP algorithm (12.4.9)–
(12.4.11) for 15 iterations, the iterative value function is shown in Fig. 12.22, where
we can see that the iterative value function will converge to its optimum. Let
4

usum = ui be the sum of controls, and the optimal control usum based on Bat-
i=1
tery ℵ for 168 h (one week) is shown in Fig. 12.23.
12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 529

5.5

4.5
Rate (cents/kWh)

3.5

2.5

1.5
0 5 10 15 20 25
Time (Hours)

Fig. 12.27 Typical electricity rate in summer

5.5
Iterative value function

4.5

3.5

2.5
aleph
I
II
III
IV
I
15
II
Ba III
tter IV
I 10
ys II
equ III
te ps
ns
IV 5
enc I o
II ati
e III
IV 0 Iter

Fig. 12.28 The trajectory of the iterative value function in summer


530 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

1.5

0.5

0
Power (kW)

−0.5

−1

−1.5

−2

−2.5

−3
Rate/3 Load usum for Battery ℵ Optimal usum

−3.5
0 20 40 60 80 100 120 140 160
Time (Hours)


Fig. 12.29 Optimal usum of the batteries for one week in summer

Based on the optimal control law of Battery ℵ, the distributed iterative ADP
algorithm is implemented. Let the optimization sequence of the batteries be chosen
as {I, II, . . . , IV, I, II, . . . , IV, . . .}. The trajectory of the iterative value function is
shown in Fig. 12.22, where we can see that the iterative value function is monoton-
ically nonincreasing and converges to the optimum. The optimal sum control usum
for the four batteries is shown in Fig. 12.23, where the charging/discharging power
is obviously larger than the one of Battery ℵ. The optimal control laws for Batteries
I, . . ., IV are shown in Fig. 12.24, and the energies of the batteries are displayed in
Fig. 12.25.
Next, time-based Q-learning (TBQL) algorithm [22] and particle swarm opti-
mization (PSO) algorithm [17] will be compared to illustrate the superiority of our
ADP algorithm. For t = 0, 1, . . ., the goal of TBQL algorithm [22] is to design
an iterative control that satisfies the following optimality equation Q(xt−1 , ut−1 ) =
U(xt , ut ) + γ Q(xt , ut ).
Three-layer backpropagation (BP) neural networks are implemented to approxi-
mate the Q-function. The detailed neural network implementation of TBQL can be
seen in [8, 22, 38], which is omitted here. For PSO algorithm [17], let G = 100
be the swarm size. The position of each particle at time t is represented by xt ,
 = 1, 2, . . . , G , and its movement by the velocity vector vt . Then, the updating
rule of PSO can be expressed as
12.4 Multi-battery Optimal Coordination Control for Residential Energy Systems 531

1 1

Optimal control of Battery II (kW)


Optimal control of Battery I (kW)

0.5 0.5

0 0

−0.5 −0.5

−1 −1
0 50 100 150 0 50 100 150
Time (Hours) Time (Hours)
(a) (b)
0.4
Optimal control of Battery IV (kW)
Optimal control of Battery III (kW)

0.6

0.4 0.2

0.2
0
0
−0.2
−0.2

−0.4 −0.4

−0.6
−0.6
0 50 100 150 0 50 100 150
Time (Hours) Time (Hours)
(c) (d)
Fig. 12.30 Optimal control of the batteries for one week in summer. a Battery I. b Battery II. c
Battery III. d Battery IV

xt = x(t−1) + νt ,


νt = ων(t−1) + ϕ1 ρ1T (p − x(t−1) ) + ϕ2 ρ2T (pg − x(t−1) ).

Let the inertia factor be ω = 0.7. Let the correction factors ρ1 = ρ2 = 1. Let ϕ1 and
ϕ2 be random numbers in [0, 1]. Let p be the best position of particles, and let pg
be the global best position. Implement the TBQL for 500 time steps, and implement
the PSO algorithm for 500 iterations. The comparison of the total cost for 168 h
is displayed in Table 12.2, where we can see that the minimum cost is obtained by
our distributed iterative ADP algorithm. Let the real-time cost function be Rct =
Ct Pgt , and the corresponding real-time cost functions are shown in Fig. 12.26, where
the term “original” denotes “no battery system” and “DIADP” denotes “distributed
iterative ADP algorithm.” According to the numerical comparisons, the superiority
of our distributed iterative ADP algorithm can be verified.
532 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

Algorithm 12.4.1 The distributed iterative ADP algorithm


Initialization:
Collect an array of system data for the residential energy system (12.4.2).
Give a positive semidefinite function Ψ (xk ).
Give the computation precision ε > 0.
Give a sequence {ϑ0 , ϑ1 , . . .}, where ϑ ∈ Nς ,  = 0, 1, . . ..
Iteration:
Step 1. Construct Battery ℵ by (12.4.5). Establish the home energy management system by (12.4.6)
and transform the cost function by (12.4.7).
Block 1.....................................................................................................
Step 2. For i = 0, define the initial value function V 0 (zk ) = Ψ (zk ). Calculate A0 (zk ) by (12.4.9).
Step 3. Let i = i + 1. The iterative value function V i (zk ) is obtained by (12.4.10).
Step 4. Obtain the iterative control sequence Ai (zk ) by (12.4.11).
Step 5. If |V i (zk ) − V i−1 (zk )| ≤ ε, then goto next step; else, goto the Step 3.
Step 6. Let J o (zk ) = V i (zk ).
Block 2.....................................................................................................
Step 8. For  = 0, let Vϑ00 (xk ) = J o (zk ). Calculate μ0ϑ0 (xk ) by (12.4.16). Let τ = 1.
Step 9. Let the iterative value function Vϑτ0 (xk ) and the iterative control law μτϑ0 (xk ) be obtained
by (12.4.17) and (12.4.18), respectively.
Step 10. If |Vϑτ0 (xk ) − Vϑτ0−1 (xk )| ≤ ε, then let Vϑ01 (xk ) = Vϑτ0 (xk ), let  = 1, let τ = 1, and goto
Step 11; else, let τ = τ + 1 and goto Step 9.
Step 11. Obtain the iterative value function Vϑτ (xk ) and the iterative control law μτϑ (xk ) by
(12.4.20) and (12.4.21), respectively.
Step 12. If |Vϑτ (xk ) − Vϑτ−1 (xk )| ≤ ε, then let Vϑ0+1 (xk ) = Vϑτ (xk ) and goto the next step; else, let
τ = τ + 1 and goto Step 11.
Step 13. If |Vϑ0+1 (xk ) − Vϑ0 (xk )| ≤ ε, then goto the next step; else, let τ = 1,  =  + 1, and goto
Step 11.
Step 14. Construct the control law sequence as Uϑ (xk ) = [μ1 (xk ), . . . , μN (xk )]. Determine the
optimal control law sequence U∗ (xk ) = Uϑ (xk ) and the optimal value function J ∗ (xk ) =
Vϑ0+1 (xk ).
Step 15. Return U∗ (xk ) and J ∗ (xk ).

It should be pointed out that if the residential environment, such as the electricity
rate, is changed, the optimal multi-battery control will change correspondingly. It is
pointed out by ComEd Company [14] that in summer, the electricity rate is different
from other seasons, which is shown in Fig. 12.27. Let all the other parameters remain
unchanged, and implement the iterative ADP algorithm. The iterative value function
is shown in Fig. 12.28, where the nonincreasing monotonicity and optimality are
verified. The optimal control of usum for 168 h in summer is shown in Fig. 12.29. The
corresponding optimal multi-battery coordination controls are shown in Fig. 12.30a–
d, respectively. Next, implementing TBQL and PSO algorithms for the summer data,
the total costs by TBQL and PSO, and the present iterative ADP algorithms for one
week are shown in Table 12.3, where the superiority of our distributed iterative ADP
algorithm can be verified.
12.5 Conclusions 533

Table 12.3 Cost comparison


Original Battery ℵ DIADP TBQL PSO
Total cost (cents) 722.62 607.17 530.21 577.86 590.18
Saving (%) 15.98 26.63 20.03 18.33

12.5 Conclusions

Given the residential load and the real-time electricity rate, the objective of the opti-
mal control in this chapter is to find the optimal battery charging/discharging/idle
control law at each time step which minimizes the total expense of the power from
the grid while considering the battery limitations. First, the ADP scheme based on
ADHDP that is suitable for applications to residential energy system control and
management problem is introduced. Then, a new iterative ADP algorithm, called
dual iterative Q-learning algorithm, is developed to solve the optimal battery man-
agement and control problem in smart residential environments. The main idea of
the present dual iterative Q-learning algorithm is to update the iterative value func-
tion and iterative control laws by ADP technique according to the i-iteration and the
j-iteration, respectively. The convergence and optimality of the present algorithm
are established. Next, optimal multi-battery management problems in smart home
energy management systems are solved by iterative ADP algorithms. To obtain the
optimal coordination control law for multi-batteries, a new distributed iterative ADP
algorithm is developed, which avoids the increasing dimension of controls. Conver-
gence properties are developed to guarantee the optimality of the algorithm. Neural
networks are introduced to implement these ADP algorithms. Effectiveness of the
algorithms is verified by numerical results.

References

1. Aksoy S, Haralick RM (2001) Feature normalization and likelihood-based similarity measures


for image retrieval. Pattern Recognit Lett 22(5):563–582
2. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part B:
Cybern 38(4):943–949
3. Amjadi Z, Williamson SS (2010) Power-electronics-based solutions for plug-in hybrid electric
vehicle energy storage and management systems. IEEE Trans Ind Electron 57(2):608–616
4. Angelis FD, Boaro M, Fuselli D, Squartini S, Piazza F, Wei Q (2013) Optimal home energy
management under dynamic electrical and thermal constraints. IEEE Trans Ind Inf 9(3):1518–
1527
5. Apostol TM (1974) Mathematical analysis, 2nd edn. Addison-Wesley, Boston
6. Bakirtzis AG, Dokopoulos PS (1988) Short term generation scheduling in a small autonomous
system with unconventional energy system. IEEE Trans Power Syst 3(3):1230–1236
7. Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton
534 12 Adaptive Dynamic Programming for Optimal Residential Energy Management

8. Boaro M, Fuselli D, Angelis FD, Liu D, Wei Q, Piazza F (2013) Adaptive dynamic programming
algorithm for renewable energy scheduling and battery management. Cogn Comput 5(2):264–
277
9. Chacra FA, Bastard P, Fleury G, Clavreul R (2005) Impact of energy storage costs on economical
performance in a distribution substation. IEEE Trans Power Syst 20:684–691
10. Chaouachi A, Kamel RM, Andoulsi R, Nagasaka K (2013) Multiobjective intelligent energy
management for a microgrid. IEEE Trans Ind Electron 60(4):1688–1699
11. Chen C, Duan S, Cai T, Liu B, Yin J (2009) Energy trading model for optimal microgrid schedul-
ing based on genetic algorithm. In: Proceedings of the IEEE International Power Electronics
and Motion Control Conference. pp 2136–2139
12. ComEd, Real time pricing in USA. [Online Available] http://www.thewattspot.com
13. Corrigan PM, Heydt GT (2007) Optimized dispatch of a residential solar energy system. In:
Proceedings of the North American power symposium. pp 4183–4188
14. Data of electricity rate from ComEd Company, USA. [Online Available] https://rrtp.comed.
com/live-prices
15. Dierks T, Jagannathan S (2012) Online optimal control of affine nonlinear discrete-time systems
with unknown internal dynamics by using time-based policy update. IEEE Trans Neural Netw
Learn Syst 23(7):1118–1129
16. Fung CC, Ho SCY, Nayar CV (1993) Optimisation of a hybrid energy system using simulated
annealing technique. In: Proceedings of the ieee region 10 conference on computer, commu-
nication, control and power engineering, pp 235–238
17. Fuselli D, Angelis FD, Boaro M, Liu D, Wei Q, Squartini S, Piazza F (2013) Action dependent
heuristic dynamic programming for home energy resource scheduling. Int J Electr Power
Energy Syst 48:148–160
18. Gudi N, Wang L, Devabhaktuni V, Depuru SSSR (2011) A demand-side management sim-
ulation platform incorporating optimal management of distributed renewable resources. In:
Proceedings of the ieee/pes power systems conference and exposition, pp 1–7
19. Guerrero JM, Loh PC, Lee TL, Chandorkar M (2013) Advanced control architectures for
intelligent microgrids-part II: power quality, energy storage, and AC/DC microgrids. IEEE
Trans Ind Electron 60(4):1263–1270
20. Hagan MT, Menhaj MB (1994) Training feedforward networks with the Marquardt algorithm.
IEEE Trans Neural Netw 5(6):989–993
21. Huang T, Liu D (2011) Residential energy system control and management using adaptive
dynamic programming. In: Proceedings of the ieee international joint conference on neural
networks, pp 119–124
22. Huang T, Liu D (2013) A self-learning scheme for residential energy system control and
management. Neural Comput Appl 22(2):259–269
23. Jian L, Xue H, Xu G, Zhu X, Zhao D, Shao ZY (2013) Regulated charging of plug-in hybrid
electric vehicles for minimizing load variance in household smart microgrid. IEEE Trans Ind
Electron 60(8):3218–3226
24. Lee TY (2007) Operating schedule of battery energy storage system in a time-of-use rate
industrial user with wind turbine generators: A multipass iteration particle swarm optimization
approach. IEEE Trans Energy Convers 22(3):774–782
25. Lendaris GG, Paintz C (1997) Training strategies for critic and action neural networks in dual
heuristic programming method. In: Proceedings of the ieee international conference neural
networks, pp 712–717
26. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
27. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
28. Liu D, Xiong X, Zhang Y (2001) Action-dependent adaptive critic designs. In: Proceedings of
the international joint conference on neural networks, pp 990–995
29. Liu D, Zhang Y, Zhang H (2005) A self-learning call admission control scheme for CDMA
cellular networks. IEEE Trans Neural Netw 16(5):1219–1228
References 535

30. Lu B, Shahidehpour M (2005) Short-term scheduling of battery in a grid-connected PV/battery


system. IEEE Trans Power Syst 20(2):1053–1061
31. Lu X, Sun K, Guerrero JM, Vasquez JC, Huang L (2014) State-of-charge balance using adaptive
droop control for distributed energy storage systems in DC microgrid applications. IEEE Trans
Ind Electron 61(6):2804–2815
32. Maly DK, Kwan KS (1995) Optimal battery energy storage system (BESS) charge scheduling
with dynamic programming. IEE Proc Sci Meas Technol 142(6):453–458
33. Morais H, Kádár P, Faria P, Vale ZA, Khodr HM (2010) Optimal scheduling of a renewable
micro-grid in an isolated load area using mixed-integer linear programming. Renew Energy
35(1):151–156
34. Motapon SN, Dessaint LA, Al-Haddad K (2014) A comparative study of energy management
schemes for a fuel-cell hybrid emergency power system of more-electric aircraft. IEEE Trans
Ind Electron 61(3):1320–1334
35. NAHB Research Center (2014) Review of residential electrical energy use data. 400 Prince
George’s Boulevard, Upper Marlboro, MD, USA, July 16, 2001. [Online Available] http://
www.toolbase.org/PDF/CaseStudies/Res-Electrical-EnergyUseData.pdf
36. Rahimi-Eichi H, Baronti F, Chow MY (2014) Online adaptive parameter identification and
state-of-charge coestimation for lithium-polymer battery cells. IEEE Trans Ind Electron
61(4):2053–2061
37. Riffonneau Y, Bacha S, Barruel F, Ploix S (2011) Optimal power flow management for grid
connected PV systems with batteries. IEEE Trans Sustain Energy 2(3):309–320
38. Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
39. Smolenski R, Bojarski J, Kempski A, Lezynski P (2014) Time-domain-based assessment of
data transmission error probability in smart grids with electromagnetic interference. IEEE
Trans Ind Electron 61(4):1882–1890
40. Watkins C (1989) Learning from delayed rewards. Ph.D. Thesis, Cambridge University, Cam-
bridge, England
41. Watkins C, Dayan P (1992) Q-learning. Mach Learn 8(3–4):279–292
42. Wei Q, Liu D, Shi G (2015) A novel dual iterative Q-learning method for optimal battery
management in smart residential environments. IEEE Trans Ind Electron 62(4):2509–2518
43. Wei Q, Liu D, Shi G, Liu Y (2015) Multibattery optimal coordination control for home energy
management systems via distributed iterative adaptive dynamic programming. IEEE Trans Ind
Electron 62(7):4203–4214
44. Yau T, Walker LN, Graham HL, Raithel R (1981) Effects of battery storage devices on power
system dispatch. IEEE Trans Power Appar Syst PAS-100(1):375–383
Chapter 13
Adaptive Dynamic Programming
for Optimal Control of Coal Gasification
Process

13.1 Introduction

Coal is the world’s most abundant energy resource and the cheapest fossil fuel. The
development of coal gasification technologies, which is a primary component of the
carbon-based process industries, is of primary importance to deal with the limited
petroleum reserves [12]. Hence, optimal control for the coal gasification process is a
key problem of carbon-based process industries. To describe the process of coal gasi-
fication, many discussions focus on coal gasification modeling approaches [1, 13,
14, 19]. The established models are usually very complex with high nonlinearities.
To simplify the controller design, the traditional control method for the coal gasifica-
tion process adopts feedback linearization control method [6, 10, 18]. However, the
controller designed by feedback linearization technique is only effective in the neigh-
borhood of the equilibrium point. When the required operating range is large, the
nonlinearities in the system cannot be properly compensated by using a linear model.
Therefore, it is necessary to study an optimal control approach for the original nonlin-
ear system [4, 7–9, 17]. But to the best of our knowledge, there are no discussions on
the optimal controller design for the nonlinear coal gasification processes. One of the
difficulties is the complexity of the coal gasification processes, which leads to very
complex expressions for the optimal control law. Generally speaking, the optimal
control law cannot be expressed analytically. Another difficulty to obtain the optimal
control law lies in solving the time-varying Bellman equation which is usually too
difficult to solve analytically. On the other hand, in the real-world control systems
of coal gasification processes, the coal quality is also unknown for control systems.
This makes it more difficult to obtain the optimal control law of the coal gasification
systems. To overcome these difficulties, iterative ADP algorithm will be employed.
For the coal gasification process, the accurate system model is complex and can-
not be obtained in general. In each iteration of the iterative ADP algorithms, the
accurate iterative control laws and the cost function cannot be accurately obtained
either. In this situation, approximation structures, such as neural networks, can be

© Springer International Publishing AG 2017 537


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_13
538 13 Adaptive Dynamic Programming for Optimal Control …

used to approximate the system model, the iterative control law, and the iterative
value function, respectively. So, there must exist approximation errors between the
approximated functions and the expected ones, no matter what the approximation
precisions are obtained. When the accurate system model, iterative control laws, and
the iterative value function cannot be obtained, the convergence properties of the
accurate iterative ADP algorithms may be invalid. Till now, only in [11], approxi-
mation errors for the iterative control law and iterative value function in the iterative
ADP algorithm were considered, but the accurate system model is required. To the
best of our knowledge, there are no discussions on the optimal control scheme of the
iterative ADP algorithms, where modeling errors of unknown systems and iteration
errors are both considered. In this chapter, an integrated self-learning optimal con-
trol method of the coal gasification process using iterative ADP is developed, where
modeling errors and iteration errors are considered [16].

13.2 Data-Based Modeling and Properties

13.2.1 Description of Coal Gasification Process


and Control Systems

The coal gasification inputs the coal water slurry (including coal and water) and
combines with oxygen into the gasifier. The coal gasification process in the gasifier
operates at a high temperature and the output of coal gasification process include
synthesis gas and char. The diagram of coal gasification process is given in Fig. 13.1.
The composition of coal contains carbon (C), hydrogen (H), oxygen (O), and char
(Char), which is expressed by

Coal → θk1 C + θk2 H + θk3 O + θk4 Char,


4
where θki = 1 and k = 0, 1, . . . , is the discrete-time. Let Θk = [θk1 , θk2 , θk3 , θk4 ]T
i=1
denote the coal quality function. The coal gasification reaction can be classified
into two phases [13]. The first phase is coal combustion reaction, and the chemical
equations are expressed by

1 1
C + O2 = CO − 123.1 kJ/mol, CO + O2 = CO2 − 282.9 kJ/mol, (13.2.1)
2 2
where CO is carbon monoxide and CO2 is carbon dioxide. The other phase is water
gas shift reaction which is reversible and mildly exothermic

CO + H2 O = CO2 + H2 − 42.3 kJ/mol, (13.2.2)

where H2 O is water.
13.2 Data-Based Modeling and Properties 539

Fig. 13.1 Flow diagram of coal gasification process

The coal combustion reaction is instantaneous and nonreversible. The water gas
shift reaction is reversible and the reaction is strongly dependent on the reaction tem-
perature. Let xk be the reaction temperature and let Tk denote the reaction equilibrium
coefficient. Then, we have the following empirical formula [13]
 0.921914
nCO2 · nH2 202.362
Tk = = , (13.2.3)
nCO · nH2 O xk − 635.52

where n(·) denotes the molar quantity in the synthesis gas.


For the coal gasification process, it is pointed out that the reaction temperature is
the key parameter [5, 13]. Hence, in this chapter, an optimal control scheme will be
established to guarantee the reaction temperature to track effectively a desired one.
Let P(·)k denote the flow of the control input (kg/h) and R(·)
k
denote the flow of the
output (kg/h). Then, the control input vector is defined as

u k = [u 1k , u 2k , u 3k ]T = [Pcoal
k
, PHk2 O , POk2 ]T . (13.2.4)

The system output vector is defined as

yk = [y1k , y2k , y3k , y4k , y5k ]T = [RCO


k
, RCO
k
2
, RHk 2 , RHk 2 O , RChar
k
]T . (13.2.5)
540 13 Adaptive Dynamic Programming for Optimal Control …

According to (13.2.1)–(13.2.5), the coal gasification control system can now be


expressed as

xk+1 = F(xk , u k , Θk )
yk = G(xk , u k , Θk ), (13.2.6)

where F and G are unknown system functions.


Let the desired state trajectory be η. Then, our goal is to design an optimal state-
feedback tracking control law u ∗k = u ∗(xk ), such that the system state tracks the
desired state trajectory. However, it is nearly impossible to obtain a direct optimal
tracking controller for system (13.2.6). First, the system functions F and G are
unknown nonlinear functions. Second, for desired trajectory η, the corresponding
reference control is also difficult to obtain for the unknown system. Furthermore,
the coal quality Θk is also an unknown and uncontrollable parameter. Thus, novel
methods must be established to solve these problems.

13.2.2 Data-Based Process Modeling and Properties

In this section, three-layer backpropagation (BP) NNs are employed to approximate


the system (13.2.6). We also use NNs to solve the reference control and obtain the
coal quality. Let the number of hidden layer neurons be denoted by L. Let the weight
matrix between the input layer and hidden layer be denoted by Y f . Let the weight
matrix between the hidden layer and output layer be denoted by W f . Let the input
vector of the NN be denoted as X . Then, the output of three-layer NN is represented
by:
F̂N (X, Y f , W f ) = W f σ̄(Y f X ), (13.2.7)

exp(ζi ) − exp(−ζi )
where σ̄(Y f X ) ∈ R L , [σ̄(ζ)]i = , i = 1, . . . L , are the activation
exp(ζi ) + exp(−ζi )
function.
The NN estimation error can be expressed by

FN (X ) = F̂N (X, Y ∗f , W ∗f ) + ε(X ),

where Y ∗f and W ∗f are the ideal weight parameters, ε(X ) is the reconstruction error.
For the convenience of analysis, only the output weights W f are updated during
the training, while the hidden weights Y f are kept fixed [3, 20]. Hence, the NN
function (13.2.7) will be simplified by the expression F̂N (X, W f ) = W f σ f (X ),
where σ f (X ) = σ̄(Y f X ).
Next, using input-state-output data, two BP NNs are used to reconstruct the system
(13.2.6). Let the number of hidden layer neurons be denoted by L m1 and L m2 . Let the
∗ ∗
ideal weights be denoted by Wm1 and Wm2 , respectively. According to the universal
13.2 Data-Based Modeling and Properties 541

approximation property of NNs, the NN representation of the system (13.2.6) can be


written as
∗T
xk+1 = Wm1 σ1 (z k ) + εm1k ,
∗T
yk = Wm2 σ2 (z k ) + εm2k ,

where z k = [ xk , u Tk , ΘkT ]T is the NN input.


Let σ1 (z k ) = σ̄(Y1 z k ) and σ2 (z k ) = σ̄(Y2 z k ), where Y1 and Y2 are arbitrary
matrices with suitable dimensions. Let max{σ1 (·) , σ2 (·)} ≤ σ M for a constant
σ M . Let εmik be the bounded NN reconstruction errors such that εmik  ≤ ε M , for
i = 1, 2, for a positive constant ε M . The NN model for the system is constructed as

x̂k+1 = Ŵm1k
T
σ1 (z k ),
ŷk = Ŵm2k
T
σ2 (z k ), (13.2.8)

where x̂k is the estimated system state vector and ŷk is the estimated system output

vector. Let Ŵm1k be the estimation of the ideal weight matrix Wm1 and let Ŵm2k be the

estimation of the ideal weight matrix Wm2 . Then, we define the system identification
errors as

x̃k+1 = x̂k+1 − xk+1 = W̃m1k


T
σ1 (z k ) − εm1k ,
ỹk = ŷk − yk = W̃m2k
T
σ2 (z k ) − εm2k , (13.2.9)

∗ ∗
where W̃m1k = Ŵm1k − Wm1 and W̃m2k = Ŵm2k − Wm2 .
Let φm1k = W̃m1k σ1 (z k ) and φm2k = W̃m2k σ2 (z k ). Then, we can get
T T

x̃k+1 = φm1k − εm1k ,


ỹk = φm2k − εm2k .

The weights are adjusted to minimize the following error

1 2 1
E mk = x̃ + ỹ T ỹk .
2 k+1 2 k
By a gradient-based adaptation rule, the weights are updated as

Ŵm1,k+1 = Ŵm1k − lm1 σ1 (z k )x̃k+1 ,


Ŵm2,k+1 = Ŵm2k − lm2 σ2 (z k ) ỹkT , (13.2.10)

where lm1 > 0 and lm2 > 0 are learning rates.


Before proceeding, the following assumption is necessary.
542 13 Adaptive Dynamic Programming for Optimal Control …

Assumption 13.2.1 The NN approximation errors εm1k and εm2k are assumed to be
upper bounded by a function of estimation error such that

ε2m1k ≤ λm1 x̃ 2 ,
φTm2k εm2k ≤ λm2 φTm2k φm2k ,

where 0 < λm1 < 1 and 0 < λm2 < 1 are bounded constant values.
Then, we have the following theorem.
Theorem 13.2.1 Let the identification scheme (13.2.8) be used to identify the non-
linear system (13.2.6), and let the NN weights be updated by (13.2.10). If Assump-
tion 13.2.1 holds, then the system identification error x̃k approaches zero asymptoti-
cally and the error matrices W̃m1k and W̃m2k both converge to zero, as k → ∞.

Proof Consider the following Lyapunov function candidate defined as

1  T  1  T 
L(x̃k , W̃m1k , W̃m2k ) = x̃k2 + tr W̃m1k W̃m1k + tr W̃m2k W̃m2k .
lm1 lm2

The difference of the Lyapunov function candidate is given by

ΔL(x̃k , W̃m1k , W̃m2k ) = x̃k+1


2
− x̃k2
1  T 
+ tr W̃m1,k+1 W̃m1,k+1 − W̃m1k
T
W̃m1k
lm1
1  T 
+ tr W̃m2,k+1 W̃m2,k+1 − W̃m2k
T
W̃m2k .
lm2

With the identification error dynamics (13.2.9) and the weight tuning rules of Ŵm1,k+1
and Ŵm2,k+1 in (13.2.10), we can obtain

ΔL(x̃k , W̃m1k , W̃m2k ) = φ2m1k −2φm1k εm1k + ε2m1k + lm1 σ1T(z k )σ1 (z k )x̃k+1
2

− x̃k2 +lm2 σ2T(z k )σ2 (z k ) ỹkT ỹk −2 W̃m2k σ2T(z k ) ỹk −2φm1k x̃k+1 .

Applying the Cauchy–Schwarz inequality, we get

ΔL(x̃k , W̃m1k , W̃m2k ) ≤ φ2m1k + ε2m1k − x̃k2 + 2lm1 σ1T(z k )σ1 (z k )


 
× φ2m1k + ε2m1k − 2(φTm2k φm2k − φTm2k εm2k )
+ lm2 σ2T(z k )σ2 (z k )(φm2k − εm2k )T (φm2k − εm2k ).
13.2 Data-Based Modeling and Properties 543

Considering max{σ1 (z k ) , σ2 (z k )} ≤ σ M , we can get

ΔL(x̃k , W̃m1k , W̃m2k ) ≤ − (1 − 2lm1 σ 2M )φ2m1k − (1 − λm1 − 2lm1 λm1 σ 2M )x̃k2


 
− 2 φm2k 2 − φTm2k εm2k + lm2 σ 2M φm2k − εm2k 2 .
(13.2.11)

As εm2k is finite, there exists a χm2 > 0 such that

εTm2k εm2k ≤ χm2 φTm2k φm2k .

Then, (13.2.11) can be written as

ΔL(x̃k , W̃m1k , W̃m2k ) ≤ − (1 − 2lm1 σ 2M )φ2m1k − (1 − λm1 − 2lm1 λm1 σ 2M )x̃k2


− 2 (1 − λm2 ) φm2k 2 + 2 lm2 σ 2M (1 + χm2 )φm2k 2 .

Let lm1 be selected as

1 1 − λm1
lm1 < min ,
2σ 2M 2λm1 σ 2M

and lm2 be selected as

1 − λm2
lm2 < .
σ 2M (1
+ χm2 )

Then, ΔL < 0. The proof is complete.

Next, NN will be used to identify the coal quality function Θk and solve the
reference control law u f k using the system data. Different from the system modeling,
the coal quality data cannot generally be detected and identified in real-time coal
gasification process. This means that the coal quality data can only be obtained
offline. Noticing this feature, an iterative training method of the neural networks can
be adopted.
According to (13.2.6), we can solve Θk , which is expressed as

Θk = FΘ (xk , xk+1 , yk , u k ). (13.2.12)

Usually, FΘ (·) is a highly nonlinear function and the analytical expression of FΘ (·)
is nearly impossible to obtain. Thus, a BP NN (Θ network for brief) is established
to identify the coal quality function Θk .
Let the number of hidden layer neurons be denoted by L Θ . Let the ideal weights
be denoted by WΘ∗ . The NN representation of (13.2.12) can be written as

Θk = WΘ∗TσΘ (z Θk ) + εΘk , (13.2.13)


544 13 Adaptive Dynamic Programming for Optimal Control …

where z Θk = [xk , xk+1 , ykT , u Tk ]T and εΘk is the reconstruction error. Let σΘ (z Θk ) =
σ̄(YΘ z Θk ) where YΘ is an arbitrary matrix. The NN coal quality function is
constructed as

Θ̂k = ŴΘk
T
σΘ (z Θk ), (13.2.14)

where Θ̂k is the estimated coal quality function, and ŴΘk is the estimated weight
matrix. According to (13.2.12), we notice that solving Θk needs the data xk+1 . As
we adopt offline data to train the NN, the corresponding data can be obtained. Define
the identification error as
j j j
Θ̃k = Θk − Θ̂k = φΘk − εΘk ,

where φΘk = W̃Θk σΘ (z Θk ) and W̃Θk = ŴΘk − WΘ∗ .


j jT j j

Similarly, the weights are updated as


j+1 j j
ŴΘk = ŴΘk − lΘ σΘ (z Θk )Θ̃k , (13.2.15)

where lΘ > 0 is the learning rate.


Next, we will solve the reference control using NN (u f network for brief). In
this chapter, as we aim to design a state-feedback controller to guarantee the system
state to track the desired one, according to the state equation in (13.2.6), we utilize
xk , xk+1 , Θk to approximate the reference control function u f k , which is expressed
as

u f k = Fu (xk , xk+1 , Θk ). (13.2.16)

Let the number of hidden layer neurons be denoted by L u . Let the ideal weights
be denoted by Wu∗ . The NN representation of (13.2.16) can be written as

u f k = Wu∗Tσu (z uk ) + εuk , (13.2.17)

where z uk = [xk , xk+1 , ΘkT ]T and εuk is the reconstruction error. Let σu (z k ) =
σ̄(Yu z k ) where Yu is an arbitrary matrix. The NN reference control is constructed as

û f k = F̂u (xk , xk+1 , Θk ) = Ŵuk


T
σu (z uk ), (13.2.18)

T
where û f k is the estimated reference control, and Ŵuk is the estimated weight matrix.
Define the identification error as
j j j
ũ f k = u f k − û f k = φuk − εuk , (13.2.19)

where φuk = W̃uk σu (z uk ) and W̃uk = Ŵuk − Wu∗ .


j jT j j
13.2 Data-Based Modeling and Properties 545

Similarly, the weights are updated as


j+1 j j
Ŵuk = Ŵuk − lu σu (z uk )ũ k , (13.2.20)

where lu > 0 is the learning rate.


Next, we give the convergence properties of Θ network and u f network.
Theorem 13.2.2 Let the identification schemes (13.2.14) and (13.2.18) be used to
identify Θk and u f k in (13.2.12) and (13.2.17), respectively. Let the NN weights be
updated by (13.2.15) and (13.2.20), respectively. If for j = 1, 2, . . ., the inequalities

jT jT j
φΘk εΘk ≤ λΘ φΘk φΘk ,
jT jT j
φuk εuk ≤ λu φukφuk (13.2.21)

hold, where 0 < λΘ < 1 and 0 < λu < 1, then the error matrices W̃Θk and W̃uk
both converge to zero, as j → ∞.

Proof Consider the following Lyapunov function candidate

 j j  1  jT j  1  jT j 
L W̃Θk , W̃uk = tr W̃Θk W̃Θk + tr W̃uk W̃uk .
lΘ lu

As the activation functions σΘ (z Θk ) and σu (z uk ) are both bounded. We can let


σΘ (z Θk ) ≤ σΘ̄ and σu (z uk ) ≤ σū , respectively. The difference of the Lyapunov
function candidate is given by

j j 1  ( j+1)T j+1 jT j 
ΔL(W̃Θk , W̃uk ) = tr W̃Θk W̃Θk − W̃Θk W̃Θk

1  ( j+1)T j+1 jT j 
+ tr W̃uk W̃uk − W̃uk W̃uk
lu
jT j jT j
= − 2WΘk σΘk Θ̃k + lΘ Θ̃k σΘk
T
σΘk Θ̃k
jT j jT j
− 2Wuk σuk ũ k + lΘ ũ k σuk
T
σuk ũ k
 jT j jT  j
≤ − 2 φΘk φΘu − φΘk εΘk + lΘ σΘ̄ 2
φΘk − εΘk 2
 jT j jT  j
− 2 φukφuk − φukεu k + lu σū2 φuk − εuk 2 , (13.2.22)

where σΘk = σΘ (z Θk ) and σuk = σu (z uk ). As εΘk and εuk are bounded, there exist
χΘ > 0 and χu > 0 such that
jT j
εTΘk εΘk ≤ χΘ φΘk φΘk ,
jT j
εTuk εuk ≤ χu φukφuk .
546 13 Adaptive Dynamic Programming for Optimal Control …

Then, (13.2.22) can be written as


 j j  jT j j
ΔL W̃Θk , W̃uk ≤ − 2 (1 − λΘ ) φΘk φΘk + 2lΘ σΘ̄
2
(1 + χΘ )φΘk 2
jT j j
− 2 (1 − λu )φuk φuk + 2lu σū2 (1 + χu )φuk 2 .

According to (13.2.21), we can select the learning rates lΘ and lu as

1 − λΘ
lΘ < ,
σΘ̄
2
(1+ χΘ )
1 − λu
lu < .
σū2 (1 + χu )
 j j 
Hence, we can obtain ΔL W̃Θk , W̃uk ≤ 0. The proof is complete.
Remark 13.2.1 From Theorem 13.2.2, the coal quality function Θk and the reference
control law u f k can be approximated by neural networks. It should be pointed out
that, in real-world applications, the coal quality is generally a slow time-varying
function. It implies that when the coal quality function Θk is identified, it can be
considered as a constant vector. Hence, from the current coal gasification system,
we can obtain the current state, input and output data. Then, we can first use neural
network to identify Θk according to (13.2.12)–(13.2.15). Then, taking the estimated
Θ̂k and the state xk into the u f network, we can obtain the reference control law û f k
immediately, according to (13.2.16)–(13.2.20).

13.3 Design and Implementation of Optimal


Tracking Control

In the previous section, we have shown how to use the system data to approximate
the dynamics of system (13.2.6). NNs are also adopted to solve the reference control
and obtain the coal quality function, respectively. In this section, we will present the
iterative ADP algorithm to obtain the optimal tracking control law under system and
iteration errors.

13.3.1 Optimal Tracking Controller Design by Iterative ADP


Algorithm Under System and Iteration Errors

Although the control system, the reference control, and the coal quality function are
approximated by NNs, the system errors are still unknown. It is difficult to design
the optimal tracking control system with unknown system errors. Thus, an effective
system transformation is performed in this section.
13.3 Design and Implementation of Optimal Tracking Control 547

In order to transform the system, for the desired system state η, a desired reference
control (desired control for brief) can be obtained. Substituting the desired state
trajectory η into (13.2.16), we can obtain the reference control trajectory

u dk = Fu (η, η, Θk ),

where u dk is defined as the desired control or reference control. Let F̂u (η, η, Θk ) =
T
Ŵuk σu (η, η, Θk ) be the neural network function which approximates the reference
control u dk . If the weights Ŵuk converge to Wu∗ sufficiently, then

u dk = F̂u (η, η, Θk ) + εuk .

As Θk cannot be obtained directly, Θ network is used to approximate Θk .


According to (13.2.13) and (13.2.14), letting the weights ŴΘk be convergent to WΘ∗
sufficiently, we have
Θk = Θ̂k + εΘk .

Let z duk = [η, η, ΘkT ]T and ẑ duk = [η, η, Θ̂kT ]T . According to the mean value
theorem, we have
F̂u (ẑ duk ) = F̂u (z duk ) − ∇(ξΘ )εΘk ,

where
∂ F̂u (η, η, ξΘ )
∇(ξΘ ) = ,
∂ξΘ

ξΘ = cΘ Θ̂k + (1 − cΘ )(Θ̂k − εΘk ) and 0 ≤ cΘ ≤ 1 is some constant. As εΘk is


bounded and σu (·) is smooth, ∇(ξΘ )εΘk  is bounded. If we let the neural network-
generated reference control be expressed by

û dk = F̂u (η, η, Θ̂k ), (13.3.1)

then we can get


u dk = û dk + ε̂uk , (13.3.2)

where ε̂uk = ∇(ξΘ )εΘk + εuk . Let u δk be the error between the control u k and the
reference control u dk , then we can obtain

u k = û dk + u δk + ε̂uk = û k + ε̂uk , (13.3.3)

where û k = û dk + u δk is the estimated control.

Remark 13.3.1 From (13.3.3), we can see that if we have obtained the estimated
control uˆk and the control error ε̂u k , then the control input u k can be determined.
548 13 Adaptive Dynamic Programming for Optimal Control …

In the real-world industrial processes, however, the control performance will be


influenced by lower level controllers. For coal gasification, the fluctuation of flow
of the control inputs is important and cannot be ignored. Let Δεuk be the bounded
disturbance on the control signal. The control input can be written as

ū k = û k + Δεuk = u k + ε̄u k , (13.3.4)

where ε̄uk = Δεuk − ε̂uk . Next, the disturbance of the control is considered.
∗ ∗
According to (13.2.8), let the NN weights be convergent to Wm1 and Wm2
sufficiently. If we let F̂(z k ) = Ŵm1k
T
σ1 (z k ), then the system state equation can be
written as

xk+1 = F̂(z k ) + εm1k . (13.3.5)

As u k and Θk cannot be obtained directly, approximations are adopted. Let zˆk =


[xk , ū k , Θ̂kT ]T . As the activation function σ1 (·) is smooth, according to the mean
value theorem, we have

F̂(zˆk ) = F̂(z k ) + ∇(ξu )ε̄uk + ∇(ξΘ )εΘk ,

where
∂ F̂(xk , ξuk , Θ̂k )
∇(ξu ) = ,
∂ξuk

ξu = cu ū k + (1 − cu )(ū k − ε̄uk ), and 0 ≤ cu ≤ 1 is some constant. Let ∇(ξΘ ) =


∂ F̂(xk , ū k , ξΘ )/∂ξΘ , and ξΘ = cΘ Θ̂k + (1 − cΘ )(Θ̂k − εΘk ), where 0 ≤ cΘ ≤ 1 is
some constant. As ε̄uk and εΘk are both bounded and σΘ (·) and σu (·) are both smooth,
then ∇(ξu )ε̄uk  and ∇(ξΘ )εΘk  are both bounded. So, we let ∇(ξu )ε̄uk  ≤ ε̄u 
and ∇(ξΘ )εΘk  ≤ ε̄Θ . Then, (13.3.5) can be written as

xk+1 = F̂(xk , ū k , Θ̂k ) + ∇(ξu )ε̄uk + ∇(ξΘ )εΘk + εm1k . (13.3.6)

Let the tracking error be defined as

ek = xk − η, (13.3.7)

where η is the desired state trajectory. Let

u ek = u k − u dk , (13.3.8)

where û dk is the neural network-generated reference control trajectory expressed by


(13.3.1). According to (13.3.2), (13.3.4), and (13.3.8), we can get

ū k = u ek + û dk + ε̃uk , (13.3.9)
13.3 Design and Implementation of Optimal Tracking Control 549

where ε̃uk = ε̄uk + εuk . According to (13.3.7) and (13.3.9), we have


 
F̂(xk , ū k , Θ̂k ) = F̂ (ek + η), (u ek + û dk ), Θ̂k + ∇(ξ˜u )ε̃uk ,

where  
∇(ξ˜u ) = ∂ F̂ (ek + η), ξ˜u , Θ̂k /∂ ξ˜u ,

ξ˜u = c̃u (u ek + û dk ) + (1 − c̃u )(u ek + û dk + (ε̄uk + εuk )),

and 0 ≤ c̃u ≤ 1. Thus, (13.3.6) can be written as


 
ek+1 = F̂ (ek +η), (u ek + û dk ), Θ̂k − η + wk , (13.3.10)

where wk = ∇(ξu )ε̄uk + ∇(ξΘ )εΘk + εm1k + ∇(ξ˜u )ε̃uk . As ∇(ξu )ε̄uk , ∇(ξΘ )εΘk ,
∇(ξ˜u )ε̃uk , and εm1k are all bounded, the system disturbance will also be bounded.
Let |εm1k | ≤ |ε̄m1 | and ∇(ξ˜u )ε̃uk  ≤ ε̃u , then we can get

wk  ≤ ε̄u  + ε̄Θ  + |ε̄m1 | + ε̃u .

On the other hand, as mentioned in Remark 13.2.1, Θ̂k is a constant vector after it
is identified. Hence, according to (13.3.1), û dk can also be seen as a constant vector.
Then, system (13.3.10) can be transformed into the following regulation system

ek+1 = F̄(ek , u ek , Θ̂k ) + wk , (13.3.11)

where  
F̄(ek , u ek , Θ̂k ) = F̂ (ek + η), (u ek + û dk ), Θ̂k − η.

From (13.3.11), we can see that the nonlinear tracking control system (13.2.6) is
transformed into a regulation system, where the system errors and the control fluc-
tuation are transformed into an unknown-bounded system disturbance.
Our goal is to obtain an optimal control such that the tracking error ek converges
to zero under the system disturbance wk . As the system disturbance wk is unknown,
the design of the optimal controller becomes very difficult. In [2], the optimal control
problem for system (13.3.11) was transformed into a two-person zero-sum optimal
control problem, where the system disturbance wk was defined as a control variable.
The optimal control law is obtained under the worst case of the disturbance (the
disturbance control maximizes the cost function). Inspired by [2], we define wk as a
disturbance control of the system and the two controls u ek and wk of system (13.3.11)
are designed to optimize the following quadratic cost function

 
J (e0 , u e 0 , w0 ) = Aek2 + u Tek Bu ek − Cwk2 , (13.3.12)
k=0
550 13 Adaptive Dynamic Programming for Optimal Control …

where we let u ek = (u ek , u e,k+1 , . . .) and wk = (wk , wk+1 , . . .). Let A > 0 and
C > 0 be constants, and let B be a positive definite matrix. Then, the optimal cost
function can be defined as
 
J ∗ (ek ) = inf sup J (ek , u ek , wk ) .
u ek wk

Let U (ek , u ek , wk ) = Aek2 + u Tek Bu ek − Cwk2 be the utility function. In this chapter,
we assume that the utility function U (ek , u ek , wk ) > 0, ∀ek , u ek , wk = 0. Generally
speaking, the system errors are small. This requires the system disturbance wk to
be small and the utility function to be larger than zero. If wk are large, we can
reduce the value of C or enlarge value of A and the matrix B. Hence, the assumption
U (ek , u ek , wk ) > 0 can be guaranteed.
According to Bellman’s principle of optimality, J ∗ (ek ) satisfies the discrete-time
Hamilton–Jacobi–Isaacs (HJI) equation
 
J ∗ (ek ) = min max U (ek , u ek , wk )+ J ∗ (ek+1 ) . (13.3.13)
u ek wk

Define the laws of optimal controls as


 
w∗ (ek ) = arg max U (ek , u ek , wk ) + J ∗ (ek+1 ) ,
wk
 
u ∗e (ek ) = arg min U (ek , u ek , w∗ (ek )) +J ∗ (ek+1 ) .
u ek

Hence, the HJI equation (13.3.13) can be written as

J ∗ (ek ) = U (ek , u ∗e (ek ), w∗ (ek )) + J ∗ (ek+1 ).

We can see that if we want to obtain the optimal control laws u ∗e (ek ) and w∗ (ek ), we
must obtain the optimal cost function J ∗ (ek ). Generally speaking, J ∗ (ek ) is unknown
before all the controls u ek and wk are considered, which means that the HJI equation
is generally unsolvable. In this chapter, an iterative ADP algorithm with system
and approximation errors is employed to overcome these difficulties. In the present
iterative ADP algorithm, the cost function and control law are updated by iterations,
with the iteration index i increasing from 0 to infinity. Let the initial value function
V̂0 (ek ) ≡ 0, ∀ek .
From V̂0 (·) = 0, we calculate

ω 0 (ek ) = arg max U (ek , u ek , wk ) + V̂0 (ek+1 ) , (13.3.14)


wk

v̂0 (ek ) = arg min U (ek , u ek , ω 0 (ek )) + V̂0 (ek+1 ) + ρ0 (ek ), (13.3.15)
u ek
13.3 Design and Implementation of Optimal Tracking Control 551

where V̂0 (ek+1 ) = 0 and ρ0 (ek ) is the finite approximation error function of the
iterative control v̂0 (ek ). For i = 1, 2, . . ., the iterative algorithm calculates the itera-
tive value function V̂i (ek ),

V̂i (ek ) = min max U (ek , u ek , wk ) + V̂i−1 (ek+1 ) + πi (ek )


u ek wk

= U (ek , v̂i−1 (ek ), ωi−1 (ek ) + V̂i−1 (ek+1 ) + πi (ek ), (13.3.16)

where ek+1 is expressed as in (13.3.11) and πi (ek ) is the finite approximation error
function of the iterative value function and the control laws ωi (ek ) and v̂i (ek ),

ωi (ek ) = arg max U (ek , u ek , wk ) + V̂i (ek+1 ) , (13.3.17)


wk

v̂i (ek ) = arg min {U (ek , u ek , ωi (ek )) +V̂i (ek+1 ) + ρi (ek ), (13.3.18)
u ek

where ρi (ek ) is the finite approximation error function of the iterative control.

Remark 13.3.2 From (13.3.11), the system is affine for the disturbance control wk
(it is actually linear in this case). According to (13.3.14) and (13.3.17), using the
necessary condition of optimality, for i = 0, 1, 2, . . ., ωi (ek ) can be obtained as

1 d V̂i (ek+1 )
ωi (ek ) = C −1 .
2 dek+1

Next, we consider the properties of the iterative ADP algorithm with system errors,
iteration errors, and control disturbance. For the two-person zero-sum iterative ADP
algorithm described in (13.3.14)–(13.3.18), as the iteration errors are unknown, the
properties of the iterative value functions V̂i (ek ) and the iterative control laws ωi (ek )
and v̂i (ek ) are very difficult to analyze, for i = 0, 1, . . .. On the other hand, in [11],
for nonlinear systems with a single controller, an “error bound” analysis method is
proposed to prove the convergence of the iterative value function. In this chapter,
we will establish similar “error bound” convergence analysis results for the iterative
value functions for nonlinear two-person zero-sum optimal control problems.
For i = 1, 2, . . ., define an iterative value function as

Vi (ek ) = min max U (ek , u ek , wk )+ V̂i (ek+1 )


u ek wk

= U (ek , vi (ek ), ωi (ek )) + V̂i (ek+1 ), (13.3.19)

where
V0 (ek+1 ) = V̂0 (ek+1 ) = 0

and
552 13 Adaptive Dynamic Programming for Optimal Control …

vi (ek ) = arg min U (ek , u ek , ωi (ek )) + V̂i (ek+1 ) , (13.3.20)


u ek

is the accurate iterative control law. According to (13.3.18), there exists a finite
constant τ ≥ 1 such that

V̂i (ek ) ≤ τ Vi (ek ), ∀i = 0, 1, . . . , (13.3.21)

hold uniformly. Hence, we have the following theorems.


Theorem 13.3.1 For all i = 0, 1, . . ., let Vi (ek ) be expressed as in (13.3.19) and
V̂i (ek ) be expressed as in (13.3.18). Let ek+1 be expressed as in (13.3.11). Let 0 <
γ < ∞ be a constant such that

J ∗ (ek+1 ) ≤ γ U (ek , u ek , wk )

holds uniformly. If there exists 1 ≤ τ < ∞ such that (13.3.21) holds uniformly, then
 i 
γ j τ j−1 (τ − 1) ∗
V̂i (ek ) ≤ τ 1 + J (ek ), (13.3.22)
j=1
(γ + 1) j

i
where we define j (·) = 0, ∀ j > i and i, j = 0, 1, . . ..

Proof The theorem can be proved by mathematical induction. First, let i = 0. We


have V̂0 (ek ) = 0. So, the conclusion holds for i = 0. Next, for i = 1, according to
(13.3.19), we have

V1 (ek ) = min max U (ek , u ek , wk ) + V̂0 (ek+1 )


u ek wk
 
≤ min max U (ek , u ek , wk ) + τ J ∗ (ek+1 )
u ek wk
 
γ(τ − 1) 
≤ 1+ min max {U (ek , u ek , wk ) +J ∗ (ek+1 )
γ+1 u ek wk
 
γ(τ − 1)
= 1+ J ∗ (ek ).
γ+1

According to (13.3.21), we can obtain


 
γ(τ − 1)
V̂1 (ek ) ≤ τ 1 + J ∗ (ek ),
γ+1

which shows that (13.3.22) holds for i = 1. Assume that (13.3.22) holds for i = l −1,
where l = 1, 2, . . .. Then, for i = l, we have
13.3 Design and Implementation of Optimal Tracking Control 553

Vl (ek ) = min max U (ek , u ek , wk ) + V̂l−1 (ek+1 )


u ek wk
 l−1 
γ j−1 τ j−1 (τ − 1)
≤ min max 1+γ U (ek , u ek , wk )
u ek wk
j=1
(γ + 1) j
  l−1 
γ j τ j−1 (τ − 1)
+ τ 1+
j=1
(γ + 1) j
l−1 
γ j−1 τ j−1 (τ − 1) ∗
− J (ek+1 )
j=1
(γ + 1) j
 l 
γ j τ j−1 (τ − 1)
= 1+
j=1
(γ + 1) j
 
× min max U (ek , u ek , wk ) + J ∗ (ek+1 )
u ek wk
 l 
γ j τ j−1 (τ − 1) ∗
= 1+ J (ek ).
j=1
(γ + 1) j

According to (13.3.21), we can obtain (13.3.22) which proves the conclusion for
i = 0, 1, . . .. This completes the proof of the theorem.
Theorem 13.3.2 Suppose Theorem 13.3.1 holds. If for 0 < γ < ∞, the inequality

γ+1
1≤τ < (13.3.23)
γ

holds, then as i → ∞, the iterative value function V̂i (ek ) in the iterative ADP
algorithm (13.3.14)–(13.3.18) is uniformly convergent to a bounded neighborhood
of the optimal cost function J ∗ (ek ), i.e.,
 
γ(τ − 1)
lim V̂i (ek ) = V̂∞ (ek ) ≤ τ 1 + J ∗ (ek ). (13.3.24)
i→∞ 1 − γ(τ − 1)

Proof According to (13.3.22) in Theorem 13.3.1, we can see that for j = 1, 2, . . .,


γ j τ j−1 (τ − 1)
the sequence is a geometrical series. If 1 ≤ τ < (γ + 1)/γ, then,
(γ + 1) j
for i → ∞, (13.3.22) becomes
 
γ(τ − 1)
lim Vi (ek ) = V∞ (ek ) ≤ 1 + J ∗ (ek ). (13.3.25)
i→∞ 1 − γ(τ − 1)

According to (13.3.21), let i → ∞, we have

V̂∞ (ek ) ≤ τ V∞ (ek ). (13.3.26)


554 13 Adaptive Dynamic Programming for Optimal Control …

According to (13.3.25) and (13.3.26), we can obtain (13.3.24). This completes the
proof of the theorem.

Corollary 13.3.1 Suppose Theorem 13.3.1 holds. If for 0 < γ < ∞ and 1 ≤ τ <
∞, the inequality (13.3.23) holds, then the iterative control laws ωi (ek ) and v̂i (ek )
of the iterative ADP algorithm (13.3.14)–(13.3.18) are convergent, i.e.,

⎨ ω∞ (ek ) = i→∞
lim ωi (ek ),

⎩ v̂∞ (ek ) = lim v̂i (ek ).


i→∞

13.3.2 Neural Network Implementation

In this section, neural networks, including action network and critic network, are
used to implement the present iterative ADP algorithm. Both the neural networks
are chosen as three-layer BP networks. The whole structural diagram is shown in
Fig. 13.2.
For all i = 1, 2, . . ., the critic network is used to approximate the value function
Vi (ek ) in (13.3.19). The output of the critic network is denoted by

j jT
V̂i (ek ) = Wci σc (ek )

for j = 0, 1, . . .. Let Wci0 be random weight matrices. Let σc (ek ) = σ̄(Yc ek ) where
Yc is an arbitrary matrix. Then, σc (ek ) is upper bounded, i.e., σc (ek ) ≤ σ̄c for a
positive constant σ̄c > 0. The target function can be written as

Vi (ek ) = U (ek , vi−1 (ek ), wi−1 (ek )) + V̂i−1 (ek+1 ).

Critic J (ek )
-
Network

ˆ
k

uˆdk uk

u uek uˆk uk
ek + + Model xk 1 ek 1 Critic J (ek 1 )
Action + - +
xk Network Network
Network +

U (ek , uek , wk )

w
Control
Module wk

Fig. 13.2 The structural diagram of the algorithm


13.3 Design and Implementation of Optimal Tracking Control 555

Then, we define the error function of the critic network as


j j
ecik = Vi (ek ) − V̂i (ek ).

The objective function to be minimized in the critic network training is

j 1  j 2
E cik = e .
2 cik
The gradient-based weight update rule [15] can be applied here to train the critic
network
i( j+1) ij ij
Wck = Wck + ΔWck ,
 j j 
ij ∂ E cik ∂ V̂i (ek )
= Wck − lc j ij
∂ V̂i (ek ) ∂Wck
ij j
= Wck − lc ecik σc (ek ), (13.3.27)

where lc > 0 is the learning rate of critic network. If the training precision is achieved,
then Vi (ek ) can be approximated by the critic network.
The action network is used to approximate the iterative control law vi (ek ), where
vi (ek ) is defined by (13.3.20). The output can be formulated as
j
v̂i (ek ) = Wai jTσa (ek ).

Let σa (ek ) = σ̄(Ya ek ) where Ya is an arbitrary matrix. Then, σa (ek ) is upper bounded,
i.e., σa (ek ) ≤ σ̄a for a positive constant σ̄a > 0. So, we can define the output error
of the action network as
j j
eaik = vi (ek ) − v̂i (ek ).

The weights in the action network are updated to minimize the following performance
error measure:
j 1 jT j
E aik = eaik eaik .
2
The weight updating algorithm is similar to the one for the critic network. By the
gradient descent rule, we can obtain
i( j+1) ij ij
Wak = Wak + ΔWak ,
 j j j 
ij ∂ E aik ∂eaik ∂ v̂ik
= Wak − la j j ij
∂eaik ∂ v̂ik ∂Wak
ij j
= Wak − la eaik σa (ek ), (13.3.28)

where la > 0 is the learning rate of action network.


556 13 Adaptive Dynamic Programming for Optimal Control …

To guarantee the effectiveness of neural network implementation, the convergence


of the neural network weights should be proved which allows the approximation of
the iterative value function and iterative control using critic and action networks,
respectively. The weight convergence property of the neural networks is shown in
the following theorem.
Theorem 13.3.3 Let the target value function and the target iterative control law
be expressed by

Vi+1 (ek ) = Wc∗iTσc (ek ) + εcik ,


vi (ek ) = Wa∗iTσa (ek ) + εaik ,

respectively, where εcik and εaik are reconstruction errors. Let the critic and action
networks be trained by (13.3.27) and (13.3.28), respectively. Let W̃c = Wc − Wc∗i
ij ij
ij ij ∗i
and W̃a = Wa − Wa . If for j = 1, 2, . . ., there exist 0 < λc < 1 and 0 < λa < 1
such that
j  j 2 jT jT j
φcik εcik ≤ λc φcik , φaik εaik ≤ λa φaik φaik , (13.3.29)

j i jT j i jT
where φcik = W̃ck σc (ek ) and φaik = W̃ak σa (ek ), then the error matrices W̃ck
i
and
W̃ak both converge to zero, as j → ∞.
i

Proof From (13.3.27) and (13.3.28), we have


i( j+1) j
W̃ck = W̃ci jk − αc ecik σc (ek ),
i( j+1) j
W̃ak = W̃ai jk − βa eaik σa (ek ).

Consider the following Lyapunov function candidate


   
L W̃ci j , W̃ai j = tr W̃ci jT W̃ci j + W̃ai jT W̃ai j . (13.3.30)

j j j j
Let Ṽi (ek ) = Vi (ek ) − V̂i (ek ) and ṽi (ek ) = vi (ek ) − v̂i (ek ). Then, the difference
of the Lyapunov function candidate (13.3.30) is given by

  1   1  
ΔL W̃ci j , W̃ai j = tr W̃ci jT W̃ci j + W̃ai jT W̃ai j − tr W̃ci jT W̃ci j + W̃ai jT W̃ai j
lc la
i jT j i jT j j
= − 2Wck σc (ek )Ṽi+1 (ek ) − 2Wak σa (ek )ṽi (ek ) + lc Ṽi+1 (ek )σcT(ek )
j jT j
× σc (ek )Ṽi+1 (ek ) + lΘ ṽi (ek )σaT(ek )σa (ek )ṽi (ek )
 j j   jT j jT 
≤ − 2 (φcik )2 − φcik εcik − 2 φaik φaik − φaik εaik
 j 2 j
+ lc σc 2 φcik − εcik + la σa 2 φaik − εaik 2 . (13.3.31)
13.3 Design and Implementation of Optimal Tracking Control 557

As εcik and εaik are both finite, there exist χc > 0 and χa > 0 such that
 j 2 jT j
ε2cik ≤ χc φcik , εaik
T
εaik ≤ χa φaik φaik . (13.3.32)

Then, (13.3.31) can be written as


   j 2  j 2
ΔL W̃ci j , W̃ai j ≤ − 2 (1 − λc ) φcik + 2lc σc2 (1 + χc ) φcik
jT j j
− 2 (1 − λa ) φaik φaik + 2la σa2 (1 + χa )φaik 2 .

Selecting the learning rates lc and la such that

1 − λc
lc < ,
σc2 (1 + χc )
1 − λa
lu < 2 ,
σa (1 + χa )
 ij ij
we can obtain ΔL W̃c , W̃a < 0. The proof is complete.

13.4 Numerical Analysis

In this section, numerical experiments are studied to show the effectiveness of our
iterative ADP algorithm. Let the coal gasification control system be expressed as
in (13.2.6). We let the initial reaction temperature in the gasifier be x0 = 1000 ◦ C.
Observe the corresponding system input and output data (kg/h) which are

u 0 = [60960, 47572, 44752]T and y0 = [74690, 34381, 4265, 29653, 10295]T .

Let the desired reaction temperature η = 1320 ◦ C. To model the coal gasification
control system (13.2.6), we collect 20, 000 temperature data from the real-world
coal gasification system. The corresponding 20, 000 system input data and 20, 000
system output data are also recorded. Then, a three-layer BP NN is established with
the structure of 8–20–1 to approximate the state equation in (13.2.6) and the NN is
the model network. The control input is expressed by (13.2.4). We also use three-
layer BP NN with structure of 8–20–5 to approximate the input–output equation in
(13.2.6) and the NN is the input–output network. Let the learning rates of the model
network and input–output network be lm = 0.002. Use the gradient-based weight
update rule [15] to train the neural networks for 20, 000 iteration steps to reach the
training precision of 10−6 . The converged weights, respectively, are given by

Ŵm1 = [−0.0030, 0.0005, −0.0032, −3.7356, −1.8093, 0.0120, 0.0020,


−0.0030, 4.3394, 0.0001, −0.5405, 0.0200, 0.0080, 0.0310,
−0.0600, −0.014, 0.025, −0.3138, 0.0059, −2.1526] (13.4.1)
558 13 Adaptive Dynamic Programming for Optimal Control …

and ⎡
0.1062 1.0652 0.2797 −0.0631 −0.1974
⎢ −0.0843 1.6798 −0.9755 0.0069 0.5187

Ŵm2 =⎢
⎢ 0.9926 0.3194 −0.3022 −0.0434 0.0509
⎣ −0.7054 1.4218 0.3699 0.0479 0.0666
0.0301 4.9841 0.0175 0.1136 −0.0059

−0.0047 0.2946 0.0254 −0.0129 −0.1453


−0.0232 −0.9079 −0.0723 0.0366 0.0950
−0.0266 −0.2629 −0.0252 0.0132 0.0765
−0.1865 0.2426 0.0268 −0.0136 −0.0804
0.0179 −0.0199 0.0004 0.0001 −0.0089

0.1064 −0.9205 −0.1178 −0.3625 0.0570


−0.0113 2.4362 0.5149 0.7794 −0.0446
0.0751 −1.2588 0.4976 −0.6136 0.0411
−0.0808 2.3609 −0.8929 0.3444 −0.0466
−0.1947 9.6033 0.0280 −0.0772 0.0162

0.4106 0.0056 −0.7702 −2.1900 0.0363
−0.7316 −0.0802 2.1927 0.9994 −0.2945 ⎥

−0.3442 0.0189 0.1892 −1.0204 −0.1601 ⎥
⎥.
0.3453 −0.0478 0.3243 1.0497 −0.4577 ⎦
0.0149 0.0103 −0.0156 3.6508 0.0482

Next, we adopt three-layer BP NNs to identify the coal quality equation (13.2.13)
and the reference control equation (13.2.16). The structure of Θ network and u f
network is chosen as 10–20–4 and 6–20–3, respectively. Using the gradient-based
weight update rule, train the two neural networks for 20,000 iteration steps under the
learning rate 0.002 to reach the training precision of 10−6 . The converged weights
are given by

0.003 0.2629 −0.0030 0.0010 −0.0868
⎢ 0.001 0.4992 −0.0001 0.0001 0.0020
ŴΘ = ⎢
⎣ 0.001 −0.0391 0.0010 −0.0001 −0.0285
−0.004 0.2651 0.0020 −0.0010 0.1151

0.0010 0.0001 −0.001 0.0126 −0.0001


0.0001 −0.0001 0.001 −0.0079 −0.0001
0.0001 −0.0001 −0.002 −0.0179 −0.0010
−0.0010 −0.0001 0.003 0.0131 −0.0002
13.4 Numerical Analysis 559

0.0064 0.1072 −0.2253 0.0640 0.0015


0.0002 0.0024 −0.0040 −0.0016 −0.0001
−0.0014 −0.0184 0.0852 0.0301 0.0004
−0.0055 −0.0910 0.1438 −0.0923 −0.0004

0.2770 −0.004 −0.002 0.017 −0.0400
0.0026 0.002 −0.001 −0.001 0.4850 ⎥⎥
−0.0064 −0.011 −0.001 0.001 −0.2182 ⎦
−0.2653 0.016 0.003 −0.017 −0.2320

and

0.0108 −0.0011 −0.0013 0.0464 −0.0515
Ŵu = ⎣ −0.0389 −0.0038 −0.0126 −0.1154 −0.0345
0.0990 0.0010 −0.0011 0.5440 0.1863

6.9212 −0.1401 −0.0266 0.1011 −0.0022


0.6060 5.4144 0.0902 −0.4428 −0.0039
−0.2832 0.2710 0.1602 0.4762 0.0006

0.0276 0.0074 −0.1958 −0.1729 0.1457


−0.0427 0.0081 0.3125 0.1914 −0.3870
0.1827 0.1162 0.4931 0.4594 1.7896

−0.0058 −0.0077 −0.0010 −0.0000 0.4769
0.0839 0.1626 −0.0229 0.0344 0.4023 ⎦ .
−0.0482 −0.3090 −0.0133 −0.0392 0.8065

Taking the current system data x0 , u 0 , and y0 into Θ network, we can obtain the
coal quality as
Θ̂k = [0.6789, 0.0373, 0.1149, 0.1689]T .

Taking the desired state η = 1320 and the coal quality Θ̂k into u f network, we can
obtain the desired control input expressed by û dk = [61408.74, 44430.69, 51200]T .
According to the weights of model network, Θ network, and u f network, we can
easily obtain the system disturbance w ≤ 26.44.
Next, the present iterative ADP algorithm is established to obtain the optimal
tracking control law. Let the cost function be defined as in (13.3.12), where A = 1
and C = 0.5, and B id the identity matrix with suitable dimensions. The critic and
action networks are both chosen as three-layer BP neural networks with the structures
of 1–8–3 and 1–8–1, respectively. For each iteration, the critic and action networks
are trained for 1000 steps using the learning rate of lc = la = 0.01 so that the neural
network training errors become less than 10−6 . Let the iteration index i = 100. The
converged weights of the critic network and the action network are expressed as

Wc = [0.2194, 0.2555, 0.2797, 0.4137, 0.1777, 0.2165, 0.0092, 0.1841]


560 13 Adaptive Dynamic Programming for Optimal Control …

and ⎡
−0.0625 0.0746 −0.0122 −0.0717
Wa = ⎣ 0.0497 −0.2990 −0.1878 0.0750
0.0265 0.5098 0.3543 0.0404

−0.0263 −0.0236 −0.0311 0.0031
−0.1385 0.0146 0.1016 0.0823 ⎦ ,
0.2876 0.1056 −0.0604 −0.0862

Fig. 13.3 Convergence 1.4


trajectory of iterative value
function 1.2

1
Iterative value function

0.8

0.6

0.4

0.2

0
0 10 20 30 40 50 60 70 80 90 100
Iteration steps

Fig. 13.4 Trajectory of state 1350

1300

1250
State (oC)

1200

1150

1100

1050

1000
0 10 20 30 40 50 60 70 80 90 100
Time steps
13.4 Numerical Analysis 561

4 4
x 10 x 10

Control input of u (H2O, Kg/h)


Control input of u (coal, Kg/h)
6.16 4.8

6.14 4.7
1

2
6.12 4.6

6.1 4.5

6.08 4.4
0 50 100 0 50 100
Time steps Time steps
(a) (b)
4 4
x 10 x 10
System output of y (CO, Kg/h)
5.2
Control input of u3 (O2, Kg/h)

7.9

5 7.8
1

7.7
4.8
7.6
4.6
7.5

4.4 7.4
0 50 100 0 50 100
Time steps Time steps
(c) (d)

Fig. 13.5 Trajectories of control inputs and system output. a Coal input trajectory. b H2 O input
trajectory. c O2 input trajectory. d CO output trajectory

respectively. The convergence trajectory of the iterative value function is shown in


Fig. 13.3. We apply the optimal control law to the system for T f = 100 time steps
and obtain the following results. The optimal state trajectory is shown in Fig. 13.4.
The corresponding control trajectories and system output trajectories are shown in
Figs. 13.5 and 13.6, respectively.
From the above numerical results, we can see that under the given NN training
precisions, the system state tracks successfully the desired temperature using the
optimal control derived by the present iterative ADP algorithm with system and iter-
ation errors, which shows the effectiveness of the present algorithm. In the following,
we will change the NN training precisions to show the performance of the present
iterative ADP algorithm. First, we change the training precisions of model network,
Θ network, and u f network to 10−3 . While the training precisions of critic and action
networks are kept at 10−6 .
Let the iteration index i = 200. The convergence trajectory of the iterative value
function is shown in Fig. 13.7. We apply the optimal control law to the system for
562 13 Adaptive Dynamic Programming for Optimal Control …

4
x 10
System output of y2 (CO2, Kg/h)

System output of y3 (H2, Kg/h)


3.6 4400

4200
3.4
4000

3.2 3800

3600
3
3400
2.8
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(a) (b)
4 4

System output of y (Char, Kg/h)


System output of y (H2O, Kg/h)

x 10 x 10
3.6

1.036
3.4
4

1.034

3.2
1.032

3 1.03

0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(c) (d)
Fig. 13.6 Trajectories of system outputs. a CO2 output trajectory. b H2 output trajectory. c H2 O
output trajectory. d Char output trajectory

Fig. 13.7 Convergence 1.4


trajectory of iterative value
function 1.2

1
Iterative value function

0.8

0.6

0.4

0.2

0
0 20 40 60 80 100 120 140 160 180 200
Iteration steps
13.4 Numerical Analysis 563

1350

1300

1250
State (oC)
1200

1150

1100

1050

1000
0 10 20 30 40 50 60 70 80 90 100
Time steps

Fig. 13.8 Trajectory of state

4 4
Control input of u (coal, Kg/h)

Control input of u (H2O, Kg/h)

x 10 x 10
6.15 4.9

4.8
6.14
4.7
1

6.13
4.6
6.12
4.5

6.11 4.4
0 50 100 0 50 100
Time steps Time steps
(a) (b)
4 4
x 10 x 10
5.2
System output of y (CO, Kg/h)
Control input of u (O , Kg/h)

7.9
5
2

7.8
3

4.8
7.7

4.6 7.6

4.4 7.5
0 50 100 0 50 100
Time steps Time steps
(c) (d)

Fig. 13.9 Trajectories of control inputs and system output. a Coal input trajectory. b H2 O input
trajectory. c O2 input trajectory. d CO output trajectory
564 13 Adaptive Dynamic Programming for Optimal Control …

4
System output y (CO2, Kg/h) x 10
3.6 4500

System output y (H2, Kg/h)


3.4
4000

3
2

3.2

3500
3

2.8 3000
0 50 100 0 50 100
Time steps Time steps
(a) (b)
4 4
x 10 x 10
System output y4 (H2O, Kg/h)

3.6 System output y5 (char, Kg/h) 1.038

3.4
1.036

3.2

1.034
3

2.8 1.032
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 13.10 Trajectories of system outputs. a CO2 output trajectory. b H2 output trajectory. c H2 O
output trajectory. d Char output trajectory

T f = 100 time steps and obtain the following results. The optimal state trajec-
tory is shown in Fig. 13.8. The corresponding control trajectories and system output
trajectories are shown in Figs. 13.9 and 13.10, respectively.
As is known, for the real-world coal gasification, the flow fluctuation of the control
inputs is important and cannot be ignored. We will display the control system per-
formance under the control disturbance. Let Δu k be a zero-expectation white noise
of control input, with |Δu 1k | ≤ 100, |Δu 2k | ≤ 40, |Δu 3k | ≤ 40. The disturbance
trajectories of the control input are displayed in Figs. 13.11a, b and c, respectively.
Let the training precisions of model network, Θ network, and u f network be 10−3
and the training precisions of critic and action networks are kept at 10−6 . The conver-
gence trajectory of the iterative value function is shown in Fig. 13.12. The optimal
state trajectory is shown in Fig. 13.13. The corresponding control trajectories and
system output trajectories are shown in Figs. 13.14 and 13.15, respectively. From the
numerical results, we can see that under the disturbance of the control input, we can
also obtain the optimal tracking control of the system which shows the effectiveness
13.4 Numerical Analysis 565

Disturbance input of u (H2O, Kg/h )


Disturbance input of u1 (coal, Kg/h )
100 40

50 20

2
0 0

−50 −20

−100 −40
0 50 100 0 50 100
Time steps Time steps
(a) (b)

Errors between input and output (Kg/h)


Disturbance input of u3 (O2, Kg/h )

−3
x 10
40 1.5

1
20

0.5
0
0

−20
−0.5

−40 −1
0 50 100 0 50 100
Time steps Time steps
(c) (d)
Fig. 13.11 Control disturbance and the input–output mass error. a Control disturbance Δu 1k .
b Control disturbance Δu 2k . c Control disturbance Δu 3k . d The error between the input and output
mass

Fig. 13.12 Convergence 1.5


trajectory of iterative value
function
Iterative value function

0.5

0
0 20 40 60 80 100 120 140 160 180 200
Iteration steps
566 13 Adaptive Dynamic Programming for Optimal Control …

1350

1300

1250
State (oC)
1200

1150

1100

1050

1000
0 10 20 30 40 50 60 70 80 90 100
Time steps

Fig. 13.13 Trajectory of state

4 4
Control input of u (H2O, Kg/h )
Control input of u1 (coal, Kg/h )

x 10 x 10
6.2 5.2

6.15
5
6.1
2

6.05 4.8

6
4.6
5.95

5.9 4.4
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(a) (b)
4 4
x 10 x 10
Control input of u (O2, Kg/h )

System output y (CO, Kg/h )

5.5 8

5 7.8
3

4.5 7.6

4 7.4

3.5 7.2
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(c) (d)
Fig. 13.14 Trajectories of control inputs and system output. a Coal input trajectory. b H2 O input
trajectory. c O2 input trajectory. d CO output trajectory
13.4 Numerical Analysis 567

System output of y2 (CO2, Kg/h) 4


x 10

System output of y3 (H2, Kg/h)


3.3 5000

3.2
4500
3.1
4000
3
3500
2.9

2.8 3000
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(a) (b)
System output of y4 (H2O, Kg/h)

4 4

System output of y5 (char, Kg/h)


x 10 x 10
3.8 1.05

3.6 1.04

3.4 1.03

3.2 1.02

3 1.01

2.8 1
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(c) (d)
Fig. 13.15 Trajectories of the system outputs. a CO2 output trajectory. b H2 output trajectory.
c H2 O output trajectory. d Char output trajectory

and robustness of the present iterative ADP method. To verify the correctness of the
model and the present method, the mass errors between the input and output are
given in Fig. 13.11d.
From the numerical results, we can see that when the system errors and the distur-
bance of the control input are enlarged, the iterative ADP algorithm is still effective
to find the optimal tracking control scheme for the system. On the other hand, if we
enlarge the iteration errors, the control property is quite different. Let the disturbance
of the control Δu k = 0. Let the training precisions for model network, Θ network,
and u f network be kept at 10−3 . We change the training precisions of critic and
action networks to 10−3 . Let the iteration index i = 100. The convergence trajectory
of the iterative value function is shown in Fig. 13.16a, where we can see that the iter-
ative value function is not convergent any more. The corresponding state trajectory
is shown in Fig. 13.16b, where we notice that the desired state is not achieved.
568 13 Adaptive Dynamic Programming for Optimal Control …

1400

1200
Iterative value function

1000

800

600

400

200

0
0 10 20 30 40 50 60 70 80 90 100
Iteration steps
(a)
1500

1400
State (oC)

1300

1200

1100

1000
0 10 20 30 40 50 60 70 80 90 100
Time steps
(b)
Fig. 13.16 Simulation results. a The trajectory of iterative value function. b The trajectory of state

13.5 Conclusions

In this chapter, an effective iterative ADP algorithm is established to solve the optimal
tracking control problem for coal gasification systems. Using the input-state-output
data of the system, NNs are used to approximate the system model, the coal qual-
ity, and the reference control, respectively, and the mathematical model of the coal
gasification is unnecessary. Considering the system errors of NNs and the control
disturbance, the optimal tracking control problem is transformed into a two-person
zero-sum optimal regulation control problem. Iterative ADP algorithm is then estab-
lished to obtain the optimal control law where the approximation errors in each
iteration are considered. Convergence analysis is given to guarantee that the iterative
value functions are convergent to a finite neighborhood of the optimal cost function.
References 569

References

1. Abani N, Ghoniem AF (2013) Large eddy simulations of coal gasification in an entrained flow
gasifier. Fuel 104:664–680
2. Basar T, Bernard P (1995) H∞ Optimal control and related minimax design problems.
Birkhauser, Boston
3. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):82–92
4. Chen Y, Li Z, Zhou M (2014) Optimal supervisory control of flexible manufacturing systems
by petri nets: a set classification approach. IEEE Trans Autom Sci Eng 11(2):549–563
5. Gopalsami N, Raptis AC (1984) Acoustic velocity and attenuation measurements in thin rods
with application to temperature profiling in coal gasification systems. IEEE Trans Sonics Ultra-
son 31(1):32–39
6. Guo R, Cheng G, Wang Y (2006) Texaco coal gasification quality prediction by neural estimator
based on dynamic PCA. In: Proceedings of the IEEE international conference on mechatronics
and automation, pp 2241–2246
7. Jia QS (2011) An adaptive sampling algorithm for simulation-based optimization with descrip-
tive complexity preference. IEEE Trans Autom Sci Eng 8(4):720–731
8. Jin X, Hu SJ, Ni J, Xiao G (2013) Assembly strategies for remanufacturing systems with
variable quality returns. IEEE Trans Autom Sci Eng 10(1):76–85
9. Kang Q, Zhou M, An J, Wu Q (2013) Swarm intelligence approaches to optimal power flow
problem with distributed generator failures in power networks. IEEE Trans Autom Sci Eng
10(2):343–353
10. Kostur K, Kacur J (2012) Developing of optimal control system for UCG. In: Proceedings of
the international carpathian control conference, pp 347–352
11. Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
12. Matveev IB, Messerle VE, Ustimenko AB (2009) Investigation of plasma-aided bituminous
coal gasification. IEEE Trans Plasma Sci 37(4):580–585
13. Ruprecht P, Schafer W, Wallace P (1988) A computer model of entrained coal gasification.
Fuel 67(6):739–742
14. Serbin SI, Matveev IB (2010) Theoretical investigations of the working processes in a plasma
coal gasification system. IEEE Transactions on Plasma Science 12(38):3300–3305
15. Si J, Wang YT (2001) Online learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
16. Wei Q, Liu D (2014) Adaptive dynamic programming for optimal tracking control of unknown
nonlinear systems with application to coal gasification. IEEE Trans Autom Sci Eng 11(4):1020–
1036
17. Wigstrom O, Lennartson B, Vergnano A, Breitholtz C (2013) High-level scheduling of energy
optimal trajectories. IEEE Trans Autom Sci Eng 10(1):57–64
18. Wilson JA, Chew M, Jones WE (2006) State estimation-based control of a coal gasifier. IEE
Proc-Control Theory Appl 153(3):268–276
19. Xu J, Qiao L, Gore J (2013) Multiphysics well-stirred reactor modeling of coal gasification
under intense thermal radiation. Int J Hydrog Energy 38(17):7007–7015
20. Yang Q, Jagannathan S (2012) Reinforcement learning controller design for affine nonlinear
discrete-time systems using online approximators. IEEE Trans Syst, Man, Cybern-Part B:
Cybern 42(2):377–390
Chapter 14
Data-Based Neuro-Optimal Temperature
Control of Water Gas Shift Reaction

14.1 Introduction

Water gas shift (WGS) reactor is an essential component of the coal-based chemical
industry [11]. The WGS reactor combines carbon monoxide (CO) and water (H2 O)
in the reactant stream to produce carbon dioxide (CO2 ) and hydrogen (H2 ). Proper
regulation of the operating temperature is critical to achieving adequate CO conver-
sion during transients [3]. Hence, optimal control of the reaction temperate is a key
problem for WGS reaction process. To describe the dynamics of the WGS reaction
process, many discussions focused on WGS modeling approaches [6, 16]. Unfortu-
nately, the established WGS models are generally complex with high nonlinearities.
Thus, the traditional linearized control method [28, 31, 32] is only effective in the
neighborhood of the equilibrium point. When the required operating range is large,
the nonlinearities in the system cannot be properly compensated by using a linear
model. Therefore, it is necessary to study optimal control approaches for the original
nonlinear system [3, 11]. Although optimal control of nonlinear systems has been
the focus of control field in the last several decades [1, 2, 4, 5, 9, 18, 19, 22, 29],
the optimal controller design for WGS reaction systems (WGS systems in brief) is
still challenging, due to the complexity of the WGS reaction process.
Iterative ADP method has played an important role as an effective way to solve
Bellman equation indirectly and received lots of attentions [12, 14, 23, 24, 27, 30,
33]. For most previous ADP algorithms, it requires that the system model, the iterative
control, and the cost function can accurately be approximated which guarantees
the convergence property of the proposed algorithms. In real-world implementation
of ADP, e.g., for WGS systems, the reconstruction errors by approximators and
the disturbances of system states and controls inherently exist. Thus, the system
models, iterative control laws, and cost functions are impossible to obtain accurately.
Although in [8, 21], ADP was explored to design optimal temperature controller
of the WGS system, the effects of approximation errors and disturbances were not
considered. Furthermore, the convergence and stability properties were not discussed.

© Springer International Publishing AG 2017 571


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3_14
572 14 Data-Based Neuro-Optimal Temperature Control …

In this chapter, a stable iterative ADP algorithm is developed to obtain the optimal
control law for the WGS system, such that the temperature of WGS system tracks
the desired temperature [26].

14.2 System Description and Data-Based Modeling

14.2.1 Water Gas Shift Reaction

The WGS reaction inputs the water gas, which includes CO, CO2 , H2 , and H2 O, into
the WGS reactor. The WGS reaction, which is slightly exothermic, converts CO to
CO2 and H2 as shown in the following equation

CO + H2 O ↔ CO2 + H2 − 41.9 kj/mol.

The WGS reaction rate [3] can be described as follows:


   
5126 K [CO2 ][H2 ]
rWGS = ρcat kr exp − [CO] [H2 O]
0.78 0.15
1− ,
T [CO][H2 O]K T
(14.2.1)

where the rate is in (kmol/m3/s). The catalyst density is ρcat = 1.8 × 10−4 kg/m3 .
T is the current reaction temperature in K (Kelvin). The rate constant is kr = 1.32 ×
109 kmol/kg/s. The reaction
 equilibrium coefficient
 K T is given as in [17], which is
4577.7 K
expressed by K T = exp − 4.33 .
T
For the WGS reaction (14.2.1), we can see that the reaction temperature is the
key parameter [3, 11].
Let u k denote the control input representing the flow of water gas (m3/s). Let

P(u k ) = u k [θCO , θCO2 , θH2 , θH2 O ]T

where θCO , θCO2 , θH2 , and θH2 O denote the given percentage compositions of CO,
CO2 , H2 , and H2 O. Generally speaking, the water gas of WGS systems comes from
the previous reaction process, such as coal gasification [16]. This means that the
composition ratios of the mixed gas are uncontrollable for the WGS systems and
the amount of water gas flow is the only one to be controlled. Let xk denote the
temperature of the WGS reactor. The WGS system can be expressed as

xk+1 = F(xk , u k ), (14.2.2)

where F(·) is an unknown system function. Let xk ∈ R and u k ∈ R. Let the desired
state be τ . Then, our goal is to design an optimal state feedback tracking control law
u ∗k = u ∗ (xk ), such that the system state tracks the desired state trajectory.
14.2 System Description and Data-Based Modeling 573

14.2.2 Data-Based Modeling and Properties

Three-layer backpropagation (BP) NNs are introduced to construct the dynamics of


the WGS system and solve the reference control, respectively. Let L be the number
of hidden layer neurons. Let X ∈ RN be the input of NN, and let Z ∈ RM be the
output. Then, function of the BP NNs can be expressed by

Z = F̂N (X, Y f , W f ) = W Tf σ (Y f X ),

where Y f ∈ RL ×N is the input–hidden layer weight matrix and W f ∈ RL ×M is the


hidden–output layer weight matrix. Let σ be a sigmoid activation function [20, 25].
For the convenience of analysis, only the hidden–output layer weight W f is updated
during the NN training, while the input–hidden weight is fixed [30]. Hence, the
NN function will be simplified by the expression F̂N (X, W f ) = W Tf σ N (X ), where
σ N (X ) = σ (Y f X ).
The model NN of system (14.2.2) can be written as

xk+1 = Wm∗T σm (z k ) + εm1k ,

where z k = [ xk , u k ]T denotes the NN input and Wm∗ ∈ R L m ×1 denotes the ideal weight
matrix of model NN, where L m is the number of hidden layer neurons. Let σm (z k ) =
σ (Ym z k ), where Ym is an arbitrary weight matrix with a suitable dimension. Let
σm (·) ≤ σ̄ M for a constant σ̄ M > 0, and let εm1k be the bounded NN reconstruction
error such that |εm1k | ≤ ε̄m1 for a constant ε̄m1 > 0. To train the model NN, it requires
an array of WGS system and control data, such as the data from a period of time.
The NN model for the system is constructed as

x̂k+1 = Ŵmk
T
σm (z k ), (14.2.3)

where x̂k is the estimated system state vector. Let Ŵmk be the estimation of the ideal
weight matrix Wm∗ . Then, we define the system identification error as

x̃k+1 = x̂k+1 −xk+1 = W̃mk


T
σm (z k )−εm1k ,

where W̃mk = Ŵmk − Wm∗ . Let φmk = W̃mk


T
σm (z k ). We can get

x̃k+1 = φmk − εm1k .

The weights are adjusted to minimize the following error function

1 T 1 2
E mk = x̃k+1 x̃k+1 = x̃k+1 .
2 2
574 14 Data-Based Neuro-Optimal Temperature Control …

By a gradient descent adaptation rule [20, 25], the weights are updated as

Ŵm,k+1 = Ŵmk − lm σm (z k )x̃k+1 , (14.2.4)

where lm > 0 is the learning rate.


Theorem 14.2.1 Let the model network (14.2.3) be used to identify WGS system
(14.2.2). If there exists a constant 0 < λm < 1 such that εm1k
2
≤ λm x̃k2 , then the system
identification error x̃k is asymptotically stable and the error matrix W̃mk converges
to zero, as k → ∞.
Proof Consider the following Lyapunov function candidate defined as

1  T 
L(x̃k , W̃mk ) = x̃k2 + tr W̃mk W̃mk .
lm

Taking the difference of the Lyapunov function candidate, we can obtain


 2 
ΔL(x̃k , W̃mk ) ≤ φmk
2
+ εm1k
2
− x̃k2 + 2lm σmT (z k )σm (z k ) φmk + εmk
2

≤−(1−2lm σ M2 )φmk
2
−(1−λm (1+2lm σ M2 ))x̃k2 .

Selecting the learning rate



1 1 − λm
lm < min , ,
2σ M2 2λm σ M2

we can obtain ΔL(x̃k , W̃mk ) ≤ 0. The proof is complete.


Next, we will solve the reference control by NN (u f network in brief). According
to the state equation in (14.2.2), we give xk and xk+1 to approximate the reference
control function u f k , which is expressed as u f k = Fu (xk , xk+1 ). We notice that solv-
ing u f k needs the data of xk+1 . Hence, it requires to adopt off-line or history data to
train u f network. Let the number of hidden layer neurons be L u . Let Wu∗ be the ideal
weight matrix. The NN representation of u f network can be written as

u f k = Wu∗Tσu (z uk ) + εu1k ,

where z uk = [xk , xk+1 ]T and εu1k is the NN reconstruction error such that |εu1k | ≤ ε̄u1
for a constant ε̄u1 > 0. Let σu (z uk ) = σ (Yu z uk ), where Yu is an arbitrary weight
matrix with a suitable dimension.
The NN reference control is constructed as

û f k = F̂u (xk , xk+1 ) = Ŵuk


T
σu (z u k ),

where û f k is the estimated reference control, and Ŵuk is the estimated weight matrix.
Define the identification error as
14.2 System Description and Data-Based Modeling 575

ũ f k = û f k − u f k = φuk − εu1k ,

where φuk = W̃ukT


σu (z u k ) and W̃uk = Wu∗ − Ŵuk . The weight of u f network is
adjusted to minimize the error function

1 2
Euk = ũ .
2 fk
By gradient-based adaptation rule, the weight is updated as

Ŵu,k+1 = Ŵuk − lu σu (z uk )ũ f k , (14.2.5)

where lu > 0 is the learning rate.


Theorem 14.2.2 Let the NN weight of u f network be updated by (14.2.5). If there
exists a constant 0 < λu < 1 such that φuk εu1k ≤ λu φuk
2
, then the error matrix W̃uk
asymptotically converges to zero, as k → ∞.

14.3 Design of Neuro-Optimal Temperature Controller

In this section, a stable iterative ADP algorithm will be employed to obtain the
optimal control law such that the temperature of WGS system tracks the desired one
with convergence and stability analysis.

14.3.1 System Transformation

For WGS system (14.2.2), if we let the desired state be τ , then we can define the
tracking error ek = xk − τ . Let u dk be the corresponding desired reference control
(desired control in brief) for the desired state τ . As the system function is unknown,
the desired control u dk cannot directly be obtained by the WGS system (14.2.2).
On the other hand, in the real-world WGS systems, the disturbances of the system
and control input are both unavoidable. Thus, the system transformation method with
accurate system model [34] is difficult to implement. To overcome these difficulties, a
system transformation with NN reconstruction errors and disturbances is developed.
First, according to the desired state τ , we can obtain u dk = Fu (τ, τ ).
Let
û dk = F̂u (τ, τ ) = Ŵuk
T
σu (τ, τ )

be the output of u f network. Let εu2k be an unknown bounded control disturbance


such that |εu2k | ≤ ε̄u2 for a constant ε̄u2 > 0. Then, we can define the control error
u ek as
576 14 Data-Based Neuro-Optimal Temperature Control …

u ek = u k − û dk − εuk ,

where εuk = εu1k + εu2k . As εu1k and εu2k are bounded, there exits a constant ε̄u > 0
such that |εuk | ≤ ε̄u . On the other hand, let F̂(z k ) = Ŵmk
T
σm (z k ) be the model NN
function. Let εm2k be an unknown bounded system disturbance such that |εm2k | ≤ ε̄m2
for a constant ε̄m2 > 0. Then, the tracking error system ek+1 can be defined as

ek+1 = F̄(ek , u ek )
= F̂((ek + τ ), (u ek + û dk )) − τ + ∇ F̂(ξu )εu + εmk , (14.3.1)

where
∂ F̂((ek + τ ), ξu )
∇ F̂(ξu ) = ,
∂ξu

ξu = cu (u ek + û dk ) + (1 − cu )(u ek + û dk + εuk ),

and 0 ≤ cu ≤ 1. Let εmk = εm1k + εm2k . We have |εmk | ≤ ε̄m for a constant ε̄m > 0.
Let the NN tracking error êk+1 be expressed as

êk+1 = Fe (ek , û ek )
= F̂((ek + τ ), (u ek + û dk )) − τ. (14.3.2)

We can get ek+1 = êk+1 + εek , where we define

εek = ∇ F̂(ξu )εu + εmk

as the system error such that |εek | ≤ ε̄e for a constant ε̄e > 0.

14.3.2 Derivation of Stable Iterative ADP Algorithm

In this section, our goal is to design an optimal control scheme such that the tracking
error ek converges to zero. Let

U (ek , u ek ) = Qek2 + Ru 2ek

be the utility function, where Q and R are positive constants. Define the cost function
as

J (e0 , u e0 ) = U (ek , u ek ),
k=0
14.3 Design of Neuro-Optimal Temperature Controller 577

where we let u ek = (u ek , u e,k+1 , . . .). The optimal cost function can be defined as
 
J ∗ (ek ) = inf J (ek , u ek ) .
u ek

According to the principle of optimality, J ∗ (ek ) satisfies the Bellman equation


 
J ∗ (ek ) = min U (ek , u ek )+ J ∗ (ek+1 ) . (14.3.3)
u ek

Define the laws of optimal controls as


 
u ∗e (ek ) = arg min U (ek , u ek ) + J ∗ (ek+1 ) .
u ek

Hence, Bellman equation (14.3.3) can be written as

J ∗ (ek ) = U (ek , u ∗e (ek )) + J ∗ (ek+1 ). (14.3.4)

Generally speaking, J ∗ (ek ) is a highly nonlinear and nonanalytical function,


which is nearly impossible to obtain by solving (14.3.4) directly. To overcome
this difficulty, a new ADP algorithm is developed to obtain the optimal control law
iteratively.
In the present stable iterative ADP algorithm, the value function and control law
are updated by iterations, with the iteration index i increasing from 0 to infinity. First,
let μ(ek ) be an arbitrary admissible control law, and let P(ek ) be the corresponding
value function which satisfies

P(ek ) = U (ek , μ(ek )) + P(ek+1 ). (14.3.5)

Let the initial value function V̂0 (ek ) = P(ek ). The control law v̂0 (ek ) is obtained by

v̂0 (ek ) = arg min U (ek , û ek ) + V̂0 (êk+1 ) + ρ0 (ek ). (14.3.6)


û ek

Then, for i = 1, 2, . . ., the iterative ADP algorithm will iterate between

V̂i (ek ) = U (ek , v̂i−1 (ek ))+ V̂i−1 (êk+1 ) + πi (ek ), (14.3.7)

and

v̂i (ek ) = arg min U (ek , û ek ) + V̂i (êk+1 ) + ρi (ek ), (14.3.8)


û ek

where πi (ek ) and ρi (ek ) are iteration errors and û ek = u k − û dk .


From the stable iterative ADP algorithm (14.3.6)–(14.3.8), we can see that the
iterative value function V̂i (ek ) is used to approximate J ∗ (ek ) and the iterative control
578 14 Data-Based Neuro-Optimal Temperature Control …

law v̂i (ek ) is used to approximate u ∗ (ek ). Therefore, when i → ∞, the algorithm
should be convergent, i.e., V̂i (ek ) and v̂i (ek ) converge to the optimal ones. In the next
section, we will show the properties of the present iterative ADP algorithm.

14.3.3 Properties of Stable Iterative ADP Algorithm


with Approximation Errors and Disturbances

From the iterative ADP algorithm (14.3.6)–(14.3.8), as the existence of system errors,
iteration errors, and disturbances, the convergence analysis methods for the accurate
ADP algorithms are no longer valid. In this chapter, inspired by [13, 14], an “error
bound”-based convergence and stability analysis will be developed. First, we define
a new value function

Γi (ek ) = min{U (ek , u ek ) + V̂i (ek+1 )}. (14.3.9)


u ek

Then, we can derive the following theorem.


Theorem 14.3.1 For i = 0, 1, . . ., the iterative value function V̂i (ek ) and the iterative
control law v̂i (ek ) are obtained by (14.3.6)–(14.3.8). Let Γi (ek ) be expressed as in
(14.3.9). Then, there exists a constant σ > 1 such that

V̂i (ek ) ≤ σ Γi (ek ) (14.3.10)

holds uniformly.

Proof For all i = 0, 1, . . ., if we let


 
v̄i (ek ) = arg min U (ek , û ek ) + V̂i (êk+1 ) , (14.3.11)
u ek

then
v̄i (ek ) = v̂i (ek ) − ρi (ek ).

According to (14.3.7), we have

V̂i (ek ) = U (ek , (v̄i−1 (ek ) + ρi−1 (ek )) + πi (ek ) + Vi−1 (Fe (ek , (v̄i−1 (ek ) + ρi−1 (ek )))).

Let
∂U (ek , ξ ) ∂ V̂i (Fe (ek , ξ ))
∇U (ξ ) = and ∇Vi (ξ ) = .
∂ξ ∂ξ

Let 0 ≤ cU i ≤ 1, 0 ≤ cU i ≤ 1, 0 ≤ cV i ≤ 1 and 0 ≤ cV i ≤ 1 be constants and let


ξU i = cU i v̂i (ek ) + (1 − cU i )v̄i (ek ), ξ U i = cU i û ek + (1 − cU i )u ek ,
14.3 Design of Neuro-Optimal Temperature Controller 579

ξV i = cV i V̂i (Fe (ek , v̄i (ek ))) + (1 − cV i )V̂i (Fe (ek , v̂i (ek ))),

and

ξ V i = cV i V̂i (Fe (ek , û e (ek ))) + (1 − cV i )V̂i (Fe (ek , u e (ek ))).

Then, the iterative value function V̂i (ek ) can be expressed as

V̂i (ek ) = U (ek , (v̄i−1 (ek ) + ρi−1 (ek ))) + πi (ek ) + Vi−1 (Fe (ek , (v̄i−1 (ek ) + ρi−1 (ek ))))
= min{U (ek , u ek ) + ∇U (ξU i )εuk + Vi−1 (ek+1 ) + ∇Vi (ξV i )εek } + πi (ek )
u ek

+ ∇U (ξU i )ρi (ek ) + ∇Vi (ξV i )ρi (ek ).

As ∇U (ξU i ), ∇Vi (ξV i ), ∇U (ξU i ), and ∇Vi (ξV i ) are upper bounded, if we let
|∇U (ξU i )εuk | ≤ ε̄U i , |∇Vi (ξV i )εek | ≤ ε̄V i , |∇U (ξU i )ρi (ek )| ≤ εU i , |∇Vi (ξV i )ρi (ek )|
≤ εV i , and |πi (ek )| ≤ επi for constants ε̄U i , ε̄V i , εU i , and εV i , then

V̂i (ek ) ≤ Γi (ek ) + εi , (14.3.12)

where εi = ε̄U i + ε̄V i + εU i + εV i + επi is finite. Hence, for i = 0, 1, . . ., there


exists a σ ≥ 1 such that (14.3.10) holds uniformly. The proof is complete.
From Theorem 14.3.1, we can see that, for i = 0, 1, . . ., there must exist a
finite σ ≥ 1 such that (14.3.10) holds uniformly. Thus, σ can be seen as a uniform
approximation error. Then, we can derive the following theorem.
Theorem 14.3.2 For all i = 0, 1, . . ., let Γi (ek ) be expressed as in (14.3.10) where
σ ≥ 1 is a constant. Let 0 < γ < ∞ and 1 ≤ δ < ∞ be both constants such that

J ∗ ( F̄(ek , u ek )) ≤ γ U (ek , u ek )
V0 (ek ) ≤ δ J ∗ (ek )

holds uniformly. If the constant σ in (14.3.10) satisfies

δ−1
σ ≤1+ , (14.3.13)
γδ

then the iterative value function V̂i (ek ) converges to a finite neighborhood of the
optimal cost function J ∗ (ek ).
Proof The theorem can be proven in two steps. First, using mathematical induction,
we will prove that, for i = 0, 1, . . ., the iterative value function V̂i (ek ) satisfies
⎛ ⎞
i
γ j σ j−1 (σ − 1) γ i σ i (δ − 1) ⎠ ∗
V̂i (ek ) ≤ σ ⎝1 + + J (ek ). (14.3.14)
j=1
(γ + 1) j (γ + 1)i
580 14 Data-Based Neuro-Optimal Temperature Control …

Let i = 0. Then, (14.3.14) becomes V̂0 (ek ) ≤ σ δ J ∗ (ek ). We have the conclusion
holds for i = 0. Assume that (14.3.14) holds for i = l − 1, l = 1, 2, . . .. Then, for
i = l, we have
⎧⎛
⎨ l−1
γ j−1 σ j−1 (σ − 1)
Γl (ek ) ≤ min ⎝1 + γ
u ek ⎩ (γ + 1) j
j=1

γ l−1 σ l−1 (σ δ − 1)
+ U (ek , u ek )
(γ + 1)l
⎡ ⎛ ⎞
l
γ σ
j j−1
(σ − 1) γ σ
l l
(δ − 1)
+ ⎣σ ⎝1 + + ⎠
j=1
(γ + 1) j (γ + 1)l
⎛ ⎞⎤
l−1
γ σ
j−1 j−1
(σ − 1)s γ σ
l−1 l−1
(σ δ − 1)
−⎝ + ⎠⎦
j=1
(γ + 1) j (γ + 1)l


×J ∗ ( F̄(ek , u ek ))

⎛ ⎞
l
γ σ
j j−1
(σ − 1) γ σ
l l
(δ − 1)
= ⎝1+ + ⎠ J ∗ (ek ),
j=1
(γ + 1) j
(γ + 1) l

which proves (14.3.14). The mathematical induction is complete.


Second, according  to (14.3.13), we have γ j σ j−1 /(γ + 1) j < 1, and hence, the
geometrical series γ σ (σ − 1)/(γ + 1) j is finite as i → ∞. According to
j j−1

(14.3.14), the iterative value function V̂i (ek ) is convergent to a finite neighborhood
of the optimal cost function J ∗ (ek ). This completes the proof of the theorem.

Next, we can derive the stability property.


Theorem 14.3.3 For i = 0, 1, . . ., let V̂i (ek ) and v̂i (ek ) be obtained by (14.3.7) and
(14.3.8), respectively. Then, the tracking error system (14.3.1) is UUB under the
iterative control law v̂i (ek ).

Proof According to (14.3.14), for i = 0, 1, . . ., let


⎛ ⎞
i
γ σ (σ − 1) γ σ (δ − 1) ⎠
j j−1 i i
χi = σ ⎝1 + + . (14.3.15)
j=1
(γ + 1) j (γ + 1)i

Define a new iterative value function as V̄i (ek ) = χi J ∗ (ek ), where χi is defined
as in (14.3.15). According to (14.3.13), we can get χi+1 − χi ≤ 0, which means
V̄i+1 (ek ) ≤ V̄i (ek ). Let
14.3 Design of Neuro-Optimal Temperature Controller 581

ξV i = cV̄ i V̄i (Fe (ek , v̄i (ek ))) + (1 − cV̄ i )V̄i (Fe (ek , v̂i (ek ))),

for 0 ≤ cV̄ i ≤ 1. Let |∇(ξV̄ i )εe | ≤ εV̄ i for a constant εV̄ i , and we can get

V̄i (ek+1 ) − V̄i (ek ) ≤ −U (ek , v̂i (ek )) + ∇(ξV̄ i )εe


≤ −U (ek , v̂i (ek )) + εV̄ i .

Define a new state error set Ωe = {ek : U (ek , v̂i (ek )) ≤ εV̄ i }. As U (ek , v̂i (ek )) is a
positive-definite function, |ek | is finite for ek ∈ Ωe . We define

e M = sup {|x|} .
x∈Ωe

Define two scalar functions α(|ek |) and β(|ek |) which satisfy the following two
conditions.
(1) If |ek | ≤ e M , then

α(|ek |) = β(|ek |) = V̄i (ek ). (14.3.16)

(2) If |ek | > e M , then α(|ek |) and β(|ek |) are both monotonically increasing functions
and satisfy

0 < α(|ek |) ≤ V̄i (ek ) ≤ β(|ek |). (14.3.17)

For an arbitrary constant ς > e M , there exists a (ς ) > e M such that β() ≤ α(ς ).
For T = 1, 2, . . ., if |ek | > e M and |ek+T | > e M , then V̄i (ek+T ) − V̄i (ek ) ≤ 0.
Hence, for all |ek | > e M satisfying e M < |ek | ≤ β(), there exists a T > 0 such
that

α(ς ) ≥ β() ≥ V̄i (ek ) ≥ V̄i (ek+T ) ≥ α(|ek+T |),

which obtains ς > |ek+T |. Therefore, for all |ek | > e M , there exists a T = 1, 2, . . .
such that |ek+T | ≤ ς . As ς is arbitrary, let ς → e M . Then, we can obtain |ek+T | ≤ e M .
According to the definition in [10], ek is UUB.
Next, for V̂i (ek ) ≤ V̄i (ek ), there exists time instants T0 and T1 such that

V̄i (ek ) ≥ V̄i (ek+T0 ) ≥ V̂i (ek ) ≥ V̄i (ek+T1 ) (14.3.18)

for all |ek | > e M , |ek+T0 | > e M , and |ek+T1 | > e M . Choose ς1 > 0 to satisfy V̂i (ek ) ≥
α(ς1 ) ≥ V̄i (ek+T1 ). Then, there exists 1 (ς1 ) > 0 such that

α(ς1 ) ≥ β(1 ) ≥ V̄i (ek+T1 ).


582 14 Data-Based Neuro-Optimal Temperature Control …

According to (14.3.18) and the definition of α(|ek |) and β(|ek |) in (14.3.16) and
(14.3.17), we have

α(ς ) ≥ β() ≥ V̂i (ek ) ≥ α(ς1 ) ≥ β(1 ) ≥ V̄i (ek+T1 ) ≥ α(|ek+T1 |).

For an arbitrary constant ς > e M , we can obtain |ek+T1 | ≤ ς , which shows that v̂i (ek )
is a UUB control law for the tracking error system (14.3.1). This completes the proof
of the theorem.

Corollary 14.3.1 For i = 0, 1, . . ., let V̂i (ek ) and v̂i (ek ) be obtained by (14.3.7) and
(14.3.8), respectively. If

U (ek , v̂i (ek )) > ∇Vi (ξV̄ i )εek ,

∀ek , then the iterative control law v̂i (ek ) is an asymptotically stable control law for
system (14.3.1).

14.4 Neural Network Implementation for the Optimal


Tracking Control Scheme

In this section, neural networks, including action network and critic network, are
used to implement the present stable iterative ADP algorithm. The whole structural
diagram is shown in Fig. 14.1.
For all i = 0, 1, . . ., the critic network is used to approximate the value function in
p
(14.3.8). Collect an array of tracking errors Ek = {ek1 , . . . , ek }, where p is a large inte-
j jT
ger. For j = 0, 1, . . . , p, let the output of the critic network be V̂i (ek ) = Wci σc (ek ),
where σc (ek ) = σ (Yc ek ) and Yc is an arbitrary matrix with a suitable dimension.

Fig. 14.1 The structural diagram of the stable iterative ADP algorithm
14.4 Neural Network Implementation for the Optimal … 583

The target function can be written as

Vi (ek ) = U (ek , v̂i−1 (ek )) + V̂i−1 (êk+1 ).

j j j j j
Define the error function of the critic network as ϑci (ek ) = Vi (ek ) − V̂i (ek ). The
weights of the critic network are updated as [15, 25]
j j j
Wci( j+1) = Wci j − lc ϑci (ek )σc (ek ), (14.4.1)

j
where σc (ek ) ≤ σC for a constant σC and lc > 0 is the learning rate of critic
network.
The action network is used to approximate the iterative control law v̄i (ek ), where
j i jT
v̄i (ek ) is defined by (14.3.11). The output can be formulated as v̂i (ek ) = Wa σa (ek ),
where σa (ek ) = σ (Ya ek ). Let Ya be an arbitrary matrix with a suitable dimension.
j j
According to Ek , we can define the output error of the action network as ϑai (ek ) =
j j j
v̂i (ek ) − v̄i (ek ), j = 1, 2, . . . , p. The weight of the action network can be updated as

j j j
Wai( j+1) = Wai j − la σa (ek )ϑai(ek ), (14.4.2)

j
where σa (ek ) ≤ σ A for a constant σ A and la > 0 is the learning rate of action
network. The weight convergence property of the neural networks is shown in the
following theorem.
Theorem 14.4.1 For j = 1, 2, . . . , p, let the ideal critic and action network func-
tion be expressed by

Vi (ek ) = Wc∗iT σc (ek ) + εci (ek ),


j j j

and

v̄i (ek ) = Wai∗T σa (ek ) + εai (ek ),


j j j

respectively. The critic and action networks are trained by (14.4.1) and (14.4.2),
respectively. Let W̃ci = Wci − Wci∗ and W̃ai = Wai − Wai∗ . For all i = 1, 2, . . ., if
j j j j

there exist constants 0 < λc < 1 and 0 < λa < 1 such that
j j  j 2 j j  j 2
φcik εci (ek ) ≤ λc φcik and φaik εai (ek ) ≤ λa φaik ,

j i jT j j i jT j
respectively, where φcik = W̃c σc (ek ) and φaik = W̃a σa (ek ), then the error matri-
ij ij
ces W̃c and W̃a converge to zero, as j → ∞.

Proof Consider the following Lyapunov function candidate

  1   1  
L W̃ci j , W̃ai j = tr W̃ci jT W̃ci j + tr W̃ai jT W̃ai j .
lc la
584 14 Data-Based Neuro-Optimal Temperature Control …

The difference of the Lyapunov function candidate is given by


   j j j   j j j 
ΔL W̃ci j , W̃ai j ≤ − 2 (φcik )2 − φcik εci (ek ) − 2 (φaik )2 − φaik εai (ek )
 j j 2  j j 2
+ lc σC2 φcik − εci (ek ) + la σ A2 φaik − εai (ek )
  j 2
≤ −2(1 − λc ) + lc σC2 (1 + χc ) φcik
  j 2
+ −2(1 − λa ) + la σ A2 (1 + χa ) φaik ,
 j 2  j 2
where we let χc > 0 and χa > 0 be constants such that εci (ek ) ≤ χc φcik and
 j 2  j 2
εai (ek ) ≤ χa φaik , respectively. Selecting lc and la such that

2(1 − λc ) 2(1 − λa )
lc < , la < 2 ,
σC (1 + χc )
2
σ A (1 + χa )
 ij ij
we have ΔL W̃c , W̃a ≤ 0, j = 1, 2, . . . , p. Let j → ∞, and we can obtain the
conclusion. This completes the proof of the theorem.

Based on the above analysis, the whole data-driven stable iterative ADP algorithm
for the WGS system can be summarized in Algorithm 14.4.1.

Algorithm 14.4.1 Data-driven stable iterative ADP algorithm


NN modeling and system transformation:
Step 1. Collect an array of system data of the WGS system (14.2.2).
Step 2. Establish model network, where the NN training rule is expressed as in (14.2.4).
Step 3. Establish u f network, where the NN training rule is expressed as in (14.2.5).
Step 4. Transform the WGS tracking system (14.2.2) into an error regulation system (14.3.2).
Stable iterative ADP algorithm:
Step 5. Let i = 0 and V0 (ek ) = P(ek ), where P(ek ) satisfies (14.3.5).
Step 6. Update the iterative value function V̂i (ek ) by (14.3.7). Compute the iterative control law
v̂i (ek ) by (14.3.8).
Step 7. If the approximation error σ satisfies (14.3.13), then goto Step 8; else, reduce the approxi-
mation error σ and goto Step 6.
Step 8. If |V̂i (ek ) − V̂i−1 (ek )| ≤ ζ , then goto next step; else, let i = i + 1 and goto Step 6.
Step 9. Return V̂i (ek ) and v̂i (ek ).

Remark 14.4.1 One property should be pointed out. For all i = 1, 2, . . ., if we define
the approximation error function εi (ek ) as

V̂i (ek ) = Γi (ek ) + εi (ek ),

then according to (14.3.12), we have εi (ek ) ≤ ε, where ε = sup{εi }, i = 0, 1, . . ..


According to (14.3.10) and (14.3.13), we can obtain the following equivalent
convergence criterion
14.4 Neural Network Implementation for the Optimal … 585

V̂i (ek )(δ − 1)


εi (ek ) ≤ . (14.4.3)
γδ + δ − 1

From (14.4.3), we can see that if |ek | is large, then the present iterative ADP algorithm
permits convergence under large approximation errors, and if |ek | is small, then
small approximation errors are required to ensure the convergence of the iterative
ADP algorithm. As the existences of the approximation errors and disturbances, the
convergence criterion (14.4.3) cannot generally be satisfied for every ek . Define a
new tracking error set
 
V̂i (ek )(δ − 1)
Θe = ek : εi (ek ) > .
γδ + δ − 1

As εi (ek ) ≤ ε is finite, if we define Υ = sup {|ek |}, then Υ is finite. Thus, for all
ek ∈Θe
ek ∈ Θ e , Θ e  Rn \Θe , we can get that V̂i (ek ) is convergent, i.e.,

V̂∞ (ek ) = lim V̂i (ek ).


i→∞

14.5 Numerical Analysis

In this section, numerical experiments will be studied to show the effectiveness


of the present stable iterative ADP algorithm with approximation errors and dis-
turbances. For WGS system (14.2.2), we let the initial reaction temperature be
x0 = 273 ◦ C. Let the desired reaction temperature be τ = 375 ◦ C. Observe the vol-
ume percentage compositions in the inlet water gas of the WGS system, and we
obtain [θCO , θCO2 , θH2 , θH2 O ] = [24.39%, 14.71%, 22.79%, 38.11%].
To model WGS system (14.2.2), we collect 20,000 input-state data from the actual
WGS operational system. Then, three-layer BP NNs are established with the struc-
tures of 2–15–1 and 2–15–1 to approximate the WGS system and reference control,
respectively. Give the disturbances of the system and control input in Fig. 14.2a, b,
respectively. Let the learning rates of NNs be 0.001 and implement our iterative ADP
algorithm for 25 iterations. The curve of the admissible approximation errors for the
present ADP algorithm is displayed in Fig. 14.4. From Fig. 14.4, we can see that for
different ek and iteration index i, it requires different approximation error to guar-
antee the convergence of our iterative ADP algorithm. Let ε̄ = max{εm , εu , ρi , πi }
be the maximum reconstruction error of NNs for i = 0, 1, . . . , 25. We choose two
reconstruction errors ε̄’s, which are 10−6 and 10−4 , respectively, to train the neural
networks. The convergence trajectories of the iterative value functions are shown in
Fig. 14.2c, d, respectively.
Implement the iterative control law for the WGS system (14.2.2). Let the imple-
mentation time T f = 100 and the trajectories of the states and controls are displayed
586 14 Data-Based Neuro-Optimal Temperature Control …

6 6

4 4
System disturbance

Control disturbance
2 2

0 0

−2 −2

−4 −4

−6 −6
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Time steps x 10
4 Time steps x 10
4

(a) (b)

12 12
Iterative value function

Iterative value function

10 10

8 8

6 6

4 4

2 2
0 5 10 15 20 25 0 5 10 15 20 25
Iteration steps Iteration steps
(c) (d)
Fig. 14.2 Disturbances and iterative value function. a System disturbance. b Control disturbance.
c Iterative value function under ε̄ = 10−4 . d Iterative value function under ε̄ = 10−6 .

in Fig. 14.3a–d, respectively. From the numerical results, we can see that by the stable
iterative ADP algorithm, the iterative control law can guarantee the tracking error
system to be UUB, which shows the robustness of the present algorithm. Moreover,
we can see that if we enhance the training precisions of the NNs, such as reducing
ε̄ from 10−4 to 10−6 , then the approximation errors can be reduced and the system
state will be closer to the desired one. The optimal state and control trajectories for
ε̄ = 10−6 are shown in Fig. 14.5a, b, respectively. In the real-world neural network
training, the training precision of NNs is generally set to a uniform one. Thus, it is
recommended that the present iterative ADP algorithm is implemented with a high
training precision which allows the iterative value function to converge for most of
the state space.
14.5 Numerical Analysis 587

400 140

Iteative control (m /s)


380
120
System state (°C)

3
360
Initial iteration
340 Limiting iteration 100

320 Limiting iteration


80
300
Initial iteration
60
280

260 40
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(a) (b)
380 120

360 Iteative control (m /s)


System state (°C)

100
Limiting iteration
340
Initial iteration
320 80

300
Initial iteration 60
280
Limiting iteration
260 40
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(c) (d)
Fig. 14.3 Iterative trajectories of states and controls for different ε̄’s. a State for ε̄ = 10−4 . b Control
for ε̄ = 10−4 . c State for ε̄ = 10−6 . d Control for ε̄ = 10−6 .

Fig. 14.4 The curve of the


admissible approximation
errors
Admissible approximation error

2.5

1.5

0.5

0
1
0.5 25
20
0 15
State −0.5 10
5
−1 0 Iteration steps
588 14 Data-Based Neuro-Optimal Temperature Control …

380 75

Optimal control (m3/s) by ADP


System state (°C) by ADP

360
70
340

320 65

300
60
280

260 55
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(a) (b)
450 75
System state (°C) by MFAC

Control (m3/s) by MFAC 70


400

65
350
60

300
55

250 50
0 20 40 60 80 100 0 20 40 60 80 100
Time steps Time steps
(c) (d)

Fig. 14.5 Comparisons by ADP and MFAC. a State trajectory by ADP. b Control trajectory by
ADP. c State trajectory by MFAC. d Control trajectory by MFAC.

On the other hand, to show the effectiveness of the stable iterative ADP algorithm,
numerical results by our algorithm will be compared with the ones by the data-driven
model-free adaptive control (MFAC) algorithm [7]. According to [7], the controller
is designed by

ρΦk (τ − xk )
u k = u k−1 + ,
λ + Φk2

and
η(Δxk − Φk−1 Δu k−1 )Δu k−1
Φk = Φk−1 + ,
μ + (Δu k−1 )2

where η = ρ = μ = 1, λ = 0.5. Let Φ0 be initialized by an arbitrary positive-definite


matrix. The corresponding state and control trajectories are shown in Fig. 14.5c, d,
respectively.
14.5 Numerical Analysis 589

From the numerical results, we can see that using the present stable iterative
ADP algorithm, it takes 25 time steps for the system state to track the desired one.
By MFAC algorithm in [7], it takes 50 iteration steps for the system state to track
the desired one. Furthermore, there exist overshoots by the method of [7], while the
overshoots are avoided by the present stable iterative ADP algorithm. These illustrate
the effectiveness of our algorithm.

14.6 Conclusions

In this chapter, an effective data-driven stable iterative ADP algorithm is established


to solve optimal temperature control problems for WGS systems. Using the WGS
system data, NNs are used to approximate the system model and the reference control,
respectively. The stable iterative ADP algorithm is established to obtain the optimal
control law where the approximation errors of NNs and the disturbances are both
considered. The convergence and stability properties are analyzed.

References

1. Alonso-Martinez J, Eloy-Garcia J, Santos-Martin D, Arnaltes S (2013) A new variable-


frequency optimal direct power control algorithm. IEEE Trans Ind Electron 60(4):1442–1451
2. Andrikopoulos G, Nikolakopoulos G, Arvanitakis I, Manesis S (2014) Piecewise affine model-
ing and constrained optimal control for a pneumatic artificial muscle. IEEE Trans Ind Electron
61(2):904–916
3. Baier T, Kolb G (2007) Temperature control of the water gas shift reaction in microstructured
reactors. Chem Eng Sci 62(17):4602–4611
4. Barros JD, Silva JFA, Jesus EGA (2013) Fast-predictive optimal control of NPC multilevel
converters barros. IEEE Trans Ind Electron 60(2):619–627
5. Do TD, Choi HH, Jung JW (2012) SDRE-based near optimal control system design for PM
synchronous motor. IEEE Trans Ind Electron 59(11):4063–4074
6. Falco MD, Piemonte V, Basile A (2012) Performance assessment of water gas shift membrane
reactors by a two-dimensional model. Comput Aided Chem Eng 31:610–614
7. Hou Z, Jin S (2011) Data-driven model-free adaptive control for a class of MIMO nonlinear
discrete-time systems. IEEE Trans Neural Netw 22(12):2173–2188
8. Huang Y, Liu D, Wei Q (2012) Temperature control in water-gas shift reaction with adaptive
dynamic programming. In: Proceedings of the international symposium on neural networks,
pp 478–487
9. Jing X, Cheng L (2013) An optimal PID control algorithm for training feedforward neural
networks. IEEE Trans Ind Electron 60(6):2273–2283
10. Khalil HK (2002) Nonlinear syst. Prentice-Hall, Upper Saddle River
11. Kim GY, Mayor JR, Ni J (2005) Parametric study of microreactor design for water gas shift
reactor using an integrated reaction and heat exchange model. Chem Eng J 110(1):1–10
12. Lewis FL, Liu D (2012) Reinforcement learning and approximate dynamic programming for
feedback control. Wiley, Hoboken
13. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Automa Control
51(8):1249–1260
590 14 Data-Based Neuro-Optimal Temperature Control …

14. Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-
time nonlinear systems. IEEE Trans Cybern 43(2):779–789
15. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
16. Lu X, Wang T (2013) Watergas shift modeling in coal gasification in an entrained-flow gasifier.
Part 1: development of methodology and model calibration. Fuel 108:629–638
17. Moe JM (1962) Design of water gas shift reactors. Chem Eng Prog 58(3):33–36
18. Olalla C, Queinnec I, Leyva R, Aroudi AE (2012) Optimal state-feedback control of bilinear
DC-DC converters with guaranteed regions of stability. IEEE Trans Ind Electro 59(10):3868–
3880
19. Rathore R, Holtz H, Boller T (2013) Generalized optimal pulsewidth modulation of multilevel
inverters for low-switching-frequency control of medium-voltage high-power industrial AC
drives. IEEE Trans Ind Electron 60(10):4215–4224
20. Si J, Wang YT (2001) Online learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
21. Sudhakar M, Narasimhan S, Kaisare NS (2012) Approximate dynamic programming based
control for water gas shift reactor. Comput Aided Chem Eng 31:340–344
22. Ueyama Y, Miyashita E (2014) Optimal feedback control for predicting dynamic stiffness
during arm movement. IEEE Trans Ind Electron 61(2):1044–1052
23. Wei Q, Liu D (2012) An iterative ε-optimal control scheme for a class of discrete-time nonlinear
systems with unfixed initial state. Neural Netw 32:236–244
24. Wei Q, Liu D (2013) Numerical adaptive learning control scheme for discrete-time nonlinear
systems. IET Control Theory Appl 7(11):1472–1486
25. Wei Q, Liu D (2013) A novel iterative θ-adaptive dynamic programming for discrete-time
nonlinear systems. IEEE Trans Automa Sci Eng 11(4):1176–1190
26. Wei Q, Liu D (2014) Data-driven neuro-optimal temperature control of water-gas shift reaction
using stable iterative adaptive dynamic programming. IEEE Trans Ind Electron 61(11):6399–
6408
27. Wei Q, Wang D, Zhang D (2013) Dual iterative adaptive dynamic programming for a class of
discrete-time nonlinear systems with time-delays. Neural Comput Appl 23(7–8):1851–1863
28. Wright GT, Edgar TF (1989) Adaptive control of a laboratory water-gas shift reactor with
dynamic inlet condition. In: Proceedings of the American control conference, pp 1828–1833
29. Xiao S, Li Y (2013) Optimal design, fabrication, and control of an micropositioning stage
driven by electromagnetic actuators. IEEE Trans Ind Electron 60(10):4613–4626
30. Yang Q, Jagannathan S (2012) Reinforcement learning controller design for affine nonlinear
discrete-time systems using online approximators. IEEE Trans Syst Man Cybern Part B Cybern
42(2):377–390
31. Yin S, Ding SX, Haghani A, Hao H, Zhang P (2012) A comparison study of basic data-driven
fault diagnosis and process monitoring methods on the benchmark Tennessee Eastman process.
J Process Control 22(9):1567–1581
32. Yin S, Luo H, Ding SX (2014) Real-time implementation of fault-tolerant control systems with
performance optimization. IEEE Trans Ind Electron 61(5):2402–2411
33. Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving
a class of nonlinear zero-sum differential games. Automatica 47(1):207–214
34. Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a
class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans
Syst Man Cybern Part B Cybern 38(4):937–942
Index

A B
Action network, 10, 11, 59, 102, 161 Backpropagation, 12, 22, 49, 100, 273, 329,
Action-dependent, 13, 259 503, 530, 540
Action-dependent heuristic dynamic pro- Backward-in-time approach, 12, 13
gramming, 483, 484, 489 Battery ℵ, 516, 526
Action-value function, or state-action value Battery model, 487, 496, 514
function, 3, 6 Battery storage system, 484, 487
Activation function, 50, 269, 275, 287, 294, Bellman equation, or Bellman optimality
302, 316, 324, 329, 334, 340, 351, equation, 4, 9, 15, 39, 52, 92, 152,
396 178, 225, 247, 248, 497, 515, 577
Actor–critic architecture, 316 Bellman residual error, 317, 318
Adaptive critic designs, 2, 9, 10, 22 Bellman’s principle of optimality, 9, 18, 52,
Adaptive dynamic programming, 2, 7, 9, 10, 92, 152, 178, 225, 247, 515, 550, 577
25, 26, 37 Brain-like intelligence, 22
Admissibility property, 180, 203
Admissible control, 39, 80, 92, 134, 142,
152–154, 156, 157, 179, 183, 189, C
192, 225, 232, 247, 251, 252, 350, Cauchy–Schwarz inequality, 313, 542
366, 390, 395, 402, 459, 460, 470, Coal combustion reaction, 538
523, 524 Coal gasification process, 538, 539
Affine nonlinear systems, 39, 267, 286, 302, Coal quality function, 538, 543, 544
431 Constrained inputs, 56, 291, 298, 310
Algebraic Riccati equation, 163, 193 Convergence criterion, 133, 211
AlphaGo, 1 Cost function, 8, 15, 496, 497, 514
Approximate dynamic programming, 9, 10, Cost-to-go, 8, 489
23 Critic architecture, 294
Approximate optimal control, 267 Critic network, 10, 11, 57, 58, 101, 161, 333
Approximate optimistic policy iteration, Curse of dimensionality, 9, 18
237, 239, 243
Approximate policy iteration, 231, 235, 259
Approximate value iteration, 226, 228 D
Approximation error, 91, 94, 96, 224 Decentralized control, 388
Asymptotic dynamic programming, 10 Decentralized stabilization, 389
Asymptotically stable, 47, 48, 79 Deep learning, 1, 2

© Springer International Publishing AG 2017 591


D. Liu et al., Adaptive Dynamic Programming with Applications
in Optimal Control, Advances in Industrial Control,
DOI 10.1007/978-3-319-50815-3
592 Index

DeepMind, 1 Hamiltonian, 225, 274, 276, 292, 295, 310,


Discount factor, 3, 8, 223 333, 335, 348, 351, 352, 365, 369,
Discounted optimal control, 247 370, 391, 396, 397, 433, 460, 469
Distributed iterative adaptive dynamic pro- Heuristic dynamic programming, 13, 49, 59
gramming, 484, 515, 519, 526 Home energy management system, 515
Disturbance policy, 419 Hurwitz matrix, 269, 270, 328
Dual heuristic dynamic programming, 13, Hyperbolic tangent function, 57, 269, 293,
57, 59 311, 315, 323, 329, 340
Dual iterative Q-learning, 483, 496, 497,
504, 505
Dynamic programming, 2, 8, 9 I
Identification error, 312
Identifier-critic architecture, 269
E Infinite horizon cost function, 39, 247, 347,
Eligibility trace, 6 361, 419, 459
Error bound, 96, 98, 226, 231, 236, 240, 249, Initial admissible control law, 160, 161
254, 255, 257, 551, 578 Initial stabilizing control, 223
Euclidean norm, 268, 271, 331 Input constraints, 38, 56, 87
Exploratory signal, 277, 288, 297, 302, 324, Integral policy iteration, 401, 418, 420, 421
325, 336, 343, 405, 411 Integral Q-learning, 423
Interconnected nonlinear systems, 388
Isolated subsystems, 390, 394, 400, 410
F Iteration flowchart, 42
Finite approximation errors, 91, 125 Iterative θ-ADP algorithm, 80, 83, 93
Finite-step policy evaluation, 188 Iterative adaptive dynamic programming, 15
Forward-in-time approach, 12 Iterative control law, 16, 69, 71, 72, 79, 94,
Fréchet derivative, 461 99, 107, 108, 110, 116, 117, 119, 124,
Frobenius matrix norm, 268, 271, 286, 288, 126, 135
331 Iterative value function, 16, 46, 54–56, 65,
75, 93, 94, 96, 98, 99, 105, 107, 110,
117, 123, 127, 129, 132, 134, 138,
G 502
Gâteaux derivative, 461
Game algebraic Riccati equation, 419
General value iteration, 38, 51, 139 K
Generalized Bellman equation, 153, 179, Kronecker product, 426
180, 204
Generalized policy iteration, 179, 191, 196,
214, 223 L
Global convergence criterion, 208 Lagrange stability, 48, 273, 286, 301, 323,
Global convergence property, 522 332, 340, 378
Globalized dual heuristic programming, 13, Large-scale systems, 388, 403, 404
38, 57, 58 Levenberg–Marquardt algorithm, 38, 49,
Gradient descent algorithm, 276, 296, 297, 243, 336
317, 336, 573 Lipschitz constant, 115–117, 120
Guaranteed cost control, 360 Lipschitz continuous, 39, 92, 111, 152, 178,
224, 247, 268, 292, 310, 347, 361,
390, 459
H Load profile, 485, 486, 493
Hamilton–Jacobi–Bellman equation, 18, 19, Local convergence criterion, 199, 203
267, 269, 274, 292, 294, 311, 315, Local convergence property, 520
316, 333, 348, 365, 391, 550 Lyapunov equation, 270, 292, 310, 347, 348,
Hamilton–Jacobi–Isaacs equation, 432, 350, 351, 365, 390, 395, 396, 402,
437–440, 443, 450 424, 460
Index 593

Lyapunov function, 47, 48, 80, 81, 110, 154, Persistence of excitation, 277, 279, 283, 288,
204, 270, 274, 282, 293, 298, 313, 297, 302, 318, 336, 343
319, 330, 337, 353, 359, 366, 373, Policy evaluation, 5, 153, 180, 191, 214, 232
393, 398–400, 424, 445, 467, 470, Policy improvement, 5, 41, 154, 180, 190,
542, 545, 556 191, 194, 214, 227, 232
Lyapunov’s extension theorem, 48, 273, 286, Policy iteration, 5, 37, 151, 153, 154, 160,
301, 323, 332, 340, 354, 378 162, 173, 223, 231, 249
Policy update, 41, 154, 236, 240, 255

M
Mean value theorem, 548 Q
Model network, 10, 11, 57 Q(λ), 7
Monte Carlo tree search, 1 Q-function, 247, 248, 497, 498, 512, 530
Moore-Penrose pseudoinverse, 330, 331 Q-learning, 6
Multi-battery coordination control, 513
Multi-player nonzero-sum games, 459
Multi-player zero-sum games, 431 R
Reaction temperature, 539, 557, 571, 572,
585
Reference control, 546, 547, 574
N
Reinforcement learning, 2, 3, 10, 23–25
Near optimal control, 51
Relaxed dynamic programming, 10
Neural dynamic programming, 10
Residential energy system, 484, 489, 491,
Neural network implementation, 48, 100,
496, 513
241, 257, 350, 396, 425, 444, 464,
Residual error, 275, 276
503, 554, 582
Robust control, 346
Neural network observer, 328
Robust guaranteed cost control, 366, 367,
Neuro-dynamic programming, 10, 22
372
Nonaffine nonlinear systems, 52, 223, 291,
309, 310, 327
Numerical control, 107, 111
S
Numerical iterative θ-adaptive dynamic pro-
Sarsa, 6
gramming, 107, 120
Sarsa(λ), 7
Numerical iterative θ-ADP algorithm, 107,
Single critic approach, 490, 491
111, 120
Snyder International, 21
State-action value function, 248
State-value function, 3–5
O Steepest descent algorithm, 397
Observer-critic architecture, 309 Suboptimal control, 19
Off-policy, 6
On-policy, 6
One-step policy evaluation, 41 T
Online optimal control, 275, 294 TD(λ), 6
Optimal adaptive control, 25 TD-Gammon, 2
Optimal battery charge/discharge strategy, Temporal difference, 5
484 θ-adaptive dynamic programming, 67, 80,
Optimal tracking control, 52, 223 92, 107
Optimistic policy iteration, 237 Time-based Q-learning algorithm, 507, 530
Torsional pendulum system, 166
Two-player zero-sum games, 418
P
Partial differential equation, 267, 274, 293
Particle swarm optimization, 507, 530 U
Performance index, 8, 419, 489 Uncertain nonlinear systems, 346, 360
594 Index

Uniformly ultimately bounded, 268, 282, Value function update, 5, 41, 53, 188, 227
286, 292, 298, 323 Value iteration, 5, 37, 40, 223, 227, 248, 310
Utility function, 17, 18, 39, 40, 52, 56, 57,
68, 550

V
Value function, 3, 15, 268, 274, 292, 327, W
400, 419, 459 Water gas shift reaction, 572

You might also like